From nobody Sun Feb 8 02:26:45 2026 Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.16]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 2E7F5382; Sat, 2 Mar 2024 00:11:47 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=192.198.163.16 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1709338310; cv=none; b=CWM8hCrUyVtKp88mvx562MDkK/TAI/5ZRfT+D9g2Kqi4pxU/9MyQKkWSLGPH09EAcibWbEwY+CyM7u950R/uABH7Gy8NjqvPgoskOdPRzQpnHnkjShIFVexWtfXgDPFflaXlzL6DEoTAZClWxssPg49mAg6mGpLwc78vbo0BEVA= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1709338310; c=relaxed/simple; bh=KYpc/jxsvmGHz+6mGByWMc9jyRrSfv4k7YqceZKkURA=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=jZ5yMTKIV2QKXO49k+M/2/R+8jSy+hnuqjngsUKyw1T1b6MsWhiFzo+1w8Wvf1GsdfQov4J8yptnSnZwg1Gr6NrHpgL3BLdfPmKjFx3PaMpRFTkCeEVzQhOlBVrm9VL1Whop3EMGWHHDWqOvsKWn9JSKilNxoZk7axqhvbF5M1M= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com; spf=pass smtp.mailfrom=intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=Tl6JB+Za; arc=none smtp.client-ip=192.198.163.16 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="Tl6JB+Za" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1709338308; x=1740874308; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=KYpc/jxsvmGHz+6mGByWMc9jyRrSfv4k7YqceZKkURA=; b=Tl6JB+Za4YmhMDkP6MaMrTV1tdF5T6PCDmBmmWPziiQ3ZQ4qBeWLf9kf 4dRWsaSB5mQdGlAi61TwafoRCxIsqHDKfcLyLYwdR7JCP1AfkouRIqtkm 36VEHkcQ8lj0TqcV5dj3npJDiAk1aydCGzpHb9Wl+kvAb/nvtaV/cGEII Rx3MiFVV9tiWVGDRhho6vRF1OJfViyurgFQKdES9yKEaZnYEIEQYZlYqo E76+GX8X7BaAM5oQaVC6faTS/S3bMdkur5O/eygkF5p8793rRhbLwrjb1 VkQn4JZSpOjAaXgaFkF+/5Olxh2pGfnm9bErPZUX9mf/7Fd68vXRBIEe1 A==; X-IronPort-AV: E=McAfee;i="6600,9927,11000"; a="4476136" X-IronPort-AV: E=Sophos;i="6.06,197,1705392000"; d="scan'208";a="4476136" Received: from orviesa005.jf.intel.com ([10.64.159.145]) by fmvoesa110.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 01 Mar 2024 16:11:46 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.06,197,1705392000"; d="scan'208";a="13054633" Received: from fl31ca102ks0602.deacluster.intel.com (HELO gnr-bkc.deacluster.intel.com) ([10.75.133.163]) by orviesa005.jf.intel.com with ESMTP; 01 Mar 2024 16:11:45 -0800 From: weilin.wang@intel.com To: weilin.wang@intel.com, Ian Rogers , Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Alexander Shishkin , Jiri Olsa , Namhyung Kim , Adrian Hunter , Kan Liang Cc: linux-perf-users@vger.kernel.org, linux-kernel@vger.kernel.org, Perry Taylor , Samantha Alt , Caleb Biggers Subject: [RFC PATCH v3 1/6] perf stat: Parse and find tpebs events when parsing metrics to prepare for perf record sampling Date: Fri, 1 Mar 2024 19:11:33 -0500 Message-ID: <20240302001139.604829-2-weilin.wang@intel.com> X-Mailer: git-send-email 2.43.0 In-Reply-To: <20240302001139.604829-1-weilin.wang@intel.com> References: <20240302001139.604829-1-weilin.wang@intel.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Weilin Wang Metrics that use tpebs values would use the :retire_latency keyword in formulas. We put all these events into a list and pass the list to perf record to collect their retire latency value. Signed-off-by: Weilin Wang --- tools/perf/builtin-stat.c | 38 +++++++++++++--- tools/perf/util/metricgroup.c | 81 +++++++++++++++++++++++++++++------ tools/perf/util/metricgroup.h | 10 ++++- tools/perf/util/stat.h | 2 + 4 files changed, 112 insertions(+), 19 deletions(-) diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c index 6bba1a89d030..5a3093541cff 100644 --- a/tools/perf/builtin-stat.c +++ b/tools/perf/builtin-stat.c @@ -162,6 +162,7 @@ static struct perf_stat_config stat_config =3D { .ctl_fd =3D -1, .ctl_fd_ack =3D -1, .iostat_run =3D false, + .tpebs_event_size =3D 0, }; =20 static bool cpus_map_matched(struct evsel *a, struct evsel *b) @@ -686,6 +687,12 @@ static enum counter_recovery stat_handle_error(struct = evsel *counter) return COUNTER_FATAL; } =20 +static int __run_perf_record(void) +{ + pr_debug("Prepare perf record for retire_latency\n"); + return 0; +} + static int __run_perf_stat(int argc, const char **argv, int run_idx) { int interval =3D stat_config.interval; @@ -703,6 +710,15 @@ static int __run_perf_stat(int argc, const char **argv= , int run_idx) int err; bool second_pass =3D false; =20 + //Prepare perf record for sampling event retire_latency before fork and p= repare workload + if (stat_config.tpebs_event_size > 0) { + int ret; + + ret =3D __run_perf_record(); + if (ret) + return ret; + } + if (forks) { if (evlist__prepare_workload(evsel_list, &target, argv, is_pipe, workloa= d_exec_failed_signal) < 0) { perror("failed to prepare workload"); @@ -2106,7 +2122,9 @@ static int add_default_attributes(void) stat_config.metric_no_threshold, stat_config.user_requested_cpu_list, stat_config.system_wide, - &stat_config.metric_events); + &stat_config.metric_events, + &stat_config.tpebs_events, + &stat_config.tpebs_event_size); } =20 if (smi_cost) { @@ -2139,7 +2157,9 @@ static int add_default_attributes(void) stat_config.metric_no_threshold, stat_config.user_requested_cpu_list, stat_config.system_wide, - &stat_config.metric_events); + &stat_config.metric_events, + &stat_config.tpebs_events, + &stat_config.tpebs_event_size); } =20 if (topdown_run) { @@ -2173,7 +2193,9 @@ static int add_default_attributes(void) /*metric_no_threshold=3D*/true, stat_config.user_requested_cpu_list, stat_config.system_wide, - &stat_config.metric_events) < 0) + &stat_config.metric_events, + &stat_config.tpebs_events, + &stat_config.tpebs_event_size) < 0) return -1; } =20 @@ -2214,7 +2236,9 @@ static int add_default_attributes(void) /*metric_no_threshold=3D*/true, stat_config.user_requested_cpu_list, stat_config.system_wide, - &stat_config.metric_events) < 0) + &stat_config.metric_events, + /*&stat_config.tpebs_events=3D*/NULL, + /*stat_config.tpebs_event_size=3D*/0) < 0) return -1; =20 evlist__for_each_entry(metric_evlist, metric_evsel) { @@ -2736,6 +2760,8 @@ int cmd_stat(int argc, const char **argv) } } =20 + INIT_LIST_HEAD(&stat_config.tpebs_events); + /* * Metric parsing needs to be delayed as metrics may optimize events * knowing the target is system-wide. @@ -2748,7 +2774,9 @@ int cmd_stat(int argc, const char **argv) stat_config.metric_no_threshold, stat_config.user_requested_cpu_list, stat_config.system_wide, - &stat_config.metric_events); + &stat_config.metric_events, + &stat_config.tpebs_events, + &stat_config.tpebs_event_size); =20 zfree(&metrics); if (ret) { diff --git a/tools/perf/util/metricgroup.c b/tools/perf/util/metricgroup.c index b24a1c177a80..5ee96d2f16c9 100644 --- a/tools/perf/util/metricgroup.c +++ b/tools/perf/util/metricgroup.c @@ -277,7 +277,8 @@ static bool contains_metric_id(struct evsel **metric_ev= ents, int num_events, */ static int setup_metric_events(const char *pmu, struct hashmap *ids, struct evlist *metric_evlist, - struct evsel ***out_metric_events) + struct evsel ***out_metric_events, + size_t tpebs_event_size) { struct evsel **metric_events; const char *metric_id; @@ -286,7 +287,7 @@ static int setup_metric_events(const char *pmu, struct = hashmap *ids, bool all_pmus =3D !strcmp(pmu, "all") || perf_pmus__num_core_pmus() =3D= =3D 1 || !is_pmu_core(pmu); =20 *out_metric_events =3D NULL; - ids_size =3D hashmap__size(ids); + ids_size =3D hashmap__size(ids) - tpebs_event_size; =20 metric_events =3D calloc(ids_size + 1, sizeof(void *)); if (!metric_events) @@ -323,6 +324,7 @@ static int setup_metric_events(const char *pmu, struct = hashmap *ids, } } if (matched_events < ids_size) { + pr_debug("Error: matched_events =3D %lu, ids_size =3D %lu\n", matched_ev= ents, ids_size); free(metric_events); return -EINVAL; } @@ -670,7 +672,9 @@ static int decode_all_metric_ids(struct evlist *perf_ev= list, const char *modifie static int metricgroup__build_event_string(struct strbuf *events, const struct expr_parse_ctx *ctx, const char *modifier, - bool group_events) + bool group_events, + struct list_head *tpebs_events __maybe_unused, + size_t *tpebs_event_size) { struct hashmap_entry *cur; size_t bkt; @@ -683,8 +687,47 @@ static int metricgroup__build_event_string(struct strb= uf *events, hashmap__for_each_entry(ctx->ids, cur, bkt) { const char *sep, *rsep, *id =3D cur->pkey; enum perf_tool_event ev; + char *p =3D strstr(id, ":R"); + char *p1 =3D strstr(id, "@R"); + + if (p =3D=3D NULL && p1) { + p =3D strstr(p1+1, "@R"); + if (p =3D=3D NULL) + p =3D p1; + p =3D p+1; + } + + if (p) { + struct tpebs_event *new_event =3D malloc(sizeof(struct tpebs_event)); + char *name; + char *at; =20 - pr_debug("found event %s\n", id); + pr_debug("event name %s\n", id); + new_event->tpebs_name =3D strdup(id); + *p =3D '\0'; + name =3D malloc(strlen(id) + 2); + if (!name) + return -ENOMEM; + + at =3D strchr(id, '@'); + if (at !=3D NULL) { + pr_debug("tpebs name %s\n", id); + *at =3D '/'; + at =3D strchr(id, '@'); + *at =3D '/'; + strcpy(name, id); + strcat(name, "p"); + } else { + strcpy(name, id); + strcat(name, ":p"); + } + new_event->name =3D name; + *tpebs_event_size +=3D 1; + pr_debug("retire_latency required, tpebs_event_size=3D%lu, new_event=3D= %s\n", + *tpebs_event_size, new_event->name); + list_add_tail(&new_event->nd, tpebs_events); + continue; + } =20 /* Always move tool events outside of the group. */ ev =3D perf_tool_event__from_str(id); @@ -1449,7 +1492,8 @@ static int build_combined_expr_ctx(const struct list_= head *metric_list, static int parse_ids(bool metric_no_merge, struct perf_pmu *fake_pmu, struct expr_parse_ctx *ids, const char *modifier, bool group_events, const bool tool_events[PERF_TOOL_MAX], - struct evlist **out_evlist) + struct evlist **out_evlist, struct list_head *tpebs_events, + size_t *tpebs_event_size) { struct parse_events_error parse_error; struct evlist *parsed_evlist; @@ -1492,7 +1536,7 @@ static int parse_ids(bool metric_no_merge, struct per= f_pmu *fake_pmu, } } ret =3D metricgroup__build_event_string(&events, ids, modifier, - group_events); + group_events, tpebs_events, tpebs_event_size); if (ret) return ret; =20 @@ -1531,7 +1575,9 @@ static int parse_groups(struct evlist *perf_evlist, bool system_wide, struct perf_pmu *fake_pmu, struct rblist *metric_events_list, - const struct pmu_metrics_table *table) + const struct pmu_metrics_table *table, + struct list_head *tpebs_events, + size_t *tpebs_event_size) { struct evlist *combined_evlist =3D NULL; LIST_HEAD(metric_list); @@ -1563,7 +1609,8 @@ static int parse_groups(struct evlist *perf_evlist, /*modifier=3D*/NULL, /*group_events=3D*/false, tool_events, - &combined_evlist); + &combined_evlist, + tpebs_events, tpebs_event_size); } if (combined) expr__ctx_free(combined); @@ -1618,14 +1665,15 @@ static int parse_groups(struct evlist *perf_evlist, } if (!metric_evlist) { ret =3D parse_ids(metric_no_merge, fake_pmu, m->pctx, m->modifier, - m->group_events, tool_events, &m->evlist); + m->group_events, tool_events, &m->evlist, + tpebs_events, tpebs_event_size); if (ret) goto out; =20 metric_evlist =3D m->evlist; } ret =3D setup_metric_events(fake_pmu ? "all" : m->pmu, m->pctx->ids, - metric_evlist, &metric_events); + metric_evlist, &metric_events, *tpebs_event_size); if (ret) { pr_err("Cannot resolve IDs for %s: %s\n", m->metric_name, m->metric_expr); @@ -1692,16 +1740,21 @@ int metricgroup__parse_groups(struct evlist *perf_e= vlist, bool metric_no_threshold, const char *user_requested_cpu_list, bool system_wide, - struct rblist *metric_events) + struct rblist *metric_events, + struct list_head *tpebs_events, + size_t *tpebs_event_size) { const struct pmu_metrics_table *table =3D pmu_metrics_table__find(); =20 + pr_debug("Test debugging\n"); + if (!table) return -EINVAL; =20 return parse_groups(perf_evlist, pmu, str, metric_no_group, metric_no_mer= ge, metric_no_threshold, user_requested_cpu_list, system_wide, - /*fake_pmu=3D*/NULL, metric_events, table); + /*fake_pmu=3D*/NULL, metric_events, table, tpebs_events, + tpebs_event_size); } =20 int metricgroup__parse_groups_test(struct evlist *evlist, @@ -1715,7 +1768,9 @@ int metricgroup__parse_groups_test(struct evlist *evl= ist, /*metric_no_threshold=3D*/false, /*user_requested_cpu_list=3D*/NULL, /*system_wide=3D*/false, - &perf_pmu__fake, metric_events, table); + &perf_pmu__fake, metric_events, table, + /*tpebs_events=3D*/NULL, + /*tpebs_event_size=3D*/0); } =20 struct metricgroup__has_metric_data { diff --git a/tools/perf/util/metricgroup.h b/tools/perf/util/metricgroup.h index d5325c6ec8e1..7c24ed768ff3 100644 --- a/tools/perf/util/metricgroup.h +++ b/tools/perf/util/metricgroup.h @@ -66,6 +66,12 @@ struct metric_expr { int runtime; }; =20 +struct tpebs_event { + struct list_head nd; + const char *name; + const char *tpebs_name; +}; + struct metric_event *metricgroup__lookup(struct rblist *metric_events, struct evsel *evsel, bool create); @@ -77,7 +83,9 @@ int metricgroup__parse_groups(struct evlist *perf_evlist, bool metric_no_threshold, const char *user_requested_cpu_list, bool system_wide, - struct rblist *metric_events); + struct rblist *metric_events, + struct list_head *tpebs_events, + size_t *tpebs_event_size); int metricgroup__parse_groups_test(struct evlist *evlist, const struct pmu_metrics_table *table, const char *str, diff --git a/tools/perf/util/stat.h b/tools/perf/util/stat.h index d6e5c8787ba2..b987960df3c5 100644 --- a/tools/perf/util/stat.h +++ b/tools/perf/util/stat.h @@ -109,6 +109,8 @@ struct perf_stat_config { struct cpu_aggr_map *cpus_aggr_map; u64 *walltime_run; struct rblist metric_events; + struct list_head tpebs_events; + size_t tpebs_event_size; int ctl_fd; int ctl_fd_ack; bool ctl_fd_close; --=20 2.43.0 From nobody Sun Feb 8 02:26:45 2026 Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.16]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C798165F; Sat, 2 Mar 2024 00:11:48 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=192.198.163.16 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1709338310; cv=none; b=XUAgnSL71lCPmJybSHLNF7q9/+ggtdtOmtDnRJitdemUQnO0pnJgJXrN3PlxzTlcCxyh5lFXpbuohR63Vp2/IC9OumHw9rsnv02mDy3F8wAun8FomserXf9sbHrQVvw20DXwA8r0Vn73fNhh+F3icOjUxbBiCacIptfYRs2/dBk= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1709338310; c=relaxed/simple; bh=Ta9gztUJwuF64cKdtkKgtJr74LNn5xAuP0vQaHyvkD0=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=MIHNaP29i6FdnSJQZ6EeyOM7n5pSUnNQZcN2R5LKZEAeX5w6IyyN4UlLPur43oIfBfWxt5GIBoleItYwwj+wMJVh0d2+LvceDLKxf2aM/klQky0T6NXTY7zKuTcG/HHea9yIUFMCgPHHiZhmyqtsHHBxdbzD3TQFnxBtsFW2HNA= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com; spf=pass smtp.mailfrom=intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=TBNdGgNF; arc=none smtp.client-ip=192.198.163.16 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="TBNdGgNF" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1709338309; x=1740874309; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=Ta9gztUJwuF64cKdtkKgtJr74LNn5xAuP0vQaHyvkD0=; b=TBNdGgNF9bKFunnxRykDcJNvw5u5IlwYo8nq98KMdcTNeMvLdpciKYmF X1tp1sTu4h2COXZn9hdUeS6XWiugnrU36LSSQzmcYVq7zfH9QgfS4rx0X k1ZGGSqbbbO5GKBpkQyF1eU0O32sK5nRDfEXKPUBdsE22Vs2zjBndspDQ iKCUbweXzeiQKvoGw2oTgvaudGvvu5Q79ywN85G9ENX3/P4ECa4DiCznL TudLUVfIdLdASAyAa+3YVsnVCQoVVpgZjvzAjHmUrOGHu1VoXgjcp9zoR 5KjN0760ah7EpMd5mNYuI9fNJfPeQFme5E4T0jEX0cjwRtValsrZxqoYS w==; X-IronPort-AV: E=McAfee;i="6600,9927,11000"; a="4476143" X-IronPort-AV: E=Sophos;i="6.06,197,1705392000"; d="scan'208";a="4476143" Received: from orviesa005.jf.intel.com ([10.64.159.145]) by fmvoesa110.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 01 Mar 2024 16:11:47 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.06,197,1705392000"; d="scan'208";a="13054638" Received: from fl31ca102ks0602.deacluster.intel.com (HELO gnr-bkc.deacluster.intel.com) ([10.75.133.163]) by orviesa005.jf.intel.com with ESMTP; 01 Mar 2024 16:11:45 -0800 From: weilin.wang@intel.com To: weilin.wang@intel.com, Ian Rogers , Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Alexander Shishkin , Jiri Olsa , Namhyung Kim , Adrian Hunter , Kan Liang Cc: linux-perf-users@vger.kernel.org, linux-kernel@vger.kernel.org, Perry Taylor , Samantha Alt , Caleb Biggers Subject: [RFC PATCH v3 2/6] perf stat: Fork and launch perf record when perf stat needs to get retire latency value for a metric. Date: Fri, 1 Mar 2024 19:11:34 -0500 Message-ID: <20240302001139.604829-3-weilin.wang@intel.com> X-Mailer: git-send-email 2.43.0 In-Reply-To: <20240302001139.604829-1-weilin.wang@intel.com> References: <20240302001139.604829-1-weilin.wang@intel.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Weilin Wang When retire_latency value is used in a metric formula, perf stat would fork= a perf record process with "-e" and "-W" options. Perf record will collect required retire_latency values in parallel while perf stat is collecting counting values. At the point of time that perf stat stops counting, it would send sigterm s= ignal to perf record process and receiving sampling data back from perf record fr= om a pipe. Perf stat will then process the received data to get retire latency d= ata and calculate metric result. Signed-off-by: Weilin Wang --- tools/perf/builtin-stat.c | 179 +++++++++++++++++++++++++++++++++- tools/perf/util/data.c | 4 + tools/perf/util/data.h | 1 + tools/perf/util/metricgroup.h | 7 ++ tools/perf/util/stat.h | 3 + 5 files changed, 191 insertions(+), 3 deletions(-) diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c index 5a3093541cff..3890a579349e 100644 --- a/tools/perf/builtin-stat.c +++ b/tools/perf/builtin-stat.c @@ -94,8 +94,13 @@ #include #include =20 +#include "util/sample.h" +#include +#include + #define DEFAULT_SEPARATOR " " #define FREEZE_ON_SMI_PATH "devices/cpu/freeze_on_smi" +#define PERF_DATA "-" =20 static void print_counters(struct timespec *ts, int argc, const char **arg= v); =20 @@ -162,7 +167,8 @@ static struct perf_stat_config stat_config =3D { .ctl_fd =3D -1, .ctl_fd_ack =3D -1, .iostat_run =3D false, - .tpebs_event_size =3D 0, + .tpebs_event_size =3D 0, + .tpebs_pid =3D -1, }; =20 static bool cpus_map_matched(struct evsel *a, struct evsel *b) @@ -687,12 +693,163 @@ static enum counter_recovery stat_handle_error(struc= t evsel *counter) return COUNTER_FATAL; } =20 -static int __run_perf_record(void) +static int __run_perf_record(const char **record_argv) { + int i =3D 0; + struct tpebs_event *e; pr_debug("Prepare perf record for retire_latency\n"); + + + record_argv[i++] =3D "perf"; + record_argv[i++] =3D "record"; + record_argv[i++] =3D "-W"; + + if (stat_config.user_requested_cpu_list) { + record_argv[i++] =3D "-C"; + record_argv[i++] =3D stat_config.user_requested_cpu_list; + } + + if (stat_config.system_wide) + record_argv[i++] =3D "-a"; + + list_for_each_entry(e, &stat_config.tpebs_events, nd) { + record_argv[i++] =3D "-e"; + record_argv[i++] =3D e->name; + } + + record_argv[i++] =3D "-o"; + record_argv[i++] =3D PERF_DATA; + return 0; } =20 +static void prepare_run_command(struct child_process *cmd, + const char **argv) +{ + memset(cmd, 0, sizeof(*cmd)); + cmd->argv =3D argv; + cmd->out =3D -1; +} +static int prepare_perf_record(struct child_process *cmd) +{ + const char **record_argv; + + record_argv =3D calloc(10 + 2 * stat_config.tpebs_event_size, sizeof(char= *)); + if (!record_argv) + return -1; + __run_perf_record(record_argv); + + prepare_run_command(cmd, record_argv); + return start_command(cmd); +} + +struct perf_script { + struct perf_tool tool; + struct perf_session *session; + struct evswitch evswitch; + struct perf_cpu_map *cpus; + struct perf_thread_map *threads; + int name_width; +}; + +static void tpebs_data__delete(void) +{ + struct tpebs_retire_lat *r, *rtmp; + struct tpebs_event *e, *etmp; + list_for_each_entry_safe(r, rtmp, &stat_config.tpebs_results, nd) { + list_del_init(&r->nd); + free(r); + } + list_for_each_entry_safe(e, etmp, &stat_config.tpebs_events, nd) { + list_del_init(&e->nd); + free(e); + } +} + +static int process_sample_event(struct perf_tool *tool, + union perf_event *event __maybe_unused, + struct perf_sample *sample, + struct evsel *evsel, + struct machine *machine __maybe_unused) +{ + struct perf_script *script =3D container_of(tool, struct perf_script, too= l); + int ret =3D 0; + const char *evname; + struct tpebs_retire_lat *t; + + pr_debug("entering function %s\n ", __func__); + evname =3D evsel__name(evsel); + + pr_debug("[%03d] ", sample->cpu); + pr_debug("%*s: ", script->name_width, evname ?: "[unknown]"); + pr_debug("%16" PRIu16, sample->retire_lat); + pr_debug("\n"); + + // Need to handle per core results? + // We are assuming average retire latency value will be used. Save the nu= mber of + // samples and the sum of retire latency value for each event. + list_for_each_entry(t, &stat_config.tpebs_results, nd) { + if (!strcmp(evname, t->name)) { + t->count +=3D 1; + t->sum +=3D sample->retire_lat; + break; + } + } + + return ret; +} + +static int process_feature_event(struct perf_session *session, + union perf_event *event) +{ + if (event->feat.feat_id < HEADER_LAST_FEATURE) + return perf_event__process_feature(session, event); + return 0; +} + +static int __cmd_script(struct child_process *cmd __maybe_unused) +{ + int err =3D 0; + struct perf_session *session; + struct perf_data data =3D { + .mode =3D PERF_DATA_MODE_READ, + .path =3D PERF_DATA, + .fd =3D cmd->out, + }; + struct perf_script script =3D { + .tool =3D { + .sample =3D process_sample_event, + .ordered_events =3D true, + .ordering_requires_timestamps =3D true, + .feature =3D process_feature_event, + .attr =3D perf_event__process_attr, + }, + }; + struct tpebs_event *e; + + list_for_each_entry(e, &stat_config.tpebs_events, nd) { + struct tpebs_retire_lat *new =3D malloc(sizeof(struct tpebs_retire_lat)); + + if (!new) + return -1; + new->name =3D strdup(e->name); + new->tpebs_name =3D strdup(e->tpebs_name); + new->count =3D 0; + new->sum =3D 0; + list_add_tail(&new->nd, &stat_config.tpebs_results); + } + + kill(cmd->pid, SIGTERM); + session =3D perf_session__new(&data, &script.tool); + if (IS_ERR(session)) + return PTR_ERR(session); + script.session =3D session; + err =3D perf_session__process_events(session); + perf_session__delete(session); + + return err; +} + static int __run_perf_stat(int argc, const char **argv, int run_idx) { int interval =3D stat_config.interval; @@ -709,12 +866,14 @@ static int __run_perf_stat(int argc, const char **arg= v, int run_idx) struct affinity saved_affinity, *affinity =3D NULL; int err; bool second_pass =3D false; + struct child_process cmd; =20 //Prepare perf record for sampling event retire_latency before fork and p= repare workload if (stat_config.tpebs_event_size > 0) { int ret; =20 - ret =3D __run_perf_record(); + pr_debug("perf stat pid =3D %d\n", getpid()); + ret =3D prepare_perf_record(&cmd); if (ret) return ret; } @@ -924,6 +1083,17 @@ static int __run_perf_stat(int argc, const char **arg= v, int run_idx) =20 t1 =3D rdclock(); =20 + if (stat_config.tpebs_event_size > 0) { + int ret; + + pr_debug("pid =3D %d\n", getpid()); + pr_debug("cmd.pid =3D %d\n", cmd.pid); + + ret =3D __cmd_script(&cmd); + close(cmd.out); + pr_debug("%d\n", ret); + } + if (stat_config.walltime_run_table) stat_config.walltime_run[run_idx] =3D t1 - t0; =20 @@ -2761,6 +2931,7 @@ int cmd_stat(int argc, const char **argv) } =20 INIT_LIST_HEAD(&stat_config.tpebs_events); + INIT_LIST_HEAD(&stat_config.tpebs_results); =20 /* * Metric parsing needs to be delayed as metrics may optimize events @@ -2972,5 +3143,7 @@ int cmd_stat(int argc, const char **argv) metricgroup__rblist_exit(&stat_config.metric_events); evlist__close_control(stat_config.ctl_fd, stat_config.ctl_fd_ack, &stat_c= onfig.ctl_fd_close); =20 + tpebs_data__delete(); + return status; } diff --git a/tools/perf/util/data.c b/tools/perf/util/data.c index 08c4bfbd817f..2e2a20fc5c30 100644 --- a/tools/perf/util/data.c +++ b/tools/perf/util/data.c @@ -185,6 +185,10 @@ static bool check_pipe(struct perf_data *data) int fd =3D perf_data__is_read(data) ? STDIN_FILENO : STDOUT_FILENO; =20 + if (data->fd > 0) { + fd =3D data->fd; + } + if (!data->path) { if (!fstat(fd, &st) && S_ISFIFO(st.st_mode)) is_pipe =3D true; diff --git a/tools/perf/util/data.h b/tools/perf/util/data.h index 110f3ebde30f..720638116ca0 100644 --- a/tools/perf/util/data.h +++ b/tools/perf/util/data.h @@ -28,6 +28,7 @@ struct perf_data_file { =20 struct perf_data { const char *path; + int fd; struct perf_data_file file; bool is_pipe; bool is_dir; diff --git a/tools/perf/util/metricgroup.h b/tools/perf/util/metricgroup.h index 7c24ed768ff3..1fa12cc3294e 100644 --- a/tools/perf/util/metricgroup.h +++ b/tools/perf/util/metricgroup.h @@ -71,6 +71,13 @@ struct tpebs_event { const char *name; const char *tpebs_name; }; +struct tpebs_retire_lat { + struct list_head nd; + const char *name; + const char *tpebs_name; + size_t count; + int sum; +}; =20 struct metric_event *metricgroup__lookup(struct rblist *metric_events, struct evsel *evsel, diff --git a/tools/perf/util/stat.h b/tools/perf/util/stat.h index b987960df3c5..2387c6289f93 100644 --- a/tools/perf/util/stat.h +++ b/tools/perf/util/stat.h @@ -111,6 +111,9 @@ struct perf_stat_config { struct rblist metric_events; struct list_head tpebs_events; size_t tpebs_event_size; + struct list_head tpebs_results; + pid_t tpebs_pid; + int tpebs_pipe; int ctl_fd; int ctl_fd_ack; bool ctl_fd_close; --=20 2.43.0 From nobody Sun Feb 8 02:26:45 2026 Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.16]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 42207A48; Sat, 2 Mar 2024 00:11:49 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=192.198.163.16 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1709338310; cv=none; b=MVoHzN0m3DsoHEvUm86QbvURSeRUgrduK9xwPVPS0GOyceDlWIhsjOP2uzejVePiEkcpBXVnyNfSIsa8RTxe08Ztaw8sG9NcIFfFqtyu6aMHafDaS2tgH79e47GHeDhd2mh6TCUioM9fitkZiQyusb0Wc1gAY+lZ4RCiPgxnpcg= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1709338310; c=relaxed/simple; bh=6wo/+6fc+LTd+YM3oQ6kkcsi6l7BgZUng26rp9OSpk8=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=c4lkNAaPw4tRTlsMBJX7OBa0irxv94GQr+0KINP1xXS/Aoo3/Mj3aA1CZp/lUluVJ0CulNZdUHWVcwJ0K3W2WG8WolSmAs7SlYrGCDVyboEnAqmnnffbq0ZYQYyhCUZAyW2aS+PVT23Oeq2w6e7BZFIv7APqZl5JUUaugMgiS+c= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com; spf=pass smtp.mailfrom=intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=HgHB9W6q; arc=none smtp.client-ip=192.198.163.16 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="HgHB9W6q" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1709338309; x=1740874309; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=6wo/+6fc+LTd+YM3oQ6kkcsi6l7BgZUng26rp9OSpk8=; b=HgHB9W6qxZSeo/ZNmUhChY2WRkRrAnH1Ew2+pUAyCf8rflLek/78sgoa aXZGnXaK7smj8K7xmnfTmlnTJXq4R/ii9r2PbfgqylslW1MOzD66zCsaY W2UuDSPBPrLsHO8miN9Egz1HVs3Ss0rIdIj8MXUCkxEfPqXsat15zZFK7 6N/pwILUlBqHoGy+gCRYNac4mqJkUGQcpWp4wKXkkruOGxniNlMVwYHz4 imQGIJZqA4wDXmdy+kRM+2ZZ6n8nuBJSaRRQpVwrtvoUZV+zp/N8+vh2C vc7rgHJIaE/M1cu/FEtuL+5FHQ7lyx5bWXBQt278OcKyM/DO9Fkb7rswn g==; X-IronPort-AV: E=McAfee;i="6600,9927,11000"; a="4476148" X-IronPort-AV: E=Sophos;i="6.06,197,1705392000"; d="scan'208";a="4476148" Received: from orviesa005.jf.intel.com ([10.64.159.145]) by fmvoesa110.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 01 Mar 2024 16:11:47 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.06,197,1705392000"; d="scan'208";a="13054641" Received: from fl31ca102ks0602.deacluster.intel.com (HELO gnr-bkc.deacluster.intel.com) ([10.75.133.163]) by orviesa005.jf.intel.com with ESMTP; 01 Mar 2024 16:11:46 -0800 From: weilin.wang@intel.com To: weilin.wang@intel.com, Ian Rogers , Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Alexander Shishkin , Jiri Olsa , Namhyung Kim , Adrian Hunter , Kan Liang Cc: linux-perf-users@vger.kernel.org, linux-kernel@vger.kernel.org, Perry Taylor , Samantha Alt , Caleb Biggers Subject: [RFC PATCH v3 3/6] perf stat: Add retire latency values into the expr_parse_ctx to prepare for final metric calculation Date: Fri, 1 Mar 2024 19:11:35 -0500 Message-ID: <20240302001139.604829-4-weilin.wang@intel.com> X-Mailer: git-send-email 2.43.0 In-Reply-To: <20240302001139.604829-1-weilin.wang@intel.com> References: <20240302001139.604829-1-weilin.wang@intel.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Weilin Wang Retire latency values of events are used in metric formulas. This update ad= ds code to process data from perf record for required retire latency values. Signed-off-by: Weilin Wang --- tools/perf/builtin-stat.c | 1 + tools/perf/util/metricgroup.h | 1 + tools/perf/util/stat-shadow.c | 18 ++++++++++++++++++ 3 files changed, 20 insertions(+) diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c index 3890a579349e..3e5865572266 100644 --- a/tools/perf/builtin-stat.c +++ b/tools/perf/builtin-stat.c @@ -792,6 +792,7 @@ static int process_sample_event(struct perf_tool *tool, if (!strcmp(evname, t->name)) { t->count +=3D 1; t->sum +=3D sample->retire_lat; + t->val =3D t->count > 0 ? t->sum/t->count : 0; break; } } diff --git a/tools/perf/util/metricgroup.h b/tools/perf/util/metricgroup.h index 1fa12cc3294e..08af0f447550 100644 --- a/tools/perf/util/metricgroup.h +++ b/tools/perf/util/metricgroup.h @@ -77,6 +77,7 @@ struct tpebs_retire_lat { const char *tpebs_name; size_t count; int sum; + double val; }; =20 struct metric_event *metricgroup__lookup(struct rblist *metric_events, diff --git a/tools/perf/util/stat-shadow.c b/tools/perf/util/stat-shadow.c index 3466aa952442..c63ba52004fc 100644 --- a/tools/perf/util/stat-shadow.c +++ b/tools/perf/util/stat-shadow.c @@ -355,6 +355,19 @@ static void print_nsecs(struct perf_stat_config *confi= g, print_metric(config, ctxp, NULL, NULL, "CPUs utilized", 0); } =20 +static int prepare_retire_lat(struct expr_parse_ctx *pctx, + struct list_head *retire_lats) +{ + int ret =3D 0; + struct tpebs_retire_lat *t; + list_for_each_entry(t, retire_lats, nd) { + ret =3D expr__add_id_val(pctx, strdup(t->tpebs_name), t->val); + if (ret < 0) + return ret; + } + return ret; +} + static int prepare_metric(const struct metric_expr *mexp, const struct evsel *evsel, struct expr_parse_ctx *pctx, @@ -486,6 +499,11 @@ static void generic_metric(struct perf_stat_config *co= nfig, pctx->sctx.user_requested_cpu_list =3D strdup(config->user_requested_cpu= _list); pctx->sctx.runtime =3D runtime; pctx->sctx.system_wide =3D config->system_wide; + i =3D prepare_retire_lat(pctx, &config->tpebs_results); + if (i < 0) { + expr__ctx_free(pctx); + return; + } i =3D prepare_metric(mexp, evsel, pctx, aggr_idx); if (i < 0) { expr__ctx_free(pctx); --=20 2.43.0 From nobody Sun Feb 8 02:26:45 2026 Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.16]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id B33251854; Sat, 2 Mar 2024 00:11:50 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=192.198.163.16 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1709338312; cv=none; b=gCRoHy7x90YFSYR0xWJKqO2V/NANDu5Me5Hid5sob8Vm90KMigFTIzFrK7QisVWPn/eQnTLidA/x61weJJq21qBP9xmNZijxn+W/+Y1NieGfUmsNoZ2dYC3bW5S+Yx0Uj0CV78i77XXXy5ujAb+e+VIjaHN7T82ORqGJCfddByw= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1709338312; c=relaxed/simple; bh=WvnrJC58zvSrxJ8X4Aj5CwSK0eSjYIRrEgdXF3P5wG4=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=hP4PdnAFL8uBJffXpUhyZlUuah8RhQchltqF0f2CYlViszsoe1IgNWecws4kWmyUT4QsAh/49NKyVwuJz53PgAl3ou1Jk745ZgkWpCqs1fgr2t3Xqia0RkyDNpK4xI4Am2Y1k4/bXhmXuoczYwahpRl/0f0OJsc8Y8Q+bVNKQvA= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com; spf=pass smtp.mailfrom=intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=EVCnQ9xp; arc=none smtp.client-ip=192.198.163.16 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="EVCnQ9xp" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1709338311; x=1740874311; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=WvnrJC58zvSrxJ8X4Aj5CwSK0eSjYIRrEgdXF3P5wG4=; b=EVCnQ9xpY0zRuQ38QJ6Nq6/H/ejzkbT6ozO9xTFtEtuQXUnQnJkc6K/j JNg6xVb8NeoOsiNQT1QOddq9gZVpV9kaYMBQh215Djx+FcwCwsSOidYCl rwZtZlg/V7scq0GQucDL+hNpeR/oLo9S6Q/GIVk5u2wUFBkSKyjObQAVB VdLdoFDajmlrd9feFjDGomMu5+kyMSHefXM3rAfE0ZISBihdC3RKQnhvi 7jWJhdY1JImrPm01Bz6+/woCaYAauvIa+u8y4Rg/vzlxwBeCeYFVc+oGa 4o6pDi3QNz/sGhznng3qw/ikCom+zohQDKuFumBtvtgl7m4QCLBUFgOJu Q==; X-IronPort-AV: E=McAfee;i="6600,9927,11000"; a="4476153" X-IronPort-AV: E=Sophos;i="6.06,197,1705392000"; d="scan'208";a="4476153" Received: from orviesa005.jf.intel.com ([10.64.159.145]) by fmvoesa110.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 01 Mar 2024 16:11:47 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.06,197,1705392000"; d="scan'208";a="13054645" Received: from fl31ca102ks0602.deacluster.intel.com (HELO gnr-bkc.deacluster.intel.com) ([10.75.133.163]) by orviesa005.jf.intel.com with ESMTP; 01 Mar 2024 16:11:46 -0800 From: weilin.wang@intel.com To: weilin.wang@intel.com, Ian Rogers , Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Alexander Shishkin , Jiri Olsa , Namhyung Kim , Adrian Hunter , Kan Liang Cc: linux-perf-users@vger.kernel.org, linux-kernel@vger.kernel.org, Perry Taylor , Samantha Alt , Caleb Biggers Subject: [RFC PATCH v3 4/6] perf stat: Create another thread for sample data processing Date: Fri, 1 Mar 2024 19:11:36 -0500 Message-ID: <20240302001139.604829-5-weilin.wang@intel.com> X-Mailer: git-send-email 2.43.0 In-Reply-To: <20240302001139.604829-1-weilin.wang@intel.com> References: <20240302001139.604829-1-weilin.wang@intel.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Weilin Wang Another thread is required to synchronize between perf stat and perf record when we pass data through pipe. Signed-off-by: Weilin Wang --- tools/perf/builtin-stat.c | 58 +++++++++++++++++++++++---------------- 1 file changed, 34 insertions(+), 24 deletions(-) diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c index 3e5865572266..2372175d3408 100644 --- a/tools/perf/builtin-stat.c +++ b/tools/perf/builtin-stat.c @@ -777,7 +777,6 @@ static int process_sample_event(struct perf_tool *tool, const char *evname; struct tpebs_retire_lat *t; =20 - pr_debug("entering function %s\n ", __func__); evname =3D evsel__name(evsel); =20 pr_debug("[%03d] ", sample->cpu); @@ -808,9 +807,9 @@ static int process_feature_event(struct perf_session *s= ession, return 0; } =20 -static int __cmd_script(struct child_process *cmd __maybe_unused) +static void *__cmd_script(void *arg __maybe_unused) { - int err =3D 0; + struct child_process *cmd =3D arg; struct perf_session *session; struct perf_data data =3D { .mode =3D PERF_DATA_MODE_READ, @@ -826,29 +825,15 @@ static int __cmd_script(struct child_process *cmd __m= aybe_unused) .attr =3D perf_event__process_attr, }, }; - struct tpebs_event *e; - - list_for_each_entry(e, &stat_config.tpebs_events, nd) { - struct tpebs_retire_lat *new =3D malloc(sizeof(struct tpebs_retire_lat)); - - if (!new) - return -1; - new->name =3D strdup(e->name); - new->tpebs_name =3D strdup(e->tpebs_name); - new->count =3D 0; - new->sum =3D 0; - list_add_tail(&new->nd, &stat_config.tpebs_results); - } =20 - kill(cmd->pid, SIGTERM); session =3D perf_session__new(&data, &script.tool); if (IS_ERR(session)) - return PTR_ERR(session); + return NULL; script.session =3D session; - err =3D perf_session__process_events(session); + perf_session__process_events(session); perf_session__delete(session); =20 - return err; + return NULL; } =20 static int __run_perf_stat(int argc, const char **argv, int run_idx) @@ -868,15 +853,37 @@ static int __run_perf_stat(int argc, const char **arg= v, int run_idx) int err; bool second_pass =3D false; struct child_process cmd; + pthread_t thread_script; =20 //Prepare perf record for sampling event retire_latency before fork and p= repare workload if (stat_config.tpebs_event_size > 0) { int ret; =20 + struct tpebs_event *e; pr_debug("perf stat pid =3D %d\n", getpid()); ret =3D prepare_perf_record(&cmd); if (ret) return ret; + + list_for_each_entry(e, &stat_config.tpebs_events, nd) { + struct tpebs_retire_lat *new =3D malloc(sizeof(struct tpebs_retire_lat)= ); + + if (!new) + return -1; + new->name =3D strdup(e->name); + new->tpebs_name =3D strdup(e->tpebs_name); + new->count =3D 0; + new->sum =3D 0; + list_add_tail(&new->nd, &stat_config.tpebs_results); + } + + if (pthread_create(&thread_script, NULL, __cmd_script, &cmd)) { + kill(cmd.pid, SIGTERM); + close(cmd.out); + pr_err("Could not create thread to process sample data.\n"); + return -1; + } + sleep(2); } =20 if (forks) { @@ -1087,12 +1094,15 @@ static int __run_perf_stat(int argc, const char **a= rgv, int run_idx) if (stat_config.tpebs_event_size > 0) { int ret; =20 - pr_debug("pid =3D %d\n", getpid()); - pr_debug("cmd.pid =3D %d\n", cmd.pid); + pr_debug("Workload finished, finishing record\n"); + pr_debug("Perf stat pid =3D %d, Perf record pid =3D %d\n", getpid(), cmd= .pid); =20 - ret =3D __cmd_script(&cmd); + kill(cmd.pid, SIGTERM); + pthread_join(thread_script, NULL); close(cmd.out); - pr_debug("%d\n", ret); + ret =3D finish_command(&cmd); + if (ret !=3D -ERR_RUN_COMMAND_WAITPID_SIGNAL) + return ret; } =20 if (stat_config.walltime_run_table) --=20 2.43.0 From nobody Sun Feb 8 02:26:45 2026 Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.16]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id B32FE1842; Sat, 2 Mar 2024 00:11:50 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=192.198.163.16 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1709338313; cv=none; b=h64jz/woae8Pkx6JgnxBnE6qXglKoR6swMGaxR/+QogMxX2NebvqHRw2cpd5JnhUA2J2lZhNqF97mXTV4DNimZ6i2Aej/OoXKHBMatgDhmW4kIS4yHeVX7HKPzDaVvd0RJeR0H4yA3X9tUFylNynuOjrzKmJhohMaEwnDmfwAdo= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1709338313; c=relaxed/simple; bh=SQctvRJ39Y6FTztBS6nvv5clzEQieh9m50qo+ZNGlrY=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=dVn18tRGAymUdI8szpxQQowKT+wLxnn1AUUdxgLk0OyydtuKj9wM5PG29Y2FZcoiYoLjCDDWi8k1k88euIQF9JtNl/8LHbPBvoI97yZs4w+In6nOpwU6iOzTwoliPS3cdJPXa872OLtE85cmboMFAFK2r1AECzGjwTA8lgI1n+E= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com; spf=pass smtp.mailfrom=intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=dq7AEa3g; arc=none smtp.client-ip=192.198.163.16 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="dq7AEa3g" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1709338311; x=1740874311; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=SQctvRJ39Y6FTztBS6nvv5clzEQieh9m50qo+ZNGlrY=; b=dq7AEa3gfmvutYSsp45miotp0eyZQkAxT0gDcO0FqHKMbptJ/aq5MrCc VHMBVxTYkNJjrgl+Cf5/8LJkm1MScH4SaOdQeB1wCiuZC08LpaIHh8iuD fAw0M37p3gYtGzOagebDGlzVP69bfPR/qcnesEK9oJ9WZ7QIGPqapVLCr zr/EzNAeod42uygid0hqdef3SFc0nhpOnm/PUVH4vL1FZQz29sCAmlxa4 imieaW86HCGvWExuy9O5pk7mOaWNZflp47OUlUvMDdshB6+jONcPmWYe3 xk2t3Wonvy6BgSskW8TK8mCb0hiReSDi3fGZCiAPBn9uswt6vlWzkgsZK w==; X-IronPort-AV: E=McAfee;i="6600,9927,11000"; a="4476158" X-IronPort-AV: E=Sophos;i="6.06,197,1705392000"; d="scan'208";a="4476158" Received: from orviesa005.jf.intel.com ([10.64.159.145]) by fmvoesa110.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 01 Mar 2024 16:11:47 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.06,197,1705392000"; d="scan'208";a="13054650" Received: from fl31ca102ks0602.deacluster.intel.com (HELO gnr-bkc.deacluster.intel.com) ([10.75.133.163]) by orviesa005.jf.intel.com with ESMTP; 01 Mar 2024 16:11:46 -0800 From: weilin.wang@intel.com To: weilin.wang@intel.com, Ian Rogers , Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Alexander Shishkin , Jiri Olsa , Namhyung Kim , Adrian Hunter , Kan Liang Cc: linux-perf-users@vger.kernel.org, linux-kernel@vger.kernel.org, Perry Taylor , Samantha Alt , Caleb Biggers Subject: [RFC PATCH v3 5/6] perf stat: Add retire latency print functions to print out at the very end of print out Date: Fri, 1 Mar 2024 19:11:37 -0500 Message-ID: <20240302001139.604829-6-weilin.wang@intel.com> X-Mailer: git-send-email 2.43.0 In-Reply-To: <20240302001139.604829-1-weilin.wang@intel.com> References: <20240302001139.604829-1-weilin.wang@intel.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Weilin Wang Add print out functions so that users could read retire latency values. Signed-off-by: Weilin Wang --- tools/perf/util/stat-display.c | 65 ++++++++++++++++++++++++++++++++++ 1 file changed, 65 insertions(+) diff --git a/tools/perf/util/stat-display.c b/tools/perf/util/stat-display.c index bfc1d705f437..e2996ba67bfe 100644 --- a/tools/perf/util/stat-display.c +++ b/tools/perf/util/stat-display.c @@ -21,6 +21,7 @@ #include "iostat.h" #include "pmu.h" #include "pmus.h" +#include "metricgroup.h" =20 #define CNTR_NOT_SUPPORTED "" #define CNTR_NOT_COUNTED "" @@ -34,6 +35,7 @@ #define COMM_LEN 16 #define PID_LEN 7 #define CPUS_LEN 4 +#define RETIRE_LEN 8 =20 static int aggr_header_lens[] =3D { [AGGR_CORE] =3D 18, @@ -426,6 +428,67 @@ static void print_metric_std(struct perf_stat_config *= config, fprintf(out, " %-*s", METRIC_LEN - n - 1, unit); } =20 +static void print_retire_lat_std(struct perf_stat_config *config, + struct outstate *os) +{ + FILE *out =3D os->fh; + bool newline =3D os->newline; + struct tpebs_retire_lat *t; + struct list_head *retire_lats =3D &config->tpebs_results; + + list_for_each_entry(t, retire_lats, nd) { + if (newline) + do_new_line_std(config, os); + fprintf(out, "%'*.2f %-*s", COUNTS_LEN, t->val, EVNAME_LEN, t->name); + fprintf(out, "%*ld %*d\n", RETIRE_LEN, t->count, + RETIRE_LEN, t->sum); + } +} + +static void print_retire_lat_csv(struct perf_stat_config *config, + struct outstate *os) +{ + FILE *out =3D os->fh; + struct tpebs_retire_lat *t; + struct list_head *retire_lats =3D &config->tpebs_results; + const char *sep =3D config->csv_sep; + + list_for_each_entry(t, retire_lats, nd) { + fprintf(out, "%f%s%s%s%s%ld%s%d\n", t->val, sep, sep, t->name, sep, + t->count, sep, t->sum); + } +} + +static void print_retire_lat_json(struct perf_stat_config *config, + struct outstate *os) +{ + FILE *out =3D os->fh; + struct tpebs_retire_lat *t; + struct list_head *retire_lats =3D &config->tpebs_results; + + fprintf(out, "{"); + list_for_each_entry(t, retire_lats, nd) { + fprintf(out, "\"retire_latency-value\" : \"%f\", ", t->val); + fprintf(out, "\"event-name\" : \"%s\"", t->name); + fprintf(out, "\"sample-counts\" : \"%ld\"", t->count); + fprintf(out, "\"retire_latency-sum\" : \"%d\"", t->sum); + } + fprintf(out, "}"); +} + +static void print_retire_lat(struct perf_stat_config *config, + struct outstate *os) +{ + if (!&config->tpebs_results) + return; + if (config->json_output) + print_retire_lat_json(config, os); + else if (config->csv_output) + print_retire_lat_csv(config, os); + else + print_retire_lat_std(config, os); +} + static void new_line_csv(struct perf_stat_config *config, void *ctx) { struct outstate *os =3D ctx; @@ -1609,6 +1672,8 @@ void evlist__print_counters(struct evlist *evlist, st= ruct perf_stat_config *conf break; } =20 + print_retire_lat(config, &os); + print_footer(config); =20 fflush(config->output); --=20 2.43.0 From nobody Sun Feb 8 02:26:45 2026 Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.16]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 0A41828F8; Sat, 2 Mar 2024 00:11:51 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=192.198.163.16 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1709338318; cv=none; b=QEyj09KocnSul7ZCQYpLz4orrjHN+JqYs9ikm/55ek+j1FqSyHJQ5RLms795/uQFX2xgHwo2R998bnhSiSskKKsMX5hb6xlXSYxSTpVzA2DmZYMrIVIZUMDe6jm+lZbenc0WRXc/ArJONIc+8RGCKhebScC2bMOwL4wJ62U4Yjw= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1709338318; c=relaxed/simple; bh=/wlUN9tZrva3SpTc1Wp4ug4zYLcv12oakkKKSqhppyI=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=mycV4O9XbHl2H0zw1c6HTUqERVE/f5Z3q1SyFqd4+y2teQm7gt2OjptIgT8bUrn9iGP5tADK2S2Oe9R/sbZH3gLaZq+hsMBdSEKzGy8f3A4hP/W0wzzTK3HiVjzmk9IYdQQT70sFq0p7EMMuZVaqfPRC+vnVLInhr4XM8YNrGJM= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com; spf=pass smtp.mailfrom=intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=Y+1qJGy5; arc=none smtp.client-ip=192.198.163.16 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="Y+1qJGy5" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1709338311; x=1740874311; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=/wlUN9tZrva3SpTc1Wp4ug4zYLcv12oakkKKSqhppyI=; b=Y+1qJGy5ryiS0SZL5LQ3+qauZsbyBFxfDyFIl54+NS3e41krMGGUiuM3 SfSrlneeWPpvu9nCtUxtxjdwXEhRsadXiS0GrGc+RWg987lUa0I8jf33g cr0Ew6oY/uTqXOKLodJogSM6Nx82tjfeJcZnUBwOh9KW+ltzrmEVLovh9 sCiyB2cVAIu09pRrPh0RNnt8uah5aXhAheYy0U/7PHahxeUKiA6yJOwf4 dvS/pS6dA3x2EH4kcrh4/nQokkfOCNCcX2+VHRw5FpIgWpi2nfFTIlZp2 x5EmcO94d9/N8r8e6rcl7pYgtXpYtrBhZ/qE6Yaxrd+zftt6WWsifUwa2 g==; X-IronPort-AV: E=McAfee;i="6600,9927,11000"; a="4476160" X-IronPort-AV: E=Sophos;i="6.06,197,1705392000"; d="scan'208";a="4476160" Received: from orviesa005.jf.intel.com ([10.64.159.145]) by fmvoesa110.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 01 Mar 2024 16:11:47 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.06,197,1705392000"; d="scan'208";a="13054654" Received: from fl31ca102ks0602.deacluster.intel.com (HELO gnr-bkc.deacluster.intel.com) ([10.75.133.163]) by orviesa005.jf.intel.com with ESMTP; 01 Mar 2024 16:11:46 -0800 From: weilin.wang@intel.com To: weilin.wang@intel.com, Ian Rogers , Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Alexander Shishkin , Jiri Olsa , Namhyung Kim , Adrian Hunter , Kan Liang Cc: linux-perf-users@vger.kernel.org, linux-kernel@vger.kernel.org, Perry Taylor , Samantha Alt , Caleb Biggers Subject: [RFC PATCH v3 6/6] perf vendor events intel: Add MTL metric json files Date: Fri, 1 Mar 2024 19:11:38 -0500 Message-ID: <20240302001139.604829-7-weilin.wang@intel.com> X-Mailer: git-send-email 2.43.0 In-Reply-To: <20240302001139.604829-1-weilin.wang@intel.com> References: <20240302001139.604829-1-weilin.wang@intel.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Weilin Wang Add MTL metric json file at TMA4.7 [1]. Some of the metrics' formulas use T= PEBS retire_latency in MTL. [1] https://lore.kernel.org/all/20240214011820.644458-1-irogers@google.com/ Signed-off-by: Weilin Wang --- .../arch/x86/meteorlake/metricgroups.json | 127 + .../arch/x86/meteorlake/mtl-metrics.json | 2547 +++++++++++++++++ 2 files changed, 2674 insertions(+) create mode 100644 tools/perf/pmu-events/arch/x86/meteorlake/metricgroups.= json create mode 100644 tools/perf/pmu-events/arch/x86/meteorlake/mtl-metrics.j= son diff --git a/tools/perf/pmu-events/arch/x86/meteorlake/metricgroups.json b/= tools/perf/pmu-events/arch/x86/meteorlake/metricgroups.json new file mode 100644 index 000000000000..7a03835f262c --- /dev/null +++ b/tools/perf/pmu-events/arch/x86/meteorlake/metricgroups.json @@ -0,0 +1,127 @@ +{ + "Backend": "Grouping from Top-down Microarchitecture Analysis Metrics = spreadsheet", + "Bad": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", + "BadSpec": "Grouping from Top-down Microarchitecture Analysis Metrics = spreadsheet", + "BigFootprint": "Grouping from Top-down Microarchitecture Analysis Met= rics spreadsheet", + "BrMispredicts": "Grouping from Top-down Microarchitecture Analysis Me= trics spreadsheet", + "Branches": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", + "C0Wait": "Grouping from Top-down Microarchitecture Analysis Metrics s= preadsheet", + "CacheHits": "Grouping from Top-down Microarchitecture Analysis Metric= s spreadsheet", + "CodeGen": "Grouping from Top-down Microarchitecture Analysis Metrics = spreadsheet", + "Compute": "Grouping from Top-down Microarchitecture Analysis Metrics = spreadsheet", + "Cor": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", + "DSB": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", + "DSBmiss": "Grouping from Top-down Microarchitecture Analysis Metrics = spreadsheet", + "DataSharing": "Grouping from Top-down Microarchitecture Analysis Metr= ics spreadsheet", + "Fed": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", + "FetchBW": "Grouping from Top-down Microarchitecture Analysis Metrics = spreadsheet", + "FetchLat": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", + "Flops": "Grouping from Top-down Microarchitecture Analysis Metrics sp= readsheet", + "FpScalar": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", + "FpVector": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", + "Frontend": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", + "HPC": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", + "IcMiss": "Grouping from Top-down Microarchitecture Analysis Metrics s= preadsheet", + "InsType": "Grouping from Top-down Microarchitecture Analysis Metrics = spreadsheet", + "IntVector": "Grouping from Top-down Microarchitecture Analysis Metric= s spreadsheet", + "L2Evicts": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", + "LSD": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", + "MachineClears": "Grouping from Top-down Microarchitecture Analysis Me= trics spreadsheet", + "Machine_Clears": "Grouping from Top-down Microarchitecture Analysis M= etrics spreadsheet", + "Mem": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", + "MemOffcore": "Grouping from Top-down Microarchitecture Analysis Metri= cs spreadsheet", + "MemoryBW": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", + "MemoryBound": "Grouping from Top-down Microarchitecture Analysis Metr= ics spreadsheet", + "MemoryLat": "Grouping from Top-down Microarchitecture Analysis Metric= s spreadsheet", + "MemoryTLB": "Grouping from Top-down Microarchitecture Analysis Metric= s spreadsheet", + "Memory_BW": "Grouping from Top-down Microarchitecture Analysis Metric= s spreadsheet", + "Memory_Lat": "Grouping from Top-down Microarchitecture Analysis Metri= cs spreadsheet", + "MicroSeq": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", + "OS": "Grouping from Top-down Microarchitecture Analysis Metrics sprea= dsheet", + "Offcore": "Grouping from Top-down Microarchitecture Analysis Metrics = spreadsheet", + "PGO": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", + "Pipeline": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", + "PortsUtil": "Grouping from Top-down Microarchitecture Analysis Metric= s spreadsheet", + "Power": "Grouping from Top-down Microarchitecture Analysis Metrics sp= readsheet", + "Prefetches": "Grouping from Top-down Microarchitecture Analysis Metri= cs spreadsheet", + "Ret": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", + "Retire": "Grouping from Top-down Microarchitecture Analysis Metrics s= preadsheet", + "SMT": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", + "Server": "Grouping from Top-down Microarchitecture Analysis Metrics s= preadsheet", + "Snoop": "Grouping from Top-down Microarchitecture Analysis Metrics sp= readsheet", + "SoC": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", + "Summary": "Grouping from Top-down Microarchitecture Analysis Metrics = spreadsheet", + "TmaL1": "Grouping from Top-down Microarchitecture Analysis Metrics sp= readsheet", + "TmaL2": "Grouping from Top-down Microarchitecture Analysis Metrics sp= readsheet", + "TmaL3mem": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", + "TopdownL1": "Metrics for top-down breakdown at level 1", + "TopdownL2": "Metrics for top-down breakdown at level 2", + "TopdownL3": "Metrics for top-down breakdown at level 3", + "TopdownL4": "Metrics for top-down breakdown at level 4", + "TopdownL5": "Metrics for top-down breakdown at level 5", + "TopdownL6": "Metrics for top-down breakdown at level 6", + "tma_L1_group": "Metrics for top-down breakdown at level 1", + "tma_L2_group": "Metrics for top-down breakdown at level 2", + "tma_L3_group": "Metrics for top-down breakdown at level 3", + "tma_L4_group": "Metrics for top-down breakdown at level 4", + "tma_L5_group": "Metrics for top-down breakdown at level 5", + "tma_L6_group": "Metrics for top-down breakdown at level 6", + "tma_alu_op_utilization_group": "Metrics contributing to tma_alu_op_ut= ilization category", + "tma_assists_group": "Metrics contributing to tma_assists category", + "tma_backend_bound_aux_group": "Metrics contributing to tma_backend_bo= und_aux category", + "tma_backend_bound_group": "Metrics contributing to tma_backend_bound = category", + "tma_bad_speculation_group": "Metrics contributing to tma_bad_speculat= ion category", + "tma_base_group": "Metrics contributing to tma_base category", + "tma_branch_mispredicts_group": "Metrics contributing to tma_branch_mi= spredicts category", + "tma_branch_resteers_group": "Metrics contributing to tma_branch_reste= ers category", + "tma_core_bound_group": "Metrics contributing to tma_core_bound catego= ry", + "tma_dram_bound_group": "Metrics contributing to tma_dram_bound catego= ry", + "tma_dtlb_load_group": "Metrics contributing to tma_dtlb_load category= ", + "tma_dtlb_store_group": "Metrics contributing to tma_dtlb_store catego= ry", + "tma_fetch_bandwidth_group": "Metrics contributing to tma_fetch_bandwi= dth category", + "tma_fetch_latency_group": "Metrics contributing to tma_fetch_latency = category", + "tma_fp_arith_group": "Metrics contributing to tma_fp_arith category", + "tma_fp_vector_group": "Metrics contributing to tma_fp_vector category= ", + "tma_frontend_bound_group": "Metrics contributing to tma_frontend_boun= d category", + "tma_heavy_operations_group": "Metrics contributing to tma_heavy_opera= tions category", + "tma_int_operations_group": "Metrics contributing to tma_int_operation= s category", + "tma_issue2P": "Metrics related by the issue $issue2P", + "tma_issueBM": "Metrics related by the issue $issueBM", + "tma_issueBW": "Metrics related by the issue $issueBW", + "tma_issueComp": "Metrics related by the issue $issueComp", + "tma_issueD0": "Metrics related by the issue $issueD0", + "tma_issueFB": "Metrics related by the issue $issueFB", + "tma_issueFL": "Metrics related by the issue $issueFL", + "tma_issueL1": "Metrics related by the issue $issueL1", + "tma_issueLat": "Metrics related by the issue $issueLat", + "tma_issueMC": "Metrics related by the issue $issueMC", + "tma_issueMS": "Metrics related by the issue $issueMS", + "tma_issueMV": "Metrics related by the issue $issueMV", + "tma_issueRFO": "Metrics related by the issue $issueRFO", + "tma_issueSL": "Metrics related by the issue $issueSL", + "tma_issueSO": "Metrics related by the issue $issueSO", + "tma_issueSmSt": "Metrics related by the issue $issueSmSt", + "tma_issueSpSt": "Metrics related by the issue $issueSpSt", + "tma_issueSyncxn": "Metrics related by the issue $issueSyncxn", + "tma_issueTLB": "Metrics related by the issue $issueTLB", + "tma_l1_bound_group": "Metrics contributing to tma_l1_bound category", + "tma_l3_bound_group": "Metrics contributing to tma_l3_bound category", + "tma_light_operations_group": "Metrics contributing to tma_light_opera= tions category", + "tma_load_op_utilization_group": "Metrics contributing to tma_load_op_= utilization category", + "tma_machine_clears_group": "Metrics contributing to tma_machine_clear= s category", + "tma_mem_latency_group": "Metrics contributing to tma_mem_latency cate= gory", + "tma_mem_scheduler_group": "Metrics contributing to tma_mem_scheduler = category", + "tma_memory_bound_group": "Metrics contributing to tma_memory_bound ca= tegory", + "tma_microcode_sequencer_group": "Metrics contributing to tma_microcod= e_sequencer category", + "tma_mite_group": "Metrics contributing to tma_mite category", + "tma_nuke_group": "Metrics contributing to tma_nuke category", + "tma_other_light_ops_group": "Metrics contributing to tma_other_light_= ops category", + "tma_ports_utilization_group": "Metrics contributing to tma_ports_util= ization category", + "tma_ports_utilized_0_group": "Metrics contributing to tma_ports_utili= zed_0 category", + "tma_ports_utilized_3m_group": "Metrics contributing to tma_ports_util= ized_3m category", + "tma_resource_bound_group": "Metrics contributing to tma_resource_boun= d category", + "tma_retiring_group": "Metrics contributing to tma_retiring category", + "tma_serializing_operation_group": "Metrics contributing to tma_serial= izing_operation category", + "tma_store_bound_group": "Metrics contributing to tma_store_bound cate= gory", + "tma_store_op_utilization_group": "Metrics contributing to tma_store_o= p_utilization category" +} diff --git a/tools/perf/pmu-events/arch/x86/meteorlake/mtl-metrics.json b/t= ools/perf/pmu-events/arch/x86/meteorlake/mtl-metrics.json new file mode 100644 index 000000000000..aa13e1ce563c --- /dev/null +++ b/tools/perf/pmu-events/arch/x86/meteorlake/mtl-metrics.json @@ -0,0 +1,2547 @@ +[ + { + "BriefDescription": "C10 residency percent per package", + "MetricExpr": "cstate_pkg@c10\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C10_Pkg_Residency", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "C1 residency percent per core", + "MetricExpr": "cstate_core@c1\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C1_Core_Residency", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "C2 residency percent per package", + "MetricExpr": "cstate_pkg@c2\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C2_Pkg_Residency", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "C3 residency percent per package", + "MetricExpr": "cstate_pkg@c3\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C3_Pkg_Residency", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "C6 residency percent per core", + "MetricExpr": "cstate_core@c6\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C6_Core_Residency", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "C6 residency percent per package", + "MetricExpr": "cstate_pkg@c6\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C6_Pkg_Residency", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "C7 residency percent per core", + "MetricExpr": "cstate_core@c7\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C7_Core_Residency", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "C7 residency percent per package", + "MetricExpr": "cstate_pkg@c7\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C7_Pkg_Residency", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "C8 residency percent per package", + "MetricExpr": "cstate_pkg@c8\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C8_Pkg_Residency", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "C9 residency percent per package", + "MetricExpr": "cstate_pkg@c9\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C9_Pkg_Residency", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "Percentage of cycles spent in System Manageme= nt Interrupts.", + "MetricExpr": "((msr@aperf@ - cycles) / msr@aperf@ if msr@smi@ > 0= else 0)", + "MetricGroup": "smi", + "MetricName": "smi_cycles", + "MetricThreshold": "smi_cycles > 0.1", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "Number of SMI interrupts.", + "MetricExpr": "msr@smi@", + "MetricGroup": "smi", + "MetricName": "smi_num", + "ScaleUnit": "1SMI#" + }, + { + "BriefDescription": "Counts the number of issue slots that were n= ot consumed by the backend due to certain allocation restrictions.", + "MetricExpr": "cpu_atom@TOPDOWN_BE_BOUND.ALLOC_RESTRICTIONS@ / tma= _info_core_slots", + "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", + "MetricName": "tma_alloc_restriction", + "MetricThreshold": "tma_alloc_restriction > 0.1", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the total number of issue slots that = were not consumed by the backend due to backend stalls", + "MetricExpr": "cpu_atom@TOPDOWN_BE_BOUND.ALL@ / tma_info_core_slot= s", + "MetricGroup": "TopdownL1;tma_L1_group", + "MetricName": "tma_backend_bound", + "MetricThreshold": "tma_backend_bound > 0.1", + "MetricgroupNoGroup": "TopdownL1", + "PublicDescription": "Counts the total number of issue slots that= were not consumed by the backend due to backend stalls. Note that uops mu= st be available for consumption in order for this event to count. If a uop= is not available (IQ is empty), this event will not count. The rest of t= hese subevents count backend stalls, in cycles, due to an outstanding reque= st which is memory bound vs core bound. The subevents are not slot based = events and therefore can not be precisely added or subtracted from the Back= end_Bound_Aux subevents which are slot based.", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the total number of issue slots that = were not consumed by the backend due to backend stalls", + "MetricExpr": "tma_backend_bound", + "MetricGroup": "TopdownL1;tma_L1_group", + "MetricName": "tma_backend_bound_aux", + "MetricThreshold": "tma_backend_bound_aux > 0.2", + "MetricgroupNoGroup": "TopdownL1", + "PublicDescription": "Counts the total number of issue slots that= were not consumed by the backend due to backend stalls. Note that UOPS mu= st be available for consumption in order for this event to count. If a uop= is not available (IQ is empty), this event will not count. All of these s= ubevents count backend stalls, in slots, due to a resource limitation. Th= ese are not cycle based events and therefore can not be precisely added or = subtracted from the Backend_Bound subevents which are cycle based. These s= ubevents are supplementary to Backend_Bound and can be used to analyze resu= lts from a resource perspective at allocation.", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the total number of issue slots that w= ere not consumed by the backend because allocation is stalled due to a misp= redicted jump or a machine clear", + "MetricExpr": "cpu_atom@TOPDOWN_BAD_SPECULATION.ALL@ / tma_info_co= re_slots", + "MetricGroup": "TopdownL1;tma_L1_group", + "MetricName": "tma_bad_speculation", + "MetricThreshold": "tma_bad_speculation > 0.15", + "MetricgroupNoGroup": "TopdownL1", + "PublicDescription": "Counts the total number of issue slots that = were not consumed by the backend because allocation is stalled due to a mis= predicted jump or a machine clear. Only issue slots wasted due to fast nuke= s such as memory ordering nukes are counted. Other nukes are not accounted = for. Counts all issue slots blocked during this recovery window including r= elevant microcode flows and while uops are not yet available in the instruc= tion queue (IQ). Also includes the issue slots that were consumed by the ba= ckend but were thrown away because they were younger than the mispredict or= machine clear.", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of uops that are not from t= he microsequencer.", + "MetricExpr": "(cpu_atom@TOPDOWN_RETIRING.ALL@ - cpu_atom@UOPS_RET= IRED.MS@) / tma_info_core_slots", + "MetricGroup": "TopdownL2;tma_L2_group;tma_retiring_group", + "MetricName": "tma_base", + "MetricThreshold": "tma_base > 0.6", + "MetricgroupNoGroup": "TopdownL2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots that were n= ot delivered by the frontend due to BACLEARS, which occurs when the Branch = Target Buffer (BTB) prediction or lack thereof, was corrected by a later br= anch predictor in the frontend", + "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.BRANCH_DETECT@ / tma_info= _core_slots", + "MetricGroup": "TopdownL3;tma_L3_group;tma_fetch_latency_group", + "MetricName": "tma_branch_detect", + "MetricThreshold": "tma_branch_detect > 0.05", + "PublicDescription": "Counts the number of issue slots that were = not delivered by the frontend due to BACLEARS, which occurs when the Branch= Target Buffer (BTB) prediction or lack thereof, was corrected by a later b= ranch predictor in the frontend. Includes BACLEARS due to all branch types = including conditional and unconditional jumps, returns, and indirect branch= es.", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots that were n= ot consumed by the backend due to branch mispredicts.", + "MetricExpr": "cpu_atom@TOPDOWN_BAD_SPECULATION.MISPREDICT@ / tma_= info_core_slots", + "MetricGroup": "TopdownL2;tma_L2_group;tma_bad_speculation_group", + "MetricName": "tma_branch_mispredicts", + "MetricThreshold": "tma_branch_mispredicts > 0.05", + "MetricgroupNoGroup": "TopdownL2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots that were n= ot delivered by the frontend due to BTCLEARS, which occurs when the Branch = Target Buffer (BTB) predicts a taken branch.", + "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.BRANCH_RESTEER@ / tma_inf= o_core_slots", + "MetricGroup": "TopdownL3;tma_L3_group;tma_fetch_latency_group", + "MetricName": "tma_branch_resteer", + "MetricThreshold": "tma_branch_resteer > 0.05", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots that were n= ot delivered by the frontend due to the microcode sequencer (MS).", + "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.CISC@ / tma_info_core_slo= ts", + "MetricGroup": "TopdownL3;tma_L3_group;tma_fetch_bandwidth_group", + "MetricName": "tma_cisc", + "MetricThreshold": "tma_cisc > 0.05", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles due to backend bo= und stalls that are core execution bound and not attributed to outstanding = demand load or store stalls.", + "MetricExpr": "max(0, tma_backend_bound - tma_memory_bound)", + "MetricGroup": "TopdownL2;tma_L2_group;tma_backend_bound_group", + "MetricName": "tma_core_bound", + "MetricThreshold": "tma_core_bound > 0.1", + "MetricgroupNoGroup": "TopdownL2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots that were n= ot delivered by the frontend due to decode stalls.", + "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.DECODE@ / tma_info_core_s= lots", + "MetricGroup": "TopdownL3;tma_L3_group;tma_fetch_bandwidth_group", + "MetricName": "tma_decode", + "MetricThreshold": "tma_decode > 0.05", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of machine clears relative = to the number of nuke slots due to memory disambiguation.", + "MetricExpr": "tma_nuke * (cpu_atom@MACHINE_CLEARS.DISAMBIGUATION@= / cpu_atom@MACHINE_CLEARS.SLOW@)", + "MetricGroup": "TopdownL4;tma_L4_group;tma_nuke_group", + "MetricName": "tma_disambiguation", + "MetricThreshold": "tma_disambiguation > 0.02", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles the core is stall= ed due to a demand load miss which hit in DRAM or MMIO (Non-DRAM).", + "MetricExpr": "cpu_atom@MEM_BOUND_STALLS_LOAD.LLC_MISS@ / tma_info= _core_clks - max((cpu_atom@MEM_BOUND_STALLS_LOAD.ALL@ - cpu_atom@LD_HEAD.L1= _MISS_AT_RET@) / tma_info_core_clks, 0) * cpu_atom@MEM_BOUND_STALLS_LOAD.LL= C_MISS@ / cpu_atom@MEM_BOUND_STALLS_LOAD.ALL@", + "MetricGroup": "TopdownL3;tma_L3_group;tma_memory_bound_group", + "MetricName": "tma_dram_bound", + "MetricThreshold": "tma_dram_bound > 0.1", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots that were n= ot consumed by the backend due to a machine clear classified as a fast nuke= due to memory ordering, memory disambiguation and memory renaming.", + "MetricExpr": "cpu_atom@TOPDOWN_BAD_SPECULATION.FASTNUKE@ / tma_in= fo_core_slots", + "MetricGroup": "TopdownL3;tma_L3_group;tma_machine_clears_group", + "MetricName": "tma_fast_nuke", + "MetricThreshold": "tma_fast_nuke > 0.05", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots that were n= ot delivered by the frontend due to frontend bandwidth restrictions due to = decode, predecode, cisc, and other limitations.", + "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.FRONTEND_BANDWIDTH@ / tma= _info_core_slots", + "MetricGroup": "TopdownL2;tma_L2_group;tma_frontend_bound_group", + "MetricName": "tma_fetch_bandwidth", + "MetricThreshold": "tma_fetch_bandwidth > 0.1", + "MetricgroupNoGroup": "TopdownL2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots that were n= ot delivered by the frontend due to frontend bandwidth restrictions due to = decode, predecode, cisc, and other limitations.", + "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.FRONTEND_LATENCY@ / tma_i= nfo_core_slots", + "MetricGroup": "TopdownL2;tma_L2_group;tma_frontend_bound_group", + "MetricName": "tma_fetch_latency", + "MetricThreshold": "tma_fetch_latency > 0.15", + "MetricgroupNoGroup": "TopdownL2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of machine clears relative = to the number of nuke slots due to FP assists.", + "MetricExpr": "tma_nuke * (cpu_atom@MACHINE_CLEARS.FP_ASSIST@ / cp= u_atom@MACHINE_CLEARS.SLOW@)", + "MetricGroup": "TopdownL4;tma_L4_group;tma_nuke_group", + "MetricName": "tma_fp_assist", + "MetricThreshold": "tma_fp_assist > 0.02", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of floating point divide op= erations per uop.", + "MetricExpr": "cpu_atom@UOPS_RETIRED.FPDIV@ / tma_info_core_slots", + "MetricGroup": "TopdownL3;tma_L3_group;tma_base_group", + "MetricName": "tma_fpdiv_uops", + "MetricThreshold": "tma_fpdiv_uops > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots that were n= ot consumed by the backend due to frontend stalls.", + "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.ALL@ / tma_info_core_slot= s", + "MetricGroup": "TopdownL1;tma_L1_group", + "MetricName": "tma_frontend_bound", + "MetricThreshold": "tma_frontend_bound > 0.2", + "MetricgroupNoGroup": "TopdownL1", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots that were n= ot delivered by the frontend due to instruction cache misses.", + "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.ICACHE@ / tma_info_core_s= lots", + "MetricGroup": "TopdownL3;tma_L3_group;tma_fetch_latency_group", + "MetricName": "tma_icache_misses", + "MetricThreshold": "tma_icache_misses > 0.05", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "", + "MetricExpr": "cpu_atom@CPU_CLK_UNHALTED.CORE@", + "MetricName": "tma_info_core_clks", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "", + "MetricExpr": "cpu_atom@CPU_CLK_UNHALTED.CORE_P@", + "MetricName": "tma_info_core_clks_p", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Cycles Per Instruction", + "MetricExpr": "tma_info_core_clks / INST_RETIRED.ANY", + "MetricName": "tma_info_core_cpi", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions Per Cycle", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / tma_info_core_clks", + "MetricName": "tma_info_core_ipc", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "", + "MetricExpr": "6 * tma_info_core_clks", + "MetricName": "tma_info_core_slots", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Uops Per Instruction", + "MetricExpr": "cpu_atom@UOPS_RETIRED.ALL@ / INST_RETIRED.ANY", + "MetricName": "tma_info_core_upi", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percent of instruction miss cost that hit in = DRAM", + "MetricExpr": "100 * cpu_atom@MEM_BOUND_STALLS_IFETCH.LLC_MISS@ / = cpu_atom@MEM_BOUND_STALLS_IFETCH.ALL@", + "MetricName": "tma_info_frontend_inst_miss_cost_dramhit_percent", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percent of instruction miss cost that hit in = the L2", + "MetricExpr": "100 * cpu_atom@MEM_BOUND_STALLS_IFETCH.L2_HIT@ / cp= u_atom@MEM_BOUND_STALLS_IFETCH.ALL@", + "MetricName": "tma_info_frontend_inst_miss_cost_l2hit_percent", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percent of instruction miss cost that hit in = the L3", + "MetricExpr": "100 * cpu_atom@MEM_BOUND_STALLS_IFETCH.LLC_HIT@ / c= pu_atom@MEM_BOUND_STALLS_IFETCH.ALL@", + "MetricName": "tma_info_frontend_inst_miss_cost_l3hit_percent", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Ratio of all branches which mispredict", + "MetricExpr": "cpu_atom@BR_MISP_RETIRED.ALL_BRANCHES@ / BR_INST_RE= TIRED.ALL_BRANCHES", + "MetricName": "tma_info_inst_mix_branch_mispredict_ratio", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Ratio between Mispredicted branches and unkno= wn branches", + "MetricExpr": "cpu_atom@BR_MISP_RETIRED.ALL_BRANCHES@ / BACLEARS.A= NY", + "MetricName": "tma_info_inst_mix_branch_mispredict_to_unknown_bran= ch_ratio", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of all uops which are FPDiv uops", + "MetricExpr": "100 * cpu_atom@UOPS_RETIRED.FPDIV@ / UOPS_RETIRED.A= LL", + "MetricName": "tma_info_inst_mix_fpdiv_uop_ratio", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of all uops which are IDiv uops", + "MetricExpr": "100 * cpu_atom@UOPS_RETIRED.IDIV@ / UOPS_RETIRED.AL= L", + "MetricName": "tma_info_inst_mix_idiv_uop_ratio", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per Branch (lower number means h= igher occurrence rate)", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / BR_INST_RETIRED.ALL_BR= ANCHES", + "MetricName": "tma_info_inst_mix_ipbranch", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instruction per (near) call (lower number mea= ns higher occurrence rate)", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / BR_INST_RETIRED.NEAR_C= ALL", + "MetricName": "tma_info_inst_mix_ipcall", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per Far Branch", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / (cpu_atom@BR_INST_RETI= RED.FAR_BRANCH@ / 2)", + "MetricName": "tma_info_inst_mix_ipfarbranch", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per Load", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / MEM_UOPS_RETIRED.ALL_L= OADS", + "MetricName": "tma_info_inst_mix_ipload", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per retired conditional Branch M= isprediction where the branch was not taken", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / (cpu_atom@BR_MISP_RETI= RED.COND@ - cpu_atom@BR_MISP_RETIRED.COND_TAKEN@)", + "MetricName": "tma_info_inst_mix_ipmisp_cond_ntaken", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per retired conditional Branch M= isprediction where the branch was taken", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / BR_MISP_RETIRED.COND_T= AKEN", + "MetricName": "tma_info_inst_mix_ipmisp_cond_taken", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per retired indirect call or jum= p Branch Misprediction", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / BR_MISP_RETIRED.INDIRE= CT", + "MetricName": "tma_info_inst_mix_ipmisp_indirect", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per retired return Branch Mispre= diction", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / BR_MISP_RETIRED.RETURN= ", + "MetricName": "tma_info_inst_mix_ipmisp_ret", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per retired Branch Misprediction= ", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / BR_MISP_RETIRED.ALL_BR= ANCHES", + "MetricName": "tma_info_inst_mix_ipmispredict", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per Store", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / MEM_UOPS_RETIRED.ALL_S= TORES", + "MetricName": "tma_info_inst_mix_ipstore", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of all uops which are ucode ops", + "MetricExpr": "100 * cpu_atom@UOPS_RETIRED.MS@ / UOPS_RETIRED.ALL", + "MetricName": "tma_info_inst_mix_microcode_uop_ratio", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of all uops which are x87 uops", + "MetricExpr": "100 * cpu_atom@UOPS_RETIRED.X87@ / UOPS_RETIRED.ALL= ", + "MetricName": "tma_info_inst_mix_x87_uop_ratio", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of total non-speculative loads wit= h a address aliasing block", + "MetricExpr": "100 * cpu_atom@LD_BLOCKS.ADDRESS_ALIAS@ / MEM_UOPS_= RETIRED.ALL_LOADS", + "MetricName": "tma_info_l1_bound_address_alias_blocks", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of total non-speculative loads tha= t are splits", + "MetricExpr": "100 * cpu_atom@MEM_UOPS_RETIRED.SPLIT_LOADS@ / MEM_= UOPS_RETIRED.ALL_LOADS", + "MetricName": "tma_info_l1_bound_load_splits", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of total non-speculative loads wit= h a store forward or unknown store address block", + "MetricExpr": "100 * cpu_atom@LD_BLOCKS.DATA_UNKNOWN@ / MEM_UOPS_R= ETIRED.ALL_LOADS", + "MetricName": "tma_info_l1_bound_store_fwd_blocks", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Cycle cost per L2 hit", + "MetricExpr": "cpu_atom@MEM_BOUND_STALLS_LOAD.L2_HIT@ / MEM_LOAD_U= OPS_RETIRED.L2_HIT", + "MetricName": "tma_info_memory_cycles_per_demand_load_l2_hit", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Cycle cost per LLC hit", + "MetricExpr": "cpu_atom@MEM_BOUND_STALLS_LOAD.LLC_HIT@ / MEM_LOAD_= UOPS_RETIRED.L3_HIT", + "MetricName": "tma_info_memory_cycles_per_demand_load_l3_hit", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "load ops retired per 1000 instruction", + "MetricExpr": "1e3 * cpu_atom@MEM_UOPS_RETIRED.ALL_LOADS@ / INST_R= ETIRED.ANY", + "MetricName": "tma_info_memory_memloadpki", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average CPU Utilization", + "MetricExpr": "cpu_atom@CPU_CLK_UNHALTED.REF_TSC@ / TSC", + "MetricName": "tma_info_system_cpu_utilization", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Fraction of cycles spent in Kernel mode", + "MetricExpr": "cpu_atom@CPU_CLK_UNHALTED.CORE@k / CPU_CLK_UNHALTED= .CORE", + "MetricGroup": "Summary", + "MetricName": "tma_info_system_kernel_utilization", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average Frequency Utilization relative nomina= l frequency", + "MetricExpr": "tma_info_core_clks / CPU_CLK_UNHALTED.REF_TSC", + "MetricGroup": "Power", + "MetricName": "tma_info_system_turbo_utilization", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots that were n= ot delivered by the frontend due to Instruction Table Lookaside Buffer (ITL= B) misses.", + "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.ITLB_MISS@ / tma_info_cor= e_slots", + "MetricGroup": "TopdownL3;tma_L3_group;tma_fetch_latency_group", + "MetricName": "tma_itlb_misses", + "MetricThreshold": "tma_itlb_misses > 0.05", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles that the oldest l= oad of the load buffer is stalled at retirement due to a load block.", + "MetricExpr": "cpu_atom@LD_HEAD.L1_BOUND_AT_RET@ / tma_info_core_c= lks", + "MetricGroup": "TopdownL3;tma_L3_group;tma_memory_bound_group", + "MetricName": "tma_l1_bound", + "MetricThreshold": "tma_l1_bound > 0.1", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles a core is stalled= due to a demand load which hit in the L2 Cache.", + "MetricExpr": "cpu_atom@MEM_BOUND_STALLS_LOAD.L2_HIT@ / tma_info_c= ore_clks - max((cpu_atom@MEM_BOUND_STALLS_LOAD.ALL@ - cpu_atom@LD_HEAD.L1_M= ISS_AT_RET@) / tma_info_core_clks, 0) * cpu_atom@MEM_BOUND_STALLS_LOAD.L2_H= IT@ / cpu_atom@MEM_BOUND_STALLS_LOAD.ALL@", + "MetricGroup": "TopdownL3;tma_L3_group;tma_memory_bound_group", + "MetricName": "tma_l2_bound", + "MetricThreshold": "tma_l2_bound > 0.1", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles a core is stalled= due to a demand load which hit in the Last Level Cache (LLC) or other core= with HITE/F/M.", + "MetricExpr": "cpu_atom@MEM_BOUND_STALLS_LOAD.LLC_HIT@ / tma_info_= core_clks - max((cpu_atom@MEM_BOUND_STALLS_LOAD.ALL@ - cpu_atom@LD_HEAD.L1_= MISS_AT_RET@) / tma_info_core_clks, 0) * cpu_atom@MEM_BOUND_STALLS_LOAD.LLC= _HIT@ / cpu_atom@MEM_BOUND_STALLS_LOAD.ALL@", + "MetricGroup": "TopdownL3;tma_L3_group;tma_memory_bound_group", + "MetricName": "tma_l3_bound", + "MetricThreshold": "tma_l3_bound > 0.1", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles, relative to the = number of mem_scheduler slots, in which uops are blocked due to load buffer= full", + "MetricExpr": "tma_mem_scheduler * cpu_atom@MEM_SCHEDULER_BLOCK.LD= _BUF@ / MEM_SCHEDULER_BLOCK.ALL", + "MetricGroup": "TopdownL4;tma_L4_group;tma_mem_scheduler_group", + "MetricName": "tma_ld_buffer", + "MetricThreshold": "tma_ld_buffer > 0.05", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the total number of issue slots that w= ere not consumed by the backend because allocation is stalled due to a mach= ine clear (nuke) of any kind including memory ordering and memory disambigu= ation.", + "MetricExpr": "cpu_atom@TOPDOWN_BAD_SPECULATION.MACHINE_CLEARS@ / = tma_info_core_slots", + "MetricGroup": "TopdownL2;tma_L2_group;tma_bad_speculation_group", + "MetricName": "tma_machine_clears", + "MetricThreshold": "tma_machine_clears > 0.05", + "MetricgroupNoGroup": "TopdownL2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots that were n= ot consumed by the backend due to memory reservation stalls in which a sche= duler is not able to accept uops.", + "MetricExpr": "cpu_atom@TOPDOWN_BE_BOUND.MEM_SCHEDULER@ / tma_info= _core_slots", + "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", + "MetricName": "tma_mem_scheduler", + "MetricThreshold": "tma_mem_scheduler > 0.1", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles the core is stall= ed due to stores or loads.", + "MetricExpr": "min(tma_backend_bound, cpu_atom@LD_HEAD.ANY_AT_RET@= / tma_info_core_clks + tma_store_bound)", + "MetricGroup": "TopdownL2;tma_L2_group;tma_backend_bound_group", + "MetricName": "tma_memory_bound", + "MetricThreshold": "tma_memory_bound > 0.2", + "MetricgroupNoGroup": "TopdownL2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of machine clears relative = to the number of nuke slots due to memory ordering.", + "MetricExpr": "tma_nuke * (cpu_atom@MACHINE_CLEARS.MEMORY_ORDERING= @ / cpu_atom@MACHINE_CLEARS.SLOW@)", + "MetricGroup": "TopdownL4;tma_L4_group;tma_nuke_group", + "MetricName": "tma_memory_ordering", + "MetricThreshold": "tma_memory_ordering > 0.02", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of uops that are from the c= omplex flows issued by the micro-sequencer (MS)", + "MetricExpr": "cpu_atom@UOPS_RETIRED.MS@ / tma_info_core_slots", + "MetricGroup": "TopdownL2;tma_L2_group;tma_retiring_group", + "MetricName": "tma_ms_uops", + "MetricThreshold": "tma_ms_uops > 0.05", + "MetricgroupNoGroup": "TopdownL2", + "PublicDescription": "Counts the number of uops that are from the = complex flows issued by the micro-sequencer (MS). This includes uops from = flows due to complex instructions, faults, assists, and inserted flows.", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots that were n= ot consumed by the backend due to IEC or FPC RAT stalls, which can be due t= o FIQ or IEC reservation stalls in which the integer, floating point or SIM= D scheduler is not able to accept uops.", + "MetricExpr": "cpu_atom@TOPDOWN_BE_BOUND.NON_MEM_SCHEDULER@ / tma_= info_core_slots", + "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", + "MetricName": "tma_non_mem_scheduler", + "MetricThreshold": "tma_non_mem_scheduler > 0.1", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots that were n= ot consumed by the backend due to a machine clear (slow nuke).", + "MetricExpr": "cpu_atom@TOPDOWN_BAD_SPECULATION.NUKE@ / tma_info_c= ore_slots", + "MetricGroup": "TopdownL3;tma_L3_group;tma_machine_clears_group", + "MetricName": "tma_nuke", + "MetricThreshold": "tma_nuke > 0.05", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots that were n= ot delivered by the frontend due to other common frontend stalls not catego= rized.", + "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.OTHER@ / tma_info_core_sl= ots", + "MetricGroup": "TopdownL3;tma_L3_group;tma_fetch_bandwidth_group", + "MetricName": "tma_other_fb", + "MetricThreshold": "tma_other_fb > 0.05", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles that the oldest l= oad of the load buffer is stalled at retirement due to a number of other lo= ad blocks.", + "MetricExpr": "cpu_atom@LD_HEAD.OTHER_AT_RET@ / tma_info_core_clks= ", + "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", + "MetricName": "tma_other_l1", + "MetricThreshold": "tma_other_l1 > 0.05", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles the core is stall= ed due to a demand load miss which hits in the L2, LLC, DRAM or MMIO (Non-D= RAM) but could not be correctly attributed or cycles in which the load miss= is waiting on a request buffer.", + "MetricExpr": "max(0, tma_memory_bound - (tma_store_bound + tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound))", + "MetricGroup": "TopdownL3;tma_L3_group;tma_memory_bound_group", + "MetricName": "tma_other_load_store", + "MetricThreshold": "tma_other_load_store > 0.1", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of uops retired excluding m= s and fp div uops.", + "MetricExpr": "(cpu_atom@TOPDOWN_RETIRING.ALL@ - cpu_atom@UOPS_RET= IRED.MS@ - cpu_atom@UOPS_RETIRED.FPDIV@) / tma_info_core_slots", + "MetricGroup": "TopdownL3;tma_L3_group;tma_base_group", + "MetricName": "tma_other_ret", + "MetricThreshold": "tma_other_ret > 0.3", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of machine clears relative = to the number of nuke slots due to page faults.", + "MetricExpr": "tma_nuke * (cpu_atom@MACHINE_CLEARS.PAGE_FAULT@ / c= pu_atom@MACHINE_CLEARS.SLOW@)", + "MetricGroup": "TopdownL4;tma_L4_group;tma_nuke_group", + "MetricName": "tma_page_fault", + "MetricThreshold": "tma_page_fault > 0.02", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots that were n= ot delivered by the frontend due to wrong predecodes.", + "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.PREDECODE@ / tma_info_cor= e_slots", + "MetricGroup": "TopdownL3;tma_L3_group;tma_fetch_bandwidth_group", + "MetricName": "tma_predecode", + "MetricThreshold": "tma_predecode > 0.05", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots that were n= ot consumed by the backend due to the physical register file unable to acce= pt an entry (marble stalls).", + "MetricExpr": "cpu_atom@TOPDOWN_BE_BOUND.REGISTER@ / tma_info_core= _slots", + "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", + "MetricName": "tma_register", + "MetricThreshold": "tma_register > 0.1", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots that were n= ot consumed by the backend due to the reorder buffer being full (ROB stalls= ).", + "MetricExpr": "cpu_atom@TOPDOWN_BE_BOUND.REORDER_BUFFER@ / tma_inf= o_core_slots", + "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", + "MetricName": "tma_reorder_buffer", + "MetricThreshold": "tma_reorder_buffer > 0.1", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the total number of issue slots that = were not consumed by the backend due to backend stalls", + "MetricExpr": "tma_backend_bound", + "MetricGroup": "TopdownL2;tma_L2_group;tma_backend_bound_aux_group= ", + "MetricName": "tma_resource_bound", + "MetricThreshold": "tma_resource_bound > 0.2", + "MetricgroupNoGroup": "TopdownL2", + "PublicDescription": "Counts the total number of issue slots that= were not consumed by the backend due to backend stalls. Note that uops mu= st be available for consumption in order for this event to count. If a uop= is not available (IQ is empty), this event will not count.", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots that result= in retirement slots.", + "MetricExpr": "cpu_atom@TOPDOWN_RETIRING.ALL@ / tma_info_core_slot= s", + "MetricGroup": "TopdownL1;tma_L1_group", + "MetricName": "tma_retiring", + "MetricThreshold": "tma_retiring > 0.75", + "MetricgroupNoGroup": "TopdownL1", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles, relative to the = number of mem_scheduler slots, in which uops are blocked due to RSV full re= lative", + "MetricExpr": "tma_mem_scheduler * cpu_atom@MEM_SCHEDULER_BLOCK.RS= V@ / MEM_SCHEDULER_BLOCK.ALL", + "MetricGroup": "TopdownL4;tma_L4_group;tma_mem_scheduler_group", + "MetricName": "tma_rsv", + "MetricThreshold": "tma_rsv > 0.05", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots that were n= ot consumed by the backend due to scoreboards from the instruction queue (I= Q), jump execution unit (JEU), or microcode sequencer (MS).", + "MetricExpr": "cpu_atom@TOPDOWN_BE_BOUND.SERIALIZATION@ / tma_info= _core_slots", + "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", + "MetricName": "tma_serialization", + "MetricThreshold": "tma_serialization > 0.1", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of machine clears relative = to the number of nuke slots due to SMC.", + "MetricExpr": "tma_nuke * (cpu_atom@MACHINE_CLEARS.SMC@ / cpu_atom= @MACHINE_CLEARS.SLOW@)", + "MetricGroup": "TopdownL4;tma_L4_group;tma_nuke_group", + "MetricName": "tma_smc", + "MetricThreshold": "tma_smc > 0.02", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles, relative to the = number of mem_scheduler slots, in which uops are blocked due to store buffe= r full", + "MetricExpr": "tma_store_bound", + "MetricGroup": "TopdownL4;tma_L4_group;tma_mem_scheduler_group", + "MetricName": "tma_st_buffer", + "MetricThreshold": "tma_st_buffer > 0.05", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles that the oldest l= oad of the load buffer is stalled at retirement due to a first level TLB mi= ss.", + "MetricExpr": "cpu_atom@LD_HEAD.DTLB_MISS_AT_RET@ / tma_info_core_= clks", + "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", + "MetricName": "tma_stlb_hit", + "MetricThreshold": "tma_stlb_hit > 0.05", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles that the oldest l= oad of the load buffer is stalled at retirement due to a second level TLB m= iss requiring a page walk.", + "MetricExpr": "cpu_atom@LD_HEAD.PGWALK_AT_RET@ / tma_info_core_clk= s", + "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", + "MetricName": "tma_stlb_miss", + "MetricThreshold": "tma_stlb_miss > 0.05", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles the core is stall= ed due to store buffer full.", + "MetricExpr": "tma_mem_scheduler * (cpu_atom@MEM_SCHEDULER_BLOCK.S= T_BUF@ / cpu_atom@MEM_SCHEDULER_BLOCK.ALL@)", + "MetricGroup": "TopdownL3;tma_L3_group;tma_memory_bound_group", + "MetricName": "tma_store_bound", + "MetricThreshold": "tma_store_bound > 0.1", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles that the oldest l= oad of the load buffer is stalled at retirement due to a store forward bloc= k.", + "MetricExpr": "cpu_atom@LD_HEAD.ST_ADDR_AT_RET@ / tma_info_core_cl= ks", + "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", + "MetricName": "tma_store_fwd_blk", + "MetricThreshold": "tma_store_fwd_blk > 0.05", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Uncore frequency per die [GHZ]", + "MetricExpr": "tma_info_system_socket_clks / #num_dies / duration_= time / 1e9", + "MetricGroup": "SoC", + "MetricName": "UNCORE_FREQ", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution ports for ALU operations.", + "MetricExpr": "(cpu_core@UOPS_DISPATCHED.PORT_0@ + cpu_core@UOPS_D= ISPATCHED.PORT_1@ + cpu_core@UOPS_DISPATCHED.PORT_5_11@ + cpu_core@UOPS_DIS= PATCHED.PORT_6@) / (5 * tma_info_core_core_clks)", + "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group= ", + "MetricName": "tma_alu_op_utilization", + "MetricThreshold": "tma_alu_op_utilization > 0.4", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates fraction of slots the C= PU retired uops delivered by the Microcode_Sequencer as a result of Assists= ", + "MetricExpr": "78 * cpu_core@ASSISTS.ANY@ / tma_info_thread_slots", + "MetricGroup": "TopdownL4;tma_L4_group;tma_microcode_sequencer_gro= up", + "MetricName": "tma_assists", + "MetricThreshold": "tma_assists > 0.1 & (tma_microcode_sequencer >= 0.05 & tma_heavy_operations > 0.1)", + "PublicDescription": "This metric estimates fraction of slots the = CPU retired uops delivered by the Microcode_Sequencer as a result of Assist= s. Assists are long sequences of uops that are required in certain corner-c= ases for operations that cannot be handled natively by the execution pipeli= ne. For example; when working with very small floating point values (so-cal= led Denormals); the FP units are not set up to perform these operations nat= ively. Instead; a sequence of instructions to perform the computation on th= e Denormals is injected into the pipeline. Since these microcode sequences = might be dozens of uops long; Assists can be extremely deleterious to perfo= rmance and they can be avoided in many cases. Sample with: ASSISTS.ANY", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates fraction of slots the C= PU retired uops as a result of handing SSE to AVX* or AVX* to SSE transitio= n Assists.", + "MetricExpr": "63 * cpu_core@ASSISTS.SSE_AVX_MIX@ / tma_info_threa= d_slots", + "MetricGroup": "HPC;TopdownL5;tma_L5_group;tma_assists_group", + "MetricName": "tma_avx_assists", + "MetricThreshold": "tma_avx_assists > 0.1", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This category represents fraction of slots wh= ere no uops are being delivered due to a lack of required resources for acc= epting new uops in the Backend", + "MetricExpr": "cpu_core@topdown\\-be\\-bound@ / (cpu_core@topdown\= \-fe\\-bound@ + cpu_core@topdown\\-bad\\-spec@ + cpu_core@topdown\\-retirin= g@ + cpu_core@topdown\\-be\\-bound@) + 0 * tma_info_thread_slots", + "MetricGroup": "TmaL1;TopdownL1;tma_L1_group", + "MetricName": "tma_backend_bound", + "MetricThreshold": "tma_backend_bound > 0.2", + "MetricgroupNoGroup": "TopdownL1", + "PublicDescription": "This category represents fraction of slots w= here no uops are being delivered due to a lack of required resources for ac= cepting new uops in the Backend. Backend is the portion of the processor co= re where the out-of-order scheduler dispatches ready uops into their respec= tive execution units; and once completed these uops get retired according t= o program order. For example; stalls due to data-cache misses or stalls due= to the divider unit being overloaded are both categorized under Backend Bo= und. Backend Bound is further divided into two main categories: Memory Boun= d and Core Bound. Sample with: TOPDOWN.BACKEND_BOUND_SLOTS", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This category represents fraction of slots wa= sted due to incorrect speculations", + "MetricExpr": "max(1 - (tma_frontend_bound + tma_backend_bound + t= ma_retiring), 0)", + "MetricGroup": "TmaL1;TopdownL1;tma_L1_group", + "MetricName": "tma_bad_speculation", + "MetricThreshold": "tma_bad_speculation > 0.15", + "MetricgroupNoGroup": "TopdownL1", + "PublicDescription": "This category represents fraction of slots w= asted due to incorrect speculations. This include slots used to issue uops = that do not eventually get retired and slots for which the issue-pipeline w= as blocked due to recovery from earlier incorrect speculation. For example;= wasted work due to miss-predicted branches are categorized under Bad Specu= lation category. Incorrect data speculation followed by Memory Ordering Nuk= es is another example.", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Branch Misprediction", + "MetricExpr": "cpu_core@topdown\\-br\\-mispredict@ / (cpu_core@top= down\\-fe\\-bound@ + cpu_core@topdown\\-bad\\-spec@ + cpu_core@topdown\\-re= tiring@ + cpu_core@topdown\\-be\\-bound@) + 0 * tma_info_thread_slots", + "MetricGroup": "BadSpec;BrMispredicts;TmaL2;TopdownL2;tma_L2_group= ;tma_bad_speculation_group;tma_issueBM", + "MetricName": "tma_branch_mispredicts", + "MetricThreshold": "tma_branch_mispredicts > 0.1 & tma_bad_specula= tion > 0.15", + "MetricgroupNoGroup": "TopdownL2", + "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Branch Misprediction. These slots are either wasted= by uops fetched from an incorrectly speculated program path; or stalls whe= n the out-of-order part of the machine needs to recover its state from a sp= eculative path. Sample with: TOPDOWN.BR_MISPREDICT_SLOTS. Related metrics: = tma_info_bad_spec_branch_misprediction_cost, tma_info_bottleneck_mispredict= ions, tma_mispredicts_resteers", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to Branch Resteers", + "MetricExpr": "cpu_core@INT_MISC.CLEAR_RESTEER_CYCLES@ / tma_info_= thread_clks + tma_unknown_branches", + "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_= group", + "MetricName": "tma_branch_resteers", + "MetricThreshold": "tma_branch_resteers > 0.05 & (tma_fetch_latenc= y > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers. Branch Resteers estimates the Fro= ntend delay in fetching operations from corrected path; following all sorts= of miss-predicted branches. For example; branchy code with lots of miss-pr= edictions might get categorized under Branch Resteers. Note the value of th= is node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRA= NCHES", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due staying in C0.1 power-performance optimized state (Fas= ter wakeup time; Smaller power savings).", + "MetricExpr": "cpu_core@CPU_CLK_UNHALTED.C01@ / tma_info_thread_cl= ks", + "MetricGroup": "C0Wait;TopdownL4;tma_L4_group;tma_serializing_oper= ation_group", + "MetricName": "tma_c01_wait", + "MetricThreshold": "tma_c01_wait > 0.05 & (tma_serializing_operati= on > 0.1 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due staying in C0.2 power-performance optimized state (Slo= wer wakeup time; Larger power savings).", + "MetricExpr": "cpu_core@CPU_CLK_UNHALTED.C02@ / tma_info_thread_cl= ks", + "MetricGroup": "C0Wait;TopdownL4;tma_L4_group;tma_serializing_oper= ation_group", + "MetricName": "tma_c02_wait", + "MetricThreshold": "tma_c02_wait > 0.05 & (tma_serializing_operati= on > 0.1 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates fraction of cycles the = CPU retired uops originated from CISC (complex instruction set computer) in= struction", + "MetricExpr": "max(0, tma_microcode_sequencer - tma_assists)", + "MetricGroup": "TopdownL4;tma_L4_group;tma_microcode_sequencer_gro= up", + "MetricName": "tma_cisc", + "MetricThreshold": "tma_cisc > 0.1 & (tma_microcode_sequencer > 0.= 05 & tma_heavy_operations > 0.1)", + "PublicDescription": "This metric estimates fraction of cycles the= CPU retired uops originated from CISC (complex instruction set computer) i= nstruction. A CISC instruction has multiple uops that are required to perfo= rm the instruction's functionality as in the case of read-modify-write as a= n example. Since these instructions require multiple uops they may or may n= ot imply sub-optimal use of machine resources. Sample with: FRONTEND_RETIRE= D.MS_FLOWS", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to Branch Resteers as a result of Machine Clears", + "MetricExpr": "(1 - tma_branch_mispredicts / tma_bad_speculation) = * cpu_core@INT_MISC.CLEAR_RESTEER_CYCLES@ / tma_info_thread_clks", + "MetricGroup": "BadSpec;MachineClears;TopdownL4;tma_L4_group;tma_b= ranch_resteers_group;tma_issueMC", + "MetricName": "tma_clears_resteers", + "MetricThreshold": "tma_clears_resteers > 0.05 & (tma_branch_reste= ers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Machine Clears. Sam= ple with: INT_MISC.CLEAR_RESTEER_CYCLES. Related metrics: tma_l1_bound, tma= _machine_clears, tma_microcode_sequencer, tma_ms_switches", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling synchronizations due to contested acces= ses", + "MetricExpr": "(cpu_core@MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS@ * min(= cpu_core@MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS@R, 24 * tma_info_system_core_fre= quency) + cpu_core@MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD@ * min(cpu_core@MEM_LOA= D_L3_HIT_RETIRED.XSNP_FWD@R, 25 * tma_info_system_core_frequency) * (cpu_co= re@OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM@ / (cpu_core@OCR.DEMAND_DATA_RD.L3_= HIT.SNOOP_HITM@ + cpu_core@OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HIT_WITH_FWD@)))= * (1 + cpu_core@MEM_LOAD_RETIRED.FB_HIT@ / cpu_core@MEM_LOAD_RETIRED.L1_MI= SS@ / 2) / tma_info_thread_clks", + "MetricGroup": "DataSharing;Offcore;Snoop;TopdownL4;tma_L4_group;t= ma_issueSyncxn;tma_l3_bound_group", + "MetricName": "tma_contested_accesses", + "MetricThreshold": "tma_contested_accesses > 0.05 & (tma_l3_bound = > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to contested acce= sses. Contested accesses occur when data written by one Logical Processor a= re read by another Logical Processor on a different Physical Core. Examples= of contested accesses include synchronizations such as locks; true data sh= aring such as modified locked variables; and false sharing. Sample with: ME= M_LOAD_L3_HIT_RETIRED.XSNP_FWD;MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS. Related m= etrics: tma_data_sharing, tma_false_sharing, tma_machine_clears, tma_remote= _cache", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of slots wher= e Core non-memory issues were of a bottleneck", + "MetricExpr": "max(0, tma_backend_bound - tma_memory_bound)", + "MetricGroup": "Backend;Compute;TmaL2;TopdownL2;tma_L2_group;tma_b= ackend_bound_group", + "MetricName": "tma_core_bound", + "MetricThreshold": "tma_core_bound > 0.1 & tma_backend_bound > 0.2= ", + "MetricgroupNoGroup": "TopdownL2", + "PublicDescription": "This metric represents fraction of slots whe= re Core non-memory issues were of a bottleneck. Shortage in hardware compu= te resources; or dependencies in software's instructions are both categoriz= ed under Core Bound. Hence it may indicate the machine ran out of an out-of= -order resource; certain execution units are overloaded or dependencies in = program's data- or instruction-flow are limiting the performance (e.g. FP-c= hained long-latency arithmetic operations).", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling synchronizations due to data-sharing ac= cesses", + "MetricExpr": "(cpu_core@MEM_LOAD_L3_HIT_RETIRED.XSNP_NO_FWD@ * mi= n(cpu_core@MEM_LOAD_L3_HIT_RETIRED.XSNP_NO_FWD@R, 24 * tma_info_system_core= _frequency) + cpu_core@MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD@ * min(cpu_core@MEM= _LOAD_L3_HIT_RETIRED.XSNP_FWD@R, 24 * tma_info_system_core_frequency) * (1 = - cpu_core@OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM@ / (cpu_core@OCR.DEMAND_DAT= A_RD.L3_HIT.SNOOP_HITM@ + cpu_core@OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HIT_WITH= _FWD@))) * (1 + cpu_core@MEM_LOAD_RETIRED.FB_HIT@ / cpu_core@MEM_LOAD_RETIR= ED.L1_MISS@ / 2) / tma_info_thread_clks", + "MetricGroup": "Offcore;Snoop;TopdownL4;tma_L4_group;tma_issueSync= xn;tma_l3_bound_group", + "MetricName": "tma_data_sharing", + "MetricThreshold": "tma_data_sharing > 0.05 & (tma_l3_bound > 0.05= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to data-sharing a= ccesses. Data shared by multiple Logical Processors (even just read shared)= may cause increased access latency due to cache coherency. Excessive data = sharing can drastically harm multithreaded performance. Sample with: MEM_LO= AD_L3_HIT_RETIRED.XSNP_NO_FWD. Related metrics: tma_contested_accesses, tma= _false_sharing, tma_machine_clears, tma_remote_cache", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles whe= re decoder-0 was the only active decoder", + "MetricExpr": "(cpu_core@INST_DECODED.DECODERS\\,cmask\\=3D1@ - cp= u_core@INST_DECODED.DECODERS\\,cmask\\=3D2@) / tma_info_core_core_clks / 2", + "MetricGroup": "DSBmiss;FetchBW;TopdownL4;tma_L4_group;tma_issueD0= ;tma_mite_group", + "MetricName": "tma_decoder0_alone", + "MetricThreshold": "tma_decoder0_alone > 0.1 & (tma_mite > 0.1 & t= ma_fetch_bandwidth > 0.2)", + "PublicDescription": "This metric represents fraction of cycles wh= ere decoder-0 was the only active decoder. Related metrics: tma_few_uops_in= structions", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles whe= re the Divider unit was active", + "MetricExpr": "cpu_core@ARITH.DIV_ACTIVE@ / tma_info_thread_clks", + "MetricGroup": "TopdownL3;tma_L3_group;tma_core_bound_group", + "MetricName": "tma_divider", + "MetricThreshold": "tma_divider > 0.2 & (tma_core_bound > 0.1 & tm= a_backend_bound > 0.2)", + "PublicDescription": "This metric represents fraction of cycles wh= ere the Divider unit was active. Divide and square root instructions are pe= rformed by the Divider unit and can take considerably longer latency than i= nteger or Floating Point addition; subtraction; or multiplication. Sample w= ith: ARITH.DIVIDER_UOPS", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates how often the CPU was s= talled on accesses to external memory (DRAM) by loads", + "MetricExpr": "cpu_core@MEMORY_ACTIVITY.STALLS_L3_MISS@ / tma_info= _thread_clks", + "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_me= mory_bound_group", + "MetricName": "tma_dram_bound", + "MetricThreshold": "tma_dram_bound > 0.1 & (tma_memory_bound > 0.2= & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often the CPU was = stalled on accesses to external memory (DRAM) by loads. Better caching can = improve the latency and increase performance. Sample with: MEM_LOAD_RETIRED= .L3_MISS_PS", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents Core fraction of cycle= s in which CPU was likely limited due to DSB (decoded uop cache) fetch pipe= line", + "MetricExpr": "(cpu_core@IDQ.DSB_CYCLES_ANY@ - cpu_core@IDQ.DSB_CY= CLES_OK@) / tma_info_core_core_clks / 2", + "MetricGroup": "DSB;FetchBW;TopdownL3;tma_L3_group;tma_fetch_bandw= idth_group", + "MetricName": "tma_dsb", + "MetricThreshold": "tma_dsb > 0.15 & tma_fetch_bandwidth > 0.2", + "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to DSB (decoded uop cache) fetch pip= eline. For example; inefficient utilization of the DSB cache structure or = bank conflict when reading from it; are categorized here.", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to switches from DSB to MITE pipelines", + "MetricExpr": "cpu_core@DSB2MITE_SWITCHES.PENALTY_CYCLES@ / tma_in= fo_thread_clks", + "MetricGroup": "DSBmiss;FetchLat;TopdownL3;tma_L3_group;tma_fetch_= latency_group;tma_issueFB", + "MetricName": "tma_dsb_switches", + "MetricThreshold": "tma_dsb_switches > 0.05 & (tma_fetch_latency >= 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to switches from DSB to MITE pipelines. The DSB (deco= ded i-cache) is a Uop Cache where the front-end directly delivers Uops (mic= ro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter la= tency and delivered higher bandwidth than the MITE (legacy instruction deco= de pipeline). Switching between the two pipelines can cause penalties hence= this metric measures the exposed penalty. Sample with: FRONTEND_RETIRED.DS= B_MISS_PS. Related metrics: tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_mis= ses, tma_info_frontend_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric roughly estimates the fraction of= cycles where the Data TLB (DTLB) was missed by load accesses", + "MetricExpr": "MEM_INST_RETIRED.STLB_HIT_LOADS * min(MEM_INST_RETI= RED.STLB_HIT_LOADS:R, 7) / tma_info_thread_clks + tma_load_stlb_miss", + "MetricGroup": "MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB;tma_= l1_bound_group", + "MetricName": "tma_dtlb_load", + "MetricThreshold": "tma_dtlb_load > 0.1 & (tma_l1_bound > 0.1 & (t= ma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Trans= lation Look-aside Buffers) are processor caches for recently used entries o= ut of the Page Tables that are used to map virtual- to physical-addresses b= y the operating system. This metric approximates the potential delay of dem= and loads missing the first-level data TLB (assuming worst case scenario wi= th back to back misses to different pages). This includes hitting in the se= cond-level TLB (STLB) as well as performing a hardware page walk on an STLB= miss. Sample with: MEM_INST_RETIRED.STLB_MISS_LOADS_PS. Related metrics: t= ma_dtlb_store, tma_info_bottleneck_memory_data_tlbs, tma_info_bottleneck_me= mory_synchronization", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric roughly estimates the fraction of= cycles spent handling first-level data TLB store misses", + "MetricExpr": "MEM_INST_RETIRED.STLB_HIT_STORES * min(MEM_INST_RET= IRED.STLB_HIT_STORES:R, 7) / tma_info_thread_clks + tma_store_stlb_miss", + "MetricGroup": "MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB;tma_= store_bound_group", + "MetricName": "tma_dtlb_store", + "MetricThreshold": "tma_dtlb_store > 0.05 & (tma_store_bound > 0.2= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles spent handling first-level data TLB store misses. As with ordinar= y data caching; focus on improving data locality and reducing working-set s= ize to reduce DTLB overhead. Additionally; consider using profile-guided o= ptimization (PGO) to collocate frequently-used data on the same page. Try = using larger page sizes for large amounts of frequently-used data. Sample w= ith: MEM_INST_RETIRED.STLB_MISS_STORES_PS. Related metrics: tma_dtlb_load, = tma_info_bottleneck_memory_data_tlbs, tma_info_bottleneck_memory_synchroniz= ation", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric roughly estimates how often CPU w= as handling synchronizations due to False Sharing", + "MetricExpr": "28 * tma_info_system_core_frequency * cpu_core@OCR.= DEMAND_RFO.L3_HIT.SNOOP_HITM@ / tma_info_thread_clks", + "MetricGroup": "DataSharing;Offcore;Snoop;TopdownL4;tma_L4_group;t= ma_issueSyncxn;tma_store_bound_group", + "MetricName": "tma_false_sharing", + "MetricThreshold": "tma_false_sharing > 0.05 & (tma_store_bound > = 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates how often CPU = was handling synchronizations due to False Sharing. False Sharing is a mult= ithreading hiccup; where multiple Logical Processors contend on different d= ata-elements mapped into the same cache line. Sample with: OCR.DEMAND_RFO.L= 3_HIT.SNOOP_HITM. Related metrics: tma_contested_accesses, tma_data_sharing= , tma_machine_clears, tma_remote_cache", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric does a *rough estimation* of how = often L1D Fill Buffer unavailability limited additional L1D miss memory acc= ess requests to proceed", + "MetricExpr": "cpu_core@L1D_PEND_MISS.FB_FULL@ / tma_info_thread_c= lks", + "MetricGroup": "MemoryBW;TopdownL4;tma_L4_group;tma_issueBW;tma_is= sueSL;tma_issueSmSt;tma_l1_bound_group", + "MetricName": "tma_fb_full", + "MetricThreshold": "tma_fb_full > 0.3", + "PublicDescription": "This metric does a *rough estimation* of how= often L1D Fill Buffer unavailability limited additional L1D miss memory ac= cess requests to proceed. The higher the metric value; the deeper the memor= y hierarchy level the misses are satisfied from (metric values >1 are valid= ). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or= external memory). Related metrics: tma_info_bottleneck_cache_memory_bandwi= dth, tma_info_system_dram_bw_use, tma_mem_bandwidth, tma_sq_full, tma_store= _latency, tma_streaming_stores", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of slots the = CPU was stalled due to Frontend bandwidth issues", + "MetricExpr": "max(0, tma_frontend_bound - tma_fetch_latency)", + "MetricGroup": "FetchBW;Frontend;TmaL2;TopdownL2;tma_L2_group;tma_= frontend_bound_group;tma_issueFB", + "MetricName": "tma_fetch_bandwidth", + "MetricThreshold": "tma_fetch_bandwidth > 0.2", + "MetricgroupNoGroup": "TopdownL2", + "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend bandwidth issues. For example; inefficien= cies at the instruction decoders; or restrictions for caching in the DSB (d= ecoded uops cache) are categorized under Fetch Bandwidth. In such cases; th= e Frontend typically delivers suboptimal amount of uops to the Backend. Sam= ple with: FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1_PS;FRONTEND_RETIRED.LA= TENCY_GE_1_PS;FRONTEND_RETIRED.LATENCY_GE_2_PS. Related metrics: tma_dsb_sw= itches, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_= info_inst_mix_iptb, tma_lcp", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of slots the = CPU was stalled due to Frontend latency issues", + "MetricExpr": "cpu_core@topdown\\-fetch\\-lat@ / (cpu_core@topdown= \\-fe\\-bound@ + cpu_core@topdown\\-bad\\-spec@ + cpu_core@topdown\\-retiri= ng@ + cpu_core@topdown\\-be\\-bound@) - cpu_core@INT_MISC.UOP_DROPPING@ / t= ma_info_thread_slots", + "MetricGroup": "Frontend;TmaL2;TopdownL2;tma_L2_group;tma_frontend= _bound_group", + "MetricName": "tma_fetch_latency", + "MetricThreshold": "tma_fetch_latency > 0.1 & tma_frontend_bound >= 0.15", + "MetricgroupNoGroup": "TopdownL2", + "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend latency issues. For example; instruction-= cache misses; iTLB misses or fetch stalls after a branch misprediction are = categorized under Frontend Latency. In such cases; the Frontend eventually = delivers no uops for some period. Sample with: FRONTEND_RETIRED.LATENCY_GE_= 16_PS;FRONTEND_RETIRED.LATENCY_GE_8_PS", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring instructions that that are decoder into two or up to= ([SNB+] four; [ADL+] five) uops", + "MetricExpr": "max(0, tma_heavy_operations - tma_microcode_sequenc= er)", + "MetricGroup": "TopdownL3;tma_L3_group;tma_heavy_operations_group;= tma_issueD0", + "MetricName": "tma_few_uops_instructions", + "MetricThreshold": "tma_few_uops_instructions > 0.05 & tma_heavy_o= perations > 0.1", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring instructions that that are decoder into two or up t= o ([SNB+] four; [ADL+] five) uops. This highly-correlates with the number o= f uops in such instructions. Related metrics: tma_decoder0_alone", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents overall arithmetic flo= ating-point (FP) operations fraction the CPU has executed (retired)", + "MetricExpr": "tma_x87_use + tma_fp_scalar + tma_fp_vector", + "MetricGroup": "HPC;TopdownL3;tma_L3_group;tma_light_operations_gr= oup", + "MetricName": "tma_fp_arith", + "MetricThreshold": "tma_fp_arith > 0.2 & tma_light_operations > 0.= 6", + "PublicDescription": "This metric represents overall arithmetic fl= oating-point (FP) operations fraction the CPU has executed (retired). Note = this metric's value may exceed its parent due to use of \"Uops\" CountDomai= n and FMA double-counting.", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric roughly estimates fraction of slo= ts the CPU retired uops as a result of handing Floating Point (FP) Assists", + "MetricExpr": "30 * cpu_core@ASSISTS.FP@ / tma_info_thread_slots", + "MetricGroup": "HPC;TopdownL5;tma_L5_group;tma_assists_group", + "MetricName": "tma_fp_assists", + "MetricThreshold": "tma_fp_assists > 0.1", + "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Floating Point (FP) Assists= . FP Assist may apply when working with very small floating point values (s= o-called Denormals).", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric approximates arithmetic floating-= point (FP) scalar uops fraction the CPU has retired", + "MetricExpr": "cpu_core@FP_ARITH_INST_RETIRED.SCALAR_SINGLE\\,umas= k\\=3D0x03@ / (tma_retiring * tma_info_thread_slots)", + "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_= group;tma_issue2P", + "MetricName": "tma_fp_scalar", + "MetricThreshold": "tma_fp_scalar > 0.1 & (tma_fp_arith > 0.2 & tm= a_light_operations > 0.6)", + "PublicDescription": "This metric approximates arithmetic floating= -point (FP) scalar uops fraction the CPU has retired. May overcount due to = FMA double counting. Related metrics: tma_fp_vector, tma_fp_vector_128b, tm= a_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_128b, tma_int_vector_2= 56b, tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric approximates arithmetic floating-= point (FP) vector uops fraction the CPU has retired aggregated across all v= ector widths", + "MetricExpr": "cpu_core@FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE\\= ,umask\\=3D0x3c@ / (tma_retiring * tma_info_thread_slots)", + "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_= group;tma_issue2P", + "MetricName": "tma_fp_vector", + "MetricThreshold": "tma_fp_vector > 0.1 & (tma_fp_arith > 0.2 & tm= a_light_operations > 0.6)", + "PublicDescription": "This metric approximates arithmetic floating= -point (FP) vector uops fraction the CPU has retired aggregated across all = vector widths. May overcount due to FMA double counting. Related metrics: t= ma_fp_scalar, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, t= ma_int_vector_128b, tma_int_vector_256b, tma_port_0, tma_port_1, tma_port_5= , tma_port_6, tma_ports_utilized_2", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric approximates arithmetic FP vector= uops fraction the CPU has retired for 128-bit wide vectors", + "MetricExpr": "(cpu_core@FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE@= + cpu_core@FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE@) / (tma_retiring * tm= a_info_thread_slots)", + "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", + "MetricName": "tma_fp_vector_128b", + "MetricThreshold": "tma_fp_vector_128b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 128-bit wide vectors. May overcount= due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector,= tma_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_128b, tma_int_vecto= r_256b, tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_= 2", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric approximates arithmetic FP vector= uops fraction the CPU has retired for 256-bit wide vectors", + "MetricExpr": "(cpu_core@FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE@= + cpu_core@FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE@) / (tma_retiring * tm= a_info_thread_slots)", + "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", + "MetricName": "tma_fp_vector_256b", + "MetricThreshold": "tma_fp_vector_256b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 256-bit wide vectors. May overcount= due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector,= tma_fp_vector_128b, tma_fp_vector_512b, tma_int_vector_128b, tma_int_vecto= r_256b, tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_= 2", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This category represents fraction of slots wh= ere the processor's Frontend undersupplies its Backend", + "MetricExpr": "cpu_core@topdown\\-fe\\-bound@ / (cpu_core@topdown\= \-fe\\-bound@ + cpu_core@topdown\\-bad\\-spec@ + cpu_core@topdown\\-retirin= g@ + cpu_core@topdown\\-be\\-bound@) - cpu_core@INT_MISC.UOP_DROPPING@ / tm= a_info_thread_slots", + "MetricGroup": "PGO;TmaL1;TopdownL1;tma_L1_group", + "MetricName": "tma_frontend_bound", + "MetricThreshold": "tma_frontend_bound > 0.15", + "MetricgroupNoGroup": "TopdownL1", + "PublicDescription": "This category represents fraction of slots w= here the processor's Frontend undersupplies its Backend. Frontend denotes t= he first part of the processor core responsible to fetch operations that ar= e executed later on by the Backend part. Within the Frontend; a branch pred= ictor predicts the next address to fetch; cache-lines are fetched from the = memory subsystem; parsed into instructions; and lastly decoded into micro-o= perations (uops). Ideally the Frontend can issue Pipeline_Width uops every = cycle to the Backend. Frontend Bound denotes unutilized issue-slots when th= ere is no Backend stall; i.e. bubbles where Frontend delivered no uops whil= e Backend could have accepted them. For example; stalls due to instruction-= cache misses would be categorized under Frontend Bound. Sample with: FRONTE= ND_RETIRED.LATENCY_GE_4_PS", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring fused instructions -- where one uop can represent mu= ltiple contiguous instructions", + "MetricExpr": "tma_light_operations * cpu_core@INST_RETIRED.MACRO_= FUSED@ / (tma_retiring * tma_info_thread_slots)", + "MetricGroup": "Branches;Pipeline;TopdownL3;tma_L3_group;tma_light= _operations_group", + "MetricName": "tma_fused_instructions", + "MetricThreshold": "tma_fused_instructions > 0.1 & tma_light_opera= tions > 0.6", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring fused instructions -- where one uop can represent m= ultiple contiguous instructions. CMP+JCC or DEC+JCC are common examples of = legacy fusions. {([MTL] Note new MOV+OP and Load+OP fusions appear under Ot= her_Light_Ops in MTL!)}", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring heavy-weight operations -- instructions that require= two or more uops or micro-coded sequences", + "MetricExpr": "cpu_core@topdown\\-heavy\\-ops@ / (cpu_core@topdown= \\-fe\\-bound@ + cpu_core@topdown\\-bad\\-spec@ + cpu_core@topdown\\-retiri= ng@ + cpu_core@topdown\\-be\\-bound@) + 0 * tma_info_thread_slots", + "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_g= roup", + "MetricName": "tma_heavy_operations", + "MetricThreshold": "tma_heavy_operations > 0.1", + "MetricgroupNoGroup": "TopdownL2", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations -- instructions that requir= e two or more uops or micro-coded sequences. This highly-correlates with th= e uop length of these instructions/sequences. ([ICL+] Note this may overcou= nt due to approximation using indirect events; [ADL+] .). Sample with: UOPS= _RETIRED.HEAVY", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to instruction cache misses", + "MetricExpr": "cpu_core@ICACHE_DATA.STALLS@ / tma_info_thread_clks= ", + "MetricGroup": "BigFootprint;FetchLat;IcMiss;TopdownL3;tma_L3_grou= p;tma_fetch_latency_group", + "MetricName": "tma_icache_misses", + "MetricThreshold": "tma_icache_misses > 0.05 & (tma_fetch_latency = > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to instruction cache misses. Sample with: FRONTEND_RE= TIRED.L2_MISS_PS;FRONTEND_RETIRED.L1I_MISS_PS", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Branch Misprediction Cost: Fraction of TMA sl= ots wasted per non-speculative branch misprediction (retired JEClear)", + "MetricExpr": "tma_info_bottleneck_mispredictions * tma_info_threa= d_slots / cpu_core@BR_MISP_RETIRED.ALL_BRANCHES@ / 100", + "MetricGroup": "Bad;BrMispredicts;tma_issueBM", + "MetricName": "tma_info_bad_spec_branch_misprediction_cost", + "PublicDescription": "Branch Misprediction Cost: Fraction of TMA s= lots wasted per non-speculative branch misprediction (retired JEClear). Rel= ated metrics: tma_branch_mispredicts, tma_info_bottleneck_mispredictions, t= ma_mispredicts_resteers", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Instructions per retired mispredicts for cond= itional non-taken branches (lower number means higher occurrence rate).", + "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / BR_MISP_RETIRED.COND_N= TAKEN", + "MetricGroup": "Bad;BrMispredicts", + "MetricName": "tma_info_bad_spec_ipmisp_cond_ntaken", + "MetricThreshold": "tma_info_bad_spec_ipmisp_cond_ntaken < 200", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Instructions per retired mispredicts for cond= itional taken branches (lower number means higher occurrence rate).", + "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / BR_MISP_RETIRED.COND_T= AKEN", + "MetricGroup": "Bad;BrMispredicts", + "MetricName": "tma_info_bad_spec_ipmisp_cond_taken", + "MetricThreshold": "tma_info_bad_spec_ipmisp_cond_taken < 200", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Instructions per retired mispredicts for indi= rect CALL or JMP branches (lower number means higher occurrence rate).", + "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / BR_MISP_RETIRED.INDIRE= CT", + "MetricGroup": "Bad;BrMispredicts", + "MetricName": "tma_info_bad_spec_ipmisp_indirect", + "MetricThreshold": "tma_info_bad_spec_ipmisp_indirect < 1e3", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Instructions per retired mispredicts for retu= rn branches (lower number means higher occurrence rate).", + "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / BR_MISP_RETIRED.RET", + "MetricGroup": "Bad;BrMispredicts", + "MetricName": "tma_info_bad_spec_ipmisp_ret", + "MetricThreshold": "tma_info_bad_spec_ipmisp_ret < 500", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Number of Instructions per non-speculative Br= anch Misprediction (JEClear) (lower number means higher occurrence rate)", + "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / BR_MISP_RETIRED.ALL_BR= ANCHES", + "MetricGroup": "Bad;BadSpec;BrMispredicts", + "MetricName": "tma_info_bad_spec_ipmispredict", + "MetricThreshold": "tma_info_bad_spec_ipmispredict < 200", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Speculative to Retired ratio of all clears (c= overing mispredicts and nukes)", + "MetricExpr": "cpu_core@INT_MISC.CLEARS_COUNT@ / (cpu_core@BR_MISP= _RETIRED.ALL_BRANCHES@ + cpu_core@MACHINE_CLEARS.COUNT@)", + "MetricGroup": "BrMispredicts", + "MetricName": "tma_info_bad_spec_spec_clears_ratio", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Probability of Core Bound bottleneck hidden b= y SMT-profiling artifacts", + "MetricExpr": "(100 * (1 - tma_core_bound / tma_ports_utilization = if tma_core_bound < tma_ports_utilization else 1) if tma_info_system_smt_2t= _utilization > 0.5 else 0)", + "MetricGroup": "Cor;SMT", + "MetricName": "tma_info_botlnk_l0_core_bound_likely", + "MetricThreshold": "tma_info_botlnk_l0_core_bound_likely > 0.5", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total pipeline cost of DSB (uop cache) misses= - subset of the Instruction_Fetch_BW Bottleneck", + "MetricExpr": "100 * (tma_fetch_latency * tma_dsb_switches / (tma_= branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + = tma_lcp + tma_ms_switches) + tma_fetch_bandwidth * tma_mite / (tma_dsb + tm= a_lsd + tma_mite))", + "MetricGroup": "DSBmiss;Fed;tma_issueFB", + "MetricName": "tma_info_botlnk_l2_dsb_misses", + "MetricThreshold": "tma_info_botlnk_l2_dsb_misses > 10", + "PublicDescription": "Total pipeline cost of DSB (uop cache) misse= s - subset of the Instruction_Fetch_BW Bottleneck. Related metrics: tma_dsb= _switches, tma_fetch_bandwidth, tma_info_frontend_dsb_coverage, tma_info_in= st_mix_iptb, tma_lcp", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total pipeline cost of Instruction Cache miss= es - subset of the Big_Code Bottleneck", + "MetricExpr": "100 * (tma_fetch_latency * tma_icache_misses / (tma= _branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses += tma_lcp + tma_ms_switches))", + "MetricGroup": "Fed;FetchLat;IcMiss;tma_issueFL", + "MetricName": "tma_info_botlnk_l2_ic_misses", + "MetricThreshold": "tma_info_botlnk_l2_ic_misses > 5", + "PublicDescription": "Total pipeline cost of Instruction Cache mis= ses - subset of the Big_Code Bottleneck. Related metrics: ", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total pipeline cost of \"useful operations\" = - the baseline operations not covered by Branching_Overhead nor Irregular_O= verhead.", + "MetricExpr": "100 * (tma_retiring - (cpu_core@BR_INST_RETIRED.ALL= _BRANCHES@ + cpu_core@BR_INST_RETIRED.NEAR_CALL@) / tma_info_thread_slots -= tma_microcode_sequencer / (tma_few_uops_instructions + tma_microcode_seque= ncer) * (tma_assists / tma_microcode_sequencer) * tma_heavy_operations)", + "MetricGroup": "Ret", + "MetricName": "tma_info_bottleneck_base_non_br", + "MetricThreshold": "tma_info_bottleneck_base_non_br > 20", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total pipeline cost of instruction fetch rela= ted bottlenecks by large code footprint programs (i-side cache; TLB and BTB= misses)", + "MetricExpr": "100 * tma_fetch_latency * (tma_itlb_misses + tma_ic= ache_misses + tma_unknown_branches) / (tma_branch_resteers + tma_dsb_switch= es + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches)", + "MetricGroup": "BigFootprint;Fed;Frontend;IcMiss;MemoryTLB", + "MetricName": "tma_info_bottleneck_big_code", + "MetricThreshold": "tma_info_bottleneck_big_code > 20", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total pipeline cost of branch related instruc= tions (used for program control-flow including function calls)", + "MetricExpr": "100 * ((cpu_core@BR_INST_RETIRED.ALL_BRANCHES@ + cp= u_core@BR_INST_RETIRED.NEAR_CALL@) / tma_info_thread_slots)", + "MetricGroup": "Ret", + "MetricName": "tma_info_bottleneck_branching_overhead", + "MetricThreshold": "tma_info_bottleneck_branching_overhead > 5", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Bandwidth related bottlenecks", + "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_dr= am_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)) *= (tma_mem_bandwidth / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_b= ound * (tma_l3_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_= l3_bound + tma_store_bound)) * (tma_sq_full / (tma_contested_accesses + tma= _data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound * (tm= a_l1_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound += tma_store_bound)) * (tma_fb_full / (tma_dtlb_load + tma_fb_full + tma_lock= _latency + tma_split_loads + tma_store_fwd_blk)))", + "MetricGroup": "Mem;MemoryBW;Offcore;tma_issueBW", + "MetricName": "tma_info_bottleneck_cache_memory_bandwidth", + "MetricThreshold": "tma_info_bottleneck_cache_memory_bandwidth > 2= 0", + "PublicDescription": "Total pipeline cost of external Memory- or C= ache-Bandwidth related bottlenecks. Related metrics: tma_fb_full, tma_info_= system_dram_bw_use, tma_mem_bandwidth, tma_sq_full", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Latency related bottlenecks", + "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_dr= am_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)) *= (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_bou= nd * (tma_l3_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3= _bound + tma_store_bound)) * (tma_l3_hit_latency / (tma_contested_accesses = + tma_data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound = * tma_l2_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bou= nd + tma_store_bound) + tma_memory_bound * (tma_store_bound / (tma_dram_bou= nd + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)) * (tma_= store_latency / (tma_dtlb_store + tma_false_sharing + tma_split_stores + tm= a_store_latency + tma_streaming_stores)))", + "MetricGroup": "Mem;MemoryLat;Offcore;tma_issueLat", + "MetricName": "tma_info_bottleneck_cache_memory_latency", + "MetricThreshold": "tma_info_bottleneck_cache_memory_latency > 20", + "PublicDescription": "Total pipeline cost of external Memory- or C= ache-Latency related bottlenecks. Related metrics: tma_l3_hit_latency, tma_= mem_latency", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total pipeline cost when the execution is com= pute-bound - an estimation", + "MetricExpr": "100 * (tma_core_bound * tma_divider / (tma_divider = + tma_ports_utilization + tma_serializing_operation) + tma_core_bound * (tm= a_ports_utilization / (tma_divider + tma_ports_utilization + tma_serializin= g_operation)) * (tma_ports_utilized_3m / (tma_ports_utilized_0 + tma_ports_= utilized_1 + tma_ports_utilized_2 + tma_ports_utilized_3m)))", + "MetricGroup": "Cor;tma_issueComp", + "MetricName": "tma_info_bottleneck_compute_bound_est", + "MetricThreshold": "tma_info_bottleneck_compute_bound_est > 20", + "PublicDescription": "Total pipeline cost when the execution is co= mpute-bound - an estimation. Covers Core Bound when High ILP as well as whe= n long-latency execution units are busy. Related metrics: ", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total pipeline cost of instruction fetch band= width related bottlenecks", + "MetricExpr": "100 * (tma_frontend_bound - (1 - 10 * tma_microcode= _sequencer * tma_other_mispredicts / tma_branch_mispredicts) * tma_fetch_la= tency * tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switches = + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches) - (1 - c= pu_core@INST_RETIRED.REP_ITERATION@ / cpu_core@UOPS_RETIRED.MS\\,cmask\\=3D= 1@) * (tma_fetch_latency * (tma_ms_switches + tma_branch_resteers * (tma_cl= ears_resteers + tma_mispredicts_resteers * tma_other_mispredicts / tma_bran= ch_mispredicts) / (tma_clears_resteers + tma_mispredicts_resteers + tma_unk= nown_branches)) / (tma_branch_resteers + tma_dsb_switches + tma_icache_miss= es + tma_itlb_misses + tma_lcp + tma_ms_switches))) - tma_info_bottleneck_b= ig_code", + "MetricGroup": "Fed;FetchBW;Frontend", + "MetricName": "tma_info_bottleneck_instruction_fetch_bw", + "MetricThreshold": "tma_info_bottleneck_instruction_fetch_bw > 20", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total pipeline cost of irregular execution (e= .g", + "MetricExpr": "100 * ((1 - cpu_core@INST_RETIRED.REP_ITERATION@ / = cpu_core@UOPS_RETIRED.MS\\,cmask\\=3D1@) * (tma_fetch_latency * (tma_ms_swi= tches + tma_branch_resteers * (tma_clears_resteers + tma_mispredicts_restee= rs * tma_other_mispredicts / tma_branch_mispredicts) / (tma_clears_resteers= + tma_mispredicts_resteers + tma_unknown_branches)) / (tma_branch_resteers= + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_m= s_switches)) + 10 * tma_microcode_sequencer * tma_other_mispredicts / tma_b= ranch_mispredicts * tma_branch_mispredicts + tma_machine_clears * tma_other= _nukes / tma_other_nukes + tma_core_bound * (tma_serializing_operation + cp= u_core@RS.EMPTY\\,umask\\=3D1@ / tma_info_thread_clks * tma_ports_utilized_= 0) / (tma_divider + tma_ports_utilization + tma_serializing_operation) + tm= a_microcode_sequencer / (tma_few_uops_instructions + tma_microcode_sequence= r) * (tma_assists / tma_microcode_sequencer) * tma_heavy_operations)", + "MetricGroup": "Bad;Cor;Ret;tma_issueMS", + "MetricName": "tma_info_bottleneck_irregular_overhead", + "MetricThreshold": "tma_info_bottleneck_irregular_overhead > 10", + "PublicDescription": "Total pipeline cost of irregular execution (= e.g. FP-assists in HPC, Wait time with work imbalance multithreaded workloa= ds, overhead in system services or virtualized environments). Related metri= cs: tma_microcode_sequencer, tma_ms_switches", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total pipeline cost of Memory Address Transla= tion related bottlenecks (data-side TLBs)", + "MetricExpr": "100 * (tma_memory_bound * (tma_l1_bound / (tma_dram= _bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)) * (= tma_dtlb_load / (tma_dtlb_load + tma_fb_full + tma_lock_latency + tma_split= _loads + tma_store_fwd_blk)) + tma_memory_bound * (tma_store_bound / (tma_d= ram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)) = * (tma_dtlb_store / (tma_dtlb_store + tma_false_sharing + tma_split_stores = + tma_store_latency + tma_streaming_stores)))", + "MetricGroup": "Mem;MemoryTLB;Offcore;tma_issueTLB", + "MetricName": "tma_info_bottleneck_memory_data_tlbs", + "MetricThreshold": "tma_info_bottleneck_memory_data_tlbs > 20", + "PublicDescription": "Total pipeline cost of Memory Address Transl= ation related bottlenecks (data-side TLBs). Related metrics: tma_dtlb_load,= tma_dtlb_store, tma_info_bottleneck_memory_synchronization", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total pipeline cost of Memory Synchronization= related bottlenecks (data transfers and coherency updates across processor= s)", + "MetricExpr": "100 * (tma_memory_bound * (tma_l3_bound / (tma_dram= _bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * (t= ma_contested_accesses + tma_data_sharing) / (tma_contested_accesses + tma_d= ata_sharing + tma_l3_hit_latency + tma_sq_full) + tma_store_bound / (tma_dr= am_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * = tma_false_sharing / (tma_dtlb_store + tma_false_sharing + tma_split_stores = + tma_store_latency + tma_streaming_stores - tma_store_latency)) + tma_mach= ine_clears * (1 - tma_other_nukes / tma_other_nukes))", + "MetricGroup": "Mem;Offcore;tma_issueTLB", + "MetricName": "tma_info_bottleneck_memory_synchronization", + "MetricThreshold": "tma_info_bottleneck_memory_synchronization > 1= 0", + "PublicDescription": "Total pipeline cost of Memory Synchronizatio= n related bottlenecks (data transfers and coherency updates across processo= rs). Related metrics: tma_dtlb_load, tma_dtlb_store, tma_info_bottleneck_me= mory_data_tlbs", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total pipeline cost of Branch Misprediction r= elated bottlenecks", + "MetricExpr": "100 * (1 - 10 * tma_microcode_sequencer * tma_other= _mispredicts / tma_branch_mispredicts) * (tma_branch_mispredicts + tma_fetc= h_latency * tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switc= hes + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches))", + "MetricGroup": "Bad;BadSpec;BrMispredicts;tma_issueBM", + "MetricName": "tma_info_bottleneck_mispredictions", + "MetricThreshold": "tma_info_bottleneck_mispredictions > 20", + "PublicDescription": "Total pipeline cost of Branch Misprediction = related bottlenecks. Related metrics: tma_branch_mispredicts, tma_info_bad_= spec_branch_misprediction_cost, tma_mispredicts_resteers", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total pipeline cost of remaining bottlenecks = (apart from those listed in the Info.Bottlenecks metrics class)", + "MetricExpr": "100 - (tma_info_bottleneck_big_code + tma_info_bott= leneck_instruction_fetch_bw + tma_info_bottleneck_mispredictions + tma_info= _bottleneck_cache_memory_bandwidth + tma_info_bottleneck_cache_memory_laten= cy + tma_info_bottleneck_memory_data_tlbs + tma_info_bottleneck_memory_sync= hronization + tma_info_bottleneck_compute_bound_est + tma_info_bottleneck_i= rregular_overhead + tma_info_bottleneck_branching_overhead + tma_info_bottl= eneck_base_non_br)", + "MetricGroup": "Cor;Offcore", + "MetricName": "tma_info_bottleneck_other_bottlenecks", + "MetricThreshold": "tma_info_bottleneck_other_bottlenecks > 20", + "PublicDescription": "Total pipeline cost of remaining bottlenecks= (apart from those listed in the Info.Bottlenecks metrics class). Examples = include data-dependencies (Core Bound when Low ILP) and other unlisted memo= ry-related stalls.", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Fraction of branches that are CALL or RET", + "MetricExpr": "(cpu_core@BR_INST_RETIRED.NEAR_CALL@ + cpu_core@BR_= INST_RETIRED.NEAR_RETURN@) / BR_INST_RETIRED.ALL_BRANCHES", + "MetricGroup": "Bad;Branches", + "MetricName": "tma_info_branches_callret", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Fraction of branches that are non-taken condi= tionals", + "MetricExpr": "cpu_core@BR_INST_RETIRED.COND_NTAKEN@ / BR_INST_RET= IRED.ALL_BRANCHES", + "MetricGroup": "Bad;Branches;CodeGen;PGO", + "MetricName": "tma_info_branches_cond_nt", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Fraction of branches that are taken condition= als", + "MetricExpr": "cpu_core@BR_INST_RETIRED.COND_TAKEN@ / BR_INST_RETI= RED.ALL_BRANCHES", + "MetricGroup": "Bad;Branches;CodeGen;PGO", + "MetricName": "tma_info_branches_cond_tk", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Fraction of branches that are unconditional (= direct or indirect) jumps", + "MetricExpr": "(cpu_core@BR_INST_RETIRED.NEAR_TAKEN@ - cpu_core@BR= _INST_RETIRED.COND_TAKEN@ - 2 * cpu_core@BR_INST_RETIRED.NEAR_CALL@) / BR_I= NST_RETIRED.ALL_BRANCHES", + "MetricGroup": "Bad;Branches", + "MetricName": "tma_info_branches_jump", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Fraction of branches of other types (not indi= vidually covered by other metrics in Info.Branches group)", + "MetricExpr": "1 - (tma_info_branches_cond_nt + tma_info_branches_= cond_tk + tma_info_branches_callret + tma_info_branches_jump)", + "MetricGroup": "Bad;Branches", + "MetricName": "tma_info_branches_other_branches", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Core actual clocks when any Logical Processor= is active on the Physical Core", + "MetricExpr": "(cpu_core@CPU_CLK_UNHALTED.DISTRIBUTED@ if #SMT_on = else tma_info_thread_clks)", + "MetricGroup": "SMT", + "MetricName": "tma_info_core_core_clks", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Instructions Per Cycle across hyper-threads (= per physical core)", + "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / tma_info_core_core_clk= s", + "MetricGroup": "Ret;SMT;TmaL1;tma_L1_group", + "MetricName": "tma_info_core_coreipc", + "Unit": "cpu_core" + }, + { + "BriefDescription": "uops Executed per Cycle", + "MetricExpr": "cpu_core@UOPS_EXECUTED.THREAD@ / tma_info_thread_cl= ks", + "MetricGroup": "Power", + "MetricName": "tma_info_core_epc", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Floating Point Operations Per Cycle", + "MetricExpr": "(cpu_core@FP_ARITH_INST_RETIRED.SCALAR@ + 2 * cpu_c= ore@FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE@ + 4 * cpu_core@FP_ARITH_INST_= RETIRED.4_FLOPS@ + 8 * cpu_core@FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE@) = / tma_info_core_core_clks", + "MetricGroup": "Flops;Ret", + "MetricName": "tma_info_core_flopc", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Actual per-core usage of the Floating Point n= on-X87 execution units (regardless of precision or vector-width)", + "MetricExpr": "(cpu_core@FP_ARITH_DISPATCHED.PORT_0@ + cpu_core@FP= _ARITH_DISPATCHED.PORT_1@ + cpu_core@FP_ARITH_DISPATCHED.PORT_5@) / (2 * tm= a_info_core_core_clks)", + "MetricGroup": "Cor;Flops;HPC", + "MetricName": "tma_info_core_fp_arith_utilization", + "PublicDescription": "Actual per-core usage of the Floating Point = non-X87 execution units (regardless of precision or vector-width). Values >= 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; = [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less commo= n).", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Instruction-Level-Parallelism (average number= of uops executed when there is execution) per thread (logical-processor)", + "MetricExpr": "cpu_core@UOPS_EXECUTED.THREAD@ / cpu_core@UOPS_EXEC= UTED.THREAD\\,cmask\\=3D1@", + "MetricGroup": "Backend;Cor;Pipeline;PortsUtil", + "MetricName": "tma_info_core_ilp", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Fraction of Uops delivered by the DSB (aka De= coded ICache; or Uop Cache)", + "MetricExpr": "cpu_core@IDQ.DSB_UOPS@ / cpu_core@UOPS_ISSUED.ANY@", + "MetricGroup": "DSB;Fed;FetchBW;tma_issueFB", + "MetricName": "tma_info_frontend_dsb_coverage", + "MetricThreshold": "tma_info_frontend_dsb_coverage < 0.7 & tma_inf= o_thread_ipc / 6 > 0.35", + "PublicDescription": "Fraction of Uops delivered by the DSB (aka D= ecoded ICache; or Uop Cache). Related metrics: tma_dsb_switches, tma_fetch_= bandwidth, tma_info_botlnk_l2_dsb_misses, tma_info_inst_mix_iptb, tma_lcp", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Average number of cycles of a switch from the= DSB fetch-unit to MITE fetch unit - see DSB_Switches tree node for details= .", + "MetricExpr": "cpu_core@DSB2MITE_SWITCHES.PENALTY_CYCLES@ / cpu_co= re@DSB2MITE_SWITCHES.PENALTY_CYCLES\\,cmask\\=3D1\\,edge@", + "MetricGroup": "DSBmiss", + "MetricName": "tma_info_frontend_dsb_switch_cost", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Average number of Uops issued by front-end wh= en it issued something", + "MetricExpr": "cpu_core@UOPS_ISSUED.ANY@ / cpu_core@UOPS_ISSUED.AN= Y\\,cmask\\=3D1@", + "MetricGroup": "Fed;FetchBW", + "MetricName": "tma_info_frontend_fetch_upc", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Average Latency for L1 instruction cache miss= es", + "MetricExpr": "cpu_core@ICACHE_DATA.STALLS@ / cpu_core@ICACHE_DATA= .STALLS\\,cmask\\=3D1\\,edge@", + "MetricGroup": "Fed;FetchLat;IcMiss", + "MetricName": "tma_info_frontend_icache_miss_latency", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Instructions per non-speculative DSB miss (lo= wer number means higher occurrence rate)", + "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / FRONTEND_RETIRED.ANY_D= SB_MISS", + "MetricGroup": "DSBmiss;Fed", + "MetricName": "tma_info_frontend_ipdsb_miss_ret", + "MetricThreshold": "tma_info_frontend_ipdsb_miss_ret < 50", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Instructions per speculative Unknown Branch M= isprediction (BAClear) (lower number means higher occurrence rate)", + "MetricExpr": "tma_info_inst_mix_instructions / BACLEARS.ANY", + "MetricGroup": "Fed", + "MetricName": "tma_info_frontend_ipunknown_branch", + "Unit": "cpu_core" + }, + { + "BriefDescription": "L2 cache true code cacheline misses per kilo = instruction", + "MetricExpr": "1e3 * cpu_core@FRONTEND_RETIRED.L2_MISS@ / INST_RET= IRED.ANY", + "MetricGroup": "IcMiss", + "MetricName": "tma_info_frontend_l2mpki_code", + "Unit": "cpu_core" + }, + { + "BriefDescription": "L2 cache speculative code cacheline misses pe= r kilo instruction", + "MetricExpr": "1e3 * cpu_core@L2_RQSTS.CODE_RD_MISS@ / INST_RETIRE= D.ANY", + "MetricGroup": "IcMiss", + "MetricName": "tma_info_frontend_l2mpki_code_all", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Fraction of Uops delivered by the LSD (Loop S= tream Detector; aka Loop Cache)", + "MetricExpr": "cpu_core@LSD.UOPS@ / cpu_core@UOPS_ISSUED.ANY@", + "MetricGroup": "Fed;LSD", + "MetricName": "tma_info_frontend_lsd_coverage", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Average number of cycles the front-end was de= layed due to an Unknown Branch detection", + "MetricExpr": "cpu_core@INT_MISC.UNKNOWN_BRANCH_CYCLES@ / cpu_core= @INT_MISC.UNKNOWN_BRANCH_CYCLES\\,cmask\\=3D1\\,edge@", + "MetricGroup": "Fed", + "MetricName": "tma_info_frontend_unknown_branch_cost", + "PublicDescription": "Average number of cycles the front-end was d= elayed due to an Unknown Branch detection. See Unknown_Branches node.", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Branch instructions per taken branch.", + "MetricExpr": "cpu_core@BR_INST_RETIRED.ALL_BRANCHES@ / BR_INST_RE= TIRED.NEAR_TAKEN", + "MetricGroup": "Branches;Fed;PGO", + "MetricName": "tma_info_inst_mix_bptkbranch", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total number of retired Instructions", + "MetricExpr": "cpu_core@INST_RETIRED.ANY@", + "MetricGroup": "Summary;TmaL1;tma_L1_group", + "MetricName": "tma_info_inst_mix_instructions", + "PublicDescription": "Total number of retired Instructions. Sample= with: INST_RETIRED.PREC_DIST", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Instructions per FP Arithmetic instruction (l= ower number means higher occurrence rate)", + "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / (cpu_core@FP_ARITH_INS= T_RETIRED.SCALAR_SINGLE\\,umask\\=3D0x03@ + cpu_core@FP_ARITH_INST_RETIRED.= 128B_PACKED_DOUBLE\\,umask\\=3D0x3c@)", + "MetricGroup": "Flops;InsType", + "MetricName": "tma_info_inst_mix_iparith", + "MetricThreshold": "tma_info_inst_mix_iparith < 10", + "PublicDescription": "Instructions per FP Arithmetic instruction (= lower number means higher occurrence rate). Values < 1 are possible due to = intentional FMA double counting. Approximated prior to BDW.", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Instructions per FP Arithmetic AVX/SSE 128-bi= t instruction (lower number means higher occurrence rate)", + "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / (cpu_core@FP_ARITH_INS= T_RETIRED.128B_PACKED_DOUBLE@ + cpu_core@FP_ARITH_INST_RETIRED.128B_PACKED_= SINGLE@)", + "MetricGroup": "Flops;FpVector;InsType", + "MetricName": "tma_info_inst_mix_iparith_avx128", + "MetricThreshold": "tma_info_inst_mix_iparith_avx128 < 10", + "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-b= it instruction (lower number means higher occurrence rate). Values < 1 are = possible due to intentional FMA double counting.", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Instructions per FP Arithmetic AVX* 256-bit i= nstruction (lower number means higher occurrence rate)", + "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / (cpu_core@FP_ARITH_INS= T_RETIRED.256B_PACKED_DOUBLE@ + cpu_core@FP_ARITH_INST_RETIRED.256B_PACKED_= SINGLE@)", + "MetricGroup": "Flops;FpVector;InsType", + "MetricName": "tma_info_inst_mix_iparith_avx256", + "MetricThreshold": "tma_info_inst_mix_iparith_avx256 < 10", + "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit = instruction (lower number means higher occurrence rate). Values < 1 are pos= sible due to intentional FMA double counting.", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Instructions per FP Arithmetic Scalar Double-= Precision instruction (lower number means higher occurrence rate)", + "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / FP_ARITH_INST_RETIRED.= SCALAR_DOUBLE", + "MetricGroup": "Flops;FpScalar;InsType", + "MetricName": "tma_info_inst_mix_iparith_scalar_dp", + "MetricThreshold": "tma_info_inst_mix_iparith_scalar_dp < 10", + "PublicDescription": "Instructions per FP Arithmetic Scalar Double= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting.", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Instructions per FP Arithmetic Scalar Single-= Precision instruction (lower number means higher occurrence rate)", + "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / FP_ARITH_INST_RETIRED.= SCALAR_SINGLE", + "MetricGroup": "Flops;FpScalar;InsType", + "MetricName": "tma_info_inst_mix_iparith_scalar_sp", + "MetricThreshold": "tma_info_inst_mix_iparith_scalar_sp < 10", + "PublicDescription": "Instructions per FP Arithmetic Scalar Single= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting.", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Instructions per Branch (lower number means h= igher occurrence rate)", + "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / BR_INST_RETIRED.ALL_BR= ANCHES", + "MetricGroup": "Branches;Fed;InsType", + "MetricName": "tma_info_inst_mix_ipbranch", + "MetricThreshold": "tma_info_inst_mix_ipbranch < 8", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Instructions per (near) call (lower number me= ans higher occurrence rate)", + "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / BR_INST_RETIRED.NEAR_C= ALL", + "MetricGroup": "Branches;Fed;PGO", + "MetricName": "tma_info_inst_mix_ipcall", + "MetricThreshold": "tma_info_inst_mix_ipcall < 200", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Instructions per Floating Point (FP) Operatio= n (lower number means higher occurrence rate)", + "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / (cpu_core@FP_ARITH_INS= T_RETIRED.SCALAR@ + 2 * cpu_core@FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE@ = + 4 * cpu_core@FP_ARITH_INST_RETIRED.4_FLOPS@ + 8 * cpu_core@FP_ARITH_INST_= RETIRED.256B_PACKED_SINGLE@)", + "MetricGroup": "Flops;InsType", + "MetricName": "tma_info_inst_mix_ipflop", + "MetricThreshold": "tma_info_inst_mix_ipflop < 10", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Instructions per Load (lower number means hig= her occurrence rate)", + "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / MEM_INST_RETIRED.ALL_L= OADS", + "MetricGroup": "InsType", + "MetricName": "tma_info_inst_mix_ipload", + "MetricThreshold": "tma_info_inst_mix_ipload < 3", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Instructions per PAUSE (lower number means hi= gher occurrence rate)", + "MetricExpr": "tma_info_inst_mix_instructions / CPU_CLK_UNHALTED.P= AUSE_INST", + "MetricGroup": "Flops;FpVector;InsType", + "MetricName": "tma_info_inst_mix_ippause", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Instructions per Store (lower number means hi= gher occurrence rate)", + "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / MEM_INST_RETIRED.ALL_S= TORES", + "MetricGroup": "InsType", + "MetricName": "tma_info_inst_mix_ipstore", + "MetricThreshold": "tma_info_inst_mix_ipstore < 8", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Instructions per Software prefetch instructio= n (of any type: NTA/T0/T1/T2/Prefetch) (lower number means higher occurrenc= e rate)", + "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / cpu_core@SW_PREFETCH_A= CCESS.T0\\,umask\\=3D0xF@", + "MetricGroup": "Prefetches", + "MetricName": "tma_info_inst_mix_ipswpf", + "MetricThreshold": "tma_info_inst_mix_ipswpf < 100", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Instruction per taken branch", + "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / BR_INST_RETIRED.NEAR_T= AKEN", + "MetricGroup": "Branches;Fed;FetchBW;Frontend;PGO;tma_issueFB", + "MetricName": "tma_info_inst_mix_iptb", + "MetricThreshold": "tma_info_inst_mix_iptb < 13", + "PublicDescription": "Instruction per taken branch. Related metric= s: tma_dsb_switches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_misses, tm= a_info_frontend_dsb_coverage, tma_lcp", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Average per-core data fill bandwidth to the L= 1 data cache [GB / sec]", + "MetricExpr": "tma_info_memory_l1d_cache_fill_bw", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_memory_core_l1d_cache_fill_bw_2t", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Average per-core data fill bandwidth to the L= 2 cache [GB / sec]", + "MetricExpr": "tma_info_memory_l2_cache_fill_bw", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_memory_core_l2_cache_fill_bw_2t", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Average per-core data access bandwidth to the= L3 cache [GB / sec]", + "MetricExpr": "tma_info_memory_l3_cache_access_bw", + "MetricGroup": "Mem;MemoryBW;Offcore", + "MetricName": "tma_info_memory_core_l3_cache_access_bw_2t", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Average per-core data fill bandwidth to the L= 3 cache [GB / sec]", + "MetricExpr": "tma_info_memory_l3_cache_fill_bw", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_memory_core_l3_cache_fill_bw_2t", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Fill Buffer (FB) hits per kilo instructions f= or retired demand loads (L1D misses that merge into ongoing miss-handling e= ntries)", + "MetricExpr": "1e3 * cpu_core@MEM_LOAD_RETIRED.FB_HIT@ / INST_RETI= RED.ANY", + "MetricGroup": "CacheHits;Mem", + "MetricName": "tma_info_memory_fb_hpki", + "Unit": "cpu_core" + }, + { + "BriefDescription": "", + "MetricExpr": "64 * cpu_core@L1D.REPLACEMENT@ / 1e9 / duration_tim= e", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_memory_l1d_cache_fill_bw", + "Unit": "cpu_core" + }, + { + "BriefDescription": "L1 cache true misses per kilo instruction for= retired demand loads", + "MetricExpr": "1e3 * cpu_core@MEM_LOAD_RETIRED.L1_MISS@ / INST_RET= IRED.ANY", + "MetricGroup": "CacheHits;Mem", + "MetricName": "tma_info_memory_l1mpki", + "Unit": "cpu_core" + }, + { + "BriefDescription": "L1 cache true misses per kilo instruction for= all demand loads (including speculative)", + "MetricExpr": "1e3 * cpu_core@L2_RQSTS.ALL_DEMAND_DATA_RD@ / INST_= RETIRED.ANY", + "MetricGroup": "CacheHits;Mem", + "MetricName": "tma_info_memory_l1mpki_load", + "Unit": "cpu_core" + }, + { + "BriefDescription": "", + "MetricExpr": "64 * cpu_core@L2_LINES_IN.ALL@ / 1e9 / duration_tim= e", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_memory_l2_cache_fill_bw", + "Unit": "cpu_core" + }, + { + "BriefDescription": "L2 cache hits per kilo instruction for all re= quest types (including speculative)", + "MetricExpr": "1e3 * (cpu_core@L2_RQSTS.REFERENCES@ - cpu_core@L2_= RQSTS.MISS@) / INST_RETIRED.ANY", + "MetricGroup": "CacheHits;Mem", + "MetricName": "tma_info_memory_l2hpki_all", + "Unit": "cpu_core" + }, + { + "BriefDescription": "L2 cache hits per kilo instruction for all de= mand loads (including speculative)", + "MetricExpr": "1e3 * cpu_core@L2_RQSTS.DEMAND_DATA_RD_HIT@ / INST_= RETIRED.ANY", + "MetricGroup": "CacheHits;Mem", + "MetricName": "tma_info_memory_l2hpki_load", + "Unit": "cpu_core" + }, + { + "BriefDescription": "L2 cache true misses per kilo instruction for= retired demand loads", + "MetricExpr": "1e3 * cpu_core@MEM_LOAD_RETIRED.L2_MISS@ / INST_RET= IRED.ANY", + "MetricGroup": "Backend;CacheHits;Mem", + "MetricName": "tma_info_memory_l2mpki", + "Unit": "cpu_core" + }, + { + "BriefDescription": "L2 cache ([RKL+] true) misses per kilo instru= ction for all request types (including speculative)", + "MetricExpr": "1e3 * cpu_core@L2_RQSTS.MISS@ / INST_RETIRED.ANY", + "MetricGroup": "CacheHits;Mem;Offcore", + "MetricName": "tma_info_memory_l2mpki_all", + "Unit": "cpu_core" + }, + { + "BriefDescription": "L2 cache ([RKL+] true) misses per kilo instru= ction for all demand loads (including speculative)", + "MetricExpr": "1e3 * cpu_core@L2_RQSTS.DEMAND_DATA_RD_MISS@ / INST= _RETIRED.ANY", + "MetricGroup": "CacheHits;Mem", + "MetricName": "tma_info_memory_l2mpki_load", + "Unit": "cpu_core" + }, + { + "BriefDescription": "", + "MetricExpr": "64 * cpu_core@OFFCORE_REQUESTS.ALL_REQUESTS@ / 1e9 = / duration_time", + "MetricGroup": "Mem;MemoryBW;Offcore", + "MetricName": "tma_info_memory_l3_cache_access_bw", + "Unit": "cpu_core" + }, + { + "BriefDescription": "", + "MetricExpr": "64 * cpu_core@LONGEST_LAT_CACHE.MISS@ / 1e9 / durat= ion_time", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_memory_l3_cache_fill_bw", + "Unit": "cpu_core" + }, + { + "BriefDescription": "L3 cache true misses per kilo instruction for= retired demand loads", + "MetricExpr": "1e3 * cpu_core@MEM_LOAD_RETIRED.L3_MISS@ / INST_RET= IRED.ANY", + "MetricGroup": "Mem", + "MetricName": "tma_info_memory_l3mpki", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Average Parallel L2 cache miss data reads", + "MetricExpr": "cpu_core@OFFCORE_REQUESTS_OUTSTANDING.DATA_RD@ / OF= FCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DATA_RD", + "MetricGroup": "Memory_BW;Offcore", + "MetricName": "tma_info_memory_latency_data_l2_mlp", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Average Latency for L2 cache miss demand Load= s", + "MetricExpr": "cpu_core@OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_R= D@ / OFFCORE_REQUESTS.DEMAND_DATA_RD", + "MetricGroup": "Memory_Lat;Offcore", + "MetricName": "tma_info_memory_latency_load_l2_miss_latency", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Average Parallel L2 cache miss demand Loads", + "MetricExpr": "cpu_core@OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_R= D@ / cpu_core@OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD\\,cmask\\=3D1@", + "MetricGroup": "Memory_BW;Offcore", + "MetricName": "tma_info_memory_latency_load_l2_mlp", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Average Latency for L3 cache miss demand Load= s", + "MetricExpr": "cpu_core@OFFCORE_REQUESTS_OUTSTANDING.L3_MISS_DEMAN= D_DATA_RD@ / OFFCORE_REQUESTS.L3_MISS_DEMAND_DATA_RD", + "MetricGroup": "Memory_Lat;Offcore", + "MetricName": "tma_info_memory_latency_load_l3_miss_latency", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Actual Average Latency for L1 data-cache miss= demand load operations (in core cycles)", + "MetricExpr": "cpu_core@L1D_PEND_MISS.PENDING@ / MEM_LOAD_COMPLETE= D.L1_MISS_ANY", + "MetricGroup": "Mem;MemoryBound;MemoryLat", + "MetricName": "tma_info_memory_load_miss_real_latency", + "Unit": "cpu_core" + }, + { + "BriefDescription": "\"Bus lock\" per kilo instruction", + "MetricExpr": "1e3 * cpu_core@SQ_MISC.BUS_LOCK@ / INST_RETIRED.ANY= ", + "MetricGroup": "Mem", + "MetricName": "tma_info_memory_mix_bus_lock_pki", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Un-cacheable retired load per kilo instructio= n", + "MetricExpr": "1e3 * cpu_core@MEM_LOAD_MISC_RETIRED.UC@ / INST_RET= IRED.ANY", + "MetricGroup": "Mem", + "MetricName": "tma_info_memory_mix_uc_load_pki", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Memory-Level-Parallelism (average number of L= 1 miss demand load when there is at least one such miss", + "MetricExpr": "cpu_core@L1D_PEND_MISS.PENDING@ / L1D_PEND_MISS.PEN= DING_CYCLES", + "MetricGroup": "Mem;MemoryBW;MemoryBound", + "MetricName": "tma_info_memory_mlp", + "PublicDescription": "Memory-Level-Parallelism (average number of = L1 miss demand load when there is at least one such miss. Per-Logical Proce= ssor)", + "Unit": "cpu_core" + }, + { + "BriefDescription": "STLB (2nd level TLB) code speculative misses = per kilo instruction (misses of any page-size that complete the page walk)", + "MetricExpr": "1e3 * cpu_core@ITLB_MISSES.WALK_COMPLETED@ / INST_R= ETIRED.ANY", + "MetricGroup": "Fed;MemoryTLB", + "MetricName": "tma_info_memory_tlb_code_stlb_mpki", + "Unit": "cpu_core" + }, + { + "BriefDescription": "STLB (2nd level TLB) data load speculative mi= sses per kilo instruction (misses of any page-size that complete the page w= alk)", + "MetricExpr": "1e3 * cpu_core@DTLB_LOAD_MISSES.WALK_COMPLETED@ / I= NST_RETIRED.ANY", + "MetricGroup": "Mem;MemoryTLB", + "MetricName": "tma_info_memory_tlb_load_stlb_mpki", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Utilization of the core's Page Walker(s) serv= ing STLB misses triggered by instruction/Load/Store accesses", + "MetricExpr": "(cpu_core@ITLB_MISSES.WALK_PENDING@ + cpu_core@DTLB= _LOAD_MISSES.WALK_PENDING@ + cpu_core@DTLB_STORE_MISSES.WALK_PENDING@) / (4= * tma_info_core_core_clks)", + "MetricGroup": "Mem;MemoryTLB", + "MetricName": "tma_info_memory_tlb_page_walks_utilization", + "MetricThreshold": "tma_info_memory_tlb_page_walks_utilization > 0= .5", + "Unit": "cpu_core" + }, + { + "BriefDescription": "STLB (2nd level TLB) data store speculative m= isses per kilo instruction (misses of any page-size that complete the page = walk)", + "MetricExpr": "1e3 * cpu_core@DTLB_STORE_MISSES.WALK_COMPLETED@ / = INST_RETIRED.ANY", + "MetricGroup": "Mem;MemoryTLB", + "MetricName": "tma_info_memory_tlb_store_stlb_mpki", + "Unit": "cpu_core" + }, + { + "BriefDescription": "", + "MetricExpr": "cpu_core@UOPS_EXECUTED.THREAD@ / (cpu_core@UOPS_EXE= CUTED.CORE_CYCLES_GE_1@ / 2 if #SMT_on else cpu_core@UOPS_EXECUTED.THREAD\\= ,cmask\\=3D1@)", + "MetricGroup": "Cor;Pipeline;PortsUtil;SMT", + "MetricName": "tma_info_pipeline_execute", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Instructions per a microcode Assist invocatio= n", + "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / ASSISTS.ANY", + "MetricGroup": "MicroSeq;Pipeline;Ret;Retire", + "MetricName": "tma_info_pipeline_ipassist", + "MetricThreshold": "tma_info_pipeline_ipassist < 100e3", + "PublicDescription": "Instructions per a microcode Assist invocati= on. See Assists tree node for details (lower number means higher occurrence= rate)", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Average number of Uops retired in cycles wher= e at least one uop has retired.", + "MetricExpr": "tma_retiring * tma_info_thread_slots / cpu_core@UOP= S_RETIRED.SLOTS\\,cmask\\=3D1@", + "MetricGroup": "Pipeline;Ret", + "MetricName": "tma_info_pipeline_retire", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Estimated fraction of retirement-cycles deali= ng with repeat instructions", + "MetricExpr": "cpu_core@INST_RETIRED.REP_ITERATION@ / cpu_core@UOP= S_RETIRED.SLOTS\\,cmask\\=3D1@", + "MetricGroup": "MicroSeq;Pipeline;Ret", + "MetricName": "tma_info_pipeline_strings_cycles", + "MetricThreshold": "tma_info_pipeline_strings_cycles > 0.1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Fraction of cycles the processor is waiting y= et unhalted; covering legacy PAUSE instruction, as well as C0.1 / C0.2 powe= r-performance optimized states", + "MetricExpr": "cpu_core@CPU_CLK_UNHALTED.C0_WAIT@ / tma_info_threa= d_clks", + "MetricGroup": "C0Wait", + "MetricName": "tma_info_system_c0_wait", + "MetricThreshold": "tma_info_system_c0_wait > 0.05", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Measured Average Core Frequency for unhalted = processors [GHz]", + "MetricExpr": "tma_info_system_turbo_utilization * TSC / 1e9 / dur= ation_time", + "MetricGroup": "Power;Summary", + "MetricName": "tma_info_system_core_frequency", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Average CPU Utilization (percentage)", + "MetricExpr": "cpu_core@CPU_CLK_UNHALTED.REF_TSC@ / TSC", + "MetricGroup": "HPC;Summary", + "MetricName": "tma_info_system_cpu_utilization", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Average number of utilized CPUs", + "MetricExpr": "#num_cpus_online * tma_info_system_cpu_utilization", + "MetricGroup": "Summary", + "MetricName": "tma_info_system_cpus_utilized", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Average external Memory Bandwidth Use for rea= ds and writes [GB / sec]", + "MetricExpr": "64 * (UNC_HAC_ARB_TRK_REQUESTS.ALL + UNC_HAC_ARB_CO= H_TRK_REQUESTS.ALL) / 1e9 / duration_time", + "MetricGroup": "HPC;MemOffcore;MemoryBW;SoC;tma_issueBW", + "MetricName": "tma_info_system_dram_bw_use", + "PublicDescription": "Average external Memory Bandwidth Use for re= ads and writes [GB / sec]. Related metrics: tma_fb_full, tma_info_bottlenec= k_cache_memory_bandwidth, tma_mem_bandwidth, tma_sq_full", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Giga Floating Point Operations Per Second", + "MetricExpr": "(cpu_core@FP_ARITH_INST_RETIRED.SCALAR@ + 2 * cpu_c= ore@FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE@ + 4 * cpu_core@FP_ARITH_INST_= RETIRED.4_FLOPS@ + 8 * cpu_core@FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE@) = / 1e9 / duration_time", + "MetricGroup": "Cor;Flops;HPC", + "MetricName": "tma_info_system_gflops", + "PublicDescription": "Giga Floating Point Operations Per Second. A= ggregate across all supported options of: FP precisions, scalar and vector = instructions, vector-width", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Instructions per Far Branch ( Far Branches ap= ply upon transition from application to operating system, handling interrup= ts, exceptions) [lower number means higher occurrence rate]", + "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / cpu_core@BR_INST_RETIR= ED.FAR_BRANCH@u", + "MetricGroup": "Branches;OS", + "MetricName": "tma_info_system_ipfarbranch", + "MetricThreshold": "tma_info_system_ipfarbranch < 1e6", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles Per Instruction for the Operating Syst= em (OS) Kernel mode", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / cpu_core@INST_RETIRED= .ANY_P@k", + "MetricGroup": "OS", + "MetricName": "tma_info_system_kernel_cpi", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Fraction of cycles spent in the Operating Sys= tem (OS) Kernel mode", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / CPU_CLK_UNHALTED.THRE= AD", + "MetricGroup": "OS", + "MetricName": "tma_info_system_kernel_utilization", + "MetricThreshold": "tma_info_system_kernel_utilization > 0.05", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Average number of parallel data read requests= to external memory", + "MetricExpr": "UNC_ARB_DAT_OCCUPANCY.RD / UNC_ARB_DAT_OCCUPANCY.RD= @cmask\\=3D1@", + "MetricGroup": "Mem;MemoryBW;SoC", + "MetricName": "tma_info_system_mem_parallel_reads", + "PublicDescription": "Average number of parallel data read request= s to external memory. Accounts for demand loads and L1/L2 prefetches", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Fraction of cycles where both hardware Logica= l Processors were active", + "MetricExpr": "(1 - cpu_core@CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE@ /= cpu_core@CPU_CLK_UNHALTED.REF_DISTRIBUTED@ if #SMT_on else 0)", + "MetricGroup": "SMT", + "MetricName": "tma_info_system_smt_2t_utilization", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Socket actual clocks when any core is active = on that socket", + "MetricExpr": "UNC_CLOCK.SOCKET", + "MetricGroup": "SoC", + "MetricName": "tma_info_system_socket_clks", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Average Frequency Utilization relative nomina= l frequency", + "MetricExpr": "tma_info_thread_clks / CPU_CLK_UNHALTED.REF_TSC", + "MetricGroup": "Power", + "MetricName": "tma_info_system_turbo_utilization", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Per-Logical Processor actual clocks when the = Logical Processor is active.", + "MetricExpr": "cpu_core@CPU_CLK_UNHALTED.THREAD@", + "MetricGroup": "Pipeline", + "MetricName": "tma_info_thread_clks", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles Per Instruction (per Logical Processor= )", + "MetricExpr": "1 / tma_info_thread_ipc", + "MetricGroup": "Mem;Pipeline", + "MetricName": "tma_info_thread_cpi", + "Unit": "cpu_core" + }, + { + "BriefDescription": "The ratio of Executed- by Issued-Uops", + "MetricExpr": "cpu_core@UOPS_EXECUTED.THREAD@ / UOPS_ISSUED.ANY", + "MetricGroup": "Cor;Pipeline", + "MetricName": "tma_info_thread_execute_per_issue", + "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio= > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate o= f \"execute\" at rename stage.", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Instructions Per Cycle (per Logical Processor= )", + "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / tma_info_thread_clks", + "MetricGroup": "Ret;Summary", + "MetricName": "tma_info_thread_ipc", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total issue-pipeline slots (per-Physical Core= till ICL; per-Logical Processor ICL onward)", + "MetricExpr": "cpu_core@TOPDOWN.SLOTS@", + "MetricGroup": "TmaL1;tma_L1_group", + "MetricName": "tma_info_thread_slots", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Fraction of Physical Core issue-slots utilize= d by this Logical Processor", + "MetricExpr": "(tma_info_thread_slots / (cpu_core@TOPDOWN.SLOTS@ /= 2) if #SMT_on else 1)", + "MetricGroup": "SMT;TmaL1;tma_L1_group", + "MetricName": "tma_info_thread_slots_utilization", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Uops Per Instruction", + "MetricExpr": "tma_retiring * tma_info_thread_slots / INST_RETIRED= .ANY", + "MetricGroup": "Pipeline;Ret;Retire", + "MetricName": "tma_info_thread_uoppi", + "MetricThreshold": "tma_info_thread_uoppi > 1.05", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Instruction per taken branch", + "MetricExpr": "tma_retiring * tma_info_thread_slots / BR_INST_RETI= RED.NEAR_TAKEN", + "MetricGroup": "Branches;Fed;FetchBW", + "MetricName": "tma_info_thread_uptb", + "MetricThreshold": "tma_info_thread_uptb < 9", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents overall Integer (Int) = select operations fraction the CPU has executed (retired)", + "MetricExpr": "tma_int_vector_128b + tma_int_vector_256b", + "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operatio= ns_group", + "MetricName": "tma_int_operations", + "MetricThreshold": "tma_int_operations > 0.1 & tma_light_operation= s > 0.6", + "PublicDescription": "This metric represents overall Integer (Int)= select operations fraction the CPU has executed (retired). Vector/Matrix I= nt operations and shuffles are counted. Note this metric's value may exceed= its parent due to use of \"Uops\" CountDomain.", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents 128-bit vector Integer= ADD/SUB/SAD or VNNI (Vector Neural Network Instructions) uops fraction the= CPU has retired", + "MetricExpr": "(cpu_core@INT_VEC_RETIRED.ADD_128@ + cpu_core@INT_V= EC_RETIRED.VNNI_128@) / (tma_retiring * tma_info_thread_slots)", + "MetricGroup": "Compute;IntVector;Pipeline;TopdownL4;tma_L4_group;= tma_int_operations_group;tma_issue2P", + "MetricName": "tma_int_vector_128b", + "MetricThreshold": "tma_int_vector_128b > 0.1 & (tma_int_operation= s > 0.1 & tma_light_operations > 0.6)", + "PublicDescription": "This metric represents 128-bit vector Intege= r ADD/SUB/SAD or VNNI (Vector Neural Network Instructions) uops fraction th= e CPU has retired. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_ve= ctor_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_256b, tma= _port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents 256-bit vector Integer= ADD/SUB/SAD/MUL or VNNI (Vector Neural Network Instructions) uops fraction= the CPU has retired", + "MetricExpr": "(cpu_core@INT_VEC_RETIRED.ADD_256@ + cpu_core@INT_V= EC_RETIRED.MUL_256@ + cpu_core@INT_VEC_RETIRED.VNNI_256@) / (tma_retiring *= tma_info_thread_slots)", + "MetricGroup": "Compute;IntVector;Pipeline;TopdownL4;tma_L4_group;= tma_int_operations_group;tma_issue2P", + "MetricName": "tma_int_vector_256b", + "MetricThreshold": "tma_int_vector_256b > 0.1 & (tma_int_operation= s > 0.1 & tma_light_operations > 0.6)", + "PublicDescription": "This metric represents 256-bit vector Intege= r ADD/SUB/SAD/MUL or VNNI (Vector Neural Network Instructions) uops fractio= n the CPU has retired. Related metrics: tma_fp_scalar, tma_fp_vector, tma_f= p_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_128b,= tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to Instruction TLB (ITLB) misses", + "MetricExpr": "cpu_core@ICACHE_TAG.STALLS@ / tma_info_thread_clks", + "MetricGroup": "BigFootprint;FetchLat;MemoryTLB;TopdownL3;tma_L3_g= roup;tma_fetch_latency_group", + "MetricName": "tma_itlb_misses", + "MetricThreshold": "tma_itlb_misses > 0.05 & (tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: FRONTE= ND_RETIRED.STLB_MISS_PS;FRONTEND_RETIRED.ITLB_MISS_PS", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates how often the CPU was s= talled without loads missing the L1 data cache", + "MetricExpr": "max((cpu_core@EXE_ACTIVITY.BOUND_ON_LOADS@ - cpu_co= re@MEMORY_ACTIVITY.STALLS_L1D_MISS@) / tma_info_thread_clks, 0)", + "MetricGroup": "CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_L3_gr= oup;tma_issueL1;tma_issueMC;tma_memory_bound_group", + "MetricName": "tma_l1_bound", + "MetricThreshold": "tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 &= tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often the CPU was = stalled without loads missing the L1 data cache. The L1 data cache typical= ly has the shortest latency. However; in certain cases like loads blocked = on older stores; a load might suffer due to high latency even though it is = being satisfied by the L1. Another example is loads who miss in the TLB. Th= ese cases are characterized by execution unit stalls; while some non-comple= ted demand load lives in the machine without having that demand load missin= g the L1 cache. Sample with: MEM_LOAD_RETIRED.L1_HIT_PS;MEM_LOAD_RETIRED.FB= _HIT_PS. Related metrics: tma_clears_resteers, tma_machine_clears, tma_micr= ocode_sequencer, tma_ms_switches, tma_ports_utilized_1", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates how often the CPU was s= talled due to L2 cache accesses by loads", + "MetricExpr": "(cpu_core@MEMORY_ACTIVITY.STALLS_L1D_MISS@ - cpu_co= re@MEMORY_ACTIVITY.STALLS_L2_MISS@) / tma_info_thread_clks", + "MetricGroup": "CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_L3_gr= oup;tma_memory_bound_group", + "MetricName": "tma_l2_bound", + "MetricThreshold": "tma_l2_bound > 0.05 & (tma_memory_bound > 0.2 = & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often the CPU was = stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 = misses/L2 hits) can improve the latency and increase performance. Sample wi= th: MEM_LOAD_RETIRED.L2_HIT_PS", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates how often the CPU was s= talled due to loads accesses to L3 cache or contended with a sibling Core", + "MetricExpr": "(cpu_core@MEMORY_ACTIVITY.STALLS_L2_MISS@ - cpu_cor= e@MEMORY_ACTIVITY.STALLS_L3_MISS@) / tma_info_thread_clks", + "MetricGroup": "CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_L3_gr= oup;tma_memory_bound_group", + "MetricName": "tma_l3_bound", + "MetricThreshold": "tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 = & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often the CPU was = stalled due to loads accesses to L3 cache or contended with a sibling Core.= Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency an= d increase performance. Sample with: MEM_LOAD_RETIRED.L3_HIT_PS", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates fraction of cycles with= demand load accesses that hit the L3 cache under unloaded scenarios (possi= bly L3 latency limited)", + "MetricExpr": "MEM_LOAD_RETIRED.L3_HIT * min(MEM_LOAD_RETIRED.L3_H= IT:R, 9 * tma_info_system_core_frequency) * (1 + cpu_core@MEM_LOAD_RETIRED.= FB_HIT@ / cpu_core@MEM_LOAD_RETIRED.L1_MISS@ / 2) / tma_info_thread_clks", + "MetricGroup": "MemoryLat;TopdownL4;tma_L4_group;tma_issueLat;tma_= l3_bound_group", + "MetricName": "tma_l3_hit_latency", + "MetricThreshold": "tma_l3_hit_latency > 0.1 & (tma_l3_bound > 0.0= 5 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles wit= h demand load accesses that hit the L3 cache under unloaded scenarios (poss= ibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3= hits) will improve the latency; reduce contention with sibling physical co= res and increase performance. Note the value of this node may overlap with= its siblings. Sample with: MEM_LOAD_RETIRED.L3_HIT_PS. Related metrics: tm= a_info_bottleneck_cache_memory_latency, tma_mem_latency", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles CPU= was stalled due to Length Changing Prefixes (LCPs)", + "MetricExpr": "cpu_core@DECODE.LCP@ / tma_info_thread_clks", + "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_= group;tma_issueFB", + "MetricName": "tma_lcp", + "MetricThreshold": "tma_lcp > 0.05 & (tma_fetch_latency > 0.1 & tm= a_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles CP= U was stalled due to Length Changing Prefixes (LCPs). Using proper compiler= flags or Intel Compiler by default will certainly avoid this. #Link: Optim= ization Guide about LCP BKMs. Related metrics: tma_dsb_switches, tma_fetch_= bandwidth, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, t= ma_info_inst_mix_iptb", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring light-weight operations -- instructions that require= no more than one uop (micro-operation)", + "MetricExpr": "max(0, tma_retiring - tma_heavy_operations)", + "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_g= roup", + "MetricName": "tma_light_operations", + "MetricThreshold": "tma_light_operations > 0.6", + "MetricgroupNoGroup": "TopdownL2", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring light-weight operations -- instructions that requir= e no more than one uop (micro-operation). This correlates with total number= of instructions used by the program. A uops-per-instruction (see UopPI met= ric) ratio of 1 or less should be expected for decently optimized code runn= ing on Intel Core/Xeon products. While this often indicates efficient X86 i= nstructions were executed; high value does not necessarily mean better perf= ormance cannot be achieved. ([ICL+] Note this may undercount due to approxi= mation using indirect events; [ADL+] .). Sample with: INST_RETIRED.PREC_DIS= T", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution port for Load operations", + "MetricExpr": "cpu_core@UOPS_DISPATCHED.PORT_2_3_10@ / (3 * tma_in= fo_core_core_clks)", + "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group= ", + "MetricName": "tma_load_op_utilization", + "MetricThreshold": "tma_load_op_utilization > 0.6", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port for Load operations. Sample with: = UOPS_DISPATCHED.PORT_2_3_10", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric roughly estimates the fraction of= cycles where the (first level) DTLB was missed by load accesses, that late= r on hit in second-level TLB (STLB)", + "MetricExpr": "max(0, tma_dtlb_load - tma_load_stlb_miss)", + "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_load_gro= up", + "MetricName": "tma_load_stlb_hit", + "MetricThreshold": "tma_load_stlb_hit > 0.05 & (tma_dtlb_load > 0.= 1 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2= )))", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = where the Second-level TLB (STLB) was missed by load accesses, performing a= hardware page walk", + "MetricExpr": "cpu_core@DTLB_LOAD_MISSES.WALK_ACTIVE@ / tma_info_t= hread_clks", + "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_load_gro= up", + "MetricName": "tma_load_stlb_miss", + "MetricThreshold": "tma_load_stlb_miss > 0.05 & (tma_dtlb_load > 0= .1 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.= 2)))", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU spent handling cache misses due to lock operations", + "MetricExpr": "MEM_INST_RETIRED.LOCK_LOADS * MEM_INST_RETIRED.LOCK= _LOADS:R / tma_info_thread_clks", + "MetricGroup": "Offcore;TopdownL4;tma_L4_group;tma_issueRFO;tma_l1= _bound_group", + "MetricName": "tma_lock_latency", + "MetricThreshold": "tma_lock_latency > 0.2 & (tma_l1_bound > 0.1 &= (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles th= e CPU spent handling cache misses due to lock operations. Due to the microa= rchitecture handling of locks; they are classified as L1_Bound regardless o= f what memory source satisfied them. Sample with: MEM_INST_RETIRED.LOCK_LOA= DS_PS. Related metrics: tma_store_latency", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents Core fraction of cycle= s in which CPU was likely limited due to LSD (Loop Stream Detector) unit", + "MetricExpr": "(cpu_core@LSD.CYCLES_ACTIVE@ - cpu_core@LSD.CYCLES_= OK@) / tma_info_core_core_clks / 2", + "MetricGroup": "FetchBW;LSD;TopdownL3;tma_L3_group;tma_fetch_bandw= idth_group", + "MetricName": "tma_lsd", + "MetricThreshold": "tma_lsd > 0.15 & tma_fetch_bandwidth > 0.2", + "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to LSD (Loop Stream Detector) unit. = LSD typically does well sustaining Uop supply. However; in some rare cases= ; optimal uop-delivery could not be reached for small loops whose size (in = terms of number of uops) does not suit well the LSD structure.", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Machine Clears", + "MetricExpr": "max(0, tma_bad_speculation - tma_branch_mispredicts= )", + "MetricGroup": "BadSpec;MachineClears;TmaL2;TopdownL2;tma_L2_group= ;tma_bad_speculation_group;tma_issueMC;tma_issueSyncxn", + "MetricName": "tma_machine_clears", + "MetricThreshold": "tma_machine_clears > 0.1 & tma_bad_speculation= > 0.15", + "MetricgroupNoGroup": "TopdownL2", + "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Machine Clears. These slots are either wasted by uo= ps fetched prior to the clear; or stalls the out-of-order portion of the ma= chine needs to recover its state after the clear. For example; this can hap= pen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modif= ying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT. Related metrics: = tma_clears_resteers, tma_contested_accesses, tma_data_sharing, tma_false_sh= aring, tma_l1_bound, tma_microcode_sequencer, tma_ms_switches, tma_remote_c= ache", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates fraction of cycles wher= e the core's performance was likely hurt due to approaching bandwidth limit= s of external memory - DRAM ([SPR-HBM] and/or HBM)", + "MetricExpr": "min(cpu_core@CPU_CLK_UNHALTED.THREAD@, cpu_core@OFF= CORE_REQUESTS_OUTSTANDING.DATA_RD\\,cmask\\=3D4@) / tma_info_thread_clks", + "MetricGroup": "MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_dram_b= ound_group;tma_issueBW", + "MetricName": "tma_mem_bandwidth", + "MetricThreshold": "tma_mem_bandwidth > 0.2 & (tma_dram_bound > 0.= 1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles whe= re the core's performance was likely hurt due to approaching bandwidth limi= ts of external memory - DRAM ([SPR-HBM] and/or HBM). The underlying heuris= tic assumes that a similar off-core traffic is generated by all IA cores. T= his metric does not aggregate non-data-read requests by this logical proces= sor; requests from other IA Logical Processors/Physical Cores/sockets; or o= ther non-IA devices like GPU; hence the maximum external memory bandwidth l= imits may or may not be approached when this metric is flagged (see Uncore = counters for that). Related metrics: tma_fb_full, tma_info_bottleneck_cache= _memory_bandwidth, tma_info_system_dram_bw_use, tma_sq_full", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates fraction of cycles wher= e the performance was likely hurt due to latency from external memory - DRA= M ([SPR-HBM] and/or HBM)", + "MetricExpr": "min(cpu_core@CPU_CLK_UNHALTED.THREAD@, cpu_core@OFF= CORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DATA_RD@) / tma_info_thread_clks - tm= a_mem_bandwidth", + "MetricGroup": "MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_dram_= bound_group;tma_issueLat", + "MetricName": "tma_mem_latency", + "MetricThreshold": "tma_mem_latency > 0.1 & (tma_dram_bound > 0.1 = & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles whe= re the performance was likely hurt due to latency from external memory - DR= AM ([SPR-HBM] and/or HBM). This metric does not aggregate requests from ot= her Logical Processors/Physical Cores/sockets (see Uncore counters for that= ). Related metrics: tma_info_bottleneck_cache_memory_latency, tma_l3_hit_la= tency", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of slots the = Memory subsystem within the Backend was a bottleneck", + "MetricExpr": "cpu_core@topdown\\-mem\\-bound@ / (cpu_core@topdown= \\-fe\\-bound@ + cpu_core@topdown\\-bad\\-spec@ + cpu_core@topdown\\-retiri= ng@ + cpu_core@topdown\\-be\\-bound@) + 0 * tma_info_thread_slots", + "MetricGroup": "Backend;TmaL2;TopdownL2;tma_L2_group;tma_backend_b= ound_group", + "MetricName": "tma_memory_bound", + "MetricThreshold": "tma_memory_bound > 0.2 & tma_backend_bound > 0= .2", + "MetricgroupNoGroup": "TopdownL2", + "PublicDescription": "This metric represents fraction of slots the= Memory subsystem within the Backend was a bottleneck. Memory Bound estima= tes fraction of slots where pipeline is likely stalled due to demand load o= r store instructions. This accounts mainly for (1) non-completed in-flight = memory demand loads which coincides with execution units starvation; in add= ition to (2) cases where stores could impose backpressure on the pipeline w= hen many of them get buffered at the same time (less common out of the two)= .", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to LFENCE Instructions.", + "MetricConstraint": "NO_GROUP_EVENTS_NMI", + "MetricExpr": "13 * cpu_core@MISC2_RETIRED.LFENCE@ / tma_info_thre= ad_clks", + "MetricGroup": "TopdownL4;tma_L4_group;tma_serializing_operation_g= roup", + "MetricName": "tma_memory_fence", + "MetricThreshold": "tma_memory_fence > 0.05 & (tma_serializing_ope= ration > 0.1 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring memory operations -- uops for memory load or store a= ccesses.", + "MetricExpr": "tma_light_operations * cpu_core@MEM_UOP_RETIRED.ANY= @ / (tma_retiring * tma_info_thread_slots)", + "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operatio= ns_group", + "MetricName": "tma_memory_operations", + "MetricThreshold": "tma_memory_operations > 0.1 & tma_light_operat= ions > 0.6", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of slots the = CPU was retiring uops fetched by the Microcode Sequencer (MS) unit", + "MetricExpr": "cpu_core@UOPS_RETIRED.MS@ / tma_info_thread_slots", + "MetricGroup": "MicroSeq;TopdownL3;tma_L3_group;tma_heavy_operatio= ns_group;tma_issueMC;tma_issueMS", + "MetricName": "tma_microcode_sequencer", + "MetricThreshold": "tma_microcode_sequencer > 0.05 & tma_heavy_ope= rations > 0.1", + "PublicDescription": "This metric represents fraction of slots the= CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The M= S is used for CISC instructions not supported by the default decoders (like= repeat move strings; or CPUID); or by microcode assists used to address so= me operation modes (like in Floating Point assists). These cases can often = be avoided. Sample with: UOPS_RETIRED.MS. Related metrics: tma_clears_reste= ers, tma_info_bottleneck_irregular_overhead, tma_l1_bound, tma_machine_clea= rs, tma_ms_switches", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to Branch Resteers as a result of Branch Misprediction= at execution stage", + "MetricExpr": "tma_branch_mispredicts / tma_bad_speculation * cpu_= core@INT_MISC.CLEAR_RESTEER_CYCLES@ / tma_info_thread_clks", + "MetricGroup": "BadSpec;BrMispredicts;TopdownL4;tma_L4_group;tma_b= ranch_resteers_group;tma_issueBM", + "MetricName": "tma_mispredicts_resteers", + "MetricThreshold": "tma_mispredicts_resteers > 0.05 & (tma_branch_= resteers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Branch Mispredictio= n at execution stage. Sample with: INT_MISC.CLEAR_RESTEER_CYCLES. Related m= etrics: tma_branch_mispredicts, tma_info_bad_spec_branch_misprediction_cost= , tma_info_bottleneck_mispredictions", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents Core fraction of cycle= s in which CPU was likely limited due to the MITE pipeline (the legacy deco= de pipeline)", + "MetricExpr": "(cpu_core@IDQ.MITE_CYCLES_ANY@ - cpu_core@IDQ.MITE_= CYCLES_OK@) / tma_info_core_core_clks / 2", + "MetricGroup": "DSBmiss;FetchBW;TopdownL3;tma_L3_group;tma_fetch_b= andwidth_group", + "MetricName": "tma_mite", + "MetricThreshold": "tma_mite > 0.1 & tma_fetch_bandwidth > 0.2", + "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to the MITE pipeline (the legacy dec= ode pipeline). This pipeline is used for code that was not pre-cached in th= e DSB or LSD. For example; inefficiencies due to asymmetric decoders; use o= f long immediate or LCP can manifest as MITE fetch bandwidth bottleneck. Sa= mple with: FRONTEND_RETIRED.ANY_DSB_MISS", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates penalty in terms of per= centage of([SKL+] injected blend uops out of all Uops Issued -- the Count D= omain; [ADL+] cycles)", + "MetricExpr": "160 * cpu_core@ASSISTS.SSE_AVX_MIX@ / tma_info_thre= ad_clks", + "MetricGroup": "TopdownL5;tma_L5_group;tma_issueMV;tma_ports_utili= zed_0_group", + "MetricName": "tma_mixing_vectors", + "MetricThreshold": "tma_mixing_vectors > 0.05", + "PublicDescription": "This metric estimates penalty in terms of pe= rcentage of([SKL+] injected blend uops out of all Uops Issued -- the Count = Domain; [ADL+] cycles). Usually a Mixing_Vectors over 5% is worth investiga= ting. Read more in Appendix B1 of the Optimizations Guide for this topic. R= elated metrics: tma_ms_switches", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = when the CPU was stalled due to switches of uop delivery to the Microcode S= equencer (MS)", + "MetricExpr": "3 * cpu_core@UOPS_RETIRED.MS\\,cmask\\=3D1\\,edge@ = / (cpu_core@UOPS_RETIRED.SLOTS@ / cpu_core@UOPS_ISSUED.ANY@) / tma_info_thr= ead_clks", + "MetricGroup": "FetchLat;MicroSeq;TopdownL3;tma_L3_group;tma_fetch= _latency_group;tma_issueMC;tma_issueMS;tma_issueMV;tma_issueSO", + "MetricName": "tma_ms_switches", + "MetricThreshold": "tma_ms_switches > 0.05 & (tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric estimates the fraction of cycles= when the CPU was stalled due to switches of uop delivery to the Microcode = Sequencer (MS). Commonly used instructions are optimized for delivery by th= e DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Cert= ain operations cannot be handled natively by the execution pipeline; and mu= st be performed by microcode (small programs injected into the execution st= ream). Switching to the MS too often can negatively impact performance. The= MS is designated to deliver long uop flows required by CISC instructions l= ike CPUID; or uncommon conditions like Floating Point Assists when dealing = with Denormals. Sample with: FRONTEND_RETIRED.MS_FLOWS. Related metrics: tm= a_clears_resteers, tma_info_bottleneck_irregular_overhead, tma_l1_bound, tm= a_machine_clears, tma_microcode_sequencer, tma_mixing_vectors, tma_serializ= ing_operation", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring branch instructions that were not fused", + "MetricExpr": "tma_light_operations * (cpu_core@BR_INST_RETIRED.AL= L_BRANCHES@ - cpu_core@INST_RETIRED.MACRO_FUSED@) / (tma_retiring * tma_inf= o_thread_slots)", + "MetricGroup": "Branches;Pipeline;TopdownL3;tma_L3_group;tma_light= _operations_group", + "MetricName": "tma_non_fused_branches", + "MetricThreshold": "tma_non_fused_branches > 0.1 & tma_light_opera= tions > 0.6", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring branch instructions that were not fused. Non-condit= ional branches like direct JMP or CALL would count here. Can be used to exa= mine fusible conditional jumps that were not fused.", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring NOP (no op) instructions", + "MetricExpr": "tma_light_operations * cpu_core@INST_RETIRED.NOP@ /= (tma_retiring * tma_info_thread_slots)", + "MetricGroup": "Pipeline;TopdownL4;tma_L4_group;tma_other_light_op= s_group", + "MetricName": "tma_nop_instructions", + "MetricThreshold": "tma_nop_instructions > 0.1 & (tma_other_light_= ops > 0.3 & tma_light_operations > 0.6)", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring NOP (no op) instructions. Compilers often use NOPs = for certain address alignments - e.g. start address of a function or loop b= ody. Sample with: INST_RETIRED.NOP", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents the remaining light uo= ps fraction the CPU has executed - remaining means not covered by other sib= ling nodes", + "MetricExpr": "max(0, tma_light_operations - (tma_fp_arith + tma_i= nt_operations + tma_memory_operations + tma_fused_instructions + tma_non_fu= sed_branches))", + "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operatio= ns_group", + "MetricName": "tma_other_light_ops", + "MetricThreshold": "tma_other_light_ops > 0.3 & tma_light_operatio= ns > 0.6", + "PublicDescription": "This metric represents the remaining light u= ops fraction the CPU has executed - remaining means not covered by other si= bling nodes. May undercount due to FMA double counting", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates fraction of slots the C= PU was stalled due to other cases of misprediction (non-retired x86 branche= s or other types).", + "MetricExpr": "max(tma_branch_mispredicts * (1 - cpu_core@BR_MISP_= RETIRED.ALL_BRANCHES@ / (cpu_core@INT_MISC.CLEARS_COUNT@ - cpu_core@MACHINE= _CLEARS.COUNT@)), 0.0001)", + "MetricGroup": "BrMispredicts;TopdownL3;tma_L3_group;tma_branch_mi= spredicts_group", + "MetricName": "tma_other_mispredicts", + "MetricThreshold": "tma_other_mispredicts > 0.05 & (tma_branch_mis= predicts > 0.1 & tma_bad_speculation > 0.15)", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Nukes (Machine Clears) not related to memory ordering= .", + "MetricExpr": "max(tma_machine_clears * (1 - cpu_core@MACHINE_CLEA= RS.MEMORY_ORDERING@ / cpu_core@MACHINE_CLEARS.COUNT@), 0.0001)", + "MetricGroup": "Machine_Clears;TopdownL3;tma_L3_group;tma_machine_= clears_group", + "MetricName": "tma_other_nukes", + "MetricThreshold": "tma_other_nukes > 0.05 & (tma_machine_clears >= 0.1 & tma_bad_speculation > 0.15)", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric roughly estimates fraction of slo= ts the CPU retired uops as a result of handing Page Faults", + "MetricExpr": "99 * cpu_core@ASSISTS.PAGE_FAULT@ / tma_info_thread= _slots", + "MetricGroup": "TopdownL5;tma_L5_group;tma_assists_group", + "MetricName": "tma_page_faults", + "MetricThreshold": "tma_page_faults > 0.05", + "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Page Faults. A Page Fault m= ay apply on first application access to a memory page. Note operating syste= m handling of page faults accounts for the majority of its cost.", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd b= ranch)", + "MetricExpr": "cpu_core@UOPS_DISPATCHED.PORT_0@ / tma_info_core_co= re_clks", + "MetricGroup": "Compute;TopdownL6;tma_L6_group;tma_alu_op_utilizat= ion_group;tma_issue2P", + "MetricName": "tma_port_0", + "MetricThreshold": "tma_port_0 > 0.6", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd = branch). Sample with: UOPS_DISPATCHED.PORT_0. Related metrics: tma_fp_scala= r, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512= b, tma_int_vector_128b, tma_int_vector_256b, tma_port_1, tma_port_5, tma_po= rt_6, tma_ports_utilized_2", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution port 1 (ALU)", + "MetricExpr": "cpu_core@UOPS_DISPATCHED.PORT_1@ / tma_info_core_co= re_clks", + "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_grou= p;tma_issue2P", + "MetricName": "tma_port_1", + "MetricThreshold": "tma_port_1 > 0.6", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 1 (ALU). Sample with: UOPS_DISPATC= HED.PORT_1. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_12= 8b, tma_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_128b, tma_int_ve= ctor_256b, tma_port_0, tma_port_5, tma_port_6, tma_ports_utilized_2", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution port 6 ([HSW+] Primary Branch and simple= ALU)", + "MetricExpr": "cpu_core@UOPS_DISPATCHED.PORT_6@ / tma_info_core_co= re_clks", + "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_grou= p;tma_issue2P", + "MetricName": "tma_port_6", + "MetricThreshold": "tma_port_6 > 0.6", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 6 ([HSW+] Primary Branch and simpl= e ALU). Sample with: UOPS_DISPATCHED.PORT_6. Related metrics: tma_fp_scalar= , tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b= , tma_int_vector_128b, tma_int_vector_256b, tma_port_0, tma_port_1, tma_por= t_5, tma_ports_utilized_2", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates fraction of cycles the = CPU performance was potentially limited due to Core computation issues (non= divider-related)", + "MetricExpr": "((tma_ports_utilized_0 * tma_info_thread_clks + (cp= u_core@EXE_ACTIVITY.1_PORTS_UTIL@ + tma_retiring * cpu_core@EXE_ACTIVITY.2_= PORTS_UTIL\\,umask\\=3D0xc@)) / tma_info_thread_clks if cpu_core@ARITH.DIV_= ACTIVE@ < cpu_core@CYCLE_ACTIVITY.STALLS_TOTAL@ - cpu_core@EXE_ACTIVITY.BOU= ND_ON_LOADS@ else (cpu_core@EXE_ACTIVITY.1_PORTS_UTIL@ + tma_retiring * cpu= _core@EXE_ACTIVITY.2_PORTS_UTIL\\,umask\\=3D0xc@) / tma_info_thread_clks)", + "MetricGroup": "PortsUtil;TopdownL3;tma_L3_group;tma_core_bound_gr= oup", + "MetricName": "tma_ports_utilization", + "MetricThreshold": "tma_ports_utilization > 0.15 & (tma_core_bound= > 0.1 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates fraction of cycles the= CPU performance was potentially limited due to Core computation issues (no= n divider-related). Two distinct categories can be attributed into this me= tric: (1) heavy data-dependency among contiguous instructions would manifes= t in this metric - such cases are often referred to as low Instruction Leve= l Parallelism (ILP). (2) Contention on some hardware execution unit other t= han Divider. For example; when there are too many multiply operations.", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles CPU= executed no uops on any execution port (Logical Processor cycles since ICL= , Physical Core cycles otherwise)", + "MetricExpr": "max((cpu_core@EXE_ACTIVITY.3_PORTS_UTIL\\,umask\\= =3D0x80@ + cpu_core@RS.EMPTY_RESOURCE@) / tma_info_thread_clks, 1) * (cpu_c= ore@CYCLE_ACTIVITY.STALLS_TOTAL@ - cpu_core@EXE_ACTIVITY.BOUND_ON_LOADS@) /= tma_info_thread_clks", + "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_ports_utiliza= tion_group", + "MetricName": "tma_ports_utilized_0", + "MetricThreshold": "tma_ports_utilized_0 > 0.2 & (tma_ports_utiliz= ation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles CP= U executed no uops on any execution port (Logical Processor cycles since IC= L, Physical Core cycles otherwise). Long-latency instructions like divides = may contribute to this metric.", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles whe= re the CPU executed total of 1 uop per cycle on all execution ports (Logica= l Processor cycles since ICL, Physical Core cycles otherwise)", + "MetricExpr": "cpu_core@EXE_ACTIVITY.1_PORTS_UTIL@ / tma_info_thre= ad_clks", + "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issueL1;tma_p= orts_utilization_group", + "MetricName": "tma_ports_utilized_1", + "MetricThreshold": "tma_ports_utilized_1 > 0.2 & (tma_ports_utiliz= ation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles wh= ere the CPU executed total of 1 uop per cycle on all execution ports (Logic= al Processor cycles since ICL, Physical Core cycles otherwise). This can be= due to heavy data-dependency among software instructions; or over oversubs= cribing a particular hardware resource. In some other cases with high 1_Por= t_Utilized and L1_Bound; this metric can point to L1 data-cache latency bot= tleneck that may not necessarily manifest with complete execution starvatio= n (due to the short L1 latency e.g. walking a linked list) - looking at the= assembly can be helpful. Sample with: EXE_ACTIVITY.1_PORTS_UTIL. Related m= etrics: tma_l1_bound", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles CPU= executed total of 2 uops per cycle on all execution ports (Logical Process= or cycles since ICL, Physical Core cycles otherwise)", + "MetricConstraint": "NO_GROUP_EVENTS_NMI", + "MetricExpr": "cpu_core@EXE_ACTIVITY.2_PORTS_UTIL@ / tma_info_thre= ad_clks", + "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issue2P;tma_p= orts_utilization_group", + "MetricName": "tma_ports_utilized_2", + "MetricThreshold": "tma_ports_utilized_2 > 0.15 & (tma_ports_utili= zation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 2 uops per cycle on all execution ports (Logical Proces= sor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization = -most compilers feature auto-Vectorization options today- reduces pressure = on the execution ports as multiple elements are calculated with same uop. S= ample with: EXE_ACTIVITY.2_PORTS_UTIL. Related metrics: tma_fp_scalar, tma_= fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_= int_vector_128b, tma_int_vector_256b, tma_port_0, tma_port_1, tma_port_5, t= ma_port_6", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles CPU= executed total of 3 or more uops per cycle on all execution ports (Logical= Processor cycles since ICL, Physical Core cycles otherwise)", + "MetricConstraint": "NO_GROUP_EVENTS_NMI", + "MetricExpr": "cpu_core@UOPS_EXECUTED.CYCLES_GE_3@ / tma_info_thre= ad_clks", + "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_ports_utiliza= tion_group", + "MetricName": "tma_ports_utilized_3m", + "MetricThreshold": "tma_ports_utilized_3m > 0.4 & (tma_ports_utili= zation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 3 or more uops per cycle on all execution ports (Logica= l Processor cycles since ICL, Physical Core cycles otherwise). Sample with:= UOPS_EXECUTED.CYCLES_GE_3", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This category represents fraction of slots ut= ilized by useful work i.e. issued uops that eventually get retired", + "MetricExpr": "cpu_core@topdown\\-retiring@ / (cpu_core@topdown\\-= fe\\-bound@ + cpu_core@topdown\\-bad\\-spec@ + cpu_core@topdown\\-retiring@= + cpu_core@topdown\\-be\\-bound@) + 0 * tma_info_thread_slots", + "MetricGroup": "TmaL1;TopdownL1;tma_L1_group", + "MetricName": "tma_retiring", + "MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.= 1", + "MetricgroupNoGroup": "TopdownL1", + "PublicDescription": "This category represents fraction of slots u= tilized by useful work i.e. issued uops that eventually get retired. Ideall= y; all pipeline slots would be attributed to the Retiring category. Retiri= ng of 100% would indicate the maximum Pipeline_Width throughput was achieve= d. Maximizing Retiring typically increases the Instructions-per-cycle (see= IPC metric). Note that a high Retiring value does not necessary mean there= is no room for more performance. For example; Heavy-operations or Microco= de Assists are categorized under Retiring. They often indicate suboptimal p= erformance and can often be optimized or avoided. Sample with: UOPS_RETIRED= .SLOTS", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU issue-pipeline was stalled due to serializing operations", + "MetricExpr": "cpu_core@RESOURCE_STALLS.SCOREBOARD@ / tma_info_thr= ead_clks + tma_c02_wait", + "MetricGroup": "PortsUtil;TopdownL3;tma_L3_group;tma_core_bound_gr= oup;tma_issueSO", + "MetricName": "tma_serializing_operation", + "MetricThreshold": "tma_serializing_operation > 0.1 & (tma_core_bo= und > 0.1 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric represents fraction of cycles th= e CPU issue-pipeline was stalled due to serializing operations. Instruction= s like CPUID; WRMSR or LFENCE serialize the out-of-order execution which ma= y limit performance. Sample with: RESOURCE_STALLS.SCOREBOARD. Related metri= cs: tma_ms_switches", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring Shuffle operations of 256-bit vector size (FP or Int= eger)", + "MetricExpr": "tma_light_operations * cpu_core@INT_VEC_RETIRED.SHU= FFLES@ / (tma_retiring * tma_info_thread_slots)", + "MetricGroup": "HPC;Pipeline;TopdownL4;tma_L4_group;tma_other_ligh= t_ops_group", + "MetricName": "tma_shuffles_256b", + "MetricThreshold": "tma_shuffles_256b > 0.1 & (tma_other_light_ops= > 0.3 & tma_light_operations > 0.6)", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring Shuffle operations of 256-bit vector size (FP or In= teger). Shuffles may incur slow cross \"vector lane\" data transfers.", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to PAUSE Instructions", + "MetricConstraint": "NO_GROUP_EVENTS_NMI", + "MetricExpr": "cpu_core@CPU_CLK_UNHALTED.PAUSE@ / tma_info_thread_= clks", + "MetricGroup": "TopdownL4;tma_L4_group;tma_serializing_operation_g= roup", + "MetricName": "tma_slow_pause", + "MetricThreshold": "tma_slow_pause > 0.05 & (tma_serializing_opera= tion > 0.1 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to PAUSE Instructions. Sample with: CPU_CLK_UNHALTED.= PAUSE_INST", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates fraction of cycles hand= ling memory load split accesses - load that cross 64-byte cache line bounda= ry", + "MetricExpr": "MEM_INST_RETIRED.SPLIT_LOADS * min(MEM_INST_RETIRED= .SPLIT_LOADS:R, tma_info_memory_load_miss_real_latency) / tma_info_thread_c= lks", + "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", + "MetricName": "tma_split_loads", + "MetricThreshold": "tma_split_loads > 0.2 & (tma_l1_bound > 0.1 & = (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles han= dling memory load split accesses - load that cross 64-byte cache line bound= ary. Sample with: MEM_INST_RETIRED.SPLIT_LOADS_PS", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents rate of split store ac= cesses", + "MetricExpr": "MEM_INST_RETIRED.SPLIT_STORES * min(MEM_INST_RETIRE= D.SPLIT_STORES:R, 1) / tma_info_thread_clks", + "MetricGroup": "TopdownL4;tma_L4_group;tma_issueSpSt;tma_store_bou= nd_group", + "MetricName": "tma_split_stores", + "MetricThreshold": "tma_split_stores > 0.2 & (tma_store_bound > 0.= 2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents rate of split store a= ccesses. Consider aligning your data to the 64-byte cache line granularity= . Sample with: MEM_INST_RETIRED.SPLIT_STORES_PS. Related metrics: tma_port_= 4", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric measures fraction of cycles where= the Super Queue (SQ) was full taking into account all request-types and bo= th hardware SMT threads (Logical Processors)", + "MetricExpr": "(cpu_core@XQ.FULL_CYCLES@ + cpu_core@L1D_PEND_MISS.= L2_STALLS@) / tma_info_thread_clks", + "MetricGroup": "MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_issueB= W;tma_l3_bound_group", + "MetricName": "tma_sq_full", + "MetricThreshold": "tma_sq_full > 0.3 & (tma_l3_bound > 0.05 & (tm= a_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric measures fraction of cycles wher= e the Super Queue (SQ) was full taking into account all request-types and b= oth hardware SMT threads (Logical Processors). Related metrics: tma_fb_full= , tma_info_bottleneck_cache_memory_bandwidth, tma_info_system_dram_bw_use, = tma_mem_bandwidth", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates how often CPU was stall= ed due to RFO store memory accesses; RFO store issue a read-for-ownership = request before the write", + "MetricExpr": "cpu_core@EXE_ACTIVITY.BOUND_ON_STORES@ / tma_info_t= hread_clks", + "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_me= mory_bound_group", + "MetricName": "tma_store_bound", + "MetricThreshold": "tma_store_bound > 0.2 & (tma_memory_bound > 0.= 2 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often CPU was stal= led due to RFO store memory accesses; RFO store issue a read-for-ownership= request before the write. Even though store accesses do not typically stal= l out-of-order CPUs; there are few cases where stores can lead to actual st= alls. This metric will be flagged should RFO stores be a bottleneck. Sample= with: MEM_INST_RETIRED.ALL_STORES_PS", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric roughly estimates fraction of cyc= les when the memory subsystem had loads blocked since they could not forwar= d data from earlier (in program order) overlapping stores", + "MetricExpr": "13 * cpu_core@LD_BLOCKS.STORE_FORWARD@ / tma_info_t= hread_clks", + "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", + "MetricName": "tma_store_fwd_blk", + "MetricThreshold": "tma_store_fwd_blk > 0.1 & (tma_l1_bound > 0.1 = & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates fraction of cy= cles when the memory subsystem had loads blocked since they could not forwa= rd data from earlier (in program order) overlapping stores. To streamline m= emory operations in the pipeline; a load can avoid waiting for memory if a = prior in-flight store is writing the data that the load wants to read (stor= e forwarding process). However; in some cases the load may be blocked for a= significant time pending the store forward. For example; when the prior st= ore is writing a smaller region than the load is reading.", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates fraction of cycles the = CPU spent handling L1D store misses", + "MetricExpr": "(cpu_core@MEM_STORE_RETIRED.L2_HIT@ * 10 * (1 - cpu= _core@MEM_INST_RETIRED.LOCK_LOADS@ / cpu_core@MEM_INST_RETIRED.ALL_STORES@)= + (1 - cpu_core@MEM_INST_RETIRED.LOCK_LOADS@ / cpu_core@MEM_INST_RETIRED.A= LL_STORES@) * min(cpu_core@CPU_CLK_UNHALTED.THREAD@, cpu_core@OFFCORE_REQUE= STS_OUTSTANDING.CYCLES_WITH_DEMAND_RFO@)) / tma_info_thread_clks", + "MetricGroup": "MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_issue= RFO;tma_issueSL;tma_store_bound_group", + "MetricName": "tma_store_latency", + "MetricThreshold": "tma_store_latency > 0.1 & (tma_store_bound > 0= .2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles the= CPU spent handling L1D store misses. Store accesses usually less impact ou= t-of-order core performance; however; holding resources for longer time can= lead into undesired implications (e.g. contention on L1D fill-buffer entri= es - see FB_Full). Related metrics: tma_fb_full, tma_lock_latency", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution port for Store operations", + "MetricExpr": "(cpu_core@UOPS_DISPATCHED.PORT_4_9@ + cpu_core@UOPS= _DISPATCHED.PORT_7_8@) / (4 * tma_info_core_core_clks)", + "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group= ", + "MetricName": "tma_store_op_utilization", + "MetricThreshold": "tma_store_op_utilization > 0.6", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port for Store operations. Sample with:= UOPS_DISPATCHED.PORT_7_8", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric roughly estimates the fraction of= cycles where the TLB was missed by store accesses, hitting in the second-l= evel TLB (STLB)", + "MetricExpr": "max(0, tma_dtlb_store - tma_store_stlb_miss)", + "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_store_gr= oup", + "MetricName": "tma_store_stlb_hit", + "MetricThreshold": "tma_store_stlb_hit > 0.05 & (tma_dtlb_store > = 0.05 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound= > 0.2)))", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = where the STLB was missed by store accesses, performing a hardware page wal= k", + "MetricExpr": "cpu_core@DTLB_STORE_MISSES.WALK_ACTIVE@ / tma_info_= core_core_clks", + "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_store_gr= oup", + "MetricName": "tma_store_stlb_miss", + "MetricThreshold": "tma_store_stlb_miss > 0.05 & (tma_dtlb_store >= 0.05 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_boun= d > 0.2)))", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates how often CPU was stall= ed due to Streaming store memory accesses; Streaming store optimize out a = read request required by RFO stores", + "MetricExpr": "9 * cpu_core@OCR.STREAMING_WR.ANY_RESPONSE@ / tma_i= nfo_thread_clks", + "MetricGroup": "MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_issueS= mSt;tma_store_bound_group", + "MetricName": "tma_streaming_stores", + "MetricThreshold": "tma_streaming_stores > 0.2 & (tma_store_bound = > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates how often CPU was stal= led due to Streaming store memory accesses; Streaming store optimize out a= read request required by RFO stores. Even though store accesses do not typ= ically stall out-of-order CPUs; there are few cases where stores can lead t= o actual stalls. This metric will be flagged should Streaming stores be a b= ottleneck. Sample with: OCR.STREAMING_WR.ANY_RESPONSE. Related metrics: tma= _fb_full", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to new branch address clears", + "MetricExpr": "cpu_core@INT_MISC.UNKNOWN_BRANCH_CYCLES@ / tma_info= _thread_clks", + "MetricGroup": "BigFootprint;FetchLat;TopdownL4;tma_L4_group;tma_b= ranch_resteers_group", + "MetricName": "tma_unknown_branches", + "MetricThreshold": "tma_unknown_branches > 0.05 & (tma_branch_rest= eers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to new branch address clears. These are fetched branc= hes the Branch Prediction Unit was unable to recognize (e.g. first time the= branch is fetched or hitting BTB capacity limit) hence called Unknown Bran= ches. Sample with: FRONTEND_RETIRED.UNKNOWN_BRANCH", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric serves as an approximation of leg= acy x87 usage", + "MetricExpr": "tma_retiring * cpu_core@UOPS_EXECUTED.X87@ / UOPS_E= XECUTED.THREAD", + "MetricGroup": "Compute;TopdownL4;tma_L4_group;tma_fp_arith_group", + "MetricName": "tma_x87_use", + "MetricThreshold": "tma_x87_use > 0.1 & (tma_fp_arith > 0.2 & tma_= light_operations > 0.6)", + "PublicDescription": "This metric serves as an approximation of le= gacy x87 usage. It accounts for instructions beyond X87 FP arithmetic opera= tions; hence may be used as a thermometer to avoid X87 high usage and prefe= rably upgrade to modern ISA. See Tip under Tuning Hint.", + "ScaleUnit": "100%", + "Unit": "cpu_core" + } +] --=20 2.43.0