From nobody Mon Sep 15 09:47:25 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id EC6DFC54EBC for ; Thu, 12 Jan 2023 16:27:47 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234008AbjALQ1q (ORCPT ); Thu, 12 Jan 2023 11:27:46 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:33040 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S238076AbjALQ1V (ORCPT ); Thu, 12 Jan 2023 11:27:21 -0500 Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 0BE7E1C10C for ; Thu, 12 Jan 2023 08:24:38 -0800 (PST) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id 99FD96208F for ; Thu, 12 Jan 2023 16:24:37 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 92642C433EF; Thu, 12 Jan 2023 16:24:33 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1673540677; bh=63+5Xx8u7+1H2T+lBnQxdLkryhorI+y70lXtivZ2alk=; h=From:To:Cc:Subject:Date:From; b=BSlgKwj7Yc/kGMm+EXmtq1wG1gBZCE7Lk0aZFZ0HMlmTFhAtM21eTekAATIk00Hg6 Uup6nZsgBwwcEusdgdDaiKJKSEo6XQYJj/rEaYxUFs/oNJ8R/4N5tWsRAYtWf2mym4 oV2m+iU1WHcqYz1BfSIpsOF3p9/90cxxP0qYtVkLxAi5altbJnhzi9fRVzqE0G9jJv oIRVgmI6PiufDey2uHFj5KiJWNfG9udOXZoDH5LWRxcwE7gPrlXu6RwWORU80KdmfO v4TQhmIVtYZXV4w8BkN/bnvUsGpiuqvRDeqVg8k3b9NpcCmdm8ykCepThyED5H70GA zn7f2Qg3yzHuA== From: Daniel Bristot de Oliveira To: linux-kernel@vger.kernel.org, Ingo Molnar , Peter Zijlstra Cc: Daniel Bristot de Oliveira , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Daniel Bristot de Oliveira , Valentin Schneider , Joe Mario Subject: [PATCH] sched/idle: Make idle poll dynamic per-cpu Date: Thu, 12 Jan 2023 17:24:26 +0100 Message-Id: <20230112162426.217522-1-bristot@kernel.org> X-Mailer: git-send-email 2.39.0 MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" idle=3Dpoll is frequently used on ultra-low-latency systems. Examples of such systems are high-performance trading and 5G NVRAM. The performance gain is given by avoiding the idle driver machinery and by keeping the CPU is always in an active state - avoiding (odd) hardware heuristics that are out of the control of the OS. Currently, idle=3Dpoll is an all-or-nothing static option defined at boot time. The motivation for creating this option dynamic and per-cpu are two: 1) Reduce the power usage/heat by allowing only selected CPUs to do idle polling; 2) Allow multi-tenant systems (e.g., Kubernetes) to enable idle polling only when ultra-low-latency applications are present on specific CPUs. Joe Mario did some experiments with this option enabled, and the results were significant. For example, by using dynamic idle polling on selected CPUs, cyclictest performance is optimal (like when using idle=3Dpoll), but cpu power consumption drops from 381 to 233 watts. Also, limiting idle=3Dpoll to the set of CPUs that benefits from it allows other CPUs to benefit from frequency boosts. Joe also shows that the results can be in the order of 80nsec round trip improvement when system-wide idle=3Dpoll was not used. The user can enable idle polling with this command: # echo 1 > /sys/devices/system/cpu/cpu{CPU_ID}/idle_poll And disable it via: # echo 0 > /sys/devices/system/cpu/cpu{CPU_ID}/idle_poll By default, all CPUs have idle polling disabled (the current behavior). A static key avoids the CPU mask check overhead when no idle polling is enabled. Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Juri Lelli Cc: Vincent Guittot Cc: Dietmar Eggemann Cc: Steven Rostedt Cc: Ben Segall Cc: Mel Gorman Cc: Daniel Bristot de Oliveira Cc: Valentin Schneider Cc: Joe Mario Signed-off-by: Daniel Bristot de Oliveira --- kernel/sched/idle.c | 97 +++++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 93 insertions(+), 4 deletions(-) diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c index f26ab2675f7d..c6ef1322d549 100644 --- a/kernel/sched/idle.c +++ b/kernel/sched/idle.c @@ -10,6 +10,91 @@ /* Linker adds these: start and end of __cpuidle functions */ extern char __cpuidle_text_start[], __cpuidle_text_end[]; =20 +/* + * per-cpu idle polling selector. + */ +static struct cpumask cpu_poll_mask; +DEFINE_STATIC_KEY_FALSE(cpu_poll_enabled); + +/* + * Protects the mask/static key relation. + */ +DEFINE_MUTEX(cpu_poll_mutex); + +static ssize_t idle_poll_store(struct device *dev, struct device_attribute= *attr, + const char *buf, size_t count) +{ + int cpu =3D dev->id; + int retval, set; + bool val; + + retval =3D kstrtobool(buf, &val); + if (retval) + return retval; + + mutex_lock(&cpu_poll_mutex); + + if (val) { + set =3D cpumask_test_and_set_cpu(cpu, &cpu_poll_mask); + + /* + * If the CPU was already on, do not increase the static key usage. + */ + if (!set) + static_branch_inc(&cpu_poll_enabled); + } else { + set =3D cpumask_test_and_clear_cpu(cpu, &cpu_poll_mask); + + /* + * If the CPU was already off, do not decrease the static key usage. + */ + if (set) + static_branch_dec(&cpu_poll_enabled); + } + + mutex_unlock(&cpu_poll_mutex); + + return count; +} + +static ssize_t idle_poll_show(struct device *dev, struct device_attribute = *attr, char *buf) +{ + return sprintf(buf, "%d\n", cpumask_test_cpu(dev->id, &cpu_poll_mask)); +} + +static DEVICE_ATTR_RW(idle_poll); + +static const struct attribute *idle_poll_attrs[] =3D { + &dev_attr_idle_poll.attr, + NULL +}; + +static int __init idle_poll_sysfs_init(void) +{ + int cpu, retval; + + for_each_possible_cpu(cpu) { + struct device *dev =3D get_cpu_device(cpu); + + if (!dev) + continue; + retval =3D sysfs_create_files(&dev->kobj, idle_poll_attrs); + if (retval) + return retval; + } + + return 0; +} +device_initcall(idle_poll_sysfs_init); + +static int is_cpu_idle_poll(int cpu) +{ + if (static_branch_unlikely(&cpu_poll_enabled)) + return cpumask_test_cpu(cpu, &cpu_poll_mask); + + return 0; +} + /** * sched_idle_set_state - Record idle state for the current CPU. * @idle_state: State to record. @@ -51,18 +136,21 @@ __setup("hlt", cpu_idle_nopoll_setup); =20 static noinline int __cpuidle cpu_idle_poll(void) { - trace_cpu_idle(0, smp_processor_id()); + int cpu =3D smp_processor_id(); + + trace_cpu_idle(0, cpu); stop_critical_timings(); ct_idle_enter(); local_irq_enable(); =20 while (!tif_need_resched() && - (cpu_idle_force_poll || tick_check_broadcast_expired())) + (cpu_idle_force_poll || tick_check_broadcast_expired() + || is_cpu_idle_poll(cpu))) cpu_relax(); =20 ct_idle_exit(); start_critical_timings(); - trace_cpu_idle(PWR_EVENT_EXIT, smp_processor_id()); + trace_cpu_idle(PWR_EVENT_EXIT, cpu); =20 return 1; } @@ -296,7 +384,8 @@ static void do_idle(void) * broadcast device expired for us, we don't want to go deep * idle as we know that the IPI is going to arrive right away. */ - if (cpu_idle_force_poll || tick_check_broadcast_expired()) { + if (cpu_idle_force_poll || tick_check_broadcast_expired() + || is_cpu_idle_poll(cpu)) { tick_nohz_idle_restart_tick(); cpu_idle_poll(); } else { --=20 2.38.1