From nobody Sun Feb 8 00:34:52 2026 Received: from mail-ed1-f74.google.com (mail-ed1-f74.google.com [209.85.208.74]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 6CE0F34DB45 for ; Wed, 12 Nov 2025 19:24:25 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.208.74 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1762975468; cv=none; b=frRIUkcIBF6CgZvbSzBxTXlYX+tm+UarTrFu7Gl1p639Y/wFkvrDoK6Y0wg9zog9vkSLLnCVpjIWGBSIm9kG6yc0m99aiurvxsMaMFVNOF4GzgfNnlkPVJNl2lSViEeJdvmfjkbjzTBr7sXPxvg92zlM/1guUH700IiGGUf4Rjg= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1762975468; c=relaxed/simple; bh=ld2fsmdlt2T8HxwLNOlU6qnNXPyRkodM5rwY68LS6VY=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=cKp0TojMwl6LSGUow848N0SvRCZAcWpCEKxQbw2isnRzx4pWzMOdP2efaQEX8X+w22q89eID9Mo1mUo1s5aLPlYiHD6/+AvRCH2w+QbT+jwum9U1AeB2c9hkeXuVqSM5uqzuT8ua0iOClg2UWcszpACn5ZTAYNZZz/gJ1V3PPgw= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--lrizzo.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=QT224pSc; arc=none smtp.client-ip=209.85.208.74 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--lrizzo.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="QT224pSc" Received: by mail-ed1-f74.google.com with SMTP id 4fb4d7f45d1cf-6417b2fae83so7787a12.1 for ; Wed, 12 Nov 2025 11:24:25 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1762975464; x=1763580264; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=9RboGJiDgx522e/0FZwsBBrKlE2sz8rmng8iRZf/blE=; b=QT224pScIFR88bQmg7aQrgi3AlC9h/oTvQdUNyM8rADBrAmBOYnfMlQOleNh4Unzch GQTJ0LL6lHjdIcI3NP0BW+zaYDtEfSNOruNTpeoGg4iaffKUNmZUVS5VjAZMAACb77CD HvkPbdiLFRdGQfWaHa06WbHIcxeSX3C33j7FRdkp11ETmQgsHAJWfVJM9sb++rD2hE27 3B2SgAJMXpL6QumY939W9SGtVH+GUcj81fc7s4aqZNmYt5Xhx9Tu9hUWcub1c2GsXI+9 qJKd26LrLNCo2ZOYcNQgXwYawch0V9TkMiiVsZu/TNAc2e2ECwj4pglFl0yblbrRTtxz 0AEA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1762975464; x=1763580264; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=9RboGJiDgx522e/0FZwsBBrKlE2sz8rmng8iRZf/blE=; b=pNZC7qPAx5gCSZvEpaEoiWHGFb3YHrbkWKDLrA10d/QHe3x7nUuiXMkvbuqMoUIyK2 lJZI1QmGDTNEQSZy4D/gItvnhmlzrZt4ub93snomdqaytl0O8tBn5O2pF5/YVSIc2SQf Xh8+sKjU+PbyHkmI0PZ/efpmMtHsHchlrHFnJsAQxeXU08U9c54O+rOZ/bR6DdJzW+pz +El5ys0J0Wi2BR78GEhCPD4ibaY5gX+sGtFGDWxnrVj/Au9csccfItnvq25KcOJD2raW k0/NjTix8fQskHSuKMz6I3HmrEYqlGL1SqDbqJdx0aB8nfl+bWKsG+oHJbVPIoAkomni kFpw== X-Gm-Message-State: AOJu0YzCihcAXA9TpxoSiWWGNW0rZgQ7O9GY0Y83MX+n+3rid0QhEbvt SICEmrxznC+Ietc0SC/etFPBVfMjubaM2rAzYCicSsf8BU5AHp45QEbph0w/e5oYOiUry2KSGMk 7PG7emA== X-Google-Smtp-Source: AGHT+IG1SKOI3GKLgrjPgVXOjy4ln+GVaRLJ1ECeOnc4FwvAXPRn2TxDI3BCwzytmIsWQ8IixfxCReuXBz0= X-Received: from edo21.prod.google.com ([2002:a05:6402:52d5:b0:641:4eeb:43cd]) (user=lrizzo job=prod-delivery.src-stubby-dispatcher) by 2002:a05:6402:27c8:b0:63c:690d:6a46 with SMTP id 4fb4d7f45d1cf-6431a4b65d3mr4184716a12.13.1762975463640; Wed, 12 Nov 2025 11:24:23 -0800 (PST) Date: Wed, 12 Nov 2025 19:24:03 +0000 In-Reply-To: <20251112192408.3646835-1-lrizzo@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20251112192408.3646835-1-lrizzo@google.com> X-Mailer: git-send-email 2.51.2.1041.gc1ab5b90ca-goog Message-ID: <20251112192408.3646835-2-lrizzo@google.com> Subject: [PATCH 1/6] genirq: platform wide interrupt moderation: Documentation, Kconfig, irq_desc From: Luigi Rizzo To: Thomas Gleixner , Marc Zyngier , Luigi Rizzo , Paolo Abeni , Andrew Morton , Sean Christopherson , Jacob Pan Cc: linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org, Bjorn Helgaas , Willem de Bruijn , Luigi Rizzo Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Platform wide software interrupt moderation ("soft_moderation" in this patch series) specifically addresses a limitation of platforms from many vendors whose I/O performance drops significantly when the total rate of MSI-X interrupts is too high (e.g 1-3M intr/s depending on the platform). Conventional interrupt moderation operates separately on each source, hence the configuration should target the worst case. On large servers with hundreds of interrupt sources, keeping the total rate bounded would require delays of 100-200us; and adaptive moderation would have to reach those delays with as little as 10K intr/s per source. These values are unacceptable for RPC or transactional workloads. To address this problem, this code measures efficiently the total and per-CPU interrupt rates, so that individual moderation delays can be adjusted based on actual global and local load. This way, the system controls both global interrupt rates and individual CPU load, and tunes delays so they are normally 0 or very small except during actual local/global overload. Configuration is easy and robust. System administrators specify the maximum targets (moderation delay; interrupt rate; and fraction of time spent in hardirq), and per-CPU control loops adjust actual delays to try and keep metrics within the bounds. There is no need for exact targets, because the system is adaptive; the defaults delay_us=3D100, target_irq_rate=3D1000000, hardirq_frac=3D70 intr/= s, are good almost everywhere. The system does not rely on any special hardware feature except from MSI-X Pending Bit Array (PBA), a mandatory component of MSI-X Boot defaults are set via module parameters (/sys/module/irq_moderation and /sys/module/${DRIVER}) or at runtime via /proc/irq/moderation, which is also used to export statistics. Moderation on individual irq can be turned on/off via /proc/irq/NN/moderation . The system does not rely on any special hardware feature except from MSI-X Pending Bit Array (PBA), a mandatory component of MSI-X This initial patch adds Documentation, Kconfig option, two fields in struct irq_desc, and prototypes in include/linux/interrupt.h No functional impact. Enabling the option will just extend struct irq_desc with two fields. CONFIG_SOFT_IRQ_MODERATION=3Dy --- Documentation/core-api/irq/index.rst | 1 + Documentation/core-api/irq/irq-moderation.rst | 215 ++++++++++++++++++ include/linux/interrupt.h | 15 ++ include/linux/irqdesc.h | 5 + kernel/irq/Kconfig | 11 + 5 files changed, 247 insertions(+) create mode 100644 Documentation/core-api/irq/irq-moderation.rst diff --git a/Documentation/core-api/irq/index.rst b/Documentation/core-api/= irq/index.rst index 0d65d11e54200..b5a6e2ade69bb 100644 --- a/Documentation/core-api/irq/index.rst +++ b/Documentation/core-api/irq/index.rst @@ -9,3 +9,4 @@ IRQs irq-affinity irq-domain irqflags-tracing + irq-moderation diff --git a/Documentation/core-api/irq/irq-moderation.rst b/Documentation/= core-api/irq/irq-moderation.rst new file mode 100644 index 0000000000000..ff12dbabc701b --- /dev/null +++ b/Documentation/core-api/irq/irq-moderation.rst @@ -0,0 +1,215 @@ +.. SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause) + +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D +Platform wide software interrupt moderation +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D + +:Author: Luigi Rizzo + +.. contents:: :depth: 2 + +Introduction +------------ + +Platform Wide software interrupt moderation is a variant of moderation +that adjusts the delay based on platform-wide metrics, instead of +considering each source separately. It then uses hrtimers to implement +adaptive, per-CPU moderation in software, without requiring any specific +hardware support other than Pending Bit Array, a standard feature +of MSI-X. + +To understand the motivation for this feature, we start with some +background on interrupt moderation. + +* **Interrupt** is a mechanism to **notify** the CPU of **events** + that should be handled by software, for example, **completions** + of I/O requests (network tx/rx, disk read/writes...). + +* Each event typically issues one interrupt, which is then processed by + software before the next interrupt can be issued. If more events fire + in the meantime, the next interrupt notifies all of them. This is called + **coalescing**, and it can happen unintentionally, as in the example. + +* Coalescing amortizes the fixed costs of processing one interrupt over + multiple events, so it improves efficiency. This suggested the idea + that we could intentionally make coalescing more likely. This is called + interrupt **moderation**. + +* The most common and robust implementation of moderation enforces + some minimum **delay** between subsequent interrupts, using a timer + in the device or in software. Most NICs support programmable hardware + moderation, with timer granularity down to 1us or so. NVME also + specifies hardware moderation timers, with 100us granularity. + +* One downside of moderation, especially with **fixed** delay, is that + even with moderate load, the notification latency can increase by as + much as the moderation delay. This is undesirable for transactional + workloads. At high load the extra delay is less problematic, because + the queueing delays that occur can be one or more orders of magnitude + bigger. + +* To address this problem, software can dynamically adjust the delay, maki= ng + it proportional to the I/O rate. This is called **adaptive** moderation, + and it is commonly implemented in network device drivers. + +In summary, interrupt moderation, as normally implemented, is a very +effective way to reduce interrupt processing costs on a per-source +basis. The modest compromise on the extra latency can be removed with +adaptive moderation. + +MOTIVATION +~~~~~~~~~~ + +There is one aspect that per-source moderation does not address. + +Several Systems-on-Chip (SoC) from all vendors (Intel, AMD, ARM), show +huge reduction in I/O throughput (up to 3-4x times slower for high speed +NICs or SSDs) in presence of high MSI-X interrupt rates across the entire +platform (1-3M intr/s total, depending on the SoC). Note that unaffected +SoCs can sustain 20-30M intr/s from MSI-X without performance degradation. + +The above performance degradation is not caused by overload of individual +CPUs. What matters is the total MSI-X interrupt rate, across either +individual PCIe root port, or the entire SoC. The specific root cause +depends on the SoC, but generally some internal block (in the PCIe root +port, or in the IOMMU block) applies excessive serialization around +MSI-X writes. This in turn causes huge delays in other PCIe transactions, +leading to the observed performance drop. + +EXISTING MITIGATIONS AND WHY THEY ARE INSUFFICIENT: +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D + +* As mentioned, traditional interrupt moderation operates on individual so= urces. + Large server platforms can have hundreds of sources (NIC or NVME + queues). To stay within the total platform limit (e.g. 1-3M intrs/s) + one would need very large delays (100-200us), which are undesirable + for transactional workloads (RPC, database). + +* Per-source adaptive moderation has the same problem. The adaptive control + cannot tell whether other sources are active, so in order to be effective + it must assume the worst case and jump to large delays with as little + as 10K intrs/s, even if no other sources are active. + +* Back in 2020 we addressed this very problem for network devices with + the ``napi_defer_hard_irq`` mechanism: after the first interrupt, + NAPI does not rearm the queue interrupt, and instead runs the next + softirq handler from an hrtimer. It keeps using the timer as interrupt + source until one or a few handler calls generate no activity, at which + point interrupts are re-enabled. + + That way, under load, device interrupts (which are problematic for + the platform) are highly reduced, and replaced with less problematic + timer interrupts. There are a few down sides though: + + * the feature is only available for NAPI devices + + * the timer delay is not adaptive, and must still be tuned based on the = number + of active sources + + * the system generates extra calls to the NAPI handler + + * it has non-intuitive interaction with devices that share tx/rx interru= pts + or implement optimizations that expect interrupts to be enabled. + +PLATFORM WIDE ADAPTIVE MODERATION +--------------------------------- + +Platform-Wide adaptive interrupt moderation is designed to overcome the +above limitations. The system operates as follows: + +* on each interrupt, increments a per-CPU counter for number of interrupts + +* opportunistically, every ms or so, each CPU scalably accumulates + the values across the entire system, so it can compute the total and + per-CPU interrupt rate, and the number of CPUs actively processing + interrupts. + +* with the above information and a system-wide, configurable + ``target_irq_rate``, each CPU computes whether it is processing more + or less than its fair share of interrupts. A simple control loop then + adjusts up/down its per-CPU moderation delay. The value varies from 0 (d= isabled) + to a configurable upper bound ``delay_us``. + +* when the per-CPU moderation delay goes above a threshold (e.g. 10us), th= e first + interrupt served by that CPU will start an hrtimer to fire after the + adaptive delay. All interrupts sources served by that CPU will be + disabled as they come. + +* when the timer fires, all disabled sources are re-enabled. The Pending + Bit Array feature of MSI will avoid that interrupt generated while + disabled are lost. + +This scheme is effective in keeping the total interrupt rate under +control as long as the configuration parameters are sensible +(``delay_us < #CPUs / target_irq_rate``). + +It also lends itself to some extensions, specifically: + +* **protect against hardirq overload**. It is possible for one CPU + handling interrupts to be overwhelmed by hardirq processing. The + control loop can be extended to declare an overload situation when the + percentage of time spent in hardirq is above a configurable threshold + ``hardirq_percent``. Moderation can thus kick in to keep the load within= bounds. + +* **reduce latency using timer-based polling**. Similar to ``napi_defer_ha= rd_irq`` + described earlier, once interrupts are disabled and we have an hrtimer a= ctive, + we keep the timer active for a few rounds and run the handler from a tim= er callback + instead of waiting for an interrupt. The ``timer_rounds`` parameter cont= rols this behavior, + + Say the control loop settles on 120us delay to stay within the global MS= I-X rate limit. + By setting ``timer_rounds=3D2``, each time we have a hardware interrupt,= the handler + will be called two more times by the timer. As a consequence, in the sam= e conditions, + the same global MSI-X rate will be reached with just 120/3 =3D 40us dela= y, thus improving + latency significantly (note that those handlers call do cause extra CPU = work, so we + may lose some of the efficiency gains coming from large delays). + +CONFIGURATION=20 +------------- + +Configuration of this system is done via module parameters +``irq_moderation.${name}=3D${value}`` (for boot-time defaults) +or writing ``echo "${name}=3D${value}" > /proc/irq/soft_moderation`` +for run-time configuration. + +Here are the existing module parameters + +* ``delay_us`` (0: off, range 0-500) + + The maximum moderation delay. 0 means moderation is globally disabled. + +* ``target_irq_rate`` (0 off, range 0-50000000) + + The maximum irq rate across the entire platform. The adaptive algorithm = will adjust + delays to stay within the target. Use 0 to disable this control. + +* ``hardirq_percent`` (0 off, range 0-100) + + The maximum percentage of CPU time spent in hardirq. The adaptive algori= thm will adjust + delays to stay within the target. Use 0 to disable this control. + +* ``timer_rounds`` (0 0ff, range 0-20) + + Once the moderation timer is activated, how many extra timer rounds to d= o before + re-enabling interrupts. + +* ``update_ms`` (default 1, range 1-100) + + How often the adaptive control should adjust delays. The default value (= 1ms) should be good + in most circumstances. + +Interrupt moderation can be enabled/disabled on individual IRQs as follows: + +* module parameter ``${driver}.soft_moderation=3D1`` (default 0) selects + whether to use moderation at device probe time. + +* ``echo 1 > /proc/irq/*/${irq_name}/../soft_moderation`` (default 0, disa= bled) toggles + moderation on/off for specific IRQs once they are attached. + +**INTEL SPECIFIC** + +Recent intel CPUs support a kernel feature, enabled via boot parameter ``i= ntremap=3Dposted_msi``, +that routes all interrupts targeting one CPU via a special interrupt, call= ed **posted_msi**, +whose handler in turn calls the individual interrupt handlers. + +The ``posted_msi`` kernel feature always uses moderation if enabled (``del= ay_us > 0``) and +individual IRQs do not need to be enabled individually. diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h index 51b6484c04934..007201c8db6dd 100644 --- a/include/linux/interrupt.h +++ b/include/linux/interrupt.h @@ -872,6 +872,21 @@ extern int early_irq_init(void); extern int arch_probe_nr_irqs(void); extern int arch_early_irq_init(void); =20 +#ifdef CONFIG_IRQ_SOFT_MODERATION + +void irq_moderation_percpu_init(void); +void irq_moderation_init_fields(struct irq_desc *desc); +/* add/remove /proc/irq/NN/soft_moderation */ +void irq_moderation_procfs_entry(struct irq_desc *desc, umode_t umode); + +#else /* empty stubs to avoid conditional compilation */ + +static inline void irq_moderation_percpu_init(void) {} +static inline void irq_moderation_init_fields(struct irq_desc *desc) {} +static inline void irq_moderation_procfs_entry(struct irq_desc *desc, umod= e_t umode) {}; + +#endif + /* * We want to know which function is an entrypoint of a hardirq or a softi= rq. */ diff --git a/include/linux/irqdesc.h b/include/linux/irqdesc.h index fd091c35d5721..4eb05bc456abe 100644 --- a/include/linux/irqdesc.h +++ b/include/linux/irqdesc.h @@ -112,6 +112,11 @@ struct irq_desc { #endif struct mutex request_mutex; int parent_irq; +#ifdef CONFIG_IRQ_SOFT_MODERATION + /* mode: 0: off, 1: disable_irq_nosync() */ + u8 moderation_mode; /* written in procfs, read on irq */ + struct list_head ms_node; /* per-CPU list of moderated irq_desc */ +#endif struct module *owner; const char *name; #ifdef CONFIG_HARDIRQS_SW_RESEND diff --git a/kernel/irq/Kconfig b/kernel/irq/Kconfig index 1b4254d19a73e..fb9c78b1deea8 100644 --- a/kernel/irq/Kconfig +++ b/kernel/irq/Kconfig @@ -155,6 +155,17 @@ config IRQ_KUNIT_TEST =20 endmenu =20 +# Support platform wide software interrupt moderation +config IRQ_SOFT_MODERATION + bool "Enable platform wide software interrupt moderation" + help + Enable platform wide software interrupt moderation. + Uses a local timer to delay interrupts in configurable ways + and depending on various global system load indicators + and targets. + + If you don't know what to do here, say N. + config GENERIC_IRQ_MULTI_HANDLER bool help --=20 2.51.2.1041.gc1ab5b90ca-goog From nobody Sun Feb 8 00:34:52 2026 Received: from mail-ed1-f73.google.com (mail-ed1-f73.google.com [209.85.208.73]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 682BA34F48F for ; Wed, 12 Nov 2025 19:24:27 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.208.73 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1762975470; cv=none; b=Y75URT87FMupHyaJaHGZeWBGs0U6qRSMBR1nZVXYjDTAvpe3JeEftKjBIh1M/88EYkVrK96rI4YJKkkhmw0Z6nQNsYWNu9+tcAKs4x484wkOmZwIYhDTaE07p1Xj4BRPxlvcOqT4/Qufo9NTh4lQBmKrqJZBain/jfS9XIs29rM= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1762975470; c=relaxed/simple; bh=JboPOrrlrY4Ko6LJbn94SzHonmcSfVDYwJHknxK4H3Y=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=nUsFFYg/m54Q/quo7idF+yRr11g6H/UG0s+uoJ7/M7bXyOZ+mRYOAOr3RIMyvfouP6T3dT1yNDyi7rhap3/mXz1gU7FqXI9S98MyYtyWz0lU2UnMEHv9eOMwnbBJA0LOV5xi1uMUgKLATjJINo4ZxSJVvIYfmj6WtDBYtSQ7bek= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--lrizzo.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=wR96Uzqw; arc=none smtp.client-ip=209.85.208.73 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--lrizzo.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="wR96Uzqw" Received: by mail-ed1-f73.google.com with SMTP id 4fb4d7f45d1cf-640cc916e65so11501a12.1 for ; Wed, 12 Nov 2025 11:24:27 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1762975466; x=1763580266; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=B24mQnSZMVRgdG5U9hRXkGt9TnaPoRUEHUCFM9E7GFE=; b=wR96UzqwAPrKEfTJIjFEVuwe0xevl/xEFkWYQ04EmhA0Pko+qnZym6fQbCoU8qywgI 9nQqadI0HKcqQv95b6MzoQ4LWozWP1p/7S3fagw92TRI0hleW/OggPd+8yhv8L77xX3c /TNfetYUp+6WjoLoBxXkGTxR/eLtCdfr1255T+1m1IOzgEK/KXx34zm98Emuck2GhmVi h18zaGQvNIfI3sROC2RMWjk8Sir0Ium06Q0NDpamZWjUdq/tAIkA/hqZ6oLwAwWaSVho qt0+1lfbjSyDX3CPX5Vyk7sAeQoqNJjEfhM2Sm4Lrn3RwGzToLE70MyrOZE3dh/Mgq4w xPDg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1762975466; x=1763580266; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=B24mQnSZMVRgdG5U9hRXkGt9TnaPoRUEHUCFM9E7GFE=; b=OnnNtsALZQpgbTQe8Qv5EIeRHx83mm5FM19yT+Ig1CZOIVUgDTiPV/JUoljKDLEXjY 7KxL93WzT3fNATEEalr7C65QUOCLVS5Dh2jjIbamCXuxJjnmVuDYcSC6H5Uc/LcYxRIs xmFUDGwYJfdWePx5Km6GpkUQL2toQ/8IkyzqwM0eakRHPmQOMmY920Jt1r/4q/kTauUb 5Ct/qFytXZyKkPeY2GQG9sx2N/zrjyah9+GvG95Yw8LzDuY2mC5T3TTdnZBOQrrYNgEr lcTNW+hpTLmMar5iEQSLVK1tl10siVw7GTLU76ZJczGY/poV25K4s5jW+tsMPXPn/jES QTzQ== X-Gm-Message-State: AOJu0YzULie2GsI8DuFAEbszShzmqfMBqLoBHlPnClniR3kDBh3kWVJi Z+7c5dJm0R03N7LCY81QYY4kpGmXTYVOTEX4MzEjLUw7hu4u5BxKhKPWuTRrTHsPoJh5b90CfzK fQ2553A== X-Google-Smtp-Source: AGHT+IHRbH0+L2jlw8eibE1pxIXW03f0w5TdjuwI8vS9PEphIgcgFYF8p42VznymTU0p0+Uaeai1DRIrV5Y= X-Received: from ejczi12.prod.google.com ([2002:a17:907:e98c:b0:b6d:7901:e54f]) (user=lrizzo job=prod-delivery.src-stubby-dispatcher) by 2002:a17:907:1c25:b0:b72:7cd3:d55b with SMTP id a640c23a62f3a-b733195f534mr450314366b.12.1762975465792; Wed, 12 Nov 2025 11:24:25 -0800 (PST) Date: Wed, 12 Nov 2025 19:24:04 +0000 In-Reply-To: <20251112192408.3646835-1-lrizzo@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20251112192408.3646835-1-lrizzo@google.com> X-Mailer: git-send-email 2.51.2.1041.gc1ab5b90ca-goog Message-ID: <20251112192408.3646835-3-lrizzo@google.com> Subject: [PATCH 2/6] genirq: soft_moderation: add base files, procfs hooks From: Luigi Rizzo To: Thomas Gleixner , Marc Zyngier , Luigi Rizzo , Paolo Abeni , Andrew Morton , Sean Christopherson , Jacob Pan Cc: linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org, Bjorn Helgaas , Willem de Bruijn , Luigi Rizzo Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Add the main core files that implement soft_moderation, limited to static moderation, plus related small changes to include/linux/interrupt.h, kernel/irq/Makefile, and kernel/irq/proc.c - include/linux/irq_moderation.h has the two main struct, prototypes and inline hooks - kernel/irq/irq_moderation.c has the procfs handlers The code is not yet hooked to the interrupt handler, so the feature is disabled but we can see the module parameters /sys/module/irq_moderation/parameters and read/write the procfs entries /proc/irq/soft_moderation and /proc/irq/NN/soft_moderation. Examples: cat /proc/irq/soft_moderation echo "delay_us=3D345" > /proc/irq/soft_moderation echo 1 | tee /proc/irq/*/nvme*/../soft_moderation Change-Id: I472d9b5b31770aa2787f062f7fe5d411882be60e --- include/linux/irq_moderation.h | 196 ++++++++++++++++++++ kernel/irq/Makefile | 1 + kernel/irq/irq_moderation.c | 315 +++++++++++++++++++++++++++++++++ kernel/irq/proc.c | 2 + 4 files changed, 514 insertions(+) create mode 100644 include/linux/irq_moderation.h create mode 100644 kernel/irq/irq_moderation.c diff --git a/include/linux/irq_moderation.h b/include/linux/irq_moderation.h new file mode 100644 index 0000000000000..4d90d7c4ca26b --- /dev/null +++ b/include/linux/irq_moderation.h @@ -0,0 +1,196 @@ +/* SPDX-License-Identifier: GPL-2.0 OR BSD-2-Clause */ + +#ifndef _LINUX_IRQ_MODERATION_H +#define _LINUX_IRQ_MODERATION_H + +/* + * Platform wide software interrupt moderation, see + * Documentation/core-api/irq/irq-moderation.rst + */ + +#include +#include +#include +#include + +#ifdef CONFIG_IRQ_SOFT_MODERATION + +/* Global configuration parameters and state */ +struct irq_mod_info { + /* These fields are written to by all CPUs */ + ____cacheline_aligned + atomic_long_t total_intrs; /* running count updated every update_ms */ + atomic_long_t total_cpus; /* as above, active CPUs in this interval */ + + /* These are mostly read (frequently), so use a different cacheline */ + ____cacheline_aligned + u64 procfs_write_ns; /* last write to /proc/irq/soft_moderation */ + uint delay_us; /* fixed delay, or maximum for adaptive */ + uint target_irq_rate; /* target interrupt rate */ + uint hardirq_percent; /* target maximum hardirq percentage */ + uint timer_rounds; /* how many timer polls once moderation fires */ + uint update_ms; /* how often to update delay/rate/fraction */ + uint scale_cpus; /* (percent) scale factor to estimate active CPUs */ + uint count_timer_calls; /* count timer calls for irq limits */ + uint count_msi_calls; /* count calls from posted_msi for irq limits */ + uint decay_factor; /* keep it at 16 */ + uint grow_factor; /* keep it at 8 */ + int pad[] ____cacheline_aligned; +}; + +extern struct irq_mod_info irq_mod_info; + +/* Per-CPU moderation state */ +struct irq_mod_state { + struct hrtimer timer; /* moderation timer */ + struct list_head descs; /* moderated irq_desc on this CPU */ + + /* Counters on last time we updated moderation delay */ + u64 last_ns; /* time of last update */ + u64 last_irqtime; /* from cpustat[CPUTIME_IRQ] */ + u64 last_total_irqs; + u64 last_total_cpus; + + bool in_posted_msi; /* don't suppress handle_irq, set in posted_msi handl= er */ + bool kick_posted_msi; /* kick posted_msi from the timer callback */ + + u32 cycles; /* calls since last ktime_get_ns() */ + s32 irq_count; /* irqs in the last cycle, signed as we also decrement */ + u32 delay_ns; /* fetched from irq_mod_info */ + u32 mod_ns; /* recomputed every update_ms */ + u32 sleep_ns; /* accumulated time for actual delay */ + s32 rounds_left; /* how many rounds left for moderation */ + + /* Statistics */ + u32 irq_rate; /* smoothed global irq rate */; + u32 my_irq_rate; /* smoothed irq rate for this CPU */; + u32 cpu_count; /* smoothed CPU count (scaled) */; + u32 src_count; /* smoothed irq sources count (scaled) */; + u32 irq_high; /* how many times above each threshold */ + u32 my_irq_high; + u32 hardirq_high; + u32 timer_set; /* counters for various events */ + u32 timer_fire; + u32 disable_irq; + u32 enable_irq; + u32 timer_calls; + u32 from_posted_msi; + u32 stray_irq; + int pad[] ____cacheline_aligned; +}; + +DECLARE_PER_CPU_ALIGNED(struct irq_mod_state, irq_mod_state); + +static inline void irq_moderation_start_timer(struct irq_mod_state *ms) +{ + ms->timer_set++; + ms->rounds_left =3D clamp(READ_ONCE(irq_mod_info.timer_rounds), 0u, 20u) = + 1; + hrtimer_start_range_ns(&ms->timer, ns_to_ktime(ms->sleep_ns), + /*range*/2000, HRTIMER_MODE_REL_PINNED_HARD); +} + +static inline bool irq_moderation_enabled(void) +{ + return READ_ONCE(irq_mod_info.delay_us); +} + +static inline uint irq_moderation_get_update_ms(void) +{ + return clamp(READ_ONCE(irq_mod_info.update_ms), 1u, 100u); +} + +/* Called on each interrupt for adaptive moderation delay adjustment */ +static inline void irq_moderation_adjust_delay(struct irq_mod_state *ms) +{ + u64 now, delta_time, update_ns; + + ms->irq_count++; + if (ms->cycles++ < 16) /* ktime_get_ns() is expensive, don't do too often= */ + return; + ms->cycles =3D 0; + now =3D ktime_get_ns(); + delta_time =3D now - ms->last_ns; + update_ns =3D irq_moderation_get_update_ms() * NSEC_PER_MSEC; + + /* Run approximately every update_ns, a little bit early is ok */ + if (delta_time < update_ns - 5000) + return; + + /* Fetch important state */ + ms->delay_ns =3D clamp(irq_mod_info.delay_us, 1u, 500u) * NSEC_PER_USEC; + + ms->last_ns =3D now; + ms->mod_ns =3D ms->delay_ns; +} + +/* Return true if timer is active or delay is large enough to require mode= ration */ +static inline bool irq_moderation_needed(struct irq_mod_state *ms) +{ + if (!hrtimer_is_queued(&ms->timer)) { + ms->sleep_ns +=3D ms->mod_ns; /* accumulate sleep time */ + if (ms->sleep_ns < 10000) /* no moderation if too small */ + return false; + } + return true; +} + +void disable_irq_nosync(unsigned int irq); + +/* + * Use in handle_irq_event() before calling the handler. Decide whether th= is + * desc should be moderated, and in case disable the irq and add the desc = to + * the list for this CPU. + */ +static inline void irq_moderation_hook(struct irq_desc *desc) +{ + struct irq_mod_state *ms =3D this_cpu_ptr(&irq_mod_state); + + if (!irq_moderation_enabled()) + return; + + if (!READ_ONCE(desc->moderation_mode)) + return; + + irq_moderation_adjust_delay(ms); + + if (!list_empty(&desc->ms_node)) { + /* + * Very unlikely, stray interrupt while the desc is moderated. + * Unfortunately we cannot ignore it, just count it. + */ + ms->stray_irq++; + return; + } + + if (!irq_moderation_needed(ms)) + return; + + list_add(&desc->ms_node, &ms->descs); /* Add to list of moderated desc */ + /* + * disable the irq. This will also cause irq_can_handle() return false + * (through irq_can_handle_actions()), and that will prevent a handler + * instance to be run again while the descriptor is being moderated. + * + * irq_moderation_epilogue() will then start the timer if needed. + */ + ms->disable_irq++; + disable_irq_nosync(desc->irq_data.irq); /* desc must be unlocked */ +} + +/* After the handler, if desc is moderated, make sure the timer is active.= */ +static inline void irq_moderation_epilogue(const struct irq_desc *desc) +{ + struct irq_mod_state *ms =3D this_cpu_ptr(&irq_mod_state); + + if (!list_empty(&desc->ms_node) && !hrtimer_is_queued(&ms->timer)) + irq_moderation_start_timer(ms); +} + +#else /* empty stubs to avoid conditional compilation */ + +static inline void irq_moderation_hook(struct irq_desc *desc) {} +static inline void irq_moderation_epilogue(const struct irq_desc *desc) {} + +#endif /* CONFIG_IRQ_SOFT_MODERATION */ + +#endif /* _LINUX_IRQ_MODERATION_H */ diff --git a/kernel/irq/Makefile b/kernel/irq/Makefile index 6ab3a40556670..c06da43d644f2 100644 --- a/kernel/irq/Makefile +++ b/kernel/irq/Makefile @@ -9,6 +9,7 @@ obj-$(CONFIG_GENERIC_IRQ_CHIP) +=3D generic-chip.o obj-$(CONFIG_GENERIC_IRQ_PROBE) +=3D autoprobe.o obj-$(CONFIG_IRQ_DOMAIN) +=3D irqdomain.o obj-$(CONFIG_IRQ_SIM) +=3D irq_sim.o +obj-$(CONFIG_IRQ_SOFT_MODERATION) +=3D irq_moderation.o obj-$(CONFIG_PROC_FS) +=3D proc.o obj-$(CONFIG_GENERIC_PENDING_IRQ) +=3D migration.o obj-$(CONFIG_GENERIC_IRQ_MIGRATION) +=3D cpuhotplug.o diff --git a/kernel/irq/irq_moderation.c b/kernel/irq/irq_moderation.c new file mode 100644 index 0000000000000..a9d2bdcf4d8c7 --- /dev/null +++ b/kernel/irq/irq_moderation.c @@ -0,0 +1,315 @@ +// SPDX-License-Identifier: GPL-2.0 OR BSD-2-Clause + +#include +#include +#include + +#include +#include /* interrupt.h, kcpustat_this_cpu */ +#include "internals.h" + +/* + * Platform-wide software interrupt moderation. + * + * see Documentation/core-api/irq/irq-moderation.rst + * + * =3D=3D=3D MOTIVATION AND OPERATION =3D=3D=3D + * + * Some platforms show reduced I/O performance when the total device inter= rupt + * rate across the entire platform becomes too high. This code implements + * per-CPU adaptive moderation based on the total interrupt rate, as oppos= ed + * to conventional moderation that operates separately on each source. + * + * It computes the total interrupt rate and number of sources, and uses the + * information to adaptively disable individual interrupts for small amoun= ts + * of time using per-CPU hrtimers and MSI-X Pending Bit Array. Specificall= y: + * + * - a hook in handle_irq_event(), which applies only on sources configured + * to use moderation, updates statistics and check whether we need + * moderation on that CPU/irq. If so, calls disable_irq_nosync() and sta= rts + * an hrtimer with appropriate delay. + * + * - the timer callback calls enable_irq() for all disabled interrupts on = that + * CPU. That in turn will generate interrupts if there are pending event= s. + * + * =3D=3D=3D CONFIGURATION =3D=3D=3D + * + * The following can be controlled at boot time via module parameters + * + * irq_moderation.${NAME}=3D${VALUE} + * + * or at runtime by writing + * + * echo "${NAME}=3D${VALUE}" > /proc/irq/soft_moderation + * + * delay_us (default 50, range 10..500, 0 DISABLES MODERATION) + * Fixed or maximum moderation delay. A reasonable range is 20..100, = higher + * values can be useful if the hardirq handler is performing a signifi= cant + * amount of work. + * + * FIXED MODERATION mode requires target_irq_rate=3D0, hardirq_percent= =3D0 + * + * target_irq_rate (default 1M, 0 off, range 0..50M) + * the total irq rate above which moderation kicks in. + * Not particularly critical, a value in the 500K-1M range is usually = ok + * + * hardirq_percent (default 70, 0 off, range 10..100) + * the hardirq percentage above which moderation kicks in. + * 50-90 is a reasonable range. + * + * timer_rounds (default 0, max 20) + * once moderation triggers, periodically run handler zero or more + * times using a timer rather than interrupts. This is similar to + * napi_defer_hard_irqs on NICs. + * A small value may help control load in interrupt-challenged platfor= ms. + * + * update_ms (default 1, range 1...100) + * how often the load is measured and moderation delay updated. + * + * Moderation can be enabled/disabled for individual interrupts with + * + * echo "on" > /proc/irq/NN/soft_moderation # use "off" to disable + * + * =3D=3D=3D MONITORING =3D=3D=3D + * + * cat /proc/irq/soft_moderation shows per-CPU and global statistics. + * + */ + +static_assert(offsetof(struct irq_mod_info, procfs_write_ns) =3D=3D 64); + +struct irq_mod_info irq_mod_info ____cacheline_aligned =3D { + .delay_us =3D 100, + .update_ms =3D 1, + .count_timer_calls =3D true, +}; + +module_param_named(delay_us, irq_mod_info.delay_us, uint, 0444); +MODULE_PARM_DESC(delay_us, "Max moderation delay us, 0 =3D moderation off,= range 10..500."); + +module_param_named(timer_rounds, irq_mod_info.timer_rounds, uint, 0444); +MODULE_PARM_DESC(timer_rounds, "How many extra timer polls once moderation= triggered."); + +DEFINE_PER_CPU_ALIGNED(struct irq_mod_state, irq_mod_state); + +static void smooth_avg(u32 *dst, u32 val, u32 steps) +{ + *dst =3D ((64 - steps) * *dst + steps * val) / 64; +} + +/* moderation timer handler, called in hardintr context */ +static enum hrtimer_restart moderation_timer_cb(struct hrtimer *timer) +{ + struct irq_mod_state *ms =3D this_cpu_ptr(&irq_mod_state); + struct irq_desc *desc, *next; + uint srcs =3D 0; + + ms->timer_fire++; + WARN_ONCE(ms->timer_set !=3D ms->timer_fire, + "CPU %d timer set %d fire %d (lost events?)\n", + smp_processor_id(), ms->timer_set, ms->timer_fire); + + ms->rounds_left--; + + if (ms->rounds_left > 0) { + /* Timer still alive. Just call the handlers */ + list_for_each_entry_safe(desc, next, &ms->descs, ms_node) { + ms->irq_count +=3D irq_mod_info.count_timer_calls; + ms->timer_calls++; + handle_irq_event_percpu(desc); + } + ms->timer_set++; + hrtimer_forward_now(&ms->timer, ms->sleep_ns); + return HRTIMER_RESTART; + } + + /* Last round, remove from list and enable_irq() */ + list_for_each_entry_safe(desc, next, &ms->descs, ms_node) { + list_del(&desc->ms_node); + INIT_LIST_HEAD(&desc->ms_node); + srcs++; + ms->enable_irq++; + enable_irq(desc->irq_data.irq); /* ok if the sync_lock/unlock are NULL */ + } + smooth_avg(&ms->src_count, srcs * 256, 1); + + ms->sleep_ns =3D 0; /* prepare to accumulate next moderation delay */ + + WARN_ONCE(ms->disable_irq !=3D ms->enable_irq, + "CPU %d irq disable %d enable %d (%s)\n", + smp_processor_id(), ms->disable_irq, ms->enable_irq, + "bookkeeping error, some irq qill be stuck"); + + return HRTIMER_NORESTART; +} + +/* Initialize moderation state in desc_set_defaults() */ +void irq_moderation_init_fields(struct irq_desc *desc) +{ + INIT_LIST_HEAD(&desc->ms_node); + desc->moderation_mode =3D 0; +} + +/* Per-CPU state initialization */ +void irq_moderation_percpu_init(void) +{ + struct irq_mod_state *ms =3D this_cpu_ptr(&irq_mod_state); + + hrtimer_setup(&ms->timer, moderation_timer_cb, CLOCK_MONOTONIC, HRTIMER_M= ODE_REL_PINNED_HARD); + INIT_LIST_HEAD(&ms->descs); +} + +static void set_moderation_mode(struct irq_desc *desc, bool mode) +{ + if (desc) { + struct irq_chip *chip =3D desc->irq_data.chip; + + /* Make sure this is msi and we can run enable_irq from irq context */ + mode &=3D desc->handle_irq =3D=3D handle_edge_irq && chip && chip->irq_b= us_lock =3D=3D NULL && + chip->irq_bus_sync_unlock =3D=3D NULL; + if (mode !=3D desc->moderation_mode) + desc->moderation_mode =3D mode; + } +} + +#pragma clang diagnostic error "-Wformat" +/* Print statistics */ +static int moderation_show(struct seq_file *p, void *v) +{ + ulong irq_rate =3D 0, irq_high =3D 0, my_irq_high =3D 0, hardirq_high =3D= 0; + uint delay_us =3D irq_mod_info.delay_us; + struct irq_mod_state *ms; + u64 now =3D ktime_get_ns(); + int j, active_cpus =3D 0; + struct irq_desc *desc =3D p->private; + + if (desc) { + seq_printf(p, "%s\n", desc->moderation_mode ? "on" : "off"); + return 0; + } + + seq_puts(p, "# CPU irq/s my_irq/s cpus srcs delay_ns irq_hi= my_irq_hi" + " hardirq_hi timer_set disable_irq from_msi timer_calls stra= y_irq\n"); + for_each_online_cpu(j) { + ms =3D per_cpu_ptr(&irq_mod_state, j); + if (now - ms->last_ns > NSEC_PER_SEC) { + ms->my_irq_rate =3D 0; + ms->irq_rate =3D 0; + ms->cpu_count =3D 0; + } else { /* average irq_rate over active CPUs */ + active_cpus++; + irq_rate +=3D ms->irq_rate; + } + /* compute totals */ + irq_high +=3D ms->irq_high; + my_irq_high +=3D ms->my_irq_high; + hardirq_high +=3D ms->hardirq_high; + + seq_printf(p, "%5u %8u %8u %4u %4u %8u %11u %11u %11u %11u %11= u %11u %11u %9u\n", + j, ms->irq_rate, ms->my_irq_rate, (ms->cpu_count + 128) / 256, + (ms->src_count + 128) / 256, ms->mod_ns, ms->irq_high, ms->my_irq_hi= gh, + ms->hardirq_high, ms->timer_set, ms->disable_irq, + ms->from_posted_msi, ms->timer_calls, ms->stray_irq); + } + + seq_printf(p, "\n" + "enabled %s\n" + "delay_us %u\n" + "timer_rounds %u\n" + "count_timer_calls %s\n", + str_yes_no(delay_us), + delay_us, + irq_mod_info.timer_rounds, + str_yes_no(irq_mod_info.count_timer_calls)); + + return 0; +} + +static int moderation_open(struct inode *inode, struct file *file) +{ + return single_open(file, moderation_show, pde_data(inode)); +} + +/* helpers to set values */ +struct var_names { + const char *name; + uint *val; + int min; + int max; +} var_names[] =3D { + { "delay_us", &irq_mod_info.delay_us, 0, 500 }, + { "timer_rounds", &irq_mod_info.timer_rounds, 0, 50 }, + { "count_timer_calls", &irq_mod_info.count_timer_calls, 0, 1 }, + {} +}; + +static int set_parameter(const char *buf, int len) +{ + struct var_names *n; + int l, val; + + for (n =3D var_names; n->name; n++) { + l =3D strlen(n->name); + if (len >=3D l + 2 && !strncmp(buf, n->name, l) && buf[l] =3D=3D '=3D') + break; + } + if (!n->name || !n->val) + return -EINVAL; + if (kstrtouint(buf + l + 1, 0, &val)) + return -EINVAL; + WRITE_ONCE(*(n->val), clamp(val, n->min, n->max)); + irq_mod_info.procfs_write_ns =3D ktime_get_ns(); + return len; +} + +static ssize_t moderation_write(struct file *f, const char __user *buf, si= ze_t count, loff_t *ppos) +{ + char val[40]; /* bounded string size */ + struct irq_desc *desc =3D (struct irq_desc *)pde_data(file_inode(f)); + + if (count =3D=3D 0 || count + 1 > sizeof(val)) + return -EINVAL; + if (copy_from_user(val, buf, count)) + return -EFAULT; + val[count] =3D '\0'; + if (val[count - 1] =3D=3D '\n') + val[count - 1] =3D '\0'; + if (!desc) + return set_parameter(val, count); + + if (!strcmp(val, "off") || !strcmp(val, "0")) + set_moderation_mode(desc, false); + else if (!strcmp(val, "on") || !strcmp(val, "1")) + set_moderation_mode(desc, true); + else + return -EINVAL; + return count; /* consume all */ +} + +static const struct proc_ops proc_ops =3D { + .proc_open =3D moderation_open, + .proc_read =3D seq_read, + .proc_lseek =3D seq_lseek, + .proc_release =3D single_release, + .proc_write =3D moderation_write, +}; + +void irq_moderation_procfs_entry(struct irq_desc *desc, umode_t umode) +{ + if (umode) + proc_create_data("soft_moderation", umode, desc->dir, &proc_ops, desc); + else + remove_proc_entry("soft_moderation", desc->dir); +} + +static int __init init_irq_moderation(void) +{ + proc_create_data("irq/soft_moderation", 0644, NULL, &proc_ops, (void *)0); + return 0; +} + +MODULE_LICENSE("Dual BSD/GPL"); +MODULE_VERSION("1.0"); +MODULE_AUTHOR("Luigi Rizzo "); +MODULE_DESCRIPTION("Platform wide software interrupt moderation"); +module_init(init_irq_moderation); diff --git a/kernel/irq/proc.c b/kernel/irq/proc.c index 29c2404e743be..3d2b583b9443e 100644 --- a/kernel/irq/proc.c +++ b/kernel/irq/proc.c @@ -374,6 +374,7 @@ void register_irq_proc(unsigned int irq, struct irq_des= c *desc) proc_create_single_data("effective_affinity_list", 0444, desc->dir, irq_effective_aff_list_proc_show, irqp); # endif + irq_moderation_procfs_entry(desc, 0644); #endif proc_create_single_data("spurious", 0444, desc->dir, irq_spurious_proc_show, (void *)(long)irq); @@ -395,6 +396,7 @@ void unregister_irq_proc(unsigned int irq, struct irq_d= esc *desc) remove_proc_entry("effective_affinity", desc->dir); remove_proc_entry("effective_affinity_list", desc->dir); # endif + irq_moderation_procfs_entry(desc, 0); /* remove */ #endif remove_proc_entry("spurious", desc->dir); =20 --=20 2.51.2.1041.gc1ab5b90ca-goog From nobody Sun Feb 8 00:34:52 2026 Received: from mail-ed1-f73.google.com (mail-ed1-f73.google.com [209.85.208.73]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 20F3E3502B0 for ; Wed, 12 Nov 2025 19:24:29 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.208.73 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1762975470; cv=none; b=LkZvNaaDfAwUd6mOXGMQvuS10g00Rx8AWv5VDmQqTKFi3PNfpURDXyOcZQ4z7MTHihVnF7A3HgDVhbbBGdCXXEx8Hgx1yoRuE3IdTl2/74dXKdIEparDAX6YTivNyYrjflFdq3JnP7kqmi6sDZMAHTmXXXCAlWz70cQOx81c43Q= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1762975470; c=relaxed/simple; bh=U1wRohHj26D11My7hyn5Fjdv4sejotMbBGmzGh0UhXc=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=Xw6bjRyrSbosR75FORXEiVs9MFKpxVAfnaGdyq+5kKQd76AXP71aEbhy+vUk5tGgiCF52Ntc3JusbFdYAmCYPZ1CumalBcV+6rfic/rj+lnZ4CzcjkHGhsrhquejHorcYBeJ5Kxrlp9b23oYOvQCyehX7/7aXep6CFDSFAjNVAY= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--lrizzo.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=i8cG4LGT; arc=none smtp.client-ip=209.85.208.73 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--lrizzo.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="i8cG4LGT" Received: by mail-ed1-f73.google.com with SMTP id 4fb4d7f45d1cf-6408222225eso1402a12.3 for ; Wed, 12 Nov 2025 11:24:28 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1762975467; x=1763580267; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=CIoXEE+S06QHrqc9zlnHrp1tKevTOCIMWDXUdlSULxc=; b=i8cG4LGTFBIlpmu5dMCuJZ7Q48y2UjORI1YkwIz3f8jUKd92CDbSpPTIB7HuuF7enN vVJaQAZeDWTepgMuxGM8hnZlRfQ1dZMFvOkLbOkyhur91IZYVEdiP0+4iUtaFH3wJFb+ I2faD2Bccfo5cQqS+Tq8lQ21c20lu4Xez4BS7f2fcYfC3Prx59MlMkZBGUrHLfUxooKO Qjw+wDMUPF7QTyeiVamr5Mx8UveqPGz7gtHpa2Ie8Cy4e1Iv0Js3JQ2EjYlkeCNbwChi RjI3V95c492yczXz9Plm/5l+mLRhwpu3X1imTd09RzmP00I1EeN1Frc2CpHBuh64447M Ut4A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1762975467; x=1763580267; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=CIoXEE+S06QHrqc9zlnHrp1tKevTOCIMWDXUdlSULxc=; b=D1TMOWzwkNPFma8x0eE1ZGhunjhXIecPbECUo+fH9kyqKqqwcsM6wkOzomMgGJU/mm G6Dioq5yumNvjX2IuvBwAa1TCsnD4D2BKh+4T2vwzjrTH0dj4ovHrxuWjvOrVQQCmkrJ hShLaCokn996iOrbboxu4zCBfqVR4BXyQo8+CLdMRTrGa1HL9XG2Fer6lvLH1c1ymO3F k/O++3DVZqpk1d7aFZItQBiefss0eyi9If3L6gJ2EtMKjRZmmCKQV2BrBLHXz88SY9tr 7P2fyGw/vYYUqDD8+U3fcUWUYp+DN3PCUFOnh8fR/92xszNSapeFpnfclgi56+wzTCl+ logA== X-Gm-Message-State: AOJu0Yy+WixM1rq80BOV4vxA651asyNMb7yeG0AYblOIXfv+qZ7PphtK SYqhCrSqqGtAvJy3Ucd9yj388+8v5Sq4HZhLE5u2mnmnS49W2Xzb/LKd285CFoj1ZLsj9l9GF/Q v4BJqAw== X-Google-Smtp-Source: AGHT+IFKilkduTNedmFbVR6dbNG+KpfIItVMFlp40IXZ01vb1nyeI/Op4xDBXuM1bNYEeMsk9RIyypHkw1k= X-Received: from edok5.prod.google.com ([2002:aa7:c045:0:b0:640:9a27:321e]) (user=lrizzo job=prod-delivery.src-stubby-dispatcher) by 2002:a05:6402:909:b0:63c:1170:656a with SMTP id 4fb4d7f45d1cf-6431a586af8mr3787278a12.37.1762975467428; Wed, 12 Nov 2025 11:24:27 -0800 (PST) Date: Wed, 12 Nov 2025 19:24:05 +0000 In-Reply-To: <20251112192408.3646835-1-lrizzo@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20251112192408.3646835-1-lrizzo@google.com> X-Mailer: git-send-email 2.51.2.1041.gc1ab5b90ca-goog Message-ID: <20251112192408.3646835-4-lrizzo@google.com> Subject: [PATCH 3/6] genirq: soft_moderation: activate hooks in handle_irq_event() From: Luigi Rizzo To: Thomas Gleixner , Marc Zyngier , Luigi Rizzo , Paolo Abeni , Andrew Morton , Sean Christopherson , Jacob Pan Cc: linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org, Bjorn Helgaas , Willem de Bruijn , Luigi Rizzo Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Activate soft_moderation via the hooks in handle_irq_event() and per-CPU and irq_desc initialization. This change only implements fixed moderation. It needs to be explicitly enabled at runtime on individual interrupts. Example (kernel built with CONFIG_SOFT_IRQ_MODERATION=3Dy) # enable fixed moderation echo "delay_us=3D400" > /proc/irq/soft_moderation # enable on network interrupts (change name as appropriate) echo on | tee /proc/irq/*/*eth*/../soft_moderation # show it works by looking at counters in /proc/irq/soft_moderation cat /proc/irq/soft_moderation # Show runtime impact on ping times changing delay_us ping -n -f -q -c 1000 ${some_nearby_host} echo "delay_us=3D100" > /proc/irq/soft_moderation ping -n -f -q -c 1000 ${some_nearby_host} Configuration via module parameters (irq_moderation.${name}=3D${value}) or echo "${name}=3D${value}" > /proc/irq/soft_moderation) delay_us 0=3Doff, range 1-500, default 100 how long an interrupt is disabled after it fires. Small values are accumulated until they are large enough, e.g. 10us. As an example, a 2us = value means that the timer is set only every 5 interrupts. timer_rounds 0-20, default 0 How many extra timer runs before re-enabling interrupts. This allows reducing the number of MSI interrupts while keeping delay_us small. This is similar to the "napi_defer_hard_irqs" option in NAPI, but with some subtle differences (e.g. here the number of rounds is deterministic, and interrupts are disabled at MSI level). Change-Id: I47c5059ad537fcb9561f924620cf68e1d648aae6 --- arch/x86/kernel/cpu/common.c | 1 + drivers/irqchip/irq-gic-v3.c | 2 ++ kernel/irq/handle.c | 3 +++ kernel/irq/irqdesc.c | 1 + 4 files changed, 7 insertions(+) diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c index 02d97834a1d4d..1953419fde6ff 100644 --- a/arch/x86/kernel/cpu/common.c +++ b/arch/x86/kernel/cpu/common.c @@ -2440,6 +2440,7 @@ void cpu_init(void) =20 intel_posted_msi_init(); } + irq_moderation_percpu_init(); =20 mmgrab(&init_mm); cur->active_mm =3D &init_mm; diff --git a/drivers/irqchip/irq-gic-v3.c b/drivers/irqchip/irq-gic-v3.c index 3de351e66ee84..902bcbf9d85d8 100644 --- a/drivers/irqchip/irq-gic-v3.c +++ b/drivers/irqchip/irq-gic-v3.c @@ -1226,6 +1226,8 @@ static void gic_cpu_sys_reg_init(void) WARN_ON(gic_dist_security_disabled() !=3D cpus_have_security_disabled); } =20 + irq_moderation_percpu_init(); + /* * Some firmwares hand over to the kernel with the BPR changed from * its reset value (and with a value large enough to prevent diff --git a/kernel/irq/handle.c b/kernel/irq/handle.c index e103451243a0b..2cacceaaea9d0 100644 --- a/kernel/irq/handle.c +++ b/kernel/irq/handle.c @@ -12,6 +12,7 @@ #include #include #include +#include #include =20 #include @@ -254,9 +255,11 @@ irqreturn_t handle_irq_event(struct irq_desc *desc) irqd_set(&desc->irq_data, IRQD_IRQ_INPROGRESS); raw_spin_unlock(&desc->lock); =20 + irq_moderation_hook(desc); /* may disable irq so must run unlocked */ ret =3D handle_irq_event_percpu(desc); =20 raw_spin_lock(&desc->lock); + irq_moderation_epilogue(desc); /* start moderation timer if needed */ irqd_clear(&desc->irq_data, IRQD_IRQ_INPROGRESS); return ret; } diff --git a/kernel/irq/irqdesc.c b/kernel/irq/irqdesc.c index db714d3014b5f..e3efbecf5b937 100644 --- a/kernel/irq/irqdesc.c +++ b/kernel/irq/irqdesc.c @@ -134,6 +134,7 @@ static void desc_set_defaults(unsigned int irq, struct = irq_desc *desc, int node, desc->tot_count =3D 0; desc->name =3D NULL; desc->owner =3D owner; + irq_moderation_init_fields(desc); for_each_possible_cpu(cpu) *per_cpu_ptr(desc->kstat_irqs, cpu) =3D (struct irqstat) { }; desc_smp_init(desc, node, affinity); --=20 2.51.2.1041.gc1ab5b90ca-goog From nobody Sun Feb 8 00:34:52 2026 Received: from mail-ed1-f74.google.com (mail-ed1-f74.google.com [209.85.208.74]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 2ECB0350A2A for ; Wed, 12 Nov 2025 19:24:30 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.208.74 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1762975473; cv=none; b=lGeVbuskI0Sp4Rn1r65F0lhZ667pmStN5pnx3aoJyiUvVmpqtlEBqX28UKunnOeQG9bs6KR3cQns1WUr0cwaTjYz6eOjq9qpTLTKWOPGnauHthsTxJ5R1R4IIJ/I+uiXmpGXnPruRBpWyj9HIQnOjvSK//e3ZXX+k2/WjCNeI9Y= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1762975473; c=relaxed/simple; bh=97ZKBa3Mxok7Ixq5KQmA3/oV4Kbeh4dCnPI534aVGCs=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=O2FERdbWfrCtM7o5PAkDddOHEb7U3M8cSzanWUeNC4vqtM6wVCEaLlAPDvro2e1nixBv5Mpl82E6ZSNYnMuQqBre0Akp+x1QNPl78skAawAGAmQkqOoJf9WQMGdLMwVsNtqzwrucBmqCXJDTJ7AdIsWu9Ac2pa+YHFNR2Igc+io= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--lrizzo.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=dNiGLOrG; arc=none smtp.client-ip=209.85.208.74 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--lrizzo.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="dNiGLOrG" Received: by mail-ed1-f74.google.com with SMTP id 4fb4d7f45d1cf-6418122dd7bso14136a12.0 for ; Wed, 12 Nov 2025 11:24:30 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1762975469; x=1763580269; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=4QyPyRpqYf57hBi/oo2GFY86WIPxhm1a6MpX+EQajfk=; b=dNiGLOrGPnhTXioWgUTfX1qZwcB/llIBCsA6x0/FZWQx4Zd8rguFW3q3peMHvXZSr6 f0KqeOlZXjXeMSIysk8lkjB6BFLwZMTjAC/WR/h9hmnxRjyxX4mWGfoFgLvlR3O4hJYE D5iSk9HOZO/KzPQ0v3PO3jZF9JgFpXW0xHtInuogIonTtWUtOl9jQSTjvqp76clLYBtz 1w1Ba3WraeOWGvjeli5HU1+zZcSa7lLrsjSi+Nb0Ww9VbAMWDt7GggZTFqR19n8oBejd fs4SWpKyMwZwxHDqqTJYy9fjckHNBTi7rOV0dH2clHtXVmv12FQxqgws2wthmeBK4ENI d02A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1762975469; x=1763580269; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=4QyPyRpqYf57hBi/oo2GFY86WIPxhm1a6MpX+EQajfk=; b=Nex9eQvTj4dYjY7cpXBUhnkegNvaAvxUSK59gRlINxyl2YcXOlWI+UIwU8XHvcIIY/ Ar/+jIIGoYloJ1bXZW6R+fT+kw9eg9vu85nz7xnmIxtWUif/b1/jQLEkafQEUqFd+VD6 w37AJWFYksyGpem1xCUKrlsOCALI0eOLF3EgYSJDxnbNtXVJnmxJNJddljxdIo7VUFZj ECl0UiN9oBcLfiIX6B3ANbQTMAjQTPEq6bNfJneq+2cWUZJQpl/wVp0ToDYqlMv8ILuH vzNVbGOZhW/Vqdg+/f1XWRtmc1sC0Diwl0UhZKLl8W0tpv+G9OZNmvbrfbJR+7mFIR90 OQog== X-Gm-Message-State: AOJu0YysFdecAv7bh2CQBXBbF9fZdt2RBI0jXUheKbdYIB4pEl9Zi7Ke SQEnHgZDTWVvCx5yzgGY/p8KfQcd40ye/8Vel3dMW/R5roHL612s41Zs+tPBlZ71ywogsrp0CA7 PZbnFLw== X-Google-Smtp-Source: AGHT+IFy70lDf/WZnWuSVrEkhLrRAG32Pal9GLoXRspx3Ed8ppMs2VQXtd1kFgrTgnPzuZI/FsqN/nSPurw= X-Received: from edc23.prod.google.com ([2002:a05:6402:4617:b0:640:b66f:1e57]) (user=lrizzo job=prod-delivery.src-stubby-dispatcher) by 2002:a05:6402:26c4:b0:640:b497:bf71 with SMTP id 4fb4d7f45d1cf-6431a4bf92cmr3903806a12.8.1762975469487; Wed, 12 Nov 2025 11:24:29 -0800 (PST) Date: Wed, 12 Nov 2025 19:24:06 +0000 In-Reply-To: <20251112192408.3646835-1-lrizzo@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20251112192408.3646835-1-lrizzo@google.com> X-Mailer: git-send-email 2.51.2.1041.gc1ab5b90ca-goog Message-ID: <20251112192408.3646835-5-lrizzo@google.com> Subject: [PATCH 4/6] genirq: soft_moderation: implement adaptive moderation From: Luigi Rizzo To: Thomas Gleixner , Marc Zyngier , Luigi Rizzo , Paolo Abeni , Andrew Morton , Sean Christopherson , Jacob Pan Cc: linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org, Bjorn Helgaas , Willem de Bruijn , Luigi Rizzo Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Add two control parameters (target_irq_rate and hardirq_percent) to indicate the desired maximum values for these two metrics. Every update_ms the hook in handle_irq_event() recomputes the total and local interrupt rate and the amount of time spent in hardirq, compares the values with the targets, and adjusts the moderation delay up or down. The interrupt rate is computed in a scalable way by counting interrupts per-CPU, and aggregating the value in a global variable only every update_ms. Only CPUs that actively process interrupts are actually accessing the shared variable, so the system scales well even on very large servers. EXAMPLE TESTING Need some workload that can exceed the limits, such as heavy network or disk traffic. For testing, one can use very low thresholds (e.g. target_irq_rate=3D50000, hardirq_frac=3D10) to make it easier to go above the limit. # configure maximum delay (which is also the fixed moderation delay) echo "delay_us=3D400" > /proc/irq/soft_moderation # enable on network interrupts (change name as appropriate) echo on | tee /proc/irq/*/*eth*/../soft_moderation # ping times should reflect the 400us ping -n -f -q -c 1000 ${some_nearby_host} # show actual per-cpu delays and statistics less /proc/irq/soft_moderation # configure adaptive bounds. The control loop will adjust values # based on actual load echo "target_irq_rate=3D1000000" > /proc/irq/soft_moderation echo "hardirq_percent=3D70" > /proc/irq/soft_moderation # ping times now should be much lower ping -n -f -q -c 1000 ${some_nearby_host} # show actual per-cpu delays and statistics less /proc/irq/soft_moderation By generating high interrupt or hardirq load, one can also test the effectiveness of the control scheme and the sensitivity to control parameters. NEW PARAMETERS target_irq_rate 0 off, 0-50000000, default 0 the total maximum acceptable interrupt rate. hardirq_percent 0 off, 0-100, default 0 the maximum acceptable percentage of time spent in hardirq. update_ms 1-100, default 1 how often the control loop will readjust the delay. Change-Id: I3cdc72041be1e3c793013d8804f484cdcbb455ab --- include/linux/irq_moderation.h | 9 ++- kernel/irq/irq_moderation.c | 143 ++++++++++++++++++++++++++++++++- 2 files changed, 147 insertions(+), 5 deletions(-) diff --git a/include/linux/irq_moderation.h b/include/linux/irq_moderation.h index 4d90d7c4ca26b..45df60230e42e 100644 --- a/include/linux/irq_moderation.h +++ b/include/linux/irq_moderation.h @@ -89,6 +89,8 @@ static inline void irq_moderation_start_timer(struct irq_= mod_state *ms) /*range*/2000, HRTIMER_MODE_REL_PINNED_HARD); } =20 +void __irq_moderation_adjust_delay(struct irq_mod_state *ms, u64 delta_tim= e, u64 update_ns); + static inline bool irq_moderation_enabled(void) { return READ_ONCE(irq_mod_info.delay_us); @@ -119,8 +121,13 @@ static inline void irq_moderation_adjust_delay(struct = irq_mod_state *ms) /* Fetch important state */ ms->delay_ns =3D clamp(irq_mod_info.delay_us, 1u, 500u) * NSEC_PER_USEC; =20 + /* If config changed, restart from the highest delay */ + if (ktime_compare(irq_mod_info.procfs_write_ns, ms->last_ns) > 0) + ms->mod_ns =3D ms->delay_ns; + ms->last_ns =3D now; - ms->mod_ns =3D ms->delay_ns; + /* Do the expensive processing */ + __irq_moderation_adjust_delay(ms, delta_time, update_ns); } =20 /* Return true if timer is active or delay is large enough to require mode= ration */ diff --git a/kernel/irq/irq_moderation.c b/kernel/irq/irq_moderation.c index a9d2bdcf4d8c7..0229697a6a95a 100644 --- a/kernel/irq/irq_moderation.c +++ b/kernel/irq/irq_moderation.c @@ -81,22 +81,127 @@ static_assert(offsetof(struct irq_mod_info, procfs_wri= te_ns) =3D=3D 64); struct irq_mod_info irq_mod_info ____cacheline_aligned =3D { .delay_us =3D 100, .update_ms =3D 1, + .scale_cpus =3D 100, .count_timer_calls =3D true, + .decay_factor =3D 16, + .grow_factor =3D 8, }; =20 module_param_named(delay_us, irq_mod_info.delay_us, uint, 0444); MODULE_PARM_DESC(delay_us, "Max moderation delay us, 0 =3D moderation off,= range 10..500."); =20 +module_param_named(hardirq_percent, irq_mod_info.hardirq_percent, uint, 04= 44); +MODULE_PARM_DESC(hardirq_percent, "Target max hardirq percentage, 0 off."); + +module_param_named(target_irq_rate, irq_mod_info.target_irq_rate, uint, 04= 44); +MODULE_PARM_DESC(target_irq_rate, "Target max interrupt rate, 0 off."); + module_param_named(timer_rounds, irq_mod_info.timer_rounds, uint, 0444); MODULE_PARM_DESC(timer_rounds, "How many extra timer polls once moderation= triggered."); =20 +module_param_named(update_ms, irq_mod_info.update_ms, uint, 0444); +MODULE_PARM_DESC(update_ms, "Update interval in milliseconds, range 1-100"= ); + DEFINE_PER_CPU_ALIGNED(struct irq_mod_state, irq_mod_state); =20 +static inline uint get_grow_factor(void) { return clamp(irq_mod_info.grow_= factor, 8u, 64u); } +static inline uint get_decay_factor(void) { return clamp(irq_mod_info.deca= y_factor, 8u, 64u); } +static inline uint get_scale_cpus(void) { return clamp(irq_mod_info.scale_= cpus, 50u, 1000u); } + static void smooth_avg(u32 *dst, u32 val, u32 steps) { *dst =3D ((64 - steps) * *dst + steps * val) / 64; } =20 +/* Adjust the moderation delay, called at most every update_ns */ +void __irq_moderation_adjust_delay(struct irq_mod_state *ms, u64 delta_tim= e, u64 update_ns) +{ + /* Fetch configuration */ + u32 target_rate =3D clamp(irq_mod_info.target_irq_rate, 0u, 50000000u); + u32 hardirq_percent =3D clamp(irq_mod_info.hardirq_percent, 0u, 100u); + bool below_target =3D true; + /* Compute decay steps based on elapsed time */ + u32 steps =3D delta_time > 10 * update_ns ? 10 : 1 + (delta_time / update= _ns); + + if (target_rate =3D=3D 0 && hardirq_percent =3D=3D 0) { /* use fixed dela= y */ + ms->mod_ns =3D ms->delay_ns; + ms->irq_rate =3D 0; + ms->my_irq_rate =3D 0; + ms->cpu_count =3D 0; + return; + } + + if (target_rate > 0) { /* control total and individual CPU rate */ + u64 irq_rate, my_irq_rate, tmp, delta_irqs, num_cpus; + bool my_rate_ok, global_rate_ok; + + /* Update global number of interrupts */ + if (ms->irq_count < 1) /* make sure it is always > 0 */ + ms->irq_count =3D 1; + tmp =3D atomic_long_add_return(ms->irq_count, &irq_mod_info.total_intrs); + delta_irqs =3D tmp - ms->last_total_irqs; + + /* Compute global rate, check if we are ok */ + irq_rate =3D (delta_irqs * NSEC_PER_SEC) / delta_time; + global_rate_ok =3D irq_rate < target_rate; + + ms->last_total_irqs =3D tmp; + + /* + * num_cpus is the number of CPUs actively handling interrupts + * in the last interval. CPUs handling less than the fair share + * target_rate / num_cpus do not need to be throttled. + */ + tmp =3D atomic_long_add_return(1, &irq_mod_info.total_cpus); + num_cpus =3D tmp - ms->last_total_cpus; + /* scale proportionally to time, reduce errors if we are idle for too lo= ng */ + num_cpus =3D 1 + (num_cpus * update_ns + delta_time / 2) / delta_time; + + /* Short intervals may underestimate sources. Apply a scale factor */ + num_cpus =3D num_cpus * get_scale_cpus() / 100; + + /* Compute our rate, check if we are ok */ + my_irq_rate =3D (ms->irq_count * NSEC_PER_SEC) / delta_time; + my_rate_ok =3D my_irq_rate * num_cpus < target_rate; + + ms->irq_count =3D 1; /* reset for next cycle */ + ms->last_total_cpus =3D tmp; + + /* Use instantaneous rates to react. */ + below_target =3D global_rate_ok || my_rate_ok; + + /* Statistics (rates are smoothed averages) */ + smooth_avg(&ms->irq_rate, irq_rate, steps); + smooth_avg(&ms->my_irq_rate, my_irq_rate, steps); + smooth_avg(&ms->cpu_count, num_cpus * 256, steps); /* scaled */ + ms->my_irq_high +=3D !my_rate_ok; + ms->irq_high +=3D !global_rate_ok; + } + + if (hardirq_percent > 0) { /* control time spent in hardirq */ + u64 cur =3D kcpustat_this_cpu->cpustat[CPUTIME_IRQ]; + u64 irqtime =3D cur - ms->last_irqtime; + bool hardirq_ok =3D irqtime * 100 < delta_time * hardirq_percent; + + below_target &=3D hardirq_ok; + ms->last_irqtime =3D cur; + ms->hardirq_high +=3D !hardirq_ok; /* statistics */ + } + + /* Controller: move mod_ns up/down if we are above/below target */ + if (below_target) { + ms->mod_ns -=3D ms->mod_ns * steps / (steps + get_decay_factor()); + if (ms->mod_ns < 100) + ms->mod_ns =3D 0; + } else if (ms->mod_ns < 500) { + ms->mod_ns =3D 500; + } else { + ms->mod_ns +=3D ms->mod_ns * steps / (steps + get_grow_factor()); + if (ms->mod_ns > ms->delay_ns) + ms->mod_ns =3D ms->delay_ns; /* cap to delay_ns */ + } +} + /* moderation timer handler, called in hardintr context */ static enum hrtimer_restart moderation_timer_cb(struct hrtimer *timer) { @@ -172,6 +277,13 @@ static void set_moderation_mode(struct irq_desc *desc,= bool mode) } } =20 +/* irq_to_desc is not exported. Wrap it in this function for a specific us= e. */ +void irq_moderation_set_mode(int irq, bool mode) +{ + set_moderation_mode(irq_to_desc(irq), mode); +} +EXPORT_SYMBOL(irq_moderation_set_mode); + #pragma clang diagnostic error "-Wformat" /* Print statistics */ static int moderation_show(struct seq_file *p, void *v) @@ -215,12 +327,32 @@ static int moderation_show(struct seq_file *p, void *= v) seq_printf(p, "\n" "enabled %s\n" "delay_us %u\n" + "target_irq_rate %u\n" + "hardirq_percent %u\n" "timer_rounds %u\n" - "count_timer_calls %s\n", + "update_ms %u\n" + "scale_cpus %u\n" + "count_timer_calls %s\n" + "decay_factor %u\n" + "grow_factor %u\n", str_yes_no(delay_us), - delay_us, - irq_mod_info.timer_rounds, - str_yes_no(irq_mod_info.count_timer_calls)); + delay_us, irq_mod_info.target_irq_rate, irq_mod_info.hardirq_percent, + irq_mod_info.timer_rounds, irq_mod_info.update_ms, + irq_mod_info.scale_cpus, + str_yes_no(irq_mod_info.count_timer_calls), + get_decay_factor(), get_grow_factor()); + + seq_printf(p, + "irq_rate %lu\n" + "irq_high %lu\n" + "my_irq_high %lu\n" + "hardirq_percent_high %lu\n" + "total_interrupts %lu\n" + "total_cpus %lu\n", + active_cpus ? irq_rate / active_cpus : 0, + irq_high, my_irq_high, hardirq_high, + READ_ONCE(*((ulong *)&irq_mod_info.total_intrs)), + READ_ONCE(*((ulong *)&irq_mod_info.total_cpus))); =20 return 0; } @@ -238,7 +370,10 @@ struct var_names { int max; } var_names[] =3D { { "delay_us", &irq_mod_info.delay_us, 0, 500 }, + { "target_irq_rate", &irq_mod_info.target_irq_rate, 0, 50000000 }, + { "hardirq_percent", &irq_mod_info.hardirq_percent, 0, 100 }, { "timer_rounds", &irq_mod_info.timer_rounds, 0, 50 }, + { "update_ms", &irq_mod_info.update_ms, 1, 100 }, { "count_timer_calls", &irq_mod_info.count_timer_calls, 0, 1 }, {} }; --=20 2.51.2.1041.gc1ab5b90ca-goog From nobody Sun Feb 8 00:34:52 2026 Received: from mail-wm1-f73.google.com (mail-wm1-f73.google.com [209.85.128.73]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id BBDC3351FA1 for ; Wed, 12 Nov 2025 19:24:32 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.73 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1762975476; cv=none; b=W3ZWYyCnd4EnWo7bqhRt98nsnaCIVqdCPyapCsm4X6c+KBXtP/O13oHaGUv9i5cjM+xnkDDPOonesKTqzMCf1No1OlusFMBzLBf9bfE33z60+iLZaOX/BR3kmOeVKI6ObwnaQqsXk4pmbR4WcRy3Qqws2SMGPtBPlWUptZt9jGw= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1762975476; c=relaxed/simple; bh=EcEMz/roqR7p5kfQ+SF+75AAG1oH5FluG3tlNWV3tf0=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=oM2MinQxXgIUZQ0fHKqIB/nRx0axKMHRzXUhpuZwQKm5tlOg987feVRCMJsF5BOn+VKaOLNo8/gwTPOjtDeoThAqslfVbM/pGBAmgc+Oxx8zWEPancvDCJl8kH6zDe0s0mWTQ4Etsp+QRy+9EhGugPTCuB5KrdAN0STZW8j9QzE= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--lrizzo.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=nm24AbZn; arc=none smtp.client-ip=209.85.128.73 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--lrizzo.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="nm24AbZn" Received: by mail-wm1-f73.google.com with SMTP id 5b1f17b1804b1-4775d8428e8so519255e9.0 for ; Wed, 12 Nov 2025 11:24:32 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1762975471; x=1763580271; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=VC1j8oYRoAipyh2PTAdRVVd7/P28VqKYCYBPW/DVTGs=; b=nm24AbZn+Y+Cmb88a5cgtTmTaHTeDrlWeRoyZ9etBYXRkKyT2RYd6UKWfNXVFGsmSa iJNETg3WxbV0LxwNd17M6oUrWtzZ6vtlH29wwyp/Wq2LZjAzkbDS74Y3CvkmHIhBeU5I iPEHo3OFnFf174/EWznUSkgwaRYqPZ4WwWbxVj7mJ2FrsUX4PvPssKdhGzqZQ350rPwp rLP5qjMIa2vG1Gn9JYIwXO+CaR5Xs8xyPkwqhizUjQ/VwEBe5GBzx1owJ75zw0qLJwdi KtzmGdtG4bm17IriYI4S4GxRbnUdURqWqrv+wZWwI/W/xYV5615jU99deX6kmBlHpy6z 70/w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1762975471; x=1763580271; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=VC1j8oYRoAipyh2PTAdRVVd7/P28VqKYCYBPW/DVTGs=; b=CDjku3ZdwIPw42hYBJ2DLy//2tvun34kppJFBxb4S8Ph2rrJ4Cf3LdJoKQ6rIYeV4+ nLgdzLjGeER7wf5Sv0nh48wNotnVUqP2e7CWva32YrcRGAcAFgN3IGZi6EgSA9MlXd9i yWKcb1LNwRz2GtPi160kFwO87ftxwIgBFupRIP0rowI5rx3gNu910xe6I9KdS9epBBKw b1lxg0lmR4Jr4OsYW/aS5ace3SVBYlN2QG3dcs0fcH3VBdX27kpSu3Oh7T5I3xlnCMM2 SBm3D226BeKbdSnVcBmEYk20k9cke588BadmFlbyRiaTzvkyQ3VQoazRo/eSq4HnJD0L sLfA== X-Gm-Message-State: AOJu0YyUQVOmNn29GzAXlrq8qY/9Sb/eOBBkHPOVvCG0aWbdKaCl/pEd /hLaotA0cg52zgxpKJCOLDWljILshqg+dT+Ni/BUlMcEroPiLl/51OW1BAWj7M586p85Tqpy/4K HMVjuJQ== X-Google-Smtp-Source: AGHT+IES5L+U2SYBYYvmU4RSiZn/1uxydaGSZ+d9qkJ7f9qO3dw4Mv7B3ZWzNFbDg7tNlJ/GNLLLBlHqCLk= X-Received: from wmjq25.prod.google.com ([2002:a7b:ce99:0:b0:46e:1e57:dbd6]) (user=lrizzo job=prod-delivery.src-stubby-dispatcher) by 2002:a05:600c:3546:b0:46e:32a5:bd8d with SMTP id 5b1f17b1804b1-4778703e738mr45602905e9.3.1762975471183; Wed, 12 Nov 2025 11:24:31 -0800 (PST) Date: Wed, 12 Nov 2025 19:24:07 +0000 In-Reply-To: <20251112192408.3646835-1-lrizzo@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20251112192408.3646835-1-lrizzo@google.com> X-Mailer: git-send-email 2.51.2.1041.gc1ab5b90ca-goog Message-ID: <20251112192408.3646835-6-lrizzo@google.com> Subject: [PATCH 5/6] x86/irq: soft_moderation: add support for posted_msi (intel) From: Luigi Rizzo To: Thomas Gleixner , Marc Zyngier , Luigi Rizzo , Paolo Abeni , Andrew Morton , Sean Christopherson , Jacob Pan Cc: linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org, Bjorn Helgaas , Willem de Bruijn , Luigi Rizzo Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" On recent Intel CPUs, kernels compiled with CONFIG_X86_POSTED_MSI=3Dy, and the boot option "intremap=3Dposted_msi", all MSI interrupts that hit a CPU issue a single POSTED_MSI interrupt processed by sysvec_posted_msi_notification() instead of having separate interrupts. This change adds soft moderation hooks to the above handler. Soft moderation on posted_msi does not require per-source enable, irq_moderation.delay_us > 0 suffices. To test it, run a kernel with the above options and enable moderation by setting delay_us > 0. The column "from_msi" in /proc/irq/soft_moderation will show a non-zero value. Change-Id: Idcd6005f05048c4b3f9d002c8587039b46bc9d73 --- arch/x86/kernel/irq.c | 12 +++++++ include/linux/irq_moderation.h | 62 ++++++++++++++++++++++++++++++++++ kernel/irq/irq_moderation.c | 39 ++++++++++++++++----- 3 files changed, 104 insertions(+), 9 deletions(-) diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c index 10721a1252269..241d57fadc30c 100644 --- a/arch/x86/kernel/irq.c +++ b/arch/x86/kernel/irq.c @@ -4,6 +4,7 @@ */ #include #include +#include #include #include #include @@ -448,6 +449,13 @@ DEFINE_IDTENTRY_SYSVEC(sysvec_posted_msi_notification) inc_irq_stat(posted_msi_notification_count); irq_enter(); =20 + if (posted_msi_moderation_enabled()) { + if (posted_msi_should_rearm(handle_pending_pir(pid->pir, regs))) + goto rearm; + else + goto common_end; + } + /* * Max coalescing count includes the extra round of handle_pending_pir * after clearing the outstanding notification bit. Hence, at most @@ -458,6 +466,7 @@ DEFINE_IDTENTRY_SYSVEC(sysvec_posted_msi_notification) break; } =20 +rearm: /* * Clear outstanding notification bit to allow new IRQ notifications, * do this last to maximize the window of interrupt coalescing. @@ -471,6 +480,9 @@ DEFINE_IDTENTRY_SYSVEC(sysvec_posted_msi_notification) */ handle_pending_pir(pid->pir, regs); =20 +common_end: + posted_msi_moderation_epilogue(); + apic_eoi(); irq_exit(); set_irq_regs(old_regs); diff --git a/include/linux/irq_moderation.h b/include/linux/irq_moderation.h index 45df60230e42e..aabcfaba1aefc 100644 --- a/include/linux/irq_moderation.h +++ b/include/linux/irq_moderation.h @@ -155,6 +155,14 @@ static inline void irq_moderation_hook(struct irq_desc= *desc) if (!irq_moderation_enabled()) return; =20 +#ifdef CONFIG_X86_POSTED_MSI + if (ms->in_posted_msi) { /* these calls are not moderated */ + ms->from_posted_msi++; + ms->irq_count +=3D irq_mod_info.count_msi_calls; + return; + } +#endif + if (!READ_ONCE(desc->moderation_mode)) return; =20 @@ -193,11 +201,65 @@ static inline void irq_moderation_epilogue(const stru= ct irq_desc *desc) irq_moderation_start_timer(ms); } =20 +#ifdef CONFIG_X86_POSTED_MSI +/* + * Helpers for to sysvec_posted_msi_notification(), use as follows + * + * if (posted_msi_moderation_enabled()) { + * if (posted_msi_should_rearm(handle_pending_pir(pid->pir, regs))) + * goto rearm; + * else + * goto common_end; + * } + * ... + * common_end: + * posted_msi_moderation_epilogue(); + */ +static inline bool posted_msi_moderation_enabled(void) +{ + struct irq_mod_state *ms =3D this_cpu_ptr(&irq_mod_state); + + if (!irq_moderation_enabled()) + return false; + irq_moderation_adjust_delay(ms); + ms->in_posted_msi =3D true; /* tell handler to not throttle these calls */ + return true; +} + +/* Decide whether to rearm or not posted_msi */ +static inline bool posted_msi_should_rearm(bool work_done) +{ + struct irq_mod_state *ms =3D this_cpu_ptr(&irq_mod_state); + + if (ms->rounds_left > 0) /* timer pending, no rearm */ + return false; + if (!work_done) /* no work done, can rearm */ + return true; + if (!irq_moderation_needed(ms)) /* moderation not needed, rearm */ + return true; + ms->kick_posted_msi =3D true; /* do kick in timer callback */ + irq_moderation_start_timer(ms); + return false; /* timer now active, no rearm */ +} + +/* Cleanup state set in posted_msi_moderation_enabled() */ +static inline void posted_msi_moderation_epilogue(void) +{ + this_cpu_ptr(&irq_mod_state)->in_posted_msi =3D false; +} +#endif + #else /* empty stubs to avoid conditional compilation */ =20 static inline void irq_moderation_hook(struct irq_desc *desc) {} static inline void irq_moderation_epilogue(const struct irq_desc *desc) {} =20 +#ifdef CONFIG_X86_POSTED_MSI +static inline bool posted_msi_moderation_enabled(void) { return false; } +static inline bool posted_msi_should_rearm(bool work_done) { return false;= } +static inline void posted_msi_moderation_epilogue(void) {} +#endif + #endif /* CONFIG_IRQ_SOFT_MODERATION */ =20 #endif /* _LINUX_IRQ_MODERATION_H */ diff --git a/kernel/irq/irq_moderation.c b/kernel/irq/irq_moderation.c index 0229697a6a95a..672e161ecf29e 100644 --- a/kernel/irq/irq_moderation.c +++ b/kernel/irq/irq_moderation.c @@ -7,6 +7,15 @@ #include #include /* interrupt.h, kcpustat_this_cpu */ #include "internals.h" +#ifdef CONFIG_X86 +#include +#endif + +#ifdef CONFIG_IRQ_REMAP +extern bool enable_posted_msi; +#else +static bool enable_posted_msi; +#endif =20 /* * Platform-wide software interrupt moderation. @@ -29,6 +38,10 @@ * moderation on that CPU/irq. If so, calls disable_irq_nosync() and sta= rts * an hrtimer with appropriate delay. * + * - Intel only: using "intremap=3Dposted_msi", all the above is done in + * sysvec_posted_msi_notification(). In this case all host device interr= upts + * are subject to moderation. + * * - the timer callback calls enable_irq() for all disabled interrupts on = that * CPU. That in turn will generate interrupts if there are pending event= s. * @@ -82,6 +95,7 @@ struct irq_mod_info irq_mod_info ____cacheline_aligned = =3D { .delay_us =3D 100, .update_ms =3D 1, .scale_cpus =3D 100, + .count_msi_calls =3D true, .count_timer_calls =3D true, .decay_factor =3D 16, .grow_factor =3D 8, @@ -216,6 +230,17 @@ static enum hrtimer_restart moderation_timer_cb(struct= hrtimer *timer) =20 ms->rounds_left--; =20 +#ifdef CONFIG_X86_POSTED_MSI + if (ms->kick_posted_msi) { + if (ms->rounds_left <=3D 0) + ms->kick_posted_msi =3D false; + /* Next call will be from timer, count it conditionally */ + ms->irq_count -=3D !irq_mod_info.count_timer_calls; + ms->timer_calls++; + apic->send_IPI_self(POSTED_MSI_NOTIFICATION_VECTOR); + } +#endif + if (ms->rounds_left > 0) { /* Timer still alive. Just call the handlers */ list_for_each_entry_safe(desc, next, &ms->descs, ms_node) { @@ -277,13 +302,6 @@ static void set_moderation_mode(struct irq_desc *desc,= bool mode) } } =20 -/* irq_to_desc is not exported. Wrap it in this function for a specific us= e. */ -void irq_moderation_set_mode(int irq, bool mode) -{ - set_moderation_mode(irq_to_desc(irq), mode); -} -EXPORT_SYMBOL(irq_moderation_set_mode); - #pragma clang diagnostic error "-Wformat" /* Print statistics */ static int moderation_show(struct seq_file *p, void *v) @@ -325,7 +343,7 @@ static int moderation_show(struct seq_file *p, void *v) } =20 seq_printf(p, "\n" - "enabled %s\n" + "enabled %s%s\n" "delay_us %u\n" "target_irq_rate %u\n" "hardirq_percent %u\n" @@ -333,13 +351,15 @@ static int moderation_show(struct seq_file *p, void *= v) "update_ms %u\n" "scale_cpus %u\n" "count_timer_calls %s\n" + "count_msi_calls %s\n" "decay_factor %u\n" "grow_factor %u\n", - str_yes_no(delay_us), + str_yes_no(delay_us), enable_posted_msi ? " (also on posted_msi)" : "= ", delay_us, irq_mod_info.target_irq_rate, irq_mod_info.hardirq_percent, irq_mod_info.timer_rounds, irq_mod_info.update_ms, irq_mod_info.scale_cpus, str_yes_no(irq_mod_info.count_timer_calls), + str_yes_no(irq_mod_info.count_msi_calls), get_decay_factor(), get_grow_factor()); =20 seq_printf(p, @@ -375,6 +395,7 @@ struct var_names { { "timer_rounds", &irq_mod_info.timer_rounds, 0, 50 }, { "update_ms", &irq_mod_info.update_ms, 1, 100 }, { "count_timer_calls", &irq_mod_info.count_timer_calls, 0, 1 }, + { "count_msi_calls", &irq_mod_info.count_msi_calls, 0, 1 }, {} }; =20 --=20 2.51.2.1041.gc1ab5b90ca-goog From nobody Sun Feb 8 00:34:52 2026 Received: from mail-ed1-f73.google.com (mail-ed1-f73.google.com [209.85.208.73]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C1395352925 for ; Wed, 12 Nov 2025 19:24:34 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.208.73 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1762975476; cv=none; b=UpM/6cRsjKuMffeuNlDXnu2b2+pKJXSByHKDMXixWy7JPrydNhrveMICGjBsn6oDLtsoIdi2vqVRCEVIlEIpSQitkOYwLM/sYENknSBr/A3L3vIMIezKkDXDzovXp+AsOiO3dMdT/WyysMRYwEXQPr5T4wFecM+OYNtroKGlj3g= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1762975476; c=relaxed/simple; bh=toNGEDGM2fDo8bSkeyRBZN131/gcqcXF1zPXs1SMotM=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=S1NhYGwGOrh/eNYtHtXv1vtwc4wGe0gp9kkFDAp3bbGOTT8X1L4kjAphpNWBfoh3X3J739wdfBHCFiuA20JzL7A9qN+bx8hmLI4Swpy3t/xk8b6xAjxci3qf5YzUe6dimIMvsJfrqgjCSetu/FmKOEGfOdMYjQCe6IyRhgx/JsY= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--lrizzo.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=0t8G89EN; arc=none smtp.client-ip=209.85.208.73 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--lrizzo.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="0t8G89EN" Received: by mail-ed1-f73.google.com with SMTP id 4fb4d7f45d1cf-64165abd7ffso17741a12.0 for ; Wed, 12 Nov 2025 11:24:34 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1762975473; x=1763580273; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=Bc56Im0L/wPheTuphZBEmtz1ndMMkc1rQdJZlfQeq+U=; b=0t8G89ENKBnhFiDLYda3FT79chNWlGLJUShyVKb1Nhmsq2Mu0eIFNwUM/fvuuFwkOU aCvpv8+1CJciltNNyewOfxg3b4D7BhSZFuJNxyC8MFGDWH9hU8cN+g9JvnjIzlCBc5AB 8C6pA0T6DCH/ScTfaFAUsV1KkLDLZUUJ5QbrB8G2dzXYVlkH00Om184Kl850423xpIBH Smu4x/N82AM+mKjar2hs7BW3Zae8VKtPAmZ7gRyg1mb13PDDRwjDc+/HV5/Pf4acvsXE fOHntq32xpOLPC6ef2rl9CA0r4YTcF6dBQtd3X1BCuuGNmcYzEXOLHwP4SnLR+hzIg0j ku8g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1762975473; x=1763580273; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=Bc56Im0L/wPheTuphZBEmtz1ndMMkc1rQdJZlfQeq+U=; b=LJt4lG7g0E1X+zVjypCyvlj+KVBMyyh33kVTzvY9GB6ueesweCRWuwBLwN0XHBZ12o 5B5V4Zz57MVpWxtVE1EppRdCrjnVL74fQ11i6LitkPyfHYBZW+caee325EmWatANyKJT UzRTJUXH37pQq8JPj8ioQKzw/WkGggWQEUuuuVnCFnZkxAfK6M1gxahqjBJKBE9ENzzB 2HccNCvS5bHcDFBfj0x6n8Gd/wQx6SZkaNOFForter1B0t5ut1NT3kSbisQYfA+Aosv9 ys2k12Op3iCIImqqFRGIno0Y8BDFW7+JCRwKNaJ6KfR5DdSeki2flf35H/hIjtXbH8JK dj5Q== X-Gm-Message-State: AOJu0YyiGK20fxtQQU/ZWQ0DuBN6LUYT5US2ooCF108mAZLJ5WI6m4mn HCViSRmoDb2YdZXnDYrRnXxm4r3OuiLG8au7zYx4jjafnrZ89k52QFxFh3JqX0xH97ttPanGY43 sn2v+dg== X-Google-Smtp-Source: AGHT+IHy1JZdLmmhLtppnkVtk92yU/GzqkzRArhgbpy6IgXzyGv+2/u+PlsYwaDikLZmzfb52j6h4XAZ/GY= X-Received: from edhr5.prod.google.com ([2002:a50:8d85:0:b0:641:92ae:b599]) (user=lrizzo job=prod-delivery.src-stubby-dispatcher) by 2002:a05:6402:1449:b0:640:b8a0:1aad with SMTP id 4fb4d7f45d1cf-6431a39287emr4098002a12.6.1762975473032; Wed, 12 Nov 2025 11:24:33 -0800 (PST) Date: Wed, 12 Nov 2025 19:24:08 +0000 In-Reply-To: <20251112192408.3646835-1-lrizzo@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20251112192408.3646835-1-lrizzo@google.com> X-Mailer: git-send-email 2.51.2.1041.gc1ab5b90ca-goog Message-ID: <20251112192408.3646835-7-lrizzo@google.com> Subject: [PATCH 6/6] genirq: soft_moderation: implement per-driver defaults (nvme and vfio) From: Luigi Rizzo To: Thomas Gleixner , Marc Zyngier , Luigi Rizzo , Paolo Abeni , Andrew Morton , Sean Christopherson , Jacob Pan Cc: linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org, Bjorn Helgaas , Willem de Bruijn , Luigi Rizzo Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Introduce helpers to implement per-driver module parameters to enable moderation at boot/probe time. As an example, use the helpers in nvme and vfio drivers. To test, boot a kernel with ${driver}.soft_moderation=3D1 # enables moderation. and verify with "cat /proc/irq/soft_moderation" that the counters increase. Change-Id: Iaad4110977deb96df845501895e0043bd93fc350 --- drivers/nvme/host/pci.c | 3 +++ drivers/vfio/pci/vfio_pci_intrs.c | 3 +++ include/linux/interrupt.h | 13 +++++++++++++ kernel/irq/irq_moderation.c | 11 +++++++++++ 4 files changed, 30 insertions(+) diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index 72fb675a696f4..b9d7bce30061f 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -72,6 +72,8 @@ static_assert(MAX_PRP_RANGE / NVME_CTRL_PAGE_SIZE <=3D (1 /* prp1 */ + NVME_MAX_NR_DESCRIPTORS * PRPS_PER_PAGE)); =20 +DEFINE_IRQ_MODERATION_MODE_PARAMETER; + static int use_threaded_interrupts; module_param(use_threaded_interrupts, int, 0444); =20 @@ -1989,6 +1991,7 @@ static int nvme_create_queue(struct nvme_queue *nvmeq= , int qid, bool polled) result =3D queue_request_irq(nvmeq); if (result < 0) goto release_sq; + IRQ_MODERATION_SET_DEFAULT_MODE(pci_irq_vector(to_pci_dev(dev->dev), vec= tor)); } =20 set_bit(NVMEQ_ENABLED, &nvmeq->flags); diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_= intrs.c index 30d3e921cb0de..e54d88cfe601d 100644 --- a/drivers/vfio/pci/vfio_pci_intrs.c +++ b/drivers/vfio/pci/vfio_pci_intrs.c @@ -22,6 +22,8 @@ =20 #include "vfio_pci_priv.h" =20 +DEFINE_IRQ_MODERATION_MODE_PARAMETER; + struct vfio_pci_irq_ctx { struct vfio_pci_core_device *vdev; struct eventfd_ctx *trigger; @@ -317,6 +319,7 @@ static int vfio_intx_enable(struct vfio_pci_core_device= *vdev, vfio_irq_ctx_free(vdev, ctx, 0); return ret; } + IRQ_MODERATION_SET_DEFAULT_MODE(pdev->irq); =20 return 0; } diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h index 007201c8db6dd..c7d68d8ec49d7 100644 --- a/include/linux/interrupt.h +++ b/include/linux/interrupt.h @@ -879,12 +879,25 @@ void irq_moderation_init_fields(struct irq_desc *desc= ); /* add/remove /proc/irq/NN/soft_moderation */ void irq_moderation_procfs_entry(struct irq_desc *desc, umode_t umode); =20 +/* helpers for per-driver moderation mode settings */ +#define DEFINE_IRQ_MODERATION_MODE_PARAMETER \ + static bool soft_moderation; \ + module_param(soft_moderation, bool, 0644); \ + MODULE_PARM_DESC(soft_moderation, "0: off, 1: disable_irq") + +void irq_moderation_set_mode(int irq, bool mode); +#define IRQ_MODERATION_SET_DEFAULT_MODE(_irq) \ + irq_moderation_set_mode(_irq, READ_ONCE(soft_moderation)) + #else /* empty stubs to avoid conditional compilation */ =20 static inline void irq_moderation_percpu_init(void) {} static inline void irq_moderation_init_fields(struct irq_desc *desc) {} static inline void irq_moderation_procfs_entry(struct irq_desc *desc, umod= e_t umode) {}; =20 +#define DEFINE_IRQ_MODERATION_MODE_PARAMETER +#define IRQ_MODERATION_SET_DEFAULT_MODE(_irq) + #endif =20 /* diff --git a/kernel/irq/irq_moderation.c b/kernel/irq/irq_moderation.c index 672e161ecf29e..3b3962dae33d1 100644 --- a/kernel/irq/irq_moderation.c +++ b/kernel/irq/irq_moderation.c @@ -83,6 +83,10 @@ static bool enable_posted_msi; * * echo "on" > /proc/irq/NN/soft_moderation # use "off" to disable * + * For selected drivers, the default can also be supplied via module param= eters + * + * ${DRIVER}.soft_moderation=3D1 + * * =3D=3D=3D MONITORING =3D=3D=3D * * cat /proc/irq/soft_moderation shows per-CPU and global statistics. @@ -302,6 +306,13 @@ static void set_moderation_mode(struct irq_desc *desc,= bool mode) } } =20 +/* irq_to_desc is not exported. Wrap it in this function for a specific us= e. */ +void irq_moderation_set_mode(int irq, bool mode) +{ + set_moderation_mode(irq_to_desc(irq), mode); +} +EXPORT_SYMBOL(irq_moderation_set_mode); + #pragma clang diagnostic error "-Wformat" /* Print statistics */ static int moderation_show(struct seq_file *p, void *v) --=20 2.51.2.1041.gc1ab5b90ca-goog