From nobody Mon Feb  9 04:58:57 2026
Received: from mail-ed1-f74.google.com (mail-ed1-f74.google.com
 [209.85.208.74])
	(using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 2ECB0350A2A
	for <linux-kernel@vger.kernel.org>; Wed, 12 Nov 2025 19:24:30 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=209.85.208.74
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1762975473; cv=none;
 b=lGeVbuskI0Sp4Rn1r65F0lhZ667pmStN5pnx3aoJyiUvVmpqtlEBqX28UKunnOeQG9bs6KR3cQns1WUr0cwaTjYz6eOjq9qpTLTKWOPGnauHthsTxJ5R1R4IIJ/I+uiXmpGXnPruRBpWyj9HIQnOjvSK//e3ZXX+k2/WjCNeI9Y=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1762975473; c=relaxed/simple;
	bh=97ZKBa3Mxok7Ixq5KQmA3/oV4Kbeh4dCnPI534aVGCs=;
	h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From:
	 To:Cc:Content-Type;
 b=O2FERdbWfrCtM7o5PAkDddOHEb7U3M8cSzanWUeNC4vqtM6wVCEaLlAPDvro2e1nixBv5Mpl82E6ZSNYnMuQqBre0Akp+x1QNPl78skAawAGAmQkqOoJf9WQMGdLMwVsNtqzwrucBmqCXJDTJ7AdIsWu9Ac2pa+YHFNR2Igc+io=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=reject dis=none) header.from=google.com;
 spf=pass smtp.mailfrom=flex--lrizzo.bounces.google.com;
 dkim=pass (2048-bit key) header.d=google.com header.i=@google.com
 header.b=dNiGLOrG; arc=none smtp.client-ip=209.85.208.74
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=reject dis=none) header.from=google.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=flex--lrizzo.bounces.google.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=google.com header.i=@google.com
 header.b="dNiGLOrG"
Received: by mail-ed1-f74.google.com with SMTP id
 4fb4d7f45d1cf-6418122dd7bso14136a12.0
        for <linux-kernel@vger.kernel.org>;
 Wed, 12 Nov 2025 11:24:30 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20230601; t=1762975469; x=1763580269;
 darn=vger.kernel.org;
        h=cc:to:from:subject:message-id:references:mime-version:in-reply-to
         :date:from:to:cc:subject:date:message-id:reply-to;
        bh=4QyPyRpqYf57hBi/oo2GFY86WIPxhm1a6MpX+EQajfk=;
        b=dNiGLOrGPnhTXioWgUTfX1qZwcB/llIBCsA6x0/FZWQx4Zd8rguFW3q3peMHvXZSr6
         f0KqeOlZXjXeMSIysk8lkjB6BFLwZMTjAC/WR/h9hmnxRjyxX4mWGfoFgLvlR3O4hJYE
         D5iSk9HOZO/KzPQ0v3PO3jZF9JgFpXW0xHtInuogIonTtWUtOl9jQSTjvqp76clLYBtz
         1w1Ba3WraeOWGvjeli5HU1+zZcSa7lLrsjSi+Nb0Ww9VbAMWDt7GggZTFqR19n8oBejd
         fs4SWpKyMwZwxHDqqTJYy9fjckHNBTi7rOV0dH2clHtXVmv12FQxqgws2wthmeBK4ENI
         d02A==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1762975469; x=1763580269;
        h=cc:to:from:subject:message-id:references:mime-version:in-reply-to
         :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to;
        bh=4QyPyRpqYf57hBi/oo2GFY86WIPxhm1a6MpX+EQajfk=;
        b=Nex9eQvTj4dYjY7cpXBUhnkegNvaAvxUSK59gRlINxyl2YcXOlWI+UIwU8XHvcIIY/
         Ar/+jIIGoYloJ1bXZW6R+fT+kw9eg9vu85nz7xnmIxtWUif/b1/jQLEkafQEUqFd+VD6
         w37AJWFYksyGpem1xCUKrlsOCALI0eOLF3EgYSJDxnbNtXVJnmxJNJddljxdIo7VUFZj
         ECl0UiN9oBcLfiIX6B3ANbQTMAjQTPEq6bNfJneq+2cWUZJQpl/wVp0ToDYqlMv8ILuH
         vzNVbGOZhW/Vqdg+/f1XWRtmc1sC0Diwl0UhZKLl8W0tpv+G9OZNmvbrfbJR+7mFIR90
         OQog==
X-Gm-Message-State: AOJu0YysFdecAv7bh2CQBXBbF9fZdt2RBI0jXUheKbdYIB4pEl9Zi7Ke
	SQEnHgZDTWVvCx5yzgGY/p8KfQcd40ye/8Vel3dMW/R5roHL612s41Zs+tPBlZ71ywogsrp0CA7
	PZbnFLw==
X-Google-Smtp-Source: 
 AGHT+IFy70lDf/WZnWuSVrEkhLrRAG32Pal9GLoXRspx3Ed8ppMs2VQXtd1kFgrTgnPzuZI/FsqN/nSPurw=
X-Received: from edc23.prod.google.com ([2002:a05:6402:4617:b0:640:b66f:1e57])
 (user=lrizzo job=prod-delivery.src-stubby-dispatcher) by
 2002:a05:6402:26c4:b0:640:b497:bf71
 with SMTP id 4fb4d7f45d1cf-6431a4bf92cmr3903806a12.8.1762975469487; Wed, 12
 Nov 2025 11:24:29 -0800 (PST)
Date: Wed, 12 Nov 2025 19:24:06 +0000
In-Reply-To: <20251112192408.3646835-1-lrizzo@google.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
Mime-Version: 1.0
References: <20251112192408.3646835-1-lrizzo@google.com>
X-Mailer: git-send-email 2.51.2.1041.gc1ab5b90ca-goog
Message-ID: <20251112192408.3646835-5-lrizzo@google.com>
Subject: [PATCH 4/6] genirq: soft_moderation: implement adaptive moderation
From: Luigi Rizzo <lrizzo@google.com>
To: Thomas Gleixner <tglx@linutronix.de>, Marc Zyngier <maz@kernel.org>,
	Luigi Rizzo <rizzo.unipi@gmail.com>, Paolo Abeni <pabeni@redhat.com>,
	Andrew Morton <akpm@linux-foundation.org>,
 Sean Christopherson <seanjc@google.com>,
	Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org,
	Bjorn Helgaas <bhelgaas@google.com>, Willem de Bruijn <willemb@google.com>,
	Luigi Rizzo <lrizzo@google.com>
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

Add two control parameters (target_irq_rate and hardirq_percent)
to indicate the desired maximum values for these two metrics.

Every update_ms the hook in handle_irq_event() recomputes the total and
local interrupt rate and the amount of time spent in hardirq, compares
the values with the targets, and adjusts the moderation delay up or down.

The interrupt rate is computed in a scalable way by counting interrupts
per-CPU, and aggregating the value in a global variable only every
update_ms. Only CPUs that actively process interrupts are actually
accessing the shared variable, so the system scales well even on very
large servers.

EXAMPLE TESTING

Need some workload that can exceed the limits, such as heavy
network or disk traffic. For testing, one can use very low thresholds
(e.g. target_irq_rate=3D50000, hardirq_frac=3D10) to make it easier to go
above the limit.

  # configure maximum delay (which is also the fixed moderation delay)
  echo "delay_us=3D400" > /proc/irq/soft_moderation

  # enable on network interrupts (change name as appropriate)
  echo on | tee /proc/irq/*/*eth*/../soft_moderation

  # ping times should reflect the 400us
  ping -n -f -q -c 1000 ${some_nearby_host}
  # show actual per-cpu delays and statistics
  less /proc/irq/soft_moderation

  # configure adaptive bounds. The control loop will adjust values
  # based on actual load
  echo "target_irq_rate=3D1000000" > /proc/irq/soft_moderation
  echo "hardirq_percent=3D70" > /proc/irq/soft_moderation

  # ping times now should be much lower
  ping -n -f -q -c 1000 ${some_nearby_host}

  # show actual per-cpu delays and statistics
  less /proc/irq/soft_moderation

By generating high interrupt or hardirq load, one can also test
the effectiveness of the control scheme and the sensitivity to
control parameters.

NEW PARAMETERS

target_irq_rate   0 off, 0-50000000, default 0
  the total maximum acceptable interrupt rate.

hardirq_percent   0 off,  0-100, default 0
  the maximum acceptable percentage of time spent in hardirq.

update_ms         1-100, default 1
  how often the control loop will readjust the delay.

Change-Id: I3cdc72041be1e3c793013d8804f484cdcbb455ab
---
 include/linux/irq_moderation.h |   9 ++-
 kernel/irq/irq_moderation.c    | 143 ++++++++++++++++++++++++++++++++-
 2 files changed, 147 insertions(+), 5 deletions(-)

diff --git a/include/linux/irq_moderation.h b/include/linux/irq_moderation.h
index 4d90d7c4ca26b..45df60230e42e 100644
--- a/include/linux/irq_moderation.h
+++ b/include/linux/irq_moderation.h
@@ -89,6 +89,8 @@ static inline void irq_moderation_start_timer(struct irq_=
mod_state *ms)
 			       /*range*/2000, HRTIMER_MODE_REL_PINNED_HARD);
 }
=20
+void __irq_moderation_adjust_delay(struct irq_mod_state *ms, u64 delta_tim=
e, u64 update_ns);
+
 static inline bool irq_moderation_enabled(void)
 {
 	return READ_ONCE(irq_mod_info.delay_us);
@@ -119,8 +121,13 @@ static inline void irq_moderation_adjust_delay(struct =
irq_mod_state *ms)
 	/* Fetch important state */
 	ms->delay_ns =3D clamp(irq_mod_info.delay_us, 1u, 500u) * NSEC_PER_USEC;
=20
+	/* If config changed, restart from the highest delay */
+	if (ktime_compare(irq_mod_info.procfs_write_ns, ms->last_ns) > 0)
+		ms->mod_ns =3D ms->delay_ns;
+
 	ms->last_ns =3D now;
-	ms->mod_ns =3D ms->delay_ns;
+	/* Do the expensive processing */
+	__irq_moderation_adjust_delay(ms, delta_time, update_ns);
 }
=20
 /* Return true if timer is active or delay is large enough to require mode=
ration */
diff --git a/kernel/irq/irq_moderation.c b/kernel/irq/irq_moderation.c
index a9d2bdcf4d8c7..0229697a6a95a 100644
--- a/kernel/irq/irq_moderation.c
+++ b/kernel/irq/irq_moderation.c
@@ -81,22 +81,127 @@ static_assert(offsetof(struct irq_mod_info, procfs_wri=
te_ns) =3D=3D 64);
 struct irq_mod_info irq_mod_info ____cacheline_aligned =3D {
 	.delay_us =3D 100,
 	.update_ms =3D 1,
+	.scale_cpus =3D 100,
 	.count_timer_calls =3D true,
+	.decay_factor =3D 16,
+	.grow_factor =3D 8,
 };
=20
 module_param_named(delay_us, irq_mod_info.delay_us, uint, 0444);
 MODULE_PARM_DESC(delay_us, "Max moderation delay us, 0 =3D moderation off,=
 range 10..500.");
=20
+module_param_named(hardirq_percent, irq_mod_info.hardirq_percent, uint, 04=
44);
+MODULE_PARM_DESC(hardirq_percent, "Target max hardirq percentage, 0 off.");
+
+module_param_named(target_irq_rate, irq_mod_info.target_irq_rate, uint, 04=
44);
+MODULE_PARM_DESC(target_irq_rate, "Target max interrupt rate, 0 off.");
+
 module_param_named(timer_rounds, irq_mod_info.timer_rounds, uint, 0444);
 MODULE_PARM_DESC(timer_rounds, "How many extra timer polls once moderation=
 triggered.");
=20
+module_param_named(update_ms, irq_mod_info.update_ms, uint, 0444);
+MODULE_PARM_DESC(update_ms, "Update interval in milliseconds, range 1-100"=
);
+
 DEFINE_PER_CPU_ALIGNED(struct irq_mod_state, irq_mod_state);
=20
+static inline uint get_grow_factor(void) { return clamp(irq_mod_info.grow_=
factor, 8u, 64u); }
+static inline uint get_decay_factor(void) { return clamp(irq_mod_info.deca=
y_factor, 8u, 64u); }
+static inline uint get_scale_cpus(void) { return clamp(irq_mod_info.scale_=
cpus, 50u, 1000u); }
+
 static void smooth_avg(u32 *dst, u32 val, u32 steps)
 {
 	*dst =3D ((64 - steps) * *dst + steps * val) / 64;
 }
=20
+/* Adjust the moderation delay, called at most every update_ns */
+void __irq_moderation_adjust_delay(struct irq_mod_state *ms, u64 delta_tim=
e, u64 update_ns)
+{
+	/* Fetch configuration */
+	u32 target_rate =3D clamp(irq_mod_info.target_irq_rate, 0u, 50000000u);
+	u32 hardirq_percent =3D clamp(irq_mod_info.hardirq_percent, 0u, 100u);
+	bool below_target =3D true;
+	/* Compute decay steps based on elapsed time */
+	u32 steps =3D delta_time > 10 * update_ns ? 10 : 1 + (delta_time / update=
_ns);
+
+	if (target_rate =3D=3D 0 && hardirq_percent =3D=3D 0) {	/* use fixed dela=
y */
+		ms->mod_ns =3D ms->delay_ns;
+		ms->irq_rate =3D 0;
+		ms->my_irq_rate =3D 0;
+		ms->cpu_count =3D 0;
+		return;
+	}
+
+	if (target_rate > 0) {	/* control total and individual CPU rate */
+		u64 irq_rate, my_irq_rate, tmp, delta_irqs, num_cpus;
+		bool my_rate_ok, global_rate_ok;
+
+		/* Update global number of interrupts */
+		if (ms->irq_count < 1)	/* make sure it is always > 0 */
+			ms->irq_count =3D 1;
+		tmp =3D atomic_long_add_return(ms->irq_count, &irq_mod_info.total_intrs);
+		delta_irqs =3D tmp - ms->last_total_irqs;
+
+		/* Compute global rate, check if we are ok */
+		irq_rate =3D (delta_irqs * NSEC_PER_SEC) / delta_time;
+		global_rate_ok =3D irq_rate < target_rate;
+
+		ms->last_total_irqs =3D tmp;
+
+		/*
+		 * num_cpus is the number of CPUs actively handling interrupts
+		 * in the last interval. CPUs handling less than the fair share
+		 * target_rate / num_cpus do not need to be throttled.
+		 */
+		tmp =3D atomic_long_add_return(1, &irq_mod_info.total_cpus);
+		num_cpus =3D tmp - ms->last_total_cpus;
+		/* scale proportionally to time, reduce errors if we are idle for too lo=
ng */
+		num_cpus =3D 1 + (num_cpus * update_ns + delta_time / 2) / delta_time;
+
+		/* Short intervals may underestimate sources. Apply a scale factor */
+		num_cpus =3D num_cpus * get_scale_cpus() / 100;
+
+		/* Compute our rate, check if we are ok */
+		my_irq_rate =3D (ms->irq_count * NSEC_PER_SEC) / delta_time;
+		my_rate_ok =3D my_irq_rate * num_cpus < target_rate;
+
+		ms->irq_count =3D 1;	/* reset for next cycle */
+		ms->last_total_cpus =3D tmp;
+
+		/* Use instantaneous rates to react. */
+		below_target =3D global_rate_ok || my_rate_ok;
+
+		/* Statistics (rates are smoothed averages) */
+		smooth_avg(&ms->irq_rate, irq_rate, steps);
+		smooth_avg(&ms->my_irq_rate, my_irq_rate, steps);
+		smooth_avg(&ms->cpu_count, num_cpus * 256, steps); /* scaled */
+		ms->my_irq_high +=3D !my_rate_ok;
+		ms->irq_high +=3D !global_rate_ok;
+	}
+
+	if (hardirq_percent > 0) {		/* control time spent in hardirq */
+		u64 cur =3D kcpustat_this_cpu->cpustat[CPUTIME_IRQ];
+		u64 irqtime =3D cur - ms->last_irqtime;
+		bool hardirq_ok =3D irqtime * 100 < delta_time * hardirq_percent;
+
+		below_target &=3D hardirq_ok;
+		ms->last_irqtime =3D cur;
+		ms->hardirq_high +=3D !hardirq_ok;	/* statistics */
+	}
+
+	/* Controller: move mod_ns up/down if we are above/below target */
+	if (below_target) {
+		ms->mod_ns -=3D ms->mod_ns * steps / (steps + get_decay_factor());
+		if (ms->mod_ns < 100)
+			ms->mod_ns =3D 0;
+	} else if (ms->mod_ns < 500) {
+		ms->mod_ns =3D 500;
+	} else {
+		ms->mod_ns +=3D ms->mod_ns * steps / (steps + get_grow_factor());
+		if (ms->mod_ns > ms->delay_ns)
+			ms->mod_ns =3D ms->delay_ns;	/* cap to delay_ns */
+	}
+}
+
 /* moderation timer handler, called in hardintr context */
 static enum hrtimer_restart moderation_timer_cb(struct hrtimer *timer)
 {
@@ -172,6 +277,13 @@ static void set_moderation_mode(struct irq_desc *desc,=
 bool mode)
 	}
 }
=20
+/* irq_to_desc is not exported. Wrap it in this function for a specific us=
e. */
+void irq_moderation_set_mode(int irq, bool mode)
+{
+	set_moderation_mode(irq_to_desc(irq), mode);
+}
+EXPORT_SYMBOL(irq_moderation_set_mode);
+
 #pragma clang diagnostic error "-Wformat"
 /* Print statistics */
 static int moderation_show(struct seq_file *p, void *v)
@@ -215,12 +327,32 @@ static int moderation_show(struct seq_file *p, void *=
v)
 	seq_printf(p, "\n"
 		   "enabled              %s\n"
 		   "delay_us             %u\n"
+		   "target_irq_rate      %u\n"
+		   "hardirq_percent      %u\n"
 		   "timer_rounds         %u\n"
-		   "count_timer_calls    %s\n",
+		   "update_ms            %u\n"
+		   "scale_cpus           %u\n"
+		   "count_timer_calls    %s\n"
+		   "decay_factor         %u\n"
+		   "grow_factor          %u\n",
 		   str_yes_no(delay_us),
-		   delay_us,
-		   irq_mod_info.timer_rounds,
-		   str_yes_no(irq_mod_info.count_timer_calls));
+		   delay_us, irq_mod_info.target_irq_rate, irq_mod_info.hardirq_percent,
+		   irq_mod_info.timer_rounds, irq_mod_info.update_ms,
+		   irq_mod_info.scale_cpus,
+		   str_yes_no(irq_mod_info.count_timer_calls),
+		   get_decay_factor(), get_grow_factor());
+
+	seq_printf(p,
+		   "irq_rate             %lu\n"
+		   "irq_high             %lu\n"
+		   "my_irq_high          %lu\n"
+		   "hardirq_percent_high %lu\n"
+		   "total_interrupts     %lu\n"
+		   "total_cpus           %lu\n",
+		   active_cpus ? irq_rate / active_cpus : 0,
+		   irq_high, my_irq_high, hardirq_high,
+		   READ_ONCE(*((ulong *)&irq_mod_info.total_intrs)),
+		   READ_ONCE(*((ulong *)&irq_mod_info.total_cpus)));
=20
 	return 0;
 }
@@ -238,7 +370,10 @@ struct var_names {
 	int max;
 } var_names[] =3D {
 	{ "delay_us", &irq_mod_info.delay_us, 0, 500 },
+	{ "target_irq_rate", &irq_mod_info.target_irq_rate, 0, 50000000 },
+	{ "hardirq_percent", &irq_mod_info.hardirq_percent, 0, 100 },
 	{ "timer_rounds", &irq_mod_info.timer_rounds, 0, 50 },
+	{ "update_ms", &irq_mod_info.update_ms, 1, 100 },
 	{ "count_timer_calls", &irq_mod_info.count_timer_calls, 0, 1 },
 	{}
 };
--=20
2.51.2.1041.gc1ab5b90ca-goog