From nobody Thu Apr  9 17:58:04 2026
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.133.124])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id B74A03ED112
	for <linux-kernel@vger.kernel.org>; Tue,  3 Mar 2026 19:07:25 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=170.10.133.124
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1772564847; cv=none;
 b=bXQfaJFbF1dXF9i3rFeQiltgA+46yMElF27+mCCko+0lWEBBbi1AifcwZqxzUGMpl2oy69qjeOZiEH4uHRYpc9O5z6bgkpsoOwA7uxntxBBa5upPzpiSFXAUCRqFUPTR/R/MqpD7ByS0KdQx+CN5Xod+9GLF1OimKwFQe1DpmT8=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1772564847; c=relaxed/simple;
	bh=FTVwAxulobi1odK7ouUMjFRJzDD++CyABn0c46NIU/M=;
	h=From:To:Cc:Subject:Date:Message-ID:MIME-Version;
 b=g7GBZfl7DHwo3yP7IUj0FsR2L96sQqDV10hhYXsc26qVe3Nc/ab1MTJV9Qc633ztGTfSTGzYQZ2KZGl8+ne8q5zuztllrxdtgoBRn73t1QR7zNsN2nnT8eOVABDYpM8GiE6OKxQfIR7Wsjd1/vTf/i+u0uG0oW4BXknlUW7Qof8=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=redhat.com;
 spf=pass smtp.mailfrom=redhat.com;
 dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b=cpKdg658; arc=none smtp.client-ip=170.10.133.124
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b="cpKdg658"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1772564844;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:
	 content-transfer-encoding:content-transfer-encoding;
	bh=vlhhM+4qkVIuFT9mnQ99wIkDZ8ep1XVRlZ1/RpTvXIw=;
	b=cpKdg65806fkmMjuGuZrtK12WL4/V89NixZmj2Ot1mDW4hzNCrkPV65KSZwbPOuk0nZEP0
	+NWt2CZG8HhGUtVTnKg4RgIb73ctt+UAfQ+TW8A0oAq+smn/dHKA1vaJJJ/ikqED3wc3HU
	LecrUNiC0lYfwH5jQdhrn0IBOsJk5JE=
Received: from mx-prod-mc-03.mail-002.prod.us-west-2.aws.redhat.com
 (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by
 relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3,
 cipher=TLS_AES_256_GCM_SHA384) id us-mta-214-gt_U0k_tOC-GC2CogHca9w-1; Tue,
 03 Mar 2026 14:07:21 -0500
X-MC-Unique: gt_U0k_tOC-GC2CogHca9w-1
X-Mimecast-MFC-AGG-ID: gt_U0k_tOC-GC2CogHca9w_1772564839
Received: from mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com
 (mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.93])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest
 SHA256)
	(No client certificate requested)
	by mx-prod-mc-03.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS
 id 081311956048;
	Tue,  3 Mar 2026 19:07:19 +0000 (UTC)
Received: from redhat.com (unknown [10.96.134.28])
	by mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP
 id 81ABA1800666;
	Tue,  3 Mar 2026 19:07:17 +0000 (UTC)
From: "Herton R. Krzesinski" <herton@redhat.com>
To: linux-kernel@vger.kernel.org
Cc: herton@redhat.com,
	frederic@kernel.org,
	peterz@infradead.org,
	mingo@kernel.org,
	paulmck@linux.vnet.ibm.com,
	tglx@linutronix.de,
	anna-maria@linutronix.de,
	kyin@redhat.com,
	jaeshin@redhat.com
Subject: [RFC] Processing of raised_list can stall if an IPI/interrupt is
 missed
Date: Tue,  3 Mar 2026 16:07:15 -0300
Message-ID: <20260303190715.935867-1-herton@redhat.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.93
Content-Type: text/plain; charset="utf-8"

Hello,

I saw recently a report where a system went down after it stopped processing
irq work items in raised_list (from kernel/irq_work.c). The system in quest=
ion,
from the vmcore data I got, is a linux guest under VMWare (under an x86_64 =
host).

It seems a very rare ocurrence, from what I know only two different users r=
eported it
so far. While it was reported on an old RHEL based kernel (4.18), I believe=
 the issue
could still happen in newer kernels, since the processing of raised_list in=
 principle
didn't change.

Taking into account an x86_64 system, from my understanding of the code the=
re are two
ways raised_list can be consumed, either trough irq_work_tick() or through =
the irq
work interrupt/IPI. If the system has an working APIC, raised_list items ar=
e only
consumed through interrupt/IPI with irq_work_run() being called at
arch/x86/kernel/irq_work.c, and irq_work_tick() will not call
irq_work_run_list(raised) because of the arch_irq_work_has_interrupt() check
in this case.

So in this specific case, if the interrupt/IPI is missed somehow, processin=
g of
items in raised_list can stall forever, since __irq_work_queue_local() calls
llist_add(), checking if it returns false for a non-empty list: if the list=
 was
not consumed due an missed interrupt/IPI, it'll never call irq_work_raise()=
 again.

This is what I saw on the vmcore from the one of the reports I mention abov=
e, where
the system died after some time, and from it we got some pending irq work i=
tems
in raised list in CPU 2:

crash> pd raised_list:all
per_cpu(raised_list, 0) =3D $1 =3D {
  first =3D 0x0
}
per_cpu(raised_list, 1) =3D $2 =3D {
  first =3D 0x0
}
per_cpu(raised_list, 2) =3D $3 =3D {
  first =3D 0xffffbb22d1609020
}
...

crash> list 0xffffbb22d1609020
ffffbb22d1609020
ffffbb233d06b020
ffffbb233901d020
ffffbb2324ec1020
ffffbb232cf59020
ffffbb2328f0d020
ffffbb2320e7d020
ffffbb2334fd1020
ffffbb2330f95020
ffffbb231ce39020
ffffbb2318da5020
ffffbb2314d29020
ffffbb22e45f4020
ffffbb23007c5020
ffffbb2310cdd020
ffffbb230c8b1020
ffffbb2308857020
ffffbb2304821020
ffffbb22fc789020
ffffbb22e05f0020
ffffbb22f8715020
ffffbb22f46db020
ffffbb22f06ad020
ffffbb22e8635020
ffffbb22ec673020
ffff93d3a6a1efe0
ffff93d65151e6d0
crash> list 0xffffbb22d1609020 | wc -l
27

All other CPUs had no items, only CPU 2. These pending items looks to have
caused some cascade effects which lead to soft lockups and system dying
(eg. work item doesn't run, hold up resources and several tasks ends up
stuck...).

It appears relying on IPI only could be too strict like in this case, altho=
ugh
I don't know if the system missing an IPI/interrupt is something that can
be expected. It looks to me we could have a virtualization bug/issue on this
specific case (since running under VMWare), but may be should we make a fal=
lback
if something like this happens? For example making it less strict and
allow irq_work_tick to also process the list? Like below:

diff --git a/kernel/irq_work.c b/kernel/irq_work.c
index 73f7e1fd4ab4..e47d64b56a38 100644
--- a/kernel/irq_work.c
+++ b/kernel/irq_work.c
@@ -188,9 +188,8 @@ bool irq_work_needs_cpu(void)
        raised =3D this_cpu_ptr(&raised_list);
        lazy =3D this_cpu_ptr(&lazy_list);
=20
-       if (llist_empty(raised) || arch_irq_work_has_interrupt())
-               if (llist_empty(lazy))
-                       return false;
+       if (llist_empty(raised) && llist_empty(lazy))
+               return false;
=20
        /* All work should have been flushed before going offline */
        WARN_ON_ONCE(cpu_is_offline(smp_processor_id()));
@@ -270,7 +269,7 @@ void irq_work_tick(void)
 {
        struct llist_head *raised =3D this_cpu_ptr(&raised_list);
=20
-       if (!llist_empty(raised) && !arch_irq_work_has_interrupt())
+       if (!llist_empty(raised))
                irq_work_run_list(raised);
=20
        if (!IS_ENABLED(CONFIG_PREEMPT_RT))

However, that above essentially reverts 76a33061b9323b7fdb220ae5fa116c10833=
ec22e
("irq_work: Force raised irq work to run on irq work interrupt"), and could
reintroduce the issue it fixed, however, since nohz_full_kick_func() (which=
 is the
renamed nohz_full_kick_work_func()) is empty now, that may be is ok to not
be strict anymore about making raised_list only run in irq work interrupt?

Or may be it's not worth changing this since this is rare and missed self I=
PI should
not be expected?