From nobody Sat Feb  7 03:09:40 2026
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.129.124])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 1029613F44E
	for <linux-kernel@vger.kernel.org>; Tue, 18 Jun 2024 17:37:55 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=170.10.129.124
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1718732279; cv=none;
 b=QZxz8wHNdQXTldtJKsHM6eTN3F8134Ilxxnd423yLXWD0cQGZuvSH1LArF2NRCHHw1DmR+G81ZMOjbwJ95lmXp+b/nu9pcHwZ1yyEJKKHiKbumRWFX+kIicvn5/j4tQxIkCsZ4B+ax3SRT3W/P86r2pux6gcLUQXNQRD7a4w9Gg=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1718732279; c=relaxed/simple;
	bh=xTOlixnOjCXBRUTnU3x6iTX7lhvk0rmAcejVRJ/KGE8=;
	h=Date:From:To:Cc:Subject:Message-ID:MIME-Version:Content-Type:
	 Content-Disposition;
 b=fnDNZeqRpWrPU0Qqpq6qxcb4dyMZ8raSFACkOffEkYy0b+uv3XZqnKqwesYI8Pg06D7li7eri25BHviOfJgXrCLeXlYGHrtTvsGTQoywpzRZUS1T5ZvhJ2KFO1tEnw2MhuftM525VJx3uDc43umrYJyV+lkjB/Dda0T12IEag24=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=redhat.com;
 spf=pass smtp.mailfrom=redhat.com;
 dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b=WwizVx6A; arc=none smtp.client-ip=170.10.129.124
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b="WwizVx6A"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1718732275;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type;
	bh=mgYcOe2C5KVnL45iO9/OIJYGaLtsqSULC2wkwzh/GbQ=;
	b=WwizVx6ABor93qHj9rM5k/ukN2lYv1/3WYu5QqfFpQOi+FdKpRN7XTu0z1KRo+g2ArOK9G
	BPt3mcEuFt7IUeS05pUq1X4g8Fmn3pNTYcMvtTtwxa8Wxi5M0Bm/64Zra5gFQAQZJEV8CE
	ASC2ryFNTQ2KgyJTbo6/UHLDl4CJAiY=
Received: from mx-prod-mc-04.mail-002.prod.us-west-2.aws.redhat.com
 (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by
 relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3,
 cipher=TLS_AES_256_GCM_SHA384) id us-mta-599-1B6QQANmMuaP7PMGP_CnMA-1; Tue,
 18 Jun 2024 13:37:49 -0400
X-MC-Unique: 1B6QQANmMuaP7PMGP_CnMA-1
Received: from mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com
 (mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.17])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest
 SHA256)
	(No client certificate requested)
	by mx-prod-mc-04.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS
 id 4DAE019560BB;
	Tue, 18 Jun 2024 17:37:48 +0000 (UTC)
Received: from debarbos-thinkpadt14sgen2i.remote.csb (unknown [10.22.8.160])
	by mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with
 ESMTPS id ED5751956048;
	Tue, 18 Jun 2024 17:37:44 +0000 (UTC)
Date: Tue, 18 Jun 2024 13:37:42 -0400
From: Derek Barbosa <debarbos@redhat.com>
To: pmladek@suse.com, rostedt@goodmis.org, john.ogness@linutronix.de,
	senozhatsky@chromium.org, linux-rt-users@vger.kernel.org,
	linux-kernel@vger.kernel.org
Cc: williams@redhat.com, jlelli@redhat.com, ahalaney@redhat.com,
	lgoncalv@redhat.com, jwyatt@redhat.com, aubaker@redhat.com
Subject: [BUG] printk/nbcon.c: watchdog BUG: softlockup - CPU#x stuck for 78s
Message-ID: <ZnHF5j1DUDjN1kkq@debarbos-thinkpadt14sgen2i.remote.csb>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Disposition: inline
X-Scanned-By: MIMEDefang 3.0 on 10.30.177.17
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

Hi,

Apologies for the resend. My email client borked the headers...

The realtime team at Red Hat has recently backported the latest printk chan=
ges
present in 6.6-rt stable (HEAD at 20fd4439f644 printk: nbcon: move locked_p=
ort flag to
struct uart_port) to CentOS Stream 9 for performance improvements and
printk-related bugfixes.

Since merging this said code, we've hit an interesting bug during testing,
specifically, on larger systems, a softlockup may be reported by the watchd=
og
when there is a heavy amount of printing to tty devices (whether it be thro=
ugh
printk, /dev/kmsg, etc).

We have a modicum of reasonable suspicion to believe that nbcon_reacquire, =
or
some other nbcon mechanism *may* be causing such behavior.

Since we've succesfully reproduced this in the Kernel-ARK/Fedora-ELN
(osbuild-rt), and linux-rt-devel 6.10.rc4-rt6, we are reporting this bug
upstream.

Anyway, here is a more in-depth description, along with some call traces.

Description:

On x86 systems with a large amount of logical cores (nproc ~> 60), a softlo=
ckup can
be observed with accompanying call trace when a large amount of "printing"
activity is taking place.

As shown in the call traces appended below, during some kind of numa
balancing/numa_migration after a task_numa_fault --where a set of processes=
s are being migrated/swapped
between two CPUs-- there is a busy thread that is being waited on (in the o=
rder
of seconds), causing a livelock. Additional investigation of collected vmco=
res
by toggling panic on softlockup shows that the waiting thread may be waitin=
g for
a thread looping with nbcon_reacquire.

I suspect that some logic within nbcon_context_try_acquire may be a good pl=
ace
to start. My understanding of the code becomes a bit fuzzy here, so apologi=
es
in advance for any erroneous statements. As I see it, there may be something
happening during migration (or under heavy load) in which nbcon_reacquire()=
 is in
a non-migratable or non-preemtible state as it is attempting to regain acce=
ss to
a lost/taken console. It could very well be a situation in which context
was forcefully taken from the printing thread.

Alternatively, Andrew Halaney <ahalaney@redhat.com> suspects that it is the=
 loop within
nbcon_kthread_func() -- since there is nothing that would yield the task in=
 said
loop (cant_migrate()), the migrate code would be essentially waiting foreve=
r for
the aforementioned loop to "finish". I believe in PREEMPT_RT, there would b=
e a
preemption point here. Furthermore, in his investigation, there were signs =
that the
loop was just iterating up until the crash, leaving reason to
believe that task would be the culprit.

In fact, with the following diff, we noticed this output:

```
ahalaney@x1gen2nano ~/git/linux-rt-devel (git)-[tags/v6.10-rc4-rt6-rebase] =
% git diff | cat
diff --git a/kernel/printk/nbcon.c b/kernel/printk/nbcon.c
index bb9689f94d30..d716b72bf2f8 100644
--- a/kernel/printk/nbcon.c
+++ b/kernel/printk/nbcon.c
@@ -1075,6 +1075,7 @@ static int nbcon_kthread_func(void *__console)
      *
      * This pairs with rcuwait_has_sleeper:A and nbcon_kthread_wake:A.
      */
+    trace_printk("Before rcuwait_wait_event()\n");
     ret =3D rcuwait_wait_event(&con->rcuwait,
                  nbcon_kthread_should_wakeup(con, ctxt),
                  TASK_INTERRUPTIBLE); /* LMM(nbcon_kthread_func:A) */
@@ -1086,7 +1087,10 @@ static int nbcon_kthread_func(void *__console)
     if (ret)
         goto wait_for_event;

+    trace_printk("Starting do while(backlog) loop \n");
+    unsigned long int loop =3D 0;
     do {
+        trace_printk("On loop %lu\n", ++loop);
         backlog =3D false;

         cookie =3D console_srcu_read_lock();
@@ -1121,6 +1125,7 @@ static int nbcon_kthread_func(void *__console)
         console_srcu_read_unlock(cookie);

     } while (backlog);
+    trace_printk("End of while(backlog) loop, looped %lu times \n", loop);

     goto wait_for_event;
```

The output:

```
[ 1681.309720] pr/ttyS0-18       21.N... 893365994us : nbcon_kthread_func: =
On loop 2117
[ 1681.309727] pr/ttyS0-18       21.N... 893374418us : nbcon_kthread_func: =
On loop 2118
[ 1681.309734] pr/ttyS0-18       21.N... 893382860us : nbcon_kthread_func: =
On loop 2119
(...)
[ 1681.396193] pr/ttyS0-18       21.N... 954571399us : nbcon_kthread_func: =
On loop 14025
[ 1681.396200] pr/ttyS0-18       21.N... 954576525us : nbcon_kthread_func: =
On loop 14026
[ 1681.396207] pr/ttyS0-18       21.N... 954581561us : nbcon_kthread_func: =
On loop 14027
[ 1681.396213] pr/ttyS0-18       21.N... 954584954us : nbcon_kthread_func: =
On loop 14028
[ 1681.396220] pr/ttyS0-18       21.N... 954590111us : nbcon_kthread_func: =
On loop 14029
[ 1681.396223] ---------------------------------
[ 1681.396230] Kernel panic - not syncing: softlockup: hung tasks
(...)
```

Demonstrating evidence that the nbcon kthread function is invoked continuou=
sly
up until the point of panic. I do believe that this approach is more sound =
than
my initial hints (or it could be more than a handful of threads). Some trac=
es
around serial8250_console_write_thread() also denote continuous calls witho=
ut
any holdups.

As a sidenote, we are seeing the softlockup with !CONFIG_PREEMPT_RT

Some questions arise from this, as we've never encountered this in our test=
ing
with John Ogness' console_blast (kudos to <jwyatt@redhat.com>) and other pr=
intk torture
tests that have been compiled [here](https://gitlab.com/debarbos/printktort=
ure).

We are curious to understand why is it that the printing thread is chosen b=
y the
NUMA balancer for migration/swap, and how that interaction is handled withi=
n the
code (in other words, how else would nbcon handle a migrated printing threa=
d?).

Our next round of tests aim to disable numa balancing on
large-multi-core-systems to determine whether it is the NUMA mechanisms + n=
bcon
interactions are at fault here.


How to reproduce:

Provision an x86 machine with a large amount of logical cores with a kernel=
 that
contains the latest printk code, such as linux-rt-devel 6.10 rc3-rt5
(for reference, we are able to consistently reproduce this on machines with
nproc =3D=3D 128).

Additionally, use LTP pty03 && 06. from ltp/testcase/kernel/pty{03,06}


Use the following reproducer

```

#!/bin/bash

echo 1 > /proc/sys/kernel/softlockup_panic

while true; do ./ltp/testcases/kernel/pty/pty03; ./ltp/testcases/kernel/pty=
/pty06; done

```

Alternatively, anything that places a great deal of strain on /dev/kmsg may=
 also
do the trick.


Attachments:

Appended below are the call traces collected across various machines.

The first trace contains some additional backtraces and dissassembly.

The second trace was collected with kernel.softlockup_panic set to 1

The third trace was collected without any additional cmdline arguments.

Finally, fourth trace was collected with kernel.softlockup_all_cpu_backtrace
set to 1.

Please let me know if you would like for me to append the configs for said
kernels as well.

---- FIRST TRACE ----

kudos to Audra Baker <aubaker@redhat.com> for this investigation:
=20
[42895.891462] pps pps2: removed
[42923.776618] watchdog: BUG: soft lockup - CPU#32 stuck for 26s! [migratio=
n/32:212]
[42923.776621] Modules linked in: n_gsm pps_ldisc ppp_synctty n_hdlc ppp_as=
ync ppp_generic slhc serport rfkill intel_
rapl_msr intel_rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp co=
retemp kvm_intel kvm mgag200 i2c_algo_bit=20
drm_shmem_helper drm_kms_helper syscopyarea sysfillrect sysimgblt irqbypass=
 fb_sys_fops rapl iTCO_wdt iTCO_vendor_sup
port joydev pcspkr intel_cstate intel_uncore ioatdma mei_me i2c_i801 i2c_sm=
bus lpc_ich ipmi_si ipmi_devintf ipmi_msgh
andler mei acpi_pad acpi_power_meter mxm_wmi drm fuse xfs libcrc32c sd_mod =
t10_pi sg ahci libahci crct10dif_pclmul cr
c32_pclmul ixgbe libata crc32c_intel megaraid_sas ghash_clmulni_intel mdio =
dca wmi dm_mirror dm_region_hash dm_log dm
_mod
[42923.776655] CPU: 32 PID: 212 Comm: migration/32 Kdump: loaded Not tainte=
d 5.14.0-452.el9.x86_64 #1
[42923.776657] Hardware name: Intel Corporation S2600CWR/S2600CWR, BIOS SE5=
C610.86B.01.01.0018.072020161249 07/20/201
6
[42923.776658] Stopper: multi_cpu_stop+0x0/0x100 <- migrate_swap+0xb4/0x110
[42923.776666] RIP: 0010:stop_machine_yield+0x2/0x10
[42923.776669] Code: cc cc cc b8 fe ff ff ff eb df e8 d9 c3 a8 00 66 0f 1f =
84 00 00 00 00 00 90 90 90 90 90 90 90 90=20
90 90 90 90 90 90 90 90 f3 90 <c3> cc cc cc cc 66 0f 1f 84 00 00 00 00 00 9=
0 90 90 90 90 90 90 90
[42923.776670] RSP: 0018:ffffab0906ca7e70 EFLAGS: 00000246
[42923.776672] RAX: 0000000000000000 RBX: ffffab0927fef948 RCX: 00000000000=
00002
[42923.776673] RDX: 0000000000000004 RSI: ffffab0927fef9a0 RDI: ffffffff98a=
371c0
[42923.776674] RBP: ffffab0927fef96c R08: ffff9d55ffc241d0 R09: ffff9d46d36=
fec00
[42923.776675] R10: 0000000000000193 R11: 0000000000000000 R12: 00000000000=
00001
[42923.776676] R13: ffffffff98a371c0 R14: 0000000000000000 R15: 00000000000=
00001
[42923.776677] FS:  0000000000000000(0000) GS:ffff9d55ffc00000(0000) knlGS:=
0000000000000000
[42923.776679] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[42923.776680] CR2: 00007ff24c9190e8 CR3: 00000010a06a2004 CR4: 00000000001=
706f0
[42923.776681] Call Trace:
[42923.776682]  <IRQ>
[42923.776684]  ? show_trace_log_lvl+0x1c4/0x2df
[42923.776686]  ? show_trace_log_lvl+0x1c4/0x2df
[42923.776688]  ? multi_cpu_stop+0x94/0x100
[42923.776689]  ? watchdog_timer_fn+0x1ad/0x210
[42923.776692]  ? __pfx_watchdog_timer_fn+0x10/0x10
[42923.776694]  ? __hrtimer_run_queues+0x112/0x2b0
[42923.776698]  ? hrtimer_interrupt+0xfc/0x210
[42923.776701]  ? __sysvec_apic_timer_interrupt+0x4e/0x100
[42923.776704]  ? sysvec_apic_timer_interrupt+0x6d/0x90
[42923.776708]  </IRQ>
[42923.776709]  <TASK>
[42923.776709]  ? asm_sysvec_apic_timer_interrupt+0x16/0x20
[42923.776713]  ? stop_machine_yield+0x2/0x10
[42923.776714]  multi_cpu_stop+0x94/0x100
[42923.776716]  ? __pfx_multi_cpu_stop+0x10/0x10
[42923.776717]  cpu_stopper_thread+0x93/0x140
[42923.776719]  ? __pfx_smpboot_thread_fn+0x10/0x10
[42923.776723]  smpboot_thread_fn+0xd6/0x1a0
[42923.776725]  kthread+0xe0/0x100
[42923.776728]  ? __pfx_kthread+0x10/0x10
[42923.776729]  ret_from_fork+0x2c/0x50
[42923.776734]  </TASK>
=20
Looking at the soft lock up- when we hit the soft lock up we were in stop_m=
achine_yield().=20
This expects a cpumask to be passed as the first argument:
=20
notrace void __weak stop_machine_yield(const struct cpumask *cpumask)
{
cpu_relax();
}
=20
Looks like we are waiting on CPU 27
=20
crash> struct cpumask ffffffff98a371c0
struct cpumask {
  bits =3D \{0x8000000, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0=
x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0=
x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0=
x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0=
x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0=
x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0=
x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0=
x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0=
x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}
}
=20
crash> eval -b 0x8000000,
eval: invalid expression: (0x8000000,)
crash> eval -b 0x8000000
hexadecimal: 8000000  (128MB)
    decimal: 134217728 =20
      octal: 1000000000
     binary: 00000000000000000000000000000000000010000000000000000000000000=
00
   bits set: 27=20
=20
=20
The function that calls stop_machine_yield() is multi_cpu_stop() which expe=
cts a data pointer as its argument. Looking at the assembly, we can work ou=
t that this should be our multi_stop_data. Looks like we were migrating two=
 threads, one is already done:
=20
crash> struct multi_stop_data ffffab0927fef948=20
struct multi_stop_data {
  fn =3D 0xffffffff97b516c0 <migrate_swap_stop>,
  data =3D 0xffffab0927fefa08,
  num_threads =3D 0x2,
  active_cpus =3D 0xffffffff98a371c0 <cpu_bit_bitmap+28672>,
  state =3D MULTI_STOP_PREPARE,
  thread_ack =3D {
    counter =3D 0x1
  }
}
=20
Here is the task we were waiting on. Its been running for a long while:
=20
crash> ps -m 17
[0 00:01:50.403] [RU]  PID: 17       TASK: ffff9d46c0b9d580  CPU: 27   COMM=
AND: "pr/ttyS0"
=20
crash> bt  -c 27
PID: 17       TASK: ffff9d46c0b9d580  CPU: 27   COMMAND: "pr/ttyS0"
#0 [fffffe5a17a51bd8] crash_nmi_callback at ffffffff97a6a581
#1 [fffffe5a17a51be0] nmi_panic_self_stop at ffffffff97a6a9f1
#2 [fffffe5a17a51bf8] watchdog_overflow_callback.cold at ffffffff98654315
#3 [fffffe5a17a51c08] __perf_event_overflow at ffffffff97d19a75
#4 [fffffe5a17a51c38] handle_pmi_common at ffffffff97a12bd8
#5 [fffffe5a17a51e08] intel_pmu_handle_irq at ffffffff97a13243
#6 [fffffe5a17a51e48] perf_event_nmi_handler at ffffffff97a04d38
#7 [fffffe5a17a51e68] nmi_handle at ffffffff97a306be
#8 [fffffe5a17a51eb0] default_do_nmi at ffffffff986a00e0
#9 [fffffe5a17a51ed0] exc_nmi at ffffffff986a02f2
#10 [fffffe5a17a51ef0] end_repeat_nmi at ffffffff988015e7
    [exception RIP: nbcon_reacquire+0x11]
    RIP: ffffffff97b96621  RSP: ffffab090431fdf0  RFLAGS: 00000046
    RAX: 0000000000000000  RBX: ffffab090431feb0  RCX: 00000000000006cc
    RDX: 000000000000001b  RSI: ffffffff99ad5360  RDI: ffffab090431feb0
    RBP: ffffffff99ad5360   R8: 0000000000000000   R9: 0000000000000001
    R10: 0000000000000001  R11: 6465646461207070  R12: ffffffff9a7ad120
    R13: 0000000000000032  R14: 0000000000000000  R15: 000000000000001a
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
--- <NMI exception stack> ---
#11 [ffffab090431fdf0] nbcon_reacquire at ffffffff97b96621
#12 [ffffab090431fe00] serial8250_console_write_thread at ffffffff981ce209
#13 [ffffab090431fe40] nbcon_emit_next_record at ffffffff97b96baf
#14 [ffffab090431fea0] nbcon_kthread_func at ffffffff97b96e36
#15 [ffffab090431ff18] kthread at ffffffff97b36430
#16 [ffffab090431ff50] ret_from_fork at ffffffff97a02d0c
=20
We could be looping within nbcon_reacquire:
=20
/usr/src/debug/kernel-5.14.0-452.el9/linux-5.14.0-452.el9.x86_64/kernel/pri=
ntk/nbcon.c: 850
0xffffffff97b96610 <nbcon_reacquire>:   nopl   0x0(%rax,%rax,1) [FTRACE NOP]
/usr/src/debug/kernel-5.14.0-452.el9/linux-5.14.0-452.el9.x86_64/kernel/pri=
ntk/nbcon.c: 852
0xffffffff97b96615 <nbcon_reacquire+0x5>:       push   %rbp
0xffffffff97b96616 <nbcon_reacquire+0x6>:       push   %rbx
/usr/src/debug/kernel-5.14.0-452.el9/linux-5.14.0-452.el9.x86_64/kernel/pri=
ntk/nbcon.c: 850
0xffffffff97b96617 <nbcon_reacquire+0x7>:       mov    %rdi,%rbx
/usr/src/debug/kernel-5.14.0-452.el9/linux-5.14.0-452.el9.x86_64/kernel/pri=
ntk/nbcon.c: 852
0xffffffff97b9661a <nbcon_reacquire+0xa>:       mov    (%rdi),%rbp
/usr/src/debug/kernel-5.14.0-452.el9/linux-5.14.0-452.el9.x86_64/kernel/pri=
ntk/nbcon.c: 855
0xffffffff97b9661d <nbcon_reacquire+0xd>:       jmp    0xffffffff97b96621 <=
nbcon_reacquire+0x11>
/usr/src/debug/kernel-5.14.0-452.el9/linux-5.14.0-452.el9.x86_64/./arch/x86=
/include/asm/vdso/processor.h: 13
0xffffffff97b9661f <nbcon_reacquire+0xf>:       pause =20
/usr/src/debug/kernel-5.14.0-452.el9/linux-5.14.0-452.el9.x86_64/kernel/pri=
ntk/nbcon.c: 855
0xffffffff97b96621 <nbcon_reacquire+0x11>:      mov    %rbx,%rdi
0xffffffff97b96624 <nbcon_reacquire+0x14>:      call   0xffffffff97b962e0 <=
nbcon_context_try_acquire>
0xffffffff97b96629 <nbcon_reacquire+0x19>:      test   %al,%al
0xffffffff97b9662b <nbcon_reacquire+0x1b>:      je     0xffffffff97b9661f <=
nbcon_reacquire+0xf>
=20
=20
We know that the watchdog timer gets touched at the start of serial8250_con=
sole_write_thread:
=20
bool serial8250_console_write_thread(struct uart_8250_port *up,
                             struct nbcon_write_context *wctxt)
{
struct uart_8250_em485 *em485 =3D up->em485;
struct uart_port *port =3D &up->port;
bool done =3D false;
unsigned int ier;multi_stop_data
=20
touch_nmi_watchdog();
=20
Looking at the other tasks, it looks like the other task that was part of t=
he migration was pid 30259- this matches the data pointer in the multi_stop=
_data as the address is within pid 30259's stack. It doesn't look like this=
 is part of the problem because we have been in blocked state for 1 min 49 =
sec- which is just under the 1 min 50 seconds CPU 27 has been running on th=
e CPU. =20
=20
crash> ps -S
  RU: 70
  IN: 314
  ID: 347
  UN: 3
 =20
crash> ps -m | grep UN
[0 00:00:53.425] [UN]  PID: 30307    TASK: ffff9d5646760000  CPU: 57   COMM=
AND: "pty03"
[0 00:01:46.341] [UN]  PID: 477      TASK: ffff9d46c54b8000  CPU: 38   COMM=
AND: "khugepaged"
[0 00:01:49.401] [UN]  PID: 30259    TASK: ffff9d56733c1c80  CPU: 0    COMM=
AND: "pty03"
=20
crash> bt 30259
PID: 30259    TASK: ffff9d56733c1c80  CPU: 0    COMMAND: "pty03"
#0 [ffffab0927fef7b8] __schedule at ffffffff986ae36b
#1 [ffffab0927fef820] schedule at ffffffff986ae6de
#2 [ffffab0927fef838] schedule_timeout at ffffffff986b431f
#3 [ffffab0927fef8a0] __wait_for_common at ffffffff986af393
#4 [ffffab0927fef918] stop_two_cpus at ffffffff97c16d77
#5 [ffffab0927fefa00] migrate_swap at ffffffff97b51df4
#6 [ffffab0927fefa28] task_numa_migrate at ffffffff97b5fcaa
#7 [ffffab0927fefb28] task_numa_fault at ffffffff97b63e4d
#8 [ffffab0927fefb80] do_numa_page at ffffffff97d8abf2
#9 [ffffab0927fefbd0] __handle_mm_fault at ffffffff97d8da1b
#10 [ffffab0927fefcb0] handle_mm_fault at ffffffff97d8de9c
#11 [ffffab0927fefce8] do_user_addr_fault at ffffffff97a8b984
#12 [ffffab0927fefd38] exc_page_fault at ffffffff986a2d52
#13 [ffffab0927fefd60] asm_exc_page_fault at ffffffff98800b52
    [exception RIP: __get_user_8+0xd]
    RIP: ffffffff9808dc9d  RSP: ffffab0927fefe10  RFLAGS: 00010206
    RAX: 00007ff24c9190e8  RBX: ffff9d56733c1c80  RCX: 00000074a1ab6020
    RDX: 0000000000000000  RSI: ffffab0927fefe40  RDI: ffff9d56733c1c80
    RBP: ffffab0927fefeb8   R8: ffff9d476014e400   R9: ffffffff99afa120
    R10: ffffab0927fefd38  R11: ffffab0927fefd40  R12: 0000000000000002
    R13: 00007ff24c70357b  R14: ffffab0927feff58  R15: 0000000000000000
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
#14 [ffffab0927fefe10] rseq_get_rseq_cs at ffffffff97d30a9d
#15 [ffffab0927fefe38] rseq_ip_fixup at ffffffff97d30d16
#16 [ffffab0927fefec0] __rseq_handle_notify_resume at ffffffff97d31216
#17 [ffffab0927fefed8] exit_to_user_mode_loop at ffffffff97bc3f09
#18 [ffffab0927fefef8] exit_to_user_mode_prepare at ffffffff97bc4026
#19 [ffffab0927feff10] syscall_exit_to_user_mode at ffffffff986a31f2
#20 [ffffab0927feff20] do_syscall_64 at ffffffff9869e46b
#21 [ffffab0927feff50] entry_SYSCALL_64_after_hwframe at ffffffff988000e6
    RIP: 00007ff24c70357b  RSP: 00007ffde2a53378  RFLAGS: 00000246
    RAX: 0000000000000000  RBX: 0000000000434790  RCX: 00007ff24c70357b
    RDX: 0000000004a75c54  RSI: 0000000000005437  RDI: 000000000000000a
    RBP: 0000000000000007   R8: 00000000295d6934   R9: 0000000000000000
    R10: 00007ffde2b13170  R11: 0000000000000246  R12: 000000000000000a
    R13: 0000000000000000  R14: 0000000000000000  R15: 0000000000000003
    ORIG_RAX: 0000000000000010  CS: 0033  SS: 002b

---- SECOND TRACE ----

[ 2680.882153] watchdog: BUG: soft lockup - CPU#5 stuck for 26s! [migration=
/5:49]=20
[ 2680.882157] Modules linked in: n_gsm pps_ldisc slcan can_dev ppp_synctty=
 n_hdlc ppp_async ppp_generic slip slhc serport nfsv3 nfs_acl tun brd overl=
ay exfat vfat fat loop ext4 mbcache jbd2 tls rpcsec_gss_krb5 auth_rpcgss nf=
sv4 dns_resolver nfs lockd grace fscache netfs sunrpc intel_rapl_msr intel_=
rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_inte=
l dell_wmi_descriptor sparse_keymap kvm ipmi_ssif rfkill video mgag200 i2c_=
algo_bit drm_shmem_helper drm_kms_helper cdc_ether iTCO_wdt rapl syscopyare=
a usbnet ipmi_si iTCO_vendor_support sysfillrect ipmi_devintf mii intel_cst=
ate mei_me ipmi_msghandler dcdbas sysimgblt mei lpc_ich intel_uncore fb_sys=
_fops mxm_wmi ioatdma pcspkr acpi_power_meter drm fuse xfs libcrc32c ahci q=
ede libahci sd_mod t10_pi crct10dif_pclmul crc32_pclmul sg ixgbe qed libata=
 crc32c_intel megaraid_sas tg3 ghash_clmulni_intel mdio crc8 dca wmi dm_mir=
ror dm_region_hash dm_log dm_mod [last unloaded: hwpoison_inject]=20
[ 2680.882211] CPU: 5 PID: 49 Comm: migration/5 Kdump: loaded Tainted: G   =
        OE  X  -------  ---  5.14.0-458.4417_1319993482.el9.x86_64 #1=20
[ 2680.882214] Hardware name: Dell Inc. PowerEdge R730/072T6D, BIOS 2.12.1 =
12/04/2020=20
[ 2680.882215] Stopper: multi_cpu_stop+0x0/0x100 <- migrate_swap+0xb4/0x110=20
[ 2680.882225] RIP: 0010:multi_cpu_stop+0x94/0x100=20
[ 2680.882228] Code: ff 4d 00 75 0f 8b 43 20 8b 4b 10 83 c0 01 89 4b 24 89 =
43 20 e8 0d 21 fa ff 41 83 ff 04 74 2e 45 89 fc 4c 89 ef e8 4c ff ff ff <44=
> 8b 7b 20 45 39 fc 75 aa 41 83 ff 01 76 0a e8 e8 09 02 00 e8 b3=20
[ 2680.882230] RSP: 0000:ffffade1066d7e78 EFLAGS: 00000246=20
[ 2680.882232] RAX: 0000000000000000 RBX: ffffade108f57988 RCX: 00000000000=
00002=20
[ 2680.882233] RDX: 0000000000000004 RSI: ffffade108f579b0 RDI: ffffffffac4=
32e60=20
[ 2680.882235] RBP: ffffade108f579ac R08: ffff897b7f8a41d0 R09: ffff896bd16=
f3800=20
[ 2680.882236] R10: 00000000000002f8 R11: 0000000000000001 R12: 00000000000=
00001=20
[ 2680.882237] R13: ffffffffac432e60 R14: 0000000000000000 R15: 00000000000=
00001=20
[ 2680.882238] FS:  0000000000000000(0000) GS:ffff897b7f880000(0000) knlGS:=
0000000000000000=20
[ 2680.882240] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033=20
[ 2680.882241] CR2: 00007f0069887fe8 CR3: 000000108a7ba004 CR4: 00000000001=
706f0=20
[ 2680.882243] Call Trace:=20
[ 2680.882245]  <IRQ>=20
[ 2680.882246]  ? show_trace_log_lvl+0x1c4/0x2df=20
[ 2680.882250]  ? show_trace_log_lvl+0x1c4/0x2df=20
[ 2680.882252]  ? cpu_stopper_thread+0x93/0x140=20
[ 2680.882255]  ? watchdog_timer_fn+0x1ad/0x210=20
[ 2680.882258]  ? __pfx_watchdog_timer_fn+0x10/0x10=20
[ 2680.882261]  ? __hrtimer_run_queues+0x112/0x2b0=20
[ 2680.882266]  ? hrtimer_interrupt+0xfc/0x210=20
[ 2680.882270]  ? __sysvec_apic_timer_interrupt+0x4e/0x100=20
[ 2680.882274]  ? sysvec_apic_timer_interrupt+0x6d/0x90=20
[ 2680.882278]  </IRQ>=20
[ 2680.882279]  <TASK>=20
[ 2680.882280]  ? asm_sysvec_apic_timer_interrupt+0x16/0x20=20
[ 2680.882285]  ? multi_cpu_stop+0x94/0x100=20
[ 2680.882287]  ? multi_cpu_stop+0x94/0x100=20
[ 2680.882289]  ? __pfx_multi_cpu_stop+0x10/0x10=20
[ 2680.882292]  cpu_stopper_thread+0x93/0x140=20
[ 2680.882294]  ? __pfx_smpboot_thread_fn+0x10/0x10=20
[ 2680.882299]  smpboot_thread_fn+0xd6/0x1a0=20
[ 2680.882302]  kthread+0xe0/0x100=20
[ 2680.882305]  ? __pfx_kthread+0x10/0x10=20
[ 2680.882308]  ret_from_fork+0x2c/0x50=20
[ 2680.882312]  </TASK>

---- THIRD TRACE ----
 7602.894128] Modules linked in: n_gsm pps_ldisc slcan can_dev ppp_synctty =
n_hdlc ppp_async ppp_generic slip slhc serport rfkill intel_rapl_msr intel_=
rapl_common intel_uncore_frequency intel_uncore_frequency_common intel_ifs =
i10nm_edac nfit libnvdimm x86_pkg_temp_thermal intel_powerclamp coretemp kv=
m_intel vfat fat ipmi_ssif kvm rapl pmt_telemetry iTCO_wdt dax_hmem intel_c=
state mei_me iTCO_vendor_support pmt_class intel_sdsi cxl_acpi dcdbas mgag2=
00 dell_smbios cxl_core isst_if_mbox_pci isst_if_mmio i2c_i801 acpi_power_m=
eter intel_uncore mei einj dell_wmi_descriptor pcspkr wmi_bmof intel_vsec i=
2c_ismt i2c_algo_bit isst_if_common i2c_smbus ipmi_si acpi_ipmi ipmi_devint=
f ipmi_msghandler fuse xfs iaa_crypto nvme ahci crct10dif_pclmul libahci cr=
c32_pclmul nvme_core crc32c_intel idxd libata ghash_clmulni_intel idxd_bus =
tg3 nvme_auth wmi t10_pi pinctrl_emmitsburg dm_mirror dm_region_hash dm_log=
 dm_mod
[ 7602.894175] CPU: 63 PID: 402 Comm: migration/63 Kdump: loaded Not tainte=
d 6.10.0-0.rc3.f8dba31b0a82.33.test.eln137.x86_64 #1
[ 7602.894178] Hardware name: Dell Inc. PowerEdge R760/030NDG, BIOS 1.4.4 0=
5/15/2023
[ 7602.894179] Stopper: multi_cpu_stop+0x0/0x100 <- migrate_swap+0xa9/0x130
[ 7602.894188] RIP: 0010:stop_machine_yield+0x6/0x10
[ 7602.894191] Code: ff ff eb df e8 bb 85 bb 00 66 66 2e 0f 1f 84 00 00 00 =
00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 0f 1f 00 f3 90 <c3=
> cc cc cc cc 0f 1f 44 00 00 90 90 90 90 90 90 90 90 90 90 90 90
[ 7602.894193] RSP: 0000:ff75cb01445efe38 EFLAGS: 00000246
[ 7602.894195] RAX: 0000000000000000 RBX: ff75cb014c103b78 RCX: ff4186fcaeb=
a5650
[ 7602.894197] RDX: ff4186fcaeb80000 RSI: ff75cb014c103bd0 RDI: ffffffff8e0=
39ce0
[ 7602.894197] RBP: ff75cb014c103b9c R08: 00000000000001cc R09: 00000000000=
00000
[ 7602.894198] R10: 0000000000000001 R11: 0000000000000001 R12: 00000000000=
00001
[ 7602.894199] R13: ffffffff8e039ce0 R14: 0000000000000001 R15: ffffffff8d0=
44700
[ 7602.894200] FS:  0000000000000000(0000) GS:ff4186fcaeb80000(0000) knlGS:=
0000000000000000
[ 7602.894201] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 7602.894203] CR2: 0000000000434ea4 CR3: 0000000159c08004 CR4: 0000000000f=
71ef0
[ 7602.894204] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 00000000000=
00000
[ 7602.894204] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 00000000000=
00400
[ 7602.894205] PKRU: 55555554
[ 7602.894206] Call Trace:
[ 7602.894207]  <IRQ>
[ 7602.894209]  ? show_trace_log_lvl+0x1b0/0x2f0
[ 7602.894212]  ? show_trace_log_lvl+0x1b0/0x2f0
[ 7602.894215]  ? multi_cpu_stop+0x99/0x100
[ 7602.894217]  ? watchdog_timer_fn.cold+0x3d/0xa0
[ 7602.894220]  ? __pfx_watchdog_timer_fn+0x10/0x10
[ 7602.894223]  ? __hrtimer_run_queues+0x139/0x2a0
[ 7602.894226]  ? hrtimer_interrupt+0xfa/0x210
[ 7602.894228]  ? __sysvec_apic_timer_interrupt+0x52/0x100
[ 7602.894230]  ? sysvec_apic_timer_interrupt+0x6c/0x90
[ 7602.894234]  </IRQ>
[ 7602.894234]  <TASK>
[ 7602.894235]  ? asm_sysvec_apic_timer_interrupt+0x1a/0x20
[ 7602.894239]  ? stop_one_cpu+0xb0/0xe0
[ 7602.894242]  ? stop_machine_yield+0x6/0x10
[ 7602.894245]  multi_cpu_stop+0x99/0x100
[ 7602.894247]  ? __pfx_multi_cpu_stop+0x10/0x10
[ 7602.894249]  cpu_stopper_thread+0x94/0x150
[ 7602.894252]  ? __pfx_smpboot_thread_fn+0x10/0x10
[ 7602.894255]  smpboot_thread_fn+0xda/0x1d0
[ 7602.894257]  kthread+0xcf/0x100
[ 7602.894260]  ? __pfx_kthread+0x10/0x10
[ 7602.894261]  ret_from_fork+0x31/0x50
[ 7602.894265]  ? __pfx_kthread+0x10/0x10
[ 7602.894267]  ret_from_fork_asm+0x1a/0x30
[ 7602.894270]  </TASK>
[ 7630.894558] watchdog: BUG: soft lockup - CPU#63 stuck for 48s! [migratio=
n/63:402]
[ 7630.894561] Modules linked in: n_gsm pps_ldisc slcan can_dev ppp_synctty=
 n_hdlc ppp_async ppp_generic slip slhc serport rfkill intel_rapl_msr intel=
_rapl_common intel_uncore_frequency intel_uncore_frequency_common intel_ifs=
 i10nm_edac nfit libnvdimm x86_pkg_temp_thermal intel_powerclamp coretemp k=
vm_intel vfat fat ipmi_ssif kvm rapl pmt_telemetry iTCO_wdt dax_hmem intel_=
cstate mei_me iTCO_vendor_support pmt_class intel_sdsi cxl_acpi dcdbas mgag=
200 dell_smbios cxl_core isst_if_mbox_pci isst_if_mmio i2c_i801 acpi_power_=
meter intel_uncore mei einj dell_wmi_descriptor pcspkr wmi_bmof intel_vsec =
i2c_ismt i2c_algo_bit isst_if_common i2c_smbus ipmi_si acpi_ipmi ipmi_devin=
tf ipmi_msghandler fuse xfs iaa_crypto nvme ahci crct10dif_pclmul libahci c=
rc32_pclmul nvme_core crc32c_intel idxd libata ghash_clmulni_intel idxd_bus=
 tg3 nvme_auth wmi t10_pi pinctrl_emmitsburg dm_mirror dm_region_hash dm_lo=
g dm_mod
[ 7630.894585] CPU: 63 PID: 402 Comm: migration/63 Kdump: loaded Tainted: G=
             L    -------  ---  6.10.0-0.rc3.f8dba31b0a82.33.test.eln137.x8=
6_64 #1
[ 7630.894587] Hardware name: Dell Inc. PowerEdge R760/030NDG, BIOS 1.4.4 0=
5/15/2023
[ 7630.894587] Stopper: multi_cpu_stop+0x0/0x100 <- migrate_swap+0xa9/0x130
[ 7630.894592] RIP: 0010:stop_machine_yield+0x6/0x10
[ 7630.894594] Code: ff ff eb df e8 bb 85 bb 00 66 66 2e 0f 1f 84 00 00 00 =
00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 0f 1f 00 f3 90 <c3=
> cc cc cc cc 0f 1f 44 00 00 90 90 90 90 90 90 90 90 90 90 90 90
[ 7630.894595] RSP: 0000:ff75cb01445efe38 EFLAGS: 00000246
[ 7630.894597] RAX: 0000000000000000 RBX: ff75cb014c103b78 RCX: ff4186fcaeb=
a5650
[ 7630.894598] RDX: ff4186fcaeb80000 RSI: ff75cb014c103bd0 RDI: ffffffff8e0=
39ce0
[ 7630.894599] RBP: ff75cb014c103b9c R08: 00000000000001cc R09: 00000000000=
00000
[ 7630.894600] R10: 0000000000000001 R11: 0000000000000001 R12: 00000000000=
00001
[ 7630.894601] R13: ffffffff8e039ce0 R14: 0000000000000001 R15: ffffffff8d0=
44700
[ 7630.894602] FS:  0000000000000000(0000) GS:ff4186fcaeb80000(0000) knlGS:=
0000000000000000
[ 7630.894603] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 7630.894604] CR2: 0000000000434ea4 CR3: 0000000159c08004 CR4: 0000000000f=
71ef0
[ 7630.894605] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 00000000000=
00000
[ 7630.894606] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 00000000000=
00400
[ 7630.894607] PKRU: 55555554
[ 7630.894607] Call Trace:
[ 7630.894608]  <IRQ>
[ 7630.894608]  ? show_trace_log_lvl+0x1b0/0x2f0
[ 7630.894611]  ? show_trace_log_lvl+0x1b0/0x2f0
[ 7630.894613]  ? multi_cpu_stop+0x99/0x100
[ 7630.894615]  ? watchdog_timer_fn.cold+0x3d/0xa0
[ 7630.894617]  ? __pfx_watchdog_timer_fn+0x10/0x10
[ 7630.894620]  ? __hrtimer_run_queues+0x139/0x2a0
[ 7630.894621]  ? hrtimer_interrupt+0xfa/0x210
[ 7630.894623]  ? __sysvec_apic_timer_interrupt+0x52/0x100
[ 7630.894625]  ? sysvec_apic_timer_interrupt+0x6c/0x90
[ 7630.894627]  </IRQ>
[ 7630.894628]  <TASK>
[ 7630.894628]  ? asm_sysvec_apic_timer_interrupt+0x1a/0x20
[ 7630.894631]  ? stop_one_cpu+0xb0/0xe0
[ 7630.894634]  ? stop_machine_yield+0x6/0x10
[ 7630.894636]  multi_cpu_stop+0x99/0x100
[ 7630.894638]  ? __pfx_multi_cpu_stop+0x10/0x10
[ 7630.894640]  cpu_stopper_thread+0x94/0x150
[ 7630.894642]  ? __pfx_smpboot_thread_fn+0x10/0x10
[ 7630.894644]  smpboot_thread_fn+0xda/0x1d0
[ 7630.894646]  kthread+0xcf/0x100
[ 7630.894647]  ? __pfx_kthread+0x10/0x10
[ 7630.894649]  ret_from_fork+0x31/0x50
[ 7630.894651]  ? __pfx_kthread+0x10/0x10
[ 7630.894653]  ret_from_fork_asm+0x1a/0x30
[ 7630.894655]  </TASK>
[ 7658.894993] watchdog: BUG: soft lockup - CPU#63 stuck for 74s! [migratio=
n/63:402]
[ 7658.894996] Modules linked in: n_gsm pps_ldisc slcan can_dev ppp_synctty=
 n_hdlc ppp_async ppp_generic slip slhc serport rfkill intel_rapl_msr intel=
_rapl_common intel_uncore_frequency intel_uncore_frequency_common intel_ifs=
 i10nm_edac nfit libnvdimm x86_pkg_temp_thermal intel_powerclamp coretemp k=
vm_intel vfat fat ipmi_ssif kvm rapl pmt_telemetry iTCO_wdt dax_hmem intel_=
cstate mei_me iTCO_vendor_support pmt_class intel_sdsi cxl_acpi dcdbas mgag=
200 dell_smbios cxl_core isst_if_mbox_pci isst_if_mmio i2c_i801 acpi_power_=
meter intel_uncore mei einj dell_wmi_descriptor pcspkr wmi_bmof intel_vsec =
i2c_ismt i2c_algo_bit isst_if_common i2c_smbus ipmi_si acpi_ipmi ipmi_devin=
tf ipmi_msghandler fuse xfs iaa_crypto nvme ahci crct10dif_pclmul libahci c=
rc32_pclmul nvme_core crc32c_intel idxd libata ghash_clmulni_intel idxd_bus=
 tg3 nvme_auth wmi t10_pi pinctrl_emmitsburg dm_mirror dm_region_hash dm_lo=
g dm_mod
[ 7658.895019] CPU: 63 PID: 402 Comm: migration/63 Kdump: loaded Tainted: G=
             L    -------  ---  6.10.0-0.rc3.f8dba31b0a82.33.test.eln137.x8=
6_64 #1
[ 7658.895021] Hardware name: Dell Inc. PowerEdge R760/030NDG, BIOS 1.4.4 0=
5/15/2023
[ 7658.895022] Stopper: multi_cpu_stop+0x0/0x100 <- migrate_swap+0xa9/0x130
[ 7658.895025] RIP: 0010:rcu_momentary_dyntick_idle+0xc/0x50
[ 7658.895027] Code: e9 e9 ee fe ff 66 0f 1f 84 00 00 00 00 00 90 90 90 90 =
90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 65 c6 05 18 a4 05 73 00 <65=
> 48 8b 15 14 d6 03 73 b8 08 00 00 00 f0 0f c1 82 c8 05 03 00 a8
[ 7658.895029] RSP: 0000:ff75cb01445efe38 EFLAGS: 00000246
[ 7658.895030] RAX: 0000000000000000 RBX: ff75cb014c103b78 RCX: ff4186fcaeb=
a5650
[ 7658.895031] RDX: ff4186fcaeb80000 RSI: ff75cb014c103bd0 RDI: ffffffff8e0=
39ce0
[ 7658.895032] RBP: ff75cb014c103b9c R08: 00000000000001cc R09: 00000000000=
00000
[ 7658.895032] R10: 0000000000000001 R11: 0000000000000001 R12: 00000000000=
00001
[ 7658.895033] R13: ffffffff8e039ce0 R14: 0000000000000001 R15: ffffffff8d0=
44700
[ 7658.895034] FS:  0000000000000000(0000) GS:ff4186fcaeb80000(0000) knlGS:=
0000000000000000
[ 7658.895035] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 7658.895036] CR2: 0000000000434ea4 CR3: 0000000159c08004 CR4: 0000000000f=
71ef0
[ 7658.895037] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 00000000000=
00000
[ 7658.895037] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 00000000000=
00400
[ 7658.895038] PKRU: 55555554
[ 7658.895039] Call Trace:
[ 7658.895039]  <IRQ>
[ 7658.895040]  ? show_trace_log_lvl+0x1b0/0x2f0
[ 7658.895042]  ? show_trace_log_lvl+0x1b0/0x2f0
[ 7658.895044]  ? multi_cpu_stop+0xb7/0x100
[ 7658.895047]  ? watchdog_timer_fn.cold+0x3d/0xa0
[ 7658.895048]  ? __pfx_watchdog_timer_fn+0x10/0x10
[ 7658.895051]  ? __hrtimer_run_queues+0x139/0x2a0
[ 7658.895052]  ? hrtimer_interrupt+0xfa/0x210
[ 7658.895054]  ? __sysvec_apic_timer_interrupt+0x52/0x100
[ 7658.895055]  ? sysvec_apic_timer_interrupt+0x6c/0x90
[ 7658.895058]  </IRQ>
[ 7658.895058]  <TASK>
[ 7658.895059]  ? asm_sysvec_apic_timer_interrupt+0x1a/0x20
[ 7658.895061]  ? stop_one_cpu+0xb0/0xe0
[ 7658.895064]  ? rcu_momentary_dyntick_idle+0xc/0x50
[ 7658.895066]  multi_cpu_stop+0xb7/0x100
[ 7658.895068]  ? __pfx_multi_cpu_stop+0x10/0x10
[ 7658.895070]  cpu_stopper_thread+0x94/0x150
[ 7658.895072]  ? __pfx_smpboot_thread_fn+0x10/0x10
[ 7658.895074]  smpboot_thread_fn+0xda/0x1d0
[ 7658.895075]  kthread+0xcf/0x100
[ 7658.895077]  ? __pfx_kthread+0x10/0x10
[ 7658.895079]  ret_from_fork+0x31/0x50
[ 7658.895081]  ? __pfx_kthread+0x10/0x10
[ 7658.895083]  ret_from_fork_asm+0x1a/0x30
[ 7658.895085]  </TASK>

---- FOURTH TRACE ----
kernel.softlockup_all_cpu_backtrace set to 1

kudos to Andrew Halaney <ahalaney@redhat.com> for this trace + investigatio=
n=20

[30435.183265] Hardware name: Intel Corporation S2600CWR/S2600CWR, BIOS SE5=
C610.86B.01.01.0018.072020161249 07/20/2016
[30435.183266] Stopper: multi_cpu_stop+0x0/0x100 <- migrate_swap+0xa9/0x130
[30435.183276] RIP: 0010:rcu_momentary_dyntick_idle+0x21/0x50
[30435.183279] Code: 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 65 c6 05 38 c4 =
45 47 00 65 48 8b 15 34 f6 43 47 b8 08 00 00 00 f0 0f c1 82 c8 05 03 00 <a8=
> 04 74 0d 65 48 8b 3d 03 a5 45 47 e9 ce fe ff ff 0f 0b 65 48 8b
[30435.183282] RSP: 0000:ffffa96f469dbe38 EFLAGS: 00000212
[30435.183284] RAX: 00000000457705fc RBX: ffffa96f6bebbb78 RCX: ffff8c81bef=
25650
[30435.183286] RDX: ffff8c81bef00000 RSI: ffffa96f6bebbbd0 RDI: ffffffffb9a=
328e0
[30435.183287] RBP: ffffa96f6bebbb9c R08: 0000000000000155 R09: 00000000000=
00000
[30435.183288] R10: 0000000000000001 R11: 0000000000000001 R12: 00000000000=
00001
[30435.183289] R13: ffffffffb9a328e0 R14: 0000000000000001 R15: ffffffffb8c=
41a00
[30435.183290] FS:  0000000000000000(0000) GS:ffff8c81bef00000(0000) knlGS:=
0000000000000000
[30435.183292] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[30435.183293] CR2: 0000000000434ea4 CR3: 0000000284088006 CR4: 00000000001=
706f0
[30435.183294] Call Trace:
[30435.183296]  <IRQ>
[30435.183297]  ? show_trace_log_lvl+0x1b0/0x2f0
[30435.183302]  ? show_trace_log_lvl+0x1b0/0x2f0
[30435.183305]  ? multi_cpu_stop+0xb7/0x100
[30435.183307]  ? watchdog_timer_fn.cold+0x3d/0xa0
[30435.183311]  ? __pfx_watchdog_timer_fn+0x10/0x10
[30435.183315]  ? __hrtimer_run_queues+0x13c/0x2a0
[30435.183318]  ? hrtimer_interrupt+0xfa/0x210
[30435.183321]  ? __sysvec_apic_timer_interrupt+0x55/0x100
[30435.183323]  ? sysvec_apic_timer_interrupt+0x6c/0x90
[30435.183327]  </IRQ>
[30435.183328]  <TASK>
[30435.183329]  ? asm_sysvec_apic_timer_interrupt+0x1a/0x20
[30435.183339]  ? __pfx_multi_cpu_stop+0x10/0x10
[30435.183342]  ? rcu_momentary_dyntick_idle+0x21/0x50
[30435.183344]  multi_cpu_stop+0xb7/0x100
[30435.183359]  ? __pfx_multi_cpu_stop+0x10/0x10
[30435.183361]  cpu_stopper_thread+0x97/0x150
[30435.183363]  ? __pfx_smpboot_thread_fn+0x10/0x10
[30435.183366]  smpboot_thread_fn+0xdd/0x1d0
[30435.183368]  kthread+0xd2/0x100
[30435.183371]  ? __pfx_kthread+0x10/0x10
[30435.183373]  ret_from_fork+0x34/0x50
[30435.183377]  ? __pfx_kthread+0x10/0x10
[30435.183379]  ret_from_fork_asm+0x1a/0x30
[30435.183382]  </TASK>
[30435.183384] Sending NMI from CPU 18 to CPUs 0-17,19-63:
[30435.183389] NMI backtrace for cpu 15
[30435.183392] CPU: 15 PID: 350429 Comm: pty03 Kdump: loaded Not tainted 6.=
10.0-0.rc2.28nbcontest.eln137.x86_64 #1
[30435.183396] Hardware name: Intel Corporation S2600CWR/S2600CWR, BIOS SE5=
C610.86B.01.01.0018.072020161249 07/20/2016
[30435.183397] RIP: 0033:0x405f4f
[30435.183431] Code: 45 84 c0 0f 85 5c 01 00 00 8b 15 68 ef 02 00 8b 05 66 =
ef 02 00 39 c2 7d 22 8b 05 60 ef 02 00 85 c0 75 18 83 05 29 ef 02 00 01 <8b=
> 15 47 ef 02 00 8b 05 45 ef 02 00 39 c2 7c df 90 44 89 e1 31 d2
[30435.183433] RSP: 002b:00007fffd5cc4130 EFLAGS: 00000206
[30435.183436] RAX: 0000000000000000 RBX: 00000000004347f0 RCX: 00000000028=
4a074
[30435.183439] RDX: 00000000000d15ca RSI: 0000000000434e28 RDI: 00007f4de8e=
5f1b8
[30435.183440] RBP: 0000000000000009 R08: 0000000000000000 R09: 00000000fff=
fffff
[30435.183442] R10: 0000000000000000 R11: 00007f4de8e5f178 R12: 00000000000=
00008
[30435.183444] R13: 0000000000000000 R14: 0000000000000000 R15: 00000000000=
00002
[30435.183445] FS:  00007f4de8e53740 GS:  0000000000000000
[30435.183448] NMI backtrace for cpu 40 skipped: idling at intel_idle+0x59/=
0xa0
...
[30435.183608] NMI backtrace for cpu 9
[30435.183610] CPU: 9 PID: 18 Comm: pr/ttyS0 Kdump: loaded Not tainted 6.10=
.0-0.rc2.28nbcontest.eln137.x86_64 #1
[30435.183613] Hardware name: Intel Corporation S2600CWR/S2600CWR, BIOS SE5=
C610.86B.01.01.0018.072020161249 07/20/2016
[30435.183615] RIP: 0010:io_serial_in+0x1a/0x30
[30435.183620] Code: 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f =
1e fa 0f 1f 44 00 00 0f b6 87 c1 00 00 00 c4 e2 79 f7 d6 66 03 57 08 ec <0f=
> b6 c0 c3 cc cc cc cc 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00
[30435.183622] RSP: 0018:ffffa96f4431fd58 EFLAGS: 00000002
[30435.183624] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 00000000000=
00009
[30435.183625] RDX: 00000000000003fd RSI: 0000000000000005 RDI: ffffffffbb9=
90b60
[30435.183626] RBP: 000000000000270a R08: 00000000000008f7 R09: 00000000000=
00009
[30435.183628] R10: ffffa96f4431fd38 R11: 6465646461207070 R12: 00000000000=
00020
[30435.183629] R13: ffffffffbb990b60 R14: 0000000000000005 R15: 00000000000=
00032
[30435.183630] FS:  0000000000000000(0000) GS:ffff8c71bf480000(0000) knlGS:=
0000000000000000
[30435.183632] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[30435.183633] CR2: 00007f16ec050330 CR3: 000000000b420005 CR4: 00000000001=
706f0
[30435.183634] Call Trace:
[30435.183636]  <NMI>
[30435.183637]  ? show_trace_log_lvl+0x1b0/0x2f0
[30435.183642]  ? show_trace_log_lvl+0x1b0/0x2f0
[30435.183645]  ? wait_for_xmitr+0x4a/0xc0
[30435.183647]  ? nmi_cpu_backtrace.cold+0x32/0x63
[30435.183651]  ? nmi_cpu_backtrace_handler+0x11/0x20
[30435.183655]  ? nmi_handle+0x61/0x120
[30435.183659]  ? default_do_nmi+0x40/0x130
[30435.183662]  ? exc_nmi+0x103/0x180
[30435.183665]  ? end_repeat_nmi+0xf/0x53
[30435.183668]  ? io_serial_in+0x1a/0x30
[30435.183671]  ? io_serial_in+0x1a/0x30
[30435.183674]  ? io_serial_in+0x1a/0x30
[30435.183676]  </NMI>
[30435.183677]  <TASK>
[30435.183677]  wait_for_xmitr+0x4a/0xc0
[30435.183679]  serial8250_console_putchar+0x1a/0x40
[30435.183681]  ? __pfx_serial8250_console_putchar+0x10/0x10
[30435.183682]  uart_console_write+0x47/0x60
[30435.183686]  serial8250_console_write_thread+0xec/0x1d0
[30435.183688]  nbcon_emit_next_record+0x1b9/0x270
[30435.183692]  nbcon_kthread_func+0x1ff/0x260
[30435.183695]  ? __pfx_nbcon_kthread_func+0x10/0x10
[30435.183697]  kthread+0xd2/0x100
[30435.183700]  ? __pfx_kthread+0x10/0x10
[30435.183702]  ret_from_fork+0x34/0x50
[30435.183705]  ? __pfx_kthread+0x10/0x10
[30435.183707]  ret_from_fork_asm+0x1a/0x30
[30435.183710]  </TASK>
...
[30435.183893] NMI backtrace for cpu 50 skipped: idling at intel_idle+0x59/=
0xa0
[30435.183899] NMI backtrace for cpu 17
[30435.183903] CPU: 17 PID: 1312 Comm: systemd-udevd Kdump: loaded Not tain=
ted 6.10.0-0.rc2.28nbcontest.eln137.x86_64 #1
[30435.183907] Hardware name: Intel Corporation S2600CWR/S2600CWR, BIOS SE5=
C610.86B.01.01.0018.072020161249 07/20/2016
[30435.183909] RIP: 0033:0x5635e2eee50e
[30435.183928] Code: 8b 9b 80 00 00 00 83 78 10 01 75 e0 48 8b 85 78 ff ff =
ff 48 8b 08 48 85 c9 0f 84 2d 08 00 00 48 8b 85 78 ff ff ff 48 8b 50 30 <48=
> 8b 40 28 48 39 c2 0f 87 35 08 00 00 48 8b b5 78 ff ff ff 48 83
[30435.183931] RSP: 002b:00007ffc66ab47f0 EFLAGS: 00000206
[30435.183934] RAX: 0000563600dcc7e0 RBX: 0000563600797130 RCX: 00005636006=
cda60
[30435.183936] RDX: 0000000011d07ae3 RSI: 0000000000000000 RDI: 00000000000=
00007
[30435.183938] RBP: 00007ffc66ab4890 R08: 00005635e2f1eef0 R09: 00007ffc66a=
b46d0
[30435.183940] R10: 0000000000000000 R11: 0000000000000000 R12: 00005636013=
6bee0
[30435.183942] R13: 0000563601d38610 R14: 000056360097a330 R15: 00005635e2f=
1d95b
[30435.183944] FS:  00007fa9bcb24b40 GS:  0000000000000000
[30435.184389] Kernel panic - not syncing: softlockup: hung tasks
[30509.409049] CPU: 18 PID: 130 Comm: migration/18 Kdump: loaded Tainted: G=
             L    -------  ---  6.10.0-0.rc2.28nbcontest.eln137.x86_64 #1
[30509.423608] Hardware name: Intel Corporation S2600CWR/S2600CWR, BIOS SE5=
C610.86B.01.01.0018.072020161249 07/20/2016
[30509.435255] Stopper: multi_cpu_stop+0x0/0x100 <- migrate_swap+0xa9/0x130
[30509.442738] Call Trace:
[30509.445463]  <IRQ>
[30509.447705]  dump_stack_lvl+0x4e/0x70
[30509.451785]  panic+0x113/0x2c8
[30509.455194]  watchdog_timer_fn.cold+0x99/0xa0
[30509.460047]  ? __pfx_watchdog_timer_fn+0x10/0x10
[30509.465200]  __hrtimer_run_queues+0x13c/0x2a0
[30509.470061]  hrtimer_interrupt+0xfa/0x210
[30509.474542]  __sysvec_apic_timer_interrupt+0x55/0x100
[30509.480178]  sysvec_apic_timer_interrupt+0x6c/0x90
[30509.485537]  </IRQ>
[30509.487873]  <TASK>
[30509.490210]  asm_sysvec_apic_timer_interrupt+0x1a/0x20
[30509.495947] RIP: 0010:rcu_momentary_dyntick_idle+0x21/0x50
[30509.502068] Code: 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 65 c6 05 38 c4 =
45 47 00 65 48 8b 15 34 f6 43 47 b8 08 00 00 00 f0 0f c1 82 c8 05 03 00 <a8=
> 04 74 0d 65 48 8b 3d 03 a5 45 47 e9 ce fe ff ff 0f 0b 65 48 8b
[30509.523024] RSP: 0000:ffffa96f469dbe38 EFLAGS: 00000212
[30509.528854] RAX: 00000000457705fc RBX: ffffa96f6bebbb78 RCX: ffff8c81bef=
25650
[30509.536821] RDX: ffff8c81bef00000 RSI: ffffa96f6bebbbd0 RDI: ffffffffb9a=
328e0
[30509.544794] RBP: ffffa96f6bebbb9c R08: 0000000000000155 R09: 00000000000=
00000
[30509.552755] R10: 0000000000000001 R11: 0000000000000001 R12: 00000000000=
00001
[30509.560716] R13: ffffffffb9a328e0 R14: 0000000000000001 R15: ffffffffb8c=
41a00
[30509.568678]  ? __pfx_multi_cpu_stop+0x10/0x10
[30509.573542]  multi_cpu_stop+0xb7/0x100
[30509.577735]  ? __pfx_multi_cpu_stop+0x10/0x10
[30509.582594]  cpu_stopper_thread+0x97/0x150
[30509.587163]  ? __pfx_smpboot_thread_fn+0x10/0x10
[30509.592315]  smpboot_thread_fn+0xdd/0x1d0
[30509.596795]  kthread+0xd2/0x100
[30509.600298]  ? __pfx_kthread+0x10/0x10
[30509.604479]  ret_from_fork+0x34/0x50
[30509.608468]  ? __pfx_kthread+0x10/0x10
[30509.612649]  ret_from_fork_asm+0x1a/0x30
[30509.617025]  </TASK>