[PATCH 1/2] xfs: take a breath in xlog_ioend_work()

alexjlzheng@gmail.com posted 2 patches 5 days ago
[PATCH 1/2] xfs: take a breath in xlog_ioend_work()
Posted by alexjlzheng@gmail.com 5 days ago
From: Jinliang Zheng <alexjlzheng@tencent.com>

The xlog_ioend_work() function contains several nested loops with
fairly complex operations, which may leads to:

  PID: 2604722  TASK: ffff88c08306b1c0  CPU: 263  COMMAND: "kworker/263:0H"
   #0 [ffffc9001cbf8d58] machine_kexec at ffffffff9d086081
   #1 [ffffc9001cbf8db8] __crash_kexec at ffffffff9d20817a
   #2 [ffffc9001cbf8e78] panic at ffffffff9d107d8f
   #3 [ffffc9001cbf8ef8] watchdog_timer_fn at ffffffff9d243511
   #4 [ffffc9001cbf8f28] __hrtimer_run_queues at ffffffff9d1e62ff
   #5 [ffffc9001cbf8f80] hrtimer_interrupt at ffffffff9d1e73d4
   #6 [ffffc9001cbf8fd8] __sysvec_apic_timer_interrupt at ffffffff9d07bb29
   #7 [ffffc9001cbf8ff0] sysvec_apic_timer_interrupt at ffffffff9dd689f9
  --- <IRQ stack> ---
   #8 [ffffc900460a7c28] asm_sysvec_apic_timer_interrupt at ffffffff9de00e86
      [exception RIP: slab_free_freelist_hook.constprop.0+107]
      RIP: ffffffff9d3ef74b  RSP: ffffc900460a7cd0  RFLAGS: 00000286
      RAX: ffff89ea4de06b00  RBX: ffff89ea4de06a00  RCX: ffff89ea4de06a00
      RDX: 0000000000000100  RSI: ffffc900460a7d28  RDI: ffff888100044c80
      RBP: ffff888100044c80   R8: 0000000000000000   R9: ffffffffc21e8500
      R10: ffff88c867e93200  R11: 0000000000000001  R12: ffff89ea4de06a00
      R13: ffffc900460a7d28  R14: ffff89ea4de06a00  R15: ffffc900460a7d30
      ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
   #9 [ffffc900460a7d18] __kmem_cache_free at ffffffff9d3f65a0
  #10 [ffffc900460a7d70] xlog_cil_committed at ffffffffc21e85af [xfs]
  #11 [ffffc900460a7da0] xlog_cil_process_committed at ffffffffc21e9747 [xfs]
  #12 [ffffc900460a7dd0] xlog_state_do_iclog_callbacks at ffffffffc21e41eb [xfs]
  #13 [ffffc900460a7e28] xlog_state_do_callback at ffffffffc21e436f [xfs]
  #14 [ffffc900460a7e50] xlog_ioend_work at ffffffffc21e6e1c [xfs]
  #15 [ffffc900460a7e70] process_one_work at ffffffff9d12de69
  #16 [ffffc900460a7ea8] worker_thread at ffffffff9d12e79b
  #17 [ffffc900460a7ef8] kthread at ffffffff9d1378fc
  #18 [ffffc900460a7f30] ret_from_fork at ffffffff9d042dd0
  #19 [ffffc900460a7f50] ret_from_fork_asm at ffffffff9d007e2b

This patch adds cond_resched() to avoid softlockups similar to the one
described above.

Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com>
---
 fs/xfs/xfs_log_cil.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index 778ac47adb8c..c51c24f98acc 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -843,6 +843,8 @@ xlog_cil_ail_insert(
 					LOG_ITEM_BATCH_SIZE, ctx->start_lsn);
 			i = 0;
 		}
+
+		cond_resched();
 	}
 
 	/* make sure we insert the remainder! */
@@ -925,6 +927,7 @@ xlog_cil_process_committed(
 			struct xfs_cil_ctx, iclog_entry))) {
 		list_del(&ctx->iclog_entry);
 		xlog_cil_committed(ctx);
+		cond_resched();
 	}
 }
 
-- 
2.49.0
Re: [PATCH 1/2] xfs: take a breath in xlog_ioend_work()
Posted by Dave Chinner 4 days, 22 hours ago
On Thu, Feb 05, 2026 at 04:26:20PM +0800, alexjlzheng@gmail.com wrote:
> From: Jinliang Zheng <alexjlzheng@tencent.com>
> 
> The xlog_ioend_work() function contains several nested loops with
> fairly complex operations, which may leads to:
> 
>   PID: 2604722  TASK: ffff88c08306b1c0  CPU: 263  COMMAND: "kworker/263:0H"
>    #0 [ffffc9001cbf8d58] machine_kexec at ffffffff9d086081
>    #1 [ffffc9001cbf8db8] __crash_kexec at ffffffff9d20817a
>    #2 [ffffc9001cbf8e78] panic at ffffffff9d107d8f
>    #3 [ffffc9001cbf8ef8] watchdog_timer_fn at ffffffff9d243511
>    #4 [ffffc9001cbf8f28] __hrtimer_run_queues at ffffffff9d1e62ff
>    #5 [ffffc9001cbf8f80] hrtimer_interrupt at ffffffff9d1e73d4
>    #6 [ffffc9001cbf8fd8] __sysvec_apic_timer_interrupt at ffffffff9d07bb29
>    #7 [ffffc9001cbf8ff0] sysvec_apic_timer_interrupt at ffffffff9dd689f9
>   --- <IRQ stack> ---
>    #8 [ffffc900460a7c28] asm_sysvec_apic_timer_interrupt at ffffffff9de00e86
>       [exception RIP: slab_free_freelist_hook.constprop.0+107]
>       RIP: ffffffff9d3ef74b  RSP: ffffc900460a7cd0  RFLAGS: 00000286
>       RAX: ffff89ea4de06b00  RBX: ffff89ea4de06a00  RCX: ffff89ea4de06a00
>       RDX: 0000000000000100  RSI: ffffc900460a7d28  RDI: ffff888100044c80
>       RBP: ffff888100044c80   R8: 0000000000000000   R9: ffffffffc21e8500
>       R10: ffff88c867e93200  R11: 0000000000000001  R12: ffff89ea4de06a00
>       R13: ffffc900460a7d28  R14: ffff89ea4de06a00  R15: ffffc900460a7d30
>       ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
>    #9 [ffffc900460a7d18] __kmem_cache_free at ffffffff9d3f65a0
>   #10 [ffffc900460a7d70] xlog_cil_committed at ffffffffc21e85af [xfs]
>   #11 [ffffc900460a7da0] xlog_cil_process_committed at ffffffffc21e9747 [xfs]
>   #12 [ffffc900460a7dd0] xlog_state_do_iclog_callbacks at ffffffffc21e41eb [xfs]
>   #13 [ffffc900460a7e28] xlog_state_do_callback at ffffffffc21e436f [xfs]
>   #14 [ffffc900460a7e50] xlog_ioend_work at ffffffffc21e6e1c [xfs]
>   #15 [ffffc900460a7e70] process_one_work at ffffffff9d12de69
>   #16 [ffffc900460a7ea8] worker_thread at ffffffff9d12e79b
>   #17 [ffffc900460a7ef8] kthread at ffffffff9d1378fc
>   #18 [ffffc900460a7f30] ret_from_fork at ffffffff9d042dd0
>   #19 [ffffc900460a7f50] ret_from_fork_asm at ffffffff9d007e2b
> 
> This patch adds cond_resched() to avoid softlockups similar to the one
> described above.

You've elided the soft lockup messages that tell us how long this
task was holding the CPU. What is the soft lockup timer set to?
What workload causes this to happen? How do we reproduce it?

FWIW, yes, there might be several tens of thousands of objects to
move to the AIL in this journal IO completion path, but if this
takes more than a couple of hundred milliseconds of processing time
then there is something else going wrong....

-Dave.
-- 
Dave Chinner
david@fromorbit.com
Re: [PATCH 1/2] xfs: take a breath in xlog_ioend_work()
Posted by Jinliang Zheng 4 days, 20 hours ago
On Thu, 5 Feb 2026 21:54:49 +1100, david@fromorbit.com wrote:
> On Thu, Feb 05, 2026 at 04:26:20PM +0800, alexjlzheng@gmail.com wrote:
> > From: Jinliang Zheng <alexjlzheng@tencent.com>
> > 
> > The xlog_ioend_work() function contains several nested loops with
> > fairly complex operations, which may leads to:
> > 
> >   PID: 2604722  TASK: ffff88c08306b1c0  CPU: 263  COMMAND: "kworker/263:0H"
> >    #0 [ffffc9001cbf8d58] machine_kexec at ffffffff9d086081
> >    #1 [ffffc9001cbf8db8] __crash_kexec at ffffffff9d20817a
> >    #2 [ffffc9001cbf8e78] panic at ffffffff9d107d8f
> >    #3 [ffffc9001cbf8ef8] watchdog_timer_fn at ffffffff9d243511
> >    #4 [ffffc9001cbf8f28] __hrtimer_run_queues at ffffffff9d1e62ff
> >    #5 [ffffc9001cbf8f80] hrtimer_interrupt at ffffffff9d1e73d4
> >    #6 [ffffc9001cbf8fd8] __sysvec_apic_timer_interrupt at ffffffff9d07bb29
> >    #7 [ffffc9001cbf8ff0] sysvec_apic_timer_interrupt at ffffffff9dd689f9
> >   --- <IRQ stack> ---
> >    #8 [ffffc900460a7c28] asm_sysvec_apic_timer_interrupt at ffffffff9de00e86
> >       [exception RIP: slab_free_freelist_hook.constprop.0+107]
> >       RIP: ffffffff9d3ef74b  RSP: ffffc900460a7cd0  RFLAGS: 00000286
> >       RAX: ffff89ea4de06b00  RBX: ffff89ea4de06a00  RCX: ffff89ea4de06a00
> >       RDX: 0000000000000100  RSI: ffffc900460a7d28  RDI: ffff888100044c80
> >       RBP: ffff888100044c80   R8: 0000000000000000   R9: ffffffffc21e8500
> >       R10: ffff88c867e93200  R11: 0000000000000001  R12: ffff89ea4de06a00
> >       R13: ffffc900460a7d28  R14: ffff89ea4de06a00  R15: ffffc900460a7d30
> >       ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
> >    #9 [ffffc900460a7d18] __kmem_cache_free at ffffffff9d3f65a0
> >   #10 [ffffc900460a7d70] xlog_cil_committed at ffffffffc21e85af [xfs]
> >   #11 [ffffc900460a7da0] xlog_cil_process_committed at ffffffffc21e9747 [xfs]
> >   #12 [ffffc900460a7dd0] xlog_state_do_iclog_callbacks at ffffffffc21e41eb [xfs]
> >   #13 [ffffc900460a7e28] xlog_state_do_callback at ffffffffc21e436f [xfs]
> >   #14 [ffffc900460a7e50] xlog_ioend_work at ffffffffc21e6e1c [xfs]
> >   #15 [ffffc900460a7e70] process_one_work at ffffffff9d12de69
> >   #16 [ffffc900460a7ea8] worker_thread at ffffffff9d12e79b
> >   #17 [ffffc900460a7ef8] kthread at ffffffff9d1378fc
> >   #18 [ffffc900460a7f30] ret_from_fork at ffffffff9d042dd0
> >   #19 [ffffc900460a7f50] ret_from_fork_asm at ffffffff9d007e2b
> > 
> > This patch adds cond_resched() to avoid softlockups similar to the one
> > described above.
> 
> You've elided the soft lockup messages that tell us how long this
> task was holding the CPU. What is the soft lockup timer set to?
> What workload causes this to happen? How do we reproduce it?

Thanks for your reply. :)

The soft lockup timer is set to 20s, and the cpu was holding 22s.

The workload is a test suite combining stress-ng, LTP, and fio,
executed concurrently. I believe reproducing the issue requires a
certain probability.

thanks,
Jinliang Zheng. :)

> 
> FWIW, yes, there might be several tens of thousands of objects to
> move to the AIL in this journal IO completion path, but if this
> takes more than a couple of hundred milliseconds of processing time
> then there is something else going wrong....

Is it possible that the kernel’s CPU and memory were under high load,
causing each iteration of the loop to take more time and eventually
accumulating to over 20 seconds?

> 
> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
Re: [PATCH 1/2] xfs: take a breath in xlog_ioend_work()
Posted by Dave Chinner 4 days, 12 hours ago
On Thu, Feb 05, 2026 at 08:49:38PM +0800, Jinliang Zheng wrote:
> On Thu, 5 Feb 2026 21:54:49 +1100, david@fromorbit.com wrote:
> > On Thu, Feb 05, 2026 at 04:26:20PM +0800, alexjlzheng@gmail.com wrote:
> > > From: Jinliang Zheng <alexjlzheng@tencent.com>
> > > 
> > > The xlog_ioend_work() function contains several nested loops with
> > > fairly complex operations, which may leads to:
> > > 
> > >   PID: 2604722  TASK: ffff88c08306b1c0  CPU: 263  COMMAND: "kworker/263:0H"
> > >    #0 [ffffc9001cbf8d58] machine_kexec at ffffffff9d086081
> > >    #1 [ffffc9001cbf8db8] __crash_kexec at ffffffff9d20817a
> > >    #2 [ffffc9001cbf8e78] panic at ffffffff9d107d8f
> > >    #3 [ffffc9001cbf8ef8] watchdog_timer_fn at ffffffff9d243511
> > >    #4 [ffffc9001cbf8f28] __hrtimer_run_queues at ffffffff9d1e62ff
> > >    #5 [ffffc9001cbf8f80] hrtimer_interrupt at ffffffff9d1e73d4
> > >    #6 [ffffc9001cbf8fd8] __sysvec_apic_timer_interrupt at ffffffff9d07bb29
> > >    #7 [ffffc9001cbf8ff0] sysvec_apic_timer_interrupt at ffffffff9dd689f9
> > >   --- <IRQ stack> ---
> > >    #8 [ffffc900460a7c28] asm_sysvec_apic_timer_interrupt at ffffffff9de00e86
> > >       [exception RIP: slab_free_freelist_hook.constprop.0+107]
> > >       RIP: ffffffff9d3ef74b  RSP: ffffc900460a7cd0  RFLAGS: 00000286
> > >       RAX: ffff89ea4de06b00  RBX: ffff89ea4de06a00  RCX: ffff89ea4de06a00
> > >       RDX: 0000000000000100  RSI: ffffc900460a7d28  RDI: ffff888100044c80
> > >       RBP: ffff888100044c80   R8: 0000000000000000   R9: ffffffffc21e8500
> > >       R10: ffff88c867e93200  R11: 0000000000000001  R12: ffff89ea4de06a00
> > >       R13: ffffc900460a7d28  R14: ffff89ea4de06a00  R15: ffffc900460a7d30
> > >       ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
> > >    #9 [ffffc900460a7d18] __kmem_cache_free at ffffffff9d3f65a0
> > >   #10 [ffffc900460a7d70] xlog_cil_committed at ffffffffc21e85af [xfs]
> > >   #11 [ffffc900460a7da0] xlog_cil_process_committed at ffffffffc21e9747 [xfs]
> > >   #12 [ffffc900460a7dd0] xlog_state_do_iclog_callbacks at ffffffffc21e41eb [xfs]
> > >   #13 [ffffc900460a7e28] xlog_state_do_callback at ffffffffc21e436f [xfs]
> > >   #14 [ffffc900460a7e50] xlog_ioend_work at ffffffffc21e6e1c [xfs]
> > >   #15 [ffffc900460a7e70] process_one_work at ffffffff9d12de69
> > >   #16 [ffffc900460a7ea8] worker_thread at ffffffff9d12e79b
> > >   #17 [ffffc900460a7ef8] kthread at ffffffff9d1378fc
> > >   #18 [ffffc900460a7f30] ret_from_fork at ffffffff9d042dd0
> > >   #19 [ffffc900460a7f50] ret_from_fork_asm at ffffffff9d007e2b
> > > 
> > > This patch adds cond_resched() to avoid softlockups similar to the one
> > > described above.
> > 
> > You've elided the soft lockup messages that tell us how long this
> > task was holding the CPU. What is the soft lockup timer set to?
> > What workload causes this to happen? How do we reproduce it?
> 
> Thanks for your reply. :)
> 
> The soft lockup timer is set to 20s, and the cpu was holding 22s.

Yep, something else must be screwed up here - an iclog completion
should never have enough items attched to it that is takes this long
to process them without yeilding the CPU.

We can only loop once around the iclogs in this path these days,
because the iclog we we are completing runs xlog_ioend_work() with
the iclog->ic_sema held. This locks out new IO being issued on that
iclog whilst we are processing the completion, and hence the
iclogbuf ring will stall trying to write new items to this iclog.

Hence whilst we are processing an iclog completion, the entire
journal (and hence filesystem) can stall waiting for the completion
to finish and release the iclog->ic_sema.

This also means we can have, at most, 8 iclogs worth of journal
writes to complete in xlog_ioend_work(). That greatly limits the
number of items we are processing in the xlog_state_do_callback()
loops. Yes, it can be tens of thousands of items, but it is bound by
journal size and the checkpoint pipeline depth (maximum of 4
checkpoints in flight at once).

So I don't see how the number of items that we are asking to be
processed during journal completion, by itself, can cause such a
long processing time. We typically process upwards of a thousand
items per millisecond here...

> The workload is a test suite combining stress-ng, LTP, and fio,
> executed concurrently. I believe reproducing the issue requires a
> certain probability.
> 
> thanks,
> Jinliang Zheng. :)
> 
> > 
> > FWIW, yes, there might be several tens of thousands of objects to
> > move to the AIL in this journal IO completion path, but if this
> > takes more than a couple of hundred milliseconds of processing time
> > then there is something else going wrong....
> 
> Is it possible that the kernel’s CPU and memory were under high load,
> causing each iteration of the loop to take more time and eventually
> accumulating to over 20 seconds?

That's a characteristic behaviour of catastrophic spin lock
contention...

-Dave.
-- 
Dave Chinner
david@fromorbit.com