xfs: snapshot current CIL sequence under xc_push_lock

[PATCH] xfs: snapshot current CIL sequence under xc_push_lock

Posted by Cen Zhang 1 month, 1 week ago

xlog_cil_force() and xlog_cil_flush() both use the current CIL
checkpoint sequence to request an immediate push.  They currently read
xc_current_sequence before xlog_cil_push_now() takes xc_push_lock.

The CIL push worker advances xc_current_sequence under xc_push_lock while
switching to a new checkpoint context.  If a current-checkpoint force reads
an old sequence and then publishes it as xc_push_seq after the worker has
moved to the next context, xlog_cil_push_work() can treat the request as a
previously pushed sequence and skip it via the push_seq < ctx->sequence
check.  That can leave the current dirty checkpoint unqueued by a
force/flush request.

Make xlog_cil_push_now() resolve push_seq == 0 to the current sequence
while holding xc_push_lock, and return the resolved sequence for tracing
and waiting.  Route xlog_cil_force() and xlog_cil_flush() through that path
so the current-sequence snapshot and xc_push_seq publication are serialized
with the context switch.

Signed-off-by: Cen Zhang <zzzccc427@gmail.com>
---
 fs/xfs/xfs_log_cil.c  | 34 +++++++++++++++++-----------------
 fs/xfs/xfs_log_priv.h |  2 +-
 2 files changed, 18 insertions(+), 18 deletions(-)

diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index edc368938f30..6a1ced3c9314 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -1707,24 +1707,25 @@ xlog_cil_push_background(
  * mechanism. Hence in this case we need to pass a flag to the push work to
  * indicate it needs to flush the commit record itself.
  */
-static void
+static xfs_csn_t
 xlog_cil_push_now(
 	struct xlog	*log,
-	xfs_lsn_t	push_seq,
+	xfs_csn_t	push_seq,
 	bool		async)
 {
 	struct xfs_cil	*cil = log->l_cilp;
 
 	if (!cil)
-		return;
-
-	ASSERT(push_seq && push_seq <= cil->xc_current_sequence);
+		return 0;
 
 	/* start on any pending background push to minimise wait time on it */
 	if (!async)
 		flush_workqueue(cil->xc_push_wq);
 
 	spin_lock(&cil->xc_push_lock);
+	if (!push_seq)
+		push_seq = cil->xc_current_sequence;
+	ASSERT(push_seq && push_seq <= cil->xc_current_sequence);
 
 	/*
 	 * If this is an async flush request, we always need to set the
@@ -1742,12 +1743,13 @@ xlog_cil_push_now(
 	if (test_bit(XLOG_CIL_EMPTY, &cil->xc_flags) ||
 	    push_seq <= cil->xc_push_seq) {
 		spin_unlock(&cil->xc_push_lock);
-		return;
+		return push_seq;
 	}
 
 	cil->xc_push_seq = push_seq;
 	queue_work(cil->xc_push_wq, &cil->xc_ctx->push_work);
 	spin_unlock(&cil->xc_push_lock);
+	return push_seq;
 }
 
 bool
@@ -1880,16 +1882,16 @@ void
 xlog_cil_flush(
 	struct xlog	*log)
 {
-	xfs_csn_t	seq = log->l_cilp->xc_current_sequence;
+	struct xfs_cil	*cil = log->l_cilp;
+	xfs_csn_t	seq = xlog_cil_push_now(log, 0, true);
 
 	trace_xfs_log_force(log->l_mp, seq, _RET_IP_);
-	xlog_cil_push_now(log, seq, true);
 
 	/*
 	 * If the CIL is empty, make sure that any previous checkpoint that may
 	 * still be in an active iclog is pushed to stable storage.
 	 */
-	if (test_bit(XLOG_CIL_EMPTY, &log->l_cilp->xc_flags))
+	if (test_bit(XLOG_CIL_EMPTY, &cil->xc_flags))
 		xfs_log_force(log->l_mp, 0);
 }
 
@@ -1911,12 +1913,7 @@ xlog_cil_force_seq(
 	struct xfs_cil		*cil = log->l_cilp;
 	struct xfs_cil_ctx	*ctx;
 	xfs_lsn_t		commit_lsn = NULLCOMMITLSN;
-
-	ASSERT(sequence <= cil->xc_current_sequence);
-
-	if (!sequence)
-		sequence = cil->xc_current_sequence;
-	trace_xfs_log_force(log->l_mp, sequence, _RET_IP_);
+	bool			traced = false;
 
 	/*
 	 * check to see if we need to force out the current context.
@@ -1924,7 +1921,11 @@ xlog_cil_force_seq(
 	 * so no need to deal with it here.
 	 */
 restart:
-	xlog_cil_push_now(log, sequence, false);
+	sequence = xlog_cil_push_now(log, sequence, false);
+	if (!traced) {
+		trace_xfs_log_force(log->l_mp, sequence, _RET_IP_);
+		traced = true;
+	}
 
 	/*
 	 * See if we can find a previous sequence still committing.
@@ -2066,4 +2067,3 @@ xlog_cil_destroy(
 	destroy_workqueue(cil->xc_push_wq);
 	kfree(cil);
 }
-
diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
index cf1e4ce61a8c..4ae98d3800ea 100644
--- a/fs/xfs/xfs_log_priv.h
+++ b/fs/xfs/xfs_log_priv.h
@@ -582,7 +582,7 @@ xfs_lsn_t xlog_cil_force_seq(struct xlog *log, xfs_csn_t sequence);
 static inline void
 xlog_cil_force(struct xlog *log)
 {
-	xlog_cil_force_seq(log, log->l_cilp->xc_current_sequence);
+	xlog_cil_force_seq(log, 0);
 }
 
 /*

Re: [PATCH] xfs: snapshot current CIL sequence under xc_push_lock

Posted by Dave Chinner 1 month, 1 week ago

On Wed, May 06, 2026 at 09:58:11AM +0800, Cen Zhang wrote:
> xlog_cil_force() and xlog_cil_flush() both use the current CIL
> checkpoint sequence to request an immediate push.  They currently read
> xc_current_sequence before xlog_cil_push_now() takes xc_push_lock.
> 
> The CIL push worker advances xc_current_sequence under xc_push_lock while
> switching to a new checkpoint context.  If a current-checkpoint force reads
> an old sequence and then publishes it as xc_push_seq after the worker has
> moved to the next context, xlog_cil_push_work() can treat the request as a
> previously pushed sequence and skip it via the push_seq < ctx->sequence
> check.  That can leave the current dirty checkpoint unqueued by a
> force/flush request.

Which does not matter at all.

Integrity in XFS is defined by completion-to-submission semantics.
IOWs, the only thing that a CIL -force- guarantees is that it will
ensure the current CIL context containing fully completed
transactions when it is called.

See, for example, __xfs_trans_commit() -> xfs_log_force() for
XFS_TRANS_SYNC transactions.

So if xlog_cil_force() races with a push switching to a new sequence
number, it means that all the transactions that were completed at
the time xfs_log_force() was called are -already being pushed-.
Hence all we need to do is wait for the CIL context at that sequence
to be committed. The new sequence number, by definition, contains
modifications that completed -after- the log force was started, and
so are outside the scope of the current force operation.

IOWs, we can safely skip the new sequence number in teh case of a
push/force race, and the force will still correctly wait on the old
sequence number being forced to disk. The race is benign, and not a
bug.

As for xlog_cil_flush(), it is a latency reduction mechanism and not
a integrity mechanism. We just don't care about racing, because the
caller will call it again if sufficient progress wasn't made by the
previous call.

Hence I don't think there is a bug that needs fixing here. Yes, the
fact that deducing it is ok requires understanding
competion-to-submission integrity semantics, but that's kinda
assumed knowledge because it's exactly the same semantics as we
relyon for data/metadata ordering and integrity w/ O_SYNC/O_DSYNC
operations...

> Make xlog_cil_push_now() resolve push_seq == 0 to the current sequence
> while holding xc_push_lock, and return the resolved sequence for tracing
> and waiting.  Route xlog_cil_force() and xlog_cil_flush() through that path
> so the current-sequence snapshot and xc_push_seq publication are serialized
> with the context switch.

So the change is likely correct, but somewhat nasty in places.
Regardless, I'm inclined to reject it based on "if it isn't broken,
don't fix it" principles.

In future, when reporting problems and fixes like this, it is very
important that you say how the issue was found, whether it can be
reproduced, hardware configs, etc. Presenting multiple fixes in a
short period for a wide variety of different conditions in extremely
complex code without any context of how/when they occur or what
impact they have on systems smells like AI slop more than
anything....

So, please document what tools/AI are you using to find these
problems and how they can be reproduced. I haven't commented on the
other two "fixes" you've just posted because I haven't got hours to
dig into them right now. You also posted a fix a month ago for a
race condition that wasn't a race condition in buf log item
handling, so something is clearly directing you at them....

-Dave.
-- 
Dave Chinner
dgc@kernel.org

Re: [PATCH] xfs: snapshot current CIL sequence under xc_push_lock

Posted by Cen Zhang 1 month, 1 week ago

Dear Chinner

>
> Which does not matter at all.
>
> Integrity in XFS is defined by completion-to-submission semantics.
> IOWs, the only thing that a CIL -force- guarantees is that it will
> ensure the current CIL context containing fully completed
> transactions when it is called.
>
> See, for example, __xfs_trans_commit() -> xfs_log_force() for
> XFS_TRANS_SYNC transactions.
>
> So if xlog_cil_force() races with a push switching to a new sequence
> number, it means that all the transactions that were completed at
> the time xfs_log_force() was called are -already being pushed-.
> Hence all we need to do is wait for the CIL context at that sequence
> to be committed. The new sequence number, by definition, contains
> modifications that completed -after- the log force was started, and
> so are outside the scope of the current force operation.
>
> IOWs, we can safely skip the new sequence number in teh case of a
> push/force race, and the force will still correctly wait on the old
> sequence number being forced to disk. The race is benign, and not a
> bug.
>
> As for xlog_cil_flush(), it is a latency reduction mechanism and not
> a integrity mechanism. We just don't care about racing, because the
> caller will call it again if sufficient progress wasn't made by the
> previous call.
>
> Hence I don't think there is a bug that needs fixing here. Yes, the
> fact that deducing it is ok requires understanding
> competion-to-submission integrity semantics, but that's kinda
> assumed knowledge because it's exactly the same semantics as we
> relyon for data/metadata ordering and integrity w/ O_SYNC/O_DSYNC
> operations...
>
>
> So the change is likely correct, but somewhat nasty in places.
> Regardless, I'm inclined to reject it based on "if it isn't broken,
> don't fix it" principles.
>
> In future, when reporting problems and fixes like this, it is very
> important that you say how the issue was found, whether it can be
> reproduced, hardware configs, etc. Presenting multiple fixes in a
> short period for a wide variety of different conditions in extremely
> complex code without any context of how/when they occur or what
> impact they have on systems smells like AI slop more than
> anything....
>
> So, please document what tools/AI are you using to find these
> problems and how they can be reproduced. I haven't commented on the
> other two "fixes" you've just posted because I haven't got hours to
> dig into them right now. You also posted a fix a month ago for a
> race condition that wasn't a race condition in buf log item
> handling, so something is clearly directing you at them....
>

Thanks for the detailed explanation.

You are right. My analysis of the XFS CIL checkpoint semantics was not
careful enough, and I overstated the possible effect of this window.

The report originally came from dynamic testing with a customized
syzkaller setup, which observed an unsynchronised access pattern at
runtime. I then tried to analyse whether that pattern could lead to a
real correctness problem, with some AI assistance during the reasoning
process. However, I did not validate that analysis sufficiently against
the actual XFS CIL semantics, and that led to an incorrect conclusion on
my side.

I am sorry for the noise and for taking your time on this.

In the past, when I sent more direct/raw reports from my testing, some of
them were reasonably treated as bot-like reports. Since then I have been
trying to better understand the reports before sending them, so that I
can submit more useful and actionable issues or patches. Clearly I still
got this wrong here.

I will take your comments seriously and be more careful in future
reports, especially when reasoning about subsystem-specific semantics
such as XFS log/CIL behavior.

Thanks again for the explanation, and sorry again for the noise.

Best regards
Cen

Re: [PATCH] xfs: snapshot current CIL sequence under xc_push_lock

Posted by Dave Chinner 1 month ago

On Thu, May 07, 2026 at 12:34:23PM +0800, Cen Zhang wrote:
> In the past, when I sent more direct/raw reports from my testing, some of
> them were reasonably treated as bot-like reports. Since then I have been
> trying to better understand the reports before sending them, so that I
> can submit more useful and actionable issues or patches. Clearly I still
> got this wrong here.

The important thing is to explain how you found the issue, how it
can be reproduced the impact of the bug being fixed, etc.
Describing a race condition and it's fix purely in theoretical terms
and omitting all other context makes it feel very "bot driven" as
they tend to lack all context other than code analysis and the
change itself...

Think about how you'd describe the impact of the bug to someone, and
how'd they'd diagnose the problem if it were occuring on their
system. A good commit message should allow a user to identify the
problem (and the fix) from it's contents...

> I will take your comments seriously and be more careful in future
> reports, especially when reasoning about subsystem-specific semantics
> such as XFS log/CIL behavior.

You don't need to get it 100% right before you post a fix - it's
often much faster to post an RFC or ask a question and get immediate
feedback than to try to be perfect on the first submission.

> Thanks again for the explanation, and sorry again for the noise.

It's not noise when people are listening and learning - that makes
it time well spent IMO. :)

-Dave,
-- 
Dave Chinner
dgc@kernel.org