jbd2/ext4/ocfs2: READ_ONCE for lockless jinode reads

[PATCH 3/3] ocfs2: use READ_ONCE for lockless jinode reads

Posted by Li Chen 1 week, 2 days ago

ocfs2 journal commit callback reads jbd2_inode dirty range fields without
holding journal->j_list_lock.

Use READ_ONCE() for these reads to correct the concurrency assumptions.

Suggested-by: Jan Kara <jack@suse.com>
Signed-off-by: Li Chen <me@linux.beauty>
---
 fs/ocfs2/journal.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/fs/ocfs2/journal.c b/fs/ocfs2/journal.c
index 85239807dec7..7032284cdbd6 100644
--- a/fs/ocfs2/journal.c
+++ b/fs/ocfs2/journal.c
@@ -902,8 +902,11 @@ int ocfs2_journal_alloc(struct ocfs2_super *osb)
 
 static int ocfs2_journal_submit_inode_data_buffers(struct jbd2_inode *jinode)
 {
-	return filemap_fdatawrite_range(jinode->i_vfs_inode->i_mapping,
-			jinode->i_dirty_start, jinode->i_dirty_end);
+	struct address_space *mapping = jinode->i_vfs_inode->i_mapping;
+	loff_t dirty_start = READ_ONCE(jinode->i_dirty_start);
+	loff_t dirty_end = READ_ONCE(jinode->i_dirty_end);
+
+	return filemap_fdatawrite_range(mapping, dirty_start, dirty_end);
 }
 
 int ocfs2_journal_init(struct ocfs2_super *osb, int *dirty)
-- 
2.52.0

Re: [PATCH 3/3] ocfs2: use READ_ONCE for lockless jinode reads

Posted by Matthew Wilcox 1 week, 2 days ago

On Fri, Jan 30, 2026 at 11:12:32AM +0800, Li Chen wrote:
> ocfs2 journal commit callback reads jbd2_inode dirty range fields without
> holding journal->j_list_lock.
> 
> Use READ_ONCE() for these reads to correct the concurrency assumptions.

I don't think this is the right solution to the problem.  If it is,
there needs to be much better argumentation in the commit message.

As I understand it, jbd2_journal_file_inode() initialises jinode,
then adds it to the t_inode_list, then drops the j_list_lock.  So the
actual problem we need to address is that there's no memory barrier
between the store to i_dirty_start and the list_add().  Once that's
added, there's no need for a READ_ONCE here.

Or have I misunderstood the problem?

> Suggested-by: Jan Kara <jack@suse.com>
> Signed-off-by: Li Chen <me@linux.beauty>
> ---
>  fs/ocfs2/journal.c | 7 +++++--
>  1 file changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/ocfs2/journal.c b/fs/ocfs2/journal.c
> index 85239807dec7..7032284cdbd6 100644
> --- a/fs/ocfs2/journal.c
> +++ b/fs/ocfs2/journal.c
> @@ -902,8 +902,11 @@ int ocfs2_journal_alloc(struct ocfs2_super *osb)
>  
>  static int ocfs2_journal_submit_inode_data_buffers(struct jbd2_inode *jinode)
>  {
> -	return filemap_fdatawrite_range(jinode->i_vfs_inode->i_mapping,
> -			jinode->i_dirty_start, jinode->i_dirty_end);
> +	struct address_space *mapping = jinode->i_vfs_inode->i_mapping;
> +	loff_t dirty_start = READ_ONCE(jinode->i_dirty_start);
> +	loff_t dirty_end = READ_ONCE(jinode->i_dirty_end);
> +
> +	return filemap_fdatawrite_range(mapping, dirty_start, dirty_end);
>  }
>  
>  int ocfs2_journal_init(struct ocfs2_super *osb, int *dirty)
> -- 
> 2.52.0
>

Re: [PATCH 3/3] ocfs2: use READ_ONCE for lockless jinode reads

Posted by Li Chen 1 week, 2 days ago

Hi Matthew,

 > On Fri, Jan 30, 2026 at 11:12:32AM +0800, Li Chen wrote:
 > > ocfs2 journal commit callback reads jbd2_inode dirty range fields without
 > > holding journal->j_list_lock.
 > > 
 > > Use READ_ONCE() for these reads to correct the concurrency assumptions.
 > 
 > I don't think this is the right solution to the problem.  If it is,
 > there needs to be much better argumentation in the commit message.
 > 
 > As I understand it, jbd2_journal_file_inode() initialises jinode,
 > then adds it to the t_inode_list, then drops the j_list_lock.  So the
 > actual problem we need to address is that there's no memory barrier
 > between the store to i_dirty_start and the list_add().  Once that's
 > added, there's no need for a READ_ONCE here.
 > 
 > Or have I misunderstood the problem?

Thanks for the review.

My understanding of your point is that you're worried about a missing
"publish" ordering in jbd2_journal_file_inode(): we store
jinode->i_dirty_start/end and then list_add() the jinode to
t_inode_list, and a core which observes the list entry might miss the prior
i_dirty_* stores. Is that the issue you had in mind?

If so, for the normal commit path where the list is walked under
journal->j_list_lock (e.g. journal_submit_data_buffers() in
fs/jbd2/commit.c), spin_lock()/spin_unlock() should already provide the
necessary ordering, since both the i_dirty_* updates and the list_add()
happen inside the same critical section.

The ocfs2 case I was aiming at is different: the filesystem callback is
invoked after unlocking journal->j_list_lock and may sleep, so it can't hold
j_list_lock but it still reads jinode->i_dirty_start/end while other
threads update these fields under the lock. Adding a barrier between the
stores and list_add() would not address that concurrent update window.

So the itent of READ_ONCE() in ocfs2 is to take a single snapshot of the
dirty range values from memory (avoid compiler to reuse a value kept in a
register or fold multiple reads). I'm not trying to claim any additional
memory ordering from this change.

I'll respin and adjust the commit message accordingly. The updated part will
say along the lines of:

"ocfs2 reads jinode->i_dirty_start/end without journal->j_list_lock
(callback may sleep); these fields are updated under j_list_lock in jbd2.
Use READ_ONCE() so the callback takes a single snapshot via actual loads
from the variable (i.e. don't let the compiler reuse a value kept in a register
or fold multiple reads)."

Does that match your understanding? 

Regards,
Li

 > > Suggested-by: Jan Kara <jack@suse.com>
 > > Signed-off-by: Li Chen <me@linux.beauty>
 > > ---
 > >  fs/ocfs2/journal.c | 7 +++++--
 > >  1 file changed, 5 insertions(+), 2 deletions(-)
 > > 
 > > diff --git a/fs/ocfs2/journal.c b/fs/ocfs2/journal.c
 > > index 85239807dec7..7032284cdbd6 100644
 > > --- a/fs/ocfs2/journal.c
 > > +++ b/fs/ocfs2/journal.c
 > > @@ -902,8 +902,11 @@ int ocfs2_journal_alloc(struct ocfs2_super *osb)
 > >  
 > >  static int ocfs2_journal_submit_inode_data_buffers(struct jbd2_inode *jinode)
 > >  {
 > > -    return filemap_fdatawrite_range(jinode->i_vfs_inode->i_mapping,
 > > -            jinode->i_dirty_start, jinode->i_dirty_end);
 > > +    struct address_space *mapping = jinode->i_vfs_inode->i_mapping;
 > > +    loff_t dirty_start = READ_ONCE(jinode->i_dirty_start);
 > > +    loff_t dirty_end = READ_ONCE(jinode->i_dirty_end);
 > > +
 > > +    return filemap_fdatawrite_range(mapping, dirty_start, dirty_end);
 > >  }
 > >  
 > >  int ocfs2_journal_init(struct ocfs2_super *osb, int *dirty)
 > > -- 
 > > 2.52.0
 > > 
 >

Re: [PATCH 3/3] ocfs2: use READ_ONCE for lockless jinode reads

Posted by Matthew Wilcox 1 week, 2 days ago

On Fri, Jan 30, 2026 at 08:26:40PM +0800, Li Chen wrote:
> Hi Matthew,
> 
>  > On Fri, Jan 30, 2026 at 11:12:32AM +0800, Li Chen wrote:
>  > > ocfs2 journal commit callback reads jbd2_inode dirty range fields without
>  > > holding journal->j_list_lock.
>  > > 
>  > > Use READ_ONCE() for these reads to correct the concurrency assumptions.
>  > 
>  > I don't think this is the right solution to the problem.  If it is,
>  > there needs to be much better argumentation in the commit message.
>  > 
>  > As I understand it, jbd2_journal_file_inode() initialises jinode,
>  > then adds it to the t_inode_list, then drops the j_list_lock.  So the
>  > actual problem we need to address is that there's no memory barrier
>  > between the store to i_dirty_start and the list_add().  Once that's
>  > added, there's no need for a READ_ONCE here.
>  > 
>  > Or have I misunderstood the problem?
> 
> Thanks for the review.
> 
> My understanding of your point is that you're worried about a missing
> "publish" ordering in jbd2_journal_file_inode(): we store
> jinode->i_dirty_start/end and then list_add() the jinode to
> t_inode_list, and a core which observes the list entry might miss the prior
> i_dirty_* stores. Is that the issue you had in mind?

I think that's the only issue that exists ...

> If so, for the normal commit path where the list is walked under
> journal->j_list_lock (e.g. journal_submit_data_buffers() in
> fs/jbd2/commit.c), spin_lock()/spin_unlock() should already provide the
> necessary ordering, since both the i_dirty_* updates and the list_add()
> happen inside the same critical section.

I don't think that's true.  I think what you're asserting is that:

	int *pi;
	int **ppi;

	spin_lock(&lock);
	*pi = 1;
	*ppi = pi;
	spin_unlock(&lock);

that the store to *pi must be observed before the store to *ppi, and
that's not true for a reader which doesn't read the value of lock.
The store to *ppi needs a store barrier before it.

> The ocfs2 case I was aiming at is different: the filesystem callback is
> invoked after unlocking journal->j_list_lock and may sleep, so it can't hold
> j_list_lock but it still reads jinode->i_dirty_start/end while other
> threads update these fields under the lock. Adding a barrier between the
> stores and list_add() would not address that concurrent update window.

I don't think that race exists.  If it does exist, the READ_ONCE will
not help (on 32 bit platforms) because it's a 64-bit quantity and 32-bit
platforms do not, in general, have a way to do an atomic 64-bit load
(look at the implementation of i_size_read() for the gyrations we go
through to assure a non-torn read of that value).

> "ocfs2 reads jinode->i_dirty_start/end without journal->j_list_lock
> (callback may sleep); these fields are updated under j_list_lock in jbd2.
> Use READ_ONCE() so the callback takes a single snapshot via actual loads
> from the variable (i.e. don't let the compiler reuse a value kept in a register
> or fold multiple reads)."

I think the prevention of this race occurs at a higher level than
"it's updated under a lock".  That is, jbd2_journal_file_inode()
is never called for a jinode which is currently being operated on by
j_submit_inode_data_buffers().  Now, I'm not an expert on the jbd code,
so I may be wrong here.

Re: [PATCH 3/3] ocfs2: use READ_ONCE for lockless jinode reads

Posted by Jan Kara 6 days, 4 hours ago

On Fri 30-01-26 16:36:28, Matthew Wilcox wrote:
> On Fri, Jan 30, 2026 at 08:26:40PM +0800, Li Chen wrote:
> > Hi Matthew,
> > 
> >  > On Fri, Jan 30, 2026 at 11:12:32AM +0800, Li Chen wrote:
> >  > > ocfs2 journal commit callback reads jbd2_inode dirty range fields without
> >  > > holding journal->j_list_lock.
> >  > > 
> >  > > Use READ_ONCE() for these reads to correct the concurrency assumptions.
> >  > 
> >  > I don't think this is the right solution to the problem.  If it is,
> >  > there needs to be much better argumentation in the commit message.
> >  > 
> >  > As I understand it, jbd2_journal_file_inode() initialises jinode,
> >  > then adds it to the t_inode_list, then drops the j_list_lock.  So the
> >  > actual problem we need to address is that there's no memory barrier
> >  > between the store to i_dirty_start and the list_add().  Once that's
> >  > added, there's no need for a READ_ONCE here.
> >  > 
> >  > Or have I misunderstood the problem?
> > 
> > Thanks for the review.
> > 
> > My understanding of your point is that you're worried about a missing
> > "publish" ordering in jbd2_journal_file_inode(): we store
> > jinode->i_dirty_start/end and then list_add() the jinode to
> > t_inode_list, and a core which observes the list entry might miss the prior
> > i_dirty_* stores. Is that the issue you had in mind?
> 
> I think that's the only issue that exists ...
> 
> > If so, for the normal commit path where the list is walked under
> > journal->j_list_lock (e.g. journal_submit_data_buffers() in
> > fs/jbd2/commit.c), spin_lock()/spin_unlock() should already provide the
> > necessary ordering, since both the i_dirty_* updates and the list_add()
> > happen inside the same critical section.
> 
> I don't think that's true.  I think what you're asserting is that:
> 
> 	int *pi;
> 	int **ppi;
> 
> 	spin_lock(&lock);
> 	*pi = 1;
> 	*ppi = pi;
> 	spin_unlock(&lock);
> 
> that the store to *pi must be observed before the store to *ppi, and
> that's not true for a reader which doesn't read the value of lock.
> The store to *ppi needs a store barrier before it.

Well, the above reasonably accurately describes the code making jinode
visible. The reader code is like:

	spin_lock(&lock);
        pi = *ppi;
	spin_unlock(&lock);

	work with pi

so it is guaranteed to see pi properly initialized. The problem is that
"work with pi" can race with other code updating the content of pi which is
what this patch is trying to deal with.

> > The ocfs2 case I was aiming at is different: the filesystem callback is
> > invoked after unlocking journal->j_list_lock and may sleep, so it can't hold
> > j_list_lock but it still reads jinode->i_dirty_start/end while other
> > threads update these fields under the lock. Adding a barrier between the
> > stores and list_add() would not address that concurrent update window.
> 
> I don't think that race exists.  If it does exist, the READ_ONCE will
> not help (on 32 bit platforms) because it's a 64-bit quantity and 32-bit
> platforms do not, in general, have a way to do an atomic 64-bit load
> (look at the implementation of i_size_read() for the gyrations we go
> through to assure a non-torn read of that value).

Sadly the race does exist - journal_submit_data_buffers() on the committing
transaction can run in parallel with jbd2_journal_file_inode() in the
running transaction. There's nothing preventing that. The problems arising
out of that are mostly theoretical but they do exist. In particular you're
correct that on 32-bit platforms this will be racy even with READ_ONCE /
WRITE_ONCE which I didn't realize.

Li, the best way to address this concern would be to modify jbd2_inode to
switch i_dirty_start / i_dirty_end to account in PAGE_SIZE units instead of
bytes and be of type pgoff_t. jbd2_journal_file_inode() just needs to round
the passed ranges properly...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

Re: [PATCH 3/3] ocfs2: use READ_ONCE for lockless jinode reads

Posted by Li Chen 5 days, 9 hours ago

Hi Jan,

 ---- On Tue, 03 Feb 2026 01:17:49 +0800  Jan Kara <jack@suse.cz> wrote --- 
 > On Fri 30-01-26 16:36:28, Matthew Wilcox wrote:
 > > On Fri, Jan 30, 2026 at 08:26:40PM +0800, Li Chen wrote:
 > > > Hi Matthew,
 > > > 
 > > >  > On Fri, Jan 30, 2026 at 11:12:32AM +0800, Li Chen wrote:
 > > >  > > ocfs2 journal commit callback reads jbd2_inode dirty range fields without
 > > >  > > holding journal->j_list_lock.
 > > >  > > 
 > > >  > > Use READ_ONCE() for these reads to correct the concurrency assumptions.
 > > >  > 
 > > >  > I don't think this is the right solution to the problem.  If it is,
 > > >  > there needs to be much better argumentation in the commit message.
 > > >  > 
 > > >  > As I understand it, jbd2_journal_file_inode() initialises jinode,
 > > >  > then adds it to the t_inode_list, then drops the j_list_lock.  So the
 > > >  > actual problem we need to address is that there's no memory barrier
 > > >  > between the store to i_dirty_start and the list_add().  Once that's
 > > >  > added, there's no need for a READ_ONCE here.
 > > >  > 
 > > >  > Or have I misunderstood the problem?
 > > > 
 > > > Thanks for the review.
 > > > 
 > > > My understanding of your point is that you're worried about a missing
 > > > "publish" ordering in jbd2_journal_file_inode(): we store
 > > > jinode->i_dirty_start/end and then list_add() the jinode to
 > > > t_inode_list, and a core which observes the list entry might miss the prior
 > > > i_dirty_* stores. Is that the issue you had in mind?
 > > 
 > > I think that's the only issue that exists ...
 > > 
 > > > If so, for the normal commit path where the list is walked under
 > > > journal->j_list_lock (e.g. journal_submit_data_buffers() in
 > > > fs/jbd2/commit.c), spin_lock()/spin_unlock() should already provide the
 > > > necessary ordering, since both the i_dirty_* updates and the list_add()
 > > > happen inside the same critical section.
 > > 
 > > I don't think that's true.  I think what you're asserting is that:
 > > 
 > >     int *pi;
 > >     int **ppi;
 > > 
 > >     spin_lock(&lock);
 > >     *pi = 1;
 > >     *ppi = pi;
 > >     spin_unlock(&lock);
 > > 
 > > that the store to *pi must be observed before the store to *ppi, and
 > > that's not true for a reader which doesn't read the value of lock.
 > > The store to *ppi needs a store barrier before it.
 > 
 > Well, the above reasonably accurately describes the code making jinode
 > visible. The reader code is like:
 > 
 >     spin_lock(&lock);
 >         pi = *ppi;
 >     spin_unlock(&lock);
 > 
 >     work with pi
 > 
 > so it is guaranteed to see pi properly initialized. The problem is that
 > "work with pi" can race with other code updating the content of pi which is
 > what this patch is trying to deal with.
 > 
 > > > The ocfs2 case I was aiming at is different: the filesystem callback is
 > > > invoked after unlocking journal->j_list_lock and may sleep, so it can't hold
 > > > j_list_lock but it still reads jinode->i_dirty_start/end while other
 > > > threads update these fields under the lock. Adding a barrier between the
 > > > stores and list_add() would not address that concurrent update window.
 > > 
 > > I don't think that race exists.  If it does exist, the READ_ONCE will
 > > not help (on 32 bit platforms) because it's a 64-bit quantity and 32-bit
 > > platforms do not, in general, have a way to do an atomic 64-bit load
 > > (look at the implementation of i_size_read() for the gyrations we go
 > > through to assure a non-torn read of that value).
 > 
 > Sadly the race does exist - journal_submit_data_buffers() on the committing
 > transaction can run in parallel with jbd2_journal_file_inode() in the
 > running transaction. There's nothing preventing that. The problems arising
 > out of that are mostly theoretical but they do exist. In particular you're
 > correct that on 32-bit platforms this will be racy even with READ_ONCE /
 > WRITE_ONCE which I didn't realize.
 > 
 > Li, the best way to address this concern would be to modify jbd2_inode to
 > switch i_dirty_start / i_dirty_end to account in PAGE_SIZE units instead of
 > bytes and be of type pgoff_t. jbd2_journal_file_inode() just needs to round
 > the passed ranges properly...

  Thank you, Jan. I will update the series accordingly.

  It seems this won’t break large files on 32-bit either: MAX_LFS_FILESIZE there is
  ULONG_MAX << PAGE_SHIFT (i.e. ULONG_MAX pages, ~16TB with 4K pages), and pgoff_t
  (unsigned long) can represent the same ULONG_MAX-page range.

Regards,
Li

Re: [PATCH 3/3] ocfs2: use READ_ONCE for lockless jinode reads

Posted by Jan Kara 5 days, 9 hours ago

On Tue 03-02-26 20:10:42, Li Chen wrote:
> Hi Jan,
> 
>  ---- On Tue, 03 Feb 2026 01:17:49 +0800  Jan Kara <jack@suse.cz> wrote --- 
>  > On Fri 30-01-26 16:36:28, Matthew Wilcox wrote:
>  > > On Fri, Jan 30, 2026 at 08:26:40PM +0800, Li Chen wrote:
>  > > > Hi Matthew,
>  > > > 
>  > > >  > On Fri, Jan 30, 2026 at 11:12:32AM +0800, Li Chen wrote:
>  > > >  > > ocfs2 journal commit callback reads jbd2_inode dirty range fields without
>  > > >  > > holding journal->j_list_lock.
>  > > >  > > 
>  > > >  > > Use READ_ONCE() for these reads to correct the concurrency assumptions.
>  > > >  > 
>  > > >  > I don't think this is the right solution to the problem.  If it is,
>  > > >  > there needs to be much better argumentation in the commit message.
>  > > >  > 
>  > > >  > As I understand it, jbd2_journal_file_inode() initialises jinode,
>  > > >  > then adds it to the t_inode_list, then drops the j_list_lock.  So the
>  > > >  > actual problem we need to address is that there's no memory barrier
>  > > >  > between the store to i_dirty_start and the list_add().  Once that's
>  > > >  > added, there's no need for a READ_ONCE here.
>  > > >  > 
>  > > >  > Or have I misunderstood the problem?
>  > > > 
>  > > > Thanks for the review.
>  > > > 
>  > > > My understanding of your point is that you're worried about a missing
>  > > > "publish" ordering in jbd2_journal_file_inode(): we store
>  > > > jinode->i_dirty_start/end and then list_add() the jinode to
>  > > > t_inode_list, and a core which observes the list entry might miss the prior
>  > > > i_dirty_* stores. Is that the issue you had in mind?
>  > > 
>  > > I think that's the only issue that exists ...
>  > > 
>  > > > If so, for the normal commit path where the list is walked under
>  > > > journal->j_list_lock (e.g. journal_submit_data_buffers() in
>  > > > fs/jbd2/commit.c), spin_lock()/spin_unlock() should already provide the
>  > > > necessary ordering, since both the i_dirty_* updates and the list_add()
>  > > > happen inside the same critical section.
>  > > 
>  > > I don't think that's true.  I think what you're asserting is that:
>  > > 
>  > >     int *pi;
>  > >     int **ppi;
>  > > 
>  > >     spin_lock(&lock);
>  > >     *pi = 1;
>  > >     *ppi = pi;
>  > >     spin_unlock(&lock);
>  > > 
>  > > that the store to *pi must be observed before the store to *ppi, and
>  > > that's not true for a reader which doesn't read the value of lock.
>  > > The store to *ppi needs a store barrier before it.
>  > 
>  > Well, the above reasonably accurately describes the code making jinode
>  > visible. The reader code is like:
>  > 
>  >     spin_lock(&lock);
>  >         pi = *ppi;
>  >     spin_unlock(&lock);
>  > 
>  >     work with pi
>  > 
>  > so it is guaranteed to see pi properly initialized. The problem is that
>  > "work with pi" can race with other code updating the content of pi which is
>  > what this patch is trying to deal with.
>  > 
>  > > > The ocfs2 case I was aiming at is different: the filesystem callback is
>  > > > invoked after unlocking journal->j_list_lock and may sleep, so it can't hold
>  > > > j_list_lock but it still reads jinode->i_dirty_start/end while other
>  > > > threads update these fields under the lock. Adding a barrier between the
>  > > > stores and list_add() would not address that concurrent update window.
>  > > 
>  > > I don't think that race exists.  If it does exist, the READ_ONCE will
>  > > not help (on 32 bit platforms) because it's a 64-bit quantity and 32-bit
>  > > platforms do not, in general, have a way to do an atomic 64-bit load
>  > > (look at the implementation of i_size_read() for the gyrations we go
>  > > through to assure a non-torn read of that value).
>  > 
>  > Sadly the race does exist - journal_submit_data_buffers() on the committing
>  > transaction can run in parallel with jbd2_journal_file_inode() in the
>  > running transaction. There's nothing preventing that. The problems arising
>  > out of that are mostly theoretical but they do exist. In particular you're
>  > correct that on 32-bit platforms this will be racy even with READ_ONCE /
>  > WRITE_ONCE which I didn't realize.
>  > 
>  > Li, the best way to address this concern would be to modify jbd2_inode to
>  > switch i_dirty_start / i_dirty_end to account in PAGE_SIZE units instead of
>  > bytes and be of type pgoff_t. jbd2_journal_file_inode() just needs to round
>  > the passed ranges properly...
> 
>   Thank you, Jan. I will update the series accordingly.
> 
>   It seems this won’t break large files on 32-bit either: MAX_LFS_FILESIZE there is
>   ULONG_MAX << PAGE_SHIFT (i.e. ULONG_MAX pages, ~16TB with 4K pages), and pgoff_t
>   (unsigned long) can represent the same ULONG_MAX-page range.

Exactly. That's the reason why I've proposed this but I should have
probably stated that explicitely :).

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

Re: [PATCH 3/3] ocfs2: use READ_ONCE for lockless jinode reads

Posted by Li Chen 1 week ago

Hi Matthew,

Thank you very much for the detailed explanation and for your patience.

On Sat, 31 Jan 2026 00:36:28 +0800,
Matthew Wilcox wrote:
> 
> On Fri, Jan 30, 2026 at 08:26:40PM +0800, Li Chen wrote:
> > Hi Matthew,
> > 
> >  > On Fri, Jan 30, 2026 at 11:12:32AM +0800, Li Chen wrote:
> >  > > ocfs2 journal commit callback reads jbd2_inode dirty range fields without
> >  > > holding journal->j_list_lock.
> >  > > 
> >  > > Use READ_ONCE() for these reads to correct the concurrency assumptions.
> >  > 
> >  > I don't think this is the right solution to the problem.  If it is,
> >  > there needs to be much better argumentation in the commit message.
> >  > 
> >  > As I understand it, jbd2_journal_file_inode() initialises jinode,
> >  > then adds it to the t_inode_list, then drops the j_list_lock.  So the
> >  > actual problem we need to address is that there's no memory barrier
> >  > between the store to i_dirty_start and the list_add().  Once that's
> >  > added, there's no need for a READ_ONCE here.
> >  > 
> >  > Or have I misunderstood the problem?
> > 
> > Thanks for the review.
> > 
> > My understanding of your point is that you're worried about a missing
> > "publish" ordering in jbd2_journal_file_inode(): we store
> > jinode->i_dirty_start/end and then list_add() the jinode to
> > t_inode_list, and a core which observes the list entry might miss the prior
> > i_dirty_* stores. Is that the issue you had in mind?
> 
> I think that's the only issue that exists ...

Understood.

> 
> > If so, for the normal commit path where the list is walked under
> > journal->j_list_lock (e.g. journal_submit_data_buffers() in
> > fs/jbd2/commit.c), spin_lock()/spin_unlock() should already provide the
> > necessary ordering, since both the i_dirty_* updates and the list_add()
> > happen inside the same critical section.
> 
> I don't think that's true.  I think what you're asserting is that:
> 
> 	int *pi;
> 	int **ppi;
> 
> 	spin_lock(&lock);
> 	*pi = 1;
> 	*ppi = pi;
> 	spin_unlock(&lock);
> 
> that the store to *pi must be observed before the store to *ppi, and
> that's not true for a reader which doesn't read the value of lock.
> The store to *ppi needs a store barrier before it.

Yes, agreed ― thank you. I was implicitly assuming the reader had taken the same lock
at some point, which is not a valid assumption for a lockless reader. 

> > The ocfs2 case I was aiming at is different: the filesystem callback is
> > invoked after unlocking journal->j_list_lock and may sleep, so it can't hold
> > j_list_lock but it still reads jinode->i_dirty_start/end while other
> > threads update these fields under the lock. Adding a barrier between the
> > stores and list_add() would not address that concurrent update window.
> 
> I don't think that race exists.  If it does exist, the READ_ONCE will
> not help (on 32 bit platforms) because it's a 64-bit quantity and 32-bit
> platforms do not, in general, have a way to do an atomic 64-bit load
> (look at the implementation of i_size_read() for the gyrations we go
> through to assure a non-torn read of that value).
> 
> > "ocfs2 reads jinode->i_dirty_start/end without journal->j_list_lock
> > (callback may sleep); these fields are updated under j_list_lock in jbd2.
> > Use READ_ONCE() so the callback takes a single snapshot via actual loads
> > from the variable (i.e. don't let the compiler reuse a value kept in a register
> > or fold multiple reads)."
> 
> I think the prevention of this race occurs at a higher level than
> "it's updated under a lock".  That is, jbd2_journal_file_inode()
> is never called for a jinode which is currently being operated on by
> j_submit_inode_data_buffers().  Now, I'm not an expert on the jbd code,
> so I may be wrong here.

Thanks. I tried to sanity-check whether that “never called” invariant holds
in practice.

I added a small local-only tracepoint (not for upstream) which fires from
jbd2_journal_file_inode() when it observes JI_COMMIT_RUNNING already set
on the same jinode:

/* fs/jbd2/transaction.c */
if (unlikely(jinode->i_flags & JI_COMMIT_RUNNING))
      trace_jbd2_file_inode_commit_running(...);

The trace event prints dev, ino, current tid, jinode flags, and the
i_transaction / i_next_transaction tids.

With an ext4 test (ordered mode) I do see repeated hits. Trace output:

... jbd2_submit_inode_data: dev 7,0 ino 20
... jbd2_file_inode_commit_running: dev 7,0 ino 20 tid 3 op 0x6 i_flags 0x7
    j_tid 2 j_next 3 ... comm python3

So it looks like jbd2_journal_file_inode() can run while JI_COMMIT_RUNNING
is set for that inode, i.e. during the window where the commit thread drops
j_list_lock around ->j_submit_inode_data_buffers() / ->j_finish_inode_data_buffers().

Given this, would you prefer the series to move towards something like:

1. taking a snapshot of i_dirty_start/end under j_list_lock in the commit path and passing the snapshot
   to the filesystem callback (so callbacks never read jinode->i_dirty_* locklessly), or
2. introducing a real synchronization mechanism for the dirty range itself (seqcount/atomic64/etc)?
3. something else.

I’d be very grateful for guidance on what you consider the most appropriate direction or point out something I'm wrong.

Thanks again.

Regards,
Li

[PATCH 1/3] jbd2: use READ_ONCE for lockless jinode reads
[PATCH 2/3] ext4: use READ_ONCE for lockless jinode reads
[PATCH 3/3] ocfs2: use READ_ONCE for lockless jinode reads