fs/dcache.c | 4 ++++ fs/inode.c | 8 ++++++++ include/linux/writeback.h | 4 ++++ 3 files changed, 16 insertions(+)
Suppose there are 2 CPUs racing inode hash lookup func (say ilookup5())
and unlock_new_inode().
In principle the latter can clear the I_NEW flag before prior stores
into the inode were made visible.
The former can in turn observe I_NEW is cleared and proceed to use the
inode, while possibly reading from not-yet-published areas.
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
---
I don't think this is a serious bug in the sense I doubt anyone ever ran
into it, but this is an issue on paper.
I'm doing some changes in the area and I figured I'll get this bit out
of the way.
fs/dcache.c | 4 ++++
fs/inode.c | 8 ++++++++
include/linux/writeback.h | 4 ++++
3 files changed, 16 insertions(+)
diff --git a/fs/dcache.c b/fs/dcache.c
index a067fa0a965a..806d6a665124 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1981,6 +1981,10 @@ void d_instantiate_new(struct dentry *entry, struct inode *inode)
spin_lock(&inode->i_lock);
__d_instantiate(entry, inode);
WARN_ON(!(inode->i_state & I_NEW));
+ /*
+ * Pairs with smp_rmb in wait_on_inode().
+ */
+ smp_wmb();
inode->i_state &= ~I_NEW & ~I_CREATING;
/*
* Pairs with the barrier in prepare_to_wait_event() to make sure
diff --git a/fs/inode.c b/fs/inode.c
index ec9339024ac3..842ee973c8b6 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1181,6 +1181,10 @@ void unlock_new_inode(struct inode *inode)
lockdep_annotate_inode_mutex_key(inode);
spin_lock(&inode->i_lock);
WARN_ON(!(inode->i_state & I_NEW));
+ /*
+ * Pairs with smp_rmb in wait_on_inode().
+ */
+ smp_wmb();
inode->i_state &= ~I_NEW & ~I_CREATING;
/*
* Pairs with the barrier in prepare_to_wait_event() to make sure
@@ -1198,6 +1202,10 @@ void discard_new_inode(struct inode *inode)
lockdep_annotate_inode_mutex_key(inode);
spin_lock(&inode->i_lock);
WARN_ON(!(inode->i_state & I_NEW));
+ /*
+ * Pairs with smp_rmb in wait_on_inode().
+ */
+ smp_wmb();
inode->i_state &= ~I_NEW;
/*
* Pairs with the barrier in prepare_to_wait_event() to make sure
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 22dd4adc5667..e1e1231a6830 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -194,6 +194,10 @@ static inline void wait_on_inode(struct inode *inode)
{
wait_var_event(inode_state_wait_address(inode, __I_NEW),
!(READ_ONCE(inode->i_state) & I_NEW));
+ /*
+ * Pairs with routines clearing I_NEW.
+ */
+ smp_rmb();
}
#ifdef CONFIG_CGROUP_WRITEBACK
--
2.34.1
On Mon 06-10-25 01:15:26, Mateusz Guzik wrote:
> Suppose there are 2 CPUs racing inode hash lookup func (say ilookup5())
> and unlock_new_inode().
>
> In principle the latter can clear the I_NEW flag before prior stores
> into the inode were made visible.
>
> The former can in turn observe I_NEW is cleared and proceed to use the
> inode, while possibly reading from not-yet-published areas.
>
> Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
> ---
>
> I don't think this is a serious bug in the sense I doubt anyone ever ran
> into it, but this is an issue on paper.
>
> I'm doing some changes in the area and I figured I'll get this bit out
> of the way.
Yeah, good spotting.
> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -1981,6 +1981,10 @@ void d_instantiate_new(struct dentry *entry, struct inode *inode)
> spin_lock(&inode->i_lock);
> __d_instantiate(entry, inode);
> WARN_ON(!(inode->i_state & I_NEW));
> + /*
> + * Pairs with smp_rmb in wait_on_inode().
> + */
> + smp_wmb();
> inode->i_state &= ~I_NEW & ~I_CREATING;
Hum, why not smp_store_release() here (and below) and...
> /*
> * Pairs with the barrier in prepare_to_wait_event() to make sure
> diff --git a/fs/inode.c b/fs/inode.c
> index ec9339024ac3..842ee973c8b6 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -1181,6 +1181,10 @@ void unlock_new_inode(struct inode *inode)
> lockdep_annotate_inode_mutex_key(inode);
> spin_lock(&inode->i_lock);
> WARN_ON(!(inode->i_state & I_NEW));
> + /*
> + * Pairs with smp_rmb in wait_on_inode().
> + */
> + smp_wmb();
> inode->i_state &= ~I_NEW & ~I_CREATING;
> /*
> * Pairs with the barrier in prepare_to_wait_event() to make sure
> @@ -1198,6 +1202,10 @@ void discard_new_inode(struct inode *inode)
> lockdep_annotate_inode_mutex_key(inode);
> spin_lock(&inode->i_lock);
> WARN_ON(!(inode->i_state & I_NEW));
> + /*
> + * Pairs with smp_rmb in wait_on_inode().
> + */
> + smp_wmb();
> inode->i_state &= ~I_NEW;
> /*
> * Pairs with the barrier in prepare_to_wait_event() to make sure
> diff --git a/include/linux/writeback.h b/include/linux/writeback.h
> index 22dd4adc5667..e1e1231a6830 100644
> --- a/include/linux/writeback.h
> +++ b/include/linux/writeback.h
> @@ -194,6 +194,10 @@ static inline void wait_on_inode(struct inode *inode)
> {
> wait_var_event(inode_state_wait_address(inode, __I_NEW),
> !(READ_ONCE(inode->i_state) & I_NEW));
> + /*
> + * Pairs with routines clearing I_NEW.
> + */
> + smp_rmb();
... smp_load_acquire() instead if READ_ONCE? That would seem like a more
"modern" way to fix this?
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
On Mon, Oct 6, 2025 at 2:15 PM Jan Kara <jack@suse.cz> wrote:
>
> On Mon 06-10-25 01:15:26, Mateusz Guzik wrote:
> > diff --git a/include/linux/writeback.h b/include/linux/writeback.h
> > index 22dd4adc5667..e1e1231a6830 100644
> > --- a/include/linux/writeback.h
> > +++ b/include/linux/writeback.h
> > @@ -194,6 +194,10 @@ static inline void wait_on_inode(struct inode *inode)
> > {
> > wait_var_event(inode_state_wait_address(inode, __I_NEW),
> > !(READ_ONCE(inode->i_state) & I_NEW));
> > + /*
> > + * Pairs with routines clearing I_NEW.
> > + */
> > + smp_rmb();
>
> ... smp_load_acquire() instead if READ_ONCE? That would seem like a more
> "modern" way to fix this?
>
Now that the merge window flurry has died down I'll be posting an
updated i_state accessor patchset.
Then I would need to add inode_state_read_once_acquire() and
inode_state_clear_release() to keep up with this.
I figured I'll spare it for the time being, worst case can be added later.
That aside I have a wip patch to not require fences here and instead
take advantage of the i_lock held earlier, so I expect this to go away
anyway.
On Mon 06-10-25 14:20:25, Mateusz Guzik wrote:
> On Mon, Oct 6, 2025 at 2:15 PM Jan Kara <jack@suse.cz> wrote:
> >
> > On Mon 06-10-25 01:15:26, Mateusz Guzik wrote:
> > > diff --git a/include/linux/writeback.h b/include/linux/writeback.h
> > > index 22dd4adc5667..e1e1231a6830 100644
> > > --- a/include/linux/writeback.h
> > > +++ b/include/linux/writeback.h
> > > @@ -194,6 +194,10 @@ static inline void wait_on_inode(struct inode *inode)
> > > {
> > > wait_var_event(inode_state_wait_address(inode, __I_NEW),
> > > !(READ_ONCE(inode->i_state) & I_NEW));
> > > + /*
> > > + * Pairs with routines clearing I_NEW.
> > > + */
> > > + smp_rmb();
> >
> > ... smp_load_acquire() instead if READ_ONCE? That would seem like a more
> > "modern" way to fix this?
> >
>
> Now that the merge window flurry has died down I'll be posting an
> updated i_state accessor patchset.
>
> Then I would need to add inode_state_read_once_acquire() and
> inode_state_clear_release() to keep up with this.
>
> I figured I'll spare it for the time being, worst case can be added later.
>
> That aside I have a wip patch to not require fences here and instead
> take advantage of the i_lock held earlier, so I expect this to go away
> anyway.
Fair enough, I was just wondering :). Feel free to add:
Reviewed-by: Jan Kara <jack@suse.cz>
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
On Mon, 06 Oct 2025 01:15:26 +0200, Mateusz Guzik wrote:
> Suppose there are 2 CPUs racing inode hash lookup func (say ilookup5())
> and unlock_new_inode().
>
> In principle the latter can clear the I_NEW flag before prior stores
> into the inode were made visible.
>
> The former can in turn observe I_NEW is cleared and proceed to use the
> inode, while possibly reading from not-yet-published areas.
>
> [...]
Applied to the vfs-6.19.inode branch of the vfs/vfs.git tree.
Patches in the vfs-6.19.inode branch should appear in linux-next soon.
Please report any outstanding bugs that were missed during review in a
new review to the original patch series allowing us to drop it.
It's encouraged to provide Acked-bys and Reviewed-bys even though the
patch has now been applied. If possible patch trailers will be updated.
Note that commit hashes shown below are subject to change due to rebase,
trailer updates or similar. If in doubt, please check the listed branch.
tree: https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
branch: vfs-6.19.inode
[1/1] fs: add missing fences to I_NEW handling
https://git.kernel.org/vfs/vfs/c/fb43e3dde8ec
On Mon, 6 Oct 2025 01:15:26 +0200 Mateusz Guzik wrote:
> Suppose there are 2 CPUs racing inode hash lookup func (say ilookup5())
> and unlock_new_inode().
>
> In principle the latter can clear the I_NEW flag before prior stores
> into the inode were made visible.
>
Given difficulty following up here, could you specify why the current
mem barrier [1] in unlock_new_inode() is not enough?
[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/inode.c#n1190
> The former can in turn observe I_NEW is cleared and proceed to use the
> inode, while possibly reading from not-yet-published areas.
>
> Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
> ---
>
> I don't think this is a serious bug in the sense I doubt anyone ever ran
> into it, but this is an issue on paper.
>
> I'm doing some changes in the area and I figured I'll get this bit out
> of the way.
>
> fs/dcache.c | 4 ++++
> fs/inode.c | 8 ++++++++
> include/linux/writeback.h | 4 ++++
> 3 files changed, 16 insertions(+)
>
> diff --git a/fs/dcache.c b/fs/dcache.c
> index a067fa0a965a..806d6a665124 100644
> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -1981,6 +1981,10 @@ void d_instantiate_new(struct dentry *entry, struct inode *inode)
> spin_lock(&inode->i_lock);
> __d_instantiate(entry, inode);
> WARN_ON(!(inode->i_state & I_NEW));
> + /*
> + * Pairs with smp_rmb in wait_on_inode().
> + */
> + smp_wmb();
> inode->i_state &= ~I_NEW & ~I_CREATING;
> /*
> * Pairs with the barrier in prepare_to_wait_event() to make sure
> diff --git a/fs/inode.c b/fs/inode.c
> index ec9339024ac3..842ee973c8b6 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -1181,6 +1181,10 @@ void unlock_new_inode(struct inode *inode)
> lockdep_annotate_inode_mutex_key(inode);
> spin_lock(&inode->i_lock);
> WARN_ON(!(inode->i_state & I_NEW));
> + /*
> + * Pairs with smp_rmb in wait_on_inode().
> + */
> + smp_wmb();
> inode->i_state &= ~I_NEW & ~I_CREATING;
> /*
> * Pairs with the barrier in prepare_to_wait_event() to make sure
> @@ -1198,6 +1202,10 @@ void discard_new_inode(struct inode *inode)
> lockdep_annotate_inode_mutex_key(inode);
> spin_lock(&inode->i_lock);
> WARN_ON(!(inode->i_state & I_NEW));
> + /*
> + * Pairs with smp_rmb in wait_on_inode().
> + */
> + smp_wmb();
> inode->i_state &= ~I_NEW;
> /*
> * Pairs with the barrier in prepare_to_wait_event() to make sure
> diff --git a/include/linux/writeback.h b/include/linux/writeback.h
> index 22dd4adc5667..e1e1231a6830 100644
> --- a/include/linux/writeback.h
> +++ b/include/linux/writeback.h
> @@ -194,6 +194,10 @@ static inline void wait_on_inode(struct inode *inode)
> {
> wait_var_event(inode_state_wait_address(inode, __I_NEW),
> !(READ_ONCE(inode->i_state) & I_NEW));
> + /*
> + * Pairs with routines clearing I_NEW.
> + */
> + smp_rmb();
> }
Why is this needed as nobody cares I_NEW after wait?
>
> #ifdef CONFIG_CGROUP_WRITEBACK
> --
> 2.34.1
On Mon, Oct 6, 2025 at 2:57 AM Hillf Danton <hdanton@sina.com> wrote:
>
> On Mon, 6 Oct 2025 01:15:26 +0200 Mateusz Guzik wrote:
> > Suppose there are 2 CPUs racing inode hash lookup func (say ilookup5())
> > and unlock_new_inode().
> >
> > In principle the latter can clear the I_NEW flag before prior stores
> > into the inode were made visible.
> >
> Given difficulty following up here, could you specify why the current
> mem barrier [1] in unlock_new_inode() is not enough?
>
That fence synchronizes against threads which went to sleep.
In the example I'm providing this did not happen.
193 static inline void wait_on_inode(struct inode *inode)
194 {
195 wait_var_event(inode_state_wait_address(inode, __I_NEW),
196 !(READ_ONCE(inode->i_state) & I_NEW));
303 #define wait_var_event(var, condition) \
304 do { \
305 might_sleep(); \
306 if (condition) \
307 break; \
I_NEW is tested here without any locks or fences.
308 __wait_var_event(var, condition); \
309 } while (0)
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/inode.c#n1190
>
> > The former can in turn observe I_NEW is cleared and proceed to use the
> > inode, while possibly reading from not-yet-published areas.
> >
> > Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
> > ---
> >
> > I don't think this is a serious bug in the sense I doubt anyone ever ran
> > into it, but this is an issue on paper.
> >
> > I'm doing some changes in the area and I figured I'll get this bit out
> > of the way.
> >
> > fs/dcache.c | 4 ++++
> > fs/inode.c | 8 ++++++++
> > include/linux/writeback.h | 4 ++++
> > 3 files changed, 16 insertions(+)
> >
> > diff --git a/fs/dcache.c b/fs/dcache.c
> > index a067fa0a965a..806d6a665124 100644
> > --- a/fs/dcache.c
> > +++ b/fs/dcache.c
> > @@ -1981,6 +1981,10 @@ void d_instantiate_new(struct dentry *entry, struct inode *inode)
> > spin_lock(&inode->i_lock);
> > __d_instantiate(entry, inode);
> > WARN_ON(!(inode->i_state & I_NEW));
> > + /*
> > + * Pairs with smp_rmb in wait_on_inode().
> > + */
> > + smp_wmb();
> > inode->i_state &= ~I_NEW & ~I_CREATING;
> > /*
> > * Pairs with the barrier in prepare_to_wait_event() to make sure
> > diff --git a/fs/inode.c b/fs/inode.c
> > index ec9339024ac3..842ee973c8b6 100644
> > --- a/fs/inode.c
> > +++ b/fs/inode.c
> > @@ -1181,6 +1181,10 @@ void unlock_new_inode(struct inode *inode)
> > lockdep_annotate_inode_mutex_key(inode);
> > spin_lock(&inode->i_lock);
> > WARN_ON(!(inode->i_state & I_NEW));
> > + /*
> > + * Pairs with smp_rmb in wait_on_inode().
> > + */
> > + smp_wmb();
> > inode->i_state &= ~I_NEW & ~I_CREATING;
> > /*
> > * Pairs with the barrier in prepare_to_wait_event() to make sure
> > @@ -1198,6 +1202,10 @@ void discard_new_inode(struct inode *inode)
> > lockdep_annotate_inode_mutex_key(inode);
> > spin_lock(&inode->i_lock);
> > WARN_ON(!(inode->i_state & I_NEW));
> > + /*
> > + * Pairs with smp_rmb in wait_on_inode().
> > + */
> > + smp_wmb();
> > inode->i_state &= ~I_NEW;
> > /*
> > * Pairs with the barrier in prepare_to_wait_event() to make sure
> > diff --git a/include/linux/writeback.h b/include/linux/writeback.h
> > index 22dd4adc5667..e1e1231a6830 100644
> > --- a/include/linux/writeback.h
> > +++ b/include/linux/writeback.h
> > @@ -194,6 +194,10 @@ static inline void wait_on_inode(struct inode *inode)
> > {
> > wait_var_event(inode_state_wait_address(inode, __I_NEW),
> > !(READ_ONCE(inode->i_state) & I_NEW));
> > + /*
> > + * Pairs with routines clearing I_NEW.
> > + */
> > + smp_rmb();
> > }
>
> Why is this needed as nobody cares I_NEW after wait?
>
> >
> > #ifdef CONFIG_CGROUP_WRITEBACK
> > --
> > 2.34.1
On Mon, 6 Oct 2025 03:16:43 +0200 Mateusz Guzik wrote:
> On Mon, Oct 6, 2025 at 2:57 AM Hillf Danton <hdanton@sina.com> wrote:
> > On Mon, 6 Oct 2025 01:15:26 +0200 Mateusz Guzik wrote:
> > > Suppose there are 2 CPUs racing inode hash lookup func (say ilookup5())
> > > and unlock_new_inode().
> > >
> > > In principle the latter can clear the I_NEW flag before prior stores
> > > into the inode were made visible.
> > >
> > Given difficulty following up here, could you specify why the current
> > mem barrier [1] in unlock_new_inode() is not enough?
> >
> That fence synchronizes against threads which went to sleep.
>
> In the example I'm providing this did not happen.
>
> 193 static inline void wait_on_inode(struct inode *inode)
> 194 {
> 195 wait_var_event(inode_state_wait_address(inode, __I_NEW),
> 196 !(READ_ONCE(inode->i_state) & I_NEW));
>
> 303 #define wait_var_event(var, condition) \
> 304 do { \
> 305 might_sleep(); \
> 306 if (condition) \
> 307 break; \
>
> I_NEW is tested here without any locks or fences.
>
Thanks, got it but given the comment of the current mem barrier in
unlock_new_inode(), why did peterZ/Av leave such a huge chance behind?
The condition check in wait_var_event() matches waitqueue_active() in
__wake_up_bit(), and both make it run fast on both the waiter and the
waker sides, no?
On Mon, Oct 6, 2025 at 9:21 AM Hillf Danton <hdanton@sina.com> wrote:
>
> On Mon, 6 Oct 2025 03:16:43 +0200 Mateusz Guzik wrote:
> > That fence synchronizes against threads which went to sleep.
> >
> > In the example I'm providing this did not happen.
> >
> > 193 static inline void wait_on_inode(struct inode *inode)
> > 194 {
> > 195 wait_var_event(inode_state_wait_address(inode, __I_NEW),
> > 196 !(READ_ONCE(inode->i_state) & I_NEW));
> >
> > 303 #define wait_var_event(var, condition) \
> > 304 do { \
> > 305 might_sleep(); \
> > 306 if (condition) \
> > 307 break; \
> >
> > I_NEW is tested here without any locks or fences.
> >
> Thanks, got it but given the comment of the current mem barrier in
> unlock_new_inode(), why did peterZ/Av leave such a huge chance behind?
>
My guess is nobody is perfect -- mistakes happen.
> The condition check in waitqueue_active() matches waitqueue_active() in
> __wake_up_bit(), and both make it run fast on both the waiter and the
> waker sides, no?
So happens the commentary above wait_var_event() explicitly mentions
you want an acquire fence:
299 * The condition should normally use smp_load_acquire() or a
similarly
300 * ordered access to ensure that any changes to memory made
before the
301 * condition became true will be visible after the wait
completes.
The commentary about waitqueue_active() says you want and even
stronger fence (smp_mb) for that one:
104 * Use either while holding wait_queue_head::lock or when used
for wakeups
105 * with an extra smp_mb() like::
106 *
107 * CPU0 - waker CPU1 - waiter
108 *
109 * for (;;) {
110 * @cond = true;
prepare_to_wait(&wq_head, &wait, state);
111 * smp_mb(); // smp_mb() from
set_current_state()
112 * if (waitqueue_active(wq_head)) if (@cond)
113 * wake_up(wq_head); break;
114 * schedule();
115 * }
116 * finish_wait(&wq_head, &wait);
On Mon, 6 Oct 2025 12:30:01 +0200 Mateusz Guzik wrote:
> On Mon, Oct 6, 2025 at 9:21 AM Hillf Danton <hdanton@sina.com> wrote:
> > On Mon, 6 Oct 2025 03:16:43 +0200 Mateusz Guzik wrote:
> > > That fence synchronizes against threads which went to sleep.
> > >
> > > In the example I'm providing this did not happen.
> > >
> > > 193 static inline void wait_on_inode(struct inode *inode)
> > > 194 {
> > > 195 wait_var_event(inode_state_wait_address(inode, __I_NEW),
> > > 196 !(READ_ONCE(inode->i_state) & I_NEW));
> > >
> > > 303 #define wait_var_event(var, condition) \
> > > 304 do { \
> > > 305 might_sleep(); \
> > > 306 if (condition) \
> > > 307 break; \
> > >
> > > I_NEW is tested here without any locks or fences.
> > >
> > Thanks, got it but given the comment of the current mem barrier in
> > unlock_new_inode(), why did peterZ/Av leave such a huge chance behind?
> >
> My guess is nobody is perfect -- mistakes happen.
>
I do not think you are so lucky -- the code has been there for quite a while.
> > The condition check in waitqueue_active() matches waitqueue_active() in
> > __wake_up_bit(), and both make it run fast on both the waiter and the
> > waker sides, no?
>
Your quotation here is different from my reply [2]. Weird.
The condition check in wait_var_event() matches waitqueue_active() in
__wake_up_bit(), and both make it run fast on both the waiter and the
waker sides, no?
[2] https://lore.kernel.org/lkml/20251006072136.8236-1-hdanton@sina.com/
> So happens the commentary above wait_var_event() explicitly mentions
> you want an acquire fence:
> 299 * The condition should normally use smp_load_acquire() or a similarly
> 300 * ordered access to ensure that any changes to memory made before the
> 301 * condition became true will be visible after the wait completes.
What is missed is recheck -- adding self on to wait queue with a true condition
makes no sense.
* Wait for a @condition to be true, only re-checking when a wake up is
* received for the given @var (an arbitrary kernel address which need
* not be directly related to the given condition, but usually is).
>
> The commentary about waitqueue_active() says you want and even
> stronger fence (smp_mb) for that one:
Yes, it matches the current code, so your change is not needed.
> 104 * Use either while holding wait_queue_head::lock or when used for wakeups
> 105 * with an extra smp_mb() like::
> 106 *
> 107 * CPU0 - waker CPU1 - waiter
> 108 *
> 109 * for (;;) {
> 110 * @cond = true; prepare_to_wait(&wq_head, &wait, state);
> 111 * smp_mb(); // smp_mb() from set_current_state()
> 112 * if (waitqueue_active(wq_head)) if (@cond)
> 113 * wake_up(wq_head); break;
> 114 * schedule();
> 115 * }
> 116 * finish_wait(&wq_head, &wait);
© 2016 - 2025 Red Hat, Inc.