fs/proc/inode.c | 4 +--- fs/proc/proc_sysctl.c | 4 +--- 2 files changed, 2 insertions(+), 6 deletions(-)
The rcu_dereference() call in proc_sys_compare() is problematic as
->d_compare is not guaranteed to be called with rcu_read_lock() held and
rcu_dereference() can cause a warning when used without that lock.
Specifically d_alloc_parallel() will call ->d_compare() without
rcu_read_lock(), but with ->d_lock to ensure stability. In this case
->d_inode is usually NULL so the rcu_dereference() will normally not be
reached, but it is possible that ->d_inode was set while waiting for
->d_lock which could lead to the warning.
The rcu_dereference() isn't really needed - ->sysctl isn't used in a
pattern which normally requires RCU protection. In particular it is
never updated. It is assigned a value in proc_sys_make_inode() before
the inode is generally visible, and the value is freed (using
rcu_free()) only after any possible access to the inode must have
completed.
Even though the value stored at ->sysctl is not freed, the ->sysctl
pointer itself is set to NULL in proc_evict_inode(). This necessitates
proc_sys_compare() taking care, reading the pointer with READ_ONCE()
(currently via rcu_dereference()) and checking for NULL. If we drop the
assignment to NULL, this care becomes unnecessary.
This patch removes the assignment of NULL in proc_evict_inode() so that
for the entire (public) life of a proc_sysctl inode, the ->sysctl
pointer is stable and points to a valid value. It then changes
proc_sys_compare() to simply use ->sysctl without any concern for it
changing or being NULL.
Signed-off-by: NeilBrown <neil@brown.name>
---
fs/proc/inode.c | 4 +---
fs/proc/proc_sysctl.c | 4 +---
2 files changed, 2 insertions(+), 6 deletions(-)
diff --git a/fs/proc/inode.c b/fs/proc/inode.c
index a3eb3b740f76..e0f984c44523 100644
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -41,10 +41,8 @@ static void proc_evict_inode(struct inode *inode)
proc_pid_evict_inode(ei);
head = ei->sysctl;
- if (head) {
- RCU_INIT_POINTER(ei->sysctl, NULL);
+ if (head)
proc_sys_evict_inode(inode, head);
- }
}
static struct kmem_cache *proc_inode_cachep __ro_after_init;
diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c
index cc9d74a06ff0..5358327ee640 100644
--- a/fs/proc/proc_sysctl.c
+++ b/fs/proc/proc_sysctl.c
@@ -915,7 +915,6 @@ static int sysctl_is_seen(struct ctl_table_header *p)
static int proc_sys_compare(const struct dentry *dentry,
unsigned int len, const char *str, const struct qstr *name)
{
- struct ctl_table_header *head;
struct inode *inode;
/* Although proc doesn't have negative dentries, rcu-walk means
@@ -928,8 +927,7 @@ static int proc_sys_compare(const struct dentry *dentry,
return 1;
if (memcmp(name->name, str, len))
return 1;
- head = rcu_dereference(PROC_I(inode)->sysctl);
- return !head || !sysctl_is_seen(head);
+ return !sysctl_is_seen(PROC_I(inode)->sysctl);
}
static const struct dentry_operations proc_sys_dentry_operations = {
--
2.49.0
On Mon, Jun 16, 2025 at 09:00:39AM +1000, NeilBrown wrote: > > The rcu_dereference() call in proc_sys_compare() is problematic as > ->d_compare is not guaranteed to be called with rcu_read_lock() held and > rcu_dereference() can cause a warning when used without that lock. > > Specifically d_alloc_parallel() will call ->d_compare() without > rcu_read_lock(), but with ->d_lock to ensure stability. In this case > ->d_inode is usually NULL so the rcu_dereference() will normally not be > reached, but it is possible that ->d_inode was set while waiting for > ->d_lock which could lead to the warning. Huh? There are two call sites of d_same_name() in d_alloc_parallel() - one in the loop (under rcu_read_lock()) and another after the thing we are comparing has ceased to be in-lookup. The latter is under ->d_lock, stabilizing everything (and it really can't run into NULL ->d_inode for /proc/sys/ stuff). ->d_compare() instances are guaranteed dentry->d_lock or rcu_read_lock(); in the latter case we'll either recheck or validate on previously sampled ->d_seq. And the second call in d_alloc_parallel() is just that - recheck under ->d_lock. Just use rcu_dereference_check(...., spin_is_locked(&dentry->d_lock)) and be done with that... The part where we have a somewhat wrong behaviour is not the second call in d_alloc_parallel() - it's the first one. Something like this static int proc_sys_compare(const struct dentry *dentry, unsigned int len, const char *str, const struct qstr *name) { struct ctl_table_header *head; struct inode *inode; if (name->len != len) return 1; if (memcmp(name->name, str, len)) return 1; // false positive is fine here - we'll recheck anyway if (d_in_lookup(dentry)) return 0; inode = d_inode_rcu(dentry); // we just might have run into dentry in the middle of __dentry_kill() if (!inode) return 1; head = rcu_dereference_check(PROC_I(inode)->sysctl, spin_is_locked(&dentry->d_lock)); return !head || !sysctl_is_seen(head); }
On Mon, 16 Jun 2025, Al Viro wrote: > On Mon, Jun 16, 2025 at 09:00:39AM +1000, NeilBrown wrote: > > > > The rcu_dereference() call in proc_sys_compare() is problematic as > > ->d_compare is not guaranteed to be called with rcu_read_lock() held and > > rcu_dereference() can cause a warning when used without that lock. > > > > Specifically d_alloc_parallel() will call ->d_compare() without > > rcu_read_lock(), but with ->d_lock to ensure stability. In this case > > ->d_inode is usually NULL so the rcu_dereference() will normally not be > > reached, but it is possible that ->d_inode was set while waiting for > > ->d_lock which could lead to the warning. > > Huh? > > There are two call sites of d_same_name() in d_alloc_parallel() - one > in the loop (under rcu_read_lock()) and another after the thing we > are comparing has ceased to be in-lookup. The latter is under ->d_lock, > stabilizing everything (and it really can't run into NULL ->d_inode > for /proc/sys/ stuff). Ok, so ->d_inode will always be non-NULL here, so the rcu_dereference() will always cause a warning if that code is reached for a proc_sysctl dentry. > > ->d_compare() instances are guaranteed dentry->d_lock or rcu_read_lock(); > in the latter case we'll either recheck or validate on previously sampled > ->d_seq. And the second call in d_alloc_parallel() is just that - recheck > under ->d_lock. > > Just use rcu_dereference_check(...., spin_is_locked(&dentry->d_lock)) and > be done with that... We could - but that would be misleading. And it would still cause a sparse warning because ->sysctl isn't marked __rcu. The reality is that ->sysctl does not need rcu protection. There is no concurrent update except that it can be set to NULL which is pointless. For the entire public life of the inode - whenever ->d_compare could possibly run - there is precisely one "struct ctl_table_header" associated with the inode. Once we remove the unnecessary RCU_INIT_POINTER, the ->sysctl pointer is completely stable and not needing any protection at all. So it would be misleading to leave the rcu_dereference{_check}() there. > > The part where we have a somewhat wrong behaviour is not the second call > in d_alloc_parallel() - it's the first one. Something like this > > static int proc_sys_compare(const struct dentry *dentry, > unsigned int len, const char *str, const struct qstr *name) > { > struct ctl_table_header *head; > struct inode *inode; > > if (name->len != len) > return 1; > if (memcmp(name->name, str, len)) > return 1; > > // false positive is fine here - we'll recheck anyway > if (d_in_lookup(dentry)) > return 0; I wonder if it would be good to document that d_compare on a d_in_lookup() dentry will always be re-checked. I agree this is a good way to avoid the possible duplicate dentries. But this is fixing a different problem than the one I'm trying to fix. I'm just trying to remove a possible warning and to do so in a way the makes the code consistent. Thanks, NeilBrown > > inode = d_inode_rcu(dentry); > // we just might have run into dentry in the middle of __dentry_kill() > if (!inode) > return 1; > head = rcu_dereference_check(PROC_I(inode)->sysctl, > spin_is_locked(&dentry->d_lock)); > return !head || !sysctl_is_seen(head); > } >
On Mon, Jun 16, 2025 at 12:49:51PM +1000, NeilBrown wrote: > The reality is that ->sysctl does not need rcu protection. There is no > concurrent update except that it can be set to NULL which is pointless. I would rather *not* leave a dangling pointer there, and yes, it can end up being dangling. kfree_rcu() from inside the ->evict_inode() may very well happen earlier than (also RCU-delayed) freeing of struct inode itself. What we can do is WRITE_ONCE() to set it to NULL on the evict_inode side and READ_ONCE() in the proc_sys_compare(). The reason why the latter is memory-safe is that ->d_compare() for non-in-lookup dentries is called either under rcu_read_lock() (in which case observing non-NULL means that kfree_rcu() couldn't have gotten to freeing the sucker) *or* under ->d_lock, in which case the inode can't reach ->evict_inode() until we are done. So this predicate is very much relevant. Have that fucker called with neither rcu_read_lock() nor ->d_lock, and you might very well end up with dereferencing an already freed ctl_table_header.
On Fri, 04 Jul 2025, Al Viro wrote: > On Mon, Jun 16, 2025 at 12:49:51PM +1000, NeilBrown wrote: > > > The reality is that ->sysctl does not need rcu protection. There is no > > concurrent update except that it can be set to NULL which is pointless. > > I would rather *not* leave a dangling pointer there, and yes, it can > end up being dangling. kfree_rcu() from inside the ->evict_inode() > may very well happen earlier than (also RCU-delayed) freeing of struct > inode itself. In that case could we move the proc_sys_evict_inode() call from proc_evict_inode() to proc_free_inode(), and replace kfree_rcu() with kfree()? Or does the inode need to be deleted from ->sibling_inodes earlier than free_inode? > > What we can do is WRITE_ONCE() to set it to NULL on the evict_inode > side and READ_ONCE() in the proc_sys_compare(). That is likely the simplest change. Thanks, NeilBrown > > The reason why the latter is memory-safe is that ->d_compare() for > non-in-lookup dentries is called either under rcu_read_lock() (in which > case observing non-NULL means that kfree_rcu() couldn't have gotten to > freeing the sucker) *or* under ->d_lock, in which case the inode can't > reach ->evict_inode() until we are done. > > So this predicate is very much relevant. Have that fucker called with > neither rcu_read_lock() nor ->d_lock, and you might very well end up > with dereferencing an already freed ctl_table_header. >
On Fri, Jul 04, 2025 at 11:39:16AM +1000, NeilBrown wrote: > On Fri, 04 Jul 2025, Al Viro wrote: > > On Mon, Jun 16, 2025 at 12:49:51PM +1000, NeilBrown wrote: > > > > > The reality is that ->sysctl does not need rcu protection. There is no > > > concurrent update except that it can be set to NULL which is pointless. > > > > I would rather *not* leave a dangling pointer there, and yes, it can > > end up being dangling. kfree_rcu() from inside the ->evict_inode() > > may very well happen earlier than (also RCU-delayed) freeing of struct > > inode itself. > > In that case could we move the proc_sys_evict_inode() call from > proc_evict_inode() to proc_free_inode(), and replace kfree_rcu() with > kfree()? proc_free_inode() can be called from softirq context, so we'd need to touch all sysctl_lock users for that. Definitely a larger patch that way, if nothing else...
On Fri, Jul 04, 2025 at 12:43:13AM +0100, Al Viro wrote: > I would rather *not* leave a dangling pointer there, and yes, it can > end up being dangling. kfree_rcu() from inside the ->evict_inode() > may very well happen earlier than (also RCU-delayed) freeing of struct > inode itself. > > What we can do is WRITE_ONCE() to set it to NULL on the evict_inode > side and READ_ONCE() in the proc_sys_compare(). > > The reason why the latter is memory-safe is that ->d_compare() for > non-in-lookup dentries is called either under rcu_read_lock() (in which > case observing non-NULL means that kfree_rcu() couldn't have gotten to > freeing the sucker) *or* under ->d_lock, in which case the inode can't > reach ->evict_inode() until we are done. > > So this predicate is very much relevant. Have that fucker called with > neither rcu_read_lock() nor ->d_lock, and you might very well end up > with dereferencing an already freed ctl_table_header. IOW, I would prefer to do this: [PATCH] fix proc_sys_compare() handling of in-lookup dentries There's one case where ->d_compare() can be called for an in-lookup dentry; usually that's nothing special from ->d_compare() point of view, but... proc_sys_compare() is weird. The thing is, /proc/sys subdirectories can look differently for different processes. Up to and including having the same name resolve to different dentries - all of them hashed. The way it's done is ->d_compare() refusing to admit a match unless this dentry is supposed to be visible to this caller. The information needed to discriminate between them is stored in inode; it is set during proc_sys_lookup() and until it's done d_splice_alias() we really can't tell who should that dentry be visible for. Normally there's no negative dentries in /proc/sys; we can run into a dying dentry in RCU dcache lookup, but those can be safely rejected. However, ->d_compare() is also called for in-lookup dentries, before they get positive - or hashed, for that matter. In case of match we will wait until dentry leaves in-lookup state and repeat ->d_compare() afterwards. In other words, the right behaviour is to treat the name match as sufficient for in-lookup dentries; if dentry is not for us, we'll see that when we recheck once proc_sys_lookup() is done with it. While we are at it, fix the misspelled READ_ONCE and WRITE_ONCE there. Fixes: d9171b934526 ("parallel lookups machinery, part 4 (and last)") Reported-by: NeilBrown <neilb@brown.name> Reviewed-by: Christian Brauner <brauner@kernel.org> Reviewed-by: NeilBrown <neil@brown.name> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- diff --git a/fs/proc/inode.c b/fs/proc/inode.c index a3eb3b740f76..3604b616311c 100644 --- a/fs/proc/inode.c +++ b/fs/proc/inode.c @@ -42,7 +42,7 @@ static void proc_evict_inode(struct inode *inode) head = ei->sysctl; if (head) { - RCU_INIT_POINTER(ei->sysctl, NULL); + WRITE_ONCE(ei->sysctl, NULL); proc_sys_evict_inode(inode, head); } } diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c index cc9d74a06ff0..08b78150cdde 100644 --- a/fs/proc/proc_sysctl.c +++ b/fs/proc/proc_sysctl.c @@ -918,17 +918,21 @@ static int proc_sys_compare(const struct dentry *dentry, struct ctl_table_header *head; struct inode *inode; - /* Although proc doesn't have negative dentries, rcu-walk means - * that inode here can be NULL */ - /* AV: can it, indeed? */ - inode = d_inode_rcu(dentry); - if (!inode) - return 1; if (name->len != len) return 1; if (memcmp(name->name, str, len)) return 1; - head = rcu_dereference(PROC_I(inode)->sysctl); + + // false positive is fine here - we'll recheck anyway + if (d_in_lookup(dentry)) + return 0; + + inode = d_inode_rcu(dentry); + // we just might have run into dentry in the middle of __dentry_kill() + if (!inode) + return 1; + + head = READ_ONCE(PROC_I(inode)->sysctl); return !head || !sysctl_is_seen(head); }
On Fri, 04 Jul 2025, Al Viro wrote: > On Fri, Jul 04, 2025 at 12:43:13AM +0100, Al Viro wrote: > > > I would rather *not* leave a dangling pointer there, and yes, it can > > end up being dangling. kfree_rcu() from inside the ->evict_inode() > > may very well happen earlier than (also RCU-delayed) freeing of struct > > inode itself. > > > > What we can do is WRITE_ONCE() to set it to NULL on the evict_inode > > side and READ_ONCE() in the proc_sys_compare(). > > > > The reason why the latter is memory-safe is that ->d_compare() for > > non-in-lookup dentries is called either under rcu_read_lock() (in which > > case observing non-NULL means that kfree_rcu() couldn't have gotten to > > freeing the sucker) *or* under ->d_lock, in which case the inode can't > > reach ->evict_inode() until we are done. > > > > So this predicate is very much relevant. Have that fucker called with > > neither rcu_read_lock() nor ->d_lock, and you might very well end up > > with dereferencing an already freed ctl_table_header. > > IOW, I would prefer to do this: Looks good - thanks, NeilBrown > > [PATCH] fix proc_sys_compare() handling of in-lookup dentries > > There's one case where ->d_compare() can be called for an in-lookup > dentry; usually that's nothing special from ->d_compare() point of > view, but... proc_sys_compare() is weird. > > The thing is, /proc/sys subdirectories can look differently for > different processes. Up to and including having the same name > resolve to different dentries - all of them hashed. > > The way it's done is ->d_compare() refusing to admit a match unless > this dentry is supposed to be visible to this caller. The information > needed to discriminate between them is stored in inode; it is set > during proc_sys_lookup() and until it's done d_splice_alias() we really > can't tell who should that dentry be visible for. > > Normally there's no negative dentries in /proc/sys; we can run into > a dying dentry in RCU dcache lookup, but those can be safely rejected. > > However, ->d_compare() is also called for in-lookup dentries, before > they get positive - or hashed, for that matter. In case of match > we will wait until dentry leaves in-lookup state and repeat ->d_compare() > afterwards. In other words, the right behaviour is to treat the > name match as sufficient for in-lookup dentries; if dentry is not > for us, we'll see that when we recheck once proc_sys_lookup() is > done with it. > > While we are at it, fix the misspelled READ_ONCE and WRITE_ONCE there. > > Fixes: d9171b934526 ("parallel lookups machinery, part 4 (and last)") > Reported-by: NeilBrown <neilb@brown.name> > Reviewed-by: Christian Brauner <brauner@kernel.org> > Reviewed-by: NeilBrown <neil@brown.name> > Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> > --- > diff --git a/fs/proc/inode.c b/fs/proc/inode.c > index a3eb3b740f76..3604b616311c 100644 > --- a/fs/proc/inode.c > +++ b/fs/proc/inode.c > @@ -42,7 +42,7 @@ static void proc_evict_inode(struct inode *inode) > > head = ei->sysctl; > if (head) { > - RCU_INIT_POINTER(ei->sysctl, NULL); > + WRITE_ONCE(ei->sysctl, NULL); > proc_sys_evict_inode(inode, head); > } > } > diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c > index cc9d74a06ff0..08b78150cdde 100644 > --- a/fs/proc/proc_sysctl.c > +++ b/fs/proc/proc_sysctl.c > @@ -918,17 +918,21 @@ static int proc_sys_compare(const struct dentry *dentry, > struct ctl_table_header *head; > struct inode *inode; > > - /* Although proc doesn't have negative dentries, rcu-walk means > - * that inode here can be NULL */ > - /* AV: can it, indeed? */ > - inode = d_inode_rcu(dentry); > - if (!inode) > - return 1; > if (name->len != len) > return 1; > if (memcmp(name->name, str, len)) > return 1; > - head = rcu_dereference(PROC_I(inode)->sysctl); > + > + // false positive is fine here - we'll recheck anyway > + if (d_in_lookup(dentry)) > + return 0; > + > + inode = d_inode_rcu(dentry); > + // we just might have run into dentry in the middle of __dentry_kill() > + if (!inode) > + return 1; > + > + head = READ_ONCE(PROC_I(inode)->sysctl); > return !head || !sysctl_is_seen(head); > } > >
On Fri, Jul 04, 2025 at 11:39:52AM +1000, NeilBrown wrote: > On Fri, 04 Jul 2025, Al Viro wrote: > > On Fri, Jul 04, 2025 at 12:43:13AM +0100, Al Viro wrote: > > > > > I would rather *not* leave a dangling pointer there, and yes, it can > > > end up being dangling. kfree_rcu() from inside the ->evict_inode() > > > may very well happen earlier than (also RCU-delayed) freeing of struct > > > inode itself. > > > > > > What we can do is WRITE_ONCE() to set it to NULL on the evict_inode > > > side and READ_ONCE() in the proc_sys_compare(). > > > > > > The reason why the latter is memory-safe is that ->d_compare() for > > > non-in-lookup dentries is called either under rcu_read_lock() (in which > > > case observing non-NULL means that kfree_rcu() couldn't have gotten to > > > freeing the sucker) *or* under ->d_lock, in which case the inode can't > > > reach ->evict_inode() until we are done. > > > > > > So this predicate is very much relevant. Have that fucker called with > > > neither rcu_read_lock() nor ->d_lock, and you might very well end up > > > with dereferencing an already freed ctl_table_header. > > > > IOW, I would prefer to do this: > > Looks good - thanks, > NeilBrown See viro/vfs.git #fixes...
© 2016 - 2025 Red Hat, Inc.