[v5] bpf path iterator

[PATCH v5 bpf-next 0/5] bpf path iterator

Posted by Song Liu 3 months, 3 weeks ago

In security use cases, it is common to apply rules to VFS subtrees.
However, filtering files in a subtree is not straightforward [1].

One solution to this problem is to start from a path and walk up the VFS
tree (towards the root). Among in-tree LSMs, Landlock uses this solution.

BPF LSM solutions, such like Tetragon [2], also use similar approaches.
However, due to lack of proper helper/kfunc support, BPF LSM solutions
usually do the path walk with probe read, which is racy.

This patchset introduces a new helper path_walk_parent, which walks
path to its VFS parent. The helper is used in Landlock.

A new BPF iterator, path iterator, is introduced to do the path walking.
The BPF path iterator uses the new path_walk_parent help to walk the VFS
tree.

Changes v4 => v5:
1. Minor changes to path_walk_parent per suggestions by Neil Brown and
   Tingmao Wang.

v4: https://lore.kernel.org/bpf/20250611220220.3681382-1-song@kernel.org/

Changes v3 => v4:
1. Rewrite path_walk_parent based on suggestion from Neil Brown.
2. When path_walk_parent() returns false, it call path_put on @path and
   then zeros @path.

v3: https://lore.kernel.org/bpf/20250606213015.255134-1-song@kernel.org/

Changes v2 => v3:
1. Fix an issue with path_walk_parent.
2. Move bpf path iterator to fs/bpf_fs_kfuncs.c
3. Optimize bpf path iterator (less memory).
4. Add more selftests.
5. Add more comments.

v2: https://lore.kernel.org/bpf/20250603065920.3404510-1-song@kernel.org/

Changes v1 => v2:
1. Rename path_parent => path_walk_parent.
2. Remove path_connected check in path_walk_parent.
3. Fix is_access_to_paths_allowed().
4. Remove mode for path iterator, add a flag instead.

v1: https://lore.kernel.org/bpf/20250528222623.1373000-1-song@kernel.org/

[1] https://lpc.events/event/18/contributions/1940/
[2] https://github.com/cilium/tetragon/

Song Liu (5):
  namei: Introduce new helper function path_walk_parent()
  landlock: Use path_walk_parent()
  bpf: Introduce path iterator
  selftests/bpf: Add tests for bpf path iterator
  selftests/bpf: Path walk test

 fs/bpf_fs_kfuncs.c                            |  72 +++++++++
 fs/namei.c                                    |  95 +++++++++---
 include/linux/namei.h                         |   2 +
 kernel/bpf/verifier.c                         |   5 +
 security/landlock/fs.c                        |  30 +---
 .../testing/selftests/bpf/bpf_experimental.h  |   6 +
 .../selftests/bpf/prog_tests/path_iter.c      | 111 ++++++++++++++
 tools/testing/selftests/bpf/progs/path_iter.c | 145 ++++++++++++++++++
 tools/testing/selftests/bpf/progs/path_walk.c |  59 +++++++
 9 files changed, 485 insertions(+), 40 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/path_iter.c
 create mode 100644 tools/testing/selftests/bpf/progs/path_iter.c
 create mode 100644 tools/testing/selftests/bpf/progs/path_walk.c

--
2.47.1

Re: [PATCH v5 bpf-next 0/5] bpf path iterator

Posted by Song Liu 3 months, 2 weeks ago

Hi Christian, Mickaël, and folks,

Could you please share your comments on this version? Does this
look sane?

Thanks,
Song

On Mon, Jun 16, 2025 at 11:11 PM Song Liu <song@kernel.org> wrote:
>
> In security use cases, it is common to apply rules to VFS subtrees.
> However, filtering files in a subtree is not straightforward [1].
>
> One solution to this problem is to start from a path and walk up the VFS
> tree (towards the root). Among in-tree LSMs, Landlock uses this solution.
>

[...]

Re: [PATCH v5 bpf-next 0/5] bpf path iterator

Posted by Mickaël Salaün 3 months, 2 weeks ago

On Fri, Jun 20, 2025 at 02:59:17PM -0700, Song Liu wrote:
> Hi Christian, Mickaël, and folks,
> 
> Could you please share your comments on this version? Does this
> look sane?

This looks good to me but we need to know what is the acceptable next
step to support RCU.  If we can go with another _rcu helper, I'm good
with the current approach, otherwise we need to figure out a way to
leverage the current helper to make it compatible with callers being in
a RCU read-side critical section while leveraging safe path walk (i.e.
several calls to path_walk_parent).

> 
> Thanks,
> Song
> 
> On Mon, Jun 16, 2025 at 11:11 PM Song Liu <song@kernel.org> wrote:
> >
> > In security use cases, it is common to apply rules to VFS subtrees.
> > However, filtering files in a subtree is not straightforward [1].
> >
> > One solution to this problem is to start from a path and walk up the VFS
> > tree (towards the root). Among in-tree LSMs, Landlock uses this solution.
> >
> 
> [...]
>

Re: [PATCH v5 bpf-next 0/5] bpf path iterator

Posted by Song Liu 3 months, 1 week ago

Hi Christian,

On Tue, Jun 24, 2025 at 11:46 AM Mickaël Salaün <mic@digikod.net> wrote:
>
> On Fri, Jun 20, 2025 at 02:59:17PM -0700, Song Liu wrote:
> > Hi Christian, Mickaël, and folks,
> >
> > Could you please share your comments on this version? Does this
> > look sane?
>
> This looks good to me but we need to know what is the acceptable next
> step to support RCU.  If we can go with another _rcu helper, I'm good
> with the current approach, otherwise we need to figure out a way to
> leverage the current helper to make it compatible with callers being in
> a RCU read-side critical section while leveraging safe path walk (i.e.
> several calls to path_walk_parent).

Could you please share your suggestions on this topic? RCU
protected path walk out of fs/ seems controversial in multiple
ways. Do we have to let this set wait indefinitely for a solution
of RCU protected path walk? I would like to highlight that this
set doesn't add any persistent APIs. path_walk_parent() is not
in the UAPI, nor exported. If a newer and better API is created,
we can refactor bpf and landlock code and deprecate
path_walk_parent().

Thanks,
Song

Re: [PATCH v5 bpf-next 0/5] bpf path iterator

Posted by NeilBrown 3 months, 2 weeks ago

On Wed, 25 Jun 2025, Mickaël Salaün wrote:
> On Fri, Jun 20, 2025 at 02:59:17PM -0700, Song Liu wrote:
> > Hi Christian, Mickaël, and folks,
> > 
> > Could you please share your comments on this version? Does this
> > look sane?
> 
> This looks good to me but we need to know what is the acceptable next
> step to support RCU.  If we can go with another _rcu helper, I'm good
> with the current approach, otherwise we need to figure out a way to
> leverage the current helper to make it compatible with callers being in
> a RCU read-side critical section while leveraging safe path walk (i.e.
> several calls to path_walk_parent).

Can you spell out the minimum that you need?

My vague impression is that you want to search up from a given strut path,
no further then some other given path, looking for a dentry that matches
some rule.  Is that correct?

In general, the original dentry could be moved away from under the
dentry you find moments after the match is reported.  What mechanisms do
you have in place to ensure this doesn't happen, or that it doesn't
matter?

Would it be sufficient to have an iterator which reported successive
ancestors in turn, or reported that you need to restart because something
changed?  Would you need to know that a restart happened or would it be
acceptable to transparently start again at the parent of the starting
point?

Or do you really need a "one step at a time" interface?

Do you need more complex movements around the tree, or is just walking
up sufficient?

If this has been discussed or documented elsewhere I'd be happy for you
just to provide a reference, and I can come back with follow-up
questions if needed.

Thanks,
NeilBrown

> 
> > 
> > Thanks,
> > Song
> > 
> > On Mon, Jun 16, 2025 at 11:11 PM Song Liu <song@kernel.org> wrote:
> > >
> > > In security use cases, it is common to apply rules to VFS subtrees.
> > > However, filtering files in a subtree is not straightforward [1].
> > >
> > > One solution to this problem is to start from a path and walk up the VFS
> > > tree (towards the root). Among in-tree LSMs, Landlock uses this solution.
> > >
> > 
> > [...]
> > 
>

Re: [PATCH v5 bpf-next 0/5] bpf path iterator

Posted by Mickaël Salaün 3 months, 2 weeks ago

On Wed, Jun 25, 2025 at 07:38:53AM +1000, NeilBrown wrote:
> On Wed, 25 Jun 2025, Mickaël Salaün wrote:
> > On Fri, Jun 20, 2025 at 02:59:17PM -0700, Song Liu wrote:
> > > Hi Christian, Mickaël, and folks,
> > > 
> > > Could you please share your comments on this version? Does this
> > > look sane?
> > 
> > This looks good to me but we need to know what is the acceptable next
> > step to support RCU.  If we can go with another _rcu helper, I'm good
> > with the current approach, otherwise we need to figure out a way to
> > leverage the current helper to make it compatible with callers being in
> > a RCU read-side critical section while leveraging safe path walk (i.e.
> > several calls to path_walk_parent).
> 
> Can you spell out the minimum that you need?

Sure.  We'd like to call this new helper in a RCU
read-side critical section and leverage this capability to speed up path
walk when there is no concurrent hierarchy modification.  This use case
is similar to handle_dots() with LOOKUP_RCU calling follow_dotdot_rcu().

The main issue with this approach is to keep some state of the path walk
to know if the next call to "path_walk_parent_rcu()" would be valid
(i.e. something like a very light version of nameidata, mainly sequence
integers), and to get back to the non-RCU version otherwise.

> 
> My vague impression is that you want to search up from a given strut path,
> no further then some other given path, looking for a dentry that matches
> some rule.  Is that correct?

Yes

> 
> In general, the original dentry could be moved away from under the
> dentry you find moments after the match is reported.  What mechanisms do
> you have in place to ensure this doesn't happen, or that it doesn't
> matter?

In the case of Landlock, by default, a set of access rights are denied
and can only be allowed by an element in the file hierarchy.  The goal
is to only allow access to files under a specific directory (or directly
a specific file).  That's why we only care of the file hierarchy at the
time of access check.  It's not an issue if the file/directory was
moved or is being moved as long as we can walk its "current" hierarchy.
Furthermore, a sandboxed process is restricted from doing arbitrary
mounts (and renames/links are controlled with the
LANDLOCK_ACCESS_FS_REFER right).

However, we need to get a valid "snapshot" of the set of dentries that
(could) lead to the evaluated file/directory.

> 
> Would it be sufficient to have an iterator which reported successive
> ancestors in turn, or reported that you need to restart because something
> changed?  Would you need to know that a restart happened or would it be
> acceptable to transparently start again at the parent of the starting
> point?

If the path walk is being invalidated, we need to reset the collected
access right and start again the path walk to get all the access rights
from a consistent/real file hierarchy.

> 
> Or do you really need a "one step at a time" interface?

We need to check each component of the path walk, so either we call an
helper to get each of them and we do our check after that (we should be
able to do that in RCU), or we provide a callback function which is
called by the path walk helper.

> 
> Do you need more complex movements around the tree, or is just walking
> up sufficient?

Just walking up.

> 
> If this has been discussed or documented elsewhere I'd be happy for you
> just to provide a reference, and I can come back with follow-up
> questions if needed.

Tingmao initially described the goal here:
https://lore.kernel.org/all/afe77383-fe56-4029-848e-1401e3297139@maowtm.org/

and she sent an RFC to illustrate that:
https://lore.kernel.org/all/cover.1748997840.git.m@maowtm.org/

The discussion mainly raised two questions:
- Should we have one or two APIs?
- How to store the state of the walk without exposing VFS internals to
  the rest of the kernel?

Thanks

> 
> Thanks,
> NeilBrown
> 
> 
> > 
> > > 
> > > Thanks,
> > > Song
> > > 
> > > On Mon, Jun 16, 2025 at 11:11 PM Song Liu <song@kernel.org> wrote:
> > > >
> > > > In security use cases, it is common to apply rules to VFS subtrees.
> > > > However, filtering files in a subtree is not straightforward [1].
> > > >
> > > > One solution to this problem is to start from a path and walk up the VFS
> > > > tree (towards the root). Among in-tree LSMs, Landlock uses this solution.
> > > >
> > > 
> > > [...]
> > > 
> > 
> 
>

Re: [PATCH v5 bpf-next 0/5] bpf path iterator

Posted by NeilBrown 3 months, 2 weeks ago

On Wed, 25 Jun 2025, Mickaël Salaün wrote:
> On Wed, Jun 25, 2025 at 07:38:53AM +1000, NeilBrown wrote:
> > 
> > Can you spell out the minimum that you need?
> 
> Sure.  We'd like to call this new helper in a RCU
> read-side critical section and leverage this capability to speed up path
> walk when there is no concurrent hierarchy modification.  This use case
> is similar to handle_dots() with LOOKUP_RCU calling follow_dotdot_rcu().
> 
> The main issue with this approach is to keep some state of the path walk
> to know if the next call to "path_walk_parent_rcu()" would be valid
> (i.e. something like a very light version of nameidata, mainly sequence
> integers), and to get back to the non-RCU version otherwise.
> 
> > 
> > My vague impression is that you want to search up from a given strut path,
> > no further then some other given path, looking for a dentry that matches
> > some rule.  Is that correct?
> 
> Yes
> 
> > 
> > In general, the original dentry could be moved away from under the
> > dentry you find moments after the match is reported.  What mechanisms do
> > you have in place to ensure this doesn't happen, or that it doesn't
> > matter?
> 
> In the case of Landlock, by default, a set of access rights are denied
> and can only be allowed by an element in the file hierarchy.  The goal
> is to only allow access to files under a specific directory (or directly
> a specific file).  That's why we only care of the file hierarchy at the
> time of access check.  It's not an issue if the file/directory was
> moved or is being moved as long as we can walk its "current" hierarchy.
> Furthermore, a sandboxed process is restricted from doing arbitrary
> mounts (and renames/links are controlled with the
> LANDLOCK_ACCESS_FS_REFER right).
> 
> However, we need to get a valid "snapshot" of the set of dentries that
> (could) lead to the evaluated file/directory.

A "snapshot" is an interesting idea - though looking at the landlock
code you one need inodes, not dentries.
I imagine an interface where you give it a starting path, a root, and
and array of inode pointers, and it fills in the pointers with the path
- all under rcu so no references are needed.
But you would need some fallback if the array isn't big enough, so maybe
that isn't a good idea.

Based on the comments by Al and Christian, I think the only viable
approach is to pass a callback to some vfs function that does the
walking.

   vfs_walk_ancestors(struct path *path, struct path *root,
		      int (*walk_cb)(struct path *ancestor, void *data),
		      void *data)

where walk_cb() returns a negative number if it wants to abort, and is
given a NULL ancestor if vfs_walk_ancestors() needed to restart.

vfs_walk_ancestors() would initialise a "struct nameidata" and
effectively call handle_dots(&nd, LAST_DOTDOT) repeatedly, calling
    walk_cb(&nd.path, data)
each time.

How would you feel about that sort of interface?

NeilBrown

Re: [PATCH v5 bpf-next 0/5] bpf path iterator

Posted by Tingmao Wang 3 months, 2 weeks ago

On 6/26/25 00:04, NeilBrown wrote:
> On Wed, 25 Jun 2025, Mickaël Salaün wrote:
>> On Wed, Jun 25, 2025 at 07:38:53AM +1000, NeilBrown wrote:
>>>
>>> Can you spell out the minimum that you need?
>>
>> Sure.  We'd like to call this new helper in a RCU
>> read-side critical section and leverage this capability to speed up path
>> walk when there is no concurrent hierarchy modification.  This use case
>> is similar to handle_dots() with LOOKUP_RCU calling follow_dotdot_rcu().
>>
>> The main issue with this approach is to keep some state of the path walk
>> to know if the next call to "path_walk_parent_rcu()" would be valid
>> (i.e. something like a very light version of nameidata, mainly sequence
>> integers), and to get back to the non-RCU version otherwise.
>>
>>>
>>> My vague impression is that you want to search up from a given strut path,
>>> no further then some other given path, looking for a dentry that matches
>>> some rule.  Is that correct?
>>
>> Yes
>>
>>>
>>> In general, the original dentry could be moved away from under the
>>> dentry you find moments after the match is reported.  What mechanisms do
>>> you have in place to ensure this doesn't happen, or that it doesn't
>>> matter?
>>
>> In the case of Landlock, by default, a set of access rights are denied
>> and can only be allowed by an element in the file hierarchy.  The goal
>> is to only allow access to files under a specific directory (or directly
>> a specific file).  That's why we only care of the file hierarchy at the
>> time of access check.  It's not an issue if the file/directory was
>> moved or is being moved as long as we can walk its "current" hierarchy.
>> Furthermore, a sandboxed process is restricted from doing arbitrary
>> mounts (and renames/links are controlled with the
>> LANDLOCK_ACCESS_FS_REFER right).
>>
>> However, we need to get a valid "snapshot" of the set of dentries that
>> (could) lead to the evaluated file/directory.
> 
> A "snapshot" is an interesting idea - though looking at the landlock
> code you one need inodes, not dentries.
> I imagine an interface where you give it a starting path, a root, and
> and array of inode pointers, and it fills in the pointers with the path
> - all under rcu so no references are needed.
> But you would need some fallback if the array isn't big enough, so maybe
> that isn't a good idea.
> 
> Based on the comments by Al and Christian, I think the only viable
> approach is to pass a callback to some vfs function that does the
> walking.
> 
>    vfs_walk_ancestors(struct path *path, struct path *root,
> 		      int (*walk_cb)(struct path *ancestor, void *data),
> 		      void *data)
> 
> where walk_cb() returns a negative number if it wants to abort, and is
> given a NULL ancestor if vfs_walk_ancestors() needed to restart.
> 
> vfs_walk_ancestors() would initialise a "struct nameidata" and
> effectively call handle_dots(&nd, LAST_DOTDOT) repeatedly, calling
>     walk_cb(&nd.path, data)
> each time.

handle_dots semantically does more than dget_parent + choose_mountpoint
tho (which is what Landlock currently does, and is also what Song's
iterator will do).  There is the step_into which will step into
mountpoints (there is also code to handle symlinks, although I'm not sure
if that's relevant for following ".."), and it will also return ENOENT if
the path is disconnected.

Also I guess we might not need to have an entire nameidata?  In theory it
only needs to do what follow_dotdot_rcu does without the path_connected
check.  So it seems like given we have path and root as function argument,
it would only need nd->{seq,m_seq}.

I might be wrong tho, but certainly the behaviour is different.

> 
> How would you feel about that sort of interface?

I can't speak for Mickaël, but a callback-based interface is less flexible
(and _maybe_ less performant?).  Also, probably we will want to fallback
to a reference-taking walk if the walk fails (rather than, say, retry
infinitely), and this should probably use Song's proposed iterator.  I'm
not sure if Song would be keen to rewrite this iterator patch series in
callback style (to be clear, it doesn't necessarily seem like a good idea
to me, and I'm not asking him to), which means that we will end up with
the reference walk API being a "call this function repeatedly", and the
rcu walk API taking a callback.  I think it is still workable (after all,
if Landlock wants to reuse the code in the callback it can just call the
callback function itself when doing the reference walk), but it seems a
bit "ugly" to me.

But this is just my opinion, and if there is a stronger desire to not
expose any VFS seqcount integers then maybe we will just need to work with
a callback.

Quick note in case anyone reading this has not seen it, a while ago I made
a POC of a non-callback style API for rcu parent walk based on Song's
series:
https://lore.kernel.org/all/dbc7ee0f1f483b7bc2ec9757672a38d99015e9ae.1749402769@maowtm.org/#iZ31fs:namei.c

> 
> NeilBrown
>

Re: [PATCH v5 bpf-next 0/5] bpf path iterator

Posted by NeilBrown 3 months, 2 weeks ago

On Thu, 26 Jun 2025, Tingmao Wang wrote:
> On 6/26/25 00:04, NeilBrown wrote:
> > On Wed, 25 Jun 2025, Mickaël Salaün wrote:
> >> On Wed, Jun 25, 2025 at 07:38:53AM +1000, NeilBrown wrote:
> >>>
> >>> Can you spell out the minimum that you need?
> >>
> >> Sure.  We'd like to call this new helper in a RCU
> >> read-side critical section and leverage this capability to speed up path
> >> walk when there is no concurrent hierarchy modification.  This use case
> >> is similar to handle_dots() with LOOKUP_RCU calling follow_dotdot_rcu().
> >>
> >> The main issue with this approach is to keep some state of the path walk
> >> to know if the next call to "path_walk_parent_rcu()" would be valid
> >> (i.e. something like a very light version of nameidata, mainly sequence
> >> integers), and to get back to the non-RCU version otherwise.
> >>
> >>>
> >>> My vague impression is that you want to search up from a given strut path,
> >>> no further then some other given path, looking for a dentry that matches
> >>> some rule.  Is that correct?
> >>
> >> Yes
> >>
> >>>
> >>> In general, the original dentry could be moved away from under the
> >>> dentry you find moments after the match is reported.  What mechanisms do
> >>> you have in place to ensure this doesn't happen, or that it doesn't
> >>> matter?
> >>
> >> In the case of Landlock, by default, a set of access rights are denied
> >> and can only be allowed by an element in the file hierarchy.  The goal
> >> is to only allow access to files under a specific directory (or directly
> >> a specific file).  That's why we only care of the file hierarchy at the
> >> time of access check.  It's not an issue if the file/directory was
> >> moved or is being moved as long as we can walk its "current" hierarchy.
> >> Furthermore, a sandboxed process is restricted from doing arbitrary
> >> mounts (and renames/links are controlled with the
> >> LANDLOCK_ACCESS_FS_REFER right).
> >>
> >> However, we need to get a valid "snapshot" of the set of dentries that
> >> (could) lead to the evaluated file/directory.
> > 
> > A "snapshot" is an interesting idea - though looking at the landlock
> > code you one need inodes, not dentries.
> > I imagine an interface where you give it a starting path, a root, and
> > and array of inode pointers, and it fills in the pointers with the path
> > - all under rcu so no references are needed.
> > But you would need some fallback if the array isn't big enough, so maybe
> > that isn't a good idea.
> > 
> > Based on the comments by Al and Christian, I think the only viable
> > approach is to pass a callback to some vfs function that does the
> > walking.
> > 
> >    vfs_walk_ancestors(struct path *path, struct path *root,
> > 		      int (*walk_cb)(struct path *ancestor, void *data),
> > 		      void *data)
> > 
> > where walk_cb() returns a negative number if it wants to abort, and is
> > given a NULL ancestor if vfs_walk_ancestors() needed to restart.
> > 
> > vfs_walk_ancestors() would initialise a "struct nameidata" and
> > effectively call handle_dots(&nd, LAST_DOTDOT) repeatedly, calling
> >     walk_cb(&nd.path, data)
> > each time.
> 
> handle_dots semantically does more than dget_parent + choose_mountpoint
> tho (which is what Landlock currently does, and is also what Song's
> iterator will do).  There is the step_into which will step into
> mountpoints (there is also code to handle symlinks, although I'm not sure
> if that's relevant for following ".."), and it will also return ENOENT if
> the path is disconnected.

Is any of this a problem for you?

> 
> Also I guess we might not need to have an entire nameidata?  In theory it
> only needs to do what follow_dotdot_rcu does without the path_connected
> check.  So it seems like given we have path and root as function argument,
> it would only need nd->{seq,m_seq}.

Those are implementation details internal to namei.c.  Certainly this
function wouldn't use all of the fields in nameidata, but it doesn't
hurt to have a few fields in a struct on the stack which don't get used.
Keeping the code simple and uniform is much more important.  Using
nameidata would help achieve that.

> 
> I might be wrong tho, but certainly the behaviour is different.

Here we get back to the question of "precisely what behaviour do you
need?".
"The same as what a previous patch did" is not a reasonable answer.

If, from userspace, you repeatedly did chdir("..") and then examined the
current directory you would get exactly the sequence of directories
provided by repeatedly calling handle_dots(..., LAST_DOTDOT).  If there
is some circumstance where that would be not acceptable for your use
case, you need to explain (and we need to document) what differences you
need and why use need it.

> 
> > 
> > How would you feel about that sort of interface?
> 
> I can't speak for Mickaël, but a callback-based interface is less flexible
> (and _maybe_ less performant?).  Also, probably we will want to fallback
> to a reference-taking walk if the walk fails (rather than, say, retry
> infinitely), and this should probably use Song's proposed iterator.  I'm
> not sure if Song would be keen to rewrite this iterator patch series in
> callback style (to be clear, it doesn't necessarily seem like a good idea
> to me, and I'm not asking him to), which means that we will end up with
> the reference walk API being a "call this function repeatedly", and the
> rcu walk API taking a callback.  I think it is still workable (after all,
> if Landlock wants to reuse the code in the callback it can just call the
> callback function itself when doing the reference walk), but it seems a
> bit "ugly" to me.

call-back can have a performance impact (less opportunity for compiler
optimisation and CPU speculation), though less than taking spinlock and
references.  However Al and Christian have drawn a hard line against
making seq numbers visible outside VFS code so I think it is the
approach most likely to be accepted.

Certainly vfs_walk_ancestors() would fallback to ref-walk if rcu-walk
resulted in -ECHILD - just like all other path walking code in namei.c.
This would be largely transparent to the caller - the caller would only
see that the callback received a NULL path indicating a restart.  It
wouldn't need to know why.

NeilBrown

Re: [PATCH v5 bpf-next 0/5] bpf path iterator

Posted by Christian Brauner 3 months ago

> Those are implementation details internal to namei.c.  Certainly this
> function wouldn't use all of the fields in nameidata, but it doesn't
> hurt to have a few fields in a struct on the stack which don't get used.
> Keeping the code simple and uniform is much more important.  Using

Exactly.

> Certainly vfs_walk_ancestors() would fallback to ref-walk if rcu-walk
> resulted in -ECHILD - just like all other path walking code in namei.c.
> This would be largely transparent to the caller - the caller would only
> see that the callback received a NULL path indicating a restart.  It
> wouldn't need to know why.

Yes, that's also what I mentioned in an earlier mail.

Re: [PATCH v5 bpf-next 0/5] bpf path iterator

Posted by Song Liu 3 months, 2 weeks ago


> On Jun 25, 2025, at 6:05 PM, NeilBrown <neil@brown.name> wrote:

[...]

>> 
>> I can't speak for Mickaël, but a callback-based interface is less flexible
>> (and _maybe_ less performant?).  Also, probably we will want to fallback
>> to a reference-taking walk if the walk fails (rather than, say, retry
>> infinitely), and this should probably use Song's proposed iterator.  I'm
>> not sure if Song would be keen to rewrite this iterator patch series in
>> callback style (to be clear, it doesn't necessarily seem like a good idea
>> to me, and I'm not asking him to), which means that we will end up with
>> the reference walk API being a "call this function repeatedly", and the
>> rcu walk API taking a callback.  I think it is still workable (after all,
>> if Landlock wants to reuse the code in the callback it can just call the
>> callback function itself when doing the reference walk), but it seems a
>> bit "ugly" to me.
> 
> call-back can have a performance impact (less opportunity for compiler
> optimisation and CPU speculation), though less than taking spinlock and
> references.  However Al and Christian have drawn a hard line against
> making seq numbers visible outside VFS code so I think it is the
> approach most likely to be accepted.
> 
> Certainly vfs_walk_ancestors() would fallback to ref-walk if rcu-walk
> resulted in -ECHILD - just like all other path walking code in namei.c.
> This would be largely transparent to the caller - the caller would only
> see that the callback received a NULL path indicating a restart.  It
> wouldn't need to know why.

I guess I misunderstood the proposal of vfs_walk_ancestors() 
initially, so some clarification:

I think vfs_walk_ancestors() is good for the rcu-walk, and some 
rcu-then-ref-walk. However, I don’t think it fits all use cases. 
A reliable step-by-step ref-walk, like this set, works well with 
BPF, and we want to keep it. 

Can we ship this set as-is (or after fixing the comment reported
by kernel test robot)? I really don’t think we need figure out 
all details about the rcu-walk here. 

Once this is landed, we can try implementing the rcu-walk
(vfs_walk_ancestors or some variation). If no one volunteers, I
can give it a try. 

Thanks,
Song

Re: [PATCH v5 bpf-next 0/5] bpf path iterator

Posted by NeilBrown 3 months, 2 weeks ago

On Thu, 26 Jun 2025, Song Liu wrote:
> 
> 
> > On Jun 25, 2025, at 6:05 PM, NeilBrown <neil@brown.name> wrote:
> 
> [...]
> 
> >> 
> >> I can't speak for Mickaël, but a callback-based interface is less flexible
> >> (and _maybe_ less performant?).  Also, probably we will want to fallback
> >> to a reference-taking walk if the walk fails (rather than, say, retry
> >> infinitely), and this should probably use Song's proposed iterator.  I'm
> >> not sure if Song would be keen to rewrite this iterator patch series in
> >> callback style (to be clear, it doesn't necessarily seem like a good idea
> >> to me, and I'm not asking him to), which means that we will end up with
> >> the reference walk API being a "call this function repeatedly", and the
> >> rcu walk API taking a callback.  I think it is still workable (after all,
> >> if Landlock wants to reuse the code in the callback it can just call the
> >> callback function itself when doing the reference walk), but it seems a
> >> bit "ugly" to me.
> > 
> > call-back can have a performance impact (less opportunity for compiler
> > optimisation and CPU speculation), though less than taking spinlock and
> > references.  However Al and Christian have drawn a hard line against
> > making seq numbers visible outside VFS code so I think it is the
> > approach most likely to be accepted.
> > 
> > Certainly vfs_walk_ancestors() would fallback to ref-walk if rcu-walk
> > resulted in -ECHILD - just like all other path walking code in namei.c.
> > This would be largely transparent to the caller - the caller would only
> > see that the callback received a NULL path indicating a restart.  It
> > wouldn't need to know why.
> 
> I guess I misunderstood the proposal of vfs_walk_ancestors() 
> initially, so some clarification:
> 
> I think vfs_walk_ancestors() is good for the rcu-walk, and some 
> rcu-then-ref-walk. However, I don’t think it fits all use cases. 
> A reliable step-by-step ref-walk, like this set, works well with 
> BPF, and we want to keep it. 

The distinction between rcu-walk and ref-walk is an internal
implementation detail.  You as a caller shouldn't need to think about
the difference.  You just want to walk.  Note that LOOKUP_RCU is
documented in namei.h as "semi-internal".  The only uses outside of
core-VFS code is in individual filesystem's d_revalidate handler - they
are checking if they are allowed to sleep or not.  You should never
expect to pass LOOKUP_RCU to an VFS API - no other code does.

It might be reasonable for you as a caller to have some control over
whether the call can sleep or not.  LOOKUP_CACHED is a bit like that.
But for dotdot lookup the code will never sleep - so that is not
relevant.

I strongly suggest you stop thinking about rcu-walk vs ref-walk.  Think
about the needs of your code.  If you need a high-performance API, then
ask for a high-performance API, don't assume what form it will take or
what the internal implementation details will be.

I think you already have a clear answer that a step-by-step API will not
be read-only on the dcache (i.e.  it will adjust refcounts) and so will
not be high performance.  If you want high performance, you need to
accept a different style of API.

NeilBrown

Re: [PATCH v5 bpf-next 0/5] bpf path iterator

Posted by Song Liu 3 months, 2 weeks ago


> On Jun 26, 2025, at 3:22 AM, NeilBrown <neil@brown.name> wrote:

[...]

>> I guess I misunderstood the proposal of vfs_walk_ancestors() 
>> initially, so some clarification:
>> 
>> I think vfs_walk_ancestors() is good for the rcu-walk, and some 
>> rcu-then-ref-walk. However, I don’t think it fits all use cases. 
>> A reliable step-by-step ref-walk, like this set, works well with 
>> BPF, and we want to keep it.
> 
> The distinction between rcu-walk and ref-walk is an internal
> implementation detail.  You as a caller shouldn't need to think about
> the difference.  You just want to walk.  Note that LOOKUP_RCU is
> documented in namei.h as "semi-internal".  The only uses outside of
> core-VFS code is in individual filesystem's d_revalidate handler - they
> are checking if they are allowed to sleep or not.  You should never
> expect to pass LOOKUP_RCU to an VFS API - no other code does.
> 
> It might be reasonable for you as a caller to have some control over
> whether the call can sleep or not.  LOOKUP_CACHED is a bit like that.
> But for dotdot lookup the code will never sleep - so that is not
> relevant.

Unfortunately, the BPF use case is more complicated. In some cases, 
the callback function cannot be call in rcu critical sections. For 
example, the callback may need to read xatter. For these cases, we
we cannot use RCU walk at all. 

> I strongly suggest you stop thinking about rcu-walk vs ref-walk.  Think
> about the needs of your code.  If you need a high-performance API, then
> ask for a high-performance API, don't assume what form it will take or
> what the internal implementation details will be.

At the moment, we need a ref-walk API on the BPF side. The RCU walk
is a totally separate topic. 

> I think you already have a clear answer that a step-by-step API will not
> be read-only on the dcache (i.e.  it will adjust refcounts) and so will
> not be high performance.  If you want high performance, you need to
> accept a different style of API.

Agreed. 

Thanks,
Song

Re: [PATCH v5 bpf-next 0/5] bpf path iterator

Posted by NeilBrown 3 months, 2 weeks ago

On Fri, 27 Jun 2025, Song Liu wrote:
> 
> 
> > On Jun 26, 2025, at 3:22 AM, NeilBrown <neil@brown.name> wrote:
> 
> [...]
> 
> >> I guess I misunderstood the proposal of vfs_walk_ancestors() 
> >> initially, so some clarification:
> >> 
> >> I think vfs_walk_ancestors() is good for the rcu-walk, and some 
> >> rcu-then-ref-walk. However, I don’t think it fits all use cases. 
> >> A reliable step-by-step ref-walk, like this set, works well with 
> >> BPF, and we want to keep it.
> > 
> > The distinction between rcu-walk and ref-walk is an internal
> > implementation detail.  You as a caller shouldn't need to think about
> > the difference.  You just want to walk.  Note that LOOKUP_RCU is
> > documented in namei.h as "semi-internal".  The only uses outside of
> > core-VFS code is in individual filesystem's d_revalidate handler - they
> > are checking if they are allowed to sleep or not.  You should never
> > expect to pass LOOKUP_RCU to an VFS API - no other code does.
> > 
> > It might be reasonable for you as a caller to have some control over
> > whether the call can sleep or not.  LOOKUP_CACHED is a bit like that.
> > But for dotdot lookup the code will never sleep - so that is not
> > relevant.
> 
> Unfortunately, the BPF use case is more complicated. In some cases, 
> the callback function cannot be call in rcu critical sections. For 
> example, the callback may need to read xatter. For these cases, we
> we cannot use RCU walk at all. 

I really think you should stop using the terms RCU walk and ref-walk.  I
think they might be focusing your thinking in an unhelpful direction.

The key issue about reading xattrs is that it might need to sleep.
Focusing on what might need to sleep and what will never need to sleep
is a useful approach - the distinction is wide spread in the kernel and
several function take a flag indicating if they are permitted to sleep,
or if failure when sleeping would be required.

So your above observation is better described as 

   The vfs_walk_ancestors() API has an (implicit) requirement that the
   callback mustn't sleep.  This is a problem for some use-cases
   where the call back might need to sleep - e.g. for accessing xattrs.

That is a good and useful observation.  I can see three possibly
responses:

1/ Add a vfs_walk_ancestors_maysleep() API for which the callback is
   always allowed to sleep.  I don't particularly like this approach.

2/ Use repeated calls to vfs_walk_parent() when the handling of each
   ancestor might need to sleep.  I see no problem with supporting both
   vfs_walk_ancestors() and vfs_walk_parent().  There is plenty of
   precedent for having different  interfaces for different use cases.

3/ Extend vfs_walk_ancestors() to pass a "may sleep" flag to the callback.
   If the callback finds that it needs to sleep but that "may sleep"
   isn't set, it returns some well known status, like -EWOULDBLOCK (or
   -ECHILD).  It can expect to be called again but with "may sleep" set.
   This is my preferred approach. There is precedent with the
   d_revalidate callbacks which works like this.
   I suspect that accessing xattrs might often be possible without
   sleeping.  It is conceivable that we could add a "may sleep" argument
   to vfs_getxattr() so that it could still often be used without
   requiring vfs_walk_ancestors() to permit sleeping.
   This would almost certainly require a clear demonstration that 
   there was a performance cost in not having the option of non-sleeping
   vfs_getxattr().

> 
> > I strongly suggest you stop thinking about rcu-walk vs ref-walk.  Think
> > about the needs of your code.  If you need a high-performance API, then
> > ask for a high-performance API, don't assume what form it will take or
> > what the internal implementation details will be.
> 
> At the moment, we need a ref-walk API on the BPF side. The RCU walk
> is a totally separate topic. 

Do you mean "we need step-by-step walking" or do you mean "we need to
potentially sleep for each ancestor"?  These are conceptually different
requirements, but I cannot tell which you mean when you talk about "RCU
walk".

Thanks,
NeilBrown

> 
> > I think you already have a clear answer that a step-by-step API will not
> > be read-only on the dcache (i.e.  it will adjust refcounts) and so will
> > not be high performance.  If you want high performance, you need to
> > accept a different style of API.
> 
> Agreed. 
> 
> Thanks,
> Song
> 
>

Re: [PATCH v5 bpf-next 0/5] bpf path iterator

Posted by Christian Brauner 3 months ago

On Fri, Jun 27, 2025 at 08:51:21AM +1000, NeilBrown wrote:
> On Fri, 27 Jun 2025, Song Liu wrote:
> > 
> > 
> > > On Jun 26, 2025, at 3:22 AM, NeilBrown <neil@brown.name> wrote:
> > 
> > [...]
> > 
> > >> I guess I misunderstood the proposal of vfs_walk_ancestors() 
> > >> initially, so some clarification:
> > >> 
> > >> I think vfs_walk_ancestors() is good for the rcu-walk, and some 
> > >> rcu-then-ref-walk. However, I don’t think it fits all use cases. 
> > >> A reliable step-by-step ref-walk, like this set, works well with 
> > >> BPF, and we want to keep it.
> > > 
> > > The distinction between rcu-walk and ref-walk is an internal
> > > implementation detail.  You as a caller shouldn't need to think about
> > > the difference.  You just want to walk.  Note that LOOKUP_RCU is
> > > documented in namei.h as "semi-internal".  The only uses outside of
> > > core-VFS code is in individual filesystem's d_revalidate handler - they
> > > are checking if they are allowed to sleep or not.  You should never
> > > expect to pass LOOKUP_RCU to an VFS API - no other code does.
> > > 
> > > It might be reasonable for you as a caller to have some control over
> > > whether the call can sleep or not.  LOOKUP_CACHED is a bit like that.
> > > But for dotdot lookup the code will never sleep - so that is not
> > > relevant.
> > 
> > Unfortunately, the BPF use case is more complicated. In some cases, 
> > the callback function cannot be call in rcu critical sections. For 
> > example, the callback may need to read xatter. For these cases, we
> > we cannot use RCU walk at all. 
> 
> I really think you should stop using the terms RCU walk and ref-walk.  I
> think they might be focusing your thinking in an unhelpful direction.

Thank you! I really appreciate you helping to shape this API and it
aligns a lot with my thinking.

> The key issue about reading xattrs is that it might need to sleep.
> Focusing on what might need to sleep and what will never need to sleep
> is a useful approach - the distinction is wide spread in the kernel and
> several function take a flag indicating if they are permitted to sleep,
> or if failure when sleeping would be required.
> 
> So your above observation is better described as 
> 
>    The vfs_walk_ancestors() API has an (implicit) requirement that the
>    callback mustn't sleep.  This is a problem for some use-cases
>    where the call back might need to sleep - e.g. for accessing xattrs.
> 
> That is a good and useful observation.  I can see three possibly
> responses:
> 
> 1/ Add a vfs_walk_ancestors_maysleep() API for which the callback is
>    always allowed to sleep.  I don't particularly like this approach.

Agreed.

> 
> 2/ Use repeated calls to vfs_walk_parent() when the handling of each
>    ancestor might need to sleep.  I see no problem with supporting both
>    vfs_walk_ancestors() and vfs_walk_parent().  There is plenty of
>    precedent for having different  interfaces for different use cases.

Meh.

> 
> 3/ Extend vfs_walk_ancestors() to pass a "may sleep" flag to the callback.

I think that's fine.

Re: [PATCH v5 bpf-next 0/5] bpf path iterator

Posted by Christian Brauner 3 months ago

On Mon, Jul 07, 2025 at 12:46:41PM +0200, Christian Brauner wrote:
> On Fri, Jun 27, 2025 at 08:51:21AM +1000, NeilBrown wrote:
> > On Fri, 27 Jun 2025, Song Liu wrote:
> > > 
> > > 
> > > > On Jun 26, 2025, at 3:22 AM, NeilBrown <neil@brown.name> wrote:
> > > 
> > > [...]
> > > 
> > > >> I guess I misunderstood the proposal of vfs_walk_ancestors() 
> > > >> initially, so some clarification:
> > > >> 
> > > >> I think vfs_walk_ancestors() is good for the rcu-walk, and some 
> > > >> rcu-then-ref-walk. However, I don’t think it fits all use cases. 
> > > >> A reliable step-by-step ref-walk, like this set, works well with 
> > > >> BPF, and we want to keep it.
> > > > 
> > > > The distinction between rcu-walk and ref-walk is an internal
> > > > implementation detail.  You as a caller shouldn't need to think about
> > > > the difference.  You just want to walk.  Note that LOOKUP_RCU is
> > > > documented in namei.h as "semi-internal".  The only uses outside of
> > > > core-VFS code is in individual filesystem's d_revalidate handler - they
> > > > are checking if they are allowed to sleep or not.  You should never
> > > > expect to pass LOOKUP_RCU to an VFS API - no other code does.
> > > > 
> > > > It might be reasonable for you as a caller to have some control over
> > > > whether the call can sleep or not.  LOOKUP_CACHED is a bit like that.
> > > > But for dotdot lookup the code will never sleep - so that is not
> > > > relevant.
> > > 
> > > Unfortunately, the BPF use case is more complicated. In some cases, 
> > > the callback function cannot be call in rcu critical sections. For 
> > > example, the callback may need to read xatter. For these cases, we
> > > we cannot use RCU walk at all. 
> > 
> > I really think you should stop using the terms RCU walk and ref-walk.  I
> > think they might be focusing your thinking in an unhelpful direction.
> 
> Thank you! I really appreciate you helping to shape this API and it
> aligns a lot with my thinking.
> 
> > The key issue about reading xattrs is that it might need to sleep.
> > Focusing on what might need to sleep and what will never need to sleep
> > is a useful approach - the distinction is wide spread in the kernel and
> > several function take a flag indicating if they are permitted to sleep,
> > or if failure when sleeping would be required.
> > 
> > So your above observation is better described as 
> > 
> >    The vfs_walk_ancestors() API has an (implicit) requirement that the
> >    callback mustn't sleep.  This is a problem for some use-cases
> >    where the call back might need to sleep - e.g. for accessing xattrs.
> > 
> > That is a good and useful observation.  I can see three possibly
> > responses:
> > 
> > 1/ Add a vfs_walk_ancestors_maysleep() API for which the callback is
> >    always allowed to sleep.  I don't particularly like this approach.
> 
> Agreed.
> 
> > 
> > 2/ Use repeated calls to vfs_walk_parent() when the handling of each
> >    ancestor might need to sleep.  I see no problem with supporting both
> >    vfs_walk_ancestors() and vfs_walk_parent().  There is plenty of
> >    precedent for having different  interfaces for different use cases.
> 
> Meh.
> 
> > 
> > 3/ Extend vfs_walk_ancestors() to pass a "may sleep" flag to the callback.
> 
> I think that's fine.

Ok, sorry for the delay but there's a lot of different things going on
right now and this one isn't exactly an easy thing to solve.

I mentioned this before and so did Neil: the lookup implementation
supports two modes sleeping and non-sleeping. That api is abstracted
away as heavily as possible by the VFS so that non-core code will not be
exposed to it other than in exceptional circumstances and doesn't have
to care about it.

It is a conceptual dead-end to expose these two modes via separate APIs
and leak this implementation detail into non-core code. It will not
happen as far as I'm concerned.

I very much understand the urge to get the refcount step-by-step thing
merged asap. Everyone wants their APIs merged fast. And if it's
reasonable to move fast we will (see the kernfs xattr thing).

But here are two use-cases that ask for the same thing with different
constraints that closely mirror our unified approach. Merging one
quickly just to have something and then later bolting the other one on
top, augmenting, or replacing, possible having to deprecate the old API
is just objectively nuts. That's how we end up with a spaghetthi helper
collection. We want as little helper fragmentation as possible.

We need a unified API that serves both use-cases. I dislike
callback-based APIs generally but we have precedent in the VFS for this
for cases where the internal state handling is delicate enough that it
should not be exposed (see __iterate_supers() which does exactly work
like Neil suggested down to the flag argument itself I added).

So I'm open to the callback solution.

(Note for really absurd perf requirements you could even make it work
with static calls I'm pretty sure.)

Re: [PATCH v5 bpf-next 0/5] bpf path iterator

Posted by Song Liu 3 months ago

Hi Christian, 

Thanks for your comments! 

> On Jul 7, 2025, at 4:17 AM, Christian Brauner <brauner@kernel.org> wrote:

[...]

>>> 3/ Extend vfs_walk_ancestors() to pass a "may sleep" flag to the callback.
>> 
>> I think that's fine.
> 
> Ok, sorry for the delay but there's a lot of different things going on
> right now and this one isn't exactly an easy thing to solve.
> 
> I mentioned this before and so did Neil: the lookup implementation
> supports two modes sleeping and non-sleeping. That api is abstracted
> away as heavily as possible by the VFS so that non-core code will not be
> exposed to it other than in exceptional circumstances and doesn't have
> to care about it.
> 
> It is a conceptual dead-end to expose these two modes via separate APIs
> and leak this implementation detail into non-core code. It will not
> happen as far as I'm concerned.
> 
> I very much understand the urge to get the refcount step-by-step thing
> merged asap. Everyone wants their APIs merged fast. And if it's
> reasonable to move fast we will (see the kernfs xattr thing).
> 
> But here are two use-cases that ask for the same thing with different
> constraints that closely mirror our unified approach. Merging one
> quickly just to have something and then later bolting the other one on
> top, augmenting, or replacing, possible having to deprecate the old API
> is just objectively nuts. That's how we end up with a spaghetthi helper
> collection. We want as little helper fragmentation as possible.
> 
> We need a unified API that serves both use-cases. I dislike
> callback-based APIs generally but we have precedent in the VFS for this
> for cases where the internal state handling is delicate enough that it
> should not be exposed (see __iterate_supers() which does exactly work
> like Neil suggested down to the flag argument itself I added).
> 
> So I'm open to the callback solution.
> 
> (Note for really absurd perf requirements you could even make it work
> with static calls I'm pretty sure.)

I guess we will go with Mickaël’s idea:

> int vfs_walk_ancestors(struct path *path,
>                       bool (*walk_cb)(const struct path *ancestor, void *data),
>                       void *data, int flags)
> 
> The walk continue while walk_cb() returns true.  walk_cb() can then
> check if @ancestor is equal to a @root, or other properties.  The
> walk_cb() return value (if not bool) should not be returned by
> vfs_walk_ancestors() because a walk stop doesn't mean an error.

If necessary, we hide “root" inside @data. This is good. 

> @path would be updated with latest ancestor path (e.g. @root).

Update @path to the last ancestor and hold proper references. 
I missed this part earlier. With this feature, vfs_walk_ancestors 
should work usable with open-codeed bpf path iterator. 

I have a question about this behavior with RCU walk. IIUC, RCU 
walk does not hold reference to @ancestor when calling walk_cb().
If walk_cb() returns false, shall vfs_walk_ancestors() then
grab a reference on @ancestor? This feels a bit weird to me. 
Maybe “updating @path to the last ancestor” should only apply to
LOOKUP_RCU==false case? 

> @flags could contain LOOKUP_RCU or not, which enables us to have
> walk_cb() not-RCU compatible.
> 
> When passing LOOKUP_RCU, if the first call to vfs_walk_ancestors()
> failed with -ECHILD, the caller can restart the walk by calling
> vfs_walk_ancestors() again but without LOOKUP_RCU.


Given we want callers to handle -ECHILD and call vfs_walk_ancestors
again without LOOKUP_RCU, I think we should keep @path not changed
With LOOKUP_RCU==true, and only update it to the last ancestor 
when LOOKUP_RCU==false. 

With this behavior, landlock code will be like:


/* Assume we hold reference on “path”. 
 * With LOOKUP_RCU, path will not change, we don’t need 
 * extra reference on “path”.
 */
err = vfs_walk_ancestors(path, ll_cb, data, LOOKUP_RCU);
/* 
 * At this point, whether err is 0 or not, path is not 
 * changed.
 */

if (err == -ECHILD) {
	struct path walk_path = *path;

	/* reset any data changed by the walk */
	reset_data(data);

	/* get a reference on walk_path. */
	path_get(&walk_path);

	err = vfs_walk_ancestors(&walk_path, ll_cb, data, 0);
	/* Now, walk_path might be updated */

	/* Always release reference on walk_path */
	path_put(&walk_path);
}


BPF path iterator sode will look like:

static bool bpf_cb(const struct path *ancestor, void *data)
{
	return false;
}

struct path *bpf_iter_path_next(struct bpf_iter_path *it)
{
	struct bpf_iter_path_kern *kit = (void *)it;

	if (vfs_walk_ancestors(&kit->path, bpf_cb, NULL))
		return NULL;
	return &kit->path;
}


Does this sound reasonable to every body?

Thanks,
Song

Re: [PATCH v5 bpf-next 0/5] bpf path iterator

Posted by NeilBrown 3 months ago

On Tue, 08 Jul 2025, Song Liu wrote:
> Hi Christian, 
> 
> Thanks for your comments! 
> 
> > On Jul 7, 2025, at 4:17 AM, Christian Brauner <brauner@kernel.org> wrote:
> 
> [...]
> 
> >>> 3/ Extend vfs_walk_ancestors() to pass a "may sleep" flag to the callback.
> >> 
> >> I think that's fine.
> > 
> > Ok, sorry for the delay but there's a lot of different things going on
> > right now and this one isn't exactly an easy thing to solve.
> > 
> > I mentioned this before and so did Neil: the lookup implementation
> > supports two modes sleeping and non-sleeping. That api is abstracted
> > away as heavily as possible by the VFS so that non-core code will not be
> > exposed to it other than in exceptional circumstances and doesn't have
> > to care about it.
> > 
> > It is a conceptual dead-end to expose these two modes via separate APIs
> > and leak this implementation detail into non-core code. It will not
> > happen as far as I'm concerned.
> > 
> > I very much understand the urge to get the refcount step-by-step thing
> > merged asap. Everyone wants their APIs merged fast. And if it's
> > reasonable to move fast we will (see the kernfs xattr thing).
> > 
> > But here are two use-cases that ask for the same thing with different
> > constraints that closely mirror our unified approach. Merging one
> > quickly just to have something and then later bolting the other one on
> > top, augmenting, or replacing, possible having to deprecate the old API
> > is just objectively nuts. That's how we end up with a spaghetthi helper
> > collection. We want as little helper fragmentation as possible.
> > 
> > We need a unified API that serves both use-cases. I dislike
> > callback-based APIs generally but we have precedent in the VFS for this
> > for cases where the internal state handling is delicate enough that it
> > should not be exposed (see __iterate_supers() which does exactly work
> > like Neil suggested down to the flag argument itself I added).
> > 
> > So I'm open to the callback solution.
> > 
> > (Note for really absurd perf requirements you could even make it work
> > with static calls I'm pretty sure.)
> 
> I guess we will go with Mickaël’s idea:
> 
> > int vfs_walk_ancestors(struct path *path,
> >                       bool (*walk_cb)(const struct path *ancestor, void *data),
> >                       void *data, int flags)
> > 
> > The walk continue while walk_cb() returns true.  walk_cb() can then
> > check if @ancestor is equal to a @root, or other properties.  The
> > walk_cb() return value (if not bool) should not be returned by
> > vfs_walk_ancestors() because a walk stop doesn't mean an error.
> 
> If necessary, we hide “root" inside @data. This is good. 
> 
> > @path would be updated with latest ancestor path (e.g. @root).
> 
> Update @path to the last ancestor and hold proper references. 
> I missed this part earlier. With this feature, vfs_walk_ancestors 
> should work usable with open-codeed bpf path iterator. 

I don't think path should be updated.  That adds complexity which might
not be needed.  The original (landlock) requirements were only to look
at each ancestor, not to take a reference to any of them.

If the caller needs a reference to any of the ancestors I think that
walk_cb() needs to take that reference and store it in data.
Note that attempting to take the reference might fail.  See
legitimize_path() in fs/namei.c.

It isn't yet clear to me what would be a good API for requesting the
reference.
One option would be for vfs_walk_ancestors() to pass another void* to
walk_cb(), and it passed it on to vfs_legitimize_path() which extracts
the seq numbers from there.
Another might be that the path passed to walk_cb is always
nameidata.path, and so when that is passed to vfs_legitimize_path() path
it can use container_of() to find the seq numbers.

If vfs_legitimize_path() fail, walk_cb() might want to ask for the walk
to be restarted.

> 
> I have a question about this behavior with RCU walk. IIUC, RCU 
> walk does not hold reference to @ancestor when calling walk_cb().
> If walk_cb() returns false, shall vfs_walk_ancestors() then
> grab a reference on @ancestor? This feels a bit weird to me. 
> Maybe “updating @path to the last ancestor” should only apply to
> LOOKUP_RCU==false case? 
> 
> > @flags could contain LOOKUP_RCU or not, which enables us to have
> > walk_cb() not-RCU compatible.
> > 
> > When passing LOOKUP_RCU, if the first call to vfs_walk_ancestors()
> > failed with -ECHILD, the caller can restart the walk by calling
> > vfs_walk_ancestors() again but without LOOKUP_RCU.
> 
> 
> Given we want callers to handle -ECHILD and call vfs_walk_ancestors
> again without LOOKUP_RCU, I think we should keep @path not changed
> With LOOKUP_RCU==true, and only update it to the last ancestor 
> when LOOKUP_RCU==false. 

No, we really don't want to pass a LOOKUP_RCU() flag to
vfs_walk_ancestors().
vfs_walk_ancestors() might choose to pass that flag to walk_cb().

NeilBrown

Re: [PATCH v5 bpf-next 0/5] bpf path iterator

Posted by Song Liu 3 months ago


> On Jul 9, 2025, at 3:14 PM, NeilBrown <neil@brown.name> wrote:
> 
> On Tue, 08 Jul 2025, Song Liu wrote:
>> Hi Christian, 
>> 
>> Thanks for your comments! 
>> 
>>> On Jul 7, 2025, at 4:17 AM, Christian Brauner <brauner@kernel.org> wrote:
>> 
>> [...]
>> 
>>>>> 3/ Extend vfs_walk_ancestors() to pass a "may sleep" flag to the callback.
>>>> 
>>>> I think that's fine.
>>> 
>>> Ok, sorry for the delay but there's a lot of different things going on
>>> right now and this one isn't exactly an easy thing to solve.
>>> 
>>> I mentioned this before and so did Neil: the lookup implementation
>>> supports two modes sleeping and non-sleeping. That api is abstracted
>>> away as heavily as possible by the VFS so that non-core code will not be
>>> exposed to it other than in exceptional circumstances and doesn't have
>>> to care about it.
>>> 
>>> It is a conceptual dead-end to expose these two modes via separate APIs
>>> and leak this implementation detail into non-core code. It will not
>>> happen as far as I'm concerned.
>>> 
>>> I very much understand the urge to get the refcount step-by-step thing
>>> merged asap. Everyone wants their APIs merged fast. And if it's
>>> reasonable to move fast we will (see the kernfs xattr thing).
>>> 
>>> But here are two use-cases that ask for the same thing with different
>>> constraints that closely mirror our unified approach. Merging one
>>> quickly just to have something and then later bolting the other one on
>>> top, augmenting, or replacing, possible having to deprecate the old API
>>> is just objectively nuts. That's how we end up with a spaghetthi helper
>>> collection. We want as little helper fragmentation as possible.
>>> 
>>> We need a unified API that serves both use-cases. I dislike
>>> callback-based APIs generally but we have precedent in the VFS for this
>>> for cases where the internal state handling is delicate enough that it
>>> should not be exposed (see __iterate_supers() which does exactly work
>>> like Neil suggested down to the flag argument itself I added).
>>> 
>>> So I'm open to the callback solution.
>>> 
>>> (Note for really absurd perf requirements you could even make it work
>>> with static calls I'm pretty sure.)
>> 
>> I guess we will go with Mickaël’s idea:
>> 
>>> int vfs_walk_ancestors(struct path *path,
>>>                      bool (*walk_cb)(const struct path *ancestor, void *data),
>>>                      void *data, int flags)
>>> 
>>> The walk continue while walk_cb() returns true.  walk_cb() can then
>>> check if @ancestor is equal to a @root, or other properties.  The
>>> walk_cb() return value (if not bool) should not be returned by
>>> vfs_walk_ancestors() because a walk stop doesn't mean an error.
>> 
>> If necessary, we hide “root" inside @data. This is good. 
>> 
>>> @path would be updated with latest ancestor path (e.g. @root).
>> 
>> Update @path to the last ancestor and hold proper references. 
>> I missed this part earlier. With this feature, vfs_walk_ancestors 
>> should work usable with open-codeed bpf path iterator.
> 
> I don't think path should be updated.  That adds complexity which might
> not be needed.  The original (landlock) requirements were only to look
> at each ancestor, not to take a reference to any of them.

I think this is the ideal case that landlock wants in the long term. 
But we may need to take references when the attempt fails. Also, 
current landlock code takes reference at each step. 

> If the caller needs a reference to any of the ancestors I think that
> walk_cb() needs to take that reference and store it in data.
> Note that attempting to take the reference might fail.  See
> legitimize_path() in fs/namei.c.
> 
> It isn't yet clear to me what would be a good API for requesting the
> reference.
> One option would be for vfs_walk_ancestors() to pass another void* to
> walk_cb(), and it passed it on to vfs_legitimize_path() which extracts
> the seq numbers from there.
> Another might be that the path passed to walk_cb is always
> nameidata.path, and so when that is passed to vfs_legitimize_path() path
> it can use container_of() to find the seq numbers.

Letting walk_cb() call vfs_legitimize_path() seems suboptimal to me. 
I think the original goal is to have vfs_walk_ancestors() to:
  1. Try to walk the ancestors without taking any references;
  2. Detect when the not-taking-reference walk is not reliable;
  3. Maybe, retry the walk from beginning, but takes references on 
     each step. 

With walk_cb() calling vfs_legitimize_path(), we are moving #2 above 
to walk_cb(). I think this is not what we want? 

> If vfs_legitimize_path() fail, walk_cb() might want to ask for the walk
> to be restarted.
> 
>> 
>> I have a question about this behavior with RCU walk. IIUC, RCU 
>> walk does not hold reference to @ancestor when calling walk_cb().
>> If walk_cb() returns false, shall vfs_walk_ancestors() then
>> grab a reference on @ancestor? This feels a bit weird to me. 
>> Maybe “updating @path to the last ancestor” should only apply to
>> LOOKUP_RCU==false case? 
>> 
>>> @flags could contain LOOKUP_RCU or not, which enables us to have
>>> walk_cb() not-RCU compatible.
>>> 
>>> When passing LOOKUP_RCU, if the first call to vfs_walk_ancestors()
>>> failed with -ECHILD, the caller can restart the walk by calling
>>> vfs_walk_ancestors() again but without LOOKUP_RCU.
>> 
>> 
>> Given we want callers to handle -ECHILD and call vfs_walk_ancestors
>> again without LOOKUP_RCU, I think we should keep @path not changed
>> With LOOKUP_RCU==true, and only update it to the last ancestor 
>> when LOOKUP_RCU==false.
> 
> No, we really don't want to pass a LOOKUP_RCU() flag to
> vfs_walk_ancestors().
> vfs_walk_ancestors() might choose to pass that flag to walk_cb().

In this case, we will need vfs_walk_ancestors to handle #3 above. 

Thanks,
Song

Re: [PATCH v5 bpf-next 0/5] bpf path iterator

Posted by NeilBrown 3 months ago

On Thu, 10 Jul 2025, Song Liu wrote:
> 
> 
> > On Jul 9, 2025, at 3:14 PM, NeilBrown <neil@brown.name> wrote:
> > 
> > On Tue, 08 Jul 2025, Song Liu wrote:
> >> Hi Christian, 
> >> 
> >> Thanks for your comments! 
> >> 
> >>> On Jul 7, 2025, at 4:17 AM, Christian Brauner <brauner@kernel.org> wrote:
> >> 
> >> [...]
> >> 
> >>>>> 3/ Extend vfs_walk_ancestors() to pass a "may sleep" flag to the callback.
> >>>> 
> >>>> I think that's fine.
> >>> 
> >>> Ok, sorry for the delay but there's a lot of different things going on
> >>> right now and this one isn't exactly an easy thing to solve.
> >>> 
> >>> I mentioned this before and so did Neil: the lookup implementation
> >>> supports two modes sleeping and non-sleeping. That api is abstracted
> >>> away as heavily as possible by the VFS so that non-core code will not be
> >>> exposed to it other than in exceptional circumstances and doesn't have
> >>> to care about it.
> >>> 
> >>> It is a conceptual dead-end to expose these two modes via separate APIs
> >>> and leak this implementation detail into non-core code. It will not
> >>> happen as far as I'm concerned.
> >>> 
> >>> I very much understand the urge to get the refcount step-by-step thing
> >>> merged asap. Everyone wants their APIs merged fast. And if it's
> >>> reasonable to move fast we will (see the kernfs xattr thing).
> >>> 
> >>> But here are two use-cases that ask for the same thing with different
> >>> constraints that closely mirror our unified approach. Merging one
> >>> quickly just to have something and then later bolting the other one on
> >>> top, augmenting, or replacing, possible having to deprecate the old API
> >>> is just objectively nuts. That's how we end up with a spaghetthi helper
> >>> collection. We want as little helper fragmentation as possible.
> >>> 
> >>> We need a unified API that serves both use-cases. I dislike
> >>> callback-based APIs generally but we have precedent in the VFS for this
> >>> for cases where the internal state handling is delicate enough that it
> >>> should not be exposed (see __iterate_supers() which does exactly work
> >>> like Neil suggested down to the flag argument itself I added).
> >>> 
> >>> So I'm open to the callback solution.
> >>> 
> >>> (Note for really absurd perf requirements you could even make it work
> >>> with static calls I'm pretty sure.)
> >> 
> >> I guess we will go with Mickaël’s idea:
> >> 
> >>> int vfs_walk_ancestors(struct path *path,
> >>>                      bool (*walk_cb)(const struct path *ancestor, void *data),
> >>>                      void *data, int flags)
> >>> 
> >>> The walk continue while walk_cb() returns true.  walk_cb() can then
> >>> check if @ancestor is equal to a @root, or other properties.  The
> >>> walk_cb() return value (if not bool) should not be returned by
> >>> vfs_walk_ancestors() because a walk stop doesn't mean an error.
> >> 
> >> If necessary, we hide “root" inside @data. This is good. 
> >> 
> >>> @path would be updated with latest ancestor path (e.g. @root).
> >> 
> >> Update @path to the last ancestor and hold proper references. 
> >> I missed this part earlier. With this feature, vfs_walk_ancestors 
> >> should work usable with open-codeed bpf path iterator.
> > 
> > I don't think path should be updated.  That adds complexity which might
> > not be needed.  The original (landlock) requirements were only to look
> > at each ancestor, not to take a reference to any of them.
> 
> I think this is the ideal case that landlock wants in the long term. 
> But we may need to take references when the attempt fails. Also, 
> current landlock code takes reference at each step. 

Why may be need to?
Yes, current landlock code takes references, but I don't think that is
because it needs references, only because the API requires it to take
references. 

> 
> > If the caller needs a reference to any of the ancestors I think that
> > walk_cb() needs to take that reference and store it in data.
> > Note that attempting to take the reference might fail.  See
> > legitimize_path() in fs/namei.c.
> > 
> > It isn't yet clear to me what would be a good API for requesting the
> > reference.
> > One option would be for vfs_walk_ancestors() to pass another void* to
> > walk_cb(), and it passed it on to vfs_legitimize_path() which extracts
> > the seq numbers from there.
> > Another might be that the path passed to walk_cb is always
> > nameidata.path, and so when that is passed to vfs_legitimize_path() path
> > it can use container_of() to find the seq numbers.
> 
> Letting walk_cb() call vfs_legitimize_path() seems suboptimal to me. 
> I think the original goal is to have vfs_walk_ancestors() to:
>   1. Try to walk the ancestors without taking any references;
>   2. Detect when the not-taking-reference walk is not reliable;
>   3. Maybe, retry the walk from beginning, but takes references on 
>      each step. 
> 
> With walk_cb() calling vfs_legitimize_path(), we are moving #2 above 
> to walk_cb(). I think this is not what we want? 

I think you are looking at this the wrong way around.  Focus on the
needs for the caller, not on how you think it might be implemented.

If the caller needs a reference, there should be a way for it to get a
reference.  This is quite separate from the choices vfs_walk_ancestors()
makes about how it is going to walk the list of dentries.

NeilBrown

Re: [PATCH v5 bpf-next 0/5] bpf path iterator

Posted by Mickaël Salaün 3 months ago

On Mon, Jul 07, 2025 at 06:50:12PM +0000, Song Liu wrote:
> Hi Christian, 
> 
> Thanks for your comments! 
> 
> > On Jul 7, 2025, at 4:17 AM, Christian Brauner <brauner@kernel.org> wrote:
> 
> [...]
> 
> >>> 3/ Extend vfs_walk_ancestors() to pass a "may sleep" flag to the callback.
> >> 
> >> I think that's fine.
> > 
> > Ok, sorry for the delay but there's a lot of different things going on
> > right now and this one isn't exactly an easy thing to solve.
> > 
> > I mentioned this before and so did Neil: the lookup implementation
> > supports two modes sleeping and non-sleeping. That api is abstracted
> > away as heavily as possible by the VFS so that non-core code will not be
> > exposed to it other than in exceptional circumstances and doesn't have
> > to care about it.
> > 
> > It is a conceptual dead-end to expose these two modes via separate APIs
> > and leak this implementation detail into non-core code. It will not
> > happen as far as I'm concerned.
> > 
> > I very much understand the urge to get the refcount step-by-step thing
> > merged asap. Everyone wants their APIs merged fast. And if it's
> > reasonable to move fast we will (see the kernfs xattr thing).
> > 
> > But here are two use-cases that ask for the same thing with different
> > constraints that closely mirror our unified approach. Merging one
> > quickly just to have something and then later bolting the other one on
> > top, augmenting, or replacing, possible having to deprecate the old API
> > is just objectively nuts. That's how we end up with a spaghetthi helper
> > collection. We want as little helper fragmentation as possible.
> > 
> > We need a unified API that serves both use-cases. I dislike
> > callback-based APIs generally but we have precedent in the VFS for this
> > for cases where the internal state handling is delicate enough that it
> > should not be exposed (see __iterate_supers() which does exactly work
> > like Neil suggested down to the flag argument itself I added).
> > 
> > So I'm open to the callback solution.
> > 
> > (Note for really absurd perf requirements you could even make it work
> > with static calls I'm pretty sure.)
> 
> I guess we will go with Mickaël’s idea:
> 
> > int vfs_walk_ancestors(struct path *path,
> >                       bool (*walk_cb)(const struct path *ancestor, void *data),

> >                       void *data, int flags)
> > 
> > The walk continue while walk_cb() returns true.  walk_cb() can then
> > check if @ancestor is equal to a @root, or other properties.  The
> > walk_cb() return value (if not bool) should not be returned by
> > vfs_walk_ancestors() because a walk stop doesn't mean an error.
> 
> If necessary, we hide “root" inside @data. This is good. 
> 
> > @path would be updated with latest ancestor path (e.g. @root).
> 
> Update @path to the last ancestor and hold proper references. 
> I missed this part earlier. With this feature, vfs_walk_ancestors 
> should work usable with open-codeed bpf path iterator. 
> 
> I have a question about this behavior with RCU walk. IIUC, RCU 
> walk does not hold reference to @ancestor when calling walk_cb().

I think a reference to the mount should be held, but not necessarily to
the dentry if we are still in the same mount as the original path.

> If walk_cb() returns false, shall vfs_walk_ancestors() then
> grab a reference on @ancestor? This feels a bit weird to me. 

If walk_cb() checks for a root, it will return false when the path will
match, and the caller would expect to get this root path, right?

In general, it's safer to always have the same behavior when holding or
releasing a reference.  I think the caller should then always call
path_put() after vfs_walk_ancestors() whatever the return code is.

> Maybe “updating @path to the last ancestor” should only apply to
> LOOKUP_RCU==false case? 
> 
> > @flags could contain LOOKUP_RCU or not, which enables us to have
> > walk_cb() not-RCU compatible.
> > 
> > When passing LOOKUP_RCU, if the first call to vfs_walk_ancestors()
> > failed with -ECHILD, the caller can restart the walk by calling
> > vfs_walk_ancestors() again but without LOOKUP_RCU.
> 
> 
> Given we want callers to handle -ECHILD and call vfs_walk_ancestors
> again without LOOKUP_RCU, I think we should keep @path not changed
> With LOOKUP_RCU==true, and only update it to the last ancestor 
> when LOOKUP_RCU==false. 

As Neil said, we don't want to explicitly pass LOOKUP_RCU as a public
flag.  Instead, walk_cb() should never sleep (and then potentially be
called under RCU by the vfs_walk_ancestors() implementation).

> 
> With this behavior, landlock code will be like:
> 
> 
> /* Assume we hold reference on “path”. 
>  * With LOOKUP_RCU, path will not change, we don’t need 
>  * extra reference on “path”.
>  */
> err = vfs_walk_ancestors(path, ll_cb, data, LOOKUP_RCU);
> /* 
>  * At this point, whether err is 0 or not, path is not 
>  * changed.
>  */
> 
> if (err == -ECHILD) {
> 	struct path walk_path = *path;
> 
> 	/* reset any data changed by the walk */
> 	reset_data(data);
> 
> 	/* get a reference on walk_path. */
> 	path_get(&walk_path);
> 
> 	err = vfs_walk_ancestors(&walk_path, ll_cb, data, 0);
> 	/* Now, walk_path might be updated */
> 
> 	/* Always release reference on walk_path */
> 	path_put(&walk_path);
> }
> 
> 
> BPF path iterator sode will look like:
> 
> static bool bpf_cb(const struct path *ancestor, void *data)
> {
> 	return false;
> }

Instead of this callback, we could just always return if walk_cb is
NULL.

> 
> struct path *bpf_iter_path_next(struct bpf_iter_path *it)
> {
> 	struct bpf_iter_path_kern *kit = (void *)it;
> 
> 	if (vfs_walk_ancestors(&kit->path, bpf_cb, NULL))
> 		return NULL;
> 	return &kit->path;
> }
> 
> 
> Does this sound reasonable to every body?
> 
> Thanks,
> Song
>

Re: [PATCH v5 bpf-next 0/5] bpf path iterator

Posted by Song Liu 3 months ago


> On Jul 9, 2025, at 9:06 AM, Mickaël Salaün <mic@digikod.net> wrote:\

[...]

>> If necessary, we hide “root" inside @data. This is good. 
>> 
>>> @path would be updated with latest ancestor path (e.g. @root).
>> 
>> Update @path to the last ancestor and hold proper references. 
>> I missed this part earlier. With this feature, vfs_walk_ancestors 
>> should work usable with open-codeed bpf path iterator. 
>> 
>> I have a question about this behavior with RCU walk. IIUC, RCU 
>> walk does not hold reference to @ancestor when calling walk_cb().
> 
> I think a reference to the mount should be held, but not necessarily to
> the dentry if we are still in the same mount as the original path.

If we update @path and do path_put() after the walk, we have to hold 
reference to both the mnt and the dentry, no? 

> 
>> If walk_cb() returns false, shall vfs_walk_ancestors() then
>> grab a reference on @ancestor? This feels a bit weird to me.
> 
> If walk_cb() checks for a root, it will return false when the path will
> match, and the caller would expect to get this root path, right?

If the user want to walk to the global root, walk_cb() may not 
return false at all, IIUC. walk_cb() may also return false on 
other conditions. 

> 
> In general, it's safer to always have the same behavior when holding or
> releasing a reference.  I think the caller should then always call
> path_put() after vfs_walk_ancestors() whatever the return code is.
> 
>> Maybe “updating @path to the last ancestor” should only apply to
>> LOOKUP_RCU==false case? 
>> 
>>> @flags could contain LOOKUP_RCU or not, which enables us to have
>>> walk_cb() not-RCU compatible.
>>> 
>>> When passing LOOKUP_RCU, if the first call to vfs_walk_ancestors()
>>> failed with -ECHILD, the caller can restart the walk by calling
>>> vfs_walk_ancestors() again but without LOOKUP_RCU.
>> 
>> 
>> Given we want callers to handle -ECHILD and call vfs_walk_ancestors
>> again without LOOKUP_RCU, I think we should keep @path not changed
>> With LOOKUP_RCU==true, and only update it to the last ancestor 
>> when LOOKUP_RCU==false.
> 
> As Neil said, we don't want to explicitly pass LOOKUP_RCU as a public
> flag.  Instead, walk_cb() should never sleep (and then potentially be
> called under RCU by the vfs_walk_ancestors() implementation).

How should the user handle -ECHILD without LOOKUP_RCU flag? Say the
following code in landlocked:

/* Try RCU walk first */
err = vfs_walk_ancestors(path, ll_cb, data, LOOKUP_RCU);

if (err == -ECHILD) {
	struct path walk_path = *path;

	/* reset any data changed by the walk */
	reset_data(data);
	
	/* now do ref-walk */
	err = vfs_walk_ancestors(&walk_path, ll_cb, data, 0);
}

Or do you mean vfs_walk_ancestors will never return -ECHILD?
Then we need vfs_walk_ancestors to call reset_data logic, right?

Thanks,
Song

Re: [PATCH v5 bpf-next 0/5] bpf path iterator

Posted by NeilBrown 3 months ago

On Thu, 10 Jul 2025, Song Liu wrote:
> 
> 
> > On Jul 9, 2025, at 9:06 AM, Mickaël Salaün <mic@digikod.net> wrote:\
> 
> [...]
> 
> >> If necessary, we hide “root" inside @data. This is good. 
> >> 
> >>> @path would be updated with latest ancestor path (e.g. @root).
> >> 
> >> Update @path to the last ancestor and hold proper references. 
> >> I missed this part earlier. With this feature, vfs_walk_ancestors 
> >> should work usable with open-codeed bpf path iterator. 
> >> 
> >> I have a question about this behavior with RCU walk. IIUC, RCU 
> >> walk does not hold reference to @ancestor when calling walk_cb().
> > 
> > I think a reference to the mount should be held, but not necessarily to
> > the dentry if we are still in the same mount as the original path.
> 
> If we update @path and do path_put() after the walk, we have to hold 
> reference to both the mnt and the dentry, no? 
> 
> > 
> >> If walk_cb() returns false, shall vfs_walk_ancestors() then
> >> grab a reference on @ancestor? This feels a bit weird to me.
> > 
> > If walk_cb() checks for a root, it will return false when the path will
> > match, and the caller would expect to get this root path, right?
> 
> If the user want to walk to the global root, walk_cb() may not 
> return false at all, IIUC. walk_cb() may also return false on 
> other conditions. 
> 
> > 
> > In general, it's safer to always have the same behavior when holding or
> > releasing a reference.  I think the caller should then always call
> > path_put() after vfs_walk_ancestors() whatever the return code is.
> > 
> >> Maybe “updating @path to the last ancestor” should only apply to
> >> LOOKUP_RCU==false case? 
> >> 
> >>> @flags could contain LOOKUP_RCU or not, which enables us to have
> >>> walk_cb() not-RCU compatible.
> >>> 
> >>> When passing LOOKUP_RCU, if the first call to vfs_walk_ancestors()
> >>> failed with -ECHILD, the caller can restart the walk by calling
> >>> vfs_walk_ancestors() again but without LOOKUP_RCU.
> >> 
> >> 
> >> Given we want callers to handle -ECHILD and call vfs_walk_ancestors
> >> again without LOOKUP_RCU, I think we should keep @path not changed
> >> With LOOKUP_RCU==true, and only update it to the last ancestor 
> >> when LOOKUP_RCU==false.
> > 
> > As Neil said, we don't want to explicitly pass LOOKUP_RCU as a public
> > flag.  Instead, walk_cb() should never sleep (and then potentially be
> > called under RCU by the vfs_walk_ancestors() implementation).
> 
> How should the user handle -ECHILD without LOOKUP_RCU flag? Say the
> following code in landlocked:
> 
> /* Try RCU walk first */
> err = vfs_walk_ancestors(path, ll_cb, data, LOOKUP_RCU);
> 
> if (err == -ECHILD) {
> 	struct path walk_path = *path;
> 
> 	/* reset any data changed by the walk */
> 	reset_data(data);
> 	
> 	/* now do ref-walk */
> 	err = vfs_walk_ancestors(&walk_path, ll_cb, data, 0);
> }
> 
> Or do you mean vfs_walk_ancestors will never return -ECHILD?
> Then we need vfs_walk_ancestors to call reset_data logic, right?

It isn't clear to me that vfs_walk_ancestors() needs to return anything.
All the communication happens through walk_cb()

walk_cb() is called with a path, the data, and a "may_sleep" flag.
If it needs to sleep but may_sleep is not set, it returns "-ECHILD"
which causes the walk to restart and use refcounts.
If it wants to stop, it returns 0.
If it wants to continue, it returns 1.
If it wants a reference to the path then it can use (new)
vfs_legitimize_path() which might fail.
If it wants a reference to the path and may_sleep is true, it can use
path_get() which won't fail.

When returning -ECHILD (either because of a need to sleep or because
vfs_legitimize_path() fails), walk_cb() would reset_data().

NeilBrown

Re: [PATCH v5 bpf-next 0/5] bpf path iterator

Posted by Song Liu 3 months ago


> On Jul 9, 2025, at 3:24 PM, NeilBrown <neil@brown.name> wrote:
[...]
>> 
>> How should the user handle -ECHILD without LOOKUP_RCU flag? Say the
>> following code in landlocked:
>> 
>> /* Try RCU walk first */
>> err = vfs_walk_ancestors(path, ll_cb, data, LOOKUP_RCU);
>> 
>> if (err == -ECHILD) {
>> struct path walk_path = *path;
>> 
>> /* reset any data changed by the walk */
>> reset_data(data);
>> 
>> /* now do ref-walk */
>> err = vfs_walk_ancestors(&walk_path, ll_cb, data, 0);
>> }
>> 
>> Or do you mean vfs_walk_ancestors will never return -ECHILD?
>> Then we need vfs_walk_ancestors to call reset_data logic, right?
> 
> It isn't clear to me that vfs_walk_ancestors() needs to return anything.
> All the communication happens through walk_cb()
> 
> walk_cb() is called with a path, the data, and a "may_sleep" flag.
> If it needs to sleep but may_sleep is not set, it returns "-ECHILD"
> which causes the walk to restart and use refcounts.
> If it wants to stop, it returns 0.
> If it wants to continue, it returns 1.
> If it wants a reference to the path then it can use (new)
> vfs_legitimize_path() which might fail.
> If it wants a reference to the path and may_sleep is true, it can use
> path_get() which won't fail.
> 
> When returning -ECHILD (either because of a need to sleep or because
> vfs_legitimize_path() fails), walk_cb() would reset_data().

This might actually work. 

My only concern is with vfs_legitimize_path. It is probably safer if 
we only allow taking references with may_sleep==true, so that path_get
won’t fail. In this case, we will not need walk_cb() to call 
vfs_legitimize_path. If the user want a reference, the walk_cb will 
first return -ECHILD, and call path_get when may_sleep is true. 

Does this make sense? Did I miss any cases? 

Thanks,
Song

Re: [PATCH v5 bpf-next 0/5] bpf path iterator

Posted by NeilBrown 3 months ago

On Thu, 10 Jul 2025, Song Liu wrote:
> 
> 
> > On Jul 9, 2025, at 3:24 PM, NeilBrown <neil@brown.name> wrote:
> [...]
> >> 
> >> How should the user handle -ECHILD without LOOKUP_RCU flag? Say the
> >> following code in landlocked:
> >> 
> >> /* Try RCU walk first */
> >> err = vfs_walk_ancestors(path, ll_cb, data, LOOKUP_RCU);
> >> 
> >> if (err == -ECHILD) {
> >> struct path walk_path = *path;
> >> 
> >> /* reset any data changed by the walk */
> >> reset_data(data);
> >> 
> >> /* now do ref-walk */
> >> err = vfs_walk_ancestors(&walk_path, ll_cb, data, 0);
> >> }
> >> 
> >> Or do you mean vfs_walk_ancestors will never return -ECHILD?
> >> Then we need vfs_walk_ancestors to call reset_data logic, right?
> > 
> > It isn't clear to me that vfs_walk_ancestors() needs to return anything.
> > All the communication happens through walk_cb()
> > 
> > walk_cb() is called with a path, the data, and a "may_sleep" flag.
> > If it needs to sleep but may_sleep is not set, it returns "-ECHILD"
> > which causes the walk to restart and use refcounts.
> > If it wants to stop, it returns 0.
> > If it wants to continue, it returns 1.
> > If it wants a reference to the path then it can use (new)
> > vfs_legitimize_path() which might fail.
> > If it wants a reference to the path and may_sleep is true, it can use
> > path_get() which won't fail.
> > 
> > When returning -ECHILD (either because of a need to sleep or because
> > vfs_legitimize_path() fails), walk_cb() would reset_data().
> 
> This might actually work. 
> 
> My only concern is with vfs_legitimize_path. It is probably safer if 
> we only allow taking references with may_sleep==true, so that path_get
> won’t fail. In this case, we will not need walk_cb() to call 
> vfs_legitimize_path. If the user want a reference, the walk_cb will 
> first return -ECHILD, and call path_get when may_sleep is true. 

What is your concern with vfs_legitimize_path() ??

I've since realised that always restarting in response to -ECHILD isn't
necessary and isn't how normal path-walk works.  Restarting might be
needed, but the first response to -ECHILD is to try legitimize_path().
If that succeeds, then it is safe to sleep.
So returning -ECHILD might just result in vfs_walk_ancestors() calling
legitimize_path() and then calling walk_cb() again.  Why not have
walk_cb() do the vfs_legitimize_path() call (which will almost always
succeed in practice).

NeilBrown


> 
> Does this make sense? Did I miss any cases? 
> 
> Thanks,
> Song
> 
>

Re: [PATCH v5 bpf-next 0/5] bpf path iterator

Posted by Song Liu 3 months ago

> On Jul 9, 2025, at 5:58 PM, NeilBrown <neil@brown.name> wrote:
> 
> On Thu, 10 Jul 2025, Song Liu wrote:
>> 
>> 
>>> On Jul 9, 2025, at 3:24 PM, NeilBrown <neil@brown.name> wrote:
>> [...]
>>>> 
>>>> How should the user handle -ECHILD without LOOKUP_RCU flag? Say the
>>>> following code in landlocked:
>>>> 
>>>> /* Try RCU walk first */
>>>> err = vfs_walk_ancestors(path, ll_cb, data, LOOKUP_RCU);
>>>> 
>>>> if (err == -ECHILD) {
>>>> struct path walk_path = *path;
>>>> 
>>>> /* reset any data changed by the walk */
>>>> reset_data(data);
>>>> 
>>>> /* now do ref-walk */
>>>> err = vfs_walk_ancestors(&walk_path, ll_cb, data, 0);
>>>> }
>>>> 
>>>> Or do you mean vfs_walk_ancestors will never return -ECHILD?
>>>> Then we need vfs_walk_ancestors to call reset_data logic, right?
>>> 
>>> It isn't clear to me that vfs_walk_ancestors() needs to return anything.
>>> All the communication happens through walk_cb()
>>> 
>>> walk_cb() is called with a path, the data, and a "may_sleep" flag.
>>> If it needs to sleep but may_sleep is not set, it returns "-ECHILD"
>>> which causes the walk to restart and use refcounts.
>>> If it wants to stop, it returns 0.
>>> If it wants to continue, it returns 1.
>>> If it wants a reference to the path then it can use (new)
>>> vfs_legitimize_path() which might fail.
>>> If it wants a reference to the path and may_sleep is true, it can use
>>> path_get() which won't fail.
>>> 
>>> When returning -ECHILD (either because of a need to sleep or because
>>> vfs_legitimize_path() fails), walk_cb() would reset_data().
>> 
>> This might actually work. 
>> 
>> My only concern is with vfs_legitimize_path. It is probably safer if 
>> we only allow taking references with may_sleep==true, so that path_get
>> won’t fail. In this case, we will not need walk_cb() to call 
>> vfs_legitimize_path. If the user want a reference, the walk_cb will 
>> first return -ECHILD, and call path_get when may_sleep is true.
> 
> What is your concern with vfs_legitimize_path() ??
> 
> I've since realised that always restarting in response to -ECHILD isn't
> necessary and isn't how normal path-walk works.  Restarting might be
> needed, but the first response to -ECHILD is to try legitimize_path().
> If that succeeds, then it is safe to sleep.
> So returning -ECHILD might just result in vfs_walk_ancestors() calling
> legitimize_path() and then calling walk_cb() again.  Why not have
> walk_cb() do the vfs_legitimize_path() call (which will almost always
> succeed in practice).

After reading the emails and the code more, I think I misunderstood 
why we need to call vfs_legitimize_path(). The goal of “legitimize” 
is to get a reference on @path, so a reference-less walk may not
need legitimize_path() at all. Do I get this right this time? 

However, I still have some concern with legitimize_path: it requires
m_seq and r_seq recorded at the beginning of the walk, do we want
to pass those to walk_cb()? IIUC, one of the reason we prefer a 
callback based solution is that it doesn’t expose nameidata (or a
subset of it). Letting walk_cb to call legitimize_path appears to 
defeat this benefit, no? 

A separate question below. 

I still have some question about how vfs_walk_ancestors() and the 
walk_cb() interact. Let’s look at the landlock use case: the user 
(landlock) just want to look at each ancestor, but doesn’t need to 
take any references. walk_cb() will check @path against @root, and 
return 0 when @path is the same as @root. 

IIUC, in this case, we will record m_seq and r_seq at the beginning
of vfs_walk_ancestors(), and check them against mount_lock and 
rename_lock at the end of the walk. (Maybe we also need to check 
them at some points before the end of the walk?) If either seq
changed during the walk, we need to restart the walk, and take
reference on each step. Did I get this right so far? 

If the above is right, here are my questions about the 
reference-less walk above: 

1. Which function (vfs_walk_ancestors or walk_cb) will check m_seq 
   and r_seq? I think vfs_walk_ancestors should check them. 
2. When either seq changes, which function will call reset_data?
   I think there are 3 options here:
  2.a: vfs_walk_ancestors calls reset_data, which will be another
       callback function the caller passes to vfs_walk_ancestors. 
  2.b: walk_cb will call reset_data(), but we need a mechanism to
       tell walk_cb to do it, maybe a “restart” flag?
  2.c: Caller of vfs_walk_ancestors will call reset_data(). In 
       this case, vfs_walk_ancestors will return -ECHILD to its
       caller. But I think this option is NACKed. 

I think the right solution is to have vfs_walk_ancestors check
m_seq and r_seq, and have walk_cb call reset_data. But this is
Different to the proposal above. 

Do my questions above make any sense? Or maybe I totally 
misunderstood something?

Thanks,
Song

Re: [PATCH v5 bpf-next 0/5] bpf path iterator

Posted by Song Liu 2 months, 3 weeks ago

> On Jul 9, 2025, at 11:28 PM, Song Liu <songliubraving@meta.com> wrote:

[...]

>>>> It isn't clear to me that vfs_walk_ancestors() needs to return anything.
>>>> All the communication happens through walk_cb()
>>>> 
>>>> walk_cb() is called with a path, the data, and a "may_sleep" flag.
>>>> If it needs to sleep but may_sleep is not set, it returns "-ECHILD"
>>>> which causes the walk to restart and use refcounts.
>>>> If it wants to stop, it returns 0.
>>>> If it wants to continue, it returns 1.
>>>> If it wants a reference to the path then it can use (new)
>>>> vfs_legitimize_path() which might fail.
>>>> If it wants a reference to the path and may_sleep is true, it can use
>>>> path_get() which won't fail.
>>>> 
>>>> When returning -ECHILD (either because of a need to sleep or because
>>>> vfs_legitimize_path() fails), walk_cb() would reset_data().
>>> 
>>> This might actually work. 
>>> 
>>> My only concern is with vfs_legitimize_path. It is probably safer if 
>>> we only allow taking references with may_sleep==true, so that path_get
>>> won’t fail. In this case, we will not need walk_cb() to call 
>>> vfs_legitimize_path. If the user want a reference, the walk_cb will 
>>> first return -ECHILD, and call path_get when may_sleep is true.
>> 
>> What is your concern with vfs_legitimize_path() ??
>> 
>> I've since realised that always restarting in response to -ECHILD isn't
>> necessary and isn't how normal path-walk works.  Restarting might be
>> needed, but the first response to -ECHILD is to try legitimize_path().
>> If that succeeds, then it is safe to sleep.
>> So returning -ECHILD might just result in vfs_walk_ancestors() calling
>> legitimize_path() and then calling walk_cb() again.  Why not have
>> walk_cb() do the vfs_legitimize_path() call (which will almost always
>> succeed in practice).
> 
> After reading the emails and the code more, I think I misunderstood 
> why we need to call vfs_legitimize_path(). The goal of “legitimize” 
> is to get a reference on @path, so a reference-less walk may not
> need legitimize_path() at all. Do I get this right this time? 
> 
> However, I still have some concern with legitimize_path: it requires
> m_seq and r_seq recorded at the beginning of the walk, do we want
> to pass those to walk_cb()? IIUC, one of the reason we prefer a 
> callback based solution is that it doesn’t expose nameidata (or a
> subset of it). Letting walk_cb to call legitimize_path appears to 
> defeat this benefit, no? 
> 
> 
> A separate question below. 
> 
> I still have some question about how vfs_walk_ancestors() and the 
> walk_cb() interact. Let’s look at the landlock use case: the user 
> (landlock) just want to look at each ancestor, but doesn’t need to 
> take any references. walk_cb() will check @path against @root, and 
> return 0 when @path is the same as @root. 
> 
> IIUC, in this case, we will record m_seq and r_seq at the beginning
> of vfs_walk_ancestors(), and check them against mount_lock and 
> rename_lock at the end of the walk. (Maybe we also need to check 
> them at some points before the end of the walk?) If either seq
> changed during the walk, we need to restart the walk, and take
> reference on each step. Did I get this right so far? 
> 
> If the above is right, here are my questions about the 
> reference-less walk above: 
> 
> 1. Which function (vfs_walk_ancestors or walk_cb) will check m_seq 
>   and r_seq? I think vfs_walk_ancestors should check them. 
> 2. When either seq changes, which function will call reset_data?
>   I think there are 3 options here:
>  2.a: vfs_walk_ancestors calls reset_data, which will be another
>       callback function the caller passes to vfs_walk_ancestors. 
>  2.b: walk_cb will call reset_data(), but we need a mechanism to
>       tell walk_cb to do it, maybe a “restart” flag?
>  2.c: Caller of vfs_walk_ancestors will call reset_data(). In 
>       this case, vfs_walk_ancestors will return -ECHILD to its
>       caller. But I think this option is NACKed. 
> 
> I think the right solution is to have vfs_walk_ancestors check
> m_seq and r_seq, and have walk_cb call reset_data. But this is
> Different to the proposal above. 
> 
> Do my questions above make any sense? Or maybe I totally 
> misunderstood something?

Hi Neil, 

Did my questions/comments above make sense? I am hoping we can 
agree on some design soon. 

Christian and Mickaël, 

Could you please also share your thoughts on this?

Current requirements from BPF side is straightforward: we just
need a mechanism to “walk up one level and hold reference”. So
most of the requirement comes from LandLock side. 

Thanks,
Song

Re: [PATCH v5 bpf-next 0/5] bpf path iterator

Posted by Mickaël Salaün 2 months, 2 weeks ago

On Mon, Jul 14, 2025 at 09:09:42PM +0000, Song Liu wrote:
> 
> > On Jul 9, 2025, at 11:28 PM, Song Liu <songliubraving@meta.com> wrote:
> 
> [...]
> 
> >>>> It isn't clear to me that vfs_walk_ancestors() needs to return anything.
> >>>> All the communication happens through walk_cb()
> >>>> 
> >>>> walk_cb() is called with a path, the data, and a "may_sleep" flag.
> >>>> If it needs to sleep but may_sleep is not set, it returns "-ECHILD"
> >>>> which causes the walk to restart and use refcounts.
> >>>> If it wants to stop, it returns 0.
> >>>> If it wants to continue, it returns 1.
> >>>> If it wants a reference to the path then it can use (new)
> >>>> vfs_legitimize_path() which might fail.
> >>>> If it wants a reference to the path and may_sleep is true, it can use
> >>>> path_get() which won't fail.
> >>>> 
> >>>> When returning -ECHILD (either because of a need to sleep or because
> >>>> vfs_legitimize_path() fails), walk_cb() would reset_data().
> >>> 
> >>> This might actually work. 
> >>> 
> >>> My only concern is with vfs_legitimize_path. It is probably safer if 
> >>> we only allow taking references with may_sleep==true, so that path_get
> >>> won’t fail. In this case, we will not need walk_cb() to call 
> >>> vfs_legitimize_path. If the user want a reference, the walk_cb will 
> >>> first return -ECHILD, and call path_get when may_sleep is true.
> >> 
> >> What is your concern with vfs_legitimize_path() ??
> >> 
> >> I've since realised that always restarting in response to -ECHILD isn't
> >> necessary and isn't how normal path-walk works.  Restarting might be
> >> needed, but the first response to -ECHILD is to try legitimize_path().
> >> If that succeeds, then it is safe to sleep.
> >> So returning -ECHILD might just result in vfs_walk_ancestors() calling
> >> legitimize_path() and then calling walk_cb() again.  Why not have
> >> walk_cb() do the vfs_legitimize_path() call (which will almost always
> >> succeed in practice).
> > 
> > After reading the emails and the code more, I think I misunderstood 
> > why we need to call vfs_legitimize_path(). The goal of “legitimize” 
> > is to get a reference on @path, so a reference-less walk may not
> > need legitimize_path() at all. Do I get this right this time? 
> > 
> > However, I still have some concern with legitimize_path: it requires
> > m_seq and r_seq recorded at the beginning of the walk, do we want
> > to pass those to walk_cb()? IIUC, one of the reason we prefer a 
> > callback based solution is that it doesn’t expose nameidata (or a
> > subset of it). Letting walk_cb to call legitimize_path appears to 
> > defeat this benefit, no? 

Yes, walk_cb() should be very light and non-blocking/non-sleepable.  If
the caller cannot give these guarantees, then it can just pass NULL
instead of a valid walk_cb(), and continue the walk (if needed) by
calling the vfs_walk_ancentors() helper again, which would not benefit
from the RCU optimization in this case.

Before this patch series land, handling of disconnected directories
should be well defined, or at least let the caller deal with it.  How do
you plan to handle disconnected directories for the eBPF use case?  See
https://lore.kernel.org/all/20250719104204.545188-1-mic@digikod.net/
Unfortunately, this issue is not solved for Landlock yet.

> > 
> > 
> > A separate question below. 
> > 
> > I still have some question about how vfs_walk_ancestors() and the 
> > walk_cb() interact. Let’s look at the landlock use case: the user 
> > (landlock) just want to look at each ancestor, but doesn’t need to 
> > take any references. walk_cb() will check @path against @root, and 
> > return 0 when @path is the same as @root. 
> > 
> > IIUC, in this case, we will record m_seq and r_seq at the beginning
> > of vfs_walk_ancestors(), and check them against mount_lock and 
> > rename_lock at the end of the walk. (Maybe we also need to check 
> > them at some points before the end of the walk?) If either seq
> > changed during the walk, we need to restart the walk, and take
> > reference on each step. Did I get this right so far? 

I think so.  You should get some inspiration from prepend_path().

> > 
> > If the above is right, here are my questions about the 
> > reference-less walk above: 
> > 
> > 1. Which function (vfs_walk_ancestors or walk_cb) will check m_seq 
> >   and r_seq? I think vfs_walk_ancestors should check them. 

Yes, walk_cb() should be as simple as possible: the simpler version
should just return a constant.

> > 2. When either seq changes, which function will call reset_data?
> >   I think there are 3 options here:
> >  2.a: vfs_walk_ancestors calls reset_data, which will be another
> >       callback function the caller passes to vfs_walk_ancestors. 
> >  2.b: walk_cb will call reset_data(), but we need a mechanism to
> >       tell walk_cb to do it, maybe a “restart” flag?
> >  2.c: Caller of vfs_walk_ancestors will call reset_data(). In 
> >       this case, vfs_walk_ancestors will return -ECHILD to its
> >       caller. But I think this option is NACKed. 
> > 
> > I think the right solution is to have vfs_walk_ancestors check
> > m_seq and r_seq, and have walk_cb call reset_data. But this is
> > Different to the proposal above. 

I'm not sure a reset_data() would be useful if walk_cb() never sleep.

If we really need such reset_data(), a fourth option would be for
walk_cb() to return a specific value (an enum instead of a bool) to
trigger the reset.

> > 
> > Do my questions above make any sense? Or maybe I totally 
> > misunderstood something?
> 
> Hi Neil, 
> 
> Did my questions/comments above make sense? I am hoping we can 
> agree on some design soon. 
> 
> Christian and Mickaël, 
> 
> Could you please also share your thoughts on this?
> 
> Current requirements from BPF side is straightforward: we just
> need a mechanism to “walk up one level and hold reference”. So
> most of the requirement comes from LandLock side. 

Have you thought about how to handle disconnected directories?

> 
> Thanks,
> Song
>

Re: [PATCH v5 bpf-next 0/5] bpf path iterator

Posted by Song Liu 2 months, 2 weeks ago


> On Jul 25, 2025, at 1:35 AM, Mickaël Salaün <mic@digikod.net> wrote:
[...]
>>> 
>>> Do my questions above make any sense? Or maybe I totally 
>>> misunderstood something?
>> 
>> Hi Neil, 
>> 
>> Did my questions/comments above make sense? I am hoping we can 
>> agree on some design soon. 
>> 
>> Christian and Mickaël, 
>> 
>> Could you please also share your thoughts on this?
>> 
>> Current requirements from BPF side is straightforward: we just
>> need a mechanism to “walk up one level and hold reference”. So
>> most of the requirement comes from LandLock side.
> 
> Have you thought about how to handle disconnected directories?

In the case of open-coded path iterator, the iterator will 
return a special value for disconnected roots and disconnected 
directories. Then the BPF program need to handle them based on 
the policy. 

Thanks,
Song

Re: [PATCH v5 bpf-next 0/5] bpf path iterator

Posted by Song Liu 3 months, 2 weeks ago


> On Jun 26, 2025, at 3:51 PM, NeilBrown <neil@brown.name> wrote:

[...]

>> Unfortunately, the BPF use case is more complicated. In some cases, 
>> the callback function cannot be call in rcu critical sections. For 
>> example, the callback may need to read xatter. For these cases, we
>> we cannot use RCU walk at all.
> 
> I really think you should stop using the terms RCU walk and ref-walk.  I
> think they might be focusing your thinking in an unhelpful direction.
> 
> The key issue about reading xattrs is that it might need to sleep.
> Focusing on what might need to sleep and what will never need to sleep
> is a useful approach - the distinction is wide spread in the kernel and
> several function take a flag indicating if they are permitted to sleep,
> or if failure when sleeping would be required.
> 
> So your above observation is better described as 
> 
>   The vfs_walk_ancestors() API has an (implicit) requirement that the
>   callback mustn't sleep.  This is a problem for some use-cases
>   where the call back might need to sleep - e.g. for accessing xattrs.
> 
> That is a good and useful observation.  I can see three possibly
> responses:
> 
> 1/ Add a vfs_walk_ancestors_maysleep() API for which the callback is
>   always allowed to sleep.  I don't particularly like this approach.
> 
> 2/ Use repeated calls to vfs_walk_parent() when the handling of each
>   ancestor might need to sleep.  I see no problem with supporting both
>   vfs_walk_ancestors() and vfs_walk_parent().  There is plenty of
>   precedent for having different  interfaces for different use cases.

I prefer option 2. 

> 
> 3/ Extend vfs_walk_ancestors() to pass a "may sleep" flag to the callback.
>   If the callback finds that it needs to sleep but that "may sleep"
>   isn't set, it returns some well known status, like -EWOULDBLOCK (or
>   -ECHILD).  It can expect to be called again but with "may sleep" set.
>   This is my preferred approach. There is precedent with the
>   d_revalidate callbacks which works like this.
>   I suspect that accessing xattrs might often be possible without
>   sleeping.  It is conceivable that we could add a "may sleep" argument
>   to vfs_getxattr() so that it could still often be used without
>   requiring vfs_walk_ancestors() to permit sleeping.
>   This would almost certainly require a clear demonstration that 
>   there was a performance cost in not having the option of non-sleeping
>   vfs_getxattr().

For built-in kernel code, I can see this works. However, I don’t see 
why it is necessary to introduce the extra complexity of -EWOULDBLOCK, 
and vfs_get_xattr_cannot_sleep, etc. A separate step-by-step walking
API is much cleaner.

> 
>>> I strongly suggest you stop thinking about rcu-walk vs ref-walk.  Think
>>> about the needs of your code.  If you need a high-performance API, then
>>> ask for a high-performance API, don't assume what form it will take or
>>> what the internal implementation details will be.
>> 
>> At the moment, we need a ref-walk API on the BPF side. The RCU walk
>> is a totally separate topic.
> 
> Do you mean "we need step-by-step walking" or do you mean "we need to
> potentially sleep for each ancestor"?  These are conceptually different
> requirements, but I cannot tell which you mean when you talk about "RCU
> walk”.

To be extra clear, I mean we need "step-by-step and 
take-reference-on-each-step walking”, for existing use cases. 

In the future, if it is possible to have a “do-not-take-reference, 
cannot-sleep, callback-based walking”. We may try to use that for 
some use cases. But that won’t replace step-by-step walking for 
all users. 

Thanks,
Song

Re: [PATCH v5 bpf-next 0/5] bpf path iterator

Posted by Mickaël Salaün 3 months, 2 weeks ago

On Thu, Jun 26, 2025 at 05:52:50AM +0000, Song Liu wrote:
> 
> 
> > On Jun 25, 2025, at 6:05 PM, NeilBrown <neil@brown.name> wrote:
> 
> [...]
> 
> >> 
> >> I can't speak for Mickaël, but a callback-based interface is less flexible
> >> (and _maybe_ less performant?).  Also, probably we will want to fallback
> >> to a reference-taking walk if the walk fails (rather than, say, retry
> >> infinitely), and this should probably use Song's proposed iterator.  I'm
> >> not sure if Song would be keen to rewrite this iterator patch series in
> >> callback style (to be clear, it doesn't necessarily seem like a good idea
> >> to me, and I'm not asking him to), which means that we will end up with
> >> the reference walk API being a "call this function repeatedly", and the
> >> rcu walk API taking a callback.  I think it is still workable (after all,
> >> if Landlock wants to reuse the code in the callback it can just call the
> >> callback function itself when doing the reference walk), but it seems a
> >> bit "ugly" to me.
> > 
> > call-back can have a performance impact (less opportunity for compiler
> > optimisation and CPU speculation), though less than taking spinlock and
> > references.  However Al and Christian have drawn a hard line against
> > making seq numbers visible outside VFS code so I think it is the
> > approach most likely to be accepted.
> > 
> > Certainly vfs_walk_ancestors() would fallback to ref-walk if rcu-walk
> > resulted in -ECHILD - just like all other path walking code in namei.c.
> > This would be largely transparent to the caller - the caller would only
> > see that the callback received a NULL path indicating a restart.  It
> > wouldn't need to know why.

Given the constraints this looks good to me.  Here is an updated API
with two extra consts, an updated walk_cb() signature, and a new
"flags" and without @root:

int vfs_walk_ancestors(struct path *path,
                       bool (*walk_cb)(const struct path *ancestor, void *data),
                       void *data, int flags)

The walk continue while walk_cb() returns true.  walk_cb() can then
check if @ancestor is equal to a @root, or other properties.  The
walk_cb() return value (if not bool) should not be returned by
vfs_walk_ancestors() because a walk stop doesn't mean an error.

@path would be updated with latest ancestor path (e.g. @root).
@flags could contain LOOKUP_RCU or not, which enables us to have
walk_cb() not-RCU compatible.

When passing LOOKUP_RCU, if the first call to vfs_walk_ancestors()
failed with -ECHILD, the caller can restart the walk by calling
vfs_walk_ancestors() again but without LOOKUP_RCU.

> 
> I guess I misunderstood the proposal of vfs_walk_ancestors() 
> initially, so some clarification:
> 
> I think vfs_walk_ancestors() is good for the rcu-walk, and some 
> rcu-then-ref-walk. However, I don’t think it fits all use cases. 
> A reliable step-by-step ref-walk, like this set, works well with 
> BPF, and we want to keep it. 

The above updated API should work for both use cases: if the caller wants
to walk only one level, walk_cb() can just always return false (and
potentially save that it was called) and then stop the walk the first
time it is called.  This makes it possible to write an eBPF helper with
the same API as path_walk_parent(), while making the kernel API more
flexible.

> 
> Can we ship this set as-is (or after fixing the comment reported
> by kernel test robot)? I really don’t think we need figure out 
> all details about the rcu-walk here. 
> 
> Once this is landed, we can try implementing the rcu-walk
> (vfs_walk_ancestors or some variation). If no one volunteers, I
> can give it a try. 

My understanding is that Christian only wants one helper (that should
handle both use cases).  I think this updated API should be good enough
for everyone.  Most of your code should stay the same.  What do you
think?

> 
> Thanks,
> Song
>

Re: [PATCH v5 bpf-next 0/5] bpf path iterator

Posted by Song Liu 3 months, 2 weeks ago


> On Jun 26, 2025, at 2:43 AM, Mickaël Salaün <mic@digikod.net> wrote:

[...]

> Given the constraints this looks good to me.  Here is an updated API
> with two extra consts, an updated walk_cb() signature, and a new
> "flags" and without @root:
> 
> int vfs_walk_ancestors(struct path *path,
>                       bool (*walk_cb)(const struct path *ancestor, void *data),
>                       void *data, int flags)
> 
> The walk continue while walk_cb() returns true.  walk_cb() can then
> check if @ancestor is equal to a @root, or other properties.  The
> walk_cb() return value (if not bool) should not be returned by
> vfs_walk_ancestors() because a walk stop doesn't mean an error.
> 
> @path would be updated with latest ancestor path (e.g. @root).
> @flags could contain LOOKUP_RCU or not, which enables us to have
> walk_cb() not-RCU compatible.
> 
> When passing LOOKUP_RCU, if the first call to vfs_walk_ancestors()
> failed with -ECHILD, the caller can restart the walk by calling
> vfs_walk_ancestors() again but without LOOKUP_RCU.

IIUC, Neil is against using LOOKUP_RCU as input, and VFS folks
may share same the concerns. 

>> 
>> I guess I misunderstood the proposal of vfs_walk_ancestors() 
>> initially, so some clarification:
>> 
>> I think vfs_walk_ancestors() is good for the rcu-walk, and some 
>> rcu-then-ref-walk. However, I don’t think it fits all use cases. 
>> A reliable step-by-step ref-walk, like this set, works well with 
>> BPF, and we want to keep it.
> 
> The above updated API should work for both use cases: if the caller wants
> to walk only one level, walk_cb() can just always return false (and
> potentially save that it was called) and then stop the walk the first
> time it is called.  This makes it possible to write an eBPF helper with
> the same API as path_walk_parent(), while making the kernel API more
> flexible.

I don’t think this will be the same. Current path_walk_parent() 
holds a reference on parent path, and returns control to the caller. 
However, when vfs_walk_ancestors() returns, it should not hold any
extra reference, right? IOW, vfs_walk_ancestors may hold some 
reference between callbacks, but is expected to release these 
references before finally returning to the caller.

> Can we ship this set as-is (or after fixing the comment reported
>> by kernel test robot)? I really don’t think we need figure out 
>> all details about the rcu-walk here. 
>> 
>> Once this is landed, we can try implementing the rcu-walk
>> (vfs_walk_ancestors or some variation). If no one volunteers, I
>> can give it a try.
> 
> My understanding is that Christian only wants one helper (that should
> handle both use cases).  I think this updated API should be good enough
> for everyone.  Most of your code should stay the same.  What do you
> think?

Christian, could you please clarify this requirement?

Given different expectations in how the references are handled (see 
above), I don’t think we can fit all use cases in one API. However, 
if we find such an API in the future, which works for all cases, we 
can refactor BPF side code to use that instead. Therefore, even we 
have a “one API only” requirement, it is not necessary to delay this
set for a RCU-walk API. 

Thanks,
Song

Re: [PATCH v5 bpf-next 0/5] bpf path iterator

Posted by Song Liu 3 months, 2 weeks ago

On Wed, Jun 25, 2025 at 4:05 PM NeilBrown <neil@brown.name> wrote:
>
> On Wed, 25 Jun 2025, Mickaël Salaün wrote:
> > On Wed, Jun 25, 2025 at 07:38:53AM +1000, NeilBrown wrote:
> > >
> > > Can you spell out the minimum that you need?
> >
> > Sure.  We'd like to call this new helper in a RCU
> > read-side critical section and leverage this capability to speed up path
> > walk when there is no concurrent hierarchy modification.  This use case
> > is similar to handle_dots() with LOOKUP_RCU calling follow_dotdot_rcu().
> >
> > The main issue with this approach is to keep some state of the path walk
> > to know if the next call to "path_walk_parent_rcu()" would be valid
> > (i.e. something like a very light version of nameidata, mainly sequence
> > integers), and to get back to the non-RCU version otherwise.
> >
> > >
> > > My vague impression is that you want to search up from a given strut path,
> > > no further then some other given path, looking for a dentry that matches
> > > some rule.  Is that correct?
> >
> > Yes
> >
> > >
> > > In general, the original dentry could be moved away from under the
> > > dentry you find moments after the match is reported.  What mechanisms do
> > > you have in place to ensure this doesn't happen, or that it doesn't
> > > matter?
> >
> > In the case of Landlock, by default, a set of access rights are denied
> > and can only be allowed by an element in the file hierarchy.  The goal
> > is to only allow access to files under a specific directory (or directly
> > a specific file).  That's why we only care of the file hierarchy at the
> > time of access check.  It's not an issue if the file/directory was
> > moved or is being moved as long as we can walk its "current" hierarchy.
> > Furthermore, a sandboxed process is restricted from doing arbitrary
> > mounts (and renames/links are controlled with the
> > LANDLOCK_ACCESS_FS_REFER right).
> >
> > However, we need to get a valid "snapshot" of the set of dentries that
> > (could) lead to the evaluated file/directory.
>
> A "snapshot" is an interesting idea - though looking at the landlock
> code you one need inodes, not dentries.
> I imagine an interface where you give it a starting path, a root, and
> and array of inode pointers, and it fills in the pointers with the path
> - all under rcu so no references are needed.
> But you would need some fallback if the array isn't big enough, so maybe
> that isn't a good idea.
>
> Based on the comments by Al and Christian, I think the only viable
> approach is to pass a callback to some vfs function that does the
> walking.
>
>    vfs_walk_ancestors(struct path *path, struct path *root,
>                       int (*walk_cb)(struct path *ancestor, void *data),
>                       void *data)

I like this idea.

Maybe we want "struct path *ancestor" of walk_cb to be const.
walk_cb should only change "data", so that we can undo all the
changes when the rcu walk fails.

Thanks,
Song