fs/bpf_fs_kfuncs.c | 72 +++++++++ fs/namei.c | 95 +++++++++--- include/linux/namei.h | 2 + kernel/bpf/verifier.c | 5 + security/landlock/fs.c | 30 +--- .../testing/selftests/bpf/bpf_experimental.h | 6 + .../selftests/bpf/prog_tests/path_iter.c | 111 ++++++++++++++ tools/testing/selftests/bpf/progs/path_iter.c | 145 ++++++++++++++++++ tools/testing/selftests/bpf/progs/path_walk.c | 59 +++++++ 9 files changed, 485 insertions(+), 40 deletions(-) create mode 100644 tools/testing/selftests/bpf/prog_tests/path_iter.c create mode 100644 tools/testing/selftests/bpf/progs/path_iter.c create mode 100644 tools/testing/selftests/bpf/progs/path_walk.c
In security use cases, it is common to apply rules to VFS subtrees. However, filtering files in a subtree is not straightforward [1]. One solution to this problem is to start from a path and walk up the VFS tree (towards the root). Among in-tree LSMs, Landlock uses this solution. BPF LSM solutions, such like Tetragon [2], also use similar approaches. However, due to lack of proper helper/kfunc support, BPF LSM solutions usually do the path walk with probe read, which is racy. This patchset introduces a new helper path_walk_parent, which walks path to its VFS parent. The helper is used in Landlock. A new BPF iterator, path iterator, is introduced to do the path walking. The BPF path iterator uses the new path_walk_parent help to walk the VFS tree. Changes v4 => v5: 1. Minor changes to path_walk_parent per suggestions by Neil Brown and Tingmao Wang. v4: https://lore.kernel.org/bpf/20250611220220.3681382-1-song@kernel.org/ Changes v3 => v4: 1. Rewrite path_walk_parent based on suggestion from Neil Brown. 2. When path_walk_parent() returns false, it call path_put on @path and then zeros @path. v3: https://lore.kernel.org/bpf/20250606213015.255134-1-song@kernel.org/ Changes v2 => v3: 1. Fix an issue with path_walk_parent. 2. Move bpf path iterator to fs/bpf_fs_kfuncs.c 3. Optimize bpf path iterator (less memory). 4. Add more selftests. 5. Add more comments. v2: https://lore.kernel.org/bpf/20250603065920.3404510-1-song@kernel.org/ Changes v1 => v2: 1. Rename path_parent => path_walk_parent. 2. Remove path_connected check in path_walk_parent. 3. Fix is_access_to_paths_allowed(). 4. Remove mode for path iterator, add a flag instead. v1: https://lore.kernel.org/bpf/20250528222623.1373000-1-song@kernel.org/ [1] https://lpc.events/event/18/contributions/1940/ [2] https://github.com/cilium/tetragon/ Song Liu (5): namei: Introduce new helper function path_walk_parent() landlock: Use path_walk_parent() bpf: Introduce path iterator selftests/bpf: Add tests for bpf path iterator selftests/bpf: Path walk test fs/bpf_fs_kfuncs.c | 72 +++++++++ fs/namei.c | 95 +++++++++--- include/linux/namei.h | 2 + kernel/bpf/verifier.c | 5 + security/landlock/fs.c | 30 +--- .../testing/selftests/bpf/bpf_experimental.h | 6 + .../selftests/bpf/prog_tests/path_iter.c | 111 ++++++++++++++ tools/testing/selftests/bpf/progs/path_iter.c | 145 ++++++++++++++++++ tools/testing/selftests/bpf/progs/path_walk.c | 59 +++++++ 9 files changed, 485 insertions(+), 40 deletions(-) create mode 100644 tools/testing/selftests/bpf/prog_tests/path_iter.c create mode 100644 tools/testing/selftests/bpf/progs/path_iter.c create mode 100644 tools/testing/selftests/bpf/progs/path_walk.c -- 2.47.1
Hi Christian, Mickaël, and folks, Could you please share your comments on this version? Does this look sane? Thanks, Song On Mon, Jun 16, 2025 at 11:11 PM Song Liu <song@kernel.org> wrote: > > In security use cases, it is common to apply rules to VFS subtrees. > However, filtering files in a subtree is not straightforward [1]. > > One solution to this problem is to start from a path and walk up the VFS > tree (towards the root). Among in-tree LSMs, Landlock uses this solution. > [...]
On Fri, Jun 20, 2025 at 02:59:17PM -0700, Song Liu wrote: > Hi Christian, Mickaël, and folks, > > Could you please share your comments on this version? Does this > look sane? This looks good to me but we need to know what is the acceptable next step to support RCU. If we can go with another _rcu helper, I'm good with the current approach, otherwise we need to figure out a way to leverage the current helper to make it compatible with callers being in a RCU read-side critical section while leveraging safe path walk (i.e. several calls to path_walk_parent). > > Thanks, > Song > > On Mon, Jun 16, 2025 at 11:11 PM Song Liu <song@kernel.org> wrote: > > > > In security use cases, it is common to apply rules to VFS subtrees. > > However, filtering files in a subtree is not straightforward [1]. > > > > One solution to this problem is to start from a path and walk up the VFS > > tree (towards the root). Among in-tree LSMs, Landlock uses this solution. > > > > [...] >
Hi Christian, On Tue, Jun 24, 2025 at 11:46 AM Mickaël Salaün <mic@digikod.net> wrote: > > On Fri, Jun 20, 2025 at 02:59:17PM -0700, Song Liu wrote: > > Hi Christian, Mickaël, and folks, > > > > Could you please share your comments on this version? Does this > > look sane? > > This looks good to me but we need to know what is the acceptable next > step to support RCU. If we can go with another _rcu helper, I'm good > with the current approach, otherwise we need to figure out a way to > leverage the current helper to make it compatible with callers being in > a RCU read-side critical section while leveraging safe path walk (i.e. > several calls to path_walk_parent). Could you please share your suggestions on this topic? RCU protected path walk out of fs/ seems controversial in multiple ways. Do we have to let this set wait indefinitely for a solution of RCU protected path walk? I would like to highlight that this set doesn't add any persistent APIs. path_walk_parent() is not in the UAPI, nor exported. If a newer and better API is created, we can refactor bpf and landlock code and deprecate path_walk_parent(). Thanks, Song
On Wed, 25 Jun 2025, Mickaël Salaün wrote: > On Fri, Jun 20, 2025 at 02:59:17PM -0700, Song Liu wrote: > > Hi Christian, Mickaël, and folks, > > > > Could you please share your comments on this version? Does this > > look sane? > > This looks good to me but we need to know what is the acceptable next > step to support RCU. If we can go with another _rcu helper, I'm good > with the current approach, otherwise we need to figure out a way to > leverage the current helper to make it compatible with callers being in > a RCU read-side critical section while leveraging safe path walk (i.e. > several calls to path_walk_parent). Can you spell out the minimum that you need? My vague impression is that you want to search up from a given strut path, no further then some other given path, looking for a dentry that matches some rule. Is that correct? In general, the original dentry could be moved away from under the dentry you find moments after the match is reported. What mechanisms do you have in place to ensure this doesn't happen, or that it doesn't matter? Would it be sufficient to have an iterator which reported successive ancestors in turn, or reported that you need to restart because something changed? Would you need to know that a restart happened or would it be acceptable to transparently start again at the parent of the starting point? Or do you really need a "one step at a time" interface? Do you need more complex movements around the tree, or is just walking up sufficient? If this has been discussed or documented elsewhere I'd be happy for you just to provide a reference, and I can come back with follow-up questions if needed. Thanks, NeilBrown > > > > > Thanks, > > Song > > > > On Mon, Jun 16, 2025 at 11:11 PM Song Liu <song@kernel.org> wrote: > > > > > > In security use cases, it is common to apply rules to VFS subtrees. > > > However, filtering files in a subtree is not straightforward [1]. > > > > > > One solution to this problem is to start from a path and walk up the VFS > > > tree (towards the root). Among in-tree LSMs, Landlock uses this solution. > > > > > > > [...] > > >
On Wed, Jun 25, 2025 at 07:38:53AM +1000, NeilBrown wrote: > On Wed, 25 Jun 2025, Mickaël Salaün wrote: > > On Fri, Jun 20, 2025 at 02:59:17PM -0700, Song Liu wrote: > > > Hi Christian, Mickaël, and folks, > > > > > > Could you please share your comments on this version? Does this > > > look sane? > > > > This looks good to me but we need to know what is the acceptable next > > step to support RCU. If we can go with another _rcu helper, I'm good > > with the current approach, otherwise we need to figure out a way to > > leverage the current helper to make it compatible with callers being in > > a RCU read-side critical section while leveraging safe path walk (i.e. > > several calls to path_walk_parent). > > Can you spell out the minimum that you need? Sure. We'd like to call this new helper in a RCU read-side critical section and leverage this capability to speed up path walk when there is no concurrent hierarchy modification. This use case is similar to handle_dots() with LOOKUP_RCU calling follow_dotdot_rcu(). The main issue with this approach is to keep some state of the path walk to know if the next call to "path_walk_parent_rcu()" would be valid (i.e. something like a very light version of nameidata, mainly sequence integers), and to get back to the non-RCU version otherwise. > > My vague impression is that you want to search up from a given strut path, > no further then some other given path, looking for a dentry that matches > some rule. Is that correct? Yes > > In general, the original dentry could be moved away from under the > dentry you find moments after the match is reported. What mechanisms do > you have in place to ensure this doesn't happen, or that it doesn't > matter? In the case of Landlock, by default, a set of access rights are denied and can only be allowed by an element in the file hierarchy. The goal is to only allow access to files under a specific directory (or directly a specific file). That's why we only care of the file hierarchy at the time of access check. It's not an issue if the file/directory was moved or is being moved as long as we can walk its "current" hierarchy. Furthermore, a sandboxed process is restricted from doing arbitrary mounts (and renames/links are controlled with the LANDLOCK_ACCESS_FS_REFER right). However, we need to get a valid "snapshot" of the set of dentries that (could) lead to the evaluated file/directory. > > Would it be sufficient to have an iterator which reported successive > ancestors in turn, or reported that you need to restart because something > changed? Would you need to know that a restart happened or would it be > acceptable to transparently start again at the parent of the starting > point? If the path walk is being invalidated, we need to reset the collected access right and start again the path walk to get all the access rights from a consistent/real file hierarchy. > > Or do you really need a "one step at a time" interface? We need to check each component of the path walk, so either we call an helper to get each of them and we do our check after that (we should be able to do that in RCU), or we provide a callback function which is called by the path walk helper. > > Do you need more complex movements around the tree, or is just walking > up sufficient? Just walking up. > > If this has been discussed or documented elsewhere I'd be happy for you > just to provide a reference, and I can come back with follow-up > questions if needed. Tingmao initially described the goal here: https://lore.kernel.org/all/afe77383-fe56-4029-848e-1401e3297139@maowtm.org/ and she sent an RFC to illustrate that: https://lore.kernel.org/all/cover.1748997840.git.m@maowtm.org/ The discussion mainly raised two questions: - Should we have one or two APIs? - How to store the state of the walk without exposing VFS internals to the rest of the kernel? Thanks > > Thanks, > NeilBrown > > > > > > > > > > Thanks, > > > Song > > > > > > On Mon, Jun 16, 2025 at 11:11 PM Song Liu <song@kernel.org> wrote: > > > > > > > > In security use cases, it is common to apply rules to VFS subtrees. > > > > However, filtering files in a subtree is not straightforward [1]. > > > > > > > > One solution to this problem is to start from a path and walk up the VFS > > > > tree (towards the root). Among in-tree LSMs, Landlock uses this solution. > > > > > > > > > > [...] > > > > > > >
On Wed, 25 Jun 2025, Mickaël Salaün wrote: > On Wed, Jun 25, 2025 at 07:38:53AM +1000, NeilBrown wrote: > > > > Can you spell out the minimum that you need? > > Sure. We'd like to call this new helper in a RCU > read-side critical section and leverage this capability to speed up path > walk when there is no concurrent hierarchy modification. This use case > is similar to handle_dots() with LOOKUP_RCU calling follow_dotdot_rcu(). > > The main issue with this approach is to keep some state of the path walk > to know if the next call to "path_walk_parent_rcu()" would be valid > (i.e. something like a very light version of nameidata, mainly sequence > integers), and to get back to the non-RCU version otherwise. > > > > > My vague impression is that you want to search up from a given strut path, > > no further then some other given path, looking for a dentry that matches > > some rule. Is that correct? > > Yes > > > > > In general, the original dentry could be moved away from under the > > dentry you find moments after the match is reported. What mechanisms do > > you have in place to ensure this doesn't happen, or that it doesn't > > matter? > > In the case of Landlock, by default, a set of access rights are denied > and can only be allowed by an element in the file hierarchy. The goal > is to only allow access to files under a specific directory (or directly > a specific file). That's why we only care of the file hierarchy at the > time of access check. It's not an issue if the file/directory was > moved or is being moved as long as we can walk its "current" hierarchy. > Furthermore, a sandboxed process is restricted from doing arbitrary > mounts (and renames/links are controlled with the > LANDLOCK_ACCESS_FS_REFER right). > > However, we need to get a valid "snapshot" of the set of dentries that > (could) lead to the evaluated file/directory. A "snapshot" is an interesting idea - though looking at the landlock code you one need inodes, not dentries. I imagine an interface where you give it a starting path, a root, and and array of inode pointers, and it fills in the pointers with the path - all under rcu so no references are needed. But you would need some fallback if the array isn't big enough, so maybe that isn't a good idea. Based on the comments by Al and Christian, I think the only viable approach is to pass a callback to some vfs function that does the walking. vfs_walk_ancestors(struct path *path, struct path *root, int (*walk_cb)(struct path *ancestor, void *data), void *data) where walk_cb() returns a negative number if it wants to abort, and is given a NULL ancestor if vfs_walk_ancestors() needed to restart. vfs_walk_ancestors() would initialise a "struct nameidata" and effectively call handle_dots(&nd, LAST_DOTDOT) repeatedly, calling walk_cb(&nd.path, data) each time. How would you feel about that sort of interface? NeilBrown
On 6/26/25 00:04, NeilBrown wrote: > On Wed, 25 Jun 2025, Mickaël Salaün wrote: >> On Wed, Jun 25, 2025 at 07:38:53AM +1000, NeilBrown wrote: >>> >>> Can you spell out the minimum that you need? >> >> Sure. We'd like to call this new helper in a RCU >> read-side critical section and leverage this capability to speed up path >> walk when there is no concurrent hierarchy modification. This use case >> is similar to handle_dots() with LOOKUP_RCU calling follow_dotdot_rcu(). >> >> The main issue with this approach is to keep some state of the path walk >> to know if the next call to "path_walk_parent_rcu()" would be valid >> (i.e. something like a very light version of nameidata, mainly sequence >> integers), and to get back to the non-RCU version otherwise. >> >>> >>> My vague impression is that you want to search up from a given strut path, >>> no further then some other given path, looking for a dentry that matches >>> some rule. Is that correct? >> >> Yes >> >>> >>> In general, the original dentry could be moved away from under the >>> dentry you find moments after the match is reported. What mechanisms do >>> you have in place to ensure this doesn't happen, or that it doesn't >>> matter? >> >> In the case of Landlock, by default, a set of access rights are denied >> and can only be allowed by an element in the file hierarchy. The goal >> is to only allow access to files under a specific directory (or directly >> a specific file). That's why we only care of the file hierarchy at the >> time of access check. It's not an issue if the file/directory was >> moved or is being moved as long as we can walk its "current" hierarchy. >> Furthermore, a sandboxed process is restricted from doing arbitrary >> mounts (and renames/links are controlled with the >> LANDLOCK_ACCESS_FS_REFER right). >> >> However, we need to get a valid "snapshot" of the set of dentries that >> (could) lead to the evaluated file/directory. > > A "snapshot" is an interesting idea - though looking at the landlock > code you one need inodes, not dentries. > I imagine an interface where you give it a starting path, a root, and > and array of inode pointers, and it fills in the pointers with the path > - all under rcu so no references are needed. > But you would need some fallback if the array isn't big enough, so maybe > that isn't a good idea. > > Based on the comments by Al and Christian, I think the only viable > approach is to pass a callback to some vfs function that does the > walking. > > vfs_walk_ancestors(struct path *path, struct path *root, > int (*walk_cb)(struct path *ancestor, void *data), > void *data) > > where walk_cb() returns a negative number if it wants to abort, and is > given a NULL ancestor if vfs_walk_ancestors() needed to restart. > > vfs_walk_ancestors() would initialise a "struct nameidata" and > effectively call handle_dots(&nd, LAST_DOTDOT) repeatedly, calling > walk_cb(&nd.path, data) > each time. handle_dots semantically does more than dget_parent + choose_mountpoint tho (which is what Landlock currently does, and is also what Song's iterator will do). There is the step_into which will step into mountpoints (there is also code to handle symlinks, although I'm not sure if that's relevant for following ".."), and it will also return ENOENT if the path is disconnected. Also I guess we might not need to have an entire nameidata? In theory it only needs to do what follow_dotdot_rcu does without the path_connected check. So it seems like given we have path and root as function argument, it would only need nd->{seq,m_seq}. I might be wrong tho, but certainly the behaviour is different. > > How would you feel about that sort of interface? I can't speak for Mickaël, but a callback-based interface is less flexible (and _maybe_ less performant?). Also, probably we will want to fallback to a reference-taking walk if the walk fails (rather than, say, retry infinitely), and this should probably use Song's proposed iterator. I'm not sure if Song would be keen to rewrite this iterator patch series in callback style (to be clear, it doesn't necessarily seem like a good idea to me, and I'm not asking him to), which means that we will end up with the reference walk API being a "call this function repeatedly", and the rcu walk API taking a callback. I think it is still workable (after all, if Landlock wants to reuse the code in the callback it can just call the callback function itself when doing the reference walk), but it seems a bit "ugly" to me. But this is just my opinion, and if there is a stronger desire to not expose any VFS seqcount integers then maybe we will just need to work with a callback. Quick note in case anyone reading this has not seen it, a while ago I made a POC of a non-callback style API for rcu parent walk based on Song's series: https://lore.kernel.org/all/dbc7ee0f1f483b7bc2ec9757672a38d99015e9ae.1749402769@maowtm.org/#iZ31fs:namei.c > > NeilBrown >
On Thu, 26 Jun 2025, Tingmao Wang wrote: > On 6/26/25 00:04, NeilBrown wrote: > > On Wed, 25 Jun 2025, Mickaël Salaün wrote: > >> On Wed, Jun 25, 2025 at 07:38:53AM +1000, NeilBrown wrote: > >>> > >>> Can you spell out the minimum that you need? > >> > >> Sure. We'd like to call this new helper in a RCU > >> read-side critical section and leverage this capability to speed up path > >> walk when there is no concurrent hierarchy modification. This use case > >> is similar to handle_dots() with LOOKUP_RCU calling follow_dotdot_rcu(). > >> > >> The main issue with this approach is to keep some state of the path walk > >> to know if the next call to "path_walk_parent_rcu()" would be valid > >> (i.e. something like a very light version of nameidata, mainly sequence > >> integers), and to get back to the non-RCU version otherwise. > >> > >>> > >>> My vague impression is that you want to search up from a given strut path, > >>> no further then some other given path, looking for a dentry that matches > >>> some rule. Is that correct? > >> > >> Yes > >> > >>> > >>> In general, the original dentry could be moved away from under the > >>> dentry you find moments after the match is reported. What mechanisms do > >>> you have in place to ensure this doesn't happen, or that it doesn't > >>> matter? > >> > >> In the case of Landlock, by default, a set of access rights are denied > >> and can only be allowed by an element in the file hierarchy. The goal > >> is to only allow access to files under a specific directory (or directly > >> a specific file). That's why we only care of the file hierarchy at the > >> time of access check. It's not an issue if the file/directory was > >> moved or is being moved as long as we can walk its "current" hierarchy. > >> Furthermore, a sandboxed process is restricted from doing arbitrary > >> mounts (and renames/links are controlled with the > >> LANDLOCK_ACCESS_FS_REFER right). > >> > >> However, we need to get a valid "snapshot" of the set of dentries that > >> (could) lead to the evaluated file/directory. > > > > A "snapshot" is an interesting idea - though looking at the landlock > > code you one need inodes, not dentries. > > I imagine an interface where you give it a starting path, a root, and > > and array of inode pointers, and it fills in the pointers with the path > > - all under rcu so no references are needed. > > But you would need some fallback if the array isn't big enough, so maybe > > that isn't a good idea. > > > > Based on the comments by Al and Christian, I think the only viable > > approach is to pass a callback to some vfs function that does the > > walking. > > > > vfs_walk_ancestors(struct path *path, struct path *root, > > int (*walk_cb)(struct path *ancestor, void *data), > > void *data) > > > > where walk_cb() returns a negative number if it wants to abort, and is > > given a NULL ancestor if vfs_walk_ancestors() needed to restart. > > > > vfs_walk_ancestors() would initialise a "struct nameidata" and > > effectively call handle_dots(&nd, LAST_DOTDOT) repeatedly, calling > > walk_cb(&nd.path, data) > > each time. > > handle_dots semantically does more than dget_parent + choose_mountpoint > tho (which is what Landlock currently does, and is also what Song's > iterator will do). There is the step_into which will step into > mountpoints (there is also code to handle symlinks, although I'm not sure > if that's relevant for following ".."), and it will also return ENOENT if > the path is disconnected. Is any of this a problem for you? > > Also I guess we might not need to have an entire nameidata? In theory it > only needs to do what follow_dotdot_rcu does without the path_connected > check. So it seems like given we have path and root as function argument, > it would only need nd->{seq,m_seq}. Those are implementation details internal to namei.c. Certainly this function wouldn't use all of the fields in nameidata, but it doesn't hurt to have a few fields in a struct on the stack which don't get used. Keeping the code simple and uniform is much more important. Using nameidata would help achieve that. > > I might be wrong tho, but certainly the behaviour is different. Here we get back to the question of "precisely what behaviour do you need?". "The same as what a previous patch did" is not a reasonable answer. If, from userspace, you repeatedly did chdir("..") and then examined the current directory you would get exactly the sequence of directories provided by repeatedly calling handle_dots(..., LAST_DOTDOT). If there is some circumstance where that would be not acceptable for your use case, you need to explain (and we need to document) what differences you need and why use need it. > > > > > How would you feel about that sort of interface? > > I can't speak for Mickaël, but a callback-based interface is less flexible > (and _maybe_ less performant?). Also, probably we will want to fallback > to a reference-taking walk if the walk fails (rather than, say, retry > infinitely), and this should probably use Song's proposed iterator. I'm > not sure if Song would be keen to rewrite this iterator patch series in > callback style (to be clear, it doesn't necessarily seem like a good idea > to me, and I'm not asking him to), which means that we will end up with > the reference walk API being a "call this function repeatedly", and the > rcu walk API taking a callback. I think it is still workable (after all, > if Landlock wants to reuse the code in the callback it can just call the > callback function itself when doing the reference walk), but it seems a > bit "ugly" to me. call-back can have a performance impact (less opportunity for compiler optimisation and CPU speculation), though less than taking spinlock and references. However Al and Christian have drawn a hard line against making seq numbers visible outside VFS code so I think it is the approach most likely to be accepted. Certainly vfs_walk_ancestors() would fallback to ref-walk if rcu-walk resulted in -ECHILD - just like all other path walking code in namei.c. This would be largely transparent to the caller - the caller would only see that the callback received a NULL path indicating a restart. It wouldn't need to know why. NeilBrown
> Those are implementation details internal to namei.c. Certainly this > function wouldn't use all of the fields in nameidata, but it doesn't > hurt to have a few fields in a struct on the stack which don't get used. > Keeping the code simple and uniform is much more important. Using Exactly. > Certainly vfs_walk_ancestors() would fallback to ref-walk if rcu-walk > resulted in -ECHILD - just like all other path walking code in namei.c. > This would be largely transparent to the caller - the caller would only > see that the callback received a NULL path indicating a restart. It > wouldn't need to know why. Yes, that's also what I mentioned in an earlier mail.
> On Jun 25, 2025, at 6:05 PM, NeilBrown <neil@brown.name> wrote: [...] >> >> I can't speak for Mickaël, but a callback-based interface is less flexible >> (and _maybe_ less performant?). Also, probably we will want to fallback >> to a reference-taking walk if the walk fails (rather than, say, retry >> infinitely), and this should probably use Song's proposed iterator. I'm >> not sure if Song would be keen to rewrite this iterator patch series in >> callback style (to be clear, it doesn't necessarily seem like a good idea >> to me, and I'm not asking him to), which means that we will end up with >> the reference walk API being a "call this function repeatedly", and the >> rcu walk API taking a callback. I think it is still workable (after all, >> if Landlock wants to reuse the code in the callback it can just call the >> callback function itself when doing the reference walk), but it seems a >> bit "ugly" to me. > > call-back can have a performance impact (less opportunity for compiler > optimisation and CPU speculation), though less than taking spinlock and > references. However Al and Christian have drawn a hard line against > making seq numbers visible outside VFS code so I think it is the > approach most likely to be accepted. > > Certainly vfs_walk_ancestors() would fallback to ref-walk if rcu-walk > resulted in -ECHILD - just like all other path walking code in namei.c. > This would be largely transparent to the caller - the caller would only > see that the callback received a NULL path indicating a restart. It > wouldn't need to know why. I guess I misunderstood the proposal of vfs_walk_ancestors() initially, so some clarification: I think vfs_walk_ancestors() is good for the rcu-walk, and some rcu-then-ref-walk. However, I don’t think it fits all use cases. A reliable step-by-step ref-walk, like this set, works well with BPF, and we want to keep it. Can we ship this set as-is (or after fixing the comment reported by kernel test robot)? I really don’t think we need figure out all details about the rcu-walk here. Once this is landed, we can try implementing the rcu-walk (vfs_walk_ancestors or some variation). If no one volunteers, I can give it a try. Thanks, Song
On Thu, 26 Jun 2025, Song Liu wrote: > > > > On Jun 25, 2025, at 6:05 PM, NeilBrown <neil@brown.name> wrote: > > [...] > > >> > >> I can't speak for Mickaël, but a callback-based interface is less flexible > >> (and _maybe_ less performant?). Also, probably we will want to fallback > >> to a reference-taking walk if the walk fails (rather than, say, retry > >> infinitely), and this should probably use Song's proposed iterator. I'm > >> not sure if Song would be keen to rewrite this iterator patch series in > >> callback style (to be clear, it doesn't necessarily seem like a good idea > >> to me, and I'm not asking him to), which means that we will end up with > >> the reference walk API being a "call this function repeatedly", and the > >> rcu walk API taking a callback. I think it is still workable (after all, > >> if Landlock wants to reuse the code in the callback it can just call the > >> callback function itself when doing the reference walk), but it seems a > >> bit "ugly" to me. > > > > call-back can have a performance impact (less opportunity for compiler > > optimisation and CPU speculation), though less than taking spinlock and > > references. However Al and Christian have drawn a hard line against > > making seq numbers visible outside VFS code so I think it is the > > approach most likely to be accepted. > > > > Certainly vfs_walk_ancestors() would fallback to ref-walk if rcu-walk > > resulted in -ECHILD - just like all other path walking code in namei.c. > > This would be largely transparent to the caller - the caller would only > > see that the callback received a NULL path indicating a restart. It > > wouldn't need to know why. > > I guess I misunderstood the proposal of vfs_walk_ancestors() > initially, so some clarification: > > I think vfs_walk_ancestors() is good for the rcu-walk, and some > rcu-then-ref-walk. However, I don’t think it fits all use cases. > A reliable step-by-step ref-walk, like this set, works well with > BPF, and we want to keep it. The distinction between rcu-walk and ref-walk is an internal implementation detail. You as a caller shouldn't need to think about the difference. You just want to walk. Note that LOOKUP_RCU is documented in namei.h as "semi-internal". The only uses outside of core-VFS code is in individual filesystem's d_revalidate handler - they are checking if they are allowed to sleep or not. You should never expect to pass LOOKUP_RCU to an VFS API - no other code does. It might be reasonable for you as a caller to have some control over whether the call can sleep or not. LOOKUP_CACHED is a bit like that. But for dotdot lookup the code will never sleep - so that is not relevant. I strongly suggest you stop thinking about rcu-walk vs ref-walk. Think about the needs of your code. If you need a high-performance API, then ask for a high-performance API, don't assume what form it will take or what the internal implementation details will be. I think you already have a clear answer that a step-by-step API will not be read-only on the dcache (i.e. it will adjust refcounts) and so will not be high performance. If you want high performance, you need to accept a different style of API. NeilBrown
> On Jun 26, 2025, at 3:22 AM, NeilBrown <neil@brown.name> wrote: [...] >> I guess I misunderstood the proposal of vfs_walk_ancestors() >> initially, so some clarification: >> >> I think vfs_walk_ancestors() is good for the rcu-walk, and some >> rcu-then-ref-walk. However, I don’t think it fits all use cases. >> A reliable step-by-step ref-walk, like this set, works well with >> BPF, and we want to keep it. > > The distinction between rcu-walk and ref-walk is an internal > implementation detail. You as a caller shouldn't need to think about > the difference. You just want to walk. Note that LOOKUP_RCU is > documented in namei.h as "semi-internal". The only uses outside of > core-VFS code is in individual filesystem's d_revalidate handler - they > are checking if they are allowed to sleep or not. You should never > expect to pass LOOKUP_RCU to an VFS API - no other code does. > > It might be reasonable for you as a caller to have some control over > whether the call can sleep or not. LOOKUP_CACHED is a bit like that. > But for dotdot lookup the code will never sleep - so that is not > relevant. Unfortunately, the BPF use case is more complicated. In some cases, the callback function cannot be call in rcu critical sections. For example, the callback may need to read xatter. For these cases, we we cannot use RCU walk at all. > I strongly suggest you stop thinking about rcu-walk vs ref-walk. Think > about the needs of your code. If you need a high-performance API, then > ask for a high-performance API, don't assume what form it will take or > what the internal implementation details will be. At the moment, we need a ref-walk API on the BPF side. The RCU walk is a totally separate topic. > I think you already have a clear answer that a step-by-step API will not > be read-only on the dcache (i.e. it will adjust refcounts) and so will > not be high performance. If you want high performance, you need to > accept a different style of API. Agreed. Thanks, Song
On Fri, 27 Jun 2025, Song Liu wrote: > > > > On Jun 26, 2025, at 3:22 AM, NeilBrown <neil@brown.name> wrote: > > [...] > > >> I guess I misunderstood the proposal of vfs_walk_ancestors() > >> initially, so some clarification: > >> > >> I think vfs_walk_ancestors() is good for the rcu-walk, and some > >> rcu-then-ref-walk. However, I don’t think it fits all use cases. > >> A reliable step-by-step ref-walk, like this set, works well with > >> BPF, and we want to keep it. > > > > The distinction between rcu-walk and ref-walk is an internal > > implementation detail. You as a caller shouldn't need to think about > > the difference. You just want to walk. Note that LOOKUP_RCU is > > documented in namei.h as "semi-internal". The only uses outside of > > core-VFS code is in individual filesystem's d_revalidate handler - they > > are checking if they are allowed to sleep or not. You should never > > expect to pass LOOKUP_RCU to an VFS API - no other code does. > > > > It might be reasonable for you as a caller to have some control over > > whether the call can sleep or not. LOOKUP_CACHED is a bit like that. > > But for dotdot lookup the code will never sleep - so that is not > > relevant. > > Unfortunately, the BPF use case is more complicated. In some cases, > the callback function cannot be call in rcu critical sections. For > example, the callback may need to read xatter. For these cases, we > we cannot use RCU walk at all. I really think you should stop using the terms RCU walk and ref-walk. I think they might be focusing your thinking in an unhelpful direction. The key issue about reading xattrs is that it might need to sleep. Focusing on what might need to sleep and what will never need to sleep is a useful approach - the distinction is wide spread in the kernel and several function take a flag indicating if they are permitted to sleep, or if failure when sleeping would be required. So your above observation is better described as The vfs_walk_ancestors() API has an (implicit) requirement that the callback mustn't sleep. This is a problem for some use-cases where the call back might need to sleep - e.g. for accessing xattrs. That is a good and useful observation. I can see three possibly responses: 1/ Add a vfs_walk_ancestors_maysleep() API for which the callback is always allowed to sleep. I don't particularly like this approach. 2/ Use repeated calls to vfs_walk_parent() when the handling of each ancestor might need to sleep. I see no problem with supporting both vfs_walk_ancestors() and vfs_walk_parent(). There is plenty of precedent for having different interfaces for different use cases. 3/ Extend vfs_walk_ancestors() to pass a "may sleep" flag to the callback. If the callback finds that it needs to sleep but that "may sleep" isn't set, it returns some well known status, like -EWOULDBLOCK (or -ECHILD). It can expect to be called again but with "may sleep" set. This is my preferred approach. There is precedent with the d_revalidate callbacks which works like this. I suspect that accessing xattrs might often be possible without sleeping. It is conceivable that we could add a "may sleep" argument to vfs_getxattr() so that it could still often be used without requiring vfs_walk_ancestors() to permit sleeping. This would almost certainly require a clear demonstration that there was a performance cost in not having the option of non-sleeping vfs_getxattr(). > > > I strongly suggest you stop thinking about rcu-walk vs ref-walk. Think > > about the needs of your code. If you need a high-performance API, then > > ask for a high-performance API, don't assume what form it will take or > > what the internal implementation details will be. > > At the moment, we need a ref-walk API on the BPF side. The RCU walk > is a totally separate topic. Do you mean "we need step-by-step walking" or do you mean "we need to potentially sleep for each ancestor"? These are conceptually different requirements, but I cannot tell which you mean when you talk about "RCU walk". Thanks, NeilBrown > > > I think you already have a clear answer that a step-by-step API will not > > be read-only on the dcache (i.e. it will adjust refcounts) and so will > > not be high performance. If you want high performance, you need to > > accept a different style of API. > > Agreed. > > Thanks, > Song > >
On Fri, Jun 27, 2025 at 08:51:21AM +1000, NeilBrown wrote: > On Fri, 27 Jun 2025, Song Liu wrote: > > > > > > > On Jun 26, 2025, at 3:22 AM, NeilBrown <neil@brown.name> wrote: > > > > [...] > > > > >> I guess I misunderstood the proposal of vfs_walk_ancestors() > > >> initially, so some clarification: > > >> > > >> I think vfs_walk_ancestors() is good for the rcu-walk, and some > > >> rcu-then-ref-walk. However, I don’t think it fits all use cases. > > >> A reliable step-by-step ref-walk, like this set, works well with > > >> BPF, and we want to keep it. > > > > > > The distinction between rcu-walk and ref-walk is an internal > > > implementation detail. You as a caller shouldn't need to think about > > > the difference. You just want to walk. Note that LOOKUP_RCU is > > > documented in namei.h as "semi-internal". The only uses outside of > > > core-VFS code is in individual filesystem's d_revalidate handler - they > > > are checking if they are allowed to sleep or not. You should never > > > expect to pass LOOKUP_RCU to an VFS API - no other code does. > > > > > > It might be reasonable for you as a caller to have some control over > > > whether the call can sleep or not. LOOKUP_CACHED is a bit like that. > > > But for dotdot lookup the code will never sleep - so that is not > > > relevant. > > > > Unfortunately, the BPF use case is more complicated. In some cases, > > the callback function cannot be call in rcu critical sections. For > > example, the callback may need to read xatter. For these cases, we > > we cannot use RCU walk at all. > > I really think you should stop using the terms RCU walk and ref-walk. I > think they might be focusing your thinking in an unhelpful direction. Thank you! I really appreciate you helping to shape this API and it aligns a lot with my thinking. > The key issue about reading xattrs is that it might need to sleep. > Focusing on what might need to sleep and what will never need to sleep > is a useful approach - the distinction is wide spread in the kernel and > several function take a flag indicating if they are permitted to sleep, > or if failure when sleeping would be required. > > So your above observation is better described as > > The vfs_walk_ancestors() API has an (implicit) requirement that the > callback mustn't sleep. This is a problem for some use-cases > where the call back might need to sleep - e.g. for accessing xattrs. > > That is a good and useful observation. I can see three possibly > responses: > > 1/ Add a vfs_walk_ancestors_maysleep() API for which the callback is > always allowed to sleep. I don't particularly like this approach. Agreed. > > 2/ Use repeated calls to vfs_walk_parent() when the handling of each > ancestor might need to sleep. I see no problem with supporting both > vfs_walk_ancestors() and vfs_walk_parent(). There is plenty of > precedent for having different interfaces for different use cases. Meh. > > 3/ Extend vfs_walk_ancestors() to pass a "may sleep" flag to the callback. I think that's fine.
On Mon, Jul 07, 2025 at 12:46:41PM +0200, Christian Brauner wrote: > On Fri, Jun 27, 2025 at 08:51:21AM +1000, NeilBrown wrote: > > On Fri, 27 Jun 2025, Song Liu wrote: > > > > > > > > > > On Jun 26, 2025, at 3:22 AM, NeilBrown <neil@brown.name> wrote: > > > > > > [...] > > > > > > >> I guess I misunderstood the proposal of vfs_walk_ancestors() > > > >> initially, so some clarification: > > > >> > > > >> I think vfs_walk_ancestors() is good for the rcu-walk, and some > > > >> rcu-then-ref-walk. However, I don’t think it fits all use cases. > > > >> A reliable step-by-step ref-walk, like this set, works well with > > > >> BPF, and we want to keep it. > > > > > > > > The distinction between rcu-walk and ref-walk is an internal > > > > implementation detail. You as a caller shouldn't need to think about > > > > the difference. You just want to walk. Note that LOOKUP_RCU is > > > > documented in namei.h as "semi-internal". The only uses outside of > > > > core-VFS code is in individual filesystem's d_revalidate handler - they > > > > are checking if they are allowed to sleep or not. You should never > > > > expect to pass LOOKUP_RCU to an VFS API - no other code does. > > > > > > > > It might be reasonable for you as a caller to have some control over > > > > whether the call can sleep or not. LOOKUP_CACHED is a bit like that. > > > > But for dotdot lookup the code will never sleep - so that is not > > > > relevant. > > > > > > Unfortunately, the BPF use case is more complicated. In some cases, > > > the callback function cannot be call in rcu critical sections. For > > > example, the callback may need to read xatter. For these cases, we > > > we cannot use RCU walk at all. > > > > I really think you should stop using the terms RCU walk and ref-walk. I > > think they might be focusing your thinking in an unhelpful direction. > > Thank you! I really appreciate you helping to shape this API and it > aligns a lot with my thinking. > > > The key issue about reading xattrs is that it might need to sleep. > > Focusing on what might need to sleep and what will never need to sleep > > is a useful approach - the distinction is wide spread in the kernel and > > several function take a flag indicating if they are permitted to sleep, > > or if failure when sleeping would be required. > > > > So your above observation is better described as > > > > The vfs_walk_ancestors() API has an (implicit) requirement that the > > callback mustn't sleep. This is a problem for some use-cases > > where the call back might need to sleep - e.g. for accessing xattrs. > > > > That is a good and useful observation. I can see three possibly > > responses: > > > > 1/ Add a vfs_walk_ancestors_maysleep() API for which the callback is > > always allowed to sleep. I don't particularly like this approach. > > Agreed. > > > > > 2/ Use repeated calls to vfs_walk_parent() when the handling of each > > ancestor might need to sleep. I see no problem with supporting both > > vfs_walk_ancestors() and vfs_walk_parent(). There is plenty of > > precedent for having different interfaces for different use cases. > > Meh. > > > > > 3/ Extend vfs_walk_ancestors() to pass a "may sleep" flag to the callback. > > I think that's fine. Ok, sorry for the delay but there's a lot of different things going on right now and this one isn't exactly an easy thing to solve. I mentioned this before and so did Neil: the lookup implementation supports two modes sleeping and non-sleeping. That api is abstracted away as heavily as possible by the VFS so that non-core code will not be exposed to it other than in exceptional circumstances and doesn't have to care about it. It is a conceptual dead-end to expose these two modes via separate APIs and leak this implementation detail into non-core code. It will not happen as far as I'm concerned. I very much understand the urge to get the refcount step-by-step thing merged asap. Everyone wants their APIs merged fast. And if it's reasonable to move fast we will (see the kernfs xattr thing). But here are two use-cases that ask for the same thing with different constraints that closely mirror our unified approach. Merging one quickly just to have something and then later bolting the other one on top, augmenting, or replacing, possible having to deprecate the old API is just objectively nuts. That's how we end up with a spaghetthi helper collection. We want as little helper fragmentation as possible. We need a unified API that serves both use-cases. I dislike callback-based APIs generally but we have precedent in the VFS for this for cases where the internal state handling is delicate enough that it should not be exposed (see __iterate_supers() which does exactly work like Neil suggested down to the flag argument itself I added). So I'm open to the callback solution. (Note for really absurd perf requirements you could even make it work with static calls I'm pretty sure.)
Hi Christian, Thanks for your comments! > On Jul 7, 2025, at 4:17 AM, Christian Brauner <brauner@kernel.org> wrote: [...] >>> 3/ Extend vfs_walk_ancestors() to pass a "may sleep" flag to the callback. >> >> I think that's fine. > > Ok, sorry for the delay but there's a lot of different things going on > right now and this one isn't exactly an easy thing to solve. > > I mentioned this before and so did Neil: the lookup implementation > supports two modes sleeping and non-sleeping. That api is abstracted > away as heavily as possible by the VFS so that non-core code will not be > exposed to it other than in exceptional circumstances and doesn't have > to care about it. > > It is a conceptual dead-end to expose these two modes via separate APIs > and leak this implementation detail into non-core code. It will not > happen as far as I'm concerned. > > I very much understand the urge to get the refcount step-by-step thing > merged asap. Everyone wants their APIs merged fast. And if it's > reasonable to move fast we will (see the kernfs xattr thing). > > But here are two use-cases that ask for the same thing with different > constraints that closely mirror our unified approach. Merging one > quickly just to have something and then later bolting the other one on > top, augmenting, or replacing, possible having to deprecate the old API > is just objectively nuts. That's how we end up with a spaghetthi helper > collection. We want as little helper fragmentation as possible. > > We need a unified API that serves both use-cases. I dislike > callback-based APIs generally but we have precedent in the VFS for this > for cases where the internal state handling is delicate enough that it > should not be exposed (see __iterate_supers() which does exactly work > like Neil suggested down to the flag argument itself I added). > > So I'm open to the callback solution. > > (Note for really absurd perf requirements you could even make it work > with static calls I'm pretty sure.) I guess we will go with Mickaël’s idea: > int vfs_walk_ancestors(struct path *path, > bool (*walk_cb)(const struct path *ancestor, void *data), > void *data, int flags) > > The walk continue while walk_cb() returns true. walk_cb() can then > check if @ancestor is equal to a @root, or other properties. The > walk_cb() return value (if not bool) should not be returned by > vfs_walk_ancestors() because a walk stop doesn't mean an error. If necessary, we hide “root" inside @data. This is good. > @path would be updated with latest ancestor path (e.g. @root). Update @path to the last ancestor and hold proper references. I missed this part earlier. With this feature, vfs_walk_ancestors should work usable with open-codeed bpf path iterator. I have a question about this behavior with RCU walk. IIUC, RCU walk does not hold reference to @ancestor when calling walk_cb(). If walk_cb() returns false, shall vfs_walk_ancestors() then grab a reference on @ancestor? This feels a bit weird to me. Maybe “updating @path to the last ancestor” should only apply to LOOKUP_RCU==false case? > @flags could contain LOOKUP_RCU or not, which enables us to have > walk_cb() not-RCU compatible. > > When passing LOOKUP_RCU, if the first call to vfs_walk_ancestors() > failed with -ECHILD, the caller can restart the walk by calling > vfs_walk_ancestors() again but without LOOKUP_RCU. Given we want callers to handle -ECHILD and call vfs_walk_ancestors again without LOOKUP_RCU, I think we should keep @path not changed With LOOKUP_RCU==true, and only update it to the last ancestor when LOOKUP_RCU==false. With this behavior, landlock code will be like: /* Assume we hold reference on “path”. * With LOOKUP_RCU, path will not change, we don’t need * extra reference on “path”. */ err = vfs_walk_ancestors(path, ll_cb, data, LOOKUP_RCU); /* * At this point, whether err is 0 or not, path is not * changed. */ if (err == -ECHILD) { struct path walk_path = *path; /* reset any data changed by the walk */ reset_data(data); /* get a reference on walk_path. */ path_get(&walk_path); err = vfs_walk_ancestors(&walk_path, ll_cb, data, 0); /* Now, walk_path might be updated */ /* Always release reference on walk_path */ path_put(&walk_path); } BPF path iterator sode will look like: static bool bpf_cb(const struct path *ancestor, void *data) { return false; } struct path *bpf_iter_path_next(struct bpf_iter_path *it) { struct bpf_iter_path_kern *kit = (void *)it; if (vfs_walk_ancestors(&kit->path, bpf_cb, NULL)) return NULL; return &kit->path; } Does this sound reasonable to every body? Thanks, Song
On Tue, 08 Jul 2025, Song Liu wrote: > Hi Christian, > > Thanks for your comments! > > > On Jul 7, 2025, at 4:17 AM, Christian Brauner <brauner@kernel.org> wrote: > > [...] > > >>> 3/ Extend vfs_walk_ancestors() to pass a "may sleep" flag to the callback. > >> > >> I think that's fine. > > > > Ok, sorry for the delay but there's a lot of different things going on > > right now and this one isn't exactly an easy thing to solve. > > > > I mentioned this before and so did Neil: the lookup implementation > > supports two modes sleeping and non-sleeping. That api is abstracted > > away as heavily as possible by the VFS so that non-core code will not be > > exposed to it other than in exceptional circumstances and doesn't have > > to care about it. > > > > It is a conceptual dead-end to expose these two modes via separate APIs > > and leak this implementation detail into non-core code. It will not > > happen as far as I'm concerned. > > > > I very much understand the urge to get the refcount step-by-step thing > > merged asap. Everyone wants their APIs merged fast. And if it's > > reasonable to move fast we will (see the kernfs xattr thing). > > > > But here are two use-cases that ask for the same thing with different > > constraints that closely mirror our unified approach. Merging one > > quickly just to have something and then later bolting the other one on > > top, augmenting, or replacing, possible having to deprecate the old API > > is just objectively nuts. That's how we end up with a spaghetthi helper > > collection. We want as little helper fragmentation as possible. > > > > We need a unified API that serves both use-cases. I dislike > > callback-based APIs generally but we have precedent in the VFS for this > > for cases where the internal state handling is delicate enough that it > > should not be exposed (see __iterate_supers() which does exactly work > > like Neil suggested down to the flag argument itself I added). > > > > So I'm open to the callback solution. > > > > (Note for really absurd perf requirements you could even make it work > > with static calls I'm pretty sure.) > > I guess we will go with Mickaël’s idea: > > > int vfs_walk_ancestors(struct path *path, > > bool (*walk_cb)(const struct path *ancestor, void *data), > > void *data, int flags) > > > > The walk continue while walk_cb() returns true. walk_cb() can then > > check if @ancestor is equal to a @root, or other properties. The > > walk_cb() return value (if not bool) should not be returned by > > vfs_walk_ancestors() because a walk stop doesn't mean an error. > > If necessary, we hide “root" inside @data. This is good. > > > @path would be updated with latest ancestor path (e.g. @root). > > Update @path to the last ancestor and hold proper references. > I missed this part earlier. With this feature, vfs_walk_ancestors > should work usable with open-codeed bpf path iterator. I don't think path should be updated. That adds complexity which might not be needed. The original (landlock) requirements were only to look at each ancestor, not to take a reference to any of them. If the caller needs a reference to any of the ancestors I think that walk_cb() needs to take that reference and store it in data. Note that attempting to take the reference might fail. See legitimize_path() in fs/namei.c. It isn't yet clear to me what would be a good API for requesting the reference. One option would be for vfs_walk_ancestors() to pass another void* to walk_cb(), and it passed it on to vfs_legitimize_path() which extracts the seq numbers from there. Another might be that the path passed to walk_cb is always nameidata.path, and so when that is passed to vfs_legitimize_path() path it can use container_of() to find the seq numbers. If vfs_legitimize_path() fail, walk_cb() might want to ask for the walk to be restarted. > > I have a question about this behavior with RCU walk. IIUC, RCU > walk does not hold reference to @ancestor when calling walk_cb(). > If walk_cb() returns false, shall vfs_walk_ancestors() then > grab a reference on @ancestor? This feels a bit weird to me. > Maybe “updating @path to the last ancestor” should only apply to > LOOKUP_RCU==false case? > > > @flags could contain LOOKUP_RCU or not, which enables us to have > > walk_cb() not-RCU compatible. > > > > When passing LOOKUP_RCU, if the first call to vfs_walk_ancestors() > > failed with -ECHILD, the caller can restart the walk by calling > > vfs_walk_ancestors() again but without LOOKUP_RCU. > > > Given we want callers to handle -ECHILD and call vfs_walk_ancestors > again without LOOKUP_RCU, I think we should keep @path not changed > With LOOKUP_RCU==true, and only update it to the last ancestor > when LOOKUP_RCU==false. No, we really don't want to pass a LOOKUP_RCU() flag to vfs_walk_ancestors(). vfs_walk_ancestors() might choose to pass that flag to walk_cb(). NeilBrown
> On Jul 9, 2025, at 3:14 PM, NeilBrown <neil@brown.name> wrote: > > On Tue, 08 Jul 2025, Song Liu wrote: >> Hi Christian, >> >> Thanks for your comments! >> >>> On Jul 7, 2025, at 4:17 AM, Christian Brauner <brauner@kernel.org> wrote: >> >> [...] >> >>>>> 3/ Extend vfs_walk_ancestors() to pass a "may sleep" flag to the callback. >>>> >>>> I think that's fine. >>> >>> Ok, sorry for the delay but there's a lot of different things going on >>> right now and this one isn't exactly an easy thing to solve. >>> >>> I mentioned this before and so did Neil: the lookup implementation >>> supports two modes sleeping and non-sleeping. That api is abstracted >>> away as heavily as possible by the VFS so that non-core code will not be >>> exposed to it other than in exceptional circumstances and doesn't have >>> to care about it. >>> >>> It is a conceptual dead-end to expose these two modes via separate APIs >>> and leak this implementation detail into non-core code. It will not >>> happen as far as I'm concerned. >>> >>> I very much understand the urge to get the refcount step-by-step thing >>> merged asap. Everyone wants their APIs merged fast. And if it's >>> reasonable to move fast we will (see the kernfs xattr thing). >>> >>> But here are two use-cases that ask for the same thing with different >>> constraints that closely mirror our unified approach. Merging one >>> quickly just to have something and then later bolting the other one on >>> top, augmenting, or replacing, possible having to deprecate the old API >>> is just objectively nuts. That's how we end up with a spaghetthi helper >>> collection. We want as little helper fragmentation as possible. >>> >>> We need a unified API that serves both use-cases. I dislike >>> callback-based APIs generally but we have precedent in the VFS for this >>> for cases where the internal state handling is delicate enough that it >>> should not be exposed (see __iterate_supers() which does exactly work >>> like Neil suggested down to the flag argument itself I added). >>> >>> So I'm open to the callback solution. >>> >>> (Note for really absurd perf requirements you could even make it work >>> with static calls I'm pretty sure.) >> >> I guess we will go with Mickaël’s idea: >> >>> int vfs_walk_ancestors(struct path *path, >>> bool (*walk_cb)(const struct path *ancestor, void *data), >>> void *data, int flags) >>> >>> The walk continue while walk_cb() returns true. walk_cb() can then >>> check if @ancestor is equal to a @root, or other properties. The >>> walk_cb() return value (if not bool) should not be returned by >>> vfs_walk_ancestors() because a walk stop doesn't mean an error. >> >> If necessary, we hide “root" inside @data. This is good. >> >>> @path would be updated with latest ancestor path (e.g. @root). >> >> Update @path to the last ancestor and hold proper references. >> I missed this part earlier. With this feature, vfs_walk_ancestors >> should work usable with open-codeed bpf path iterator. > > I don't think path should be updated. That adds complexity which might > not be needed. The original (landlock) requirements were only to look > at each ancestor, not to take a reference to any of them. I think this is the ideal case that landlock wants in the long term. But we may need to take references when the attempt fails. Also, current landlock code takes reference at each step. > If the caller needs a reference to any of the ancestors I think that > walk_cb() needs to take that reference and store it in data. > Note that attempting to take the reference might fail. See > legitimize_path() in fs/namei.c. > > It isn't yet clear to me what would be a good API for requesting the > reference. > One option would be for vfs_walk_ancestors() to pass another void* to > walk_cb(), and it passed it on to vfs_legitimize_path() which extracts > the seq numbers from there. > Another might be that the path passed to walk_cb is always > nameidata.path, and so when that is passed to vfs_legitimize_path() path > it can use container_of() to find the seq numbers. Letting walk_cb() call vfs_legitimize_path() seems suboptimal to me. I think the original goal is to have vfs_walk_ancestors() to: 1. Try to walk the ancestors without taking any references; 2. Detect when the not-taking-reference walk is not reliable; 3. Maybe, retry the walk from beginning, but takes references on each step. With walk_cb() calling vfs_legitimize_path(), we are moving #2 above to walk_cb(). I think this is not what we want? > If vfs_legitimize_path() fail, walk_cb() might want to ask for the walk > to be restarted. > >> >> I have a question about this behavior with RCU walk. IIUC, RCU >> walk does not hold reference to @ancestor when calling walk_cb(). >> If walk_cb() returns false, shall vfs_walk_ancestors() then >> grab a reference on @ancestor? This feels a bit weird to me. >> Maybe “updating @path to the last ancestor” should only apply to >> LOOKUP_RCU==false case? >> >>> @flags could contain LOOKUP_RCU or not, which enables us to have >>> walk_cb() not-RCU compatible. >>> >>> When passing LOOKUP_RCU, if the first call to vfs_walk_ancestors() >>> failed with -ECHILD, the caller can restart the walk by calling >>> vfs_walk_ancestors() again but without LOOKUP_RCU. >> >> >> Given we want callers to handle -ECHILD and call vfs_walk_ancestors >> again without LOOKUP_RCU, I think we should keep @path not changed >> With LOOKUP_RCU==true, and only update it to the last ancestor >> when LOOKUP_RCU==false. > > No, we really don't want to pass a LOOKUP_RCU() flag to > vfs_walk_ancestors(). > vfs_walk_ancestors() might choose to pass that flag to walk_cb(). In this case, we will need vfs_walk_ancestors to handle #3 above. Thanks, Song
On Thu, 10 Jul 2025, Song Liu wrote: > > > > On Jul 9, 2025, at 3:14 PM, NeilBrown <neil@brown.name> wrote: > > > > On Tue, 08 Jul 2025, Song Liu wrote: > >> Hi Christian, > >> > >> Thanks for your comments! > >> > >>> On Jul 7, 2025, at 4:17 AM, Christian Brauner <brauner@kernel.org> wrote: > >> > >> [...] > >> > >>>>> 3/ Extend vfs_walk_ancestors() to pass a "may sleep" flag to the callback. > >>>> > >>>> I think that's fine. > >>> > >>> Ok, sorry for the delay but there's a lot of different things going on > >>> right now and this one isn't exactly an easy thing to solve. > >>> > >>> I mentioned this before and so did Neil: the lookup implementation > >>> supports two modes sleeping and non-sleeping. That api is abstracted > >>> away as heavily as possible by the VFS so that non-core code will not be > >>> exposed to it other than in exceptional circumstances and doesn't have > >>> to care about it. > >>> > >>> It is a conceptual dead-end to expose these two modes via separate APIs > >>> and leak this implementation detail into non-core code. It will not > >>> happen as far as I'm concerned. > >>> > >>> I very much understand the urge to get the refcount step-by-step thing > >>> merged asap. Everyone wants their APIs merged fast. And if it's > >>> reasonable to move fast we will (see the kernfs xattr thing). > >>> > >>> But here are two use-cases that ask for the same thing with different > >>> constraints that closely mirror our unified approach. Merging one > >>> quickly just to have something and then later bolting the other one on > >>> top, augmenting, or replacing, possible having to deprecate the old API > >>> is just objectively nuts. That's how we end up with a spaghetthi helper > >>> collection. We want as little helper fragmentation as possible. > >>> > >>> We need a unified API that serves both use-cases. I dislike > >>> callback-based APIs generally but we have precedent in the VFS for this > >>> for cases where the internal state handling is delicate enough that it > >>> should not be exposed (see __iterate_supers() which does exactly work > >>> like Neil suggested down to the flag argument itself I added). > >>> > >>> So I'm open to the callback solution. > >>> > >>> (Note for really absurd perf requirements you could even make it work > >>> with static calls I'm pretty sure.) > >> > >> I guess we will go with Mickaël’s idea: > >> > >>> int vfs_walk_ancestors(struct path *path, > >>> bool (*walk_cb)(const struct path *ancestor, void *data), > >>> void *data, int flags) > >>> > >>> The walk continue while walk_cb() returns true. walk_cb() can then > >>> check if @ancestor is equal to a @root, or other properties. The > >>> walk_cb() return value (if not bool) should not be returned by > >>> vfs_walk_ancestors() because a walk stop doesn't mean an error. > >> > >> If necessary, we hide “root" inside @data. This is good. > >> > >>> @path would be updated with latest ancestor path (e.g. @root). > >> > >> Update @path to the last ancestor and hold proper references. > >> I missed this part earlier. With this feature, vfs_walk_ancestors > >> should work usable with open-codeed bpf path iterator. > > > > I don't think path should be updated. That adds complexity which might > > not be needed. The original (landlock) requirements were only to look > > at each ancestor, not to take a reference to any of them. > > I think this is the ideal case that landlock wants in the long term. > But we may need to take references when the attempt fails. Also, > current landlock code takes reference at each step. Why may be need to? Yes, current landlock code takes references, but I don't think that is because it needs references, only because the API requires it to take references. > > > If the caller needs a reference to any of the ancestors I think that > > walk_cb() needs to take that reference and store it in data. > > Note that attempting to take the reference might fail. See > > legitimize_path() in fs/namei.c. > > > > It isn't yet clear to me what would be a good API for requesting the > > reference. > > One option would be for vfs_walk_ancestors() to pass another void* to > > walk_cb(), and it passed it on to vfs_legitimize_path() which extracts > > the seq numbers from there. > > Another might be that the path passed to walk_cb is always > > nameidata.path, and so when that is passed to vfs_legitimize_path() path > > it can use container_of() to find the seq numbers. > > Letting walk_cb() call vfs_legitimize_path() seems suboptimal to me. > I think the original goal is to have vfs_walk_ancestors() to: > 1. Try to walk the ancestors without taking any references; > 2. Detect when the not-taking-reference walk is not reliable; > 3. Maybe, retry the walk from beginning, but takes references on > each step. > > With walk_cb() calling vfs_legitimize_path(), we are moving #2 above > to walk_cb(). I think this is not what we want? I think you are looking at this the wrong way around. Focus on the needs for the caller, not on how you think it might be implemented. If the caller needs a reference, there should be a way for it to get a reference. This is quite separate from the choices vfs_walk_ancestors() makes about how it is going to walk the list of dentries. NeilBrown
On Mon, Jul 07, 2025 at 06:50:12PM +0000, Song Liu wrote: > Hi Christian, > > Thanks for your comments! > > > On Jul 7, 2025, at 4:17 AM, Christian Brauner <brauner@kernel.org> wrote: > > [...] > > >>> 3/ Extend vfs_walk_ancestors() to pass a "may sleep" flag to the callback. > >> > >> I think that's fine. > > > > Ok, sorry for the delay but there's a lot of different things going on > > right now and this one isn't exactly an easy thing to solve. > > > > I mentioned this before and so did Neil: the lookup implementation > > supports two modes sleeping and non-sleeping. That api is abstracted > > away as heavily as possible by the VFS so that non-core code will not be > > exposed to it other than in exceptional circumstances and doesn't have > > to care about it. > > > > It is a conceptual dead-end to expose these two modes via separate APIs > > and leak this implementation detail into non-core code. It will not > > happen as far as I'm concerned. > > > > I very much understand the urge to get the refcount step-by-step thing > > merged asap. Everyone wants their APIs merged fast. And if it's > > reasonable to move fast we will (see the kernfs xattr thing). > > > > But here are two use-cases that ask for the same thing with different > > constraints that closely mirror our unified approach. Merging one > > quickly just to have something and then later bolting the other one on > > top, augmenting, or replacing, possible having to deprecate the old API > > is just objectively nuts. That's how we end up with a spaghetthi helper > > collection. We want as little helper fragmentation as possible. > > > > We need a unified API that serves both use-cases. I dislike > > callback-based APIs generally but we have precedent in the VFS for this > > for cases where the internal state handling is delicate enough that it > > should not be exposed (see __iterate_supers() which does exactly work > > like Neil suggested down to the flag argument itself I added). > > > > So I'm open to the callback solution. > > > > (Note for really absurd perf requirements you could even make it work > > with static calls I'm pretty sure.) > > I guess we will go with Mickaël’s idea: > > > int vfs_walk_ancestors(struct path *path, > > bool (*walk_cb)(const struct path *ancestor, void *data), > > void *data, int flags) > > > > The walk continue while walk_cb() returns true. walk_cb() can then > > check if @ancestor is equal to a @root, or other properties. The > > walk_cb() return value (if not bool) should not be returned by > > vfs_walk_ancestors() because a walk stop doesn't mean an error. > > If necessary, we hide “root" inside @data. This is good. > > > @path would be updated with latest ancestor path (e.g. @root). > > Update @path to the last ancestor and hold proper references. > I missed this part earlier. With this feature, vfs_walk_ancestors > should work usable with open-codeed bpf path iterator. > > I have a question about this behavior with RCU walk. IIUC, RCU > walk does not hold reference to @ancestor when calling walk_cb(). I think a reference to the mount should be held, but not necessarily to the dentry if we are still in the same mount as the original path. > If walk_cb() returns false, shall vfs_walk_ancestors() then > grab a reference on @ancestor? This feels a bit weird to me. If walk_cb() checks for a root, it will return false when the path will match, and the caller would expect to get this root path, right? In general, it's safer to always have the same behavior when holding or releasing a reference. I think the caller should then always call path_put() after vfs_walk_ancestors() whatever the return code is. > Maybe “updating @path to the last ancestor” should only apply to > LOOKUP_RCU==false case? > > > @flags could contain LOOKUP_RCU or not, which enables us to have > > walk_cb() not-RCU compatible. > > > > When passing LOOKUP_RCU, if the first call to vfs_walk_ancestors() > > failed with -ECHILD, the caller can restart the walk by calling > > vfs_walk_ancestors() again but without LOOKUP_RCU. > > > Given we want callers to handle -ECHILD and call vfs_walk_ancestors > again without LOOKUP_RCU, I think we should keep @path not changed > With LOOKUP_RCU==true, and only update it to the last ancestor > when LOOKUP_RCU==false. As Neil said, we don't want to explicitly pass LOOKUP_RCU as a public flag. Instead, walk_cb() should never sleep (and then potentially be called under RCU by the vfs_walk_ancestors() implementation). > > With this behavior, landlock code will be like: > > > /* Assume we hold reference on “path”. > * With LOOKUP_RCU, path will not change, we don’t need > * extra reference on “path”. > */ > err = vfs_walk_ancestors(path, ll_cb, data, LOOKUP_RCU); > /* > * At this point, whether err is 0 or not, path is not > * changed. > */ > > if (err == -ECHILD) { > struct path walk_path = *path; > > /* reset any data changed by the walk */ > reset_data(data); > > /* get a reference on walk_path. */ > path_get(&walk_path); > > err = vfs_walk_ancestors(&walk_path, ll_cb, data, 0); > /* Now, walk_path might be updated */ > > /* Always release reference on walk_path */ > path_put(&walk_path); > } > > > BPF path iterator sode will look like: > > static bool bpf_cb(const struct path *ancestor, void *data) > { > return false; > } Instead of this callback, we could just always return if walk_cb is NULL. > > struct path *bpf_iter_path_next(struct bpf_iter_path *it) > { > struct bpf_iter_path_kern *kit = (void *)it; > > if (vfs_walk_ancestors(&kit->path, bpf_cb, NULL)) > return NULL; > return &kit->path; > } > > > Does this sound reasonable to every body? > > Thanks, > Song >
> On Jul 9, 2025, at 9:06 AM, Mickaël Salaün <mic@digikod.net> wrote:\ [...] >> If necessary, we hide “root" inside @data. This is good. >> >>> @path would be updated with latest ancestor path (e.g. @root). >> >> Update @path to the last ancestor and hold proper references. >> I missed this part earlier. With this feature, vfs_walk_ancestors >> should work usable with open-codeed bpf path iterator. >> >> I have a question about this behavior with RCU walk. IIUC, RCU >> walk does not hold reference to @ancestor when calling walk_cb(). > > I think a reference to the mount should be held, but not necessarily to > the dentry if we are still in the same mount as the original path. If we update @path and do path_put() after the walk, we have to hold reference to both the mnt and the dentry, no? > >> If walk_cb() returns false, shall vfs_walk_ancestors() then >> grab a reference on @ancestor? This feels a bit weird to me. > > If walk_cb() checks for a root, it will return false when the path will > match, and the caller would expect to get this root path, right? If the user want to walk to the global root, walk_cb() may not return false at all, IIUC. walk_cb() may also return false on other conditions. > > In general, it's safer to always have the same behavior when holding or > releasing a reference. I think the caller should then always call > path_put() after vfs_walk_ancestors() whatever the return code is. > >> Maybe “updating @path to the last ancestor” should only apply to >> LOOKUP_RCU==false case? >> >>> @flags could contain LOOKUP_RCU or not, which enables us to have >>> walk_cb() not-RCU compatible. >>> >>> When passing LOOKUP_RCU, if the first call to vfs_walk_ancestors() >>> failed with -ECHILD, the caller can restart the walk by calling >>> vfs_walk_ancestors() again but without LOOKUP_RCU. >> >> >> Given we want callers to handle -ECHILD and call vfs_walk_ancestors >> again without LOOKUP_RCU, I think we should keep @path not changed >> With LOOKUP_RCU==true, and only update it to the last ancestor >> when LOOKUP_RCU==false. > > As Neil said, we don't want to explicitly pass LOOKUP_RCU as a public > flag. Instead, walk_cb() should never sleep (and then potentially be > called under RCU by the vfs_walk_ancestors() implementation). How should the user handle -ECHILD without LOOKUP_RCU flag? Say the following code in landlocked: /* Try RCU walk first */ err = vfs_walk_ancestors(path, ll_cb, data, LOOKUP_RCU); if (err == -ECHILD) { struct path walk_path = *path; /* reset any data changed by the walk */ reset_data(data); /* now do ref-walk */ err = vfs_walk_ancestors(&walk_path, ll_cb, data, 0); } Or do you mean vfs_walk_ancestors will never return -ECHILD? Then we need vfs_walk_ancestors to call reset_data logic, right? Thanks, Song
On Thu, 10 Jul 2025, Song Liu wrote: > > > > On Jul 9, 2025, at 9:06 AM, Mickaël Salaün <mic@digikod.net> wrote:\ > > [...] > > >> If necessary, we hide “root" inside @data. This is good. > >> > >>> @path would be updated with latest ancestor path (e.g. @root). > >> > >> Update @path to the last ancestor and hold proper references. > >> I missed this part earlier. With this feature, vfs_walk_ancestors > >> should work usable with open-codeed bpf path iterator. > >> > >> I have a question about this behavior with RCU walk. IIUC, RCU > >> walk does not hold reference to @ancestor when calling walk_cb(). > > > > I think a reference to the mount should be held, but not necessarily to > > the dentry if we are still in the same mount as the original path. > > If we update @path and do path_put() after the walk, we have to hold > reference to both the mnt and the dentry, no? > > > > >> If walk_cb() returns false, shall vfs_walk_ancestors() then > >> grab a reference on @ancestor? This feels a bit weird to me. > > > > If walk_cb() checks for a root, it will return false when the path will > > match, and the caller would expect to get this root path, right? > > If the user want to walk to the global root, walk_cb() may not > return false at all, IIUC. walk_cb() may also return false on > other conditions. > > > > > In general, it's safer to always have the same behavior when holding or > > releasing a reference. I think the caller should then always call > > path_put() after vfs_walk_ancestors() whatever the return code is. > > > >> Maybe “updating @path to the last ancestor” should only apply to > >> LOOKUP_RCU==false case? > >> > >>> @flags could contain LOOKUP_RCU or not, which enables us to have > >>> walk_cb() not-RCU compatible. > >>> > >>> When passing LOOKUP_RCU, if the first call to vfs_walk_ancestors() > >>> failed with -ECHILD, the caller can restart the walk by calling > >>> vfs_walk_ancestors() again but without LOOKUP_RCU. > >> > >> > >> Given we want callers to handle -ECHILD and call vfs_walk_ancestors > >> again without LOOKUP_RCU, I think we should keep @path not changed > >> With LOOKUP_RCU==true, and only update it to the last ancestor > >> when LOOKUP_RCU==false. > > > > As Neil said, we don't want to explicitly pass LOOKUP_RCU as a public > > flag. Instead, walk_cb() should never sleep (and then potentially be > > called under RCU by the vfs_walk_ancestors() implementation). > > How should the user handle -ECHILD without LOOKUP_RCU flag? Say the > following code in landlocked: > > /* Try RCU walk first */ > err = vfs_walk_ancestors(path, ll_cb, data, LOOKUP_RCU); > > if (err == -ECHILD) { > struct path walk_path = *path; > > /* reset any data changed by the walk */ > reset_data(data); > > /* now do ref-walk */ > err = vfs_walk_ancestors(&walk_path, ll_cb, data, 0); > } > > Or do you mean vfs_walk_ancestors will never return -ECHILD? > Then we need vfs_walk_ancestors to call reset_data logic, right? It isn't clear to me that vfs_walk_ancestors() needs to return anything. All the communication happens through walk_cb() walk_cb() is called with a path, the data, and a "may_sleep" flag. If it needs to sleep but may_sleep is not set, it returns "-ECHILD" which causes the walk to restart and use refcounts. If it wants to stop, it returns 0. If it wants to continue, it returns 1. If it wants a reference to the path then it can use (new) vfs_legitimize_path() which might fail. If it wants a reference to the path and may_sleep is true, it can use path_get() which won't fail. When returning -ECHILD (either because of a need to sleep or because vfs_legitimize_path() fails), walk_cb() would reset_data(). NeilBrown
> On Jul 9, 2025, at 3:24 PM, NeilBrown <neil@brown.name> wrote: [...] >> >> How should the user handle -ECHILD without LOOKUP_RCU flag? Say the >> following code in landlocked: >> >> /* Try RCU walk first */ >> err = vfs_walk_ancestors(path, ll_cb, data, LOOKUP_RCU); >> >> if (err == -ECHILD) { >> struct path walk_path = *path; >> >> /* reset any data changed by the walk */ >> reset_data(data); >> >> /* now do ref-walk */ >> err = vfs_walk_ancestors(&walk_path, ll_cb, data, 0); >> } >> >> Or do you mean vfs_walk_ancestors will never return -ECHILD? >> Then we need vfs_walk_ancestors to call reset_data logic, right? > > It isn't clear to me that vfs_walk_ancestors() needs to return anything. > All the communication happens through walk_cb() > > walk_cb() is called with a path, the data, and a "may_sleep" flag. > If it needs to sleep but may_sleep is not set, it returns "-ECHILD" > which causes the walk to restart and use refcounts. > If it wants to stop, it returns 0. > If it wants to continue, it returns 1. > If it wants a reference to the path then it can use (new) > vfs_legitimize_path() which might fail. > If it wants a reference to the path and may_sleep is true, it can use > path_get() which won't fail. > > When returning -ECHILD (either because of a need to sleep or because > vfs_legitimize_path() fails), walk_cb() would reset_data(). This might actually work. My only concern is with vfs_legitimize_path. It is probably safer if we only allow taking references with may_sleep==true, so that path_get won’t fail. In this case, we will not need walk_cb() to call vfs_legitimize_path. If the user want a reference, the walk_cb will first return -ECHILD, and call path_get when may_sleep is true. Does this make sense? Did I miss any cases? Thanks, Song
On Thu, 10 Jul 2025, Song Liu wrote: > > > > On Jul 9, 2025, at 3:24 PM, NeilBrown <neil@brown.name> wrote: > [...] > >> > >> How should the user handle -ECHILD without LOOKUP_RCU flag? Say the > >> following code in landlocked: > >> > >> /* Try RCU walk first */ > >> err = vfs_walk_ancestors(path, ll_cb, data, LOOKUP_RCU); > >> > >> if (err == -ECHILD) { > >> struct path walk_path = *path; > >> > >> /* reset any data changed by the walk */ > >> reset_data(data); > >> > >> /* now do ref-walk */ > >> err = vfs_walk_ancestors(&walk_path, ll_cb, data, 0); > >> } > >> > >> Or do you mean vfs_walk_ancestors will never return -ECHILD? > >> Then we need vfs_walk_ancestors to call reset_data logic, right? > > > > It isn't clear to me that vfs_walk_ancestors() needs to return anything. > > All the communication happens through walk_cb() > > > > walk_cb() is called with a path, the data, and a "may_sleep" flag. > > If it needs to sleep but may_sleep is not set, it returns "-ECHILD" > > which causes the walk to restart and use refcounts. > > If it wants to stop, it returns 0. > > If it wants to continue, it returns 1. > > If it wants a reference to the path then it can use (new) > > vfs_legitimize_path() which might fail. > > If it wants a reference to the path and may_sleep is true, it can use > > path_get() which won't fail. > > > > When returning -ECHILD (either because of a need to sleep or because > > vfs_legitimize_path() fails), walk_cb() would reset_data(). > > This might actually work. > > My only concern is with vfs_legitimize_path. It is probably safer if > we only allow taking references with may_sleep==true, so that path_get > won’t fail. In this case, we will not need walk_cb() to call > vfs_legitimize_path. If the user want a reference, the walk_cb will > first return -ECHILD, and call path_get when may_sleep is true. What is your concern with vfs_legitimize_path() ?? I've since realised that always restarting in response to -ECHILD isn't necessary and isn't how normal path-walk works. Restarting might be needed, but the first response to -ECHILD is to try legitimize_path(). If that succeeds, then it is safe to sleep. So returning -ECHILD might just result in vfs_walk_ancestors() calling legitimize_path() and then calling walk_cb() again. Why not have walk_cb() do the vfs_legitimize_path() call (which will almost always succeed in practice). NeilBrown > > Does this make sense? Did I miss any cases? > > Thanks, > Song > >
> On Jul 9, 2025, at 5:58 PM, NeilBrown <neil@brown.name> wrote: > > On Thu, 10 Jul 2025, Song Liu wrote: >> >> >>> On Jul 9, 2025, at 3:24 PM, NeilBrown <neil@brown.name> wrote: >> [...] >>>> >>>> How should the user handle -ECHILD without LOOKUP_RCU flag? Say the >>>> following code in landlocked: >>>> >>>> /* Try RCU walk first */ >>>> err = vfs_walk_ancestors(path, ll_cb, data, LOOKUP_RCU); >>>> >>>> if (err == -ECHILD) { >>>> struct path walk_path = *path; >>>> >>>> /* reset any data changed by the walk */ >>>> reset_data(data); >>>> >>>> /* now do ref-walk */ >>>> err = vfs_walk_ancestors(&walk_path, ll_cb, data, 0); >>>> } >>>> >>>> Or do you mean vfs_walk_ancestors will never return -ECHILD? >>>> Then we need vfs_walk_ancestors to call reset_data logic, right? >>> >>> It isn't clear to me that vfs_walk_ancestors() needs to return anything. >>> All the communication happens through walk_cb() >>> >>> walk_cb() is called with a path, the data, and a "may_sleep" flag. >>> If it needs to sleep but may_sleep is not set, it returns "-ECHILD" >>> which causes the walk to restart and use refcounts. >>> If it wants to stop, it returns 0. >>> If it wants to continue, it returns 1. >>> If it wants a reference to the path then it can use (new) >>> vfs_legitimize_path() which might fail. >>> If it wants a reference to the path and may_sleep is true, it can use >>> path_get() which won't fail. >>> >>> When returning -ECHILD (either because of a need to sleep or because >>> vfs_legitimize_path() fails), walk_cb() would reset_data(). >> >> This might actually work. >> >> My only concern is with vfs_legitimize_path. It is probably safer if >> we only allow taking references with may_sleep==true, so that path_get >> won’t fail. In this case, we will not need walk_cb() to call >> vfs_legitimize_path. If the user want a reference, the walk_cb will >> first return -ECHILD, and call path_get when may_sleep is true. > > What is your concern with vfs_legitimize_path() ?? > > I've since realised that always restarting in response to -ECHILD isn't > necessary and isn't how normal path-walk works. Restarting might be > needed, but the first response to -ECHILD is to try legitimize_path(). > If that succeeds, then it is safe to sleep. > So returning -ECHILD might just result in vfs_walk_ancestors() calling > legitimize_path() and then calling walk_cb() again. Why not have > walk_cb() do the vfs_legitimize_path() call (which will almost always > succeed in practice). After reading the emails and the code more, I think I misunderstood why we need to call vfs_legitimize_path(). The goal of “legitimize” is to get a reference on @path, so a reference-less walk may not need legitimize_path() at all. Do I get this right this time? However, I still have some concern with legitimize_path: it requires m_seq and r_seq recorded at the beginning of the walk, do we want to pass those to walk_cb()? IIUC, one of the reason we prefer a callback based solution is that it doesn’t expose nameidata (or a subset of it). Letting walk_cb to call legitimize_path appears to defeat this benefit, no? A separate question below. I still have some question about how vfs_walk_ancestors() and the walk_cb() interact. Let’s look at the landlock use case: the user (landlock) just want to look at each ancestor, but doesn’t need to take any references. walk_cb() will check @path against @root, and return 0 when @path is the same as @root. IIUC, in this case, we will record m_seq and r_seq at the beginning of vfs_walk_ancestors(), and check them against mount_lock and rename_lock at the end of the walk. (Maybe we also need to check them at some points before the end of the walk?) If either seq changed during the walk, we need to restart the walk, and take reference on each step. Did I get this right so far? If the above is right, here are my questions about the reference-less walk above: 1. Which function (vfs_walk_ancestors or walk_cb) will check m_seq and r_seq? I think vfs_walk_ancestors should check them. 2. When either seq changes, which function will call reset_data? I think there are 3 options here: 2.a: vfs_walk_ancestors calls reset_data, which will be another callback function the caller passes to vfs_walk_ancestors. 2.b: walk_cb will call reset_data(), but we need a mechanism to tell walk_cb to do it, maybe a “restart” flag? 2.c: Caller of vfs_walk_ancestors will call reset_data(). In this case, vfs_walk_ancestors will return -ECHILD to its caller. But I think this option is NACKed. I think the right solution is to have vfs_walk_ancestors check m_seq and r_seq, and have walk_cb call reset_data. But this is Different to the proposal above. Do my questions above make any sense? Or maybe I totally misunderstood something? Thanks, Song
> On Jul 9, 2025, at 11:28 PM, Song Liu <songliubraving@meta.com> wrote: [...] >>>> It isn't clear to me that vfs_walk_ancestors() needs to return anything. >>>> All the communication happens through walk_cb() >>>> >>>> walk_cb() is called with a path, the data, and a "may_sleep" flag. >>>> If it needs to sleep but may_sleep is not set, it returns "-ECHILD" >>>> which causes the walk to restart and use refcounts. >>>> If it wants to stop, it returns 0. >>>> If it wants to continue, it returns 1. >>>> If it wants a reference to the path then it can use (new) >>>> vfs_legitimize_path() which might fail. >>>> If it wants a reference to the path and may_sleep is true, it can use >>>> path_get() which won't fail. >>>> >>>> When returning -ECHILD (either because of a need to sleep or because >>>> vfs_legitimize_path() fails), walk_cb() would reset_data(). >>> >>> This might actually work. >>> >>> My only concern is with vfs_legitimize_path. It is probably safer if >>> we only allow taking references with may_sleep==true, so that path_get >>> won’t fail. In this case, we will not need walk_cb() to call >>> vfs_legitimize_path. If the user want a reference, the walk_cb will >>> first return -ECHILD, and call path_get when may_sleep is true. >> >> What is your concern with vfs_legitimize_path() ?? >> >> I've since realised that always restarting in response to -ECHILD isn't >> necessary and isn't how normal path-walk works. Restarting might be >> needed, but the first response to -ECHILD is to try legitimize_path(). >> If that succeeds, then it is safe to sleep. >> So returning -ECHILD might just result in vfs_walk_ancestors() calling >> legitimize_path() and then calling walk_cb() again. Why not have >> walk_cb() do the vfs_legitimize_path() call (which will almost always >> succeed in practice). > > After reading the emails and the code more, I think I misunderstood > why we need to call vfs_legitimize_path(). The goal of “legitimize” > is to get a reference on @path, so a reference-less walk may not > need legitimize_path() at all. Do I get this right this time? > > However, I still have some concern with legitimize_path: it requires > m_seq and r_seq recorded at the beginning of the walk, do we want > to pass those to walk_cb()? IIUC, one of the reason we prefer a > callback based solution is that it doesn’t expose nameidata (or a > subset of it). Letting walk_cb to call legitimize_path appears to > defeat this benefit, no? > > > A separate question below. > > I still have some question about how vfs_walk_ancestors() and the > walk_cb() interact. Let’s look at the landlock use case: the user > (landlock) just want to look at each ancestor, but doesn’t need to > take any references. walk_cb() will check @path against @root, and > return 0 when @path is the same as @root. > > IIUC, in this case, we will record m_seq and r_seq at the beginning > of vfs_walk_ancestors(), and check them against mount_lock and > rename_lock at the end of the walk. (Maybe we also need to check > them at some points before the end of the walk?) If either seq > changed during the walk, we need to restart the walk, and take > reference on each step. Did I get this right so far? > > If the above is right, here are my questions about the > reference-less walk above: > > 1. Which function (vfs_walk_ancestors or walk_cb) will check m_seq > and r_seq? I think vfs_walk_ancestors should check them. > 2. When either seq changes, which function will call reset_data? > I think there are 3 options here: > 2.a: vfs_walk_ancestors calls reset_data, which will be another > callback function the caller passes to vfs_walk_ancestors. > 2.b: walk_cb will call reset_data(), but we need a mechanism to > tell walk_cb to do it, maybe a “restart” flag? > 2.c: Caller of vfs_walk_ancestors will call reset_data(). In > this case, vfs_walk_ancestors will return -ECHILD to its > caller. But I think this option is NACKed. > > I think the right solution is to have vfs_walk_ancestors check > m_seq and r_seq, and have walk_cb call reset_data. But this is > Different to the proposal above. > > Do my questions above make any sense? Or maybe I totally > misunderstood something? Hi Neil, Did my questions/comments above make sense? I am hoping we can agree on some design soon. Christian and Mickaël, Could you please also share your thoughts on this? Current requirements from BPF side is straightforward: we just need a mechanism to “walk up one level and hold reference”. So most of the requirement comes from LandLock side. Thanks, Song
On Mon, Jul 14, 2025 at 09:09:42PM +0000, Song Liu wrote: > > > On Jul 9, 2025, at 11:28 PM, Song Liu <songliubraving@meta.com> wrote: > > [...] > > >>>> It isn't clear to me that vfs_walk_ancestors() needs to return anything. > >>>> All the communication happens through walk_cb() > >>>> > >>>> walk_cb() is called with a path, the data, and a "may_sleep" flag. > >>>> If it needs to sleep but may_sleep is not set, it returns "-ECHILD" > >>>> which causes the walk to restart and use refcounts. > >>>> If it wants to stop, it returns 0. > >>>> If it wants to continue, it returns 1. > >>>> If it wants a reference to the path then it can use (new) > >>>> vfs_legitimize_path() which might fail. > >>>> If it wants a reference to the path and may_sleep is true, it can use > >>>> path_get() which won't fail. > >>>> > >>>> When returning -ECHILD (either because of a need to sleep or because > >>>> vfs_legitimize_path() fails), walk_cb() would reset_data(). > >>> > >>> This might actually work. > >>> > >>> My only concern is with vfs_legitimize_path. It is probably safer if > >>> we only allow taking references with may_sleep==true, so that path_get > >>> won’t fail. In this case, we will not need walk_cb() to call > >>> vfs_legitimize_path. If the user want a reference, the walk_cb will > >>> first return -ECHILD, and call path_get when may_sleep is true. > >> > >> What is your concern with vfs_legitimize_path() ?? > >> > >> I've since realised that always restarting in response to -ECHILD isn't > >> necessary and isn't how normal path-walk works. Restarting might be > >> needed, but the first response to -ECHILD is to try legitimize_path(). > >> If that succeeds, then it is safe to sleep. > >> So returning -ECHILD might just result in vfs_walk_ancestors() calling > >> legitimize_path() and then calling walk_cb() again. Why not have > >> walk_cb() do the vfs_legitimize_path() call (which will almost always > >> succeed in practice). > > > > After reading the emails and the code more, I think I misunderstood > > why we need to call vfs_legitimize_path(). The goal of “legitimize” > > is to get a reference on @path, so a reference-less walk may not > > need legitimize_path() at all. Do I get this right this time? > > > > However, I still have some concern with legitimize_path: it requires > > m_seq and r_seq recorded at the beginning of the walk, do we want > > to pass those to walk_cb()? IIUC, one of the reason we prefer a > > callback based solution is that it doesn’t expose nameidata (or a > > subset of it). Letting walk_cb to call legitimize_path appears to > > defeat this benefit, no? Yes, walk_cb() should be very light and non-blocking/non-sleepable. If the caller cannot give these guarantees, then it can just pass NULL instead of a valid walk_cb(), and continue the walk (if needed) by calling the vfs_walk_ancentors() helper again, which would not benefit from the RCU optimization in this case. Before this patch series land, handling of disconnected directories should be well defined, or at least let the caller deal with it. How do you plan to handle disconnected directories for the eBPF use case? See https://lore.kernel.org/all/20250719104204.545188-1-mic@digikod.net/ Unfortunately, this issue is not solved for Landlock yet. > > > > > > A separate question below. > > > > I still have some question about how vfs_walk_ancestors() and the > > walk_cb() interact. Let’s look at the landlock use case: the user > > (landlock) just want to look at each ancestor, but doesn’t need to > > take any references. walk_cb() will check @path against @root, and > > return 0 when @path is the same as @root. > > > > IIUC, in this case, we will record m_seq and r_seq at the beginning > > of vfs_walk_ancestors(), and check them against mount_lock and > > rename_lock at the end of the walk. (Maybe we also need to check > > them at some points before the end of the walk?) If either seq > > changed during the walk, we need to restart the walk, and take > > reference on each step. Did I get this right so far? I think so. You should get some inspiration from prepend_path(). > > > > If the above is right, here are my questions about the > > reference-less walk above: > > > > 1. Which function (vfs_walk_ancestors or walk_cb) will check m_seq > > and r_seq? I think vfs_walk_ancestors should check them. Yes, walk_cb() should be as simple as possible: the simpler version should just return a constant. > > 2. When either seq changes, which function will call reset_data? > > I think there are 3 options here: > > 2.a: vfs_walk_ancestors calls reset_data, which will be another > > callback function the caller passes to vfs_walk_ancestors. > > 2.b: walk_cb will call reset_data(), but we need a mechanism to > > tell walk_cb to do it, maybe a “restart” flag? > > 2.c: Caller of vfs_walk_ancestors will call reset_data(). In > > this case, vfs_walk_ancestors will return -ECHILD to its > > caller. But I think this option is NACKed. > > > > I think the right solution is to have vfs_walk_ancestors check > > m_seq and r_seq, and have walk_cb call reset_data. But this is > > Different to the proposal above. I'm not sure a reset_data() would be useful if walk_cb() never sleep. If we really need such reset_data(), a fourth option would be for walk_cb() to return a specific value (an enum instead of a bool) to trigger the reset. > > > > Do my questions above make any sense? Or maybe I totally > > misunderstood something? > > Hi Neil, > > Did my questions/comments above make sense? I am hoping we can > agree on some design soon. > > Christian and Mickaël, > > Could you please also share your thoughts on this? > > Current requirements from BPF side is straightforward: we just > need a mechanism to “walk up one level and hold reference”. So > most of the requirement comes from LandLock side. Have you thought about how to handle disconnected directories? > > Thanks, > Song >
> On Jul 25, 2025, at 1:35 AM, Mickaël Salaün <mic@digikod.net> wrote: [...] >>> >>> Do my questions above make any sense? Or maybe I totally >>> misunderstood something? >> >> Hi Neil, >> >> Did my questions/comments above make sense? I am hoping we can >> agree on some design soon. >> >> Christian and Mickaël, >> >> Could you please also share your thoughts on this? >> >> Current requirements from BPF side is straightforward: we just >> need a mechanism to “walk up one level and hold reference”. So >> most of the requirement comes from LandLock side. > > Have you thought about how to handle disconnected directories? In the case of open-coded path iterator, the iterator will return a special value for disconnected roots and disconnected directories. Then the BPF program need to handle them based on the policy. Thanks, Song
> On Jun 26, 2025, at 3:51 PM, NeilBrown <neil@brown.name> wrote: [...] >> Unfortunately, the BPF use case is more complicated. In some cases, >> the callback function cannot be call in rcu critical sections. For >> example, the callback may need to read xatter. For these cases, we >> we cannot use RCU walk at all. > > I really think you should stop using the terms RCU walk and ref-walk. I > think they might be focusing your thinking in an unhelpful direction. > > The key issue about reading xattrs is that it might need to sleep. > Focusing on what might need to sleep and what will never need to sleep > is a useful approach - the distinction is wide spread in the kernel and > several function take a flag indicating if they are permitted to sleep, > or if failure when sleeping would be required. > > So your above observation is better described as > > The vfs_walk_ancestors() API has an (implicit) requirement that the > callback mustn't sleep. This is a problem for some use-cases > where the call back might need to sleep - e.g. for accessing xattrs. > > That is a good and useful observation. I can see three possibly > responses: > > 1/ Add a vfs_walk_ancestors_maysleep() API for which the callback is > always allowed to sleep. I don't particularly like this approach. > > 2/ Use repeated calls to vfs_walk_parent() when the handling of each > ancestor might need to sleep. I see no problem with supporting both > vfs_walk_ancestors() and vfs_walk_parent(). There is plenty of > precedent for having different interfaces for different use cases. I prefer option 2. > > 3/ Extend vfs_walk_ancestors() to pass a "may sleep" flag to the callback. > If the callback finds that it needs to sleep but that "may sleep" > isn't set, it returns some well known status, like -EWOULDBLOCK (or > -ECHILD). It can expect to be called again but with "may sleep" set. > This is my preferred approach. There is precedent with the > d_revalidate callbacks which works like this. > I suspect that accessing xattrs might often be possible without > sleeping. It is conceivable that we could add a "may sleep" argument > to vfs_getxattr() so that it could still often be used without > requiring vfs_walk_ancestors() to permit sleeping. > This would almost certainly require a clear demonstration that > there was a performance cost in not having the option of non-sleeping > vfs_getxattr(). For built-in kernel code, I can see this works. However, I don’t see why it is necessary to introduce the extra complexity of -EWOULDBLOCK, and vfs_get_xattr_cannot_sleep, etc. A separate step-by-step walking API is much cleaner. > >>> I strongly suggest you stop thinking about rcu-walk vs ref-walk. Think >>> about the needs of your code. If you need a high-performance API, then >>> ask for a high-performance API, don't assume what form it will take or >>> what the internal implementation details will be. >> >> At the moment, we need a ref-walk API on the BPF side. The RCU walk >> is a totally separate topic. > > Do you mean "we need step-by-step walking" or do you mean "we need to > potentially sleep for each ancestor"? These are conceptually different > requirements, but I cannot tell which you mean when you talk about "RCU > walk”. To be extra clear, I mean we need "step-by-step and take-reference-on-each-step walking”, for existing use cases. In the future, if it is possible to have a “do-not-take-reference, cannot-sleep, callback-based walking”. We may try to use that for some use cases. But that won’t replace step-by-step walking for all users. Thanks, Song
On Thu, Jun 26, 2025 at 05:52:50AM +0000, Song Liu wrote: > > > > On Jun 25, 2025, at 6:05 PM, NeilBrown <neil@brown.name> wrote: > > [...] > > >> > >> I can't speak for Mickaël, but a callback-based interface is less flexible > >> (and _maybe_ less performant?). Also, probably we will want to fallback > >> to a reference-taking walk if the walk fails (rather than, say, retry > >> infinitely), and this should probably use Song's proposed iterator. I'm > >> not sure if Song would be keen to rewrite this iterator patch series in > >> callback style (to be clear, it doesn't necessarily seem like a good idea > >> to me, and I'm not asking him to), which means that we will end up with > >> the reference walk API being a "call this function repeatedly", and the > >> rcu walk API taking a callback. I think it is still workable (after all, > >> if Landlock wants to reuse the code in the callback it can just call the > >> callback function itself when doing the reference walk), but it seems a > >> bit "ugly" to me. > > > > call-back can have a performance impact (less opportunity for compiler > > optimisation and CPU speculation), though less than taking spinlock and > > references. However Al and Christian have drawn a hard line against > > making seq numbers visible outside VFS code so I think it is the > > approach most likely to be accepted. > > > > Certainly vfs_walk_ancestors() would fallback to ref-walk if rcu-walk > > resulted in -ECHILD - just like all other path walking code in namei.c. > > This would be largely transparent to the caller - the caller would only > > see that the callback received a NULL path indicating a restart. It > > wouldn't need to know why. Given the constraints this looks good to me. Here is an updated API with two extra consts, an updated walk_cb() signature, and a new "flags" and without @root: int vfs_walk_ancestors(struct path *path, bool (*walk_cb)(const struct path *ancestor, void *data), void *data, int flags) The walk continue while walk_cb() returns true. walk_cb() can then check if @ancestor is equal to a @root, or other properties. The walk_cb() return value (if not bool) should not be returned by vfs_walk_ancestors() because a walk stop doesn't mean an error. @path would be updated with latest ancestor path (e.g. @root). @flags could contain LOOKUP_RCU or not, which enables us to have walk_cb() not-RCU compatible. When passing LOOKUP_RCU, if the first call to vfs_walk_ancestors() failed with -ECHILD, the caller can restart the walk by calling vfs_walk_ancestors() again but without LOOKUP_RCU. > > I guess I misunderstood the proposal of vfs_walk_ancestors() > initially, so some clarification: > > I think vfs_walk_ancestors() is good for the rcu-walk, and some > rcu-then-ref-walk. However, I don’t think it fits all use cases. > A reliable step-by-step ref-walk, like this set, works well with > BPF, and we want to keep it. The above updated API should work for both use cases: if the caller wants to walk only one level, walk_cb() can just always return false (and potentially save that it was called) and then stop the walk the first time it is called. This makes it possible to write an eBPF helper with the same API as path_walk_parent(), while making the kernel API more flexible. > > Can we ship this set as-is (or after fixing the comment reported > by kernel test robot)? I really don’t think we need figure out > all details about the rcu-walk here. > > Once this is landed, we can try implementing the rcu-walk > (vfs_walk_ancestors or some variation). If no one volunteers, I > can give it a try. My understanding is that Christian only wants one helper (that should handle both use cases). I think this updated API should be good enough for everyone. Most of your code should stay the same. What do you think? > > Thanks, > Song >
> On Jun 26, 2025, at 2:43 AM, Mickaël Salaün <mic@digikod.net> wrote: [...] > Given the constraints this looks good to me. Here is an updated API > with two extra consts, an updated walk_cb() signature, and a new > "flags" and without @root: > > int vfs_walk_ancestors(struct path *path, > bool (*walk_cb)(const struct path *ancestor, void *data), > void *data, int flags) > > The walk continue while walk_cb() returns true. walk_cb() can then > check if @ancestor is equal to a @root, or other properties. The > walk_cb() return value (if not bool) should not be returned by > vfs_walk_ancestors() because a walk stop doesn't mean an error. > > @path would be updated with latest ancestor path (e.g. @root). > @flags could contain LOOKUP_RCU or not, which enables us to have > walk_cb() not-RCU compatible. > > When passing LOOKUP_RCU, if the first call to vfs_walk_ancestors() > failed with -ECHILD, the caller can restart the walk by calling > vfs_walk_ancestors() again but without LOOKUP_RCU. IIUC, Neil is against using LOOKUP_RCU as input, and VFS folks may share same the concerns. >> >> I guess I misunderstood the proposal of vfs_walk_ancestors() >> initially, so some clarification: >> >> I think vfs_walk_ancestors() is good for the rcu-walk, and some >> rcu-then-ref-walk. However, I don’t think it fits all use cases. >> A reliable step-by-step ref-walk, like this set, works well with >> BPF, and we want to keep it. > > The above updated API should work for both use cases: if the caller wants > to walk only one level, walk_cb() can just always return false (and > potentially save that it was called) and then stop the walk the first > time it is called. This makes it possible to write an eBPF helper with > the same API as path_walk_parent(), while making the kernel API more > flexible. I don’t think this will be the same. Current path_walk_parent() holds a reference on parent path, and returns control to the caller. However, when vfs_walk_ancestors() returns, it should not hold any extra reference, right? IOW, vfs_walk_ancestors may hold some reference between callbacks, but is expected to release these references before finally returning to the caller. > Can we ship this set as-is (or after fixing the comment reported >> by kernel test robot)? I really don’t think we need figure out >> all details about the rcu-walk here. >> >> Once this is landed, we can try implementing the rcu-walk >> (vfs_walk_ancestors or some variation). If no one volunteers, I >> can give it a try. > > My understanding is that Christian only wants one helper (that should > handle both use cases). I think this updated API should be good enough > for everyone. Most of your code should stay the same. What do you > think? Christian, could you please clarify this requirement? Given different expectations in how the references are handled (see above), I don’t think we can fit all use cases in one API. However, if we find such an API in the future, which works for all cases, we can refactor BPF side code to use that instead. Therefore, even we have a “one API only” requirement, it is not necessary to delay this set for a RCU-walk API. Thanks, Song
On Wed, Jun 25, 2025 at 4:05 PM NeilBrown <neil@brown.name> wrote: > > On Wed, 25 Jun 2025, Mickaël Salaün wrote: > > On Wed, Jun 25, 2025 at 07:38:53AM +1000, NeilBrown wrote: > > > > > > Can you spell out the minimum that you need? > > > > Sure. We'd like to call this new helper in a RCU > > read-side critical section and leverage this capability to speed up path > > walk when there is no concurrent hierarchy modification. This use case > > is similar to handle_dots() with LOOKUP_RCU calling follow_dotdot_rcu(). > > > > The main issue with this approach is to keep some state of the path walk > > to know if the next call to "path_walk_parent_rcu()" would be valid > > (i.e. something like a very light version of nameidata, mainly sequence > > integers), and to get back to the non-RCU version otherwise. > > > > > > > > My vague impression is that you want to search up from a given strut path, > > > no further then some other given path, looking for a dentry that matches > > > some rule. Is that correct? > > > > Yes > > > > > > > > In general, the original dentry could be moved away from under the > > > dentry you find moments after the match is reported. What mechanisms do > > > you have in place to ensure this doesn't happen, or that it doesn't > > > matter? > > > > In the case of Landlock, by default, a set of access rights are denied > > and can only be allowed by an element in the file hierarchy. The goal > > is to only allow access to files under a specific directory (or directly > > a specific file). That's why we only care of the file hierarchy at the > > time of access check. It's not an issue if the file/directory was > > moved or is being moved as long as we can walk its "current" hierarchy. > > Furthermore, a sandboxed process is restricted from doing arbitrary > > mounts (and renames/links are controlled with the > > LANDLOCK_ACCESS_FS_REFER right). > > > > However, we need to get a valid "snapshot" of the set of dentries that > > (could) lead to the evaluated file/directory. > > A "snapshot" is an interesting idea - though looking at the landlock > code you one need inodes, not dentries. > I imagine an interface where you give it a starting path, a root, and > and array of inode pointers, and it fills in the pointers with the path > - all under rcu so no references are needed. > But you would need some fallback if the array isn't big enough, so maybe > that isn't a good idea. > > Based on the comments by Al and Christian, I think the only viable > approach is to pass a callback to some vfs function that does the > walking. > > vfs_walk_ancestors(struct path *path, struct path *root, > int (*walk_cb)(struct path *ancestor, void *data), > void *data) I like this idea. Maybe we want "struct path *ancestor" of walk_cb to be const. walk_cb should only change "data", so that we can undo all the changes when the rcu walk fails. Thanks, Song
© 2016 - 2025 Red Hat, Inc.