fs/fsopen.c | 4 ++++ 1 file changed, 4 insertions(+)
From: Chen Linxuan <me@black-desk.cn>
When using fsconfig(..., FSCONFIG_CMD_CREATE, ...), the filesystem
context is retrieved from the file descriptor. Since the file structure
persists across syscall restarts, the context state is preserved:
// fs/fsopen.c
SYSCALL_DEFINE5(fsconfig, ...)
{
...
fc = fd_file(f)->private_data;
...
ret = vfs_fsconfig_locked(fc, cmd, ¶m);
...
}
In vfs_cmd_create(), the context phase is transitioned to
FS_CONTEXT_CREATING before calling vfs_get_tree():
// fs/fsopen.c
static int vfs_cmd_create(struct fs_context *fc, bool exclusive)
{
...
fc->phase = FS_CONTEXT_CREATING;
...
ret = vfs_get_tree(fc);
...
}
However, vfs_get_tree() may return -ERESTARTNOINTR if the filesystem
implementation needs to restart the syscall. For example, cgroup v1 does
this when it encounters a race condition where the root is dying:
// kernel/cgroup/cgroup-v1.c
int cgroup1_get_tree(struct fs_context *fc)
{
...
if (unlikely(ret > 0)) {
msleep(10);
return restart_syscall();
}
return ret;
}
If the syscall is restarted, fsconfig() is called again and retrieves
the *same* fs_context. However, vfs_cmd_create() rejects the call
because the phase was left as FS_CONTEXT_CREATING during the first
attempt:
if (fc->phase != FS_CONTEXT_CREATE_PARAMS)
return -EBUSY;
Fix this by resetting fc->phase back to FS_CONTEXT_CREATE_PARAMS if
vfs_get_tree() returns -ERESTARTNOINTR.
Cc: stable@vger.kernel.org
Signed-off-by: Chen Linxuan <me@black-desk.cn>
---
fs/fsopen.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/fs/fsopen.c b/fs/fsopen.c
index f645c99204eb..8a7cb031af50 100644
--- a/fs/fsopen.c
+++ b/fs/fsopen.c
@@ -229,6 +229,10 @@ static int vfs_cmd_create(struct fs_context *fc, bool exclusive)
fc->exclusive = exclusive;
ret = vfs_get_tree(fc);
+ if (ret == -ERESTARTNOINTR) {
+ fc->phase = FS_CONTEXT_CREATE_PARAMS;
+ return ret;
+ }
if (ret) {
fc->phase = FS_CONTEXT_FAILED;
return ret;
---
base-commit: 187d0801404f415f22c0b31531982c7ea97fa341
change-id: 20251213-mount-ebusy-8ee3888a7e4f
Best regards,
--
Chen Linxuan <me@black-desk.cn>
On Sat 13-12-25 02:03:56, Chen Linxuan via B4 Relay wrote:
> From: Chen Linxuan <me@black-desk.cn>
>
> When using fsconfig(..., FSCONFIG_CMD_CREATE, ...), the filesystem
> context is retrieved from the file descriptor. Since the file structure
> persists across syscall restarts, the context state is preserved:
>
> // fs/fsopen.c
> SYSCALL_DEFINE5(fsconfig, ...)
> {
> ...
> fc = fd_file(f)->private_data;
> ...
> ret = vfs_fsconfig_locked(fc, cmd, ¶m);
> ...
> }
>
> In vfs_cmd_create(), the context phase is transitioned to
> FS_CONTEXT_CREATING before calling vfs_get_tree():
>
> // fs/fsopen.c
> static int vfs_cmd_create(struct fs_context *fc, bool exclusive)
> {
> ...
> fc->phase = FS_CONTEXT_CREATING;
> ...
> ret = vfs_get_tree(fc);
> ...
> }
>
> However, vfs_get_tree() may return -ERESTARTNOINTR if the filesystem
> implementation needs to restart the syscall. For example, cgroup v1 does
> this when it encounters a race condition where the root is dying:
>
> // kernel/cgroup/cgroup-v1.c
> int cgroup1_get_tree(struct fs_context *fc)
> {
> ...
> if (unlikely(ret > 0)) {
> msleep(10);
> return restart_syscall();
> }
> return ret;
> }
>
> If the syscall is restarted, fsconfig() is called again and retrieves
> the *same* fs_context. However, vfs_cmd_create() rejects the call
> because the phase was left as FS_CONTEXT_CREATING during the first
> attempt:
Well, not quite. The phase is actually set to FS_CONTEXT_FAILED if
vfs_get_tree() returns any error. Still the effect is the same.
> if (fc->phase != FS_CONTEXT_CREATE_PARAMS)
> return -EBUSY;
>
> Fix this by resetting fc->phase back to FS_CONTEXT_CREATE_PARAMS if
> vfs_get_tree() returns -ERESTARTNOINTR.
>
> Cc: stable@vger.kernel.org
> Signed-off-by: Chen Linxuan <me@black-desk.cn>
> ---
> fs/fsopen.c | 4 ++++
> 1 file changed, 4 insertions(+)
>
> diff --git a/fs/fsopen.c b/fs/fsopen.c
> index f645c99204eb..8a7cb031af50 100644
> --- a/fs/fsopen.c
> +++ b/fs/fsopen.c
> @@ -229,6 +229,10 @@ static int vfs_cmd_create(struct fs_context *fc, bool exclusive)
> fc->exclusive = exclusive;
>
> ret = vfs_get_tree(fc);
> + if (ret == -ERESTARTNOINTR) {
> + fc->phase = FS_CONTEXT_CREATE_PARAMS;
> + return ret;
> + }
> if (ret) {
> fc->phase = FS_CONTEXT_FAILED;
> return ret;
Thanks for the patch. It looks good to me. I'd slightly prefer style like:
if (ret) {
if (ret == -ERESTARTNOINTR)
fc->phase = FS_CONTEXT_CREATE_PARAMS;
else
fc->phase = FS_CONTEXT_FAILED;
return ret;
}
but I can live with what you propose as well. So feel free to add:
Reviewed-by: Jan Kara <jack@suse.cz>
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
On Mon, Dec 15, 2025 at 09:46:19AM +0100, Jan Kara wrote:
> On Sat 13-12-25 02:03:56, Chen Linxuan via B4 Relay wrote:
> > From: Chen Linxuan <me@black-desk.cn>
> >
> > When using fsconfig(..., FSCONFIG_CMD_CREATE, ...), the filesystem
> > context is retrieved from the file descriptor. Since the file structure
> > persists across syscall restarts, the context state is preserved:
> >
> > // fs/fsopen.c
> > SYSCALL_DEFINE5(fsconfig, ...)
> > {
> > ...
> > fc = fd_file(f)->private_data;
> > ...
> > ret = vfs_fsconfig_locked(fc, cmd, ¶m);
> > ...
> > }
> >
> > In vfs_cmd_create(), the context phase is transitioned to
> > FS_CONTEXT_CREATING before calling vfs_get_tree():
> >
> > // fs/fsopen.c
> > static int vfs_cmd_create(struct fs_context *fc, bool exclusive)
> > {
> > ...
> > fc->phase = FS_CONTEXT_CREATING;
> > ...
> > ret = vfs_get_tree(fc);
> > ...
> > }
> >
> > However, vfs_get_tree() may return -ERESTARTNOINTR if the filesystem
> > implementation needs to restart the syscall. For example, cgroup v1 does
> > this when it encounters a race condition where the root is dying:
> >
> > // kernel/cgroup/cgroup-v1.c
> > int cgroup1_get_tree(struct fs_context *fc)
> > {
> > ...
> > if (unlikely(ret > 0)) {
> > msleep(10);
> > return restart_syscall();
> > }
> > return ret;
> > }
> >
> > If the syscall is restarted, fsconfig() is called again and retrieves
> > the *same* fs_context. However, vfs_cmd_create() rejects the call
> > because the phase was left as FS_CONTEXT_CREATING during the first
> > attempt:
>
> Well, not quite. The phase is actually set to FS_CONTEXT_FAILED if
> vfs_get_tree() returns any error. Still the effect is the same.
Uh, I'm not sure we should do this. If this only affects cgroup v1 then
I say we should simply not care at all. It's a deprecated api and anyone
using it uses something that is inherently broken and a big portion of
userspace has already migrated. The current or upcoming systemd release
has dropped all cgroup v1 support.
Generally, making fsconfig() restartable is not as trivial as it looks
because once you called into the filesystem the config that was setup
might have already been consumed. That's definitely the case for stuff
in overlayfs and others. So no, that patch won't work and btw, I
remembered that we already had that discussion a few years ago and I was
right:
https://lore.kernel.org/20200923201958.b27ecda5a1e788fb5f472bcd@virtuozzo.com
On Mon, Dec 15, 2025 at 7:55 PM Christian Brauner <brauner@kernel.org> wrote: > Uh, I'm not sure we should do this. If this only affects cgroup v1 then > I say we should simply not care at all. It's a deprecated api and anyone > using it uses something that is inherently broken and a big portion of > userspace has already migrated. The current or upcoming systemd release > has dropped all cgroup v1 support. I am not quite sure if nfs will be affected by this or not. It looks like vfs_get_tree -> nfs_get_tree -> nfs_try_get_tree -> nfs_try_get_tree -> nfs_request_mount -> nfs_mount -> rpc_call_sync -> rpc_run_task -> rpc_execute -> __rpc_execute -> rpc_wait_bit_killable might return -ERESTARTSYS.
On Mon 15-12-25 12:55:12, Christian Brauner wrote:
> On Mon, Dec 15, 2025 at 09:46:19AM +0100, Jan Kara wrote:
> > On Sat 13-12-25 02:03:56, Chen Linxuan via B4 Relay wrote:
> > > From: Chen Linxuan <me@black-desk.cn>
> > >
> > > When using fsconfig(..., FSCONFIG_CMD_CREATE, ...), the filesystem
> > > context is retrieved from the file descriptor. Since the file structure
> > > persists across syscall restarts, the context state is preserved:
> > >
> > > // fs/fsopen.c
> > > SYSCALL_DEFINE5(fsconfig, ...)
> > > {
> > > ...
> > > fc = fd_file(f)->private_data;
> > > ...
> > > ret = vfs_fsconfig_locked(fc, cmd, ¶m);
> > > ...
> > > }
> > >
> > > In vfs_cmd_create(), the context phase is transitioned to
> > > FS_CONTEXT_CREATING before calling vfs_get_tree():
> > >
> > > // fs/fsopen.c
> > > static int vfs_cmd_create(struct fs_context *fc, bool exclusive)
> > > {
> > > ...
> > > fc->phase = FS_CONTEXT_CREATING;
> > > ...
> > > ret = vfs_get_tree(fc);
> > > ...
> > > }
> > >
> > > However, vfs_get_tree() may return -ERESTARTNOINTR if the filesystem
> > > implementation needs to restart the syscall. For example, cgroup v1 does
> > > this when it encounters a race condition where the root is dying:
> > >
> > > // kernel/cgroup/cgroup-v1.c
> > > int cgroup1_get_tree(struct fs_context *fc)
> > > {
> > > ...
> > > if (unlikely(ret > 0)) {
> > > msleep(10);
> > > return restart_syscall();
> > > }
> > > return ret;
> > > }
> > >
> > > If the syscall is restarted, fsconfig() is called again and retrieves
> > > the *same* fs_context. However, vfs_cmd_create() rejects the call
> > > because the phase was left as FS_CONTEXT_CREATING during the first
> > > attempt:
> >
> > Well, not quite. The phase is actually set to FS_CONTEXT_FAILED if
> > vfs_get_tree() returns any error. Still the effect is the same.
>
> Uh, I'm not sure we should do this. If this only affects cgroup v1 then
> I say we should simply not care at all. It's a deprecated api and anyone
> using it uses something that is inherently broken and a big portion of
> userspace has already migrated. The current or upcoming systemd release
> has dropped all cgroup v1 support.
>
> Generally, making fsconfig() restartable is not as trivial as it looks
> because once you called into the filesystem the config that was setup
> might have already been consumed. That's definitely the case for stuff
> in overlayfs and others. So no, that patch won't work and btw, I
> remembered that we already had that discussion a few years ago and I was
> right:
>
> https://lore.kernel.org/20200923201958.b27ecda5a1e788fb5f472bcd@virtuozzo.com
I see. I was assuming that if the filesystem returns ERESTART* it will make
sure to not consume the context it wants to restart with. But I can see why
that might be too errorprone and not really worth the hassle in this case.
So fine by me if you don't want to go that route.
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
On Mon, Dec 15, 2025 at 4:46 PM Jan Kara <jack@suse.cz> wrote:
>
> On Sat 13-12-25 02:03:56, Chen Linxuan via B4 Relay wrote:
> > If the syscall is restarted, fsconfig() is called again and retrieves
> > the *same* fs_context. However, vfs_cmd_create() rejects the call
> > because the phase was left as FS_CONTEXT_CREATING during the first
> > attempt:
>
> Well, not quite. The phase is actually set to FS_CONTEXT_FAILED if
> vfs_get_tree() returns any error. Still the effect is the same.
Oh, that's a mistake.
> Thanks for the patch. It looks good to me. I'd slightly prefer style like:
>
> if (ret) {
> if (ret == -ERESTARTNOINTR)
> fc->phase = FS_CONTEXT_CREATE_PARAMS;
> else
> fc->phase = FS_CONTEXT_FAILED;
> return ret;
> }
Will be applied in v2.
© 2016 - 2026 Red Hat, Inc.