[v1] erofs: don't bother with s_stack_depth increasing for now

[PATCH] erofs: don't bother with s_stack_depth increasing for now

Posted by Gao Xiang 1 month, 1 week ago

Previously, commit d53cd891f0e4 ("erofs: limit the level of fs stacking
for file-backed mounts") bumped `s_stack_depth` by one to avoid kernel
stack overflow, but it breaks composefs mounts, which need erofs+ovl^2
sometimes (and such setups are already used in production for quite long
time) since `s_stack_depth` can be 3 (i.e., FILESYSTEM_MAX_STACK_DEPTH
needs to change from 2 to 3).

After a long discussion on GitHub issues [1] about possible solutions,
it seems there is no need to support nesting file-backed mounts as one
conclusion (especially when increasing FILESYSTEM_MAX_STACK_DEPTH to 3).
So let's disallow this right now, since there is always a way to use
loopback devices as a fallback.

Then, I started to wonder about an alternative EROFS quick fix to
address the composefs mounts directly for this cycle: since EROFS is the
only fs to support file-backed mounts and other stacked fses will just
bump up `FILESYSTEM_MAX_STACK_DEPTH`, just check that `s_stack_depth`
!= 0 and the backing inode is not from EROFS instead.

At least it works for all known file-backed mount use cases (composefs,
containerd, and Android APEX for some Android vendors), and the fix is
self-contained.

Let's defer increasing FILESYSTEM_MAX_STACK_DEPTH for now.

Fixes: d53cd891f0e4 ("erofs: limit the level of fs stacking for file-backed mounts")
Closes: https://github.com/coreos/fedora-coreos-tracker/issues/2087 [1]
Closes: https://lore.kernel.org/r/CAFHtUiYv4+=+JP_-JjARWjo6OwcvBj1wtYN=z0QXwCpec9sXtg@mail.gmail.com
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Alexander Larsson <alexl@redhat.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
---
 fs/erofs/super.c | 18 ++++++++++++------
 1 file changed, 12 insertions(+), 6 deletions(-)

diff --git a/fs/erofs/super.c b/fs/erofs/super.c
index 937a215f626c..0cf41ed7ced8 100644
--- a/fs/erofs/super.c
+++ b/fs/erofs/super.c
@@ -644,14 +644,20 @@ static int erofs_fc_fill_super(struct super_block *sb, struct fs_context *fc)
 		 * fs contexts (including its own) due to self-controlled RO
 		 * accesses/contexts and no side-effect changes that need to
 		 * context save & restore so it can reuse the current thread
-		 * context.  However, it still needs to bump `s_stack_depth` to
-		 * avoid kernel stack overflow from nested filesystems.
+		 * context.
+		 * However, we still need to prevent kernel stack overflow due
+		 * to filesystem nesting: just ensure that s_stack_depth is 0
+		 * to disallow mounting EROFS on stacked filesystems.
+		 * Note: s_stack_depth is not incremented here for now, since
+		 * EROFS is the only fs supporting file-backed mounts for now.
+		 * It MUST change if another fs plans to support them, which
+		 * may also require adjusting FILESYSTEM_MAX_STACK_DEPTH.
 		 */
 		if (erofs_is_fileio_mode(sbi)) {
-			sb->s_stack_depth =
-				file_inode(sbi->dif0.file)->i_sb->s_stack_depth + 1;
-			if (sb->s_stack_depth > FILESYSTEM_MAX_STACK_DEPTH) {
-				erofs_err(sb, "maximum fs stacking depth exceeded");
+			inode = file_inode(sbi->dif0.file);
+			if (inode->i_sb->s_op == &erofs_sops ||
+			    inode->i_sb->s_stack_depth) {
+				erofs_err(sb, "file-backed mounts cannot be applied to stacked fses");
 				return -ENOTBLK;
 			}
 		}
-- 
2.43.5

Re: [PATCH] erofs: don't bother with s_stack_depth increasing for now

Posted by Alexander Larsson 1 month ago

On Thu, 2026-01-01 at 04:42 +0800, Gao Xiang wrote:
> Previously, commit d53cd891f0e4 ("erofs: limit the level of fs
> stacking
> for file-backed mounts") bumped `s_stack_depth` by one to avoid
> kernel
> stack overflow, but it breaks composefs mounts, which need
> erofs+ovl^2
> sometimes (and such setups are already used in production for quite
> long
> time) since `s_stack_depth` can be 3 (i.e.,
> FILESYSTEM_MAX_STACK_DEPTH
> needs to change from 2 to 3).
> 
> After a long discussion on GitHub issues [1] about possible
> solutions,
> it seems there is no need to support nesting file-backed mounts as
> one
> conclusion (especially when increasing FILESYSTEM_MAX_STACK_DEPTH to
> 3).
> So let's disallow this right now, since there is always a way to use
> loopback devices as a fallback.
> 
> Then, I started to wonder about an alternative EROFS quick fix to
> address the composefs mounts directly for this cycle: since EROFS is
> the
> only fs to support file-backed mounts and other stacked fses will
> just
> bump up `FILESYSTEM_MAX_STACK_DEPTH`, just check that `s_stack_depth`
> != 0 and the backing inode is not from EROFS instead.
> 
> At least it works for all known file-backed mount use cases
> (composefs,
> containerd, and Android APEX for some Android vendors), and the fix
> is
> self-contained.
> 
> Let's defer increasing FILESYSTEM_MAX_STACK_DEPTH for now.
> 
> Fixes: d53cd891f0e4 ("erofs: limit the level of fs stacking for file-
> backed mounts")
> Closes:
> https://github.com/coreos/fedora-coreos-tracker/issues/2087 [1]
> Closes:
> https://lore.kernel.org/r/CAFHtUiYv4+=+JP_-JjARWjo6OwcvBj1wtYN=z0QXwCpec9sXtg@mail.gmail.com
> Cc: Amir Goldstein <amir73il@gmail.com>
> Cc: Alexander Larsson <alexl@redhat.com>
> Cc: Christian Brauner <brauner@kernel.org>
> Cc: Miklos Szeredi <mszeredi@redhat.com>
> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
> ---

Acked-by: Alexander Larsson <alexl@redhat.com>

>  fs/erofs/super.c | 18 ++++++++++++------
>  1 file changed, 12 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/erofs/super.c b/fs/erofs/super.c
> index 937a215f626c..0cf41ed7ced8 100644
> --- a/fs/erofs/super.c
> +++ b/fs/erofs/super.c
> @@ -644,14 +644,20 @@ static int erofs_fc_fill_super(struct
> super_block *sb, struct fs_context *fc)
>  		 * fs contexts (including its own) due to self-
> controlled RO
>  		 * accesses/contexts and no side-effect changes that
> need to
>  		 * context save & restore so it can reuse the
> current thread
> -		 * context.  However, it still needs to bump
> `s_stack_depth` to
> -		 * avoid kernel stack overflow from nested
> filesystems.
> +		 * context.
> +		 * However, we still need to prevent kernel stack
> overflow due
> +		 * to filesystem nesting: just ensure that
> s_stack_depth is 0
> +		 * to disallow mounting EROFS on stacked
> filesystems.
> +		 * Note: s_stack_depth is not incremented here for
> now, since
> +		 * EROFS is the only fs supporting file-backed
> mounts for now.
> +		 * It MUST change if another fs plans to support
> them, which
> +		 * may also require adjusting
> FILESYSTEM_MAX_STACK_DEPTH.
>  		 */
>  		if (erofs_is_fileio_mode(sbi)) {
> -			sb->s_stack_depth =
> -				file_inode(sbi->dif0.file)->i_sb-
> >s_stack_depth + 1;
> -			if (sb->s_stack_depth >
> FILESYSTEM_MAX_STACK_DEPTH) {
> -				erofs_err(sb, "maximum fs stacking
> depth exceeded");
> +			inode = file_inode(sbi->dif0.file);
> +			if (inode->i_sb->s_op == &erofs_sops ||
> +			    inode->i_sb->s_stack_depth) {
> +				erofs_err(sb, "file-backed mounts
> cannot be applied to stacked fses");
>  				return -ENOTBLK;
>  			}
>  		}

-- 
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
=-=-=
 Alexander Larsson                                            Red Hat,
Inc 
       alexl@redhat.com            alexander.larsson@gmail.com 
He's an all-American dishevelled card sharp searching for his wife's
true 
killer. She's a scantily clad insomniac bounty hunter with an
incredible 
destiny. They fight crime!

Re: [PATCH] erofs: don't bother with s_stack_depth increasing for now

Posted by Amir Goldstein 1 month, 1 week ago

On Wed, Dec 31, 2025 at 9:42 PM Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
>
> Previously, commit d53cd891f0e4 ("erofs: limit the level of fs stacking
> for file-backed mounts") bumped `s_stack_depth` by one to avoid kernel
> stack overflow, but it breaks composefs mounts, which need erofs+ovl^2
> sometimes (and such setups are already used in production for quite long
> time) since `s_stack_depth` can be 3 (i.e., FILESYSTEM_MAX_STACK_DEPTH
> needs to change from 2 to 3).
>
> After a long discussion on GitHub issues [1] about possible solutions,
> it seems there is no need to support nesting file-backed mounts as one
> conclusion (especially when increasing FILESYSTEM_MAX_STACK_DEPTH to 3).
> So let's disallow this right now, since there is always a way to use
> loopback devices as a fallback.
>
> Then, I started to wonder about an alternative EROFS quick fix to
> address the composefs mounts directly for this cycle: since EROFS is the
> only fs to support file-backed mounts and other stacked fses will just
> bump up `FILESYSTEM_MAX_STACK_DEPTH`, just check that `s_stack_depth`
> != 0 and the backing inode is not from EROFS instead.
>
> At least it works for all known file-backed mount use cases (composefs,
> containerd, and Android APEX for some Android vendors), and the fix is
> self-contained.
>
> Let's defer increasing FILESYSTEM_MAX_STACK_DEPTH for now.
>
> Fixes: d53cd891f0e4 ("erofs: limit the level of fs stacking for file-backed mounts")
> Closes: https://github.com/coreos/fedora-coreos-tracker/issues/2087 [1]
> Closes: https://lore.kernel.org/r/CAFHtUiYv4+=+JP_-JjARWjo6OwcvBj1wtYN=z0QXwCpec9sXtg@mail.gmail.com
> Cc: Amir Goldstein <amir73il@gmail.com>
> Cc: Alexander Larsson <alexl@redhat.com>
> Cc: Christian Brauner <brauner@kernel.org>
> Cc: Miklos Szeredi <mszeredi@redhat.com>
> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
> ---

Acked-by: Amir Goldstein <amir73il@gmail.com>

But you forgot to include details of the stack usage analysis you ran
with erofs+ovl^2 setup.

I am guessing people will want to see this information before relaxing
s_stack_depth in this case.

Thanks,
Amir.

>  fs/erofs/super.c | 18 ++++++++++++------
>  1 file changed, 12 insertions(+), 6 deletions(-)
>
> diff --git a/fs/erofs/super.c b/fs/erofs/super.c
> index 937a215f626c..0cf41ed7ced8 100644
> --- a/fs/erofs/super.c
> +++ b/fs/erofs/super.c
> @@ -644,14 +644,20 @@ static int erofs_fc_fill_super(struct super_block *sb, struct fs_context *fc)
>                  * fs contexts (including its own) due to self-controlled RO
>                  * accesses/contexts and no side-effect changes that need to
>                  * context save & restore so it can reuse the current thread
> -                * context.  However, it still needs to bump `s_stack_depth` to
> -                * avoid kernel stack overflow from nested filesystems.
> +                * context.
> +                * However, we still need to prevent kernel stack overflow due
> +                * to filesystem nesting: just ensure that s_stack_depth is 0
> +                * to disallow mounting EROFS on stacked filesystems.
> +                * Note: s_stack_depth is not incremented here for now, since
> +                * EROFS is the only fs supporting file-backed mounts for now.
> +                * It MUST change if another fs plans to support them, which
> +                * may also require adjusting FILESYSTEM_MAX_STACK_DEPTH.
>                  */
>                 if (erofs_is_fileio_mode(sbi)) {
> -                       sb->s_stack_depth =
> -                               file_inode(sbi->dif0.file)->i_sb->s_stack_depth + 1;
> -                       if (sb->s_stack_depth > FILESYSTEM_MAX_STACK_DEPTH) {
> -                               erofs_err(sb, "maximum fs stacking depth exceeded");
> +                       inode = file_inode(sbi->dif0.file);
> +                       if (inode->i_sb->s_op == &erofs_sops ||
> +                           inode->i_sb->s_stack_depth) {
> +                               erofs_err(sb, "file-backed mounts cannot be applied to stacked fses");
>                                 return -ENOTBLK;
>                         }
>                 }
> --
> 2.43.5
>

Re: [PATCH] erofs: don't bother with s_stack_depth increasing for now

Posted by Gao Xiang 1 month ago

Hi Amir,

On 2026/1/1 23:52, Amir Goldstein wrote:
> On Wed, Dec 31, 2025 at 9:42 PM Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
>>
>> Previously, commit d53cd891f0e4 ("erofs: limit the level of fs stacking
>> for file-backed mounts") bumped `s_stack_depth` by one to avoid kernel
>> stack overflow, but it breaks composefs mounts, which need erofs+ovl^2
>> sometimes (and such setups are already used in production for quite long
>> time) since `s_stack_depth` can be 3 (i.e., FILESYSTEM_MAX_STACK_DEPTH
>> needs to change from 2 to 3).
>>
>> After a long discussion on GitHub issues [1] about possible solutions,
>> it seems there is no need to support nesting file-backed mounts as one
>> conclusion (especially when increasing FILESYSTEM_MAX_STACK_DEPTH to 3).
>> So let's disallow this right now, since there is always a way to use
>> loopback devices as a fallback.
>>
>> Then, I started to wonder about an alternative EROFS quick fix to
>> address the composefs mounts directly for this cycle: since EROFS is the
>> only fs to support file-backed mounts and other stacked fses will just
>> bump up `FILESYSTEM_MAX_STACK_DEPTH`, just check that `s_stack_depth`
>> != 0 and the backing inode is not from EROFS instead.
>>
>> At least it works for all known file-backed mount use cases (composefs,
>> containerd, and Android APEX for some Android vendors), and the fix is
>> self-contained.
>>
>> Let's defer increasing FILESYSTEM_MAX_STACK_DEPTH for now.
>>
>> Fixes: d53cd891f0e4 ("erofs: limit the level of fs stacking for file-backed mounts")
>> Closes: https://github.com/coreos/fedora-coreos-tracker/issues/2087 [1]
>> Closes: https://lore.kernel.org/r/CAFHtUiYv4+=+JP_-JjARWjo6OwcvBj1wtYN=z0QXwCpec9sXtg@mail.gmail.com
>> Cc: Amir Goldstein <amir73il@gmail.com>
>> Cc: Alexander Larsson <alexl@redhat.com>
>> Cc: Christian Brauner <brauner@kernel.org>
>> Cc: Miklos Szeredi <mszeredi@redhat.com>
>> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
>> ---
> 
> Acked-by: Amir Goldstein <amir73il@gmail.com>
> 
> But you forgot to include details of the stack usage analysis you ran
> with erofs+ovl^2 setup.
> 
> I am guessing people will want to see this information before relaxing
> s_stack_depth in this case.

Sorry I didn't check emails these days, I'm not sure if posting
detailed stack traces are useful, how about adding the following
words:

Note: There are some observations while evaluating the erofs + ovl^2
setup with an XFS backing fs:

  - Regular RW workloads traverse only one overlayfs layer regardless of
    the value of FILESYSTEM_MAX_STACK_DEPTH, because `upperdir=` cannot
    point to another overlayfs.  Therefore, for pure RW workloads, the
    typical stack is always just:
      overlayfs + upper fs + underlay storage

  - For read-only workloads and the copy-up read part (ovl_splice_read),
    the difference can lie in how many overlays are nested.
    The stack just looks like either:
      ovl + ovl [+ erofs] + backing fs + underlay storage
    or
      ovl [+ erofs] + ext4/xfs + underlay storage

  - The fs reclaim path should be entered only once, so the writeback
    path will not re-enter.

Sorry about my English, and I'm not sure if it's enough (e.g. FUSE
passthrough part).  I will look for your further inputs (and other
acks) before sending this patch upstream.

(Also btw, i'm not sure if it's possible to optimize read_iter and
  splice_read stack usage even further in overlayfs, e.g. just
  recursive handling real file/path directly in the top overlayfs
  since the permission check is already done when opening the file.)

Thanks,
Gao Xiang

> 
> Thanks,
> Amir.

Re: [PATCH] erofs: don't bother with s_stack_depth increasing for now

Posted by Amir Goldstein 1 month ago

[+fsdevel][+overlayfs]

On Sun, Jan 4, 2026 at 4:56 AM Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
>
> Hi Amir,
>
> On 2026/1/1 23:52, Amir Goldstein wrote:
> > On Wed, Dec 31, 2025 at 9:42 PM Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
> >>
> >> Previously, commit d53cd891f0e4 ("erofs: limit the level of fs stacking
> >> for file-backed mounts") bumped `s_stack_depth` by one to avoid kernel
> >> stack overflow, but it breaks composefs mounts, which need erofs+ovl^2
> >> sometimes (and such setups are already used in production for quite long
> >> time) since `s_stack_depth` can be 3 (i.e., FILESYSTEM_MAX_STACK_DEPTH
> >> needs to change from 2 to 3).
> >>
> >> After a long discussion on GitHub issues [1] about possible solutions,
> >> it seems there is no need to support nesting file-backed mounts as one
> >> conclusion (especially when increasing FILESYSTEM_MAX_STACK_DEPTH to 3).
> >> So let's disallow this right now, since there is always a way to use
> >> loopback devices as a fallback.
> >>
> >> Then, I started to wonder about an alternative EROFS quick fix to
> >> address the composefs mounts directly for this cycle: since EROFS is the
> >> only fs to support file-backed mounts and other stacked fses will just
> >> bump up `FILESYSTEM_MAX_STACK_DEPTH`, just check that `s_stack_depth`
> >> != 0 and the backing inode is not from EROFS instead.
> >>
> >> At least it works for all known file-backed mount use cases (composefs,
> >> containerd, and Android APEX for some Android vendors), and the fix is
> >> self-contained.
> >>
> >> Let's defer increasing FILESYSTEM_MAX_STACK_DEPTH for now.
> >>
> >> Fixes: d53cd891f0e4 ("erofs: limit the level of fs stacking for file-backed mounts")
> >> Closes: https://github.com/coreos/fedora-coreos-tracker/issues/2087 [1]
> >> Closes: https://lore.kernel.org/r/CAFHtUiYv4+=+JP_-JjARWjo6OwcvBj1wtYN=z0QXwCpec9sXtg@mail.gmail.com
> >> Cc: Amir Goldstein <amir73il@gmail.com>
> >> Cc: Alexander Larsson <alexl@redhat.com>
> >> Cc: Christian Brauner <brauner@kernel.org>
> >> Cc: Miklos Szeredi <mszeredi@redhat.com>
> >> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
> >> ---
> >
> > Acked-by: Amir Goldstein <amir73il@gmail.com>
> >
> > But you forgot to include details of the stack usage analysis you ran
> > with erofs+ovl^2 setup.
> >
> > I am guessing people will want to see this information before relaxing
> > s_stack_depth in this case.
>
> Sorry I didn't check emails these days, I'm not sure if posting
> detailed stack traces are useful, how about adding the following
> words:

Didn't mean detailed stack traces, but you did some tests with the
new possible setup and you reached stack usage < 8K so  I think this is
something worth mentioning.

>
> Note: There are some observations while evaluating the erofs + ovl^2
> setup with an XFS backing fs:
>
>   - Regular RW workloads traverse only one overlayfs layer regardless of
>     the value of FILESYSTEM_MAX_STACK_DEPTH, because `upperdir=` cannot
>     point to another overlayfs.  Therefore, for pure RW workloads, the
>     typical stack is always just:
>       overlayfs + upper fs + underlay storage
>
>   - For read-only workloads and the copy-up read part (ovl_splice_read),
>     the difference can lie in how many overlays are nested.
>     The stack just looks like either:
>       ovl + ovl [+ erofs] + backing fs + underlay storage
>     or
>       ovl [+ erofs] + ext4/xfs + underlay storage
>
>   - The fs reclaim path should be entered only once, so the writeback
>     path will not re-enter.
>
> Sorry about my English, and I'm not sure if it's enough (e.g. FUSE
> passthrough part).  I will look for your further inputs (and other
> acks) before sending this patch upstream.
>

I think that most people will have problems understanding this
rationale not because of the English, but because of the tech ;)
this is a bit too hand wavy IMO.

> (Also btw, i'm not sure if it's possible to optimize read_iter and
>   splice_read stack usage even further in overlayfs, e.g. just
>   recursive handling real file/path directly in the top overlayfs
>   since the permission check is already done when opening the file.)

Maybe so, but LSM permission to open hook is not the same hook
as permission to read/write.

Thanks,
Amir.

Re: [PATCH] erofs: don't bother with s_stack_depth increasing for now

Posted by Gao Xiang 1 month ago


On 2026/1/4 18:01, Amir Goldstein wrote:
> [+fsdevel][+overlayfs]
> 
> On Sun, Jan 4, 2026 at 4:56 AM Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
>>
>> Hi Amir,
>>
>> On 2026/1/1 23:52, Amir Goldstein wrote:
>>> On Wed, Dec 31, 2025 at 9:42 PM Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
>>>>
>>>> Previously, commit d53cd891f0e4 ("erofs: limit the level of fs stacking
>>>> for file-backed mounts") bumped `s_stack_depth` by one to avoid kernel
>>>> stack overflow, but it breaks composefs mounts, which need erofs+ovl^2
>>>> sometimes (and such setups are already used in production for quite long
>>>> time) since `s_stack_depth` can be 3 (i.e., FILESYSTEM_MAX_STACK_DEPTH
>>>> needs to change from 2 to 3).
>>>>
>>>> After a long discussion on GitHub issues [1] about possible solutions,
>>>> it seems there is no need to support nesting file-backed mounts as one
>>>> conclusion (especially when increasing FILESYSTEM_MAX_STACK_DEPTH to 3).
>>>> So let's disallow this right now, since there is always a way to use
>>>> loopback devices as a fallback.
>>>>
>>>> Then, I started to wonder about an alternative EROFS quick fix to
>>>> address the composefs mounts directly for this cycle: since EROFS is the
>>>> only fs to support file-backed mounts and other stacked fses will just
>>>> bump up `FILESYSTEM_MAX_STACK_DEPTH`, just check that `s_stack_depth`
>>>> != 0 and the backing inode is not from EROFS instead.
>>>>
>>>> At least it works for all known file-backed mount use cases (composefs,
>>>> containerd, and Android APEX for some Android vendors), and the fix is
>>>> self-contained.
>>>>
>>>> Let's defer increasing FILESYSTEM_MAX_STACK_DEPTH for now.
>>>>
>>>> Fixes: d53cd891f0e4 ("erofs: limit the level of fs stacking for file-backed mounts")
>>>> Closes: https://github.com/coreos/fedora-coreos-tracker/issues/2087 [1]
>>>> Closes: https://lore.kernel.org/r/CAFHtUiYv4+=+JP_-JjARWjo6OwcvBj1wtYN=z0QXwCpec9sXtg@mail.gmail.com
>>>> Cc: Amir Goldstein <amir73il@gmail.com>
>>>> Cc: Alexander Larsson <alexl@redhat.com>
>>>> Cc: Christian Brauner <brauner@kernel.org>
>>>> Cc: Miklos Szeredi <mszeredi@redhat.com>
>>>> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
>>>> ---
>>>
>>> Acked-by: Amir Goldstein <amir73il@gmail.com>
>>>
>>> But you forgot to include details of the stack usage analysis you ran
>>> with erofs+ovl^2 setup.
>>>
>>> I am guessing people will want to see this information before relaxing
>>> s_stack_depth in this case.
>>
>> Sorry I didn't check emails these days, I'm not sure if posting
>> detailed stack traces are useful, how about adding the following
>> words:
> 
> Didn't mean detailed stack traces, but you did some tests with the
> new possible setup and you reached stack usage < 8K so  I think this is

The issue is that my limited stress test setup cannot cover
every cases:

  - I cannot find a way to make direct reclaim reliably in the
    deep memory allocation, is there some suggestion on this?

  - I'm not sure what's the perfered way to evaluate the worst
    stack usage below the block layer, but we should care more
    about increasing delta just out of one more overlayfs I
    guess?

I can only say what I've seen is the peak stack usage of my
fsstress for an erofs+ovl^2 setup on x86_64 is < 8K (7184 bytes,
but I don't think the peak value absolutely useful), which
evaluates RW workloads in the upperdir, and for such workloads,
the stack depth won't be impacted by FILESYSTEM_MAX_STACK_DEPTH,
I don't see such workload is harmful.

And then I manually copyup some files (because I didn't find any
available tool to stress overlayfs copyups) and I could see the
delta is (I think "ovl_copy_up_" is the only one path to do
copyups):

   0)     6688      48   mempool_alloc_slab+0x9/0x20
   1)     6640      56   mempool_alloc_noprof+0x65/0xd0
   2)     6584      72   __sg_alloc_table+0x128/0x190
   3)     6512      40   sg_alloc_table_chained+0x46/0xa0
   4)     6472      64   scsi_alloc_sgtables+0x91/0x2c0
   5)     6408      72   sd_init_command+0x263/0x930
   6)     6336      88   scsi_queue_rq+0x54a/0xb70
   7)     6248     144   blk_mq_dispatch_rq_list+0x265/0x6c0
   8)     6104     144   __blk_mq_sched_dispatch_requests+0x399/0x5c0
   9)     5960      16   blk_mq_sched_dispatch_requests+0x2d/0x70
  10)     5944      56   blk_mq_run_hw_queue+0x208/0x290
  11)     5888      96   blk_mq_dispatch_list+0x13f/0x460
  12)     5792      48   blk_mq_flush_plug_list+0x4b/0x180
  13)     5744      32   blk_add_rq_to_plug+0x3d/0x160
  14)     5712     136   blk_mq_submit_bio+0x4f4/0x760
  15)     5576     120   __submit_bio+0x9b/0x240
  16)     5456      88   submit_bio_noacct_nocheck+0x271/0x330
  17)     5368      72   iomap_bio_read_folio_range+0xde/0x1d0
  18)     5296     112   iomap_read_folio_iter+0x1ee/0x2d0
  19)     5184     264   iomap_readahead+0xb9/0x290
  20)     4920      48   xfs_vm_readahead+0x4a/0x70
  21)     4872     112   read_pages+0x6c/0x1b0
  22)     4760     104   page_cache_ra_unbounded+0x12c/0x210
  23)     4656      80   filemap_readahead.isra.0+0x78/0xb0
  24)     4576     192   filemap_get_pages+0x3a6/0x820
  25)     4384     376   filemap_read+0xde/0x380
  26)     4008      32   xfs_file_buffered_read+0xa6/0xd0
  27)     3976      16   xfs_file_read_iter+0x6a/0xd0
  28)     3960      48   vfs_iocb_iter_read+0xdb/0x140
  29)     3912      88   erofs_fileio_rq_submit+0x136/0x190
  30)     3824     368   z_erofs_runqueue+0x1ce/0x9f0
  31)     3456     232   z_erofs_readahead+0x16c/0x220
  32)     3224     112   read_pages+0x6c/0x1b0
  33)     3112     104   page_cache_ra_unbounded+0x12c/0x210
  34)     3008      80   filemap_readahead.isra.0+0x78/0xb0
  35)     2928     192   filemap_get_pages+0x3a6/0x820
  36)     2736     400   filemap_splice_read+0x12c/0x2f0
  37)     2336      48   backing_file_splice_read+0x3f/0x90
  38)     2288     128   ovl_splice_read+0xef/0x170
  39)     2160     104   splice_direct_to_actor+0xb9/0x260
  40)     2056      88   do_splice_direct+0x76/0xc0
  41)     1968     120   ovl_copy_up_file+0x1a8/0x2b0
  42)     1848     840   ovl_copy_up_one+0x14b0/0x1610
  43)     1008      72   ovl_copy_up_flags+0xd7/0x110
  44)      936      56   ovl_open+0x72/0x110
  45)      880      56   do_dentry_open+0x16c/0x480
  46)      824      40   vfs_open+0x2e/0xf0
  47)      784     152   path_openat+0x80a/0x12e0
  48)      632     296   do_filp_open+0xb8/0x160
  49)      336      80   do_sys_openat2+0x72/0xf0
  50)      256      40   __x64_sys_openat+0x57/0xa0
  51)      216      40   do_syscall_64+0xa4/0x310
  52)      176     176   entry_SYSCALL_64_after_hwframe+0x77/0x7f

And it's still far from the stack overflow of 16k stacks,
because the difference seems only how many (
ovl_splice_read + backing_file_splice_read), and there only takes
hundreds of bytes for each layer.

Finally I used my own rostress to stress RO workloads, and the
deepest stack so far is as below (5456 bytes):

   0)     5456      48   arch_scale_cpu_capacity+0x9/0x30
   1)     5408      16   cpu_util.constprop.0+0x7e/0xe0
   2)     5392     392   sched_balance_find_src_group+0x29f/0xd30
   3)     5000     280   sched_balance_rq+0x1b2/0xf10
   4)     4720     120   pick_next_task_fair+0x23b/0x7b0
   5)     4600     104   __schedule+0x2bc/0xda0
   6)     4496      16   schedule+0x27/0xd0
   7)     4480      24   io_schedule+0x46/0x70
   8)     4456     120   blk_mq_get_tag+0x11b/0x280
   9)     4336      96   __blk_mq_alloc_requests+0x2a1/0x410
  10)     4240     136   blk_mq_submit_bio+0x59c/0x760
  11)     4104     120   __submit_bio+0x9b/0x240
  12)     3984      88   submit_bio_noacct_nocheck+0x271/0x330
  13)     3896      72   iomap_bio_read_folio_range+0xde/0x1d0
  14)     3824     112   iomap_read_folio_iter+0x1ee/0x2d0
  15)     3712     264   iomap_readahead+0xb9/0x290
  16)     3448      48   xfs_vm_readahead+0x4a/0x70
  17)     3400     112   read_pages+0x6c/0x1b0
  18)     3288     104   page_cache_ra_unbounded+0x12c/0x210
  19)     3184      80   filemap_readahead.isra.0+0x78/0xb0
  20)     3104     192   filemap_get_pages+0x3a6/0x820
  21)     2912     376   filemap_read+0xde/0x380
  22)     2536      32   xfs_file_buffered_read+0xa6/0xd0
  23)     2504      16   xfs_file_read_iter+0x6a/0xd0
  24)     2488      48   vfs_iocb_iter_read+0xdb/0x140
  25)     2440      88   erofs_fileio_rq_submit+0x136/0x190
  26)     2352     368   z_erofs_runqueue+0x1ce/0x9f0
  27)     1984     232   z_erofs_readahead+0x16c/0x220
  28)     1752     112   read_pages+0x6c/0x1b0
  29)     1640     104   page_cache_ra_unbounded+0x12c/0x210
  30)     1536      40   force_page_cache_ra+0x96/0xc0
  31)     1496     192   filemap_get_pages+0x123/0x820
  32)     1304     376   filemap_read+0xde/0x380
  33)      928      72   do_iter_readv_writev+0x1b9/0x220
  34)      856      56   vfs_iter_read+0xde/0x140
  35)      800      64   backing_file_read_iter+0x193/0x1e0
  36)      736      56   ovl_read_iter+0x98/0xa0
  37)      680      72   do_iter_readv_writev+0x1b9/0x220
  38)      608      56   vfs_iter_read+0xde/0x140
  39)      552      64   backing_file_read_iter+0x193/0x1e0
  40)      488      56   ovl_read_iter+0x98/0xa0
  41)      432     152   vfs_read+0x21a/0x350
  42)      280      64   __x64_sys_pread64+0x92/0xc0
  43)      216      40   do_syscall_64+0xa4/0x310
  44)      176     176   entry_SYSCALL_64_after_hwframe+0x77/0x7f

> something worth mentioning.
> 
>>
>> Note: There are some observations while evaluating the erofs + ovl^2
>> setup with an XFS backing fs:
>>
>>    - Regular RW workloads traverse only one overlayfs layer regardless of
>>      the value of FILESYSTEM_MAX_STACK_DEPTH, because `upperdir=` cannot
>>      point to another overlayfs.  Therefore, for pure RW workloads, the
>>      typical stack is always just:
>>        overlayfs + upper fs + underlay storage
>>
>>    - For read-only workloads and the copy-up read part (ovl_splice_read),
>>      the difference can lie in how many overlays are nested.
>>      The stack just looks like either:
>>        ovl + ovl [+ erofs] + backing fs + underlay storage
>>      or
>>        ovl [+ erofs] + ext4/xfs + underlay storage
>>
>>    - The fs reclaim path should be entered only once, so the writeback
>>      path will not re-enter.
>>
>> Sorry about my English, and I'm not sure if it's enough (e.g. FUSE
>> passthrough part).  I will look for your further inputs (and other
>> acks) before sending this patch upstream.
>>
> 
> I think that most people will have problems understanding this
> rationale not because of the English, but because of the tech ;)
> this is a bit too hand wavy IMO.

Honestly, I don't have better way to describe it, I think we'd
better just to focus more on the increment of one more overlayfs:

FILESYSTEM_MAX_STACK_DEPTH 2 already works for 8k kstacks on
32-bit arches, so I don't think FILESYSTEM_MAX_STACK_DEPTH from
2 to 3, which causes hundreds-more-byte additional stack usage
out of mediate overlayfs on 16k kstacks on 64-bit arches is
harmful (and only RO workloads and copyups are impacted).

And if hundreds-more-byte additional stack usage can overflow
the 16k kstack, I do think then the kernel stack can be
overflowed randomly everywhere in the storage stack, not just
because this FILESYSTEM_MAX_STACK_DEPTH modification.

Thanks,
Gao Xiang

> 
>> (Also btw, i'm not sure if it's possible to optimize read_iter and
>>    splice_read stack usage even further in overlayfs, e.g. just
>>    recursive handling real file/path directly in the top overlayfs
>>    since the permission check is already done when opening the file.)
> 
> Maybe so, but LSM permission to open hook is not the same hook
> as permission to read/write.
> 
> Thanks,
> Amir.

Re: [PATCH] erofs: don't bother with s_stack_depth increasing for now

Posted by Amir Goldstein 1 month ago

On Sun, Jan 4, 2026 at 11:42 AM Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
>
>
>
> On 2026/1/4 18:01, Amir Goldstein wrote:
> > [+fsdevel][+overlayfs]
> >
> > On Sun, Jan 4, 2026 at 4:56 AM Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
> >>
> >> Hi Amir,
> >>
> >> On 2026/1/1 23:52, Amir Goldstein wrote:
> >>> On Wed, Dec 31, 2025 at 9:42 PM Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
> >>>>
> >>>> Previously, commit d53cd891f0e4 ("erofs: limit the level of fs stacking
> >>>> for file-backed mounts") bumped `s_stack_depth` by one to avoid kernel
> >>>> stack overflow, but it breaks composefs mounts, which need erofs+ovl^2
> >>>> sometimes (and such setups are already used in production for quite long
> >>>> time) since `s_stack_depth` can be 3 (i.e., FILESYSTEM_MAX_STACK_DEPTH
> >>>> needs to change from 2 to 3).
> >>>>
> >>>> After a long discussion on GitHub issues [1] about possible solutions,
> >>>> it seems there is no need to support nesting file-backed mounts as one
> >>>> conclusion (especially when increasing FILESYSTEM_MAX_STACK_DEPTH to 3).
> >>>> So let's disallow this right now, since there is always a way to use
> >>>> loopback devices as a fallback.
> >>>>
> >>>> Then, I started to wonder about an alternative EROFS quick fix to
> >>>> address the composefs mounts directly for this cycle: since EROFS is the
> >>>> only fs to support file-backed mounts and other stacked fses will just
> >>>> bump up `FILESYSTEM_MAX_STACK_DEPTH`, just check that `s_stack_depth`
> >>>> != 0 and the backing inode is not from EROFS instead.
> >>>>
> >>>> At least it works for all known file-backed mount use cases (composefs,
> >>>> containerd, and Android APEX for some Android vendors), and the fix is
> >>>> self-contained.
> >>>>
> >>>> Let's defer increasing FILESYSTEM_MAX_STACK_DEPTH for now.
> >>>>
> >>>> Fixes: d53cd891f0e4 ("erofs: limit the level of fs stacking for file-backed mounts")
> >>>> Closes: https://github.com/coreos/fedora-coreos-tracker/issues/2087 [1]
> >>>> Closes: https://lore.kernel.org/r/CAFHtUiYv4+=+JP_-JjARWjo6OwcvBj1wtYN=z0QXwCpec9sXtg@mail.gmail.com
> >>>> Cc: Amir Goldstein <amir73il@gmail.com>
> >>>> Cc: Alexander Larsson <alexl@redhat.com>
> >>>> Cc: Christian Brauner <brauner@kernel.org>
> >>>> Cc: Miklos Szeredi <mszeredi@redhat.com>
> >>>> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
> >>>> ---
> >>>
> >>> Acked-by: Amir Goldstein <amir73il@gmail.com>
> >>>
> >>> But you forgot to include details of the stack usage analysis you ran
> >>> with erofs+ovl^2 setup.
> >>>
> >>> I am guessing people will want to see this information before relaxing
> >>> s_stack_depth in this case.
> >>
> >> Sorry I didn't check emails these days, I'm not sure if posting
> >> detailed stack traces are useful, how about adding the following
> >> words:
> >
> > Didn't mean detailed stack traces, but you did some tests with the
> > new possible setup and you reached stack usage < 8K so  I think this is
>
> The issue is that my limited stress test setup cannot cover
> every cases:
>
>   - I cannot find a way to make direct reclaim reliably in the
>     deep memory allocation, is there some suggestion on this?
>
>   - I'm not sure what's the perfered way to evaluate the worst
>     stack usage below the block layer, but we should care more
>     about increasing delta just out of one more overlayfs I
>     guess?
>
> I can only say what I've seen is the peak stack usage of my
> fsstress for an erofs+ovl^2 setup on x86_64 is < 8K (7184 bytes,
> but I don't think the peak value absolutely useful), which
> evaluates RW workloads in the upperdir, and for such workloads,
> the stack depth won't be impacted by FILESYSTEM_MAX_STACK_DEPTH,
> I don't see such workload is harmful.
>
> And then I manually copyup some files (because I didn't find any
> available tool to stress overlayfs copyups) and I could see the
> delta is (I think "ovl_copy_up_" is the only one path to do
> copyups):
>
>    0)     6688      48   mempool_alloc_slab+0x9/0x20
>    1)     6640      56   mempool_alloc_noprof+0x65/0xd0
>    2)     6584      72   __sg_alloc_table+0x128/0x190
>    3)     6512      40   sg_alloc_table_chained+0x46/0xa0
>    4)     6472      64   scsi_alloc_sgtables+0x91/0x2c0
>    5)     6408      72   sd_init_command+0x263/0x930
>    6)     6336      88   scsi_queue_rq+0x54a/0xb70
>    7)     6248     144   blk_mq_dispatch_rq_list+0x265/0x6c0
>    8)     6104     144   __blk_mq_sched_dispatch_requests+0x399/0x5c0
>    9)     5960      16   blk_mq_sched_dispatch_requests+0x2d/0x70
>   10)     5944      56   blk_mq_run_hw_queue+0x208/0x290
>   11)     5888      96   blk_mq_dispatch_list+0x13f/0x460
>   12)     5792      48   blk_mq_flush_plug_list+0x4b/0x180
>   13)     5744      32   blk_add_rq_to_plug+0x3d/0x160
>   14)     5712     136   blk_mq_submit_bio+0x4f4/0x760
>   15)     5576     120   __submit_bio+0x9b/0x240
>   16)     5456      88   submit_bio_noacct_nocheck+0x271/0x330
>   17)     5368      72   iomap_bio_read_folio_range+0xde/0x1d0
>   18)     5296     112   iomap_read_folio_iter+0x1ee/0x2d0
>   19)     5184     264   iomap_readahead+0xb9/0x290
>   20)     4920      48   xfs_vm_readahead+0x4a/0x70
>   21)     4872     112   read_pages+0x6c/0x1b0
>   22)     4760     104   page_cache_ra_unbounded+0x12c/0x210
>   23)     4656      80   filemap_readahead.isra.0+0x78/0xb0
>   24)     4576     192   filemap_get_pages+0x3a6/0x820
>   25)     4384     376   filemap_read+0xde/0x380
>   26)     4008      32   xfs_file_buffered_read+0xa6/0xd0
>   27)     3976      16   xfs_file_read_iter+0x6a/0xd0
>   28)     3960      48   vfs_iocb_iter_read+0xdb/0x140
>   29)     3912      88   erofs_fileio_rq_submit+0x136/0x190
>   30)     3824     368   z_erofs_runqueue+0x1ce/0x9f0
>   31)     3456     232   z_erofs_readahead+0x16c/0x220
>   32)     3224     112   read_pages+0x6c/0x1b0
>   33)     3112     104   page_cache_ra_unbounded+0x12c/0x210
>   34)     3008      80   filemap_readahead.isra.0+0x78/0xb0
>   35)     2928     192   filemap_get_pages+0x3a6/0x820
>   36)     2736     400   filemap_splice_read+0x12c/0x2f0
>   37)     2336      48   backing_file_splice_read+0x3f/0x90
>   38)     2288     128   ovl_splice_read+0xef/0x170
>   39)     2160     104   splice_direct_to_actor+0xb9/0x260
>   40)     2056      88   do_splice_direct+0x76/0xc0
>   41)     1968     120   ovl_copy_up_file+0x1a8/0x2b0
>   42)     1848     840   ovl_copy_up_one+0x14b0/0x1610
>   43)     1008      72   ovl_copy_up_flags+0xd7/0x110
>   44)      936      56   ovl_open+0x72/0x110
>   45)      880      56   do_dentry_open+0x16c/0x480
>   46)      824      40   vfs_open+0x2e/0xf0
>   47)      784     152   path_openat+0x80a/0x12e0
>   48)      632     296   do_filp_open+0xb8/0x160
>   49)      336      80   do_sys_openat2+0x72/0xf0
>   50)      256      40   __x64_sys_openat+0x57/0xa0
>   51)      216      40   do_syscall_64+0xa4/0x310
>   52)      176     176   entry_SYSCALL_64_after_hwframe+0x77/0x7f
>
> And it's still far from the stack overflow of 16k stacks,
> because the difference seems only how many (
> ovl_splice_read + backing_file_splice_read), and there only takes
> hundreds of bytes for each layer.
>
> Finally I used my own rostress to stress RO workloads, and the
> deepest stack so far is as below (5456 bytes):
>
>    0)     5456      48   arch_scale_cpu_capacity+0x9/0x30
>    1)     5408      16   cpu_util.constprop.0+0x7e/0xe0
>    2)     5392     392   sched_balance_find_src_group+0x29f/0xd30
>    3)     5000     280   sched_balance_rq+0x1b2/0xf10
>    4)     4720     120   pick_next_task_fair+0x23b/0x7b0
>    5)     4600     104   __schedule+0x2bc/0xda0
>    6)     4496      16   schedule+0x27/0xd0
>    7)     4480      24   io_schedule+0x46/0x70
>    8)     4456     120   blk_mq_get_tag+0x11b/0x280
>    9)     4336      96   __blk_mq_alloc_requests+0x2a1/0x410
>   10)     4240     136   blk_mq_submit_bio+0x59c/0x760
>   11)     4104     120   __submit_bio+0x9b/0x240
>   12)     3984      88   submit_bio_noacct_nocheck+0x271/0x330
>   13)     3896      72   iomap_bio_read_folio_range+0xde/0x1d0
>   14)     3824     112   iomap_read_folio_iter+0x1ee/0x2d0
>   15)     3712     264   iomap_readahead+0xb9/0x290
>   16)     3448      48   xfs_vm_readahead+0x4a/0x70
>   17)     3400     112   read_pages+0x6c/0x1b0
>   18)     3288     104   page_cache_ra_unbounded+0x12c/0x210
>   19)     3184      80   filemap_readahead.isra.0+0x78/0xb0
>   20)     3104     192   filemap_get_pages+0x3a6/0x820
>   21)     2912     376   filemap_read+0xde/0x380
>   22)     2536      32   xfs_file_buffered_read+0xa6/0xd0
>   23)     2504      16   xfs_file_read_iter+0x6a/0xd0
>   24)     2488      48   vfs_iocb_iter_read+0xdb/0x140
>   25)     2440      88   erofs_fileio_rq_submit+0x136/0x190
>   26)     2352     368   z_erofs_runqueue+0x1ce/0x9f0
>   27)     1984     232   z_erofs_readahead+0x16c/0x220
>   28)     1752     112   read_pages+0x6c/0x1b0
>   29)     1640     104   page_cache_ra_unbounded+0x12c/0x210
>   30)     1536      40   force_page_cache_ra+0x96/0xc0
>   31)     1496     192   filemap_get_pages+0x123/0x820
>   32)     1304     376   filemap_read+0xde/0x380
>   33)      928      72   do_iter_readv_writev+0x1b9/0x220
>   34)      856      56   vfs_iter_read+0xde/0x140
>   35)      800      64   backing_file_read_iter+0x193/0x1e0
>   36)      736      56   ovl_read_iter+0x98/0xa0
>   37)      680      72   do_iter_readv_writev+0x1b9/0x220
>   38)      608      56   vfs_iter_read+0xde/0x140
>   39)      552      64   backing_file_read_iter+0x193/0x1e0
>   40)      488      56   ovl_read_iter+0x98/0xa0
>   41)      432     152   vfs_read+0x21a/0x350
>   42)      280      64   __x64_sys_pread64+0x92/0xc0
>   43)      216      40   do_syscall_64+0xa4/0x310
>   44)      176     176   entry_SYSCALL_64_after_hwframe+0x77/0x7f
>
> > something worth mentioning.
> >
> >>
> >> Note: There are some observations while evaluating the erofs + ovl^2
> >> setup with an XFS backing fs:
> >>
> >>    - Regular RW workloads traverse only one overlayfs layer regardless of
> >>      the value of FILESYSTEM_MAX_STACK_DEPTH, because `upperdir=` cannot
> >>      point to another overlayfs.  Therefore, for pure RW workloads, the
> >>      typical stack is always just:
> >>        overlayfs + upper fs + underlay storage
> >>
> >>    - For read-only workloads and the copy-up read part (ovl_splice_read),
> >>      the difference can lie in how many overlays are nested.
> >>      The stack just looks like either:
> >>        ovl + ovl [+ erofs] + backing fs + underlay storage
> >>      or
> >>        ovl [+ erofs] + ext4/xfs + underlay storage
> >>
> >>    - The fs reclaim path should be entered only once, so the writeback
> >>      path will not re-enter.
> >>
> >> Sorry about my English, and I'm not sure if it's enough (e.g. FUSE
> >> passthrough part).  I will look for your further inputs (and other
> >> acks) before sending this patch upstream.
> >>
> >
> > I think that most people will have problems understanding this
> > rationale not because of the English, but because of the tech ;)
> > this is a bit too hand wavy IMO.
>
> Honestly, I don't have better way to describe it, I think we'd
> better just to focus more on the increment of one more overlayfs:
>

ok. but are we talking about one more overlayfs?
This patch is adding just one erofs, so what am I missing?

> FILESYSTEM_MAX_STACK_DEPTH 2 already works for 8k kstacks on
> 32-bit arches, so I don't think FILESYSTEM_MAX_STACK_DEPTH from
> 2 to 3, which causes hundreds-more-byte additional stack usage
> out of mediate overlayfs on 16k kstacks on 64-bit arches is
> harmful (and only RO workloads and copyups are impacted).
>
> And if hundreds-more-byte additional stack usage can overflow
> the 16k kstack, I do think then the kernel stack can be
> overflowed randomly everywhere in the storage stack, not just
> because this FILESYSTEM_MAX_STACK_DEPTH modification.
>

Fine by me, but does that mean that you only want to allow
erofs backing files with >8K stack size?

Otherwise, I do not follow your argument.

Thanks,
Amir.

Re: [PATCH] erofs: don't bother with s_stack_depth increasing for now

Posted by Gao Xiang 1 month ago


On 2026/1/5 02:44, Amir Goldstein wrote:
> On Sun, Jan 4, 2026 at 11:42 AM Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
>>
>>
>>
>> On 2026/1/4 18:01, Amir Goldstein wrote:
>>> [+fsdevel][+overlayfs]
>>>
>>> On Sun, Jan 4, 2026 at 4:56 AM Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
>>>>
>>>> Hi Amir,
>>>>
>>>> On 2026/1/1 23:52, Amir Goldstein wrote:
>>>>> On Wed, Dec 31, 2025 at 9:42 PM Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
>>>>>>
>>>>>> Previously, commit d53cd891f0e4 ("erofs: limit the level of fs stacking
>>>>>> for file-backed mounts") bumped `s_stack_depth` by one to avoid kernel
>>>>>> stack overflow, but it breaks composefs mounts, which need erofs+ovl^2
>>>>>> sometimes (and such setups are already used in production for quite long
>>>>>> time) since `s_stack_depth` can be 3 (i.e., FILESYSTEM_MAX_STACK_DEPTH
>>>>>> needs to change from 2 to 3).
>>>>>>
>>>>>> After a long discussion on GitHub issues [1] about possible solutions,
>>>>>> it seems there is no need to support nesting file-backed mounts as one
>>>>>> conclusion (especially when increasing FILESYSTEM_MAX_STACK_DEPTH to 3).
>>>>>> So let's disallow this right now, since there is always a way to use
>>>>>> loopback devices as a fallback.
>>>>>>
>>>>>> Then, I started to wonder about an alternative EROFS quick fix to
>>>>>> address the composefs mounts directly for this cycle: since EROFS is the
>>>>>> only fs to support file-backed mounts and other stacked fses will just
>>>>>> bump up `FILESYSTEM_MAX_STACK_DEPTH`, just check that `s_stack_depth`
>>>>>> != 0 and the backing inode is not from EROFS instead.
>>>>>>
>>>>>> At least it works for all known file-backed mount use cases (composefs,
>>>>>> containerd, and Android APEX for some Android vendors), and the fix is
>>>>>> self-contained.
>>>>>>
>>>>>> Let's defer increasing FILESYSTEM_MAX_STACK_DEPTH for now.
>>>>>>
>>>>>> Fixes: d53cd891f0e4 ("erofs: limit the level of fs stacking for file-backed mounts")
>>>>>> Closes: https://github.com/coreos/fedora-coreos-tracker/issues/2087 [1]
>>>>>> Closes: https://lore.kernel.org/r/CAFHtUiYv4+=+JP_-JjARWjo6OwcvBj1wtYN=z0QXwCpec9sXtg@mail.gmail.com
>>>>>> Cc: Amir Goldstein <amir73il@gmail.com>
>>>>>> Cc: Alexander Larsson <alexl@redhat.com>
>>>>>> Cc: Christian Brauner <brauner@kernel.org>
>>>>>> Cc: Miklos Szeredi <mszeredi@redhat.com>
>>>>>> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
>>>>>> ---
>>>>>
>>>>> Acked-by: Amir Goldstein <amir73il@gmail.com>
>>>>>
>>>>> But you forgot to include details of the stack usage analysis you ran
>>>>> with erofs+ovl^2 setup.
>>>>>
>>>>> I am guessing people will want to see this information before relaxing
>>>>> s_stack_depth in this case.
>>>>
>>>> Sorry I didn't check emails these days, I'm not sure if posting
>>>> detailed stack traces are useful, how about adding the following
>>>> words:
>>>
>>> Didn't mean detailed stack traces, but you did some tests with the
>>> new possible setup and you reached stack usage < 8K so  I think this is
>>
>> The issue is that my limited stress test setup cannot cover
>> every cases:
>>
>>    - I cannot find a way to make direct reclaim reliably in the
>>      deep memory allocation, is there some suggestion on this?
>>
>>    - I'm not sure what's the perfered way to evaluate the worst
>>      stack usage below the block layer, but we should care more
>>      about increasing delta just out of one more overlayfs I
>>      guess?
>>
>> I can only say what I've seen is the peak stack usage of my
>> fsstress for an erofs+ovl^2 setup on x86_64 is < 8K (7184 bytes,
>> but I don't think the peak value absolutely useful), which
>> evaluates RW workloads in the upperdir, and for such workloads,
>> the stack depth won't be impacted by FILESYSTEM_MAX_STACK_DEPTH,
>> I don't see such workload is harmful.
>>
>> And then I manually copyup some files (because I didn't find any
>> available tool to stress overlayfs copyups) and I could see the
>> delta is (I think "ovl_copy_up_" is the only one path to do
>> copyups):
>>
>>     0)     6688      48   mempool_alloc_slab+0x9/0x20
>>     1)     6640      56   mempool_alloc_noprof+0x65/0xd0
>>     2)     6584      72   __sg_alloc_table+0x128/0x190
>>     3)     6512      40   sg_alloc_table_chained+0x46/0xa0
>>     4)     6472      64   scsi_alloc_sgtables+0x91/0x2c0
>>     5)     6408      72   sd_init_command+0x263/0x930
>>     6)     6336      88   scsi_queue_rq+0x54a/0xb70
>>     7)     6248     144   blk_mq_dispatch_rq_list+0x265/0x6c0
>>     8)     6104     144   __blk_mq_sched_dispatch_requests+0x399/0x5c0
>>     9)     5960      16   blk_mq_sched_dispatch_requests+0x2d/0x70
>>    10)     5944      56   blk_mq_run_hw_queue+0x208/0x290
>>    11)     5888      96   blk_mq_dispatch_list+0x13f/0x460
>>    12)     5792      48   blk_mq_flush_plug_list+0x4b/0x180
>>    13)     5744      32   blk_add_rq_to_plug+0x3d/0x160
>>    14)     5712     136   blk_mq_submit_bio+0x4f4/0x760
>>    15)     5576     120   __submit_bio+0x9b/0x240
>>    16)     5456      88   submit_bio_noacct_nocheck+0x271/0x330
>>    17)     5368      72   iomap_bio_read_folio_range+0xde/0x1d0
>>    18)     5296     112   iomap_read_folio_iter+0x1ee/0x2d0
>>    19)     5184     264   iomap_readahead+0xb9/0x290
>>    20)     4920      48   xfs_vm_readahead+0x4a/0x70
>>    21)     4872     112   read_pages+0x6c/0x1b0
>>    22)     4760     104   page_cache_ra_unbounded+0x12c/0x210
>>    23)     4656      80   filemap_readahead.isra.0+0x78/0xb0
>>    24)     4576     192   filemap_get_pages+0x3a6/0x820
>>    25)     4384     376   filemap_read+0xde/0x380
>>    26)     4008      32   xfs_file_buffered_read+0xa6/0xd0
>>    27)     3976      16   xfs_file_read_iter+0x6a/0xd0
>>    28)     3960      48   vfs_iocb_iter_read+0xdb/0x140
>>    29)     3912      88   erofs_fileio_rq_submit+0x136/0x190
>>    30)     3824     368   z_erofs_runqueue+0x1ce/0x9f0
>>    31)     3456     232   z_erofs_readahead+0x16c/0x220
>>    32)     3224     112   read_pages+0x6c/0x1b0
>>    33)     3112     104   page_cache_ra_unbounded+0x12c/0x210
>>    34)     3008      80   filemap_readahead.isra.0+0x78/0xb0
>>    35)     2928     192   filemap_get_pages+0x3a6/0x820
>>    36)     2736     400   filemap_splice_read+0x12c/0x2f0
>>    37)     2336      48   backing_file_splice_read+0x3f/0x90
>>    38)     2288     128   ovl_splice_read+0xef/0x170
>>    39)     2160     104   splice_direct_to_actor+0xb9/0x260
>>    40)     2056      88   do_splice_direct+0x76/0xc0
>>    41)     1968     120   ovl_copy_up_file+0x1a8/0x2b0
>>    42)     1848     840   ovl_copy_up_one+0x14b0/0x1610
>>    43)     1008      72   ovl_copy_up_flags+0xd7/0x110
>>    44)      936      56   ovl_open+0x72/0x110
>>    45)      880      56   do_dentry_open+0x16c/0x480
>>    46)      824      40   vfs_open+0x2e/0xf0
>>    47)      784     152   path_openat+0x80a/0x12e0
>>    48)      632     296   do_filp_open+0xb8/0x160
>>    49)      336      80   do_sys_openat2+0x72/0xf0
>>    50)      256      40   __x64_sys_openat+0x57/0xa0
>>    51)      216      40   do_syscall_64+0xa4/0x310
>>    52)      176     176   entry_SYSCALL_64_after_hwframe+0x77/0x7f
>>
>> And it's still far from the stack overflow of 16k stacks,
>> because the difference seems only how many (
>> ovl_splice_read + backing_file_splice_read), and there only takes
>> hundreds of bytes for each layer.
>>
>> Finally I used my own rostress to stress RO workloads, and the
>> deepest stack so far is as below (5456 bytes):
>>
>>     0)     5456      48   arch_scale_cpu_capacity+0x9/0x30
>>     1)     5408      16   cpu_util.constprop.0+0x7e/0xe0
>>     2)     5392     392   sched_balance_find_src_group+0x29f/0xd30
>>     3)     5000     280   sched_balance_rq+0x1b2/0xf10
>>     4)     4720     120   pick_next_task_fair+0x23b/0x7b0
>>     5)     4600     104   __schedule+0x2bc/0xda0
>>     6)     4496      16   schedule+0x27/0xd0
>>     7)     4480      24   io_schedule+0x46/0x70
>>     8)     4456     120   blk_mq_get_tag+0x11b/0x280
>>     9)     4336      96   __blk_mq_alloc_requests+0x2a1/0x410
>>    10)     4240     136   blk_mq_submit_bio+0x59c/0x760
>>    11)     4104     120   __submit_bio+0x9b/0x240
>>    12)     3984      88   submit_bio_noacct_nocheck+0x271/0x330
>>    13)     3896      72   iomap_bio_read_folio_range+0xde/0x1d0
>>    14)     3824     112   iomap_read_folio_iter+0x1ee/0x2d0
>>    15)     3712     264   iomap_readahead+0xb9/0x290
>>    16)     3448      48   xfs_vm_readahead+0x4a/0x70
>>    17)     3400     112   read_pages+0x6c/0x1b0
>>    18)     3288     104   page_cache_ra_unbounded+0x12c/0x210
>>    19)     3184      80   filemap_readahead.isra.0+0x78/0xb0
>>    20)     3104     192   filemap_get_pages+0x3a6/0x820
>>    21)     2912     376   filemap_read+0xde/0x380
>>    22)     2536      32   xfs_file_buffered_read+0xa6/0xd0
>>    23)     2504      16   xfs_file_read_iter+0x6a/0xd0
>>    24)     2488      48   vfs_iocb_iter_read+0xdb/0x140
>>    25)     2440      88   erofs_fileio_rq_submit+0x136/0x190
>>    26)     2352     368   z_erofs_runqueue+0x1ce/0x9f0
>>    27)     1984     232   z_erofs_readahead+0x16c/0x220
>>    28)     1752     112   read_pages+0x6c/0x1b0
>>    29)     1640     104   page_cache_ra_unbounded+0x12c/0x210
>>    30)     1536      40   force_page_cache_ra+0x96/0xc0
>>    31)     1496     192   filemap_get_pages+0x123/0x820
>>    32)     1304     376   filemap_read+0xde/0x380
>>    33)      928      72   do_iter_readv_writev+0x1b9/0x220
>>    34)      856      56   vfs_iter_read+0xde/0x140
>>    35)      800      64   backing_file_read_iter+0x193/0x1e0
>>    36)      736      56   ovl_read_iter+0x98/0xa0
>>    37)      680      72   do_iter_readv_writev+0x1b9/0x220
>>    38)      608      56   vfs_iter_read+0xde/0x140
>>    39)      552      64   backing_file_read_iter+0x193/0x1e0
>>    40)      488      56   ovl_read_iter+0x98/0xa0
>>    41)      432     152   vfs_read+0x21a/0x350
>>    42)      280      64   __x64_sys_pread64+0x92/0xc0
>>    43)      216      40   do_syscall_64+0xa4/0x310
>>    44)      176     176   entry_SYSCALL_64_after_hwframe+0x77/0x7f
>>
>>> something worth mentioning.
>>>
>>>>
>>>> Note: There are some observations while evaluating the erofs + ovl^2
>>>> setup with an XFS backing fs:
>>>>
>>>>     - Regular RW workloads traverse only one overlayfs layer regardless of
>>>>       the value of FILESYSTEM_MAX_STACK_DEPTH, because `upperdir=` cannot
>>>>       point to another overlayfs.  Therefore, for pure RW workloads, the
>>>>       typical stack is always just:
>>>>         overlayfs + upper fs + underlay storage
>>>>
>>>>     - For read-only workloads and the copy-up read part (ovl_splice_read),
>>>>       the difference can lie in how many overlays are nested.
>>>>       The stack just looks like either:
>>>>         ovl + ovl [+ erofs] + backing fs + underlay storage
>>>>       or
>>>>         ovl [+ erofs] + ext4/xfs + underlay storage
>>>>
>>>>     - The fs reclaim path should be entered only once, so the writeback
>>>>       path will not re-enter.
>>>>
>>>> Sorry about my English, and I'm not sure if it's enough (e.g. FUSE
>>>> passthrough part).  I will look for your further inputs (and other
>>>> acks) before sending this patch upstream.
>>>>
>>>
>>> I think that most people will have problems understanding this
>>> rationale not because of the English, but because of the tech ;)
>>> this is a bit too hand wavy IMO.
>>
>> Honestly, I don't have better way to describe it, I think we'd
>> better just to focus more on the increment of one more overlayfs:
>>
> 
> ok. but are we talking about one more overlayfs?
> This patch is adding just one erofs, so what am I missing?

Sorry, I didn't describe accurately, first I tested erofs+ovl^2.

It's the last overlayfs mount fails, and the stack traces start
from the last overlayfs.  So compared with the normal erofs+ovl
(since it can be mounted in the upsteam correctly without this
  patch), I mean it's one more overlayfs.

> 
>> FILESYSTEM_MAX_STACK_DEPTH 2 already works for 8k kstacks on
>> 32-bit arches, so I don't think FILESYSTEM_MAX_STACK_DEPTH from
>> 2 to 3, which causes hundreds-more-byte additional stack usage
>> out of mediate overlayfs on 16k kstacks on 64-bit arches is
>> harmful (and only RO workloads and copyups are impacted).
>>
>> And if hundreds-more-byte additional stack usage can overflow
>> the 16k kstack, I do think then the kernel stack can be
>> overflowed randomly everywhere in the storage stack, not just
>> because this FILESYSTEM_MAX_STACK_DEPTH modification.
>>
> 
> Fine by me, but does that mean that you only want to allow
> erofs backing files with >8K stack size?

If FILESYSTEM_MAX_STACK_DEPTH 2 without this patch, erofs+ovl
can still success (so it should guarantee erofs+ovl is always
fine), so compared with that, FILESYSTEM_MAX_STACK_DEPTH 3,
the extra stack is always one more overlayfs I think (either
erofs+ovl^2 or ovl^3)?

Thanks,
Gao Xiang

> 
> Otherwise, I do not follow your argument.
> 
> Thanks,
> Amir.

[PATCH v2] erofs: don't bother with s_stack_depth increasing for now

Posted by Gao Xiang 1 month ago

Previously, commit d53cd891f0e4 ("erofs: limit the level of fs stacking
for file-backed mounts") bumped `s_stack_depth` by one to avoid kernel
stack overflow when stacking an unlimited number of EROFS on top of
each other.

This fix breaks composefs mounts, which need EROFS+ovl^2 sometimes
(and such setups are already used in production for quite a long time).

One way to fix this regression is to bump FILESYSTEM_MAX_STACK_DEPTH
from 2 to 3, but proving that this is safe in general is a high bar.

After a long discussion on GitHub issues [1] about possible solutions,
one conclusion is that there is no need to support nesting file-backed
EROFS mounts on stacked filesystems, because there is always the option
to use loopback devices as a fallback.

As a quick fix for the composefs regression for this cycle, instead of
bumping `s_stack_depth` for file backed EROFS mounts, we disallow
nesting file-backed EROFS over EROFS and over filesystems with
`s_stack_depth` > 0.

This works for all known file-backed mount use cases (composefs,
containerd, and Android APEX for some Android vendors), and the fix is
self-contained.

Essentially, we are allowing one extra unaccounted fs stacking level of
EROFS below stacking filesystems, but EROFS can only be used in the read
path (i.e. overlayfs lower layers), which typically has much lower stack
usage than the write path.

We can consider increasing FILESYSTEM_MAX_STACK_DEPTH later, after more
stack usage analysis or using alternative approaches, such as splitting
the `s_stack_depth` limitation according to different combinations of
stacking.

Fixes: d53cd891f0e4 ("erofs: limit the level of fs stacking for file-backed mounts")
Reported-by: Dusty Mabe <dusty@dustymabe.com>
Reported-by: Timothée Ravier <tim@siosm.fr>
Closes: https://github.com/coreos/fedora-coreos-tracker/issues/2087 [1]
Reported-by: "Alekséi Naidénov" <an@digitaltide.io>
Closes: https://lore.kernel.org/r/CAFHtUiYv4+=+JP_-JjARWjo6OwcvBj1wtYN=z0QXwCpec9sXtg@mail.gmail.com
Acked-by: Amir Goldstein <amir73il@gmail.com>
Cc: Alexander Larsson <alexl@redhat.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Cc: Sheng Yong <shengyong1@xiaomi.com>
Cc: Zhiguo Niu <niuzhiguo84@gmail.com>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
---
v2:
 - Update commit message (suggested by Amir in 1-on-1 talk);
 - Add proper `Reported-by:`.

 fs/erofs/super.c | 18 ++++++++++++------
 1 file changed, 12 insertions(+), 6 deletions(-)

diff --git a/fs/erofs/super.c b/fs/erofs/super.c
index 937a215f626c..0cf41ed7ced8 100644
--- a/fs/erofs/super.c
+++ b/fs/erofs/super.c
@@ -644,14 +644,20 @@ static int erofs_fc_fill_super(struct super_block *sb, struct fs_context *fc)
 		 * fs contexts (including its own) due to self-controlled RO
 		 * accesses/contexts and no side-effect changes that need to
 		 * context save & restore so it can reuse the current thread
-		 * context.  However, it still needs to bump `s_stack_depth` to
-		 * avoid kernel stack overflow from nested filesystems.
+		 * context.
+		 * However, we still need to prevent kernel stack overflow due
+		 * to filesystem nesting: just ensure that s_stack_depth is 0
+		 * to disallow mounting EROFS on stacked filesystems.
+		 * Note: s_stack_depth is not incremented here for now, since
+		 * EROFS is the only fs supporting file-backed mounts for now.
+		 * It MUST change if another fs plans to support them, which
+		 * may also require adjusting FILESYSTEM_MAX_STACK_DEPTH.
 		 */
 		if (erofs_is_fileio_mode(sbi)) {
-			sb->s_stack_depth =
-				file_inode(sbi->dif0.file)->i_sb->s_stack_depth + 1;
-			if (sb->s_stack_depth > FILESYSTEM_MAX_STACK_DEPTH) {
-				erofs_err(sb, "maximum fs stacking depth exceeded");
+			inode = file_inode(sbi->dif0.file);
+			if (inode->i_sb->s_op == &erofs_sops ||
+			    inode->i_sb->s_stack_depth) {
+				erofs_err(sb, "file-backed mounts cannot be applied to stacked fses");
 				return -ENOTBLK;
 			}
 		}
-- 
2.43.5

Re: [PATCH v2] erofs: don't bother with s_stack_depth increasing for now

Posted by Sheng Yong 1 month ago

On 1/7/26 01:05, Gao Xiang wrote:
> Previously, commit d53cd891f0e4 ("erofs: limit the level of fs stacking
> for file-backed mounts") bumped `s_stack_depth` by one to avoid kernel
> stack overflow when stacking an unlimited number of EROFS on top of
> each other.
> 
> This fix breaks composefs mounts, which need EROFS+ovl^2 sometimes
> (and such setups are already used in production for quite a long time).
> 
> One way to fix this regression is to bump FILESYSTEM_MAX_STACK_DEPTH
> from 2 to 3, but proving that this is safe in general is a high bar.
> 
> After a long discussion on GitHub issues [1] about possible solutions,
> one conclusion is that there is no need to support nesting file-backed
> EROFS mounts on stacked filesystems, because there is always the option
> to use loopback devices as a fallback.
> 
> As a quick fix for the composefs regression for this cycle, instead of
> bumping `s_stack_depth` for file backed EROFS mounts, we disallow
> nesting file-backed EROFS over EROFS and over filesystems with
> `s_stack_depth` > 0.
> 
> This works for all known file-backed mount use cases (composefs,
> containerd, and Android APEX for some Android vendors), and the fix is
> self-contained.
> 
> Essentially, we are allowing one extra unaccounted fs stacking level of
> EROFS below stacking filesystems, but EROFS can only be used in the read
> path (i.e. overlayfs lower layers), which typically has much lower stack
> usage than the write path.
> 
> We can consider increasing FILESYSTEM_MAX_STACK_DEPTH later, after more
> stack usage analysis or using alternative approaches, such as splitting
> the `s_stack_depth` limitation according to different combinations of
> stacking.
> 
> Fixes: d53cd891f0e4 ("erofs: limit the level of fs stacking for file-backed mounts")
> Reported-by: Dusty Mabe <dusty@dustymabe.com>
> Reported-by: Timothée Ravier <tim@siosm.fr>
> Closes: https://github.com/coreos/fedora-coreos-tracker/issues/2087 [1]
> Reported-by: "Alekséi Naidénov" <an@digitaltide.io>
> Closes: https://lore.kernel.org/r/CAFHtUiYv4+=+JP_-JjARWjo6OwcvBj1wtYN=z0QXwCpec9sXtg@mail.gmail.com
> Acked-by: Amir Goldstein <amir73il@gmail.com>
> Cc: Alexander Larsson <alexl@redhat.com>
> Cc: Christian Brauner <brauner@kernel.org>
> Cc: Miklos Szeredi <mszeredi@redhat.com>
> Cc: Sheng Yong <shengyong1@xiaomi.com>
> Cc: Zhiguo Niu <niuzhiguo84@gmail.com>
> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
> ---
> v2:
>   - Update commit message (suggested by Amir in 1-on-1 talk);
>   - Add proper `Reported-by:`.
> 
>   fs/erofs/super.c | 18 ++++++++++++------
>   1 file changed, 12 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/erofs/super.c b/fs/erofs/super.c
> index 937a215f626c..0cf41ed7ced8 100644
> --- a/fs/erofs/super.c
> +++ b/fs/erofs/super.c
> @@ -644,14 +644,20 @@ static int erofs_fc_fill_super(struct super_block *sb, struct fs_context *fc)
>   		 * fs contexts (including its own) due to self-controlled RO
>   		 * accesses/contexts and no side-effect changes that need to
>   		 * context save & restore so it can reuse the current thread
> -		 * context.  However, it still needs to bump `s_stack_depth` to
> -		 * avoid kernel stack overflow from nested filesystems.
> +		 * context.
> +		 * However, we still need to prevent kernel stack overflow due
> +		 * to filesystem nesting: just ensure that s_stack_depth is 0
> +		 * to disallow mounting EROFS on stacked filesystems.
> +		 * Note: s_stack_depth is not incremented here for now, since
> +		 * EROFS is the only fs supporting file-backed mounts for now.
> +		 * It MUST change if another fs plans to support them, which
> +		 * may also require adjusting FILESYSTEM_MAX_STACK_DEPTH.
>   		 */
>   		if (erofs_is_fileio_mode(sbi)) {
> -			sb->s_stack_depth =
> -				file_inode(sbi->dif0.file)->i_sb->s_stack_depth + 1;
> -			if (sb->s_stack_depth > FILESYSTEM_MAX_STACK_DEPTH) {
> -				erofs_err(sb, "maximum fs stacking depth exceeded");
> +			inode = file_inode(sbi->dif0.file);
> +			if (inode->i_sb->s_op == &erofs_sops ||

Hi, Xiang

In Android APEX scenario, apex images formatted as EROFS are packed in
system.img which is also EROFS format. As a result, it will always fail
to do APEX-file-backed mount since `inode->i_sb->s_op == &erofs_sops'
is true.
Any thoughts to handle such scenario?

thanks,
shengyong

> +			    inode->i_sb->s_stack_depth) {
> +				erofs_err(sb, "file-backed mounts cannot be applied to stacked fses");
>   				return -ENOTBLK;
>   			}
>   		}

Re: [PATCH v2] erofs: don't bother with s_stack_depth increasing for now

Posted by Gao Xiang 1 month ago

Hi Sheng,

On 2026/1/8 10:26, Sheng Yong wrote:
> On 1/7/26 01:05, Gao Xiang wrote:
>> Previously, commit d53cd891f0e4 ("erofs: limit the level of fs stacking
>> for file-backed mounts") bumped `s_stack_depth` by one to avoid kernel
>> stack overflow when stacking an unlimited number of EROFS on top of
>> each other.
>>
>> This fix breaks composefs mounts, which need EROFS+ovl^2 sometimes
>> (and such setups are already used in production for quite a long time).
>>
>> One way to fix this regression is to bump FILESYSTEM_MAX_STACK_DEPTH
>> from 2 to 3, but proving that this is safe in general is a high bar.
>>
>> After a long discussion on GitHub issues [1] about possible solutions,
>> one conclusion is that there is no need to support nesting file-backed
>> EROFS mounts on stacked filesystems, because there is always the option
>> to use loopback devices as a fallback.
>>
>> As a quick fix for the composefs regression for this cycle, instead of
>> bumping `s_stack_depth` for file backed EROFS mounts, we disallow
>> nesting file-backed EROFS over EROFS and over filesystems with
>> `s_stack_depth` > 0.
>>
>> This works for all known file-backed mount use cases (composefs,
>> containerd, and Android APEX for some Android vendors), and the fix is
>> self-contained.
>>
>> Essentially, we are allowing one extra unaccounted fs stacking level of
>> EROFS below stacking filesystems, but EROFS can only be used in the read
>> path (i.e. overlayfs lower layers), which typically has much lower stack
>> usage than the write path.
>>
>> We can consider increasing FILESYSTEM_MAX_STACK_DEPTH later, after more
>> stack usage analysis or using alternative approaches, such as splitting
>> the `s_stack_depth` limitation according to different combinations of
>> stacking.
>>
>> Fixes: d53cd891f0e4 ("erofs: limit the level of fs stacking for file-backed mounts")
>> Reported-by: Dusty Mabe <dusty@dustymabe.com>
>> Reported-by: Timothée Ravier <tim@siosm.fr>
>> Closes: https://github.com/coreos/fedora-coreos-tracker/issues/2087 [1]
>> Reported-by: "Alekséi Naidénov" <an@digitaltide.io>
>> Closes: https://lore.kernel.org/r/CAFHtUiYv4+=+JP_-JjARWjo6OwcvBj1wtYN=z0QXwCpec9sXtg@mail.gmail.com
>> Acked-by: Amir Goldstein <amir73il@gmail.com>
>> Cc: Alexander Larsson <alexl@redhat.com>
>> Cc: Christian Brauner <brauner@kernel.org>
>> Cc: Miklos Szeredi <mszeredi@redhat.com>
>> Cc: Sheng Yong <shengyong1@xiaomi.com>
>> Cc: Zhiguo Niu <niuzhiguo84@gmail.com>
>> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
>> ---
>> v2:
>>   - Update commit message (suggested by Amir in 1-on-1 talk);
>>   - Add proper `Reported-by:`.
>>
>>   fs/erofs/super.c | 18 ++++++++++++------
>>   1 file changed, 12 insertions(+), 6 deletions(-)
>>
>> diff --git a/fs/erofs/super.c b/fs/erofs/super.c
>> index 937a215f626c..0cf41ed7ced8 100644
>> --- a/fs/erofs/super.c
>> +++ b/fs/erofs/super.c
>> @@ -644,14 +644,20 @@ static int erofs_fc_fill_super(struct super_block *sb, struct fs_context *fc)
>>            * fs contexts (including its own) due to self-controlled RO
>>            * accesses/contexts and no side-effect changes that need to
>>            * context save & restore so it can reuse the current thread
>> -         * context.  However, it still needs to bump `s_stack_depth` to
>> -         * avoid kernel stack overflow from nested filesystems.
>> +         * context.
>> +         * However, we still need to prevent kernel stack overflow due
>> +         * to filesystem nesting: just ensure that s_stack_depth is 0
>> +         * to disallow mounting EROFS on stacked filesystems.
>> +         * Note: s_stack_depth is not incremented here for now, since
>> +         * EROFS is the only fs supporting file-backed mounts for now.
>> +         * It MUST change if another fs plans to support them, which
>> +         * may also require adjusting FILESYSTEM_MAX_STACK_DEPTH.
>>            */
>>           if (erofs_is_fileio_mode(sbi)) {
>> -            sb->s_stack_depth =
>> -                file_inode(sbi->dif0.file)->i_sb->s_stack_depth + 1;
>> -            if (sb->s_stack_depth > FILESYSTEM_MAX_STACK_DEPTH) {
>> -                erofs_err(sb, "maximum fs stacking depth exceeded");
>> +            inode = file_inode(sbi->dif0.file);
>> +            if (inode->i_sb->s_op == &erofs_sops ||
> 
> Hi, Xiang
> 
> In Android APEX scenario, apex images formatted as EROFS are packed in
> system.img which is also EROFS format. As a result, it will always fail
> to do APEX-file-backed mount since `inode->i_sb->s_op == &erofs_sops'
> is true.
> Any thoughts to handle such scenario?

Sorry, I forgot this popular case, I think it can be simply resolved
by the following diff:

diff --git a/fs/erofs/super.c b/fs/erofs/super.c
index 0cf41ed7ced8..e93264034b5d 100644
--- a/fs/erofs/super.c
+++ b/fs/erofs/super.c
@@ -655,7 +655,7 @@ static int erofs_fc_fill_super(struct super_block *sb, struct fs_context *fc)
                  */
                 if (erofs_is_fileio_mode(sbi)) {
                         inode = file_inode(sbi->dif0.file);
-                       if (inode->i_sb->s_op == &erofs_sops ||
+                       if ((inode->i_sb->s_op == &erofs_sops && !sb->s_bdev) ||
                             inode->i_sb->s_stack_depth) {
                                 erofs_err(sb, "file-backed mounts cannot be applied to stacked fses");
                                 return -ENOTBLK;

"!sb->s_bdev" covers file-backed EROFS mounts and
(deprecated) fscache EROFS mounts, I will send v3 soon.

Thanks,
Gao Xiang

> 
> thanks,
> shengyong
> 
>> +                inode->i_sb->s_stack_depth) {
>> +                erofs_err(sb, "file-backed mounts cannot be applied to stacked fses");
>>                   return -ENOTBLK;
>>               }
>>           }

Re: [PATCH v2] erofs: don't bother with s_stack_depth increasing for now

Posted by Gao Xiang 1 month ago


On 2026/1/8 10:32, Gao Xiang wrote:
> Hi Sheng,
> 
> On 2026/1/8 10:26, Sheng Yong wrote:
>> On 1/7/26 01:05, Gao Xiang wrote:
>>> Previously, commit d53cd891f0e4 ("erofs: limit the level of fs stacking
>>> for file-backed mounts") bumped `s_stack_depth` by one to avoid kernel
>>> stack overflow when stacking an unlimited number of EROFS on top of
>>> each other.
>>>
>>> This fix breaks composefs mounts, which need EROFS+ovl^2 sometimes
>>> (and such setups are already used in production for quite a long time).
>>>
>>> One way to fix this regression is to bump FILESYSTEM_MAX_STACK_DEPTH
>>> from 2 to 3, but proving that this is safe in general is a high bar.
>>>
>>> After a long discussion on GitHub issues [1] about possible solutions,
>>> one conclusion is that there is no need to support nesting file-backed
>>> EROFS mounts on stacked filesystems, because there is always the option
>>> to use loopback devices as a fallback.
>>>
>>> As a quick fix for the composefs regression for this cycle, instead of
>>> bumping `s_stack_depth` for file backed EROFS mounts, we disallow
>>> nesting file-backed EROFS over EROFS and over filesystems with
>>> `s_stack_depth` > 0.
>>>
>>> This works for all known file-backed mount use cases (composefs,
>>> containerd, and Android APEX for some Android vendors), and the fix is
>>> self-contained.
>>>
>>> Essentially, we are allowing one extra unaccounted fs stacking level of
>>> EROFS below stacking filesystems, but EROFS can only be used in the read
>>> path (i.e. overlayfs lower layers), which typically has much lower stack
>>> usage than the write path.
>>>
>>> We can consider increasing FILESYSTEM_MAX_STACK_DEPTH later, after more
>>> stack usage analysis or using alternative approaches, such as splitting
>>> the `s_stack_depth` limitation according to different combinations of
>>> stacking.
>>>
>>> Fixes: d53cd891f0e4 ("erofs: limit the level of fs stacking for file-backed mounts")
>>> Reported-by: Dusty Mabe <dusty@dustymabe.com>
>>> Reported-by: Timothée Ravier <tim@siosm.fr>
>>> Closes: https://github.com/coreos/fedora-coreos-tracker/issues/2087 [1]
>>> Reported-by: "Alekséi Naidénov" <an@digitaltide.io>
>>> Closes: https://lore.kernel.org/r/CAFHtUiYv4+=+JP_-JjARWjo6OwcvBj1wtYN=z0QXwCpec9sXtg@mail.gmail.com
>>> Acked-by: Amir Goldstein <amir73il@gmail.com>
>>> Cc: Alexander Larsson <alexl@redhat.com>
>>> Cc: Christian Brauner <brauner@kernel.org>
>>> Cc: Miklos Szeredi <mszeredi@redhat.com>
>>> Cc: Sheng Yong <shengyong1@xiaomi.com>
>>> Cc: Zhiguo Niu <niuzhiguo84@gmail.com>
>>> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
>>> ---
>>> v2:
>>>   - Update commit message (suggested by Amir in 1-on-1 talk);
>>>   - Add proper `Reported-by:`.
>>>
>>>   fs/erofs/super.c | 18 ++++++++++++------
>>>   1 file changed, 12 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/fs/erofs/super.c b/fs/erofs/super.c
>>> index 937a215f626c..0cf41ed7ced8 100644
>>> --- a/fs/erofs/super.c
>>> +++ b/fs/erofs/super.c
>>> @@ -644,14 +644,20 @@ static int erofs_fc_fill_super(struct super_block *sb, struct fs_context *fc)
>>>            * fs contexts (including its own) due to self-controlled RO
>>>            * accesses/contexts and no side-effect changes that need to
>>>            * context save & restore so it can reuse the current thread
>>> -         * context.  However, it still needs to bump `s_stack_depth` to
>>> -         * avoid kernel stack overflow from nested filesystems.
>>> +         * context.
>>> +         * However, we still need to prevent kernel stack overflow due
>>> +         * to filesystem nesting: just ensure that s_stack_depth is 0
>>> +         * to disallow mounting EROFS on stacked filesystems.
>>> +         * Note: s_stack_depth is not incremented here for now, since
>>> +         * EROFS is the only fs supporting file-backed mounts for now.
>>> +         * It MUST change if another fs plans to support them, which
>>> +         * may also require adjusting FILESYSTEM_MAX_STACK_DEPTH.
>>>            */
>>>           if (erofs_is_fileio_mode(sbi)) {
>>> -            sb->s_stack_depth =
>>> -                file_inode(sbi->dif0.file)->i_sb->s_stack_depth + 1;
>>> -            if (sb->s_stack_depth > FILESYSTEM_MAX_STACK_DEPTH) {
>>> -                erofs_err(sb, "maximum fs stacking depth exceeded");
>>> +            inode = file_inode(sbi->dif0.file);
>>> +            if (inode->i_sb->s_op == &erofs_sops ||
>>
>> Hi, Xiang
>>
>> In Android APEX scenario, apex images formatted as EROFS are packed in
>> system.img which is also EROFS format. As a result, it will always fail
>> to do APEX-file-backed mount since `inode->i_sb->s_op == &erofs_sops'
>> is true.
>> Any thoughts to handle such scenario?
> 
> Sorry, I forgot this popular case, I think it can be simply resolved
> by the following diff:
> 
> diff --git a/fs/erofs/super.c b/fs/erofs/super.c
> index 0cf41ed7ced8..e93264034b5d 100644
> --- a/fs/erofs/super.c
> +++ b/fs/erofs/super.c
> @@ -655,7 +655,7 @@ static int erofs_fc_fill_super(struct super_block *sb, struct fs_context *fc)
>                   */
>                  if (erofs_is_fileio_mode(sbi)) {
>                          inode = file_inode(sbi->dif0.file);
> -                       if (inode->i_sb->s_op == &erofs_sops ||
> +                       if ((inode->i_sb->s_op == &erofs_sops && !sb->s_bdev) ||

Sorry it should be `!inode->i_sb->s_bdev`, I've
fixed it in v3 RESEND:
https://lore.kernel.org/r/20260108030709.3305545-1-hsiangkao@linux.alibaba.com

Thanks,
Gao Xiang

>                              inode->i_sb->s_stack_depth) {
>                                  erofs_err(sb, "file-backed mounts cannot be applied to stacked fses");
>                                  return -ENOTBLK;
> 
> "!sb->s_bdev" covers file-backed EROFS mounts and
> (deprecated) fscache EROFS mounts, I will send v3 soon.
> 
> Thanks,
> Gao Xiang

Re: [PATCH v2] erofs: don't bother with s_stack_depth increasing for now

Posted by Amir Goldstein 1 month ago

On Thu, Jan 8, 2026 at 4:10 AM Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
>
>
>
> On 2026/1/8 10:32, Gao Xiang wrote:
> > Hi Sheng,
> >
> > On 2026/1/8 10:26, Sheng Yong wrote:
> >> On 1/7/26 01:05, Gao Xiang wrote:
> >>> Previously, commit d53cd891f0e4 ("erofs: limit the level of fs stacking
> >>> for file-backed mounts") bumped `s_stack_depth` by one to avoid kernel
> >>> stack overflow when stacking an unlimited number of EROFS on top of
> >>> each other.
> >>>
> >>> This fix breaks composefs mounts, which need EROFS+ovl^2 sometimes
> >>> (and such setups are already used in production for quite a long time).
> >>>
> >>> One way to fix this regression is to bump FILESYSTEM_MAX_STACK_DEPTH
> >>> from 2 to 3, but proving that this is safe in general is a high bar.
> >>>
> >>> After a long discussion on GitHub issues [1] about possible solutions,
> >>> one conclusion is that there is no need to support nesting file-backed
> >>> EROFS mounts on stacked filesystems, because there is always the option
> >>> to use loopback devices as a fallback.
> >>>
> >>> As a quick fix for the composefs regression for this cycle, instead of
> >>> bumping `s_stack_depth` for file backed EROFS mounts, we disallow
> >>> nesting file-backed EROFS over EROFS and over filesystems with
> >>> `s_stack_depth` > 0.
> >>>
> >>> This works for all known file-backed mount use cases (composefs,
> >>> containerd, and Android APEX for some Android vendors), and the fix is
> >>> self-contained.
> >>>
> >>> Essentially, we are allowing one extra unaccounted fs stacking level of
> >>> EROFS below stacking filesystems, but EROFS can only be used in the read
> >>> path (i.e. overlayfs lower layers), which typically has much lower stack
> >>> usage than the write path.
> >>>
> >>> We can consider increasing FILESYSTEM_MAX_STACK_DEPTH later, after more
> >>> stack usage analysis or using alternative approaches, such as splitting
> >>> the `s_stack_depth` limitation according to different combinations of
> >>> stacking.
> >>>
> >>> Fixes: d53cd891f0e4 ("erofs: limit the level of fs stacking for file-backed mounts")
> >>> Reported-by: Dusty Mabe <dusty@dustymabe.com>
> >>> Reported-by: Timothée Ravier <tim@siosm.fr>
> >>> Closes: https://github.com/coreos/fedora-coreos-tracker/issues/2087 [1]
> >>> Reported-by: "Alekséi Naidénov" <an@digitaltide.io>
> >>> Closes: https://lore.kernel.org/r/CAFHtUiYv4+=+JP_-JjARWjo6OwcvBj1wtYN=z0QXwCpec9sXtg@mail.gmail.com
> >>> Acked-by: Amir Goldstein <amir73il@gmail.com>
> >>> Cc: Alexander Larsson <alexl@redhat.com>
> >>> Cc: Christian Brauner <brauner@kernel.org>
> >>> Cc: Miklos Szeredi <mszeredi@redhat.com>
> >>> Cc: Sheng Yong <shengyong1@xiaomi.com>
> >>> Cc: Zhiguo Niu <niuzhiguo84@gmail.com>
> >>> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
> >>> ---
> >>> v2:
> >>>   - Update commit message (suggested by Amir in 1-on-1 talk);
> >>>   - Add proper `Reported-by:`.
> >>>
> >>>   fs/erofs/super.c | 18 ++++++++++++------
> >>>   1 file changed, 12 insertions(+), 6 deletions(-)
> >>>
> >>> diff --git a/fs/erofs/super.c b/fs/erofs/super.c
> >>> index 937a215f626c..0cf41ed7ced8 100644
> >>> --- a/fs/erofs/super.c
> >>> +++ b/fs/erofs/super.c
> >>> @@ -644,14 +644,20 @@ static int erofs_fc_fill_super(struct super_block *sb, struct fs_context *fc)
> >>>            * fs contexts (including its own) due to self-controlled RO
> >>>            * accesses/contexts and no side-effect changes that need to
> >>>            * context save & restore so it can reuse the current thread
> >>> -         * context.  However, it still needs to bump `s_stack_depth` to
> >>> -         * avoid kernel stack overflow from nested filesystems.
> >>> +         * context.
> >>> +         * However, we still need to prevent kernel stack overflow due
> >>> +         * to filesystem nesting: just ensure that s_stack_depth is 0
> >>> +         * to disallow mounting EROFS on stacked filesystems.
> >>> +         * Note: s_stack_depth is not incremented here for now, since
> >>> +         * EROFS is the only fs supporting file-backed mounts for now.
> >>> +         * It MUST change if another fs plans to support them, which
> >>> +         * may also require adjusting FILESYSTEM_MAX_STACK_DEPTH.
> >>>            */
> >>>           if (erofs_is_fileio_mode(sbi)) {
> >>> -            sb->s_stack_depth =
> >>> -                file_inode(sbi->dif0.file)->i_sb->s_stack_depth + 1;
> >>> -            if (sb->s_stack_depth > FILESYSTEM_MAX_STACK_DEPTH) {
> >>> -                erofs_err(sb, "maximum fs stacking depth exceeded");
> >>> +            inode = file_inode(sbi->dif0.file);
> >>> +            if (inode->i_sb->s_op == &erofs_sops ||
> >>
> >> Hi, Xiang
> >>
> >> In Android APEX scenario, apex images formatted as EROFS are packed in
> >> system.img which is also EROFS format. As a result, it will always fail
> >> to do APEX-file-backed mount since `inode->i_sb->s_op == &erofs_sops'
> >> is true.
> >> Any thoughts to handle such scenario?
> >
> > Sorry, I forgot this popular case, I think it can be simply resolved
> > by the following diff:
> >
> > diff --git a/fs/erofs/super.c b/fs/erofs/super.c
> > index 0cf41ed7ced8..e93264034b5d 100644
> > --- a/fs/erofs/super.c
> > +++ b/fs/erofs/super.c
> > @@ -655,7 +655,7 @@ static int erofs_fc_fill_super(struct super_block *sb, struct fs_context *fc)
> >                   */
> >                  if (erofs_is_fileio_mode(sbi)) {
> >                          inode = file_inode(sbi->dif0.file);
> > -                       if (inode->i_sb->s_op == &erofs_sops ||
> > +                       if ((inode->i_sb->s_op == &erofs_sops && !sb->s_bdev) ||
>
> Sorry it should be `!inode->i_sb->s_bdev`, I've
> fixed it in v3 RESEND:

A RESEND implies no changes since v3, so this is bad practice.

> https://lore.kernel.org/r/20260108030709.3305545-1-hsiangkao@linux.alibaba.com
>

Ouch! If the erofs maintainer got this condition wrong... twice...
Maybe better using the helper instead of open coding this non trivial check?

if ((inode->i_sb->s_op == &erofs_sops &&
      erofs_is_fileio_mode(EROFS_I_SB(inode)))

Thanks,
Amir.

Re: [PATCH v2] erofs: don't bother with s_stack_depth increasing for now

Posted by Gao Xiang 1 month ago

Hi Amir,

On 2026/1/8 16:02, Amir Goldstein wrote:
> On Thu, Jan 8, 2026 at 4:10 AM Gao Xiang <hsiangkao@linux.alibaba.com> wrote:

...

>>>>
>>>> Hi, Xiang
>>>>
>>>> In Android APEX scenario, apex images formatted as EROFS are packed in
>>>> system.img which is also EROFS format. As a result, it will always fail
>>>> to do APEX-file-backed mount since `inode->i_sb->s_op == &erofs_sops'
>>>> is true.
>>>> Any thoughts to handle such scenario?
>>>
>>> Sorry, I forgot this popular case, I think it can be simply resolved
>>> by the following diff:
>>>
>>> diff --git a/fs/erofs/super.c b/fs/erofs/super.c
>>> index 0cf41ed7ced8..e93264034b5d 100644
>>> --- a/fs/erofs/super.c
>>> +++ b/fs/erofs/super.c
>>> @@ -655,7 +655,7 @@ static int erofs_fc_fill_super(struct super_block *sb, struct fs_context *fc)
>>>                    */
>>>                   if (erofs_is_fileio_mode(sbi)) {
>>>                           inode = file_inode(sbi->dif0.file);
>>> -                       if (inode->i_sb->s_op == &erofs_sops ||
>>> +                       if ((inode->i_sb->s_op == &erofs_sops && !sb->s_bdev) ||
>>
>> Sorry it should be `!inode->i_sb->s_bdev`, I've
>> fixed it in v3 RESEND:
> 
> A RESEND implies no changes since v3, so this is bad practice.
> 
>> https://lore.kernel.org/r/20260108030709.3305545-1-hsiangkao@linux.alibaba.com
>>
> 
> Ouch! If the erofs maintainer got this condition wrong... twice...
> Maybe better using the helper instead of open coding this non trivial check?
> 
> if ((inode->i_sb->s_op == &erofs_sops &&
>        erofs_is_fileio_mode(EROFS_I_SB(inode)))

I was thought to use that, but it excludes fscache as the
backing fs.. so I suggest to use !s_bdev directly to
cover both file-backed mounts and fscache cases directly.

Thanks,
Gao Xiang

> 
> Thanks,
> Amir.

Re: [PATCH v2] erofs: don't bother with s_stack_depth increasing for now

Posted by David Laight 1 month ago

On Thu, 8 Jan 2026 16:05:03 +0800
Gao Xiang <hsiangkao@linux.alibaba.com> wrote:

> Hi Amir,
> 
> On 2026/1/8 16:02, Amir Goldstein wrote:
> > On Thu, Jan 8, 2026 at 4:10 AM Gao Xiang <hsiangkao@linux.alibaba.com> wrote:  
> 
> ...
> 
> >>>>
> >>>> Hi, Xiang
> >>>>
> >>>> In Android APEX scenario, apex images formatted as EROFS are packed in
> >>>> system.img which is also EROFS format. As a result, it will always fail
> >>>> to do APEX-file-backed mount since `inode->i_sb->s_op == &erofs_sops'
> >>>> is true.
> >>>> Any thoughts to handle such scenario?  
> >>>
> >>> Sorry, I forgot this popular case, I think it can be simply resolved
> >>> by the following diff:
> >>>
> >>> diff --git a/fs/erofs/super.c b/fs/erofs/super.c
> >>> index 0cf41ed7ced8..e93264034b5d 100644
> >>> --- a/fs/erofs/super.c
> >>> +++ b/fs/erofs/super.c
> >>> @@ -655,7 +655,7 @@ static int erofs_fc_fill_super(struct super_block *sb, struct fs_context *fc)
> >>>                    */
> >>>                   if (erofs_is_fileio_mode(sbi)) {
> >>>                           inode = file_inode(sbi->dif0.file);
> >>> -                       if (inode->i_sb->s_op == &erofs_sops ||
> >>> +                       if ((inode->i_sb->s_op == &erofs_sops && !sb->s_bdev) ||  
> >>
> >> Sorry it should be `!inode->i_sb->s_bdev`, I've
> >> fixed it in v3 RESEND:  
> > 
> > A RESEND implies no changes since v3, so this is bad practice.
> >   
> >> https://lore.kernel.org/r/20260108030709.3305545-1-hsiangkao@linux.alibaba.com
> >>  
> > 
> > Ouch! If the erofs maintainer got this condition wrong... twice...
> > Maybe better using the helper instead of open coding this non trivial check?
> > 
> > if ((inode->i_sb->s_op == &erofs_sops &&
> >        erofs_is_fileio_mode(EROFS_I_SB(inode)))  
> 
> I was thought to use that, but it excludes fscache as the
> backing fs.. so I suggest to use !s_bdev directly to
> cover both file-backed mounts and fscache cases directly.

Is it worth just allocating each fs a 'stack needed' value and then
allowing the mount if the total is low enough.
This is equivalent to counting the recursion depth, but lets erofs only
add (say) 0.5.
Ideally you'd want to do static analysis to find the value to add,
but 'inspired guesswork' is probably good enough.

Isn't there also a big difference between recursive mounts (which need
to do read/write on the underlying file) and overlay mounts (which just
pass the request onto the lower filesystem).

	David

> 
> Thanks,
> Gao Xiang
> 
> > 
> > Thanks,
> > Amir.  
> 
>

Re: [PATCH v2] erofs: don't bother with s_stack_depth increasing for now

Posted by Gao Xiang 1 month ago

Hi David,

On 2026/1/8 18:26, David Laight wrote:
> On Thu, 8 Jan 2026 16:05:03 +0800
> Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
> 
>> Hi Amir,
>>
>> On 2026/1/8 16:02, Amir Goldstein wrote:
>>> On Thu, Jan 8, 2026 at 4:10 AM Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
>>
>> ...
>>
>>>>>>
>>>>>> Hi, Xiang
>>>>>>
>>>>>> In Android APEX scenario, apex images formatted as EROFS are packed in
>>>>>> system.img which is also EROFS format. As a result, it will always fail
>>>>>> to do APEX-file-backed mount since `inode->i_sb->s_op == &erofs_sops'
>>>>>> is true.
>>>>>> Any thoughts to handle such scenario?
>>>>>
>>>>> Sorry, I forgot this popular case, I think it can be simply resolved
>>>>> by the following diff:
>>>>>
>>>>> diff --git a/fs/erofs/super.c b/fs/erofs/super.c
>>>>> index 0cf41ed7ced8..e93264034b5d 100644
>>>>> --- a/fs/erofs/super.c
>>>>> +++ b/fs/erofs/super.c
>>>>> @@ -655,7 +655,7 @@ static int erofs_fc_fill_super(struct super_block *sb, struct fs_context *fc)
>>>>>                     */
>>>>>                    if (erofs_is_fileio_mode(sbi)) {
>>>>>                            inode = file_inode(sbi->dif0.file);
>>>>> -                       if (inode->i_sb->s_op == &erofs_sops ||
>>>>> +                       if ((inode->i_sb->s_op == &erofs_sops && !sb->s_bdev) ||
>>>>
>>>> Sorry it should be `!inode->i_sb->s_bdev`, I've
>>>> fixed it in v3 RESEND:
>>>
>>> A RESEND implies no changes since v3, so this is bad practice.
>>>    
>>>> https://lore.kernel.org/r/20260108030709.3305545-1-hsiangkao@linux.alibaba.com
>>>>   
>>>
>>> Ouch! If the erofs maintainer got this condition wrong... twice...
>>> Maybe better using the helper instead of open coding this non trivial check?
>>>
>>> if ((inode->i_sb->s_op == &erofs_sops &&
>>>         erofs_is_fileio_mode(EROFS_I_SB(inode)))
>>
>> I was thought to use that, but it excludes fscache as the
>> backing fs.. so I suggest to use !s_bdev directly to
>> cover both file-backed mounts and fscache cases directly.
> 
> Is it worth just allocating each fs a 'stack needed' value and then
> allowing the mount if the total is low enough.
> This is equivalent to counting the recursion depth, but lets erofs only
> add (say) 0.5.
> Ideally you'd want to do static analysis to find the value to add,
> but 'inspired guesswork' is probably good enough.

That is a good alternative way but I could also use some
realistic issue such as how to evaluate stack usage under
the block layer.

And the rule exposing to userspace becomes complex if we
do in such way.

> 
> Isn't there also a big difference between recursive mounts (which need
> to do read/write on the underlying file) and overlay mounts (which just
> pass the request onto the lower filesystem).

As for EROFS, we only care read since it's safe enough
but I won't speak of write paths (like sb_writers and
journal nesting for example, and I don't want to spread
the discussion since it's much unrelated to the topic).

I agree but as I said above, it makes the rule more
complex and users have no idea when it can mount and
when it cannot mount.

Anyway, I think for the current 16k kernel stack,
FILESYSTEM_MAX_STACK_DEPTH = 3 is safe enough to provide
an abundant margin for the underlay storage stack.
I have no idea how to prove it strictly but I think it's
roughly provable to show the stack usages when reaching
the real backing fs (e.g. the remaining stack size when
reaching the real backing fs) and
FILESYSTEM_MAX_STACK_DEPTH 2 was an arbitary one too.

Thanks,
Gao Xiang

> 
> 	David
> 
>>
>> Thanks,
>> Gao Xiang
>>
>>>
>>> Thanks,
>>> Amir.
>>
>>

Re: [PATCH v2] erofs: don't bother with s_stack_depth increasing for now

Posted by Amir Goldstein 1 month ago

On Thu, Jan 8, 2026 at 9:05 AM Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
>
> Hi Amir,
>
> On 2026/1/8 16:02, Amir Goldstein wrote:
> > On Thu, Jan 8, 2026 at 4:10 AM Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
>
> ...
>
> >>>>
> >>>> Hi, Xiang
> >>>>
> >>>> In Android APEX scenario, apex images formatted as EROFS are packed in
> >>>> system.img which is also EROFS format. As a result, it will always fail
> >>>> to do APEX-file-backed mount since `inode->i_sb->s_op == &erofs_sops'
> >>>> is true.
> >>>> Any thoughts to handle such scenario?
> >>>
> >>> Sorry, I forgot this popular case, I think it can be simply resolved
> >>> by the following diff:
> >>>
> >>> diff --git a/fs/erofs/super.c b/fs/erofs/super.c
> >>> index 0cf41ed7ced8..e93264034b5d 100644
> >>> --- a/fs/erofs/super.c
> >>> +++ b/fs/erofs/super.c
> >>> @@ -655,7 +655,7 @@ static int erofs_fc_fill_super(struct super_block *sb, struct fs_context *fc)
> >>>                    */
> >>>                   if (erofs_is_fileio_mode(sbi)) {
> >>>                           inode = file_inode(sbi->dif0.file);
> >>> -                       if (inode->i_sb->s_op == &erofs_sops ||
> >>> +                       if ((inode->i_sb->s_op == &erofs_sops && !sb->s_bdev) ||
> >>
> >> Sorry it should be `!inode->i_sb->s_bdev`, I've
> >> fixed it in v3 RESEND:
> >
> > A RESEND implies no changes since v3, so this is bad practice.
> >
> >> https://lore.kernel.org/r/20260108030709.3305545-1-hsiangkao@linux.alibaba.com
> >>
> >
> > Ouch! If the erofs maintainer got this condition wrong... twice...
> > Maybe better using the helper instead of open coding this non trivial check?
> >
> > if ((inode->i_sb->s_op == &erofs_sops &&
> >        erofs_is_fileio_mode(EROFS_I_SB(inode)))
>
> I was thought to use that, but it excludes fscache as the
> backing fs.. so I suggest to use !s_bdev directly to
> cover both file-backed mounts and fscache cases directly.

Your fs, your decision.

But what are you actually saying?
Are you saying that reading from file backed fscache has similar
stack usage to reading from file backed erofs?
Isn't filecache doing async file IO?

If we regard fscache an extra unaccounted layer, because of all the
sync operations that it does, then we already allowed this setup a long
time ago, e.g. fscache+nfs+ovl^2.

This could be an argument to support the claim that stack usage of
file+erofs+ovl^2 should also be fine.

Thanks,
Amir.

Re: [PATCH v2] erofs: don't bother with s_stack_depth increasing for now

Posted by Gao Xiang 1 month ago


On 2026/1/8 16:24, Amir Goldstein wrote:
> On Thu, Jan 8, 2026 at 9:05 AM Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
>>
>> Hi Amir,
>>
>> On 2026/1/8 16:02, Amir Goldstein wrote:
>>> On Thu, Jan 8, 2026 at 4:10 AM Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
>>
>> ...
>>
>>>>>>
>>>>>> Hi, Xiang
>>>>>>
>>>>>> In Android APEX scenario, apex images formatted as EROFS are packed in
>>>>>> system.img which is also EROFS format. As a result, it will always fail
>>>>>> to do APEX-file-backed mount since `inode->i_sb->s_op == &erofs_sops'
>>>>>> is true.
>>>>>> Any thoughts to handle such scenario?
>>>>>
>>>>> Sorry, I forgot this popular case, I think it can be simply resolved
>>>>> by the following diff:
>>>>>
>>>>> diff --git a/fs/erofs/super.c b/fs/erofs/super.c
>>>>> index 0cf41ed7ced8..e93264034b5d 100644
>>>>> --- a/fs/erofs/super.c
>>>>> +++ b/fs/erofs/super.c
>>>>> @@ -655,7 +655,7 @@ static int erofs_fc_fill_super(struct super_block *sb, struct fs_context *fc)
>>>>>                     */
>>>>>                    if (erofs_is_fileio_mode(sbi)) {
>>>>>                            inode = file_inode(sbi->dif0.file);
>>>>> -                       if (inode->i_sb->s_op == &erofs_sops ||
>>>>> +                       if ((inode->i_sb->s_op == &erofs_sops && !sb->s_bdev) ||
>>>>
>>>> Sorry it should be `!inode->i_sb->s_bdev`, I've
>>>> fixed it in v3 RESEND:
>>>
>>> A RESEND implies no changes since v3, so this is bad practice.
>>>
>>>> https://lore.kernel.org/r/20260108030709.3305545-1-hsiangkao@linux.alibaba.com
>>>>
>>>
>>> Ouch! If the erofs maintainer got this condition wrong... twice...
>>> Maybe better using the helper instead of open coding this non trivial check?
>>>
>>> if ((inode->i_sb->s_op == &erofs_sops &&
>>>         erofs_is_fileio_mode(EROFS_I_SB(inode)))
>>
>> I was thought to use that, but it excludes fscache as the
>> backing fs.. so I suggest to use !s_bdev directly to
>> cover both file-backed mounts and fscache cases directly.
> 
> Your fs, your decision.
> 
> But what are you actually saying?
> Are you saying that reading from file backed fscache has similar
> stack usage to reading from file backed erofs?

Nope, I just don't want to be bothered with fscache in any
cases since it's already deprecated, IOWs I don't want such
setup works:
  erofs (file-backed) + erofs(fscache) + ...

I just want to allow
  erofs(APEX) + erofs(bdev) + ...

cases since Android users use it

in addition to
  ovl^2 + erofs + ext4 / xfs /... (composefs, containerd and ...)

Does that make sense?

> Isn't filecache doing async file IO?

But as I said, AIO is not a must, it can still
fallback to sync I/Os.

> 
> If we regard fscache an extra unaccounted layer, because of all the
> sync operations that it does, then we already allowed this setup a long
> time ago, e.g. fscache+nfs+ovl^2.
> 
> This could be an argument to support the claim that stack usage of
> file+erofs+ovl^2 should also be fine.

Anyway, I'm not sure how many users really use that so
I won't speak of that.

Thanks,
Gao Xiang

> 
> Thanks,
> Amir.

[PATCH v3 RESEND] erofs: don't bother with s_stack_depth increasing for now

Posted by Gao Xiang 1 month ago

Previously, commit d53cd891f0e4 ("erofs: limit the level of fs stacking
for file-backed mounts") bumped `s_stack_depth` by one to avoid kernel
stack overflow when stacking an unlimited number of EROFS on top of
each other.

This fix breaks composefs mounts, which need EROFS+ovl^2 sometimes
(and such setups are already used in production for quite a long time).

One way to fix this regression is to bump FILESYSTEM_MAX_STACK_DEPTH
from 2 to 3, but proving that this is safe in general is a high bar.

After a long discussion on GitHub issues [1] about possible solutions,
one conclusion is that there is no need to support nesting file-backed
EROFS mounts on stacked filesystems, because there is always the option
to use loopback devices as a fallback.

As a quick fix for the composefs regression for this cycle, instead of
bumping `s_stack_depth` for file backed EROFS mounts, we disallow
nesting file-backed EROFS over EROFS and over filesystems with
`s_stack_depth` > 0.

This works for all known file-backed mount use cases (composefs,
containerd, and Android APEX for some Android vendors), and the fix is
self-contained.

Essentially, we are allowing one extra unaccounted fs stacking level of
EROFS below stacking filesystems, but EROFS can only be used in the read
path (i.e. overlayfs lower layers), which typically has much lower stack
usage than the write path.

We can consider increasing FILESYSTEM_MAX_STACK_DEPTH later, after more
stack usage analysis or using alternative approaches, such as splitting
the `s_stack_depth` limitation according to different combinations of
stacking.

Fixes: d53cd891f0e4 ("erofs: limit the level of fs stacking for file-backed mounts")
Reported-and-tested-by: Dusty Mabe <dusty@dustymabe.com>
Reported-by: Timothée Ravier <tim@siosm.fr>
Closes: https://github.com/coreos/fedora-coreos-tracker/issues/2087 [1]
Reported-by: "Alekséi Naidénov" <an@digitaltide.io>
Closes: https://lore.kernel.org/r/CAFHtUiYv4+=+JP_-JjARWjo6OwcvBj1wtYN=z0QXwCpec9sXtg@mail.gmail.com
Acked-by: Amir Goldstein <amir73il@gmail.com>
Acked-by: Alexander Larsson <alexl@redhat.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Cc: Sheng Yong <shengyong1@xiaomi.com>
Cc: Zhiguo Niu <niuzhiguo84@gmail.com>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
---
v2->v3 RESEND:
 - Exclude bdev-backed EROFS mounts since it will be a real terminal fs
   as pointed out by Sheng Yong (APEX will rely on this);

 - Preserve previous "Acked-by:" and "Tested-by:" since it's trivial.

 fs/erofs/super.c | 19 +++++++++++++------
 1 file changed, 13 insertions(+), 6 deletions(-)

diff --git a/fs/erofs/super.c b/fs/erofs/super.c
index 937a215f626c..5136cda5972a 100644
--- a/fs/erofs/super.c
+++ b/fs/erofs/super.c
@@ -644,14 +644,21 @@ static int erofs_fc_fill_super(struct super_block *sb, struct fs_context *fc)
 		 * fs contexts (including its own) due to self-controlled RO
 		 * accesses/contexts and no side-effect changes that need to
 		 * context save & restore so it can reuse the current thread
-		 * context.  However, it still needs to bump `s_stack_depth` to
-		 * avoid kernel stack overflow from nested filesystems.
+		 * context.
+		 * However, we still need to prevent kernel stack overflow due
+		 * to filesystem nesting: just ensure that s_stack_depth is 0
+		 * to disallow mounting EROFS on stacked filesystems.
+		 * Note: s_stack_depth is not incremented here for now, since
+		 * EROFS is the only fs supporting file-backed mounts for now.
+		 * It MUST change if another fs plans to support them, which
+		 * may also require adjusting FILESYSTEM_MAX_STACK_DEPTH.
 		 */
 		if (erofs_is_fileio_mode(sbi)) {
-			sb->s_stack_depth =
-				file_inode(sbi->dif0.file)->i_sb->s_stack_depth + 1;
-			if (sb->s_stack_depth > FILESYSTEM_MAX_STACK_DEPTH) {
-				erofs_err(sb, "maximum fs stacking depth exceeded");
+			inode = file_inode(sbi->dif0.file);
+			if ((inode->i_sb->s_op == &erofs_sops &&
+			     !inode->i_sb->s_bdev) ||
+			    inode->i_sb->s_stack_depth) {
+				erofs_err(sb, "file-backed mounts cannot be applied to stacked fses");
 				return -ENOTBLK;
 			}
 		}
-- 
2.43.5

Re: [PATCH v3 RESEND] erofs: don't bother with s_stack_depth increasing for now

Posted by Christian Brauner 3 weeks, 5 days ago

On Thu, Jan 08, 2026 at 11:07:09AM +0800, Gao Xiang wrote:
> Previously, commit d53cd891f0e4 ("erofs: limit the level of fs stacking
> for file-backed mounts") bumped `s_stack_depth` by one to avoid kernel
> stack overflow when stacking an unlimited number of EROFS on top of
> each other.
> 
> This fix breaks composefs mounts, which need EROFS+ovl^2 sometimes
> (and such setups are already used in production for quite a long time).
> 
> One way to fix this regression is to bump FILESYSTEM_MAX_STACK_DEPTH
> from 2 to 3, but proving that this is safe in general is a high bar.
> 
> After a long discussion on GitHub issues [1] about possible solutions,
> one conclusion is that there is no need to support nesting file-backed
> EROFS mounts on stacked filesystems, because there is always the option
> to use loopback devices as a fallback.
> 
> As a quick fix for the composefs regression for this cycle, instead of
> bumping `s_stack_depth` for file backed EROFS mounts, we disallow
> nesting file-backed EROFS over EROFS and over filesystems with
> `s_stack_depth` > 0.
> 
> This works for all known file-backed mount use cases (composefs,
> containerd, and Android APEX for some Android vendors), and the fix is
> self-contained.
> 
> Essentially, we are allowing one extra unaccounted fs stacking level of
> EROFS below stacking filesystems, but EROFS can only be used in the read
> path (i.e. overlayfs lower layers), which typically has much lower stack
> usage than the write path.
> 
> We can consider increasing FILESYSTEM_MAX_STACK_DEPTH later, after more
> stack usage analysis or using alternative approaches, such as splitting
> the `s_stack_depth` limitation according to different combinations of
> stacking.
> 
> Fixes: d53cd891f0e4 ("erofs: limit the level of fs stacking for file-backed mounts")
> Reported-and-tested-by: Dusty Mabe <dusty@dustymabe.com>
> Reported-by: Timothée Ravier <tim@siosm.fr>
> Closes: https://github.com/coreos/fedora-coreos-tracker/issues/2087 [1]
> Reported-by: "Alekséi Naidénov" <an@digitaltide.io>
> Closes: https://lore.kernel.org/r/CAFHtUiYv4+=+JP_-JjARWjo6OwcvBj1wtYN=z0QXwCpec9sXtg@mail.gmail.com
> Acked-by: Amir Goldstein <amir73il@gmail.com>
> Acked-by: Alexander Larsson <alexl@redhat.com>
> Cc: Christian Brauner <brauner@kernel.org>
> Cc: Miklos Szeredi <mszeredi@redhat.com>
> Cc: Sheng Yong <shengyong1@xiaomi.com>
> Cc: Zhiguo Niu <niuzhiguo84@gmail.com>
> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
> ---

Acked-by: Christian Brauner <brauner@kernel.org>

Re: [PATCH v3 RESEND] erofs: don't bother with s_stack_depth increasing for now

Posted by Chao Yu 4 weeks ago

On 1/8/2026 11:07 AM, Gao Xiang wrote:
> Previously, commit d53cd891f0e4 ("erofs: limit the level of fs stacking
> for file-backed mounts") bumped `s_stack_depth` by one to avoid kernel
> stack overflow when stacking an unlimited number of EROFS on top of
> each other.
> 
> This fix breaks composefs mounts, which need EROFS+ovl^2 sometimes
> (and such setups are already used in production for quite a long time).
> 
> One way to fix this regression is to bump FILESYSTEM_MAX_STACK_DEPTH
> from 2 to 3, but proving that this is safe in general is a high bar.
> 
> After a long discussion on GitHub issues [1] about possible solutions,
> one conclusion is that there is no need to support nesting file-backed
> EROFS mounts on stacked filesystems, because there is always the option
> to use loopback devices as a fallback.
> 
> As a quick fix for the composefs regression for this cycle, instead of
> bumping `s_stack_depth` for file backed EROFS mounts, we disallow
> nesting file-backed EROFS over EROFS and over filesystems with
> `s_stack_depth` > 0.
> 
> This works for all known file-backed mount use cases (composefs,
> containerd, and Android APEX for some Android vendors), and the fix is
> self-contained.
> 
> Essentially, we are allowing one extra unaccounted fs stacking level of
> EROFS below stacking filesystems, but EROFS can only be used in the read
> path (i.e. overlayfs lower layers), which typically has much lower stack
> usage than the write path.
> 
> We can consider increasing FILESYSTEM_MAX_STACK_DEPTH later, after more
> stack usage analysis or using alternative approaches, such as splitting
> the `s_stack_depth` limitation according to different combinations of
> stacking.
> 
> Fixes: d53cd891f0e4 ("erofs: limit the level of fs stacking for file-backed mounts")
> Reported-and-tested-by: Dusty Mabe <dusty@dustymabe.com>
> Reported-by: Timothée Ravier <tim@siosm.fr>
> Closes: https://github.com/coreos/fedora-coreos-tracker/issues/2087 [1]
> Reported-by: "Alekséi Naidénov" <an@digitaltide.io>
> Closes: https://lore.kernel.org/r/CAFHtUiYv4+=+JP_-JjARWjo6OwcvBj1wtYN=z0QXwCpec9sXtg@mail.gmail.com
> Acked-by: Amir Goldstein <amir73il@gmail.com>
> Acked-by: Alexander Larsson <alexl@redhat.com>
> Cc: Christian Brauner <brauner@kernel.org>
> Cc: Miklos Szeredi <mszeredi@redhat.com>
> Cc: Sheng Yong <shengyong1@xiaomi.com>
> Cc: Zhiguo Niu <niuzhiguo84@gmail.com>
> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>

Reviewed-by: Chao Yu <chao@kernel.org>

Thanks,

Re: [PATCH v3 RESEND] erofs: don't bother with s_stack_depth increasing for now

Posted by Zhiguo Niu 1 month ago

Gao Xiang <hsiangkao@linux.alibaba.com> 于2026年1月8日周四 11:07写道：
>
> Previously, commit d53cd891f0e4 ("erofs: limit the level of fs stacking
> for file-backed mounts") bumped `s_stack_depth` by one to avoid kernel
> stack overflow when stacking an unlimited number of EROFS on top of
> each other.
>
> This fix breaks composefs mounts, which need EROFS+ovl^2 sometimes
> (and such setups are already used in production for quite a long time).
>
> One way to fix this regression is to bump FILESYSTEM_MAX_STACK_DEPTH
> from 2 to 3, but proving that this is safe in general is a high bar.
>
> After a long discussion on GitHub issues [1] about possible solutions,
> one conclusion is that there is no need to support nesting file-backed
> EROFS mounts on stacked filesystems, because there is always the option
> to use loopback devices as a fallback.
>
> As a quick fix for the composefs regression for this cycle, instead of
> bumping `s_stack_depth` for file backed EROFS mounts, we disallow
> nesting file-backed EROFS over EROFS and over filesystems with
> `s_stack_depth` > 0.
>
> This works for all known file-backed mount use cases (composefs,
> containerd, and Android APEX for some Android vendors), and the fix is
> self-contained.
>
> Essentially, we are allowing one extra unaccounted fs stacking level of
> EROFS below stacking filesystems, but EROFS can only be used in the read
> path (i.e. overlayfs lower layers), which typically has much lower stack
> usage than the write path.
>
> We can consider increasing FILESYSTEM_MAX_STACK_DEPTH later, after more
> stack usage analysis or using alternative approaches, such as splitting
> the `s_stack_depth` limitation according to different combinations of
> stacking.
>
> Fixes: d53cd891f0e4 ("erofs: limit the level of fs stacking for file-backed mounts")
> Reported-and-tested-by: Dusty Mabe <dusty@dustymabe.com>
> Reported-by: Timothée Ravier <tim@siosm.fr>
> Closes: https://github.com/coreos/fedora-coreos-tracker/issues/2087 [1]
> Reported-by: "Alekséi Naidénov" <an@digitaltide.io>
> Closes: https://lore.kernel.org/r/CAFHtUiYv4+=+JP_-JjARWjo6OwcvBj1wtYN=z0QXwCpec9sXtg@mail.gmail.com
> Acked-by: Amir Goldstein <amir73il@gmail.com>
> Acked-by: Alexander Larsson <alexl@redhat.com>
> Cc: Christian Brauner <brauner@kernel.org>
> Cc: Miklos Szeredi <mszeredi@redhat.com>
> Cc: Sheng Yong <shengyong1@xiaomi.com>
> Cc: Zhiguo Niu <niuzhiguo84@gmail.com>
> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
> ---
> v2->v3 RESEND:
>  - Exclude bdev-backed EROFS mounts since it will be a real terminal fs
>    as pointed out by Sheng Yong (APEX will rely on this);
>
>  - Preserve previous "Acked-by:" and "Tested-by:" since it's trivial.
>
>  fs/erofs/super.c | 19 +++++++++++++------
>  1 file changed, 13 insertions(+), 6 deletions(-)
>
> diff --git a/fs/erofs/super.c b/fs/erofs/super.c
> index 937a215f626c..5136cda5972a 100644
> --- a/fs/erofs/super.c
> +++ b/fs/erofs/super.c
> @@ -644,14 +644,21 @@ static int erofs_fc_fill_super(struct super_block *sb, struct fs_context *fc)
>                  * fs contexts (including its own) due to self-controlled RO
>                  * accesses/contexts and no side-effect changes that need to
>                  * context save & restore so it can reuse the current thread
> -                * context.  However, it still needs to bump `s_stack_depth` to
> -                * avoid kernel stack overflow from nested filesystems.
> +                * context.
> +                * However, we still need to prevent kernel stack overflow due
> +                * to filesystem nesting: just ensure that s_stack_depth is 0
> +                * to disallow mounting EROFS on stacked filesystems.
> +                * Note: s_stack_depth is not incremented here for now, since
> +                * EROFS is the only fs supporting file-backed mounts for now.
> +                * It MUST change if another fs plans to support them, which
> +                * may also require adjusting FILESYSTEM_MAX_STACK_DEPTH.
>                  */
>                 if (erofs_is_fileio_mode(sbi)) {
> -                       sb->s_stack_depth =
> -                               file_inode(sbi->dif0.file)->i_sb->s_stack_depth + 1;
> -                       if (sb->s_stack_depth > FILESYSTEM_MAX_STACK_DEPTH) {
> -                               erofs_err(sb, "maximum fs stacking depth exceeded");
> +                       inode = file_inode(sbi->dif0.file);
> +                       if ((inode->i_sb->s_op == &erofs_sops &&
> +                            !inode->i_sb->s_bdev) ||
> +                           inode->i_sb->s_stack_depth) {
> +                               erofs_err(sb, "file-backed mounts cannot be applied to stacked fses");
Hi Xiang
Do we need to print s_stack_depth here to distinguish which specific
problem case it is?
Other LGTM based on my basic test. so

Reviewed-by: Zhiguo Niu <zhiguo.niu@unisoc.com>
Thanks！
>                                 return -ENOTBLK;
>                         }
>                 }
> --
> 2.43.5
>

Re: [PATCH v3 RESEND] erofs: don't bother with s_stack_depth increasing for now

Posted by Gao Xiang 1 month ago


On 2026/1/8 17:28, Zhiguo Niu wrote:
> Gao Xiang <hsiangkao@linux.alibaba.com> 于2026年1月8日周四 11:07写道：
>>
>> Previously, commit d53cd891f0e4 ("erofs: limit the level of fs stacking
>> for file-backed mounts") bumped `s_stack_depth` by one to avoid kernel
>> stack overflow when stacking an unlimited number of EROFS on top of
>> each other.
>>
>> This fix breaks composefs mounts, which need EROFS+ovl^2 sometimes
>> (and such setups are already used in production for quite a long time).
>>
>> One way to fix this regression is to bump FILESYSTEM_MAX_STACK_DEPTH
>> from 2 to 3, but proving that this is safe in general is a high bar.
>>
>> After a long discussion on GitHub issues [1] about possible solutions,
>> one conclusion is that there is no need to support nesting file-backed
>> EROFS mounts on stacked filesystems, because there is always the option
>> to use loopback devices as a fallback.
>>
>> As a quick fix for the composefs regression for this cycle, instead of
>> bumping `s_stack_depth` for file backed EROFS mounts, we disallow
>> nesting file-backed EROFS over EROFS and over filesystems with
>> `s_stack_depth` > 0.
>>
>> This works for all known file-backed mount use cases (composefs,
>> containerd, and Android APEX for some Android vendors), and the fix is
>> self-contained.
>>
>> Essentially, we are allowing one extra unaccounted fs stacking level of
>> EROFS below stacking filesystems, but EROFS can only be used in the read
>> path (i.e. overlayfs lower layers), which typically has much lower stack
>> usage than the write path.
>>
>> We can consider increasing FILESYSTEM_MAX_STACK_DEPTH later, after more
>> stack usage analysis or using alternative approaches, such as splitting
>> the `s_stack_depth` limitation according to different combinations of
>> stacking.
>>
>> Fixes: d53cd891f0e4 ("erofs: limit the level of fs stacking for file-backed mounts")
>> Reported-and-tested-by: Dusty Mabe <dusty@dustymabe.com>
>> Reported-by: Timothée Ravier <tim@siosm.fr>
>> Closes: https://github.com/coreos/fedora-coreos-tracker/issues/2087 [1]
>> Reported-by: "Alekséi Naidénov" <an@digitaltide.io>
>> Closes: https://lore.kernel.org/r/CAFHtUiYv4+=+JP_-JjARWjo6OwcvBj1wtYN=z0QXwCpec9sXtg@mail.gmail.com
>> Acked-by: Amir Goldstein <amir73il@gmail.com>
>> Acked-by: Alexander Larsson <alexl@redhat.com>
>> Cc: Christian Brauner <brauner@kernel.org>
>> Cc: Miklos Szeredi <mszeredi@redhat.com>
>> Cc: Sheng Yong <shengyong1@xiaomi.com>
>> Cc: Zhiguo Niu <niuzhiguo84@gmail.com>
>> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
>> ---
>> v2->v3 RESEND:
>>   - Exclude bdev-backed EROFS mounts since it will be a real terminal fs
>>     as pointed out by Sheng Yong (APEX will rely on this);
>>
>>   - Preserve previous "Acked-by:" and "Tested-by:" since it's trivial.
>>
>>   fs/erofs/super.c | 19 +++++++++++++------
>>   1 file changed, 13 insertions(+), 6 deletions(-)
>>
>> diff --git a/fs/erofs/super.c b/fs/erofs/super.c
>> index 937a215f626c..5136cda5972a 100644
>> --- a/fs/erofs/super.c
>> +++ b/fs/erofs/super.c
>> @@ -644,14 +644,21 @@ static int erofs_fc_fill_super(struct super_block *sb, struct fs_context *fc)
>>                   * fs contexts (including its own) due to self-controlled RO
>>                   * accesses/contexts and no side-effect changes that need to
>>                   * context save & restore so it can reuse the current thread
>> -                * context.  However, it still needs to bump `s_stack_depth` to
>> -                * avoid kernel stack overflow from nested filesystems.
>> +                * context.
>> +                * However, we still need to prevent kernel stack overflow due
>> +                * to filesystem nesting: just ensure that s_stack_depth is 0
>> +                * to disallow mounting EROFS on stacked filesystems.
>> +                * Note: s_stack_depth is not incremented here for now, since
>> +                * EROFS is the only fs supporting file-backed mounts for now.
>> +                * It MUST change if another fs plans to support them, which
>> +                * may also require adjusting FILESYSTEM_MAX_STACK_DEPTH.
>>                   */
>>                  if (erofs_is_fileio_mode(sbi)) {
>> -                       sb->s_stack_depth =
>> -                               file_inode(sbi->dif0.file)->i_sb->s_stack_depth + 1;
>> -                       if (sb->s_stack_depth > FILESYSTEM_MAX_STACK_DEPTH) {
>> -                               erofs_err(sb, "maximum fs stacking depth exceeded");
>> +                       inode = file_inode(sbi->dif0.file);
>> +                       if ((inode->i_sb->s_op == &erofs_sops &&
>> +                            !inode->i_sb->s_bdev) ||
>> +                           inode->i_sb->s_stack_depth) {
>> +                               erofs_err(sb, "file-backed mounts cannot be applied to stacked fses");
> Hi Xiang
> Do we need to print s_stack_depth here to distinguish which specific
> problem case it is?

.. I don't want to complex it (since it's just a short-term
solution and erofs is unaccounted so s_stack_depth really
mean nothing) unless it's really needed for Android vendors?

> Other LGTM based on my basic test. so
> 
> Reviewed-by: Zhiguo Niu <zhiguo.niu@unisoc.com>

Thanks for this too.

Thanks,
Gao Xiang

> Thanks！
>>                                  return -ENOTBLK;
>>                          }
>>                  }
>> --
>> 2.43.5
>>

Re: [PATCH v3 RESEND] erofs: don't bother with s_stack_depth increasing for now

Posted by Sheng Yong 1 month ago

On 1/8/26 11:07, Gao Xiang wrote:
> Previously, commit d53cd891f0e4 ("erofs: limit the level of fs stacking
> for file-backed mounts") bumped `s_stack_depth` by one to avoid kernel
> stack overflow when stacking an unlimited number of EROFS on top of
> each other.
> 
> This fix breaks composefs mounts, which need EROFS+ovl^2 sometimes
> (and such setups are already used in production for quite a long time).
> 
> One way to fix this regression is to bump FILESYSTEM_MAX_STACK_DEPTH
> from 2 to 3, but proving that this is safe in general is a high bar.
> 
> After a long discussion on GitHub issues [1] about possible solutions,
> one conclusion is that there is no need to support nesting file-backed
> EROFS mounts on stacked filesystems, because there is always the option
> to use loopback devices as a fallback.
> 
> As a quick fix for the composefs regression for this cycle, instead of
> bumping `s_stack_depth` for file backed EROFS mounts, we disallow
> nesting file-backed EROFS over EROFS and over filesystems with
> `s_stack_depth` > 0.
> 
> This works for all known file-backed mount use cases (composefs,
> containerd, and Android APEX for some Android vendors), and the fix is
> self-contained.
> 
> Essentially, we are allowing one extra unaccounted fs stacking level of
> EROFS below stacking filesystems, but EROFS can only be used in the read
> path (i.e. overlayfs lower layers), which typically has much lower stack
> usage than the write path.
> 
> We can consider increasing FILESYSTEM_MAX_STACK_DEPTH later, after more
> stack usage analysis or using alternative approaches, such as splitting
> the `s_stack_depth` limitation according to different combinations of
> stacking.
> 
> Fixes: d53cd891f0e4 ("erofs: limit the level of fs stacking for file-backed mounts")
> Reported-and-tested-by: Dusty Mabe <dusty@dustymabe.com>
> Reported-by: Timothée Ravier <tim@siosm.fr>
> Closes: https://github.com/coreos/fedora-coreos-tracker/issues/2087 [1]
> Reported-by: "Alekséi Naidénov" <an@digitaltide.io>
> Closes: https://lore.kernel.org/r/CAFHtUiYv4+=+JP_-JjARWjo6OwcvBj1wtYN=z0QXwCpec9sXtg@mail.gmail.com
> Acked-by: Amir Goldstein <amir73il@gmail.com>
> Acked-by: Alexander Larsson <alexl@redhat.com>
> Cc: Christian Brauner <brauner@kernel.org>
> Cc: Miklos Szeredi <mszeredi@redhat.com>
> Cc: Sheng Yong <shengyong1@xiaomi.com>
> Cc: Zhiguo Niu <niuzhiguo84@gmail.com>
> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>

Reviewed-and-tested-by: Sheng Yong <shengyong1@xiaomi.com>

I tested the APEX scenario on an Android phone. APEX images are
filebacked-mounted correctly. And for a stacked APEX testcase,
it reports error as expected.

thanks,
shengyong

> ---
> v2->v3 RESEND:
>   - Exclude bdev-backed EROFS mounts since it will be a real terminal fs
>     as pointed out by Sheng Yong (APEX will rely on this);
> 
>   - Preserve previous "Acked-by:" and "Tested-by:" since it's trivial.
> 
>   fs/erofs/super.c | 19 +++++++++++++------
>   1 file changed, 13 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/erofs/super.c b/fs/erofs/super.c
> index 937a215f626c..5136cda5972a 100644
> --- a/fs/erofs/super.c
> +++ b/fs/erofs/super.c
> @@ -644,14 +644,21 @@ static int erofs_fc_fill_super(struct super_block *sb, struct fs_context *fc)
>   		 * fs contexts (including its own) due to self-controlled RO
>   		 * accesses/contexts and no side-effect changes that need to
>   		 * context save & restore so it can reuse the current thread
> -		 * context.  However, it still needs to bump `s_stack_depth` to
> -		 * avoid kernel stack overflow from nested filesystems.
> +		 * context.
> +		 * However, we still need to prevent kernel stack overflow due
> +		 * to filesystem nesting: just ensure that s_stack_depth is 0
> +		 * to disallow mounting EROFS on stacked filesystems.
> +		 * Note: s_stack_depth is not incremented here for now, since
> +		 * EROFS is the only fs supporting file-backed mounts for now.
> +		 * It MUST change if another fs plans to support them, which
> +		 * may also require adjusting FILESYSTEM_MAX_STACK_DEPTH.
>   		 */
>   		if (erofs_is_fileio_mode(sbi)) {
> -			sb->s_stack_depth =
> -				file_inode(sbi->dif0.file)->i_sb->s_stack_depth + 1;
> -			if (sb->s_stack_depth > FILESYSTEM_MAX_STACK_DEPTH) {
> -				erofs_err(sb, "maximum fs stacking depth exceeded");
> +			inode = file_inode(sbi->dif0.file);
> +			if ((inode->i_sb->s_op == &erofs_sops &&
> +			     !inode->i_sb->s_bdev) ||
> +			    inode->i_sb->s_stack_depth) {
> +				erofs_err(sb, "file-backed mounts cannot be applied to stacked fses");
>   				return -ENOTBLK;
>   			}
>   		}

Re: [PATCH v3 RESEND] erofs: don't bother with s_stack_depth increasing for now

Posted by Gao Xiang 1 month ago

Hi Sheng,

On 2026/1/8 17:14, Sheng Yong wrote:
> On 1/8/26 11:07, Gao Xiang wrote:
>> Previously, commit d53cd891f0e4 ("erofs: limit the level of fs stacking
>> for file-backed mounts") bumped `s_stack_depth` by one to avoid kernel
>> stack overflow when stacking an unlimited number of EROFS on top of
>> each other.
>>
>> This fix breaks composefs mounts, which need EROFS+ovl^2 sometimes
>> (and such setups are already used in production for quite a long time).
>>
>> One way to fix this regression is to bump FILESYSTEM_MAX_STACK_DEPTH
>> from 2 to 3, but proving that this is safe in general is a high bar.
>>
>> After a long discussion on GitHub issues [1] about possible solutions,
>> one conclusion is that there is no need to support nesting file-backed
>> EROFS mounts on stacked filesystems, because there is always the option
>> to use loopback devices as a fallback.
>>
>> As a quick fix for the composefs regression for this cycle, instead of
>> bumping `s_stack_depth` for file backed EROFS mounts, we disallow
>> nesting file-backed EROFS over EROFS and over filesystems with
>> `s_stack_depth` > 0.
>>
>> This works for all known file-backed mount use cases (composefs,
>> containerd, and Android APEX for some Android vendors), and the fix is
>> self-contained.
>>
>> Essentially, we are allowing one extra unaccounted fs stacking level of
>> EROFS below stacking filesystems, but EROFS can only be used in the read
>> path (i.e. overlayfs lower layers), which typically has much lower stack
>> usage than the write path.
>>
>> We can consider increasing FILESYSTEM_MAX_STACK_DEPTH later, after more
>> stack usage analysis or using alternative approaches, such as splitting
>> the `s_stack_depth` limitation according to different combinations of
>> stacking.
>>
>> Fixes: d53cd891f0e4 ("erofs: limit the level of fs stacking for file-backed mounts")
>> Reported-and-tested-by: Dusty Mabe <dusty@dustymabe.com>
>> Reported-by: Timothée Ravier <tim@siosm.fr>
>> Closes: https://github.com/coreos/fedora-coreos-tracker/issues/2087 [1]
>> Reported-by: "Alekséi Naidénov" <an@digitaltide.io>
>> Closes: https://lore.kernel.org/r/CAFHtUiYv4+=+JP_-JjARWjo6OwcvBj1wtYN=z0QXwCpec9sXtg@mail.gmail.com
>> Acked-by: Amir Goldstein <amir73il@gmail.com>
>> Acked-by: Alexander Larsson <alexl@redhat.com>
>> Cc: Christian Brauner <brauner@kernel.org>
>> Cc: Miklos Szeredi <mszeredi@redhat.com>
>> Cc: Sheng Yong <shengyong1@xiaomi.com>
>> Cc: Zhiguo Niu <niuzhiguo84@gmail.com>
>> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
> 
> Reviewed-and-tested-by: Sheng Yong <shengyong1@xiaomi.com>
> 
> I tested the APEX scenario on an Android phone. APEX images are
> filebacked-mounted correctly.


> And for a stacked APEX testcase, it reports error as expected.

Just to make sure it's an invalid case (should not be used on
Android), yes? If so, thanks for the test on the APEX side.

Thanks,
Gao Xiang

> 
> thanks,
> shengyong

Re: [PATCH v3 RESEND] erofs: don't bother with s_stack_depth increasing for now

Posted by Sheng Yong 1 month ago

On 1/8/26 17:25, Gao Xiang wrote:
> Hi Sheng,
> 
> On 2026/1/8 17:14, Sheng Yong wrote:
>> On 1/8/26 11:07, Gao Xiang wrote:
>>> Previously, commit d53cd891f0e4 ("erofs: limit the level of fs stacking
>>> for file-backed mounts") bumped `s_stack_depth` by one to avoid kernel
>>> stack overflow when stacking an unlimited number of EROFS on top of
>>> each other.
>>>
>>> This fix breaks composefs mounts, which need EROFS+ovl^2 sometimes
>>> (and such setups are already used in production for quite a long time).
>>>
>>> One way to fix this regression is to bump FILESYSTEM_MAX_STACK_DEPTH
>>> from 2 to 3, but proving that this is safe in general is a high bar.
>>>
>>> After a long discussion on GitHub issues [1] about possible solutions,
>>> one conclusion is that there is no need to support nesting file-backed
>>> EROFS mounts on stacked filesystems, because there is always the option
>>> to use loopback devices as a fallback.
>>>
>>> As a quick fix for the composefs regression for this cycle, instead of
>>> bumping `s_stack_depth` for file backed EROFS mounts, we disallow
>>> nesting file-backed EROFS over EROFS and over filesystems with
>>> `s_stack_depth` > 0.
>>>
>>> This works for all known file-backed mount use cases (composefs,
>>> containerd, and Android APEX for some Android vendors), and the fix is
>>> self-contained.
>>>
>>> Essentially, we are allowing one extra unaccounted fs stacking level of
>>> EROFS below stacking filesystems, but EROFS can only be used in the read
>>> path (i.e. overlayfs lower layers), which typically has much lower stack
>>> usage than the write path.
>>>
>>> We can consider increasing FILESYSTEM_MAX_STACK_DEPTH later, after more
>>> stack usage analysis or using alternative approaches, such as splitting
>>> the `s_stack_depth` limitation according to different combinations of
>>> stacking.
>>>
>>> Fixes: d53cd891f0e4 ("erofs: limit the level of fs stacking for file-backed mounts")
>>> Reported-and-tested-by: Dusty Mabe <dusty@dustymabe.com>
>>> Reported-by: Timothée Ravier <tim@siosm.fr>
>>> Closes: https://github.com/coreos/fedora-coreos-tracker/issues/2087 [1]
>>> Reported-by: "Alekséi Naidénov" <an@digitaltide.io>
>>> Closes: https://lore.kernel.org/r/CAFHtUiYv4+=+JP_-JjARWjo6OwcvBj1wtYN=z0QXwCpec9sXtg@mail.gmail.com
>>> Acked-by: Amir Goldstein <amir73il@gmail.com>
>>> Acked-by: Alexander Larsson <alexl@redhat.com>
>>> Cc: Christian Brauner <brauner@kernel.org>
>>> Cc: Miklos Szeredi <mszeredi@redhat.com>
>>> Cc: Sheng Yong <shengyong1@xiaomi.com>
>>> Cc: Zhiguo Niu <niuzhiguo84@gmail.com>
>>> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
>>
>> Reviewed-and-tested-by: Sheng Yong <shengyong1@xiaomi.com>
>>
>> I tested the APEX scenario on an Android phone. APEX images are
>> filebacked-mounted correctly.
> 
> 
>> And for a stacked APEX testcase, it reports error as expected.
> 
Hi, Xiang,

> Just to make sure it's an invalid case (should not be used on
> Android), yes? If so, thanks for the test on the APEX side.

No, it's not a real use case, just an invalid case, and only
used to test the error handling path.

thanks,
shengyong
> 
> Thanks,
> Gao Xiang
> 
>>
>> thanks,
>> shengyong

[PATCH v3] erofs: don't bother with s_stack_depth increasing for now

Posted by Gao Xiang 1 month ago

Previously, commit d53cd891f0e4 ("erofs: limit the level of fs stacking
for file-backed mounts") bumped `s_stack_depth` by one to avoid kernel
stack overflow when stacking an unlimited number of EROFS on top of
each other.

This fix breaks composefs mounts, which need EROFS+ovl^2 sometimes
(and such setups are already used in production for quite a long time).

One way to fix this regression is to bump FILESYSTEM_MAX_STACK_DEPTH
from 2 to 3, but proving that this is safe in general is a high bar.

After a long discussion on GitHub issues [1] about possible solutions,
one conclusion is that there is no need to support nesting file-backed
EROFS mounts on stacked filesystems, because there is always the option
to use loopback devices as a fallback.

As a quick fix for the composefs regression for this cycle, instead of
bumping `s_stack_depth` for file backed EROFS mounts, we disallow
nesting file-backed EROFS over EROFS and over filesystems with
`s_stack_depth` > 0.

This works for all known file-backed mount use cases (composefs,
containerd, and Android APEX for some Android vendors), and the fix is
self-contained.

Essentially, we are allowing one extra unaccounted fs stacking level of
EROFS below stacking filesystems, but EROFS can only be used in the read
path (i.e. overlayfs lower layers), which typically has much lower stack
usage than the write path.

We can consider increasing FILESYSTEM_MAX_STACK_DEPTH later, after more
stack usage analysis or using alternative approaches, such as splitting
the `s_stack_depth` limitation according to different combinations of
stacking.

Fixes: d53cd891f0e4 ("erofs: limit the level of fs stacking for file-backed mounts")
Reported-and-tested-by: Dusty Mabe <dusty@dustymabe.com>
Reported-by: Timothée Ravier <tim@siosm.fr>
Closes: https://github.com/coreos/fedora-coreos-tracker/issues/2087 [1]
Reported-by: "Alekséi Naidénov" <an@digitaltide.io>
Closes: https://lore.kernel.org/r/CAFHtUiYv4+=+JP_-JjARWjo6OwcvBj1wtYN=z0QXwCpec9sXtg@mail.gmail.com
Acked-by: Amir Goldstein <amir73il@gmail.com>
Acked-by: Alexander Larsson <alexl@redhat.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Cc: Sheng Yong <shengyong1@xiaomi.com>
Cc: Zhiguo Niu <niuzhiguo84@gmail.com>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
---
v3:
 - Exclude bdev-backed EROFS mounts since it will be a real terminal fs
   as pointed out by Sheng Yong (APEX will rely on this);
 - Preserve previous "Acked-by:" and "Tested-by:" since it's trivial.

 fs/erofs/super.c | 18 ++++++++++++------
 1 file changed, 12 insertions(+), 6 deletions(-)

diff --git a/fs/erofs/super.c b/fs/erofs/super.c
index 937a215f626c..e93264034b5d 100644
--- a/fs/erofs/super.c
+++ b/fs/erofs/super.c
@@ -644,14 +644,20 @@ static int erofs_fc_fill_super(struct super_block *sb, struct fs_context *fc)
 		 * fs contexts (including its own) due to self-controlled RO
 		 * accesses/contexts and no side-effect changes that need to
 		 * context save & restore so it can reuse the current thread
-		 * context.  However, it still needs to bump `s_stack_depth` to
-		 * avoid kernel stack overflow from nested filesystems.
+		 * context.
+		 * However, we still need to prevent kernel stack overflow due
+		 * to filesystem nesting: just ensure that s_stack_depth is 0
+		 * to disallow mounting EROFS on stacked filesystems.
+		 * Note: s_stack_depth is not incremented here for now, since
+		 * EROFS is the only fs supporting file-backed mounts for now.
+		 * It MUST change if another fs plans to support them, which
+		 * may also require adjusting FILESYSTEM_MAX_STACK_DEPTH.
 		 */
 		if (erofs_is_fileio_mode(sbi)) {
-			sb->s_stack_depth =
-				file_inode(sbi->dif0.file)->i_sb->s_stack_depth + 1;
-			if (sb->s_stack_depth > FILESYSTEM_MAX_STACK_DEPTH) {
-				erofs_err(sb, "maximum fs stacking depth exceeded");
+			inode = file_inode(sbi->dif0.file);
+			if ((inode->i_sb->s_op == &erofs_sops && !sb->s_bdev) ||
+			    inode->i_sb->s_stack_depth) {
+				erofs_err(sb, "file-backed mounts cannot be applied to stacked fses");
 				return -ENOTBLK;
 			}
 		}
-- 
2.43.5

Re: [PATCH v2] erofs: don't bother with s_stack_depth increasing for now

Posted by Dusty Mabe 1 month ago


On 1/6/26 12:05 PM, Gao Xiang wrote:
> Previously, commit d53cd891f0e4 ("erofs: limit the level of fs stacking
> for file-backed mounts") bumped `s_stack_depth` by one to avoid kernel
> stack overflow when stacking an unlimited number of EROFS on top of
> each other.
> 
> This fix breaks composefs mounts, which need EROFS+ovl^2 sometimes
> (and such setups are already used in production for quite a long time).
> 
> One way to fix this regression is to bump FILESYSTEM_MAX_STACK_DEPTH
> from 2 to 3, but proving that this is safe in general is a high bar.
> 
> After a long discussion on GitHub issues [1] about possible solutions,
> one conclusion is that there is no need to support nesting file-backed
> EROFS mounts on stacked filesystems, because there is always the option
> to use loopback devices as a fallback.
> 
> As a quick fix for the composefs regression for this cycle, instead of
> bumping `s_stack_depth` for file backed EROFS mounts, we disallow
> nesting file-backed EROFS over EROFS and over filesystems with
> `s_stack_depth` > 0.
> 
> This works for all known file-backed mount use cases (composefs,
> containerd, and Android APEX for some Android vendors), and the fix is
> self-contained.
> 
> Essentially, we are allowing one extra unaccounted fs stacking level of
> EROFS below stacking filesystems, but EROFS can only be used in the read
> path (i.e. overlayfs lower layers), which typically has much lower stack
> usage than the write path.
> 
> We can consider increasing FILESYSTEM_MAX_STACK_DEPTH later, after more
> stack usage analysis or using alternative approaches, such as splitting
> the `s_stack_depth` limitation according to different combinations of
> stacking.
> 
> Fixes: d53cd891f0e4 ("erofs: limit the level of fs stacking for file-backed mounts")
> Reported-by: Dusty Mabe <dusty@dustymabe.com>
> Reported-by: Timothée Ravier <tim@siosm.fr>
> Closes: https://github.com/coreos/fedora-coreos-tracker/issues/2087 [1]
> Reported-by: "Alekséi Naidénov" <an@digitaltide.io>
> Closes: https://lore.kernel.org/r/CAFHtUiYv4+=+JP_-JjARWjo6OwcvBj1wtYN=z0QXwCpec9sXtg@mail.gmail.com
> Acked-by: Amir Goldstein <amir73il@gmail.com>
> Cc: Alexander Larsson <alexl@redhat.com>
> Cc: Christian Brauner <brauner@kernel.org>
> Cc: Miklos Szeredi <mszeredi@redhat.com>
> Cc: Sheng Yong <shengyong1@xiaomi.com>
> Cc: Zhiguo Niu <niuzhiguo84@gmail.com>
> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>



Tested-by: Dusty Mabe <dusty@dustymabe.com>

I tested this fixed the problem we observed in our Fedora CoreOS CI documented over in
https://github.com/coreos/fedora-coreos-tracker/issues/2087