fs/erofs/super.c | 18 ++++++++++++------ 1 file changed, 12 insertions(+), 6 deletions(-)
Previously, commit d53cd891f0e4 ("erofs: limit the level of fs stacking
for file-backed mounts") bumped `s_stack_depth` by one to avoid kernel
stack overflow, but it breaks composefs mounts, which need erofs+ovl^2
sometimes (and such setups are already used in production for quite long
time) since `s_stack_depth` can be 3 (i.e., FILESYSTEM_MAX_STACK_DEPTH
needs to change from 2 to 3).
After a long discussion on GitHub issues [1] about possible solutions,
it seems there is no need to support nesting file-backed mounts as one
conclusion (especially when increasing FILESYSTEM_MAX_STACK_DEPTH to 3).
So let's disallow this right now, since there is always a way to use
loopback devices as a fallback.
Then, I started to wonder about an alternative EROFS quick fix to
address the composefs mounts directly for this cycle: since EROFS is the
only fs to support file-backed mounts and other stacked fses will just
bump up `FILESYSTEM_MAX_STACK_DEPTH`, just check that `s_stack_depth`
!= 0 and the backing inode is not from EROFS instead.
At least it works for all known file-backed mount use cases (composefs,
containerd, and Android APEX for some Android vendors), and the fix is
self-contained.
Let's defer increasing FILESYSTEM_MAX_STACK_DEPTH for now.
Fixes: d53cd891f0e4 ("erofs: limit the level of fs stacking for file-backed mounts")
Closes: https://github.com/coreos/fedora-coreos-tracker/issues/2087 [1]
Closes: https://lore.kernel.org/r/CAFHtUiYv4+=+JP_-JjARWjo6OwcvBj1wtYN=z0QXwCpec9sXtg@mail.gmail.com
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Alexander Larsson <alexl@redhat.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
---
fs/erofs/super.c | 18 ++++++++++++------
1 file changed, 12 insertions(+), 6 deletions(-)
diff --git a/fs/erofs/super.c b/fs/erofs/super.c
index 937a215f626c..0cf41ed7ced8 100644
--- a/fs/erofs/super.c
+++ b/fs/erofs/super.c
@@ -644,14 +644,20 @@ static int erofs_fc_fill_super(struct super_block *sb, struct fs_context *fc)
* fs contexts (including its own) due to self-controlled RO
* accesses/contexts and no side-effect changes that need to
* context save & restore so it can reuse the current thread
- * context. However, it still needs to bump `s_stack_depth` to
- * avoid kernel stack overflow from nested filesystems.
+ * context.
+ * However, we still need to prevent kernel stack overflow due
+ * to filesystem nesting: just ensure that s_stack_depth is 0
+ * to disallow mounting EROFS on stacked filesystems.
+ * Note: s_stack_depth is not incremented here for now, since
+ * EROFS is the only fs supporting file-backed mounts for now.
+ * It MUST change if another fs plans to support them, which
+ * may also require adjusting FILESYSTEM_MAX_STACK_DEPTH.
*/
if (erofs_is_fileio_mode(sbi)) {
- sb->s_stack_depth =
- file_inode(sbi->dif0.file)->i_sb->s_stack_depth + 1;
- if (sb->s_stack_depth > FILESYSTEM_MAX_STACK_DEPTH) {
- erofs_err(sb, "maximum fs stacking depth exceeded");
+ inode = file_inode(sbi->dif0.file);
+ if (inode->i_sb->s_op == &erofs_sops ||
+ inode->i_sb->s_stack_depth) {
+ erofs_err(sb, "file-backed mounts cannot be applied to stacked fses");
return -ENOTBLK;
}
}
--
2.43.5
On Thu, 2026-01-01 at 04:42 +0800, Gao Xiang wrote:
> Previously, commit d53cd891f0e4 ("erofs: limit the level of fs
> stacking
> for file-backed mounts") bumped `s_stack_depth` by one to avoid
> kernel
> stack overflow, but it breaks composefs mounts, which need
> erofs+ovl^2
> sometimes (and such setups are already used in production for quite
> long
> time) since `s_stack_depth` can be 3 (i.e.,
> FILESYSTEM_MAX_STACK_DEPTH
> needs to change from 2 to 3).
>
> After a long discussion on GitHub issues [1] about possible
> solutions,
> it seems there is no need to support nesting file-backed mounts as
> one
> conclusion (especially when increasing FILESYSTEM_MAX_STACK_DEPTH to
> 3).
> So let's disallow this right now, since there is always a way to use
> loopback devices as a fallback.
>
> Then, I started to wonder about an alternative EROFS quick fix to
> address the composefs mounts directly for this cycle: since EROFS is
> the
> only fs to support file-backed mounts and other stacked fses will
> just
> bump up `FILESYSTEM_MAX_STACK_DEPTH`, just check that `s_stack_depth`
> != 0 and the backing inode is not from EROFS instead.
>
> At least it works for all known file-backed mount use cases
> (composefs,
> containerd, and Android APEX for some Android vendors), and the fix
> is
> self-contained.
>
> Let's defer increasing FILESYSTEM_MAX_STACK_DEPTH for now.
>
> Fixes: d53cd891f0e4 ("erofs: limit the level of fs stacking for file-
> backed mounts")
> Closes:
> https://github.com/coreos/fedora-coreos-tracker/issues/2087 [1]
> Closes:
> https://lore.kernel.org/r/CAFHtUiYv4+=+JP_-JjARWjo6OwcvBj1wtYN=z0QXwCpec9sXtg@mail.gmail.com
> Cc: Amir Goldstein <amir73il@gmail.com>
> Cc: Alexander Larsson <alexl@redhat.com>
> Cc: Christian Brauner <brauner@kernel.org>
> Cc: Miklos Szeredi <mszeredi@redhat.com>
> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
> ---
Acked-by: Alexander Larsson <alexl@redhat.com>
> fs/erofs/super.c | 18 ++++++++++++------
> 1 file changed, 12 insertions(+), 6 deletions(-)
>
> diff --git a/fs/erofs/super.c b/fs/erofs/super.c
> index 937a215f626c..0cf41ed7ced8 100644
> --- a/fs/erofs/super.c
> +++ b/fs/erofs/super.c
> @@ -644,14 +644,20 @@ static int erofs_fc_fill_super(struct
> super_block *sb, struct fs_context *fc)
> * fs contexts (including its own) due to self-
> controlled RO
> * accesses/contexts and no side-effect changes that
> need to
> * context save & restore so it can reuse the
> current thread
> - * context. However, it still needs to bump
> `s_stack_depth` to
> - * avoid kernel stack overflow from nested
> filesystems.
> + * context.
> + * However, we still need to prevent kernel stack
> overflow due
> + * to filesystem nesting: just ensure that
> s_stack_depth is 0
> + * to disallow mounting EROFS on stacked
> filesystems.
> + * Note: s_stack_depth is not incremented here for
> now, since
> + * EROFS is the only fs supporting file-backed
> mounts for now.
> + * It MUST change if another fs plans to support
> them, which
> + * may also require adjusting
> FILESYSTEM_MAX_STACK_DEPTH.
> */
> if (erofs_is_fileio_mode(sbi)) {
> - sb->s_stack_depth =
> - file_inode(sbi->dif0.file)->i_sb-
> >s_stack_depth + 1;
> - if (sb->s_stack_depth >
> FILESYSTEM_MAX_STACK_DEPTH) {
> - erofs_err(sb, "maximum fs stacking
> depth exceeded");
> + inode = file_inode(sbi->dif0.file);
> + if (inode->i_sb->s_op == &erofs_sops ||
> + inode->i_sb->s_stack_depth) {
> + erofs_err(sb, "file-backed mounts
> cannot be applied to stacked fses");
> return -ENOTBLK;
> }
> }
--
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
=-=-=
Alexander Larsson Red Hat,
Inc
alexl@redhat.com alexander.larsson@gmail.com
He's an all-American dishevelled card sharp searching for his wife's
true
killer. She's a scantily clad insomniac bounty hunter with an
incredible
destiny. They fight crime!
On Wed, Dec 31, 2025 at 9:42 PM Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
>
> Previously, commit d53cd891f0e4 ("erofs: limit the level of fs stacking
> for file-backed mounts") bumped `s_stack_depth` by one to avoid kernel
> stack overflow, but it breaks composefs mounts, which need erofs+ovl^2
> sometimes (and such setups are already used in production for quite long
> time) since `s_stack_depth` can be 3 (i.e., FILESYSTEM_MAX_STACK_DEPTH
> needs to change from 2 to 3).
>
> After a long discussion on GitHub issues [1] about possible solutions,
> it seems there is no need to support nesting file-backed mounts as one
> conclusion (especially when increasing FILESYSTEM_MAX_STACK_DEPTH to 3).
> So let's disallow this right now, since there is always a way to use
> loopback devices as a fallback.
>
> Then, I started to wonder about an alternative EROFS quick fix to
> address the composefs mounts directly for this cycle: since EROFS is the
> only fs to support file-backed mounts and other stacked fses will just
> bump up `FILESYSTEM_MAX_STACK_DEPTH`, just check that `s_stack_depth`
> != 0 and the backing inode is not from EROFS instead.
>
> At least it works for all known file-backed mount use cases (composefs,
> containerd, and Android APEX for some Android vendors), and the fix is
> self-contained.
>
> Let's defer increasing FILESYSTEM_MAX_STACK_DEPTH for now.
>
> Fixes: d53cd891f0e4 ("erofs: limit the level of fs stacking for file-backed mounts")
> Closes: https://github.com/coreos/fedora-coreos-tracker/issues/2087 [1]
> Closes: https://lore.kernel.org/r/CAFHtUiYv4+=+JP_-JjARWjo6OwcvBj1wtYN=z0QXwCpec9sXtg@mail.gmail.com
> Cc: Amir Goldstein <amir73il@gmail.com>
> Cc: Alexander Larsson <alexl@redhat.com>
> Cc: Christian Brauner <brauner@kernel.org>
> Cc: Miklos Szeredi <mszeredi@redhat.com>
> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
> ---
Acked-by: Amir Goldstein <amir73il@gmail.com>
But you forgot to include details of the stack usage analysis you ran
with erofs+ovl^2 setup.
I am guessing people will want to see this information before relaxing
s_stack_depth in this case.
Thanks,
Amir.
> fs/erofs/super.c | 18 ++++++++++++------
> 1 file changed, 12 insertions(+), 6 deletions(-)
>
> diff --git a/fs/erofs/super.c b/fs/erofs/super.c
> index 937a215f626c..0cf41ed7ced8 100644
> --- a/fs/erofs/super.c
> +++ b/fs/erofs/super.c
> @@ -644,14 +644,20 @@ static int erofs_fc_fill_super(struct super_block *sb, struct fs_context *fc)
> * fs contexts (including its own) due to self-controlled RO
> * accesses/contexts and no side-effect changes that need to
> * context save & restore so it can reuse the current thread
> - * context. However, it still needs to bump `s_stack_depth` to
> - * avoid kernel stack overflow from nested filesystems.
> + * context.
> + * However, we still need to prevent kernel stack overflow due
> + * to filesystem nesting: just ensure that s_stack_depth is 0
> + * to disallow mounting EROFS on stacked filesystems.
> + * Note: s_stack_depth is not incremented here for now, since
> + * EROFS is the only fs supporting file-backed mounts for now.
> + * It MUST change if another fs plans to support them, which
> + * may also require adjusting FILESYSTEM_MAX_STACK_DEPTH.
> */
> if (erofs_is_fileio_mode(sbi)) {
> - sb->s_stack_depth =
> - file_inode(sbi->dif0.file)->i_sb->s_stack_depth + 1;
> - if (sb->s_stack_depth > FILESYSTEM_MAX_STACK_DEPTH) {
> - erofs_err(sb, "maximum fs stacking depth exceeded");
> + inode = file_inode(sbi->dif0.file);
> + if (inode->i_sb->s_op == &erofs_sops ||
> + inode->i_sb->s_stack_depth) {
> + erofs_err(sb, "file-backed mounts cannot be applied to stacked fses");
> return -ENOTBLK;
> }
> }
> --
> 2.43.5
>
Hi Amir,
On 2026/1/1 23:52, Amir Goldstein wrote:
> On Wed, Dec 31, 2025 at 9:42 PM Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
>>
>> Previously, commit d53cd891f0e4 ("erofs: limit the level of fs stacking
>> for file-backed mounts") bumped `s_stack_depth` by one to avoid kernel
>> stack overflow, but it breaks composefs mounts, which need erofs+ovl^2
>> sometimes (and such setups are already used in production for quite long
>> time) since `s_stack_depth` can be 3 (i.e., FILESYSTEM_MAX_STACK_DEPTH
>> needs to change from 2 to 3).
>>
>> After a long discussion on GitHub issues [1] about possible solutions,
>> it seems there is no need to support nesting file-backed mounts as one
>> conclusion (especially when increasing FILESYSTEM_MAX_STACK_DEPTH to 3).
>> So let's disallow this right now, since there is always a way to use
>> loopback devices as a fallback.
>>
>> Then, I started to wonder about an alternative EROFS quick fix to
>> address the composefs mounts directly for this cycle: since EROFS is the
>> only fs to support file-backed mounts and other stacked fses will just
>> bump up `FILESYSTEM_MAX_STACK_DEPTH`, just check that `s_stack_depth`
>> != 0 and the backing inode is not from EROFS instead.
>>
>> At least it works for all known file-backed mount use cases (composefs,
>> containerd, and Android APEX for some Android vendors), and the fix is
>> self-contained.
>>
>> Let's defer increasing FILESYSTEM_MAX_STACK_DEPTH for now.
>>
>> Fixes: d53cd891f0e4 ("erofs: limit the level of fs stacking for file-backed mounts")
>> Closes: https://github.com/coreos/fedora-coreos-tracker/issues/2087 [1]
>> Closes: https://lore.kernel.org/r/CAFHtUiYv4+=+JP_-JjARWjo6OwcvBj1wtYN=z0QXwCpec9sXtg@mail.gmail.com
>> Cc: Amir Goldstein <amir73il@gmail.com>
>> Cc: Alexander Larsson <alexl@redhat.com>
>> Cc: Christian Brauner <brauner@kernel.org>
>> Cc: Miklos Szeredi <mszeredi@redhat.com>
>> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
>> ---
>
> Acked-by: Amir Goldstein <amir73il@gmail.com>
>
> But you forgot to include details of the stack usage analysis you ran
> with erofs+ovl^2 setup.
>
> I am guessing people will want to see this information before relaxing
> s_stack_depth in this case.
Sorry I didn't check emails these days, I'm not sure if posting
detailed stack traces are useful, how about adding the following
words:
Note: There are some observations while evaluating the erofs + ovl^2
setup with an XFS backing fs:
- Regular RW workloads traverse only one overlayfs layer regardless of
the value of FILESYSTEM_MAX_STACK_DEPTH, because `upperdir=` cannot
point to another overlayfs. Therefore, for pure RW workloads, the
typical stack is always just:
overlayfs + upper fs + underlay storage
- For read-only workloads and the copy-up read part (ovl_splice_read),
the difference can lie in how many overlays are nested.
The stack just looks like either:
ovl + ovl [+ erofs] + backing fs + underlay storage
or
ovl [+ erofs] + ext4/xfs + underlay storage
- The fs reclaim path should be entered only once, so the writeback
path will not re-enter.
Sorry about my English, and I'm not sure if it's enough (e.g. FUSE
passthrough part). I will look for your further inputs (and other
acks) before sending this patch upstream.
(Also btw, i'm not sure if it's possible to optimize read_iter and
splice_read stack usage even further in overlayfs, e.g. just
recursive handling real file/path directly in the top overlayfs
since the permission check is already done when opening the file.)
Thanks,
Gao Xiang
>
> Thanks,
> Amir.
[+fsdevel][+overlayfs]
On Sun, Jan 4, 2026 at 4:56 AM Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
>
> Hi Amir,
>
> On 2026/1/1 23:52, Amir Goldstein wrote:
> > On Wed, Dec 31, 2025 at 9:42 PM Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
> >>
> >> Previously, commit d53cd891f0e4 ("erofs: limit the level of fs stacking
> >> for file-backed mounts") bumped `s_stack_depth` by one to avoid kernel
> >> stack overflow, but it breaks composefs mounts, which need erofs+ovl^2
> >> sometimes (and such setups are already used in production for quite long
> >> time) since `s_stack_depth` can be 3 (i.e., FILESYSTEM_MAX_STACK_DEPTH
> >> needs to change from 2 to 3).
> >>
> >> After a long discussion on GitHub issues [1] about possible solutions,
> >> it seems there is no need to support nesting file-backed mounts as one
> >> conclusion (especially when increasing FILESYSTEM_MAX_STACK_DEPTH to 3).
> >> So let's disallow this right now, since there is always a way to use
> >> loopback devices as a fallback.
> >>
> >> Then, I started to wonder about an alternative EROFS quick fix to
> >> address the composefs mounts directly for this cycle: since EROFS is the
> >> only fs to support file-backed mounts and other stacked fses will just
> >> bump up `FILESYSTEM_MAX_STACK_DEPTH`, just check that `s_stack_depth`
> >> != 0 and the backing inode is not from EROFS instead.
> >>
> >> At least it works for all known file-backed mount use cases (composefs,
> >> containerd, and Android APEX for some Android vendors), and the fix is
> >> self-contained.
> >>
> >> Let's defer increasing FILESYSTEM_MAX_STACK_DEPTH for now.
> >>
> >> Fixes: d53cd891f0e4 ("erofs: limit the level of fs stacking for file-backed mounts")
> >> Closes: https://github.com/coreos/fedora-coreos-tracker/issues/2087 [1]
> >> Closes: https://lore.kernel.org/r/CAFHtUiYv4+=+JP_-JjARWjo6OwcvBj1wtYN=z0QXwCpec9sXtg@mail.gmail.com
> >> Cc: Amir Goldstein <amir73il@gmail.com>
> >> Cc: Alexander Larsson <alexl@redhat.com>
> >> Cc: Christian Brauner <brauner@kernel.org>
> >> Cc: Miklos Szeredi <mszeredi@redhat.com>
> >> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
> >> ---
> >
> > Acked-by: Amir Goldstein <amir73il@gmail.com>
> >
> > But you forgot to include details of the stack usage analysis you ran
> > with erofs+ovl^2 setup.
> >
> > I am guessing people will want to see this information before relaxing
> > s_stack_depth in this case.
>
> Sorry I didn't check emails these days, I'm not sure if posting
> detailed stack traces are useful, how about adding the following
> words:
Didn't mean detailed stack traces, but you did some tests with the
new possible setup and you reached stack usage < 8K so I think this is
something worth mentioning.
>
> Note: There are some observations while evaluating the erofs + ovl^2
> setup with an XFS backing fs:
>
> - Regular RW workloads traverse only one overlayfs layer regardless of
> the value of FILESYSTEM_MAX_STACK_DEPTH, because `upperdir=` cannot
> point to another overlayfs. Therefore, for pure RW workloads, the
> typical stack is always just:
> overlayfs + upper fs + underlay storage
>
> - For read-only workloads and the copy-up read part (ovl_splice_read),
> the difference can lie in how many overlays are nested.
> The stack just looks like either:
> ovl + ovl [+ erofs] + backing fs + underlay storage
> or
> ovl [+ erofs] + ext4/xfs + underlay storage
>
> - The fs reclaim path should be entered only once, so the writeback
> path will not re-enter.
>
> Sorry about my English, and I'm not sure if it's enough (e.g. FUSE
> passthrough part). I will look for your further inputs (and other
> acks) before sending this patch upstream.
>
I think that most people will have problems understanding this
rationale not because of the English, but because of the tech ;)
this is a bit too hand wavy IMO.
> (Also btw, i'm not sure if it's possible to optimize read_iter and
> splice_read stack usage even further in overlayfs, e.g. just
> recursive handling real file/path directly in the top overlayfs
> since the permission check is already done when opening the file.)
Maybe so, but LSM permission to open hook is not the same hook
as permission to read/write.
Thanks,
Amir.
On 2026/1/4 18:01, Amir Goldstein wrote:
> [+fsdevel][+overlayfs]
>
> On Sun, Jan 4, 2026 at 4:56 AM Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
>>
>> Hi Amir,
>>
>> On 2026/1/1 23:52, Amir Goldstein wrote:
>>> On Wed, Dec 31, 2025 at 9:42 PM Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
>>>>
>>>> Previously, commit d53cd891f0e4 ("erofs: limit the level of fs stacking
>>>> for file-backed mounts") bumped `s_stack_depth` by one to avoid kernel
>>>> stack overflow, but it breaks composefs mounts, which need erofs+ovl^2
>>>> sometimes (and such setups are already used in production for quite long
>>>> time) since `s_stack_depth` can be 3 (i.e., FILESYSTEM_MAX_STACK_DEPTH
>>>> needs to change from 2 to 3).
>>>>
>>>> After a long discussion on GitHub issues [1] about possible solutions,
>>>> it seems there is no need to support nesting file-backed mounts as one
>>>> conclusion (especially when increasing FILESYSTEM_MAX_STACK_DEPTH to 3).
>>>> So let's disallow this right now, since there is always a way to use
>>>> loopback devices as a fallback.
>>>>
>>>> Then, I started to wonder about an alternative EROFS quick fix to
>>>> address the composefs mounts directly for this cycle: since EROFS is the
>>>> only fs to support file-backed mounts and other stacked fses will just
>>>> bump up `FILESYSTEM_MAX_STACK_DEPTH`, just check that `s_stack_depth`
>>>> != 0 and the backing inode is not from EROFS instead.
>>>>
>>>> At least it works for all known file-backed mount use cases (composefs,
>>>> containerd, and Android APEX for some Android vendors), and the fix is
>>>> self-contained.
>>>>
>>>> Let's defer increasing FILESYSTEM_MAX_STACK_DEPTH for now.
>>>>
>>>> Fixes: d53cd891f0e4 ("erofs: limit the level of fs stacking for file-backed mounts")
>>>> Closes: https://github.com/coreos/fedora-coreos-tracker/issues/2087 [1]
>>>> Closes: https://lore.kernel.org/r/CAFHtUiYv4+=+JP_-JjARWjo6OwcvBj1wtYN=z0QXwCpec9sXtg@mail.gmail.com
>>>> Cc: Amir Goldstein <amir73il@gmail.com>
>>>> Cc: Alexander Larsson <alexl@redhat.com>
>>>> Cc: Christian Brauner <brauner@kernel.org>
>>>> Cc: Miklos Szeredi <mszeredi@redhat.com>
>>>> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
>>>> ---
>>>
>>> Acked-by: Amir Goldstein <amir73il@gmail.com>
>>>
>>> But you forgot to include details of the stack usage analysis you ran
>>> with erofs+ovl^2 setup.
>>>
>>> I am guessing people will want to see this information before relaxing
>>> s_stack_depth in this case.
>>
>> Sorry I didn't check emails these days, I'm not sure if posting
>> detailed stack traces are useful, how about adding the following
>> words:
>
> Didn't mean detailed stack traces, but you did some tests with the
> new possible setup and you reached stack usage < 8K so I think this is
The issue is that my limited stress test setup cannot cover
every cases:
- I cannot find a way to make direct reclaim reliably in the
deep memory allocation, is there some suggestion on this?
- I'm not sure what's the perfered way to evaluate the worst
stack usage below the block layer, but we should care more
about increasing delta just out of one more overlayfs I
guess?
I can only say what I've seen is the peak stack usage of my
fsstress for an erofs+ovl^2 setup on x86_64 is < 8K (7184 bytes,
but I don't think the peak value absolutely useful), which
evaluates RW workloads in the upperdir, and for such workloads,
the stack depth won't be impacted by FILESYSTEM_MAX_STACK_DEPTH,
I don't see such workload is harmful.
And then I manually copyup some files (because I didn't find any
available tool to stress overlayfs copyups) and I could see the
delta is (I think "ovl_copy_up_" is the only one path to do
copyups):
0) 6688 48 mempool_alloc_slab+0x9/0x20
1) 6640 56 mempool_alloc_noprof+0x65/0xd0
2) 6584 72 __sg_alloc_table+0x128/0x190
3) 6512 40 sg_alloc_table_chained+0x46/0xa0
4) 6472 64 scsi_alloc_sgtables+0x91/0x2c0
5) 6408 72 sd_init_command+0x263/0x930
6) 6336 88 scsi_queue_rq+0x54a/0xb70
7) 6248 144 blk_mq_dispatch_rq_list+0x265/0x6c0
8) 6104 144 __blk_mq_sched_dispatch_requests+0x399/0x5c0
9) 5960 16 blk_mq_sched_dispatch_requests+0x2d/0x70
10) 5944 56 blk_mq_run_hw_queue+0x208/0x290
11) 5888 96 blk_mq_dispatch_list+0x13f/0x460
12) 5792 48 blk_mq_flush_plug_list+0x4b/0x180
13) 5744 32 blk_add_rq_to_plug+0x3d/0x160
14) 5712 136 blk_mq_submit_bio+0x4f4/0x760
15) 5576 120 __submit_bio+0x9b/0x240
16) 5456 88 submit_bio_noacct_nocheck+0x271/0x330
17) 5368 72 iomap_bio_read_folio_range+0xde/0x1d0
18) 5296 112 iomap_read_folio_iter+0x1ee/0x2d0
19) 5184 264 iomap_readahead+0xb9/0x290
20) 4920 48 xfs_vm_readahead+0x4a/0x70
21) 4872 112 read_pages+0x6c/0x1b0
22) 4760 104 page_cache_ra_unbounded+0x12c/0x210
23) 4656 80 filemap_readahead.isra.0+0x78/0xb0
24) 4576 192 filemap_get_pages+0x3a6/0x820
25) 4384 376 filemap_read+0xde/0x380
26) 4008 32 xfs_file_buffered_read+0xa6/0xd0
27) 3976 16 xfs_file_read_iter+0x6a/0xd0
28) 3960 48 vfs_iocb_iter_read+0xdb/0x140
29) 3912 88 erofs_fileio_rq_submit+0x136/0x190
30) 3824 368 z_erofs_runqueue+0x1ce/0x9f0
31) 3456 232 z_erofs_readahead+0x16c/0x220
32) 3224 112 read_pages+0x6c/0x1b0
33) 3112 104 page_cache_ra_unbounded+0x12c/0x210
34) 3008 80 filemap_readahead.isra.0+0x78/0xb0
35) 2928 192 filemap_get_pages+0x3a6/0x820
36) 2736 400 filemap_splice_read+0x12c/0x2f0
37) 2336 48 backing_file_splice_read+0x3f/0x90
38) 2288 128 ovl_splice_read+0xef/0x170
39) 2160 104 splice_direct_to_actor+0xb9/0x260
40) 2056 88 do_splice_direct+0x76/0xc0
41) 1968 120 ovl_copy_up_file+0x1a8/0x2b0
42) 1848 840 ovl_copy_up_one+0x14b0/0x1610
43) 1008 72 ovl_copy_up_flags+0xd7/0x110
44) 936 56 ovl_open+0x72/0x110
45) 880 56 do_dentry_open+0x16c/0x480
46) 824 40 vfs_open+0x2e/0xf0
47) 784 152 path_openat+0x80a/0x12e0
48) 632 296 do_filp_open+0xb8/0x160
49) 336 80 do_sys_openat2+0x72/0xf0
50) 256 40 __x64_sys_openat+0x57/0xa0
51) 216 40 do_syscall_64+0xa4/0x310
52) 176 176 entry_SYSCALL_64_after_hwframe+0x77/0x7f
And it's still far from the stack overflow of 16k stacks,
because the difference seems only how many (
ovl_splice_read + backing_file_splice_read), and there only takes
hundreds of bytes for each layer.
Finally I used my own rostress to stress RO workloads, and the
deepest stack so far is as below (5456 bytes):
0) 5456 48 arch_scale_cpu_capacity+0x9/0x30
1) 5408 16 cpu_util.constprop.0+0x7e/0xe0
2) 5392 392 sched_balance_find_src_group+0x29f/0xd30
3) 5000 280 sched_balance_rq+0x1b2/0xf10
4) 4720 120 pick_next_task_fair+0x23b/0x7b0
5) 4600 104 __schedule+0x2bc/0xda0
6) 4496 16 schedule+0x27/0xd0
7) 4480 24 io_schedule+0x46/0x70
8) 4456 120 blk_mq_get_tag+0x11b/0x280
9) 4336 96 __blk_mq_alloc_requests+0x2a1/0x410
10) 4240 136 blk_mq_submit_bio+0x59c/0x760
11) 4104 120 __submit_bio+0x9b/0x240
12) 3984 88 submit_bio_noacct_nocheck+0x271/0x330
13) 3896 72 iomap_bio_read_folio_range+0xde/0x1d0
14) 3824 112 iomap_read_folio_iter+0x1ee/0x2d0
15) 3712 264 iomap_readahead+0xb9/0x290
16) 3448 48 xfs_vm_readahead+0x4a/0x70
17) 3400 112 read_pages+0x6c/0x1b0
18) 3288 104 page_cache_ra_unbounded+0x12c/0x210
19) 3184 80 filemap_readahead.isra.0+0x78/0xb0
20) 3104 192 filemap_get_pages+0x3a6/0x820
21) 2912 376 filemap_read+0xde/0x380
22) 2536 32 xfs_file_buffered_read+0xa6/0xd0
23) 2504 16 xfs_file_read_iter+0x6a/0xd0
24) 2488 48 vfs_iocb_iter_read+0xdb/0x140
25) 2440 88 erofs_fileio_rq_submit+0x136/0x190
26) 2352 368 z_erofs_runqueue+0x1ce/0x9f0
27) 1984 232 z_erofs_readahead+0x16c/0x220
28) 1752 112 read_pages+0x6c/0x1b0
29) 1640 104 page_cache_ra_unbounded+0x12c/0x210
30) 1536 40 force_page_cache_ra+0x96/0xc0
31) 1496 192 filemap_get_pages+0x123/0x820
32) 1304 376 filemap_read+0xde/0x380
33) 928 72 do_iter_readv_writev+0x1b9/0x220
34) 856 56 vfs_iter_read+0xde/0x140
35) 800 64 backing_file_read_iter+0x193/0x1e0
36) 736 56 ovl_read_iter+0x98/0xa0
37) 680 72 do_iter_readv_writev+0x1b9/0x220
38) 608 56 vfs_iter_read+0xde/0x140
39) 552 64 backing_file_read_iter+0x193/0x1e0
40) 488 56 ovl_read_iter+0x98/0xa0
41) 432 152 vfs_read+0x21a/0x350
42) 280 64 __x64_sys_pread64+0x92/0xc0
43) 216 40 do_syscall_64+0xa4/0x310
44) 176 176 entry_SYSCALL_64_after_hwframe+0x77/0x7f
> something worth mentioning.
>
>>
>> Note: There are some observations while evaluating the erofs + ovl^2
>> setup with an XFS backing fs:
>>
>> - Regular RW workloads traverse only one overlayfs layer regardless of
>> the value of FILESYSTEM_MAX_STACK_DEPTH, because `upperdir=` cannot
>> point to another overlayfs. Therefore, for pure RW workloads, the
>> typical stack is always just:
>> overlayfs + upper fs + underlay storage
>>
>> - For read-only workloads and the copy-up read part (ovl_splice_read),
>> the difference can lie in how many overlays are nested.
>> The stack just looks like either:
>> ovl + ovl [+ erofs] + backing fs + underlay storage
>> or
>> ovl [+ erofs] + ext4/xfs + underlay storage
>>
>> - The fs reclaim path should be entered only once, so the writeback
>> path will not re-enter.
>>
>> Sorry about my English, and I'm not sure if it's enough (e.g. FUSE
>> passthrough part). I will look for your further inputs (and other
>> acks) before sending this patch upstream.
>>
>
> I think that most people will have problems understanding this
> rationale not because of the English, but because of the tech ;)
> this is a bit too hand wavy IMO.
Honestly, I don't have better way to describe it, I think we'd
better just to focus more on the increment of one more overlayfs:
FILESYSTEM_MAX_STACK_DEPTH 2 already works for 8k kstacks on
32-bit arches, so I don't think FILESYSTEM_MAX_STACK_DEPTH from
2 to 3, which causes hundreds-more-byte additional stack usage
out of mediate overlayfs on 16k kstacks on 64-bit arches is
harmful (and only RO workloads and copyups are impacted).
And if hundreds-more-byte additional stack usage can overflow
the 16k kstack, I do think then the kernel stack can be
overflowed randomly everywhere in the storage stack, not just
because this FILESYSTEM_MAX_STACK_DEPTH modification.
Thanks,
Gao Xiang
>
>> (Also btw, i'm not sure if it's possible to optimize read_iter and
>> splice_read stack usage even further in overlayfs, e.g. just
>> recursive handling real file/path directly in the top overlayfs
>> since the permission check is already done when opening the file.)
>
> Maybe so, but LSM permission to open hook is not the same hook
> as permission to read/write.
>
> Thanks,
> Amir.
On Sun, Jan 4, 2026 at 11:42 AM Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
>
>
>
> On 2026/1/4 18:01, Amir Goldstein wrote:
> > [+fsdevel][+overlayfs]
> >
> > On Sun, Jan 4, 2026 at 4:56 AM Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
> >>
> >> Hi Amir,
> >>
> >> On 2026/1/1 23:52, Amir Goldstein wrote:
> >>> On Wed, Dec 31, 2025 at 9:42 PM Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
> >>>>
> >>>> Previously, commit d53cd891f0e4 ("erofs: limit the level of fs stacking
> >>>> for file-backed mounts") bumped `s_stack_depth` by one to avoid kernel
> >>>> stack overflow, but it breaks composefs mounts, which need erofs+ovl^2
> >>>> sometimes (and such setups are already used in production for quite long
> >>>> time) since `s_stack_depth` can be 3 (i.e., FILESYSTEM_MAX_STACK_DEPTH
> >>>> needs to change from 2 to 3).
> >>>>
> >>>> After a long discussion on GitHub issues [1] about possible solutions,
> >>>> it seems there is no need to support nesting file-backed mounts as one
> >>>> conclusion (especially when increasing FILESYSTEM_MAX_STACK_DEPTH to 3).
> >>>> So let's disallow this right now, since there is always a way to use
> >>>> loopback devices as a fallback.
> >>>>
> >>>> Then, I started to wonder about an alternative EROFS quick fix to
> >>>> address the composefs mounts directly for this cycle: since EROFS is the
> >>>> only fs to support file-backed mounts and other stacked fses will just
> >>>> bump up `FILESYSTEM_MAX_STACK_DEPTH`, just check that `s_stack_depth`
> >>>> != 0 and the backing inode is not from EROFS instead.
> >>>>
> >>>> At least it works for all known file-backed mount use cases (composefs,
> >>>> containerd, and Android APEX for some Android vendors), and the fix is
> >>>> self-contained.
> >>>>
> >>>> Let's defer increasing FILESYSTEM_MAX_STACK_DEPTH for now.
> >>>>
> >>>> Fixes: d53cd891f0e4 ("erofs: limit the level of fs stacking for file-backed mounts")
> >>>> Closes: https://github.com/coreos/fedora-coreos-tracker/issues/2087 [1]
> >>>> Closes: https://lore.kernel.org/r/CAFHtUiYv4+=+JP_-JjARWjo6OwcvBj1wtYN=z0QXwCpec9sXtg@mail.gmail.com
> >>>> Cc: Amir Goldstein <amir73il@gmail.com>
> >>>> Cc: Alexander Larsson <alexl@redhat.com>
> >>>> Cc: Christian Brauner <brauner@kernel.org>
> >>>> Cc: Miklos Szeredi <mszeredi@redhat.com>
> >>>> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
> >>>> ---
> >>>
> >>> Acked-by: Amir Goldstein <amir73il@gmail.com>
> >>>
> >>> But you forgot to include details of the stack usage analysis you ran
> >>> with erofs+ovl^2 setup.
> >>>
> >>> I am guessing people will want to see this information before relaxing
> >>> s_stack_depth in this case.
> >>
> >> Sorry I didn't check emails these days, I'm not sure if posting
> >> detailed stack traces are useful, how about adding the following
> >> words:
> >
> > Didn't mean detailed stack traces, but you did some tests with the
> > new possible setup and you reached stack usage < 8K so I think this is
>
> The issue is that my limited stress test setup cannot cover
> every cases:
>
> - I cannot find a way to make direct reclaim reliably in the
> deep memory allocation, is there some suggestion on this?
>
> - I'm not sure what's the perfered way to evaluate the worst
> stack usage below the block layer, but we should care more
> about increasing delta just out of one more overlayfs I
> guess?
>
> I can only say what I've seen is the peak stack usage of my
> fsstress for an erofs+ovl^2 setup on x86_64 is < 8K (7184 bytes,
> but I don't think the peak value absolutely useful), which
> evaluates RW workloads in the upperdir, and for such workloads,
> the stack depth won't be impacted by FILESYSTEM_MAX_STACK_DEPTH,
> I don't see such workload is harmful.
>
> And then I manually copyup some files (because I didn't find any
> available tool to stress overlayfs copyups) and I could see the
> delta is (I think "ovl_copy_up_" is the only one path to do
> copyups):
>
> 0) 6688 48 mempool_alloc_slab+0x9/0x20
> 1) 6640 56 mempool_alloc_noprof+0x65/0xd0
> 2) 6584 72 __sg_alloc_table+0x128/0x190
> 3) 6512 40 sg_alloc_table_chained+0x46/0xa0
> 4) 6472 64 scsi_alloc_sgtables+0x91/0x2c0
> 5) 6408 72 sd_init_command+0x263/0x930
> 6) 6336 88 scsi_queue_rq+0x54a/0xb70
> 7) 6248 144 blk_mq_dispatch_rq_list+0x265/0x6c0
> 8) 6104 144 __blk_mq_sched_dispatch_requests+0x399/0x5c0
> 9) 5960 16 blk_mq_sched_dispatch_requests+0x2d/0x70
> 10) 5944 56 blk_mq_run_hw_queue+0x208/0x290
> 11) 5888 96 blk_mq_dispatch_list+0x13f/0x460
> 12) 5792 48 blk_mq_flush_plug_list+0x4b/0x180
> 13) 5744 32 blk_add_rq_to_plug+0x3d/0x160
> 14) 5712 136 blk_mq_submit_bio+0x4f4/0x760
> 15) 5576 120 __submit_bio+0x9b/0x240
> 16) 5456 88 submit_bio_noacct_nocheck+0x271/0x330
> 17) 5368 72 iomap_bio_read_folio_range+0xde/0x1d0
> 18) 5296 112 iomap_read_folio_iter+0x1ee/0x2d0
> 19) 5184 264 iomap_readahead+0xb9/0x290
> 20) 4920 48 xfs_vm_readahead+0x4a/0x70
> 21) 4872 112 read_pages+0x6c/0x1b0
> 22) 4760 104 page_cache_ra_unbounded+0x12c/0x210
> 23) 4656 80 filemap_readahead.isra.0+0x78/0xb0
> 24) 4576 192 filemap_get_pages+0x3a6/0x820
> 25) 4384 376 filemap_read+0xde/0x380
> 26) 4008 32 xfs_file_buffered_read+0xa6/0xd0
> 27) 3976 16 xfs_file_read_iter+0x6a/0xd0
> 28) 3960 48 vfs_iocb_iter_read+0xdb/0x140
> 29) 3912 88 erofs_fileio_rq_submit+0x136/0x190
> 30) 3824 368 z_erofs_runqueue+0x1ce/0x9f0
> 31) 3456 232 z_erofs_readahead+0x16c/0x220
> 32) 3224 112 read_pages+0x6c/0x1b0
> 33) 3112 104 page_cache_ra_unbounded+0x12c/0x210
> 34) 3008 80 filemap_readahead.isra.0+0x78/0xb0
> 35) 2928 192 filemap_get_pages+0x3a6/0x820
> 36) 2736 400 filemap_splice_read+0x12c/0x2f0
> 37) 2336 48 backing_file_splice_read+0x3f/0x90
> 38) 2288 128 ovl_splice_read+0xef/0x170
> 39) 2160 104 splice_direct_to_actor+0xb9/0x260
> 40) 2056 88 do_splice_direct+0x76/0xc0
> 41) 1968 120 ovl_copy_up_file+0x1a8/0x2b0
> 42) 1848 840 ovl_copy_up_one+0x14b0/0x1610
> 43) 1008 72 ovl_copy_up_flags+0xd7/0x110
> 44) 936 56 ovl_open+0x72/0x110
> 45) 880 56 do_dentry_open+0x16c/0x480
> 46) 824 40 vfs_open+0x2e/0xf0
> 47) 784 152 path_openat+0x80a/0x12e0
> 48) 632 296 do_filp_open+0xb8/0x160
> 49) 336 80 do_sys_openat2+0x72/0xf0
> 50) 256 40 __x64_sys_openat+0x57/0xa0
> 51) 216 40 do_syscall_64+0xa4/0x310
> 52) 176 176 entry_SYSCALL_64_after_hwframe+0x77/0x7f
>
> And it's still far from the stack overflow of 16k stacks,
> because the difference seems only how many (
> ovl_splice_read + backing_file_splice_read), and there only takes
> hundreds of bytes for each layer.
>
> Finally I used my own rostress to stress RO workloads, and the
> deepest stack so far is as below (5456 bytes):
>
> 0) 5456 48 arch_scale_cpu_capacity+0x9/0x30
> 1) 5408 16 cpu_util.constprop.0+0x7e/0xe0
> 2) 5392 392 sched_balance_find_src_group+0x29f/0xd30
> 3) 5000 280 sched_balance_rq+0x1b2/0xf10
> 4) 4720 120 pick_next_task_fair+0x23b/0x7b0
> 5) 4600 104 __schedule+0x2bc/0xda0
> 6) 4496 16 schedule+0x27/0xd0
> 7) 4480 24 io_schedule+0x46/0x70
> 8) 4456 120 blk_mq_get_tag+0x11b/0x280
> 9) 4336 96 __blk_mq_alloc_requests+0x2a1/0x410
> 10) 4240 136 blk_mq_submit_bio+0x59c/0x760
> 11) 4104 120 __submit_bio+0x9b/0x240
> 12) 3984 88 submit_bio_noacct_nocheck+0x271/0x330
> 13) 3896 72 iomap_bio_read_folio_range+0xde/0x1d0
> 14) 3824 112 iomap_read_folio_iter+0x1ee/0x2d0
> 15) 3712 264 iomap_readahead+0xb9/0x290
> 16) 3448 48 xfs_vm_readahead+0x4a/0x70
> 17) 3400 112 read_pages+0x6c/0x1b0
> 18) 3288 104 page_cache_ra_unbounded+0x12c/0x210
> 19) 3184 80 filemap_readahead.isra.0+0x78/0xb0
> 20) 3104 192 filemap_get_pages+0x3a6/0x820
> 21) 2912 376 filemap_read+0xde/0x380
> 22) 2536 32 xfs_file_buffered_read+0xa6/0xd0
> 23) 2504 16 xfs_file_read_iter+0x6a/0xd0
> 24) 2488 48 vfs_iocb_iter_read+0xdb/0x140
> 25) 2440 88 erofs_fileio_rq_submit+0x136/0x190
> 26) 2352 368 z_erofs_runqueue+0x1ce/0x9f0
> 27) 1984 232 z_erofs_readahead+0x16c/0x220
> 28) 1752 112 read_pages+0x6c/0x1b0
> 29) 1640 104 page_cache_ra_unbounded+0x12c/0x210
> 30) 1536 40 force_page_cache_ra+0x96/0xc0
> 31) 1496 192 filemap_get_pages+0x123/0x820
> 32) 1304 376 filemap_read+0xde/0x380
> 33) 928 72 do_iter_readv_writev+0x1b9/0x220
> 34) 856 56 vfs_iter_read+0xde/0x140
> 35) 800 64 backing_file_read_iter+0x193/0x1e0
> 36) 736 56 ovl_read_iter+0x98/0xa0
> 37) 680 72 do_iter_readv_writev+0x1b9/0x220
> 38) 608 56 vfs_iter_read+0xde/0x140
> 39) 552 64 backing_file_read_iter+0x193/0x1e0
> 40) 488 56 ovl_read_iter+0x98/0xa0
> 41) 432 152 vfs_read+0x21a/0x350
> 42) 280 64 __x64_sys_pread64+0x92/0xc0
> 43) 216 40 do_syscall_64+0xa4/0x310
> 44) 176 176 entry_SYSCALL_64_after_hwframe+0x77/0x7f
>
> > something worth mentioning.
> >
> >>
> >> Note: There are some observations while evaluating the erofs + ovl^2
> >> setup with an XFS backing fs:
> >>
> >> - Regular RW workloads traverse only one overlayfs layer regardless of
> >> the value of FILESYSTEM_MAX_STACK_DEPTH, because `upperdir=` cannot
> >> point to another overlayfs. Therefore, for pure RW workloads, the
> >> typical stack is always just:
> >> overlayfs + upper fs + underlay storage
> >>
> >> - For read-only workloads and the copy-up read part (ovl_splice_read),
> >> the difference can lie in how many overlays are nested.
> >> The stack just looks like either:
> >> ovl + ovl [+ erofs] + backing fs + underlay storage
> >> or
> >> ovl [+ erofs] + ext4/xfs + underlay storage
> >>
> >> - The fs reclaim path should be entered only once, so the writeback
> >> path will not re-enter.
> >>
> >> Sorry about my English, and I'm not sure if it's enough (e.g. FUSE
> >> passthrough part). I will look for your further inputs (and other
> >> acks) before sending this patch upstream.
> >>
> >
> > I think that most people will have problems understanding this
> > rationale not because of the English, but because of the tech ;)
> > this is a bit too hand wavy IMO.
>
> Honestly, I don't have better way to describe it, I think we'd
> better just to focus more on the increment of one more overlayfs:
>
ok. but are we talking about one more overlayfs?
This patch is adding just one erofs, so what am I missing?
> FILESYSTEM_MAX_STACK_DEPTH 2 already works for 8k kstacks on
> 32-bit arches, so I don't think FILESYSTEM_MAX_STACK_DEPTH from
> 2 to 3, which causes hundreds-more-byte additional stack usage
> out of mediate overlayfs on 16k kstacks on 64-bit arches is
> harmful (and only RO workloads and copyups are impacted).
>
> And if hundreds-more-byte additional stack usage can overflow
> the 16k kstack, I do think then the kernel stack can be
> overflowed randomly everywhere in the storage stack, not just
> because this FILESYSTEM_MAX_STACK_DEPTH modification.
>
Fine by me, but does that mean that you only want to allow
erofs backing files with >8K stack size?
Otherwise, I do not follow your argument.
Thanks,
Amir.
On 2026/1/5 02:44, Amir Goldstein wrote:
> On Sun, Jan 4, 2026 at 11:42 AM Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
>>
>>
>>
>> On 2026/1/4 18:01, Amir Goldstein wrote:
>>> [+fsdevel][+overlayfs]
>>>
>>> On Sun, Jan 4, 2026 at 4:56 AM Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
>>>>
>>>> Hi Amir,
>>>>
>>>> On 2026/1/1 23:52, Amir Goldstein wrote:
>>>>> On Wed, Dec 31, 2025 at 9:42 PM Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
>>>>>>
>>>>>> Previously, commit d53cd891f0e4 ("erofs: limit the level of fs stacking
>>>>>> for file-backed mounts") bumped `s_stack_depth` by one to avoid kernel
>>>>>> stack overflow, but it breaks composefs mounts, which need erofs+ovl^2
>>>>>> sometimes (and such setups are already used in production for quite long
>>>>>> time) since `s_stack_depth` can be 3 (i.e., FILESYSTEM_MAX_STACK_DEPTH
>>>>>> needs to change from 2 to 3).
>>>>>>
>>>>>> After a long discussion on GitHub issues [1] about possible solutions,
>>>>>> it seems there is no need to support nesting file-backed mounts as one
>>>>>> conclusion (especially when increasing FILESYSTEM_MAX_STACK_DEPTH to 3).
>>>>>> So let's disallow this right now, since there is always a way to use
>>>>>> loopback devices as a fallback.
>>>>>>
>>>>>> Then, I started to wonder about an alternative EROFS quick fix to
>>>>>> address the composefs mounts directly for this cycle: since EROFS is the
>>>>>> only fs to support file-backed mounts and other stacked fses will just
>>>>>> bump up `FILESYSTEM_MAX_STACK_DEPTH`, just check that `s_stack_depth`
>>>>>> != 0 and the backing inode is not from EROFS instead.
>>>>>>
>>>>>> At least it works for all known file-backed mount use cases (composefs,
>>>>>> containerd, and Android APEX for some Android vendors), and the fix is
>>>>>> self-contained.
>>>>>>
>>>>>> Let's defer increasing FILESYSTEM_MAX_STACK_DEPTH for now.
>>>>>>
>>>>>> Fixes: d53cd891f0e4 ("erofs: limit the level of fs stacking for file-backed mounts")
>>>>>> Closes: https://github.com/coreos/fedora-coreos-tracker/issues/2087 [1]
>>>>>> Closes: https://lore.kernel.org/r/CAFHtUiYv4+=+JP_-JjARWjo6OwcvBj1wtYN=z0QXwCpec9sXtg@mail.gmail.com
>>>>>> Cc: Amir Goldstein <amir73il@gmail.com>
>>>>>> Cc: Alexander Larsson <alexl@redhat.com>
>>>>>> Cc: Christian Brauner <brauner@kernel.org>
>>>>>> Cc: Miklos Szeredi <mszeredi@redhat.com>
>>>>>> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
>>>>>> ---
>>>>>
>>>>> Acked-by: Amir Goldstein <amir73il@gmail.com>
>>>>>
>>>>> But you forgot to include details of the stack usage analysis you ran
>>>>> with erofs+ovl^2 setup.
>>>>>
>>>>> I am guessing people will want to see this information before relaxing
>>>>> s_stack_depth in this case.
>>>>
>>>> Sorry I didn't check emails these days, I'm not sure if posting
>>>> detailed stack traces are useful, how about adding the following
>>>> words:
>>>
>>> Didn't mean detailed stack traces, but you did some tests with the
>>> new possible setup and you reached stack usage < 8K so I think this is
>>
>> The issue is that my limited stress test setup cannot cover
>> every cases:
>>
>> - I cannot find a way to make direct reclaim reliably in the
>> deep memory allocation, is there some suggestion on this?
>>
>> - I'm not sure what's the perfered way to evaluate the worst
>> stack usage below the block layer, but we should care more
>> about increasing delta just out of one more overlayfs I
>> guess?
>>
>> I can only say what I've seen is the peak stack usage of my
>> fsstress for an erofs+ovl^2 setup on x86_64 is < 8K (7184 bytes,
>> but I don't think the peak value absolutely useful), which
>> evaluates RW workloads in the upperdir, and for such workloads,
>> the stack depth won't be impacted by FILESYSTEM_MAX_STACK_DEPTH,
>> I don't see such workload is harmful.
>>
>> And then I manually copyup some files (because I didn't find any
>> available tool to stress overlayfs copyups) and I could see the
>> delta is (I think "ovl_copy_up_" is the only one path to do
>> copyups):
>>
>> 0) 6688 48 mempool_alloc_slab+0x9/0x20
>> 1) 6640 56 mempool_alloc_noprof+0x65/0xd0
>> 2) 6584 72 __sg_alloc_table+0x128/0x190
>> 3) 6512 40 sg_alloc_table_chained+0x46/0xa0
>> 4) 6472 64 scsi_alloc_sgtables+0x91/0x2c0
>> 5) 6408 72 sd_init_command+0x263/0x930
>> 6) 6336 88 scsi_queue_rq+0x54a/0xb70
>> 7) 6248 144 blk_mq_dispatch_rq_list+0x265/0x6c0
>> 8) 6104 144 __blk_mq_sched_dispatch_requests+0x399/0x5c0
>> 9) 5960 16 blk_mq_sched_dispatch_requests+0x2d/0x70
>> 10) 5944 56 blk_mq_run_hw_queue+0x208/0x290
>> 11) 5888 96 blk_mq_dispatch_list+0x13f/0x460
>> 12) 5792 48 blk_mq_flush_plug_list+0x4b/0x180
>> 13) 5744 32 blk_add_rq_to_plug+0x3d/0x160
>> 14) 5712 136 blk_mq_submit_bio+0x4f4/0x760
>> 15) 5576 120 __submit_bio+0x9b/0x240
>> 16) 5456 88 submit_bio_noacct_nocheck+0x271/0x330
>> 17) 5368 72 iomap_bio_read_folio_range+0xde/0x1d0
>> 18) 5296 112 iomap_read_folio_iter+0x1ee/0x2d0
>> 19) 5184 264 iomap_readahead+0xb9/0x290
>> 20) 4920 48 xfs_vm_readahead+0x4a/0x70
>> 21) 4872 112 read_pages+0x6c/0x1b0
>> 22) 4760 104 page_cache_ra_unbounded+0x12c/0x210
>> 23) 4656 80 filemap_readahead.isra.0+0x78/0xb0
>> 24) 4576 192 filemap_get_pages+0x3a6/0x820
>> 25) 4384 376 filemap_read+0xde/0x380
>> 26) 4008 32 xfs_file_buffered_read+0xa6/0xd0
>> 27) 3976 16 xfs_file_read_iter+0x6a/0xd0
>> 28) 3960 48 vfs_iocb_iter_read+0xdb/0x140
>> 29) 3912 88 erofs_fileio_rq_submit+0x136/0x190
>> 30) 3824 368 z_erofs_runqueue+0x1ce/0x9f0
>> 31) 3456 232 z_erofs_readahead+0x16c/0x220
>> 32) 3224 112 read_pages+0x6c/0x1b0
>> 33) 3112 104 page_cache_ra_unbounded+0x12c/0x210
>> 34) 3008 80 filemap_readahead.isra.0+0x78/0xb0
>> 35) 2928 192 filemap_get_pages+0x3a6/0x820
>> 36) 2736 400 filemap_splice_read+0x12c/0x2f0
>> 37) 2336 48 backing_file_splice_read+0x3f/0x90
>> 38) 2288 128 ovl_splice_read+0xef/0x170
>> 39) 2160 104 splice_direct_to_actor+0xb9/0x260
>> 40) 2056 88 do_splice_direct+0x76/0xc0
>> 41) 1968 120 ovl_copy_up_file+0x1a8/0x2b0
>> 42) 1848 840 ovl_copy_up_one+0x14b0/0x1610
>> 43) 1008 72 ovl_copy_up_flags+0xd7/0x110
>> 44) 936 56 ovl_open+0x72/0x110
>> 45) 880 56 do_dentry_open+0x16c/0x480
>> 46) 824 40 vfs_open+0x2e/0xf0
>> 47) 784 152 path_openat+0x80a/0x12e0
>> 48) 632 296 do_filp_open+0xb8/0x160
>> 49) 336 80 do_sys_openat2+0x72/0xf0
>> 50) 256 40 __x64_sys_openat+0x57/0xa0
>> 51) 216 40 do_syscall_64+0xa4/0x310
>> 52) 176 176 entry_SYSCALL_64_after_hwframe+0x77/0x7f
>>
>> And it's still far from the stack overflow of 16k stacks,
>> because the difference seems only how many (
>> ovl_splice_read + backing_file_splice_read), and there only takes
>> hundreds of bytes for each layer.
>>
>> Finally I used my own rostress to stress RO workloads, and the
>> deepest stack so far is as below (5456 bytes):
>>
>> 0) 5456 48 arch_scale_cpu_capacity+0x9/0x30
>> 1) 5408 16 cpu_util.constprop.0+0x7e/0xe0
>> 2) 5392 392 sched_balance_find_src_group+0x29f/0xd30
>> 3) 5000 280 sched_balance_rq+0x1b2/0xf10
>> 4) 4720 120 pick_next_task_fair+0x23b/0x7b0
>> 5) 4600 104 __schedule+0x2bc/0xda0
>> 6) 4496 16 schedule+0x27/0xd0
>> 7) 4480 24 io_schedule+0x46/0x70
>> 8) 4456 120 blk_mq_get_tag+0x11b/0x280
>> 9) 4336 96 __blk_mq_alloc_requests+0x2a1/0x410
>> 10) 4240 136 blk_mq_submit_bio+0x59c/0x760
>> 11) 4104 120 __submit_bio+0x9b/0x240
>> 12) 3984 88 submit_bio_noacct_nocheck+0x271/0x330
>> 13) 3896 72 iomap_bio_read_folio_range+0xde/0x1d0
>> 14) 3824 112 iomap_read_folio_iter+0x1ee/0x2d0
>> 15) 3712 264 iomap_readahead+0xb9/0x290
>> 16) 3448 48 xfs_vm_readahead+0x4a/0x70
>> 17) 3400 112 read_pages+0x6c/0x1b0
>> 18) 3288 104 page_cache_ra_unbounded+0x12c/0x210
>> 19) 3184 80 filemap_readahead.isra.0+0x78/0xb0
>> 20) 3104 192 filemap_get_pages+0x3a6/0x820
>> 21) 2912 376 filemap_read+0xde/0x380
>> 22) 2536 32 xfs_file_buffered_read+0xa6/0xd0
>> 23) 2504 16 xfs_file_read_iter+0x6a/0xd0
>> 24) 2488 48 vfs_iocb_iter_read+0xdb/0x140
>> 25) 2440 88 erofs_fileio_rq_submit+0x136/0x190
>> 26) 2352 368 z_erofs_runqueue+0x1ce/0x9f0
>> 27) 1984 232 z_erofs_readahead+0x16c/0x220
>> 28) 1752 112 read_pages+0x6c/0x1b0
>> 29) 1640 104 page_cache_ra_unbounded+0x12c/0x210
>> 30) 1536 40 force_page_cache_ra+0x96/0xc0
>> 31) 1496 192 filemap_get_pages+0x123/0x820
>> 32) 1304 376 filemap_read+0xde/0x380
>> 33) 928 72 do_iter_readv_writev+0x1b9/0x220
>> 34) 856 56 vfs_iter_read+0xde/0x140
>> 35) 800 64 backing_file_read_iter+0x193/0x1e0
>> 36) 736 56 ovl_read_iter+0x98/0xa0
>> 37) 680 72 do_iter_readv_writev+0x1b9/0x220
>> 38) 608 56 vfs_iter_read+0xde/0x140
>> 39) 552 64 backing_file_read_iter+0x193/0x1e0
>> 40) 488 56 ovl_read_iter+0x98/0xa0
>> 41) 432 152 vfs_read+0x21a/0x350
>> 42) 280 64 __x64_sys_pread64+0x92/0xc0
>> 43) 216 40 do_syscall_64+0xa4/0x310
>> 44) 176 176 entry_SYSCALL_64_after_hwframe+0x77/0x7f
>>
>>> something worth mentioning.
>>>
>>>>
>>>> Note: There are some observations while evaluating the erofs + ovl^2
>>>> setup with an XFS backing fs:
>>>>
>>>> - Regular RW workloads traverse only one overlayfs layer regardless of
>>>> the value of FILESYSTEM_MAX_STACK_DEPTH, because `upperdir=` cannot
>>>> point to another overlayfs. Therefore, for pure RW workloads, the
>>>> typical stack is always just:
>>>> overlayfs + upper fs + underlay storage
>>>>
>>>> - For read-only workloads and the copy-up read part (ovl_splice_read),
>>>> the difference can lie in how many overlays are nested.
>>>> The stack just looks like either:
>>>> ovl + ovl [+ erofs] + backing fs + underlay storage
>>>> or
>>>> ovl [+ erofs] + ext4/xfs + underlay storage
>>>>
>>>> - The fs reclaim path should be entered only once, so the writeback
>>>> path will not re-enter.
>>>>
>>>> Sorry about my English, and I'm not sure if it's enough (e.g. FUSE
>>>> passthrough part). I will look for your further inputs (and other
>>>> acks) before sending this patch upstream.
>>>>
>>>
>>> I think that most people will have problems understanding this
>>> rationale not because of the English, but because of the tech ;)
>>> this is a bit too hand wavy IMO.
>>
>> Honestly, I don't have better way to describe it, I think we'd
>> better just to focus more on the increment of one more overlayfs:
>>
>
> ok. but are we talking about one more overlayfs?
> This patch is adding just one erofs, so what am I missing?
Sorry, I didn't describe accurately, first I tested erofs+ovl^2.
It's the last overlayfs mount fails, and the stack traces start
from the last overlayfs. So compared with the normal erofs+ovl
(since it can be mounted in the upsteam correctly without this
patch), I mean it's one more overlayfs.
>
>> FILESYSTEM_MAX_STACK_DEPTH 2 already works for 8k kstacks on
>> 32-bit arches, so I don't think FILESYSTEM_MAX_STACK_DEPTH from
>> 2 to 3, which causes hundreds-more-byte additional stack usage
>> out of mediate overlayfs on 16k kstacks on 64-bit arches is
>> harmful (and only RO workloads and copyups are impacted).
>>
>> And if hundreds-more-byte additional stack usage can overflow
>> the 16k kstack, I do think then the kernel stack can be
>> overflowed randomly everywhere in the storage stack, not just
>> because this FILESYSTEM_MAX_STACK_DEPTH modification.
>>
>
> Fine by me, but does that mean that you only want to allow
> erofs backing files with >8K stack size?
If FILESYSTEM_MAX_STACK_DEPTH 2 without this patch, erofs+ovl
can still success (so it should guarantee erofs+ovl is always
fine), so compared with that, FILESYSTEM_MAX_STACK_DEPTH 3,
the extra stack is always one more overlayfs I think (either
erofs+ovl^2 or ovl^3)?
Thanks,
Gao Xiang
>
> Otherwise, I do not follow your argument.
>
> Thanks,
> Amir.
Previously, commit d53cd891f0e4 ("erofs: limit the level of fs stacking
for file-backed mounts") bumped `s_stack_depth` by one to avoid kernel
stack overflow when stacking an unlimited number of EROFS on top of
each other.
This fix breaks composefs mounts, which need EROFS+ovl^2 sometimes
(and such setups are already used in production for quite a long time).
One way to fix this regression is to bump FILESYSTEM_MAX_STACK_DEPTH
from 2 to 3, but proving that this is safe in general is a high bar.
After a long discussion on GitHub issues [1] about possible solutions,
one conclusion is that there is no need to support nesting file-backed
EROFS mounts on stacked filesystems, because there is always the option
to use loopback devices as a fallback.
As a quick fix for the composefs regression for this cycle, instead of
bumping `s_stack_depth` for file backed EROFS mounts, we disallow
nesting file-backed EROFS over EROFS and over filesystems with
`s_stack_depth` > 0.
This works for all known file-backed mount use cases (composefs,
containerd, and Android APEX for some Android vendors), and the fix is
self-contained.
Essentially, we are allowing one extra unaccounted fs stacking level of
EROFS below stacking filesystems, but EROFS can only be used in the read
path (i.e. overlayfs lower layers), which typically has much lower stack
usage than the write path.
We can consider increasing FILESYSTEM_MAX_STACK_DEPTH later, after more
stack usage analysis or using alternative approaches, such as splitting
the `s_stack_depth` limitation according to different combinations of
stacking.
Fixes: d53cd891f0e4 ("erofs: limit the level of fs stacking for file-backed mounts")
Reported-by: Dusty Mabe <dusty@dustymabe.com>
Reported-by: Timothée Ravier <tim@siosm.fr>
Closes: https://github.com/coreos/fedora-coreos-tracker/issues/2087 [1]
Reported-by: "Alekséi Naidénov" <an@digitaltide.io>
Closes: https://lore.kernel.org/r/CAFHtUiYv4+=+JP_-JjARWjo6OwcvBj1wtYN=z0QXwCpec9sXtg@mail.gmail.com
Acked-by: Amir Goldstein <amir73il@gmail.com>
Cc: Alexander Larsson <alexl@redhat.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Cc: Sheng Yong <shengyong1@xiaomi.com>
Cc: Zhiguo Niu <niuzhiguo84@gmail.com>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
---
v2:
- Update commit message (suggested by Amir in 1-on-1 talk);
- Add proper `Reported-by:`.
fs/erofs/super.c | 18 ++++++++++++------
1 file changed, 12 insertions(+), 6 deletions(-)
diff --git a/fs/erofs/super.c b/fs/erofs/super.c
index 937a215f626c..0cf41ed7ced8 100644
--- a/fs/erofs/super.c
+++ b/fs/erofs/super.c
@@ -644,14 +644,20 @@ static int erofs_fc_fill_super(struct super_block *sb, struct fs_context *fc)
* fs contexts (including its own) due to self-controlled RO
* accesses/contexts and no side-effect changes that need to
* context save & restore so it can reuse the current thread
- * context. However, it still needs to bump `s_stack_depth` to
- * avoid kernel stack overflow from nested filesystems.
+ * context.
+ * However, we still need to prevent kernel stack overflow due
+ * to filesystem nesting: just ensure that s_stack_depth is 0
+ * to disallow mounting EROFS on stacked filesystems.
+ * Note: s_stack_depth is not incremented here for now, since
+ * EROFS is the only fs supporting file-backed mounts for now.
+ * It MUST change if another fs plans to support them, which
+ * may also require adjusting FILESYSTEM_MAX_STACK_DEPTH.
*/
if (erofs_is_fileio_mode(sbi)) {
- sb->s_stack_depth =
- file_inode(sbi->dif0.file)->i_sb->s_stack_depth + 1;
- if (sb->s_stack_depth > FILESYSTEM_MAX_STACK_DEPTH) {
- erofs_err(sb, "maximum fs stacking depth exceeded");
+ inode = file_inode(sbi->dif0.file);
+ if (inode->i_sb->s_op == &erofs_sops ||
+ inode->i_sb->s_stack_depth) {
+ erofs_err(sb, "file-backed mounts cannot be applied to stacked fses");
return -ENOTBLK;
}
}
--
2.43.5
On 1/7/26 01:05, Gao Xiang wrote:
> Previously, commit d53cd891f0e4 ("erofs: limit the level of fs stacking
> for file-backed mounts") bumped `s_stack_depth` by one to avoid kernel
> stack overflow when stacking an unlimited number of EROFS on top of
> each other.
>
> This fix breaks composefs mounts, which need EROFS+ovl^2 sometimes
> (and such setups are already used in production for quite a long time).
>
> One way to fix this regression is to bump FILESYSTEM_MAX_STACK_DEPTH
> from 2 to 3, but proving that this is safe in general is a high bar.
>
> After a long discussion on GitHub issues [1] about possible solutions,
> one conclusion is that there is no need to support nesting file-backed
> EROFS mounts on stacked filesystems, because there is always the option
> to use loopback devices as a fallback.
>
> As a quick fix for the composefs regression for this cycle, instead of
> bumping `s_stack_depth` for file backed EROFS mounts, we disallow
> nesting file-backed EROFS over EROFS and over filesystems with
> `s_stack_depth` > 0.
>
> This works for all known file-backed mount use cases (composefs,
> containerd, and Android APEX for some Android vendors), and the fix is
> self-contained.
>
> Essentially, we are allowing one extra unaccounted fs stacking level of
> EROFS below stacking filesystems, but EROFS can only be used in the read
> path (i.e. overlayfs lower layers), which typically has much lower stack
> usage than the write path.
>
> We can consider increasing FILESYSTEM_MAX_STACK_DEPTH later, after more
> stack usage analysis or using alternative approaches, such as splitting
> the `s_stack_depth` limitation according to different combinations of
> stacking.
>
> Fixes: d53cd891f0e4 ("erofs: limit the level of fs stacking for file-backed mounts")
> Reported-by: Dusty Mabe <dusty@dustymabe.com>
> Reported-by: Timothée Ravier <tim@siosm.fr>
> Closes: https://github.com/coreos/fedora-coreos-tracker/issues/2087 [1]
> Reported-by: "Alekséi Naidénov" <an@digitaltide.io>
> Closes: https://lore.kernel.org/r/CAFHtUiYv4+=+JP_-JjARWjo6OwcvBj1wtYN=z0QXwCpec9sXtg@mail.gmail.com
> Acked-by: Amir Goldstein <amir73il@gmail.com>
> Cc: Alexander Larsson <alexl@redhat.com>
> Cc: Christian Brauner <brauner@kernel.org>
> Cc: Miklos Szeredi <mszeredi@redhat.com>
> Cc: Sheng Yong <shengyong1@xiaomi.com>
> Cc: Zhiguo Niu <niuzhiguo84@gmail.com>
> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
> ---
> v2:
> - Update commit message (suggested by Amir in 1-on-1 talk);
> - Add proper `Reported-by:`.
>
> fs/erofs/super.c | 18 ++++++++++++------
> 1 file changed, 12 insertions(+), 6 deletions(-)
>
> diff --git a/fs/erofs/super.c b/fs/erofs/super.c
> index 937a215f626c..0cf41ed7ced8 100644
> --- a/fs/erofs/super.c
> +++ b/fs/erofs/super.c
> @@ -644,14 +644,20 @@ static int erofs_fc_fill_super(struct super_block *sb, struct fs_context *fc)
> * fs contexts (including its own) due to self-controlled RO
> * accesses/contexts and no side-effect changes that need to
> * context save & restore so it can reuse the current thread
> - * context. However, it still needs to bump `s_stack_depth` to
> - * avoid kernel stack overflow from nested filesystems.
> + * context.
> + * However, we still need to prevent kernel stack overflow due
> + * to filesystem nesting: just ensure that s_stack_depth is 0
> + * to disallow mounting EROFS on stacked filesystems.
> + * Note: s_stack_depth is not incremented here for now, since
> + * EROFS is the only fs supporting file-backed mounts for now.
> + * It MUST change if another fs plans to support them, which
> + * may also require adjusting FILESYSTEM_MAX_STACK_DEPTH.
> */
> if (erofs_is_fileio_mode(sbi)) {
> - sb->s_stack_depth =
> - file_inode(sbi->dif0.file)->i_sb->s_stack_depth + 1;
> - if (sb->s_stack_depth > FILESYSTEM_MAX_STACK_DEPTH) {
> - erofs_err(sb, "maximum fs stacking depth exceeded");
> + inode = file_inode(sbi->dif0.file);
> + if (inode->i_sb->s_op == &erofs_sops ||
Hi, Xiang
In Android APEX scenario, apex images formatted as EROFS are packed in
system.img which is also EROFS format. As a result, it will always fail
to do APEX-file-backed mount since `inode->i_sb->s_op == &erofs_sops'
is true.
Any thoughts to handle such scenario?
thanks,
shengyong
> + inode->i_sb->s_stack_depth) {
> + erofs_err(sb, "file-backed mounts cannot be applied to stacked fses");
> return -ENOTBLK;
> }
> }
Hi Sheng,
On 2026/1/8 10:26, Sheng Yong wrote:
> On 1/7/26 01:05, Gao Xiang wrote:
>> Previously, commit d53cd891f0e4 ("erofs: limit the level of fs stacking
>> for file-backed mounts") bumped `s_stack_depth` by one to avoid kernel
>> stack overflow when stacking an unlimited number of EROFS on top of
>> each other.
>>
>> This fix breaks composefs mounts, which need EROFS+ovl^2 sometimes
>> (and such setups are already used in production for quite a long time).
>>
>> One way to fix this regression is to bump FILESYSTEM_MAX_STACK_DEPTH
>> from 2 to 3, but proving that this is safe in general is a high bar.
>>
>> After a long discussion on GitHub issues [1] about possible solutions,
>> one conclusion is that there is no need to support nesting file-backed
>> EROFS mounts on stacked filesystems, because there is always the option
>> to use loopback devices as a fallback.
>>
>> As a quick fix for the composefs regression for this cycle, instead of
>> bumping `s_stack_depth` for file backed EROFS mounts, we disallow
>> nesting file-backed EROFS over EROFS and over filesystems with
>> `s_stack_depth` > 0.
>>
>> This works for all known file-backed mount use cases (composefs,
>> containerd, and Android APEX for some Android vendors), and the fix is
>> self-contained.
>>
>> Essentially, we are allowing one extra unaccounted fs stacking level of
>> EROFS below stacking filesystems, but EROFS can only be used in the read
>> path (i.e. overlayfs lower layers), which typically has much lower stack
>> usage than the write path.
>>
>> We can consider increasing FILESYSTEM_MAX_STACK_DEPTH later, after more
>> stack usage analysis or using alternative approaches, such as splitting
>> the `s_stack_depth` limitation according to different combinations of
>> stacking.
>>
>> Fixes: d53cd891f0e4 ("erofs: limit the level of fs stacking for file-backed mounts")
>> Reported-by: Dusty Mabe <dusty@dustymabe.com>
>> Reported-by: Timothée Ravier <tim@siosm.fr>
>> Closes: https://github.com/coreos/fedora-coreos-tracker/issues/2087 [1]
>> Reported-by: "Alekséi Naidénov" <an@digitaltide.io>
>> Closes: https://lore.kernel.org/r/CAFHtUiYv4+=+JP_-JjARWjo6OwcvBj1wtYN=z0QXwCpec9sXtg@mail.gmail.com
>> Acked-by: Amir Goldstein <amir73il@gmail.com>
>> Cc: Alexander Larsson <alexl@redhat.com>
>> Cc: Christian Brauner <brauner@kernel.org>
>> Cc: Miklos Szeredi <mszeredi@redhat.com>
>> Cc: Sheng Yong <shengyong1@xiaomi.com>
>> Cc: Zhiguo Niu <niuzhiguo84@gmail.com>
>> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
>> ---
>> v2:
>> - Update commit message (suggested by Amir in 1-on-1 talk);
>> - Add proper `Reported-by:`.
>>
>> fs/erofs/super.c | 18 ++++++++++++------
>> 1 file changed, 12 insertions(+), 6 deletions(-)
>>
>> diff --git a/fs/erofs/super.c b/fs/erofs/super.c
>> index 937a215f626c..0cf41ed7ced8 100644
>> --- a/fs/erofs/super.c
>> +++ b/fs/erofs/super.c
>> @@ -644,14 +644,20 @@ static int erofs_fc_fill_super(struct super_block *sb, struct fs_context *fc)
>> * fs contexts (including its own) due to self-controlled RO
>> * accesses/contexts and no side-effect changes that need to
>> * context save & restore so it can reuse the current thread
>> - * context. However, it still needs to bump `s_stack_depth` to
>> - * avoid kernel stack overflow from nested filesystems.
>> + * context.
>> + * However, we still need to prevent kernel stack overflow due
>> + * to filesystem nesting: just ensure that s_stack_depth is 0
>> + * to disallow mounting EROFS on stacked filesystems.
>> + * Note: s_stack_depth is not incremented here for now, since
>> + * EROFS is the only fs supporting file-backed mounts for now.
>> + * It MUST change if another fs plans to support them, which
>> + * may also require adjusting FILESYSTEM_MAX_STACK_DEPTH.
>> */
>> if (erofs_is_fileio_mode(sbi)) {
>> - sb->s_stack_depth =
>> - file_inode(sbi->dif0.file)->i_sb->s_stack_depth + 1;
>> - if (sb->s_stack_depth > FILESYSTEM_MAX_STACK_DEPTH) {
>> - erofs_err(sb, "maximum fs stacking depth exceeded");
>> + inode = file_inode(sbi->dif0.file);
>> + if (inode->i_sb->s_op == &erofs_sops ||
>
> Hi, Xiang
>
> In Android APEX scenario, apex images formatted as EROFS are packed in
> system.img which is also EROFS format. As a result, it will always fail
> to do APEX-file-backed mount since `inode->i_sb->s_op == &erofs_sops'
> is true.
> Any thoughts to handle such scenario?
Sorry, I forgot this popular case, I think it can be simply resolved
by the following diff:
diff --git a/fs/erofs/super.c b/fs/erofs/super.c
index 0cf41ed7ced8..e93264034b5d 100644
--- a/fs/erofs/super.c
+++ b/fs/erofs/super.c
@@ -655,7 +655,7 @@ static int erofs_fc_fill_super(struct super_block *sb, struct fs_context *fc)
*/
if (erofs_is_fileio_mode(sbi)) {
inode = file_inode(sbi->dif0.file);
- if (inode->i_sb->s_op == &erofs_sops ||
+ if ((inode->i_sb->s_op == &erofs_sops && !sb->s_bdev) ||
inode->i_sb->s_stack_depth) {
erofs_err(sb, "file-backed mounts cannot be applied to stacked fses");
return -ENOTBLK;
"!sb->s_bdev" covers file-backed EROFS mounts and
(deprecated) fscache EROFS mounts, I will send v3 soon.
Thanks,
Gao Xiang
>
> thanks,
> shengyong
>
>> + inode->i_sb->s_stack_depth) {
>> + erofs_err(sb, "file-backed mounts cannot be applied to stacked fses");
>> return -ENOTBLK;
>> }
>> }
On 2026/1/8 10:32, Gao Xiang wrote:
> Hi Sheng,
>
> On 2026/1/8 10:26, Sheng Yong wrote:
>> On 1/7/26 01:05, Gao Xiang wrote:
>>> Previously, commit d53cd891f0e4 ("erofs: limit the level of fs stacking
>>> for file-backed mounts") bumped `s_stack_depth` by one to avoid kernel
>>> stack overflow when stacking an unlimited number of EROFS on top of
>>> each other.
>>>
>>> This fix breaks composefs mounts, which need EROFS+ovl^2 sometimes
>>> (and such setups are already used in production for quite a long time).
>>>
>>> One way to fix this regression is to bump FILESYSTEM_MAX_STACK_DEPTH
>>> from 2 to 3, but proving that this is safe in general is a high bar.
>>>
>>> After a long discussion on GitHub issues [1] about possible solutions,
>>> one conclusion is that there is no need to support nesting file-backed
>>> EROFS mounts on stacked filesystems, because there is always the option
>>> to use loopback devices as a fallback.
>>>
>>> As a quick fix for the composefs regression for this cycle, instead of
>>> bumping `s_stack_depth` for file backed EROFS mounts, we disallow
>>> nesting file-backed EROFS over EROFS and over filesystems with
>>> `s_stack_depth` > 0.
>>>
>>> This works for all known file-backed mount use cases (composefs,
>>> containerd, and Android APEX for some Android vendors), and the fix is
>>> self-contained.
>>>
>>> Essentially, we are allowing one extra unaccounted fs stacking level of
>>> EROFS below stacking filesystems, but EROFS can only be used in the read
>>> path (i.e. overlayfs lower layers), which typically has much lower stack
>>> usage than the write path.
>>>
>>> We can consider increasing FILESYSTEM_MAX_STACK_DEPTH later, after more
>>> stack usage analysis or using alternative approaches, such as splitting
>>> the `s_stack_depth` limitation according to different combinations of
>>> stacking.
>>>
>>> Fixes: d53cd891f0e4 ("erofs: limit the level of fs stacking for file-backed mounts")
>>> Reported-by: Dusty Mabe <dusty@dustymabe.com>
>>> Reported-by: Timothée Ravier <tim@siosm.fr>
>>> Closes: https://github.com/coreos/fedora-coreos-tracker/issues/2087 [1]
>>> Reported-by: "Alekséi Naidénov" <an@digitaltide.io>
>>> Closes: https://lore.kernel.org/r/CAFHtUiYv4+=+JP_-JjARWjo6OwcvBj1wtYN=z0QXwCpec9sXtg@mail.gmail.com
>>> Acked-by: Amir Goldstein <amir73il@gmail.com>
>>> Cc: Alexander Larsson <alexl@redhat.com>
>>> Cc: Christian Brauner <brauner@kernel.org>
>>> Cc: Miklos Szeredi <mszeredi@redhat.com>
>>> Cc: Sheng Yong <shengyong1@xiaomi.com>
>>> Cc: Zhiguo Niu <niuzhiguo84@gmail.com>
>>> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
>>> ---
>>> v2:
>>> - Update commit message (suggested by Amir in 1-on-1 talk);
>>> - Add proper `Reported-by:`.
>>>
>>> fs/erofs/super.c | 18 ++++++++++++------
>>> 1 file changed, 12 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/fs/erofs/super.c b/fs/erofs/super.c
>>> index 937a215f626c..0cf41ed7ced8 100644
>>> --- a/fs/erofs/super.c
>>> +++ b/fs/erofs/super.c
>>> @@ -644,14 +644,20 @@ static int erofs_fc_fill_super(struct super_block *sb, struct fs_context *fc)
>>> * fs contexts (including its own) due to self-controlled RO
>>> * accesses/contexts and no side-effect changes that need to
>>> * context save & restore so it can reuse the current thread
>>> - * context. However, it still needs to bump `s_stack_depth` to
>>> - * avoid kernel stack overflow from nested filesystems.
>>> + * context.
>>> + * However, we still need to prevent kernel stack overflow due
>>> + * to filesystem nesting: just ensure that s_stack_depth is 0
>>> + * to disallow mounting EROFS on stacked filesystems.
>>> + * Note: s_stack_depth is not incremented here for now, since
>>> + * EROFS is the only fs supporting file-backed mounts for now.
>>> + * It MUST change if another fs plans to support them, which
>>> + * may also require adjusting FILESYSTEM_MAX_STACK_DEPTH.
>>> */
>>> if (erofs_is_fileio_mode(sbi)) {
>>> - sb->s_stack_depth =
>>> - file_inode(sbi->dif0.file)->i_sb->s_stack_depth + 1;
>>> - if (sb->s_stack_depth > FILESYSTEM_MAX_STACK_DEPTH) {
>>> - erofs_err(sb, "maximum fs stacking depth exceeded");
>>> + inode = file_inode(sbi->dif0.file);
>>> + if (inode->i_sb->s_op == &erofs_sops ||
>>
>> Hi, Xiang
>>
>> In Android APEX scenario, apex images formatted as EROFS are packed in
>> system.img which is also EROFS format. As a result, it will always fail
>> to do APEX-file-backed mount since `inode->i_sb->s_op == &erofs_sops'
>> is true.
>> Any thoughts to handle such scenario?
>
> Sorry, I forgot this popular case, I think it can be simply resolved
> by the following diff:
>
> diff --git a/fs/erofs/super.c b/fs/erofs/super.c
> index 0cf41ed7ced8..e93264034b5d 100644
> --- a/fs/erofs/super.c
> +++ b/fs/erofs/super.c
> @@ -655,7 +655,7 @@ static int erofs_fc_fill_super(struct super_block *sb, struct fs_context *fc)
> */
> if (erofs_is_fileio_mode(sbi)) {
> inode = file_inode(sbi->dif0.file);
> - if (inode->i_sb->s_op == &erofs_sops ||
> + if ((inode->i_sb->s_op == &erofs_sops && !sb->s_bdev) ||
Sorry it should be `!inode->i_sb->s_bdev`, I've
fixed it in v3 RESEND:
https://lore.kernel.org/r/20260108030709.3305545-1-hsiangkao@linux.alibaba.com
Thanks,
Gao Xiang
> inode->i_sb->s_stack_depth) {
> erofs_err(sb, "file-backed mounts cannot be applied to stacked fses");
> return -ENOTBLK;
>
> "!sb->s_bdev" covers file-backed EROFS mounts and
> (deprecated) fscache EROFS mounts, I will send v3 soon.
>
> Thanks,
> Gao Xiang
On Thu, Jan 8, 2026 at 4:10 AM Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
>
>
>
> On 2026/1/8 10:32, Gao Xiang wrote:
> > Hi Sheng,
> >
> > On 2026/1/8 10:26, Sheng Yong wrote:
> >> On 1/7/26 01:05, Gao Xiang wrote:
> >>> Previously, commit d53cd891f0e4 ("erofs: limit the level of fs stacking
> >>> for file-backed mounts") bumped `s_stack_depth` by one to avoid kernel
> >>> stack overflow when stacking an unlimited number of EROFS on top of
> >>> each other.
> >>>
> >>> This fix breaks composefs mounts, which need EROFS+ovl^2 sometimes
> >>> (and such setups are already used in production for quite a long time).
> >>>
> >>> One way to fix this regression is to bump FILESYSTEM_MAX_STACK_DEPTH
> >>> from 2 to 3, but proving that this is safe in general is a high bar.
> >>>
> >>> After a long discussion on GitHub issues [1] about possible solutions,
> >>> one conclusion is that there is no need to support nesting file-backed
> >>> EROFS mounts on stacked filesystems, because there is always the option
> >>> to use loopback devices as a fallback.
> >>>
> >>> As a quick fix for the composefs regression for this cycle, instead of
> >>> bumping `s_stack_depth` for file backed EROFS mounts, we disallow
> >>> nesting file-backed EROFS over EROFS and over filesystems with
> >>> `s_stack_depth` > 0.
> >>>
> >>> This works for all known file-backed mount use cases (composefs,
> >>> containerd, and Android APEX for some Android vendors), and the fix is
> >>> self-contained.
> >>>
> >>> Essentially, we are allowing one extra unaccounted fs stacking level of
> >>> EROFS below stacking filesystems, but EROFS can only be used in the read
> >>> path (i.e. overlayfs lower layers), which typically has much lower stack
> >>> usage than the write path.
> >>>
> >>> We can consider increasing FILESYSTEM_MAX_STACK_DEPTH later, after more
> >>> stack usage analysis or using alternative approaches, such as splitting
> >>> the `s_stack_depth` limitation according to different combinations of
> >>> stacking.
> >>>
> >>> Fixes: d53cd891f0e4 ("erofs: limit the level of fs stacking for file-backed mounts")
> >>> Reported-by: Dusty Mabe <dusty@dustymabe.com>
> >>> Reported-by: Timothée Ravier <tim@siosm.fr>
> >>> Closes: https://github.com/coreos/fedora-coreos-tracker/issues/2087 [1]
> >>> Reported-by: "Alekséi Naidénov" <an@digitaltide.io>
> >>> Closes: https://lore.kernel.org/r/CAFHtUiYv4+=+JP_-JjARWjo6OwcvBj1wtYN=z0QXwCpec9sXtg@mail.gmail.com
> >>> Acked-by: Amir Goldstein <amir73il@gmail.com>
> >>> Cc: Alexander Larsson <alexl@redhat.com>
> >>> Cc: Christian Brauner <brauner@kernel.org>
> >>> Cc: Miklos Szeredi <mszeredi@redhat.com>
> >>> Cc: Sheng Yong <shengyong1@xiaomi.com>
> >>> Cc: Zhiguo Niu <niuzhiguo84@gmail.com>
> >>> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
> >>> ---
> >>> v2:
> >>> - Update commit message (suggested by Amir in 1-on-1 talk);
> >>> - Add proper `Reported-by:`.
> >>>
> >>> fs/erofs/super.c | 18 ++++++++++++------
> >>> 1 file changed, 12 insertions(+), 6 deletions(-)
> >>>
> >>> diff --git a/fs/erofs/super.c b/fs/erofs/super.c
> >>> index 937a215f626c..0cf41ed7ced8 100644
> >>> --- a/fs/erofs/super.c
> >>> +++ b/fs/erofs/super.c
> >>> @@ -644,14 +644,20 @@ static int erofs_fc_fill_super(struct super_block *sb, struct fs_context *fc)
> >>> * fs contexts (including its own) due to self-controlled RO
> >>> * accesses/contexts and no side-effect changes that need to
> >>> * context save & restore so it can reuse the current thread
> >>> - * context. However, it still needs to bump `s_stack_depth` to
> >>> - * avoid kernel stack overflow from nested filesystems.
> >>> + * context.
> >>> + * However, we still need to prevent kernel stack overflow due
> >>> + * to filesystem nesting: just ensure that s_stack_depth is 0
> >>> + * to disallow mounting EROFS on stacked filesystems.
> >>> + * Note: s_stack_depth is not incremented here for now, since
> >>> + * EROFS is the only fs supporting file-backed mounts for now.
> >>> + * It MUST change if another fs plans to support them, which
> >>> + * may also require adjusting FILESYSTEM_MAX_STACK_DEPTH.
> >>> */
> >>> if (erofs_is_fileio_mode(sbi)) {
> >>> - sb->s_stack_depth =
> >>> - file_inode(sbi->dif0.file)->i_sb->s_stack_depth + 1;
> >>> - if (sb->s_stack_depth > FILESYSTEM_MAX_STACK_DEPTH) {
> >>> - erofs_err(sb, "maximum fs stacking depth exceeded");
> >>> + inode = file_inode(sbi->dif0.file);
> >>> + if (inode->i_sb->s_op == &erofs_sops ||
> >>
> >> Hi, Xiang
> >>
> >> In Android APEX scenario, apex images formatted as EROFS are packed in
> >> system.img which is also EROFS format. As a result, it will always fail
> >> to do APEX-file-backed mount since `inode->i_sb->s_op == &erofs_sops'
> >> is true.
> >> Any thoughts to handle such scenario?
> >
> > Sorry, I forgot this popular case, I think it can be simply resolved
> > by the following diff:
> >
> > diff --git a/fs/erofs/super.c b/fs/erofs/super.c
> > index 0cf41ed7ced8..e93264034b5d 100644
> > --- a/fs/erofs/super.c
> > +++ b/fs/erofs/super.c
> > @@ -655,7 +655,7 @@ static int erofs_fc_fill_super(struct super_block *sb, struct fs_context *fc)
> > */
> > if (erofs_is_fileio_mode(sbi)) {
> > inode = file_inode(sbi->dif0.file);
> > - if (inode->i_sb->s_op == &erofs_sops ||
> > + if ((inode->i_sb->s_op == &erofs_sops && !sb->s_bdev) ||
>
> Sorry it should be `!inode->i_sb->s_bdev`, I've
> fixed it in v3 RESEND:
A RESEND implies no changes since v3, so this is bad practice.
> https://lore.kernel.org/r/20260108030709.3305545-1-hsiangkao@linux.alibaba.com
>
Ouch! If the erofs maintainer got this condition wrong... twice...
Maybe better using the helper instead of open coding this non trivial check?
if ((inode->i_sb->s_op == &erofs_sops &&
erofs_is_fileio_mode(EROFS_I_SB(inode)))
Thanks,
Amir.
Hi Amir,
On 2026/1/8 16:02, Amir Goldstein wrote:
> On Thu, Jan 8, 2026 at 4:10 AM Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
...
>>>>
>>>> Hi, Xiang
>>>>
>>>> In Android APEX scenario, apex images formatted as EROFS are packed in
>>>> system.img which is also EROFS format. As a result, it will always fail
>>>> to do APEX-file-backed mount since `inode->i_sb->s_op == &erofs_sops'
>>>> is true.
>>>> Any thoughts to handle such scenario?
>>>
>>> Sorry, I forgot this popular case, I think it can be simply resolved
>>> by the following diff:
>>>
>>> diff --git a/fs/erofs/super.c b/fs/erofs/super.c
>>> index 0cf41ed7ced8..e93264034b5d 100644
>>> --- a/fs/erofs/super.c
>>> +++ b/fs/erofs/super.c
>>> @@ -655,7 +655,7 @@ static int erofs_fc_fill_super(struct super_block *sb, struct fs_context *fc)
>>> */
>>> if (erofs_is_fileio_mode(sbi)) {
>>> inode = file_inode(sbi->dif0.file);
>>> - if (inode->i_sb->s_op == &erofs_sops ||
>>> + if ((inode->i_sb->s_op == &erofs_sops && !sb->s_bdev) ||
>>
>> Sorry it should be `!inode->i_sb->s_bdev`, I've
>> fixed it in v3 RESEND:
>
> A RESEND implies no changes since v3, so this is bad practice.
>
>> https://lore.kernel.org/r/20260108030709.3305545-1-hsiangkao@linux.alibaba.com
>>
>
> Ouch! If the erofs maintainer got this condition wrong... twice...
> Maybe better using the helper instead of open coding this non trivial check?
>
> if ((inode->i_sb->s_op == &erofs_sops &&
> erofs_is_fileio_mode(EROFS_I_SB(inode)))
I was thought to use that, but it excludes fscache as the
backing fs.. so I suggest to use !s_bdev directly to
cover both file-backed mounts and fscache cases directly.
Thanks,
Gao Xiang
>
> Thanks,
> Amir.
On Thu, 8 Jan 2026 16:05:03 +0800
Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
> Hi Amir,
>
> On 2026/1/8 16:02, Amir Goldstein wrote:
> > On Thu, Jan 8, 2026 at 4:10 AM Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
>
> ...
>
> >>>>
> >>>> Hi, Xiang
> >>>>
> >>>> In Android APEX scenario, apex images formatted as EROFS are packed in
> >>>> system.img which is also EROFS format. As a result, it will always fail
> >>>> to do APEX-file-backed mount since `inode->i_sb->s_op == &erofs_sops'
> >>>> is true.
> >>>> Any thoughts to handle such scenario?
> >>>
> >>> Sorry, I forgot this popular case, I think it can be simply resolved
> >>> by the following diff:
> >>>
> >>> diff --git a/fs/erofs/super.c b/fs/erofs/super.c
> >>> index 0cf41ed7ced8..e93264034b5d 100644
> >>> --- a/fs/erofs/super.c
> >>> +++ b/fs/erofs/super.c
> >>> @@ -655,7 +655,7 @@ static int erofs_fc_fill_super(struct super_block *sb, struct fs_context *fc)
> >>> */
> >>> if (erofs_is_fileio_mode(sbi)) {
> >>> inode = file_inode(sbi->dif0.file);
> >>> - if (inode->i_sb->s_op == &erofs_sops ||
> >>> + if ((inode->i_sb->s_op == &erofs_sops && !sb->s_bdev) ||
> >>
> >> Sorry it should be `!inode->i_sb->s_bdev`, I've
> >> fixed it in v3 RESEND:
> >
> > A RESEND implies no changes since v3, so this is bad practice.
> >
> >> https://lore.kernel.org/r/20260108030709.3305545-1-hsiangkao@linux.alibaba.com
> >>
> >
> > Ouch! If the erofs maintainer got this condition wrong... twice...
> > Maybe better using the helper instead of open coding this non trivial check?
> >
> > if ((inode->i_sb->s_op == &erofs_sops &&
> > erofs_is_fileio_mode(EROFS_I_SB(inode)))
>
> I was thought to use that, but it excludes fscache as the
> backing fs.. so I suggest to use !s_bdev directly to
> cover both file-backed mounts and fscache cases directly.
Is it worth just allocating each fs a 'stack needed' value and then
allowing the mount if the total is low enough.
This is equivalent to counting the recursion depth, but lets erofs only
add (say) 0.5.
Ideally you'd want to do static analysis to find the value to add,
but 'inspired guesswork' is probably good enough.
Isn't there also a big difference between recursive mounts (which need
to do read/write on the underlying file) and overlay mounts (which just
pass the request onto the lower filesystem).
David
>
> Thanks,
> Gao Xiang
>
> >
> > Thanks,
> > Amir.
>
>
Hi David,
On 2026/1/8 18:26, David Laight wrote:
> On Thu, 8 Jan 2026 16:05:03 +0800
> Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
>
>> Hi Amir,
>>
>> On 2026/1/8 16:02, Amir Goldstein wrote:
>>> On Thu, Jan 8, 2026 at 4:10 AM Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
>>
>> ...
>>
>>>>>>
>>>>>> Hi, Xiang
>>>>>>
>>>>>> In Android APEX scenario, apex images formatted as EROFS are packed in
>>>>>> system.img which is also EROFS format. As a result, it will always fail
>>>>>> to do APEX-file-backed mount since `inode->i_sb->s_op == &erofs_sops'
>>>>>> is true.
>>>>>> Any thoughts to handle such scenario?
>>>>>
>>>>> Sorry, I forgot this popular case, I think it can be simply resolved
>>>>> by the following diff:
>>>>>
>>>>> diff --git a/fs/erofs/super.c b/fs/erofs/super.c
>>>>> index 0cf41ed7ced8..e93264034b5d 100644
>>>>> --- a/fs/erofs/super.c
>>>>> +++ b/fs/erofs/super.c
>>>>> @@ -655,7 +655,7 @@ static int erofs_fc_fill_super(struct super_block *sb, struct fs_context *fc)
>>>>> */
>>>>> if (erofs_is_fileio_mode(sbi)) {
>>>>> inode = file_inode(sbi->dif0.file);
>>>>> - if (inode->i_sb->s_op == &erofs_sops ||
>>>>> + if ((inode->i_sb->s_op == &erofs_sops && !sb->s_bdev) ||
>>>>
>>>> Sorry it should be `!inode->i_sb->s_bdev`, I've
>>>> fixed it in v3 RESEND:
>>>
>>> A RESEND implies no changes since v3, so this is bad practice.
>>>
>>>> https://lore.kernel.org/r/20260108030709.3305545-1-hsiangkao@linux.alibaba.com
>>>>
>>>
>>> Ouch! If the erofs maintainer got this condition wrong... twice...
>>> Maybe better using the helper instead of open coding this non trivial check?
>>>
>>> if ((inode->i_sb->s_op == &erofs_sops &&
>>> erofs_is_fileio_mode(EROFS_I_SB(inode)))
>>
>> I was thought to use that, but it excludes fscache as the
>> backing fs.. so I suggest to use !s_bdev directly to
>> cover both file-backed mounts and fscache cases directly.
>
> Is it worth just allocating each fs a 'stack needed' value and then
> allowing the mount if the total is low enough.
> This is equivalent to counting the recursion depth, but lets erofs only
> add (say) 0.5.
> Ideally you'd want to do static analysis to find the value to add,
> but 'inspired guesswork' is probably good enough.
That is a good alternative way but I could also use some
realistic issue such as how to evaluate stack usage under
the block layer.
And the rule exposing to userspace becomes complex if we
do in such way.
>
> Isn't there also a big difference between recursive mounts (which need
> to do read/write on the underlying file) and overlay mounts (which just
> pass the request onto the lower filesystem).
As for EROFS, we only care read since it's safe enough
but I won't speak of write paths (like sb_writers and
journal nesting for example, and I don't want to spread
the discussion since it's much unrelated to the topic).
I agree but as I said above, it makes the rule more
complex and users have no idea when it can mount and
when it cannot mount.
Anyway, I think for the current 16k kernel stack,
FILESYSTEM_MAX_STACK_DEPTH = 3 is safe enough to provide
an abundant margin for the underlay storage stack.
I have no idea how to prove it strictly but I think it's
roughly provable to show the stack usages when reaching
the real backing fs (e.g. the remaining stack size when
reaching the real backing fs) and
FILESYSTEM_MAX_STACK_DEPTH 2 was an arbitary one too.
Thanks,
Gao Xiang
>
> David
>
>>
>> Thanks,
>> Gao Xiang
>>
>>>
>>> Thanks,
>>> Amir.
>>
>>
On Thu, Jan 8, 2026 at 9:05 AM Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
>
> Hi Amir,
>
> On 2026/1/8 16:02, Amir Goldstein wrote:
> > On Thu, Jan 8, 2026 at 4:10 AM Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
>
> ...
>
> >>>>
> >>>> Hi, Xiang
> >>>>
> >>>> In Android APEX scenario, apex images formatted as EROFS are packed in
> >>>> system.img which is also EROFS format. As a result, it will always fail
> >>>> to do APEX-file-backed mount since `inode->i_sb->s_op == &erofs_sops'
> >>>> is true.
> >>>> Any thoughts to handle such scenario?
> >>>
> >>> Sorry, I forgot this popular case, I think it can be simply resolved
> >>> by the following diff:
> >>>
> >>> diff --git a/fs/erofs/super.c b/fs/erofs/super.c
> >>> index 0cf41ed7ced8..e93264034b5d 100644
> >>> --- a/fs/erofs/super.c
> >>> +++ b/fs/erofs/super.c
> >>> @@ -655,7 +655,7 @@ static int erofs_fc_fill_super(struct super_block *sb, struct fs_context *fc)
> >>> */
> >>> if (erofs_is_fileio_mode(sbi)) {
> >>> inode = file_inode(sbi->dif0.file);
> >>> - if (inode->i_sb->s_op == &erofs_sops ||
> >>> + if ((inode->i_sb->s_op == &erofs_sops && !sb->s_bdev) ||
> >>
> >> Sorry it should be `!inode->i_sb->s_bdev`, I've
> >> fixed it in v3 RESEND:
> >
> > A RESEND implies no changes since v3, so this is bad practice.
> >
> >> https://lore.kernel.org/r/20260108030709.3305545-1-hsiangkao@linux.alibaba.com
> >>
> >
> > Ouch! If the erofs maintainer got this condition wrong... twice...
> > Maybe better using the helper instead of open coding this non trivial check?
> >
> > if ((inode->i_sb->s_op == &erofs_sops &&
> > erofs_is_fileio_mode(EROFS_I_SB(inode)))
>
> I was thought to use that, but it excludes fscache as the
> backing fs.. so I suggest to use !s_bdev directly to
> cover both file-backed mounts and fscache cases directly.
Your fs, your decision.
But what are you actually saying?
Are you saying that reading from file backed fscache has similar
stack usage to reading from file backed erofs?
Isn't filecache doing async file IO?
If we regard fscache an extra unaccounted layer, because of all the
sync operations that it does, then we already allowed this setup a long
time ago, e.g. fscache+nfs+ovl^2.
This could be an argument to support the claim that stack usage of
file+erofs+ovl^2 should also be fine.
Thanks,
Amir.
On 2026/1/8 16:24, Amir Goldstein wrote:
> On Thu, Jan 8, 2026 at 9:05 AM Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
>>
>> Hi Amir,
>>
>> On 2026/1/8 16:02, Amir Goldstein wrote:
>>> On Thu, Jan 8, 2026 at 4:10 AM Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
>>
>> ...
>>
>>>>>>
>>>>>> Hi, Xiang
>>>>>>
>>>>>> In Android APEX scenario, apex images formatted as EROFS are packed in
>>>>>> system.img which is also EROFS format. As a result, it will always fail
>>>>>> to do APEX-file-backed mount since `inode->i_sb->s_op == &erofs_sops'
>>>>>> is true.
>>>>>> Any thoughts to handle such scenario?
>>>>>
>>>>> Sorry, I forgot this popular case, I think it can be simply resolved
>>>>> by the following diff:
>>>>>
>>>>> diff --git a/fs/erofs/super.c b/fs/erofs/super.c
>>>>> index 0cf41ed7ced8..e93264034b5d 100644
>>>>> --- a/fs/erofs/super.c
>>>>> +++ b/fs/erofs/super.c
>>>>> @@ -655,7 +655,7 @@ static int erofs_fc_fill_super(struct super_block *sb, struct fs_context *fc)
>>>>> */
>>>>> if (erofs_is_fileio_mode(sbi)) {
>>>>> inode = file_inode(sbi->dif0.file);
>>>>> - if (inode->i_sb->s_op == &erofs_sops ||
>>>>> + if ((inode->i_sb->s_op == &erofs_sops && !sb->s_bdev) ||
>>>>
>>>> Sorry it should be `!inode->i_sb->s_bdev`, I've
>>>> fixed it in v3 RESEND:
>>>
>>> A RESEND implies no changes since v3, so this is bad practice.
>>>
>>>> https://lore.kernel.org/r/20260108030709.3305545-1-hsiangkao@linux.alibaba.com
>>>>
>>>
>>> Ouch! If the erofs maintainer got this condition wrong... twice...
>>> Maybe better using the helper instead of open coding this non trivial check?
>>>
>>> if ((inode->i_sb->s_op == &erofs_sops &&
>>> erofs_is_fileio_mode(EROFS_I_SB(inode)))
>>
>> I was thought to use that, but it excludes fscache as the
>> backing fs.. so I suggest to use !s_bdev directly to
>> cover both file-backed mounts and fscache cases directly.
>
> Your fs, your decision.
>
> But what are you actually saying?
> Are you saying that reading from file backed fscache has similar
> stack usage to reading from file backed erofs?
Nope, I just don't want to be bothered with fscache in any
cases since it's already deprecated, IOWs I don't want such
setup works:
erofs (file-backed) + erofs(fscache) + ...
I just want to allow
erofs(APEX) + erofs(bdev) + ...
cases since Android users use it
in addition to
ovl^2 + erofs + ext4 / xfs /... (composefs, containerd and ...)
Does that make sense?
> Isn't filecache doing async file IO?
But as I said, AIO is not a must, it can still
fallback to sync I/Os.
>
> If we regard fscache an extra unaccounted layer, because of all the
> sync operations that it does, then we already allowed this setup a long
> time ago, e.g. fscache+nfs+ovl^2.
>
> This could be an argument to support the claim that stack usage of
> file+erofs+ovl^2 should also be fine.
Anyway, I'm not sure how many users really use that so
I won't speak of that.
Thanks,
Gao Xiang
>
> Thanks,
> Amir.
Previously, commit d53cd891f0e4 ("erofs: limit the level of fs stacking
for file-backed mounts") bumped `s_stack_depth` by one to avoid kernel
stack overflow when stacking an unlimited number of EROFS on top of
each other.
This fix breaks composefs mounts, which need EROFS+ovl^2 sometimes
(and such setups are already used in production for quite a long time).
One way to fix this regression is to bump FILESYSTEM_MAX_STACK_DEPTH
from 2 to 3, but proving that this is safe in general is a high bar.
After a long discussion on GitHub issues [1] about possible solutions,
one conclusion is that there is no need to support nesting file-backed
EROFS mounts on stacked filesystems, because there is always the option
to use loopback devices as a fallback.
As a quick fix for the composefs regression for this cycle, instead of
bumping `s_stack_depth` for file backed EROFS mounts, we disallow
nesting file-backed EROFS over EROFS and over filesystems with
`s_stack_depth` > 0.
This works for all known file-backed mount use cases (composefs,
containerd, and Android APEX for some Android vendors), and the fix is
self-contained.
Essentially, we are allowing one extra unaccounted fs stacking level of
EROFS below stacking filesystems, but EROFS can only be used in the read
path (i.e. overlayfs lower layers), which typically has much lower stack
usage than the write path.
We can consider increasing FILESYSTEM_MAX_STACK_DEPTH later, after more
stack usage analysis or using alternative approaches, such as splitting
the `s_stack_depth` limitation according to different combinations of
stacking.
Fixes: d53cd891f0e4 ("erofs: limit the level of fs stacking for file-backed mounts")
Reported-and-tested-by: Dusty Mabe <dusty@dustymabe.com>
Reported-by: Timothée Ravier <tim@siosm.fr>
Closes: https://github.com/coreos/fedora-coreos-tracker/issues/2087 [1]
Reported-by: "Alekséi Naidénov" <an@digitaltide.io>
Closes: https://lore.kernel.org/r/CAFHtUiYv4+=+JP_-JjARWjo6OwcvBj1wtYN=z0QXwCpec9sXtg@mail.gmail.com
Acked-by: Amir Goldstein <amir73il@gmail.com>
Acked-by: Alexander Larsson <alexl@redhat.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Cc: Sheng Yong <shengyong1@xiaomi.com>
Cc: Zhiguo Niu <niuzhiguo84@gmail.com>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
---
v2->v3 RESEND:
- Exclude bdev-backed EROFS mounts since it will be a real terminal fs
as pointed out by Sheng Yong (APEX will rely on this);
- Preserve previous "Acked-by:" and "Tested-by:" since it's trivial.
fs/erofs/super.c | 19 +++++++++++++------
1 file changed, 13 insertions(+), 6 deletions(-)
diff --git a/fs/erofs/super.c b/fs/erofs/super.c
index 937a215f626c..5136cda5972a 100644
--- a/fs/erofs/super.c
+++ b/fs/erofs/super.c
@@ -644,14 +644,21 @@ static int erofs_fc_fill_super(struct super_block *sb, struct fs_context *fc)
* fs contexts (including its own) due to self-controlled RO
* accesses/contexts and no side-effect changes that need to
* context save & restore so it can reuse the current thread
- * context. However, it still needs to bump `s_stack_depth` to
- * avoid kernel stack overflow from nested filesystems.
+ * context.
+ * However, we still need to prevent kernel stack overflow due
+ * to filesystem nesting: just ensure that s_stack_depth is 0
+ * to disallow mounting EROFS on stacked filesystems.
+ * Note: s_stack_depth is not incremented here for now, since
+ * EROFS is the only fs supporting file-backed mounts for now.
+ * It MUST change if another fs plans to support them, which
+ * may also require adjusting FILESYSTEM_MAX_STACK_DEPTH.
*/
if (erofs_is_fileio_mode(sbi)) {
- sb->s_stack_depth =
- file_inode(sbi->dif0.file)->i_sb->s_stack_depth + 1;
- if (sb->s_stack_depth > FILESYSTEM_MAX_STACK_DEPTH) {
- erofs_err(sb, "maximum fs stacking depth exceeded");
+ inode = file_inode(sbi->dif0.file);
+ if ((inode->i_sb->s_op == &erofs_sops &&
+ !inode->i_sb->s_bdev) ||
+ inode->i_sb->s_stack_depth) {
+ erofs_err(sb, "file-backed mounts cannot be applied to stacked fses");
return -ENOTBLK;
}
}
--
2.43.5
On Thu, Jan 08, 2026 at 11:07:09AM +0800, Gao Xiang wrote:
> Previously, commit d53cd891f0e4 ("erofs: limit the level of fs stacking
> for file-backed mounts") bumped `s_stack_depth` by one to avoid kernel
> stack overflow when stacking an unlimited number of EROFS on top of
> each other.
>
> This fix breaks composefs mounts, which need EROFS+ovl^2 sometimes
> (and such setups are already used in production for quite a long time).
>
> One way to fix this regression is to bump FILESYSTEM_MAX_STACK_DEPTH
> from 2 to 3, but proving that this is safe in general is a high bar.
>
> After a long discussion on GitHub issues [1] about possible solutions,
> one conclusion is that there is no need to support nesting file-backed
> EROFS mounts on stacked filesystems, because there is always the option
> to use loopback devices as a fallback.
>
> As a quick fix for the composefs regression for this cycle, instead of
> bumping `s_stack_depth` for file backed EROFS mounts, we disallow
> nesting file-backed EROFS over EROFS and over filesystems with
> `s_stack_depth` > 0.
>
> This works for all known file-backed mount use cases (composefs,
> containerd, and Android APEX for some Android vendors), and the fix is
> self-contained.
>
> Essentially, we are allowing one extra unaccounted fs stacking level of
> EROFS below stacking filesystems, but EROFS can only be used in the read
> path (i.e. overlayfs lower layers), which typically has much lower stack
> usage than the write path.
>
> We can consider increasing FILESYSTEM_MAX_STACK_DEPTH later, after more
> stack usage analysis or using alternative approaches, such as splitting
> the `s_stack_depth` limitation according to different combinations of
> stacking.
>
> Fixes: d53cd891f0e4 ("erofs: limit the level of fs stacking for file-backed mounts")
> Reported-and-tested-by: Dusty Mabe <dusty@dustymabe.com>
> Reported-by: Timothée Ravier <tim@siosm.fr>
> Closes: https://github.com/coreos/fedora-coreos-tracker/issues/2087 [1]
> Reported-by: "Alekséi Naidénov" <an@digitaltide.io>
> Closes: https://lore.kernel.org/r/CAFHtUiYv4+=+JP_-JjARWjo6OwcvBj1wtYN=z0QXwCpec9sXtg@mail.gmail.com
> Acked-by: Amir Goldstein <amir73il@gmail.com>
> Acked-by: Alexander Larsson <alexl@redhat.com>
> Cc: Christian Brauner <brauner@kernel.org>
> Cc: Miklos Szeredi <mszeredi@redhat.com>
> Cc: Sheng Yong <shengyong1@xiaomi.com>
> Cc: Zhiguo Niu <niuzhiguo84@gmail.com>
> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
> ---
Acked-by: Christian Brauner <brauner@kernel.org>
On 1/8/2026 11:07 AM, Gao Xiang wrote:
> Previously, commit d53cd891f0e4 ("erofs: limit the level of fs stacking
> for file-backed mounts") bumped `s_stack_depth` by one to avoid kernel
> stack overflow when stacking an unlimited number of EROFS on top of
> each other.
>
> This fix breaks composefs mounts, which need EROFS+ovl^2 sometimes
> (and such setups are already used in production for quite a long time).
>
> One way to fix this regression is to bump FILESYSTEM_MAX_STACK_DEPTH
> from 2 to 3, but proving that this is safe in general is a high bar.
>
> After a long discussion on GitHub issues [1] about possible solutions,
> one conclusion is that there is no need to support nesting file-backed
> EROFS mounts on stacked filesystems, because there is always the option
> to use loopback devices as a fallback.
>
> As a quick fix for the composefs regression for this cycle, instead of
> bumping `s_stack_depth` for file backed EROFS mounts, we disallow
> nesting file-backed EROFS over EROFS and over filesystems with
> `s_stack_depth` > 0.
>
> This works for all known file-backed mount use cases (composefs,
> containerd, and Android APEX for some Android vendors), and the fix is
> self-contained.
>
> Essentially, we are allowing one extra unaccounted fs stacking level of
> EROFS below stacking filesystems, but EROFS can only be used in the read
> path (i.e. overlayfs lower layers), which typically has much lower stack
> usage than the write path.
>
> We can consider increasing FILESYSTEM_MAX_STACK_DEPTH later, after more
> stack usage analysis or using alternative approaches, such as splitting
> the `s_stack_depth` limitation according to different combinations of
> stacking.
>
> Fixes: d53cd891f0e4 ("erofs: limit the level of fs stacking for file-backed mounts")
> Reported-and-tested-by: Dusty Mabe <dusty@dustymabe.com>
> Reported-by: Timothée Ravier <tim@siosm.fr>
> Closes: https://github.com/coreos/fedora-coreos-tracker/issues/2087 [1]
> Reported-by: "Alekséi Naidénov" <an@digitaltide.io>
> Closes: https://lore.kernel.org/r/CAFHtUiYv4+=+JP_-JjARWjo6OwcvBj1wtYN=z0QXwCpec9sXtg@mail.gmail.com
> Acked-by: Amir Goldstein <amir73il@gmail.com>
> Acked-by: Alexander Larsson <alexl@redhat.com>
> Cc: Christian Brauner <brauner@kernel.org>
> Cc: Miklos Szeredi <mszeredi@redhat.com>
> Cc: Sheng Yong <shengyong1@xiaomi.com>
> Cc: Zhiguo Niu <niuzhiguo84@gmail.com>
> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Thanks,
Gao Xiang <hsiangkao@linux.alibaba.com> 于2026年1月8日周四 11:07写道:
>
> Previously, commit d53cd891f0e4 ("erofs: limit the level of fs stacking
> for file-backed mounts") bumped `s_stack_depth` by one to avoid kernel
> stack overflow when stacking an unlimited number of EROFS on top of
> each other.
>
> This fix breaks composefs mounts, which need EROFS+ovl^2 sometimes
> (and such setups are already used in production for quite a long time).
>
> One way to fix this regression is to bump FILESYSTEM_MAX_STACK_DEPTH
> from 2 to 3, but proving that this is safe in general is a high bar.
>
> After a long discussion on GitHub issues [1] about possible solutions,
> one conclusion is that there is no need to support nesting file-backed
> EROFS mounts on stacked filesystems, because there is always the option
> to use loopback devices as a fallback.
>
> As a quick fix for the composefs regression for this cycle, instead of
> bumping `s_stack_depth` for file backed EROFS mounts, we disallow
> nesting file-backed EROFS over EROFS and over filesystems with
> `s_stack_depth` > 0.
>
> This works for all known file-backed mount use cases (composefs,
> containerd, and Android APEX for some Android vendors), and the fix is
> self-contained.
>
> Essentially, we are allowing one extra unaccounted fs stacking level of
> EROFS below stacking filesystems, but EROFS can only be used in the read
> path (i.e. overlayfs lower layers), which typically has much lower stack
> usage than the write path.
>
> We can consider increasing FILESYSTEM_MAX_STACK_DEPTH later, after more
> stack usage analysis or using alternative approaches, such as splitting
> the `s_stack_depth` limitation according to different combinations of
> stacking.
>
> Fixes: d53cd891f0e4 ("erofs: limit the level of fs stacking for file-backed mounts")
> Reported-and-tested-by: Dusty Mabe <dusty@dustymabe.com>
> Reported-by: Timothée Ravier <tim@siosm.fr>
> Closes: https://github.com/coreos/fedora-coreos-tracker/issues/2087 [1]
> Reported-by: "Alekséi Naidénov" <an@digitaltide.io>
> Closes: https://lore.kernel.org/r/CAFHtUiYv4+=+JP_-JjARWjo6OwcvBj1wtYN=z0QXwCpec9sXtg@mail.gmail.com
> Acked-by: Amir Goldstein <amir73il@gmail.com>
> Acked-by: Alexander Larsson <alexl@redhat.com>
> Cc: Christian Brauner <brauner@kernel.org>
> Cc: Miklos Szeredi <mszeredi@redhat.com>
> Cc: Sheng Yong <shengyong1@xiaomi.com>
> Cc: Zhiguo Niu <niuzhiguo84@gmail.com>
> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
> ---
> v2->v3 RESEND:
> - Exclude bdev-backed EROFS mounts since it will be a real terminal fs
> as pointed out by Sheng Yong (APEX will rely on this);
>
> - Preserve previous "Acked-by:" and "Tested-by:" since it's trivial.
>
> fs/erofs/super.c | 19 +++++++++++++------
> 1 file changed, 13 insertions(+), 6 deletions(-)
>
> diff --git a/fs/erofs/super.c b/fs/erofs/super.c
> index 937a215f626c..5136cda5972a 100644
> --- a/fs/erofs/super.c
> +++ b/fs/erofs/super.c
> @@ -644,14 +644,21 @@ static int erofs_fc_fill_super(struct super_block *sb, struct fs_context *fc)
> * fs contexts (including its own) due to self-controlled RO
> * accesses/contexts and no side-effect changes that need to
> * context save & restore so it can reuse the current thread
> - * context. However, it still needs to bump `s_stack_depth` to
> - * avoid kernel stack overflow from nested filesystems.
> + * context.
> + * However, we still need to prevent kernel stack overflow due
> + * to filesystem nesting: just ensure that s_stack_depth is 0
> + * to disallow mounting EROFS on stacked filesystems.
> + * Note: s_stack_depth is not incremented here for now, since
> + * EROFS is the only fs supporting file-backed mounts for now.
> + * It MUST change if another fs plans to support them, which
> + * may also require adjusting FILESYSTEM_MAX_STACK_DEPTH.
> */
> if (erofs_is_fileio_mode(sbi)) {
> - sb->s_stack_depth =
> - file_inode(sbi->dif0.file)->i_sb->s_stack_depth + 1;
> - if (sb->s_stack_depth > FILESYSTEM_MAX_STACK_DEPTH) {
> - erofs_err(sb, "maximum fs stacking depth exceeded");
> + inode = file_inode(sbi->dif0.file);
> + if ((inode->i_sb->s_op == &erofs_sops &&
> + !inode->i_sb->s_bdev) ||
> + inode->i_sb->s_stack_depth) {
> + erofs_err(sb, "file-backed mounts cannot be applied to stacked fses");
Hi Xiang
Do we need to print s_stack_depth here to distinguish which specific
problem case it is?
Other LGTM based on my basic test. so
Reviewed-by: Zhiguo Niu <zhiguo.niu@unisoc.com>
Thanks!
> return -ENOTBLK;
> }
> }
> --
> 2.43.5
>
On 2026/1/8 17:28, Zhiguo Niu wrote:
> Gao Xiang <hsiangkao@linux.alibaba.com> 于2026年1月8日周四 11:07写道:
>>
>> Previously, commit d53cd891f0e4 ("erofs: limit the level of fs stacking
>> for file-backed mounts") bumped `s_stack_depth` by one to avoid kernel
>> stack overflow when stacking an unlimited number of EROFS on top of
>> each other.
>>
>> This fix breaks composefs mounts, which need EROFS+ovl^2 sometimes
>> (and such setups are already used in production for quite a long time).
>>
>> One way to fix this regression is to bump FILESYSTEM_MAX_STACK_DEPTH
>> from 2 to 3, but proving that this is safe in general is a high bar.
>>
>> After a long discussion on GitHub issues [1] about possible solutions,
>> one conclusion is that there is no need to support nesting file-backed
>> EROFS mounts on stacked filesystems, because there is always the option
>> to use loopback devices as a fallback.
>>
>> As a quick fix for the composefs regression for this cycle, instead of
>> bumping `s_stack_depth` for file backed EROFS mounts, we disallow
>> nesting file-backed EROFS over EROFS and over filesystems with
>> `s_stack_depth` > 0.
>>
>> This works for all known file-backed mount use cases (composefs,
>> containerd, and Android APEX for some Android vendors), and the fix is
>> self-contained.
>>
>> Essentially, we are allowing one extra unaccounted fs stacking level of
>> EROFS below stacking filesystems, but EROFS can only be used in the read
>> path (i.e. overlayfs lower layers), which typically has much lower stack
>> usage than the write path.
>>
>> We can consider increasing FILESYSTEM_MAX_STACK_DEPTH later, after more
>> stack usage analysis or using alternative approaches, such as splitting
>> the `s_stack_depth` limitation according to different combinations of
>> stacking.
>>
>> Fixes: d53cd891f0e4 ("erofs: limit the level of fs stacking for file-backed mounts")
>> Reported-and-tested-by: Dusty Mabe <dusty@dustymabe.com>
>> Reported-by: Timothée Ravier <tim@siosm.fr>
>> Closes: https://github.com/coreos/fedora-coreos-tracker/issues/2087 [1]
>> Reported-by: "Alekséi Naidénov" <an@digitaltide.io>
>> Closes: https://lore.kernel.org/r/CAFHtUiYv4+=+JP_-JjARWjo6OwcvBj1wtYN=z0QXwCpec9sXtg@mail.gmail.com
>> Acked-by: Amir Goldstein <amir73il@gmail.com>
>> Acked-by: Alexander Larsson <alexl@redhat.com>
>> Cc: Christian Brauner <brauner@kernel.org>
>> Cc: Miklos Szeredi <mszeredi@redhat.com>
>> Cc: Sheng Yong <shengyong1@xiaomi.com>
>> Cc: Zhiguo Niu <niuzhiguo84@gmail.com>
>> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
>> ---
>> v2->v3 RESEND:
>> - Exclude bdev-backed EROFS mounts since it will be a real terminal fs
>> as pointed out by Sheng Yong (APEX will rely on this);
>>
>> - Preserve previous "Acked-by:" and "Tested-by:" since it's trivial.
>>
>> fs/erofs/super.c | 19 +++++++++++++------
>> 1 file changed, 13 insertions(+), 6 deletions(-)
>>
>> diff --git a/fs/erofs/super.c b/fs/erofs/super.c
>> index 937a215f626c..5136cda5972a 100644
>> --- a/fs/erofs/super.c
>> +++ b/fs/erofs/super.c
>> @@ -644,14 +644,21 @@ static int erofs_fc_fill_super(struct super_block *sb, struct fs_context *fc)
>> * fs contexts (including its own) due to self-controlled RO
>> * accesses/contexts and no side-effect changes that need to
>> * context save & restore so it can reuse the current thread
>> - * context. However, it still needs to bump `s_stack_depth` to
>> - * avoid kernel stack overflow from nested filesystems.
>> + * context.
>> + * However, we still need to prevent kernel stack overflow due
>> + * to filesystem nesting: just ensure that s_stack_depth is 0
>> + * to disallow mounting EROFS on stacked filesystems.
>> + * Note: s_stack_depth is not incremented here for now, since
>> + * EROFS is the only fs supporting file-backed mounts for now.
>> + * It MUST change if another fs plans to support them, which
>> + * may also require adjusting FILESYSTEM_MAX_STACK_DEPTH.
>> */
>> if (erofs_is_fileio_mode(sbi)) {
>> - sb->s_stack_depth =
>> - file_inode(sbi->dif0.file)->i_sb->s_stack_depth + 1;
>> - if (sb->s_stack_depth > FILESYSTEM_MAX_STACK_DEPTH) {
>> - erofs_err(sb, "maximum fs stacking depth exceeded");
>> + inode = file_inode(sbi->dif0.file);
>> + if ((inode->i_sb->s_op == &erofs_sops &&
>> + !inode->i_sb->s_bdev) ||
>> + inode->i_sb->s_stack_depth) {
>> + erofs_err(sb, "file-backed mounts cannot be applied to stacked fses");
> Hi Xiang
> Do we need to print s_stack_depth here to distinguish which specific
> problem case it is?
.. I don't want to complex it (since it's just a short-term
solution and erofs is unaccounted so s_stack_depth really
mean nothing) unless it's really needed for Android vendors?
> Other LGTM based on my basic test. so
>
> Reviewed-by: Zhiguo Niu <zhiguo.niu@unisoc.com>
Thanks for this too.
Thanks,
Gao Xiang
> Thanks!
>> return -ENOTBLK;
>> }
>> }
>> --
>> 2.43.5
>>
On 1/8/26 11:07, Gao Xiang wrote:
> Previously, commit d53cd891f0e4 ("erofs: limit the level of fs stacking
> for file-backed mounts") bumped `s_stack_depth` by one to avoid kernel
> stack overflow when stacking an unlimited number of EROFS on top of
> each other.
>
> This fix breaks composefs mounts, which need EROFS+ovl^2 sometimes
> (and such setups are already used in production for quite a long time).
>
> One way to fix this regression is to bump FILESYSTEM_MAX_STACK_DEPTH
> from 2 to 3, but proving that this is safe in general is a high bar.
>
> After a long discussion on GitHub issues [1] about possible solutions,
> one conclusion is that there is no need to support nesting file-backed
> EROFS mounts on stacked filesystems, because there is always the option
> to use loopback devices as a fallback.
>
> As a quick fix for the composefs regression for this cycle, instead of
> bumping `s_stack_depth` for file backed EROFS mounts, we disallow
> nesting file-backed EROFS over EROFS and over filesystems with
> `s_stack_depth` > 0.
>
> This works for all known file-backed mount use cases (composefs,
> containerd, and Android APEX for some Android vendors), and the fix is
> self-contained.
>
> Essentially, we are allowing one extra unaccounted fs stacking level of
> EROFS below stacking filesystems, but EROFS can only be used in the read
> path (i.e. overlayfs lower layers), which typically has much lower stack
> usage than the write path.
>
> We can consider increasing FILESYSTEM_MAX_STACK_DEPTH later, after more
> stack usage analysis or using alternative approaches, such as splitting
> the `s_stack_depth` limitation according to different combinations of
> stacking.
>
> Fixes: d53cd891f0e4 ("erofs: limit the level of fs stacking for file-backed mounts")
> Reported-and-tested-by: Dusty Mabe <dusty@dustymabe.com>
> Reported-by: Timothée Ravier <tim@siosm.fr>
> Closes: https://github.com/coreos/fedora-coreos-tracker/issues/2087 [1]
> Reported-by: "Alekséi Naidénov" <an@digitaltide.io>
> Closes: https://lore.kernel.org/r/CAFHtUiYv4+=+JP_-JjARWjo6OwcvBj1wtYN=z0QXwCpec9sXtg@mail.gmail.com
> Acked-by: Amir Goldstein <amir73il@gmail.com>
> Acked-by: Alexander Larsson <alexl@redhat.com>
> Cc: Christian Brauner <brauner@kernel.org>
> Cc: Miklos Szeredi <mszeredi@redhat.com>
> Cc: Sheng Yong <shengyong1@xiaomi.com>
> Cc: Zhiguo Niu <niuzhiguo84@gmail.com>
> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-and-tested-by: Sheng Yong <shengyong1@xiaomi.com>
I tested the APEX scenario on an Android phone. APEX images are
filebacked-mounted correctly. And for a stacked APEX testcase,
it reports error as expected.
thanks,
shengyong
> ---
> v2->v3 RESEND:
> - Exclude bdev-backed EROFS mounts since it will be a real terminal fs
> as pointed out by Sheng Yong (APEX will rely on this);
>
> - Preserve previous "Acked-by:" and "Tested-by:" since it's trivial.
>
> fs/erofs/super.c | 19 +++++++++++++------
> 1 file changed, 13 insertions(+), 6 deletions(-)
>
> diff --git a/fs/erofs/super.c b/fs/erofs/super.c
> index 937a215f626c..5136cda5972a 100644
> --- a/fs/erofs/super.c
> +++ b/fs/erofs/super.c
> @@ -644,14 +644,21 @@ static int erofs_fc_fill_super(struct super_block *sb, struct fs_context *fc)
> * fs contexts (including its own) due to self-controlled RO
> * accesses/contexts and no side-effect changes that need to
> * context save & restore so it can reuse the current thread
> - * context. However, it still needs to bump `s_stack_depth` to
> - * avoid kernel stack overflow from nested filesystems.
> + * context.
> + * However, we still need to prevent kernel stack overflow due
> + * to filesystem nesting: just ensure that s_stack_depth is 0
> + * to disallow mounting EROFS on stacked filesystems.
> + * Note: s_stack_depth is not incremented here for now, since
> + * EROFS is the only fs supporting file-backed mounts for now.
> + * It MUST change if another fs plans to support them, which
> + * may also require adjusting FILESYSTEM_MAX_STACK_DEPTH.
> */
> if (erofs_is_fileio_mode(sbi)) {
> - sb->s_stack_depth =
> - file_inode(sbi->dif0.file)->i_sb->s_stack_depth + 1;
> - if (sb->s_stack_depth > FILESYSTEM_MAX_STACK_DEPTH) {
> - erofs_err(sb, "maximum fs stacking depth exceeded");
> + inode = file_inode(sbi->dif0.file);
> + if ((inode->i_sb->s_op == &erofs_sops &&
> + !inode->i_sb->s_bdev) ||
> + inode->i_sb->s_stack_depth) {
> + erofs_err(sb, "file-backed mounts cannot be applied to stacked fses");
> return -ENOTBLK;
> }
> }
Hi Sheng,
On 2026/1/8 17:14, Sheng Yong wrote:
> On 1/8/26 11:07, Gao Xiang wrote:
>> Previously, commit d53cd891f0e4 ("erofs: limit the level of fs stacking
>> for file-backed mounts") bumped `s_stack_depth` by one to avoid kernel
>> stack overflow when stacking an unlimited number of EROFS on top of
>> each other.
>>
>> This fix breaks composefs mounts, which need EROFS+ovl^2 sometimes
>> (and such setups are already used in production for quite a long time).
>>
>> One way to fix this regression is to bump FILESYSTEM_MAX_STACK_DEPTH
>> from 2 to 3, but proving that this is safe in general is a high bar.
>>
>> After a long discussion on GitHub issues [1] about possible solutions,
>> one conclusion is that there is no need to support nesting file-backed
>> EROFS mounts on stacked filesystems, because there is always the option
>> to use loopback devices as a fallback.
>>
>> As a quick fix for the composefs regression for this cycle, instead of
>> bumping `s_stack_depth` for file backed EROFS mounts, we disallow
>> nesting file-backed EROFS over EROFS and over filesystems with
>> `s_stack_depth` > 0.
>>
>> This works for all known file-backed mount use cases (composefs,
>> containerd, and Android APEX for some Android vendors), and the fix is
>> self-contained.
>>
>> Essentially, we are allowing one extra unaccounted fs stacking level of
>> EROFS below stacking filesystems, but EROFS can only be used in the read
>> path (i.e. overlayfs lower layers), which typically has much lower stack
>> usage than the write path.
>>
>> We can consider increasing FILESYSTEM_MAX_STACK_DEPTH later, after more
>> stack usage analysis or using alternative approaches, such as splitting
>> the `s_stack_depth` limitation according to different combinations of
>> stacking.
>>
>> Fixes: d53cd891f0e4 ("erofs: limit the level of fs stacking for file-backed mounts")
>> Reported-and-tested-by: Dusty Mabe <dusty@dustymabe.com>
>> Reported-by: Timothée Ravier <tim@siosm.fr>
>> Closes: https://github.com/coreos/fedora-coreos-tracker/issues/2087 [1]
>> Reported-by: "Alekséi Naidénov" <an@digitaltide.io>
>> Closes: https://lore.kernel.org/r/CAFHtUiYv4+=+JP_-JjARWjo6OwcvBj1wtYN=z0QXwCpec9sXtg@mail.gmail.com
>> Acked-by: Amir Goldstein <amir73il@gmail.com>
>> Acked-by: Alexander Larsson <alexl@redhat.com>
>> Cc: Christian Brauner <brauner@kernel.org>
>> Cc: Miklos Szeredi <mszeredi@redhat.com>
>> Cc: Sheng Yong <shengyong1@xiaomi.com>
>> Cc: Zhiguo Niu <niuzhiguo84@gmail.com>
>> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
>
> Reviewed-and-tested-by: Sheng Yong <shengyong1@xiaomi.com>
>
> I tested the APEX scenario on an Android phone. APEX images are
> filebacked-mounted correctly.
> And for a stacked APEX testcase, it reports error as expected.
Just to make sure it's an invalid case (should not be used on
Android), yes? If so, thanks for the test on the APEX side.
Thanks,
Gao Xiang
>
> thanks,
> shengyong
On 1/8/26 17:25, Gao Xiang wrote:
> Hi Sheng,
>
> On 2026/1/8 17:14, Sheng Yong wrote:
>> On 1/8/26 11:07, Gao Xiang wrote:
>>> Previously, commit d53cd891f0e4 ("erofs: limit the level of fs stacking
>>> for file-backed mounts") bumped `s_stack_depth` by one to avoid kernel
>>> stack overflow when stacking an unlimited number of EROFS on top of
>>> each other.
>>>
>>> This fix breaks composefs mounts, which need EROFS+ovl^2 sometimes
>>> (and such setups are already used in production for quite a long time).
>>>
>>> One way to fix this regression is to bump FILESYSTEM_MAX_STACK_DEPTH
>>> from 2 to 3, but proving that this is safe in general is a high bar.
>>>
>>> After a long discussion on GitHub issues [1] about possible solutions,
>>> one conclusion is that there is no need to support nesting file-backed
>>> EROFS mounts on stacked filesystems, because there is always the option
>>> to use loopback devices as a fallback.
>>>
>>> As a quick fix for the composefs regression for this cycle, instead of
>>> bumping `s_stack_depth` for file backed EROFS mounts, we disallow
>>> nesting file-backed EROFS over EROFS and over filesystems with
>>> `s_stack_depth` > 0.
>>>
>>> This works for all known file-backed mount use cases (composefs,
>>> containerd, and Android APEX for some Android vendors), and the fix is
>>> self-contained.
>>>
>>> Essentially, we are allowing one extra unaccounted fs stacking level of
>>> EROFS below stacking filesystems, but EROFS can only be used in the read
>>> path (i.e. overlayfs lower layers), which typically has much lower stack
>>> usage than the write path.
>>>
>>> We can consider increasing FILESYSTEM_MAX_STACK_DEPTH later, after more
>>> stack usage analysis or using alternative approaches, such as splitting
>>> the `s_stack_depth` limitation according to different combinations of
>>> stacking.
>>>
>>> Fixes: d53cd891f0e4 ("erofs: limit the level of fs stacking for file-backed mounts")
>>> Reported-and-tested-by: Dusty Mabe <dusty@dustymabe.com>
>>> Reported-by: Timothée Ravier <tim@siosm.fr>
>>> Closes: https://github.com/coreos/fedora-coreos-tracker/issues/2087 [1]
>>> Reported-by: "Alekséi Naidénov" <an@digitaltide.io>
>>> Closes: https://lore.kernel.org/r/CAFHtUiYv4+=+JP_-JjARWjo6OwcvBj1wtYN=z0QXwCpec9sXtg@mail.gmail.com
>>> Acked-by: Amir Goldstein <amir73il@gmail.com>
>>> Acked-by: Alexander Larsson <alexl@redhat.com>
>>> Cc: Christian Brauner <brauner@kernel.org>
>>> Cc: Miklos Szeredi <mszeredi@redhat.com>
>>> Cc: Sheng Yong <shengyong1@xiaomi.com>
>>> Cc: Zhiguo Niu <niuzhiguo84@gmail.com>
>>> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
>>
>> Reviewed-and-tested-by: Sheng Yong <shengyong1@xiaomi.com>
>>
>> I tested the APEX scenario on an Android phone. APEX images are
>> filebacked-mounted correctly.
>
>
>> And for a stacked APEX testcase, it reports error as expected.
>
Hi, Xiang,
> Just to make sure it's an invalid case (should not be used on
> Android), yes? If so, thanks for the test on the APEX side.
No, it's not a real use case, just an invalid case, and only
used to test the error handling path.
thanks,
shengyong
>
> Thanks,
> Gao Xiang
>
>>
>> thanks,
>> shengyong
Previously, commit d53cd891f0e4 ("erofs: limit the level of fs stacking
for file-backed mounts") bumped `s_stack_depth` by one to avoid kernel
stack overflow when stacking an unlimited number of EROFS on top of
each other.
This fix breaks composefs mounts, which need EROFS+ovl^2 sometimes
(and such setups are already used in production for quite a long time).
One way to fix this regression is to bump FILESYSTEM_MAX_STACK_DEPTH
from 2 to 3, but proving that this is safe in general is a high bar.
After a long discussion on GitHub issues [1] about possible solutions,
one conclusion is that there is no need to support nesting file-backed
EROFS mounts on stacked filesystems, because there is always the option
to use loopback devices as a fallback.
As a quick fix for the composefs regression for this cycle, instead of
bumping `s_stack_depth` for file backed EROFS mounts, we disallow
nesting file-backed EROFS over EROFS and over filesystems with
`s_stack_depth` > 0.
This works for all known file-backed mount use cases (composefs,
containerd, and Android APEX for some Android vendors), and the fix is
self-contained.
Essentially, we are allowing one extra unaccounted fs stacking level of
EROFS below stacking filesystems, but EROFS can only be used in the read
path (i.e. overlayfs lower layers), which typically has much lower stack
usage than the write path.
We can consider increasing FILESYSTEM_MAX_STACK_DEPTH later, after more
stack usage analysis or using alternative approaches, such as splitting
the `s_stack_depth` limitation according to different combinations of
stacking.
Fixes: d53cd891f0e4 ("erofs: limit the level of fs stacking for file-backed mounts")
Reported-and-tested-by: Dusty Mabe <dusty@dustymabe.com>
Reported-by: Timothée Ravier <tim@siosm.fr>
Closes: https://github.com/coreos/fedora-coreos-tracker/issues/2087 [1]
Reported-by: "Alekséi Naidénov" <an@digitaltide.io>
Closes: https://lore.kernel.org/r/CAFHtUiYv4+=+JP_-JjARWjo6OwcvBj1wtYN=z0QXwCpec9sXtg@mail.gmail.com
Acked-by: Amir Goldstein <amir73il@gmail.com>
Acked-by: Alexander Larsson <alexl@redhat.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Cc: Sheng Yong <shengyong1@xiaomi.com>
Cc: Zhiguo Niu <niuzhiguo84@gmail.com>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
---
v3:
- Exclude bdev-backed EROFS mounts since it will be a real terminal fs
as pointed out by Sheng Yong (APEX will rely on this);
- Preserve previous "Acked-by:" and "Tested-by:" since it's trivial.
fs/erofs/super.c | 18 ++++++++++++------
1 file changed, 12 insertions(+), 6 deletions(-)
diff --git a/fs/erofs/super.c b/fs/erofs/super.c
index 937a215f626c..e93264034b5d 100644
--- a/fs/erofs/super.c
+++ b/fs/erofs/super.c
@@ -644,14 +644,20 @@ static int erofs_fc_fill_super(struct super_block *sb, struct fs_context *fc)
* fs contexts (including its own) due to self-controlled RO
* accesses/contexts and no side-effect changes that need to
* context save & restore so it can reuse the current thread
- * context. However, it still needs to bump `s_stack_depth` to
- * avoid kernel stack overflow from nested filesystems.
+ * context.
+ * However, we still need to prevent kernel stack overflow due
+ * to filesystem nesting: just ensure that s_stack_depth is 0
+ * to disallow mounting EROFS on stacked filesystems.
+ * Note: s_stack_depth is not incremented here for now, since
+ * EROFS is the only fs supporting file-backed mounts for now.
+ * It MUST change if another fs plans to support them, which
+ * may also require adjusting FILESYSTEM_MAX_STACK_DEPTH.
*/
if (erofs_is_fileio_mode(sbi)) {
- sb->s_stack_depth =
- file_inode(sbi->dif0.file)->i_sb->s_stack_depth + 1;
- if (sb->s_stack_depth > FILESYSTEM_MAX_STACK_DEPTH) {
- erofs_err(sb, "maximum fs stacking depth exceeded");
+ inode = file_inode(sbi->dif0.file);
+ if ((inode->i_sb->s_op == &erofs_sops && !sb->s_bdev) ||
+ inode->i_sb->s_stack_depth) {
+ erofs_err(sb, "file-backed mounts cannot be applied to stacked fses");
return -ENOTBLK;
}
}
--
2.43.5
On 1/6/26 12:05 PM, Gao Xiang wrote:
> Previously, commit d53cd891f0e4 ("erofs: limit the level of fs stacking
> for file-backed mounts") bumped `s_stack_depth` by one to avoid kernel
> stack overflow when stacking an unlimited number of EROFS on top of
> each other.
>
> This fix breaks composefs mounts, which need EROFS+ovl^2 sometimes
> (and such setups are already used in production for quite a long time).
>
> One way to fix this regression is to bump FILESYSTEM_MAX_STACK_DEPTH
> from 2 to 3, but proving that this is safe in general is a high bar.
>
> After a long discussion on GitHub issues [1] about possible solutions,
> one conclusion is that there is no need to support nesting file-backed
> EROFS mounts on stacked filesystems, because there is always the option
> to use loopback devices as a fallback.
>
> As a quick fix for the composefs regression for this cycle, instead of
> bumping `s_stack_depth` for file backed EROFS mounts, we disallow
> nesting file-backed EROFS over EROFS and over filesystems with
> `s_stack_depth` > 0.
>
> This works for all known file-backed mount use cases (composefs,
> containerd, and Android APEX for some Android vendors), and the fix is
> self-contained.
>
> Essentially, we are allowing one extra unaccounted fs stacking level of
> EROFS below stacking filesystems, but EROFS can only be used in the read
> path (i.e. overlayfs lower layers), which typically has much lower stack
> usage than the write path.
>
> We can consider increasing FILESYSTEM_MAX_STACK_DEPTH later, after more
> stack usage analysis or using alternative approaches, such as splitting
> the `s_stack_depth` limitation according to different combinations of
> stacking.
>
> Fixes: d53cd891f0e4 ("erofs: limit the level of fs stacking for file-backed mounts")
> Reported-by: Dusty Mabe <dusty@dustymabe.com>
> Reported-by: Timothée Ravier <tim@siosm.fr>
> Closes: https://github.com/coreos/fedora-coreos-tracker/issues/2087 [1]
> Reported-by: "Alekséi Naidénov" <an@digitaltide.io>
> Closes: https://lore.kernel.org/r/CAFHtUiYv4+=+JP_-JjARWjo6OwcvBj1wtYN=z0QXwCpec9sXtg@mail.gmail.com
> Acked-by: Amir Goldstein <amir73il@gmail.com>
> Cc: Alexander Larsson <alexl@redhat.com>
> Cc: Christian Brauner <brauner@kernel.org>
> Cc: Miklos Szeredi <mszeredi@redhat.com>
> Cc: Sheng Yong <shengyong1@xiaomi.com>
> Cc: Zhiguo Niu <niuzhiguo84@gmail.com>
> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Tested-by: Dusty Mabe <dusty@dustymabe.com>
I tested this fixed the problem we observed in our Fedora CoreOS CI documented over in
https://github.com/coreos/fedora-coreos-tracker/issues/2087
© 2016 - 2026 Red Hat, Inc.