[v1] ceph: CephFS writeback correctness and performance fixes

[PATCH 5/5] ceph: Fix write storm on fscrypted files

Posted by Sam Edwards 1 month, 1 week ago

CephFS stores file data across multiple RADOS objects. An object is the
atomic unit of storage, so the writeback code must clean only folios
that belong to the same object with each OSD request.

CephFS also supports RAID0-style striping of file contents: if enabled,
each object stores multiple unbroken "stripe units" covering different
portions of the file; if disabled, a "stripe unit" is simply the whole
object. The stripe unit is (usually) reported as the inode's block size.

Though the writeback logic could, in principle, lock all dirty folios
belonging to the same object, its current design is to lock only a
single stripe unit at a time. Ever since this code was first written,
it has determined this size by checking the inode's block size.
However, the relatively-new fscrypt support needed to reduce the block
size for encrypted inodes to the crypto block size (see 'fixes' commit),
which causes an unnecessarily high number of write operations (~1024x as
many, with 4MiB objects) and grossly degraded performance.

Fix this (and clarify intent) by using i_layout.stripe_unit directly in
ceph_define_write_size() so that encrypted inodes are written back with
the same number of operations as if they were unencrypted.

Fixes: 94af0470924c ("ceph: add some fscrypt guardrails")
Cc: stable@vger.kernel.org
Signed-off-by: Sam Edwards <CFSworks@gmail.com>
---
 fs/ceph/addr.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index b3569d44d510..cb1da8e27c2b 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -1000,7 +1000,8 @@ unsigned int ceph_define_write_size(struct address_space *mapping)
 {
 	struct inode *inode = mapping->host;
 	struct ceph_fs_client *fsc = ceph_inode_to_fs_client(inode);
-	unsigned int wsize = i_blocksize(inode);
+	struct ceph_inode_info *ci = ceph_inode(inode);
+	unsigned int wsize = ci->i_layout.stripe_unit;
 
 	if (fsc->mount_options->wsize < wsize)
 		wsize = fsc->mount_options->wsize;
-- 
2.51.2

Re: [PATCH 5/5] ceph: Fix write storm on fscrypted files

Posted by Viacheslav Dubeyko 1 month ago

On Tue, 2025-12-30 at 18:43 -0800, Sam Edwards wrote:
> CephFS stores file data across multiple RADOS objects. An object is the
> atomic unit of storage, so the writeback code must clean only folios
> that belong to the same object with each OSD request.
> 
> CephFS also supports RAID0-style striping of file contents: if enabled,
> each object stores multiple unbroken "stripe units" covering different
> portions of the file; if disabled, a "stripe unit" is simply the whole
> object. The stripe unit is (usually) reported as the inode's block size.
> 
> Though the writeback logic could, in principle, lock all dirty folios
> belonging to the same object, its current design is to lock only a
> single stripe unit at a time. Ever since this code was first written,
> it has determined this size by checking the inode's block size.
> However, the relatively-new fscrypt support needed to reduce the block
> size for encrypted inodes to the crypto block size (see 'fixes' commit),
> which causes an unnecessarily high number of write operations (~1024x as
> many, with 4MiB objects) and grossly degraded performance.

Do you have any benchmarking results that prove your point?

Thanks,
Slava.

> 
> Fix this (and clarify intent) by using i_layout.stripe_unit directly in
> ceph_define_write_size() so that encrypted inodes are written back with
> the same number of operations as if they were unencrypted.
> 
> Fixes: 94af0470924c ("ceph: add some fscrypt guardrails")
> Cc: stable@vger.kernel.org
> Signed-off-by: Sam Edwards <CFSworks@gmail.com>
> ---
>  fs/ceph/addr.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
> index b3569d44d510..cb1da8e27c2b 100644
> --- a/fs/ceph/addr.c
> +++ b/fs/ceph/addr.c
> @@ -1000,7 +1000,8 @@ unsigned int ceph_define_write_size(struct address_space *mapping)
>  {
>  	struct inode *inode = mapping->host;
>  	struct ceph_fs_client *fsc = ceph_inode_to_fs_client(inode);
> -	unsigned int wsize = i_blocksize(inode);
> +	struct ceph_inode_info *ci = ceph_inode(inode);
> +	unsigned int wsize = ci->i_layout.stripe_unit;
>  
>  	if (fsc->mount_options->wsize < wsize)
>  		wsize = fsc->mount_options->wsize;

Re: [PATCH 5/5] ceph: Fix write storm on fscrypted files

Posted by Sam Edwards 1 month ago

On Mon, Jan 5, 2026 at 2:34 PM Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> wrote:
>
> On Tue, 2025-12-30 at 18:43 -0800, Sam Edwards wrote:
> > CephFS stores file data across multiple RADOS objects. An object is the
> > atomic unit of storage, so the writeback code must clean only folios
> > that belong to the same object with each OSD request.
> >
> > CephFS also supports RAID0-style striping of file contents: if enabled,
> > each object stores multiple unbroken "stripe units" covering different
> > portions of the file; if disabled, a "stripe unit" is simply the whole
> > object. The stripe unit is (usually) reported as the inode's block size.
> >
> > Though the writeback logic could, in principle, lock all dirty folios
> > belonging to the same object, its current design is to lock only a
> > single stripe unit at a time. Ever since this code was first written,
> > it has determined this size by checking the inode's block size.
> > However, the relatively-new fscrypt support needed to reduce the block
> > size for encrypted inodes to the crypto block size (see 'fixes' commit),
> > which causes an unnecessarily high number of write operations (~1024x as
> > many, with 4MiB objects) and grossly degraded performance.

Hi Slava,

> Do you have any benchmarking results that prove your point?

I haven't done any "real" benchmarking for this change. On my setup
(closer to a home server than a typical Ceph deployment), sequential
write throughput increased from ~1.7 to ~66 MB/s with this patch
applied. I don't consider this single datapoint representative, so
rather than presenting it as a general benchmark in the commit
message, I chose the qualitative wording "grossly degraded
performance." Actual impact will vary depending on workload, disk
type, OSD count, etc.

Those curious about the bug's performance impact in their environment
can find out without enabling fscrypt, using: mount -o wsize=4096

However, the core rationale for my claim is based on principles, not
on measurements: batching writes into fewer operations necessarily
spreads per-operation overhead across more bytes. So this change
removes an artificial per-op bottleneck on sequential write
performance. The exact impact varies, but the patch does improve
(fscrypt-enabled) write throughput in nearly every case.

Warm regards,
Sam


>
> Thanks,
> Slava.
>
> >
> > Fix this (and clarify intent) by using i_layout.stripe_unit directly in
> > ceph_define_write_size() so that encrypted inodes are written back with
> > the same number of operations as if they were unencrypted.
> >
> > Fixes: 94af0470924c ("ceph: add some fscrypt guardrails")
> > Cc: stable@vger.kernel.org
> > Signed-off-by: Sam Edwards <CFSworks@gmail.com>
> > ---
> >  fs/ceph/addr.c | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> >
> > diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
> > index b3569d44d510..cb1da8e27c2b 100644
> > --- a/fs/ceph/addr.c
> > +++ b/fs/ceph/addr.c
> > @@ -1000,7 +1000,8 @@ unsigned int ceph_define_write_size(struct address_space *mapping)
> >  {
> >       struct inode *inode = mapping->host;
> >       struct ceph_fs_client *fsc = ceph_inode_to_fs_client(inode);
> > -     unsigned int wsize = i_blocksize(inode);
> > +     struct ceph_inode_info *ci = ceph_inode(inode);
> > +     unsigned int wsize = ci->i_layout.stripe_unit;
> >
> >       if (fsc->mount_options->wsize < wsize)
> >               wsize = fsc->mount_options->wsize;

[PATCH 1/5] ceph: Do not propagate page array emplacement errors as batch errors
[PATCH 2/5] ceph: Remove error return from ceph_process_folio_batch()
[PATCH 3/5] ceph: Free page array when ceph_submit_write fails
[PATCH 4/5] ceph: Assert writeback loop invariants
[PATCH 5/5] ceph: Fix write storm on fscrypted files