ceph: do not fill fscache for RWF_DONTCACHE writeback

[PATCH] ceph: do not fill fscache for RWF_DONTCACHE writeback

Posted by Max Kellermann 2 months, 1 week ago

Avoid populating the local fscache with writeback from dropbehind
folios.

At the moment, buffered RWF_DONTCACHE writes still go through the
usual Ceph writeback path, which mirrors the written data into
fscache.  The data is dropped from the page cache, but we still spend
local I/O and local cache space to retain a copy in fscache.

The DONTCACHE documentation is only about the page cache and the
intent is to avoid caching data that will not be needed again soon.
I believe skipping fscache writes during Ceph writeback on such pages
would follow the same spirit: commit the write to permanent storage,
but otherwise get it out of the way quickly.

Use folio_test_dropbehind() to treat such folios as non-cacheable for
the purposes of Ceph's write-side fscache population.  This skips both
ceph_set_page_fscache() and the corresponding write-to-cache operation
for dropbehind folios.

The writepages path can batch together folios with different cacheability,
so track cacheable subranges separately and only submit fscache writes
for contiguous non-dropbehind spans.

This keeps normal buffered writeback unchanged, while making
RWF_DONTCACHE a better match for its intended "don't retain this
locally" behavior and avoiding unnecessary local cache traffic.

Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
---
Note: this is an additional feature on top of my Ceph-DONTCACHE patch,
see https://lore.kernel.org/ceph-devel/20260401053109.1861724-1-max.kellermann@ionos.com/
---
 fs/ceph/addr.c | 34 ++++++++++++++++++++++++++++++----
 1 file changed, 30 insertions(+), 4 deletions(-)

diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index 2090fc78529c..9612a1d8ccb2 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -576,6 +576,21 @@ static inline void ceph_fscache_write_to_cache(struct inode *inode, u64 off, u64
 }
 #endif /* CONFIG_CEPH_FSCACHE */
 
+static inline bool ceph_folio_is_cacheable(const struct folio *folio, bool caching)
+{
+	/* Dropbehind writeback should not populate the local fscache. */
+	return caching && !folio_test_dropbehind(folio);
+}
+
+static inline void ceph_flush_fscache_write(struct inode *inode, u64 off, u64 *len)
+{
+	if (!*len)
+		return;
+
+	ceph_fscache_write_to_cache(inode, off, *len, true);
+	*len = 0;
+}
+
 struct ceph_writeback_ctl
 {
 	loff_t i_size;
@@ -730,7 +745,7 @@ static int write_folio_nounlock(struct folio *folio,
 	struct ceph_writeback_ctl ceph_wbc;
 	struct ceph_osd_client *osdc = &fsc->client->osdc;
 	struct ceph_osd_request *req;
-	bool caching = ceph_is_cache_enabled(inode);
+	bool caching = ceph_folio_is_cacheable(folio, ceph_is_cache_enabled(inode));
 	struct page *bounce_page = NULL;
 
 	doutc(cl, "%llx.%llx folio %p idx %lu\n", ceph_vinop(inode), folio,
@@ -1412,11 +1427,14 @@ int ceph_submit_write(struct address_space *mapping,
 	bool caching = ceph_is_cache_enabled(inode);
 	u64 offset;
 	u64 len;
+	u64 cache_offset, cache_len;
 	unsigned i;
 
 new_request:
 	offset = ceph_fscrypt_page_offset(ceph_wbc->pages[0]);
 	len = ceph_wbc->wsize;
+	cache_offset = 0;
+	cache_len = 0;
 
 	req = ceph_osdc_new_request(&fsc->client->osdc,
 				    &ci->i_layout, vino,
@@ -1477,9 +1495,11 @@ int ceph_submit_write(struct address_space *mapping,
 	ceph_wbc->op_idx = 0;
 	for (i = 0; i < ceph_wbc->locked_pages; i++) {
 		u64 cur_offset;
+		bool cache_page;
 
 		page = ceph_fscrypt_pagecache_page(ceph_wbc->pages[i]);
 		cur_offset = page_offset(page);
+		cache_page = ceph_folio_is_cacheable(page_folio(page), caching);
 
 		/*
 		 * Discontinuity in page range? Ceph can handle that by just passing
@@ -1491,7 +1511,7 @@ int ceph_submit_write(struct address_space *mapping,
 				break;
 
 			/* Kick off an fscache write with what we have so far. */
-			ceph_fscache_write_to_cache(inode, offset, len, caching);
+			ceph_flush_fscache_write(inode, cache_offset, &cache_len);
 
 			/* Start a new extent */
 			osd_req_op_extent_dup_last(req, ceph_wbc->op_idx,
@@ -1514,13 +1534,19 @@ int ceph_submit_write(struct address_space *mapping,
 
 		set_page_writeback(page);
 
-		if (caching)
+		if (cache_page) {
+			if (!cache_len)
+				cache_offset = cur_offset;
 			ceph_set_page_fscache(page);
+			cache_len += thp_size(page);
+		} else {
+			ceph_flush_fscache_write(inode, cache_offset, &cache_len);
+		}
 
 		len += thp_size(page);
 	}
 
-	ceph_fscache_write_to_cache(inode, offset, len, caching);
+	ceph_flush_fscache_write(inode, cache_offset, &cache_len);
 
 	if (ceph_wbc->size_stable) {
 		len = min(len, ceph_wbc->i_size - offset);
-- 
2.47.3

Re: [PATCH] ceph: do not fill fscache for RWF_DONTCACHE writeback

Posted by Viacheslav Dubeyko 2 months, 1 week ago

On Wed, 2026-04-01 at 22:56 +0200, Max Kellermann wrote:
> Avoid populating the local fscache with writeback from dropbehind
> folios.
> 

The idea sounds reasonable enough. However, this patch cannot be standalone
because it depends on another one.

I assume that a filesystem must declare DONTCACHE feature support by setting
FOP_DONTCACHE in its file_operations.fop_flags. Am I right here?

And what's about the IOCB_DONTCACHE. As far as I can see,
write_begin_get_folio() translates IOCB_DONTCACHE into FGP_DONTCACHE:

static inline struct folio *write_begin_get_folio(const struct kiocb *iocb,
		  struct address_space *mapping, pgoff_t index, size_t len)
{
        fgf_t fgp_flags = FGP_WRITEBEGIN;

        fgp_flags |= fgf_set_order(len);

        if (iocb && iocb->ki_flags & IOCB_DONTCACHE)
                fgp_flags |= FGP_DONTCACHE;

        return __filemap_get_folio(mapping, index, fgp_flags,
                                   mapping_gfp_mask(mapping));
}

The Ceph write_begin path calls netfs_write_begin() but does not pass
IOCB_DONTCACHE through to trigger __folio_set_dropbehind. So,
folio_test_dropbehind() would never be true on the Ceph write path right now.
Does it make sense?

> At the moment, buffered RWF_DONTCACHE writes still go through the
> usual Ceph writeback path, which mirrors the written data into
> fscache.  The data is dropped from the page cache, but we still spend
> local I/O and local cache space to retain a copy in fscache.
> 
> The DONTCACHE documentation is only about the page cache and the
> intent is to avoid caching data that will not be needed again soon.
> I believe skipping fscache writes during Ceph writeback on such pages
> would follow the same spirit: commit the write to permanent storage,
> but otherwise get it out of the way quickly.
> 
> Use folio_test_dropbehind() to treat such folios as non-cacheable for
> the purposes of Ceph's write-side fscache population.  This skips both
> ceph_set_page_fscache() and the corresponding write-to-cache operation
> for dropbehind folios.
> 
> The writepages path can batch together folios with different cacheability,
> so track cacheable subranges separately and only submit fscache writes
> for contiguous non-dropbehind spans.
> 
> This keeps normal buffered writeback unchanged, while making
> RWF_DONTCACHE a better match for its intended "don't retain this
> locally" behavior and avoiding unnecessary local cache traffic.
> 
> Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
> ---
> Note: this is an additional feature on top of my Ceph-DONTCACHE patch,
> see https://lore.kernel.org/ceph-devel/20260401053109.1861724-1-max.kellermann@ionos.com/
> ---
>  fs/ceph/addr.c | 34 ++++++++++++++++++++++++++++++----
>  1 file changed, 30 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
> index 2090fc78529c..9612a1d8ccb2 100644
> --- a/fs/ceph/addr.c
> +++ b/fs/ceph/addr.c
> @@ -576,6 +576,21 @@ static inline void ceph_fscache_write_to_cache(struct inode *inode, u64 off, u64
>  }
>  #endif /* CONFIG_CEPH_FSCACHE */
>  
> +static inline bool ceph_folio_is_cacheable(const struct folio *folio, bool caching)
> +{
> +	/* Dropbehind writeback should not populate the local fscache. */
> +	return caching && !folio_test_dropbehind(folio);
> +}
> +
> +static inline void ceph_flush_fscache_write(struct inode *inode, u64 off, u64 *len)
> +{
> +	if (!*len)
> +		return;
> +
> +	ceph_fscache_write_to_cache(inode, off, *len, true);

Are you sure that caching should be always true? All other calls checks that
ceph_is_cache_enabled():

bool caching = ceph_is_cache_enabled(inode);

> +	*len = 0;
> +}


The ceph_folio_is_cacheable() and ceph_flush_fscache_write() are out of
CONFIG_CEPH_FSCACHE. It doesn't look right.

> +
>  struct ceph_writeback_ctl
>  {
>  	loff_t i_size;
> @@ -730,7 +745,7 @@ static int write_folio_nounlock(struct folio *folio,
>  	struct ceph_writeback_ctl ceph_wbc;
>  	struct ceph_osd_client *osdc = &fsc->client->osdc;
>  	struct ceph_osd_request *req;
> -	bool caching = ceph_is_cache_enabled(inode);
> +	bool caching = ceph_folio_is_cacheable(folio, ceph_is_cache_enabled(inode));
>  	struct page *bounce_page = NULL;
>  
>  	doutc(cl, "%llx.%llx folio %p idx %lu\n", ceph_vinop(inode), folio,
> @@ -1412,11 +1427,14 @@ int ceph_submit_write(struct address_space *mapping,
>  	bool caching = ceph_is_cache_enabled(inode);
>  	u64 offset;
>  	u64 len;
> +	u64 cache_offset, cache_len;

Why do you need to introduce the cache_offset and cache_len? We already have
offset and len.

>  	unsigned i;
>  
>  new_request:
>  	offset = ceph_fscrypt_page_offset(ceph_wbc->pages[0]);
>  	len = ceph_wbc->wsize;
> +	cache_offset = 0;

Is it correct initialization? Frankly speaking, I don't quite follow why we need
such initialization.

Thanks,
Slava.

> +	cache_len = 0;
>  
>  	req = ceph_osdc_new_request(&fsc->client->osdc,
>  				    &ci->i_layout, vino,
> @@ -1477,9 +1495,11 @@ int ceph_submit_write(struct address_space *mapping,
>  	ceph_wbc->op_idx = 0;
>  	for (i = 0; i < ceph_wbc->locked_pages; i++) {
>  		u64 cur_offset;
> +		bool cache_page;
>  
>  		page = ceph_fscrypt_pagecache_page(ceph_wbc->pages[i]);
>  		cur_offset = page_offset(page);
> +		cache_page = ceph_folio_is_cacheable(page_folio(page), caching);
>  
>  		/*
>  		 * Discontinuity in page range? Ceph can handle that by just passing
> @@ -1491,7 +1511,7 @@ int ceph_submit_write(struct address_space *mapping,
>  				break;
>  
>  			/* Kick off an fscache write with what we have so far. */
> -			ceph_fscache_write_to_cache(inode, offset, len, caching);
> +			ceph_flush_fscache_write(inode, cache_offset, &cache_len);
>  
>  			/* Start a new extent */
>  			osd_req_op_extent_dup_last(req, ceph_wbc->op_idx,
> @@ -1514,13 +1534,19 @@ int ceph_submit_write(struct address_space *mapping,
>  
>  		set_page_writeback(page);
>  
> -		if (caching)
> +		if (cache_page) {
> +			if (!cache_len)
> +				cache_offset = cur_offset;
>  			ceph_set_page_fscache(page);
> +			cache_len += thp_size(page);
> +		} else {
> +			ceph_flush_fscache_write(inode, cache_offset, &cache_len);
> +		}
>  
>  		len += thp_size(page);
>  	}
>  
> -	ceph_fscache_write_to_cache(inode, offset, len, caching);
> +	ceph_flush_fscache_write(inode, cache_offset, &cache_len);
>  
>  	if (ceph_wbc->size_stable) {
>  		len = min(len, ceph_wbc->i_size - offset);

Re: [PATCH] ceph: do not fill fscache for RWF_DONTCACHE writeback

Posted by Max Kellermann 2 months, 1 week ago

On Thu, Apr 2, 2026 at 9:44 PM Viacheslav Dubeyko <vdubeyko@redhat.com> wrote:
> The Ceph write_begin path calls netfs_write_begin() but does not pass
> IOCB_DONTCACHE through to trigger __folio_set_dropbehind. So,
> folio_test_dropbehind() would never be true on the Ceph write path right now.
> Does it make sense?

Yes, see:

> > ---
> > Note: this is an additional feature on top of my Ceph-DONTCACHE patch,
> > see https://lore.kernel.org/ceph-devel/20260401053109.1861724-1-max.kellermann@ionos.com/

The code in this patch is not reachable until my RWF_DONTCACHE patch
is merged as well.

> Are you sure that caching should be always true? All other calls checks that
> ceph_is_cache_enabled():
>
> bool caching = ceph_is_cache_enabled(inode);

This function is only called if caching is enabled.

>
> > +     *len = 0;
> > +}
>
>
> The ceph_folio_is_cacheable() and ceph_flush_fscache_write() are out of
> CONFIG_CEPH_FSCACHE. It doesn't look right.

All of the old code is out of CONFIG_CEPH_FSCACHE, too. Does the old
code look right?

> > @@ -1412,11 +1427,14 @@ int ceph_submit_write(struct address_space *mapping,
> >       bool caching = ceph_is_cache_enabled(inode);
> >       u64 offset;
> >       u64 len;
> > +     u64 cache_offset, cache_len;
>
> Why do you need to introduce the cache_offset and cache_len? We already have
> offset and len.

These keep track of the region that should be submitted to fscache.
Folios without "dropbehind" need to be excluded from that.

> >  new_request:
> >       offset = ceph_fscrypt_page_offset(ceph_wbc->pages[0]);
> >       len = ceph_wbc->wsize;
> > +     cache_offset = 0;
>
> Is it correct initialization? Frankly speaking, I don't quite follow why we need
> such initialization.

Technically, cache_offset does not need to be initialized as long as
cache_len is zero because then its value is never used. Would you feel
more comfortable if I remove the unnecessary initializer? I wasn't
sure which approach would raise fewer eyebrows.

-- 
Max Kellermann
Principal Architect
Hosting Technology

cm4all | Im Mediapark 6a | 50670 Köln | Germany
General information about the company can be found here:
https://www.cm4all.com/impressum
A member of the IONOS Group

Re: [PATCH] ceph: do not fill fscache for RWF_DONTCACHE writeback

Posted by Viacheslav Dubeyko 2 months, 1 week ago

On Fri, 2026-04-03 at 08:52 +0200, Max Kellermann wrote:
> On Thu, Apr 2, 2026 at 9:44 PM Viacheslav Dubeyko <vdubeyko@redhat.com> wrote:
> > The Ceph write_begin path calls netfs_write_begin() but does not pass
> > IOCB_DONTCACHE through to trigger __folio_set_dropbehind. So,
> > folio_test_dropbehind() would never be true on the Ceph write path right now.
> > Does it make sense?
> 
> Yes, see:
> 
> > > ---
> > > Note: this is an additional feature on top of my Ceph-DONTCACHE patch,
> > > see https://lore.kernel.org/ceph-devel/20260401053109.1861724-1-max.kellermann@ionos.com/
> 
> The code in this patch is not reachable until my RWF_DONTCACHE patch
> is merged as well.
> 
> > Are you sure that caching should be always true? All other calls checks that
> > ceph_is_cache_enabled():
> > 
> > bool caching = ceph_is_cache_enabled(inode);
> 
> This function is only called if caching is enabled.

I think that such interface will be more clean and safe:

static inline void ceph_flush_fscache_write(struct inode *inode, u64 off, u64
*len, bool caching)

> 
> > 
> > > +     *len = 0;
> > > +}
> > 
> > 
> > The ceph_folio_is_cacheable() and ceph_flush_fscache_write() are out of
> > CONFIG_CEPH_FSCACHE. It doesn't look right.
> 
> All of the old code is out of CONFIG_CEPH_FSCACHE, too. Does the old
> code look right?

As far as I can see, all fscache code is under CONFIG_CEPH_FSCACHE compilation
option. If we have some issues with old code, then it makes sense to fix it. But
this code is fscache related and it should be under CONFIG_CEPH_FSCACHE
protection, from my point of view. Moreover, other fscache related code is under
CONFIG_CEPH_FSCACHE protection pretty above the code of these functions.

> 
> > > @@ -1412,11 +1427,14 @@ int ceph_submit_write(struct address_space *mapping,
> > >       bool caching = ceph_is_cache_enabled(inode);
> > >       u64 offset;
> > >       u64 len;
> > > +     u64 cache_offset, cache_len;
> > 
> > Why do you need to introduce the cache_offset and cache_len? We already have
> > offset and len.
> 
> These keep track of the region that should be submitted to fscache.
> Folios without "dropbehind" need to be excluded from that.
> 
> > >  new_request:
> > >       offset = ceph_fscrypt_page_offset(ceph_wbc->pages[0]);
> > >       len = ceph_wbc->wsize;
> > > +     cache_offset = 0;
> > 
> > Is it correct initialization? Frankly speaking, I don't quite follow why we need
> > such initialization.
> 
> Technically, cache_offset does not need to be initialized as long as
> cache_len is zero because then its value is never used. Would you feel
> more comfortable if I remove the unnecessary initializer? I wasn't
> sure which approach would raise fewer eyebrows.

I am simply trying to follow why we need in cache_offset. We are using the
offset currently:

/* Kick off an fscache write with what we have so far. */
ceph_fscache_write_to_cache(inode, offset, len, caching);


Why the offset is not good enough?

Thanks,
Slava.

Re: [PATCH] ceph: do not fill fscache for RWF_DONTCACHE writeback

Posted by Max Kellermann 2 months, 1 week ago

On Fri, Apr 3, 2026 at 7:18 PM Viacheslav Dubeyko <vdubeyko@redhat.com> wrote:
> Why the offset is not good enough?

Because the "offset" variable tracks the whole write, including the
folios that are supposed to be omitted from the fscache.

-- 
Max Kellermann
Principal Architect
Hosting Technology

cm4all | Im Mediapark 6a | 50670 Köln | Germany
General information about the company can be found here:
https://www.cm4all.com/impressum
A member of the IONOS Group