fs/ceph/addr.c | 34 ++++++++++++++++++++++++++++++---- 1 file changed, 30 insertions(+), 4 deletions(-)
Avoid populating the local fscache with writeback from dropbehind
folios.
At the moment, buffered RWF_DONTCACHE writes still go through the
usual Ceph writeback path, which mirrors the written data into
fscache. The data is dropped from the page cache, but we still spend
local I/O and local cache space to retain a copy in fscache.
The DONTCACHE documentation is only about the page cache and the
intent is to avoid caching data that will not be needed again soon.
I believe skipping fscache writes during Ceph writeback on such pages
would follow the same spirit: commit the write to permanent storage,
but otherwise get it out of the way quickly.
Use folio_test_dropbehind() to treat such folios as non-cacheable for
the purposes of Ceph's write-side fscache population. This skips both
ceph_set_page_fscache() and the corresponding write-to-cache operation
for dropbehind folios.
The writepages path can batch together folios with different cacheability,
so track cacheable subranges separately and only submit fscache writes
for contiguous non-dropbehind spans.
This keeps normal buffered writeback unchanged, while making
RWF_DONTCACHE a better match for its intended "don't retain this
locally" behavior and avoiding unnecessary local cache traffic.
Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
---
Note: this is an additional feature on top of my Ceph-DONTCACHE patch,
see https://lore.kernel.org/ceph-devel/20260401053109.1861724-1-max.kellermann@ionos.com/
---
fs/ceph/addr.c | 34 ++++++++++++++++++++++++++++++----
1 file changed, 30 insertions(+), 4 deletions(-)
diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index 2090fc78529c..9612a1d8ccb2 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -576,6 +576,21 @@ static inline void ceph_fscache_write_to_cache(struct inode *inode, u64 off, u64
}
#endif /* CONFIG_CEPH_FSCACHE */
+static inline bool ceph_folio_is_cacheable(const struct folio *folio, bool caching)
+{
+ /* Dropbehind writeback should not populate the local fscache. */
+ return caching && !folio_test_dropbehind(folio);
+}
+
+static inline void ceph_flush_fscache_write(struct inode *inode, u64 off, u64 *len)
+{
+ if (!*len)
+ return;
+
+ ceph_fscache_write_to_cache(inode, off, *len, true);
+ *len = 0;
+}
+
struct ceph_writeback_ctl
{
loff_t i_size;
@@ -730,7 +745,7 @@ static int write_folio_nounlock(struct folio *folio,
struct ceph_writeback_ctl ceph_wbc;
struct ceph_osd_client *osdc = &fsc->client->osdc;
struct ceph_osd_request *req;
- bool caching = ceph_is_cache_enabled(inode);
+ bool caching = ceph_folio_is_cacheable(folio, ceph_is_cache_enabled(inode));
struct page *bounce_page = NULL;
doutc(cl, "%llx.%llx folio %p idx %lu\n", ceph_vinop(inode), folio,
@@ -1412,11 +1427,14 @@ int ceph_submit_write(struct address_space *mapping,
bool caching = ceph_is_cache_enabled(inode);
u64 offset;
u64 len;
+ u64 cache_offset, cache_len;
unsigned i;
new_request:
offset = ceph_fscrypt_page_offset(ceph_wbc->pages[0]);
len = ceph_wbc->wsize;
+ cache_offset = 0;
+ cache_len = 0;
req = ceph_osdc_new_request(&fsc->client->osdc,
&ci->i_layout, vino,
@@ -1477,9 +1495,11 @@ int ceph_submit_write(struct address_space *mapping,
ceph_wbc->op_idx = 0;
for (i = 0; i < ceph_wbc->locked_pages; i++) {
u64 cur_offset;
+ bool cache_page;
page = ceph_fscrypt_pagecache_page(ceph_wbc->pages[i]);
cur_offset = page_offset(page);
+ cache_page = ceph_folio_is_cacheable(page_folio(page), caching);
/*
* Discontinuity in page range? Ceph can handle that by just passing
@@ -1491,7 +1511,7 @@ int ceph_submit_write(struct address_space *mapping,
break;
/* Kick off an fscache write with what we have so far. */
- ceph_fscache_write_to_cache(inode, offset, len, caching);
+ ceph_flush_fscache_write(inode, cache_offset, &cache_len);
/* Start a new extent */
osd_req_op_extent_dup_last(req, ceph_wbc->op_idx,
@@ -1514,13 +1534,19 @@ int ceph_submit_write(struct address_space *mapping,
set_page_writeback(page);
- if (caching)
+ if (cache_page) {
+ if (!cache_len)
+ cache_offset = cur_offset;
ceph_set_page_fscache(page);
+ cache_len += thp_size(page);
+ } else {
+ ceph_flush_fscache_write(inode, cache_offset, &cache_len);
+ }
len += thp_size(page);
}
- ceph_fscache_write_to_cache(inode, offset, len, caching);
+ ceph_flush_fscache_write(inode, cache_offset, &cache_len);
if (ceph_wbc->size_stable) {
len = min(len, ceph_wbc->i_size - offset);
--
2.47.3
On Wed, 2026-04-01 at 22:56 +0200, Max Kellermann wrote:
> Avoid populating the local fscache with writeback from dropbehind
> folios.
>
The idea sounds reasonable enough. However, this patch cannot be standalone
because it depends on another one.
I assume that a filesystem must declare DONTCACHE feature support by setting
FOP_DONTCACHE in its file_operations.fop_flags. Am I right here?
And what's about the IOCB_DONTCACHE. As far as I can see,
write_begin_get_folio() translates IOCB_DONTCACHE into FGP_DONTCACHE:
static inline struct folio *write_begin_get_folio(const struct kiocb *iocb,
struct address_space *mapping, pgoff_t index, size_t len)
{
fgf_t fgp_flags = FGP_WRITEBEGIN;
fgp_flags |= fgf_set_order(len);
if (iocb && iocb->ki_flags & IOCB_DONTCACHE)
fgp_flags |= FGP_DONTCACHE;
return __filemap_get_folio(mapping, index, fgp_flags,
mapping_gfp_mask(mapping));
}
The Ceph write_begin path calls netfs_write_begin() but does not pass
IOCB_DONTCACHE through to trigger __folio_set_dropbehind. So,
folio_test_dropbehind() would never be true on the Ceph write path right now.
Does it make sense?
> At the moment, buffered RWF_DONTCACHE writes still go through the
> usual Ceph writeback path, which mirrors the written data into
> fscache. The data is dropped from the page cache, but we still spend
> local I/O and local cache space to retain a copy in fscache.
>
> The DONTCACHE documentation is only about the page cache and the
> intent is to avoid caching data that will not be needed again soon.
> I believe skipping fscache writes during Ceph writeback on such pages
> would follow the same spirit: commit the write to permanent storage,
> but otherwise get it out of the way quickly.
>
> Use folio_test_dropbehind() to treat such folios as non-cacheable for
> the purposes of Ceph's write-side fscache population. This skips both
> ceph_set_page_fscache() and the corresponding write-to-cache operation
> for dropbehind folios.
>
> The writepages path can batch together folios with different cacheability,
> so track cacheable subranges separately and only submit fscache writes
> for contiguous non-dropbehind spans.
>
> This keeps normal buffered writeback unchanged, while making
> RWF_DONTCACHE a better match for its intended "don't retain this
> locally" behavior and avoiding unnecessary local cache traffic.
>
> Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
> ---
> Note: this is an additional feature on top of my Ceph-DONTCACHE patch,
> see https://lore.kernel.org/ceph-devel/20260401053109.1861724-1-max.kellermann@ionos.com/
> ---
> fs/ceph/addr.c | 34 ++++++++++++++++++++++++++++++----
> 1 file changed, 30 insertions(+), 4 deletions(-)
>
> diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
> index 2090fc78529c..9612a1d8ccb2 100644
> --- a/fs/ceph/addr.c
> +++ b/fs/ceph/addr.c
> @@ -576,6 +576,21 @@ static inline void ceph_fscache_write_to_cache(struct inode *inode, u64 off, u64
> }
> #endif /* CONFIG_CEPH_FSCACHE */
>
> +static inline bool ceph_folio_is_cacheable(const struct folio *folio, bool caching)
> +{
> + /* Dropbehind writeback should not populate the local fscache. */
> + return caching && !folio_test_dropbehind(folio);
> +}
> +
> +static inline void ceph_flush_fscache_write(struct inode *inode, u64 off, u64 *len)
> +{
> + if (!*len)
> + return;
> +
> + ceph_fscache_write_to_cache(inode, off, *len, true);
Are you sure that caching should be always true? All other calls checks that
ceph_is_cache_enabled():
bool caching = ceph_is_cache_enabled(inode);
> + *len = 0;
> +}
The ceph_folio_is_cacheable() and ceph_flush_fscache_write() are out of
CONFIG_CEPH_FSCACHE. It doesn't look right.
> +
> struct ceph_writeback_ctl
> {
> loff_t i_size;
> @@ -730,7 +745,7 @@ static int write_folio_nounlock(struct folio *folio,
> struct ceph_writeback_ctl ceph_wbc;
> struct ceph_osd_client *osdc = &fsc->client->osdc;
> struct ceph_osd_request *req;
> - bool caching = ceph_is_cache_enabled(inode);
> + bool caching = ceph_folio_is_cacheable(folio, ceph_is_cache_enabled(inode));
> struct page *bounce_page = NULL;
>
> doutc(cl, "%llx.%llx folio %p idx %lu\n", ceph_vinop(inode), folio,
> @@ -1412,11 +1427,14 @@ int ceph_submit_write(struct address_space *mapping,
> bool caching = ceph_is_cache_enabled(inode);
> u64 offset;
> u64 len;
> + u64 cache_offset, cache_len;
Why do you need to introduce the cache_offset and cache_len? We already have
offset and len.
> unsigned i;
>
> new_request:
> offset = ceph_fscrypt_page_offset(ceph_wbc->pages[0]);
> len = ceph_wbc->wsize;
> + cache_offset = 0;
Is it correct initialization? Frankly speaking, I don't quite follow why we need
such initialization.
Thanks,
Slava.
> + cache_len = 0;
>
> req = ceph_osdc_new_request(&fsc->client->osdc,
> &ci->i_layout, vino,
> @@ -1477,9 +1495,11 @@ int ceph_submit_write(struct address_space *mapping,
> ceph_wbc->op_idx = 0;
> for (i = 0; i < ceph_wbc->locked_pages; i++) {
> u64 cur_offset;
> + bool cache_page;
>
> page = ceph_fscrypt_pagecache_page(ceph_wbc->pages[i]);
> cur_offset = page_offset(page);
> + cache_page = ceph_folio_is_cacheable(page_folio(page), caching);
>
> /*
> * Discontinuity in page range? Ceph can handle that by just passing
> @@ -1491,7 +1511,7 @@ int ceph_submit_write(struct address_space *mapping,
> break;
>
> /* Kick off an fscache write with what we have so far. */
> - ceph_fscache_write_to_cache(inode, offset, len, caching);
> + ceph_flush_fscache_write(inode, cache_offset, &cache_len);
>
> /* Start a new extent */
> osd_req_op_extent_dup_last(req, ceph_wbc->op_idx,
> @@ -1514,13 +1534,19 @@ int ceph_submit_write(struct address_space *mapping,
>
> set_page_writeback(page);
>
> - if (caching)
> + if (cache_page) {
> + if (!cache_len)
> + cache_offset = cur_offset;
> ceph_set_page_fscache(page);
> + cache_len += thp_size(page);
> + } else {
> + ceph_flush_fscache_write(inode, cache_offset, &cache_len);
> + }
>
> len += thp_size(page);
> }
>
> - ceph_fscache_write_to_cache(inode, offset, len, caching);
> + ceph_flush_fscache_write(inode, cache_offset, &cache_len);
>
> if (ceph_wbc->size_stable) {
> len = min(len, ceph_wbc->i_size - offset);
On Thu, Apr 2, 2026 at 9:44 PM Viacheslav Dubeyko <vdubeyko@redhat.com> wrote: > The Ceph write_begin path calls netfs_write_begin() but does not pass > IOCB_DONTCACHE through to trigger __folio_set_dropbehind. So, > folio_test_dropbehind() would never be true on the Ceph write path right now. > Does it make sense? Yes, see: > > --- > > Note: this is an additional feature on top of my Ceph-DONTCACHE patch, > > see https://lore.kernel.org/ceph-devel/20260401053109.1861724-1-max.kellermann@ionos.com/ The code in this patch is not reachable until my RWF_DONTCACHE patch is merged as well. > Are you sure that caching should be always true? All other calls checks that > ceph_is_cache_enabled(): > > bool caching = ceph_is_cache_enabled(inode); This function is only called if caching is enabled. > > > + *len = 0; > > +} > > > The ceph_folio_is_cacheable() and ceph_flush_fscache_write() are out of > CONFIG_CEPH_FSCACHE. It doesn't look right. All of the old code is out of CONFIG_CEPH_FSCACHE, too. Does the old code look right? > > @@ -1412,11 +1427,14 @@ int ceph_submit_write(struct address_space *mapping, > > bool caching = ceph_is_cache_enabled(inode); > > u64 offset; > > u64 len; > > + u64 cache_offset, cache_len; > > Why do you need to introduce the cache_offset and cache_len? We already have > offset and len. These keep track of the region that should be submitted to fscache. Folios without "dropbehind" need to be excluded from that. > > new_request: > > offset = ceph_fscrypt_page_offset(ceph_wbc->pages[0]); > > len = ceph_wbc->wsize; > > + cache_offset = 0; > > Is it correct initialization? Frankly speaking, I don't quite follow why we need > such initialization. Technically, cache_offset does not need to be initialized as long as cache_len is zero because then its value is never used. Would you feel more comfortable if I remove the unnecessary initializer? I wasn't sure which approach would raise fewer eyebrows. -- Max Kellermann Principal Architect Hosting Technology cm4all | Im Mediapark 6a | 50670 Köln | Germany General information about the company can be found here: https://www.cm4all.com/impressum A member of the IONOS Group
On Fri, 2026-04-03 at 08:52 +0200, Max Kellermann wrote: > On Thu, Apr 2, 2026 at 9:44 PM Viacheslav Dubeyko <vdubeyko@redhat.com> wrote: > > The Ceph write_begin path calls netfs_write_begin() but does not pass > > IOCB_DONTCACHE through to trigger __folio_set_dropbehind. So, > > folio_test_dropbehind() would never be true on the Ceph write path right now. > > Does it make sense? > > Yes, see: > > > > --- > > > Note: this is an additional feature on top of my Ceph-DONTCACHE patch, > > > see https://lore.kernel.org/ceph-devel/20260401053109.1861724-1-max.kellermann@ionos.com/ > > The code in this patch is not reachable until my RWF_DONTCACHE patch > is merged as well. > > > Are you sure that caching should be always true? All other calls checks that > > ceph_is_cache_enabled(): > > > > bool caching = ceph_is_cache_enabled(inode); > > This function is only called if caching is enabled. I think that such interface will be more clean and safe: static inline void ceph_flush_fscache_write(struct inode *inode, u64 off, u64 *len, bool caching) > > > > > > + *len = 0; > > > +} > > > > > > The ceph_folio_is_cacheable() and ceph_flush_fscache_write() are out of > > CONFIG_CEPH_FSCACHE. It doesn't look right. > > All of the old code is out of CONFIG_CEPH_FSCACHE, too. Does the old > code look right? As far as I can see, all fscache code is under CONFIG_CEPH_FSCACHE compilation option. If we have some issues with old code, then it makes sense to fix it. But this code is fscache related and it should be under CONFIG_CEPH_FSCACHE protection, from my point of view. Moreover, other fscache related code is under CONFIG_CEPH_FSCACHE protection pretty above the code of these functions. > > > > @@ -1412,11 +1427,14 @@ int ceph_submit_write(struct address_space *mapping, > > > bool caching = ceph_is_cache_enabled(inode); > > > u64 offset; > > > u64 len; > > > + u64 cache_offset, cache_len; > > > > Why do you need to introduce the cache_offset and cache_len? We already have > > offset and len. > > These keep track of the region that should be submitted to fscache. > Folios without "dropbehind" need to be excluded from that. > > > > new_request: > > > offset = ceph_fscrypt_page_offset(ceph_wbc->pages[0]); > > > len = ceph_wbc->wsize; > > > + cache_offset = 0; > > > > Is it correct initialization? Frankly speaking, I don't quite follow why we need > > such initialization. > > Technically, cache_offset does not need to be initialized as long as > cache_len is zero because then its value is never used. Would you feel > more comfortable if I remove the unnecessary initializer? I wasn't > sure which approach would raise fewer eyebrows. I am simply trying to follow why we need in cache_offset. We are using the offset currently: /* Kick off an fscache write with what we have so far. */ ceph_fscache_write_to_cache(inode, offset, len, caching); Why the offset is not good enough? Thanks, Slava.
On Fri, Apr 3, 2026 at 7:18 PM Viacheslav Dubeyko <vdubeyko@redhat.com> wrote: > Why the offset is not good enough? Because the "offset" variable tracks the whole write, including the folios that are supposed to be omitted from the fscache. -- Max Kellermann Principal Architect Hosting Technology cm4all | Im Mediapark 6a | 50670 Köln | Germany General information about the company can be found here: https://www.cm4all.com/impressum A member of the IONOS Group
© 2016 - 2026 Red Hat, Inc.