Here's the simplest fix for cifs_writepages_region() that gets it to work.
Fix the cifs_writepages_region() to just skip over members of the batch that
have been cleaned up rather than retrying them. I'm not entirely sure why it
fixes it, though. It's also not the most efficient as, in the common case,
this is going to happen a lot because cifs_extend_writeback() is going to
clean up the contiguous pages in the batch - and then this skip will occur for
those.
Fix: 3822a7c40997 ("Merge tag 'mm-stable-2023-02-20-13-37' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm")
Signed-off-by: David Howells <dhowells@redhat.com>
---
diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index 5365a3299088..ebfcaae8c437 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -2893,8 +2893,9 @@ static int cifs_writepages_region(struct address_space *mapping,
if (folio_mapping(folio) != mapping ||
!folio_test_dirty(folio)) {
+ start += folio_size(folio);
folio_unlock(folio);
- goto skip_write;
+ continue;
}
if (folio_test_writeback(folio) ||
On Fri, Feb 24, 2023 at 02:31:15PM +0000, David Howells wrote: > Here's the simplest fix for cifs_writepages_region() that gets it to work. > > Fix the cifs_writepages_region() to just skip over members of the batch that > have been cleaned up rather than retrying them. I'm not entirely sure why it > fixes it, though. It's also not the most efficient as, in the common case, > this is going to happen a lot because cifs_extend_writeback() is going to > clean up the contiguous pages in the batch - and then this skip will occur for > those. Why are you doing it this way? What's wrong with using write_cache_pages() to push all the contiguous dirty folios into a single I/O object, submitting it when the folios turn out not to be contiguous, or when we run out of a batch? You've written an awful lot of code here and it's a different model from every other filesystem. Why is it better?
Matthew Wilcox <willy@infradead.org> wrote: > Why are you doing it this way? What's wrong with using > write_cache_pages() to push all the contiguous dirty folios into a single > I/O object, submitting it when the folios turn out not to be contiguous, > or when we run out of a batch? > > You've written an awful lot of code here and it's a different model from > every other filesystem. Why is it better? Because write_cache_pages(): (1) Takes no account of fscache. I can't just build knowledge of PG_fscache into it because PG_private_2 may be used differently by other filesystems (btrfs, for example). (I'm also trying to phase out the use of PG_private_2 and instead uses PG_writeback to cover both and the difference will be recorded elsewhere - but that's not there yet). (2) Calls ->writepage() individually for each folio - which is excessive. In AFS's implementation, we locate the first folio, then race through the following folios without ever waiting until we hit something that's locked or a gap and then stop and submit. write_cache_pages(), otoh, calls us with the next folio already undirtied and set for writeback when we find out that we don't want it yet. (3) Skips over holes, but at some point in the future we're going to need to schedule adjacent clean pages (before and after) for writeback too to handle transport compression and fscache updates if the granule size for either is larger than the folio size. It might be better to take what's in cifs, generalise it and replace write_cache_pages() with it, then have a "->submit_write()" aop that takes an ITER_XARRAY iterator to write from. David
On Fri, Feb 24, 2023 at 05:15:41PM +0000, David Howells wrote: > Matthew Wilcox <willy@infradead.org> wrote: > > > Why are you doing it this way? What's wrong with using > > write_cache_pages() to push all the contiguous dirty folios into a single > > I/O object, submitting it when the folios turn out not to be contiguous, > > or when we run out of a batch? > > > > You've written an awful lot of code here and it's a different model from > > every other filesystem. Why is it better? > > Because write_cache_pages(): > > (1) Takes no account of fscache. I can't just build knowledge of PG_fscache > into it because PG_private_2 may be used differently by other filesystems > (btrfs, for example). (I'm also trying to phase out the use of > PG_private_2 and instead uses PG_writeback to cover both and the > difference will be recorded elsewhere - but that's not there yet). I don't understand why waiting for fscache here is even the right thing to do. The folio is dirty and needs to be written back to the server. Why should beginning that write wait for the current write to the fscache to complete? > (2) Calls ->writepage() individually for each folio - which is excessive. In > AFS's implementation, we locate the first folio, then race through the > following folios without ever waiting until we hit something that's > locked or a gap and then stop and submit. > > write_cache_pages(), otoh, calls us with the next folio already undirtied > and set for writeback when we find out that we don't want it yet. I think you're so convinced that your way is better that you don't see what write_cache_pages() is actually doing. Yes, the writepage callback is called once for each folio, but that doesn't actually submit a write. Instead, they're accumulated into the equivalent of a wdata and the wdata is submitted when there's a discontiguity or we've accumulated enough to satisfy the constraints of the wbc. > (3) Skips over holes, but at some point in the future we're going to need to > schedule adjacent clean pages (before and after) for writeback too to > handle transport compression and fscache updates if the granule size for > either is larger than the folio size. Then we'll need it for other filesystems too, so should enhance write_cache_pages(). > It might be better to take what's in cifs, generalise it and replace > write_cache_pages() with it, then have a "->submit_write()" aop that takes an > ITER_XARRAY iterator to write from. ->writepages _is_ ->submit_write. Should write_cache_pages() be better? Maybe! But it works a whole lot better than what AFS was doing and you've now ladelled into cifs.
Matthew Wilcox <willy@infradead.org> wrote: > I don't understand why waiting for fscache here is even the right thing > to do. Then why do we have to wait for PG_writeback to complete? David
On Fri, Feb 24, 2023 at 12:14 PM David Howells <dhowells@redhat.com> wrote: > > Then why do we have to wait for PG_writeback to complete? At least for PG_writeback, it's about "the _previous_ dirty write is still under way, but - since PG_dirty is set again - the page has been dirtied since". So we have to start _another_ writeback, because while the current writeback *might* have written the updated data, that is not at all certain or clear. I'm not sure what the fscache rules are. Linus
Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Then why do we have to wait for PG_writeback to complete? > > At least for PG_writeback, it's about "the _previous_ dirty write is > still under way, but - since PG_dirty is set again - the page has been > dirtied since". > > So we have to start _another_ writeback, because while the current > writeback *might* have written the updated data, that is not at all > certain or clear. As I understand it, it's also about serialising writes from the same page to the same backing store. We don't want them to end up out-of-order. I'm not sure what guarantees, for instance, the block layer gives if two I/O requests go to the same place. > I'm not sure what the fscache rules are. I'm now using PG_fscache in exactly the same way: the previous write to the cache is still under way. I don't want to start another DIO write to the cache for the same pages. Hence the waits/checks on PG_fscache I've added anywhere we need to wait/check on PG_writeback. As I mentioned I'm looking at the possibility of making PG_dirty and PG_writeback cover *both* cases and recording the difference elsewhere - thereby returning PG_private_2 to the VM folks who'd like their bit back. This means, for instance, when we read from the server and find we need to write it to the cache, we set a note in the aforementioned elsewhere, mark the page dirty and leave it to writepages() to effect the write to the cache. It could get tricky because we have two different places to write to, with very different characteristics (e.g. ~6000km away server vs local SSD) with their own queueing, scheduling, bandwidth, etc. - and the local disk might have to share with the system. David
On Fri, Feb 24, 2023 at 12:16:49PM -0800, Linus Torvalds wrote: > On Fri, Feb 24, 2023 at 12:14 PM David Howells <dhowells@redhat.com> wrote: > > > > Then why do we have to wait for PG_writeback to complete? > > At least for PG_writeback, it's about "the _previous_ dirty write is > still under way, but - since PG_dirty is set again - the page has been > dirtied since". > > So we have to start _another_ writeback, because while the current > writeback *might* have written the updated data, that is not at all > certain or clear. also, we only have a writeback bit, not a writeback count. And when the current writeback completes, it'll clear that bit. We're also being kind to our backing store and not writing to the same block twice at the same time. > I'm not sure what the fscache rules are. My understanding is that the fscache bit is set under several circumstances, but if the folio is dirty _and_ the fscache bit is set, it means the folio is currently being written to the cache device. I don't see a conflict there; we can write to the backing store and the cache device at the same time.
Matthew Wilcox <willy@infradead.org> wrote: > On Fri, Feb 24, 2023 at 12:16:49PM -0800, Linus Torvalds wrote: > > On Fri, Feb 24, 2023 at 12:14 PM David Howells <dhowells@redhat.com> wrote: > > > > > > Then why do we have to wait for PG_writeback to complete? > > > > At least for PG_writeback, it's about "the _previous_ dirty write is > > still under way, but - since PG_dirty is set again - the page has been > > dirtied since". > > > > So we have to start _another_ writeback, because while the current > > writeback *might* have written the updated data, that is not at all > > certain or clear. > > also, we only have a writeback bit, not a writeback count. And when > the current writeback completes, it'll clear that bit. We're also > being kind to our backing store and not writing to the same block twice > at the same time. It's not so much being kind to the backing store, I think, as avoiding the possibility that the writes happen out of order. > > I'm not sure what the fscache rules are. > > My understanding is that the fscache bit is set under several > circumstances, but if the folio is dirty _and_ the fscache bit > is set, it means the folio is currently being written to the cache > device. I don't see a conflict there; we can write to the backing > store and the cache device at the same time. The dirty bit is nothing to do with it. If the fscache bit is set, then the page is currently being written to the cache - and we need to wait before starting another write. Sometimes we start a write to the cache from a clean page (e.g. we just read it from the server) and sometimes we start a write to the cache from writepages (e.g. the data is dirty and we're writing it to the server as well). Things will become more 'interesting' should we ever get around to implementing disconnected operation. Then we might end up staging dirty data through the cache. David
On Fri, Feb 24, 2023 at 6:31 AM David Howells <dhowells@redhat.com> wrote: > > Here's the simplest fix for cifs_writepages_region() that gets it to work. Hmm. The commit message for this is wrong. > Fix the cifs_writepages_region() to just skip over members of the batch that > have been cleaned up rather than retrying them. It never retried them. The "skip_write" code did that same start += folio_size(folio); continue; that your patch does, but it *also* had that if (skips >= 5 || need_resched()) { thing to just stop writing entirely. > I'm not entirely sure why it fixes it, though. Yes. Strange. Because it does the exact same thing as the "Oh, the trylock worked, but it was still under writeback or fscache" did. I just merged all the "skip write" cases. But the code is clearly (a) not working and (b) the whole skip count and need_resched() logic is a bit strange to begin with. Can you humor me, and try if just removing that skip count thing instead? IOW, this attached patch? Because that whole "let's stop writing if we need to reschedule" sounds truly odd (we have a cond_resched(), although it's per folio batch, not per-folio), and the skip count logic doesn't make much sense to me either. SteveF? Linus
Linus Torvalds <torvalds@linux-foundation.org> wrote: > Can you humor me, and try if just removing that skip count thing > instead? IOW, this attached patch? That works too. > Because that whole "let's stop writing if we need to reschedule" sounds > truly odd (we have a cond_resched(), although it's per folio batch, not > per-folio), and the skip count logic doesn't make much sense to me either. The skip thing, in my code, is only used in WB_SYNC_NONE mode. If we hit 5 things in progress or rescheduling is required, we return to the caller on the basis that conflicting flushes appear to be happening in other threads. David
On Fri, Feb 24, 2023 at 9:19 AM David Howells <dhowells@redhat.com> wrote: > > The skip thing, in my code, is only used in WB_SYNC_NONE mode. If we hit 5 > things in progress or rescheduling is required, we return to the caller on the > basis that conflicting flushes appear to be happening in other threads. Ahh. *That* is the difference, and I didn't realize. I made all the skip-write cases the same, and I really meant for that case to only trigger for WB_SYNC_NONE, but I stupidly didn't notice that the whole folio_test_dirty() re-test case was done without that WB_SYNC_NONE case that all the other cases had. Mea culpa, mea maxima culpa. That was just me being stupid. So that case isn't actually a "skip write" case at all, it's actually a "no write needed at all" case. Your original patch is the right fix, and I was just being silly for having not realized. I'll apply that minimal fix for now - I think the right thing to do is your bigger patch, but that needs more thinking (or at least splitting up). Sorry for the confusion, and thanks for setting me straight, Linus
On Fri, Feb 24, 2023 at 10:58 AM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > I'll apply that minimal fix for now - I think the right thing to do is > your bigger patch, but that needs more thinking (or at least splitting > up). Minimal fix applied - that way I can drop this for now, and we can discuss the whole "maybe we can just use write_cache_pages()" instead. Because that _would_ be lovely, even if it's possible that the generic helper might need some extra love to work better for cifs/afs. Linus
Have been doing additional testing of these - and also verified that David's most recent patch, now in my for-next branch ("cifs: Fix memory leak in direct I/O") fixes the remaining problem (the issue with xfstest 208 that Murphy pointed out). Will send the three small cifs fixes from David later this week, along with some unrelated fixes from Paulo and Shyam. -----Original Message----- From: Linus Torvalds <torvalds@linux-foundation.org> Sent: Friday, February 24, 2023 1:06 PM To: David Howells <dhowells@redhat.com> Cc: Steven French <Steven.French@microsoft.com>; Vishal Moola <vishal.moola@gmail.com>; Andrew Morton <akpm@linux-foundation.org>; Jan Kara <jack@suse.cz>; pc <pc@cjr.nz>; Matthew Wilcox <willy@infradead.org>; Huang Ying <ying.huang@intel.com>; Baolin Wang <baolin.wang@linux.alibaba.com>; Xin Hao <xhao@linux.alibaba.com>; linux-mm@kvack.org; mm-commits@vger.kernel.org; linux-kernel@vger.kernel.org Subject: [EXTERNAL] Re: [RFC][PATCH] cifs: Fix cifs_writepages_region() [You don't often get email from torvalds@linux-foundation.org. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ] On Fri, Feb 24, 2023 at 10:58 AM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > I'll apply that minimal fix for now - I think the right thing to do is > your bigger patch, but that needs more thinking (or at least splitting > up). Minimal fix applied - that way I can drop this for now, and we can discuss the whole "maybe we can just use write_cache_pages()" instead. Because that _would_ be lovely, even if it's possible that the generic helper might need some extra love to work better for cifs/afs. Linus
[This additional to the "cifs: Fix cifs_writepages_region()" patch that I
posted]
The inefficiency derived from filemap_get_folios_tag() get a batch of
contiguous folios in Vishal's change to afs that got copied into cifs can
be reduced by skipping over those folios that have been passed by the start
position rather than going through the process of locking, checking and
trying to write them.
A similar change would need to be made in afs, in addition to fixing the bugs
there.
There's also a fix in cifs_write_back_from_locked_folio() where it doesn't
return the amount of data dispatched to the server as ->async_writev() just
returns 0 on success.
Signed-off-by: David Howells <dhowells@redhat.com>
---
diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index ebfcaae8c437..bae1a9709e32 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -2839,6 +2839,7 @@ static ssize_t cifs_write_back_from_locked_folio(struct address_space *mapping,
free_xid(xid);
if (rc == 0) {
wbc->nr_to_write = count;
+ rc = len;
} else if (is_retryable_error(rc)) {
cifs_pages_write_redirty(inode, start, len);
} else {
@@ -2873,6 +2874,13 @@ static int cifs_writepages_region(struct address_space *mapping,
for (int i = 0; i < nr; i++) {
ssize_t ret;
struct folio *folio = fbatch.folios[i];
+ unsigned long long fstart;
+
+ fstart = folio_pos(folio); /* May go backwards with THPs */
+ if (fstart < start &&
+ folio_size(folio) <= start - fstart)
+ continue;
+ start = fstart;
redo_folio:
start = folio_pos(folio); /* May regress with THPs */
On Fri, Feb 24, 2023 at 7:13 AM David Howells <dhowells@redhat.com> wrote: > > The inefficiency derived from filemap_get_folios_tag() get a batch of > contiguous folios in Vishal's change to afs that got copied into cifs can > be reduced by skipping over those folios that have been passed by the start > position rather than going through the process of locking, checking and > trying to write them. This patch just makes me go "Ugh". There's something wrong with this code for it to need these games. That just makes me convinced that your other patch that just gets rid of the batching entirely is the right one. Of course, I'd be even happier if Willy is right and the code could use the generic write_cache_pages() and avoid all of these things entirely. I'm not clear on why cifs and afs are being so different in the first place, and some of the differences are just odd (like that skip count). Linus
Linus Torvalds <torvalds@linux-foundation.org> wrote: > Of course, I'd be even happier if Willy is right and the code could > use the generic write_cache_pages() and avoid all of these things > entirely. I'm not clear on why cifs and afs are being so different in > the first place, and some of the differences are just odd (like that > skip count). The main reason is that write_cache_pages() doesn't (and can't) check PG_fscache (btrfs uses PG_private_2 for other purposes). NFS, 9p and ceph, for the moment, don't cache files that are open for writing, but I'm intending to change that at some point. The intention is to unify the writepages code for at least 9p, afs, ceph and cifs in netfslib in the future. David
© 2016 - 2025 Red Hat, Inc.