Re: kernel BUG when mounting large block xfs backed by 9p (folio ref count bug)

Posted by Christian Schoenebeck 2 months ago

On Friday, 5 December 2025 14:03:12 CET Dominique Martinet wrote:
> Christian Schoenebeck wrote on Fri, Dec 05, 2025 at 11:47:10AM +0100:
> > > Oh, hang on.  You're passing a kmalloc'ed page to
> > > iov_iter_get_pages_alloc().  That's not allowed ...
> > > 
> > > see
> > > https://lore.kernel.org/all/20250310142750.1209192-1-willy@infradead.org
> > > /
> > 
> > I'm confused. Looking at p9_get_mapped_pages(),
> > iov_iter_get_pages_alloc2() is only called for user space iovec data,
> > isn't it?
> 
> I think that doesn't hold true when mounting a filesystem with -o
> loop -- afaiu the kernel issues IOs to 9p from the XFS subsystem, so
> these can come from kmalloc or whatever it is they get memory with.
> 
> As to what to do with this, given Willy's last reply (thanks!), I'm
> honestly still not sure... If we can detect the pages are coming from
> somewhere we don't like I guess we can return EIO instead as a stop-gap
> measure (better than a crash)?
> If we check early enough (in client.c generic code) perhaps we could
> route these to the non-zc path that doesn't require this, so let's start
> by trying to figure out how to check what kind of page we got...?

Haven't checked yet, but shouldn't it be like this?

diff --git a/net/9p/trans_virtio.c b/net/9p/trans_virtio.c
index 0b8086f58ad5..5ff4bfe25a3e 100644
--- a/net/9p/trans_virtio.c
+++ b/net/9p/trans_virtio.c
@@ -318,7 +318,7 @@ static int p9_get_mapped_pages(struct virtio_chan *chan,
        if (!iov_iter_count(data))
                return 0;
 
-       if (!iov_iter_is_kvec(data)) {
+       if (iov_iter_is_iovec(data)) {
                int n;
                /*
                 * We allow only p9_max_pages pinned. We wait for the

Posted by Dominique Martinet 2 months ago

Christian Schoenebeck wrote on Fri, Dec 05, 2025 at 02:36:41PM +0100:
> Haven't checked yet, but shouldn't it be like this?
> 
> diff --git a/net/9p/trans_virtio.c b/net/9p/trans_virtio.c
> index 0b8086f58ad5..5ff4bfe25a3e 100644
> --- a/net/9p/trans_virtio.c
> +++ b/net/9p/trans_virtio.c
> @@ -318,7 +318,7 @@ static int p9_get_mapped_pages(struct virtio_chan *chan,
>         if (!iov_iter_count(data))
>                 return 0;
>  
> -       if (!iov_iter_is_kvec(data)) {
> +       if (iov_iter_is_iovec(data)) {

Hmm, __iov_iter_get_pages_alloc() is supposed to handle any of these:
user_backed_iter() (iter_is_ubuf() || iter_is_iovec()),
iov_iter_is_bvec(),
iov_iter_is_folioq(),
iov_iter_is_xarray()

so it would EFAULT on stuff like is_discard() and is_kvec(), but in our
case (folio) it also depends on what backs the folio, and just the type
of the iov isn't enough to tell

Your patch will appear to work (folioq path won't go there so the page
won't be pinned), but I'm not sure just being a folio is enough of a
guarantee here? for example is a folio coming from page cache
(e.g. readahead) guaranteed to be stable while it is being read? Can
something (try to) kill that thread while the IO is in progress and
reclaim the memory?

-- 
Dominique Martinet | Asmadeus

Re: kernel BUG when mounting large block xfs backed by 9p (folio ref count bug)

Posted by Matthew Wilcox 2 months ago

On Fri, Dec 05, 2025 at 10:48:55PM +0900, Dominique Martinet wrote:
> Your patch will appear to work (folioq path won't go there so the page
> won't be pinned), but I'm not sure just being a folio is enough of a
> guarantee here? for example is a folio coming from page cache
> (e.g. readahead) guaranteed to be stable while it is being read? Can
> something (try to) kill that thread while the IO is in progress and
> reclaim the memory?

In readahead, we allocate a folio, lock it and add it to the page cache.
We then submit it to the filesystem for read.  It cannot be truncated
from the page cache until the filesystem unlocks it (generally by calling
folio_end_read() but some filesystems explicitly call folio_unlock()
instead).  So you don't need to take an extra reference to it.

Re: kernel BUG when mounting large block xfs backed by 9p (folio ref count bug)

Posted by Dominique Martinet 2 months ago

Chris,

I'm not sure why but I can't reproduce with your .config either :/
If you can still reproduce this reliably, could you try with the
following diff applied (which is basically the same as what Christian
suggested a couple of days ago with also ubuf, whatever that is)
------------
diff --git a/net/9p/trans_virtio.c b/net/9p/trans_virtio.c
index 10c2dd486438..f7ee1f864b03 100644
--- a/net/9p/trans_virtio.c
+++ b/net/9p/trans_virtio.c
@@ -318,7 +318,7 @@ static int p9_get_mapped_pages(struct virtio_chan *chan,
 	if (!iov_iter_count(data))
 		return 0;
 
-	if (!iov_iter_is_kvec(data)) {
+	if (user_backed_iter(data)) {
 		int n;
 		/*
 		* We allow only p9_max_pages pinned. We wait for the
-----------



Willy,

Matthew Wilcox wrote on Sun, Dec 07, 2025 at 07:18:02AM +0000:
> In readahead, we allocate a folio, lock it and add it to the page cache.
> We then submit it to the filesystem for read.  It cannot be truncated
> from the page cache until the filesystem unlocks it (generally by calling
> folio_end_read() but some filesystems explicitly call folio_unlock()
> instead).  So you don't need to take an extra reference to it.

Thanks.

My main problem with this all is that trans_virtio adds the buffers to
the virtio virtqueue but does nothing to take it off if
wait_event_killable() in virtio_request() gets killed, but looking at it
even in the code path that gets a ref the code will happily drop the ref
even before the flush is over so I guess there's no reason to actively
try to pin kernel pages...

I'd sleep better if there was a way to remove (detach?) the buffer from
the virtqueue but I can't see how to do that without breaking something
else, so I guess we'll have to live with that behavior unless someone
knows better.

-- 
Dominique Martinet | Asmadeus

Re: kernel BUG when mounting large block xfs backed by 9p (folio ref count bug)

Posted by Christian Schoenebeck 2 months ago

On Sunday, 7 December 2025 14:49:31 CET Dominique Martinet wrote:
> Chris,
> 
> I'm not sure why but I can't reproduce with your .config either :/
> If you can still reproduce this reliably, could you try with the
> following diff applied (which is basically the same as what Christian
> suggested a couple of days ago with also ubuf, whatever that is)
> ------------
> diff --git a/net/9p/trans_virtio.c b/net/9p/trans_virtio.c
> index 10c2dd486438..f7ee1f864b03 100644
> --- a/net/9p/trans_virtio.c
> +++ b/net/9p/trans_virtio.c
> @@ -318,7 +318,7 @@ static int p9_get_mapped_pages(struct virtio_chan *chan,
> if (!iov_iter_count(data))
>  		return 0;
> 
> -	if (!iov_iter_is_kvec(data)) {
> +	if (user_backed_iter(data)) {
>  		int n;
>  		/*
>  		* We allow only p9_max_pages pinned. We wait for the
> -----------

Right, that's clearly better than what I suggested. Feel free to add:

Reviewed-by: Christian Schoenebeck <linux_oss@crudebyte.com>

Honestly I think those iov_iter functions and/or iter_type enums deserve some 
API doc comments. At least I don't find it self explanatory what the 
differences between the individual iterator types are.

/Christian

> Willy,
> 
> Matthew Wilcox wrote on Sun, Dec 07, 2025 at 07:18:02AM +0000:
> > In readahead, we allocate a folio, lock it and add it to the page cache.
> > We then submit it to the filesystem for read.  It cannot be truncated
> > from the page cache until the filesystem unlocks it (generally by calling
> > folio_end_read() but some filesystems explicitly call folio_unlock()
> > instead).  So you don't need to take an extra reference to it.
> 
> Thanks.
> 
> My main problem with this all is that trans_virtio adds the buffers to
> the virtio virtqueue but does nothing to take it off if
> wait_event_killable() in virtio_request() gets killed, but looking at it
> even in the code path that gets a ref the code will happily drop the ref
> even before the flush is over so I guess there's no reason to actively
> try to pin kernel pages...
> 
> I'd sleep better if there was a way to remove (detach?) the buffer from
> the virtqueue but I can't see how to do that without breaking something
> else, so I guess we'll have to live with that behavior unless someone
> knows better.

Re: kernel BUG when mounting large block xfs backed by 9p (folio ref count bug)

Posted by Chris Arges 2 months ago

On 2025-12-07 22:49:31, Dominique Martinet wrote:
> Chris,
> 
> I'm not sure why but I can't reproduce with your .config either :/
> If you can still reproduce this reliably, could you try with the
> following diff applied (which is basically the same as what Christian
> suggested a couple of days ago with also ubuf, whatever that is)
> ------------
> diff --git a/net/9p/trans_virtio.c b/net/9p/trans_virtio.c
> index 10c2dd486438..f7ee1f864b03 100644
> --- a/net/9p/trans_virtio.c
> +++ b/net/9p/trans_virtio.c
> @@ -318,7 +318,7 @@ static int p9_get_mapped_pages(struct virtio_chan *chan,
>  	if (!iov_iter_count(data))
>  		return 0;
>  
> -	if (!iov_iter_is_kvec(data)) {
> +	if (user_backed_iter(data)) {
>  		int n;
>  		/*
>  		* We allow only p9_max_pages pinned. We wait for the
> -----------
> 

Dominique,
I can still reproduce this easily. If I apply this patch, I no longer get
the crash. If this patch ends up being applied feel free to add:
Tested-By: Chris Arges <carges@cloudflare.com>

Happy to test other iterations as well.
--chris

> 
> 
> Willy,
> 
> Matthew Wilcox wrote on Sun, Dec 07, 2025 at 07:18:02AM +0000:
> > In readahead, we allocate a folio, lock it and add it to the page cache.
> > We then submit it to the filesystem for read.  It cannot be truncated
> > from the page cache until the filesystem unlocks it (generally by calling
> > folio_end_read() but some filesystems explicitly call folio_unlock()
> > instead).  So you don't need to take an extra reference to it.
> 
> Thanks.
> 
> My main problem with this all is that trans_virtio adds the buffers to
> the virtio virtqueue but does nothing to take it off if
> wait_event_killable() in virtio_request() gets killed, but looking at it
> even in the code path that gets a ref the code will happily drop the ref
> even before the flush is over so I guess there's no reason to actively
> try to pin kernel pages...
> 
> I'd sleep better if there was a way to remove (detach?) the buffer from
> the virtqueue but I can't see how to do that without breaking something
> else, so I guess we'll have to live with that behavior unless someone
> knows better.
> 
> -- 
> Dominique Martinet | Asmadeus