[v1] vduse: avoid using __GFP_NOFAIL

[PATCH] vduse: avoid using __GFP_NOFAIL

Posted by Jason Wang 1 year, 6 months ago

Barry said [1]:

"""
mm doesn't support non-blockable __GFP_NOFAIL allocation. Because
__GFP_NOFAIL without direct reclamation may just result in a busy
loop within non-sleepable contexts.
""“

Unfortuantely, we do that under read lock. A possible way to fix that
is to move the pages allocation out of the lock into the caller, but
having to allocate a huge number of pages and auxiliary page array
seems to be problematic as well per Tetsuon [2]:

"""
You should implement proper error handling instead of using
__GFP_NOFAIL if count can become large.
"""

So I choose another way, which does not release kernel bounce pages
when user tries to register usersapce bounce pages. Then we don't need
to do allocation in the path which is not expected to be fail (e.g in
the release). We pay this for more memory usage but further
optimizations could be done on top.

[1] https://lore.kernel.org/all/CACGkMEtcOJAA96SF9B8m-nZ1X04-XZr+nq8ZQ2saLnUdfOGOLg@mail.gmail.com/T/#m3caef86a66ea6318ef94f9976ddb3a0ccfe6fcf8
[2] https://lore.kernel.org/all/CACGkMEtcOJAA96SF9B8m-nZ1X04-XZr+nq8ZQ2saLnUdfOGOLg@mail.gmail.com/T/#m7ad10eaba48ade5abf2d572f24e185d9fb146480

Fixes: 6c77ed22880d ("vduse: Support using userspace pages as bounce buffer")
Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/vdpa/vdpa_user/iova_domain.c | 18 ++++++++++--------
 drivers/vdpa/vdpa_user/iova_domain.h |  1 +
 2 files changed, 11 insertions(+), 8 deletions(-)

diff --git a/drivers/vdpa/vdpa_user/iova_domain.c b/drivers/vdpa/vdpa_user/iova_domain.c
index 791d38d6284c..933d2f7cd49a 100644
--- a/drivers/vdpa/vdpa_user/iova_domain.c
+++ b/drivers/vdpa/vdpa_user/iova_domain.c
@@ -162,6 +162,7 @@ static void vduse_domain_bounce(struct vduse_iova_domain *domain,
 				enum dma_data_direction dir)
 {
 	struct vduse_bounce_map *map;
+	struct page *page;
 	unsigned int offset;
 	void *addr;
 	size_t sz;
@@ -178,7 +179,10 @@ static void vduse_domain_bounce(struct vduse_iova_domain *domain,
 			    map->orig_phys == INVALID_PHYS_ADDR))
 			return;
 
-		addr = kmap_local_page(map->bounce_page);
+		page = domain->user_bounce_pages ?
+		       map->user_bounce_page : map->bounce_page;
+
+		addr = kmap_local_page(page);
 		do_bounce(map->orig_phys + offset, addr + offset, sz, dir);
 		kunmap_local(addr);
 		size -= sz;
@@ -270,9 +274,8 @@ int vduse_domain_add_user_bounce_pages(struct vduse_iova_domain *domain,
 				memcpy_to_page(pages[i], 0,
 					       page_address(map->bounce_page),
 					       PAGE_SIZE);
-			__free_page(map->bounce_page);
 		}
-		map->bounce_page = pages[i];
+		map->user_bounce_page = pages[i];
 		get_page(pages[i]);
 	}
 	domain->user_bounce_pages = true;
@@ -297,17 +300,16 @@ void vduse_domain_remove_user_bounce_pages(struct vduse_iova_domain *domain)
 		struct page *page = NULL;
 
 		map = &domain->bounce_maps[i];
-		if (WARN_ON(!map->bounce_page))
+		if (WARN_ON(!map->user_bounce_page))
 			continue;
 
 		/* Copy user page to kernel page if it's in use */
 		if (map->orig_phys != INVALID_PHYS_ADDR) {
-			page = alloc_page(GFP_ATOMIC | __GFP_NOFAIL);
+			page = map->bounce_page;
 			memcpy_from_page(page_address(page),
-					 map->bounce_page, 0, PAGE_SIZE);
+					 map->user_bounce_page, 0, PAGE_SIZE);
 		}
-		put_page(map->bounce_page);
-		map->bounce_page = page;
+		put_page(map->user_bounce_page);
 	}
 	domain->user_bounce_pages = false;
 out:
diff --git a/drivers/vdpa/vdpa_user/iova_domain.h b/drivers/vdpa/vdpa_user/iova_domain.h
index f92f22a7267d..7f3f0928ec78 100644
--- a/drivers/vdpa/vdpa_user/iova_domain.h
+++ b/drivers/vdpa/vdpa_user/iova_domain.h
@@ -21,6 +21,7 @@
 
 struct vduse_bounce_map {
 	struct page *bounce_page;
+	struct page *user_bounce_page;
 	u64 orig_phys;
 };
 
-- 
2.31.1

Re: [PATCH] vduse: avoid using __GFP_NOFAIL

Posted by Yongji Xie 1 year, 6 months ago

On Mon, Aug 5, 2024 at 4:21 PM Jason Wang <jasowang@redhat.com> wrote:
>
> Barry said [1]:
>
> """
> mm doesn't support non-blockable __GFP_NOFAIL allocation. Because
> __GFP_NOFAIL without direct reclamation may just result in a busy
> loop within non-sleepable contexts.
> ""“
>
> Unfortuantely, we do that under read lock. A possible way to fix that
> is to move the pages allocation out of the lock into the caller, but
> having to allocate a huge number of pages and auxiliary page array
> seems to be problematic as well per Tetsuon [2]:
>
> """
> You should implement proper error handling instead of using
> __GFP_NOFAIL if count can become large.
> """
>
> So I choose another way, which does not release kernel bounce pages
> when user tries to register usersapce bounce pages. Then we don't need
> to do allocation in the path which is not expected to be fail (e.g in
> the release). We pay this for more memory usage but further
> optimizations could be done on top.
>
> [1] https://lore.kernel.org/all/CACGkMEtcOJAA96SF9B8m-nZ1X04-XZr+nq8ZQ2saLnUdfOGOLg@mail.gmail.com/T/#m3caef86a66ea6318ef94f9976ddb3a0ccfe6fcf8
> [2] https://lore.kernel.org/all/CACGkMEtcOJAA96SF9B8m-nZ1X04-XZr+nq8ZQ2saLnUdfOGOLg@mail.gmail.com/T/#m7ad10eaba48ade5abf2d572f24e185d9fb146480
>
> Fixes: 6c77ed22880d ("vduse: Support using userspace pages as bounce buffer")
> Signed-off-by: Jason Wang <jasowang@redhat.com>
> ---

Reviewed-by: Xie Yongji <xieyongji@bytedance.com>
Tested-by: Xie Yongji <xieyongji@bytedance.com>

Have tested it with qemu-storage-daemon [1]:

$ qemu-storage-daemon \
    --chardev socket,id=charmonitor,path=/tmp/qmp.sock,server=on,wait=off \
    --monitor chardev=charmonitor \
    --blockdev driver=host_device,cache.direct=on,aio=native,filename=/dev/nullb0,node-name=disk0
\
    --export type=vduse-blk,id=vduse-test,name=vduse-test,node-name=disk0,writable=on

[1] https://github.com/bytedance/qemu/tree/vduse-umem

>  drivers/vdpa/vdpa_user/iova_domain.c | 18 ++++++++++--------
>  drivers/vdpa/vdpa_user/iova_domain.h |  1 +
>  2 files changed, 11 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/vdpa/vdpa_user/iova_domain.c b/drivers/vdpa/vdpa_user/iova_domain.c
> index 791d38d6284c..933d2f7cd49a 100644
> --- a/drivers/vdpa/vdpa_user/iova_domain.c
> +++ b/drivers/vdpa/vdpa_user/iova_domain.c
> @@ -162,6 +162,7 @@ static void vduse_domain_bounce(struct vduse_iova_domain *domain,
>                                 enum dma_data_direction dir)
>  {
>         struct vduse_bounce_map *map;
> +       struct page *page;
>         unsigned int offset;
>         void *addr;
>         size_t sz;
> @@ -178,7 +179,10 @@ static void vduse_domain_bounce(struct vduse_iova_domain *domain,
>                             map->orig_phys == INVALID_PHYS_ADDR))
>                         return;
>
> -               addr = kmap_local_page(map->bounce_page);
> +               page = domain->user_bounce_pages ?
> +                      map->user_bounce_page : map->bounce_page;
> +
> +               addr = kmap_local_page(page);
>                 do_bounce(map->orig_phys + offset, addr + offset, sz, dir);
>                 kunmap_local(addr);
>                 size -= sz;
> @@ -270,9 +274,8 @@ int vduse_domain_add_user_bounce_pages(struct vduse_iova_domain *domain,
>                                 memcpy_to_page(pages[i], 0,
>                                                page_address(map->bounce_page),
>                                                PAGE_SIZE);
> -                       __free_page(map->bounce_page);
>                 }
> -               map->bounce_page = pages[i];
> +               map->user_bounce_page = pages[i];
>                 get_page(pages[i]);
>         }
>         domain->user_bounce_pages = true;
> @@ -297,17 +300,16 @@ void vduse_domain_remove_user_bounce_pages(struct vduse_iova_domain *domain)
>                 struct page *page = NULL;
>
>                 map = &domain->bounce_maps[i];
> -               if (WARN_ON(!map->bounce_page))
> +               if (WARN_ON(!map->user_bounce_page))
>                         continue;
>
>                 /* Copy user page to kernel page if it's in use */
>                 if (map->orig_phys != INVALID_PHYS_ADDR) {
> -                       page = alloc_page(GFP_ATOMIC | __GFP_NOFAIL);
> +                       page = map->bounce_page;
>                         memcpy_from_page(page_address(page),
> -                                        map->bounce_page, 0, PAGE_SIZE);
> +                                        map->user_bounce_page, 0, PAGE_SIZE);
>                 }
> -               put_page(map->bounce_page);
> -               map->bounce_page = page;
> +               put_page(map->user_bounce_page);

map->user_bounce_page = NULL?

Thanks,
Yongji

Re: [PATCH] vduse: avoid using __GFP_NOFAIL

Posted by Jason Wang 1 year, 6 months ago

On Wed, Aug 7, 2024 at 2:52 PM Yongji Xie <xieyongji@bytedance.com> wrote:
>
> On Mon, Aug 5, 2024 at 4:21 PM Jason Wang <jasowang@redhat.com> wrote:
> >
> > Barry said [1]:
> >
> > """
> > mm doesn't support non-blockable __GFP_NOFAIL allocation. Because
> > __GFP_NOFAIL without direct reclamation may just result in a busy
> > loop within non-sleepable contexts.
> > ""“
> >
> > Unfortuantely, we do that under read lock. A possible way to fix that
> > is to move the pages allocation out of the lock into the caller, but
> > having to allocate a huge number of pages and auxiliary page array
> > seems to be problematic as well per Tetsuon [2]:
> >
> > """
> > You should implement proper error handling instead of using
> > __GFP_NOFAIL if count can become large.
> > """
> >
> > So I choose another way, which does not release kernel bounce pages
> > when user tries to register usersapce bounce pages. Then we don't need
> > to do allocation in the path which is not expected to be fail (e.g in
> > the release). We pay this for more memory usage but further
> > optimizations could be done on top.
> >
> > [1] https://lore.kernel.org/all/CACGkMEtcOJAA96SF9B8m-nZ1X04-XZr+nq8ZQ2saLnUdfOGOLg@mail.gmail.com/T/#m3caef86a66ea6318ef94f9976ddb3a0ccfe6fcf8
> > [2] https://lore.kernel.org/all/CACGkMEtcOJAA96SF9B8m-nZ1X04-XZr+nq8ZQ2saLnUdfOGOLg@mail.gmail.com/T/#m7ad10eaba48ade5abf2d572f24e185d9fb146480
> >
> > Fixes: 6c77ed22880d ("vduse: Support using userspace pages as bounce buffer")
> > Signed-off-by: Jason Wang <jasowang@redhat.com>
> > ---
>
> Reviewed-by: Xie Yongji <xieyongji@bytedance.com>
> Tested-by: Xie Yongji <xieyongji@bytedance.com>

Thanks.

>
> Have tested it with qemu-storage-daemon [1]:
>
> $ qemu-storage-daemon \
>     --chardev socket,id=charmonitor,path=/tmp/qmp.sock,server=on,wait=off \
>     --monitor chardev=charmonitor \
>     --blockdev driver=host_device,cache.direct=on,aio=native,filename=/dev/nullb0,node-name=disk0
> \
>     --export type=vduse-blk,id=vduse-test,name=vduse-test,node-name=disk0,writable=on
>
> [1] https://github.com/bytedance/qemu/tree/vduse-umem

Great, would you want to post them to the Qemu?

>
> >  drivers/vdpa/vdpa_user/iova_domain.c | 18 ++++++++++--------
> >  drivers/vdpa/vdpa_user/iova_domain.h |  1 +
> >  2 files changed, 11 insertions(+), 8 deletions(-)
> >
> > diff --git a/drivers/vdpa/vdpa_user/iova_domain.c b/drivers/vdpa/vdpa_user/iova_domain.c
> > index 791d38d6284c..933d2f7cd49a 100644
> > --- a/drivers/vdpa/vdpa_user/iova_domain.c
> > +++ b/drivers/vdpa/vdpa_user/iova_domain.c
> > @@ -162,6 +162,7 @@ static void vduse_domain_bounce(struct vduse_iova_domain *domain,
> >                                 enum dma_data_direction dir)
> >  {
> >         struct vduse_bounce_map *map;
> > +       struct page *page;
> >         unsigned int offset;
> >         void *addr;
> >         size_t sz;
> > @@ -178,7 +179,10 @@ static void vduse_domain_bounce(struct vduse_iova_domain *domain,
> >                             map->orig_phys == INVALID_PHYS_ADDR))
> >                         return;
> >
> > -               addr = kmap_local_page(map->bounce_page);
> > +               page = domain->user_bounce_pages ?
> > +                      map->user_bounce_page : map->bounce_page;
> > +
> > +               addr = kmap_local_page(page);
> >                 do_bounce(map->orig_phys + offset, addr + offset, sz, dir);
> >                 kunmap_local(addr);
> >                 size -= sz;
> > @@ -270,9 +274,8 @@ int vduse_domain_add_user_bounce_pages(struct vduse_iova_domain *domain,
> >                                 memcpy_to_page(pages[i], 0,
> >                                                page_address(map->bounce_page),
> >                                                PAGE_SIZE);
> > -                       __free_page(map->bounce_page);
> >                 }
> > -               map->bounce_page = pages[i];
> > +               map->user_bounce_page = pages[i];
> >                 get_page(pages[i]);
> >         }
> >         domain->user_bounce_pages = true;
> > @@ -297,17 +300,16 @@ void vduse_domain_remove_user_bounce_pages(struct vduse_iova_domain *domain)
> >                 struct page *page = NULL;
> >
> >                 map = &domain->bounce_maps[i];
> > -               if (WARN_ON(!map->bounce_page))
> > +               if (WARN_ON(!map->user_bounce_page))
> >                         continue;
> >
> >                 /* Copy user page to kernel page if it's in use */
> >                 if (map->orig_phys != INVALID_PHYS_ADDR) {
> > -                       page = alloc_page(GFP_ATOMIC | __GFP_NOFAIL);
> > +                       page = map->bounce_page;
> >                         memcpy_from_page(page_address(page),
> > -                                        map->bounce_page, 0, PAGE_SIZE);
> > +                                        map->user_bounce_page, 0, PAGE_SIZE);
> >                 }
> > -               put_page(map->bounce_page);
> > -               map->bounce_page = page;
> > +               put_page(map->user_bounce_page);
>
> map->user_bounce_page = NULL?

For safety, it might be better. Let me add this and post a V2/

Thanks

>
> Thanks,
> Yongji
>

Re: Re: [PATCH] vduse: avoid using __GFP_NOFAIL

Posted by Yongji Xie 1 year, 6 months ago

On Thu, Aug 8, 2024 at 10:58 AM Jason Wang <jasowang@redhat.com> wrote:
>
> On Wed, Aug 7, 2024 at 2:52 PM Yongji Xie <xieyongji@bytedance.com> wrote:
> >
> > On Mon, Aug 5, 2024 at 4:21 PM Jason Wang <jasowang@redhat.com> wrote:
> > >
> > > Barry said [1]:
> > >
> > > """
> > > mm doesn't support non-blockable __GFP_NOFAIL allocation. Because
> > > __GFP_NOFAIL without direct reclamation may just result in a busy
> > > loop within non-sleepable contexts.
> > > ""“
> > >
> > > Unfortuantely, we do that under read lock. A possible way to fix that
> > > is to move the pages allocation out of the lock into the caller, but
> > > having to allocate a huge number of pages and auxiliary page array
> > > seems to be problematic as well per Tetsuon [2]:
> > >
> > > """
> > > You should implement proper error handling instead of using
> > > __GFP_NOFAIL if count can become large.
> > > """
> > >
> > > So I choose another way, which does not release kernel bounce pages
> > > when user tries to register usersapce bounce pages. Then we don't need
> > > to do allocation in the path which is not expected to be fail (e.g in
> > > the release). We pay this for more memory usage but further
> > > optimizations could be done on top.
> > >
> > > [1] https://lore.kernel.org/all/CACGkMEtcOJAA96SF9B8m-nZ1X04-XZr+nq8ZQ2saLnUdfOGOLg@mail.gmail.com/T/#m3caef86a66ea6318ef94f9976ddb3a0ccfe6fcf8
> > > [2] https://lore.kernel.org/all/CACGkMEtcOJAA96SF9B8m-nZ1X04-XZr+nq8ZQ2saLnUdfOGOLg@mail.gmail.com/T/#m7ad10eaba48ade5abf2d572f24e185d9fb146480
> > >
> > > Fixes: 6c77ed22880d ("vduse: Support using userspace pages as bounce buffer")
> > > Signed-off-by: Jason Wang <jasowang@redhat.com>
> > > ---
> >
> > Reviewed-by: Xie Yongji <xieyongji@bytedance.com>
> > Tested-by: Xie Yongji <xieyongji@bytedance.com>
>
> Thanks.
>
> >
> > Have tested it with qemu-storage-daemon [1]:
> >
> > $ qemu-storage-daemon \
> >     --chardev socket,id=charmonitor,path=/tmp/qmp.sock,server=on,wait=off \
> >     --monitor chardev=charmonitor \
> >     --blockdev driver=host_device,cache.direct=on,aio=native,filename=/dev/nullb0,node-name=disk0
> > \
> >     --export type=vduse-blk,id=vduse-test,name=vduse-test,node-name=disk0,writable=on
> >
> > [1] https://github.com/bytedance/qemu/tree/vduse-umem
>
> Great, would you want to post them to the Qemu?
>

Looks like qemu-storage-daemon would not benefit from this feature
which is designed for some hugepage users such as SPDK/DPDK.

Thanks,
Yongji

Re: Re: [PATCH] vduse: avoid using __GFP_NOFAIL

Posted by Jason Wang 1 year, 5 months ago

On Thu, Aug 8, 2024 at 6:52 PM Yongji Xie <xieyongji@bytedance.com> wrote:
>
> On Thu, Aug 8, 2024 at 10:58 AM Jason Wang <jasowang@redhat.com> wrote:
> >
> > On Wed, Aug 7, 2024 at 2:52 PM Yongji Xie <xieyongji@bytedance.com> wrote:
> > >
> > > On Mon, Aug 5, 2024 at 4:21 PM Jason Wang <jasowang@redhat.com> wrote:
> > > >
> > > > Barry said [1]:
> > > >
> > > > """
> > > > mm doesn't support non-blockable __GFP_NOFAIL allocation. Because
> > > > __GFP_NOFAIL without direct reclamation may just result in a busy
> > > > loop within non-sleepable contexts.
> > > > ""“
> > > >
> > > > Unfortuantely, we do that under read lock. A possible way to fix that
> > > > is to move the pages allocation out of the lock into the caller, but
> > > > having to allocate a huge number of pages and auxiliary page array
> > > > seems to be problematic as well per Tetsuon [2]:
> > > >
> > > > """
> > > > You should implement proper error handling instead of using
> > > > __GFP_NOFAIL if count can become large.
> > > > """
> > > >
> > > > So I choose another way, which does not release kernel bounce pages
> > > > when user tries to register usersapce bounce pages. Then we don't need
> > > > to do allocation in the path which is not expected to be fail (e.g in
> > > > the release). We pay this for more memory usage but further
> > > > optimizations could be done on top.
> > > >
> > > > [1] https://lore.kernel.org/all/CACGkMEtcOJAA96SF9B8m-nZ1X04-XZr+nq8ZQ2saLnUdfOGOLg@mail.gmail.com/T/#m3caef86a66ea6318ef94f9976ddb3a0ccfe6fcf8
> > > > [2] https://lore.kernel.org/all/CACGkMEtcOJAA96SF9B8m-nZ1X04-XZr+nq8ZQ2saLnUdfOGOLg@mail.gmail.com/T/#m7ad10eaba48ade5abf2d572f24e185d9fb146480
> > > >
> > > > Fixes: 6c77ed22880d ("vduse: Support using userspace pages as bounce buffer")
> > > > Signed-off-by: Jason Wang <jasowang@redhat.com>
> > > > ---
> > >
> > > Reviewed-by: Xie Yongji <xieyongji@bytedance.com>
> > > Tested-by: Xie Yongji <xieyongji@bytedance.com>
> >
> > Thanks.
> >
> > >
> > > Have tested it with qemu-storage-daemon [1]:
> > >
> > > $ qemu-storage-daemon \
> > >     --chardev socket,id=charmonitor,path=/tmp/qmp.sock,server=on,wait=off \
> > >     --monitor chardev=charmonitor \
> > >     --blockdev driver=host_device,cache.direct=on,aio=native,filename=/dev/nullb0,node-name=disk0
> > > \
> > >     --export type=vduse-blk,id=vduse-test,name=vduse-test,node-name=disk0,writable=on
> > >
> > > [1] https://github.com/bytedance/qemu/tree/vduse-umem
> >
> > Great, would you want to post them to the Qemu?
> >
>
> Looks like qemu-storage-daemon would not benefit from this feature
> which is designed for some hugepage users such as SPDK/DPDK.

Yes, but maybe for testing purposes like here?

Thanks

>
> Thanks,
> Yongji
>

Re: Re: Re: [PATCH] vduse: avoid using __GFP_NOFAIL

Posted by Yongji Xie 1 year, 5 months ago

On Mon, Aug 12, 2024 at 3:00 PM Jason Wang <jasowang@redhat.com> wrote:
>
> On Thu, Aug 8, 2024 at 6:52 PM Yongji Xie <xieyongji@bytedance.com> wrote:
> >
> > On Thu, Aug 8, 2024 at 10:58 AM Jason Wang <jasowang@redhat.com> wrote:
> > >
> > > On Wed, Aug 7, 2024 at 2:52 PM Yongji Xie <xieyongji@bytedance.com> wrote:
> > > >
> > > > On Mon, Aug 5, 2024 at 4:21 PM Jason Wang <jasowang@redhat.com> wrote:
> > > > >
> > > > > Barry said [1]:
> > > > >
> > > > > """
> > > > > mm doesn't support non-blockable __GFP_NOFAIL allocation. Because
> > > > > __GFP_NOFAIL without direct reclamation may just result in a busy
> > > > > loop within non-sleepable contexts.
> > > > > ""“
> > > > >
> > > > > Unfortuantely, we do that under read lock. A possible way to fix that
> > > > > is to move the pages allocation out of the lock into the caller, but
> > > > > having to allocate a huge number of pages and auxiliary page array
> > > > > seems to be problematic as well per Tetsuon [2]:
> > > > >
> > > > > """
> > > > > You should implement proper error handling instead of using
> > > > > __GFP_NOFAIL if count can become large.
> > > > > """
> > > > >
> > > > > So I choose another way, which does not release kernel bounce pages
> > > > > when user tries to register usersapce bounce pages. Then we don't need
> > > > > to do allocation in the path which is not expected to be fail (e.g in
> > > > > the release). We pay this for more memory usage but further
> > > > > optimizations could be done on top.
> > > > >
> > > > > [1] https://lore.kernel.org/all/CACGkMEtcOJAA96SF9B8m-nZ1X04-XZr+nq8ZQ2saLnUdfOGOLg@mail.gmail.com/T/#m3caef86a66ea6318ef94f9976ddb3a0ccfe6fcf8
> > > > > [2] https://lore.kernel.org/all/CACGkMEtcOJAA96SF9B8m-nZ1X04-XZr+nq8ZQ2saLnUdfOGOLg@mail.gmail.com/T/#m7ad10eaba48ade5abf2d572f24e185d9fb146480
> > > > >
> > > > > Fixes: 6c77ed22880d ("vduse: Support using userspace pages as bounce buffer")
> > > > > Signed-off-by: Jason Wang <jasowang@redhat.com>
> > > > > ---
> > > >
> > > > Reviewed-by: Xie Yongji <xieyongji@bytedance.com>
> > > > Tested-by: Xie Yongji <xieyongji@bytedance.com>
> > >
> > > Thanks.
> > >
> > > >
> > > > Have tested it with qemu-storage-daemon [1]:
> > > >
> > > > $ qemu-storage-daemon \
> > > >     --chardev socket,id=charmonitor,path=/tmp/qmp.sock,server=on,wait=off \
> > > >     --monitor chardev=charmonitor \
> > > >     --blockdev driver=host_device,cache.direct=on,aio=native,filename=/dev/nullb0,node-name=disk0
> > > > \
> > > >     --export type=vduse-blk,id=vduse-test,name=vduse-test,node-name=disk0,writable=on
> > > >
> > > > [1] https://github.com/bytedance/qemu/tree/vduse-umem
> > >
> > > Great, would you want to post them to the Qemu?
> > >
> >
> > Looks like qemu-storage-daemon would not benefit from this feature
> > which is designed for some hugepage users such as SPDK/DPDK.
>
> Yes, but maybe for testing purposes like here?
>

OK for me.

Thanks,
Yongji

Re: [PATCH] vduse: avoid using __GFP_NOFAIL

Posted by Barry Song 1 year, 6 months ago

On Mon, Aug 5, 2024 at 8:21 PM Jason Wang <jasowang@redhat.com> wrote:
>
> Barry said [1]:
>
> """
> mm doesn't support non-blockable __GFP_NOFAIL allocation. Because
> __GFP_NOFAIL without direct reclamation may just result in a busy
> loop within non-sleepable contexts.
> ""“

the current code will result in returning a NULL pointer but
not a busy-loop.

static inline struct page *
__alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
                                                struct alloc_context *ac)
{
        ...
        /*
         * Make sure that __GFP_NOFAIL request doesn't leak out and make sure
         * we always retry
         */
        if (gfp_mask & __GFP_NOFAIL) {
                /*
                 * All existing users of the __GFP_NOFAIL are blockable, so warn
                 * of any new users that actually require GFP_NOWAIT
                 */
                if (WARN_ON_ONCE_GFP(!can_direct_reclaim, gfp_mask))
                        goto fail;
                ...
        }
        ...
fail:
        warn_alloc(gfp_mask, ac->nodemask,
                        "page allocation failure: order:%u", order);
got_pg:
        return page;
}


We have two choices to address the issue:
1. busy-loop
2. BUG_ON

the below patch chose 2:
https://lore.kernel.org/linux-mm/20240731000155.109583-5-21cnbao@gmail.com/

>
> Unfortuantely, we do that under read lock. A possible way to fix that
> is to move the pages allocation out of the lock into the caller, but
> having to allocate a huge number of pages and auxiliary page array
> seems to be problematic as well per Tetsuon [2]:
>
> """
> You should implement proper error handling instead of using
> __GFP_NOFAIL if count can become large.
> """
>
> So I choose another way, which does not release kernel bounce pages
> when user tries to register usersapce bounce pages. Then we don't need
> to do allocation in the path which is not expected to be fail (e.g in
> the release). We pay this for more memory usage but further
> optimizations could be done on top.
>
> [1] https://lore.kernel.org/all/CACGkMEtcOJAA96SF9B8m-nZ1X04-XZr+nq8ZQ2saLnUdfOGOLg@mail.gmail.com/T/#m3caef86a66ea6318ef94f9976ddb3a0ccfe6fcf8
> [2] https://lore.kernel.org/all/CACGkMEtcOJAA96SF9B8m-nZ1X04-XZr+nq8ZQ2saLnUdfOGOLg@mail.gmail.com/T/#m7ad10eaba48ade5abf2d572f24e185d9fb146480
>
> Fixes: 6c77ed22880d ("vduse: Support using userspace pages as bounce buffer")
> Signed-off-by: Jason Wang <jasowang@redhat.com>
> ---
>  drivers/vdpa/vdpa_user/iova_domain.c | 18 ++++++++++--------
>  drivers/vdpa/vdpa_user/iova_domain.h |  1 +
>  2 files changed, 11 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/vdpa/vdpa_user/iova_domain.c b/drivers/vdpa/vdpa_user/iova_domain.c
> index 791d38d6284c..933d2f7cd49a 100644
> --- a/drivers/vdpa/vdpa_user/iova_domain.c
> +++ b/drivers/vdpa/vdpa_user/iova_domain.c
> @@ -162,6 +162,7 @@ static void vduse_domain_bounce(struct vduse_iova_domain *domain,
>                                 enum dma_data_direction dir)
>  {
>         struct vduse_bounce_map *map;
> +       struct page *page;
>         unsigned int offset;
>         void *addr;
>         size_t sz;
> @@ -178,7 +179,10 @@ static void vduse_domain_bounce(struct vduse_iova_domain *domain,
>                             map->orig_phys == INVALID_PHYS_ADDR))
>                         return;
>
> -               addr = kmap_local_page(map->bounce_page);
> +               page = domain->user_bounce_pages ?
> +                      map->user_bounce_page : map->bounce_page;
> +
> +               addr = kmap_local_page(page);
>                 do_bounce(map->orig_phys + offset, addr + offset, sz, dir);
>                 kunmap_local(addr);
>                 size -= sz;
> @@ -270,9 +274,8 @@ int vduse_domain_add_user_bounce_pages(struct vduse_iova_domain *domain,
>                                 memcpy_to_page(pages[i], 0,
>                                                page_address(map->bounce_page),
>                                                PAGE_SIZE);
> -                       __free_page(map->bounce_page);
>                 }
> -               map->bounce_page = pages[i];
> +               map->user_bounce_page = pages[i];
>                 get_page(pages[i]);
>         }
>         domain->user_bounce_pages = true;
> @@ -297,17 +300,16 @@ void vduse_domain_remove_user_bounce_pages(struct vduse_iova_domain *domain)
>                 struct page *page = NULL;
>
>                 map = &domain->bounce_maps[i];
> -               if (WARN_ON(!map->bounce_page))
> +               if (WARN_ON(!map->user_bounce_page))
>                         continue;
>
>                 /* Copy user page to kernel page if it's in use */
>                 if (map->orig_phys != INVALID_PHYS_ADDR) {
> -                       page = alloc_page(GFP_ATOMIC | __GFP_NOFAIL);
> +                       page = map->bounce_page;
>                         memcpy_from_page(page_address(page),
> -                                        map->bounce_page, 0, PAGE_SIZE);
> +                                        map->user_bounce_page, 0, PAGE_SIZE);
>                 }
> -               put_page(map->bounce_page);
> -               map->bounce_page = page;
> +               put_page(map->user_bounce_page);
>         }
>         domain->user_bounce_pages = false;
>  out:
> diff --git a/drivers/vdpa/vdpa_user/iova_domain.h b/drivers/vdpa/vdpa_user/iova_domain.h
> index f92f22a7267d..7f3f0928ec78 100644
> --- a/drivers/vdpa/vdpa_user/iova_domain.h
> +++ b/drivers/vdpa/vdpa_user/iova_domain.h
> @@ -21,6 +21,7 @@
>
>  struct vduse_bounce_map {
>         struct page *bounce_page;
> +       struct page *user_bounce_page;
>         u64 orig_phys;
>  };
>
> --
> 2.31.1
>

Re: [PATCH] vduse: avoid using __GFP_NOFAIL

Posted by Jason Wang 1 year, 6 months ago

On Tue, Aug 6, 2024 at 10:30 AM Barry Song <21cnbao@gmail.com> wrote:
>
> On Mon, Aug 5, 2024 at 8:21 PM Jason Wang <jasowang@redhat.com> wrote:
> >
> > Barry said [1]:
> >
> > """
> > mm doesn't support non-blockable __GFP_NOFAIL allocation. Because
> > __GFP_NOFAIL without direct reclamation may just result in a busy
> > loop within non-sleepable contexts.
> > ""“
>
> the current code will result in returning a NULL pointer but
> not a busy-loop.
>
> static inline struct page *
> __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>                                                 struct alloc_context *ac)
> {
>         ...
>         /*
>          * Make sure that __GFP_NOFAIL request doesn't leak out and make sure
>          * we always retry
>          */
>         if (gfp_mask & __GFP_NOFAIL) {
>                 /*
>                  * All existing users of the __GFP_NOFAIL are blockable, so warn
>                  * of any new users that actually require GFP_NOWAIT
>                  */
>                 if (WARN_ON_ONCE_GFP(!can_direct_reclaim, gfp_mask))
>                         goto fail;
>                 ...
>         }
>         ...
> fail:
>         warn_alloc(gfp_mask, ac->nodemask,
>                         "page allocation failure: order:%u", order);
> got_pg:
>         return page;
> }
>
>
> We have two choices to address the issue:
> 1. busy-loop
> 2. BUG_ON
>
> the below patch chose 2:
> https://lore.kernel.org/linux-mm/20240731000155.109583-5-21cnbao@gmail.com/
>

I will add those to V2.

Thanks

Re: [PATCH] vduse: avoid using __GFP_NOFAIL

Posted by Michael S. Tsirkin 1 year, 6 months ago

On Mon, Aug 05, 2024 at 04:21:06PM +0800, Jason Wang wrote:
> Barry said [1]:
> 
> """
> mm doesn't support non-blockable __GFP_NOFAIL allocation. Because
> __GFP_NOFAIL without direct reclamation may just result in a busy
> loop within non-sleepable contexts.
> ""“
> 
> Unfortuantely, we do that under read lock. A possible way to fix that
> is to move the pages allocation out of the lock into the caller, but
> having to allocate a huge number of pages and auxiliary page array
> seems to be problematic as well per Tetsuon [2]:
> 
> """
> You should implement proper error handling instead of using
> __GFP_NOFAIL if count can become large.
> """
> 
> So I choose another way, which does not release kernel bounce pages
> when user tries to register usersapce bounce pages. Then we don't need

userspace

> to do allocation in the path which is not expected to be fail (e.g in
> the release). We pay this for more memory usage but further

what does "we pay this for more memory usage" mean?
Do you mean "we pay for this by using more memory"?
How much more?

> optimizations could be done on top.
> 
> [1] https://lore.kernel.org/all/CACGkMEtcOJAA96SF9B8m-nZ1X04-XZr+nq8ZQ2saLnUdfOGOLg@mail.gmail.com/T/#m3caef86a66ea6318ef94f9976ddb3a0ccfe6fcf8
> [2] https://lore.kernel.org/all/CACGkMEtcOJAA96SF9B8m-nZ1X04-XZr+nq8ZQ2saLnUdfOGOLg@mail.gmail.com/T/#m7ad10eaba48ade5abf2d572f24e185d9fb146480
> 
> Fixes: 6c77ed22880d ("vduse: Support using userspace pages as bounce buffer")
> Signed-off-by: Jason Wang <jasowang@redhat.com>
> ---
>  drivers/vdpa/vdpa_user/iova_domain.c | 18 ++++++++++--------
>  drivers/vdpa/vdpa_user/iova_domain.h |  1 +
>  2 files changed, 11 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/vdpa/vdpa_user/iova_domain.c b/drivers/vdpa/vdpa_user/iova_domain.c
> index 791d38d6284c..933d2f7cd49a 100644
> --- a/drivers/vdpa/vdpa_user/iova_domain.c
> +++ b/drivers/vdpa/vdpa_user/iova_domain.c
> @@ -162,6 +162,7 @@ static void vduse_domain_bounce(struct vduse_iova_domain *domain,
>  				enum dma_data_direction dir)
>  {
>  	struct vduse_bounce_map *map;
> +	struct page *page;
>  	unsigned int offset;
>  	void *addr;
>  	size_t sz;
> @@ -178,7 +179,10 @@ static void vduse_domain_bounce(struct vduse_iova_domain *domain,
>  			    map->orig_phys == INVALID_PHYS_ADDR))
>  			return;
>  
> -		addr = kmap_local_page(map->bounce_page);
> +		page = domain->user_bounce_pages ?
> +		       map->user_bounce_page : map->bounce_page;
> +
> +		addr = kmap_local_page(page);
>  		do_bounce(map->orig_phys + offset, addr + offset, sz, dir);
>  		kunmap_local(addr);
>  		size -= sz;
> @@ -270,9 +274,8 @@ int vduse_domain_add_user_bounce_pages(struct vduse_iova_domain *domain,
>  				memcpy_to_page(pages[i], 0,
>  					       page_address(map->bounce_page),
>  					       PAGE_SIZE);
> -			__free_page(map->bounce_page);
>  		}
> -		map->bounce_page = pages[i];
> +		map->user_bounce_page = pages[i];
>  		get_page(pages[i]);
>  	}
>  	domain->user_bounce_pages = true;
> @@ -297,17 +300,16 @@ void vduse_domain_remove_user_bounce_pages(struct vduse_iova_domain *domain)
>  		struct page *page = NULL;
>  
>  		map = &domain->bounce_maps[i];
> -		if (WARN_ON(!map->bounce_page))
> +		if (WARN_ON(!map->user_bounce_page))
>  			continue;
>  
>  		/* Copy user page to kernel page if it's in use */
>  		if (map->orig_phys != INVALID_PHYS_ADDR) {
> -			page = alloc_page(GFP_ATOMIC | __GFP_NOFAIL);
> +			page = map->bounce_page;
>  			memcpy_from_page(page_address(page),
> -					 map->bounce_page, 0, PAGE_SIZE);
> +					 map->user_bounce_page, 0, PAGE_SIZE);
>  		}
> -		put_page(map->bounce_page);
> -		map->bounce_page = page;
> +		put_page(map->user_bounce_page);
>  	}
>  	domain->user_bounce_pages = false;
>  out:
> diff --git a/drivers/vdpa/vdpa_user/iova_domain.h b/drivers/vdpa/vdpa_user/iova_domain.h
> index f92f22a7267d..7f3f0928ec78 100644
> --- a/drivers/vdpa/vdpa_user/iova_domain.h
> +++ b/drivers/vdpa/vdpa_user/iova_domain.h
> @@ -21,6 +21,7 @@
>  
>  struct vduse_bounce_map {
>  	struct page *bounce_page;
> +	struct page *user_bounce_page;
>  	u64 orig_phys;
>  };
>  
> -- 
> 2.31.1

Re: [PATCH] vduse: avoid using __GFP_NOFAIL

Posted by Jason Wang 1 year, 6 months ago

On Mon, Aug 5, 2024 at 4:26 PM Michael S. Tsirkin <mst@redhat.com> wrote:
>
> On Mon, Aug 05, 2024 at 04:21:06PM +0800, Jason Wang wrote:
> > Barry said [1]:
> >
> > """
> > mm doesn't support non-blockable __GFP_NOFAIL allocation. Because
> > __GFP_NOFAIL without direct reclamation may just result in a busy
> > loop within non-sleepable contexts.
> > ""“
> >
> > Unfortuantely, we do that under read lock. A possible way to fix that
> > is to move the pages allocation out of the lock into the caller, but
> > having to allocate a huge number of pages and auxiliary page array
> > seems to be problematic as well per Tetsuon [2]:
> >
> > """
> > You should implement proper error handling instead of using
> > __GFP_NOFAIL if count can become large.
> > """
> >
> > So I choose another way, which does not release kernel bounce pages
> > when user tries to register usersapce bounce pages. Then we don't need
>
> userspace
>
> > to do allocation in the path which is not expected to be fail (e.g in
> > the release). We pay this for more memory usage but further
>
> what does "we pay this for more memory usage" mean?
> Do you mean "we pay for this by using more memory"?

Yes.

> How much more?

Depends on the workload, but basically, it's just the maximum size of
bounce buffer:

Default size 64M

#define VDUSE_BOUNCE_SIZE (64 * 1024 * 1024)

Maximum 1G:

#define VDUSE_MAX_BOUNCE_SIZE (1024 * 1024 * 1024)

Thanks

Re: [PATCH] vduse: avoid using __GFP_NOFAIL

Posted by Jason Wang 1 year, 6 months ago

On Mon, Aug 5, 2024 at 4:21 PM Jason Wang <jasowang@redhat.com> wrote:
>
> Barry said [1]:
>
> """
> mm doesn't support non-blockable __GFP_NOFAIL allocation. Because
> __GFP_NOFAIL without direct reclamation may just result in a busy
> loop within non-sleepable contexts.
> ""“
>
> Unfortuantely, we do that under read lock. A possible way to fix that
> is to move the pages allocation out of the lock into the caller, but
> having to allocate a huge number of pages and auxiliary page array
> seems to be problematic as well per Tetsuon [2]:
>
> """
> You should implement proper error handling instead of using
> __GFP_NOFAIL if count can become large.
> """
>
> So I choose another way, which does not release kernel bounce pages
> when user tries to register usersapce bounce pages. Then we don't need
> to do allocation in the path which is not expected to be fail (e.g in
> the release). We pay this for more memory usage but further
> optimizations could be done on top.
>
> [1] https://lore.kernel.org/all/CACGkMEtcOJAA96SF9B8m-nZ1X04-XZr+nq8ZQ2saLnUdfOGOLg@mail.gmail.com/T/#m3caef86a66ea6318ef94f9976ddb3a0ccfe6fcf8
> [2] https://lore.kernel.org/all/CACGkMEtcOJAA96SF9B8m-nZ1X04-XZr+nq8ZQ2saLnUdfOGOLg@mail.gmail.com/T/#m7ad10eaba48ade5abf2d572f24e185d9fb146480
>
> Fixes: 6c77ed22880d ("vduse: Support using userspace pages as bounce buffer")
> Signed-off-by: Jason Wang <jasowang@redhat.com>
> ---

Note for YongJi:

I can only test it without usersapce bounce pages as neither DPDK nor
libvduse users use that. Would you want to review and have a test for
this?

Thanks

Re: [PATCH] vduse: avoid using __GFP_NOFAIL

Posted by Yongji Xie 1 year, 6 months ago

On Mon, Aug 5, 2024 at 4:24 PM Jason Wang <jasowang@redhat.com> wrote:
>
> On Mon, Aug 5, 2024 at 4:21 PM Jason Wang <jasowang@redhat.com> wrote:
> >
> > Barry said [1]:
> >
> > """
> > mm doesn't support non-blockable __GFP_NOFAIL allocation. Because
> > __GFP_NOFAIL without direct reclamation may just result in a busy
> > loop within non-sleepable contexts.
> > ""“
> >
> > Unfortuantely, we do that under read lock. A possible way to fix that
> > is to move the pages allocation out of the lock into the caller, but
> > having to allocate a huge number of pages and auxiliary page array
> > seems to be problematic as well per Tetsuon [2]:
> >
> > """
> > You should implement proper error handling instead of using
> > __GFP_NOFAIL if count can become large.
> > """
> >

I think the problem is it's hard to do the error handling in
fops->release() currently.

So can we temporarily hold the user page refcount, and release it when
vduse_dev_open()/vduse_domain_release() is executed. The kernel page
allocation and memcpy can be done in vduse_dev_open() which allows
some error handling.

> > So I choose another way, which does not release kernel bounce pages
> > when user tries to register usersapce bounce pages. Then we don't need
> > to do allocation in the path which is not expected to be fail (e.g in
> > the release). We pay this for more memory usage but further
> > optimizations could be done on top.
> >
> > [1] https://lore.kernel.org/all/CACGkMEtcOJAA96SF9B8m-nZ1X04-XZr+nq8ZQ2saLnUdfOGOLg@mail.gmail.com/T/#m3caef86a66ea6318ef94f9976ddb3a0ccfe6fcf8
> > [2] https://lore.kernel.org/all/CACGkMEtcOJAA96SF9B8m-nZ1X04-XZr+nq8ZQ2saLnUdfOGOLg@mail.gmail.com/T/#m7ad10eaba48ade5abf2d572f24e185d9fb146480
> >
> > Fixes: 6c77ed22880d ("vduse: Support using userspace pages as bounce buffer")
> > Signed-off-by: Jason Wang <jasowang@redhat.com>
> > ---
>
> Note for YongJi:
>
> I can only test it without usersapce bounce pages as neither DPDK nor
> libvduse users use that. Would you want to review and have a test for
> this?
>

I can do some tests for it.

Thanks,
Yongji

Re: [PATCH] vduse: avoid using __GFP_NOFAIL

Posted by Jason Wang 1 year, 6 months ago

On Mon, Aug 5, 2024 at 6:42 PM Yongji Xie <xieyongji@bytedance.com> wrote:
>
> On Mon, Aug 5, 2024 at 4:24 PM Jason Wang <jasowang@redhat.com> wrote:
> >
> > On Mon, Aug 5, 2024 at 4:21 PM Jason Wang <jasowang@redhat.com> wrote:
> > >
> > > Barry said [1]:
> > >
> > > """
> > > mm doesn't support non-blockable __GFP_NOFAIL allocation. Because
> > > __GFP_NOFAIL without direct reclamation may just result in a busy
> > > loop within non-sleepable contexts.
> > > ""“
> > >
> > > Unfortuantely, we do that under read lock. A possible way to fix that
> > > is to move the pages allocation out of the lock into the caller, but
> > > having to allocate a huge number of pages and auxiliary page array
> > > seems to be problematic as well per Tetsuon [2]:
> > >
> > > """
> > > You should implement proper error handling instead of using
> > > __GFP_NOFAIL if count can become large.
> > > """
> > >
>
> I think the problem is it's hard to do the error handling in
> fops->release() currently.

vduse_dev_dereg_umem() should be the same, it's very hard to allow it to fail.

>
> So can we temporarily hold the user page refcount, and release it when
> vduse_dev_open()/vduse_domain_release() is executed. The kernel page
> allocation and memcpy can be done in vduse_dev_open() which allows
> some error handling.

Just to make sure I understand this, the free is probably not the big
issue but the allocation itself.

And if we do the memcpy() in open(), it seems to be a subtle userspace
noticeable change? (Or I don't get how copying in vduse_dev_open() can
help here).

>
> > > So I choose another way, which does not release kernel bounce pages
> > > when user tries to register usersapce bounce pages. Then we don't need
> > > to do allocation in the path which is not expected to be fail (e.g in
> > > the release). We pay this for more memory usage but further
> > > optimizations could be done on top.
> > >
> > > [1] https://lore.kernel.org/all/CACGkMEtcOJAA96SF9B8m-nZ1X04-XZr+nq8ZQ2saLnUdfOGOLg@mail.gmail.com/T/#m3caef86a66ea6318ef94f9976ddb3a0ccfe6fcf8
> > > [2] https://lore.kernel.org/all/CACGkMEtcOJAA96SF9B8m-nZ1X04-XZr+nq8ZQ2saLnUdfOGOLg@mail.gmail.com/T/#m7ad10eaba48ade5abf2d572f24e185d9fb146480
> > >
> > > Fixes: 6c77ed22880d ("vduse: Support using userspace pages as bounce buffer")
> > > Signed-off-by: Jason Wang <jasowang@redhat.com>
> > > ---
> >
> > Note for YongJi:
> >
> > I can only test it without usersapce bounce pages as neither DPDK nor
> > libvduse users use that. Would you want to review and have a test for
> > this?
> >
>
> I can do some tests for it.

Great.

>
> Thanks,
> Yongji
>

Thanks

Re: [PATCH] vduse: avoid using __GFP_NOFAIL

Posted by Yongji Xie 1 year, 6 months ago

On Tue, Aug 6, 2024 at 10:28 AM Jason Wang <jasowang@redhat.com> wrote:
>
> On Mon, Aug 5, 2024 at 6:42 PM Yongji Xie <xieyongji@bytedance.com> wrote:
> >
> > On Mon, Aug 5, 2024 at 4:24 PM Jason Wang <jasowang@redhat.com> wrote:
> > >
> > > On Mon, Aug 5, 2024 at 4:21 PM Jason Wang <jasowang@redhat.com> wrote:
> > > >
> > > > Barry said [1]:
> > > >
> > > > """
> > > > mm doesn't support non-blockable __GFP_NOFAIL allocation. Because
> > > > __GFP_NOFAIL without direct reclamation may just result in a busy
> > > > loop within non-sleepable contexts.
> > > > ""“
> > > >
> > > > Unfortuantely, we do that under read lock. A possible way to fix that
> > > > is to move the pages allocation out of the lock into the caller, but
> > > > having to allocate a huge number of pages and auxiliary page array
> > > > seems to be problematic as well per Tetsuon [2]:
> > > >
> > > > """
> > > > You should implement proper error handling instead of using
> > > > __GFP_NOFAIL if count can become large.
> > > > """
> > > >
> >
> > I think the problem is it's hard to do the error handling in
> > fops->release() currently.
>
> vduse_dev_dereg_umem() should be the same, it's very hard to allow it to fail.
>
> >
> > So can we temporarily hold the user page refcount, and release it when
> > vduse_dev_open()/vduse_domain_release() is executed. The kernel page
> > allocation and memcpy can be done in vduse_dev_open() which allows
> > some error handling.
>
> Just to make sure I understand this, the free is probably not the big
> issue but the allocation itself.
>

Yes, so defer the allocation might be a solution.

> And if we do the memcpy() in open(), it seems to be a subtle userspace
> noticeable change? (Or I don't get how copying in vduse_dev_open() can
> help here).
>

Maybe we don't need to do the copy in open(). We can hold the user
page refcount until the inflight I/O is completed. That means the
allocation of new kernel pages can be done in
vduse_domain_map_bounce_page() and the release of old user pages can
be done in vduse_domain_unmap_bounce_page(). Of course, we still have
a copy (old user page -> new user spage) if the daemon calls
vduse_dev_reg_umem() again.

Thanks,
Yongji

Re: [PATCH] vduse: avoid using __GFP_NOFAIL

Posted by Jason Wang 1 year, 6 months ago

On Tue, Aug 6, 2024 at 11:10 AM Yongji Xie <xieyongji@bytedance.com> wrote:
>
> On Tue, Aug 6, 2024 at 10:28 AM Jason Wang <jasowang@redhat.com> wrote:
> >
> > On Mon, Aug 5, 2024 at 6:42 PM Yongji Xie <xieyongji@bytedance.com> wrote:
> > >
> > > On Mon, Aug 5, 2024 at 4:24 PM Jason Wang <jasowang@redhat.com> wrote:
> > > >
> > > > On Mon, Aug 5, 2024 at 4:21 PM Jason Wang <jasowang@redhat.com> wrote:
> > > > >
> > > > > Barry said [1]:
> > > > >
> > > > > """
> > > > > mm doesn't support non-blockable __GFP_NOFAIL allocation. Because
> > > > > __GFP_NOFAIL without direct reclamation may just result in a busy
> > > > > loop within non-sleepable contexts.
> > > > > ""“
> > > > >
> > > > > Unfortuantely, we do that under read lock. A possible way to fix that
> > > > > is to move the pages allocation out of the lock into the caller, but
> > > > > having to allocate a huge number of pages and auxiliary page array
> > > > > seems to be problematic as well per Tetsuon [2]:
> > > > >
> > > > > """
> > > > > You should implement proper error handling instead of using
> > > > > __GFP_NOFAIL if count can become large.
> > > > > """
> > > > >
> > >
> > > I think the problem is it's hard to do the error handling in
> > > fops->release() currently.
> >
> > vduse_dev_dereg_umem() should be the same, it's very hard to allow it to fail.
> >
> > >
> > > So can we temporarily hold the user page refcount, and release it when
> > > vduse_dev_open()/vduse_domain_release() is executed. The kernel page
> > > allocation and memcpy can be done in vduse_dev_open() which allows
> > > some error handling.
> >
> > Just to make sure I understand this, the free is probably not the big
> > issue but the allocation itself.
> >
>
> Yes, so defer the allocation might be a solution.

Would you mind posting a patch for this?

>
> > And if we do the memcpy() in open(), it seems to be a subtle userspace
> > noticeable change? (Or I don't get how copying in vduse_dev_open() can
> > help here).
> >
>
> Maybe we don't need to do the copy in open(). We can hold the user
> page refcount until the inflight I/O is completed. That means the
> allocation of new kernel pages can be done in
> vduse_domain_map_bounce_page() and the release of old user pages can
> be done in vduse_domain_unmap_bounce_page().

This seems to be a subtle userspace noticeable behaviour?

> Of course, we still have
> a copy (old user page -> new user spage) if the daemon calls
> vduse_dev_reg_umem() again.
>
> Thanks,
> Yongji
>

Thanks

Re: [PATCH] vduse: avoid using __GFP_NOFAIL

Posted by Yongji Xie 1 year, 6 months ago

On Wed, Aug 7, 2024 at 10:39 AM Jason Wang <jasowang@redhat.com> wrote:
>
> On Tue, Aug 6, 2024 at 11:10 AM Yongji Xie <xieyongji@bytedance.com> wrote:
> >
> > On Tue, Aug 6, 2024 at 10:28 AM Jason Wang <jasowang@redhat.com> wrote:
> > >
> > > On Mon, Aug 5, 2024 at 6:42 PM Yongji Xie <xieyongji@bytedance.com> wrote:
> > > >
> > > > On Mon, Aug 5, 2024 at 4:24 PM Jason Wang <jasowang@redhat.com> wrote:
> > > > >
> > > > > On Mon, Aug 5, 2024 at 4:21 PM Jason Wang <jasowang@redhat.com> wrote:
> > > > > >
> > > > > > Barry said [1]:
> > > > > >
> > > > > > """
> > > > > > mm doesn't support non-blockable __GFP_NOFAIL allocation. Because
> > > > > > __GFP_NOFAIL without direct reclamation may just result in a busy
> > > > > > loop within non-sleepable contexts.
> > > > > > ""“
> > > > > >
> > > > > > Unfortuantely, we do that under read lock. A possible way to fix that
> > > > > > is to move the pages allocation out of the lock into the caller, but
> > > > > > having to allocate a huge number of pages and auxiliary page array
> > > > > > seems to be problematic as well per Tetsuon [2]:
> > > > > >
> > > > > > """
> > > > > > You should implement proper error handling instead of using
> > > > > > __GFP_NOFAIL if count can become large.
> > > > > > """
> > > > > >
> > > >
> > > > I think the problem is it's hard to do the error handling in
> > > > fops->release() currently.
> > >
> > > vduse_dev_dereg_umem() should be the same, it's very hard to allow it to fail.
> > >
> > > >
> > > > So can we temporarily hold the user page refcount, and release it when
> > > > vduse_dev_open()/vduse_domain_release() is executed. The kernel page
> > > > allocation and memcpy can be done in vduse_dev_open() which allows
> > > > some error handling.
> > >
> > > Just to make sure I understand this, the free is probably not the big
> > > issue but the allocation itself.
> > >
> >
> > Yes, so defer the allocation might be a solution.
>
> Would you mind posting a patch for this?
>
> >
> > > And if we do the memcpy() in open(), it seems to be a subtle userspace
> > > noticeable change? (Or I don't get how copying in vduse_dev_open() can
> > > help here).
> > >
> >
> > Maybe we don't need to do the copy in open(). We can hold the user
> > page refcount until the inflight I/O is completed. That means the
> > allocation of new kernel pages can be done in
> > vduse_domain_map_bounce_page() and the release of old user pages can
> > be done in vduse_domain_unmap_bounce_page().
>
> This seems to be a subtle userspace noticeable behaviour?
>

Yes, userspace needs to ensure that it does not reuse the old user
pages for other purposes before vduse_dev_dereg_umem() returns
successfully. The vduse_dev_dereg_umem() will only return successfully
when there is no inflight I/O which means we don't need to allocate
extra kernel pages to store data. If we can't accept this, then your
current patch might be the most suitable. I will test this patch
first.

Thanks,
Yongji

Re: [PATCH] vduse: avoid using __GFP_NOFAIL

Posted by Jason Wang 1 year, 6 months ago

On Wed, Aug 7, 2024 at 11:13 AM Yongji Xie <xieyongji@bytedance.com> wrote:
>
> On Wed, Aug 7, 2024 at 10:39 AM Jason Wang <jasowang@redhat.com> wrote:
> >
> > On Tue, Aug 6, 2024 at 11:10 AM Yongji Xie <xieyongji@bytedance.com> wrote:
> > >
> > > On Tue, Aug 6, 2024 at 10:28 AM Jason Wang <jasowang@redhat.com> wrote:
> > > >
> > > > On Mon, Aug 5, 2024 at 6:42 PM Yongji Xie <xieyongji@bytedance.com> wrote:
> > > > >
> > > > > On Mon, Aug 5, 2024 at 4:24 PM Jason Wang <jasowang@redhat.com> wrote:
> > > > > >
> > > > > > On Mon, Aug 5, 2024 at 4:21 PM Jason Wang <jasowang@redhat.com> wrote:
> > > > > > >
> > > > > > > Barry said [1]:
> > > > > > >
> > > > > > > """
> > > > > > > mm doesn't support non-blockable __GFP_NOFAIL allocation. Because
> > > > > > > __GFP_NOFAIL without direct reclamation may just result in a busy
> > > > > > > loop within non-sleepable contexts.
> > > > > > > ""“
> > > > > > >
> > > > > > > Unfortuantely, we do that under read lock. A possible way to fix that
> > > > > > > is to move the pages allocation out of the lock into the caller, but
> > > > > > > having to allocate a huge number of pages and auxiliary page array
> > > > > > > seems to be problematic as well per Tetsuon [2]:
> > > > > > >
> > > > > > > """
> > > > > > > You should implement proper error handling instead of using
> > > > > > > __GFP_NOFAIL if count can become large.
> > > > > > > """
> > > > > > >
> > > > >
> > > > > I think the problem is it's hard to do the error handling in
> > > > > fops->release() currently.
> > > >
> > > > vduse_dev_dereg_umem() should be the same, it's very hard to allow it to fail.
> > > >
> > > > >
> > > > > So can we temporarily hold the user page refcount, and release it when
> > > > > vduse_dev_open()/vduse_domain_release() is executed. The kernel page
> > > > > allocation and memcpy can be done in vduse_dev_open() which allows
> > > > > some error handling.
> > > >
> > > > Just to make sure I understand this, the free is probably not the big
> > > > issue but the allocation itself.
> > > >
> > >
> > > Yes, so defer the allocation might be a solution.
> >
> > Would you mind posting a patch for this?
> >
> > >
> > > > And if we do the memcpy() in open(), it seems to be a subtle userspace
> > > > noticeable change? (Or I don't get how copying in vduse_dev_open() can
> > > > help here).
> > > >
> > >
> > > Maybe we don't need to do the copy in open(). We can hold the user
> > > page refcount until the inflight I/O is completed. That means the
> > > allocation of new kernel pages can be done in
> > > vduse_domain_map_bounce_page() and the release of old user pages can
> > > be done in vduse_domain_unmap_bounce_page().
> >
> > This seems to be a subtle userspace noticeable behaviour?
> >
>
> Yes, userspace needs to ensure that it does not reuse the old user
> pages for other purposes before vduse_dev_dereg_umem() returns
> successfully. The vduse_dev_dereg_umem() will only return successfully
> when there is no inflight I/O which means we don't need to allocate
> extra kernel pages to store data. If we can't accept this, then your
> current patch might be the most suitable.

It might be better to not break.

Actually during my testing, the read_lock in the do_bounce path slows
down the performance. Remove read_lock or use rcu_read_lock() to give
20% improvement of PPS.

I do want to get rid of it (e.g moving the copying some other where)
seems it meets the exact "issue" here which introduces some behaviour
change...

> I will test this patch
> first.
>

Great.

Thanks

> Thanks,
> Yongji
>

Re: [PATCH] vduse: avoid using __GFP_NOFAIL

Posted by Yongji Xie 1 year, 6 months ago

On Wed, Aug 7, 2024 at 12:38 PM Jason Wang <jasowang@redhat.com> wrote:
>
> On Wed, Aug 7, 2024 at 11:13 AM Yongji Xie <xieyongji@bytedance.com> wrote:
> >
> > On Wed, Aug 7, 2024 at 10:39 AM Jason Wang <jasowang@redhat.com> wrote:
> > >
> > > On Tue, Aug 6, 2024 at 11:10 AM Yongji Xie <xieyongji@bytedance.com> wrote:
> > > >
> > > > On Tue, Aug 6, 2024 at 10:28 AM Jason Wang <jasowang@redhat.com> wrote:
> > > > >
> > > > > On Mon, Aug 5, 2024 at 6:42 PM Yongji Xie <xieyongji@bytedance.com> wrote:
> > > > > >
> > > > > > On Mon, Aug 5, 2024 at 4:24 PM Jason Wang <jasowang@redhat.com> wrote:
> > > > > > >
> > > > > > > On Mon, Aug 5, 2024 at 4:21 PM Jason Wang <jasowang@redhat.com> wrote:
> > > > > > > >
> > > > > > > > Barry said [1]:
> > > > > > > >
> > > > > > > > """
> > > > > > > > mm doesn't support non-blockable __GFP_NOFAIL allocation. Because
> > > > > > > > __GFP_NOFAIL without direct reclamation may just result in a busy
> > > > > > > > loop within non-sleepable contexts.
> > > > > > > > ""“
> > > > > > > >
> > > > > > > > Unfortuantely, we do that under read lock. A possible way to fix that
> > > > > > > > is to move the pages allocation out of the lock into the caller, but
> > > > > > > > having to allocate a huge number of pages and auxiliary page array
> > > > > > > > seems to be problematic as well per Tetsuon [2]:
> > > > > > > >
> > > > > > > > """
> > > > > > > > You should implement proper error handling instead of using
> > > > > > > > __GFP_NOFAIL if count can become large.
> > > > > > > > """
> > > > > > > >
> > > > > >
> > > > > > I think the problem is it's hard to do the error handling in
> > > > > > fops->release() currently.
> > > > >
> > > > > vduse_dev_dereg_umem() should be the same, it's very hard to allow it to fail.
> > > > >
> > > > > >
> > > > > > So can we temporarily hold the user page refcount, and release it when
> > > > > > vduse_dev_open()/vduse_domain_release() is executed. The kernel page
> > > > > > allocation and memcpy can be done in vduse_dev_open() which allows
> > > > > > some error handling.
> > > > >
> > > > > Just to make sure I understand this, the free is probably not the big
> > > > > issue but the allocation itself.
> > > > >
> > > >
> > > > Yes, so defer the allocation might be a solution.
> > >
> > > Would you mind posting a patch for this?
> > >
> > > >
> > > > > And if we do the memcpy() in open(), it seems to be a subtle userspace
> > > > > noticeable change? (Or I don't get how copying in vduse_dev_open() can
> > > > > help here).
> > > > >
> > > >
> > > > Maybe we don't need to do the copy in open(). We can hold the user
> > > > page refcount until the inflight I/O is completed. That means the
> > > > allocation of new kernel pages can be done in
> > > > vduse_domain_map_bounce_page() and the release of old user pages can
> > > > be done in vduse_domain_unmap_bounce_page().
> > >
> > > This seems to be a subtle userspace noticeable behaviour?
> > >
> >
> > Yes, userspace needs to ensure that it does not reuse the old user
> > pages for other purposes before vduse_dev_dereg_umem() returns
> > successfully. The vduse_dev_dereg_umem() will only return successfully
> > when there is no inflight I/O which means we don't need to allocate
> > extra kernel pages to store data. If we can't accept this, then your
> > current patch might be the most suitable.
>
> It might be better to not break.
>
> Actually during my testing, the read_lock in the do_bounce path slows
> down the performance. Remove read_lock or use rcu_read_lock() to give
> 20% improvement of PPS.
>

Looks like rcu_read_lock() should be OK here.

Thanks,
Yongji

Re: [PATCH] vduse: avoid using __GFP_NOFAIL

Posted by Jason Wang 1 year, 6 months ago

On Wed, Aug 7, 2024 at 2:54 PM Yongji Xie <xieyongji@bytedance.com> wrote:
>
> On Wed, Aug 7, 2024 at 12:38 PM Jason Wang <jasowang@redhat.com> wrote:
> >
> > On Wed, Aug 7, 2024 at 11:13 AM Yongji Xie <xieyongji@bytedance.com> wrote:
> > >
> > > On Wed, Aug 7, 2024 at 10:39 AM Jason Wang <jasowang@redhat.com> wrote:
> > > >
> > > > On Tue, Aug 6, 2024 at 11:10 AM Yongji Xie <xieyongji@bytedance.com> wrote:
> > > > >
> > > > > On Tue, Aug 6, 2024 at 10:28 AM Jason Wang <jasowang@redhat.com> wrote:
> > > > > >
> > > > > > On Mon, Aug 5, 2024 at 6:42 PM Yongji Xie <xieyongji@bytedance.com> wrote:
> > > > > > >
> > > > > > > On Mon, Aug 5, 2024 at 4:24 PM Jason Wang <jasowang@redhat.com> wrote:
> > > > > > > >
> > > > > > > > On Mon, Aug 5, 2024 at 4:21 PM Jason Wang <jasowang@redhat.com> wrote:
> > > > > > > > >
> > > > > > > > > Barry said [1]:
> > > > > > > > >
> > > > > > > > > """
> > > > > > > > > mm doesn't support non-blockable __GFP_NOFAIL allocation. Because
> > > > > > > > > __GFP_NOFAIL without direct reclamation may just result in a busy
> > > > > > > > > loop within non-sleepable contexts.
> > > > > > > > > ""“
> > > > > > > > >
> > > > > > > > > Unfortuantely, we do that under read lock. A possible way to fix that
> > > > > > > > > is to move the pages allocation out of the lock into the caller, but
> > > > > > > > > having to allocate a huge number of pages and auxiliary page array
> > > > > > > > > seems to be problematic as well per Tetsuon [2]:
> > > > > > > > >
> > > > > > > > > """
> > > > > > > > > You should implement proper error handling instead of using
> > > > > > > > > __GFP_NOFAIL if count can become large.
> > > > > > > > > """
> > > > > > > > >
> > > > > > >
> > > > > > > I think the problem is it's hard to do the error handling in
> > > > > > > fops->release() currently.
> > > > > >
> > > > > > vduse_dev_dereg_umem() should be the same, it's very hard to allow it to fail.
> > > > > >
> > > > > > >
> > > > > > > So can we temporarily hold the user page refcount, and release it when
> > > > > > > vduse_dev_open()/vduse_domain_release() is executed. The kernel page
> > > > > > > allocation and memcpy can be done in vduse_dev_open() which allows
> > > > > > > some error handling.
> > > > > >
> > > > > > Just to make sure I understand this, the free is probably not the big
> > > > > > issue but the allocation itself.
> > > > > >
> > > > >
> > > > > Yes, so defer the allocation might be a solution.
> > > >
> > > > Would you mind posting a patch for this?
> > > >
> > > > >
> > > > > > And if we do the memcpy() in open(), it seems to be a subtle userspace
> > > > > > noticeable change? (Or I don't get how copying in vduse_dev_open() can
> > > > > > help here).
> > > > > >
> > > > >
> > > > > Maybe we don't need to do the copy in open(). We can hold the user
> > > > > page refcount until the inflight I/O is completed. That means the
> > > > > allocation of new kernel pages can be done in
> > > > > vduse_domain_map_bounce_page() and the release of old user pages can
> > > > > be done in vduse_domain_unmap_bounce_page().
> > > >
> > > > This seems to be a subtle userspace noticeable behaviour?
> > > >
> > >
> > > Yes, userspace needs to ensure that it does not reuse the old user
> > > pages for other purposes before vduse_dev_dereg_umem() returns
> > > successfully. The vduse_dev_dereg_umem() will only return successfully
> > > when there is no inflight I/O which means we don't need to allocate
> > > extra kernel pages to store data. If we can't accept this, then your
> > > current patch might be the most suitable.
> >
> > It might be better to not break.
> >
> > Actually during my testing, the read_lock in the do_bounce path slows
> > down the performance. Remove read_lock or use rcu_read_lock() to give
> > 20% improvement of PPS.
> >
>
> Looks like rcu_read_lock() should be OK here.

The tricky part is that we may still end up behaviour changes (or lose
some of the synchronization between kernel and bounce pages):

RCU allows the read to be executed in parallel with the writer. So
bouncing could be done in parallel with
vduse_domain_add_user_bounce_pages(), there would be a race in two
memcpy.

Thanks

>
> Thanks,
> Yongji
>

Re: Re: [PATCH] vduse: avoid using __GFP_NOFAIL

Posted by Yongji Xie 1 year, 6 months ago

On Thu, Aug 8, 2024 at 1:50 PM Jason Wang <jasowang@redhat.com> wrote:
>
> On Wed, Aug 7, 2024 at 2:54 PM Yongji Xie <xieyongji@bytedance.com> wrote:
> >
> > On Wed, Aug 7, 2024 at 12:38 PM Jason Wang <jasowang@redhat.com> wrote:
> > >
> > > On Wed, Aug 7, 2024 at 11:13 AM Yongji Xie <xieyongji@bytedance.com> wrote:
> > > >
> > > > On Wed, Aug 7, 2024 at 10:39 AM Jason Wang <jasowang@redhat.com> wrote:
> > > > >
> > > > > On Tue, Aug 6, 2024 at 11:10 AM Yongji Xie <xieyongji@bytedance.com> wrote:
> > > > > >
> > > > > > On Tue, Aug 6, 2024 at 10:28 AM Jason Wang <jasowang@redhat.com> wrote:
> > > > > > >
> > > > > > > On Mon, Aug 5, 2024 at 6:42 PM Yongji Xie <xieyongji@bytedance.com> wrote:
> > > > > > > >
> > > > > > > > On Mon, Aug 5, 2024 at 4:24 PM Jason Wang <jasowang@redhat.com> wrote:
> > > > > > > > >
> > > > > > > > > On Mon, Aug 5, 2024 at 4:21 PM Jason Wang <jasowang@redhat.com> wrote:
> > > > > > > > > >
> > > > > > > > > > Barry said [1]:
> > > > > > > > > >
> > > > > > > > > > """
> > > > > > > > > > mm doesn't support non-blockable __GFP_NOFAIL allocation. Because
> > > > > > > > > > __GFP_NOFAIL without direct reclamation may just result in a busy
> > > > > > > > > > loop within non-sleepable contexts.
> > > > > > > > > > ""“
> > > > > > > > > >
> > > > > > > > > > Unfortuantely, we do that under read lock. A possible way to fix that
> > > > > > > > > > is to move the pages allocation out of the lock into the caller, but
> > > > > > > > > > having to allocate a huge number of pages and auxiliary page array
> > > > > > > > > > seems to be problematic as well per Tetsuon [2]:
> > > > > > > > > >
> > > > > > > > > > """
> > > > > > > > > > You should implement proper error handling instead of using
> > > > > > > > > > __GFP_NOFAIL if count can become large.
> > > > > > > > > > """
> > > > > > > > > >
> > > > > > > >
> > > > > > > > I think the problem is it's hard to do the error handling in
> > > > > > > > fops->release() currently.
> > > > > > >
> > > > > > > vduse_dev_dereg_umem() should be the same, it's very hard to allow it to fail.
> > > > > > >
> > > > > > > >
> > > > > > > > So can we temporarily hold the user page refcount, and release it when
> > > > > > > > vduse_dev_open()/vduse_domain_release() is executed. The kernel page
> > > > > > > > allocation and memcpy can be done in vduse_dev_open() which allows
> > > > > > > > some error handling.
> > > > > > >
> > > > > > > Just to make sure I understand this, the free is probably not the big
> > > > > > > issue but the allocation itself.
> > > > > > >
> > > > > >
> > > > > > Yes, so defer the allocation might be a solution.
> > > > >
> > > > > Would you mind posting a patch for this?
> > > > >
> > > > > >
> > > > > > > And if we do the memcpy() in open(), it seems to be a subtle userspace
> > > > > > > noticeable change? (Or I don't get how copying in vduse_dev_open() can
> > > > > > > help here).
> > > > > > >
> > > > > >
> > > > > > Maybe we don't need to do the copy in open(). We can hold the user
> > > > > > page refcount until the inflight I/O is completed. That means the
> > > > > > allocation of new kernel pages can be done in
> > > > > > vduse_domain_map_bounce_page() and the release of old user pages can
> > > > > > be done in vduse_domain_unmap_bounce_page().
> > > > >
> > > > > This seems to be a subtle userspace noticeable behaviour?
> > > > >
> > > >
> > > > Yes, userspace needs to ensure that it does not reuse the old user
> > > > pages for other purposes before vduse_dev_dereg_umem() returns
> > > > successfully. The vduse_dev_dereg_umem() will only return successfully
> > > > when there is no inflight I/O which means we don't need to allocate
> > > > extra kernel pages to store data. If we can't accept this, then your
> > > > current patch might be the most suitable.
> > >
> > > It might be better to not break.
> > >
> > > Actually during my testing, the read_lock in the do_bounce path slows
> > > down the performance. Remove read_lock or use rcu_read_lock() to give
> > > 20% improvement of PPS.
> > >
> >
> > Looks like rcu_read_lock() should be OK here.
>
> The tricky part is that we may still end up behaviour changes (or lose
> some of the synchronization between kernel and bounce pages):
>
> RCU allows the read to be executed in parallel with the writer. So
> bouncing could be done in parallel with
> vduse_domain_add_user_bounce_pages(), there would be a race in two
> memcpy.
>

Hmm...this is a problem. We may still need some userspace noticeable
behaviour, e.g. only allowing reg_umem/dereg_umem when the device is
not started.

Thanks,
Yongji

Re: Re: [PATCH] vduse: avoid using __GFP_NOFAIL

Posted by Jason Wang 1 year, 5 months ago

On Thu, Aug 8, 2024 at 7:09 PM Yongji Xie <xieyongji@bytedance.com> wrote:
>
> On Thu, Aug 8, 2024 at 1:50 PM Jason Wang <jasowang@redhat.com> wrote:
> >
> > On Wed, Aug 7, 2024 at 2:54 PM Yongji Xie <xieyongji@bytedance.com> wrote:
> > >
> > > On Wed, Aug 7, 2024 at 12:38 PM Jason Wang <jasowang@redhat.com> wrote:
> > > >
> > > > On Wed, Aug 7, 2024 at 11:13 AM Yongji Xie <xieyongji@bytedance.com> wrote:
> > > > >
> > > > > On Wed, Aug 7, 2024 at 10:39 AM Jason Wang <jasowang@redhat.com> wrote:
> > > > > >
> > > > > > On Tue, Aug 6, 2024 at 11:10 AM Yongji Xie <xieyongji@bytedance.com> wrote:
> > > > > > >
> > > > > > > On Tue, Aug 6, 2024 at 10:28 AM Jason Wang <jasowang@redhat.com> wrote:
> > > > > > > >
> > > > > > > > On Mon, Aug 5, 2024 at 6:42 PM Yongji Xie <xieyongji@bytedance.com> wrote:
> > > > > > > > >
> > > > > > > > > On Mon, Aug 5, 2024 at 4:24 PM Jason Wang <jasowang@redhat.com> wrote:
> > > > > > > > > >
> > > > > > > > > > On Mon, Aug 5, 2024 at 4:21 PM Jason Wang <jasowang@redhat.com> wrote:
> > > > > > > > > > >
> > > > > > > > > > > Barry said [1]:
> > > > > > > > > > >
> > > > > > > > > > > """
> > > > > > > > > > > mm doesn't support non-blockable __GFP_NOFAIL allocation. Because
> > > > > > > > > > > __GFP_NOFAIL without direct reclamation may just result in a busy
> > > > > > > > > > > loop within non-sleepable contexts.
> > > > > > > > > > > ""“
> > > > > > > > > > >
> > > > > > > > > > > Unfortuantely, we do that under read lock. A possible way to fix that
> > > > > > > > > > > is to move the pages allocation out of the lock into the caller, but
> > > > > > > > > > > having to allocate a huge number of pages and auxiliary page array
> > > > > > > > > > > seems to be problematic as well per Tetsuon [2]:
> > > > > > > > > > >
> > > > > > > > > > > """
> > > > > > > > > > > You should implement proper error handling instead of using
> > > > > > > > > > > __GFP_NOFAIL if count can become large.
> > > > > > > > > > > """
> > > > > > > > > > >
> > > > > > > > >
> > > > > > > > > I think the problem is it's hard to do the error handling in
> > > > > > > > > fops->release() currently.
> > > > > > > >
> > > > > > > > vduse_dev_dereg_umem() should be the same, it's very hard to allow it to fail.
> > > > > > > >
> > > > > > > > >
> > > > > > > > > So can we temporarily hold the user page refcount, and release it when
> > > > > > > > > vduse_dev_open()/vduse_domain_release() is executed. The kernel page
> > > > > > > > > allocation and memcpy can be done in vduse_dev_open() which allows
> > > > > > > > > some error handling.
> > > > > > > >
> > > > > > > > Just to make sure I understand this, the free is probably not the big
> > > > > > > > issue but the allocation itself.
> > > > > > > >
> > > > > > >
> > > > > > > Yes, so defer the allocation might be a solution.
> > > > > >
> > > > > > Would you mind posting a patch for this?
> > > > > >
> > > > > > >
> > > > > > > > And if we do the memcpy() in open(), it seems to be a subtle userspace
> > > > > > > > noticeable change? (Or I don't get how copying in vduse_dev_open() can
> > > > > > > > help here).
> > > > > > > >
> > > > > > >
> > > > > > > Maybe we don't need to do the copy in open(). We can hold the user
> > > > > > > page refcount until the inflight I/O is completed. That means the
> > > > > > > allocation of new kernel pages can be done in
> > > > > > > vduse_domain_map_bounce_page() and the release of old user pages can
> > > > > > > be done in vduse_domain_unmap_bounce_page().
> > > > > >
> > > > > > This seems to be a subtle userspace noticeable behaviour?
> > > > > >
> > > > >
> > > > > Yes, userspace needs to ensure that it does not reuse the old user
> > > > > pages for other purposes before vduse_dev_dereg_umem() returns
> > > > > successfully. The vduse_dev_dereg_umem() will only return successfully
> > > > > when there is no inflight I/O which means we don't need to allocate
> > > > > extra kernel pages to store data. If we can't accept this, then your
> > > > > current patch might be the most suitable.
> > > >
> > > > It might be better to not break.
> > > >
> > > > Actually during my testing, the read_lock in the do_bounce path slows
> > > > down the performance. Remove read_lock or use rcu_read_lock() to give
> > > > 20% improvement of PPS.
> > > >
> > >
> > > Looks like rcu_read_lock() should be OK here.
> >
> > The tricky part is that we may still end up behaviour changes (or lose
> > some of the synchronization between kernel and bounce pages):
> >
> > RCU allows the read to be executed in parallel with the writer. So
> > bouncing could be done in parallel with
> > vduse_domain_add_user_bounce_pages(), there would be a race in two
> > memcpy.
> >
>
> Hmm...this is a problem. We may still need some userspace noticeable
> behaviour, e.g. only allowing reg_umem/dereg_umem when the device is
> not started.

Exactly, maybe have a new userspace flag.

Thanks

>
> Thanks,
> Yongji
>