[PATCH 1/6] workqueue: Inherit NOIO and NOFS alloc flags

Håkon Bugge posted 6 patches 1 year, 9 months ago
There is a newer version of this series
[PATCH 1/6] workqueue: Inherit NOIO and NOFS alloc flags
Posted by Håkon Bugge 1 year, 9 months ago
For drivers/modules running inside a
memalloc_{noio,nofs}_{save,restore} region, if a work-queue is
created, we make sure work executed on the work-queue inherits the
same flag(s).

This in order to conditionally enable drivers to work aligned with
block I/O devices. This commit makes sure that any work queued later
on work-queues created during module initialization, when current's
flags has PF_MEMALLOC_{NOIO,NOFS} set, will inherit the same flags.

We do this in order to enable drivers to be used as a network block
I/O device. This in order to support XFS or other file-systems on top
of a raw block device which uses said drivers as the network transport
layer.

Under intense memory pressure, we get memory reclaims. Assume the
file-system reclaims memory, goes to the raw block device, which calls
into said drivers. Now, if regular GFP_KERNEL allocations in the
drivers require reclaims to be fulfilled, we end up in a circular
dependency.

We break this circular dependency by:

1. Force all allocations in the drivers to use GFP_NOIO, by means of a
   parenthetic use of memalloc_noio_{save,restore} on all relevant
   entry points.

2. Make sure work-queues inherits current->flags
   wrt. PF_MEMALLOC_{NOIO,NOFS}, such that work executed on the
   work-queue inherits the same flag(s). That is what this commit
   contributes with.

Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>
---
 include/linux/workqueue.h |  2 ++
 kernel/workqueue.c        | 17 +++++++++++++++++
 2 files changed, 19 insertions(+)

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 158784dd189ab..09ecc692ffcae 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -398,6 +398,8 @@ enum wq_flags {
 	__WQ_DRAINING		= 1 << 16, /* internal: workqueue is draining */
 	__WQ_ORDERED		= 1 << 17, /* internal: workqueue is ordered */
 	__WQ_LEGACY		= 1 << 18, /* internal: create*_workqueue() */
+	__WQ_NOIO               = 1 << 19, /* internal: execute work with NOIO */
+	__WQ_NOFS               = 1 << 20, /* internal: execute work with NOFS */
 
 	/* BH wq only allows the following flags */
 	__WQ_BH_ALLOWS		= WQ_BH | WQ_HIGHPRI,
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index d2dbe099286b9..a1d166a7c0f85 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -51,6 +51,7 @@
 #include <linux/uaccess.h>
 #include <linux/sched/isolation.h>
 #include <linux/sched/debug.h>
+#include <linux/sched/mm.h>
 #include <linux/nmi.h>
 #include <linux/kvm_para.h>
 #include <linux/delay.h>
@@ -3172,6 +3173,10 @@ __acquires(&pool->lock)
 	unsigned long work_data;
 	int lockdep_start_depth, rcu_start_depth;
 	bool bh_draining = pool->flags & POOL_BH_DRAINING;
+	bool use_noio_allocs = pwq->wq->flags & __WQ_NOIO;
+	bool use_nofs_allocs = pwq->wq->flags & __WQ_NOFS;
+	unsigned long noio_flags;
+	unsigned long nofs_flags;
 #ifdef CONFIG_LOCKDEP
 	/*
 	 * It is permissible to free the struct work_struct from
@@ -3184,6 +3189,12 @@ __acquires(&pool->lock)
 
 	lockdep_copy_map(&lockdep_map, &work->lockdep_map);
 #endif
+	/* Set inherited alloc flags */
+	if (use_noio_allocs)
+		noio_flags = memalloc_noio_save();
+	if (use_nofs_allocs)
+		nofs_flags = memalloc_nofs_save();
+
 	/* ensure we're on the correct CPU */
 	WARN_ON_ONCE(!(pool->flags & POOL_DISASSOCIATED) &&
 		     raw_smp_processor_id() != pool->cpu);
@@ -3320,6 +3331,12 @@ __acquires(&pool->lock)
 
 	/* must be the last step, see the function comment */
 	pwq_dec_nr_in_flight(pwq, work_data);
+
+	/* Restore alloc flags */
+	if (use_nofs_allocs)
+		memalloc_nofs_restore(nofs_flags);
+	if (use_noio_allocs)
+		memalloc_noio_restore(noio_flags);
 }
 
 /**
-- 
2.39.3

Re: [PATCH 1/6] workqueue: Inherit NOIO and NOFS alloc flags
Posted by Tejun Heo 1 year, 9 months ago
Hello,

On Mon, May 13, 2024 at 02:53:41PM +0200, Håkon Bugge wrote:
> diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
> index 158784dd189ab..09ecc692ffcae 100644
> --- a/include/linux/workqueue.h
> +++ b/include/linux/workqueue.h
> @@ -398,6 +398,8 @@ enum wq_flags {
>  	__WQ_DRAINING		= 1 << 16, /* internal: workqueue is draining */
>  	__WQ_ORDERED		= 1 << 17, /* internal: workqueue is ordered */
>  	__WQ_LEGACY		= 1 << 18, /* internal: create*_workqueue() */
> +	__WQ_NOIO               = 1 << 19, /* internal: execute work with NOIO */
> +	__WQ_NOFS               = 1 << 20, /* internal: execute work with NOFS */

I don't quite understand how this is supposed to be used. The flags are
marked internal but nothing actually sets them. Looking at later patches, I
don't see any usages either. What am I missing?

> @@ -3184,6 +3189,12 @@ __acquires(&pool->lock)
>  
>  	lockdep_copy_map(&lockdep_map, &work->lockdep_map);
>  #endif
> +	/* Set inherited alloc flags */
> +	if (use_noio_allocs)
> +		noio_flags = memalloc_noio_save();
> +	if (use_nofs_allocs)
> +		nofs_flags = memalloc_nofs_save();
> +
>  	/* ensure we're on the correct CPU */
>  	WARN_ON_ONCE(!(pool->flags & POOL_DISASSOCIATED) &&
>  		     raw_smp_processor_id() != pool->cpu);
> @@ -3320,6 +3331,12 @@ __acquires(&pool->lock)
>  
>  	/* must be the last step, see the function comment */
>  	pwq_dec_nr_in_flight(pwq, work_data);
> +
> +	/* Restore alloc flags */
> +	if (use_nofs_allocs)
> +		memalloc_nofs_restore(nofs_flags);
> +	if (use_noio_allocs)
> +		memalloc_noio_restore(noio_flags);

Also, this looks like something that the work function can do on entry and
before exit, no?

Thanks.

-- 
tejun
Re: [PATCH 1/6] workqueue: Inherit NOIO and NOFS alloc flags
Posted by Haakon Bugge 1 year, 9 months ago
> Hello,
> 
> On Mon, May 13, 2024 at 02:53:41PM +0200, Håkon Bugge wrote:
> diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
> index 158784dd189ab..09ecc692ffcae 100644
> --- a/include/linux/workqueue.h
> +++ b/include/linux/workqueue.h
> @@ -398,6 +398,8 @@ enum wq_flags {
> 	__WQ_DRAINING		= 1 << 16, /* internal: workqueue is draining */
> 	__WQ_ORDERED		= 1 << 17, /* internal: workqueue is ordered */
> 	__WQ_LEGACY		= 1 << 18, /* internal: create*_workqueue() */
> +	__WQ_NOIO               = 1 << 19, /* internal: execute work with NOIO */
> +	__WQ_NOFS               = 1 << 20, /* internal: execute work with NOFS */
> 
> I don't quite understand how this is supposed to be used. The flags are
> marked internal but nothing actually sets them. Looking at later patches, I
> don't see any usages either. What am I missing?

Hi Tejun,


Thank you for so quickly looking into this. You are quite right. During re-basing, I missed a hunk from alloc_workqueue(), where I sample current->flags for PF_MALLOC_{NOIO,NOFS} bits and sets the corresponding bits in wq->flags. Fixed in v2.

> @@ -3184,6 +3189,12 @@ __acquires(&pool->lock)
> 
> 	lockdep_copy_map(&lockdep_map, &work->lockdep_map);
> #endif
> +	/* Set inherited alloc flags */
> +	if (use_noio_allocs)
> +		noio_flags = memalloc_noio_save();
> +	if (use_nofs_allocs)
> +		nofs_flags = memalloc_nofs_save();
> +
> 	/* ensure we're on the correct CPU */
> 	WARN_ON_ONCE(!(pool->flags & POOL_DISASSOCIATED) &&
> 		     raw_smp_processor_id() != pool->cpu);
> @@ -3320,6 +3331,12 @@ __acquires(&pool->lock)
> 
> 	/* must be the last step, see the function comment */
> 	pwq_dec_nr_in_flight(pwq, work_data);
> +
> +	/* Restore alloc flags */
> +	if (use_nofs_allocs)
> +		memalloc_nofs_restore(nofs_flags);
> +	if (use_noio_allocs)
> +		memalloc_noio_restore(noio_flags);
> 
> Also, this looks like something that the work function can do on entry and
> before exit, no?

It _can_ be done in the work functions, but that will be a code sprawl. Only in RDS, we have the following worker functions:

rds_ib_odp_mr_worker();
rds_ib_mr_pool_flush_worker()
rds_ib_odp_mr_worker()
rds_tcp_accept_worker()
rds_connect_worker()
rds_send_worker()
rds_recv_worker()
rds_shutdown_worker()

adding the ones from ib_cm, rdma_cm, mlx5_ib, and mlx5_core, I strongly prefer to have it in one place.


Thxs, Håkon

Re: [PATCH 1/6] workqueue: Inherit NOIO and NOFS alloc flags
Posted by Tejun Heo 1 year, 9 months ago
Hello,

On Tue, May 14, 2024 at 01:48:24PM +0000, Haakon Bugge wrote:
> > Also, this looks like something that the work function can do on entry and
> > before exit, no?
> 
> It _can_ be done in the work functions, but that will be a code sprawl.
> Only in RDS, we have the following worker functions:
> 
> rds_ib_odp_mr_worker();
> rds_ib_mr_pool_flush_worker()
> rds_ib_odp_mr_worker()
> rds_tcp_accept_worker()
> rds_connect_worker()
> rds_send_worker()
> rds_recv_worker()
> rds_shutdown_worker()
> 
> adding the ones from ib_cm, rdma_cm, mlx5_ib, and mlx5_core, I strongly
> prefer to have it in one place.

I haven't seen the code yet, so can't tell for sure but if you're
automatically inherting these flags from the scheduling site, I don't think
that's gonna work. Note that getting a different, more permissive,
allocation context is one of reasons why one might want to use workqueues,
so it'd have to be explicit whether it's in workqueue or in its users.

Thanks.

-- 
tejun
Re: [PATCH 1/6] workqueue: Inherit NOIO and NOFS alloc flags
Posted by Haakon Bugge 1 year, 9 months ago
Hi Tejun,


> On 14 May 2024, at 18:49, Tejun Heo <tj@kernel.org> wrote:
> 
> Hello,
> 
> On Tue, May 14, 2024 at 01:48:24PM +0000, Haakon Bugge wrote:
> 
> I haven't seen the code yet, so can't tell for sure but if you're
> automatically inherting these flags from the scheduling site, I don't think
> that's gonna work. Note that getting a different, more permissive,
> allocation context is one of reasons why one might want to use workqueues,
> so it'd have to be explicit whether it's in workqueue or in its users.

I wanted to hold off sending the v2 if it came in more comments, but I have sent it now.


Thxs, Håkon