workqueue: introduce queue_delayed_work_on_offline_safe

[PATCH v3] workqueue: introduce queue_delayed_work_on_offline_safe

Posted by Imran Khan 1 year ago

Currently users of queue_delayed_work_on, need to ensure
that specified cpu is and remains online. The failure to
do so may result in delayed_work getting queued on an
offlined cpu and hence never getting executed.

The current users of queue_delayed_work_on, seem to ensure
the above mentioned criteria but for those, unknown amongst
current users or new users, who can't confirm to this
we need another interface.

So introduce queue_delayed_work_on_offline_safe, which
is a wrapper around queue_delayed_work_on to ensure that
the specified cpu is and remains online.

Signed-off-by: Imran Khan <imran.f.khan@oracle.com>
Acked-by: Haakon Bugge <haakon.bugge@oracle.com>
---
v2 --> v3:
  - Corrected a couple of typos spotted by Haakon
  - Collected Acked-by, from Haakon

v1 --> v2:
  - Remove RFC tag
  - For cases where dwork can't be put on specified CPU,
    let caller decide the next CPU to try with.

 include/linux/workqueue.h |  3 +++
 kernel/workqueue.c        | 42 +++++++++++++++++++++++++++++++++++++++
 2 files changed, 45 insertions(+)

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index b0dc957c3e560..cefcf9e89be6f 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -589,6 +589,9 @@ extern bool queue_work_node(int node, struct workqueue_struct *wq,
 			    struct work_struct *work);
 extern bool queue_delayed_work_on(int cpu, struct workqueue_struct *wq,
 			struct delayed_work *work, unsigned long delay);
+extern bool queue_delayed_work_on_offline_safe(int cpu,
+			struct workqueue_struct *wq, struct delayed_work *work,
+			unsigned long delay, bool *online);
 extern bool mod_delayed_work_on(int cpu, struct workqueue_struct *wq,
 			struct delayed_work *dwork, unsigned long delay);
 extern bool queue_rcu_work(struct workqueue_struct *wq, struct rcu_work *rwork);
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 9362484a653c4..b3c030e6c6b17 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -2565,6 +2565,48 @@ bool queue_delayed_work_on(int cpu, struct workqueue_struct *wq,
 }
 EXPORT_SYMBOL(queue_delayed_work_on);
 
+/**
+ * queue_delayed_work_on_offline_safe - queue work on specific online CPU after
+ *					delay,
+ *
+ * @cpu: CPU number to execute work on
+ * @wq: workqueue to use
+ * @dwork: work to queue
+ * @delay: number of jiffies to wait before queueing
+ * @online: online status of @cpu, for caller
+ *
+ * a wrapper, around queue_delayed_work_on, that checks and ensures that
+ * specified @cpu is online. If @cpu is found to be offline or if its online
+ * status can't be reliably determined, set @online to false and return
+ * false, leaving the decision, of selecting new cpu for delayed_work, to
+ * the caller.
+ *
+ * If caller sees @online as false, it can try submitting work on a
+ * different @cpu, but if it sees @online as true, it can check the return
+ * value to determine if the work was really submitted or not.
+ */
+bool queue_delayed_work_on_offline_safe(int cpu, struct workqueue_struct *wq,
+			   struct delayed_work *dwork, unsigned long delay,
+			   bool *online)
+{
+	bool ret = false;
+	int locked = cpus_read_trylock();
+
+	if (locked && cpu_online(cpu)) {
+		ret = queue_delayed_work_on(cpu, wq, dwork, delay);
+		*online = true;
+	} else {
+		*online = false;
+	}
+
+	if (locked)
+		cpus_read_unlock();
+
+	return ret;
+}
+EXPORT_SYMBOL(queue_delayed_work_on_offline_safe);
+
+
 /**
  * mod_delayed_work_on - modify delay of or queue a delayed work on specific CPU
  * @cpu: CPU number to execute work on

base-commit: 5bc55a333a2f7316b58edc7573e8e893f7acb532
-- 
2.34.1

Re: [PATCH v3] workqueue: introduce queue_delayed_work_on_offline_safe

Posted by Tejun Heo 1 year ago

On Tue, Feb 04, 2025 at 10:36:35PM +1100, Imran Khan wrote:
> Currently users of queue_delayed_work_on, need to ensure
> that specified cpu is and remains online. The failure to
> do so may result in delayed_work getting queued on an
> offlined cpu and hence never getting executed.
> 
> The current users of queue_delayed_work_on, seem to ensure
> the above mentioned criteria but for those, unknown amongst
> current users or new users, who can't confirm to this
> we need another interface.
> 
> So introduce queue_delayed_work_on_offline_safe, which
> is a wrapper around queue_delayed_work_on to ensure that
> the specified cpu is and remains online.
> 
> Signed-off-by: Imran Khan <imran.f.khan@oracle.com>
> Acked-by: Haakon Bugge <haakon.bugge@oracle.com>

So, idk, do we really need this? Can't we just add a debug warning which
triggers when CPU goes down with delayed works queued on it?

Thanks.

-- 
tejun

Re: [PATCH v3] workqueue: introduce queue_delayed_work_on_offline_safe

Posted by imran.f.khan@oracle.com 1 year ago

Hello Tejun,
Thanks for getting back on this.
On 5/2/2025 6:17 am, Tejun Heo wrote:
> On Tue, Feb 04, 2025 at 10:36:35PM +1100, Imran Khan wrote:
>> Currently users of queue_delayed_work_on, need to ensure
>> that specified cpu is and remains online. The failure to
>> do so may result in delayed_work getting queued on an
>> offlined cpu and hence never getting executed.
>>
>> The current users of queue_delayed_work_on, seem to ensure
>> the above mentioned criteria but for those, unknown amongst
>> current users or new users, who can't confirm to this
>> we need another interface.
>>
>> So introduce queue_delayed_work_on_offline_safe, which
>> is a wrapper around queue_delayed_work_on to ensure that
>> the specified cpu is and remains online.
>>
>> Signed-off-by: Imran Khan <imran.f.khan@oracle.com>
>> Acked-by: Haakon Bugge <haakon.bugge@oracle.com>
> 
> So, idk, do we really need this? Can't we just add a debug warning which
> triggers when CPU goes down with delayed works queued on it?
> 
Actually, we are good for cases where a CPU goes offline with delayed
works queued on it, because the associated timers will migrate to other
cpu (BP).
The problem is for the cases where a CPU is already offline or in the
middle of being offlined and past timers_dead_cpu callback. In such a
scenario if someone puts a delayed work on this CPU, we have problem. 
The WARN_ON_ONCE in [1] can indicate  this but the dwork's timer would
still end up on the offlined cpu and will not be migrated (since CPU was
past timer_dead state , when dwork was queued).

One way to avoid this is that we ask callers to do the needful (hotplug lock,
hotplug callbacks) and ensure dwork does not end up on such offlined CPU.
The other way (as attempted in this patch) would be to give such users an
interface, that can ensure that  dwork never ends on on offlined CPU.

Thanks,
Imran

[1]: https://lore.kernel.org/all/20250109232711.2081259-1-imran.f.khan@oracle.com/
> Thanks.
>

Re: [PATCH v3] workqueue: introduce queue_delayed_work_on_offline_safe

Posted by Tejun Heo 1 year ago

On Wed, Feb 05, 2025 at 11:54:20AM +1100, imran.f.khan@oracle.com wrote:
...
> The problem is for the cases where a CPU is already offline or in the
> middle of being offlined and past timers_dead_cpu callback. In such a
> scenario if someone puts a delayed work on this CPU, we have problem. 
> The WARN_ON_ONCE in [1] can indicate  this but the dwork's timer would
> still end up on the offlined cpu and will not be migrated (since CPU was
> past timer_dead state , when dwork was queued).
> 
> One way to avoid this is that we ask callers to do the needful (hotplug lock,
> hotplug callbacks) and ensure dwork does not end up on such offlined CPU.
> The other way (as attempted in this patch) would be to give such users an
> interface, that can ensure that  dwork never ends on on offlined CPU.

I don't think introducing a new interface makes sense here. Either the WARN
is enough or we can follow what queue_work_on() does and ensure that the
delayed work item gets executed *somewhere*.

Thanks.

-- 
tejun