This reverts commit 0263f92fadbb9d294d5971ac57743f882c93b2b3.
The reason the lock was removed was that the nvme-pci driver reset
handler attempted to acquire the CPU read lock during CPU hotplug
offlining (holds the CPU write lock). Consequently, the block layer
offline notifier callback could not progress because in-flight requests
were detected.
Since then, in-flight detection has been improved, and the nvme-pci
driver now explicitly updates the hctx state when it is safe to ignore
detected in-flight requests. As a result, it's possible to reintroduce
the CPU read lock in group_cpus_evenly.
Signed-off-by: Daniel Wagner <wagi@kernel.org>
---
lib/group_cpus.c | 21 +++++----------------
1 file changed, 5 insertions(+), 16 deletions(-)
diff --git a/lib/group_cpus.c b/lib/group_cpus.c
index e6e18d7a49bb..533c722b5c2c 100644
--- a/lib/group_cpus.c
+++ b/lib/group_cpus.c
@@ -510,25 +510,13 @@ struct cpumask *group_cpus_evenly(unsigned int numgrps, unsigned int *nummasks)
if (!masks)
goto fail_node_to_cpumask;
+ /* Stabilize the cpumasks */
+ cpus_read_lock();
build_node_to_cpumask(node_to_cpumask);
- /*
- * Make a local cache of 'cpu_present_mask', so the two stages
- * spread can observe consistent 'cpu_present_mask' without holding
- * cpu hotplug lock, then we can reduce deadlock risk with cpu
- * hotplug code.
- *
- * Here CPU hotplug may happen when reading `cpu_present_mask`, and
- * we can live with the case because it only affects that hotplug
- * CPU is handled in the 1st or 2nd stage, and either way is correct
- * from API user viewpoint since 2-stage spread is sort of
- * optimization.
- */
- cpumask_copy(npresmsk, data_race(cpu_present_mask));
-
/* grouping present CPUs first */
ret = __group_cpus_evenly(curgrp, numgrps, node_to_cpumask,
- npresmsk, nmsk, masks);
+ cpu_present_mask, nmsk, masks);
if (ret < 0)
goto fail_node_to_cpumask;
nr_present = ret;
@@ -543,13 +531,14 @@ struct cpumask *group_cpus_evenly(unsigned int numgrps, unsigned int *nummasks)
curgrp = 0;
else
curgrp = nr_present;
- cpumask_andnot(npresmsk, cpu_possible_mask, npresmsk);
+ cpumask_andnot(npresmsk, cpu_possible_mask, cpu_present_mask);
ret = __group_cpus_evenly(curgrp, numgrps, node_to_cpumask,
npresmsk, nmsk, masks);
if (ret >= 0)
nr_others = ret;
fail_node_to_cpumask:
+ cpus_read_unlock();
free_node_to_cpumask(node_to_cpumask);
fail_npresmsk:
--
2.53.0
On Thu, Feb 26, 2026 at 02:40:37PM +0100, Daniel Wagner wrote: > This reverts commit 0263f92fadbb9d294d5971ac57743f882c93b2b3. > > The reason the lock was removed was that the nvme-pci driver reset > handler attempted to acquire the CPU read lock during CPU hotplug > offlining (holds the CPU write lock). Consequently, the block layer > offline notifier callback could not progress because in-flight requests > were detected. > > Since then, in-flight detection has been improved, and the nvme-pci > driver now explicitly updates the hctx state when it is safe to ignore > detected in-flight requests. As a result, it's possible to reintroduce > the CPU read lock in group_cpus_evenly. Can you explain your motivation a bit? Especially adding back the lock causes the API hard to use. Any benefit? Thanks, Ming
Hi Ming, Sorry for the late response. Last week the mail server did take a break... On Thu, Feb 26, 2026 at 10:04:18PM +0800, Ming Lei wrote: > On Thu, Feb 26, 2026 at 02:40:37PM +0100, Daniel Wagner wrote: > > This reverts commit 0263f92fadbb9d294d5971ac57743f882c93b2b3. > > > > The reason the lock was removed was that the nvme-pci driver reset > > handler attempted to acquire the CPU read lock during CPU hotplug > > offlining (holds the CPU write lock). Consequently, the block layer > > offline notifier callback could not progress because in-flight requests > > were detected. > > > > Since then, in-flight detection has been improved, and the nvme-pci > > driver now explicitly updates the hctx state when it is safe to ignore > > detected in-flight requests. As a result, it's possible to reintroduce > > the CPU read lock in group_cpus_evenly. > > Can you explain your motivation a bit? Especially adding back the lock > causes the API hard to use. Any benefit? Sure, I would like to add the lock back to group_cpus_evenly so it's possible to add support for the isolcpu use case. For the isolcpus case, it's necessary to access the cpu_online_mask when creating a housekeeping cpu mask. I failed to find a good solution which doesn't introduce horrible hacks (see Thomas' feedback on this [1]). Anyway, I am not totally set on this solution, but I having a proper lock in this code path would make the isolcpu extension way cleaner I think. What do you exactly mean with 'API hard to use'? The problem that the caller/driver has to make sure it doesn't do anything like the nvme-pci driver? [1] https://lore.kernel.org/linux-nvme/87cy7vrbc4.ffs@tglx/ Thanks, Daniel
On Mon, Mar 2, 2026 at 10:05 PM Daniel Wagner <dwagner@suse.de> wrote: > > Hi Ming, > > Sorry for the late response. Last week the mail server did take a break... > > On Thu, Feb 26, 2026 at 10:04:18PM +0800, Ming Lei wrote: > > On Thu, Feb 26, 2026 at 02:40:37PM +0100, Daniel Wagner wrote: > > > This reverts commit 0263f92fadbb9d294d5971ac57743f882c93b2b3. > > > > > > The reason the lock was removed was that the nvme-pci driver reset > > > handler attempted to acquire the CPU read lock during CPU hotplug > > > offlining (holds the CPU write lock). Consequently, the block layer > > > offline notifier callback could not progress because in-flight requests > > > were detected. > > > > > > Since then, in-flight detection has been improved, and the nvme-pci > > > driver now explicitly updates the hctx state when it is safe to ignore > > > detected in-flight requests. As a result, it's possible to reintroduce > > > the CPU read lock in group_cpus_evenly. > > > > Can you explain your motivation a bit? Especially adding back the lock > > causes the API hard to use. Any benefit? > > Sure, I would like to add the lock back to group_cpus_evenly so it's > possible to add support for the isolcpu use case. For the isolcpus case, > it's necessary to access the cpu_online_mask when creating a > housekeeping cpu mask. I failed to find a good solution which doesn't > introduce horrible hacks (see Thomas' feedback on this [1]). > > Anyway, I am not totally set on this solution, but I having a proper > lock in this code path would make the isolcpu extension way cleaner I > think. Then please include this patch with an explanation in your isolcpus patch set. > > What do you exactly mean with 'API hard to use'? The problem that the > caller/driver has to make sure it doesn't do anything like the nvme-pci > driver? This API is usually called in slow path, in which subsystem locks are often required, then lock dependency against cpus_read_lock is added. Thanks, Ming
On Mon, Mar 02, 2026 at 10:12:49PM +0800, Ming Lei wrote: > > Sure, I would like to add the lock back to group_cpus_evenly so it's > > possible to add support for the isolcpu use case. For the isolcpus case, > > it's necessary to access the cpu_online_mask when creating a > > housekeeping cpu mask. I failed to find a good solution which doesn't > > introduce horrible hacks (see Thomas' feedback on this [1]). > > > > Anyway, I am not totally set on this solution, but I having a proper > > lock in this code path would make the isolcpu extension way cleaner I > > think. > > Then please include this patch with an explanation in your isolcpus > patch set. I didn't add it in the commit message because the code is not there yet, thus only mentioned it the cover letter. But sure, I'll add this info. > > What do you exactly mean with 'API hard to use'? The problem that the > > caller/driver has to make sure it doesn't do anything like the nvme-pci > > driver? > > This API is usually called in slow path, in which subsystem locks are often > required, then lock dependency against cpus_read_lock is added. Yes, that's very reason I came up with this handshake protocol, which only covers the block layer subsystem. I wonder if it would be possible to do a lock free version with a retry check at the end. When a cpu hotplug event happended during the calculation, start over. For this some sort of generation count for cpu hotplug events would be handy. Just think loud.
© 2016 - 2026 Red Hat, Inc.