The io worker threads are userland threads that just never exit to the
userland. By that, they are also assigned to a cgroup (the group of the
creating task).
When creating a new io worker, this worker should inherit the cpuset
of the cgroup.
Fixes: da64d6db3bd3 ("io_uring: One wqe per wq")
Signed-off-by: Felix Moessbauer <felix.moessbauer@siemens.com>
---
io_uring/io-wq.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/io_uring/io-wq.c b/io_uring/io-wq.c
index c7055a8895d7..a38f36b68060 100644
--- a/io_uring/io-wq.c
+++ b/io_uring/io-wq.c
@@ -1168,7 +1168,7 @@ struct io_wq *io_wq_create(unsigned bounded, struct io_wq_data *data)
if (!alloc_cpumask_var(&wq->cpu_mask, GFP_KERNEL))
goto err;
- cpumask_copy(wq->cpu_mask, cpu_possible_mask);
+ cpuset_cpus_allowed(data->task, wq->cpu_mask);
wq->acct[IO_WQ_ACCT_BOUND].max_workers = bounded;
wq->acct[IO_WQ_ACCT_UNBOUND].max_workers =
task_rlimit(current, RLIMIT_NPROC);
--
2.39.2
On 9/10/24 13:11, Felix Moessbauer wrote: > The io worker threads are userland threads that just never exit to the > userland. By that, they are also assigned to a cgroup (the group of the > creating task). The io-wq task is not actually assigned to a cgroup. To belong to a cgroup, its pid has to be present to the cgroup.procs of the corresponding cgroup, which is not the case here. My understanding is that you are just restricting the CPU affinity to follow the cpuset of the corresponding user task that creates it. The CPU affinity (cpumask) is just one of the many resources controlled by a cgroup. That probably needs to be clarified. Besides cpumask, the cpuset controller also controls the node mask of the memory nodes allowed. Cheers, Longman > > When creating a new io worker, this worker should inherit the cpuset > of the cgroup. > > Fixes: da64d6db3bd3 ("io_uring: One wqe per wq") > Signed-off-by: Felix Moessbauer <felix.moessbauer@siemens.com> > --- > io_uring/io-wq.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/io_uring/io-wq.c b/io_uring/io-wq.c > index c7055a8895d7..a38f36b68060 100644 > --- a/io_uring/io-wq.c > +++ b/io_uring/io-wq.c > @@ -1168,7 +1168,7 @@ struct io_wq *io_wq_create(unsigned bounded, struct io_wq_data *data) > > if (!alloc_cpumask_var(&wq->cpu_mask, GFP_KERNEL)) > goto err; > - cpumask_copy(wq->cpu_mask, cpu_possible_mask); > + cpuset_cpus_allowed(data->task, wq->cpu_mask); > wq->acct[IO_WQ_ACCT_BOUND].max_workers = bounded; > wq->acct[IO_WQ_ACCT_UNBOUND].max_workers = > task_rlimit(current, RLIMIT_NPROC);
On Tue, 2024-09-10 at 13:42 -0400, Waiman Long wrote: > > On 9/10/24 13:11, Felix Moessbauer wrote: > > The io worker threads are userland threads that just never exit to > > the > > userland. By that, they are also assigned to a cgroup (the group of > > the > > creating task). > > The io-wq task is not actually assigned to a cgroup. To belong to a > cgroup, its pid has to be present to the cgroup.procs of the > corresponding cgroup, which is not the case here. Hi, thanks for jumping in. As said, I'm not too familiar with the internals of the io worker threads. Nonetheless, the kernel presents the cgroup assignment quite consistently. This however contradicts your statement from above. Example: pid tid 648460 648460 SCHED_OTHER 20 S 0 0-1 ./test/wq-aff.t 648460 648461 SCHED_OTHER 20 S 1 1 iou-sqp-648460 648460 648462 SCHED_OTHER 20 S 0 0 iou-wrk-648461 When I now check the cgroup.procs, I just see the 648460, which is expected as this the process (with its main thread). Checking cgroup.threads shows all three tids. When checking the other way round, I get the same information: $cat /proc/648460/task/648461/cgroup 0::/user.slice/user-1000.slice/session-1.scope $cat /proc/648460/task/648462/cgroup 0::/user.slice/user-1000.slice/session-1.scope Now I'm wondering if it is just presented incorrectly, or if these tasks indeed belong to the mentioned cgroup? > My understanding is > that you are just restricting the CPU affinity to follow the cpuset > of > the corresponding user task that creates it. The CPU affinity > (cpumask) > is just one of the many resources controlled by a cgroup. That > probably > needs to be clarified. That's clear. Looking at the bigger picture, I want to ensure that the io workers do not break out of the cgroup limits (I called it "ambient" before, similar to the capabilites), because this breaks the isolation assumption. In our case, we are mostly interested in not leaving the cpuset, as we use that to perform system partitioning into realtime and non realtime parts. > > Besides cpumask, the cpuset controller also controls the node mask of > the memory nodes allowed. Yes, and that is especially important as some memory can be "closer" to the IOs than others. Best regards, Felix -- Siemens AG, Technology Linux Expert Center
© 2016 - 2024 Red Hat, Inc.