From nobody Sat Feb 7 18:20:18 2026 Received: from mail-pl1-f178.google.com (mail-pl1-f178.google.com [209.85.214.178]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E1485214A9E for ; Fri, 10 Jan 2025 23:30:41 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.178 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1736551843; cv=none; b=Puawcm5EvJcIDUFpg39WevpewmtvXQC4WYNj7iuH08Sn9fLN2EFB1hHaK8zsPRM5S3ynI8STPoFuvAHxIjutM33U4fwO/k53LZaVyO9FERUFT1PGjZb2dYsAezuLTybo7midmug16mklLjPAPMnqQ/OBk0k3Uuvw7xtsib/Gt7U= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1736551843; c=relaxed/simple; bh=zwi0c3XV4JRKU7qr8BpThhmoqmscbmlCjMlypClGFLo=; h=From:To:Cc:Subject:Date:Message-Id:MIME-Version; b=d1NOs/BVlZl+IDyXWEYYuxzNGQ054lNeemSN0U7UeLHkdE5zaskc4xk/wVBA4q6EL+Bke3ANVvma5u55zTg+/H3Q0/AL1I+KZoiwKmv75wgDjcDvFpbQ1UC6uw0bZy6eOH7QjXFrMNAzGxd9bIaVMLKWaRX8tLR6+9WntqPGdc4= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=JcXm1ZHi; arc=none smtp.client-ip=209.85.214.178 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="JcXm1ZHi" Received: by mail-pl1-f178.google.com with SMTP id d9443c01a7336-21636268e43so55978315ad.2 for ; Fri, 10 Jan 2025 15:30:41 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1736551841; x=1737156641; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=vhfKnsw0G89PuGOt/Ooee2+F/ahtn3cs1Khvsu2Lr8E=; b=JcXm1ZHipf5fu88pQUZmYGYkFKsWpWEk/1+/ELlGOAbGUsSAky0u+FEklj+JbXGjyV 2URbbT09B+7rZpSqGLnD4JlVA0DPzxABR9iMWu88HcMAxX0dDSgfLhyWTcCkzeUoSWKj Au/u3ohURjS+9ItZNw5fMbE3rIlZaqUQUTfanWtoC0vj8CyUzuuRDeZ9yu+X17oCl3DG SmTZ6NVQonyqruu2QG/tR1s9iwPQ1LDxHdfdqZ+/UtUN00+roU2YqqozOlE8f085mCFX 2XsnbCpmBiGkDmujRnYOzGuJq9gWy7UwFFAmPhNJCgigdwWsovmDdbLUvnkU3wBb8PlD yeEQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1736551841; x=1737156641; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=vhfKnsw0G89PuGOt/Ooee2+F/ahtn3cs1Khvsu2Lr8E=; b=CLCcwJZr9V1KqgcSjcb5KB5GJe2JZyZGjFCctlA9vjFnFZzV5QnuAL0eeFIVR02r2S kkeR4KhZnJixclu0SHiCezrLzQkAynzS6Z/7NOPrtJiUIYKWWkoaCquG7sSYIXCfWwO1 ZHWQDFxNg0H6pH6KG9NfL6tVK2WfIAf4XE+6J5QO5nq/FJp6eNBdU4S97m6I0DcuAoJZ CNucC/7og5pFYoxw52q/VXJaOLRLVxpDrju4Zs6NZofmaVMHe6ZMjOXxM8n2TSC8TPk0 f1HSuN/ZZPP2Ay5k7LpiPgvm/A+kPaKjgMN81du8lcLw/sRJrqOYyIOGyIsD1jZdRRys 5SWQ== X-Forwarded-Encrypted: i=1; AJvYcCUTczcPNuUCz0sp4SJEkyoLg4TbRVbOyvG17Fm3bz8Lv9o8jzGVxZtj5DFjFc6PGJF9KbwHl5qCYd1jqlM=@vger.kernel.org X-Gm-Message-State: AOJu0YxsXttDp3NRrfoz3sxK3X7NuBBPuJZaO6JjhHjy/H6smT0Qo+IS QnpyL/tGdeAEwgzin+UzIywMYQ9IR6s6QrqT127Z1Q2NEJFBjhJX X-Gm-Gg: ASbGnculgr49n3M+kkCD1TqlkBuhFvLAvU9kRx9iUHTFJWcolMhZzxDl2cEl95IUW46 gkftSNkJkZaYjXZfiamoIRKAmHspMAYglxAmmJK7DnfmGQaExQJp0VoMjj0tpX/Kq11h3FDxmeV E0OZxok7Ii1UXaDaRKLLuhyAVZSUDYUg+R0RAEjLUg0izfIlwyzwqzPSkcP2mOz3ZbxiYIo6C9Z 63SOF6t0FL31VFgBfK2FINkujiKv7t5VavC0ZX8+JEcFGwF1M/a+/NxZ47kph7mtfoBSF0C2lQN pZ+4PldWc6dMgYKe3OaKZuiv X-Google-Smtp-Source: AGHT+IHxEDqlNJ3aRXBs3GvOdi5RJIzWApzNsbOn/kNPjuSRKovPRzQYu068UAps+xZdTeg2bsAWvQ== X-Received: by 2002:a17:902:f70d:b0:215:b75f:a1e0 with SMTP id d9443c01a7336-21a83f46a96mr177988515ad.7.1736551840939; Fri, 10 Jan 2025 15:30:40 -0800 (PST) Received: from stbirv-lnx-1.igp.broadcom.net ([192.19.223.252]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-21a9f10f4d7sm17994435ad.10.2025.01.10.15.30.39 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 10 Jan 2025 15:30:40 -0800 (PST) From: Doug Berger To: Ingo Molnar , Peter Zijlstra , Juri Lelli , Vincent Guittot Cc: Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Valentin Schneider , Florian Fainelli , linux-kernel@vger.kernel.org, Doug Berger Subject: [RFC] sched/deadline: only mark active cpu as free Date: Fri, 10 Jan 2025 15:30:10 -0800 Message-Id: <20250110233010.2339521-1-opendmb@gmail.com> X-Mailer: git-send-email 2.34.1 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" There is a hazard in the deadline scheduler where an offlined CPU can have its free_cpus bit left set in the def_root_domain when the schedutil cpufreq governor is used. This can allow a deadline thread to be pushed to the runqueue of a powered down CPU which breaks scheduling. This commit works around the issue by only setting the free_cpus bit for a CPU when it is "active". It is likely that the ordering of sched_set_rq_online() and set_cpu_active() at the end of the sched_cpu_deactivate() function should be revisited if this approach has merit. Signed-off-by: Doug Berger --- Coffee is recommended before proceeding. While stress testing CPU hotplug on a quad-core arm64 architecture system I encountered a deadlock. My specific deadlock appears to be dependent on the system having three or more cores and using the sched-util cpufreq governor which uses a deadline scheduled thread named "sugov:n" where n is the CPU number. The scenario I observe is as follows: Initially, CPU0 and CPU1 are active and CPU2 and CPU3 have been previously offlined so their runqueues are attached to the def_root_domain. 1) A hot plug is initiated on CPU2. 2) The cpuhp/2 thread invokes the cpufreq governor driver during the CPUHP_AP_ONLINE_DYN step. 3) The sched util cpufreq governor creates the "sugov:2" thread to execute on CPU2 with the deadline scheduler. 4) The deadline scheduler clears the free_cpus mask for CPU2 within the def_root_domain when "sugov:2" is scheduled. 5) When the "sugov:2" thread blocks, cpudl_clear() gets called to clear the deadline which sets the free_cpus mask for CPU2 within the def_root_domain. 6) When cpuhp/2 reaches the CPUHP_AP_ACTIVE step a new scheduling domain is created to include CPU0, CPU1, and CPU2. o detach_destroy_domains() invokes rq_attach_root() for CPU0 and CPU1 which offlines their runqueues and detaches their current dynamic scheduling domain (clearing their deadline free_cpus bits there) and attaches the def_root_domain and onlines their runqueus (setting their deadline free_cpus bits there). o build_sched_domains() invokes rq_attach_root() for CPU0, CPU1, and CPU2. - Since only CPU0 and CPU1 are online in the def_root_domain set_rq_offline() is only called for them to offline their runqueues and detach the def_root_domain (clearing their deadline free_cpus bits there). - The free_cpus bit for CPU2 in def_root_domain is allowed to remain set. - The newly created dynamic scheduling domain is attached to CPU0, CPU1, and CPU2 runqueues and set_rq_online() is used to online their runqueues (setting their deadline free_cpus bits there). 7) The cpuhp/2 thread also invokes sched_set_rq_online() in the CPUHP_AP_ACTIVE step, but since the runqueues are already online essentially nothing happens. 8) Some time later CPU2 is hot unplugged. 9) At the CPUHP_AP_ACTIVE step, cpuhp/2 marks CPU2 not active and invokes balance_push_set() for CPU2 which migrates "sugov:2" to a different CPU through fallback. 10) Also at this step, cpuhp/2 invokes sched_set_rq_offline() for CPU2 which takes its runqueue offline and clears its deadline free_cpus bit in the current dynamic scheduling domain. 11) Also at this step, cpuhp/2 updates the scheduling domain to remove CPU2. o detach_destroy_domains() invokes rq_attach_root() for CPU0, CPU1, and CPU2 to move them back to the def_root_domain. - Since only CPU0 and CPU1 are online in the current dynamic scheduling domain (CPU2 was removed at 10 above), set_rq_offline() is only called for them to clear their deadline free_cpus bits. - The def_root_domain is attached to CPU0, CPU1, and CPU2 runqueues and since only CPU0 and CPU1 are marked active set_rq_online() is used to online their runqueues (setting their deadline free_cpus bits there). - The free_cpus bit for CPU2 in def_root_domain is allowed to remain set. o build_sched_domains() invokes rq_attach_root() for CPU0 and CPU1 which offlines their runqueues (clearing their deadline free_cpus bits in def_root_domain), attaches a new dynamic scheduling domain, and onlines their runqueus (setting their deadline free_cpus bits there). 12) The cpuhp/2 thread invokes the cpufreq governor driver during the CPUHP_AP_ONLINE_DYN step which attempts to stop the "sugov:2" kthread by calling kthread_flush_worker() followed by kthread_stop(). 13) The "sugov:2" thread can likely be successfully deadline scheduled on CPU0 or CPU1 to allow the cpuhp/2 thread to complete offlining CPU2 and power it off. 14) Some time later CPU1 is hot unplugged. 15) At the CPUHP_AP_ACTIVE step, cpuhp/1 marks CPU1 not active and invokes balance_push_set() for CPU1 which migrates "sugov:1" to CPU0 through fallback. 16) Also at this step, cpuhp/1 invokes sched_set_rq_offline() for CPU1 which takes its runqueue offline and clears its deadline free_cpus bit in the current dynamic scheduling domain. 17) Also at this step, cpuhp/1 updates the scheduling domain to remove CPU1. o detach_destroy_domains() invokes rq_attach_root() for CPU0 and CPU1 to move them back to the def_root_domain. - Since only CPU0 is online in the current dynamic scheduling domain (CPU1 was removed at 16 above), set_rq_offline() is only called for it to clear its deadline free_cpus bit. - The def_root_domain is attached to CPU0 and CPU1 runqueues and since only CPU0 is marked active set_rq_online() is used to online its runqueue (setting its deadline free_cpus bits there). - The free_cpus bit for CPU1 is untouched in def_root_domain. - The free_cpus bit for CPU2 in def_root_domain remains set from the preceeding sequence. 18) If CPU0 executes the "sugov:0" deadline thread at this time it may see that the "sugov:1" deadline thread is also on its runqueue and may call push_dl_task() to attempt to push it to a different CPU. 19) The effort to find a later runqueue will find the stale free_cpus bit of CPU2 in the currently attached def_root_domain and will migrate the "sugov:1" thread to the runqueue of the powered down CPU2 where it can never get scheduled. 20) The cpuhp/1 thread invokes the cpufreq governor driver during the CPUHP_AP_ONLINE_DYN step which attempts to stop the "sugov:1" kthread by calling kthread_flush_worker() followed by kthread_stop(). Since "sugov:1" never gets scheduled cpuhp/1 remains blocked on completion events. Steps 1-13 amount to setting a trap by allowing a free_cpus bit in the deadline scheduler def_root_domain to remain set for a CPU that is powered off. The trap can be sprung during the narrow timing hazard when the def_root_domain is transitionally attached while changing scheduling domains if the deadline scheduler pushes a queued task to the powered off CPU. This problem appears to have been initially introduced by commit=20 120455c514f7 ("sched: Fix hotplug vs CPU bandwidth control") which moved the set_rq_offline() handling from sched_cpu_dying() to sched_cpu_deactivate(). The original sequence allowed the free_cpus bit to be forcibly cleared in the def_root_domains after all of the scheduler dust settled. The new location makes the sched_set_rq_offline() essentially meaningless for the deadline scheduler since the managing of changed scheduling domains happens later. There are likely many different approaches to address this issue and I'm hopeful that somone more familiar with the scheduler than I can propose a better solution than the one suggested here. Thank you for reading this far. Any advice is appreciated. -Doug kernel/sched/cpudeadline.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/kernel/sched/cpudeadline.c b/kernel/sched/cpudeadline.c index 95baa12a1029..6896bbe0e9ae 100644 --- a/kernel/sched/cpudeadline.c +++ b/kernel/sched/cpudeadline.c @@ -195,7 +195,8 @@ void cpudl_clear(struct cpudl *cp, int cpu) cp->elements[cpu].idx =3D IDX_INVALID; cpudl_heapify(cp, old_idx); =20 - cpumask_set_cpu(cpu, cp->free_cpus); + if (cpu_active(cpu)) + cpumask_set_cpu(cpu, cp->free_cpus); } raw_spin_unlock_irqrestore(&cp->lock, flags); } --=20 2.34.1