From nobody Tue Dec 16 23:45:06 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 116D3C10F04 for ; Fri, 8 Dec 2023 01:53:21 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1572964AbjLHBxM (ORCPT ); Thu, 7 Dec 2023 20:53:12 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:48996 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S235659AbjLHBxE (ORCPT ); Thu, 7 Dec 2023 20:53:04 -0500 Received: from mail-wr1-x434.google.com (mail-wr1-x434.google.com [IPv6:2a00:1450:4864:20::434]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 9E77E172D for ; Thu, 7 Dec 2023 17:53:04 -0800 (PST) Received: by mail-wr1-x434.google.com with SMTP id ffacd0b85a97d-332c46d5988so1688702f8f.1 for ; Thu, 07 Dec 2023 17:53:04 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=layalina-io.20230601.gappssmtp.com; s=20230601; t=1702000383; x=1702605183; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=31uCcMHZPgEoDhCw1uvDKEltQSe+mPpn7IW2GIItino=; b=vOrvHqgDi0qxb5BdwhAL0mk6P007jCigtoK2GZFaibOEAN8gtKZYPTvoBKP8m3LYik fV0yEIAlCCGzrSrVzgaHrZpKgcXt6B+vCIqmhPQhdoa3T/S1bTAARxT8IQpr/nIGkNMV skAkL+jzNEkThS0zaOZFpQXiYqqz2y/WyKkDdAM3wHs5lpIDFyy+ymbIJXsLTFNhRFuc jM23QQWJwbMtEw41S7mZu/MHrtHVoe8hpu+DdJOQwVpsVALRG6rxGXqJ+hZ3jVhiUHNb ZhhkvX+yaFHdlQsxE0OGLFs15p3+tSdg9lAbK+W9s7tauTvDe6bZF/J6lWZXaLzJZDUh BuQQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1702000383; x=1702605183; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=31uCcMHZPgEoDhCw1uvDKEltQSe+mPpn7IW2GIItino=; b=mzjESjxzF7exJH3iZSzk+f4MAXjvvwzzexA4HwV95PxnSmvTKg/pcJn4DnLKg6Lwct hesfki13xOHS2l7GUp5/p0cBrxA9zN87/4WhSRqnHr/ZtYYe18HN5Cb7l9qjIL3yQQtl rAyrJ50Vj8E7vt/7lepLALEN5NUlq2xJU4WXctivlWQ9OSUG27T8+5coWUxRhN9kCDwT cGWQKePkHZxoQiSza7lC9mpIRB0jqHz8HYUuaI1dhTy+yY8qePJatOCP0i600usQyqeF tGxPlsHXChKY+Nvxuv4oxUBLgNC9mlrIOsKvcFtI2Uf0A34JQSRBDDoQQbL+U57tpCUx 6SDg== X-Gm-Message-State: AOJu0Yzt0FOkTANFfIGxhBB6qyHzgZw8MNkRcDz5rgYd9c5ALAx9O51A 8Vdfju/k0MQKgAGzsHWwsy3SBQ== X-Google-Smtp-Source: AGHT+IG4IUsa7DV15wnNOlcuAvDlFGhKyQFMusK/if39EeqD0F8ohN6cpsYvBkwY1HdsAijqH552LQ== X-Received: by 2002:adf:e952:0:b0:333:2fd2:5d3d with SMTP id m18-20020adfe952000000b003332fd25d3dmr1899176wrn.111.1702000383066; Thu, 07 Dec 2023 17:53:03 -0800 (PST) Received: from airbuntu.. (host109-153-232-45.range109-153.btcentralplus.com. [109.153.232.45]) by smtp.gmail.com with ESMTPSA id s12-20020adf978c000000b003333a0da243sm902521wrb.81.2023.12.07.17.53.02 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 07 Dec 2023 17:53:02 -0800 (PST) From: Qais Yousef To: Ingo Molnar , Peter Zijlstra , "Rafael J. Wysocki" , Viresh Kumar , Vincent Guittot , Dietmar Eggemann Cc: linux-kernel@vger.kernel.org, linux-pm@vger.kernel.org, Lukasz Luba , Wei Wang , Rick Yiu , Chung-Kai Mei , Hongyan Xia , Qais Yousef Subject: [PATCH 4/4] sched/documentation: Remove reference to max aggregation Date: Fri, 8 Dec 2023 01:52:42 +0000 Message-Id: <20231208015242.385103-5-qyousef@layalina.io> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20231208015242.385103-1-qyousef@layalina.io> References: <20231208015242.385103-1-qyousef@layalina.io> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" Now that max aggregation complexity was removed, update the doc to reflect the new working design. And since we fixed one of the limitation, remove it as well. Signed-off-by: Qais Yousef (Google) --- Documentation/scheduler/sched-util-clamp.rst | 239 ++++--------------- 1 file changed, 45 insertions(+), 194 deletions(-) diff --git a/Documentation/scheduler/sched-util-clamp.rst b/Documentation/s= cheduler/sched-util-clamp.rst index 74d5b7c6431d..642e5f386f8e 100644 --- a/Documentation/scheduler/sched-util-clamp.rst +++ b/Documentation/scheduler/sched-util-clamp.rst @@ -129,172 +129,50 @@ and the scheduler needs to select a suitable CPU for= it to run on. =20 Since the goal of util clamp is to allow requesting a minimum and maximum performance point for a task to run on, it must be able to influence the -frequency selection as well as task placement to be most effective. Both of -which have implications on the utilization value at CPU runqueue (rq for s= hort) -level, which brings us to the main design challenge. - -When a task wakes up on an rq, the utilization signal of the rq will be -affected by the uclamp settings of all the tasks enqueued on it. For examp= le if -a task requests to run at UTIL_MIN =3D 512, then the util signal of the rq= needs -to respect to this request as well as all other requests from all of the -enqueued tasks. - -To be able to aggregate the util clamp value of all the tasks attached to = the -rq, uclamp must do some housekeeping at every enqueue/dequeue, which is the -scheduler hot path. Hence care must be taken since any slow down will have -significant impact on a lot of use cases and could hinder its usability in -practice. - -The way this is handled is by dividing the utilization range into buckets -(struct uclamp_bucket) which allows us to reduce the search space from eve= ry -task on the rq to only a subset of tasks on the top-most bucket. - -When a task is enqueued, the counter in the matching bucket is incremented, -and on dequeue it is decremented. This makes keeping track of the effective -uclamp value at rq level a lot easier. - -As tasks are enqueued and dequeued, we keep track of the current effective -uclamp value of the rq. See :ref:`section 2.1 ` for detail= s on -how this works. - -Later at any path that wants to identify the effective uclamp value of the= rq, -it will simply need to read this effective uclamp value of the rq at that = exact -moment of time it needs to take a decision. +frequency selection as well as task placement to be most effective. =20 For task placement case, only Energy Aware and Capacity Aware Scheduling (EAS/CAS) make use of uclamp for now, which implies that it is applied on -heterogeneous systems only. -When a task wakes up, the scheduler will look at the current effective ucl= amp -value of every rq and compare it with the potential new value if the task = were -to be enqueued there. Favoring the rq that will end up with the most energy -efficient combination. +heterogeneous systems only. When a task wakes up, the scheduler will look = at +the effective uclamp values of the woken task to check if it will fit the +capacity of the CPU. Favouring the most energy efficient CPU that fits the +performant request hints. Enabling it to favour a bigger CPU if uclamp_min= is +high even if the utilization of the task is low. Or enable it to run on +a smaller CPU if UCLAMP_MAX is low, even if the utilization of the task is +high. =20 Similarly in schedutil, when it needs to make a frequency update it will l= ook -at the current effective uclamp value of the rq which is influenced by the= set -of tasks currently enqueued there and select the appropriate frequency that -will satisfy constraints from requests. +at the effective uclamp values of the current running task on the rq and s= elect +the appropriate frequency that will satisfy the performance request hints = of +the task, taking into account the current utilization of the rq. =20 Other paths like setting overutilization state (which effectively disables= EAS) make use of uclamp as well. Such cases are considered necessary housekeepi= ng to allow the 2 main use cases above and will not be covered in detail here as= they could change with implementation details. =20 -.. _uclamp-buckets: +2.1. Frequency hints are applied at context switch +-------------------------------------------------- =20 -2.1. Buckets ------------- +At context switch, we tell schedutil of the new uclamp values (or min/max = perf +requirments) of the newly RUNNING task. It is up to the governor to try its +best to honour these requests. =20 -:: - - [struct rq] - - (bottom) (top) - - 0 1024 - | | - +-----------+-----------+-----------+---- ----+-----------+ - | Bucket 0 | Bucket 1 | Bucket 2 | ... | Bucket N | - +-----------+-----------+-----------+---- ----+-----------+ - : : : - +- p0 +- p3 +- p4 - : : - +- p1 +- p5 - : - +- p2 - - -.. note:: - The diagram above is an illustration rather than a true depiction of the - internal data structure. - -To reduce the search space when trying to decide the effective uclamp valu= e of -an rq as tasks are enqueued/dequeued, the whole utilization range is divid= ed -into N buckets where N is configured at compile time by setting -CONFIG_UCLAMP_BUCKETS_COUNT. By default it is set to 5. - -The rq has a bucket for each uclamp_id tunables: [UCLAMP_MIN, UCLAMP_MAX]. - -The range of each bucket is 1024/N. For example, for the default value of -5 there will be 5 buckets, each of which will cover the following range: - -:: - - DELTA =3D round_closest(1024/5) =3D 204.8 =3D 205 - - Bucket 0: [0:204] - Bucket 1: [205:409] - Bucket 2: [410:614] - Bucket 3: [615:819] - Bucket 4: [820:1024] - -When a task p with following tunable parameters - -:: - - p->uclamp[UCLAMP_MIN] =3D 300 - p->uclamp[UCLAMP_MAX] =3D 1024 - -is enqueued into the rq, bucket 1 will be incremented for UCLAMP_MIN and b= ucket -4 will be incremented for UCLAMP_MAX to reflect the fact the rq has a task= in -this range. - -The rq then keeps track of its current effective uclamp value for each -uclamp_id. - -When a task p is enqueued, the rq value changes to: - -:: - - // update bucket logic goes here - rq->uclamp[UCLAMP_MIN] =3D max(rq->uclamp[UCLAMP_MIN], p->uclamp[U= CLAMP_MIN]) - // repeat for UCLAMP_MAX +For uclamp to be effective, it is desired to have a hardware with reasonab= ly +fast DVFS hardware (rate_limit_us is short). =20 -Similarly, when p is dequeued the rq value changes to: - -:: - - // update bucket logic goes here - rq->uclamp[UCLAMP_MIN] =3D search_top_bucket_for_highest_value() - // repeat for UCLAMP_MAX - -When all buckets are empty, the rq uclamp values are reset to system defau= lts. -See :ref:`section 3.4 ` for details on default valu= es. +It is believed that most modern hardware (including lower rend ones) has +rate_limit_us <=3D 2ms which should be good enough. Having 500us or below = would +be ideal so the hardware can reasonably catch up with each task's performa= nce +constraints. =20 +Schedutil might ignore the task's performance request if its historical ru= ntime +has been shorter than the rate_limit_us. =20 -2.2. Max aggregation --------------------- +See :ref:`Section 5.2 ` for more details on +schedutil related limitations. =20 -Util clamp is tuned to honour the request for the task that requires the -highest performance point. - -When multiple tasks are attached to the same rq, then util clamp must make= sure -the task that needs the highest performance point gets it even if there's -another task that doesn't need it or is disallowed from reaching this poin= t. - -For example, if there are multiple tasks attached to an rq with the follow= ing -values: - -:: - - p0->uclamp[UCLAMP_MIN] =3D 300 - p0->uclamp[UCLAMP_MAX] =3D 900 - - p1->uclamp[UCLAMP_MIN] =3D 500 - p1->uclamp[UCLAMP_MAX] =3D 500 - -then assuming both p0 and p1 are enqueued to the same rq, both UCLAMP_MIN -and UCLAMP_MAX become: - -:: - - rq->uclamp[UCLAMP_MIN] =3D max(300, 500) =3D 500 - rq->uclamp[UCLAMP_MAX] =3D max(900, 500) =3D 900 - -As we shall see in :ref:`section 5.1 `, this max -aggregation is the cause of one of limitations when using util clamp, in -particular for UCLAMP_MAX hint when user space would like to save power. - -2.3. Hierarchical aggregation +2.2. Hierarchical aggregation ----------------------------- =20 As stated earlier, util clamp is a property of every task in the system. B= ut @@ -324,7 +202,7 @@ In other words, this aggregation will not cause an erro= r when a task changes its uclamp values, but rather the system may not be able to satisfy reques= ts based on those factors. =20 -2.4. Range +2.3. Range ---------- =20 Uclamp performance request has the range of 0 to 1024 inclusive. @@ -332,6 +210,14 @@ Uclamp performance request has the range of 0 to 1024 = inclusive. For cgroup interface percentage is used (that is 0 to 100 inclusive). Just like other cgroup interfaces, you can use 'max' instead of 100. =20 +2.4. Older design +----------------- + +Older design was behaving differently due what was called max-aggregation = rule +which was adding high complexity and had some limitations. Please consult = the +docs corresponding to your kernel version as part of this doc might reflec= t how +it works on your kernel. + .. _uclamp-interfaces: =20 3. Interfaces @@ -594,39 +480,7 @@ possible. 5. Limitations =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D =20 -.. _uclamp-capping-fail: - -5.1. Capping frequency with uclamp_max fails under certain conditions ---------------------------------------------------------------------- - -If task p0 is capped to run at 512: - -:: - - p0->uclamp[UCLAMP_MAX] =3D 512 - -and it shares the rq with p1 which is free to run at any performance point: - -:: - - p1->uclamp[UCLAMP_MAX] =3D 1024 - -then due to max aggregation the rq will be allowed to reach max performance -point: - -:: - - rq->uclamp[UCLAMP_MAX] =3D max(512, 1024) =3D 1024 - -Assuming both p0 and p1 have UCLAMP_MIN =3D 0, then the frequency selectio= n for -the rq will depend on the actual utilization value of the tasks. - -If p1 is a small task but p0 is a CPU intensive task, then due to the fact= that -both are running at the same rq, p1 will cause the frequency capping to be= left -from the rq although p1, which is allowed to run at any performance point, -doesn't actually need to run at that frequency. - -5.2. UCLAMP_MAX can break PELT (util_avg) signal +5.1. UCLAMP_MAX can break PELT (util_avg) signal ------------------------------------------------ =20 PELT assumes that frequency will always increase as the signals grow to en= sure @@ -650,10 +504,6 @@ CPU is capable of. The max CPU frequency (Fmax) matter= s here as well, since it designates the shortest computational time to finish the task's work on this CPU. =20 -:: - - rq->uclamp[UCLAMP_MAX] =3D 0 - If the ratio of Fmax/Fmin is 3, then maximum value will be: =20 :: @@ -689,9 +539,8 @@ If task p1 wakes up on this CPU, which have: p1->util_avg =3D 200 p1->uclamp[UCLAMP_MAX] =3D 1024 =20 -then the effective UCLAMP_MAX for the CPU will be 1024 according to max -aggregation rule. But since the capped p0 task was running and throttled -severely, then the rq->util_avg will be: +Since the capped p0 task was running and throttled severely, then the +rq->util_avg will be: =20 :: =20 @@ -699,9 +548,9 @@ severely, then the rq->util_avg will be: p1->util_avg =3D 200 =20 rq->util_avg =3D 1024 - rq->uclamp[UCLAMP_MAX] =3D 1024 =20 -Hence lead to a frequency spike since if p0 wasn't throttled we should get: +Hence lead to a frequency spike when p1 is running. If p0 wasn't throttled= we +should get: =20 :: =20 @@ -710,9 +559,11 @@ Hence lead to a frequency spike since if p0 wasn't thr= ottled we should get: =20 rq->util_avg =3D 500 =20 -and run somewhere near mid performance point of that CPU, not the Fmax we = get. +and run somewhere near mid performance point of that CPU, not the Fmax p1 = gets. + +.. _schedutil_response_time_issues: =20 -5.3. Schedutil response time issues +5.2. Schedutil response time issues ----------------------------------- =20 schedutil has three limitations: --=20 2.34.1