From nobody Sun Dec 7 08:33:41 2025 Received: from mail-io1-f53.google.com (mail-io1-f53.google.com [209.85.166.53]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id A8C6B2236EE for ; Mon, 6 Oct 2025 17:04:13 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.166.53 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1759770255; cv=none; b=ANy/vb8+nKaPUbK0g+aHR28H4Zda8zvBObBmgRBaMAzHNEvUNgqrmQIA83jHof3evneqPK0EL0yPbi4sU/AMZGf7chKoUi6wajeIcy0cFppIka5qiOZ41QNyTRvaUEWWwJaC+AlQca+YtwpZdr3l91WDSYf0+wrecxEtBTUjIkY= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1759770255; c=relaxed/simple; bh=WSOhLCP47bTTI2G1Dx59qp5+Llml7tOV3z6rZZ0BA6M=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=I7Gahney4YgnHATdNlroW69Tejc7RihoPzJc8bIZHskEW//vUdxKPf3IqnW+6JN/h849rx22/zVv7ql+ZNdJUFXSnAXMxYeJLyqbUXl7q1q2OfSZRXbLzzCB4G1++9RBq3LR02kG7uObgW9RaZlgd7IOc5TKcQPx7M4k9u01Kv0= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=PGRc2JTz; arc=none smtp.client-ip=209.85.166.53 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="PGRc2JTz" Received: by mail-io1-f53.google.com with SMTP id ca18e2360f4ac-8ca2e53c37bso462176239f.3 for ; Mon, 06 Oct 2025 10:04:13 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1759770252; x=1760375052; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=hP5lbLel6FeQQkI2Vhrf17nln8NYIgCIUzjg9SwJbmA=; b=PGRc2JTz4/uwn9Snt+morS+HXTKzi4ghcClcysJmClJsJQI0YI/i7AcdBqByvR7EgR 6+mw4O+4Ixk13SH0YZTVA/xjtM1Fx3Go8vG9xxcWJAgIqBC7Scq9yUzCZz+/7TLJX2uu ltSJ2eCJ0phfYWlcRI1kHVQ+N4colNNqI6yE9YiCs+KOTpCXRVVM0p3/6eusEzuA4zk9 fJdd2CANPSSPdupY4E+BCJMaW6ZfFrXNJ4DZQeVQtLlrCIpVHOSTKsVinNDoT8ye1BBZ LSbOWIhqUpF2xUOVhMVcw7fCw5MFmKeT5aFoxilLdnZ0GjFhzaMfawcGs87JduJxYoww aOJQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1759770252; x=1760375052; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=hP5lbLel6FeQQkI2Vhrf17nln8NYIgCIUzjg9SwJbmA=; b=nnWSYu1fMJ6/7qMIc4WGuI6y5yL60hv7JkDsgahxybf/WkT1mXm8uBlLmjuvDCU3jh l9DYVQpSve0rbQpEZ077fZsfYGhfpNFETpu4ZthOzoGb2nqqc8OTRvtRdh/Rve2dpGVg M+oW+5bpp6V0a5xVPJEiHHO1ESsgrfHvszGWjzmYRtIsYAcumQycr+WtgWHSf3jUDkpV nsFD3t17g2tMzUBkMtwTH0yuJCbCjO+pSVTsBoBnmTiih092MNpZq1ijuXjgX4nuSEy7 mdC5yxKYuPt/fD/fSGaN3r7lKgQB3cpav4ocP2YPEc+buAnJDKiJzMtA5IwD+ZQ5D04T fMdg== X-Gm-Message-State: AOJu0Yyi6gsL2gBIGE/yFFMALzh2GTjkdaUCIaz6utZRNd7/bv9Y2Ne2 jKPNkRY7k5cij5GO0LOlaMvuyu6qPQcd83wJk2ebMq9OsXU0uuYA1pE+HyUuBiNx X-Gm-Gg: ASbGncugOB0zKsgjPIfXSSpX9fk+xnTHAgY/Pn73w7i8VldJwG9ju9zB9hlUQafJJRn N2pzO+XherQIu4WH5ytJiRLwWsS94zqaaHCRPA1NdyTDGzsHHq2d2Rq03aiE6febYeuFBwrbm5o 1gPdZqzn2NtwxIp4t3nYx3PMtKiRog8D17CVQdvu/6sMX9+bIr/7MRFPwily4h8q/8PIpYb+r7C LCa8+AThOliROF5JSQF24kcapfRFJtj53p3wzldiSLSc/BWHkamt/qBzmD5STB/zxkFzr+9FrRd B6wJAepejdQJDpe2mJlqKJNLayAbEU+ljmkoi+zRu45gxzizFqOavTPlE+GcUqZyj35TqqCUV1i R5g1Zhc2VSIKto4NsLHpODJTLuXmnjIPFSH1vGRrcb+UoJFqXIVKvITm5H1mFSpUbbeShgKHOet MBybkBKnAtwLuezGg0pTf9rZbQpII6H55pUB9kk5E9tgMyqPXjI7tK2IJ0 X-Google-Smtp-Source: AGHT+IHtXrzkafzb/xJHKEdIZvSUMa4i5nvKxN/qFCoBRPJYFlRlsqk1i6a2OetdRygINP8sROrbkw== X-Received: by 2002:a05:6602:740e:b0:937:6516:62e9 with SMTP id ca18e2360f4ac-93b96a4fba2mr1685764439f.9.1759770251943; Mon, 06 Oct 2025 10:04:11 -0700 (PDT) Received: from newton-fedora-MZ01GC9H (c-68-45-22-229.hsd1.in.comcast.net. [68.45.22.229]) by smtp.gmail.com with ESMTPSA id ca18e2360f4ac-93a87bb84b4sm488589539f.18.2025.10.06.10.04.10 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 06 Oct 2025 10:04:11 -0700 (PDT) From: Ryan Newton To: linux-kernel@vger.kernel.org Cc: sched-ext@lists.linux.dev, tj@kernel.org, arighi@nvidia.com, rrnewton@gmail.com, newton@meta.com Subject: [PATCH v3 1/2] sched_ext: Add lockless peek operation for DSQs Date: Mon, 6 Oct 2025 13:04:02 -0400 Message-ID: <20251006170403.3584204-2-rrnewton@gmail.com> X-Mailer: git-send-email 2.51.0 In-Reply-To: <20251006170403.3584204-1-rrnewton@gmail.com> References: <20251006170403.3584204-1-rrnewton@gmail.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Ryan Newton The builtin DSQ queue data structures are meant to be used by a wide range of different sched_ext schedulers with different demands on these data structures. They might be per-cpu with low-contention, or high-contention shared queues. Unfortunately, DSQs have a coarse-grained lock around the whole data structure. Without going all the way to a lock-free, more scalable implementation, a small step we can take to reduce lock contention is to allow a lockless, small-fixed-cost peek at the head of the queue. This change allows certain custom SCX schedulers to cheaply peek at queues, e.g. during load balancing, before locking them. But it represents a few extra memory operations to update the pointer each time the DSQ is modified, including a memory barrier on ARM so the write appears correctly ordered. This commit adds a first_task pointer field which is updated atomically when the DSQ is modified, and allows any thread to peek at the head of the queue without holding the lock. Signed-off-by: Ryan Newton Reviewed-by: Andrea Righi --- include/linux/sched/ext.h | 1 + kernel/sched/ext.c | 54 +++++++++++++++++++++++- tools/sched_ext/include/scx/common.bpf.h | 1 + tools/sched_ext/include/scx/compat.bpf.h | 19 +++++++++ 4 files changed, 73 insertions(+), 2 deletions(-) diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h index d82b7a9b0658..81478d4ae782 100644 --- a/include/linux/sched/ext.h +++ b/include/linux/sched/ext.h @@ -58,6 +58,7 @@ enum scx_dsq_id_flags { */ struct scx_dispatch_q { raw_spinlock_t lock; + struct task_struct __rcu *first_task; /* lockless peek at head */ struct list_head list; /* tasks in dispatch order */ struct rb_root priq; /* used to order by p->scx.dsq_vtime */ u32 nr; diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index 2b0e88206d07..6d3537e65001 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -944,8 +944,11 @@ static void dispatch_enqueue(struct scx_sched *sch, st= ruct scx_dispatch_q *dsq, container_of(rbp, struct task_struct, scx.dsq_priq); list_add(&p->scx.dsq_list.node, &prev->scx.dsq_list.node); + /* first task unchanged - no update needed */ } else { list_add(&p->scx.dsq_list.node, &dsq->list); + /* not builtin and new task is at head - use fastpath */ + rcu_assign_pointer(dsq->first_task, p); } } else { /* a FIFO DSQ shouldn't be using PRIQ enqueuing */ @@ -953,10 +956,19 @@ static void dispatch_enqueue(struct scx_sched *sch, s= truct scx_dispatch_q *dsq, scx_error(sch, "DSQ ID 0x%016llx already had PRIQ-enqueued tasks", dsq->id); =20 - if (enq_flags & (SCX_ENQ_HEAD | SCX_ENQ_PREEMPT)) + if (enq_flags & (SCX_ENQ_HEAD | SCX_ENQ_PREEMPT)) { list_add(&p->scx.dsq_list.node, &dsq->list); - else + /* new task inserted at head - use fastpath */ + if (!(dsq->id & SCX_DSQ_FLAG_BUILTIN)) + rcu_assign_pointer(dsq->first_task, p); + } else { + bool was_empty; + + was_empty =3D list_empty(&dsq->list); list_add_tail(&p->scx.dsq_list.node, &dsq->list); + if (was_empty && !(dsq->id & SCX_DSQ_FLAG_BUILTIN)) + rcu_assign_pointer(dsq->first_task, p); + } } =20 /* seq records the order tasks are queued, used by BPF DSQ iterator */ @@ -1011,6 +1023,13 @@ static void task_unlink_from_dsq(struct task_struct = *p, p->scx.dsq_flags &=3D ~SCX_TASK_DSQ_ON_PRIQ; } =20 + if (!(dsq->id & SCX_DSQ_FLAG_BUILTIN) && dsq->first_task =3D=3D p) { + struct task_struct *first_task; + + first_task =3D nldsq_next_task(dsq, NULL, false); + rcu_assign_pointer(dsq->first_task, first_task); + } + list_del_init(&p->scx.dsq_list.node); dsq_mod_nr(dsq, -1); } @@ -6084,6 +6103,36 @@ __bpf_kfunc void bpf_iter_scx_dsq_destroy(struct bpf= _iter_scx_dsq *it) kit->dsq =3D NULL; } =20 +/** + * scx_bpf_dsq_peek - Lockless peek at the first element. + * @dsq_id: DSQ to examine. + * + * Read the first element in the DSQ. This is semantically equivalent to u= sing + * the DSQ iterator, but is lockfree. + * + * Returns the pointer, or NULL indicates an empty queue OR internal error. + */ +__bpf_kfunc struct task_struct *scx_bpf_dsq_peek(u64 dsq_id) +{ + struct scx_sched *sch; + struct scx_dispatch_q *dsq; + + sch =3D rcu_dereference(scx_root); + if (unlikely(!sch)) + return NULL; + if (unlikely(dsq_id & SCX_DSQ_FLAG_BUILTIN)) { + scx_error(sch, "peek disallowed on builtin DSQ 0x%llx", dsq_id); + return NULL; + } + + dsq =3D find_user_dsq(sch, dsq_id); + if (unlikely(!dsq)) { + scx_error(sch, "peek on non-existent DSQ 0x%llx", dsq_id); + return NULL; + } + return rcu_dereference(dsq->first_task); +} + __bpf_kfunc_end_defs(); =20 static s32 __bstr_format(struct scx_sched *sch, u64 *data_buf, char *line_= buf, @@ -6641,6 +6690,7 @@ BTF_KFUNCS_START(scx_kfunc_ids_any) BTF_ID_FLAGS(func, scx_bpf_kick_cpu) BTF_ID_FLAGS(func, scx_bpf_dsq_nr_queued) BTF_ID_FLAGS(func, scx_bpf_destroy_dsq) +BTF_ID_FLAGS(func, scx_bpf_dsq_peek, KF_RCU_PROTECTED | KF_RET_NULL) BTF_ID_FLAGS(func, bpf_iter_scx_dsq_new, KF_ITER_NEW | KF_RCU_PROTECTED) BTF_ID_FLAGS(func, bpf_iter_scx_dsq_next, KF_ITER_NEXT | KF_RET_NULL) BTF_ID_FLAGS(func, bpf_iter_scx_dsq_destroy, KF_ITER_DESTROY) diff --git a/tools/sched_ext/include/scx/common.bpf.h b/tools/sched_ext/inc= lude/scx/common.bpf.h index 06e2551033cb..fbf3e7f9526c 100644 --- a/tools/sched_ext/include/scx/common.bpf.h +++ b/tools/sched_ext/include/scx/common.bpf.h @@ -75,6 +75,7 @@ u32 scx_bpf_reenqueue_local(void) __ksym; void scx_bpf_kick_cpu(s32 cpu, u64 flags) __ksym; s32 scx_bpf_dsq_nr_queued(u64 dsq_id) __ksym; void scx_bpf_destroy_dsq(u64 dsq_id) __ksym; +struct task_struct *scx_bpf_dsq_peek(u64 dsq_id) __ksym __weak; int bpf_iter_scx_dsq_new(struct bpf_iter_scx_dsq *it, u64 dsq_id, u64 flag= s) __ksym __weak; struct task_struct *bpf_iter_scx_dsq_next(struct bpf_iter_scx_dsq *it) __k= sym __weak; void bpf_iter_scx_dsq_destroy(struct bpf_iter_scx_dsq *it) __ksym __weak; diff --git a/tools/sched_ext/include/scx/compat.bpf.h b/tools/sched_ext/inc= lude/scx/compat.bpf.h index dd9144624dc9..97b10c184b2c 100644 --- a/tools/sched_ext/include/scx/compat.bpf.h +++ b/tools/sched_ext/include/scx/compat.bpf.h @@ -130,6 +130,25 @@ int bpf_cpumask_populate(struct cpumask *dst, void *sr= c, size_t src__sz) __ksym false; \ }) =20 + +/* + * v6.19: Introduce lockless peek API for user DSQs. + * + * Preserve the following macro until v6.21. + */ +static inline struct task_struct *__COMPAT_scx_bpf_dsq_peek(u64 dsq_id) +{ + struct task_struct *p =3D NULL; + struct bpf_iter_scx_dsq it; + + if (bpf_ksym_exists(scx_bpf_dsq_peek)) + return scx_bpf_dsq_peek(dsq_id); + if (!bpf_iter_scx_dsq_new(&it, dsq_id, 0)) + p =3D bpf_iter_scx_dsq_next(&it); + bpf_iter_scx_dsq_destroy(&it); + return p; +} + /** * __COMPAT_is_enq_cpu_selected - Test if SCX_ENQ_CPU_SELECTED is on * in a compatible way. We will preserve this __COMPAT helper until v6.16. --=20 2.51.0 From nobody Sun Dec 7 08:33:41 2025 Received: from mail-io1-f51.google.com (mail-io1-f51.google.com [209.85.166.51]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 3D291226D14 for ; Mon, 6 Oct 2025 17:04:14 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.166.51 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1759770257; cv=none; b=Pqg1B1K9K9npFzWviQHfQiKF25gkc/rxueDJxQYexks6MJButYh6Q7hUexw3LSv2ZhYtAZ5DCeWnA3tRjDXt5Z2jU53vANSnW08h1dGD1eTrI1U9PZYcoGComP1LBwxcJVHkKgT0KN/nqV8uc1xnwhfOG5L3/k3sKtYyvp9Ftiw= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1759770257; c=relaxed/simple; bh=w9qQN86kH4uxofb0sYpXqZkSZGVFrTWqZBqMEXZZBbI=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=isVXTu0q+pyGVmDqjt05IzQc00dh3rtpIpk8pd5Ww38nL5Fz80BUrM5VcceXaufrYYHDJ+FBBVQKxM5Vzc79LOpH5i1K/x2xQ7029g+vaQg8CB4rI8Icy32fFFI5NXplxb+GLOo7YHaTr5itBI0jZUx4ftvRwhdn/TY+VMiBHn4= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=NvaMwdyx; arc=none smtp.client-ip=209.85.166.51 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="NvaMwdyx" Received: by mail-io1-f51.google.com with SMTP id ca18e2360f4ac-930a6c601b3so449752939f.3 for ; Mon, 06 Oct 2025 10:04:14 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1759770254; x=1760375054; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=ZwV/3UDFHoCnSS4lQ5PWW6RJWQc6Qz60ALs0MB2kCq4=; b=NvaMwdyxFZR+vwYPidmaste7Zd4NomL4B3nmtpqh1O9yeHNejZzebrh74/CCAaabeR yn3gQn3J4Nokh+hFLPY0kwXFzEp8OQvVrk99vXlrbQ+GO2B9MCFxSkHKHJFaMS0uS/s3 Eq62Qw3QadHFvl5ESs+m1iGjRC3sKRIYBLggcRWgIgHb2/YYCfAORNJsLL93bu+j1Z+L L3eiZqltQm3kPkpA8XgE8/CNE7ICdWAB9hiVhZMfqDwwtQ5wCfPKnzJJYWLiO5L/9pfU sQeQLM2IcNmWRl8UFuHJHj2KRUwY+m0hsE3IgmFlaIxhtSf2h9It/65X9WBsC0/fm4WJ 48yA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1759770254; x=1760375054; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=ZwV/3UDFHoCnSS4lQ5PWW6RJWQc6Qz60ALs0MB2kCq4=; b=Ol1DLVwq0o98ePVyflzpETLxUjGt6p5Ac5271rbzamWDsHdZfMI3F2xRY404n3u+q6 EiIwFJeYKWR8aZNc78xp06rACxvoEZPSRgdZkVSCBAP2Xj3wl+yDh8Qnz4+EsHKdv3Nh i95TbTxF0gu5DSWqD9JO7Yo0JQSvzbbLBXPwJe5h3Hhso2nRVmRwFCOlUQfoW1hU/o9c CTwQ6861zUvPSJliaXcPAl8xVIM/YL+6u0wpYi7haoGsqLRssqj6mwG2hfvnDnPYCU4e sYwA2WMwzII5XRb7WZVo92x3JzmRbQkMlVCd6Xk1jOAIwOGS2IHCStk76lbUAO3UFdsc sfew== X-Gm-Message-State: AOJu0Yy/T6j/YN1/48dFxqTC2cs6CUt91ks2aCgPWzj+VB2fr3dbCXI1 pOHNXQjDcgZPAfZgsn/Wp+yUM1cqNIAPKfN9tdEy5RVlHtD5oZ2RPGPzVlxUn7aD X-Gm-Gg: ASbGncvowOlLNqVPRHOz1AsGLiSvNlOYaY/B7j5JYMGcXC4WLPZ7PJQ33+bq3F5lqA8 WzYpN/cwDUjDRW+agaoK9XB8hFs+FzWEtiAvfZyKdNQbScXrSa+B6Nflzj95AiuMYiA8FQq8NK7 nsuUJYz2d63FUIlxJWOBnqc1uTMsy6ufi0rgK7yJDivg0m1DqrPDqlD3Lo7b625I2lYx40JpJlS eBXmVqdBS0Bt31tmWwPqGu/BP3Llu19aGCmRr2kIK1H6DKxpfypxJZUqje4DC6jhKMVal7Uri8c BtXPs8s4Y0x2fQGdwmENfhZOd8/CwJm2XOzcIzYIh2S99GBHphm4sUmuyh2EWqtzoJ4mREcZJ1F 2Yz474e2mbbOpRmKy0wmwiuGCmJ07loN7S4rBwe3WJRlboDOeKPi8lzFZxOYd99WUsyWRahCHBy FuG1LZkL0FChodbyegfJOb3GsuCqP4O0yAI5XKkzyvO/Vnog== X-Google-Smtp-Source: AGHT+IEQyr1icEiH9XL29A+dKGJbzT3oasLp+qt/mL+BlNWHLAiU5LdkfhKygUePWZyHOwhkS3vVXA== X-Received: by 2002:a05:6602:3fc8:b0:8d3:6ac1:4dd3 with SMTP id ca18e2360f4ac-93b969aa445mr2005054139f.6.1759770253768; Mon, 06 Oct 2025 10:04:13 -0700 (PDT) Received: from newton-fedora-MZ01GC9H (c-68-45-22-229.hsd1.in.comcast.net. [68.45.22.229]) by smtp.gmail.com with ESMTPSA id ca18e2360f4ac-93a87bb84b4sm488589539f.18.2025.10.06.10.04.12 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 06 Oct 2025 10:04:12 -0700 (PDT) From: Ryan Newton To: linux-kernel@vger.kernel.org Cc: sched-ext@lists.linux.dev, tj@kernel.org, arighi@nvidia.com, rrnewton@gmail.com, newton@meta.com Subject: [PATCH v3 2/2] sched_ext: Add a selftest for scx_bpf_dsq_peek Date: Mon, 6 Oct 2025 13:04:03 -0400 Message-ID: <20251006170403.3584204-3-rrnewton@gmail.com> X-Mailer: git-send-email 2.51.0 In-Reply-To: <20251006170403.3584204-1-rrnewton@gmail.com> References: <20251006170403.3584204-1-rrnewton@gmail.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Ryan Newton Perform the most basic unit test: make sure an empty queue peeks as empty, and when we put one element in the queue, make sure peek returns that element. However, even this simple test is a little complicated by the different behavior of scx_bpf_dsq_insert in different calling contexts: - insert is for direct dispatch in enqueue - insert is delayed when called from select_cpu In this case we split the insert and the peek that verifies the result between enqueue/dispatch. As a second phase, we stress test by performing many peeks on an array of user DSQs. Note: An alternative would be to call `scx_bpf_dsq_move_to_local` on an empty queue, which in turn calls `flush_dispatch_buf`, in order to flush the buffered insert. Unfortunately, this is not viable within the enqueue path, as it attempts a voluntary context switch within an RCU read-side critical section. Signed-off-by: Ryan Newton --- kernel/sched/ext.c | 2 + tools/testing/selftests/sched_ext/Makefile | 1 + .../selftests/sched_ext/peek_dsq.bpf.c | 265 ++++++++++++++++++ tools/testing/selftests/sched_ext/peek_dsq.c | 230 +++++++++++++++ 4 files changed, 498 insertions(+) create mode 100644 tools/testing/selftests/sched_ext/peek_dsq.bpf.c create mode 100644 tools/testing/selftests/sched_ext/peek_dsq.c diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index 6d3537e65001..ec7e791cd4c8 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -6120,6 +6120,7 @@ __bpf_kfunc struct task_struct *scx_bpf_dsq_peek(u64 = dsq_id) sch =3D rcu_dereference(scx_root); if (unlikely(!sch)) return NULL; + if (unlikely(dsq_id & SCX_DSQ_FLAG_BUILTIN)) { scx_error(sch, "peek disallowed on builtin DSQ 0x%llx", dsq_id); return NULL; @@ -6130,6 +6131,7 @@ __bpf_kfunc struct task_struct *scx_bpf_dsq_peek(u64 = dsq_id) scx_error(sch, "peek on non-existent DSQ 0x%llx", dsq_id); return NULL; } + return rcu_dereference(dsq->first_task); } =20 diff --git a/tools/testing/selftests/sched_ext/Makefile b/tools/testing/sel= ftests/sched_ext/Makefile index 9d9d6b4c38b0..5fe45f9c5f8f 100644 --- a/tools/testing/selftests/sched_ext/Makefile +++ b/tools/testing/selftests/sched_ext/Makefile @@ -174,6 +174,7 @@ auto-test-targets :=3D \ minimal \ numa \ allowed_cpus \ + peek_dsq \ prog_run \ reload_loop \ select_cpu_dfl \ diff --git a/tools/testing/selftests/sched_ext/peek_dsq.bpf.c b/tools/testi= ng/selftests/sched_ext/peek_dsq.bpf.c new file mode 100644 index 000000000000..8d179d4c7efb --- /dev/null +++ b/tools/testing/selftests/sched_ext/peek_dsq.bpf.c @@ -0,0 +1,265 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * A BPF program for testing DSQ operations including create, destroy, + * and peek operations. Uses a hybrid approach: + * - Syscall program for DSQ lifecycle (create/destroy) + * - Struct ops scheduler for task insertion/dequeue testing + * + * Copyright (c) 2025 Meta Platforms, Inc. and affiliates. + * Copyright (c) 2025 Ryan Newton + */ + +#include +#include + +char _license[] SEC("license") =3D "GPL"; + +#define MAX_SAMPLES 100 +#define MAX_CPUS 512 +#define DSQ_POOL_SIZE 8 +int max_samples =3D MAX_SAMPLES; +int max_cpus =3D MAX_CPUS; +int dsq_pool_size =3D DSQ_POOL_SIZE; + +/* Global variables to store test results */ +int dsq_create_result =3D -1; +int dsq_destroy_result =3D -1; +int dsq_peek_result1 =3D -1; +long dsq_inserted_pid =3D -1; +int insert_test_cpu =3D -1; /* Set to the cpu that performs the test */ +long dsq_peek_result2 =3D -1; +long dsq_peek_result2_pid =3D -1; +long dsq_peek_result2_expected =3D -1; +int test_dsq_id =3D 1234; /* Use a simple ID like create_dsq example */ +int real_dsq_id =3D 1235; /* DSQ for normal operation */ +int enqueue_count =3D -1; +int dispatch_count =3D -1; +int debug_ksym_exists =3D -1; + +/* DSQ pool for stress testing */ +int dsq_pool_base_id =3D 2000; +int phase1_complete =3D -1; +int total_peek_attempts =3D -1; +int successful_peeks =3D -1; + +/* BPF map for sharing peek results with userspace */ +struct { + __uint(type, BPF_MAP_TYPE_ARRAY); + __uint(max_entries, MAX_SAMPLES); + __type(key, u32); + __type(value, long); +} peek_results SEC(".maps"); + +/* Test if we're actually using the native or compat version */ +int check_dsq_insert_ksym(void) +{ + return bpf_ksym_exists(scx_bpf_dsq_insert) ? 1 : 0; +} + +int check_dsq_peek_ksym(void) +{ + return bpf_ksym_exists(scx_bpf_dsq_peek) ? 1 : 0; +} + +static inline int get_random_dsq_id(void) +{ + u64 time =3D bpf_ktime_get_ns(); + + return dsq_pool_base_id + (time % DSQ_POOL_SIZE); +} + +static inline void record_peek_result(long pid) +{ + u32 slot_key; + long *slot_pid_ptr; + int ix; + + if (pid <=3D 0) + return; + + /* Find an empty slot or one with the same PID */ + bpf_for(ix, 0, 10) { + slot_key =3D (pid + ix) % MAX_SAMPLES; + slot_pid_ptr =3D bpf_map_lookup_elem(&peek_results, &slot_key); + if (!slot_pid_ptr) + continue; + + if (*slot_pid_ptr =3D=3D -1 || *slot_pid_ptr =3D=3D pid) { + *slot_pid_ptr =3D pid; + break; + } + } +} + +/* Scan all DSQs in the pool and try to move a task to local */ +static inline int scan_dsq_pool(void) +{ + struct task_struct *task; + int moved =3D 0; + int i; + + bpf_for(i, 0, DSQ_POOL_SIZE) { + int dsq_id =3D dsq_pool_base_id + i; + + total_peek_attempts++; + + task =3D __COMPAT_scx_bpf_dsq_peek(dsq_id); + if (task) { + successful_peeks++; + record_peek_result(task->pid); + + /* Try to move this task to local */ + if (!moved && scx_bpf_dsq_move_to_local(dsq_id) =3D=3D 0) { + moved =3D 1; + break; + } + } + } + return moved; +} + +/* Struct_ops scheduler for testing DSQ peek operations */ +void BPF_STRUCT_OPS(peek_dsq_enqueue, struct task_struct *p, u64 enq_flags) +{ + struct task_struct *peek_result; + int last_insert_test_cpu, cpu; + + enqueue_count++; + cpu =3D bpf_get_smp_processor_id(); + last_insert_test_cpu =3D __sync_val_compare_and_swap( + &insert_test_cpu, -1, cpu); + + /* Phase 1: Simple insert-then-peek test (only on first task) */ + if (last_insert_test_cpu =3D=3D -1) { + bpf_printk("peek_dsq_enqueue beginning phase 1 peek test on cpu %d\n", c= pu); + + /* Test 1: Peek empty DSQ - should return NULL */ + peek_result =3D __COMPAT_scx_bpf_dsq_peek(test_dsq_id); + dsq_peek_result1 =3D (long)peek_result; /* Should be 0 (NULL) */ + + /* Test 2: Insert task into test DSQ for testing in dispatch callback */ + dsq_inserted_pid =3D p->pid; + scx_bpf_dsq_insert(p, test_dsq_id, 0, enq_flags); + dsq_peek_result2_expected =3D (long)p; /* Expected the task we just inse= rted */ + } else if (!phase1_complete) { + /* Still in phase 1, use real DSQ */ + scx_bpf_dsq_insert(p, real_dsq_id, 0, enq_flags); + } else { + /* Phase 2: Random DSQ insertion for stress testing */ + int random_dsq_id =3D get_random_dsq_id(); + + scx_bpf_dsq_insert(p, random_dsq_id, 0, enq_flags); + } +} + +void BPF_STRUCT_OPS(peek_dsq_dispatch, s32 cpu, struct task_struct *prev) +{ + dispatch_count++; + + /* Phase 1: Complete the simple peek test if we inserted a task but + * haven't tested peek yet + */ + if (insert_test_cpu =3D=3D cpu && dsq_peek_result2 =3D=3D -1) { + struct task_struct *peek_result; + + bpf_printk("peek_dsq_dispatch completing phase 1 peek test on cpu %d\n",= cpu); + + /* Test 3: Peek DSQ after insert - should return the task we inserted */ + peek_result =3D __COMPAT_scx_bpf_dsq_peek(test_dsq_id); + /* Store the PID of the peeked task for comparison */ + dsq_peek_result2 =3D (long)peek_result; + dsq_peek_result2_pid =3D peek_result ? peek_result->pid : -1; + + /* Now consume the task since we've peeked at it */ + scx_bpf_dsq_move_to_local(test_dsq_id); + + /* Mark phase 1 as complete */ + phase1_complete =3D 1; + bpf_printk("Phase 1 complete, starting phase 2 stress testing\n"); + } else if (!phase1_complete) { + /* Still in phase 1, use real DSQ */ + scx_bpf_dsq_move_to_local(real_dsq_id); + } else { + /* Phase 2: Scan all DSQs in the pool and try to move a task */ + if (!scan_dsq_pool()) { + /* No tasks found in DSQ pool, fall back to real DSQ */ + scx_bpf_dsq_move_to_local(real_dsq_id); + } + } +} + +s32 BPF_STRUCT_OPS_SLEEPABLE(peek_dsq_init) +{ + s32 err; + int i; + + /* Always set debug values so we can see which version we're using */ + debug_ksym_exists =3D check_dsq_peek_ksym(); + + /* Initialize state first */ + insert_test_cpu =3D -1; + enqueue_count =3D 0; + dispatch_count =3D 0; + phase1_complete =3D 0; + total_peek_attempts =3D 0; + successful_peeks =3D 0; + dsq_create_result =3D 0; /* Reset to 0 before attempting */ + + /* Create the test and real DSQs */ + err =3D scx_bpf_create_dsq(test_dsq_id, -1); + if (!err) + err =3D scx_bpf_create_dsq(real_dsq_id, -1); + if (err) { + dsq_create_result =3D err; + scx_bpf_error("Failed to create primary DSQ %d: %d", test_dsq_id, err); + return err; + } + + /* Create the DSQ pool for stress testing */ + bpf_for(i, 0, DSQ_POOL_SIZE) { + int dsq_id =3D dsq_pool_base_id + i; + + err =3D scx_bpf_create_dsq(dsq_id, -1); + if (err) { + dsq_create_result =3D err; + scx_bpf_error("Failed to create DSQ pool entry %d: %d", dsq_id, err); + return err; + } + } + + dsq_create_result =3D 1; /* Success */ + + /* Initialize the peek results map */ + bpf_for(i, 0, MAX_SAMPLES) { + u32 key =3D i; + long pid =3D -1; + + bpf_map_update_elem(&peek_results, &key, &pid, BPF_ANY); + } + + return 0; +} + +void BPF_STRUCT_OPS(peek_dsq_exit, struct scx_exit_info *ei) +{ + int i; + + scx_bpf_destroy_dsq(test_dsq_id); + scx_bpf_destroy_dsq(real_dsq_id); + bpf_for(i, 0, DSQ_POOL_SIZE) { + int dsq_id =3D dsq_pool_base_id + i; + + scx_bpf_destroy_dsq(dsq_id); + } + + dsq_destroy_result =3D 1; +} + +SEC(".struct_ops.link") +struct sched_ext_ops peek_dsq_ops =3D { + .enqueue =3D (void *)peek_dsq_enqueue, + .dispatch =3D (void *)peek_dsq_dispatch, + .init =3D (void *)peek_dsq_init, + .exit =3D (void *)peek_dsq_exit, + .name =3D "peek_dsq", +}; diff --git a/tools/testing/selftests/sched_ext/peek_dsq.c b/tools/testing/s= elftests/sched_ext/peek_dsq.c new file mode 100644 index 000000000000..182dbdce2400 --- /dev/null +++ b/tools/testing/selftests/sched_ext/peek_dsq.c @@ -0,0 +1,230 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Test for DSQ operations including create, destroy, and peek operations. + * + * Copyright (c) 2025 Meta Platforms, Inc. and affiliates. + * Copyright (c) 2025 Ryan Newton + */ +#include +#include +#include +#include +#include +#include +#include +#include "peek_dsq.bpf.skel.h" +#include "scx_test.h" + +#define NUM_WORKERS 100 + +static bool workload_running =3D true; +static pthread_t workload_threads[NUM_WORKERS]; + +/** + * Background workload thread that sleeps and wakes rapidly to exercise + * the scheduler's enqueue operations and ensure DSQ operations get tested. + */ +static void *workload_thread_fn(void *arg) +{ + while (workload_running) { + /* Sleep for a very short time to trigger scheduler activity */ + usleep(1000); /* 1ms sleep */ + sched_yield(); + } + return NULL; +} + +static enum scx_test_status setup(void **ctx) +{ + struct peek_dsq *skel; + + skel =3D peek_dsq__open(); + SCX_FAIL_IF(!skel, "Failed to open"); + SCX_ENUM_INIT(skel); + SCX_FAIL_IF(peek_dsq__load(skel), "Failed to load skel"); + + *ctx =3D skel; + + return SCX_TEST_PASS; +} + +static int print_observed_pids(struct bpf_map *map, int max_samples, const= char *dsq_name) +{ + long count =3D 0; + + printf("Observed %s DSQ peek pids:\n", dsq_name); + for (int i =3D 0; i < max_samples; i++) { + long pid; + int err; + + err =3D bpf_map_lookup_elem(bpf_map__fd(map), &i, &pid); + if (err =3D=3D 0) { + if (pid =3D=3D 0) { + printf(" Sample %d: NULL peek\n", i); + } else if (pid > 0) { + printf(" Sample %d: pid %ld\n", i, pid); + count++; + } + } else { + printf(" Sample %d: error reading pid (err=3D%d)\n", i, err); + } + } + printf("Observed ~%ld pids in the %s DSQ(s)\n", count, dsq_name); + return count; +} + +static enum scx_test_status run(void *ctx) +{ + struct peek_dsq *skel =3D ctx; + bool failed =3D false; + int seconds =3D 3; + int err; + + printf("Enabling scheduler to test DSQ insert operations...\n"); + + struct bpf_link *link =3D + bpf_map__attach_struct_ops(skel->maps.peek_dsq_ops); + + if (!link) { + SCX_ERR("Failed to attach struct_ops"); + return SCX_TEST_FAIL; + } + + printf("Starting %d background workload threads...\n", NUM_WORKERS); + workload_running =3D true; + for (int i =3D 0; i < NUM_WORKERS; i++) { + err =3D pthread_create(&workload_threads[i], NULL, workload_thread_fn, N= ULL); + if (err) { + SCX_ERR("Failed to create workload thread %d: %s", i, strerror(err)); + /* Stop already-created threads */ + workload_running =3D false; + for (int j =3D 0; j < i; j++) + pthread_join(workload_threads[j], NULL); + bpf_link__destroy(link); + return SCX_TEST_FAIL; + } + } + + printf("Waiting for enqueue events.\n"); + sleep(seconds); + while (skel->data->enqueue_count <=3D 0) { + printf("."); + fflush(stdout); + sleep(1); + seconds++; + if (seconds >=3D 30) { + printf("\n\u2717 Timeout waiting for enqueue events\n"); + /* Stop workload threads and cleanup */ + workload_running =3D false; + for (int i =3D 0; i < NUM_WORKERS; i++) + pthread_join(workload_threads[i], NULL); + bpf_link__destroy(link); + return SCX_TEST_FAIL; + } + } + + workload_running =3D false; + for (int i =3D 0; i < NUM_WORKERS; i++) { + err =3D pthread_join(workload_threads[i], NULL); + if (err) { + SCX_ERR("Failed to join workload thread %d: %s", i, strerror(err)); + bpf_link__destroy(link); + return SCX_TEST_FAIL; + } + } + printf("Background workload threads stopped.\n"); + + /* Detach the scheduler */ + bpf_link__destroy(link); + + if (skel->data->dsq_create_result !=3D 1) { + printf("\u2717 DSQ create failed: got %d, expected 1\n", + skel->data->dsq_create_result); + failed =3D true; + } else { + printf("\u2713 DSQ create succeeded\n"); + } + + printf("Enqueue/dispatch count over %d seconds: %d / %d\n", seconds, + skel->data->enqueue_count, skel->data->dispatch_count); + printf("Debug: ksym_exists=3D%d\n", + skel->data->debug_ksym_exists); + + printf("DSQ insert test done on cpu: %d\n", skel->data->insert_test_cpu); + if (skel->data->insert_test_cpu !=3D -1) + printf("\u2713 DSQ insert succeeded !\n"); + else { + printf("\u2717 DSQ insert failed or not attempted\n"); + failed =3D true; + } + + printf(" DSQ peek result 1 (before insert): %d\n", + skel->data->dsq_peek_result1); + if (skel->data->dsq_peek_result1 =3D=3D 0) + printf("\u2713 DSQ peek verification success: peek returned NULL!\n"); + else { + printf("\u2717 DSQ peek verification failed\n"); + failed =3D true; + } + + printf(" DSQ peek result 2 (after insert): %ld\n", + skel->data->dsq_peek_result2); + printf(" DSQ peek result 2, expected: %ld\n", + skel->data->dsq_peek_result2_expected); + if (skel->data->dsq_peek_result2 =3D=3D + skel->data->dsq_peek_result2_expected) + printf("\u2713 DSQ peek verification success: peek returned the inserted= task!\n"); + else { + printf("\u2717 DSQ peek verification failed\n"); + failed =3D true; + } + + printf(" Inserted test task -> pid: %ld\n", skel->data->dsq_inserted_pid= ); + printf(" DSQ peek result 2 -> pid: %ld\n", skel->data->dsq_peek_result2_= pid); + + if (skel->data->dsq_destroy_result !=3D 1) { + printf("\u2717 DSQ destroy failed: got %d, expected 1\n", + skel->data->dsq_destroy_result); + failed =3D true; + } + + int pid_count; + + pid_count =3D print_observed_pids(skel->maps.peek_results, + skel->data->max_samples, "DSQ pool"); + + if (skel->data->debug_ksym_exists && pid_count =3D=3D 0) { + printf("\u2717 DSQ pool test failed: no successful peeks in native mode\= n"); + failed =3D true; + } + if (skel->data->debug_ksym_exists && pid_count > 0) + printf("\u2713 DSQ pool test success: observed successful peeks in nativ= e mode\n"); + + if (failed) + return SCX_TEST_FAIL; + else + return SCX_TEST_PASS; +} + +static void cleanup(void *ctx) +{ + struct peek_dsq *skel =3D ctx; + + if (workload_running) { + workload_running =3D false; + for (int i =3D 0; i < NUM_WORKERS; i++) + pthread_join(workload_threads[i], NULL); + } + + peek_dsq__destroy(skel); +} + +struct scx_test peek_dsq =3D { + .name =3D "peek_dsq", + .description =3D + "Test DSQ create/destroy operations and future peek functionality", + .setup =3D setup, + .run =3D run, + .cleanup =3D cleanup, +}; +REGISTER_SCX_TEST(&peek_dsq) --=20 2.51.0