From nobody Sun Sep 14 06:36:59 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id EFE2FC27C76 for ; Thu, 26 Jan 2023 00:38:18 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S236378AbjAZAiS (ORCPT ); Wed, 25 Jan 2023 19:38:18 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:47720 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S235859AbjAZAiP (ORCPT ); Wed, 25 Jan 2023 19:38:15 -0500 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 1BED94A237 for ; Wed, 25 Jan 2023 16:37:37 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1674693456; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=jqegOKwKp0b5k01538L+8elHQKTxwvJermmvNJSeQ8I=; b=Z/00IWX3gUjdmTYbn9UmPNgZ1F7Z9oZMpGKTwMgd5/BHyUCcspuDXeVjD1f+LDKqsiGY1G 4NWwx6gcEiEn9530eAezSP9lUrhWisHsPLQf+VKOgprGqSHAXSbTwfouKtBQpJIDcDIHMD V+0z/RKM7dupm5svaWEfyflmP6CVojQ= Received: from mimecast-mx02.redhat.com (mx3-rdu2.redhat.com [66.187.233.73]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-68-QI00zbvxOc-mUqh8J84_Xw-1; Wed, 25 Jan 2023 19:37:31 -0500 X-MC-Unique: QI00zbvxOc-mUqh8J84_Xw-1 Received: from smtp.corp.redhat.com (int-mx09.intmail.prod.int.rdu2.redhat.com [10.11.54.9]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 820283C0CD4A; Thu, 26 Jan 2023 00:37:30 +0000 (UTC) Received: from llong.com (unknown [10.22.17.109]) by smtp.corp.redhat.com (Postfix) with ESMTP id 0B4D8492C1B; Thu, 26 Jan 2023 00:37:30 +0000 (UTC) From: Waiman Long To: Peter Zijlstra , Ingo Molnar , Will Deacon , Boqun Feng Cc: linux-kernel@vger.kernel.org, john.p.donnelly@oracle.com, Hillf Danton , Mukesh Ojha , =?UTF-8?q?Ting11=20Wang=20=E7=8E=8B=E5=A9=B7?= , Waiman Long , stable@vger.kernel.org Subject: [PATCH v7 1/4] locking/rwsem: Prevent non-first waiter from spinning in down_write() slowpath Date: Wed, 25 Jan 2023 19:36:25 -0500 Message-Id: <20230126003628.365092-2-longman@redhat.com> In-Reply-To: <20230126003628.365092-1-longman@redhat.com> References: <20230126003628.365092-1-longman@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Scanned-By: MIMEDefang 3.1 on 10.11.54.9 Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" A non-first waiter can potentially spin in the for loop of rwsem_down_write_slowpath() without sleeping but fail to acquire the lock even if the rwsem is free if the following sequence happens: Non-first RT waiter First waiter Lock holder ------------------- ------------ ----------- Acquire wait_lock rwsem_try_write_lock(): Set handoff bit if RT or wait too long Set waiter->handoff_set Release wait_lock Acquire wait_lock Inherit waiter->handoff_set Release wait_lock Clear owner Release lock if (waiter.handoff_set) { rwsem_spin_on_owner((); if (OWNER_NULL) goto trylock_again; } trylock_again: Acquire wait_lock rwsem_try_write_lock(): if (first->handoff_set && (waiter !=3D first)) return false; Release wait_lock A non-first waiter cannot really acquire the rwsem even if it mistakenly believes that it can spin on OWNER_NULL value. If that waiter happens to be an RT task running on the same CPU as the first waiter, it can block the first waiter from acquiring the rwsem leading to live lock. Fix this problem by making sure that a non-first waiter cannot spin in the slowpath loop without sleeping. Fixes: d257cc8cb8d5 ("locking/rwsem: Make handoff bit handling more consist= ent") Reviewed-and-tested-by: Mukesh Ojha Signed-off-by: Waiman Long Cc: stable@vger.kernel.org --- kernel/locking/rwsem.c | 19 +++++++++---------- 1 file changed, 9 insertions(+), 10 deletions(-) diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c index 44873594de03..be2df9ea7c30 100644 --- a/kernel/locking/rwsem.c +++ b/kernel/locking/rwsem.c @@ -624,18 +624,16 @@ static inline bool rwsem_try_write_lock(struct rw_sem= aphore *sem, */ if (first->handoff_set && (waiter !=3D first)) return false; - - /* - * First waiter can inherit a previously set handoff - * bit and spin on rwsem if lock acquisition fails. - */ - if (waiter =3D=3D first) - waiter->handoff_set =3D true; } =20 new =3D count; =20 if (count & RWSEM_LOCK_MASK) { + /* + * A waiter (first or not) can set the handoff bit + * if it is an RT task or wait in the wait queue + * for too long. + */ if (has_handoff || (!rt_task(waiter->task) && !time_after(jiffies, waiter->timeout))) return false; @@ -651,11 +649,12 @@ static inline bool rwsem_try_write_lock(struct rw_sem= aphore *sem, } while (!atomic_long_try_cmpxchg_acquire(&sem->count, &count, new)); =20 /* - * We have either acquired the lock with handoff bit cleared or - * set the handoff bit. + * We have either acquired the lock with handoff bit cleared or set + * the handoff bit. Only the first waiter can have its handoff_set + * set here to enable optimistic spinning in slowpath loop. */ if (new & RWSEM_FLAG_HANDOFF) { - waiter->handoff_set =3D true; + first->handoff_set =3D true; lockevent_inc(rwsem_wlock_handoff); return false; } --=20 2.31.1 From nobody Sun Sep 14 06:36:59 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 104C1C54E94 for ; Thu, 26 Jan 2023 00:38:36 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S236391AbjAZAie (ORCPT ); Wed, 25 Jan 2023 19:38:34 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:47726 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229536AbjAZAiZ (ORCPT ); Wed, 25 Jan 2023 19:38:25 -0500 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 5276A5AB7E for ; Wed, 25 Jan 2023 16:37:37 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1674693456; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=7FgnhlbaAs6E7+0zgDhgOJtIcO5/Rx7pYlSdMteD5sc=; b=IUmkumqxm5UzutlZ5BBpOxX1LvxISehrEoKI1RZpYWwKMIFv3ja+LGkG07lHTyxvhXqX9I y70Cj5aa7fF08Cnr5rS0g9FM5keXhYHwnl9djOXUoTpKxgctez8NgenXt5FsY0ApKPJXPY sjEz716hDy5QUk6X9wzLEpt5Rb0Pkko= Received: from mimecast-mx02.redhat.com (mx3-rdu2.redhat.com [66.187.233.73]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-516-NM9vHwU0NyCcWhkuIaPF6Q-1; Wed, 25 Jan 2023 19:37:31 -0500 X-MC-Unique: NM9vHwU0NyCcWhkuIaPF6Q-1 Received: from smtp.corp.redhat.com (int-mx09.intmail.prod.int.rdu2.redhat.com [10.11.54.9]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 057B21C05AF5; Thu, 26 Jan 2023 00:37:31 +0000 (UTC) Received: from llong.com (unknown [10.22.17.109]) by smtp.corp.redhat.com (Postfix) with ESMTP id 909A6492C14; Thu, 26 Jan 2023 00:37:30 +0000 (UTC) From: Waiman Long To: Peter Zijlstra , Ingo Molnar , Will Deacon , Boqun Feng Cc: linux-kernel@vger.kernel.org, john.p.donnelly@oracle.com, Hillf Danton , Mukesh Ojha , =?UTF-8?q?Ting11=20Wang=20=E7=8E=8B=E5=A9=B7?= , Waiman Long Subject: [PATCH v7 2/4] locking/rwsem: Disable preemption at all down_read*() and up_read() code paths Date: Wed, 25 Jan 2023 19:36:26 -0500 Message-Id: <20230126003628.365092-3-longman@redhat.com> In-Reply-To: <20230126003628.365092-1-longman@redhat.com> References: <20230126003628.365092-1-longman@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Scanned-By: MIMEDefang 3.1 on 10.11.54.9 Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" Commit 91d2a812dfb9 ("locking/rwsem: Make handoff writer optimistically spin on owner") assumes that when the owner field is changed to NULL, the lock will become free soon. Commit 48dfb5d2560d ("locking/rwsem: Disable preemption while trying for rwsem lock") disables preemption when acquiring rwsem for write. However, preemption has not yet been disabled when acquiring a read lock on a rwsem. So a reader can add a RWSEM_READER_BIAS to count without setting owner to signal a reader, got preempted out by a RT task which then spins in the writer slowpath as owner remains NULL leading to live lock. One easy way to fix this problem is to disable preemption at all the down_read*() and up_read() code paths as implemented in this patch. Fixes: 91d2a812dfb9 ("locking/rwsem: Make handoff writer optimistically spi= n on owner") Reported-by: Mukesh Ojha Suggested-by: Peter Zijlstra Signed-off-by: Waiman Long --- kernel/locking/rwsem.c | 30 ++++++++++++++++++++++++------ 1 file changed, 24 insertions(+), 6 deletions(-) diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c index be2df9ea7c30..84d5b649b95f 100644 --- a/kernel/locking/rwsem.c +++ b/kernel/locking/rwsem.c @@ -1091,7 +1091,7 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, lo= ng count, unsigned int stat /* Ordered by sem->wait_lock against rwsem_mark_wake(). */ break; } - schedule(); + schedule_preempt_disabled(); lockevent_inc(rwsem_sleep_reader); } =20 @@ -1253,14 +1253,20 @@ static struct rw_semaphore *rwsem_downgrade_wake(st= ruct rw_semaphore *sem) */ static inline int __down_read_common(struct rw_semaphore *sem, int state) { + int ret =3D 0; long count; =20 + preempt_disable(); if (!rwsem_read_trylock(sem, &count)) { - if (IS_ERR(rwsem_down_read_slowpath(sem, count, state))) - return -EINTR; + if (IS_ERR(rwsem_down_read_slowpath(sem, count, state))) { + ret =3D -EINTR; + goto out; + } DEBUG_RWSEMS_WARN_ON(!is_rwsem_reader_owned(sem), sem); } - return 0; +out: + preempt_enable(); + return ret; } =20 static inline void __down_read(struct rw_semaphore *sem) @@ -1280,19 +1286,23 @@ static inline int __down_read_killable(struct rw_se= maphore *sem) =20 static inline int __down_read_trylock(struct rw_semaphore *sem) { + int ret =3D 0; long tmp; =20 DEBUG_RWSEMS_WARN_ON(sem->magic !=3D sem, sem); =20 + preempt_disable(); tmp =3D atomic_long_read(&sem->count); while (!(tmp & RWSEM_READ_FAILED_MASK)) { if (atomic_long_try_cmpxchg_acquire(&sem->count, &tmp, tmp + RWSEM_READER_BIAS)) { rwsem_set_reader_owned(sem); - return 1; + ret =3D 1; + break; } } - return 0; + preempt_enable(); + return ret; } =20 /* @@ -1334,6 +1344,7 @@ static inline void __up_read(struct rw_semaphore *sem) DEBUG_RWSEMS_WARN_ON(sem->magic !=3D sem, sem); DEBUG_RWSEMS_WARN_ON(!is_rwsem_reader_owned(sem), sem); =20 + preempt_disable(); rwsem_clear_reader_owned(sem); tmp =3D atomic_long_add_return_release(-RWSEM_READER_BIAS, &sem->count); DEBUG_RWSEMS_WARN_ON(tmp < 0, sem); @@ -1342,6 +1353,7 @@ static inline void __up_read(struct rw_semaphore *sem) clear_nonspinnable(sem); rwsem_wake(sem); } + preempt_enable(); } =20 /* @@ -1661,6 +1673,12 @@ void down_read_non_owner(struct rw_semaphore *sem) { might_sleep(); __down_read(sem); + /* + * The owner value for a reader-owned lock is mostly for debugging + * purpose only and is not critical to the correct functioning of + * rwsem. So it is perfectly fine to set it in a preempt-enabled + * context here. + */ __rwsem_set_reader_owned(sem, NULL); } EXPORT_SYMBOL(down_read_non_owner); --=20 2.31.1 From nobody Sun Sep 14 06:36:59 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 99393C54E94 for ; Thu, 26 Jan 2023 00:38:29 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S236393AbjAZAi2 (ORCPT ); Wed, 25 Jan 2023 19:38:28 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:47710 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S235422AbjAZAiX (ORCPT ); Wed, 25 Jan 2023 19:38:23 -0500 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 0C4184A20A for ; Wed, 25 Jan 2023 16:37:35 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1674693455; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=xoPc7h2MXQbO5QKH8EbA+GwoYeGdGi4ZmgSUMsdtLlg=; b=S/8vIwGOglPbaIcfru79YvCWqX/fjZB+2dBnNeY98PwwUoBLwaQXU1JLmmrmnpOZMV1Ipf ED2Jhpk27ICzkx9WLyCbMatBAGHoZ8VIsijeul/iuvH4UFuzTwK4MnWOUwrmTfOEoJ6WmN vV3LE8dzGrBvGZOvz/0i4u9tgmOWuRY= Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-84-E6wcanBpMi270ebRjSI_LQ-1; Wed, 25 Jan 2023 19:37:32 -0500 X-MC-Unique: E6wcanBpMi270ebRjSI_LQ-1 Received: from smtp.corp.redhat.com (int-mx09.intmail.prod.int.rdu2.redhat.com [10.11.54.9]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 91CBE85C069; Thu, 26 Jan 2023 00:37:31 +0000 (UTC) Received: from llong.com (unknown [10.22.17.109]) by smtp.corp.redhat.com (Postfix) with ESMTP id 177BB492C14; Thu, 26 Jan 2023 00:37:31 +0000 (UTC) From: Waiman Long To: Peter Zijlstra , Ingo Molnar , Will Deacon , Boqun Feng Cc: linux-kernel@vger.kernel.org, john.p.donnelly@oracle.com, Hillf Danton , Mukesh Ojha , =?UTF-8?q?Ting11=20Wang=20=E7=8E=8B=E5=A9=B7?= , Waiman Long Subject: [PATCH v7 3/4] locking/rwsem: Disable preemption at all down_write*() and up_write() code paths Date: Wed, 25 Jan 2023 19:36:27 -0500 Message-Id: <20230126003628.365092-4-longman@redhat.com> In-Reply-To: <20230126003628.365092-1-longman@redhat.com> References: <20230126003628.365092-1-longman@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Scanned-By: MIMEDefang 3.1 on 10.11.54.9 Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" The previous patch has disabled preemption at all the down_read() and up_read() code paths. For symmetry, this patch extends commit 48dfb5d2560d ("locking/rwsem: Disable preemption while trying for rwsem lock") to have preemption disabled at all the down_write() and up_write() code path including downgrade_write(). Suggested-by: Peter Zijlstra Signed-off-by: Waiman Long --- kernel/locking/rwsem.c | 38 +++++++++++++++++++------------------- 1 file changed, 19 insertions(+), 19 deletions(-) diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c index 84d5b649b95f..acb5a50309a1 100644 --- a/kernel/locking/rwsem.c +++ b/kernel/locking/rwsem.c @@ -256,16 +256,13 @@ static inline bool rwsem_read_trylock(struct rw_semap= hore *sem, long *cntp) static inline bool rwsem_write_trylock(struct rw_semaphore *sem) { long tmp =3D RWSEM_UNLOCKED_VALUE; - bool ret =3D false; =20 - preempt_disable(); if (atomic_long_try_cmpxchg_acquire(&sem->count, &tmp, RWSEM_WRITER_LOCKE= D)) { rwsem_set_owner(sem); - ret =3D true; + return true; } =20 - preempt_enable(); - return ret; + return false; } =20 /* @@ -716,7 +713,6 @@ static inline bool rwsem_can_spin_on_owner(struct rw_se= maphore *sem) return false; } =20 - preempt_disable(); /* * Disable preemption is equal to the RCU read-side crital section, * thus the task_strcut structure won't go away. @@ -728,7 +724,6 @@ static inline bool rwsem_can_spin_on_owner(struct rw_se= maphore *sem) if ((flags & RWSEM_NONSPINNABLE) || (owner && !(flags & RWSEM_READER_OWNED) && !owner_on_cpu(owner))) ret =3D false; - preempt_enable(); =20 lockevent_cond_inc(rwsem_opt_fail, !ret); return ret; @@ -828,8 +823,6 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *= sem) int loop =3D 0; u64 rspin_threshold =3D 0; =20 - preempt_disable(); - /* sem->wait_lock should not be held when doing optimistic spinning */ if (!osq_lock(&sem->osq)) goto done; @@ -937,7 +930,6 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *= sem) } osq_unlock(&sem->osq); done: - preempt_enable(); lockevent_cond_inc(rwsem_opt_fail, !taken); return taken; } @@ -1178,15 +1170,12 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem,= int state) if (waiter.handoff_set) { enum owner_state owner_state; =20 - preempt_disable(); owner_state =3D rwsem_spin_on_owner(sem); - preempt_enable(); - if (owner_state =3D=3D OWNER_NULL) goto trylock_again; } =20 - schedule(); + schedule_preempt_disabled(); lockevent_inc(rwsem_sleep_writer); set_current_state(state); trylock_again: @@ -1310,12 +1299,15 @@ static inline int __down_read_trylock(struct rw_sem= aphore *sem) */ static inline int __down_write_common(struct rw_semaphore *sem, int state) { + int ret =3D 0; + + preempt_disable(); if (unlikely(!rwsem_write_trylock(sem))) { if (IS_ERR(rwsem_down_write_slowpath(sem, state))) - return -EINTR; + ret =3D -EINTR; } - - return 0; + preempt_enable(); + return ret; } =20 static inline void __down_write(struct rw_semaphore *sem) @@ -1330,8 +1322,14 @@ static inline int __down_write_killable(struct rw_se= maphore *sem) =20 static inline int __down_write_trylock(struct rw_semaphore *sem) { + int ret; + + preempt_disable(); DEBUG_RWSEMS_WARN_ON(sem->magic !=3D sem, sem); - return rwsem_write_trylock(sem); + ret =3D rwsem_write_trylock(sem); + preempt_enable(); + + return ret; } =20 /* @@ -1374,9 +1372,9 @@ static inline void __up_write(struct rw_semaphore *se= m) preempt_disable(); rwsem_clear_owner(sem); tmp =3D atomic_long_fetch_add_release(-RWSEM_WRITER_LOCKED, &sem->count); - preempt_enable(); if (unlikely(tmp & RWSEM_FLAG_WAITERS)) rwsem_wake(sem); + preempt_enable(); } =20 /* @@ -1394,11 +1392,13 @@ static inline void __downgrade_write(struct rw_sema= phore *sem) * write side. As such, rely on RELEASE semantics. */ DEBUG_RWSEMS_WARN_ON(rwsem_owner(sem) !=3D current, sem); + preempt_disable(); tmp =3D atomic_long_fetch_add_release( -RWSEM_WRITER_LOCKED+RWSEM_READER_BIAS, &sem->count); rwsem_set_reader_owned(sem); if (tmp & RWSEM_FLAG_WAITERS) rwsem_downgrade_wake(sem); + preempt_enable(); } =20 #else /* !CONFIG_PREEMPT_RT */ --=20 2.31.1 From nobody Sun Sep 14 06:36:59 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id BA893C54E94 for ; Thu, 26 Jan 2023 00:39:08 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S235422AbjAZAjH (ORCPT ); Wed, 25 Jan 2023 19:39:07 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:48066 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229536AbjAZAjF (ORCPT ); Wed, 25 Jan 2023 19:39:05 -0500 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id B380361D43 for ; Wed, 25 Jan 2023 16:37:37 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1674693457; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=iDEg1bL2Pd2FJi7M41wYv/wjVNxd/1xDBORr+4z5xPM=; b=Vjgi66tIw+/duS/QvkJ3hSHv6G+2JGaNqmN47dLpO6lr+mDZapRNrqiCzlO73AtV6B1XsE oA6Y5dqdaFh06Q2l00073VWXhU5zXzYRuOSOnTD9IDzFMrHgg201KUKx5Ia1kG6j/SWW0Y whtxEZp9EeA8tJy/HLm/6rEeMYJKwrk= Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-653-bHAFzgNdMyCxYjvsREhpiw-1; Wed, 25 Jan 2023 19:37:32 -0500 X-MC-Unique: bHAFzgNdMyCxYjvsREhpiw-1 Received: from smtp.corp.redhat.com (int-mx09.intmail.prod.int.rdu2.redhat.com [10.11.54.9]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 16878101A521; Thu, 26 Jan 2023 00:37:32 +0000 (UTC) Received: from llong.com (unknown [10.22.17.109]) by smtp.corp.redhat.com (Postfix) with ESMTP id A3E70492C14; Thu, 26 Jan 2023 00:37:31 +0000 (UTC) From: Waiman Long To: Peter Zijlstra , Ingo Molnar , Will Deacon , Boqun Feng Cc: linux-kernel@vger.kernel.org, john.p.donnelly@oracle.com, Hillf Danton , Mukesh Ojha , =?UTF-8?q?Ting11=20Wang=20=E7=8E=8B=E5=A9=B7?= , Waiman Long Subject: [PATCH v7 4/4] locking/rwsem: Enable direct rwsem lock handoff Date: Wed, 25 Jan 2023 19:36:28 -0500 Message-Id: <20230126003628.365092-5-longman@redhat.com> In-Reply-To: <20230126003628.365092-1-longman@redhat.com> References: <20230126003628.365092-1-longman@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Scanned-By: MIMEDefang 3.1 on 10.11.54.9 Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" The lock handoff provided in rwsem isn't a true handoff like that in the mutex. Instead, it is more like a quiescent state where optimistic spinning and lock stealing are disabled to make it easier for the first waiter to acquire the lock. For mutex, lock handoff is done at unlock time as the owner value and the handoff bit is in the same lock word and can be updated atomically. That is the not case for rwsem which has a count value for locking and a different owner value for storing lock owner. In addition, the handoff processing differs depending on whether the first waiter is a writer or a reader. We can only make that waiter type determination after acquiring the wait lock. Together with the fact that the RWSEM_FLAG_HANDOFF bit is stable while holding the wait_lock, the most convenient place to do the handoff is at rwsem_wake() where wait_lock has to be acquired anyway. Since a lot can happen between unlock time and after acquiring the wait_lock in rwsem_wake(), we have to reconfirm the presence of the handoff bit and the lock is free before doing the handoff. Handing off to a reader has already been done pretty well by rwsem_mark_wake(), we don't need to do anything extra other than disabling optimistic spinning. For writer, additional code is added to pass the lock ownership to it. The waiter is removed from the wait queue and waiter->task is cleared in this case to signal that handoff has happened. This is similar to what rwsem_mark_wake() is doing to readers whether a handoff has happened or not. Running a 96-thread rwsem locking test on a 96-thread x86-64 system, the locking throughput increases slightly from 588 kops/s to 592 kops/s with this change. Signed-off-by: Waiman Long --- kernel/locking/rwsem.c | 74 +++++++++++++++++++++++++++++++++++------- 1 file changed, 63 insertions(+), 11 deletions(-) diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c index acb5a50309a1..2cf1e0bfdaa5 100644 --- a/kernel/locking/rwsem.c +++ b/kernel/locking/rwsem.c @@ -40,7 +40,7 @@ * * When the rwsem is reader-owned and a spinning writer has timed out, * the nonspinnable bit will be set to disable optimistic spinning. - + * * When a writer acquires a rwsem, it puts its task_struct pointer * into the owner field. It is cleared after an unlock. * @@ -430,6 +430,10 @@ static void rwsem_mark_wake(struct rw_semaphore *sem, * Mark writer at the front of the queue for wakeup. * Until the task is actually later awoken later by * the caller, other writers are able to steal it. + * + * *Unless* HANDOFF is set, in which case only the + * first waiter is allowed to take it. + * * Readers, on the other hand, will block as they * will notice the queued writer. */ @@ -467,7 +471,12 @@ static void rwsem_mark_wake(struct rw_semaphore *sem, adjustment -=3D RWSEM_FLAG_HANDOFF; lockevent_inc(rwsem_rlock_handoff); } + /* + * With HANDOFF set for reader, we must + * terminate all spinning. + */ waiter->handoff_set =3D true; + rwsem_set_nonspinnable(sem); } =20 atomic_long_add(-adjustment, &sem->count); @@ -609,6 +618,12 @@ static inline bool rwsem_try_write_lock(struct rw_sema= phore *sem, =20 lockdep_assert_held(&sem->wait_lock); =20 + if (!waiter->task) { + /* Write lock handed off */ + smp_acquire__after_ctrl_dep(); + return true; + } + count =3D atomic_long_read(&sem->count); do { bool has_handoff =3D !!(count & RWSEM_FLAG_HANDOFF); @@ -754,6 +769,10 @@ rwsem_spin_on_owner(struct rw_semaphore *sem) =20 owner =3D rwsem_owner_flags(sem, &flags); state =3D rwsem_owner_state(owner, flags); + + if (owner =3D=3D current) + return OWNER_NONSPINNABLE; /* Handoff granted */ + if (state !=3D OWNER_WRITER) return state; =20 @@ -844,7 +863,6 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *= sem) * Try to acquire the lock */ taken =3D rwsem_try_write_lock_unqueued(sem); - if (taken) break; =20 @@ -1168,21 +1186,23 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem,= int state) * without sleeping. */ if (waiter.handoff_set) { - enum owner_state owner_state; - - owner_state =3D rwsem_spin_on_owner(sem); - if (owner_state =3D=3D OWNER_NULL) - goto trylock_again; + rwsem_spin_on_owner(sem); + if (!READ_ONCE(waiter.task)) { + /* Write lock handed off */ + smp_acquire__after_ctrl_dep(); + set_current_state(TASK_RUNNING); + goto out; + } } =20 schedule_preempt_disabled(); lockevent_inc(rwsem_sleep_writer); set_current_state(state); -trylock_again: raw_spin_lock_irq(&sem->wait_lock); } __set_current_state(TASK_RUNNING); raw_spin_unlock_irq(&sem->wait_lock); +out: lockevent_inc(rwsem_wlock); trace_contention_end(sem, 0); return sem; @@ -1190,6 +1210,11 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, = int state) out_nolock: __set_current_state(TASK_RUNNING); raw_spin_lock_irq(&sem->wait_lock); + if (!waiter.task) { + smp_acquire__after_ctrl_dep(); + raw_spin_unlock_irq(&sem->wait_lock); + goto out; + } rwsem_del_wake_waiter(sem, &waiter, &wake_q); lockevent_inc(rwsem_wlock_fail); trace_contention_end(sem, -EINTR); @@ -1202,14 +1227,41 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem,= int state) */ static struct rw_semaphore *rwsem_wake(struct rw_semaphore *sem) { - unsigned long flags; DEFINE_WAKE_Q(wake_q); + unsigned long flags; + unsigned long count; =20 raw_spin_lock_irqsave(&sem->wait_lock, flags); =20 - if (!list_empty(&sem->wait_list)) - rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q); + if (list_empty(&sem->wait_list)) + goto unlock_out; + + /* + * If the rwsem is free and handoff flag is set with wait_lock held, + * no other CPUs can take an active lock. + */ + count =3D atomic_long_read(&sem->count); + if (!(count & RWSEM_LOCK_MASK) && (count & RWSEM_FLAG_HANDOFF)) { + /* + * Since rwsem_mark_wake() will handle the handoff to reader + * properly, we don't need to do anything extra for reader. + * Special handoff processing will only be needed for writer. + */ + struct rwsem_waiter *waiter =3D rwsem_first_waiter(sem); + long adj =3D RWSEM_WRITER_LOCKED - RWSEM_FLAG_HANDOFF; + + if (waiter->type =3D=3D RWSEM_WAITING_FOR_WRITE) { + atomic_long_set(&sem->owner, (long)waiter->task); + atomic_long_add(adj, &sem->count); + wake_q_add(&wake_q, waiter->task); + rwsem_del_waiter(sem, waiter); + waiter->task =3D NULL; /* Signal the handoff */ + goto unlock_out; + } + } + rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q); =20 +unlock_out: raw_spin_unlock_irqrestore(&sem->wait_lock, flags); wake_up_q(&wake_q); =20 --=20 2.31.1