From nobody Thu Sep 11 18:24:16 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4BB57C61DA4 for ; Thu, 16 Feb 2023 21:10:39 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230162AbjBPVKi (ORCPT ); Thu, 16 Feb 2023 16:10:38 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43076 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229884AbjBPVKg (ORCPT ); Thu, 16 Feb 2023 16:10:36 -0500 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 86ED22BF17 for ; Thu, 16 Feb 2023 13:09:48 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1676581787; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=RxvKZYMoS1TjdcpNr1g4CRLxpkYLjSGfyjmowEABKnw=; b=gX+DmOUf5pc+Vg6uZVM3Hat8NqCe53AHSLcDEyEFfYOxpb3iRXe2DitY57dkeUfb4sPEVg zqRIJ1LkB4/VhtxZGXt0tJvvX6yG+Kqf5qDItREia545l7cU5cRpUkxfHtgvP5BeUMiKWu DwupVg88roztYjjAox2htLsON5u0Jg4= Received: from mimecast-mx02.redhat.com (mx3-rdu2.redhat.com [66.187.233.73]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-171-m3LrvgPbPoOgdwQm6md9zw-1; Thu, 16 Feb 2023 16:09:46 -0500 X-MC-Unique: m3LrvgPbPoOgdwQm6md9zw-1 Received: from smtp.corp.redhat.com (int-mx05.intmail.prod.int.rdu2.redhat.com [10.11.54.5]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id C51DD38149AF; Thu, 16 Feb 2023 21:09:45 +0000 (UTC) Received: from llong.com (unknown [10.22.33.69]) by smtp.corp.redhat.com (Postfix) with ESMTP id 774F451FF; Thu, 16 Feb 2023 21:09:45 +0000 (UTC) From: Waiman Long To: Peter Zijlstra , Ingo Molnar , Will Deacon , Boqun Feng Cc: linux-kernel@vger.kernel.org, Hillf Danton , Waiman Long Subject: [PATCH v2 1/3] locking/rwsem: Minor code refactoring in rwsem_mark_wake() Date: Thu, 16 Feb 2023 16:09:31 -0500 Message-Id: <20230216210933.1169097-2-longman@redhat.com> In-Reply-To: <20230216210933.1169097-1-longman@redhat.com> References: <20230216210933.1169097-1-longman@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Scanned-By: MIMEDefang 3.1 on 10.11.54.5 Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" Rename "oldcount" to "count" as it is not always old count value. Also make some minor code refactoring to reduce indentation. There is no functional change. Signed-off-by: Waiman Long --- kernel/locking/rwsem.c | 44 +++++++++++++++++++++--------------------- 1 file changed, 22 insertions(+), 22 deletions(-) diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c index acb5a50309a1..e589f69793df 100644 --- a/kernel/locking/rwsem.c +++ b/kernel/locking/rwsem.c @@ -40,7 +40,7 @@ * * When the rwsem is reader-owned and a spinning writer has timed out, * the nonspinnable bit will be set to disable optimistic spinning. - + * * When a writer acquires a rwsem, it puts its task_struct pointer * into the owner field. It is cleared after an unlock. * @@ -413,7 +413,7 @@ static void rwsem_mark_wake(struct rw_semaphore *sem, struct wake_q_head *wake_q) { struct rwsem_waiter *waiter, *tmp; - long oldcount, woken =3D 0, adjustment =3D 0; + long count, woken =3D 0, adjustment =3D 0; struct list_head wlist; =20 lockdep_assert_held(&sem->wait_lock); @@ -424,22 +424,23 @@ static void rwsem_mark_wake(struct rw_semaphore *sem, */ waiter =3D rwsem_first_waiter(sem); =20 - if (waiter->type =3D=3D RWSEM_WAITING_FOR_WRITE) { - if (wake_type =3D=3D RWSEM_WAKE_ANY) { - /* - * Mark writer at the front of the queue for wakeup. - * Until the task is actually later awoken later by - * the caller, other writers are able to steal it. - * Readers, on the other hand, will block as they - * will notice the queued writer. - */ - wake_q_add(wake_q, waiter->task); - lockevent_inc(rwsem_wake_writer); - } + if (waiter->type !=3D RWSEM_WAITING_FOR_WRITE) + goto wake_readers; =20 - return; + if (wake_type =3D=3D RWSEM_WAKE_ANY) { + /* + * Mark writer at the front of the queue for wakeup. + * Until the task is actually later awoken later by + * the caller, other writers are able to steal it. + * Readers, on the other hand, will block as they + * will notice the queued writer. + */ + wake_q_add(wake_q, waiter->task); + lockevent_inc(rwsem_wake_writer); } + return; =20 +wake_readers: /* * No reader wakeup if there are too many of them already. */ @@ -455,15 +456,15 @@ static void rwsem_mark_wake(struct rw_semaphore *sem, struct task_struct *owner; =20 adjustment =3D RWSEM_READER_BIAS; - oldcount =3D atomic_long_fetch_add(adjustment, &sem->count); - if (unlikely(oldcount & RWSEM_WRITER_MASK)) { + count =3D atomic_long_fetch_add(adjustment, &sem->count); + if (unlikely(count & RWSEM_WRITER_MASK)) { /* * When we've been waiting "too" long (for writers * to give up the lock), request a HANDOFF to * force the issue. */ if (time_after(jiffies, waiter->timeout)) { - if (!(oldcount & RWSEM_FLAG_HANDOFF)) { + if (!(count & RWSEM_FLAG_HANDOFF)) { adjustment -=3D RWSEM_FLAG_HANDOFF; lockevent_inc(rwsem_rlock_handoff); } @@ -524,21 +525,21 @@ static void rwsem_mark_wake(struct rw_semaphore *sem, adjustment =3D woken * RWSEM_READER_BIAS - adjustment; lockevent_cond_inc(rwsem_wake_reader, woken); =20 - oldcount =3D atomic_long_read(&sem->count); + count =3D atomic_long_read(&sem->count); if (list_empty(&sem->wait_list)) { /* * Combined with list_move_tail() above, this implies * rwsem_del_waiter(). */ adjustment -=3D RWSEM_FLAG_WAITERS; - if (oldcount & RWSEM_FLAG_HANDOFF) + if (count & RWSEM_FLAG_HANDOFF) adjustment -=3D RWSEM_FLAG_HANDOFF; } else if (woken) { /* * When we've woken a reader, we no longer need to force * writers to give up the lock and we can clear HANDOFF. */ - if (oldcount & RWSEM_FLAG_HANDOFF) + if (count & RWSEM_FLAG_HANDOFF) adjustment -=3D RWSEM_FLAG_HANDOFF; } =20 @@ -844,7 +845,6 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *= sem) * Try to acquire the lock */ taken =3D rwsem_try_write_lock_unqueued(sem); - if (taken) break; =20 --=20 2.31.1 From nobody Thu Sep 11 18:24:16 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id E3015C636CC for ; Thu, 16 Feb 2023 21:10:44 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230182AbjBPVKn (ORCPT ); Thu, 16 Feb 2023 16:10:43 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43094 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230153AbjBPVKh (ORCPT ); Thu, 16 Feb 2023 16:10:37 -0500 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C400E2BF28 for ; Thu, 16 Feb 2023 13:09:50 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1676581790; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=zEzH5Zj7yh9QZtg0PJSYZ6nzi6eSZlPlfr6VnAN6HTg=; b=VQ9TdMQKLKQzoF0LmKYRS/dfqvw0YfNLre0+r+8kd8kOkhcidBsHGoeFTjZg+aktYlX5uq W1YCgXs0YJo+d4s3hETbNtUK2T/zosYR+nwss37m4aewirU/AlHTqpMUYU/pf2DVlb6tm2 NHUvzt2giK/ejbh8c+aQqJ+r74OyiVI= Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-346-i5-PaZ1rMJiPEq6thjgeKA-1; Thu, 16 Feb 2023 16:09:46 -0500 X-MC-Unique: i5-PaZ1rMJiPEq6thjgeKA-1 Received: from smtp.corp.redhat.com (int-mx05.intmail.prod.int.rdu2.redhat.com [10.11.54.5]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 2788A101A521; Thu, 16 Feb 2023 21:09:46 +0000 (UTC) Received: from llong.com (unknown [10.22.33.69]) by smtp.corp.redhat.com (Postfix) with ESMTP id D1F2C51FF; Thu, 16 Feb 2023 21:09:45 +0000 (UTC) From: Waiman Long To: Peter Zijlstra , Ingo Molnar , Will Deacon , Boqun Feng Cc: linux-kernel@vger.kernel.org, Hillf Danton , Waiman Long Subject: [PATCH v2 2/3] locking/rwsem: Enable early rwsem writer lock handoff Date: Thu, 16 Feb 2023 16:09:32 -0500 Message-Id: <20230216210933.1169097-3-longman@redhat.com> In-Reply-To: <20230216210933.1169097-1-longman@redhat.com> References: <20230216210933.1169097-1-longman@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Scanned-By: MIMEDefang 3.1 on 10.11.54.5 Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" The lock handoff provided in rwsem isn't a true handoff like that in the mutex. Instead, it is more like a quiescent state where optimistic spinning and lock stealing are disabled to make it easier for the first waiter to acquire the lock. For readers, setting the HANDOFF bit will disable writers from stealing the lock. The actual handoff is done at rwsem_wake() time after taking the wait_lock. There isn't much we need to improve here other than setting the RWSEM_NONSPINNABLE bit in owner. For writers, setting the HANDOFF bit does not guarantee that it can acquire the rwsem successfully in a subsequent rwsem_try_write_lock() after setting the bit there. A reader can come in and add a RWSEM_READER_BIAS temporarily which can spoil the takeover of the rwsem in rwsem_try_write_lock() leading to additional delay. For mutex, lock handoff is done at unlock time as the owner value and the handoff bit is in the same lock word and can be updated atomically. That is the not case for rwsem which has a count value for locking and a different owner value for storing lock owner. In addition, the handoff processing differs depending on whether the first waiter is a writer or a reader. We can only make that waiter type determination after acquiring the wait lock. Together with the fact that the RWSEM_FLAG_HANDOFF bit is stable while holding the wait_lock, the most convenient place to do the early handoff is at rwsem_mark_wake() where wait_lock has to be acquired anyway. There isn't much additional cost in doing this check and early handoff in rwsem_mark_wake() while increasing the chance that a lock handoff will be successful when the HANDOFF setting writer wakes up without even the need to take the wait_lock at all. Note that if early handoff fails to happen in rwsem_mark_wake(), a late handoff can still happen when the awoken writer calls rwsem_try_write_lock(). Kernel test robot noticed a 19.3% improvement of will-it-scale.per_thread_ops in an earlier version of this commit [1]. [1] https://lore.kernel.org/lkml/202302122155.87699b56-oliver.sang@intel.co= m/ Signed-off-by: Waiman Long --- kernel/locking/lock_events_list.h | 1 + kernel/locking/rwsem.c | 71 +++++++++++++++++++++++++++---- 2 files changed, 64 insertions(+), 8 deletions(-) diff --git a/kernel/locking/lock_events_list.h b/kernel/locking/lock_events= _list.h index 97fb6f3f840a..fd80f5828f24 100644 --- a/kernel/locking/lock_events_list.h +++ b/kernel/locking/lock_events_list.h @@ -67,3 +67,4 @@ LOCK_EVENT(rwsem_rlock_handoff) /* # of read lock handoff= s */ LOCK_EVENT(rwsem_wlock) /* # of write locks acquired */ LOCK_EVENT(rwsem_wlock_fail) /* # of failed write lock acquisitions */ LOCK_EVENT(rwsem_wlock_handoff) /* # of write lock handoffs */ +LOCK_EVENT(rwsem_wlock_ehandoff) /* # of write lock early handoffs */ diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c index e589f69793df..fc3961ceabe8 100644 --- a/kernel/locking/rwsem.c +++ b/kernel/locking/rwsem.c @@ -412,8 +412,9 @@ static void rwsem_mark_wake(struct rw_semaphore *sem, enum rwsem_wake_type wake_type, struct wake_q_head *wake_q) { + long count =3D atomic_long_read(&sem->count); struct rwsem_waiter *waiter, *tmp; - long count, woken =3D 0, adjustment =3D 0; + long woken =3D 0, adjustment =3D 0; struct list_head wlist; =20 lockdep_assert_held(&sem->wait_lock); @@ -432,19 +433,39 @@ static void rwsem_mark_wake(struct rw_semaphore *sem, * Mark writer at the front of the queue for wakeup. * Until the task is actually later awoken later by * the caller, other writers are able to steal it. + * + * *Unless* HANDOFF is set, in which case only the + * first waiter is allowed to take it. + * * Readers, on the other hand, will block as they * will notice the queued writer. */ wake_q_add(wake_q, waiter->task); lockevent_inc(rwsem_wake_writer); + + if ((count & RWSEM_LOCK_MASK) || !(count & RWSEM_FLAG_HANDOFF)) + return; + + /* + * If the rwsem is free and handoff flag is set with wait_lock + * held, no other CPUs can take an active lock. We can do an + * early handoff. + */ + adjustment =3D RWSEM_WRITER_LOCKED - RWSEM_FLAG_HANDOFF; + atomic_long_set(&sem->owner, (long)waiter->task); + waiter->task =3D NULL; + atomic_long_add(adjustment, &sem->count); + rwsem_del_waiter(sem, waiter); + lockevent_inc(rwsem_wlock_ehandoff); } return; =20 wake_readers: /* - * No reader wakeup if there are too many of them already. + * No reader wakeup if there are too many of them already or + * something wrong happens. */ - if (unlikely(atomic_long_read(&sem->count) < 0)) + if (unlikely(count < 0)) return; =20 /* @@ -468,7 +489,12 @@ static void rwsem_mark_wake(struct rw_semaphore *sem, adjustment -=3D RWSEM_FLAG_HANDOFF; lockevent_inc(rwsem_rlock_handoff); } + /* + * With HANDOFF set for reader, we must + * terminate all spinning. + */ waiter->handoff_set =3D true; + rwsem_set_nonspinnable(sem); } =20 atomic_long_add(-adjustment, &sem->count); @@ -610,6 +636,12 @@ static inline bool rwsem_try_write_lock(struct rw_sema= phore *sem, =20 lockdep_assert_held(&sem->wait_lock); =20 + if (!waiter->task) { + /* Write lock handed off */ + smp_acquire__after_ctrl_dep(); + return true; + } + count =3D atomic_long_read(&sem->count); do { bool has_handoff =3D !!(count & RWSEM_FLAG_HANDOFF); @@ -755,6 +787,10 @@ rwsem_spin_on_owner(struct rw_semaphore *sem) =20 owner =3D rwsem_owner_flags(sem, &flags); state =3D rwsem_owner_state(owner, flags); + + if (owner =3D=3D current) + return OWNER_NONSPINNABLE; /* Handoff granted */ + if (state !=3D OWNER_WRITER) return state; =20 @@ -1164,32 +1200,51 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem,= int state) * the lock, attempt to spin on owner to accelerate lock * transfer. If the previous owner is a on-cpu writer and it * has just released the lock, OWNER_NULL will be returned. - * In this case, we attempt to acquire the lock again - * without sleeping. + * In this case, the waker may be in the process of early + * lock handoff. Use the wait_lock to synchronize with that + * before checking for handoff. */ if (waiter.handoff_set) { enum owner_state owner_state; =20 owner_state =3D rwsem_spin_on_owner(sem); - if (owner_state =3D=3D OWNER_NULL) - goto trylock_again; + if ((owner_state =3D=3D OWNER_NULL) && + READ_ONCE(waiter.task)) { + raw_spin_lock_irq(&sem->wait_lock); + raw_spin_unlock_irq(&sem->wait_lock); + } + if (!READ_ONCE(waiter.task)) + goto handed_off; } =20 schedule_preempt_disabled(); lockevent_inc(rwsem_sleep_writer); + if (!READ_ONCE(waiter.task)) + goto handed_off; + set_current_state(state); -trylock_again: raw_spin_lock_irq(&sem->wait_lock); } __set_current_state(TASK_RUNNING); raw_spin_unlock_irq(&sem->wait_lock); +out: lockevent_inc(rwsem_wlock); trace_contention_end(sem, 0); return sem; =20 +handed_off: + /* Write lock handed off */ + set_current_state(TASK_RUNNING); /* smp_mb() */ + goto out; + out_nolock: __set_current_state(TASK_RUNNING); raw_spin_lock_irq(&sem->wait_lock); + if (!waiter.task) { + smp_acquire__after_ctrl_dep(); + raw_spin_unlock_irq(&sem->wait_lock); + goto out; + } rwsem_del_wake_waiter(sem, &waiter, &wake_q); lockevent_inc(rwsem_wlock_fail); trace_contention_end(sem, -EINTR); --=20 2.31.1 From nobody Thu Sep 11 18:24:16 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id DD4C4C61DA4 for ; Thu, 16 Feb 2023 21:10:49 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230206AbjBPVKs (ORCPT ); Thu, 16 Feb 2023 16:10:48 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43096 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230172AbjBPVKl (ORCPT ); Thu, 16 Feb 2023 16:10:41 -0500 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 46F933B230 for ; Thu, 16 Feb 2023 13:09:51 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1676581790; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=MbSttENprtZeMfW1hKajzYumOMh0pA4tKJsFRVCESlg=; b=BFobe5b73eKXHaP9dQH9C0+uzykT+UWLCh5LjXm08QwPMVteXkJuvEjiEe/yl1Z/H5L/LB V4zdfskaS7PiIiIuGaYWsiNVZSsk4nzX0D35HyFBNlERoM6o/6x31f67GHHw8B1tyaTUAB fid/XAxpeu9k2D9wz0c2FJvyeRaXKSI= Received: from mimecast-mx02.redhat.com (mx3-rdu2.redhat.com [66.187.233.73]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-42-Yo0BlUDiN3eM0vGM9v6BZQ-1; Thu, 16 Feb 2023 16:09:47 -0500 X-MC-Unique: Yo0BlUDiN3eM0vGM9v6BZQ-1 Received: from smtp.corp.redhat.com (int-mx05.intmail.prod.int.rdu2.redhat.com [10.11.54.5]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 7FF1C38149B5; Thu, 16 Feb 2023 21:09:46 +0000 (UTC) Received: from llong.com (unknown [10.22.33.69]) by smtp.corp.redhat.com (Postfix) with ESMTP id 3536151FF; Thu, 16 Feb 2023 21:09:46 +0000 (UTC) From: Waiman Long To: Peter Zijlstra , Ingo Molnar , Will Deacon , Boqun Feng Cc: linux-kernel@vger.kernel.org, Hillf Danton , Waiman Long Subject: [PATCH v2 3/3] locking/rwsem: Wake up all readers for wait queue waker Date: Thu, 16 Feb 2023 16:09:33 -0500 Message-Id: <20230216210933.1169097-4-longman@redhat.com> In-Reply-To: <20230216210933.1169097-1-longman@redhat.com> References: <20230216210933.1169097-1-longman@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Scanned-By: MIMEDefang 3.1 on 10.11.54.5 Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" As noted in commit 54c1ee4d614d ("locking/rwsem: Conditionally wake waiters in reader/writer slowpaths"), it was possible for a rwsem to get into a state where a reader-owned rwsem could have many readers waiting in the wait queue but no writer. Recently, it was found that one way to cause this condition is to have a highly contended rwsem with many readers, like a mmap_sem. There can be hundreds of readers waiting in the wait queue of a writer-owned mmap_sem. The rwsem_wake() call by the up_write() call of the rwsem owning writer can hit the 256 reader wakeup limit and leave the rests of the readers remaining in the wait queue. The reason for the limit is to avoid excessive delay in doing other useful work. With commit 54c1ee4d614d ("locking/rwsem: Conditionally wake waiters in reader/writer slowpaths"), a new incoming reader should wake up another batch of up to 256 readers. However, these incoming readers or writers will have to wait in the wait queue and there is nothing else they can do until it is their turn to be waken up. This patch renames rwsem_mark_wake() to __rwsem_mark_wake() and adds an additional in_waitq argument to indicate that the waker is in the wait queue and can ignore the limit. A rwsem_mark_wake() helper is added that keeps the original semantics. Signed-off-by: Waiman Long --- kernel/locking/rwsem.c | 20 ++++++++++++++------ 1 file changed, 14 insertions(+), 6 deletions(-) diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c index fc3961ceabe8..35b4adf8ea55 100644 --- a/kernel/locking/rwsem.c +++ b/kernel/locking/rwsem.c @@ -408,9 +408,9 @@ rwsem_del_waiter(struct rw_semaphore *sem, struct rwsem= _waiter *waiter) * * Implies rwsem_del_waiter() for all woken readers. */ -static void rwsem_mark_wake(struct rw_semaphore *sem, - enum rwsem_wake_type wake_type, - struct wake_q_head *wake_q) +static void __rwsem_mark_wake(struct rw_semaphore *sem, + enum rwsem_wake_type wake_type, + struct wake_q_head *wake_q, bool in_waitq) { long count =3D atomic_long_read(&sem->count); struct rwsem_waiter *waiter, *tmp; @@ -542,9 +542,10 @@ static void rwsem_mark_wake(struct rw_semaphore *sem, list_move_tail(&waiter->list, &wlist); =20 /* - * Limit # of readers that can be woken up per wakeup call. + * Limit # of readers that can be woken up per wakeup call + * unless the waker is waiting in the wait queue. */ - if (unlikely(woken >=3D MAX_READERS_WAKEUP)) + if (unlikely(!in_waitq && (woken >=3D MAX_READERS_WAKEUP))) break; } =20 @@ -594,6 +595,13 @@ static void rwsem_mark_wake(struct rw_semaphore *sem, } } =20 +static inline void rwsem_mark_wake(struct rw_semaphore *sem, + enum rwsem_wake_type wake_type, + struct wake_q_head *wake_q) +{ + __rwsem_mark_wake(sem, wake_type, wake_q, false); +} + /* * Remove a waiter and try to wake up other waiters in the wait queue * This function is called from the out_nolock path of both the reader and @@ -1022,7 +1030,7 @@ static inline void rwsem_cond_wake_waiter(struct rw_s= emaphore *sem, long count, wake_type =3D RWSEM_WAKE_ANY; clear_nonspinnable(sem); } - rwsem_mark_wake(sem, wake_type, wake_q); + __rwsem_mark_wake(sem, wake_type, wake_q, true); } =20 /* --=20 2.31.1