From nobody Sun May 24 21:38:03 2026 Received: from out30-119.freemail.mail.aliyun.com (out30-119.freemail.mail.aliyun.com [115.124.30.119]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 31675383320 for ; Thu, 21 May 2026 10:00:36 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=115.124.30.119 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779357641; cv=none; b=k899DtI5PaWbtXQcc/85xLpxFAczaaeqK5eOxwUA6uy9PFYbMvrolVyOiBwZ6jKfo2hbWcNOXsfAckl6HroLEZNtITBoFWJNMqoi4R/57VRqES2fHWDOeiMH91hdKBqctTT/GvfT9+T7MnZ2YMrf5nE4fiI1URZ2jQnKhIAUnOI= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779357641; c=relaxed/simple; bh=MeeDQPI8I6T5QpnMkmDlPG0mNVpGNEPdKuzohXCXpHI=; h=From:To:Cc:Subject:Date:Message-ID:MIME-Version; b=WZ4GB3Y9b+oAtn1UXFVmcX53qslTi7hWK4GDwXFXtLCXtj29m8Wnqobmrl1+StThCd5ifXU5ZUUTxX4CQN8g4338WBGjZ5PGC/4NSzuxAateKHXpGSuYnMgiwQs2AROdjU2H0w2OYUaZY+WN5Ky8sGj4jb46tz75uGmsyJU8798= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.alibaba.com; spf=pass smtp.mailfrom=linux.alibaba.com; dkim=pass (1024-bit key) header.d=linux.alibaba.com header.i=@linux.alibaba.com header.b=s8YZI1SB; arc=none smtp.client-ip=115.124.30.119 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.alibaba.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.alibaba.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.alibaba.com header.i=@linux.alibaba.com header.b="s8YZI1SB" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.alibaba.com; s=default; t=1779357627; h=From:To:Subject:Date:Message-ID:MIME-Version; bh=Dairx/I0khKrPY1lxa5trbIzwJm5DrjdCJgxljQR3YE=; b=s8YZI1SBTPwrgyzwOXDGocDsSKb8rjDca1QnyxOE0il5Hmluxy2aBK2huNNszOyqlKAlbZR2KBvMLyq4ckgTIkR+G6wvbX5ukIJVKZGQBccw1Pqhh/gtgkcQJgt4W8JKXiqjUYZkM4xCC7HDQ1aHMdNY5Tg0NO8fvWkXrXsBmlo= X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R191e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=maildocker-contentspam033037033178;MF=peng_wang@linux.alibaba.com;NM=1;PH=DS;RN=8;SR=0;TI=SMTPD_---0X3LRBCv_1779357615; Received: from localhost(mailfrom:peng_wang@linux.alibaba.com fp:SMTPD_---0X3LRBCv_1779357615 cluster:ay36) by smtp.aliyun-inc.com; Thu, 21 May 2026 18:00:26 +0800 From: Peng Wang To: peterz@infradead.org, mingo@redhat.com, will@kernel.org, boqun@kernel.org, longman@redhat.com, dbueso@suse.de Cc: linux-kernel@vger.kernel.org, Peng Wang Subject: [PATCH] locking/rwsem: Remove reader optimistic lock stealing Date: Thu, 21 May 2026 17:59:26 +0800 Message-ID: <20260521095926.29363-1-peng_wang@linux.alibaba.com> X-Mailer: git-send-email 2.47.0 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Reader optimistic lock stealing, introduced by commit 1a728dff855a ("locking/rwsem: Enable reader optimistic lock stealing") and made more aggressive by commit 617f3ef95177 ("locking/rwsem: Remove reader optimistic spinning"), allows a reader entering the slowpath to bypass the wait queue and acquire the lock directly when WRITER_LOCKED and HANDOFF bits are not set. This causes severe writer starvation in workloads where readers hold the lock for extended periods, such as Direct I/O operations which hold inode->i_rwsem for the entire duration of iomap_dio_rw(). A common example is log-structured storage where one thread appends via DIO writes while another thread tails the log via DIO reads -- a pattern seen in database redo-log replay and shared-storage replication. The problem occurs because: 1. A reader entering the slowpath (due to RWSEM_FLAG_WAITERS being set) can still steal the lock as the steal condition only checks WRITER_LOCKED and HANDOFF, not WAITERS. This is inconsistent with the fast path which already blocks readers when WAITERS is set (via RWSEM_READ_FAILED_MASK). 2. In the window between the last reader releasing the lock and the waiting writer being scheduled (~10-100us), a new reader can steal the lock in ~50-100ns. This race is structurally inevitable due to the 1000x speed difference between atomic operations vs context switching. 3. Each stolen read lock is held for the full DIO duration (potentially milliseconds), and the pattern repeats until the 4ms HANDOFF timeout. This effectively taxes every write operation with a ~4ms penalty. Performance impact measured with a DIO mixed read/write workload (1 writer + 1 reader, O_DIRECT, ext4): NVMe SSD: Before After Write-only baseline: 397 MB/s 397 MB/s (no change) Mixed write throughput: 11 MB/s 350 MB/s (+31x) Mixed write latency: 880 us 23 us (-38x) Mixed read throughput: 95 MB/s 95 MB/s (no change) Fixes: 617f3ef95177 ("locking/rwsem: Remove reader optimistic spinning") Signed-off-by: Peng Wang --- kernel/locking/lock_events_list.h | 1 - kernel/locking/rwsem.c | 33 --------------------------------- 2 files changed, 34 deletions(-) diff --git a/kernel/locking/lock_events_list.h b/kernel/locking/lock_events= _list.h index 97fb6f3f840a..35b45576bee4 100644 --- a/kernel/locking/lock_events_list.h +++ b/kernel/locking/lock_events_list.h @@ -65,7 +65,6 @@ LOCK_EVENT(rwsem_opt_lock) /* # of opt-acquired write loc= ks */ LOCK_EVENT(rwsem_opt_fail) /* # of failed optspins */ LOCK_EVENT(rwsem_opt_nospin) /* # of disabled optspins */ LOCK_EVENT(rwsem_rlock) /* # of read locks acquired */ -LOCK_EVENT(rwsem_rlock_steal) /* # of read locks by lock stealing */ LOCK_EVENT(rwsem_rlock_fast) /* # of fast read locks acquired */ LOCK_EVENT(rwsem_rlock_fail) /* # of failed read lock acquisitions */ LOCK_EVENT(rwsem_rlock_handoff) /* # of read lock handoffs */ diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c index bda5577339c0..40b141c5765f 100644 --- a/kernel/locking/rwsem.c +++ b/kernel/locking/rwsem.c @@ -1017,42 +1017,9 @@ static struct rw_semaphore __sched * rwsem_down_read_slowpath(struct rw_semaphore *sem, long count, unsigned in= t state) { long adjustment =3D -RWSEM_READER_BIAS; - long rcnt =3D (count >> RWSEM_READER_SHIFT); struct rwsem_waiter waiter, *first; DEFINE_WAKE_Q(wake_q); - /* - * To prevent a constant stream of readers from starving a sleeping - * writer, don't attempt optimistic lock stealing if the lock is - * very likely owned by readers. - */ - if ((atomic_long_read(&sem->owner) & RWSEM_READER_OWNED) && - (rcnt > 1) && !(count & RWSEM_WRITER_LOCKED)) - goto queue; - - /* - * Reader optimistic lock stealing. - */ - if (!(count & (RWSEM_WRITER_LOCKED | RWSEM_FLAG_HANDOFF))) { - rwsem_set_reader_owned(sem); - lockevent_inc(rwsem_rlock_steal); - - /* - * Wake up other readers in the wait queue if it is - * the first reader. - */ - if ((rcnt =3D=3D 1) && (count & RWSEM_FLAG_WAITERS)) { - raw_spin_lock_irq(&sem->wait_lock); - if (sem->first_waiter) - rwsem_mark_wake(sem, RWSEM_WAKE_READ_OWNED, - &wake_q); - raw_spin_unlock_irq(&sem->wait_lock); - wake_up_q(&wake_q); - } - return sem; - } - -queue: waiter.task =3D current; waiter.type =3D RWSEM_WAITING_FOR_READ; waiter.timeout =3D jiffies + RWSEM_WAIT_TIMEOUT; -- 2.39.3