From nobody Tue Nov 26 19:41:08 2024 Received: from mail-pf1-f176.google.com (mail-pf1-f176.google.com [209.85.210.176]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 8663021E3BF for ; Wed, 16 Oct 2024 04:36:21 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.210.176 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1729053383; cv=none; b=hoIQNHCAnJAQ1+JVM0TFhlVU8iZTYh9nHa6CKfN1zkWrpRDKkcGr9e6vnwS3T/Ju29Tf5Y0XpxwL4Sbg9cuwGBq2zM/Fyec0k5lYvnWo+BeBQmNi9fbWpo5l9q5gkxuLTUJ53nEMLWctetxVTnx7qQIf3G3zst9Udu7H+MHpdTE= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1729053383; c=relaxed/simple; bh=5Co02av+8KsGWIZGjAv5K40HJLnWMr5BF9l7TTxcJgE=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=W2v06P3lKtV5/4Rm3XvqJjvPb+XXZOpXNC2EYdo93hOOM0TjxRDb7wmz1CiP2FWtDgxANFH7TIEmmGm2tmHt2wCvUjdTOG7qRmwDS1WLwxHH/AqEUa8O4Mif3yddY0MoLiWSqZ8guEJxJiRv6QEDTan8qdMD91fiKsjSS1D0jVk= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com; spf=pass smtp.mailfrom=bytedance.com; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b=KVUusk/M; arc=none smtp.client-ip=209.85.210.176 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=bytedance.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b="KVUusk/M" Received: by mail-pf1-f176.google.com with SMTP id d2e1a72fcca58-71e4e481692so3602766b3a.1 for ; Tue, 15 Oct 2024 21:36:21 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1729053381; x=1729658181; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=/nF3Zt0gxviNAj6W7zuplHnOejMHjOAOwzub01HWozw=; b=KVUusk/MUY5/TCqagfkHgOu0GZKbK6+s1esS82wM86VmA3FLxVvC+rZyKNSgFb/qQU ftFrXnSM9+F5qK3wI8/kDw7GqnFyqUmn/HZPdJaMnL3uLeKDoxzsyCQj0dQ1EWi1tfYG +BPA1hIG4hhQkjN5EZ1yWDC20I/0Om2MzX3Z4eZ1bPn/jalyNqmKCjKyFh8aklBJL/La 6DUPH5v/7Qn/YITe+UEsEtlIucKa/wEzAtsuo3BMRiAW9auzvMrWMA3uw/sCn4doFkH9 p9dlWJWlenRUOM8+F4tK1ddXDmZFX9ah24T+e0o02wGQ/3XjtsJbrYBE98Brrx8WnW2x gYwg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1729053381; x=1729658181; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=/nF3Zt0gxviNAj6W7zuplHnOejMHjOAOwzub01HWozw=; b=MBuqIR7J3TJbg1EYiQNudh0ifM4oTJvrHLMhSq1jD2x8f7KPuK+xRDeIwyJHIVVX5k 99D0ZSYuy7oamAiQxI6U6MG1iKVRwBwDNyGkHYrLer+fjRshtnDluThnwlr8aIelN2rr MZEtgWK0t87uMXb9+OTxNXY6SOX1XT4Bmr57+gsoIG1rckiHXNppb9TRdC6B/QusDkmI T3DrIJheFXhzRMkBOTaea27hKzeHO5QXZ6asEzkXkcNcs2qmXncHgo7/ibIEnA0Prdsb jsiFdtBm1UVpPPA0U+RvcoJAZjPHWZNHBa7/G6Q6/ZtkbOQY6uXp+sFUD9joKHaaIVPT lcqg== X-Gm-Message-State: AOJu0Yyacqk3XolmPTRfhUig3s+ziQ1KdbUsZcF6naJ+cg40Ty/yAkwP BDGchifPuKNE+0d1NgypZ3zg0yiNQYU//JsEFoNVMOx+jtyBvmBD8T1/XxKKhJs= X-Google-Smtp-Source: AGHT+IFczFz3ErUrY+fjvYOL+NXV28+0Iu8qbIMMGYJw/5xJ5jVf0FYBW/ESLdbc+kY+vlNNDpmfZQ== X-Received: by 2002:a05:6a00:10c3:b0:71d:f012:6de7 with SMTP id d2e1a72fcca58-71e4c1cfd6fmr22325426b3a.27.1729053380740; Tue, 15 Oct 2024 21:36:20 -0700 (PDT) Received: from GQ6QX3JCW2.bytedance.net ([203.208.189.8]) by smtp.gmail.com with ESMTPSA id d2e1a72fcca58-71e77518a76sm2189192b3a.220.2024.10.15.21.36.17 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Tue, 15 Oct 2024 21:36:20 -0700 (PDT) From: lizhe.67@bytedance.com To: peterz@infradead.org, mingo@redhat.com, will@kernel.org, longman@redhat.com, boqun.feng@gmail.com, akpm@linux-foundation.org Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, lizhe.67@bytedance.com Subject: [RFC 1/2] rwsem: introduce upgrade_read interface Date: Wed, 16 Oct 2024 12:35:59 +0800 Message-ID: <20241016043600.35139-2-lizhe.67@bytedance.com> X-Mailer: git-send-email 2.45.2 In-Reply-To: <20241016043600.35139-1-lizhe.67@bytedance.com> References: <20241016043600.35139-1-lizhe.67@bytedance.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Li Zhe Introduce a new rwsem interface upgrade_read(). We can call it to upgrade the lock into write rwsem lock after we get read lock. This interface will wait for all readers to exit before obtaining the write lock. In addition, this interface has a higher priority than any process waiting for the write lock and subsequent threads that want to obtain the read lock. Signed-off-by: Li Zhe --- include/linux/rwsem.h | 1 + kernel/locking/rwsem.c | 87 ++++++++++++++++++++++++++++++++++++++++-- 2 files changed, 85 insertions(+), 3 deletions(-) diff --git a/include/linux/rwsem.h b/include/linux/rwsem.h index c8b543d428b0..90183ab5ea79 100644 --- a/include/linux/rwsem.h +++ b/include/linux/rwsem.h @@ -249,6 +249,7 @@ DEFINE_GUARD_COND(rwsem_write, _try, down_write_trylock= (_T)) * downgrade write lock to read lock */ extern void downgrade_write(struct rw_semaphore *sem); +extern int upgrade_read(struct rw_semaphore *sem); =20 #ifdef CONFIG_DEBUG_LOCK_ALLOC /* diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c index 2bbb6eca5144..0583e1be3dbf 100644 --- a/kernel/locking/rwsem.c +++ b/kernel/locking/rwsem.c @@ -37,6 +37,7 @@ * meanings when set. * - Bit 0: RWSEM_READER_OWNED - rwsem may be owned by readers (just a hi= nt) * - Bit 1: RWSEM_NONSPINNABLE - Cannot spin on a reader-owned lock + * - Bit 2: RWSEM_UPGRADING - doing upgrade read process * * When the rwsem is reader-owned and a spinning writer has timed out, * the nonspinnable bit will be set to disable optimistic spinning. @@ -62,7 +63,8 @@ */ #define RWSEM_READER_OWNED (1UL << 0) #define RWSEM_NONSPINNABLE (1UL << 1) -#define RWSEM_OWNER_FLAGS_MASK (RWSEM_READER_OWNED | RWSEM_NONSPINNABLE) +#define RWSEM_UPGRADING (1UL << 2) +#define RWSEM_OWNER_FLAGS_MASK (RWSEM_READER_OWNED | RWSEM_NONSPINNABLE | = RWSEM_UPGRADING) =20 #ifdef CONFIG_DEBUG_RWSEMS # define DEBUG_RWSEMS_WARN_ON(c, sem) do { \ @@ -93,7 +95,8 @@ * Bit 0 - writer locked bit * Bit 1 - waiters present bit * Bit 2 - lock handoff bit - * Bits 3-7 - reserved + * Bit 3 - upgrade read bit + * Bits 4-7 - reserved * Bits 8-30 - 23-bit reader count * Bit 31 - read fail bit * @@ -117,6 +120,7 @@ #define RWSEM_WRITER_LOCKED (1UL << 0) #define RWSEM_FLAG_WAITERS (1UL << 1) #define RWSEM_FLAG_HANDOFF (1UL << 2) +#define RWSEM_FLAG_UPGRADE_READ (1UL << 3) #define RWSEM_FLAG_READFAIL (1UL << (BITS_PER_LONG - 1)) =20 #define RWSEM_READER_SHIFT 8 @@ -143,6 +147,13 @@ static inline void rwsem_set_owner(struct rw_semaphore= *sem) atomic_long_set(&sem->owner, (long)current); } =20 +static inline void rwsem_set_owner_upgrade(struct rw_semaphore *sem) +{ + lockdep_assert_preemption_disabled(); + atomic_long_set(&sem->owner, (long)current | RWSEM_UPGRADING | + RWSEM_READER_OWNED | RWSEM_NONSPINNABLE); +} + static inline void rwsem_clear_owner(struct rw_semaphore *sem) { lockdep_assert_preemption_disabled(); @@ -201,7 +212,7 @@ static inline bool is_rwsem_reader_owned(struct rw_sema= phore *sem) */ long count =3D atomic_long_read(&sem->count); =20 - if (count & RWSEM_WRITER_MASK) + if ((count & RWSEM_WRITER_MASK) && !(count & RWSEM_FLAG_UPGRADE_READ)) return false; return rwsem_test_oflags(sem, RWSEM_READER_OWNED); } @@ -1336,6 +1347,8 @@ static inline int __down_write_trylock(struct rw_sema= phore *sem) static inline void __up_read(struct rw_semaphore *sem) { long tmp; + unsigned long flags; + struct task_struct *owner; =20 DEBUG_RWSEMS_WARN_ON(sem->magic !=3D sem, sem); DEBUG_RWSEMS_WARN_ON(!is_rwsem_reader_owned(sem), sem); @@ -1349,6 +1362,9 @@ static inline void __up_read(struct rw_semaphore *sem) clear_nonspinnable(sem); rwsem_wake(sem); } + owner =3D rwsem_owner_flags(sem, &flags); + if (unlikely(!(tmp & RWSEM_READER_MASK) && (flags & RWSEM_UPGRADING))) + wake_up_process(owner); preempt_enable(); } =20 @@ -1641,6 +1657,71 @@ void downgrade_write(struct rw_semaphore *sem) } EXPORT_SYMBOL(downgrade_write); =20 +static inline void rwsem_clear_upgrade_flag(struct rw_semaphore *sem) +{ + atomic_long_andnot(RWSEM_FLAG_UPGRADE_READ, &sem->count); +} + +/* + * upgrade read lock to write lock + */ +static inline int __upgrade_read(struct rw_semaphore *sem) +{ + long tmp; + + preempt_disable(); + + tmp =3D atomic_long_read(&sem->count); + do { + if (tmp & (RWSEM_WRITER_MASK | RWSEM_FLAG_UPGRADE_READ)) { + preempt_enable(); + return -EBUSY; + } + } while (!atomic_long_try_cmpxchg(&sem->count, &tmp, + tmp + RWSEM_FLAG_UPGRADE_READ + RWSEM_WRITER_LOCKED - RWSEM_READER_BIAS)= ); + + if ((tmp & RWSEM_READER_MASK) =3D=3D RWSEM_READER_BIAS) { + /* fast path */ + DEBUG_RWSEMS_WARN_ON(sem->magic !=3D sem, sem); + rwsem_clear_upgrade_flag(sem); + rwsem_set_owner(sem); + preempt_enable(); + return 0; + } + /* slow path */ + raw_spin_lock_irq(&sem->wait_lock); + rwsem_set_owner_upgrade(sem); + + set_current_state(TASK_UNINTERRUPTIBLE); + + for (;;) { + if (!(atomic_long_read(&sem->count) & RWSEM_READER_MASK)) + break; + raw_spin_unlock_irq(&sem->wait_lock); + schedule_preempt_disabled(); + set_current_state(TASK_UNINTERRUPTIBLE); + raw_spin_lock_irq(&sem->wait_lock); + } + + rwsem_clear_upgrade_flag(sem); + rwsem_set_owner(sem); + __set_current_state(TASK_RUNNING); + raw_spin_unlock_irq(&sem->wait_lock); + preempt_enable(); + return 0; +} + +/* + * upgrade read lock to write lock + * + * Return: 0 on success, error code on failure + */ +int upgrade_read(struct rw_semaphore *sem) +{ + return __upgrade_read(sem); +} +EXPORT_SYMBOL(upgrade_read); + #ifdef CONFIG_DEBUG_LOCK_ALLOC =20 void down_read_nested(struct rw_semaphore *sem, int subclass) --=20 2.20.1 From nobody Tue Nov 26 19:41:08 2024 Received: from mail-pf1-f182.google.com (mail-pf1-f182.google.com [209.85.210.182]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id F00B1189B9C for ; Wed, 16 Oct 2024 04:36:27 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.210.182 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1729053390; cv=none; b=HThwHxIg3EMUDT19K29Vp15rwD74k4+1+So7XmK6v0G6NVYihHyCHkljI3b9MKx3h+27oKV0Q9CfEDrc5c+8rzaLPEZWx6e8r3j1zJMJuOLY+UX1w0EcR/khRawIO9zHZ272Z4fHhSHZbPziTnd5VNdLwAFf1B00GBJl32BAa8Q= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1729053390; c=relaxed/simple; bh=SF1cfD5NyIusAX4cJLE8I0Xl9EStwnnCV4GYsjNRDLE=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=JwxnU5l4LACAW1olL5CB6aeDVi2A6onmo1d/xEfAbYNfrYJUcd7orQwgWJQja+up7+H3h52GcIWRncy+DrfyiGPvOgzOEhbSKy1ZzPMKTCETHV95C6FbVHVxjVc8j1YVy7b4VHmeTOzB0B4X+NaqYqQ0qm6ovJBftPLSGcpEODY= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com; spf=pass smtp.mailfrom=bytedance.com; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b=JkIa2I37; arc=none smtp.client-ip=209.85.210.182 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=bytedance.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b="JkIa2I37" Received: by mail-pf1-f182.google.com with SMTP id d2e1a72fcca58-71e585ef0b3so2501775b3a.1 for ; Tue, 15 Oct 2024 21:36:27 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1729053387; x=1729658187; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=Q7wGUcAWkY/xXNhwj5XB7vCbP6fvR8LLicMUY6R9k4E=; b=JkIa2I37b4vvOYkVHed4zKLlOWSIL793PZflr1qe4q5x/BhD4QSOcvNkZkkjQHOBPZ UveidUDGKuU6d0RGvnziBNSrF23Nw2DxFjLMJ1vPaag4Mdkinuz8my2E6u/fhu13A2OB 1VzCL8wXBYZPIlolpx+WLu+HmYD9LVNijVjnd5tAoyf2OHHaWUXTPWEzCMOKeFx9f7Ji MeSGCH2nTM31ITq2YehgkuAuN2SxwwvZM/NI+liR6904h4NgDxcjm3OFZ4sdhxQbfEh9 KLXohb2uCIOOScHY5IZ+RL5/S14+CGlePxUTufOkTkFFwLN4i42i6/sW2Q5GjKSzYjAG Si7g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1729053387; x=1729658187; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=Q7wGUcAWkY/xXNhwj5XB7vCbP6fvR8LLicMUY6R9k4E=; b=k57sSj1FwV81pY3Byrt9yKsH1yGwhNjE/r2quu28QRa69BAWJhHoKr+D/oPB52z63I AIUrgGflBI0F/+GBlVr4MLglgv2VUFYTAoyay2mMN8VE2FfOAkMwNnkUscR2RzECEWCv q4kaW8MXCDI9DKeBU1pBOA2ucrV83swO8q+owHOMqxw2DWbnocIcOwJOd91LSdy2AArA MCWLq7aGR1JjfSMJcSIu8hMC/6Oo2H3l61dq+3Mr8Mp6O0lzdDAOR6ks5c/WB5Ghy3S6 U1aaxrrMYg0KwjODgj2u2DxFhHJusur2M/h6wATEEyRHym0g/F0Yoas4afD6znH5Pufl diPw== X-Gm-Message-State: AOJu0YyAQPkMhGJh/KGCcaoE+Eyau9WFgGkh4vRMIWFyaIouk9yfMMQW 06AX69bWTwYd3auzqgb8tw3tYKLSS6VCS4NOio6m3FIfrqWxku2NpphdCzttF8c= X-Google-Smtp-Source: AGHT+IGsHi96TSHaDa9/5tTO7fMcmHr9rDiWd6O9B7Avgxv2U6sxlJiV4p4N3nYNqbJHGkIB2KwGlA== X-Received: by 2002:a05:6a00:2288:b0:71e:4dc5:259e with SMTP id d2e1a72fcca58-71e4dc5275dmr18417384b3a.17.1729053387124; Tue, 15 Oct 2024 21:36:27 -0700 (PDT) Received: from GQ6QX3JCW2.bytedance.net ([203.208.189.8]) by smtp.gmail.com with ESMTPSA id d2e1a72fcca58-71e77518a76sm2189192b3a.220.2024.10.15.21.36.23 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Tue, 15 Oct 2024 21:36:26 -0700 (PDT) From: lizhe.67@bytedance.com To: peterz@infradead.org, mingo@redhat.com, will@kernel.org, longman@redhat.com, boqun.feng@gmail.com, akpm@linux-foundation.org Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, lizhe.67@bytedance.com Subject: [RFC 2/2] khugepaged: use upgrade_read() to optimize collapse_huge_page Date: Wed, 16 Oct 2024 12:36:00 +0800 Message-ID: <20241016043600.35139-3-lizhe.67@bytedance.com> X-Mailer: git-send-email 2.45.2 In-Reply-To: <20241016043600.35139-1-lizhe.67@bytedance.com> References: <20241016043600.35139-1-lizhe.67@bytedance.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Li Zhe In function collapse_huge_page(), we drop mmap read lock and get mmap write lock to prevent most accesses to pagetables. There is a small time window to allow other tasks to acquire the mmap lock. With the use of upgrade_read(), we don't need to check vma and pmd again in most cases. Signed-off-by: Li Zhe --- mm/khugepaged.c | 36 +++++++++++++++++++----------------- 1 file changed, 19 insertions(+), 17 deletions(-) diff --git a/mm/khugepaged.c b/mm/khugepaged.c index f9c39898eaff..934051274f7a 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -1142,23 +1142,25 @@ static int collapse_huge_page(struct mm_struct *mm,= unsigned long address, goto out_nolock; } =20 - mmap_read_unlock(mm); - /* - * Prevent all access to pagetables with the exception of - * gup_fast later handled by the ptep_clear_flush and the VM - * handled by the anon_vma lock + PG_lock. - * - * UFFDIO_MOVE is prevented to race as well thanks to the - * mmap_lock. - */ - mmap_write_lock(mm); - result =3D hugepage_vma_revalidate(mm, address, true, &vma, cc); - if (result !=3D SCAN_SUCCEED) - goto out_up_write; - /* check if the pmd is still valid */ - result =3D check_pmd_still_valid(mm, address, pmd); - if (result !=3D SCAN_SUCCEED) - goto out_up_write; + if (upgrade_read(&mm->mmap_lock)) { + mmap_read_unlock(mm); + /* + * Prevent all access to pagetables with the exception of + * gup_fast later handled by the ptep_clear_flush and the VM + * handled by the anon_vma lock + PG_lock. + * + * UFFDIO_MOVE is prevented to race as well thanks to the + * mmap_lock. + */ + mmap_write_lock(mm); + result =3D hugepage_vma_revalidate(mm, address, true, &vma, cc); + if (result !=3D SCAN_SUCCEED) + goto out_up_write; + /* check if the pmd is still valid */ + result =3D check_pmd_still_valid(mm, address, pmd); + if (result !=3D SCAN_SUCCEED) + goto out_up_write; + } =20 vma_start_write(vma); anon_vma_lock_write(vma->anon_vma); --=20 2.20.1