From nobody Fri Jun 19 16:16:36 2026 Received: from mail-dl1-f74.google.com (mail-dl1-f74.google.com [74.125.82.74]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 8DD5336403B for ; Fri, 24 Apr 2026 19:16:50 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=74.125.82.74 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777058211; cv=none; b=EkfVUzgpPm/hTRaQhS1UHm5zc8g3872AF+VkREkfp7/y1/Gy81WFJe8BNqalol3hqQbqYGY52BdJ8UTY9xg9M8y7yv8uwcVZrTdy/S1j7fItoNFdhmnU0cyXA3hxl/fdn11PZXDq8sa2Wrr647puA1ZbIeXT3fDZOJpj8xpaSuQ= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777058211; c=relaxed/simple; bh=4lHh4KX8MVjhhmFqL81gYEMXbFoUy/2yrwleQDVO/a0=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=nqdQCnxB6FGsVuCj4hKHho2JiuofQW7gr42cl+5h+pQtnKYWvuM+gCE2sSZri7E33BLWpnUXAhdhSJ/KAjt9+WhLXbMAuWFF7fI/rvnzWlvIr/ets0DGHHoK98TzY+QnS1m19Z2HJYObHP1VBKKv5YB2kNPt10TA+ZGB0uUiQUc= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--stevensd.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=f2t9MeHS; arc=none smtp.client-ip=74.125.82.74 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--stevensd.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="f2t9MeHS" Received: by mail-dl1-f74.google.com with SMTP id a92af1059eb24-12db37213daso12312757c88.1 for ; Fri, 24 Apr 2026 12:16:50 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20251104; t=1777058210; x=1777663010; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=uhV9bgboNhnyawRNXgnugsEye6zSCci68cPGbUTJPQs=; b=f2t9MeHSo7xmKroe6wdsX6S9UzLhVMluvsOQ1aIMvgFTIEKNinWfg3y783zaw5wHn3 EiDOf6QtcnPf3HIw4t/W5y6+C/5tU+yVfEbhpnk3+gA5gqJ1cGISyfGMSwQIOeIKVGY5 Xly1xYrGp3B7kQA2TBXz3LbMWN7/ZytmyGBDlwezAUOR8OYzFXQnmQ8QpIUtiHmyABEW zeehGkbCzqArlOpuJI1LFMG/uuX9TwCIMRuszkSgxh7ocQ0VoxkN5pkLDWooVHXnwlQZ sq0nNnzw+gAgWWXmnD6j8g4X6N3CshEznoBd8kVczBTiHBKQ8qe/jjuJNR3EG9gj6Kzc F4+A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1777058210; x=1777663010; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=uhV9bgboNhnyawRNXgnugsEye6zSCci68cPGbUTJPQs=; b=LbFcZ34KJb0Z4lgCP0wGUwTjFxUpmFHLYWtxn6Iua70MO8F8dTZngJm55dR0pxVacC WRNI04X/WI3R027iR9eeknf9CaTd5jyaHQf4IwdeMrXSJdVCnsA3qX/wkfV0tMYYa+OO WysHzPeOcjbAgj6wNuaZ2du/CHerM3r21DaiZuVCOyQAe7o+T3yQKH7mrG5sNXSv3w6x Ymz192EA+OIgZdBpIL9vUh9hSZgiNjrIwTJwIkpeax4GKtnPWlxCWtjyezYgVGxyZP00 WBAT/YjtqJQdRZA1RmmD889aQU1st/KuzLT6+Y8jQ50Ps81ebFMdlyFcq9E9DT3ESAuD AV9w== X-Forwarded-Encrypted: i=1; AFNElJ/71kooqe3pE2rYwnzsYsw+oiGm3lo4MH2srFM7YHE3aK/uLAP+FA6U6Ju3dYDTqV/T7bSOuaADgm0dkh8=@vger.kernel.org X-Gm-Message-State: AOJu0Yw1TZQ79MFFjTAQcy/rcLK6XpE92GtSCRspe22jyHepJdAtM2Sp SEGcOpUYhdqfovWX+6u+54EyLs3gaU9i121VLs8KI4nHQ9ybhCfv7AFUiUaSdk+ioYmlY4pkjxY MPlp50whkuBiN0g== X-Received: from dldz11-n2.prod.google.com ([2002:a05:701b:418b:20b0:12d:c368:c23a]) (user=stevensd job=prod-delivery.src-stubby-dispatcher) by 2002:a05:7023:b0a:b0:12d:b7e5:a67b with SMTP id a92af1059eb24-12db7e5aaafmr8877334c88.14.1777058209503; Fri, 24 Apr 2026 12:16:49 -0700 (PDT) Date: Fri, 24 Apr 2026 12:14:44 -0700 In-Reply-To: <20260424191456.2679717-1-stevensd@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20260424191456.2679717-1-stevensd@google.com> X-Mailer: git-send-email 2.54.0.rc2.544.gc7ae2d5bb8-goog Message-ID: <20260424191456.2679717-2-stevensd@google.com> Subject: [PATCH v2 01/13] fork: Remove assumption that vm_area->nr_pages equals to THREAD_SIZE From: David Stevens To: Pasha Tatashin , Linus Walleij , Will Deacon , Quentin Perret , Thomas Gleixner , Ingo Molnar , Borislav Petkov , Dave Hansen , x86@kernel.org, "H. Peter Anvin" , Andy Lutomirski , Xin Li , Peter Zijlstra , Andrew Morton , David Hildenbrand , Lorenzo Stoakes , "Liam R. Howlett" , Vlastimil Babka , Mike Rapoport , Suren Baghdasaryan , Michal Hocko , Uladzislau Rezki , Kees Cook Cc: David Stevens , linux-kernel@vger.kernel.org, linux-mm@kvack.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Pasha Tatashin In many places number of pages in the stack is detremined via (THREAD_SIZE / PAGE_SIZE). There is also a BUG_ON() that ensures that (THREAD_SIZE / PAGE_SIZE) is indeed equals to vm_area->nr_pages. However, with dynamic stacks, the number of pages in vm_area will grow with stack, therefore, use vm_area->nr_pages to determine the actual number of pages allocated in stack. Signed-off-by: Pasha Tatashin [Rebased, also skipped intermediary helper variable nr_pages] Signed-off-by: Linus Walleij Signed-off-by: David Stevens --- kernel/fork.c | 8 +++----- 1 file changed, 3 insertions(+), 5 deletions(-) diff --git a/kernel/fork.c b/kernel/fork.c index bc2bf58b93b6..8961b895bf05 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -312,9 +312,7 @@ static int memcg_charge_kernel_stack(struct vm_struct *= vm_area) int ret; int nr_charged =3D 0; =20 - BUG_ON(vm_area->nr_pages !=3D THREAD_SIZE / PAGE_SIZE); - - for (i =3D 0; i < THREAD_SIZE / PAGE_SIZE; i++) { + for (i =3D 0; i < vm_area->nr_pages; i++) { ret =3D memcg_kmem_charge_page(vm_area->pages[i], GFP_KERNEL, 0); if (ret) goto err; @@ -484,7 +482,7 @@ static void account_kernel_stack(struct task_struct *ts= k, int account) struct vm_struct *vm_area =3D task_stack_vm_area(tsk); int i; =20 - for (i =3D 0; i < THREAD_SIZE / PAGE_SIZE; i++) + for (i =3D 0; i < vm_area->nr_pages; i++) mod_lruvec_page_state(vm_area->pages[i], NR_KERNEL_STACK_KB, account * (PAGE_SIZE / 1024)); } else { @@ -505,7 +503,7 @@ void exit_task_stack_account(struct task_struct *tsk) int i; =20 vm_area =3D task_stack_vm_area(tsk); - for (i =3D 0; i < THREAD_SIZE / PAGE_SIZE; i++) + for (i =3D 0; i < vm_area->nr_pages; i++) memcg_kmem_uncharge_page(vm_area->pages[i], 0); } } --=20 2.54.0.rc2.544.gc7ae2d5bb8-goog From nobody Fri Jun 19 16:16:36 2026 Received: from mail-dy1-f201.google.com (mail-dy1-f201.google.com [74.125.82.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E603436D4F3 for ; Fri, 24 Apr 2026 19:16:52 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=74.125.82.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777058214; cv=none; b=Jri+zJMKPujSyKjcNDS+xWef2A79NNku/dCdgk5JAam1NZiy/KjmeuUIPVpOWAToQ2jpweH8oBANCvS8oW4csOKgX5YD3FNAx9aTc5npUOO9DEy4Y32J6jQxFezF8nOzSTK0uppN16GjsZONhmZgoWjZvWB+iuvRclujrojc1Ws= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777058214; c=relaxed/simple; bh=IfRal2g4eax2qipQgHftBuobG7FTGdvuNT9mvVr/54c=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=Y7H7MxgpWYui0ul4LSKvy4rUfeH3za82HdFzplQ9xCXI5I7oDG+gfXJKQjYuI0oHWDbEyDo3FBoxOKo7zW3Bek2xXtyE7DnNrmtE79lSAbd/elBxPTxdRDSC9sBFPaLBehXWZc0FHp1CN8dMSwOH95r4B4QFsz//3tpHRJgMG1Y= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--stevensd.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=eHoa3mTQ; arc=none smtp.client-ip=74.125.82.201 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--stevensd.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="eHoa3mTQ" Received: by mail-dy1-f201.google.com with SMTP id 5a478bee46e88-2da19227bc1so17661135eec.1 for ; Fri, 24 Apr 2026 12:16:52 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20251104; t=1777058212; x=1777663012; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=1KThmIRA5a/VHEqSyhfTRNnbsq/aoBAcBxWfiBbrQ6E=; b=eHoa3mTQXWlmhKG+5czoZE+DxotQANMmkbmngVUPsSy+uhfjWBoiwt0J42z6A30kvY oXT1Q8/ksqwEpxxT7jssLaBGrpU4zWmUZqvNKxQvhOGSBJtZSazVsuqzrtuZnQhB7cV6 kZMk0u4ADOZ4YunR9fcBN8rruLFOjBocokDedfFrpEGO98aTBSmF+DipX8mlOlyXZ2y2 WBQqTOfw42ZT56W3NSB4Wq/n8Agvu8nZ3CzERPwOBnx/tojyzQlewedZxBVdszkv90X8 3gCrHjXyppDLe4meAqciagG2+or+tfC+HFxNLgbut505fBxyAQd+lLNlTlZDHtw/6OxJ swgw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1777058212; x=1777663012; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=1KThmIRA5a/VHEqSyhfTRNnbsq/aoBAcBxWfiBbrQ6E=; b=XSqRwBFbxcHMo4hpPdGhUbJf3SIUGJBA3iPAQNjhARFqN3iUlqnamQyZknH5wBqvKo IEz5QFjVK9vfGfADnTLv54wtVX6ewVq2A+0/5/3PO6zkj9SODMZbigEYLLEmsd6WveVS RwDveXGbqWmqxaSPul7PvVJ/G1VTU2D3bV3U4HzEjBlOzI/q8vouZVOgqAuBqwp6eZbB sbXuR/JrKBiuecPJP91vU7RW9syWJv3WOR6VGwyPqRqVEMp6FpgehAyx4YKxHUCDuQWV 20UZjPKOIDuVddlbJ1mKG9UnwdQAE1AcgAhu1cmakOgtSlDshbkvrwpvG12mqGLLB52d Xzew== X-Forwarded-Encrypted: i=1; AFNElJ8OaI1jATw/YmEHqdaaHoDpVV2s4m9xaBoT9c2+MPj64SgNnL6J0+94kfQ/Y3dMfYjKGCo3dun244J6ruk=@vger.kernel.org X-Gm-Message-State: AOJu0YzCTO5ud2i2MNYjLJJGCJzGjDPp+IxEflRX/Ks0cfAvqDDBNJIH Hj3PnchIXavDMF3nHkgIGVcZGQxaLNPhR3kGdo7CTzOChBYlf7PDdtrJbcGb7TnzYhR2Fh1DaFt 2O/1DqGhwPCcA5A== X-Received: from dlae7.prod.google.com ([2002:a05:701b:2307:b0:12c:211d:3e86]) (user=stevensd job=prod-delivery.src-stubby-dispatcher) by 2002:a05:7022:eac8:b0:12b:ec96:c936 with SMTP id a92af1059eb24-12c73f70bc8mr18556370c88.14.1777058211531; Fri, 24 Apr 2026 12:16:51 -0700 (PDT) Date: Fri, 24 Apr 2026 12:14:45 -0700 In-Reply-To: <20260424191456.2679717-1-stevensd@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20260424191456.2679717-1-stevensd@google.com> X-Mailer: git-send-email 2.54.0.rc2.544.gc7ae2d5bb8-goog Message-ID: <20260424191456.2679717-3-stevensd@google.com> Subject: [PATCH v2 02/13] fork: Don't assume fully populated stack during reuse From: David Stevens To: Pasha Tatashin , Linus Walleij , Will Deacon , Quentin Perret , Thomas Gleixner , Ingo Molnar , Borislav Petkov , Dave Hansen , x86@kernel.org, "H. Peter Anvin" , Andy Lutomirski , Xin Li , Peter Zijlstra , Andrew Morton , David Hildenbrand , Lorenzo Stoakes , "Liam R. Howlett" , Vlastimil Babka , Mike Rapoport , Suren Baghdasaryan , Michal Hocko , Uladzislau Rezki , Kees Cook Cc: David Stevens , linux-kernel@vger.kernel.org, linux-mm@kvack.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" In preparation for dynamic kernel stacks, don't assume that vm_area->nr_pages matches THREAD_SIZE when clearing a stack for reuse. Signed-off-by: David Stevens --- kernel/fork.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/kernel/fork.c b/kernel/fork.c index 8961b895bf05..50772c0cc5da 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -332,6 +332,8 @@ static int alloc_thread_stack_node(struct task_struct *= tsk, int node) =20 vm_area =3D alloc_thread_stack_node_from_cache(tsk, node); if (vm_area) { + unsigned long memset_offset =3D 0; + if (memcg_charge_kernel_stack(vm_area)) { vfree(vm_area->addr); return -ENOMEM; @@ -343,7 +345,9 @@ static int alloc_thread_stack_node(struct task_struct *= tsk, int node) stack =3D kasan_reset_tag(vm_area->addr); =20 /* Clear stale pointers from reused stack. */ - memset(stack, 0, THREAD_SIZE); + if (!IS_ENABLED(CONFIG_STACK_GROWSUP)) + memset_offset =3D THREAD_SIZE - vm_area->nr_pages * PAGE_SIZE; + memset(stack + memset_offset, 0, vm_area->nr_pages * PAGE_SIZE); =20 tsk->stack_vm_area =3D vm_area; tsk->stack =3D stack; --=20 2.54.0.rc2.544.gc7ae2d5bb8-goog From nobody Fri Jun 19 16:16:36 2026 Received: from mail-dy1-f201.google.com (mail-dy1-f201.google.com [74.125.82.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 9586236A03B for ; Fri, 24 Apr 2026 19:16:54 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=74.125.82.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777058215; cv=none; b=ZHDQeleQ0p4dYkEJ+sgZ0oPSeZrhYvlML1nfsuozQWlMyp3PZPb9EJcK1Rkfj+rPlY+nq3aIudrxx1JDz+QbPH0EnyASNVmH5AN7W2eGOuhdj0EzLNBOIh283XN1zR9Yfd23lFLT3zuyNJRXRFhiVwU3CBYt8sAKBy98bkuSnpU= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777058215; c=relaxed/simple; bh=nQS6eXBrY3KPwkyShBSngvf9zYZxFsF3HnayMzV/8RU=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=ddAFI6yC3iwKXApWT+56+TYO12UKsfjvWnx3tXcJjgLsoYkkidDwsFaVBIQbx2XTu7SAY2JNa0tFHSUHraI8hQHokdhqPnGRiez3876CB4OP4dF86SSS5Hahy1bJmrkKAp55j/RCNORqkSMAE4kWmUEfACd4PFx6eoh9aylKUPU= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--stevensd.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=cpEuia10; arc=none smtp.client-ip=74.125.82.201 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--stevensd.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="cpEuia10" Received: by mail-dy1-f201.google.com with SMTP id 5a478bee46e88-2ba9a744f7dso10721360eec.0 for ; Fri, 24 Apr 2026 12:16:54 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20251104; t=1777058214; x=1777663014; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=9JB7NKoMb7Kcix3lOCCD+xHQKh1dK9y1oER4KZfFKa0=; b=cpEuia10ThlYXXyhnavjYiCqDCVGYHkK1Sr7rKLQddh+7Rs5lVSzdt3d4uc4vWqXya /uaYblfVzL1lujAQgt90RKTStTw5npglUaWYGwkcmOVXdYuAHPWfqzW+OFkUZawp1R5w aWKg9oYIgYZVabaXLtYhrf0wZWyA+Z++f2G4AYWughDHbqzVV2aAEu6Co+P4w+H4XqVy 1MTz1n4mUZeGvlmkXGoGWAsPVZm3Bd1VjQdqzwzdnXpn0KdeYecNR+TzTg7TfXV8Jtv+ NZ0Jtv0h0Niyx88UOxIjTOYIWS3sbsXQG4Iy6oWwlDzUH4hW/c2g+dr9vduJGN+ClOJd Es4Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1777058214; x=1777663014; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=9JB7NKoMb7Kcix3lOCCD+xHQKh1dK9y1oER4KZfFKa0=; b=nXMXuk3QY45L5geKWNOwFmeNEb/CFzBT1IYafAPOwp2Q2rCRrf2qKkIOcnsWVxmqR1 FVD6A119Pnpjc8UkiRlN3zQWACPnxxd7knoF9o9A2R2qIWluMNBvl0VZIUd7U8iRt0Xv SZPnNulUt9dakYqINxj8D7yHfCc5k3Mmf8E2Ou4a+jNP6t10kkitIlU8N3YeDwJyCTmO 1buqB0KeNoBZWqGSE/RIc5uK8hpwClTNpRcfCiNbiBr/VCVSM30RHWusmNZZflt1xLR4 +DvmLK4NyWkU4OONceRMpBNbgbGOJxgGhgFsmr1LoGn3mh1GEYbP11jQ7qsKPShjLYSf cveQ== X-Forwarded-Encrypted: i=1; AFNElJ+iD+Ydh6Tdiw5pqtvw/MLlV5mlznnVivNIVM/nJsEaaA6QHAuEDQ80aRP5/KZU1PK8f2oOHPq4rJ2YY90=@vger.kernel.org X-Gm-Message-State: AOJu0YzaCNyEFoE/RErIQS4AYYWsId9uAorFUr35huXpnvy5rwMJTSWt RwoEI/EiGBiXtphYg+5nCFOWu0UzVAVE4t8g8Bqxl4sF+G+l/dH894oZ5/2tFQYUdw4JwWJONv0 SHKrjugj0KofqKw== X-Received: from dlam10.prod.google.com ([2002:a05:701b:208a:b0:12c:912f:7d3f]) (user=stevensd job=prod-delivery.src-stubby-dispatcher) by 2002:a05:7022:2602:b0:128:d23d:81a2 with SMTP id a92af1059eb24-12c73f9ae5cmr16959290c88.29.1777058213588; Fri, 24 Apr 2026 12:16:53 -0700 (PDT) Date: Fri, 24 Apr 2026 12:14:46 -0700 In-Reply-To: <20260424191456.2679717-1-stevensd@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20260424191456.2679717-1-stevensd@google.com> X-Mailer: git-send-email 2.54.0.rc2.544.gc7ae2d5bb8-goog Message-ID: <20260424191456.2679717-4-stevensd@google.com> Subject: [PATCH v2 03/13] fork: Move vm_stack to the beginning of the stack From: David Stevens To: Pasha Tatashin , Linus Walleij , Will Deacon , Quentin Perret , Thomas Gleixner , Ingo Molnar , Borislav Petkov , Dave Hansen , x86@kernel.org, "H. Peter Anvin" , Andy Lutomirski , Xin Li , Peter Zijlstra , Andrew Morton , David Hildenbrand , Lorenzo Stoakes , "Liam R. Howlett" , Vlastimil Babka , Mike Rapoport , Suren Baghdasaryan , Michal Hocko , Uladzislau Rezki , Kees Cook Cc: David Stevens , linux-kernel@vger.kernel.org, linux-mm@kvack.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" The vm_stack struct used to free stacks via an RCU callback is stored directly in the stack being freed. Make sure it's stored at the beginning of the stack regardless of stack growth direction, to avoid faults on partially allocated dynamic stacks. Signed-off-by: David Stevens --- kernel/fork.c | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/kernel/fork.c b/kernel/fork.c index 50772c0cc5da..72c081db492c 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -282,7 +282,12 @@ static void thread_stack_free_rcu(struct rcu_head *rh) =20 static void thread_stack_delayed_free(struct task_struct *tsk) { - struct vm_stack *vm_stack =3D tsk->stack; + struct vm_stack *vm_stack; + + if (IS_ENABLED(CONFIG_STACK_GROWSUP)) + vm_stack =3D tsk->stack; + else + vm_stack =3D tsk->stack + THREAD_SIZE - sizeof(*vm_stack); =20 vm_stack->stack_vm_area =3D tsk->stack_vm_area; call_rcu(&vm_stack->rcu, thread_stack_free_rcu); --=20 2.54.0.rc2.544.gc7ae2d5bb8-goog From nobody Fri Jun 19 16:16:36 2026 Received: from mail-dy1-f201.google.com (mail-dy1-f201.google.com [74.125.82.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 868AC39478D for ; Fri, 24 Apr 2026 19:16:56 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=74.125.82.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777058217; cv=none; b=Naqj9CAroV12DgONuKM0o6p23oUJsc0TTatNHvBOWiUNMAraT6yreo//oaVcJXic70KjNzAfld2mlW1NmDyUJGD2M9z3A0DTizl/5pVhNj8n/Y3ARTCg52jPNfhWHzu20hARb6n2kolQiZUWDQSHo6rtjBsAgM8nRKor1Qj73WA= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777058217; c=relaxed/simple; bh=G8gbRKcKZA1SdzIsqc/3W0/7Pcug5kFAxa35ZzPmKy0=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=WBVij89FEj2Ard5jiH4/j3sj8Y9GxxffWieiPMHfWLiaoxvc2YMZsPgLpDKc1V2xIM3wC7UGKRqftbueZVPNfccZHOcMBKPjR+iiFRoBh5JnaH3z6CyJb4dXnVhGwfJm1c4H5fcutg4oqNiHn1i0c4s86DX8I1YseqsN2f2dMqo= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--stevensd.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=akHopOdb; arc=none smtp.client-ip=74.125.82.201 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--stevensd.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="akHopOdb" Received: by mail-dy1-f201.google.com with SMTP id 5a478bee46e88-2d8a677cdfaso8717421eec.1 for ; Fri, 24 Apr 2026 12:16:56 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20251104; t=1777058216; x=1777663016; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=MLdDN0AHHw4MrI7CUBGciac9s6fN21aSVhtjr+LOvQ0=; b=akHopOdbeARJ+mC76cv+kWGZAT0OpslrGnJvYHWuoDrzOMM6LpeL+/sTSV+YrjhYCj JkgwsXcs6fCVQwmi4q3nzB70Gjdn6H65KIMPHJXradbKEeaB6NLoobB/K4xfDyHxqJLl iwiNqcVhyG36Tzq243byhX7mL3mvqpMxf4eV3JfalNn6/CnzEglsltWrXppqk9DNHIeh cPa1+FtSOxGQJpG8ngPtdABY3oR7sJsPC9alD2sQiSg/kr3BL8a3kk+GdsEl/jKYijrH KcDzX4cqudeViXxTqlzB3pUGqGfIh6BF4UCwsC/GFg/iQ1VzwFGwMskJ81GRnzT7Q3sq TYHA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1777058216; x=1777663016; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=MLdDN0AHHw4MrI7CUBGciac9s6fN21aSVhtjr+LOvQ0=; b=hFeCFGhiWQsCEO6RF/xQKexzl+FLo+jIa6opQy/uSTNyevZEB2W/uxVFxViSS4bUTB I4O2QTDb/YKsxA7EIL9xRyFbwZjbi0/RLfp97ePi7vJGVyPs3SHUZMxF9SXfmP+CiZtg dtyF/e/LkAfgyOTTXnXcx7ypxaMpkGfVYdpyDY1ayKayVCA7mOigN091I7OgFY02OGor LLmp6o3t0YGhis5ahvRNa1KXtQmOoqYJ2tWtSiCbS7crOiVAqBc0Mk80XJeJ+hLF5NI1 N1coi4lFKiAjIfUOrgW8ezZd+dX0w7kKFsperYCDpa73xSvg8h6sR6bzlIwChz3VuCPD fN6g== X-Forwarded-Encrypted: i=1; AFNElJ8r0fawZnrd8GLKanq2wBkoX13o0eN0ZD0EamqesDcAHbldXZquryQegD1BY26JwKe/I4ID8zn9175lzQ4=@vger.kernel.org X-Gm-Message-State: AOJu0Yyi5RpGL6cmJisQaCtNsFYkcTyP0wFXYJEuuZ26sgt69RfXhRSy h6tyT9bWdgiw/Vz4P5nWj847o2VycTdWknvtx87WAHXptrjSLynYQfv2YQFSo3Lrv07eIkdrBaG VPLxp+nz4Q9Qdfw== X-Received: from dycoz1.prod.google.com ([2002:a05:7301:fc81:b0:2d9:1564:c80c]) (user=stevensd job=prod-delivery.src-stubby-dispatcher) by 2002:a05:7300:7495:b0:2df:71f0:e5b3 with SMTP id 5a478bee46e88-2e478c1ef91mr17887668eec.20.1777058215502; Fri, 24 Apr 2026 12:16:55 -0700 (PDT) Date: Fri, 24 Apr 2026 12:14:47 -0700 In-Reply-To: <20260424191456.2679717-1-stevensd@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20260424191456.2679717-1-stevensd@google.com> X-Mailer: git-send-email 2.54.0.rc2.544.gc7ae2d5bb8-goog Message-ID: <20260424191456.2679717-5-stevensd@google.com> Subject: [PATCH v2 04/13] fork: separate vmap stack allocation and free calls From: David Stevens To: Pasha Tatashin , Linus Walleij , Will Deacon , Quentin Perret , Thomas Gleixner , Ingo Molnar , Borislav Petkov , Dave Hansen , x86@kernel.org, "H. Peter Anvin" , Andy Lutomirski , Xin Li , Peter Zijlstra , Andrew Morton , David Hildenbrand , Lorenzo Stoakes , "Liam R. Howlett" , Vlastimil Babka , Mike Rapoport , Suren Baghdasaryan , Michal Hocko , Uladzislau Rezki , Kees Cook Cc: David Stevens , linux-kernel@vger.kernel.org, linux-mm@kvack.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Pasha Tatashin In preparation for the dynamic stacks, separate out the __vmalloc_node_range and vfree calls from the vmap based stack allocations. The dynamic stacks will use their own variants of these functions. Signed-off-by: Pasha Tatashin [Fix a bug in original patch: free_vmap_stack(vm_area->addr)] Signed-off-by: Linus Walleij [Add missing free_vmap_stack conversion, fix typos, rebase] Signed-off-by: David Stevens --- kernel/fork.c | 40 ++++++++++++++++++++++++---------------- 1 file changed, 24 insertions(+), 16 deletions(-) diff --git a/kernel/fork.c b/kernel/fork.c index 72c081db492c..8bf32815f422 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -269,6 +269,21 @@ static bool try_release_thread_stack_to_cache(struct v= m_struct *vm_area) return false; } =20 +static inline struct vm_struct *alloc_vmap_stack(int node) +{ + void *stack; + + stack =3D __vmalloc_node(THREAD_SIZE, THREAD_ALIGN, GFP_VMAP_STACK, + node, __builtin_return_address(0)); + + return stack ? find_vm_area(stack) : NULL; +} + +static inline void free_vmap_stack(struct vm_struct *vm_area) +{ + vfree(vm_area->addr); +} + static void thread_stack_free_rcu(struct rcu_head *rh) { struct vm_stack *vm_stack =3D container_of(rh, struct vm_stack, rcu); @@ -277,7 +292,7 @@ static void thread_stack_free_rcu(struct rcu_head *rh) if (try_release_thread_stack_to_cache(vm_stack->stack_vm_area)) return; =20 - vfree(vm_area->addr); + free_vmap_stack(vm_area); } =20 static void thread_stack_delayed_free(struct task_struct *tsk) @@ -304,7 +319,7 @@ static int free_vm_stack_cache(unsigned int cpu) if (!vm_area) continue; =20 - vfree(vm_area->addr); + free_vmap_stack(vm_area); cached_vm_stack_areas[i] =3D NULL; } =20 @@ -333,41 +348,35 @@ static int memcg_charge_kernel_stack(struct vm_struct= *vm_area) static int alloc_thread_stack_node(struct task_struct *tsk, int node) { struct vm_struct *vm_area; - void *stack; =20 vm_area =3D alloc_thread_stack_node_from_cache(tsk, node); if (vm_area) { unsigned long memset_offset =3D 0; =20 if (memcg_charge_kernel_stack(vm_area)) { - vfree(vm_area->addr); + free_vmap_stack(vm_area); return -ENOMEM; } =20 /* Reset stack metadata. */ kasan_unpoison_range(vm_area->addr, THREAD_SIZE); - - stack =3D kasan_reset_tag(vm_area->addr); + tsk->stack =3D kasan_reset_tag(vm_area->addr); =20 /* Clear stale pointers from reused stack. */ if (!IS_ENABLED(CONFIG_STACK_GROWSUP)) memset_offset =3D THREAD_SIZE - vm_area->nr_pages * PAGE_SIZE; - memset(stack + memset_offset, 0, vm_area->nr_pages * PAGE_SIZE); + memset(tsk->stack + memset_offset, 0, vm_area->nr_pages * PAGE_SIZE); =20 tsk->stack_vm_area =3D vm_area; - tsk->stack =3D stack; return 0; } =20 - stack =3D __vmalloc_node(THREAD_SIZE, THREAD_ALIGN, - GFP_VMAP_STACK, - node, __builtin_return_address(0)); - if (!stack) + vm_area =3D alloc_vmap_stack(node); + if (!vm_area) return -ENOMEM; =20 - vm_area =3D find_vm_area(stack); if (memcg_charge_kernel_stack(vm_area)) { - vfree(stack); + free_vmap_stack(vm_area); return -ENOMEM; } /* @@ -376,8 +385,7 @@ static int alloc_thread_stack_node(struct task_struct *= tsk, int node) * so cache the vm_struct. */ tsk->stack_vm_area =3D vm_area; - stack =3D kasan_reset_tag(stack); - tsk->stack =3D stack; + tsk->stack =3D kasan_reset_tag(vm_area->addr); return 0; } =20 --=20 2.54.0.rc2.544.gc7ae2d5bb8-goog From nobody Fri Jun 19 16:16:36 2026 Received: from mail-dy1-f201.google.com (mail-dy1-f201.google.com [74.125.82.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 7E92E36BCDA for ; Fri, 24 Apr 2026 19:16:58 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=74.125.82.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777058223; cv=none; b=HfxVfTih+/SDVUFfZq7PN2Z/utr7q+03PQE2jbOudSGB3I0RF2BshaVVvKqhB4zYISRhD2KbG7b3dpDeU6i2jULipZ8/sE5WCxSIO+nKplLwFW3PRyCcPkA3CriyWvrPfZjX/rCJwbfdP90u7T9unsIXXOe87YwpMpLiSj1JiVs= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777058223; c=relaxed/simple; bh=i/9oPU2HocRmndLUSICVriaABeQQaYNmpuirr7o4KnQ=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=TYazLRCa9eLQKAdqIZ88txpA+HWogtVQjwWjborPPky7m+6csB3o1EIJDsBU3PSB6y/28wrucwt0EOFL0yF81DzPpP077cQg7XZsGRB6Ob3TUTrDN5mMG4+BsM5KK7DTfvCR8WFt/GmnoTUX3OQM+Tfa2+pWxO1LcxTlPfzWGpU= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--stevensd.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=gHktjsQC; arc=none smtp.client-ip=74.125.82.201 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--stevensd.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="gHktjsQC" Received: by mail-dy1-f201.google.com with SMTP id 5a478bee46e88-2dd6fb4c867so12584010eec.0 for ; Fri, 24 Apr 2026 12:16:58 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20251104; t=1777058218; x=1777663018; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=oo/pdV2AAzpZoUJTB8KwJow4oYxSpXiK/Ch5kfUiDOU=; b=gHktjsQCdSubH4o6ezURtPzrdzAucuGQcKDfXbnQPpIKpTUZ/5JqiIn9CRlXVFCVrL hqmQqrxnWaITJdBxh1iON/QCUTFh5dsHag/oPG7XZGOYsi41T7M3WG6Q2B/RF/Ma8pTP bVPIGTQqsrGtBvC7J7xNnIlTBa/Rn/HiD6m0VPHh3ziuNGH9OSVtydqnPjCFf4joo0mc NJzWDkEznWtKTFhvV/9nMbEhuXFwqQs/Wt3nUjZ+xJuWODPodI4zbHWwYgIIkG4Yqw0t S8a2x7c0Hi+duEhzmhpvR+i9fvuy7duBV2K4seQuG6OzFS9C6nRQoSgpJEJOELtlrLYx EYQA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1777058218; x=1777663018; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=oo/pdV2AAzpZoUJTB8KwJow4oYxSpXiK/Ch5kfUiDOU=; b=sAC1jM4BQG++69vWLIChdhybmxUzMi3OxQR9ehXSrrcaUfWASTVR6xn6+rt4UPrbn7 gaVy+4l8fwcx9EgntZ5goUBtSwvLyeeOhvCGUflEZfTqyxEWQYhMOUIMPWzXBDUA7LZ6 HSCOwkypzxpHIShiQSYsLhznnZ7ZyfX5Ym0HvJaBZOPMAsyNgWbLvMqHwrJZAY1liUP4 bovE1K/5XXa4wNpMJfJCC2hcE8b4Y8607qn+x4AAWjVWvMeD8+Qx7aZvbq9SC4InQEAw AI2jrhbfpdLC7mWkQeS7PMGZPDeNSjUH/r18ENuVqAmUfcw7odmzpkopZVKm5K7z/2Fr p6Dg== X-Forwarded-Encrypted: i=1; AFNElJ97K2E+gXpgsC57i2lytpQrkMLxEwa0mqQ0W7Ny7TmHyF9uEv4Bc5bDqfd3g0PebV8WvjugKSFpjbMIv1M=@vger.kernel.org X-Gm-Message-State: AOJu0YyNm5pPrI6eMW1GbCkGr0TUaE4BZ8yya0MFBt0HGxn0OyJHlF33 nvMj5boa2mGO3/oiVNV/l9DuUU9eHgUcdwrTP0qX09nv7MKqKKQNUoQ1YkTTkmebxjK+to1wmT2 Go8J4rWmAr1o86Q== X-Received: from dlbcf24.prod.google.com ([2002:a05:7022:4598:b0:12d:b2ba:b551]) (user=stevensd job=prod-delivery.src-stubby-dispatcher) by 2002:a05:701b:270f:b0:12d:b993:c68f with SMTP id a92af1059eb24-12db993c9b2mr4770751c88.4.1777058217479; Fri, 24 Apr 2026 12:16:57 -0700 (PDT) Date: Fri, 24 Apr 2026 12:14:48 -0700 In-Reply-To: <20260424191456.2679717-1-stevensd@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20260424191456.2679717-1-stevensd@google.com> X-Mailer: git-send-email 2.54.0.rc2.544.gc7ae2d5bb8-goog Message-ID: <20260424191456.2679717-6-stevensd@google.com> Subject: [PATCH v2 05/13] mm/vmalloc: Add a get_vm_area_node() and vmap_pages_range() public functions From: David Stevens To: Pasha Tatashin , Linus Walleij , Will Deacon , Quentin Perret , Thomas Gleixner , Ingo Molnar , Borislav Petkov , Dave Hansen , x86@kernel.org, "H. Peter Anvin" , Andy Lutomirski , Xin Li , Peter Zijlstra , Andrew Morton , David Hildenbrand , Lorenzo Stoakes , "Liam R. Howlett" , Vlastimil Babka , Mike Rapoport , Suren Baghdasaryan , Michal Hocko , Uladzislau Rezki , Kees Cook Cc: David Stevens , linux-kernel@vger.kernel.org, linux-mm@kvack.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Pasha Tatashin get_vm_area_node() Unlike the other public get_vm_area_* variants, this one accepts node from which to allocate data structure, and also the align, which allows to create vm area with a specific alignment. This call is going to be used by dynamic stacks in order to ensure that the stack VM area of a specific alignment, and that even if there is only one page mapped, no page table allocations are going to be needed to map the other stack pages. vmap_pages_range() We will need it from kernel/fork.c in order to map the initial stack pages, so export the function and add a forward declaration of this function to the linux/vmalloc.h header. Signed-off-by: Pasha Tatashin Signed-off-by: Linus Walleij [Switched to vmap_pages_range instead of noflush variant, fix typos] Signed-off-by: David Stevens --- include/linux/vmalloc.h | 14 ++++++++++++++ mm/vmalloc.c | 25 +++++++++++++++++++++++++ 2 files changed, 39 insertions(+) diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h index e8e94f90d686..7b56a0b998ab 100644 --- a/include/linux/vmalloc.h +++ b/include/linux/vmalloc.h @@ -250,6 +250,9 @@ extern struct vm_struct *__get_vm_area_caller(unsigned = long size, unsigned long flags, unsigned long start, unsigned long end, const void *caller); +struct vm_struct *get_vm_area_node(unsigned long size, unsigned long align, + unsigned long flags, int node, gfp_t gfp, + const void *caller); void free_vm_area(struct vm_struct *area); extern struct vm_struct *remove_vm_area(const void *addr); extern struct vm_struct *find_vm_area(const void *addr); @@ -301,11 +304,22 @@ static inline void set_vm_flush_reset_perms(void *add= r) if (vm) vm->flags |=3D VM_FLUSH_RESET_PERMS; } + +int __must_check vmap_pages_range(unsigned long addr, unsigned long end, + pgprot_t prot, struct page **pages, unsigned int page_shift); + #else /* !CONFIG_MMU */ #define VMALLOC_TOTAL 0UL =20 static inline unsigned long vmalloc_nr_pages(void) { return 0; } static inline void set_vm_flush_reset_perms(void *addr) {} +static inline +int __must_check vmap_pages_range(unsigned long addr, unsigned long end, + pgprot_t prot, struct page **pages, unsigned int page_shift) +{ + return -EINVAL; +} + #endif /* CONFIG_MMU */ =20 #if defined(CONFIG_MMU) && defined(CONFIG_SMP) diff --git a/mm/vmalloc.c b/mm/vmalloc.c index 61caa55a4402..39b7e118cbce 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -722,6 +722,7 @@ int vmap_pages_range(unsigned long addr, unsigned long = end, { return __vmap_pages_range(addr, end, prot, pages, page_shift, GFP_KERNEL); } +EXPORT_SYMBOL_GPL(vmap_pages_range); =20 static int check_sparse_vm_area(struct vm_struct *area, unsigned long star= t, unsigned long end) @@ -3285,6 +3286,30 @@ struct vm_struct *get_vm_area_caller(unsigned long s= ize, unsigned long flags, NUMA_NO_NODE, GFP_KERNEL, caller); } =20 +/** + * get_vm_area_node - reserve a contiguous and aligned kernel virtual area + * @size: size of the area + * @align: alignment of the start address of the area + * @flags: %VM_IOREMAP for I/O mappings + * @node: NUMA node from which to allocate the area data structure + * @gfp: Flags to pass to the allocator + * @caller: Caller to be stored in the vm area data structure + * + * Search for an area of @size/align in the kernel virtual mapping area and + * reserve it for our purposes. Returns the area descriptor on success or = %NULL + * on failure. + * + * Return: the area descriptor on success or %NULL on failure. + */ +struct vm_struct *get_vm_area_node(unsigned long size, unsigned long align, + unsigned long flags, int node, gfp_t gfp, + const void *caller) +{ + return __get_vm_area_node(size, align, PAGE_SHIFT, flags, + VMALLOC_START, VMALLOC_END, + node, gfp, caller); +} + /** * find_vm_area - find a continuous kernel virtual area * @addr: base address --=20 2.54.0.rc2.544.gc7ae2d5bb8-goog From nobody Fri Jun 19 16:16:36 2026 Received: from mail-dy1-f201.google.com (mail-dy1-f201.google.com [74.125.82.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id A886136403B for ; Fri, 24 Apr 2026 19:17:00 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=74.125.82.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777058224; cv=none; b=VTzu+k9mZbWt3rsWW71z3my9G6V2KY7UbPvc8IXxZTElfmRnNVIzY4hh4TbCqkw/q2K5vYnihGbWt6bxjzl59CLdcvAvpjMODE58ttSQFm4ZdzQeYKW7kaHPc65jZlFNcDXcCsSqPOLsSpOZXWXoNkXQYz9woTR992wFopeFHrw= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777058224; c=relaxed/simple; bh=tNuwyKpalaIl3acwaoWFy9weySepbDKU/1IbYZVQRVA=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=Ib6i7IwnO2bnimrIDrBVO0CWcM38Xxf5zvXriVQgOwxX/FJf9HQFHcFlOFtYjBqS0tT1j5QTYb3pGobGhE8U4sXqeSrhWGaEvRmTZgROX5Gb/f4626NzYPBMt9BbfUiJroCN4dgWu32wW1G0/pVlxNek0WKX4M5q73SWag5wmpU= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--stevensd.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=P8b2jLMC; arc=none smtp.client-ip=74.125.82.201 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--stevensd.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="P8b2jLMC" Received: by mail-dy1-f201.google.com with SMTP id 5a478bee46e88-2bdf6fe90a9so11988007eec.1 for ; Fri, 24 Apr 2026 12:17:00 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20251104; t=1777058220; x=1777663020; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=/N+N4eCtpYxHixfaIFLv9JBOjdUyGT3kJeE3HTW8Oq0=; b=P8b2jLMCNaAA0VqZQ2YUEU0Y1s5VdbeYDj1Ie1ixAtzF6z6qlSDMRZw2BipyE5ogNx +wBaS007xmsbx5Ixc4GtBGhNf+gN9IUGQcy6H1h8AyECeHmkf60836nxOpZ+uBGasbYX u5x+TNgsa4x7aQL1P90wStIKxGkka9i3o/S/1iVIsCCabpHplY5D7Ndl7uzWc2YDz72f pkf+2eXwquhz9pkZWL02lTpDoWalaxfNT45gFDVwvKbEP0OdP0fkb69dQhpSQX1EsilF WqWcXHp4BeW1qOZkZTyFbXCTyb+QF+vFLLFmpemxlKfDQv02p71NrIJxgu/D+hx23vkN lyfg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1777058220; x=1777663020; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=/N+N4eCtpYxHixfaIFLv9JBOjdUyGT3kJeE3HTW8Oq0=; b=lMuhwrL2bIyA0QWBAELOyhzzGuimBIrGEOGtTsBxPmcxWzzUzmj9DVM+bEX2hHh1G3 4dCRhqs1zvWklesGvm/mDTwZDAJRczIaBmCMEVwtn7A88Ix90xayjhW29oD4dKKjc8om okCTUb88gtPoOVJoo5M2F6Ygg+a8C5HMDwQV4KELJdPrRxsmOQjzAcjr2s5dKTB/XsR+ ZIirtfSw36kfIhZEAECuN/38wZFrkKM/h92htZj2sltfL13pkwY4e2EXmapXYidD7WyV GxXyWiOlh/m+k4DLsYEgPqQMwId1XcHaJ7b9RQwaYilLui1eCzObKgTlv8DzKIzhk10h k1JQ== X-Forwarded-Encrypted: i=1; AFNElJ8q3BHyHIw2uWLC8HhHZ/uoqnjGE5HoAJbPdEHAdtV9VjJrqm3Q7Zcm8LMlUKPmJ7ikGnHqjm0U5tG09Z8=@vger.kernel.org X-Gm-Message-State: AOJu0YxbSl0+RfpgBX3Q3sHe7DhqgbcLe0YIirR+if27IBAsV7pAhF6/ FXdVXuEcZ1FwHMkxm/JbrIoAbGbL+1csQf1Xx64/D80RNO/u6Modpz8olQrNKKNb8FMbaJM8tKH /NXkKtQvY08nkvQ== X-Received: from dly15-n1.prod.google.com ([2002:a05:701b:204f:10b0:12a:83c5:a16f]) (user=stevensd job=prod-delivery.src-stubby-dispatcher) by 2002:a05:7022:50d:b0:12a:747e:5b5c with SMTP id a92af1059eb24-12c73fa3af3mr17422533c88.24.1777058219493; Fri, 24 Apr 2026 12:16:59 -0700 (PDT) Date: Fri, 24 Apr 2026 12:14:49 -0700 In-Reply-To: <20260424191456.2679717-1-stevensd@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20260424191456.2679717-1-stevensd@google.com> X-Mailer: git-send-email 2.54.0.rc2.544.gc7ae2d5bb8-goog Message-ID: <20260424191456.2679717-7-stevensd@google.com> Subject: [PATCH v2 06/13] fork: Move vmap stack freeing to work queue From: David Stevens To: Pasha Tatashin , Linus Walleij , Will Deacon , Quentin Perret , Thomas Gleixner , Ingo Molnar , Borislav Petkov , Dave Hansen , x86@kernel.org, "H. Peter Anvin" , Andy Lutomirski , Xin Li , Peter Zijlstra , Andrew Morton , David Hildenbrand , Lorenzo Stoakes , "Liam R. Howlett" , Vlastimil Babka , Mike Rapoport , Suren Baghdasaryan , Michal Hocko , Uladzislau Rezki , Kees Cook Cc: David Stevens , linux-kernel@vger.kernel.org, linux-mm@kvack.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" For vmap stacks not immediately released into the stack cache, free them in a workqueue instead of via call_rcu(). In an RCU context, vfree already schedules the actual freeing on the per-cpu system workqueue, so this change only affects when exactly the second attempt to put the stack into the stack cache occurs. Moving freeing to a workqueue will allow for freeing dynamic stacks in a sleepable context (for remove_vm_area), rather than relying on vfree dispatching to a workqueue via vfree_atomic. Signed-off-by: David Stevens --- kernel/fork.c | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/kernel/fork.c b/kernel/fork.c index 8bf32815f422..01e0bf4f4b02 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -205,7 +205,7 @@ static DEFINE_PER_CPU(struct vm_struct *, cached_stacks= [NR_CACHED_STACKS]); #define GFP_VMAP_STACK (GFP_KERNEL | __GFP_ZERO) =20 struct vm_stack { - struct rcu_head rcu; + struct rcu_work work; struct vm_struct *stack_vm_area; }; =20 @@ -284,9 +284,9 @@ static inline void free_vmap_stack(struct vm_struct *vm= _area) vfree(vm_area->addr); } =20 -static void thread_stack_free_rcu(struct rcu_head *rh) +static void thread_stack_free_work(struct work_struct *work) { - struct vm_stack *vm_stack =3D container_of(rh, struct vm_stack, rcu); + struct vm_stack *vm_stack =3D container_of(to_rcu_work(work), struct vm_s= tack, work); struct vm_struct *vm_area =3D vm_stack->stack_vm_area; =20 if (try_release_thread_stack_to_cache(vm_stack->stack_vm_area)) @@ -305,7 +305,8 @@ static void thread_stack_delayed_free(struct task_struc= t *tsk) vm_stack =3D tsk->stack + THREAD_SIZE - sizeof(*vm_stack); =20 vm_stack->stack_vm_area =3D tsk->stack_vm_area; - call_rcu(&vm_stack->rcu, thread_stack_free_rcu); + INIT_RCU_WORK(&vm_stack->work, thread_stack_free_work); + queue_rcu_work(system_wq, &vm_stack->work); } =20 static int free_vm_stack_cache(unsigned int cpu) --=20 2.54.0.rc2.544.gc7ae2d5bb8-goog From nobody Fri Jun 19 16:16:36 2026 Received: from mail-dy1-f202.google.com (mail-dy1-f202.google.com [74.125.82.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id DF7763644DE for ; Fri, 24 Apr 2026 19:17:02 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=74.125.82.202 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777058225; cv=none; b=rxT5qmnTEeXX68FnfRk0fnwi03UiNUeNUjLf0as6leogwWEW8yzmFvajDdmgYm4gkaifyTu6LZSe5uwrHruNsMvgPm11yd84P8S27GqSZuSMmkW+RyMkB0X26e88hjW8+qr4UjY5NlklxErfaNvn+Xuh8/hBjxcbQQwLqKz3eas= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777058225; c=relaxed/simple; bh=3TdQ761WIfw5ZjRhl2c4mwAnJj1+uNqJiPgyA61YXgg=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=HUjR4KT4VZyfSANuoOLorwLRs7NO9Go1JBsBppArPfjCsXKQ7SEcI80+Gy09l1Ddpu8/rZUtE6QiS/xksPfnNc2M0q/jl4hLYp9pUR6oCVFCYwUei16t9mSORLAa4xpKTdO3ptlNRXNgzjzpTmfJoaUp5bdcUIn1Z4wFMQuYz78= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--stevensd.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=X1UQE5tW; arc=none smtp.client-ip=74.125.82.202 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--stevensd.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="X1UQE5tW" Received: by mail-dy1-f202.google.com with SMTP id 5a478bee46e88-2dd6fb4c867so12584189eec.0 for ; Fri, 24 Apr 2026 12:17:02 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20251104; t=1777058222; x=1777663022; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=XtvA7IpLXk0S/YWb/vbFA+ZmsgjoMJHMRLoR+QNRjhA=; b=X1UQE5tWFYdXd1lri2fH4y8P0RDWig6fL7VJJ+S+0MLGwR1DPfvfemfXYoLgFTNxhc YXhx2YFIPv9wLsbFhvEFsN78od0BWRyGbrzrQP6f2d6W0wwrtssG9aoHKB6dsqLgSl5n cS2AJziHZmSJ+/vy2736H/BuiX74kBlManRwS9Psm6CrdF8BVn/3Vp+r/1yUuzDZ4JK1 atj647lKaXfc2+o7yzbDlEx1uCpo+QQUUoHJaC8hlM3UODDR7VJ3D2nTyJHE7bAwaWzj 20tjhDfKuX79LKAwc5lV0wyk3r25tJykov2j2w/cO2xBIa0Rbc3hV6qEW6gYt58HhDJb nIQA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1777058222; x=1777663022; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=XtvA7IpLXk0S/YWb/vbFA+ZmsgjoMJHMRLoR+QNRjhA=; b=QwDzVW7/KAsOU6e3SkZ1avOoqhnznP+EZsLabo5T6m2lVQaWbzhDFxkI/KWfgWPLhX Vmoxhw7FzPW5eXwokzHJMjP73GaD9FipvVCyHhRnaGFw8Dj+et+CjiW/ZqPu+rtaVfOM 3R7DIK3mAnPdNGkgl/5Q6vGM03uhP+8DHWHISkgXATgmUyfOk1HKHIHfzv4NO5sI4TGL oQY59AS6k1YRS+tUsOw+YM/Eo70Oi+BBAYa+t4ATWzawyfBDevilvgZQva24W6wAQl1o iHeKcfOYqwJlBFAiPdMp/6bA0Fx+pyviN5Tk9+gICbH2N2mfA/LJMCFnEFpkIzvUs+wd AxHQ== X-Forwarded-Encrypted: i=1; AFNElJ/wCQMAGdZ4FkQBSgIjx7ric7HmbEwiKN1V5O+sUR31BhzhA07oFvT44eTuKt4sSeg+Oo5ohUOjhJFL0Wg=@vger.kernel.org X-Gm-Message-State: AOJu0Yw21PPoIoXZRTaMgGA0OA3NH37CXXWoF5FRFKXXlw4IYbu/lL9T /JhRIAMkWWhMEj58isd0j1p5REoLb+GpQB0HyggcBS6dIGzZ4rGWO76u5i7YGoseIBqBM8iMr77 TbiiWRvba6jRXew== X-Received: from dlbut4.prod.google.com ([2002:a05:7022:7e04:b0:12b:e83a:8d31]) (user=stevensd job=prod-delivery.src-stubby-dispatcher) by 2002:a05:7022:69a:b0:11b:7970:ea3f with SMTP id a92af1059eb24-12c73f9f519mr16491319c88.25.1777058221615; Fri, 24 Apr 2026 12:17:01 -0700 (PDT) Date: Fri, 24 Apr 2026 12:14:50 -0700 In-Reply-To: <20260424191456.2679717-1-stevensd@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20260424191456.2679717-1-stevensd@google.com> X-Mailer: git-send-email 2.54.0.rc2.544.gc7ae2d5bb8-goog Message-ID: <20260424191456.2679717-8-stevensd@google.com> Subject: [PATCH v2 07/13] fork: Dynamic Kernel Stacks From: David Stevens To: Pasha Tatashin , Linus Walleij , Will Deacon , Quentin Perret , Thomas Gleixner , Ingo Molnar , Borislav Petkov , Dave Hansen , x86@kernel.org, "H. Peter Anvin" , Andy Lutomirski , Xin Li , Peter Zijlstra , Andrew Morton , David Hildenbrand , Lorenzo Stoakes , "Liam R. Howlett" , Vlastimil Babka , Mike Rapoport , Suren Baghdasaryan , Michal Hocko , Uladzislau Rezki , Kees Cook Cc: David Stevens , linux-kernel@vger.kernel.org, linux-mm@kvack.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Pasha Tatashin The core implementation of dynamic kernel stacks. Unlike traditional kernel stacks, these stacks auto-grow as they are used. This allows to save a significant amount of memory in the fleet environments. Also, potentially the default size of kernel thread can be increased in order to prevent stack overflows without compromising on the overall memory overhead. The dynamic kernel stacks interface provides two global functions: 1. dynamic_stack_fault(). Architectures that support dynamic kernel stacks, must call this function in order to handle the fault in the stack. It allocates and maps new pages into the stack. The pages are maintained in a per-cpu data structure. 2. dynamic_stack() Must be called as a thread leaving CPU to check if the thread has allocated dynamic stack pages (tsk->flags & PF_DYNAMIC_STACK) is set. If this is the case, there are two things need to be performed: a. Charge the thread for the allocated stack pages. b. refill the per-cpu array so the next thread can also fault. Dynamic kernel threads do not support "STACK_END_MAGIC", as the last page does not have to be faulted in. However, since they are based off vmap stacks, the guard pages always protect the dynamic kernel stacks from overflow. The average depth of a kernel thread depends on the workload, profiling, virtualization, compiler optimizations, and driver implementations. Therefore, the numbers should be tested for a specific workload. From my tests I found the following values on a freshly booted idling machines: CPU #Cores #Stacks Regular(kb) Dynamic(kb) AMD Genoa 384 5786 92576 23388 Intel Skylake 112 3182 50912 12860 AMD Rome 128 3401 54416 14784 AMD Rome 256 4908 78528 20876 Intel Haswell 72 2644 42304 10624 On all machines dynamic kernel stacks take about 25% of the original stack memory. Only 5% of active tasks performed a stack page fault in their life cycles. Signed-off-by: Pasha Tatashin [Rebased, used vm_area->nr_pages directly in one instance] [Depends on !PREEMPT_RT] Signed-off-by: Linus Walleij [Fix races around accounting] [Use GFP_ATOMIC when executing in the scheduler] [Depend on INIT_STACK_ALL_* config] [Fix bugs in some error paths and edge cases] [Don't cache partially faulted stacks] [Added out-var to tell if address is on target stack] Signed-off-by: David Stevens --- arch/Kconfig | 39 ++++ include/linux/sched.h | 11 +- include/linux/sched/task_stack.h | 47 +++- init/init_task.c | 4 + kernel/fork.c | 357 +++++++++++++++++++++++++++++-- kernel/sched/core.c | 1 + 6 files changed, 439 insertions(+), 20 deletions(-) diff --git a/arch/Kconfig b/arch/Kconfig index 102ddbd4298e..95ded79f0825 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -1515,6 +1515,45 @@ config VMAP_STACK backing virtual mappings with real shadow memory, and KASAN_VMALLOC must be enabled. =20 +config HAVE_ARCH_DYNAMIC_STACK + def_bool n + help + An arch should select this symbol if it can support kernel stacks + that grow dynamically. + + - Arch must have support for HAVE_ARCH_VMAP_STACK, in order to handle + stack related page faults. + + - Arch must be able to fault from interrupt context. + + - Arch must allow the kernel to handle stack faults gracefully, even + during interrupt handling. + + - Exceptions such as no pages available should be handled the same + in the consistent and predictable way. I.e. the exception should be + handled the same as when stack overflow occurs when guard pages are + touched with extra information about the allocation error. + +config DYNAMIC_STACK + default y + bool "Dynamically grow kernel stacks" + depends on THREAD_INFO_IN_TASK + depends on HAVE_ARCH_DYNAMIC_STACK + depends on VMAP_STACK + depends on INIT_STACK_ALL_ZERO || INIT_STACK_ALL_PATTERN + depends on !KASAN + depends on !DEBUG_STACK_USAGE + depends on !STACK_GROWSUP + depends on !PREEMPT_RT + help + Dynamic kernel stacks allow to save memory on machines with a lot of + threads by starting with small stacks, and grow them only when needed. + On workloads where most of the stack depth do not reach over one page + the memory saving can be substantial. The feature requires virtually + mapped kernel stacks in order to handle page faults. It requires stack + initialization to preclude one thread from faulting on another thread's + stack. + config HAVE_ARCH_RANDOMIZE_KSTACK_OFFSET def_bool n help diff --git a/include/linux/sched.h b/include/linux/sched.h index 5a5d3dbc9cdf..7aa06233afd5 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -836,7 +836,11 @@ struct task_struct { */ randomized_struct_fields_start =20 +#ifdef CONFIG_DYNAMIC_STACK + unsigned long packed_stack; +#else void *stack; +#endif refcount_t usage; /* Per task flags (PF_*), defined further below: */ unsigned int flags; @@ -1563,6 +1567,11 @@ struct task_struct { struct timer_list oom_reaper_timer; #endif #ifdef CONFIG_VMAP_STACK + /* + * We can't call find_vm_area() in interrupt context, and + * free_thread_stack() can be called in interrupt context, + * so cache the vm_struct. + */ struct vm_struct *stack_vm_area; #endif #ifdef CONFIG_THREAD_INFO_IN_TASK @@ -1773,7 +1782,7 @@ extern struct pid *cad_pid; * I am cleaning dirty pages from some other bdi. */ #define PF_KTHREAD 0x00200000 /* I am a kernel thread */ #define PF_RANDOMIZE 0x00400000 /* Randomize virtual address space */ -#define PF__HOLE__00800000 0x00800000 +#define PF_DYNAMIC_STACK 0x00800000 /* This thread allocated dynamic stack= pages */ #define PF__HOLE__01000000 0x01000000 #define PF__HOLE__02000000 0x02000000 #define PF_NO_SETAFFINITY 0x04000000 /* Userland is not allowed to meddle = with cpus_mask */ diff --git a/include/linux/sched/task_stack.h b/include/linux/sched/task_st= ack.h index 1fab7e9043a3..7dcff2836d7e 100644 --- a/include/linux/sched/task_stack.h +++ b/include/linux/sched/task_stack.h @@ -13,6 +13,10 @@ =20 #ifdef CONFIG_THREAD_INFO_IN_TASK =20 +#ifdef CONFIG_DYNAMIC_STACK +#define DYNAMIC_STACK_MAX_ACCOUNT_MASK ((1 << (THREAD_SIZE_ORDER + 1)) - = 1) +#endif + /* * When accessing the stack of a non-current task that might exit, use * try_get_task_stack() instead. task_stack_page will return a pointer @@ -20,7 +24,11 @@ */ static __always_inline void *task_stack_page(const struct task_struct *tas= k) { +#ifdef CONFIG_DYNAMIC_STACK + return (void *)(task->packed_stack & ~DYNAMIC_STACK_MAX_ACCOUNT_MASK); +#else return task->stack; +#endif } =20 #define setup_thread_stack(new,old) do { } while(0) @@ -30,7 +38,7 @@ static __always_inline unsigned long *end_of_stack(const = struct task_struct *tas #ifdef CONFIG_STACK_GROWSUP return (unsigned long *)((unsigned long)task->stack + THREAD_SIZE) - 1; #else - return task->stack; + return task_stack_page(task); #endif } =20 @@ -83,9 +91,45 @@ static inline void put_task_stack(struct task_struct *ts= k) {} =20 void exit_task_stack_account(struct task_struct *tsk); =20 +#ifdef CONFIG_DYNAMIC_STACK + +#define task_stack_end_corrupted(task) 0 + +#ifndef THREAD_PREALLOC_PAGES +#define THREAD_PREALLOC_PAGES 1 +#endif + +#define THREAD_DYNAMIC_PAGES \ + ((THREAD_SIZE >> PAGE_SHIFT) - THREAD_PREALLOC_PAGES) + +void dynamic_stack_refill_pages(void); +unsigned long dynamic_stack_accounting(struct task_struct *tsk, bool final= ize); +bool dynamic_stack_fault(struct task_struct *tsk, unsigned long address, b= ool *on_stack); + +/* + * Refill and charge for the used pages. + */ +static inline void dynamic_stack(struct task_struct *tsk) +{ + if (unlikely(tsk->flags & PF_DYNAMIC_STACK)) { + dynamic_stack_refill_pages(); + dynamic_stack_accounting(tsk, false); + tsk->flags &=3D ~PF_DYNAMIC_STACK; + } +} + +static inline void set_task_stack_end_magic(struct task_struct *tsk) {} + +#else /* !CONFIG_DYNAMIC_STACK */ + #define task_stack_end_corrupted(task) \ (*(end_of_stack(task)) !=3D STACK_END_MAGIC) =20 +void set_task_stack_end_magic(struct task_struct *tsk); +static inline void dynamic_stack(struct task_struct *tsk) {} + +#endif /* CONFIG_DYNAMIC_STACK */ + static inline int object_is_on_stack(const void *obj) { void *stack =3D task_stack_page(current); @@ -104,7 +148,6 @@ static inline unsigned long stack_not_used(struct task_= struct *p) return 0; } #endif -extern void set_task_stack_end_magic(struct task_struct *tsk); =20 static inline int kstack_end(void *addr) { diff --git a/init/init_task.c b/init/init_task.c index 5c838757fc10..e3645ec4ab02 100644 --- a/init/init_task.c +++ b/init/init_task.c @@ -99,7 +99,11 @@ struct task_struct init_task __aligned(L1_CACHE_BYTES) = =3D { .stack_refcount =3D REFCOUNT_INIT(1), #endif .__state =3D 0, +#ifdef CONFIG_DYNAMIC_STACK + .packed_stack =3D (unsigned long)init_stack, +#else .stack =3D init_stack, +#endif .usage =3D REFCOUNT_INIT(2), .flags =3D PF_KTHREAD, .prio =3D MAX_PRIO - 20, diff --git a/kernel/fork.c b/kernel/fork.c index 01e0bf4f4b02..e615ef736dc0 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -202,7 +202,10 @@ static DEFINE_PER_CPU(struct vm_struct *, cached_stack= s[NR_CACHED_STACKS]); * accounting is performed by the code assigning/releasing stacks to tasks. * We need a zeroed memory without __GFP_ACCOUNT. */ -#define GFP_VMAP_STACK (GFP_KERNEL | __GFP_ZERO) +static gfp_t vmap_stack_gfp(bool is_atomic) +{ + return (is_atomic ? GFP_ATOMIC : GFP_KERNEL) | __GFP_ZERO; +} =20 struct vm_stack { struct rcu_work work; @@ -241,6 +244,18 @@ static bool try_release_thread_stack_to_cache(struct v= m_struct *vm_area) unsigned int i; int nid; =20 +#ifdef CONFIG_DYNAMIC_STACK + /* + * Skip the cache for populated dynamic stacks to avoid punishing a + * memcg with a larger charge just because it happened to pick up a + * dynamic stack that's been partially faulted in. We may get a lower + * number of cache hits, but stacks with dynamically faulted pages + * should be fairly uncommon. + */ + if (vm_area->nr_pages !=3D THREAD_PREALLOC_PAGES) + return false; +#endif /* CONFIG_DYNAMIC_STACK */ + /* * Don't cache stacks if any of the pages don't match the local domain, u= nless * there is no local memory to begin with. @@ -269,11 +284,285 @@ static bool try_release_thread_stack_to_cache(struct= vm_struct *vm_area) return false; } =20 +#ifdef CONFIG_DYNAMIC_STACK + +/* + * There is a window between when a thread refills the page pool and when = it + * actually gets scheduled out where it can still consume pages from the p= ool. + * To guarantee the next thread has enough pages to fully populate its sta= ck, + * double the size of the page pool. + */ +#define DYNSTK_PAGE_POOL_NR (THREAD_DYNAMIC_PAGES * 2) + +static DEFINE_PER_CPU(struct page *, dynamic_stack_pages[DYNSTK_PAGE_POOL_= NR]); + +static void link_vmap_stack_to_task(struct task_struct *tsk, struct vm_str= uct *vm_area) +{ + tsk->stack_vm_area =3D vm_area; + tsk->packed_stack =3D (unsigned long)kasan_reset_tag(vm_area->addr); +} + +static void free_vmap_stack(struct vm_struct *vm_area) +{ + int i; + + remove_vm_area(vm_area->addr); + + for (i =3D 0; i < vm_area->nr_pages; i++) + __free_page(vm_area->pages[i]); + + kfree(vm_area->pages); + kfree(vm_area); +} + +static struct vm_struct *alloc_vmap_stack(int node) +{ + gfp_t gfp =3D vmap_stack_gfp(false); + unsigned long addr, end; + struct vm_struct *vm_area; + int err, i; + + /* + * Paranoid check to guarantee we never straddle a page table, so + * that virt_to_kpte() is always valid in dynamic_stack_fault(). + */ + BUILD_BUG_ON((PMD_SIZE % THREAD_SIZE) || (THREAD_ALIGN % THREAD_SIZE)); + + vm_area =3D get_vm_area_node(THREAD_SIZE, THREAD_ALIGN, VM_MAP, node, + gfp, __builtin_return_address(0)); + if (!vm_area) + return NULL; + + vm_area->pages =3D kmalloc_node(sizeof(void *) * + (THREAD_SIZE >> PAGE_SHIFT), gfp, node); + if (!vm_area->pages) + goto cleanup_err; + + for (i =3D 0; i < THREAD_PREALLOC_PAGES; i++) { + vm_area->pages[i] =3D alloc_pages(gfp, 0); + if (!vm_area->pages[i]) + goto cleanup_err; + vm_area->nr_pages++; + } + + addr =3D (unsigned long)vm_area->addr + + (THREAD_DYNAMIC_PAGES << PAGE_SHIFT); + end =3D (unsigned long)vm_area->addr + THREAD_SIZE; + err =3D vmap_pages_range(addr, end, PAGE_KERNEL, vm_area->pages, PAGE_SHI= FT); + if (err) + goto cleanup_err; + + return vm_area; +cleanup_err: + free_vmap_stack(vm_area); + return NULL; +} + +static struct page *noinstr dynamic_stack_get_page(void) +{ + struct page **pages =3D this_cpu_ptr(dynamic_stack_pages); + int i; + + for (i =3D 0; i < DYNSTK_PAGE_POOL_NR; i++) { + struct page *page =3D pages[i]; + + if (!page) + continue; + pages[i] =3D NULL; + return page; + } + + return NULL; +} + +static int dynamic_stack_refill_pages_cpu(unsigned int cpu) +{ + struct page **pages =3D per_cpu_ptr(dynamic_stack_pages, cpu); + int i; + + for (i =3D 0; i < DYNSTK_PAGE_POOL_NR; i++) { + if (pages[i]) + continue; + pages[i] =3D alloc_pages(vmap_stack_gfp(false), 0); + if (unlikely(!pages[i])) { + pr_err("failed to allocate dynamic stack page for cpu[%d]\n", + cpu); + break; + } + } + + return 0; +} + +static int dynamic_stack_free_pages_cpu(unsigned int cpu) +{ + struct page **pages =3D per_cpu_ptr(dynamic_stack_pages, cpu); + int i; + + for (i =3D 0; i < DYNSTK_PAGE_POOL_NR; i++) { + if (!pages[i]) + continue; + __free_page(pages[i]); + pages[i] =3D NULL; + } + + return 0; +} + +void dynamic_stack_refill_pages(void) +{ + struct page **pages =3D this_cpu_ptr(dynamic_stack_pages); + int i; + + for (i =3D 0; i < DYNSTK_PAGE_POOL_NR; i++) { + struct page *page =3D pages[i]; + + if (page) + continue; + + /* + * This is called during context switch, so we can't take any + * sleeping locks. As such, we need to use GFP_ATOMIC. + */ + page =3D alloc_pages(vmap_stack_gfp(true), 0); + if (unlikely(!page)) + pr_err_ratelimited("failed to refill per-cpu dynamic stack\n"); + pages[i] =3D page; + } +} + +unsigned long dynamic_stack_accounting(struct task_struct *tsk, bool final= ize) +{ + struct vm_struct *vm_area =3D tsk->stack_vm_area; + unsigned long nr_accounted, i; + + cant_sleep(); + + /* Verify enough low order bits in the page-aligned stack pointer. */ + BUILD_BUG_ON(THREAD_PREALLOC_PAGES =3D=3D 0 || + PAGE_SIZE - 1 <=3D DYNAMIC_STACK_MAX_ACCOUNT_MASK); + + nr_accounted =3D tsk->packed_stack & DYNAMIC_STACK_MAX_ACCOUNT_MASK; + + if (nr_accounted =3D=3D DYNAMIC_STACK_MAX_ACCOUNT_MASK) { + WARN_ON_ONCE(finalize); + return 0; + } + + for (i =3D THREAD_PREALLOC_PAGES + nr_accounted; i < vm_area->nr_pages; i= ++) { + struct page *page =3D vm_area->pages[i]; + + int ret =3D memcg_kmem_charge_page(page, GFP_ATOMIC, 0); + /* + * XXX Since stack pages were already allocated, we should never + * fail charging. Therefore, we should probably induce force + * charge and oom killing if charge fails. + */ + if (unlikely(ret)) + pr_warn_ratelimited("dynamic stack: charge for allocated page failed\n"= ); + + mod_lruvec_page_state(page, NR_KERNEL_STACK_KB, + PAGE_SIZE / 1024); + } + + if (finalize) { + tsk->packed_stack |=3D DYNAMIC_STACK_MAX_ACCOUNT_MASK; + } else { + tsk->packed_stack &=3D ~DYNAMIC_STACK_MAX_ACCOUNT_MASK; + tsk->packed_stack |=3D (i - THREAD_PREALLOC_PAGES); + } + + return i; +} + +bool noinstr dynamic_stack_fault(struct task_struct *tsk, unsigned long ad= dress, bool *on_stack) +{ + unsigned long stack, hole_end, addr; + struct vm_struct *vm_area; + struct page *page; + int nr_pages; + pte_t *pte; + + cant_sleep(); + + if (WARN_ON(in_nmi())) { + *on_stack =3D false; + return false; + } + + /* check if address is inside the kernel stack area */ + stack =3D (unsigned long)task_stack_page(tsk); + if (address < stack || address >=3D stack + THREAD_SIZE) { + *on_stack =3D false; + return false; + } + *on_stack =3D true; + + vm_area =3D tsk->stack_vm_area; + if (WARN_ON_ONCE(!vm_area)) + return false; + + nr_pages =3D vm_area->nr_pages; + + /* Check if fault address is within the stack hole */ + hole_end =3D stack + THREAD_SIZE - (nr_pages << PAGE_SHIFT); + if (address >=3D hole_end) + return false; + + /* + * Most likely we faulted in the page right next to the last mapped + * page in the stack, however, it is possible (but very unlikely) that + * the faulted page is actually skips some pages in the stack. Make sure + * we do not create more than one holes in the stack, and map every + * page between the current fault address and the last page that is + * mapped in the stack. + */ + address =3D PAGE_ALIGN_DOWN(address); + for (addr =3D hole_end - PAGE_SIZE; addr >=3D address; addr -=3D PAGE_SIZ= E) { + /* Take the next page from the per-cpu list */ + page =3D dynamic_stack_get_page(); + if (!page) { + instrumentation_begin(); + pr_emerg("Failed to allocate a page during kernel_stack_fault\n"); + instrumentation_end(); + return false; + } + + /* Add the new page entry to the page table */ + pte =3D virt_to_kpte(addr); + if (!pte) { + instrumentation_begin(); + pr_emerg("The PTE page table for a kernel stack is not found\n"); + instrumentation_end(); + return false; + } + + /* Make sure there are no existing mappings at this address */ + if (pte_present(*pte)) { + instrumentation_begin(); + pr_emerg("The PTE contains a mapping\n"); + instrumentation_end(); + return false; + } + set_pte_at(&init_mm, addr, pte, mk_pte(page, PAGE_KERNEL)); + + /* Store the new page in the stack's vm_area */ + vm_area->pages[nr_pages] =3D page; + vm_area->nr_pages =3D ++nr_pages; + } + + /* Refill the pcp stack pages during context switch */ + tsk->flags |=3D PF_DYNAMIC_STACK; + + return true; +} + +#else /* !CONFIG_DYNAMIC_STACK */ static inline struct vm_struct *alloc_vmap_stack(int node) { void *stack; =20 - stack =3D __vmalloc_node(THREAD_SIZE, THREAD_ALIGN, GFP_VMAP_STACK, + stack =3D __vmalloc_node(THREAD_SIZE, THREAD_ALIGN, vmap_stack_gfp(false), node, __builtin_return_address(0)); =20 return stack ? find_vm_area(stack) : NULL; @@ -284,6 +573,13 @@ static inline void free_vmap_stack(struct vm_struct *v= m_area) vfree(vm_area->addr); } =20 +static void link_vmap_stack_to_task(struct task_struct *tsk, struct vm_str= uct *vm_area) +{ + tsk->stack_vm_area =3D vm_area; + tsk->stack =3D kasan_reset_tag(vm_area->addr); +} +#endif /* CONFIG_DYNAMIC_STACK */ + static void thread_stack_free_work(struct work_struct *work) { struct vm_stack *vm_stack =3D container_of(to_rcu_work(work), struct vm_s= tack, work); @@ -300,9 +596,9 @@ static void thread_stack_delayed_free(struct task_struc= t *tsk) struct vm_stack *vm_stack; =20 if (IS_ENABLED(CONFIG_STACK_GROWSUP)) - vm_stack =3D tsk->stack; + vm_stack =3D task_stack_page(tsk); else - vm_stack =3D tsk->stack + THREAD_SIZE - sizeof(*vm_stack); + vm_stack =3D task_stack_page(tsk) + THREAD_SIZE - sizeof(*vm_stack); =20 vm_stack->stack_vm_area =3D tsk->stack_vm_area; INIT_RCU_WORK(&vm_stack->work, thread_stack_free_work); @@ -361,14 +657,13 @@ static int alloc_thread_stack_node(struct task_struct= *tsk, int node) =20 /* Reset stack metadata. */ kasan_unpoison_range(vm_area->addr, THREAD_SIZE); - tsk->stack =3D kasan_reset_tag(vm_area->addr); + link_vmap_stack_to_task(tsk, vm_area); =20 /* Clear stale pointers from reused stack. */ if (!IS_ENABLED(CONFIG_STACK_GROWSUP)) memset_offset =3D THREAD_SIZE - vm_area->nr_pages * PAGE_SIZE; - memset(tsk->stack + memset_offset, 0, vm_area->nr_pages * PAGE_SIZE); + memset(task_stack_page(tsk) + memset_offset, 0, vm_area->nr_pages * PAGE= _SIZE); =20 - tsk->stack_vm_area =3D vm_area; return 0; } =20 @@ -380,22 +675,20 @@ static int alloc_thread_stack_node(struct task_struct= *tsk, int node) free_vmap_stack(vm_area); return -ENOMEM; } - /* - * We can't call find_vm_area() in interrupt context, and - * free_thread_stack() can be called in interrupt context, - * so cache the vm_struct. - */ - tsk->stack_vm_area =3D vm_area; - tsk->stack =3D kasan_reset_tag(vm_area->addr); + link_vmap_stack_to_task(tsk, vm_area); return 0; } =20 static void free_thread_stack(struct task_struct *tsk) { - if (!try_release_thread_stack_to_cache(tsk->stack_vm_area)) + if (!try_release_thread_stack_to_cache(task_stack_vm_area(tsk))) thread_stack_delayed_free(tsk); =20 +#ifdef CONFIG_DYNAMIC_STACK + tsk->packed_stack =3D 0; +#else tsk->stack =3D NULL; +#endif tsk->stack_vm_area =3D NULL; } =20 @@ -498,9 +791,27 @@ static void account_kernel_stack(struct task_struct *t= sk, int account) { if (IS_ENABLED(CONFIG_VMAP_STACK)) { struct vm_struct *vm_area =3D task_stack_vm_area(tsk); - int i; + int i, nr_accounted; =20 - for (i =3D 0; i < vm_area->nr_pages; i++) +#ifdef CONFIG_DYNAMIC_STACK + /* + * For the exit path, resolve any pending accounting to avoid + * underflow. Finalize to skip accounting for any faults that + * happen between here and this thread's final __schedule() + * call in do_task_dead(). + */ + if (account < 0) { + preempt_disable(); + nr_accounted =3D dynamic_stack_accounting(tsk, true); + preempt_enable(); + } else { + nr_accounted =3D THREAD_PREALLOC_PAGES; + } +#else + nr_accounted =3D vm_area->nr_pages; +#endif + + for (i =3D 0; i < nr_accounted; i++) mod_lruvec_page_state(vm_area->pages[i], NR_KERNEL_STACK_KB, account * (PAGE_SIZE / 1024)); } else { @@ -901,6 +1212,16 @@ void __init fork_init(void) NULL, free_vm_stack_cache); #endif =20 +#ifdef CONFIG_DYNAMIC_STACK + cpuhp_setup_state(CPUHP_BP_PREPARE_DYN, "fork:dynamic_stack", + dynamic_stack_refill_pages_cpu, + dynamic_stack_free_pages_cpu); + /* + * Fill the dynamic stack pages for the boot CPU, others will be filled + * as CPUs are onlined. + */ + dynamic_stack_refill_pages_cpu(smp_processor_id()); +#endif scs_init(); =20 lockdep_init_task(&init_task); @@ -914,6 +1235,7 @@ int __weak arch_dup_task_struct(struct task_struct *ds= t, return 0; } =20 +#ifndef CONFIG_DYNAMIC_STACK void set_task_stack_end_magic(struct task_struct *tsk) { unsigned long *stackend; @@ -921,6 +1243,7 @@ void set_task_stack_end_magic(struct task_struct *tsk) stackend =3D end_of_stack(tsk); *stackend =3D STACK_END_MAGIC; /* for overflow detection */ } +#endif =20 static struct task_struct *dup_task_struct(struct task_struct *orig, int n= ode) { diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 496dff740dca..417269a86973 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6783,6 +6783,7 @@ static void __sched notrace __schedule(int sched_mode) rq =3D cpu_rq(cpu); prev =3D rq->curr; =20 + dynamic_stack(prev); schedule_debug(prev, preempt); =20 if (sched_feat(HRTICK) || sched_feat(HRTICK_DL)) --=20 2.54.0.rc2.544.gc7ae2d5bb8-goog From nobody Fri Jun 19 16:16:36 2026 Received: from mail-dy1-f201.google.com (mail-dy1-f201.google.com [74.125.82.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E52FA366820 for ; Fri, 24 Apr 2026 19:17:04 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=74.125.82.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777058226; cv=none; b=jAxO8bm56jyx4xOkJj7UUVwJfssx4KSpGQ2BezuHCFXZ5N0N+L35QMRc1w2SGqtO5FJNoq3KZLWpXQRY5Xs5onlKsRib0HhYdMu1EOzZlp3DOPIHpCRjzBWo5+kfsADe4F4PKtiDGcNiypEHZMqAFN1o8C83soYgA2oUPfhEqCg= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777058226; c=relaxed/simple; bh=MryoLFz+gAPvMVLjt/bhg2ZLuhTMPgGMCzfEsk6T/Ag=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=RFa5bOdxZXnB/kSCFx/d8kBSqIF2E8pGeefjzUrgsE8ALLenS5xKc9E9NeRlV5rW9eeWrBpZHw2CmYog6Rpfx9gmYWuBZNHV7BBBcU5drqHAS/SwTGb/yOWvJvVp9zvHnIMb8sKTTEHmrColCHW2v0oKwFA7jjfZuOLou1LmPbE= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--stevensd.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=LJ2mFNkG; arc=none smtp.client-ip=74.125.82.201 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--stevensd.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="LJ2mFNkG" Received: by mail-dy1-f201.google.com with SMTP id 5a478bee46e88-2d93379001eso16843532eec.1 for ; Fri, 24 Apr 2026 12:17:04 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20251104; t=1777058224; x=1777663024; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=bJ923Ki6R5fPmM4gLIIn14hi8wnIAPg7CegKahU1aFY=; b=LJ2mFNkG5i+ePoxWlPeWbi4CHwiaRCLWKB6/j9IEjr2utnDbvsjvdnJa24Xi2aZTeX dShHIDe0B8m+J1fnAijCVLaMQyupDDksLfthz4iaN74dwf5NkK4bmXoPGX/VX5T3/jeN koVsFjTQ3WFArklEmDf/G8lG5Ibr+b6b25OGTNJkokaK5/1grXhjlniNqRTrRO43WxQp fnszr0rzVq9gddvdiulfND+n15Q/HW+MWAdXdoVOOea0FPOCtc/Jle69i27ebQ6NhA1N /sjb4HFtAFZMUeAD6ExKQuYM0EsgxDaAP0joWMnsG9Tnt6lvBTrqW7wW9TmomqTz4ewI uRag== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1777058224; x=1777663024; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=bJ923Ki6R5fPmM4gLIIn14hi8wnIAPg7CegKahU1aFY=; b=XMfo8oE+BGOINxqysuvWA4QpA8uNXEmld12D1n49cYxVs6T09Vyhpv3+nDzlvJ2f85 RHCI85n5YIMfWz/0BFf6qpxLYGq2+gT7S2v4QiNX3pUqqyxu4QsNr4UdjTBdj4DBR2el s5n4SbJMwBRztAXARGz2Pm5usPez6KOBHUPVFxuUw9viEwQUt9p4BG3gpE/PhQ6da/8K uklnM+58KHWDZl/Z/t0mBynhZ0GoA27KwucNscM44uIWAtGq3L91wL1eGxAxeMVbVtL4 Wiq51eljsKr19JwRZtAhZr6iZRZMtRxabJfiPVXa34p3h+G7i7Kl1FA0bwkyO2rkznwL 98uQ== X-Forwarded-Encrypted: i=1; AFNElJ9e98MyXLPwWpghEBdPy/2QzBZTnJwu4vTinf1QoDyhn78il54lizWXfZfJoKlP7HdP/n1IdZpsBAtdvto=@vger.kernel.org X-Gm-Message-State: AOJu0YwM5lhEFC5KRJ76x1DV8zzAd+j12NBcycZ5mS7B/kMAq47T3LVo Xa3YJp6hinPQx/jee7dqShZ1RSlj2bKwgZ7HTOm8dXbXyjMZbNXM1cNonZH//3IPjUNOrjTl09d T+VxhtXfHnomvxg== X-Received: from dycoy24.prod.google.com ([2002:a05:7301:fc18:b0:2e2:488c:4eba]) (user=stevensd job=prod-delivery.src-stubby-dispatcher) by 2002:a05:7300:8628:b0:2d8:1efe:51dc with SMTP id 5a478bee46e88-2e464ea8c23mr20065493eec.6.1777058223573; Fri, 24 Apr 2026 12:17:03 -0700 (PDT) Date: Fri, 24 Apr 2026 12:14:51 -0700 In-Reply-To: <20260424191456.2679717-1-stevensd@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20260424191456.2679717-1-stevensd@google.com> X-Mailer: git-send-email 2.54.0.rc2.544.gc7ae2d5bb8-goog Message-ID: <20260424191456.2679717-9-stevensd@google.com> Subject: [PATCH v2 08/13] task_stack.h: Add stack_not_used() support for dynamic stack From: David Stevens To: Pasha Tatashin , Linus Walleij , Will Deacon , Quentin Perret , Thomas Gleixner , Ingo Molnar , Borislav Petkov , Dave Hansen , x86@kernel.org, "H. Peter Anvin" , Andy Lutomirski , Xin Li , Peter Zijlstra , Andrew Morton , David Hildenbrand , Lorenzo Stoakes , "Liam R. Howlett" , Vlastimil Babka , Mike Rapoport , Suren Baghdasaryan , Michal Hocko , Uladzislau Rezki , Kees Cook Cc: David Stevens , linux-kernel@vger.kernel.org, linux-mm@kvack.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Pasha Tatashin CONFIG_DEBUG_STACK_USAGE is enabled by default on most architectures. Its purpose is to determine and print the maximum stack depth on thread exit. The way it works, is it starts from the bottom of the stack and searches the first non-zero word in the stack. With dynamic stack it does not work very well, as it means it faults every pages in every stack. Instead, add a specific version of stack_not_used() for dynamic stacks where instead of starting from the bottom of the stack, we start from the last page mapped in the stack. In addition to not doing unnecessary page faulting, this search is optimized by skipping search through zero pages. Also, because dynamic stack does not end with MAGIC_NUMBER, there is no need to skip the bottom most word in the stack. Signed-off-by: Pasha Tatashin [Rebased, Kasan oneliner needed preserving, rewrote a bit due to bugs] Signed-off-by: Linus Walleij [Handle init_task's use of init_stack, fix typos] Signed-off-by: David Stevens --- arch/Kconfig | 1 - kernel/exit.c | 22 ++++++++++++++++++++++ 2 files changed, 22 insertions(+), 1 deletion(-) diff --git a/arch/Kconfig b/arch/Kconfig index 95ded79f0825..beffe7e01296 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -1542,7 +1542,6 @@ config DYNAMIC_STACK depends on VMAP_STACK depends on INIT_STACK_ALL_ZERO || INIT_STACK_ALL_PATTERN depends on !KASAN - depends on !DEBUG_STACK_USAGE depends on !STACK_GROWSUP depends on !PREEMPT_RT help diff --git a/kernel/exit.c b/kernel/exit.c index ede3117fa7d4..6caf4030e8f4 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -71,6 +71,7 @@ #include #include #include +#include =20 #include =20 @@ -791,6 +792,26 @@ unsigned long stack_not_used(struct task_struct *p) return (unsigned long)end_of_stack(p) - (unsigned long)n; } #else /* !CONFIG_STACK_GROWSUP */ +#ifdef CONFIG_DYNAMIC_STACK +unsigned long stack_not_used(struct task_struct *p) +{ + struct vm_struct *vm_area =3D task_stack_vm_area(p); + unsigned long stack =3D (unsigned long)task_stack_page(p); + unsigned long alloc_size, *n; + + /* This is NULL only for init_task, where init_stack is fully allocated. = */ + if (likely(vm_area)) + alloc_size =3D vm_area->nr_pages << PAGE_SHIFT; + else + alloc_size =3D THREAD_SIZE; + n =3D (unsigned long *)(stack + THREAD_SIZE - alloc_size); + + while (!*n) + n++; + + return (unsigned long)n - stack; +} +#else unsigned long stack_not_used(struct task_struct *p) { unsigned long *n =3D end_of_stack(p); @@ -801,6 +822,7 @@ unsigned long stack_not_used(struct task_struct *p) =20 return (unsigned long)n - (unsigned long)end_of_stack(p); } +#endif /* CONFIG_DYNAMIC_STACK */ #endif /* CONFIG_STACK_GROWSUP */ =20 /* Count the maximum pages reached in kernel stacks */ --=20 2.54.0.rc2.544.gc7ae2d5bb8-goog From nobody Fri Jun 19 16:16:36 2026 Received: from mail-dl1-f73.google.com (mail-dl1-f73.google.com [74.125.82.73]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 0440B3E7169 for ; Fri, 24 Apr 2026 19:17:06 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=74.125.82.73 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777058228; cv=none; b=YErNFJW5A5OQxzNgHBZZrQ2nJhwMi86976UdVPpl5DpYYMno1Dk7pE1wEfxq0JQaE/IQRRryAgFCduO8bHwzD+4mof1eUb7PgDopHBoyYkRNa3y+piMxU12N80MewpLwP4Lau0Zy9gjpt4kggGCb90IOG85O0IdD7jhIr82vNsY= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777058228; c=relaxed/simple; bh=RhALqMblpZH4uENCiafOIg3H3ixsF5M678eBW2fmFrc=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=ancShQvhJkgPMOUGWYVEhQp/u/FT9oew87FSvbDVSdXADBhHzCAeZmaZabmoWpa2RvNv0sZ5ziCqCpCVLaEyI2Vw2n74hYoxRykGYCtDPkO7JUEDnI7+jOUHP4bKjP4tSSrRoIMU+Bfg7POlvcQlWt41gEjVgdAAI/x53UJBtFo= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--stevensd.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=u6036bLN; arc=none smtp.client-ip=74.125.82.73 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--stevensd.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="u6036bLN" Received: by mail-dl1-f73.google.com with SMTP id a92af1059eb24-12dbf4f678eso13754584c88.0 for ; Fri, 24 Apr 2026 12:17:06 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20251104; t=1777058226; x=1777663026; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=RwTIWWS6PoMEJey7EJLHssvqaQ5r36vA2Bmz8LkSSZY=; b=u6036bLNzxOeSBknuKPpRLLQA00vuB7jgJmf2nngpAUjjbkXjRdlnrEfmCTjgZyBKL Ui/z2qspBwY1jujRXsEbVFAQf+mLQOyj5hn014XRUGojkw2mtJhH2OzDuC3S1+G+q1rK z0j6UWSlDoLncWMVSoBPOGPJL4yZ/h8DAHqD7ZwCh7ntTYEeM7HsAn70Qu5qkD3FDhob Ri424gjPA3W3b109L0rFuo23IJCHb8na8XMlhSMY6HQYGfQm+bVt420gZlkq3i7pLCd9 RhS1st4uKgNfW0dc7VSGWkmrtLPIiMz0/ySwG6AB7aMyBuGL5NjMhlJuai7CWlTmhIMM VFPA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1777058226; x=1777663026; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=RwTIWWS6PoMEJey7EJLHssvqaQ5r36vA2Bmz8LkSSZY=; b=C6AGVYKUHupCUXzI+Kk9Loim+96iGlkLVp7JPrDkifrr32XnXpAypg3JSSAo01saW/ Ba6NmEECj5IwpeTNfGbC5mQK97D/stTJvcSFljoMtDyQtaJecrUIUu0DGjXOF9FvugGh h1y6tTg1TTjtduqurnaMbLrHLEytTdnJy0VIh5/mipo2ItG3TKPiqQHavt73WNDCdzDG QbdIfdEzfQ14RJrQkrjmLCWqN7VKiFOff/xlAeautt+RTDpRlca6k79nlyAI5QVJg79s pmvoxglJMVCxdTozh1SUQPpoful7quYmhifZ4W9e2N53gJK4z+akWD2ZDyo1O6quQtBY Sfuw== X-Forwarded-Encrypted: i=1; AFNElJ9c7GQ/qelbGL1IkpOvFAO+EBtTiAwFjSkVgu5wnsg9yYaZoeiuXSgx27B8mlObbnEQwS3NmR032bLuBEE=@vger.kernel.org X-Gm-Message-State: AOJu0YxjTS8cuAyd0XRBfvlqU7OQXaiiVNCBnxbWisR1cHCilp99s0SW hf2873aF6RLwuttx7kCN842MTtQSJ25iF7fUmF5ZOIAzuz1j5IbeWd2GM+BwY9B0bAZGc5S0KOp 7ypcnWEbZn/BCZA== X-Received: from dled1-n2.prod.google.com ([2002:a05:701b:42c1:20b0:12c:5513:fb7d]) (user=stevensd job=prod-delivery.src-stubby-dispatcher) by 2002:a05:7022:3d0e:b0:12d:ca31:f19d with SMTP id a92af1059eb24-12dca31f3f7mr1182426c88.28.1777058225702; Fri, 24 Apr 2026 12:17:05 -0700 (PDT) Date: Fri, 24 Apr 2026 12:14:52 -0700 In-Reply-To: <20260424191456.2679717-1-stevensd@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20260424191456.2679717-1-stevensd@google.com> X-Mailer: git-send-email 2.54.0.rc2.544.gc7ae2d5bb8-goog Message-ID: <20260424191456.2679717-10-stevensd@google.com> Subject: [PATCH v2 09/13] fork: Dynamic Kernel Stack accounting From: David Stevens To: Pasha Tatashin , Linus Walleij , Will Deacon , Quentin Perret , Thomas Gleixner , Ingo Molnar , Borislav Petkov , Dave Hansen , x86@kernel.org, "H. Peter Anvin" , Andy Lutomirski , Xin Li , Peter Zijlstra , Andrew Morton , David Hildenbrand , Lorenzo Stoakes , "Liam R. Howlett" , Vlastimil Babka , Mike Rapoport , Suren Baghdasaryan , Michal Hocko , Uladzislau Rezki , Kees Cook Cc: David Stevens , linux-kernel@vger.kernel.org, linux-mm@kvack.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Pasha Tatashin Add an accounting of the amount of stack pages that have been faulted in and are currently in use. Example use case: $ cat /proc/vmstat | grep stack nr_kernel_stack 18684 nr_dynamic_stacks_faults 156 The above shows that the kernel stacks use total 18684KiB, out of which 156KiB were faulted in. Given that the pre-allocated stacks are 4KiB, we can determine the total number of tasks: tasks =3D (nr_kernel_stack - nr_dynamic_stacks_faults) / 4 =3D 4632. The amount of kernel stack memory without dynamic stack on this machine would be: 4632 * 16 KiB =3D 74,112 KiB Therefore, in this example dynamic stacks save: 55,428 KiB Signed-off-by: Pasha Tatashin [Rebased] Signed-off-by: Linus Walleij [add to memcg stats, fix typos] Signed-off-by: David Stevens --- include/linux/mmzone.h | 3 +++ kernel/fork.c | 12 +++++++++++- mm/memcontrol.c | 10 ++++++++++ mm/vmstat.c | 3 +++ 4 files changed, 27 insertions(+), 1 deletion(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 3e51190a55e4..4458fa7016a1 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -221,6 +221,9 @@ enum node_stat_item { NR_FOLL_PIN_ACQUIRED, /* via: pin_user_page(), gup flag: FOLL_PIN */ NR_FOLL_PIN_RELEASED, /* pages returned via unpin_user_page() */ NR_KERNEL_STACK_KB, /* measured in KiB */ +#ifdef CONFIG_DYNAMIC_STACK + NR_DYNAMIC_STACKS_FAULTS_KB, /* KiB of faulted kernel stack memory */ +#endif #if IS_ENABLED(CONFIG_SHADOW_CALL_STACK) NR_KERNEL_SCS_KB, /* measured in KiB */ #endif diff --git a/kernel/fork.c b/kernel/fork.c index e615ef736dc0..9ac9d23f5f4b 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -463,6 +463,8 @@ unsigned long dynamic_stack_accounting(struct task_stru= ct *tsk, bool finalize) =20 mod_lruvec_page_state(page, NR_KERNEL_STACK_KB, PAGE_SIZE / 1024); + mod_lruvec_page_state(page, NR_DYNAMIC_STACKS_FAULTS_KB, + PAGE_SIZE / 1024); } =20 if (finalize) { @@ -811,9 +813,17 @@ static void account_kernel_stack(struct task_struct *t= sk, int account) nr_accounted =3D vm_area->nr_pages; #endif =20 - for (i =3D 0; i < nr_accounted; i++) + for (i =3D 0; i < nr_accounted; i++) { mod_lruvec_page_state(vm_area->pages[i], NR_KERNEL_STACK_KB, account * (PAGE_SIZE / 1024)); +#ifdef CONFIG_DYNAMIC_STACK + if (i >=3D THREAD_PREALLOC_PAGES) { + mod_lruvec_page_state(vm_area->pages[i], + NR_DYNAMIC_STACKS_FAULTS_KB, + account * (PAGE_SIZE / 1024)); + } +#endif + } } else { void *stack =3D task_stack_page(tsk); =20 diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 772bac21d155..cd2195a735ab 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -318,6 +318,9 @@ static const unsigned int memcg_node_stat_items[] =3D { NR_FILE_THPS, NR_ANON_THPS, NR_KERNEL_STACK_KB, +#ifdef CONFIG_DYNAMIC_STACK + NR_DYNAMIC_STACKS_FAULTS_KB, +#endif NR_PAGETABLE, NR_SECONDARY_PAGETABLE, #ifdef CONFIG_SWAP @@ -1403,6 +1406,10 @@ static const struct memory_stat memory_stats[] =3D { #ifdef CONFIG_NUMA_BALANCING { "pgpromote_success", PGPROMOTE_SUCCESS }, #endif + +#ifdef CONFIG_DYNAMIC_STACK + { "dynamic_stack_faults", NR_DYNAMIC_STACKS_FAULTS_KB }, +#endif }; =20 /* The actual unit of the state item, not the same as the output unit */ @@ -1415,6 +1422,9 @@ static int memcg_page_state_unit(int item) case NR_SLAB_UNRECLAIMABLE_B: return 1; case NR_KERNEL_STACK_KB: +#ifdef CONFIG_DYNAMIC_STACK + case NR_DYNAMIC_STACKS_FAULTS_KB: +#endif return SZ_1K; default: return PAGE_SIZE; diff --git a/mm/vmstat.c b/mm/vmstat.c index 86b14b0f77b5..8fa1c7bcbaea 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1256,6 +1256,9 @@ const char * const vmstat_text[] =3D { [I(NR_FOLL_PIN_ACQUIRED)] =3D "nr_foll_pin_acquired", [I(NR_FOLL_PIN_RELEASED)] =3D "nr_foll_pin_released", [I(NR_KERNEL_STACK_KB)] =3D "nr_kernel_stack", +#ifdef CONFIG_DYNAMIC_STACK + [I(NR_DYNAMIC_STACKS_FAULTS_KB)] =3D "nr_dynamic_stacks_faults", +#endif #if IS_ENABLED(CONFIG_SHADOW_CALL_STACK) [I(NR_KERNEL_SCS_KB)] =3D "nr_shadow_call_stack", #endif --=20 2.54.0.rc2.544.gc7ae2d5bb8-goog From nobody Fri Jun 19 16:16:36 2026 Received: from mail-dl1-f73.google.com (mail-dl1-f73.google.com [74.125.82.73]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id CB1183DC4BE for ; Fri, 24 Apr 2026 19:17:08 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=74.125.82.73 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777058232; cv=none; b=MHjf8vmeYZlNSYieDFJBJdKwrsbYGelM7PNQ7QEN16K6SFq/UXw54Pv8edv4MrWyYL7+Rqi8QUkj4IVC+L3KY3nwB6UdwRnLUZURdRs9e2p9XIp3tQVm7XTnNXeAODIGffsml+1vBSmY8GdfvyWqHrR9rLaCB57YaW4q+s43xVU= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777058232; c=relaxed/simple; bh=8CwW+GpF5BAaieLCinbMNp90S6l0U17fNDo9y9xNgcA=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=ZzBUuqKC+8nAe9m+I69ssb3iUJXEl9BL7MNgrKvdJJ/6vky51kj5FJvQON2o7SEGBnm+bU/Wg1vCcoTw0diXOe4bd8h5sWZhCBy5GyrlYy9DeCSiaMLkuGmmEjbG+NuHd0+lwIkQ6BfQU5t7O9ckQjpaXZpjK+qD+dUL6i8rLEc= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--stevensd.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=jr4mBAMg; arc=none smtp.client-ip=74.125.82.73 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--stevensd.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="jr4mBAMg" Received: by mail-dl1-f73.google.com with SMTP id a92af1059eb24-12dba1e866dso2820951c88.1 for ; Fri, 24 Apr 2026 12:17:08 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20251104; t=1777058228; x=1777663028; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=LVunvLNRu5wU/C5dXUrSh/20awEeZOSQNdDKwzJM0Dk=; b=jr4mBAMg2ncalSL0odGZ6BGOa7uuq3zkItyKWt+dHXb6eOHW4n6n7XAw8WvZQhX37U VZHdEyaFWB/9Q3XT0RaYI0wnfuqp3ka+ZNuV/kunMazjQiEfGUgSqPEOSqNY4Mc4BFrr 3XvHesXQohVlJ1Xpb3O/nAs1fUbpxAWkP1AsU2QbMoPcDK9gWPmyhwT6UrpJD/7f/dGd fDd7ezm1x4sWub1Y2Zuy5GdxvMwRGICXKiQ8DeObuGTmeUWNm9qqj66cHWgx/HqbqfcZ IdFt/engqfZGmQ1BkbHR4jMwznHf4l5lh+rNZMTWjgCH9IMHQ/1R3LC/ZjHFRqjT/9xd +s7Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1777058228; x=1777663028; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=LVunvLNRu5wU/C5dXUrSh/20awEeZOSQNdDKwzJM0Dk=; b=FRUWuRAklf8TPirqrWSLaJwR3sibGwDF5FhKw9MV+yHWZPftVX+0OXrZHa6RnfLemP KBedJIlAteS/IZinq39yH6tchH3cqpOYyvqcM+jaIcFsh5A7oz5a/YEg1dePyeQqFg2Z w707l7W46GOAYoFW87ku57kDaud6i377iGBiFt2TyPm8LPXJpUi39rfjaA0b2bEyCAHI DsJ2k96da+SejpWBfHNNVe1mODFpZGr3vEcMvBMmgYL4V7+I5UZqxwkRPJ6R4h7FqaMW R4oFYFkEtjeWXMQ+UMGRKHSJmbcjU5Z1sab2vaUecK9TNBGRbvyVOWjMFZKJpyOORP6E K60g== X-Forwarded-Encrypted: i=1; AFNElJ8OwPvg/7u3ZfD5rC8hZawvfGAu6N2SMxkYQyaO9DYFqH7UIDDxc2kIUIUmZzSrO4Z0SWsxI4RIKdWc6Ig=@vger.kernel.org X-Gm-Message-State: AOJu0Yx7RWrfotKEKc5Kpx4L2W9c4sUTh669iA57BgeHX/+j6dpTSecV 967LNqRpALluiET7/hMrC7zMDHM141wOb7P8xzeX09r+abUXyMppgKUmj9xQEcNGib07T5FVsw7 7XjhrOKfomW8Z2A== X-Received: from dybnj5.prod.google.com ([2002:a05:7300:d085:b0:2dd:4573:2897]) (user=stevensd job=prod-delivery.src-stubby-dispatcher) by 2002:a05:7023:b0d:b0:12c:8eb:80b9 with SMTP id a92af1059eb24-12c73afa081mr12946449c88.6.1777058227667; Fri, 24 Apr 2026 12:17:07 -0700 (PDT) Date: Fri, 24 Apr 2026 12:14:53 -0700 In-Reply-To: <20260424191456.2679717-1-stevensd@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20260424191456.2679717-1-stevensd@google.com> X-Mailer: git-send-email 2.54.0.rc2.544.gc7ae2d5bb8-goog Message-ID: <20260424191456.2679717-11-stevensd@google.com> Subject: [PATCH v2 10/13] fork: Store task pointer in unpopulated stack ptes From: David Stevens To: Pasha Tatashin , Linus Walleij , Will Deacon , Quentin Perret , Thomas Gleixner , Ingo Molnar , Borislav Petkov , Dave Hansen , x86@kernel.org, "H. Peter Anvin" , Andy Lutomirski , Xin Li , Peter Zijlstra , Andrew Morton , David Hildenbrand , Lorenzo Stoakes , "Liam R. Howlett" , Vlastimil Babka , Mike Rapoport , Suren Baghdasaryan , Michal Hocko , Uladzislau Rezki , Kees Cook Cc: David Stevens , linux-kernel@vger.kernel.org, linux-mm@kvack.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Store the task pointer in the ptes of the unpopulated pages of dynamic stacks, to allow the vm_struct pointer to be retrieved without relying on any locks or current. This relies on being able to pack the struct task_struct pointer into a pte. Since the struct is 64 byte aligned, that gives 5 bits of leeway, which should be viable on most architectures. Any architecture which enables dynamic thread stacks must provide make_data_kpte() and unpack_data_kpte(), which pack/unpack a right shifted pointer value into/from a pte. Signed-off-by: David Stevens --- include/linux/sched/task_stack.h | 1 + kernel/fork.c | 74 +++++++++++++++++++++++++++++--- mm/vmalloc.c | 2 +- 3 files changed, 69 insertions(+), 8 deletions(-) diff --git a/include/linux/sched/task_stack.h b/include/linux/sched/task_st= ack.h index 7dcff2836d7e..7cf00ce97f7c 100644 --- a/include/linux/sched/task_stack.h +++ b/include/linux/sched/task_stack.h @@ -105,6 +105,7 @@ void exit_task_stack_account(struct task_struct *tsk); void dynamic_stack_refill_pages(void); unsigned long dynamic_stack_accounting(struct task_struct *tsk, bool final= ize); bool dynamic_stack_fault(struct task_struct *tsk, unsigned long address, b= ool *on_stack); +struct task_struct *task_from_stack_address(unsigned long address); =20 /* * Refill and charge for the used pages. diff --git a/kernel/fork.c b/kernel/fork.c index 9ac9d23f5f4b..733fc1f58b8b 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -296,16 +296,40 @@ static bool try_release_thread_stack_to_cache(struct = vm_struct *vm_area) =20 static DEFINE_PER_CPU(struct page *, dynamic_stack_pages[DYNSTK_PAGE_POOL_= NR]); =20 +#define TASK_PTR_SHIFT (ilog2(__alignof__(struct task_struct))) + static void link_vmap_stack_to_task(struct task_struct *tsk, struct vm_str= uct *vm_area) { + int i; + unsigned long addr; + pte_t *ptep, pte; + + pte =3D make_data_kpte(((unsigned long)tsk) >> TASK_PTR_SHIFT); + tsk->stack_vm_area =3D vm_area; tsk->packed_stack =3D (unsigned long)kasan_reset_tag(vm_area->addr); + + addr =3D (unsigned long)vm_area->addr; + ptep =3D virt_to_kpte(addr); + for (i =3D vm_area->nr_pages; i < THREAD_SIZE >> PAGE_SHIFT; + i++, addr +=3D PAGE_SIZE, ptep++) + set_pte_at(&init_mm, addr, ptep, pte); } =20 -static void free_vmap_stack(struct vm_struct *vm_area) +static void free_vmap_stack(struct vm_struct *vm_area, bool was_mapped) { int i; =20 + /* Clear data kptes since vunmap expects present or none. */ + if (was_mapped) { + unsigned long addr =3D (unsigned long)vm_area->addr; + pte_t *ptep =3D virt_to_kpte(addr); + unsigned int nr_to_clear =3D (THREAD_SIZE >> PAGE_SHIFT) - vm_area->nr_p= ages; + + if (nr_to_clear) + clear_ptes(&init_mm, addr, ptep, nr_to_clear); + } + remove_vm_area(vm_area->addr); =20 for (i =3D 0; i < vm_area->nr_pages; i++) @@ -354,7 +378,7 @@ static struct vm_struct *alloc_vmap_stack(int node) =20 return vm_area; cleanup_err: - free_vmap_stack(vm_area); + free_vmap_stack(vm_area, false); return NULL; } =20 @@ -477,6 +501,42 @@ unsigned long dynamic_stack_accounting(struct task_str= uct *tsk, bool finalize) return i; } =20 +noinstr struct task_struct *task_from_stack_address(unsigned long address) +{ + pgd_t *pgd; + p4d_t *p4d; + pud_t *pud; + pmd_t *pmd; + pte_t *pte; + + BUILD_BUG_ON((BITS_PER_LONG - TASK_PTR_SHIFT) > KPTE_AVAILABLE_DATA_BITS); + + if (!is_vmalloc_addr((void *)address)) + return NULL; + + pgd =3D pgd_offset_k(address); + if (pgd_none(*pgd) || pgd_leaf(*pgd)) + return NULL; + + p4d =3D p4d_offset(pgd, address); + if (p4d_none(*p4d) || p4d_leaf(*p4d)) + return NULL; + + pud =3D pud_offset(p4d, address); + if (pud_none(*pud) || pud_leaf(*pud)) + return NULL; + + pmd =3D pmd_offset(pud, address); + if (pmd_none(*pmd) || pmd_leaf(*pmd)) + return NULL; + + pte =3D pte_offset_kernel(pmd, address); + if (pte_present(*pte) || pte_none(*pte)) + return NULL; + + return (struct task_struct *)(unpack_data_kpte(*pte) << TASK_PTR_SHIFT); +} + bool noinstr dynamic_stack_fault(struct task_struct *tsk, unsigned long ad= dress, bool *on_stack) { unsigned long stack, hole_end, addr; @@ -570,7 +630,7 @@ static inline struct vm_struct *alloc_vmap_stack(int no= de) return stack ? find_vm_area(stack) : NULL; } =20 -static inline void free_vmap_stack(struct vm_struct *vm_area) +static inline void free_vmap_stack(struct vm_struct *vm_area, bool was_map= ped) { vfree(vm_area->addr); } @@ -590,7 +650,7 @@ static void thread_stack_free_work(struct work_struct *= work) if (try_release_thread_stack_to_cache(vm_stack->stack_vm_area)) return; =20 - free_vmap_stack(vm_area); + free_vmap_stack(vm_area, true); } =20 static void thread_stack_delayed_free(struct task_struct *tsk) @@ -618,7 +678,7 @@ static int free_vm_stack_cache(unsigned int cpu) if (!vm_area) continue; =20 - free_vmap_stack(vm_area); + free_vmap_stack(vm_area, true); cached_vm_stack_areas[i] =3D NULL; } =20 @@ -653,7 +713,7 @@ static int alloc_thread_stack_node(struct task_struct *= tsk, int node) unsigned long memset_offset =3D 0; =20 if (memcg_charge_kernel_stack(vm_area)) { - free_vmap_stack(vm_area); + free_vmap_stack(vm_area, true); return -ENOMEM; } =20 @@ -674,7 +734,7 @@ static int alloc_thread_stack_node(struct task_struct *= tsk, int node) return -ENOMEM; =20 if (memcg_charge_kernel_stack(vm_area)) { - free_vmap_stack(vm_area); + free_vmap_stack(vm_area, true); return -ENOMEM; } link_vmap_stack_to_task(tsk, vm_area); diff --git a/mm/vmalloc.c b/mm/vmalloc.c index 39b7e118cbce..76955c101180 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -76,7 +76,7 @@ early_param("nohugevmalloc", set_nohugevmalloc); static const bool vmap_allow_huge =3D false; #endif /* CONFIG_HAVE_ARCH_HUGE_VMALLOC */ =20 -bool is_vmalloc_addr(const void *x) +noinstr bool is_vmalloc_addr(const void *x) { unsigned long addr =3D (unsigned long)kasan_reset_tag(x); =20 --=20 2.54.0.rc2.544.gc7ae2d5bb8-goog From nobody Fri Jun 19 16:16:36 2026 Received: from mail-dl1-f73.google.com (mail-dl1-f73.google.com [74.125.82.73]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C1B823659FD for ; Fri, 24 Apr 2026 19:17:10 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=74.125.82.73 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777058232; cv=none; b=H0XI3Rood9sNYxTak2RnNVwoA6KbvMRc7xT7BrnhqGjEenLDEttoOQLzvjj795qSBC1/D6xMnT5v62n+fp/vsKJBpG9eefetsDkdzTdzyxNZd3CapGUruBgCchK7EG0CIruAbjwl1r6yuT/V78oJFXnXkUDjlr+jU4PuFSSzAHs= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777058232; c=relaxed/simple; bh=gB8Ns8cNYMsmD7tsEB0dVE61d4RWKk1F3s8smyBXdR4=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=ttEAVwc6MW0C5x+BsjeQXRSuwaDmSMzC8qLZcVF/R04sDrYjjiLyz6bMkeMsCcqiXfyV13WsqibzxrW53G6El3Dd4rk2NWtz8vfPvEpAqO61+jzlxbC9QDMmruwDL5V+KjLvHK+5JtP0i/idG4nsf1sQ0wbymP+9qFzos5a3UrM= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--stevensd.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=Yby4NlmA; arc=none smtp.client-ip=74.125.82.73 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--stevensd.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="Yby4NlmA" Received: by mail-dl1-f73.google.com with SMTP id a92af1059eb24-12c91ef7009so19770467c88.1 for ; Fri, 24 Apr 2026 12:17:10 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20251104; t=1777058230; x=1777663030; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=WM5wn0gMsIuTgGRwyT/97PaVP0xOF656BTt5vV1/wNU=; b=Yby4NlmA8ULTFOTQ3KSvPC8d+yjds5Q6A4+V35FTSJov7hkMsotGW7vJyEezDF/1DD 54yTH9M9ENg6RfFmZxxlP0OzzvfXh9RWJr2kPeDsbI2gDt+E5tuN2w9Ys2H9MQNkW5rZ /JM12/l9tFzxAti6RoeyxRUJDa39Ovx2rlx+AnZlFZB7cV/VcpZDI0/YZHx6fokWn+4g 4CkqMlHkCQbe2oygwL1izqNrsTQWL+Wy4xs59Rj8KOsXRXrgqdVXURJYdUBljlLiB8I+ aWJXqtvJrjpuSH5+vkKc97HbD2NnwY5FFpZYyxpIEiXlws3IbkiKbMhmXN1Telh/cRlQ NrHg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1777058230; x=1777663030; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=WM5wn0gMsIuTgGRwyT/97PaVP0xOF656BTt5vV1/wNU=; b=l5b/Y6C7uN0ytuV/ItrQjv9QNIPzVllJhh7oKQwXStSb6SZEcg7DLSZwRCnY9+eEE5 mjG70FTsSOyqwzOvH3+XqbQxRAHQZqT1ZQLkwwDb5eE1ChXmyJCRl6GDvPToo0OAW8Q2 99Cp8W91ZwrBourSrJnE+hfXHVy58Ab/w94oBZNjgw7HcaktzwxHzo4uYqoWZF6DkFtk jDNeUmiZPNzo5KAWL6IONRHJ7+UJrsFcRs7fRIgdfi1wxYwmGu0jCA4+S+oPAx8y1HYU L46jcV3K+TjcDUgxnVwLYtFmkKWppusNoZ+gQQ1Lf1+ZcOhFxm9dzXRINDYOkH/jS4ZW w8fA== X-Forwarded-Encrypted: i=1; AFNElJ8YOAYUcgAyGGT5cguee857wwRqqKdgWM064jUfNyL0Ux0xY+V2wV33ldyt2Xz0GkU7Gt1AaFvzluAM8Mg=@vger.kernel.org X-Gm-Message-State: AOJu0YzYAYy7Pf3a8+nE0rhTbaAqBO7QXjf5PvZzJOnJu9HkWJQKsb/N Jz1u4CNaJp6Gn3n+Nq7If6fh6+/3E6HFYiYL66q1BxEfwF8CdLmKoxcahhqT02g+wcG/Sg3TL+2 80k9/y7vijZClNA== X-Received: from dlaj14.prod.google.com ([2002:a05:701b:280e:b0:12d:b28c:f5d6]) (user=stevensd job=prod-delivery.src-stubby-dispatcher) by 2002:a05:701b:2415:b0:12d:b26f:cafd with SMTP id a92af1059eb24-12db26fcc16mr6838722c88.5.1777058229763; Fri, 24 Apr 2026 12:17:09 -0700 (PDT) Date: Fri, 24 Apr 2026 12:14:54 -0700 In-Reply-To: <20260424191456.2679717-1-stevensd@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20260424191456.2679717-1-stevensd@google.com> X-Mailer: git-send-email 2.54.0.rc2.544.gc7ae2d5bb8-goog Message-ID: <20260424191456.2679717-12-stevensd@google.com> Subject: [PATCH v2 11/13] x86/entry/fred: encode frame pointer on entry From: David Stevens To: Pasha Tatashin , Linus Walleij , Will Deacon , Quentin Perret , Thomas Gleixner , Ingo Molnar , Borislav Petkov , Dave Hansen , x86@kernel.org, "H. Peter Anvin" , Andy Lutomirski , Xin Li , Peter Zijlstra , Andrew Morton , David Hildenbrand , Lorenzo Stoakes , "Liam R. Howlett" , Vlastimil Babka , Mike Rapoport , Suren Baghdasaryan , Michal Hocko , Uladzislau Rezki , Kees Cook Cc: David Stevens , linux-kernel@vger.kernel.org, linux-mm@kvack.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Add missing ENCODE_FRAME_POINTER macro invocation into FRED_ENTER macro, to prevent the unwinder from encountering a NULL stack frame pointer when CONFIG_UNWINDER_FRAME_POINTER is enabled Fixes: 14619d912b65 ("x86/fred: FRED entry/exit and dispatch code") Signed-off-by: David Stevens Acked-by: H. Peter Anvin (Intel) --- arch/x86/entry/entry_64_fred.S | 2 ++ 1 file changed, 2 insertions(+) diff --git a/arch/x86/entry/entry_64_fred.S b/arch/x86/entry/entry_64_fred.S index 894f7f16eb80..119b8214748e 100644 --- a/arch/x86/entry/entry_64_fred.S +++ b/arch/x86/entry/entry_64_fred.S @@ -7,6 +7,7 @@ #include =20 #include +#include #include #include =20 @@ -19,6 +20,7 @@ UNWIND_HINT_END_OF_STACK ANNOTATE_NOENDBR PUSH_AND_CLEAR_REGS + ENCODE_FRAME_POINTER movq %rsp, %rdi /* %rdi -> pt_regs */ .endm =20 --=20 2.54.0.rc2.544.gc7ae2d5bb8-goog From nobody Fri Jun 19 16:16:36 2026 Received: from mail-dy1-f202.google.com (mail-dy1-f202.google.com [74.125.82.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 503D13644DE for ; Fri, 24 Apr 2026 19:17:12 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=74.125.82.202 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777058236; cv=none; b=V6DLWigm9qW2HiqUtGMs+Mj3Q6WfY2/xRukWVPXaNH3pRBMH22U6HTBpm14csvR7jgY6q6l421PUmd+5iAJ8PJQqeg+AjxnkyRgCRNe4tSlQqhpqxdaRHd3g8x37kgFgAlQIx5rbv2ekZ6jMrqaUZJZtPvS9iM0/9H32JxL2H5o= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777058236; c=relaxed/simple; bh=N8QhAW6k/SPtLBIlbYxiuWleUQa2wF7gJhz4GQznIh0=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=bHseIacbPETo8ASSlwqukG+pbCL8qb+BqjuDSKTD9E/AnGT58Y60ZaYMgV3gWfIB9vPIoKPFkQus2fuHuPTRx4hkOvJt/kf1VV3NrNuw0v0Afaf9RSvV+YCXYkN8QD/zxcbG5W3AZ0po/zVHrhHRYZrVlszxEgtk9SI7vVZZfkk= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--stevensd.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=nIYnG3PA; arc=none smtp.client-ip=74.125.82.202 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--stevensd.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="nIYnG3PA" Received: by mail-dy1-f202.google.com with SMTP id 5a478bee46e88-2c0ba59a830so12810094eec.0 for ; Fri, 24 Apr 2026 12:17:12 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20251104; t=1777058232; x=1777663032; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=sI3AsQ38T9OUoFK9auQxK91StrB0zHb6WwxxAxz4Ksk=; b=nIYnG3PAREmVUNiyvMAK8jaHAhnmUZ9MM0mDQSs2PW3LbfpHD1I7XY0IN2+4bGgsvv iQyUpk4X6qhPvsdRFC3qcz+8ZW5udqDym4Bct84vHTY8iIlxSJKrBqYYX+jUFOG7m1uX aj1XukMLHYUKAGCxEE7BBfS0wYc0dW1MIheFJj2OwhfVOh9irEQRUpkiu+zl2uScQ9R6 7pUAx0S5qsYX5TcxLIBI+DdBsOgEA/G+TjVt70v8287jnZYWAn1V8D1nYPdqtSjcmhyF Cch45stSr0lK3rUwPztSBbh/fzd9O5CsbQZXfChhqUxinhkOz97Ppu5dOpWXPYoHiroD SQsA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1777058232; x=1777663032; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=sI3AsQ38T9OUoFK9auQxK91StrB0zHb6WwxxAxz4Ksk=; b=KSMZJxRIXACcSU84jeStKuYn2NMfQAE9KqI4kctWnJ+TEpjwfbKD3SHV2uFx8DKZ1k fmwr9mGt5i0yYDgWZ0INGM9nCoczEhmrJisEqP+lpIGcJo6OZdmrXeLkWBxGLfDRwJuR 6slii/U5sKuXYkp+LWO7D9upEMCZWqadMphTxO686dWWL3ppayykhqtnNTVjfN2YT1+f dHVnmb9trelZ+Pw9fxUakuKMCCtB/GXgqdst+HRPRdzmmtX18EaJ2FJGIv1SsJ4iFJic 9XQmaC7rArhNHJZuyolgNSXkclf6VY8Pc4X9xYx49q3MkY6YzppBwVvD3vx0zHaaI/Vy 5BhA== X-Forwarded-Encrypted: i=1; AFNElJ/noy9HUbfj+eCcK9vJRR7J0vEW9v9ws9jEpEqpeS/vqfGPlFT7/FoNq6jhgdVdvd54gC37MsLnoPSXpyA=@vger.kernel.org X-Gm-Message-State: AOJu0YwZhKiMBohcsoSF1hJB+pvM4+bdpSuLDbD25fYWa9AxLJEaOaax McVl/eOzGinQ7etPBG4xFbrt0SV5HRWk7TK2BIxeHUmhiEVxYF0B0J7PPsNUhWyrqa6ckT2VJbj jjEjmF2kAAXA1ow== X-Received: from dyhd3.prod.google.com ([2002:a05:7300:8283:b0:2e2:b79e:ed06]) (user=stevensd job=prod-delivery.src-stubby-dispatcher) by 2002:a05:693c:2b04:b0:2de:cc07:e8b with SMTP id 5a478bee46e88-2e466044dddmr18982911eec.1.1777058231713; Fri, 24 Apr 2026 12:17:11 -0700 (PDT) Date: Fri, 24 Apr 2026 12:14:55 -0700 In-Reply-To: <20260424191456.2679717-1-stevensd@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20260424191456.2679717-1-stevensd@google.com> X-Mailer: git-send-email 2.54.0.rc2.544.gc7ae2d5bb8-goog Message-ID: <20260424191456.2679717-13-stevensd@google.com> Subject: [PATCH v2 12/13] x86: Add support for dynamic kernel stacks via FRED From: David Stevens To: Pasha Tatashin , Linus Walleij , Will Deacon , Quentin Perret , Thomas Gleixner , Ingo Molnar , Borislav Petkov , Dave Hansen , x86@kernel.org, "H. Peter Anvin" , Andy Lutomirski , Xin Li , Peter Zijlstra , Andrew Morton , David Hildenbrand , Lorenzo Stoakes , "Liam R. Howlett" , Vlastimil Babka , Mike Rapoport , Suren Baghdasaryan , Michal Hocko , Uladzislau Rezki , Kees Cook Cc: David Stevens , linux-kernel@vger.kernel.org, linux-mm@kvack.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Add support for dynamic kernel stack faults by handling #PFs from CPL 0 on stack level 1. Since we can't sleep while on a per-CPU stack, any page faults that didn't originate in an atomic context need to be bounced back to the originating stack. With dynamic kernel stacks, the processor pushing data onto the kernel thread stack can cause a page fault. The SDM says in the #DF section that the processor should be able to handle these exceptions serially. However, this does not seem to actually be handled reliably. With KVM, I've observed timer interrupts dropped. The corresponding bit in VIRR is cleared and the ISR bit in the APIC is set before the #PF is delivered, but the interrupt handler is not invoked after the kernel stack fault is resolved. On bare metal, I've observed frequent hangs due to threads getting stuck on folio_wait_bit_common. I haven't traced this to an exact interrupt being lost, but moving interrupts to stack level 1 reduces boot failures from >10% to 0 in 1000s of attempts. To work around this, external interrupts are also moved to stack level 1, and unconditionally bounced back to the originating stack. Bouncing page faults and external interrupts through stack level 1 while in CPL 0 adds a small but non-trivial overhead to those paths. The shared entry point for events received in CPL 0 also becomes slightly more expensive, due to the need to detect page faults and external interrupts. Since enabling HAVE_ARCH_DYNAMIC_STACK requires unconditional support, enabling the config is done in the next patch that adds dynamic stack support for traditional interrupt delivery. Signed-off-by: David Stevens --- arch/x86/entry/entry_64_fred.S | 55 +++++++++++++++++++++++++++++++ arch/x86/include/asm/pgtable_64.h | 36 ++++++++++++++++++++ arch/x86/include/asm/traps.h | 5 +++ arch/x86/kernel/fred.c | 20 ++++++++--- arch/x86/mm/dump_pagetables.c | 14 +++++--- arch/x86/mm/fault.c | 53 +++++++++++++++++++++++++++++ 6 files changed, 174 insertions(+), 9 deletions(-) diff --git a/arch/x86/entry/entry_64_fred.S b/arch/x86/entry/entry_64_fred.S index 119b8214748e..7202655ef662 100644 --- a/arch/x86/entry/entry_64_fred.S +++ b/arch/x86/entry/entry_64_fred.S @@ -54,7 +54,62 @@ SYM_CODE_END(asm_fred_entrypoint_user) .org asm_fred_entrypoint_user + 256, 0xcc SYM_CODE_START_NOALIGN(asm_fred_entrypoint_kernel) FRED_ENTER + +#ifdef CONFIG_DYNAMIC_STACK + /* Extract event type and vector from augmented SS. */ + movl (SS + 4)(%rsp), %esi + andl $0x000f00ff, %esi + + /* Check if event type is hardware exception and vector is #PF. */ + cmpl $0x0003000e, %esi + jne .Lcheck_for_extint + + call handle_dynamic_stack_kernel_faults + testq %rax, %rax + jz .Lentrypoint_done + cmpq %rax, %rsp + je .Lskip_stack_switch + jmp .Ldo_stack_switch + +.Lcheck_for_extint: + /* Check if event type is external interrupt. */ + andl $0xf0000, %esi + testl %esi, %esi + jne .Lcall_primary_entry + call switch_to_kstack + +.Ldo_stack_switch: +#ifdef CONFIG_DEBUG_ENTRY + /* + * We should only do a stack switch for an external interrupt or a page + * fault in a non-atomic context. These should only ever happen in user + * space or from a regular kernel stack (i.e. CSL =3D=3D 0). + */ + movw (CS + 2)(%rsp), %si + testw $0x3, %si + jz .Lcsl_ok + ud2 +.Lcsl_ok: +#endif + movq %rax, %rsp + + UNWIND_HINT_REGS + ENCODE_FRAME_POINTER + + mov $MSR_IA32_FRED_CONFIG, %ecx + rdmsr + andl $~0x3, %eax + wrmsr + + movq %rsp, %rdi +#endif + +.Lskip_stack_switch: + movq %rsp, %rdi +.Lcall_primary_entry: call fred_entry_from_kernel + +.Lentrypoint_done: FRED_EXIT ERETS SYM_CODE_END(asm_fred_entrypoint_kernel) diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtab= le_64.h index ce45882ccd07..fbb042c89d13 100644 --- a/arch/x86/include/asm/pgtable_64.h +++ b/arch/x86/include/asm/pgtable_64.h @@ -237,6 +237,42 @@ static inline void native_pgd_clear(pgd_t *pgd) #define __swp_entry_to_pte(x) (__pte((x).val)) #define __swp_entry_to_pmd(x) (__pmd((x).val)) =20 +#ifdef CONFIG_DYNAMIC_STACK + +/* + * Skip the present bit. And skip dirty and accessed bits due to + * erratum where they can be incorrectly set on non-present ptes. + * + * Also skip bit 8, which is used for pte_present for PROT_NONE. This + * isn't necessary in the strictest sense since PROT_NONE doesn't apply + * to kernel PTEs, but it's easier to let pte_present just continue + * to work. + */ +#define KPTE_AVAILABLE_DATA_BITS 58 + +static inline pte_t make_data_kpte(unsigned long val) +{ + unsigned long low_part, mid_part, high_part; + + low_part =3D (val & 0xf) << 1; + mid_part =3D (val & 0x10) << 3; + high_part =3D (val & ~0x1f) << 4; + + return __pte(low_part | mid_part | high_part); +} + +static inline unsigned long unpack_data_kpte(pte_t pte) +{ + unsigned long val =3D pte_val(pte), high_part, mid_part, low_part; + + low_part =3D (val >> 1) & 0xf; + mid_part =3D (val >> 3) & 0x10; + high_part =3D (val >> 4) & ~0x1f; + + return low_part | mid_part | high_part; +} +#endif /* CONFIG_DYNAMIC_STACK */ + extern void cleanup_highmap(void); =20 #define HAVE_ARCH_UNMAPPED_AREA diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h index 3f24cc472ce9..6b55eb91aea6 100644 --- a/arch/x86/include/asm/traps.h +++ b/arch/x86/include/asm/traps.h @@ -15,6 +15,11 @@ asmlinkage __visible notrace struct pt_regs *sync_regs(s= truct pt_regs *eregs); asmlinkage __visible notrace struct pt_regs *fixup_bad_iret(struct pt_regs *bad_regs); asmlinkage __visible noinstr struct pt_regs *vc_switch_off_ist(struct pt_r= egs *eregs); + +#ifdef CONFIG_DYNAMIC_STACK +asmlinkage __visible noinstr unsigned long switch_to_kstack(struct pt_regs= *regs); +asmlinkage __visible noinstr unsigned long handle_dynamic_stack_kernel_fau= lts(struct pt_regs *regs); +#endif #endif =20 extern int ibt_selftest(void); diff --git a/arch/x86/kernel/fred.c b/arch/x86/kernel/fred.c index e736b19e18de..01d727420d1f 100644 --- a/arch/x86/kernel/fred.c +++ b/arch/x86/kernel/fred.c @@ -9,6 +9,8 @@ =20 /* #DB in the kernel would imply the use of a kernel debugger. */ #define FRED_DB_STACK_LEVEL 1UL +#define FRED_PF_STACK_LEVEL 1UL +#define FRED_INT_STACK_LEVEL 1UL #define FRED_NMI_STACK_LEVEL 2UL #define FRED_MC_STACK_LEVEL 2UL /* @@ -25,6 +27,11 @@ DEFINE_PER_CPU(unsigned long, fred_rsp0); EXPORT_PER_CPU_SYMBOL(fred_rsp0); =20 +#define FRED_CONFIG_VAL(int_stklvl) \ + (FRED_CONFIG_REDZONE /* Reserve for CALL emulation */ | \ + FRED_CONFIG_INT_STKLVL(int_stklvl) | \ + FRED_CONFIG_ENTRYPOINT(asm_fred_entrypoint_user)) + void cpu_init_fred_exceptions(void) { /* When FRED is enabled by default, remove this log message */ @@ -44,11 +51,7 @@ void cpu_init_fred_exceptions(void) */ loadsegment(ss, __KERNEL_DS); =20 - wrmsrq(MSR_IA32_FRED_CONFIG, - /* Reserve for CALL emulation */ - FRED_CONFIG_REDZONE | - FRED_CONFIG_INT_STKLVL(0) | - FRED_CONFIG_ENTRYPOINT(asm_fred_entrypoint_user)); + wrmsrq(MSR_IA32_FRED_CONFIG, FRED_CONFIG_VAL(0)); =20 wrmsrq(MSR_IA32_FRED_STKLVLS, 0); =20 @@ -84,8 +87,15 @@ void cpu_init_fred_rsps(void) FRED_STKLVL(X86_TRAP_DB, FRED_DB_STACK_LEVEL) | FRED_STKLVL(X86_TRAP_NMI, FRED_NMI_STACK_LEVEL) | FRED_STKLVL(X86_TRAP_MC, FRED_MC_STACK_LEVEL) | +#ifdef CONFIG_DYNAMIC_STACK + FRED_STKLVL(X86_TRAP_PF, FRED_PF_STACK_LEVEL) | +#endif FRED_STKLVL(X86_TRAP_DF, FRED_DF_STACK_LEVEL)); =20 +#ifdef CONFIG_DYNAMIC_STACK + wrmsrq(MSR_IA32_FRED_CONFIG, FRED_CONFIG_VAL(FRED_INT_STACK_LEVEL)); +#endif + /* The FRED equivalents to IST stacks... */ wrmsrq(MSR_IA32_FRED_RSP1, __this_cpu_ist_top_va(DB)); wrmsrq(MSR_IA32_FRED_RSP2, __this_cpu_ist_top_va(NMI)); diff --git a/arch/x86/mm/dump_pagetables.c b/arch/x86/mm/dump_pagetables.c index 2afa7a23340e..5c33c33e93fe 100644 --- a/arch/x86/mm/dump_pagetables.c +++ b/arch/x86/mm/dump_pagetables.c @@ -306,11 +306,17 @@ static void note_page(struct ptdump_state *pt_st, uns= igned long addr, int level, static const char units[] =3D "BKMGTPE"; struct seq_file *m =3D st->seq; =20 - new_prot =3D val & PTE_FLAGS_MASK; - if (!val) + /* Ignore prot/eff from data kptes. */ + if (val & _PAGE_PRESENT || addr < address_markers[KERNEL_SPACE_NR].start_= address) { + new_prot =3D val & PTE_FLAGS_MASK; + if (!val) + new_eff =3D 0; + else + new_eff =3D st->prot_levels[level]; + } else { + new_prot =3D 0; new_eff =3D 0; - else - new_eff =3D st->prot_levels[level]; + } =20 /* * If we have a "break" in the series, we need to flush the state that diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index b83a06739b51..40d518d9f562 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -1480,6 +1480,59 @@ handle_page_fault(struct pt_regs *regs, unsigned lon= g error_code, local_irq_disable(); } =20 +#ifdef CONFIG_DYNAMIC_STACK + +static noinstr unsigned long copy_stack_data(struct pt_regs *regs) +{ + unsigned long new_sp; + unsigned long data_len; + + new_sp =3D regs->sp - (FRED_CONFIG_REDZONE_AMOUNT << 6); + new_sp &=3D FRED_STACK_FRAME_RSP_MASK; + data_len =3D sizeof(struct fred_frame); + new_sp -=3D data_len; + + memcpy((void *)new_sp, regs, data_len); + + return new_sp; +} + +__visible noinstr unsigned long switch_to_kstack(struct pt_regs *regs) +{ + return copy_stack_data(regs); +} + +#define ALIGN_TO_STACK(addr) ((addr) & ~(THREAD_ALIGN - 1)) + +__visible noinstr unsigned long handle_dynamic_stack_kernel_faults(struct = pt_regs *regs) +{ + unsigned long address; + struct task_struct *tsk; + bool on_stack; + + address =3D fred_event_data(regs); + if (fault_in_kernel_space(address) && !in_nmi()) { + tsk =3D task_from_stack_address(address); + + if (tsk && dynamic_stack_fault(tsk, address, &on_stack)) { + WARN_ON_ONCE(tsk !=3D current && + ALIGN_TO_STACK(regs->sp) !=3D ALIGN_TO_STACK(address)); + return 0; + } + } + + /* + * The regular fault handler won't sleep when executing in an + * atomic context, so we can complete the #PF directly on the + * #PF stack. + */ + if (in_atomic()) + return (unsigned long)regs; + else + return copy_stack_data(regs); +} +#endif + DEFINE_IDTENTRY_RAW_ERRORCODE(exc_page_fault) { irqentry_state_t state; --=20 2.54.0.rc2.544.gc7ae2d5bb8-goog From nobody Fri Jun 19 16:16:36 2026 Received: from mail-dl1-f74.google.com (mail-dl1-f74.google.com [74.125.82.74]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id F0CD23F9F34 for ; Fri, 24 Apr 2026 19:17:14 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=74.125.82.74 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777058238; cv=none; b=fPZfIiyfm71B905CZREfTZekFoU9zNrfTrnkO4nKCqkmkfCMgJevsFhnQs2JNoFHX+RzBB0BXgIlV6gtjJwpc4m4cJtJNjkD5SHFKTBvdlGEZMd65/H62fgIQTbYVPeZbfwU0zFbrIVf0Pa98K22KYf1ACESbzHzAqtXrXqr78M= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777058238; c=relaxed/simple; bh=Mp4sPfMkHS9pdnY4bYJdO+3VcKQPz9QfgzlPdgPTS14=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=HVIFQaqdjwSEoC/LflMFK6e30/znJVUnc/REhd8WsFJmFEjwmkZOJSDG+SbcIwDVrFqb54zch3zEFkjrWsL89SRvY+2lHCDgtDrNJbi66JV66KxLFzUFM/l6PcT5rj8Ng+OAl4lL41dUo8RGfYo8YX/+Uz1lgBi4k0gVy1HdqB8= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--stevensd.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=fCFxON1A; arc=none smtp.client-ip=74.125.82.74 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--stevensd.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="fCFxON1A" Received: by mail-dl1-f74.google.com with SMTP id a92af1059eb24-12c726f4055so8069584c88.1 for ; Fri, 24 Apr 2026 12:17:14 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20251104; t=1777058234; x=1777663034; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=CNroi0x4OeQZPhAnA797U6GF8ZCJTAYEz27NHRzu1+c=; b=fCFxON1As+1yviUipKg6J0bObz9HLqBd3aJrOs1kApqSQl4Vb8yYoHQW5ktDSrvNp1 B9wAfu9GzSyoeu3EyjDmHEldfvZw5v2kAeT+xSZPlJ4b9x2t2XIeeuHRBevyrbUF9VzU NpfJkLatsTGso9yx3WQX86mkoAANK5sL6YejAPhCFGgOqXtqtgtzDOittEgLnfgnnwaq dVvt3VL+Bl+DDQArygkdEyQ5IFTbu4uB87cO/NUa+MWRHJxkuA2epTCROfurpZzDvwFp M7e9g77vO2x045OxttzWS6SSRJfyHSFykRAQLWhtQWgS2S7KYtiNJJLc9XY5hLfxfhPQ WVsQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1777058234; x=1777663034; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=CNroi0x4OeQZPhAnA797U6GF8ZCJTAYEz27NHRzu1+c=; b=ftPCsUgsCjaSmJhohPhT8pWDi+NFHg4HgfNIaCKcpB6CRmtrPfluon+dL4Q2BmpY3s u3UXFn30bWDFE3bStBPHK4vr5CRDPzZAzfV9PviItNhGms1UICeGTUqcOBG4VOBbf465 5A6IjRWdVni6z1T1LYo3kNIR7ZaGeKObaCF7QZBbOLq5S0cPfTWEPdBtHbYA4aNcOG6N eKfBeZQJixpCByxKkI1RMfie6jgzTYHZ8aAJr8lzFSSM7mgzpLw6hbRi0bj4PCGWWnJ9 YTrzmW5uvu2W8OHH+Ou7V1lPkVuvXHwzPMc7W42/KzQ7/WxrVAhvS0KJBgFYFW9aZ71E J0Eg== X-Forwarded-Encrypted: i=1; AFNElJ9ZdOGBWuVd7E6p2ZIfaPcPzBvrP9AN0V31d66vBQbsTURzhTLo/AOsueIV/lDDigVr6sleGlHiA5zYzD8=@vger.kernel.org X-Gm-Message-State: AOJu0YzArlwi7Ifv32HUVFX84NyDkgOvUicA9Fptfow7gCDCZvCf7wJw I9PoO1Ib51KE3zu02EpfG5G364FyHaZ/gpVu412QhTqklmrWI1zTZIX3c+Cc6wtL8SZyn/6Ubpe Q/Jkfkf/MILJs7Q== X-Received: from dlak2.prod.google.com ([2002:a05:701b:2902:b0:12d:b319:ccc4]) (user=stevensd job=prod-delivery.src-stubby-dispatcher) by 2002:a05:7022:6884:b0:128:d23d:81a7 with SMTP id a92af1059eb24-12c73f64315mr17489339c88.6.1777058233635; Fri, 24 Apr 2026 12:17:13 -0700 (PDT) Date: Fri, 24 Apr 2026 12:14:56 -0700 In-Reply-To: <20260424191456.2679717-1-stevensd@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20260424191456.2679717-1-stevensd@google.com> X-Mailer: git-send-email 2.54.0.rc2.544.gc7ae2d5bb8-goog Message-ID: <20260424191456.2679717-14-stevensd@google.com> Subject: [PATCH v2 13/13] x86: Add support for dynamic kernel stacks via IST From: David Stevens To: Pasha Tatashin , Linus Walleij , Will Deacon , Quentin Perret , Thomas Gleixner , Ingo Molnar , Borislav Petkov , Dave Hansen , x86@kernel.org, "H. Peter Anvin" , Andy Lutomirski , Xin Li , Peter Zijlstra , Andrew Morton , David Hildenbrand , Lorenzo Stoakes , "Liam R. Howlett" , Vlastimil Babka , Mike Rapoport , Suren Baghdasaryan , Michal Hocko , Uladzislau Rezki , Kees Cook Cc: David Stevens , linux-kernel@vger.kernel.org, linux-mm@kvack.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" On hardware that doesn't support FRED, use ISTs to support dynamic kernel stacks. In the same way as we do when using FRED, any regular #PF gets manually moved back onto the original stack. Additionally, we take the similar approach as we do with FRED to avoid issues with interrupt re-delivery and handle external interrupts on an IST stack. The fact that IST stacks aren't reentrant means we have to be very careful to avoid triggering a #PF while the #PF IST is being used. Since NMIs can trigger #PFs, we have the NMI handler temporarily install a secondary #PF IST stack if it detects it came from the #PF IST stack, to avoid clobbering that stack. Note that although iret unmasking of NMIs can cause us to get a second NMI while an NMI is on the #PF IST stack, the actual handling of that secondary NMI will be delayed until after the original NMI (and thus the #PF) is resolved. As such, one extra #PF IST stack is sufficient to resolve reentrancy issues with respect to NMIs. For #DB exceptions, we make sure that all code that executes on the #PF IST stack is noinstr. Unfortunately this is not 100% bulletproof, since the handler needs to access data outside of cpu_entry_area (e.g. current, current's stack, vmap stack page tables), and the user could have set hardware breakpoints on accesses to those addresses. Rather than handle this edge case that should only occur during manual debugging, we just detect reentrancy on the #PF IST and abort. It is possible for #MCE to occur on the #PF IST stack, but the #MCE handler shouldn't generate new #PFs. The reentrancy check on the #PF stack will trigger if any recoverable #MCEs do generate #PFs - if there are actually reports of it happening, we can address it then. Bouncing all #PF and external interrupts through IST stacks adds some overhead. However, such events from userspace already had to bounce through the CPU entry stack, so introducing ISTs only adds notable overhead for #PFs and external interrupts that occur while in CPL 0. Signed-off-by: David Stevens --- arch/x86/Kconfig | 1 + arch/x86/entry/entry_64.S | 49 +++++++++++++++++-- arch/x86/include/asm/cpu_entry_area.h | 18 +++++++ arch/x86/include/asm/idtentry.h | 38 ++++++++++++++- arch/x86/include/asm/page_64_types.h | 10 +++- arch/x86/include/asm/processor.h | 6 +++ arch/x86/kernel/cpu/common.c | 11 +++++ arch/x86/kernel/dumpstack_64.c | 10 +++- arch/x86/kernel/idt.c | 57 +++++++++++++--------- arch/x86/kernel/nmi.c | 9 ++++ arch/x86/lib/usercopy.c | 9 ++++ arch/x86/mm/cpu_entry_area.c | 17 +++++++ arch/x86/mm/fault.c | 70 ++++++++++++++++++++++----- 13 files changed, 262 insertions(+), 43 deletions(-) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index e2df1b147184..182fda721b0d 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -212,6 +212,7 @@ config X86 select HAVE_ARCH_USERFAULTFD_WP if X86_64 && USERFAULTFD select HAVE_ARCH_USERFAULTFD_MINOR if X86_64 && USERFAULTFD select HAVE_ARCH_VMAP_STACK if X86_64 + select HAVE_ARCH_DYNAMIC_STACK if X86_64 && !XEN_PV select HAVE_ARCH_RANDOMIZE_KSTACK_OFFSET select HAVE_ARCH_WITHIN_STACK_FRAMES select HAVE_ASM_MODVERSIONS diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S index 42447b1e1dff..02dbd00cc4bb 100644 --- a/arch/x86/entry/entry_64.S +++ b/arch/x86/entry/entry_64.S @@ -286,7 +286,7 @@ SYM_CODE_END(xen_error_entry) * @cfunc: C function to be called * @has_error_code: Hardware pushed error code on stack */ -.macro idtentry_body cfunc has_error_code:req +.macro idtentry_body cfunc has_error_code:req kernel_reentry_fn=3D =20 /* * Call error_entry() and switch to the task stack if from userspace. @@ -302,6 +302,38 @@ SYM_CODE_END(xen_error_entry) ENCODE_FRAME_POINTER UNWIND_HINT_REGS =20 +#ifdef CONFIG_DYNAMIC_STACK +.ifnb \kernel_reentry_fn + /* + * For entry from userspace, we've also already moved off of + * the IST after calling error_entry above. + */ + testb $3, CS(%rsp) + jnz .Lregular_fault_\cfunc + + /* Check and set the reentry canary reserved by IST_ENTRY_OFFSET. */ + cmpq $0, (SS + 8)(%rsp) + jne .List_reentry_abort_\cfunc + movq $1, (SS + 8)(%rsp) + + movq %rsp, %rdi + call \kernel_reentry_fn + + movq $0, (SS + 8)(%rsp) + + testq %rax, %rax + jnz .Lchange_stack_\cfunc + jmp error_return + +.Lchange_stack_\cfunc: + movq %rax, %rsp + + ENCODE_FRAME_POINTER + UNWIND_HINT_REGS +.Lregular_fault_\cfunc: +.endif +#endif + movq %rsp, %rdi /* pt_regs pointer into 1st argument*/ =20 .if \has_error_code =3D=3D 1 @@ -314,6 +346,13 @@ SYM_CODE_END(xen_error_entry) call \cfunc =20 jmp error_return + +#ifdef CONFIG_DYNAMIC_STACK +.ifnb \kernel_reentry_fn +.List_reentry_abort_\cfunc: + ud2 +.endif +#endif .endm =20 /** @@ -322,11 +361,13 @@ SYM_CODE_END(xen_error_entry) * @asmsym: ASM symbol for the entry point * @cfunc: C function to be called * @has_error_code: Hardware pushed error code on stack + * @kernel_reentry_fn: If set, C function to be called on re-entry from + * kernel space before the main handler is invoked. * * The macro emits code to set up the kernel context for straight forward * and simple IDT entries. No IST stack, no paranoid entry checks. */ -.macro idtentry vector asmsym cfunc has_error_code:req +.macro idtentry vector asmsym cfunc has_error_code:req kernel_reentry_fn= =3D SYM_CODE_START(\asmsym) =20 .if \vector =3D=3D X86_TRAP_BP @@ -358,7 +399,7 @@ SYM_CODE_START(\asmsym) .Lfrom_usermode_no_gap_\@: .endif =20 - idtentry_body \cfunc \has_error_code + idtentry_body \cfunc \has_error_code \kernel_reentry_fn =20 _ASM_NOKPROBE(\asmsym) SYM_CODE_END(\asmsym) @@ -375,7 +416,7 @@ SYM_CODE_END(\asmsym) */ .macro idtentry_irq vector cfunc .p2align CONFIG_X86_L1_CACHE_SHIFT - idtentry \vector asm_\cfunc \cfunc has_error_code=3D1 + idtentry \vector asm_\cfunc \cfunc has_error_code=3D1 kernel_reentry_fn= =3Dswitch_to_kstack .endm =20 /** diff --git a/arch/x86/include/asm/cpu_entry_area.h b/arch/x86/include/asm/c= pu_entry_area.h index 462fc34f1317..5bce3259edee 100644 --- a/arch/x86/include/asm/cpu_entry_area.h +++ b/arch/x86/include/asm/cpu_entry_area.h @@ -26,6 +26,12 @@ char DB_stack[EXCEPTION_STKSZ]; \ char MCE_stack_guard[guardsize]; \ char MCE_stack[EXCEPTION_STKSZ]; \ + char PF_stack_guard[guardsize]; \ + char PF_stack[EXCEPTION_STKSZ]; \ + char PF2_stack_guard[guardsize]; \ + char PF2_stack[EXCEPTION_STKSZ]; \ + char UDI_stack_guard[guardsize]; \ + char UDI_stack[EXCEPTION_STKSZ]; \ char VC_stack_guard[guardsize]; \ char VC_stack[optional_stack_size]; \ char VC2_stack_guard[guardsize]; \ @@ -50,6 +56,9 @@ enum exception_stack_ordering { ESTACK_NMI, ESTACK_DB, ESTACK_MCE, + ESTACK_PF, + ESTACK_PF2, + ESTACK_UDI, ESTACK_VC, ESTACK_VC2, N_EXCEPTION_STACKS @@ -144,6 +153,15 @@ static __always_inline struct entry_stack *cpu_entry_s= tack(int cpu) return &get_cpu_entry_area(cpu)->entry_stack_page.stack; } =20 +#ifdef CONFIG_DYNAMIC_STACK +bool is_pf_ist_stack(unsigned long addr); +#else +static inline bool is_pf_ist_stack(unsigned long addr) +{ + return false; +} +#endif + #define __this_cpu_ist_top_va(name) \ CEA_ESTACK_TOP(__this_cpu_read(cea_exception_stacks), name) =20 diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentr= y.h index 42bf6a58ec36..d8c846d28a1d 100644 --- a/arch/x86/include/asm/idtentry.h +++ b/arch/x86/include/asm/idtentry.h @@ -163,6 +163,16 @@ noinstr void fred_##func(struct pt_regs *regs) #define DECLARE_IDTENTRY_RAW_ERRORCODE(vector, func) \ DECLARE_IDTENTRY_ERRORCODE(vector, func) =20 +/** + * DECLARE_IDTENTRY_PF - Declare functions for page fault entry point + * @vector: Vector number (ignored for C) + * @func: Function name of the entry point + * + * Maps to @DECLARE_IDTENTRY_ERRORCODE(). + */ +#define DECLARE_IDTENTRY_PF(vector, func) \ + DECLARE_IDTENTRY_RAW_ERRORCODE(vector, func) + /** * DEFINE_IDTENTRY_RAW_ERRORCODE - Emit code for raw IDT entry points * @func: Function name of the entry point @@ -391,6 +401,15 @@ static __always_inline void __##func(struct pt_regs *r= egs) #define DEFINE_IDTENTRY_DF(func) \ DEFINE_IDTENTRY_RAW_ERRORCODE(func) =20 +/** + * DEFINE_IDTENTRY_PF - Emit code for page fault + * @func: Function name of the entry point + * + * Maps to DEFINE_IDTENTRY_RAW_ERRORCODE + */ +#define DEFINE_IDTENTRY_PF(func) \ + DEFINE_IDTENTRY_RAW_ERRORCODE(func) + /** * DEFINE_IDTENTRY_VC_KERNEL - Emit code for VMM communication handler * when raised from kernel mode @@ -480,6 +499,15 @@ void fred_install_sysvec(unsigned int vector, const id= tentry_t function); #define DECLARE_IDTENTRY_ERRORCODE(vector, func) \ idtentry vector asm_##func func has_error_code=3D1 =20 +#ifdef CONFIG_DYNAMIC_STACK +#define DECLARE_IDTENTRY_PF(vector, func) \ + idtentry vector asm_##func func has_error_code=3D1 \ + kernel_reentry_fn=3Dhandle_dynamic_stack_kernel_faults +#else +#define DECLARE_IDTENTRY_PF(vector, func) \ + DECLARE_IDTENTRY_RAW_ERRORCODE(vector, func) +#endif + /* Special case for 32bit IRET 'trap'. Do not emit ASM code */ #define DECLARE_IDTENTRY_SW(vector, func) =20 @@ -494,8 +522,14 @@ void fred_install_sysvec(unsigned int vector, const id= tentry_t function); idtentry_irq vector func =20 /* System vector entries */ +#ifdef CONFIG_DYNAMIC_STACK +#define DECLARE_IDTENTRY_SYSVEC(vector, func) \ + idtentry vector asm_##func func has_error_code=3D0 \ + kernel_reentry_fn=3Dswitch_to_kstack +#else #define DECLARE_IDTENTRY_SYSVEC(vector, func) \ DECLARE_IDTENTRY(vector, func) +#endif =20 #ifdef CONFIG_X86_64 # define DECLARE_IDTENTRY_MCE(vector, func) \ @@ -615,7 +649,7 @@ DECLARE_IDTENTRY_ERRORCODE(X86_TRAP_AC, exc_alignment_c= heck); /* Raw exception entries which need extra work */ DECLARE_IDTENTRY_RAW(X86_TRAP_UD, exc_invalid_op); DECLARE_IDTENTRY_RAW(X86_TRAP_BP, exc_int3); -DECLARE_IDTENTRY_RAW_ERRORCODE(X86_TRAP_PF, exc_page_fault); +DECLARE_IDTENTRY_PF(X86_TRAP_PF, exc_page_fault); =20 #if defined(CONFIG_IA32_EMULATION) DECLARE_IDTENTRY_RAW(IA32_SYSCALL_VECTOR, int80_emulation); @@ -699,7 +733,7 @@ DECLARE_IDTENTRY_SYSVEC(X86_PLATFORM_IPI_VECTOR, sysvec= _x86_platform_ipi); #endif =20 #ifdef CONFIG_SMP -DECLARE_IDTENTRY(RESCHEDULE_VECTOR, sysvec_reschedule_ipi); +DECLARE_IDTENTRY_SYSVEC(RESCHEDULE_VECTOR, sysvec_reschedule_ipi); DECLARE_IDTENTRY_SYSVEC(REBOOT_VECTOR, sysvec_reboot); DECLARE_IDTENTRY_SYSVEC(CALL_FUNCTION_SINGLE_VECTOR, sysvec_call_function_= single); DECLARE_IDTENTRY_SYSVEC(CALL_FUNCTION_VECTOR, sysvec_call_function); diff --git a/arch/x86/include/asm/page_64_types.h b/arch/x86/include/asm/pa= ge_64_types.h index 7400dab373fe..b0b60f83a531 100644 --- a/arch/x86/include/asm/page_64_types.h +++ b/arch/x86/include/asm/page_64_types.h @@ -28,7 +28,15 @@ #define IST_INDEX_NMI 1 #define IST_INDEX_DB 2 #define IST_INDEX_MCE 3 -#define IST_INDEX_VC 4 +#define IST_INDEX_PF 4 +#define IST_INDEX_UDI 5 +#define IST_INDEX_VC 6 + +/* + * Offset used for some IST stacks to reserve a slot for re-entry + * canary. At the very top of the stack for cache friendliness. + */ +#define IST_ENTRY_OFFSET 8 =20 /* * Set __PAGE_OFFSET to the most negative possible address + diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/proces= sor.h index a24c7805acdb..fa790731dea0 100644 --- a/arch/x86/include/asm/processor.h +++ b/arch/x86/include/asm/processor.h @@ -573,6 +573,12 @@ static inline void load_sp0(unsigned long sp0) =20 #endif /* CONFIG_PARAVIRT_XXL */ =20 +#ifdef CONFIG_DYNAMIC_STACK +void install_nmi_pf_stack(bool use_nmi_pf_stack); +#else +static inline void install_nmi_pf_stack(bool use_nmi_pf_stack) {} +#endif + unsigned long __get_wchan(struct task_struct *p); =20 extern void select_idle_routine(void); diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c index ec0670114efa..d90a01e2fdd2 100644 --- a/arch/x86/kernel/cpu/common.c +++ b/arch/x86/kernel/cpu/common.c @@ -2377,6 +2377,8 @@ static inline void tss_setup_ist(struct tss_struct *t= ss) tss->x86_tss.ist[IST_INDEX_NMI] =3D __this_cpu_ist_top_va(NMI); tss->x86_tss.ist[IST_INDEX_DB] =3D __this_cpu_ist_top_va(DB); tss->x86_tss.ist[IST_INDEX_MCE] =3D __this_cpu_ist_top_va(MCE); + tss->x86_tss.ist[IST_INDEX_PF] =3D __this_cpu_ist_top_va(PF) - IST_ENTRY_= OFFSET; + tss->x86_tss.ist[IST_INDEX_UDI] =3D __this_cpu_ist_top_va(UDI) - IST_ENTR= Y_OFFSET; /* Only mapped when SEV-ES is active */ tss->x86_tss.ist[IST_INDEX_VC] =3D __this_cpu_ist_top_va(VC); } @@ -2665,3 +2667,12 @@ void __init arch_cpu_finalize_init(void) */ mem_encrypt_init(); } + +#ifdef CONFIG_DYNAMIC_STACK +noinstr void install_nmi_pf_stack(bool use_nmi_pf_stack) +{ + unsigned long stack =3D use_nmi_pf_stack ? __this_cpu_ist_top_va(PF2) + : __this_cpu_ist_top_va(PF); + this_cpu_write(cpu_tss_rw.x86_tss.ist[IST_INDEX_PF], stack - IST_ENTRY_OF= FSET); +} +#endif diff --git a/arch/x86/kernel/dumpstack_64.c b/arch/x86/kernel/dumpstack_64.c index 6c5defd6569a..6784d31d3eb3 100644 --- a/arch/x86/kernel/dumpstack_64.c +++ b/arch/x86/kernel/dumpstack_64.c @@ -24,13 +24,16 @@ static const char * const exception_stack_names[] =3D { [ ESTACK_NMI ] =3D "NMI", [ ESTACK_DB ] =3D "#DB", [ ESTACK_MCE ] =3D "#MC", + [ ESTACK_PF ] =3D "#PF", + [ ESTACK_PF2 ] =3D "#PF2", + [ ESTACK_UDI ] =3D "#UDI", [ ESTACK_VC ] =3D "#VC", [ ESTACK_VC2 ] =3D "#VC2", }; =20 const char *stack_type_name(enum stack_type type) { - BUILD_BUG_ON(N_EXCEPTION_STACKS !=3D 6); + BUILD_BUG_ON(N_EXCEPTION_STACKS !=3D 9); =20 if (type =3D=3D STACK_TYPE_TASK) return "TASK"; @@ -87,6 +90,9 @@ struct estack_pages estack_pages[CEA_ESTACK_PAGES] ____ca= cheline_aligned =3D { EPAGERANGE(NMI), EPAGERANGE(DB), EPAGERANGE(MCE), + EPAGERANGE(PF), + EPAGERANGE(PF2), + EPAGERANGE(UDI), EPAGERANGE(VC), EPAGERANGE(VC2), }; @@ -98,7 +104,7 @@ static __always_inline bool in_exception_stack(unsigned = long *stack, struct stac struct pt_regs *regs; unsigned int k; =20 - BUILD_BUG_ON(N_EXCEPTION_STACKS !=3D 6); + BUILD_BUG_ON(N_EXCEPTION_STACKS !=3D 9); =20 begin =3D (unsigned long)__this_cpu_read(cea_exception_stacks); /* diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c index 260456588756..7626fa7adfb3 100644 --- a/arch/x86/kernel/idt.c +++ b/arch/x86/kernel/idt.c @@ -116,6 +116,10 @@ static const __initconst struct idt_data def_idts[] = =3D { ISTG(X86_TRAP_VC, asm_exc_vmm_communication, IST_INDEX_VC), #endif =20 +#ifdef CONFIG_DYNAMIC_STACK + ISTG(X86_TRAP_PF, asm_exc_page_fault, IST_INDEX_PF), +#endif + SYSG(X86_TRAP_OF, asm_exc_overflow), }; =20 @@ -127,47 +131,55 @@ static const struct idt_data ia32_idt[] __initconst = =3D { #endif }; =20 +#ifdef CONFIG_DYNAMIC_STACK +#define EXTERNAL_INTR(_vector, _addr) ISTG(_vector, _addr, IST_INDEX_UDI) +#define EXTERNAL_INTR_IST_VALUE (IST_INDEX_UDI + 1) +#else +#define EXTERNAL_INTR(_vector, _addr) INTG(_vector, _addr) +#define EXTERNAL_INTR_IST_VALUE 0 +#endif + /* * The APIC and SMP idt entries */ static const __initconst struct idt_data apic_idts[] =3D { #ifdef CONFIG_SMP - INTG(RESCHEDULE_VECTOR, asm_sysvec_reschedule_ipi), - INTG(CALL_FUNCTION_VECTOR, asm_sysvec_call_function), - INTG(CALL_FUNCTION_SINGLE_VECTOR, asm_sysvec_call_function_single), - INTG(REBOOT_VECTOR, asm_sysvec_reboot), + EXTERNAL_INTR(RESCHEDULE_VECTOR, asm_sysvec_reschedule_ipi), + EXTERNAL_INTR(CALL_FUNCTION_VECTOR, asm_sysvec_call_function), + EXTERNAL_INTR(CALL_FUNCTION_SINGLE_VECTOR, asm_sysvec_call_function_singl= e), + EXTERNAL_INTR(REBOOT_VECTOR, asm_sysvec_reboot), #endif =20 #ifdef CONFIG_X86_THERMAL_VECTOR - INTG(THERMAL_APIC_VECTOR, asm_sysvec_thermal), + EXTERNAL_INTR(THERMAL_APIC_VECTOR, asm_sysvec_thermal), #endif =20 #ifdef CONFIG_X86_MCE_THRESHOLD - INTG(THRESHOLD_APIC_VECTOR, asm_sysvec_threshold), + EXTERNAL_INTR(THRESHOLD_APIC_VECTOR, asm_sysvec_threshold), #endif =20 #ifdef CONFIG_X86_MCE_AMD - INTG(DEFERRED_ERROR_VECTOR, asm_sysvec_deferred_error), + EXTERNAL_INTR(DEFERRED_ERROR_VECTOR, asm_sysvec_deferred_error), #endif =20 #ifdef CONFIG_X86_LOCAL_APIC - INTG(LOCAL_TIMER_VECTOR, asm_sysvec_apic_timer_interrupt), - INTG(X86_PLATFORM_IPI_VECTOR, asm_sysvec_x86_platform_ipi), + EXTERNAL_INTR(LOCAL_TIMER_VECTOR, asm_sysvec_apic_timer_interrupt), + EXTERNAL_INTR(X86_PLATFORM_IPI_VECTOR, asm_sysvec_x86_platform_ipi), # if IS_ENABLED(CONFIG_KVM) - INTG(POSTED_INTR_VECTOR, asm_sysvec_kvm_posted_intr_ipi), - INTG(POSTED_INTR_WAKEUP_VECTOR, asm_sysvec_kvm_posted_intr_wakeup_ipi), - INTG(POSTED_INTR_NESTED_VECTOR, asm_sysvec_kvm_posted_intr_nested_ipi), + EXTERNAL_INTR(POSTED_INTR_VECTOR, asm_sysvec_kvm_posted_intr_ipi), + EXTERNAL_INTR(POSTED_INTR_WAKEUP_VECTOR, asm_sysvec_kvm_posted_intr_wakeu= p_ipi), + EXTERNAL_INTR(POSTED_INTR_NESTED_VECTOR, asm_sysvec_kvm_posted_intr_neste= d_ipi), # endif #ifdef CONFIG_GUEST_PERF_EVENTS INTG(PERF_GUEST_MEDIATED_PMI_VECTOR, asm_sysvec_perf_guest_mediated_pmi_h= andler), #endif # ifdef CONFIG_IRQ_WORK - INTG(IRQ_WORK_VECTOR, asm_sysvec_irq_work), + EXTERNAL_INTR(IRQ_WORK_VECTOR, asm_sysvec_irq_work), # endif - INTG(SPURIOUS_APIC_VECTOR, asm_sysvec_spurious_apic_interrupt), - INTG(ERROR_APIC_VECTOR, asm_sysvec_error_interrupt), + EXTERNAL_INTR(SPURIOUS_APIC_VECTOR, asm_sysvec_spurious_apic_interrupt), + EXTERNAL_INTR(ERROR_APIC_VECTOR, asm_sysvec_error_interrupt), # ifdef CONFIG_X86_POSTED_MSI - INTG(POSTED_MSI_NOTIFICATION_VECTOR, asm_sysvec_posted_msi_notification), + EXTERNAL_INTR(POSTED_MSI_NOTIFICATION_VECTOR, asm_sysvec_posted_msi_notif= ication), # endif #endif }; @@ -206,11 +218,12 @@ idt_setup_from_table(gate_desc *idt, const struct idt= _data *t, int size, bool sy } } =20 -static __init void set_intr_gate(unsigned int n, const void *addr) +static __init void set_intr_gate(unsigned int n, const void *addr, int ist) { struct idt_data data; =20 init_idt_data(&data, n, addr); + data.bits.ist =3D ist; =20 idt_setup_from_table(idt_table, &data, 1, false); } @@ -293,7 +306,7 @@ void __init idt_setup_apic_and_irq_gates(void) =20 for_each_clear_bit_from(i, system_vectors, FIRST_SYSTEM_VECTOR) { entry =3D irq_entries_start + IDT_ALIGN * (i - FIRST_EXTERNAL_VECTOR); - set_intr_gate(i, entry); + set_intr_gate(i, entry, EXTERNAL_INTR_IST_VALUE); } =20 #ifdef CONFIG_X86_LOCAL_APIC @@ -304,7 +317,7 @@ void __init idt_setup_apic_and_irq_gates(void) * /proc/interrupts. */ entry =3D spurious_entries_start + IDT_ALIGN * (i - FIRST_SYSTEM_VECTOR); - set_intr_gate(i, entry); + set_intr_gate(i, entry, EXTERNAL_INTR_IST_VALUE); } #endif /* Map IDT into CPU entry area and reload it. */ @@ -325,10 +338,10 @@ void __init idt_setup_early_handler(void) int i; =20 for (i =3D 0; i < NUM_EXCEPTION_VECTORS; i++) - set_intr_gate(i, early_idt_handler_array[i]); + set_intr_gate(i, early_idt_handler_array[i], DEFAULT_STACK); #ifdef CONFIG_X86_32 for ( ; i < NR_VECTORS; i++) - set_intr_gate(i, early_ignore_irq); + set_intr_gate(i, early_ignore_irq, DEFAULT_STACK); #endif load_idt(&idt_descr); } @@ -352,5 +365,5 @@ void __init idt_install_sysvec(unsigned int n, const vo= id *function) return; =20 if (!WARN_ON(test_and_set_bit(n, system_vectors))) - set_intr_gate(n, function); + set_intr_gate(n, function, EXTERNAL_INTR_IST_VALUE); } diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c index 3d239ed12744..a2444b9d5b71 100644 --- a/arch/x86/kernel/nmi.c +++ b/arch/x86/kernel/nmi.c @@ -37,6 +37,7 @@ #include #include #include +#include =20 #define CREATE_TRACE_POINTS #include @@ -581,6 +582,11 @@ DEFINE_IDTENTRY_RAW(exc_nmi) if (IS_ENABLED(CONFIG_NMI_CHECK_CPU) && ignore_nmis) { WRITE_ONCE(nsp->idt_ignored, nsp->idt_ignored + 1); } else if (!ignore_nmis) { + bool protect_pf_ist_stack =3D is_pf_ist_stack(regs->sp); + + if (protect_pf_ist_stack) + install_nmi_pf_stack(true); + if (IS_ENABLED(CONFIG_NMI_CHECK_CPU)) { WRITE_ONCE(nsp->idt_nmi_seq, nsp->idt_nmi_seq + 1); WARN_ON_ONCE(!(nsp->idt_nmi_seq & 0x1)); @@ -590,6 +596,9 @@ DEFINE_IDTENTRY_RAW(exc_nmi) WRITE_ONCE(nsp->idt_nmi_seq, nsp->idt_nmi_seq + 1); WARN_ON_ONCE(nsp->idt_nmi_seq & 0x1); } + + if (protect_pf_ist_stack) + install_nmi_pf_stack(false); } =20 irqentry_nmi_exit(regs, irq_state); diff --git a/arch/x86/lib/usercopy.c b/arch/x86/lib/usercopy.c index 24b48af27417..75b9f851f428 100644 --- a/arch/x86/lib/usercopy.c +++ b/arch/x86/lib/usercopy.c @@ -9,6 +9,7 @@ #include =20 #include +#include =20 /** * copy_from_user_nmi - NMI safe copy from user @@ -39,6 +40,14 @@ copy_from_user_nmi(void *to, const void __user *from, un= signed long n) if (!nmi_uaccess_okay()) return n; =20 + /* + * IST stacks aren't reentrant, so bail before the possibility of + * a #PF. While on the #PF IST stack, we should only need this + * function for stack dumps (WARN/panic/etc). + */ + if (is_pf_ist_stack(current_stack_pointer)) + return n; + /* * Even though this function is typically called from NMI/IRQ context * disable pagefaults so that its behaviour is consistent even when diff --git a/arch/x86/mm/cpu_entry_area.c b/arch/x86/mm/cpu_entry_area.c index 575f863f3c75..97ac91c109ed 100644 --- a/arch/x86/mm/cpu_entry_area.c +++ b/arch/x86/mm/cpu_entry_area.c @@ -156,6 +156,12 @@ static void __init percpu_setup_exception_stacks(unsig= ned int cpu) cea_map_stack(DB); cea_map_stack(MCE); =20 + if (IS_ENABLED(CONFIG_DYNAMIC_STACK)) { + cea_map_stack(PF); + cea_map_stack(PF2); + cea_map_stack(UDI); + } + if (IS_ENABLED(CONFIG_AMD_MEM_ENCRYPT)) { if (cc_platform_has(CC_ATTR_GUEST_STATE_ENCRYPT)) { cea_map_stack(VC); @@ -173,6 +179,17 @@ static void __init percpu_setup_exception_stacks(unsig= ned int cpu) } #endif =20 +#ifdef CONFIG_DYNAMIC_STACK +bool noinstr is_pf_ist_stack(unsigned long addr) +{ + struct cea_exception_stacks *cs =3D __this_cpu_read(cea_exception_stacks); + unsigned long top =3D CEA_ESTACK_TOP(cs, PF2); + unsigned long bot =3D CEA_ESTACK_BOT(cs, PF); + + return addr >=3D bot && addr < top; +} +#endif + /* Setup the fixmap mappings only once per-processor */ static void __init setup_cpu_entry_area(unsigned int cpu) { diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index 40d518d9f562..48ef50982c06 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -1482,16 +1482,61 @@ handle_page_fault(struct pt_regs *regs, unsigned lo= ng error_code, =20 #ifdef CONFIG_DYNAMIC_STACK =20 -static noinstr unsigned long copy_stack_data(struct pt_regs *regs) +static noinstr unsigned long copy_stack_data(struct pt_regs *regs, bool is= _dynamic_stack_fault) { unsigned long new_sp; unsigned long data_len; + bool must_avoid_dynamic_stack_fault; =20 - new_sp =3D regs->sp - (FRED_CONFIG_REDZONE_AMOUNT << 6); - new_sp &=3D FRED_STACK_FRAME_RSP_MASK; - data_len =3D sizeof(struct fred_frame); + if (cpu_feature_enabled(X86_FEATURE_FRED)) { + new_sp =3D regs->sp - (FRED_CONFIG_REDZONE_AMOUNT << 6); + new_sp &=3D FRED_STACK_FRAME_RSP_MASK; + data_len =3D sizeof(struct fred_frame); + must_avoid_dynamic_stack_fault =3D false; + } else { + // Hardware aligns sp to a 16 byte boundary when going through the IDT. + new_sp =3D ALIGN_DOWN(regs->sp, 16); + data_len =3D sizeof(struct pt_regs); + must_avoid_dynamic_stack_fault =3D is_dynamic_stack_fault; + } new_sp -=3D data_len; =20 + if (must_avoid_dynamic_stack_fault) { + bool new_sp_on_stack; + + /* + * We don't have to worry about the window where current_task + * is inconsistent during a context switch because interrupts + * are disabled during that window and the only #PF that can + * happen there is a dynamic stack fault, in which case we + * return directly from handle_dynamic_stack_kernel_faults(). + */ + if (!in_nmi()) + dynamic_stack_fault(current, new_sp, &new_sp_on_stack); + else + new_sp_on_stack =3D false; + + /* + * If new_sp isn't on the current task's stack, verify that it's + * on an exception/irq/entry stack. This is a little expensive, + * but #PFs in those contexts should be rare. + */ + if (!new_sp_on_stack) { + struct stack_info info, info2; + + if (!get_stack_info_noinstr((void *)new_sp, current, &info)) { + instrumentation_begin(); + if (get_stack_info_noinstr((void *)(new_sp - PAGE_SIZE), + current, &info2)) { + pr_emerg("Stack overflow during stack switch\n"); + handle_stack_overflow(regs, new_sp, &info2); + } else { + die("Stack switch back to unknown stack", regs, 0); + } + } + } + } + memcpy((void *)new_sp, regs, data_len); =20 return new_sp; @@ -1499,7 +1544,7 @@ static noinstr unsigned long copy_stack_data(struct p= t_regs *regs) =20 __visible noinstr unsigned long switch_to_kstack(struct pt_regs *regs) { - return copy_stack_data(regs); + return copy_stack_data(regs, false); } =20 #define ALIGN_TO_STACK(addr) ((addr) & ~(THREAD_ALIGN - 1)) @@ -1510,7 +1555,7 @@ __visible noinstr unsigned long handle_dynamic_stack_= kernel_faults(struct pt_reg struct task_struct *tsk; bool on_stack; =20 - address =3D fred_event_data(regs); + address =3D cpu_feature_enabled(X86_FEATURE_FRED) ? fred_event_data(regs)= : read_cr2(); if (fault_in_kernel_space(address) && !in_nmi()) { tsk =3D task_from_stack_address(address); =20 @@ -1522,18 +1567,19 @@ __visible noinstr unsigned long handle_dynamic_stac= k_kernel_faults(struct pt_reg } =20 /* - * The regular fault handler won't sleep when executing in an - * atomic context, so we can complete the #PF directly on the - * #PF stack. + * The regular fault handler won't sleep when executing in an atomic + * context, so we can complete the #PF directly on the #PF stack. + * However, IST doesn't support nested exceptions, so we need to avoid + * running any non-noinstr code on the IST #PF stack. */ - if (in_atomic()) + if (in_atomic() && cpu_feature_enabled(X86_FEATURE_FRED)) return (unsigned long)regs; else - return copy_stack_data(regs); + return copy_stack_data(regs, true); } #endif =20 -DEFINE_IDTENTRY_RAW_ERRORCODE(exc_page_fault) +DEFINE_IDTENTRY_PF(exc_page_fault) { irqentry_state_t state; unsigned long address; --=20 2.54.0.rc2.544.gc7ae2d5bb8-goog