From nobody Fri Dec 19 03:44:59 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9FA07C001E0 for ; Sun, 13 Aug 2023 18:26:03 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231411AbjHMSZ7 (ORCPT ); Sun, 13 Aug 2023 14:25:59 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:44822 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229451AbjHMSZ6 (ORCPT ); Sun, 13 Aug 2023 14:25:58 -0400 Received: from mail-wr1-x433.google.com (mail-wr1-x433.google.com [IPv6:2a00:1450:4864:20::433]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id BB33310CE; Sun, 13 Aug 2023 11:25:59 -0700 (PDT) Received: by mail-wr1-x433.google.com with SMTP id ffacd0b85a97d-3175f17a7baso3052703f8f.0; Sun, 13 Aug 2023 11:25:59 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1691951158; x=1692555958; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=CIq8sIl/X2b7MlKyknohqi2WsE1jvwSaGlbdtI8w73Q=; b=JdIB/3t8M3PQ2ABKIlUA3PvTiusA3oDqo27jm1ZUL3ssYJYP1+CvD1yRDJb1qky4Xe JKL+W4OFc7YP7LtGvMMdVSYm2yvKQ/LGKEHWTjXT2LpDBCrWgNNwso+aQ82xAM7Dis9z 9pyQ6BFVrIbB6LMT2rXXFPaj96SBdri8pKzW73R8W8d6tCQQPa+laBJd8WktGS2Y4hm0 sjLsmPNlFBvph37k7KzINI8tn+QIuFZi5FJI3hpoSVrdT8iYHfBVtGRbmfAfkHDfu24B po3rgk2XdZSytn8mke9aXjrlauDuKMuHgmTEWZtlFMYBntjmsMRayleaYfMW1z+6FjAT eOMQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1691951158; x=1692555958; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=CIq8sIl/X2b7MlKyknohqi2WsE1jvwSaGlbdtI8w73Q=; b=PFCCJJl9n/KKyGKsR3WrEtPFNi/8xjXa8E0EfaFtoubo02qNveNm105/hlhCkmsUgz qPdRNUZuPI8yDnF3fs/090HnBbBW3O3L75BW+AXzz22c6xnSwyx50annoNYdZo/4+ki6 BuJQtDRom00Mv0VQs8VW9XtHsUma/oA7jGGRiuq1lBL7amYp+qoLuKylCSE5iRjXhQwg ET+Eic1m8yuidnybSQbrynJY9u9Dw/NdtwnZBiTubwMY3QVkrNVej1AVbXlNIQaetAzx F6/xzOp4t73G865S45V3BW20Lq81+wxkpYKX6N2Iu+ceZcfcM+HKB/uWQ19kbcOZ13qf NTIw== X-Gm-Message-State: AOJu0YwLFlZX0Of9HgqkRIAk+cba+7c1oWvKql5naMvVWQSf8p276luG fzMHI9/pF5raHO95d35IxF4= X-Google-Smtp-Source: AGHT+IH3cPNtVJW4kh/iHxFuCrnQ/1tjSjbQPw4xSIR+5xlKJ57JVXqoqILLkUFlM1N8cQwVEF1cKQ== X-Received: by 2002:a05:6000:42:b0:317:5b32:b2c3 with SMTP id k2-20020a056000004200b003175b32b2c3mr4907641wrx.6.1691951157713; Sun, 13 Aug 2023 11:25:57 -0700 (PDT) Received: from localhost.localdomain (host-95-239-194-68.retail.telecomitalia.it. [95.239.194.68]) by smtp.gmail.com with ESMTPSA id s6-20020a5d5106000000b003141a3c4353sm12058239wrt.30.2023.08.13.11.25.55 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sun, 13 Aug 2023 11:25:56 -0700 (PDT) From: "Fabio M. De Francesco" To: Jonathan Corbet , Jonathan Cameron , Linus Walleij , Mike Rapoport , linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org Cc: "Fabio M. De Francesco" , Andrew Morton , Ira Weiny , Matthew Wilcox , Randy Dunlap Subject: [PATCH v2] Documentation/page_tables: Add info about MMU/TLB and Page Faults Date: Sun, 13 Aug 2023 20:25:42 +0200 Message-ID: <20230813182552.31792-1-fmdefrancesco@gmail.com> X-Mailer: git-send-email 2.41.0 MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" Extend page_tables.rst by adding a section about the role of MMU and TLB in translating between virtual addresses and physical page frames. Furthermore explain the concept behind Page Faults and how the Linux kernel handles TLB misses. Finally briefly explain how and why to disable the page faults handler. Cc: Andrew Morton Cc: Ira Weiny Cc: Jonathan Cameron Cc: Jonathan Corbet Cc: Linus Walleij Cc: Matthew Wilcox Cc: Mike Rapoport Cc: Randy Dunlap Signed-off-by: Fabio M. De Francesco Reviewed-by: Linus Walleij --- v1 -> v2: This version takes into account the comments provided by Mike (thanks!). I hope I haven't overlooked anything he suggested :-) https://lore.kernel.org/all/20230807105010.GK2607694@kernel.org/ Furthermore, v2 adds few more information about swapping which was not pres= ent in v1. before the "real" patch, this has been an RFC PATCH in its 2nd version for = a week or so until I received comments and suggestions from Jonathan Cameron (than= ks!), and then it morphed to a real patch. The link to the thread with the RFC PATCH v2 and the messages between Jonat= han and me start at https://lore.kernel.org/all/20230723120721.7139-1-fmdefranc= esco@gmail.com/#r Documentation/mm/page_tables.rst | 128 +++++++++++++++++++++++++++++++ 1 file changed, 128 insertions(+) diff --git a/Documentation/mm/page_tables.rst b/Documentation/mm/page_table= s.rst index 7840c1891751..ad9e52f2d7f1 100644 --- a/Documentation/mm/page_tables.rst +++ b/Documentation/mm/page_tables.rst @@ -152,3 +152,131 @@ Page table handling code that wishes to be architectu= re-neutral, such as the virtual memory manager, will need to be written so that it traverses all o= f the currently five levels. This style should also be preferred for architecture-specific code, so as to be robust to future changes. + + +MMU, TLB, and Page Faults +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D + +The `Memory Management Unit (MMU)` is a hardware component that handles vi= rtual +to physical address translations. It may use relatively small caches in ha= rdware +called `Translation Lookaside Buffers (TLBs)` and `Page Walk Caches` to sp= eed up +these translations. + +When CPU accesses a memory location, it provides a virtual address to the = MMU, +which checks if there is the existing translation in the TLB or in the Page +Walk Caches (on architectures that support them). If no translation is fou= nd, +MMU uses the page walks to determine the physical address and create the m= ap. + +The dirty bit for a page is set (i.e., turned on) when the page is written= to. +Each page of memory has associated permission and dirty bits. The latter +indicate that the page has been modified since it was loaded into memory. + +If nothing prevents it, eventually the physical memory can be accessed and= the +requested operation on the physical frame is performed. + +There are several reasons why the MMU can't find certain translations. It = could +happen because the CPU is trying to access memory that the current task is= not +permitted to, or because the data is not present into physical memory. + +When these conditions happen, the MMU triggers page faults, which are type= s of +exceptions that signal the CPU to pause the current execution and run a sp= ecial +function to handle the mentioned exceptions. + +Page faults may be caused by code bugs or by maliciously crafted addresses= that +the CPU is instructed to dereference and access. A thread of a process cou= ld +use an instruction to address (non-shared) memory which does not belong to= its +own address space, or could try to execute an instruction that want to wri= te to +a read-only location. + +If the above-mentioned conditions happen in user-space, the kernel sends a +`Segmentation Fault` (SIGSEGV) signal to the current thread. That signal u= sually +causes the termination of the thread and of the process it belongs to. + +Instead, there are also common and expected other causes of page faults. T= hese +are triggered by process management optimization techniques called "Lazy +Allocation" and "Copy-on-Write". Page faults may also happen when frames h= ave +been swapped out to persistent storage (swap partition or file) and evicte= d from +their physical locations. + +These techniques improve memory efficiency, reduce latency, and minimize s= pace +occupation. This document won't go deeper into the details of "Lazy Alloca= tion" +and "Copy-on-Write" because these subjects are out of scope for they belon= g to +Process Address Management. + +Swapping differentiate itself from the other mentioned techniques because = it's +not so desirable since it's performed as a means to reduce memory under he= avy +pressure. + +Swapping can't work for memory mapped by kernel logical addresses. These a= re a +subset of the kernel virtual space that directly maps a contiguous range of +physical memory. Given any logical address, its physical address is determ= ined +with simple arithmetic on an offset. Accesses to logical addresses are fast +because they avoid the need for complex page table lookups at the expenses= of +frames not being evictable and pageable out. + +If everything fails to make room for the data that must reside be present = in +physical frames, the kernel invokes the out-of-memory (OOM) killer to make= room +by terminating lower priority processes until pressure reduces under a safe +threshold. + +This document is going to simplify and show an high altitude view of how t= he +Linux kernel handles these page faults, creates tables and tables' entries, +check if memory is present and, if not, requests to load data from persist= ent +storage or from other devices, and updates the MMU and its caches... + +The first steps are architectures dependent. Most architectures jump to +`do_page_fault()`, whereas the x86 interrupt handler is defined by the +`DEFINE_IDTENTRY_RAW_ERRORCODE()` macro which calls `handle_page_fault()`. + +Whatever the routes, all architectures end up to the invocation of +`handle_mm_fault()` which, in turn, (likely) ends up calling +`__handle_mm_fault()` to carry out the actual work of allocation of the pa= ge +tables. + +The unfortunate case of not being able to call `__handle_mm_fault()` means +that the virtual address is pointing to areas of physical memory which are= not +permitted to be accessed (at least from the current context). This +condition resolves to the kernel sending the above-mentioned SIGSEGV signal +to the process and leads to the consequences already explained. + +`__handle_mm_fault()` carries out its work by calling several functions to +find the entry's offsets of the upper layers of the page tables and alloca= te +the tables that it may need to. + +The functions that look for the offset have names like `*_offset()`, where= the +"*" is for pgd, p4d, pud, pmd, pte; instead the functions to allocate the +corresponding tables, layer by layer, are called `*_alloc`, using the +above-mentioned convention to name them after the corresponding types of t= ables +in the hierarchy. + +The page table walk may end at one of the middle or upper layers (PMD, PUD= ). + +Linux supports larger page sizes than the usual 4KB (i.e., the so called +`huge pages`). When using these kinds of larger pages, higher level pages = can +directly map them, with no need to use lower level page entries (PTE). Huge +pages contain large contiguos physical regions that usually span from 2MB = to +1GB. They are respectively mapped by the PMD and PUD page entries. + +The huge pages bring with them several benefits like reduced TLB pressure, +reduced page table overhead, memory allocation efficiency, and performance +improvement for certain workloads. However, these benefits come with +trade-offs, like wasted memory and allocation challenges. Huge pages are o= ut +of scope of the present document, therefore, it won't go into further deta= ils. + +At the very end of the walk with allocations, if it didn't return errors, +`__handle_mm_fault()` finally calls `handle_pte_fault()`, which via `do_fa= ult()` +performs one of `do_read_fault()`, `do_cow_fault()`, `do_shared_fault()`. +"read", "cow", "shared" give hints about the reasons and the kind of fault= it's +handling. + +The actual implementation of the workflow is very complex. Its design allo= ws +Linux to handle page faults in a way that is tailored to the specific +characteristics of each architecture, while still sharing a common overall +structure. + +To conclude this brief overview from very high altitude of how Linux handl= es +page faults, let's add that page faults handler can be disabled and enabled +respectively with `pagefault_disable()` and `pagefault_enable()`. + +Several code path make use of the latter two functions because they need to +disable traps into the page faults handler, mostly to prevent deadlocks. --=20 2.41.0