From nobody Tue Sep 16 01:02:11 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id CB24BC54EBD for ; Mon, 9 Jan 2023 09:54:07 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S236779AbjAIJxe (ORCPT ); Mon, 9 Jan 2023 04:53:34 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43894 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233909AbjAIJv6 (ORCPT ); Mon, 9 Jan 2023 04:51:58 -0500 Received: from mail-pg1-x52e.google.com (mail-pg1-x52e.google.com [IPv6:2607:f8b0:4864:20::52e]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 6806D7666; Mon, 9 Jan 2023 01:51:19 -0800 (PST) Received: by mail-pg1-x52e.google.com with SMTP id q9so5483292pgq.5; Mon, 09 Jan 2023 01:51:19 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=eP1IGQWoe2Wm2Hwa/p6WLQ62fc5Nm8epIo7HcOBeu6g=; b=gNrPBdvxH5BR3Q0vvWQahpKoqOHi4FppaWePJP0yXAdxsWOxMZS6+y9ADc+74shgUM 1iP/myhIKIizonrGHb1XjyvPl4bB3dsLwe2Nlo94qPdyaXwFfWEXAMpGFnH/Q+enSvZD roMHRTxipy4J9IJWzfvLwrZ1hg94YEkbOZVTABzBGw3gcSbPYJxg0iUJJpnqiuq33Aqz ROEFJBnltwyRnSfEjNUg/qipsJMyvZtfr2mI8ugeRQkuMTP+LRpmXJRC5JIPIdevzVdt B9bF4g+pptNa9NXSjk9WlnVfl52o1gpb/svWCLSrAlWyajS6sdwb70VZlgV9g+Lcs1gg dbWA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=eP1IGQWoe2Wm2Hwa/p6WLQ62fc5Nm8epIo7HcOBeu6g=; b=LXe23pAow+7Bi9qzPH4XOZhg2WQUj2j9c2J/iiWuTG7eCMVk3r83Glo3jVWQvd2FuE 4NmSS7yXwIno5qMq5fPnoOXl+uaFNY8d4T8a06xq2TOU6ktDNlxzbIUUk9+dMwd3Vq/C 2YHkn3ouZM2WA2J2AouRRR+oviP+YrSicT210M2R1PhG2clHA+EZHP9fxsh9XFcy8CcQ 8mc3wbWvZPaeO3Y/Pk1GFB3V16gOPVUbyjguFDjgw14koYSh7IfyKgR6N5Uwl4oVskR1 lz0fEO/94OF/UGIQOUoiz+jJGtLJp7j0WVR2gweh5ic+AP3ROQNS/KMU25kGFm1SpAsf UXIQ== X-Gm-Message-State: AFqh2kpLBY9moZIGKQpUszZ7MQ7fv4QcddOT47+G88tbd4PfLC2OWtD3 jPDrQNlswYPVW3nChVCpgNs= X-Google-Smtp-Source: AMrXdXsumcrc8/rx72sg9TBehK42gezsJH6/nTzrOT/gsuKHUfTjmYtocv8vWNU7NNqj/haWu8DUuw== X-Received: by 2002:a62:2544:0:b0:589:8fac:cfe5 with SMTP id l65-20020a622544000000b005898faccfe5mr753365pfl.13.1673257877706; Mon, 09 Jan 2023 01:51:17 -0800 (PST) Received: from debian.me (subs03-180-214-233-26.three.co.id. [180.214.233.26]) by smtp.gmail.com with ESMTPSA id e18-20020aa79812000000b0058119caa82csm5598727pfl.205.2023.01.09.01.51.15 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 09 Jan 2023 01:51:15 -0800 (PST) Received: by debian.me (Postfix, from userid 1000) id 81A9E104957; Mon, 9 Jan 2023 16:51:11 +0700 (WIB) From: Bagas Sanjaya To: Jonathan Corbet , Yann Sionneau Cc: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, Clement Leger , Guillaume Thouvenin , Bagas Sanjaya Subject: [PATCH 8/8] Documentation: kvx: reword Date: Mon, 9 Jan 2023 16:51:08 +0700 Message-Id: <20230109095108.21229-9-bagasdotme@gmail.com> X-Mailer: git-send-email 2.39.0 In-Reply-To: <20230109095108.21229-1-bagasdotme@gmail.com> References: <874jt7fqxt.fsf@meer.lwn.net> <20230109095108.21229-1-bagasdotme@gmail.com> MIME-Version: 1.0 X-Developer-Signature: v=1; a=openpgp-sha256; l=60293; i=bagasdotme@gmail.com; h=from:subject; bh=qhvIpICM+km+udNBRMt23gZQwKaL8C7q+O94O0Am7Nc=; b=owGbwMvMwCX2bWenZ2ig32LG02pJDMm7H3db7E171Be30+8fVynTozKdKQtOPRf0vqXH3lRgsojx Z19DRykLgxgXg6yYIsukRL6m07uMRC60r3WEmcPKBDKEgYtTACZSf5iRYYZU9h0ht+51BxX82M9t1F 3E+e+91aa4dzlOSWdvKW9jUmVkuJK0YtWE53+vzD+aeLXc8b+foti+okdz+JdO03pzUKMhhgkA X-Developer-Key: i=bagasdotme@gmail.com; a=openpgp; fpr=701B806FDCA5D3A58FFB8F7D7C276C64A5E44A1D Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" Improve the documentation wording to be clearer and effective. In most cases, third-person perspective ("we") is avoided unless absolutely necessary. Also, monospacize programming keywords (like variable and function names). Signed-off-by: Bagas Sanjaya --- Documentation/kvx/kvx-exceptions.rst | 114 +++++++------- Documentation/kvx/kvx-iommu.rst | 124 +++++++-------- Documentation/kvx/kvx-mmu.rst | 227 ++++++++++++++------------- Documentation/kvx/kvx-smp.rst | 29 ++-- Documentation/kvx/kvx.rst | 209 ++++++++++++------------ 5 files changed, 351 insertions(+), 352 deletions(-) diff --git a/Documentation/kvx/kvx-exceptions.rst b/Documentation/kvx/kvx-e= xceptions.rst index 15692f14b9219d..2ce7a62174a40a 100644 --- a/Documentation/kvx/kvx-exceptions.rst +++ b/Documentation/kvx/kvx-exceptions.rst @@ -2,11 +2,10 @@ Exception handling in kvx =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D =20 -On kvx, handlers are set using $ev (exception vector) register which -specifies a base address. -An offset is added to $ev upon exception and the result is used as -"Next $pc". -The offset depends on which exception vector the cpu wants to jump to: +On kvx, handlers are set using $ev (exception vector) register which speci= fies +a base address. An offset is added to $ev upon exception and the result is= used +as "Next $pc". The offset depends on which exception vector the cpu wants = to +jump to: =20 * $ev + 0x00 for debug * $ev + 0x40 for trap @@ -30,53 +29,52 @@ Then, handlers are laid in the following order:: BASE -> +-------------+ v =20 =20 -Interrupts, and traps are serviced similarly, ie: +Interrupts and traps are serviced similarly, ie: =20 - Jump to handler - Save all registers - Prepare the call (do_IRQ or trap_handler) -- restore all registers -- return from exception +- Restore all registers +- Return from exception =20 -entry.S file is (as for other architectures) the entry point into the kern= el. +``entry.S`` is (as for other architectures) the entry point into the kerne= l. It contains all assembly routines related to interrupts/traps/syscall. =20 Syscall handling =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D =20 When executing a syscall, it must be done using "scall $r6" -where $r6 contains the syscall number. Using this convention allow to +where $r6 contains the syscall number. This convention allows to modify and restart a syscall from the kernel. =20 Syscalls are handled differently than interrupts/exceptions. From an ABI -point of view, scalls are like function calls: any caller saved register +point of view, syscalls are like function calls: any caller-saved register can be clobbered by the syscall. However, syscall parameters are passed using registers r0 through r7. These registers must be preserved to avoid -cloberring them before the actual syscall function. +clobbering them before the actual syscall function. =20 -On syscall from userspace (scall instruction), the processor will put +On syscall from userspace (``scall`` instruction), the processor will put the syscall number in $es.sn and switch from user to kernel privilege -mode. kvx_syscall_handler will be called in kernel mode. +mode. ``kvx_syscall_handler`` will then be called in kernel mode. =20 -The following steps are then taken: +Below is the path when executing syscall: =20 -- Switch to kernel stack -- Extract syscall number -- Check that the syscall number is not bogus. - If so, set syscall func to a not implemented one +- Switch to kernel stack. +- Extract syscall number. +- Check that the syscall number is not bogus. If so, set syscall func to t= he + unimplemented one. =20 -- Check if tracing is enabled. - If so, jump to trace_syscall_enter, then: +- Check if tracing is enabled. If so, jump to ``trace_syscall_enter``, the= n: =20 - - Save syscall arguments (r0 -> r7) on stack in pt_regs - - Call do_trace_syscall_enter function + - Save syscall arguments (r0 -> r7) on stack in pt_regs. + - Call ``do_trace_syscall_enter`` function. =20 -- Restore syscall arguments since they have been modified by C call -- Call the syscall function -- Save $r0 in pt_regs since it can be cloberred afterward -- If tracing was enabled, call trace_syscall_exit -- Call work_pending -- Return to user ! +- Restore syscall arguments since they have been modified by C function ca= ll. +- Call the ``syscall`` function. +- Save $r0 in ``pt_regs`` since it can be clobbered afterward. +- If tracing is enabled, call ``trace_syscall_exit``. +- Call ``work_pending``. +- Return to user =20 The trace call is handled out of the fast path. All slow path handling is done in another part of code to avoid messing with the cache. @@ -85,18 +83,18 @@ Signals =3D=3D=3D=3D=3D=3D=3D =20 Signals are handled when exiting kernel before returning to user. -When handling a signal, the path is the following: +When handling a signal, the execution path is: =20 1. User application is executing normally, then exception occurs (syscall, interrupt, trap) -2. The exception handling path is taken - and before returning to user, pending signals are checked. +2. The exception handling path is taken and before returning to user, pend= ing + signals are checked. =20 3. The signal handling path is as follows: =20 - * Signals are handled by do_signal. + * Signals are handled by ``do_signal``. * Registers are saved and a special part of the stack is modified - to create a trampoline to call rt_sigreturn. + to create a trampoline to call ``rt_sigreturn``. * $spc is modified to jump to user signal handler. * $ra is modified to jump to sigreturn trampoline directly after returning from user signal handler. @@ -104,9 +102,9 @@ When handling a signal, the path is the following: 4. User signal handler is called after rfe from exception. When returning, $ra is retored to $pc, resulting in a call to the syscall trampoline. -5. syscall trampoline is executed, leading to rt_sigreturn syscall -6. rt_sigreturn syscall is executed. - Previous registers are restored to allow returning to user correctly +5. syscall trampoline is executed, leading to ``rt_sigreturn`` syscall +6. ``rt_sigreturn`` syscall is executed. Previous registers are restored to + allow returning to user correctly 7. User application is restored at the exact point it was interrupted before. =20 @@ -175,18 +173,18 @@ Registers handling =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D =20 MMU is disabled in all exceptions paths, during register save and restorat= ion. -This will prevent from triggering MMU fault (such as TLB miss) which could +This will prevent triggering MMU fault (such as TLB miss) which could clobber the current register state. Such event can occurs when RWX mode is enabled and the memory accessed to save register can trigger a TLB miss. Aside from that which is common for all exceptions path, registers are sav= ed -differently regarding the type of exception. +differently according to exception type. =20 Interrupts and traps -------------------- =20 -When interrupt and traps are triggered, we only save the caller-saved regi= sters. +When interrupt and traps are triggered, only caller-saved registers are sa= ved. Indeed, we rely on the fact that C code will save and restore callee-saved= and -hence, there is no need to save them. This path is the following:: +hence, there is no need to save them. The path is:: =20 +------------+ +-----------+ +---------------+ IT | Save caller| C Call | Execute C | Ret | Restore caller| Ret = from IT @@ -194,12 +192,13 @@ hence, there is no need to save them. This path is th= e following:: | registers | +-----------+ | registers | +------------+ +---------------+ =20 -However, when returning to user, we check if there is work_pending. If a s= ignal -is pending and there is a signal handler to be called, then we need all -registers to be saved on the stack in the pt_regs before executing the sig= nal -handler and restored after that. Since we only saved caller-saved register= s, we -need to also save callee-saved registers to restore them correctly when -returning to user. This path is the following (a bit more complicated !):: +However, when returning to user, we check if there is ``work_pending``. If= a +signal is pending and there is a signal handler to be called, then all +registers are needed to be saved on the stack in ``pt_regs`` before execut= ing +the signal handler and restored after that. Since only caller-saved regist= ers +are saved, we need to also save callee-saved registers to restore them +correctly when returning to user. The path will be (note: a bit more +complicated!):: =20 +------------+ | Save caller| +-----------+ Ret +------------+ @@ -246,13 +245,14 @@ returning to user. This path is the following (a bit = more complicated !):: =20 Syscalls -------- -As explained before, for syscalls, we can use whatever callee-saved regist= ers -we want since syscall are seen as a "classic" call from ABI pov. -Only different path is the one for clone. For this path, since the child e= xpects -to find same callee-registers content than his parent, we must save them b= efore -executing the clone syscall and restore them after that for the child. Thi= s is -done via a redefinition of __sys_clone in assembly which will be called in= place -of the standard sys_clone. This new call will save callee saved registers -in pt_regs. Parent will return using the syscall standard path. Freshly sp= awned -child however will be woken up via ret_from_fork which will restore all -registers (even if caller saved are not needed). +As explained before, for syscalls, any arbitrary callee-saved registers can +besince syscall are seen as a "classic" call from ABI pov. The only differ= ent +path is the one for :manpage:`clone(2)`. For this path, since the child ex= pects +to find same callee-registers content than its parent, they must be saved +before executing the :manpage:`clone(2)` syscall and restore them after th= at +for the child. This is done via a redefinition of ``__sys_clone`` in assem= bly +which will be called in place of the standard ``sys_clone``. This new call= will +save callee-saved registers in ``pt_regs``. Parent will return using the +syscall standard path. Freshly spawned child however will be woken up via +``ret_from_fork`` which will restore all registers (even if caller-saved +registers are not needed). diff --git a/Documentation/kvx/kvx-iommu.rst b/Documentation/kvx/kvx-iommu.= rst index c95d9231d5b665..dc642ff20d8f67 100644 --- a/Documentation/kvx/kvx-iommu.rst +++ b/Documentation/kvx/kvx-iommu.rst @@ -4,28 +4,28 @@ IOMMU in kvx General Overview ---------------- =20 -To exchange data between device and users through memory, the driver has to -set up a buffer by doing some kernel allocation. The address of the buffer= is -virtual and the physical one is obtained through the MMU. When the device = wants -to access the same physical memory space it uses a bus address. This addre= ss is -obtained by using the DMA mapping API. The Coolidge SoC includes several I= OMMUs for clusters, -PCIe peripherals, SoC peripherals, and more; that will translate this "bus= address" -into a physical one during DMA operations. +To exchange data between device and users through memory, the driver has t= o set +up a buffer by doing some kernel memory allocation. The address of the buf= fer +is virtual and the physical one is obtained through the MMU. When the devi= ce +wants to access the same physical memory space it uses a bus address, whic= h is +obtained by using the DMA mapping API. The Coolidge SoC includes several I= OMMUs +for clusters, PCIe peripherals, SoC peripherals, and more; that will trans= late +this "bus address" into a physical one during DMA operations. =20 The bus addresses are IOVA (I/O Virtual Address) or DMA addresses. This addresses can be obtained by calling the allocation functions of the DMA A= PIs. -It can also be obtained through classical kernel allocation of physical +It can also be obtained through classical allocation of physical contiguous memory and then calling mapping functions of the DMA API. =20 -In order to be able to use the kvx IOMMU we have implemented the IOMMU DMA -interface in arch/kvx/mm/dma-mapping.c. DMA functions are registered by -implementing arch_setup_dma_ops() and generic IOMMU functions. Generic IOM= MU -are calling our specific IOMMU functions that adding or removing mappings +In order to be able to use the kvx IOMMU the necessary IOMMU DMA interface= is +implemented in ``arch/kvx/mm/dma-mapping.c``. DMA functions are registered= by +implementing ``arch_setup_dma_ops()`` and generic IOMMU functions. Generic +IOMMU are calling our specific IOMMU functions that adding or removing map= pings between DMA addresses and physical addresses in the IOMMU TLB. =20 -Specifics IOMMU functions are defined in the kvx IOMMU driver. A kvx IOMMU -driver is managing two physical hardware IOMMU used for TX and RX. In the = next -section we described the HW IOMMUs. +Specifics IOMMU functions are defined in the kvx IOMMU driver. It manages = two +physical hardware IOMMU used for TX and RX. In the next section we describ= ed +the HW IOMMUs. =20 =20 Cluster IOMMUs @@ -45,9 +45,9 @@ SoC peripherals IOMMUs ---------------------- =20 Since SoC peripherals are connected to an AXI bus, two IOMMUs are used: on= e for -each AXI channel (read and write). These two IOMMUs are shared between all= master -devices and DMA. These two IOMMUs will have the same entries but need to b= e configured -independently. +each AXI channel (read and write). These two IOMMUs are shared between all +master devices and DMA. These two IOMMUs will have the same entries but ne= ed to +be configured independently. =20 PCIe IOMMUs ----------- @@ -56,8 +56,8 @@ There is a slave IOMMU (read and write from the MPPA to t= he PCIe endpoint) and a master IOMMU (read and write from a PCIe endpoint to system DDR). The PCIe root complex and the MSI/MSI-X controller have been designed to u= se the IOMMU feature when enabled. (For example for supporting endpoint that -support only 32 bits addresses and allow them to access any memory in a -64 bits address space). For security reason it is highly recommended to +support only 32-bit addresses and allow them to access any memory in the +64-bit address space). For security reason it is highly recommended to activate the IOMMU for PCIe. =20 IOMMU implementation @@ -99,13 +99,12 @@ and translations that occurs between memory and devices= :: +--------------+ =20 =20 -There is also an IOMMU dedicated to the crypto module but this module will= not +There is also an IOMMU dedicated to the crypto module but the module will = not be accessed by the operating system. =20 -We will provide one driver to manage IOMMUs RX/TX. All of them will be -described in the device tree to be able to get their particularities. See -the example below that describes the relation between IOMMU, DMA and NoC in -the cluster. +The kernel provides one driver to manage IOMMUs RX/TX. All of them are +described in the device tree in detail. See the example below that describ= es +the relation between IOMMU, DMA and NoC in the cluster. =20 IOMMU is related to a specific bus like PCIe we will be able to specify th= at all peripherals will go through this IOMMU. @@ -113,38 +112,38 @@ all peripherals will go through this IOMMU. IOMMU Page table ~~~~~~~~~~~~~~~~ =20 -We need to be able to know which IO virtual addresses (IOVA) are mapped in= the -TLB in order to be able to remove entries when a device finishes a transfe= r and -release memory. This information could be extracted when needed by computi= ng all -sets used by the memory and then reads all sixteen ways and compare them t= o the -IOVA but it won't be efficient. We also need to be able to translate an IO= VA -to a physical address as required by the iova_to_phys IOMMU ops that is us= ed -by DMA. Like previously it can be done by extracting the set from the addr= ess -and comparing the IOVA to each sixteen entries of the given set. +It is necessary to know which IO virtual addresses (IOVA) are mapped in th= e TLB +in order to be able to remove entries when a device finishes a transfer and +release memory. This information could be extracted when needed by computi= ng +all sets used by the memory and then reads all 16 entries and compare them= to +the IOVA but it won't be efficient. It is also necessary to translate an I= OVA +to a physical address as required by the ``iova_to_phys`` IOMMU ops that is +used by DMA. Again, it can be done by extracting the set from the address = and +comparing the IOVA to each sixteen entries of the given set. =20 -A solution is to keep a page table for the IOMMU. But this method is not -efficient for reloading an entry of the TLB without the help of an hardware -page table. So to prevent the need of a refill we will update the TLB when= a -device request access to memory and if there is no more slot available in = the -TLB we will just fail and the device will have to try again later. It is n= ot -efficient but at least we won't need to manage the refill of the TLB. +A possible solution is to keep a page table for the IOMMU. However, this m= ethod +is not efficient for reloading an entry of the TLB without the help of an +hardware page table. So to prevent the need to refill the TLB is updated w= hen +a device requests access to memory and if there is no more slot available = in +the TLB the request will just fail and the device will have to try again l= ater. +It is not efficient but at least managing TLB refill can be avoided. =20 This leads to an issue with the memory that can be used for transfer betwe= en -device and memory (see Limitations below). As we only support 4Ko page siz= e we -can only map 8Mo. To be able to manage bigger transfer we can implement the -huge page table in the Linux kernel and use a page table that match the si= ze of -huge page table for a given IOMMU (typically the PCIe IOMMU). +device and memory (see Limitations below). As the kernel only supports 4KB= page +size only 8MB transfer can be mapped. In order to be able to manage bigger +transfer size, it is required to implement the huge page table in the Linux +kernel and use a page table that match the size of huge page table for a g= iven +IOMMU (typically the PCIe IOMMU). =20 -As we won't refill the TLB we know that we won't have more than 128*16 ent= ries. -In this case we can simply keep a table with all possible entries. +Consequently, the maximum page table entries is 128*16 (2048) and the appr= oach +chosen to manage IOMMU TLB is is to keep a table with all posible entries. =20 Maintenance interface ~~~~~~~~~~~~~~~~~~~~~ =20 It is possible to have several "maintainers" for the same IOMMU. The drive= r is -using two of them. One that writes the TLB and another interface reads TLB= . For -debug purpose it is possible to display the content of the tlb by using the -following command in gdb:: +using two of them: one that writes the TLB and another interface that read= s it. +For debug purpose it is possible to display the TLB content in gdb by:: =20 gdb> p kvx_iommu_dump_tlb( , 0) =20 @@ -155,34 +154,35 @@ Interrupts ~~~~~~~~~~ =20 IOMMU can have 3 kind of interrupts that corresponds to 3 different types = of -errors (no mapping. protection, parity). When the IOMMU is shared between -clusters (SoC periph and PCIe) then fifteen IRQs are generated according t= o the -configuration of an association table. The association table is indexed by= the -ASN number (9 bits) and the entry of the table is a subscription mask with= one -bit per destination. Currently this is not managed by the driver. +errors: no mapping, protection, and parity. When the IOMMU is shared betwe= en +clusters (SoC periph and PCIe), 15 IRQs are generated corresponding to +association table configuration. The association table is indexed by the A= SN +number (9 bits) and the entry of the table is a subscription mask with one= bit +per destination. Currently this is not managed by the driver. =20 The driver is only managing interrupts for the cluster. The mode used is t= he -stall one. So when an interrupt occurs it is managed by the driver. All ot= hers -interrupts that occurs are stored and the IOMMU is stalled. When driver cl= eans -the first interrupt others will be managed one by one. +stall one. Thus, when an interrupt occurs it is managed by the driver. All +others interrupts that occurs are stored and the IOMMU is stalled. When the +driver cleans up the first interrupt, other interrupts will be managed +sequentially. =20 ASN (Address Space Number) ~~~~~~~~~~~~~~~~~~~~~~~~~~ =20 This is also know as ASID in some other architecture. Each device will hav= e a given ASN that will be given through the device tree. As address space is -managed at the IOMMU domain level we will use one group and one domain per= ID. +managed at the IOMMU domain level one group and one domain per ID is used. ASN are coded on 9 bits. =20 Device tree ----------- =20 -Relationships between devices, DMAs and IOMMUs are described in the -device tree (see ``Documentation/devicetree/bindings/iommu/kalray,kvx-iomm= u.txt`` -for more details). +Relationships between devices, DMAs and IOMMUs are described in the device= tree +(see ``Documentation/devicetree/bindings/iommu/kalray,kvx-iommu.txt`` for = more +details). =20 Limitations ----------- =20 -Only supporting 4 KB page size will limit the size of mapped memory to 8 MB -because the IOMMU TLB can have at most 128*16 entries. +kvx kernel only supports 4 KB page size, which will limit the size of mapp= ed +memory to 8 MB because the IOMMU TLB can have at most 128*16 (2048) entrie= s. diff --git a/Documentation/kvx/kvx-mmu.rst b/Documentation/kvx/kvx-mmu.rst index 05b9bc111e02db..edad3c52caf47f 100644 --- a/Documentation/kvx/kvx-mmu.rst +++ b/Documentation/kvx/kvx-mmu.rst @@ -2,26 +2,26 @@ kvx Memory Management Unit =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D =20 -Virtual addresses are on 41 bits for kvx when using 64-bit mode. -To differentiate kernel from user space, we use the high order bit -(bit 40). When bit 40 is set, then the higher remaining bits must also be = set to -1. The virtual address must be extended with 1 when the bit 40 is set, -if not the address must be zero extended. Bit 40 is set for kernel space -mappings and not set for user space mappings. +Virtual addresses are on 41 bits for kvx when using 64-bit mode. To +differentiate kernel from user space, the high order bit (bit 40) is used.= If +it is set, then the higher remaining bits must also be set to 1. The virtu= al +address must be extended with 1 when the bit 40 is set, if not the address= must +be zero extended. Bit 40 is set for kernel space mappings and not set for = user +space mappings. =20 Memory Map =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D =20 -In Linux physical memories are arranged into banks according to the cost o= f an -access in term of distance to a memory. As we are UMA architecture we only= have -one bank and thus one node. +In Linux physical memories are arranged into "banks" according to the cost= of +an access in term of distance to a memory. As kvx is an UMA architecture t= here +is only one bank and thus one node. =20 A node is divided into several kind of zone. For example if DMA can only a= ccess -a specific area in the physical memory we will define a ZONE_DMA for this = purpose. -In our case we are considering that DMA can access all DDR so we don't hav= e a specific -zone for this. On 64 bit architecture all DDR can be mapped in virtual ker= nel space -so there is no need for a ZONE_HIGHMEM. That means that in our case there = is -only one ZONE_NORMAL. This will be updated if DMA cannot access all memory. +a specific area in the physical memory, the region is called ``ZONE_DMA``.= In +kvx we assume that DMA can access all memory so we don't have a specific z= one +for this purpose. On 64-bit architecture all memory can be mapped in virtu= al +kernel space so ``ZONE_HIGHMEM`` is unnecessary. This implies that there is +only ``ZONE_NORMAL``. This can change if DMA cannot access all memory. =20 Currently, the memory mapping is the following for 4KB page: =20 @@ -46,92 +46,94 @@ Currently, the memory mapping is the following for 4KB = page: Enable the MMU =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D =20 -All kernel functions and symbols are in virtual memory except for kvx_star= t() -function which is loaded at 0x0 in physical memory. -To be able to switch from physical addresses to virtual addresses we choos= e to +All kernel functions and symbols are in virtual memory except for +``kvx_start()`` function which is loaded at 0x0 in physical memory. To be = able +to switch from physical addresses to virtual addresses, the decision is to setup the TLB at the very beginning of the boot process to be able to map = both -pieces of code. For this we added two entries in the LTLB. The first one, -LTLB[0], contains the mapping between virtual memory and DDR. Its size is = 512MB. -The second entry, LTLB[1], contains a flat mapping of the first 2MB of the= SMEM. -Once those two entries are present we can enable the MMU. LTLB[1] will be -removed during paging_init() because once we are really running in virtual= space -it will not be used anymore. -In order to access more than 512MB DDR memory, the remaining memory (> 512= MB) is -refill using a comparison in kernel_perf_refill that does not walk the ker= nel -page table, thus having a faster refill time for kernel. These entries are -inserted into the LTLB for easier computation (4 LTLB entries). The drawba= ck of -this approach is that mapped entries are using RWX protection attributes, -leading to no protection at all. +pieces of code. For this two entries in the LTLB are added. The first one, +LTLB[0], contains the mapping between virtual and physical memory. Its siz= e is +512MB. The second entry, LTLB[1], contains a flat mapping of the first 2MB= of +the SMEM. Once those two entries are present the MMU can be enabled. LTLB[= 1] +will be removed during paging_init() because once we are really running in +virtual space it will not be used anymore. + +In order to access more than 512MB of physical memory, the remaining memor= y (> +512MB) is refilled using a comparison in ``kernel_perf_refill`` that does = not +walk the kernel page table, thus having a faster refill time for kernel. T= hese +entries are inserted into the LTLB for easier computation (4 LTLB entries)= . The +drawback of this approach is that mapped entries are read-write (RWX), hen= ce +there is no protection and anything can happen. =20 Kernel strict RWX =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D =20 -CONFIG_STRICT_KERNEL_RWX is enabled by default in default_defconfig. -Once booted, if CONFIG_STRICT_KERNEL_RWX is enable, the kernel text and me= mory -will be mapped in the init_mm page table. Once mapped, the refill routine = for -the kernel is patched to always do a page table walk, bypassing the faster -comparison but enforcing page protection attributes when refilling. -Finally, the LTLB[0] entry is replaced by a 4K one, mapping only exception= s with -RX protection. It allows us to never trigger nomapping on nomapping refill -routine which would (obviously) not work... Once this is done, we can flus= h the -4 LTLB entries for kernel refill in order to be sure there is no stalled -entries and that new entries inserted in JTLB will apply. +``CONFIG_STRICT_KERNEL_RWX`` is enabled by default. Once booted, if +``CONFIG_STRICT_KERNEL_RWX`` is enable, the kernel text and memory will be +mapped in the init_mm page table. Once mapped, the refill routine for the +kernel is patched to always do walk the page table, bypassing the faster +comparison but enforcing page protection attributes when refilling. Finall= y, +the LTLB[0] entry is replaced by a 4K one, mapping only exceptions read-on= ly +(RX). It allows us to never trigger nomapping on nomapping refill routine = which +would (obviously) not work. Once this is done, 4 LTLB entries can be flush= ed +for kernel refill in order to be sure there is no stalled entries and that= new +entries inserted in JTLB will apply. =20 By default, the following policy is applied on vmlinux sections: =20 - init_data: RW -- init_text: RX (or RWX if parameter rodata=3Doff) -- text: RX (or RWX if parameter rodata=3Doff) +- init_text: RX (or RWX if parameter ``rodata=3Doff`` is specified) +- text: RX (or RWX if parameter ``rodata=3Doff`` is specified) - rodata: RW before init, RO after init - sdata: RW =20 -Kernel RWX mode can then be switched on/off using /sys/kvx/kernel_rwx file. +Kernel RWX mode can then be switched on/off with ``/sys/kvx/kernel_rwx``. =20 Privilege Level =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D -Since we are using privilege levels on kvx, we make use of the virtual -spaces to be in the same space as the user. The kernel will have the -$ps.mmup set in kernel (PL1) and unset for user (PL2). -As said in kvx documentation, we have two cases when the kernel is -booted: +Since kvx uses privilege levels, the virtual spaces is leveraged so that t= he +kernel memory space is same as userspace. The kernel will have the $ps.mmu= p set +in kernel (PL1) and unset for user (PL2). As said in kvx documentation, th= ere +are two cases when the kernel is booted: =20 - Boot via intermediaries (bootloader, hypervisor, etc) - Direct boot from flash =20 -In both cases, we will use the virtual space 0. Indeed, if we are alone -on the core, then it means nobody is using the MMU and we can take the -first virtual space. If not alone, then when writing an entry to the tlb -using writetlb instruction, the hypervisor will catch it and change the +In both cases, the virtual space 0 is used. Indeed, if there is only kernel +running on the core, nothing else is using the MMU and the first virtual s= pace +can be used directly by the kernel. Otherwise, when writing an entry to th= e tlb +using ``writetlb`` instruction, the hypervisor will catch it and change the virtual space accordingly. =20 Memblock =3D=3D=3D=3D=3D=3D=3D=3D =20 When the kernel starts there is no memory allocator available. One of the = first -step in the kernel is to detect the amount of DDR available by getting this -information in the device tree and initialize the low-level "memblock" all= ocator. +step in the kernel is to detect the amount of available memory by getting = this +information in the device tree and initialize the low-level "memblock" +allocator. =20 -We start by reserving memory for the whole kernel. For instance with a dev= ice -tree containing 512Mo of DDR you could see the following boot messages: +Memory initialization starts by reserving memory for the whole kernel. For +instance with a 512MB RAM device dmesg will print:: =20 -setup_bootmem: Memory : 0x100000000 - 0x120000000 -setup_bootmem: Reserved: 0x10001f000 - 0x1002d1bc0 + setup_bootmem: Memory : 0x100000000 - 0x120000000 + setup_bootmem: Reserved: 0x10001f000 - 0x1002d1bc0 =20 During the paging init we need to set: =20 - - min_low_pfn that is the lowest PFN available in the system - - max_low_pfn that indicates the end if NORMAL zone - - max_pfn that is the number of pages in the system + - ``min_low_pfn`` - the lowest PFN available in the system + - ``max_low_pfn`` - the end of ``ZONE_NORMAL`` + - ``max_pfn that`` - the number of pages in the system =20 -This setting is used for dividing memory into pages and for configuring the -zone. See the memory map section for more information about ZONE. +This scheme is used for dividing memory into pages and for configuring the +zone. See the memory map section for more details. =20 -Zones are configured in free_area_init_core(). During start_kernel() other -allocations are done for command line, cpu areas, PID hash table, different -caches for VFS. This allocator is used until mem_init() is called. +Zones are configured in ``free_area_init_core()``. During ``start_kernel()= `` +other allocations are done for command line, cpu areas, PID hash table, +different caches for VFS. The memblock allocator is used until ``mem_init(= )`` +is called. =20 -mem_init() is provided by the architecture. For MPPA we just call -free_all_bootmem() that will go through all pages that are not used by the +``mem_init()`` is provided by the architecture. For MPPA we just call +``free_all_bootmem()`` that will go through all pages that are not used by= the low level allocator and mark them as not used. So physical pages that are reserved for the kernel are still used and remain in physical memory. All = pages released will now be used by the buddy allocator. @@ -146,20 +148,20 @@ LTLB Usage =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D =20 LTLB is used to add resident mapping which allows for faster MMU lookup. -Currently, the LTLB is used to map some mandatory kernel pages and to allo= w fast -accesses to l2 cache (mailbox and registers). -When CONFIG_STRICT_KERNEL_RWX is disabled, 4 entries are reserved for kern= el -TLB refill using 512MB pages. When CONFIG_STRICT_KERNEL_RWX is enabled, th= ese +Currently, the LTLB is used to map some mandatory kernel pages and to allow +fast accesses to l2 cache (mailbox and registers). When +``CONFIG_STRICT_KERNEL_RWX`` is disabled, 4 entries are reserved for kerne= l TLB +refill using 512MB pages. When ``CONFIG_STRICT_KERNEL_RWX`` is enabled, th= ese entries are unused since kernel is paginated using the same mecanism than = for user (page walking and entries in JTLB) =20 Page Table =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D =20 -We only support three levels for the page table and 4KB for page size. +Only three-level page table and 4KB page size are supported. =20 -3 levels page table -------------------- +3-level page table +------------------ =20 :: =20 @@ -172,16 +174,16 @@ We only support three levels for the page table and 4= KB for page size. | +-----------------------> [29:21] PMD offset (9 bit= s) +----------------------------------> [39:30] PGD offset (10 bi= ts) =20 -Bits 40 to 64 are signed extended according to bit 39. If bit 39 is equal = to 1 -we are in kernel space. +Bits 40 to 64 are signed extended according to bit 39. If this bit is equa= l to +1 the process is in kernel space. =20 -As 10 bits are used for PGD we need to allocate 2 pages. +As 10 bits are used for PGD 2 pages are needed to be allocated. =20 PTE format =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D =20 -About the format of the PTE entry, as we are not forced by hardware for ch= oices, -we choose to follow the format described in the RiscV implementation as a +For PTE entry format, instead of being forced by hardware constraints, +we choose to follow the format described in the RISC-V implementation as a starting point:: =20 +---------+--------+----+--------+---+---+---+---+---+---+------+---+---+ @@ -207,16 +209,16 @@ starting point:: Huge bit must be somewhere in the first 12 bits to be able to detect it when reading the PMD entry. =20 -PageSZ must be on bit 10 and 11 because it matches the TEL.PS bits. And -by doing that it is easier in assembly to set the TEL.PS to PageSZ. +PageSZ must be on bit 10 and 11 because it matches the TEL.PS bits. As suc= h, +it is easier in assembly to set the TEL.PS to PageSZ. =20 Fast TLB refill =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D =20 -kvx core does not feature a hardware page walker. This work must be done -by the core in software. In order to optimize TLB refill, a special fast -path is taken when entering in kernel space. -In order to speed up the process, the following actions are taken: +kvx core does not feature a hardware page walker. Instead, page walking mu= st be +done by the core in software. In order to optimize TLB refill, a special f= ast +path is utilized when entering in kernel space. In order to speed up the +process, TLB refill is done by: =20 1. Save some registers in a per process scratchpad 2. If the trap is a nomapping then try the fastpath @@ -224,21 +226,22 @@ In order to speed up the process, the following actio= ns are taken: 4. Check if faulting address is a memory direct mapping one. If entry is a direct mapping one and RWX is not enabled, add an entry into LTLB. Otherwise, continue -5. Try to walk the page table. If entry is not present, take the slowpath = (do_page_fault) +5. Try to walk the page table. If entry is not present, take the slowpath + (``do_page_fault``) 6. Refill the tlb properly 7. Exit by restoring only a few registers =20 ASN Handling =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D =20 -Disclaimer: Some part of this are taken from ARC architecture. +.. note:: + Some part of ASN handling is inspired from ARC architecture. =20 kvx MMU provides 9-bit ASN (Address Space Number) in order to tag TLB entr= ies. It allows for multiple process with the same virtual space to cohabit with= out -the need to flush TLB everytime we context switch. -kvx implementation to use them is based on other architectures (such as arc -or xtensa) and uses a wrapping ASN counter containing both cycle/generatio= n and -asn. +the need to flush TLB everytime we context switch. kvx implementation to u= se +them is based on other architectures (such as arc or xtensa) and uses a +wrapping ASN counter containing both cycle/generation and asn. =20 :: =20 @@ -250,27 +253,27 @@ asn. This ASN counter is incremented monotonously to allocate new ASNs. When the counter reaches 511 (9 bit), TLB is completely flushed and a new cycle is started. A new allocation cycle, post rollover, could potentially reassign= an -ASN to a different task. Thus the rule is to reassign an ASN when the curr= ent -context cycles does not match the allocation cycle. -The 64 bit @cpu_asn_cache (and mm->asn) have 9 bits MMU ASN and rest 55 bi= ts -serve as cycle/generation indicator and natural 64 bit unsigned math -automagically increments the generation when lower 9 bits rollover. -When the counter completely wraps, we reset the counter to first cycle val= ue -(ie cycle =3D 1). This allows to distinguish context without any ASN and o= ld cycle -generated value with the same operation (XOR on cycle). +ASN to a different task, hence the rule is to reassign an ASN when the cur= rent +context cycles does not match the allocation cycle. The 64 bit +``@cpu_asn_cache`` (and ``mm->asn``) have 9 bits MMU ASN and rest 55 bits = serve +as cycle/generation indicator and natural 64 bit unsigned math automagical= ly +increments the generation when lower 9 bits rollover. When the counter +completely wraps, we reset the counter to first cycle value (ie cycle =3D = 1). +This allows to distinguish context without any ASN and old cycle generated +value with the same operation (XOR on cycle). =20 Huge page =3D=3D=3D=3D=3D=3D=3D=3D=3D =20 -Currently only 3 level page table has been implemented for 4Ko base page s= ize. -So the page shift is 12 bits, the pmd shift is 21 and the pgdir shift is 30 -bits. This choice implies that for 4Ko base page size if we use a PMD as a= huge -page the size will be 2Mo and if we use a PUD as a huge page it will be 1G= o. +Currently only 3-level page table has been implemented for 4Ko base page s= ize. +As such, the page shift is 12 bits, the pmd shift is 21 and the pgdir shif= t is +30 bits. This also implies that for 4Ko base page size, if PMD is used as a +huge page the size will be 2Mo and if we use a PUD as a huge page it will = be +1Go. =20 -To support other huge page sizes (64Ko and 512Mo) we need to use several -contiguous entries in the page table. For huge page of 64Ko we will need to -use 16 entries in the PTE and for a huge page of 512Mo it means that 256 -entries in PMD will be used. +To support other huge page sizes (64KB and 512MB) it is necessary to use +several contiguous entries in the page table. For 64KB page size 16 entrie= s in +the PTE are needed whereas for 512MB page size it requires 256 entries in = PMD. =20 Debug =3D=3D=3D=3D=3D @@ -278,14 +281,14 @@ Debug In order to debug the page table and tlb entries, gdb scripts contains com= mands which allows to dump the page table: =20 -- lx-kvx-page-table-walk +- ``lx-kvx-page-table-walk`` Display the current process page table by default -- lx-kvx-tlb-decode +- ``lx-kvx-tlb-decode`` Display the content of $tel and $teh into something readable =20 -Other commands available in kvx-gdb are the following: +Other commands available in kvx-gdb are: =20 -- mppa-dump-tlb +- ``mppa-dump-tlb`` Display the content of TLBs (JTLB and LTLB) -- mppa-lookup-addr +- ``mppa-lookup-addr`` Find physical address matching a virtual one diff --git a/Documentation/kvx/kvx-smp.rst b/Documentation/kvx/kvx-smp.rst index f170bc48ea5f7f..dbb02207beaff0 100644 --- a/Documentation/kvx/kvx-smp.rst +++ b/Documentation/kvx/kvx-smp.rst @@ -2,30 +2,29 @@ Symmetric Multiprocessing Implementation in kvx =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D =20 -On kvx, 5 clusters are organized as groups of 16 processors + 1 -secure core (RM) for each cluster. These 17 processors are L1$ coherent -for TCM (tightly Coupled Memory). A mixed hw/sw L2$ is present to have -cache coherency on DDR as well as TCM. -The RM manager is not meant to run Linux so, 16 processors are available -for SMP. +On kvx, 5 clusters are organized as groups of 16 processors + 1 secure core +(RM) for each cluster. These 17 processors are L1$ coherent for TCM (tight= ly +Coupled Memory). There is also a mixed hw/sw L2$ to provide cache coherenc= y on +DDR as well as TCM. As the secure core (RM) is not meant to run the kerne= l, +the rest 16 processors are available for SMP. =20 Booting =3D=3D=3D=3D=3D=3D=3D =20 -When booting the kvx processor, only the RM is woken up. This RM will -execute a portion of code located in a section named .rm_firmware. -By default, a simple power off code is embedded in this section. -To avoid embedding the firmware in kernel sources, the section is patched -using external tools to add the L2$ firmware (and replace the default firm= ware). -Before executing this firmware, the RM boots the PE0. PE0 will then enable= L2 +When booting the kvx processor, only the RM is woken up. This secure core = will +execute a portion of code located in a section named ``.rm_firmware``. By +default, a simple power off code is embedded in this section. To avoid +embedding the firmware in kernel sources, the section is patched using ext= ernal +tools to add the L2$ firmware (and replace the default firmware). Before +executing this firmware, the RM boots the PE0. PE0 will then enable L2 coherency and request will be stalled until RM boots the L2$ firmware. =20 Locking primitives =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D =20 -spinlock/rwlock are using the kernel standard queued spinlock/rwlocks. -These primitives are based on cmpxch and xchg. More particularly, it uses = xchg16 -which is implemented as a read modify write with acswap on 32 bit word sin= ce +spinlock/rwlock are using the kernel standard queued spinlock/rwlocks. The= se +primitives are based on cmpxch and xchg. More particularly, it uses xchg16 +which is implemented as a read-modify-write with acswap on 32 bit word sin= ce kvx does not have cmpxchg for size < 32bits. =20 IPI diff --git a/Documentation/kvx/kvx.rst b/Documentation/kvx/kvx.rst index 5385e1e3d30187..c2b26ed7b06a8b 100644 --- a/Documentation/kvx/kvx.rst +++ b/Documentation/kvx/kvx.rst @@ -5,18 +5,19 @@ kvx Core Implementation This documents will try to explain any architecture choice for the kvx linux port. =20 -Regarding the peripheral, we MUST use device tree to describe ALL -peripherals. The bindings should always start with "kalray,kvx" for all -core related peripherals (watchdog, timer, etc) +Regarding the peripheral, devicetree must be used to describe ALL +peripherals. The bindings should always start with ``kalray,kvx`` for all +core-related peripherals (watchdog, timer, etc) =20 System Architecture =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D =20 -On kvx, we have 4 levels of privilege level starting from 0 (most -privileged one) to 3 (less privilege one). A system of owners allows -to delegate ownership of resources by using specials system registers. +On kvx, there are 4 levels of privilege level, starting from 0 (most +privileged) to 3 (least privileged). Ownership system allows to delegate +ownership of resources by using specials system registers. =20 -The 2 main software stacks for Linux Kernel are the following:: +The 2 main software stacks for Linux Kernel are bare metal and hypervisor. +Below is simple privilege level diagrams of both stacks:: =20 +-------------+ +-------------+ | PL0: Debug | | PL0: Debug | @@ -32,104 +33,100 @@ In both cases, the kvx support for privileges has bee= n designed using only relative PL and thus should work on both configurations without any modifications. =20 -When booting, the CPU is executing in PL0 and owns all the privileges. -This level is almost dedicated to the debug routines for the debugguer. -It only needs to own few privileges (breakpoint 0 and watchpoint 0) to -be able to debug a system executing in PL1 to PL3. -Debug routines are not always there for instance when the kernel is -executing alone (booted from flash). -In order to ease the load of debug routines, software convention is to -jump directly to PL1 and let PL0 for the debug. -When the kernel boots, it checks if the current privilege level is 0 -($ps.pl is the only absolute value). If so, then it will delegate -almost all resources to PL1 and use a RFE to lower its execution -privilege level (see asm_delegate_pl in head.S). -If the current PL is already different from 0, then it means somebody -is above us and we need to request resource to inform it we need them. It = will -then either delegate them to us directly or virtualize the delegation. -All privileges levels have their set of banked registers (ps, ea, sps, -sr, etc) which contain privilege level specific values. -$sr (system reserved) is banked and will hold the current task_struct. -This register is reserved and should not be touched by any other code. +When booting, the CPU is executing in PL0 and owns all the privileges. This +level is almost dedicated to the debug routines for the debugguer. It only +needs to own few privileges (breakpoint 0 and watchpoint 0) to be able to = debug +a system executing in PL1 to PL3. Debug routines are not always there, for +instance when the kernel is executing alone (directly booted from flash). = In +order to ease the loading them, the software convention is to jump directl= y to +PL1 and let PL0 for the debug. When the kernel boots, it checks if the cur= rent +privilege level is 0 (note that ``$ps.pl`` can only contains positive inte= ger). +If so, then it will delegate almost all resources to PL1 and use a RFE to = lower +its execution privilege level (see ``asm_delegate_pl`` in ``head.S``). If = the +current PL is already different from 0, then it means that there is someth= ing +in PL 0 and it is necessary to request resource in order to inform it that= the +privilege is needed. It will then either delegate them to the kernel direc= tly +or virtualize the delegation. All privileges levels have their set of bank= ed +registers (ps, ea, sps, sr, etc) which contain privilege level specific va= lues. +$sr (system reserved) is banked and hold the current task_struct. This reg= ister +is reserved and should not be touched by any other code. + For more information, refer to the kvx system level architecture manual. =20 Boot =3D=3D=3D=3D =20 -On kvx, the RM (Secure Core) of Cluster 0 will boot first. It will then be= able -to boot a firmware. This firmware is stored in the rm_firmware section. -The first argument ($r0) of this firmware will be a pointer to a function = with -the following prototype: void firmware_init_done(uint64_t features). This -function is responsible of describing the features supported by the firmwa= re and -will start the first PE after that. -By default, the rm_firmware function act as the "default" firmware. This -function does nothing except calling firmware_init_done and then goes to s= leep. -In order to add another firmware, the rm_firmware section is patched using -objcopy. The content of this section is then replaced by the provided firm= ware. -This firmware will do an init and then call firmware_init_done before runn= ing -the main loop. +On kvx, the RM (Secure Core) of Cluster 0 will boot first. It will be used= to +boot a firmware. This firmware is stored in the ``rm_firmware`` section. T= he +first argument ($r0) of this firmware will be a pointer to a function with +``void firmware_init_done(uint64_t features)`` prototype. This function is +responsible of describing the features supported by the firmware and will = start +the first PE after that. By default, the ``rm_firmware`` function act as t= he +"default" firmware. This function does nothing except calling +``firmware_init_done`` and then goes to sleep. In order to add another +firmware, the ``rm_firmware`` section is patched using ``objcopy``. The co= ntent +of this section is then replaced by the provided firmware. This firmware w= ill +be initialized and then call firmware_init_done before running the main lo= op. When the PE boots, it will check for the firmware features to enable or di= sable specific core features (L2$ for instance). =20 -When entering the C (kvx_lowlevel_start) the kernel will look for a special -magic in $r0 (0x494C314B). This magic tells the kernel if there is argumen= ts -passed by a bootloader. -Currently, the following values are passed through registers: +When entering the C code (``kvx_lowlevel_start``) the kernel will look for= a +special magic in $r0 (0x494C314B). It tells the kernel if there are argume= nts +passed by a bootloader. Currently, the following values are passed through +registers: =20 - r1: pointer to command line setup by bootloader - r2: device tree =20 -If this magic is not set, then, the command line will be the one -provided in the device tree (see bootargs). The default device tree is -not builtin but will be patched by the runner used (simulator or jtag) in = the -dtb section. +If this magic is not set, then, the command line will be the one provided = in +the device tree (see ``bootargs``). The default devicetree is not builtin = but +will be patched by the runner used (simulator or jtag) in the dtb section. =20 -A default stdout-path is desirable to allow early printk. +The default stdout path is sufficient to allow early printk. =20 Boot Memory Allocator =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D =20 The boot memory allocator is used to allocate memory before paging is enab= led. It is initialized with DDR and also with the shared memory. This first one= is -initialized during the setup_bootmem() and the second one when calling -early_init_fdt_scan_reserved_mem(). +initialized during ``setup_bootmem()`` and the second one is initialized w= hen +calling ``early_init_fdt_scan_reserved_mem()``. =20 =20 Virtual and physical memory =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D =20 The mapping used and the memory management is described in -Documentation/kvx/kvx-mmu.txt. -Our Kernel is compiled using virtual addresses that starts at -0xffffff0000000000. But when it is started the kernel uses physical addres= ses. -Before calling the first function arch_low_level_start() we configure 2 en= tries -of the LTLB. +Documentation/kvx/kvx-mmu.rst. The kernel is compiled using virtual addres= ses +that starts at 0xffffff0000000000, however when it is started it uses phys= ical +addresses. Before calling the first function ``arch_low_level_start()``, 2= LTLB +entries are configured first. =20 -The first entry will map the first 1G of virtual address space to the first -1G of DDR: +The first one maps the first 1G of virtual address space to the first 1G of +DDR: =20 - TLB[0]: 0xffffff0000000000 -> 0x100000000 (size 512Mo) =20 -The second entry will be a flat mapping of the first 512 Ko of the SMEM. It +The second one is a flat mapping of the first 512 Ko of the SMEM. It is required to have this flat mapping because there is still code located = at this address that needs to be executed: =20 - TLB[1]: 0x0 -> 0x0 (size 512Ko) =20 -Once virtual space reached the second entry is removed. +Once virtual memory space is reached the second entry is removed. =20 -To be able to set breakpoints when MMU is enabled we added a label called -gdb_mmu_enabled. If you try to set a breakpoint on a function that is in -virtual memory before the activation of the MMU this address as no signifi= cation -for GDB. So, for example, if you want to break on the function start_kerne= l() -you will need to run:: +To be able to set breakpoints when MMU is enabled, ``gdb_mmu_enabled`` lab= el is +added. If you try to set a breakpoint on a function that is in virtual mem= ory +before the activation of the MMU it will be unhelpful for GDB. Thus, for +example, if you want to break on the function ``start_kernel()`` you will = need +to do:: =20 kvx-gdb -silent path_to/vmlinux \ -ex 'tbreak gdb_mmu_enabled' -ex 'run' \ -ex 'break start_kernel' \ -ex 'continue' =20 -We will also add an option to kvx-gdb to simplify this step. +In the future there will be an option to kvx-gdb to simplify this step. =20 Timers =3D=3D=3D=3D=3D=3D @@ -137,7 +134,7 @@ Timers The free-runinng clock (clocksource) is based on the DSU. This clock is not interruptible and never stops even if core go into idle. =20 -Regarding the tick (clockevent), we use the timer 0 available on the core. +Regarding the tick (clockevent), the timer 0 available on the core is used. This timer allows to set a periodic tick which will be used as the main tick for each core. Note that this clock is percpu. =20 @@ -149,60 +146,58 @@ stop the cycle counter) Context switching =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D =20 -context switching is done in entry.S. When spawning a fresh thread, -copy_thread is called. During this call, we setup callee saved register -r20 and r21 to special values containing the function to call. +context switching is done in ``entry.S``. When spawning a fresh thread, +copy_thread is called. During this call, callee-saved registers r20 and r21 +are set up to special values containing the function to call. =20 -The normal path for a kernel thread will be the following: +The normal path for a kernel thread is: =20 - 1. Enter copy_thread_tls and setup callee saved registers which will - be restored in __switch_to. - 2. set r20 and r21 (in thread_struct) to function and argument and - ra to ret_from_kernel_thread. - These callee saved will be restored in switch_to. - 3. Call _switch_to at some point. - 4. Save all callee saved register since switch_to is seen as a + 1. Enter ``copy_thread_tls`` and setup callee saved registers which will + be restored in ``__switch_to``. + 2. Set r20 and r21 (in ``thread_struct``) to function and argument and + ra to ``ret_from_kernel_thread``. These callee-saved registers will be + restored in switch_to. + 3. Call ``_switch_to`` at some point. + 4. Save all callee-saved registers since ``switch_to`` is seen as a standard function call by the caller. 5. Change stack pointer to the new stack 6. At the end of switch to, set sr0 to the new task and use ret to - jump to ret_from_kernel_thread (address restored from ra). - 7. In ret_from_kernel_thread, execute the function with arguments by + jump to ``ret_from_kernel_thread`` (address restored from ra). + 7. In ``ret_from_kernel_thread``, execute the function with arguments by using r20, r21 and we are done =20 -For more explanation, you can refer to https://lwn.net/Articles/520227/ +For more explanation, see https://lwn.net/Articles/520227/ =20 User thread creation =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D =20 -We are using almost the same path as copy_thread to create it. -The detailed path is the following: +The similar path as ``copy_thread`` is used to create threads. It consists= of: =20 - 1. Call start_thread which will setup user pc and stack pointer in - task regs. We also set sps and clear privilege mode bit. - When returning from exception, it will "flip" to user mode. - 2. Enter copy_thread_tls and setup callee saved registers which will - be restored in __switch_to. Also, set the "return" function to be - ret_from_fork which will be called at end of switch_to - 3. set r20 (in thread_struct) with tracing information. - (simply by lazyness to avoid computing it in assembly...) - 4. Call _switch_to at some point. - 5. The current pc will then be restored to be ret_from fork. - 6. Ret from fork calls schedule_tail and then check if tracing is - enabled. If so call syscall_trace_exit - 7. Finally, instead of returning to kernel, we restore all registers - that have been setup by start_thread by restoring regs stored on - stack + 1. Call ``start_thread`` which will set up user pc and stack pointer in + task regs. sps and clear privilege mode bits are also set. + When return from exception, it will "flip" to user mode. + 2. Enter ``copy_thread_tls`` and setup callee-saved registers which will + be restored in ``__switch_to``. Also, set the "return" function to be + ``ret_from_fork`` which will be called at end of ``switch_to`` + 3. set r20 (in ``thread_struct``) with tracing information. + (this is done to avoid computing it in assembly...) + 4. Call ``_switch_to`` at some point. + 5. The current pc will then be restored to be ``ret_from`` fork. + 6. ``ret_from`` fork calls ``schedule_tail`` and then check if tracing is + enabled. If so call ``syscall_trace_exit``. + 7. Finally, instead of returning to kernel, all registers that have been + setup by ``start_thread`` are restored by restoring regs stored on sta= ck. =20 L2$ handling =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D =20 On kvx, the L2$ is handled by a firmware running on the RM. This firmware = needs various information to be aware of its configuration and communicate with = the -kernel. In order to do that, when firmware is starting, the device tree is= given -as parameter along with the "registers" zone. This zone is simply a memory= area -where data are exchanged between kernel <-> L2$. When some commands are wr= itten -to it, the kernel sends an interrupt using a mailbox. -If the L2$ node is not present in the device tree, then, the RM will direc= tly go +kernel. In order to do that, when firmware is starting, the device tree is +given as parameter along with the "registers" zone. This zone is simply a +memory area where data are exchanged between kernel and L2$. When firmware +commands are written to it, the kernel sends an interrupt using a mailbox.= If +the L2$ node is not present in the device tree, then, the RM will directly= go into sleeping. =20 Boot diagram:: @@ -246,14 +241,16 @@ Boot diagram:: +------------+ + v =20 =20 -Since this driver is started early (before SMP boot), A lot of drivers are= not +Since this driver is started early (before SMP boot), a lot of drivers are= not yet probed (mailboxes, iommu, etc) and thus can not be used. =20 Building =3D=3D=3D=3D=3D=3D=3D=3D =20 -In order to build the kernel, you will need a complete kvx toolchain. -First, setup the config using the following command line:: +In order to build the kernel, you will need kvx cross toolchain and have it +somewhere in the ``PATH``. + +First, prepare the config by:: =20 $ make ARCH=3Dkvx O=3Dyour_directory default_defconfig =20 @@ -261,11 +258,11 @@ Adjust any configuration option you may need and then= , build the kernel:: =20 $ make ARCH=3Dkvx O=3Dyour_directory -j12 =20 -You will finally have a vmlinux image ready to be run:: +You will finally have a vmlinux image which can be run by:: =20 $ kvx-mppa -- vmlinux =20 -Additionally, you may want to debug it. To do so, use kvx-gdb:: +In case you need to debug the kernel, you can simply launch:: =20 $ kvx-gdb vmlinux =20 --=20 An old man doll... just what I always wanted! - Clara