[PATCH 2/4] ARM: Disable HIGHPTE on PREEMPT_RT kernels

Arnd Bergmann posted 4 patches 1 year ago
[PATCH 2/4] ARM: Disable HIGHPTE on PREEMPT_RT kernels
Posted by Arnd Bergmann 1 year ago
From: Sebastian Andrzej Siewior <bigeasy@linutronix.de>

gup_pgd_range() is invoked with disabled interrupts and invokes
__kmap_local_page_prot() via pte_offset_map(), gup_p4d_range().
With HIGHPTE enabled, __kmap_local_page_prot() invokes kmap_high_get()
which uses a spinlock_t via lock_kmap_any(). This leads to an
sleeping-while-atomic error on PREEMPT_RT because spinlock_t becomes a
sleeping lock and must not be acquired in atomic context.

The loop in map_new_virtual() uses wait_queue_head_t for wake up which
also is using a spinlock_t.

Since HIGHPTE is rarely needed at all, turn it off for PREEMPT_RT
to allow the use of get_user_pages_fast().

[arnd: rework patch to turn off HIGHPTE instead of HAVE_PAST_GUP]
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
There is an open question about whether HIGHPTE is still needed
at all, given how rare 32-bit machines with more than 4GB
are on any architecture. If we instead decide to remove HIGHPTE
altogether, this patch is no longer needed.
---
 arch/arm/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index ed850cc0ed3c..4de4e5697bdf 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -1231,7 +1231,7 @@ config HIGHMEM
 
 config HIGHPTE
 	bool "Allocate 2nd-level pagetables from highmem" if EXPERT
-	depends on HIGHMEM
+	depends on HIGHMEM && !PREEMPT_RT
 	default y
 	help
 	  The VM uses one page of physical memory for each page table.
-- 
2.39.5
Re: [PATCH 2/4] ARM: Disable HIGHPTE on PREEMPT_RT kernels
Posted by Sebastian Andrzej Siewior 1 year ago
On 2024-12-10 17:05:54 [+0100], Arnd Bergmann wrote:
> From: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> 
> gup_pgd_range() is invoked with disabled interrupts and invokes
> __kmap_local_page_prot() via pte_offset_map(), gup_p4d_range().

s@gup_pgd_range@gup_fast_pgd_range@
s@gup_p4d_range@gup_fast_p4d_range@

The functions got renamed…

> With HIGHPTE enabled, __kmap_local_page_prot() invokes kmap_high_get()
> which uses a spinlock_t via lock_kmap_any(). This leads to an
> sleeping-while-atomic error on PREEMPT_RT because spinlock_t becomes a
> sleeping lock and must not be acquired in atomic context.
> 
> The loop in map_new_virtual() uses wait_queue_head_t for wake up which
> also is using a spinlock_t.
> 
> Since HIGHPTE is rarely needed at all, turn it off for PREEMPT_RT
> to allow the use of get_user_pages_fast().
> 
> [arnd: rework patch to turn off HIGHPTE instead of HAVE_PAST_GUP]
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>

This version works, too. Thanks.

> ---
> There is an open question about whether HIGHPTE is still needed
> at all, given how rare 32-bit machines with more than 4GB
> are on any architecture. If we instead decide to remove HIGHPTE
> altogether, this patch is no longer needed.

HIGHPTE isn't much about 4GiB+ but about the page-table which is
offloaded to HIGHMEM. Maybe it is more likely to be needed with 4GiB+ of
memory. No idea. X86 had support for up to 64GiB of memory and is the
only architecture supporting HIGHPTE :)

I guess if you have boxes with 4GiB+ and can proof that the performance
improves without HIGHPTE (since you don't have to map the page table).
The question is then how much of low mem has to be used instead and when
does it start to hurt.

Sebastian
Re: [PATCH 2/4] ARM: Disable HIGHPTE on PREEMPT_RT kernels
Posted by Sebastian Andrzej Siewior 1 year ago
On 2024-12-11 14:48:11 [+0100], To Arnd Bergmann wrote:
> I guess if you have boxes with 4GiB+ and can proof that the performance
> improves without HIGHPTE (since you don't have to map the page table).
> The question is then how much of low mem has to be used instead and when
> does it start to hurt.

Some numbers have been been documented in commit
   14315592009c1 ("x86, mm: Allow highmem user page tables to be disabled at boot time")

and I would like cite:
| We could probably handwave up an argument for a threshold at 16G of total
| RAM.

which means HIGHPTE would make sense with >= 16GiB of memory.

Sebastian
Re: [PATCH 2/4] ARM: Disable HIGHPTE on PREEMPT_RT kernels
Posted by Russell King (Oracle) 1 year ago
On Wed, Dec 11, 2024 at 03:04:02PM +0100, Sebastian Andrzej Siewior wrote:
> On 2024-12-11 14:48:11 [+0100], To Arnd Bergmann wrote:
> > I guess if you have boxes with 4GiB+ and can proof that the performance
> > improves without HIGHPTE (since you don't have to map the page table).
> > The question is then how much of low mem has to be used instead and when
> > does it start to hurt.
> 
> Some numbers have been been documented in commit
>    14315592009c1 ("x86, mm: Allow highmem user page tables to be disabled at boot time")
> 
> and I would like cite:
> | We could probably handwave up an argument for a threshold at 16G of total
> | RAM.
> 
> which means HIGHPTE would make sense with >= 16GiB of memory.

However, there is more to consider.

32-bit Arm works out at the same for this:

    Assuming 768M of lowmem we have 196608 potential lowmem PTE
    pages. Each page can map 2M of RAM in a PAE-enabled configuration,
    meaning a maximum of 384G of RAM could potentially be mapped using
    lowmem PTEs.

because, presumably, x86 uses 8 bytes per PTE entry, whereas on Arm we
still use 4 bytes, but because we keep two copies of a PTE (one for
hardware, the other for the kernel) it works out that we're the same
there - one PTE page can also map 2M of RAM.

However, what is quite different is the L1 page tables. On x86,
everything is nice and easy, and each page table is one 4k page.
On 32-bit Arm, this is not the case - we need to grab a 16k page for
the L1, and the more immovable allocations we have in lowmem, the
harder it will be to satisfy this. Failing to grab a 16k page
leads to fork() failing and an unusable system.

So, we want to keep as many immovable allocations out of lowmem as
possible - which is an additional constraint x86 doesn't have, and
shouldn't be overlooked without ensuring that the probability of it
happening remains acceptably low.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!
Re: [PATCH 2/4] ARM: Disable HIGHPTE on PREEMPT_RT kernels
Posted by Arnd Bergmann 12 months ago
TL;DR: I would still like to merge patch 3 and 4, but let's
keep that separate from the RT patches.

(just realized I had never sent my reply here)

On Wed, Dec 11, 2024, at 16:55, Russell King (Oracle) wrote:
> On Wed, Dec 11, 2024 at 03:04:02PM +0100, Sebastian Andrzej Siewior wrote:
>
> However, what is quite different is the L1 page tables. On x86,
> everything is nice and easy, and each page table is one 4k page.
> On 32-bit Arm, this is not the case - we need to grab a 16k page for
> the L1, and the more immovable allocations we have in lowmem, the
> harder it will be to satisfy this. Failing to grab a 16k page
> leads to fork() failing and an unusable system.
>
> So, we want to keep as many immovable allocations out of lowmem as
> possible - which is an additional constraint x86 doesn't have, and
> shouldn't be overlooked without ensuring that the probability of it
> happening remains acceptably low.

Right, non-LPAE systems indeed have a bigger problem with
fragmentation here, I hadn't thought of that.  Most of the
systems with 4GB RAM (and all of those with more) support
LPAE though, which means they can use the same page table
layout as x86 when when using CONFIG_LPAE=y.

Almost all SoCs prior to ARMv7VE are limited to 2GB of RAM
or less, the only exceptions I can think of are three SoCs
that can use close to 4GB (a small physical address range 
is used for MMIO):

- Calxeda highbank
- Hisilicon hip01
- NXP i.MX6D/DP/Q/QP

We can throw the first two under the bus, I'm sure nobody
cares enough, but the i.MX6Q was actually shipped in a
handful of products with 4GB that some users still have:

- Bunnie's Novena Laptop (856 original backers on crowdsupply)
- Solidrun CuBox i4x4 (not the more common i4P)
- Solidrun Hummingboard (early revisions only, later 2GB)

A couple of other i.MX6Q boards were advertised as supporting
"up to 4GB", but if you tried to buy those, the only ones in
stock anywhere seemed to be limited to 1 or 2 GB. Examples:

- armstone A9r4 (dts not upstream)
- DFI FS053 (dts not  upstream)
- VAR-SOM-MX6

[Part of the problem here apparently was availability
and cost of qualified 8Gbit DDR3 chips.]

Do you know of any other SoCs or boards in that category?

My feeling so far is that none of these are show-stoppers:

- The 4GB/8GB boards with LPAE don't have the added problem
  of fragmentation with order-2 page table allocations.
  They can run low on lowmem with CONFIG_VMSPLIT_3G, but
  as Sebastian cited, HIGHPTE is not likely to be the
  breaking point for them. These were mainly built around
  2012-2013 (before 64-bit became available) as high-end
  SoCs and are reaching the end of their commercial life.
- The 4GB boards without LPAE are also 10+ years old by now
  and were fairly rare even then. These would suffer the
  most from the fragmentation though.
- The 2GB boards with or without LPAE can theoretically
  avoid HIGHMEM entirely by using CONFIG_VMSPLIT_2G_OPT.
  These are still very common today, but at least a third
  of the total memory is going to be available as lowmem
  even with VMSPLIT_3G.

The case that is less clear to me is the one where memory
is sparse enough to lead to exhausting lowmem without
having a lot of total RAM. These are a little harder
to find, but I think e.g. Renesas has some chips that
need this, and Realview PBX had a custom __phys_to_virt
in earlier versions to work around this. This should
be solved when we get the "densemem" support that Linus
mentioned.

     Arnd
Re: [PATCH 2/4] ARM: Disable HIGHPTE on PREEMPT_RT kernels
Posted by Arnd Bergmann 1 year ago
On Wed, Dec 11, 2024, at 15:04, Sebastian Andrzej Siewior wrote:
> On 2024-12-11 14:48:11 [+0100], To Arnd Bergmann wrote:
>> I guess if you have boxes with 4GiB+ and can proof that the performance
>> improves without HIGHPTE (since you don't have to map the page table).
>> The question is then how much of low mem has to be used instead and when
>> does it start to hurt.
>
> Some numbers have been been documented in commit
>    14315592009c1 ("x86, mm: Allow highmem user page tables to be 
> disabled at boot time")
>
> and I would like cite:
> | We could probably handwave up an argument for a threshold at 16G of total
> | RAM.
>
> which means HIGHPTE would make sense with >= 16GiB of memory.

Very useful, thanks!

On x86, that means we can definitely remove HIGHPTE along with
CONFIG_HIGHMEM64G on x86.

On 32-bit ARM, we still need to support LPAE for systems that
require 64-bit addressing. LPAE supports 36 bits of addressing
(up to 64GB), but the largest actual size I've seen mentioned
is 16GB (Hisilicon HiP04, Calxeda Midway servers) and I'm
certain nobody actually requires these to perform well
given that they are no longer useful for the workloads they
were designed for.

There are also a small number of embedded systems with 8GB
(Ti Keystone2, NVidia Tegra3, Marvell Armada XP), but they
are rare enough that turning off HIGHPTE is completely safe.

      Arnd
Re: [PATCH 2/4] ARM: Disable HIGHPTE on PREEMPT_RT kernels
Posted by Linus Walleij 1 year ago
On Tue, Dec 10, 2024 at 5:06 PM Arnd Bergmann <arnd@kernel.org> wrote:

> From: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
>
> gup_pgd_range() is invoked with disabled interrupts and invokes

There is no gup_pgd_range() in the kernel, is this patch a bit
old?

There is gup_fast_pgd_range().
See 23babe1934d7637b598e4c9d9f3876e318fa63a4
gup.c contains:

  get_user_pages_fast attempts to pin user pages by
  walking the page tables directly and avoids taking locks.
(...)
   Let's consistently call the "fast-only" part of GUP "GUP-fast"
   and rename all relevant internal functions to start with
   "gup_fast", to make it clearer that this is not ordinary GUP.
   The current mixture of "lockless",  "gup" and "gup_fast" is
   confusing.

So fast GUP is supposed to be lockless, and should just not
have this problem. So it can't be addressing gup_fast_pgd_range()
right?

> __kmap_local_page_prot() via pte_offset_map(), gup_p4d_range().
> With HIGHPTE enabled, __kmap_local_page_prot() invokes kmap_high_get()
> which uses a spinlock_t via lock_kmap_any(). This leads to an
> sleeping-while-atomic error on PREEMPT_RT because spinlock_t becomes a
> sleeping lock and must not be acquired in atomic context.

I think this needs to be inspected by David Hildenbrand, if he consistently
rename the GPU functions to be "fast" and there is a lock somewhere
deep in there, something must be wrong and violating the API
contract.

> The loop in map_new_virtual() uses wait_queue_head_t for wake up which
> also is using a spinlock_t.
>
> Since HIGHPTE is rarely needed at all, turn it off for PREEMPT_RT
> to allow the use of get_user_pages_fast().
>
> [arnd: rework patch to turn off HIGHPTE instead of HAVE_PAST_GUP]

HAVE_FAST_GUP

I'm still confused, how can something that is supposed to be
lockless "fast" acquire a spinlock? Something is odd here.

> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> ---
> There is an open question about whether HIGHPTE is still needed
> at all, given how rare 32-bit machines with more than 4GB
> are on any architecture. If we instead decide to remove HIGHPTE
> altogether, this patch is no longer needed.

I'm more asking if HIGHPTE even acquires a spinlock anymore
as it is supposed to be "fast"/lockless. If it does, it is clearly violating
the "fast" promise of the fast GUP API and should not exist.

Yours,
Linus Walleij
Re: [PATCH 2/4] ARM: Disable HIGHPTE on PREEMPT_RT kernels
Posted by Sebastian Andrzej Siewior 1 year ago
On 2024-12-11 14:29:29 [+0100], Linus Walleij wrote:
> So fast GUP is supposed to be lockless, and should just not
> have this problem. So it can't be addressing gup_fast_pgd_range()
> right?
…
> I'm more asking if HIGHPTE even acquires a spinlock anymore
> as it is supposed to be "fast"/lockless. If it does, it is clearly violating
> the "fast" promise of the fast GUP API and should not exist.

This is lockless on x86. The problem is ARM's
arch_kmap_local_high_get(). This is where the lock is from.

> Yours,
> Linus Walleij

Sebastian
Re: [PATCH 2/4] ARM: Disable HIGHPTE on PREEMPT_RT kernels
Posted by Linus Walleij 1 year ago
On Wed, Dec 11, 2024 at 4:22 PM Sebastian Andrzej Siewior
<bigeasy@linutronix.de> wrote:
> On 2024-12-11 14:29:29 [+0100], Linus Walleij wrote:
> > So fast GUP is supposed to be lockless, and should just not
> > have this problem. So it can't be addressing gup_fast_pgd_range()
> > right?
> …
> > I'm more asking if HIGHPTE even acquires a spinlock anymore
> > as it is supposed to be "fast"/lockless. If it does, it is clearly violating
> > the "fast" promise of the fast GUP API and should not exist.
>
> This is lockless on x86. The problem is ARM's
> arch_kmap_local_high_get(). This is where the lock is from.

Aha that calls down to kmap_high_get() that that issues
lock_kmap_any(flags).

But is it really sound that the "fast" API does this? It feels
like a violation of the whole design of the fast stuff.

Yours,
Linus Walleij
Re: [PATCH 2/4] ARM: Disable HIGHPTE on PREEMPT_RT kernels
Posted by Russell King (Oracle) 1 year ago
On Fri, Dec 13, 2024 at 01:27:00AM +0100, Linus Walleij wrote:
> On Wed, Dec 11, 2024 at 4:22 PM Sebastian Andrzej Siewior
> <bigeasy@linutronix.de> wrote:
> > On 2024-12-11 14:29:29 [+0100], Linus Walleij wrote:
> > > So fast GUP is supposed to be lockless, and should just not
> > > have this problem. So it can't be addressing gup_fast_pgd_range()
> > > right?
> > …
> > > I'm more asking if HIGHPTE even acquires a spinlock anymore
> > > as it is supposed to be "fast"/lockless. If it does, it is clearly violating
> > > the "fast" promise of the fast GUP API and should not exist.
> >
> > This is lockless on x86. The problem is ARM's
> > arch_kmap_local_high_get(). This is where the lock is from.
> 
> Aha that calls down to kmap_high_get() that that issues
> lock_kmap_any(flags).
> 
> But is it really sound that the "fast" API does this? It feels
> like a violation of the whole design of the fast stuff.

If there's no way to do it lockless, then it has to take the lock.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!
Re: [PATCH 2/4] ARM: Disable HIGHPTE on PREEMPT_RT kernels
Posted by Matthew Wilcox 1 year ago
On Fri, Dec 13, 2024 at 09:11:57AM +0000, Russell King (Oracle) wrote:
> On Fri, Dec 13, 2024 at 01:27:00AM +0100, Linus Walleij wrote:
> > On Wed, Dec 11, 2024 at 4:22 PM Sebastian Andrzej Siewior
> > <bigeasy@linutronix.de> wrote:
> > > On 2024-12-11 14:29:29 [+0100], Linus Walleij wrote:
> > > > So fast GUP is supposed to be lockless, and should just not
> > > > have this problem. So it can't be addressing gup_fast_pgd_range()
> > > > right?
> > > …
> > > > I'm more asking if HIGHPTE even acquires a spinlock anymore
> > > > as it is supposed to be "fast"/lockless. If it does, it is clearly violating
> > > > the "fast" promise of the fast GUP API and should not exist.
> > >
> > > This is lockless on x86. The problem is ARM's
> > > arch_kmap_local_high_get(). This is where the lock is from.
> > 
> > Aha that calls down to kmap_high_get() that that issues
> > lock_kmap_any(flags).
> > 
> > But is it really sound that the "fast" API does this? It feels
> > like a violation of the whole design of the fast stuff.
> 
> If there's no way to do it lockless, then it has to take the lock.

Well, no, it's allowed to fail.  It could trylock and fail if it can't
get the lock.