[PATCH] docs/mm: add more warnings around page table access

Jann Horn posted 1 patch 1 week, 1 day ago
There is a newer version of this series
Documentation/mm/process_addrs.rst | 21 +++++++++++++++++++--
1 file changed, 19 insertions(+), 2 deletions(-)
[PATCH] docs/mm: add more warnings around page table access
Posted by Jann Horn 1 week, 1 day ago
Make it clearer that holding the mmap lock in read mode is not enough
to traverse page tables, and that just having a stable VMA is not enough
to read PTEs.

Suggested-by: Matteo Rizzo <matteorizzo@google.com>
Signed-off-by: Jann Horn <jannh@google.com>
---
@akpm: Please don't put this in your tree before Lorenzo has replied.

@Lorenzo:
This is intended to go on top of your documentation patch.
If you think this is a sensible change, do you prefer to squash it into
your patch or do you prefer having akpm take this as a separate patch?
IDK what works better...
---
 Documentation/mm/process_addrs.rst | 21 +++++++++++++++++++--
 1 file changed, 19 insertions(+), 2 deletions(-)

diff --git a/Documentation/mm/process_addrs.rst b/Documentation/mm/process_addrs.rst
index 1bf7ad010fc063d003bb857bb3b695a3eafa0b55..9bdf073d0c3ebea1707812508a309aa4a6163660 100644
--- a/Documentation/mm/process_addrs.rst
+++ b/Documentation/mm/process_addrs.rst
@@ -339,6 +339,16 @@ When **installing** page table entries, the mmap or VMA lock must be held to
 keep the VMA stable. We explore why this is in the page table locking details
 section below.
 
+.. warning:: Taking the mmap lock in read mode **is not sufficient** for
+             traversing page tables; you must also ensure that a VMA exists that
+             covers the range being accessed.
+             This ensures you can't race with concurrent page table removal
+             which happens with the mmap lock in read mode, in regions whose
+             VMAs are no longer present in the VMA tree.
+
+             (Alternatively, the mmap lock can be taken in write mode, but that
+             is heavy-handed and almost never the right choice.)
+
 **Freeing** page tables is an entirely internal memory management operation and
 has special requirements (see the page freeing section below for more details).
 
@@ -450,6 +460,9 @@ the time of writing of this document.
 Locking Implementation Details
 ------------------------------
 
+.. warning:: Locking rules for PTE-level page tables are very different from
+             locking rules for page tables at other levels.
+
 Page table locking details
 --------------------------
 
@@ -470,8 +483,12 @@ additional locks dedicated to page tables:
 These locks represent the minimum required to interact with each page table
 level, but there are further requirements.
 
-Importantly, note that on a **traversal** of page tables, no such locks are
-taken. Whether care is taken on reading the page table entries depends on the
+Importantly, note that on a **traversal** of page tables, sometimes no such
+locks are taken. However, at the PTE level, at least concurrent page table
+deletion must be prevented (using RCU) and the page table must be mapped into
+high memory, see below.
+
+Whether care is taken on reading the page table entries depends on the
 architecture, see the section on atomicity below.
 
 Locking rules

---
base-commit: 1e96a63d3022403e06cdda0213c7849b05973cd5
change-id: 20241114-vma-docs-addition1-onv3-32df4e6dffcf

-- 
Jann Horn <jannh@google.com>
Re: [PATCH] docs/mm: add more warnings around page table access
Posted by Lorenzo Stoakes 1 week ago
On Thu, Nov 14, 2024 at 10:12:00PM +0100, Jann Horn wrote:
> Make it clearer that holding the mmap lock in read mode is not enough
> to traverse page tables, and that just having a stable VMA is not enough
> to read PTEs.
>
> Suggested-by: Matteo Rizzo <matteorizzo@google.com>
> Signed-off-by: Jann Horn <jannh@google.com>

Have some queries before we move forward so would like a little more
clarification/perhaps putting some extra meat on the bones first.

Broadly very glad you have done this however so it's just sorting details
first! :>)

> ---
> @akpm: Please don't put this in your tree before Lorenzo has replied.
>
> @Lorenzo:
> This is intended to go on top of your documentation patch.
> If you think this is a sensible change, do you prefer to squash it into
> your patch or do you prefer having akpm take this as a separate patch?
> IDK what works better...

I think a new patch is better, as I'd like the original to settle down now
and the whole point of this doc is that it's a living thing that many
people can contribute to, update, etc.

For instance, Suren is updating as part of one of his series to correct
things that he changes in that series, which is really nice.

> ---
>  Documentation/mm/process_addrs.rst | 21 +++++++++++++++++++--
>  1 file changed, 19 insertions(+), 2 deletions(-)
>
> diff --git a/Documentation/mm/process_addrs.rst b/Documentation/mm/process_addrs.rst
> index 1bf7ad010fc063d003bb857bb3b695a3eafa0b55..9bdf073d0c3ebea1707812508a309aa4a6163660 100644
> --- a/Documentation/mm/process_addrs.rst
> +++ b/Documentation/mm/process_addrs.rst
> @@ -339,6 +339,16 @@ When **installing** page table entries, the mmap or VMA lock must be held to
>  keep the VMA stable. We explore why this is in the page table locking details
>  section below.
>
> +.. warning:: Taking the mmap lock in read mode **is not sufficient** for
> +             traversing page tables; you must also ensure that a VMA exists that
> +             covers the range being accessed.

Hm, but we say later we don't need _any_ locks for traversal, and here we
say we need mmap read lock. Do you mean installing page table entries?

Or do you mean to say, that if you don't span a VMA, you must acquire a
write lock at least to preclude this?

This seems quite unclear.

I kind of didn't want to touch on the horrors of fiddling about without a
VMA, so I'd rather this very clearly say something like 'it is unusual to
manipulate page tables wihch are not spanned by a VMA, and there are
special requirements for this operation' etc. et.c otherwise this just adds
more noise and confusion I think.

> +             This ensures you can't race with concurrent page table removal
> +             which happens with the mmap lock in read mode, in regions whose
> +             VMAs are no longer present in the VMA tree.
> +
> +             (Alternatively, the mmap lock can be taken in write mode, but that
> +             is heavy-handed and almost never the right choice.)

You kind of need to expand on why that is I think!

> +
>  **Freeing** page tables is an entirely internal memory management operation and
>  has special requirements (see the page freeing section below for more details).
>
> @@ -450,6 +460,9 @@ the time of writing of this document.
>  Locking Implementation Details
>  ------------------------------
>
> +.. warning:: Locking rules for PTE-level page tables are very different from
> +             locking rules for page tables at other levels.
> +
>  Page table locking details
>  --------------------------
>
> @@ -470,8 +483,12 @@ additional locks dedicated to page tables:
>  These locks represent the minimum required to interact with each page table
>  level, but there are further requirements.
>
> -Importantly, note that on a **traversal** of page tables, no such locks are
> -taken. Whether care is taken on reading the page table entries depends on the
> +Importantly, note that on a **traversal** of page tables, sometimes no such
> +locks are taken. However, at the PTE level, at least concurrent page table
> +deletion must be prevented (using RCU) and the page table must be mapped into
> +high memory, see below.

Ugh I really do hate that we have to think about high memory. I'd like to
sort of deny it exists. But I suppose that's not an option.

As for the RCU thing, I guess this is why pte_offset_map_lock() is taking
it? Maybe worth mentioning something there or updating that 'interestingly'
block... :>)

Or am I mistaken? I wasn't aware of this requirement, is this sort of
implied by the gup_fast() IRQ disabling stuff?

Please expand :)

> +
> +Whether care is taken on reading the page table entries depends on the
>  architecture, see the section on atomicity below.
>
>  Locking rules
>
> ---
> base-commit: 1e96a63d3022403e06cdda0213c7849b05973cd5
> change-id: 20241114-vma-docs-addition1-onv3-32df4e6dffcf
>
> --
> Jann Horn <jannh@google.com>
>

Thanks for this, your input is hugely appreciated both in the review and
now this, you're a gem!

Cheers, Lorenzo
Re: [PATCH] docs/mm: add more warnings around page table access
Posted by Jann Horn 1 week ago
On Fri, Nov 15, 2024 at 7:02 PM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
> On Thu, Nov 14, 2024 at 10:12:00PM +0100, Jann Horn wrote:
> > Make it clearer that holding the mmap lock in read mode is not enough
> > to traverse page tables, and that just having a stable VMA is not enough
> > to read PTEs.
> >
> > Suggested-by: Matteo Rizzo <matteorizzo@google.com>
> > Signed-off-by: Jann Horn <jannh@google.com>
>
> Have some queries before we move forward so would like a little more
> clarification/perhaps putting some extra meat on the bones first.
>
> Broadly very glad you have done this however so it's just sorting details
> first! :>)
>
> > ---
> > @akpm: Please don't put this in your tree before Lorenzo has replied.
> >
> > @Lorenzo:
> > This is intended to go on top of your documentation patch.
> > If you think this is a sensible change, do you prefer to squash it into
> > your patch or do you prefer having akpm take this as a separate patch?
> > IDK what works better...
>
> I think a new patch is better, as I'd like the original to settle down now
> and the whole point of this doc is that it's a living thing that many
> people can contribute to, update, etc.
>
> For instance, Suren is updating as part of one of his series to correct
> things that he changes in that series, which is really nice.
>
> > ---
> >  Documentation/mm/process_addrs.rst | 21 +++++++++++++++++++--
> >  1 file changed, 19 insertions(+), 2 deletions(-)
> >
> > diff --git a/Documentation/mm/process_addrs.rst b/Documentation/mm/process_addrs.rst
> > index 1bf7ad010fc063d003bb857bb3b695a3eafa0b55..9bdf073d0c3ebea1707812508a309aa4a6163660 100644
> > --- a/Documentation/mm/process_addrs.rst
> > +++ b/Documentation/mm/process_addrs.rst
> > @@ -339,6 +339,16 @@ When **installing** page table entries, the mmap or VMA lock must be held to
> >  keep the VMA stable. We explore why this is in the page table locking details
> >  section below.
> >
> > +.. warning:: Taking the mmap lock in read mode **is not sufficient** for
> > +             traversing page tables; you must also ensure that a VMA exists that
> > +             covers the range being accessed.
>
> Hm, but we say later we don't need _any_ locks for traversal, and here we
> say we need mmap read lock. Do you mean installing page table entries?
>
> Or do you mean to say, that if you don't span a VMA, you must acquire a
> write lock at least to preclude this?

Yeah, this is what I meant with the "you must also ensure that a VMA
exists" part. (Context is that I was looking at some non-upstream code
that was trying to do exactly this - take a userspace-supplied address
and walk down the page tables at that address. Upstream also once had
a UAF in pagemap_walk() because walk_page_range() was used while
holding only the mmap read lock, see
<https://project-zero.issues.chromium.org/42451485>.)

> This seems quite unclear.
>
> I kind of didn't want to touch on the horrors of fiddling about without a
> VMA, so I'd rather this very clearly say something like 'it is unusual to
> manipulate page tables wihch are not spanned by a VMA, and there are
> special requirements for this operation' etc. et.c otherwise this just adds
> more noise and confusion I think.

I guess maybe we could replace this entire warning block with
something like this?

".. warning:: Page tables are normally only traversed in regions
covered by VMAs. If you want to traverse page tables in areas that
might not be covered by VMAs, heavier locking is required. See
:c:func:`!walk_page_range_novma` for details."

And walk_page_range_novma() already has a comment block talking about
why an mmap read lock isn't enough to traverse user virtual address
space outside VMAs.

> > +             This ensures you can't race with concurrent page table removal
> > +             which happens with the mmap lock in read mode, in regions whose
> > +             VMAs are no longer present in the VMA tree.
> > +
> > +             (Alternatively, the mmap lock can be taken in write mode, but that
> > +             is heavy-handed and almost never the right choice.)
>
> You kind of need to expand on why that is I think!
>
> > +
> >  **Freeing** page tables is an entirely internal memory management operation and
> >  has special requirements (see the page freeing section below for more details).
> >
> > @@ -450,6 +460,9 @@ the time of writing of this document.
> >  Locking Implementation Details
> >  ------------------------------
> >
> > +.. warning:: Locking rules for PTE-level page tables are very different from
> > +             locking rules for page tables at other levels.
> > +
> >  Page table locking details
> >  --------------------------
> >
> > @@ -470,8 +483,12 @@ additional locks dedicated to page tables:
> >  These locks represent the minimum required to interact with each page table
> >  level, but there are further requirements.
> >
> > -Importantly, note that on a **traversal** of page tables, no such locks are
> > -taken. Whether care is taken on reading the page table entries depends on the
> > +Importantly, note that on a **traversal** of page tables, sometimes no such
> > +locks are taken. However, at the PTE level, at least concurrent page table
> > +deletion must be prevented (using RCU) and the page table must be mapped into
> > +high memory, see below.
>
> Ugh I really do hate that we have to think about high memory. I'd like to
> sort of deny it exists. But I suppose that's not an option.

(FWIW from a quick look I think only arm and x86 actually have this behavior.)

I mean... we could try to just make userspace page tables live in
non-high-memory based on the assumption that nobody nowadays is going
to run a 32-bit kernel on a device with significantly more than 4G or
8G or so of RAM, and probably most memory is used for stuff like
anon/file pages, not for page tables... but I don't think it actually
matters much for code complexity now that we need RCU for page table
access anyway.

> As for the RCU thing, I guess this is why pte_offset_map_lock() is taking
> it? Maybe worth mentioning something there or updating that 'interestingly'
> block... :>)

Yeah, and not just pte_offset_map_lock() but also pte_offset_map() and
such which don't lock the page table.

> Or am I mistaken? I wasn't aware of this requirement, is this sort of
> implied by the gup_fast() IRQ disabling stuff?

This is to protect against khugepaged/MADV_COLLAPSE deleting page
tables while holding only an rmap lock (in read mode), the pmd lock,
and the pte lock. So a concurrent call to MADV_COLLAPSE (potentially
in another process) can (via retract_page_tables()) remove the page
table while we're traversing it, but the page table will only actually
be freed via RCU (thanks to pte_free_defer), and so this doesn't cause
use-after-free. pte_offset_map_lock() ensures that on success, it has
locked the current page table (holding the pte lock prevents the page
table being removed in the future, and rechecking the pmd entry
ensures it hasn't already been removed), so we never end up installing
PTEs into page tables that have already been removed. pte_offset_map()
does not ensure that you always see the latest page table; it could be
that by the time you're looking at a PTE inside it, the page table
you're looking at has been removed, and a new page table has been
installed in its place and populated with PTEs. But removed page
tables are always empty, so the only possible consequence of such a
race is that you wrongly think that there is no PTE at a given
address.

The fourth paragraph of the comment block above
__pte_offset_map_lock() sorta explains this.

> Please expand :)
Re: [PATCH] docs/mm: add more warnings around page table access
Posted by Lorenzo Stoakes 4 days, 8 hours ago
On Fri, Nov 15, 2024 at 08:04:52PM +0100, Jann Horn wrote:
> On Fri, Nov 15, 2024 at 7:02 PM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> > On Thu, Nov 14, 2024 at 10:12:00PM +0100, Jann Horn wrote:
> > > Make it clearer that holding the mmap lock in read mode is not enough
> > > to traverse page tables, and that just having a stable VMA is not enough
> > > to read PTEs.
> > >
> > > Suggested-by: Matteo Rizzo <matteorizzo@google.com>
> > > Signed-off-by: Jann Horn <jannh@google.com>
> >
> > Have some queries before we move forward so would like a little more
> > clarification/perhaps putting some extra meat on the bones first.
> >
> > Broadly very glad you have done this however so it's just sorting details
> > first! :>)
> >
> > > ---
> > > @akpm: Please don't put this in your tree before Lorenzo has replied.
> > >
> > > @Lorenzo:
> > > This is intended to go on top of your documentation patch.
> > > If you think this is a sensible change, do you prefer to squash it into
> > > your patch or do you prefer having akpm take this as a separate patch?
> > > IDK what works better...
> >
> > I think a new patch is better, as I'd like the original to settle down now
> > and the whole point of this doc is that it's a living thing that many
> > people can contribute to, update, etc.
> >
> > For instance, Suren is updating as part of one of his series to correct
> > things that he changes in that series, which is really nice.
> >
> > > ---
> > >  Documentation/mm/process_addrs.rst | 21 +++++++++++++++++++--
> > >  1 file changed, 19 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/Documentation/mm/process_addrs.rst b/Documentation/mm/process_addrs.rst
> > > index 1bf7ad010fc063d003bb857bb3b695a3eafa0b55..9bdf073d0c3ebea1707812508a309aa4a6163660 100644
> > > --- a/Documentation/mm/process_addrs.rst
> > > +++ b/Documentation/mm/process_addrs.rst
> > > @@ -339,6 +339,16 @@ When **installing** page table entries, the mmap or VMA lock must be held to
> > >  keep the VMA stable. We explore why this is in the page table locking details
> > >  section below.
> > >
> > > +.. warning:: Taking the mmap lock in read mode **is not sufficient** for
> > > +             traversing page tables; you must also ensure that a VMA exists that
> > > +             covers the range being accessed.
> >
> > Hm, but we say later we don't need _any_ locks for traversal, and here we
> > say we need mmap read lock. Do you mean installing page table entries?
> >
> > Or do you mean to say, that if you don't span a VMA, you must acquire a
> > write lock at least to preclude this?
>
> Yeah, this is what I meant with the "you must also ensure that a VMA
> exists" part. (Context is that I was looking at some non-upstream code
> that was trying to do exactly this - take a userspace-supplied address
> and walk down the page tables at that address. Upstream also once had
> a UAF in pagemap_walk() because walk_page_range() was used while
> holding only the mmap read lock, see
> <https://project-zero.issues.chromium.org/42451485>.)
>
> > This seems quite unclear.
> >
> > I kind of didn't want to touch on the horrors of fiddling about without a
> > VMA, so I'd rather this very clearly say something like 'it is unusual to
> > manipulate page tables wihch are not spanned by a VMA, and there are
> > special requirements for this operation' etc. et.c otherwise this just adds
> > more noise and confusion I think.
>
> I guess maybe we could replace this entire warning block with
> something like this?
>
> ".. warning:: Page tables are normally only traversed in regions
> covered by VMAs. If you want to traverse page tables in areas that
> might not be covered by VMAs, heavier locking is required. See
> :c:func:`!walk_page_range_novma` for details."

Yes I prefer this.

>
> And walk_page_range_novma() already has a comment block talking about
> why an mmap read lock isn't enough to traverse user virtual address
> space outside VMAs.
>
> > > +             This ensures you can't race with concurrent page table removal
> > > +             which happens with the mmap lock in read mode, in regions whose
> > > +             VMAs are no longer present in the VMA tree.
> > > +
> > > +             (Alternatively, the mmap lock can be taken in write mode, but that
> > > +             is heavy-handed and almost never the right choice.)
> >
> > You kind of need to expand on why that is I think!
> >
> > > +
> > >  **Freeing** page tables is an entirely internal memory management operation and
> > >  has special requirements (see the page freeing section below for more details).
> > >
> > > @@ -450,6 +460,9 @@ the time of writing of this document.
> > >  Locking Implementation Details
> > >  ------------------------------
> > >
> > > +.. warning:: Locking rules for PTE-level page tables are very different from
> > > +             locking rules for page tables at other levels.
> > > +
> > >  Page table locking details
> > >  --------------------------
> > >
> > > @@ -470,8 +483,12 @@ additional locks dedicated to page tables:
> > >  These locks represent the minimum required to interact with each page table
> > >  level, but there are further requirements.
> > >
> > > -Importantly, note that on a **traversal** of page tables, no such locks are
> > > -taken. Whether care is taken on reading the page table entries depends on the
> > > +Importantly, note that on a **traversal** of page tables, sometimes no such
> > > +locks are taken. However, at the PTE level, at least concurrent page table
> > > +deletion must be prevented (using RCU) and the page table must be mapped into
> > > +high memory, see below.
> >
> > Ugh I really do hate that we have to think about high memory. I'd like to
> > sort of deny it exists. But I suppose that's not an option.
>
> (FWIW from a quick look I think only arm and x86 actually have this behavior.)
>
> I mean... we could try to just make userspace page tables live in
> non-high-memory based on the assumption that nobody nowadays is going
> to run a 32-bit kernel on a device with significantly more than 4G or
> 8G or so of RAM, and probably most memory is used for stuff like
> anon/file pages, not for page tables... but I don't think it actually
> matters much for code complexity now that we need RCU for page table
> access anyway.

Right, I just am not a fan of 32-bit kernels :P not your fault!

>
> > As for the RCU thing, I guess this is why pte_offset_map_lock() is taking
> > it? Maybe worth mentioning something there or updating that 'interestingly'
> > block... :>)
>
> Yeah, and not just pte_offset_map_lock() but also pte_offset_map() and
> such which don't lock the page table.

Could you update that 'interestingly' block then also?

>
> > Or am I mistaken? I wasn't aware of this requirement, is this sort of
> > implied by the gup_fast() IRQ disabling stuff?
>
> This is to protect against khugepaged/MADV_COLLAPSE deleting page
> tables while holding only an rmap lock (in read mode), the pmd lock,
> and the pte lock. So a concurrent call to MADV_COLLAPSE (potentially
> in another process) can (via retract_page_tables()) remove the page
> table while we're traversing it, but the page table will only actually
> be freed via RCU (thanks to pte_free_defer), and so this doesn't cause
> use-after-free. pte_offset_map_lock() ensures that on success, it has
> locked the current page table (holding the pte lock prevents the page
> table being removed in the future, and rechecking the pmd entry
> ensures it hasn't already been removed), so we never end up installing
> PTEs into page tables that have already been removed. pte_offset_map()
> does not ensure that you always see the latest page table; it could be
> that by the time you're looking at a PTE inside it, the page table
> you're looking at has been removed, and a new page table has been
> installed in its place and populated with PTEs. But removed page
> tables are always empty, so the only possible consequence of such a
> race is that you wrongly think that there is no PTE at a given
> address.

It'd be good to include this detail in the doc also.

>
> The fourth paragraph of the comment block above
> __pte_offset_map_lock() sorta explains this.
>
> > Please expand :)

Thanks!