[v1] arm64/mm: Enable userfaultfd write-protect

[PATCH v1 1/2] arm64/mm: Move PTE_PROT_NONE and PMD_PRESENT_INVALID

Posted by Ryan Roberts 1 year, 9 months ago

Previously PTE_PROT_NONE was occupying bit 58, one of the bits reserved
for SW use when the PTE is valid. This is a waste of those precious SW
bits since PTE_PROT_NONE can only ever be set when valid is clear.
Instead let's overlay it on what would be a HW bit if valid was set.

We need to be careful about which HW bit to choose since some of them
must be preserved; when pte_present() is true (as it is for a
PTE_PROT_NONE pte), it is legitimate for the core to call various
accessors, e.g. pte_dirty(), pte_write() etc. There are also some
accessors that are private to the arch which must continue to be
honoured, e.g. pte_user(), pte_user_exec() etc.

So we choose to overlay PTE_UXN; This effectively means that whenever a
pte has PTE_PROT_NONE set, it will always report pte_user_exec() ==
false, which is obviously always correct.

As a result of this change, we must shuffle the layout of the
arch-specific swap pte so that PTE_PROT_NONE is always zero and not
overlapping with any other field. As a result of this, there is no way
to keep the `type` field contiguous without conflicting with
PMD_PRESENT_INVALID (bit 59), which must also be 0 for a swap pte. So
let's move PMD_PRESENT_INVALID to bit 60.

In the end, this frees up bit 58 for future use as a proper SW bit (e.g.
soft-dirty or uffd-wp).

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/arm64/include/asm/pgtable-prot.h |  4 ++--
 arch/arm64/include/asm/pgtable.h      | 16 +++++++++-------
 2 files changed, 11 insertions(+), 9 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable-prot.h b/arch/arm64/include/asm/pgtable-prot.h
index dd9ee67d1d87..ef952d69fd04 100644
--- a/arch/arm64/include/asm/pgtable-prot.h
+++ b/arch/arm64/include/asm/pgtable-prot.h
@@ -18,14 +18,14 @@
 #define PTE_DIRTY		(_AT(pteval_t, 1) << 55)
 #define PTE_SPECIAL		(_AT(pteval_t, 1) << 56)
 #define PTE_DEVMAP		(_AT(pteval_t, 1) << 57)
-#define PTE_PROT_NONE		(_AT(pteval_t, 1) << 58) /* only when !PTE_VALID */
+#define PTE_PROT_NONE		(PTE_UXN)		 /* Reuse PTE_UXN; only when !PTE_VALID */
 
 /*
  * This bit indicates that the entry is present i.e. pmd_page()
  * still points to a valid huge page in memory even if the pmd
  * has been invalidated.
  */
-#define PMD_PRESENT_INVALID	(_AT(pteval_t, 1) << 59) /* only when !PMD_SECT_VALID */
+#define PMD_PRESENT_INVALID	(_AT(pteval_t, 1) << 60) /* only when !PMD_SECT_VALID */
 
 #define _PROT_DEFAULT		(PTE_TYPE_PAGE | PTE_AF | PTE_SHARED)
 #define _PROT_SECT_DEFAULT	(PMD_TYPE_SECT | PMD_SECT_AF | PMD_SECT_S)
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index afdd56d26ad7..23aabff4fa6f 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -1248,20 +1248,22 @@ static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
  * Encode and decode a swap entry:
  *	bits 0-1:	present (must be zero)
  *	bits 2:		remember PG_anon_exclusive
- *	bits 3-7:	swap type
- *	bits 8-57:	swap offset
- *	bit  58:	PTE_PROT_NONE (must be zero)
+ *	bits 4-53:	swap offset
+ *	bit  54:	PTE_PROT_NONE (overlays PTE_UXN) (must be zero)
+ *	bits 55-59:	swap type
+ *	bit  60:	PMD_PRESENT_INVALID (must be zero)
  */
-#define __SWP_TYPE_SHIFT	3
+#define __SWP_TYPE_SHIFT	55
 #define __SWP_TYPE_BITS		5
-#define __SWP_OFFSET_BITS	50
 #define __SWP_TYPE_MASK		((1 << __SWP_TYPE_BITS) - 1)
-#define __SWP_OFFSET_SHIFT	(__SWP_TYPE_BITS + __SWP_TYPE_SHIFT)
+#define __SWP_OFFSET_SHIFT	4
+#define __SWP_OFFSET_BITS	50
 #define __SWP_OFFSET_MASK	((1UL << __SWP_OFFSET_BITS) - 1)
 
 #define __swp_type(x)		(((x).val >> __SWP_TYPE_SHIFT) & __SWP_TYPE_MASK)
 #define __swp_offset(x)		(((x).val >> __SWP_OFFSET_SHIFT) & __SWP_OFFSET_MASK)
-#define __swp_entry(type,offset) ((swp_entry_t) { ((type) << __SWP_TYPE_SHIFT) | ((offset) << __SWP_OFFSET_SHIFT) })
+#define __swp_entry(type, offset) ((swp_entry_t) { ((unsigned long)(type) << __SWP_TYPE_SHIFT) | \
+						   ((unsigned long)(offset) << __SWP_OFFSET_SHIFT) })
 
 #define __pte_to_swp_entry(pte)	((swp_entry_t) { pte_val(pte) })
 #define __swp_entry_to_pte(swp)	((pte_t) { (swp).val })
-- 
2.25.1

Re: [PATCH v1 1/2] arm64/mm: Move PTE_PROT_NONE and PMD_PRESENT_INVALID

Posted by David Hildenbrand 1 year, 9 months ago

On 24.04.24 13:10, Ryan Roberts wrote:
> Previously PTE_PROT_NONE was occupying bit 58, one of the bits reserved
> for SW use when the PTE is valid. This is a waste of those precious SW
> bits since PTE_PROT_NONE can only ever be set when valid is clear.
> Instead let's overlay it on what would be a HW bit if valid was set.
> 
> We need to be careful about which HW bit to choose since some of them
> must be preserved; when pte_present() is true (as it is for a
> PTE_PROT_NONE pte), it is legitimate for the core to call various
> accessors, e.g. pte_dirty(), pte_write() etc. There are also some
> accessors that are private to the arch which must continue to be
> honoured, e.g. pte_user(), pte_user_exec() etc.
> 
> So we choose to overlay PTE_UXN; This effectively means that whenever a
> pte has PTE_PROT_NONE set, it will always report pte_user_exec() ==
> false, which is obviously always correct.
> 
> As a result of this change, we must shuffle the layout of the
> arch-specific swap pte so that PTE_PROT_NONE is always zero and not
> overlapping with any other field. As a result of this, there is no way
> to keep the `type` field contiguous without conflicting with
> PMD_PRESENT_INVALID (bit 59), which must also be 0 for a swap pte. So
> let's move PMD_PRESENT_INVALID to bit 60.

A note that some archs split/re-combine type and/or offset, to make use 
of every bit possible :) But that's mostly relevant for 32bit.

(and as long as PFNs can still fit into the swp offset for migration 
entries etc.)

> 
> In the end, this frees up bit 58 for future use as a proper SW bit (e.g.
> soft-dirty or uffd-wp).

I was briefly confused about how you would use these bits as SW bits for 
swap PTEs (which you can't as they overlay the type). See below 
regarding bit 3.

I would have said here "proper SW bit for present PTEs".

> 
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>   arch/arm64/include/asm/pgtable-prot.h |  4 ++--
>   arch/arm64/include/asm/pgtable.h      | 16 +++++++++-------
>   2 files changed, 11 insertions(+), 9 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/pgtable-prot.h b/arch/arm64/include/asm/pgtable-prot.h
> index dd9ee67d1d87..ef952d69fd04 100644
> --- a/arch/arm64/include/asm/pgtable-prot.h
> +++ b/arch/arm64/include/asm/pgtable-prot.h
> @@ -18,14 +18,14 @@
>   #define PTE_DIRTY		(_AT(pteval_t, 1) << 55)
>   #define PTE_SPECIAL		(_AT(pteval_t, 1) << 56)
>   #define PTE_DEVMAP		(_AT(pteval_t, 1) << 57)
> -#define PTE_PROT_NONE		(_AT(pteval_t, 1) << 58) /* only when !PTE_VALID */
> +#define PTE_PROT_NONE		(PTE_UXN)		 /* Reuse PTE_UXN; only when !PTE_VALID */
>   
>   /*
>    * This bit indicates that the entry is present i.e. pmd_page()
>    * still points to a valid huge page in memory even if the pmd
>    * has been invalidated.
>    */
> -#define PMD_PRESENT_INVALID	(_AT(pteval_t, 1) << 59) /* only when !PMD_SECT_VALID */
> +#define PMD_PRESENT_INVALID	(_AT(pteval_t, 1) << 60) /* only when !PMD_SECT_VALID */
>   
>   #define _PROT_DEFAULT		(PTE_TYPE_PAGE | PTE_AF | PTE_SHARED)
>   #define _PROT_SECT_DEFAULT	(PMD_TYPE_SECT | PMD_SECT_AF | PMD_SECT_S)
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index afdd56d26ad7..23aabff4fa6f 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -1248,20 +1248,22 @@ static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
>    * Encode and decode a swap entry:
>    *	bits 0-1:	present (must be zero)
>    *	bits 2:		remember PG_anon_exclusive
> - *	bits 3-7:	swap type
> - *	bits 8-57:	swap offset
> - *	bit  58:	PTE_PROT_NONE (must be zero)

Reading this patch alone: what happened to bit 3? Please mention that 
that it will be used as a swap pte metadata bit (uffd-wp).

> + *	bits 4-53:	swap offset

So we'll still have 50bit for the offset, good. We could even use 61-63 
if ever required to store bigger PFNs.

LGTM

Reviewed-by: David Hildenbrand <david@redhat.com>

-- 
Cheers,

David / dhildenb

Re: [PATCH v1 1/2] arm64/mm: Move PTE_PROT_NONE and PMD_PRESENT_INVALID

Posted by Ryan Roberts 1 year, 9 months ago

On 25/04/2024 10:16, David Hildenbrand wrote:
> On 24.04.24 13:10, Ryan Roberts wrote:
>> Previously PTE_PROT_NONE was occupying bit 58, one of the bits reserved
>> for SW use when the PTE is valid. This is a waste of those precious SW
>> bits since PTE_PROT_NONE can only ever be set when valid is clear.
>> Instead let's overlay it on what would be a HW bit if valid was set.
>>
>> We need to be careful about which HW bit to choose since some of them
>> must be preserved; when pte_present() is true (as it is for a
>> PTE_PROT_NONE pte), it is legitimate for the core to call various
>> accessors, e.g. pte_dirty(), pte_write() etc. There are also some
>> accessors that are private to the arch which must continue to be
>> honoured, e.g. pte_user(), pte_user_exec() etc.
>>
>> So we choose to overlay PTE_UXN; This effectively means that whenever a
>> pte has PTE_PROT_NONE set, it will always report pte_user_exec() ==
>> false, which is obviously always correct.
>>
>> As a result of this change, we must shuffle the layout of the
>> arch-specific swap pte so that PTE_PROT_NONE is always zero and not
>> overlapping with any other field. As a result of this, there is no way
>> to keep the `type` field contiguous without conflicting with
>> PMD_PRESENT_INVALID (bit 59), which must also be 0 for a swap pte. So
>> let's move PMD_PRESENT_INVALID to bit 60.
> 
> A note that some archs split/re-combine type and/or offset, to make use of every
> bit possible :) But that's mostly relevant for 32bit.
> 
> (and as long as PFNs can still fit into the swp offset for migration entries etc.)

Yeah, I considered splitting the type or offset field to avoid moving
PMD_PRESENT_INVALID, but thought it was better to avoid the extra mask and shift.

> 
>>
>> In the end, this frees up bit 58 for future use as a proper SW bit (e.g.
>> soft-dirty or uffd-wp).
> 
> I was briefly confused about how you would use these bits as SW bits for swap
> PTEs (which you can't as they overlay the type). See below regarding bit 3.
> 
> I would have said here "proper SW bit for present PTEs".

Yes; I'll clarify in the next version.

> 
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>   arch/arm64/include/asm/pgtable-prot.h |  4 ++--
>>   arch/arm64/include/asm/pgtable.h      | 16 +++++++++-------
>>   2 files changed, 11 insertions(+), 9 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/pgtable-prot.h
>> b/arch/arm64/include/asm/pgtable-prot.h
>> index dd9ee67d1d87..ef952d69fd04 100644
>> --- a/arch/arm64/include/asm/pgtable-prot.h
>> +++ b/arch/arm64/include/asm/pgtable-prot.h
>> @@ -18,14 +18,14 @@
>>   #define PTE_DIRTY        (_AT(pteval_t, 1) << 55)
>>   #define PTE_SPECIAL        (_AT(pteval_t, 1) << 56)
>>   #define PTE_DEVMAP        (_AT(pteval_t, 1) << 57)
>> -#define PTE_PROT_NONE        (_AT(pteval_t, 1) << 58) /* only when !PTE_VALID */
>> +#define PTE_PROT_NONE        (PTE_UXN)         /* Reuse PTE_UXN; only when
>> !PTE_VALID */
>>     /*
>>    * This bit indicates that the entry is present i.e. pmd_page()
>>    * still points to a valid huge page in memory even if the pmd
>>    * has been invalidated.
>>    */
>> -#define PMD_PRESENT_INVALID    (_AT(pteval_t, 1) << 59) /* only when
>> !PMD_SECT_VALID */
>> +#define PMD_PRESENT_INVALID    (_AT(pteval_t, 1) << 60) /* only when
>> !PMD_SECT_VALID */
>>     #define _PROT_DEFAULT        (PTE_TYPE_PAGE | PTE_AF | PTE_SHARED)
>>   #define _PROT_SECT_DEFAULT    (PMD_TYPE_SECT | PMD_SECT_AF | PMD_SECT_S)
>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>> index afdd56d26ad7..23aabff4fa6f 100644
>> --- a/arch/arm64/include/asm/pgtable.h
>> +++ b/arch/arm64/include/asm/pgtable.h
>> @@ -1248,20 +1248,22 @@ static inline pmd_t pmdp_establish(struct
>> vm_area_struct *vma,
>>    * Encode and decode a swap entry:
>>    *    bits 0-1:    present (must be zero)
>>    *    bits 2:        remember PG_anon_exclusive
>> - *    bits 3-7:    swap type
>> - *    bits 8-57:    swap offset
>> - *    bit  58:    PTE_PROT_NONE (must be zero)
> 
> Reading this patch alone: what happened to bit 3? Please mention that that it
> will be used as a swap pte metadata bit (uffd-wp).

Will do. It's all a bit arbitrary though. I could have put offset in 3-52, and
then 53 would have been spare for uffd-wp. I'm not sure there is any advantage
to either option.

> 
>> + *    bits 4-53:    swap offset
> 
> So we'll still have 50bit for the offset, good. We could even use 61-63 if ever
> required to store bigger PFNs.

yep, or more sw bits.

> 
> LGTM
> 
> Reviewed-by: David Hildenbrand <david@redhat.com>

Thanks!

Re: [PATCH v1 1/2] arm64/mm: Move PTE_PROT_NONE and PMD_PRESENT_INVALID

Posted by Ryan Roberts 1 year, 9 months ago

On 25/04/2024 11:29, Ryan Roberts wrote:
> On 25/04/2024 10:16, David Hildenbrand wrote:
>> On 24.04.24 13:10, Ryan Roberts wrote:
>>> Previously PTE_PROT_NONE was occupying bit 58, one of the bits reserved
>>> for SW use when the PTE is valid. This is a waste of those precious SW
>>> bits since PTE_PROT_NONE can only ever be set when valid is clear.
>>> Instead let's overlay it on what would be a HW bit if valid was set.
>>>
>>> We need to be careful about which HW bit to choose since some of them
>>> must be preserved; when pte_present() is true (as it is for a
>>> PTE_PROT_NONE pte), it is legitimate for the core to call various
>>> accessors, e.g. pte_dirty(), pte_write() etc. There are also some
>>> accessors that are private to the arch which must continue to be
>>> honoured, e.g. pte_user(), pte_user_exec() etc.
>>>
>>> So we choose to overlay PTE_UXN; This effectively means that whenever a
>>> pte has PTE_PROT_NONE set, it will always report pte_user_exec() ==
>>> false, which is obviously always correct.
>>>
>>> As a result of this change, we must shuffle the layout of the
>>> arch-specific swap pte so that PTE_PROT_NONE is always zero and not
>>> overlapping with any other field. As a result of this, there is no way
>>> to keep the `type` field contiguous without conflicting with
>>> PMD_PRESENT_INVALID (bit 59), which must also be 0 for a swap pte. So
>>> let's move PMD_PRESENT_INVALID to bit 60.
>>
>> A note that some archs split/re-combine type and/or offset, to make use of every
>> bit possible :) But that's mostly relevant for 32bit.
>>
>> (and as long as PFNs can still fit into the swp offset for migration entries etc.)
> 
> Yeah, I considered splitting the type or offset field to avoid moving
> PMD_PRESENT_INVALID, but thought it was better to avoid the extra mask and shift.

Also, IMHO we shouldn't really need to reserve PMD_PRESENT_INVALID for swap
ptes; it would be cleaner to have one bit that defines "present" when valid is
clear (similar to PTE_PROT_NONE today) then another bit which is only defined
when "present && !valid" which tells us if this is PTE_PROT_NONE or
PMD_PRESENT_INVALID (I don't think you can ever have both at the same time?).

But there is a problem with this: __split_huge_pmd_locked() calls
pmdp_invalidate() for a pmd before it determines that it is pmd_present(). So
the PMD_PRESENT_INVALID can be set in a swap pte today. That feels wrong to me,
but was trying to avoid the whole thing unravelling so didn't persue.

> 
>>
>>>
>>> In the end, this frees up bit 58 for future use as a proper SW bit (e.g.
>>> soft-dirty or uffd-wp).
>>
>> I was briefly confused about how you would use these bits as SW bits for swap
>> PTEs (which you can't as they overlay the type). See below regarding bit 3.
>>
>> I would have said here "proper SW bit for present PTEs".
> 
> Yes; I'll clarify in the next version.
> 
>>
>>>
>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>> ---
>>>   arch/arm64/include/asm/pgtable-prot.h |  4 ++--
>>>   arch/arm64/include/asm/pgtable.h      | 16 +++++++++-------
>>>   2 files changed, 11 insertions(+), 9 deletions(-)
>>>
>>> diff --git a/arch/arm64/include/asm/pgtable-prot.h
>>> b/arch/arm64/include/asm/pgtable-prot.h
>>> index dd9ee67d1d87..ef952d69fd04 100644
>>> --- a/arch/arm64/include/asm/pgtable-prot.h
>>> +++ b/arch/arm64/include/asm/pgtable-prot.h
>>> @@ -18,14 +18,14 @@
>>>   #define PTE_DIRTY        (_AT(pteval_t, 1) << 55)
>>>   #define PTE_SPECIAL        (_AT(pteval_t, 1) << 56)
>>>   #define PTE_DEVMAP        (_AT(pteval_t, 1) << 57)
>>> -#define PTE_PROT_NONE        (_AT(pteval_t, 1) << 58) /* only when !PTE_VALID */
>>> +#define PTE_PROT_NONE        (PTE_UXN)         /* Reuse PTE_UXN; only when
>>> !PTE_VALID */
>>>     /*
>>>    * This bit indicates that the entry is present i.e. pmd_page()
>>>    * still points to a valid huge page in memory even if the pmd
>>>    * has been invalidated.
>>>    */
>>> -#define PMD_PRESENT_INVALID    (_AT(pteval_t, 1) << 59) /* only when
>>> !PMD_SECT_VALID */
>>> +#define PMD_PRESENT_INVALID    (_AT(pteval_t, 1) << 60) /* only when
>>> !PMD_SECT_VALID */
>>>     #define _PROT_DEFAULT        (PTE_TYPE_PAGE | PTE_AF | PTE_SHARED)
>>>   #define _PROT_SECT_DEFAULT    (PMD_TYPE_SECT | PMD_SECT_AF | PMD_SECT_S)
>>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>>> index afdd56d26ad7..23aabff4fa6f 100644
>>> --- a/arch/arm64/include/asm/pgtable.h
>>> +++ b/arch/arm64/include/asm/pgtable.h
>>> @@ -1248,20 +1248,22 @@ static inline pmd_t pmdp_establish(struct
>>> vm_area_struct *vma,
>>>    * Encode and decode a swap entry:
>>>    *    bits 0-1:    present (must be zero)
>>>    *    bits 2:        remember PG_anon_exclusive
>>> - *    bits 3-7:    swap type
>>> - *    bits 8-57:    swap offset
>>> - *    bit  58:    PTE_PROT_NONE (must be zero)
>>
>> Reading this patch alone: what happened to bit 3? Please mention that that it
>> will be used as a swap pte metadata bit (uffd-wp).
> 
> Will do. It's all a bit arbitrary though. I could have put offset in 3-52, and
> then 53 would have been spare for uffd-wp. I'm not sure there is any advantage
> to either option.
> 
>>
>>> + *    bits 4-53:    swap offset
>>
>> So we'll still have 50bit for the offset, good. We could even use 61-63 if ever
>> required to store bigger PFNs.
> 
> yep, or more sw bits.
> 
>>
>> LGTM
>>
>> Reviewed-by: David Hildenbrand <david@redhat.com>
> 
> Thanks!
> 
>

Re: [PATCH v1 1/2] arm64/mm: Move PTE_PROT_NONE and PMD_PRESENT_INVALID

Posted by Catalin Marinas 1 year, 9 months ago

On Thu, Apr 25, 2024 at 11:37:42AM +0100, Ryan Roberts wrote:
> Also, IMHO we shouldn't really need to reserve PMD_PRESENT_INVALID for swap
> ptes; it would be cleaner to have one bit that defines "present" when valid is
> clear (similar to PTE_PROT_NONE today) then another bit which is only defined
> when "present && !valid" which tells us if this is PTE_PROT_NONE or
> PMD_PRESENT_INVALID (I don't think you can ever have both at the same time?).

I think this make sense, maybe rename the above to PTE_PRESENT_INVALID
and use it for both ptes and pmds.

> But there is a problem with this: __split_huge_pmd_locked() calls
> pmdp_invalidate() for a pmd before it determines that it is pmd_present(). So
> the PMD_PRESENT_INVALID can be set in a swap pte today. That feels wrong to me,
> but was trying to avoid the whole thing unravelling so didn't persue.

Maybe what's wrong is the arm64 implementation setting this bit on a
swap/migration pmd (though we could handle this in the core code as
well, it depends what the other architectures do). The only check for
the PMD_PRESENT_INVALID bit is in the arm64 code and it can be absorbed
into the pmd_present() check. I think it is currently broken as
pmd_present() can return true for a swap pmd after pmd_mkinvalid().

So I don't think we lose anything if pmd_mkinvalid() skips any bit
setting when !PTE_VALID. Maybe it even fixes some corner case we never
hit yet (like pmd_present() on a swap/migration+invalid pmd).

-- 
Catalin

Re: [PATCH v1 1/2] arm64/mm: Move PTE_PROT_NONE and PMD_PRESENT_INVALID

Posted by Ryan Roberts 1 year, 9 months ago

On 26/04/2024 15:48, Catalin Marinas wrote:
> On Thu, Apr 25, 2024 at 11:37:42AM +0100, Ryan Roberts wrote:
>> Also, IMHO we shouldn't really need to reserve PMD_PRESENT_INVALID for swap
>> ptes; it would be cleaner to have one bit that defines "present" when valid is
>> clear (similar to PTE_PROT_NONE today) then another bit which is only defined
>> when "present && !valid" which tells us if this is PTE_PROT_NONE or
>> PMD_PRESENT_INVALID (I don't think you can ever have both at the same time?).
> 
> I think this make sense, maybe rename the above to PTE_PRESENT_INVALID
> and use it for both ptes and pmds.

Yep, sounds good. I've already got a patch to do this, but it's exposed a bug in
core-mm so will now fix that before I can validate my change. see
https://lore.kernel.org/linux-arm-kernel/ZiuyGXt0XWwRgFh9@x1n/

With this in place, I'm proposing to remove PTE_PROT_NONE entirely and instead
represent PROT_NONE as a present but invalid pte (PTE_VALID=0, PTE_INVALID=1)
with both PTE_WRITE=0 and PTE_RDONLY=0.

While the HW would interpret PTE_WRITE=0/PTE_RDONLY=0 as "RW without dirty bit
modification", this is not a problem as the pte is invalid, so the HW doesn't
interpret it. And SW always uses the PTE_WRITE bit to interpret the writability
of the pte. So PTE_WRITE=0/PTE_RDONLY=0 was previously an unused combination
that we now repurpose for PROT_NONE.

This will subtly change behaviour in an edge case though. Imagine:

pte_t pte;

pte = pte_modify(pte, PAGE_NONE);
pte = pte_mkwrite_novma(pte);
WARN_ON(pte_protnone(pte));

Should that warning fire or not? Previously, because we had a dedicated bit for
PTE_PROT_NONE it would fire. With my proposed change it will not fire. To me
it's more intuitive if it doesn't fire. Regardless there is no core code that
ever does this. Once you have a protnone pte, its terminal - nothing ever
modifies it with these helpers AFAICS.

Personally I think this is a nice tidy up that saves a SW bit in both present
and swap ptes. What do you think? (I'll just post the series if its easier to
provide feedback in that context).

> 
>> But there is a problem with this: __split_huge_pmd_locked() calls
>> pmdp_invalidate() for a pmd before it determines that it is pmd_present(). So
>> the PMD_PRESENT_INVALID can be set in a swap pte today. That feels wrong to me,
>> but was trying to avoid the whole thing unravelling so didn't persue.
> 
> Maybe what's wrong is the arm64 implementation setting this bit on a
> swap/migration pmd (though we could handle this in the core code as
> well, it depends what the other architectures do). The only check for
> the PMD_PRESENT_INVALID bit is in the arm64 code and it can be absorbed
> into the pmd_present() check. I think it is currently broken as
> pmd_present() can return true for a swap pmd after pmd_mkinvalid().

I've posted a fix here:
https://lore.kernel.org/linux-mm/20240425170704.3379492-1-ryan.roberts@arm.com/

My position is that you shouldn't be calling pmd_mkinvalid() on a non-present pmd.

> 
> So I don't think we lose anything if pmd_mkinvalid() skips any bit
> setting when !PTE_VALID. Maybe it even fixes some corner case we never
> hit yet (like pmd_present() on a swap/migration+invalid pmd).
>

Re: [PATCH v1 1/2] arm64/mm: Move PTE_PROT_NONE and PMD_PRESENT_INVALID

Posted by Catalin Marinas 1 year, 9 months ago

On Mon, Apr 29, 2024 at 11:04:53AM +0100, Ryan Roberts wrote:
> On 26/04/2024 15:48, Catalin Marinas wrote:
> > On Thu, Apr 25, 2024 at 11:37:42AM +0100, Ryan Roberts wrote:
> >> Also, IMHO we shouldn't really need to reserve PMD_PRESENT_INVALID for swap
> >> ptes; it would be cleaner to have one bit that defines "present" when valid is
> >> clear (similar to PTE_PROT_NONE today) then another bit which is only defined
> >> when "present && !valid" which tells us if this is PTE_PROT_NONE or
> >> PMD_PRESENT_INVALID (I don't think you can ever have both at the same time?).
> > 
> > I think this make sense, maybe rename the above to PTE_PRESENT_INVALID
> > and use it for both ptes and pmds.
> 
> Yep, sounds good. I've already got a patch to do this, but it's exposed a bug in
> core-mm so will now fix that before I can validate my change. see
> https://lore.kernel.org/linux-arm-kernel/ZiuyGXt0XWwRgFh9@x1n/
> 
> With this in place, I'm proposing to remove PTE_PROT_NONE entirely and instead
> represent PROT_NONE as a present but invalid pte (PTE_VALID=0, PTE_INVALID=1)
> with both PTE_WRITE=0 and PTE_RDONLY=0.
> 
> While the HW would interpret PTE_WRITE=0/PTE_RDONLY=0 as "RW without dirty bit
> modification", this is not a problem as the pte is invalid, so the HW doesn't
> interpret it. And SW always uses the PTE_WRITE bit to interpret the writability
> of the pte. So PTE_WRITE=0/PTE_RDONLY=0 was previously an unused combination
> that we now repurpose for PROT_NONE.

Why not just keep the bits currently in PAGE_NONE (PTE_RDONLY would be
set) and check PTE_USER|PTE_UXN == 0b01 which is a unique combination
for PAGE_NONE (bar the kernel mappings).

For ptes, it doesn't matter, we can assume that PTE_PRESENT_INVALID
means pte_protnone(). For pmds, however, we can end up with
pmd_protnone(pmd_mkinvalid(pmd)) == true for any of the PAGE_*
permissions encoded into a valid pmd. That's where a dedicated
PTE_PROT_NONE bit helped.

Let's say a CPU starts splitting a pmd and does a pmdp_invalidate*()
first to set PTE_PRESENT_INVALID. A different CPU gets a fault and since
the pmd is present, it goes and checks pmd_protnone() which returns
true, ending up on do_huge_pmd_numa_page() path. Maybe some locks help
but it looks fragile to rely on them.

So I think for protnone we need to check some other bits (like USER and
UXN) in addition to PTE_PRESENT_INVALID.

> This will subtly change behaviour in an edge case though. Imagine:
> 
> pte_t pte;
> 
> pte = pte_modify(pte, PAGE_NONE);
> pte = pte_mkwrite_novma(pte);
> WARN_ON(pte_protnone(pte));
> 
> Should that warning fire or not? Previously, because we had a dedicated bit for
> PTE_PROT_NONE it would fire. With my proposed change it will not fire. To me
> it's more intuitive if it doesn't fire. Regardless there is no core code that
> ever does this. Once you have a protnone pte, its terminal - nothing ever
> modifies it with these helpers AFAICS.

I don't think any core code should try to make page a PAGE_NONE pte
writeable.

> Personally I think this is a nice tidy up that saves a SW bit in both present
> and swap ptes. What do you think? (I'll just post the series if its easier to
> provide feedback in that context).

It would be nice to tidy this up and get rid of PTE_PROT_NONE as long as
it doesn't affect the pmd case I mentioned above.

> >> But there is a problem with this: __split_huge_pmd_locked() calls
> >> pmdp_invalidate() for a pmd before it determines that it is pmd_present(). So
> >> the PMD_PRESENT_INVALID can be set in a swap pte today. That feels wrong to me,
> >> but was trying to avoid the whole thing unravelling so didn't persue.
> > 
> > Maybe what's wrong is the arm64 implementation setting this bit on a
> > swap/migration pmd (though we could handle this in the core code as
> > well, it depends what the other architectures do). The only check for
> > the PMD_PRESENT_INVALID bit is in the arm64 code and it can be absorbed
> > into the pmd_present() check. I think it is currently broken as
> > pmd_present() can return true for a swap pmd after pmd_mkinvalid().
> 
> I've posted a fix here:
> https://lore.kernel.org/linux-mm/20240425170704.3379492-1-ryan.roberts@arm.com/
> 
> My position is that you shouldn't be calling pmd_mkinvalid() on a non-present pmd.

I agree, thanks.

-- 
Catalin

Re: [PATCH v1 1/2] arm64/mm: Move PTE_PROT_NONE and PMD_PRESENT_INVALID

Posted by Ryan Roberts 1 year, 9 months ago

On 29/04/2024 13:38, Catalin Marinas wrote:
> On Mon, Apr 29, 2024 at 11:04:53AM +0100, Ryan Roberts wrote:
>> On 26/04/2024 15:48, Catalin Marinas wrote:
>>> On Thu, Apr 25, 2024 at 11:37:42AM +0100, Ryan Roberts wrote:
>>>> Also, IMHO we shouldn't really need to reserve PMD_PRESENT_INVALID for swap
>>>> ptes; it would be cleaner to have one bit that defines "present" when valid is
>>>> clear (similar to PTE_PROT_NONE today) then another bit which is only defined
>>>> when "present && !valid" which tells us if this is PTE_PROT_NONE or
>>>> PMD_PRESENT_INVALID (I don't think you can ever have both at the same time?).
>>>
>>> I think this make sense, maybe rename the above to PTE_PRESENT_INVALID
>>> and use it for both ptes and pmds.
>>
>> Yep, sounds good. I've already got a patch to do this, but it's exposed a bug in
>> core-mm so will now fix that before I can validate my change. see
>> https://lore.kernel.org/linux-arm-kernel/ZiuyGXt0XWwRgFh9@x1n/
>>
>> With this in place, I'm proposing to remove PTE_PROT_NONE entirely and instead
>> represent PROT_NONE as a present but invalid pte (PTE_VALID=0, PTE_INVALID=1)
>> with both PTE_WRITE=0 and PTE_RDONLY=0.
>>
>> While the HW would interpret PTE_WRITE=0/PTE_RDONLY=0 as "RW without dirty bit
>> modification", this is not a problem as the pte is invalid, so the HW doesn't
>> interpret it. And SW always uses the PTE_WRITE bit to interpret the writability
>> of the pte. So PTE_WRITE=0/PTE_RDONLY=0 was previously an unused combination
>> that we now repurpose for PROT_NONE.
> 
> Why not just keep the bits currently in PAGE_NONE (PTE_RDONLY would be
> set) and check PTE_USER|PTE_UXN == 0b01 which is a unique combination
> for PAGE_NONE (bar the kernel mappings).

Yes I guess that works. I personally prefer my proposal because it is more
intuitive; you have an R bit and a W bit, and you encode RO, WR, and NONE. But
if you think reusing the kernel mapping check (PTE_USER|PTE_UXN == 0b01) is
preferable, then I'll go with that.

> 
> For ptes, it doesn't matter, we can assume that PTE_PRESENT_INVALID
> means pte_protnone(). For pmds, however, we can end up with
> pmd_protnone(pmd_mkinvalid(pmd)) == true for any of the PAGE_*
> permissions encoded into a valid pmd. That's where a dedicated
> PTE_PROT_NONE bit helped.

Yes agreed.

> 
> Let's say a CPU starts splitting a pmd and does a pmdp_invalidate*()
> first to set PTE_PRESENT_INVALID. A different CPU gets a fault and since
> the pmd is present, it goes and checks pmd_protnone() which returns
> true, ending up on do_huge_pmd_numa_page() path. Maybe some locks help
> but it looks fragile to rely on them.
> 
> So I think for protnone we need to check some other bits (like USER and
> UXN) in addition to PTE_PRESENT_INVALID.

Yes 100% agree. But using PTE_WRITE|PTE_RDONLY==0b00 is just as valid for that
purpose, I think?

> 
>> This will subtly change behaviour in an edge case though. Imagine:
>>
>> pte_t pte;
>>
>> pte = pte_modify(pte, PAGE_NONE);
>> pte = pte_mkwrite_novma(pte);
>> WARN_ON(pte_protnone(pte));
>>
>> Should that warning fire or not? Previously, because we had a dedicated bit for
>> PTE_PROT_NONE it would fire. With my proposed change it will not fire. To me
>> it's more intuitive if it doesn't fire. Regardless there is no core code that
>> ever does this. Once you have a protnone pte, its terminal - nothing ever
>> modifies it with these helpers AFAICS.
> 
> I don't think any core code should try to make page a PAGE_NONE pte
> writeable.

I looked at some other arches; some (at least alpha and hexagon) will not fire
this warning because they have R and W bits and 0b00 means NONE. Others (x86)
will fire it because they have an explicit NONE bit and don't remove it on
permission change. So I conclude its UB and fine to do either.

> 
>> Personally I think this is a nice tidy up that saves a SW bit in both present
>> and swap ptes. What do you think? (I'll just post the series if its easier to
>> provide feedback in that context).
> 
> It would be nice to tidy this up and get rid of PTE_PROT_NONE as long as
> it doesn't affect the pmd case I mentioned above.
> 
>>>> But there is a problem with this: __split_huge_pmd_locked() calls
>>>> pmdp_invalidate() for a pmd before it determines that it is pmd_present(). So
>>>> the PMD_PRESENT_INVALID can be set in a swap pte today. That feels wrong to me,
>>>> but was trying to avoid the whole thing unravelling so didn't persue.
>>>
>>> Maybe what's wrong is the arm64 implementation setting this bit on a
>>> swap/migration pmd (though we could handle this in the core code as
>>> well, it depends what the other architectures do). The only check for
>>> the PMD_PRESENT_INVALID bit is in the arm64 code and it can be absorbed
>>> into the pmd_present() check. I think it is currently broken as
>>> pmd_present() can return true for a swap pmd after pmd_mkinvalid().
>>
>> I've posted a fix here:
>> https://lore.kernel.org/linux-mm/20240425170704.3379492-1-ryan.roberts@arm.com/
>>
>> My position is that you shouldn't be calling pmd_mkinvalid() on a non-present pmd.
> 
> I agree, thanks.
>

Re: [PATCH v1 1/2] arm64/mm: Move PTE_PROT_NONE and PMD_PRESENT_INVALID

Posted by Ryan Roberts 1 year, 9 months ago

On 29/04/2024 14:01, Ryan Roberts wrote:
> On 29/04/2024 13:38, Catalin Marinas wrote:
>> On Mon, Apr 29, 2024 at 11:04:53AM +0100, Ryan Roberts wrote:
>>> On 26/04/2024 15:48, Catalin Marinas wrote:
>>>> On Thu, Apr 25, 2024 at 11:37:42AM +0100, Ryan Roberts wrote:
>>>>> Also, IMHO we shouldn't really need to reserve PMD_PRESENT_INVALID for swap
>>>>> ptes; it would be cleaner to have one bit that defines "present" when valid is
>>>>> clear (similar to PTE_PROT_NONE today) then another bit which is only defined
>>>>> when "present && !valid" which tells us if this is PTE_PROT_NONE or
>>>>> PMD_PRESENT_INVALID (I don't think you can ever have both at the same time?).
>>>>
>>>> I think this make sense, maybe rename the above to PTE_PRESENT_INVALID
>>>> and use it for both ptes and pmds.
>>>
>>> Yep, sounds good. I've already got a patch to do this, but it's exposed a bug in
>>> core-mm so will now fix that before I can validate my change. see
>>> https://lore.kernel.org/linux-arm-kernel/ZiuyGXt0XWwRgFh9@x1n/
>>>
>>> With this in place, I'm proposing to remove PTE_PROT_NONE entirely and instead
>>> represent PROT_NONE as a present but invalid pte (PTE_VALID=0, PTE_INVALID=1)
>>> with both PTE_WRITE=0 and PTE_RDONLY=0.
>>>
>>> While the HW would interpret PTE_WRITE=0/PTE_RDONLY=0 as "RW without dirty bit
>>> modification", this is not a problem as the pte is invalid, so the HW doesn't
>>> interpret it. And SW always uses the PTE_WRITE bit to interpret the writability
>>> of the pte. So PTE_WRITE=0/PTE_RDONLY=0 was previously an unused combination
>>> that we now repurpose for PROT_NONE.
>>
>> Why not just keep the bits currently in PAGE_NONE (PTE_RDONLY would be
>> set) and check PTE_USER|PTE_UXN == 0b01 which is a unique combination
>> for PAGE_NONE (bar the kernel mappings).
> 
> Yes I guess that works. I personally prefer my proposal because it is more
> intuitive; you have an R bit and a W bit, and you encode RO, WR, and NONE. But
> if you think reusing the kernel mapping check (PTE_USER|PTE_UXN == 0b01) is
> preferable, then I'll go with that.

Ignore this - I looked at your proposed approach and agree it's better. I'll use
`PTE_USER|PTE_UXN==0b01`. Posting shortly...

> 
>>
>> For ptes, it doesn't matter, we can assume that PTE_PRESENT_INVALID
>> means pte_protnone(). For pmds, however, we can end up with
>> pmd_protnone(pmd_mkinvalid(pmd)) == true for any of the PAGE_*
>> permissions encoded into a valid pmd. That's where a dedicated
>> PTE_PROT_NONE bit helped.
> 
> Yes agreed.
> 
>>
>> Let's say a CPU starts splitting a pmd and does a pmdp_invalidate*()
>> first to set PTE_PRESENT_INVALID. A different CPU gets a fault and since
>> the pmd is present, it goes and checks pmd_protnone() which returns
>> true, ending up on do_huge_pmd_numa_page() path. Maybe some locks help
>> but it looks fragile to rely on them.
>>
>> So I think for protnone we need to check some other bits (like USER and
>> UXN) in addition to PTE_PRESENT_INVALID.
> 
> Yes 100% agree. But using PTE_WRITE|PTE_RDONLY==0b00 is just as valid for that
> purpose, I think?
> 
>>
>>> This will subtly change behaviour in an edge case though. Imagine:
>>>
>>> pte_t pte;
>>>
>>> pte = pte_modify(pte, PAGE_NONE);
>>> pte = pte_mkwrite_novma(pte);
>>> WARN_ON(pte_protnone(pte));
>>>
>>> Should that warning fire or not? Previously, because we had a dedicated bit for
>>> PTE_PROT_NONE it would fire. With my proposed change it will not fire. To me
>>> it's more intuitive if it doesn't fire. Regardless there is no core code that
>>> ever does this. Once you have a protnone pte, its terminal - nothing ever
>>> modifies it with these helpers AFAICS.
>>
>> I don't think any core code should try to make page a PAGE_NONE pte
>> writeable.
> 
> I looked at some other arches; some (at least alpha and hexagon) will not fire
> this warning because they have R and W bits and 0b00 means NONE. Others (x86)
> will fire it because they have an explicit NONE bit and don't remove it on
> permission change. So I conclude its UB and fine to do either.
> 
>>
>>> Personally I think this is a nice tidy up that saves a SW bit in both present
>>> and swap ptes. What do you think? (I'll just post the series if its easier to
>>> provide feedback in that context).
>>
>> It would be nice to tidy this up and get rid of PTE_PROT_NONE as long as
>> it doesn't affect the pmd case I mentioned above.
>>
>>>>> But there is a problem with this: __split_huge_pmd_locked() calls
>>>>> pmdp_invalidate() for a pmd before it determines that it is pmd_present(). So
>>>>> the PMD_PRESENT_INVALID can be set in a swap pte today. That feels wrong to me,
>>>>> but was trying to avoid the whole thing unravelling so didn't persue.
>>>>
>>>> Maybe what's wrong is the arm64 implementation setting this bit on a
>>>> swap/migration pmd (though we could handle this in the core code as
>>>> well, it depends what the other architectures do). The only check for
>>>> the PMD_PRESENT_INVALID bit is in the arm64 code and it can be absorbed
>>>> into the pmd_present() check. I think it is currently broken as
>>>> pmd_present() can return true for a swap pmd after pmd_mkinvalid().
>>>
>>> I've posted a fix here:
>>> https://lore.kernel.org/linux-mm/20240425170704.3379492-1-ryan.roberts@arm.com/
>>>
>>> My position is that you shouldn't be calling pmd_mkinvalid() on a non-present pmd.
>>
>> I agree, thanks.
>>
>

Re: [PATCH v1 1/2] arm64/mm: Move PTE_PROT_NONE and PMD_PRESENT_INVALID

Posted by Catalin Marinas 1 year, 9 months ago

On Mon, Apr 29, 2024 at 02:23:35PM +0100, Ryan Roberts wrote:
> On 29/04/2024 14:01, Ryan Roberts wrote:
> > On 29/04/2024 13:38, Catalin Marinas wrote:
> >> On Mon, Apr 29, 2024 at 11:04:53AM +0100, Ryan Roberts wrote:
> >>> On 26/04/2024 15:48, Catalin Marinas wrote:
> >>>> On Thu, Apr 25, 2024 at 11:37:42AM +0100, Ryan Roberts wrote:
> >>>>> Also, IMHO we shouldn't really need to reserve PMD_PRESENT_INVALID for swap
> >>>>> ptes; it would be cleaner to have one bit that defines "present" when valid is
> >>>>> clear (similar to PTE_PROT_NONE today) then another bit which is only defined
> >>>>> when "present && !valid" which tells us if this is PTE_PROT_NONE or
> >>>>> PMD_PRESENT_INVALID (I don't think you can ever have both at the same time?).
> >>>>
> >>>> I think this make sense, maybe rename the above to PTE_PRESENT_INVALID
> >>>> and use it for both ptes and pmds.
> >>>
> >>> Yep, sounds good. I've already got a patch to do this, but it's exposed a bug in
> >>> core-mm so will now fix that before I can validate my change. see
> >>> https://lore.kernel.org/linux-arm-kernel/ZiuyGXt0XWwRgFh9@x1n/
> >>>
> >>> With this in place, I'm proposing to remove PTE_PROT_NONE entirely and instead
> >>> represent PROT_NONE as a present but invalid pte (PTE_VALID=0, PTE_INVALID=1)
> >>> with both PTE_WRITE=0 and PTE_RDONLY=0.
> >>>
> >>> While the HW would interpret PTE_WRITE=0/PTE_RDONLY=0 as "RW without dirty bit
> >>> modification", this is not a problem as the pte is invalid, so the HW doesn't
> >>> interpret it. And SW always uses the PTE_WRITE bit to interpret the writability
> >>> of the pte. So PTE_WRITE=0/PTE_RDONLY=0 was previously an unused combination
> >>> that we now repurpose for PROT_NONE.
> >>
> >> Why not just keep the bits currently in PAGE_NONE (PTE_RDONLY would be
> >> set) and check PTE_USER|PTE_UXN == 0b01 which is a unique combination
> >> for PAGE_NONE (bar the kernel mappings).
> > 
> > Yes I guess that works. I personally prefer my proposal because it is more
> > intuitive; you have an R bit and a W bit, and you encode RO, WR, and NONE. But
> > if you think reusing the kernel mapping check (PTE_USER|PTE_UXN == 0b01) is
> > preferable, then I'll go with that.
> 
> Ignore this - I looked at your proposed approach and agree it's better. I'll use
> `PTE_USER|PTE_UXN==0b01`. Posting shortly...

You nearly convinced me until I read your second reply ;). The
PTE_WRITE|PTE_RDONLY == 0b00 still has the mkwrite problem if we care
about (I don't think it can happen though).

-- 
Catalin

Re: [PATCH v1 1/2] arm64/mm: Move PTE_PROT_NONE and PMD_PRESENT_INVALID

Posted by Ryan Roberts 1 year, 9 months ago

On 29/04/2024 15:18, Catalin Marinas wrote:
> On Mon, Apr 29, 2024 at 02:23:35PM +0100, Ryan Roberts wrote:
>> On 29/04/2024 14:01, Ryan Roberts wrote:
>>> On 29/04/2024 13:38, Catalin Marinas wrote:
>>>> On Mon, Apr 29, 2024 at 11:04:53AM +0100, Ryan Roberts wrote:
>>>>> On 26/04/2024 15:48, Catalin Marinas wrote:
>>>>>> On Thu, Apr 25, 2024 at 11:37:42AM +0100, Ryan Roberts wrote:
>>>>>>> Also, IMHO we shouldn't really need to reserve PMD_PRESENT_INVALID for swap
>>>>>>> ptes; it would be cleaner to have one bit that defines "present" when valid is
>>>>>>> clear (similar to PTE_PROT_NONE today) then another bit which is only defined
>>>>>>> when "present && !valid" which tells us if this is PTE_PROT_NONE or
>>>>>>> PMD_PRESENT_INVALID (I don't think you can ever have both at the same time?).
>>>>>>
>>>>>> I think this make sense, maybe rename the above to PTE_PRESENT_INVALID
>>>>>> and use it for both ptes and pmds.
>>>>>
>>>>> Yep, sounds good. I've already got a patch to do this, but it's exposed a bug in
>>>>> core-mm so will now fix that before I can validate my change. see
>>>>> https://lore.kernel.org/linux-arm-kernel/ZiuyGXt0XWwRgFh9@x1n/
>>>>>
>>>>> With this in place, I'm proposing to remove PTE_PROT_NONE entirely and instead
>>>>> represent PROT_NONE as a present but invalid pte (PTE_VALID=0, PTE_INVALID=1)
>>>>> with both PTE_WRITE=0 and PTE_RDONLY=0.
>>>>>
>>>>> While the HW would interpret PTE_WRITE=0/PTE_RDONLY=0 as "RW without dirty bit
>>>>> modification", this is not a problem as the pte is invalid, so the HW doesn't
>>>>> interpret it. And SW always uses the PTE_WRITE bit to interpret the writability
>>>>> of the pte. So PTE_WRITE=0/PTE_RDONLY=0 was previously an unused combination
>>>>> that we now repurpose for PROT_NONE.
>>>>
>>>> Why not just keep the bits currently in PAGE_NONE (PTE_RDONLY would be
>>>> set) and check PTE_USER|PTE_UXN == 0b01 which is a unique combination
>>>> for PAGE_NONE (bar the kernel mappings).
>>>
>>> Yes I guess that works. I personally prefer my proposal because it is more
>>> intuitive; you have an R bit and a W bit, and you encode RO, WR, and NONE. But
>>> if you think reusing the kernel mapping check (PTE_USER|PTE_UXN == 0b01) is
>>> preferable, then I'll go with that.
>>
>> Ignore this - I looked at your proposed approach and agree it's better. I'll use
>> `PTE_USER|PTE_UXN==0b01`. Posting shortly...
> 
> You nearly convinced me until I read your second reply ;). The
> PTE_WRITE|PTE_RDONLY == 0b00 still has the mkwrite problem if we care
> about (I don't think it can happen though).

Yes, just to clearly enumerate the reasons I prefer your approach:

 - PTE_RDONLY is also used for HW dirty bit. I had to add a conditional to
   pte_mkclean() for my scheme to prevent pte_mkclean() on a PROT_NONE pte
   eroneously making it RO. No such problem with your scheme.

 - With my scheme, we have the mkwrite problem, as you call it. Although, as I
   said some arches already have this semantic, so I don't think its a problem.
   But with your scheme we keep the existing arm64 semantics so it reduces risk
   of a problem in a corner I overlooked.

Anyway, I've posted the v2. Take a look when you get time - perhaps we can get
it into v6.10?

Re: [PATCH v1 1/2] arm64/mm: Move PTE_PROT_NONE and PMD_PRESENT_INVALID

Posted by Catalin Marinas 1 year, 9 months ago

On Wed, Apr 24, 2024 at 12:10:16PM +0100, Ryan Roberts wrote:
> Previously PTE_PROT_NONE was occupying bit 58, one of the bits reserved
> for SW use when the PTE is valid. This is a waste of those precious SW
> bits since PTE_PROT_NONE can only ever be set when valid is clear.
> Instead let's overlay it on what would be a HW bit if valid was set.
> 
> We need to be careful about which HW bit to choose since some of them
> must be preserved; when pte_present() is true (as it is for a
> PTE_PROT_NONE pte), it is legitimate for the core to call various
> accessors, e.g. pte_dirty(), pte_write() etc. There are also some
> accessors that are private to the arch which must continue to be
> honoured, e.g. pte_user(), pte_user_exec() etc.
> 
> So we choose to overlay PTE_UXN; This effectively means that whenever a
> pte has PTE_PROT_NONE set, it will always report pte_user_exec() ==
> false, which is obviously always correct.
> 
> As a result of this change, we must shuffle the layout of the
> arch-specific swap pte so that PTE_PROT_NONE is always zero and not
> overlapping with any other field. As a result of this, there is no way
> to keep the `type` field contiguous without conflicting with
> PMD_PRESENT_INVALID (bit 59), which must also be 0 for a swap pte. So
> let's move PMD_PRESENT_INVALID to bit 60.

I think we discussed but forgot the details. What was the reason for not
using, say, bit 60 for PTE_PROT_NONE to avoid all the swap bits
reshuffling? Clearing or setting of the PTE_PROT_NONE bit is done via
pte_modify() and this gets all the new permission bits anyway. With POE
support (on the list for now), PTE_PROT_NONE would overlap with
POIndex[0] but I don't think we ever plan to read this field (other than
maybe ptdump). The POIndex field is set from the vma->vm_page_prot (Joey
may need to adjust vm_get_page_prot() in his patches to avoid setting a
pkey on a PROT_NONE mapping).

-- 
Catalin

Re: [PATCH v1 1/2] arm64/mm: Move PTE_PROT_NONE and PMD_PRESENT_INVALID

Posted by Ryan Roberts 1 year, 9 months ago

On 24/04/2024 17:43, Catalin Marinas wrote:
> On Wed, Apr 24, 2024 at 12:10:16PM +0100, Ryan Roberts wrote:
>> Previously PTE_PROT_NONE was occupying bit 58, one of the bits reserved
>> for SW use when the PTE is valid. This is a waste of those precious SW
>> bits since PTE_PROT_NONE can only ever be set when valid is clear.
>> Instead let's overlay it on what would be a HW bit if valid was set.
>>
>> We need to be careful about which HW bit to choose since some of them
>> must be preserved; when pte_present() is true (as it is for a
>> PTE_PROT_NONE pte), it is legitimate for the core to call various
>> accessors, e.g. pte_dirty(), pte_write() etc. There are also some
>> accessors that are private to the arch which must continue to be
>> honoured, e.g. pte_user(), pte_user_exec() etc.
>>
>> So we choose to overlay PTE_UXN; This effectively means that whenever a
>> pte has PTE_PROT_NONE set, it will always report pte_user_exec() ==
>> false, which is obviously always correct.
>>
>> As a result of this change, we must shuffle the layout of the
>> arch-specific swap pte so that PTE_PROT_NONE is always zero and not
>> overlapping with any other field. As a result of this, there is no way
>> to keep the `type` field contiguous without conflicting with
>> PMD_PRESENT_INVALID (bit 59), which must also be 0 for a swap pte. So
>> let's move PMD_PRESENT_INVALID to bit 60.
> 
> I think we discussed but forgot the details. What was the reason for not
> using, say, bit 60 for PTE_PROT_NONE to avoid all the swap bits
> reshuffling? Clearing or setting of the PTE_PROT_NONE bit is done via
> pte_modify() and this gets all the new permission bits anyway. With POE
> support (on the list for now), PTE_PROT_NONE would overlap with
> POIndex[0] but I don't think we ever plan to read this field (other than
> maybe ptdump). The POIndex field is set from the vma->vm_page_prot (Joey
> may need to adjust vm_get_page_prot() in his patches to avoid setting a
> pkey on a PROT_NONE mapping).
> 

Copy/pasting your comment from the other patch into this one since its easier to
discuss it all together:

  Ah, I did not realise we need to free up bit 3 from the swap pte as
  well. Though maybe patch 1 is fine as is but for the record, it would be
  good to justify the decision to go with PTE_UXN.

While we need a new bit in the swap pte for uffd-wp, its just a SW bit - it
could go anywhere. I chose bit 3 because it was free after all the other shuffling.

As for choosing PTE_UXN for PTE_PROT_NONE, I wanted to choose a bit that would
definitely never lead to confusion if ever interpretted as its HW meaning, since
as far as the core-mm is concerned, the pte is either present or its not, and if
it is present, then it is completely valid to call all the pte_*() helpers. By
definition, if PTE_PROT_NONE is set, then the PTE is not executable in user
space, so any helpers that continue to interpret the bit position as UXN will
still give sensible answers.

Yes, I could have just put PTE_PROT_NONE in bit position 60 and avoided all the
shuffling. But in the past you have pushed back on using the PBHA bits due to
out of tree patches using them. I thought it was better to just sidestep having
to think about it by not using them. Additionally, as you point out, I didn't
want to risk overlapping with the POIndex and that causing subtle bugs.

But then... PMD_PRESENT_INVALID. Which already turns out to be violating my
above considerations. Ugh. I considered moving that to NS, but it would have
required splitting the offset field into 2 discontiguous parts in the swap pte.
In the end, I decided its ok in any position because its transient; its just a
temp marker and the pte will soon get set again from scratch so it doesn't
matter is adding the marker is destructive.

Personally I think there is less risk of a future/subtle bug by putting
PTE_PROT_NONE over PTE_UXN. But if you prefer to reduce churn by putting it at
bit 60 then I'm ok with that.

[PATCH v1 1/2] arm64/mm: Move PTE_PROT_NONE and PMD_PRESENT_INVALID
[PATCH v1 2/2] arm64/mm: Add uffd write-protect support