[v2] xen/riscv: introduce p2m functionality

[PATCH v2 15/17] xen/riscv: Implement superpage splitting for p2m mappings

Posted by Oleksii Kurochko 8 months ago

Add support for down large memory mappings ("superpages") in the RISC-V
p2m mapping so that smaller, more precise mappings ("finer-grained entries")
can be inserted into lower levels of the page table hierarchy.

To implement that the following is done:
- Introduce p2m_split_superpage(): Recursively shatters a superpage into
  smaller page table entries down to the target level, preserving original
  permissions and attributes.
- __p2m_set_entry() updated to invoke superpage splitting when inserting
  entries at lower levels within a superpage-mapped region.

This implementation is based on the ARM code, with modifications to the part
that follows the BBM (break-before-make) approach. Unlike ARM, RISC-V does
not require BBM, so there is no need to invalidate the PTE and flush the
TLB before updating it with the newly created, split page table.
Additionally, the page table walk logic has been adjusted, as ARM uses the
opposite walk order compared to RISC-V.

Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
---
Changes in V2:
 - New patch. It was a part of a big patch "xen/riscv: implement p2m mapping
   functionality" which was splitted to smaller.
 - Update the commit above the cycle which creates new page table as
   RISC-V travserse page tables in an opposite to ARM order.
 - RISC-V doesn't require BBM so there is no needed for invalidating
   and TLB flushing before updating PTE.
---
 xen/arch/riscv/p2m.c | 102 ++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 101 insertions(+), 1 deletion(-)

diff --git a/xen/arch/riscv/p2m.c b/xen/arch/riscv/p2m.c
index 87dd636b80..79c4473f1f 100644
--- a/xen/arch/riscv/p2m.c
+++ b/xen/arch/riscv/p2m.c
@@ -743,6 +743,77 @@ static void p2m_free_entry(struct p2m_domain *p2m,
     p2m_free_page(p2m->domain, pg);
 }
 
+static bool p2m_split_superpage(struct p2m_domain *p2m, pte_t *entry,
+                                unsigned int level, unsigned int target,
+                                const unsigned int *offsets)
+{
+    struct page_info *page;
+    unsigned int i;
+    pte_t pte, *table;
+    bool rv = true;
+
+    /* Convenience aliases */
+    mfn_t mfn = pte_get_mfn(*entry);
+    unsigned int next_level = level - 1;
+    unsigned int level_order = XEN_PT_LEVEL_ORDER(next_level);
+
+    /*
+     * This should only be called with target != level and the entry is
+     * a superpage.
+     */
+    ASSERT(level > target);
+    ASSERT(p2me_is_superpage(p2m, *entry, level));
+
+    page = p2m_alloc_page(p2m->domain);
+    if ( !page )
+        return false;
+
+    page_list_add(page, &p2m->pages);
+    table = __map_domain_page(page);
+
+    /*
+     * We are either splitting a second level 1G page into 512 first level
+     * 2M pages, or a first level 2M page into 512 zero level 4K pages.
+     */
+    for ( i = 0; i < XEN_PT_ENTRIES; i++ )
+    {
+        pte_t *new_entry = table + i;
+
+        /*
+         * Use the content of the superpage entry and override
+         * the necessary fields. So the correct permission are kept.
+         */
+        pte = *entry;
+        pte_set_mfn(&pte, mfn_add(mfn, i << level_order));
+
+        write_pte(new_entry, pte);
+    }
+
+    /*
+     * Shatter superpage in the page to the level we want to make the
+     * changes.
+     * This is done outside the loop to avoid checking the offset to
+     * know whether the entry should be shattered for every entry.
+     */
+    if ( next_level != target )
+        rv = p2m_split_superpage(p2m, table + offsets[next_level],
+                                 level - 1, target, offsets);
+
+    /* TODO: why it is necessary to have clean here? Not somewhere in the caller */
+    if ( p2m->clean_pte )
+        clean_dcache_va_range(table, PAGE_SIZE);
+
+    unmap_domain_page(table);
+
+    /*
+     * Even if we failed, we should install the newly allocated PTE
+     * entry. The caller will be in charge to free the sub-tree.
+     */
+    p2m_write_pte(entry, page_to_p2m_table(p2m, page), p2m->clean_pte);
+
+    return rv;
+}
+
 /*
  * Insert an entry in the p2m. This should be called with a mapping
  * equal to a page/superpage.
@@ -806,7 +877,36 @@ static int __p2m_set_entry(struct p2m_domain *p2m,
      */
     if ( level > target )
     {
-        panic("Shattering isn't implemented\n");
+        /* We need to split the original page. */
+        pte_t split_pte = *entry;
+
+        ASSERT(p2me_is_superpage(p2m, *entry, level));
+
+        if ( !p2m_split_superpage(p2m, &split_pte, level, target, offsets) )
+        {
+            /* Free the allocated sub-tree */
+            p2m_free_entry(p2m, split_pte, level);
+
+            rc = -ENOMEM;
+            goto out;
+        }
+
+        p2m_write_pte(entry, split_pte, p2m->clean_pte);
+
+        /* Then move to the level we want to make real changes */
+        for ( ; level < target; level++ )
+        {
+            rc = p2m_next_level(p2m, true, level, &table, offsets[level]);
+
+            /*
+             * The entry should be found and either be a table
+             * or a superpage if level 0 is not targeted
+             */
+            ASSERT(rc == GUEST_TABLE_NORMAL ||
+                   (rc == GUEST_TABLE_SUPER_PAGE && target > 0));
+        }
+
+        entry = table + offsets[level];
     }
 
     /*
-- 
2.49.0

Re: [PATCH v2 15/17] xen/riscv: Implement superpage splitting for p2m mappings

Posted by Jan Beulich 7 months, 1 week ago

On 10.06.2025 15:05, Oleksii Kurochko wrote:
> Add support for down large memory mappings ("superpages") in the RISC-V
> p2m mapping so that smaller, more precise mappings ("finer-grained entries")
> can be inserted into lower levels of the page table hierarchy.
> 
> To implement that the following is done:
> - Introduce p2m_split_superpage(): Recursively shatters a superpage into
>   smaller page table entries down to the target level, preserving original
>   permissions and attributes.
> - __p2m_set_entry() updated to invoke superpage splitting when inserting
>   entries at lower levels within a superpage-mapped region.
> 
> This implementation is based on the ARM code, with modifications to the part
> that follows the BBM (break-before-make) approach. Unlike ARM, RISC-V does
> not require BBM, so there is no need to invalidate the PTE and flush the
> TLB before updating it with the newly created, split page table.

But some flushing is going to be necessary. As long as you only ever do
global flushes, the one after the individual PTE modification (within the
split table) will do (if BBM isn't required, see below), but once you move
to more fine-grained flushing, that's not going to be enough anymore. Not
sure it's a good idea to leave such a pitfall.

As to (no need for) BBM: I couldn't find anything to that effect in the
privileged spec. Can you provide some pointer? What I found instead is e.g.
this sentence: "To ensure that implicit reads observe writes to the same
memory locations, an SFENCE.VMA instruction must be executed after the
writes to flush the relevant cached translations." And this: "Accessing the
same location using different cacheability attributes may cause loss of
coherence." (This may not only occur when the same physical address is
mapped twice at different VAs, but also after the shattering of a superpage
when the new entry differs in cacheability.)

> Additionally, the page table walk logic has been adjusted, as ARM uses the
> opposite walk order compared to RISC-V.

I think you used some similar wording already in an earlier patch. I find
this confusing: Walk order is, aiui, the same. It's merely the numbering
of levels that is the opposite way round, isn't it?

> Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
> ---
> Changes in V2:
>  - New patch. It was a part of a big patch "xen/riscv: implement p2m mapping
>    functionality" which was splitted to smaller.
>  - Update the commit above the cycle which creates new page table as
>    RISC-V travserse page tables in an opposite to ARM order.
>  - RISC-V doesn't require BBM so there is no needed for invalidating
>    and TLB flushing before updating PTE.
> ---
>  xen/arch/riscv/p2m.c | 102 ++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 101 insertions(+), 1 deletion(-)
> 
> diff --git a/xen/arch/riscv/p2m.c b/xen/arch/riscv/p2m.c
> index 87dd636b80..79c4473f1f 100644
> --- a/xen/arch/riscv/p2m.c
> +++ b/xen/arch/riscv/p2m.c
> @@ -743,6 +743,77 @@ static void p2m_free_entry(struct p2m_domain *p2m,
>      p2m_free_page(p2m->domain, pg);
>  }
>  
> +static bool p2m_split_superpage(struct p2m_domain *p2m, pte_t *entry,
> +                                unsigned int level, unsigned int target,
> +                                const unsigned int *offsets)
> +{
> +    struct page_info *page;
> +    unsigned int i;
> +    pte_t pte, *table;
> +    bool rv = true;
> +
> +    /* Convenience aliases */
> +    mfn_t mfn = pte_get_mfn(*entry);
> +    unsigned int next_level = level - 1;
> +    unsigned int level_order = XEN_PT_LEVEL_ORDER(next_level);
> +
> +    /*
> +     * This should only be called with target != level and the entry is
> +     * a superpage.
> +     */
> +    ASSERT(level > target);
> +    ASSERT(p2me_is_superpage(p2m, *entry, level));
> +
> +    page = p2m_alloc_page(p2m->domain);
> +    if ( !page )
> +        return false;
> +
> +    page_list_add(page, &p2m->pages);

Is there a reason this list maintenance isn't done in p2m_alloc_page()?

> +    table = __map_domain_page(page);
> +
> +    /*
> +     * We are either splitting a second level 1G page into 512 first level
> +     * 2M pages, or a first level 2M page into 512 zero level 4K pages.
> +     */
> +    for ( i = 0; i < XEN_PT_ENTRIES; i++ )
> +    {
> +        pte_t *new_entry = table + i;
> +
> +        /*
> +         * Use the content of the superpage entry and override
> +         * the necessary fields. So the correct permission are kept.
> +         */
> +        pte = *entry;
> +        pte_set_mfn(&pte, mfn_add(mfn, i << level_order));

While okay as long as you only permit superpages up to 1G, this is another
trap for someone to fall into: Imo i would better be unsigned long right
away, considering that RISC-V permits large pages at all levels.

> +        write_pte(new_entry, pte);
> +    }
> +
> +    /*
> +     * Shatter superpage in the page to the level we want to make the
> +     * changes.
> +     * This is done outside the loop to avoid checking the offset to
> +     * know whether the entry should be shattered for every entry.
> +     */
> +    if ( next_level != target )
> +        rv = p2m_split_superpage(p2m, table + offsets[next_level],
> +                                 level - 1, target, offsets);

I don't understand the comment: Under what conditions would every entry
need (further) shattering? And where's that happening? Or is this merely
a word ordering issue in the sentence, and "for every entry" wants
moving ahead? (In that case I'm unconvinced this is in need of commenting
upon.)

> +    /* TODO: why it is necessary to have clean here? Not somewhere in the caller */
> +    if ( p2m->clean_pte )
> +        clean_dcache_va_range(table, PAGE_SIZE);
> +
> +    unmap_domain_page(table);

Again likely not something that wants taking care of right away, but there
again is an inefficiency here: The caller almost certainly wants to map
the same page again, to update the one entry that caused the request to
shatter the page.

> +    /*
> +     * Even if we failed, we should install the newly allocated PTE
> +     * entry. The caller will be in charge to free the sub-tree.
> +     */
> +    p2m_write_pte(entry, page_to_p2m_table(p2m, page), p2m->clean_pte);

Why would it be wrong to free the page right here, vacating the entry at
the same time (or leaving just that to the caller)? (IOW - if this is an
implementation decision of yours, I think the word "should" would want
dropping.) After all, the caller invoking p2m_free_entry() on the thus
split PTE is less efficient (needs to iterate over all entries) than on
the original one (where it's just a single superpage).

> @@ -806,7 +877,36 @@ static int __p2m_set_entry(struct p2m_domain *p2m,
>       */
>      if ( level > target )

This condition is likely too strong, unless you actually mean to also
split a superpage if it really wouldn't need splitting (new entry written
still fitting with the superpage mapping, i.e. suitable MFN and same
attributes).

>      {
> -        panic("Shattering isn't implemented\n");
> +        /* We need to split the original page. */
> +        pte_t split_pte = *entry;
> +
> +        ASSERT(p2me_is_superpage(p2m, *entry, level));
> +
> +        if ( !p2m_split_superpage(p2m, &split_pte, level, target, offsets) )
> +        {
> +            /* Free the allocated sub-tree */
> +            p2m_free_entry(p2m, split_pte, level);
> +
> +            rc = -ENOMEM;
> +            goto out;
> +        }
> +
> +        p2m_write_pte(entry, split_pte, p2m->clean_pte);
> +
> +        /* Then move to the level we want to make real changes */
> +        for ( ; level < target; level++ )

Don't you mean to move downwards here? At which point I wonder: Did you test
this code?

> +        {
> +            rc = p2m_next_level(p2m, true, level, &table, offsets[level]);
> +
> +            /*
> +             * The entry should be found and either be a table
> +             * or a superpage if level 0 is not targeted
> +             */
> +            ASSERT(rc == GUEST_TABLE_NORMAL ||
> +                   (rc == GUEST_TABLE_SUPER_PAGE && target > 0));
> +        }

This, too, is inefficient (but likely good enough as a starting point): You walk
tables twice - first when splitting, and then again when finding the target level.

Considering the enclosing if(), this also again is a do/while() candidate.

Jan

Re: [PATCH v2 15/17] xen/riscv: Implement superpage splitting for p2m mappings

Posted by Oleksii Kurochko 6 months, 3 weeks ago

On 7/2/25 11:25 AM, Jan Beulich wrote:
> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>> Add support for down large memory mappings ("superpages") in the RISC-V
>> p2m mapping so that smaller, more precise mappings ("finer-grained entries")
>> can be inserted into lower levels of the page table hierarchy.
>>
>> To implement that the following is done:
>> - Introduce p2m_split_superpage(): Recursively shatters a superpage into
>>    smaller page table entries down to the target level, preserving original
>>    permissions and attributes.
>> - __p2m_set_entry() updated to invoke superpage splitting when inserting
>>    entries at lower levels within a superpage-mapped region.
>>
>> This implementation is based on the ARM code, with modifications to the part
>> that follows the BBM (break-before-make) approach. Unlike ARM, RISC-V does
>> not require BBM, so there is no need to invalidate the PTE and flush the
>> TLB before updating it with the newly created, split page table.
> But some flushing is going to be necessary. As long as you only ever do
> global flushes, the one after the individual PTE modification (within the
> split table) will do (if BBM isn't required, see below), but once you move
> to more fine-grained flushing, that's not going to be enough anymore. Not
> sure it's a good idea to leave such a pitfall.

I think that I don't fully understand what is an issue.

>
> As to (no need for) BBM: I couldn't find anything to that effect in the
> privileged spec. Can you provide some pointer? What I found instead is e.g.
> this sentence: "To ensure that implicit reads observe writes to the same
> memory locations, an SFENCE.VMA instruction must be executed after the
> writes to flush the relevant cached translations." And this: "Accessing the
> same location using different cacheability attributes may cause loss of
> coherence." (This may not only occur when the same physical address is
> mapped twice at different VAs, but also after the shattering of a superpage
> when the new entry differs in cacheability.)

I also couldn't find that RISC-V spec mandates BBM explicitly as Arm does it.

What I meant that on RISC-V can do:
- Write new PTE
- Flush TLB

While on Arm it is almost always needed to do:
- Write zero to PTE
- Flush TLB
- Write new PTE

For example, the common CoW code path where you copy from a read only page to
a new page, then map that new page as writable just works on RISC-V without
extra considerations and on Arm it requires BBM.

It seems to me that BBM is mandated for Arm only because that TLB is shared
among cores, so there is no any guarantee that no prior entry for same VA
remains in TLB. In case of RISC-V's TLB isn't shared and after a flush it is
guaranteed that no prior entry for the same VA remains in the TLB.

But in the same time it could be cases, I guess, where BBM will be needed for
RISC-V too. Even the case of CoW mentioned above will need some kind of BBM,
but nothing that'll the CPU misbehave by doing CoW without BBM on RISC-V.

>
>> Additionally, the page table walk logic has been adjusted, as ARM uses the
>> opposite walk order compared to RISC-V.
> I think you used some similar wording already in an earlier patch. I find
> this confusing: Walk order is, aiui, the same. It's merely the numbering
> of levels that is the opposite way round, isn't it?

Yes, the numbering of levels is different and I counted that as a different
walk order. If it is too confusing, I will reword it and use numbering of levels.

>
>> Signed-off-by: Oleksii Kurochko<oleksii.kurochko@gmail.com>
>> ---
>> Changes in V2:
>>   - New patch. It was a part of a big patch "xen/riscv: implement p2m mapping
>>     functionality" which was splitted to smaller.
>>   - Update the commit above the cycle which creates new page table as
>>     RISC-V travserse page tables in an opposite to ARM order.
>>   - RISC-V doesn't require BBM so there is no needed for invalidating
>>     and TLB flushing before updating PTE.
>> ---
>>   xen/arch/riscv/p2m.c | 102 ++++++++++++++++++++++++++++++++++++++++++-
>>   1 file changed, 101 insertions(+), 1 deletion(-)
>>
>> diff --git a/xen/arch/riscv/p2m.c b/xen/arch/riscv/p2m.c
>> index 87dd636b80..79c4473f1f 100644
>> --- a/xen/arch/riscv/p2m.c
>> +++ b/xen/arch/riscv/p2m.c
>> @@ -743,6 +743,77 @@ static void p2m_free_entry(struct p2m_domain *p2m,
>>       p2m_free_page(p2m->domain, pg);
>>   }
>>   
>> +static bool p2m_split_superpage(struct p2m_domain *p2m, pte_t *entry,
>> +                                unsigned int level, unsigned int target,
>> +                                const unsigned int *offsets)
>> +{
>> +    struct page_info *page;
>> +    unsigned int i;
>> +    pte_t pte, *table;
>> +    bool rv = true;
>> +
>> +    /* Convenience aliases */
>> +    mfn_t mfn = pte_get_mfn(*entry);
>> +    unsigned int next_level = level - 1;
>> +    unsigned int level_order = XEN_PT_LEVEL_ORDER(next_level);
>> +
>> +    /*
>> +     * This should only be called with target != level and the entry is
>> +     * a superpage.
>> +     */
>> +    ASSERT(level > target);
>> +    ASSERT(p2me_is_superpage(p2m, *entry, level));
>> +
>> +    page = p2m_alloc_page(p2m->domain);
>> +    if ( !page )
>> +        return false;
>> +
>> +    page_list_add(page, &p2m->pages);
> Is there a reason this list maintenance isn't done in p2m_alloc_page()?

No there is no any reason, I will move that inside p2m_alloc_page().

>
>> +    table = __map_domain_page(page);
>> +
>> +    /*
>> +     * We are either splitting a second level 1G page into 512 first level
>> +     * 2M pages, or a first level 2M page into 512 zero level 4K pages.
>> +     */
>> +    for ( i = 0; i < XEN_PT_ENTRIES; i++ )
>> +    {
>> +        pte_t *new_entry = table + i;
>> +
>> +        /*
>> +         * Use the content of the superpage entry and override
>> +         * the necessary fields. So the correct permission are kept.
>> +         */
>> +        pte = *entry;
>> +        pte_set_mfn(&pte, mfn_add(mfn, i << level_order));
> While okay as long as you only permit superpages up to 1G, this is another
> trap for someone to fall into: Imo i would better be unsigned long right
> away, considering that RISC-V permits large pages at all levels.
>
>> +        write_pte(new_entry, pte);
>> +    }
>> +
>> +    /*
>> +     * Shatter superpage in the page to the level we want to make the
>> +     * changes.
>> +     * This is done outside the loop to avoid checking the offset to
>> +     * know whether the entry should be shattered for every entry.
>> +     */
>> +    if ( next_level != target )
>> +        rv = p2m_split_superpage(p2m, table + offsets[next_level],
>> +                                 level - 1, target, offsets);
> I don't understand the comment: Under what conditions would every entry
> need (further) shattering? And where's that happening? Or is this merely
> a word ordering issue in the sentence, and "for every entry" wants
> moving ahead? (In that case I'm unconvinced this is in need of commenting
> upon.)

It is wording question. It should be something like:
+    /*
+     * Shatter superpage in the page to the level we want to make the
+     * changes.
+     * This is done outside the loop to avoid checking the offset for every
+     * entry (of new page table) to know whether the entry should be shattered.
+     */


>
>> +    /* TODO: why it is necessary to have clean here? Not somewhere in the caller */
>> +    if ( p2m->clean_pte )
>> +        clean_dcache_va_range(table, PAGE_SIZE);
>> +
>> +    unmap_domain_page(table);
> Again likely not something that wants taking care of right away, but there
> again is an inefficiency here: The caller almost certainly wants to map
> the same page again, to update the one entry that caused the request to
> shatter the page.

I'll add TODO.

>
>> +    /*
>> +     * Even if we failed, we should install the newly allocated PTE
>> +     * entry. The caller will be in charge to free the sub-tree.
>> +     */
>> +    p2m_write_pte(entry, page_to_p2m_table(p2m, page), p2m->clean_pte);
> Why would it be wrong to free the page right here, vacating the entry at
> the same time (or leaving just that to the caller)? (IOW - if this is an
> implementation decision of yours, I think the word "should" would want
> dropping.) After all, the caller invoking p2m_free_entry() on the thus
> split PTE is less efficient (needs to iterate over all entries) than on
> the original one (where it's just a single superpage).

I think that I didn't get your idea.

>
>> @@ -806,7 +877,36 @@ static int __p2m_set_entry(struct p2m_domain *p2m,
>>        */
>>       if ( level > target )
> This condition is likely too strong, unless you actually mean to also
> split a superpage if it really wouldn't need splitting (new entry written
> still fitting with the superpage mapping, i.e. suitable MFN and same
> attributes).

I am not really sure that I fully understand.
My understanding is if level != target then the splitting is needed, I am
not really get the part "split a superpage if it really wouldn't need splitting".

>
>>       {
>> -        panic("Shattering isn't implemented\n");
>> +        /* We need to split the original page. */
>> +        pte_t split_pte = *entry;
>> +
>> +        ASSERT(p2me_is_superpage(p2m, *entry, level));
>> +
>> +        if ( !p2m_split_superpage(p2m, &split_pte, level, target, offsets) )
>> +        {
>> +            /* Free the allocated sub-tree */
>> +            p2m_free_entry(p2m, split_pte, level);
>> +
>> +            rc = -ENOMEM;
>> +            goto out;
>> +        }
>> +
>> +        p2m_write_pte(entry, split_pte, p2m->clean_pte);
>> +
>> +        /* Then move to the level we want to make real changes */
>> +        for ( ; level < target; level++ )
> Don't you mean to move downwards here? At which point I wonder: Did you test
> this code?

No as the test for this case was missed. I will add one.

>
>> +        {
>> +            rc = p2m_next_level(p2m, true, level, &table, offsets[level]);
>> +
>> +            /*
>> +             * The entry should be found and either be a table
>> +             * or a superpage if level 0 is not targeted
>> +             */
>> +            ASSERT(rc == GUEST_TABLE_NORMAL ||
>> +                   (rc == GUEST_TABLE_SUPER_PAGE && target > 0));
>> +        }
> This, too, is inefficient (but likely good enough as a starting point): You walk
> tables twice - first when splitting, and then again when finding the target level.
>
> Considering the enclosing if(), this also again is a do/while() candidate.

I will add TODO to make that part more efficient. And based on your reply regarding
statement inside if(), I'll likely to use do/while().

Thanks.

~ Oleksii

Re: [PATCH v2 15/17] xen/riscv: Implement superpage splitting for p2m mappings

Posted by Jan Beulich 6 months, 3 weeks ago

On 17.07.2025 18:37, Oleksii Kurochko wrote:
> On 7/2/25 11:25 AM, Jan Beulich wrote:
>> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>>> Add support for down large memory mappings ("superpages") in the RISC-V
>>> p2m mapping so that smaller, more precise mappings ("finer-grained entries")
>>> can be inserted into lower levels of the page table hierarchy.
>>>
>>> To implement that the following is done:
>>> - Introduce p2m_split_superpage(): Recursively shatters a superpage into
>>>    smaller page table entries down to the target level, preserving original
>>>    permissions and attributes.
>>> - __p2m_set_entry() updated to invoke superpage splitting when inserting
>>>    entries at lower levels within a superpage-mapped region.
>>>
>>> This implementation is based on the ARM code, with modifications to the part
>>> that follows the BBM (break-before-make) approach. Unlike ARM, RISC-V does
>>> not require BBM, so there is no need to invalidate the PTE and flush the
>>> TLB before updating it with the newly created, split page table.
>> But some flushing is going to be necessary. As long as you only ever do
>> global flushes, the one after the individual PTE modification (within the
>> split table) will do (if BBM isn't required, see below), but once you move
>> to more fine-grained flushing, that's not going to be enough anymore. Not
>> sure it's a good idea to leave such a pitfall.
> 
> I think that I don't fully understand what is an issue.

Whether a flush is necessary after solely breaking up a superpage is arch-
defined. I don't know how much restrictions the spec on possible hardware
behavior for RISC-V. However, the eventual change of (at least) one entry
of fulfill the original request will surely require a flush. What I was
trying to say is that this required flush would better not also cover for
the flushes that may or may not be required by the spec. IOW if the spec
leaves any room for flushes to possibly be needed, those flushes would
better be explicit.

>> As to (no need for) BBM: I couldn't find anything to that effect in the
>> privileged spec. Can you provide some pointer? What I found instead is e.g.
>> this sentence: "To ensure that implicit reads observe writes to the same
>> memory locations, an SFENCE.VMA instruction must be executed after the
>> writes to flush the relevant cached translations." And this: "Accessing the
>> same location using different cacheability attributes may cause loss of
>> coherence." (This may not only occur when the same physical address is
>> mapped twice at different VAs, but also after the shattering of a superpage
>> when the new entry differs in cacheability.)
> 
> I also couldn't find that RISC-V spec mandates BBM explicitly as Arm does it.
> 
> What I meant that on RISC-V can do:
> - Write new PTE
> - Flush TLB
> 
> While on Arm it is almost always needed to do:
> - Write zero to PTE
> - Flush TLB
> - Write new PTE
> 
> For example, the common CoW code path where you copy from a read only page to
> a new page, then map that new page as writable just works on RISC-V without
> extra considerations and on Arm it requires BBM.

CoW is a specific sub-case with increasing privilege.

> It seems to me that BBM is mandated for Arm only because that TLB is shared
> among cores, so there is no any guarantee that no prior entry for same VA
> remains in TLB. In case of RISC-V's TLB isn't shared and after a flush it is
> guaranteed that no prior entry for the same VA remains in the TLB.

Not sure that's the sole reason. But again the question is: Is this written
down explicitly anywhere? Generally there can be multiple levels of TLBs, and
while some of them may be per-core, others may be shared.

>>> +    /*
>>> +     * Even if we failed, we should install the newly allocated PTE
>>> +     * entry. The caller will be in charge to free the sub-tree.
>>> +     */
>>> +    p2m_write_pte(entry, page_to_p2m_table(p2m, page), p2m->clean_pte);
>> Why would it be wrong to free the page right here, vacating the entry at
>> the same time (or leaving just that to the caller)? (IOW - if this is an
>> implementation decision of yours, I think the word "should" would want
>> dropping.) After all, the caller invoking p2m_free_entry() on the thus
>> split PTE is less efficient (needs to iterate over all entries) than on
>> the original one (where it's just a single superpage).
> 
> I think that I didn't get your idea.

Well, first and foremost this was a question to you, as it's not clear to
me whether "should" is the correct word to use here. It would be
appropriate if the spec mandated this behavior. It would seem less
appropriate if this arrangement was an implementation choice of yours.
And it looks to me as if the latter was the case here.

>>> @@ -806,7 +877,36 @@ static int __p2m_set_entry(struct p2m_domain *p2m,
>>>        */
>>>       if ( level > target )
>> This condition is likely too strong, unless you actually mean to also
>> split a superpage if it really wouldn't need splitting (new entry written
>> still fitting with the superpage mapping, i.e. suitable MFN and same
>> attributes).
> 
> I am not really sure that I fully understand.
> My understanding is if level != target then the splitting is needed, I am
> not really get the part "split a superpage if it really wouldn't need splitting".

Hmm, maybe I was wrong here. The caller determines at what level the
actual change needs to occur? In which case what you have may be right.

Jan

Re: [PATCH v2 15/17] xen/riscv: Implement superpage splitting for p2m mappings

Posted by Oleksii Kurochko 6 months, 3 weeks ago

On 7/21/25 3:34 PM, Jan Beulich wrote:
> On 17.07.2025 18:37, Oleksii Kurochko wrote:
>> On 7/2/25 11:25 AM, Jan Beulich wrote:
>>> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>>>> Add support for down large memory mappings ("superpages") in the RISC-V
>>>> p2m mapping so that smaller, more precise mappings ("finer-grained entries")
>>>> can be inserted into lower levels of the page table hierarchy.
>>>>
>>>> To implement that the following is done:
>>>> - Introduce p2m_split_superpage(): Recursively shatters a superpage into
>>>>     smaller page table entries down to the target level, preserving original
>>>>     permissions and attributes.
>>>> - __p2m_set_entry() updated to invoke superpage splitting when inserting
>>>>     entries at lower levels within a superpage-mapped region.
>>>>
>>>> This implementation is based on the ARM code, with modifications to the part
>>>> that follows the BBM (break-before-make) approach. Unlike ARM, RISC-V does
>>>> not require BBM, so there is no need to invalidate the PTE and flush the
>>>> TLB before updating it with the newly created, split page table.
>>> But some flushing is going to be necessary. As long as you only ever do
>>> global flushes, the one after the individual PTE modification (within the
>>> split table) will do (if BBM isn't required, see below), but once you move
>>> to more fine-grained flushing, that's not going to be enough anymore. Not
>>> sure it's a good idea to leave such a pitfall.
>> I think that I don't fully understand what is an issue.
> Whether a flush is necessary after solely breaking up a superpage is arch-
> defined. I don't know how much restrictions the spec on possible hardware
> behavior for RISC-V. However, the eventual change of (at least) one entry
> of fulfill the original request will surely require a flush. What I was
> trying to say is that this required flush would better not also cover for
> the flushes that may or may not be required by the spec. IOW if the spec
> leaves any room for flushes to possibly be needed, those flushes would
> better be explicit.

I think that I still don't understand why what I wrote above will work as long
as global flushes is working and will stop to work when more fine-grained flushing
is used.

Inside p2m_split_superpage() we don't need any kind of TLB flush operation as
it is allocation a new page for a table and works with it, so no one could use
this page at the moment and cache it in TLB.

The only question is that if it is needed BBM before staring using splitted entry:
         ....
         if ( !p2m_split_superpage(p2m, &split_pte, level, target, offsets) )
         {
             /* Free the allocated sub-tree */
             p2m_free_subtree(p2m, split_pte, level);

             rc = -ENOMEM;
             goto out;
         }

------> /* Should be BBM used here ? */
         p2m_write_pte(entry, split_pte, p2m->clean_pte);

And I can't find anything in the spec what tells me to use BBM here like Arm
does:
         /*
          * Follow the break-before-sequence to update the entry.
          * For more details see (D4.7.1 in ARM DDI 0487A.j).
          */
         p2m_remove_pte(entry, p2m->clean_pte);
         p2m_force_tlb_flush_sync(p2m);

         p2m_write_pte(entry, split_pte, p2m->clean_pte);

I agree that even BBM isn't mandated explicitly sometime it could be useful, but
here I am not really sure what is the point to do TLB flush before p2m_write_pte()
as nothing serious will happen if and old mapping will be used for a some time
as we are keeping the same rights for smaller (splited) mapping.
The reason why Arm does p2m_remove_pte() & p2m_force_tlb_flush() before updating
an entry here as it is doesn't guarantee that everything will be okay and they
explicitly tells:
  This situation can possibly break coherency and ordering guarantees, leading to
  inconsistent unwanted behavior in our Processing Element (PE).


>>> As to (no need for) BBM: I couldn't find anything to that effect in the
>>> privileged spec. Can you provide some pointer? What I found instead is e.g.
>>> this sentence: "To ensure that implicit reads observe writes to the same
>>> memory locations, an SFENCE.VMA instruction must be executed after the
>>> writes to flush the relevant cached translations." And this: "Accessing the
>>> same location using different cacheability attributes may cause loss of
>>> coherence." (This may not only occur when the same physical address is
>>> mapped twice at different VAs, but also after the shattering of a superpage
>>> when the new entry differs in cacheability.)
>> I also couldn't find that RISC-V spec mandates BBM explicitly as Arm does it.
>>
>> What I meant that on RISC-V can do:
>> - Write new PTE
>> - Flush TLB
>>
>> While on Arm it is almost always needed to do:
>> - Write zero to PTE
>> - Flush TLB
>> - Write new PTE
>>
>> For example, the common CoW code path where you copy from a read only page to
>> a new page, then map that new page as writable just works on RISC-V without
>> extra considerations and on Arm it requires BBM.
> CoW is a specific sub-case with increasing privilege.

Could you please explain why it matters increasing of privilege in terms of TLB
flushing and PTE clearing before writing a new PTE?

>
>> It seems to me that BBM is mandated for Arm only because that TLB is shared
>> among cores, so there is no any guarantee that no prior entry for same VA
>> remains in TLB. In case of RISC-V's TLB isn't shared and after a flush it is
>> guaranteed that no prior entry for the same VA remains in the TLB.
> Not sure that's the sole reason. But again the question is: Is this written
> down explicitly anywhere? Generally there can be multiple levels of TLBs, and
> while some of them may be per-core, others may be shared.

Spec. mentions that:
   If a hart employs an address-translation cache, that cache must appear to be
   private to that hart.

>
>>>> +    /*
>>>> +     * Even if we failed, we should install the newly allocated PTE
>>>> +     * entry. The caller will be in charge to free the sub-tree.
>>>> +     */
>>>> +    p2m_write_pte(entry, page_to_p2m_table(p2m, page), p2m->clean_pte);
>>> Why would it be wrong to free the page right here, vacating the entry at
>>> the same time (or leaving just that to the caller)? (IOW - if this is an
>>> implementation decision of yours, I think the word "should" would want
>>> dropping.) After all, the caller invoking p2m_free_entry() on the thus
>>> split PTE is less efficient (needs to iterate over all entries) than on
>>> the original one (where it's just a single superpage).
>> I think that I didn't get your idea.
> Well, first and foremost this was a question to you, as it's not clear to
> me whether "should" is the correct word to use here. It would be
> appropriate if the spec mandated this behavior. It would seem less
> appropriate if this arrangement was an implementation choice of yours.
> And it looks to me as if the latter was the case here.

No, the spec doesn't mandate such behavior, it is just implementation specific.
I can mention that in the comment.

>
>>>> @@ -806,7 +877,36 @@ static int __p2m_set_entry(struct p2m_domain *p2m,
>>>>         */
>>>>        if ( level > target )
>>> This condition is likely too strong, unless you actually mean to also
>>> split a superpage if it really wouldn't need splitting (new entry written
>>> still fitting with the superpage mapping, i.e. suitable MFN and same
>>> attributes).
>> I am not really sure that I fully understand.
>> My understanding is if level != target then the splitting is needed, I am
>> not really get the part "split a superpage if it really wouldn't need splitting".
> Hmm, maybe I was wrong here. The caller determines at what level the
> actual change needs to occur? In which case what you have may be right.

Yes, the caller determines expected level and then asks __p2m_set_entry() to map on
this level.

~ Oleksii

Re: [PATCH v2 15/17] xen/riscv: Implement superpage splitting for p2m mappings

Posted by Jan Beulich 6 months, 3 weeks ago

On 22.07.2025 16:57, Oleksii Kurochko wrote:
> 
> On 7/21/25 3:34 PM, Jan Beulich wrote:
>> On 17.07.2025 18:37, Oleksii Kurochko wrote:
>>> On 7/2/25 11:25 AM, Jan Beulich wrote:
>>>> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>>>>> Add support for down large memory mappings ("superpages") in the RISC-V
>>>>> p2m mapping so that smaller, more precise mappings ("finer-grained entries")
>>>>> can be inserted into lower levels of the page table hierarchy.
>>>>>
>>>>> To implement that the following is done:
>>>>> - Introduce p2m_split_superpage(): Recursively shatters a superpage into
>>>>>     smaller page table entries down to the target level, preserving original
>>>>>     permissions and attributes.
>>>>> - __p2m_set_entry() updated to invoke superpage splitting when inserting
>>>>>     entries at lower levels within a superpage-mapped region.
>>>>>
>>>>> This implementation is based on the ARM code, with modifications to the part
>>>>> that follows the BBM (break-before-make) approach. Unlike ARM, RISC-V does
>>>>> not require BBM, so there is no need to invalidate the PTE and flush the
>>>>> TLB before updating it with the newly created, split page table.
>>>> But some flushing is going to be necessary. As long as you only ever do
>>>> global flushes, the one after the individual PTE modification (within the
>>>> split table) will do (if BBM isn't required, see below), but once you move
>>>> to more fine-grained flushing, that's not going to be enough anymore. Not
>>>> sure it's a good idea to leave such a pitfall.
>>> I think that I don't fully understand what is an issue.
>> Whether a flush is necessary after solely breaking up a superpage is arch-
>> defined. I don't know how much restrictions the spec on possible hardware
>> behavior for RISC-V. However, the eventual change of (at least) one entry
>> of fulfill the original request will surely require a flush. What I was
>> trying to say is that this required flush would better not also cover for
>> the flushes that may or may not be required by the spec. IOW if the spec
>> leaves any room for flushes to possibly be needed, those flushes would
>> better be explicit.
> 
> I think that I still don't understand why what I wrote above will work as long
> as global flushes is working and will stop to work when more fine-grained flushing
> is used.
> 
> Inside p2m_split_superpage() we don't need any kind of TLB flush operation as
> it is allocation a new page for a table and works with it, so no one could use
> this page at the moment and cache it in TLB.
> 
> The only question is that if it is needed BBM before staring using splitted entry:
>          ....
>          if ( !p2m_split_superpage(p2m, &split_pte, level, target, offsets) )
>          {
>              /* Free the allocated sub-tree */
>              p2m_free_subtree(p2m, split_pte, level);
> 
>              rc = -ENOMEM;
>              goto out;
>          }
> 
> ------> /* Should be BBM used here ? */
>          p2m_write_pte(entry, split_pte, p2m->clean_pte);
> 
> And I can't find anything in the spec what tells me to use BBM here like Arm
> does:

But what you need is a statement in the spec that you can get away without. Imo
at least.

>          /*
>           * Follow the break-before-sequence to update the entry.
>           * For more details see (D4.7.1 in ARM DDI 0487A.j).
>           */
>          p2m_remove_pte(entry, p2m->clean_pte);
>          p2m_force_tlb_flush_sync(p2m);
> 
>          p2m_write_pte(entry, split_pte, p2m->clean_pte);
> 
> I agree that even BBM isn't mandated explicitly sometime it could be useful, but
> here I am not really sure what is the point to do TLB flush before p2m_write_pte()
> as nothing serious will happen if and old mapping will be used for a some time
> as we are keeping the same rights for smaller (splited) mapping.
> The reason why Arm does p2m_remove_pte() & p2m_force_tlb_flush() before updating
> an entry here as it is doesn't guarantee that everything will be okay and they
> explicitly tells:
>   This situation can possibly break coherency and ordering guarantees, leading to
>   inconsistent unwanted behavior in our Processing Element (PE).
> 
> 
>>>> As to (no need for) BBM: I couldn't find anything to that effect in the
>>>> privileged spec. Can you provide some pointer? What I found instead is e.g.
>>>> this sentence: "To ensure that implicit reads observe writes to the same
>>>> memory locations, an SFENCE.VMA instruction must be executed after the
>>>> writes to flush the relevant cached translations." And this: "Accessing the
>>>> same location using different cacheability attributes may cause loss of
>>>> coherence." (This may not only occur when the same physical address is
>>>> mapped twice at different VAs, but also after the shattering of a superpage
>>>> when the new entry differs in cacheability.)
>>> I also couldn't find that RISC-V spec mandates BBM explicitly as Arm does it.
>>>
>>> What I meant that on RISC-V can do:
>>> - Write new PTE
>>> - Flush TLB
>>>
>>> While on Arm it is almost always needed to do:
>>> - Write zero to PTE
>>> - Flush TLB
>>> - Write new PTE
>>>
>>> For example, the common CoW code path where you copy from a read only page to
>>> a new page, then map that new page as writable just works on RISC-V without
>>> extra considerations and on Arm it requires BBM.
>> CoW is a specific sub-case with increasing privilege.
> 
> Could you please explain why it matters increasing of privilege in terms of TLB
> flushing and PTE clearing before writing a new PTE?

Some architectures automatically re-walk page tables when encountering a
permission violation based on TLB contents. Hence increasing privilege
can be a special case.

>>> It seems to me that BBM is mandated for Arm only because that TLB is shared
>>> among cores, so there is no any guarantee that no prior entry for same VA
>>> remains in TLB. In case of RISC-V's TLB isn't shared and after a flush it is
>>> guaranteed that no prior entry for the same VA remains in the TLB.
>> Not sure that's the sole reason. But again the question is: Is this written
>> down explicitly anywhere? Generally there can be multiple levels of TLBs, and
>> while some of them may be per-core, others may be shared.
> 
> Spec. mentions that:
>    If a hart employs an address-translation cache, that cache must appear to be
>    private to that hart.

Hmm, okay, that indeed pretty much excludes shared TLBs.

Jan

Re: [PATCH v2 15/17] xen/riscv: Implement superpage splitting for p2m mappings

Posted by Oleksii Kurochko 6 months, 2 weeks ago

On 7/22/25 6:02 PM, Jan Beulich wrote:
> On 22.07.2025 16:57, Oleksii Kurochko wrote:
>> On 7/21/25 3:34 PM, Jan Beulich wrote:
>>> On 17.07.2025 18:37, Oleksii Kurochko wrote:
>>>> On 7/2/25 11:25 AM, Jan Beulich wrote:
>>>>> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>>>>>> Add support for down large memory mappings ("superpages") in the RISC-V
>>>>>> p2m mapping so that smaller, more precise mappings ("finer-grained entries")
>>>>>> can be inserted into lower levels of the page table hierarchy.
>>>>>>
>>>>>> To implement that the following is done:
>>>>>> - Introduce p2m_split_superpage(): Recursively shatters a superpage into
>>>>>>      smaller page table entries down to the target level, preserving original
>>>>>>      permissions and attributes.
>>>>>> - __p2m_set_entry() updated to invoke superpage splitting when inserting
>>>>>>      entries at lower levels within a superpage-mapped region.
>>>>>>
>>>>>> This implementation is based on the ARM code, with modifications to the part
>>>>>> that follows the BBM (break-before-make) approach. Unlike ARM, RISC-V does
>>>>>> not require BBM, so there is no need to invalidate the PTE and flush the
>>>>>> TLB before updating it with the newly created, split page table.
>>>>> But some flushing is going to be necessary. As long as you only ever do
>>>>> global flushes, the one after the individual PTE modification (within the
>>>>> split table) will do (if BBM isn't required, see below), but once you move
>>>>> to more fine-grained flushing, that's not going to be enough anymore. Not
>>>>> sure it's a good idea to leave such a pitfall.
>>>> I think that I don't fully understand what is an issue.
>>> Whether a flush is necessary after solely breaking up a superpage is arch-
>>> defined. I don't know how much restrictions the spec on possible hardware
>>> behavior for RISC-V. However, the eventual change of (at least) one entry
>>> of fulfill the original request will surely require a flush. What I was
>>> trying to say is that this required flush would better not also cover for
>>> the flushes that may or may not be required by the spec. IOW if the spec
>>> leaves any room for flushes to possibly be needed, those flushes would
>>> better be explicit.
>> I think that I still don't understand why what I wrote above will work as long
>> as global flushes is working and will stop to work when more fine-grained flushing
>> is used.
>>
>> Inside p2m_split_superpage() we don't need any kind of TLB flush operation as
>> it is allocation a new page for a table and works with it, so no one could use
>> this page at the moment and cache it in TLB.
>>
>> The only question is that if it is needed BBM before staring using splitted entry:
>>           ....
>>           if ( !p2m_split_superpage(p2m, &split_pte, level, target, offsets) )
>>           {
>>               /* Free the allocated sub-tree */
>>               p2m_free_subtree(p2m, split_pte, level);
>>
>>               rc = -ENOMEM;
>>               goto out;
>>           }
>>
>> ------> /* Should be BBM used here ? */
>>           p2m_write_pte(entry, split_pte, p2m->clean_pte);
>>
>> And I can't find anything in the spec what tells me to use BBM here like Arm
>> does:
> But what you need is a statement in the spec that you can get away without. Imo
> at least.

In the spec. it is mentioned that:
   It is permitted for multiple address-translation cache entries to co-exist for the same
   address. This represents the fact that in a conventional TLB hierarchy, it is possible for
   multiple entries to match a single address if, for example, a page is upgraded to a
   superpage without first clearing the original non-leaf PTE’s valid bit and executing an
   SFENCE.VMA with rs1=x0, or if multiple TLBs exist in parallel at a given level of the
   hierarchy. In this case, just as if an SFENCE.VMA is not executed between a write to the
   memory-management tables and subsequent implicit read of the same address: it is
   unpredictable whether the old non-leaf PTE or the new leaf PTE is used, but the behavior is
   otherwise well defined.
The phrase*"but the behavior is otherwise well defined"* emphasizes that even if the TLB sees
two versions (the old and the new), the architecture guarantees stability, and the behavior
remains safe — though unpredictable in terms of which translation will be used.
And I think that this unpredictability is okay, at least, in the case if superpage splitting
and therefore TLB flushing can be deferred since the old pages (which are used for old mapping)
still exist and the permissions of the new entries match those of the original ones.
Also, it seems like there clearing PTE before TLB flushing isn't need too.

Does it make sense?

Re: [PATCH v2 15/17] xen/riscv: Implement superpage splitting for p2m mappings

Posted by Jan Beulich 6 months, 2 weeks ago

On 23.07.2025 21:51, Oleksii Kurochko wrote:
> 
> On 7/22/25 6:02 PM, Jan Beulich wrote:
>> On 22.07.2025 16:57, Oleksii Kurochko wrote:
>>> On 7/21/25 3:34 PM, Jan Beulich wrote:
>>>> On 17.07.2025 18:37, Oleksii Kurochko wrote:
>>>>> On 7/2/25 11:25 AM, Jan Beulich wrote:
>>>>>> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>>>>>>> Add support for down large memory mappings ("superpages") in the RISC-V
>>>>>>> p2m mapping so that smaller, more precise mappings ("finer-grained entries")
>>>>>>> can be inserted into lower levels of the page table hierarchy.
>>>>>>>
>>>>>>> To implement that the following is done:
>>>>>>> - Introduce p2m_split_superpage(): Recursively shatters a superpage into
>>>>>>>      smaller page table entries down to the target level, preserving original
>>>>>>>      permissions and attributes.
>>>>>>> - __p2m_set_entry() updated to invoke superpage splitting when inserting
>>>>>>>      entries at lower levels within a superpage-mapped region.
>>>>>>>
>>>>>>> This implementation is based on the ARM code, with modifications to the part
>>>>>>> that follows the BBM (break-before-make) approach. Unlike ARM, RISC-V does
>>>>>>> not require BBM, so there is no need to invalidate the PTE and flush the
>>>>>>> TLB before updating it with the newly created, split page table.
>>>>>> But some flushing is going to be necessary. As long as you only ever do
>>>>>> global flushes, the one after the individual PTE modification (within the
>>>>>> split table) will do (if BBM isn't required, see below), but once you move
>>>>>> to more fine-grained flushing, that's not going to be enough anymore. Not
>>>>>> sure it's a good idea to leave such a pitfall.
>>>>> I think that I don't fully understand what is an issue.
>>>> Whether a flush is necessary after solely breaking up a superpage is arch-
>>>> defined. I don't know how much restrictions the spec on possible hardware
>>>> behavior for RISC-V. However, the eventual change of (at least) one entry
>>>> of fulfill the original request will surely require a flush. What I was
>>>> trying to say is that this required flush would better not also cover for
>>>> the flushes that may or may not be required by the spec. IOW if the spec
>>>> leaves any room for flushes to possibly be needed, those flushes would
>>>> better be explicit.
>>> I think that I still don't understand why what I wrote above will work as long
>>> as global flushes is working and will stop to work when more fine-grained flushing
>>> is used.
>>>
>>> Inside p2m_split_superpage() we don't need any kind of TLB flush operation as
>>> it is allocation a new page for a table and works with it, so no one could use
>>> this page at the moment and cache it in TLB.
>>>
>>> The only question is that if it is needed BBM before staring using splitted entry:
>>>           ....
>>>           if ( !p2m_split_superpage(p2m, &split_pte, level, target, offsets) )
>>>           {
>>>               /* Free the allocated sub-tree */
>>>               p2m_free_subtree(p2m, split_pte, level);
>>>
>>>               rc = -ENOMEM;
>>>               goto out;
>>>           }
>>>
>>> ------> /* Should be BBM used here ? */
>>>           p2m_write_pte(entry, split_pte, p2m->clean_pte);
>>>
>>> And I can't find anything in the spec what tells me to use BBM here like Arm
>>> does:
>> But what you need is a statement in the spec that you can get away without. Imo
>> at least.
> 
> In the spec. it is mentioned that:
>    It is permitted for multiple address-translation cache entries to co-exist for the same
>    address. This represents the fact that in a conventional TLB hierarchy, it is possible for
>    multiple entries to match a single address if, for example, a page is upgraded to a
>    superpage without first clearing the original non-leaf PTE’s valid bit and executing an
>    SFENCE.VMA with rs1=x0, or if multiple TLBs exist in parallel at a given level of the
>    hierarchy. In this case, just as if an SFENCE.VMA is not executed between a write to the
>    memory-management tables and subsequent implicit read of the same address: it is
>    unpredictable whether the old non-leaf PTE or the new leaf PTE is used, but the behavior is
>    otherwise well defined.
> The phrase*"but the behavior is otherwise well defined"* emphasizes that even if the TLB sees
> two versions (the old and the new), the architecture guarantees stability, and the behavior
> remains safe — though unpredictable in terms of which translation will be used.
> And I think that this unpredictability is okay, at least, in the case if superpage splitting
> and therefore TLB flushing can be deferred since the old pages (which are used for old mapping)
> still exist and the permissions of the new entries match those of the original ones.
> Also, it seems like there clearing PTE before TLB flushing isn't need too.
> 
> Does it make sense?

Yes, I think this indeed is sufficient as a spec requirement.

Jan