[v2] xen/riscv: introduce p2m functionality

[PATCH v2 01/17] xen/riscv: implement sbi_remote_hfence_gvma()

Posted by Oleksii Kurochko 8 months ago

Instruct the remote harts to execute one or more HFENCE.GVMA instructions,
covering the range of guest physical addresses between start_addr and
start_addr + size for all the guests.

The remote fence operation applies to the entire address space if either:
 - start_addr and size are both 0, or
 - size is equal to 2^XLEN-1.

Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
---
Changes in V2:
 - New patch.
---
 xen/arch/riscv/include/asm/sbi.h | 21 +++++++++++++++++++++
 xen/arch/riscv/sbi.c             |  9 +++++++++
 2 files changed, 30 insertions(+)

diff --git a/xen/arch/riscv/include/asm/sbi.h b/xen/arch/riscv/include/asm/sbi.h
index 527d773277..8e346347af 100644
--- a/xen/arch/riscv/include/asm/sbi.h
+++ b/xen/arch/riscv/include/asm/sbi.h
@@ -89,6 +89,27 @@ bool sbi_has_rfence(void);
 int sbi_remote_sfence_vma(const cpumask_t *cpu_mask, vaddr_t start,
                           size_t size);
 
+/*
+ * Instructs the remote harts to execute one or more HFENCE.GVMA
+ * instructions, covering the range of guest physical addresses
+ * between start_addr and start_addr + size for all the guests.
+ * This function call is only valid for harts implementing
+ * hypervisor extension.
+ *
+ * Returns 0 if IPI was sent to all the targeted harts successfully
+ * or negative value if start_addr or size is not valid.
+ *
+ * The remote fence operation applies to the entire address space if either:
+ *  - start_addr and size are both 0, or
+ *  - size is equal to 2^XLEN-1.
+ *
+ * @cpu_mask a cpu mask containing all the target CPUs (in Xen space).
+ * @param start virtual address start
+ * @param size virtual address range size
+ */
+int sbi_remote_hfence_gvma(const cpumask_t *cpu_mask, vaddr_t start,
+                           size_t size);
+
 /*
  * Initialize SBI library
  *
diff --git a/xen/arch/riscv/sbi.c b/xen/arch/riscv/sbi.c
index 4209520389..0613ad1cb0 100644
--- a/xen/arch/riscv/sbi.c
+++ b/xen/arch/riscv/sbi.c
@@ -258,6 +258,15 @@ int sbi_remote_sfence_vma(const cpumask_t *cpu_mask, vaddr_t start,
                       cpu_mask, start, size, 0, 0);
 }
 
+int sbi_remote_hfence_gvma(const cpumask_t *cpu_mask, vaddr_t start,
+                           size_t size)
+{
+    ASSERT(sbi_rfence);
+
+    return sbi_rfence(SBI_EXT_RFENCE_REMOTE_HFENCE_GVMA,
+                      cpu_mask, start, size, 0, 0);
+}
+
 /* This function must always succeed. */
 #define sbi_get_spec_version()  \
     sbi_ext_base_func(SBI_EXT_BASE_GET_SPEC_VERSION)
-- 
2.49.0

Re: [PATCH v2 01/17] xen/riscv: implement sbi_remote_hfence_gvma()

Posted by Jan Beulich 7 months, 3 weeks ago

On 10.06.2025 15:05, Oleksii Kurochko wrote:
> Instruct the remote harts to execute one or more HFENCE.GVMA instructions,
> covering the range of guest physical addresses between start_addr and
> start_addr + size for all the guests.

Here and in the code comment: Why "for all the guests"? Under what conditions
would you require such a broad (guest) TLB flush?

> --- a/xen/arch/riscv/sbi.c
> +++ b/xen/arch/riscv/sbi.c
> @@ -258,6 +258,15 @@ int sbi_remote_sfence_vma(const cpumask_t *cpu_mask, vaddr_t start,
>                        cpu_mask, start, size, 0, 0);
>  }
>  
> +int sbi_remote_hfence_gvma(const cpumask_t *cpu_mask, vaddr_t start,
> +                           size_t size)
> +{
> +    ASSERT(sbi_rfence);

As previously indicated, I question the usefulness of such assertions. If the
pointer is still NULL, ...

> +    return sbi_rfence(SBI_EXT_RFENCE_REMOTE_HFENCE_GVMA,
> +                      cpu_mask, start, size, 0, 0);

... you'll crash here anyway (much like you will in a release build).

Jan

Re: [PATCH v2 01/17] xen/riscv: implement sbi_remote_hfence_gvma()

Posted by Oleksii Kurochko 7 months, 2 weeks ago

On 6/18/25 5:15 PM, Jan Beulich wrote:
> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>> Instruct the remote harts to execute one or more HFENCE.GVMA instructions,
>> covering the range of guest physical addresses between start_addr and
>> start_addr + size for all the guests.
> Here and in the code comment: Why "for all the guests"? Under what conditions
> would you require such a broad (guest) TLB flush?

Hmm, it seems like KVM always do such a broad (guest) TLB flush during detection
of VMIDLEN:
	void __init kvm_riscv_gstage_vmid_detect(void)
	{
		unsigned long old;
	
		/* Figure-out number of VMID bits in HW */
		old = csr_read(CSR_HGATP);
		csr_write(CSR_HGATP, old | HGATP_VMID);
		vmid_bits = csr_read(CSR_HGATP);
		vmid_bits = (vmid_bits & HGATP_VMID) >> HGATP_VMID_SHIFT;
		vmid_bits = fls_long(vmid_bits);
		csr_write(CSR_HGATP, old);
	
		/* We polluted local TLB so flush all guest TLB */
		kvm_riscv_local_hfence_gvma_all();
	
		/* We don't use VMID bits if they are not sufficient */
		if ((1UL << vmid_bits) < num_possible_cpus())
			vmid_bits = 0;
	}

It is not clear actually why so broad and why not hfence_gvma_vmid(vmid_bits).

And I am not really 100% sure that any hfence_gvma() is needed here as I don't see
what could pollutes local guest TLB between csr_write() calls.

RISC-V spec. says that:
	Note that writing hgatp does not imply any ordering constraints between page-table updates and
	subsequent G-stage address translations. If the new virtual machine’s guest physical page tables have
	been modified, or if a VMID is reused, it may be necessary to execute an HFENCE.GVMA instruction
	(see Section 18.3.2) before or after writing hgatp.

But we don't modify VM's guest physical page table. We could potentially reuse VMID between csr_write()
calls, but it is returning back and we don't switch to a guest with this "new" VMID, so it isn't really used.

Do you have any thoughts about that?

~ Oleksii

Re: [PATCH v2 01/17] xen/riscv: implement sbi_remote_hfence_gvma()

Posted by Jan Beulich 7 months, 2 weeks ago

On 24.06.2025 12:33, Oleksii Kurochko wrote:
> On 6/18/25 5:15 PM, Jan Beulich wrote:
>> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>>> Instruct the remote harts to execute one or more HFENCE.GVMA instructions,
>>> covering the range of guest physical addresses between start_addr and
>>> start_addr + size for all the guests.
>> Here and in the code comment: Why "for all the guests"? Under what conditions
>> would you require such a broad (guest) TLB flush?
> 
> Hmm, it seems like KVM always do such a broad (guest) TLB flush during detection
> of VMIDLEN:
> 	void __init kvm_riscv_gstage_vmid_detect(void)
> 	{
> 		unsigned long old;
> 	
> 		/* Figure-out number of VMID bits in HW */
> 		old = csr_read(CSR_HGATP);
> 		csr_write(CSR_HGATP, old | HGATP_VMID);
> 		vmid_bits = csr_read(CSR_HGATP);
> 		vmid_bits = (vmid_bits & HGATP_VMID) >> HGATP_VMID_SHIFT;
> 		vmid_bits = fls_long(vmid_bits);
> 		csr_write(CSR_HGATP, old);
> 	
> 		/* We polluted local TLB so flush all guest TLB */
> 		kvm_riscv_local_hfence_gvma_all();
> 	
> 		/* We don't use VMID bits if they are not sufficient */
> 		if ((1UL << vmid_bits) < num_possible_cpus())
> 			vmid_bits = 0;
> 	}
> 
> It is not clear actually why so broad and why not hfence_gvma_vmid(vmid_bits).
> 
> And I am not really 100% sure that any hfence_gvma() is needed here as I don't see
> what could pollutes local guest TLB between csr_write() calls.
> 
> RISC-V spec. says that:
> 	Note that writing hgatp does not imply any ordering constraints between page-table updates and
> 	subsequent G-stage address translations. If the new virtual machine’s guest physical page tables have
> 	been modified, or if a VMID is reused, it may be necessary to execute an HFENCE.GVMA instruction
> 	(see Section 18.3.2) before or after writing hgatp.
> 
> But we don't modify VM's guest physical page table. We could potentially reuse VMID between csr_write()
> calls, but it is returning back and we don't switch to a guest with this "new" VMID, so it isn't really used.

That would be my expectation, too. Yet I don't know if RISC-V has any
peculiarities there.

Jan

Re: [PATCH v2 01/17] xen/riscv: implement sbi_remote_hfence_gvma()

Posted by Oleksii Kurochko 7 months, 2 weeks ago

On 6/18/25 5:15 PM, Jan Beulich wrote:
> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>> Instruct the remote harts to execute one or more HFENCE.GVMA instructions,
>> covering the range of guest physical addresses between start_addr and
>> start_addr + size for all the guests.
> Here and in the code comment: Why "for all the guests"? Under what conditions
> would you require such a broad (guest) TLB flush?

Originally, it came from Andrew reply:
```
TLB flushing needs to happen for each pCPU which potentially has cached
a mapping.

In other arches, this is tracked by d->dirty_cpumask which is the bitmap
of pCPUs where this domain is scheduled.

CPUs need to flush their TLBs before removing themselves from
d->dirty_cpumask, which is typically done during context switch, but it
means that to flush the P2M, you only need to IPI a subset of CPUs.
```

But specifically this function was introduced to work in case no VMID support
as we can't distinguish which TLB entries belong to which domain. As a result,
we have no choice but to flush the entire TLB to avoid incorrect translations.

However, this patch may no longer be necessary, as VMID support has been
introduced and|sbi_remote_hfence_gvma_vmid()| will be used instead.

>
>> --- a/xen/arch/riscv/sbi.c
>> +++ b/xen/arch/riscv/sbi.c
>> @@ -258,6 +258,15 @@ int sbi_remote_sfence_vma(const cpumask_t *cpu_mask, vaddr_t start,
>>                         cpu_mask, start, size, 0, 0);
>>   }
>>   
>> +int sbi_remote_hfence_gvma(const cpumask_t *cpu_mask, vaddr_t start,
>> +                           size_t size)
>> +{
>> +    ASSERT(sbi_rfence);
> As previously indicated, I question the usefulness of such assertions. If the
> pointer is still NULL, ...
>
>> +    return sbi_rfence(SBI_EXT_RFENCE_REMOTE_HFENCE_GVMA,
>> +                      cpu_mask, start, size, 0, 0);
> ... you'll crash here anyway (much like you will in a release build).

I will drop ASSERT() for rfence functions.

Thanks.

~ Oleksii

Re: [PATCH v2 01/17] xen/riscv: implement sbi_remote_hfence_gvma()

Posted by Jan Beulich 7 months, 2 weeks ago

On 23.06.2025 16:31, Oleksii Kurochko wrote:
> On 6/18/25 5:15 PM, Jan Beulich wrote:
>> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>>> Instruct the remote harts to execute one or more HFENCE.GVMA instructions,
>>> covering the range of guest physical addresses between start_addr and
>>> start_addr + size for all the guests.
>> Here and in the code comment: Why "for all the guests"? Under what conditions
>> would you require such a broad (guest) TLB flush?
> 
> Originally, it came from Andrew reply:
> ```
> TLB flushing needs to happen for each pCPU which potentially has cached
> a mapping.
> 
> In other arches, this is tracked by d->dirty_cpumask which is the bitmap
> of pCPUs where this domain is scheduled.
> 
> CPUs need to flush their TLBs before removing themselves from
> d->dirty_cpumask, which is typically done during context switch, but it
> means that to flush the P2M, you only need to IPI a subset of CPUs.
> ```

Hmm, but the word "guest" doesn't even appear there. "Each pCPU" isn't quite
the same as "all guests".

> But specifically this function was introduced to work in case no VMID support
> as we can't distinguish which TLB entries belong to which domain. As a result,
> we have no choice but to flush the entire TLB to avoid incorrect translations.
> 
> However, this patch may no longer be necessary, as VMID support has been
> introduced and|sbi_remote_hfence_gvma_vmid()| will be used instead.

Good.

Jan

Re: [PATCH v2 01/17] xen/riscv: implement sbi_remote_hfence_gvma()

Posted by Oleksii Kurochko 7 months, 2 weeks ago

On 6/23/25 4:39 PM, Jan Beulich wrote:
> On 23.06.2025 16:31, Oleksii Kurochko wrote:
>> On 6/18/25 5:15 PM, Jan Beulich wrote:
>>> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>>>> Instruct the remote harts to execute one or more HFENCE.GVMA instructions,
>>>> covering the range of guest physical addresses between start_addr and
>>>> start_addr + size for all the guests.
>>> Here and in the code comment: Why "for all the guests"? Under what conditions
>>> would you require such a broad (guest) TLB flush?
>> Originally, it came from Andrew reply:
>> ```
>> TLB flushing needs to happen for each pCPU which potentially has cached
>> a mapping.
>>
>> In other arches, this is tracked by d->dirty_cpumask which is the bitmap
>> of pCPUs where this domain is scheduled.
>>
>> CPUs need to flush their TLBs before removing themselves from
>> d->dirty_cpumask, which is typically done during context switch, but it
>> means that to flush the P2M, you only need to IPI a subset of CPUs.
>> ```
> Hmm, but the word "guest" doesn't even appear there. "Each pCPU" isn't quite
> the same as "all guests".

Agree, it is just what SBI spec wording that...

pCPU here it is the first argument of sbi_remote_hfence_gvma(unsigned long hart_mask, ...)

It is even more confusing as based on explaantion if RISC-V priv spec., hfence.gvma it is
used to flush G-stage (stage-2) TLB and hfence.vvma it is VS-stage (stage-1) TLB flush.

~ Oleksii