qemu riscv, thead c906, Linux boot regression

Björn Töpel posted 1 patch 10 months, 1 week ago
Patches applied successfully (tree, apply log)
git fetch https://github.com/patchew-project/qemu tags/patchew/874jf2rj4g.fsf@all.your.base.are.belong.to.us
Maintainers: Palmer Dabbelt <palmer@dabbelt.com>, Alistair Francis <alistair.francis@wdc.com>, Bin Meng <bin.meng@windriver.com>, Weiwei Li <liwei1518@gmail.com>, Daniel Henrique Barboza <dbarboza@ventanamicro.com>, Liu Zhiwei <zhiwei_liu@linux.alibaba.com>
qemu riscv, thead c906, Linux boot regression
Posted by Björn Töpel 10 months, 1 week ago
Hi!

I bumped the RISC-V Linux kernel CI to use qemu 8.2.0, and realized that
thead c906 didn't boot anymore. Bisection points to commit d6a427e2c0b2
("target/riscv/cpu.c: restrict 'marchid' value")

Reverting that commit, or the hack below solves the boot issue:

--8<--
diff --git a/target/riscv/cpu.c b/target/riscv/cpu.c
index 8cbfc7e781ad..e18596c8a55a 100644
--- a/target/riscv/cpu.c
+++ b/target/riscv/cpu.c
@@ -505,6 +505,9 @@ static void rv64_thead_c906_cpu_init(Object *obj)
     cpu->cfg.ext_xtheadsync = true;
 
     cpu->cfg.mvendorid = THEAD_VENDOR_ID;
+    cpu->cfg.marchid = ((QEMU_VERSION_MAJOR << 16) |
+                        (QEMU_VERSION_MINOR << 8)  |
+                        (QEMU_VERSION_MICRO));
 #ifndef CONFIG_USER_ONLY
     set_satp_mode_max_supported(cpu, VM_1_10_SV39);
 #endif
--8<--

I'm unsure what the correct qemu way of adding a default value is,
or if c906 should have a proper marchid.

Maybe Christoph or Zhiwei can answer?

qemu command-line:
qemu-system-riscv64 -nodefaults -nographic -machine virt,acpi=off \
   -cpu thead-c906 ...


Thanks,
Björn
Re: qemu riscv, thead c906, Linux boot regression
Posted by LIU Zhiwei 10 months ago
On 2024/1/24 20:49, Björn Töpel wrote:
> Hi!
>
> I bumped the RISC-V Linux kernel CI to use qemu 8.2.0, and realized that
> thead c906 didn't boot anymore. Bisection points to commit d6a427e2c0b2
> ("target/riscv/cpu.c: restrict 'marchid' value")
>
> Reverting that commit, or the hack below solves the boot issue:
>
> --8<--
> diff --git a/target/riscv/cpu.c b/target/riscv/cpu.c
> index 8cbfc7e781ad..e18596c8a55a 100644
> --- a/target/riscv/cpu.c
> +++ b/target/riscv/cpu.c
> @@ -505,6 +505,9 @@ static void rv64_thead_c906_cpu_init(Object *obj)
>       cpu->cfg.ext_xtheadsync = true;
>   
>       cpu->cfg.mvendorid = THEAD_VENDOR_ID;
> +    cpu->cfg.marchid = ((QEMU_VERSION_MAJOR << 16) |
> +                        (QEMU_VERSION_MINOR << 8)  |
> +                        (QEMU_VERSION_MICRO));
>   #ifndef CONFIG_USER_ONLY
>       set_satp_mode_max_supported(cpu, VM_1_10_SV39);
>   #endif
> --8<--
>
> I'm unsure what the correct qemu way of adding a default value is,
> or if c906 should have a proper marchid.
>
> Maybe Christoph or Zhiwei can answer?
>
> qemu command-line:
> qemu-system-riscv64 -nodefaults -nographic -machine virt,acpi=off \
>     -cpu thead-c906 ...

Hi  Björn,

I think it is caused by an mmu extension(named XTheadMaee) not 
implemented on QEMU which is conflicts with Svpbmt, which is the reason 
for error-ta in Linux.

I will try to fix this on QEMU and at the same time give a way to 
implement vendor custom CSR(XTheadMaee is controlled by an CSR named 
mexstatus)  on QEMU.

Thanks,
Zhiwei

>
> Thanks,
> Björn

Re: qemu riscv, thead c906, Linux boot regression
Posted by Daniel Henrique Barboza 10 months, 1 week ago

On 1/24/24 09:49, Björn Töpel wrote:
> Hi!
> 
> I bumped the RISC-V Linux kernel CI to use qemu 8.2.0, and realized that
> thead c906 didn't boot anymore. Bisection points to commit d6a427e2c0b2
> ("target/riscv/cpu.c: restrict 'marchid' value")
> 
> Reverting that commit, or the hack below solves the boot issue:
> 
> --8<--
> diff --git a/target/riscv/cpu.c b/target/riscv/cpu.c
> index 8cbfc7e781ad..e18596c8a55a 100644
> --- a/target/riscv/cpu.c
> +++ b/target/riscv/cpu.c
> @@ -505,6 +505,9 @@ static void rv64_thead_c906_cpu_init(Object *obj)
>       cpu->cfg.ext_xtheadsync = true;
>   
>       cpu->cfg.mvendorid = THEAD_VENDOR_ID;
> +    cpu->cfg.marchid = ((QEMU_VERSION_MAJOR << 16) |
> +                        (QEMU_VERSION_MINOR << 8)  |
> +                        (QEMU_VERSION_MICRO));
>   #ifndef CONFIG_USER_ONLY
>       set_satp_mode_max_supported(cpu, VM_1_10_SV39);
>   #endif
> --8<--
> 
> I'm unsure what the correct qemu way of adding a default value is,
> or if c906 should have a proper marchid.

In case you need to set a 'marchid' different than zero for c906, this hack would
be a proper fix. As mentioned in the commit msg of the patch you mentioned:

"Named CPUs should set 'marchid' to a meaningful value instead, and generic
  CPUs can set to any valid value."

That means that any specific marchid value that the CPU uses must to be set
in its own cpu_init() function.


Thanks,

Daniel


> 
> Maybe Christoph or Zhiwei can answer?
> 
> qemu command-line:
> qemu-system-riscv64 -nodefaults -nographic -machine virt,acpi=off \
>     -cpu thead-c906 ...
> 
> 
> Thanks,
> Björn

Re: qemu riscv, thead c906, Linux boot regression
Posted by Björn Töpel 10 months ago
Daniel Henrique Barboza <dbarboza@ventanamicro.com> writes:

> On 1/24/24 09:49, Björn Töpel wrote:
>> Hi!
>> 
>> I bumped the RISC-V Linux kernel CI to use qemu 8.2.0, and realized that
>> thead c906 didn't boot anymore. Bisection points to commit d6a427e2c0b2
>> ("target/riscv/cpu.c: restrict 'marchid' value")
>> 
>> Reverting that commit, or the hack below solves the boot issue:
>> 
>> --8<--
>> diff --git a/target/riscv/cpu.c b/target/riscv/cpu.c
>> index 8cbfc7e781ad..e18596c8a55a 100644
>> --- a/target/riscv/cpu.c
>> +++ b/target/riscv/cpu.c
>> @@ -505,6 +505,9 @@ static void rv64_thead_c906_cpu_init(Object *obj)
>>       cpu->cfg.ext_xtheadsync = true;
>>   
>>       cpu->cfg.mvendorid = THEAD_VENDOR_ID;
>> +    cpu->cfg.marchid = ((QEMU_VERSION_MAJOR << 16) |
>> +                        (QEMU_VERSION_MINOR << 8)  |
>> +                        (QEMU_VERSION_MICRO));
>>   #ifndef CONFIG_USER_ONLY
>>       set_satp_mode_max_supported(cpu, VM_1_10_SV39);
>>   #endif
>> --8<--
>> 
>> I'm unsure what the correct qemu way of adding a default value is,
>> or if c906 should have a proper marchid.
>
> In case you need to set a 'marchid' different than zero for c906, this hack would
> be a proper fix. As mentioned in the commit msg of the patch you mentioned:
>
> "Named CPUs should set 'marchid' to a meaningful value instead, and generic
>   CPUs can set to any valid value."
>
> That means that any specific marchid value that the CPU uses must to be set
> in its own cpu_init() function.

Got it. Thanks, Daniel!

For completeness (since it came up on the weekly PW call); Conor pointed
out that zero *is* indeed the right marchid for c906, and in fact, the
non-zero marchid pre commit d6a427e2c0b2 was incorrect.

Post commit d6a427e2c0b2, the correct alternative is picked up, and
ERRATA_THEAD_PBMT (using non-standard memory type bits in
page-table-entries) kicks in. AFAIU, that's not implemented by qemu's
c906 support, which then traps.

That's the theory. Maybe Christoph knows if the non-standard bits are
implemented or not?

Regardless; I removed booting Qemu T-head c906 from the CI, and the
build/boot passes nicely ;-) [1]


Björn

[1] https://github.com/linux-riscv/linux-riscv/actions/runs/7641764759/job/20819801235?pr=447
Re: qemu riscv, thead c906, Linux boot regression
Posted by Daniel Henrique Barboza 10 months ago

On 1/24/24 16:26, Björn Töpel wrote:
> Daniel Henrique Barboza <dbarboza@ventanamicro.com> writes:
> 
>> On 1/24/24 09:49, Björn Töpel wrote:
>>> Hi!
>>>
>>> I bumped the RISC-V Linux kernel CI to use qemu 8.2.0, and realized that
>>> thead c906 didn't boot anymore. Bisection points to commit d6a427e2c0b2
>>> ("target/riscv/cpu.c: restrict 'marchid' value")
>>>
>>> Reverting that commit, or the hack below solves the boot issue:
>>>
>>> --8<--
>>> diff --git a/target/riscv/cpu.c b/target/riscv/cpu.c
>>> index 8cbfc7e781ad..e18596c8a55a 100644
>>> --- a/target/riscv/cpu.c
>>> +++ b/target/riscv/cpu.c
>>> @@ -505,6 +505,9 @@ static void rv64_thead_c906_cpu_init(Object *obj)
>>>        cpu->cfg.ext_xtheadsync = true;
>>>    
>>>        cpu->cfg.mvendorid = THEAD_VENDOR_ID;
>>> +    cpu->cfg.marchid = ((QEMU_VERSION_MAJOR << 16) |
>>> +                        (QEMU_VERSION_MINOR << 8)  |
>>> +                        (QEMU_VERSION_MICRO));
>>>    #ifndef CONFIG_USER_ONLY
>>>        set_satp_mode_max_supported(cpu, VM_1_10_SV39);
>>>    #endif
>>> --8<--
>>>
>>> I'm unsure what the correct qemu way of adding a default value is,
>>> or if c906 should have a proper marchid.
>>
>> In case you need to set a 'marchid' different than zero for c906, this hack would
>> be a proper fix. As mentioned in the commit msg of the patch you mentioned:
>>
>> "Named CPUs should set 'marchid' to a meaningful value instead, and generic
>>    CPUs can set to any valid value."
>>
>> That means that any specific marchid value that the CPU uses must to be set
>> in its own cpu_init() function.
> 
> Got it. Thanks, Daniel!
> 
> For completeness (since it came up on the weekly PW call); Conor pointed
> out that zero *is* indeed the right marchid for c906, and in fact, the
> non-zero marchid pre commit d6a427e2c0b2 was incorrect.
> 
> Post commit d6a427e2c0b2, the correct alternative is picked up, and
> ERRATA_THEAD_PBMT (using non-standard memory type bits in
> page-table-entries) kicks in. AFAIU, that's not implemented by qemu's
> c906 support, which then traps.


This looks like a very good reason to actually push what you called 'hack' as
a fix. Yeah, in theory that commit did nothing wrong, but the side effect
(missing support for non-standard memory type bits) is kind of a QEMU problem.

You're welcome to format that hack into a patch, explaining in the commit msg why
we need to set marchid for c906 to that specific value. I'd even add a TODO
tag in rv64_thead_c906_cpu_init() to remind us that this is a band-aid and
that we should remove it once we implement the needed support.



> 
> That's the theory. Maybe Christoph knows if the non-standard bits are
> implemented or not?
> 
> Regardless; I removed booting Qemu T-head c906 from the CI, and the
> build/boot passes nicely ;-) [1]

I vote for setting marchid in c906 cpu_init and re-enable it in the CI.


Thanks,

Daniel

> 
> 
> Björn
> 
> [1] https://github.com/linux-riscv/linux-riscv/actions/runs/7641764759/job/20819801235?pr=447

Re: qemu riscv, thead c906, Linux boot regression
Posted by Björn Töpel 10 months ago
Daniel Henrique Barboza <dbarboza@ventanamicro.com> writes:

> On 1/24/24 16:26, Björn Töpel wrote:
>> Daniel Henrique Barboza <dbarboza@ventanamicro.com> writes:
>> 
>>> On 1/24/24 09:49, Björn Töpel wrote:
>>>> Hi!
>>>>
>>>> I bumped the RISC-V Linux kernel CI to use qemu 8.2.0, and realized that
>>>> thead c906 didn't boot anymore. Bisection points to commit d6a427e2c0b2
>>>> ("target/riscv/cpu.c: restrict 'marchid' value")
>>>>
>>>> Reverting that commit, or the hack below solves the boot issue:
>>>>
>>>> --8<--
>>>> diff --git a/target/riscv/cpu.c b/target/riscv/cpu.c
>>>> index 8cbfc7e781ad..e18596c8a55a 100644
>>>> --- a/target/riscv/cpu.c
>>>> +++ b/target/riscv/cpu.c
>>>> @@ -505,6 +505,9 @@ static void rv64_thead_c906_cpu_init(Object *obj)
>>>>        cpu->cfg.ext_xtheadsync = true;
>>>>    
>>>>        cpu->cfg.mvendorid = THEAD_VENDOR_ID;
>>>> +    cpu->cfg.marchid = ((QEMU_VERSION_MAJOR << 16) |
>>>> +                        (QEMU_VERSION_MINOR << 8)  |
>>>> +                        (QEMU_VERSION_MICRO));
>>>>    #ifndef CONFIG_USER_ONLY
>>>>        set_satp_mode_max_supported(cpu, VM_1_10_SV39);
>>>>    #endif
>>>> --8<--
>>>>
>>>> I'm unsure what the correct qemu way of adding a default value is,
>>>> or if c906 should have a proper marchid.
>>>
>>> In case you need to set a 'marchid' different than zero for c906, this hack would
>>> be a proper fix. As mentioned in the commit msg of the patch you mentioned:
>>>
>>> "Named CPUs should set 'marchid' to a meaningful value instead, and generic
>>>    CPUs can set to any valid value."
>>>
>>> That means that any specific marchid value that the CPU uses must to be set
>>> in its own cpu_init() function.
>> 
>> Got it. Thanks, Daniel!
>> 
>> For completeness (since it came up on the weekly PW call); Conor pointed
>> out that zero *is* indeed the right marchid for c906, and in fact, the
>> non-zero marchid pre commit d6a427e2c0b2 was incorrect.
>> 
>> Post commit d6a427e2c0b2, the correct alternative is picked up, and
>> ERRATA_THEAD_PBMT (using non-standard memory type bits in
>> page-table-entries) kicks in. AFAIU, that's not implemented by qemu's
>> c906 support, which then traps.
>
>
> This looks like a very good reason to actually push what you called 'hack' as
> a fix. Yeah, in theory that commit did nothing wrong, but the side effect
> (missing support for non-standard memory type bits) is kind of a QEMU problem.

For me, it'd be weird to add the hack (setting marchid to non-zero).
Claiming that it's a "thead-c906 emulation" in qemu, but w/o the proper
page-bit support. That's just cpu rv64 plus some extra instructions --
not the c906.


Björn
Re: qemu riscv, thead c906, Linux boot regression
Posted by Conor Dooley 10 months, 1 week ago
On Wed, Jan 24, 2024 at 01:49:51PM +0100, Björn Töpel wrote:
> Hi!
> 
> I bumped the RISC-V Linux kernel CI to use qemu 8.2.0, and realized that
> thead c906 didn't boot anymore. Bisection points to commit d6a427e2c0b2
> ("target/riscv/cpu.c: restrict 'marchid' value")
> 
> Reverting that commit, or the hack below solves the boot issue:
> 
> --8<--
> diff --git a/target/riscv/cpu.c b/target/riscv/cpu.c
> index 8cbfc7e781ad..e18596c8a55a 100644
> --- a/target/riscv/cpu.c
> +++ b/target/riscv/cpu.c
> @@ -505,6 +505,9 @@ static void rv64_thead_c906_cpu_init(Object *obj)
>      cpu->cfg.ext_xtheadsync = true;
>  
>      cpu->cfg.mvendorid = THEAD_VENDOR_ID;
> +    cpu->cfg.marchid = ((QEMU_VERSION_MAJOR << 16) |
> +                        (QEMU_VERSION_MINOR << 8)  |
> +                        (QEMU_VERSION_MICRO));
>  #ifndef CONFIG_USER_ONLY
>      set_satp_mode_max_supported(cpu, VM_1_10_SV39);
>  #endif
> --8<--
> 
> I'm unsure what the correct qemu way of adding a default value is,
> or if c906 should have a proper marchid.

The "correct" marchid/mimpid values for the c906 are zero.

I haven't looked into the code at all, so I am "assuming" that it is
being zero intialised at present. Linux applies the errata fixups for
the c906 when archid and impid are both zero - so your patch will avoid
these fixups being applied.
Do you think that perhaps the emulation in QEMU does not support what
the kernel uses once then errata fixups are enabled?

> 
> Maybe Christoph or Zhiwei can answer?
> 
> qemu command-line:
> qemu-system-riscv64 -nodefaults -nographic -machine virt,acpi=off \
>    -cpu thead-c906 ...
> 
> 
> Thanks,
> Björn
> 
Re: qemu riscv, thead c906, Linux boot regression
Posted by Björn Töpel 10 months, 1 week ago
Conor Dooley <conor@kernel.org> writes:

> On Wed, Jan 24, 2024 at 01:49:51PM +0100, Björn Töpel wrote:
>> Hi!
>> 
>> I bumped the RISC-V Linux kernel CI to use qemu 8.2.0, and realized that
>> thead c906 didn't boot anymore. Bisection points to commit d6a427e2c0b2
>> ("target/riscv/cpu.c: restrict 'marchid' value")
>> 
>> Reverting that commit, or the hack below solves the boot issue:
>> 
>> --8<--
>> diff --git a/target/riscv/cpu.c b/target/riscv/cpu.c
>> index 8cbfc7e781ad..e18596c8a55a 100644
>> --- a/target/riscv/cpu.c
>> +++ b/target/riscv/cpu.c
>> @@ -505,6 +505,9 @@ static void rv64_thead_c906_cpu_init(Object *obj)
>>      cpu->cfg.ext_xtheadsync = true;
>>  
>>      cpu->cfg.mvendorid = THEAD_VENDOR_ID;
>> +    cpu->cfg.marchid = ((QEMU_VERSION_MAJOR << 16) |
>> +                        (QEMU_VERSION_MINOR << 8)  |
>> +                        (QEMU_VERSION_MICRO));
>>  #ifndef CONFIG_USER_ONLY
>>      set_satp_mode_max_supported(cpu, VM_1_10_SV39);
>>  #endif
>> --8<--
>> 
>> I'm unsure what the correct qemu way of adding a default value is,
>> or if c906 should have a proper marchid.
>
> The "correct" marchid/mimpid values for the c906 are zero.

Ok! Thanks for clearing that up for me.

> I haven't looked into the code at all, so I am "assuming" that it is
> being zero intialised at present. Linux applies the errata fixups for
> the c906 when archid and impid are both zero - so your patch will avoid
> these fixups being applied.

I'm also assuming 0, -- will double-check. Hmm, that means that the
*previous* marchid was incorrect (pre d6a427e2c0b2).

> Do you think that perhaps the emulation in QEMU does not support what
> the kernel uses once then errata fixups are enabled?

Did a quick look at the c906 "in_asm,int" logs:

| 0x80201040:  12000073          sfence.vma              zero,zero
| 0x80201044:  18051073          csrrw                   zero,satp,a0
| 
| riscv_cpu_do_interrupt: hart:0, async:0, cause:000000000000000c, epc:0x0000000080201048, tval:0x0000000080201048, desc=exec_page_fault
| riscv_cpu_do_interrupt: hart:0, async:0, cause:000000000000000c, epc:0xffffffff80001048, tval:0xffffffff80001048, desc=exec_page_fault
| ...cont forever

So it looks like we're tripping over the page tables, when we're turning
on paging.

Hmm, maybe it's not qemu, but the c906 that has been broken for a while?

I'll disable it temporarily from CI anyhow, and will continue digging.


Thanks for the pointers/clarifications, Conor!
Björn
Re: qemu riscv, thead c906, Linux boot regression
Posted by Conor Dooley 10 months, 1 week ago
On Wed, Jan 24, 2024 at 02:27:10PM +0100, Björn Töpel wrote:
> Conor Dooley <conor@kernel.org> writes:
> 
> > On Wed, Jan 24, 2024 at 01:49:51PM +0100, Björn Töpel wrote:
> >> Hi!
> >> 
> >> I bumped the RISC-V Linux kernel CI to use qemu 8.2.0, and realized that
> >> thead c906 didn't boot anymore. Bisection points to commit d6a427e2c0b2
> >> ("target/riscv/cpu.c: restrict 'marchid' value")
> >> 
> >> Reverting that commit, or the hack below solves the boot issue:
> >> 
> >> --8<--
> >> diff --git a/target/riscv/cpu.c b/target/riscv/cpu.c
> >> index 8cbfc7e781ad..e18596c8a55a 100644
> >> --- a/target/riscv/cpu.c
> >> +++ b/target/riscv/cpu.c
> >> @@ -505,6 +505,9 @@ static void rv64_thead_c906_cpu_init(Object *obj)
> >>      cpu->cfg.ext_xtheadsync = true;
> >>  
> >>      cpu->cfg.mvendorid = THEAD_VENDOR_ID;
> >> +    cpu->cfg.marchid = ((QEMU_VERSION_MAJOR << 16) |
> >> +                        (QEMU_VERSION_MINOR << 8)  |
> >> +                        (QEMU_VERSION_MICRO));
> >>  #ifndef CONFIG_USER_ONLY
> >>      set_satp_mode_max_supported(cpu, VM_1_10_SV39);
> >>  #endif
> >> --8<--
> >> 
> >> I'm unsure what the correct qemu way of adding a default value is,
> >> or if c906 should have a proper marchid.
> >
> > The "correct" marchid/mimpid values for the c906 are zero.
> 
> Ok! Thanks for clearing that up for me.
> 
> > I haven't looked into the code at all, so I am "assuming" that it is
> > being zero intialised at present. Linux applies the errata fixups for
> > the c906 when archid and impid are both zero - so your patch will avoid
> > these fixups being applied.
> 
> I'm also assuming 0, -- will double-check. Hmm, that means that the
> *previous* marchid was incorrect (pre d6a427e2c0b2).
> 
> > Do you think that perhaps the emulation in QEMU does not support what
> > the kernel uses once then errata fixups are enabled?
> 
> Did a quick look at the c906 "in_asm,int" logs:
> 
> | 0x80201040:  12000073          sfence.vma              zero,zero
> | 0x80201044:  18051073          csrrw                   zero,satp,a0
> | 
> | riscv_cpu_do_interrupt: hart:0, async:0, cause:000000000000000c, epc:0x0000000080201048, tval:0x0000000080201048, desc=exec_page_fault
> | riscv_cpu_do_interrupt: hart:0, async:0, cause:000000000000000c, epc:0xffffffff80001048, tval:0xffffffff80001048, desc=exec_page_fault
> | ...cont forever
> 
> So it looks like we're tripping over the page tables, when we're turning
> on paging.
> 
> Hmm, maybe it's not qemu, but the c906 that has been broken for a while?

I didn't know what you mean by "not qemu, but the c906", so I went and
boot tested my d1 nezha. On today's next (6.8.0-rc1-next-20240124) it
booted into my initramfs with no problems. Obivously though my config is
unlikely to match yours, but that seems like a core thing that should be
hit regardless of config.
So perhaps this is a c906-in-QEMU problem? Lacking emulation for
something the kernel uses perhaps? I know nothing about the capabilities
of its emulation in QEMU, so I am of no help.

Cheers,
Conor.

> 
> I'll disable it temporarily from CI anyhow, and will continue digging.
> 
> 
> Thanks for the pointers/clarifications, Conor!
> Björn
> 
> _______________________________________________
> linux-riscv mailing list
> linux-riscv@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-riscv