[v4] Fix accurate exception reporting in SPARC assembly

[PATCH v4 2/5] sparc: fix accurate exception reporting in copy_{from_to}_user for UltraSPARC III

Posted by Michael Karcher 4 weeks ago

Anthony Yznaga tracked down that a BUG_ON in ext4 code with large folios
enabled resulted from copy_from_user() returning impossibly large values
greater than the size to be copied. This lead to __copy_from_iter()
returning impossible values instead of the actual number of bytes it was
able to copy.

The BUG_ON has been reported in
https://lore.kernel.org/r/b14f55642207e63e907965e209f6323a0df6dcee.camel@physik.fu-berlin.de

The referenced commit introduced exception handlers on user-space memory
references in copy_from_user and copy_to_user. These handlers return from
the respective function and calculate the remaining bytes left to copy
using the current register contents. The exception handlers expect that
%o2 has already been masked during the bulk copy loop, but the masking was
performed after that loop. This will fix the return value of copy_from_user
and copy_to_user in the faulting case. The behaviour of memcpy stays
unchanged.

Fixes: ee841d0aff64 ("sparc64: Convert U3copy_{from,to}_user to accurate exception reporting.")
Tested-by: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> # on Sun Netra 240
Reviewed-by: Anthony Yznaga <anthony.yznaga@oracle.com>
Tested-by: René Rebe <rene@exactcode.com> # on UltraSparc III+ and UltraSparc IIIi
Signed-off-by: Michael Karcher <kernel@mkarcher.dialup.fu-berlin.de>
---
 arch/sparc/lib/U3memcpy.S | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/sparc/lib/U3memcpy.S b/arch/sparc/lib/U3memcpy.S
index 9248d59c734ce200f1f55e6d9913277f18715a87..bace3a18f836f1428ae0ed72b27aa1e00374089e 100644
--- a/arch/sparc/lib/U3memcpy.S
+++ b/arch/sparc/lib/U3memcpy.S
@@ -267,6 +267,7 @@ FUNC_NAME:	/* %o0=dst, %o1=src, %o2=len */
 	faligndata	%f10, %f12, %f26
 	EX_LD_FP(LOAD(ldd, %o1 + 0x040, %f0), U3_retl_o2)
 
+	and		%o2, 0x3f, %o2
 	subcc		GLOBAL_SPARE, 0x80, GLOBAL_SPARE
 	add		%o1, 0x40, %o1
 	bgu,pt		%XCC, 1f
@@ -336,7 +337,6 @@ FUNC_NAME:	/* %o0=dst, %o1=src, %o2=len */
 	 * Also notice how this code is careful not to perform a
 	 * load past the end of the src buffer.
 	 */
-	and		%o2, 0x3f, %o2
 	andcc		%o2, 0x38, %g2
 	be,pn		%XCC, 2f
 	 subcc		%g2, 0x8, %g2

-- 
2.50.1

Re: [PATCH v4 2/5] sparc: fix accurate exception reporting in copy_{from_to}_user for UltraSPARC III

Posted by John Paul Adrian Glaubitz 3 weeks, 4 days ago

Hi Michael,

On Fri, 2025-09-05 at 00:03 +0200, Michael Karcher wrote:
> Anthony Yznaga tracked down that a BUG_ON in ext4 code with large folios
> enabled resulted from copy_from_user() returning impossibly large values
> greater than the size to be copied. This lead to __copy_from_iter()
> returning impossible values instead of the actual number of bytes it was
> able to copy.
> 
> The BUG_ON has been reported in
> https://lore.kernel.org/r/b14f55642207e63e907965e209f6323a0df6dcee.camel@physik.fu-berlin.de
> 
> The referenced commit introduced exception handlers on user-space memory
> references in copy_from_user and copy_to_user. These handlers return from
> the respective function and calculate the remaining bytes left to copy
> using the current register contents. The exception handlers expect that
> %o2 has already been masked during the bulk copy loop, but the masking was
> performed after that loop. This will fix the return value of copy_from_user
> and copy_to_user in the faulting case. The behaviour of memcpy stays
> unchanged.
> 
> Fixes: ee841d0aff64 ("sparc64: Convert U3copy_{from,to}_user to accurate exception reporting.")
> Tested-by: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> # on Sun Netra 240
> Reviewed-by: Anthony Yznaga <anthony.yznaga@oracle.com>
> Tested-by: René Rebe <rene@exactcode.com> # on UltraSparc III+ and UltraSparc IIIi
> Signed-off-by: Michael Karcher <kernel@mkarcher.dialup.fu-berlin.de>
> ---
>  arch/sparc/lib/U3memcpy.S | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/sparc/lib/U3memcpy.S b/arch/sparc/lib/U3memcpy.S
> index 9248d59c734ce200f1f55e6d9913277f18715a87..bace3a18f836f1428ae0ed72b27aa1e00374089e 100644
> --- a/arch/sparc/lib/U3memcpy.S
> +++ b/arch/sparc/lib/U3memcpy.S
> @@ -267,6 +267,7 @@ FUNC_NAME:	/* %o0=dst, %o1=src, %o2=len */
>  	faligndata	%f10, %f12, %f26
>  	EX_LD_FP(LOAD(ldd, %o1 + 0x040, %f0), U3_retl_o2)
>  
> +	and		%o2, 0x3f, %o2
>  	subcc		GLOBAL_SPARE, 0x80, GLOBAL_SPARE
>  	add		%o1, 0x40, %o1
>  	bgu,pt		%XCC, 1f
> @@ -336,7 +337,6 @@ FUNC_NAME:	/* %o0=dst, %o1=src, %o2=len */
>  	 * Also notice how this code is careful not to perform a
>  	 * load past the end of the src buffer.
>  	 */
> -	and		%o2, 0x3f, %o2
>  	andcc		%o2, 0x38, %g2
>  	be,pn		%XCC, 2f
>  	 subcc		%g2, 0x8, %g2

It looks like the fix isn't actually complete for UltraSPARC III.

There still seem to be edge-cases where this bug is triggered and that
actually happens when configuring the systemd-timesyncd package and it's
reproducible in 100% of the cases:

[  125.301353] systemd-sysv-generator[1042]: Please update package to include a native systemd unit file.
[  125.424703] systemd-sysv-generator[1042]: ⚠ This compatibility logic is deprecated, expect removal soon. ⚠
[  127.206268] get_swap_device: Bad swap offset entry 808000000
[  127.354181] get_swap_device: Bad swap offset entry 808000000
[  127.449735] get_swap_device: Bad swap offset entry 808000000
[  127.553698] get_swap_device: Bad swap offset entry 808000000
[  127.701748] get_swap_device: Bad swap offset entry 808000000
[  127.821914] get_swap_device: Bad swap offset entry 808000000
[  127.939392] Unable to handle kernel paging request at virtual address 00000001108ca000
[  128.043605] tsk->{mm,active_mm}->context = 0000000000000555
[  128.116890] tsk->{mm,active_mm}->pgd = fff0000009fd0000
[  128.185604]               \|/ ____ \|/
[  128.185604]               "@'/ .. \`@"
[  128.185604]               /_| \__/ |_\
[  128.185604]                  \__U_/
[  128.378914] systemd-tty-ask(1054): Oops [#1]
[  128.435046] CPU: 0 UID: 0 PID: 1054 Comm: systemd-tty-ask Not tainted 6.17.0-rc4+ #11 NONE 
[  128.544945] TSTATE: 0000000011001606 TPC: 00000000007a5800 TNPC: 00000000007a5804 Y: 00000000    Not tainted
[  128.674196] TPC: <lookup_swap_cgroup_id+0x40/0x80>
[  128.737194] g0: fff000023f800040 g1: 0000000010000000 g2: 00000001008ca000 g3: 000000000153a8b8
[  128.851572] g4: fff0000008d1b700 g5: fff000023e336000 g6: fff00000140f4000 g7: fff0000101934000
[  128.965946] o0: fff0000008e6c180 o1: 0000000000000000 o2: 0000000000001000 o3: 0000000000000001
[  129.080321] o4: 00000000000001ff o5: 0000000000000555 sp: fff00000140f6c81 ret_pc: 0000000000000000
[  129.199272] RPC: <0x0>
[  129.230149] l0: 0000000000000000 l1: fff0000008e6c180 l2: 0000000000000000 l3: 03ffffffffffffff
[  129.344528] l4: 0000000000000004 l5: 0000000000000000 l6: 0000000000000001 l7: 0000000000000014
[  129.458902] i0: 0000000080000000 i1: fff0000101900000 i2: fff00000140f75d8 i3: ffffffffffffffff
[  129.573283] i4: 0000000000001000 i5: 0000000000000000 i6: fff00000140f6d31 i7: 00000000007173e0
[  129.687653] I7: <swap_pte_batch+0x40/0x160>
[  129.742653] Call Trace:
[  129.774671] [<00000000007173e0>] swap_pte_batch+0x40/0x160
[  129.846733] [<0000000000719998>] unmap_page_range+0x718/0x1200
[  129.923366] [<000000000071a4f8>] unmap_single_vma.constprop.0+0x78/0xe0
[  130.010289] [<000000000071a5b0>] unmap_vmas+0x50/0x160
[  130.077767] [<00000000007288bc>] exit_mmap+0xbc/0x460
[  130.144108] [<000000000047aec4>] mmput+0x64/0x180
[  130.205867] [<0000000000483b38>] do_exit+0x218/0xb80
[  130.271067] [<0000000000484664>] do_group_exit+0x24/0xa0
[  130.340830] [<0000000000494848>] get_signal+0x948/0x9a0
[  130.409458] [<000000000043eb68>] do_notify_resume+0xc8/0x5c0
[  130.483802] [<0000000000404b48>] __handle_signal+0xc/0x30
[  130.554715] Disabling lock debugging due to kernel taint
[  130.624483] Caller[00000000007173e0]: swap_pte_batch+0x40/0x160
[  130.702257] Caller[0000000000719998]: unmap_page_range+0x718/0x1200
[  130.784610] Caller[000000000071a4f8]: unmap_single_vma.constprop.0+0x78/0xe0
[  130.877252] Caller[000000000071a5b0]: unmap_vmas+0x50/0x160
[  130.950452] Caller[00000000007288bc]: exit_mmap+0xbc/0x460
[  131.022508] Caller[000000000047aec4]: mmput+0x64/0x180
[  131.089986] Caller[0000000000483b38]: do_exit+0x218/0xb80
[  131.160901] Caller[0000000000484664]: do_group_exit+0x24/0xa0
[  131.236387] Caller[0000000000494848]: get_signal+0x948/0x9a0
[  131.310736] Caller[000000000043eb68]: do_notify_resume+0xc8/0x5c0
[  131.390795] Caller[0000000000404b48]: __handle_signal+0xc/0x30
[  131.467427] Caller[fff0000101600238]: 0xfff0000101600238
[  131.537197] Instruction DUMP:
[  131.537201]  c458c002 
[  131.576079]  83287002 
[  131.606963]  b12e2004 
[  131.637839] <c2008001>
[  131.668723]  b1304018 
[  131.699603]  b12e3030 
[  131.730486]  81cfe008 
[  131.761364]  91323030 
[  131.792249]  b0102000 
[  131.823130] 
[  131.873450] Fixing recursive fault but reboot is needed!

Adrian

-- 
 .''`.  John Paul Adrian Glaubitz
: :' :  Debian Developer
`. `'   Physicist
  `-    GPG: 62FF 8A75 84E0 2956 9546  0006 7426 3B37 F5B5 F913

Re: [PATCH v4 2/5] sparc: fix accurate exception reporting in copy_{from_to}_user for UltraSPARC III

Posted by John Paul Adrian Glaubitz 3 weeks, 4 days ago

Hi,

On Sun, 2025-09-07 at 19:02 +0200, John Paul Adrian Glaubitz wrote:
> Hi Michael,
> 
> On Fri, 2025-09-05 at 00:03 +0200, Michael Karcher wrote:
> > Anthony Yznaga tracked down that a BUG_ON in ext4 code with large folios
> > enabled resulted from copy_from_user() returning impossibly large values
> > greater than the size to be copied. This lead to __copy_from_iter()
> > returning impossible values instead of the actual number of bytes it was
> > able to copy.
> > 
> > The BUG_ON has been reported in
> > https://lore.kernel.org/r/b14f55642207e63e907965e209f6323a0df6dcee.camel@physik.fu-berlin.de
> > 
> > The referenced commit introduced exception handlers on user-space memory
> > references in copy_from_user and copy_to_user. These handlers return from
> > the respective function and calculate the remaining bytes left to copy
> > using the current register contents. The exception handlers expect that
> > %o2 has already been masked during the bulk copy loop, but the masking was
> > performed after that loop. This will fix the return value of copy_from_user
> > and copy_to_user in the faulting case. The behaviour of memcpy stays
> > unchanged.
> > 
> > Fixes: ee841d0aff64 ("sparc64: Convert U3copy_{from,to}_user to accurate exception reporting.")
> > Tested-by: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> # on Sun Netra 240
> > Reviewed-by: Anthony Yznaga <anthony.yznaga@oracle.com>
> > Tested-by: René Rebe <rene@exactcode.com> # on UltraSparc III+ and UltraSparc IIIi
> > Signed-off-by: Michael Karcher <kernel@mkarcher.dialup.fu-berlin.de>
> > ---
> >  arch/sparc/lib/U3memcpy.S | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/arch/sparc/lib/U3memcpy.S b/arch/sparc/lib/U3memcpy.S
> > index 9248d59c734ce200f1f55e6d9913277f18715a87..bace3a18f836f1428ae0ed72b27aa1e00374089e 100644
> > --- a/arch/sparc/lib/U3memcpy.S
> > +++ b/arch/sparc/lib/U3memcpy.S
> > @@ -267,6 +267,7 @@ FUNC_NAME:	/* %o0=dst, %o1=src, %o2=len */
> >  	faligndata	%f10, %f12, %f26
> >  	EX_LD_FP(LOAD(ldd, %o1 + 0x040, %f0), U3_retl_o2)
> >  
> > +	and		%o2, 0x3f, %o2
> >  	subcc		GLOBAL_SPARE, 0x80, GLOBAL_SPARE
> >  	add		%o1, 0x40, %o1
> >  	bgu,pt		%XCC, 1f
> > @@ -336,7 +337,6 @@ FUNC_NAME:	/* %o0=dst, %o1=src, %o2=len */
> >  	 * Also notice how this code is careful not to perform a
> >  	 * load past the end of the src buffer.
> >  	 */
> > -	and		%o2, 0x3f, %o2
> >  	andcc		%o2, 0x38, %g2
> >  	be,pn		%XCC, 2f
> >  	 subcc		%g2, 0x8, %g2
> 
> It looks like the fix isn't actually complete for UltraSPARC III.
> 
> There still seem to be edge-cases where this bug is triggered and that
> actually happens when configuring the systemd-timesyncd package and it's
> reproducible in 100% of the cases:
> 
> [  125.301353] systemd-sysv-generator[1042]: Please update package to include a native systemd unit file.
> [  125.424703] systemd-sysv-generator[1042]: ⚠ This compatibility logic is deprecated, expect removal soon. ⚠
> [  127.206268] get_swap_device: Bad swap offset entry 808000000
> [  127.354181] get_swap_device: Bad swap offset entry 808000000
> [  127.449735] get_swap_device: Bad swap offset entry 808000000
> [  127.553698] get_swap_device: Bad swap offset entry 808000000
> [  127.701748] get_swap_device: Bad swap offset entry 808000000
> [  127.821914] get_swap_device: Bad swap offset entry 808000000
> [  127.939392] Unable to handle kernel paging request at virtual address 00000001108ca000
> [  128.043605] tsk->{mm,active_mm}->context = 0000000000000555
> [  128.116890] tsk->{mm,active_mm}->pgd = fff0000009fd0000
> [  128.185604]               \|/ ____ \|/
> [  128.185604]               "@'/ .. \`@"
> [  128.185604]               /_| \__/ |_\
> [  128.185604]                  \__U_/
> [  128.378914] systemd-tty-ask(1054): Oops [#1]
> [  128.435046] CPU: 0 UID: 0 PID: 1054 Comm: systemd-tty-ask Not tainted 6.17.0-rc4+ #11 NONE 
> [  128.544945] TSTATE: 0000000011001606 TPC: 00000000007a5800 TNPC: 00000000007a5804 Y: 00000000    Not tainted
> [  128.674196] TPC: <lookup_swap_cgroup_id+0x40/0x80>
> [  128.737194] g0: fff000023f800040 g1: 0000000010000000 g2: 00000001008ca000 g3: 000000000153a8b8
> [  128.851572] g4: fff0000008d1b700 g5: fff000023e336000 g6: fff00000140f4000 g7: fff0000101934000
> [  128.965946] o0: fff0000008e6c180 o1: 0000000000000000 o2: 0000000000001000 o3: 0000000000000001
> [  129.080321] o4: 00000000000001ff o5: 0000000000000555 sp: fff00000140f6c81 ret_pc: 0000000000000000
> [  129.199272] RPC: <0x0>
> [  129.230149] l0: 0000000000000000 l1: fff0000008e6c180 l2: 0000000000000000 l3: 03ffffffffffffff
> [  129.344528] l4: 0000000000000004 l5: 0000000000000000 l6: 0000000000000001 l7: 0000000000000014
> [  129.458902] i0: 0000000080000000 i1: fff0000101900000 i2: fff00000140f75d8 i3: ffffffffffffffff
> [  129.573283] i4: 0000000000001000 i5: 0000000000000000 i6: fff00000140f6d31 i7: 00000000007173e0
> [  129.687653] I7: <swap_pte_batch+0x40/0x160>
> [  129.742653] Call Trace:
> [  129.774671] [<00000000007173e0>] swap_pte_batch+0x40/0x160
> [  129.846733] [<0000000000719998>] unmap_page_range+0x718/0x1200
> [  129.923366] [<000000000071a4f8>] unmap_single_vma.constprop.0+0x78/0xe0
> [  130.010289] [<000000000071a5b0>] unmap_vmas+0x50/0x160
> [  130.077767] [<00000000007288bc>] exit_mmap+0xbc/0x460
> [  130.144108] [<000000000047aec4>] mmput+0x64/0x180
> [  130.205867] [<0000000000483b38>] do_exit+0x218/0xb80
> [  130.271067] [<0000000000484664>] do_group_exit+0x24/0xa0
> [  130.340830] [<0000000000494848>] get_signal+0x948/0x9a0
> [  130.409458] [<000000000043eb68>] do_notify_resume+0xc8/0x5c0
> [  130.483802] [<0000000000404b48>] __handle_signal+0xc/0x30
> [  130.554715] Disabling lock debugging due to kernel taint
> [  130.624483] Caller[00000000007173e0]: swap_pte_batch+0x40/0x160
> [  130.702257] Caller[0000000000719998]: unmap_page_range+0x718/0x1200
> [  130.784610] Caller[000000000071a4f8]: unmap_single_vma.constprop.0+0x78/0xe0
> [  130.877252] Caller[000000000071a5b0]: unmap_vmas+0x50/0x160
> [  130.950452] Caller[00000000007288bc]: exit_mmap+0xbc/0x460
> [  131.022508] Caller[000000000047aec4]: mmput+0x64/0x180
> [  131.089986] Caller[0000000000483b38]: do_exit+0x218/0xb80
> [  131.160901] Caller[0000000000484664]: do_group_exit+0x24/0xa0
> [  131.236387] Caller[0000000000494848]: get_signal+0x948/0x9a0
> [  131.310736] Caller[000000000043eb68]: do_notify_resume+0xc8/0x5c0
> [  131.390795] Caller[0000000000404b48]: __handle_signal+0xc/0x30
> [  131.467427] Caller[fff0000101600238]: 0xfff0000101600238
> [  131.537197] Instruction DUMP:
> [  131.537201]  c458c002 
> [  131.576079]  83287002 
> [  131.606963]  b12e2004 
> [  131.637839] <c2008001>
> [  131.668723]  b1304018 
> [  131.699603]  b12e3030 
> [  131.730486]  81cfe008 
> [  131.761364]  91323030 
> [  131.792249]  b0102000 
> [  131.823130] 
> [  131.873450] Fixing recursive fault but reboot is needed!

Michael suggested switching to the generic copy_{to,from}_user code offlist
to verify this:

diff --git a/arch/sparc/kernel/head_64.S b/arch/sparc/kernel/head_64.S
index c305486501dc..cd1a96a918b3 100644
--- a/arch/sparc/kernel/head_64.S
+++ b/arch/sparc/kernel/head_64.S
@@ -687,7 +687,7 @@ cheetah_tlb_fixup:
        stw     %g2, [%g1 + %lo(tlb_type)]
 
        /* Patch copy/page operations to cheetah optimized versions. */
-       call    cheetah_patch_copyops
+       call    generic_patch_copyops
         nop
        call    cheetah_patch_copy_page
         nop

The kernel still crashes, even when using the generic code.

So, this particular issue is not rooted in the U3_copy_{to,from}_user code.

Adrian

-- 
 .''`.  John Paul Adrian Glaubitz
: :' :  Debian Developer
`. `'   Physicist
  `-    GPG: 62FF 8A75 84E0 2956 9546  0006 7426 3B37 F5B5 F913

Re: [PATCH v4 2/5] sparc: fix accurate exception reporting in copy_{from_to}_user for UltraSPARC III

Posted by John Paul Adrian Glaubitz 3 weeks, 4 days ago

Hi,

On Sun, 2025-09-07 at 19:49 +0200, John Paul Adrian Glaubitz wrote:
> Michael suggested switching to the generic copy_{to,from}_user code offlist
> to verify this:
> 
> diff --git a/arch/sparc/kernel/head_64.S b/arch/sparc/kernel/head_64.S
> index c305486501dc..cd1a96a918b3 100644
> --- a/arch/sparc/kernel/head_64.S
> +++ b/arch/sparc/kernel/head_64.S
> @@ -687,7 +687,7 @@ cheetah_tlb_fixup:
>         stw     %g2, [%g1 + %lo(tlb_type)]
>  
>         /* Patch copy/page operations to cheetah optimized versions. */
> -       call    cheetah_patch_copyops
> +       call    generic_patch_copyops
>          nop
>         call    cheetah_patch_copy_page
>          nop
> 
> The kernel still crashes, even when using the generic code.
> 
> So, this particular issue is not rooted in the U3_copy_{to,from}_user code.

Replacing "call cheetah_patch_copy_page" with a nop doesn't help either:

diff --git a/arch/sparc/kernel/head_64.S b/arch/sparc/kernel/head_64.S
index c305486501dc..ed859bae5175 100644
--- a/arch/sparc/kernel/head_64.S
+++ b/arch/sparc/kernel/head_64.S
@@ -689,7 +689,7 @@ cheetah_tlb_fixup:
        /* Patch copy/page operations to cheetah optimized versions. */
        call    cheetah_patch_copyops
         nop
-       call    cheetah_patch_copy_page
+        nop
         nop
        call    cheetah_patch_cachetlbops
         nop

[  140.207051] systemd-sysv-generator[1037]: SysV service '/etc/init.d/buildd' lacks a native systemd unit file, automatically generating a unit file for compatibility.
[  140.401791] systemd-sysv-generator[1037]: Please update package to include a native systemd unit file.
[  140.525028] systemd-sysv-generator[1037]: ⚠ This compatibility logic is deprecated, expect removal soon. ⚠
[  147.718747] systemd-sysv-generator[1093]: SysV service '/etc/init.d/buildd' lacks a native systemd unit file, automatically generating a unit file for compatibility.
[  147.913402] systemd-sysv-generator[1093]: Please update package to include a native systemd unit file.
[  148.036530] systemd-sysv-generator[1093]: ⚠ This compatibility logic is deprecated, expect removal soon. ⚠
[  149.208409] Unable to handle kernel NULL pointer dereference
[  149.282820] tsk->{mm,active_mm}->context = 00000000000000ab
[  149.356117] tsk->{mm,active_mm}->pgd = fff0000008830000
[  149.424819]               \|/ ____ \|/
[  149.424819]               "@'/ .. \`@"
[  149.424819]               /_| \__/ |_\
[  149.424819]                  \__U_/
[  149.618139] systemd(1): Oops [#1]
[  149.661684] CPU: 0 UID: 0 PID: 1 Comm: systemd Not tainted 6.17.0-rc4+ #16 NONE 
[  149.758917] TSTATE: 0000004411001602 TPC: 00000000006260a4 TNPC: 00000000006260a8 Y: ffffffff    Not tainted
[  149.888258] TPC: <bpf_patch_insn_data+0x204/0x2e0>
[  149.951255] g0: 0000000000000000 g1: 0000000000000000 g2: 0000000000000036 g3: fff0000012178b28
[  150.065638] g4: fff0000000236300 g5: fff000023e336000 g6: fff000000026c000 g7: 0000000000000001
[  150.180010] o0: 0000000100880000 o1: 0000000000000000 o2: 0000000000000001 o3: 0000000000000001
[  150.294387] o4: fff00000046f42a0 o5: 0000000000000001 sp: fff000000026efb1 ret_pc: 0000000000626058
[  150.413336] RPC: <bpf_patch_insn_data+0x1b8/0x2e0>
[  150.476236] l0: fff0000012178000 l1: 0000000100874048 l2: 0000000000000001 l3: 0000000100880000
[  150.590616] l4: 0000000100874068 l5: 0000000000000005 l6: 0000000000000000 l7: fff000001217e128
[  150.704994] i0: 0000000100874000 i1: 0000000000000004 i2: 0000000000000005 i3: 0000000000000002
[  150.819434] i4: 0000000100888000 i5: fff0000012178ae8 i6: fff000000026f061 i7: 000000000064b0e8
[  150.933878] I7: <bpf_check+0x1988/0x34a0>
[  150.986575] Call Trace:
[  151.018687] [<000000000064b0e8>] bpf_check+0x1988/0x34a0
[  151.088456] [<000000000061bf2c>] bpf_prog_load+0x8ec/0xc80
[  151.160510] [<000000000061db44>] __sys_bpf+0xd04/0x25a0
[  151.229138] [<000000000061f9f8>] sys_bpf+0x18/0x60
[  151.292041] [<0000000000406274>] linux_sparc_syscall+0x34/0x44
[  151.368678] Disabling lock debugging due to kernel taint
[  151.438440] Caller[000000000064b0e8]: bpf_check+0x1988/0x34a0
[  151.513936] Caller[000000000061bf2c]: bpf_prog_load+0x8ec/0xc80
[  151.591704] Caller[000000000061db44]: __sys_bpf+0xd04/0x25a0
[  151.666051] Caller[000000000061f9f8]: sys_bpf+0x18/0x60
[  151.734676] Caller[0000000000406274]: linux_sparc_syscall+0x34/0x44
[  151.817025] Caller[fff000010099b80c]: 0xfff000010099b80c
[  151.886791] Instruction DUMP:
[  151.886795]  326ffffa 
[  151.925677]  c4004000 
[  151.956558]  c25e2038 
[  151.987440] <c4006118>
[  152.018326]  80a0a000 
[  152.049204]  04400014 
[  152.080083]  c2586100 
[  152.110960]  8400bfff 
[  152.141845]  8e00606c 
[  152.172726] 
[  152.223054] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000009
[  152.323706] Press Stop-A (L1-A) from sun keyboard or send break
[  152.323706] twice on console to return to the boot prom
[  152.470098] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000009 ]---

Replacing all calls with nops already triggers crashes during boot:

diff --git a/arch/sparc/kernel/head_64.S b/arch/sparc/kernel/head_64.S
index c305486501dc..1e2737649d46 100644
--- a/arch/sparc/kernel/head_64.S
+++ b/arch/sparc/kernel/head_64.S
@@ -687,11 +687,11 @@ cheetah_tlb_fixup:
        stw     %g2, [%g1 + %lo(tlb_type)]
 
        /* Patch copy/page operations to cheetah optimized versions. */
-       call    cheetah_patch_copyops
         nop
-       call    cheetah_patch_copy_page
         nop
-       call    cheetah_patch_cachetlbops
+        nop
+        nop
+        nop
         nop
 
        ba,a,pt %xcc, tlb_fixup_done

[   42.061355] decompression failed with status 7
[   42.172976] SCSI subsystem initialized
[   42.254511] decompression failed with status 7
[   42.462903] Unable to handle kernel NULL pointer dereference
[   42.537392] tsk->{mm,active_mm}->context = 000000000000004d
[   42.610625] tsk->{mm,active_mm}->pgd = fff0000008954000
[   42.679246]               \|/ ____ \|/
[   42.679246]               "@'/ .. \`@"
[   42.679246]               /_| \__/ |_\
[   42.679246]                  \__U_/
[   42.872571] (udev-worker)(96): Oops [#1]
[   42.924111] CPU: 0 UID: 0 PID: 96 Comm: (udev-worker) Not tainted 6.17.0-rc4+ #14 NONE 
[   43.029343] TSTATE: 0000000011001601 TPC: 0000000000f6875c TNPC: 0000000000f68760 Y: 00000000    Not tainted
[   43.158584] TPC: <strcmp+0x1c/0x60>
[   43.204430] g0: 0000000000000000 g1: 0000000000000000 g2: 000000000000006f g3: 000000001001a130
[   43.318825] g4: fff000000873cd00 g5: fff000023e336000 g6: fff00000088c4000 g7: 000000001001a058
[   43.433291] o0: 00000009e2fc2857 o1: 0000000000000000 o2: 0000000000000001 o3: 0000000000000000
[   43.547667] o4: 0000000000000dc0 o5: 0000000000000dc0 sp: fff00000088c6f21 ret_pc: 00000000005a1fe4
[   43.666617] RPC: <trace_clock_local+0x4/0x20>
[   43.723797] l0: fff000000004c798 l1: 0000000000000001 l2: fff000000004c510 l3: 0000000000000000
[   43.838177] l4: 0000000000000000 l5: 00000000014da748 l6: 00000000012a7ef8 l7: 0000000000000000
[   43.952553] i0: 0000000010076e97 i1: 0000000000000000 i2: 00000000015370f8 i3: 0000000000000000
[   44.066928] i4: 0000000000000000 i5: 0000000000000dc0 i6: fff00000088c6fd1 i7: 000000000053a2d0
[   44.181303] I7: <cmp_name+0x10/0x20>
[   44.228190] Call Trace:
[   44.260219] [<000000000053a2d0>] cmp_name+0x10/0x20
[   44.324268] [<0000000000a20dc0>] bsearch+0x20/0x60
[   44.387173] [<000000000053a45c>] find_exported_symbol_in_section+0x5c/0xc0
[   44.477532] [<000000000053ba50>] find_symbol+0xd0/0x160
[   44.546153] [<000000000053e76c>] load_module+0x1acc/0x22c0
[   44.618211] [<000000000053f16c>] init_module_from_file+0x6c/0xc0
[   44.697130] [<000000000053f3cc>] sys_finit_module+0x1ac/0x300
[   44.772618] [<0000000000406274>] linux_sparc_syscall+0x34/0x44
[   44.849248] Disabling lock debugging due to kernel taint
[   44.919018] Caller[000000000053a2d0]: cmp_name+0x10/0x20
[   44.988784] Caller[0000000000a20dc0]: bsearch+0x20/0x60
[   45.057412] Caller[000000000053a45c]: find_exported_symbol_in_section+0x5c/0xc0
[   45.153487] Caller[000000000053ba50]: find_symbol+0xd0/0x160
[   45.227831] Caller[000000000053e76c]: load_module+0x1acc/0x22c0
[   45.305604] Caller[000000000053f16c]: init_module_from_file+0x6c/0xc0
[   45.390244] Caller[000000000053f3cc]: sys_finit_module+0x1ac/0x300
[   45.471447] Caller[0000000000406274]: linux_sparc_syscall+0x34/0x44
[   45.553799] Caller[fff000010470e2fc]: 0xfff000010470e2fc
[   45.623566] Instruction DUMP:
[   45.623569]  2240000b 
[   45.662452]  b0102000 
[   45.693333]  c40e0001 
[   45.724211] <c60e4001>
[   45.755093]  80a08003 
[   45.785978]  024ffffa 
[   45.816857]  82006001 
[   45.847737]  b0102001 
[   45.878620]  b16567ff 
[   45.909502] 
[   63.467354] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[   63.545187] rcu:     (detected by 0, t=5252 jiffies, g=-87, q=21 ncpus=1)
[   63.630966] rcu: All QSes seen, last rcu_sched kthread activity 5252 (4294906056-4294900804), jiffies_till_next_fqs=1, root ->qsmask 0x0
[   63.792241] rcu: rcu_sched kthread starved for 5252 jiffies! g-87 f0x2 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=0
[   63.922625] rcu:     Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior.
[   64.040431] rcu: RCU grace-period kthread stack dump:
[   64.106766] task:rcu_sched       state:R  running task     stack:0     pid:15    tgid:15    ppid:2      task_flags:0x208040 flags:0x07000000
[   64.272615] Call Trace:
[   64.304632] [<0000000000f8857c>] schedule+0x1c/0x180
[   64.369827] [<0000000000f8f61c>] schedule_timeout+0x5c/0xe0
[   64.443027] [<0000000000529550>] rcu_gp_fqs_loop+0x130/0x540
[   64.517372] [<000000000052e6f4>] rcu_gp_kthread+0x174/0x200
[   64.590571] [<00000000004aa700>] kthread+0xe0/0x280
[   64.654620] [<00000000004060c8>] ret_from_fork+0x1c/0x2c
[   64.724391] [<0000000000000000>] 0x0
[   64.771284] rcu: Stack dump where RCU GP kthread last ran:
[   64.843343] CPU: 0 UID: 0 PID: 96 Comm: (udev-worker) Tainted: G      D             6.17.0-rc4+ #14 NONE 
[   64.969158] Tainted: [D]=DIE
[   65.006896] TSTATE: 0000008080001606 TPC: 00000000007a0fa0 TNPC: 00000000007a0fa4 Y: 00000000    Tainted: G      D            
[   65.156733] TPC: <count_memcg_events+0x100/0x200>
[   65.218489] g0: fff000023f804f78 g1: 00000000014e1f00 g2: 00000000014e6340 g3: 00000000014e2100
[   65.332869] g4: fff000000873cd00 g5: fff000023e336000 g6: fff00000088c4000 g7: fff000023f81c350
[   65.447243] o0: fff000000025a880 o1: 0000000000000000 o2: fff000000825a0c8 o3: 80000002026d6fb2
[   65.561617] o4: 0000000000000000 o5: 0000000000000000 sp: fff00000088c6491 ret_pc: 00000000007a0f94
[   65.680568] RPC: <count_memcg_events+0xf4/0x200>
[   65.741182] l0: 0000000000100073 l1: fff000023f804f38 l2: fff000023f804f78 l3: fff000023f804fb8
[   65.855563] l4: 0000000000000000 l5: 0000000000000005 l6: 0000000000000000 l7: 0000000000000008
[   65.969937] i0: fff000000025a880 i1: 000000000000000e i2: 0000000000000001 i3: fff0000008256820
[   66.084313] i4: 0000000000000001 i5: 00000000014f9000 i6: fff00000088c6541 i7: 0000000000722890
[   66.198686] I7: <handle_mm_fault+0x190/0x2e0>
[   66.255870] Call Trace:
[   66.287895] [<0000000000722890>] handle_mm_fault+0x190/0x2e0
[   66.362241] [<0000000000f92e00>] do_sparc64_fault+0x6c0/0xb20
[   66.437727] [<0000000000407714>] sparc64_realfault_common+0x10/0x20
[   66.520077] [<0000000000562070>] exit_robust_list+0x10/0x120
[   66.594422] [<0000000000562710>] futex_exit_release+0x70/0xc0
[   66.669910] [<000000000047b48c>] exit_mm_release+0xc/0x40
[   66.740821] [<0000000000483ab8>] do_exit+0x198/0xb80
[   66.806014] [<0000000000484528>] make_task_dead+0x88/0x160
[   66.878070] [<0000000000428374>] die_if_kernel+0x260/0x26c
[   66.950126] [<0000000000f9271c>] unhandled_fault+0x88/0xac
[   67.022184] [<0000000000f92af0>] do_sparc64_fault+0x3b0/0xb20
[   67.097670] [<0000000000407714>] sparc64_realfault_common+0x10/0x20
[   67.180022] [<0000000000f6875c>] strcmp+0x1c/0x60
[   67.241784] [<000000000053a2d0>] cmp_name+0x10/0x20
[   67.305833] [<0000000000a20dc0>] bsearch+0x20/0x60
[   67.368740] [<000000000053a45c>] find_exported_symbol_in_section+0x5c/0xc0

I assume that cheetah_patch_cachetlbops has to be invoked on UltraSPARC III
since there is other code depending on it. On the other hand, the TLB code
on UltraSPARC III was heavily overhauled in 2016 [1] which was also followed
by a bug fix [2].

Chances are there are still bugs in the code introduced in [1].

Adrian

> [1] https://github.com/torvalds/linux/commit/a74ad5e660a9ee1d071665e7e8ad822784a2dc7f
> [2] https://github.com/torvalds/linux/commit/d3c976c14ad8af421134c428b0a89ff8dd3bd8f8

-- 
 .''`.  John Paul Adrian Glaubitz
: :' :  Debian Developer
`. `'   Physicist
  `-    GPG: 62FF 8A75 84E0 2956 9546  0006 7426 3B37 F5B5 F913

Re: [PATCH v4 2/5] sparc: fix accurate exception reporting in copy_{from_to}_user for UltraSPARC III

Posted by John Paul Adrian Glaubitz 3 weeks, 4 days ago

Hi,

On Sun, 2025-09-07 at 20:33 +0200, John Paul Adrian Glaubitz wrote:
> I assume that cheetah_patch_cachetlbops has to be invoked on UltraSPARC III
> since there is other code depending on it. On the other hand, the TLB code
> on UltraSPARC III was heavily overhauled in 2016 [1] which was also followed
> by a bug fix [2].
> 
> Chances are there are still bugs in the code introduced in [1].
> 
> > [1] https://github.com/torvalds/linux/commit/a74ad5e660a9ee1d071665e7e8ad822784a2dc7f
> > [2] https://github.com/torvalds/linux/commit/d3c976c14ad8af421134c428b0a89ff8dd3bd8f8

I have reverted both commits. The machine boots until it tries to start
systemd when it locks up. So, I guess if there is a bug in the TLB code
it needs to be diagnosed differently.

Adrian

-- 
 .''`.  John Paul Adrian Glaubitz
: :' :  Debian Developer
`. `'   Physicist
  `-    GPG: 62FF 8A75 84E0 2956 9546  0006 7426 3B37 F5B5 F913

Re: [PATCH v4 2/5] sparc: fix accurate exception reporting in copy_{from_to}_user for UltraSPARC III

Posted by John Paul Adrian Glaubitz 3 weeks, 3 days ago

Hi,

On Sun, 2025-09-07 at 23:31 +0200, John Paul Adrian Glaubitz wrote:
> Hi,
> 
> On Sun, 2025-09-07 at 20:33 +0200, John Paul Adrian Glaubitz wrote:
> > I assume that cheetah_patch_cachetlbops has to be invoked on UltraSPARC III
> > since there is other code depending on it. On the other hand, the TLB code
> > on UltraSPARC III was heavily overhauled in 2016 [1] which was also followed
> > by a bug fix [2].
> > 
> > Chances are there are still bugs in the code introduced in [1].
> > 
> > > [1] https://github.com/torvalds/linux/commit/a74ad5e660a9ee1d071665e7e8ad822784a2dc7f
> > > [2] https://github.com/torvalds/linux/commit/d3c976c14ad8af421134c428b0a89ff8dd3bd8f8
> 
> I have reverted both commits. The machine boots until it tries to start
> systemd when it locks up. So, I guess if there is a bug in the TLB code
> it needs to be diagnosed differently.

Another test with a kernel source rebased to 6.17-rc5+, with the following patch applied
by Anthony Yznaga and CONFIG_SMP disabled:

diff --git a/arch/sparc/mm/ultra.S b/arch/sparc/mm/ultra.S
index 70e658d107e0..b323db303de1 100644
--- a/arch/sparc/mm/ultra.S
+++ b/arch/sparc/mm/ultra.S
@@ -347,6 +347,7 @@ __cheetah_flush_tlb_kernel_range:	/* 31 insns */
  	membar		#Sync
  	stxa		%g0, [%o4] ASI_IMMU_DEMAP
  	membar		#Sync
+	flush
  	retl
  	 nop
  	nop
@@ -355,7 +356,6 @@ __cheetah_flush_tlb_kernel_range:	/* 31 insns */
  	nop
  	nop
  	nop
-	nop

  #ifdef DCACHE_ALIASING_POSSIBLE
  __cheetah_flush_dcache_page: /* 11 insns */

Still crashes:

[  139.236744] tsk->{mm,active_mm}->context = 00000000000000ab
[  139.310042] tsk->{mm,active_mm}->pgd = fff0000007db8000
[  139.378747]               \|/ ____ \|/
[  139.378747]               "@'/ .. \`@"
[  139.378747]               /_| \__/ |_\
[  139.378747]                  \__U_/
[  139.572059] systemd(1): Oops [#1]
[  139.615613] CPU: 0 UID: 0 PID: 1 Comm: systemd Not tainted 6.17.0-rc5+ #19 NONE 
[  139.712832] TSTATE: 0000004411001602 TPC: 00000000005e29e4 TNPC: 00000000005e29e8 Y: 00000000    Not tainted
[  139.842076] TPC: <bpf_patch_insn_data+0x204/0x2e0>
[  139.905077] g0: ffffffffffffffff g1: 0000000000000000 g2: 0000000000000065 g3: fff0000009618b28
[  140.019460] g4: fff00000001f9500 g5: 0000000000657300 g6: fff000000022c000 g7: 0000000000000001
[  140.133837] o0: 0000000100058000 o1: 0000000000000000 o2: 0000000000000001 o3: 0000000000000002
[  140.248208] o4: fff00000045ec900 o5: 0000000000000002 sp: fff000000022f031 ret_pc: 00000000005e2998
[  140.367158] RPC: <bpf_patch_insn_data+0x1b8/0x2e0>
[  140.430057] l0: fff0000009618000 l1: 0000000100046048 l2: 0000000000000001 l3: 0000000100058000
[  140.544437] l4: 0000000100046068 l5: 0000000000000005 l6: 0000000000000000 l7: fff000000961e128
[  140.658810] i0: 0000000100046000 i1: 0000000000000004 i2: 0000000000000005 i3: 0000000000000002
[  140.773189] i4: 0000000100066000 i5: fff0000009618ae8 i6: fff000000022f0e1 i7: 0000000000607a08
[  140.887561] I7: <bpf_check+0x1988/0x34a0>
[  140.940171] Call Trace:
[  140.972191] [<0000000000607a08>] bpf_check+0x1988/0x34a0
[  141.041963] [<00000000005d862c>] bpf_prog_load+0x8ec/0xc80
[  141.114021] [<00000000005d9be4>] __sys_bpf+0x724/0x28a0
[  141.182646] [<00000000005dc338>] sys_bpf+0x18/0x60
[  141.245551] [<0000000000406174>] linux_sparc_syscall+0x34/0x44
[  141.322185] Disabling lock debugging due to kernel taint
[  141.391952] Caller[0000000000607a08]: bpf_check+0x1988/0x34a0
[  141.467440] Caller[00000000005d862c]: bpf_prog_load+0x8ec/0xc80
[  141.545212] Caller[00000000005d9be4]: __sys_bpf+0x724/0x28a0
[  141.619558] Caller[00000000005dc338]: sys_bpf+0x18/0x60
[  141.688179] Caller[0000000000406174]: linux_sparc_syscall+0x34/0x44
[  141.770535] Caller[fff000010089b80c]: 0xfff000010089b80c
[  141.840301] Instruction DUMP:
[  141.840305]  326ffffa 
[  141.879185]  c4004000 
[  141.910065]  c25e2038 
[  141.940945] <c4006108>
[  141.971827]  80a0a000 
[  142.002709]  04400014 
[  142.033589]  c25860f0 
[  142.064474]  8400bfff 
[  142.095354]  8e00606c 
[  142.126234] 
[  142.176560] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000009
[  142.277218] Press Stop-A (L1-A) from sun keyboard or send break
[  142.277218] twice on console to return to the boot prom
[  142.423608] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000009 ]---

Adrian

-- 
 .''`.  John Paul Adrian Glaubitz
: :' :  Debian Developer
`. `'   Physicist
  `-    GPG: 62FF 8A75 84E0 2956 9546  0006 7426 3B37 F5B5 F913

Re: [PATCH v4 2/5] sparc: fix accurate exception reporting in copy_{from_to}_user for UltraSPARC III

Posted by John Paul Adrian Glaubitz 3 weeks, 3 days ago

Hi,

On Mon, 2025-09-08 at 08:30 +0200, John Paul Adrian Glaubitz wrote:
> Hi,
> 
> On Sun, 2025-09-07 at 23:31 +0200, John Paul Adrian Glaubitz wrote:
> > Hi,
> > 
> > On Sun, 2025-09-07 at 20:33 +0200, John Paul Adrian Glaubitz wrote:
> > > I assume that cheetah_patch_cachetlbops has to be invoked on UltraSPARC III
> > > since there is other code depending on it. On the other hand, the TLB code
> > > on UltraSPARC III was heavily overhauled in 2016 [1] which was also followed
> > > by a bug fix [2].
> > > 
> > > Chances are there are still bugs in the code introduced in [1].
> > > 
> > > > [1] https://github.com/torvalds/linux/commit/a74ad5e660a9ee1d071665e7e8ad822784a2dc7f
> > > > [2] https://github.com/torvalds/linux/commit/d3c976c14ad8af421134c428b0a89ff8dd3bd8f8
> > 
> > I have reverted both commits. The machine boots until it tries to start
> > systemd when it locks up. So, I guess if there is a bug in the TLB code
> > it needs to be diagnosed differently.
> 
> Another test with a kernel source rebased to 6.17-rc5+, with the following patch applied
> by Anthony Yznaga and CONFIG_SMP disabled:
> 
> diff --git a/arch/sparc/mm/ultra.S b/arch/sparc/mm/ultra.S
> index 70e658d107e0..b323db303de1 100644
> --- a/arch/sparc/mm/ultra.S
> +++ b/arch/sparc/mm/ultra.S
> @@ -347,6 +347,7 @@ __cheetah_flush_tlb_kernel_range:	/* 31 insns */
>   	membar		#Sync
>   	stxa		%g0, [%o4] ASI_IMMU_DEMAP
>   	membar		#Sync
> +	flush
>   	retl
>   	 nop
>   	nop
> @@ -355,7 +356,6 @@ __cheetah_flush_tlb_kernel_range:	/* 31 insns */
>   	nop
>   	nop
>   	nop
> -	nop
> 
>   #ifdef DCACHE_ALIASING_POSSIBLE
>   __cheetah_flush_dcache_page: /* 11 insns */
> 
> Still crashes:
> 
> [  139.236744] tsk->{mm,active_mm}->context = 00000000000000ab
> [  139.310042] tsk->{mm,active_mm}->pgd = fff0000007db8000
> [  139.378747]               \|/ ____ \|/
> [  139.378747]               "@'/ .. \`@"
> [  139.378747]               /_| \__/ |_\
> [  139.378747]                  \__U_/
> [  139.572059] systemd(1): Oops [#1]
> [  139.615613] CPU: 0 UID: 0 PID: 1 Comm: systemd Not tainted 6.17.0-rc5+ #19 NONE 
> [  139.712832] TSTATE: 0000004411001602 TPC: 00000000005e29e4 TNPC: 00000000005e29e8 Y: 00000000    Not tainted
> [  139.842076] TPC: <bpf_patch_insn_data+0x204/0x2e0>
> [  139.905077] g0: ffffffffffffffff g1: 0000000000000000 g2: 0000000000000065 g3: fff0000009618b28
> [  140.019460] g4: fff00000001f9500 g5: 0000000000657300 g6: fff000000022c000 g7: 0000000000000001
> [  140.133837] o0: 0000000100058000 o1: 0000000000000000 o2: 0000000000000001 o3: 0000000000000002
> [  140.248208] o4: fff00000045ec900 o5: 0000000000000002 sp: fff000000022f031 ret_pc: 00000000005e2998
> [  140.367158] RPC: <bpf_patch_insn_data+0x1b8/0x2e0>
> [  140.430057] l0: fff0000009618000 l1: 0000000100046048 l2: 0000000000000001 l3: 0000000100058000
> [  140.544437] l4: 0000000100046068 l5: 0000000000000005 l6: 0000000000000000 l7: fff000000961e128
> [  140.658810] i0: 0000000100046000 i1: 0000000000000004 i2: 0000000000000005 i3: 0000000000000002
> [  140.773189] i4: 0000000100066000 i5: fff0000009618ae8 i6: fff000000022f0e1 i7: 0000000000607a08
> [  140.887561] I7: <bpf_check+0x1988/0x34a0>
> [  140.940171] Call Trace:
> [  140.972191] [<0000000000607a08>] bpf_check+0x1988/0x34a0
> [  141.041963] [<00000000005d862c>] bpf_prog_load+0x8ec/0xc80
> [  141.114021] [<00000000005d9be4>] __sys_bpf+0x724/0x28a0
> [  141.182646] [<00000000005dc338>] sys_bpf+0x18/0x60
> [  141.245551] [<0000000000406174>] linux_sparc_syscall+0x34/0x44
> [  141.322185] Disabling lock debugging due to kernel taint
> [  141.391952] Caller[0000000000607a08]: bpf_check+0x1988/0x34a0
> [  141.467440] Caller[00000000005d862c]: bpf_prog_load+0x8ec/0xc80
> [  141.545212] Caller[00000000005d9be4]: __sys_bpf+0x724/0x28a0
> [  141.619558] Caller[00000000005dc338]: sys_bpf+0x18/0x60
> [  141.688179] Caller[0000000000406174]: linux_sparc_syscall+0x34/0x44
> [  141.770535] Caller[fff000010089b80c]: 0xfff000010089b80c
> [  141.840301] Instruction DUMP:
> [  141.840305]  326ffffa 
> [  141.879185]  c4004000 
> [  141.910065]  c25e2038 
> [  141.940945] <c4006108>
> [  141.971827]  80a0a000 
> [  142.002709]  04400014 
> [  142.033589]  c25860f0 
> [  142.064474]  8400bfff 
> [  142.095354]  8e00606c 
> [  142.126234] 
> [  142.176560] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000009
> [  142.277218] Press Stop-A (L1-A) from sun keyboard or send break
> [  142.277218] twice on console to return to the boot prom
> [  142.423608] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000009 ]---

Disabling support for Transparent Huge Pages (CONFIG_THP) avoids the crash.

Adrian

-- 
 .''`.  John Paul Adrian Glaubitz
: :' :  Debian Developer
`. `'   Physicist
  `-    GPG: 62FF 8A75 84E0 2956 9546  0006 7426 3B37 F5B5 F913

Re: [PATCH v4 2/5] sparc: fix accurate exception reporting in copy_{from_to}_user for UltraSPARC III

Posted by John Paul Adrian Glaubitz 3 weeks, 3 days ago

On Mon, 2025-09-08 at 08:47 +0200, John Paul Adrian Glaubitz wrote:
> Hi,
> 
> On Mon, 2025-09-08 at 08:30 +0200, John Paul Adrian Glaubitz wrote:
> > Hi,
> > 
> > On Sun, 2025-09-07 at 23:31 +0200, John Paul Adrian Glaubitz wrote:
> > > Hi,
> > > 
> > > On Sun, 2025-09-07 at 20:33 +0200, John Paul Adrian Glaubitz wrote:
> > > > I assume that cheetah_patch_cachetlbops has to be invoked on UltraSPARC III
> > > > since there is other code depending on it. On the other hand, the TLB code
> > > > on UltraSPARC III was heavily overhauled in 2016 [1] which was also followed
> > > > by a bug fix [2].
> > > > 
> > > > Chances are there are still bugs in the code introduced in [1].
> > > > 
> > > > > [1] https://github.com/torvalds/linux/commit/a74ad5e660a9ee1d071665e7e8ad822784a2dc7f
> > > > > [2] https://github.com/torvalds/linux/commit/d3c976c14ad8af421134c428b0a89ff8dd3bd8f8
> > > 
> > > I have reverted both commits. The machine boots until it tries to start
> > > systemd when it locks up. So, I guess if there is a bug in the TLB code
> > > it needs to be diagnosed differently.
> > 
> > Another test with a kernel source rebased to 6.17-rc5+, with the following patch applied
> > by Anthony Yznaga and CONFIG_SMP disabled:
> > 
> > diff --git a/arch/sparc/mm/ultra.S b/arch/sparc/mm/ultra.S
> > index 70e658d107e0..b323db303de1 100644
> > --- a/arch/sparc/mm/ultra.S
> > +++ b/arch/sparc/mm/ultra.S
> > @@ -347,6 +347,7 @@ __cheetah_flush_tlb_kernel_range:	/* 31 insns */
> >   	membar		#Sync
> >   	stxa		%g0, [%o4] ASI_IMMU_DEMAP
> >   	membar		#Sync
> > +	flush
> >   	retl
> >   	 nop
> >   	nop
> > @@ -355,7 +356,6 @@ __cheetah_flush_tlb_kernel_range:	/* 31 insns */
> >   	nop
> >   	nop
> >   	nop
> > -	nop
> > 
> >   #ifdef DCACHE_ALIASING_POSSIBLE
> >   __cheetah_flush_dcache_page: /* 11 insns */
> > 
> > Still crashes:
> > 
> > [  139.236744] tsk->{mm,active_mm}->context = 00000000000000ab
> > [  139.310042] tsk->{mm,active_mm}->pgd = fff0000007db8000
> > [  139.378747]               \|/ ____ \|/
> > [  139.378747]               "@'/ .. \`@"
> > [  139.378747]               /_| \__/ |_\
> > [  139.378747]                  \__U_/
> > [  139.572059] systemd(1): Oops [#1]
> > [  139.615613] CPU: 0 UID: 0 PID: 1 Comm: systemd Not tainted 6.17.0-rc5+ #19 NONE 
> > [  139.712832] TSTATE: 0000004411001602 TPC: 00000000005e29e4 TNPC: 00000000005e29e8 Y: 00000000    Not tainted
> > [  139.842076] TPC: <bpf_patch_insn_data+0x204/0x2e0>
> > [  139.905077] g0: ffffffffffffffff g1: 0000000000000000 g2: 0000000000000065 g3: fff0000009618b28
> > [  140.019460] g4: fff00000001f9500 g5: 0000000000657300 g6: fff000000022c000 g7: 0000000000000001
> > [  140.133837] o0: 0000000100058000 o1: 0000000000000000 o2: 0000000000000001 o3: 0000000000000002
> > [  140.248208] o4: fff00000045ec900 o5: 0000000000000002 sp: fff000000022f031 ret_pc: 00000000005e2998
> > [  140.367158] RPC: <bpf_patch_insn_data+0x1b8/0x2e0>
> > [  140.430057] l0: fff0000009618000 l1: 0000000100046048 l2: 0000000000000001 l3: 0000000100058000
> > [  140.544437] l4: 0000000100046068 l5: 0000000000000005 l6: 0000000000000000 l7: fff000000961e128
> > [  140.658810] i0: 0000000100046000 i1: 0000000000000004 i2: 0000000000000005 i3: 0000000000000002
> > [  140.773189] i4: 0000000100066000 i5: fff0000009618ae8 i6: fff000000022f0e1 i7: 0000000000607a08
> > [  140.887561] I7: <bpf_check+0x1988/0x34a0>
> > [  140.940171] Call Trace:
> > [  140.972191] [<0000000000607a08>] bpf_check+0x1988/0x34a0
> > [  141.041963] [<00000000005d862c>] bpf_prog_load+0x8ec/0xc80
> > [  141.114021] [<00000000005d9be4>] __sys_bpf+0x724/0x28a0
> > [  141.182646] [<00000000005dc338>] sys_bpf+0x18/0x60
> > [  141.245551] [<0000000000406174>] linux_sparc_syscall+0x34/0x44
> > [  141.322185] Disabling lock debugging due to kernel taint
> > [  141.391952] Caller[0000000000607a08]: bpf_check+0x1988/0x34a0
> > [  141.467440] Caller[00000000005d862c]: bpf_prog_load+0x8ec/0xc80
> > [  141.545212] Caller[00000000005d9be4]: __sys_bpf+0x724/0x28a0
> > [  141.619558] Caller[00000000005dc338]: sys_bpf+0x18/0x60
> > [  141.688179] Caller[0000000000406174]: linux_sparc_syscall+0x34/0x44
> > [  141.770535] Caller[fff000010089b80c]: 0xfff000010089b80c
> > [  141.840301] Instruction DUMP:
> > [  141.840305]  326ffffa 
> > [  141.879185]  c4004000 
> > [  141.910065]  c25e2038 
> > [  141.940945] <c4006108>
> > [  141.971827]  80a0a000 
> > [  142.002709]  04400014 
> > [  142.033589]  c25860f0 
> > [  142.064474]  8400bfff 
> > [  142.095354]  8e00606c 
> > [  142.126234] 
> > [  142.176560] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000009
> > [  142.277218] Press Stop-A (L1-A) from sun keyboard or send break
> > [  142.277218] twice on console to return to the boot prom
> > [  142.423608] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000009 ]---
> 
> Disabling support for Transparent Huge Pages (CONFIG_THP) avoids the crash.

Sorry, the option is called CONFIG_TRANSPARENT_HUGEPAGE, of course.

My suspicion is that it's related the flushing of D-Cache handling which is enabled
for small pages only:

https://elixir.bootlin.com/linux/v6.16.5/source/arch/sparc/mm/ultra.S#L1016

and:

https://elixir.bootlin.com/linux/v6.16.5/source/arch/sparc/include/asm/page_64.h#L9

Interestingly, while running the reproducer with CONFIG_TRANSPARENT_HUGEPAGE disabled,
I'm also getting this kernel warning, but the kernel does not crash:

[  108.733686] CPU[0]: Cheetah+ D-cache parity error at TPC[00000000005d78b4]
[  108.824096] TPC<bpf_prog_load+0x394/0xc80>

Could it be that we need to enable the code guarded by DCACHE_ALIASING_POSSIBLE
unconditionally?

Adrian

-- 
 .''`.  John Paul Adrian Glaubitz
: :' :  Debian Developer
`. `'   Physicist
  `-    GPG: 62FF 8A75 84E0 2956 9546  0006 7426 3B37 F5B5 F913

Re: [PATCH v4 2/5] sparc: fix accurate exception reporting in copy_{from_to}_user for UltraSPARC III

Posted by Anthony Yznaga 3 weeks, 3 days ago


On 9/7/25 11:53 PM, John Paul Adrian Glaubitz wrote:
> On Mon, 2025-09-08 at 08:47 +0200, John Paul Adrian Glaubitz wrote:
>> Hi,
>>
>> On Mon, 2025-09-08 at 08:30 +0200, John Paul Adrian Glaubitz wrote:
>>> Hi,
>>>
>>> On Sun, 2025-09-07 at 23:31 +0200, John Paul Adrian Glaubitz wrote:
>>>> Hi,
>>>>
>>>> On Sun, 2025-09-07 at 20:33 +0200, John Paul Adrian Glaubitz wrote:
>>>>> I assume that cheetah_patch_cachetlbops has to be invoked on UltraSPARC III
>>>>> since there is other code depending on it. On the other hand, the TLB code
>>>>> on UltraSPARC III was heavily overhauled in 2016 [1] which was also followed
>>>>> by a bug fix [2].
>>>>>
>>>>> Chances are there are still bugs in the code introduced in [1].
>>>>>
>>>>>> [1] https://github.com/torvalds/linux/commit/a74ad5e660a9ee1d071665e7e8ad822784a2dc7f
>>>>>> [2] https://github.com/torvalds/linux/commit/d3c976c14ad8af421134c428b0a89ff8dd3bd8f8
>>>>
>>>> I have reverted both commits. The machine boots until it tries to start
>>>> systemd when it locks up. So, I guess if there is a bug in the TLB code
>>>> it needs to be diagnosed differently.
>>>
>>> Another test with a kernel source rebased to 6.17-rc5+, with the following patch applied
>>> by Anthony Yznaga and CONFIG_SMP disabled:
>>>
>>> diff --git a/arch/sparc/mm/ultra.S b/arch/sparc/mm/ultra.S
>>> index 70e658d107e0..b323db303de1 100644
>>> --- a/arch/sparc/mm/ultra.S
>>> +++ b/arch/sparc/mm/ultra.S
>>> @@ -347,6 +347,7 @@ __cheetah_flush_tlb_kernel_range:	/* 31 insns */
>>>    	membar		#Sync
>>>    	stxa		%g0, [%o4] ASI_IMMU_DEMAP
>>>    	membar		#Sync
>>> +	flush
>>>    	retl
>>>    	 nop
>>>    	nop
>>> @@ -355,7 +356,6 @@ __cheetah_flush_tlb_kernel_range:	/* 31 insns */
>>>    	nop
>>>    	nop
>>>    	nop
>>> -	nop
>>>
>>>    #ifdef DCACHE_ALIASING_POSSIBLE
>>>    __cheetah_flush_dcache_page: /* 11 insns */
>>>
>>> Still crashes:
>>>
>>> [  139.236744] tsk->{mm,active_mm}->context = 00000000000000ab
>>> [  139.310042] tsk->{mm,active_mm}->pgd = fff0000007db8000
>>> [  139.378747]               \|/ ____ \|/
>>> [  139.378747]               "@'/ .. \`@"
>>> [  139.378747]               /_| \__/ |_\
>>> [  139.378747]                  \__U_/
>>> [  139.572059] systemd(1): Oops [#1]
>>> [  139.615613] CPU: 0 UID: 0 PID: 1 Comm: systemd Not tainted 6.17.0-rc5+ #19 NONE
>>> [  139.712832] TSTATE: 0000004411001602 TPC: 00000000005e29e4 TNPC: 00000000005e29e8 Y: 00000000    Not tainted
>>> [  139.842076] TPC: <bpf_patch_insn_data+0x204/0x2e0>
>>> [  139.905077] g0: ffffffffffffffff g1: 0000000000000000 g2: 0000000000000065 g3: fff0000009618b28
>>> [  140.019460] g4: fff00000001f9500 g5: 0000000000657300 g6: fff000000022c000 g7: 0000000000000001
>>> [  140.133837] o0: 0000000100058000 o1: 0000000000000000 o2: 0000000000000001 o3: 0000000000000002
>>> [  140.248208] o4: fff00000045ec900 o5: 0000000000000002 sp: fff000000022f031 ret_pc: 00000000005e2998
>>> [  140.367158] RPC: <bpf_patch_insn_data+0x1b8/0x2e0>
>>> [  140.430057] l0: fff0000009618000 l1: 0000000100046048 l2: 0000000000000001 l3: 0000000100058000
>>> [  140.544437] l4: 0000000100046068 l5: 0000000000000005 l6: 0000000000000000 l7: fff000000961e128
>>> [  140.658810] i0: 0000000100046000 i1: 0000000000000004 i2: 0000000000000005 i3: 0000000000000002
>>> [  140.773189] i4: 0000000100066000 i5: fff0000009618ae8 i6: fff000000022f0e1 i7: 0000000000607a08
>>> [  140.887561] I7: <bpf_check+0x1988/0x34a0>
>>> [  140.940171] Call Trace:
>>> [  140.972191] [<0000000000607a08>] bpf_check+0x1988/0x34a0
>>> [  141.041963] [<00000000005d862c>] bpf_prog_load+0x8ec/0xc80
>>> [  141.114021] [<00000000005d9be4>] __sys_bpf+0x724/0x28a0
>>> [  141.182646] [<00000000005dc338>] sys_bpf+0x18/0x60
>>> [  141.245551] [<0000000000406174>] linux_sparc_syscall+0x34/0x44
>>> [  141.322185] Disabling lock debugging due to kernel taint
>>> [  141.391952] Caller[0000000000607a08]: bpf_check+0x1988/0x34a0
>>> [  141.467440] Caller[00000000005d862c]: bpf_prog_load+0x8ec/0xc80
>>> [  141.545212] Caller[00000000005d9be4]: __sys_bpf+0x724/0x28a0
>>> [  141.619558] Caller[00000000005dc338]: sys_bpf+0x18/0x60
>>> [  141.688179] Caller[0000000000406174]: linux_sparc_syscall+0x34/0x44
>>> [  141.770535] Caller[fff000010089b80c]: 0xfff000010089b80c
>>> [  141.840301] Instruction DUMP:
>>> [  141.840305]  326ffffa
>>> [  141.879185]  c4004000
>>> [  141.910065]  c25e2038
>>> [  141.940945] <c4006108>
>>> [  141.971827]  80a0a000
>>> [  142.002709]  04400014
>>> [  142.033589]  c25860f0
>>> [  142.064474]  8400bfff
>>> [  142.095354]  8e00606c
>>> [  142.126234]
>>> [  142.176560] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000009
>>> [  142.277218] Press Stop-A (L1-A) from sun keyboard or send break
>>> [  142.277218] twice on console to return to the boot prom
>>> [  142.423608] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000009 ]---
>>
>> Disabling support for Transparent Huge Pages (CONFIG_THP) avoids the crash.
> 
> Sorry, the option is called CONFIG_TRANSPARENT_HUGEPAGE, of course.
> 
> My suspicion is that it's related the flushing of D-Cache handling which is enabled
> for small pages only:
> 
> https://elixir.bootlin.com/linux/v6.16.5/source/arch/sparc/mm/ultra.S#L1016
> 
> and:
> 
> https://elixir.bootlin.com/linux/v6.16.5/source/arch/sparc/include/asm/page_64.h#L9
> 
> Interestingly, while running the reproducer with CONFIG_TRANSPARENT_HUGEPAGE disabled,
> I'm also getting this kernel warning, but the kernel does not crash:
> 
> [  108.733686] CPU[0]: Cheetah+ D-cache parity error at TPC[00000000005d78b4]
> [  108.824096] TPC<bpf_prog_load+0x394/0xc80>
> 
> Could it be that we need to enable the code guarded by DCACHE_ALIASING_POSSIBLE
> unconditionally?

It's already essentially enabled unconditionally. PAGE_SHIFT will always 
be 13 on sparc64 systems.

The flushing should be happening for folios of any size. See 
flush_dcache_folio(()/flush_dcache_folio_all().

You could try setting page_poison=1 on the kernel command line to see if 
the kernel detects any freed pages being used.

Is this a different Cheetah+-based system than the one I borrowed? 
Definitely some sort of memory corruption happening, but the system I 
used seemed much more stable than this.

Anthony

> 
> Adrian
>

Re: [PATCH v4 2/5] sparc: fix accurate exception reporting in copy_{from_to}_user for UltraSPARC III

Posted by John Paul Adrian Glaubitz 3 weeks, 2 days ago

Hi Anthony,

On Mon, 2025-09-08 at 15:47 -0700, Anthony Yznaga wrote:
> > Could it be that we need to enable the code guarded by DCACHE_ALIASING_POSSIBLE
> > unconditionally?
> 
> It's already essentially enabled unconditionally. PAGE_SHIFT will always 
> be 13 on sparc64 systems.

I see.

I was confused by this comment:

/* Flushing for D-cache alias handling is only needed if
 * the page size is smaller than 16K.
 */
#if PAGE_SHIFT < 14
#define DCACHE_ALIASING_POSSIBLE
#endif

My thinking was that there might be a flush skipped when using transparent
huge pages which causes these crashes.

> The flushing should be happening for folios of any size. See 
> flush_dcache_folio(()/flush_dcache_folio_all().

OK, I'll have a look and maybe add a printk() there.

> You could try setting page_poison=1 on the kernel command line to see if 
> the kernel detects any freed pages being used.

Ah, good to know. Thanks.

> Is this a different Cheetah+-based system than the one I borrowed? 
> Definitely some sort of memory corruption happening, but the system I 
> used seemed much more stable than this.

Yes, it's the same Sun Netra 240. It has been running stable for days, but
upgrading to the latest systemd version caused these new reproducible crashes
which started my new investigation.

Replacing "call cheetah_patch_copyops" with "call generic_patch_copyops" did
not help in this situation which indicates that the bug is not in Michael's
patch set which is a good sign at least for Michael's work.

Adrian

-- 
 .''`.  John Paul Adrian Glaubitz
: :' :  Debian Developer
`. `'   Physicist
  `-    GPG: 62FF 8A75 84E0 2956 9546  0006 7426 3B37 F5B5 F913

Re: [PATCH v4 2/5] sparc: fix accurate exception reporting in copy_{from_to}_user for UltraSPARC III

Posted by René Rebe 3 weeks, 4 days ago

Hi Adrian,

> On 7. Sep 2025, at 19:02, John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> wrote:
> 
> Hi Michael,
> 
> On Fri, 2025-09-05 at 00:03 +0200, Michael Karcher wrote:
>> Anthony Yznaga tracked down that a BUG_ON in ext4 code with large folios
>> enabled resulted from copy_from_user() returning impossibly large values
>> greater than the size to be copied. This lead to __copy_from_iter()
>> returning impossible values instead of the actual number of bytes it was
>> able to copy.
>> 
>> The BUG_ON has been reported in
>> https://lore.kernel.org/r/b14f55642207e63e907965e209f6323a0df6dcee.camel@physik.fu-berlin.de
>> 
>> The referenced commit introduced exception handlers on user-space memory
>> references in copy_from_user and copy_to_user. These handlers return from
>> the respective function and calculate the remaining bytes left to copy
>> using the current register contents. The exception handlers expect that
>> %o2 has already been masked during the bulk copy loop, but the masking was
>> performed after that loop. This will fix the return value of copy_from_user
>> and copy_to_user in the faulting case. The behaviour of memcpy stays
>> unchanged.
>> 
>> Fixes: ee841d0aff64 ("sparc64: Convert U3copy_{from,to}_user to accurate exception reporting.")
>> Tested-by: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> # on Sun Netra 240
>> Reviewed-by: Anthony Yznaga <anthony.yznaga@oracle.com>
>> Tested-by: René Rebe <rene@exactcode.com> # on UltraSparc III+ and UltraSparc IIIi
>> Signed-off-by: Michael Karcher <kernel@mkarcher.dialup.fu-berlin.de>
>> ---
>> arch/sparc/lib/U3memcpy.S | 2 +-
>> 1 file changed, 1 insertion(+), 1 deletion(-)
>> 
>> diff --git a/arch/sparc/lib/U3memcpy.S b/arch/sparc/lib/U3memcpy.S
>> index 9248d59c734ce200f1f55e6d9913277f18715a87..bace3a18f836f1428ae0ed72b27aa1e00374089e 100644
>> --- a/arch/sparc/lib/U3memcpy.S
>> +++ b/arch/sparc/lib/U3memcpy.S
>> @@ -267,6 +267,7 @@ FUNC_NAME: /* %o0=dst, %o1=src, %o2=len */
>> faligndata %f10, %f12, %f26
>> EX_LD_FP(LOAD(ldd, %o1 + 0x040, %f0), U3_retl_o2)
>> 
>> + and %o2, 0x3f, %o2
>> subcc GLOBAL_SPARE, 0x80, GLOBAL_SPARE
>> add %o1, 0x40, %o1
>> bgu,pt %XCC, 1f
>> @@ -336,7 +337,6 @@ FUNC_NAME: /* %o0=dst, %o1=src, %o2=len */
>> * Also notice how this code is careful not to perform a
>> * load past the end of the src buffer.
>> */
>> - and %o2, 0x3f, %o2
>> andcc %o2, 0x38, %g2
>> be,pn %XCC, 2f
>> subcc %g2, 0x8, %g2
> 
> It looks like the fix isn't actually complete for UltraSPARC III.
> 
> There still seem to be edge-cases where this bug is triggered and that
> actually happens when configuring the systemd-timesyncd package and it's
> reproducible in 100% of the cases:

It is probably a good time to mention that there are likely some other major
TLB (or so) bug on U3. For example, I could never boot any Linux kernel
(probably ever) with 8GB installed in my Sun Blade 1000 - it would NULL ptr
deref very early:

Unable to handle kernel NULL pointer dereference
tsk->{mm,active_mm}->context = 0000000000000000
tsk->{mm,active_mm}->pgd = fff00001ff002000
              \|/ ____ \|/
              "@'/ .. \`@"
              /_| \__/ |_\
                 \__U_/
swapper(0): Oops [#1]
CPU: 0 PID: 0 Comm: swapper Not tainted 6.10.11-t2 #45
TSTATE: 0000009980e01602 TPC: 0000000000c5ec98 TNPC: 0000000000c5ec9c Y: f9e69dcb    Not tainted
TPC: <subsection_map_init+0x50/0x98>
g0: 0000000010d7d038 g1: 0000000000000000 g2: 0000000000000000 g3: 0000000000000001
g4: 0000000000b95d80 g5: 0000000000000000 g6: 0000000000b84000 g7: 0000000000000000
o0: 0000000000000000 o1: ffffffffffffffff o2: 0000000000000000 o3: 0000000000ae1c40
o4: 0000000000b87ae8 o5: 0000000000000000 sp: 0000000000b871b1 ret_pc: 0000000000c5ec90
RPC: <subsection_map_init+0x48/0x98>
l0: 0000000000020000 l1: 0000000000020000 l2: 0000000000b0d730 l3: 0000000000000001
l4: 0000000000000000 l5: 0000000000000001 l6: 0000000001dfefbf l7: 00000001ff8f3110
i0: 0000000000000000 i1: 00000000000ff7ff i2: 0000000000000001 i3: 000000000001ffff
i4: 0000000000000000 i5: 0000000000000007 i6: 0000000000b87261 i7: 0000000000c5a4c8
I7: <free_area_init+0x58c/0xc78>
Call Trace:
[<0000000000c5a4c8>] free_area_init+0x58c/0xc78
[<0000000000c509c8>] paging_init+0xd1c/0xdd0
[<0000000000c4b848>] setup_arch+0x110/0x774
[<0000000000c48664>] start_kernel+0x58/0x778
[<0000000000c4b584>] start_early_boot+0x78/0x22c
[<00000000009b5264>] tlb_fixup_done+0x4c/0x54
[<00000000002f1a28>] 0x2f1a28
Disabling lock debugging due to kernel taint
Caller[0000000000c5a4c8]: free_area_init+0x58c/0xc78
Caller[0000000000c509c8]: paging_init+0xd1c/0xdd0
Caller[0000000000c4b848]: setup_arch+0x110/0x774
Caller[0000000000c48664]: start_kernel+0x58/0x778
Caller[0000000000c4b584]: start_early_boot+0x78/0x22c
Caller[00000000009b5264]: tlb_fixup_done+0x4c/0x54

I tried to analyze this last year, but had to put it back into my TODO file
after a weekend on it. I have an very crude workaround hack for that in
T2 see below.

I’m not saying that this is related, I only want to optimize bug hunting qith
another heads up data point. What is strange is that IIRC this does not
bug for a user with an Ultra 45, will double check on my Ultra 25 now
that I got 8GB in it, too. Probably good time to finally tackle the root cause,
too:

--- linux-6.10/arch/sparc/kernel/ktlb.S	2024-07-15 00:43:32.000000000 +0200
+++ b/arch/sparc/kernel/ktlb.S	2024-09-24 20:18:35.373344860 +0200
@@ -144,7 +144,7 @@
 	brgez,pn	%g4, kvmap_dtlb_nonlinear
 	 nop
 
-#ifdef CONFIG_DEBUG_PAGEALLOC
+#if 1 /* def CONFIG_DEBUG_PAGEALLOC */
 	/* Index through the base page size TSB even for linear
 	 * mappings when using page allocation debugging.
 	 */
--- linux-6.10/arch/sparc/mm/init_64.c	2024-07-15 00:43:32.000000000 +0200
+++ b/arch/sparc/mm/init_64.c	2024-09-24 20:35:22.566682546 +0200
@@ -1891,11 +1891,22 @@
 static void __init kernel_physical_mapping_init(void)
 {
 	unsigned long i, mem_alloced = 0UL;
+	unsigned long phys_mem = 0UL;
 	bool use_huge = true;
 
 #ifdef CONFIG_DEBUG_PAGEALLOC
 	use_huge = false;
 #endif
+
+	if (tlb_type == cheetah_plus) {
+		for (i = 0; i < pall_ents; i++)
+			phys_mem += pall[i].reg_size;
+		printk("phys_mem: %ld\n", phys_mem);
+
+		if (phys_mem > 4294967296)
+			use_huge = false;
+	}
+
 	for (i = 0; i < pall_ents; i++) {
 		unsigned long phys_start, phys_end;
 

> [  125.301353] systemd-sysv-generator[1042]: Please update package to include a native systemd unit file.
> [  125.424703] systemd-sysv-generator[1042]: ⚠ This compatibility logic is deprecated, expect removal soon. ⚠
> [  127.206268] get_swap_device: Bad swap offset entry 808000000
> [  127.354181] get_swap_device: Bad swap offset entry 808000000
> [  127.449735] get_swap_device: Bad swap offset entry 808000000
> [  127.553698] get_swap_device: Bad swap offset entry 808000000
> [  127.701748] get_swap_device: Bad swap offset entry 808000000
> [  127.821914] get_swap_device: Bad swap offset entry 808000000
> [  127.939392] Unable to handle kernel paging request at virtual address 00000001108ca000
> [  128.043605] tsk->{mm,active_mm}->context = 0000000000000555
> [  128.116890] tsk->{mm,active_mm}->pgd = fff0000009fd0000
> [  128.185604]               \|/ ____ \|/
> [  128.185604]               "@'/ .. \`@"
> [  128.185604]               /_| \__/ |_\
> [  128.185604]                  \__U_/
> [  128.378914] systemd-tty-ask(1054): Oops [#1]
> [  128.435046] CPU: 0 UID: 0 PID: 1054 Comm: systemd-tty-ask Not tainted 6.17.0-rc4+ #11 NONE 
> [  128.544945] TSTATE: 0000000011001606 TPC: 00000000007a5800 TNPC: 00000000007a5804 Y: 00000000    Not tainted
> [  128.674196] TPC: <lookup_swap_cgroup_id+0x40/0x80>
> [  128.737194] g0: fff000023f800040 g1: 0000000010000000 g2: 00000001008ca000 g3: 000000000153a8b8
> [  128.851572] g4: fff0000008d1b700 g5: fff000023e336000 g6: fff00000140f4000 g7: fff0000101934000
> [  128.965946] o0: fff0000008e6c180 o1: 0000000000000000 o2: 0000000000001000 o3: 0000000000000001
> [  129.080321] o4: 00000000000001ff o5: 0000000000000555 sp: fff00000140f6c81 ret_pc: 0000000000000000
> [  129.199272] RPC: <0x0>
> [  129.230149] l0: 0000000000000000 l1: fff0000008e6c180 l2: 0000000000000000 l3: 03ffffffffffffff
> [  129.344528] l4: 0000000000000004 l5: 0000000000000000 l6: 0000000000000001 l7: 0000000000000014
> [  129.458902] i0: 0000000080000000 i1: fff0000101900000 i2: fff00000140f75d8 i3: ffffffffffffffff
> [  129.573283] i4: 0000000000001000 i5: 0000000000000000 i6: fff00000140f6d31 i7: 00000000007173e0
> [  129.687653] I7: <swap_pte_batch+0x40/0x160>
> [  129.742653] Call Trace:
> [  129.774671] [<00000000007173e0>] swap_pte_batch+0x40/0x160
> [  129.846733] [<0000000000719998>] unmap_page_range+0x718/0x1200
> [  129.923366] [<000000000071a4f8>] unmap_single_vma.constprop.0+0x78/0xe0
> [  130.010289] [<000000000071a5b0>] unmap_vmas+0x50/0x160
> [  130.077767] [<00000000007288bc>] exit_mmap+0xbc/0x460
> [  130.144108] [<000000000047aec4>] mmput+0x64/0x180
> [  130.205867] [<0000000000483b38>] do_exit+0x218/0xb80
> [  130.271067] [<0000000000484664>] do_group_exit+0x24/0xa0
> [  130.340830] [<0000000000494848>] get_signal+0x948/0x9a0
> [  130.409458] [<000000000043eb68>] do_notify_resume+0xc8/0x5c0
> [  130.483802] [<0000000000404b48>] __handle_signal+0xc/0x30
> [  130.554715] Disabling lock debugging due to kernel taint
> [  130.624483] Caller[00000000007173e0]: swap_pte_batch+0x40/0x160
> [  130.702257] Caller[0000000000719998]: unmap_page_range+0x718/0x1200
> [  130.784610] Caller[000000000071a4f8]: unmap_single_vma.constprop.0+0x78/0xe0
> [  130.877252] Caller[000000000071a5b0]: unmap_vmas+0x50/0x160
> [  130.950452] Caller[00000000007288bc]: exit_mmap+0xbc/0x460
> [  131.022508] Caller[000000000047aec4]: mmput+0x64/0x180
> [  131.089986] Caller[0000000000483b38]: do_exit+0x218/0xb80
> [  131.160901] Caller[0000000000484664]: do_group_exit+0x24/0xa0
> [  131.236387] Caller[0000000000494848]: get_signal+0x948/0x9a0
> [  131.310736] Caller[000000000043eb68]: do_notify_resume+0xc8/0x5c0
> [  131.390795] Caller[0000000000404b48]: __handle_signal+0xc/0x30
> [  131.467427] Caller[fff0000101600238]: 0xfff0000101600238
> [  131.537197] Instruction DUMP:
> [  131.537201]  c458c002 
> [  131.576079]  83287002 
> [  131.606963]  b12e2004 
> [  131.637839] <c2008001>
> [  131.668723]  b1304018 
> [  131.699603]  b12e3030 
> [  131.730486]  81cfe008 
> [  131.761364]  91323030 
> [  131.792249]  b0102000 
> [  131.823130] 
> [  131.873450] Fixing recursive fault but reboot is needed!
> 
> Adrian
> 
> -- 
> .''`.  John Paul Adrian Glaubitz
> : :' :  Debian Developer
> `. `'   Physicist
>  `-    GPG: 62FF 8A75 84E0 2956 9546  0006 7426 3B37 F5B5 F913

-- 
https://exactco.de - https://t2linux.com - https://rene.rebe.de

Re: [PATCH v4 2/5] sparc: fix accurate exception reporting in copy_{from_to}_user for UltraSPARC III

Posted by John Paul Adrian Glaubitz 3 weeks, 4 days ago

Hi Rene,

On Sun, 2025-09-07 at 19:32 +0200, René Rebe wrote:
> It is probably a good time to mention that there are likely some other major
> TLB (or so) bug on U3. For example, I could never boot any Linux kernel
> (probably ever) with 8GB installed in my Sun Blade 1000 - it would NULL ptr
> deref very early:

Have a look at arch/sparc/kernel/head_64.S:

	/* Patch copy/page operations to cheetah optimized versions. */
	call	cheetah_patch_copyops
	 nop
	call	cheetah_patch_copy_page
	 nop
	call	cheetah_patch_cachetlbops
	 nop

These patch in UltraSPARC-III-optimized versions of copy_{to,from}_user
and copy_page and TLB operations. Replacing these calls with "nop" might
tell us whether any of those is broken.

I have already replaced cheetah_patch_copyops with generic_patch_copyops
and was able to verify that the recently discovered OOPS is not related
to copy_{to,from}_user, so Michael's UltraSPARC III may still be correct
and complete.

I am testing the other two now as I think that these are good candidates.

Adrian

-- 
 .''`.  John Paul Adrian Glaubitz
: :' :  Debian Developer
`. `'   Physicist
  `-    GPG: 62FF 8A75 84E0 2956 9546  0006 7426 3B37 F5B5 F913

[PATCH v4 1/5] sparc: fix accurate exception reporting in copy_{from_to}_user for UltraSPARC
[PATCH v4 2/5] sparc: fix accurate exception reporting in copy_{from_to}_user for UltraSPARC III
[PATCH v4 3/5] sparc: fix accurate exception reporting in copy_{from_to}_user for Niagara
[PATCH v4 4/5] sparc: fix accurate exception reporting in copy_to_user for Niagara 4
[PATCH v4 5/5] sparc: fix accurate exception reporting in copy_{from,to}_user for M7