[REGRESSION 00/04] Crash during resume of pcie bridge

Bert Karwatzki posted 4 patches 2 months, 1 week ago
[REGRESSION 00/04] Crash during resume of pcie bridge
Posted by Bert Karwatzki 2 months, 1 week ago
Since linux version v6.15 I experience random crashes on my MSI Alpha 15 Laptop
running debian trixie (amd64). The first such crash happened about in the midth
of june, and as there were no useful log messages and even using netconsole
gave no useful message I suspected faulty hardware. So I ran memtest86+ and
found a faulty address line and replaced the memory (unfortunately 64G to 16G).
But the crashes occured again and so I did a thorough investigation.

The crashes occur after 30min to 33h (yes, hours) of uptime and consist of a
sudden reboot after which the PCI bridge at 00:02.4 and the nvme device 
connected to it are missing. If there's sound running during the crash then the
first sign of the crash is the sound looping like a broken record for about 2s,
after which the reboot happens. With the missing nvme device the reboot drops to
a rescue shell. Using "shutdown -h now" from that shell and starting the laptop
with the power button restores the missing PCI bridge and nvme device.

The hardware is the following (it's a dual GPU laptop where the GUI
runs on the built-in GPU):

$ cat /proc/cpuinfo 
processor	: 0
vendor_id	: AuthenticAMD
cpu family	: 25
model		: 80
model name	: AMD Ryzen 7 5800H with Radeon Graphics
stepping	: 0
microcode	: 0xa50000c
cpu MHz		: 3394.238
cache size	: 512 KB
physical id	: 0
siblings	: 16
core id		: 0
cpu cores	: 8
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 16
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip pku ospke vaes vpclmulqdq rdpid overflow_recov succor smca fsrm debug_swap
bugs		: sysret_ss_attrs spectre_v1 spectre_v2 spec_store_bypass srso ibpb_no_ret
bogomips	: 6388.57
TLB size	: 2560 4K pages
clflush size	: 64
cache_alignment	: 64
address sizes	: 48 bits physical, 48 bits virtual
power management: ts ttp tm hwpstate cpb eff_freq_ro [13] [14]

$ lspci -nn
00:00.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne Root Complex [1022:1630]
00:00.2 IOMMU [0806]: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne IOMMU [1022:1631]
00:01.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Renoir PCIe Dummy Host Bridge [1022:1632]
00:01.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Renoir PCIe GPP Bridge [1022:1633]
00:02.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Renoir PCIe Dummy Host Bridge [1022:1632]
00:02.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne PCIe GPP Bridge [1022:1634]
00:02.2 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne PCIe GPP Bridge [1022:1634]
00:02.3 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne PCIe GPP Bridge [1022:1634]
00:02.4 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne PCIe GPP Bridge [1022:1634]
00:08.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Renoir PCIe Dummy Host Bridge [1022:1632]
00:08.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Renoir Internal PCIe GPP Bridge to Bus [1022:1635]
00:14.0 SMBus [0c05]: Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller [1022:790b] (rev 51)
00:14.3 ISA bridge [0601]: Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge [1022:790e] (rev 51)
00:18.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 0 [1022:166a]
00:18.1 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 1 [1022:166b]
00:18.2 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 2 [1022:166c]
00:18.3 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 3 [1022:166d]
00:18.4 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 4 [1022:166e]
00:18.5 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 5 [1022:166f]
00:18.6 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 6 [1022:1670]
00:18.7 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 7 [1022:1671]
01:00.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 XL Upstream Port of PCI Express Switch [1002:1478] (rev c3)
02:00.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 XL Downstream Port of PCI Express Switch [1002:1479]
03:00.0 Display controller [0380]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 23 [Radeon RX 6600/6600 XT/6600M] [1002:73ff] (rev c3)
03:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 21/23 HDMI/DP Audio Controller [1002:ab28]
04:00.0 Network controller [0280]: MEDIATEK Corp. MT7921K (RZ608) Wi-Fi 6E 80MHz [14c3:0608]
05:00.0 Ethernet controller [0200]: Realtek Semiconductor Co., Ltd. RTL8111/8168/8211/8411 PCI Express Gigabit Ethernet Controller [10ec:8168] (rev 15)
06:00.0 Non-Volatile memory controller [0108]: Kingston Technology Company, Inc. KC3000/FURY Renegade NVMe SSD [E18] [2646:5013] (rev 01)
07:00.0 Non-Volatile memory controller [0108]: Micron/Crucial Technology P1 NVMe PCIe SSD[Frampton] [c0a9:2263] (rev 03)
08:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Cezanne [Radeon Vega Series / Radeon Vega Mobile Series] [1002:1638] (rev c5)
08:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Renoir Radeon High Definition Audio Controller [1002:1637]
08:00.2 Encryption controller [1080]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 10h-1fh) Platform Security Processor [1022:15df]
08:00.3 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne USB 3.1 [1022:1639]
08:00.4 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne USB 3.1 [1022:1639]
08:00.5 Multimedia controller [0480]: Advanced Micro Devices, Inc. [AMD] Audio Coprocessor [1022:15e2] (rev 01)
08:00.6 Audio device [0403]: Advanced Micro Devices, Inc. [AMD] Family 17h/19h/1ah HD Audio Controller [1022:15e3]
08:00.7 Signal processing controller [1180]: Advanced Micro Devices, Inc. [AMD] Sensor Fusion Hub [1022:15e4]

These devices are attached to the PCI bus like this:

$ lspci -t
-[0000:00]-+-00.0
           +-00.2
           +-01.0
           +-01.1-[01-03]----00.0-[02-03]----00.0-[03]--+-00.0 // This is the bridge which causes the crash
           |                                            \-00.1
           +-02.0
           +-02.1-[04]----00.0
           +-02.2-[05]----00.0
           +-02.3-[06]----00.0
           +-02.4-[07]----00.0 // These are the bridge and nvme device which disappear after the crash.
           +-08.0
           +-08.1-[08]--+-00.0
           |            +-00.1
           |            +-00.2
           |            +-00.3
           |            +-00.4
           |            +-00.5
           |            +-00.6
           |            \-00.7
           +-14.0
           +-14.3
           +-18.0
           +-18.1
           +-18.2
           +-18.3
           +-18.4
           +-18.5
           +-18.6
           \-18.7

I tried to bisect this between v6.14 and v6.15 but due to the wildly varying time
it takes to trigger the bug the bisections were not successful. Nevertheless they
gave lots of data about affected and non-affected version of the linux kernel,
and it's quite likely that version v6.14 is indeed free of the bug.

Here's an almost complete list of tested versions:
(Somewhat) sorted (by kernel version, 6.14.0-rc* kernels are from attempted bisections
between v6.14 and v6.15)
v6.14.0							no crash after 16h
v6.14.11						no crash after 7.5h
6.14.0-rc1-bisect-00003-g541ddf31e300			booted 12:24, 22.8.2025, no crash after {48h, 17h}
6.14.0-rc1-mystery-00134-gcc28c0e5e725			booted 11:42, 5.8.2025, no crash after 10.5h
6.14.0-rc1-mystery-00198-gd7f6f07ecec9			booted 22:27, 5.8.2025, no crash after 12h
6.14.0-rc4-mystery-01022-gab498828fad7			booted 21:04, 3.8.2025, no crash after {14h, 24h} 
6.14.0-rc4-mystery-01427-g7547510d4a91			booted 11:11, 4.8.2025, no crash after {13h, 23h}
6.14.0-rc6-mystery-01641-g0f04462874e1			booted 00:26, 5.8.2025, no crash after {11h, 24h}
6.14.0-mystery-00826-g327ecdbc0fda			no crash after {16h, 17h, 6.5h}
############## here the crashes start (time to each crash, crashes do not always occur) ########
6.14.0-bisect-01053-gebfb94d87b35			booted 10:15, 20.8.2025 crash after ~33h
6.14.0-mystery-09584-g7d06015d936c			crash 20.44 3.8.2025 after 7h
6.14.0-mystery-11703-geb0ece16027f      		crash 13.22 3.8.2025 after 1.75h
6.15.0							crashed around 15-17.6.2025, unknown uptime (This is the first crash!)
6.15.0-nort  						crash after 6.75h
6.16-rc4 (next-20250627)				crash after ~4h
6.16-rc4 (next-20250630)				crash after ~5h
6.16-rc4 (next-20250703) 				crash after ~2.5h (sound buffer repeated for ~1s before restarting) 	
6.16-rc6 (next-20250718)				crash after {2h, 2h}
6.16-rc7 (next-20250721)				crash after {~30min, 2h, 5.5h}
6.16.0-nortlockdep					crash after 4h
6.17.0-rc4-next-20250902-master				booted 8:36, 3.9.2025, crash after ~3.5h
6.17.0-rc5-next-20250908-master				booted 10:25, 9.9.2025, crash after {~6.5h, 14h}
6.17.0-rc6-next-20250917-acpidebug 			booted 12:41, 20.9.2025, crash 15:22 20.8.2025 (~3h, 647 GPP notifies)
The versions below contain additional debugging printk()s and dev_info()s.
The details of these debugging statements are explained below.
6.17.0-rc6-next-20250917-gpudebug-00018-g7a38b625a003	booted 12:58, 26.9.2025, crash 12:01, 27.9.2025 (~23h, 1500 GPP notifies)
6.17.0-rc6-next-20250917-gpudebug-00021-gab98d880e3c8	booted 23:52, 28.9.2025, crash 2:25, 30.9.2025 (26.5h, 1504GPP0, 889GPP2)
6.17.0-rc6-next-20250917-gpudebug-00024-g5c6b49b810db	booted 9:10, 2.10.2025, 60h 3093 GPP0 notifies without crash (too many printk()s?)
6.17.0-rc6-next-20250917-gpudebug-00028-gf99cf81b1da7	booted 21:21, 4.10.2025 first try stopped after 77min due to hung tasks
6.17.0-rc6-next-20250917-gpudebug-00028-gf99cf81b1da7	booted 23:37, 4.10.2025 crash 4:52, 6.10.2025 (~27.5h)
6.17.0-rc6-next-20250917-gpudebug-00029-ge797f42363d1	booted 13:00, 6.10.2025 currently testing

As the bisections were not succesfull I tried to monitor the crash using
netconsole and CONFIG_ACPI_DEBUG and "acpi.debug_layer=0xf acpi.debug_level=0x107"
as command line parameters. With this the last message on netconsole before
the crash is usually:

[21465.639279] [    T251]    evmisc-0132 ev_queue_notify_reques: Dispatching Notify on [GPP0] (Device) Value 0x00 (Bus Check) Node 00000000f81f36b8

GPP0 is the ACPI name of this PCI bridge (at least that's my best guess):

00:01.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Renoir PCIe GPP Bridge [1022:1633]

to which the discrete GPU is connected

03:00.0 Display controller [0380]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 23 [Radeon RX 6600/6600 XT/6600M] [1002:73ff] (rev c3)

via the pci express switch

01:00.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 XL Upstream Port of PCI Express Switch [1002:1478] (rev c3)
02:00.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 XL Downstream Port of PCI Express Switch [1002:1479]

While the GUI (xfce on xorg) on my laptop runs on the built-in GPU the discrete 
GPU usually wakes up quite often, e.g. when a window is opened or when scrolling down on youtube.

A somewhat reliable method to generate GPP0 notifies is putting on a youtube
video and the periodically starting evolution with this script:

#!/bin/bash
for i in {0..1000}
do
	echo $i
	evolution &
	sleep 5
	killall evolution
	sleep 55
done

This is also the method I used to test the debug kernel in the following mails.

Bert Karwatzki
Re: [REGRESSION 00/04] Crash during resume of pcie bridge
Posted by Mario Limonciello 2 months, 1 week ago
On 10/6/25 7:09 AM, Bert Karwatzki wrote:
> Since linux version v6.15 I experience random crashes on my MSI Alpha 15 Laptop
> running debian trixie (amd64). The first such crash happened about in the midth
> of june, and as there were no useful log messages and even using netconsole
> gave no useful message I suspected faulty hardware. So I ran memtest86+ and
> found a faulty address line and replaced the memory (unfortunately 64G to 16G).
> But the crashes occured again and so I did a thorough investigation.
> 
> The crashes occur after 30min to 33h (yes, hours) of uptime and consist of a
> sudden reboot after which the PCI bridge at 00:02.4 and the nvme device
> connected to it are missing. If there's sound running during the crash then the
> first sign of the crash is the sound looping like a broken record for about 2s,
> after which the reboot happens. With the missing nvme device the reboot drops to
> a rescue shell. Using "shutdown -h now" from that shell and starting the laptop
> with the power button restores the missing PCI bridge and nvme device.
> 
> The hardware is the following (it's a dual GPU laptop where the GUI
> runs on the built-in GPU):
> 
> $ cat /proc/cpuinfo
> processor	: 0
> vendor_id	: AuthenticAMD
> cpu family	: 25
> model		: 80
> model name	: AMD Ryzen 7 5800H with Radeon Graphics
> stepping	: 0
> microcode	: 0xa50000c
> cpu MHz		: 3394.238
> cache size	: 512 KB
> physical id	: 0
> siblings	: 16
> core id		: 0
> cpu cores	: 8
> apicid		: 0
> initial apicid	: 0
> fpu		: yes
> fpu_exception	: yes
> cpuid level	: 16
> wp		: yes
> flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip pku ospke vaes vpclmulqdq rdpid overflow_recov succor smca fsrm debug_swap
> bugs		: sysret_ss_attrs spectre_v1 spectre_v2 spec_store_bypass srso ibpb_no_ret
> bogomips	: 6388.57
> TLB size	: 2560 4K pages
> clflush size	: 64
> cache_alignment	: 64
> address sizes	: 48 bits physical, 48 bits virtual
> power management: ts ttp tm hwpstate cpb eff_freq_ro [13] [14]
> 
> $ lspci -nn
> 00:00.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne Root Complex [1022:1630]
> 00:00.2 IOMMU [0806]: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne IOMMU [1022:1631]
> 00:01.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Renoir PCIe Dummy Host Bridge [1022:1632]
> 00:01.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Renoir PCIe GPP Bridge [1022:1633]
> 00:02.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Renoir PCIe Dummy Host Bridge [1022:1632]
> 00:02.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne PCIe GPP Bridge [1022:1634]
> 00:02.2 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne PCIe GPP Bridge [1022:1634]
> 00:02.3 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne PCIe GPP Bridge [1022:1634]
> 00:02.4 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne PCIe GPP Bridge [1022:1634]
> 00:08.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Renoir PCIe Dummy Host Bridge [1022:1632]
> 00:08.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Renoir Internal PCIe GPP Bridge to Bus [1022:1635]
> 00:14.0 SMBus [0c05]: Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller [1022:790b] (rev 51)
> 00:14.3 ISA bridge [0601]: Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge [1022:790e] (rev 51)
> 00:18.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 0 [1022:166a]
> 00:18.1 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 1 [1022:166b]
> 00:18.2 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 2 [1022:166c]
> 00:18.3 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 3 [1022:166d]
> 00:18.4 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 4 [1022:166e]
> 00:18.5 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 5 [1022:166f]
> 00:18.6 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 6 [1022:1670]
> 00:18.7 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 7 [1022:1671]
> 01:00.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 XL Upstream Port of PCI Express Switch [1002:1478] (rev c3)
> 02:00.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 XL Downstream Port of PCI Express Switch [1002:1479]
> 03:00.0 Display controller [0380]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 23 [Radeon RX 6600/6600 XT/6600M] [1002:73ff] (rev c3)
> 03:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 21/23 HDMI/DP Audio Controller [1002:ab28]
> 04:00.0 Network controller [0280]: MEDIATEK Corp. MT7921K (RZ608) Wi-Fi 6E 80MHz [14c3:0608]
> 05:00.0 Ethernet controller [0200]: Realtek Semiconductor Co., Ltd. RTL8111/8168/8211/8411 PCI Express Gigabit Ethernet Controller [10ec:8168] (rev 15)
> 06:00.0 Non-Volatile memory controller [0108]: Kingston Technology Company, Inc. KC3000/FURY Renegade NVMe SSD [E18] [2646:5013] (rev 01)
> 07:00.0 Non-Volatile memory controller [0108]: Micron/Crucial Technology P1 NVMe PCIe SSD[Frampton] [c0a9:2263] (rev 03)
> 08:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Cezanne [Radeon Vega Series / Radeon Vega Mobile Series] [1002:1638] (rev c5)
> 08:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Renoir Radeon High Definition Audio Controller [1002:1637]
> 08:00.2 Encryption controller [1080]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 10h-1fh) Platform Security Processor [1022:15df]
> 08:00.3 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne USB 3.1 [1022:1639]
> 08:00.4 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne USB 3.1 [1022:1639]
> 08:00.5 Multimedia controller [0480]: Advanced Micro Devices, Inc. [AMD] Audio Coprocessor [1022:15e2] (rev 01)
> 08:00.6 Audio device [0403]: Advanced Micro Devices, Inc. [AMD] Family 17h/19h/1ah HD Audio Controller [1022:15e3]
> 08:00.7 Signal processing controller [1180]: Advanced Micro Devices, Inc. [AMD] Sensor Fusion Hub [1022:15e4]
> 
> These devices are attached to the PCI bus like this:
> 
> $ lspci -t
> -[0000:00]-+-00.0
>             +-00.2
>             +-01.0
>             +-01.1-[01-03]----00.0-[02-03]----00.0-[03]--+-00.0 // This is the bridge which causes the crash
>             |                                            \-00.1
>             +-02.0
>             +-02.1-[04]----00.0
>             +-02.2-[05]----00.0
>             +-02.3-[06]----00.0
>             +-02.4-[07]----00.0 // These are the bridge and nvme device which disappear after the crash.
>             +-08.0
>             +-08.1-[08]--+-00.0
>             |            +-00.1
>             |            +-00.2
>             |            +-00.3
>             |            +-00.4
>             |            +-00.5
>             |            +-00.6
>             |            \-00.7
>             +-14.0
>             +-14.3
>             +-18.0
>             +-18.1
>             +-18.2
>             +-18.3
>             +-18.4
>             +-18.5
>             +-18.6
>             \-18.7
> 
> I tried to bisect this between v6.14 and v6.15 but due to the wildly varying time
> it takes to trigger the bug the bisections were not successful. Nevertheless they
> gave lots of data about affected and non-affected version of the linux kernel,
> and it's quite likely that version v6.14 is indeed free of the bug.
> 
> Here's an almost complete list of tested versions:
> (Somewhat) sorted (by kernel version, 6.14.0-rc* kernels are from attempted bisections
> between v6.14 and v6.15)
> v6.14.0							no crash after 16h
> v6.14.11						no crash after 7.5h
> 6.14.0-rc1-bisect-00003-g541ddf31e300			booted 12:24, 22.8.2025, no crash after {48h, 17h}
> 6.14.0-rc1-mystery-00134-gcc28c0e5e725			booted 11:42, 5.8.2025, no crash after 10.5h
> 6.14.0-rc1-mystery-00198-gd7f6f07ecec9			booted 22:27, 5.8.2025, no crash after 12h
> 6.14.0-rc4-mystery-01022-gab498828fad7			booted 21:04, 3.8.2025, no crash after {14h, 24h}
> 6.14.0-rc4-mystery-01427-g7547510d4a91			booted 11:11, 4.8.2025, no crash after {13h, 23h}
> 6.14.0-rc6-mystery-01641-g0f04462874e1			booted 00:26, 5.8.2025, no crash after {11h, 24h}
> 6.14.0-mystery-00826-g327ecdbc0fda			no crash after {16h, 17h, 6.5h}
> ############## here the crashes start (time to each crash, crashes do not always occur) ########
> 6.14.0-bisect-01053-gebfb94d87b35			booted 10:15, 20.8.2025 crash after ~33h
> 6.14.0-mystery-09584-g7d06015d936c			crash 20.44 3.8.2025 after 7h
> 6.14.0-mystery-11703-geb0ece16027f      		crash 13.22 3.8.2025 after 1.75h
> 6.15.0							crashed around 15-17.6.2025, unknown uptime (This is the first crash!)
> 6.15.0-nort  						crash after 6.75h
> 6.16-rc4 (next-20250627)				crash after ~4h
> 6.16-rc4 (next-20250630)				crash after ~5h
> 6.16-rc4 (next-20250703) 				crash after ~2.5h (sound buffer repeated for ~1s before restarting) 	
> 6.16-rc6 (next-20250718)				crash after {2h, 2h}
> 6.16-rc7 (next-20250721)				crash after {~30min, 2h, 5.5h}
> 6.16.0-nortlockdep					crash after 4h
> 6.17.0-rc4-next-20250902-master				booted 8:36, 3.9.2025, crash after ~3.5h
> 6.17.0-rc5-next-20250908-master				booted 10:25, 9.9.2025, crash after {~6.5h, 14h}
> 6.17.0-rc6-next-20250917-acpidebug 			booted 12:41, 20.9.2025, crash 15:22 20.8.2025 (~3h, 647 GPP notifies)
> The versions below contain additional debugging printk()s and dev_info()s.
> The details of these debugging statements are explained below.
> 6.17.0-rc6-next-20250917-gpudebug-00018-g7a38b625a003	booted 12:58, 26.9.2025, crash 12:01, 27.9.2025 (~23h, 1500 GPP notifies)
> 6.17.0-rc6-next-20250917-gpudebug-00021-gab98d880e3c8	booted 23:52, 28.9.2025, crash 2:25, 30.9.2025 (26.5h, 1504GPP0, 889GPP2)
> 6.17.0-rc6-next-20250917-gpudebug-00024-g5c6b49b810db	booted 9:10, 2.10.2025, 60h 3093 GPP0 notifies without crash (too many printk()s?)
> 6.17.0-rc6-next-20250917-gpudebug-00028-gf99cf81b1da7	booted 21:21, 4.10.2025 first try stopped after 77min due to hung tasks
> 6.17.0-rc6-next-20250917-gpudebug-00028-gf99cf81b1da7	booted 23:37, 4.10.2025 crash 4:52, 6.10.2025 (~27.5h)
> 6.17.0-rc6-next-20250917-gpudebug-00029-ge797f42363d1	booted 13:00, 6.10.2025 currently testing
> 
> As the bisections were not succesfull I tried to monitor the crash using
> netconsole and CONFIG_ACPI_DEBUG and "acpi.debug_layer=0xf acpi.debug_level=0x107"
> as command line parameters. With this the last message on netconsole before
> the crash is usually:
> 
> [21465.639279] [    T251]    evmisc-0132 ev_queue_notify_reques: Dispatching Notify on [GPP0] (Device) Value 0x00 (Bus Check) Node 00000000f81f36b8
> 
> GPP0 is the ACPI name of this PCI bridge (at least that's my best guess):
> 
> 00:01.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Renoir PCIe GPP Bridge [1022:1633]
> 
> to which the discrete GPU is connected
> 
> 03:00.0 Display controller [0380]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 23 [Radeon RX 6600/6600 XT/6600M] [1002:73ff] (rev c3)
> 
> via the pci express switch
> 
> 01:00.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 XL Upstream Port of PCI Express Switch [1002:1478] (rev c3)
> 02:00.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 XL Downstream Port of PCI Express Switch [1002:1479]
> 
> While the GUI (xfce on xorg) on my laptop runs on the built-in GPU the discrete
> GPU usually wakes up quite often, e.g. when a window is opened or when scrolling down on youtube.
> 
> A somewhat reliable method to generate GPP0 notifies is putting on a youtube
> video and the periodically starting evolution with this script:
> 
> #!/bin/bash
> for i in {0..1000}
> do
> 	echo $i
> 	evolution &
> 	sleep 5
> 	killall evolution
> 	sleep 55
> done
> 
> This is also the method I used to test the debug kernel in the following mails.
> 
> Bert Karwatzki

Given the perpetrator and victim here don't share a common upstream root 
port (the only common is the root complex) I wonder if this is actually 
an issue with something non-obvious like the IOMMU.

Can you still reproduce with amd_iommu=off?

Re: [REGRESSION 00/04] Crash during resume of pcie bridge
Posted by Bert Karwatzki 2 months ago
Am Dienstag, dem 07.10.2025 um 16:33 -0500 schrieb Mario Limonciello:
> 
> Can you still reproduce with amd_iommu=off?

Reproducing this is at all is very difficult, so I'll try to find the exact spot
where things break (i.e. when the pci bus breaks and no more message are transmitted
via netconsole) first. The current state of this search is that the crash occurs in
pci_pm_runtime_resume(), before pci_fixup_device() is called:

static int pci_pm_runtime_resume(struct device *dev)
{
	struct pci_dev *pci_dev = to_pci_dev(dev);
	const struct dev_pm_ops *pm = dev->driver ? dev->driver->pm : NULL;
	pci_power_t prev_state = pci_dev->current_state;
	int error = 0;
	// dev_info(dev, "%s = %px\n", __func__, (void *) pci_pm_runtime_resume); // remove this so we don't get too much delay
										  // This was still printed in the case of a crash
										  // so the crash must happen below

	/*
	 * Restoring config space is necessary even if the device is not bound
	 * to a driver because although we left it in D0, it may have gone to
	 * D3cold when the bridge above it runtime suspended.
	 */
	pci_pm_default_resume_early(pci_dev);
	if (!strcmp(dev_name(dev), "0000:00:01.1")) // This is the current test.
		dev_info(dev, "%s %d\n", __func__, __LINE__);
	pci_resume_ptm(pci_dev);

	if (!pci_dev->driver)
		return 0;

	//if (!strcmp(dev_name(dev), "0000:00:01.1"))         // This was not printed when 6.17.0-rc6-next-20250917-gpudebug-00036-g4f7b4067c9ce
	//	dev_info(dev, "%s %d\n", __func__, __LINE__); // crashed, so the crash must happen above
	pci_fixup_device(pci_fixup_resume_early, pci_dev);
	pci_pm_default_resume(pci_dev);

	if (prev_state == PCI_D3cold)
		pci_pm_bridge_power_up_actions(pci_dev);

	if (pm && pm->runtime_resume)
		error = pm->runtime_resume(dev);

	return error;
}


Bert Karwatzki
Re: [REGRESSION 00/04] Crash during resume of pcie bridge
Posted by Mario Limonciello 2 months ago
On 10/13/25 11:29 AM, Bert Karwatzki wrote:
> Am Dienstag, dem 07.10.2025 um 16:33 -0500 schrieb Mario Limonciello:
>>
>> Can you still reproduce with amd_iommu=off?
> 
> Reproducing this is at all is very difficult, so I'll try to find the exact spot
> where things break 
> (i.e. when the pci bus breaks and no more message are transmitted
> via netconsole) first. The current state of this search is that the crash occurs in
> pci_pm_runtime_resume(), before pci_fixup_device() is called:
> 

One other (unfortunate) possibility is that the timing of this crash 
occurring is not deterministic.

As an idea for debugging this issue, do you think maybe using kdumpst 
[1] might be helpful to get more information on the state during the crash?

Since NVME is missing you might need to boot off of USB or SD though so 
that kdumpst is able to save the vmcore out of RAM.

Link: 
https://blogs.igalia.com/gpiccoli/2024/07/presenting-kdumpst-or-how-to-collect-kernel-crash-logs-on-arch-linux/ 
[1]
> static int pci_pm_runtime_resume(struct device *dev)
> {
> 	struct pci_dev *pci_dev = to_pci_dev(dev);
> 	const struct dev_pm_ops *pm = dev->driver ? dev->driver->pm : NULL;
> 	pci_power_t prev_state = pci_dev->current_state;
> 	int error = 0;
> 	// dev_info(dev, "%s = %px\n", __func__, (void *) pci_pm_runtime_resume); // remove this so we don't get too much delay
> 										  // This was still printed in the case of a crash
> 										  // so the crash must happen below
> 
> 	/*
> 	 * Restoring config space is necessary even if the device is not bound
> 	 * to a driver because although we left it in D0, it may have gone to
> 	 * D3cold when the bridge above it runtime suspended.
> 	 */
> 	pci_pm_default_resume_early(pci_dev);
> 	if (!strcmp(dev_name(dev), "0000:00:01.1")) // This is the current test.
> 		dev_info(dev, "%s %d\n", __func__, __LINE__);
> 	pci_resume_ptm(pci_dev);
> 
> 	if (!pci_dev->driver)
> 		return 0;
> 
> 	//if (!strcmp(dev_name(dev), "0000:00:01.1"))         // This was not printed when 6.17.0-rc6-next-20250917-gpudebug-00036-g4f7b4067c9ce
> 	//	dev_info(dev, "%s %d\n", __func__, __LINE__); // crashed, so the crash must happen above
> 	pci_fixup_device(pci_fixup_resume_early, pci_dev);
> 	pci_pm_default_resume(pci_dev);
> 
> 	if (prev_state == PCI_D3cold)
> 		pci_pm_bridge_power_up_actions(pci_dev);
> 
> 	if (pm && pm->runtime_resume)
> 		error = pm->runtime_resume(dev);
> 
> 	return error;
> }
> 
> 
> Bert Karwatzki
Re: [REGRESSION 00/04] Crash during resume of pcie bridge
Posted by Christian König 2 months ago
On 13.10.25 20:51, Mario Limonciello wrote:
> On 10/13/25 11:29 AM, Bert Karwatzki wrote:
>> Am Dienstag, dem 07.10.2025 um 16:33 -0500 schrieb Mario Limonciello:
>>>
>>> Can you still reproduce with amd_iommu=off?
>>
>> Reproducing this is at all is very difficult, so I'll try to find the exact spot
>> where things break (i.e. when the pci bus breaks and no more message are transmitted
>> via netconsole) first. The current state of this search is that the crash occurs in
>> pci_pm_runtime_resume(), before pci_fixup_device() is called:
>>
> 
> One other (unfortunate) possibility is that the timing of this crash occurring is not deterministic.

Yeah, completely agree.

The exact spot where things break is actually pretty uninteresting I think. Background is that it is most likely not the spot which caused the issue.

Instead what happens is that something in the HW times out and you see a spontaneous reboot because of this.

I would rather try to narrow down which operation or combination of things is causing the issue.

Maybe also double check if runtime pm is actually working on the good kernel or if the issue might be that somebody fixed runtime pm and you are now seeing issues because you happen to have problematic HW which we need to add to the blacklist.

Regards,
Christian.

> 
> As an idea for debugging this issue, do you think maybe using kdumpst [1] might be helpful to get more information on the state during the crash?
> 
> Since NVME is missing you might need to boot off of USB or SD though so that kdumpst is able to save the vmcore out of RAM.
> 
> Link: https://blogs.igalia.com/gpiccoli/2024/07/presenting-kdumpst-or-how-to-collect-kernel-crash-logs-on-arch-linux/ [1]
>> static int pci_pm_runtime_resume(struct device *dev)
>> {
>>     struct pci_dev *pci_dev = to_pci_dev(dev);
>>     const struct dev_pm_ops *pm = dev->driver ? dev->driver->pm : NULL;
>>     pci_power_t prev_state = pci_dev->current_state;
>>     int error = 0;
>>     // dev_info(dev, "%s = %px\n", __func__, (void *) pci_pm_runtime_resume); // remove this so we don't get too much delay
>>                                           // This was still printed in the case of a crash
>>                                           // so the crash must happen below
>>
>>     /*
>>      * Restoring config space is necessary even if the device is not bound
>>      * to a driver because although we left it in D0, it may have gone to
>>      * D3cold when the bridge above it runtime suspended.
>>      */
>>     pci_pm_default_resume_early(pci_dev);
>>     if (!strcmp(dev_name(dev), "0000:00:01.1")) // This is the current test.
>>         dev_info(dev, "%s %d\n", __func__, __LINE__);
>>     pci_resume_ptm(pci_dev);
>>
>>     if (!pci_dev->driver)
>>         return 0;
>>
>>     //if (!strcmp(dev_name(dev), "0000:00:01.1"))         // This was not printed when 6.17.0-rc6-next-20250917-gpudebug-00036-g4f7b4067c9ce
>>     //    dev_info(dev, "%s %d\n", __func__, __LINE__); // crashed, so the crash must happen above
>>     pci_fixup_device(pci_fixup_resume_early, pci_dev);
>>     pci_pm_default_resume(pci_dev);
>>
>>     if (prev_state == PCI_D3cold)
>>         pci_pm_bridge_power_up_actions(pci_dev);
>>
>>     if (pm && pm->runtime_resume)
>>         error = pm->runtime_resume(dev);
>>
>>     return error;
>> }
>>
>>
>> Bert Karwatzki
> 

Re: [REGRESSION 00/04] Crash during resume of pcie bridge
Posted by Christian König 2 months, 1 week ago
On 06.10.25 14:09, Bert Karwatzki wrote:
> Since linux version v6.15 I experience random crashes on my MSI Alpha 15 Laptop
> running debian trixie (amd64). The first such crash happened about in the midth
> of june, and as there were no useful log messages and even using netconsole
> gave no useful message I suspected faulty hardware. So I ran memtest86+ and
> found a faulty address line and replaced the memory (unfortunately 64G to 16G).
> But the crashes occured again and so I did a thorough investigation.
> 
> The crashes occur after 30min to 33h (yes, hours) of uptime and consist of a
> sudden reboot after which the PCI bridge at 00:02.4 and the nvme device 
> connected to it are missing. If there's sound running during the crash then the
> first sign of the crash is the sound looping like a broken record for about 2s,
> after which the reboot happens. With the missing nvme device the reboot drops to
> a rescue shell. Using "shutdown -h now" from that shell and starting the laptop
> with the power button restores the missing PCI bridge and nvme device.

Oh well, it sounds like some PCIe device is dropping of the bus and taking it's upstream bridge with it.

> As the bisections were not succesfull I tried to monitor the crash using
> netconsole and CONFIG_ACPI_DEBUG and "acpi.debug_layer=0xf acpi.debug_level=0x107"
> as command line parameters. With this the last message on netconsole before
> the crash is usually:
> 
> [21465.639279] [    T251]    evmisc-0132 ev_queue_notify_reques: Dispatching Notify on [GPP0] (Device) Value 0x00 (Bus Check) Node 00000000f81f36b8

A full dump of that might be helpful. That sounds like the dGPU is powering up/down.

> 
> GPP0 is the ACPI name of this PCI bridge (at least that's my best guess):
> 
> 00:01.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Renoir PCIe GPP Bridge [1022:1633]
> 
> to which the discrete GPU is connected
> 
> 03:00.0 Display controller [0380]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 23 [Radeon RX 6600/6600 XT/6600M] [1002:73ff] (rev c3)
> 
> via the pci express switch
> 
> 01:00.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 XL Upstream Port of PCI Express Switch [1002:1478] (rev c3)
> 02:00.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 XL Downstream Port of PCI Express Switch [1002:1479]
> 
> While the GUI (xfce on xorg) on my laptop runs on the built-in GPU the discrete 
> GPU usually wakes up quite often, e.g. when a window is opened or when scrolling down on youtube.

Yeah, that is a known issue and we are working on it.

Basically an application enumerates the possible render or video decode devices in the system and that wakes up the dGPU even when it isn't actually used.

> A somewhat reliable method to generate GPP0 notifies is putting on a youtube
> video and the periodically starting evolution with this script:
> 
> #!/bin/bash
> for i in {0..1000}
> do
> 	echo $i
> 	evolution &
> 	sleep 5
> 	killall evolution
> 	sleep 55
> done
> 
> This is also the method I used to test the debug kernel in the following mails.

To further narrow down the issue please run your laptop with amdgpu.runpm=0 on the kernel command line for a while and see if that is stable or not.

Thanks,
Christian.

> 
> Bert Karwatzki
Re: [REGRESSION 00/04] Crash during resume of pcie bridge
Posted by Bert Karwatzki 2 months, 1 week ago
Am Montag, dem 06.10.2025 um 14:39 +0200 schrieb Christian König:
> On 06.10.25 14:09, Bert Karwatzki wrote:
> > Since linux version v6.15 I experience random crashes on my MSI Alpha 15 Laptop
> > running debian trixie (amd64). The first such crash happened about in the midth
> > of june, and as there were no useful log messages and even using netconsole
> > gave no useful message I suspected faulty hardware. So I ran memtest86+ and
> > found a faulty address line and replaced the memory (unfortunately 64G to 16G).
> > But the crashes occured again and so I did a thorough investigation.
> > 
> > The crashes occur after 30min to 33h (yes, hours) of uptime and consist of a
> > sudden reboot after which the PCI bridge at 00:02.4 and the nvme device 
> > connected to it are missing. If there's sound running during the crash then the
> > first sign of the crash is the sound looping like a broken record for about 2s,
> > after which the reboot happens. With the missing nvme device the reboot drops to
> > a rescue shell. Using "shutdown -h now" from that shell and starting the laptop
> > with the power button restores the missing PCI bridge and nvme device.
> 
> Oh well, it sounds like some PCIe device is dropping of the bus and taking it's upstream bridge with it.
> 
> > As the bisections were not succesfull I tried to monitor the crash using
> > netconsole and CONFIG_ACPI_DEBUG and "acpi.debug_layer=0xf acpi.debug_level=0x107"
> > as command line parameters. With this the last message on netconsole before
> > the crash is usually:
> > 
> > [21465.639279] [    T251]    evmisc-0132 ev_queue_notify_reques: Dispatching Notify on [GPP0] (Device) Value 0x00 (Bus Check) Node 00000000f81f36b8
> 
> A full dump of that might be helpful. That sounds like the dGPU is powering up/down.

Yes, that's what's happening.

> 
> > 
> > GPP0 is the ACPI name of this PCI bridge (at least that's my best guess):
> > 
> > 00:01.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Renoir PCIe GPP Bridge [1022:1633]
> > 
> > to which the discrete GPU is connected
> > 
> > 03:00.0 Display controller [0380]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 23 [Radeon RX 6600/6600 XT/6600M] [1002:73ff] (rev c3)
> > 
> > via the pci express switch
> > 
> > 01:00.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 XL Upstream Port of PCI Express Switch [1002:1478] (rev c3)
> > 02:00.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 XL Downstream Port of PCI Express Switch [1002:1479]
> > 
> > While the GUI (xfce on xorg) on my laptop runs on the built-in GPU the discrete 
> > GPU usually wakes up quite often, e.g. when a window is opened or when scrolling down on youtube.
> 
> Yeah, that is a known issue and we are working on it.

Until linux v6.15 this didn't cause any harm.

> 
> Basically an application enumerates the possible render or video decode devices in the system and that wakes up the dGPU even when it isn't actually used.
> 
> > A somewhat reliable method to generate GPP0 notifies is putting on a youtube
> > video and the periodically starting evolution with this script:
> > 
> > #!/bin/bash
> > for i in {0..1000}
> > do
> > 	echo $i
> > 	evolution &
> > 	sleep 5
> > 	killall evolution
> > 	sleep 55
> > done
> > 
> > This is also the method I used to test the debug kernel in the following mails.
> 
> To further narrow down the issue please run your laptop with amdgpu.runpm=0 on the kernel command line for a while and see if that is stable or not.
> 
Even versions that did crash can be stable for 24h of uptime so I think this 
will take too long.
I think I've already chased down the crash to this part of rpm_resume()
(I'm currently doing a testrun with more dev_info()s in this part):

 skip_parent:

	if (!strcmp(dev_name(dev), "0000:00:01.1"))
		dev_info(dev, "%s %d\n", __func__, __LINE__); // this is the last reported line in netconsole
	if (dev->power.no_callbacks)
		goto no_callback;	/* Assume success. */

	__update_runtime_status(dev, RPM_RESUMING);

	callback = RPM_GET_CALLBACK(dev, runtime_resume);

	dev_pm_disable_wake_irq_check(dev, false);
	retval = rpm_callback(callback, dev);
	if (retval) {
		__update_runtime_status(dev, RPM_SUSPENDED);
		pm_runtime_cancel_pending(dev);
		dev_pm_enable_wake_irq_check(dev, false);
	} else {
 no_callback:


Bert Karwatzki
Re: [REGRESSION 00/04] Crash during resume of pcie bridge
Posted by Bert Karwatzki 2 months, 1 week ago
Am Montag, dem 06.10.2025 um 18:22 +0200 schrieb Bert Karwatzki:
> 
> > 
> Even versions that did crash can be stable for 24h of uptime so I think this 
> will take too long.
> I think I've already chased down the crash to this part of rpm_resume()
> (I'm currently doing a testrun with more dev_info()s in this part):
> 
>  skip_parent:
> 
> 	if (!strcmp(dev_name(dev), "0000:00:01.1"))
> 		dev_info(dev, "%s %d\n", __func__, __LINE__); // this is the last reported line in netconsole
> 	if (dev->power.no_callbacks)
> 		goto no_callback;	/* Assume success. */
> 
> 	__update_runtime_status(dev, RPM_RESUMING);
> 
> 	callback = RPM_GET_CALLBACK(dev, runtime_resume);
> 
> 	dev_pm_disable_wake_irq_check(dev, false);
> 	retval = rpm_callback(callback, dev);
> 	if (retval) {
> 		__update_runtime_status(dev, RPM_SUSPENDED);
> 		pm_runtime_cancel_pending(dev);
> 		dev_pm_enable_wake_irq_check(dev, false);
> 	} else {
>  no_callback:
> 
> 
> Bert Karwatzki

The testrun is already finished the crash occured after 10h and ~700 GPP0 notifies,
the part of rpm_resume() above was monitored like this:

 skip_parent:

	if (!strcmp(dev_name(dev), "0000:00:01.1"))
		dev_info(dev, "%s %d\n", __func__, __LINE__);
	if (dev->power.no_callbacks)
		goto no_callback;	/* Assume success. */

	if (!strcmp(dev_name(dev), "0000:00:01.1"))
		dev_info(dev, "%s %d\n", __func__, __LINE__);
	__update_runtime_status(dev, RPM_RESUMING);

	if (!strcmp(dev_name(dev), "0000:00:01.1"))
		dev_info(dev, "%s %d\n", __func__, __LINE__);
	callback = RPM_GET_CALLBACK(dev, runtime_resume);

	if (!strcmp(dev_name(dev), "0000:00:01.1"))
		dev_info(dev, "%s %d callback = %px\n", __func__, __LINE__, (void *) callback);
	dev_pm_disable_wake_irq_check(dev, false);
	if (!strcmp(dev_name(dev), "0000:00:01.1"))
		dev_info(dev, "%s %d\n", __func__, __LINE__);   // This is the last reported line!
	retval = rpm_callback(callback, dev);
	if (!strcmp(dev_name(dev), "0000:00:01.1"))
		dev_info(dev, "%s %d\n", __func__, __LINE__);
	if (retval) {
		if (!strcmp(dev_name(dev), "0000:00:01.1"))
			dev_info(dev, "%s %d\n", __func__, __LINE__);
		__update_runtime_status(dev, RPM_SUSPENDED);
		pm_runtime_cancel_pending(dev);
		dev_pm_enable_wake_irq_check(dev, false);
	} else {
 no_callback:

The result is that in the case of the crash rpm_callback() didn't return, so
I'll continue the investigation in rpm_callback().

The whole calltrace is:
acpiphp_check_bridge()->pm_runtime_get_sync()->__pm_runtime_resume()->rpm_resume()->rpm_callback()

Bert Karwatzki