This series enables CPUFreq support on the X1E SoC using the SCMI perf protocol. This was originally part of the RFC: firmware: arm_scmi: Qualcomm Vendor Protocol [1]. I've split it up so that this part can land earlier. Warnings Introduced by the series are fixed by [2] V6: * Drop patches that are already picked up. * Rebase to the latest next-20241029. V5: * Fix build error reported by kernel test robot by adding 64BIT requirement to COMPILE_TEST * Pick Rbs V4: * Move val, flag and chan to local loop variables. [Jassi] * Add cpucp mailbox to the MAINTAINERS file. [Jassi] * Move to core_initcall. [Konrad] * Skip explicitly setting txdone_irq/txdone_poll to zero. [Konrad] V3: * Fix Maintainer info in cpucp mbox bindings. [Bjorn] * Fix copyright info in cpucp driver. [Bjorn] * Drop unused APSS_CPUCP_TX_MBOX_IDR, value init and drv_data. [Bjorn/Dmitry] * Convert to lower case hex. [Bjorn] * Convert irq and dev to local variables. [Bjorn] * Replace for and if with for_each_set_bit. [Bjorn] * Document the need for spinlock. [Bjorn] * Add space after " for aesthetics. [Bjorn] * Fix err in calc and add fixes tag. [Bjorn] * Include io.h and re-order platform_device.h * Use GENMASK_ULL to generate APSS_CPUCP_RX_MBOX_CMD_MASK. V2: * Fix series version number [Rob] * Pickup Rbs from Dimitry and Rob. * Use power-domain instead of clocks. [Sudeep/Ulf] * Rename sram sub-nodes according to schema. [Dmitry] * Use BIT() instead of manual shift. [Dmitry] * Define RX_MBOX_CMD to account for chan calculation. [Dmitry] * Clear the bit instead of the entire status within the spinlock. [Dmitry] * Use dev_err_probe instead. [Dmitry] * Drop superfluous error message while handling errors from get_irq. [Dmitry] * Use devm_mbox_controller_register and drop remove path. [Dmitry] * Define TX_MBOX_CMD to account for chan calculation. * Use cpucp->dev in probe path for conformity. RFC V1: * Use x1e80100 as the fallback for future SoCs using the cpucp-mbox controller. [Krzysztoff/Konrad/Rob] * Use chan->lock and chan->cl to detect if the channel is no longer Available. [Dmitry] * Use BIT() instead of using manual shifts. [Dmitry] * Don't use integer as a pointer value. [Dmitry] * Allow it to default to of_mbox_index_xlate. [Dmitry] * Use devm_of_iomap. [Dmitry] * Use module_platform_driver instead of module init/exit. [Dmitry] * Get channel number using mailbox core (like other drivers) and further simplify the driver by dropping setup_mbox func. [1]: https://lore.kernel.org/lkml/20240117173458.2312669-1-quic_sibis@quicinc.com/#r [2]: https://lore.kernel.org/lkml/20241030125512.2884761-1-quic_sibis@quicinc.com/ Other relevant Links: https://lore.kernel.org/lkml/be2e475a-349f-4e98-b238-262dd7117a4e@linaro.org/ base: next-20241029 Sibi Sankar (2): arm64: dts: qcom: x1e80100: Add cpucp mailbox and sram nodes arm64: dts: qcom: x1e80100: Enable cpufreq arch/arm64/boot/dts/qcom/x1e80100.dtsi | 89 +++++++++++++++++++------- 1 file changed, 65 insertions(+), 24 deletions(-) -- 2.34.1
[ +CC: Marc, who I think I saw reporting something similar even if I can seem to find where right now ] On Wed, Oct 30, 2024 at 06:38:38PM +0530, Sibi Sankar wrote: > This series enables CPUFreq support on the X1E SoC using the SCMI perf > protocol. This was originally part of the RFC: firmware: arm_scmi: > Qualcomm Vendor Protocol [1]. I've split it up so that this part can > land earlier. Warnings Introduced by the series are fixed by [2] Sibi Sankar (2): > arm64: dts: qcom: x1e80100: Add cpucp mailbox and sram nodes > arm64: dts: qcom: x1e80100: Enable cpufreq I've been running with v6 of these for a while now, without noticing any issues, and just updated to v7 to be able to provide a Tested-by tag. I wanted to run a compilation and see how the frequencies varied, but before I got around to that I just grepped the cpufreq sysfs attributes for CPU0 four times. And this triggered a reset of the machine (x1e80100 CRD). The last values output were: affected_cpus:0 1 2 3 cpuinfo_cur_freq:<unknown> cpuinfo_max_freq:3417600 cpuinfo_min_freq:710400 cpuinfo_transition_latency:30000 related_cpus:0 1 2 3 scaling_available_frequencies:710400 806400 998400 1190400 1440000 1670400 1920000 2188800 2515200 2707200 2976000 320 scaling_available_governors:ondemand userspace performance schedutil scaling_cur_freq:806400 scaling_driver:scmi scaling_governor:schedutil scaling_max_freq:3417600 scaling_min_freq:710400 scaling_setspeed:<unsupported> Notice the <unknown> current frequency (the previous greps said 710400 and 2515200). The last thing I see on the serial console, presumably just before the reset, is: [ 196.268025] arm-scmi arm-scmi.0.auto: timed out in resp(caller: do_xfer+0x164/0x564) I just rebooted and grepped again and it triggered on the first attempt (cur_freq also said '<unknown>'). Same error in the log, printed when grepping. Johan
On Fri, 01 Nov 2024 13:00:37 +0000, Johan Hovold <johan@kernel.org> wrote: > > [ +CC: Marc, who I think I saw reporting something similar even if I can > seem to find where right now ] It was on IRC. > > On Wed, Oct 30, 2024 at 06:38:38PM +0530, Sibi Sankar wrote: > > This series enables CPUFreq support on the X1E SoC using the SCMI perf > > protocol. This was originally part of the RFC: firmware: arm_scmi: > > Qualcomm Vendor Protocol [1]. I've split it up so that this part can > > land earlier. Warnings Introduced by the series are fixed by [2] > > Sibi Sankar (2): > > arm64: dts: qcom: x1e80100: Add cpucp mailbox and sram nodes > > arm64: dts: qcom: x1e80100: Enable cpufreq > > I've been running with v6 of these for a while now, without noticing any > issues, and just updated to v7 to be able to provide a Tested-by tag. > > I wanted to run a compilation and see how the frequencies varied, but > before I got around to that I just grepped the cpufreq sysfs attributes > for CPU0 four times. And this triggered a reset of the machine (x1e80100 > CRD). > > The last values output were: > > affected_cpus:0 1 2 3 > cpuinfo_cur_freq:<unknown> > cpuinfo_max_freq:3417600 > cpuinfo_min_freq:710400 > cpuinfo_transition_latency:30000 > related_cpus:0 1 2 3 > scaling_available_frequencies:710400 806400 998400 1190400 1440000 1670400 1920000 2188800 2515200 2707200 2976000 320 > scaling_available_governors:ondemand userspace performance schedutil > scaling_cur_freq:806400 > scaling_driver:scmi > scaling_governor:schedutil > scaling_max_freq:3417600 > scaling_min_freq:710400 > scaling_setspeed:<unsupported> > > Notice the <unknown> current frequency (the previous greps said 710400 > and 2515200). > > The last thing I see on the serial console, presumably just before > the reset, is: > > [ 196.268025] arm-scmi arm-scmi.0.auto: timed out in resp(caller: do_xfer+0x164/0x564) > > I just rebooted and grepped again and it triggered on the first attempt > (cur_freq also said '<unknown>'). Same error in the log, printed when > grepping. I'm seeing similar things indeed. Randomly grepping in cpufreq/policy* results in hard resets, although I don't get much on the serial console when that happens. Interestingly, I also see some errors in dmesg at boot time: maz@semi-fraudulent:~$ dmesg| grep -i scmi [ 0.966175] scmi_core: SCMI protocol bus registered [ 7.929710] arm-scmi arm-scmi.2.auto: Using scmi_mailbox_transport [ 7.939059] arm-scmi arm-scmi.2.auto: SCMI max-rx-timeout: 30ms [ 7.945567] arm-scmi arm-scmi.2.auto: SCMI RAW Mode initialized for instance 0 [ 7.958348] arm-scmi arm-scmi.2.auto: SCMI RAW Mode COEX enabled ! [ 7.978303] arm-scmi arm-scmi.2.auto: SCMI Notifications - Core Enabled. [ 7.985351] arm-scmi arm-scmi.2.auto: SCMI Protocol v2.0 'Qualcomm:' Firmware version 0x20000 [ 8.033774] arm-scmi arm-scmi.2.auto: Failed to add opps_by_lvl at 3801600 for NCC - ret:-16 [ 8.033902] arm-scmi arm-scmi.2.auto: Failed to add opps_by_lvl at 3801600 for NCC - ret:-16 [ 8.036528] arm-scmi arm-scmi.2.auto: Failed to add opps_by_lvl at 3801600 for NCC - ret:-16 [ 8.036744] arm-scmi arm-scmi.2.auto: Failed to add opps_by_lvl at 3801600 for NCC - ret:-16 [ 8.171232] scmi-perf-domain scmi_dev.4: Initialized 3 performance domains All these "Failed" are a bit worrying. Happy to put any theory to the test. Thanks, M. -- Without deviation from the norm, progress is not possible.
On Fri, Nov 01, 2024 at 02:08:24PM +0000, Marc Zyngier wrote: > I'm seeing similar things indeed. Randomly grepping in cpufreq/policy* > results in hard resets, although I don't get much on the serial > console when that happens. Interestingly, I also see some errors in > dmesg at boot time: > > maz@semi-fraudulent:~$ dmesg| grep -i scmi > [ 0.966175] scmi_core: SCMI protocol bus registered > [ 7.929710] arm-scmi arm-scmi.2.auto: Using scmi_mailbox_transport > [ 7.939059] arm-scmi arm-scmi.2.auto: SCMI max-rx-timeout: 30ms > [ 7.945567] arm-scmi arm-scmi.2.auto: SCMI RAW Mode initialized for instance 0 > [ 7.958348] arm-scmi arm-scmi.2.auto: SCMI RAW Mode COEX enabled ! > [ 7.978303] arm-scmi arm-scmi.2.auto: SCMI Notifications - Core Enabled. > [ 7.985351] arm-scmi arm-scmi.2.auto: SCMI Protocol v2.0 'Qualcomm:' Firmware version 0x20000 > [ 8.033774] arm-scmi arm-scmi.2.auto: Failed to add opps_by_lvl at 3801600 for NCC - ret:-16 > [ 8.033902] arm-scmi arm-scmi.2.auto: Failed to add opps_by_lvl at 3801600 for NCC - ret:-16 > [ 8.036528] arm-scmi arm-scmi.2.auto: Failed to add opps_by_lvl at 3801600 for NCC - ret:-16 > [ 8.036744] arm-scmi arm-scmi.2.auto: Failed to add opps_by_lvl at 3801600 for NCC - ret:-16 > [ 8.171232] scmi-perf-domain scmi_dev.4: Initialized 3 performance domains > > All these "Failed" are a bit worrying. Happy to put any theory to the > test. Yes, those warnings indeed look troubling. Fortunately they appear to be mostly benign and only indicate that the firmware is reporting duplicate OPPs, which the kernel is now ignoring without any other side effects than the warnings. The side-effects and these remaining warnings are addressed by this series: https://lore.kernel.org/all/20241030125512.2884761-1-quic_sibis@quicinc.com/ but I think we should try to make the warnings a bit more informative (and less scary) by printing something along the lines of: arm-scmi arm-scmi.0.auto: [Firmware Bug]: Ignoring duplicate OPP 3417600 for NCC instead. Johan
On Fri, 01 Nov 2024 14:19:54 +0000, Johan Hovold <johan@kernel.org> wrote: > > On Fri, Nov 01, 2024 at 02:08:24PM +0000, Marc Zyngier wrote: > > > I'm seeing similar things indeed. Randomly grepping in cpufreq/policy* > > results in hard resets, although I don't get much on the serial > > console when that happens. Interestingly, I also see some errors in > > dmesg at boot time: > > > > maz@semi-fraudulent:~$ dmesg| grep -i scmi > > [ 0.966175] scmi_core: SCMI protocol bus registered > > [ 7.929710] arm-scmi arm-scmi.2.auto: Using scmi_mailbox_transport > > [ 7.939059] arm-scmi arm-scmi.2.auto: SCMI max-rx-timeout: 30ms > > [ 7.945567] arm-scmi arm-scmi.2.auto: SCMI RAW Mode initialized for instance 0 > > [ 7.958348] arm-scmi arm-scmi.2.auto: SCMI RAW Mode COEX enabled ! > > [ 7.978303] arm-scmi arm-scmi.2.auto: SCMI Notifications - Core Enabled. > > [ 7.985351] arm-scmi arm-scmi.2.auto: SCMI Protocol v2.0 'Qualcomm:' Firmware version 0x20000 > > [ 8.033774] arm-scmi arm-scmi.2.auto: Failed to add opps_by_lvl at 3801600 for NCC - ret:-16 > > [ 8.033902] arm-scmi arm-scmi.2.auto: Failed to add opps_by_lvl at 3801600 for NCC - ret:-16 > > [ 8.036528] arm-scmi arm-scmi.2.auto: Failed to add opps_by_lvl at 3801600 for NCC - ret:-16 > > [ 8.036744] arm-scmi arm-scmi.2.auto: Failed to add opps_by_lvl at 3801600 for NCC - ret:-16 > > [ 8.171232] scmi-perf-domain scmi_dev.4: Initialized 3 performance domains > > > > All these "Failed" are a bit worrying. Happy to put any theory to the > > test. > > Yes, those warnings indeed look troubling. Fortunately they appear to be > mostly benign and only indicate that the firmware is reporting duplicate > OPPs, which the kernel is now ignoring without any other side effects > than the warnings. Right. Not something that would explain the hard reset behaviour then. > > The side-effects and these remaining warnings are addressed by this > series: > > https://lore.kernel.org/all/20241030125512.2884761-1-quic_sibis@quicinc.com/ > > but I think we should try to make the warnings a bit more informative > (and less scary) by printing something along the lines of: > > arm-scmi arm-scmi.0.auto: [Firmware Bug]: Ignoring duplicate OPP 3417600 for NCC > > instead. Indeed. Seeing [Firmware Bug] has a comforting feeling of familiarity... :) I wonder whether the same sort of reset happen on more "commercial" systems (such as some of the laptops). You expect that people look at the cpufreq stuff closely, and don't see things exploding like we are. M. -- Without deviation from the norm, progress is not possible.
On Fri, Nov 01, 2024 at 02:43:57PM +0000, Marc Zyngier wrote: > On Fri, 01 Nov 2024 14:19:54 +0000, > Johan Hovold <johan@kernel.org> wrote: > > The side-effects and these remaining warnings are addressed by this > > series: > > > > https://lore.kernel.org/all/20241030125512.2884761-1-quic_sibis@quicinc.com/ > > > > but I think we should try to make the warnings a bit more informative > > (and less scary) by printing something along the lines of: > > > > arm-scmi arm-scmi.0.auto: [Firmware Bug]: Ignoring duplicate OPP 3417600 for NCC > > > > instead. > > Indeed. Seeing [Firmware Bug] has a comforting feeling of > familiarity... :) > > I wonder whether the same sort of reset happen on more "commercial" > systems (such as some of the laptops). You expect that people look at > the cpufreq stuff closely, and don't see things exploding like we are. I finally got around to getting my Lenovo ThinkPad T14s to boot (it refuses to start the kernel when using GRUB, and it's not due to the known 64 GB memory issue as it only has 32 GB) and can confirm that it hard resets when accessing the cpufreq sysfs attributes as well. On the bright side, at least I don't see any warnings due to duplicate OPPs on this machine (x1e78100, latest UEFI fw). Johan
On Tue, 05 Nov 2024 16:57:07 +0000, Johan Hovold <johan@kernel.org> wrote: > > On Fri, Nov 01, 2024 at 02:43:57PM +0000, Marc Zyngier wrote: > > On Fri, 01 Nov 2024 14:19:54 +0000, > > Johan Hovold <johan@kernel.org> wrote: > > > > The side-effects and these remaining warnings are addressed by this > > > series: > > > > > > https://lore.kernel.org/all/20241030125512.2884761-1-quic_sibis@quicinc.com/ > > > > > > but I think we should try to make the warnings a bit more informative > > > (and less scary) by printing something along the lines of: > > > > > > arm-scmi arm-scmi.0.auto: [Firmware Bug]: Ignoring duplicate OPP 3417600 for NCC > > > > > > instead. > > > > Indeed. Seeing [Firmware Bug] has a comforting feeling of > > familiarity... :) > > > > I wonder whether the same sort of reset happen on more "commercial" > > systems (such as some of the laptops). You expect that people look at > > the cpufreq stuff closely, and don't see things exploding like we are. > > I finally got around to getting my Lenovo ThinkPad T14s to boot (it > refuses to start the kernel when using GRUB, and it's not due to the > known 64 GB memory issue as it only has 32 GB) <cry> I know the feeling. My devkit can't use GRUB either, so I added a hook to the GRUB config to generate EFI scripts that directly execute the kernel with initrd, dtb, and command line. This is probably the worse firmware I've seen in a very long while. </cry> > and can confirm that it > hard resets when accessing the cpufreq sysfs attributes as well. Right. So this also happens on non-abandonware machines. > On the bright side, at least I don't see any warnings due to duplicate > OPPs on this machine (x1e78100, latest UEFI fw). One bug fixed... M. -- Without deviation from the norm, progress is not possible.
© 2016 - 2024 Red Hat, Inc.