[PATCH] wifi: ath12k: fix CMA error and MHI state mismatch during resume

Saikiran posted 1 patch 4 days, 12 hours ago
drivers/net/wireless/ath/ath12k/mhi.c | 24 +++++++++++++++++-------
1 file changed, 17 insertions(+), 7 deletions(-)
[PATCH] wifi: ath12k: fix CMA error and MHI state mismatch during resume
Posted by Saikiran 4 days, 12 hours ago
Commit 8d5f4da8d70b ("wifi: ath12k: support suspend/resume") introduced
system suspend/resume support but caused a critical regression where
CMA pages are corrupted during resume.

1. CMA page corruption:
   Calling mhi_unprepare_after_power_down() during suspend (via
   ATH12K_MHI_DEINIT) prematurely frees the fbc_image and rddm_image
   DMA buffers. When these pages are accessed during resume, the kernel
   detects corruption (Bad page state).

To fix this corruption, the driver must skip ATH12K_MHI_DEINIT during
suspend, preserving the DMA buffers. However, implementing this fix
exposes a second issue in the state machine:

2. Resume failure due to MHI state mismatch:
   When DEINIT is skipped during suspend to protect the memory, the
   ATH12K_MHI_INIT bit remains set. On resume, ath12k_mhi_start()
   blindly attempts to set INIT again, but the state machine rejects
   the transition:

   ath12k_wifi7_pci ...: failed to set mhi state INIT(0) in current
   mhi state (0x1)

Fix the corruption and enable the correct suspend flow by:

1. In ath12k_mhi_stop(), skipping ATH12K_MHI_DEINIT if suspending.
   This prevents the memory corruption by keeping the device context
   valid (MHI_POWER_OFF_KEEP_DEV).

2. In ath12k_mhi_start(), checking if MHI_INIT is already set.
   This accommodates the new suspend flow where the device remains
   initialized, allowing the driver to proceed directly to POWER_ON.

Tested with suspend/resume cycles on Qualcomm Snapdragon X Elite
(SC8380XP) with WCN7850 WiFi. No CMA corruption observed, WiFi resumes
successfully, and deep sleep works correctly.

Fixes: 8d5f4da8d70b ("wifi: ath12k: support suspend/resume")
Tested-on: WCN7850 hw2.0 PCI WLAN.HMT.1.1.c5-00302 (Lenovo Yoga Slim 7x)
Signed-off-by: Saikiran <bjsaikiran@gmail.com>
---
 drivers/net/wireless/ath/ath12k/mhi.c | 24 +++++++++++++++++-------
 1 file changed, 17 insertions(+), 7 deletions(-)

diff --git a/drivers/net/wireless/ath/ath12k/mhi.c b/drivers/net/wireless/ath/ath12k/mhi.c
index 45c0f66dcc5e..1a0b3bcc6bbf 100644
--- a/drivers/net/wireless/ath/ath12k/mhi.c
+++ b/drivers/net/wireless/ath/ath12k/mhi.c
@@ -485,9 +485,14 @@ int ath12k_mhi_start(struct ath12k_pci *ab_pci)
 
 	ab_pci->mhi_ctrl->timeout_ms = MHI_TIMEOUT_DEFAULT_MS;
 
-	ret = ath12k_mhi_set_state(ab_pci, ATH12K_MHI_INIT);
-	if (ret)
-		goto out;
+	/* In case of suspend/resume, MHI INIT is already done.
+	 * So check if MHI INIT is set or not.
+	 */
+	if (!test_bit(ATH12K_MHI_INIT, &ab_pci->mhi_state)) {
+		ret = ath12k_mhi_set_state(ab_pci, ATH12K_MHI_INIT);
+		if (ret)
+			goto out;
+	}
 
 	ret = ath12k_mhi_set_state(ab_pci, ATH12K_MHI_POWER_ON);
 	if (ret)
@@ -501,16 +506,21 @@ int ath12k_mhi_start(struct ath12k_pci *ab_pci)
 
 void ath12k_mhi_stop(struct ath12k_pci *ab_pci, bool is_suspend)
 {
-	/* During suspend we need to use mhi_power_down_keep_dev()
-	 * workaround, otherwise ath12k_core_resume() will timeout
-	 * during resume.
+	/* During suspend, we need to use mhi_power_down_keep_dev()
+	 * and avoid calling MHI_DEINIT. The deinit frees BHIE tables
+	 * which causes memory corruption when those pages are
+	 * accessed/freed again during resume. We want to keep the
+	 * device prepared for resume, otherwise ath12k_core_resume()
+	 * will timeout.
 	 */
 	if (is_suspend)
 		ath12k_mhi_set_state(ab_pci, ATH12K_MHI_POWER_OFF_KEEP_DEV);
 	else
 		ath12k_mhi_set_state(ab_pci, ATH12K_MHI_POWER_OFF);
 
-	ath12k_mhi_set_state(ab_pci, ATH12K_MHI_DEINIT);
+	/* Only deinit when doing full power down, not during suspend */
+	if (!is_suspend)
+		ath12k_mhi_set_state(ab_pci, ATH12K_MHI_DEINIT);
 }
 
 void ath12k_mhi_suspend(struct ath12k_pci *ab_pci)
-- 
2.51.0
Re: [PATCH] wifi: ath12k: fix CMA error and MHI state mismatch during resume
Posted by Baochen Qiang 4 days ago

On 2/2/2026 11:17 PM, Saikiran wrote:
> Commit 8d5f4da8d70b ("wifi: ath12k: support suspend/resume") introduced
> system suspend/resume support but caused a critical regression where
> CMA pages are corrupted during resume.
> 
> 1. CMA page corruption:
>    Calling mhi_unprepare_after_power_down() during suspend (via
>    ATH12K_MHI_DEINIT) prematurely frees the fbc_image and rddm_image
>    DMA buffers. When these pages are accessed during resume, the kernel
>    detects corruption (Bad page state).

How, FBC image and RDDM image get re-allocated at resume, no?

> 
> To fix this corruption, the driver must skip ATH12K_MHI_DEINIT during
> suspend, preserving the DMA buffers. However, implementing this fix
> exposes a second issue in the state machine:
> 
> 2. Resume failure due to MHI state mismatch:
>    When DEINIT is skipped during suspend to protect the memory, the
>    ATH12K_MHI_INIT bit remains set. On resume, ath12k_mhi_start()
>    blindly attempts to set INIT again, but the state machine rejects
>    the transition:
> 
>    ath12k_wifi7_pci ...: failed to set mhi state INIT(0) in current
>    mhi state (0x1)
> 
> Fix the corruption and enable the correct suspend flow by:
> 
> 1. In ath12k_mhi_stop(), skipping ATH12K_MHI_DEINIT if suspending.
>    This prevents the memory corruption by keeping the device context
>    valid (MHI_POWER_OFF_KEEP_DEV).
> 
> 2. In ath12k_mhi_start(), checking if MHI_INIT is already set.
>    This accommodates the new suspend flow where the device remains
>    initialized, allowing the driver to proceed directly to POWER_ON.
> 
> Tested with suspend/resume cycles on Qualcomm Snapdragon X Elite
> (SC8380XP) with WCN7850 WiFi. No CMA corruption observed, WiFi resumes
> successfully, and deep sleep works correctly.
> 
> Fixes: 8d5f4da8d70b ("wifi: ath12k: support suspend/resume")
> Tested-on: WCN7850 hw2.0 PCI WLAN.HMT.1.1.c5-00302 (Lenovo Yoga Slim 7x)
> Signed-off-by: Saikiran <bjsaikiran@gmail.com>
> ---
>  drivers/net/wireless/ath/ath12k/mhi.c | 24 +++++++++++++++++-------
>  1 file changed, 17 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/net/wireless/ath/ath12k/mhi.c b/drivers/net/wireless/ath/ath12k/mhi.c
> index 45c0f66dcc5e..1a0b3bcc6bbf 100644
> --- a/drivers/net/wireless/ath/ath12k/mhi.c
> +++ b/drivers/net/wireless/ath/ath12k/mhi.c
> @@ -485,9 +485,14 @@ int ath12k_mhi_start(struct ath12k_pci *ab_pci)
>  
>  	ab_pci->mhi_ctrl->timeout_ms = MHI_TIMEOUT_DEFAULT_MS;
>  
> -	ret = ath12k_mhi_set_state(ab_pci, ATH12K_MHI_INIT);
> -	if (ret)
> -		goto out;
> +	/* In case of suspend/resume, MHI INIT is already done.
> +	 * So check if MHI INIT is set or not.
> +	 */
> +	if (!test_bit(ATH12K_MHI_INIT, &ab_pci->mhi_state)) {
> +		ret = ath12k_mhi_set_state(ab_pci, ATH12K_MHI_INIT);
> +		if (ret)
> +			goto out;
> +	}
>  
>  	ret = ath12k_mhi_set_state(ab_pci, ATH12K_MHI_POWER_ON);
>  	if (ret)
> @@ -501,16 +506,21 @@ int ath12k_mhi_start(struct ath12k_pci *ab_pci)
>  
>  void ath12k_mhi_stop(struct ath12k_pci *ab_pci, bool is_suspend)
>  {
> -	/* During suspend we need to use mhi_power_down_keep_dev()
> -	 * workaround, otherwise ath12k_core_resume() will timeout
> -	 * during resume.
> +	/* During suspend, we need to use mhi_power_down_keep_dev()
> +	 * and avoid calling MHI_DEINIT. The deinit frees BHIE tables
> +	 * which causes memory corruption when those pages are
> +	 * accessed/freed again during resume. We want to keep the
> +	 * device prepared for resume, otherwise ath12k_core_resume()
> +	 * will timeout.
>  	 */
>  	if (is_suspend)
>  		ath12k_mhi_set_state(ab_pci, ATH12K_MHI_POWER_OFF_KEEP_DEV);
>  	else
>  		ath12k_mhi_set_state(ab_pci, ATH12K_MHI_POWER_OFF);
>  
> -	ath12k_mhi_set_state(ab_pci, ATH12K_MHI_DEINIT);
> +	/* Only deinit when doing full power down, not during suspend */
> +	if (!is_suspend)
> +		ath12k_mhi_set_state(ab_pci, ATH12K_MHI_DEINIT);
>  }
>  
>  void ath12k_mhi_suspend(struct ath12k_pci *ab_pci)
Re: [PATCH] wifi: ath12k: fix CMA error and MHI state mismatch during resume
Posted by Jayasaikiran Banigallapati 3 days, 22 hours ago
On 2/3/26 08:21, Baochen Qiang wrote:
>
> On 2/2/2026 11:17 PM, Saikiran wrote:
>> Commit 8d5f4da8d70b ("wifi: ath12k: support suspend/resume") introduced
>> system suspend/resume support but caused a critical regression where
>> CMA pages are corrupted during resume.
>>
>> 1. CMA page corruption:
>>     Calling mhi_unprepare_after_power_down() during suspend (via
>>     ATH12K_MHI_DEINIT) prematurely frees the fbc_image and rddm_image
>>     DMA buffers. When these pages are accessed during resume, the kernel
>>     detects corruption (Bad page state).
> How, FBC image and RDDM image get re-allocated at resume, no?
>
> To clarify, the BUG: Bad page state crash actually occurs during the 
> suspend phase, specifically when ath12k_mhi_stop() calls 
> mhi_unprepare_after_power_down().
>
> The stack trace shows the panic happens inside mhi_free_bhie_table() 
> while trying to free the pages:
>
>  mhi_free_bhie_table+0x50/0xa0 [mhi]
>  mhi_unprepare_after_power_down+0x30/0x70 [mhi]
>  ath12k_mhi_stop+0xf8/0x210 [ath12k]
>  ath12k_core_suspend_late+0x94/0xc0 [ath12k]
>
> The kernel reports nonzero _refcount when attempting to free the CMA 
> pages (fbc_image/rddm_image). This suggests that something is still 
> holding a reference to these pages when DEINIT attempts to free them, 
> causing the kernel to panic before we reach the resume stage.
>
> Since the pages cannot be safely freed during suspend, skipping DEINIT 
> (and using MHI_POWER_OFF_KEEP_DEV) avoids this invalid free operation. 
> This also aligns with the existing comment in ath12k_mhi_stop which 
> suggests using mhi_power_down_keep_dev() for suspend.
>
> Thanks & Regards,
> Saikiran
Re: [PATCH] wifi: ath12k: fix CMA error and MHI state mismatch during resume
Posted by Baochen Qiang 3 days, 22 hours ago

On 2/3/2026 1:02 PM, Jayasaikiran Banigallapati wrote:
> 
> On 2/3/26 08:21, Baochen Qiang wrote:
>>
>> On 2/2/2026 11:17 PM, Saikiran wrote:
>>> Commit 8d5f4da8d70b ("wifi: ath12k: support suspend/resume") introduced
>>> system suspend/resume support but caused a critical regression where
>>> CMA pages are corrupted during resume.
>>>
>>> 1. CMA page corruption:
>>>     Calling mhi_unprepare_after_power_down() during suspend (via
>>>     ATH12K_MHI_DEINIT) prematurely frees the fbc_image and rddm_image
>>>     DMA buffers. When these pages are accessed during resume, the kernel
>>>     detects corruption (Bad page state).
>> How, FBC image and RDDM image get re-allocated at resume, no?
>>
>> To clarify, the BUG: Bad page state crash actually occurs during the suspend phase,
>> specifically when ath12k_mhi_stop() calls mhi_unprepare_after_power_down().
>>
>> The stack trace shows the panic happens inside mhi_free_bhie_table() while trying to
>> free the pages:
>>
>>  mhi_free_bhie_table+0x50/0xa0 [mhi]
>>  mhi_unprepare_after_power_down+0x30/0x70 [mhi]
>>  ath12k_mhi_stop+0xf8/0x210 [ath12k]
>>  ath12k_core_suspend_late+0x94/0xc0 [ath12k]
>>
>> The kernel reports nonzero _refcount when attempting to free the CMA pages (fbc_image/
>> rddm_image). This suggests that something is still holding a reference to these pages
>> when DEINIT attempts to free them, causing the kernel to panic before we reach the
>> resume stage.

this seems like a bug either in MHI stack or in kernel DMA/MM subsystems, rather than in
ath12k

>>
>> Since the pages cannot be safely freed during suspend, skipping DEINIT (and using
>> MHI_POWER_OFF_KEEP_DEV) avoids this invalid free operation. This also aligns with the
>> existing comment in ath12k_mhi_stop which suggests using mhi_power_down_keep_dev() for
>> suspend.

first of all, this is a workaround rather than fix. Ideally we should try to root cause
the issue and fix it in the right way.

Secondly the workaround here seems problematic: you skip INIT druing resume. However note
several hardware registers need to be re-programmed during this stage, how could the
target work if its power is cutoff during suspend and the register context is not restored
during resume?

>>
>> Thanks & Regards,
>> Saikiran

Re: [PATCH] wifi: ath12k: fix CMA error and MHI state mismatch during resume
Posted by Jayasaikiran Banigallapati 3 days, 21 hours ago
On 2/3/26 11:00, Baochen Qiang wrote:
>
> On 2/3/2026 1:02 PM, Jayasaikiran Banigallapati wrote:
>> On 2/3/26 08:21, Baochen Qiang wrote:
>>> On 2/2/2026 11:17 PM, Saikiran wrote:
>>>> Commit 8d5f4da8d70b ("wifi: ath12k: support suspend/resume") introduced
>>>> system suspend/resume support but caused a critical regression where
>>>> CMA pages are corrupted during resume.
>>>>
>>>> 1. CMA page corruption:
>>>>      Calling mhi_unprepare_after_power_down() during suspend (via
>>>>      ATH12K_MHI_DEINIT) prematurely frees the fbc_image and rddm_image
>>>>      DMA buffers. When these pages are accessed during resume, the kernel
>>>>      detects corruption (Bad page state).
>>> How, FBC image and RDDM image get re-allocated at resume, no?
>>>
>>> To clarify, the BUG: Bad page state crash actually occurs during the suspend phase,
>>> specifically when ath12k_mhi_stop() calls mhi_unprepare_after_power_down().
>>>
>>> The stack trace shows the panic happens inside mhi_free_bhie_table() while trying to
>>> free the pages:
>>>
>>>   mhi_free_bhie_table+0x50/0xa0 [mhi]
>>>   mhi_unprepare_after_power_down+0x30/0x70 [mhi]
>>>   ath12k_mhi_stop+0xf8/0x210 [ath12k]
>>>   ath12k_core_suspend_late+0x94/0xc0 [ath12k]
>>>
>>> The kernel reports nonzero _refcount when attempting to free the CMA pages (fbc_image/
>>> rddm_image). This suggests that something is still holding a reference to these pages
>>> when DEINIT attempts to free them, causing the kernel to panic before we reach the
>>> resume stage.
> this seems like a bug either in MHI stack or in kernel DMA/MM subsystems, rather than in
> ath12k
>
>>> Since the pages cannot be safely freed during suspend, skipping DEINIT (and using
>>> MHI_POWER_OFF_KEEP_DEV) avoids this invalid free operation. This also aligns with the
>>> existing comment in ath12k_mhi_stop which suggests using mhi_power_down_keep_dev() for
>>> suspend.
> first of all, this is a workaround rather than fix. Ideally we should try to root cause
> the issue and fix it in the right way.


The original comment in existing code:


/* During suspend we need to use mhi_power_down_keep_dev()
  * workaround, otherwise ath12k_core_resume() will timeout
  * during resume.
  */

This patch aligns the code with this existing intent. The driver was 
previously

calling DEINIT (and freeing resources) despite the comment advising to 
use keep_dev.

If the intention of the driver authors was to use keep_dev for suspend,

then my understanding is DEINIT is incorrect here (Correct me if I am 
wrong)

regardless of the underlying MM behavior.

>
> Secondly the workaround here seems problematic: you skip INIT druing resume. However note
> several hardware registers need to be re-programmed during this stage, how could the
> target work if its power is cutoff during suspend and the register context is not restored
> during resume?


In my testing, WiFi functionality was fully restored after resume.

The device associates and passes traffic immediately.

My understanding is that:

ATH12K_MHI_INIT primarily handles host memory allocation (which we 
preserved by skipping DEINIT).

ATH12K_MHI_POWER_ON calls mhi_sync_power_up(). This function triggers 
the MHI state machine,

which handles the necessary BHI/BHIE programming and firmware download 
(SBL) sequence.

Since mhi_sync_power_up() is still called during resume, the target is 
correctly re-initialized and

registers are programmed, even if we skip the redundant host memory 
allocation step (INIT).

Thanks & Regards,
Saikiran

Re: [PATCH] wifi: ath12k: fix CMA error and MHI state mismatch during resume
Posted by Baochen Qiang 3 days, 21 hours ago

On 2/3/2026 1:51 PM, Jayasaikiran Banigallapati wrote:
> 
> On 2/3/26 11:00, Baochen Qiang wrote:
>>
>> On 2/3/2026 1:02 PM, Jayasaikiran Banigallapati wrote:
>>> On 2/3/26 08:21, Baochen Qiang wrote:
>>>> On 2/2/2026 11:17 PM, Saikiran wrote:
>>>>> Commit 8d5f4da8d70b ("wifi: ath12k: support suspend/resume") introduced
>>>>> system suspend/resume support but caused a critical regression where
>>>>> CMA pages are corrupted during resume.
>>>>>
>>>>> 1. CMA page corruption:
>>>>>      Calling mhi_unprepare_after_power_down() during suspend (via
>>>>>      ATH12K_MHI_DEINIT) prematurely frees the fbc_image and rddm_image
>>>>>      DMA buffers. When these pages are accessed during resume, the kernel
>>>>>      detects corruption (Bad page state).
>>>> How, FBC image and RDDM image get re-allocated at resume, no?
>>>>
>>>> To clarify, the BUG: Bad page state crash actually occurs during the suspend phase,
>>>> specifically when ath12k_mhi_stop() calls mhi_unprepare_after_power_down().
>>>>
>>>> The stack trace shows the panic happens inside mhi_free_bhie_table() while trying to
>>>> free the pages:
>>>>
>>>>   mhi_free_bhie_table+0x50/0xa0 [mhi]
>>>>   mhi_unprepare_after_power_down+0x30/0x70 [mhi]
>>>>   ath12k_mhi_stop+0xf8/0x210 [ath12k]
>>>>   ath12k_core_suspend_late+0x94/0xc0 [ath12k]
>>>>
>>>> The kernel reports nonzero _refcount when attempting to free the CMA pages (fbc_image/
>>>> rddm_image). This suggests that something is still holding a reference to these pages
>>>> when DEINIT attempts to free them, causing the kernel to panic before we reach the
>>>> resume stage.
>> this seems like a bug either in MHI stack or in kernel DMA/MM subsystems, rather than in
>> ath12k
>>
>>>> Since the pages cannot be safely freed during suspend, skipping DEINIT (and using
>>>> MHI_POWER_OFF_KEEP_DEV) avoids this invalid free operation. This also aligns with the
>>>> existing comment in ath12k_mhi_stop which suggests using mhi_power_down_keep_dev() for
>>>> suspend.
>> first of all, this is a workaround rather than fix. Ideally we should try to root cause
>> the issue and fix it in the right way.
> 
> 
> The original comment in existing code:
> 
> 
> /* During suspend we need to use mhi_power_down_keep_dev()
>  * workaround, otherwise ath12k_core_resume() will timeout
>  * during resume.
>  */
> 
> This patch aligns the code with this existing intent. The driver was previously
> 
> calling DEINIT (and freeing resources) despite the comment advising to use keep_dev.
> 
> If the intention of the driver authors was to use keep_dev for suspend,
> 
> then my understanding is DEINIT is incorrect here (Correct me if I am wrong)
> 
> regardless of the underlying MM behavior.

keep_dev means not to destroy the mhi_device instance while going to suspend. The purpose
is to get rid of the PROBE_DEFER problem in MHI during resume. You may want to check the
upstream discussion to learn about the history.

> 
>>
>> Secondly the workaround here seems problematic: you skip INIT druing resume. However note
>> several hardware registers need to be re-programmed during this stage, how could the
>> target work if its power is cutoff during suspend and the register context is not restored
>> during resume?
> 
> 
> In my testing, WiFi functionality was fully restored after resume.
> 
> The device associates and passes traffic immediately.

I can imagine two reasons: either WLAN target's power is not cutoff during suspend, or you
did not get into the issue scenario. For the latter, I mean you may need to trigger a
firmware crash to see if RDDM works normally, since you skip RDDM register context restore
during resume.

> 
> My understanding is that:
> 
> ATH12K_MHI_INIT primarily handles host memory allocation (which we preserved by skipping
> DEINIT).

In addition to memory allocation, there is also register programming. See
mhi_prepare_for_power_up() and mhi_rddm_prepare().

> 
> ATH12K_MHI_POWER_ON calls mhi_sync_power_up(). This function triggers the MHI state machine,
> 
> which handles the necessary BHI/BHIE programming and firmware download (SBL) sequence.
> 
> Since mhi_sync_power_up() is still called during resume, the target is correctly re-
> initialized and
> 
> registers are programmed, even if we skip the redundant host memory allocation step (INIT).
> 
> Thanks & Regards,
> Saikiran
>