A relatively unusual race condition occurs between host software
and hardware, where the host sees the updated destination ring head
pointer before the hardware updates the corresponding descriptor.
When this situation occurs, the length of the descriptor returns 0.
The current error handling method is to increment descriptor tail
pointer by 1, but 'sw_index' is not updated, causing descriptor and
skb to not correspond one-to-one, resulting in the following error:
ath11k_pci 0006:01:00.0: HTC Rx: insufficient length, got 1488, expected 1492
ath11k_pci 0006:01:00.0: HTC Rx: insufficient length, got 1460, expected 1484
To address this problem, temporarily skip processing the current
descriptor and handle it again next time. However, to prevent this
descriptor from continuously returning 0, use skb cb to set a flag.
If the length returns 0 again, this descriptor will be discarded.
Tested-on: QCA6698AQ hw2.1 PCI WLAN.HSP.1.1-04546-QCAHSPSWPL_V1_V2_SILICONZ_IOE-1
Reported-by: Johan Hovold <johan+linaro@kernel.org>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=218623
Signed-off-by: Miaoqing Pan <quic_miaoqing@quicinc.com>
---
drivers/net/wireless/ath/ath11k/ce.c | 32 ++++++++++++++++++++------
drivers/net/wireless/ath/ath11k/core.h | 1 +
2 files changed, 26 insertions(+), 7 deletions(-)
diff --git a/drivers/net/wireless/ath/ath11k/ce.c b/drivers/net/wireless/ath/ath11k/ce.c
index e66e86bdec20..2573f8c7a994 100644
--- a/drivers/net/wireless/ath/ath11k/ce.c
+++ b/drivers/net/wireless/ath/ath11k/ce.c
@@ -1,7 +1,7 @@
// SPDX-License-Identifier: BSD-3-Clause-Clear
/*
* Copyright (c) 2018-2019 The Linux Foundation. All rights reserved.
- * Copyright (c) 2021-2023 Qualcomm Innovation Center, Inc. All rights reserved.
+ * Copyright (c) 2021-2023, 2025 Qualcomm Innovation Center, Inc. All rights reserved.
*/
#include "dp_rx.h"
@@ -387,18 +387,36 @@ static int ath11k_ce_completed_recv_next(struct ath11k_ce_pipe *pipe,
ath11k_hal_srng_access_begin(ab, srng);
- desc = ath11k_hal_srng_dst_get_next_entry(ab, srng);
+ desc = ath11k_hal_srng_dst_peek(ab, srng);
if (!desc) {
ret = -EIO;
goto err;
}
*nbytes = ath11k_hal_ce_dst_status_get_length(desc);
- if (*nbytes == 0) {
- ret = -EIO;
- goto err;
+ if (unlikely(*nbytes == 0)) {
+ struct ath11k_skb_rxcb *rxcb =
+ ATH11K_SKB_RXCB(pipe->dest_ring->skb[sw_index]);
+
+ /* A relatively unusual race condition occurs between host
+ * software and hardware, where the host sees the updated
+ * destination ring head pointer before the hardware updates
+ * the corresponding descriptor.
+ *
+ * Temporarily skip processing the current descriptor and handle
+ * it again next time. However, to prevent this descriptor from
+ * continuously returning 0, set 'is_desc_len0' flag. If the
+ * length returns 0 again, this descriptor will be discarded.
+ */
+ if (!rxcb->is_desc_len0) {
+ rxcb->is_desc_len0 = true;
+ ret = -EIO;
+ goto err;
+ }
}
+ ath11k_hal_srng_dst_next(ab, srng);
+
*skb = pipe->dest_ring->skb[sw_index];
pipe->dest_ring->skb[sw_index] = NULL;
@@ -430,8 +448,8 @@ static void ath11k_ce_recv_process_cb(struct ath11k_ce_pipe *pipe)
dma_unmap_single(ab->dev, ATH11K_SKB_RXCB(skb)->paddr,
max_nbytes, DMA_FROM_DEVICE);
- if (unlikely(max_nbytes < nbytes)) {
- ath11k_warn(ab, "rxed more than expected (nbytes %d, max %d)",
+ if (unlikely(max_nbytes < nbytes || !nbytes)) {
+ ath11k_warn(ab, "rxed invalid length (nbytes %d, max %d)",
nbytes, max_nbytes);
dev_kfree_skb_any(skb);
continue;
diff --git a/drivers/net/wireless/ath/ath11k/core.h b/drivers/net/wireless/ath/ath11k/core.h
index 1a3d0de4afde..c8614c5c6493 100644
--- a/drivers/net/wireless/ath/ath11k/core.h
+++ b/drivers/net/wireless/ath/ath11k/core.h
@@ -128,6 +128,7 @@ struct ath11k_skb_rxcb {
bool is_continuation;
bool is_mcbc;
bool is_eapol;
+ bool is_desc_len0;
struct hal_rx_desc *rx_desc;
u8 err_rel_src;
u8 err_code;
--
2.25.1
On Mon, Mar 10, 2025 at 09:02:17AM +0800, Miaoqing Pan wrote:
> A relatively unusual race condition occurs between host software
> and hardware, where the host sees the updated destination ring head
> pointer before the hardware updates the corresponding descriptor.
> When this situation occurs, the length of the descriptor returns 0.
I still think this description is too vague and it doesn't explain how
this race is even possible. It sounds like there's a bug somewhere in
the driver or firmware, but if this really is an indication the hardware
is broken as your reply here seems to suggest:
https://lore.kernel.org/lkml/bc187777-588c-4fa0-ba8c-847e91c78d43@quicinc.com/
then that too should be highlighted in the commit message (e.g. by
describing this as "working around broken hardware").
> The current error handling method is to increment descriptor tail
> pointer by 1, but 'sw_index' is not updated, causing descriptor and
> skb to not correspond one-to-one, resulting in the following error:
>
> ath11k_pci 0006:01:00.0: HTC Rx: insufficient length, got 1488, expected 1492
> ath11k_pci 0006:01:00.0: HTC Rx: insufficient length, got 1460, expected 1484
>
> To address this problem, temporarily skip processing the current
> descriptor and handle it again next time. However, to prevent this
> descriptor from continuously returning 0, use skb cb to set a flag.
> If the length returns 0 again, this descriptor will be discarded.
The ath12k ring-buffer handling looks very similar. Do you need a
corresponding workaround in ath12k_ce_completed_recv_next()? Or are you
sure that this (hardware) bug only affects ath11k devices?
> *nbytes = ath11k_hal_ce_dst_status_get_length(desc);
> - if (*nbytes == 0) {
> - ret = -EIO;
> - goto err;
> + if (unlikely(*nbytes == 0)) {
> + struct ath11k_skb_rxcb *rxcb =
> + ATH11K_SKB_RXCB(pipe->dest_ring->skb[sw_index]);
> +
> + /* A relatively unusual race condition occurs between host
> + * software and hardware, where the host sees the updated
> + * destination ring head pointer before the hardware updates
> + * the corresponding descriptor.
> + *
> + * Temporarily skip processing the current descriptor and handle
> + * it again next time. However, to prevent this descriptor from
> + * continuously returning 0, set 'is_desc_len0' flag. If the
> + * length returns 0 again, this descriptor will be discarded.
> + */
> + if (!rxcb->is_desc_len0) {
> + rxcb->is_desc_len0 = true;
> + ret = -EIO;
> + goto err;
> + }
> }
I'm still waiting for feedback from one user that can reproduce the
ring-buffer corruption very easily, but another user mentioned seeing
multiple zero-length descriptor warnings over the weekend when running
with this patch:
ath11k_pci 0006:01:00.0: rxed invalid length (nbytes 0, max 2048)
Are there ever any valid reasons for seeing a zero-length descriptor
(i.e. unrelated to the race at hand)? IIUC the warning would only be
printed when processing such descriptors a second time (i.e. when
is_desc_len0 is set).
Johan
On 3/10/2025 6:09 PM, Johan Hovold wrote:
> On Mon, Mar 10, 2025 at 09:02:17AM +0800, Miaoqing Pan wrote:
>> A relatively unusual race condition occurs between host software
>> and hardware, where the host sees the updated destination ring head
>> pointer before the hardware updates the corresponding descriptor.
>> When this situation occurs, the length of the descriptor returns 0.
>
> I still think this description is too vague and it doesn't explain how
> this race is even possible. It sounds like there's a bug somewhere in
> the driver or firmware, but if this really is an indication the hardware
> is broken as your reply here seems to suggest:
>
> https://lore.kernel.org/lkml/bc187777-588c-4fa0-ba8c-847e91c78d43@quicinc.com/
>
> then that too should be highlighted in the commit message (e.g. by
> describing this as "working around broken hardware").
>
Ok, will do.
>> The current error handling method is to increment descriptor tail
>> pointer by 1, but 'sw_index' is not updated, causing descriptor and
>> skb to not correspond one-to-one, resulting in the following error:
>>
>> ath11k_pci 0006:01:00.0: HTC Rx: insufficient length, got 1488, expected 1492
>> ath11k_pci 0006:01:00.0: HTC Rx: insufficient length, got 1460, expected 1484
>>
>> To address this problem, temporarily skip processing the current
>> descriptor and handle it again next time. However, to prevent this
>> descriptor from continuously returning 0, use skb cb to set a flag.
>> If the length returns 0 again, this descriptor will be discarded.
>
> The ath12k ring-buffer handling looks very similar. Do you need a
> corresponding workaround in ath12k_ce_completed_recv_next()? Or are you
> sure that this (hardware) bug only affects ath11k devices?
>
Yes, will submit a similar patch.
>> *nbytes = ath11k_hal_ce_dst_status_get_length(desc);
>> - if (*nbytes == 0) {
>> - ret = -EIO;
>> - goto err;
>> + if (unlikely(*nbytes == 0)) {
>> + struct ath11k_skb_rxcb *rxcb =
>> + ATH11K_SKB_RXCB(pipe->dest_ring->skb[sw_index]);
>> +
>> + /* A relatively unusual race condition occurs between host
>> + * software and hardware, where the host sees the updated
>> + * destination ring head pointer before the hardware updates
>> + * the corresponding descriptor.
>> + *
>> + * Temporarily skip processing the current descriptor and handle
>> + * it again next time. However, to prevent this descriptor from
>> + * continuously returning 0, set 'is_desc_len0' flag. If the
>> + * length returns 0 again, this descriptor will be discarded.
>> + */
>> + if (!rxcb->is_desc_len0) {
>> + rxcb->is_desc_len0 = true;
>> + ret = -EIO;
>> + goto err;
>> + }
>> }
>
> I'm still waiting for feedback from one user that can reproduce the
> ring-buffer corruption very easily, but another user mentioned seeing
> multiple zero-length descriptor warnings over the weekend when running
> with this patch:
>
> ath11k_pci 0006:01:00.0: rxed invalid length (nbytes 0, max 2048)
>
> Are there ever any valid reasons for seeing a zero-length descriptor
> (i.e. unrelated to the race at hand)? IIUC the warning would only be
> printed when processing such descriptors a second time (i.e. when
> is_desc_len0 is set).
>
That's exactly the logic, only can see the warning in a second time. For
the first time, ath12k_ce_completed_recv_next() returns '-EIO'.
On 3/11/2025 1:29 AM, Miaoqing Pan wrote: > On 3/10/2025 6:09 PM, Johan Hovold wrote: >> I'm still waiting for feedback from one user that can reproduce the >> ring-buffer corruption very easily, but another user mentioned seeing >> multiple zero-length descriptor warnings over the weekend when running >> with this patch: >> >> ath11k_pci 0006:01:00.0: rxed invalid length (nbytes 0, max 2048) >> >> Are there ever any valid reasons for seeing a zero-length descriptor >> (i.e. unrelated to the race at hand)? IIUC the warning would only be >> printed when processing such descriptors a second time (i.e. when >> is_desc_len0 is set). >> > > That's exactly the logic, only can see the warning in a second time. For > the first time, ath12k_ce_completed_recv_next() returns '-EIO'. That didn't answer Johan's first question: Are there ever any valid reasons for seeing a zero-length descriptor? We have an issue that there is a race condition where hardware updates the pointer before it has filled all the data. The current solution is just to read the data a second time. But if that second read also occurs before hardware has updated the data, then the issue isn't fixed. So should there be some forced delay before we read a second time? Or should we attempt to read more times?
On 3/11/2025 11:20 PM, Jeff Johnson wrote: > On 3/11/2025 1:29 AM, Miaoqing Pan wrote: >> On 3/10/2025 6:09 PM, Johan Hovold wrote: >>> I'm still waiting for feedback from one user that can reproduce the >>> ring-buffer corruption very easily, but another user mentioned seeing >>> multiple zero-length descriptor warnings over the weekend when running >>> with this patch: >>> >>> ath11k_pci 0006:01:00.0: rxed invalid length (nbytes 0, max 2048) >>> >>> Are there ever any valid reasons for seeing a zero-length descriptor >>> (i.e. unrelated to the race at hand)? IIUC the warning would only be >>> printed when processing such descriptors a second time (i.e. when >>> is_desc_len0 is set). >>> >> >> That's exactly the logic, only can see the warning in a second time. For >> the first time, ath12k_ce_completed_recv_next() returns '-EIO'. > > That didn't answer Johan's first question: > Are there ever any valid reasons for seeing a zero-length descriptor? > The events currently observed are all firmware logs. The discarded packets will not affect normal operation. I will adjust the logging to debug level. > We have an issue that there is a race condition where hardware updates the > pointer before it has filled all the data. The current solution is just to > read the data a second time. But if that second read also occurs before > hardware has updated the data, then the issue isn't fixed. > Thanks for the addition. > So should there be some forced delay before we read a second time? > Or should we attempt to read more times? > The initial fix was to keep waiting for the data to be ready. The observed phenomenon is that when the second read fails, subsequent reads will continue to fail until the firmware's CE2 ring is full and triggers an assert after timeout. However, this situation is relatively rare, and in most cases, the second read will succeed. Therefore, adding a delay or multiple read attempts is not useful.
On Wed, Mar 12, 2025 at 09:11:45AM +0800, Miaoqing Pan wrote:
> On 3/11/2025 11:20 PM, Jeff Johnson wrote:
> > On 3/11/2025 1:29 AM, Miaoqing Pan wrote:
> >> On 3/10/2025 6:09 PM, Johan Hovold wrote:
> >>> I'm still waiting for feedback from one user that can reproduce the
> >>> ring-buffer corruption very easily, but another user mentioned seeing
> >>> multiple zero-length descriptor warnings over the weekend when running
> >>> with this patch:
> >>>
> >>> ath11k_pci 0006:01:00.0: rxed invalid length (nbytes 0, max 2048)
> >>>
> >>> Are there ever any valid reasons for seeing a zero-length descriptor
> >>> (i.e. unrelated to the race at hand)? IIUC the warning would only be
> >>> printed when processing such descriptors a second time (i.e. when
> >>> is_desc_len0 is set).
> >>
> >> That's exactly the logic, only can see the warning in a second time. For
> >> the first time, ath12k_ce_completed_recv_next() returns '-EIO'.
> >
> > That didn't answer Johan's first question:
> > Are there ever any valid reasons for seeing a zero-length descriptor?
>
> The events currently observed are all firmware logs. The discarded
> packets will not affect normal operation. I will adjust the logging to
> debug level.
That still does not answer the question whether there are ever any valid
reasons for seeing zero-length descriptors. I assume there are none?
> > We have an issue that there is a race condition where hardware updates the
> > pointer before it has filled all the data. The current solution is just to
> > read the data a second time. But if that second read also occurs before
> > hardware has updated the data, then the issue isn't fixed.
>
> Thanks for the addition.
>
> > So should there be some forced delay before we read a second time?
> > Or should we attempt to read more times?
>
> The initial fix was to keep waiting for the data to be ready. The
> observed phenomenon is that when the second read fails, subsequent reads
> will continue to fail until the firmware's CE2 ring is full and triggers
> an assert after timeout. However, this situation is relatively rare, and
> in most cases, the second read will succeed. Therefore, adding a delay
> or multiple read attempts is not useful.
The proposed fix is broken since ath11k_hal_ce_dst_status_get_length()
not just reads the length but also sets it to zero so that the updated
length may indeed never be seen.
I've taken a closer look at the driver and it seems like we're missing a
read barrier to make sure that the updated descriptor is not read until
after the head pointer.
Miaoqing, could you try the below patch with your reproducer and see if
it is enough to fix the corruption?
If so I can resend with the warning removed and include a corresponding
fix for ath12k (it looks like there are further places where barriers
are missing too).
Johan
From 656dbd0894741445aeb16ee8357e6fef51b6084c Mon Sep 17 00:00:00 2001
From: Johan Hovold <johan+linaro@kernel.org>
Date: Wed, 12 Mar 2025 16:49:20 +0100
Subject: [PATCH] wifi: ath11k: fix ring-buffer corruption
Users of the Lenovo ThinkPad X13s have reported that Wi-Fi sometimes
breaks and the log fills up with errors like:
ath11k_pci 0006:01:00.0: HTC Rx: insufficient length, got 1484, expected 1492
ath11k_pci 0006:01:00.0: HTC Rx: insufficient length, got 1460, expected 1484
which based on a quick look at the driver seemed to indicate some kind
of ring-buffer corruption.
Miaoqing Pan tracked it down to the host seeing the updated destination
ring head pointer before the updated descriptor, and the error handling
for that in turn leaves the ring buffer in an inconsistent state.
Add the missing the read barrier to make sure that the descriptor is
read after the head pointer to address the root cause of the corruption.
The error handling can be fixed separately in case there can ever be
actual zero-length descriptors.
FIXME: remove WARN_ON_ONCE() added for verification purposes
Tested-on: WCN6855 hw2.1 WLAN.HSP.1.1-03125-QCAHSPSWPL_V1_V2_SILICONZ_LITE-3.6510.41
Fixes: d5c65159f289 ("ath11k: driver for Qualcomm IEEE 802.11ax devices")
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=218623
Link: https://lore.kernel.org/20250310010217.3845141-3-quic_miaoqing@quicinc.com
Cc: Miaoqing Pan <quic_miaoqing@quicinc.com>
Cc: stable@vger.kernel.org # 5.6
Signed-off-by: Johan Hovold <johan+linaro@kernel.org>
---
drivers/net/wireless/ath/ath11k/ce.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/drivers/net/wireless/ath/ath11k/ce.c b/drivers/net/wireless/ath/ath11k/ce.c
index e66e86bdec20..423b970e288c 100644
--- a/drivers/net/wireless/ath/ath11k/ce.c
+++ b/drivers/net/wireless/ath/ath11k/ce.c
@@ -393,8 +393,12 @@ static int ath11k_ce_completed_recv_next(struct ath11k_ce_pipe *pipe,
goto err;
}
+ /* Make sure descriptor is read after the head pointer. */
+ dma_rmb();
+
*nbytes = ath11k_hal_ce_dst_status_get_length(desc);
if (*nbytes == 0) {
+ WARN_ON_ONCE(1); // FIXME: remove
ret = -EIO;
goto err;
}
--
2.48.1
On 3/13/2025 12:43 AM, Johan Hovold wrote:
> On Wed, Mar 12, 2025 at 09:11:45AM +0800, Miaoqing Pan wrote:
>> On 3/11/2025 11:20 PM, Jeff Johnson wrote:
>>> On 3/11/2025 1:29 AM, Miaoqing Pan wrote:
>>>> On 3/10/2025 6:09 PM, Johan Hovold wrote:
>>>>> I'm still waiting for feedback from one user that can reproduce the
>>>>> ring-buffer corruption very easily, but another user mentioned seeing
>>>>> multiple zero-length descriptor warnings over the weekend when running
>>>>> with this patch:
>>>>>
>>>>> ath11k_pci 0006:01:00.0: rxed invalid length (nbytes 0, max 2048)
>>>>>
>>>>> Are there ever any valid reasons for seeing a zero-length descriptor
>>>>> (i.e. unrelated to the race at hand)? IIUC the warning would only be
>>>>> printed when processing such descriptors a second time (i.e. when
>>>>> is_desc_len0 is set).
>>>>
>>>> That's exactly the logic, only can see the warning in a second time. For
>>>> the first time, ath12k_ce_completed_recv_next() returns '-EIO'.
>>>
>>> That didn't answer Johan's first question:
>>> Are there ever any valid reasons for seeing a zero-length descriptor?
>>
>> The events currently observed are all firmware logs. The discarded
>> packets will not affect normal operation. I will adjust the logging to
>> debug level.
>
> That still does not answer the question whether there are ever any valid
> reasons for seeing zero-length descriptors. I assume there are none?
>
>>> We have an issue that there is a race condition where hardware updates the
>>> pointer before it has filled all the data. The current solution is just to
>>> read the data a second time. But if that second read also occurs before
>>> hardware has updated the data, then the issue isn't fixed.
>>
>> Thanks for the addition.
>>
>>> So should there be some forced delay before we read a second time?
>>> Or should we attempt to read more times?
>>
>> The initial fix was to keep waiting for the data to be ready. The
>> observed phenomenon is that when the second read fails, subsequent reads
>> will continue to fail until the firmware's CE2 ring is full and triggers
>> an assert after timeout. However, this situation is relatively rare, and
>> in most cases, the second read will succeed. Therefore, adding a delay
>> or multiple read attempts is not useful.
>
> The proposed fix is broken since ath11k_hal_ce_dst_status_get_length()
> not just reads the length but also sets it to zero so that the updated
> length may indeed never be seen.
>
> I've taken a closer look at the driver and it seems like we're missing a
> read barrier to make sure that the updated descriptor is not read until
> after the head pointer.
>
> Miaoqing, could you try the below patch with your reproducer and see if
> it is enough to fix the corruption?
>
> If so I can resend with the warning removed and include a corresponding
> fix for ath12k (it looks like there are further places where barriers
> are missing too).
>
> Johan
>
>
> From 656dbd0894741445aeb16ee8357e6fef51b6084c Mon Sep 17 00:00:00 2001
> From: Johan Hovold <johan+linaro@kernel.org>
> Date: Wed, 12 Mar 2025 16:49:20 +0100
> Subject: [PATCH] wifi: ath11k: fix ring-buffer corruption
>
> Users of the Lenovo ThinkPad X13s have reported that Wi-Fi sometimes
> breaks and the log fills up with errors like:
>
> ath11k_pci 0006:01:00.0: HTC Rx: insufficient length, got 1484, expected 1492
> ath11k_pci 0006:01:00.0: HTC Rx: insufficient length, got 1460, expected 1484
>
> which based on a quick look at the driver seemed to indicate some kind
> of ring-buffer corruption.
>
> Miaoqing Pan tracked it down to the host seeing the updated destination
> ring head pointer before the updated descriptor, and the error handling
> for that in turn leaves the ring buffer in an inconsistent state.
>
> Add the missing the read barrier to make sure that the descriptor is
> read after the head pointer to address the root cause of the corruption.
>
> The error handling can be fixed separately in case there can ever be
> actual zero-length descriptors.
>
> FIXME: remove WARN_ON_ONCE() added for verification purposes
>
> Tested-on: WCN6855 hw2.1 WLAN.HSP.1.1-03125-QCAHSPSWPL_V1_V2_SILICONZ_LITE-3.6510.41
>
> Fixes: d5c65159f289 ("ath11k: driver for Qualcomm IEEE 802.11ax devices")
> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=218623
> Link: https://lore.kernel.org/20250310010217.3845141-3-quic_miaoqing@quicinc.com
> Cc: Miaoqing Pan <quic_miaoqing@quicinc.com>
> Cc: stable@vger.kernel.org # 5.6
> Signed-off-by: Johan Hovold <johan+linaro@kernel.org>
> ---
> drivers/net/wireless/ath/ath11k/ce.c | 4 ++++
> 1 file changed, 4 insertions(+)
>
> diff --git a/drivers/net/wireless/ath/ath11k/ce.c b/drivers/net/wireless/ath/ath11k/ce.c
> index e66e86bdec20..423b970e288c 100644
> --- a/drivers/net/wireless/ath/ath11k/ce.c
> +++ b/drivers/net/wireless/ath/ath11k/ce.c
> @@ -393,8 +393,12 @@ static int ath11k_ce_completed_recv_next(struct ath11k_ce_pipe *pipe,
> goto err;
> }
>
> + /* Make sure descriptor is read after the head pointer. */
> + dma_rmb();
> +
> *nbytes = ath11k_hal_ce_dst_status_get_length(desc);
> if (*nbytes == 0) {
> + WARN_ON_ONCE(1); // FIXME: remove
> ret = -EIO;
> goto err;
> }
This issue can still be reproduced.
[ 3283.687469] WARNING: CPU: 0 PID: 0 at
/drivers/net/wireless/ath/ath11k/ce.c:405
ath11k_ce_per_engine_service+0x228/0x3e4 [ath11k]
[ 3283.688685] Call trace:
[ 3283.688692] ath11k_ce_per_engine_service+0x228/0x3e4 [ath11k]
[ 3283.688807] ath11k_pcic_ce_tasklet+0x34/0x54 [ath11k]
[ 3283.688920] tasklet_action_common.isra.0+0xec/0x338
[ 3283.688958] tasklet_action+0x28/0x34
[ 3283.688972] handle_softirqs+0x120/0x36c
On Thu, Mar 13, 2025 at 09:31:56PM +0800, Miaoqing Pan wrote:
> On 3/13/2025 12:43 AM, Johan Hovold wrote:
> > On Wed, Mar 12, 2025 at 09:11:45AM +0800, Miaoqing Pan wrote:
> >> On 3/11/2025 11:20 PM, Jeff Johnson wrote:
> >>> That didn't answer Johan's first question:
> >>> Are there ever any valid reasons for seeing a zero-length descriptor?
> >>
> >> The events currently observed are all firmware logs. The discarded
> >> packets will not affect normal operation. I will adjust the logging to
> >> debug level.
Have you looked at the device side of things as well? Could it be that
the firmware is doing something wrong when forwarding the logs?
How sure are you that you only see this with firmware logs?
> > I've taken a closer look at the driver and it seems like we're missing a
> > read barrier to make sure that the updated descriptor is not read until
> > after the head pointer.
> >
> > Miaoqing, could you try the below patch with your reproducer and see if
> > it is enough to fix the corruption?
> > + /* Make sure descriptor is read after the head pointer. */
> > + dma_rmb();
> > +
> > *nbytes = ath11k_hal_ce_dst_status_get_length(desc);
> > if (*nbytes == 0) {
> > + WARN_ON_ONCE(1); // FIXME: remove
> > ret = -EIO;
> > goto err;
> > }
>
> This issue can still be reproduced.
>
> [ 3283.687469] WARNING: CPU: 0 PID: 0 at
> /drivers/net/wireless/ath/ath11k/ce.c:405
> ath11k_ce_per_engine_service+0x228/0x3e4 [ath11k]
Thanks for verifying.
Which platform are you testing on and which kernel are you using?
I'm still waiting to hear back from some people testing my patch on the
X13s (sc8280xp).
Johan
On 3/14/2025 12:14 AM, Johan Hovold wrote:
> On Thu, Mar 13, 2025 at 09:31:56PM +0800, Miaoqing Pan wrote:
>> On 3/13/2025 12:43 AM, Johan Hovold wrote:
>>> On Wed, Mar 12, 2025 at 09:11:45AM +0800, Miaoqing Pan wrote:
>>>> On 3/11/2025 11:20 PM, Jeff Johnson wrote:
>
>>>>> That didn't answer Johan's first question:
>>>>> Are there ever any valid reasons for seeing a zero-length descriptor?
>>>>
>>>> The events currently observed are all firmware logs. The discarded
>>>> packets will not affect normal operation. I will adjust the logging to
>>>> debug level.
>
> Have you looked at the device side of things as well? Could it be that
> the firmware is doing something wrong when forwarding the logs?
>
> How sure are you that you only see this with firmware logs?
>
Checked, no issues have been found so far.
>>> I've taken a closer look at the driver and it seems like we're missing a
>>> read barrier to make sure that the updated descriptor is not read until
>>> after the head pointer.
>>>
>>> Miaoqing, could you try the below patch with your reproducer and see if
>>> it is enough to fix the corruption?
>
>>> + /* Make sure descriptor is read after the head pointer. */
>>> + dma_rmb();
>>> +
>>> *nbytes = ath11k_hal_ce_dst_status_get_length(desc);
>>> if (*nbytes == 0) {
>>> + WARN_ON_ONCE(1); // FIXME: remove
>>> ret = -EIO;
>>> goto err;
>>> }
>>
>> This issue can still be reproduced.
>>
>> [ 3283.687469] WARNING: CPU: 0 PID: 0 at
>> /drivers/net/wireless/ath/ath11k/ce.c:405
>> ath11k_ce_per_engine_service+0x228/0x3e4 [ath11k]
>
> Thanks for verifying.
>
> Which platform are you testing on and which kernel are you using?
>
> I'm still waiting to hear back from some people testing my patch on the
> X13s (sc8280xp).
>
> Johan
qcs615-ride, kernel 6.6.65.
I think the hardware has already ensured synchronization between
descriptor and head pointer, which isn't difficult to achieve. The issue
is likely caused by something else and requires further debugging.
On Fri, Mar 14, 2025 at 09:01:36AM +0800, Miaoqing Pan wrote:
> On 3/14/2025 12:14 AM, Johan Hovold wrote:
> > On Thu, Mar 13, 2025 at 09:31:56PM +0800, Miaoqing Pan wrote:
> >> On 3/13/2025 12:43 AM, Johan Hovold wrote:
> >>> + /* Make sure descriptor is read after the head pointer. */
> >>> + dma_rmb();
> >>> +
> >>> *nbytes = ath11k_hal_ce_dst_status_get_length(desc);
> >>> if (*nbytes == 0) {
> >>> + WARN_ON_ONCE(1); // FIXME: remove
> >>> ret = -EIO;
> >>> goto err;
> >>> }
> >>
> >> This issue can still be reproduced.
> >>
> >> [ 3283.687469] WARNING: CPU: 0 PID: 0 at
> >> /drivers/net/wireless/ath/ath11k/ce.c:405
> >> ath11k_ce_per_engine_service+0x228/0x3e4 [ath11k]
> >
> > Thanks for verifying.
> >
> > Which platform are you testing on and which kernel are you using?
> >
> > I'm still waiting to hear back from some people testing my patch on the
> > X13s (sc8280xp).
> qcs615-ride, kernel 6.6.65.
Ok, so a downstream vendor kernel?
qcs615-ride does not even have PCIe enabled in mainline yet, but I
assume that's what you use here?
> I think the hardware has already ensured synchronization between
> descriptor and head pointer, which isn't difficult to achieve. The issue
> is likely caused by something else and requires further debugging.
Yeah, but you still need memory barriers on the kernel side.
It could be that we are looking at two different causes for those
zero-length descriptors.
The error handling for that obviously needs to be fixed either way, but
I haven't heard anyone hitting the corruption with the memory barriers
in place on the X13s yet (even if we'd need some more time to test
this).
Johan
On 3/14/2025 4:06 PM, Johan Hovold wrote:
> On Fri, Mar 14, 2025 at 09:01:36AM +0800, Miaoqing Pan wrote:
>> On 3/14/2025 12:14 AM, Johan Hovold wrote:
>>> On Thu, Mar 13, 2025 at 09:31:56PM +0800, Miaoqing Pan wrote:
>>>> On 3/13/2025 12:43 AM, Johan Hovold wrote:
>
>>>>> + /* Make sure descriptor is read after the head pointer. */
>>>>> + dma_rmb();
>>>>> +
>>>>> *nbytes = ath11k_hal_ce_dst_status_get_length(desc);
>>>>> if (*nbytes == 0) {
>>>>> + WARN_ON_ONCE(1); // FIXME: remove
>>>>> ret = -EIO;
>>>>> goto err;
>>>>> }
>>>>
>>>> This issue can still be reproduced.
>>>>
>>>> [ 3283.687469] WARNING: CPU: 0 PID: 0 at
>>>> /drivers/net/wireless/ath/ath11k/ce.c:405
>>>> ath11k_ce_per_engine_service+0x228/0x3e4 [ath11k]
>>>
>>> Thanks for verifying.
>>>
>>> Which platform are you testing on and which kernel are you using?
>>>
>>> I'm still waiting to hear back from some people testing my patch on the
>>> X13s (sc8280xp).
>
>> qcs615-ride, kernel 6.6.65.
>
> Ok, so a downstream vendor kernel?
>
> qcs615-ride does not even have PCIe enabled in mainline yet, but I
> assume that's what you use here?
>
>> I think the hardware has already ensured synchronization between
>> descriptor and head pointer, which isn't difficult to achieve. The issue
>> is likely caused by something else and requires further debugging.
>
> Yeah, but you still need memory barriers on the kernel side.
>
> It could be that we are looking at two different causes for those
> zero-length descriptors.
>
> The error handling for that obviously needs to be fixed either way, but
> I haven't heard anyone hitting the corruption with the memory barriers
> in place on the X13s yet (even if we'd need some more time to test
> this).
>
> Johan
After multiple and prolonged verifications, adding dma_rmb() did not
improve the issue at all. I think this Status Descriptor is updated by
hardware (Copy Engine) controlled by another system, not involving DMA
or out-of-order CPU access within the same system, so memory barriers do
not take effect.
On Mon, Mar 17, 2025 at 01:52:15PM +0800, Miaoqing Pan wrote: > On 3/14/2025 4:06 PM, Johan Hovold wrote: > > On Fri, Mar 14, 2025 at 09:01:36AM +0800, Miaoqing Pan wrote: > >> I think the hardware has already ensured synchronization between > >> descriptor and head pointer, which isn't difficult to achieve. The issue > >> is likely caused by something else and requires further debugging. > > > > Yeah, but you still need memory barriers on the kernel side. > > > > It could be that we are looking at two different causes for those > > zero-length descriptors. > > > > The error handling for that obviously needs to be fixed either way, but > > I haven't heard anyone hitting the corruption with the memory barriers > > in place on the X13s yet (even if we'd need some more time to test > > this). > After multiple and prolonged verifications, adding dma_rmb() did not > improve the issue at all. I think this Status Descriptor is updated by > hardware (Copy Engine) controlled by another system, not involving DMA > or out-of-order CPU access within the same system, so memory barriers do > not take effect. Then it seems we are looking at two separate root causes for the corruption as the memory barrier appears to be all that is needed to fix the X13s issue. A user who hit the corruption after 2 h without the fix has been running over the weekend with the memory barrier without any problems. I'll ask further users to test, but it certainly looks like it is working as intended. And the memory barrier is de-facto missing as the head pointer and descriptor are accessed through (two separate) coherent mappings so there are no ordering guarantees without explicit barriers. Now obviously there are further issues in your system, which we should make sure we understand before adding workarounds to the driver. Do you have a pointer to the downstream kernel sources you are testing with? Or even better, can you reproduce the issue with mainline after adding the PCIe patches that were posted to the lists for these platforms? Apparently the descriptors can also be passed in non-coherent memory for some devices (.alloc_cacheable_memory / HAL_SRNG_FLAGS_CACHED). That implementation looks suspicious and could possibly result in similar problems. Are you using .alloc_cacheable_memory in your setup? Does it make any difference if you use a full rmb() barrier? And after modifying ath11k_hal_ce_dst_status_get_length() so that it does not clear the length, how many times you need to retry? Does it always work on the second try? Johan
On 3/17/2025 9:04 PM, Johan Hovold wrote: > On Mon, Mar 17, 2025 at 01:52:15PM +0800, Miaoqing Pan wrote: >> On 3/14/2025 4:06 PM, Johan Hovold wrote: >>> On Fri, Mar 14, 2025 at 09:01:36AM +0800, Miaoqing Pan wrote: > >>>> I think the hardware has already ensured synchronization between >>>> descriptor and head pointer, which isn't difficult to achieve. The issue >>>> is likely caused by something else and requires further debugging. >>> >>> Yeah, but you still need memory barriers on the kernel side. >>> >>> It could be that we are looking at two different causes for those >>> zero-length descriptors. >>> >>> The error handling for that obviously needs to be fixed either way, but >>> I haven't heard anyone hitting the corruption with the memory barriers >>> in place on the X13s yet (even if we'd need some more time to test >>> this). > >> After multiple and prolonged verifications, adding dma_rmb() did not >> improve the issue at all. I think this Status Descriptor is updated by >> hardware (Copy Engine) controlled by another system, not involving DMA >> or out-of-order CPU access within the same system, so memory barriers do >> not take effect. > > Then it seems we are looking at two separate root causes for the > corruption as the memory barrier appears to be all that is needed to fix > the X13s issue. > > A user who hit the corruption after 2 h without the fix has been running > over the weekend with the memory barrier without any problems. I'll ask > further users to test, but it certainly looks like it is working as > intended. > > And the memory barrier is de-facto missing as the head pointer and > descriptor are accessed through (two separate) coherent mappings so > there are no ordering guarantees without explicit barriers. > This situation should occur when there is only one descriptor in the ring. If, as you mentioned, the CPU tries to load the descriptor first, but the descriptor fetch fails before the HP load because the ring returns empty, it won't trigger the current issue. The Copy Engine hardware module copies the metadata to the Status Descriptor after the DMA is complete, then updates the HP to trigger an interrupt. I think there might be some issues in this process, such as the lack of a wmb instruction after the copy is complete, causing the HP to be updated first. > Now obviously there are further issues in your system, which we should > make sure we understand before adding workarounds to the driver. > > Do you have a pointer to the downstream kernel sources you are testing > with? Or even better, can you reproduce the issue with mainline after > adding the PCIe patches that were posted to the lists for these > platforms? > https://github.com/qualcomm-linux/meta-qcom-hwe/blob/scarthgap/recipes-kernel/linux/linux-qcom-base_6.6.bb > Apparently the descriptors can also be passed in non-coherent memory > for some devices (.alloc_cacheable_memory / HAL_SRNG_FLAGS_CACHED). That > implementation looks suspicious and could possibly result in similar > problems. Are you using .alloc_cacheable_memory in your setup? > No. > Does it make any difference if you use a full rmb() barrier? > I've also tried rmb() and mb(), but they didn't work either. > And after modifying ath11k_hal_ce_dst_status_get_length() so that it > does not clear the length, how many times you need to retry? Does it> always work on the second try? > Yes, the test has been running continuously for over 48 hours, always work on the second try, updated in patch v4.
On Tue, Mar 18, 2025 at 03:53:39PM +0800, Miaoqing Pan wrote: > On 3/17/2025 9:04 PM, Johan Hovold wrote: > > Then it seems we are looking at two separate root causes for the > > corruption as the memory barrier appears to be all that is needed to fix > > the X13s issue. > > > > A user who hit the corruption after 2 h without the fix has been running > > over the weekend with the memory barrier without any problems. I'll ask > > further users to test, but it certainly looks like it is working as > > intended. > > > > And the memory barrier is de-facto missing as the head pointer and > > descriptor are accessed through (two separate) coherent mappings so > > there are no ordering guarantees without explicit barriers. > > This situation should occur when there is only one descriptor in the > ring. If, as you mentioned, the CPU tries to load the descriptor first, > but the descriptor fetch fails before the HP load because the ring > returns empty, it won't trigger the current issue. It could if the CPU observes the updates out of order due to the missing barrier. The driver could be processing an earlier interrupt when the new descriptor is added and head pointer updated. If for example the CPU speculatively fetches the descriptor before the head pointer is updated, then the descriptor length may be zero when the CPU sees the updated head pointer. This seems to be what is happening on the X13s since adding the memory barrier makes the zero-length descriptors go away. > The Copy Engine hardware module copies the metadata to the Status > Descriptor after the DMA is complete, then updates the HP to trigger an > interrupt. I think there might be some issues in this process, such as > the lack of a wmb instruction after the copy is complete, causing the HP > to be updated first. Yeah, possibly. At least it seems there are more issues than the missing barrier on the machines you test. > > Now obviously there are further issues in your system, which we should > > make sure we understand before adding workarounds to the driver. > > > > Do you have a pointer to the downstream kernel sources you are testing > > with? Or even better, can you reproduce the issue with mainline after > > adding the PCIe patches that were posted to the lists for these > > platforms? > > > https://github.com/qualcomm-linux/meta-qcom-hwe/blob/scarthgap/recipes-kernel/linux/linux-qcom-base_6.6.bb Thanks for the pointer. That's a lot of out-of-tree patches on top of stable so not that easy to check the state of the resulting tree. > > Does it make any difference if you use a full rmb() barrier? > > > I've also tried rmb() and mb(), but they didn't work either. Thanks for checking. Just to be sure, you did add the barrier in the same place as my patch (i.e. just before the descriptor read)? > > And after modifying ath11k_hal_ce_dst_status_get_length() so that it > > does not clear the length, how many times you need to retry? Does it > > always work on the second try? > > Yes, the test has been running continuously for over 48 hours, always > work on the second try, updated in patch v4. Good, at least the descriptor-length-sometimes-never-updated issue is solved. Johan
On 3/19/2025 1:42 AM, Johan Hovold wrote: > On Tue, Mar 18, 2025 at 03:53:39PM +0800, Miaoqing Pan wrote: >> On 3/17/2025 9:04 PM, Johan Hovold wrote: > >>> Then it seems we are looking at two separate root causes for the >>> corruption as the memory barrier appears to be all that is needed to fix >>> the X13s issue. >>> >>> A user who hit the corruption after 2 h without the fix has been running >>> over the weekend with the memory barrier without any problems. I'll ask >>> further users to test, but it certainly looks like it is working as >>> intended. >>> >>> And the memory barrier is de-facto missing as the head pointer and >>> descriptor are accessed through (two separate) coherent mappings so >>> there are no ordering guarantees without explicit barriers. >> >> This situation should occur when there is only one descriptor in the >> ring. If, as you mentioned, the CPU tries to load the descriptor first, >> but the descriptor fetch fails before the HP load because the ring >> returns empty, it won't trigger the current issue. > > It could if the CPU observes the updates out of order due to the missing > barrier. The driver could be processing an earlier interrupt when the > new descriptor is added and head pointer updated. If for example the CPU > speculatively fetches the descriptor before the head pointer is updated, > then the descriptor length may be zero when the CPU sees the updated > head pointer. > Sorry, I still think this situation won't happen. Please see the following code. ath11k_hal_srng_access_begin(ab, srng); => srng->u.dst_ring.cached_hp = *srng->u.dst_ring.hp_addr; desc = ath11k_hal_srng_dst_get_next_entry(ab, srng); => if (srng->u.dst_ring.tp == srng->u.dst_ring.cached_hp) return NULL; //dma_rmb(); *nbytes = ath11k_hal_ce_dst_status_get_length(desc); If the condition 'srng->u.dst_ring.tp == srng->u.dst_ring.cached_hp' is true, the descriptor retrieval fails. > This seems to be what is happening on the X13s since adding the memory > barrier makes the zero-length descriptors go away. > Hmm, it is indeed a bit strange. Could it be that dma_rmb() introduces some delay ? >> The Copy Engine hardware module copies the metadata to the Status >> Descriptor after the DMA is complete, then updates the HP to trigger an >> interrupt. I think there might be some issues in this process, such as >> the lack of a wmb instruction after the copy is complete, causing the HP >> to be updated first. > > Yeah, possibly. At least it seems there are more issues than the missing > barrier on the machines you test. > >>> Now obviously there are further issues in your system, which we should >>> make sure we understand before adding workarounds to the driver. >>> >>> Do you have a pointer to the downstream kernel sources you are testing >>> with? Or even better, can you reproduce the issue with mainline after >>> adding the PCIe patches that were posted to the lists for these >>> platforms? >>> >> https://github.com/qualcomm-linux/meta-qcom-hwe/blob/scarthgap/recipes-kernel/linux/linux-qcom-base_6.6.bb > > Thanks for the pointer. That's a lot of out-of-tree patches on top of > stable so not that easy to check the state of the resulting tree. > Yes, but there are only a few patches for ath11k. >>> Does it make any difference if you use a full rmb() barrier? >>> >> I've also tried rmb() and mb(), but they didn't work either. > > Thanks for checking. > > Just to be sure, you did add the barrier in the same place as my patch > (i.e. just before the descriptor read)? > Yes. >>> And after modifying ath11k_hal_ce_dst_status_get_length() so that it >>> does not clear the length, how many times you need to retry? Does it >>> always work on the second try? >> >> Yes, the test has been running continuously for over 48 hours, always >> work on the second try, updated in patch v4. > > Good, at least the descriptor-length-sometimes-never-updated issue is > solved. > > Johan Yeah, thanks for pointing out the issue.
On Wed, Mar 19, 2025 at 02:47:12PM +0800, Miaoqing Pan wrote: > On 3/19/2025 1:42 AM, Johan Hovold wrote: > > It could if the CPU observes the updates out of order due to the missing > > barrier. The driver could be processing an earlier interrupt when the > > new descriptor is added and head pointer updated. If for example the CPU > > speculatively fetches the descriptor before the head pointer is updated, > > then the descriptor length may be zero when the CPU sees the updated > > head pointer. > > Sorry, I still think this situation won't happen. Please see the > following code. > > ath11k_hal_srng_access_begin(ab, srng); > => srng->u.dst_ring.cached_hp = *srng->u.dst_ring.hp_addr; > desc = ath11k_hal_srng_dst_get_next_entry(ab, srng); > => if (srng->u.dst_ring.tp == srng->u.dst_ring.cached_hp) return NULL; > //dma_rmb(); > *nbytes = ath11k_hal_ce_dst_status_get_length(desc); > > If the condition 'srng->u.dst_ring.tp == srng->u.dst_ring.cached_hp' is > true, the descriptor retrieval fails. The CPU can still speculate that this condition will be false and load the descriptor. If the speculation later turns out to be correct, then the descriptor may have stale values from before the head pointer was updated. > > This seems to be what is happening on the X13s since adding the memory > > barrier makes the zero-length descriptors go away. > > Hmm, it is indeed a bit strange. Could it be that dma_rmb() introduces > some delay ? It's only expected since you must use memory barriers on weakly ordered architectures like aarch64 to guarantee the ordering. > >> The Copy Engine hardware module copies the metadata to the Status > >> Descriptor after the DMA is complete, then updates the HP to trigger an > >> interrupt. I think there might be some issues in this process, such as > >> the lack of a wmb instruction after the copy is complete, causing the HP > >> to be updated first. > > > > Yeah, possibly. At least it seems there are more issues than the missing > > barrier on the machines you test. > > > >>> Now obviously there are further issues in your system, which we should > >>> make sure we understand before adding workarounds to the driver. > >>> > >>> Do you have a pointer to the downstream kernel sources you are testing > >>> with? Or even better, can you reproduce the issue with mainline after > >>> adding the PCIe patches that were posted to the lists for these > >>> platforms? > >>> > >> https://github.com/qualcomm-linux/meta-qcom-hwe/blob/scarthgap/recipes-kernel/linux/linux-qcom-base_6.6.bb > > > > Thanks for the pointer. That's a lot of out-of-tree patches on top of > > stable so not that easy to check the state of the resulting tree. > > Yes, but there are only a few patches for ath11k. Sure, but there are other components that come into play here such as the PCIe controller driver. A colleague of yours recently submitted an updated patch that overrides the no_snoop bit for qcs8300: https://lore.kernel.org/lkml/20250318053836.tievnd5ohzl7bmox@thinkpad/ but that flag appears not to be set in your downstream tree: https://github.com/qualcomm-linux/meta-qcom-hwe/blob/scarthgap/recipes-kernel/linux/linux-qcom-base-6.6/drivers/qcs8300/0004-PCI-qcom-Add-QCS8300-PCIe-support.patch Something like that may prevent a cached descriptor from being invalidated when the controller updates it. Similarly, the PCIe controllers are marked as dma-coherent in your devicetrees. A misconfiguration there could also cause problems. I suggest we merge my fix that adds the missing memory barrier, and which users have now been testing for a week without hitting the corruption (which they used to see several times a day). Then we can continue to track down why you are having coherency issues on qcs615 and qcs8300. You really want to make sure that that is fixed properly as it may lead to subtle bugs elsewhere too. Johan
On 3/21/2025 5:35 PM, Johan Hovold wrote: > On Wed, Mar 19, 2025 at 02:47:12PM +0800, Miaoqing Pan wrote: >> On 3/19/2025 1:42 AM, Johan Hovold wrote: > >>> It could if the CPU observes the updates out of order due to the missing >>> barrier. The driver could be processing an earlier interrupt when the >>> new descriptor is added and head pointer updated. If for example the CPU >>> speculatively fetches the descriptor before the head pointer is updated, >>> then the descriptor length may be zero when the CPU sees the updated >>> head pointer. >> >> Sorry, I still think this situation won't happen. Please see the >> following code. >> >> ath11k_hal_srng_access_begin(ab, srng); >> => srng->u.dst_ring.cached_hp = *srng->u.dst_ring.hp_addr; >> desc = ath11k_hal_srng_dst_get_next_entry(ab, srng); >> => if (srng->u.dst_ring.tp == srng->u.dst_ring.cached_hp) return NULL; >> //dma_rmb(); >> *nbytes = ath11k_hal_ce_dst_status_get_length(desc); >> >> If the condition 'srng->u.dst_ring.tp == srng->u.dst_ring.cached_hp' is >> true, the descriptor retrieval fails. > > The CPU can still speculate that this condition will be false and load > the descriptor. > > If the speculation later turns out to be correct, then the descriptor > may have stale values from before the head pointer was updated. > >>> This seems to be what is happening on the X13s since adding the memory >>> barrier makes the zero-length descriptors go away. >> >> Hmm, it is indeed a bit strange. Could it be that dma_rmb() introduces >> some delay ? > > It's only expected since you must use memory barriers on weakly ordered > architectures like aarch64 to guarantee the ordering. > >>>> The Copy Engine hardware module copies the metadata to the Status >>>> Descriptor after the DMA is complete, then updates the HP to trigger an >>>> interrupt. I think there might be some issues in this process, such as >>>> the lack of a wmb instruction after the copy is complete, causing the HP >>>> to be updated first. >>> >>> Yeah, possibly. At least it seems there are more issues than the missing >>> barrier on the machines you test. >>> >>>>> Now obviously there are further issues in your system, which we should >>>>> make sure we understand before adding workarounds to the driver. >>>>> >>>>> Do you have a pointer to the downstream kernel sources you are testing >>>>> with? Or even better, can you reproduce the issue with mainline after >>>>> adding the PCIe patches that were posted to the lists for these >>>>> platforms? >>>>> >>>> https://github.com/qualcomm-linux/meta-qcom-hwe/blob/scarthgap/recipes-kernel/linux/linux-qcom-base_6.6.bb >>> >>> Thanks for the pointer. That's a lot of out-of-tree patches on top of >>> stable so not that easy to check the state of the resulting tree. >> >> Yes, but there are only a few patches for ath11k. > > Sure, but there are other components that come into play here such as > the PCIe controller driver. > > A colleague of yours recently submitted an updated patch that overrides > the no_snoop bit for qcs8300: > > https://lore.kernel.org/lkml/20250318053836.tievnd5ohzl7bmox@thinkpad/ > > but that flag appears not to be set in your downstream tree: > > https://github.com/qualcomm-linux/meta-qcom-hwe/blob/scarthgap/recipes-kernel/linux/linux-qcom-base-6.6/drivers/qcs8300/0004-PCI-qcom-Add-QCS8300-PCIe-support.patch > > Something like that may prevent a cached descriptor from being > invalidated when the controller updates it. > > Similarly, the PCIe controllers are marked as dma-coherent in your > devicetrees. A misconfiguration there could also cause problems. Thank you very much for these suggestions. I tried setting no_snoop and disabling relaxed_ordering, but unfortunately, these did not work. > > I suggest we merge my fix that adds the missing memory barrier, and > which users have now been testing for a week without hitting the > corruption (which they used to see several times a day). > Agreed, I previously mistakenly thought that the status descriptor was not updated by DMA. > Then we can continue to track down why you are having coherency issues > on qcs615 and qcs8300. You really want to make sure that that is fixed > properly as it may lead to subtle bugs elsewhere too. > The same WLAN card is attached to qcs615, qcs8300 and sa8775p, and the issue is never be seen on sa8775p, maybe I can compare the PCIE settings to track down the root cause.
On 3/14/2025 4:06 PM, Johan Hovold wrote:
> On Fri, Mar 14, 2025 at 09:01:36AM +0800, Miaoqing Pan wrote:
>> On 3/14/2025 12:14 AM, Johan Hovold wrote:
>>> On Thu, Mar 13, 2025 at 09:31:56PM +0800, Miaoqing Pan wrote:
>>>> On 3/13/2025 12:43 AM, Johan Hovold wrote:
>
>>>>> + /* Make sure descriptor is read after the head pointer. */
>>>>> + dma_rmb();
>>>>> +
>>>>> *nbytes = ath11k_hal_ce_dst_status_get_length(desc);
>>>>> if (*nbytes == 0) {
>>>>> + WARN_ON_ONCE(1); // FIXME: remove
>>>>> ret = -EIO;
>>>>> goto err;
>>>>> }
>>>>
>>>> This issue can still be reproduced.
>>>>
>>>> [ 3283.687469] WARNING: CPU: 0 PID: 0 at
>>>> /drivers/net/wireless/ath/ath11k/ce.c:405
>>>> ath11k_ce_per_engine_service+0x228/0x3e4 [ath11k]
>>>
>>> Thanks for verifying.
>>>
>>> Which platform are you testing on and which kernel are you using?
>>>
>>> I'm still waiting to hear back from some people testing my patch on the
>>> X13s (sc8280xp).
>
>> qcs615-ride, kernel 6.6.65.
>
> Ok, so a downstream vendor kernel?
>
Yes.
> qcs615-ride does not even have PCIe enabled in mainline yet, but I
> assume that's what you use here?
>
Yes, also reproduced on qcs8300-ride.
>> I think the hardware has already ensured synchronization between
>> descriptor and head pointer, which isn't difficult to achieve. The issue
>> is likely caused by something else and requires further debugging.
>
> Yeah, but you still need memory barriers on the kernel side.
>
> It could be that we are looking at two different causes for those
> zero-length descriptors.
>
> The error handling for that obviously needs to be fixed either way, but
> I haven't heard anyone hitting the corruption with the memory barriers
> in place on the X13s yet (even if we'd need some more time to test
> this).
>
> Johan
Agreed with you, let's waiting for the feedback.
On 3/13/2025 12:43 AM, Johan Hovold wrote:
> On Wed, Mar 12, 2025 at 09:11:45AM +0800, Miaoqing Pan wrote:
>> On 3/11/2025 11:20 PM, Jeff Johnson wrote:
>>> On 3/11/2025 1:29 AM, Miaoqing Pan wrote:
>>>> On 3/10/2025 6:09 PM, Johan Hovold wrote:
>>>>> I'm still waiting for feedback from one user that can reproduce the
>>>>> ring-buffer corruption very easily, but another user mentioned seeing
>>>>> multiple zero-length descriptor warnings over the weekend when running
>>>>> with this patch:
>>>>>
>>>>> ath11k_pci 0006:01:00.0: rxed invalid length (nbytes 0, max 2048)
>>>>>
>>>>> Are there ever any valid reasons for seeing a zero-length descriptor
>>>>> (i.e. unrelated to the race at hand)? IIUC the warning would only be
>>>>> printed when processing such descriptors a second time (i.e. when
>>>>> is_desc_len0 is set).
>>>>
>>>> That's exactly the logic, only can see the warning in a second time. For
>>>> the first time, ath12k_ce_completed_recv_next() returns '-EIO'.
>>>
>>> That didn't answer Johan's first question:
>>> Are there ever any valid reasons for seeing a zero-length descriptor?
>>
>> The events currently observed are all firmware logs. The discarded
>> packets will not affect normal operation. I will adjust the logging to
>> debug level.
>
> That still does not answer the question whether there are ever any valid
> reasons for seeing zero-length descriptors. I assume there are none?
>
>>> We have an issue that there is a race condition where hardware updates the
>>> pointer before it has filled all the data. The current solution is just to
>>> read the data a second time. But if that second read also occurs before
>>> hardware has updated the data, then the issue isn't fixed.
>>
>> Thanks for the addition.
>>
>>> So should there be some forced delay before we read a second time?
>>> Or should we attempt to read more times?
>>
>> The initial fix was to keep waiting for the data to be ready. The
>> observed phenomenon is that when the second read fails, subsequent reads
>> will continue to fail until the firmware's CE2 ring is full and triggers
>> an assert after timeout. However, this situation is relatively rare, and
>> in most cases, the second read will succeed. Therefore, adding a delay
>> or multiple read attempts is not useful.
>
> The proposed fix is broken since ath11k_hal_ce_dst_status_get_length()
> not just reads the length but also sets it to zero so that the updated
> length may indeed never be seen.
>
Good point!
> I've taken a closer look at the driver and it seems like we're missing a
> read barrier to make sure that the updated descriptor is not read until
> after the head pointer.
>
> Miaoqing, could you try the below patch with your reproducer and see if
> it is enough to fix the corruption?
>
Sure, the stress test is running.
> If so I can resend with the warning removed and include a corresponding
> fix for ath12k (it looks like there are further places where barriers
> are missing too).
>
> Johan
>
>
If the DMA read barrier works, do you think my submitted patch series is
still needed? Because the error handling is incorrect.
> From 656dbd0894741445aeb16ee8357e6fef51b6084c Mon Sep 17 00:00:00 2001
> From: Johan Hovold <johan+linaro@kernel.org>
> Date: Wed, 12 Mar 2025 16:49:20 +0100
> Subject: [PATCH] wifi: ath11k: fix ring-buffer corruption
>
> Users of the Lenovo ThinkPad X13s have reported that Wi-Fi sometimes
> breaks and the log fills up with errors like:
>
> ath11k_pci 0006:01:00.0: HTC Rx: insufficient length, got 1484, expected 1492
> ath11k_pci 0006:01:00.0: HTC Rx: insufficient length, got 1460, expected 1484
>
> which based on a quick look at the driver seemed to indicate some kind
> of ring-buffer corruption.
>
> Miaoqing Pan tracked it down to the host seeing the updated destination
> ring head pointer before the updated descriptor, and the error handling
> for that in turn leaves the ring buffer in an inconsistent state.
>
> Add the missing the read barrier to make sure that the descriptor is
> read after the head pointer to address the root cause of the corruption.
>
> The error handling can be fixed separately in case there can ever be
> actual zero-length descriptors.
>
> FIXME: remove WARN_ON_ONCE() added for verification purposes
>
> Tested-on: WCN6855 hw2.1 WLAN.HSP.1.1-03125-QCAHSPSWPL_V1_V2_SILICONZ_LITE-3.6510.41
>
> Fixes: d5c65159f289 ("ath11k: driver for Qualcomm IEEE 802.11ax devices")
> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=218623
> Link: https://lore.kernel.org/20250310010217.3845141-3-quic_miaoqing@quicinc.com
> Cc: Miaoqing Pan <quic_miaoqing@quicinc.com>
> Cc: stable@vger.kernel.org # 5.6
> Signed-off-by: Johan Hovold <johan+linaro@kernel.org>
> ---
> drivers/net/wireless/ath/ath11k/ce.c | 4 ++++
> 1 file changed, 4 insertions(+)
>
> diff --git a/drivers/net/wireless/ath/ath11k/ce.c b/drivers/net/wireless/ath/ath11k/ce.c
> index e66e86bdec20..423b970e288c 100644
> --- a/drivers/net/wireless/ath/ath11k/ce.c
> +++ b/drivers/net/wireless/ath/ath11k/ce.c
> @@ -393,8 +393,12 @@ static int ath11k_ce_completed_recv_next(struct ath11k_ce_pipe *pipe,
> goto err;
> }
>
> + /* Make sure descriptor is read after the head pointer. */
> + dma_rmb();
> +
> *nbytes = ath11k_hal_ce_dst_status_get_length(desc);
> if (*nbytes == 0) {
> + WARN_ON_ONCE(1); // FIXME: remove
> ret = -EIO;
> goto err;
> }
On Thu, Mar 13, 2025 at 09:41:51AM +0800, Miaoqing Pan wrote: > On 3/13/2025 12:43 AM, Johan Hovold wrote: > > I've taken a closer look at the driver and it seems like we're missing a > > read barrier to make sure that the updated descriptor is not read until > > after the head pointer. > > > > Miaoqing, could you try the below patch with your reproducer and see if > > it is enough to fix the corruption? > > Sure, the stress test is running. Thanks. > > If so I can resend with the warning removed and include a corresponding > > fix for ath12k (it looks like there are further places where barriers > > are missing too). > If the DMA read barrier works, do you think my submitted patch series is > still needed? Because the error handling is incorrect. Yeah, it would still be good to fix up the error handling even if you don't expect to ever see a descriptor with length 0. But unless the device is doing something wrong here, there shouldn't be a need for peeking at the descriptor and retrying. Johan
On 3/13/2025 11:57 PM, Johan Hovold wrote: > On Thu, Mar 13, 2025 at 09:41:51AM +0800, Miaoqing Pan wrote: >> On 3/13/2025 12:43 AM, Johan Hovold wrote: > >>> I've taken a closer look at the driver and it seems like we're missing a >>> read barrier to make sure that the updated descriptor is not read until >>> after the head pointer. >>> >>> Miaoqing, could you try the below patch with your reproducer and see if >>> it is enough to fix the corruption? >> >> Sure, the stress test is running. > > Thanks. > >>> If so I can resend with the warning removed and include a corresponding >>> fix for ath12k (it looks like there are further places where barriers >>> are missing too). > >> If the DMA read barrier works, do you think my submitted patch series is >> still needed? Because the error handling is incorrect. > > Yeah, it would still be good to fix up the error handling even if you > don't expect to ever see a descriptor with length 0. > > But unless the device is doing something wrong here, there shouldn't be > a need for peeking at the descriptor and retrying. > > Johan New version will be submitted based on the previous discussion.
© 2016 - 2026 Red Hat, Inc.