[PATCH] wifi: ath11k: fix rx completion meta data corruption

Johan Hovold posted 1 patch 9 months ago
drivers/net/wireless/ath/ath11k/dp_rx.c | 25 ++++++++++++++++---------
1 file changed, 16 insertions(+), 9 deletions(-)
[PATCH] wifi: ath11k: fix rx completion meta data corruption
Posted by Johan Hovold 9 months ago
Add the missing memory barrier to make sure that the REO dest ring
descriptor is read after the head pointer to avoid using stale data on
weakly ordered architectures like aarch64.

This may fix the ring-buffer corruption worked around by commit
f9fff67d2d7c ("wifi: ath11k: Fix SKB corruption in REO destination
ring") by silently discarding data, and may possibly also address user
reported errors like:

	ath11k_pci 0006:01:00.0: msdu_done bit in attention is not set

Tested-on: WCN6855 hw2.1 WLAN.HSP.1.1-03125-QCAHSPSWPL_V1_V2_SILICONZ_LITE-3.6510.41

Fixes: d5c65159f289 ("ath11k: driver for Qualcomm IEEE 802.11ax devices")
Cc: stable@vger.kernel.org	# 5.6
Link: https://bugzilla.kernel.org/show_bug.cgi?id=218005
Signed-off-by: Johan Hovold <johan+linaro@kernel.org>
---

As I reported here:

	https://lore.kernel.org/lkml/Z9G5zEOcTdGKm7Ei@hovoldconsulting.com/

the ath11k and ath12k appear to be missing a number of memory barriers
that are required on weakly ordered architectures like aarch64 to avoid
memory corruption issues.

Here's a fix for one more such case which people already seem to be
hitting.

Note that I've seen one "msdu_done" bit not set warning also with this
patch so whether it helps with that at all remains to be seen. I'm CCing
Jens and Steev that see these warnings frequently and that may be able
to help out with testing.

Johan


 drivers/net/wireless/ath/ath11k/dp_rx.c | 25 ++++++++++++++++---------
 1 file changed, 16 insertions(+), 9 deletions(-)

diff --git a/drivers/net/wireless/ath/ath11k/dp_rx.c b/drivers/net/wireless/ath/ath11k/dp_rx.c
index 029ecf51c9ef..0a57b337e4c6 100644
--- a/drivers/net/wireless/ath/ath11k/dp_rx.c
+++ b/drivers/net/wireless/ath/ath11k/dp_rx.c
@@ -2646,7 +2646,7 @@ int ath11k_dp_process_rx(struct ath11k_base *ab, int ring_id,
 	struct ath11k *ar;
 	struct hal_reo_dest_ring *desc;
 	enum hal_reo_dest_ring_push_reason push_reason;
-	u32 cookie;
+	u32 cookie, info0, rx_msdu_info0, rx_mpdu_info0;
 	int i;
 
 	for (i = 0; i < MAX_RADIOS; i++)
@@ -2659,11 +2659,14 @@ int ath11k_dp_process_rx(struct ath11k_base *ab, int ring_id,
 try_again:
 	ath11k_hal_srng_access_begin(ab, srng);
 
+	/* Make sure descriptor is read after the head pointer. */
+	dma_rmb();
+
 	while (likely(desc =
 	      (struct hal_reo_dest_ring *)ath11k_hal_srng_dst_get_next_entry(ab,
 									     srng))) {
 		cookie = FIELD_GET(BUFFER_ADDR_INFO1_SW_COOKIE,
-				   desc->buf_addr_info.info1);
+				   READ_ONCE(desc->buf_addr_info.info1));
 		buf_id = FIELD_GET(DP_RXDMA_BUF_COOKIE_BUF_ID,
 				   cookie);
 		mac_id = FIELD_GET(DP_RXDMA_BUF_COOKIE_PDEV_ID, cookie);
@@ -2692,8 +2695,9 @@ int ath11k_dp_process_rx(struct ath11k_base *ab, int ring_id,
 
 		num_buffs_reaped[mac_id]++;
 
+		info0 = READ_ONCE(desc->info0);
 		push_reason = FIELD_GET(HAL_REO_DEST_RING_INFO0_PUSH_REASON,
-					desc->info0);
+					info0);
 		if (unlikely(push_reason !=
 			     HAL_REO_DEST_RING_PUSH_REASON_ROUTING_INSTRUCTION)) {
 			dev_kfree_skb_any(msdu);
@@ -2701,18 +2705,21 @@ int ath11k_dp_process_rx(struct ath11k_base *ab, int ring_id,
 			continue;
 		}
 
-		rxcb->is_first_msdu = !!(desc->rx_msdu_info.info0 &
+		rx_msdu_info0 = READ_ONCE(desc->rx_msdu_info.info0);
+		rx_mpdu_info0 = READ_ONCE(desc->rx_mpdu_info.info0);
+
+		rxcb->is_first_msdu = !!(rx_msdu_info0 &
 					 RX_MSDU_DESC_INFO0_FIRST_MSDU_IN_MPDU);
-		rxcb->is_last_msdu = !!(desc->rx_msdu_info.info0 &
+		rxcb->is_last_msdu = !!(rx_msdu_info0 &
 					RX_MSDU_DESC_INFO0_LAST_MSDU_IN_MPDU);
-		rxcb->is_continuation = !!(desc->rx_msdu_info.info0 &
+		rxcb->is_continuation = !!(rx_msdu_info0 &
 					   RX_MSDU_DESC_INFO0_MSDU_CONTINUATION);
 		rxcb->peer_id = FIELD_GET(RX_MPDU_DESC_META_DATA_PEER_ID,
-					  desc->rx_mpdu_info.meta_data);
+					  READ_ONCE(desc->rx_mpdu_info.meta_data));
 		rxcb->seq_no = FIELD_GET(RX_MPDU_DESC_INFO0_SEQ_NUM,
-					 desc->rx_mpdu_info.info0);
+					 rx_mpdu_info0);
 		rxcb->tid = FIELD_GET(HAL_REO_DEST_RING_INFO0_RX_QUEUE_NUM,
-				      desc->info0);
+				      info0);
 
 		rxcb->mac_id = mac_id;
 		__skb_queue_tail(&msdu_list[mac_id], msdu);
-- 
2.48.1
Re: [PATCH] wifi: ath11k: fix rx completion meta data corruption
Posted by Jeff Johnson 7 months ago
On Fri, 21 Mar 2025 15:53:02 +0100, Johan Hovold wrote:
> Add the missing memory barrier to make sure that the REO dest ring
> descriptor is read after the head pointer to avoid using stale data on
> weakly ordered architectures like aarch64.
> 
> This may fix the ring-buffer corruption worked around by commit
> f9fff67d2d7c ("wifi: ath11k: Fix SKB corruption in REO destination
> ring") by silently discarding data, and may possibly also address user
> reported errors like:
> 
> [...]

Applied, thanks!

[1/1] wifi: ath11k: fix rx completion meta data corruption
      commit: ab52e3e44fe9b666281752e2481d11e25b0e3fdd

Best regards,
-- 
Jeff Johnson <jeff.johnson@oss.qualcomm.com>
Re: [PATCH] wifi: ath11k: fix rx completion meta data corruption
Posted by Jeff Johnson 8 months, 3 weeks ago
On 3/21/2025 7:53 AM, Johan Hovold wrote:
> Add the missing memory barrier to make sure that the REO dest ring
> descriptor is read after the head pointer to avoid using stale data on
> weakly ordered architectures like aarch64.
> 
> This may fix the ring-buffer corruption worked around by commit
> f9fff67d2d7c ("wifi: ath11k: Fix SKB corruption in REO destination
> ring") by silently discarding data, and may possibly also address user
> reported errors like:
> 
> 	ath11k_pci 0006:01:00.0: msdu_done bit in attention is not set
> 
> Tested-on: WCN6855 hw2.1 WLAN.HSP.1.1-03125-QCAHSPSWPL_V1_V2_SILICONZ_LITE-3.6510.41
> 
> Fixes: d5c65159f289 ("ath11k: driver for Qualcomm IEEE 802.11ax devices")
> Cc: stable@vger.kernel.org	# 5.6
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=218005
> Signed-off-by: Johan Hovold <johan+linaro@kernel.org>

Does this supersede:
[PATCH] wifi: ath11k: fix ring-buffer corruption
Re: [PATCH] wifi: ath11k: fix rx completion meta data corruption
Posted by Johan Hovold 8 months, 3 weeks ago
On Mon, Mar 24, 2025 at 08:03:15AM -0700, Jeff Johnson wrote:
> On 3/21/2025 7:53 AM, Johan Hovold wrote:
> > Add the missing memory barrier to make sure that the REO dest ring
> > descriptor is read after the head pointer to avoid using stale data on
> > weakly ordered architectures like aarch64.
> > 
> > This may fix the ring-buffer corruption worked around by commit
> > f9fff67d2d7c ("wifi: ath11k: Fix SKB corruption in REO destination
> > ring") by silently discarding data, and may possibly also address user
> > reported errors like:
> > 
> > 	ath11k_pci 0006:01:00.0: msdu_done bit in attention is not set
> > 
> > Tested-on: WCN6855 hw2.1 WLAN.HSP.1.1-03125-QCAHSPSWPL_V1_V2_SILICONZ_LITE-3.6510.41
> > 
> > Fixes: d5c65159f289 ("ath11k: driver for Qualcomm IEEE 802.11ax devices")
> > Cc: stable@vger.kernel.org	# 5.6
> > Link: https://bugzilla.kernel.org/show_bug.cgi?id=218005
> > Signed-off-by: Johan Hovold <johan+linaro@kernel.org>
> 
> Does this supersede:
> [PATCH] wifi: ath11k: fix ring-buffer corruption

No, this is a separate fix. There are more places where barriers are
missing, I'll try send some further fixes during the week as well.

Johan
Re: [PATCH] wifi: ath11k: fix rx completion meta data corruption
Posted by Clayton Craft 8 months, 4 weeks ago
On 3/21/25 07:53, Johan Hovold wrote:
> Add the missing memory barrier to make sure that the REO dest ring
> descriptor is read after the head pointer to avoid using stale data on
> weakly ordered architectures like aarch64.
> 
> This may fix the ring-buffer corruption worked around by commit
> f9fff67d2d7c ("wifi: ath11k: Fix SKB corruption in REO destination
> ring") by silently discarding data, and may possibly also address user
> reported errors like:
> 
> 	ath11k_pci 0006:01:00.0: msdu_done bit in attention is not set
> 
> Tested-on: WCN6855 hw2.1 WLAN.HSP.1.1-03125-QCAHSPSWPL_V1_V2_SILICONZ_LITE-3.6510.41
> 
> Fixes: d5c65159f289 ("ath11k: driver for Qualcomm IEEE 802.11ax devices")
> Cc: stable@vger.kernel.org	# 5.6
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=218005
> Signed-off-by: Johan Hovold <johan+linaro@kernel.org>
> ---
> 
> As I reported here:
> 
> 	https://lore.kernel.org/lkml/Z9G5zEOcTdGKm7Ei@hovoldconsulting.com/
> 
> the ath11k and ath12k appear to be missing a number of memory barriers
> that are required on weakly ordered architectures like aarch64 to avoid
> memory corruption issues.
> 
> Here's a fix for one more such case which people already seem to be
> hitting.
> 
> Note that I've seen one "msdu_done" bit not set warning also with this
> patch so whether it helps with that at all remains to be seen. I'm CCing
> Jens and Steev that see these warnings frequently and that may be able
> to help out with testing.
> 

Before this patch I was seeing this "msdu_done bit" an average of about 
40 times per hour... e.g. a recent boot period of 43hrs saw 1600 of 
these msgs. I've been testing this patch for about 10 hours now 
connected to the same network etc, and haven't seen this "msdu_done bit" 
message once. So, even if it's not completely resolving this for 
everyone, it seems to be a huge improvement for me.

0006:01:00.0 Network controller: Qualcomm Technologies, Inc QCNFA765 
Wireless Network Adapter (rev 01)
ath11k_pci 0006:01:00.0: chip_id 0x2 chip_family 0xb board_id 0x8c 
soc_id 0x400c0210
ath11k_pci 0006:01:00.0: fw_version 0x11088c35 fw_build_timestamp 
2024-04-17 08:34 fw_build_id 
WLAN.HSP.1.1-03125-QCAHSPSWPL_V1_V2_SILICONZ_LITE-3.6510.41

Tested-by: Clayton Craft <clayton@craftyguy.net>
Re: [PATCH] wifi: ath11k: fix rx completion meta data corruption
Posted by Johan Hovold 8 months, 4 weeks ago
On Sun, Mar 23, 2025 at 11:15:54PM -0700, Clayton Craft wrote:
> On 3/21/25 07:53, Johan Hovold wrote:
> > Add the missing memory barrier to make sure that the REO dest ring
> > descriptor is read after the head pointer to avoid using stale data on
> > weakly ordered architectures like aarch64.
> > 
> > This may fix the ring-buffer corruption worked around by commit
> > f9fff67d2d7c ("wifi: ath11k: Fix SKB corruption in REO destination
> > ring") by silently discarding data, and may possibly also address user
> > reported errors like:
> > 
> > 	ath11k_pci 0006:01:00.0: msdu_done bit in attention is not set
> > 
> > Tested-on: WCN6855 hw2.1 WLAN.HSP.1.1-03125-QCAHSPSWPL_V1_V2_SILICONZ_LITE-3.6510.41
> > 
> > Fixes: d5c65159f289 ("ath11k: driver for Qualcomm IEEE 802.11ax devices")
> > Cc: stable@vger.kernel.org	# 5.6
> > Link: https://bugzilla.kernel.org/show_bug.cgi?id=218005
> > Signed-off-by: Johan Hovold <johan+linaro@kernel.org>
> > ---
> > 
> > As I reported here:
> > 
> > 	https://lore.kernel.org/lkml/Z9G5zEOcTdGKm7Ei@hovoldconsulting.com/
> > 
> > the ath11k and ath12k appear to be missing a number of memory barriers
> > that are required on weakly ordered architectures like aarch64 to avoid
> > memory corruption issues.
> > 
> > Here's a fix for one more such case which people already seem to be
> > hitting.
> > 
> > Note that I've seen one "msdu_done" bit not set warning also with this
> > patch so whether it helps with that at all remains to be seen. I'm CCing
> > Jens and Steev that see these warnings frequently and that may be able
> > to help out with testing.
> 
> Before this patch I was seeing this "msdu_done bit" an average of about 
> 40 times per hour... e.g. a recent boot period of 43hrs saw 1600 of 
> these msgs. I've been testing this patch for about 10 hours now 
> connected to the same network etc, and haven't seen this "msdu_done bit" 
> message once. So, even if it's not completely resolving this for 
> everyone, it seems to be a huge improvement for me.
> 
> 0006:01:00.0 Network controller: Qualcomm Technologies, Inc QCNFA765 
> Wireless Network Adapter (rev 01)
> ath11k_pci 0006:01:00.0: chip_id 0x2 chip_family 0xb board_id 0x8c 
> soc_id 0x400c0210
> ath11k_pci 0006:01:00.0: fw_version 0x11088c35 fw_build_timestamp 
> 2024-04-17 08:34 fw_build_id 
> WLAN.HSP.1.1-03125-QCAHSPSWPL_V1_V2_SILICONZ_LITE-3.6510.41
> 
> Tested-by: Clayton Craft <clayton@craftyguy.net>

Thanks for testing and confirming my suspicion.

Johan
Re: [PATCH] wifi: ath11k: fix rx completion meta data corruption
Posted by Steev Klimaszewski 8 months, 4 weeks ago
Hi Johan,

On Fri, Mar 21, 2025 at 9:55 AM Johan Hovold <johan+linaro@kernel.org> wrote:
>
> Add the missing memory barrier to make sure that the REO dest ring
> descriptor is read after the head pointer to avoid using stale data on
> weakly ordered architectures like aarch64.
>
> This may fix the ring-buffer corruption worked around by commit
> f9fff67d2d7c ("wifi: ath11k: Fix SKB corruption in REO destination
> ring") by silently discarding data, and may possibly also address user
> reported errors like:
>
>         ath11k_pci 0006:01:00.0: msdu_done bit in attention is not set
>
> Tested-on: WCN6855 hw2.1 WLAN.HSP.1.1-03125-QCAHSPSWPL_V1_V2_SILICONZ_LITE-3.6510.41
>
> Fixes: d5c65159f289 ("ath11k: driver for Qualcomm IEEE 802.11ax devices")
> Cc: stable@vger.kernel.org      # 5.6
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=218005
> Signed-off-by: Johan Hovold <johan+linaro@kernel.org>
> ---
>
> As I reported here:
>
>         https://lore.kernel.org/lkml/Z9G5zEOcTdGKm7Ei@hovoldconsulting.com/
>
> the ath11k and ath12k appear to be missing a number of memory barriers
> that are required on weakly ordered architectures like aarch64 to avoid
> memory corruption issues.
>
> Here's a fix for one more such case which people already seem to be
> hitting.
>
> Note that I've seen one "msdu_done" bit not set warning also with this
> patch so whether it helps with that at all remains to be seen. I'm CCing
> Jens and Steev that see these warnings frequently and that may be able
> to help out with testing.
>
> Johan
>
>
>  drivers/net/wireless/ath/ath11k/dp_rx.c | 25 ++++++++++++++++---------
>  1 file changed, 16 insertions(+), 9 deletions(-)
>
> diff --git a/drivers/net/wireless/ath/ath11k/dp_rx.c b/drivers/net/wireless/ath/ath11k/dp_rx.c
> index 029ecf51c9ef..0a57b337e4c6 100644
> --- a/drivers/net/wireless/ath/ath11k/dp_rx.c
> +++ b/drivers/net/wireless/ath/ath11k/dp_rx.c
> @@ -2646,7 +2646,7 @@ int ath11k_dp_process_rx(struct ath11k_base *ab, int ring_id,
>         struct ath11k *ar;
>         struct hal_reo_dest_ring *desc;
>         enum hal_reo_dest_ring_push_reason push_reason;
> -       u32 cookie;
> +       u32 cookie, info0, rx_msdu_info0, rx_mpdu_info0;
>         int i;
>
>         for (i = 0; i < MAX_RADIOS; i++)
> @@ -2659,11 +2659,14 @@ int ath11k_dp_process_rx(struct ath11k_base *ab, int ring_id,
>  try_again:
>         ath11k_hal_srng_access_begin(ab, srng);
>
> +       /* Make sure descriptor is read after the head pointer. */
> +       dma_rmb();
> +
>         while (likely(desc =
>               (struct hal_reo_dest_ring *)ath11k_hal_srng_dst_get_next_entry(ab,
>                                                                              srng))) {
>                 cookie = FIELD_GET(BUFFER_ADDR_INFO1_SW_COOKIE,
> -                                  desc->buf_addr_info.info1);
> +                                  READ_ONCE(desc->buf_addr_info.info1));
>                 buf_id = FIELD_GET(DP_RXDMA_BUF_COOKIE_BUF_ID,
>                                    cookie);
>                 mac_id = FIELD_GET(DP_RXDMA_BUF_COOKIE_PDEV_ID, cookie);
> @@ -2692,8 +2695,9 @@ int ath11k_dp_process_rx(struct ath11k_base *ab, int ring_id,
>
>                 num_buffs_reaped[mac_id]++;
>
> +               info0 = READ_ONCE(desc->info0);
>                 push_reason = FIELD_GET(HAL_REO_DEST_RING_INFO0_PUSH_REASON,
> -                                       desc->info0);
> +                                       info0);
>                 if (unlikely(push_reason !=
>                              HAL_REO_DEST_RING_PUSH_REASON_ROUTING_INSTRUCTION)) {
>                         dev_kfree_skb_any(msdu);
> @@ -2701,18 +2705,21 @@ int ath11k_dp_process_rx(struct ath11k_base *ab, int ring_id,
>                         continue;
>                 }
>
> -               rxcb->is_first_msdu = !!(desc->rx_msdu_info.info0 &
> +               rx_msdu_info0 = READ_ONCE(desc->rx_msdu_info.info0);
> +               rx_mpdu_info0 = READ_ONCE(desc->rx_mpdu_info.info0);
> +
> +               rxcb->is_first_msdu = !!(rx_msdu_info0 &
>                                          RX_MSDU_DESC_INFO0_FIRST_MSDU_IN_MPDU);
> -               rxcb->is_last_msdu = !!(desc->rx_msdu_info.info0 &
> +               rxcb->is_last_msdu = !!(rx_msdu_info0 &
>                                         RX_MSDU_DESC_INFO0_LAST_MSDU_IN_MPDU);
> -               rxcb->is_continuation = !!(desc->rx_msdu_info.info0 &
> +               rxcb->is_continuation = !!(rx_msdu_info0 &
>                                            RX_MSDU_DESC_INFO0_MSDU_CONTINUATION);
>                 rxcb->peer_id = FIELD_GET(RX_MPDU_DESC_META_DATA_PEER_ID,
> -                                         desc->rx_mpdu_info.meta_data);
> +                                         READ_ONCE(desc->rx_mpdu_info.meta_data));
>                 rxcb->seq_no = FIELD_GET(RX_MPDU_DESC_INFO0_SEQ_NUM,
> -                                        desc->rx_mpdu_info.info0);
> +                                        rx_mpdu_info0);
>                 rxcb->tid = FIELD_GET(HAL_REO_DEST_RING_INFO0_RX_QUEUE_NUM,
> -                                     desc->info0);
> +                                     info0);
>
>                 rxcb->mac_id = mac_id;
>                 __skb_queue_tail(&msdu_list[mac_id], msdu);
> --
> 2.48.1
>

While the fix is definitely a fix, it does not seem to help with the
`msdu_done bit in attention is not set` message as I have seen it 43
times in the last 12 hours.

-- Steev
Re: [PATCH] wifi: ath11k: fix rx completion meta data corruption
Posted by Johan Hovold 8 months, 4 weeks ago
On Sat, Mar 22, 2025 at 03:32:08PM -0500, Steev Klimaszewski wrote:
> On Fri, Mar 21, 2025 at 9:55 AM Johan Hovold <johan+linaro@kernel.org> wrote:
> >
> > Add the missing memory barrier to make sure that the REO dest ring
> > descriptor is read after the head pointer to avoid using stale data on
> > weakly ordered architectures like aarch64.
> >
> > This may fix the ring-buffer corruption worked around by commit
> > f9fff67d2d7c ("wifi: ath11k: Fix SKB corruption in REO destination
> > ring") by silently discarding data, and may possibly also address user
> > reported errors like:
> >
> >         ath11k_pci 0006:01:00.0: msdu_done bit in attention is not set
> >
> > Tested-on: WCN6855 hw2.1 WLAN.HSP.1.1-03125-QCAHSPSWPL_V1_V2_SILICONZ_LITE-3.6510.41
> >
> > Fixes: d5c65159f289 ("ath11k: driver for Qualcomm IEEE 802.11ax devices")
> > Cc: stable@vger.kernel.org      # 5.6
> > Link: https://bugzilla.kernel.org/show_bug.cgi?id=218005
> > Signed-off-by: Johan Hovold <johan+linaro@kernel.org>
> > ---
> >
> > As I reported here:
> >
> >         https://lore.kernel.org/lkml/Z9G5zEOcTdGKm7Ei@hovoldconsulting.com/
> >
> > the ath11k and ath12k appear to be missing a number of memory barriers
> > that are required on weakly ordered architectures like aarch64 to avoid
> > memory corruption issues.
> >
> > Here's a fix for one more such case which people already seem to be
> > hitting.
> >
> > Note that I've seen one "msdu_done" bit not set warning also with this
> > patch so whether it helps with that at all remains to be seen. I'm CCing
> > Jens and Steev that see these warnings frequently and that may be able
> > to help out with testing.

> While the fix is definitely a fix, it does not seem to help with the
> `msdu_done bit in attention is not set` message as I have seen it 43
> times in the last 12 hours.

Thanks for testing, Steev. Given that that the patch makes the warnings
go away in Clayton's setup it seems we may be dealing with more than
once root cause here.

Johan