[DONOTAPPLY RFC PATCH v2 3/4] DONOTMERGE: net: mwifiex: fix timeouts with the SD8777 chip

Karel Balej posted 4 patches 3 months, 2 weeks ago
[DONOTAPPLY RFC PATCH v2 3/4] DONOTMERGE: net: mwifiex: fix timeouts with the SD8777 chip
Posted by Karel Balej 3 months, 2 weeks ago
It has been observed with the samsung,coreprimevelte smartphone which
uses this wireless chip that the driver would irrecoverably fail upon
certain operations, usually after timing out on a FW command such as
with these logs:

	[ 2101.209913] mwifiex_sdio mmc2:0001:1: mwifiex_cmd_timeout_func: Timeout cmd id = 0xa4, act = 0x0
	[ 2101.209941] mwifiex_sdio mmc2:0001:1: num_data_h2c_failure = 0
	[ 2101.209949] mwifiex_sdio mmc2:0001:1: num_cmd_h2c_failure = 0
	[ 2101.209957] mwifiex_sdio mmc2:0001:1: is_cmd_timedout = 1
	[ 2101.209964] mwifiex_sdio mmc2:0001:1: num_tx_timeout = 0
	[ 2101.209971] mwifiex_sdio mmc2:0001:1: last_cmd_index = 1
	[ 2101.209978] mwifiex_sdio mmc2:0001:1: last_cmd_id: 16 00 a4 00 75 00 a4 00 7f 00
	[ 2101.209988] mwifiex_sdio mmc2:0001:1: last_cmd_act: 00 00 00 00 02 00 00 00 00 00
	[ 2101.209995] mwifiex_sdio mmc2:0001:1: last_cmd_resp_index = 0
	[ 2101.210003] mwifiex_sdio mmc2:0001:1: last_cmd_resp_id: 16 80 a4 80 75 80 a4 80 7f 80
	[ 2101.210010] mwifiex_sdio mmc2:0001:1: last_event_index = 4
	[ 2101.210018] mwifiex_sdio mmc2:0001:1: last_event: 0b 00 0a 00 0b 00 0a 00 1c 00
	[ 2101.210025] mwifiex_sdio mmc2:0001:1: data_sent=1 cmd_sent=1
	[ 2101.210033] mwifiex_sdio mmc2:0001:1: ps_mode=1 ps_state=0
	[ 2101.210089] mwifiex_sdio mmc2:0001:1: failed to get signal information
	[ 2101.210761] mwifiex_sdio mmc2:0001:1: PREP_CMD: FW is in bad state
	[ 2101.210786] mwifiex_sdio mmc2:0001:1: failed to get signal information
	[ 2101.211162] mwifiex_sdio mmc2:0001:1: ===mwifiex driverinfo dump start===
	[ 2101.211178] mwifiex_sdio mmc2:0001:1: info: MWIFIEX VERSION: mwifiex 1.0 (14.75.33.p119)
	[ 2101.211202] mwifiex_sdio mmc2:0001:1: SDIO register dump start
	[ 2101.211482] mwifiex_sdio mmc2:0001:1: SDIO Func0 (0x0-0x9): 32 02 02 02 03 00 00 02 03 00
	[ 2101.211649] mwifiex_sdio mmc2:0001:1: SDIO Func1 (0x0-0x9): 02 3f 03 00 00 00 00 00 92 00
	[ 2101.211740] mwifiex_sdio mmc2:0001:1: SDIO Func1: (0x28) 00 (0x30) 08 (0x34) 07 (0x38) 11 (0x3c) 00
	[ 2101.211921] mwifiex_sdio mmc2:0001:1: SDIO Func1 (0x60-0x6a): dc fe 5f 81 ca 04 00 79 79 00 30
	[ 2101.314135] mwifiex_sdio mmc2:0001:1: SDIO Func1 (0x60-0x6a): dc fe 5f 81 ca 04 00 79 79 00 30
	[ 2101.314168] mwifiex_sdio mmc2:0001:1: SDIO register dump end
	[ 2101.314300] mwifiex_sdio mmc2:0001:1: ===mwifiex driverinfo dump end===
	[ 2101.314313] mwifiex_sdio mmc2:0001:1: == mwifiex dump information to /sys/class/devcoredump start
	[ 2101.314586] mwifiex_sdio mmc2:0001:1: == mwifiex dump information to /sys/class/devcoredump end
	[ 2101.314610] mwifiex_sdio mmc2:0001:1: PREP_CMD: FW is in bad state
	[ 2101.314638] mwifiex_sdio mmc2:0001:1: PREP_CMD: FW is in bad state
	[ 2101.317997] mwifiex_sdio mmc2:0001:1: PREP_CMD: card is removed
	[ 2101.318029] mwifiex_sdio mmc2:0001:1: deleting the crypto keys
	[ 2101.318037] mwifiex_sdio mmc2:0001:1: PREP_CMD: card is removed
	[ 2101.318044] mwifiex_sdio mmc2:0001:1: deleting the crypto keys
	[ 2101.318051] mwifiex_sdio mmc2:0001:1: PREP_CMD: card is removed
	[ 2101.318057] mwifiex_sdio mmc2:0001:1: deleting the crypto keys
	[ 2101.318064] mwifiex_sdio mmc2:0001:1: PREP_CMD: card is removed
	[ 2101.318071] mwifiex_sdio mmc2:0001:1: deleting the crypto keys
	[ 2101.318078] mwifiex_sdio mmc2:0001:1: PREP_CMD: card is removed
	[ 2101.318084] mwifiex_sdio mmc2:0001:1: deleting the crypto keys
	[ 2101.318091] mwifiex_sdio mmc2:0001:1: PREP_CMD: card is removed
	[ 2101.318098] mwifiex_sdio mmc2:0001:1: deleting the crypto keys
	[ 2101.321278] mwifiex_sdio mmc2:0001:1: info: shutdown mwifiex...
	[ 2101.323214] mwifiex_sdio mmc2:0001:1: PREP_CMD: card is removed
	[ 2101.323250] mwifiex_sdio mmc2:0001:1: PREP_CMD: card is removed
	[ 2101.324427] mwifiex_sdio mmc2:0001:1: PREP_CMD: card is removed
	[ 2101.419786] mmc2: queuing unknown CIS tuple 0x50 [40 1e fd d1 c0 46 70 47 00 b5 23 48 24 49 01 60 24 48 24 49 01 60 24 49 08 47 1f 48 24 49 01 60] (32 bytes)
	[ 2101.460850] mmc2: queuing unknown CIS tuple 0x70 [53 f0 21 e3 1e ff 2f e1 10 1f 11 ee 00 00 50 e3 02 2a a0 e3 02 10 c1 01 02 10 81 11 10 1f 01 ee 1e ff 2f e1 01 00 a0 e3 f6 ff ff eb ea ff ff fa 48 00 9f e5 54 10 9f e5 54 20 9f e5 ef ff ff eb] (71 bytes)
	[ 2101.532495] mmc2: queuing unknown CIS tuple 0xe8 [2f 07 ee 1e ff 2f e1 0e 30 a0 e1 00 00 a0 e3 ea ff ff eb e5 ff ff fa 03 e0 a0 e1 2c 30 9f e5 13 ff 2f e1 00 10 80 e5 1a 9f 00 ee 12 ff 2f e1 04 21 00 80 c0 00 10 80 04 22 00 80 06 0a 46 02 48] (144 bytes)
	[ 2101.598922] mmc2: queuing unknown CIS tuple 0x9d [29 15 1d 01 00 4d 61 72 76 65 6c 6c 20 42 6c 75 65 74 6f 6f 74 68 20 44 65 76 69 63 65 00 00 ff 20 04 df 02 32 91 21 02 0c 00 22 2a 01 01 00 00 00 00 00 00 00 00 00 00 00 02 00 00 00 00 00 00] (162 bytes)
	[ 2101.599647] mmc2: tried to HW reset card, got error -2
	[ 2101.599699] mwifiex_sdio mmc2:0001:1: SDIO HW reset failed: -2

(besides cmd id 0xa4 also 0xce is often seen in these dumps).

One of these operations which would cause the timeout reliably was
initiating a SSH connection into the phone over the WiFi and this was
used for tracking down the issue. The other known ones are heavier usage
of the XMPP client Dino on the phone and any more than basic browsing in
Firefox.

It was however verified that the mwifiex driver works without issues on
the vendor Samsung kernel with the proper support for the chip added,
although the stock Android on the phone doesn't use mwifiex but instead
a specialized driver for the chip which however is very similar to
mwifiex.

A support for the phone and it's components essential for the WiFi
function were thus backported to the v3.14 kernel on whose stable branch
the stock kernel was based and it was verified that the WiFi (namely SSH
over WiFi) also works there without timeouts. Afterwards, a bisect was
performed which first indicated the commit 808bbebcc8fc ("mwifiex: add
Tx status support for EAPOL packets") introduced in the v3.18-v3.19
cycle.

Reverting this commit (and the following one, commit 18ca43823f3c
("mwifiex: add Tx status support for ACTION frames"), to facilitate a
clean revert) fixed the timeouts for v3.19, but during the next cycle,
v3.19-v4.0, another breakage was introduced via commit 84b313b35f81
("mwifiex: make tx packet 64 byte DMA aligned").

Reverting all three commits fixed the timeouts on the current mainline
kernel also. This patch contains the minimal changes needed to achieve
that derived from the full revert commits.

Note that a few timeouts were still observed on mainline Linux even with
these changes, however they were all on a particular network and don't
have a known reproducer. On other networks, the chip performs flawlessly
as far as it was tested so far.

These changes also don't mitigate the observed firmware loading issues:

	[   68.698674] mwifiex_sdio mmc2:0001:1: FW download, write iomem (0) failed @ 208016
	[   68.698711] mwifiex_sdio mmc2:0001:1: prog_fw failed ret=0xffffffff

These were however also seen with the old kernels and as such are likely
caused either by some inherent incompatibility of the driver with the
chip in this regard or, more likely given their occasional nature, by an
issue at some other level, such as the stability of the SDIO (the
occurrence of the problem does seem to loosely correlate with the load
exerted on the phone while the FW loading is performed (i. e. mainly on
startup)).

Note that Bluetooth on the phone (provided by the same chip) has also
been seen crashing in a similar way as the WiFi without this patch,
although seemingly less frequently/deterministically, and a similar
change might thus be needed in the btmrvl driver if it is to be used for
the phone's BT without issues.

Signed-off-by: Karel Balej <balejk@matfyz.cz>
---
 drivers/net/wireless/marvell/mwifiex/fw.h     |  4 +---
 drivers/net/wireless/marvell/mwifiex/sta_tx.c | 10 ++--------
 2 files changed, 3 insertions(+), 11 deletions(-)

diff --git a/drivers/net/wireless/marvell/mwifiex/fw.h b/drivers/net/wireless/marvell/mwifiex/fw.h
index e9e896606912..5c4c3363c7de 100644
--- a/drivers/net/wireless/marvell/mwifiex/fw.h
+++ b/drivers/net/wireless/marvell/mwifiex/fw.h
@@ -690,9 +690,7 @@ struct txpd {
 	u8 priority;
 	u8 flags;
 	u8 pkt_delay_2ms;
-	u8 reserved1[2];
-	u8 tx_token_id;
-	u8 reserved[2];
+	u8 reserved1;
 } __packed;
 
 struct rxpd {
diff --git a/drivers/net/wireless/marvell/mwifiex/sta_tx.c b/drivers/net/wireless/marvell/mwifiex/sta_tx.c
index 9d0ef04ebe02..857eb22f4c24 100644
--- a/drivers/net/wireless/marvell/mwifiex/sta_tx.c
+++ b/drivers/net/wireless/marvell/mwifiex/sta_tx.c
@@ -41,8 +41,8 @@ void mwifiex_process_sta_txpd(struct mwifiex_private *priv,
 
 	pkt_type = mwifiex_is_skb_mgmt_frame(skb) ? PKT_TYPE_MGMT : 0;
 
-	pad = ((uintptr_t)skb->data - (sizeof(*local_tx_pd) + hroom)) &
-	       (MWIFIEX_DMA_ALIGN_SZ - 1);
+	pad = (4 - (((void *)skb->data - NULL) & 0x3)) % 4;
+
 	skb_push(skb, sizeof(*local_tx_pd) + pad);
 
 	local_tx_pd = (struct txpd *) skb->data;
@@ -58,12 +58,6 @@ void mwifiex_process_sta_txpd(struct mwifiex_private *priv,
 	local_tx_pd->pkt_delay_2ms =
 				mwifiex_wmm_compute_drv_pkt_delay(priv, skb);
 
-	if (tx_info->flags & MWIFIEX_BUF_FLAG_EAPOL_TX_STATUS ||
-	    tx_info->flags & MWIFIEX_BUF_FLAG_ACTION_TX_STATUS) {
-		local_tx_pd->tx_token_id = tx_info->ack_frame_id;
-		local_tx_pd->flags |= MWIFIEX_TXPD_FLAGS_REQ_TX_STATUS;
-	}
-
 	if (local_tx_pd->priority <
 	    ARRAY_SIZE(priv->wmm.user_pri_pkt_tx_ctrl))
 		/*
-- 
2.51.1
Re: [DONOTAPPLY RFC PATCH v2 3/4] DONOTMERGE: net: mwifiex: fix timeouts with the SD8777 chip
Posted by Brian Norris 2 months, 1 week ago
Hi,

On Sun, Oct 26, 2025 at 07:20:40PM +0100, Karel Balej wrote:
> 	[ 2101.211178] mwifiex_sdio mmc2:0001:1: info: MWIFIEX VERSION: mwifiex 1.0 (14.75.33.p119)
... 
> Afterwards, a bisect was
> performed which first indicated the commit 808bbebcc8fc ("mwifiex: add
> Tx status support for EAPOL packets") introduced in the v3.18-v3.19
> cycle.
> 
> Reverting this commit (and the following one, commit 18ca43823f3c
> ("mwifiex: add Tx status support for ACTION frames"), to facilitate a
> clean revert) fixed the timeouts for v3.19, but during the next cycle,
> v3.19-v4.0, another breakage was introduced via commit 84b313b35f81
> ("mwifiex: make tx packet 64 byte DMA aligned").
> 
> Reverting all three commits fixed the timeouts on the current mainline
> kernel also. This patch contains the minimal changes needed to achieve
> that derived from the full revert commits.
...

(Trimmed the commit message down to the breaking commits, and the
version info)

From the looks of it, you're dealing with incompatible changes made in
the Marvell firmware API. It seems that you have a "version 14"
firmware, and the timeline of these mwifiex changes (~2014) is approx
when linux-firmware started seeing v15 and v16 firmware. It *might* be
OK to try add some versioning to these structs and padding changes, and
make a choice based on adapter->fw_release_number or
adapter->fw_cap_info. It might be ugly and error-prone, but possible...

Or if the FW versioning doesn't work out, it's possible we could
specifically flag these quirks for SD8777 somehow.

> Signed-off-by: Karel Balej <balejk@matfyz.cz>
> ---
>  drivers/net/wireless/marvell/mwifiex/fw.h     |  4 +---
>  drivers/net/wireless/marvell/mwifiex/sta_tx.c | 10 ++--------
>  2 files changed, 3 insertions(+), 11 deletions(-)
> 
> diff --git a/drivers/net/wireless/marvell/mwifiex/fw.h b/drivers/net/wireless/marvell/mwifiex/fw.h
> index e9e896606912..5c4c3363c7de 100644
> --- a/drivers/net/wireless/marvell/mwifiex/fw.h
> +++ b/drivers/net/wireless/marvell/mwifiex/fw.h
> @@ -690,9 +690,7 @@ struct txpd {
>  	u8 priority;
>  	u8 flags;
>  	u8 pkt_delay_2ms;
> -	u8 reserved1[2];
> -	u8 tx_token_id;
> -	u8 reserved[2];
> +	u8 reserved1;

I'm inferring that 'sizeof(struct txpd)' (also spelled
'sizeof(*local_tx_pd)' below) is relevant, and that this struct probably
should retain the smaller size for FW version 14.

Maybe you need a new 'struct txpd_v14' layout, and embed that inside
'struct txpd'.

>  } __packed;
>  
>  struct rxpd {
> diff --git a/drivers/net/wireless/marvell/mwifiex/sta_tx.c b/drivers/net/wireless/marvell/mwifiex/sta_tx.c
> index 9d0ef04ebe02..857eb22f4c24 100644
> --- a/drivers/net/wireless/marvell/mwifiex/sta_tx.c
> +++ b/drivers/net/wireless/marvell/mwifiex/sta_tx.c
> @@ -41,8 +41,8 @@ void mwifiex_process_sta_txpd(struct mwifiex_private *priv,
>  
>  	pkt_type = mwifiex_is_skb_mgmt_frame(skb) ? PKT_TYPE_MGMT : 0;
>  
> -	pad = ((uintptr_t)skb->data - (sizeof(*local_tx_pd) + hroom)) &
> -	       (MWIFIEX_DMA_ALIGN_SZ - 1);
> +	pad = (4 - (((void *)skb->data - NULL) & 0x3)) % 4;

It's not clear to me whether your v14 FW doesn't like the 64-byte
alignment, or if it didn't like the new txpd header size/layout, or
both. But obviously this line won't fly, with magic numbers and all. It
will need to be expressed in terms of macros (MWIFIEX_DMA_ALIGN_SZ, or a
"V14" version of that; and sizeof(...)).

> +
>  	skb_push(skb, sizeof(*local_tx_pd) + pad);
>  
>  	local_tx_pd = (struct txpd *) skb->data;
> @@ -58,12 +58,6 @@ void mwifiex_process_sta_txpd(struct mwifiex_private *priv,
>  	local_tx_pd->pkt_delay_2ms =
>  				mwifiex_wmm_compute_drv_pkt_delay(priv, skb);
>  
> -	if (tx_info->flags & MWIFIEX_BUF_FLAG_EAPOL_TX_STATUS ||
> -	    tx_info->flags & MWIFIEX_BUF_FLAG_ACTION_TX_STATUS) {
> -		local_tx_pd->tx_token_id = tx_info->ack_frame_id;
> -		local_tx_pd->flags |= MWIFIEX_TXPD_FLAGS_REQ_TX_STATUS;
> -	}

Rather than dropping this block, would it work to also check:

	adapter->fw_api_ver >= MWIFIEX_FW_V15

?

Brian

> -
>  	if (local_tx_pd->priority <
>  	    ARRAY_SIZE(priv->wmm.user_pri_pkt_tx_ctrl))
>  		/*
> -- 
> 2.51.1
>