[v2] stmmac crash/stall fixes when under memory pressure

[PATCH net v2 1/2] net: stmmac: Prevent NULL deref when RX memory exhausted

Posted by Sam Edwards 2 weeks, 3 days ago

The CPU receives frames from the MAC through conventional DMA: the CPU
allocates buffers for the MAC, then the MAC fills them and returns
ownership to the CPU. For each hardware RX queue, the CPU and MAC
coordinate through a shared ring array of DMA descriptors: one
descriptor per DMA buffer. Each descriptor includes the buffer's
physical address and a status flag ("OWN") indicating which side owns
the buffer: OWN=0 for CPU, OWN=1 for MAC. The CPU is only allowed to set
the flag and the MAC is only allowed to clear it, and both must move
through the ring in sequence: thus the ring is used for both
"submissions" and "completions."

In the stmmac driver, stmmac_rx() bookmarks its position in the ring
with the `cur_rx` index. The main receive loop in that function checks
for rx_descs[cur_rx].own=0, gives the corresponding buffer to the
network stack (NULLing the pointer), and increments `cur_rx` modulo the
ring size. After the loop exits, stmmac_rx_refill(), which bookmarks its
position with `dirty_rx`, allocates fresh buffers and rearms the
descriptors (setting OWN=1). If it fails any allocation, it simply stops
early (leaving OWN=0) and will retry where it left off when next called.

This means descriptors have a three-stage lifecycle (terms my own):
- `empty` (OWN=1, buffer valid)
- `full` (OWN=0, buffer valid and populated)
- `dirty` (OWN=0, buffer NULL)

But because stmmac_rx() only checks OWN, it confuses `full`/`dirty`. In
the past (see 'Fixes:'), there was a bug where the loop could cycle
`cur_rx` all the way back to the first descriptor it dirtied, resulting
in a NULL dereference when mistaken for `full`. The aforementioned
commit resolved that *specific* failure by capping the loop's iteration
limit at `dma_rx_size - 1`, but this is only a partial fix: if the
previous stmmac_rx_refill() didn't complete, then there are leftover
`dirty` descriptors that the loop might encounter without needing to
cycle fully around. The current code therefore panics (see 'Closes:')
when stmmac_rx_refill() is memory-starved long enough for `cur_rx` to
catch up to `dirty_rx`.

Fix this by further tightening the clamp from `dma_rx_size - 1` to
`dma_rx_size - stmmac_rx_dirty() - 1`, subtracting any remnant dirty
entries and limiting the loop so that `cur_rx` cannot catch back up to
`dirty_rx`. This carries no risk of arithmetic underflow: since the
maximum possible return value of stmmac_rx_dirty() is `dma_rx_size - 1`,
the worst the clamp can do is prevent the loop from running at all.

Fixes: b6cb4541853c7 ("net: stmmac: avoid rx queue overrun")
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=221010
Cc: stable@vger.kernel.org
Signed-off-by: Sam Edwards <CFSworks@gmail.com>
---
 drivers/net/ethernet/stmicro/stmmac/stmmac_main.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
index 6827c99bde8c..f98b070073c0 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
@@ -5609,7 +5609,8 @@ static int stmmac_rx(struct stmmac_priv *priv, int limit, u32 queue)
 
 	dma_dir = page_pool_get_dma_dir(rx_q->page_pool);
 	bufsz = DIV_ROUND_UP(priv->dma_conf.dma_buf_sz, PAGE_SIZE) * PAGE_SIZE;
-	limit = min(priv->dma_conf.dma_rx_size - 1, (unsigned int)limit);
+	limit = min(priv->dma_conf.dma_rx_size - stmmac_rx_dirty(priv, queue) - 1,
+		    (unsigned int)limit);
 
 	if (netif_msg_rx_status(priv)) {
 		void *rx_head;
-- 
2.52.0

Re: [PATCH net v2 1/2] net: stmmac: Prevent NULL deref when RX memory exhausted

Posted by Russell King (Oracle) 2 weeks, 3 days ago

On Thu, Mar 19, 2026 at 11:40:30AM -0700, Sam Edwards wrote:
> The CPU receives frames from the MAC through conventional DMA: the CPU
> allocates buffers for the MAC, then the MAC fills them and returns
> ownership to the CPU. For each hardware RX queue, the CPU and MAC
> coordinate through a shared ring array of DMA descriptors: one
> descriptor per DMA buffer. Each descriptor includes the buffer's
> physical address and a status flag ("OWN") indicating which side owns
> the buffer: OWN=0 for CPU, OWN=1 for MAC. The CPU is only allowed to set
> the flag and the MAC is only allowed to clear it, and both must move
> through the ring in sequence: thus the ring is used for both
> "submissions" and "completions."
> 
> In the stmmac driver, stmmac_rx() bookmarks its position in the ring
> with the `cur_rx` index. The main receive loop in that function checks
> for rx_descs[cur_rx].own=0, gives the corresponding buffer to the
> network stack (NULLing the pointer), and increments `cur_rx` modulo the
> ring size. After the loop exits, stmmac_rx_refill(), which bookmarks its
> position with `dirty_rx`, allocates fresh buffers and rearms the
> descriptors (setting OWN=1). If it fails any allocation, it simply stops
> early (leaving OWN=0) and will retry where it left off when next called.
> 
> This means descriptors have a three-stage lifecycle (terms my own):
> - `empty` (OWN=1, buffer valid)
> - `full` (OWN=0, buffer valid and populated)
> - `dirty` (OWN=0, buffer NULL)
> 
> But because stmmac_rx() only checks OWN, it confuses `full`/`dirty`. In
> the past (see 'Fixes:'), there was a bug where the loop could cycle
> `cur_rx` all the way back to the first descriptor it dirtied, resulting
> in a NULL dereference when mistaken for `full`. The aforementioned
> commit resolved that *specific* failure by capping the loop's iteration
> limit at `dma_rx_size - 1`, but this is only a partial fix: if the
> previous stmmac_rx_refill() didn't complete, then there are leftover
> `dirty` descriptors that the loop might encounter without needing to
> cycle fully around. The current code therefore panics (see 'Closes:')
> when stmmac_rx_refill() is memory-starved long enough for `cur_rx` to
> catch up to `dirty_rx`.
> 
> Fix this by further tightening the clamp from `dma_rx_size - 1` to
> `dma_rx_size - stmmac_rx_dirty() - 1`, subtracting any remnant dirty
> entries and limiting the loop so that `cur_rx` cannot catch back up to
> `dirty_rx`. This carries no risk of arithmetic underflow: since the
> maximum possible return value of stmmac_rx_dirty() is `dma_rx_size - 1`,
> the worst the clamp can do is prevent the loop from running at all.

Isn't the limit equivalent to:

	space = CIRC_SPACE(rx_q->cur_rx, rx_q->dirty_rx,
			   priv->dma_conf.dma_rx_size);
	limit = min_t(unsigned int, space, limit);

(Think of the "full" and "empty" cases as "space" which can be
consumed to provide entries for the refiller to action.)

I think the same applies for patch 2 - when the above returns zero
it means we have no entries in the ring that aren't due for refill.

Have you checked whether stmmac_rx_zc() also buggy in this respect?

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!

Re: [PATCH net v2 1/2] net: stmmac: Prevent NULL deref when RX memory exhausted

Posted by Sam Edwards 2 weeks, 3 days ago

On Thu, Mar 19, 2026 at 2:59 PM Russell King (Oracle)
<linux@armlinux.org.uk> wrote:

Hi Russell,

> Isn't the limit equivalent to:
>
>         space = CIRC_SPACE(rx_q->cur_rx, rx_q->dirty_rx,
>                            priv->dma_conf.dma_rx_size);
>         limit = min_t(unsigned int, space, limit);
>

I had to think about the edge cases for a few minutes -- yes, you're
right. However...

> (Think of the "full" and "empty" cases as "space" which can be
> consumed to provide entries for the refiller to action.)

...this seems like a double negative: We're looking at the ring from
the refiller's perspective (i.e. considering "dirty" the meaningful
state), and *then* looking at the "unused space" (i.e. "not dirty").
If I used it, I'd at least add a comment explaining that CIRC_SPACE
is, counterintuitively, not measuring "space."

After the bugs are fixed, I'll follow up with a net-next patch [1]
that removes the `min` altogether and makes the loop check for 'dirty'
directly, so my focus here is "minimal, obviously correct" and not
necessarily "tidiest." Would it be acceptable if I simplify
stmmac_rx_dirty() to use CIRC_CNT() then, instead of using
CIRC_SPACE() now?

> I think the same applies for patch 2 - when the above returns zero
> it means we have no entries in the ring that aren't due for refill.

I'm not completely satisfied with the `stmmac_rx_dirty() ==
dma_rx_size - 1` expression either. One might argue we should keep
trying until dirty==0, because tolerating *any* dirtiness risks not
having enough buffers for the next traffic burst, causing avoidable rx
drops. But that's a discussion for patch 2. :)

> Have you checked whether stmmac_rx_zc() also buggy in this respect?

I skimmed it. I saw that it checks the return status of
stmmac_rx_refill_zc() and on failure, returns the NAPI budget to keep
polling -- which told me that the ZC author(s) at least considered
these problems.

Looking at it more thoroughly now, there are a few code smells around
the unconditional `dirty = 0;`, and `dirty` being used as the "budget"
for stmmac_rx_refill_zc() (why have a budget at all then?) so it looks
like something's there, but it's a separate bug if so.

Best,
Sam

[1] https://lore.kernel.org/netdev/20260316021009.262358-4-CFSworks@gmail.com/

[PATCH net v2 1/2] net: stmmac: Prevent NULL deref when RX memory exhausted
[PATCH net v2 2/2] net: stmmac: Prevent indefinite RX stall on buffer exhaustion