[PATCH v2 3/4] gpu: nova-core: gsp: fix improper handling of empty slot in cmdq

Eliot Courtney posted 4 patches 2 weeks, 1 day ago
There is a newer version of this series
[PATCH v2 3/4] gpu: nova-core: gsp: fix improper handling of empty slot in cmdq
Posted by Eliot Courtney 2 weeks, 1 day ago
The current code hands out buffers that go all the way up to and
including `rx - 1`, but we need to maintain an empty slot to prevent the
ring buffer from wrapping around into having 'tx == rx', which means
empty.

Also add more rigorous no-panic proofs.

Fixes: 75f6b1de8133 ("gpu: nova-core: gsp: Add GSP command queue bindings and handling")
Signed-off-by: Eliot Courtney <ecourtney@nvidia.com>
---
 drivers/gpu/nova-core/gsp/cmdq.rs | 33 +++++++++++++++++++--------------
 1 file changed, 19 insertions(+), 14 deletions(-)

diff --git a/drivers/gpu/nova-core/gsp/cmdq.rs b/drivers/gpu/nova-core/gsp/cmdq.rs
index 09c28eeb6f12..aa8758fc7723 100644
--- a/drivers/gpu/nova-core/gsp/cmdq.rs
+++ b/drivers/gpu/nova-core/gsp/cmdq.rs
@@ -227,21 +227,26 @@ fn new(dev: &device::Device<device::Bound>) -> Result<Self> {
         // PANIC: per the invariant of `cpu_write_ptr`, `tx` is `< MSGQ_NUM_PAGES`.
         let (before_tx, after_tx) = gsp_mem.cpuq.msgq.data.split_at_mut(tx);
 
-        if rx <= tx {
-            // The area from `tx` up to the end of the ring, and from the beginning of the ring up
-            // to `rx`, minus one unit, belongs to the driver.
-            if rx == 0 {
-                let last = after_tx.len() - 1;
-                (&mut after_tx[..last], &mut before_tx[0..0])
-            } else {
-                (after_tx, &mut before_tx[..rx])
-            }
+        // The area starting at `tx` and ending at `rx - 2` modulo MSGQ_NUM_PAGES, inclusive,
+        // belongs to the driver for writing.
+        if rx == 0 {
+            // Since `rx` is zero, leave an empty slot at end of the buffer.
+            let last = after_tx.len() - 1;
+            (&mut after_tx[..last], &mut before_tx[0..0])
+        } else if rx > tx {
+            // The area is contiguous and we leave an empty slot before `rx`.
+            // PANIC:
+            // - The index `rx - tx - 1` is non-negative because `rx > tx` in this branch.
+            // - The index does not exceed `after_tx.len()` (which is `MSGQ_NUM_PAGES - tx`)
+            //   because `rx < MSGQ_NUM_PAGES` by the `gsp_read_ptr` invariant.
+            (&mut after_tx[..(rx - tx - 1)], &mut before_tx[0..0])
         } else {
-            // The area from `tx` to `rx`, minus one unit, belongs to the driver.
-            //
-            // PANIC: per the invariants of `cpu_write_ptr` and `gsp_read_ptr`, `rx` and `tx` are
-            // `<= MSGQ_NUM_PAGES`, and the test above ensured that `rx > tx`.
-            (after_tx.split_at_mut(rx - tx).0, &mut before_tx[0..0])
+            // The area is discontiguous and we leave an empty slot before `rx`.
+            // PANIC:
+            // - The index `rx - 1` is non-negative because `rx != 0` in this branch.
+            // - The index does not exceed `before_tx.len()` (which equals `tx`) because
+            //   `rx <= tx` in this branch.
+            (after_tx, &mut before_tx[..(rx - 1)])
         }
     }
 

-- 
2.52.0
Re: [PATCH v2 3/4] gpu: nova-core: gsp: fix improper handling of empty slot in cmdq
Posted by Alexandre Courbot 1 week, 3 days ago
On Fri Jan 23, 2026 at 9:12 PM JST, Eliot Courtney wrote:
> The current code hands out buffers that go all the way up to and
> including `rx - 1`, but we need to maintain an empty slot to prevent the
> ring buffer from wrapping around into having 'tx == rx', which means
> empty.
>
> Also add more rigorous no-panic proofs.
>
> Fixes: 75f6b1de8133 ("gpu: nova-core: gsp: Add GSP command queue bindings and handling")
> Signed-off-by: Eliot Courtney <ecourtney@nvidia.com>
> ---
>  drivers/gpu/nova-core/gsp/cmdq.rs | 33 +++++++++++++++++++--------------
>  1 file changed, 19 insertions(+), 14 deletions(-)
>
> diff --git a/drivers/gpu/nova-core/gsp/cmdq.rs b/drivers/gpu/nova-core/gsp/cmdq.rs
> index 09c28eeb6f12..aa8758fc7723 100644
> --- a/drivers/gpu/nova-core/gsp/cmdq.rs
> +++ b/drivers/gpu/nova-core/gsp/cmdq.rs
> @@ -227,21 +227,26 @@ fn new(dev: &device::Device<device::Bound>) -> Result<Self> {
>          // PANIC: per the invariant of `cpu_write_ptr`, `tx` is `< MSGQ_NUM_PAGES`.
>          let (before_tx, after_tx) = gsp_mem.cpuq.msgq.data.split_at_mut(tx);
>  
> -        if rx <= tx {
> -            // The area from `tx` up to the end of the ring, and from the beginning of the ring up
> -            // to `rx`, minus one unit, belongs to the driver.
> -            if rx == 0 {
> -                let last = after_tx.len() - 1;
> -                (&mut after_tx[..last], &mut before_tx[0..0])
> -            } else {
> -                (after_tx, &mut before_tx[..rx])

Indeed, the comment clearly states "minus one unit" but the second
branch (and the one below) ignores that. >_<

> -            }
> +        // The area starting at `tx` and ending at `rx - 2` modulo MSGQ_NUM_PAGES, inclusive,
> +        // belongs to the driver for writing.

I have trouble parsing this comment - it says `rx - 2` but nowhere does
that expression appear in the code. It's particularly confusing because
it apparently applies to all 3 blocks, but the position suggests it is
only the first.  I'd add a blank line after it to signal that it is
relevant to the rest of the method.

> +        if rx == 0 {
> +            // Since `rx` is zero, leave an empty slot at end of the buffer.
> +            let last = after_tx.len() - 1;
> +            (&mut after_tx[..last], &mut before_tx[0..0])
> +        } else if rx > tx {

In the original code, the `rx < tx` case appears first. Can we preserve
this order so the corresponding diffs also appear next to one another?

> +            // The area is contiguous and we leave an empty slot before `rx`.
> +            // PANIC:
> +            // - The index `rx - tx - 1` is non-negative because `rx > tx` in this branch.
> +            // - The index does not exceed `after_tx.len()` (which is `MSGQ_NUM_PAGES - tx`)
> +            //   because `rx < MSGQ_NUM_PAGES` by the `gsp_read_ptr` invariant.
> +            (&mut after_tx[..(rx - tx - 1)], &mut before_tx[0..0])
>          } else {
> -            // The area from `tx` to `rx`, minus one unit, belongs to the driver.
> -            //
> -            // PANIC: per the invariants of `cpu_write_ptr` and `gsp_read_ptr`, `rx` and `tx` are
> -            // `<= MSGQ_NUM_PAGES`, and the test above ensured that `rx > tx`.
> -            (after_tx.split_at_mut(rx - tx).0, &mut before_tx[0..0])
> +            // The area is discontiguous and we leave an empty slot before `rx`.
> +            // PANIC:
> +            // - The index `rx - 1` is non-negative because `rx != 0` in this branch.
> +            // - The index does not exceed `before_tx.len()` (which equals `tx`) because
> +            //   `rx <= tx` in this branch.
> +            (after_tx, &mut before_tx[..(rx - 1)])

In any case, nice catch! This would have bitten us sooner or later.
Re: [PATCH v2 3/4] gpu: nova-core: gsp: fix improper handling of empty slot in cmdq
Posted by Gary Guo 1 week, 5 days ago
On Fri Jan 23, 2026 at 12:12 PM GMT, Eliot Courtney wrote:
> The current code hands out buffers that go all the way up to and
> including `rx - 1`, but we need to maintain an empty slot to prevent the
> ring buffer from wrapping around into having 'tx == rx', which means
> empty.
>
> Also add more rigorous no-panic proofs.
>
> Fixes: 75f6b1de8133 ("gpu: nova-core: gsp: Add GSP command queue bindings and handling")
> Signed-off-by: Eliot Courtney <ecourtney@nvidia.com>
> ---
>  drivers/gpu/nova-core/gsp/cmdq.rs | 33 +++++++++++++++++++--------------
>  1 file changed, 19 insertions(+), 14 deletions(-)
>
> diff --git a/drivers/gpu/nova-core/gsp/cmdq.rs b/drivers/gpu/nova-core/gsp/cmdq.rs
> index 09c28eeb6f12..aa8758fc7723 100644
> --- a/drivers/gpu/nova-core/gsp/cmdq.rs
> +++ b/drivers/gpu/nova-core/gsp/cmdq.rs
> @@ -227,21 +227,26 @@ fn new(dev: &device::Device<device::Bound>) -> Result<Self> {
>          // PANIC: per the invariant of `cpu_write_ptr`, `tx` is `< MSGQ_NUM_PAGES`.
>          let (before_tx, after_tx) = gsp_mem.cpuq.msgq.data.split_at_mut(tx);
>  
> -        if rx <= tx {
> -            // The area from `tx` up to the end of the ring, and from the beginning of the ring up
> -            // to `rx`, minus one unit, belongs to the driver.
> -            if rx == 0 {
> -                let last = after_tx.len() - 1;
> -                (&mut after_tx[..last], &mut before_tx[0..0])
> -            } else {
> -                (after_tx, &mut before_tx[..rx])
> -            }
> +        // The area starting at `tx` and ending at `rx - 2` modulo MSGQ_NUM_PAGES, inclusive,
> +        // belongs to the driver for writing.
> +        if rx == 0 {
> +            // Since `rx` is zero, leave an empty slot at end of the buffer.
> +            let last = after_tx.len() - 1;
> +            (&mut after_tx[..last], &mut before_tx[0..0])

Does the address actually matter? Otherwise I would find `&mut []` easier to
understand than an empty indexing.

> +        } else if rx > tx {
> +            // The area is contiguous and we leave an empty slot before `rx`.
> +            // PANIC:
> +            // - The index `rx - tx - 1` is non-negative because `rx > tx` in this branch.
> +            // - The index does not exceed `after_tx.len()` (which is `MSGQ_NUM_PAGES - tx`)
> +            //   because `rx < MSGQ_NUM_PAGES` by the `gsp_read_ptr` invariant.
> +            (&mut after_tx[..(rx - tx - 1)], &mut before_tx[0..0])
>          } else {
> -            // The area from `tx` to `rx`, minus one unit, belongs to the driver.
> -            //
> -            // PANIC: per the invariants of `cpu_write_ptr` and `gsp_read_ptr`, `rx` and `tx` are
> -            // `<= MSGQ_NUM_PAGES`, and the test above ensured that `rx > tx`.
> -            (after_tx.split_at_mut(rx - tx).0, &mut before_tx[0..0])
> +            // The area is discontiguous and we leave an empty slot before `rx`.
> +            // PANIC:
> +            // - The index `rx - 1` is non-negative because `rx != 0` in this branch.
> +            // - The index does not exceed `before_tx.len()` (which equals `tx`) because
> +            //   `rx <= tx` in this branch.
> +            (after_tx, &mut before_tx[..(rx - 1)])

If this is written with get_disjoint_mut, the indices would be so much easier to
understand... To bad that that API is only available from 1.86 onwards.

Best,
Gary

>          }
>      }
>  
Re: [PATCH v2 3/4] gpu: nova-core: gsp: fix improper handling of empty slot in cmdq
Posted by Alexandre Courbot 1 week, 3 days ago
On Tue Jan 27, 2026 at 3:26 AM JST, Gary Guo wrote:
> On Fri Jan 23, 2026 at 12:12 PM GMT, Eliot Courtney wrote:
>> The current code hands out buffers that go all the way up to and
>> including `rx - 1`, but we need to maintain an empty slot to prevent the
>> ring buffer from wrapping around into having 'tx == rx', which means
>> empty.
>>
>> Also add more rigorous no-panic proofs.
>>
>> Fixes: 75f6b1de8133 ("gpu: nova-core: gsp: Add GSP command queue bindings and handling")
>> Signed-off-by: Eliot Courtney <ecourtney@nvidia.com>
>> ---
>>  drivers/gpu/nova-core/gsp/cmdq.rs | 33 +++++++++++++++++++--------------
>>  1 file changed, 19 insertions(+), 14 deletions(-)
>>
>> diff --git a/drivers/gpu/nova-core/gsp/cmdq.rs b/drivers/gpu/nova-core/gsp/cmdq.rs
>> index 09c28eeb6f12..aa8758fc7723 100644
>> --- a/drivers/gpu/nova-core/gsp/cmdq.rs
>> +++ b/drivers/gpu/nova-core/gsp/cmdq.rs
>> @@ -227,21 +227,26 @@ fn new(dev: &device::Device<device::Bound>) -> Result<Self> {
>>          // PANIC: per the invariant of `cpu_write_ptr`, `tx` is `< MSGQ_NUM_PAGES`.
>>          let (before_tx, after_tx) = gsp_mem.cpuq.msgq.data.split_at_mut(tx);
>>  
>> -        if rx <= tx {
>> -            // The area from `tx` up to the end of the ring, and from the beginning of the ring up
>> -            // to `rx`, minus one unit, belongs to the driver.
>> -            if rx == 0 {
>> -                let last = after_tx.len() - 1;
>> -                (&mut after_tx[..last], &mut before_tx[0..0])
>> -            } else {
>> -                (after_tx, &mut before_tx[..rx])
>> -            }
>> +        // The area starting at `tx` and ending at `rx - 2` modulo MSGQ_NUM_PAGES, inclusive,
>> +        // belongs to the driver for writing.
>> +        if rx == 0 {
>> +            // Since `rx` is zero, leave an empty slot at end of the buffer.
>> +            let last = after_tx.len() - 1;
>> +            (&mut after_tx[..last], &mut before_tx[0..0])
>
> Does the address actually matter? Otherwise I would find `&mut []` easier to
> understand than an empty indexing.

The address doesn't matter, and indeed I am not sure why we did that
(possibly a lifetime issue with some previous version of the code?).

Your suggestion seems to work fine, so Eliot feel free to include it in
your series (as a separate patch please to make sure we can isolate the
effects of both changes).

>
>> +        } else if rx > tx {
>> +            // The area is contiguous and we leave an empty slot before `rx`.
>> +            // PANIC:
>> +            // - The index `rx - tx - 1` is non-negative because `rx > tx` in this branch.
>> +            // - The index does not exceed `after_tx.len()` (which is `MSGQ_NUM_PAGES - tx`)
>> +            //   because `rx < MSGQ_NUM_PAGES` by the `gsp_read_ptr` invariant.
>> +            (&mut after_tx[..(rx - tx - 1)], &mut before_tx[0..0])
>>          } else {
>> -            // The area from `tx` to `rx`, minus one unit, belongs to the driver.
>> -            //
>> -            // PANIC: per the invariants of `cpu_write_ptr` and `gsp_read_ptr`, `rx` and `tx` are
>> -            // `<= MSGQ_NUM_PAGES`, and the test above ensured that `rx > tx`.
>> -            (after_tx.split_at_mut(rx - tx).0, &mut before_tx[0..0])
>> +            // The area is discontiguous and we leave an empty slot before `rx`.
>> +            // PANIC:
>> +            // - The index `rx - 1` is non-negative because `rx != 0` in this branch.
>> +            // - The index does not exceed `before_tx.len()` (which equals `tx`) because
>> +            //   `rx <= tx` in this branch.
>> +            (after_tx, &mut before_tx[..(rx - 1)])
>
> If this is written with get_disjoint_mut, the indices would be so much easier to
> understand... To bad that that API is only available from 1.86 onwards.

I for one am longing for `split_at_checked`... :)