[v1] gpu: nova-core: gsp: fix command queue ring buffer bugs

[PATCH 3/4] gpu: nova-core: gsp: fix improper handling of empty slot in cmdq

Posted by Eliot Courtney 2 weeks, 4 days ago

The current code hands out buffers that go all the way up to and
including `rx - 1`, but we need to maintain an empty slot to prevent the
ring buffer from wrapping around into having 'tx == rx', which means
empty.

Also add more rigorous no-panic proofs.

Fixes: 75f6b1de8133 ("gpu: nova-core: gsp: Add GSP command queue bindings and handling")
Signed-off-by: Eliot Courtney <ecourtney@nvidia.com>
---
 drivers/gpu/nova-core/gsp/cmdq.rs | 31 +++++++++++++++++--------------
 1 file changed, 17 insertions(+), 14 deletions(-)

diff --git a/drivers/gpu/nova-core/gsp/cmdq.rs b/drivers/gpu/nova-core/gsp/cmdq.rs
index 09c28eeb6f12..b6d6093e3ac0 100644
--- a/drivers/gpu/nova-core/gsp/cmdq.rs
+++ b/drivers/gpu/nova-core/gsp/cmdq.rs
@@ -227,21 +227,24 @@ fn new(dev: &device::Device<device::Bound>) -> Result<Self> {
         // PANIC: per the invariant of `cpu_write_ptr`, `tx` is `< MSGQ_NUM_PAGES`.
         let (before_tx, after_tx) = gsp_mem.cpuq.msgq.data.split_at_mut(tx);
 
-        if rx <= tx {
-            // The area from `tx` up to the end of the ring, and from the beginning of the ring up
-            // to `rx`, minus one unit, belongs to the driver.
-            if rx == 0 {
-                let last = after_tx.len() - 1;
-                (&mut after_tx[..last], &mut before_tx[0..0])
-            } else {
-                (after_tx, &mut before_tx[..rx])
-            }
+        // The area starting at `tx` and ending at `rx - 2` modulo MSGQ_NUM_PAGES, inclusive,
+        // belongs to the driver for writing.
+        if rx == 0 {
+            // Since `rx` is zero, leave an empty slot at end of the buffer.
+            let last = after_tx.len() - 1;
+            (&mut after_tx[..last], &mut before_tx[0..0])
+        } else if rx > tx {
+            // The area is contiguous and we leave an empty slot before `rx`.
+            // PANIC: since `rx > tx` we have `rx - tx - 1 >= 0`
+            // PANIC: since `tx < rx < MSGQ_NUM_PAGES && after_tx.len() == MSGQ_NUM_PAGES - tx`:
+            //   `rx - 1 <= MSGQ_NUM_PAGES` -> `rx - tx - 1 <= MSGQ_NUM_PAGES - tx`
+            //   -> `rx - tx - 1 <= after_tx.len()`
+            (&mut after_tx[..(rx - tx - 1)], &mut before_tx[0..0])
         } else {
-            // The area from `tx` to `rx`, minus one unit, belongs to the driver.
-            //
-            // PANIC: per the invariants of `cpu_write_ptr` and `gsp_read_ptr`, `rx` and `tx` are
-            // `<= MSGQ_NUM_PAGES`, and the test above ensured that `rx > tx`.
-            (after_tx.split_at_mut(rx - tx).0, &mut before_tx[0..0])
+            // The area is discontiguous and we leave an empty slot before `rx`.
+            // PANIC: since `rx != 0 && rx is unsigned` we have `rx - 1 >= 0`
+            // PANIC: since `rx <= tx && before_tx.len() == tx` we have `rx - 1 <= before_tx.len()`
+            (after_tx, &mut before_tx[..(rx - 1)])
         }
     }
 

-- 
2.52.0

Re: [PATCH 3/4] gpu: nova-core: gsp: fix improper handling of empty slot in cmdq

Posted by Gary Guo 2 weeks, 2 days ago

On Thu Jan 22, 2026 at 2:59 AM GMT, Eliot Courtney wrote:
> The current code hands out buffers that go all the way up to and
> including `rx - 1`, but we need to maintain an empty slot to prevent the
> ring buffer from wrapping around into having 'tx == rx', which means
> empty.

Doesn't this mean that you're effectively wasting a slot? You can easily
implementing a ring buffer that allows you to disambiguate empty/full while
still using all slots.

A common approach is to only do modulo/masking operation before accessing the
slot. Then `write_ptr.wrapping_sub(read_ptr)` will give you the accurate length of
things inside the ring buffer.

Best,
Gary

>
> Also add more rigorous no-panic proofs.
>
> Fixes: 75f6b1de8133 ("gpu: nova-core: gsp: Add GSP command queue bindings and handling")
> Signed-off-by: Eliot Courtney <ecourtney@nvidia.com>
> ---
>  drivers/gpu/nova-core/gsp/cmdq.rs | 31 +++++++++++++++++--------------
>  1 file changed, 17 insertions(+), 14 deletions(-)
>
> diff --git a/drivers/gpu/nova-core/gsp/cmdq.rs b/drivers/gpu/nova-core/gsp/cmdq.rs
> index 09c28eeb6f12..b6d6093e3ac0 100644
> --- a/drivers/gpu/nova-core/gsp/cmdq.rs
> +++ b/drivers/gpu/nova-core/gsp/cmdq.rs
> @@ -227,21 +227,24 @@ fn new(dev: &device::Device<device::Bound>) -> Result<Self> {
>          // PANIC: per the invariant of `cpu_write_ptr`, `tx` is `< MSGQ_NUM_PAGES`.
>          let (before_tx, after_tx) = gsp_mem.cpuq.msgq.data.split_at_mut(tx);
>  
> -        if rx <= tx {
> -            // The area from `tx` up to the end of the ring, and from the beginning of the ring up
> -            // to `rx`, minus one unit, belongs to the driver.
> -            if rx == 0 {
> -                let last = after_tx.len() - 1;
> -                (&mut after_tx[..last], &mut before_tx[0..0])
> -            } else {
> -                (after_tx, &mut before_tx[..rx])
> -            }
> +        // The area starting at `tx` and ending at `rx - 2` modulo MSGQ_NUM_PAGES, inclusive,
> +        // belongs to the driver for writing.
> +        if rx == 0 {
> +            // Since `rx` is zero, leave an empty slot at end of the buffer.
> +            let last = after_tx.len() - 1;
> +            (&mut after_tx[..last], &mut before_tx[0..0])
> +        } else if rx > tx {
> +            // The area is contiguous and we leave an empty slot before `rx`.
> +            // PANIC: since `rx > tx` we have `rx - tx - 1 >= 0`
> +            // PANIC: since `tx < rx < MSGQ_NUM_PAGES && after_tx.len() == MSGQ_NUM_PAGES - tx`:
> +            //   `rx - 1 <= MSGQ_NUM_PAGES` -> `rx - tx - 1 <= MSGQ_NUM_PAGES - tx`
> +            //   -> `rx - tx - 1 <= after_tx.len()`
> +            (&mut after_tx[..(rx - tx - 1)], &mut before_tx[0..0])
>          } else {
> -            // The area from `tx` to `rx`, minus one unit, belongs to the driver.
> -            //
> -            // PANIC: per the invariants of `cpu_write_ptr` and `gsp_read_ptr`, `rx` and `tx` are
> -            // `<= MSGQ_NUM_PAGES`, and the test above ensured that `rx > tx`.
> -            (after_tx.split_at_mut(rx - tx).0, &mut before_tx[0..0])
> +            // The area is discontiguous and we leave an empty slot before `rx`.
> +            // PANIC: since `rx != 0 && rx is unsigned` we have `rx - 1 >= 0`
> +            // PANIC: since `rx <= tx && before_tx.len() == tx` we have `rx - 1 <= before_tx.len()`
> +            (after_tx, &mut before_tx[..(rx - 1)])
>          }
>      }
>

Re: [PATCH 3/4] gpu: nova-core: gsp: fix improper handling of empty slot in cmdq

Posted by Eliot Courtney 2 weeks ago

On Sat Jan 24, 2026 at 3:31 AM JST, Gary Guo wrote:
> On Thu Jan 22, 2026 at 2:59 AM GMT, Eliot Courtney wrote:
>> The current code hands out buffers that go all the way up to and
>> including `rx - 1`, but we need to maintain an empty slot to prevent the
>> ring buffer from wrapping around into having 'tx == rx', which means
>> empty.
>
> Doesn't this mean that you're effectively wasting a slot? You can easily
> implementing a ring buffer that allows you to disambiguate empty/full while
> still using all slots.
>
> A common approach is to only do modulo/masking operation before accessing the
> slot. Then `write_ptr.wrapping_sub(read_ptr)` will give you the accurate length of
> things inside the ring buffer.

Yes, we are wasting a slot (the current code also does this). But,
this is the protocol between the CPU and GSP, so the GSP actually
expects this and we can't change the semantics easily, IIUC.

Re: [PATCH 3/4] gpu: nova-core: gsp: fix improper handling of empty slot in cmdq

Posted by John Hubbard 2 weeks, 4 days ago

On 1/21/26 6:59 PM, Eliot Courtney wrote:
> +            // The area is contiguous and we leave an empty slot before `rx`.
> +            // PANIC: since `rx > tx` we have `rx - tx - 1 >= 0`
> +            // PANIC: since `tx < rx < MSGQ_NUM_PAGES && after_tx.len() == MSGQ_NUM_PAGES - tx`:
> +            //   `rx - 1 <= MSGQ_NUM_PAGES` -> `rx - tx - 1 <= MSGQ_NUM_PAGES - tx`
> +            //   -> `rx - tx - 1 <= after_tx.len()`

Hi Eliot,

Documentation nit: the proofs are great, but the above just does
not go into my head easily, because it's a proof, rather than a
sentence.

Can you please reword these PANIC comments so that they are complete
sentences, along the lines of:

    // PANIC: a > b and therefore c cannot overflow, therefore this
    // cannot panic.

And please also use words "and", "therefore", "because", instead of 
symbols such as "&&".

Same for the other patch with PANIC comments.

I did a quick search in Rust for Linux, before writing this, in order to
ensure that what I'm recommending is How It Is Done. (Not to claim that
Nova in particular is fully correct, though.)

thanks,
-- 
John Hubbard

Re: [PATCH 3/4] gpu: nova-core: gsp: fix improper handling of empty slot in cmdq

Posted by Eliot Courtney 2 weeks, 3 days ago

On Thu Jan 22, 2026 at 12:26 PM JST, John Hubbard wrote:
> On 1/21/26 6:59 PM, Eliot Courtney wrote:
>> +            // The area is contiguous and we leave an empty slot before `rx`.
>> +            // PANIC: since `rx > tx` we have `rx - tx - 1 >= 0`
>> +            // PANIC: since `tx < rx < MSGQ_NUM_PAGES && after_tx.len() == MSGQ_NUM_PAGES - tx`:
>> +            //   `rx - 1 <= MSGQ_NUM_PAGES` -> `rx - tx - 1 <= MSGQ_NUM_PAGES - tx`
>> +            //   -> `rx - tx - 1 <= after_tx.len()`
>
> Hi Eliot,
>
> Documentation nit: the proofs are great, but the above just does
> not go into my head easily, because it's a proof, rather than a
> sentence.
I had a look now and I agree that it looks like plain English is the
defacto standard for the PANIC comments, so I will update them.

But, I wonder what people think about this. IMO it makes sense to have
SAFETY and PANIC comments as rigorous proofs (where practical and
possible) to match the level of work the compiler does for us in the
infalliable areas of the code - if an issue occurs, unsafe or panicking
code is often the root cause IMO. Writing these in plain English is
easier to read but also harder to verify that the proof is correct and
harder to verify if there are any implicit assumptions.

I see there are some guidelines about SAFETY: comments but not about
PANIC: comments in Documentation/rust/coding-guidelines.rst.

Re: [PATCH 3/4] gpu: nova-core: gsp: fix improper handling of empty slot in cmdq

Posted by Alexandre Courbot 2 weeks, 3 days ago

On Thu Jan 22, 2026 at 2:07 PM JST, Eliot Courtney wrote:
> On Thu Jan 22, 2026 at 12:26 PM JST, John Hubbard wrote:
>> On 1/21/26 6:59 PM, Eliot Courtney wrote:
>>> +            // The area is contiguous and we leave an empty slot before `rx`.
>>> +            // PANIC: since `rx > tx` we have `rx - tx - 1 >= 0`
>>> +            // PANIC: since `tx < rx < MSGQ_NUM_PAGES && after_tx.len() == MSGQ_NUM_PAGES - tx`:
>>> +            //   `rx - 1 <= MSGQ_NUM_PAGES` -> `rx - tx - 1 <= MSGQ_NUM_PAGES - tx`
>>> +            //   -> `rx - tx - 1 <= after_tx.len()`
>>
>> Hi Eliot,
>>
>> Documentation nit: the proofs are great, but the above just does
>> not go into my head easily, because it's a proof, rather than a
>> sentence.
> I had a look now and I agree that it looks like plain English is the
> defacto standard for the PANIC comments, so I will update them.
>
> But, I wonder what people think about this. IMO it makes sense to have
> SAFETY and PANIC comments as rigorous proofs (where practical and
> possible) to match the level of work the compiler does for us in the
> infalliable areas of the code - if an issue occurs, unsafe or panicking
> code is often the root cause IMO. Writing these in plain English is
> easier to read but also harder to verify that the proof is correct and
> harder to verify if there are any implicit assumptions.
>
> I see there are some guidelines about SAFETY: comments but not about
> PANIC: comments in Documentation/rust/coding-guidelines.rst.

Would be interesting to have Miguel/Alice and the core team's opinion
on this IMHO.

Re: [PATCH 3/4] gpu: nova-core: gsp: fix improper handling of empty slot in cmdq

Posted by Miguel Ojeda 2 weeks, 2 days ago

On Thu, Jan 22, 2026 at 10:29 AM Alexandre Courbot <acourbot@nvidia.com> wrote:
>
> Would be interesting to have Miguel/Alice and the core team's opinion
> on this IMHO.

[ Re-replying to this so that it gets saved in the list. ]

This is something that we have discussed back and forth, including
making it way more formal, even using special notation, etc.

Please take a look at Benno’s proposal at:

https://lore.kernel.org/all/20240717221133.459589-1-benno.lossin@proton.me/

(He may have a newer link)

Cheers,
Miguel