[PATCH 4/7] gpu: nova-core: falcon: use dma::Coherent

Alexandre Courbot posted 7 patches 1 week, 6 days ago
There is a newer version of this series
[PATCH 4/7] gpu: nova-core: falcon: use dma::Coherent
Posted by Alexandre Courbot 1 week, 6 days ago
Replace the nova-core local `DmaObject` with a `Coherent` that can
fulfill the same role.

Signed-off-by: Alexandre Courbot <acourbot@nvidia.com>
---
 drivers/gpu/nova-core/falcon.rs | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/nova-core/falcon.rs b/drivers/gpu/nova-core/falcon.rs
index 5bf8da8760bf..f6239c44dd80 100644
--- a/drivers/gpu/nova-core/falcon.rs
+++ b/drivers/gpu/nova-core/falcon.rs
@@ -10,6 +10,7 @@
         Device, //
     },
     dma::{
+        Coherent,
         DmaAddress,
         DmaMask, //
     },
@@ -20,7 +21,6 @@
 };
 
 use crate::{
-    dma::DmaObject,
     driver::Bar0,
     falcon::hal::LoadMethod,
     gpu::Chipset,
@@ -636,7 +636,7 @@ pub(crate) fn pio_load<F: FalconFirmware<Target = E> + FalconPioLoadable>(
     fn dma_wr(
         &self,
         bar: &Bar0,
-        dma_obj: &DmaObject,
+        dma_obj: &Coherent<[u8]>,
         target_mem: FalconMem,
         load_offsets: FalconDmaLoadTarget,
     ) -> Result {
@@ -740,7 +740,7 @@ fn dma_load<F: FalconFirmware<Target = E> + FalconDmaLoadable>(
         fw: &F,
     ) -> Result {
         // Create DMA object with firmware content as the source of the DMA engine.
-        let dma_obj = DmaObject::from_data(dev, fw.as_slice())?;
+        let dma_obj = Coherent::from_slice(dev, fw.as_slice(), GFP_KERNEL)?;
 
         self.dma_reset(bar);
         regs::NV_PFALCON_FBIF_TRANSCFG::update(bar, &E::ID, 0, |v| {

-- 
2.53.0
Re: [PATCH 4/7] gpu: nova-core: falcon: use dma::Coherent
Posted by Eliot Courtney 1 week, 2 days ago
On Sat Mar 21, 2026 at 10:36 PM JST, Alexandre Courbot wrote:
> Replace the nova-core local `DmaObject` with a `Coherent` that can
> fulfill the same role.
>
> Signed-off-by: Alexandre Courbot <acourbot@nvidia.com>
> ---
>  drivers/gpu/nova-core/falcon.rs | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/gpu/nova-core/falcon.rs b/drivers/gpu/nova-core/falcon.rs
> index 5bf8da8760bf..f6239c44dd80 100644
> --- a/drivers/gpu/nova-core/falcon.rs
> +++ b/drivers/gpu/nova-core/falcon.rs
> @@ -10,6 +10,7 @@
>          Device, //
>      },
>      dma::{
> +        Coherent,
>          DmaAddress,
>          DmaMask, //
>      },
> @@ -20,7 +21,6 @@
>  };
>  
>  use crate::{
> -    dma::DmaObject,
>      driver::Bar0,
>      falcon::hal::LoadMethod,
>      gpu::Chipset,
> @@ -636,7 +636,7 @@ pub(crate) fn pio_load<F: FalconFirmware<Target = E> + FalconPioLoadable>(
>      fn dma_wr(
>          &self,
>          bar: &Bar0,
> -        dma_obj: &DmaObject,
> +        dma_obj: &Coherent<[u8]>,
>          target_mem: FalconMem,
>          load_offsets: FalconDmaLoadTarget,
>      ) -> Result {
> @@ -740,7 +740,7 @@ fn dma_load<F: FalconFirmware<Target = E> + FalconDmaLoadable>(
>          fw: &F,
>      ) -> Result {
>          // Create DMA object with firmware content as the source of the DMA engine.
> -        let dma_obj = DmaObject::from_data(dev, fw.as_slice())?;
> +        let dma_obj = Coherent::from_slice(dev, fw.as_slice(), GFP_KERNEL)?;

Is it guaranteed that fw.as_slice() is a multiple of 256 in size?
In `dma_wr` it breaks this up into 256 byte transfers. Since this
no longer pads out to a page boundary, it means that it could now error
(around "DMA transfer goes beyond range of DMA object") if the Dmem 
section's size is not divisible by 256. But tbh, I find it odd that 
`dma_wr` doesn't check that FalconDmaLoadTarget's length is a
multiple of 256 anyway, because it looks like it'll write a bunch of
unrelated bytes (since it rounds up to the nearest 256 to copy).

Maybe we should enforce that `FalconDmaLoadTarget` length is divisible
by 256?

For this series if for all firmwares it's divisible by 256 then I think
it's fine to leave this as is for now, but I do find the lack of
checking in `dma_wr` (or anywhere else for FalconDmaLoadTarget) a bit
odd.

>  
>          self.dma_reset(bar);
>          regs::NV_PFALCON_FBIF_TRANSCFG::update(bar, &E::ID, 0, |v| {
Re: [PATCH 4/7] gpu: nova-core: falcon: use dma::Coherent
Posted by Alexandre Courbot 1 week, 1 day ago
On Wed Mar 25, 2026 at 11:14 AM JST, Eliot Courtney wrote:
> On Sat Mar 21, 2026 at 10:36 PM JST, Alexandre Courbot wrote:
>> Replace the nova-core local `DmaObject` with a `Coherent` that can
>> fulfill the same role.
>>
>> Signed-off-by: Alexandre Courbot <acourbot@nvidia.com>
>> ---
>>  drivers/gpu/nova-core/falcon.rs | 6 +++---
>>  1 file changed, 3 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/gpu/nova-core/falcon.rs b/drivers/gpu/nova-core/falcon.rs
>> index 5bf8da8760bf..f6239c44dd80 100644
>> --- a/drivers/gpu/nova-core/falcon.rs
>> +++ b/drivers/gpu/nova-core/falcon.rs
>> @@ -10,6 +10,7 @@
>>          Device, //
>>      },
>>      dma::{
>> +        Coherent,
>>          DmaAddress,
>>          DmaMask, //
>>      },
>> @@ -20,7 +21,6 @@
>>  };
>>  
>>  use crate::{
>> -    dma::DmaObject,
>>      driver::Bar0,
>>      falcon::hal::LoadMethod,
>>      gpu::Chipset,
>> @@ -636,7 +636,7 @@ pub(crate) fn pio_load<F: FalconFirmware<Target = E> + FalconPioLoadable>(
>>      fn dma_wr(
>>          &self,
>>          bar: &Bar0,
>> -        dma_obj: &DmaObject,
>> +        dma_obj: &Coherent<[u8]>,
>>          target_mem: FalconMem,
>>          load_offsets: FalconDmaLoadTarget,
>>      ) -> Result {
>> @@ -740,7 +740,7 @@ fn dma_load<F: FalconFirmware<Target = E> + FalconDmaLoadable>(
>>          fw: &F,
>>      ) -> Result {
>>          // Create DMA object with firmware content as the source of the DMA engine.
>> -        let dma_obj = DmaObject::from_data(dev, fw.as_slice())?;
>> +        let dma_obj = Coherent::from_slice(dev, fw.as_slice(), GFP_KERNEL)?;
>
> Is it guaranteed that fw.as_slice() is a multiple of 256 in size?
> In `dma_wr` it breaks this up into 256 byte transfers. Since this
> no longer pads out to a page boundary, it means that it could now error
> (around "DMA transfer goes beyond range of DMA object") if the Dmem 
> section's size is not divisible by 256. But tbh, I find it odd that 
> `dma_wr` doesn't check that FalconDmaLoadTarget's length is a
> multiple of 256 anyway, because it looks like it'll write a bunch of
> unrelated bytes (since it rounds up to the nearest 256 to copy).
>
> Maybe we should enforce that `FalconDmaLoadTarget` length is divisible
> by 256?
>
> For this series if for all firmwares it's divisible by 256 then I think
> it's fine to leave this as is for now, but I do find the lack of
> checking in `dma_wr` (or anywhere else for FalconDmaLoadTarget) a bit
> odd.

All coherent allocations are page-aligned (and use full pages), so we
are safe in terms of overflows.

Also `dma_wr` uses `div_ceil(256)` which will skip the last data block
entirely if it is not a multiple of 256. It might be a bit more robust
to explicitly check that the size is a multiple of 256 and return an
error if that is not the case indeed.
Re: [PATCH 4/7] gpu: nova-core: falcon: use dma::Coherent
Posted by Gary Guo 1 week, 1 day ago
On Thu Mar 26, 2026 at 3:04 PM GMT, Alexandre Courbot wrote:
> On Wed Mar 25, 2026 at 11:14 AM JST, Eliot Courtney wrote:
>> On Sat Mar 21, 2026 at 10:36 PM JST, Alexandre Courbot wrote:
>>> Replace the nova-core local `DmaObject` with a `Coherent` that can
>>> fulfill the same role.
>>>
>>> Signed-off-by: Alexandre Courbot <acourbot@nvidia.com>
>>> ---
>>>  drivers/gpu/nova-core/falcon.rs | 6 +++---
>>>  1 file changed, 3 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/drivers/gpu/nova-core/falcon.rs b/drivers/gpu/nova-core/falcon.rs
>>> index 5bf8da8760bf..f6239c44dd80 100644
>>> --- a/drivers/gpu/nova-core/falcon.rs
>>> +++ b/drivers/gpu/nova-core/falcon.rs
>>> @@ -10,6 +10,7 @@
>>>          Device, //
>>>      },
>>>      dma::{
>>> +        Coherent,
>>>          DmaAddress,
>>>          DmaMask, //
>>>      },
>>> @@ -20,7 +21,6 @@
>>>  };
>>>  
>>>  use crate::{
>>> -    dma::DmaObject,
>>>      driver::Bar0,
>>>      falcon::hal::LoadMethod,
>>>      gpu::Chipset,
>>> @@ -636,7 +636,7 @@ pub(crate) fn pio_load<F: FalconFirmware<Target = E> + FalconPioLoadable>(
>>>      fn dma_wr(
>>>          &self,
>>>          bar: &Bar0,
>>> -        dma_obj: &DmaObject,
>>> +        dma_obj: &Coherent<[u8]>,
>>>          target_mem: FalconMem,
>>>          load_offsets: FalconDmaLoadTarget,
>>>      ) -> Result {
>>> @@ -740,7 +740,7 @@ fn dma_load<F: FalconFirmware<Target = E> + FalconDmaLoadable>(
>>>          fw: &F,
>>>      ) -> Result {
>>>          // Create DMA object with firmware content as the source of the DMA engine.
>>> -        let dma_obj = DmaObject::from_data(dev, fw.as_slice())?;
>>> +        let dma_obj = Coherent::from_slice(dev, fw.as_slice(), GFP_KERNEL)?;
>>
>> Is it guaranteed that fw.as_slice() is a multiple of 256 in size?
>> In `dma_wr` it breaks this up into 256 byte transfers. Since this
>> no longer pads out to a page boundary, it means that it could now error
>> (around "DMA transfer goes beyond range of DMA object") if the Dmem 
>> section's size is not divisible by 256. But tbh, I find it odd that 
>> `dma_wr` doesn't check that FalconDmaLoadTarget's length is a
>> multiple of 256 anyway, because it looks like it'll write a bunch of
>> unrelated bytes (since it rounds up to the nearest 256 to copy).
>>
>> Maybe we should enforce that `FalconDmaLoadTarget` length is divisible
>> by 256?
>>
>> For this series if for all firmwares it's divisible by 256 then I think
>> it's fine to leave this as is for now, but I do find the lack of
>> checking in `dma_wr` (or anywhere else for FalconDmaLoadTarget) a bit
>> odd.
>
> All coherent allocations are page-aligned (and use full pages), so we
> are safe in terms of overflows.

Let's not rely on this behaviour. There is no guarantee on what's at the end
of allocation whatsoever. There's no guarantee that it will be initialized.
Even with __GFP_ZERO only the size provided will be zeroed.

If the GPU is going to read beyond ranges covered by `Coherent` (not just rely
on the alignment), let's align up the allocation.

>
> Also `dma_wr` uses `div_ceil(256)` which will skip the last data block
> entirely if it is not a multiple of 256. It might be a bit more robust
> to explicitly check that the size is a multiple of 256 and return an
> error if that is not the case indeed.

div_ceil will not skip the last block, it will over-read beyond the end.
div_floor would have skipped the block.

Best,
Gary
Re: [PATCH 4/7] gpu: nova-core: falcon: use dma::Coherent
Posted by Alexandre Courbot 6 days, 11 hours ago
On Fri Mar 27, 2026 at 12:35 AM JST, Gary Guo wrote:
> On Thu Mar 26, 2026 at 3:04 PM GMT, Alexandre Courbot wrote:
>> On Wed Mar 25, 2026 at 11:14 AM JST, Eliot Courtney wrote:
>>> On Sat Mar 21, 2026 at 10:36 PM JST, Alexandre Courbot wrote:
>>>> Replace the nova-core local `DmaObject` with a `Coherent` that can
>>>> fulfill the same role.
>>>>
>>>> Signed-off-by: Alexandre Courbot <acourbot@nvidia.com>
>>>> ---
>>>>  drivers/gpu/nova-core/falcon.rs | 6 +++---
>>>>  1 file changed, 3 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/nova-core/falcon.rs b/drivers/gpu/nova-core/falcon.rs
>>>> index 5bf8da8760bf..f6239c44dd80 100644
>>>> --- a/drivers/gpu/nova-core/falcon.rs
>>>> +++ b/drivers/gpu/nova-core/falcon.rs
>>>> @@ -10,6 +10,7 @@
>>>>          Device, //
>>>>      },
>>>>      dma::{
>>>> +        Coherent,
>>>>          DmaAddress,
>>>>          DmaMask, //
>>>>      },
>>>> @@ -20,7 +21,6 @@
>>>>  };
>>>>  
>>>>  use crate::{
>>>> -    dma::DmaObject,
>>>>      driver::Bar0,
>>>>      falcon::hal::LoadMethod,
>>>>      gpu::Chipset,
>>>> @@ -636,7 +636,7 @@ pub(crate) fn pio_load<F: FalconFirmware<Target = E> + FalconPioLoadable>(
>>>>      fn dma_wr(
>>>>          &self,
>>>>          bar: &Bar0,
>>>> -        dma_obj: &DmaObject,
>>>> +        dma_obj: &Coherent<[u8]>,
>>>>          target_mem: FalconMem,
>>>>          load_offsets: FalconDmaLoadTarget,
>>>>      ) -> Result {
>>>> @@ -740,7 +740,7 @@ fn dma_load<F: FalconFirmware<Target = E> + FalconDmaLoadable>(
>>>>          fw: &F,
>>>>      ) -> Result {
>>>>          // Create DMA object with firmware content as the source of the DMA engine.
>>>> -        let dma_obj = DmaObject::from_data(dev, fw.as_slice())?;
>>>> +        let dma_obj = Coherent::from_slice(dev, fw.as_slice(), GFP_KERNEL)?;
>>>
>>> Is it guaranteed that fw.as_slice() is a multiple of 256 in size?
>>> In `dma_wr` it breaks this up into 256 byte transfers. Since this
>>> no longer pads out to a page boundary, it means that it could now error
>>> (around "DMA transfer goes beyond range of DMA object") if the Dmem 
>>> section's size is not divisible by 256. But tbh, I find it odd that 
>>> `dma_wr` doesn't check that FalconDmaLoadTarget's length is a
>>> multiple of 256 anyway, because it looks like it'll write a bunch of
>>> unrelated bytes (since it rounds up to the nearest 256 to copy).
>>>
>>> Maybe we should enforce that `FalconDmaLoadTarget` length is divisible
>>> by 256?
>>>
>>> For this series if for all firmwares it's divisible by 256 then I think
>>> it's fine to leave this as is for now, but I do find the lack of
>>> checking in `dma_wr` (or anywhere else for FalconDmaLoadTarget) a bit
>>> odd.
>>
>> All coherent allocations are page-aligned (and use full pages), so we
>> are safe in terms of overflows.
>
> Let's not rely on this behaviour. There is no guarantee on what's at the end
> of allocation whatsoever. There's no guarantee that it will be initialized.
> Even with __GFP_ZERO only the size provided will be zeroed.
>
> If the GPU is going to read beyond ranges covered by `Coherent` (not just rely
> on the alignment), let's align up the allocation.
>
>>
>> Also `dma_wr` uses `div_ceil(256)` which will skip the last data block
>> entirely if it is not a multiple of 256. It might be a bit more robust
>> to explicitly check that the size is a multiple of 256 and return an
>> error if that is not the case indeed.
>
> div_ceil will not skip the last block, it will over-read beyond the end.
> div_floor would have skipped the block.

Ooopsie yes, of course. Making `dma_wr` check that the data is a
multiple of 256 is the simplest, I'll send a patch for that (with maybe
some padding code as I think I remember Turing at least did not always
follow the 256-alignment requirement).