Several firmware files loaded from userspace feature a common header
that describes their payload. Add basic support for it so subsequent
patches can leverage it.
Signed-off-by: Alexandre Courbot <acourbot@nvidia.com>
---
drivers/gpu/nova-core/firmware.rs | 62 +++++++++++++++++++++++++++++++++++++++
1 file changed, 62 insertions(+)
diff --git a/drivers/gpu/nova-core/firmware.rs b/drivers/gpu/nova-core/firmware.rs
index 2931912ddba0ea1fe6d027ccec70b39cdb40344a..ccb4d19f8fa76b0e844252dede5f50b37c590571 100644
--- a/drivers/gpu/nova-core/firmware.rs
+++ b/drivers/gpu/nova-core/firmware.rs
@@ -4,11 +4,13 @@
//! to be loaded into a given execution unit.
use core::marker::PhantomData;
+use core::mem::size_of;
use kernel::device;
use kernel::firmware;
use kernel::prelude::*;
use kernel::str::CString;
+use kernel::transmute::FromBytes;
use crate::dma::DmaObject;
use crate::falcon::FalconFirmware;
@@ -150,6 +152,66 @@ fn no_patch_signature(self) -> FirmwareDmaObject<F, Signed> {
}
}
+/// Header common to most firmware files.
+#[repr(C)]
+#[derive(Debug, Clone)]
+struct BinHdr {
+ /// Magic number, must be `0x10de`.
+ bin_magic: u32,
+ /// Version of the header.
+ bin_ver: u32,
+ /// Size in bytes of the binary (to be ignored).
+ bin_size: u32,
+ /// Offset of the start of the application-specific header.
+ header_offset: u32,
+ /// Offset of the start of the data payload.
+ data_offset: u32,
+ /// Size in bytes of the data payload.
+ data_size: u32,
+}
+
+// SAFETY: all bit patterns are valid for this type, and it doesn't use interior mutability.
+unsafe impl FromBytes for BinHdr {}
+
+// A firmware blob starting with a `BinHdr`.
+struct BinFirmware<'a> {
+ hdr: BinHdr,
+ fw: &'a [u8],
+}
+
+#[expect(dead_code)]
+impl<'a> BinFirmware<'a> {
+ /// Interpret `fw` as a firmware image starting with a [`BinHdr`], and returns the
+ /// corresponding [`BinFirmware`] that can be used to extract its payload.
+ fn new(fw: &'a firmware::Firmware) -> Result<Self> {
+ const BIN_MAGIC: u32 = 0x10de;
+ let fw = fw.data();
+
+ fw.get(0..size_of::<BinHdr>())
+ // Extract header.
+ .and_then(BinHdr::from_bytes_copy)
+ // Validate header.
+ .and_then(|hdr| {
+ if hdr.bin_magic == BIN_MAGIC {
+ Some(hdr)
+ } else {
+ None
+ }
+ })
+ .map(|hdr| Self { hdr, fw })
+ .ok_or(EINVAL)
+ }
+
+ /// Returns the data payload of the firmware, or `None` if the data range is out of bounds of
+ /// the firmware image.
+ fn data(&self) -> Option<&[u8]> {
+ let fw_start = self.hdr.data_offset as usize;
+ let fw_size = self.hdr.data_size as usize;
+
+ self.fw.get(fw_start..fw_start + fw_size)
+ }
+}
+
pub(crate) struct ModInfoBuilder<const N: usize>(firmware::ModInfoBuilder<N>);
impl<const N: usize> ModInfoBuilder<N> {
--
2.50.1
On 8/25/25 9:07 PM, Alexandre Courbot wrote: > Several firmware files loaded from userspace feature a common header > that describes their payload. Add basic support for it so subsequent > patches can leverage it. > > Signed-off-by: Alexandre Courbot <acourbot@nvidia.com> > --- > drivers/gpu/nova-core/firmware.rs | 62 +++++++++++++++++++++++++++++++++++++++ > 1 file changed, 62 insertions(+) > > diff --git a/drivers/gpu/nova-core/firmware.rs b/drivers/gpu/nova-core/firmware.rs > index 2931912ddba0ea1fe6d027ccec70b39cdb40344a..ccb4d19f8fa76b0e844252dede5f50b37c590571 100644 > --- a/drivers/gpu/nova-core/firmware.rs > +++ b/drivers/gpu/nova-core/firmware.rs > @@ -4,11 +4,13 @@ > //! to be loaded into a given execution unit. > > use core::marker::PhantomData; > +use core::mem::size_of; > > use kernel::device; > use kernel::firmware; > use kernel::prelude::*; > use kernel::str::CString; > +use kernel::transmute::FromBytes; > > use crate::dma::DmaObject; > use crate::falcon::FalconFirmware; > @@ -150,6 +152,66 @@ fn no_patch_signature(self) -> FirmwareDmaObject<F, Signed> { > } > } > > +/// Header common to most firmware files. > +#[repr(C)] > +#[derive(Debug, Clone)] > +struct BinHdr { > + /// Magic number, must be `0x10de`. How about: /// Magic number, required to be equal to BIN_MAGIC ...and see below. > + bin_magic: u32, > + /// Version of the header. > + bin_ver: u32, > + /// Size in bytes of the binary (to be ignored). > + bin_size: u32, > + /// Offset of the start of the application-specific header. > + header_offset: u32, > + /// Offset of the start of the data payload. > + data_offset: u32, > + /// Size in bytes of the data payload. > + data_size: u32, > +} > + > +// SAFETY: all bit patterns are valid for this type, and it doesn't use interior mutability. > +unsafe impl FromBytes for BinHdr {} > + > +// A firmware blob starting with a `BinHdr`. > +struct BinFirmware<'a> { > + hdr: BinHdr, > + fw: &'a [u8], > +} > + > +#[expect(dead_code)] > +impl<'a> BinFirmware<'a> { > + /// Interpret `fw` as a firmware image starting with a [`BinHdr`], and returns the > + /// corresponding [`BinFirmware`] that can be used to extract its payload. > + fn new(fw: &'a firmware::Firmware) -> Result<Self> { > + const BIN_MAGIC: u32 = 0x10de; Let's promote this to approximately file scope and put it just above the BinHdr definition. Then we end up with one definition at the ideal scope, and the comment docs take better care of themselves. > + let fw = fw.data(); > + > + fw.get(0..size_of::<BinHdr>()) > + // Extract header. > + .and_then(BinHdr::from_bytes_copy) > + // Validate header. > + .and_then(|hdr| { > + if hdr.bin_magic == BIN_MAGIC { > + Some(hdr) > + } else { > + None > + } > + }) > + .map(|hdr| Self { hdr, fw }) > + .ok_or(EINVAL) > + } > + > + /// Returns the data payload of the firmware, or `None` if the data range is out of bounds of > + /// the firmware image. > + fn data(&self) -> Option<&[u8]> { > + let fw_start = self.hdr.data_offset as usize; > + let fw_size = self.hdr.data_size as usize; > + > + self.fw.get(fw_start..fw_start + fw_size) This worries me a bit, because we never checked that these bounds are reasonable: within the range of the firmware, and not overflowing (.checked_add() for example), that sort of thing. Thoughts? > + } > +} > + > pub(crate) struct ModInfoBuilder<const N: usize>(firmware::ModInfoBuilder<N>); > > impl<const N: usize> ModInfoBuilder<N> { > thanks, -- John Hubbard
On Wed Aug 27, 2025 at 10:34 AM JST, John Hubbard wrote: <snip> >> + /// Returns the data payload of the firmware, or `None` if the data range is out of bounds of >> + /// the firmware image. >> + fn data(&self) -> Option<&[u8]> { >> + let fw_start = self.hdr.data_offset as usize; >> + let fw_size = self.hdr.data_size as usize; >> + >> + self.fw.get(fw_start..fw_start + fw_size) > > This worries me a bit, because we never checked that these bounds > are reasonable: within the range of the firmware, and not overflowing > (.checked_add() for example), that sort of thing. > > Thoughts? `get` returns `None` if the requested slice is out of bounds, so there should be no risk of panicking here. However, `fw_start + fw_size` can panic in debug configuration if it overflows. In a release build I believe it will just happily wrap, and `get` should consequently return `None` at the invalid range... Although we can also get unlucky and produce a valid, yet incorrect, one. This is actually something I've been thinking about while writing this series and could not really decide upon: how to deal with operands and functions in Rust that can potentially panic. Using `checked` operands everywhere is a bit tedious, and even with great care there is no way to guarantee that no panic occurs in a given function. Panics are a big no-no in the kernel, yet I don't feel like we have the proper tools to ensure they do not happen. User-space has some crates like `no_panic`, but even these feel more like hacks than anything else. Something at the compiler level would be nice. Maybe that would be a good discussion topic for the Plumber Microconference?
On Wed, Aug 27, 2025 at 10:47 AM Alexandre Courbot <acourbot@nvidia.com> wrote: > > However, `fw_start + fw_size` can panic in debug configuration if it > overflows. In a release build I believe it will just happily wrap, and In the kernel, it is a panic in the default configuration, not just a debug one. We have debug assertions too -- and those are disabled by default, but they are separate from the overflow checking, which is the one enabled by default. So, any use of those operators is limited to cases where one knows, somehow, that it will not overflow. And e.g. user-controlled inputs cannot use them at all. So, conceptually, something like this: - Static assert if the compiler knows it cannot fail. - Build assert if the optimizer knows it cannot fail. - Unfallible (like the possibly panicking operators) if the developer knows it cannot fail. - Fallible/wrapping/saturating/... if the developer isn't sure or it simply cannot be known until runtime. User-derived inputs must use this option (or rarely the unsafe one). - Unsafe if the developer knows it cannot fail and the other options are not acceptable for some reason. Ideally paired with a debug assertion (the compiler adds these already for many unsafe preconditions). In the past I requested upstream Rust a way to have a "third mode" ("report and continue") for the operators so that it would wrap (like the non-panicking mode) but allowing us to add a customization point so that we can e.g. `WARN_ON_ONCE`. That would be ideal, because it is a nice middle ground that is not as punishing if a developer gets it wrong, especially for "normal users", where panics typically mean lost reports etc. And other kernel users, like cloud operators, can keep the panicking mode. As for discussing no-panic, sure! Cheers, Miguel
Hi Miguel, sorry for the delay in replying! On Thu Aug 28, 2025 at 8:26 PM JST, Miguel Ojeda wrote: > On Wed, Aug 27, 2025 at 10:47 AM Alexandre Courbot <acourbot@nvidia.com> wrote: >> >> However, `fw_start + fw_size` can panic in debug configuration if it >> overflows. In a release build I believe it will just happily wrap, and > > In the kernel, it is a panic in the default configuration, not just a debug one. > > We have debug assertions too -- and those are disabled by default, but > they are separate from the overflow checking, which is the one enabled > by default. > > So, any use of those operators is limited to cases where one knows, > somehow, that it will not overflow. And e.g. user-controlled inputs > cannot use them at all. > > So, conceptually, something like this: > > - Static assert if the compiler knows it cannot fail. > - Build assert if the optimizer knows it cannot fail. > - Unfallible (like the possibly panicking operators) if the > developer knows it cannot fail. > - Fallible/wrapping/saturating/... if the developer isn't sure or it > simply cannot be known until runtime. User-derived inputs must use > this option (or rarely the unsafe one). > - Unsafe if the developer knows it cannot fail and the other options > are not acceptable for some reason. Ideally paired with a debug > assertion (the compiler adds these already for many unsafe > preconditions). > > In the past I requested upstream Rust a way to have a "third mode" > ("report and continue") for the operators so that it would wrap (like > the non-panicking mode) but allowing us to add a customization point > so that we can e.g. `WARN_ON_ONCE`. That would be nice, but also wouldn't cover all the cases where implicit panics can happen, like out-of-bounds slice accesses - we can't have a "report-and-continue" mode for these. And that's really the elephant in the room IMHO: such panic sites can be introduced implicitly, without the programmer realizing it, potentially resulting in more runtime panics for Rust modules than one might expect from a language whose main selling point is safety. I understand that the previous sentence is a bit fallacious, since such panics indicate bugs in the code that would likely go unnoticed in C (which is arguably worse). But perception matters, and such crashes can be damaging to the reputation of the project. In user-space, crates like `no_panic` can provide a compile-time guarantee that a given function cannot panic. I don't know how that would translate to the kernel, but ideally we could have some support from tooling (compiler and/or LSP?) to warn us of sites introduced in the code. After all, since the compiler inserts these panic sites, it should also be able to tell us where they are, allowing us to evaluate (and hopefully remove) them before the code ships to users. Most of them could then be eliminated by constraining inputs or using checked variants. I am not suggesting we should mandate that ALL Rust kernel code be proven panic-free at compile time, however since I started writing kernel code in Rust, I've often wished I had a simple way to check whether my carefully-crafted function processing user-space data really *is* panic-free. > As for discussing no-panic, sure! Writing a uC topic proposal for Plumbers right now. :)
On Wed, Sep 10, 2025 at 7:45 AM Alexandre Courbot <acourbot@nvidia.com> wrote: > > That would be nice, but also wouldn't cover all the cases where implicit > panics can happen, like out-of-bounds slice accesses - we can't have a > "report-and-continue" mode for these. In principle, it could cover OOBs (even if it is a bad idea). > But perception matters, and such crashes can be damaging to the reputation of the project. Yes, we are well aware -- we have had it in our wish list for upstream Rust for a long time. We are tackling these things as we go -- e.g. we solved the `alloc` panics and the ball on the report-and-continue mode for overflows started moving. Part of Rust for Linux is about making Rust the best language for kernel development it can be, after all, and so far upstream Rust has been quite helpful on giving us the features we need -- we meet with them every two weeks, please join if you have time! (Side note: the "safety" that Rust "sells" isn't really about avoiding panics, although obviously it would be a nice feature to have.) > Writing a uC topic proposal for Plumbers right now. :) I see it there, thanks! I can briefly mention the topic in Kangrejos, since we will have Rust representation, including from the language team. I don't think the discussion should focus much on "Do we need this?" but rather more on "What exactly do we want? Would we be OK with a local solution? Do we need/want a global one? Would we be OK with LSP? Would we be OK with no panics after optimizations, e.g. a link time check? Or do we want full support in the language for guaranteed non-panicking functions? Do we need exceptional carve-outs on such checking for particular language constructs?" and so on. And, of course, "Who has time to write an RFC and implement an experiment upstream if an approach is decided". Getting data on "in practice, how much of an issue it is on the Rust side" would help too -- those with actual users running Rust kernel code probably can tell us something. What I would personally expect to happen is that, over time, we understand better what are the worst cases we must tackle. Cheers, Miguel
On Wed Sep 10, 2025 at 7:00 PM JST, Miguel Ojeda wrote: > On Wed, Sep 10, 2025 at 7:45 AM Alexandre Courbot <acourbot@nvidia.com> wrote: >> >> That would be nice, but also wouldn't cover all the cases where implicit >> panics can happen, like out-of-bounds slice accesses - we can't have a >> "report-and-continue" mode for these. > > In principle, it could cover OOBs (even if it is a bad idea). > >> But perception matters, and such crashes can be damaging to the reputation of the project. > > Yes, we are well aware -- we have had it in our wish list for upstream > Rust for a long time. > > We are tackling these things as we go -- e.g. we solved the `alloc` > panics and the ball on the report-and-continue mode for overflows > started moving. > > Part of Rust for Linux is about making Rust the best language for > kernel development it can be, after all, and so far upstream Rust has > been quite helpful on giving us the features we need -- we meet with > them every two weeks, please join if you have time! > > (Side note: the "safety" that Rust "sells" isn't really about avoiding > panics, although obviously it would be a nice feature to have.) That's right, these panics are actually the last line of safety to prevent a program for doing something damaging. It is just that the consequences in a regular program are not as heavy as in the kernel. The only two options are either allowing user-space to crash the kernel through a module with a missing bound check, or letting it tamper with data it is not supposed to access. While the first option is terrible, the second one is unacceptable - so at the end of the day what we likely want is to keep the panic behavior and limit these occurrences as much as possible through information to the programmer. Build errors on such panic sites insertions, with the option to relax the rule locally if a justifying SAFETY comment is provided? And as you said, what do we do if a panic can be removed through a particular optimization - does it become mandatory to build the kernel? Is it applicable to all architectures and (in the future) all supported compilers? I suspect it will take more than Plumbers to get to the bottom of this. :) > >> Writing a uC topic proposal for Plumbers right now. :) > > I see it there, thanks! I can briefly mention the topic in Kangrejos, > since we will have Rust representation, including from the language > team. > > I don't think the discussion should focus much on "Do we need this?" > but rather more on "What exactly do we want? Would we be OK with a > local solution? Do we need/want a global one? Would we be OK with LSP? > Would we be OK with no panics after optimizations, e.g. a link time > check? Or do we want full support in the language for guaranteed > non-panicking functions? Do we need exceptional carve-outs on such > checking for particular language constructs?" and so on. And, of > course, "Who has time to write an RFC and implement an experiment > upstream if an approach is decided". > > Getting data on "in practice, how much of an issue it is on the Rust > side" would help too -- those with actual users running Rust kernel > code probably can tell us something. > > What I would personally expect to happen is that, over time, we > understand better what are the worst cases we must tackle. Thanks, these are great directions to explore. I see that some thinking has already been done on this, do we have a bug or tracking issue so I can catch up with the discussions that have already taken place?
On 9/10/25 6:54 AM, Alexandre Courbot wrote: > On Wed Sep 10, 2025 at 7:00 PM JST, Miguel Ojeda wrote: >> On Wed, Sep 10, 2025 at 7:45 AM Alexandre Courbot <acourbot@nvidia.com> wrote: ...> Build errors on such panic sites insertions, with the option to relax > the rule locally if a justifying SAFETY comment is provided? And as you > said, what do we do if a panic can be removed through a particular > optimization - does it become mandatory to build the kernel? Is it > applicable to all architectures and (in the future) all supported > compilers? We have a very solid precedent for that: building the kernel already requires -O2 optimization for *functional* correctness. So Rust for Linux certainly can specify a certain set of build options that are required--including options that are normally considered optimizations. thanks, -- John Hubbard
On Wed, Sep 10, 2025 at 3:55 PM Alexandre Courbot <acourbot@nvidia.com> wrote: > > The only two options are either allowing user-space to crash the kernel > through a module with a missing bound check, or letting it tamper with So we are definitely not aiming to allow that on purpose, i.e. I know you said "missing", but just to clarify: they would of course be considered a bug, just like hitting similar facilities in C code. (In general, if we get into the situation where we can actually crash the kernel from userspace, that is a CVE with the current rules (modulo exceptional cases, e.g. the recent discussion about debugging tools, for instance), regardless of whether it was due to a Rust panic or not.) The Rust panic is, as you say, the last line of defense, attempting to limit the damage. I think that is worth it for at least some users that want it (like cloud providers) and some parts of the kernel. Sometimes it can be worse, though, i.e. perhaps the OOB or the overflow was not actually a "big issue" "in practice" for a particular case. I think it will likely depend on the kind of code we are talking about -- not all panics are the same, and not all code is the same. For random modules, for instance, yes, we should definitely prevent developers from writing panics as much as possible (or not at all); the same way we try to prevent them from writing unsafe code -- after all, panicking operations and unsafe code are both an statement that we are sure "something cannot happen". So the more we prevent APIs that do not need to rely on panicking (nor unsafe code) on drivers etc., the better. I also think forcing to have no panics at all (i.e. globally) would force us to handle way more truly impossible cases than C does (unsafely so), which isn't great and has downsides too (increased complexity, for one, which can also lead to more bugs). And we don't want to go memory unsafe for all those either (I mean, it could be an option, but I wouldn't recommend it, and it would still be a bug if hit). So for certain cases panicking may be a reasonable option in practice -- that is where I wanted us to get more experience to know how we fare vs. C here and thus the data request. It will also depend on what upstream Rust can give us (more on that below). For instance, having an enforced way with carve-outs that need to be annotated would at least make it stand out to reviewers and make developers think thrice. > Thanks, these are great directions to explore. I see that some thinking > has already been done on this, do we have a bug or tracking issue so I > can catch up with the discussions that have already taken place? In general, I do my best to track this kind of thing in the issues linked from issue #2 ("Rust features" in this case: https://github.com/Rust-for-Linux/linux/issues/354), and you can see some pointers for the overflow part. For no panics there is also an entry for a long time, but I don't think there has been any "deep" discussion on it -- I mean between Rust and the kernel (there are of course discussions in upstream Rust, and there also have been in our side in the mailing list). I just added a couple recent pointers, I can look for more. In that meeting I mentioned with them, I raised this a month or two ago, and I did so today too, and mentioned that you wanted to discuss it in LPC. If you (or e.g. someone from NVIDIA) have time to work on this, including in upstream Rust (e.g. writing an RFC, implementing an experiment...), then please let me know. They are open to the possibility of having a Rust project goal for the next round on no panics (2026H1), so it is a good chance to move things forward. We had today a discussion on potential forms it could take (an attribute on a function, guaranteed or not, escape hatches or not, lints, an enforced comment, Klint, the report-and-continue feature...) and related ongoing efforts (having a `nopanic` effect, `target_feature(enable =`...). I hope that helps, and thanks! Cheers, Miguel
On 8/27/25 1:47 AM, Alexandre Courbot wrote: > On Wed Aug 27, 2025 at 10:34 AM JST, John Hubbard wrote: > <snip> >>> + /// Returns the data payload of the firmware, or `None` if the data range is out of bounds of >>> + /// the firmware image. >>> + fn data(&self) -> Option<&[u8]> { >>> + let fw_start = self.hdr.data_offset as usize; >>> + let fw_size = self.hdr.data_size as usize; >>> + >>> + self.fw.get(fw_start..fw_start + fw_size) >> >> This worries me a bit, because we never checked that these bounds >> are reasonable: within the range of the firmware, and not overflowing >> (.checked_add() for example), that sort of thing. >> >> Thoughts? > > `get` returns `None` if the requested slice is out of bounds, so there > should be no risk of panicking here. I was wondering about the bounds themselves, though. Couldn't they be wrong? (Do we care?) > > However, `fw_start + fw_size` can panic in debug configuration if it > overflows. In a release build I believe it will just happily wrap, and > `get` should consequently return `None` at the invalid range... Although > we can also get unlucky and produce a valid, yet incorrect, one. > > This is actually something I've been thinking about while writing this > series and could not really decide upon: how to deal with operands and > functions in Rust that can potentially panic. Using `checked` operands > everywhere is a bit tedious, and even with great care there is no way to > guarantee that no panic occurs in a given function. Yes, .checked_add() all over the place is just awful, would like to avoid that for sure. > > Panics are a big no-no in the kernel, yet I don't feel like we have the > proper tools to ensure they do not happen. > > User-space has some crates like `no_panic`, but even these feel more > like hacks than anything else. Something at the compiler level would be > nice. > > Maybe that would be a good discussion topic for the Plumber > Microconference? Yes. And maybe even for Kangrejos. thanks, -- John Hubbard
On Thu Aug 28, 2025 at 6:50 AM JST, John Hubbard wrote: > On 8/27/25 1:47 AM, Alexandre Courbot wrote: >> On Wed Aug 27, 2025 at 10:34 AM JST, John Hubbard wrote: >> <snip> >>>> + /// Returns the data payload of the firmware, or `None` if the data range is out of bounds of >>>> + /// the firmware image. >>>> + fn data(&self) -> Option<&[u8]> { >>>> + let fw_start = self.hdr.data_offset as usize; >>>> + let fw_size = self.hdr.data_size as usize; >>>> + >>>> + self.fw.get(fw_start..fw_start + fw_size) >>> >>> This worries me a bit, because we never checked that these bounds >>> are reasonable: within the range of the firmware, and not overflowing >>> (.checked_add() for example), that sort of thing. >>> >>> Thoughts? >> >> `get` returns `None` if the requested slice is out of bounds, so there >> should be no risk of panicking here. > > I was wondering about the bounds themselves, though. Couldn't they > be wrong? (Do we care?) Not sure what you mean by wrong bounds here? Do you mean what if the header data is incorrect?
On 8/28/25 12:08 AM, Alexandre Courbot wrote: ... >>>> This worries me a bit, because we never checked that these bounds >>>> are reasonable: within the range of the firmware, and not overflowing >>>> (.checked_add() for example), that sort of thing. >>>> >>>> Thoughts? >>> >>> `get` returns `None` if the requested slice is out of bounds, so there >>> should be no risk of panicking here. >> >> I was wondering about the bounds themselves, though. Couldn't they >> be wrong? (Do we care?) > > Not sure what you mean by wrong bounds here? Do you mean what if the > header data is incorrect? Yes, that's what I meant. And I'm mainly trying to get some perspective about what kinds of checking we should be doing. In this case, it seems that we don't actually need anything more than what you already have, so we're all good here. thanks, -- John Hubbard
© 2016 - 2025 Red Hat, Inc.