From: Wedson Almeida Filho <wedsonaf@google.com>
The `kernel` crate currently includes all the abstractions that wrap
kernel features written in C.
These abstractions call the C side of the kernel via the generated
bindings with the `bindgen` tool. Modules developed in Rust should
never call the bindings themselves.
In the future, as the abstractions grow in number, we may need
to split this crate into several, possibly following a similar
subdivision in subsystems as the kernel itself and/or moving
the code to the actual subsystems.
Co-developed-by: Alex Gaynor <alex.gaynor@gmail.com>
Signed-off-by: Alex Gaynor <alex.gaynor@gmail.com>
Co-developed-by: Geoffrey Thomas <geofft@ldpreload.com>
Signed-off-by: Geoffrey Thomas <geofft@ldpreload.com>
Co-developed-by: Finn Behrens <me@kloenk.de>
Signed-off-by: Finn Behrens <me@kloenk.de>
Co-developed-by: Adam Bratschi-Kaye <ark.email@gmail.com>
Signed-off-by: Adam Bratschi-Kaye <ark.email@gmail.com>
Co-developed-by: Sven Van Asbroeck <thesven73@gmail.com>
Signed-off-by: Sven Van Asbroeck <thesven73@gmail.com>
Co-developed-by: Gary Guo <gary@garyguo.net>
Signed-off-by: Gary Guo <gary@garyguo.net>
Co-developed-by: Boris-Chengbiao Zhou <bobo1239@web.de>
Signed-off-by: Boris-Chengbiao Zhou <bobo1239@web.de>
Co-developed-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Co-developed-by: Fox Chen <foxhlchen@gmail.com>
Signed-off-by: Fox Chen <foxhlchen@gmail.com>
Co-developed-by: Viktor Garske <viktor@v-gar.de>
Signed-off-by: Viktor Garske <viktor@v-gar.de>
Co-developed-by: Dariusz Sosnowski <dsosnowski@dsosnowski.pl>
Signed-off-by: Dariusz Sosnowski <dsosnowski@dsosnowski.pl>
Co-developed-by: Léo Lanteri Thauvin <leseulartichaut@gmail.com>
Signed-off-by: Léo Lanteri Thauvin <leseulartichaut@gmail.com>
Co-developed-by: Niklas Mohrin <dev@niklasmohrin.de>
Signed-off-by: Niklas Mohrin <dev@niklasmohrin.de>
Co-developed-by: Milan Landaverde <milan@mdaverde.com>
Signed-off-by: Milan Landaverde <milan@mdaverde.com>
Co-developed-by: Morgan Bartlett <mjmouse9999@gmail.com>
Signed-off-by: Morgan Bartlett <mjmouse9999@gmail.com>
Co-developed-by: Maciej Falkowski <m.falkowski@samsung.com>
Signed-off-by: Maciej Falkowski <m.falkowski@samsung.com>
Co-developed-by: Nándor István Krácser <bonifaido@gmail.com>
Signed-off-by: Nándor István Krácser <bonifaido@gmail.com>
Co-developed-by: David Gow <davidgow@google.com>
Signed-off-by: David Gow <davidgow@google.com>
Co-developed-by: John Baublitz <john.m.baublitz@gmail.com>
Signed-off-by: John Baublitz <john.m.baublitz@gmail.com>
Co-developed-by: Björn Roy Baron <bjorn3_gh@protonmail.com>
Signed-off-by: Björn Roy Baron <bjorn3_gh@protonmail.com>
Signed-off-by: Wedson Almeida Filho <wedsonaf@google.com>
Co-developed-by: Miguel Ojeda <ojeda@kernel.org>
Signed-off-by: Miguel Ojeda <ojeda@kernel.org>
---
rust/kernel/allocator.rs | 64 +++++++++++++
rust/kernel/error.rs | 59 ++++++++++++
rust/kernel/lib.rs | 78 +++++++++++++++
rust/kernel/prelude.rs | 20 ++++
rust/kernel/print.rs | 198 +++++++++++++++++++++++++++++++++++++++
rust/kernel/str.rs | 72 ++++++++++++++
6 files changed, 491 insertions(+)
create mode 100644 rust/kernel/allocator.rs
create mode 100644 rust/kernel/error.rs
create mode 100644 rust/kernel/lib.rs
create mode 100644 rust/kernel/prelude.rs
create mode 100644 rust/kernel/print.rs
create mode 100644 rust/kernel/str.rs
diff --git a/rust/kernel/allocator.rs b/rust/kernel/allocator.rs
new file mode 100644
index 000000000000..397a3dd57a9b
--- /dev/null
+++ b/rust/kernel/allocator.rs
@@ -0,0 +1,64 @@
+// SPDX-License-Identifier: GPL-2.0
+
+//! Allocator support.
+
+use core::alloc::{GlobalAlloc, Layout};
+use core::ptr;
+
+use crate::bindings;
+
+struct KernelAllocator;
+
+unsafe impl GlobalAlloc for KernelAllocator {
+ unsafe fn alloc(&self, layout: Layout) -> *mut u8 {
+ // `krealloc()` is used instead of `kmalloc()` because the latter is
+ // an inline function and cannot be bound to as a result.
+ unsafe { bindings::krealloc(ptr::null(), layout.size(), bindings::GFP_KERNEL) as *mut u8 }
+ }
+
+ unsafe fn dealloc(&self, ptr: *mut u8, _layout: Layout) {
+ unsafe {
+ bindings::kfree(ptr as *const core::ffi::c_void);
+ }
+ }
+}
+
+#[global_allocator]
+static ALLOCATOR: KernelAllocator = KernelAllocator;
+
+// `rustc` only generates these for some crate types. Even then, we would need
+// to extract the object file that has them from the archive. For the moment,
+// let's generate them ourselves instead.
+//
+// Note that `#[no_mangle]` implies exported too, nowadays.
+#[no_mangle]
+fn __rust_alloc(size: usize, _align: usize) -> *mut u8 {
+ unsafe { bindings::krealloc(core::ptr::null(), size, bindings::GFP_KERNEL) as *mut u8 }
+}
+
+#[no_mangle]
+fn __rust_dealloc(ptr: *mut u8, _size: usize, _align: usize) {
+ unsafe { bindings::kfree(ptr as *const core::ffi::c_void) };
+}
+
+#[no_mangle]
+fn __rust_realloc(ptr: *mut u8, _old_size: usize, _align: usize, new_size: usize) -> *mut u8 {
+ unsafe {
+ bindings::krealloc(
+ ptr as *const core::ffi::c_void,
+ new_size,
+ bindings::GFP_KERNEL,
+ ) as *mut u8
+ }
+}
+
+#[no_mangle]
+fn __rust_alloc_zeroed(size: usize, _align: usize) -> *mut u8 {
+ unsafe {
+ bindings::krealloc(
+ core::ptr::null(),
+ size,
+ bindings::GFP_KERNEL | bindings::__GFP_ZERO,
+ ) as *mut u8
+ }
+}
diff --git a/rust/kernel/error.rs b/rust/kernel/error.rs
new file mode 100644
index 000000000000..466b2a8fe569
--- /dev/null
+++ b/rust/kernel/error.rs
@@ -0,0 +1,59 @@
+// SPDX-License-Identifier: GPL-2.0
+
+//! Kernel errors.
+//!
+//! C header: [`include/uapi/asm-generic/errno-base.h`](../../../include/uapi/asm-generic/errno-base.h)
+
+use alloc::collections::TryReserveError;
+
+/// Contains the C-compatible error codes.
+pub mod code {
+ /// Out of memory.
+ pub const ENOMEM: super::Error = super::Error(-(crate::bindings::ENOMEM as i32));
+}
+
+/// Generic integer kernel error.
+///
+/// The kernel defines a set of integer generic error codes based on C and
+/// POSIX ones. These codes may have a more specific meaning in some contexts.
+///
+/// # Invariants
+///
+/// The value is a valid `errno` (i.e. `>= -MAX_ERRNO && < 0`).
+#[derive(Clone, Copy, PartialEq, Eq)]
+pub struct Error(core::ffi::c_int);
+
+impl Error {
+ /// Returns the kernel error code.
+ pub fn to_kernel_errno(self) -> core::ffi::c_int {
+ self.0
+ }
+}
+
+impl From<TryReserveError> for Error {
+ fn from(_: TryReserveError) -> Error {
+ code::ENOMEM
+ }
+}
+
+/// A [`Result`] with an [`Error`] error type.
+///
+/// To be used as the return type for functions that may fail.
+///
+/// # Error codes in C and Rust
+///
+/// In C, it is common that functions indicate success or failure through
+/// their return value; modifying or returning extra data through non-`const`
+/// pointer parameters. In particular, in the kernel, functions that may fail
+/// typically return an `int` that represents a generic error code. We model
+/// those as [`Error`].
+///
+/// In Rust, it is idiomatic to model functions that may fail as returning
+/// a [`Result`]. Since in the kernel many functions return an error code,
+/// [`Result`] is a type alias for a [`core::result::Result`] that uses
+/// [`Error`] as its error type.
+///
+/// Note that even if a function does not return anything when it succeeds,
+/// it should still be modeled as returning a `Result` rather than
+/// just an [`Error`].
+pub type Result<T = ()> = core::result::Result<T, Error>;
diff --git a/rust/kernel/lib.rs b/rust/kernel/lib.rs
new file mode 100644
index 000000000000..abd46261d385
--- /dev/null
+++ b/rust/kernel/lib.rs
@@ -0,0 +1,78 @@
+// SPDX-License-Identifier: GPL-2.0
+
+//! The `kernel` crate.
+//!
+//! This crate contains the kernel APIs that have been ported or wrapped for
+//! usage by Rust code in the kernel and is shared by all of them.
+//!
+//! In other words, all the rest of the Rust code in the kernel (e.g. kernel
+//! modules written in Rust) depends on [`core`], [`alloc`] and this crate.
+//!
+//! If you need a kernel C API that is not ported or wrapped yet here, then
+//! do so first instead of bypassing this crate.
+
+#![no_std]
+#![feature(core_ffi_c)]
+
+// Ensure conditional compilation based on the kernel configuration works;
+// otherwise we may silently break things like initcall handling.
+#[cfg(not(CONFIG_RUST))]
+compile_error!("Missing kernel configuration for conditional compilation");
+
+#[cfg(not(test))]
+#[cfg(not(testlib))]
+mod allocator;
+pub mod error;
+pub mod prelude;
+pub mod print;
+pub mod str;
+
+#[doc(hidden)]
+pub use bindings;
+pub use macros;
+
+/// Prefix to appear before log messages printed from within the `kernel` crate.
+const __LOG_PREFIX: &[u8] = b"rust_kernel\0";
+
+/// The top level entrypoint to implementing a kernel module.
+///
+/// For any teardown or cleanup operations, your type may implement [`Drop`].
+pub trait Module: Sized + Sync {
+ /// Called at module initialization time.
+ ///
+ /// Use this method to perform whatever setup or registration your module
+ /// should do.
+ ///
+ /// Equivalent to the `module_init` macro in the C API.
+ fn init(module: &'static ThisModule) -> error::Result<Self>;
+}
+
+/// Equivalent to `THIS_MODULE` in the C API.
+///
+/// C header: `include/linux/export.h`
+pub struct ThisModule(*mut bindings::module);
+
+// SAFETY: `THIS_MODULE` may be used from all threads within a module.
+unsafe impl Sync for ThisModule {}
+
+impl ThisModule {
+ /// Creates a [`ThisModule`] given the `THIS_MODULE` pointer.
+ ///
+ /// # Safety
+ ///
+ /// The pointer must be equal to the right `THIS_MODULE`.
+ pub const unsafe fn from_ptr(ptr: *mut bindings::module) -> ThisModule {
+ ThisModule(ptr)
+ }
+}
+
+#[cfg(not(any(testlib, test)))]
+#[panic_handler]
+fn panic(info: &core::panic::PanicInfo<'_>) -> ! {
+ pr_emerg!("{}\n", info);
+ // SAFETY: FFI call.
+ unsafe { bindings::BUG() };
+ // Bindgen currently does not recognize `__noreturn` so `BUG` returns `()`
+ // instead of `!`. See <https://github.com/rust-lang/rust-bindgen/issues/2094>.
+ loop {}
+}
diff --git a/rust/kernel/prelude.rs b/rust/kernel/prelude.rs
new file mode 100644
index 000000000000..495e22250726
--- /dev/null
+++ b/rust/kernel/prelude.rs
@@ -0,0 +1,20 @@
+// SPDX-License-Identifier: GPL-2.0
+
+//! The `kernel` prelude.
+//!
+//! These are the most common items used by Rust code in the kernel,
+//! intended to be imported by all Rust code, for convenience.
+//!
+//! # Examples
+//!
+//! ```
+//! use kernel::prelude::*;
+//! ```
+
+pub use super::{
+ error::{Error, Result},
+ pr_emerg, pr_info, ThisModule,
+};
+pub use alloc::{boxed::Box, vec::Vec};
+pub use core::pin::Pin;
+pub use macros::module;
diff --git a/rust/kernel/print.rs b/rust/kernel/print.rs
new file mode 100644
index 000000000000..55db5a1ba752
--- /dev/null
+++ b/rust/kernel/print.rs
@@ -0,0 +1,198 @@
+// SPDX-License-Identifier: GPL-2.0
+
+//! Printing facilities.
+//!
+//! C header: [`include/linux/printk.h`](../../../../include/linux/printk.h)
+//!
+//! Reference: <https://www.kernel.org/doc/html/latest/core-api/printk-basics.html>
+
+use core::{
+ ffi::{c_char, c_void},
+ fmt,
+};
+
+use crate::str::RawFormatter;
+
+#[cfg(CONFIG_PRINTK)]
+use crate::bindings;
+
+// Called from `vsprintf` with format specifier `%pA`.
+#[no_mangle]
+unsafe fn rust_fmt_argument(buf: *mut c_char, end: *mut c_char, ptr: *const c_void) -> *mut c_char {
+ use fmt::Write;
+ // SAFETY: The C contract guarantees that `buf` is valid if it's less than `end`.
+ let mut w = unsafe { RawFormatter::from_ptrs(buf.cast(), end.cast()) };
+ let _ = w.write_fmt(unsafe { *(ptr as *const fmt::Arguments<'_>) });
+ w.pos().cast()
+}
+
+/// Format strings.
+///
+/// Public but hidden since it should only be used from public macros.
+#[doc(hidden)]
+pub mod format_strings {
+ use crate::bindings;
+
+ /// The length we copy from the `KERN_*` kernel prefixes.
+ const LENGTH_PREFIX: usize = 2;
+
+ /// The length of the fixed format strings.
+ pub const LENGTH: usize = 10;
+
+ /// Generates a fixed format string for the kernel's [`_printk`].
+ ///
+ /// The format string is always the same for a given level, i.e. for a
+ /// given `prefix`, which are the kernel's `KERN_*` constants.
+ ///
+ /// [`_printk`]: ../../../../include/linux/printk.h
+ const fn generate(is_cont: bool, prefix: &[u8; 3]) -> [u8; LENGTH] {
+ // Ensure the `KERN_*` macros are what we expect.
+ assert!(prefix[0] == b'\x01');
+ if is_cont {
+ assert!(prefix[1] == b'c');
+ } else {
+ assert!(prefix[1] >= b'0' && prefix[1] <= b'7');
+ }
+ assert!(prefix[2] == b'\x00');
+
+ let suffix: &[u8; LENGTH - LENGTH_PREFIX] = if is_cont {
+ b"%pA\0\0\0\0\0"
+ } else {
+ b"%s: %pA\0"
+ };
+
+ [
+ prefix[0], prefix[1], suffix[0], suffix[1], suffix[2], suffix[3], suffix[4], suffix[5],
+ suffix[6], suffix[7],
+ ]
+ }
+
+ // Generate the format strings at compile-time.
+ //
+ // This avoids the compiler generating the contents on the fly in the stack.
+ //
+ // Furthermore, `static` instead of `const` is used to share the strings
+ // for all the kernel.
+ pub static EMERG: [u8; LENGTH] = generate(false, bindings::KERN_EMERG);
+ pub static INFO: [u8; LENGTH] = generate(false, bindings::KERN_INFO);
+}
+
+/// Prints a message via the kernel's [`_printk`].
+///
+/// Public but hidden since it should only be used from public macros.
+///
+/// # Safety
+///
+/// The format string must be one of the ones in [`format_strings`], and
+/// the module name must be null-terminated.
+///
+/// [`_printk`]: ../../../../include/linux/_printk.h
+#[doc(hidden)]
+#[cfg_attr(not(CONFIG_PRINTK), allow(unused_variables))]
+pub unsafe fn call_printk(
+ format_string: &[u8; format_strings::LENGTH],
+ module_name: &[u8],
+ args: fmt::Arguments<'_>,
+) {
+ // `_printk` does not seem to fail in any path.
+ #[cfg(CONFIG_PRINTK)]
+ unsafe {
+ bindings::_printk(
+ format_string.as_ptr() as _,
+ module_name.as_ptr(),
+ &args as *const _ as *const c_void,
+ );
+ }
+}
+
+/// Performs formatting and forwards the string to [`call_printk`].
+///
+/// Public but hidden since it should only be used from public macros.
+#[doc(hidden)]
+#[cfg(not(testlib))]
+#[macro_export]
+#[allow(clippy::crate_in_macro_def)]
+macro_rules! print_macro (
+ // The non-continuation cases (most of them, e.g. `INFO`).
+ ($format_string:path, $($arg:tt)+) => (
+ // SAFETY: This hidden macro should only be called by the documented
+ // printing macros which ensure the format string is one of the fixed
+ // ones. All `__LOG_PREFIX`s are null-terminated as they are generated
+ // by the `module!` proc macro or fixed values defined in a kernel
+ // crate.
+ unsafe {
+ $crate::print::call_printk(
+ &$format_string,
+ crate::__LOG_PREFIX,
+ format_args!($($arg)+),
+ );
+ }
+ );
+);
+
+/// Stub for doctests
+#[cfg(testlib)]
+#[macro_export]
+macro_rules! print_macro (
+ ($format_string:path, $e:expr, $($arg:tt)+) => (
+ ()
+ );
+);
+
+// We could use a macro to generate these macros. However, doing so ends
+// up being a bit ugly: it requires the dollar token trick to escape `$` as
+// well as playing with the `doc` attribute. Furthermore, they cannot be easily
+// imported in the prelude due to [1]. So, for the moment, we just write them
+// manually, like in the C side; while keeping most of the logic in another
+// macro, i.e. [`print_macro`].
+//
+// [1]: https://github.com/rust-lang/rust/issues/52234
+
+/// Prints an emergency-level message (level 0).
+///
+/// Use this level if the system is unusable.
+///
+/// Equivalent to the kernel's [`pr_emerg`] macro.
+///
+/// Mimics the interface of [`std::print!`]. See [`core::fmt`] and
+/// `alloc::format!` for information about the formatting syntax.
+///
+/// [`pr_emerg`]: https://www.kernel.org/doc/html/latest/core-api/printk-basics.html#c.pr_emerg
+/// [`std::print!`]: https://doc.rust-lang.org/std/macro.print.html
+///
+/// # Examples
+///
+/// ```
+/// pr_emerg!("hello {}\n", "there");
+/// ```
+#[macro_export]
+macro_rules! pr_emerg (
+ ($($arg:tt)*) => (
+ $crate::print_macro!($crate::print::format_strings::EMERG, $($arg)*)
+ )
+);
+
+/// Prints an info-level message (level 6).
+///
+/// Use this level for informational messages.
+///
+/// Equivalent to the kernel's [`pr_info`] macro.
+///
+/// Mimics the interface of [`std::print!`]. See [`core::fmt`] and
+/// `alloc::format!` for information about the formatting syntax.
+///
+/// [`pr_info`]: https://www.kernel.org/doc/html/latest/core-api/printk-basics.html#c.pr_info
+/// [`std::print!`]: https://doc.rust-lang.org/std/macro.print.html
+///
+/// # Examples
+///
+/// ```
+/// pr_info!("hello {}\n", "there");
+/// ```
+#[macro_export]
+#[doc(alias = "print")]
+macro_rules! pr_info (
+ ($($arg:tt)*) => (
+ $crate::print_macro!($crate::print::format_strings::INFO, $($arg)*)
+ )
+);
diff --git a/rust/kernel/str.rs b/rust/kernel/str.rs
new file mode 100644
index 000000000000..e45ff220ae50
--- /dev/null
+++ b/rust/kernel/str.rs
@@ -0,0 +1,72 @@
+// SPDX-License-Identifier: GPL-2.0
+
+//! String representations.
+
+use core::fmt;
+
+/// Allows formatting of [`fmt::Arguments`] into a raw buffer.
+///
+/// It does not fail if callers write past the end of the buffer so that they can calculate the
+/// size required to fit everything.
+///
+/// # Invariants
+///
+/// The memory region between `pos` (inclusive) and `end` (exclusive) is valid for writes if `pos`
+/// is less than `end`.
+pub(crate) struct RawFormatter {
+ // Use `usize` to use `saturating_*` functions.
+ #[allow(dead_code)]
+ beg: usize,
+ pos: usize,
+ end: usize,
+}
+
+impl RawFormatter {
+ /// Creates a new instance of [`RawFormatter`] with the given buffer pointers.
+ ///
+ /// # Safety
+ ///
+ /// If `pos` is less than `end`, then the region between `pos` (inclusive) and `end`
+ /// (exclusive) must be valid for writes for the lifetime of the returned [`RawFormatter`].
+ pub(crate) unsafe fn from_ptrs(pos: *mut u8, end: *mut u8) -> Self {
+ // INVARIANT: The safety requierments guarantee the type invariants.
+ Self {
+ beg: pos as _,
+ pos: pos as _,
+ end: end as _,
+ }
+ }
+
+ /// Returns the current insert position.
+ ///
+ /// N.B. It may point to invalid memory.
+ pub(crate) fn pos(&self) -> *mut u8 {
+ self.pos as _
+ }
+}
+
+impl fmt::Write for RawFormatter {
+ fn write_str(&mut self, s: &str) -> fmt::Result {
+ // `pos` value after writing `len` bytes. This does not have to be bounded by `end`, but we
+ // don't want it to wrap around to 0.
+ let pos_new = self.pos.saturating_add(s.len());
+
+ // Amount that we can copy. `saturating_sub` ensures we get 0 if `pos` goes past `end`.
+ let len_to_copy = core::cmp::min(pos_new, self.end).saturating_sub(self.pos);
+
+ if len_to_copy > 0 {
+ // SAFETY: If `len_to_copy` is non-zero, then we know `pos` has not gone past `end`
+ // yet, so it is valid for write per the type invariants.
+ unsafe {
+ core::ptr::copy_nonoverlapping(
+ s.as_bytes().as_ptr(),
+ self.pos as *mut u8,
+ len_to_copy,
+ )
+ };
+ }
+
+ self.pos = pos_new;
+ Ok(())
+ }
+}
--
2.37.1
> +unsafe impl GlobalAlloc for KernelAllocator {
> + unsafe fn alloc(&self, layout: Layout) -> *mut u8 {
> + // `krealloc()` is used instead of `kmalloc()` because the latter is
> + // an inline function and cannot be bound to as a result.
> + unsafe { bindings::krealloc(ptr::null(), layout.size(), bindings::GFP_KERNEL) as *mut u8 }
> + }
> +
> + unsafe fn dealloc(&self, ptr: *mut u8, _layout: Layout) {
> + unsafe {
> + bindings::kfree(ptr as *const core::ffi::c_void);
> + }
> + }
> +}
I sense possible problems here. It's common for a kernel code to pass
flags during memory allocations.
For example:
struct bio *bio;
for (...) {
bio = bio_alloc_bioset(bdev, nr_vecs, opf, GFP_NOIO, bs);
if (!bio)
return -ENOMEM;
}
Without GFP_NOIO we can run into a deadlock, because the kernel will try
give us free memory by flushing the dirty pages and we need the memory
to actually do it and boom, deadlock.
Or we can be allocating some structs under spinlock (yeah, that happens too):
struct efc_vport *vport;
spin_lock_irqsave(...);
vport = kzalloc(sizeof(*vport), GFP_ATOMIC);
if (!vport) {
spin_unlock_irqrestore(...);
return NULL;
}
spin_unlock(...);
Same can (and probably will) happen to e.g. Vec elements. So some form
of flags passing should be supported in try_* variants:
let mut vec = Vec::try_new(GFP_ATOMIC)?;
vec.try_push(GFP_ATOMIC, 1)?;
vec.try_push(GFP_ATOMIC, 2)?;
vec.try_push(GFP_ATOMIC, 3)?;
On Sat, Aug 6, 2022 at 12:25 PM Konstantin Shelekhin
<k.shelekhin@yadro.com> wrote:
>
> I sense possible problems here. It's common for a kernel code to pass
> flags during memory allocations.
Yes, of course. We will support this, but how exactly it will look
like, to what extent upstream Rust's `alloc` could support our use
cases, etc. has been on discussion for a long time.
For instance, see https://github.com/Rust-for-Linux/linux/pull/815 for
a potential extension trait approach with no allocator carried on the
type that Andreas wrote after a discussion in the last informal call:
let a = Box::try_new_atomic(101)?;
Cheers,
Miguel
On Sat, Aug 06, 2022 at 01:22:52PM +0200, Miguel Ojeda wrote: > On Sat, Aug 6, 2022 at 12:25 PM Konstantin Shelekhin > <k.shelekhin@yadro.com> wrote: > > > > I sense possible problems here. It's common for a kernel code to pass > > flags during memory allocations. > > Yes, of course. We will support this, but how exactly it will look > like, to what extent upstream Rust's `alloc` could support our use > cases, etc. has been on discussion for a long time. > > For instance, see https://github.com/Rust-for-Linux/linux/pull/815 for > a potential extension trait approach with no allocator carried on the > type that Andreas wrote after a discussion in the last informal call: > > let a = Box::try_new_atomic(101)?; In my opinion, the rest of the thread clearly shows that the conservative approach is currently the only solid option. I suggest the following explicit API: let a = Box::try_new(size, flags)?; Vec::try_push(item, flags)?; etc. Whadda you think?
On Wed, Sep 21, 2022 at 02:23:42PM +0300, Konstantin Shelekhin wrote: > On Sat, Aug 06, 2022 at 01:22:52PM +0200, Miguel Ojeda wrote: > > On Sat, Aug 6, 2022 at 12:25 PM Konstantin Shelekhin > > <k.shelekhin@yadro.com> wrote: > > > > > > I sense possible problems here. It's common for a kernel code to pass > > > flags during memory allocations. > > > > Yes, of course. We will support this, but how exactly it will look > > like, to what extent upstream Rust's `alloc` could support our use > > cases, etc. has been on discussion for a long time. > > > > For instance, see https://github.com/Rust-for-Linux/linux/pull/815 for > > a potential extension trait approach with no allocator carried on the > > type that Andreas wrote after a discussion in the last informal call: > > > > let a = Box::try_new_atomic(101)?; > > In my opinion, the rest of the thread clearly shows that the > conservative approach is currently the only solid option. I suggest the > following explicit API: > > let a = Box::try_new(size, flags)?; > Vec::try_push(item, flags)?; > > etc. Whadda you think? Please, yes. This fits the current kernel memory allocation pattern and allows for proper propagation of the allocation flags as needed through the system. This is going to be required in any non-trivial kernel code anyway, might as well do it correct from the beginning. It also allows for flags to change over time, which also happens. thanks, greg k-h
On Sat, Aug 06, 2022 at 01:22:52PM +0200, Miguel Ojeda wrote:
> On Sat, Aug 6, 2022 at 12:25 PM Konstantin Shelekhin
> <k.shelekhin@yadro.com> wrote:
> >
> > I sense possible problems here. It's common for a kernel code to pass
> > flags during memory allocations.
>
> Yes, of course. We will support this, but how exactly it will look
> like, to what extent upstream Rust's `alloc` could support our use
> cases, etc. has been on discussion for a long time.
>
> For instance, see https://github.com/Rust-for-Linux/linux/pull/815 for
> a potential extension trait approach with no allocator carried on the
> type that Andreas wrote after a discussion in the last informal call:
>
> let a = Box::try_new_atomic(101)?;
Something I've been wondering about for a while is ...
struct task_struct {
...
+ gfp_t gfp_flags;
...
};
We've already done some work towards this with the scoped allocation
API for NOIO and NOFS, but having spin_lock() turn current->gfp_flags
into GFP_ATOMIC might not be the worst idea in the world.
On Sat, Aug 06, 2022 at 03:57:35PM +0100, Matthew Wilcox wrote:
> On Sat, Aug 06, 2022 at 01:22:52PM +0200, Miguel Ojeda wrote:
> > On Sat, Aug 6, 2022 at 12:25 PM Konstantin Shelekhin
> > <k.shelekhin@yadro.com> wrote:
> > >
> > > I sense possible problems here. It's common for a kernel code to pass
> > > flags during memory allocations.
> >
> > Yes, of course. We will support this, but how exactly it will look
> > like, to what extent upstream Rust's `alloc` could support our use
> > cases, etc. has been on discussion for a long time.
> >
> > For instance, see https://github.com/Rust-for-Linux/linux/pull/815 for
> > a potential extension trait approach with no allocator carried on the
> > type that Andreas wrote after a discussion in the last informal call:
> >
> > let a = Box::try_new_atomic(101)?;
>
> Something I've been wondering about for a while is ...
>
> struct task_struct {
> ...
> + gfp_t gfp_flags;
> ...
> };
For GFP_ATOMIC, we could use preempt_count except that it isn't always
enabled. Conveniently, it is already separated out into its own config.
How do people feel about removing CONFIG_PREEMPT_COUNT and having the
count always enabled?
We would then have a way to reliably detect when we are in atomic
context and we could catch other scenarios beyond allocation. For
example, I recently noticed that the following code (as a minimal
reproduction) does _not_ lead to a deadlock when CONFIG_PREEMPT=n:
rcu_read_lock();
synchronize_rcu();
Boqun explained to me that the reason is that synchronize_rcu() is not
supposed to be called from an rcu read-side critical section and the
current implementation takes advantage of this fact plus there is no way
to detect if we're in atomic context. This is all well and good, but if
one makes this mistake, the result is a potential user-after-free.
Always having preempt_count would allow us to behave differently here --
this is another conversation but at least we'll have to option to choose
what to do. (And it seems to me that detecting this and deadlocking or
BUG'ing would be preferable over the alternative [CC'ing Kees].)
Anyway, objections to a patch series that would amount to the changes
below?
Thanks,
-Wedson
diff --git a/arch/arm/include/asm/assembler.h b/arch/arm/include/asm/assembler.h
index 90fbe4a3f9c8..c88c932dba5b 100644
--- a/arch/arm/include/asm/assembler.h
+++ b/arch/arm/include/asm/assembler.h
@@ -224,7 +224,6 @@ THUMB( fpreg .req r7 )
/*
* Increment/decrement the preempt count.
*/
-#ifdef CONFIG_PREEMPT_COUNT
.macro inc_preempt_count, ti, tmp
ldr \tmp, [\ti, #TI_PREEMPT] @ get preempt count
add \tmp, \tmp, #1 @ increment it
@@ -241,16 +240,6 @@ THUMB( fpreg .req r7 )
get_thread_info \ti
dec_preempt_count \ti, \tmp
.endm
-#else
- .macro inc_preempt_count, ti, tmp
- .endm
-
- .macro dec_preempt_count, ti, tmp
- .endm
-
- .macro dec_preempt_count_ti, ti, tmp
- .endm
-#endif
#define USERL(l, x...) \
9999: x; \
diff --git a/arch/arm/kernel/iwmmxt.S b/arch/arm/kernel/iwmmxt.S
index d2b4ac06e4ed..ecdeaa4d5190 100644
--- a/arch/arm/kernel/iwmmxt.S
+++ b/arch/arm/kernel/iwmmxt.S
@@ -95,9 +95,7 @@ ENTRY(iwmmxt_task_enable)
mov r2, r2 @ cpwait
bl concan_save
-#ifdef CONFIG_PREEMPT_COUNT
get_thread_info r10
-#endif
4: dec_preempt_count r10, r3
ret r9 @ normal exit from exception
diff --git a/arch/xtensa/kernel/entry.S b/arch/xtensa/kernel/entry.S
index 272fff587907..8ad94e13d0f0 100644
--- a/arch/xtensa/kernel/entry.S
+++ b/arch/xtensa/kernel/entry.S
@@ -832,7 +832,7 @@ ENTRY(debug_exception)
* preemption if we have HW breakpoints to preserve DEBUGCAUSE.DBNUM
* meaning.
*/
-#if defined(CONFIG_PREEMPT_COUNT) && defined(CONFIG_HAVE_HW_BREAKPOINT)
+#if defined(CONFIG_HAVE_HW_BREAKPOINT)
GET_THREAD_INFO(a2, a1)
l32i a3, a2, TI_PRE_COUNT
addi a3, a3, 1
diff --git a/drivers/gpu/drm/i915/Kconfig.debug b/drivers/gpu/drm/i915/Kconfig.debug
index 47e845353ffa..811f34dbb80b 100644
--- a/drivers/gpu/drm/i915/Kconfig.debug
+++ b/drivers/gpu/drm/i915/Kconfig.debug
@@ -22,7 +22,6 @@ config DRM_I915_DEBUG
depends on EXPERT # only for developers
depends on !COMPILE_TEST # never built by robots
select DEBUG_FS
- select PREEMPT_COUNT
select I2C_CHARDEV
select STACKDEPOT
select DRM_DP_AUX_CHARDEV
diff --git a/drivers/gpu/drm/i915/i915_utils.h b/drivers/gpu/drm/i915/i915_utils.h
index c10d68cdc3ca..93686bdb9707 100644
--- a/drivers/gpu/drm/i915/i915_utils.h
+++ b/drivers/gpu/drm/i915/i915_utils.h
@@ -293,8 +293,7 @@ wait_remaining_ms_from_jiffies(unsigned long timestamp_jiffies, int to_wait_ms)
(Wmax))
#define wait_for(COND, MS) _wait_for((COND), (MS) * 1000, 10, 1000)
-/* If CONFIG_PREEMPT_COUNT is disabled, in_atomic() always reports false. */
-#if defined(CONFIG_DRM_I915_DEBUG) && defined(CONFIG_PREEMPT_COUNT)
+#if defined(CONFIG_DRM_I915_DEBUG)
# define _WAIT_FOR_ATOMIC_CHECK(ATOMIC) WARN_ON_ONCE((ATOMIC) && !in_atomic())
#else
# define _WAIT_FOR_ATOMIC_CHECK(ATOMIC) do { } while (0)
diff --git a/include/linux/bit_spinlock.h b/include/linux/bit_spinlock.h
index bbc4730a6505..1e03d54b0b6f 100644
--- a/include/linux/bit_spinlock.h
+++ b/include/linux/bit_spinlock.h
@@ -90,10 +90,8 @@ static inline int bit_spin_is_locked(int bitnum, unsigned long *addr)
{
#if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK)
return test_bit(bitnum, addr);
-#elif defined CONFIG_PREEMPT_COUNT
- return preempt_count();
#else
- return 1;
+ return preempt_count();
#endif
}
diff --git a/include/linux/lockdep.h b/include/linux/lockdep.h
index 1f1099dac3f0..a05e40dccb0a 100644
--- a/include/linux/lockdep.h
+++ b/include/linux/lockdep.h
@@ -600,16 +600,14 @@ do { \
#define lockdep_assert_preemption_enabled() \
do { \
- WARN_ON_ONCE(IS_ENABLED(CONFIG_PREEMPT_COUNT) && \
- __lockdep_enabled && \
+ WARN_ON_ONCE(__lockdep_enabled && \
(preempt_count() != 0 || \
!this_cpu_read(hardirqs_enabled))); \
} while (0)
#define lockdep_assert_preemption_disabled() \
do { \
- WARN_ON_ONCE(IS_ENABLED(CONFIG_PREEMPT_COUNT) && \
- __lockdep_enabled && \
+ WARN_ON_ONCE(__lockdep_enabled && \
(preempt_count() == 0 && \
this_cpu_read(hardirqs_enabled))); \
} while (0)
diff --git a/include/linux/page_ref.h b/include/linux/page_ref.h
index 2e677e6ad09f..4cddcb17489a 100644
--- a/include/linux/page_ref.h
+++ b/include/linux/page_ref.h
@@ -271,9 +271,7 @@ static inline bool folio_ref_try_add_rcu(struct folio *folio, int count)
* context, so (on !SMP) we only need preemption to be disabled
* and TINY_RCU does that for us.
*/
-# ifdef CONFIG_PREEMPT_COUNT
VM_BUG_ON(!in_atomic() && !irqs_disabled());
-# endif
VM_BUG_ON_FOLIO(folio_ref_count(folio) == 0, folio);
folio_ref_add(folio, count);
#else
diff --git a/include/linux/preempt.h b/include/linux/preempt.h
index b4381f255a5c..77da73007375 100644
--- a/include/linux/preempt.h
+++ b/include/linux/preempt.h
@@ -56,8 +56,7 @@
#define PREEMPT_DISABLED (PREEMPT_DISABLE_OFFSET + PREEMPT_ENABLED)
/*
- * Disable preemption until the scheduler is running -- use an unconditional
- * value so that it also works on !PREEMPT_COUNT kernels.
+ * Disable preemption until the scheduler is running.
*
* Reset by start_kernel()->sched_init()->init_idle()->init_idle_preempt_count().
*/
@@ -69,7 +68,6 @@
*
* preempt_count() == 2*PREEMPT_DISABLE_OFFSET
*
- * Note: PREEMPT_DISABLE_OFFSET is 0 for !PREEMPT_COUNT kernels.
* Note: See finish_task_switch().
*/
#define FORK_PREEMPT_COUNT (2*PREEMPT_DISABLE_OFFSET + PREEMPT_ENABLED)
@@ -133,11 +131,7 @@ static __always_inline unsigned char interrupt_context_level(void)
/*
* The preempt_count offset after preempt_disable();
*/
-#if defined(CONFIG_PREEMPT_COUNT)
-# define PREEMPT_DISABLE_OFFSET PREEMPT_OFFSET
-#else
-# define PREEMPT_DISABLE_OFFSET 0
-#endif
+#define PREEMPT_DISABLE_OFFSET PREEMPT_OFFSET
/*
* The preempt_count offset after spin_lock()
@@ -154,7 +148,7 @@ static __always_inline unsigned char interrupt_context_level(void)
*
* spin_lock_bh()
*
- * Which need to disable both preemption (CONFIG_PREEMPT_COUNT) and
+ * Which need to disable both preemption and
* softirqs, such that unlock sequences of:
*
* spin_unlock();
@@ -196,8 +190,6 @@ extern void preempt_count_sub(int val);
#define preempt_count_inc() preempt_count_add(1)
#define preempt_count_dec() preempt_count_sub(1)
-#ifdef CONFIG_PREEMPT_COUNT
-
#define preempt_disable() \
do { \
preempt_count_inc(); \
@@ -263,27 +255,6 @@ do { \
__preempt_count_dec(); \
} while (0)
-#else /* !CONFIG_PREEMPT_COUNT */
-
-/*
- * Even if we don't have any preemption, we need preempt disable/enable
- * to be barriers, so that we don't have things like get_user/put_user
- * that can cause faults and scheduling migrate into our preempt-protected
- * region.
- */
-#define preempt_disable() barrier()
-#define sched_preempt_enable_no_resched() barrier()
-#define preempt_enable_no_resched() barrier()
-#define preempt_enable() barrier()
-#define preempt_check_resched() do { } while (0)
-
-#define preempt_disable_notrace() barrier()
-#define preempt_enable_no_resched_notrace() barrier()
-#define preempt_enable_notrace() barrier()
-#define preemptible() 0
-
-#endif /* CONFIG_PREEMPT_COUNT */
-
#ifdef MODULE
/*
* Modules have no business playing preemption tricks.
diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h
index 47e5d374c7eb..8761b85c7874 100644
--- a/include/linux/uaccess.h
+++ b/include/linux/uaccess.h
@@ -225,9 +225,6 @@ static inline bool pagefault_disabled(void)
*
* This function should only be used by the fault handlers. Other users should
* stick to pagefault_disabled().
- * Please NEVER use preempt_disable() to disable the fault handler. With
- * !CONFIG_PREEMPT_COUNT, this is like a NOP. So the handler won't be disabled.
- * in_atomic() will report different values based on !CONFIG_PREEMPT_COUNT.
*/
#define faulthandler_disabled() (pagefault_disabled() || in_atomic())
diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt
index c2f1fd95a821..3d2f7ded0fee 100644
--- a/kernel/Kconfig.preempt
+++ b/kernel/Kconfig.preempt
@@ -86,12 +86,8 @@ config PREEMPT_RT
endchoice
-config PREEMPT_COUNT
- bool
-
config PREEMPTION
bool
- select PREEMPT_COUNT
config PREEMPT_DYNAMIC
bool "Preemption behaviour defined on boot"
diff --git a/kernel/rcu/Kconfig.debug b/kernel/rcu/Kconfig.debug
index 1b0c41d490f0..34d395f007a7 100644
--- a/kernel/rcu/Kconfig.debug
+++ b/kernel/rcu/Kconfig.debug
@@ -122,7 +122,6 @@ config RCU_STRICT_GRACE_PERIOD
bool "Provide debug RCU implementation with short grace periods"
depends on DEBUG_KERNEL && RCU_EXPERT && NR_CPUS <= 4 && !TINY_RCU
default n
- select PREEMPT_COUNT if PREEMPT=n
help
Select this option to build an RCU variant that is strict about
grace periods, making them as short as it can. This limits
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 79aea7df4345..c21a83e7b534 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2478,7 +2478,7 @@ static __latent_entropy void rcu_core(void)
WARN_ON_ONCE(!rdp->beenonline);
/* Report any deferred quiescent states if preemption enabled. */
- if (IS_ENABLED(CONFIG_PREEMPT_COUNT) && (!(preempt_count() & PREEMPT_MASK))) {
+ if (!(preempt_count() & PREEMPT_MASK)) {
rcu_preempt_deferred_qs(current);
} else if (rcu_preempt_need_deferred_qs(current)) {
set_tsk_need_resched(current);
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 438ecae6bd7e..432aa6e6cff7 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -29,7 +29,7 @@ static bool rcu_rdp_is_offloaded(struct rcu_data *rdp)
(IS_ENABLED(CONFIG_HOTPLUG_CPU) && lockdep_is_cpus_held()) ||
rcu_lockdep_is_held_nocb(rdp) ||
(rdp == this_cpu_ptr(&rcu_data) &&
- !(IS_ENABLED(CONFIG_PREEMPT_COUNT) && preemptible())) ||
+ !preemptible()) ||
rcu_current_is_nocb_kthread(rdp)),
"Unsafe read of RCU_NOCB offloaded state"
);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ee28253c9ac0..938f41569ae9 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5111,8 +5111,7 @@ asmlinkage __visible void schedule_tail(struct task_struct *prev)
* finish_task_switch() for details.
*
* finish_task_switch() will drop rq->lock() and lower preempt_count
- * and the preempt_enable() will end up enabling preemption (on
- * PREEMPT_COUNT kernels).
+ * and the preempt_enable() will end up enabling preemption.
*/
finish_task_switch(prev);
@@ -9901,9 +9900,6 @@ void __cant_sleep(const char *file, int line, int preempt_offset)
if (irqs_disabled())
return;
- if (!IS_ENABLED(CONFIG_PREEMPT_COUNT))
- return;
-
if (preempt_count() > preempt_offset)
return;
@@ -9933,9 +9929,6 @@ void __cant_migrate(const char *file, int line)
if (is_migration_disabled(current))
return;
- if (!IS_ENABLED(CONFIG_PREEMPT_COUNT))
- return;
-
if (preempt_count() > 0)
return;
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index bcbe60d6c80c..3dd4c8ccc7b2 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1230,7 +1230,6 @@ config PROVE_LOCKING
select DEBUG_RWSEMS
select DEBUG_WW_MUTEX_SLOWPATH
select DEBUG_LOCK_ALLOC
- select PREEMPT_COUNT if !ARCH_NO_PREEMPT
select TRACE_IRQFLAGS
default n
help
@@ -1431,7 +1430,6 @@ config DEBUG_LOCKDEP
config DEBUG_ATOMIC_SLEEP
bool "Sleep inside atomic section checking"
- select PREEMPT_COUNT
depends on DEBUG_KERNEL
depends on !ARCH_NO_PREEMPT
help
diff --git a/mm/slub.c b/mm/slub.c
index 862dbd9af4f5..b01767226476 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3106,19 +3106,15 @@ static void *__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
{
void *p;
-#ifdef CONFIG_PREEMPT_COUNT
/*
* We may have been preempted and rescheduled on a different
* cpu before disabling preemption. Need to reload cpu area
* pointer.
*/
c = slub_get_cpu_ptr(s->cpu_slab);
-#endif
-
p = ___slab_alloc(s, gfpflags, node, addr, c);
-#ifdef CONFIG_PREEMPT_COUNT
slub_put_cpu_ptr(s->cpu_slab);
-#endif
+
return p;
}
diff --git a/tools/testing/selftests/rcutorture/configs/rcu/SRCU-T b/tools/testing/selftests/rcutorture/configs/rcu/SRCU-T
index c70cf0405f24..e332b9b4d8c3 100644
--- a/tools/testing/selftests/rcutorture/configs/rcu/SRCU-T
+++ b/tools/testing/selftests/rcutorture/configs/rcu/SRCU-T
@@ -9,4 +9,3 @@ CONFIG_DEBUG_LOCK_ALLOC=y
CONFIG_PROVE_LOCKING=y
CONFIG_DEBUG_OBJECTS_RCU_HEAD=n
CONFIG_DEBUG_ATOMIC_SLEEP=y
-#CHECK#CONFIG_PREEMPT_COUNT=y
diff --git a/tools/testing/selftests/rcutorture/configs/rcu/SRCU-U b/tools/testing/selftests/rcutorture/configs/rcu/SRCU-U
index bc9eeabaa1b1..fac0047579c2 100644
--- a/tools/testing/selftests/rcutorture/configs/rcu/SRCU-U
+++ b/tools/testing/selftests/rcutorture/configs/rcu/SRCU-U
@@ -7,4 +7,3 @@ CONFIG_PREEMPT_DYNAMIC=n
CONFIG_RCU_TRACE=n
CONFIG_DEBUG_LOCK_ALLOC=n
CONFIG_DEBUG_OBJECTS_RCU_HEAD=n
-CONFIG_PREEMPT_COUNT=n
diff --git a/tools/testing/selftests/rcutorture/configs/rcu/TINY01 b/tools/testing/selftests/rcutorture/configs/rcu/TINY01
index 0953c52fcfd7..8363b0b546b7 100644
--- a/tools/testing/selftests/rcutorture/configs/rcu/TINY01
+++ b/tools/testing/selftests/rcutorture/configs/rcu/TINY01
@@ -11,4 +11,3 @@ CONFIG_RCU_TRACE=n
#CHECK#CONFIG_RCU_STALL_COMMON=n
CONFIG_DEBUG_LOCK_ALLOC=n
CONFIG_DEBUG_OBJECTS_RCU_HEAD=n
-CONFIG_PREEMPT_COUNT=n
diff --git a/tools/testing/selftests/rcutorture/doc/TINY_RCU.txt b/tools/testing/selftests/rcutorture/doc/TINY_RCU.txt
index a75b16991a92..4c596905b6b2 100644
--- a/tools/testing/selftests/rcutorture/doc/TINY_RCU.txt
+++ b/tools/testing/selftests/rcutorture/doc/TINY_RCU.txt
@@ -4,7 +4,6 @@ This document gives a brief rationale for the TINY_RCU test cases.
Kconfig Parameters:
CONFIG_DEBUG_LOCK_ALLOC -- Do all three and none of the three.
-CONFIG_PREEMPT_COUNT
CONFIG_RCU_TRACE
The theory here is that randconfig testing will hit the other six possible
diff --git a/tools/testing/selftests/rcutorture/doc/TREE_RCU-kconfig.txt b/tools/testing/selftests/rcutorture/doc/TREE_RCU-kconfig.txt
index 42acb1a64ce1..9e851c80c5eb 100644
--- a/tools/testing/selftests/rcutorture/doc/TREE_RCU-kconfig.txt
+++ b/tools/testing/selftests/rcutorture/doc/TREE_RCU-kconfig.txt
@@ -42,7 +42,6 @@ CONFIG_64BIT
Used only to check CONFIG_RCU_FANOUT value, inspection suffices.
-CONFIG_PREEMPT_COUNT
CONFIG_PREEMPT_RCU
Redundant with CONFIG_PREEMPT, ignore.
diff --git a/tools/testing/selftests/rcutorture/formal/srcu-cbmc/src/config.h b/tools/testing/selftests/rcutorture/formal/srcu-cbmc/src/config.h
index 283d7103334f..d0d485d48649 100644
--- a/tools/testing/selftests/rcutorture/formal/srcu-cbmc/src/config.h
+++ b/tools/testing/selftests/rcutorture/formal/srcu-cbmc/src/config.h
@@ -8,7 +8,6 @@
#undef CONFIG_HOTPLUG_CPU
#undef CONFIG_MODULES
#undef CONFIG_NO_HZ_FULL_SYSIDLE
-#undef CONFIG_PREEMPT_COUNT
#undef CONFIG_PREEMPT_RCU
#undef CONFIG_PROVE_RCU
#undef CONFIG_RCU_NOCB_CPU
On Mon, Sep 19, 2022 at 7:07 AM Wedson Almeida Filho <wedsonaf@gmail.com> wrote:
>
> For GFP_ATOMIC, we could use preempt_count except that it isn't always
> enabled. Conveniently, it is already separated out into its own config.
> How do people feel about removing CONFIG_PREEMPT_COUNT and having the
> count always enabled?
>
> We would then have a way to reliably detect when we are in atomic
> context [..]
No.
First off, it's not true. There are non-preempt atomic contexts too,
like interrupts disabled etc. Can you enumerate all those? Possibly.
But more importantly, doing "depending on context, I silently and
automatically do different things" is simply WRONG. Don't do it. It's
a disaster.
Doing that for *debugging* is fine. So having a
WARN_ON_ONCE(in_atomic_context());
is a perfectly fine debug check to find people who do bad bad things,
and we have lots of variations of that theme (ie might_sleep(), but
also things like lockdep_assert_held() and friends that assert other
constraints entirely).
But having *behavior changes* depending on context is a total
disaster. And that's invariably why people want this disgusting thing.
They want to do broken things like "I want to allocate memory, and I
don't want to care where I am, so I want the memory allocator to just
do the whole GFP_ATOMIC for me".
And that is FUNDAMENTALLY BROKEN.
If you want to allocate memory, and you don't want to care about what
context you are in, or whether you are holding spinlocks etc, then you
damn well shouldn't be doing kernel programming. Not in C, and not in
Rust.
It really is that simple. Contexts like this ("I am in a critical
region, I must not do memory allocation or use sleeping locks") is
*fundamental* to kernel programming. It has nothing to do with the
language, and everything to do with the problem space.
So don't go down this "let's have the allocator just know if you're in
an atomic context automatically" path. It's wrong. It's complete
garbage. It may generate kernel code that superficially "works", but
one that is fundamentally broken, and will fail and becaome unreliable
under memory pressure.
The thing is, when you do kernel programming, and you're holding a
spinlock and need to allocate memory, you generally shouldn't allocate
memory at all, you should go "Oh, maybe I need to do the allocation
*before* getting the lock".
And it's not just locks. It's also "I'm in a hardware interrupt", but
now the solution is fundamentally different. Maybe you still want to
do pre-allocation, but now you're a driver interrupt and the
pre-allocation has to happen in another code sequence entirely,
because obviously the interrupt itself is asynchronous.
But more commonly, you just want to use GFP_ATOMIC, and go "ok, I know
the VM layer tries to keep a _limited_ set of pre-allocated buffers
around".
But it needs to be explicit, because that GFP_ATOMIC pool of
allocations really is very limited, and you as the allocator need to
make it *explicit* that yeah, now you're not just doing a random
allocation, you are doing one of these *special* allocations that will
eat into that very limited global pool of allocations.
So no, you cannot and MUST NOT have an allocator that silently just
dips into that special pool, without the user being aware or
requesting it.
That really is very very fundamental. Allocators that "just work" in
different contexts are broken garbage within the context of a kernel.
Please just accept this, and really *internalize* it. Because this
isn't actually just about allocators. Allocators may be one very
common special case of this kind of issue, and they come up quite
often as a result, but this whole "your code needs to *understand* the
special restrictions that the kernel is under" is something that is
quite fundamental in general.
It shows up in various other situations too, like "Oh, this code can
run in an interrupt handler" (which is *different* from the atomicity
of just "while holding a lock", because it implies a level of random
nesting that is very relevant for locking).
Or sometimes it's subtler things than just correctness, ie "I'm
running under a critical lock, so I must be very careful because there
are scalability issues". The code may *work* just fine, but things
like tasklist_lock are very very special.
So embrace that kernel requirement. Kernels are special. We do things,
and we have constraints that pretty much no other environment has.
It is often what makes kernel programming interesting - because it's
challenging. We have a very limited stack. We have some very direct
and deep interactions with the CPU, with things like IO and
interrupts. Some constraints are always there, and others are
context-dependent.
The whole "really know what context this code is running within" is
really important. You may want to write very explicit comments about
it.
And you definitely should always write your code knowing about it.
Linus
On Mon, Sep 19, 2022 at 9:09 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> The whole "really know what context this code is running within" is
> really important. You may want to write very explicit comments about
> it.
Side note: a corollary of this is that people should avoid "dynamic
context" things like the plague, because it makes for such pain when
the context isn't statically obvious.
So things like conditional locking should generally be avoided as much
as humanly possible. Either you take the lock or you don't - don't
write code where the lock context depends on some argument value or
flag, for example.
Code like this is fine:
if (some_condition) {
spin_lock(&mylock);
xyz();
spin_unlock(&mylock);
}
because 'xyz()' is always run in the same context. But avoid patterns like
if (some_condition)
spin_lock(&mylock);
xyz();
if (same_condition)
spin_unlock(&mylock);
where now 'xyz()' sometimes does something with the lock held, and
sometimes not. That way lies insanity.
Now, obviously, the context for helper functions (like the Rust kernel
crate is, pretty much by definition) obviously depends on the context
of the callers of said helpers, so in that sense the above kind of
"sometimes in locked context, sometimes not" will always be the case.
So those kinds of helper functions will generally need to be either
insensitive to context and usable in all contexts (best), or
documented - and verify with debug code like 'might_sleep()' - that
they only run in certain contexts.
And then in the worst case there's a gfp_flag that says "you can only
do these kinds of allocations" or whatever, but even then you should
strive to never have other dynamic behavior (ie please try to avoid
behavior like having a "already locked" argument and then taking a
lock depending on that).
Because if you follow those rules, at least you can statically see the
context from a call chain (so, for example, the stack trace of an oops
will make the context unambiguous, because there's hopefully no lock
or interrupt disabling or similar that has some dynamic behavior like
in that second example of "xyz()".
Do we have places in the kernel that do conditional locking? Yes we
do. Examples like that second case do exist. It's bad. Sometimes you
can't avoid it. But you can always *strive* to avoid it, and
minimizing those kinds of "context depends on other things"
situations.
And we should strive very hard to make those kinds of contexts very
clear and explicit and not dynamic exactly because it's so important
in the kernel, and it has subtle implications wrt other locking, and
memory allocations.
Linus
On Mon, Sep 19, 2022 at 10:20:52AM -0700, Linus Torvalds wrote: > On Mon, Sep 19, 2022 at 9:09 AM Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > The whole "really know what context this code is running within" is > > really important. You may want to write very explicit comments about > > it. > > Side note: a corollary of this is that people should avoid "dynamic > context" things like the plague, because it makes for such pain when > the context isn't statically obvious. As you know, we're trying to guarantee the absence of undefined behaviour for code written in Rust. And the context is _really_ important, so important that leaving it up to comments isn't enough. I don't care as much about allocation flags as I do about sleeping in an rcu read-side critical region. When CONFIG_PREEMPT=n, if some CPU makes the mistake of sleeping between rcu_read_lock()/rcu_read_unlock(), RCU will take that as a quiescent state, which may cause unsuspecting code waiting for a grace period to wake up too early and potentially free memory that is still in use, which is obviously undefined behaviour. We generally have two routes to avoid undefined behaviour: detect at compile time (and fail compilation) or at runtime (and stop things before they go too far). The former, while feasible, would require some static analysi or passing tokens as arguments to guarantee that we're in sleepable context when sleeping (all ellided at compile time, so zero-cost in terms of run-time performance), but likely painful to program use. Always having preempt_count would allow us to detect such issues in RCU at runtime (for both C and Rust) and prevent user-after-frees. Do you have an opinion on the above? Cheers, -Wedson
On Mon, Sep 19, 2022 at 11:05 AM Wedson Almeida Filho
<wedsonaf@gmail.com> wrote:
>
> As you know, we're trying to guarantee the absence of undefined
> behaviour for code written in Rust. And the context is _really_
> important, so important that leaving it up to comments isn't enough.
You need to realize that
(a) reality trumps fantasy
(b) kernel needs trump any Rust needs
And the *reality* is that there are no absolute guarantees. Ever. The
"Rust is safe" is not some kind of absolute guarantee of code safety.
Never has been. Anybody who believes that should probably re-take
their kindergarten year, and stop believing in the Easter bunny and
Santa Claus.
Even "safe" rust code in user space will do things like panic when
things go wrong (overflows, allocation failures, etc). If you don't
realize that that is NOT some kind of true safely, I don't know what
to say.
Not completing the operation at all, is *not* really any better than
getting the wrong answer, it's only more debuggable.
In the kernel, "panic and stop" is not an option (it's actively worse
than even the wrong answer, since it's really not debugable), so the
kernel version of "panic" is "WARN_ON_ONCE()" and continue with the
wrong answer.
So this is something that I really *need* the Rust people to
understand. That whole reality of "safe" not being some absolute
thing, and the reality that the kernel side *requires* slightly
different rules than user space traditionally does.
> I don't care as much about allocation flags as I do about sleeping in an
> rcu read-side critical region. When CONFIG_PREEMPT=n, if some CPU makes
> the mistake of sleeping between rcu_read_lock()/rcu_read_unlock(), RCU
> will take that as a quiescent state, which may cause unsuspecting code
> waiting for a grace period to wake up too early and potentially free
> memory that is still in use, which is obviously undefined behaviour.
So?
You had a bug. Shit happens. We have a lot of debugging tools that
will give you a *HUGE* warning when said shit happens, including
sending automated reports to the distro maker. And then you fix the
bug.
Think of that "debugging tools give a huge warning" as being the
equivalent of std::panic in standard rust. Yes, the kernel will
continue (unless you have panic-on-warn set), because the kernel
*MUST* continue in order for that "report to upstream" to have a
chance of happening.
So it's technically a veryu different implementation from std:panic,
but you should basically see it as exactly that: a *technical*
difference, not a conceptual one. The rules for how the kernel deals
with bugs is just different, because we don't have core-files and
debuggers in the general case.
(And yes, you can have a kernel debugger, and you can just have the
WARN_ON_ONCE trigger the debugger, but think of all those billions of
devices that are in normal users hands).
And yes, in certain configurations, even those warnings will be turned
off because the state tracking isn't done. Again, that's just reality.
You don't need to use those configurations yourself if you don't like
them, but that does *NOT* mean that you get to say "nobody else gets
to use those configurations either".
Deal with it.
Or, you know, if you can't deal with the rules that the kernel
requires, then just don't do kernel programming.
Because in the end it really is that simple. I really need you to
understand that Rust in the kernel is dependent on *kernel* rules. Not
some other random rules that exist elsewhere.
Linus
On Mon, Sep 19, 2022 at 01:42:44PM -0700, Linus Torvalds wrote: > On Mon, Sep 19, 2022 at 11:05 AM Wedson Almeida Filho > <wedsonaf@gmail.com> wrote: > > > > As you know, we're trying to guarantee the absence of undefined > > behaviour for code written in Rust. And the context is _really_ > > important, so important that leaving it up to comments isn't enough. > > You need to realize that > > (a) reality trumps fantasy > > (b) kernel needs trump any Rust needs > > And the *reality* is that there are no absolute guarantees. Ever. The > "Rust is safe" is not some kind of absolute guarantee of code safety. > Never has been. Anybody who believes that should probably re-take > their kindergarten year, and stop believing in the Easter bunny and > Santa Claus. > > Even "safe" rust code in user space will do things like panic when > things go wrong (overflows, allocation failures, etc). If you don't > realize that that is NOT some kind of true safely, I don't know what > to say. No one is talking about absolute safety guarantees. I am talking about specific ones that Rust makes: these are well-documented and formally defined. > Not completing the operation at all, is *not* really any better than > getting the wrong answer, it's only more debuggable. > > In the kernel, "panic and stop" is not an option (it's actively worse > than even the wrong answer, since it's really not debugable), so the > kernel version of "panic" is "WARN_ON_ONCE()" and continue with the > wrong answer. > > So this is something that I really *need* the Rust people to > understand. That whole reality of "safe" not being some absolute > thing, and the reality that the kernel side *requires* slightly > different rules than user space traditionally does. > > > I don't care as much about allocation flags as I do about sleeping in an > > rcu read-side critical region. When CONFIG_PREEMPT=n, if some CPU makes > > the mistake of sleeping between rcu_read_lock()/rcu_read_unlock(), RCU > > will take that as a quiescent state, which may cause unsuspecting code > > waiting for a grace period to wake up too early and potentially free > > memory that is still in use, which is obviously undefined behaviour. > > So? > > You had a bug. Shit happens. We have a lot of debugging tools that > will give you a *HUGE* warning when said shit happens, including > sending automated reports to the distro maker. And then you fix the > bug. > > Think of that "debugging tools give a huge warning" as being the > equivalent of std::panic in standard rust. Yes, the kernel will > continue (unless you have panic-on-warn set), because the kernel > *MUST* continue in order for that "report to upstream" to have a > chance of happening. > > So it's technically a veryu different implementation from std:panic, > but you should basically see it as exactly that: a *technical* > difference, not a conceptual one. The rules for how the kernel deals > with bugs is just different, because we don't have core-files and > debuggers in the general case. > > (And yes, you can have a kernel debugger, and you can just have the > WARN_ON_ONCE trigger the debugger, but think of all those billions of > devices that are in normal users hands). > > And yes, in certain configurations, even those warnings will be turned > off because the state tracking isn't done. Again, that's just reality. > You don't need to use those configurations yourself if you don't like > them, but that does *NOT* mean that you get to say "nobody else gets > to use those configurations either". > > Deal with it. While I disagree with some of what you write, the point is taken. But I won't give up on Rust guarantees just yet, I'll try to find ergonomic ways to enforce them at compile time. Thanks, -Wedson
On Mon, Sep 19, 2022 at 3:35 PM Wedson Almeida Filho <wedsonaf@gmail.com> wrote:
>
> No one is talking about absolute safety guarantees. I am talking about
> specific ones that Rust makes: these are well-documented and formally
> defined.
If you cannot get over the fact that the kernel may have other
requirements that trump any language standards, we really can't work
together.
Those Rust rules may make sense in other environments. But the kernel
really does have hard requirements that you continue to limp along
even if some fundamental rule has been violated. Exactly because
there's often no separate environment outside the kernel that can deal
with it.
End result: a compiler - or language infrastructure - that says "my
rules are so ingrained that I cannot do that" is not one that is valid
for kernel work.
This is not really any different from the whole notion of "allocation
failures cannot panic" that Rust people seemed to readily understand
is a major kernel requirement, and that the kernel needed a graceful
failure return instead of a hard panic.
Also note that the kernel is perfectly willing to say "I will use
compiler flags that disable certain guarantees". We do it all the
time.
For example, the C standard has a lot of "the compiler is allowed to
make this assumption". And then we disagree with those, and so "kernel
C" is different.
For example, the standard says that dereferencing a NULL pointer is
undefined behavior, so a C compiler can see a dereference of a pointer
to be a guarantee that said pointer isn't NULL, and remove any
subsequent NULL pointer tests.
That turns out to be one of those "obviously true in a perfect world,
but problematic in a real world with bugs", and we tell the compiler
to not do that by passing it the '-fno-delete-null-pointer-checks'
flag, because the compiler _depending_ on undefined behavior and
changing code generation in the build ends up being a really bad idea
from a security standpoint.
Now, in C, most of these kinds of things come from the C standard
being very lax, and having much too many "this is undefined behavior"
rules. So in almost all cases we end up saying "we want the
well-defined implementation, not the 'strictly speaking, the language
specs allows the compiler to do Xyz".
Rust comes from a different direction than C, and it may well be that
we very much need some of the rules to be relaxed.
And hey, Rust people do know about "sometimes the rules have to be
relaxed". When it comes to integer overflows etc, there's a
"overflow-checks" flag, typically used for debug vs release builds.
The kernel has similar issues where sometimes you might want the
strict checking (lockdep etc), and sometimes you may end up being less
strict and miss a few rules (eg "we don't maintain a preempt count for
this config, so we can't check RCU mode violations").
> But I won't give up on Rust guarantees just yet, I'll try to find
> ergonomic ways to enforce them at compile time.
I think that compile-time static checking is wonderful, and as much as
possible should be done 100% statically so that people cannot write
incorrect programs.
But we all know that static checking is limited, and then the amount
of dynamic checking for violations is often something that will have
to depend on environment flags, because it may come with an exorbitant
price in the checking.
Exactly like integer overflow checking.
Linus
On Mon, Sep 19, 2022 at 04:39:56PM -0700, Linus Torvalds wrote: > On Mon, Sep 19, 2022 at 3:35 PM Wedson Almeida Filho <wedsonaf@gmail.com> wrote: > > > > No one is talking about absolute safety guarantees. I am talking about > > specific ones that Rust makes: these are well-documented and formally > > defined. > > If you cannot get over the fact that the kernel may have other > requirements that trump any language standards, we really can't work > together. > > Those Rust rules may make sense in other environments. But the kernel > really does have hard requirements that you continue to limp along > even if some fundamental rule has been violated. Exactly because > there's often no separate environment outside the kernel that can deal > with it. > > End result: a compiler - or language infrastructure - that says "my > rules are so ingrained that I cannot do that" is not one that is valid > for kernel work. > > This is not really any different from the whole notion of "allocation > failures cannot panic" that Rust people seemed to readily understand > is a major kernel requirement, and that the kernel needed a graceful > failure return instead of a hard panic. I am not a programming language/compiler person. My background is in kernel programming (Linux is not the only kernel in existence) so I have no difficulty whatsoever with kernel requirements; the graceful handling of allocation failures that you brought up is in fact something I added soon after I started started working on rust-for-linux and realised (in admittedly a bit of a shock) that userland Rust didn't have this at the time. > > Also note that the kernel is perfectly willing to say "I will use > compiler flags that disable certain guarantees". We do it all the > time. > > For example, the C standard has a lot of "the compiler is allowed to > make this assumption". And then we disagree with those, and so "kernel > C" is different. > > For example, the standard says that dereferencing a NULL pointer is > undefined behavior, so a C compiler can see a dereference of a pointer > to be a guarantee that said pointer isn't NULL, and remove any > subsequent NULL pointer tests. > > That turns out to be one of those "obviously true in a perfect world, > but problematic in a real world with bugs", and we tell the compiler > to not do that by passing it the '-fno-delete-null-pointer-checks' > flag, because the compiler _depending_ on undefined behavior and > changing code generation in the build ends up being a really bad idea > from a security standpoint. > > Now, in C, most of these kinds of things come from the C standard > being very lax, and having much too many "this is undefined behavior" > rules. So in almost all cases we end up saying "we want the > well-defined implementation, not the 'strictly speaking, the language > specs allows the compiler to do Xyz". > > Rust comes from a different direction than C, and it may well be that > we very much need some of the rules to be relaxed. Sure. In fact, we are likely to be able to influence the Rust language more and more quickly than C. So if things don't make sense, we may be able to change them. There are also opportunities here, for example, a compatible memory model and guaranteed honouring of dependency chains for more efficient synchronisation primitives. I'm not claiming this is surely going to happen or that it's easy, just that there's an opportunity to do better than C here. > And hey, Rust people do know about "sometimes the rules have to be > relaxed". When it comes to integer overflows etc, there's a > "overflow-checks" flag, typically used for debug vs release builds. > > The kernel has similar issues where sometimes you might want the > strict checking (lockdep etc), and sometimes you may end up being less > strict and miss a few rules (eg "we don't maintain a preempt count for > this config, so we can't check RCU mode violations"). > > > But I won't give up on Rust guarantees just yet, I'll try to find > > ergonomic ways to enforce them at compile time. > > I think that compile-time static checking is wonderful, and as much as > possible should be done 100% statically so that people cannot write > incorrect programs. > > But we all know that static checking is limited, and then the amount > of dynamic checking for violations is often something that will have > to depend on environment flags, because it may come with an exorbitant > price in the checking. If we can't find an ergonomic way to enforce safety guarantees, we have fallbacks like unsafe functions or runtime enforcement. And as you point out, if the runtime cost is too high in certain scenarios, we do allow users to give up safety with a config as in the integer overflow case you mentioned. We do try to minimise those. Cheers, -Wedson
I think there is some amount of talking past each other here. Rust's rules are that a function that's safe must not exhibit UB, no matter what arguments they're called with. This can be done with static checking or dynamic checking, with obvious trade offs between the two. We've had pretty good success, thus far, modeling various kernel subsystems with APIs that follow this rule. But when there's not a good way, consistent with the kernel's idioms, to expose a kernel API in Rust that's safe, that doesn't mean it's impossible! In those cases we expose an `unsafe fn` in Rust. This means that the caller (e.g., driver code) needs to ensure it meets the required pre-conditions for calling that function. This is how we square the circle of: How do we prioritize the kernel's goals, while also staying consistent with Rust's notion of safety? Wedson's point is that, when possible, finding ways to expose safe functions is better, since it puts less of a burden on driver authors. But, sadly, we all know we won't be able to find them in all circumstances -- we just want to have it as a goal to find them whenever possible. Regards, Alex On Mon, Sep 19, 2022 at 7:40 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Mon, Sep 19, 2022 at 3:35 PM Wedson Almeida Filho <wedsonaf@gmail.com> wrote: > > > > No one is talking about absolute safety guarantees. I am talking about > > specific ones that Rust makes: these are well-documented and formally > > defined. > > If you cannot get over the fact that the kernel may have other > requirements that trump any language standards, we really can't work > together. > > Those Rust rules may make sense in other environments. But the kernel > really does have hard requirements that you continue to limp along > even if some fundamental rule has been violated. Exactly because > there's often no separate environment outside the kernel that can deal > with it. > > End result: a compiler - or language infrastructure - that says "my > rules are so ingrained that I cannot do that" is not one that is valid > for kernel work. > > This is not really any different from the whole notion of "allocation > failures cannot panic" that Rust people seemed to readily understand > is a major kernel requirement, and that the kernel needed a graceful > failure return instead of a hard panic. > > Also note that the kernel is perfectly willing to say "I will use > compiler flags that disable certain guarantees". We do it all the > time. > > For example, the C standard has a lot of "the compiler is allowed to > make this assumption". And then we disagree with those, and so "kernel > C" is different. > > For example, the standard says that dereferencing a NULL pointer is > undefined behavior, so a C compiler can see a dereference of a pointer > to be a guarantee that said pointer isn't NULL, and remove any > subsequent NULL pointer tests. > > That turns out to be one of those "obviously true in a perfect world, > but problematic in a real world with bugs", and we tell the compiler > to not do that by passing it the '-fno-delete-null-pointer-checks' > flag, because the compiler _depending_ on undefined behavior and > changing code generation in the build ends up being a really bad idea > from a security standpoint. > > Now, in C, most of these kinds of things come from the C standard > being very lax, and having much too many "this is undefined behavior" > rules. So in almost all cases we end up saying "we want the > well-defined implementation, not the 'strictly speaking, the language > specs allows the compiler to do Xyz". > > Rust comes from a different direction than C, and it may well be that > we very much need some of the rules to be relaxed. > > And hey, Rust people do know about "sometimes the rules have to be > relaxed". When it comes to integer overflows etc, there's a > "overflow-checks" flag, typically used for debug vs release builds. > > The kernel has similar issues where sometimes you might want the > strict checking (lockdep etc), and sometimes you may end up being less > strict and miss a few rules (eg "we don't maintain a preempt count for > this config, so we can't check RCU mode violations"). > > > But I won't give up on Rust guarantees just yet, I'll try to find > > ergonomic ways to enforce them at compile time. > > I think that compile-time static checking is wonderful, and as much as > possible should be done 100% statically so that people cannot write > incorrect programs. > > But we all know that static checking is limited, and then the amount > of dynamic checking for violations is often something that will have > to depend on environment flags, because it may come with an exorbitant > price in the checking. > > Exactly like integer overflow checking. > > Linus -- All that is necessary for evil to succeed is for good people to do nothing.
On Mon, Sep 19, 2022 at 4:50 PM Alex Gaynor <alex.gaynor@gmail.com> wrote:
>
> Rust's rules are that a function that's safe must not exhibit UB, no
> matter what arguments they're called with. This can be done with
> static checking or dynamic checking, with obvious trade offs between
> the two.
I think you are missing just how many things are "unsafe" in certain
contexts and *cannot* be validated.
This is not some kind of "a few special things".
This is things like absolutely _anything_ that allocates memory, or
takes a lock, or does a number of other things.
Those things are simply not "safe" if you hold a spinlock, or if you
are in a RCU read-locked region.
And there is literally no way to check for it in certain configurations. None.
So are you going to mark every single function that takes a mutex as
being "unsafe"?
Or are you just going to accept and understand that "hey, exactly like
with integer overflows, sometimes it will be checked, and sometimes it
just won't be".
Because that is literally the reality of the kernel. Sometimes you
WILL NOT have the checks, and you literally CANNOT have the checks.
This is just how reality is. You don't get to choose the universe you live in.
Linus
On Mon, Sep 19, 2022 at 04:58:06PM -0700, Linus Torvalds wrote: > On Mon, Sep 19, 2022 at 4:50 PM Alex Gaynor <alex.gaynor@gmail.com> wrote: > > > > Rust's rules are that a function that's safe must not exhibit UB, no > > matter what arguments they're called with. This can be done with > > static checking or dynamic checking, with obvious trade offs between > > the two. > > I think you are missing just how many things are "unsafe" in certain > contexts and *cannot* be validated. > > This is not some kind of "a few special things". > > This is things like absolutely _anything_ that allocates memory, or > takes a lock, or does a number of other things. > > Those things are simply not "safe" if you hold a spinlock, or if you > are in a RCU read-locked region. > > And there is literally no way to check for it in certain configurations. None. > > So are you going to mark every single function that takes a mutex as > being "unsafe"? > In this early experiment stage, if something is unsafe per Rust safety requirements, maybe we should mark it as "unsafe"? Not because Rust safety needs trump kernel needs, but because we need showcases to the Rust langauge people the things that are not working very well in Rust today. I definitely agree that we CANNOT change kernel behaviors to solely fulfil Rust safety requirements, we (Rust-for-Linux people) should either find a way to check in compiler time or just mark it as "unsafe". Maybe I'm naive ;-) But keeping Rust safety requirements as they are helps communication with the people on the other side (Rust langauge/compiler): "Hey, I did everything per your safety requirements, and it ends like this, I'm not happy about it, could you figure out something helpful? After all, Rust is a *system programming" language, it should be able to handle things like these". Or we want to say "kernel is special, so please give me a option so that I don't need to worry about these UBs and deal with my real problems"? I don't have the first hand experience, but seems this is what we have been doing with C for many years. Do we want to try a new strategy? ;-) But perhaps it's not new, maybe it's done a few times already but didn't end well.. Anyway, if I really want to teach Rust language/compiler people "I know what I'm doing, the problem is that the language is not ready". What should I do? Regards, Boqun > Or are you just going to accept and understand that "hey, exactly like > with integer overflows, sometimes it will be checked, and sometimes it > just won't be". > > Because that is literally the reality of the kernel. Sometimes you > WILL NOT have the checks, and you literally CANNOT have the checks. > > This is just how reality is. You don't get to choose the universe you live in. > > Linus
I don't know if there was ever a direct answer to this question; I was hoping someone with more involvement in the project would jump in. > In this early experiment stage, if something is unsafe per Rust safety > requirements, maybe we should mark it as "unsafe"? Not because Rust > safety needs trump kernel needs, but because we need showcases to the > Rust langauge people the things that are not working very well in Rust > today. > > I definitely agree that we CANNOT change kernel behaviors to solely > fulfil Rust safety requirements, we (Rust-for-Linux people) should > either find a way to check in compiler time or just mark it as "unsafe". From the perspective of a Linux outsider, marking huge swaths of Rust code "unsafe" doesn't actually sound incorrect to me. If a function may exhibit undefined behavior, it's unsafe; thus, it should be marked as such. True, you lose some of Rust's guarantees within the code marked "unsafe", but this is what permits Rust to enforce those guarantees elsewhere. Having lots of uses of "unsafe" in kernel code is probably to be expected, and (in my opinion) not actually a sign that things "are not working very well in Rust." It's not even particularly verbose, if the callers of unsafe functions are generally also marked "unsafe"; in these cases, only the signatures are affected. The last sentence of Boqun's quote above captures the intent of "unsafe" perfectly: it simply indicates that code cannot be proven at compile time not to exhibit undefined behavior. It's important for systems languages to enable such code to be written, which is precisely why Rust has the "unsafe" keyword. On Mon, Sep 19, 2022 at 6:45 PM Boqun Feng <boqun.feng@gmail.com> wrote: > > On Mon, Sep 19, 2022 at 04:58:06PM -0700, Linus Torvalds wrote: > > On Mon, Sep 19, 2022 at 4:50 PM Alex Gaynor <alex.gaynor@gmail.com> wrote: > > > > > > Rust's rules are that a function that's safe must not exhibit UB, no > > > matter what arguments they're called with. This can be done with > > > static checking or dynamic checking, with obvious trade offs between > > > the two. > > > > I think you are missing just how many things are "unsafe" in certain > > contexts and *cannot* be validated. > > > > This is not some kind of "a few special things". > > > > This is things like absolutely _anything_ that allocates memory, or > > takes a lock, or does a number of other things. > > > > Those things are simply not "safe" if you hold a spinlock, or if you > > are in a RCU read-locked region. > > > > And there is literally no way to check for it in certain configurations. None. > > > > So are you going to mark every single function that takes a mutex as > > being "unsafe"? > > > > In this early experiment stage, if something is unsafe per Rust safety > requirements, maybe we should mark it as "unsafe"? Not because Rust > safety needs trump kernel needs, but because we need showcases to the > Rust langauge people the things that are not working very well in Rust > today. > > I definitely agree that we CANNOT change kernel behaviors to solely > fulfil Rust safety requirements, we (Rust-for-Linux people) should > either find a way to check in compiler time or just mark it as "unsafe". > > Maybe I'm naive ;-) But keeping Rust safety requirements as they are > helps communication with the people on the other side (Rust > langauge/compiler): "Hey, I did everything per your safety requirements, > and it ends like this, I'm not happy about it, could you figure out > something helpful? After all, Rust is a *system programming" language, > it should be able to handle things like these". > > Or we want to say "kernel is special, so please give me a option so that > I don't need to worry about these UBs and deal with my real problems"? > I don't have the first hand experience, but seems this is what we have > been doing with C for many years. Do we want to try a new strategy? ;-) > But perhaps it's not new, maybe it's done a few times already but didn't > end well.. > > Anyway, if I really want to teach Rust language/compiler people "I know > what I'm doing, the problem is that the language is not ready". What > should I do? > > Regards, > Boqun > > > Or are you just going to accept and understand that "hey, exactly like > > with integer overflows, sometimes it will be checked, and sometimes it > > just won't be". > > > > Because that is literally the reality of the kernel. Sometimes you > > WILL NOT have the checks, and you literally CANNOT have the checks. > > > > This is just how reality is. You don't get to choose the universe you live in. > > > > Linus
On Mon, Sep 19, 2022 at 4:58 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> This is not some kind of "a few special things".
>
> This is things like absolutely _anything_ that allocates memory, or
> takes a lock, or does a number of other things.
Examples of "number of other things" ends up being things like
"accessing user memory", which depending on what you are doing may be
very common too.
And btw, it's not only about the (multiple kinds of) atomic regions.
We have other context rules in the kernel too, like "does floating
point or vector unit calculations". Which you can actually do, but
only in a kernel_fpu_begin/kernel_fpu_end region.
Now, the floating point thing is rare enough that it's probably fine
to just say "no floating point at all in Rust code". It tends to be
very special code, so you'd write it in C or inline assembly, because
you're doing special things like using the vector unit for crypto
hashes using special CPU instructions.
Linus
Linus Torvalds <torvalds@linux-foundation.org> writes:
> On Mon, Sep 19, 2022 at 4:58 PM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>> This is not some kind of "a few special things".
>>
>> This is things like absolutely _anything_ that allocates memory, or
>> takes a lock, or does a number of other things.
>
> Examples of "number of other things" ends up being things like
> "accessing user memory", which depending on what you are doing may be
> very common too.
>
> And btw, it's not only about the (multiple kinds of) atomic regions.
>
> We have other context rules in the kernel too, like "does floating
> point or vector unit calculations". Which you can actually do, but
> only in a kernel_fpu_begin/kernel_fpu_end region.
>
> Now, the floating point thing is rare enough that it's probably fine
> to just say "no floating point at all in Rust code". It tends to be
> very special code, so you'd write it in C or inline assembly, because
> you're doing special things like using the vector unit for crypto
> hashes using special CPU instructions.
I just want to point out that there are ways of representing things like
the context you are running in during compile time. I won't argue they
are necessarily practical, but there are ways and by exploring those
ways some practical solutions may result.
Instead of saying:
spin_lock(&lock);
do_something();
spin_unlock(&lock);
It is possible to say:
with_spin_lock(&lock, do_something());
This can be taken a step farther and with_spin_lock can pass a ``token''
say a structure with no members that disappears at compile time that
let's the code know it has the spinlock held.
In C I would do:
struct have_spin_lock_x {
// nothing
};
do_something(struct have_spin_lock_x context_guarantee)
{
...;
}
I think most of the special contexts in the kernel can be represented in
a similar manner. A special parameter that can be passed and will
compile out.
I don't recall seeing anything like that tried in the kernel so I don't
know if it makes sense or if it would get too wordy to live, but it is
possible. If passing a free context parameter is not too wordy it would
catch silly errors, and would hopefully leave more mental space for
developers to pay attention to the other details of the problems they
are solving.
*Shrug* I don't know if rust allows for free parameters like that and
if it does I don't know if it would be cheap enough to live.
Eric
On Tue, 20 Sep 2022 10:55:27 -0500
"Eric W. Biederman" <ebiederm@xmission.com> wrote:
> Linus Torvalds <torvalds@linux-foundation.org> writes:
>
> > On Mon, Sep 19, 2022 at 4:58 PM Linus Torvalds
> > <torvalds@linux-foundation.org> wrote:
> >>
> >> This is not some kind of "a few special things".
> >>
> >> This is things like absolutely _anything_ that allocates memory, or
> >> takes a lock, or does a number of other things.
> >
> > Examples of "number of other things" ends up being things like
> > "accessing user memory", which depending on what you are doing may
> > be very common too.
> >
> > And btw, it's not only about the (multiple kinds of) atomic regions.
> >
> > We have other context rules in the kernel too, like "does floating
> > point or vector unit calculations". Which you can actually do, but
> > only in a kernel_fpu_begin/kernel_fpu_end region.
> >
> > Now, the floating point thing is rare enough that it's probably fine
> > to just say "no floating point at all in Rust code". It tends to be
> > very special code, so you'd write it in C or inline assembly,
> > because you're doing special things like using the vector unit for
> > crypto hashes using special CPU instructions.
>
> I just want to point out that there are ways of representing things
> like the context you are running in during compile time. I won't
> argue they are necessarily practical, but there are ways and by
> exploring those ways some practical solutions may result.
>
> Instead of saying:
> spin_lock(&lock);
> do_something();
> spin_unlock(&lock);
>
> It is possible to say:
> with_spin_lock(&lock, do_something());
>
> This can be taken a step farther and with_spin_lock can pass a
> ``token'' say a structure with no members that disappears at compile
> time that let's the code know it has the spinlock held.
>
> In C I would do:
> struct have_spin_lock_x {
> // nothing
> };
>
> do_something(struct have_spin_lock_x context_guarantee)
> {
> ...;
> }
>
> I think most of the special contexts in the kernel can be represented
> in a similar manner. A special parameter that can be passed and will
> compile out.
>
> I don't recall seeing anything like that tried in the kernel so I
> don't know if it makes sense or if it would get too wordy to live,
> but it is possible. If passing a free context parameter is not too
> wordy it would catch silly errors, and would hopefully leave more
> mental space for developers to pay attention to the other details of
> the problems they are solving.
>
> *Shrug* I don't know if rust allows for free parameters like that and
> if it does I don't know if it would be cheap enough to live.
I believe this was mentioned by Wedson in one of his previous emails.
This pattern is quite common in Rust code. It looks like this:
#[derive(Clone, Copy)]
pub struct Token<'a>(PhantomData<&'a ()>);
pub fn with_token<T>(f: impl for<'a> FnOnce(Token<'a>) -> T) ->
T { f(Token(PhantomData))
}
Any function that requires something can just take token by value, e.g.
with `token: Token<'_>`. Since Token is a zero-sized type (ZST), this
parameter will be omitted during code generation, so it won't affect
the ABI and this has no runtime cost.
Example on godbolt: https://godbolt.org/z/9n954cG4d, showing that the
token is actually all optimised out.
It should be noted however, atomic context is not something that a
token can represent. You can only use tokens to restrict what you *can*
do, but not what you *can't* do. There is no negative reasoning with
tokens, you can't create a function that can only be called when you
don't have token.
You can use tokens to represent non-atomic contexts, but that'll be
really painful because this requires carrying a token in almost all
functions. This kind of API also works well for FPU contexts.
Best,
Gary
> On Sep 20, 2022, at 6:39 PM, Gary Guo <gary@garyguo.net> wrote: > > It should be noted however, atomic context is not something that a > token can represent. You can only use tokens to restrict what you *can* > do, but not what you *can't* do. There is no negative reasoning with > tokens, you can't create a function that can only be called when you > don't have token. On the other hand, it ought to be feasible to implement that kind of ’negative reasoning' as a custom lint. It might not work as well as something built into the language, but it should work decently well, and could serve as a prototype for a future built-in feature.
On Wed, Sep 21, 2022 at 02:42:18AM -0400, comex wrote: > > > > On Sep 20, 2022, at 6:39 PM, Gary Guo <gary@garyguo.net> wrote: > > > > It should be noted however, atomic context is not something that a > > token can represent. You can only use tokens to restrict what you *can* > > do, but not what you *can't* do. There is no negative reasoning with > > tokens, you can't create a function that can only be called when you > > don't have token. > > On the other hand, it ought to be feasible to implement that kind of > ’negative reasoning' as a custom lint. It might not work as well as > something built into the language, but it should work decently well, > and could serve as a prototype for a future built-in feature. Interesting, do you have an example somewhere? Regards, Boqun
>> On the other hand, it ought to be feasible to implement that kind of
>> ’negative reasoning' as a custom lint. It might not work as well as
>> something built into the language, but it should work decently well,
>> and could serve as a prototype for a future built-in feature.
>
> Interesting, do you have an example somewhere?
>
> Regards,
> Boqun
After some searching, I found this, which someone wrote several years ago for a
very similar purpose:
https://github.com/thepowersgang/tag_safe/
> This is a linter designed originally for use with a kernel, where functions
> need to be marked as "IRQ safe" (meaning they are safe to call within an IRQ
> handler, and handle the case where they may interrupt themselves).
> If a function is annotated with #[req_safe(ident)] (where ident can be
> anything, and defines the type of safety) this linter will check that all
> functions called by that function are either annotated with the same
> annotation or #[is_safe(ident)], OR they do not call functions with the
> reverse #[is_unsafe(ident)] annotation.
Note that the code won't work as-is with recent rustc. rustc's API for custom
lints is not stable, and in fact rustc has deprecated linter plugins entirely
[1], though there are alternative approaches to using custom lints [2]. Still,
it's a good example of the approach.
One fundamental caveat is that it doesn't seem to have the sophistication
needed to be sound with respect to indirect calls.
For example, suppose you have a function that fetches a callback from some
structure and calls it. Whether this function is IRQ-safe depends on whether
the callback is expected to be IRQ-safe, so in order to safety-check this, you
would need an annotation on either the callback field or the function pointer
type. This is more complex than just putting annotations on function
definitions.
Or suppose you have the following code:
fn foo() {
bar(|| do_something_not_irq_safe());
}
If `foo` is expected to be IRQ-safe, this may or may not be sound, depending on
whether `bar` calls the callback immediately or saves it for later. If `bar`
saves it for later, then it could be marked unconditionally IRQ-safe. But if
`bar` calls it immediately, then it's neither IRQ-safe nor IRQ-unsafe, but
effectively generic over IRQ safety. You could pessimistically mark it
IRQ-unsafe, but Rust has tons of basic helper methods that accept callbacks and
call them immediately; not being able to use any of them in an IRQ-safe context
would be quite limiting.
In short, a fully sound approach requires not just checking which functions
call which, but having some kind of integration with the type system. This is
the kind of issue that I was thinking of when I said a custom lint may not work
as well as something built into the language.
However, I do think it's *possible* to handle it soundly from a lint,
especially if it focuses on typical use cases and relies on manual annotations
for the rest. Alternately, even an unsound lint would be a good first step.
It wouldn't really comport with Rust's ethos of making safety guarantees
ironclad rather than heuristic, but it would serve as a good proof of concept
for a future language feature, while likely being helpful in practice in the
short term.
[1] https://github.com/rust-lang/rust/pull/64675/files
[2] https://www.trailofbits.com/post/write-rust-lints-without-forking-clippy
On Sat, Aug 06, 2022 at 01:22:52PM +0200, Miguel Ojeda wrote: > > I sense possible problems here. It's common for a kernel code to pass > > flags during memory allocations. > > Yes, of course. We will support this, but how exactly it will look > like, to what extent upstream Rust's `alloc` could support our use > cases, etc. has been on discussion for a long time. > > For instance, see https://github.com/Rust-for-Linux/linux/pull/815 for > a potential extension trait approach with no allocator carried on the > type that Andreas wrote after a discussion in the last informal call: > > let a = Box::try_new_atomic(101)?; IMO it's just easier to always pass flags like this: let a = Box::try_new(GFP_KERNEL | GFP_DMA, 101)?; But if allocate_with_flags() will be somehow present in the API that's just what we need. P.S. Thanks for a quick reply!
© 2016 - 2026 Red Hat, Inc.