[RFC WIP 3/3] rust/drm: Add initial jobqueue sceleton

Philipp Stanner posted 3 patches 1 week, 6 days ago
[RFC WIP 3/3] rust/drm: Add initial jobqueue sceleton
Posted by Philipp Stanner 1 week, 6 days ago
DRM jobqueue is intended to become a load balancer, dependency manager
and timeout handler for GPU drivers with firmware scheduling.

The presented code shall give the reader an overview over the intended
architecture, notably over the API functions, DmaFence callbacks, job
lists and job control flow.

This code compiles (with warnings) but is incomplete. Notable missing
features are:
- Actually registering the fence callbacks
- workqueue
- timeout handling
- actually calling the driver callback for job submissions

Moreover, the implementation of the waiting_jobs and running_jobs lists
is currently not operational because I've got trouble with getting it to
work with generic Job data. Verifyable by commenting in the push_job()
call in the submit_job() function.

Some WIP code is commented out, but is probably worth reading
nevertheless since it completes the picture.

Signed-off-by: Philipp Stanner <phasta@kernel.org>
---
 rust/kernel/drm/jq.rs  | 398 +++++++++++++++++++++++++++++++++++++++++
 rust/kernel/drm/mod.rs |   2 +
 2 files changed, 400 insertions(+)
 create mode 100644 rust/kernel/drm/jq.rs

diff --git a/rust/kernel/drm/jq.rs b/rust/kernel/drm/jq.rs
new file mode 100644
index 000000000000..b3f7ab4655cf
--- /dev/null
+++ b/rust/kernel/drm/jq.rs
@@ -0,0 +1,398 @@
+// SPDX-License-Identifier: GPL-2.0
+//
+// Copyright (C) 2025 Red Hat Inc.:
+//   - Philipp Stanner <pstanner@redhat.com>
+//   - Danilo Krummrich <dakr@redhat.com>
+//   - David Airlie <airlied@redhat.com>
+
+//! DrmJobqueue. A load balancer, dependency manager and timeout handler for
+//! GPU job submissions.
+
+use crate::{
+    prelude::*,
+    types::ARef,
+};
+use kernel::sync::{Arc, SpinLock, new_spinlock, DmaFence, DmaFenceCtx, DmaFenceCb, DmaFenceCbFunc};
+use kernel::list::*;
+use kernel::revocable::Revocable;
+
+
+#[pin_data]
+pub struct Job<T: ?Sized> {
+    credits: u32,
+//    dependencies: List, // TODO implement dependency list
+    #[pin]
+    data: T,
+}
+
+impl<T> Job<T> {
+    /// Create a new job that can be submitted to [`Jobqueue`].
+    ///
+    /// Jobs contain driver data that will later be made available to the driver's
+    /// run_job() callback in which the job gets pushed to the GPU.
+    pub fn new(credits: u32, data: impl PinInit<T>) -> Result<Pin<KBox<Self>>> {
+        let job = pin_init!(Self {
+            credits,
+            data <- data,
+        });
+
+        KBox::pin_init(job, GFP_KERNEL)
+    }
+
+    /// Add a callback to the job. When the job gets submitted, all added callbacks will be
+    /// registered on the [`DmaFence`] the jobqueue returns for that job.
+    pub fn add_callback() -> Result {
+        Ok(())
+    }
+
+    /// Add a [`DmaFence`] or a [`DoneFence`] as this job's dependency. The job
+    /// will only be executed after that dependency has been finished.
+    pub fn add_dependency() -> Result {
+        // TODO: Enqueue passed DmaFence into the job's dependency list.
+        Ok(())
+    }
+
+    /// Check if there are dependencies for this job. Register the jobqueue
+    /// waker if yes.
+    fn arm_deps() -> Result {
+        // TODO: Register DependencyWaker here if applicable.
+        Ok(())
+    }
+}
+
+// Dummy trait for the linked list.
+trait JobData {
+    fn access_data(&self) -> i32;
+}
+
+#[pin_data]
+struct EnqueuedJob<T: ?Sized> {
+    inner: Pin<KBox<Job<T>>>,
+    #[pin]
+    links: ListLinksSelfPtr<EnqueuedJob<dyn JobData>>,
+    done_fence: ARef<DmaFence<i32>>, // i32 is just dummy data. TODO: allow for replacing with `()`
+    // The hardware_fence can by definition only be set at an unknown point in
+    // time.
+    // TODO: Think about replacing this with a `struct RunningJob` which consumes
+    // an `EnqueuedJob`.
+    hardware_fence: Option<ARef<DmaFence<i32>>>, // i32 is dummy data until there's DmaFence
+                                                 // without data.
+    nr_of_deps: u32,
+}
+
+impl<T> EnqueuedJob<T> {
+    fn new(inner: Pin<KBox<Job<T>>>, fctx: &Arc<DmaFenceCtx>) -> Result<ListArc<Self>> {
+        let pseudo_data: i32 = 42;
+        let done_fence = fctx.as_arc_borrow().new_fence(pseudo_data)?;
+
+        ListArc::pin_init(try_pin_init!(Self {
+            inner,
+            links <- ListLinksSelfPtr::new(),
+            done_fence,
+            hardware_fence: None,
+            nr_of_deps: 0, // TODO implement
+        }), GFP_KERNEL)
+    }
+}
+
+impl_list_arc_safe! {
+    impl{T: ?Sized} ListArcSafe<0> for EnqueuedJob<T> { untracked; }
+}
+
+impl_list_item! {
+    impl ListItem<0> for EnqueuedJob<dyn JobData> { using ListLinksSelfPtr { self.links }; }
+}
+
+// Callback item for the hardware fences to wake / progress the jobqueue.
+struct HwFenceWaker<T> {
+    jobq: Arc<Revocable<SpinLock<InnerJobqueue>>>,
+    job: ListArc<EnqueuedJob<T>>,
+}
+
+impl<T> DmaFenceCbFunc for HwFenceWaker<T> {
+     fn callback(cb: Pin<KBox<DmaFenceCb<Self>>>) where Self: Sized {
+         // This prevents against deadlock. See Jobqueue's drop() for details.
+         let jq_guard = cb.data.jobq.try_access();
+         if jq_guard.is_none() {
+             return;
+         }
+         let jq_guard = jq_guard.unwrap();
+
+         // Take Jobqueue lock.
+         let jq = jq_guard.lock();
+         // Remove job from running list.
+         //let _ = unsafe { cb.data.job.remove() };
+         // Signal done_fence.
+         // TODO: It's more robust if the JQ makes sure that fences get signalled
+         // in order, even if the driver should signal them chaotically.
+         let _ = cb.data.job.done_fence.signal();
+         // Run more ready jobs if there's capacity.
+         //jq.start_submit_worker();
+     }
+}
+
+// Callback item for the dependency fences to wake / progress the jobqueue.
+struct DependencyWaker<T> {
+    jobq: Arc<Revocable<SpinLock<InnerJobqueue>>>,
+    job: ListArc<EnqueuedJob<T>>,
+}
+
+impl<T> DmaFenceCbFunc for DependencyWaker<T> {
+    fn callback(cb: Pin<KBox<DmaFenceCb<Self>>>) where Self: Sized {
+        // This prevents against deadlock. See Jobqueue's drop() for details.
+        let jq_guard = cb.data.jobq.try_access();
+        if jq_guard.is_none() {
+            return;
+        }
+        let jq_guard = jq_guard.unwrap();
+
+        // Take Jobqueue lock.
+        let jq = jq_guard.lock();
+
+        // TODO: Lock Contention
+        //
+        // Alright, so the Jobqueue is currently designed around a big central
+        // lock, which also protects the jobs. submit_job(), the JQ's cb on the
+        // hw_fences and its cbs on the (external) dependency fences compete for
+        // the lock. The first two should ever only run sequentially, so likely
+        // aren't a problem.
+        //
+        // Dependency callbacks, however, could be registered and then signalled
+        // by the thousands and then all compete for the lock possibly for nothing.
+        //
+        // That can likely be improved. Maybe by just making the nr_of_deps
+        // counter atomic?
+
+        // Decrement dep counter.
+        // cb.data.job.nr_of_deps -= 1; // TODO needs to be DerefMut
+        // If counter == 0, a new job somewhere in the queue just got ready.
+        // Check if it was the head job and if yes, run all jobs possible.
+        if cb.data.job.nr_of_deps == 0 {
+//            jq.start_submit_worker();
+        }
+    }
+}
+
+struct InnerJobqueue {
+    capacity: u32,
+    waiting_jobs: List<EnqueuedJob<dyn JobData>>,
+    running_jobs: List<EnqueuedJob<dyn JobData>>,
+    submit_worker_active: bool,
+}
+
+impl InnerJobqueue {
+    fn new(capacity: u32) -> Self {
+        let waiting_jobs = List::<EnqueuedJob<dyn JobData>>::new();
+        let running_jobs = List::<EnqueuedJob<dyn JobData>>::new();
+
+        Self {
+            capacity,
+            waiting_jobs,
+            running_jobs,
+            submit_worker_active: false,
+        }
+    }
+
+    fn has_waiting_jobs(&self) -> bool {
+        !self.waiting_jobs.is_empty()
+    }
+
+    fn has_capacity_left(&self, cost: u32) -> bool {
+        let cost = cost as i64;
+        let capacity = self.capacity as i64;
+
+        if capacity - cost >= 0 {
+            return true;
+        }
+
+        false
+    }
+
+    fn update_capacity(&mut self, cost: u32) {
+        self.capacity -= cost;
+    }
+
+
+    // Called by the hw_fence callbacks, dependency callbacks, and submit_job().
+    // TODO: does submit_job() ever have to call it?
+    fn start_submit_worker(&mut self) {
+        if self.submit_worker_active {
+            return;
+        }
+
+        // TODO run submit work item
+
+        self.submit_worker_active = true;
+    }
+
+    /*
+
+    /// Push a job immediately.
+    ///
+    /// Returns true if the job ran immediately, false otherwise.
+    fn run_job(&mut self, job: &EnqueuedJob) -> bool {
+        // TODO remove job from waiting list.
+
+        // TODO Call the driver's run_job() callback.
+        let hardware_fence = run_job(&job);
+        job.hardware_fence = Some(hardware_fence);
+
+        // TODO check whether hardware_fence raced and is already signalled.
+
+        self.running_jobs.push_back(job);
+
+        // TODO Register HwFenceWaker on the hw_fence.
+    }
+
+    // Submits all ready jobs as long as there's capacity.
+    fn run_all_ready_jobs(&mut self) {
+        for job in self.waiting_jobs.reverse() {
+            if job.nr_of_deps > 0 {
+                return;
+            }
+
+            if self.has_capacity_left(job.credits) {
+                if !self.run_job(&job) {
+                    // run_job() didn't run the job immediately (because the
+                    // hw_fence did not race). Subtract the credits.
+                    self.update_capacity(job.credits);
+                }
+            } else {
+                return;
+            }
+        }
+    }
+    */
+}
+
+//#[pin_data]
+pub struct Jobqueue {
+    inner: Arc<Revocable<SpinLock<InnerJobqueue>>>,
+    fctx: Arc<DmaFenceCtx>, // TODO currently has a separate lock shared with fences
+//    #[pin]
+//    data: T,
+}
+
+impl Jobqueue {
+    /// Create a new [`Jobqueue`] with `capacity` space for jobs. `run_job` is
+    /// your driver's callback which the jobqueue will call to push a submitted
+    /// job to the hardware.
+    pub fn new<T, V>(capacity: u32, _run_job: fn(&Pin<KBox<Job<T>>>) -> ARef<DmaFence<V>>) -> Result<Self> {
+        let inner = Arc::pin_init(Revocable::new(new_spinlock!(InnerJobqueue::new(capacity))), GFP_KERNEL)?;
+        let fctx = DmaFenceCtx::new()?;
+
+        Ok (Self {
+            inner,
+            fctx,
+        })
+    }
+
+    /// Submit a job to the jobqueue.
+    ///
+    /// The jobqueue takes ownership over the job and later passes it back to the
+    /// driver by reference through the driver's run_job callback. Jobs are
+    /// passed back by reference instead of by value partially to allow for later
+    /// adding a job resubmission mechanism to be added to [`Jobqueue`].
+    ///
+    /// Jobs get run and their done_fences get signalled in submission order.
+    ///
+    /// Returns the "done_fence" on success, which gets signalled once the
+    /// hardware has completed the job and once the jobqueue is done with a job.
+    pub fn submit_job<U>(&self, job: Pin<KBox<Job<U>>>) -> Result<ARef<DmaFence<i32>>> {
+        let job_cost = job.credits;
+        // TODO: It would be nice if the done_fence's seqno actually matches the
+        // submission order. To do that, however, we'd need to protect job
+        // creation with InnerJobqueue's spinlock. Is that worth it?
+        let enq = EnqueuedJob::new(job, &self.fctx)?;
+        let done_fence = enq.done_fence.clone(); // Get the fence for the user.
+
+        // TODO register job's callbacks on done_fence.
+
+        let guard = self.inner.try_access();
+        if guard.is_none() {
+            // Can never happen. JQ gets only revoked when it drops.
+            return Err(ENODEV);
+        }
+        let jobq = guard.unwrap();
+
+        let jobq = jobq.lock();
+
+        // Check if there are dependencies and, if yes, register rewake
+        // callbacks on their fences. Must be done under the JQ lock's protection
+        // since the callbacks will access JQ data.
+        //job.arm_deps();
+        //jobq.waiting_jobs.push_back(job);
+
+        if jobq.has_waiting_jobs() {
+            // Jobs waiting means that there is either currently no capacity
+            // for more jobs, or the jobqueue is blocked by a job with
+            // unfullfilled dependencies. Either the hardware fences' callbacks
+            // or those of the dependency fences will pull in more jobs once
+            // there is capacity.
+            return Ok(done_fence);
+        } else if !jobq.submit_worker_active && jobq.has_capacity_left(job_cost) {
+            // This is the first waiting job. No one (i.e., no hw_fence) has
+            // woken the worker yet, but there is space. Awake it manually.
+            //jobq.start_submit_worker();
+        }
+
+        // If there was no capacity for the job, the callbacks registered on the
+        // already running jobs' hardware fences will check if there's space for
+        // the next job, guaranteeing progress.
+        //
+        // If no jobs were running, there was by definition still space and the
+        // job will get pushed by the worker.
+        //
+        // If a job couldn't be pushed because there were unfinished dependencies,
+        // then the hardware fences' callbacks mentioned above will detect that
+        // and not yet push the job.
+        //
+        // Each dependency's fence has its own callback which checks:
+        //   a) whether all other callbacks are fullfilled and if yes:
+        //   b) whether there is now enough credits available.
+        //
+        // If a) and b) are fullfilled, the job gets pushed.
+        //
+        // If there are no jobs currently running, credits must be available by
+        // definition.
+
+        Ok(done_fence)
+
+    }
+}
+
+impl Drop for Jobqueue {
+    fn drop(&mut self) {
+        // The hardware fences might outlive the jobqueue. So hw_fence callbacks
+        // could very well still call into job queue code, resulting in
+        // data UAF or, should the jobqueue code be unloaded, even code UAF.
+        //
+        // Thus, the jobqueue needs to be cleanly decoupled from the hardware
+        // fences when it drops, in other words, it needs to deregister all its
+        // hw_fence callbacks.
+        //
+        // This, however, could easily deadlock when a hw_fence signals:
+        //
+        // Step     |   Jobqueue step               |   hw_fence step
+        // ------------------------------------------------------------------
+        // 1        |   JQ starts drop              |   fence signals
+        // 2        |   JQ lock taken               |   fence lock taken
+        // 3        |   Tries to take fence lock    |   Tries to take JQ lock
+        // 4        |   ***DEADLOCK***              |   ***DEADLOCK***
+        //
+        // In order to prevent deadlock, we first have to revoke access to the
+        // JQ so that all fence callbacks can't try to take the lock anymore,
+        // and then deregister all JQ callbacks.
+        self.inner.revoke();
+
+        /*
+        let guard = self.inner.lock();
+        for job in self.inner.waiting_jobs {
+            job.deregister_dep_fences();
+        }
+        for job in self.inner.running_jobs {
+            job.deregister_hw_fence();
+        }
+        */
+    }
+}
diff --git a/rust/kernel/drm/mod.rs b/rust/kernel/drm/mod.rs
index 1b82b6945edf..803bed36231b 100644
--- a/rust/kernel/drm/mod.rs
+++ b/rust/kernel/drm/mod.rs
@@ -7,12 +7,14 @@
 pub mod file;
 pub mod gem;
 pub mod ioctl;
+pub mod jq;
 
 pub use self::device::Device;
 pub use self::driver::Driver;
 pub use self::driver::DriverInfo;
 pub use self::driver::Registration;
 pub use self::file::File;
+pub use self::jq::Jobqueue;
 
 pub(crate) mod private {
     pub trait Sealed {}
-- 
2.49.0
Re: [RFC WIP 3/3] rust/drm: Add initial jobqueue sceleton
Posted by Daniel Almeida 1 week ago
Hi Phillip,

> On 18 Nov 2025, at 10:25, Philipp Stanner <phasta@kernel.org> wrote:
> 
> DRM jobqueue is intended to become a load balancer, dependency manager
> and timeout handler for GPU drivers with firmware scheduling.
> 
> The presented code shall give the reader an overview over the intended
> architecture, notably over the API functions, DmaFence callbacks, job
> lists and job control flow.
> 
> This code compiles (with warnings) but is incomplete. Notable missing
> features are:
> - Actually registering the fence callbacks
> - workqueue
> - timeout handling
> - actually calling the driver callback for job submissions
> 
> Moreover, the implementation of the waiting_jobs and running_jobs lists
> is currently not operational because I've got trouble with getting it to
> work with generic Job data. Verifyable by commenting in the push_job()
> call in the submit_job() function.
> 
> Some WIP code is commented out, but is probably worth reading
> nevertheless since it completes the picture.
> 
> Signed-off-by: Philipp Stanner <phasta@kernel.org>
> ---
> rust/kernel/drm/jq.rs  | 398 +++++++++++++++++++++++++++++++++++++++++
> rust/kernel/drm/mod.rs |   2 +
> 2 files changed, 400 insertions(+)
> create mode 100644 rust/kernel/drm/jq.rs
> 
> diff --git a/rust/kernel/drm/jq.rs b/rust/kernel/drm/jq.rs
> new file mode 100644
> index 000000000000..b3f7ab4655cf
> --- /dev/null
> +++ b/rust/kernel/drm/jq.rs
> @@ -0,0 +1,398 @@
> +// SPDX-License-Identifier: GPL-2.0
> +//
> +// Copyright (C) 2025 Red Hat Inc.:
> +//   - Philipp Stanner <pstanner@redhat.com>
> +//   - Danilo Krummrich <dakr@redhat.com>
> +//   - David Airlie <airlied@redhat.com>
> +
> +//! DrmJobqueue. A load balancer, dependency manager and timeout handler for
> +//! GPU job submissions.
> +
> +use crate::{
> +    prelude::*,
> +    types::ARef,
> +};
> +use kernel::sync::{Arc, SpinLock, new_spinlock, DmaFence, DmaFenceCtx, DmaFenceCb, DmaFenceCbFunc};
> +use kernel::list::*;
> +use kernel::revocable::Revocable;
> +
> +
> +#[pin_data]
> +pub struct Job<T: ?Sized> {
> +    credits: u32,
> +//    dependencies: List, // TODO implement dependency list

I am assuming that this will be a list of callbacks?

> +    #[pin]
> +    data: T,
> +}
> +
> +impl<T> Job<T> {
> +    /// Create a new job that can be submitted to [`Jobqueue`].
> +    ///
> +    /// Jobs contain driver data that will later be made available to the driver's
> +    /// run_job() callback in which the job gets pushed to the GPU.
> +    pub fn new(credits: u32, data: impl PinInit<T>) -> Result<Pin<KBox<Self>>> {
> +        let job = pin_init!(Self {
> +            credits,
> +            data <- data,
> +        });
> +
> +        KBox::pin_init(job, GFP_KERNEL)
> +    }
> +
> +    /// Add a callback to the job. When the job gets submitted, all added callbacks will be
> +    /// registered on the [`DmaFence`] the jobqueue returns for that job.
> +    pub fn add_callback() -> Result {

Can’t we take all the callbacks at submission time?
> +        Ok(())
> +    }
> +
> +    /// Add a [`DmaFence`] or a [`DoneFence`] as this job's dependency. The job
> +    /// will only be executed after that dependency has been finished.
> +    pub fn add_dependency() -> Result {

Which would let us remove this ^

> +        // TODO: Enqueue passed DmaFence into the job's dependency list.
> +        Ok(())
> +    }
> +
> +    /// Check if there are dependencies for this job. Register the jobqueue
> +    /// waker if yes.
> +    fn arm_deps() -> Result {

I wonder if “check_dependencies” would be a better name? Or something
along these lines.

> +        // TODO: Register DependencyWaker here if applicable.
> +        Ok(())
> +    }
> +}
> +
> +// Dummy trait for the linked list.
> +trait JobData {

> +    fn access_data(&self) -> i32;

Can’t we dereference to the data?

> +}
> +
> +#[pin_data]
> +struct EnqueuedJob<T: ?Sized> {
> +    inner: Pin<KBox<Job<T>>>,
> +    #[pin]
> +    links: ListLinksSelfPtr<EnqueuedJob<dyn JobData>>,

Why not a KVec? A queue type can hold a KVec of enqueued jobs, and this can
hold an Arc of the queue type. By extension, ensures that the queue does not
die while we have enqueued jobs.


> +    done_fence: ARef<DmaFence<i32>>, // i32 is just dummy data. TODO: allow for replacing with `()`
> +    // The hardware_fence can by definition only be set at an unknown point in
> +    // time.
> +    // TODO: Think about replacing this with a `struct RunningJob` which consumes
> +    // an `EnqueuedJob`.
> +    hardware_fence: Option<ARef<DmaFence<i32>>>, // i32 is dummy data until there's DmaFence
> +                                                 // without data.
> +    nr_of_deps: u32,
> +}
> +
> +impl<T> EnqueuedJob<T> {
> +    fn new(inner: Pin<KBox<Job<T>>>, fctx: &Arc<DmaFenceCtx>) -> Result<ListArc<Self>> {
> +        let pseudo_data: i32 = 42;
> +        let done_fence = fctx.as_arc_borrow().new_fence(pseudo_data)?;
> +
> +        ListArc::pin_init(try_pin_init!(Self {
> +            inner,
> +            links <- ListLinksSelfPtr::new(),
> +            done_fence,
> +            hardware_fence: None,
> +            nr_of_deps: 0, // TODO implement
> +        }), GFP_KERNEL)
> +    }
> +}
> +
> +impl_list_arc_safe! {
> +    impl{T: ?Sized} ListArcSafe<0> for EnqueuedJob<T> { untracked; }
> +}
> +
> +impl_list_item! {
> +    impl ListItem<0> for EnqueuedJob<dyn JobData> { using ListLinksSelfPtr { self.links }; }
> +}
> +
> +// Callback item for the hardware fences to wake / progress the jobqueue.
> +struct HwFenceWaker<T> {
> +    jobq: Arc<Revocable<SpinLock<InnerJobqueue>>>,

Instead of a Revocable, why not keep an Arc of InnerJobQueue (which should
perhaps be called JobQueueInner)?

This way, the user can have this:

struct JobQueue(Arc<JobqueueInner>);

When the user drops the JobQueue, it will schedule whatever teardown
operations, but the inner queue will not go out of scope, guaranteeing that
there is no UAF at least at this level.

You can create circular references to keep the JobQueueInner alive for as long
as the teardown operation is taking place:

struct SomeStructUsedForCleanup {
  Arc<JobQueueInner> queue;
  // ... more stuff
}

struct JobQueueInner {
 KVec<Arc<SomeStructUsedForCleanup>> cleanups;
}

Given this cycle, both the queue and whatever structs you need for cleanup will
remain alive indefinitely. At some point, once whatever cleanup completes, you
can break the cycle:

impl Drop for SomeStructUsedForCleanup {
  fn drop(...) {
    self.queue.cleanups.remove(self)
  }
}

Once all the cleanups complete, the JobQueueInner will drop.

Note that I'd expect this struct I “invented" to be a DmaFenceCb representing a
pending dependency or a job that is already on the ring.

> +    job: ListArc<EnqueuedJob<T>>,
> +}
> +
> +impl<T> DmaFenceCbFunc for HwFenceWaker<T> {
> +     fn callback(cb: Pin<KBox<DmaFenceCb<Self>>>) where Self: Sized {
> +         // This prevents against deadlock. See Jobqueue's drop() for details.
> +         let jq_guard = cb.data.jobq.try_access();
> +         if jq_guard.is_none() {
> +             return;
> +         }
> +         let jq_guard = jq_guard.unwrap();
> +
> +         // Take Jobqueue lock.
> +         let jq = jq_guard.lock();
> +         // Remove job from running list.
> +         //let _ = unsafe { cb.data.job.remove() };
> +         // Signal done_fence.
> +         // TODO: It's more robust if the JQ makes sure that fences get signalled
> +         // in order, even if the driver should signal them chaotically.
> +         let _ = cb.data.job.done_fence.signal();
> +         // Run more ready jobs if there's capacity.
> +         //jq.start_submit_worker();
> +     }
> +}
> +
> +// Callback item for the dependency fences to wake / progress the jobqueue.
> +struct DependencyWaker<T> {
> +    jobq: Arc<Revocable<SpinLock<InnerJobqueue>>>,
> +    job: ListArc<EnqueuedJob<T>>,
> +}
> +
> +impl<T> DmaFenceCbFunc for DependencyWaker<T> {
> +    fn callback(cb: Pin<KBox<DmaFenceCb<Self>>>) where Self: Sized {
> +        // This prevents against deadlock. See Jobqueue's drop() for details.
> +        let jq_guard = cb.data.jobq.try_access();
> +        if jq_guard.is_none() {
> +            return;
> +        }
> +        let jq_guard = jq_guard.unwrap();
> +
> +        // Take Jobqueue lock.
> +        let jq = jq_guard.lock();
> +
> +        // TODO: Lock Contention
> +        //
> +        // Alright, so the Jobqueue is currently designed around a big central
> +        // lock, which also protects the jobs. submit_job(), the JQ's cb on the
> +        // hw_fences and its cbs on the (external) dependency fences compete for
> +        // the lock. The first two should ever only run sequentially, so likely
> +        // aren't a problem.
> +        //
> +        // Dependency callbacks, however, could be registered and then signalled
> +        // by the thousands and then all compete for the lock possibly for nothing.
> +        //
> +        // That can likely be improved. Maybe by just making the nr_of_deps
> +        // counter atomic?
> +
> +        // Decrement dep counter.
> +        // cb.data.job.nr_of_deps -= 1; // TODO needs to be DerefMut
> +        // If counter == 0, a new job somewhere in the queue just got ready.
> +        // Check if it was the head job and if yes, run all jobs possible.
> +        if cb.data.job.nr_of_deps == 0 {
> +//            jq.start_submit_worker();
> +        }
> +    }
> +}
> +
> +struct InnerJobqueue {
> +    capacity: u32,
> +    waiting_jobs: List<EnqueuedJob<dyn JobData>>,
> +    running_jobs: List<EnqueuedJob<dyn JobData>>,
> +    submit_worker_active: bool,
> +}
> +
> +impl InnerJobqueue {
> +    fn new(capacity: u32) -> Self {
> +        let waiting_jobs = List::<EnqueuedJob<dyn JobData>>::new();
> +        let running_jobs = List::<EnqueuedJob<dyn JobData>>::new();
> +
> +        Self {
> +            capacity,
> +            waiting_jobs,
> +            running_jobs,
> +            submit_worker_active: false,
> +        }
> +    }
> +
> +    fn has_waiting_jobs(&self) -> bool {
> +        !self.waiting_jobs.is_empty()
> +    }
> +
> +    fn has_capacity_left(&self, cost: u32) -> bool {
> +        let cost = cost as i64;
> +        let capacity = self.capacity as i64;
> +
> +        if capacity - cost >= 0 {
> +            return true;
> +        }
> +
> +        false
> +    }
> +
> +    fn update_capacity(&mut self, cost: u32) {
> +        self.capacity -= cost;
> +    }
> +
> +
> +    // Called by the hw_fence callbacks, dependency callbacks, and submit_job().
> +    // TODO: does submit_job() ever have to call it?

Hm, yeah, I’d say so.

> +    fn start_submit_worker(&mut self) {
> +        if self.submit_worker_active {
> +            return;
> +        }
> +
> +        // TODO run submit work item
> +
> +        self.submit_worker_active = true;
> +    }
> +
> +    /*
> +
> +    /// Push a job immediately.
> +    ///
> +    /// Returns true if the job ran immediately, false otherwise.
> +    fn run_job(&mut self, job: &EnqueuedJob) -> bool {
> +        // TODO remove job from waiting list.
> +
> +        // TODO Call the driver's run_job() callback.
> +        let hardware_fence = run_job(&job);
> +        job.hardware_fence = Some(hardware_fence);
> +
> +        // TODO check whether hardware_fence raced and is already signalled.
> +
> +        self.running_jobs.push_back(job);
> +
> +        // TODO Register HwFenceWaker on the hw_fence.
> +    }
> +
> +    // Submits all ready jobs as long as there's capacity.
> +    fn run_all_ready_jobs(&mut self) {
> +        for job in self.waiting_jobs.reverse() {
> +            if job.nr_of_deps > 0 {
> +                return;
> +            }
> +
> +            if self.has_capacity_left(job.credits) {
> +                if !self.run_job(&job) {
> +                    // run_job() didn't run the job immediately (because the
> +                    // hw_fence did not race). Subtract the credits.
> +                    self.update_capacity(job.credits);
> +                }
> +            } else {
> +                return;
> +            }
> +        }
> +    }
> +    */
> +}
> +
> +//#[pin_data]
> +pub struct Jobqueue {
> +    inner: Arc<Revocable<SpinLock<InnerJobqueue>>>,
> +    fctx: Arc<DmaFenceCtx>, // TODO currently has a separate lock shared with fences
> +//    #[pin]
> +//    data: T,
> +}
> +
> +impl Jobqueue {
> +    /// Create a new [`Jobqueue`] with `capacity` space for jobs. `run_job` is
> +    /// your driver's callback which the jobqueue will call to push a submitted
> +    /// job to the hardware.
> +    pub fn new<T, V>(capacity: u32, _run_job: fn(&Pin<KBox<Job<T>>>) -> ARef<DmaFence<V>>) -> Result<Self> {
> +        let inner = Arc::pin_init(Revocable::new(new_spinlock!(InnerJobqueue::new(capacity))), GFP_KERNEL)?;
> +        let fctx = DmaFenceCtx::new()?;
> +
> +        Ok (Self {
> +            inner,
> +            fctx,
> +        })
> +    }
> +
> +    /// Submit a job to the jobqueue.
> +    ///
> +    /// The jobqueue takes ownership over the job and later passes it back to the
> +    /// driver by reference through the driver's run_job callback. Jobs are
> +    /// passed back by reference instead of by value partially to allow for later
> +    /// adding a job resubmission mechanism to be added to [`Jobqueue`].
> +    ///
> +    /// Jobs get run and their done_fences get signalled in submission order.
> +    ///
> +    /// Returns the "done_fence" on success, which gets signalled once the
> +    /// hardware has completed the job and once the jobqueue is done with a job.
> +    pub fn submit_job<U>(&self, job: Pin<KBox<Job<U>>>) -> Result<ARef<DmaFence<i32>>> {
> +        let job_cost = job.credits;
> +        // TODO: It would be nice if the done_fence's seqno actually matches the
> +        // submission order. To do that, however, we'd need to protect job
> +        // creation with InnerJobqueue's spinlock. Is that worth it?

Can you guarantee that the seqno will not go backwards?

> +        let enq = EnqueuedJob::new(job, &self.fctx)?;
> +        let done_fence = enq.done_fence.clone(); // Get the fence for the user.
> +
> +        // TODO register job's callbacks on done_fence.
> +
> +        let guard = self.inner.try_access();
> +        if guard.is_none() {
> +            // Can never happen. JQ gets only revoked when it drops.
> +            return Err(ENODEV);
> +        }
> +        let jobq = guard.unwrap();
> +
> +        let jobq = jobq.lock();
> +
> +        // Check if there are dependencies and, if yes, register rewake
> +        // callbacks on their fences. Must be done under the JQ lock's protection
> +        // since the callbacks will access JQ data.
> +        //job.arm_deps();
> +        //jobq.waiting_jobs.push_back(job);
> +
> +        if jobq.has_waiting_jobs() {
> +            // Jobs waiting means that there is either currently no capacity
> +            // for more jobs, or the jobqueue is blocked by a job with
> +            // unfullfilled dependencies. Either the hardware fences' callbacks
> +            // or those of the dependency fences will pull in more jobs once
> +            // there is capacity.
> +            return Ok(done_fence);
> +        } else if !jobq.submit_worker_active && jobq.has_capacity_left(job_cost) {
> +            // This is the first waiting job. No one (i.e., no hw_fence) has
> +            // woken the worker yet, but there is space. Awake it manually.
> +            //jobq.start_submit_worker();
> +        }
> +
> +        // If there was no capacity for the job, the callbacks registered on the
> +        // already running jobs' hardware fences will check if there's space for
> +        // the next job, guaranteeing progress.
> +        //
> +        // If no jobs were running, there was by definition still space and the
> +        // job will get pushed by the worker.
> +        //
> +        // If a job couldn't be pushed because there were unfinished dependencies,
> +        // then the hardware fences' callbacks mentioned above will detect that
> +        // and not yet push the job.
> +        //
> +        // Each dependency's fence has its own callback which checks:
> +        //   a) whether all other callbacks are fullfilled and if yes:
> +        //   b) whether there is now enough credits available.
> +        //
> +        // If a) and b) are fullfilled, the job gets pushed.
> +        //
> +        // If there are no jobs currently running, credits must be available by
> +        // definition.
> +
> +        Ok(done_fence)
> +
> +    }
> +}
> +
> +impl Drop for Jobqueue {
> +    fn drop(&mut self) {
> +        // The hardware fences might outlive the jobqueue. So hw_fence callbacks
> +        // could very well still call into job queue code, resulting in
> +        // data UAF or, should the jobqueue code be unloaded, even code UAF.

Not if they reference JobQueueInner as I proposed above.

> +        //
> +        // Thus, the jobqueue needs to be cleanly decoupled from the hardware
> +        // fences when it drops, in other words, it needs to deregister all its
> +        // hw_fence callbacks.
> +        //
> +        // This, however, could easily deadlock when a hw_fence signals:
> +        //
> +        // Step     |   Jobqueue step               |   hw_fence step
> +        // ------------------------------------------------------------------
> +        // 1        |   JQ starts drop              |   fence signals
> +        // 2        |   JQ lock taken               |   fence lock taken
> +        // 3        |   Tries to take fence lock    |   Tries to take JQ lock
> +        // 4        |   ***DEADLOCK***              |   ***DEADLOCK***
> +        //
> +        // In order to prevent deadlock, we first have to revoke access to the
> +        // JQ so that all fence callbacks can't try to take the lock anymore,
> +        // and then deregister all JQ callbacks.
> +        self.inner.revoke();
> +
> +        /*
> +        let guard = self.inner.lock();
> +        for job in self.inner.waiting_jobs {
> +            job.deregister_dep_fences();
> +        }
> +        for job in self.inner.running_jobs {
> +            job.deregister_hw_fence();
> +        }
> +        */

Under my proposal above, you can also wait on dependencies if you want: the
drop() thread will not be blocked.

> +    }
> +}
> diff --git a/rust/kernel/drm/mod.rs b/rust/kernel/drm/mod.rs
> index 1b82b6945edf..803bed36231b 100644
> --- a/rust/kernel/drm/mod.rs
> +++ b/rust/kernel/drm/mod.rs
> @@ -7,12 +7,14 @@
> pub mod file;
> pub mod gem;
> pub mod ioctl;
> +pub mod jq;
> 
> pub use self::device::Device;
> pub use self::driver::Driver;
> pub use self::driver::DriverInfo;
> pub use self::driver::Registration;
> pub use self::file::File;
> +pub use self::jq::Jobqueue;
> 
> pub(crate) mod private {
>     pub trait Sealed {}
> -- 
> 2.49.0
> 
> 

— Daniel
Re: [RFC WIP 3/3] rust/drm: Add initial jobqueue sceleton
Posted by Philipp Stanner 6 days, 13 hours ago
On Mon, 2025-11-24 at 10:58 -0300, Daniel Almeida wrote:
> Hi Phillip,
> 
> > On 18 Nov 2025, at 10:25, Philipp Stanner <phasta@kernel.org> wrote:
> > 
> > 

[…]

> > +use crate::{
> > +    prelude::*,
> > +    types::ARef,
> > +};
> > +use kernel::sync::{Arc, SpinLock, new_spinlock, DmaFence, DmaFenceCtx, DmaFenceCb, DmaFenceCbFunc};
> > +use kernel::list::*;
> > +use kernel::revocable::Revocable;
> > +
> > +
> > +#[pin_data]
> > +pub struct Job<T: ?Sized> {
> > +    credits: u32,
> > +//    dependencies: List, // TODO implement dependency list
> 
> I am assuming that this will be a list of callbacks?

That's supposed to become the list of DmaFence's which are to be
treated as dependencies of this job.

Only once all fences in this list are signaled the JQ will push that
job.

> 
> > +    #[pin]
> > +    data: T,
> > +}
> > +
> > +impl<T> Job<T> {
> > +    /// Create a new job that can be submitted to [`Jobqueue`].
> > +    ///
> > +    /// Jobs contain driver data that will later be made available to the driver's
> > +    /// run_job() callback in which the job gets pushed to the GPU.
> > +    pub fn new(credits: u32, data: impl PinInit<T>) -> Result<Pin<KBox<Self>>> {
> > +        let job = pin_init!(Self {
> > +            credits,
> > +            data <- data,
> > +        });
> > +
> > +        KBox::pin_init(job, GFP_KERNEL)
> > +    }
> > +
> > +    /// Add a callback to the job. When the job gets submitted, all added callbacks will be
> > +    /// registered on the [`DmaFence`] the jobqueue returns for that job.
> > +    pub fn add_callback() -> Result {
> 
> Can’t we take all the callbacks at submission time?

To clarify the terminology, a "callback" here would be callbacks which
the JQ shall register on the done_fence returned by
DmaFence::submit_job().

> > +        Ok(())
> > +    }
> > +
> > +    /// Add a [`DmaFence`] or a [`DoneFence`] as this job's dependency. The job
> > +    /// will only be executed after that dependency has been finished.
> > +    pub fn add_dependency() -> Result {
> 
> Which would let us remove this ^

It would allow for removing this function, but you'd then just have an
optional (some jobs have no dependencies) function parameter in
DmaFence::submit_job().

The current idea looks like this:

```
let jobq = JobQueue::new(…);
let job = Job::new(driver_data);

job.add_dependency(done_fence_of_shader_in_another_context); // optional
job.add_callback(cb_that_will_wake_userspace_or_sth); // optional

let done_fence = jobq.submit_job(job)?;
```

The JQ eats the job (ownership transfer), so by design you have to set
all dependencies and specify everything that shall be done when the job
finishes _before_ submitting the job.

I think an API in this form makes the order of events very obvious to
the user?


What happens then behind the scenes is that the JQ registers all the
callbacks on the done_fence returned above. I'm not super sure about
this design idea; it's certainly optional. However, it has the
advantage of freeing the JQ user from dealing with races of done_fence.

Otherwise one would have to do something like

```
let done_fence = jobq.submit_job(job)?;

let err = done_fence.register_callback(my_drivers_cb);
if err.was_race_and_is_already_signaled() {
execute_cb_code_myself_now();
}
```


> 
> > +        // TODO: Enqueue passed DmaFence into the job's dependency list.
> > +        Ok(())
> > +    }
> > +
> > +    /// Check if there are dependencies for this job. Register the jobqueue
> > +    /// waker if yes.
> > +    fn arm_deps() -> Result {
> 
> I wonder if “check_dependencies” would be a better name? Or something
> along these lines.

ACK.

> 
> > +        // TODO: Register DependencyWaker here if applicable.
> > +        Ok(())
> > +    }
> > +}
> > +
> > +// Dummy trait for the linked list.
> > +trait JobData {
> 
> > +    fn access_data(&self) -> i32;
> 
> Can’t we dereference to the data?

That's dummy code that only exists because I so far am failing with
even getting the basic List to work.

> 
> > +}
> > +
> > +#[pin_data]
> > +struct EnqueuedJob<T: ?Sized> {
> > +    inner: Pin<KBox<Job<T>>>,
> > +    #[pin]
> > +    links: ListLinksSelfPtr<EnqueuedJob<dyn JobData>>,
> 
> Why not a KVec? A queue type can hold a KVec of enqueued jobs, and this can
> hold an Arc of the queue type.

My understanding is that KVec is not intended to be the data structure
for this?

KVec is basically like a realloc() in C, an array of same sized
elements.

The JQ, hypothetically, can hold an infinite amount of members in its
waiting_list, only the running_list is limited by the credit count.


>  By extension, ensures that the queue does not
> die while we have enqueued jobs.

See below.

> 
> 
> > +    done_fence: ARef<DmaFence<i32>>, // i32 is just dummy data. TODO: allow for replacing with `()`
> > +    // The hardware_fence can by definition only be set at an unknown point in
> > +    // time.
> > +    // TODO: Think about replacing this with a `struct RunningJob` which consumes
> > +    // an `EnqueuedJob`.
> > +    hardware_fence: Option<ARef<DmaFence<i32>>>, // i32 is dummy data until there's DmaFence
> > +                                                 // without data.
> > +    nr_of_deps: u32,
> > +}
> > +
> > +impl<T> EnqueuedJob<T> {
> > +    fn new(inner: Pin<KBox<Job<T>>>, fctx: &Arc<DmaFenceCtx>) -> Result<ListArc<Self>> {
> > +        let pseudo_data: i32 = 42;
> > +        let done_fence = fctx.as_arc_borrow().new_fence(pseudo_data)?;
> > +
> > +        ListArc::pin_init(try_pin_init!(Self {
> > +            inner,
> > +            links <- ListLinksSelfPtr::new(),
> > +            done_fence,
> > +            hardware_fence: None,
> > +            nr_of_deps: 0, // TODO implement
> > +        }), GFP_KERNEL)
> > +    }
> > +}
> > +
> > +impl_list_arc_safe! {
> > +    impl{T: ?Sized} ListArcSafe<0> for EnqueuedJob<T> { untracked; }
> > +}
> > +
> > +impl_list_item! {
> > +    impl ListItem<0> for EnqueuedJob<dyn JobData> { using ListLinksSelfPtr { self.links }; }
> > +}
> > +
> > +// Callback item for the hardware fences to wake / progress the jobqueue.
> > +struct HwFenceWaker<T> {
> > +    jobq: Arc<Revocable<SpinLock<InnerJobqueue>>>,
> 
> Instead of a Revocable, why not keep an Arc of InnerJobQueue (which should
> perhaps be called JobQueueInner)?
> 
> This way, the user can have this:
> 
> struct JobQueue(Arc<JobqueueInner>);
> 
> When the user drops the JobQueue, it will schedule whatever teardown
> operations,
> 

What kind of operation would that be? Completing all running_jobs?
Completing all waiting_jobs? Completing all running_jobs and canceling
all waiting_jobs? etc.


>  but the inner queue will not go out of scope, guaranteeing that
> there is no UAF at least at this level.
> 
> You can create circular references to keep the JobQueueInner alive for as long
> as the teardown operation is taking place:
> 
> struct SomeStructUsedForCleanup {
>   Arc<JobQueueInner> queue;
>   // ... more stuff
> }
> 
> struct JobQueueInner {
>  KVec<Arc<SomeStructUsedForCleanup>> cleanups;
> }
> 
> Given this cycle, both the queue and whatever structs you need for cleanup will
> remain alive indefinitely. At some point, once whatever cleanup completes, you
> can break the cycle:
> 
> impl Drop for SomeStructUsedForCleanup {
>   fn drop(...) {
>     self.queue.cleanups.remove(self)
>   }
> }
> 
> Once all the cleanups complete, the JobQueueInner will drop.

Whether your design approach has advantages depends on the above
question of what "cleanup" means to you?

> 
> Note that I'd expect this struct I “invented" to be a DmaFenceCb representing a
> pending dependency or a job that is already on the ring.
> 
> > +    job: ListArc<EnqueuedJob<T>>,
> > +}
> > 

[…]

> > +    fn update_capacity(&mut self, cost: u32) {
> > +        self.capacity -= cost;
> > +    }
> > +
> > +
> > +    // Called by the hw_fence callbacks, dependency callbacks, and submit_job().
> > +    // TODO: does submit_job() ever have to call it?
> 
> Hm, yeah, I’d say so.

Yup. That comment is a relict.

> 
> > +    fn start_submit_worker(&mut self) {
> > +        if self.submit_worker_active {
> > +            return;
> > +        }
> > +
> > +        // TODO run submit work item
> > +
> > +        self.submit_worker_active = true;
> > +    }
> > 

[…]

> > +    /// Submit a job to the jobqueue.
> > +    ///
> > +    /// The jobqueue takes ownership over the job and later passes it back to the
> > +    /// driver by reference through the driver's run_job callback. Jobs are
> > +    /// passed back by reference instead of by value partially to allow for later
> > +    /// adding a job resubmission mechanism to be added to [`Jobqueue`].
> > +    ///
> > +    /// Jobs get run and their done_fences get signalled in submission order.
> > +    ///
> > +    /// Returns the "done_fence" on success, which gets signalled once the
> > +    /// hardware has completed the job and once the jobqueue is done with a job.
> > +    pub fn submit_job<U>(&self, job: Pin<KBox<Job<U>>>) -> Result<ARef<DmaFence<i32>>> {
> > +        let job_cost = job.credits;
> > +        // TODO: It would be nice if the done_fence's seqno actually matches the
> > +        // submission order. To do that, however, we'd need to protect job
> > +        // creation with InnerJobqueue's spinlock. Is that worth it?
> 
> Can you guarantee that the seqno will not go backwards?

As pointed out in the other thread, that could currently happen if a
driver calls submit_job() with >1 thread.

IOW, *done_fence* seqnos could end up being enqueued like this

42 43 45 44 46

By taking the lock that could be prevented. However, that's only a
virtual or tiny win, because jobs could then actually be submitted in
an order not desired by the driver, but with correct done_fence seqno
order.

JQ executes jobs in the order they were submitted to. The fundamental
question is really: should the JQ care and what should it do if a
driver spams submit_job() asynchronously?

I tend to think that there is not really much we can do about that.


> > +impl Drop for Jobqueue {
> > +    fn drop(&mut self) {
> > +        // The hardware fences might outlive the jobqueue. So hw_fence callbacks
> > +        // could very well still call into job queue code, resulting in
> > +        // data UAF or, should the jobqueue code be unloaded, even code UAF.
> 
> Not if they reference JobQueueInner as I proposed above.
> 
> > +        //
> > +        // Thus, the jobqueue needs to be cleanly decoupled from the hardware
> > +        // fences when it drops, in other words, it needs to deregister all its
> > +        // hw_fence callbacks.
> > +        //
> > +        // This, however, could easily deadlock when a hw_fence signals:
> > +        //
> > +        // Step     |   Jobqueue step               |   hw_fence step
> > +        // ------------------------------------------------------------------
> > +        // 1        |   JQ starts drop              |   fence signals
> > +        // 2        |   JQ lock taken               |   fence lock taken
> > +        // 3        |   Tries to take fence lock    |   Tries to take JQ lock
> > +        // 4        |   ***DEADLOCK***              |   ***DEADLOCK***
> > +        //
> > +        // In order to prevent deadlock, we first have to revoke access to the
> > +        // JQ so that all fence callbacks can't try to take the lock anymore,
> > +        // and then deregister all JQ callbacks.
> > +        self.inner.revoke();
> > +
> > +        /*
> > +        let guard = self.inner.lock();
> > +        for job in self.inner.waiting_jobs {
> > +            job.deregister_dep_fences();
> > +        }
> > +        for job in self.inner.running_jobs {
> > +            job.deregister_hw_fence();
> > +        }
> > +        */
> 
> Under my proposal above, you can also wait on dependencies if you want: the
> drop() thread will not be blocked.

Maybe (I'd have to look deeper into the idea)

But what for? When someone drops his jobqueue, one would like to think
that he doesn't care about all pending jobs anymore anyways. So the
main thing you need to guarantee is that userspace gets unblocked by
signaling all fences.


Note that we had very similar discussions when solving the memory leaks
in drm_sched_fini(). The TL;DR of those discussions was:

 * Refcounting drm_sched so that it can outlive drm_sched_fini() means
   that it will continue calling into the driver with the driver
   callbacks -> UAF
 * Waiting could cause you to block SIGKILL
 * The sanest way to go was deemed to be to signal everything in the
   pending_list synchronously. Once you've done this, you know for sure
   that everything is done and clean.


AFAICS, your proposal might still have the problem of JQ continuously
calling into driver code?

I think the proposed solution is very clean: when you drop, decouple JQ
and driver by 100%, stop everything, tear everything down. At least
that's what drm_sched_fini() should have been from the beginning.


P.
Re: [RFC WIP 3/3] rust/drm: Add initial jobqueue sceleton
Posted by Daniel Almeida 4 days, 12 hours ago

> On 25 Nov 2025, at 10:20, Philipp Stanner <phasta@mailbox.org> wrote:
> 
> On Mon, 2025-11-24 at 10:58 -0300, Daniel Almeida wrote:
>> Hi Phillip,
>> 
>>> On 18 Nov 2025, at 10:25, Philipp Stanner <phasta@kernel.org> wrote:
>>> 
>>> 
> 
> […]
> 
>>> +use crate::{
>>> +    prelude::*,
>>> +    types::ARef,
>>> +};
>>> +use kernel::sync::{Arc, SpinLock, new_spinlock, DmaFence, DmaFenceCtx, DmaFenceCb, DmaFenceCbFunc};
>>> +use kernel::list::*;
>>> +use kernel::revocable::Revocable;
>>> +
>>> +
>>> +#[pin_data]
>>> +pub struct Job<T: ?Sized> {
>>> +    credits: u32,
>>> +//    dependencies: List, // TODO implement dependency list
>> 
>> I am assuming that this will be a list of callbacks?
> 
> That's supposed to become the list of DmaFence's which are to be
> treated as dependencies of this job.
> 
> Only once all fences in this list are signaled the JQ will push that
> job.

Ok, I was approaching this from the current DRM scheduler design, which (IIRC)
uses callbacks to represent dependencies. IOW: if you managed to register a
callback on a dependency, it means that it hasn’t signaled yet.

In any case, that was just me trying to understand this better. IMHO, feel free
to use anything you think it’s best here, like the whole DmaFence struct.

> 
>> 
>>> +    #[pin]
>>> +    data: T,
>>> +}
>>> +
>>> +impl<T> Job<T> {
>>> +    /// Create a new job that can be submitted to [`Jobqueue`].
>>> +    ///
>>> +    /// Jobs contain driver data that will later be made available to the driver's
>>> +    /// run_job() callback in which the job gets pushed to the GPU.
>>> +    pub fn new(credits: u32, data: impl PinInit<T>) -> Result<Pin<KBox<Self>>> {
>>> +        let job = pin_init!(Self {
>>> +            credits,
>>> +            data <- data,
>>> +        });
>>> +
>>> +        KBox::pin_init(job, GFP_KERNEL)
>>> +    }
>>> +
>>> +    /// Add a callback to the job. When the job gets submitted, all added callbacks will be
>>> +    /// registered on the [`DmaFence`] the jobqueue returns for that job.
>>> +    pub fn add_callback() -> Result {
>> 
>> Can’t we take all the callbacks at submission time?
> 
> To clarify the terminology, a "callback" here would be callbacks which
> the JQ shall register on the done_fence returned by
> DmaFence::submit_job()

Ack.

> .
> 
>>> +        Ok(())
>>> +    }
>>> +
>>> +    /// Add a [`DmaFence`] or a [`DoneFence`] as this job's dependency. The job
>>> +    /// will only be executed after that dependency has been finished.
>>> +    pub fn add_dependency() -> Result {
>> 
>> Which would let us remove this ^
> 
> It would allow for removing this function, but you'd then just have an
> optional (some jobs have no dependencies) function parameter in
> DmaFence::submit_job().

> 
> The current idea looks like this:
> 
> ```
> let jobq = JobQueue::new(…);
> let job = Job::new(driver_data);
> 
> job.add_dependency(done_fence_of_shader_in_another_context); // optional
> job.add_callback(cb_that_will_wake_userspace_or_sth); // optional
> 
> let done_fence = jobq.submit_job(job)?;
> ```
> 
> The JQ eats the job (ownership transfer), so by design you have to set
> all dependencies and specify everything that shall be done when the job
> finishes _before_ submitting the job.
> 
> I think an API in this form makes the order of events very obvious to
> the user?
> 


You’d pass a

fn submit(…, dependencies: &[DmaFence], callbacks: &[Callback])

This way a user cannot submit a job without being explicit about dependencies
and callbacks, i.e.: it cannot be forgotten, while still being optional. 

> What happens then behind the scenes is that the JQ registers all the
> callbacks on the done_fence returned above. I'm not super sure about
> this design idea; it's certainly optional. However, it has the
> advantage of freeing the JQ user from dealing with races of done_fence.
> 
> Otherwise one would have to do something like
> 
> ```
> let done_fence = jobq.submit_job(job)?;
> 
> let err = done_fence.register_callback(my_drivers_cb);
> if err.was_race_and_is_already_signaled() {
> execute_cb_code_myself_now();
> }
> ```
> 
> 
>> 
>>> +        // TODO: Enqueue passed DmaFence into the job's dependency list.
>>> +        Ok(())
>>> +    }
>>> +
>>> +    /// Check if there are dependencies for this job. Register the jobqueue
>>> +    /// waker if yes.
>>> +    fn arm_deps() -> Result {
>> 
>> I wonder if “check_dependencies” would be a better name? Or something
>> along these lines.
> 
> ACK.
> 
>> 
>>> +        // TODO: Register DependencyWaker here if applicable.
>>> +        Ok(())
>>> +    }
>>> +}
>>> +
>>> +// Dummy trait for the linked list.
>>> +trait JobData {
>> 
>>> +    fn access_data(&self) -> i32;
>> 
>> Can’t we dereference to the data?
> 
> That's dummy code that only exists because I so far am failing with
> even getting the basic List to work.
> 
>> 
>>> +}
>>> +
>>> +#[pin_data]
>>> +struct EnqueuedJob<T: ?Sized> {
>>> +    inner: Pin<KBox<Job<T>>>,
>>> +    #[pin]
>>> +    links: ListLinksSelfPtr<EnqueuedJob<dyn JobData>>,
>> 
>> Why not a KVec? A queue type can hold a KVec of enqueued jobs, and this can
>> hold an Arc of the queue type.
> 
> My understanding is that KVec is not intended to be the data structure
> for this?
> 
> KVec is basically like a realloc() in C, an array of same sized
> elements.
> 
> The JQ, hypothetically, can hold an infinite amount of members in its
> waiting_list, only the running_list is limited by the credit count.

Then I'd pre-allocate or realloc() as needed. You can reuse the empty slots, so
there won't be a unbounded growth. realloc() also looks fine, because it will
happen outside of the signaling path.

My point is that writing your own data structure adds complexity. Your call,
though.

> 
> 
>> By extension, ensures that the queue does not
>> die while we have enqueued jobs.
> 
> See below.
> 
>> 
>> 
>>> +    done_fence: ARef<DmaFence<i32>>, // i32 is just dummy data. TODO: allow for replacing with `()`
>>> +    // The hardware_fence can by definition only be set at an unknown point in
>>> +    // time.
>>> +    // TODO: Think about replacing this with a `struct RunningJob` which consumes
>>> +    // an `EnqueuedJob`.
>>> +    hardware_fence: Option<ARef<DmaFence<i32>>>, // i32 is dummy data until there's DmaFence
>>> +                                                 // without data.
>>> +    nr_of_deps: u32,
>>> +}
>>> +
>>> +impl<T> EnqueuedJob<T> {
>>> +    fn new(inner: Pin<KBox<Job<T>>>, fctx: &Arc<DmaFenceCtx>) -> Result<ListArc<Self>> {
>>> +        let pseudo_data: i32 = 42;
>>> +        let done_fence = fctx.as_arc_borrow().new_fence(pseudo_data)?;
>>> +
>>> +        ListArc::pin_init(try_pin_init!(Self {
>>> +            inner,
>>> +            links <- ListLinksSelfPtr::new(),
>>> +            done_fence,
>>> +            hardware_fence: None,
>>> +            nr_of_deps: 0, // TODO implement
>>> +        }), GFP_KERNEL)
>>> +    }
>>> +}
>>> +
>>> +impl_list_arc_safe! {
>>> +    impl{T: ?Sized} ListArcSafe<0> for EnqueuedJob<T> { untracked; }
>>> +}
>>> +
>>> +impl_list_item! {
>>> +    impl ListItem<0> for EnqueuedJob<dyn JobData> { using ListLinksSelfPtr { self.links }; }
>>> +}
>>> +
>>> +// Callback item for the hardware fences to wake / progress the jobqueue.
>>> +struct HwFenceWaker<T> {
>>> +    jobq: Arc<Revocable<SpinLock<InnerJobqueue>>>,
>> 
>> Instead of a Revocable, why not keep an Arc of InnerJobQueue (which should
>> perhaps be called JobQueueInner)?
>> 
>> This way, the user can have this:
>> 
>> struct JobQueue(Arc<JobqueueInner>);
>> 
>> When the user drops the JobQueue, it will schedule whatever teardown
>> operations,
>> 
> 
> What kind of operation would that be? Completing all running_jobs?
> Completing all waiting_jobs? Completing all running_jobs and canceling
> all waiting_jobs? etc.
> 

The same as the current DRM scheduler, i.e.:

static void drm_sched_entity_kill_jobs_cb(struct dma_fence *f,
					  struct dma_fence_cb *cb)

My understanding is that the JobQueue will follow a similar pattern?

> 
>> but the inner queue will not go out of scope, guaranteeing that
>> there is no UAF at least at this level.
>> 
>> You can create circular references to keep the JobQueueInner alive for as long
>> as the teardown operation is taking place:
>> 
>> struct SomeStructUsedForCleanup {
>>   Arc<JobQueueInner> queue;
>>   // ... more stuff
>> }
>> 
>> struct JobQueueInner {
>>  KVec<Arc<SomeStructUsedForCleanup>> cleanups;
>> }
>> 
>> Given this cycle, both the queue and whatever structs you need for cleanup will
>> remain alive indefinitely. At some point, once whatever cleanup completes, you
>> can break the cycle:
>> 
>> impl Drop for SomeStructUsedForCleanup {
>>   fn drop(...) {
>>     self.queue.cleanups.remove(self)
>>   }
>> }
>> 
>> Once all the cleanups complete, the JobQueueInner will drop.
> 
> Whether your design approach has advantages depends on the above
> question of what "cleanup" means to you?
> 
>> 
>> Note that I'd expect this struct I “invented" to be a DmaFenceCb representing a
>> pending dependency or a job that is already on the ring.
>> 
>>> +    job: ListArc<EnqueuedJob<T>>,
>>> +}
>>> 
> 
> […]
> 
>>> +    fn update_capacity(&mut self, cost: u32) {
>>> +        self.capacity -= cost;
>>> +    }
>>> +
>>> +
>>> +    // Called by the hw_fence callbacks, dependency callbacks, and submit_job().
>>> +    // TODO: does submit_job() ever have to call it?
>> 
>> Hm, yeah, I’d say so.
> 
> Yup. That comment is a relict.
> 
>> 
>>> +    fn start_submit_worker(&mut self) {
>>> +        if self.submit_worker_active {
>>> +            return;
>>> +        }
>>> +
>>> +        // TODO run submit work item
>>> +
>>> +        self.submit_worker_active = true;
>>> +    }
>>> 
> 
> […]
> 
>>> +    /// Submit a job to the jobqueue.
>>> +    ///
>>> +    /// The jobqueue takes ownership over the job and later passes it back to the
>>> +    /// driver by reference through the driver's run_job callback. Jobs are
>>> +    /// passed back by reference instead of by value partially to allow for later
>>> +    /// adding a job resubmission mechanism to be added to [`Jobqueue`].
>>> +    ///
>>> +    /// Jobs get run and their done_fences get signalled in submission order.
>>> +    ///
>>> +    /// Returns the "done_fence" on success, which gets signalled once the
>>> +    /// hardware has completed the job and once the jobqueue is done with a job.
>>> +    pub fn submit_job<U>(&self, job: Pin<KBox<Job<U>>>) -> Result<ARef<DmaFence<i32>>> {
>>> +        let job_cost = job.credits;
>>> +        // TODO: It would be nice if the done_fence's seqno actually matches the
>>> +        // submission order. To do that, however, we'd need to protect job
>>> +        // creation with InnerJobqueue's spinlock. Is that worth it?
>> 
>> Can you guarantee that the seqno will not go backwards?
> 
> As pointed out in the other thread, that could currently happen if a
> driver calls submit_job() with >1 thread.
> 
> IOW, *done_fence* seqnos could end up being enqueued like this
> 
> 42 43 45 44 46
> 
> By taking the lock that could be prevented. However, that's only a
> virtual or tiny win, because jobs could then actually be submitted in
> an order not desired by the driver, but with correct done_fence seqno
> order.
> 
> JQ executes jobs in the order they were submitted to. The fundamental
> question is really: should the JQ care and what should it do if a
> driver spams submit_job() asynchronously?
> 
> I tend to think that there is not really much we can do about that.

Ack

> 
> 
>>> +impl Drop for Jobqueue {
>>> +    fn drop(&mut self) {
>>> +        // The hardware fences might outlive the jobqueue. So hw_fence callbacks
>>> +        // could very well still call into job queue code, resulting in
>>> +        // data UAF or, should the jobqueue code be unloaded, even code UAF.
>> 
>> Not if they reference JobQueueInner as I proposed above.
>> 
>>> +        //
>>> +        // Thus, the jobqueue needs to be cleanly decoupled from the hardware
>>> +        // fences when it drops, in other words, it needs to deregister all its
>>> +        // hw_fence callbacks.
>>> +        //
>>> +        // This, however, could easily deadlock when a hw_fence signals:
>>> +        //
>>> +        // Step     |   Jobqueue step               |   hw_fence step
>>> +        // ------------------------------------------------------------------
>>> +        // 1        |   JQ starts drop              |   fence signals
>>> +        // 2        |   JQ lock taken               |   fence lock taken
>>> +        // 3        |   Tries to take fence lock    |   Tries to take JQ lock
>>> +        // 4        |   ***DEADLOCK***              |   ***DEADLOCK***
>>> +        //
>>> +        // In order to prevent deadlock, we first have to revoke access to the
>>> +        // JQ so that all fence callbacks can't try to take the lock anymore,
>>> +        // and then deregister all JQ callbacks.
>>> +        self.inner.revoke();
>>> +
>>> +        /*
>>> +        let guard = self.inner.lock();
>>> +        for job in self.inner.waiting_jobs {
>>> +            job.deregister_dep_fences();
>>> +        }
>>> +        for job in self.inner.running_jobs {
>>> +            job.deregister_hw_fence();
>>> +        }
>>> +        */
>> 
>> Under my proposal above, you can also wait on dependencies if you want: the
>> drop() thread will not be blocked.
> 
> Maybe (I'd have to look deeper into the idea)
> 
> But what for? When someone drops his jobqueue, one would like to think
> that he doesn't care about all pending jobs anymore anyways. So the
> main thing you need to guarantee is that userspace gets unblocked by
> signaling all fences.

I was basically trying to recreate this:

static void drm_sched_entity_kill_jobs_cb(struct dma_fence *f,
					  struct dma_fence_cb *cb)
{
	struct drm_sched_job *job = container_of(cb, struct drm_sched_job,
						 finish_cb);
	unsigned long index;

	dma_fence_put(f);

	/* Wait for all dependencies to avoid data corruptions */
	xa_for_each(&job->dependencies, index, f) {

> 
> 
> Note that we had very similar discussions when solving the memory leaks
> in drm_sched_fini(). The TL;DR of those discussions was:
> 
> * Refcounting drm_sched so that it can outlive drm_sched_fini() means
>   that it will continue calling into the driver with the driver
>   callbacks -> UAF
> * Waiting could cause you to block SIGKILL
> * The sanest way to go was deemed to be to signal everything in the
>   pending_list synchronously. Once you've done this, you know for sure
>   that everything is done and clean.
> 
> 
> AFAICS, your proposal might still have the problem of JQ continuously
> calling into driver code?

You’re basically calling wait() and signal(), but not run(). On top of
that, I don’t think the callbacks can actually reach the driver without
taking an extra refcount on some driver structure (iow:  we should require the
callbacks to be ’static). So, IIUC, no, this would not call into the
driver.



> 
> I think the proposed solution is very clean: when you drop, decouple JQ
> and driver by 100%, stop everything, tear everything down. At least
> that's what drm_sched_fini() should have been from the beginning.
> 
> 
> P.
> 

Sure, I am not saying what I proposed is superior. It's just an alternative
approach that you should consider. I also agree that revoking the queue looks
clean, but again, my question is how do you then wait asynchronously like
drm_sched_entity_kill_jobs_cb does, if that’s needed at all.



— Daniel
Re: [RFC WIP 3/3] rust/drm: Add initial jobqueue sceleton
Posted by Philipp Stanner 3 days, 16 hours ago
On Thu, 2025-11-27 at 11:13 -0300, Daniel Almeida wrote:
> 
> 
> > On 25 Nov 2025, at 10:20, Philipp Stanner <phasta@mailbox.org> wrote:
> > 
> > On Mon, 2025-11-24 at 10:58 -0300, Daniel Almeida wrote:
> > > Hi Phillip,
> > > 
> > > > On 18 Nov 2025, at 10:25, Philipp Stanner <phasta@kernel.org> wrote:
> > > > 
> > > > 
> > 
> > […]
> > 
> > > > +use crate::{
> > > > +    prelude::*,
> > > > +    types::ARef,
> > > > +};
> > > > +use kernel::sync::{Arc, SpinLock, new_spinlock, DmaFence, DmaFenceCtx, DmaFenceCb, DmaFenceCbFunc};
> > > > +use kernel::list::*;
> > > > +use kernel::revocable::Revocable;
> > > > +
> > > > +
> > > > +#[pin_data]
> > > > +pub struct Job<T: ?Sized> {
> > > > +    credits: u32,
> > > > +//    dependencies: List, // TODO implement dependency list
> > > 
> > > I am assuming that this will be a list of callbacks?
> > 
> > That's supposed to become the list of DmaFence's which are to be
> > treated as dependencies of this job.
> > 
> > Only once all fences in this list are signaled the JQ will push that
> > job.
> 
> Ok, I was approaching this from the current DRM scheduler design, which (IIRC)
> uses callbacks to represent dependencies.
> 

Depending on what "represent" means, it's the same here. A dependency
is some external fence, by definition one that doesn't come from the
same jobqueue / entity. To know when the dependency has been fulfilled,
you have to register one of your own events on the fence to wake
yourself.

>  IOW: if you managed to register a
> callback on a dependency, it means that it hasn’t signaled yet.

Yep.

> 
> In any case, that was just me trying to understand this better. IMHO, feel free
> to use anything you think it’s best here, like the whole DmaFence struct.

I think what might be confusing a bit is that we're used to drm/sched,
where we have entities and the scheduler instance itself (and
runqueues). Jobqueue is entity and "scheduler" in once piece. So some
functionality that we were used to see in sched_entity is now in
Jobqueue.

> 

[…]

> > .
> > 
> > > > +        Ok(())
> > > > +    }
> > > > +
> > > > +    /// Add a [`DmaFence`] or a [`DoneFence`] as this job's dependency. The job
> > > > +    /// will only be executed after that dependency has been finished.
> > > > +    pub fn add_dependency() -> Result {
> > > 
> > > Which would let us remove this ^
> > 
> > It would allow for removing this function, but you'd then just have an
> > optional (some jobs have no dependencies) function parameter in
> > DmaFence::submit_job().
> 
> > 
> > The current idea looks like this:
> > 
> > ```
> > let jobq = JobQueue::new(…);
> > let job = Job::new(driver_data);
> > 
> > job.add_dependency(done_fence_of_shader_in_another_context); // optional
> > job.add_callback(cb_that_will_wake_userspace_or_sth); // optional
> > 
> > let done_fence = jobq.submit_job(job)?;
> > ```
> > 
> > The JQ eats the job (ownership transfer), so by design you have to set
> > all dependencies and specify everything that shall be done when the job
> > finishes _before_ submitting the job.
> > 
> > I think an API in this form makes the order of events very obvious to
> > the user?
> > 
> 
> 
> You’d pass a
> 
> fn submit(…, dependencies: &[DmaFence], callbacks: &[Callback])
> 
> This way a user cannot submit a job without being explicit about dependencies
> and callbacks, i.e.: it cannot be forgotten, while still being optional. 

Hm. Yoah.
IDK, IMO semantically it's cleaner to have a job and methods that work
on the job, defining its characteristics and its consequences, and a
function that pushes jobs.

Your idea works obviously, and what you state might be an advantage. As
long as we're an RFC phase I think we can keep my draft for now and
gather more feedback from the others to see. But it's no big deal
either way probably

> 
> > What happens then behind the scenes is that the JQ registers all the
> > callbacks on the done_fence returned above. I'm not super sure about
> > this design idea; it's certainly optional. However, it has the
> > advantage of freeing the JQ user from dealing with races of done_fence.
> > 
> > Otherwise one would have to do something like
> > 
> > ```
> > let done_fence = jobq.submit_job(job)?;
> > 
> > let err = done_fence.register_callback(my_drivers_cb);
> > if err.was_race_and_is_already_signaled() {
> > execute_cb_code_myself_now();
> > }
> > ```
> > 
> > 
> > > 
> > > > +        // TODO: Enqueue passed DmaFence into the job's dependency list.
> > > > +        Ok(())
> > > > +    }
> > > > +
> > > > +    /// Check if there are dependencies for this job. Register the jobqueue
> > > > +    /// waker if yes.
> > > > +    fn arm_deps() -> Result {
> > > 
> > > I wonder if “check_dependencies” would be a better name? Or something
> > > along these lines.
> > 
> > ACK.
> > 
> > > 
> > > > +        // TODO: Register DependencyWaker here if applicable.
> > > > +        Ok(())
> > > > +    }
> > > > +}
> > > > +
> > > > +// Dummy trait for the linked list.
> > > > +trait JobData {
> > > 
> > > > +    fn access_data(&self) -> i32;
> > > 
> > > Can’t we dereference to the data?
> > 
> > That's dummy code that only exists because I so far am failing with
> > even getting the basic List to work.
> > 
> > > 
> > > > +}
> > > > +
> > > > +#[pin_data]
> > > > +struct EnqueuedJob<T: ?Sized> {
> > > > +    inner: Pin<KBox<Job<T>>>,
> > > > +    #[pin]
> > > > +    links: ListLinksSelfPtr<EnqueuedJob<dyn JobData>>,
> > > 
> > > Why not a KVec? A queue type can hold a KVec of enqueued jobs, and this can
> > > hold an Arc of the queue type.
> > 
> > My understanding is that KVec is not intended to be the data structure
> > for this?
> > 
> > KVec is basically like a realloc() in C, an array of same sized
> > elements.
> > 
> > The JQ, hypothetically, can hold an infinite amount of members in its
> > waiting_list, only the running_list is limited by the credit count.
> 
> Then I'd pre-allocate or realloc() as needed. You can reuse the empty slots, so
> there won't be a unbounded growth. realloc() also looks fine, because it will
> happen outside of the signaling path.
> 
> My point is that writing your own data structure adds complexity. Your call,
> though.

My impression of the kernel is that it's uncommon to use arrays like
that.

I think your idea is charming because LinkedList is so astoinishingly
complex (from my POV). I'm iterating over DmaFence right now. Once that
is done I want to pick up the List topic and am going to investigate
your proposal.

> 
> > 
> > 
> > > By extension, ensures that the queue does not
> > > die while we have enqueued jobs.
> > 
> > See below.
> > 
> > > 
> > > 
> > > > +    done_fence: ARef<DmaFence<i32>>, // i32 is just dummy data. TODO: allow for replacing with `()`
> > > > +    // The hardware_fence can by definition only be set at an unknown point in
> > > > +    // time.
> > > > +    // TODO: Think about replacing this with a `struct RunningJob` which consumes
> > > > +    // an `EnqueuedJob`.
> > > > +    hardware_fence: Option<ARef<DmaFence<i32>>>, // i32 is dummy data until there's DmaFence
> > > > +                                                 // without data.
> > > > +    nr_of_deps: u32,
> > > > +}
> > > > +
> > > > +impl<T> EnqueuedJob<T> {
> > > > +    fn new(inner: Pin<KBox<Job<T>>>, fctx: &Arc<DmaFenceCtx>) -> Result<ListArc<Self>> {
> > > > +        let pseudo_data: i32 = 42;
> > > > +        let done_fence = fctx.as_arc_borrow().new_fence(pseudo_data)?;
> > > > +
> > > > +        ListArc::pin_init(try_pin_init!(Self {
> > > > +            inner,
> > > > +            links <- ListLinksSelfPtr::new(),
> > > > +            done_fence,
> > > > +            hardware_fence: None,
> > > > +            nr_of_deps: 0, // TODO implement
> > > > +        }), GFP_KERNEL)
> > > > +    }
> > > > +}
> > > > +
> > > > +impl_list_arc_safe! {
> > > > +    impl{T: ?Sized} ListArcSafe<0> for EnqueuedJob<T> { untracked; }
> > > > +}
> > > > +
> > > > +impl_list_item! {
> > > > +    impl ListItem<0> for EnqueuedJob<dyn JobData> { using ListLinksSelfPtr { self.links }; }
> > > > +}
> > > > +
> > > > +// Callback item for the hardware fences to wake / progress the jobqueue.
> > > > +struct HwFenceWaker<T> {
> > > > +    jobq: Arc<Revocable<SpinLock<InnerJobqueue>>>,
> > > 
> > > Instead of a Revocable, why not keep an Arc of InnerJobQueue (which should
> > > perhaps be called JobQueueInner)?
> > > 
> > > This way, the user can have this:
> > > 
> > > struct JobQueue(Arc<JobqueueInner>);
> > > 
> > > When the user drops the JobQueue, it will schedule whatever teardown
> > > operations,
> > > 
> > 
> > What kind of operation would that be? Completing all running_jobs?
> > Completing all waiting_jobs? Completing all running_jobs and canceling
> > all waiting_jobs? etc.
> > 
> 
> The same as the current DRM scheduler, i.e.:
> 
> static void drm_sched_entity_kill_jobs_cb(struct dma_fence *f,
>    struct dma_fence_cb *cb)
> 
> My understanding is that the JobQueue will follow a similar pattern?

We have to be careful with being inspired by drm/sched. We can learn
from it and it contains a few good ideas probably, but there's a reason
why we write something new instead of putting Rust abstractions on top
:)

._.

The fundamental design problem spawning most of the other problems is
that drm/sched has two job lists in two data structures, and these two
have different life times: entities and pending_list (in sched itself).
Add GPU resets and teardown order on top and you've got an explosive
cocktail.

Jobqueue now, as stated above, can contain all job lists in itself,
partly thanks to the 1:1 relationship between hardware rings and
jobqueues.

I'm not even entirely sure, but I think that
drm_sched_entity_kill_jobs_work() exists mostly because of another
drastic design mistake: the existence of the free_job() callback. You
cannot call a driver callback from the driver's execution context (same
thread); mostly because of locking issues I think.

It also signals the s_fence, which we don't have in that form, we just
have the done_fence.


I think this shows well why Jobqueue is such a great simplification
compared to drm/sched. We don't have separate entities with separate
life times; we don't have jobs with hybrid ownership (created by the
driver, cleaned up by drm/sched) which make that entity kill mechanism
necessary. All we need to do when JQ gets killed (drop) is:

   1. Signal all still waiting or running jobs' done_fences with an
      error.
   2. Decouple JQ from the hardware_fences and dependency_fences.
   3. Stop / clean up the JQ's resources (so far only a work_item)

And other, quite complicated cleanup work from drm_sched such as
getting the refcounts right or free_job()ing the jobs is done
automatically for us by Rust.


> 
> > 
> > > 

[…]

> 
> > 
> > 
> > > > +impl Drop for Jobqueue {
> > > > +    fn drop(&mut self) {
> > > > +        // The hardware fences might outlive the jobqueue. So hw_fence callbacks
> > > > +        // could very well still call into job queue code, resulting in
> > > > +        // data UAF or, should the jobqueue code be unloaded, even code UAF.
> > > 
> > > Not if they reference JobQueueInner as I proposed above.
> > > 
> > > > +        //
> > > > +        // Thus, the jobqueue needs to be cleanly decoupled from the hardware
> > > > +        // fences when it drops, in other words, it needs to deregister all its
> > > > +        // hw_fence callbacks.
> > > > +        //
> > > > +        // This, however, could easily deadlock when a hw_fence signals:
> > > > +        //
> > > > +        // Step     |   Jobqueue step               |   hw_fence step
> > > > +        // ------------------------------------------------------------------
> > > > +        // 1        |   JQ starts drop              |   fence signals
> > > > +        // 2        |   JQ lock taken               |   fence lock taken
> > > > +        // 3        |   Tries to take fence lock    |   Tries to take JQ lock
> > > > +        // 4        |   ***DEADLOCK***              |   ***DEADLOCK***
> > > > +        //
> > > > +        // In order to prevent deadlock, we first have to revoke access to the
> > > > +        // JQ so that all fence callbacks can't try to take the lock anymore,
> > > > +        // and then deregister all JQ callbacks.
> > > > +        self.inner.revoke();
> > > > +
> > > > +        /*
> > > > +        let guard = self.inner.lock();
> > > > +        for job in self.inner.waiting_jobs {
> > > > +            job.deregister_dep_fences();
> > > > +        }
> > > > +        for job in self.inner.running_jobs {
> > > > +            job.deregister_hw_fence();
> > > > +        }
> > > > +        */
> > > 
> > > Under my proposal above, you can also wait on dependencies if you want: the
> > > drop() thread will not be blocked.
> > 
> > Maybe (I'd have to look deeper into the idea)
> > 
> > But what for? When someone drops his jobqueue, one would like to think
> > that he doesn't care about all pending jobs anymore anyways. So the
> > main thing you need to guarantee is that userspace gets unblocked by
> > signaling all fences.
> 
> I was basically trying to recreate this:
> 
> static void drm_sched_entity_kill_jobs_cb(struct dma_fence *f,
>    struct dma_fence_cb *cb)
> {
>  struct drm_sched_job *job = container_of(cb, struct drm_sched_job,
>  finish_cb);
>  unsigned long index;
> 
>  dma_fence_put(f);
> 
>  /* Wait for all dependencies to avoid data corruptions */
>  xa_for_each(&job->dependencies, index, f) {

We decouple also from the jobs' dependencies in drop(). As for other
parties which have jobs inside of this JQ as dependencies: they will
get unblocked by the done_fences getting signaled with -ECANCELED.

What we could think about is whether something like

jq.wait_to_become_idle(timeout);

could make sense. In drm/sched, such a proposal was mostly rejected
because all drivers have their own solutions for that currently AFAIK
(Nouveau has a waitqueue, but I think Nouveau actually doesn't care
about what happens to pending userspace jobs when drm_sched_fini() is
reached anyways, the waitqueue exists for page table cleanup work).

Also, Christian pointed out that it could block SIGKILL when the
hardware is really slow. But an interruptible / timeoutable version
could make sense, I think?

It'll depend a bit on when a Rust driver really drop()s its JQ. In
principle, the driver should signal all hw_fences before drop()ing it.


> 
> > 
> > 
> > Note that we had very similar discussions when solving the memory leaks
> > in drm_sched_fini(). The TL;DR of those discussions was:
> > 
> > * Refcounting drm_sched so that it can outlive drm_sched_fini() means
> >   that it will continue calling into the driver with the driver
> >   callbacks -> UAF
> > * Waiting could cause you to block SIGKILL
> > * The sanest way to go was deemed to be to signal everything in the
> >   pending_list synchronously. Once you've done this, you know for sure
> >   that everything is done and clean.
> > 
> > 
> > AFAICS, your proposal might still have the problem of JQ continuously
> > calling into driver code?
> 
> You’re basically calling wait() and signal(), but not run(). On top of
> that, I don’t think the callbacks can actually reach the driver without
> taking an extra refcount on some driver structure (iow:  we should require the
> callbacks to be ’static). So, IIUC, no, this would not call into the
> driver.

Rust's safety guarantees are limited AFAIU because the language doesn't
understand that 'static stuff can be unloaded in the kernel.

So if the driver got unloaded but the JQ lived on, JQ might at least
call an unloaded TEXT segment / unloaded code. Or am I mistaken?


Anyways, I believe that synchronous solutions are always to be
preferred, because they just are simpler, provide a fix synchronization
point and it's easier for programmers to wrap their head around them.

IMO we should only make that async if it's necessary. But currently,
drop() would wait just for an RCU grace period (the revoke()) and then
signal all done_fences. As far as I understand the dma_fence contract,
no one must register anything blocking on dma_fences, so that should be
fine.

But correct me if I'm wrong. dma_fences are very complicated..


Greetings,
P.