drivers/base/base.h | 8 ++ drivers/base/core.c | 210 +++++++++++++++++++++++++++++----- drivers/nvme/host/pci.c | 1 + include/linux/device/driver.h | 2 + kernel/async.c | 42 ++++++- 5 files changed, 236 insertions(+), 27 deletions(-)
This adds the ability for the kernel to shutdown devices asynchronously. Only devices with drivers that enable it are shut down asynchronously. This can dramatically reduce system shutdown/reboot time on systems that have multiple devices that take many seconds to shut down (like certain NVMe drives). On one system tested, the shutdown time went from 11 minutes without this patch to 55 seconds with the patch. Changes from V9: Address resource and timing issues when spawning a unique async thread for every device during shutdown: * Make the asynchronous threads able to shut down multiple devices, instead of spawning a unique thread for every device. * Modify core kernel async code with a custom wake function so it doesn't wake up threads waiting to synchronize every time the cookie changes Changes from V8: Deal with shutdown hangs resulting when a parent/supplier device is later in the devices_kset list than its children/consumers: * Ignore sync_state_only devlinks for shutdown dependencies * Ignore shutdown_after for devices that don't want async shutdown * Add a sanity check to revert to sync shutdown for any device that would otherwise wait for a child/consumer shutdown that hasn't already been scheduled Changes from V7: Do not expose driver async_shutdown_enable in sysfs. Wrapped a long line. Changes from V6: Removed a sysfs attribute that allowed the async device shutdown to be "on" (with driver opt-out), "safe" (driver opt-in), or "off"... what was previously "safe" is now the only behavior, so drivers now only need to have the option to enable or disable async shutdown. Changes from V5: Separated into multiple patches to make review easier. Reworked some code to make it more readable Made devices wait for consumers to shut down, not just children (suggested by David Jeffery) Changes from V4: Change code to use cookies for synchronization rather than async domains Allow async shutdown to be disabled via sysfs, and allow driver opt-in or opt-out of async shutdown (when not disabled), with ability to control driver opt-in/opt-out via sysfs Changes from V3: Bug fix (used "parent" not "dev->parent" in device_shutdown) Changes from V2: Removed recursive functions to schedule children to be shutdown before parents, since existing device_shutdown loop will already do this Changes from V1: Rewritten using kernel async code (suggested by Lukas Wunner) David Jeffery (1): kernel/async: streamline cookie synchronization Stuart Hayes (4): driver core: don't always lock parent in shutdown driver core: separate function to shutdown one device driver core: shut down devices asynchronously nvme-pci: Make driver prefer asynchronous shutdown drivers/base/base.h | 8 ++ drivers/base/core.c | 210 +++++++++++++++++++++++++++++----- drivers/nvme/host/pci.c | 1 + include/linux/device/driver.h | 2 + kernel/async.c | 42 ++++++- 5 files changed, 236 insertions(+), 27 deletions(-) -- 2.39.3
On Wed, Jun 25, 2025 at 03:18:48PM -0500, Stuart Hayes wrote: > Address resource and timing issues when spawning a unique async thread > for every device during shutdown: > * Make the asynchronous threads able to shut down multiple devices, > instead of spawning a unique thread for every device. > * Modify core kernel async code with a custom wake function so it > doesn't wake up threads waiting to synchronize every time the cookie > changes Given all these thread spawning issues, why can't we just go back to the approach that kicks off shutdown asynchronously and then waits for it without spawning all these threads?
On Thu, Jul 3, 2025 at 7:47 AM Christoph Hellwig <hch@lst.de> wrote: > > On Wed, Jun 25, 2025 at 03:18:48PM -0500, Stuart Hayes wrote: > > Address resource and timing issues when spawning a unique async thread > > for every device during shutdown: > > * Make the asynchronous threads able to shut down multiple devices, > > instead of spawning a unique thread for every device. > > * Modify core kernel async code with a custom wake function so it > > doesn't wake up threads waiting to synchronize every time the cookie > > changes > > Given all these thread spawning issues, why can't we just go back > to the approach that kicks off shutdown asynchronously and then waits > for it without spawning all these threads? > The async subsystem fix is something that should be fixed regardless of async shutdown. Async shutdown's use just exposed its thundering herd behavior which is easily fixed. Reducing the threads is just good optimization. Doing every callback in its own thread adds extra overhead which isn't needed to maintain ordering and async shutdown gains, and combining sync devices into a thread where reasonable didn't add that much complexity to the code. The older non-thread approaches were unpopular with how they still added plenty of complexity to the shutdown logic while pushing either ugly splits or their own thread creation down into the individual shutdown routines of drivers. David Jeffery
On Fri, Jul 04, 2025 at 09:38:15AM -0400, David Jeffery wrote: > On Thu, Jul 3, 2025 at 7:47 AM Christoph Hellwig <hch@lst.de> wrote: > > > > On Wed, Jun 25, 2025 at 03:18:48PM -0500, Stuart Hayes wrote: > > > Address resource and timing issues when spawning a unique async thread > > > for every device during shutdown: > > > * Make the asynchronous threads able to shut down multiple devices, > > > instead of spawning a unique thread for every device. > > > * Modify core kernel async code with a custom wake function so it > > > doesn't wake up threads waiting to synchronize every time the cookie > > > changes > > > > Given all these thread spawning issues, why can't we just go back > > to the approach that kicks off shutdown asynchronously and then waits > > for it without spawning all these threads? > > > > The async subsystem fix is something that should be fixed regardless > of async shutdown. Async shutdown's use just exposed its thundering > herd behavior which is easily fixed. Great, then please submit that on its own and get the maintainer of that subsystem to agree and accept it as I have no way to judge that code at all. thanks, greg k-h
On Fri, Jul 4, 2025 at 9:45 AM Greg Kroah-Hartman <gregkh@linuxfoundation.org> wrote: > > On Fri, Jul 04, 2025 at 09:38:15AM -0400, David Jeffery wrote: > > On Thu, Jul 3, 2025 at 7:47 AM Christoph Hellwig <hch@lst.de> wrote: > > > > > > On Wed, Jun 25, 2025 at 03:18:48PM -0500, Stuart Hayes wrote: > > > > Address resource and timing issues when spawning a unique async thread > > > > for every device during shutdown: > > > > * Make the asynchronous threads able to shut down multiple devices, > > > > instead of spawning a unique thread for every device. > > > > * Modify core kernel async code with a custom wake function so it > > > > doesn't wake up threads waiting to synchronize every time the cookie > > > > changes > > > > > > Given all these thread spawning issues, why can't we just go back > > > to the approach that kicks off shutdown asynchronously and then waits > > > for it without spawning all these threads? > > > > > > > The async subsystem fix is something that should be fixed regardless > > of async shutdown. Async shutdown's use just exposed its thundering > > herd behavior which is easily fixed. > > Great, then please submit that on its own and get the maintainer of that > subsystem to agree and accept it as I have no way to judge that code at > all. Unfortunately, it does not have a maintainer listed and sees limited activity. Would CC-ing some of the recent developers of kernel/async.c to ask them to review be recommended in this situation? David Jeffery
On Fri, Jul 04, 2025 at 10:09:01AM -0400, David Jeffery wrote: > On Fri, Jul 4, 2025 at 9:45 AM Greg Kroah-Hartman > <gregkh@linuxfoundation.org> wrote: > > > > On Fri, Jul 04, 2025 at 09:38:15AM -0400, David Jeffery wrote: > > > On Thu, Jul 3, 2025 at 7:47 AM Christoph Hellwig <hch@lst.de> wrote: > > > > > > > > On Wed, Jun 25, 2025 at 03:18:48PM -0500, Stuart Hayes wrote: > > > > > Address resource and timing issues when spawning a unique async thread > > > > > for every device during shutdown: > > > > > * Make the asynchronous threads able to shut down multiple devices, > > > > > instead of spawning a unique thread for every device. > > > > > * Modify core kernel async code with a custom wake function so it > > > > > doesn't wake up threads waiting to synchronize every time the cookie > > > > > changes > > > > > > > > Given all these thread spawning issues, why can't we just go back > > > > to the approach that kicks off shutdown asynchronously and then waits > > > > for it without spawning all these threads? > > > > > > > > > > The async subsystem fix is something that should be fixed regardless > > > of async shutdown. Async shutdown's use just exposed its thundering > > > herd behavior which is easily fixed. > > > > Great, then please submit that on its own and get the maintainer of that > > subsystem to agree and accept it as I have no way to judge that code at > > all. > > Unfortunately, it does not have a maintainer listed and sees limited > activity. Would CC-ing some of the recent developers of kernel/async.c > to ask them to review be recommended in this situation? Like any other piece of the kernel, yes :)
On 7/3/2025 6:46 AM, Christoph Hellwig wrote: > On Wed, Jun 25, 2025 at 03:18:48PM -0500, Stuart Hayes wrote: >> Address resource and timing issues when spawning a unique async thread >> for every device during shutdown: >> * Make the asynchronous threads able to shut down multiple devices, >> instead of spawning a unique thread for every device. >> * Modify core kernel async code with a custom wake function so it >> doesn't wake up threads waiting to synchronize every time the cookie >> changes > > Given all these thread spawning issues, why can't we just go back > to the approach that kicks off shutdown asynchronously and then waits > for it without spawning all these threads? > If you mean for drivers to have an extra call like shutdown_start() that's called for all devices before waiting for any of them to finish, it seems like that would just push the work of spawning a shutdown thread onto the individual drivers (unless they just had a single command they issue to the device that takes all the time), and synchronization might be more of an issue since shutdown_start() could be called on a device before that device's children or dependents have finished shutting down. With the async shutdown code from this patch set in place, any driver can be fully enabled to shut down devices asynchronously with no work other than setting a flag to allow it, and no device will start shutting down until its children and dependents have finished. I think the main issue in the previous version of this patch was just that it was taking too long on smaller systems to spawn a thread for every single device shutdown, when most or all of those devices were shutting down synchronously anyway. This version of the patch solves that issue... it should only spawn on the order of roughly 2N threads on systems with N async shutdown devices, and it improves the kernel async code to make synchronizing to a cookie faster, too.
On Thu, Jul 03, 2025 at 01:46:56PM +0200, Christoph Hellwig wrote: >On Wed, Jun 25, 2025 at 03:18:48PM -0500, Stuart Hayes wrote: >> Address resource and timing issues when spawning a unique async thread >> for every device during shutdown: >> * Make the asynchronous threads able to shut down multiple devices, >> instead of spawning a unique thread for every device. >> * Modify core kernel async code with a custom wake function so it >> doesn't wake up threads waiting to synchronize every time the cookie >> changes > >Given all these thread spawning issues, why can't we just go back >to the approach that kicks off shutdown asynchronously and then waits >for it without spawning all these threads? It isn't just an nvme issue. Red Hat found the same issue with SCSI devices. My colleague Sultan Alsawaf posted a simpler fix for the earlier patch here: https://lists.infradead.org/pipermail/linux-nvme/2025-January/053666.html Maybe this could be explored.
On Thu, Jul 3, 2025 at 12:13 PM Jeremy Allison <jra@samba.org> wrote: > > On Thu, Jul 03, 2025 at 01:46:56PM +0200, Christoph Hellwig wrote: > >On Wed, Jun 25, 2025 at 03:18:48PM -0500, Stuart Hayes wrote: > >> Address resource and timing issues when spawning a unique async thread > >> for every device during shutdown: > >> * Make the asynchronous threads able to shut down multiple devices, > >> instead of spawning a unique thread for every device. > >> * Modify core kernel async code with a custom wake function so it > >> doesn't wake up threads waiting to synchronize every time the cookie > >> changes > > > >Given all these thread spawning issues, why can't we just go back > >to the approach that kicks off shutdown asynchronously and then waits > >for it without spawning all these threads? > > It isn't just an nvme issue. Red Hat found the same issue > with SCSI devices. > > My colleague Sultan Alsawaf posted a simpler fix for the > earlier patch here: > > https://lists.infradead.org/pipermail/linux-nvme/2025-January/053666.html > > Maybe this could be explored. > Unfortunately, this approach looks flawed. If I am reading it right, it assumes async shutdown devices do not have dependencies on sync shutdown devices. This is not a valid assumption and so violates dependency ordering. Maintaining all the dependencies is the core problem and source of the complexity of the async shutdown patches. David Jeffery
On Fri, Jul 04, 2025 at 09:45:44AM -0400, David Jeffery wrote: > On Thu, Jul 3, 2025 at 12:13 PM Jeremy Allison <jra@samba.org> wrote: > > > > On Thu, Jul 03, 2025 at 01:46:56PM +0200, Christoph Hellwig wrote: > > >On Wed, Jun 25, 2025 at 03:18:48PM -0500, Stuart Hayes wrote: > > >> Address resource and timing issues when spawning a unique async thread > > >> for every device during shutdown: > > >> * Make the asynchronous threads able to shut down multiple devices, > > >> instead of spawning a unique thread for every device. > > >> * Modify core kernel async code with a custom wake function so it > > >> doesn't wake up threads waiting to synchronize every time the cookie > > >> changes > > > > > >Given all these thread spawning issues, why can't we just go back > > >to the approach that kicks off shutdown asynchronously and then waits > > >for it without spawning all these threads? > > > > It isn't just an nvme issue. Red Hat found the same issue > > with SCSI devices. > > > > My colleague Sultan Alsawaf posted a simpler fix for the > > earlier patch here: > > > > https://lists.infradead.org/pipermail/linux-nvme/2025-January/053666.html > > > > Maybe this could be explored. > > > > Unfortunately, this approach looks flawed. If I am reading it right, > it assumes async shutdown devices do not have dependencies on sync > shutdown devices. It does not make any such assumption. Dependency on a sync device is handled through a combination of queue_device_async_shutdown() setting an async device's shutdown_after and the synchronous shutdown loop dispatching an "async" shutdown for a sync device when it encounters a sync device that has a downstream async dependency. > Maintaining all the dependencies is the core problem and source of the > complexity of the async shutdown patches. I am acutely aware. Please take a closer look at my patch. Sultan
On Fri, Jul 4, 2025 at 12:26 PM Sultan Alsawaf <sultan@kerneltoast.com> wrote: > > On Fri, Jul 04, 2025 at 09:45:44AM -0400, David Jeffery wrote: > > On Thu, Jul 3, 2025 at 12:13 PM Jeremy Allison <jra@samba.org> wrote: > > > > > > On Thu, Jul 03, 2025 at 01:46:56PM +0200, Christoph Hellwig wrote: > > > >On Wed, Jun 25, 2025 at 03:18:48PM -0500, Stuart Hayes wrote: > > > >> Address resource and timing issues when spawning a unique async thread > > > >> for every device during shutdown: > > > >> * Make the asynchronous threads able to shut down multiple devices, > > > >> instead of spawning a unique thread for every device. > > > >> * Modify core kernel async code with a custom wake function so it > > > >> doesn't wake up threads waiting to synchronize every time the cookie > > > >> changes > > > > > > > >Given all these thread spawning issues, why can't we just go back > > > >to the approach that kicks off shutdown asynchronously and then waits > > > >for it without spawning all these threads? > > > > > > It isn't just an nvme issue. Red Hat found the same issue > > > with SCSI devices. > > > > > > My colleague Sultan Alsawaf posted a simpler fix for the > > > earlier patch here: > > > > > > https://lists.infradead.org/pipermail/linux-nvme/2025-January/053666.html > > > > > > Maybe this could be explored. > > > > > > > Unfortunately, this approach looks flawed. If I am reading it right, > > it assumes async shutdown devices do not have dependencies on sync > > shutdown devices. > > It does not make any such assumption. Dependency on a sync device is handled > through a combination of queue_device_async_shutdown() setting an async device's > shutdown_after and the synchronous shutdown loop dispatching an "async" shutdown > for a sync device when it encounters a sync device that has a downstream async > dependency. > Yes, but not what I think fails. This handles a sync parent having an async child. It does not handle the reverse, a sync child having an async parent. For example, take a system with 1 pci nvme device. The nvme device which is flagged for async shutdown can have sync shutdown children as well as a sync shutdown parent. The patch linked pulls the async device off the shutdown list into a separate async list, then starts this lone async device with queue_device_async_shutdown from being on the async list. The device then is passed to the async subsystem running shutdown_one_device_async where it will immediately do shutdown due to a zero value shutdown_after. The patch sets shutdown_after for its parent, but there is nothing connecting and ordering the async device to its sync children which will be shutdown later from the original device_shutdown task. > > Maintaining all the dependencies is the core problem and source of the > > complexity of the async shutdown patches. > > I am acutely aware. Please take a closer look at my patch. > I have, and it still looks incomplete to me. David Jeffery
On 7/7/2025 10:34 AM, David Jeffery wrote: > On Fri, Jul 4, 2025 at 12:26 PM Sultan Alsawaf <sultan@kerneltoast.com> wrote: >> >> On Fri, Jul 04, 2025 at 09:45:44AM -0400, David Jeffery wrote: >>> On Thu, Jul 3, 2025 at 12:13 PM Jeremy Allison <jra@samba.org> wrote: >>>> >>>> On Thu, Jul 03, 2025 at 01:46:56PM +0200, Christoph Hellwig wrote: >>>>> On Wed, Jun 25, 2025 at 03:18:48PM -0500, Stuart Hayes wrote: >>>>>> Address resource and timing issues when spawning a unique async thread >>>>>> for every device during shutdown: >>>>>> * Make the asynchronous threads able to shut down multiple devices, >>>>>> instead of spawning a unique thread for every device. >>>>>> * Modify core kernel async code with a custom wake function so it >>>>>> doesn't wake up threads waiting to synchronize every time the cookie >>>>>> changes >>>>> >>>>> Given all these thread spawning issues, why can't we just go back >>>>> to the approach that kicks off shutdown asynchronously and then waits >>>>> for it without spawning all these threads? >>>> >>>> It isn't just an nvme issue. Red Hat found the same issue >>>> with SCSI devices. >>>> >>>> My colleague Sultan Alsawaf posted a simpler fix for the >>>> earlier patch here: >>>> >>>> https://lists.infradead.org/pipermail/linux-nvme/2025-January/053666.html >>>> >>>> Maybe this could be explored. >>>> >>> >>> Unfortunately, this approach looks flawed. If I am reading it right, >>> it assumes async shutdown devices do not have dependencies on sync >>> shutdown devices. >> >> It does not make any such assumption. Dependency on a sync device is handled >> through a combination of queue_device_async_shutdown() setting an async device's >> shutdown_after and the synchronous shutdown loop dispatching an "async" shutdown >> for a sync device when it encounters a sync device that has a downstream async >> dependency. >> > > Yes, but not what I think fails. This handles a sync parent having an > async child. It does not handle the reverse, a sync child having an > async parent. > > For example, take a system with 1 pci nvme device. The nvme device > which is flagged for async shutdown can have sync shutdown children as > well as a sync shutdown parent. The patch linked pulls the async > device off the shutdown list into a separate async list, then starts > this lone async device with queue_device_async_shutdown from being on > the async list. The device then is passed to the async subsystem > running shutdown_one_device_async where it will immediately do > shutdown due to a zero value shutdown_after. The patch sets > shutdown_after for its parent, but there is nothing connecting and > ordering the async device to its sync children which will be shutdown > later from the original device_shutdown task. > >>> Maintaining all the dependencies is the core problem and source of the >>> complexity of the async shutdown patches. >> >> I am acutely aware. Please take a closer look at my patch. >> > > I have, and it still looks incomplete to me. > > David Jeffery > Also, the way it is implemented, it could hang if there isn't enough memory to queue up all of the async device shutdowns before starting the synchronous shutdowns. When you call async_schedule_domain() and there's not enough memory to allocate an async_entry, the async function will be run immediately. If that happens when queuing up the async shutdowns, it could easily hang if there if there are any dependencies requiring an async device shutdown to have to wait for a synchronous device to shutdown, since none of the synchronous shutdown devices have been scheduled yet. Before your patch, all device shutdowns are scheduled in order such that if the call to async_schedule_domain() runs the function immediately, it will still be able to complete, it just won't get the benefit of multiple threads shutting down devices concurrently. The V10 patch maintains that--it just lets one thread shut down multiple synchronous devices instead of creating one thread for each synchronous device shutdown.
On Mon, Jul 07, 2025 at 03:49:44PM -0500, stuart hayes wrote: > On 7/7/2025 10:34 AM, David Jeffery wrote: > > On Fri, Jul 4, 2025 at 12:26 PM Sultan Alsawaf <sultan@kerneltoast.com> wrote: > > > > > > On Fri, Jul 04, 2025 at 09:45:44AM -0400, David Jeffery wrote: > > > > On Thu, Jul 3, 2025 at 12:13 PM Jeremy Allison <jra@samba.org> wrote: > > > > > > > > > > On Thu, Jul 03, 2025 at 01:46:56PM +0200, Christoph Hellwig wrote: > > > > > > On Wed, Jun 25, 2025 at 03:18:48PM -0500, Stuart Hayes wrote: > > > > > > > Address resource and timing issues when spawning a unique async thread > > > > > > > for every device during shutdown: > > > > > > > * Make the asynchronous threads able to shut down multiple devices, > > > > > > > instead of spawning a unique thread for every device. > > > > > > > * Modify core kernel async code with a custom wake function so it > > > > > > > doesn't wake up threads waiting to synchronize every time the cookie > > > > > > > changes > > > > > > > > > > > > Given all these thread spawning issues, why can't we just go back > > > > > > to the approach that kicks off shutdown asynchronously and then waits > > > > > > for it without spawning all these threads? > > > > > > > > > > It isn't just an nvme issue. Red Hat found the same issue > > > > > with SCSI devices. > > > > > > > > > > My colleague Sultan Alsawaf posted a simpler fix for the > > > > > earlier patch here: > > > > > > > > > > https://lists.infradead.org/pipermail/linux-nvme/2025-January/053666.html > > > > > > > > > > Maybe this could be explored. > > > > > > > > > > > > > Unfortunately, this approach looks flawed. If I am reading it right, > > > > it assumes async shutdown devices do not have dependencies on sync > > > > shutdown devices. > > > > > > It does not make any such assumption. Dependency on a sync device is handled > > > through a combination of queue_device_async_shutdown() setting an async device's > > > shutdown_after and the synchronous shutdown loop dispatching an "async" shutdown > > > for a sync device when it encounters a sync device that has a downstream async > > > dependency. > > > > > > > Yes, but not what I think fails. This handles a sync parent having an > > async child. It does not handle the reverse, a sync child having an > > async parent. > > > > For example, take a system with 1 pci nvme device. The nvme device > > which is flagged for async shutdown can have sync shutdown children as > > well as a sync shutdown parent. The patch linked pulls the async > > device off the shutdown list into a separate async list, then starts > > this lone async device with queue_device_async_shutdown from being on > > the async list. The device then is passed to the async subsystem > > running shutdown_one_device_async where it will immediately do > > shutdown due to a zero value shutdown_after. The patch sets > > shutdown_after for its parent, but there is nothing connecting and > > ordering the async device to its sync children which will be shutdown > > later from the original device_shutdown task. > > > > > > Maintaining all the dependencies is the core problem and source of the > > > > complexity of the async shutdown patches. > > > > > > I am acutely aware. Please take a closer look at my patch. > > > > > > > I have, and it still looks incomplete to me. > > > > David Jeffery > > > > Also, the way it is implemented, it could hang if there isn't enough memory > to queue up all of the async device shutdowns before starting the > synchronous shutdowns. > > When you call async_schedule_domain() and there's not enough memory to > allocate an async_entry, the async function will be run immediately. If that > happens when queuing up the async shutdowns, it could easily hang if there > if there are any dependencies requiring an async device shutdown to have to > wait for a synchronous device to shutdown, since none of the synchronous > shutdown devices have been scheduled yet. FWIW, I did consider this scenario when I wrote my patch and this hang cannot occur with the way I implemented async shutdown. The reason is because my patch assumes that async devices never need to wait on synchronous devices to shut down, as David pointed out. My patch only assumes that synchronous devices need to wait for async devices, so any allocation failures in async_schedule_domain() won't cause the hang you described. Sultan
On Mon, Jul 07, 2025 at 03:49:44PM -0500, stuart hayes wrote: > On 7/7/2025 10:34 AM, David Jeffery wrote: > > On Fri, Jul 4, 2025 at 12:26 PM Sultan Alsawaf <sultan@kerneltoast.com> wrote: > > > > > > On Fri, Jul 04, 2025 at 09:45:44AM -0400, David Jeffery wrote: > > > > On Thu, Jul 3, 2025 at 12:13 PM Jeremy Allison <jra@samba.org> wrote: > > > > > > > > > > On Thu, Jul 03, 2025 at 01:46:56PM +0200, Christoph Hellwig wrote: > > > > > > On Wed, Jun 25, 2025 at 03:18:48PM -0500, Stuart Hayes wrote: > > > > > > > Address resource and timing issues when spawning a unique async thread > > > > > > > for every device during shutdown: > > > > > > > * Make the asynchronous threads able to shut down multiple devices, > > > > > > > instead of spawning a unique thread for every device. > > > > > > > * Modify core kernel async code with a custom wake function so it > > > > > > > doesn't wake up threads waiting to synchronize every time the cookie > > > > > > > changes > > > > > > > > > > > > Given all these thread spawning issues, why can't we just go back > > > > > > to the approach that kicks off shutdown asynchronously and then waits > > > > > > for it without spawning all these threads? > > > > > > > > > > It isn't just an nvme issue. Red Hat found the same issue > > > > > with SCSI devices. > > > > > > > > > > My colleague Sultan Alsawaf posted a simpler fix for the > > > > > earlier patch here: > > > > > > > > > > https://lists.infradead.org/pipermail/linux-nvme/2025-January/053666.html > > > > > > > > > > Maybe this could be explored. > > > > > > > > > > > > > Unfortunately, this approach looks flawed. If I am reading it right, > > > > it assumes async shutdown devices do not have dependencies on sync > > > > shutdown devices. > > > > > > It does not make any such assumption. Dependency on a sync device is handled > > > through a combination of queue_device_async_shutdown() setting an async device's > > > shutdown_after and the synchronous shutdown loop dispatching an "async" shutdown > > > for a sync device when it encounters a sync device that has a downstream async > > > dependency. > > > > > > > Yes, but not what I think fails. This handles a sync parent having an > > async child. It does not handle the reverse, a sync child having an > > async parent. > > > > For example, take a system with 1 pci nvme device. The nvme device > > which is flagged for async shutdown can have sync shutdown children as > > well as a sync shutdown parent. The patch linked pulls the async > > device off the shutdown list into a separate async list, then starts > > this lone async device with queue_device_async_shutdown from being on > > the async list. The device then is passed to the async subsystem > > running shutdown_one_device_async where it will immediately do > > shutdown due to a zero value shutdown_after. The patch sets > > shutdown_after for its parent, but there is nothing connecting and > > ordering the async device to its sync children which will be shutdown > > later from the original device_shutdown task. > > > > > > Maintaining all the dependencies is the core problem and source of the > > > > complexity of the async shutdown patches. > > > > > > I am acutely aware. Please take a closer look at my patch. > > > > > > > I have, and it still looks incomplete to me. > > > > David Jeffery > > > > Also, the way it is implemented, it could hang if there isn't enough memory > to queue up all of the async device shutdowns before starting the > synchronous shutdowns. > > When you call async_schedule_domain() and there's not enough memory to > allocate an async_entry, the async function will be run immediately. If that > happens when queuing up the async shutdowns, it could easily hang if there > if there are any dependencies requiring an async device shutdown to have to > wait for a synchronous device to shutdown, since none of the synchronous > shutdown devices have been scheduled yet. Understood. Thank you both for the clarifications. Regarding an async device with a sync child: this case strikes me as odd. What exactly makes the child device require "synchronous" shutdown? Synchronous relative to what, specifically? This also makes me question what, exactly, the criteria are for determining that a device is safe to shut down "async". I think that all children of an async shutdown device should be enrolled into async shutdown to isolate the entire async dependency chain. That way, the async shutdown devices don't need to wait for synchronous shutdown of unrelated devices. I'm happy to do this in a v3. Regarding async_schedule_domain() running synchronously when async_entry allocation fails: this is a bothersome constraint of the async helpers. Although there is a lot of overlap between the async helpers and the requirements for async device shutdown, there are other annoying constraints that make the async helpers unfit as-is for async device shutdown (like the arbitrary MAX_WORK limit). The async helpers also don't do exclusive wakes, which leads to the behavior you observed in "[PATCH v10 1/5] kernel/async: streamline cookie synchronization". We could use exclusive wakes by isolating async shutdown device dependency chains and creating a different async_domain for each chain, which is faster than calling TTWU on all async waiters and filtering out the wakeups ad hoc. These points make me think that either the async helpers should be made far more generic or the driver core code should have its own async infrastructure. > Before your patch, all device shutdowns are scheduled in order such that if > the call to async_schedule_domain() runs the function immediately, it will > still be able to complete, it just won't get the benefit of multiple threads > shutting down devices concurrently. The V10 patch maintains that--it just > lets one thread shut down multiple synchronous devices instead of creating > one thread for each synchronous device shutdown. This still dedicates threads to shutting down synchronous devices that don't share a dependency chain with an async shutdown device. This shouldn't be necessary. (PS: fixed CC of Jan Kiszka <jan.kiszka@siemens.com>, who appears to have been CC'd with a typo'd email address for at least the entire v9 patchset and this v10 patchset. The CC list had siemens.com misspelled as seimens.com) Sultan
On Mon, Jul 07, 2025 at 05:00:34PM -0700, Sultan Alsawaf wrote: > On Mon, Jul 07, 2025 at 03:49:44PM -0500, stuart hayes wrote: > > On 7/7/2025 10:34 AM, David Jeffery wrote: > > > On Fri, Jul 4, 2025 at 12:26 PM Sultan Alsawaf <sultan@kerneltoast.com> wrote: > > > > > > > > On Fri, Jul 04, 2025 at 09:45:44AM -0400, David Jeffery wrote: > > > > > On Thu, Jul 3, 2025 at 12:13 PM Jeremy Allison <jra@samba.org> wrote: > > > > > > > > > > > > On Thu, Jul 03, 2025 at 01:46:56PM +0200, Christoph Hellwig wrote: > > > > > > > On Wed, Jun 25, 2025 at 03:18:48PM -0500, Stuart Hayes wrote: > > > > > > > > Address resource and timing issues when spawning a unique async thread > > > > > > > > for every device during shutdown: > > > > > > > > * Make the asynchronous threads able to shut down multiple devices, > > > > > > > > instead of spawning a unique thread for every device. > > > > > > > > * Modify core kernel async code with a custom wake function so it > > > > > > > > doesn't wake up threads waiting to synchronize every time the cookie > > > > > > > > changes > > > > > > > > > > > > > > Given all these thread spawning issues, why can't we just go back > > > > > > > to the approach that kicks off shutdown asynchronously and then waits > > > > > > > for it without spawning all these threads? > > > > > > > > > > > > It isn't just an nvme issue. Red Hat found the same issue > > > > > > with SCSI devices. > > > > > > > > > > > > My colleague Sultan Alsawaf posted a simpler fix for the > > > > > > earlier patch here: > > > > > > > > > > > > https://lists.infradead.org/pipermail/linux-nvme/2025-January/053666.html > > > > > > > > > > > > Maybe this could be explored. > > > > > > > > > > > > > > > > Unfortunately, this approach looks flawed. If I am reading it right, > > > > > it assumes async shutdown devices do not have dependencies on sync > > > > > shutdown devices. > > > > > > > > It does not make any such assumption. Dependency on a sync device is handled > > > > through a combination of queue_device_async_shutdown() setting an async device's > > > > shutdown_after and the synchronous shutdown loop dispatching an "async" shutdown > > > > for a sync device when it encounters a sync device that has a downstream async > > > > dependency. > > > > > > > > > > Yes, but not what I think fails. This handles a sync parent having an > > > async child. It does not handle the reverse, a sync child having an > > > async parent. > > > > > > For example, take a system with 1 pci nvme device. The nvme device > > > which is flagged for async shutdown can have sync shutdown children as > > > well as a sync shutdown parent. The patch linked pulls the async > > > device off the shutdown list into a separate async list, then starts > > > this lone async device with queue_device_async_shutdown from being on > > > the async list. The device then is passed to the async subsystem > > > running shutdown_one_device_async where it will immediately do > > > shutdown due to a zero value shutdown_after. The patch sets > > > shutdown_after for its parent, but there is nothing connecting and > > > ordering the async device to its sync children which will be shutdown > > > later from the original device_shutdown task. > > > > > > > > Maintaining all the dependencies is the core problem and source of the > > > > > complexity of the async shutdown patches. > > > > > > > > I am acutely aware. Please take a closer look at my patch. > > > > > > > > > > I have, and it still looks incomplete to me. > > > > > > David Jeffery > > > > > > > Also, the way it is implemented, it could hang if there isn't enough memory > > to queue up all of the async device shutdowns before starting the > > synchronous shutdowns. > > > > When you call async_schedule_domain() and there's not enough memory to > > allocate an async_entry, the async function will be run immediately. If that > > happens when queuing up the async shutdowns, it could easily hang if there > > if there are any dependencies requiring an async device shutdown to have to > > wait for a synchronous device to shutdown, since none of the synchronous > > shutdown devices have been scheduled yet. > > Understood. Thank you both for the clarifications. > > Regarding an async device with a sync child: this case strikes me as odd. What > exactly makes the child device require "synchronous" shutdown? Synchronous > relative to what, specifically? > > This also makes me question what, exactly, the criteria are for determining that > a device is safe to shut down "async". I think that all children of an async > shutdown device should be enrolled into async shutdown to isolate the entire > async dependency chain. That way, the async shutdown devices don't need to wait > for synchronous shutdown of unrelated devices. I'm happy to do this in a v3. > > Regarding async_schedule_domain() running synchronously when async_entry > allocation fails: this is a bothersome constraint of the async helpers. Although > there is a lot of overlap between the async helpers and the requirements for > async device shutdown, there are other annoying constraints that make the async > helpers unfit as-is for async device shutdown (like the arbitrary MAX_WORK > limit). > > The async helpers also don't do exclusive wakes, which leads to the behavior you > observed in "[PATCH v10 1/5] kernel/async: streamline cookie synchronization". > We could use exclusive wakes by isolating async shutdown device dependency > chains and creating a different async_domain for each chain, which is faster > than calling TTWU on all async waiters and filtering out the wakeups ad hoc. Correcting myself here: the custom wake function filters out wakeups _before_ TTWU, so what I said in this paragraph is incorrect and therefore exclusive wakes aren't necessary or helpful. My apologies. Sultan
© 2016 - 2025 Red Hat, Inc.