drivers/gpu/drm/drm_gpuvm.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-)
Currently GPUVM relies on the owner implicitly holding a refcount to the
drm device, and it does not implicitly take a refcount on the drm
device. This design is error-prone, so take a refcount on the device.
Suggested-by: Danilo Krummrich <dakr@kernel.org>
Signed-off-by: Alice Ryhl <aliceryhl@google.com>
---
drivers/gpu/drm/drm_gpuvm.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/drm_gpuvm.c b/drivers/gpu/drm/drm_gpuvm.c
index 44acfe4120d2..000e7910a899 100644
--- a/drivers/gpu/drm/drm_gpuvm.c
+++ b/drivers/gpu/drm/drm_gpuvm.c
@@ -25,6 +25,7 @@
*
*/
+#include <drm/drm_drv.h>
#include <drm/drm_gpuvm.h>
#include <drm/drm_print.h>
@@ -1117,6 +1118,7 @@ drm_gpuvm_init(struct drm_gpuvm *gpuvm, const char *name,
gpuvm->drm = drm;
gpuvm->r_obj = r_obj;
+ drm_dev_get(drm);
drm_gem_object_get(r_obj);
drm_gpuvm_warn_check_overflow(gpuvm, start_offset, range);
@@ -1160,13 +1162,15 @@ static void
drm_gpuvm_free(struct kref *kref)
{
struct drm_gpuvm *gpuvm = container_of(kref, struct drm_gpuvm, kref);
+ struct drm_device *drm = gpuvm->drm;
drm_gpuvm_fini(gpuvm);
- if (drm_WARN_ON(gpuvm->drm, !gpuvm->ops->vm_free))
+ if (drm_WARN_ON(drm, !gpuvm->ops->vm_free))
return;
gpuvm->ops->vm_free(gpuvm);
+ drm_dev_put(drm);
}
/**
---
base-commit: 126c50bc2fb6ddfe5b7718de67bbd7592a1062bb
change-id: 20260416-gpuvm-drm-dev-get-5ded89c39bb3
Best regards,
--
Alice Ryhl <aliceryhl@google.com>
On Thu, 16 Apr 2026 13:10:54 +0000, Alice Ryhl wrote:
> [PATCH] drm/gpuvm: take refcount on DRM device
Applied, thanks!
Branch: drm-rust-next
Tree: https://gitlab.freedesktop.org/drm/rust/kernel.git
[1/1] drm/gpuvm: take refcount on DRM device
commit: c2d72717e0a9
The patch will appear in the next linux-next integration (typically within 24
hours on weekdays).
The patch is queued up for the upcoming merge window for the next major kernel
release.
Hi,
On Thu, 2026-04-16 at 13:10 +0000, Alice Ryhl wrote:
> Currently GPUVM relies on the owner implicitly holding a refcount to
> the
> drm device, and it does not implicitly take a refcount on the drm
> device. This design is error-prone, so take a refcount on the device.
>
> Suggested-by: Danilo Krummrich <dakr@kernel.org>
> Signed-off-by: Alice Ryhl <aliceryhl@google.com>
This is problematic since typically you also need a module reference
when taking a drm device reference.
The reason for this is that the devres reference on the drm device
expects to be the last one, since it might be called from the module
exit function of the driver. Now if there is an additional reference
held at that point the driver module can be unloaded with a dangling
reference to the drm device.
On the other hand, if you in addition take a module reference then that
blocks the driver module from being unloaded while held, just like a
drm file reference. This leads to complicated module release schemes
like the one in drm_pagemap where the module refcount is released from
a work item that is waited on in the drm_pagemap exit function.
I'm working to lift the module refcount requirement, but meanwhile I'd
recommend that in the file close callback, we'd make sure all
drm_gpuvms have called their drm_gpuvm_free() function, because then we
are sure that the drm_device is still alive and the module still
pinned.
Thanks,
Thomas
> ---
> drivers/gpu/drm/drm_gpuvm.c | 6 +++++-
> 1 file changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/drm_gpuvm.c
> b/drivers/gpu/drm/drm_gpuvm.c
> index 44acfe4120d2..000e7910a899 100644
> --- a/drivers/gpu/drm/drm_gpuvm.c
> +++ b/drivers/gpu/drm/drm_gpuvm.c
> @@ -25,6 +25,7 @@
> *
> */
>
> +#include <drm/drm_drv.h>
> #include <drm/drm_gpuvm.h>
> #include <drm/drm_print.h>
>
> @@ -1117,6 +1118,7 @@ drm_gpuvm_init(struct drm_gpuvm *gpuvm, const
> char *name,
> gpuvm->drm = drm;
> gpuvm->r_obj = r_obj;
>
> + drm_dev_get(drm);
> drm_gem_object_get(r_obj);
>
> drm_gpuvm_warn_check_overflow(gpuvm, start_offset, range);
> @@ -1160,13 +1162,15 @@ static void
> drm_gpuvm_free(struct kref *kref)
> {
> struct drm_gpuvm *gpuvm = container_of(kref, struct
> drm_gpuvm, kref);
> + struct drm_device *drm = gpuvm->drm;
>
> drm_gpuvm_fini(gpuvm);
>
> - if (drm_WARN_ON(gpuvm->drm, !gpuvm->ops->vm_free))
> + if (drm_WARN_ON(drm, !gpuvm->ops->vm_free))
> return;
>
> gpuvm->ops->vm_free(gpuvm);
> + drm_dev_put(drm);
> }
>
> /**
>
> ---
> base-commit: 126c50bc2fb6ddfe5b7718de67bbd7592a1062bb
> change-id: 20260416-gpuvm-drm-dev-get-5ded89c39bb3
>
> Best regards,
On Fri Apr 17, 2026 at 4:41 PM CEST, Thomas Hellström wrote: > This is problematic since typically you also need a module reference > when taking a drm device reference. > > The reason for this is that the devres reference on the drm device > expects to be the last one, since it might be called from the module > exit function of the driver. No, this is not how it works; if this would be true then drmm_* would be pretty pointless in the first place, as one could just use devm_* for everything. Citing the commit introducing drmm_* APIs: "The biggest wrong pattern is that developers use devm_, which ties the release action to the underlying struct device, whereas all the userspace visible stuff attached to a drm_device can long outlive that one (e.g. after a hotunplug while userspace has open files and mmap'ed buffers)." > Now if there is an additional reference held at that point the driver module > can be unloaded with a dangling reference to the drm device. > > On the other hand, if you in addition take a module reference then that > blocks the driver module from being unloaded while held, just like a > drm file reference. This leads to complicated module release schemes > like the one in drm_pagemap where the module refcount is released from > a work item that is waited on in the drm_pagemap exit function. > > I'm working to lift the module refcount requirement, but meanwhile I'd > recommend that in the file close callback, we'd make sure all > drm_gpuvms have called their drm_gpuvm_free() function, because then we > are sure that the drm_device is still alive and the module still > pinned. If GPUVM has a pointer to the DRM device, it implies shared ownership and hence GPUVM should account for this shared ownership and take a reference count. The fact that GPUVM must not outlive module unload when it has driver callbacks attached is an orthogonal requirement. The module lifetime / callback issue is a separate problem that exists regardless of whether you hold a device refcount. Not taking the refcount doesn't make the module problem go away, it just adds a second, independent bug. If struct drm_device itself, e.g. due to drm_dev_release() requires a module refcount, then this is on struct drm_device to ensure this constraint (or remove the requirement). IOW, if I get to choose between a DRM component that has a pointer to a DRM device stalls module unload and a DRM component that has a pointer to a DRM device oopses the kernel when used wrongly, I prefer the former. - Danilo
On Fri, 2026-04-17 at 21:33 +0200, Danilo Krummrich wrote: > On Fri Apr 17, 2026 at 4:41 PM CEST, Thomas Hellström wrote: > > This is problematic since typically you also need a module > > reference > > when taking a drm device reference. > > > > The reason for this is that the devres reference on the drm device > > expects to be the last one, since it might be called from the > > module > > exit function of the driver. > > No, this is not how it works; if this would be true then drmm_* would > be pretty > pointless in the first place, as one could just use devm_* for > everything. > > Citing the commit introducing drmm_* APIs: > > "The biggest wrong pattern is that developers use devm_, > which ties the > release action to the underlying struct device, whereas all > the > userspace visible stuff attached to a drm_device can long > outlive that > one (e.g. after a hotunplug while userspace has open files > and mmap'ed > buffers)." Yeah, I was a bit unclear and partly incorrect. This only happens *if there are no other holders of the driver module reference. (But see below WRT potential other holders)- driver_module_unload()->...->pci_dev_remove -> devm_release-> drm_dev_put-><module is unloaded>. So if, at this point there are additional drm device references, they'd point to dangling devices. > > > Now if there is an additional reference held at that point the > > driver module > > can be unloaded with a dangling reference to the drm device. > > > > On the other hand, if you in addition take a module reference then > > that > > blocks the driver module from being unloaded while held, just like > > a > > drm file reference. This leads to complicated module release > > schemes > > like the one in drm_pagemap where the module refcount is released > > from > > a work item that is waited on in the drm_pagemap exit function. > > > > I'm working to lift the module refcount requirement, but meanwhile > > I'd > > recommend that in the file close callback, we'd make sure all > > drm_gpuvms have called their drm_gpuvm_free() function, because > > then we > > are sure that the drm_device is still alive and the module still > > pinned. > > If GPUVM has a pointer to the DRM device, it implies shared ownership > and hence > GPUVM should account for this shared ownership and take a reference > count. > > The fact that GPUVM must not outlive module unload when it has driver > callbacks > attached is an orthogonal requirement. > > The module lifetime / callback issue is a separate problem that > exists > regardless of whether you hold a device refcount. Not taking the > refcount > doesn't make the module problem go away, it just adds a second, > independent bug. > > If struct drm_device itself, e.g. due to drm_dev_release() requires a > module > refcount, then this is on struct drm_device to ensure this constraint > (or remove > the requirement). > > IOW, if I get to choose between a DRM component that has a pointer to > a DRM > device stalls module unload and a DRM component that has a pointer to > a DRM > device oopses the kernel when used wrongly, I prefer the former. I agree with your reasoning here, but current fact is that most (if not all) holders of a drm device reference (files, pagemaps, dma-bufs) currently also hold a module reference to protect against this, and drm_gpuvm would be an outlier. To fix this properly (lifting that requirement) one could introduce a drm device count in the module and have the module exit function wait for it to become zero, *and* that the code that did the last decrement finished executing. https://patchwork.freedesktop.org/patch/712146/?series=163298&rev=1 Or one could also have the drm device hold a reference count on the driver module, but that would block unloading without previous unbind which is not typical driver behaviour and would likely be seen as a regression. /Thomas > > - Danilo
On Mon Apr 20, 2026 at 11:28 AM CEST, Thomas Hellström wrote: > I agree with your reasoning here, but current fact is that most (if not > all) holders of a drm device reference (files, pagemaps, dma-bufs) > currently also hold a module reference to protect against this, and > drm_gpuvm would be an outlier. I'm not convinced; if the DRM device has the requirement to not outlive the module it is associated with, then the DRM device code has to take care of this requirement, and not every caller of drm_dev_get(). Besides that, if GPUVM holds the module reference count on behalf of the DRM device, it has the same effect that you rightfully point out below -- it breaks rmmod. > To fix this properly (lifting that requirement) one could introduce a > drm device count in the module and have the module exit function wait > for it to become zero, *and* that the code that did the last decrement > finished executing. > > https://patchwork.freedesktop.org/patch/712146/?series=163298&rev=1 This looks like a reasonable fix to me. And it makes me conclude that we basically agree on everything. :) Regarding the reference count in the meantime, it remains that omitting it does not solve the underlying problem, i.e. I still think it is orthogonal. > Or one could also have the drm device hold a reference count on the > driver module, but that would block unloading without previous unbind > which is not typical driver behaviour and would likely be seen as a > regression.
On Mon, 2026-04-20 at 17:08 +0200, Danilo Krummrich wrote: > On Mon Apr 20, 2026 at 11:28 AM CEST, Thomas Hellström wrote: > > I agree with your reasoning here, but current fact is that most (if > > not > > all) holders of a drm device reference (files, pagemaps, dma-bufs) > > currently also hold a module reference to protect against this, and > > drm_gpuvm would be an outlier. > > I'm not convinced; if the DRM device has the requirement to not > outlive the > module it is associated with, then the DRM device code has to take > care of this > requirement, and not every caller of drm_dev_get(). > > Besides that, if GPUVM holds the module reference count on behalf of > the DRM > device, it has the same effect that you rightfully point out below -- > it breaks > rmmod. > > > To fix this properly (lifting that requirement) one could introduce > > a > > drm device count in the module and have the module exit function > > wait > > for it to become zero, *and* that the code that did the last > > decrement > > finished executing. > > > > https://patchwork.freedesktop.org/patch/712146/?series=163298&rev=1 > > This looks like a reasonable fix to me. And it makes me conclude that > we > basically agree on everything. :) Yes, unless we'd want to do a similar wait for gpuvms before returning from the close() callback: If we assume all GPUVMs are tied to an open drm file, that would conceptually be nicer IMO but I agree if gpuvm drivers implement something like the above per-driver device count, that would be unnecessary. Thanks, Thomas > > Regarding the reference count in the meantime, it remains that > omitting it does > not solve the underlying problem, i.e. I still think it is > orthogonal. > > > Or one could also have the drm device hold a reference count on the > > driver module, but that would block unloading without previous > > unbind > > which is not typical driver behaviour and would likely be seen as a > > regression.
On Mon, Apr 20, 2026 at 06:19:02PM +0200, Thomas Hellström wrote: > On Mon, 2026-04-20 at 17:08 +0200, Danilo Krummrich wrote: > > On Mon Apr 20, 2026 at 11:28 AM CEST, Thomas Hellström wrote: > > > I agree with your reasoning here, but current fact is that most (if > > > not > > > all) holders of a drm device reference (files, pagemaps, dma-bufs) > > > currently also hold a module reference to protect against this, and > > > drm_gpuvm would be an outlier. > > > > I'm not convinced; if the DRM device has the requirement to not > > outlive the > > module it is associated with, then the DRM device code has to take > > care of this > > requirement, and not every caller of drm_dev_get(). > > > > Besides that, if GPUVM holds the module reference count on behalf of > > the DRM > > device, it has the same effect that you rightfully point out below -- > > it breaks > > rmmod. > > > > > To fix this properly (lifting that requirement) one could introduce > > > a > > > drm device count in the module and have the module exit function > > > wait > > > for it to become zero, *and* that the code that did the last > > > decrement > > > finished executing. > > > > > > https://patchwork.freedesktop.org/patch/712146/?series=163298&rev=1 > > > > This looks like a reasonable fix to me. And it makes me conclude that > > we > > basically agree on everything. :) > > Yes, unless we'd want to do a similar wait for gpuvms before returning > from the close() callback: If we assume all GPUVMs are tied to an open > drm file, that would conceptually be nicer IMO but I agree if gpuvm > drivers implement something like the above per-driver device count, > that would be unnecessary. Just to confirm, it sounds like no changes to my patch are required here? Alice
On Mon, 2026-04-27 at 07:34 +0000, Alice Ryhl wrote: > On Mon, Apr 20, 2026 at 06:19:02PM +0200, Thomas Hellström wrote: > > On Mon, 2026-04-20 at 17:08 +0200, Danilo Krummrich wrote: > > > On Mon Apr 20, 2026 at 11:28 AM CEST, Thomas Hellström wrote: > > > > I agree with your reasoning here, but current fact is that most > > > > (if > > > > not > > > > all) holders of a drm device reference (files, pagemaps, dma- > > > > bufs) > > > > currently also hold a module reference to protect against this, > > > > and > > > > drm_gpuvm would be an outlier. > > > > > > I'm not convinced; if the DRM device has the requirement to not > > > outlive the > > > module it is associated with, then the DRM device code has to > > > take > > > care of this > > > requirement, and not every caller of drm_dev_get(). > > > > > > Besides that, if GPUVM holds the module reference count on behalf > > > of > > > the DRM > > > device, it has the same effect that you rightfully point out > > > below -- > > > it breaks > > > rmmod. > > > > > > > To fix this properly (lifting that requirement) one could > > > > introduce > > > > a > > > > drm device count in the module and have the module exit > > > > function > > > > wait > > > > for it to become zero, *and* that the code that did the last > > > > decrement > > > > finished executing. > > > > > > > > https://patchwork.freedesktop.org/patch/712146/?series=163298&rev=1 > > > > > > This looks like a reasonable fix to me. And it makes me conclude > > > that > > > we > > > basically agree on everything. :) > > > > Yes, unless we'd want to do a similar wait for gpuvms before > > returning > > from the close() callback: If we assume all GPUVMs are tied to an > > open > > drm file, that would conceptually be nicer IMO but I agree if gpuvm > > drivers implement something like the above per-driver device count, > > that would be unnecessary. > > Just to confirm, it sounds like no changes to my patch are required > here? I guess not, Although in the end it currently just moves one potential bug to another, but as long as people involved are aware, that's OK from my POW. Thanks, Thomas > > Alice
On Thu Apr 16, 2026 at 3:10 PM CEST, Alice Ryhl wrote:
> Currently GPUVM relies on the owner implicitly holding a refcount to the
> drm device, and it does not implicitly take a refcount on the drm
> device. This design is error-prone, so take a refcount on the device.
>
> Suggested-by: Danilo Krummrich <dakr@kernel.org>
> Signed-off-by: Alice Ryhl <aliceryhl@google.com>
Fixes: 546ca4d35dcc ("drm/gpuvm: convert WARN() to drm_WARN() variants")
© 2016 - 2026 Red Hat, Inc.