Recover sysfb after DRM probe failure

[PATCH 00/12] Recover sysfb after DRM probe failure

Posted by Zack Rusin 1 month, 1 week ago

Almost a rite of passage for every DRM developer and most Linux users
is upgrading your DRM driver/updating boot flags/changing some config
and having DRM driver fail at probe resulting in a blank screen.

Currently there's no way to recover from DRM driver probe failure. PCI
DRM driver explicitly throw out the existing sysfb to get exclusive
access to PCI resources so if the probe fails the system is left without
a functioning display driver.

Add code to sysfb to recever system framebuffer when DRM driver's probe
fails. This means that a DRM driver that fails to load reloads the system
framebuffer driver.

This works best with simpledrm. Without it Xorg won't recover because
it still tries to load the vendor specific driver which ends up usually
not working at all. With simpledrm the system recovers really nicely
ending up with a working console and not a blank screen.

There's a caveat in that some hardware might require some special magic
register write to recover EFI display. I'd appreciate it a lot if
maintainers could introduce a temporary failure in their drivers
probe to validate that the sysfb recovers and they get a working console.
The easiest way to double check it is by adding:
 /* XXX: Temporary failure to test sysfb restore - REMOVE BEFORE COMMIT */
 dev_info(&pdev->dev, "Testing sysfb restore: forcing probe failure\n");
 ret = -EINVAL;
 goto out_error;
or such right after the devm_aperture_remove_conflicting_pci_devices .

Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: amd-gfx@lists.freedesktop.org
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Ce Sun <cesun102@amd.com>
Cc: Chia-I Wu <olvaffe@gmail.com>
Cc: "Christian König" <christian.koenig@amd.com>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: Dave Airlie <airlied@redhat.com>
Cc: Deepak Rawat <drawat.floss@gmail.com>
Cc: Dmitry Osipenko <dmitry.osipenko@collabora.com>
Cc: dri-devel@lists.freedesktop.org
Cc: Gerd Hoffmann <kraxel@redhat.com>
Cc: Gurchetan Singh <gurchetansingh@chromium.org>
Cc: Hans de Goede <hansg@kernel.org>
Cc: Hawking Zhang <Hawking.Zhang@amd.com>
Cc: Helge Deller <deller@gmx.de>
Cc: intel-gfx@lists.freedesktop.org
Cc: intel-xe@lists.freedesktop.org
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Javier Martinez Canillas <javierm@redhat.com>
Cc: Jocelyn Falempe <jfalempe@redhat.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Lijo Lazar <lijo.lazar@amd.com>
Cc: linux-efi@vger.kernel.org
Cc: linux-fbdev@vger.kernel.org
Cc: linux-hyperv@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: Lucas De Marchi <lucas.demarchi@intel.com>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: "Mario Limonciello (AMD)" <superm1@kernel.org>
Cc: Mario Limonciello <mario.limonciello@amd.com>
Cc: Maxime Ripard <mripard@kernel.org>
Cc: nouveau@lists.freedesktop.org
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Simona Vetter <simona@ffwll.ch>
Cc: spice-devel@lists.freedesktop.org
Cc: "Thomas Hellström" <thomas.hellstrom@linux.intel.com>
Cc: Thomas Zimmermann <tzimmermann@suse.de>
Cc: "Timur Kristóf" <timur.kristof@gmail.com>
Cc: Tvrtko Ursulin <tursulin@ursulin.net>
Cc: virtualization@lists.linux.dev
Cc: Vitaly Prosyak <vitaly.prosyak@amd.com>

Zack Rusin (12):
  video/aperture: Add sysfb restore on DRM probe failure
  drm/vmwgfx: Use devm aperture helpers for sysfb restore on probe
    failure
  drm/xe: Use devm aperture helpers for sysfb restore on probe failure
  drm/amdgpu: Use devm aperture helpers for sysfb restore on probe
    failure
  drm/virtio: Add sysfb restore on probe failure
  drm/nouveau: Use devm aperture helpers for sysfb restore on probe
    failure
  drm/qxl: Use devm aperture helpers for sysfb restore on probe failure
  drm/vboxvideo: Use devm aperture helpers for sysfb restore on probe
    failure
  drm/hyperv: Add sysfb restore on probe failure
  drm/ast: Use devm aperture helpers for sysfb restore on probe failure
  drm/radeon: Use devm aperture helpers for sysfb restore on probe
    failure
  drm/i915: Use devm aperture helpers for sysfb restore on probe failure

 drivers/firmware/efi/sysfb_efi.c           |   2 +-
 drivers/firmware/sysfb.c                   | 191 +++++++++++++--------
 drivers/firmware/sysfb_simplefb.c          |  10 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |   9 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c    |   7 +
 drivers/gpu/drm/ast/ast_drv.c              |  13 +-
 drivers/gpu/drm/hyperv/hyperv_drm_drv.c    |  23 +++
 drivers/gpu/drm/i915/i915_driver.c         |  13 +-
 drivers/gpu/drm/nouveau/nouveau_drm.c      |  16 +-
 drivers/gpu/drm/qxl/qxl_drv.c              |  14 +-
 drivers/gpu/drm/radeon/radeon_drv.c        |  15 +-
 drivers/gpu/drm/vboxvideo/vbox_drv.c       |  13 +-
 drivers/gpu/drm/virtio/virtgpu_drv.c       |  29 ++++
 drivers/gpu/drm/vmwgfx/vmwgfx_drv.c        |  13 +-
 drivers/gpu/drm/xe/xe_device.c             |   7 +-
 drivers/gpu/drm/xe/xe_pci.c                |   7 +
 drivers/video/aperture.c                   |  54 ++++++
 include/linux/aperture.h                   |  14 ++
 include/linux/sysfb.h                      |   6 +
 19 files changed, 368 insertions(+), 88 deletions(-)

-- 
2.48.1

Re: [PATCH 00/12] Recover sysfb after DRM probe failure

Posted by Thomas Zimmermann 4 weeks, 1 day ago

Hi

Am 29.12.25 um 22:58 schrieb Zack Rusin:
> Almost a rite of passage for every DRM developer and most Linux users
> is upgrading your DRM driver/updating boot flags/changing some config
> and having DRM driver fail at probe resulting in a blank screen.
>
> Currently there's no way to recover from DRM driver probe failure. PCI
> DRM driver explicitly throw out the existing sysfb to get exclusive
> access to PCI resources so if the probe fails the system is left without
> a functioning display driver.
>
> Add code to sysfb to recever system framebuffer when DRM driver's probe
> fails. This means that a DRM driver that fails to load reloads the system
> framebuffer driver.
>
> This works best with simpledrm. Without it Xorg won't recover because
> it still tries to load the vendor specific driver which ends up usually
> not working at all. With simpledrm the system recovers really nicely
> ending up with a working console and not a blank screen.
>
> There's a caveat in that some hardware might require some special magic
> register write to recover EFI display. I'd appreciate it a lot if
> maintainers could introduce a temporary failure in their drivers
> probe to validate that the sysfb recovers and they get a working console.
> The easiest way to double check it is by adding:
>   /* XXX: Temporary failure to test sysfb restore - REMOVE BEFORE COMMIT */
>   dev_info(&pdev->dev, "Testing sysfb restore: forcing probe failure\n");
>   ret = -EINVAL;
>   goto out_error;
> or such right after the devm_aperture_remove_conflicting_pci_devices .

Recovering the display like that is guess work and will at best work 
with simple discrete devices where the framebuffer is always located in 
a confined graphics aperture.

But the problem you're trying to solve is a real one.

What we'd want to do instead is to take the initial hardware state into 
account when we do the initial mode-setting operation.

The first step is to move each driver's remove_conflicting_devices call 
to the latest possible location in the probe function. We usually do it 
first, because that's easy. But on most hardware, it could happen much 
later. The native driver is free to examine hardware state while probing 
the device as long as it does not interfere with the pre-configured 
framebuffer mode/format/address. Hence it can set up it's internal 
structures while the sysfb device is still active.

The next step for the native driver is to load the pre-configured 
hardware state into its initial internal atomic state. Maxime has worked 
on that on and off. The last iteration I'm aware of is at [1].

After the state-readout, the sysfb device has to be unplugged. But as 
the underlying hardware config remains active, the native driver can now 
use and modify it. We currently do a drm_mode_config_reset(), which 
clears the state and then let the first client set a new display state. 
But with state-readout, we could either pick up the existing framebuffer 
directly or do a proper modeset from existing state.

As DRM clients control the mode setting, they'd likely need some changes 
to handle state-readout. There's such code in i915's fbdev support AFAIK.

Best regards
Thomas

[1] 
https://lore.kernel.org/dri-devel/20250902-drm-state-readout-v1-0-14ad5315da3f@kernel.org/

>
> Cc: Alex Deucher <alexander.deucher@amd.com>
> Cc: amd-gfx@lists.freedesktop.org
> Cc: Ard Biesheuvel <ardb@kernel.org>
> Cc: Ce Sun <cesun102@amd.com>
> Cc: Chia-I Wu <olvaffe@gmail.com>
> Cc: "Christian König" <christian.koenig@amd.com>
> Cc: Danilo Krummrich <dakr@kernel.org>
> Cc: Dave Airlie <airlied@redhat.com>
> Cc: Deepak Rawat <drawat.floss@gmail.com>
> Cc: Dmitry Osipenko <dmitry.osipenko@collabora.com>
> Cc: dri-devel@lists.freedesktop.org
> Cc: Gerd Hoffmann <kraxel@redhat.com>
> Cc: Gurchetan Singh <gurchetansingh@chromium.org>
> Cc: Hans de Goede <hansg@kernel.org>
> Cc: Hawking Zhang <Hawking.Zhang@amd.com>
> Cc: Helge Deller <deller@gmx.de>
> Cc: intel-gfx@lists.freedesktop.org
> Cc: intel-xe@lists.freedesktop.org
> Cc: Jani Nikula <jani.nikula@linux.intel.com>
> Cc: Javier Martinez Canillas <javierm@redhat.com>
> Cc: Jocelyn Falempe <jfalempe@redhat.com>
> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> Cc: Lijo Lazar <lijo.lazar@amd.com>
> Cc: linux-efi@vger.kernel.org
> Cc: linux-fbdev@vger.kernel.org
> Cc: linux-hyperv@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: Lucas De Marchi <lucas.demarchi@intel.com>
> Cc: Lyude Paul <lyude@redhat.com>
> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> Cc: "Mario Limonciello (AMD)" <superm1@kernel.org>
> Cc: Mario Limonciello <mario.limonciello@amd.com>
> Cc: Maxime Ripard <mripard@kernel.org>
> Cc: nouveau@lists.freedesktop.org
> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
> Cc: Simona Vetter <simona@ffwll.ch>
> Cc: spice-devel@lists.freedesktop.org
> Cc: "Thomas Hellström" <thomas.hellstrom@linux.intel.com>
> Cc: Thomas Zimmermann <tzimmermann@suse.de>
> Cc: "Timur Kristóf" <timur.kristof@gmail.com>
> Cc: Tvrtko Ursulin <tursulin@ursulin.net>
> Cc: virtualization@lists.linux.dev
> Cc: Vitaly Prosyak <vitaly.prosyak@amd.com>
>
> Zack Rusin (12):
>    video/aperture: Add sysfb restore on DRM probe failure
>    drm/vmwgfx: Use devm aperture helpers for sysfb restore on probe
>      failure
>    drm/xe: Use devm aperture helpers for sysfb restore on probe failure
>    drm/amdgpu: Use devm aperture helpers for sysfb restore on probe
>      failure
>    drm/virtio: Add sysfb restore on probe failure
>    drm/nouveau: Use devm aperture helpers for sysfb restore on probe
>      failure
>    drm/qxl: Use devm aperture helpers for sysfb restore on probe failure
>    drm/vboxvideo: Use devm aperture helpers for sysfb restore on probe
>      failure
>    drm/hyperv: Add sysfb restore on probe failure
>    drm/ast: Use devm aperture helpers for sysfb restore on probe failure
>    drm/radeon: Use devm aperture helpers for sysfb restore on probe
>      failure
>    drm/i915: Use devm aperture helpers for sysfb restore on probe failure
>
>   drivers/firmware/efi/sysfb_efi.c           |   2 +-
>   drivers/firmware/sysfb.c                   | 191 +++++++++++++--------
>   drivers/firmware/sysfb_simplefb.c          |  10 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |   9 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c    |   7 +
>   drivers/gpu/drm/ast/ast_drv.c              |  13 +-
>   drivers/gpu/drm/hyperv/hyperv_drm_drv.c    |  23 +++
>   drivers/gpu/drm/i915/i915_driver.c         |  13 +-
>   drivers/gpu/drm/nouveau/nouveau_drm.c      |  16 +-
>   drivers/gpu/drm/qxl/qxl_drv.c              |  14 +-
>   drivers/gpu/drm/radeon/radeon_drv.c        |  15 +-
>   drivers/gpu/drm/vboxvideo/vbox_drv.c       |  13 +-
>   drivers/gpu/drm/virtio/virtgpu_drv.c       |  29 ++++
>   drivers/gpu/drm/vmwgfx/vmwgfx_drv.c        |  13 +-
>   drivers/gpu/drm/xe/xe_device.c             |   7 +-
>   drivers/gpu/drm/xe/xe_pci.c                |   7 +
>   drivers/video/aperture.c                   |  54 ++++++
>   include/linux/aperture.h                   |  14 ++
>   include/linux/sysfb.h                      |   6 +
>   19 files changed, 368 insertions(+), 88 deletions(-)
>

-- 
--
Thomas Zimmermann
Graphics Driver Developer
SUSE Software Solutions Germany GmbH
Frankenstr. 146, 90461 Nürnberg, Germany, www.suse.com
GF: Jochen Jaser, Andrew McDonald, Werner Knoblich, (HRB 36809, AG Nürnberg)

Re: [PATCH 00/12] Recover sysfb after DRM probe failure

Posted by Zack Rusin 4 weeks ago

On Fri, Jan 9, 2026 at 5:34 AM Thomas Zimmermann <tzimmermann@suse.de> wrote:
>
> Hi
>
> Am 29.12.25 um 22:58 schrieb Zack Rusin:
> > Almost a rite of passage for every DRM developer and most Linux users
> > is upgrading your DRM driver/updating boot flags/changing some config
> > and having DRM driver fail at probe resulting in a blank screen.
> >
> > Currently there's no way to recover from DRM driver probe failure. PCI
> > DRM driver explicitly throw out the existing sysfb to get exclusive
> > access to PCI resources so if the probe fails the system is left without
> > a functioning display driver.
> >
> > Add code to sysfb to recever system framebuffer when DRM driver's probe
> > fails. This means that a DRM driver that fails to load reloads the system
> > framebuffer driver.
> >
> > This works best with simpledrm. Without it Xorg won't recover because
> > it still tries to load the vendor specific driver which ends up usually
> > not working at all. With simpledrm the system recovers really nicely
> > ending up with a working console and not a blank screen.
> >
> > There's a caveat in that some hardware might require some special magic
> > register write to recover EFI display. I'd appreciate it a lot if
> > maintainers could introduce a temporary failure in their drivers
> > probe to validate that the sysfb recovers and they get a working console.
> > The easiest way to double check it is by adding:
> >   /* XXX: Temporary failure to test sysfb restore - REMOVE BEFORE COMMIT */
> >   dev_info(&pdev->dev, "Testing sysfb restore: forcing probe failure\n");
> >   ret = -EINVAL;
> >   goto out_error;
> > or such right after the devm_aperture_remove_conflicting_pci_devices .
>
> Recovering the display like that is guess work and will at best work
> with simple discrete devices where the framebuffer is always located in
> a confined graphics aperture.
>
> But the problem you're trying to solve is a real one.
>
> What we'd want to do instead is to take the initial hardware state into
> account when we do the initial mode-setting operation.
>
> The first step is to move each driver's remove_conflicting_devices call
> to the latest possible location in the probe function. We usually do it
> first, because that's easy. But on most hardware, it could happen much
> later.

Well, some drivers (vbox, vmwgfx, bochs and currus-qemu) do it because
they request pci regions which is going to fail otherwise. Because
grabbining the pci resources is in general the very first thing that
those drivers need to do to setup anything, we
remove_conflicting_devices first or at least very early.

I also don't think it's possible or even desirable by some drivers to
reuse the initial state, good example here is vmwgfx where by default
some people will setup their vm's with e.g. 8mb ram, when the vmwgfx
loads we allow scanning out from system memory, so you can set your vm
up with 8mb of vram but still use 4k resolutions when the driver
loads, this way the suspend size of the vm is very predictable (tiny
vram plus whatever ram was setup) while still allowing a lot of
flexibility.

In general I think however this is planned it's two or three separate series:
1) infrastructure to reload the sysfb driver (what this series is)
2) making sure that drivers that do want to recover cleanly actually
clean out all the state on exit properly,
3) abstracting at least some of that cleanup in some driver independent way

z

Re: [PATCH 00/12] Recover sysfb after DRM probe failure

Posted by Thomas Zimmermann 3 weeks, 2 days ago

Hi,

apologies for the delay. I wanted to reply and then forgot about it.

Am 10.01.26 um 05:52 schrieb Zack Rusin:
> On Fri, Jan 9, 2026 at 5:34 AM Thomas Zimmermann <tzimmermann@suse.de> wrote:
>> Hi
>>
>> Am 29.12.25 um 22:58 schrieb Zack Rusin:
>>> Almost a rite of passage for every DRM developer and most Linux users
>>> is upgrading your DRM driver/updating boot flags/changing some config
>>> and having DRM driver fail at probe resulting in a blank screen.
>>>
>>> Currently there's no way to recover from DRM driver probe failure. PCI
>>> DRM driver explicitly throw out the existing sysfb to get exclusive
>>> access to PCI resources so if the probe fails the system is left without
>>> a functioning display driver.
>>>
>>> Add code to sysfb to recever system framebuffer when DRM driver's probe
>>> fails. This means that a DRM driver that fails to load reloads the system
>>> framebuffer driver.
>>>
>>> This works best with simpledrm. Without it Xorg won't recover because
>>> it still tries to load the vendor specific driver which ends up usually
>>> not working at all. With simpledrm the system recovers really nicely
>>> ending up with a working console and not a blank screen.
>>>
>>> There's a caveat in that some hardware might require some special magic
>>> register write to recover EFI display. I'd appreciate it a lot if
>>> maintainers could introduce a temporary failure in their drivers
>>> probe to validate that the sysfb recovers and they get a working console.
>>> The easiest way to double check it is by adding:
>>>    /* XXX: Temporary failure to test sysfb restore - REMOVE BEFORE COMMIT */
>>>    dev_info(&pdev->dev, "Testing sysfb restore: forcing probe failure\n");
>>>    ret = -EINVAL;
>>>    goto out_error;
>>> or such right after the devm_aperture_remove_conflicting_pci_devices .
>> Recovering the display like that is guess work and will at best work
>> with simple discrete devices where the framebuffer is always located in
>> a confined graphics aperture.
>>
>> But the problem you're trying to solve is a real one.
>>
>> What we'd want to do instead is to take the initial hardware state into
>> account when we do the initial mode-setting operation.
>>
>> The first step is to move each driver's remove_conflicting_devices call
>> to the latest possible location in the probe function. We usually do it
>> first, because that's easy. But on most hardware, it could happen much
>> later.
> Well, some drivers (vbox, vmwgfx, bochs and currus-qemu) do it because
> they request pci regions which is going to fail otherwise. Because
> grabbining the pci resources is in general the very first thing that
> those drivers need to do to setup anything, we
> remove_conflicting_devices first or at least very early.

To my knowledge, requesting resources is more about correctness than a 
hard requirement to use an I/O or memory range. Has this changed?


>
> I also don't think it's possible or even desirable by some drivers to
> reuse the initial state, good example here is vmwgfx where by default
> some people will setup their vm's with e.g. 8mb ram, when the vmwgfx
> loads we allow scanning out from system memory, so you can set your vm
> up with 8mb of vram but still use 4k resolutions when the driver
> loads, this way the suspend size of the vm is very predictable (tiny
> vram plus whatever ram was setup) while still allowing a lot of
> flexibility.

If there's no initial state to switch from, the first modeset can fail 
while leaving the display unusable. There's no way around that. Going 
back to the old state is not an option unless the driver has been 
written to support this.

The case of vmwgfx is special, but does not effect the overall problem. 
For vmwgfx, it would be best to import that initial state and support a 
transparent modeset from vram to system memory (and back) at least 
during this initial state.


>
> In general I think however this is planned it's two or three separate series:
> 1) infrastructure to reload the sysfb driver (what this series is)
> 2) making sure that drivers that do want to recover cleanly actually
> clean out all the state on exit properly,
> 3) abstracting at least some of that cleanup in some driver independent way

That's really not going to work. For example, in the current series, you 
invoke devm_aperture_remove_conflicting_pci_devices_done() after 
drm_mode_reset(), drm_dev_register() and drm_client_setup(). Each of 
these calls can modify hardware state. In the case of _register() and 
_setup(), the DRM clients can perform a modeset, which destroys the 
initial hardware state. Patch 1 of this series removes the sysfb 
device/driver entirely. That should be a no-go as it significantly 
complicates recovery. For example, if the native drivers failed from an 
allocation failure, the sysfb device/driver is not likely to come back 
either. As the very first thing, the series should state which failures 
is is going to resolve, - failed hardware init, - invalid initial 
modesetting, - runtime errors (such ENOMEM, failed firmware loading), - 
others? And then specify how a recovery to sysfb could look in each 
supported scenario. In terms of implementation, make any transition 
between drivers gradually. The native driver needs to acquire the 
hardware resource (framebuffer and I/O apertures) without unloading the 
sysfb driver. Luckily there's struct drm_device.unplug, which does that. 
[1] Flipping this field disables hardware access for DRM drivers. All 
sysfb drivers support this. To get the sysfb drivers ready, I suggest 
dedicated helpers for each drivers aperture. The aperture helpers can 
use these callback to flip the DRM driver off and on again. For example, 
efidrm could do this as a minimum: int efidrm_aperture_suspend() { 
dev->unplug = true; remove_resource(/*framebuffer aperture*/) return 0 } 
int efidrm_aperture_resume() { insert_resource(/*framebuffer aperture*/) 
dev->unplug = false; return 0 } struct aperture_funcs 
efidrm_aperture_funcs { .suspend = efidrm_aperture_suspend, .resume = 
efidrm_aperture_resume, } Pass this struct when efidrm acquires the 
framebuffer aperture, so that the aperture helpers can control the 
behavior of efidrm. With this, a multi-step takeover from sysfb to 
native driver can be tried. It's still a massive effort that requires an 
audit of each driver's probing logic. There's no copy-paste pattern 
AFAICT. I suggest to pick one simple driver first and make a prototype. 
Let me also say that I DO like the general idea you're proposing. But if 
it was easy, we would likely have done it already. Best regards Thomas
>
> z

-- 
--
Thomas Zimmermann
Graphics Driver Developer
SUSE Software Solutions Germany GmbH
Frankenstr. 146, 90461 Nürnberg, Germany, www.suse.com
GF: Jochen Jaser, Andrew McDonald, Werner Knoblich, (HRB 36809, AG Nürnberg)

Re: [PATCH 00/12] Recover sysfb after DRM probe failure

Posted by Zack Rusin 3 weeks, 1 day ago

On Thu, Jan 15, 2026 at 6:02 AM Thomas Zimmermann <tzimmermann@suse.de> wrote:
>
> That's really not going to work. For example, in the current series, you
> invoke devm_aperture_remove_conflicting_pci_devices_done() after
> drm_mode_reset(), drm_dev_register() and drm_client_setup().

That's perfectly fine,
devm_aperture_remove_conflicting_pci_devices_done is removing the
reload behavior not doing anything.

This series, essentially, just adds a "defer" statement to
aperture_remove_conflicting_pci_devices that says

"reload sysfb if this driver unloads".

devm_aperture_remove_conflicting_pci_devices_done just cancels that defer.

You could ask why have
devm_aperture_remove_conflicting_pci_devices_done at all then and it's
because I didn't want to change the default behavior of anything.

There are three cases:
1) Driver fails to load before
aperture_remove_conflicting_pci_devices, in which case sysfb is still
active and there's no problem,
2) Driver fails to load after aperture_remove_conflicting_pci_devices,
in which case sysfb is gone and the screen is blank
3) Driver is unloaded after the probe succeeded. igt tests this too.

Without devm_aperture_remove_conflicting_pci_devices_done we'd try to
reload sysfb in #3, which, in general makes sense to me and I'd
probably remove it in my drivers, but there might be people or tests
(again, igt does it and we don't need to flip-flop between sysfb and
the driver there) that depend on specifically that behavior of not
having anything driving fb so I didn't want to change it.

So with this series the worst case scenario is that the driver that
failed after aperture_remove_conflicting_pci_devices changed the
hardware state so much that sysfb can't recover and the fb is blank.
So it was blank before and this series can't fix it because the driver
in its cleanup routine will need to do more unwinding for sysfb to
reload (i.e. we'd need an extra patch to unwind the driver state).
There also might be the case of some crazy behavior, e.g. pci bar
resize in the driver makes the vga hardware crash or something, in
which case, yea, we should definitely skip this patch, at least until
those drivers properly cleanup on exit.

z

Re: [PATCH 00/12] Recover sysfb after DRM probe failure

Posted by Thomas Zimmermann 3 weeks, 1 day ago

Hi

Am 16.01.26 um 04:59 schrieb Zack Rusin:
> On Thu, Jan 15, 2026 at 6:02 AM Thomas Zimmermann <tzimmermann@suse.de> wrote:
>> That's really not going to work. For example, in the current series, you
>> invoke devm_aperture_remove_conflicting_pci_devices_done() after
>> drm_mode_reset(), drm_dev_register() and drm_client_setup().
> That's perfectly fine,
> devm_aperture_remove_conflicting_pci_devices_done is removing the
> reload behavior not doing anything.
>
> This series, essentially, just adds a "defer" statement to
> aperture_remove_conflicting_pci_devices that says
>
> "reload sysfb if this driver unloads".
>
> devm_aperture_remove_conflicting_pci_devices_done just cancels that defer.

Exactly. And if that reload happens after the hardware state has been 
changed, the result is undefined.

>
> You could ask why have
> devm_aperture_remove_conflicting_pci_devices_done at all then and it's
> because I didn't want to change the default behavior of anything.
>
> There are three cases:
> 1) Driver fails to load before
> aperture_remove_conflicting_pci_devices, in which case sysfb is still
> active and there's no problem,
> 2) Driver fails to load after aperture_remove_conflicting_pci_devices,
> in which case sysfb is gone and the screen is blank
> 3) Driver is unloaded after the probe succeeded. igt tests this too.
>
> Without devm_aperture_remove_conflicting_pci_devices_done we'd try to
> reload sysfb in #3, which, in general makes sense to me and I'd
> probably remove it in my drivers, but there might be people or tests
> (again, igt does it and we don't need to flip-flop between sysfb and
> the driver there) that depend on specifically that behavior of not
> having anything driving fb so I didn't want to change it.
>
> So with this series the worst case scenario is that the driver that
> failed after aperture_remove_conflicting_pci_devices changed the
> hardware state so much that sysfb can't recover and the fb is blank.
> So it was blank before and this series can't fix it because the driver
> in its cleanup routine will need to do more unwinding for sysfb to
> reload (i.e. we'd need an extra patch to unwind the driver state).

The current recovery/reload is not reliable in any case. A number of 
high-profile devs have also said that it doesn't work with their driver. 
The same is true for ast. So the current approach is not going to happen.

> There also might be the case of some crazy behavior, e.g. pci bar
> resize in the driver makes the vga hardware crash or something, in
> which case, yea, we should definitely skip this patch, at least until
> those drivers properly cleanup on exit.

There's nothing crazy here. It's standard probing code.

If you want to to move forward, my suggestion is to look at the proposal 
with the aperture_funcs callbacks that control sysfb device access. And 
from there, build a full prototype with one or two drivers.

Best regards
Thomas


>
> z

-- 
--
Thomas Zimmermann
Graphics Driver Developer
SUSE Software Solutions Germany GmbH
Frankenstr. 146, 90461 Nürnberg, Germany, www.suse.com
GF: Jochen Jaser, Andrew McDonald, Werner Knoblich, (HRB 36809, AG Nürnberg)

Re: [PATCH 00/12] Recover sysfb after DRM probe failure

Posted by Zack Rusin 3 weeks ago

On Fri, Jan 16, 2026 at 2:58 AM Thomas Zimmermann <tzimmermann@suse.de> wrote:
>
> Hi
>
> Am 16.01.26 um 04:59 schrieb Zack Rusin:
> > On Thu, Jan 15, 2026 at 6:02 AM Thomas Zimmermann <tzimmermann@suse.de> wrote:
> >> That's really not going to work. For example, in the current series, you
> >> invoke devm_aperture_remove_conflicting_pci_devices_done() after
> >> drm_mode_reset(), drm_dev_register() and drm_client_setup().
> > That's perfectly fine,
> > devm_aperture_remove_conflicting_pci_devices_done is removing the
> > reload behavior not doing anything.
> >
> > This series, essentially, just adds a "defer" statement to
> > aperture_remove_conflicting_pci_devices that says
> >
> > "reload sysfb if this driver unloads".
> >
> > devm_aperture_remove_conflicting_pci_devices_done just cancels that defer.
>
> Exactly. And if that reload happens after the hardware state has been
> changed, the result is undefined.

This is all predicated on drivers actually cleaning up after
themselves. I don't think any amount of good will or api design is
going to fix device specific state mismatches.

> The current recovery/reload is not reliable in any case. A number of
> high-profile devs have also said that it doesn't work with their driver.
> The same is true for ast. So the current approach is not going to happen.
>
> > There also might be the case of some crazy behavior, e.g. pci bar
> > resize in the driver makes the vga hardware crash or something, in
> > which case, yea, we should definitely skip this patch, at least until
> > those drivers properly cleanup on exit.
>
> There's nothing crazy here. It's standard probing code.
>
> If you want to to move forward, my suggestion is to look at the proposal
> with the aperture_funcs callbacks that control sysfb device access. And
> from there, build a full prototype with one or two drivers.

I don't think that approach is going to work. I don't think there's
anything that can be done if drivers didn't cleanup everything they've
done that might have broken sysfb on unload. I'm going to drop it
then, it's obviously a shame because it works fine with virtualized
drivers and they're ones that would likely profit from this the most
but I'm sceptical that I could do full system state set reset in a
generalized fashion for hw drivers or that the work required would be
worth the payoff.

z

Re: [PATCH 00/12] Recover sysfb after DRM probe failure

Posted by Christian König 2 weeks, 5 days ago

On 1/17/26 07:02, Zack Rusin wrote:
> On Fri, Jan 16, 2026 at 2:58 AM Thomas Zimmermann <tzimmermann@suse.de> wrote:
>>
>> Hi
>>
>> Am 16.01.26 um 04:59 schrieb Zack Rusin:
>>> On Thu, Jan 15, 2026 at 6:02 AM Thomas Zimmermann <tzimmermann@suse.de> wrote:
>>>> That's really not going to work. For example, in the current series, you
>>>> invoke devm_aperture_remove_conflicting_pci_devices_done() after
>>>> drm_mode_reset(), drm_dev_register() and drm_client_setup().
>>> That's perfectly fine,
>>> devm_aperture_remove_conflicting_pci_devices_done is removing the
>>> reload behavior not doing anything.
>>>
>>> This series, essentially, just adds a "defer" statement to
>>> aperture_remove_conflicting_pci_devices that says
>>>
>>> "reload sysfb if this driver unloads".
>>>
>>> devm_aperture_remove_conflicting_pci_devices_done just cancels that defer.
>>
>> Exactly. And if that reload happens after the hardware state has been
>> changed, the result is undefined.
> 
> This is all predicated on drivers actually cleaning up after
> themselves. I don't think any amount of good will or api design is
> going to fix device specific state mismatches.
> 
>> The current recovery/reload is not reliable in any case. A number of
>> high-profile devs have also said that it doesn't work with their driver.
>> The same is true for ast. So the current approach is not going to happen.
>>
>>> There also might be the case of some crazy behavior, e.g. pci bar
>>> resize in the driver makes the vga hardware crash or something, in
>>> which case, yea, we should definitely skip this patch, at least until
>>> those drivers properly cleanup on exit.
>>
>> There's nothing crazy here. It's standard probing code.
>>
>> If you want to to move forward, my suggestion is to look at the proposal
>> with the aperture_funcs callbacks that control sysfb device access. And
>> from there, build a full prototype with one or two drivers.
> 
> I don't think that approach is going to work. I don't think there's
> anything that can be done if drivers didn't cleanup everything they've
> done that might have broken sysfb on unload. I'm going to drop it
> then, it's obviously a shame because it works fine with virtualized
> drivers and they're ones that would likely profit from this the most
> but I'm sceptical that I could do full system state set reset in a
> generalized fashion for hw drivers or that the work required would be
> worth the payoff.

Well at least for PCI devices you could try doing a function level reset to get the HW back into some usable state.

This does *not* work for AMD HW since we have HW/FW bugs, but at least for your virtualized use case it might work.

All you need then is an EFI, Vesa or int10 call to re-init the HW to the pre-driver load setup.

I know that is not the easiest thing to do, but still better than a black screen.

Regards,
Christian.

> 
> z

Re: [PATCH 00/12] Recover sysfb after DRM probe failure

Posted by Christian König 3 weeks, 2 days ago

Sorry to being late, but I only now realized what you are doing here.

On 1/15/26 12:02, Thomas Zimmermann wrote:
> Hi,
> 
> apologies for the delay. I wanted to reply and then forgot about it.
> 
> Am 10.01.26 um 05:52 schrieb Zack Rusin:
>> On Fri, Jan 9, 2026 at 5:34 AM Thomas Zimmermann <tzimmermann@suse.de> wrote:
>>> Hi
>>>
>>> Am 29.12.25 um 22:58 schrieb Zack Rusin:
>>>> Almost a rite of passage for every DRM developer and most Linux users
>>>> is upgrading your DRM driver/updating boot flags/changing some config
>>>> and having DRM driver fail at probe resulting in a blank screen.
>>>>
>>>> Currently there's no way to recover from DRM driver probe failure. PCI
>>>> DRM driver explicitly throw out the existing sysfb to get exclusive
>>>> access to PCI resources so if the probe fails the system is left without
>>>> a functioning display driver.
>>>>
>>>> Add code to sysfb to recever system framebuffer when DRM driver's probe
>>>> fails. This means that a DRM driver that fails to load reloads the system
>>>> framebuffer driver.
>>>>
>>>> This works best with simpledrm. Without it Xorg won't recover because
>>>> it still tries to load the vendor specific driver which ends up usually
>>>> not working at all. With simpledrm the system recovers really nicely
>>>> ending up with a working console and not a blank screen.
>>>>
>>>> There's a caveat in that some hardware might require some special magic
>>>> register write to recover EFI display. I'd appreciate it a lot if
>>>> maintainers could introduce a temporary failure in their drivers
>>>> probe to validate that the sysfb recovers and they get a working console.
>>>> The easiest way to double check it is by adding:
>>>>    /* XXX: Temporary failure to test sysfb restore - REMOVE BEFORE COMMIT */
>>>>    dev_info(&pdev->dev, "Testing sysfb restore: forcing probe failure\n");
>>>>    ret = -EINVAL;
>>>>    goto out_error;
>>>> or such right after the devm_aperture_remove_conflicting_pci_devices .
>>> Recovering the display like that is guess work and will at best work
>>> with simple discrete devices where the framebuffer is always located in
>>> a confined graphics aperture.
>>>
>>> But the problem you're trying to solve is a real one.
>>>
>>> What we'd want to do instead is to take the initial hardware state into
>>> account when we do the initial mode-setting operation.
>>>
>>> The first step is to move each driver's remove_conflicting_devices call
>>> to the latest possible location in the probe function. We usually do it
>>> first, because that's easy. But on most hardware, it could happen much
>>> later.
>> Well, some drivers (vbox, vmwgfx, bochs and currus-qemu) do it because
>> they request pci regions which is going to fail otherwise. Because
>> grabbining the pci resources is in general the very first thing that
>> those drivers need to do to setup anything, we
>> remove_conflicting_devices first or at least very early.
> 
> To my knowledge, requesting resources is more about correctness than a hard requirement to use an I/O or memory range. Has this changed?

Nope that is not correct.

At least for AMD GPUs remove_conflicting_devices() really early is necessary because otherwise some operations just result in a spontaneous system reboot.	

For example resizing the PCIe BAR giving access to VRAM or disabling VGA emulation (which AFAIK is used for EFI as well) is only possible when the VGA or EFI framebuffer driver is kicked out first.

And disabling VGA emulation is among the absolutely first steps you do to take over the scanout config.

So I absolutely clearly have to reject the amdgpu patch in this series, that will break tons of use cases.

Regards,
Christian.

>> I also don't think it's possible or even desirable by some drivers to
>> reuse the initial state, good example here is vmwgfx where by default
>> some people will setup their vm's with e.g. 8mb ram, when the vmwgfx
>> loads we allow scanning out from system memory, so you can set your vm
>> up with 8mb of vram but still use 4k resolutions when the driver
>> loads, this way the suspend size of the vm is very predictable (tiny
>> vram plus whatever ram was setup) while still allowing a lot of
>> flexibility.
> 
> If there's no initial state to switch from, the first modeset can fail while leaving the display unusable. There's no way around that. Going back to the old state is not an option unless the driver has been written to support this.
> 
> The case of vmwgfx is special, but does not effect the overall problem. For vmwgfx, it would be best to import that initial state and support a transparent modeset from vram to system memory (and back) at least during this initial state.
> 
> 
>>
>> In general I think however this is planned it's two or three separate series:
>> 1) infrastructure to reload the sysfb driver (what this series is)
>> 2) making sure that drivers that do want to recover cleanly actually
>> clean out all the state on exit properly,
>> 3) abstracting at least some of that cleanup in some driver independent way
> 
> That's really not going to work. For example, in the current series, you invoke devm_aperture_remove_conflicting_pci_devices_done() after drm_mode_reset(), drm_dev_register() and drm_client_setup(). Each of these calls can modify hardware state. In the case of _register() and _setup(), the DRM clients can perform a modeset, which destroys the initial hardware state. Patch 1 of this series removes the sysfb device/driver entirely. That should be a no-go as it significantly complicates recovery. For example, if the native drivers failed from an allocation failure, the sysfb device/driver is not likely to come back either. As the very first thing, the series should state which failures is is going to resolve, - failed hardware init, - invalid initial modesetting, - runtime errors (such ENOMEM, failed firmware loading), - others? And then specify how a recovery to sysfb could look in each supported scenario. In terms of implementation, make any transition between drivers
> gradually. The native driver needs to acquire the hardware resource (framebuffer and I/O apertures) without unloading the sysfb driver. Luckily there's struct drm_device.unplug, which does that. [1] Flipping this field disables hardware access for DRM drivers. All sysfb drivers support this. To get the sysfb drivers ready, I suggest dedicated helpers for each drivers aperture. The aperture helpers can use these callback to flip the DRM driver off and on again. For example, efidrm could do this as a minimum: int efidrm_aperture_suspend() { dev->unplug = true; remove_resource(/*framebuffer aperture*/) return 0 } int efidrm_aperture_resume() { insert_resource(/*framebuffer aperture*/) dev->unplug = false; return 0 } struct aperture_funcs efidrm_aperture_funcs { .suspend = efidrm_aperture_suspend, .resume = efidrm_aperture_resume, } Pass this struct when efidrm acquires the framebuffer aperture, so that the aperture helpers can control the behavior of efidrm. With this, a multi-
> step takeover from sysfb to native driver can be tried. It's still a massive effort that requires an audit of each driver's probing logic. There's no copy-paste pattern AFAICT. I suggest to pick one simple driver first and make a prototype. Let me also say that I DO like the general idea you're proposing. But if it was easy, we would likely have done it already. Best regards Thomas
>>
>> z
>

Re: [PATCH 00/12] Recover sysfb after DRM probe failure

Posted by Ville Syrjälä 3 weeks, 2 days ago

On Thu, Jan 15, 2026 at 03:39:00PM +0100, Christian König wrote:
> Sorry to being late, but I only now realized what you are doing here.
> 
> On 1/15/26 12:02, Thomas Zimmermann wrote:
> > Hi,
> > 
> > apologies for the delay. I wanted to reply and then forgot about it.
> > 
> > Am 10.01.26 um 05:52 schrieb Zack Rusin:
> >> On Fri, Jan 9, 2026 at 5:34 AM Thomas Zimmermann <tzimmermann@suse.de> wrote:
> >>> Hi
> >>>
> >>> Am 29.12.25 um 22:58 schrieb Zack Rusin:
> >>>> Almost a rite of passage for every DRM developer and most Linux users
> >>>> is upgrading your DRM driver/updating boot flags/changing some config
> >>>> and having DRM driver fail at probe resulting in a blank screen.
> >>>>
> >>>> Currently there's no way to recover from DRM driver probe failure. PCI
> >>>> DRM driver explicitly throw out the existing sysfb to get exclusive
> >>>> access to PCI resources so if the probe fails the system is left without
> >>>> a functioning display driver.
> >>>>
> >>>> Add code to sysfb to recever system framebuffer when DRM driver's probe
> >>>> fails. This means that a DRM driver that fails to load reloads the system
> >>>> framebuffer driver.
> >>>>
> >>>> This works best with simpledrm. Without it Xorg won't recover because
> >>>> it still tries to load the vendor specific driver which ends up usually
> >>>> not working at all. With simpledrm the system recovers really nicely
> >>>> ending up with a working console and not a blank screen.
> >>>>
> >>>> There's a caveat in that some hardware might require some special magic
> >>>> register write to recover EFI display. I'd appreciate it a lot if
> >>>> maintainers could introduce a temporary failure in their drivers
> >>>> probe to validate that the sysfb recovers and they get a working console.
> >>>> The easiest way to double check it is by adding:
> >>>>    /* XXX: Temporary failure to test sysfb restore - REMOVE BEFORE COMMIT */
> >>>>    dev_info(&pdev->dev, "Testing sysfb restore: forcing probe failure\n");
> >>>>    ret = -EINVAL;
> >>>>    goto out_error;
> >>>> or such right after the devm_aperture_remove_conflicting_pci_devices .
> >>> Recovering the display like that is guess work and will at best work
> >>> with simple discrete devices where the framebuffer is always located in
> >>> a confined graphics aperture.
> >>>
> >>> But the problem you're trying to solve is a real one.
> >>>
> >>> What we'd want to do instead is to take the initial hardware state into
> >>> account when we do the initial mode-setting operation.
> >>>
> >>> The first step is to move each driver's remove_conflicting_devices call
> >>> to the latest possible location in the probe function. We usually do it
> >>> first, because that's easy. But on most hardware, it could happen much
> >>> later.
> >> Well, some drivers (vbox, vmwgfx, bochs and currus-qemu) do it because
> >> they request pci regions which is going to fail otherwise. Because
> >> grabbining the pci resources is in general the very first thing that
> >> those drivers need to do to setup anything, we
> >> remove_conflicting_devices first or at least very early.
> > 
> > To my knowledge, requesting resources is more about correctness than a hard requirement to use an I/O or memory range. Has this changed?
> 
> Nope that is not correct.
> 
> At least for AMD GPUs remove_conflicting_devices() really early is necessary because otherwise some operations just result in a spontaneous system reboot.	
> 
> For example resizing the PCIe BAR giving access to VRAM or disabling VGA emulation (which AFAIK is used for EFI as well) is only possible when the VGA or EFI framebuffer driver is kicked out first.
> 
> And disabling VGA emulation is among the absolutely first steps you do to take over the scanout config.

It's similar for Intel. For us VGA emulation won't be used for
EFI boot, but we still can't have the previous driver poking
around in memory while the real driver is initializing. The
entire memory layout may get completely shuffled so there's
no telling where such memory accesses would land.

And I suppose reBAR is a concern for us as well.

-- 
Ville Syrjälä
Intel

Re: [PATCH 00/12] Recover sysfb after DRM probe failure

Posted by Thomas Zimmermann 3 weeks, 1 day ago

Hi

Am 15.01.26 um 16:10 schrieb Ville Syrjälä:
> On Thu, Jan 15, 2026 at 03:39:00PM +0100, Christian König wrote:
>> Sorry to being late, but I only now realized what you are doing here.
>>
>> On 1/15/26 12:02, Thomas Zimmermann wrote:
>>> Hi,
>>>
>>> apologies for the delay. I wanted to reply and then forgot about it.
>>>
>>> Am 10.01.26 um 05:52 schrieb Zack Rusin:
>>>> On Fri, Jan 9, 2026 at 5:34 AM Thomas Zimmermann <tzimmermann@suse.de> wrote:
>>>>> Hi
>>>>>
>>>>> Am 29.12.25 um 22:58 schrieb Zack Rusin:
>>>>>> Almost a rite of passage for every DRM developer and most Linux users
>>>>>> is upgrading your DRM driver/updating boot flags/changing some config
>>>>>> and having DRM driver fail at probe resulting in a blank screen.
>>>>>>
>>>>>> Currently there's no way to recover from DRM driver probe failure. PCI
>>>>>> DRM driver explicitly throw out the existing sysfb to get exclusive
>>>>>> access to PCI resources so if the probe fails the system is left without
>>>>>> a functioning display driver.
>>>>>>
>>>>>> Add code to sysfb to recever system framebuffer when DRM driver's probe
>>>>>> fails. This means that a DRM driver that fails to load reloads the system
>>>>>> framebuffer driver.
>>>>>>
>>>>>> This works best with simpledrm. Without it Xorg won't recover because
>>>>>> it still tries to load the vendor specific driver which ends up usually
>>>>>> not working at all. With simpledrm the system recovers really nicely
>>>>>> ending up with a working console and not a blank screen.
>>>>>>
>>>>>> There's a caveat in that some hardware might require some special magic
>>>>>> register write to recover EFI display. I'd appreciate it a lot if
>>>>>> maintainers could introduce a temporary failure in their drivers
>>>>>> probe to validate that the sysfb recovers and they get a working console.
>>>>>> The easiest way to double check it is by adding:
>>>>>>     /* XXX: Temporary failure to test sysfb restore - REMOVE BEFORE COMMIT */
>>>>>>     dev_info(&pdev->dev, "Testing sysfb restore: forcing probe failure\n");
>>>>>>     ret = -EINVAL;
>>>>>>     goto out_error;
>>>>>> or such right after the devm_aperture_remove_conflicting_pci_devices .
>>>>> Recovering the display like that is guess work and will at best work
>>>>> with simple discrete devices where the framebuffer is always located in
>>>>> a confined graphics aperture.
>>>>>
>>>>> But the problem you're trying to solve is a real one.
>>>>>
>>>>> What we'd want to do instead is to take the initial hardware state into
>>>>> account when we do the initial mode-setting operation.
>>>>>
>>>>> The first step is to move each driver's remove_conflicting_devices call
>>>>> to the latest possible location in the probe function. We usually do it
>>>>> first, because that's easy. But on most hardware, it could happen much
>>>>> later.
>>>> Well, some drivers (vbox, vmwgfx, bochs and currus-qemu) do it because
>>>> they request pci regions which is going to fail otherwise. Because
>>>> grabbining the pci resources is in general the very first thing that
>>>> those drivers need to do to setup anything, we
>>>> remove_conflicting_devices first or at least very early.
>>> To my knowledge, requesting resources is more about correctness than a hard requirement to use an I/O or memory range. Has this changed?
>> Nope that is not correct.
>>
>> At least for AMD GPUs remove_conflicting_devices() really early is necessary because otherwise some operations just result in a spontaneous system reboot.	
>>
>> For example resizing the PCIe BAR giving access to VRAM or disabling VGA emulation (which AFAIK is used for EFI as well) is only possible when the VGA or EFI framebuffer driver is kicked out first.
>>
>> And disabling VGA emulation is among the absolutely first steps you do to take over the scanout config.
> It's similar for Intel. For us VGA emulation won't be used for
> EFI boot, but we still can't have the previous driver poking
> around in memory while the real driver is initializing. The
> entire memory layout may get completely shuffled so there's
> no telling where such memory accesses would land.

Isn't there code in display/intel_fbdev.c that reads back the old state 
from hardware before initializing fbdev? [1] How does that work then? 
Wouldn't the HW state be invalid already?

Best regards
Thomas

[1] 
https://elixir.bootlin.com/linux/v6.18.5/source/drivers/gpu/drm/i915/display/intel_fbdev.c#L356

>
> And I suppose reBAR is a concern for us as well.
>

-- 
--
Thomas Zimmermann
Graphics Driver Developer
SUSE Software Solutions Germany GmbH
Frankenstr. 146, 90461 Nürnberg, Germany, www.suse.com
GF: Jochen Jaser, Andrew McDonald, Werner Knoblich, (HRB 36809, AG Nürnberg)

Re: [PATCH 00/12] Recover sysfb after DRM probe failure

Posted by Gerd Hoffmann 3 weeks, 2 days ago

  Hi,

> > At least for AMD GPUs remove_conflicting_devices() really early is
> > necessary because otherwise some operations just result in a
> > spontaneous system reboot.	

> It's similar for Intel. For us VGA emulation won't be used for EFI
> boot, but we still can't have the previous driver poking around in
> memory while the real driver is initializing. The entire memory layout
> may get completely shuffled so there's no telling where such memory
> accesses would land.

Can you do stuff like checking which firmware is needed and whenever
that can be loaded from the filesystem before calling
remove_conflicting_devices() ?

take care,
  Gerd

Re: [PATCH 00/12] Recover sysfb after DRM probe failure

Posted by Mario Limonciello 3 weeks, 2 days ago

On 1/15/26 10:36 AM, Gerd Hoffmann wrote:
>    Hi,
> 
>>> At least for AMD GPUs remove_conflicting_devices() really early is
>>> necessary because otherwise some operations just result in a
>>> spontaneous system reboot.	
> 
>> It's similar for Intel. For us VGA emulation won't be used for EFI
>> boot, but we still can't have the previous driver poking around in
>> memory while the real driver is initializing. The entire memory layout
>> may get completely shuffled so there's no telling where such memory
>> accesses would land.
> 
> Can you do stuff like checking which firmware is needed and whenever
> that can be loaded from the filesystem before calling
> remove_conflicting_devices() ?
> 

That's something that I did in amdgpu a few years back.

I pushed the identification and ability to load firmware into early init 
stages.  It means that if you have a brand new GPU and run a modern 
kernel with an older linux-firmware snapshot amdgpu will fail probe and 
your framebuffer from EFI keeps working.

Re: [PATCH 00/12] Recover sysfb after DRM probe failure

Posted by Thomas Zimmermann 3 weeks, 2 days ago

Hi

Am 15.01.26 um 15:39 schrieb Christian König:
> Sorry to being late, but I only now realized what you are doing here.
>
> On 1/15/26 12:02, Thomas Zimmermann wrote:
>> Hi,
>>
>> apologies for the delay. I wanted to reply and then forgot about it.
>>
>> Am 10.01.26 um 05:52 schrieb Zack Rusin:
>>> On Fri, Jan 9, 2026 at 5:34 AM Thomas Zimmermann <tzimmermann@suse.de> wrote:
>>>> Hi
>>>>
>>>> Am 29.12.25 um 22:58 schrieb Zack Rusin:
>>>>> Almost a rite of passage for every DRM developer and most Linux users
>>>>> is upgrading your DRM driver/updating boot flags/changing some config
>>>>> and having DRM driver fail at probe resulting in a blank screen.
>>>>>
>>>>> Currently there's no way to recover from DRM driver probe failure. PCI
>>>>> DRM driver explicitly throw out the existing sysfb to get exclusive
>>>>> access to PCI resources so if the probe fails the system is left without
>>>>> a functioning display driver.
>>>>>
>>>>> Add code to sysfb to recever system framebuffer when DRM driver's probe
>>>>> fails. This means that a DRM driver that fails to load reloads the system
>>>>> framebuffer driver.
>>>>>
>>>>> This works best with simpledrm. Without it Xorg won't recover because
>>>>> it still tries to load the vendor specific driver which ends up usually
>>>>> not working at all. With simpledrm the system recovers really nicely
>>>>> ending up with a working console and not a blank screen.
>>>>>
>>>>> There's a caveat in that some hardware might require some special magic
>>>>> register write to recover EFI display. I'd appreciate it a lot if
>>>>> maintainers could introduce a temporary failure in their drivers
>>>>> probe to validate that the sysfb recovers and they get a working console.
>>>>> The easiest way to double check it is by adding:
>>>>>     /* XXX: Temporary failure to test sysfb restore - REMOVE BEFORE COMMIT */
>>>>>     dev_info(&pdev->dev, "Testing sysfb restore: forcing probe failure\n");
>>>>>     ret = -EINVAL;
>>>>>     goto out_error;
>>>>> or such right after the devm_aperture_remove_conflicting_pci_devices .
>>>> Recovering the display like that is guess work and will at best work
>>>> with simple discrete devices where the framebuffer is always located in
>>>> a confined graphics aperture.
>>>>
>>>> But the problem you're trying to solve is a real one.
>>>>
>>>> What we'd want to do instead is to take the initial hardware state into
>>>> account when we do the initial mode-setting operation.
>>>>
>>>> The first step is to move each driver's remove_conflicting_devices call
>>>> to the latest possible location in the probe function. We usually do it
>>>> first, because that's easy. But on most hardware, it could happen much
>>>> later.
>>> Well, some drivers (vbox, vmwgfx, bochs and currus-qemu) do it because
>>> they request pci regions which is going to fail otherwise. Because
>>> grabbining the pci resources is in general the very first thing that
>>> those drivers need to do to setup anything, we
>>> remove_conflicting_devices first or at least very early.
>> To my knowledge, requesting resources is more about correctness than a hard requirement to use an I/O or memory range. Has this changed?
> Nope that is not correct.
>
> At least for AMD GPUs remove_conflicting_devices() really early is necessary because otherwise some operations just result in a spontaneous system reboot.	

Here I was only talking about avoiding calls to request_resource() and 
similar interfaces.

>
> For example resizing the PCIe BAR giving access to VRAM or disabling VGA emulation (which AFAIK is used for EFI as well) is only possible when the VGA or EFI framebuffer driver is kicked out first.

Yeah, that's what I expected.

>
> And disabling VGA emulation is among the absolutely first steps you do to take over the scanout config.

Assuming the driver (or driver author) is careful, is it possible to 
only read state from AMD hardware at such an early time?

We usually do remove_conflicting_devices() as the first thing in most 
driver's probe function. As a first step, it would be helpful to 
postpone itto a later point.

>
> So I absolutely clearly have to reject the amdgpu patch in this series, that will break tons of use cases.

Don't worry, we're still in the early ideation phase.

Best regards
Thomas

>
> Regards,
> Christian.
>
>>> I also don't think it's possible or even desirable by some drivers to
>>> reuse the initial state, good example here is vmwgfx where by default
>>> some people will setup their vm's with e.g. 8mb ram, when the vmwgfx
>>> loads we allow scanning out from system memory, so you can set your vm
>>> up with 8mb of vram but still use 4k resolutions when the driver
>>> loads, this way the suspend size of the vm is very predictable (tiny
>>> vram plus whatever ram was setup) while still allowing a lot of
>>> flexibility.
>> If there's no initial state to switch from, the first modeset can fail while leaving the display unusable. There's no way around that. Going back to the old state is not an option unless the driver has been written to support this.
>>
>> The case of vmwgfx is special, but does not effect the overall problem. For vmwgfx, it would be best to import that initial state and support a transparent modeset from vram to system memory (and back) at least during this initial state.
>>
>>
>>> In general I think however this is planned it's two or three separate series:
>>> 1) infrastructure to reload the sysfb driver (what this series is)
>>> 2) making sure that drivers that do want to recover cleanly actually
>>> clean out all the state on exit properly,
>>> 3) abstracting at least some of that cleanup in some driver independent way
>> That's really not going to work. For example, in the current series, you invoke devm_aperture_remove_conflicting_pci_devices_done() after drm_mode_reset(), drm_dev_register() and drm_client_setup(). Each of these calls can modify hardware state. In the case of _register() and _setup(), the DRM clients can perform a modeset, which destroys the initial hardware state. Patch 1 of this series removes the sysfb device/driver entirely. That should be a no-go as it significantly complicates recovery. For example, if the native drivers failed from an allocation failure, the sysfb device/driver is not likely to come back either. As the very first thing, the series should state which failures is is going to resolve, - failed hardware init, - invalid initial modesetting, - runtime errors (such ENOMEM, failed firmware loading), - others? And then specify how a recovery to sysfb could look in each supported scenario. In terms of implementation, make any transition between drivers
>> gradually. The native driver needs to acquire the hardware resource (framebuffer and I/O apertures) without unloading the sysfb driver. Luckily there's struct drm_device.unplug, which does that. [1] Flipping this field disables hardware access for DRM drivers. All sysfb drivers support this. To get the sysfb drivers ready, I suggest dedicated helpers for each drivers aperture. The aperture helpers can use these callback to flip the DRM driver off and on again. For example, efidrm could do this as a minimum: int efidrm_aperture_suspend() { dev->unplug = true; remove_resource(/*framebuffer aperture*/) return 0 } int efidrm_aperture_resume() { insert_resource(/*framebuffer aperture*/) dev->unplug = false; return 0 } struct aperture_funcs efidrm_aperture_funcs { .suspend = efidrm_aperture_suspend, .resume = efidrm_aperture_resume, } Pass this struct when efidrm acquires the framebuffer aperture, so that the aperture helpers can control the behavior of efidrm. With this, a multi-
>> step takeover from sysfb to native driver can be tried. It's still a massive effort that requires an audit of each driver's probing logic. There's no copy-paste pattern AFAICT. I suggest to pick one simple driver first and make a prototype. Let me also say that I DO like the general idea you're proposing. But if it was easy, we would likely have done it already. Best regards Thomas
>>> z

-- 
--
Thomas Zimmermann
Graphics Driver Developer
SUSE Software Solutions Germany GmbH
Frankenstr. 146, 90461 Nürnberg, Germany, www.suse.com
GF: Jochen Jaser, Andrew McDonald, Werner Knoblich, (HRB 36809, AG Nürnberg)

Re: [PATCH 00/12] Recover sysfb after DRM probe failure

Posted by Christian König 3 weeks, 2 days ago

On 1/15/26 15:54, Thomas Zimmermann wrote:
> Hi
> 
> Am 15.01.26 um 15:39 schrieb Christian König:
>> Sorry to being late, but I only now realized what you are doing here.
>>
>> On 1/15/26 12:02, Thomas Zimmermann wrote:
>>> Hi,
>>>
>>> apologies for the delay. I wanted to reply and then forgot about it.
>>>
>>> Am 10.01.26 um 05:52 schrieb Zack Rusin:
>>>> On Fri, Jan 9, 2026 at 5:34 AM Thomas Zimmermann <tzimmermann@suse.de> wrote:
>>>>> Hi
>>>>>
>>>>> Am 29.12.25 um 22:58 schrieb Zack Rusin:
>>>>>> Almost a rite of passage for every DRM developer and most Linux users
>>>>>> is upgrading your DRM driver/updating boot flags/changing some config
>>>>>> and having DRM driver fail at probe resulting in a blank screen.
>>>>>>
>>>>>> Currently there's no way to recover from DRM driver probe failure. PCI
>>>>>> DRM driver explicitly throw out the existing sysfb to get exclusive
>>>>>> access to PCI resources so if the probe fails the system is left without
>>>>>> a functioning display driver.
>>>>>>
>>>>>> Add code to sysfb to recever system framebuffer when DRM driver's probe
>>>>>> fails. This means that a DRM driver that fails to load reloads the system
>>>>>> framebuffer driver.
>>>>>>
>>>>>> This works best with simpledrm. Without it Xorg won't recover because
>>>>>> it still tries to load the vendor specific driver which ends up usually
>>>>>> not working at all. With simpledrm the system recovers really nicely
>>>>>> ending up with a working console and not a blank screen.
>>>>>>
>>>>>> There's a caveat in that some hardware might require some special magic
>>>>>> register write to recover EFI display. I'd appreciate it a lot if
>>>>>> maintainers could introduce a temporary failure in their drivers
>>>>>> probe to validate that the sysfb recovers and they get a working console.
>>>>>> The easiest way to double check it is by adding:
>>>>>>     /* XXX: Temporary failure to test sysfb restore - REMOVE BEFORE COMMIT */
>>>>>>     dev_info(&pdev->dev, "Testing sysfb restore: forcing probe failure\n");
>>>>>>     ret = -EINVAL;
>>>>>>     goto out_error;
>>>>>> or such right after the devm_aperture_remove_conflicting_pci_devices .
>>>>> Recovering the display like that is guess work and will at best work
>>>>> with simple discrete devices where the framebuffer is always located in
>>>>> a confined graphics aperture.
>>>>>
>>>>> But the problem you're trying to solve is a real one.
>>>>>
>>>>> What we'd want to do instead is to take the initial hardware state into
>>>>> account when we do the initial mode-setting operation.
>>>>>
>>>>> The first step is to move each driver's remove_conflicting_devices call
>>>>> to the latest possible location in the probe function. We usually do it
>>>>> first, because that's easy. But on most hardware, it could happen much
>>>>> later.
>>>> Well, some drivers (vbox, vmwgfx, bochs and currus-qemu) do it because
>>>> they request pci regions which is going to fail otherwise. Because
>>>> grabbining the pci resources is in general the very first thing that
>>>> those drivers need to do to setup anything, we
>>>> remove_conflicting_devices first or at least very early.
>>> To my knowledge, requesting resources is more about correctness than a hard requirement to use an I/O or memory range. Has this changed?
>> Nope that is not correct.
>>
>> At least for AMD GPUs remove_conflicting_devices() really early is necessary because otherwise some operations just result in a spontaneous system reboot.   
> 
> Here I was only talking about avoiding calls to request_resource() and similar interfaces.
> 
>>
>> For example resizing the PCIe BAR giving access to VRAM or disabling VGA emulation (which AFAIK is used for EFI as well) is only possible when the VGA or EFI framebuffer driver is kicked out first.
> 
> Yeah, that's what I expected.
> 
>>
>> And disabling VGA emulation is among the absolutely first steps you do to take over the scanout config.
> 
> Assuming the driver (or driver author) is careful, is it possible to only read state from AMD hardware at such an early time?

I'm not an expert for that particular stuff but I strongly don't think so.

Basically the VGA emulation is firmware which "owns" the CRTC registers and might modify them at any time unless it's turned off first.

So you can't even use data/index pairs of registers etc...

> We usually do remove_conflicting_devices() as the first thing in most driver's probe function. As a first step, it would be helpful to postpone itto a later point.

Well from what I knew that won't work in a lot of cases.

I mean what we could do on non-AMD HW is to remove the conflicting driver, play with the HW and if we find that this didn't worked reset the HW using a PCI function level reset and try to load the EFI or whatever driver again. But that has a rather low chance of working reliable I would say.

The problem with AMD GPUs is that the PCI function level reset is broken to begin with (which already caused us tons of headache in the case of pass through).

Regards,
Christian.

> 
>>
>> So I absolutely clearly have to reject the amdgpu patch in this series, that will break tons of use cases.
> 
> Don't worry, we're still in the early ideation phase.
> 
> Best regards
> Thomas
> 
>>
>> Regards,
>> Christian.
>>
>>>> I also don't think it's possible or even desirable by some drivers to
>>>> reuse the initial state, good example here is vmwgfx where by default
>>>> some people will setup their vm's with e.g. 8mb ram, when the vmwgfx
>>>> loads we allow scanning out from system memory, so you can set your vm
>>>> up with 8mb of vram but still use 4k resolutions when the driver
>>>> loads, this way the suspend size of the vm is very predictable (tiny
>>>> vram plus whatever ram was setup) while still allowing a lot of
>>>> flexibility.
>>> If there's no initial state to switch from, the first modeset can fail while leaving the display unusable. There's no way around that. Going back to the old state is not an option unless the driver has been written to support this.
>>>
>>> The case of vmwgfx is special, but does not effect the overall problem. For vmwgfx, it would be best to import that initial state and support a transparent modeset from vram to system memory (and back) at least during this initial state.
>>>
>>>
>>>> In general I think however this is planned it's two or three separate series:
>>>> 1) infrastructure to reload the sysfb driver (what this series is)
>>>> 2) making sure that drivers that do want to recover cleanly actually
>>>> clean out all the state on exit properly,
>>>> 3) abstracting at least some of that cleanup in some driver independent way
>>> That's really not going to work. For example, in the current series, you invoke devm_aperture_remove_conflicting_pci_devices_done() after drm_mode_reset(), drm_dev_register() and drm_client_setup(). Each of these calls can modify hardware state. In the case of _register() and _setup(), the DRM clients can perform a modeset, which destroys the initial hardware state. Patch 1 of this series removes the sysfb device/driver entirely. That should be a no-go as it significantly complicates recovery. For example, if the native drivers failed from an allocation failure, the sysfb device/driver is not likely to come back either. As the very first thing, the series should state which failures is is going to resolve, - failed hardware init, - invalid initial modesetting, - runtime errors (such ENOMEM, failed firmware loading), - others? And then specify how a recovery to sysfb could look in each supported scenario. In terms of implementation, make any transition between drivers
>>> gradually. The native driver needs to acquire the hardware resource (framebuffer and I/O apertures) without unloading the sysfb driver. Luckily there's struct drm_device.unplug, which does that. [1] Flipping this field disables hardware access for DRM drivers. All sysfb drivers support this. To get the sysfb drivers ready, I suggest dedicated helpers for each drivers aperture. The aperture helpers can use these callback to flip the DRM driver off and on again. For example, efidrm could do this as a minimum: int efidrm_aperture_suspend() { dev->unplug = true; remove_resource(/*framebuffer aperture*/) return 0 } int efidrm_aperture_resume() { insert_resource(/*framebuffer aperture*/) dev->unplug = false; return 0 } struct aperture_funcs efidrm_aperture_funcs { .suspend = efidrm_aperture_suspend, .resume = efidrm_aperture_resume, } Pass this struct when efidrm acquires the framebuffer aperture, so that the aperture helpers can control the behavior of efidrm. With this, a multi-
>>> step takeover from sysfb to native driver can be tried. It's still a massive effort that requires an audit of each driver's probing logic. There's no copy-paste pattern AFAICT. I suggest to pick one simple driver first and make a prototype. Let me also say that I DO like the general idea you're proposing. But if it was easy, we would likely have done it already. Best regards Thomas
>>>> z
>