[PATCH 0/4] PCI: Continue E820 vs host bridge window saga

Bjorn Helgaas posted 4 patches 1 year, 4 months ago
There is a newer version of this series
arch/x86/kernel/resource.c  |  7 +++++--
arch/x86/pci/acpi.c         |  2 +-
arch/x86/platform/efi/efi.c | 36 ++++++++++++++++++++++++++++++++++++
drivers/pci/bus.c           |  4 ++++
4 files changed, 46 insertions(+), 3 deletions(-)
[PATCH 0/4] PCI: Continue E820 vs host bridge window saga
Posted by Bjorn Helgaas 1 year, 4 months ago
From: Bjorn Helgaas <bhelgaas@google.com>

When allocating space for PCI BARs, Linux avoids allocating space mentioned
in the E820 map.  This was originally done by 4dc2287c1805 ("x86: avoid
E820 regions when allocating address space") to work around BIOS defects
that included unusable space in host bridge _CRS.

Some recent machines use EfiMemoryMappedIO for PCI MMCONFIG and host bridge
apertures, and bootloaders and EFI stubs convert those to E820 regions,
which means we can't allocate space for hot-added PCI devices (often a
dock) or for devices the BIOS didn't configure (often a touchpad)

The current strategy is to add DMI quirks that disable the E820 filtering
on these machines and to disable it entirely starting with 2023 BIOSes:

  d341838d776a ("x86/PCI: Disable E820 reserved region clipping via quirks")
  0ae084d5a674 ("x86/PCI: Disable E820 reserved region clipping starting in 2023")

But the quirks are problematic because it's really hard to list all the
machines that need them.

This series is an attempt at a more generic approach.  I'm told by firmware
folks that EfiMemoryMappedIO means "the OS should map this area so EFI
runtime services can use it in virtual mode," but does not prevent the OS
from using it.

The first patch removes any EfiMemoryMappedIO areas from the E820 map.
This doesn't affect any virtual mapping of those areas (that would have to
be done directly from the EFI memory map) but it means Linux can allocate
space for PCI MMIO.

The rest are basically cosmetic log message changes.

Bjorn Helgaas (4):
  efi/x86: Remove EfiMemoryMappedIO from E820 map
  PCI: Skip allocate_resource() if too little space available
  x86/PCI: Tidy E820 removal messages
  x86/PCI: Fix log message typo

 arch/x86/kernel/resource.c  |  7 +++++--
 arch/x86/pci/acpi.c         |  2 +-
 arch/x86/platform/efi/efi.c | 36 ++++++++++++++++++++++++++++++++++++
 drivers/pci/bus.c           |  4 ++++
 4 files changed, 46 insertions(+), 3 deletions(-)

-- 
2.25.1
Re: [PATCH 0/4] PCI: Continue E820 vs host bridge window saga
Posted by Hans de Goede 1 year, 4 months ago
Hi Bjorn,

On 12/2/22 22:18, Bjorn Helgaas wrote:
> From: Bjorn Helgaas <bhelgaas@google.com>
> 
> When allocating space for PCI BARs, Linux avoids allocating space mentioned
> in the E820 map.  This was originally done by 4dc2287c1805 ("x86: avoid
> E820 regions when allocating address space") to work around BIOS defects
> that included unusable space in host bridge _CRS.
> 
> Some recent machines use EfiMemoryMappedIO for PCI MMCONFIG and host bridge
> apertures, and bootloaders and EFI stubs convert those to E820 regions,
> which means we can't allocate space for hot-added PCI devices (often a
> dock) or for devices the BIOS didn't configure (often a touchpad)
> 
> The current strategy is to add DMI quirks that disable the E820 filtering
> on these machines and to disable it entirely starting with 2023 BIOSes:
> 
>   d341838d776a ("x86/PCI: Disable E820 reserved region clipping via quirks")
>   0ae084d5a674 ("x86/PCI: Disable E820 reserved region clipping starting in 2023")
> 
> But the quirks are problematic because it's really hard to list all the
> machines that need them.
> 
> This series is an attempt at a more generic approach.  I'm told by firmware
> folks that EfiMemoryMappedIO means "the OS should map this area so EFI
> runtime services can use it in virtual mode," but does not prevent the OS
> from using it.
> 
> The first patch removes any EfiMemoryMappedIO areas from the E820 map.
> This doesn't affect any virtual mapping of those areas (that would have to
> be done directly from the EFI memory map) but it means Linux can allocate
> space for PCI MMIO.
> 
> The rest are basically cosmetic log message changes.

Thank you for working on this. I'm a bit worried about this series though.

The 2 things which I worry about are:


1. I think this will not help when people boot in BIOS (CSM) mode rather
then UEFI mode which quite a few Linux users still do because they learned
to do this years ago when Linux EFI support (and EFI fw itself) was still
a bit in flux.

IIRC from the last time we looked at this in CSM mode the BIOS itself
translates the EfiMemoryMappedIO areas to reserved E820 regions. So when
people use the BIOS CSM mode to boot, then this patch will not help
since the kernel lacks the info to do the translation.\

We may also hit the same case when the bootloader has done the
translation which I believe is what upstream grub does. Fedora grub
has been patched to use the kernel EFI stub when booting a kernel
on EFI, so just an EFI equivalent of "exec" on the kernel EFI binary.

Where as upstream grub does a more BIOS like boot, where it skips the
EFI stub and instead does a whole bunch of stuff itself and then
jumps to the kernel's start vector. So this might also not work with
upstream grub, which is what I believe Ubuntu and Debian use.

Although I case in this case we will still have access to the EFI
memory map and I see that your patch removes the entries stemming
from the EfiMemoryMappedIO areas from the E820 map, rather then
never adding them. So I guess this will also work in the case
when the bootloader has done the translation (leaving just
the BIOS CSM case as an issue) ?

This won't cause regressions, but it does mean that e.g. the
touchpad i2c-controller / hotplugged PCI devices will still not
work when booted in BIOS (CSM) mode / through upstream grub.

I have asked the reporter of:

https://bugzilla.redhat.com/show_bug.cgi?id=1868899

To do a BIOS mode boot of a Fedora 37 live USB and then collect
dmesg output, then we can check if that indeed has the
EfiMemoryMappedIO areas as reserved E820 regions in a way where
we cannot identify them anymore since we don't have access to
the EFI memory map in this case.


2. I am afraid that now allowing PCI MMIO space to be allocated
in regions marked as EfiMemoryMappedIO will cause regressions
on some systems. Specifically when I tried something similar
the last time I looked at this (using the BIOS date cut-off
approach IIRC) there was a suspend/resume regression on
a Lenovo ThinkPad X1 carbon (20A7) model:

https://bugzilla.redhat.com/show_bug.cgi?id=2029207

Back then I came to the conclusion that the problem is that not
avoiding the EfiMemoryMappedIO regions caused PCI MMIO space to
be allocated in the 0xdfa00000 - 0xdfa10000 range which is
listed in the EFI memmap as:

[    0.000000] efi: mem46: [MMIO        |RUN|  |  |  |  |  |  |  |  |   |  |  |  |  ] range=[0x00000000dfa00000-0x00000000dfa0ffff] (0MB)

And with current kernels with the extra logging added for this
the following is logged related to this:

[    0.326504] acpi PNP0A08:00: clipped [mem 0xdfa00000-0xfebfffff window] to [mem 0xdfa10000-0xfebfffff window] for e820 entry [mem 0xdceff000-0xdfa0ffff]

I believe patch 1/4 of this set will make this clipping go away,
re-introducing the suspend/resume problem.

I will reach out to the reporter and see if I can get them to
test this patch-set.

Regards,

Hans
Re: [PATCH 0/4] PCI: Continue E820 vs host bridge window saga
Posted by Bjorn Helgaas 1 year, 4 months ago
On Sat, Dec 03, 2022 at 01:44:10PM +0100, Hans de Goede wrote:
> Hi Bjorn,
> 
> On 12/2/22 22:18, Bjorn Helgaas wrote:
> > From: Bjorn Helgaas <bhelgaas@google.com>
> > 
> > When allocating space for PCI BARs, Linux avoids allocating space mentioned
> > in the E820 map.  This was originally done by 4dc2287c1805 ("x86: avoid
> > E820 regions when allocating address space") to work around BIOS defects
> > that included unusable space in host bridge _CRS.
> > 
> > Some recent machines use EfiMemoryMappedIO for PCI MMCONFIG and host bridge
> > apertures, and bootloaders and EFI stubs convert those to E820 regions,
> > which means we can't allocate space for hot-added PCI devices (often a
> > dock) or for devices the BIOS didn't configure (often a touchpad)
> > 
> > The current strategy is to add DMI quirks that disable the E820 filtering
> > on these machines and to disable it entirely starting with 2023 BIOSes:
> > 
> >   d341838d776a ("x86/PCI: Disable E820 reserved region clipping via quirks")
> >   0ae084d5a674 ("x86/PCI: Disable E820 reserved region clipping starting in 2023")
> > 
> > But the quirks are problematic because it's really hard to list all the
> > machines that need them.
> > 
> > This series is an attempt at a more generic approach.  I'm told by firmware
> > folks that EfiMemoryMappedIO means "the OS should map this area so EFI
> > runtime services can use it in virtual mode," but does not prevent the OS
> > from using it.
> > 
> > The first patch removes any EfiMemoryMappedIO areas from the E820 map.
> > This doesn't affect any virtual mapping of those areas (that would have to
> > be done directly from the EFI memory map) but it means Linux can allocate
> > space for PCI MMIO.
> > 
> > The rest are basically cosmetic log message changes.
> 
> Thank you for working on this. I'm a bit worried about this series though.
> 
> The 2 things which I worry about are:
> 
> 
> 1. I think this will not help when people boot in BIOS (CSM) mode rather
> then UEFI mode which quite a few Linux users still do because they learned
> to do this years ago when Linux EFI support (and EFI fw itself) was still
> a bit in flux.
> 
> IIRC from the last time we looked at this in CSM mode the BIOS itself
> translates the EfiMemoryMappedIO areas to reserved E820 regions. So when
> people use the BIOS CSM mode to boot, then this patch will not help
> since the kernel lacks the info to do the translation.

Right, if BIOS CSM puts EfiMemoryMappedIO in the E820 map the same way
bootloaders do, and the kernel doesn't have the EFI memory map, this
series won't help.

> We may also hit the same case when the bootloader has done the
> translation which I believe is what upstream grub does. Fedora grub
> has been patched to use the kernel EFI stub when booting a kernel
> on EFI, so just an EFI equivalent of "exec" on the kernel EFI binary.
> 
> Where as upstream grub does a more BIOS like boot, where it skips the
> EFI stub and instead does a whole bunch of stuff itself and then
> jumps to the kernel's start vector. So this might also not work with
> upstream grub, which is what I believe Ubuntu and Debian use.
>
> Although I case in this case we will still have access to the EFI
> memory map and I see that your patch removes the entries stemming
> from the EfiMemoryMappedIO areas from the E820 map, rather then
> never adding them. So I guess this will also work in the case
> when the bootloader has done the translation (leaving just
> the BIOS CSM case as an issue)?
>
> This won't cause regressions, but it does mean that e.g. the
> touchpad i2c-controller / hotplugged PCI devices will still not
> work when booted in BIOS (CSM) mode / through upstream grub.

Yes, I agree CSM could still be broken if BIOS puts EfiMemoryMappedIO
in the E820 map.

I'm not clear on the grub cases.  Do some grub versions hide the EFI
memory map from the kernel?  If they do, they could be broken the same
way as CSM.

> 2. I am afraid that now allowing PCI MMIO space to be allocated
> in regions marked as EfiMemoryMappedIO will cause regressions
> on some systems. Specifically when I tried something similar
> the last time I looked at this (using the BIOS date cut-off
> approach IIRC) there was a suspend/resume regression on
> a Lenovo ThinkPad X1 carbon (20A7) model:
> 
> https://bugzilla.redhat.com/show_bug.cgi?id=2029207
> 
> Back then I came to the conclusion that the problem is that not
> avoiding the EfiMemoryMappedIO regions caused PCI MMIO space to
> be allocated in the 0xdfa00000 - 0xdfa10000 range which is
> listed in the EFI memmap as:
> 
> [    0.000000] efi: mem46: [MMIO        |RUN|  |  |  |  |  |  |  |  |   |  |  |  |  ] range=[0x00000000dfa00000-0x00000000dfa0ffff] (0MB)
> 
> And with current kernels with the extra logging added for this
> the following is logged related to this:
> 
> [    0.326504] acpi PNP0A08:00: clipped [mem 0xdfa00000-0xfebfffff window] to [mem 0xdfa10000-0xfebfffff window] for e820 entry [mem 0xdceff000-0xdfa0ffff]
> 
> I believe patch 1/4 of this set will make this clipping go away,
> re-introducing the suspend/resume problem.

Yes, I'm afraid you're right.  Comparing the logs at comment #31
(fails) and comment #38 (works):

  pci_bus 0000:00: root bus resource [mem 0xdfa00000-0xfebfffff window]
  pci 0000:00:1c.0: BAR 14: assigned [mem 0xdfa00000-0xdfbfffff] fails
  pci 0000:00:1c.0: BAR 14: assigned [mem 0xdfb00000-0xdfcfffff] works

Since 0xdfa00000 is included in the host bridge _CRS, but isn't
usable, my guess is this is a _CRS bug.

Or maybe BIOS is using the producer/consumer bit in _CRS to identify
this as register space as opposed to a window?  I thought we couldn't
rely on that bit, but it's been a long time since I looked at it.  An
acpidump might have a clue.

Bjorn
Re: [PATCH 0/4] PCI: Continue E820 vs host bridge window saga
Posted by Hans de Goede 1 year, 4 months ago
Hi Bjorn,

On 12/3/22 18:57, Bjorn Helgaas wrote:
> On Sat, Dec 03, 2022 at 01:44:10PM +0100, Hans de Goede wrote:
>> Hi Bjorn,
>>
>> On 12/2/22 22:18, Bjorn Helgaas wrote:
>>> From: Bjorn Helgaas <bhelgaas@google.com>
>>>
>>> When allocating space for PCI BARs, Linux avoids allocating space mentioned
>>> in the E820 map.  This was originally done by 4dc2287c1805 ("x86: avoid
>>> E820 regions when allocating address space") to work around BIOS defects
>>> that included unusable space in host bridge _CRS.
>>>
>>> Some recent machines use EfiMemoryMappedIO for PCI MMCONFIG and host bridge
>>> apertures, and bootloaders and EFI stubs convert those to E820 regions,
>>> which means we can't allocate space for hot-added PCI devices (often a
>>> dock) or for devices the BIOS didn't configure (often a touchpad)
>>>
>>> The current strategy is to add DMI quirks that disable the E820 filtering
>>> on these machines and to disable it entirely starting with 2023 BIOSes:
>>>
>>>   d341838d776a ("x86/PCI: Disable E820 reserved region clipping via quirks")
>>>   0ae084d5a674 ("x86/PCI: Disable E820 reserved region clipping starting in 2023")
>>>
>>> But the quirks are problematic because it's really hard to list all the
>>> machines that need them.
>>>
>>> This series is an attempt at a more generic approach.  I'm told by firmware
>>> folks that EfiMemoryMappedIO means "the OS should map this area so EFI
>>> runtime services can use it in virtual mode," but does not prevent the OS
>>> from using it.
>>>
>>> The first patch removes any EfiMemoryMappedIO areas from the E820 map.
>>> This doesn't affect any virtual mapping of those areas (that would have to
>>> be done directly from the EFI memory map) but it means Linux can allocate
>>> space for PCI MMIO.
>>>
>>> The rest are basically cosmetic log message changes.
>>
>> Thank you for working on this. I'm a bit worried about this series though.
>>
>> The 2 things which I worry about are:
>>
>>
>> 1. I think this will not help when people boot in BIOS (CSM) mode rather
>> then UEFI mode which quite a few Linux users still do because they learned
>> to do this years ago when Linux EFI support (and EFI fw itself) was still
>> a bit in flux.
>>
>> IIRC from the last time we looked at this in CSM mode the BIOS itself
>> translates the EfiMemoryMappedIO areas to reserved E820 regions. So when
>> people use the BIOS CSM mode to boot, then this patch will not help
>> since the kernel lacks the info to do the translation.
> 
> Right, if BIOS CSM puts EfiMemoryMappedIO in the E820 map the same way
> bootloaders do, and the kernel doesn't have the EFI memory map, this
> series won't help.

So I just got the requested dmesg in BIOS CSM mode from:
https://bugzilla.redhat.com/show_bug.cgi?id=1868899

And it says:

[    0.000000] BIOS-e820: [mem 0x000000004bc50000-0x00000000cfffffff] reserved
[    0.316140] pci_bus 0000:00: root bus resource [mem 0x65400000-0xbfffffff window]

So I'm afraid that I remembered correctly and the CSM adds
the EfiMemoryMappedIO regions to the E820 map as reserved :(

So as you said, this series won't help for people booting in
BIOS compatibility mode. Which means that we should at least keep
the current list of no_e820 quirks to avoid regressing those models
when booted in BIOS compatibility mode.

And maybe still add at least the Clevo model for which I recently
submitted a new no_e820 quirk so that that will work in BIOS CSM
mode too ?

Note I know you did not propose to drop the quirks in this series,
just covering all the bases here.

Regards,

Hans
Re: [PATCH 0/4] PCI: Continue E820 vs host bridge window saga
Posted by Werner Sembach 1 year, 4 months ago
Hi

Am 04.12.22 um 10:29 schrieb Hans de Goede:
> Hi Bjorn,
> 
> On 12/3/22 18:57, Bjorn Helgaas wrote:
>> On Sat, Dec 03, 2022 at 01:44:10PM +0100, Hans de Goede wrote:
>>> Hi Bjorn,
>>>
>>> On 12/2/22 22:18, Bjorn Helgaas wrote:
>>>> From: Bjorn Helgaas <bhelgaas@google.com>
>>>>
>>>> When allocating space for PCI BARs, Linux avoids allocating space mentioned
>>>> in the E820 map.  This was originally done by 4dc2287c1805 ("x86: avoid
>>>> E820 regions when allocating address space") to work around BIOS defects
>>>> that included unusable space in host bridge _CRS.
>>>>
>>>> Some recent machines use EfiMemoryMappedIO for PCI MMCONFIG and host bridge
>>>> apertures, and bootloaders and EFI stubs convert those to E820 regions,
>>>> which means we can't allocate space for hot-added PCI devices (often a
>>>> dock) or for devices the BIOS didn't configure (often a touchpad)
>>>>
>>>> The current strategy is to add DMI quirks that disable the E820 filtering
>>>> on these machines and to disable it entirely starting with 2023 BIOSes:
>>>>
>>>>    d341838d776a ("x86/PCI: Disable E820 reserved region clipping via quirks")
>>>>    0ae084d5a674 ("x86/PCI: Disable E820 reserved region clipping starting in 2023")
>>>>
>>>> But the quirks are problematic because it's really hard to list all the
>>>> machines that need them.
>>>>
>>>> This series is an attempt at a more generic approach.  I'm told by firmware
>>>> folks that EfiMemoryMappedIO means "the OS should map this area so EFI
>>>> runtime services can use it in virtual mode," but does not prevent the OS
>>>> from using it.
>>>>
>>>> The first patch removes any EfiMemoryMappedIO areas from the E820 map.
>>>> This doesn't affect any virtual mapping of those areas (that would have to
>>>> be done directly from the EFI memory map) but it means Linux can allocate
>>>> space for PCI MMIO.
>>>>
>>>> The rest are basically cosmetic log message changes.
>>>
>>> Thank you for working on this. I'm a bit worried about this series though.
>>>
>>> The 2 things which I worry about are:
>>>
>>>
>>> 1. I think this will not help when people boot in BIOS (CSM) mode rather
>>> then UEFI mode which quite a few Linux users still do because they learned
>>> to do this years ago when Linux EFI support (and EFI fw itself) was still
>>> a bit in flux.
>>>
>>> IIRC from the last time we looked at this in CSM mode the BIOS itself
>>> translates the EfiMemoryMappedIO areas to reserved E820 regions. So when
>>> people use the BIOS CSM mode to boot, then this patch will not help
>>> since the kernel lacks the info to do the translation.
>>
>> Right, if BIOS CSM puts EfiMemoryMappedIO in the E820 map the same way
>> bootloaders do, and the kernel doesn't have the EFI memory map, this
>> series won't help.
> 
> So I just got the requested dmesg in BIOS CSM mode from:
> https://bugzilla.redhat.com/show_bug.cgi?id=1868899
> 
> And it says:
> 
> [    0.000000] BIOS-e820: [mem 0x000000004bc50000-0x00000000cfffffff] reserved
> [    0.316140] pci_bus 0000:00: root bus resource [mem 0x65400000-0xbfffffff window]
> 
> So I'm afraid that I remembered correctly and the CSM adds
> the EfiMemoryMappedIO regions to the E820 map as reserved :(
> 
> So as you said, this series won't help for people booting in
> BIOS compatibility mode. Which means that we should at least keep
> the current list of no_e820 quirks to avoid regressing those models
> when booted in BIOS compatibility mode.
> 
> And maybe still add at least the Clevo model for which I recently
> submitted a new no_e820 quirk so that that will work in BIOS CSM
> mode too ?
Do you mean the X170KM-G? I don't think it has the option to switch to Legacy 
BIOS mode (At least i didn't found an option in the bios version i have)
> 
> Note I know you did not propose to drop the quirks in this series,
> just covering all the bases here.
> 
> Regards,
> 
> Hans
> 
> 
> 
> 
Kind regards,
Werner Sembach
Re: [PATCH 0/4] PCI: Continue E820 vs host bridge window saga
Posted by Hans de Goede 1 year, 4 months ago
Hi Werner,

On 12/5/22 14:27, Werner Sembach wrote:
> Hi
> 
> Am 04.12.22 um 10:29 schrieb Hans de Goede:
>> Hi Bjorn,
>>
>> On 12/3/22 18:57, Bjorn Helgaas wrote:
>>> On Sat, Dec 03, 2022 at 01:44:10PM +0100, Hans de Goede wrote:
>>>> Hi Bjorn,
>>>>
>>>> On 12/2/22 22:18, Bjorn Helgaas wrote:
>>>>> From: Bjorn Helgaas <bhelgaas@google.com>
>>>>>
>>>>> When allocating space for PCI BARs, Linux avoids allocating space mentioned
>>>>> in the E820 map.  This was originally done by 4dc2287c1805 ("x86: avoid
>>>>> E820 regions when allocating address space") to work around BIOS defects
>>>>> that included unusable space in host bridge _CRS.
>>>>>
>>>>> Some recent machines use EfiMemoryMappedIO for PCI MMCONFIG and host bridge
>>>>> apertures, and bootloaders and EFI stubs convert those to E820 regions,
>>>>> which means we can't allocate space for hot-added PCI devices (often a
>>>>> dock) or for devices the BIOS didn't configure (often a touchpad)
>>>>>
>>>>> The current strategy is to add DMI quirks that disable the E820 filtering
>>>>> on these machines and to disable it entirely starting with 2023 BIOSes:
>>>>>
>>>>>    d341838d776a ("x86/PCI: Disable E820 reserved region clipping via quirks")
>>>>>    0ae084d5a674 ("x86/PCI: Disable E820 reserved region clipping starting in 2023")
>>>>>
>>>>> But the quirks are problematic because it's really hard to list all the
>>>>> machines that need them.
>>>>>
>>>>> This series is an attempt at a more generic approach.  I'm told by firmware
>>>>> folks that EfiMemoryMappedIO means "the OS should map this area so EFI
>>>>> runtime services can use it in virtual mode," but does not prevent the OS
>>>>> from using it.
>>>>>
>>>>> The first patch removes any EfiMemoryMappedIO areas from the E820 map.
>>>>> This doesn't affect any virtual mapping of those areas (that would have to
>>>>> be done directly from the EFI memory map) but it means Linux can allocate
>>>>> space for PCI MMIO.
>>>>>
>>>>> The rest are basically cosmetic log message changes.
>>>>
>>>> Thank you for working on this. I'm a bit worried about this series though.
>>>>
>>>> The 2 things which I worry about are:
>>>>
>>>>
>>>> 1. I think this will not help when people boot in BIOS (CSM) mode rather
>>>> then UEFI mode which quite a few Linux users still do because they learned
>>>> to do this years ago when Linux EFI support (and EFI fw itself) was still
>>>> a bit in flux.
>>>>
>>>> IIRC from the last time we looked at this in CSM mode the BIOS itself
>>>> translates the EfiMemoryMappedIO areas to reserved E820 regions. So when
>>>> people use the BIOS CSM mode to boot, then this patch will not help
>>>> since the kernel lacks the info to do the translation.
>>>
>>> Right, if BIOS CSM puts EfiMemoryMappedIO in the E820 map the same way
>>> bootloaders do, and the kernel doesn't have the EFI memory map, this
>>> series won't help.
>>
>> So I just got the requested dmesg in BIOS CSM mode from:
>> https://bugzilla.redhat.com/show_bug.cgi?id=1868899
>>
>> And it says:
>>
>> [    0.000000] BIOS-e820: [mem 0x000000004bc50000-0x00000000cfffffff] reserved
>> [    0.316140] pci_bus 0000:00: root bus resource [mem 0x65400000-0xbfffffff window]
>>
>> So I'm afraid that I remembered correctly and the CSM adds
>> the EfiMemoryMappedIO regions to the E820 map as reserved :(
>>
>> So as you said, this series won't help for people booting in
>> BIOS compatibility mode. Which means that we should at least keep
>> the current list of no_e820 quirks to avoid regressing those models
>> when booted in BIOS compatibility mode.
>>
>> And maybe still add at least the Clevo model for which I recently
>> submitted a new no_e820 quirk so that that will work in BIOS CSM
>> mode too ?
> Do you mean the X170KM-G? I don't think it has the option to switch to Legacy BIOS mode (At least i didn't found an option in the bios version i have)

I'm talking about this patch:

https://lore.kernel.org/linux-pci/20221010150206.142615-1-hdegoede@redhat.com/

Regards,

Hans

Re: [PATCH 0/4] PCI: Continue E820 vs host bridge window saga
Posted by Hans de Goede 1 year, 4 months ago
Hi Bjorn,

On 12/3/22 18:57, Bjorn Helgaas wrote:
> On Sat, Dec 03, 2022 at 01:44:10PM +0100, Hans de Goede wrote:
>> Hi Bjorn,
>>
>> On 12/2/22 22:18, Bjorn Helgaas wrote:
>>> From: Bjorn Helgaas <bhelgaas@google.com>
>>>
>>> When allocating space for PCI BARs, Linux avoids allocating space mentioned
>>> in the E820 map.  This was originally done by 4dc2287c1805 ("x86: avoid
>>> E820 regions when allocating address space") to work around BIOS defects
>>> that included unusable space in host bridge _CRS.
>>>
>>> Some recent machines use EfiMemoryMappedIO for PCI MMCONFIG and host bridge
>>> apertures, and bootloaders and EFI stubs convert those to E820 regions,
>>> which means we can't allocate space for hot-added PCI devices (often a
>>> dock) or for devices the BIOS didn't configure (often a touchpad)
>>>
>>> The current strategy is to add DMI quirks that disable the E820 filtering
>>> on these machines and to disable it entirely starting with 2023 BIOSes:
>>>
>>>   d341838d776a ("x86/PCI: Disable E820 reserved region clipping via quirks")
>>>   0ae084d5a674 ("x86/PCI: Disable E820 reserved region clipping starting in 2023")
>>>
>>> But the quirks are problematic because it's really hard to list all the
>>> machines that need them.
>>>
>>> This series is an attempt at a more generic approach.  I'm told by firmware
>>> folks that EfiMemoryMappedIO means "the OS should map this area so EFI
>>> runtime services can use it in virtual mode," but does not prevent the OS
>>> from using it.
>>>
>>> The first patch removes any EfiMemoryMappedIO areas from the E820 map.
>>> This doesn't affect any virtual mapping of those areas (that would have to
>>> be done directly from the EFI memory map) but it means Linux can allocate
>>> space for PCI MMIO.
>>>
>>> The rest are basically cosmetic log message changes.
>>
>> Thank you for working on this. I'm a bit worried about this series though.
>>
>> The 2 things which I worry about are:
>>
>>
>> 1. I think this will not help when people boot in BIOS (CSM) mode rather
>> then UEFI mode which quite a few Linux users still do because they learned
>> to do this years ago when Linux EFI support (and EFI fw itself) was still
>> a bit in flux.
>>
>> IIRC from the last time we looked at this in CSM mode the BIOS itself
>> translates the EfiMemoryMappedIO areas to reserved E820 regions. So when
>> people use the BIOS CSM mode to boot, then this patch will not help
>> since the kernel lacks the info to do the translation.
> 
> Right, if BIOS CSM puts EfiMemoryMappedIO in the E820 map the same way
> bootloaders do, and the kernel doesn't have the EFI memory map, this
> series won't help.
> 
>> We may also hit the same case when the bootloader has done the
>> translation which I believe is what upstream grub does. Fedora grub
>> has been patched to use the kernel EFI stub when booting a kernel
>> on EFI, so just an EFI equivalent of "exec" on the kernel EFI binary.
>>
>> Where as upstream grub does a more BIOS like boot, where it skips the
>> EFI stub and instead does a whole bunch of stuff itself and then
>> jumps to the kernel's start vector. So this might also not work with
>> upstream grub, which is what I believe Ubuntu and Debian use.
>>
>> Although I case in this case we will still have access to the EFI
>> memory map and I see that your patch removes the entries stemming
>> from the EfiMemoryMappedIO areas from the E820 map, rather then
>> never adding them. So I guess this will also work in the case
>> when the bootloader has done the translation (leaving just
>> the BIOS CSM case as an issue)?
>>
>> This won't cause regressions, but it does mean that e.g. the
>> touchpad i2c-controller / hotplugged PCI devices will still not
>> work when booted in BIOS (CSM) mode / through upstream grub.
> 
> Yes, I agree CSM could still be broken if BIOS puts EfiMemoryMappedIO
> in the E820 map.
> 
> I'm not clear on the grub cases.  Do some grub versions hide the EFI
> memory map from the kernel?  If they do, they could be broken the same
> way as CSM.

I don't think grub hides the EFI memory map, I started writing
the bit about grub because I believe that grub creates its own
emulated E820 map, rather then relying on the kernel's EFI stub
creating an emulated E820 map. But since you take the emulated
map and then remove the EfiMemoryMappedIO entries afterwards
I think this should be fine.

(versus how patching the stub to never add the EfiMemoryMappedIO
entries would _not_ be fine because the emulated E820 map does not
always come from the EFI stub).

>> 2. I am afraid that now allowing PCI MMIO space to be allocated
>> in regions marked as EfiMemoryMappedIO will cause regressions
>> on some systems. Specifically when I tried something similar
>> the last time I looked at this (using the BIOS date cut-off
>> approach IIRC) there was a suspend/resume regression on
>> a Lenovo ThinkPad X1 carbon (20A7) model:
>>
>> https://bugzilla.redhat.com/show_bug.cgi?id=2029207
>>
>> Back then I came to the conclusion that the problem is that not
>> avoiding the EfiMemoryMappedIO regions caused PCI MMIO space to
>> be allocated in the 0xdfa00000 - 0xdfa10000 range which is
>> listed in the EFI memmap as:
>>
>> [    0.000000] efi: mem46: [MMIO        |RUN|  |  |  |  |  |  |  |  |   |  |  |  |  ] range=[0x00000000dfa00000-0x00000000dfa0ffff] (0MB)
>>
>> And with current kernels with the extra logging added for this
>> the following is logged related to this:
>>
>> [    0.326504] acpi PNP0A08:00: clipped [mem 0xdfa00000-0xfebfffff window] to [mem 0xdfa10000-0xfebfffff window] for e820 entry [mem 0xdceff000-0xdfa0ffff]
>>
>> I believe patch 1/4 of this set will make this clipping go away,
>> re-introducing the suspend/resume problem.
> 
> Yes, I'm afraid you're right.  Comparing the logs at comment #31
> (fails) and comment #38 (works):
> 
>   pci_bus 0000:00: root bus resource [mem 0xdfa00000-0xfebfffff window]
>   pci 0000:00:1c.0: BAR 14: assigned [mem 0xdfa00000-0xdfbfffff] fails
>   pci 0000:00:1c.0: BAR 14: assigned [mem 0xdfb00000-0xdfcfffff] works
> 
> Since 0xdfa00000 is included in the host bridge _CRS, but isn't
> usable, my guess is this is a _CRS bug.

Ack.

So I was thinking to maybe limit the removal of EfiMemoryMappedIO
regions from the E820 map if they are big enough to cause troubles?

Looking at the EFI map MMIO regions on this Lenovo ThinkPad X1 carbon
(20A7) model, they are tiny. Where as the ones which we know cause
problems are huge. So maybe add a bit of heuristics to patch 1/4 based
on the EfiMemoryMappedIO region size and only remove the big ones
from the E820 map ?

I know that adding heuristics like this always feels a bit wrong,
because you end up putting a somewhat arbitrary cut off point in
the code on which to toggle behavior on/off, but I think that in
this case it should work nicely given how huge the EfiMemoryMappedIO
regions which are actually causing problems are.

> Or maybe BIOS is using the producer/consumer bit in _CRS to identify
> this as register space as opposed to a window?  I thought we couldn't
> rely on that bit, but it's been a long time since I looked at it.  An
> acpidump might have a clue.

Regards,

Hans
Re: [PATCH 0/4] PCI: Continue E820 vs host bridge window saga
Posted by Hans de Goede 1 year, 4 months ago
Hi Bjorn,

On 12/4/22 10:13, Hans de Goede wrote:

<snip>

>>> 2. I am afraid that now allowing PCI MMIO space to be allocated
>>> in regions marked as EfiMemoryMappedIO will cause regressions
>>> on some systems. Specifically when I tried something similar
>>> the last time I looked at this (using the BIOS date cut-off
>>> approach IIRC) there was a suspend/resume regression on
>>> a Lenovo ThinkPad X1 carbon (20A7) model:
>>>
>>> https://bugzilla.redhat.com/show_bug.cgi?id=2029207
>>>
>>> Back then I came to the conclusion that the problem is that not
>>> avoiding the EfiMemoryMappedIO regions caused PCI MMIO space to
>>> be allocated in the 0xdfa00000 - 0xdfa10000 range which is
>>> listed in the EFI memmap as:
>>>
>>> [    0.000000] efi: mem46: [MMIO        |RUN|  |  |  |  |  |  |  |  |   |  |  |  |  ] range=[0x00000000dfa00000-0x00000000dfa0ffff] (0MB)
>>>
>>> And with current kernels with the extra logging added for this
>>> the following is logged related to this:
>>>
>>> [    0.326504] acpi PNP0A08:00: clipped [mem 0xdfa00000-0xfebfffff window] to [mem 0xdfa10000-0xfebfffff window] for e820 entry [mem 0xdceff000-0xdfa0ffff]
>>>
>>> I believe patch 1/4 of this set will make this clipping go away,
>>> re-introducing the suspend/resume problem.
>>
>> Yes, I'm afraid you're right.  Comparing the logs at comment #31
>> (fails) and comment #38 (works):
>>
>>   pci_bus 0000:00: root bus resource [mem 0xdfa00000-0xfebfffff window]
>>   pci 0000:00:1c.0: BAR 14: assigned [mem 0xdfa00000-0xdfbfffff] fails
>>   pci 0000:00:1c.0: BAR 14: assigned [mem 0xdfb00000-0xdfcfffff] works
>>
>> Since 0xdfa00000 is included in the host bridge _CRS, but isn't
>> usable, my guess is this is a _CRS bug.
> 
> Ack.
> 
> So I was thinking to maybe limit the removal of EfiMemoryMappedIO
> regions from the E820 map if they are big enough to cause troubles?
> 
> Looking at the EFI map MMIO regions on this Lenovo ThinkPad X1 carbon
> (20A7) model, they are tiny. Where as the ones which we know cause
> problems are huge. So maybe add a bit of heuristics to patch 1/4 based
> on the EfiMemoryMappedIO region size and only remove the big ones
> from the E820 map ?
> 
> I know that adding heuristics like this always feels a bit wrong,
> because you end up putting a somewhat arbitrary cut off point in
> the code on which to toggle behavior on/off, but I think that in
> this case it should work nicely given how huge the EfiMemoryMappedIO
> regions which are actually causing problems are.

One of the original reporters of the suspend/resume problem has gotten
back to me in:

https://bugzilla.redhat.com/show_bug.cgi?id=2029207

And they are willing to run some more tests. I could give them
a 6.0 kernel with the 4 patches from this series added to test,
but I think we both agree that it is very likely that the suspend/resume
problem will resurface, so I'm not sure how useful that would be ?

>> Or maybe BIOS is using the producer/consumer bit in _CRS to identify
>> this as register space as opposed to a window?  I thought we couldn't
>> rely on that bit, but it's been a long time since I looked at it.  An
>> acpidump might have a clue.

I have asked the reporter of;

https://bugzilla.redhat.com/show_bug.cgi?id=2029207

To grab any acpidump.

Regards,

Hans




p.s.

Looking at the efi=debug output from:

https://bugzilla-attachments.redhat.com/attachment.cgi?id=1861035

The small MMIO regions which we most honor as reserved do
have the "RUN" (runtime) flag set in the EFI mmap.

But I'm afraid that the same applies to the troublesome
MMIO EFI regions which cause the failures to assign
PCI regions for devices not setup by the firmware:

https://bugzilla-attachments.redhat.com/attachment.cgi?id=1861407

So that "RUN" flag is of no use.
Re: [PATCH 0/4] PCI: Continue E820 vs host bridge window saga
Posted by Bjorn Helgaas 1 year, 4 months ago
On Wed, Dec 07, 2022 at 04:31:12PM +0100, Hans de Goede wrote:
> On 12/4/22 10:13, Hans de Goede wrote:
> 
> <snip>
> 
> >>> 2. I am afraid that now allowing PCI MMIO space to be allocated
> >>> in regions marked as EfiMemoryMappedIO will cause regressions
> >>> on some systems. Specifically when I tried something similar
> >>> the last time I looked at this (using the BIOS date cut-off
> >>> approach IIRC) there was a suspend/resume regression on
> >>> a Lenovo ThinkPad X1 carbon (20A7) model:
> >>>
> >>> https://bugzilla.redhat.com/show_bug.cgi?id=2029207
> >>>
> >>> Back then I came to the conclusion that the problem is that not
> >>> avoiding the EfiMemoryMappedIO regions caused PCI MMIO space to
> >>> be allocated in the 0xdfa00000 - 0xdfa10000 range which is
> >>> listed in the EFI memmap as:
> >>>
> >>> [    0.000000] efi: mem46: [MMIO        |RUN|  |  |  |  |  |  |  |  |   |  |  |  |  ] range=[0x00000000dfa00000-0x00000000dfa0ffff] (0MB)
> >>>
> >>> And with current kernels with the extra logging added for this
> >>> the following is logged related to this:
> >>>
> >>> [    0.326504] acpi PNP0A08:00: clipped [mem 0xdfa00000-0xfebfffff window] to [mem 0xdfa10000-0xfebfffff window] for e820 entry [mem 0xdceff000-0xdfa0ffff]
> >>>
> >>> I believe patch 1/4 of this set will make this clipping go away,
> >>> re-introducing the suspend/resume problem.
> >>
> >> Yes, I'm afraid you're right.  Comparing the logs at comment #31
> >> (fails) and comment #38 (works):
> >>
> >>   pci_bus 0000:00: root bus resource [mem 0xdfa00000-0xfebfffff window]
> >>   pci 0000:00:1c.0: BAR 14: assigned [mem 0xdfa00000-0xdfbfffff] fails
> >>   pci 0000:00:1c.0: BAR 14: assigned [mem 0xdfb00000-0xdfcfffff] works
> >>
> >> Since 0xdfa00000 is included in the host bridge _CRS, but isn't
> >> usable, my guess is this is a _CRS bug.
> > 
> > Ack.
> > 
> > So I was thinking to maybe limit the removal of EfiMemoryMappedIO
> > regions from the E820 map if they are big enough to cause troubles?
> > 
> > Looking at the EFI map MMIO regions on this Lenovo ThinkPad X1 carbon
> > (20A7) model, they are tiny. Where as the ones which we know cause
> > problems are huge. So maybe add a bit of heuristics to patch 1/4 based
> > on the EfiMemoryMappedIO region size and only remove the big ones
> > from the E820 map ?
> > 
> > I know that adding heuristics like this always feels a bit wrong,
> > because you end up putting a somewhat arbitrary cut off point in
> > the code on which to toggle behavior on/off, but I think that in
> > this case it should work nicely given how huge the EfiMemoryMappedIO
> > regions which are actually causing problems are.

I'll post a v2 that removes only regions 256KB or larger in a minute.

> Looking at the efi=debug output from:
> 
> https://bugzilla-attachments.redhat.com/attachment.cgi?id=1861035
> 
> The small MMIO regions which we most honor as reserved do
> have the "RUN" (runtime) flag set in the EFI mmap.

Just trying to follow along here, so not sure any of the following is
relevant ...

This attachment is from
https://bugzilla.redhat.com/show_bug.cgi?id=2029207, and it shows:

  efi: mem46: [MMIO|RUN|  ] range=[0xdfa00000-0xdfa0ffff] (0MB) [64K]
  efi: mem47: [MMIO|RUN|UC] range=[0xf80f8000-0xf80f8fff] (0MB)  [4K]
  pci_bus 0000:00: root bus resource [mem 0xdfa00000-0xfebfffff window]
  pci_bus 0000:00: root bus resource [mem 0xfed40000-0xfed4bfff window]

mem46 is included in the PNP0A08 _CRS, and Ivan has verified
experimentally that we have to avoid it.

mem47 is also included in the _CRS, but I don't have a clue what it
is.  Maybe some hidden device used by BIOS but not visible to us?

> But I'm afraid that the same applies to the troublesome
> MMIO EFI regions which cause the failures to assign
> PCI regions for devices not setup by the firmware:
> 
> https://bugzilla-attachments.redhat.com/attachment.cgi?id=1861407
> 
> So that "RUN" flag is of no use.

I don't know what bug this attachment is from.

Is the point here that you considered doing the E820 removal based on
the EFI_MEMORY_RUNTIME memory *attribute* instead of the
EFI_MEMORY_MAPPED_IO memory *type*?

I don't really know the details of EFI_MEMORY_MAPPED_IO vs
EFI_MEMORY_RUNTIME, but it looks like EFI_MEMORY_RUNTIME can be
applied to things like EFI_RUNTIME_SERVICES_CODE (not MMIO space) that
should stay in E820.

Bjorn
Re: [PATCH 0/4] PCI: Continue E820 vs host bridge window saga
Posted by Hans de Goede 1 year, 4 months ago
Hi Bjorn,

On 12/8/22 19:57, Bjorn Helgaas wrote:
> On Wed, Dec 07, 2022 at 04:31:12PM +0100, Hans de Goede wrote:
>> On 12/4/22 10:13, Hans de Goede wrote:
>>
>> <snip>
>>
>>>>> 2. I am afraid that now allowing PCI MMIO space to be allocated
>>>>> in regions marked as EfiMemoryMappedIO will cause regressions
>>>>> on some systems. Specifically when I tried something similar
>>>>> the last time I looked at this (using the BIOS date cut-off
>>>>> approach IIRC) there was a suspend/resume regression on
>>>>> a Lenovo ThinkPad X1 carbon (20A7) model:
>>>>>
>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=2029207
>>>>>
>>>>> Back then I came to the conclusion that the problem is that not
>>>>> avoiding the EfiMemoryMappedIO regions caused PCI MMIO space to
>>>>> be allocated in the 0xdfa00000 - 0xdfa10000 range which is
>>>>> listed in the EFI memmap as:
>>>>>
>>>>> [    0.000000] efi: mem46: [MMIO        |RUN|  |  |  |  |  |  |  |  |   |  |  |  |  ] range=[0x00000000dfa00000-0x00000000dfa0ffff] (0MB)
>>>>>
>>>>> And with current kernels with the extra logging added for this
>>>>> the following is logged related to this:
>>>>>
>>>>> [    0.326504] acpi PNP0A08:00: clipped [mem 0xdfa00000-0xfebfffff window] to [mem 0xdfa10000-0xfebfffff window] for e820 entry [mem 0xdceff000-0xdfa0ffff]
>>>>>
>>>>> I believe patch 1/4 of this set will make this clipping go away,
>>>>> re-introducing the suspend/resume problem.
>>>>
>>>> Yes, I'm afraid you're right.  Comparing the logs at comment #31
>>>> (fails) and comment #38 (works):
>>>>
>>>>   pci_bus 0000:00: root bus resource [mem 0xdfa00000-0xfebfffff window]
>>>>   pci 0000:00:1c.0: BAR 14: assigned [mem 0xdfa00000-0xdfbfffff] fails
>>>>   pci 0000:00:1c.0: BAR 14: assigned [mem 0xdfb00000-0xdfcfffff] works
>>>>
>>>> Since 0xdfa00000 is included in the host bridge _CRS, but isn't
>>>> usable, my guess is this is a _CRS bug.
>>>
>>> Ack.
>>>
>>> So I was thinking to maybe limit the removal of EfiMemoryMappedIO
>>> regions from the E820 map if they are big enough to cause troubles?
>>>
>>> Looking at the EFI map MMIO regions on this Lenovo ThinkPad X1 carbon
>>> (20A7) model, they are tiny. Where as the ones which we know cause
>>> problems are huge. So maybe add a bit of heuristics to patch 1/4 based
>>> on the EfiMemoryMappedIO region size and only remove the big ones
>>> from the E820 map ?
>>>
>>> I know that adding heuristics like this always feels a bit wrong,
>>> because you end up putting a somewhat arbitrary cut off point in
>>> the code on which to toggle behavior on/off, but I think that in
>>> this case it should work nicely given how huge the EfiMemoryMappedIO
>>> regions which are actually causing problems are.
> 
> I'll post a v2 that removes only regions 256KB or larger in a minute.

Ok, may I ask why 256KB?

I see that that rules out then troublesome MMIO regions from the X1 carbon from:
https://bugzilla.redhat.com/show_bug.cgi?id=2029207 :
efi: mem46: [MMIO|RUN|  ] range=[0xdfa00000-0xdfa0ffff] (0MB) [64K]
which we know we need to avoid / keep reserved.

But OTOH the reservations which are causing the problems with assigning
resources to PCI devices by Linux look like this:
efi: mem50: [MMIO        |RUN|  |  |  |  |  |  |  |  |   |  |  |  |UC] range=[0x0000000065400000-0x00000000cfffffff] (1708MB)
which is significantly larger then 256KB.

So we could e.g. also put the cut-off point at 16MB and still
remove the above troublesome reservation from the E820 table.
Note just thinking out loud here. I have no idea if 16MB
would be better...


> 
>> Looking at the efi=debug output from:
>>
>> https://bugzilla-attachments.redhat.com/attachment.cgi?id=1861035
>>
>> The small MMIO regions which we most honor as reserved do
>> have the "RUN" (runtime) flag set in the EFI mmap.
> 
> Just trying to follow along here, so not sure any of the following is
> relevant ...
> 
> This attachment is from
> https://bugzilla.redhat.com/show_bug.cgi?id=2029207, and it shows:
> 
>   efi: mem46: [MMIO|RUN|  ] range=[0xdfa00000-0xdfa0ffff] (0MB) [64K]
>   efi: mem47: [MMIO|RUN|UC] range=[0xf80f8000-0xf80f8fff] (0MB)  [4K]
>   pci_bus 0000:00: root bus resource [mem 0xdfa00000-0xfebfffff window]
>   pci_bus 0000:00: root bus resource [mem 0xfed40000-0xfed4bfff window]
> 
> mem46 is included in the PNP0A08 _CRS, and Ivan has verified
> experimentally that we have to avoid it.

Ack.

> mem47 is also included in the _CRS, but I don't have a clue what it
> is.  Maybe some hidden device used by BIOS but not visible to us?

Could be, there is at least one hidden device called the P2SB on
most Intel systems.

>> But I'm afraid that the same applies to the troublesome
>> MMIO EFI regions which cause the failures to assign
>> PCI regions for devices not setup by the firmware:
>>
>> https://bugzilla-attachments.redhat.com/attachment.cgi?id=1861407
>>
>> So that "RUN" flag is of no use.
> 
> I don't know what bug this attachment is from.

It is from https://bugzilla.redhat.com/show_bug.cgi?id=1868899
which is the ideapad slim 3 with the touchpad issue caused by the:
efi: mem50: [MMIO        |RUN|  |  |  |  |  |  |  |  |   |  |  |  |UC] range=[0x0000000065400000-0x00000000cfffffff] (1708MB)
reservation getting in the way of assigning resources to
the i2c-controller.

> Is the point here that you considered doing the E820 removal based on
> the EFI_MEMORY_RUNTIME memory *attribute* instead of the
> EFI_MEMORY_MAPPED_IO memory *type*?
> 
> I don't really know the details of EFI_MEMORY_MAPPED_IO vs
> EFI_MEMORY_RUNTIME, but it looks like EFI_MEMORY_RUNTIME can be
> applied to things like EFI_RUNTIME_SERVICES_CODE (not MMIO space) that
> should stay in E820.

Sorry for the confusion. What I was trying to say is that I was interested
in seeing if we could use the "RUN" flag to differentiate between:

1. The big MMIO region which we want to remove from the e820 map:
   efi: mem50: [MMIO        |RUN|  |  |  |  |  |  |  |  |   |  |  |  |UC] range=[0x0000000065400000-0x00000000cfffffff] (1708MB)

2. The small MMIO region which we want to keep to avoid the reported suspend/resume issue:
   efi: mem46: [MMIO|RUN|  ] range=[0xdfa00000-0xdfa0ffff] (0MB) [64K]

But unfortunately both have the RUN flag set so the RUN flag is
of no use to us.

Regards,

Hans
Re: [PATCH 0/4] PCI: Continue E820 vs host bridge window saga
Posted by Bjorn Helgaas 1 year, 4 months ago
On Thu, Dec 08, 2022 at 08:16:31PM +0100, Hans de Goede wrote:
> Hi Bjorn,
> 
> On 12/8/22 19:57, Bjorn Helgaas wrote:
> > On Wed, Dec 07, 2022 at 04:31:12PM +0100, Hans de Goede wrote:
> >> On 12/4/22 10:13, Hans de Goede wrote:
> >>
> >> <snip>
> >>
> >>>>> 2. I am afraid that now allowing PCI MMIO space to be allocated
> >>>>> in regions marked as EfiMemoryMappedIO will cause regressions
> >>>>> on some systems. Specifically when I tried something similar
> >>>>> the last time I looked at this (using the BIOS date cut-off
> >>>>> approach IIRC) there was a suspend/resume regression on
> >>>>> a Lenovo ThinkPad X1 carbon (20A7) model:
> >>>>>
> >>>>> https://bugzilla.redhat.com/show_bug.cgi?id=2029207
> >>>>>
> >>>>> Back then I came to the conclusion that the problem is that not
> >>>>> avoiding the EfiMemoryMappedIO regions caused PCI MMIO space to
> >>>>> be allocated in the 0xdfa00000 - 0xdfa10000 range which is
> >>>>> listed in the EFI memmap as:
> >>>>>
> >>>>> [    0.000000] efi: mem46: [MMIO        |RUN|  |  |  |  |  |  |  |  |   |  |  |  |  ] range=[0x00000000dfa00000-0x00000000dfa0ffff] (0MB)
> >>>>>
> >>>>> And with current kernels with the extra logging added for this
> >>>>> the following is logged related to this:
> >>>>>
> >>>>> [    0.326504] acpi PNP0A08:00: clipped [mem 0xdfa00000-0xfebfffff window] to [mem 0xdfa10000-0xfebfffff window] for e820 entry [mem 0xdceff000-0xdfa0ffff]
> >>>>>
> >>>>> I believe patch 1/4 of this set will make this clipping go away,
> >>>>> re-introducing the suspend/resume problem.
> >>>>
> >>>> Yes, I'm afraid you're right.  Comparing the logs at comment #31
> >>>> (fails) and comment #38 (works):
> >>>>
> >>>>   pci_bus 0000:00: root bus resource [mem 0xdfa00000-0xfebfffff window]
> >>>>   pci 0000:00:1c.0: BAR 14: assigned [mem 0xdfa00000-0xdfbfffff] fails
> >>>>   pci 0000:00:1c.0: BAR 14: assigned [mem 0xdfb00000-0xdfcfffff] works
> >>>>
> >>>> Since 0xdfa00000 is included in the host bridge _CRS, but isn't
> >>>> usable, my guess is this is a _CRS bug.
> >>>
> >>> Ack.
> >>>
> >>> So I was thinking to maybe limit the removal of EfiMemoryMappedIO
> >>> regions from the E820 map if they are big enough to cause troubles?
> >>>
> >>> Looking at the EFI map MMIO regions on this Lenovo ThinkPad X1 carbon
> >>> (20A7) model, they are tiny. Where as the ones which we know cause
> >>> problems are huge. So maybe add a bit of heuristics to patch 1/4 based
> >>> on the EfiMemoryMappedIO region size and only remove the big ones
> >>> from the E820 map ?
> >>>
> >>> I know that adding heuristics like this always feels a bit wrong,
> >>> because you end up putting a somewhat arbitrary cut off point in
> >>> the code on which to toggle behavior on/off, but I think that in
> >>> this case it should work nicely given how huge the EfiMemoryMappedIO
> >>> regions which are actually causing problems are.
> > 
> > I'll post a v2 that removes only regions 256KB or larger in a minute.
> 
> Ok, may I ask why 256KB?
> 
> I see that that rules out then troublesome MMIO regions from the X1 carbon from:
> https://bugzilla.redhat.com/show_bug.cgi?id=2029207 :
> efi: mem46: [MMIO|RUN|  ] range=[0xdfa00000-0xdfa0ffff] (0MB) [64K]
> which we know we need to avoid / keep reserved.
> 
> But OTOH the reservations which are causing the problems with assigning
> resources to PCI devices by Linux look like this:
> efi: mem50: [MMIO        |RUN|  |  |  |  |  |  |  |  |   |  |  |  |UC] range=[0x0000000065400000-0x00000000cfffffff] (1708MB)
> which is significantly larger then 256KB.
> 
> So we could e.g. also put the cut-off point at 16MB and still
> remove the above troublesome reservation from the E820 table.
> Note just thinking out loud here. I have no idea if 16MB
> would be better...

No good reason for 256KB.  We know it needs to be at least 64KB for
the X1 Carbon.  I picked 4x bigger just for headroom, since I assume
the 64KB is platform-specific host bridge registers or something.  Do
you think a bigger number would be better, i.e., we would retain more
MMIO things in E820?

ECAM areas would be 1MB per bus, so between 1MB and 256MB.  Those areas
*should* be reserved by PNP0C02 _CRS, but IIRC the early MMCONFIG code
checks E820, and the late code checks for _CRS.  I guess one could
argue that ignoring those, e.g., by retaining anything 256MB or
smaller in E820, would reduce the amount of change.  

But if the host bridge _CRS includes 256MB of legitimate window that
EFI says is MMIO and is hence included in E820, that seems like kind
of a lot of usable window space to give up.

> ...
> Sorry for the confusion. What I was trying to say is that I was interested
> in seeing if we could use the "RUN" flag to differentiate between:
> 
> 1. The big MMIO region which we want to remove from the e820 map:
>    efi: mem50: [MMIO        |RUN|  |  |  |  |  |  |  |  |   |  |  |  |UC] range=[0x0000000065400000-0x00000000cfffffff] (1708MB)
> 
> 2. The small MMIO region which we want to keep to avoid the reported suspend/resume issue:
>    efi: mem46: [MMIO|RUN|  ] range=[0xdfa00000-0xdfa0ffff] (0MB) [64K]
> 
> But unfortunately both have the RUN flag set so the RUN flag is
> of no use to us.

Right, makes sense.

Bjorn
Re: [PATCH 0/4] PCI: Continue E820 vs host bridge window saga
Posted by Hans de Goede 1 year, 4 months ago
Hi,

On 12/8/22 21:06, Bjorn Helgaas wrote:
> On Thu, Dec 08, 2022 at 08:16:31PM +0100, Hans de Goede wrote:
>> Hi Bjorn,
>>
>> On 12/8/22 19:57, Bjorn Helgaas wrote:
>>> On Wed, Dec 07, 2022 at 04:31:12PM +0100, Hans de Goede wrote:
>>>> On 12/4/22 10:13, Hans de Goede wrote:
>>>>
>>>> <snip>
>>>>
>>>>>>> 2. I am afraid that now allowing PCI MMIO space to be allocated
>>>>>>> in regions marked as EfiMemoryMappedIO will cause regressions
>>>>>>> on some systems. Specifically when I tried something similar
>>>>>>> the last time I looked at this (using the BIOS date cut-off
>>>>>>> approach IIRC) there was a suspend/resume regression on
>>>>>>> a Lenovo ThinkPad X1 carbon (20A7) model:
>>>>>>>
>>>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=2029207
>>>>>>>
>>>>>>> Back then I came to the conclusion that the problem is that not
>>>>>>> avoiding the EfiMemoryMappedIO regions caused PCI MMIO space to
>>>>>>> be allocated in the 0xdfa00000 - 0xdfa10000 range which is
>>>>>>> listed in the EFI memmap as:
>>>>>>>
>>>>>>> [    0.000000] efi: mem46: [MMIO        |RUN|  |  |  |  |  |  |  |  |   |  |  |  |  ] range=[0x00000000dfa00000-0x00000000dfa0ffff] (0MB)
>>>>>>>
>>>>>>> And with current kernels with the extra logging added for this
>>>>>>> the following is logged related to this:
>>>>>>>
>>>>>>> [    0.326504] acpi PNP0A08:00: clipped [mem 0xdfa00000-0xfebfffff window] to [mem 0xdfa10000-0xfebfffff window] for e820 entry [mem 0xdceff000-0xdfa0ffff]
>>>>>>>
>>>>>>> I believe patch 1/4 of this set will make this clipping go away,
>>>>>>> re-introducing the suspend/resume problem.
>>>>>>
>>>>>> Yes, I'm afraid you're right.  Comparing the logs at comment #31
>>>>>> (fails) and comment #38 (works):
>>>>>>
>>>>>>   pci_bus 0000:00: root bus resource [mem 0xdfa00000-0xfebfffff window]
>>>>>>   pci 0000:00:1c.0: BAR 14: assigned [mem 0xdfa00000-0xdfbfffff] fails
>>>>>>   pci 0000:00:1c.0: BAR 14: assigned [mem 0xdfb00000-0xdfcfffff] works
>>>>>>
>>>>>> Since 0xdfa00000 is included in the host bridge _CRS, but isn't
>>>>>> usable, my guess is this is a _CRS bug.
>>>>>
>>>>> Ack.
>>>>>
>>>>> So I was thinking to maybe limit the removal of EfiMemoryMappedIO
>>>>> regions from the E820 map if they are big enough to cause troubles?
>>>>>
>>>>> Looking at the EFI map MMIO regions on this Lenovo ThinkPad X1 carbon
>>>>> (20A7) model, they are tiny. Where as the ones which we know cause
>>>>> problems are huge. So maybe add a bit of heuristics to patch 1/4 based
>>>>> on the EfiMemoryMappedIO region size and only remove the big ones
>>>>> from the E820 map ?
>>>>>
>>>>> I know that adding heuristics like this always feels a bit wrong,
>>>>> because you end up putting a somewhat arbitrary cut off point in
>>>>> the code on which to toggle behavior on/off, but I think that in
>>>>> this case it should work nicely given how huge the EfiMemoryMappedIO
>>>>> regions which are actually causing problems are.
>>>
>>> I'll post a v2 that removes only regions 256KB or larger in a minute.
>>
>> Ok, may I ask why 256KB?
>>
>> I see that that rules out then troublesome MMIO regions from the X1 carbon from:
>> https://bugzilla.redhat.com/show_bug.cgi?id=2029207 :
>> efi: mem46: [MMIO|RUN|  ] range=[0xdfa00000-0xdfa0ffff] (0MB) [64K]
>> which we know we need to avoid / keep reserved.
>>
>> But OTOH the reservations which are causing the problems with assigning
>> resources to PCI devices by Linux look like this:
>> efi: mem50: [MMIO        |RUN|  |  |  |  |  |  |  |  |   |  |  |  |UC] range=[0x0000000065400000-0x00000000cfffffff] (1708MB)
>> which is significantly larger then 256KB.
>>
>> So we could e.g. also put the cut-off point at 16MB and still
>> remove the above troublesome reservation from the E820 table.
>> Note just thinking out loud here. I have no idea if 16MB
>> would be better...
> 
> No good reason for 256KB.  We know it needs to be at least 64KB for
> the X1 Carbon.  I picked 4x bigger just for headroom, since I assume
> the 64KB is platform-specific host bridge registers or something.  Do
> you think a bigger number would be better, i.e., we would retain more
> MMIO things in E820?
> 
> ECAM areas would be 1MB per bus, so between 1MB and 256MB.  Those areas
> *should* be reserved by PNP0C02 _CRS, but IIRC the early MMCONFIG code
> checks E820, and the late code checks for _CRS.  I guess one could
> argue that ignoring those, e.g., by retaining anything 256MB or
> smaller in E820, would reduce the amount of change.  

Right, reducing how much we change what the E820 map looks like after
this would be the main reason to make the cut of point bigger then
256KB.

> But if the host bridge _CRS includes 256MB of legitimate window that
> EFI says is MMIO and is hence included in E820, that seems like kind
> of a lot of usable window space to give up.

Ack, I guess we can just go with 256KB for now and then see how things go.

Regards,

Hans