drivers/pci/pci-acpi.c | 71 ++++++++++++++++++++++-------------------- 1 file changed, 37 insertions(+), 34 deletions(-)
program_hpx_type2() is today unconditionally called, despite the fact
that when the _HPX was added to the ACPI spec. v3.0, the description
stated:
OSPM [1] will only evaluate _HPX with Setting Record – Type 2 if OSPM
is not controlling the PCI Express Advanced Error Reporting
capability.
Hence, we only call program_hpx_type2() when the OSPM owns the PCIe
hotplug capability but not the AER.
The Advanced Configuration and Power Interface (ACPI) Specification
version 6.6 has a provision that gives the OSPM the ability to control
the other PCIe Device Control bits any way. In a note in section
6.2.9, it is stated:
"OSPM may override the settings provided by the _HPX object's Type2
record (PCI Express Settings) or Type3 record (PCI Express Descriptor
Settings) when OSPM has assumed native control of the corresponding
feature."
So, in order to preserve the non-AER bits in PCIe Device Control, in
particular the performance sensitive ExtTag and RO, we make sure
program_hpx_type2() if called, doesn't modify any non-AER bits.
Also, when program_hpx_type2() is called, and any bits in the PCIe
Link Control register is set, we log a warning.
[1] Operating System-directed configuration and Power Management
Fixes: 40abb96c51bb ("[PATCH] pciehp: Fix programming hotplug parameters")
Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>
---
drivers/pci/pci-acpi.c | 71 ++++++++++++++++++++++--------------------
1 file changed, 37 insertions(+), 34 deletions(-)
diff --git a/drivers/pci/pci-acpi.c b/drivers/pci/pci-acpi.c
index 9369377725fa0..f36b51f6721a6 100644
--- a/drivers/pci/pci-acpi.c
+++ b/drivers/pci/pci-acpi.c
@@ -271,21 +271,6 @@ static acpi_status decode_type1_hpx_record(union acpi_object *record,
return AE_OK;
}
-static bool pcie_root_rcb_set(struct pci_dev *dev)
-{
- struct pci_dev *rp = pcie_find_root_port(dev);
- u16 lnkctl;
-
- if (!rp)
- return false;
-
- pcie_capability_read_word(rp, PCI_EXP_LNKCTL, &lnkctl);
- if (lnkctl & PCI_EXP_LNKCTL_RCB)
- return true;
-
- return false;
-}
-
/* _HPX PCI Express Setting Record (Type 2) */
struct hpx_type2 {
u32 revision;
@@ -311,6 +296,7 @@ static void program_hpx_type2(struct pci_dev *dev, struct hpx_type2 *hpx)
{
int pos;
u32 reg32;
+ const struct pci_host_bridge *host;
if (!hpx)
return;
@@ -318,40 +304,57 @@ static void program_hpx_type2(struct pci_dev *dev, struct hpx_type2 *hpx)
if (!pci_is_pcie(dev))
return;
+ host = pci_find_host_bridge(dev->bus);
+
+ /* We only do the HP programming if we own the PCIe native
+ * hotplug and not the AER ownership
+ */
+ if (!host->native_pcie_hotplug || host->native_aer)
+ return;
+
if (hpx->revision > 1) {
pci_warn(dev, "PCIe settings rev %d not supported\n",
hpx->revision);
return;
}
- /*
- * Don't allow _HPX to change MPS or MRRS settings. We manage
- * those to make sure they're consistent with the rest of the
+ /* We only allow _HPX to program the AER registers, namely
+ * PCI_EXP_DEVCTL_CERE, PCI_EXP_DEVCTL_NFERE,
+ * PCI_EXP_DEVCTL_FERE, and PCI_EXP_DEVCTL_URRE.
+ *
+ * The other settings in PCIe DEVCTL are managed by OS in
+ * order to make sure they're consistent with the rest of the
* platform.
*/
- hpx->pci_exp_devctl_and |= PCI_EXP_DEVCTL_PAYLOAD |
- PCI_EXP_DEVCTL_READRQ;
- hpx->pci_exp_devctl_or &= ~(PCI_EXP_DEVCTL_PAYLOAD |
- PCI_EXP_DEVCTL_READRQ);
+ hpx->pci_exp_devctl_and |= PCI_EXP_DEVCTL_RELAX_EN |
+ PCI_EXP_DEVCTL_PAYLOAD |
+ PCI_EXP_DEVCTL_EXT_TAG |
+ PCI_EXP_DEVCTL_PHANTOM |
+ PCI_EXP_DEVCTL_AUX_PME |
+ PCI_EXP_DEVCTL_NOSNOOP_EN |
+ PCI_EXP_DEVCTL_READRQ |
+ PCI_EXP_DEVCTL_BCR_FLR;
+ hpx->pci_exp_devctl_or &= ~(PCI_EXP_DEVCTL_RELAX_EN |
+ PCI_EXP_DEVCTL_PAYLOAD |
+ PCI_EXP_DEVCTL_EXT_TAG |
+ PCI_EXP_DEVCTL_PHANTOM |
+ PCI_EXP_DEVCTL_AUX_PME |
+ PCI_EXP_DEVCTL_NOSNOOP_EN |
+ PCI_EXP_DEVCTL_READRQ |
+ PCI_EXP_DEVCTL_BCR_FLR);
/* Initialize Device Control Register */
pcie_capability_clear_and_set_word(dev, PCI_EXP_DEVCTL,
~hpx->pci_exp_devctl_and, hpx->pci_exp_devctl_or);
- /* Initialize Link Control Register */
+ /* Log the Link Control Register if any bits are set */
if (pcie_cap_has_lnkctl(dev)) {
+ u16 lnkctl;
- /*
- * If the Root Port supports Read Completion Boundary of
- * 128, set RCB to 128. Otherwise, clear it.
- */
- hpx->pci_exp_lnkctl_and |= PCI_EXP_LNKCTL_RCB;
- hpx->pci_exp_lnkctl_or &= ~PCI_EXP_LNKCTL_RCB;
- if (pcie_root_rcb_set(dev))
- hpx->pci_exp_lnkctl_or |= PCI_EXP_LNKCTL_RCB;
-
- pcie_capability_clear_and_set_word(dev, PCI_EXP_LNKCTL,
- ~hpx->pci_exp_lnkctl_and, hpx->pci_exp_lnkctl_or);
+ pcie_capability_read_word(dev, PCI_EXP_LNKCTL, &lnkctl);
+ if (lnkctl)
+ pci_warn(dev, "Some bits in PCIe Link Control are set: 0x%04x\n",
+ lnkctl);
}
/* Find Advanced Error Reporting Enhanced Capability */
--
2.43.5
On Tue, Jan 13, 2026 at 06:15:20PM +0100, Håkon Bugge wrote:
> program_hpx_type2() is today unconditionally called, despite the fact
> that when the _HPX was added to the ACPI spec. v3.0, the description
> stated:
> + /* We only do the HP programming if we own the PCIe native
> + * hotplug and not the AER ownership
> + */
Conventional multi-line comments are:
/*
* We ...
*/
> + if (!host->native_pcie_hotplug || host->native_aer)
> + return;
> +
> if (hpx->revision > 1) {
> pci_warn(dev, "PCIe settings rev %d not supported\n",
> hpx->revision);
> return;
> }
>
> - /*
> - * Don't allow _HPX to change MPS or MRRS settings. We manage
> - * those to make sure they're consistent with the rest of the
> + /* We only allow _HPX to program the AER registers, namely
> + * PCI_EXP_DEVCTL_CERE, PCI_EXP_DEVCTL_NFERE,
> + * PCI_EXP_DEVCTL_FERE, and PCI_EXP_DEVCTL_URRE.
> + *
> + * The other settings in PCIe DEVCTL are managed by OS in
> + * order to make sure they're consistent with the rest of the
> * platform.
> */
> - hpx->pci_exp_devctl_and |= PCI_EXP_DEVCTL_PAYLOAD |
> - PCI_EXP_DEVCTL_READRQ;
> - hpx->pci_exp_devctl_or &= ~(PCI_EXP_DEVCTL_PAYLOAD |
> - PCI_EXP_DEVCTL_READRQ);
> + hpx->pci_exp_devctl_and |= PCI_EXP_DEVCTL_RELAX_EN |
> + PCI_EXP_DEVCTL_PAYLOAD |
> + PCI_EXP_DEVCTL_EXT_TAG |
> + PCI_EXP_DEVCTL_PHANTOM |
> + PCI_EXP_DEVCTL_AUX_PME |
> + PCI_EXP_DEVCTL_NOSNOOP_EN |
> + PCI_EXP_DEVCTL_READRQ |
> + PCI_EXP_DEVCTL_BCR_FLR;
> + hpx->pci_exp_devctl_or &= ~(PCI_EXP_DEVCTL_RELAX_EN |
> + PCI_EXP_DEVCTL_PAYLOAD |
> + PCI_EXP_DEVCTL_EXT_TAG |
> + PCI_EXP_DEVCTL_PHANTOM |
> + PCI_EXP_DEVCTL_AUX_PME |
> + PCI_EXP_DEVCTL_NOSNOOP_EN |
> + PCI_EXP_DEVCTL_READRQ |
> + PCI_EXP_DEVCTL_BCR_FLR);
Instead of listing the bits we *don't* want to touch, I think we
should explicitly *include* CERE, NFERE, FERE, URRE. Maybe we should
move the PCI_EXP_AER_FLAGS #define to drivers/pci/pci.h so we could
use it directly, e.g.,
hpx->pci_exp_devctl_and |= ~PCI_EXP_AER_FLAGS;
hpx->pci_exp_devctl_or &= PCI_EXP_AER_FLAGS;
> /* Initialize Device Control Register */
> pcie_capability_clear_and_set_word(dev, PCI_EXP_DEVCTL,
> ~hpx->pci_exp_devctl_and, hpx->pci_exp_devctl_or);
>
> - /* Initialize Link Control Register */
> + /* Log the Link Control Register if any bits are set */
> if (pcie_cap_has_lnkctl(dev)) {
> + u16 lnkctl;
>
> - /*
> - * If the Root Port supports Read Completion Boundary of
> - * 128, set RCB to 128. Otherwise, clear it.
> - */
> - hpx->pci_exp_lnkctl_and |= PCI_EXP_LNKCTL_RCB;
> - hpx->pci_exp_lnkctl_or &= ~PCI_EXP_LNKCTL_RCB;
> - if (pcie_root_rcb_set(dev))
> - hpx->pci_exp_lnkctl_or |= PCI_EXP_LNKCTL_RCB;
> -
> - pcie_capability_clear_and_set_word(dev, PCI_EXP_LNKCTL,
> - ~hpx->pci_exp_lnkctl_and, hpx->pci_exp_lnkctl_or);
> + pcie_capability_read_word(dev, PCI_EXP_LNKCTL, &lnkctl);
> + if (lnkctl)
> + pci_warn(dev, "Some bits in PCIe Link Control are set: 0x%04x\n",
> + lnkctl);
Sorry, I wasn't clear about this. I meant that we could log the
LNKCTL AND/OR values from _HPX, not the values from PCI_EXP_LNKCTL
itself. There will definitely be bits set in PCI_EXP_LNKCTL in normal
operation, which is perfectly fine.
But if pci_exp_lnkctl_and or pci_exp_lnkctl_or are non-zero, the
platform is telling us to do something, and we're ignoring it.
*That's* what I think we might want to know about. pci_info() is
probably sufficient; the user doesn't need to *do* anything with it, I
just want it in case we need to debug an issue.
Bjorn
Thanks for the review, Bjørn!
>> On Tue, Jan 13, 2026 at 06:15:20PM +0100, Håkon Bugge wrote:
>> program_hpx_type2() is today unconditionally called, despite the fact
>> that when the _HPX was added to the ACPI spec. v3.0, the description
>> stated:
>>
>> + /* We only do the HP programming if we own the PCIe native
>> + * hotplug and not the AER ownership
>> + */
>>
>> Conventional multi-line comments are:
>
> /*
> * We ...
> */
The net convention lurked in ;-)
>> + if (!host->native_pcie_hotplug || host->native_aer)
>> + return;
>> +
>> if (hpx->revision > 1) {
>> pci_warn(dev, "PCIe settings rev %d not supported\n",
>> hpx->revision);
>> return;
>> }
>>
>> - /*
>> - * Don't allow _HPX to change MPS or MRRS settings. We manage
>> - * those to make sure they're consistent with the rest of the
>> + /* We only allow _HPX to program the AER registers, namely
>> + * PCI_EXP_DEVCTL_CERE, PCI_EXP_DEVCTL_NFERE,
>> + * PCI_EXP_DEVCTL_FERE, and PCI_EXP_DEVCTL_URRE.
>> + *
>> + * The other settings in PCIe DEVCTL are managed by OS in
>> + * order to make sure they're consistent with the rest of the
>> * platform.
>> */
>> - hpx->pci_exp_devctl_and |= PCI_EXP_DEVCTL_PAYLOAD |
>> - PCI_EXP_DEVCTL_READRQ;
>> - hpx->pci_exp_devctl_or &= ~(PCI_EXP_DEVCTL_PAYLOAD |
>> - PCI_EXP_DEVCTL_READRQ);
>> + hpx->pci_exp_devctl_and |= PCI_EXP_DEVCTL_RELAX_EN |
>> + PCI_EXP_DEVCTL_PAYLOAD |
>> + PCI_EXP_DEVCTL_EXT_TAG |
>> + PCI_EXP_DEVCTL_PHANTOM |
>> + PCI_EXP_DEVCTL_AUX_PME |
>> + PCI_EXP_DEVCTL_NOSNOOP_EN |
>> + PCI_EXP_DEVCTL_READRQ |
>> + PCI_EXP_DEVCTL_BCR_FLR;
>> + hpx->pci_exp_devctl_or &= ~(PCI_EXP_DEVCTL_RELAX_EN |
>> + PCI_EXP_DEVCTL_PAYLOAD |
>> + PCI_EXP_DEVCTL_EXT_TAG |
>> + PCI_EXP_DEVCTL_PHANTOM |
>> + PCI_EXP_DEVCTL_AUX_PME |
>> + PCI_EXP_DEVCTL_NOSNOOP_EN |
>> + PCI_EXP_DEVCTL_READRQ |
>> + PCI_EXP_DEVCTL_BCR_FLR);
>>
> Instead of listing the bits we *don't* want to touch, I think we
> should explicitly *include* CERE, NFERE, FERE, URRE. Maybe we should
> move the PCI_EXP_AER_FLAGS #define to drivers/pci/pci.h so we could
> use it directly, e.g.,
>
> hpx->pci_exp_devctl_and |= ~PCI_EXP_AER_FLAGS;
> hpx->pci_exp_devctl_or &= PCI_EXP_AER_FLAGS;
Good idea. But what about moving it to include/uapi/linux/pci_regs.h and also rename it from PCI_EXP_AER_FLAGS to PCI_EXP_DEVCTL_AER, to match the convention for DEVCTL in pci_regs.h?
>> /* Initialize Device Control Register */
>> pcie_capability_clear_and_set_word(dev, PCI_EXP_DEVCTL,
>> ~hpx->pci_exp_devctl_and, hpx->pci_exp_devctl_or);
>>
>> - /* Initialize Link Control Register */
>> + /* Log the Link Control Register if any bits are set */
>> if (pcie_cap_has_lnkctl(dev)) {
>> + u16 lnkctl;
>>
>> - /*
>> - * If the Root Port supports Read Completion Boundary of
>> - * 128, set RCB to 128. Otherwise, clear it.
>> - */
>> - hpx->pci_exp_lnkctl_and |= PCI_EXP_LNKCTL_RCB;
>> - hpx->pci_exp_lnkctl_or &= ~PCI_EXP_LNKCTL_RCB;
>> - if (pcie_root_rcb_set(dev))
>> - hpx->pci_exp_lnkctl_or |= PCI_EXP_LNKCTL_RCB;
>> -
>> - pcie_capability_clear_and_set_word(dev, PCI_EXP_LNKCTL,
>> - ~hpx->pci_exp_lnkctl_and, hpx->pci_exp_lnkctl_or);
>> + pcie_capability_read_word(dev, PCI_EXP_LNKCTL, &lnkctl);
>> + if (lnkctl)
>> + pci_warn(dev, "Some bits in PCIe Link Control are set: 0x%04x\n",
>> + lnkctl);
>>
> Sorry, I wasn't clear about this. I meant that we could log the
> LNKCTL AND/OR values from _HPX, not the values from PCI_EXP_LNKCTL
> itself. There will definitely be bits set in PCI_EXP_LNKCTL in normal
> operation, which is perfectly fine.
>
> But if pci_exp_lnkctl_and or pci_exp_lnkctl_or are non-zero, the
> platform is telling us to do something, and we're ignoring it.
> *That's* what I think we might want to know about. pci_info() is
> probably sufficient; the user doesn't need to *do* anything with it, I
> just want it in case we need to debug an issue.
My bad, Yes, that makes more sense to me. And, you're OK with removing the RCB tweaking as well?
Thxs, Håkon
On Thu, Jan 15, 2026 at 03:39:21PM +0000, Haakon Bugge wrote:
> Thanks for the review, Bjørn!
> ...
> >> + hpx->pci_exp_devctl_or &= ~(PCI_EXP_DEVCTL_RELAX_EN |
> >> + PCI_EXP_DEVCTL_PAYLOAD |
> >> + PCI_EXP_DEVCTL_EXT_TAG |
> >> + PCI_EXP_DEVCTL_PHANTOM |
> >> + PCI_EXP_DEVCTL_AUX_PME |
> >> + PCI_EXP_DEVCTL_NOSNOOP_EN |
> >> + PCI_EXP_DEVCTL_READRQ |
> >> + PCI_EXP_DEVCTL_BCR_FLR);
> >>
> > Instead of listing the bits we *don't* want to touch, I think we
> > should explicitly *include* CERE, NFERE, FERE, URRE. Maybe we should
> > move the PCI_EXP_AER_FLAGS #define to drivers/pci/pci.h so we could
> > use it directly, e.g.,
> >
> > hpx->pci_exp_devctl_and |= ~PCI_EXP_AER_FLAGS;
> > hpx->pci_exp_devctl_or &= PCI_EXP_AER_FLAGS;
>
> Good idea. But what about moving it to include/uapi/linux/pci_regs.h
> and also rename it from PCI_EXP_AER_FLAGS to PCI_EXP_DEVCTL_AER, to
> match the convention for DEVCTL in pci_regs.h?
I suggested drivers/pci/pci.h because (so far) the only need for
PCI_EXP_AER_FLAGS is in drivers/pci, and that set of flags seems like
an OS policy. Most of pci_regs.h is basically translating the PCI
spec into #defines, without any real usage or policy parts. I'm not
sure whether PCI_EXP_AER_FLAGS would be useful to userspace.
> >> if (pcie_cap_has_lnkctl(dev)) {
> >> + u16 lnkctl;
> >>
> >> - /*
> >> - * If the Root Port supports Read Completion Boundary of
> >> - * 128, set RCB to 128. Otherwise, clear it.
> >> - */
> >> - hpx->pci_exp_lnkctl_and |= PCI_EXP_LNKCTL_RCB;
> >> - hpx->pci_exp_lnkctl_or &= ~PCI_EXP_LNKCTL_RCB;
> >> - if (pcie_root_rcb_set(dev))
> >> - hpx->pci_exp_lnkctl_or |= PCI_EXP_LNKCTL_RCB;
> >> -
> >> - pcie_capability_clear_and_set_word(dev, PCI_EXP_LNKCTL,
> >> - ~hpx->pci_exp_lnkctl_and, hpx->pci_exp_lnkctl_or);
> >> + pcie_capability_read_word(dev, PCI_EXP_LNKCTL, &lnkctl);
> >> + if (lnkctl)
> >> + pci_warn(dev, "Some bits in PCIe Link Control are set: 0x%04x\n",
> >> + lnkctl);
> >>
> > Sorry, I wasn't clear about this. I meant that we could log the
> > LNKCTL AND/OR values from _HPX, not the values from PCI_EXP_LNKCTL
> > itself. There will definitely be bits set in PCI_EXP_LNKCTL in normal
> > operation, which is perfectly fine.
> >
> > But if pci_exp_lnkctl_and or pci_exp_lnkctl_or are non-zero, the
> > platform is telling us to do something, and we're ignoring it.
> > *That's* what I think we might want to know about. pci_info() is
> > probably sufficient; the user doesn't need to *do* anything with it, I
> > just want it in case we need to debug an issue.
>
> My bad, Yes, that makes more sense to me. And, you're OK with
> removing the RCB tweaking as well?
Good question. My hope is that the code here is just to make sure
that we don't *clear* PCI_EXP_LNKCTL_RCB when we want it set but a
type 2 record might clear it by mistake.
We should audit PCI_EXP_LNKCTL_RCB usage to be sure that if we remove
this code, PCI_EXP_LNKCTL_RCB will still be set whenever it needs to
be set. If we rely on the existence of an _HPX type 2 record for it
to be set, that would be completely wrong.
Bjorn
> On Thu, Jan 15, 2026 at 03:39:21PM +0000, Haakon Bugge wrote:
>> Thanks for the review, Bjørn!
>> ...
>
>>>> + hpx->pci_exp_devctl_or &= ~(PCI_EXP_DEVCTL_RELAX_EN |
>>>> + PCI_EXP_DEVCTL_PAYLOAD |
>>>> + PCI_EXP_DEVCTL_EXT_TAG |
>>>> + PCI_EXP_DEVCTL_PHANTOM |
>>>> + PCI_EXP_DEVCTL_AUX_PME |
>>>> + PCI_EXP_DEVCTL_NOSNOOP_EN |
>>>> + PCI_EXP_DEVCTL_READRQ |
>>>> + PCI_EXP_DEVCTL_BCR_FLR);
>>>>
>>> Instead of listing the bits we *don't* want to touch, I think we
>>> should explicitly *include* CERE, NFERE, FERE, URRE. Maybe we should
>>> move the PCI_EXP_AER_FLAGS #define to drivers/pci/pci.h so we could
>>> use it directly, e.g.,
>>>
>>> hpx->pci_exp_devctl_and |= ~PCI_EXP_AER_FLAGS;
>>> hpx->pci_exp_devctl_or &= PCI_EXP_AER_FLAGS;
>>
>> Good idea. But what about moving it to include/uapi/linux/pci_regs.h
>> and also rename it from PCI_EXP_AER_FLAGS to PCI_EXP_DEVCTL_AER, to
>> match the convention for DEVCTL in pci_regs.h?
>
> I suggested drivers/pci/pci.h because (so far) the only need for
> PCI_EXP_AER_FLAGS is in drivers/pci, and that set of flags seems like
> an OS policy. Most of pci_regs.h is basically translating the PCI
> spec into #defines, without any real usage or policy parts. I'm not
> sure whether PCI_EXP_AER_FLAGS would be useful to userspace.
Well. My thinking was that since PCI_EXP_DEVCTL_{CERE,NFERE,FERE,URRE} is already present in pci_regs.h, the OR of them won't hurt and we gain consistency in naming:
@@ -505,6 +505,7 @@
#define PCI_EXP_DEVCAP_FLR 0x10000000 /* Function Level Reset */
#define PCI_EXP_DEVCAP_TEE 0x40000000 /* TEE I/O (TDISP) Support */
#define PCI_EXP_DEVCTL 0x08 /* Device Control */
+#define PCI_EXP_DEVCTL_AER 0x000f /* AER Flags */
#define PCI_EXP_DEVCTL_CERE 0x0001 /* Correctable Error Reporting En. */
#define PCI_EXP_DEVCTL_NFERE 0x0002 /* Non-Fatal Error Reporting Enable */
#define PCI_EXP_DEVCTL_FERE 0x0004 /* Fatal Error Reporting Enable */
I am fine either way.
>>>> if (pcie_cap_has_lnkctl(dev)) {
>>>> + u16 lnkctl;
>>>>
>>>> - /*
>>>> - * If the Root Port supports Read Completion Boundary of
>>>> - * 128, set RCB to 128. Otherwise, clear it.
>>>> - */
>>>> - hpx->pci_exp_lnkctl_and |= PCI_EXP_LNKCTL_RCB;
>>>> - hpx->pci_exp_lnkctl_or &= ~PCI_EXP_LNKCTL_RCB;
>>>> - if (pcie_root_rcb_set(dev))
>>>> - hpx->pci_exp_lnkctl_or |= PCI_EXP_LNKCTL_RCB;
>>>> -
>>>> - pcie_capability_clear_and_set_word(dev, PCI_EXP_LNKCTL,
>>>> - ~hpx->pci_exp_lnkctl_and, hpx->pci_exp_lnkctl_or);
>>>> + pcie_capability_read_word(dev, PCI_EXP_LNKCTL, &lnkctl);
>>>> + if (lnkctl)
>>>> + pci_warn(dev, "Some bits in PCIe Link Control are set: 0x%04x\n",
>>>> + lnkctl);
>>>>
>>> Sorry, I wasn't clear about this. I meant that we could log the
>>> LNKCTL AND/OR values from _HPX, not the values from PCI_EXP_LNKCTL
>>> itself. There will definitely be bits set in PCI_EXP_LNKCTL in normal
>>> operation, which is perfectly fine.
>>>
>>> But if pci_exp_lnkctl_and or pci_exp_lnkctl_or are non-zero, the
>>> platform is telling us to do something, and we're ignoring it.
>>> *That's* what I think we might want to know about. pci_info() is
>>> probably sufficient; the user doesn't need to *do* anything with it, I
>>> just want it in case we need to debug an issue.
>>
>> My bad, Yes, that makes more sense to me. And, you're OK with
>> removing the RCB tweaking as well?
>
> Good question. My hope is that the code here is just to make sure
> that we don't *clear* PCI_EXP_LNKCTL_RCB when we want it set but a
> type 2 record might clear it by mistake.
Commit e42010d8207f ("PCI: Set Read Completion Boundary to 128 iff Root Port supports it (_HPX)") fixes the "opposite" case, where _HPX sets the RCB even though the RC does not support it. That commit removes any RCB setting from the type 2 record from the equation, and sets RCB if the RC has the bit set. And to me, that seems to be the correct behaviour.
> We should audit PCI_EXP_LNKCTL_RCB usage to be sure that if we remove
> this code, PCI_EXP_LNKCTL_RCB will still be set whenever it needs to
> be set. If we rely on the existence of an _HPX type 2 record for it
> to be set, that would be completely wrong.
I'l keep the RCB tweaking as is and add the pci_info() logging if the type 2 record attempts to change any bits in the Link Control register.
Thxs, Håkon
[+cc Johannes (author of e42010d8207f ("PCI: Set Read Completion
Boundary to 128 iff Root Port supports it (_HPX)"), Myron; start of
thread:
https://lore.kernel.org/r/20260113171522.3446407-1-haakon.bugge@oracle.com]
On Fri, Jan 16, 2026 at 10:10:43AM +0000, Haakon Bugge wrote:
> > On Thu, Jan 15, 2026 at 03:39:21PM +0000, Haakon Bugge wrote:
> >> Thanks for the review, Bjørn!
> >> ...
I should have mentioned this earlier, but I think the commit log
should include something about the problem this change fixes. I
assume that the current code changes ExtTag and/or RO, and that causes
something bad. That's what is motivating this change.
> >>>> if (pcie_cap_has_lnkctl(dev)) {
> >>>> + u16 lnkctl;
> >>>>
> >>>> - /*
> >>>> - * If the Root Port supports Read Completion Boundary of
> >>>> - * 128, set RCB to 128. Otherwise, clear it.
> >>>> - */
> >>>> - hpx->pci_exp_lnkctl_and |= PCI_EXP_LNKCTL_RCB;
> >>>> - hpx->pci_exp_lnkctl_or &= ~PCI_EXP_LNKCTL_RCB;
> >>>> - if (pcie_root_rcb_set(dev))
> >>>> - hpx->pci_exp_lnkctl_or |= PCI_EXP_LNKCTL_RCB;
> >>>> -
> >>>> - pcie_capability_clear_and_set_word(dev, PCI_EXP_LNKCTL,
> >>>> - ~hpx->pci_exp_lnkctl_and, hpx->pci_exp_lnkctl_or);
> >>>> + pcie_capability_read_word(dev, PCI_EXP_LNKCTL, &lnkctl);
> >>>> + if (lnkctl)
> >>>> + pci_warn(dev, "Some bits in PCIe Link Control are set: 0x%04x\n",
> >>>> + lnkctl);
> >>>>
> >>> Sorry, I wasn't clear about this. I meant that we could log the
> >>> LNKCTL AND/OR values from _HPX, not the values from
> >>> PCI_EXP_LNKCTL itself. There will definitely be bits set in
> >>> PCI_EXP_LNKCTL in normal operation, which is perfectly fine.
> >>>
> >>> But if pci_exp_lnkctl_and or pci_exp_lnkctl_or are non-zero, the
> >>> platform is telling us to do something, and we're ignoring it.
> >>> *That's* what I think we might want to know about. pci_info()
> >>> is probably sufficient; the user doesn't need to *do* anything
> >>> with it, I just want it in case we need to debug an issue.
> >>
> >> My bad, Yes, that makes more sense to me. And, you're OK with
> >> removing the RCB tweaking as well?
> >
> > Good question. My hope is that the code here is just to make sure
> > that we don't *clear* PCI_EXP_LNKCTL_RCB when we want it set but a
> > type 2 record might clear it by mistake.
>
> Commit e42010d8207f ("PCI: Set Read Completion Boundary to 128 iff
> Root Port supports it (_HPX)") fixes the "opposite" case, where _HPX
> sets the RCB even though the RC does not support it. That commit
> removes any RCB setting from the type 2 record from the equation,
> and sets RCB if the RC has the bit set. And to me, that seems to be
> the correct behaviour.
Thanks for digging into that. You're right that it looks like
e42010d8207f ("PCI: Set Read Completion Boundary to 128 iff Root Port
supports it (_HPX)") was motivated by a machine with a Root Port with
PCI_EXP_LNKCTL_RCB cleared, but an _HPX record telling us to set
PCI_EXP_LNKCTL_RCB.
If we had realized at the time that _HPX Type 2 records are only for
the specific case of an OS that owns PCIe native hotplug but not AER,
we likely could have dropped the PCI_EXP_LNKCTL update completely.
The current behavior is that *if* there is an _HPX Type 2 record, we
set RCB to match the Root Port's RCB, regardless of what the Type 2
record is telling us. Removing that PCI_EXP_LNKCTL update could
conceivably break something if:
- Platform has an _HPX Type 2 record, and
- Root Port has PCI_EXP_LNKCTL_RCB cleared (only supports 64-byte
Completions), and
- Endpoint has PCI_EXP_LNKCTL_RCB set (may return 128-byte
Completions), which would probably be a configuration error by
BIOS
Removing it would also give up a little performance if there's a Type
2 record, the Root Port has PCI_EXP_LNKCTL_RCB set (supports 128-byte
Completions), but the Endpoint has PCI_EXP_LNKCTL_RCB cleared (may
only return 64-byte Completions).
> > We should audit PCI_EXP_LNKCTL_RCB usage to be sure that if we
> > remove this code, PCI_EXP_LNKCTL_RCB will still be set whenever it
> > needs to be set. If we rely on the existence of an _HPX type 2
> > record for it to be set, that would be completely wrong.
>
> I'll keep the RCB tweaking as is and add the pci_info() logging if
> the type 2 record attempts to change any bits in the Link Control
> register.
I think the fact that we only tweak RCB when an _HPX Type 2 record
exists is bogus. As far as I can tell, RCB has nothing to do with AER
or hotplug.
I think we probably should add a pci_configure_rcb() called from
pci_configure_device() to always configure RCB to match the Root Port
RCB. I would do this in a separate patch before removing the update
from program_hpx_type2().
Bjorn
On 1/16/26 10:11 PM, Bjorn Helgaas wrote:
> [+cc Johannes (author of e42010d8207f ("PCI: Set Read Completion
> Boundary to 128 iff Root Port supports it (_HPX)"), Myron; start of
> thread:
> https://lore.kernel.org/r/20260113171522.3446407-1-haakon.bugge@oracle.com]
>
> On Fri, Jan 16, 2026 at 10:10:43AM +0000, Haakon Bugge wrote:
>>> On Thu, Jan 15, 2026 at 03:39:21PM +0000, Haakon Bugge wrote:
>>>> Thanks for the review, Bjørn!
>>>> ...
> I should have mentioned this earlier, but I think the commit log
> should include something about the problem this change fixes. I
> assume that the current code changes ExtTag and/or RO, and that causes
> something bad. That's what is motivating this change.
>
>>>>>> if (pcie_cap_has_lnkctl(dev)) {
>>>>>> + u16 lnkctl;
>>>>>>
>>>>>> - /*
>>>>>> - * If the Root Port supports Read Completion Boundary of
>>>>>> - * 128, set RCB to 128. Otherwise, clear it.
>>>>>> - */
>>>>>> - hpx->pci_exp_lnkctl_and |= PCI_EXP_LNKCTL_RCB;
>>>>>> - hpx->pci_exp_lnkctl_or &= ~PCI_EXP_LNKCTL_RCB;
>>>>>> - if (pcie_root_rcb_set(dev))
>>>>>> - hpx->pci_exp_lnkctl_or |= PCI_EXP_LNKCTL_RCB;
>>>>>> -
>>>>>> - pcie_capability_clear_and_set_word(dev, PCI_EXP_LNKCTL,
>>>>>> - ~hpx->pci_exp_lnkctl_and, hpx->pci_exp_lnkctl_or);
>>>>>> + pcie_capability_read_word(dev, PCI_EXP_LNKCTL, &lnkctl);
>>>>>> + if (lnkctl)
>>>>>> + pci_warn(dev, "Some bits in PCIe Link Control are set: 0x%04x\n",
>>>>>> + lnkctl);
>>>>>>
>>>>> Sorry, I wasn't clear about this. I meant that we could log the
>>>>> LNKCTL AND/OR values from _HPX, not the values from
>>>>> PCI_EXP_LNKCTL itself. There will definitely be bits set in
>>>>> PCI_EXP_LNKCTL in normal operation, which is perfectly fine.
>>>>>
>>>>> But if pci_exp_lnkctl_and or pci_exp_lnkctl_or are non-zero, the
>>>>> platform is telling us to do something, and we're ignoring it.
>>>>> *That's* what I think we might want to know about. pci_info()
>>>>> is probably sufficient; the user doesn't need to *do* anything
>>>>> with it, I just want it in case we need to debug an issue.
>>>> My bad, Yes, that makes more sense to me. And, you're OK with
>>>> removing the RCB tweaking as well?
>>> Good question. My hope is that the code here is just to make sure
>>> that we don't *clear* PCI_EXP_LNKCTL_RCB when we want it set but a
>>> type 2 record might clear it by mistake.
>> Commit e42010d8207f ("PCI: Set Read Completion Boundary to 128 iff
>> Root Port supports it (_HPX)") fixes the "opposite" case, where _HPX
>> sets the RCB even though the RC does not support it. That commit
>> removes any RCB setting from the type 2 record from the equation,
>> and sets RCB if the RC has the bit set. And to me, that seems to be
>> the correct behaviour.
> Thanks for digging into that. You're right that it looks like
> e42010d8207f ("PCI: Set Read Completion Boundary to 128 iff Root Port
> supports it (_HPX)") was motivated by a machine with a Root Port with
> PCI_EXP_LNKCTL_RCB cleared, but an _HPX record telling us to set
> PCI_EXP_LNKCTL_RCB.
IIRC (this is nearly 10 years old) that's been the case. But back then
it clearly was a bios issue, but we decided to fix it in the kernel if
my memory serves me well.
> On 19 Jan 2026, at 12:56, Johannes Thumshirn <morbidrsa@gmail.com> wrote:
>
> On 1/16/26 10:11 PM, Bjorn Helgaas wrote:
>> [+cc Johannes (author of e42010d8207f ("PCI: Set Read Completion
>> Boundary to 128 iff Root Port supports it (_HPX)"), Myron; start of
>> thread:
>> https://lore.kernel.org/r/20260113171522.3446407-1-haakon.bugge@oracle.com]
>>
>> On Fri, Jan 16, 2026 at 10:10:43AM +0000, Haakon Bugge wrote:
>>>> On Thu, Jan 15, 2026 at 03:39:21PM +0000, Haakon Bugge wrote:
>>>>> Thanks for the review, Bjørn!
>>>>> ...
>> I should have mentioned this earlier, but I think the commit log
>> should include something about the problem this change fixes. I
>> assume that the current code changes ExtTag and/or RO, and that causes
>> something bad. That's what is motivating this change.
Indeed, the commit log contains:
So, in order to preserve the non-AER bits in PCIe Device Control, in
particular the performance sensitive ExtTag and RO, we make sure
program_hpx_type2() if called, doesn't modify any non-AER bits.
Should I be more specific?
I observed ~15% higher I/O rates when ExtTag and RO were not reset on the system that had the "broken" type 2 record.
>>>>>>> if (pcie_cap_has_lnkctl(dev)) {
>>>>>>> + u16 lnkctl;
>>>>>>>
>>>>>>> - /*
>>>>>>> - * If the Root Port supports Read Completion Boundary of
>>>>>>> - * 128, set RCB to 128. Otherwise, clear it.
>>>>>>> - */
>>>>>>> - hpx->pci_exp_lnkctl_and |= PCI_EXP_LNKCTL_RCB;
>>>>>>> - hpx->pci_exp_lnkctl_or &= ~PCI_EXP_LNKCTL_RCB;
>>>>>>> - if (pcie_root_rcb_set(dev))
>>>>>>> - hpx->pci_exp_lnkctl_or |= PCI_EXP_LNKCTL_RCB;
>>>>>>> -
>>>>>>> - pcie_capability_clear_and_set_word(dev, PCI_EXP_LNKCTL,
>>>>>>> - ~hpx->pci_exp_lnkctl_and, hpx->pci_exp_lnkctl_or);
>>>>>>> + pcie_capability_read_word(dev, PCI_EXP_LNKCTL, &lnkctl);
>>>>>>> + if (lnkctl)
>>>>>>> + pci_warn(dev, "Some bits in PCIe Link Control are set: 0x%04x\n",
>>>>>>> + lnkctl);
>>>>>>>
>>>>>> Sorry, I wasn't clear about this. I meant that we could log the
>>>>>> LNKCTL AND/OR values from _HPX, not the values from
>>>>>> PCI_EXP_LNKCTL itself. There will definitely be bits set in
>>>>>> PCI_EXP_LNKCTL in normal operation, which is perfectly fine.
>>>>>>
>>>>>> But if pci_exp_lnkctl_and or pci_exp_lnkctl_or are non-zero, the
>>>>>> platform is telling us to do something, and we're ignoring it.
>>>>>> *That's* what I think we might want to know about. pci_info()
>>>>>> is probably sufficient; the user doesn't need to *do* anything
>>>>>> with it, I just want it in case we need to debug an issue.
>>>>> My bad, Yes, that makes more sense to me. And, you're OK with
>>>>> removing the RCB tweaking as well?
>>>> Good question. My hope is that the code here is just to make sure
>>>> that we don't *clear* PCI_EXP_LNKCTL_RCB when we want it set but a
>>>> type 2 record might clear it by mistake.
>>> Commit e42010d8207f ("PCI: Set Read Completion Boundary to 128 iff
>>> Root Port supports it (_HPX)") fixes the "opposite" case, where _HPX
>>> sets the RCB even though the RC does not support it. That commit
>>> removes any RCB setting from the type 2 record from the equation,
>>> and sets RCB if the RC has the bit set. And to me, that seems to be
>>> the correct behaviour.
>> Thanks for digging into that. You're right that it looks like
>> e42010d8207f ("PCI: Set Read Completion Boundary to 128 iff Root Port
>> supports it (_HPX)") was motivated by a machine with a Root Port with
>> PCI_EXP_LNKCTL_RCB cleared, but an _HPX record telling us to set
>> PCI_EXP_LNKCTL_RCB.
>
> IIRC (this is nearly 10 years old) that's been the case. But back then it clearly was a bios issue, but we decided to fix it in the kernel if my memory serves me well.
Yes, looks like we are aligned. Will send a v2, which will be a two commit series, and we take it from there.
Thxs, Håkon
>
© 2016 - 2026 Red Hat, Inc.