[PATCH v3 2/2] PCI/ACPI: Confine program_hpx_type2 to the AER bits

Håkon Bugge posted 2 patches 2 weeks, 3 days ago
There is a newer version of this series
[PATCH v3 2/2] PCI/ACPI: Confine program_hpx_type2 to the AER bits
Posted by Håkon Bugge 2 weeks, 3 days ago
program_hpx_type2() is today unconditionally called, despite the fact
that when the _HPX was added to the ACPI spec. v3.0, the description
stated:

 OSPM [1] will only evaluate _HPX with Setting Record – Type 2 if OSPM
 is not controlling the PCI Express Advanced Error Reporting
 capability.

Hence, we only call program_hpx_type2() when the OSPM owns the PCIe
hotplug capability but not the AER.

The Advanced Configuration and Power Interface (ACPI) Specification
version 6.6 has a provision that gives the OSPM the ability to control
the other PCIe Device Control bits any way. In a note in section
6.2.9, it is stated:

"OSPM may override the settings provided by the _HPX object's Type2
record (PCI Express Settings) or Type3 record (PCI Express Descriptor
Settings) when OSPM has assumed native control of the corresponding
feature."

So, in order to preserve the non-AER bits in PCIe Device Control, in
particular the performance sensitive ExtTag and RO, we make sure
program_hpx_type2() if called, doesn't modify any non-AER bits.

Also, when program_hpx_type2() is called, we completely avoid
modifying any bits in the Link Control register. However, if the _HPX
type 2 records contains bits indicating such modifications, we print
an info message.

[1] Operating System-directed configuration and Power Management

Fixes: 40abb96c51bb ("[PATCH] pciehp: Fix programming hotplug parameters")
Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>

---

v2 -> v3:
   * No changes

v1 -> v2:
   * Fixed comment style
   * Simplified the and/or logic when programming the Device Control
     register
   * Fixed the incorrect and brutal warning about Link Control
     register bits set and changed it to an info message about _HPX
     attempting to set/reset bits therein.
   * Removed the RCB programming from program_hpx_type2()
   * Moved the PCI_EXP_AER_FLAGS definition from
     drivers/pci/pcie/aer.c to drivers/pci/pci.h
---
 drivers/pci/pci-acpi.c | 61 +++++++++++++++++++-----------------------
 drivers/pci/pci.h      |  3 +++
 drivers/pci/pcie/aer.c |  3 ---
 3 files changed, 30 insertions(+), 37 deletions(-)

diff --git a/drivers/pci/pci-acpi.c b/drivers/pci/pci-acpi.c
index 9369377725fa0..34ea22f65a410 100644
--- a/drivers/pci/pci-acpi.c
+++ b/drivers/pci/pci-acpi.c
@@ -271,21 +271,6 @@ static acpi_status decode_type1_hpx_record(union acpi_object *record,
 	return AE_OK;
 }
 
-static bool pcie_root_rcb_set(struct pci_dev *dev)
-{
-	struct pci_dev *rp = pcie_find_root_port(dev);
-	u16 lnkctl;
-
-	if (!rp)
-		return false;
-
-	pcie_capability_read_word(rp, PCI_EXP_LNKCTL, &lnkctl);
-	if (lnkctl & PCI_EXP_LNKCTL_RCB)
-		return true;
-
-	return false;
-}
-
 /* _HPX PCI Express Setting Record (Type 2) */
 struct hpx_type2 {
 	u32 revision;
@@ -311,6 +296,7 @@ static void program_hpx_type2(struct pci_dev *dev, struct hpx_type2 *hpx)
 {
 	int pos;
 	u32 reg32;
+	const struct pci_host_bridge *host;
 
 	if (!hpx)
 		return;
@@ -318,6 +304,15 @@ static void program_hpx_type2(struct pci_dev *dev, struct hpx_type2 *hpx)
 	if (!pci_is_pcie(dev))
 		return;
 
+	host = pci_find_host_bridge(dev->bus);
+
+	/*
+	 * We only do the HP programming if we own the PCIe native
+	 * hotplug and not the AER ownership
+	 */
+	if (!host->native_pcie_hotplug || host->native_aer)
+		return;
+
 	if (hpx->revision > 1) {
 		pci_warn(dev, "PCIe settings rev %d not supported\n",
 			 hpx->revision);
@@ -325,33 +320,31 @@ static void program_hpx_type2(struct pci_dev *dev, struct hpx_type2 *hpx)
 	}
 
 	/*
-	 * Don't allow _HPX to change MPS or MRRS settings.  We manage
-	 * those to make sure they're consistent with the rest of the
+	 * We only allow _HPX to program the AER registers, namely
+	 * PCI_EXP_DEVCTL_CERE, PCI_EXP_DEVCTL_NFERE,
+	 * PCI_EXP_DEVCTL_FERE, and PCI_EXP_DEVCTL_URRE.
+	 *
+	 * The other settings in PCIe DEVCTL are managed by OS in
+	 * order to make sure they're consistent with the rest of the
 	 * platform.
 	 */
-	hpx->pci_exp_devctl_and |= PCI_EXP_DEVCTL_PAYLOAD |
-				    PCI_EXP_DEVCTL_READRQ;
-	hpx->pci_exp_devctl_or &= ~(PCI_EXP_DEVCTL_PAYLOAD |
-				    PCI_EXP_DEVCTL_READRQ);
+	hpx->pci_exp_devctl_and |= ~PCI_EXP_AER_FLAGS;
+	hpx->pci_exp_devctl_or &= PCI_EXP_AER_FLAGS;
 
 	/* Initialize Device Control Register */
 	pcie_capability_clear_and_set_word(dev, PCI_EXP_DEVCTL,
 			~hpx->pci_exp_devctl_and, hpx->pci_exp_devctl_or);
 
-	/* Initialize Link Control Register */
+	/* Log if _HPX attempts to modify PCIe Link Control register */
 	if (pcie_cap_has_lnkctl(dev)) {
-
-		/*
-		 * If the Root Port supports Read Completion Boundary of
-		 * 128, set RCB to 128.  Otherwise, clear it.
-		 */
-		hpx->pci_exp_lnkctl_and |= PCI_EXP_LNKCTL_RCB;
-		hpx->pci_exp_lnkctl_or &= ~PCI_EXP_LNKCTL_RCB;
-		if (pcie_root_rcb_set(dev))
-			hpx->pci_exp_lnkctl_or |= PCI_EXP_LNKCTL_RCB;
-
-		pcie_capability_clear_and_set_word(dev, PCI_EXP_LNKCTL,
-			~hpx->pci_exp_lnkctl_and, hpx->pci_exp_lnkctl_or);
+		if (hpx->pci_exp_lnkctl_and)
+			pci_info(dev,
+				 "_HPX attempts to reset the following bits in PCIe Link Control: 0x%04x\n",
+				 hpx->pci_exp_lnkctl_and);
+		if (hpx->pci_exp_lnkctl_or)
+			pci_info(dev,
+				 "_HPX attempts to set the following bits in PCIe Link Control: 0x%04x\n",
+				 hpx->pci_exp_lnkctl_or);
 	}
 
 	/* Find Advanced Error Reporting Enhanced Capability */
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 0e67014aa0013..f388d4414dd3a 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -88,6 +88,9 @@ struct pcie_tlp_log;
 #define PCI_BUS_BRIDGE_MEM_WINDOW	1
 #define PCI_BUS_BRIDGE_PREF_MEM_WINDOW	2
 
+#define	PCI_EXP_AER_FLAGS	(PCI_EXP_DEVCTL_CERE | PCI_EXP_DEVCTL_NFERE | \
+				 PCI_EXP_DEVCTL_FERE | PCI_EXP_DEVCTL_URRE)
+
 extern const unsigned char pcie_link_speed[];
 extern bool pci_early_dump;
 
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index e0bcaa896803c..9472d86cef552 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -239,9 +239,6 @@ void pcie_ecrc_get_policy(char *str)
 }
 #endif	/* CONFIG_PCIE_ECRC */
 
-#define	PCI_EXP_AER_FLAGS	(PCI_EXP_DEVCTL_CERE | PCI_EXP_DEVCTL_NFERE | \
-				 PCI_EXP_DEVCTL_FERE | PCI_EXP_DEVCTL_URRE)
-
 int pcie_aer_is_native(struct pci_dev *dev)
 {
 	struct pci_host_bridge *host = pci_find_host_bridge(dev->bus);
-- 
2.43.5

Re: [PATCH v3 2/2] PCI/ACPI: Confine program_hpx_type2 to the AER bits
Posted by Bjorn Helgaas 1 week, 5 days ago
On Thu, Jan 22, 2026 at 02:09:54PM +0100, Håkon Bugge wrote:
> program_hpx_type2() is today unconditionally called, despite the fact
> that when the _HPX was added to the ACPI spec. v3.0, the description
> stated:
> 
>  OSPM [1] will only evaluate _HPX with Setting Record – Type 2 if OSPM
>  is not controlling the PCI Express Advanced Error Reporting
>  capability.
> 
> Hence, we only call program_hpx_type2() when the OSPM owns the PCIe
> hotplug capability but not the AER.
> 
> The Advanced Configuration and Power Interface (ACPI) Specification
> version 6.6 has a provision that gives the OSPM the ability to control
> the other PCIe Device Control bits any way. In a note in section
> 6.2.9, it is stated:
> 
> "OSPM may override the settings provided by the _HPX object's Type2
> record (PCI Express Settings) or Type3 record (PCI Express Descriptor
> Settings) when OSPM has assumed native control of the corresponding
> feature."
> 
> So, in order to preserve the non-AER bits in PCIe Device Control, in
> particular the performance sensitive ExtTag and RO, we make sure
> program_hpx_type2() if called, doesn't modify any non-AER bits.
> 
> Also, when program_hpx_type2() is called, we completely avoid
> modifying any bits in the Link Control register. However, if the _HPX
> type 2 records contains bits indicating such modifications, we print
> an info message.
> 
> [1] Operating System-directed configuration and Power Management

I looked at the specs again and pulled out some more details.  Here's
what seemed relevant to me:

    PCI/ACPI: Restrict program_hpx_type2() to AER bits
    
    Previously program_hpx_type2() applied PCIe settings unconditionally, which
    could incorrectly change bits like Extended Tag Field Enable and Enable
    Relaxed Ordering.
    
    When _HPX was added to ACPI r3.0, the intent of the PCIe Setting Record
    (Type 2) in sec 6.2.7.3 was to configure AER registers when the OS does not
    own the AER Capability:
    
      The PCI Express setting record contains ... [the AER] Uncorrectable Error
      Mask, Uncorrectable Error Severity, Correctable Error Mask ... to be used
      when configuring registers in the Advanced Error Reporting Extended
      Capability Structure ...
    
      OSPM will only evaluate _HPX with Setting Record – Type 2 if OSPM is not
      controlling the PCI Express Advanced Error Reporting capability.
    
    ACPI r3.0b, sec 6.2.7.3, added more AER registers, including registers in
    the PCIe Capability with AER-related bits, and the restriction that the OS
    use this only when it owns PCIe native hotplug:
    
      ... when configuring PCI Express registers in the Advanced Error
      Reporting Extended Capability Structure *or PCI Express Capability
      Structure* ...
    
      An OS that has assumed ownership of native hot plug but does not ... have
      ownership of the AER register set must use ... the Type 2 record to
      program the AER registers ...
    
      However, since the Type 2 record also includes register bits that have
      functions other than AER, the OS must ignore values ... that are not
      applicable.
    
    Restrict program_hpx_type2() to only the intended purpose:
    
      - Apply settings only when OS owns PCIe native hotplug but not AER,
    
      - Only touch the AER-related bits (Error Reporting Enables) in Device
        Control
    
      - Don't touch Link Control at all, since nothing there seems AER-related,
        but log _HPX settings for debugging purposes
    
    Note that Read Completion Boundary is now configured elsewhere, since it is
    unrelated to _HPX.

> +	/* Log if _HPX attempts to modify PCIe Link Control register */
>  	if (pcie_cap_has_lnkctl(dev)) {
> -
> -		/*
> -		 * If the Root Port supports Read Completion Boundary of
> -		 * 128, set RCB to 128.  Otherwise, clear it.
> -		 */
> -		hpx->pci_exp_lnkctl_and |= PCI_EXP_LNKCTL_RCB;
> -		hpx->pci_exp_lnkctl_or &= ~PCI_EXP_LNKCTL_RCB;
> -		if (pcie_root_rcb_set(dev))
> -			hpx->pci_exp_lnkctl_or |= PCI_EXP_LNKCTL_RCB;
> -
> -		pcie_capability_clear_and_set_word(dev, PCI_EXP_LNKCTL,
> -			~hpx->pci_exp_lnkctl_and, hpx->pci_exp_lnkctl_or);
> +		if (hpx->pci_exp_lnkctl_and)
> +			pci_info(dev,
> +				 "_HPX attempts to reset the following bits in PCIe Link Control: 0x%04x\n",
> +				 hpx->pci_exp_lnkctl_and);
> +		if (hpx->pci_exp_lnkctl_or)
> +			pci_info(dev,
> +				 "_HPX attempts to set the following bits in PCIe Link Control: 0x%04x\n",
> +				 hpx->pci_exp_lnkctl_or);

Again I'm afraid I suggested some over-engineering, and even worse, I
misinterpreted the pci_exp_lnkctl_and and pci_exp_lnkctl_or usage when
I said "if pci_exp_lnkctl_and or pci_exp_lnkctl_or are *non-zero*".

Per spec, we are supposed to AND the register with pci_exp_lnkctl_and,
so the interesting value would be anything other than 0xffff.  Since
we OR it with pci_exp_lnkctl_or, the interesting values there would be
anything non-zero.  So I think what we would want is something like
this:

+	/* Log if _HPX attempts to modify PCIe Link Control register */
 	if (pcie_cap_has_lnkctl(dev)) {
-
-		/*
-		 * If the Root Port supports Read Completion Boundary of
-		 * 128, set RCB to 128.  Otherwise, clear it.
-		 */
-		hpx->pci_exp_lnkctl_and |= PCI_EXP_LNKCTL_RCB;
-		hpx->pci_exp_lnkctl_or &= ~PCI_EXP_LNKCTL_RCB;
-		if (pcie_root_rcb_set(dev))
-			hpx->pci_exp_lnkctl_or |= PCI_EXP_LNKCTL_RCB;
-
-		pcie_capability_clear_and_set_word(dev, PCI_EXP_LNKCTL,
-			~hpx->pci_exp_lnkctl_and, hpx->pci_exp_lnkctl_or);
+		if (hpx->pci_exp_lnkctl_and != 0xffff ||
+		    hpx->pci_exp_lnkctl_or != 0)
+			pci_info(dev, "_HPX attempts Link Control setting (AND %#06x OR %#06x)\n",
+				 hpx->pci_exp_lnkctl_and,
+				 hpx->pci_exp_lnkctl_or);
 	}
Re: [PATCH v3 2/2] PCI/ACPI: Confine program_hpx_type2 to the AER bits
Posted by Haakon Bugge 1 week, 4 days ago
> On Thu, Jan 22, 2026 at 02:09:54PM +0100, Håkon Bugge wrote:
>> program_hpx_type2() is today unconditionally called, despite the fact
>> that when the _HPX was added to the ACPI spec. v3.0, the description
>> stated:
>> 
>> OSPM [1] will only evaluate _HPX with Setting Record – Type 2 if OSPM
>> is not controlling the PCI Express Advanced Error Reporting
>> capability.
>> 
>> Hence, we only call program_hpx_type2() when the OSPM owns the PCIe
>> hotplug capability but not the AER.
>> 
>> The Advanced Configuration and Power Interface (ACPI) Specification
>> version 6.6 has a provision that gives the OSPM the ability to control
>> the other PCIe Device Control bits any way. In a note in section
>> 6.2.9, it is stated:
>> 
>> "OSPM may override the settings provided by the _HPX object's Type2
>> record (PCI Express Settings) or Type3 record (PCI Express Descriptor
>> Settings) when OSPM has assumed native control of the corresponding
>> feature."
>> 
>> So, in order to preserve the non-AER bits in PCIe Device Control, in
>> particular the performance sensitive ExtTag and RO, we make sure
>> program_hpx_type2() if called, doesn't modify any non-AER bits.
>> 
>> Also, when program_hpx_type2() is called, we completely avoid
>> modifying any bits in the Link Control register. However, if the _HPX
>> type 2 records contains bits indicating such modifications, we print
>> an info message.
>> 
>> [1] Operating System-directed configuration and Power Management
> 
> I looked at the specs again and pulled out some more details.  Here's
> what seemed relevant to me:
> 
>    PCI/ACPI: Restrict program_hpx_type2() to AER bits
> 
>    Previously program_hpx_type2() applied PCIe settings unconditionally, which
>    could incorrectly change bits like Extended Tag Field Enable and Enable
>    Relaxed Ordering.
> 
>    When _HPX was added to ACPI r3.0, the intent of the PCIe Setting Record
>    (Type 2) in sec 6.2.7.3 was to configure AER registers when the OS does not
>    own the AER Capability:
> 
>      The PCI Express setting record contains ... [the AER] Uncorrectable Error
>      Mask, Uncorrectable Error Severity, Correctable Error Mask ... to be used
>      when configuring registers in the Advanced Error Reporting Extended
>      Capability Structure ...
> 
>      OSPM will only evaluate _HPX with Setting Record – Type 2 if OSPM is not
>      controlling the PCI Express Advanced Error Reporting capability.
> 
>    ACPI r3.0b, sec 6.2.7.3, added more AER registers, including registers in
>    the PCIe Capability with AER-related bits, and the restriction that the OS
>    use this only when it owns PCIe native hotplug:
> 
>      ... when configuring PCI Express registers in the Advanced Error
>      Reporting Extended Capability Structure *or PCI Express Capability
>      Structure* ...
> 
>      An OS that has assumed ownership of native hot plug but does not ... have
>      ownership of the AER register set must use ... the Type 2 record to
>      program the AER registers ...
> 
>      However, since the Type 2 record also includes register bits that have
>      functions other than AER, the OS must ignore values ... that are not
>      applicable.
> 
>    Restrict program_hpx_type2() to only the intended purpose:
> 
>      - Apply settings only when OS owns PCIe native hotplug but not AER,
> 
>      - Only touch the AER-related bits (Error Reporting Enables) in Device
>        Control
> 
>      - Don't touch Link Control at all, since nothing there seems AER-related,
>        but log _HPX settings for debugging purposes
> 
>    Note that Read Completion Boundary is now configured elsewhere, since it is
>    unrelated to _HPX.

Thanks for the word-smithing and improved accuracy!

>> +	/* Log if _HPX attempts to modify PCIe Link Control register */
>> 	if (pcie_cap_has_lnkctl(dev)) {
>> -
>> -		/*
>> -		 * If the Root Port supports Read Completion Boundary of
>> -		 * 128, set RCB to 128.  Otherwise, clear it.
>> -		 */
>> -		hpx->pci_exp_lnkctl_and |= PCI_EXP_LNKCTL_RCB;
>> -		hpx->pci_exp_lnkctl_or &= ~PCI_EXP_LNKCTL_RCB;
>> -		if (pcie_root_rcb_set(dev))
>> -			hpx->pci_exp_lnkctl_or |= PCI_EXP_LNKCTL_RCB;
>> -
>> -		pcie_capability_clear_and_set_word(dev, PCI_EXP_LNKCTL,
>> -			~hpx->pci_exp_lnkctl_and, hpx->pci_exp_lnkctl_or);

This was what confused me a lot, the bit-wise NOT above. That must be wrong, as pcie_capability_clear_and_set_word() inverts the "clear" argument.

>> +		if (hpx->pci_exp_lnkctl_and)
>> +			pci_info(dev,
>> +				 "_HPX attempts to reset the following bits in PCIe Link Control: 0x%04x\n",
>> +				 hpx->pci_exp_lnkctl_and);
>> +		if (hpx->pci_exp_lnkctl_or)
>> +			pci_info(dev,
>> +				 "_HPX attempts to set the following bits in PCIe Link Control: 0x%04x\n",
>> +				 hpx->pci_exp_lnkctl_or);
> 
> Again I'm afraid I suggested some over-engineering, and even worse, I
> misinterpreted the pci_exp_lnkctl_and and pci_exp_lnkctl_or usage when
> I said "if pci_exp_lnkctl_and or pci_exp_lnkctl_or are *non-zero*".
> 
> Per spec, we are supposed to AND the register with pci_exp_lnkctl_and,
> so the interesting value would be anything other than 0xffff.  Since
> we OR it with pci_exp_lnkctl_or, the interesting values there would be
> anything non-zero.  So I think what we would want is something like
> this:
> 
> +	/* Log if _HPX attempts to modify PCIe Link Control register */
> 	if (pcie_cap_has_lnkctl(dev)) {
> -
> -		/*
> -		 * If the Root Port supports Read Completion Boundary of
> -		 * 128, set RCB to 128.  Otherwise, clear it.
> -		 */
> -		hpx->pci_exp_lnkctl_and |= PCI_EXP_LNKCTL_RCB;
> -		hpx->pci_exp_lnkctl_or &= ~PCI_EXP_LNKCTL_RCB;
> -		if (pcie_root_rcb_set(dev))
> -			hpx->pci_exp_lnkctl_or |= PCI_EXP_LNKCTL_RCB;
> -
> -		pcie_capability_clear_and_set_word(dev, PCI_EXP_LNKCTL,
> -			~hpx->pci_exp_lnkctl_and, hpx->pci_exp_lnkctl_or);
> +		if (hpx->pci_exp_lnkctl_and != 0xffff ||
> +		    hpx->pci_exp_lnkctl_or != 0)
> +			pci_info(dev, "_HPX attempts Link Control setting (AND %#06x OR %#06x)\n",
> +				 hpx->pci_exp_lnkctl_and,
> +				 hpx->pci_exp_lnkctl_or);
> 	}

Since we do not want to fix incorrect firmware in this respect, the above makes perfect sense to me. A v4 is on its way.


Thxs, Håkon

Re: [PATCH v3 2/2] PCI/ACPI: Confine program_hpx_type2 to the AER bits
Posted by Haakon Bugge 1 week, 3 days ago
>>> Thanks for the word-smithing and improved accuracy!
>>> 
>>> +	/* Log if _HPX attempts to modify PCIe Link Control register */
>>> 	if (pcie_cap_has_lnkctl(dev)) {
>>> -
>>> -		/*
>>> -		 * If the Root Port supports Read Completion Boundary of
>>> -		 * 128, set RCB to 128.  Otherwise, clear it.
>>> -		 */
>>> -		hpx->pci_exp_lnkctl_and |= PCI_EXP_LNKCTL_RCB;
>>> -		hpx->pci_exp_lnkctl_or &= ~PCI_EXP_LNKCTL_RCB;
>>> -		if (pcie_root_rcb_set(dev))
>>> -			hpx->pci_exp_lnkctl_or |= PCI_EXP_LNKCTL_RCB;
>>> -
>>> -		pcie_capability_clear_and_set_word(dev, PCI_EXP_LNKCTL,
>>> -			~hpx->pci_exp_lnkctl_and, hpx->pci_exp_lnkctl_or);
>>> 
> This was what confused me a lot, the bit-wise NOT above. That must be wrong, as pcie_capability_clear_and_set_word() inverts the "clear" argument.

Have to correct myself here. ACPI states:

When configuring a given register, OSPM uses the following algorithm:
1. Read the register’s current value, which contains the register’s default value.
2. Perform a bit-wise AND operation with the “AND mask” from the table below.
[]

Because pcie_capability_clear_and_set_word() inverts the "clear" argument, the above bitwise NOT is of course correct. Two bitwise NOTs is a no-op and we follow the ACPI outlined algorithm.


Sorry for the noise, Håkon