[RFC PATCH v3 2/3] PCI: rockchip-host: Retry link training on failure without PERST#

Geraldo Nascimento posted 3 patches 4 months ago
[RFC PATCH v3 2/3] PCI: rockchip-host: Retry link training on failure without PERST#
Posted by Geraldo Nascimento 4 months ago
After almost 30 days of battling with RK3399 buggy PCIe on my Rock Pi
N10 through trial-and-error debugging, I finally got positive results
with enumeration on the PCI bus for both a Realtek 8111E NIC and a
Samsung PM981a SSD.

The NIC was connected to a M.2->PCIe x4 riser card and it would get
stuck on Polling.Compliance, without breaking electrical idle on the
Host RX side. The Samsung PM981a SSD is directly connected to M.2
connector and that SSD is known to be quirky (OEM... no support)
and non-functional on the RK3399 platform.

The Samsung SSD was even worse than the NIC - it would get stuck on
Detect.Active like a bricked card, even though it was fully functional
via USB adapter.

It seems both devices benefit from retrying Link Training if - big if
here - PERST# is not toggled during retry.

For retry to work, flow must be exactly as handled by present patch,
that is, we must cut power, disable the clocks, then re-enable
both clocks and power regulators and go through initialization
without touching PERST#. Then quirky devices are able to sucessfully
enumerate.

No functional change intended for already working devices.

Signed-off-by: Geraldo Nascimento <geraldogabriel@gmail.com>
---
 drivers/pci/controller/pcie-rockchip-host.c | 47 ++++++++++++++++++---
 1 file changed, 40 insertions(+), 7 deletions(-)

diff --git a/drivers/pci/controller/pcie-rockchip-host.c b/drivers/pci/controller/pcie-rockchip-host.c
index 2a1071cd3241..67b3b379d277 100644
--- a/drivers/pci/controller/pcie-rockchip-host.c
+++ b/drivers/pci/controller/pcie-rockchip-host.c
@@ -338,11 +338,14 @@ static int rockchip_pcie_set_vpcie(struct rockchip_pcie *rockchip)
 static int rockchip_pcie_host_init_port(struct rockchip_pcie *rockchip)
 {
 	struct device *dev = rockchip->dev;
-	int err, i = MAX_LANE_NUM;
+	int err, i = MAX_LANE_NUM, is_reinit = 0;
 	u32 status;
 
-	gpiod_set_value_cansleep(rockchip->perst_gpio, 0);
+	if (!is_reinit) {
+		gpiod_set_value_cansleep(rockchip->perst_gpio, 0);
+	}
 
+reinit:
 	err = rockchip_pcie_init_port(rockchip);
 	if (err)
 		return err;
@@ -369,16 +372,46 @@ static int rockchip_pcie_host_init_port(struct rockchip_pcie *rockchip)
 	rockchip_pcie_write(rockchip, PCIE_CLIENT_LINK_TRAIN_ENABLE,
 			    PCIE_CLIENT_CONFIG);
 
-	msleep(PCIE_T_PVPERL_MS);
-	gpiod_set_value_cansleep(rockchip->perst_gpio, 1);
-
-	msleep(PCIE_T_RRS_READY_MS);
+	if (!is_reinit) {
+		msleep(PCIE_T_PVPERL_MS);
+		gpiod_set_value_cansleep(rockchip->perst_gpio, 1);
+		msleep(PCIE_T_RRS_READY_MS);
+	}
 
 	/* 500ms timeout value should be enough for Gen1/2 training */
 	err = readl_poll_timeout(rockchip->apb_base + PCIE_CLIENT_BASIC_STATUS1,
 				 status, PCIE_LINK_UP(status), 20,
 				 500 * USEC_PER_MSEC);
-	if (err) {
+
+	if (err && !is_reinit) {
+		while (i--)
+			phy_power_off(rockchip->phys[i]);
+		i = MAX_LANE_NUM;
+		while (i--)
+			phy_exit(rockchip->phys[i]);
+		i = MAX_LANE_NUM;
+		is_reinit = 1;
+		dev_dbg(dev, "Will reinit PCIe without toggling PERST#");
+		if (!IS_ERR(rockchip->vpcie12v))
+			regulator_disable(rockchip->vpcie12v);
+		if (!IS_ERR(rockchip->vpcie3v3))
+			regulator_disable(rockchip->vpcie3v3);
+		regulator_disable(rockchip->vpcie1v8);
+		regulator_disable(rockchip->vpcie0v9);
+		rockchip_pcie_disable_clocks(rockchip);
+		err = rockchip_pcie_enable_clocks(rockchip);
+		if (err)
+			return err;
+		err = rockchip_pcie_set_vpcie(rockchip);
+		if (err) {
+			dev_err(dev, "failed to set vpcie regulator\n");
+			rockchip_pcie_disable_clocks(rockchip);
+			return err;
+		}
+		goto reinit;
+	}
+
+	else if (err) {
 		dev_err(dev, "PCIe link training gen1 timeout!\n");
 		goto err_power_off_phy;
 	}
-- 
2.49.0
Re: [RFC PATCH v3 2/3] PCI: rockchip-host: Retry link training on failure without PERST#
Posted by Shawn Lin 2 months, 3 weeks ago
Hi Geraldo,

在 2025/06/11 星期三 3:05, Geraldo Nascimento 写道:
> After almost 30 days of battling with RK3399 buggy PCIe on my Rock Pi
> N10 through trial-and-error debugging, I finally got positive results
> with enumeration on the PCI bus for both a Realtek 8111E NIC and a
> Samsung PM981a SSD.
> 
> The NIC was connected to a M.2->PCIe x4 riser card and it would get
> stuck on Polling.Compliance, without breaking electrical idle on the
> Host RX side. The Samsung PM981a SSD is directly connected to M.2
> connector and that SSD is known to be quirky (OEM... no support)
> and non-functional on the RK3399 platform.
> 
> The Samsung SSD was even worse than the NIC - it would get stuck on
> Detect.Active like a bricked card, even though it was fully functional
> via USB adapter.
> 
> It seems both devices benefit from retrying Link Training if - big if
> here - PERST# is not toggled during retry.
> 

I didn't see this error before especially given RTL8111 NIC is widelly
used by customers.

Could you help tried this?
[1] apply your patch 3 first
[2] apply below changes

--- a/drivers/pci/controller/pcie-rockchip-host.c
+++ b/drivers/pci/controller/pcie-rockchip-host.c
@@ -314,7 +314,7 @@ static int rockchip_pcie_host_init_port(struct 
rockchip_pcie *rockchip)
         rockchip_pcie_write(rockchip, PCIE_CLIENT_LINK_TRAIN_ENABLE,
                             PCIE_CLIENT_CONFIG);

-       msleep(PCIE_T_PVPERL_MS);
+       msleep(500);
         gpiod_set_value_cansleep(rockchip->perst_gpio, 1);

         msleep(PCIE_RESET_CONFIG_WAIT_MS);
@@ -322,7 +322,7 @@ static int rockchip_pcie_host_init_port(struct 
rockchip_pcie *rockchip)
         /* 500ms timeout value should be enough for Gen1/2 training */
         err = readl_poll_timeout(rockchip->apb_base + 
PCIE_CLIENT_BASIC_STATUS1,
                                  status, PCIE_LINK_UP(status), 20,
-                                500 * USEC_PER_MSEC);
+                                5000 * USEC_PER_MSEC);
         if (err) {
                 dev_err(dev, "PCIe link training gen1 timeout!\n");
                 goto err_power_off_phy;
@@ -951,6 +951,8 @@ static int rockchip_pcie_probe(struct 
platform_device *pdev)
         if (err)
                 return err;

+       gpiod_set_value_cansleep(rockchip->perst_gpio, 0);
+
         err = rockchip_pcie_set_vpcie(rockchip);
         if (err) {
                 dev_err(dev, "failed to set vpcie regulator\n");


> For retry to work, flow must be exactly as handled by present patch,
> that is, we must cut power, disable the clocks, then re-enable
> both clocks and power regulators and go through initialization
> without touching PERST#. Then quirky devices are able to sucessfully
> enumerate.
> 
> No functional change intended for already working devices.
> 
> Signed-off-by: Geraldo Nascimento <geraldogabriel@gmail.com>
> ---
>   drivers/pci/controller/pcie-rockchip-host.c | 47 ++++++++++++++++++---
>   1 file changed, 40 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/pci/controller/pcie-rockchip-host.c b/drivers/pci/controller/pcie-rockchip-host.c
> index 2a1071cd3241..67b3b379d277 100644
> --- a/drivers/pci/controller/pcie-rockchip-host.c
> +++ b/drivers/pci/controller/pcie-rockchip-host.c
> @@ -338,11 +338,14 @@ static int rockchip_pcie_set_vpcie(struct rockchip_pcie *rockchip)
>   static int rockchip_pcie_host_init_port(struct rockchip_pcie *rockchip)
>   {
>   	struct device *dev = rockchip->dev;
> -	int err, i = MAX_LANE_NUM;
> +	int err, i = MAX_LANE_NUM, is_reinit = 0;
>   	u32 status;
>   
> -	gpiod_set_value_cansleep(rockchip->perst_gpio, 0);
> +	if (!is_reinit) {
> +		gpiod_set_value_cansleep(rockchip->perst_gpio, 0);
> +	}
>   
> +reinit:
>   	err = rockchip_pcie_init_port(rockchip);
>   	if (err)
>   		return err;
> @@ -369,16 +372,46 @@ static int rockchip_pcie_host_init_port(struct rockchip_pcie *rockchip)
>   	rockchip_pcie_write(rockchip, PCIE_CLIENT_LINK_TRAIN_ENABLE,
>   			    PCIE_CLIENT_CONFIG);
>   
> -	msleep(PCIE_T_PVPERL_MS);
> -	gpiod_set_value_cansleep(rockchip->perst_gpio, 1);
> -
> -	msleep(PCIE_T_RRS_READY_MS);
> +	if (!is_reinit) {
> +		msleep(PCIE_T_PVPERL_MS);
> +		gpiod_set_value_cansleep(rockchip->perst_gpio, 1);
> +		msleep(PCIE_T_RRS_READY_MS);
> +	}
>   
>   	/* 500ms timeout value should be enough for Gen1/2 training */
>   	err = readl_poll_timeout(rockchip->apb_base + PCIE_CLIENT_BASIC_STATUS1,
>   				 status, PCIE_LINK_UP(status), 20,
>   				 500 * USEC_PER_MSEC);
> -	if (err) {
> +
> +	if (err && !is_reinit) {
> +		while (i--)
> +			phy_power_off(rockchip->phys[i]);
> +		i = MAX_LANE_NUM;
> +		while (i--)
> +			phy_exit(rockchip->phys[i]);
> +		i = MAX_LANE_NUM;
> +		is_reinit = 1;
> +		dev_dbg(dev, "Will reinit PCIe without toggling PERST#");
> +		if (!IS_ERR(rockchip->vpcie12v))
> +			regulator_disable(rockchip->vpcie12v);
> +		if (!IS_ERR(rockchip->vpcie3v3))
> +			regulator_disable(rockchip->vpcie3v3);
> +		regulator_disable(rockchip->vpcie1v8);
> +		regulator_disable(rockchip->vpcie0v9);
> +		rockchip_pcie_disable_clocks(rockchip);
> +		err = rockchip_pcie_enable_clocks(rockchip);
> +		if (err)
> +			return err;
> +		err = rockchip_pcie_set_vpcie(rockchip);
> +		if (err) {
> +			dev_err(dev, "failed to set vpcie regulator\n");
> +			rockchip_pcie_disable_clocks(rockchip);
> +			return err;
> +		}
> +		goto reinit;
> +	}
> +
> +	else if (err) {
>   		dev_err(dev, "PCIe link training gen1 timeout!\n");
>   		goto err_power_off_phy;
>   	}

Re: [RFC PATCH v3 2/3] PCI: rockchip-host: Retry link training on failure without PERST#
Posted by Geraldo Nascimento 2 months, 3 weeks ago
On Fri, Jul 18, 2025 at 09:55:42AM +0800, Shawn Lin wrote:
> Hi Geraldo,
> 
> 在 2025/06/11 星期三 3:05, Geraldo Nascimento 写道:
> > After almost 30 days of battling with RK3399 buggy PCIe on my Rock Pi
> > N10 through trial-and-error debugging, I finally got positive results
> > with enumeration on the PCI bus for both a Realtek 8111E NIC and a
> > Samsung PM981a SSD.
> > 
> > The NIC was connected to a M.2->PCIe x4 riser card and it would get
> > stuck on Polling.Compliance, without breaking electrical idle on the
> > Host RX side. The Samsung PM981a SSD is directly connected to M.2
> > connector and that SSD is known to be quirky (OEM... no support)
> > and non-functional on the RK3399 platform.
> > 
> > The Samsung SSD was even worse than the NIC - it would get stuck on
> > Detect.Active like a bricked card, even though it was fully functional
> > via USB adapter.
> > 
> > It seems both devices benefit from retrying Link Training if - big if
> > here - PERST# is not toggled during retry.
> > 
> 
> I didn't see this error before especially given RTL8111 NIC is widelly
> used by customers.

Hi Shawn, great to hear from you!

Notice that my board exposes PCIe only via NVMe connector, and not
directly via a proper PCIe connector, so it is necessary for me to
adapt with inexpensive riser card that exposes proper PCIe connector.

I say this because while I don't doubt that the RTL8111 NIC works
out-of-the-box for boards that directly expose PCIe connector, the
combination of riser card plus NIC has a similar effect - though not
entirely equal, as described above - of connecting known good SSDs
that simply refuse to work with Rockchip-IP PCIe.

I admit that patch 1 looks a little crazy, but is has the effect of
enabling use of presently non-working devices or combination of devices
on this IP, at least on the board I have access to.

> 
> Could you help tried this?
> [1] apply your patch 3 first

Sure, I'm always open for testing, but could you clarify the patch 3
part? AFAIK this series of mine only has 2 patches, so I'm a little
confused about exactly which patch to apply as a preliminary step.

Also, since you're asking me to test some code, I think it is only fair
if I ask you to test my code, too. It shouldn't be too hard for you to
find a otherwise working NVMe SSD that refuses to complete link training
with current code. Connect this SSD please to a RK3399 board and let us
know if my proposed code change does anything to ameliorate the
long-standing issue of SSD that refuses to cooperate.

Thank you,
Geraldo Nascimento
Re: [RFC PATCH v3 2/3] PCI: rockchip-host: Retry link training on failure without PERST#
Posted by Shawn Lin 2 months, 3 weeks ago
在 2025/07/18 星期五 11:33, Geraldo Nascimento 写道:
> On Fri, Jul 18, 2025 at 09:55:42AM +0800, Shawn Lin wrote:
>> Hi Geraldo,
>>
>> 在 2025/06/11 星期三 3:05, Geraldo Nascimento 写道:
>>> After almost 30 days of battling with RK3399 buggy PCIe on my Rock Pi
>>> N10 through trial-and-error debugging, I finally got positive results
>>> with enumeration on the PCI bus for both a Realtek 8111E NIC and a
>>> Samsung PM981a SSD.
>>>
>>> The NIC was connected to a M.2->PCIe x4 riser card and it would get
>>> stuck on Polling.Compliance, without breaking electrical idle on the
>>> Host RX side. The Samsung PM981a SSD is directly connected to M.2
>>> connector and that SSD is known to be quirky (OEM... no support)
>>> and non-functional on the RK3399 platform.
>>>
>>> The Samsung SSD was even worse than the NIC - it would get stuck on
>>> Detect.Active like a bricked card, even though it was fully functional
>>> via USB adapter.
>>>
>>> It seems both devices benefit from retrying Link Training if - big if
>>> here - PERST# is not toggled during retry.
>>>
>>
>> I didn't see this error before especially given RTL8111 NIC is widelly
>> used by customers.
> 
> Hi Shawn, great to hear from you!
> 
> Notice that my board exposes PCIe only via NVMe connector, and not
> directly via a proper PCIe connector, so it is necessary for me to
> adapt with inexpensive riser card that exposes proper PCIe connector.
> 
> I say this because while I don't doubt that the RTL8111 NIC works
> out-of-the-box for boards that directly expose PCIe connector, the
> combination of riser card plus NIC has a similar effect - though not
> entirely equal, as described above - of connecting known good SSDs
> that simply refuse to work with Rockchip-IP PCIe.
> 
> I admit that patch 1 looks a little crazy, but is has the effect of
> enabling use of presently non-working devices or combination of devices
> on this IP, at least on the board I have access to.
> 
>>
>> Could you help tried this?
>> [1] apply your patch 3 first
> 
> Sure, I'm always open for testing, but could you clarify the patch 3
> part? AFAIK this series of mine only has 2 patches, so I'm a little
> confused about exactly which patch to apply as a preliminary step.

Patch 3 refers to "arm64: dts: rockchip: drop PCIe 3v3 always-on and
boot-on" which let kernel fully controller the power in case firmware
did it in advanced.

> 
> Also, since you're asking me to test some code, I think it is only fair
> if I ask you to test my code, too. It shouldn't be too hard for you to
> find a otherwise working NVMe SSD that refuses to complete link training
> with current code. Connect this SSD please to a RK3399 board and let us
> know if my proposed code change does anything to ameliorate the
> long-standing issue of SSD that refuses to cooperate.

Sure, I don't have Samsung PM981a SSD now, but I could try to test all
my SSDs to find if I could pick up one that won't work.

> 
> Thank you,
> Geraldo Nascimento
> 

Re: [RFC PATCH v3 2/3] PCI: rockchip-host: Retry link training on failure without PERST#
Posted by Geraldo Nascimento 1 week ago
On Fri, Jul 18, 2025 at 11:46:33AM +0800, Shawn Lin wrote:
> 在 2025/07/18 星期五 11:33, Geraldo Nascimento 写道:
> > 
> > Also, since you're asking me to test some code, I think it is only fair
> > if I ask you to test my code, too. It shouldn't be too hard for you to
> > find a otherwise working NVMe SSD that refuses to complete link training
> > with current code. Connect this SSD please to a RK3399 board and let us
> > know if my proposed code change does anything to ameliorate the
> > long-standing issue of SSD that refuses to cooperate.
> 
> Sure, I don't have Samsung PM981a SSD now, but I could try to test all
> my SSDs to find if I could pick up one that won't work.
>

Hi Shawn,

Haven't heard back from you so I assume you tested with SSD that should
work but does not and that the test failed?

Thanks,
Geraldo Nascimento
Re: [RFC PATCH v3 2/3] PCI: rockchip-host: Retry link training on failure without PERST#
Posted by Geraldo Nascimento 2 months, 3 weeks ago
On Fri, Jul 18, 2025 at 11:46:33AM +0800, Shawn Lin wrote:
> 在 2025/07/18 星期五 11:33, Geraldo Nascimento 写道:
> > On Fri, Jul 18, 2025 at 09:55:42AM +0800, Shawn Lin wrote:
> >> Could you help tried this?
> >> [1] apply your patch 3 first
> > 
> > Sure, I'm always open for testing, but could you clarify the patch 3
> > part? AFAIK this series of mine only has 2 patches, so I'm a little
> > confused about exactly which patch to apply as a preliminary step.
> 
> Patch 3 refers to "arm64: dts: rockchip: drop PCIe 3v3 always-on and
> boot-on" which let kernel fully controller the power in case firmware
> did it in advanced.

Hi Shawn,

I tested your patch but unfortunately it does not work, PM981a SSD "plays
dead" and 2.5 GT/s training never completes, even with the bigger
timeout.

I hope you get the chance to test my patch soon, because once you share
your results there could be two possible scenarios:

1) Patch does not alleviate problem for you:
   If this is the case, then there's little I can do further and this
   becomes a wild goose chase, so no chance of upstreaming anything and
   I'll just move on to more useful work and leave everybody else to do
   their useful work too.

2) Patch works and previously non-working SSD is now working:
   In this case there's something serious going on and it is our mission
   to find a way to correctly upstream a fix.

Thanks,
Geraldo Nascimento
Re: [RFC PATCH v3 2/3] PCI: rockchip-host: Retry link training on failure without PERST#
Posted by Manivannan Sadhasivam 3 months, 2 weeks ago
On Tue, Jun 10, 2025 at 04:05:40PM -0300, Geraldo Nascimento wrote:
> After almost 30 days of battling with RK3399 buggy PCIe on my Rock Pi
> N10 through trial-and-error debugging, I finally got positive results
> with enumeration on the PCI bus for both a Realtek 8111E NIC and a
> Samsung PM981a SSD.
> 
> The NIC was connected to a M.2->PCIe x4 riser card and it would get
> stuck on Polling.Compliance, without breaking electrical idle on the
> Host RX side. The Samsung PM981a SSD is directly connected to M.2
> connector and that SSD is known to be quirky (OEM... no support)
> and non-functional on the RK3399 platform.
> 
> The Samsung SSD was even worse than the NIC - it would get stuck on
> Detect.Active like a bricked card, even though it was fully functional
> via USB adapter.
> 
> It seems both devices benefit from retrying Link Training if - big if
> here - PERST# is not toggled during retry.
> 
> For retry to work, flow must be exactly as handled by present patch,
> that is, we must cut power, disable the clocks, then re-enable
> both clocks and power regulators and go through initialization
> without touching PERST#. Then quirky devices are able to sucessfully
> enumerate.
> 

This sounds weird. PERST# is just an indication to the device that the power and
refclk are applied or going to be removed. The devices uses PERST# to prepare
for the power removal during assert and start functioning after deassert.

It looks like the PERST# polarity is inverted in your case. Could you please
change the 'ep-gpios' polarity to GPIO_ACTIVE_LOW and see if it fixes the issue
without this patch?

If that didn't work, could you please drop the 'ep-gpios' property and check?

> No functional change intended for already working devices.
> 
> Signed-off-by: Geraldo Nascimento <geraldogabriel@gmail.com>
> ---
>  drivers/pci/controller/pcie-rockchip-host.c | 47 ++++++++++++++++++---
>  1 file changed, 40 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/pci/controller/pcie-rockchip-host.c b/drivers/pci/controller/pcie-rockchip-host.c
> index 2a1071cd3241..67b3b379d277 100644
> --- a/drivers/pci/controller/pcie-rockchip-host.c
> +++ b/drivers/pci/controller/pcie-rockchip-host.c
> @@ -338,11 +338,14 @@ static int rockchip_pcie_set_vpcie(struct rockchip_pcie *rockchip)
>  static int rockchip_pcie_host_init_port(struct rockchip_pcie *rockchip)
>  {
>  	struct device *dev = rockchip->dev;
> -	int err, i = MAX_LANE_NUM;
> +	int err, i = MAX_LANE_NUM, is_reinit = 0;
>  	u32 status;
>  
> -	gpiod_set_value_cansleep(rockchip->perst_gpio, 0);
> +	if (!is_reinit) {
> +		gpiod_set_value_cansleep(rockchip->perst_gpio, 0);
> +	}
>  
> +reinit:

So this reinit part only skips the PERST# assert, but calls
rockchip_pcie_init_port() which resets the Root Port including PHY. I don't
think it is safe to do it if PERST# is wired.

- Mani

-- 
மணிவண்ணன் சதாசிவம்
Re: [RFC PATCH v3 2/3] PCI: rockchip-host: Retry link training on failure without PERST#
Posted by Geraldo Nascimento 3 months, 2 weeks ago
On Mon, Jun 23, 2025 at 05:29:46AM -0600, Manivannan Sadhasivam wrote:
> On Tue, Jun 10, 2025 at 04:05:40PM -0300, Geraldo Nascimento wrote:
> > After almost 30 days of battling with RK3399 buggy PCIe on my Rock Pi
> > N10 through trial-and-error debugging, I finally got positive results
> > with enumeration on the PCI bus for both a Realtek 8111E NIC and a
> > Samsung PM981a SSD.
> > 
> > The NIC was connected to a M.2->PCIe x4 riser card and it would get
> > stuck on Polling.Compliance, without breaking electrical idle on the
> > Host RX side. The Samsung PM981a SSD is directly connected to M.2
> > connector and that SSD is known to be quirky (OEM... no support)
> > and non-functional on the RK3399 platform.
> > 
> > The Samsung SSD was even worse than the NIC - it would get stuck on
> > Detect.Active like a bricked card, even though it was fully functional
> > via USB adapter.
> > 
> > It seems both devices benefit from retrying Link Training if - big if
> > here - PERST# is not toggled during retry.
> > 
> > For retry to work, flow must be exactly as handled by present patch,
> > that is, we must cut power, disable the clocks, then re-enable
> > both clocks and power regulators and go through initialization
> > without touching PERST#. Then quirky devices are able to sucessfully
> > enumerate.
> > 
> 
> This sounds weird. PERST# is just an indication to the device that the power and
> refclk are applied or going to be removed. The devices uses PERST# to prepare
> for the power removal during assert and start functioning after deassert.

Hi Mani! Thank you for looking into this.

Yeah, tell me about it, it is beyond weird. I posted RFC Patch in the
hopes someone with access to PCIe Analyzer could have deeper look
at what the heck is going on here - because it does work, but I don't
claim to understand how.

> 
> It looks like the PERST# polarity is inverted in your case. Could you please
> change the 'ep-gpios' polarity to GPIO_ACTIVE_LOW and see if it fixes the issue
> without this patch?
> 
> If that didn't work, could you please drop the 'ep-gpios' property and check?

Sorry to decline your request, but I assure you I have tried many
other combinations before reaching present patch, including your
suggestion. It will do nothing. It won't work, won't make SSD that
refuse to work with RK3399, working. Note that this isn't specific
to my board - RK3399 is infamous for being picky about devices.

> 
> > No functional change intended for already working devices.
> > 
> > Signed-off-by: Geraldo Nascimento <geraldogabriel@gmail.com>
> > ---
> >  drivers/pci/controller/pcie-rockchip-host.c | 47 ++++++++++++++++++---
> >  1 file changed, 40 insertions(+), 7 deletions(-)
> > 
> > diff --git a/drivers/pci/controller/pcie-rockchip-host.c b/drivers/pci/controller/pcie-rockchip-host.c
> > index 2a1071cd3241..67b3b379d277 100644
> > --- a/drivers/pci/controller/pcie-rockchip-host.c
> > +++ b/drivers/pci/controller/pcie-rockchip-host.c
> > @@ -338,11 +338,14 @@ static int rockchip_pcie_set_vpcie(struct rockchip_pcie *rockchip)
> >  static int rockchip_pcie_host_init_port(struct rockchip_pcie *rockchip)
> >  {
> >  	struct device *dev = rockchip->dev;
> > -	int err, i = MAX_LANE_NUM;
> > +	int err, i = MAX_LANE_NUM, is_reinit = 0;
> >  	u32 status;
> >  
> > -	gpiod_set_value_cansleep(rockchip->perst_gpio, 0);
> > +	if (!is_reinit) {
> > +		gpiod_set_value_cansleep(rockchip->perst_gpio, 0);
> > +	}
> >  
> > +reinit:
> 
> So this reinit part only skips the PERST# assert, but calls
> rockchip_pcie_init_port() which resets the Root Port including PHY. I don't
> think it is safe to do it if PERST# is wired.

I don't understand, could you be a bit more verbose on why do you
think this is dangerous?

Thanks,
Geraldo Nascimento

> 
> - Mani
> 
> -- 
> மணிவண்ணன் சதாசிவம்
Re: [RFC PATCH v3 2/3] PCI: rockchip-host: Retry link training on failure without PERST#
Posted by Manivannan Sadhasivam 2 months, 3 weeks ago
On Mon, Jun 23, 2025 at 08:44:49AM GMT, Geraldo Nascimento wrote:
> On Mon, Jun 23, 2025 at 05:29:46AM -0600, Manivannan Sadhasivam wrote:
> > On Tue, Jun 10, 2025 at 04:05:40PM -0300, Geraldo Nascimento wrote:
> > > After almost 30 days of battling with RK3399 buggy PCIe on my Rock Pi
> > > N10 through trial-and-error debugging, I finally got positive results
> > > with enumeration on the PCI bus for both a Realtek 8111E NIC and a
> > > Samsung PM981a SSD.
> > > 
> > > The NIC was connected to a M.2->PCIe x4 riser card and it would get
> > > stuck on Polling.Compliance, without breaking electrical idle on the
> > > Host RX side. The Samsung PM981a SSD is directly connected to M.2
> > > connector and that SSD is known to be quirky (OEM... no support)
> > > and non-functional on the RK3399 platform.
> > > 
> > > The Samsung SSD was even worse than the NIC - it would get stuck on
> > > Detect.Active like a bricked card, even though it was fully functional
> > > via USB adapter.
> > > 
> > > It seems both devices benefit from retrying Link Training if - big if
> > > here - PERST# is not toggled during retry.
> > > 
> > > For retry to work, flow must be exactly as handled by present patch,
> > > that is, we must cut power, disable the clocks, then re-enable
> > > both clocks and power regulators and go through initialization
> > > without touching PERST#. Then quirky devices are able to sucessfully
> > > enumerate.
> > > 
> > 
> > This sounds weird. PERST# is just an indication to the device that the power and
> > refclk are applied or going to be removed. The devices uses PERST# to prepare
> > for the power removal during assert and start functioning after deassert.
> 
> Hi Mani! Thank you for looking into this.
> 
> Yeah, tell me about it, it is beyond weird. I posted RFC Patch in the
> hopes someone with access to PCIe Analyzer could have deeper look
> at what the heck is going on here - because it does work, but I don't
> claim to understand how.
> 

I was hoping that the Rockchip folks would chime in, but no reply from them so
far.

@Shawn: Could you please shed some light here?

> > 
> > It looks like the PERST# polarity is inverted in your case. Could you please
> > change the 'ep-gpios' polarity to GPIO_ACTIVE_LOW and see if it fixes the issue
> > without this patch?
> > 
> > If that didn't work, could you please drop the 'ep-gpios' property and check?
> 
> Sorry to decline your request, but I assure you I have tried many
> other combinations before reaching present patch, including your
> suggestion. It will do nothing. It won't work, won't make SSD that
> refuse to work with RK3399, working. Note that this isn't specific
> to my board - RK3399 is infamous for being picky about devices.
> 
> > 
> > > No functional change intended for already working devices.
> > > 
> > > Signed-off-by: Geraldo Nascimento <geraldogabriel@gmail.com>
> > > ---
> > >  drivers/pci/controller/pcie-rockchip-host.c | 47 ++++++++++++++++++---
> > >  1 file changed, 40 insertions(+), 7 deletions(-)
> > > 
> > > diff --git a/drivers/pci/controller/pcie-rockchip-host.c b/drivers/pci/controller/pcie-rockchip-host.c
> > > index 2a1071cd3241..67b3b379d277 100644
> > > --- a/drivers/pci/controller/pcie-rockchip-host.c
> > > +++ b/drivers/pci/controller/pcie-rockchip-host.c
> > > @@ -338,11 +338,14 @@ static int rockchip_pcie_set_vpcie(struct rockchip_pcie *rockchip)
> > >  static int rockchip_pcie_host_init_port(struct rockchip_pcie *rockchip)
> > >  {
> > >  	struct device *dev = rockchip->dev;
> > > -	int err, i = MAX_LANE_NUM;
> > > +	int err, i = MAX_LANE_NUM, is_reinit = 0;
> > >  	u32 status;
> > >  
> > > -	gpiod_set_value_cansleep(rockchip->perst_gpio, 0);
> > > +	if (!is_reinit) {
> > > +		gpiod_set_value_cansleep(rockchip->perst_gpio, 0);
> > > +	}
> > >  
> > > +reinit:
> > 
> > So this reinit part only skips the PERST# assert, but calls
> > rockchip_pcie_init_port() which resets the Root Port including PHY. I don't
> > think it is safe to do it if PERST# is wired.
> 
> I don't understand, could you be a bit more verbose on why do you
> think this is dangerous?
> 

When the Root Port and PHY gets reset, there is a good chance that the refclk
would also be cutoff. So if that happens without PERST# assert, then the device
has no chance to clean its state machine. If the device gets its own refclk,
then it is a different story, but we should not make assumptions.

- Mani

-- 
மணிவண்ணன் சதாசிவம்
Re: [RFC PATCH v3 2/3] PCI: rockchip-host: Retry link training on failure without PERST#
Posted by Geraldo Nascimento 2 months, 3 weeks ago
On Thu, Jul 17, 2025 at 05:59:32PM +0530, Manivannan Sadhasivam wrote:
> On Mon, Jun 23, 2025 at 08:44:49AM GMT, Geraldo Nascimento wrote:
> > On Mon, Jun 23, 2025 at 05:29:46AM -0600, Manivannan Sadhasivam wrote:
> > > On Tue, Jun 10, 2025 at 04:05:40PM -0300, Geraldo Nascimento wrote:
> > > > +reinit:
> > > 
> > > So this reinit part only skips the PERST# assert, but calls
> > > rockchip_pcie_init_port() which resets the Root Port including PHY. I don't
> > > think it is safe to do it if PERST# is wired.
> > 
> > I don't understand, could you be a bit more verbose on why do you
> > think this is dangerous?
> > 
> 
> When the Root Port and PHY gets reset, there is a good chance that the refclk
> would also be cutoff. So if that happens without PERST# assert, then the device
> has no chance to clean its state machine. If the device gets its own refclk,
> then it is a different story, but we should not make assumptions.

Hi Mani, thank you for your time spent looking into this!

I'm not sure if the following information helps, but patch 2 of this
series disables the PCIe 3.3V always-on/boot-on through DT. That was
not incidental, and in fact it is required for patch 1 to work.

Then, if you follow the proposed code change, you will see that power
is effectively cut via disabling the power regulators, even before
disabling the clocks. So there's effectively zero chance of corrupting
the endpoint device state machine, since the device is power-cycled.

While I understand we should not make assumptions on kernel work, and
that the patch is unmergeable on its current form (it's a goddamn hack),
it does empirically alleviate a very real report, that of known-good
working devices refusing to cooperate with Rockchip-IP PCIe.

I agree we should wait on Shawn Lin's feedback.

Thank you,
Geraldo Nascimento

> 
> - Mani
> 
> -- 
> மணிவண்ணன் சதாசிவம்