[PATCH] net: atlantic: always deep reset on pm op, fixing null deref regression

Manuel Ullmann posted 1 patch 4 years ago
There is a newer version of this series
drivers/net/ethernet/aquantia/atlantic/aq_pci_func.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
[PATCH] net: atlantic: always deep reset on pm op, fixing null deref regression
Posted by Manuel Ullmann 4 years ago
From a3eccd32c618fe4b4f5c537cd83ba5611149623e Mon Sep 17 00:00:00 2001
Date: Wed, 4 May 2022 21:30:44 +0200

The impact of this regression is the same for resume that I saw on
thaw: the kernel hangs and nothing except SysRq rebooting can be done.

The null deref occurs at the same position as on thaw.
BUG: kernel NULL pointer dereference
RIP: aq_ring_rx_fill+0xcf/0x210 [atlantic]

Fixes regression in cbe6c3a8f8f4 ("net: atlantic: invert deep par in
pm functions, preventing null derefs"), where I disabled deep pm
resets in suspend and resume, trying to make sense of the
atl_resume_common deep parameter in the first place.

It turns out, that atlantic always has to deep reset on pm operations
and the parameter is useless. Even though I expected that and tested
resume, I screwed up by kexec-rebooting into an unpatched kernel, thus
missing the breakage.

This fixup obsoletes the deep parameter of atl_resume_common, but I
leave the cleanup for the maintainers to post to mainline.

PS: I'm very sorry for this regression.

Fixes: cbe6c3a8f8f4315b96e46e1a1c70393c06d95a4c
Link: https://lore.kernel.org/regressions/9-Ehc_xXSwdXcvZqKD5aSqsqeNj5Izco4MYEwnx5cySXVEc9-x_WC4C3kAoCqNTi-H38frroUK17iobNVnkLtW36V6VWGSQEOHXhmVMm5iQ=@protonmail.com/
Reported-by: Jordan Leppert <jordanleppert@protonmail.com>
Reported-by: Holger Hoffstätte <holger@applied-asynchrony.com>
CC: <stable@vger.kernel.org> # 5.17.5
CC: <stable@vger.kernel.org> # 5.15.36
CC: <stable@vger.kernel.org> # 5.10.113
Signed-off-by: Manuel ULlmann <labre@posteo.de>
---
 drivers/net/ethernet/aquantia/atlantic/aq_pci_func.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/aquantia/atlantic/aq_pci_func.c b/drivers/net/ethernet/aquantia/atlantic/aq_pci_func.c
index 3a529ee8c834..831833911a52 100644
--- a/drivers/net/ethernet/aquantia/atlantic/aq_pci_func.c
+++ b/drivers/net/ethernet/aquantia/atlantic/aq_pci_func.c
@@ -449,7 +449,7 @@ static int aq_pm_freeze(struct device *dev)
 
 static int aq_pm_suspend_poweroff(struct device *dev)
 {
-	return aq_suspend_common(dev, false);
+	return aq_suspend_common(dev, true);
 }
 
 static int aq_pm_thaw(struct device *dev)
@@ -459,7 +459,7 @@ static int aq_pm_thaw(struct device *dev)
 
 static int aq_pm_resume_restore(struct device *dev)
 {
-	return atl_resume_common(dev, false);
+	return atl_resume_common(dev, true);
 }
 
 static const struct dev_pm_ops aq_pm_ops = {

base-commit: 672c0c5173427e6b3e2a9bbb7be51ceeec78093a
-- 
2.35.1
Re: [EXT] [PATCH] net: atlantic: always deep reset on pm op, fixing null deref regression
Posted by Igor Russkikh 4 years ago
> The impact of this regression is the same for resume that I saw on
> thaw: the kernel hangs and nothing except SysRq rebooting can be done.
> 
> The null deref occurs at the same position as on thaw.
> BUG: kernel NULL pointer dereference
> RIP: aq_ring_rx_fill+0xcf/0x210 [atlantic]
> 
> Fixes regression in cbe6c3a8f8f4 ("net: atlantic: invert deep par in
> pm functions, preventing null derefs"), where I disabled deep pm
> resets in suspend and resume, trying to make sense of the
> atl_resume_common deep parameter in the first place.
> 
> It turns out, that atlantic always has to deep reset on pm operations
> and the parameter is useless. Even though I expected that and tested
> resume, I screwed up by kexec-rebooting into an unpatched kernel, thus
> missing the breakage.
> 
> This fixup obsoletes the deep parameter of atl_resume_common, but I
> leave the cleanup for the maintainers to post to mainline.
> 
> PS: I'm very sorry for this regression.

Hi Manuel,

Unfortunately I've missed to review and comment on previous patch - it was too quickly accepted.

I'm still in doubt on your fixes, even after rereading the original problem.
Is it possible for you to test this with all the possible combinations?
suspend/resume with device up/down,
hibernate/restore with device up/down?

I'll try to do the same on our side, but we don't have much resources for that now unfortunately..


> Fixes: cbe6c3a8f8f4315b96e46e1a1c70393c06d95a4c

That tag format is incorrect I think..

Igor
Re: [EXT] [PATCH] net: atlantic: always deep reset on pm op, fixing null deref regression
Posted by Jordan Leppert 4 years ago
With the proposed patch (deep parameter is always true), I've managed to test:
1. Hibernate/restore (with device down/up)
2. Suspend/resume (with device down/up)

I put the device down with the command:
sudo ip link set <connection> down

I hope that's correct, if not please let me know correct command.

Regards,
Jordan


------- Original Message -------
On Thursday, May 5th, 2022 at 08:04, Igor Russkikh <irusskikh@marvell.com> wrote:


>
>
> > The impact of this regression is the same for resume that I saw on
> > thaw: the kernel hangs and nothing except SysRq rebooting can be done.
> >
> > The null deref occurs at the same position as on thaw.
> > BUG: kernel NULL pointer dereference
> > RIP: aq_ring_rx_fill+0xcf/0x210 [atlantic]
> >
> > Fixes regression in cbe6c3a8f8f4 ("net: atlantic: invert deep par in
> > pm functions, preventing null derefs"), where I disabled deep pm
> > resets in suspend and resume, trying to make sense of the
> > atl_resume_common deep parameter in the first place.
> >
> > It turns out, that atlantic always has to deep reset on pm operations
> > and the parameter is useless. Even though I expected that and tested
> > resume, I screwed up by kexec-rebooting into an unpatched kernel, thus
> > missing the breakage.
> >
> > This fixup obsoletes the deep parameter of atl_resume_common, but I
> > leave the cleanup for the maintainers to post to mainline.
> >
> > PS: I'm very sorry for this regression.
>
>
> Hi Manuel,
>
> Unfortunately I've missed to review and comment on previous patch - it was too quickly accepted.
>
> I'm still in doubt on your fixes, even after rereading the original problem.
> Is it possible for you to test this with all the possible combinations?
> suspend/resume with device up/down,
> hibernate/restore with device up/down?
>
> I'll try to do the same on our side, but we don't have much resources for that now unfortunately..
>
> > Fixes: cbe6c3a8f8f4315b96e46e1a1c70393c06d95a4c
>
>
> That tag format is incorrect I think..
>
> Igor
Re: [EXT] [PATCH] net: atlantic: always deep reset on pm op, fixing null deref regression
Posted by Manuel Ullmann 4 years ago
>>
>>
>> > The impact of this regression is the same for resume that I saw on
>> > thaw: the kernel hangs and nothing except SysRq rebooting can be done.
>> >
>> > The null deref occurs at the same position as on thaw.
>> > BUG: kernel NULL pointer dereference
>> > RIP: aq_ring_rx_fill+0xcf/0x210 [atlantic]
>> >
>> > Fixes regression in cbe6c3a8f8f4 ("net: atlantic: invert deep par in
>> > pm functions, preventing null derefs"), where I disabled deep pm
>> > resets in suspend and resume, trying to make sense of the
>> > atl_resume_common deep parameter in the first place.
>> >
>> > It turns out, that atlantic always has to deep reset on pm operations
>> > and the parameter is useless. Even though I expected that and tested
>> > resume, I screwed up by kexec-rebooting into an unpatched kernel, thus
>> > missing the breakage.
>> >
>> > This fixup obsoletes the deep parameter of atl_resume_common, but I
>> > leave the cleanup for the maintainers to post to mainline.
>> >
>> > PS: I'm very sorry for this regression.
>>
>>
>> Hi Manuel,
>>
>> Unfortunately I've missed to review and comment on previous patch - it was too quickly accepted.
>>
>> I'm still in doubt on your fixes, even after rereading the original problem.
>> Is it possible for you to test this with all the possible combinations?
>> suspend/resume with device up/down,
>> hibernate/restore with device up/down?

I confirm that suspend/resume/hibernation/thaw keeps working in all
cases. Thaw would work without the original patch, if the device is down
before hibernation. I also originally described this behaviour on
bugzilla at

https://bugzilla.kernel.org/show_bug.cgi?id=215798

See also Jordan’s confirmation below.

I think, the main reason, why this could break, is, that the deep
parameter had no real impact until the breaking commit. So it was
practically untested, when the allocation/free functions were split.

Another thing, that I tested, was guarding all null pointer references
with null checks, which failed at first, because GCC optimized them
out. I think I have the atlantic tree for this (bad) fix attempt
floating around. I can try to rebase and create a patch from this and
post it to the Github issue, if you are interested.
https://github.com/Aquantia/AQtion/issues/32

Don’t have time for this before the weekend though.

>> I'll try to do the same on our side, but we don't have much resources for that now unfortunately..
>>
>> > Fixes: cbe6c3a8f8f4315b96e46e1a1c70393c06d95a4c
>>
>>
>> That tag format is incorrect I think..

Thanks for pointing that out. Also, are those stable Cc tags correct?
Because I figured, that the x in the documentation could be also the
branch name and not a placeholder. Should I resend the patch, fixing the
tags? Won’t get to it before tomorrow, though.

>> Igor

> With the proposed patch (deep parameter is always true), I've managed to test:
> 1. Hibernate/restore (with device down/up)
> 2. Suspend/resume (with device down/up)
>
> I put the device down with the command:
> sudo ip link set <connection> down
>
> I hope that's correct, if not please let me know correct command.

This should be the correct.

Manuel
Re: [PATCH] net: atlantic: always deep reset on pm op, fixing null deref regression
Posted by Holger Hoffstätte 4 years ago
On 2022-05-05 00:06, Manuel Ullmann wrote:
>  From a3eccd32c618fe4b4f5c537cd83ba5611149623e Mon Sep 17 00:00:00 2001
> Date: Wed, 4 May 2022 21:30:44 +0200
> 
> The impact of this regression is the same for resume that I saw on
> thaw: the kernel hangs and nothing except SysRq rebooting can be done.
> 
> The null deref occurs at the same position as on thaw.
> BUG: kernel NULL pointer dereference
> RIP: aq_ring_rx_fill+0xcf/0x210 [atlantic]
> 
> Fixes regression in cbe6c3a8f8f4 ("net: atlantic: invert deep par in
> pm functions, preventing null derefs"), where I disabled deep pm
> resets in suspend and resume, trying to make sense of the
> atl_resume_common deep parameter in the first place.
> 
> It turns out, that atlantic always has to deep reset on pm operations
> and the parameter is useless. Even though I expected that and tested
> resume, I screwed up by kexec-rebooting into an unpatched kernel, thus
> missing the breakage.
> 
> This fixup obsoletes the deep parameter of atl_resume_common, but I
> leave the cleanup for the maintainers to post to mainline.
> 
> PS: I'm very sorry for this regression.
> 
> Fixes: cbe6c3a8f8f4315b96e46e1a1c70393c06d95a4c
> Link: https://lore.kernel.org/regressions/9-Ehc_xXSwdXcvZqKD5aSqsqeNj5Izco4MYEwnx5cySXVEc9-x_WC4C3kAoCqNTi-H38frroUK17iobNVnkLtW36V6VWGSQEOHXhmVMm5iQ=@protonmail.com/
> Reported-by: Jordan Leppert <jordanleppert@protonmail.com>
> Reported-by: Holger Hoffstätte <holger@applied-asynchrony.com>
> CC: <stable@vger.kernel.org> # 5.17.5
> CC: <stable@vger.kernel.org> # 5.15.36
> CC: <stable@vger.kernel.org> # 5.10.113
> Signed-off-by: Manuel ULlmann <labre@posteo.de>
> ---
>   drivers/net/ethernet/aquantia/atlantic/aq_pci_func.c | 4 ++--
>   1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/ethernet/aquantia/atlantic/aq_pci_func.c b/drivers/net/ethernet/aquantia/atlantic/aq_pci_func.c
> index 3a529ee8c834..831833911a52 100644
> --- a/drivers/net/ethernet/aquantia/atlantic/aq_pci_func.c
> +++ b/drivers/net/ethernet/aquantia/atlantic/aq_pci_func.c
> @@ -449,7 +449,7 @@ static int aq_pm_freeze(struct device *dev)
>   
>   static int aq_pm_suspend_poweroff(struct device *dev)
>   {
> -	return aq_suspend_common(dev, false);
> +	return aq_suspend_common(dev, true);
>   }
>   
>   static int aq_pm_thaw(struct device *dev)
> @@ -459,7 +459,7 @@ static int aq_pm_thaw(struct device *dev)
>   
>   static int aq_pm_resume_restore(struct device *dev)
>   {
> -	return atl_resume_common(dev, false);
> +	return atl_resume_common(dev, true);
>   }
>   
>   static const struct dev_pm_ops aq_pm_ops = {
> 
> base-commit: 672c0c5173427e6b3e2a9bbb7be51ceeec78093a
> 

As mentioned in the discusson thread this reliably restores resume
for me, so:

Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>

Thanks!
Holger