[PATCH net v2 0/2] netconsole: Fix userdata race condition

Gustavo Luiz Duarte posted 2 patches 3 months, 2 weeks ago
drivers/net/netconsole.c                           |  5 ++
tools/testing/selftests/drivers/net/Makefile       |  1 +
.../selftests/drivers/net/netcons_race_userdata.sh | 87 ++++++++++++++++++++++
3 files changed, 93 insertions(+)
[PATCH net v2 0/2] netconsole: Fix userdata race condition
Posted by Gustavo Luiz Duarte 3 months, 2 weeks ago
This series fixes a race condition in netconsole's userdata handling
where concurrent message transmission could read partially updated
userdata fields, resulting in corrupted netconsole output.

The first patch adds a selftest that reproduces the race condition by
continuously sending messages while rapidly changing userdata values,
detecting any torn reads in the output.

The second patch fixes the issue by ensuring update_userdata() holds
the target_list_lock while updating both extradata_complete and
userdata_length, preventing readers from seeing inconsistent state.

This targets net tree as it fixes a bug introduced in commit df03f830d099
("net: netconsole: cache userdata formatted string in netconsole_target").

Signed-off-by: Gustavo Luiz Duarte <gustavold@gmail.com>

Changes in v2:
- Added testcase to Makefile.
- Reordered fix and testcase to avoid failure in CI.
- testcase: delay cleanup until child process are killed, plus shellcheck fixes.
- Link to v1: https://lore.kernel.org/all/20251020-netconsole-fix-race-v1-0-b775be30ee8a@gmail.com/

---
Gustavo Luiz Duarte (2):
      netconsole: Fix race condition in between reader and writer of userdata
      selftests: netconsole: Add race condition test for userdata corruption

 drivers/net/netconsole.c                           |  5 ++
 tools/testing/selftests/drivers/net/Makefile       |  1 +
 .../selftests/drivers/net/netcons_race_userdata.sh | 87 ++++++++++++++++++++++
 3 files changed, 93 insertions(+)
---
base-commit: d63f0391d6c7b75e1a847e1a26349fa8cad0004d
change-id: 20251020-netconsole-fix-race-f465f37b57ea

Best regards,
-- 
Gustavo Duarte <gustavold@meta.com>
Re: [PATCH net v2 0/2] netconsole: Fix userdata race condition
Posted by Jakub Kicinski 3 months, 2 weeks ago
On Wed, 22 Oct 2025 10:39:56 -0700 Gustavo Luiz Duarte wrote:
> This series fixes a race condition in netconsole's userdata handling
> where concurrent message transmission could read partially updated
> userdata fields, resulting in corrupted netconsole output.
> 
> The first patch adds a selftest that reproduces the race condition by
> continuously sending messages while rapidly changing userdata values,
> detecting any torn reads in the output.
> 
> The second patch fixes the issue by ensuring update_userdata() holds
> the target_list_lock while updating both extradata_complete and
> userdata_length, preventing readers from seeing inconsistent state.
> 
> This targets net tree as it fixes a bug introduced in commit df03f830d099
> ("net: netconsole: cache userdata formatted string in netconsole_target").

This test is skipping on debug kernel builds in netdev CI.

TAP version 13
1..1
# overriding timeout to 360
# selftests: drivers/net: netcons_race_userdata.sh
# socat died before we could check 10000 messages. Skipping test.
ok 1 selftests: drivers/net: netcons_race_userdata.sh # SKIP

We can't have skips for SW tests.

I think Breno was fighting with a similar problem in the past.
Not sure what he ended up doing. Maybe just leave it at the print?
Don't actually mark the test as skipped?

Slightly more advanced option is to only do that if KSFT_MACHINE_SLOW
per:
https://github.com/linux-netdev/nipa/wiki/How-to-run-netdev-selftests-CI-style#dealing-with-slow-runners-in-performancelatency-tests
-- 
pw-bot: cr
Re: [PATCH net v2 0/2] netconsole: Fix userdata race condition
Posted by Gustavo Luiz Duarte 3 months, 2 weeks ago
On Wed, Oct 22, 2025 at 10:01 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Wed, 22 Oct 2025 10:39:56 -0700 Gustavo Luiz Duarte wrote:
> > This series fixes a race condition in netconsole's userdata handling
> > where concurrent message transmission could read partially updated
> > userdata fields, resulting in corrupted netconsole output.
> >
> > The first patch adds a selftest that reproduces the race condition by
> > continuously sending messages while rapidly changing userdata values,
> > detecting any torn reads in the output.
> >
> > The second patch fixes the issue by ensuring update_userdata() holds
> > the target_list_lock while updating both extradata_complete and
> > userdata_length, preventing readers from seeing inconsistent state.
> >
> > This targets net tree as it fixes a bug introduced in commit df03f830d099
> > ("net: netconsole: cache userdata formatted string in netconsole_target").
>
> This test is skipping on debug kernel builds in netdev CI.
>
> TAP version 13
> 1..1
> # overriding timeout to 360
> # selftests: drivers/net: netcons_race_userdata.sh
> # socat died before we could check 10000 messages. Skipping test.
> ok 1 selftests: drivers/net: netcons_race_userdata.sh # SKIP
>
> We can't have skips for SW tests.
>
> I think Breno was fighting with a similar problem in the past.
> Not sure what he ended up doing. Maybe just leave it at the print?
> Don't actually mark the test as skipped?
>
> Slightly more advanced option is to only do that if KSFT_MACHINE_SLOW
> per:
> https://github.com/linux-netdev/nipa/wiki/How-to-run-netdev-selftests-CI-style#dealing-with-slow-runners-in-performancelatency-tests

There are two reasons for hitting this skip.
1. The hardcoded 2s timeout in listen_port_and_save_to() expired
2. socat died or failed to start for mysterious reasons

#1 should probably be a success (we ran the test for this long and no
corruption found), and for #2 we can try to return whatever exit code
socat give us.
Retrieving socat return code is a bit tricky because we are running it
in a subshell, but we can save it in a temp file.

I can also send a follow up patch to use a longer timeout in
listen_port_and_save_to() if KSFT_MACHINE_SLOW


> --
> pw-bot: cr
Re: [PATCH net v2 0/2] netconsole: Fix userdata race condition
Posted by Jakub Kicinski 3 months, 2 weeks ago
On Fri, 24 Oct 2025 18:10:06 -0300 Gustavo Luiz Duarte wrote:
> There are two reasons for hitting this skip.
> 1. The hardcoded 2s timeout in listen_port_and_save_to() expired
> 2. socat died or failed to start for mysterious reasons
> 
> #1 should probably be a success (we ran the test for this long and no
> corruption found), and for #2 we can try to return whatever exit code
> socat give us.
> Retrieving socat return code is a bit tricky because we are running it
> in a subshell, but we can save it in a temp file.
> 
> I can also send a follow up patch to use a longer timeout in
> listen_port_and_save_to() if KSFT_MACHINE_SLOW

Frankly I'm not sure this test is worth the compute cycles it will burn.
It's a direct repro for a very specific problem. The changes it will
occur again for the same field a pretty low. Maybe just repost patch 1?
Re: [PATCH net v2 0/2] netconsole: Fix userdata race condition
Posted by Gustavo Luiz Duarte 3 months, 1 week ago
On Fri, Oct 24, 2025 at 8:42 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> Frankly I'm not sure this test is worth the compute cycles it will burn.
> It's a direct repro for a very specific problem. The changes it will
> occur again for the same field a pretty low. Maybe just repost patch 1?

Fair enough. I will repost with only patch 1 then.