[PATCH v4 05/24] Revert "replay: stop us hanging in rr_wait_io_event"

Nicholas Piggin posted 24 patches 1 year, 9 months ago
Maintainers: Richard Henderson <richard.henderson@linaro.org>, Paolo Bonzini <pbonzini@redhat.com>, "Marc-André Lureau" <marcandre.lureau@redhat.com>, Pavel Dovgalyuk <pavel.dovgaluk@ispras.ru>, "Michael S. Tsirkin" <mst@redhat.com>, Jason Wang <jasowang@redhat.com>, Nicholas Piggin <npiggin@gmail.com>, Daniel Henrique Barboza <danielhb413@gmail.com>, "Cédric Le Goater" <clg@kaod.org>, David Gibson <david@gibson.dropbear.id.au>, Harsh Prateek Bora <harshpb@linux.ibm.com>, Peter Xu <peterx@redhat.com>, Fabiano Rosas <farosas@suse.de>, John Snow <jsnow@redhat.com>, Cleber Rosa <crosa@redhat.com>, "Philippe Mathieu-Daudé" <philmd@linaro.org>, Wainer dos Santos Moschetta <wainersm@redhat.com>, Beraldo Leal <bleal@redhat.com>
There is a newer version of this series
[PATCH v4 05/24] Revert "replay: stop us hanging in rr_wait_io_event"
Posted by Nicholas Piggin 1 year, 9 months ago
This reverts commit 1f881ea4a444ef36a8b6907b0b82be4b3af253a2.

That commit causes reverse_debugging.py test failures, and does
not seem to solve the root cause of the problem x86-64 still
hangs in record/replay tests.

The problem with short-cutting the iowait that was taken during
record phase is that related events will not get consumed at the
same points (e.g., reading the clock).

A hang with zero icount always seems to be a symptom of an earlier
problem that has caused the recording to become out of synch with
the execution and consumption of events by replay.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 include/sysemu/replay.h      |  5 -----
 accel/tcg/tcg-accel-ops-rr.c |  2 +-
 replay/replay.c              | 21 ---------------------
 3 files changed, 1 insertion(+), 27 deletions(-)

diff --git a/include/sysemu/replay.h b/include/sysemu/replay.h
index f229b2109c..8102fa54f0 100644
--- a/include/sysemu/replay.h
+++ b/include/sysemu/replay.h
@@ -73,11 +73,6 @@ int replay_get_instructions(void);
 /*! Updates instructions counter in replay mode. */
 void replay_account_executed_instructions(void);
 
-/**
- * replay_can_wait: check if we should pause for wait-io
- */
-bool replay_can_wait(void);
-
 /* Processing clocks and other time sources */
 
 /*! Save the specified clock */
diff --git a/accel/tcg/tcg-accel-ops-rr.c b/accel/tcg/tcg-accel-ops-rr.c
index 894e73e52c..a942442a33 100644
--- a/accel/tcg/tcg-accel-ops-rr.c
+++ b/accel/tcg/tcg-accel-ops-rr.c
@@ -109,7 +109,7 @@ static void rr_wait_io_event(void)
 {
     CPUState *cpu;
 
-    while (all_cpu_threads_idle() && replay_can_wait()) {
+    while (all_cpu_threads_idle()) {
         rr_stop_kick_timer();
         qemu_cond_wait_bql(first_cpu->halt_cond);
     }
diff --git a/replay/replay.c b/replay/replay.c
index b8564a4813..895fa6b67a 100644
--- a/replay/replay.c
+++ b/replay/replay.c
@@ -451,27 +451,6 @@ void replay_start(void)
     replay_enable_events();
 }
 
-/*
- * For none/record the answer is yes.
- */
-bool replay_can_wait(void)
-{
-    if (replay_mode == REPLAY_MODE_PLAY) {
-        /*
-         * For playback we shouldn't ever be at a point we wait. If
-         * the instruction count has reached zero and we have an
-         * unconsumed event we should go around again and consume it.
-         */
-        if (replay_state.instruction_count == 0 && replay_state.has_unread_data) {
-            return false;
-        } else {
-            replay_sync_error("Playback shouldn't have to iowait");
-        }
-    }
-    return true;
-}
-
-
 void replay_finish(void)
 {
     if (replay_mode == REPLAY_MODE_NONE) {
-- 
2.42.0
Re: [PATCH v4 05/24] Revert "replay: stop us hanging in rr_wait_io_event"
Posted by Alex Bennée 1 year, 9 months ago
Nicholas Piggin <npiggin@gmail.com> writes:

> This reverts commit 1f881ea4a444ef36a8b6907b0b82be4b3af253a2.
>
> That commit causes reverse_debugging.py test failures, and does
> not seem to solve the root cause of the problem x86-64 still
> hangs in record/replay tests.

I'm still finding the reverse debugging tests failing with this series.

> The problem with short-cutting the iowait that was taken during
> record phase is that related events will not get consumed at the
> same points (e.g., reading the clock).
>
> A hang with zero icount always seems to be a symptom of an earlier
> problem that has caused the recording to become out of synch with
> the execution and consumption of events by replay.

Would it be possible to still detect the failure mode rather than a full
revert?

>
> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
> ---
>  include/sysemu/replay.h      |  5 -----
>  accel/tcg/tcg-accel-ops-rr.c |  2 +-
>  replay/replay.c              | 21 ---------------------
>  3 files changed, 1 insertion(+), 27 deletions(-)
>
> diff --git a/include/sysemu/replay.h b/include/sysemu/replay.h
> index f229b2109c..8102fa54f0 100644
> --- a/include/sysemu/replay.h
> +++ b/include/sysemu/replay.h
> @@ -73,11 +73,6 @@ int replay_get_instructions(void);
>  /*! Updates instructions counter in replay mode. */
>  void replay_account_executed_instructions(void);
>  
> -/**
> - * replay_can_wait: check if we should pause for wait-io
> - */
> -bool replay_can_wait(void);
> -
>  /* Processing clocks and other time sources */
>  
>  /*! Save the specified clock */
> diff --git a/accel/tcg/tcg-accel-ops-rr.c b/accel/tcg/tcg-accel-ops-rr.c
> index 894e73e52c..a942442a33 100644
> --- a/accel/tcg/tcg-accel-ops-rr.c
> +++ b/accel/tcg/tcg-accel-ops-rr.c
> @@ -109,7 +109,7 @@ static void rr_wait_io_event(void)
>  {
>      CPUState *cpu;
>  
> -    while (all_cpu_threads_idle() && replay_can_wait()) {
> +    while (all_cpu_threads_idle()) {
>          rr_stop_kick_timer();
>          qemu_cond_wait_bql(first_cpu->halt_cond);
>      }
> diff --git a/replay/replay.c b/replay/replay.c
> index b8564a4813..895fa6b67a 100644
> --- a/replay/replay.c
> +++ b/replay/replay.c
> @@ -451,27 +451,6 @@ void replay_start(void)
>      replay_enable_events();
>  }
>  
> -/*
> - * For none/record the answer is yes.
> - */
> -bool replay_can_wait(void)
> -{
> -    if (replay_mode == REPLAY_MODE_PLAY) {
> -        /*
> -         * For playback we shouldn't ever be at a point we wait. If
> -         * the instruction count has reached zero and we have an
> -         * unconsumed event we should go around again and consume it.
> -         */
> -        if (replay_state.instruction_count == 0 && replay_state.has_unread_data) {
> -            return false;
> -        } else {
> -            replay_sync_error("Playback shouldn't have to iowait");
> -        }
> -    }
> -    return true;
> -}
> -
> -
>  void replay_finish(void)
>  {
>      if (replay_mode == REPLAY_MODE_NONE) {

-- 
Alex Bennée
Virtualisation Tech Lead @ Linaro
Re: [PATCH v4 05/24] Revert "replay: stop us hanging in rr_wait_io_event"
Posted by Nicholas Piggin 1 year, 9 months ago
On Tue Mar 12, 2024 at 11:33 PM AEST, Alex Bennée wrote:
> Nicholas Piggin <npiggin@gmail.com> writes:
>
> > This reverts commit 1f881ea4a444ef36a8b6907b0b82be4b3af253a2.
> >
> > That commit causes reverse_debugging.py test failures, and does
> > not seem to solve the root cause of the problem x86-64 still
> > hangs in record/replay tests.
>
> I'm still finding the reverse debugging tests failing with this series.

:(

In gitlab CI or your own testing? What are you running exactly?

I found 440fx must have some bug becaues it's quite bad, but
q35 was more stable.

>
> > The problem with short-cutting the iowait that was taken during
> > record phase is that related events will not get consumed at the
> > same points (e.g., reading the clock).
> >
> > A hang with zero icount always seems to be a symptom of an earlier
> > problem that has caused the recording to become out of synch with
> > the execution and consumption of events by replay.
>
> Would it be possible to still detect the failure mode rather than a full
> revert?

I'm not actually sure exactly how this patch causes test failures
(or how it was fixing the ones you saw). If I can reproduce the
ones you see I could look a bit deeper at it.

I have been looking at some ways to try detect/report record/replay
errors more sanely because it can be very hard to debug them at
the moment. No patches quite ready to post yet, but yes if there's
something we could do here to help we should.

Thanks,
Nick

>
> >
> > Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
> > ---
> >  include/sysemu/replay.h      |  5 -----
> >  accel/tcg/tcg-accel-ops-rr.c |  2 +-
> >  replay/replay.c              | 21 ---------------------
> >  3 files changed, 1 insertion(+), 27 deletions(-)
> >
> > diff --git a/include/sysemu/replay.h b/include/sysemu/replay.h
> > index f229b2109c..8102fa54f0 100644
> > --- a/include/sysemu/replay.h
> > +++ b/include/sysemu/replay.h
> > @@ -73,11 +73,6 @@ int replay_get_instructions(void);
> >  /*! Updates instructions counter in replay mode. */
> >  void replay_account_executed_instructions(void);
> >  
> > -/**
> > - * replay_can_wait: check if we should pause for wait-io
> > - */
> > -bool replay_can_wait(void);
> > -
> >  /* Processing clocks and other time sources */
> >  
> >  /*! Save the specified clock */
> > diff --git a/accel/tcg/tcg-accel-ops-rr.c b/accel/tcg/tcg-accel-ops-rr.c
> > index 894e73e52c..a942442a33 100644
> > --- a/accel/tcg/tcg-accel-ops-rr.c
> > +++ b/accel/tcg/tcg-accel-ops-rr.c
> > @@ -109,7 +109,7 @@ static void rr_wait_io_event(void)
> >  {
> >      CPUState *cpu;
> >  
> > -    while (all_cpu_threads_idle() && replay_can_wait()) {
> > +    while (all_cpu_threads_idle()) {
> >          rr_stop_kick_timer();
> >          qemu_cond_wait_bql(first_cpu->halt_cond);
> >      }
> > diff --git a/replay/replay.c b/replay/replay.c
> > index b8564a4813..895fa6b67a 100644
> > --- a/replay/replay.c
> > +++ b/replay/replay.c
> > @@ -451,27 +451,6 @@ void replay_start(void)
> >      replay_enable_events();
> >  }
> >  
> > -/*
> > - * For none/record the answer is yes.
> > - */
> > -bool replay_can_wait(void)
> > -{
> > -    if (replay_mode == REPLAY_MODE_PLAY) {
> > -        /*
> > -         * For playback we shouldn't ever be at a point we wait. If
> > -         * the instruction count has reached zero and we have an
> > -         * unconsumed event we should go around again and consume it.
> > -         */
> > -        if (replay_state.instruction_count == 0 && replay_state.has_unread_data) {
> > -            return false;
> > -        } else {
> > -            replay_sync_error("Playback shouldn't have to iowait");
> > -        }
> > -    }
> > -    return true;
> > -}
> > -
> > -
> >  void replay_finish(void)
> >  {
> >      if (replay_mode == REPLAY_MODE_NONE) {
Re: [PATCH v4 05/24] Revert "replay: stop us hanging in rr_wait_io_event"
Posted by Alex Bennée 1 year, 9 months ago
"Nicholas Piggin" <npiggin@gmail.com> writes:

> On Tue Mar 12, 2024 at 11:33 PM AEST, Alex Bennée wrote:
>> Nicholas Piggin <npiggin@gmail.com> writes:
>>
>> > This reverts commit 1f881ea4a444ef36a8b6907b0b82be4b3af253a2.
>> >
>> > That commit causes reverse_debugging.py test failures, and does
>> > not seem to solve the root cause of the problem x86-64 still
>> > hangs in record/replay tests.
>>
>> I'm still finding the reverse debugging tests failing with this series.
>
> :(
>
> In gitlab CI or your own testing? What are you running exactly?

My own - my mistake I didn't get a clean build because of the format
bug. However I'm seeing new failures:

  env QEMU_TEST_FLAKY_TESTS=1 AVOCADO_TIMEOUT_EXPECTED=1 ./pyvenv/bin/avocado run ./tests/avocado/reverse_debugging.py
  Fetching asset from ./tests/avocado/reverse_debugging.py:ReverseDebugging_AArch64.test_aarch64_virt
  JOB ID     : bd4b29f7afaa24dc6e32933ea9bc5e46bbc3a5a4
  JOB LOG    : /home/alex/avocado/job-results/job-2024-03-12T20.58-bd4b29f/job.log
   (1/5) ./tests/avocado/reverse_debugging.py:ReverseDebugging_X86_64.test_x86_64_pc: PASS (4.49 s)
   (2/5) ./tests/avocado/reverse_debugging.py:ReverseDebugging_X86_64.test_x86_64_q35: PASS (4.50 s)
   (3/5) ./tests/avocado/reverse_debugging.py:ReverseDebugging_AArch64.test_aarch64_virt: FAIL: Invalid PC (read ffff2d941e4d7f28 instead of ffff2d941e4d7f2c) (3.06 s)
   (4/5) ./tests/avocado/reverse_debugging.py:ReverseDebugging_ppc64.test_ppc64_pseries: PASS (16.48 s)
   (5/5) ./tests/avocado/reverse_debugging.py:ReverseDebugging_ppc64.test_ppc64_powernv: FAIL: Invalid PC (read 30020210 instead of 30020214) (2.80 s)
  RESULTS    : PASS 3 | ERROR 0 | FAIL 2 | SKIP 0 | WARN 0 | INTERRUPT 0 | CANCEL 0
  JOB TIME   : 32.18 s

>
> I found 440fx must have some bug becaues it's quite bad, but
> q35 was more stable.
>
>>
>> > The problem with short-cutting the iowait that was taken during
>> > record phase is that related events will not get consumed at the
>> > same points (e.g., reading the clock).
>> >
>> > A hang with zero icount always seems to be a symptom of an earlier
>> > problem that has caused the recording to become out of synch with
>> > the execution and consumption of events by replay.
>>
>> Would it be possible to still detect the failure mode rather than a full
>> revert?
>
> I'm not actually sure exactly how this patch causes test failures
> (or how it was fixing the ones you saw). If I can reproduce the
> ones you see I could look a bit deeper at it.
>
> I have been looking at some ways to try detect/report record/replay
> errors more sanely because it can be very hard to debug them at
> the moment. No patches quite ready to post yet, but yes if there's
> something we could do here to help we should.
>
> Thanks,
> Nick
>
>>
>> >
>> > Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
>> > ---
>> >  include/sysemu/replay.h      |  5 -----
>> >  accel/tcg/tcg-accel-ops-rr.c |  2 +-
>> >  replay/replay.c              | 21 ---------------------
>> >  3 files changed, 1 insertion(+), 27 deletions(-)
>> >
>> > diff --git a/include/sysemu/replay.h b/include/sysemu/replay.h
>> > index f229b2109c..8102fa54f0 100644
>> > --- a/include/sysemu/replay.h
>> > +++ b/include/sysemu/replay.h
>> > @@ -73,11 +73,6 @@ int replay_get_instructions(void);
>> >  /*! Updates instructions counter in replay mode. */
>> >  void replay_account_executed_instructions(void);
>> >  
>> > -/**
>> > - * replay_can_wait: check if we should pause for wait-io
>> > - */
>> > -bool replay_can_wait(void);
>> > -
>> >  /* Processing clocks and other time sources */
>> >  
>> >  /*! Save the specified clock */
>> > diff --git a/accel/tcg/tcg-accel-ops-rr.c b/accel/tcg/tcg-accel-ops-rr.c
>> > index 894e73e52c..a942442a33 100644
>> > --- a/accel/tcg/tcg-accel-ops-rr.c
>> > +++ b/accel/tcg/tcg-accel-ops-rr.c
>> > @@ -109,7 +109,7 @@ static void rr_wait_io_event(void)
>> >  {
>> >      CPUState *cpu;
>> >  
>> > -    while (all_cpu_threads_idle() && replay_can_wait()) {
>> > +    while (all_cpu_threads_idle()) {
>> >          rr_stop_kick_timer();
>> >          qemu_cond_wait_bql(first_cpu->halt_cond);
>> >      }
>> > diff --git a/replay/replay.c b/replay/replay.c
>> > index b8564a4813..895fa6b67a 100644
>> > --- a/replay/replay.c
>> > +++ b/replay/replay.c
>> > @@ -451,27 +451,6 @@ void replay_start(void)
>> >      replay_enable_events();
>> >  }
>> >  
>> > -/*
>> > - * For none/record the answer is yes.
>> > - */
>> > -bool replay_can_wait(void)
>> > -{
>> > -    if (replay_mode == REPLAY_MODE_PLAY) {
>> > -        /*
>> > -         * For playback we shouldn't ever be at a point we wait. If
>> > -         * the instruction count has reached zero and we have an
>> > -         * unconsumed event we should go around again and consume it.
>> > -         */
>> > -        if (replay_state.instruction_count == 0 && replay_state.has_unread_data) {
>> > -            return false;
>> > -        } else {
>> > -            replay_sync_error("Playback shouldn't have to iowait");
>> > -        }
>> > -    }
>> > -    return true;
>> > -}
>> > -
>> > -
>> >  void replay_finish(void)
>> >  {
>> >      if (replay_mode == REPLAY_MODE_NONE) {

-- 
Alex Bennée
Virtualisation Tech Lead @ Linaro
Re: [PATCH v4 05/24] Revert "replay: stop us hanging in rr_wait_io_event"
Posted by Nicholas Piggin 1 year, 9 months ago
On Wed Mar 13, 2024 at 7:03 AM AEST, Alex Bennée wrote:
> "Nicholas Piggin" <npiggin@gmail.com> writes:
>
> > On Tue Mar 12, 2024 at 11:33 PM AEST, Alex Bennée wrote:
> >> Nicholas Piggin <npiggin@gmail.com> writes:
> >>
> >> > This reverts commit 1f881ea4a444ef36a8b6907b0b82be4b3af253a2.
> >> >
> >> > That commit causes reverse_debugging.py test failures, and does
> >> > not seem to solve the root cause of the problem x86-64 still
> >> > hangs in record/replay tests.
> >>
> >> I'm still finding the reverse debugging tests failing with this series.
> >
> > :(
> >
> > In gitlab CI or your own testing? What are you running exactly?
>
> My own - my mistake I didn't get a clean build because of the format
> bug. However I'm seeing new failures:
>
>   env QEMU_TEST_FLAKY_TESTS=1 AVOCADO_TIMEOUT_EXPECTED=1 ./pyvenv/bin/avocado run ./tests/avocado/reverse_debugging.py
>   Fetching asset from ./tests/avocado/reverse_debugging.py:ReverseDebugging_AArch64.test_aarch64_virt
>   JOB ID     : bd4b29f7afaa24dc6e32933ea9bc5e46bbc3a5a4
>   JOB LOG    : /home/alex/avocado/job-results/job-2024-03-12T20.58-bd4b29f/job.log
>    (1/5) ./tests/avocado/reverse_debugging.py:ReverseDebugging_X86_64.test_x86_64_pc: PASS (4.49 s)
>    (2/5) ./tests/avocado/reverse_debugging.py:ReverseDebugging_X86_64.test_x86_64_q35: PASS (4.50 s)
>    (3/5) ./tests/avocado/reverse_debugging.py:ReverseDebugging_AArch64.test_aarch64_virt: FAIL: Invalid PC (read ffff2d941e4d7f28 instead of ffff2d941e4d7f2c) (3.06 s)

Okay, this is the new test I added. It runs for 1 second then
reverse-steps from the end of the trace. aarch64 is flaky -- pc is at a
different place at the same icount after the reverse-step (which is
basically the second replay). This indicates some non-determinism in
execution, or something in machine reset or migration is not restoring
the state exactly.

aarch64 ran okay few times including gitlab CI before I posted the
series, but turns out it does break quite often too.

x86 has a problem with this too so I disabled it there. I'll disable it
for aarch64 too for now.

x86 and aarch64 can run the replay_linux.py test quite well (after this
series), which is much longer and more complicated. The difference there
is that it is only a single replay, it never resets the machine or
loads the initial snapshot for reverse-debugging. So to me that
indicates that execution is probably deterministic, but its the reset
reload that has the problem.

Thanks,
Nick
Re: [PATCH v4 05/24] Revert "replay: stop us hanging in rr_wait_io_event"
Posted by Nicholas Piggin 1 year, 9 months ago
On Wed Mar 13, 2024 at 7:03 AM AEST, Alex Bennée wrote:
> "Nicholas Piggin" <npiggin@gmail.com> writes:
>
> > On Tue Mar 12, 2024 at 11:33 PM AEST, Alex Bennée wrote:
> >> Nicholas Piggin <npiggin@gmail.com> writes:
> >>
> >> > This reverts commit 1f881ea4a444ef36a8b6907b0b82be4b3af253a2.
> >> >
> >> > That commit causes reverse_debugging.py test failures, and does
> >> > not seem to solve the root cause of the problem x86-64 still
> >> > hangs in record/replay tests.
> >>
> >> I'm still finding the reverse debugging tests failing with this series.
> >
> > :(
> >
> > In gitlab CI or your own testing? What are you running exactly?
>
> My own - my mistake I didn't get a clean build because of the format
> bug.

Ah, x86_64 is working for you now? Great.

> However I'm seeing new failures:
>
>   env QEMU_TEST_FLAKY_TESTS=1 AVOCADO_TIMEOUT_EXPECTED=1 ./pyvenv/bin/avocado run ./tests/avocado/reverse_debugging.py
>   Fetching asset from ./tests/avocado/reverse_debugging.py:ReverseDebugging_AArch64.test_aarch64_virt
>   JOB ID     : bd4b29f7afaa24dc6e32933ea9bc5e46bbc3a5a4
>   JOB LOG    : /home/alex/avocado/job-results/job-2024-03-12T20.58-bd4b29f/job.log
>    (1/5) ./tests/avocado/reverse_debugging.py:ReverseDebugging_X86_64.test_x86_64_pc: PASS (4.49 s)
>    (2/5) ./tests/avocado/reverse_debugging.py:ReverseDebugging_X86_64.test_x86_64_q35: PASS (4.50 s)
>    (3/5) ./tests/avocado/reverse_debugging.py:ReverseDebugging_AArch64.test_aarch64_virt: FAIL: Invalid PC (read ffff2d941e4d7f28 instead of ffff2d941e4d7f2c) (3.06 s)

I added a bunch of test cases that x86 has problems with for some
reason, so it skips them. Maybe aarch64 is the same. I'll take a
closer look at it.

>    (4/5) ./tests/avocado/reverse_debugging.py:ReverseDebugging_ppc64.test_ppc64_pseries: PASS (16.48 s)
>    (5/5) ./tests/avocado/reverse_debugging.py:ReverseDebugging_ppc64.test_ppc64_powernv: FAIL: Invalid PC (read 30020210 instead of 30020214) (2.80 s)

powernv is known flaky so that's expected for now.

Thanks,
Nick