[v1] Migration 20230428 patches

[PULL 00/21] Migration 20230428 patches

Posted by Juan Quintela 1 year ago

The following changes since commit 05d50ba2d4668d43a835c5a502efdec9b92646e6:

  Merge tag 'migration-20230427-pull-request' of https://gitlab.com/juan.quintela/qemu into staging (2023-04-28 08:35:06 +0100)

are available in the Git repository at:

  https://gitlab.com/juan.quintela/qemu.git tags/migration-20230428-pull-request

for you to fetch changes up to 05ecac612ec6a4bdb358e68554b4406ac2c8e25a:

  migration: Initialize and cleanup decompression in migration.c (2023-04-28 20:54:53 +0200)

----------------------------------------------------------------
Migration Pull request (20230429 vintage)

Hi

In this series:
- compression code cleanup (lukas)
  nice, nice, nice.
- drop useless parameters from migration_tls* (juan)
- first part of remove QEMUFileHooks series (juan)

Please apply.

----------------------------------------------------------------

Juan Quintela (8):
  multifd: We already account for this packet on the multifd thread
  migration: Move ram_stats to its own file migration-stats.[ch]
  migration: Rename ram_counters to mig_stats
  migration: Rename RAMStats to MigrationAtomicStats
  migration/rdma: Split the zero page case from acct_update_position
  migration/rdma: Unfold last user of acct_update_position()
  migration: Drop unused parameter for migration_tls_get_creds()
  migration: Drop unused parameter for migration_tls_client_create()

Lukas Straub (13):
  qtest/migration-test.c: Add tests with compress enabled
  qtest/migration-test.c: Add postcopy tests with compress enabled
  ram.c: Let the compress threads return a CompressResult enum
  ram.c: Dont change param->block in the compress thread
  ram.c: Reset result after sending queued data
  ram.c: Do not call save_page_header() from compress threads
  ram.c: Call update_compress_thread_counts from
    compress_send_queued_data
  ram.c: Remove last ram.c dependency from the core compress code
  ram.c: Move core compression code into its own file
  ram.c: Move core decompression code into its own file
  ram compress: Assert that the file buffer matches the result
  ram-compress.c: Make target independent
  migration: Initialize and cleanup decompression in migration.c

 migration/meson.build        |   7 +-
 migration/migration-stats.c  |  17 ++
 migration/migration-stats.h  |  41 +++
 migration/migration.c        |  42 ++-
 migration/multifd.c          |  12 +-
 migration/postcopy-ram.c     |   2 +-
 migration/qemu-file.c        |  11 +
 migration/qemu-file.h        |   1 +
 migration/ram-compress.c     | 485 ++++++++++++++++++++++++++++++
 migration/ram-compress.h     |  70 +++++
 migration/ram.c              | 562 ++++-------------------------------
 migration/ram.h              |  24 --
 migration/rdma.c             |   9 +-
 migration/savevm.c           |   3 +-
 migration/tls.c              |  15 +-
 migration/tls.h              |   3 +-
 tests/qtest/migration-test.c | 126 ++++++++
 17 files changed, 870 insertions(+), 560 deletions(-)
 create mode 100644 migration/migration-stats.c
 create mode 100644 migration/migration-stats.h
 create mode 100644 migration/ram-compress.c
 create mode 100644 migration/ram-compress.h

-- 
2.40.0

Re: [PULL 00/21] Migration 20230428 patches

Posted by Richard Henderson 1 year ago

On 4/28/23 20:11, Juan Quintela wrote:
> The following changes since commit 05d50ba2d4668d43a835c5a502efdec9b92646e6:
> 
>    Merge tag 'migration-20230427-pull-request' of https://gitlab.com/juan.quintela/qemu into staging (2023-04-28 08:35:06 +0100)
> 
> are available in the Git repository at:
> 
>    https://gitlab.com/juan.quintela/qemu.git tags/migration-20230428-pull-request
> 
> for you to fetch changes up to 05ecac612ec6a4bdb358e68554b4406ac2c8e25a:
> 
>    migration: Initialize and cleanup decompression in migration.c (2023-04-28 20:54:53 +0200)
> 
> ----------------------------------------------------------------
> Migration Pull request (20230429 vintage)
> 
> Hi
> 
> In this series:
> - compression code cleanup (lukas)
>    nice, nice, nice.
> - drop useless parameters from migration_tls* (juan)
> - first part of remove QEMUFileHooks series (juan)
> 
> Please apply.
> 
> ----------------------------------------------------------------
> 
> Juan Quintela (8):
>    multifd: We already account for this packet on the multifd thread
>    migration: Move ram_stats to its own file migration-stats.[ch]
>    migration: Rename ram_counters to mig_stats
>    migration: Rename RAMStats to MigrationAtomicStats
>    migration/rdma: Split the zero page case from acct_update_position
>    migration/rdma: Unfold last user of acct_update_position()
>    migration: Drop unused parameter for migration_tls_get_creds()
>    migration: Drop unused parameter for migration_tls_client_create()
> 
> Lukas Straub (13):
>    qtest/migration-test.c: Add tests with compress enabled
>    qtest/migration-test.c: Add postcopy tests with compress enabled
>    ram.c: Let the compress threads return a CompressResult enum
>    ram.c: Dont change param->block in the compress thread
>    ram.c: Reset result after sending queued data
>    ram.c: Do not call save_page_header() from compress threads
>    ram.c: Call update_compress_thread_counts from
>      compress_send_queued_data
>    ram.c: Remove last ram.c dependency from the core compress code
>    ram.c: Move core compression code into its own file
>    ram.c: Move core decompression code into its own file
>    ram compress: Assert that the file buffer matches the result
>    ram-compress.c: Make target independent
>    migration: Initialize and cleanup decompression in migration.c

There are a bunch of migration failures in CI:

https://gitlab.com/qemu-project/qemu/-/jobs/4201998343#L375
https://gitlab.com/qemu-project/qemu/-/jobs/4201998342#L428
https://gitlab.com/qemu-project/qemu/-/jobs/4201998340#L459
https://gitlab.com/qemu-project/qemu/-/jobs/4201998336#L4883


r~

Re: [PULL 00/21] Migration 20230428 patches

Posted by Juan Quintela 1 year ago

Richard Henderson <richard.henderson@linaro.org> wrote:
> On 4/28/23 20:11, Juan Quintela wrote:
>> The following changes since commit 05d50ba2d4668d43a835c5a502efdec9b92646e6:
>>    Merge tag 'migration-20230427-pull-request' of
>> https://gitlab.com/juan.quintela/qemu into staging (2023-04-28
>> 08:35:06 +0100)
>> are available in the Git repository at:
>>    https://gitlab.com/juan.quintela/qemu.git
>> tags/migration-20230428-pull-request
>> for you to fetch changes up to
>> 05ecac612ec6a4bdb358e68554b4406ac2c8e25a:
>>    migration: Initialize and cleanup decompression in migration.c
>> (2023-04-28 20:54:53 +0200)
>> ----------------------------------------------------------------
>> Migration Pull request (20230429 vintage)
>> Hi
>> In this series:
>> - compression code cleanup (lukas)
>>    nice, nice, nice.
>> - drop useless parameters from migration_tls* (juan)
>> - first part of remove QEMUFileHooks series (juan)
>> Please apply.
>> ----------------------------------------------------------------
>> Juan Quintela (8):
>>    multifd: We already account for this packet on the multifd thread
>>    migration: Move ram_stats to its own file migration-stats.[ch]
>>    migration: Rename ram_counters to mig_stats
>>    migration: Rename RAMStats to MigrationAtomicStats
>>    migration/rdma: Split the zero page case from acct_update_position
>>    migration/rdma: Unfold last user of acct_update_position()
>>    migration: Drop unused parameter for migration_tls_get_creds()
>>    migration: Drop unused parameter for migration_tls_client_create()
>> Lukas Straub (13):
>>    qtest/migration-test.c: Add tests with compress enabled
>>    qtest/migration-test.c: Add postcopy tests with compress enabled
>>    ram.c: Let the compress threads return a CompressResult enum
>>    ram.c: Dont change param->block in the compress thread
>>    ram.c: Reset result after sending queued data
>>    ram.c: Do not call save_page_header() from compress threads
>>    ram.c: Call update_compress_thread_counts from
>>      compress_send_queued_data
>>    ram.c: Remove last ram.c dependency from the core compress code
>>    ram.c: Move core compression code into its own file
>>    ram.c: Move core decompression code into its own file
>>    ram compress: Assert that the file buffer matches the result
>>    ram-compress.c: Make target independent
>>    migration: Initialize and cleanup decompression in migration.c
>
> There are a bunch of migration failures in CI:
>
> https://gitlab.com/qemu-project/qemu/-/jobs/4201998343#L375

cfi-x86_64

> https://gitlab.com/qemu-project/qemu/-/jobs/4201998342#L428

opensuse aarch64?

> https://gitlab.com/qemu-project/qemu/-/jobs/4201998340#L459

debian i386

> https://gitlab.com/qemu-project/qemu/-/jobs/4201998336#L4883

x86_64 and aarch64 on a s390x host?

Dunno really what is going on here.

It works here: fedora 37 x86_64 host and both:

qemu-system-x86_64 (native kvm)
qemu-system-aarch64 (emulated)

my patches are only code movement and cleanups, so Lukas any clue?

Lukas, I am going to drop the compress code for now and pass the other
patches.  In the meanwhile, I am going to try to setup some machine
where I can run the upstream tests and see if I can reproduce there.
BTW, I would be happy if you double check that I did the rebase
correctly, they didn't apply correctly, but as said, the tests have been
running for two/three days on all my daily testing, so I thought that I
did the things correctly.

Richard, once that we are here, one of the problem that we are having is
that the test is exiting with an abort, so we have no clue what is
happening.  Is there a way to get a backtrace, or at least the number

Later, Juan.

Re: [PULL 00/21] Migration 20230428 patches

Posted by Lukas Straub 12 months ago

On Tue, 02 May 2023 12:39:12 +0200
Juan Quintela <quintela@redhat.com> wrote:

> [...]
> 
> my patches are only code movement and cleanups, so Lukas any clue?
> 
> Lukas, I am going to drop the compress code for now and pass the other
> patches.  In the meanwhile, I am going to try to setup some machine
> where I can run the upstream tests and see if I can reproduce there.
> BTW, I would be happy if you double check that I did the rebase
> correctly, they didn't apply correctly, but as said, the tests have been
> running for two/three days on all my daily testing, so I thought that I
> did the things correctly.

Hi,
I rebased the series here and got exactly the same files as in this
pull request. And I can't reproduce these failures either.

Maybe you can run the CI just on the newly added compress tests and see
if it already blows up without the refactoring?

Anyway, this series is not so important anymore...

> Richard, once that we are here, one of the problem that we are having is
> that the test is exiting with an abort, so we have no clue what is
> happening.  Is there a way to get a backtrace, or at least the number
> 
> Later, Juan.
> 



--

Re: [PULL 00/21] Migration 20230428 patches

Posted by Juan Quintela 11 months, 4 weeks ago

Lukas Straub <lukasstraub2@web.de> wrote:
> On Tue, 02 May 2023 12:39:12 +0200
> Juan Quintela <quintela@redhat.com> wrote:
>
>> [...]
>> 
>> my patches are only code movement and cleanups, so Lukas any clue?
>> 
>> Lukas, I am going to drop the compress code for now and pass the other
>> patches.  In the meanwhile, I am going to try to setup some machine
>> where I can run the upstream tests and see if I can reproduce there.
>> BTW, I would be happy if you double check that I did the rebase
>> correctly, they didn't apply correctly, but as said, the tests have been
>> running for two/three days on all my daily testing, so I thought that I
>> did the things correctly.

Hi

> Hi,
> I rebased the series here and got exactly the same files as in this
> pull request. And I can't reproduce these failures either.

Nice

> Maybe you can run the CI just on the newly added compress tests and see
> if it already blows up without the refactoring?

It does, I don't have to check O:-)

The initial reason that I did the compression code on top of multifd was
that I was not able to get the old compression code to run "reliabely"
on my testing.

> Anyway, this series is not so important anymore...

What about:
- I add the series as they are, because the code is better than what we
  have before (and being in a different file makes it easier to
  deprecate, not compile, ...)
- I just disable the tests until we find something that works.

Richard, Lukas?

Later, Juan.

>> Richard, once that we are here, one of the problem that we are having is
>> that the test is exiting with an abort, so we have no clue what is
>> happening.  Is there a way to get a backtrace, or at least the number
>> 
>> Later, Juan.
>>

Re: [PULL 00/21] Migration 20230428 patches

Posted by Lukas Straub 11 months, 4 weeks ago

On Mon, 08 May 2023 10:12:35 +0200
Juan Quintela <quintela@redhat.com> wrote:

> Lukas Straub <lukasstraub2@web.de> wrote:
> > On Tue, 02 May 2023 12:39:12 +0200
> > Juan Quintela <quintela@redhat.com> wrote:
> >
> >> [...]
> >> 
> >> my patches are only code movement and cleanups, so Lukas any clue?
> >> 
> >> Lukas, I am going to drop the compress code for now and pass the other
> >> patches.  In the meanwhile, I am going to try to setup some machine
> >> where I can run the upstream tests and see if I can reproduce there.
> >> BTW, I would be happy if you double check that I did the rebase
> >> correctly, they didn't apply correctly, but as said, the tests have been
> >> running for two/three days on all my daily testing, so I thought that I
> >> did the things correctly.
> 
> Hi
> 
> > Hi,
> > I rebased the series here and got exactly the same files as in this
> > pull request. And I can't reproduce these failures either.
> 
> Nice
> 
> > Maybe you can run the CI just on the newly added compress tests and see
> > if it already blows up without the refactoring?
> 
> It does, I don't have to check O:-)
> 
> The initial reason that I did the compression code on top of multifd was
> that I was not able to get the old compression code to run "reliabely"
> on my testing.
> 
> > Anyway, this series is not so important anymore...
> 
> What about:
> - I add the series as they are, because the code is better than what we
>   have before (and being in a different file makes it easier to
>   deprecate, not compile, ...)
> - I just disable the tests until we find something that works.
> 
> Richard, Lukas?

That is fine with me.

> 
> Later, Juan.
> 
> >> Richard, once that we are here, one of the problem that we are having is
> >> that the test is exiting with an abort, so we have no clue what is
> >> happening.  Is there a way to get a backtrace, or at least the number
> >> 
> >> Later, Juan.
> >> 
>

Re: [PULL 00/21] Migration 20230428 patches

Posted by Peter Maydell 1 year ago

On Tue, 2 May 2023 at 11:39, Juan Quintela <quintela@redhat.com> wrote:
> Richard, once that we are here, one of the problem that we are having is
> that the test is exiting with an abort, so we have no clue what is
> happening.  Is there a way to get a backtrace, or at least the number

This has been consistently an issue with the migration tests.
As the owner of the tests, if they are not providing you with
the level of detail that you need to diagnose failures, I
think that is something that is in your court to address:
the CI system is always going to only be able to provide
you with what your tests are outputting to the logs.

For the specific case of backtraces from assertion failures,
I think Dan was looking at whether we could put something
together for that. It won't help with segfaults and the like, though.

You should be able to at least get the number of the subtest out of
the logs (either directly in the logs of the job, or else
from the more detailed log file that gets stored as a
job artefact in most cases).

thanks
-- PMM

Re: [PULL 00/21] Migration 20230428 patches

Posted by Juan Quintela 12 months ago

Peter Maydell <peter.maydell@linaro.org> wrote:
> On Tue, 2 May 2023 at 11:39, Juan Quintela <quintela@redhat.com> wrote:
>> Richard, once that we are here, one of the problem that we are having is
>> that the test is exiting with an abort, so we have no clue what is
>> happening.  Is there a way to get a backtrace, or at least the number
>
> This has been consistently an issue with the migration tests.
> As the owner of the tests, if they are not providing you with
> the level of detail that you need to diagnose failures, I
> think that is something that is in your court to address:
> the CI system is always going to only be able to provide
> you with what your tests are outputting to the logs.

Right now I would be happy just to see what test it is failing at.

I am doing something wrong, or from the links that I see on richard
email, I am not able to reach anywhere where I can see the full logs.

> For the specific case of backtraces from assertion failures,
> I think Dan was looking at whether we could put something
> together for that. It won't help with segfaults and the like, though.

I am waiting for that O:-)

> You should be able to at least get the number of the subtest out of
> the logs (either directly in the logs of the job, or else
> from the more detailed log file that gets stored as a
> job artefact in most cases).

Also note that the test is stopping in an abort, with no diagnostic
message that I can see.  But I don't see where the abort cames from:

$ grep abort tests/qtest/migration-*
tests/qtest/migration-test.c:    visit_type_SocketAddressList(iv, NULL, &addrs, &error_abort);
tests/qtest/migration-test.c:     * In non-multifd case when client aborts due to mismatched
tests/qtest/migration-test.c:     * In multifd case when client aborts due to mismatched
tests/qtest/migration-test.c:     * to load migration state, and thus just aborts the migration
$

Later, Juan.

Re: [PULL 00/21] Migration 20230428 patches

Posted by Peter Maydell 12 months ago

On Wed, 3 May 2023 at 10:17, Juan Quintela <quintela@redhat.com> wrote:
>
> Peter Maydell <peter.maydell@linaro.org> wrote:
> > On Tue, 2 May 2023 at 11:39, Juan Quintela <quintela@redhat.com> wrote:
> >> Richard, once that we are here, one of the problem that we are having is
> >> that the test is exiting with an abort, so we have no clue what is
> >> happening.  Is there a way to get a backtrace, or at least the number
> >
> > This has been consistently an issue with the migration tests.
> > As the owner of the tests, if they are not providing you with
> > the level of detail that you need to diagnose failures, I
> > think that is something that is in your court to address:
> > the CI system is always going to only be able to provide
> > you with what your tests are outputting to the logs.
>
> Right now I would be happy just to see what test it is failing at.
>
> I am doing something wrong, or from the links that I see on richard
> email, I am not able to reach anywhere where I can see the full logs.
>
> > For the specific case of backtraces from assertion failures,
> > I think Dan was looking at whether we could put something
> > together for that. It won't help with segfaults and the like, though.
>
> I am waiting for that O:-)
>
> > You should be able to at least get the number of the subtest out of
> > the logs (either directly in the logs of the job, or else
> > from the more detailed log file that gets stored as a
> > job artefact in most cases).
>
> Also note that the test is stopping in an abort, with no diagnostic
> message that I can see.  But I don't see where the abort cames from:

So, as an example I took the check-system-opensuse log:
https://gitlab.com/qemu-project/qemu/-/jobs/4201998342

Use your browser's "search in web page" to look for "SIGABRT":
it'll show you the two errors (as well as the summary at
the bottom of the page which just says the tests aborted).
Here's one:

5/351 qemu:qtest+qtest-x86_64 / qtest-x86_64/migration-test ERROR
246.12s killed by signal 6 SIGABRT
>>> QTEST_QEMU_BINARY=./qemu-system-x86_64 QTEST_QEMU_IMG=./qemu-img MALLOC_PERTURB_=48 QTEST_QEMU_STORAGE_DAEMON_BINARY=./storage-daemon/qemu-storage-daemon G_TEST_DBUS_DAEMON=/builds/qemu-project/qemu/tests/dbus-vmstate-daemon.sh /builds/qemu-project/qemu/build/tests/qtest/migration-test --tap -k
――――――――――――――――――――――――――――――――――――― ✀ ―――――――――――――――――――――――――――――――――――――
stderr:
Could not access KVM kernel module: No such file or directory
Could not access KVM kernel module: No such file or directory
Could not access KVM kernel module: No such file or directory
Could not access KVM kernel module: No such file or directory
Could not access KVM kernel module: No such file or directory
Could not access KVM kernel module: No such file or directory
Could not access KVM kernel module: No such file or directory
Could not access KVM kernel module: No such file or directory
Could not access KVM kernel module: No such file or directory
Could not access KVM kernel module: No such file or directory
**
ERROR:../tests/qtest/migration-helpers.c:205:wait_for_migration_status:
assertion failed: (g_test_timer_elapsed() <
MIGRATION_STATUS_WAIT_TIMEOUT)
(test program exited with status code -6)
――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――
▶ 6/351 ERROR:../tests/qtest/migration-helpers.c:205:wait_for_migration_status:
assertion failed: (g_test_timer_elapsed() <
MIGRATION_STATUS_WAIT_TIMEOUT) ERROR
6/351 qemu:qtest+qtest-aarch64 / qtest-aarch64/migration-test ERROR
221.18s killed by signal 6 SIGABRT

Looks like it failed on a timeout in the test code.

I think there ought to be artefacts from the job which have a
copy of the full log, but I can't find them: not sure if this
is just because the gitlab UI is terrible, or if they really
didn't get generated.

thanks
-- PMM

Re: [PULL 00/21] Migration 20230428 patches

Posted by Juan Quintela 12 months ago

Peter Maydell <peter.maydell@linaro.org> wrote:
> On Wed, 3 May 2023 at 10:17, Juan Quintela <quintela@redhat.com> wrote:
>>
>> Peter Maydell <peter.maydell@linaro.org> wrote:
>> > On Tue, 2 May 2023 at 11:39, Juan Quintela <quintela@redhat.com> wrote:
>> >> Richard, once that we are here, one of the problem that we are having is
>> >> that the test is exiting with an abort, so we have no clue what is
>> >> happening.  Is there a way to get a backtrace, or at least the number
>> >
>> > This has been consistently an issue with the migration tests.
>> > As the owner of the tests, if they are not providing you with
>> > the level of detail that you need to diagnose failures, I
>> > think that is something that is in your court to address:
>> > the CI system is always going to only be able to provide
>> > you with what your tests are outputting to the logs.
>>
>> Right now I would be happy just to see what test it is failing at.
>>
>> I am doing something wrong, or from the links that I see on richard
>> email, I am not able to reach anywhere where I can see the full logs.
>>
>> > For the specific case of backtraces from assertion failures,
>> > I think Dan was looking at whether we could put something
>> > together for that. It won't help with segfaults and the like, though.
>>
>> I am waiting for that O:-)
>>
>> > You should be able to at least get the number of the subtest out of
>> > the logs (either directly in the logs of the job, or else
>> > from the more detailed log file that gets stored as a
>> > job artefact in most cases).
>>
>> Also note that the test is stopping in an abort, with no diagnostic
>> message that I can see.  But I don't see where the abort cames from:
>
> So, as an example I took the check-system-opensuse log:
> https://gitlab.com/qemu-project/qemu/-/jobs/4201998342
>
> Use your browser's "search in web page" to look for "SIGABRT":
> it'll show you the two errors (as well as the summary at
> the bottom of the page which just says the tests aborted).
> Here's one:
>
> 5/351 qemu:qtest+qtest-x86_64 / qtest-x86_64/migration-test ERROR
> 246.12s killed by signal 6 SIGABRT
>>>> QTEST_QEMU_BINARY=./qemu-system-x86_64 QTEST_QEMU_IMG=./qemu-img
>>> MALLOC_PERTURB_=48
>>> QTEST_QEMU_STORAGE_DAEMON_BINARY=./storage-daemon/qemu-storage-daemon
>>> G_TEST_DBUS_DAEMON=/builds/qemu-project/qemu/tests/dbus-vmstate-daemon.sh
>>> /builds/qemu-project/qemu/build/tests/qtest/migration-test --tap -k
> ――――――――――――――――――――――――――――――――――――― ✀ ―――――――――――――――――――――――――――――――――――――
> stderr:
> Could not access KVM kernel module: No such file or directory
> Could not access KVM kernel module: No such file or directory
> Could not access KVM kernel module: No such file or directory
> Could not access KVM kernel module: No such file or directory
> Could not access KVM kernel module: No such file or directory
> Could not access KVM kernel module: No such file or directory
> Could not access KVM kernel module: No such file or directory
> Could not access KVM kernel module: No such file or directory
> Could not access KVM kernel module: No such file or directory
> Could not access KVM kernel module: No such file or directory
> **
> ERROR:../tests/qtest/migration-helpers.c:205:wait_for_migration_status:
> assertion failed: (g_test_timer_elapsed() <
> MIGRATION_STATUS_WAIT_TIMEOUT)
> (test program exited with status code -6)
> ――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――
> ▶ 6/351 ERROR:../tests/qtest/migration-helpers.c:205:wait_for_migration_status:
> assertion failed: (g_test_timer_elapsed() <
> MIGRATION_STATUS_WAIT_TIMEOUT) ERROR
> 6/351 qemu:qtest+qtest-aarch64 / qtest-aarch64/migration-test ERROR
> 221.18s killed by signal 6 SIGABRT
>
> Looks like it failed on a timeout in the test code.

Thanks.

> I think there ought to be artefacts from the job which have a
> copy of the full log, but I can't find them: not sure if this
> is just because the gitlab UI is terrible, or if they really
> didn't get generated.

So now we are between a rock and a hard place.

We have slowed down the bandwidth for migration test because on non
loaded machines, migration was too fast to need more than one pass.

And we slowed it so much than now we hit the timer that was set at 120
seconds.

So .....

It is going to be interesting.

BTW, what procesor speed do that aarch64 machines have? Or are they so
loaded that they are efectively trashing?

2minutes for a pass looks a bit too much.

Will give a try to get this test done changing when we detect that we
don't move to the completion stage.

Thanks for the explanation on where to find the data.  The other issue
is that whan I really want is to know what test failed.  I can't see a
way to get that info.  According to Daniel answer, we don't upload that
files for tests that fail.

Later, Juan.

Re: [PULL 00/21] Migration 20230428 patches

Posted by Peter Maydell 12 months ago

On Wed, 3 May 2023 at 14:29, Juan Quintela <quintela@redhat.com> wrote:
> So now we are between a rock and a hard place.
>
> We have slowed down the bandwidth for migration test because on non
> loaded machines, migration was too fast to need more than one pass.
>
> And we slowed it so much than now we hit the timer that was set at 120
> seconds.
>
> So .....
>
> It is going to be interesting.
>
> BTW, what procesor speed do that aarch64 machines have? Or are they so
> loaded that they are efectively trashing?

The 4 failures listed in this thread are all jobs running
on Gitlab's standard x86-64 shared runners, not on the
private aarch64 runner.

thanks
-- PMM

Re: [PULL 00/21] Migration 20230428 patches

Posted by Daniel P. Berrangé 12 months ago

On Wed, May 03, 2023 at 11:17:33AM +0200, Juan Quintela wrote:
> Peter Maydell <peter.maydell@linaro.org> wrote:
> > On Tue, 2 May 2023 at 11:39, Juan Quintela <quintela@redhat.com> wrote:
> >> Richard, once that we are here, one of the problem that we are having is
> >> that the test is exiting with an abort, so we have no clue what is
> >> happening.  Is there a way to get a backtrace, or at least the number
> >
> > This has been consistently an issue with the migration tests.
> > As the owner of the tests, if they are not providing you with
> > the level of detail that you need to diagnose failures, I
> > think that is something that is in your court to address:
> > the CI system is always going to only be able to provide
> > you with what your tests are outputting to the logs.
> 
> Right now I would be happy just to see what test it is failing at.
> 
> I am doing something wrong, or from the links that I see on richard
> email, I am not able to reach anywhere where I can see the full logs.
> 
> > For the specific case of backtraces from assertion failures,
> > I think Dan was looking at whether we could put something
> > together for that. It won't help with segfaults and the like, though.
> 
> I am waiting for that O:-)

I did have a play, but didn't get anything satisfactory working.
I could only capture a trace from a single thread and the hard
problems that really want traces usually involve multiple threads.
There's really no good substitute for an OS level crash collector
like systemd-coredump or abrtd :-( This is really worth an RFE
to GitLab as a possible enhancement to their shared runners.

> > You should be able to at least get the number of the subtest out of
> > the logs (either directly in the logs of the job, or else
> > from the more detailed log file that gets stored as a
> > job artefact in most cases).
> 
> Also note that the test is stopping in an abort, with no diagnostic
> message that I can see.  But I don't see where the abort cames from:

Any g_assert  failure will result in an abort, so there are many
possibilities.

We should be uploading the testlog.txt from meson test to show what
one asserts, but I've just discovered we're lacking 'when: always'
on our 'artifacts' declarations. So the artifacts are only uploaded
when the job succeeds, which the least useful scenario for our test
logs !

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

Re: [PULL 00/21] Migration 20230428 patches

Posted by Lukas Straub 1 year ago

On Sat, 29 Apr 2023 19:45:07 +0100
Richard Henderson <richard.henderson@linaro.org> wrote:

> On 4/28/23 20:11, Juan Quintela wrote:
> > The following changes since commit 05d50ba2d4668d43a835c5a502efdec9b92646e6:
> > 
> >    Merge tag 'migration-20230427-pull-request' of https://gitlab.com/juan.quintela/qemu into staging (2023-04-28 08:35:06 +0100)
> > 
> > are available in the Git repository at:
> > 
> >    https://gitlab.com/juan.quintela/qemu.git tags/migration-20230428-pull-request
> > 
> > for you to fetch changes up to 05ecac612ec6a4bdb358e68554b4406ac2c8e25a:
> > 
> >    migration: Initialize and cleanup decompression in migration.c (2023-04-28 20:54:53 +0200)
> > 
> > ----------------------------------------------------------------
> > Migration Pull request (20230429 vintage)
> > 
> > Hi
> > 
> > In this series:
> > - compression code cleanup (lukas)
> >    nice, nice, nice.
> > - drop useless parameters from migration_tls* (juan)
> > - first part of remove QEMUFileHooks series (juan)
> > 
> > Please apply.
> > 
> > ----------------------------------------------------------------
> > 
> > Juan Quintela (8):
> >    multifd: We already account for this packet on the multifd thread
> >    migration: Move ram_stats to its own file migration-stats.[ch]
> >    migration: Rename ram_counters to mig_stats
> >    migration: Rename RAMStats to MigrationAtomicStats
> >    migration/rdma: Split the zero page case from acct_update_position
> >    migration/rdma: Unfold last user of acct_update_position()
> >    migration: Drop unused parameter for migration_tls_get_creds()
> >    migration: Drop unused parameter for migration_tls_client_create()
> > 
> > Lukas Straub (13):
> >    qtest/migration-test.c: Add tests with compress enabled
> >    qtest/migration-test.c: Add postcopy tests with compress enabled
> >    ram.c: Let the compress threads return a CompressResult enum
> >    ram.c: Dont change param->block in the compress thread
> >    ram.c: Reset result after sending queued data
> >    ram.c: Do not call save_page_header() from compress threads
> >    ram.c: Call update_compress_thread_counts from
> >      compress_send_queued_data
> >    ram.c: Remove last ram.c dependency from the core compress code
> >    ram.c: Move core compression code into its own file
> >    ram.c: Move core decompression code into its own file
> >    ram compress: Assert that the file buffer matches the result
> >    ram-compress.c: Make target independent
> >    migration: Initialize and cleanup decompression in migration.c  
> 
> There are a bunch of migration failures in CI:
> 
> https://gitlab.com/qemu-project/qemu/-/jobs/4201998343#L375
> https://gitlab.com/qemu-project/qemu/-/jobs/4201998342#L428
> https://gitlab.com/qemu-project/qemu/-/jobs/4201998340#L459
> https://gitlab.com/qemu-project/qemu/-/jobs/4201998336#L4883
> 
> 
> r~

Hmm, it doesn't always fail. Any way to get the testlog from the failed
jobs?

--

Re: [PULL 00/21] Migration 20230428 patches

Posted by Richard Henderson 1 year ago

On 4/29/23 21:14, Lukas Straub wrote:
> On Sat, 29 Apr 2023 19:45:07 +0100
> Richard Henderson <richard.henderson@linaro.org> wrote:
> 
>> On 4/28/23 20:11, Juan Quintela wrote:
>>> The following changes since commit 05d50ba2d4668d43a835c5a502efdec9b92646e6:
>>>
>>>     Merge tag 'migration-20230427-pull-request' of https://gitlab.com/juan.quintela/qemu into staging (2023-04-28 08:35:06 +0100)
>>>
>>> are available in the Git repository at:
>>>
>>>     https://gitlab.com/juan.quintela/qemu.git tags/migration-20230428-pull-request
>>>
>>> for you to fetch changes up to 05ecac612ec6a4bdb358e68554b4406ac2c8e25a:
>>>
>>>     migration: Initialize and cleanup decompression in migration.c (2023-04-28 20:54:53 +0200)
>>>
>>> ----------------------------------------------------------------
>>> Migration Pull request (20230429 vintage)
>>>
>>> Hi
>>>
>>> In this series:
>>> - compression code cleanup (lukas)
>>>     nice, nice, nice.
>>> - drop useless parameters from migration_tls* (juan)
>>> - first part of remove QEMUFileHooks series (juan)
>>>
>>> Please apply.
>>>
>>> ----------------------------------------------------------------
>>>
>>> Juan Quintela (8):
>>>     multifd: We already account for this packet on the multifd thread
>>>     migration: Move ram_stats to its own file migration-stats.[ch]
>>>     migration: Rename ram_counters to mig_stats
>>>     migration: Rename RAMStats to MigrationAtomicStats
>>>     migration/rdma: Split the zero page case from acct_update_position
>>>     migration/rdma: Unfold last user of acct_update_position()
>>>     migration: Drop unused parameter for migration_tls_get_creds()
>>>     migration: Drop unused parameter for migration_tls_client_create()
>>>
>>> Lukas Straub (13):
>>>     qtest/migration-test.c: Add tests with compress enabled
>>>     qtest/migration-test.c: Add postcopy tests with compress enabled
>>>     ram.c: Let the compress threads return a CompressResult enum
>>>     ram.c: Dont change param->block in the compress thread
>>>     ram.c: Reset result after sending queued data
>>>     ram.c: Do not call save_page_header() from compress threads
>>>     ram.c: Call update_compress_thread_counts from
>>>       compress_send_queued_data
>>>     ram.c: Remove last ram.c dependency from the core compress code
>>>     ram.c: Move core compression code into its own file
>>>     ram.c: Move core decompression code into its own file
>>>     ram compress: Assert that the file buffer matches the result
>>>     ram-compress.c: Make target independent
>>>     migration: Initialize and cleanup decompression in migration.c
>>
>> There are a bunch of migration failures in CI:
>>
>> https://gitlab.com/qemu-project/qemu/-/jobs/4201998343#L375
>> https://gitlab.com/qemu-project/qemu/-/jobs/4201998342#L428
>> https://gitlab.com/qemu-project/qemu/-/jobs/4201998340#L459
>> https://gitlab.com/qemu-project/qemu/-/jobs/4201998336#L4883
>>
>>
>> r~
> 
> Hmm, it doesn't always fail. Any way to get the testlog from the failed
> jobs?

What you can get from the links above is all I have.
But they're consistent, and new.


r~