[PULL 00/18] migration queue

Dr. David Alan Gilbert (git) posted 18 patches 2 years, 2 months ago
Failed in applying to current master (apply log)
Test checkpatch passed
Maintainers: Juan Quintela <quintela@redhat.com>, "Dr. David Alan Gilbert" <dgilbert@redhat.com>, Luc Michel <luc@lmichel.fr>, Damien Hedde <damien.hedde@greensocs.com>, Francisco Iglesias <francisco.iglesias@xilinx.com>, Alistair Francis <alistair@alistair23.me>, "Edgar E. Iglesias" <edgar.iglesias@gmail.com>, Peter Maydell <peter.maydell@linaro.org>, Markus Armbruster <armbru@redhat.com>, Eric Blake <eblake@redhat.com>, Gerd Hoffmann <kraxel@redhat.com>, Thomas Huth <thuth@redhat.com>, Laurent Vivier <lvivier@redhat.com>, Paolo Bonzini <pbonzini@redhat.com>, Stefan Hajnoczi <stefanha@redhat.com>
There is a newer version of this series
docs/devel/migration.rst         |  12 +---
hmp-commands.hx                  |  24 ++++----
hw/core/clock-vmstate.c          |   1 +
hw/ssi/xlnx-versal-ospi.c        |   1 -
include/migration/vmstate.h      |   2 -
meson.build                      |  13 +++++
migration/migration.c            |  26 +++++----
migration/migration.h            |  48 ++++++++++++++--
migration/postcopy-ram.c         | 108 ++++++++++++++++++++++-------------
migration/postcopy-ram.h         |   4 ++
migration/ram.c                  |  64 +++++++++++----------
migration/rdma.c                 |   7 +++
migration/savevm.c               |  46 ++++++++++-----
migration/trace-events           |   7 +--
migration/vmstate.c              |   6 --
monitor/hmp-cmds.c               |  47 ++++++++++++++-
monitor/hmp.c                    |  19 ++++++-
monitor/monitor-internal.h       |   3 +-
monitor/qmp-cmds.c               |  49 +++++-----------
qapi/ui.json                     | 120 +++++++++++++++++++++++++++++++++------
tests/qtest/migration-test.c     |  27 +++++----
tools/virtiofsd/passthrough_ll.c |   2 +-
22 files changed, 435 insertions(+), 201 deletions(-)
[PULL 00/18] migration queue
Posted by Dr. David Alan Gilbert (git) 2 years, 2 months ago
From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

The following changes since commit 64ada298b98a51eb2512607f6e6180cb330c47b1:

  Merge remote-tracking branch 'remotes/legoater/tags/pull-ppc-20220302' into staging (2022-03-02 12:38:46 +0000)

are available in the Git repository at:

  https://gitlab.com/dagrh/qemu.git tags/pull-migration-20220302b

for you to fetch changes up to 18621987027b1800f315fb9e29967e7b5398ef6f:

  migration: Remove load_state_old and minimum_version_id_old (2022-03-02 18:20:45 +0000)

----------------------------------------------------------------
Migration/HMP/Virtio pull 2022-03-02

A bit of a mix this time:
  * Minor fixes from myself, Hanna, and Jack
  * VNC password rework by Stefan and Fabian
  * Postcopy changes from Peter X that are
    the start of a larger series to come
  * Removing the prehistoic load_state_old
    code from Peter M

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

----------------------------------------------------------------
Dr. David Alan Gilbert (1):
      clock-vmstate: Add missing END_OF_LIST

Hanna Reitz (1):
      virtiofsd: Let meson check for statx.stx_mnt_id

Jack Wang (1):
      migration/rdma: set the REUSEADDR option for destination

Peter Maydell (1):
      migration: Remove load_state_old and minimum_version_id_old

Peter Xu (11):
      migration: Dump sub-cmd name in loadvm_process_command tp
      migration: Finer grained tracepoints for POSTCOPY_LISTEN
      migration: Tracepoint change in postcopy-run bottom half
      migration: Introduce postcopy channels on dest node
      migration: Dump ramblock and offset too when non-same-page detected
      migration: Add postcopy_thread_create()
      migration: Move static var in ram_block_from_stream() into global
      migration: Enlarge postcopy recovery to capture !-EIO too
      migration: postcopy_pause_fault_thread() never fails
      migration: Add migration_incoming_transport_cleanup()
      tests: Pass in MigrateStart** into test_migrate_start()

Stefan Reiter (3):
      monitor/hmp: add support for flag argument with value
      qapi/monitor: refactor set/expire_password with enums
      qapi/monitor: allow VNC display id in set/expire_password

 docs/devel/migration.rst         |  12 +---
 hmp-commands.hx                  |  24 ++++----
 hw/core/clock-vmstate.c          |   1 +
 hw/ssi/xlnx-versal-ospi.c        |   1 -
 include/migration/vmstate.h      |   2 -
 meson.build                      |  13 +++++
 migration/migration.c            |  26 +++++----
 migration/migration.h            |  48 ++++++++++++++--
 migration/postcopy-ram.c         | 108 ++++++++++++++++++++++-------------
 migration/postcopy-ram.h         |   4 ++
 migration/ram.c                  |  64 +++++++++++----------
 migration/rdma.c                 |   7 +++
 migration/savevm.c               |  46 ++++++++++-----
 migration/trace-events           |   7 +--
 migration/vmstate.c              |   6 --
 monitor/hmp-cmds.c               |  47 ++++++++++++++-
 monitor/hmp.c                    |  19 ++++++-
 monitor/monitor-internal.h       |   3 +-
 monitor/qmp-cmds.c               |  49 +++++-----------
 qapi/ui.json                     | 120 +++++++++++++++++++++++++++++++++------
 tests/qtest/migration-test.c     |  27 +++++----
 tools/virtiofsd/passthrough_ll.c |   2 +-
 22 files changed, 435 insertions(+), 201 deletions(-)
Re: [PULL 00/18] migration queue
Posted by Peter Maydell 2 years, 2 months ago
On Wed, 2 Mar 2022 at 18:32, Dr. David Alan Gilbert (git)
<dgilbert@redhat.com> wrote:
>
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
>
> The following changes since commit 64ada298b98a51eb2512607f6e6180cb330c47b1:
>
>   Merge remote-tracking branch 'remotes/legoater/tags/pull-ppc-20220302' into staging (2022-03-02 12:38:46 +0000)
>
> are available in the Git repository at:
>
>   https://gitlab.com/dagrh/qemu.git tags/pull-migration-20220302b
>
> for you to fetch changes up to 18621987027b1800f315fb9e29967e7b5398ef6f:
>
>   migration: Remove load_state_old and minimum_version_id_old (2022-03-02 18:20:45 +0000)
>
> ----------------------------------------------------------------
> Migration/HMP/Virtio pull 2022-03-02
>
> A bit of a mix this time:
>   * Minor fixes from myself, Hanna, and Jack
>   * VNC password rework by Stefan and Fabian
>   * Postcopy changes from Peter X that are
>     the start of a larger series to come
>   * Removing the prehistoic load_state_old
>     code from Peter M
>
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
>


Applied, thanks.

Please update the changelog at https://wiki.qemu.org/ChangeLog/7.0
for any user-visible changes.

-- PMM
Re: [PULL 00/18] migration queue
Posted by Philippe Mathieu-Daudé 2 years, 1 month ago
On 3/3/22 15:46, Peter Maydell wrote:
> On Wed, 2 Mar 2022 at 18:32, Dr. David Alan Gilbert (git)
> <dgilbert@redhat.com> wrote:
>>
>> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
>>
>> The following changes since commit 64ada298b98a51eb2512607f6e6180cb330c47b1:
>>
>>    Merge remote-tracking branch 'remotes/legoater/tags/pull-ppc-20220302' into staging (2022-03-02 12:38:46 +0000)
>>
>> are available in the Git repository at:
>>
>>    https://gitlab.com/dagrh/qemu.git tags/pull-migration-20220302b
>>
>> for you to fetch changes up to 18621987027b1800f315fb9e29967e7b5398ef6f:
>>
>>    migration: Remove load_state_old and minimum_version_id_old (2022-03-02 18:20:45 +0000)
>>
>> ----------------------------------------------------------------
>> Migration/HMP/Virtio pull 2022-03-02
>>
>> A bit of a mix this time:
>>    * Minor fixes from myself, Hanna, and Jack
>>    * VNC password rework by Stefan and Fabian
>>    * Postcopy changes from Peter X that are
>>      the start of a larger series to come
>>    * Removing the prehistoic load_state_old
>>      code from Peter M

I'm seeing an error on the s390x runner:

▶  26/547 ERROR:../tests/qtest/migration-test.c:276:check_guests_ram: 
assertion failed: (bad == 0) ERROR

  26/547 qemu:qtest+qtest-i386 / qtest-i386/migration-test 
            ERROR          78.87s   killed by signal 6 SIGABRT

https://app.travis-ci.com/gitlab/qemu-project/qemu/jobs/562515884#L7848

Re: [PULL 00/18] migration queue
Posted by Dr. David Alan Gilbert 2 years, 1 month ago
* Philippe Mathieu-Daudé (philippe.mathieu.daude@gmail.com) wrote:
> On 3/3/22 15:46, Peter Maydell wrote:
> > On Wed, 2 Mar 2022 at 18:32, Dr. David Alan Gilbert (git)
> > <dgilbert@redhat.com> wrote:
> > > 
> > > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > > 
> > > The following changes since commit 64ada298b98a51eb2512607f6e6180cb330c47b1:
> > > 
> > >    Merge remote-tracking branch 'remotes/legoater/tags/pull-ppc-20220302' into staging (2022-03-02 12:38:46 +0000)
> > > 
> > > are available in the Git repository at:
> > > 
> > >    https://gitlab.com/dagrh/qemu.git tags/pull-migration-20220302b
> > > 
> > > for you to fetch changes up to 18621987027b1800f315fb9e29967e7b5398ef6f:
> > > 
> > >    migration: Remove load_state_old and minimum_version_id_old (2022-03-02 18:20:45 +0000)
> > > 
> > > ----------------------------------------------------------------
> > > Migration/HMP/Virtio pull 2022-03-02
> > > 
> > > A bit of a mix this time:
> > >    * Minor fixes from myself, Hanna, and Jack
> > >    * VNC password rework by Stefan and Fabian
> > >    * Postcopy changes from Peter X that are
> > >      the start of a larger series to come
> > >    * Removing the prehistoic load_state_old
> > >      code from Peter M
> 
> I'm seeing an error on the s390x runner:
> 
> ▶  26/547 ERROR:../tests/qtest/migration-test.c:276:check_guests_ram:
> assertion failed: (bad == 0) ERROR
> 
>  26/547 qemu:qtest+qtest-i386 / qtest-i386/migration-test            ERROR
> 78.87s   killed by signal 6 SIGABRT
> 
> https://app.travis-ci.com/gitlab/qemu-project/qemu/jobs/562515884#L7848

Yeh, thuth mentioned that, it seems to only be s390 which is odd.
I'm not seeing anything obviously architecture dependent in that set, or
for that matter that plays with the ram migration stream much.
Is this reliable enough that someone with a tame s390 could bisect?

Dave

-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


Re: [PULL 00/18] migration queue
Posted by Christian Borntraeger 2 years, 1 month ago
Am 08.03.22 um 19:47 schrieb Dr. David Alan Gilbert:
> * Philippe Mathieu-Daudé (philippe.mathieu.daude@gmail.com) wrote:
>> On 3/3/22 15:46, Peter Maydell wrote:
>>> On Wed, 2 Mar 2022 at 18:32, Dr. David Alan Gilbert (git)
>>> <dgilbert@redhat.com> wrote:
>>>>
>>>> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
>>>>
>>>> The following changes since commit 64ada298b98a51eb2512607f6e6180cb330c47b1:
>>>>
>>>>     Merge remote-tracking branch 'remotes/legoater/tags/pull-ppc-20220302' into staging (2022-03-02 12:38:46 +0000)
>>>>
>>>> are available in the Git repository at:
>>>>
>>>>     https://gitlab.com/dagrh/qemu.git tags/pull-migration-20220302b
>>>>
>>>> for you to fetch changes up to 18621987027b1800f315fb9e29967e7b5398ef6f:
>>>>
>>>>     migration: Remove load_state_old and minimum_version_id_old (2022-03-02 18:20:45 +0000)
>>>>
>>>> ----------------------------------------------------------------
>>>> Migration/HMP/Virtio pull 2022-03-02
>>>>
>>>> A bit of a mix this time:
>>>>     * Minor fixes from myself, Hanna, and Jack
>>>>     * VNC password rework by Stefan and Fabian
>>>>     * Postcopy changes from Peter X that are
>>>>       the start of a larger series to come
>>>>     * Removing the prehistoic load_state_old
>>>>       code from Peter M
>>
>> I'm seeing an error on the s390x runner:
>>
>> ▶  26/547 ERROR:../tests/qtest/migration-test.c:276:check_guests_ram:
>> assertion failed: (bad == 0) ERROR
>>
>>   26/547 qemu:qtest+qtest-i386 / qtest-i386/migration-test            ERROR
>> 78.87s   killed by signal 6 SIGABRT
>>
>> https://app.travis-ci.com/gitlab/qemu-project/qemu/jobs/562515884#L7848
> 
> Yeh, thuth mentioned that, it seems to only be s390 which is odd.
> I'm not seeing anything obviously architecture dependent in that set, or
> for that matter that plays with the ram migration stream much.
> Is this reliable enough that someone with a tame s390 could bisect?

I just asked Peter to try with DFLTCC=0 to disable the hardware acceleration. Maybe
the zlib library still has a bug? (We are not aware of any problem right now).
In case DFLTCC makes a difference, this would be something for Ilya to look at.


Re: [PULL 00/18] migration queue
Posted by Peter Maydell 2 years, 1 month ago
On Tue, 8 Mar 2022 at 18:47, Dr. David Alan Gilbert <dgilbert@redhat.com> wrote:
>
> * Philippe Mathieu-Daudé (philippe.mathieu.daude@gmail.com) wrote:
> > I'm seeing an error on the s390x runner:
> >
> > ▶  26/547 ERROR:../tests/qtest/migration-test.c:276:check_guests_ram:
> > assertion failed: (bad == 0) ERROR
> >
> >  26/547 qemu:qtest+qtest-i386 / qtest-i386/migration-test            ERROR
> > 78.87s   killed by signal 6 SIGABRT
> >
> > https://app.travis-ci.com/gitlab/qemu-project/qemu/jobs/562515884#L7848
>
> Yeh, thuth mentioned that, it seems to only be s390 which is odd.
> I'm not seeing anything obviously architecture dependent in that set, or
> for that matter that plays with the ram migration stream much.
> Is this reliable enough that someone with a tame s390 could bisect?

Didn't see a SIGABRT, but here's a gdb backtrace of a hang
in the migration test on s390 host. I have also observed the
migration test hanging on macos host, so I don't think this is
s390-specific.

Process tree:
migration-test(455775)-+-qemu-system-i38(456194)
                       |-qemu-system-i38(456200)
                       `-qemu-system-i38(456266)
===========================================================
PROCESS: 455775
linux1    455775  312266  5 14:36 pts/0    00:07:19
./tests/qtest/migration-test -tap -k
[New LWP 455776]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/s390x-linux-gnu/libthread_db.so.1".
__libc_read (nbytes=1, buf=0x3ffe69fd816, fd=4) at
../sysdeps/unix/sysv/linux/read.c:26
26      ../sysdeps/unix/sysv/linux/read.c: No such file or directory.

Thread 2 (Thread 0x3ff9c7ff900 (LWP 455776)):
#0  syscall () at ../sysdeps/unix/sysv/linux/s390/s390-64/syscall.S:37
#1  0x000002aa1e8d24ca in qemu_futex_wait (f=0x2aa1e920cbc
<rcu_call_ready_event>, val=4294967295) at
/home/linux1/qemu/include/qemu/futex.h:29
#2  0x000002aa1e8d276e in qemu_event_wait (ev=0x2aa1e920cbc
<rcu_call_ready_event>) at ../../util/qemu-thread-posix.c:481
#3  0x000002aa1e8e6ce2 in call_rcu_thread (opaque=0x0) at ../../util/rcu.c:261
#4  0x000002aa1e8d2998 in qemu_thread_start (args=0x2aa200e51e0) at
../../util/qemu-thread-posix.c:556
#5  0x000003ff9ca87e66 in start_thread (arg=0x3ff9c7ff900) at
pthread_create.c:477
#6  0x000003ff9c97cbf6 in thread_start () at
../sysdeps/unix/sysv/linux/s390/s390-64/clone.S:65

Thread 1 (Thread 0x3ff9ce75430 (LWP 455775)):
#0  __libc_read (nbytes=1, buf=0x3ffe69fd816, fd=4) at
../sysdeps/unix/sysv/linux/read.c:26
#1  __libc_read (fd=<optimized out>, buf=0x3ffe69fd816, nbytes=1) at
../sysdeps/unix/sysv/linux/read.c:24
#2  0x000002aa1e894652 in qmp_fd_receive (fd=4) at
../../tests/qtest/libqtest.c:613
#3  0x000002aa1e894816 in qtest_qmp_receive_dict (s=0x2aa200f6a20) at
../../tests/qtest/libqtest.c:648
#4  0x000002aa1e894782 in qtest_qmp_receive (s=0x2aa200f6a20) at
../../tests/qtest/libqtest.c:636
#5  0x000002aa1e894ede in qtest_vqmp (s=0x2aa200f6a20,
fmt=0x2aa1e8f140c "{ 'execute': 'query-migrate' }", ap=0x3ffe69fdb80)
at ../../tests/qtest/libqtest.c:749
#6  0x000002aa1e891ac0 in wait_command (who=0x2aa200f6a20,
command=0x2aa1e8f140c "{ 'execute': 'query-migrate' }") at
../../tests/qtest/migration-helpers.c:63
#7  0x000002aa1e891de8 in migrate_query (who=0x2aa200f6a20) at
../../tests/qtest/migration-helpers.c:107
#8  0x000002aa1e891e1a in migrate_query_status (who=0x2aa200f6a20) at
../../tests/qtest/migration-helpers.c:116
#9  0x000002aa1e891ef6 in check_migration_status (who=0x2aa200f6a20,
goal=0x2aa1e8f0f0e "cancelled", ungoals=0x0) at
../../tests/qtest/migration-helpers.c:132
#10 0x000002aa1e892150 in wait_for_migration_status
(who=0x2aa200f6a20, goal=0x2aa1e8f0f0e "cancelled", ungoals=0x0) at
../../tests/qtest/migration-helpers.c:156
#11 0x000002aa1e8910fa in test_multifd_tcp_cancel () at
../../tests/qtest/migration-test.c:1379
#12 0x000003ff9cc7e608 in ?? () from /lib/s390x-linux-gnu/libglib-2.0.so.0
#13 0x000003ff9cc7e392 in ?? () from /lib/s390x-linux-gnu/libglib-2.0.so.0
#14 0x000003ff9cc7e392 in ?? () from /lib/s390x-linux-gnu/libglib-2.0.so.0
#15 0x000003ff9cc7e392 in ?? () from /lib/s390x-linux-gnu/libglib-2.0.so.0
#16 0x000003ff9cc7e392 in ?? () from /lib/s390x-linux-gnu/libglib-2.0.so.0
#17 0x000003ff9cc7eada in g_test_run_suite () from
/lib/s390x-linux-gnu/libglib-2.0.so.0
#18 0x000003ff9cc7eb10 in g_test_run () from
/lib/s390x-linux-gnu/libglib-2.0.so.0
#19 0x000002aa1e891578 in main (argc=2, argv=0x3ffe69fece8) at
../../tests/qtest/migration-test.c:1491
[Inferior 1 (process 455775) detached]

===========================================================
PROCESS: 456194
linux1    456194  455775 85 14:39 pts/0    01:54:06 ./qemu-system-i386
-qtest unix:/tmp/qtest-455775.sock -qtest-log /dev/null -chardev
socket,path=/tmp/qtest-455775.qmp,id=char0 -mon
chardev=char0,mode=control -display none -accel kvm -accel tcg -name
source,debug-threads=on -m 150M -serial
file:/tmp/migration-test-dmqzpM/src_serial -drive
file=/tmp/migration-test-dmqzpM/bootsect,format=raw -accel qtest
[New LWP 456196]
[New LWP 456197]
[New LWP 456198]
[New LWP 456229]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/s390x-linux-gnu/libthread_db.so.1".
0x000003ff9a071c9c in __ppoll (fds=0x2aa2c46c2f0, nfds=5,
timeout=<optimized out>, sigmask=0x0) at
../sysdeps/unix/sysv/linux/ppoll.c:44
44      ../sysdeps/unix/sysv/linux/ppoll.c: No such file or directory.

Thread 5 (Thread 0x3fee0ff9900 (LWP 456229)):
#0  futex_abstimed_wait_cancelable (private=0, abstime=0x0, clockid=0,
expected=0, futex_word=0x2aa2c46e7e4) at
../sysdeps/nptl/futex-internal.h:320
#1  do_futex_wait (sem=sem@entry=0x2aa2c46e7e0, abstime=0x0,
clockid=0) at sem_waitcommon.c:112
#2  0x000003ff9a191870 in __new_sem_wait_slow (sem=0x2aa2c46e7e0,
abstime=0x0, clockid=0) at sem_waitcommon.c:184
#3  0x000003ff9a19190e in __new_sem_wait (sem=<optimized out>) at sem_wait.c:42
#4  0x000002aa2923da1e in qemu_sem_wait (sem=0x2aa2c46e7e0) at
../../util/qemu-thread-posix.c:358
#5  0x000002aa289483cc in multifd_send_sync_main (f=0x2aa2b5f92d0) at
../../migration/multifd.c:610
#6  0x000002aa28dfa30c in ram_save_iterate (f=0x2aa2b5f92d0,
opaque=0x2aa29bf75d0 <ram_state>) at ../../migration/ram.c:3049
#7  0x000002aa28958fee in qemu_savevm_state_iterate (f=0x2aa2b5f92d0,
postcopy=false) at ../../migration/savevm.c:1296
#8  0x000002aa28942d40 in migration_iteration_run (s=0x2aa2b3f9800) at
../../migration/migration.c:3607
#9  0x000002aa289434da in migration_thread (opaque=0x2aa2b3f9800) at
../../migration/migration.c:3838
#10 0x000002aa2923e020 in qemu_thread_start (args=0x2aa2b8b29e0) at
../../util/qemu-thread-posix.c:556
#11 0x000003ff9a187e66 in start_thread (arg=0x3fee0ff9900) at
pthread_create.c:477
#12 0x000003ff9a07cbf6 in thread_start () at
../sysdeps/unix/sysv/linux/s390/s390-64/clone.S:65

Thread 4 (Thread 0x3ff89f2f900 (LWP 456198)):
#0  env_neg (env=0x2aa2b5f5030) at /home/linux1/qemu/include/exec/cpu-all.h:478
#1  0x000002aa28f5376a in env_tlb (env=0x2aa2b5f5030) at
/home/linux1/qemu/include/exec/cpu-all.h:502
#2  0x000002aa28f538a8 in tlb_index (env=0x2aa2b5f5030, mmu_idx=2,
addr=73265152) at /home/linux1/qemu/include/exec/cpu_ldst.h:366
#3  0x000002aa28f574bc in tlb_set_page_with_attrs (cpu=0x2aa2b5ec750,
vaddr=73265152, paddr=73265152, attrs=..., prot=7, mmu_idx=2,
size=4096) at ../../accel/tcg/cputlb.c:1194
#4  0x000002aa28cdfd3e in handle_mmu_fault (cs=0x2aa2b5ec750,
addr=73265152, size=1, is_write1=0, mmu_idx=2) at
../../target/i386/tcg/sysemu/excp_helper.c:442
#5  0x000002aa28cdfe90 in x86_cpu_tlb_fill (cs=0x2aa2b5ec750,
addr=73265152, size=1, access_type=MMU_DATA_LOAD, mmu_idx=2,
probe=false, retaddr=4393820748608) at
../../target/i386/tcg/sysemu/excp_helper.c:468
#6  0x000002aa28f5794e in tlb_fill (cpu=0x2aa2b5ec750, addr=73265152,
size=1, access_type=MMU_DATA_LOAD, mmu_idx=2, retaddr=4393820748608)
at ../../accel/tcg/cputlb.c:1313
#7  0x000002aa28f59982 in load_helper (env=0x2aa2b5f5030,
addr=73265152, oi=3586, retaddr=4393820748608, op=MO_8,
code_read=false, full_load=0x2aa28f59db0 <full_ldub_mmu>) at
../../accel/tcg/cputlb.c:1934
#8  0x000002aa28f59e2e in full_ldub_mmu (env=0x2aa2b5f5030,
addr=73265152, oi=3586, retaddr=4393820748608) at
../../accel/tcg/cputlb.c:2025
#9  0x000002aa28f59e94 in helper_ret_ldub_mmu (env=0x2aa2b5f5030,
addr=73265152, oi=3586, retaddr=4393820748608) at
../../accel/tcg/cputlb.c:2031
#10 0x000003ff041ffbfa in code_gen_buffer ()
#11 0x000002aa28f3cfba in cpu_tb_exec (cpu=0x2aa2b5ec750,
itb=0x3ff441ffa00, tb_exit=0x3ff89f2af44) at
../../accel/tcg/cpu-exec.c:357
#12 0x000002aa28f3e47e in cpu_loop_exec_tb (cpu=0x2aa2b5ec750,
tb=0x3ff441ffa00, last_tb=0x3ff89f2af58, tb_exit=0x3ff89f2af44) at
../../accel/tcg/cpu-exec.c:847
#13 0x000002aa28f3e970 in cpu_exec (cpu=0x2aa2b5ec750) at
../../accel/tcg/cpu-exec.c:1006
#14 0x000002aa28f71a1e in tcg_cpus_exec (cpu=0x2aa2b5ec750) at
../../accel/tcg/tcg-accel-ops.c:68
#15 0x000002aa28f71efe in mttcg_cpu_thread_fn (arg=0x2aa2b5ec750) at
../../accel/tcg/tcg-accel-ops-mttcg.c:96
#16 0x000002aa2923e020 in qemu_thread_start (args=0x2aa2b60bb00) at
../../util/qemu-thread-posix.c:556
#17 0x000003ff9a187e66 in start_thread (arg=0x3ff89f2f900) at
pthread_create.c:477
#18 0x000003ff9a07cbf6 in thread_start () at
../sysdeps/unix/sysv/linux/s390/s390-64/clone.S:65

Thread 3 (Thread 0x3ff8a821900 (LWP 456197)):
#0  0x000003ff9a071b42 in __GI___poll (fds=0x3fefc003280, nfds=3,
timeout=<optimized out>) at ../sysdeps/unix/sysv/linux/poll.c:29
#1  0x000003ff9c7d4386 in  () at /lib/s390x-linux-gnu/libglib-2.0.so.0
#2  0x000003ff9c7d4790 in g_main_loop_run () at
/lib/s390x-linux-gnu/libglib-2.0.so.0
#3  0x000002aa28fd9d56 in iothread_run (opaque=0x2aa2b339750) at
../../iothread.c:73
#4  0x000002aa2923e020 in qemu_thread_start (args=0x2aa2b2e7980) at
../../util/qemu-thread-posix.c:556
#5  0x000003ff9a187e66 in start_thread (arg=0x3ff8a821900) at
pthread_create.c:477
#6  0x000003ff9a07cbf6 in thread_start () at
../sysdeps/unix/sysv/linux/s390/s390-64/clone.S:65

Thread 2 (Thread 0x3ff8b1a4900 (LWP 456196)):
#0  syscall () at ../sysdeps/unix/sysv/linux/s390/s390-64/syscall.S:37
#1  0x000002aa2923db52 in qemu_futex_wait (f=0x2aa29c14244
<rcu_call_ready_event>, val=4294967295) at
/home/linux1/qemu/include/qemu/futex.h:29
#2  0x000002aa2923ddf6 in qemu_event_wait (ev=0x2aa29c14244
<rcu_call_ready_event>) at ../../util/qemu-thread-posix.c:481
#3  0x000002aa2924cbd2 in call_rcu_thread (opaque=0x0) at ../../util/rcu.c:261
#4  0x000002aa2923e020 in qemu_thread_start (args=0x2aa2b26ac90) at
../../util/qemu-thread-posix.c:556
#5  0x000003ff9a187e66 in start_thread (arg=0x3ff8b1a4900) at
pthread_create.c:477
#6  0x000003ff9a07cbf6 in thread_start () at
../sysdeps/unix/sysv/linux/s390/s390-64/clone.S:65

Thread 1 (Thread 0x3ff9d5fe440 (LWP 456194)):
#0  0x000003ff9a071c9c in __ppoll (fds=0x2aa2c46c2f0, nfds=5,
timeout=<optimized out>, sigmask=0x0) at
../sysdeps/unix/sysv/linux/ppoll.c:44
#1  0x000002aa2927a3e4 in qemu_poll_ns (fds=0x2aa2c46c2f0, nfds=5,
timeout=27206167) at ../../util/qemu-timer.c:348
#2  0x000002aa29272280 in os_host_main_loop_wait (timeout=27206167) at
../../util/main-loop.c:250
#3  0x000002aa29272434 in main_loop_wait (nonblocking=0) at
../../util/main-loop.c:531
#4  0x000002aa28901276 in qemu_main_loop () at ../../softmmu/runstate.c:727
#5  0x000002aa2887d2ce in main (argc=25, argv=0x3fff647eac8,
envp=0x3fff647eb98) at ../../softmmu/main.c:50
[Inferior 1 (process 456194) detached]

===========================================================
PROCESS: 456200
linux1    456200  455775  0 14:39 pts/0    00:00:00 [qemu-system-i38] <defunct>
/proc/456200/exe: No such file or directory.
Could not attach to process.  If your uid matches the uid of the target
process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
again as the root user.  For more details, see /etc/sysctl.d/10-ptrace.conf
warning: process 456200 is a zombie - the process has already terminated
ptrace: Operation not permitted.
/home/linux1/456200: No such file or directory.

===========================================================
PROCESS: 456266
linux1    456266  455775  0 14:39 pts/0    00:00:00 ./qemu-system-i386
-qtest unix:/tmp/qtest-455775.sock -qtest-log /dev/null -chardev
socket,path=/tmp/qtest-455775.qmp,id=char0 -mon
chardev=char0,mode=control -display none -accel kvm -accel tcg -name
target,debug-threads=on -m 150M -serial
file:/tmp/migration-test-dmqzpM/dest_serial -incoming defer -drive
file=/tmp/migration-test-dmqzpM/bootsect,format=raw -accel qtest
[New LWP 456268]
[New LWP 456269]
[New LWP 456270]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/s390x-linux-gnu/libthread_db.so.1".
0x000003ff9a271c9c in __ppoll (fds=0x2aa0435eb40, nfds=6,
timeout=<optimized out>, sigmask=0x0) at
../sysdeps/unix/sysv/linux/ppoll.c:44
44      ../sysdeps/unix/sysv/linux/ppoll.c: No such file or directory.

Thread 4 (Thread 0x3ff8a12f900 (LWP 456270)):
#0  futex_wait_cancelable (private=0, expected=0,
futex_word=0x2aa0459cbac) at ../sysdeps/nptl/futex-internal.h:183
#1  __pthread_cond_wait_common (abstime=0x0, clockid=0,
mutex=0x2aa02ddec88 <qemu_global_mutex>, cond=0x2aa0459cb80) at
pthread_cond_wait.c:508
#2  __pthread_cond_wait (cond=0x2aa0459cb80, mutex=0x2aa02ddec88
<qemu_global_mutex>) at pthread_cond_wait.c:638
#3  0x000002aa0243d498 in qemu_cond_wait_impl (cond=0x2aa0459cb80,
mutex=0x2aa02ddec88 <qemu_global_mutex>, file=0x2aa024e81e8
"../../softmmu/cpus.c", line=424) at
../../util/qemu-thread-posix.c:195
#4  0x000002aa01af4cc0 in qemu_wait_io_event (cpu=0x2aa0457d750) at
../../softmmu/cpus.c:424
#5  0x000002aa02172028 in mttcg_cpu_thread_fn (arg=0x2aa0457d750) at
../../accel/tcg/tcg-accel-ops-mttcg.c:124
#6  0x000002aa0243e020 in qemu_thread_start (args=0x2aa0459cbc0) at
../../util/qemu-thread-posix.c:556
#7  0x000003ff9a387e66 in start_thread (arg=0x3ff8a12f900) at
pthread_create.c:477
#8  0x000003ff9a27cbf6 in thread_start () at
../sysdeps/unix/sysv/linux/s390/s390-64/clone.S:65

Thread 3 (Thread 0x3ff8aa21900 (LWP 456269)):
#0  0x000003ff9a271b42 in __GI___poll (fds=0x3fefc003280, nfds=3,
timeout=<optimized out>) at ../sysdeps/unix/sysv/linux/poll.c:29
#1  0x000003ff9c9d4386 in  () at /lib/s390x-linux-gnu/libglib-2.0.so.0
#2  0x000003ff9c9d4790 in g_main_loop_run () at
/lib/s390x-linux-gnu/libglib-2.0.so.0
#3  0x000002aa021d9d56 in iothread_run (opaque=0x2aa042ca750) at
../../iothread.c:73
#4  0x000002aa0243e020 in qemu_thread_start (args=0x2aa04278980) at
../../util/qemu-thread-posix.c:556
#5  0x000003ff9a387e66 in start_thread (arg=0x3ff8aa21900) at
pthread_create.c:477
#6  0x000003ff9a27cbf6 in thread_start () at
../sysdeps/unix/sysv/linux/s390/s390-64/clone.S:65

Thread 2 (Thread 0x3ff8b3a4900 (LWP 456268)):
#0  syscall () at ../sysdeps/unix/sysv/linux/s390/s390-64/syscall.S:37
#1  0x000002aa0243db52 in qemu_futex_wait (f=0x2aa02e14244
<rcu_call_ready_event>, val=4294967295) at
/home/linux1/qemu/include/qemu/futex.h:29
#2  0x000002aa0243ddf6 in qemu_event_wait (ev=0x2aa02e14244
<rcu_call_ready_event>) at ../../util/qemu-thread-posix.c:481
#3  0x000002aa0244cbd2 in call_rcu_thread (opaque=0x0) at ../../util/rcu.c:261
#4  0x000002aa0243e020 in qemu_thread_start (args=0x2aa041fbc90) at
../../util/qemu-thread-posix.c:556
#5  0x000003ff9a387e66 in start_thread (arg=0x3ff8b3a4900) at
pthread_create.c:477
#6  0x000003ff9a27cbf6 in thread_start () at
../sysdeps/unix/sysv/linux/s390/s390-64/clone.S:65

Thread 1 (Thread 0x3ff9d7fe440 (LWP 456266)):
#0  0x000003ff9a271c9c in __ppoll (fds=0x2aa0435eb40, nfds=6,
timeout=<optimized out>, sigmask=0x0) at
../sysdeps/unix/sysv/linux/ppoll.c:44
#1  0x000002aa0247a3e4 in qemu_poll_ns (fds=0x2aa0435eb40, nfds=6,
timeout=1000000000) at ../../util/qemu-timer.c:348
#2  0x000002aa02472280 in os_host_main_loop_wait (timeout=1000000000)
at ../../util/main-loop.c:250
#3  0x000002aa02472434 in main_loop_wait (nonblocking=0) at
../../util/main-loop.c:531
#4  0x000002aa01b01276 in qemu_main_loop () at ../../softmmu/runstate.c:727
#5  0x000002aa01a7d2ce in main (argc=27, argv=0x3ffe38fe7e8,
envp=0x3ffe38fe8c8) at ../../softmmu/main.c:50
[Inferior 1 (process 456266) detached]

thanks
-- PMM
Re: [PULL 00/18] migration queue
Posted by Daniel P. Berrangé 2 years, 1 month ago
On Mon, Mar 14, 2022 at 04:56:18PM +0000, Peter Maydell wrote:
> On Tue, 8 Mar 2022 at 18:47, Dr. David Alan Gilbert <dgilbert@redhat.com> wrote:
> >
> > * Philippe Mathieu-Daudé (philippe.mathieu.daude@gmail.com) wrote:
> > > I'm seeing an error on the s390x runner:
> > >
> > > ▶  26/547 ERROR:../tests/qtest/migration-test.c:276:check_guests_ram:
> > > assertion failed: (bad == 0) ERROR
> > >
> > >  26/547 qemu:qtest+qtest-i386 / qtest-i386/migration-test            ERROR
> > > 78.87s   killed by signal 6 SIGABRT
> > >
> > > https://app.travis-ci.com/gitlab/qemu-project/qemu/jobs/562515884#L7848
> >
> > Yeh, thuth mentioned that, it seems to only be s390 which is odd.
> > I'm not seeing anything obviously architecture dependent in that set, or
> > for that matter that plays with the ram migration stream much.
> > Is this reliable enough that someone with a tame s390 could bisect?
> 
> Didn't see a SIGABRT, but here's a gdb backtrace of a hang
> in the migration test on s390 host. I have also observed the
> migration test hanging on macos host, so I don't think this is
> s390-specific.
> 
> Process tree:
> migration-test(455775)-+-qemu-system-i38(456194)
>                        |-qemu-system-i38(456200)
>                        `-qemu-system-i38(456266)
> ===========================================================
> PROCESS: 455775
> linux1    455775  312266  5 14:36 pts/0    00:07:19
> ./tests/qtest/migration-test -tap -k
> [New LWP 455776]
> [Thread debugging using libthread_db enabled]
> Using host libthread_db library "/lib/s390x-linux-gnu/libthread_db.so.1".
> __libc_read (nbytes=1, buf=0x3ffe69fd816, fd=4) at
> ../sysdeps/unix/sysv/linux/read.c:26
> 26      ../sysdeps/unix/sysv/linux/read.c: No such file or directory.


> 
> #5  0x000002aa1e894ede in qtest_vqmp (s=0x2aa200f6a20,
> fmt=0x2aa1e8f140c "{ 'execute': 'query-migrate' }", ap=0x3ffe69fdb80)
> at ../../tests/qtest/libqtest.c:749

So the test harness is waiting for a reply to 'query-migrate'.

This should be fast unless QEMU has hung in the main event
loop servicing monitor commands, or stopped.


> ===========================================================
> PROCESS: 456194
> linux1    456194  455775 85 14:39 pts/0    01:54:06 ./qemu-system-i386
> -qtest unix:/tmp/qtest-455775.sock -qtest-log /dev/null -chardev
> socket,path=/tmp/qtest-455775.qmp,id=char0 -mon
> chardev=char0,mode=control -display none -accel kvm -accel tcg -name
> source,debug-threads=on -m 150M -serial
> file:/tmp/migration-test-dmqzpM/src_serial -drive
> file=/tmp/migration-test-dmqzpM/bootsect,format=raw -accel qtest
> [New LWP 456196]
> [New LWP 456197]
> [New LWP 456198]
> [New LWP 456229]
> [Thread debugging using libthread_db enabled]
> Using host libthread_db library "/lib/s390x-linux-gnu/libthread_db.so.1".
> 0x000003ff9a071c9c in __ppoll (fds=0x2aa2c46c2f0, nfds=5,
> timeout=<optimized out>, sigmask=0x0) at
> ../sysdeps/unix/sysv/linux/ppoll.c:44
> 44      ../sysdeps/unix/sysv/linux/ppoll.c: No such file or directory.
> 
> Thread 5 (Thread 0x3fee0ff9900 (LWP 456229)):
> #0  futex_abstimed_wait_cancelable (private=0, abstime=0x0, clockid=0,
> expected=0, futex_word=0x2aa2c46e7e4) at
> ../sysdeps/nptl/futex-internal.h:320
> #1  do_futex_wait (sem=sem@entry=0x2aa2c46e7e0, abstime=0x0,
> clockid=0) at sem_waitcommon.c:112
> #2  0x000003ff9a191870 in __new_sem_wait_slow (sem=0x2aa2c46e7e0,
> abstime=0x0, clockid=0) at sem_waitcommon.c:184
> #3  0x000003ff9a19190e in __new_sem_wait (sem=<optimized out>) at sem_wait.c:42
> #4  0x000002aa2923da1e in qemu_sem_wait (sem=0x2aa2c46e7e0) at
> ../../util/qemu-thread-posix.c:358
> #5  0x000002aa289483cc in multifd_send_sync_main (f=0x2aa2b5f92d0) at
> ../../migration/multifd.c:610
> #6  0x000002aa28dfa30c in ram_save_iterate (f=0x2aa2b5f92d0,
> opaque=0x2aa29bf75d0 <ram_state>) at ../../migration/ram.c:3049
> #7  0x000002aa28958fee in qemu_savevm_state_iterate (f=0x2aa2b5f92d0,
> postcopy=false) at ../../migration/savevm.c:1296
> #8  0x000002aa28942d40 in migration_iteration_run (s=0x2aa2b3f9800) at
> ../../migration/migration.c:3607
> #9  0x000002aa289434da in migration_thread (opaque=0x2aa2b3f9800) at
> ../../migration/migration.c:3838
> #10 0x000002aa2923e020 in qemu_thread_start (args=0x2aa2b8b29e0) at
> ../../util/qemu-thread-posix.c:556
> #11 0x000003ff9a187e66 in start_thread (arg=0x3fee0ff9900) at
> pthread_create.c:477
> #12 0x000003ff9a07cbf6 in thread_start () at
> ../sysdeps/unix/sysv/linux/s390/s390-64/clone.S:65
> 
> Thread 4 (Thread 0x3ff89f2f900 (LWP 456198)):
> #0  env_neg (env=0x2aa2b5f5030) at /home/linux1/qemu/include/exec/cpu-all.h:478
> #1  0x000002aa28f5376a in env_tlb (env=0x2aa2b5f5030) at
> /home/linux1/qemu/include/exec/cpu-all.h:502
> #2  0x000002aa28f538a8 in tlb_index (env=0x2aa2b5f5030, mmu_idx=2,
> addr=73265152) at /home/linux1/qemu/include/exec/cpu_ldst.h:366
> #3  0x000002aa28f574bc in tlb_set_page_with_attrs (cpu=0x2aa2b5ec750,
> vaddr=73265152, paddr=73265152, attrs=..., prot=7, mmu_idx=2,
> size=4096) at ../../accel/tcg/cputlb.c:1194
> #4  0x000002aa28cdfd3e in handle_mmu_fault (cs=0x2aa2b5ec750,
> addr=73265152, size=1, is_write1=0, mmu_idx=2) at
> ../../target/i386/tcg/sysemu/excp_helper.c:442
> #5  0x000002aa28cdfe90 in x86_cpu_tlb_fill (cs=0x2aa2b5ec750,
> addr=73265152, size=1, access_type=MMU_DATA_LOAD, mmu_idx=2,
> probe=false, retaddr=4393820748608) at
> ../../target/i386/tcg/sysemu/excp_helper.c:468
> #6  0x000002aa28f5794e in tlb_fill (cpu=0x2aa2b5ec750, addr=73265152,
> size=1, access_type=MMU_DATA_LOAD, mmu_idx=2, retaddr=4393820748608)
> at ../../accel/tcg/cputlb.c:1313
> #7  0x000002aa28f59982 in load_helper (env=0x2aa2b5f5030,
> addr=73265152, oi=3586, retaddr=4393820748608, op=MO_8,
> code_read=false, full_load=0x2aa28f59db0 <full_ldub_mmu>) at
> ../../accel/tcg/cputlb.c:1934
> #8  0x000002aa28f59e2e in full_ldub_mmu (env=0x2aa2b5f5030,
> addr=73265152, oi=3586, retaddr=4393820748608) at
> ../../accel/tcg/cputlb.c:2025
> #9  0x000002aa28f59e94 in helper_ret_ldub_mmu (env=0x2aa2b5f5030,
> addr=73265152, oi=3586, retaddr=4393820748608) at
> ../../accel/tcg/cputlb.c:2031
> #10 0x000003ff041ffbfa in code_gen_buffer ()
> #11 0x000002aa28f3cfba in cpu_tb_exec (cpu=0x2aa2b5ec750,
> itb=0x3ff441ffa00, tb_exit=0x3ff89f2af44) at
> ../../accel/tcg/cpu-exec.c:357
> #12 0x000002aa28f3e47e in cpu_loop_exec_tb (cpu=0x2aa2b5ec750,
> tb=0x3ff441ffa00, last_tb=0x3ff89f2af58, tb_exit=0x3ff89f2af44) at
> ../../accel/tcg/cpu-exec.c:847
> #13 0x000002aa28f3e970 in cpu_exec (cpu=0x2aa2b5ec750) at
> ../../accel/tcg/cpu-exec.c:1006
> #14 0x000002aa28f71a1e in tcg_cpus_exec (cpu=0x2aa2b5ec750) at
> ../../accel/tcg/tcg-accel-ops.c:68
> #15 0x000002aa28f71efe in mttcg_cpu_thread_fn (arg=0x2aa2b5ec750) at
> ../../accel/tcg/tcg-accel-ops-mttcg.c:96
> #16 0x000002aa2923e020 in qemu_thread_start (args=0x2aa2b60bb00) at
> ../../util/qemu-thread-posix.c:556
> #17 0x000003ff9a187e66 in start_thread (arg=0x3ff89f2f900) at
> pthread_create.c:477
> #18 0x000003ff9a07cbf6 in thread_start () at
> ../sysdeps/unix/sysv/linux/s390/s390-64/clone.S:65
> 
> Thread 3 (Thread 0x3ff8a821900 (LWP 456197)):
> #0  0x000003ff9a071b42 in __GI___poll (fds=0x3fefc003280, nfds=3,
> timeout=<optimized out>) at ../sysdeps/unix/sysv/linux/poll.c:29
> #1  0x000003ff9c7d4386 in  () at /lib/s390x-linux-gnu/libglib-2.0.so.0
> #2  0x000003ff9c7d4790 in g_main_loop_run () at
> /lib/s390x-linux-gnu/libglib-2.0.so.0
> #3  0x000002aa28fd9d56 in iothread_run (opaque=0x2aa2b339750) at
> ../../iothread.c:73
> #4  0x000002aa2923e020 in qemu_thread_start (args=0x2aa2b2e7980) at
> ../../util/qemu-thread-posix.c:556
> #5  0x000003ff9a187e66 in start_thread (arg=0x3ff8a821900) at
> pthread_create.c:477
> #6  0x000003ff9a07cbf6 in thread_start () at
> ../sysdeps/unix/sysv/linux/s390/s390-64/clone.S:65
> 
> Thread 2 (Thread 0x3ff8b1a4900 (LWP 456196)):
> #0  syscall () at ../sysdeps/unix/sysv/linux/s390/s390-64/syscall.S:37
> #1  0x000002aa2923db52 in qemu_futex_wait (f=0x2aa29c14244
> <rcu_call_ready_event>, val=4294967295) at
> /home/linux1/qemu/include/qemu/futex.h:29
> #2  0x000002aa2923ddf6 in qemu_event_wait (ev=0x2aa29c14244
> <rcu_call_ready_event>) at ../../util/qemu-thread-posix.c:481
> #3  0x000002aa2924cbd2 in call_rcu_thread (opaque=0x0) at ../../util/rcu.c:261
> #4  0x000002aa2923e020 in qemu_thread_start (args=0x2aa2b26ac90) at
> ../../util/qemu-thread-posix.c:556
> #5  0x000003ff9a187e66 in start_thread (arg=0x3ff8b1a4900) at
> pthread_create.c:477
> #6  0x000003ff9a07cbf6 in thread_start () at
> ../sysdeps/unix/sysv/linux/s390/s390-64/clone.S:65
> 
> Thread 1 (Thread 0x3ff9d5fe440 (LWP 456194)):
> #0  0x000003ff9a071c9c in __ppoll (fds=0x2aa2c46c2f0, nfds=5,
> timeout=<optimized out>, sigmask=0x0) at
> ../sysdeps/unix/sysv/linux/ppoll.c:44
> #1  0x000002aa2927a3e4 in qemu_poll_ns (fds=0x2aa2c46c2f0, nfds=5,
> timeout=27206167) at ../../util/qemu-timer.c:348
> #2  0x000002aa29272280 in os_host_main_loop_wait (timeout=27206167) at
> ../../util/main-loop.c:250
> #3  0x000002aa29272434 in main_loop_wait (nonblocking=0) at
> ../../util/main-loop.c:531
> #4  0x000002aa28901276 in qemu_main_loop () at ../../softmmu/runstate.c:727
> #5  0x000002aa2887d2ce in main (argc=25, argv=0x3fff647eac8,
> envp=0x3fff647eb98) at ../../softmmu/main.c:50
> [Inferior 1 (process 456194) detached]


No obvious sign of a hang that would cause it to fail to reply
to 'query-migrate'



> ===========================================================
> PROCESS: 456266
> linux1    456266  455775  0 14:39 pts/0    00:00:00 ./qemu-system-i386
> -qtest unix:/tmp/qtest-455775.sock -qtest-log /dev/null -chardev
> socket,path=/tmp/qtest-455775.qmp,id=char0 -mon
> chardev=char0,mode=control -display none -accel kvm -accel tcg -name
> target,debug-threads=on -m 150M -serial
> file:/tmp/migration-test-dmqzpM/dest_serial -incoming defer -drive
> file=/tmp/migration-test-dmqzpM/bootsect,format=raw -accel qtest
> [New LWP 456268]
> [New LWP 456269]
> [New LWP 456270]
> [Thread debugging using libthread_db enabled]
> Using host libthread_db library "/lib/s390x-linux-gnu/libthread_db.so.1".
> 0x000003ff9a271c9c in __ppoll (fds=0x2aa0435eb40, nfds=6,
> timeout=<optimized out>, sigmask=0x0) at
> ../sysdeps/unix/sysv/linux/ppoll.c:44
> 44      ../sysdeps/unix/sysv/linux/ppoll.c: No such file or directory.
> 
> Thread 4 (Thread 0x3ff8a12f900 (LWP 456270)):
> #0  futex_wait_cancelable (private=0, expected=0,
> futex_word=0x2aa0459cbac) at ../sysdeps/nptl/futex-internal.h:183
> #1  __pthread_cond_wait_common (abstime=0x0, clockid=0,
> mutex=0x2aa02ddec88 <qemu_global_mutex>, cond=0x2aa0459cb80) at
> pthread_cond_wait.c:508
> #2  __pthread_cond_wait (cond=0x2aa0459cb80, mutex=0x2aa02ddec88
> <qemu_global_mutex>) at pthread_cond_wait.c:638
> #3  0x000002aa0243d498 in qemu_cond_wait_impl (cond=0x2aa0459cb80,
> mutex=0x2aa02ddec88 <qemu_global_mutex>, file=0x2aa024e81e8
> "../../softmmu/cpus.c", line=424) at
> ../../util/qemu-thread-posix.c:195
> #4  0x000002aa01af4cc0 in qemu_wait_io_event (cpu=0x2aa0457d750) at
> ../../softmmu/cpus.c:424
> #5  0x000002aa02172028 in mttcg_cpu_thread_fn (arg=0x2aa0457d750) at
> ../../accel/tcg/tcg-accel-ops-mttcg.c:124
> #6  0x000002aa0243e020 in qemu_thread_start (args=0x2aa0459cbc0) at
> ../../util/qemu-thread-posix.c:556
> #7  0x000003ff9a387e66 in start_thread (arg=0x3ff8a12f900) at
> pthread_create.c:477
> #8  0x000003ff9a27cbf6 in thread_start () at
> ../sysdeps/unix/sysv/linux/s390/s390-64/clone.S:65
> 
> Thread 3 (Thread 0x3ff8aa21900 (LWP 456269)):
> #0  0x000003ff9a271b42 in __GI___poll (fds=0x3fefc003280, nfds=3,
> timeout=<optimized out>) at ../sysdeps/unix/sysv/linux/poll.c:29
> #1  0x000003ff9c9d4386 in  () at /lib/s390x-linux-gnu/libglib-2.0.so.0
> #2  0x000003ff9c9d4790 in g_main_loop_run () at
> /lib/s390x-linux-gnu/libglib-2.0.so.0
> #3  0x000002aa021d9d56 in iothread_run (opaque=0x2aa042ca750) at
> ../../iothread.c:73
> #4  0x000002aa0243e020 in qemu_thread_start (args=0x2aa04278980) at
> ../../util/qemu-thread-posix.c:556
> #5  0x000003ff9a387e66 in start_thread (arg=0x3ff8aa21900) at
> pthread_create.c:477
> #6  0x000003ff9a27cbf6 in thread_start () at
> ../sysdeps/unix/sysv/linux/s390/s390-64/clone.S:65
> 
> Thread 2 (Thread 0x3ff8b3a4900 (LWP 456268)):
> #0  syscall () at ../sysdeps/unix/sysv/linux/s390/s390-64/syscall.S:37
> #1  0x000002aa0243db52 in qemu_futex_wait (f=0x2aa02e14244
> <rcu_call_ready_event>, val=4294967295) at
> /home/linux1/qemu/include/qemu/futex.h:29
> #2  0x000002aa0243ddf6 in qemu_event_wait (ev=0x2aa02e14244
> <rcu_call_ready_event>) at ../../util/qemu-thread-posix.c:481
> #3  0x000002aa0244cbd2 in call_rcu_thread (opaque=0x0) at ../../util/rcu.c:261
> #4  0x000002aa0243e020 in qemu_thread_start (args=0x2aa041fbc90) at
> ../../util/qemu-thread-posix.c:556
> #5  0x000003ff9a387e66 in start_thread (arg=0x3ff8b3a4900) at
> pthread_create.c:477
> #6  0x000003ff9a27cbf6 in thread_start () at
> ../sysdeps/unix/sysv/linux/s390/s390-64/clone.S:65
> 
> Thread 1 (Thread 0x3ff9d7fe440 (LWP 456266)):
> #0  0x000003ff9a271c9c in __ppoll (fds=0x2aa0435eb40, nfds=6,
> timeout=<optimized out>, sigmask=0x0) at
> ../sysdeps/unix/sysv/linux/ppoll.c:44
> #1  0x000002aa0247a3e4 in qemu_poll_ns (fds=0x2aa0435eb40, nfds=6,
> timeout=1000000000) at ../../util/qemu-timer.c:348
> #2  0x000002aa02472280 in os_host_main_loop_wait (timeout=1000000000)
> at ../../util/main-loop.c:250
> #3  0x000002aa02472434 in main_loop_wait (nonblocking=0) at
> ../../util/main-loop.c:531
> #4  0x000002aa01b01276 in qemu_main_loop () at ../../softmmu/runstate.c:727
> #5  0x000002aa01a7d2ce in main (argc=27, argv=0x3ffe38fe7e8,
> envp=0x3ffe38fe8c8) at ../../softmmu/main.c:50
> [Inferior 1 (process 456266) detached]

No obvious sign of trouble here either.

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|


Re: [PULL 00/18] migration queue
Posted by Peter Maydell 2 years, 1 month ago
On Mon, 14 Mar 2022 at 17:07, Daniel P. Berrangé <berrange@redhat.com> wrote:
> So the test harness is waiting for a reply to 'query-migrate'.
>
> This should be fast unless QEMU has hung in the main event
> loop servicing monitor commands, or stopped.

I was kind of loose with the terminology -- I don't remember whether
it was actually hung in the sense of stopped entirely, or just
"sat in a loop waiting for a migration state that never arrives".
I'll try to look more closely if I can catch it in the act again.

One thing that makes this bug investigation trickier, incidentally,
is that the migration-test code seems to depend on userfaultfd.
That means you can't run it under 'rr'.

-- PMM
Re: [PULL 00/18] migration queue
Posted by Peter Maydell 2 years, 1 month ago
On Mon, 14 Mar 2022 at 17:15, Peter Maydell <peter.maydell@linaro.org> wrote:
>
> On Mon, 14 Mar 2022 at 17:07, Daniel P. Berrangé <berrange@redhat.com> wrote:
> > So the test harness is waiting for a reply to 'query-migrate'.
> >
> > This should be fast unless QEMU has hung in the main event
> > loop servicing monitor commands, or stopped.
>
> I was kind of loose with the terminology -- I don't remember whether
> it was actually hung in the sense of stopped entirely, or just
> "sat in a loop waiting for a migration state that never arrives".
> I'll try to look more closely if I can catch it in the act again.

I just hit the abort case, narrowing it down to the
/i386/migration/multifd/tcp/zlib case, which can hit this without
any other tests being run:

$ QTEST_QEMU_BINARY=./qemu-system-i386 ./tests/qtest/migration-test
-tap -k -p /i386/migration/multifd/tcp/zlib
# random seed: R02S37eab07b59417f6cd7e26d94df0d3908
# Start of i386 tests
# Start of migration tests
# Start of multifd tests
# Start of tcp tests
# starting QEMU: exec ./qemu-system-i386 -qtest
unix:/tmp/qtest-782502.sock -qtest-log /dev/null -chardev
socket,path=/tmp/qtest-782502.qmp,id=char0 -mon
chardev=char0,mode=control -display none -accel kvm -accel tcg -name
source,debug-threads=on -m 150M -serial
file:/tmp/migration-test-H8Ggsm/src_serial -drive
file=/tmp/migration-test-H8Ggsm/bootsect,format=raw    -accel qtest
# starting QEMU: exec ./qemu-system-i386 -qtest
unix:/tmp/qtest-782502.sock -qtest-log /dev/null -chardev
socket,path=/tmp/qtest-782502.qmp,id=char0 -mon
chardev=char0,mode=control -display none -accel kvm -accel tcg -name
target,debug-threads=on -m 150M -serial
file:/tmp/migration-test-H8Ggsm/dest_serial -incoming defer -drive
file=/tmp/migration-test-H8Ggsm/bootsect,format=raw    -accel qtest
Memory content inconsistency at 5f76000 first_byte = 2 last_byte = 1
current = 2 hit_edge = 1
Memory content inconsistency at 5f77000 first_byte = 2 last_byte = 1
current = 2 hit_edge = 1
Memory content inconsistency at 5f78000 first_byte = 2 last_byte = 1
current = 2 hit_edge = 1
Memory content inconsistency at 5f79000 first_byte = 2 last_byte = 1
current = 2 hit_edge = 1
Memory content inconsistency at 5f7a000 first_byte = 2 last_byte = 1
current = 2 hit_edge = 1
Memory content inconsistency at 5f7b000 first_byte = 2 last_byte = 1
current = 2 hit_edge = 1
Memory content inconsistency at 5f7c000 first_byte = 2 last_byte = 1
current = 2 hit_edge = 1
Memory content inconsistency at 5f7d000 first_byte = 2 last_byte = 1
current = 2 hit_edge = 1
Memory content inconsistency at 5f7e000 first_byte = 2 last_byte = 1
current = 2 hit_edge = 1
Memory content inconsistency at 5f7f000 first_byte = 2 last_byte = 1
current = 2 hit_edge = 1
and in another 17 pages**
ERROR:../../tests/qtest/migration-test.c:276:check_guests_ram:
assertion failed: (bad == 0)
Bail out! ERROR:../../tests/qtest/migration-test.c:276:check_guests_ram:
assertion failed: (bad == 0)
Aborted (core dumped)

This test seems to fail fairly frequently. I'll try a bisect...

thanks
-- PMM
Re: [PULL 00/18] migration queue
Posted by Peter Maydell 2 years, 1 month ago
On Mon, 14 Mar 2022 at 18:58, Peter Maydell <peter.maydell@linaro.org> wrote:
>
> On Mon, 14 Mar 2022 at 17:15, Peter Maydell <peter.maydell@linaro.org> wrote:
> >
> > On Mon, 14 Mar 2022 at 17:07, Daniel P. Berrangé <berrange@redhat.com> wrote:
> > > So the test harness is waiting for a reply to 'query-migrate'.
> > >
> > > This should be fast unless QEMU has hung in the main event
> > > loop servicing monitor commands, or stopped.
> >
> > I was kind of loose with the terminology -- I don't remember whether
> > it was actually hung in the sense of stopped entirely, or just
> > "sat in a loop waiting for a migration state that never arrives".
> > I'll try to look more closely if I can catch it in the act again.
>
> I just hit the abort case, narrowing it down to the
> /i386/migration/multifd/tcp/zlib case, which can hit this without
> any other tests being run:

> This test seems to fail fairly frequently. I'll try a bisect...

On this s390 machine, this test has been intermittent since
it was first added in commit 7ec2c2b3c1 ("multifd: Add zlib compression
multifd support") in 2019. On that commit (after 31 successful
runs):

# random seed: R02S17937f515046216afcc72143266b3e1f
# Start of i386 tests
# Start of migration tests
# Start of multifd tests
# Start of tcp tests
# starting QEMU: exec ./build/i386/i386-softmmu/qemu-system-i386
-qtest unix:/tmp/qtest-861747.sock -qtest-log /dev/null -chardev
socket,path=/tmp/qtest-861747.qmp,id=char0 -mon
chardev=char0,mode=control -display none -accel kvm -accel tcg -name
source,debug-threads=on -m 150M -serial
file:/tmp/migration-test-7qODSs/src_serial -drive
file=/tmp/migration-test-7qODSs/bootsect,format=raw    -accel qtest
qemu-system-i386: -accel kvm: invalid accelerator kvm
qemu-system-i386: falling back to tcg
# starting QEMU: exec ./build/i386/i386-softmmu/qemu-system-i386
-qtest unix:/tmp/qtest-861747.sock -qtest-log /dev/null -chardev
socket,path=/tmp/qtest-861747.qmp,id=char0 -mon
chardev=char0,mode=control -display none -accel kvm -accel tcg -name
target,debug-threads=on -m 150M -serial
file:/tmp/migration-test-7qODSs/dest_serial -incoming defer -drive
file=/tmp/migration-test-7qODSs/bootsect,format=raw    -accel qtest
qemu-system-i386: -accel kvm: invalid accelerator kvm
qemu-system-i386: falling back to tcg
Memory content inconsistency at 5cff000 first_byte = 2 last_byte = 1
current = 0 hit_edge = 1
Memory content inconsistency at 5d00000 first_byte = 2 last_byte = 1
current = 0 hit_edge = 1
Memory content inconsistency at 5d01000 first_byte = 2 last_byte = 1
current = 0 hit_edge = 1
Memory content inconsistency at 5d02000 first_byte = 2 last_byte = 1
current = 0 hit_edge = 1
Memory content inconsistency at 5d03000 first_byte = 2 last_byte = 1
current = 0 hit_edge = 1
Memory content inconsistency at 5d04000 first_byte = 2 last_byte = 1
current = 0 hit_edge = 1
Memory content inconsistency at 5d05000 first_byte = 2 last_byte = 1
current = 0 hit_edge = 1
Memory content inconsistency at 5d06000 first_byte = 2 last_byte = 1
current = 0 hit_edge = 1
Memory content inconsistency at 5d07000 first_byte = 2 last_byte = 1
current = 0 hit_edge = 1
Memory content inconsistency at 5d08000 first_byte = 2 last_byte = 1
current = 0 hit_edge = 1
and in another 118 pages**
ERROR:/home/linux1/qemu/tests/qtest/migration-test.c:268:check_guests_ram:
assertion failed: (bad == 0)
Bail out! ERROR:/home/linux1/qemu/tests/qtest/migration-test.c:268:check_guests_ram:
assertion failed: (bad == 0)
Aborted (core dumped)

-- PMM
multifd/tcp/zlib intermittent abort (was: Re: [PULL 00/18] migration queue)
Posted by Peter Maydell 2 years, 1 month ago
On Mon, 14 Mar 2022 at 19:44, Peter Maydell <peter.maydell@linaro.org> wrote:
> On Mon, 14 Mar 2022 at 18:58, Peter Maydell <peter.maydell@linaro.org> wrote:
> > I just hit the abort case, narrowing it down to the
> > /i386/migration/multifd/tcp/zlib case, which can hit this without
> > any other tests being run:
>
> > This test seems to fail fairly frequently. I'll try a bisect...
>
> On this s390 machine, this test has been intermittent since
> it was first added in commit 7ec2c2b3c1 ("multifd: Add zlib compression
> multifd support") in 2019.

I have tried (on current master) runs of various of the other
migration tests, and:
 * /i386/migration/multifd/tcp/zstd completed 1170 iterations without
   failing
 * /i386/migration/precopy/tcp completed 4669 iterations without
   failing
 * /i386/migration/multifd/tcp/zlib fails usually within the first
   10 iterations (the most I ever saw it manage was 32)

So whatever this is, it seems like it might be specific to the
zlib code somehow ?

thanks
-- PMM
Re: multifd/tcp/zlib intermittent abort (was: Re: [PULL 00/18] migration queue)
Posted by Peter Maydell 2 years, 1 month ago
On Tue, 15 Mar 2022 at 14:39, Peter Maydell <peter.maydell@linaro.org> wrote:
>
> On Mon, 14 Mar 2022 at 19:44, Peter Maydell <peter.maydell@linaro.org> wrote:
> > On Mon, 14 Mar 2022 at 18:58, Peter Maydell <peter.maydell@linaro.org> wrote:
> > > I just hit the abort case, narrowing it down to the
> > > /i386/migration/multifd/tcp/zlib case, which can hit this without
> > > any other tests being run:
> >
> > > This test seems to fail fairly frequently. I'll try a bisect...
> >
> > On this s390 machine, this test has been intermittent since
> > it was first added in commit 7ec2c2b3c1 ("multifd: Add zlib compression
> > multifd support") in 2019.
>
> I have tried (on current master) runs of various of the other
> migration tests, and:
>  * /i386/migration/multifd/tcp/zstd completed 1170 iterations without
>    failing
>  * /i386/migration/precopy/tcp completed 4669 iterations without
>    failing
>  * /i386/migration/multifd/tcp/zlib fails usually within the first
>    10 iterations (the most I ever saw it manage was 32)
>
> So whatever this is, it seems like it might be specific to the
> zlib code somehow ?

Maybe we're running into this bug
https://bugs.launchpad.net/ubuntu/+source/zlib/+bug/1961427
("zlib: compressBound() returns an incorrect result on z15") ?

That bug report claims it doesn't affect focal, though, which
is what we're running on this box (specifically, the zlib1g
package is version 1:1.2.11.dfsg-2ubuntu1.2).

A run with DFLTCC=0 has made it past 60 iterations so far, which
suggests that that does serve as a workaround for the bug.

thanks
-- PMM
Re: multifd/tcp/zlib intermittent abort (was: Re: [PULL 00/18] migration queue)
Posted by Peter Maydell 2 years, 1 month ago
On Tue, 15 Mar 2022 at 15:03, Peter Maydell <peter.maydell@linaro.org> wrote:
> Maybe we're running into this bug
> https://bugs.launchpad.net/ubuntu/+source/zlib/+bug/1961427
> ("zlib: compressBound() returns an incorrect result on z15") ?

Full repro info, since it's a bit hidden in this long thread:

Build an i386 guest QEMU; I used this configure command:

'../../configure' '--target-list=i386-softmmu' '--enable-debug'
'--with-pkgversion=pm215' '--disable-docs'

Then run the multifd/tcp/zlib test in a tight loop:

X=1; while QTEST_QEMU_BINARY=./build/i386/i386-softmmu/qemu-system-i386
./build/i386/tests/qtest/migration-test  -tap -k -p
/i386/migration/multifd/tcp/zlib ; do echo $X; X=$((X+1)); done

Without DFLTCC=0 it fails typically within 5 or so iterations;
the longest I've ever seen it go is about 32.

-- PMM
Re: multifd/tcp/zlib intermittent abort (was: Re: [PULL 00/18] migration queue)
Posted by Daniel P. Berrangé 2 years, 1 month ago
On Tue, Mar 15, 2022 at 03:30:27PM +0000, Peter Maydell wrote:
> On Tue, 15 Mar 2022 at 15:03, Peter Maydell <peter.maydell@linaro.org> wrote:
> > Maybe we're running into this bug
> > https://bugs.launchpad.net/ubuntu/+source/zlib/+bug/1961427
> > ("zlib: compressBound() returns an incorrect result on z15") ?
> 
> Full repro info, since it's a bit hidden in this long thread:
> 
> Build an i386 guest QEMU; I used this configure command:
> 
> '../../configure' '--target-list=i386-softmmu' '--enable-debug'
> '--with-pkgversion=pm215' '--disable-docs'
> 
> Then run the multifd/tcp/zlib test in a tight loop:
> 
> X=1; while QTEST_QEMU_BINARY=./build/i386/i386-softmmu/qemu-system-i386
> ./build/i386/tests/qtest/migration-test  -tap -k -p
> /i386/migration/multifd/tcp/zlib ; do echo $X; X=$((X+1)); done
> 
> Without DFLTCC=0 it fails typically within 5 or so iterations;
> the longest I've ever seen it go is about 32.

So if this is a host OS package bug we punt to OS vendor to fix,
and just apply workaround in our CI ?  eg

$ git diff
diff --git a/.travis.yml b/.travis.yml
index c3c8048842..6da4c9f640 100644
--- a/.travis.yml
+++ b/.travis.yml
@@ -218,6 +218,7 @@ jobs:
         - TEST_CMD="make check check-tcg V=1"
         - CONFIG="--disable-containers --target-list=${MAIN_SOFTMMU_TARGETS},s390x-linux-user"
         - UNRELIABLE=true
+        - DFLTCC=0
       script:
         - BUILD_RC=0 && make -j${JOBS} || BUILD_RC=$?
         - |



Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
Re: multifd/tcp/zlib intermittent abort (was: Re: [PULL 00/18] migration queue)
Posted by Peter Maydell 2 years, 1 month ago
On Tue, 15 Mar 2022 at 15:41, Daniel P. Berrangé <berrange@redhat.com> wrote:
> So if this is a host OS package bug we punt to OS vendor to fix,
> and just apply workaround in our CI ?  eg
>
> $ git diff
> diff --git a/.travis.yml b/.travis.yml
> index c3c8048842..6da4c9f640 100644
> --- a/.travis.yml
> +++ b/.travis.yml
> @@ -218,6 +218,7 @@ jobs:
>          - TEST_CMD="make check check-tcg V=1"
>          - CONFIG="--disable-containers --target-list=${MAIN_SOFTMMU_TARGETS},s390x-linux-user"
>          - UNRELIABLE=true
> +        - DFLTCC=0
>        script:
>          - BUILD_RC=0 && make -j${JOBS} || BUILD_RC=$?
>          - |
>

Yes, that seems like the best approach. We also need to
adjust the gitlab CI config for the s390-host jobs. (In that
case we control the system being used so if there's a fixed
zlib we could install it, but for the travis stuff we'll probably
need the workaround for some time.)

-- PMM
Re: multifd/tcp/zlib intermittent abort
Posted by Thomas Huth 2 years, 1 month ago
On 15/03/2022 16.40, Daniel P. Berrangé wrote:
> On Tue, Mar 15, 2022 at 03:30:27PM +0000, Peter Maydell wrote:
>> On Tue, 15 Mar 2022 at 15:03, Peter Maydell <peter.maydell@linaro.org> wrote:
>>> Maybe we're running into this bug
>>> https://bugs.launchpad.net/ubuntu/+source/zlib/+bug/1961427
>>> ("zlib: compressBound() returns an incorrect result on z15") ?
>>
>> Full repro info, since it's a bit hidden in this long thread:
>>
>> Build an i386 guest QEMU; I used this configure command:
>>
>> '../../configure' '--target-list=i386-softmmu' '--enable-debug'
>> '--with-pkgversion=pm215' '--disable-docs'
>>
>> Then run the multifd/tcp/zlib test in a tight loop:
>>
>> X=1; while QTEST_QEMU_BINARY=./build/i386/i386-softmmu/qemu-system-i386
>> ./build/i386/tests/qtest/migration-test  -tap -k -p
>> /i386/migration/multifd/tcp/zlib ; do echo $X; X=$((X+1)); done
>>
>> Without DFLTCC=0 it fails typically within 5 or so iterations;
>> the longest I've ever seen it go is about 32.
> 
> So if this is a host OS package bug we punt to OS vendor to fix,
> and just apply workaround in our CI ?  eg
> 
> $ git diff
> diff --git a/.travis.yml b/.travis.yml
> index c3c8048842..6da4c9f640 100644
> --- a/.travis.yml
> +++ b/.travis.yml
> @@ -218,6 +218,7 @@ jobs:
>           - TEST_CMD="make check check-tcg V=1"
>           - CONFIG="--disable-containers --target-list=${MAIN_SOFTMMU_TARGETS},s390x-linux-user"
>           - UNRELIABLE=true
> +        - DFLTCC=0
>         script:
>           - BUILD_RC=0 && make -j${JOBS} || BUILD_RC=$?
>           - |

Sounds like a good idea - but you should certainly add a proper comment 
here, too, so that we can later remind ourselves to remove the workaround again.

  Thomas


Re: multifd/tcp/zlib intermittent abort
Posted by Daniel P. Berrangé 2 years, 1 month ago
On Tue, Mar 15, 2022 at 04:44:37PM +0100, Thomas Huth wrote:
> On 15/03/2022 16.40, Daniel P. Berrangé wrote:
> > On Tue, Mar 15, 2022 at 03:30:27PM +0000, Peter Maydell wrote:
> > > On Tue, 15 Mar 2022 at 15:03, Peter Maydell <peter.maydell@linaro.org> wrote:
> > > > Maybe we're running into this bug
> > > > https://bugs.launchpad.net/ubuntu/+source/zlib/+bug/1961427
> > > > ("zlib: compressBound() returns an incorrect result on z15") ?
> > > 
> > > Full repro info, since it's a bit hidden in this long thread:
> > > 
> > > Build an i386 guest QEMU; I used this configure command:
> > > 
> > > '../../configure' '--target-list=i386-softmmu' '--enable-debug'
> > > '--with-pkgversion=pm215' '--disable-docs'
> > > 
> > > Then run the multifd/tcp/zlib test in a tight loop:
> > > 
> > > X=1; while QTEST_QEMU_BINARY=./build/i386/i386-softmmu/qemu-system-i386
> > > ./build/i386/tests/qtest/migration-test  -tap -k -p
> > > /i386/migration/multifd/tcp/zlib ; do echo $X; X=$((X+1)); done
> > > 
> > > Without DFLTCC=0 it fails typically within 5 or so iterations;
> > > the longest I've ever seen it go is about 32.
> > 
> > So if this is a host OS package bug we punt to OS vendor to fix,
> > and just apply workaround in our CI ?  eg
> > 
> > $ git diff
> > diff --git a/.travis.yml b/.travis.yml
> > index c3c8048842..6da4c9f640 100644
> > --- a/.travis.yml
> > +++ b/.travis.yml
> > @@ -218,6 +218,7 @@ jobs:
> >           - TEST_CMD="make check check-tcg V=1"
> >           - CONFIG="--disable-containers --target-list=${MAIN_SOFTMMU_TARGETS},s390x-linux-user"
> >           - UNRELIABLE=true
> > +        - DFLTCC=0
> >         script:
> >           - BUILD_RC=0 && make -j${JOBS} || BUILD_RC=$?
> >           - |
> 
> Sounds like a good idea - but you should certainly add a proper comment
> here, too, so that we can later remind ourselves to remove the workaround
> again.

FYI, I don't have time to actually test this for real with Travis right
now , so I'll leave it to someone else to test and submit a formal patch.


With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|


Re: multifd/tcp/zlib intermittent abort (was: Re: [PULL 00/18] migration queue)
Posted by Dr. David Alan Gilbert 2 years, 1 month ago
* Peter Maydell (peter.maydell@linaro.org) wrote:
> On Tue, 15 Mar 2022 at 14:39, Peter Maydell <peter.maydell@linaro.org> wrote:
> >
> > On Mon, 14 Mar 2022 at 19:44, Peter Maydell <peter.maydell@linaro.org> wrote:
> > > On Mon, 14 Mar 2022 at 18:58, Peter Maydell <peter.maydell@linaro.org> wrote:
> > > > I just hit the abort case, narrowing it down to the
> > > > /i386/migration/multifd/tcp/zlib case, which can hit this without
> > > > any other tests being run:
> > >
> > > > This test seems to fail fairly frequently. I'll try a bisect...
> > >
> > > On this s390 machine, this test has been intermittent since
> > > it was first added in commit 7ec2c2b3c1 ("multifd: Add zlib compression
> > > multifd support") in 2019.
> >
> > I have tried (on current master) runs of various of the other
> > migration tests, and:
> >  * /i386/migration/multifd/tcp/zstd completed 1170 iterations without
> >    failing
> >  * /i386/migration/precopy/tcp completed 4669 iterations without
> >    failing
> >  * /i386/migration/multifd/tcp/zlib fails usually within the first
> >    10 iterations (the most I ever saw it manage was 32)
> >
> > So whatever this is, it seems like it might be specific to the
> > zlib code somehow ?
> 
> Maybe we're running into this bug
> https://bugs.launchpad.net/ubuntu/+source/zlib/+bug/1961427
> ("zlib: compressBound() returns an incorrect result on z15") ?

The initial description of compressBound being wrong doesn't
feel like it would cause that; it claims it would trigger an error
(I'm not sure how good we are at spotting that!); but then later
in the description it says:

'Mistakes in dfltcc_free_window OF and especially DEFLATE_BOUND_COMPLEN,
  (incl. the bit definitions), may cause various and unforseen defects'

Certainly looks like a 'various and unforseen defect'.

Dave

> That bug report claims it doesn't affect focal, though, which
> is what we're running on this box (specifically, the zlib1g
> package is version 1:1.2.11.dfsg-2ubuntu1.2).
> 
> A run with DFLTCC=0 has made it past 60 iterations so far, which
> suggests that that does serve as a workaround for the bug.
> 
> thanks
> -- PMM
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
Re: multifd/tcp/zlib intermittent abort (was: Re: [PULL 00/18] migration queue)
Posted by Peter Maydell 2 years, 1 month ago
On Tue, 15 Mar 2022 at 16:15, Dr. David Alan Gilbert
<dgilbert@redhat.com> wrote:
> The initial description of compressBound being wrong doesn't
> feel like it would cause that; it claims it would trigger an error
> (I'm not sure how good we are at spotting that!); but then later
> in the description it says:
>
> 'Mistakes in dfltcc_free_window OF and especially DEFLATE_BOUND_COMPLEN,
>   (incl. the bit definitions), may cause various and unforseen defects'
>
> Certainly looks like a 'various and unforseen defect'.

Mmm. I couldn't get the testcase in that bug to fail on the machine
I see the migration-test fails on, so it presumably is a different bug
(or just faintly possibly a QEMU bug that's only tickled by the
specifics of the accelerated zlib behaviour).

-- PMM
Re: [PULL 00/18] migration queue
Posted by Dr. David Alan Gilbert 2 years, 1 month ago
* Peter Maydell (peter.maydell@linaro.org) wrote:
> On Mon, 14 Mar 2022 at 17:07, Daniel P. Berrangé <berrange@redhat.com> wrote:
> > So the test harness is waiting for a reply to 'query-migrate'.
> >
> > This should be fast unless QEMU has hung in the main event
> > loop servicing monitor commands, or stopped.
> 
> I was kind of loose with the terminology -- I don't remember whether
> it was actually hung in the sense of stopped entirely, or just
> "sat in a loop waiting for a migration state that never arrives".
> I'll try to look more closely if I can catch it in the act again.

Yeh, there's a big difference; still, if it's always in this test at
that point, then I think it's one for Juan; it looks like multifd cancel
path.

> One thing that makes this bug investigation trickier, incidentally,
> is that the migration-test code seems to depend on userfaultfd.
> That means you can't run it under 'rr'.

That should only be the postcopy tests; the others shouldn't use that.

Dave

> 
> -- PMM
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
Re: [PULL 00/18] migration queue
Posted by Peter Maydell 2 years, 1 month ago
On Mon, 14 Mar 2022 at 17:55, Dr. David Alan Gilbert
<dgilbert@redhat.com> wrote:
>
> Peter Maydell (peter.maydell@linaro.org) wrote:
> > One thing that makes this bug investigation trickier, incidentally,
> > is that the migration-test code seems to depend on userfaultfd.
> > That means you can't run it under 'rr'.
>
> That should only be the postcopy tests; the others shouldn't use that.

tests/qtest/migration-test.c:main() exits immediately without adding
any of the test cases if ufd_version_check() fails, so no userfaultfd
means no tests run at all, currently.

-- PMM