[PATCH 0/2] migration: multifd live migration improvement

Li Zhang posted 2 patches 2 years, 5 months ago
Test checkpatch passed
Patches applied successfully (tree, apply log)
git fetch https://github.com/patchew-project/qemu tags/patchew/20211126153154.25424-1-lizhang@suse.de
Maintainers: Juan Quintela <quintela@redhat.com>, "Dr. David Alan Gilbert" <dgilbert@redhat.com>
There is a newer version of this series
migration/multifd.c | 2 +-
migration/socket.c  | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)
[PATCH 0/2] migration: multifd live migration improvement
Posted by Li Zhang 2 years, 5 months ago
When testing live migration with multifd channels (8, 16, or a bigger number)
and using qemu -incoming (without "defer"), if a network error occurs
(for example, triggering the kernel SYN flooding detection),
the migration fails and the guest hangs forever.

The test environment and the command line is as the following:

QEMU verions: QEMU emulator version 6.2.91 (v6.2.0-rc1-47-gc5fbdd60cf)
Host OS: SLE 15  with kernel: 5.14.5-1-default
Network Card: mlx5 100Gbps
Network card: Intel Corporation I350 Gigabit (1Gbps)

Source:
qemu-system-x86_64 -M q35 -smp 32 -nographic \
        -serial telnet:10.156.208.153:4321,server,nowait \
        -m 4096 -enable-kvm -hda /var/lib/libvirt/images/openSUSE-15.3.img \
        -monitor stdio
Dest:
qemu-system-x86_64 -M q35 -smp 32 -nographic \
        -serial telnet:10.156.208.154:4321,server,nowait \
        -m 4096 -enable-kvm -hda /var/lib/libvirt/images/openSUSE-15.3.img \
        -monitor stdio \
        -incoming tcp:1.0.8.154:4000

(qemu) migrate_set_parameter max-bandwidth 100G
(qemu) migrate_set_capability multifd on
(qemu) migrate_set_parameter multifd-channels 16

The guest hangs when executing the command: migrate -d tcp:1.0.8.154:4000.

If a network problem happens, TCP ACK is not received by destination
and the destination resets the connection with RST.

No.     Time    Source  Destination     Protocol        Length  Info
119     1.021169        1.0.8.153       1.0.8.154       TCP     1410    60166 → 4000 [PSH, ACK] Seq=65 Ack=1 Win=62720 Len=1344 TSval=1338662881 TSecr=1399531897
No.     Time    Source  Destination     Protocol        Length  Info
125     1.021181        1.0.8.154       1.0.8.153       TCP     54      4000 → 60166 [RST] Seq=1 Win=0 Len=0

kernel log:
[334520.229445] TCP: request_sock_TCP: Possible SYN flooding on port 4000. Sending cookies.  Check SNMP counters.
[334562.994919] TCP: request_sock_TCP: Possible SYN flooding on port 4000. Sending cookies.  Check SNMP counters.
[334695.519927] TCP: request_sock_TCP: Possible SYN flooding on port 4000. Sending cookies.  Check SNMP counters.
[334734.689511] TCP: request_sock_TCP: Possible SYN flooding on port 4000. Sending cookies.  Check SNMP counters.
[335687.740415] TCP: request_sock_TCP: Possible SYN flooding on port 4000. Sending cookies.  Check SNMP counters.
[335730.013598] TCP: request_sock_TCP: Possible SYN flooding on port 4000. Sending cookies.  Check SNMP counters.

There are two problems here:
1. When live migration fails, the guest hangs and no errors reported
   even if an error has happened.
2. The network problem causes the live migraiton failure when channel
   number is 8, 16, or larger.

So the two patches are to fix the two problems.

Li Zhang (2):
  multifd: use qemu_sem_timedwait in multifd_recv_thread to avoid
    waiting forever
  migration: Set the socket backlog number to reduce chance of live
    migration failure

 migration/multifd.c | 2 +-
 migration/socket.c  | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

-- 
2.31.1