[PATCH 0/2] libs/light: fix a race when domain are destroied and created concurrently

Dario Faggioli posted 2 patches 2 years, 10 months ago
Test gitlab-ci failed
Patches applied successfully (tree, apply log)
git fetch https://gitlab.com/xen-project/patchew/xen tags/patchew/164200566223.24755.262723784847161301.stgit@work
tools/libs/light/libxl_domain.c | 14 ++++++++------
tools/libs/light/libxl_numa.c   |  4 +++-
2 files changed, 11 insertions(+), 7 deletions(-)
[PATCH 0/2] libs/light: fix a race when domain are destroied and created concurrently
Posted by Dario Faggioli 2 years, 10 months ago
It is possible to encounter a segfault in libxl during concurrent domain
create and destroy operations.

This is because Placement of existing domains on the host's CPUs is examined
when creating a new domain, but the existing logic does not tolerate well a
domain disappearing during the examination.

The race is quite difficult to trigger. Whan that happens, this is an
example of a backtrace:

#0  libxl_bitmap_dispose (map=map@entry=0x50) at libxl_utils.c:626
#1  0x00007fe72c993a32 in libxl_vcpuinfo_dispose (p=p@entry=0x38) at _libxl_types.c:692
#2  0x00007fe72c94e3c4 in libxl_vcpuinfo_list_free (list=0x0, nr=<optimized out>) at libxl_utils.c:1059
#3  0x00007fe72c9528bf in nr_vcpus_on_nodes (vcpus_on_node=0x7fe71000eb60, suitable_cpumap=0x7fe721df0d38, tinfo_elements=48, tinfo=0x7fe7101b3900, gc=0x7fe7101bbfa0) at libxl_numa.c:258
#4  libxl__get_numa_candidate (gc=gc@entry=0x7fe7100033a0, min_free_memkb=4233216, min_cpus=4, min_nodes=min_nodes@entry=0, max_nodes=max_nodes@entry=0, suitable_cpumap=suitable_cpumap@entry=0x7fe721df0d38, numa_cmpf=0x7fe72c940110 <numa_cmpf>, cndt_out=0x7fe721df0cf0, cndt_found=0x7fe721df0cb4) at libxl_numa.c:394
#5  0x00007fe72c94152b in numa_place_domain (d_config=0x7fe721df11b0, domid=975, gc=0x7fe7100033a0) at libxl_dom.c:209
#6  libxl__build_pre (gc=gc@entry=0x7fe7100033a0, domid=domid@entry=975, d_config=d_config@entry=0x7fe721df11b0, state=state@entry=0x7fe710077700) at libxl_dom.c:436
#7  0x00007fe72c92c4a5 in libxl__domain_build (gc=0x7fe7100033a0, d_config=d_config@entry=0x7fe721df11b0, domid=975, state=0x7fe710077700) at libxl_create.c:444
#8  0x00007fe72c92de8b in domcreate_bootloader_done (egc=0x7fe721df0f60, bl=0x7fe7100778c0, rc=<optimized out>) at libxl_create.c:1222
#9  0x00007fe72c980425 in libxl__bootloader_run (egc=egc@entry=0x7fe721df0f60, bl=bl@entry=0x7fe7100778c0) at libxl_bootloader.c:403
#10 0x00007fe72c92f281 in initiate_domain_create (egc=egc@entry=0x7fe721df0f60, dcs=dcs@entry=0x7fe7100771b0) at libxl_create.c:1159
#11 0x00007fe72c92f456 in do_domain_create (ctx=ctx@entry=0x7fe71001c840, d_config=d_config@entry=0x7fe721df11b0, domid=domid@entry=0x7fe721df10a8, restore_fd=restore_fd@entry=-1, send_back_fd=send_back_fd@entry=-1, params=params@entry=0x0, ao_how=0x0, aop_console_how=0x7fe721df10f0) at libxl_create.c:1856
#12 0x00007fe72c92f776 in libxl_domain_create_new (ctx=0x7fe71001c840, d_config=d_config@entry=0x7fe721df11b0, domid=domid@entry=0x7fe721df10a8, ao_how=ao_how@entry=0x0, aop_console_how=aop_console_how@entry=0x7fe721df10f0) at libxl_create.c:2075

Luckily, it is easy to close the race, by just making sure that
libvxl_list_vcpu() returns 0 as the number of vCPUs of the domain, when
it also returns NULL as the list of them.

Regards
---
Dario Faggioli (2):
      tools/libs/light: numa placement: don't try to free a NULL list of vcpus
      tools/libs/light: don't touch nr_vcpus_out if listing vcpus and returning NULL

 tools/libs/light/libxl_domain.c | 14 ++++++++------
 tools/libs/light/libxl_numa.c   |  4 +++-
 2 files changed, 11 insertions(+), 7 deletions(-)
--
Dario Faggioli, Ph.D
http://about.me/dario.faggioli
Virtualization Software Engineer
SUSE Labs, SUSE https://www.suse.com/
-------------------------------------------------------------------
<<This happens because _I_ choose it to happen!>> (Raistlin Majere)


Re: [PATCH 0/2] libs/light: fix a race when domain are destroied and created concurrently
Posted by Dario Faggioli 2 years, 10 months ago
On Wed, 2022-01-12 at 17:41 +0100, Dario Faggioli wrote:
> It is possible to encounter a segfault in libxl during concurrent
> domain
> create and destroy operations.
> 
> This is because Placement of existing domains on the host's CPUs is
> examined
> when creating a new domain, but the existing logic does not tolerate
> well a
> domain disappearing during the examination.
> 
Oh, one more thing. This has been first encountered and reproduced on
our SLE15.2, which ships Xen 4.13 so, if the fixes are accepted
upstream, I think they should also be backported (to all supported
branches, I would say).

Thanks and Regards
-- 
Dario Faggioli, Ph.D
http://about.me/dario.faggioli
Virtualization Software Engineer
SUSE Labs, SUSE https://www.suse.com/
-------------------------------------------------------------------
<<This happens because _I_ choose it to happen!>> (Raistlin Majere)