[PATCH] network: Fix a race condition when shutdown & start vm at the same time

Bingsong Si posted 1 patch 3 years, 10 months ago
Test syntax-check failed
Patches applied successfully (tree, apply log)
git fetch https://github.com/patchew-project/libvirt tags/patchew/20200611095802.9367-1-owen.si@ucloud.cn
There is a newer version of this series
src/qemu/qemu_process.c | 8 +++++---
1 file changed, 5 insertions(+), 3 deletions(-)
[PATCH] network: Fix a race condition when shutdown & start vm at the same time
Posted by Bingsong Si 3 years, 10 months ago
when shutdown vm, the qemuProcessStop cleanup virtual interface in two steps:
1. qemuProcessKill kill qemu process, and vif disappeared
2. ovs-vsctl del-port from the brige

if start a vm in the middle of the two steps, the new vm will reused the vif,
but removed from bridge by step 2

Signed-off-by: Bingsong Si <owen.si@ucloud.cn>
---
 src/qemu/qemu_process.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c
index d36088ba98..706248815a 100644
--- a/src/qemu/qemu_process.c
+++ b/src/qemu/qemu_process.c
@@ -7483,9 +7483,11 @@ void qemuProcessStop(virQEMUDriverPtr driver,
             if (vport->virtPortType == VIR_NETDEV_VPORT_PROFILE_MIDONET) {
                 ignore_value(virNetDevMidonetUnbindPort(vport));
             } else if (vport->virtPortType == VIR_NETDEV_VPORT_PROFILE_OPENVSWITCH) {
-                ignore_value(virNetDevOpenvswitchRemovePort(
-                                 virDomainNetGetActualBridgeName(net),
-                                 net->ifname));
+                virMacAddr mac;
+                if (virNetDevGetMAC(net->ifname, &mac) < 0 ||  !virMacAddrCmp(&mac, &net->mac))
+                    ignore_value(virNetDevOpenvswitchRemovePort(
+                                     virDomainNetGetActualBridgeName(net),
+                                     net->ifname));
             }
         }
 
-- 
2.18.2

Re: [PATCH] network: Fix a race condition when shutdown & start vm at the same time
Posted by Daniel Henrique Barboza 3 years, 10 months ago

On 6/11/20 6:58 AM, Bingsong Si wrote:
> when shutdown vm, the qemuProcessStop cleanup virtual interface in two steps:

s/when/When

> 1. qemuProcessKill kill qemu process, and vif disappeared
> 2. ovs-vsctl del-port from the brige
> 
> if start a vm in the middle of the two steps, the new vm will reused the vif,

s/if/If

> but removed from bridge by step 2
> 
> Signed-off-by: Bingsong Si <owen.si@ucloud.cn>
> ---
>   src/qemu/qemu_process.c | 8 +++++---
>   1 file changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c
> index d36088ba98..706248815a 100644
> --- a/src/qemu/qemu_process.c
> +++ b/src/qemu/qemu_process.c
> @@ -7483,9 +7483,11 @@ void qemuProcessStop(virQEMUDriverPtr driver,
>               if (vport->virtPortType == VIR_NETDEV_VPORT_PROFILE_MIDONET) {
>                   ignore_value(virNetDevMidonetUnbindPort(vport));
>               } else if (vport->virtPortType == VIR_NETDEV_VPORT_PROFILE_OPENVSWITCH) {
> -                ignore_value(virNetDevOpenvswitchRemovePort(
> -                                 virDomainNetGetActualBridgeName(net),
> -                                 net->ifname));
> +                virMacAddr mac;
> +                if (virNetDevGetMAC(net->ifname, &mac) < 0 ||  !virMacAddrCmp(&mac, &net->mac))

Extra space between "||" and "!virMacAddrCmp(.."


With these nits fixed:


Reviewed-by: Daniel Henrique Barboza <danielhb413@gmail.com>


> +                    ignore_value(virNetDevOpenvswitchRemovePort(
> +                                     virDomainNetGetActualBridgeName(net),
> +                                     net->ifname));
>               }
>           }
>   
> 

回复: [PATCH] network: Fix a race condition when shutdown & start vm at the same time
Posted by owen.si 3 years, 10 months ago
Thanks for your review, I will fix in version 2.



发件人: Daniel Henrique Barboza
发送时间: 2020年6月16日星期二 上午2:36
收件人: Bingsong Si; libvir-list@redhat.com
主题: Re: [PATCH] network: Fix a race condition when shutdown & start vm at the same time



On 6/11/20 6:58 AM, Bingsong Si wrote:
> when shutdown vm, the qemuProcessStop cleanup virtual interface in two steps:

s/when/When

> 1. qemuProcessKill kill qemu process, and vif disappeared
> 2. ovs-vsctl del-port from the brige
> 
> if start a vm in the middle of the two steps, the new vm will reused the vif,

s/if/If

> but removed from bridge by step 2
> 
> Signed-off-by: Bingsong Si <owen.si@ucloud.cn>
> ---
>   src/qemu/qemu_process.c | 8 +++++---
>   1 file changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c
> index d36088ba98..706248815a 100644
> --- a/src/qemu/qemu_process.c
> +++ b/src/qemu/qemu_process.c
> @@ -7483,9 +7483,11 @@ void qemuProcessStop(virQEMUDriverPtr driver,
>               if (vport->virtPortType == VIR_NETDEV_VPORT_PROFILE_MIDONET) {
>                   ignore_value(virNetDevMidonetUnbindPort(vport));
>               } else if (vport->virtPortType == VIR_NETDEV_VPORT_PROFILE_OPENVSWITCH) {
> -                ignore_value(virNetDevOpenvswitchRemovePort(
> -                                 virDomainNetGetActualBridgeName(net),
> -                                 net->ifname));
> +                virMacAddr mac;
> +                if (virNetDevGetMAC(net->ifname, &mac) < 0 ||  !virMacAddrCmp(&mac, &net->mac))

Extra space between "||" and "!virMacAddrCmp(.."


With these nits fixed:


Reviewed-by: Daniel Henrique Barboza <danielhb413@gmail.com>


> +                    ignore_value(virNetDevOpenvswitchRemovePort(
> +                                     virDomainNetGetActualBridgeName(net),
> +                                     net->ifname));
>               }
>           }
>   
> 

Re: [PATCH] network: Fix a race condition when shutdown & start vm at the same time
Posted by Laine Stump 3 years, 10 months ago
(BTW, this other patch is also trying to solve the same problem:


https://www.redhat.com/archives/libvir-list/2020-June/msg00525.html


I made comments there earlier, and have learned a bit more since then:


https://www.redhat.com/archives/libvir-list/2020-June/msg00634.html


On 6/15/20 2:36 PM, Daniel Henrique Barboza wrote:
>
>
> On 6/11/20 6:58 AM, Bingsong Si wrote:
>> when shutdown vm, the qemuProcessStop cleanup virtual interface in 
>> two steps:
>
> s/when/When
>
>> 1. qemuProcessKill kill qemu process, and vif disappeared
>> 2. ovs-vsctl del-port from the brige
>>
>> if start a vm in the middle of the two steps, the new vm will reused 
>> the vif,
>
> s/if/If
>
>> but removed from bridge by step 2
>>
>> Signed-off-by: Bingsong Si <owen.si@ucloud.cn>
>> ---
>>   src/qemu/qemu_process.c | 8 +++++---
>>   1 file changed, 5 insertions(+), 3 deletions(-)
>>
>> diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c
>> index d36088ba98..706248815a 100644
>> --- a/src/qemu/qemu_process.c
>> +++ b/src/qemu/qemu_process.c
>> @@ -7483,9 +7483,11 @@ void qemuProcessStop(virQEMUDriverPtr driver,
>>               if (vport->virtPortType == 
>> VIR_NETDEV_VPORT_PROFILE_MIDONET) {
>> ignore_value(virNetDevMidonetUnbindPort(vport));
>>               } else if (vport->virtPortType == 
>> VIR_NETDEV_VPORT_PROFILE_OPENVSWITCH) {
>> -                ignore_value(virNetDevOpenvswitchRemovePort(
>> - virDomainNetGetActualBridgeName(net),
>> -                                 net->ifname));
>> +                virMacAddr mac;
>> +                if (virNetDevGetMAC(net->ifname, &mac) < 0 ||  
>> !virMacAddrCmp(&mac, &net->mac))


(Before anything else - virNetDevGetMAC() will actually *log* an error 
in libvirt's logs if the device isn't found (which will be in nearly 
100% of all cases). That will lead to people reporting it as a bug, 
which gets very time consuming and expensive for anyone providing 
commercial support for a product that uses libvirt. If it is really 
necessary to check the MAC address of a device that legitimately may or 
may not exist, then there will need to be a "Quiet" version of the 
function that doesn't log any errors.)(Update after thinking about it - 
I don't think we should be checking the MAC address anyway, as it 
doesn't reliably differentiate "new" tap from "old" tap).


>
> Extra space between "||" and "!virMacAddrCmp(.."
>
>
> With these nits fixed:
>
>
> Reviewed-by: Daniel Henrique Barboza <danielhb413@gmail.com>



This patch is attempting to solve a race condition between one guest 
starting at the same time another is stopping, and the mess that results 
if the new guest uses the same name for its tap device as the old guest 
used. For example, lets say libvirt thread A / guest A is doing this:


A1) the QEMU process is terminated (and tap device, e.g. "vnet0", is 
implicitly deleted) either by libvirtd or by some external force 
(including possibly the QEMU process itself

A2) the tap's associated port (also "vnet0") is removed from the OVS 
switch by libvirt.


While libvirt thread B is doing this:

B1) a new tap device is created for a new QEMU process. If A1 has 
already happened, then the kernel will likely give the new tap the same 
name - "vnet0".

B2) the new tap device is attached to an OVS switch (or possibly a Linux 
host bridge).


The problem occurs when if B2 happens before A2, which could result in 
B2 attaching the new tap to the OVS switch, and then A2 disconnecting it 
from the switch. So libvirt thinks the new QEMU guest tap is attached to 
the switch, but it isn't.


This patch attempts to eliminate the race by checking, prior to removing 
"old tap"s port on the switch, that 1) the tap device doesn't exist, or 
that if it does 2) that the MAC address of the tap device is unchanged.


Assuming that the two guests would not use the same MAC address for 
their tap devices (which is probably the case, but isn't *required* to 
be true), this does significantly narrow the potential time  for a race 
condition, and in particular makes sure that we never remove a port that 
hasn't just been "re-added" by the new QEMU.


However, this just creates a smaller window for the race, and different 
problem for the remainder of the time.

1) smaller window - it would still be possible for the following to happen:


    a) old qemu terminates, tap device "vnet0" is deleted

    b) libvirt checks MAC address and learns that there is no device 
"vnet0",

       so it calls virNetDevOpenvswitchRemovePort(), but before ovs-vsctl

       can be called...

    c) libvirt creates a new tap device for new QEMU, kernel names it 
"vnet0"

    d) libvirt calls virNetDevOpenvswitchAddPort() and the new tap

       device "vnet0" to the switch

    e) the ovs-vsctl from (b) is finally able to run, and removes "vnet0"

       from the switch.


Granted, this is *highly* unlikely, since there is nothing extra between 
checking MAC address and removing the port, but there is nothing 
enforcing it.

2) New Problem - I think the testing here was done with two guests who 
both attached their tap to the same (or another) OVS switch. It's 
relying on the ability to attach the tap device to a new switch/bridge 
even if it is already attached to some other switch bridge. Normally 
ovs-vsctl would refuse to add a port to a switch if a port by that same 
name was already on any OVS switch. I just looked it up though, and in 
this case libvirt is able to make this work by including "--if-exists 
del-port $ifname" in the ovs-vsctl command that *adds* the new port. 
However, if you have the same situation but the new switch device is 
instead a Linux bridge, the attempt to attach to the bridge fails.


So, if thread "B" has already created the new tap device by the time 
thread "A" is deciding whether or not to remove the old port, the port 
won't be removed, and if the new guest "B" is using a Linux host bridge, 
libvirt will fail to attach the new tap to the bridge.


So, a summary of the problems with this patch:


1) The race window is reduced (and may be gone in practical terms), but 
not guaranteed to be eliminated.


2) For the time during the "previous race window start" and "new race 
window start" a new problem has been created - if the new guest uses a 
Linux host bridge, the connection will fail


3) virNetDevGetMAC() will put an error in the system logs if the device 
doesn't exist (and it almost always will *not* exist, so this will be 
significant


4) Although it is almost always the case that two guests will not use 
the same MAC address for their network interfaces, there is nothing 
preventing it - we shouldn't assume that MAC addresses are unique. I 
think that check is actually superfluous, since the qemu process has 
always been terminated by the time we get to that place in the code, so 
the tap device should have been auto-deleted.


What do I think should be done? Good question. Possibly we could:


A) Call virNetDevTapReattachBridge() rather than 
virNetDevTapAttachBridge() in virNetDevTapCreateInBridgePort(). This 
would eliminate problem (2).


B) Instead of checking if the tap device MAC address matches, just call 
virNetDevExists() - if it exists, then skip the RemovePort() - this 
eliminates problems (3) and (4). (NB - this would fail if it turns out 
that tap device deletion isn't completed synchronously with qemu process 
termination!)


C) If we want to make it 100% sound, we need to make "check for 
interface existence + removeport" an atomic operation, and mutually 
exclusive with virNetDevTapCreate(). This would eliminate problem (1)



>
>> + ignore_value(virNetDevOpenvswitchRemovePort(
>> + virDomainNetGetActualBridgeName(net),
>> +                                     net->ifname));
>>               }
>>           }
>>
>

Re: [PATCH] network: Fix a race condition when shutdown & start vm at the same time
Posted by owen.si 3 years, 10 months ago
Thanks for your detailed analysis, I will remake a patch



发件人: Laine Stump
发送时间: Wednesday, June 17, 2020 7:46 AM
收件人: libvir-list@redhat.com
抄送: Daniel Henrique Barboza; Bingsong Si; Wei Gong
主题: Re: [PATCH] network: Fix a race condition when shutdown & start vm at the same time

(BTW, this other patch is also trying to solve the same problem:


https://www.redhat.com/archives/libvir-list/2020-June/msg00525.html


I made comments there earlier, and have learned a bit more since then:


https://www.redhat.com/archives/libvir-list/2020-June/msg00634.html


On 6/15/20 2:36 PM, Daniel Henrique Barboza wrote:
>
>
> On 6/11/20 6:58 AM, Bingsong Si wrote:
>> when shutdown vm, the qemuProcessStop cleanup virtual interface in 
>> two steps:
>
> s/when/When
>
>> 1. qemuProcessKill kill qemu process, and vif disappeared
>> 2. ovs-vsctl del-port from the brige
>>
>> if start a vm in the middle of the two steps, the new vm will reused 
>> the vif,
>
> s/if/If
>
>> but removed from bridge by step 2
>>
>> Signed-off-by: Bingsong Si <owen.si@ucloud.cn>
>> ---
>>   src/qemu/qemu_process.c | 8 +++++---
>>   1 file changed, 5 insertions(+), 3 deletions(-)
>>
>> diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c
>> index d36088ba98..706248815a 100644
>> --- a/src/qemu/qemu_process.c
>> +++ b/src/qemu/qemu_process.c
>> @@ -7483,9 +7483,11 @@ void qemuProcessStop(virQEMUDriverPtr driver,
>>               if (vport->virtPortType == 
>> VIR_NETDEV_VPORT_PROFILE_MIDONET) {
>> ignore_value(virNetDevMidonetUnbindPort(vport));
>>               } else if (vport->virtPortType == 
>> VIR_NETDEV_VPORT_PROFILE_OPENVSWITCH) {
>> -                ignore_value(virNetDevOpenvswitchRemovePort(
>> - virDomainNetGetActualBridgeName(net),
>> -                                 net->ifname));
>> +                virMacAddr mac;
>> +                if (virNetDevGetMAC(net->ifname, &mac) < 0 ||  
>> !virMacAddrCmp(&mac, &net->mac))


(Before anything else - virNetDevGetMAC() will actually *log* an error 
in libvirt's logs if the device isn't found (which will be in nearly 
100% of all cases). That will lead to people reporting it as a bug, 
which gets very time consuming and expensive for anyone providing 
commercial support for a product that uses libvirt. If it is really 
necessary to check the MAC address of a device that legitimately may or 
may not exist, then there will need to be a "Quiet" version of the 
function that doesn't log any errors.)(Update after thinking about it - 
I don't think we should be checking the MAC address anyway, as it 
doesn't reliably differentiate "new" tap from "old" tap).


>
> Extra space between "||" and "!virMacAddrCmp(.."
>
>
> With these nits fixed:
>
>
> Reviewed-by: Daniel Henrique Barboza <danielhb413@gmail.com>



This patch is attempting to solve a race condition between one guest 
starting at the same time another is stopping, and the mess that results 
if the new guest uses the same name for its tap device as the old guest 
used. For example, lets say libvirt thread A / guest A is doing this:


A1) the QEMU process is terminated (and tap device, e.g. "vnet0", is 
implicitly deleted) either by libvirtd or by some external force 
(including possibly the QEMU process itself

A2) the tap's associated port (also "vnet0") is removed from the OVS 
switch by libvirt.


While libvirt thread B is doing this:

B1) a new tap device is created for a new QEMU process. If A1 has 
already happened, then the kernel will likely give the new tap the same 
name - "vnet0".

B2) the new tap device is attached to an OVS switch (or possibly a Linux 
host bridge).


The problem occurs when if B2 happens before A2, which could result in 
B2 attaching the new tap to the OVS switch, and then A2 disconnecting it 
from the switch. So libvirt thinks the new QEMU guest tap is attached to 
the switch, but it isn't.


This patch attempts to eliminate the race by checking, prior to removing 
"old tap"s port on the switch, that 1) the tap device doesn't exist, or 
that if it does 2) that the MAC address of the tap device is unchanged.


Assuming that the two guests would not use the same MAC address for 
their tap devices (which is probably the case, but isn't *required* to 
be true), this does significantly narrow the potential time  for a race 
condition, and in particular makes sure that we never remove a port that 
hasn't just been "re-added" by the new QEMU.


However, this just creates a smaller window for the race, and different 
problem for the remainder of the time.

1) smaller window - it would still be possible for the following to happen:


    a) old qemu terminates, tap device "vnet0" is deleted

    b) libvirt checks MAC address and learns that there is no device 
"vnet0",

       so it calls virNetDevOpenvswitchRemovePort(), but before ovs-vsctl

       can be called...

    c) libvirt creates a new tap device for new QEMU, kernel names it 
"vnet0"

    d) libvirt calls virNetDevOpenvswitchAddPort() and the new tap

       device "vnet0" to the switch

    e) the ovs-vsctl from (b) is finally able to run, and removes "vnet0"

       from the switch.


Granted, this is *highly* unlikely, since there is nothing extra between 
checking MAC address and removing the port, but there is nothing 
enforcing it.

2) New Problem - I think the testing here was done with two guests who 
both attached their tap to the same (or another) OVS switch. It's 
relying on the ability to attach the tap device to a new switch/bridge 
even if it is already attached to some other switch bridge. Normally 
ovs-vsctl would refuse to add a port to a switch if a port by that same 
name was already on any OVS switch. I just looked it up though, and in 
this case libvirt is able to make this work by including "--if-exists 
del-port $ifname" in the ovs-vsctl command that *adds* the new port. 
However, if you have the same situation but the new switch device is 
instead a Linux bridge, the attempt to attach to the bridge fails.


So, if thread "B" has already created the new tap device by the time 
thread "A" is deciding whether or not to remove the old port, the port 
won't be removed, and if the new guest "B" is using a Linux host bridge, 
libvirt will fail to attach the new tap to the bridge.


So, a summary of the problems with this patch:


1) The race window is reduced (and may be gone in practical terms), but 
not guaranteed to be eliminated.


2) For the time during the "previous race window start" and "new race 
window start" a new problem has been created - if the new guest uses a 
Linux host bridge, the connection will fail


3) virNetDevGetMAC() will put an error in the system logs if the device 
doesn't exist (and it almost always will *not* exist, so this will be 
significant


4) Although it is almost always the case that two guests will not use 
the same MAC address for their network interfaces, there is nothing 
preventing it - we shouldn't assume that MAC addresses are unique. I 
think that check is actually superfluous, since the qemu process has 
always been terminated by the time we get to that place in the code, so 
the tap device should have been auto-deleted.


What do I think should be done? Good question. Possibly we could:


A) Call virNetDevTapReattachBridge() rather than 
virNetDevTapAttachBridge() in virNetDevTapCreateInBridgePort(). This 
would eliminate problem (2).


B) Instead of checking if the tap device MAC address matches, just call 
virNetDevExists() - if it exists, then skip the RemovePort() - this 
eliminates problems (3) and (4). (NB - this would fail if it turns out 
that tap device deletion isn't completed synchronously with qemu process 
termination!)


C) If we want to make it 100% sound, we need to make "check for 
interface existence + removeport" an atomic operation, and mutually 
exclusive with virNetDevTapCreate(). This would eliminate problem (1)



>
>> + ignore_value(virNetDevOpenvswitchRemovePort(
>> + virDomainNetGetActualBridgeName(net),
>> +                                     net->ifname));
>>               }
>>           }
>>
>


Re: [PATCH] network: Fix a race condition when shutdown & start vm at the same time
Posted by Michal Privoznik 3 years, 10 months ago
On 6/17/20 1:46 AM, Laine Stump wrote:

> 
> What do I think should be done? Good question. Possibly we could:
> 
> 
> A) Call virNetDevTapReattachBridge() rather than 
> virNetDevTapAttachBridge() in virNetDevTapCreateInBridgePort(). This 
> would eliminate problem (2).
> 
> 
> B) Instead of checking if the tap device MAC address matches, just call 
> virNetDevExists() - if it exists, then skip the RemovePort() - this 
> eliminates problems (3) and (4). (NB - this would fail if it turns out 
> that tap device deletion isn't completed synchronously with qemu process 
> termination!)
> 
> 
> C) If we want to make it 100% sound, we need to make "check for 
> interface existence + removeport" an atomic operation, and mutually 
> exclusive with virNetDevTapCreate(). This would eliminate problem (1)
> 

I still don't quite understand how there can be a race. I mean, from 
system POV, libvirt creates a TAP, plugs it into a bridge (when starting 
the first domain). And when shutting it down and starting the second 
domain in parallel a new TAP device (with different index and MAC) is 
created independent of the first TAP, no?

Michal

Re: [PATCH] network: Fix a race condition when shutdown & start vm at the same time
Posted by Laine Stump 3 years, 10 months ago
On 6/19/20 12:00 PM, Michal Privoznik wrote:
> On 6/17/20 1:46 AM, Laine Stump wrote:
>
>>
>> What do I think should be done? Good question. Possibly we could:
>>
>>
>> A) Call virNetDevTapReattachBridge() rather than 
>> virNetDevTapAttachBridge() in virNetDevTapCreateInBridgePort(). This 
>> would eliminate problem (2).
>>
>>
>> B) Instead of checking if the tap device MAC address matches, just 
>> call virNetDevExists() - if it exists, then skip the RemovePort() - 
>> this eliminates problems (3) and (4). (NB - this would fail if it 
>> turns out that tap device deletion isn't completed synchronously with 
>> qemu process termination!)
>>
>>
>> C) If we want to make it 100% sound, we need to make "check for 
>> interface existence + removeport" an atomic operation, and mutually 
>> exclusive with virNetDevTapCreate(). This would eliminate problem (1)
>>
>
> I still don't quite understand how there can be a race. I mean, from 
> system POV, libvirt creates a TAP, plugs it into a bridge (when 
> starting the first domain). And when shutting it down and starting the 
> second domain in parallel a new TAP device (with different index and 
> MAC) is created independent of the first TAP, no? 


Yes.


There's no problem with the tap device creation/deletion by itself. The 
problem is when that tap device is attached to an OVS switch. A "port" 
of the same name as the tap device is created in OVS, and that port 
continues to exist after the tap device itself is deleted. If a new tap 
device of the same name is created without first deleting the port from 
OVS, then the new tap device is automatically connected to the existing 
port of the same name (I don't know the exact details of how this is 
done, but OVS must have it's claws somewhere in the kernel or udev to 
make this happen, as I've seen it (and can reproduce it) myself).


Even that, by itself, isn't a problem. But if you have a tap device 
attached to an OVS port, and you delete the tap, that frees up the name 
so that the next time someone asks to create a tap device they get that 
same name. But if that happens before the old port has been removed from 
OVS (which is a separate operation), then the new tap device is attached 
to the old switch. In that case, simple requests to attach the new tap 
to a different switch (or a Linux bridge) will fail.


We solve that problem when the new tap is being attached to an OVS 
switch by prepending "--if-exists del-port blah" to the ovs-vsctl 
command that attaches to the new switch, but that doesn't help in the 
case that the new tap is being attached to a Linux host bridge.


Performing items (1) and (2) in my list will probably get rid of all bad 
behavior in practice, but there is still technically the possibility 
that the thread (B) for the new tap device will create that tap device 
after the thread (A) for the old tap device checks that the device 
doesn't exist (i.e. that a new tap with the same name hasn't yet been 
created) but before removing the port (of the same name) from the OVS 
switch.


This would all be much simpler if the kernel would put a "FIN WAIT" 
state on all tap device names so that they couldn't be re-used for a few 
seconds after deletion, but it doesn't, so we have to work with what 
we've got.


Re: [PATCH] network: Fix a race condition when shutdown & start vm at the same time
Posted by Daniel P. Berrangé 3 years, 10 months ago
On Fri, Jun 19, 2020 at 12:19:05PM -0400, Laine Stump wrote:
> 
> This would all be much simpler if the kernel would put a "FIN WAIT" state on
> all tap device names so that they couldn't be re-used for a few seconds
> after deletion, but it doesn't, so we have to work with what we've got.

The problem is that the kernel reuses tap device names.

Can we just take the kernel out of the equation and do auto-assignment
of names ourselves.

Maintain a global "int nextTapID" counter, and just iterate on this.
NIC names can be upto 16 bytes, so we'll create billions of devices
before we have any chance of wrapping around and risking a collision
with a concurrently shutting down guest.

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

Re: [PATCH] network: Fix a race condition when shutdown & start vm at the same time
Posted by Laine Stump 3 years, 10 months ago
On 6/19/20 12:26 PM, Daniel P. Berrangé wrote:
> On Fri, Jun 19, 2020 at 12:19:05PM -0400, Laine Stump wrote:
>> This would all be much simpler if the kernel would put a "FIN WAIT" state on
>> all tap device names so that they couldn't be re-used for a few seconds
>> after deletion, but it doesn't, so we have to work with what we've got.
> The problem is that the kernel reuses tap device names.
>
> Can we just take the kernel out of the equation and do auto-assignment
> of names ourselves.


Yeah, that's what we do for macvtap. It is extra code, non-trivial, and 
therefore prone to errors (as in, "anything over 1 line long is prone to 
errors").


I *thought* I had remembered that couldn't work properly for standard 
tap devices couldn't work due to the way that tap devices are "created" 
(you first open /dev/net/tun, then issue a ioctl(TUNSETIFF, 
"desiredname")  on the handle returned by open. I had been concerned 
that if you did that with an existing name it would just connect you to 
the existing tap, but I tried it just now and got an error the 2nd time 
I tried to use the same tap device name, so perhaps/probably I'm again 
misremembering.


>
> Maintain a global "int nextTapID" counter, and just iterate on this.
> NIC names can be upto 16 bytes, so we'll create billions of devices
> before we have any chance of wrapping around and risking a collision
> with a concurrently shutting down guest.


I remember I tried doing a monotonically increasing "next" counter for 
macvtap (never actually pushed, but shown to [some users, I forget who, 
maybe ovirt devs?], and they didn't like it because the devices ended up 
with long difficult-for-a-human-to-remember/pick-out/recite names like 
macvtap42301 and macvtap43021. So instead we keep track of the names 
that are in use, and always look for the lowest available number when 
creating a new one. (of course doing that would greatly increase the 
likelyhood of exposing any race conditions, so...) Definitely if we 
change the behavior in this way we're going to hear about it, though :-)


Another issue I remember from macvtap is the possibility of a different 
process (not libvirt) having already created a macvtap device with the 
name that we are wanting to use. That is easily solved in the case of 
macvtap by just iterating through until we find one that we can create 
without failure. So basically you have the list/bitmap/whatever of 
devicenames currently in use, use that as a starting point to pick a 
name for the new device, but then also allow for that name to already be 
used, and retry.


Note that since macvtap and standard tap devices (and, actually *all* 
network devices) in the same namespace need to not have conflicting 
names, we could keep a single list of in-use names. Since we would never 
auto-generate a tap device name that conflicts with an auto-generated 
macvtap name, that is probably unnecessary, but just in case it makes 
things easier...


Also, in the name of backward compatibility, we will need to support tap 
device names from the config that have %d in the name, e.g. the default 
name we send is "vnet%d", but we allow for someone to specify "mytap%d".


In the end, if we can do something simple that preserves current 
behavior while covering the hole, that would probably be more expedient, 
but I have to say the thought of naming the devices ourselves has 
crossed my mind.


Re: [PATCH] network: Fix a race condition when shutdown & start vm at the same time
Posted by Daniel P. Berrangé 3 years, 10 months ago
On Fri, Jun 19, 2020 at 01:06:33PM -0400, Laine Stump wrote:
> On 6/19/20 12:26 PM, Daniel P. Berrangé wrote:
> > Maintain a global "int nextTapID" counter, and just iterate on this.
> > NIC names can be upto 16 bytes, so we'll create billions of devices
> > before we have any chance of wrapping around and risking a collision
> > with a concurrently shutting down guest.
> 
> 
> I remember I tried doing a monotonically increasing "next" counter for
> macvtap (never actually pushed, but shown to [some users, I forget who,
> maybe ovirt devs?], and they didn't like it because the devices ended up
> with long difficult-for-a-human-to-remember/pick-out/recite names like
> macvtap42301 and macvtap43021. So instead we keep track of the names that
> are in use, and always look for the lowest available number when creating a
> new one. (of course doing that would greatly increase the likelyhood of
> exposing any race conditions, so...) Definitely if we change the behavior in
> this way we're going to hear about it, though :-)

People might complain, but I'm not convinced it really matters. Pid
numbers used to be nice & short until someone raised pid max and now
my processes have 7-digit long pids. It is surprising at first, but
it didn't really cause me any functional problems.

And there's still the option of just providing a fixed name if you
really need something predictable for a guest.

> Another issue I remember from macvtap is the possibility of a different
> process (not libvirt) having already created a macvtap device with the name
> that we are wanting to use. That is easily solved in the case of macvtap by
> just iterating through until we find one that we can create without failure.
> So basically you have the list/bitmap/whatever of devicenames currently in
> use, use that as a starting point to pick a name for the new device, but
> then also allow for that name to already be used, and retry.

I wasn't thinking we need a bitmap, literally just a single counter.
We start where we left off, trying until we succeeed, which is what
the kernel does anyway IIRC. Most other competing processes will
rely on the kernel auto-allocation so won't clash with us very
frequently if at all.

> Also, in the name of backward compatibility, we will need to support tap
> device names from the config that have %d in the name, e.g. the default name
> we send is "vnet%d", but we allow for someone to specify "mytap%d".

We could keep a counter per prefix, or just have a single global
counter.

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

Re: [PATCH] network: Fix a race condition when shutdown & start vm at the same time
Posted by Laine Stump 3 years, 10 months ago
On 6/19/20 1:16 PM, Daniel P. Berrangé wrote:
> On Fri, Jun 19, 2020 at 01:06:33PM -0400, Laine Stump wrote:
>> On 6/19/20 12:26 PM, Daniel P. Berrangé wrote:
>>> Maintain a global "int nextTapID" counter, and just iterate on this.
>>> NIC names can be upto 16 bytes, so we'll create billions of devices
>>> before we have any chance of wrapping around and risking a collision
>>> with a concurrently shutting down guest.
>>
>> I remember I tried doing a monotonically increasing "next" counter for
>> macvtap (never actually pushed, but shown to [some users, I forget who,
>> maybe ovirt devs?], and they didn't like it because the devices ended up
>> with long difficult-for-a-human-to-remember/pick-out/recite names like
>> macvtap42301 and macvtap43021. So instead we keep track of the names that
>> are in use, and always look for the lowest available number when creating a
>> new one. (of course doing that would greatly increase the likelyhood of
>> exposing any race conditions, so...) Definitely if we change the behavior in
>> this way we're going to hear about it, though :-)
> People might complain, but I'm not convinced it really matters. Pid
> numbers used to be nice & short until someone raised pid max and now
> my processes have 7-digit long pids. It is surprising at first, but
> it didn't really cause me any functional problems.


This is a good point. But even more convincing than that is the 
revelation that this isn't an issue just with OVS switch ports - the 
same race is also causing problems with libvirt nwfilter driver 
bindings, as reported here:


   https://bugzilla.redhat.com/show_bug.cgi?id=1837395


After seeing that report and realizing that this same race is the cause, 
I've decided to change libvirt's tap device creation to provide names 
based on a monotonically increasing counter as you suggest, rather than 
relying on the kernel. I'm hoping to have something to send to the list 
in a day or two.


>
> And there's still the option of just providing a fixed name if you
> really need something predictable for a guest.
>
>> Another issue I remember from macvtap is the possibility of a different
>> process (not libvirt) having already created a macvtap device with the name
>> that we are wanting to use. That is easily solved in the case of macvtap by
>> just iterating through until we find one that we can create without failure.
>> So basically you have the list/bitmap/whatever of devicenames currently in
>> use, use that as a starting point to pick a name for the new device, but
>> then also allow for that name to already be used, and retry.
> I wasn't thinking we need a bitmap, literally just a single counter.
> We start where we left off, trying until we succeeed, which is what
> the kernel does anyway IIRC. Most other competing processes will
> rely on the kernel auto-allocation so won't clash with us very
> frequently if at all.
>
>> Also, in the name of backward compatibility, we will need to support tap
>> device names from the config that have %d in the name, e.g. the default name
>> we send is "vnet%d", but we allow for someone to specify "mytap%d".
> We could keep a counter per prefix, or just have a single global
> counter.
>
> Regards,
> Daniel