[Qemu-devel] [PATCH] pr-manager-helper: fix pr process been killed when reconectting

Jie Wang posted 1 patch 4 years, 11 months ago
Test s390x failed
Test checkpatch passed
Test asan passed
Test docker-mingw@fedora passed
Test docker-clang@ubuntu passed
Test FreeBSD passed
Patches applied successfully (tree, apply log)
git fetch https://github.com/patchew-project/qemu tags/patchew/1559048796-57016-1-git-send-email-wangjie88@huawei.com
Maintainers: Paolo Bonzini <pbonzini@redhat.com>, Fam Zheng <fam@euphon.net>
scsi/pr-manager-helper.c | 1 +
1 file changed, 1 insertion(+)
[Qemu-devel] [PATCH] pr-manager-helper: fix pr process been killed when reconectting
Posted by Jie Wang 4 years, 11 months ago
if pr-helper been killed and qemu send disconnect event to libvirt
and libvirt started a new pr-helper process, the new pr-heleper
been killed again when qemu is connectting to the new pr-helper,
qemu will never send disconnect to libvirt, and libvirt will never
receive connected event.

Signed-off-by: Jie Wang <wangjie88@huawei.com>
---
 scsi/pr-manager-helper.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/scsi/pr-manager-helper.c b/scsi/pr-manager-helper.c
index 438380fced..b7341b8f47 100644
--- a/scsi/pr-manager-helper.c
+++ b/scsi/pr-manager-helper.c
@@ -120,6 +120,7 @@ static int pr_manager_helper_initialize(PRManagerHelper *pr_mgr,
     if (local_err) {
         object_unref(OBJECT(sioc));
         error_propagate(errp, local_err);
+        pr_manager_send_status_changed_event(pr_mgr);
         return -ENOTCONN;
     }
 
-- 
2.16.2.windows.1


Re: [Qemu-devel] [Qemu-block] [PATCH] pr-manager-helper: fix pr process been killed when reconectting
Posted by Paolo Bonzini 4 years, 11 months ago
On 28/05/19 15:06, Jie Wang wrote:
> if pr-helper been killed and qemu send disconnect event to libvirt
> and libvirt started a new pr-helper process, the new pr-heleper
> been killed again when qemu is connectting to the new pr-helper,
> qemu will never send disconnect to libvirt, and libvirt will never
> receive connected event.

I think this would let a guest "spam" events just by sending a lot PR
commands.  Also, as you said, in this case QEMU has never sent a
"connected" event, so I'm not sure why it should send a disconnection event.

Does libvirt monitor at all the pr-helper to check if it dies?  Or does
it rely exclusively on QEMU's events?

Paolo


> Signed-off-by: Jie Wang <wangjie88@huawei.com>
> ---
>  scsi/pr-manager-helper.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/scsi/pr-manager-helper.c b/scsi/pr-manager-helper.c
> index 438380fced..b7341b8f47 100644
> --- a/scsi/pr-manager-helper.c
> +++ b/scsi/pr-manager-helper.c
> @@ -120,6 +120,7 @@ static int pr_manager_helper_initialize(PRManagerHelper *pr_mgr,
>      if (local_err) {
>          object_unref(OBJECT(sioc));
>          error_propagate(errp, local_err);
> +        pr_manager_send_status_changed_event(pr_mgr);
>          return -ENOTCONN;
>      }
>  
> 


Re: [Qemu-devel] [Qemu-block] [PATCH] pr-manager-helper: fix pr process been killed when reconectting
Posted by Michal Privoznik 4 years, 11 months ago
On 5/28/19 7:45 PM, Paolo Bonzini wrote:
> On 28/05/19 15:06, Jie Wang wrote:
>> if pr-helper been killed and qemu send disconnect event to libvirt
>> and libvirt started a new pr-helper process, the new pr-heleper
>> been killed again when qemu is connectting to the new pr-helper,
>> qemu will never send disconnect to libvirt, and libvirt will never
>> receive connected event.
> 
> I think this would let a guest "spam" events just by sending a lot PR
> commands.  Also, as you said, in this case QEMU has never sent a
> "connected" event, so I'm not sure why it should send a disconnection event.

So pr manager is initialized on the first PR command and not when qemu 
is starting?

If a user inside the guest could somehow kill pr-helper process in the 
host then yes, they could spam libvirt/qemu. But if a user from inside a 
guest can kill a process in the host that is much bigger problem than 
spaming libvirt.

> 
> Does libvirt monitor at all the pr-helper to check if it dies?  Or does
> it rely exclusively on QEMU's events?

Libvirt relies solely on QEMU's events. Just like with qemu process 
itself, libvirt can't rely on SIGCHILD because the daemon might be 
restarted which would reparent all qemu and pr-helper processes 
rendering libvirt wait for SIGCHILD useless.

But there is an exception to this: when libvirt is spawning pr-helper it 
does so by following these steps:

1) Try to acquire (lock) pidfile
2) unlink(socket)
3) spawn pr-helper process (this yields child's PID)
4) wait some time until socket is created
5) some follow up work (move child's PID into same cgroup as qemu's main 
thread, relabel the socket so that qemu can access it)

If any of these steps fails then child is killed. However, the PID is 
not recorded anywhere and thus is forgotten once control jumps out of 
the function.

Michal

Re: [Qemu-devel] [Qemu-block] [PATCH] pr-manager-helper: fix pr process been killed when reconectting
Posted by Jie Wang 4 years, 11 months ago
Hi, Paolo and Michal:

Let me add some details about this problem.


reappear steps:

1. in Host, execute the following command many times quickly:

"ps aux|grep helper|grep -v grep|grep -v qemu-kvm|awk '{print $2}';ps aux|grep helper|grep -v grep|grep -v qemu-kvm|awk '{print $2}'|xargs -n1 kill -9"

2. at the same time , execute PR command continuously in Guest

just execute step 1 and 2 for a moment, the problem will appear.


when the problem appeared:

1. qemu will initialize pr-helper and connect to it cyclically, but always failed because no running pr-helper process to connect.

2. libvirt will always waiting for connected event, but will never to start new pr-helper process because not receive disconnect event.


I'm not found the best way to solve this problem, can you give me some suggestion?


On 2019/5/29 15:33, Michal Privoznik wrote:
> On 5/28/19 7:45 PM, Paolo Bonzini wrote:
>> On 28/05/19 15:06, Jie Wang wrote:
>>> if pr-helper been killed and qemu send disconnect event to libvirt
>>> and libvirt started a new pr-helper process, the new pr-heleper
>>> been killed again when qemu is connectting to the new pr-helper,
>>> qemu will never send disconnect to libvirt, and libvirt will never
>>> receive connected event.
>>
>> I think this would let a guest "spam" events just by sending a lot PR
>> commands.  Also, as you said, in this case QEMU has never sent a
>> "connected" event, so I'm not sure why it should send a disconnection event.
>
> So pr manager is initialized on the first PR command and not when qemu is starting?
>
> If a user inside the guest could somehow kill pr-helper process in the host then yes, they could spam libvirt/qemu. But if a user from inside a guest can kill a process in the host that is much bigger problem than spaming libvirt.
>
>>
>> Does libvirt monitor at all the pr-helper to check if it dies?  Or does
>> it rely exclusively on QEMU's events?
>
> Libvirt relies solely on QEMU's events. Just like with qemu process itself, libvirt can't rely on SIGCHILD because the daemon might be restarted which would reparent all qemu and pr-helper processes rendering libvirt wait for SIGCHILD useless.
>
> But there is an exception to this: when libvirt is spawning pr-helper it does so by following these steps:
>
> 1) Try to acquire (lock) pidfile
> 2) unlink(socket)
> 3) spawn pr-helper process (this yields child's PID)
> 4) wait some time until socket is created
> 5) some follow up work (move child's PID into same cgroup as qemu's main thread, relabel the socket so that qemu can access it)
>
> If any of these steps fails then child is killed. However, the PID is not recorded anywhere and thus is forgotten once control jumps out of the function.
>
> Michal
>
> .
>
Re: [Qemu-devel] [Qemu-block] [PATCH] pr-manager-helper: fix pr process been killed when reconectting
Posted by Paolo Bonzini 4 years, 11 months ago
On 29/05/19 10:37, Jie Wang wrote:
> when the problem appeared:
> 
> 1. qemu will initialize pr-helper and connect to it cyclically, but
> always failed because no running pr-helper process to connect.
> 
> 2. libvirt will always waiting for connected event, but will never to
> start new pr-helper process because not receive disconnect event.
> 
> I'm not found the best way to solve this problem, can you give me some
> suggestion?

I can't find a way that is better than your patch, either.  Another
possible problem is that this could cause libvirt to spawn two helpers
if you have a race like

	qemu: report DISCONNECTED
	libvirt: start pr-helper #1
	qemu: report DISCONNECTED
	libvirt: start pr-helper #2
	pr-helper #1: create socket
	pr-helper #2: fail to start

But it should not be an issue since one of the two pr-helpers will clean
up after itself.

Paolo

Re: [Qemu-devel] [Qemu-block] [PATCH] pr-manager-helper: fix pr process been killed when reconectting
Posted by Paolo Bonzini 4 years, 11 months ago
On 29/05/19 09:33, Michal Privoznik wrote:
> On 5/28/19 7:45 PM, Paolo Bonzini wrote:
>> On 28/05/19 15:06, Jie Wang wrote:
>>> if pr-helper been killed and qemu send disconnect event to libvirt
>>> and libvirt started a new pr-helper process, the new pr-heleper
>>> been killed again when qemu is connectting to the new pr-helper,
>>> qemu will never send disconnect to libvirt, and libvirt will never
>>> receive connected event.
>>
>> I think this would let a guest "spam" events just by sending a lot PR
>> commands.  Also, as you said, in this case QEMU has never sent a
>> "connected" event, so I'm not sure why it should send a disconnection
>> event.
> 
> So pr manager is initialized on the first PR command and not when qemu
> is starting?

It is initialized when QEMU is started, but if it dies, that's not
detected until the first PR command.  The command is retried for 5
seconds, which should give libvirt ample time to restart the PR manager
(and it's an exceptional situation anyway).

> If a user inside the guest could somehow kill pr-helper process in the
> host then yes, they could spam libvirt/qemu. But if a user from inside a
> guest can kill a process in the host that is much bigger problem than
> spaming libvirt.

This is true.

>> Does libvirt monitor at all the pr-helper to check if it dies?  Or does
>> it rely exclusively on QEMU's events?
> 
> Libvirt relies solely on QEMU's events. Just like with qemu process
> itself, libvirt can't rely on SIGCHILD because the daemon might be
> restarted which would reparent all qemu and pr-helper processes
> rendering libvirt wait for SIGCHILD useless.
> 
> But there is an exception to this: when libvirt is spawning pr-helper it
> does so by following these steps:
> 
> 1) Try to acquire (lock) pidfile
> 2) unlink(socket)
> 3) spawn pr-helper process (this yields child's PID)
> 4) wait some time until socket is created
> 5) some follow up work (move child's PID into same cgroup as qemu's main
> thread, relabel the socket so that qemu can access it)
> 
> If any of these steps fails then child is killed. However, the PID is
> not recorded anywhere and thus is forgotten once control jumps out of
> the function.

Note that qemu-pr-helper supports the systemd socket activation
protocol.  Would it help if libvirt used it?

Paolo

Re: [Qemu-devel] [Qemu-block] [PATCH] pr-manager-helper: fix pr process been killed when reconectting
Posted by Michal Privoznik 4 years, 10 months ago
On 5/29/19 11:34 AM, Paolo Bonzini wrote:
> On 29/05/19 09:33, Michal Privoznik wrote:
>> On 5/28/19 7:45 PM, Paolo Bonzini wrote:
>>> On 28/05/19 15:06, Jie Wang wrote:
>>>> if pr-helper been killed and qemu send disconnect event to libvirt
>>>> and libvirt started a new pr-helper process, the new pr-heleper
>>>> been killed again when qemu is connectting to the new pr-helper,
>>>> qemu will never send disconnect to libvirt, and libvirt will never
>>>> receive connected event.
>>>
>>> I think this would let a guest "spam" events just by sending a lot PR
>>> commands.  Also, as you said, in this case QEMU has never sent a
>>> "connected" event, so I'm not sure why it should send a disconnection
>>> event.
>>
>> So pr manager is initialized on the first PR command and not when qemu
>> is starting?
> 
> It is initialized when QEMU is started, but if it dies, that's not
> detected until the first PR command.  The command is retried for 5
> seconds, which should give libvirt ample time to restart the PR manager
> (and it's an exceptional situation anyway).

Ah yes, I recall vaguelly testing this whilst writing patches for libvirt.

> 
>> If a user inside the guest could somehow kill pr-helper process in the
>> host then yes, they could spam libvirt/qemu. But if a user from inside a
>> guest can kill a process in the host that is much bigger problem than
>> spaming libvirt.
> 
> This is true.
> 
>>> Does libvirt monitor at all the pr-helper to check if it dies?  Or does
>>> it rely exclusively on QEMU's events?
>>
>> Libvirt relies solely on QEMU's events. Just like with qemu process
>> itself, libvirt can't rely on SIGCHILD because the daemon might be
>> restarted which would reparent all qemu and pr-helper processes
>> rendering libvirt wait for SIGCHILD useless.
>>
>> But there is an exception to this: when libvirt is spawning pr-helper it
>> does so by following these steps:
>>
>> 1) Try to acquire (lock) pidfile
>> 2) unlink(socket)
>> 3) spawn pr-helper process (this yields child's PID)
>> 4) wait some time until socket is created
>> 5) some follow up work (move child's PID into same cgroup as qemu's main
>> thread, relabel the socket so that qemu can access it)
>>
>> If any of these steps fails then child is killed. However, the PID is
>> not recorded anywhere and thus is forgotten once control jumps out of
>> the function.
> 
> Note that qemu-pr-helper supports the systemd socket activation
> protocol.  Would it help if libvirt used it?

Thing is, libvirt creates a mount namespace for domains (one namespace 
for one domain). In this namespace a dummy /dev is mounted and only 
nodes that qemu is configured to have are created. For instance, you 
won't see /dev/sda there unless your domain has it as a disk. Then, 
libvirt moves pr-helper process into the same cgroups as the qemu's main 
thread. This is all done so that pr-helper has the same view of the 
system as qemu. I don't think that he same result can be achieved using 
socket activation.
Also, libvirt spawns one pr-helper per domain (so that the socket can be 
private and share seclabel with qemu process it's attached to).

Michal

Re: [Qemu-devel] [Qemu-block] [PATCH] pr-manager-helper: fix pr process been killed when reconectting
Posted by Paolo Bonzini 4 years, 10 months ago
On 30/05/19 12:08, Michal Privoznik wrote:
>>> 1) Try to acquire (lock) pidfile
>>> 2) unlink(socket)
>>> 3) spawn pr-helper process (this yields child's PID)
>>> 4) wait some time until socket is created
>>> 5) some follow up work (move child's PID into same cgroup as qemu's main
>>> thread, relabel the socket so that qemu can access it)
>>
>> Note that qemu-pr-helper supports the systemd socket activation
>> protocol.  Would it help if libvirt used it?
> 
> Thing is, libvirt creates a mount namespace for domains (one namespace
> for one domain). In this namespace a dummy /dev is mounted and only
> nodes that qemu is configured to have are created. For instance, you
> won't see /dev/sda there unless your domain has it as a disk. Then,
> libvirt moves pr-helper process into the same cgroups as the qemu's main
> thread. This is all done so that pr-helper has the same view of the
> system as qemu. I don't think that he same result can be achieved using
> socket activation.

Why?  The only difference with "normal" behavior and socket activation
is who creates the socket and calls listen() on it.  Everything else is
entirely the same.

> Also, libvirt spawns one pr-helper per domain (so that the socket can be
> private and share seclabel with qemu process it's attached to).

Yes, that is why I thought the socket could be moved in advance to the
right security label, prior to exec.  Also, perhaps could the child move
itself to the right cgroup before dropping privileges.  This would
remove the window between 3 and 5, by moving all the work *before*
qemu-pr-helper is exec-ed.

Paolo

Re: [Qemu-devel] [Qemu-block] [PATCH] pr-manager-helper: fix pr process been killed when reconectting
Posted by wangjie (P) 4 years, 10 months ago
Hi, Paolo and Michal:

Base on my patch, I found another problem is that:

     qemu: report DISCONNECTED

     libvirt: start pr-helper #1, but it will takes a while to complete 
this action

     qemu: reconnect to start pr-helper #1 immediately and failed, so 
report DISCONNECTED

     libvirt: begin to call qemuProcessStartManagedPRDaemon to start 
pr-helper #2, but virPidFileAcquirePath failed, so close fd and unlink 
pidfile by mistake


based on the above problem, I thought out two ways to fix this problem:

1. qemu: when call pr_manager_helper_write failed, sleep one second to 
make sure libvirt had started pr-helper before call 
pr_manager_helper_initialize.

2. libvirt: if virFileLock failed in virPidFileAcquirePath, not to close 
fd and unlink pidfile, because pr-helper #1 is using the pidfile.

what kind of the two above solutions is better? please give me some 
advice, thanks.


On 2019/5/30 18:59, Paolo Bonzini wrote:
> On 30/05/19 12:08, Michal Privoznik wrote:
>>>> 1) Try to acquire (lock) pidfile
>>>> 2) unlink(socket)
>>>> 3) spawn pr-helper process (this yields child's PID)
>>>> 4) wait some time until socket is created
>>>> 5) some follow up work (move child's PID into same cgroup as qemu's main
>>>> thread, relabel the socket so that qemu can access it)
>>> Note that qemu-pr-helper supports the systemd socket activation
>>> protocol.  Would it help if libvirt used it?
>> Thing is, libvirt creates a mount namespace for domains (one namespace
>> for one domain). In this namespace a dummy /dev is mounted and only
>> nodes that qemu is configured to have are created. For instance, you
>> won't see /dev/sda there unless your domain has it as a disk. Then,
>> libvirt moves pr-helper process into the same cgroups as the qemu's main
>> thread. This is all done so that pr-helper has the same view of the
>> system as qemu. I don't think that he same result can be achieved using
>> socket activation.
> Why?  The only difference with "normal" behavior and socket activation
> is who creates the socket and calls listen() on it.  Everything else is
> entirely the same.
>
>> Also, libvirt spawns one pr-helper per domain (so that the socket can be
>> private and share seclabel with qemu process it's attached to).
> Yes, that is why I thought the socket could be moved in advance to the
> right security label, prior to exec.  Also, perhaps could the child move
> itself to the right cgroup before dropping privileges.  This would
> remove the window between 3 and 5, by moving all the work *before*
> qemu-pr-helper is exec-ed.
>
> Paolo
>
> .
>

Re: [Qemu-devel] [Qemu-block] [PATCH] pr-manager-helper: fix pr process been killed when reconectting
Posted by Paolo Bonzini 4 years, 10 months ago
On 11/06/19 15:51, wangjie (P) wrote:
> Hi, Paolo and Michal:
> 
> Base on my patch, I found another problem is that:
> 
>     qemu: report DISCONNECTED
> 
>     libvirt: start pr-helper #1, but it will takes a while to complete
> this action
> 
>     qemu: reconnect to start pr-helper #1 immediately and failed, so
> report DISCONNECTED
> 
>     libvirt: begin to call qemuProcessStartManagedPRDaemon to start
> pr-helper #2, but virPidFileAcquirePath failed, so close fd and unlink
> pidfile by mistake
> 
> 
> based on the above problem, I thought out two ways to fix this problem:
> 
> 1. qemu: when call pr_manager_helper_write failed, sleep one second to
> make sure libvirt had started pr-helper before call
> pr_manager_helper_initialize.
> 
> 2. libvirt: if virFileLock failed in virPidFileAcquirePath, not to close
> fd and unlink pidfile, because pr-helper #1 is using the pidfile.

I think this could also be fixed by using the socket activation protocol
in libvirt, because QEMU will be able to connect.  libvirt will have to
discard events from QMP right after it calls listen() on the socket.

Paolo