[PATCH] vsock/virtio: Remove queued_replies pushback logic

Alexander Graf posted 1 patch 1 week ago
net/vmw_vsock/virtio_transport.c | 51 ++------------------------------
1 file changed, 2 insertions(+), 49 deletions(-)
[PATCH] vsock/virtio: Remove queued_replies pushback logic
Posted by Alexander Graf 1 week ago
Ever since the introduction of the virtio vsock driver, it included
pushback logic that blocks it from taking any new RX packets until the
TX queue backlog becomes shallower than the virtqueue size.

This logic works fine when you connect a user space application on the
hypervisor with a virtio-vsock target, because the guest will stop
receiving data until the host pulled all outstanding data from the VM.

With Nitro Enclaves however, we connect 2 VMs directly via vsock:

  Parent      Enclave

    RX -------- TX
    TX -------- RX

This means we now have 2 virtio-vsock backends that both have the pushback
logic. If the parent's TX queue runs full at the same time as the
Enclave's, both virtio-vsock drivers fall into the pushback path and
no longer accept RX traffic. However, that RX traffic is TX traffic on
the other side which blocks that driver from making any forward
progress. We're not in a deadlock.

To resolve this, let's remove that pushback logic altogether and rely on
higher levels (like credits) to ensure we do not consume unbounded
memory.

Fixes: 0ea9e1d3a9e3 ("VSOCK: Introduce virtio_transport.ko")
Signed-off-by: Alexander Graf <graf@amazon.com>
---
 net/vmw_vsock/virtio_transport.c | 51 ++------------------------------
 1 file changed, 2 insertions(+), 49 deletions(-)

diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
index 64a07acfef12..53e79779886c 100644
--- a/net/vmw_vsock/virtio_transport.c
+++ b/net/vmw_vsock/virtio_transport.c
@@ -44,8 +44,6 @@ struct virtio_vsock {
 	struct work_struct send_pkt_work;
 	struct sk_buff_head send_pkt_queue;
 
-	atomic_t queued_replies;
-
 	/* The following fields are protected by rx_lock.  vqs[VSOCK_VQ_RX]
 	 * must be accessed with rx_lock held.
 	 */
@@ -171,17 +169,6 @@ virtio_transport_send_pkt_work(struct work_struct *work)
 
 		virtio_transport_deliver_tap_pkt(skb);
 
-		if (reply) {
-			struct virtqueue *rx_vq = vsock->vqs[VSOCK_VQ_RX];
-			int val;
-
-			val = atomic_dec_return(&vsock->queued_replies);
-
-			/* Do we now have resources to resume rx processing? */
-			if (val + 1 == virtqueue_get_vring_size(rx_vq))
-				restart_rx = true;
-		}
-
 		added = true;
 	}
 
@@ -218,9 +205,6 @@ virtio_transport_send_pkt(struct sk_buff *skb)
 		goto out_rcu;
 	}
 
-	if (virtio_vsock_skb_reply(skb))
-		atomic_inc(&vsock->queued_replies);
-
 	virtio_vsock_skb_queue_tail(&vsock->send_pkt_queue, skb);
 	queue_work(virtio_vsock_workqueue, &vsock->send_pkt_work);
 
@@ -233,7 +217,7 @@ static int
 virtio_transport_cancel_pkt(struct vsock_sock *vsk)
 {
 	struct virtio_vsock *vsock;
-	int cnt = 0, ret;
+	int ret;
 
 	rcu_read_lock();
 	vsock = rcu_dereference(the_virtio_vsock);
@@ -242,17 +226,7 @@ virtio_transport_cancel_pkt(struct vsock_sock *vsk)
 		goto out_rcu;
 	}
 
-	cnt = virtio_transport_purge_skbs(vsk, &vsock->send_pkt_queue);
-
-	if (cnt) {
-		struct virtqueue *rx_vq = vsock->vqs[VSOCK_VQ_RX];
-		int new_cnt;
-
-		new_cnt = atomic_sub_return(cnt, &vsock->queued_replies);
-		if (new_cnt + cnt >= virtqueue_get_vring_size(rx_vq) &&
-		    new_cnt < virtqueue_get_vring_size(rx_vq))
-			queue_work(virtio_vsock_workqueue, &vsock->rx_work);
-	}
+	virtio_transport_purge_skbs(vsk, &vsock->send_pkt_queue);
 
 	ret = 0;
 
@@ -323,18 +297,6 @@ static void virtio_transport_tx_work(struct work_struct *work)
 		queue_work(virtio_vsock_workqueue, &vsock->send_pkt_work);
 }
 
-/* Is there space left for replies to rx packets? */
-static bool virtio_transport_more_replies(struct virtio_vsock *vsock)
-{
-	struct virtqueue *vq = vsock->vqs[VSOCK_VQ_RX];
-	int val;
-
-	smp_rmb(); /* paired with atomic_inc() and atomic_dec_return() */
-	val = atomic_read(&vsock->queued_replies);
-
-	return val < virtqueue_get_vring_size(vq);
-}
-
 /* event_lock must be held */
 static int virtio_vsock_event_fill_one(struct virtio_vsock *vsock,
 				       struct virtio_vsock_event *event)
@@ -581,14 +543,6 @@ static void virtio_transport_rx_work(struct work_struct *work)
 			struct sk_buff *skb;
 			unsigned int len;
 
-			if (!virtio_transport_more_replies(vsock)) {
-				/* Stop rx until the device processes already
-				 * pending replies.  Leave rx virtqueue
-				 * callbacks disabled.
-				 */
-				goto out;
-			}
-
 			skb = virtqueue_get_buf(vq, &len);
 			if (!skb)
 				break;
@@ -735,7 +689,6 @@ static int virtio_vsock_probe(struct virtio_device *vdev)
 
 	vsock->rx_buf_nr = 0;
 	vsock->rx_buf_max_nr = 0;
-	atomic_set(&vsock->queued_replies, 0);
 
 	mutex_init(&vsock->tx_lock);
 	mutex_init(&vsock->rx_lock);
-- 
2.40.1




Amazon Web Services Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B
Sitz: Berlin
Ust-ID: DE 365 538 597
Re: [PATCH] vsock/virtio: Remove queued_replies pushback logic
Posted by Stefano Garzarella 1 week ago
On Fri, Nov 15, 2024 at 10:30:16AM +0000, Alexander Graf wrote:
>Ever since the introduction of the virtio vsock driver, it included
>pushback logic that blocks it from taking any new RX packets until the
>TX queue backlog becomes shallower than the virtqueue size.
>
>This logic works fine when you connect a user space application on the
>hypervisor with a virtio-vsock target, because the guest will stop
>receiving data until the host pulled all outstanding data from the VM.

So, why not skipping this only when talking with a sibling VM?

>
>With Nitro Enclaves however, we connect 2 VMs directly via vsock:
>
>  Parent      Enclave
>
>    RX -------- TX
>    TX -------- RX
>
>This means we now have 2 virtio-vsock backends that both have the pushback
>logic. If the parent's TX queue runs full at the same time as the
>Enclave's, both virtio-vsock drivers fall into the pushback path and
>no longer accept RX traffic. However, that RX traffic is TX traffic on
>the other side which blocks that driver from making any forward
>progress. We're not in a deadlock.
>
>To resolve this, let's remove that pushback logic altogether and rely on
>higher levels (like credits) to ensure we do not consume unbounded
>memory.

I spoke quickly with Stefan who has been following the development from
the beginning and actually pointed out that there might be problems
with the control packets, since credits only covers data packets, so
it doesn't seem like a good idea remove this mechanism completely.

>
>Fixes: 0ea9e1d3a9e3 ("VSOCK: Introduce virtio_transport.ko")

I'm not sure we should add this Fixes tag, this seems very risky
backporting on stable branches IMHO.

If we cannot find a better mechanism to replace this with something
that works both guest <-> host and guest <-> guest, I would prefer
to do this just for guest <-> guest communication.
Because removing this completely seems too risky for me, at least
without a proof that control packets are fine.

Thanks,
Stefano

>Signed-off-by: Alexander Graf <graf@amazon.com>
>---
> net/vmw_vsock/virtio_transport.c | 51 ++------------------------------
> 1 file changed, 2 insertions(+), 49 deletions(-)
>
>diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
>index 64a07acfef12..53e79779886c 100644
>--- a/net/vmw_vsock/virtio_transport.c
>+++ b/net/vmw_vsock/virtio_transport.c
>@@ -44,8 +44,6 @@ struct virtio_vsock {
> 	struct work_struct send_pkt_work;
> 	struct sk_buff_head send_pkt_queue;
>
>-	atomic_t queued_replies;
>-
> 	/* The following fields are protected by rx_lock.  vqs[VSOCK_VQ_RX]
> 	 * must be accessed with rx_lock held.
> 	 */
>@@ -171,17 +169,6 @@ virtio_transport_send_pkt_work(struct work_struct *work)
>
> 		virtio_transport_deliver_tap_pkt(skb);
>
>-		if (reply) {
>-			struct virtqueue *rx_vq = vsock->vqs[VSOCK_VQ_RX];
>-			int val;
>-
>-			val = atomic_dec_return(&vsock->queued_replies);
>-
>-			/* Do we now have resources to resume rx processing? */
>-			if (val + 1 == virtqueue_get_vring_size(rx_vq))
>-				restart_rx = true;
>-		}
>-
> 		added = true;
> 	}
>
>@@ -218,9 +205,6 @@ virtio_transport_send_pkt(struct sk_buff *skb)
> 		goto out_rcu;
> 	}
>
>-	if (virtio_vsock_skb_reply(skb))
>-		atomic_inc(&vsock->queued_replies);
>-
> 	virtio_vsock_skb_queue_tail(&vsock->send_pkt_queue, skb);
> 	queue_work(virtio_vsock_workqueue, &vsock->send_pkt_work);
>
>@@ -233,7 +217,7 @@ static int
> virtio_transport_cancel_pkt(struct vsock_sock *vsk)
> {
> 	struct virtio_vsock *vsock;
>-	int cnt = 0, ret;
>+	int ret;
>
> 	rcu_read_lock();
> 	vsock = rcu_dereference(the_virtio_vsock);
>@@ -242,17 +226,7 @@ virtio_transport_cancel_pkt(struct vsock_sock *vsk)
> 		goto out_rcu;
> 	}
>
>-	cnt = virtio_transport_purge_skbs(vsk, &vsock->send_pkt_queue);
>-
>-	if (cnt) {
>-		struct virtqueue *rx_vq = vsock->vqs[VSOCK_VQ_RX];
>-		int new_cnt;
>-
>-		new_cnt = atomic_sub_return(cnt, &vsock->queued_replies);
>-		if (new_cnt + cnt >= virtqueue_get_vring_size(rx_vq) &&
>-		    new_cnt < virtqueue_get_vring_size(rx_vq))
>-			queue_work(virtio_vsock_workqueue, &vsock->rx_work);
>-	}
>+	virtio_transport_purge_skbs(vsk, &vsock->send_pkt_queue);
>
> 	ret = 0;
>
>@@ -323,18 +297,6 @@ static void virtio_transport_tx_work(struct work_struct *work)
> 		queue_work(virtio_vsock_workqueue, &vsock->send_pkt_work);
> }
>
>-/* Is there space left for replies to rx packets? */
>-static bool virtio_transport_more_replies(struct virtio_vsock *vsock)
>-{
>-	struct virtqueue *vq = vsock->vqs[VSOCK_VQ_RX];
>-	int val;
>-
>-	smp_rmb(); /* paired with atomic_inc() and atomic_dec_return() */
>-	val = atomic_read(&vsock->queued_replies);
>-
>-	return val < virtqueue_get_vring_size(vq);
>-}
>-
> /* event_lock must be held */
> static int virtio_vsock_event_fill_one(struct virtio_vsock *vsock,
> 				       struct virtio_vsock_event *event)
>@@ -581,14 +543,6 @@ static void virtio_transport_rx_work(struct work_struct *work)
> 			struct sk_buff *skb;
> 			unsigned int len;
>
>-			if (!virtio_transport_more_replies(vsock)) {
>-				/* Stop rx until the device processes already
>-				 * pending replies.  Leave rx virtqueue
>-				 * callbacks disabled.
>-				 */
>-				goto out;
>-			}
>-
> 			skb = virtqueue_get_buf(vq, &len);
> 			if (!skb)
> 				break;
>@@ -735,7 +689,6 @@ static int virtio_vsock_probe(struct virtio_device *vdev)
>
> 	vsock->rx_buf_nr = 0;
> 	vsock->rx_buf_max_nr = 0;
>-	atomic_set(&vsock->queued_replies, 0);
>
> 	mutex_init(&vsock->tx_lock);
> 	mutex_init(&vsock->rx_lock);
>-- 
>2.40.1
>
>
>
>
>Amazon Web Services Development Center Germany GmbH
>Krausenstr. 38
>10117 Berlin
>Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
>Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B
>Sitz: Berlin
>Ust-ID: DE 365 538 597
>
>
Re: [PATCH] vsock/virtio: Remove queued_replies pushback logic
Posted by Alexander Graf 1 week ago
Hi Stefano,

On 15.11.24 12:59, Stefano Garzarella wrote:
>
> On Fri, Nov 15, 2024 at 10:30:16AM +0000, Alexander Graf wrote:
>> Ever since the introduction of the virtio vsock driver, it included
>> pushback logic that blocks it from taking any new RX packets until the
>> TX queue backlog becomes shallower than the virtqueue size.
>>
>> This logic works fine when you connect a user space application on the
>> hypervisor with a virtio-vsock target, because the guest will stop
>> receiving data until the host pulled all outstanding data from the VM.
>
> So, why not skipping this only when talking with a sibling VM?


I don't think there is a way to know, is there?


>
>>
>> With Nitro Enclaves however, we connect 2 VMs directly via vsock:
>>
>>  Parent      Enclave
>>
>>    RX -------- TX
>>    TX -------- RX
>>
>> This means we now have 2 virtio-vsock backends that both have the 
>> pushback
>> logic. If the parent's TX queue runs full at the same time as the
>> Enclave's, both virtio-vsock drivers fall into the pushback path and
>> no longer accept RX traffic. However, that RX traffic is TX traffic on
>> the other side which blocks that driver from making any forward
>> progress. We're not in a deadlock.
>>
>> To resolve this, let's remove that pushback logic altogether and rely on
>> higher levels (like credits) to ensure we do not consume unbounded
>> memory.
>
> I spoke quickly with Stefan who has been following the development from
> the beginning and actually pointed out that there might be problems
> with the control packets, since credits only covers data packets, so
> it doesn't seem like a good idea remove this mechanism completely.


Can you help me understand which situations the current mechanism really 
helps with, so we can look at alternatives?


>
>>
>> Fixes: 0ea9e1d3a9e3 ("VSOCK: Introduce virtio_transport.ko")
>
> I'm not sure we should add this Fixes tag, this seems very risky
> backporting on stable branches IMHO.


Which situations do you believe it will genuinely break anything in? As 
it stands today, if you run upstream parent and enclave and hammer them 
with vsock traffic, you get into a deadlock. Even without the flow 
control, you will never hit a deadlock. But you may get a brown-out like 
situation while Linux is flushing its buffers.

Ideally we want to have actual flow control to mitigate the problem 
altogether. But I'm not quite sure how and where. Just blocking all 
receiving traffic causes problems.


> If we cannot find a better mechanism to replace this with something
> that works both guest <-> host and guest <-> guest, I would prefer
> to do this just for guest <-> guest communication.
> Because removing this completely seems too risky for me, at least
> without a proof that control packets are fine.


So your concern is that control packets would not receive pushback, so 
we would allow unbounded traffic to get queued up? Can you suggest 
options to help with that?


Alex




Amazon Web Services Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B
Sitz: Berlin
Ust-ID: DE 365 538 597
Re: [PATCH] vsock/virtio: Remove queued_replies pushback logic
Posted by Stefano Garzarella 4 days, 2 hours ago
On Fri, Nov 15, 2024 at 4:49 PM Alexander Graf <graf@amazon.com> wrote:
>
> Hi Stefano,
>
> On 15.11.24 12:59, Stefano Garzarella wrote:
> >
> > On Fri, Nov 15, 2024 at 10:30:16AM +0000, Alexander Graf wrote:
> >> Ever since the introduction of the virtio vsock driver, it included
> >> pushback logic that blocks it from taking any new RX packets until the
> >> TX queue backlog becomes shallower than the virtqueue size.
> >>
> >> This logic works fine when you connect a user space application on the
> >> hypervisor with a virtio-vsock target, because the guest will stop
> >> receiving data until the host pulled all outstanding data from the VM.
> >
> > So, why not skipping this only when talking with a sibling VM?
>
>
> I don't think there is a way to know, is there?
>

I thought about looking into the header and check the dst_cid.
If it's > VMADDR_CID_HOST, we are talking with a sibling VM.

>
> >
> >>
> >> With Nitro Enclaves however, we connect 2 VMs directly via vsock:
> >>
> >>  Parent      Enclave
> >>
> >>    RX -------- TX
> >>    TX -------- RX
> >>
> >> This means we now have 2 virtio-vsock backends that both have the
> >> pushback
> >> logic. If the parent's TX queue runs full at the same time as the
> >> Enclave's, both virtio-vsock drivers fall into the pushback path and
> >> no longer accept RX traffic. However, that RX traffic is TX traffic on
> >> the other side which blocks that driver from making any forward
> >> progress. We're not in a deadlock.
> >>
> >> To resolve this, let's remove that pushback logic altogether and rely on
> >> higher levels (like credits) to ensure we do not consume unbounded
> >> memory.
> >
> > I spoke quickly with Stefan who has been following the development from
> > the beginning and actually pointed out that there might be problems
> > with the control packets, since credits only covers data packets, so
> > it doesn't seem like a good idea remove this mechanism completely.
>
>
> Can you help me understand which situations the current mechanism really
> helps with, so we can look at alternatives?

Good question!
I didn't participate in the initial development, so what I'm telling
you is my understanding.
@Stefan feel free to correct me!

The driver uses a single workqueue (virtio_vsock_workqueue) where it
queues several workers. The ones we are interested in are:
1. the one to handle avail buffers in the TX virtqueue (send_pkt_work)
2. the one for used buffers in the RX virtqueue (rx_work)

Assuming that the same kthread executes the different workers, it
seems to be more about making sure that the RX worker (i.e. rx_work)
does not consume all the execution time, leaving no room for TX
(send_pkt_work). Especially when there are a lot of messages queued in
the TX queue that are considered as response for the host. (The
threshold seems to be the size of the virtqueue).

That said, perhaps just adopting a technique like the one in vhost
(byte_weight in vhost_dev_init(), vhost_exceeds_weight(), etc.) where
after a certain number of packets/bytes handled, the worker terminates
its work and reschedules, could give us the same guarantees, in a
simpler way.

>
>
> >
> >>
> >> Fixes: 0ea9e1d3a9e3 ("VSOCK: Introduce virtio_transport.ko")
> >
> > I'm not sure we should add this Fixes tag, this seems very risky
> > backporting on stable branches IMHO.
>
>
> Which situations do you believe it will genuinely break anything in?

The situation for which it was introduced (which I don't know
precisely because I wasn't following vsock yet).
Removing it completely without being sure that what it was developed
for is okay is risky to me.

Support for sibling VMs has only recently been introduced, so I'd be
happier making these changes just for that kind of communication.

That said, the idea of doing like vhost might solve all our problems,
so in that case maybe it might be okay.

> As
> it stands today, if you run upstream parent and enclave and hammer them
> with vsock traffic, you get into a deadlock. Even without the flow
> control, you will never hit a deadlock. But you may get a brown-out like
> situation while Linux is flushing its buffers.
>
> Ideally we want to have actual flow control to mitigate the problem
> altogether. But I'm not quite sure how and where. Just blocking all
> receiving traffic causes problems.
>
>
> > If we cannot find a better mechanism to replace this with something
> > that works both guest <-> host and guest <-> guest, I would prefer
> > to do this just for guest <-> guest communication.
> > Because removing this completely seems too risky for me, at least
> > without a proof that control packets are fine.
>
>
> So your concern is that control packets would not receive pushback, so
> we would allow unbounded traffic to get queued up?

Right, most of `reply` are control packets (reset and response IIUC)
that are not part of the credit mechanism, so I think this confirms
what Stefan was telling me.

> Can you suggest
> options to help with that?

Maybe mimic vhost approach should help, or something similar.

That said, did you really encounter a real problem or is it more of a
patch to avoid future problems.

Because it would be nice to have a test that emphasizes this problem
that we can use to check that everything is okay if we adopt something
different. The same goes for the problem that this mechanism wants to
avoid, I'll try to see if I have time to write a test so we can use
it.


Thanks,
Stefano
Re: [PATCH] vsock/virtio: Remove queued_replies pushback logic
Posted by Stefano Garzarella 3 days, 8 hours ago
On Mon, Nov 18, 2024 at 03:07:43PM +0100, Stefano Garzarella wrote:
>On Fri, Nov 15, 2024 at 4:49 PM Alexander Graf <graf@amazon.com> wrote:
>>
>> Hi Stefano,
>>
>> On 15.11.24 12:59, Stefano Garzarella wrote:
>> >
>> > On Fri, Nov 15, 2024 at 10:30:16AM +0000, Alexander Graf wrote:
>> >> Ever since the introduction of the virtio vsock driver, it included
>> >> pushback logic that blocks it from taking any new RX packets until the
>> >> TX queue backlog becomes shallower than the virtqueue size.
>> >>
>> >> This logic works fine when you connect a user space application on the
>> >> hypervisor with a virtio-vsock target, because the guest will stop
>> >> receiving data until the host pulled all outstanding data from the VM.
>> >
>> > So, why not skipping this only when talking with a sibling VM?
>>
>>
>> I don't think there is a way to know, is there?
>>
>
>I thought about looking into the header and check the dst_cid.
>If it's > VMADDR_CID_HOST, we are talking with a sibling VM.
>
>>
>> >
>> >>
>> >> With Nitro Enclaves however, we connect 2 VMs directly via vsock:
>> >>
>> >>  Parent      Enclave
>> >>
>> >>    RX -------- TX
>> >>    TX -------- RX
>> >>
>> >> This means we now have 2 virtio-vsock backends that both have the
>> >> pushback
>> >> logic. If the parent's TX queue runs full at the same time as the
>> >> Enclave's, both virtio-vsock drivers fall into the pushback path and
>> >> no longer accept RX traffic. However, that RX traffic is TX traffic on
>> >> the other side which blocks that driver from making any forward
>> >> progress. We're not in a deadlock.
>> >>
>> >> To resolve this, let's remove that pushback logic altogether and rely on
>> >> higher levels (like credits) to ensure we do not consume unbounded
>> >> memory.
>> >
>> > I spoke quickly with Stefan who has been following the development from
>> > the beginning and actually pointed out that there might be problems
>> > with the control packets, since credits only covers data packets, so
>> > it doesn't seem like a good idea remove this mechanism completely.
>>
>>
>> Can you help me understand which situations the current mechanism really
>> helps with, so we can look at alternatives?
>
>Good question!
>I didn't participate in the initial development, so what I'm telling
>you is my understanding.
>@Stefan feel free to correct me!
>
>The driver uses a single workqueue (virtio_vsock_workqueue) where it
>queues several workers. The ones we are interested in are:
>1. the one to handle avail buffers in the TX virtqueue (send_pkt_work)
>2. the one for used buffers in the RX virtqueue (rx_work)
>
>Assuming that the same kthread executes the different workers, it
>seems to be more about making sure that the RX worker (i.e. rx_work)
>does not consume all the execution time, leaving no room for TX
>(send_pkt_work). Especially when there are a lot of messages queued in
>the TX queue that are considered as response for the host. (The
>threshold seems to be the size of the virtqueue).
>
>That said, perhaps just adopting a technique like the one in vhost
>(byte_weight in vhost_dev_init(), vhost_exceeds_weight(), etc.) where
>after a certain number of packets/bytes handled, the worker terminates
>its work and reschedules, could give us the same guarantees, in a
>simpler way.

Thinking more about, perhaps now I understand better why it was
introduced, and it should be related to the above.

In practice, "replies" are almost always queued in the intermediate
queue (send_pkt_queue) directly by the RX worker (rx_work) during its
execution. These are direct responses handled by
virtio_transport_recv_pkt() to requests coming from the other peer (the
host usually or sibling VM in your case). This happens for example
calling virtio_transport_reset_no_sock() if a socket is not found, or
sending back VIRTIO_VSOCK_OP_RESPONSE packet to ack a request of a
connection coming with a VIRTIO_VSOCK_OP_REQUEST packet.

Because of this, if the number of these "replies" exceeds an arbitrary
threshold, the RX worker decides to de-schedule itself, to give space to
the TX worker (send_pkt_work) to move those "replies" from the
intermediate queue into the TX virtqueue, at which point the TX worker
will reschedule the RX worker to continue the job. So more than avoiding
deadlock, it seems to be a mechanism to avoid starvation.

Note that this is not necessary in vhost-vsock (which uses the same
functions as this driver, e.g. virtio_transport_recv_pkt), because both
workers already use vhost_exceeds_weight() to avoid this problem as
well.

At this point I think that doing something similar to vhost here as well
would allow us not only to avoid the problem that `queued_replies`
should avoid, but also that the RX worker monopolizes the workqueue
during data transfer.

Thanks,
Stefano

>
>>
>>
>> >
>> >>
>> >> Fixes: 0ea9e1d3a9e3 ("VSOCK: Introduce virtio_transport.ko")
>> >
>> > I'm not sure we should add this Fixes tag, this seems very risky
>> > backporting on stable branches IMHO.
>>
>>
>> Which situations do you believe it will genuinely break anything in?
>
>The situation for which it was introduced (which I don't know
>precisely because I wasn't following vsock yet).
>Removing it completely without being sure that what it was developed
>for is okay is risky to me.
>
>Support for sibling VMs has only recently been introduced, so I'd be
>happier making these changes just for that kind of communication.
>
>That said, the idea of doing like vhost might solve all our problems,
>so in that case maybe it might be okay.
>
>> As
>> it stands today, if you run upstream parent and enclave and hammer them
>> with vsock traffic, you get into a deadlock. Even without the flow
>> control, you will never hit a deadlock. But you may get a brown-out like
>> situation while Linux is flushing its buffers.
>>
>> Ideally we want to have actual flow control to mitigate the problem
>> altogether. But I'm not quite sure how and where. Just blocking all
>> receiving traffic causes problems.
>>
>>
>> > If we cannot find a better mechanism to replace this with something
>> > that works both guest <-> host and guest <-> guest, I would prefer
>> > to do this just for guest <-> guest communication.
>> > Because removing this completely seems too risky for me, at least
>> > without a proof that control packets are fine.
>>
>>
>> So your concern is that control packets would not receive pushback, so
>> we would allow unbounded traffic to get queued up?
>
>Right, most of `reply` are control packets (reset and response IIUC)
>that are not part of the credit mechanism, so I think this confirms
>what Stefan was telling me.
>
>> Can you suggest
>> options to help with that?
>
>Maybe mimic vhost approach should help, or something similar.
>
>That said, did you really encounter a real problem or is it more of a
>patch to avoid future problems.
>
>Because it would be nice to have a test that emphasizes this problem
>that we can use to check that everything is okay if we adopt something
>different. The same goes for the problem that this mechanism wants to
>avoid, I'll try to see if I have time to write a test so we can use
>it.
>
>
>Thanks,
>Stefano