From nobody Sun Nov 16 02:23:56 2025 Delivered-To: importer@patchew.org Authentication-Results: mx.zohomail.com; dkim=pass; spf=pass (zohomail.com: domain of gnu.org designates 209.51.188.17 as permitted sender) smtp.mailfrom=qemu-devel-bounces+importer=patchew.org@nongnu.org; dmarc=pass(p=quarantine dis=none) header.from=redhat.com ARC-Seal: i=1; a=rsa-sha256; t=1762177446; cv=none; d=zohomail.com; s=zohoarc; b=YTzqraG57ifsoJsY9wWeK5a158JI89Y9yc1mFKSJ10B+joBazSJHIQAXGT+3y/+G1oEARfDWC23BZvHBk2xT3lt462xdMabKeUoTu3WhAF2GUjv3K0fUCQ0WreP4ol/8JvsqVVcaYLENKOx6JEzjK3M/jYuo+Lp4XiEjcNV3GWw= ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=zohomail.com; s=zohoarc; t=1762177446; h=Content-Type:Content-Transfer-Encoding:Cc:Cc:Date:Date:From:From:In-Reply-To:List-Subscribe:List-Post:List-Id:List-Archive:List-Help:List-Unsubscribe:MIME-Version:Message-ID:References:Sender:Subject:Subject:To:To:Message-Id:Reply-To; bh=PHH/8x8Op/Y+Frw2h9ZHPRZ5Ou7GryVz7HpBcSZAcpA=; b=JuJLoMwnDSjkhKsQO5BPDVJz+Dik1n8nqhbK79dQCpuZC5j8ykWQR91AnUeYfzf/HKmI5IiVp8kNasDYNBolZlRg0E0hcqhpy8pobyDjy9XmXO/hN5VFLQfcsQTDffqdrMt3ShTRp5PBMdpDYUq4iU9vOBALCOmZ9cIDqeSOoIs= ARC-Authentication-Results: i=1; mx.zohomail.com; dkim=pass; spf=pass (zohomail.com: domain of gnu.org designates 209.51.188.17 as permitted sender) smtp.mailfrom=qemu-devel-bounces+importer=patchew.org@nongnu.org; dmarc=pass header.from= (p=quarantine dis=none) Return-Path: Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) by mx.zohomail.com with SMTPS id 1762177446915798.0198077952276; Mon, 3 Nov 2025 05:44:06 -0800 (PST) Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1vFulo-0005QY-4u; Mon, 03 Nov 2025 08:38:52 -0500 Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1vFulb-0005No-TE for qemu-devel@nongnu.org; Mon, 03 Nov 2025 08:38:45 -0500 Received: from us-smtp-delivery-124.mimecast.com ([170.10.133.124]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1vFulV-0004PS-Cs for qemu-devel@nongnu.org; Mon, 03 Nov 2025 08:38:39 -0500 Received: from mx-prod-mc-01.mail-002.prod.us-west-2.aws.redhat.com (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-618-gmACHiRKP4OEWKUJAJQ-hg-1; Mon, 03 Nov 2025 08:38:28 -0500 Received: from mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.93]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-01.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id AB0471954B08; Mon, 3 Nov 2025 13:38:26 +0000 (UTC) Received: from toolbx.redhat.com (unknown [10.42.28.202]) by mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id AA97B1800578; Mon, 3 Nov 2025 13:38:15 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1762177111; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=PHH/8x8Op/Y+Frw2h9ZHPRZ5Ou7GryVz7HpBcSZAcpA=; b=Ga0J7/5NtfPpcvAwsnW+6WtvDYaQNUkQ+vJu7JJ1dizj+btLbQDAn5Qkg4kpn43EpIEFIZ 2EAtl58nGmpEFiR8iwVumb268g02a87NhxdGQ861dfWYFxOVutKmf3L+nZfrt2vaw/cIbn I1y9eS1uUwwQa+gKacaXIhKb0VUi1Ts= X-MC-Unique: gmACHiRKP4OEWKUJAJQ-hg-1 X-Mimecast-MFC-AGG-ID: gmACHiRKP4OEWKUJAJQ-hg_1762177106 From: =?UTF-8?q?Daniel=20P=2E=20Berrang=C3=A9?= To: qemu-devel@nongnu.org Cc: =?UTF-8?q?Daniel=20P=2E=20Berrang=C3=A9?= , Fabiano Rosas , Paolo Bonzini , =?UTF-8?q?Philippe=20Mathieu-Daud=C3=A9?= , =?UTF-8?q?Marc-Andr=C3=A9=20Lureau?= , devel@lists.libvirt.org, Laurent Vivier , Manish Mishra , Tejus GK Subject: [PULL 06/32] io: flush zerocopy socket error queue on sendmsg failure due to ENOBUF Date: Mon, 3 Nov 2025 13:37:00 +0000 Message-ID: <20251103133727.423041-7-berrange@redhat.com> In-Reply-To: <20251103133727.423041-1-berrange@redhat.com> References: <20251103133727.423041-1-berrange@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.93 Received-SPF: pass (zohomail.com: domain of gnu.org designates 209.51.188.17 as permitted sender) client-ip=209.51.188.17; envelope-from=qemu-devel-bounces+importer=patchew.org@nongnu.org; helo=lists.gnu.org; Received-SPF: pass client-ip=170.10.133.124; envelope-from=berrange@redhat.com; helo=us-smtp-delivery-124.mimecast.com X-Spam_score_int: -20 X-Spam_score: -2.1 X-Spam_bar: -- X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIMWL_WL_HIGH=-0.001, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=0.001, RCVD_IN_MSPIKE_WL=0.001, RCVD_IN_VALIDITY_CERTIFIED_BLOCKED=0.001, RCVD_IN_VALIDITY_RPBL_BLOCKED=0.001, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: qemu-devel-bounces+importer=patchew.org@nongnu.org Sender: qemu-devel-bounces+importer=patchew.org@nongnu.org X-ZohoMail-DKIM: pass (identity @redhat.com) X-ZM-MESSAGEID: 1762177448289154100 From: Manish Mishra The kernel allocates extra metadata SKBs in case of a zerocopy send, eventually used for zerocopy's notification mechanism. This metadata memory is accounted for in the OPTMEM limit. The kernel queues completion notifications on the socket error queue and this error queue is freed when userspace reads it. Usually, in the case of in-order processing, the kernel will batch the notifications and merge the metadata into a single SKB and free the rest. As a result, it never exceeds the OPTMEM limit. However, if there is any out-of-order processing or intermittent zerocopy failures, this error chain can grow significantly, exhausting the OPTMEM limit. As a result, all new sendmsg requests fail to allocate any new SKB, leading to an ENOBUF error. Depending on the amount of data queued before the flush (i.e., large live migration iterations), even large OPTMEM limits are prone to failure. To work around this, if we encounter an ENOBUF error with a zerocopy sendmsg, flush the error queue and retry once more. Co-authored-by: Manish Mishra Signed-off-by: Tejus GK Reviewed-by: Daniel P. Berrang=C3=A9 [DB: change TRUE/FALSE to true/false for 'bool' type; add more #ifdef QEMU_MSG_ZEROCOPY blocks] Signed-off-by: Daniel P. Berrang=C3=A9 --- include/io/channel-socket.h | 5 +++ io/channel-socket.c | 84 ++++++++++++++++++++++++++++++------- 2 files changed, 75 insertions(+), 14 deletions(-) diff --git a/include/io/channel-socket.h b/include/io/channel-socket.h index 26319fa98b..fcfd489c6c 100644 --- a/include/io/channel-socket.h +++ b/include/io/channel-socket.h @@ -50,6 +50,11 @@ struct QIOChannelSocket { ssize_t zero_copy_queued; ssize_t zero_copy_sent; bool blocking; + /** + * This flag indicates whether any new data was successfully sent with + * zerocopy since the last qio_channel_socket_flush() call. + */ + bool new_zero_copy_sent_success; }; =20 =20 diff --git a/io/channel-socket.c b/io/channel-socket.c index 8b30d5b7f7..3053b35ad8 100644 --- a/io/channel-socket.c +++ b/io/channel-socket.c @@ -37,6 +37,12 @@ =20 #define SOCKET_MAX_FDS 16 =20 +#ifdef QEMU_MSG_ZEROCOPY +static int qio_channel_socket_flush_internal(QIOChannel *ioc, + bool block, + Error **errp); +#endif + SocketAddress * qio_channel_socket_get_local_address(QIOChannelSocket *ioc, Error **errp) @@ -66,6 +72,7 @@ qio_channel_socket_new(void) sioc->zero_copy_queued =3D 0; sioc->zero_copy_sent =3D 0; sioc->blocking =3D false; + sioc->new_zero_copy_sent_success =3D false; =20 ioc =3D QIO_CHANNEL(sioc); qio_channel_set_feature(ioc, QIO_CHANNEL_FEATURE_SHUTDOWN); @@ -618,6 +625,10 @@ static ssize_t qio_channel_socket_writev(QIOChannel *i= oc, size_t fdsize =3D sizeof(int) * nfds; struct cmsghdr *cmsg; int sflags =3D 0; +#ifdef QEMU_MSG_ZEROCOPY + bool blocking =3D sioc->blocking; + bool zerocopy_flushed_once =3D false; +#endif =20 memset(control, 0, CMSG_SPACE(sizeof(int) * SOCKET_MAX_FDS)); =20 @@ -662,13 +673,30 @@ static ssize_t qio_channel_socket_writev(QIOChannel *= ioc, return QIO_CHANNEL_ERR_BLOCK; case EINTR: goto retry; +#ifdef QEMU_MSG_ZEROCOPY case ENOBUFS: if (flags & QIO_CHANNEL_WRITE_FLAG_ZERO_COPY) { - error_setg_errno(errp, errno, - "Process can't lock enough memory for usi= ng MSG_ZEROCOPY"); - return -1; + /** + * Socket error queueing may exhaust the OPTMEM limit. Try + * flushing the error queue once. + */ + if (!zerocopy_flushed_once) { + ret =3D qio_channel_socket_flush_internal(ioc, blockin= g, + errp); + if (ret < 0) { + return -1; + } + zerocopy_flushed_once =3D true; + goto retry; + } else { + error_setg_errno(errp, errno, + "Process can't lock enough memory for= " + "using MSG_ZEROCOPY"); + return -1; + } } break; +#endif } =20 error_setg_errno(errp, errno, @@ -777,8 +805,9 @@ static ssize_t qio_channel_socket_writev(QIOChannel *io= c, =20 =20 #ifdef QEMU_MSG_ZEROCOPY -static int qio_channel_socket_flush(QIOChannel *ioc, - Error **errp) +static int qio_channel_socket_flush_internal(QIOChannel *ioc, + bool block, + Error **errp) { QIOChannelSocket *sioc =3D QIO_CHANNEL_SOCKET(ioc); struct msghdr msg =3D {}; @@ -786,7 +815,6 @@ static int qio_channel_socket_flush(QIOChannel *ioc, struct cmsghdr *cm; char control[CMSG_SPACE(sizeof(*serr))]; int received; - int ret; =20 if (sioc->zero_copy_queued =3D=3D sioc->zero_copy_sent) { return 0; @@ -796,16 +824,25 @@ static int qio_channel_socket_flush(QIOChannel *ioc, msg.msg_controllen =3D sizeof(control); memset(control, 0, sizeof(control)); =20 - ret =3D 1; - while (sioc->zero_copy_sent < sioc->zero_copy_queued) { received =3D recvmsg(sioc->fd, &msg, MSG_ERRQUEUE); if (received < 0) { switch (errno) { case EAGAIN: - /* Nothing on errqueue, wait until something is available = */ - qio_channel_wait(ioc, G_IO_ERR); - continue; + if (block) { + /* + * Nothing on errqueue, wait until something is + * available. + * + * Use G_IO_ERR instead of G_IO_IN since MSG_ERRQUEUE = reads + * are signaled via POLLERR, not POLLIN, as the kernel + * sets POLLERR when zero-copy notificatons appear on = the + * socket error queue. + */ + qio_channel_wait(ioc, G_IO_ERR); + continue; + } + return 0; case EINTR: continue; default: @@ -843,13 +880,32 @@ static int qio_channel_socket_flush(QIOChannel *ioc, /* No errors, count successfully finished sendmsg()*/ sioc->zero_copy_sent +=3D serr->ee_data - serr->ee_info + 1; =20 - /* If any sendmsg() succeeded using zero copy, return 0 at the end= */ + /* If any sendmsg() succeeded using zero copy, mark zerocopy succe= ss */ if (serr->ee_code !=3D SO_EE_CODE_ZEROCOPY_COPIED) { - ret =3D 0; + sioc->new_zero_copy_sent_success =3D true; } } =20 - return ret; + return 0; +} + +static int qio_channel_socket_flush(QIOChannel *ioc, + Error **errp) +{ + QIOChannelSocket *sioc =3D QIO_CHANNEL_SOCKET(ioc); + int ret; + + ret =3D qio_channel_socket_flush_internal(ioc, true, errp); + if (ret < 0) { + return ret; + } + + if (sioc->new_zero_copy_sent_success) { + sioc->new_zero_copy_sent_success =3D false; + return 0; + } + + return 1; } =20 #endif /* QEMU_MSG_ZEROCOPY */ --=20 2.51.1