From nobody Tue Feb 10 23:01:32 2026 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 9963D35B122 for ; Sun, 8 Feb 2026 14:35:17 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.133.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1770561317; cv=none; b=lyVGABwPgIpR0UgEMHHFdgmWd/qBDKKkeIn/E4nVJ9rQKED7eYnUJf2/7HhQsASaNecoFztcB0yWC5cJ5HxZ68OF0MMINehA3v6/af+2lRTslmrCUwL//UBmMy9uQKqSHAOVaOZvkIfqM0xxXtIXYYd8LvBjVy9yFmsMcbnKB2k= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1770561317; c=relaxed/simple; bh=wvoz2yEAU6sAeThW/9Wc+oWaUoQaY677EasHFK7XYpE=; h=From:To:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=PR3+3gFZw4a3OVnEVyv/YoF1JM7OFLjn7i6z4Ogmnn1VOS6cwYBY2YLPi1rgCR8TYAxs48mPhCzc2HWfj3vhfROoEk4CQRMSGLcGi3sV9u2AaBvNnl8VPqTzwuDv6zI0Fs0Mnl/FJD4iGxMRN93Q+pfIRZZPNkwspQEvnR+rM6A= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=KrsKH/vs; arc=none smtp.client-ip=170.10.133.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="KrsKH/vs" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1770561316; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=2jDMLCng9gvJ00q8oAtIVUFVCZC5q7ddA17a+3z9mOs=; b=KrsKH/vs7zJUhhbKbFHbA7rXIShosdqTdLyNfBTrsZDRVI4h3Jr8sVEz6xP6FmQiKZdu5H hqn58tcBd1X8n/SLYFRnw5Q5bcTfOApgYUsNoMzCDyUBZQ1cEZ2n12BVo4dodtu589NF9R uMObUFEYv7G2lkayD9DfMjZWslYopNI= Received: from mx-prod-mc-01.mail-002.prod.us-west-2.aws.redhat.com (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-216-Z9xYUGllOWWknLj8Ixf3ug-1; Sun, 08 Feb 2026 09:35:11 -0500 X-MC-Unique: Z9xYUGllOWWknLj8Ixf3ug-1 X-Mimecast-MFC-AGG-ID: Z9xYUGllOWWknLj8Ixf3ug_1770561310 Received: from mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.93]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-01.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 6289B1956095; Sun, 8 Feb 2026 14:35:10 +0000 (UTC) Received: from S2.redhat.com (unknown [10.72.112.33]) by mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id DFEA918004AD; Sun, 8 Feb 2026 14:35:06 +0000 (UTC) From: Cindy Lu To: lulu@redhat.com, mst@redhat.com, jasowang@redhat.com, kvm@vger.kernel.org, virtualization@lists.linux.dev, netdev@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [RFC 3/3] vhost/net: add RX netfilter offload path Date: Sun, 8 Feb 2026 22:32:24 +0800 Message-ID: <20260208143441.2177372-4-lulu@redhat.com> In-Reply-To: <20260208143441.2177372-1-lulu@redhat.com> References: <20260208143441.2177372-1-lulu@redhat.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.93 Content-Type: text/plain; charset="utf-8" Route RX packets through the netfilter socket when configured. Key points: - Add VHOST_NET_FILTER_MAX_LEN upper bound for filter payload size - Introduce vhost_net_filter_request() to send REQUEST to userspace - Add handle_rx_filter() fast path for RX when filter is active - Hook filter path in handle_rx() when filter_sock is set Signed-off-by: Cindy Lu --- drivers/vhost/net.c | 229 ++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 229 insertions(+) diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index f02deff0e53c..aa9a5ed43eae 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -161,6 +161,13 @@ struct vhost_net { =20 static unsigned vhost_net_zcopy_mask __read_mostly; =20 +/* + * Upper bound for a single packet payload on the filter path. + * Keep this large enough for the largest expected frame plus vnet headers, + * but still bounded to avoid unbounded allocations. + */ +#define VHOST_NET_FILTER_MAX_LEN (4096 + 65536) + static void *vhost_net_buf_get_ptr(struct vhost_net_buf *rxq) { if (rxq->tail !=3D rxq->head) @@ -1227,6 +1234,222 @@ static long vhost_net_set_filter(struct vhost_net *= n, int fd) return r; } =20 +/* + * Send a filter REQUEST message to userspace for a single packet. + * + * The caller provides a writable buffer; userspace may inspect the conten= t and + * optionally modify it in place. We only accept the packet if the returned + * length matches the original length, otherwise the packet is dropped. + */ +static int vhost_net_filter_request(struct vhost_net *n, u16 direction, + void *buf, u32 *len) +{ + struct vhost_net_filter_msg msg =3D { + .type =3D VHOST_NET_FILTER_MSG_REQUEST, + .direction =3D direction, + .len =3D *len, + }; + struct msghdr msghdr =3D { 0 }; + struct kvec iov[2] =3D { + { .iov_base =3D &msg, .iov_len =3D sizeof(msg) }, + { .iov_base =3D buf, .iov_len =3D *len }, + }; + struct socket *sock; + struct file *sock_file =3D NULL; + int ret; + + /* + * Take a temporary file reference to guard against concurrent + * filter socket replacement while we send the message. + */ + spin_lock(&n->filter_lock); + sock =3D n->filter_sock; + if (sock) + sock_file =3D get_file(sock->file); + spin_unlock(&n->filter_lock); + + if (!sock) { + ret =3D -ENOTCONN; + goto out_put; + } + + ret =3D kernel_sendmsg(sock, &msghdr, iov, + *len ? 2 : 1, sizeof(msg) + *len); + +out_put: + if (sock_file) + fput(sock_file); + + if (ret < 0) + return ret; + return 0; +} + +/* + * RX fast path when filter offload is active. + * + * This mirrors handle_rx() but routes each RX packet through userspace + * netfilter. Packets are copied into a temporary buffer, sent to the filt= er + * socket as a REQUEST, and only delivered to the guest if userspace keeps= the + * length unchanged. Any truncation or mismatch drops the packet. + */ +static void handle_rx_filter(struct vhost_net *net, struct socket *sock) +{ + struct vhost_net_virtqueue *nvq =3D &net->vqs[VHOST_NET_VQ_RX]; + struct vhost_virtqueue *vq =3D &nvq->vq; + bool in_order =3D vhost_has_feature(vq, VIRTIO_F_IN_ORDER); + unsigned int count =3D 0; + unsigned int in, log; + struct vhost_log *vq_log; + struct virtio_net_hdr hdr =3D { + .flags =3D 0, + .gso_type =3D VIRTIO_NET_HDR_GSO_NONE + }; + struct msghdr msg =3D { + .msg_name =3D NULL, + .msg_namelen =3D 0, + .msg_control =3D NULL, + .msg_controllen =3D 0, + .msg_flags =3D MSG_DONTWAIT, + }; + size_t total_len =3D 0; + int mergeable; + bool set_num_buffers; + size_t vhost_hlen, sock_hlen; + size_t vhost_len, sock_len; + bool busyloop_intr =3D false; + struct iov_iter fixup; + __virtio16 num_buffers; + int recv_pkts =3D 0; + unsigned int ndesc; + void *pkt; + + pkt =3D kvmalloc(VHOST_NET_FILTER_MAX_LEN, GFP_KERNEL | __GFP_NOWARN); + if (!pkt) { + vhost_net_enable_vq(net, vq); + return; + } + + vhost_hlen =3D nvq->vhost_hlen; + sock_hlen =3D nvq->sock_hlen; + + vq_log =3D unlikely(vhost_has_feature(vq, VHOST_F_LOG_ALL)) ? vq->log : N= ULL; + mergeable =3D vhost_has_feature(vq, VIRTIO_NET_F_MRG_RXBUF); + set_num_buffers =3D mergeable || vhost_has_feature(vq, VIRTIO_F_VERSION_1= ); + + do { + u32 pkt_len; + int err; + s16 headcount; + struct kvec iov; + + sock_len =3D vhost_net_rx_peek_head_len(net, sock->sk, + &busyloop_intr, &count); + if (!sock_len) + break; + sock_len +=3D sock_hlen; + if (sock_len > VHOST_NET_FILTER_MAX_LEN) { + /* Consume and drop oversized packet. */ + iov.iov_base =3D pkt; + iov.iov_len =3D 1; + kernel_recvmsg(sock, &msg, &iov, 1, 1, + MSG_DONTWAIT | MSG_TRUNC); + continue; + } + + vhost_len =3D sock_len + vhost_hlen; + headcount =3D get_rx_bufs(nvq, vq->heads + count, + vq->nheads + count, vhost_len, &in, + vq_log, &log, + likely(mergeable) ? UIO_MAXIOV : 1, + &ndesc); + if (unlikely(headcount < 0)) + goto out; + + if (!headcount) { + if (unlikely(busyloop_intr)) { + vhost_poll_queue(&vq->poll); + } else if (unlikely(vhost_enable_notify(&net->dev, vq))) { + vhost_disable_notify(&net->dev, vq); + continue; + } + goto out; + } + + busyloop_intr =3D false; + + if (nvq->rx_ring) + msg.msg_control =3D vhost_net_buf_consume(&nvq->rxq); + + iov.iov_base =3D pkt; + iov.iov_len =3D sock_len; + err =3D kernel_recvmsg(sock, &msg, &iov, 1, sock_len, + MSG_DONTWAIT | MSG_TRUNC); + if (unlikely(err !=3D sock_len)) { + vhost_discard_vq_desc(vq, headcount, ndesc); + continue; + } + + pkt_len =3D sock_len; + err =3D vhost_net_filter_request(net, VHOST_NET_FILTER_DIRECTION_TX, + pkt, &pkt_len); + if (err < 0) + pkt_len =3D sock_len; + if (pkt_len !=3D sock_len) { + vhost_discard_vq_desc(vq, headcount, ndesc); + continue; + } + + iov_iter_init(&msg.msg_iter, ITER_DEST, vq->iov, in, vhost_len); + fixup =3D msg.msg_iter; + if (unlikely(vhost_hlen)) + iov_iter_advance(&msg.msg_iter, vhost_hlen); + + if (copy_to_iter(pkt, sock_len, &msg.msg_iter) !=3D sock_len) { + vhost_discard_vq_desc(vq, headcount, ndesc); + goto out; + } + + if (unlikely(vhost_hlen)) { + if (copy_to_iter(&hdr, sizeof(hdr), + &fixup) !=3D sizeof(hdr)) { + vhost_discard_vq_desc(vq, headcount, ndesc); + goto out; + } + } else { + iov_iter_advance(&fixup, sizeof(hdr)); + } + + num_buffers =3D cpu_to_vhost16(vq, headcount); + if (likely(set_num_buffers) && + copy_to_iter(&num_buffers, sizeof(num_buffers), &fixup) !=3D + sizeof(num_buffers)) { + vhost_discard_vq_desc(vq, headcount, ndesc); + goto out; + } + + nvq->done_idx +=3D headcount; + count +=3D in_order ? 1 : headcount; + if (nvq->done_idx > VHOST_NET_BATCH) { + vhost_net_signal_used(nvq, count); + count =3D 0; + } + + if (unlikely(vq_log)) + vhost_log_write(vq, vq_log, log, vhost_len, vq->iov, in); + + total_len +=3D vhost_len; + } while (likely(!vhost_exceeds_weight(vq, ++recv_pkts, total_len))); + + if (unlikely(busyloop_intr)) + vhost_poll_queue(&vq->poll); + else if (!sock_len) + vhost_net_enable_vq(net, vq); + +out: + vhost_net_signal_used(nvq, count); + kvfree(pkt); +} /* Expects to be always run from workqueue - which acts as * read-size critical section for our kind of RCU. */ static void handle_rx(struct vhost_net *net) @@ -1281,6 +1504,11 @@ static void handle_rx(struct vhost_net *net) set_num_buffers =3D mergeable || vhost_has_feature(vq, VIRTIO_F_VERSION_1); =20 + if (READ_ONCE(net->filter_sock)) { + handle_rx_filter(net, sock); + goto out_unlock; + } + do { sock_len =3D vhost_net_rx_peek_head_len(net, sock->sk, &busyloop_intr, &count); @@ -1383,6 +1611,7 @@ static void handle_rx(struct vhost_net *net) vhost_net_enable_vq(net, vq); out: vhost_net_signal_used(nvq, count); +out_unlock: mutex_unlock(&vq->mutex); } =20 --=20 2.52.0