[PATCH v3 0/5] nvdimm: virtio_pmem: fix request lifetime and converge broken queue failures

Li Chen posted 5 patches 3 months, 2 weeks ago
There is a newer version of this series
drivers/nvdimm/nd_virtio.c   | 137 +++++++++++++++++++++++++++++------
drivers/nvdimm/virtio_pmem.c |  14 ++++
drivers/nvdimm/virtio_pmem.h |   6 ++
3 files changed, 136 insertions(+), 21 deletions(-)
[PATCH v3 0/5] nvdimm: virtio_pmem: fix request lifetime and converge broken queue failures
Posted by Li Chen 3 months, 2 weeks ago
Hi,

The virtio-pmem flush path uses a virtqueue cookie/token to carry a
per-request context through completion. Under broken virtqueue / notify
failure conditions, the submitter can return and free the request object
while the host/backend may still complete the published request. The IRQ
completion handler then dereferences freed memory when waking waiters,
which is reported by KASAN as a slab-use-after-free and may manifest as
lock corruption (e.g. "BUG: spinlock already unlocked") without KASAN.

In addition, the flush path has two wait sites: one for virtqueue
descriptor availability (-ENOSPC from virtqueue_add_sgs()) and one for
request completion. If the virtqueue becomes broken, forward progress is
no longer guaranteed and these waiters may sleep indefinitely unless the
driver converges the failure and wakes all wait sites.

This series addresses both issues:

1/5 nvdimm: virtio_pmem: always wake -ENOSPC waiters
Wake one -ENOSPC waiter for each reclaimed used buffer, decoupled from
token completion.

2/5 nvdimm: virtio_pmem: use READ_ONCE()/WRITE_ONCE() for wait flags
Use READ_ONCE()/WRITE_ONCE() for the wait_event() flags (done and
wq_buf_avail).

3/5 nvdimm: virtio_pmem: refcount requests for token lifetime
Refcount request objects so the token lifetime spans the window where it
is reachable through the virtqueue until completion/drain drops the
virtqueue reference.

4/5 nvdimm: virtio_pmem: converge broken virtqueue to -EIO
Track a device-level broken state to converge broken/notify failures to
-EIO: wake all waiters and drain/detach outstanding requests to complete
them with an error, and fail-fast new requests.

5/5 nvdimm: virtio_pmem: drain requests in freeze
Drain outstanding requests in freeze() before tearing down virtqueues so
waiters do not sleep indefinitely.

Testing was done on QEMU x86_64 with a virtio-pmem device exported as
/dev/pmem0, formatted with ext4 (-O fast_commit), mounted with DAX, and
stressed with fsync-heavy workloads.

Thanks,
Li Chen

Changelog:
v2->v3:
- Split patch 1 as suggested by Pankaj Gupta: keep the waiter wakeup
  ordering change in 1/5 and move READ_ONCE()/WRITE_ONCE() updates to
  2/5 (no functional change intended).
- Add log report to commit msg
- Fold the export fix into 4/5 to keep the series bisectable when
  CONFIG_VIRTIO_PMEM=m.
v1->v2: add the export patch to fix compile issue.

Links:
v2: https://lore.kernel.org/all/20251225042915.334117-1-me@linux.beauty/

Li Chen (5):
  nvdimm: virtio_pmem: always wake -ENOSPC waiters
  nvdimm: virtio_pmem: use READ_ONCE()/WRITE_ONCE() for wait flags
  nvdimm: virtio_pmem: refcount requests for token lifetime
  nvdimm: virtio_pmem: converge broken virtqueue to -EIO
  nvdimm: virtio_pmem: drain requests in freeze

 drivers/nvdimm/nd_virtio.c   | 137 +++++++++++++++++++++++++++++------
 drivers/nvdimm/virtio_pmem.c |  14 ++++
 drivers/nvdimm/virtio_pmem.h |   6 ++
 3 files changed, 136 insertions(+), 21 deletions(-)

-- 
2.52.0
Re: [PATCH v3 0/5] nvdimm: virtio_pmem: fix request lifetime and converge broken queue failures
Posted by Alison Schofield 1 week, 2 days ago
On Thu, Feb 26, 2026 at 10:57:05AM +0800, Li Chen wrote:
> Hi,
> 
> The virtio-pmem flush path uses a virtqueue cookie/token to carry a
> per-request context through completion. Under broken virtqueue / notify
> failure conditions, the submitter can return and free the request object
> while the host/backend may still complete the published request. The IRQ
> completion handler then dereferences freed memory when waking waiters,
> which is reported by KASAN as a slab-use-after-free and may manifest as
> lock corruption (e.g. "BUG: spinlock already unlocked") without KASAN.
> 
> In addition, the flush path has two wait sites: one for virtqueue
> descriptor availability (-ENOSPC from virtqueue_add_sgs()) and one for
> request completion. If the virtqueue becomes broken, forward progress is
> no longer guaranteed and these waiters may sleep indefinitely unless the
> driver converges the failure and wakes all wait sites.
> 
> This series addresses both issues:
> 
> 1/5 nvdimm: virtio_pmem: always wake -ENOSPC waiters
> Wake one -ENOSPC waiter for each reclaimed used buffer, decoupled from
> token completion.
> 
> 2/5 nvdimm: virtio_pmem: use READ_ONCE()/WRITE_ONCE() for wait flags
> Use READ_ONCE()/WRITE_ONCE() for the wait_event() flags (done and
> wq_buf_avail).
> 
> 3/5 nvdimm: virtio_pmem: refcount requests for token lifetime
> Refcount request objects so the token lifetime spans the window where it
> is reachable through the virtqueue until completion/drain drops the
> virtqueue reference.
> 
> 4/5 nvdimm: virtio_pmem: converge broken virtqueue to -EIO
> Track a device-level broken state to converge broken/notify failures to
> -EIO: wake all waiters and drain/detach outstanding requests to complete
> them with an error, and fail-fast new requests.
> 
> 5/5 nvdimm: virtio_pmem: drain requests in freeze
> Drain outstanding requests in freeze() before tearing down virtqueues so
> waiters do not sleep indefinitely.
> 
> Testing was done on QEMU x86_64 with a virtio-pmem device exported as
> /dev/pmem0, formatted with ext4 (-O fast_commit), mounted with DAX, and
> stressed with fsync-heavy workloads.
> 
> Thanks,
> Li Chen

Hi Li Chen,

Today I took a look at this set, noting that it's been sitting idle 
in our nvdimm backlog for a while. I'm not able to apply it. Can you
post a new rev that applies to 7.1-rc6 ?

Thanks,
Alison
Re: [PATCH v3 0/5] nvdimm: virtio_pmem: fix request lifetime and converge broken queue failures
Posted by Li Chen 1 day, 16 hours ago
Hi Alison,

 ---- On Tue, 02 Jun 2026 09:51:26 +0800  Alison Schofield <alison.schofield@intel.com> wrote --- 
 > On Thu, Feb 26, 2026 at 10:57:05AM +0800, Li Chen wrote:
 > > Hi,
 > > 
 > > The virtio-pmem flush path uses a virtqueue cookie/token to carry a
 > > per-request context through completion. Under broken virtqueue / notify
 > > failure conditions, the submitter can return and free the request object
 > > while the host/backend may still complete the published request. The IRQ
 > > completion handler then dereferences freed memory when waking waiters,
 > > which is reported by KASAN as a slab-use-after-free and may manifest as
 > > lock corruption (e.g. "BUG: spinlock already unlocked") without KASAN.
 > > 
 > > In addition, the flush path has two wait sites: one for virtqueue
 > > descriptor availability (-ENOSPC from virtqueue_add_sgs()) and one for
 > > request completion. If the virtqueue becomes broken, forward progress is
 > > no longer guaranteed and these waiters may sleep indefinitely unless the
 > > driver converges the failure and wakes all wait sites.
 > > 
 > > This series addresses both issues:
 > > 
 > > 1/5 nvdimm: virtio_pmem: always wake -ENOSPC waiters
 > > Wake one -ENOSPC waiter for each reclaimed used buffer, decoupled from
 > > token completion.
 > > 
 > > 2/5 nvdimm: virtio_pmem: use READ_ONCE()/WRITE_ONCE() for wait flags
 > > Use READ_ONCE()/WRITE_ONCE() for the wait_event() flags (done and
 > > wq_buf_avail).
 > > 
 > > 3/5 nvdimm: virtio_pmem: refcount requests for token lifetime
 > > Refcount request objects so the token lifetime spans the window where it
 > > is reachable through the virtqueue until completion/drain drops the
 > > virtqueue reference.
 > > 
 > > 4/5 nvdimm: virtio_pmem: converge broken virtqueue to -EIO
 > > Track a device-level broken state to converge broken/notify failures to
 > > -EIO: wake all waiters and drain/detach outstanding requests to complete
 > > them with an error, and fail-fast new requests.
 > > 
 > > 5/5 nvdimm: virtio_pmem: drain requests in freeze
 > > Drain outstanding requests in freeze() before tearing down virtqueues so
 > > waiters do not sleep indefinitely.
 > > 
 > > Testing was done on QEMU x86_64 with a virtio-pmem device exported as
 > > /dev/pmem0, formatted with ext4 (-O fast_commit), mounted with DAX, and
 > > stressed with fsync-heavy workloads.
 > > 
 > > Thanks,
 > > Li Chen
 > 
 > Hi Li Chen,
 > 
 > Today I took a look at this set, noting that it's been sitting idle 
 > in our nvdimm backlog for a while. I'm not able to apply it. Can you
 > post a new rev that applies to 7.1-rc6 ?
 > 
 > Thanks,
 > Alison

Sorry for my late reply. I have just sent v4(https://lore.kernel.org/all/20260609120726.1714780-1-me@linux.beauty/)
which can be applied to 7.1-rc7. Thanks for your comment.

Regards,
Li​