drivers/nvme/host/pci.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-)
Fix a NULL pointer dereference that occurs in nvme_pci_prp_iter_next()
when SWIOTLB bounce buffering becomes active during runtime.
The issue occurs when SWIOTLB activation changes the device's DMA
mapping requirements at runtime, creating a mismatch between
iod->dma_vecs allocation and access logic.
The problem manifests when:
1. Device initially operates with dma_skip_sync=true
(coherent DMA assumed)
2. First SWIOTLB mapping occurs due to DMA address limitations,
memory encryption, or IOMMU bounce buffering requirements
3. SWIOTLB calls dma_reset_need_sync(), permanently setting
dma_skip_sync=false
4. Subsequent I/Os now have dma_need_unmap()=true, requiring
iod->dma_vecs
The issue arises from the timing of allocation versus access:
- nvme_pci_setup_data_prp() allocates iod->dma_vecs only when both
(!dma_use_iova() && dma_need_unmap()) conditions are met
- nvme_pci_prp_iter_next() assumes iod->dma_vecs is valid whenever
the same conditions are true, without NULL checking
- This creates a race where the device's DMA requirements change
dynamically after the initial allocation decision, leading to NULL
pointer access
Unable to handle kernel NULL pointer dereference at
virtual address 0000000000000000
pc : nvme_pci_prp_iter_next+0xe4/0x128 [nvme]
Call trace:
nvme_pci_prp_iter_next+0xe4/0x128 [nvme]
nvme_prep_rq+0x5f4/0xa6c [nvme]
nvme_queue_rqs+0xa8/0x18c [nvme]
blk_mq_dispatch_queue_requests.constprop.0+0x108/0x120
blk_mq_flush_plug_list+0x8c/0x174
__blk_flush_plug+0xe4/0x140
blk_finish_plug+0x38/0x4c
read_pages+0x184/0x288
page_cache_ra_order+0x1e0/0x3a4
filemap_fault+0x518/0xa90
__do_fault+0x3c/0x22c
__handle_mm_fault+0x10ec/0x19b8
handle_mm_fault+0xb4/0x294
Fix this by:
1. Initialize iod->dma_vecs to NULL in nvme_prep_rq()
2. Add NULL pointer check before accessing iod->dma_vecs in
nvme_pci_prp_iter_next()
3. Set iod->dma_vecs to NULL after freeing for defensive programming
Fixes: b8b7570a7ec8 ("nvme-pci: fix dma unmapping when using PRPs and not using the IOVA mapping")
Co-developed-by: Nitin Rawat <nitin.rawat@oss.qualcomm.com>
Signed-off-by: Nitin Rawat <nitin.rawat@oss.qualcomm.com>
Signed-off-by: Pradeep P V K <pradeep.pragallapati@oss.qualcomm.com>
---
drivers/nvme/host/pci.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 2a52cf46d960..e235654e7ee0 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -720,6 +720,7 @@ static void nvme_free_prps(struct request *req, unsigned int attrs)
dma_unmap_phys(nvmeq->dev->dev, iod->dma_vecs[i].addr,
iod->dma_vecs[i].len, rq_dma_dir(req), attrs);
mempool_free(iod->dma_vecs, nvmeq->dev->dmavec_mempool);
+ iod->dma_vecs = NULL;
}
static void nvme_free_sgls(struct request *req, struct nvme_sgl_desc *sge,
@@ -825,7 +826,7 @@ static bool nvme_pci_prp_iter_next(struct request *req, struct device *dma_dev,
return true;
if (!blk_rq_dma_map_iter_next(req, dma_dev, iter))
return false;
- if (!dma_use_iova(&iod->dma_state) && dma_need_unmap(dma_dev)) {
+ if (iod->dma_vecs && !dma_use_iova(&iod->dma_state) && dma_need_unmap(dma_dev)) {
iod->dma_vecs[iod->nr_dma_vecs].addr = iter->addr;
iod->dma_vecs[iod->nr_dma_vecs].len = iter->len;
iod->nr_dma_vecs++;
@@ -1218,6 +1219,7 @@ static blk_status_t nvme_prep_rq(struct request *req)
iod->nr_descriptors = 0;
iod->total_len = 0;
iod->meta_total_len = 0;
+ iod->dma_vecs = NULL;
ret = nvme_setup_cmd(req->q->queuedata, req);
if (ret)
--
2.34.1
On Mon, Feb 02, 2026 at 06:27:38PM +0530, Pradeep P V K wrote:
> @@ -720,6 +720,7 @@ static void nvme_free_prps(struct request *req, unsigned int attrs)
> dma_unmap_phys(nvmeq->dev->dev, iod->dma_vecs[i].addr,
> iod->dma_vecs[i].len, rq_dma_dir(req), attrs);
> mempool_free(iod->dma_vecs, nvmeq->dev->dmavec_mempool);
> + iod->dma_vecs = NULL;
> }
>
> static void nvme_free_sgls(struct request *req, struct nvme_sgl_desc *sge,
> @@ -825,7 +826,7 @@ static bool nvme_pci_prp_iter_next(struct request *req, struct device *dma_dev,
> return true;
> if (!blk_rq_dma_map_iter_next(req, dma_dev, iter))
> return false;
> - if (!dma_use_iova(&iod->dma_state) && dma_need_unmap(dma_dev)) {
> + if (iod->dma_vecs && !dma_use_iova(&iod->dma_state) && dma_need_unmap(dma_dev)) {
So the return of dma_need_unmap() may change after any call to
dma_map_*? Does it only go from false -> true, and never back to false?
Since we didn't allocate the dma_vecs here, doesn't that mean the
completion side is leaking the mapping?
On Mon, Feb 02, 2026 at 06:27:38PM +0530, Pradeep P V K wrote: > Fix a NULL pointer dereference that occurs in nvme_pci_prp_iter_next() > when SWIOTLB bounce buffering becomes active during runtime. > > The issue occurs when SWIOTLB activation changes the device's DMA > mapping requirements at runtime, > > creating a mismatch between > iod->dma_vecs allocation and access logic. > > The problem manifests when: > 1. Device initially operates with dma_skip_sync=true > (coherent DMA assumed) > 2. First SWIOTLB mapping occurs due to DMA address limitations, > memory encryption, or IOMMU bounce buffering requirements > 3. SWIOTLB calls dma_reset_need_sync(), permanently setting > dma_skip_sync=false > 4. Subsequent I/Os now have dma_need_unmap()=true, requiring > iod->dma_vecs I think this patch just papers over the bug. If dma_need_unmap can't be trusted before the dma_map_* call, we've not saved the unmap information and the unmap won't work properly. So we'll need to extend the core code to tell if a mapping will set dma_skip_sync=false before doing the mapping.
On Mon, Feb 02, 2026 at 03:35:48PM +0100, Christoph Hellwig wrote:
> On Mon, Feb 02, 2026 at 06:27:38PM +0530, Pradeep P V K wrote:
> > Fix a NULL pointer dereference that occurs in nvme_pci_prp_iter_next()
> > when SWIOTLB bounce buffering becomes active during runtime.
> >
> > The issue occurs when SWIOTLB activation changes the device's DMA
> > mapping requirements at runtime,
> >
> > creating a mismatch between
> > iod->dma_vecs allocation and access logic.
> >
> > The problem manifests when:
> > 1. Device initially operates with dma_skip_sync=true
> > (coherent DMA assumed)
> > 2. First SWIOTLB mapping occurs due to DMA address limitations,
> > memory encryption, or IOMMU bounce buffering requirements
> > 3. SWIOTLB calls dma_reset_need_sync(), permanently setting
> > dma_skip_sync=false
> > 4. Subsequent I/Os now have dma_need_unmap()=true, requiring
> > iod->dma_vecs
>
> I think this patch just papers over the bug.
Agree
> If dma_need_unmap can't be trusted before the dma_map_* call, we've not saved
> the unmap information and the unmap won't work properly.
>
> So we'll need to extend the core code to tell if a mapping
> will set dma_skip_sync=false before doing the mapping.
There are two paths that lead to SWIOTLB in dma_direct_map_phys().
The first is is_swiotlb_force_bounce(dev), which dma_need_unmap() can
easily evaluate. The second is more problematic, as it depends on
dma_addr and size, neither of which is available in dma_need_unmap():
102 if (unlikely(!dma_capable(dev, dma_addr, size, true)) ||
103 dma_kmalloc_needs_bounce(dev, size, dir)) {
104 if (is_swiotlb_active(dev))
What about the following change?
diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c
index 37163eb49f9f..1510b93a8791 100644
--- a/kernel/dma/mapping.c
+++ b/kernel/dma/mapping.c
@@ -461,6 +461,8 @@ bool dma_need_unmap(struct device *dev)
{
if (!dma_map_direct(dev, get_dma_ops(dev)))
return true;
+ if (is_swiotlb_force_bounce(dev) || is_swiotlb_active(dev))
+ return true;
if (!dev->dma_skip_sync)
return true;
return IS_ENABLED(CONFIG_DMA_API_DEBUG);
On 2026-02-02 3:22 pm, Leon Romanovsky wrote:
> On Mon, Feb 02, 2026 at 03:35:48PM +0100, Christoph Hellwig wrote:
>> On Mon, Feb 02, 2026 at 06:27:38PM +0530, Pradeep P V K wrote:
>>> Fix a NULL pointer dereference that occurs in nvme_pci_prp_iter_next()
>>> when SWIOTLB bounce buffering becomes active during runtime.
>>>
>>> The issue occurs when SWIOTLB activation changes the device's DMA
>>> mapping requirements at runtime,
>>>
>>> creating a mismatch between
>>> iod->dma_vecs allocation and access logic.
>>>
>>> The problem manifests when:
>>> 1. Device initially operates with dma_skip_sync=true
>>> (coherent DMA assumed)
>>> 2. First SWIOTLB mapping occurs due to DMA address limitations,
>>> memory encryption, or IOMMU bounce buffering requirements
>>> 3. SWIOTLB calls dma_reset_need_sync(), permanently setting
>>> dma_skip_sync=false
>>> 4. Subsequent I/Os now have dma_need_unmap()=true, requiring
>>> iod->dma_vecs
>>
>> I think this patch just papers over the bug.
>
> Agree
>
>> If dma_need_unmap can't be trusted before the dma_map_* call, we've not saved
>> the unmap information and the unmap won't work properly.
>>
>> So we'll need to extend the core code to tell if a mapping
>> will set dma_skip_sync=false before doing the mapping.
>
> There are two paths that lead to SWIOTLB in dma_direct_map_phys().
> The first is is_swiotlb_force_bounce(dev), which dma_need_unmap() can
> easily evaluate. The second is more problematic, as it depends on
> dma_addr and size, neither of which is available in dma_need_unmap():
>
> 102 if (unlikely(!dma_capable(dev, dma_addr, size, true)) ||
> 103 dma_kmalloc_needs_bounce(dev, size, dir)) {
> 104 if (is_swiotlb_active(dev))
>
> What about the following change?
>
> diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c
> index 37163eb49f9f..1510b93a8791 100644
> --- a/kernel/dma/mapping.c
> +++ b/kernel/dma/mapping.c
> @@ -461,6 +461,8 @@ bool dma_need_unmap(struct device *dev)
> {
> if (!dma_map_direct(dev, get_dma_ops(dev)))
> return true;
> + if (is_swiotlb_force_bounce(dev) || is_swiotlb_active(dev))
> + return true;
This will always be true if a default SWIOTLB buffer exists at all, and
thus pretty much defeat the point of whatever optimisation the caller is
trying to make.
Thanks,
Robin.
> if (!dev->dma_skip_sync)
> return true;
> return IS_ENABLED(CONFIG_DMA_API_DEBUG);
>
On 2026-02-02 2:35 pm, Christoph Hellwig wrote: > On Mon, Feb 02, 2026 at 06:27:38PM +0530, Pradeep P V K wrote: >> Fix a NULL pointer dereference that occurs in nvme_pci_prp_iter_next() >> when SWIOTLB bounce buffering becomes active during runtime. >> >> The issue occurs when SWIOTLB activation changes the device's DMA >> mapping requirements at runtime, >> >> creating a mismatch between >> iod->dma_vecs allocation and access logic. >> >> The problem manifests when: >> 1. Device initially operates with dma_skip_sync=true >> (coherent DMA assumed) >> 2. First SWIOTLB mapping occurs due to DMA address limitations, >> memory encryption, or IOMMU bounce buffering requirements >> 3. SWIOTLB calls dma_reset_need_sync(), permanently setting >> dma_skip_sync=false >> 4. Subsequent I/Os now have dma_need_unmap()=true, requiring >> iod->dma_vecs > > I think this patch just papers over the bug. If dma_need_unmap > can't be trusted before the dma_map_* call, we've not saved > the unmap information and the unmap won't work properly. The dma_need_unmap() kerneldoc says: "This function must be called after all mappings that might need to be unmapped have been performed." Trying to infer anything from it beforehand is definitely a bug in the caller. > So we'll need to extend the core code to tell if a mapping > will set dma_skip_sync=false before doing the mapping. I don't see that being possible - at best we could reasonably infer that a fully-coherent system with no sync ops, no SWIOTLB and no DMA_DEBUG shouldn't ever set it to true, but as for the other way round, by the time you've run through all the SWIOTLB logic to guess whether a particular mapping would be bounced or not, you've basically performed the mapping anyway. Thus at best, such an API to potentially do a whole dry-run mapping before every actual mapping would seem like a pretty pointless anti-optimisation. Thanks, Robin.
On Mon, Feb 02, 2026 at 03:16:50PM +0000, Robin Murphy wrote: > On 2026-02-02 2:35 pm, Christoph Hellwig wrote: > > On Mon, Feb 02, 2026 at 06:27:38PM +0530, Pradeep P V K wrote: > > > Fix a NULL pointer dereference that occurs in nvme_pci_prp_iter_next() > > > when SWIOTLB bounce buffering becomes active during runtime. > > > > > > The issue occurs when SWIOTLB activation changes the device's DMA > > > mapping requirements at runtime, > > > > > > creating a mismatch between > > > iod->dma_vecs allocation and access logic. > > > > > > The problem manifests when: > > > 1. Device initially operates with dma_skip_sync=true > > > (coherent DMA assumed) > > > 2. First SWIOTLB mapping occurs due to DMA address limitations, > > > memory encryption, or IOMMU bounce buffering requirements > > > 3. SWIOTLB calls dma_reset_need_sync(), permanently setting > > > dma_skip_sync=false > > > 4. Subsequent I/Os now have dma_need_unmap()=true, requiring > > > iod->dma_vecs > > > > I think this patch just papers over the bug. If dma_need_unmap > > can't be trusted before the dma_map_* call, we've not saved > > the unmap information and the unmap won't work properly. > > The dma_need_unmap() kerneldoc says: > > "This function must be called after all mappings that might > need to be unmapped have been performed." > > Trying to infer anything from it beforehand is definitely a bug in the > caller. Well that doesn't really make sense. No matter how many mappings the driver has done, there will always be more. ?
On 2026-02-02 5:13 pm, Keith Busch wrote: > On Mon, Feb 02, 2026 at 03:16:50PM +0000, Robin Murphy wrote: >> On 2026-02-02 2:35 pm, Christoph Hellwig wrote: >>> On Mon, Feb 02, 2026 at 06:27:38PM +0530, Pradeep P V K wrote: >>>> Fix a NULL pointer dereference that occurs in nvme_pci_prp_iter_next() >>>> when SWIOTLB bounce buffering becomes active during runtime. >>>> >>>> The issue occurs when SWIOTLB activation changes the device's DMA >>>> mapping requirements at runtime, >>>> >>>> creating a mismatch between >>>> iod->dma_vecs allocation and access logic. >>>> >>>> The problem manifests when: >>>> 1. Device initially operates with dma_skip_sync=true >>>> (coherent DMA assumed) >>>> 2. First SWIOTLB mapping occurs due to DMA address limitations, >>>> memory encryption, or IOMMU bounce buffering requirements >>>> 3. SWIOTLB calls dma_reset_need_sync(), permanently setting >>>> dma_skip_sync=false >>>> 4. Subsequent I/Os now have dma_need_unmap()=true, requiring >>>> iod->dma_vecs >>> >>> I think this patch just papers over the bug. If dma_need_unmap >>> can't be trusted before the dma_map_* call, we've not saved >>> the unmap information and the unmap won't work properly. >> >> The dma_need_unmap() kerneldoc says: >> >> "This function must be called after all mappings that might >> need to be unmapped have been performed." >> >> Trying to infer anything from it beforehand is definitely a bug in the >> caller. > > Well that doesn't really make sense. No matter how many mappings the > driver has done, there will always be more. ? But equally the fact that none of the mappings made so far happened to not need bouncing still doesn't mean that future ones won't. This is not guaranteed to be a static property of the device, but nor is it really a property of the *device* at all; it's a property of a set of one or more DMA mappings with the same lifetime, there's just no suitable generic notion of that temporal context in the DMA API to carry around and pass as an explicit argument, so it's left implicit in the usage model. Whatever higher-level thing it's doing, the driver must have some context, so within "operation A" it makes some DMA mappings, checks dma_need_unmap() and sees it's false, so can conclude that "operation A" does not need to preserve DMA unmap state. However it may then start "operation B", do some more mappings, check dma_need_unmap() and see it's now returned true, so "operation B" *does* need to keep the DMA data and explicitly unmap it when it finishes. This is essentially the point I made at the time about it not necessarily being as useful a thing as it seems, since if an "operation" involves multiple mappings, it must still store the full state of those mappings for at least long enough to finish them all and then call dma_need_unmap(), to only then see if it might be OK to throw that state away again. Thanks, Robin.
On Mon, Feb 02, 2026 at 10:13:24AM -0700, Keith Busch wrote:
> > "This function must be called after all mappings that might
> > need to be unmapped have been performed."
> >
> > Trying to infer anything from it beforehand is definitely a bug in the
> > caller.
>
> Well that doesn't really make sense. No matter how many mappings the
> driver has done, there will always be more. ?
Yeah. It's more like if this returns true, all future calls, plus
the previous one (which might have caused this). For that something
like the patch below should work in nvme. Totally untested as I'm
about to head away from the desk and prepare dinner.
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 2a52cf46d960..f944b747e1bd 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -816,6 +816,22 @@ static void nvme_unmap_data(struct request *req)
nvme_free_descriptors(req);
}
+static bool nvme_pci_alloc_dma_vecs(struct request *req,
+ struct blk_dma_iter *iter)
+{
+ struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
+ struct nvme_queue *nvmeq = req->mq_hctx->driver_data;
+
+ iod->dma_vecs = mempool_alloc(nvmeq->dev->dmavec_mempool,
+ GFP_ATOMIC);
+ if (!iod->dma_vecs)
+ return false;
+ iod->dma_vecs[0].addr = iter->addr;
+ iod->dma_vecs[0].len = iter->len;
+ iod->nr_dma_vecs = 1;
+ return true;
+}
+
static bool nvme_pci_prp_iter_next(struct request *req, struct device *dma_dev,
struct blk_dma_iter *iter)
{
@@ -826,6 +842,8 @@ static bool nvme_pci_prp_iter_next(struct request *req, struct device *dma_dev,
if (!blk_rq_dma_map_iter_next(req, dma_dev, iter))
return false;
if (!dma_use_iova(&iod->dma_state) && dma_need_unmap(dma_dev)) {
+ if (!iod->nr_dma_vecs && !nvme_pci_alloc_dma_vecs(req, iter))
+ return false;
iod->dma_vecs[iod->nr_dma_vecs].addr = iter->addr;
iod->dma_vecs[iod->nr_dma_vecs].len = iter->len;
iod->nr_dma_vecs++;
@@ -844,13 +862,8 @@ static blk_status_t nvme_pci_setup_data_prp(struct request *req,
__le64 *prp_list;
if (!dma_use_iova(&iod->dma_state) && dma_need_unmap(nvmeq->dev->dev)) {
- iod->dma_vecs = mempool_alloc(nvmeq->dev->dmavec_mempool,
- GFP_ATOMIC);
- if (!iod->dma_vecs)
+ if (!nvme_pci_alloc_dma_vecs(req, iter))
return BLK_STS_RESOURCE;
- iod->dma_vecs[0].addr = iter->addr;
- iod->dma_vecs[0].len = iter->len;
- iod->nr_dma_vecs = 1;
}
/*
On Mon, Feb 02, 2026 at 06:36:24PM +0100, Christoph Hellwig wrote:
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 2a52cf46d960..f944b747e1bd 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -816,6 +816,22 @@ static void nvme_unmap_data(struct request *req)
> nvme_free_descriptors(req);
> }
>
> +static bool nvme_pci_alloc_dma_vecs(struct request *req,
> + struct blk_dma_iter *iter)
> +{
> + struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
> + struct nvme_queue *nvmeq = req->mq_hctx->driver_data;
> +
> + iod->dma_vecs = mempool_alloc(nvmeq->dev->dmavec_mempool,
> + GFP_ATOMIC);
> + if (!iod->dma_vecs)
> + return false;
> + iod->dma_vecs[0].addr = iter->addr;
> + iod->dma_vecs[0].len = iter->len;
> + iod->nr_dma_vecs = 1;
> + return true;
> +}
> +
> static bool nvme_pci_prp_iter_next(struct request *req, struct device *dma_dev,
> struct blk_dma_iter *iter)
> {
> @@ -826,6 +842,8 @@ static bool nvme_pci_prp_iter_next(struct request *req, struct device *dma_dev,
> if (!blk_rq_dma_map_iter_next(req, dma_dev, iter))
> return false;
> if (!dma_use_iova(&iod->dma_state) && dma_need_unmap(dma_dev)) {
> + if (!iod->nr_dma_vecs && !nvme_pci_alloc_dma_vecs(req, iter))
> + return false;
In the case where this iteration caused dma_need_unmap() to toggle to
true, this is the iteration that allocates the dma_vecs, and it
initializes the first entry to this iter. But the next lines proceed to
the save this iter in the next index, so it's doubly accounted for and
will get unmapped twice in the completion.
Also, if the allocation fails, we should set iter->status to
BLK_STS_RESOURCE so the callers know why the iteration can't continue.
Otherwise, the caller will think the request is badly formed if you
return false from here without setting iter->status.
Here's my quick take. Boot tested with swiotlb enabled, but haven't
tried to test the changing dma_need_unmap() scenario.
---
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 9fc4a60280a07..233378faab9bd 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -816,6 +816,28 @@ static void nvme_unmap_data(struct request *req)
nvme_free_descriptors(req);
}
+static bool nvme_pci_prp_save_mapping(struct blk_dma_iter *iter,
+ struct request *req)
+{
+ struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
+
+ if (!iod->dma_vecs) {
+ struct nvme_queue *nvmeq = req->mq_hctx->driver_data;
+
+ iod->dma_vecs = mempool_alloc(nvmeq->dev->dmavec_mempool,
+ GFP_ATOMIC);
+ if (!iod->dma_vecs) {
+ iter->status = BLK_STS_RESOURCE;
+ return false;
+ }
+ }
+
+ iod->dma_vecs[iod->nr_dma_vecs].addr = iter->addr;
+ iod->dma_vecs[iod->nr_dma_vecs].len = iter->len;
+ iod->nr_dma_vecs++;
+ return true;
+}
+
static bool nvme_pci_prp_iter_next(struct request *req, struct device *dma_dev,
struct blk_dma_iter *iter)
{
@@ -825,11 +847,8 @@ static bool nvme_pci_prp_iter_next(struct request *req, struct device *dma_dev,
return true;
if (!blk_rq_dma_map_iter_next(req, dma_dev, iter))
return false;
- if (!dma_use_iova(&iod->dma_state) && dma_need_unmap(dma_dev)) {
- iod->dma_vecs[iod->nr_dma_vecs].addr = iter->addr;
- iod->dma_vecs[iod->nr_dma_vecs].len = iter->len;
- iod->nr_dma_vecs++;
- }
+ if (!dma_use_iova(&iod->dma_state) && dma_need_unmap(dma_dev))
+ return nvme_pci_prp_save_mapping(iter, req);
return true;
}
@@ -843,15 +862,9 @@ static blk_status_t nvme_pci_setup_data_prp(struct request *req,
unsigned int prp_len, i;
__le64 *prp_list;
- if (!dma_use_iova(&iod->dma_state) && dma_need_unmap(nvmeq->dev->dev)) {
- iod->dma_vecs = mempool_alloc(nvmeq->dev->dmavec_mempool,
- GFP_ATOMIC);
- if (!iod->dma_vecs)
- return BLK_STS_RESOURCE;
- iod->dma_vecs[0].addr = iter->addr;
- iod->dma_vecs[0].len = iter->len;
- iod->nr_dma_vecs = 1;
- }
+ if (!dma_use_iova(&iod->dma_state) && dma_need_unmap(nvmeq->dev->dev))
+ if (!nvme_pci_prp_save_mapping(iter, req))
+ return iter->status;
/*
* PRP1 always points to the start of the DMA transfers.
@@ -1218,6 +1231,8 @@ static blk_status_t nvme_prep_rq(struct request *req)
iod->nr_descriptors = 0;
iod->total_len = 0;
iod->meta_total_len = 0;
+ iod->nr_dma_vecs = 0;
+ iod->dma_vecs = NULL;
ret = nvme_setup_cmd(req->q->queuedata, req);
if (ret)
--
On Mon, Feb 02, 2026 at 11:59:04AM -0700, Keith Busch wrote:
> On Mon, Feb 02, 2026 at 06:36:24PM +0100, Christoph Hellwig wrote:
> > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> > index 2a52cf46d960..f944b747e1bd 100644
> > --- a/drivers/nvme/host/pci.c
> > +++ b/drivers/nvme/host/pci.c
> > @@ -816,6 +816,22 @@ static void nvme_unmap_data(struct request *req)
> > nvme_free_descriptors(req);
> > }
> >
> > +static bool nvme_pci_alloc_dma_vecs(struct request *req,
> > + struct blk_dma_iter *iter)
> > +{
> > + struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
> > + struct nvme_queue *nvmeq = req->mq_hctx->driver_data;
> > +
> > + iod->dma_vecs = mempool_alloc(nvmeq->dev->dmavec_mempool,
> > + GFP_ATOMIC);
> > + if (!iod->dma_vecs)
> > + return false;
> > + iod->dma_vecs[0].addr = iter->addr;
> > + iod->dma_vecs[0].len = iter->len;
> > + iod->nr_dma_vecs = 1;
> > + return true;
> > +}
> > +
> > static bool nvme_pci_prp_iter_next(struct request *req, struct device *dma_dev,
> > struct blk_dma_iter *iter)
> > {
> > @@ -826,6 +842,8 @@ static bool nvme_pci_prp_iter_next(struct request *req, struct device *dma_dev,
> > if (!blk_rq_dma_map_iter_next(req, dma_dev, iter))
> > return false;
> > if (!dma_use_iova(&iod->dma_state) && dma_need_unmap(dma_dev)) {
> > + if (!iod->nr_dma_vecs && !nvme_pci_alloc_dma_vecs(req, iter))
> > + return false;
>
> In the case where this iteration caused dma_need_unmap() to toggle to
> true, this is the iteration that allocates the dma_vecs, and it
> initializes the first entry to this iter. But the next lines proceed to
> the save this iter in the next index, so it's doubly accounted for and
> will get unmapped twice in the completion.
>
> Also, if the allocation fails, we should set iter->status to
> BLK_STS_RESOURCE so the callers know why the iteration can't continue.
> Otherwise, the caller will think the request is badly formed if you
> return false from here without setting iter->status.
>
> Here's my quick take. Boot tested with swiotlb enabled, but haven't
> tried to test the changing dma_need_unmap() scenario.
> ---
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 9fc4a60280a07..233378faab9bd 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -816,6 +816,28 @@ static void nvme_unmap_data(struct request *req)
> nvme_free_descriptors(req);
> }
>
> +static bool nvme_pci_prp_save_mapping(struct blk_dma_iter *iter,
> + struct request *req)
> +{
> + struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
> +
> + if (!iod->dma_vecs) {
> + struct nvme_queue *nvmeq = req->mq_hctx->driver_data;
> +
> + iod->dma_vecs = mempool_alloc(nvmeq->dev->dmavec_mempool,
> + GFP_ATOMIC);
> + if (!iod->dma_vecs) {
> + iter->status = BLK_STS_RESOURCE;
> + return false;
> + }
> + }
> +
> + iod->dma_vecs[iod->nr_dma_vecs].addr = iter->addr;
> + iod->dma_vecs[iod->nr_dma_vecs].len = iter->len;
> + iod->nr_dma_vecs++;
> + return true;
> +}
> +
> static bool nvme_pci_prp_iter_next(struct request *req, struct device *dma_dev,
> struct blk_dma_iter *iter)
> {
> @@ -825,11 +847,8 @@ static bool nvme_pci_prp_iter_next(struct request *req, struct device *dma_dev,
> return true;
> if (!blk_rq_dma_map_iter_next(req, dma_dev, iter))
> return false;
> - if (!dma_use_iova(&iod->dma_state) && dma_need_unmap(dma_dev)) {
> - iod->dma_vecs[iod->nr_dma_vecs].addr = iter->addr;
> - iod->dma_vecs[iod->nr_dma_vecs].len = iter->len;
> - iod->nr_dma_vecs++;
> - }
> + if (!dma_use_iova(&iod->dma_state) && dma_need_unmap(dma_dev))
Can dev->dma_skip_sync be modified in parallel with this check?
If so, dma_need_unmap() may return different results depending on the
time at which it is invoked.
> + return nvme_pci_prp_save_mapping(iter, req);
Thanks
On 2026-02-03 9:42 am, Leon Romanovsky wrote:
> On Mon, Feb 02, 2026 at 11:59:04AM -0700, Keith Busch wrote:
>> On Mon, Feb 02, 2026 at 06:36:24PM +0100, Christoph Hellwig wrote:
>>> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
>>> index 2a52cf46d960..f944b747e1bd 100644
>>> --- a/drivers/nvme/host/pci.c
>>> +++ b/drivers/nvme/host/pci.c
>>> @@ -816,6 +816,22 @@ static void nvme_unmap_data(struct request *req)
>>> nvme_free_descriptors(req);
>>> }
>>>
>>> +static bool nvme_pci_alloc_dma_vecs(struct request *req,
>>> + struct blk_dma_iter *iter)
>>> +{
>>> + struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
>>> + struct nvme_queue *nvmeq = req->mq_hctx->driver_data;
>>> +
>>> + iod->dma_vecs = mempool_alloc(nvmeq->dev->dmavec_mempool,
>>> + GFP_ATOMIC);
>>> + if (!iod->dma_vecs)
>>> + return false;
>>> + iod->dma_vecs[0].addr = iter->addr;
>>> + iod->dma_vecs[0].len = iter->len;
>>> + iod->nr_dma_vecs = 1;
>>> + return true;
>>> +}
>>> +
>>> static bool nvme_pci_prp_iter_next(struct request *req, struct device *dma_dev,
>>> struct blk_dma_iter *iter)
>>> {
>>> @@ -826,6 +842,8 @@ static bool nvme_pci_prp_iter_next(struct request *req, struct device *dma_dev,
>>> if (!blk_rq_dma_map_iter_next(req, dma_dev, iter))
>>> return false;
>>> if (!dma_use_iova(&iod->dma_state) && dma_need_unmap(dma_dev)) {
>>> + if (!iod->nr_dma_vecs && !nvme_pci_alloc_dma_vecs(req, iter))
>>> + return false;
>>
>> In the case where this iteration caused dma_need_unmap() to toggle to
>> true, this is the iteration that allocates the dma_vecs, and it
>> initializes the first entry to this iter. But the next lines proceed to
>> the save this iter in the next index, so it's doubly accounted for and
>> will get unmapped twice in the completion.
>>
>> Also, if the allocation fails, we should set iter->status to
>> BLK_STS_RESOURCE so the callers know why the iteration can't continue.
>> Otherwise, the caller will think the request is badly formed if you
>> return false from here without setting iter->status.
>>
>> Here's my quick take. Boot tested with swiotlb enabled, but haven't
>> tried to test the changing dma_need_unmap() scenario.
>> ---
>> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
>> index 9fc4a60280a07..233378faab9bd 100644
>> --- a/drivers/nvme/host/pci.c
>> +++ b/drivers/nvme/host/pci.c
>> @@ -816,6 +816,28 @@ static void nvme_unmap_data(struct request *req)
>> nvme_free_descriptors(req);
>> }
>>
>> +static bool nvme_pci_prp_save_mapping(struct blk_dma_iter *iter,
>> + struct request *req)
>> +{
>> + struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
>> +
>> + if (!iod->dma_vecs) {
>> + struct nvme_queue *nvmeq = req->mq_hctx->driver_data;
>> +
>> + iod->dma_vecs = mempool_alloc(nvmeq->dev->dmavec_mempool,
>> + GFP_ATOMIC);
>> + if (!iod->dma_vecs) {
>> + iter->status = BLK_STS_RESOURCE;
>> + return false;
>> + }
>> + }
>> +
>> + iod->dma_vecs[iod->nr_dma_vecs].addr = iter->addr;
>> + iod->dma_vecs[iod->nr_dma_vecs].len = iter->len;
>> + iod->nr_dma_vecs++;
>> + return true;
>> +}
>> +
>> static bool nvme_pci_prp_iter_next(struct request *req, struct device *dma_dev,
>> struct blk_dma_iter *iter)
>> {
>> @@ -825,11 +847,8 @@ static bool nvme_pci_prp_iter_next(struct request *req, struct device *dma_dev,
>> return true;
>> if (!blk_rq_dma_map_iter_next(req, dma_dev, iter))
>> return false;
>> - if (!dma_use_iova(&iod->dma_state) && dma_need_unmap(dma_dev)) {
>> - iod->dma_vecs[iod->nr_dma_vecs].addr = iter->addr;
>> - iod->dma_vecs[iod->nr_dma_vecs].len = iter->len;
>> - iod->nr_dma_vecs++;
>> - }
>> + if (!dma_use_iova(&iod->dma_state) && dma_need_unmap(dma_dev))
>
> Can dev->dma_skip_sync be modified in parallel with this check?
> If so, dma_need_unmap() may return different results depending on the
> time at which it is invoked.
It can if another thread is making mappings in parallel, however as
things currently stand that would only lead to the current thread
thinking it must save the unmap state for the mappings it's already made
even if it technically didn't need to.
In principle it could also change back the other way if another thread
reset the device's DMA mask, but doing that with active mappings would
fundamentally break things in regard to the dma_skip_sync mechanism anyway.
Thanks,
Robin.
>
>> + return nvme_pci_prp_save_mapping(iter, req);
>
> Thanks
On Tue, Feb 03, 2026 at 01:50:12PM +0000, Robin Murphy wrote: > > Can dev->dma_skip_sync be modified in parallel with this check? > > If so, dma_need_unmap() may return different results depending on the > > time at which it is invoked. > > It can if another thread is making mappings in parallel, however as things > currently stand that would only lead to the current thread thinking it must > save the unmap state for the mappings it's already made even if it > technically didn't need to. > > In principle it could also change back the other way if another thread reset > the device's DMA mask, but doing that with active mappings would > fundamentally break things in regard to the dma_skip_sync mechanism anyway. We can handle a change in dma_needs_unmap() from false -> true. Worst case scenario is we save off a descriptor when we didn't really need to, but it's just a small memory use for the lifetime of the IO. We don't correctly handle a transition from true -> false, though. We'd currently leak memory if that happened. It sounds like that transition is broken for other reasons too, so I won't bother trying to handle it.
On Mon, Feb 02, 2026 at 11:59:04AM -0700, Keith Busch wrote: > In the case where this iteration caused dma_need_unmap() to toggle to > true, this is the iteration that allocates the dma_vecs, and it > initializes the first entry to this iter. But the next lines proceed to > the save this iter in the next index, so it's doubly accounted for and > will get unmapped twice in the completion. Yeah. > Also, if the allocation fails, we should set iter->status to > BLK_STS_RESOURCE so the callers know why the iteration can't continue. > Otherwise, the caller will think the request is badly formed if you > return false from here without setting iter->status. > > Here's my quick take. Boot tested with swiotlb enabled, but haven't > tried to test the changing dma_need_unmap() scenario. Looks much better. Cosmetic nits below. Pradeep, can you test this with your setup? > + if (!dma_use_iova(&iod->dma_state) && dma_need_unmap(dma_dev)) > + return nvme_pci_prp_save_mapping(iter, req); > + if (!dma_use_iova(&iod->dma_state) && dma_need_unmap(nvmeq->dev->dev)) > + if (!nvme_pci_prp_save_mapping(iter, req)) > + return iter->status; I'd move the dma_use_iova / dma_need_unmap checks into nvme_pci_prp_save_mapping to simplify this a bit more. > > /* > * PRP1 always points to the start of the DMA transfers. > @@ -1218,6 +1231,8 @@ static blk_status_t nvme_prep_rq(struct request *req) > iod->nr_descriptors = 0; > iod->total_len = 0; > iod->meta_total_len = 0; > + iod->nr_dma_vecs = 0; > + iod->dma_vecs = NULL; I don't think we need the dma_vecs initialization here, as everything is keyed off nr_dma_vecs.
On 2/3/2026 10:57 AM, Christoph Hellwig wrote: > On Mon, Feb 02, 2026 at 11:59:04AM -0700, Keith Busch wrote: >> In the case where this iteration caused dma_need_unmap() to toggle to >> true, this is the iteration that allocates the dma_vecs, and it >> initializes the first entry to this iter. But the next lines proceed to >> the save this iter in the next index, so it's doubly accounted for and >> will get unmapped twice in the completion. > > Yeah. > >> Also, if the allocation fails, we should set iter->status to >> BLK_STS_RESOURCE so the callers know why the iteration can't continue. >> Otherwise, the caller will think the request is badly formed if you >> return false from here without setting iter->status. >> >> Here's my quick take. Boot tested with swiotlb enabled, but haven't >> tried to test the changing dma_need_unmap() scenario. > > Looks much better. Cosmetic nits below. > > Pradeep, can you test this with your setup? Sure, testing has started, and I will share the findings soon. Also, I did not pick up the initialization of dma_vecs during testing. > >> + if (!dma_use_iova(&iod->dma_state) && dma_need_unmap(dma_dev)) >> + return nvme_pci_prp_save_mapping(iter, req); > >> + if (!dma_use_iova(&iod->dma_state) && dma_need_unmap(nvmeq->dev->dev)) >> + if (!nvme_pci_prp_save_mapping(iter, req)) >> + return iter->status; > > I'd move the dma_use_iova / dma_need_unmap checks into > nvme_pci_prp_save_mapping to simplify this a bit more. > >> >> /* >> * PRP1 always points to the start of the DMA transfers. >> @@ -1218,6 +1231,8 @@ static blk_status_t nvme_prep_rq(struct request *req) >> iod->nr_descriptors = 0; >> iod->total_len = 0; >> iod->meta_total_len = 0; >> + iod->nr_dma_vecs = 0; >> + iod->dma_vecs = NULL; > > I don't think we need the dma_vecs initialization here, as everything > is keyed off nr_dma_vecs.
On 2/3/2026 7:35 PM, Pradeep Pragallapati wrote: > > > On 2/3/2026 10:57 AM, Christoph Hellwig wrote: >> On Mon, Feb 02, 2026 at 11:59:04AM -0700, Keith Busch wrote: >>> In the case where this iteration caused dma_need_unmap() to toggle to >>> true, this is the iteration that allocates the dma_vecs, and it >>> initializes the first entry to this iter. But the next lines proceed to >>> the save this iter in the next index, so it's doubly accounted for and >>> will get unmapped twice in the completion. >> >> Yeah. >> >>> Also, if the allocation fails, we should set iter->status to >>> BLK_STS_RESOURCE so the callers know why the iteration can't continue. >>> Otherwise, the caller will think the request is badly formed if you >>> return false from here without setting iter->status. >>> >>> Here's my quick take. Boot tested with swiotlb enabled, but haven't >>> tried to test the changing dma_need_unmap() scenario. >> >> Looks much better. Cosmetic nits below. >> >> Pradeep, can you test this with your setup? > Sure, testing has started, and I will share the findings soon. > Also, I did not pick up the initialization of dma_vecs during testing. I ran testing for over 20 hours and did not observe the issue on my setup. It appears to be helping. > >> >>> + if (!dma_use_iova(&iod->dma_state) && dma_need_unmap(dma_dev)) >>> + return nvme_pci_prp_save_mapping(iter, req); >> >>> + if (!dma_use_iova(&iod->dma_state) && dma_need_unmap(nvmeq->dev- >>> >dev)) >>> + if (!nvme_pci_prp_save_mapping(iter, req)) >>> + return iter->status; >> >> I'd move the dma_use_iova / dma_need_unmap checks into >> nvme_pci_prp_save_mapping to simplify this a bit more. >> >>> /* >>> * PRP1 always points to the start of the DMA transfers. >>> @@ -1218,6 +1231,8 @@ static blk_status_t nvme_prep_rq(struct request >>> *req) >>> iod->nr_descriptors = 0; >>> iod->total_len = 0; >>> iod->meta_total_len = 0; >>> + iod->nr_dma_vecs = 0; >>> + iod->dma_vecs = NULL; >> >> I don't think we need the dma_vecs initialization here, as everything >> is keyed off nr_dma_vecs. >
On Wed, Feb 04, 2026 at 07:34:42PM +0530, Pradeep Pragallapati wrote: > I ran testing for over 20 hours and did not observe the issue on my setup. > It appears to be helping. Thanks for for the test! I'll put a real patch together with Christoph's suggestions folded in.
On Tue, Feb 03, 2026 at 06:27:56AM +0100, Christoph Hellwig wrote: > > iod->nr_descriptors = 0; > > iod->total_len = 0; > > iod->meta_total_len = 0; > > + iod->nr_dma_vecs = 0; > > + iod->dma_vecs = NULL; > > I don't think we need the dma_vecs initialization here, as everything > is keyed off nr_dma_vecs. Yes, we should definitely use nr_dma_vecs and skip the NULL setting. I'm a big fan removing unnecessary initialisations. Just a caution, my suggested patch has this check: if (!iod->dma_vecs) So we just need to update it to use 'iod->nr_dma_vecs' instead, which would have been correct, too.
On Mon, Feb 02, 2026 at 11:14:23PM -0700, Keith Busch wrote: > On Tue, Feb 03, 2026 at 06:27:56AM +0100, Christoph Hellwig wrote: > > > iod->nr_descriptors = 0; > > > iod->total_len = 0; > > > iod->meta_total_len = 0; > > > + iod->nr_dma_vecs = 0; > > > + iod->dma_vecs = NULL; > > > > I don't think we need the dma_vecs initialization here, as everything > > is keyed off nr_dma_vecs. > > Yes, we should definitely use nr_dma_vecs and skip the NULL setting. I'm > a big fan removing unnecessary initialisations. Just a caution, my > suggested patch has this check: > > if (!iod->dma_vecs) > > So we just need to update it to use 'iod->nr_dma_vecs' instead, which > would have been correct, too. Ah, right.
On Mon, Feb 02, 2026 at 03:16:50PM +0000, Robin Murphy wrote: > On 2026-02-02 2:35 pm, Christoph Hellwig wrote: > > On Mon, Feb 02, 2026 at 06:27:38PM +0530, Pradeep P V K wrote: > > > Fix a NULL pointer dereference that occurs in nvme_pci_prp_iter_next() > > > when SWIOTLB bounce buffering becomes active during runtime. > > > > > > The issue occurs when SWIOTLB activation changes the device's DMA > > > mapping requirements at runtime, > > > > > > creating a mismatch between > > > iod->dma_vecs allocation and access logic. > > > > > > The problem manifests when: > > > 1. Device initially operates with dma_skip_sync=true > > > (coherent DMA assumed) > > > 2. First SWIOTLB mapping occurs due to DMA address limitations, > > > memory encryption, or IOMMU bounce buffering requirements > > > 3. SWIOTLB calls dma_reset_need_sync(), permanently setting > > > dma_skip_sync=false > > > 4. Subsequent I/Os now have dma_need_unmap()=true, requiring > > > iod->dma_vecs > > > > I think this patch just papers over the bug. If dma_need_unmap > > can't be trusted before the dma_map_* call, we've not saved > > the unmap information and the unmap won't work properly. > > The dma_need_unmap() kerneldoc says: > > "This function must be called after all mappings that might > need to be unmapped have been performed." > > Trying to infer anything from it beforehand is definitely a bug in the > caller. At least for HMM, dma_need_unmap() works as expected. HMM doesn't work with SWIOTLB. Thanks
© 2016 - 2026 Red Hat, Inc.