drivers/nvme/host/core.c | 4 +- drivers/nvme/host/nvme.h | 2 +- drivers/nvme/host/pci.c | 18 ++++-- drivers/nvme/host/tcp.c | 3 +- drivers/scsi/scsi_lib_dma.c | 13 ++-- drivers/scsi/storvsc_drv.c | 9 ++- include/linux/dma-mapping.h | 13 ++++ include/linux/swiotlb.h | 15 ++++- include/scsi/scsi_cmnd.h | 7 ++- kernel/dma/Kconfig | 13 ++++ kernel/dma/direct.c | 41 +++++++++++-- kernel/dma/direct.h | 1 + kernel/dma/mapping.c | 10 ++++ kernel/dma/swiotlb.c | 114 ++++++++++++++++++++++++++++++++---- 14 files changed, 227 insertions(+), 36 deletions(-)
From: Michael Kelley <mhklinux@outlook.com> Background ========== Linux device drivers may make DMA map/unmap calls in contexts that cannot block, such as in an interrupt handler. Consequently, when a DMA map call must use a bounce buffer, the allocation of swiotlb memory must always succeed immediately. If swiotlb memory is exhausted, the DMA map call cannot wait for memory to be released. The call fails, which usually results in an I/O error. Bounce buffers are usually used infrequently for a few corner cases, so the default swiotlb memory allocation of 64 MiB is more than sufficient to avoid running out and causing errors. However, recently introduced Confidential Computing (CoCo) VMs must use bounce buffers for all DMA I/O because the VM's memory is encrypted. In CoCo VMs a new heuristic allocates ~6% of the VM's memory, up to 1 GiB, for swiotlb memory. This large allocation reduces the likelihood of a spike in usage causing DMA map failures. Unfortunately for most workloads, this insurance against spikes comes at the cost of potentially "wasting" hundreds of MiB's of the VM's memory, as swiotlb memory can't be used for other purposes. Approach ======== The goal is to significantly reduce the amount of memory reserved as swiotlb memory in CoCo VMs, while not unduly increasing the risk of DMA map failures due to memory exhaustion. To reach this goal, this patch set introduces the concept of swiotlb throttling, which can delay swiotlb allocation requests when swiotlb memory usage is high. This approach depends on the fact that some DMA map requests are made from contexts where it's OK to block. Throttling such requests is acceptable to spread out a spike in usage. Because it's not possible to detect at runtime whether a DMA map call is made in a context that can block, the calls in key device drivers must be updated with a MAY_BLOCK attribute, if appropriate. When this attribute is set and swiotlb memory usage is above a threshold, the swiotlb allocation code can serialize swiotlb memory usage to help ensure that it is not exhausted. In general, storage device drivers can take advantage of the MAY_BLOCK option, while network device drivers cannot. The Linux block layer already allows storage requests to block when the BLK_MQ_F_BLOCKING flag is present on the request queue. In a CoCo VM environment, relatively few device types are used for storage devices, and updating these drivers is feasible. This patch set updates the NVMe driver and the Hyper-V storvsc synthetic storage driver. A few other drivers might also need to be updated to handle the key CoCo VM storage devices. Because network drivers generally cannot use swiotlb throttling, it is still possible for swiotlb memory to become exhausted. But blunting the maximum swiotlb memory used by storage devices can significantly reduce the peak usage, and a smaller amount of swiotlb memory can be allocated in a CoCo VM. Also, usage by storage drivers is likely to overall be larger than for network drivers, especially when large numbers of disk devices are in use, each with many I/O requests in- flight. swiotlb throttling does not affect the context requirements of DMA unmap calls. These always complete without blocking, even if the corresponding DMA map call was throttled. Patches ======= Patches 1 and 2 implement the core of swiotlb throttling. They define DMA attribute flag DMA_ATTR_MAY_BLOCK that device drivers use to indicate that a DMA map call is allowed to block, and therefore can be throttled. They update swiotlb_tbl_map_single() to detect this flag and implement the throttling. Similarly, swiotlb_tbl_unmap_single() is updated to handle a previously throttled request that has now freed its swiotlb memory. Patch 3 adds the dma_recommend_may_block() call that device drivers can use to know if there's benefit in using the MAY_BLOCK option on DMA map calls. If not in a CoCo VM, this call returns "false" because swiotlb is not being used for all DMA I/O. This allows the driver to set the BLK_MQ_F_BLOCKING flag on blk-mq request queues only when there is benefit. Patch 4 updates the SCSI-specific DMA map calls to add a "_attrs" variant to allow passing the MAY_BLOCK attribute. Patch 5 adds the MAY_BLOCK option to the Hyper-V storvsc driver, which is used for storage in CoCo VMs in the Azure public cloud. Patches 6 and 7 add the MAY_BLOCK option to the NVMe PCI host driver. Discussion ========== * Since swiotlb isn't visible to device drivers, I've specifically named the DMA attribute as MAY_BLOCK instead of MAY_THROTTLE or something swiotlb specific. While this patch set consumes MAY_BLOCK only on the DMA direct path to do throttling in the swiotlb code, there might be other uses in the future outside of CoCo VMs, or perhaps on the IOMMU path. * The swiotlb throttling code in this patch set throttles by serializing the use of swiotlb memory when usage is above a designated threshold: i.e., only one new swiotlb request is allowed to proceed at a time. When the corresponding unmap is done to release its swiotlb memory, the next request is allowed to proceed. This serialization is global and without knowledge of swiotlb areas. From a storage I/O performance standpoint, the serialization is a bit restrictive, but the code isn't trying to optimize for being above the threshold. If a workload regularly runs above the threshold, the size of the swiotlb memory should be increased. * Except for knowing how much swiotlb memory is currently allocated, throttle accounting is done without locking or atomic operations. For example, multiple requests could proceed in parallel when usage is just under the threshold, putting usage above the threshold by the aggregate size of the parallel requests. The threshold must already be set relatively conservatively because of drivers that can't enable throttling, so this slop in the accounting shouldn't be a problem. It's better than the potential bottleneck of a globally synchronized reservation mechanism. * In a CoCo VM, mapping a scatter/gather list makes an independent swiotlb request for each entry. Throttling each independent request wouldn't really work, so the code throttles only the first SGL entry. Once that entry passes any throttle, subsequent entries in the SGL proceed without throttling. When the SGL is unmapped, entries 1 thru N-1 are unmapped first, then entry 0 is unmapped, allowing the next serialized request to proceed. Open Topics =========== 1. swiotlb allocations from Xen and the IOMMU code don't make use of throttling. This could be added if beneficial. 2. The throttling values are currently exposed and adjustable in /sys/kernel/debug/swiotlb. Should any of this be moved so it is visible even without CONFIG_DEBUG_FS? 3. I have not changed the current heuristic for the swiotlb memory size in CoCo VMs. It's not clear to me how to link this to whether the key storage drivers have been updated to allow throttling. For now, the benefit of reduced swiotlb memory size must be realized using the swiotlb= kernel boot line option. 4. I need to update the swiotlb documentation to describe throttling. This patch set is built against linux-next-20240816. Michael Kelley (7): swiotlb: Introduce swiotlb throttling dma: Handle swiotlb throttling for SGLs dma: Add function for drivers to know if allowing blocking is useful scsi_lib_dma: Add _attrs variant of scsi_dma_map() scsi: storvsc: Enable swiotlb throttling nvme: Move BLK_MQ_F_BLOCKING indicator to struct nvme_ctrl nvme: Enable swiotlb throttling for NVMe PCI devices drivers/nvme/host/core.c | 4 +- drivers/nvme/host/nvme.h | 2 +- drivers/nvme/host/pci.c | 18 ++++-- drivers/nvme/host/tcp.c | 3 +- drivers/scsi/scsi_lib_dma.c | 13 ++-- drivers/scsi/storvsc_drv.c | 9 ++- include/linux/dma-mapping.h | 13 ++++ include/linux/swiotlb.h | 15 ++++- include/scsi/scsi_cmnd.h | 7 ++- kernel/dma/Kconfig | 13 ++++ kernel/dma/direct.c | 41 +++++++++++-- kernel/dma/direct.h | 1 + kernel/dma/mapping.c | 10 ++++ kernel/dma/swiotlb.c | 114 ++++++++++++++++++++++++++++++++---- 14 files changed, 227 insertions(+), 36 deletions(-) -- 2.25.1
On 2024-08-22 7:37 pm, mhkelley58@gmail.com wrote: > From: Michael Kelley <mhklinux@outlook.com> > > Background > ========== > Linux device drivers may make DMA map/unmap calls in contexts that > cannot block, such as in an interrupt handler. Consequently, when a > DMA map call must use a bounce buffer, the allocation of swiotlb > memory must always succeed immediately. If swiotlb memory is > exhausted, the DMA map call cannot wait for memory to be released. The > call fails, which usually results in an I/O error. > > Bounce buffers are usually used infrequently for a few corner cases, > so the default swiotlb memory allocation of 64 MiB is more than > sufficient to avoid running out and causing errors. However, recently > introduced Confidential Computing (CoCo) VMs must use bounce buffers > for all DMA I/O because the VM's memory is encrypted. In CoCo VMs > a new heuristic allocates ~6% of the VM's memory, up to 1 GiB, for > swiotlb memory. This large allocation reduces the likelihood of a > spike in usage causing DMA map failures. Unfortunately for most > workloads, this insurance against spikes comes at the cost of > potentially "wasting" hundreds of MiB's of the VM's memory, as swiotlb > memory can't be used for other purposes. > > Approach > ======== > The goal is to significantly reduce the amount of memory reserved as > swiotlb memory in CoCo VMs, while not unduly increasing the risk of > DMA map failures due to memory exhaustion. Isn't that fundamentally the same thing that SWIOTLB_DYNAMIC was already meant to address? Of course the implementation of that is still young and has plenty of scope to be made more effective, and some of the ideas here could very much help with that, but I'm struggling a little to see what's really beneficial about having a completely disjoint mechanism for sitting around doing nothing in the precise circumstances where it would seem most possible to allocate a transient buffer and get on with it. Thanks, Robin. > To reach this goal, this patch set introduces the concept of swiotlb > throttling, which can delay swiotlb allocation requests when swiotlb > memory usage is high. This approach depends on the fact that some > DMA map requests are made from contexts where it's OK to block. > Throttling such requests is acceptable to spread out a spike in usage. > > Because it's not possible to detect at runtime whether a DMA map call > is made in a context that can block, the calls in key device drivers > must be updated with a MAY_BLOCK attribute, if appropriate. When this > attribute is set and swiotlb memory usage is above a threshold, the > swiotlb allocation code can serialize swiotlb memory usage to help > ensure that it is not exhausted. > > In general, storage device drivers can take advantage of the MAY_BLOCK > option, while network device drivers cannot. The Linux block layer > already allows storage requests to block when the BLK_MQ_F_BLOCKING > flag is present on the request queue. In a CoCo VM environment, > relatively few device types are used for storage devices, and updating > these drivers is feasible. This patch set updates the NVMe driver and > the Hyper-V storvsc synthetic storage driver. A few other drivers > might also need to be updated to handle the key CoCo VM storage > devices. > > Because network drivers generally cannot use swiotlb throttling, it is > still possible for swiotlb memory to become exhausted. But blunting > the maximum swiotlb memory used by storage devices can significantly > reduce the peak usage, and a smaller amount of swiotlb memory can be > allocated in a CoCo VM. Also, usage by storage drivers is likely to > overall be larger than for network drivers, especially when large > numbers of disk devices are in use, each with many I/O requests in- > flight. > > swiotlb throttling does not affect the context requirements of DMA > unmap calls. These always complete without blocking, even if the > corresponding DMA map call was throttled. > > Patches > ======= > Patches 1 and 2 implement the core of swiotlb throttling. They define > DMA attribute flag DMA_ATTR_MAY_BLOCK that device drivers use to > indicate that a DMA map call is allowed to block, and therefore can be > throttled. They update swiotlb_tbl_map_single() to detect this flag and > implement the throttling. Similarly, swiotlb_tbl_unmap_single() is > updated to handle a previously throttled request that has now freed > its swiotlb memory. > > Patch 3 adds the dma_recommend_may_block() call that device drivers > can use to know if there's benefit in using the MAY_BLOCK option on > DMA map calls. If not in a CoCo VM, this call returns "false" because > swiotlb is not being used for all DMA I/O. This allows the driver to > set the BLK_MQ_F_BLOCKING flag on blk-mq request queues only when > there is benefit. > > Patch 4 updates the SCSI-specific DMA map calls to add a "_attrs" > variant to allow passing the MAY_BLOCK attribute. > > Patch 5 adds the MAY_BLOCK option to the Hyper-V storvsc driver, which > is used for storage in CoCo VMs in the Azure public cloud. > > Patches 6 and 7 add the MAY_BLOCK option to the NVMe PCI host driver. > > Discussion > ========== > * Since swiotlb isn't visible to device drivers, I've specifically > named the DMA attribute as MAY_BLOCK instead of MAY_THROTTLE or > something swiotlb specific. While this patch set consumes MAY_BLOCK > only on the DMA direct path to do throttling in the swiotlb code, > there might be other uses in the future outside of CoCo VMs, or > perhaps on the IOMMU path. > > * The swiotlb throttling code in this patch set throttles by > serializing the use of swiotlb memory when usage is above a designated > threshold: i.e., only one new swiotlb request is allowed to proceed at > a time. When the corresponding unmap is done to release its swiotlb > memory, the next request is allowed to proceed. This serialization is > global and without knowledge of swiotlb areas. From a storage I/O > performance standpoint, the serialization is a bit restrictive, but > the code isn't trying to optimize for being above the threshold. If a > workload regularly runs above the threshold, the size of the swiotlb > memory should be increased. > > * Except for knowing how much swiotlb memory is currently allocated, > throttle accounting is done without locking or atomic operations. For > example, multiple requests could proceed in parallel when usage is > just under the threshold, putting usage above the threshold by the > aggregate size of the parallel requests. The threshold must already be > set relatively conservatively because of drivers that can't enable > throttling, so this slop in the accounting shouldn't be a problem. > It's better than the potential bottleneck of a globally synchronized > reservation mechanism. > > * In a CoCo VM, mapping a scatter/gather list makes an independent > swiotlb request for each entry. Throttling each independent request > wouldn't really work, so the code throttles only the first SGL entry. > Once that entry passes any throttle, subsequent entries in the SGL > proceed without throttling. When the SGL is unmapped, entries 1 thru > N-1 are unmapped first, then entry 0 is unmapped, allowing the next > serialized request to proceed. > > Open Topics > =========== > 1. swiotlb allocations from Xen and the IOMMU code don't make use of > throttling. This could be added if beneficial. > > 2. The throttling values are currently exposed and adjustable in > /sys/kernel/debug/swiotlb. Should any of this be moved so it is > visible even without CONFIG_DEBUG_FS? > > 3. I have not changed the current heuristic for the swiotlb memory > size in CoCo VMs. It's not clear to me how to link this to whether the > key storage drivers have been updated to allow throttling. For now, > the benefit of reduced swiotlb memory size must be realized using the > swiotlb= kernel boot line option. > > 4. I need to update the swiotlb documentation to describe throttling. > > This patch set is built against linux-next-20240816. > > Michael Kelley (7): > swiotlb: Introduce swiotlb throttling > dma: Handle swiotlb throttling for SGLs > dma: Add function for drivers to know if allowing blocking is useful > scsi_lib_dma: Add _attrs variant of scsi_dma_map() > scsi: storvsc: Enable swiotlb throttling > nvme: Move BLK_MQ_F_BLOCKING indicator to struct nvme_ctrl > nvme: Enable swiotlb throttling for NVMe PCI devices > > drivers/nvme/host/core.c | 4 +- > drivers/nvme/host/nvme.h | 2 +- > drivers/nvme/host/pci.c | 18 ++++-- > drivers/nvme/host/tcp.c | 3 +- > drivers/scsi/scsi_lib_dma.c | 13 ++-- > drivers/scsi/storvsc_drv.c | 9 ++- > include/linux/dma-mapping.h | 13 ++++ > include/linux/swiotlb.h | 15 ++++- > include/scsi/scsi_cmnd.h | 7 ++- > kernel/dma/Kconfig | 13 ++++ > kernel/dma/direct.c | 41 +++++++++++-- > kernel/dma/direct.h | 1 + > kernel/dma/mapping.c | 10 ++++ > kernel/dma/swiotlb.c | 114 ++++++++++++++++++++++++++++++++---- > 14 files changed, 227 insertions(+), 36 deletions(-) >
On Wed, 28 Aug 2024 13:02:31 +0100 Robin Murphy <robin.murphy@arm.com> wrote: > On 2024-08-22 7:37 pm, mhkelley58@gmail.com wrote: > > From: Michael Kelley <mhklinux@outlook.com> > > > > Background > > ========== > > Linux device drivers may make DMA map/unmap calls in contexts that > > cannot block, such as in an interrupt handler. Consequently, when a > > DMA map call must use a bounce buffer, the allocation of swiotlb > > memory must always succeed immediately. If swiotlb memory is > > exhausted, the DMA map call cannot wait for memory to be released. The > > call fails, which usually results in an I/O error. > > > > Bounce buffers are usually used infrequently for a few corner cases, > > so the default swiotlb memory allocation of 64 MiB is more than > > sufficient to avoid running out and causing errors. However, recently > > introduced Confidential Computing (CoCo) VMs must use bounce buffers > > for all DMA I/O because the VM's memory is encrypted. In CoCo VMs > > a new heuristic allocates ~6% of the VM's memory, up to 1 GiB, for > > swiotlb memory. This large allocation reduces the likelihood of a > > spike in usage causing DMA map failures. Unfortunately for most > > workloads, this insurance against spikes comes at the cost of > > potentially "wasting" hundreds of MiB's of the VM's memory, as swiotlb > > memory can't be used for other purposes. > > > > Approach > > ======== > > The goal is to significantly reduce the amount of memory reserved as > > swiotlb memory in CoCo VMs, while not unduly increasing the risk of > > DMA map failures due to memory exhaustion. > > Isn't that fundamentally the same thing that SWIOTLB_DYNAMIC was already > meant to address? Of course the implementation of that is still young > and has plenty of scope to be made more effective, and some of the ideas > here could very much help with that, but I'm struggling a little to see > what's really beneficial about having a completely disjoint mechanism > for sitting around doing nothing in the precise circumstances where it > would seem most possible to allocate a transient buffer and get on with it. This question can be probably best answered by Michael, but let me give my understanding of the differences. First the similarity: Yes, one of the key new concepts is that swiotlb allocation may block, and I introduced a similar attribute in one of my dynamic SWIOTLB patches; it was later dropped, but dynamic SWIOTLB would still benefit from it. More importantly, dynamic SWIOTLB may deplete memory following an I/O spike. I do have some ideas how memory could be returned back to the allocator, but the code is not ready (unlike this patch series). Moreover, it may still be a better idea to throttle the devices instead, because returning DMA'able memory is not always cheap. In a CoCo VM, this memory must be re-encrypted, and that requires a hypercall that I'm told is expensive. In short, IIUC it is faster in a CoCo VM to delay some requests a bit than to grow the swiotlb. Michael, please add your insights. Petr T > > To reach this goal, this patch set introduces the concept of swiotlb > > throttling, which can delay swiotlb allocation requests when swiotlb > > memory usage is high. This approach depends on the fact that some > > DMA map requests are made from contexts where it's OK to block. > > Throttling such requests is acceptable to spread out a spike in usage. > > > > Because it's not possible to detect at runtime whether a DMA map call > > is made in a context that can block, the calls in key device drivers > > must be updated with a MAY_BLOCK attribute, if appropriate. When this > > attribute is set and swiotlb memory usage is above a threshold, the > > swiotlb allocation code can serialize swiotlb memory usage to help > > ensure that it is not exhausted. > > > > In general, storage device drivers can take advantage of the MAY_BLOCK > > option, while network device drivers cannot. The Linux block layer > > already allows storage requests to block when the BLK_MQ_F_BLOCKING > > flag is present on the request queue. In a CoCo VM environment, > > relatively few device types are used for storage devices, and updating > > these drivers is feasible. This patch set updates the NVMe driver and > > the Hyper-V storvsc synthetic storage driver. A few other drivers > > might also need to be updated to handle the key CoCo VM storage > > devices. > > > > Because network drivers generally cannot use swiotlb throttling, it is > > still possible for swiotlb memory to become exhausted. But blunting > > the maximum swiotlb memory used by storage devices can significantly > > reduce the peak usage, and a smaller amount of swiotlb memory can be > > allocated in a CoCo VM. Also, usage by storage drivers is likely to > > overall be larger than for network drivers, especially when large > > numbers of disk devices are in use, each with many I/O requests in- > > flight. > > > > swiotlb throttling does not affect the context requirements of DMA > > unmap calls. These always complete without blocking, even if the > > corresponding DMA map call was throttled. > > > > Patches > > ======= > > Patches 1 and 2 implement the core of swiotlb throttling. They define > > DMA attribute flag DMA_ATTR_MAY_BLOCK that device drivers use to > > indicate that a DMA map call is allowed to block, and therefore can be > > throttled. They update swiotlb_tbl_map_single() to detect this flag and > > implement the throttling. Similarly, swiotlb_tbl_unmap_single() is > > updated to handle a previously throttled request that has now freed > > its swiotlb memory. > > > > Patch 3 adds the dma_recommend_may_block() call that device drivers > > can use to know if there's benefit in using the MAY_BLOCK option on > > DMA map calls. If not in a CoCo VM, this call returns "false" because > > swiotlb is not being used for all DMA I/O. This allows the driver to > > set the BLK_MQ_F_BLOCKING flag on blk-mq request queues only when > > there is benefit. > > > > Patch 4 updates the SCSI-specific DMA map calls to add a "_attrs" > > variant to allow passing the MAY_BLOCK attribute. > > > > Patch 5 adds the MAY_BLOCK option to the Hyper-V storvsc driver, which > > is used for storage in CoCo VMs in the Azure public cloud. > > > > Patches 6 and 7 add the MAY_BLOCK option to the NVMe PCI host driver. > > > > Discussion > > ========== > > * Since swiotlb isn't visible to device drivers, I've specifically > > named the DMA attribute as MAY_BLOCK instead of MAY_THROTTLE or > > something swiotlb specific. While this patch set consumes MAY_BLOCK > > only on the DMA direct path to do throttling in the swiotlb code, > > there might be other uses in the future outside of CoCo VMs, or > > perhaps on the IOMMU path. > > > > * The swiotlb throttling code in this patch set throttles by > > serializing the use of swiotlb memory when usage is above a designated > > threshold: i.e., only one new swiotlb request is allowed to proceed at > > a time. When the corresponding unmap is done to release its swiotlb > > memory, the next request is allowed to proceed. This serialization is > > global and without knowledge of swiotlb areas. From a storage I/O > > performance standpoint, the serialization is a bit restrictive, but > > the code isn't trying to optimize for being above the threshold. If a > > workload regularly runs above the threshold, the size of the swiotlb > > memory should be increased. > > > > * Except for knowing how much swiotlb memory is currently allocated, > > throttle accounting is done without locking or atomic operations. For > > example, multiple requests could proceed in parallel when usage is > > just under the threshold, putting usage above the threshold by the > > aggregate size of the parallel requests. The threshold must already be > > set relatively conservatively because of drivers that can't enable > > throttling, so this slop in the accounting shouldn't be a problem. > > It's better than the potential bottleneck of a globally synchronized > > reservation mechanism. > > > > * In a CoCo VM, mapping a scatter/gather list makes an independent > > swiotlb request for each entry. Throttling each independent request > > wouldn't really work, so the code throttles only the first SGL entry. > > Once that entry passes any throttle, subsequent entries in the SGL > > proceed without throttling. When the SGL is unmapped, entries 1 thru > > N-1 are unmapped first, then entry 0 is unmapped, allowing the next > > serialized request to proceed. > > > > Open Topics > > =========== > > 1. swiotlb allocations from Xen and the IOMMU code don't make use of > > throttling. This could be added if beneficial. > > > > 2. The throttling values are currently exposed and adjustable in > > /sys/kernel/debug/swiotlb. Should any of this be moved so it is > > visible even without CONFIG_DEBUG_FS? > > > > 3. I have not changed the current heuristic for the swiotlb memory > > size in CoCo VMs. It's not clear to me how to link this to whether the > > key storage drivers have been updated to allow throttling. For now, > > the benefit of reduced swiotlb memory size must be realized using the > > swiotlb= kernel boot line option. > > > > 4. I need to update the swiotlb documentation to describe throttling. > > > > This patch set is built against linux-next-20240816. > > > > Michael Kelley (7): > > swiotlb: Introduce swiotlb throttling > > dma: Handle swiotlb throttling for SGLs > > dma: Add function for drivers to know if allowing blocking is useful > > scsi_lib_dma: Add _attrs variant of scsi_dma_map() > > scsi: storvsc: Enable swiotlb throttling > > nvme: Move BLK_MQ_F_BLOCKING indicator to struct nvme_ctrl > > nvme: Enable swiotlb throttling for NVMe PCI devices > > > > drivers/nvme/host/core.c | 4 +- > > drivers/nvme/host/nvme.h | 2 +- > > drivers/nvme/host/pci.c | 18 ++++-- > > drivers/nvme/host/tcp.c | 3 +- > > drivers/scsi/scsi_lib_dma.c | 13 ++-- > > drivers/scsi/storvsc_drv.c | 9 ++- > > include/linux/dma-mapping.h | 13 ++++ > > include/linux/swiotlb.h | 15 ++++- > > include/scsi/scsi_cmnd.h | 7 ++- > > kernel/dma/Kconfig | 13 ++++ > > kernel/dma/direct.c | 41 +++++++++++-- > > kernel/dma/direct.h | 1 + > > kernel/dma/mapping.c | 10 ++++ > > kernel/dma/swiotlb.c | 114 ++++++++++++++++++++++++++++++++---- > > 14 files changed, 227 insertions(+), 36 deletions(-) > >
On 2024-08-28 2:03 pm, Petr Tesařík wrote: > On Wed, 28 Aug 2024 13:02:31 +0100 > Robin Murphy <robin.murphy@arm.com> wrote: > >> On 2024-08-22 7:37 pm, mhkelley58@gmail.com wrote: >>> From: Michael Kelley <mhklinux@outlook.com> >>> >>> Background >>> ========== >>> Linux device drivers may make DMA map/unmap calls in contexts that >>> cannot block, such as in an interrupt handler. Consequently, when a >>> DMA map call must use a bounce buffer, the allocation of swiotlb >>> memory must always succeed immediately. If swiotlb memory is >>> exhausted, the DMA map call cannot wait for memory to be released. The >>> call fails, which usually results in an I/O error. >>> >>> Bounce buffers are usually used infrequently for a few corner cases, >>> so the default swiotlb memory allocation of 64 MiB is more than >>> sufficient to avoid running out and causing errors. However, recently >>> introduced Confidential Computing (CoCo) VMs must use bounce buffers >>> for all DMA I/O because the VM's memory is encrypted. In CoCo VMs >>> a new heuristic allocates ~6% of the VM's memory, up to 1 GiB, for >>> swiotlb memory. This large allocation reduces the likelihood of a >>> spike in usage causing DMA map failures. Unfortunately for most >>> workloads, this insurance against spikes comes at the cost of >>> potentially "wasting" hundreds of MiB's of the VM's memory, as swiotlb >>> memory can't be used for other purposes. >>> >>> Approach >>> ======== >>> The goal is to significantly reduce the amount of memory reserved as >>> swiotlb memory in CoCo VMs, while not unduly increasing the risk of >>> DMA map failures due to memory exhaustion. >> >> Isn't that fundamentally the same thing that SWIOTLB_DYNAMIC was already >> meant to address? Of course the implementation of that is still young >> and has plenty of scope to be made more effective, and some of the ideas >> here could very much help with that, but I'm struggling a little to see >> what's really beneficial about having a completely disjoint mechanism >> for sitting around doing nothing in the precise circumstances where it >> would seem most possible to allocate a transient buffer and get on with it. > > This question can be probably best answered by Michael, but let me give > my understanding of the differences. First the similarity: Yes, one > of the key new concepts is that swiotlb allocation may block, and I > introduced a similar attribute in one of my dynamic SWIOTLB patches; it > was later dropped, but dynamic SWIOTLB would still benefit from it. > > More importantly, dynamic SWIOTLB may deplete memory following an I/O > spike. I do have some ideas how memory could be returned back to the > allocator, but the code is not ready (unlike this patch series). > Moreover, it may still be a better idea to throttle the devices > instead, because returning DMA'able memory is not always cheap. In a > CoCo VM, this memory must be re-encrypted, and that requires a > hypercall that I'm told is expensive. Sure, making a hypercall in order to progress is expensive relative to being able to progress without doing that, but waiting on a lock for an unbounded time in the hope that other drivers might release their DMA mappings soon represents a potentially unbounded expense, since it doesn't even carry any promise of progress at all - oops userspace just filled up SWIOTLB with a misguided dma-buf import and now the OS has livelocked on stalled I/O threads fighting to retry :( As soon as we start tracking thresholds etc. then that should equally put us in the position to be able to manage the lifecycle of both dynamic and transient pools more effectively - larger allocations which can be reused by multiple mappings until the I/O load drops again could amortise that initial cost quite a bit. Furthermore I'm not entirely convinced that the rationale for throttling being beneficial is even all that sound. Serialising requests doesn't make them somehow use less memory, it just makes them use it... serially. If a single CPU is capable of queueing enough requests at once to fill the SWIOTLB, this is going to do absolutely nothing; if two CPUs are capable of queueing enough requests together to fill the SWIOTLB, making them take slightly longer to do so doesn't inherently mean anything more than reaching the same outcome more slowly. At worst, if a thread is blocked from polling for completion and releasing a bunch of mappings of already-finished descriptors because it's stuck on an unfair lock trying to get one last one submitted, then throttling has actively harmed the situation. AFAICS this is dependent on rather particular assumptions of driver behaviour in terms of DMA mapping patterns and interrupts, plus the overall I/O workload shape, and it's not clear to me how well that really generalises. > In short, IIUC it is faster in a CoCo VM to delay some requests a bit > than to grow the swiotlb. I'm not necessarily disputing that for the cases where the assumptions do hold, it's still more a question of why those two things should be separate and largely incompatible (I've only skimmed the patches here, but my impression is that it doesn't look like they'd play all that nicely together if both enabled). To me it would make far more sense for this to be a tuneable policy of a more holistic SWIOTLB_DYNAMIC itself, i.e. blockable calls can opportunistically wait for free space up to a well-defined timeout, but then also fall back to synchronously allocating a new pool in order to assure a definite outcome of success or system-is-dying-level failure. Thanks, Robin.
On Thu, Aug 22, 2024 at 11:37:11AM -0700, mhkelley58@gmail.com wrote: > Because it's not possible to detect at runtime whether a DMA map call > is made in a context that can block, the calls in key device drivers > must be updated with a MAY_BLOCK attribute, if appropriate. When this > attribute is set and swiotlb memory usage is above a threshold, the > swiotlb allocation code can serialize swiotlb memory usage to help > ensure that it is not exhausted. One thing I've been doing for a while but haven't gotten to due to my lack of semantic patching skills is that we really want to split the few flags useful for dma_map* from DMA_ATTR_* which largely only applies to dma_alloc. Only DMA_ATTR_WEAK_ORDERING (if we can't just kill it entirely) and for now DMA_ATTR_NO_WARN is used for both. DMA_ATTR_SKIP_CPU_SYNC and your new SLEEP/BLOCK attribute is only useful for mapping, and the rest is for allocation only. So I'd love to move to a DMA_MAP_* namespace for the mapping flags before adding more on potentially widely used ones. With a little grace period we can then also phase out DMA_ATTR_NO_WARN for allocations, as the gfp_t can control that much better. > In general, storage device drivers can take advantage of the MAY_BLOCK > option, while network device drivers cannot. The Linux block layer > already allows storage requests to block when the BLK_MQ_F_BLOCKING > flag is present on the request queue. Note that this also in general involves changes to the block drivers to set that flag, which is a bit annoying, but I guess there is not easy way around it without paying the price for the BLK_MQ_F_BLOCKING overhead everywhere.
Hi all, upfront, I've had more time to consider this idea, because Michael kindly shared it with me back in February. On Thu, 22 Aug 2024 11:37:11 -0700 mhkelley58@gmail.com wrote: > From: Michael Kelley <mhklinux@outlook.com> > > Background > ========== > Linux device drivers may make DMA map/unmap calls in contexts that > cannot block, such as in an interrupt handler. Consequently, when a > DMA map call must use a bounce buffer, the allocation of swiotlb > memory must always succeed immediately. If swiotlb memory is > exhausted, the DMA map call cannot wait for memory to be released. The > call fails, which usually results in an I/O error. FTR most I/O errors are recoverable, but the recovery usually takes a lot of time. Plus the errors are logged and usually treated as important by monitoring software. In short, I agree it's a poor choice. > Bounce buffers are usually used infrequently for a few corner cases, > so the default swiotlb memory allocation of 64 MiB is more than > sufficient to avoid running out and causing errors. However, recently > introduced Confidential Computing (CoCo) VMs must use bounce buffers > for all DMA I/O because the VM's memory is encrypted. In CoCo VMs > a new heuristic allocates ~6% of the VM's memory, up to 1 GiB, for > swiotlb memory. This large allocation reduces the likelihood of a > spike in usage causing DMA map failures. Unfortunately for most > workloads, this insurance against spikes comes at the cost of > potentially "wasting" hundreds of MiB's of the VM's memory, as swiotlb > memory can't be used for other purposes. It may be worth mentioning that page encryption state can be changed by a hypercall, but that's a costly (and non-atomic) operation. It's much faster to copy the data to a page which is already unencrypted (a bounce buffer). > Approach > ======== > The goal is to significantly reduce the amount of memory reserved as > swiotlb memory in CoCo VMs, while not unduly increasing the risk of > DMA map failures due to memory exhaustion. > > To reach this goal, this patch set introduces the concept of swiotlb > throttling, which can delay swiotlb allocation requests when swiotlb > memory usage is high. This approach depends on the fact that some > DMA map requests are made from contexts where it's OK to block. > Throttling such requests is acceptable to spread out a spike in usage. > > Because it's not possible to detect at runtime whether a DMA map call > is made in a context that can block, the calls in key device drivers > must be updated with a MAY_BLOCK attribute, if appropriate. Before somebody asks, the general agreement for decades has been that there should be no global state indicating whether the kernel is in atomic context. Instead, if a function needs to know, it should take an explicit parameter. IOW this MAY_BLOCK attribute follows an unquestioned kernel design pattern. > When this > attribute is set and swiotlb memory usage is above a threshold, the > swiotlb allocation code can serialize swiotlb memory usage to help > ensure that it is not exhausted. > > In general, storage device drivers can take advantage of the MAY_BLOCK > option, while network device drivers cannot. The Linux block layer > already allows storage requests to block when the BLK_MQ_F_BLOCKING > flag is present on the request queue. In a CoCo VM environment, > relatively few device types are used for storage devices, and updating > these drivers is feasible. This patch set updates the NVMe driver and > the Hyper-V storvsc synthetic storage driver. A few other drivers > might also need to be updated to handle the key CoCo VM storage > devices. > > Because network drivers generally cannot use swiotlb throttling, it is > still possible for swiotlb memory to become exhausted. But blunting > the maximum swiotlb memory used by storage devices can significantly > reduce the peak usage, and a smaller amount of swiotlb memory can be > allocated in a CoCo VM. Also, usage by storage drivers is likely to > overall be larger than for network drivers, especially when large > numbers of disk devices are in use, each with many I/O requests in- > flight. The system can also handle network packet loss much better than I/O errors, mainly because lost packets have always been part of normal operation, unlike I/O errors. After all, that's why we unmount all filesystems on removable media before physically unplugging (or ejecting) them. > swiotlb throttling does not affect the context requirements of DMA > unmap calls. These always complete without blocking, even if the > corresponding DMA map call was throttled. > > Patches > ======= > Patches 1 and 2 implement the core of swiotlb throttling. They define > DMA attribute flag DMA_ATTR_MAY_BLOCK that device drivers use to > indicate that a DMA map call is allowed to block, and therefore can be > throttled. They update swiotlb_tbl_map_single() to detect this flag and > implement the throttling. Similarly, swiotlb_tbl_unmap_single() is > updated to handle a previously throttled request that has now freed > its swiotlb memory. > > Patch 3 adds the dma_recommend_may_block() call that device drivers > can use to know if there's benefit in using the MAY_BLOCK option on > DMA map calls. If not in a CoCo VM, this call returns "false" because > swiotlb is not being used for all DMA I/O. This allows the driver to > set the BLK_MQ_F_BLOCKING flag on blk-mq request queues only when > there is benefit. > > Patch 4 updates the SCSI-specific DMA map calls to add a "_attrs" > variant to allow passing the MAY_BLOCK attribute. > > Patch 5 adds the MAY_BLOCK option to the Hyper-V storvsc driver, which > is used for storage in CoCo VMs in the Azure public cloud. > > Patches 6 and 7 add the MAY_BLOCK option to the NVMe PCI host driver. > > Discussion > ========== > * Since swiotlb isn't visible to device drivers, I've specifically > named the DMA attribute as MAY_BLOCK instead of MAY_THROTTLE or > something swiotlb specific. While this patch set consumes MAY_BLOCK > only on the DMA direct path to do throttling in the swiotlb code, > there might be other uses in the future outside of CoCo VMs, or > perhaps on the IOMMU path. I once introduced a similar flag and called it MAY_SLEEP. I chose MAY_SLEEP, because there is already a might_sleep() annotation, but I don't have a strong opinion unless your semantics is supposed to be different from might_sleep(). If it is, then I strongly prefer MAY_BLOCK to prevent confusing the two. > * The swiotlb throttling code in this patch set throttles by > serializing the use of swiotlb memory when usage is above a designated > threshold: i.e., only one new swiotlb request is allowed to proceed at > a time. When the corresponding unmap is done to release its swiotlb > memory, the next request is allowed to proceed. This serialization is > global and without knowledge of swiotlb areas. From a storage I/O > performance standpoint, the serialization is a bit restrictive, but > the code isn't trying to optimize for being above the threshold. If a > workload regularly runs above the threshold, the size of the swiotlb > memory should be increased. With CONFIG_SWIOTLB_DYNAMIC, this could happen automatically in the future. But let's get the basic functionality first. > * Except for knowing how much swiotlb memory is currently allocated, > throttle accounting is done without locking or atomic operations. For > example, multiple requests could proceed in parallel when usage is > just under the threshold, putting usage above the threshold by the > aggregate size of the parallel requests. The threshold must already be > set relatively conservatively because of drivers that can't enable > throttling, so this slop in the accounting shouldn't be a problem. > It's better than the potential bottleneck of a globally synchronized > reservation mechanism. Agreed. > * In a CoCo VM, mapping a scatter/gather list makes an independent > swiotlb request for each entry. Throttling each independent request > wouldn't really work, so the code throttles only the first SGL entry. > Once that entry passes any throttle, subsequent entries in the SGL > proceed without throttling. When the SGL is unmapped, entries 1 thru > N-1 are unmapped first, then entry 0 is unmapped, allowing the next > serialized request to proceed. > > Open Topics > =========== > 1. swiotlb allocations from Xen and the IOMMU code don't make use of > throttling. This could be added if beneficial. > > 2. The throttling values are currently exposed and adjustable in > /sys/kernel/debug/swiotlb. Should any of this be moved so it is > visible even without CONFIG_DEBUG_FS? Yes. It should be possible to control the thresholds through sysctl. > 3. I have not changed the current heuristic for the swiotlb memory > size in CoCo VMs. It's not clear to me how to link this to whether the > key storage drivers have been updated to allow throttling. For now, > the benefit of reduced swiotlb memory size must be realized using the > swiotlb= kernel boot line option. This sounds fine for now. > 4. I need to update the swiotlb documentation to describe throttling. > > This patch set is built against linux-next-20240816. OK, I'm going try it out. Thank you for making this happen! Petr T
On 8/22/24 11:37 AM, mhkelley58@gmail.com wrote: > Linux device drivers may make DMA map/unmap calls in contexts that > cannot block, such as in an interrupt handler. Although I really appreciate your work, what alternatives have been considered? How many drivers perform DMA mapping from atomic context? Would it be feasible to modify these drivers such that DMA mapping always happens in a context in which sleeping is allowed? Thanks, Bart.
© 2016 - 2026 Red Hat, Inc.