[PATCH] dma-debug: skip cacheline overlap tracking on cache-coherent architectures

Mikhail Gavrilov posted 1 patch 6 days, 17 hours ago
kernel/dma/debug.c | 35 +++++++++++++++++++++++++++++++++++
1 file changed, 35 insertions(+)
[PATCH] dma-debug: skip cacheline overlap tracking on cache-coherent architectures
Posted by Mikhail Gavrilov 6 days, 17 hours ago
The dma-debug cacheline overlap tracking emits two distinct warnings
when multiple DMA mappings share a cacheline:

  1. add_dma_entry() calls err_printk("cacheline tracking EEXIST,
     overlapping mappings aren't supported\n") on every -EEXIST from
     active_cacheline_insert().

  2. active_cacheline_inc_overlap() calls WARN_ONCE("exceeded %d
     overlapping mappings of cacheline %pa\n", ...) when the 3-bit
     per-cacheline overlap counter in the dma_active_cacheline radix
     tree would saturate past ACTIVE_CACHELINE_MAX_OVERLAP (= 7).

Commit 3d48c9fd78dd ("dma-debug: suppress cacheline overlap warning
when arch has no DMA alignment requirement") suppressed (1) on
architectures where hardware bus snooping makes cacheline-overlapping
DMA mappings safe.  The same reasoning applies to (2): the tracking is
pure overhead on those architectures, and (2) still fires under real
workloads, e.g. heavy NVMe block I/O on x86_64:

  DMA-API: exceeded 7 overlapping mappings of cacheline 0x...
  WARNING: kernel/dma/debug.c:465 at add_dma_entry+0x394/0x410
  Call Trace:
   add_dma_entry+...
   debug_dma_map_phys+...
   dma_map_phys+...
   blk_dma_map_iter_start+...
   nvme_map_data+...

The block layer routinely produces nine or more concurrent in-flight
mappings whose buffers share a single cacheline.  On hardware-coherent
systems this is harmless, but it saturates the tag-based overlap
counter and produces a splat indistinguishable from a real driver bug.

Extend the gate to skip the cacheline overlap tracking entirely on
cache-coherent architectures, mirroring the DMA_TO_DEVICE early-return
that already exists for the same "tracking is unnecessary" reason.  The
helper dma_debug_cacheline_tracking_needed() captures the condition and
is symmetric to the existing add_dma_entry() check.

The same DMA_BOUNCE_UNALIGNED_KMALLOC + SWIOTLB suppression that
commit 03521c892bb8 ("dma-debug: don't report false positives with
DMA_BOUNCE_UNALIGNED_KMALLOC") added to (1) applies here for the same
reason: unaligned kmalloc buffers are bounced through aligned swiotlb
buffers, so the original cacheline overlap never reaches DMA.  The
helper preserves both suppression conditions.

Reproducer (out-of-tree module): map a single 8-byte buffer with
dma_map_single(..., DMA_BIDIRECTIONAL) nine times in a row.  The 9th
call deterministically fires the WARN_ONCE on an unfixed kernel; with
this patch applied no warning is emitted regardless of the number of
overlapping mappings.

Without this patch (n_maps=9):
  DMA-API: exceeded 7 overlapping mappings of cacheline 0x00000000071d7dbe
  WARNING: kernel/dma/debug.c:465 at add_dma_entry+0x39e/0x410
  [...]

With this patch (n_maps=1000):
  dma_debug_overlap_repro: 1000/1000 mappings active
  [no warning]

Link: https://lore.kernel.org/all/ZwxzdWmYcBK27mUs@fedora/
Fixes: 3b7a6418c749 ("dma debug: account for cachelines and read-only mappings in overlap tracking")
Tested-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
Signed-off-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
---
 kernel/dma/debug.c | 35 +++++++++++++++++++++++++++++++++++
 1 file changed, 35 insertions(+)

diff --git a/kernel/dma/debug.c b/kernel/dma/debug.c
index 1a725edbbbf6..2d1609b9d362 100644
--- a/kernel/dma/debug.c
+++ b/kernel/dma/debug.c
@@ -474,6 +474,35 @@ static int active_cacheline_dec_overlap(phys_addr_t cln)
 	return active_cacheline_set_overlap(cln, --overlap);
 }
 
+/*
+ * Whether cacheline-overlap tracking is meaningful for @dev.
+ *
+ * Mirrors the suppression conditions add_dma_entry() already applies to
+ * the sibling "cacheline tracking EEXIST" err_printk:
+ *
+ *  - On architectures with hardware DMA cache coherence
+ *    (dma_get_cache_alignment() < L1_CACHE_BYTES, e.g. x86_64) bus
+ *    snooping makes overlapping cacheline mappings safe.
+ *
+ *  - With CONFIG_DMA_BOUNCE_UNALIGNED_KMALLOC and an active SWIOTLB,
+ *    unaligned kmalloc buffers are bounced through aligned swiotlb
+ *    buffers, so the original cacheline overlap never reaches DMA.
+ *    See commit 03521c892bb8 ("dma-debug: don't report false positives
+ *    with DMA_BOUNCE_UNALIGNED_KMALLOC").
+ *
+ * In both cases tracking is pure overhead and produces false-positive
+ * WARN_ONCEs.
+ */
+static bool dma_debug_cacheline_tracking_needed(struct device *dev)
+{
+	if (dma_get_cache_alignment() < L1_CACHE_BYTES)
+		return false;
+	if (IS_ENABLED(CONFIG_DMA_BOUNCE_UNALIGNED_KMALLOC) &&
+	    is_swiotlb_active(dev))
+		return false;
+	return true;
+}
+
 static int active_cacheline_insert(struct dma_debug_entry *entry,
 				   bool *overlap_cache_clean)
 {
@@ -490,6 +519,9 @@ static int active_cacheline_insert(struct dma_debug_entry *entry,
 	if (entry->direction == DMA_TO_DEVICE)
 		return 0;
 
+	if (!dma_debug_cacheline_tracking_needed(entry->dev))
+		return 0;
+
 	spin_lock_irqsave(&radix_lock, flags);
 	rc = radix_tree_insert(&dma_active_cacheline, cln, entry);
 	if (rc == -EEXIST) {
@@ -516,6 +548,9 @@ static void active_cacheline_remove(struct dma_debug_entry *entry)
 	if (entry->direction == DMA_TO_DEVICE)
 		return;
 
+	if (!dma_debug_cacheline_tracking_needed(entry->dev))
+		return;
+
 	spin_lock_irqsave(&radix_lock, flags);
 	/* since we are counting overlaps the final put of the
 	 * cacheline will occur when the overlap count is 0.
-- 
2.54.0
Re: [PATCH] dma-debug: skip cacheline overlap tracking on cache-coherent architectures
Posted by Leon Romanovsky 6 days, 16 hours ago
On Mon, May 18, 2026 at 04:32:51PM +0500, Mikhail Gavrilov wrote:
> The dma-debug cacheline overlap tracking emits two distinct warnings
> when multiple DMA mappings share a cacheline:
> 
>   1. add_dma_entry() calls err_printk("cacheline tracking EEXIST,
>      overlapping mappings aren't supported\n") on every -EEXIST from
>      active_cacheline_insert().
> 
>   2. active_cacheline_inc_overlap() calls WARN_ONCE("exceeded %d
>      overlapping mappings of cacheline %pa\n", ...) when the 3-bit
>      per-cacheline overlap counter in the dma_active_cacheline radix
>      tree would saturate past ACTIVE_CACHELINE_MAX_OVERLAP (= 7).
> 
> Commit 3d48c9fd78dd ("dma-debug: suppress cacheline overlap warning
> when arch has no DMA alignment requirement") suppressed (1) on
> architectures where hardware bus snooping makes cacheline-overlapping
> DMA mappings safe.  The same reasoning applies to (2): the tracking is
> pure overhead on those architectures, and (2) still fires under real
> workloads, e.g. heavy NVMe block I/O on x86_64:
> 
>   DMA-API: exceeded 7 overlapping mappings of cacheline 0x...
>   WARNING: kernel/dma/debug.c:465 at add_dma_entry+0x394/0x410
>   Call Trace:
>    add_dma_entry+...
>    debug_dma_map_phys+...
>    dma_map_phys+...
>    blk_dma_map_iter_start+...
>    nvme_map_data+...
> 
> The block layer routinely produces nine or more concurrent in-flight
> mappings whose buffers share a single cacheline.  On hardware-coherent
> systems this is harmless, but it saturates the tag-based overlap
> counter and produces a splat indistinguishable from a real driver bug.
> 
> Extend the gate to skip the cacheline overlap tracking entirely on
> cache-coherent architectures, mirroring the DMA_TO_DEVICE early-return
> that already exists for the same "tracking is unnecessary" reason.  The
> helper dma_debug_cacheline_tracking_needed() captures the condition and
> is symmetric to the existing add_dma_entry() check.
> 
> The same DMA_BOUNCE_UNALIGNED_KMALLOC + SWIOTLB suppression that
> commit 03521c892bb8 ("dma-debug: don't report false positives with
> DMA_BOUNCE_UNALIGNED_KMALLOC") added to (1) applies here for the same
> reason: unaligned kmalloc buffers are bounced through aligned swiotlb
> buffers, so the original cacheline overlap never reaches DMA.  The
> helper preserves both suppression conditions.
> 
> Reproducer (out-of-tree module): map a single 8-byte buffer with
> dma_map_single(..., DMA_BIDIRECTIONAL) nine times in a row.  The 9th
> call deterministically fires the WARN_ONCE on an unfixed kernel; with
> this patch applied no warning is emitted regardless of the number of
> overlapping mappings.

I would say this reproducer is incorrect. From what I recall, the only two
legitimate use cases for cacheline overlap are virtio and RDMA. The first
intentionally relies on it for small allocations, and the second exports the
cachelines to the user space and cannot operate on non‑coherent architectures.

Thanks
Re: [PATCH] dma-debug: skip cacheline overlap tracking on cache-coherent architectures
Posted by Mikhail Gavrilov 6 days, 16 hours ago
On Mon, May 18, 2026 at 5:10 PM Leon Romanovsky <leon@kernel.org> wrote:
>
> I would say this reproducer is incorrect. From what I recall, the only two
> legitimate use cases for cacheline overlap are virtio and RDMA.

The wild trace in the commit message is NVMe block I/O -- neither virtio
nor RDMA:

  add_dma_entry -> debug_dma_map_phys -> dma_map_phys ->
  blk_dma_map_iter_start -> nvme_map_data

The block layer submits many concurrent in-flight requests; small
kmalloc'd buffers naturally land in the same cacheline under high IOPS,
which is incidental rather than intentional overlap.  Ming Lei's report
linked in the commit message [1] enumerates additional non-virtio /
non-RDMA cases hitting the same WARN: liburing iopoll tests, raid1,
dm-thin and other storage utilities.

> The first intentionally relies on it for small allocations, and the second exports the
> cachelines to the user space and cannot operate on non‑coherent architectures.

The reproducer isn't claiming to be either of those.  It deterministically
reaches the same state-based gate the wild NVMe trace hits
(!is_cache_clean && overlap > 7, with direction != DMA_TO_DEVICE, after
the v2 coherent-arch / SWIOTLB-bounce suppressions are evaluated).  Since
that gate has no subsystem-specific term, any caller -- synthetic or real
-- reaching it with those state values triggers the same WARN.

If the broader concern is that the block layer should opt into your
coherency-attribute work rather than relying on debug-side suppression,
that's a reasonable longer-term direction.  But it's additive: even with
opt-in adoption, the WARN remains a false positive on coherent arches
for callers that don't annotate -- which is exactly what v2 (3d48c9fd78dd)
already established for the sibling "cacheline tracking EEXIST" err_printk.

[1] https://lore.kernel.org/all/ZwxzdWmYcBK27mUs@fedora/

-- 
Thanks,
Mikhail
Re: [PATCH] dma-debug: skip cacheline overlap tracking on cache-coherent architectures
Posted by David Laight 5 days, 19 hours ago
On Mon, 18 May 2026 17:23:15 +0500
Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com> wrote:

> On Mon, May 18, 2026 at 5:10 PM Leon Romanovsky <leon@kernel.org> wrote:
> >
> > I would say this reproducer is incorrect. From what I recall, the only two
> > legitimate use cases for cacheline overlap are virtio and RDMA.  
> 
> The wild trace in the commit message is NVMe block I/O -- neither virtio
> nor RDMA:
> 
>   add_dma_entry -> debug_dma_map_phys -> dma_map_phys ->
>   blk_dma_map_iter_start -> nvme_map_data
> 
> The block layer submits many concurrent in-flight requests; small
> kmalloc'd buffers naturally land in the same cacheline under high IOPS,

Isn't there a flag to kmalloc() that indicates the buffers will be used
for dma and mustn't share a cache line with anything else writable.
(Which means the size must be rounded up to a multiple of the cache
line size.)
For DMA_FROM_DEVICE it is important that the cpu doesn't dirty the cache
lines.

This is probably worse on systems with 256 byte cache lines.

-- David

> which is incidental rather than intentional overlap.  Ming Lei's report
> linked in the commit message [1] enumerates additional non-virtio /
> non-RDMA cases hitting the same WARN: liburing iopoll tests, raid1,
> dm-thin and other storage utilities.
> 
> > The first intentionally relies on it for small allocations, and the second exports the
> > cachelines to the user space and cannot operate on non‑coherent architectures.  
> 
> The reproducer isn't claiming to be either of those.  It deterministically
> reaches the same state-based gate the wild NVMe trace hits
> (!is_cache_clean && overlap > 7, with direction != DMA_TO_DEVICE, after
> the v2 coherent-arch / SWIOTLB-bounce suppressions are evaluated).  Since
> that gate has no subsystem-specific term, any caller -- synthetic or real
> -- reaching it with those state values triggers the same WARN.
> 
> If the broader concern is that the block layer should opt into your
> coherency-attribute work rather than relying on debug-side suppression,
> that's a reasonable longer-term direction.  But it's additive: even with
> opt-in adoption, the WARN remains a false positive on coherent arches
> for callers that don't annotate -- which is exactly what v2 (3d48c9fd78dd)
> already established for the sibling "cacheline tracking EEXIST" err_printk.
> 
> [1] https://lore.kernel.org/all/ZwxzdWmYcBK27mUs@fedora/
> 
Re: [PATCH] dma-debug: skip cacheline overlap tracking on cache-coherent architectures
Posted by Mikhail Gavrilov 5 days, 19 hours ago
On Tue, May 19, 2026 at 2:06 PM David Laight
<david.laight.linux@gmail.com> wrote:
>
> Isn't there a flag to kmalloc() that indicates the buffers will be used
> for dma and mustn't share a cache line with anything else writable.
> (Which means the size must be rounded up to a multiple of the cache
> line size.)

Not a per-call flag, but ARCH_KMALLOC_MINALIGN does it at compile
time: when set to cache-line size (as it traditionally is on
non-coherent arches), kmalloc returns cache-line-aligned buffers so
independent kmalloc allocations can't share a cacheline.

CONFIG_DMA_BOUNCE_UNALIGNED_KMALLOC (370645f41e6e) was added to allow
smaller kmalloc minimums on those arches by bouncing small unaligned
DMA mappings through SWIOTLB; this patch preserves the cacheline
tracking when that bounce path is active.

What still hits the WARN isn't independent allocations sharing a
cacheline -- those are already prevented above. It's:
- the same buffer used in multiple concurrent DMA mappings (raid1
sync, io_uring iopoll buffer reuse, dm-thin, ...)
- userspace DIO buffers, where kmalloc alignment doesn't apply

> For DMA_FROM_DEVICE it is important that the cpu doesn't dirty the cache
> lines.

Right -- that's the real corruption concern on non-coherent arches and
why the WARN keeps firing there with this patch. On coherent arches
bus snooping handles it, which is the scope of the suppression.

> This is probably worse on systems with 256 byte cache lines.

Mechanically yes, but those systems size ARCH_KMALLOC_MINALIGN to
match, so kmalloc allocations still don't overlap each other. The
remaining cases (shared buffers, DIO) are what Christoph is looking at
via the DIO-path alignment requirement.

-- 
Thanks,
Mikhail
Re: [PATCH] dma-debug: skip cacheline overlap tracking on cache-coherent architectures
Posted by Leon Romanovsky 6 days, 15 hours ago
On Mon, May 18, 2026 at 05:23:15PM +0500, Mikhail Gavrilov wrote:
> On Mon, May 18, 2026 at 5:10 PM Leon Romanovsky <leon@kernel.org> wrote:
> >
> > I would say this reproducer is incorrect. From what I recall, the only two
> > legitimate use cases for cacheline overlap are virtio and RDMA.
> 
> The wild trace in the commit message is NVMe block I/O -- neither virtio
> nor RDMA:
> 
>   add_dma_entry -> debug_dma_map_phys -> dma_map_phys ->
>   blk_dma_map_iter_start -> nvme_map_data
> 
> The block layer submits many concurrent in-flight requests; small
> kmalloc'd buffers naturally land in the same cacheline under high IOPS,
> which is incidental rather than intentional overlap.  Ming Lei's report
> linked in the commit message [1] enumerates additional non-virtio /
> non-RDMA cases hitting the same WARN: liburing iopoll tests, raid1,
> dm-thin and other storage utilities.

Actually, later in that thread, people agreed that this debug message
correctly pointed out the underlying issue in the code.
https://lore.kernel.org/all/20241015075418.GA25487@lst.de/

> 
> > The first intentionally relies on it for small allocations, and the second exports the
> > cachelines to the user space and cannot operate on non‑coherent architectures.
> 
> The reproducer isn't claiming to be either of those.  It deterministically
> reaches the same state-based gate the wild NVMe trace hits
> (!is_cache_clean && overlap > 7, with direction != DMA_TO_DEVICE, after
> the v2 coherent-arch / SWIOTLB-bounce suppressions are evaluated).  Since
> that gate has no subsystem-specific term, any caller -- synthetic or real
> -- reaching it with those state values triggers the same WARN.
> 
> If the broader concern is that the block layer should opt into your
> coherency-attribute work rather than relying on debug-side suppression,
> that's a reasonable longer-term direction.  But it's additive: even with
> opt-in adoption, the WARN remains a false positive on coherent arches
> for callers that don't annotate -- which is exactly what v2 (3d48c9fd78dd)
> already established for the sibling "cacheline tracking EEXIST" err_printk.

How difficult is it to annotate call sites?

Thanks

> 
> [1] https://lore.kernel.org/all/ZwxzdWmYcBK27mUs@fedora/
> 
> -- 
> Thanks,
> Mikhail
> 
Re: [PATCH] dma-debug: skip cacheline overlap tracking on cache-coherent architectures
Posted by Mikhail Gavrilov 6 days, 15 hours ago
On Mon, May 18, 2026 at 5:53 PM Leon Romanovsky <leon@kernel.org> wrote:
>
> Actually, later in that thread, people agreed that this debug message
> correctly pointed out the underlying issue in the code.
> https://lore.kernel.org/all/20241015075418.GA25487@lst.de/

The full thread is more split than that. Christoph in the message
you linked says the warnings are "perfectly valid because the I/O
patterns will create data corruption on non-coherent architectures.
For direct I/O from userspace the kernel can't prevent it".

Dan Williams (original author of the cacheline tracking) earlier in
the same thread:

> I don't see an easy way out of this without instrumenting archs
> that can not support overlapping mappings to opt-in to bounce
> buffering for these cases.
>
> Archs that can support this can skip the opt-in and quiet this
> test, but some of the value is being able to catch boundary
> conditions on more widely available systems.

So Christoph scopes validity to non-coherent arches, and Dan
explicitly recognizes the "coherent arches skip the tracking" path
-- with the trade-off of losing boundary-condition catching on
widely available systems. That's the same trade-off 3d48c9fd78dd
already accepted for the sibling err_printk, which this patch
extends to (2). In both cases the production cost (spurious splats
on real workloads, e.g. NVMe block I/O) outweighs the diagnostic
value on coherent arches where bus snooping prevents the corruption
the warning is about.

> How difficult is it to annotate call sites?

For some callers it's tractable -- virtio via DMA_ATTR_CPU_CACHE_CLEAN,
RDMA via DMA_ATTR_REQUIRE_COHERENT. For others, Christoph himself in
the same thread:

> For direct I/O from userspace the kernel can't prevent it, but
> for raid1 we should be able to do something better. As raid1_
> sync_request is a convoluted and undocumented mess I don't have
> a straight shot answer to what it is doing (wrong) and how to
> fix it.

DIO from userspace is unfixable from the kernel side per that
message; raid1 acknowledged as needing a fix Christoph didn't have.
Two years on, those cases (plus dm-thin and io_uring polled tests
from Ming Lei's report) still don't have an annotation path. This
patch covers what annotation can't reach without preventing future
annotation work.

-- 
Thanks,
Mikhail
Re: [PATCH] dma-debug: skip cacheline overlap tracking on cache-coherent architectures
Posted by Christoph Hellwig 5 days, 21 hours ago
On Mon, May 18, 2026 at 06:29:11PM +0500, Mikhail Gavrilov wrote:
> DIO from userspace is unfixable from the kernel side per that
> message; raid1 acknowledged as needing a fix Christoph didn't have.
> Two years on, those cases (plus dm-thin and io_uring polled tests
> from Ming Lei's report) still don't have an annotation path. This
> patch covers what annotation can't reach without preventing future
> annotation work.

It is not.  We could require direct I/O to/from devices that are
attached without DMA coherence to require cache line alignment.

Now that Keith pushed down the checking into the driver that's even
fairly easily doable.
Re: [PATCH] dma-debug: skip cacheline overlap tracking on cache-coherent architectures
Posted by Mikhail Gavrilov 5 days, 20 hours ago
On Tue, May 19, 2026 at 12:18 PM Christoph Hellwig <hch@lst.de> wrote:
>
> It is not.  We could require direct I/O to/from devices that are
> attached without DMA coherence to require cache line alignment.
>
> Now that Keith pushed down the checking into the driver that's even
> fairly easily doable.
>

Good to hear. Is there a posted series or branch I could look at?
If non-coherent DIO gets the alignment requirement, that addresses the
corruption case at the source rather than papering over it in debug,
which would be the cleaner outcome.

Happy to test on x86_64 when something is in flight (no non-coherent
hardware here, but at least the no-regression side is covered).

-- 
Best Regards,
Mike Gavrilov.
Re: [PATCH] dma-debug: skip cacheline overlap tracking on cache-coherent architectures
Posted by Christoph Hellwig 5 days, 19 hours ago
On Tue, May 19, 2026 at 01:03:08PM +0500, Mikhail Gavrilov wrote:
> On Tue, May 19, 2026 at 12:18 PM Christoph Hellwig <hch@lst.de> wrote:
> >
> > It is not.  We could require direct I/O to/from devices that are
> > attached without DMA coherence to require cache line alignment.
> >
> > Now that Keith pushed down the checking into the driver that's even
> > fairly easily doable.
> >
> 
> Good to hear. Is there a posted series or branch I could look at?

This work is upstream, IIRC it got merged around 6.18 or 6.19.

Re: [PATCH] dma-debug: skip cacheline overlap tracking on cache-coherent architectures
Posted by Mikhail Gavrilov 5 days, 18 hours ago
On Tue, May 19, 2026 at 2:28 PM Christoph Hellwig <hch@lst.de> wrote:
>
> This work is upstream, IIRC it got merged around 6.18 or 6.19.

I went through Keith's v6.18/v6.19 series -- the block-size alignment
infrastructure is there (20a0e6276edb and surrounding), but I didn't
find a commit that actually wires cache-line alignment as the
requirement for non-coherent DIO (no dma_get_cache_alignment() or
L1_CACHE_BYTES references under block/ or drivers/nvme/ in that
range). If that enforcement is still future work building on Keith's
infrastructure, it's orthogonal to this patch's coherent-arch
suppression.

-- 
Thanks,
Mikhail
Re: [PATCH] dma-debug: skip cacheline overlap tracking on cache-coherent architectures
Posted by Christoph Hellwig 5 days, 16 hours ago
On Tue, May 19, 2026 at 02:57:45PM +0500, Mikhail Gavrilov wrote:
> On Tue, May 19, 2026 at 2:28 PM Christoph Hellwig <hch@lst.de> wrote:
> >
> > This work is upstream, IIRC it got merged around 6.18 or 6.19.
> 
> I went through Keith's v6.18/v6.19 series -- the block-size alignment
> infrastructure is there (20a0e6276edb and surrounding), but I didn't
> find a commit that actually wires cache-line alignment as the
> requirement for non-coherent DIO (no dma_get_cache_alignment() or
> L1_CACHE_BYTES references under block/ or drivers/nvme/ in that
> range). If that enforcement is still future work building on Keith's
> infrastructure, it's orthogonal to this patch's coherent-arch
> suppression.

Sorry if I was misunderstood.  I meant the changes to move the alignment
enforcement down to the drivers was merged.  There is no code to factor
the cache line size into that for non-coherent devices.  We'd need
someone who can actually test block and file system I/O on such devices
to help with that.