[Qemu-devel] [PATCH v2 00/10] intel-iommu: nested vIOMMU, cleanups, bug fixes

Peter Xu posted 10 patches 7 years, 5 months ago
Patches applied successfully (tree, apply log)
git fetch https://github.com/patchew-project/qemu tags/patchew/20180504030811.28111-1-peterx@redhat.com
Test checkpatch failed
Test docker-build@min-glib passed
Test docker-mingw@fedora passed
Test s390x passed
There is a newer version of this series
include/hw/i386/intel_iommu.h |  19 ++-
include/qemu/interval-tree.h  | 130 +++++++++++++++
hw/i386/intel_iommu.c         | 306 +++++++++++++++++++++++++---------
util/interval-tree.c          | 208 +++++++++++++++++++++++
hw/i386/trace-events          |   3 +-
util/Makefile.objs            |   1 +
6 files changed, 579 insertions(+), 88 deletions(-)
create mode 100644 include/qemu/interval-tree.h
create mode 100644 util/interval-tree.c
[Qemu-devel] [PATCH v2 00/10] intel-iommu: nested vIOMMU, cleanups, bug fixes
Posted by Peter Xu 7 years, 5 months ago
v2:
- fix patchew code style warnings
- interval tree: postpone malloc when inserting; simplify node remove
  a bit where proper [Jason]
- fix up comment and commit message for iommu lock patch [Kevin]
- protect context cache too using the iommu lock [Kevin, Jason]
- add vast comment in patch 8 to explain the modify-PTE problem
  [Jason, Kevin]

Online repo:

  https://github.com/xzpeter/qemu/tree/fix-vtd-dma

This series fixes several major problems that current code has:

- Issue 1: when getting very big PSI UNMAP invalidations, the current
  code is buggy in that we might skip the notification while actually
  we should always send that notification.

- Issue 2: IOTLB is not thread safe, while block dataplane can be
  accessing and updating it in parallel.

- Issue 3: For devices that only registered with UNMAP-only notifiers,
  we don't really need to do page walking for PSIs, we can directly
  deliver the notification down.  For example, vhost.

- Issue 4: unsafe window for MAP notified devices like vfio-pci (and
  in the future, vDPA as well).  The problem is that, now for domain
  invalidations we do this to make sure the shadow page tables are
  correctly synced:

  1. unmap the whole address space
  2. replay the whole address space, map existing pages

  However during step 1 and 2 there will be a very tiny window (it can
  be as big as 3ms) that the shadow page table is either invalid or
  incomplete (since we're rebuilding it up).  That's fatal error since
  devices never know that happending and it's still possible to DMA to
  memories.

Patch 1 fixes issue 1.  I put it at the first since it's picked from
an old post.

Patch 2 is a cleanup to remove useless IntelIOMMUNotifierNode struct.

Patch 3 fixes issue 2.

Patch 4 fixes issue 3.

Patch 5-9 fix issue 4.  Here a very simple interval tree is
implemented based on Gtree.  It's different with general interval tree
in that it does not allow user to pass in private data (e.g.,
translated addresses).  However that benefits us that then we can
merge adjacent interval leaves so that hopefully we won't consume much
memory even if the mappings are a lot (that happens for nested virt -
when mapping the whole L2 guest RAM range, it can be at least in GBs).

Patch 10 is another big cleanup only can work after patch 9.

Tests:

- device assignments to L1, even L2 guests.  With this series applied
  (and the kernel IOMMU patches: https://lkml.org/lkml/2018/4/18/5),
  we can even nest vIOMMU now, e.g., we can specify vIOMMU in L2 guest
  with assigned devices and things will work.  We can't before.

- vhost smoke test for regression.

Please review.  Thanks,

Peter Xu (10):
  intel-iommu: send PSI always even if across PDEs
  intel-iommu: remove IntelIOMMUNotifierNode
  intel-iommu: add iommu lock
  intel-iommu: only do page walk for MAP notifiers
  intel-iommu: introduce vtd_page_walk_info
  intel-iommu: pass in address space when page walk
  util: implement simple interval tree logic
  intel-iommu: maintain per-device iova ranges
  intel-iommu: don't unmap all for shadow page table
  intel-iommu: remove notify_unmap for page walk

 include/hw/i386/intel_iommu.h |  19 ++-
 include/qemu/interval-tree.h  | 130 +++++++++++++++
 hw/i386/intel_iommu.c         | 306 +++++++++++++++++++++++++---------
 util/interval-tree.c          | 208 +++++++++++++++++++++++
 hw/i386/trace-events          |   3 +-
 util/Makefile.objs            |   1 +
 6 files changed, 579 insertions(+), 88 deletions(-)
 create mode 100644 include/qemu/interval-tree.h
 create mode 100644 util/interval-tree.c

-- 
2.17.0


Re: [Qemu-devel] [PATCH v2 00/10] intel-iommu: nested vIOMMU, cleanups, bug fixes
Posted by no-reply@patchew.org 7 years, 5 months ago
Hi,

This series seems to have some coding style problems. See output below for
more information:

Type: series
Message-id: 20180504030811.28111-1-peterx@redhat.com
Subject: [Qemu-devel] [PATCH v2 00/10] intel-iommu: nested vIOMMU, cleanups, bug fixes

=== TEST SCRIPT BEGIN ===
#!/bin/bash

BASE=base
n=1
total=$(git log --oneline $BASE.. | wc -l)
failed=0

git config --local diff.renamelimit 0
git config --local diff.renames True
git config --local diff.algorithm histogram

commits="$(git log --format=%H --reverse $BASE..)"
for c in $commits; do
    echo "Checking PATCH $n/$total: $(git log -n 1 --format=%s $c)..."
    if ! git show $c --format=email | ./scripts/checkpatch.pl --mailback -; then
        failed=1
        echo
    fi
    n=$((n+1))
done

exit $failed
=== TEST SCRIPT END ===

Updating 3c8cf5a9c21ff8782164d1def7f44bd888713384
From https://github.com/patchew-project/qemu
 t [tag update]            patchew/20180503213458.2749.28566.stgit@gimli.home -> patchew/20180503213458.2749.28566.stgit@gimli.home
 * [new tag]               patchew/20180504030811.28111-1-peterx@redhat.com -> patchew/20180504030811.28111-1-peterx@redhat.com
Switched to a new branch 'test'
55b6c6b1f5 intel-iommu: remove notify_unmap for page walk
72ffe9314a intel-iommu: don't unmap all for shadow page table
7bb663682b intel-iommu: maintain per-device iova ranges
e77921e63f util: implement simple interval tree logic
356489cd59 intel-iommu: pass in address space when page walk
27fcd3062f intel-iommu: introduce vtd_page_walk_info
953a6db666 intel-iommu: only do page walk for MAP notifiers
0ccfedc1e7 intel-iommu: add iommu lock
47a29fd165 intel-iommu: remove IntelIOMMUNotifierNode
aacd552d3a intel-iommu: send PSI always even if across PDEs

=== OUTPUT BEGIN ===
Checking PATCH 1/10: intel-iommu: send PSI always even if across PDEs...
Checking PATCH 2/10: intel-iommu: remove IntelIOMMUNotifierNode...
Checking PATCH 3/10: intel-iommu: add iommu lock...
Checking PATCH 4/10: intel-iommu: only do page walk for MAP notifiers...
Checking PATCH 5/10: intel-iommu: introduce vtd_page_walk_info...
Checking PATCH 6/10: intel-iommu: pass in address space when page walk...
Checking PATCH 7/10: util: implement simple interval tree logic...
ERROR: space prohibited between function name and open parenthesis '('
#56: FILE: include/qemu/interval-tree.h:33:
+typedef gboolean (*it_tree_iterator)(ITValue start, ITValue end);

total: 1 errors, 0 warnings, 343 lines checked

Your patch has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.

Checking PATCH 8/10: intel-iommu: maintain per-device iova ranges...
Checking PATCH 9/10: intel-iommu: don't unmap all for shadow page table...
Checking PATCH 10/10: intel-iommu: remove notify_unmap for page walk...
=== OUTPUT END ===

Test command exited with code: 1


---
Email generated automatically by Patchew [http://patchew.org/].
Please send your feedback to patchew-devel@redhat.com
Re: [Qemu-devel] [PATCH v2 00/10] intel-iommu: nested vIOMMU, cleanups, bug fixes
Posted by Peter Xu 7 years, 5 months ago
On Thu, May 03, 2018 at 08:20:33PM -0700, no-reply@patchew.org wrote:

[...]

> === OUTPUT BEGIN ===
> Checking PATCH 1/10: intel-iommu: send PSI always even if across PDEs...
> Checking PATCH 2/10: intel-iommu: remove IntelIOMMUNotifierNode...
> Checking PATCH 3/10: intel-iommu: add iommu lock...
> Checking PATCH 4/10: intel-iommu: only do page walk for MAP notifiers...
> Checking PATCH 5/10: intel-iommu: introduce vtd_page_walk_info...
> Checking PATCH 6/10: intel-iommu: pass in address space when page walk...
> Checking PATCH 7/10: util: implement simple interval tree logic...
> ERROR: space prohibited between function name and open parenthesis '('
> #56: FILE: include/qemu/interval-tree.h:33:
> +typedef gboolean (*it_tree_iterator)(ITValue start, ITValue end);
> 
> total: 1 errors, 0 warnings, 343 lines checked
> 
> Your patch has style problems, please review.  If any of these errors
> are false positives report them to the maintainer, see
> CHECKPATCH in MAINTAINERS.

This is false positive, and should be dissolved after below patch
merged (now in Paolo's queue):

  [PATCH v2] checkpatch.pl: add common glib defines to typelist

-- 
Peter Xu

Re: [Qemu-devel] [PATCH v2 00/10] intel-iommu: nested vIOMMU, cleanups, bug fixes
Posted by Peter Xu 7 years, 5 months ago
On Fri, May 04, 2018 at 11:08:01AM +0800, Peter Xu wrote:
> v2:
> - fix patchew code style warnings
> - interval tree: postpone malloc when inserting; simplify node remove
>   a bit where proper [Jason]
> - fix up comment and commit message for iommu lock patch [Kevin]
> - protect context cache too using the iommu lock [Kevin, Jason]
> - add vast comment in patch 8 to explain the modify-PTE problem
>   [Jason, Kevin]

We can hold a bit on reviewing this series.  Jintack reported a scp
DMAR issue that might happen even with L1 guest with this series, and
the scp can stall after copied tens or hundreds of MBs randomly.  I'm
still investigating the problem.  This problem should be related to
deferred flushing of VT-d kernel driver, since the problem will go
away if we use "intel_iommu=on,strict".  However I'm still trying to
figure out what's the thing behind the scene even with that deferred
flushing feature.

Meanwhile, during the investigation I found another "possibly valid"
use case about the modify-PTE problem that Jason has mentioned when
with deferred flushing:

         vcpu1                        vcpu2
      map page A
  explicitly send PSI for A
   queue unmap page A [1]
                                 map the same page A [2]
                                explcitly send PSI for A [3]
   flush unmap page A [4]

Due to deferred flushing, the UNMAP PSI might be postponed (or it can
be finally a DSI) from step [1] to step [4].  If we allocate the same
page somewhere else, we might trigger this modify-PTE at [2] since we
haven't yet received the deferred PSI to unmap A from vcpu1.

Note that this will not happen with latest upstream Linux, since the
IOVA allocation algorithm in current Linux kernel made sure that the
IOVA range won't be freed until [4], so we can't really allocate the
same page address at [2].  However this let me tend to agree with
Jason and Kevin's worry on future potential issues if that can be
triggered easily by common guest kernel bugs.  So now I'm considering
to drop my mergable interval tree but just use a simpler tree to cache
everything including translated addresses.  The metadata will possibly
take 2% of managed memory if with that.

-- 
Peter Xu

Re: [Qemu-devel] [PATCH v2 00/10] intel-iommu: nested vIOMMU, cleanups, bug fixes
Posted by Jason Wang 7 years, 5 months ago

On 2018年05月16日 14:30, Peter Xu wrote:
> On Fri, May 04, 2018 at 11:08:01AM +0800, Peter Xu wrote:
>> v2:
>> - fix patchew code style warnings
>> - interval tree: postpone malloc when inserting; simplify node remove
>>    a bit where proper [Jason]
>> - fix up comment and commit message for iommu lock patch [Kevin]
>> - protect context cache too using the iommu lock [Kevin, Jason]
>> - add vast comment in patch 8 to explain the modify-PTE problem
>>    [Jason, Kevin]
> We can hold a bit on reviewing this series.  Jintack reported a scp
> DMAR issue that might happen even with L1 guest with this series, and
> the scp can stall after copied tens or hundreds of MBs randomly.  I'm
> still investigating the problem.  This problem should be related to
> deferred flushing of VT-d kernel driver, since the problem will go
> away if we use "intel_iommu=on,strict".  However I'm still trying to
> figure out what's the thing behind the scene even with that deferred
> flushing feature.

I vaguely remember recent upstream vfio support delayed flush, maybe 
it's related.

>
> Meanwhile, during the investigation I found another "possibly valid"
> use case about the modify-PTE problem that Jason has mentioned when
> with deferred flushing:
>
>           vcpu1                        vcpu2
>        map page A
>    explicitly send PSI for A
>     queue unmap page A [1]
>                                   map the same page A [2]
>                                  explcitly send PSI for A [3]
>     flush unmap page A [4]
>
> Due to deferred flushing, the UNMAP PSI might be postponed (or it can
> be finally a DSI) from step [1] to step [4].  If we allocate the same
> page somewhere else, we might trigger this modify-PTE at [2] since we
> haven't yet received the deferred PSI to unmap A from vcpu1.
>
> Note that this will not happen with latest upstream Linux, since the
> IOVA allocation algorithm in current Linux kernel made sure that the
> IOVA range won't be freed until [4], so we can't really allocate the
> same page address at [2].

Yes, so the vfio + vIOMMU work will probably uncover more bugs in the 
IOMMU driver (especially CM mode). I suspect CM mode does not have 
sufficient test (since it probably wasn't used in any production 
environment before the vfio + vIOMMU work).

>    However this let me tend to agree with
> Jason and Kevin's worry on future potential issues if that can be
> triggered easily by common guest kernel bugs.  So now I'm considering
> to drop my mergable interval tree but just use a simpler tree to cache
> everything including translated addresses.  The metadata will possibly
> take 2% of managed memory if with that.
>

Good to know this, we can start from the way we know correct for sure 
then optimizations on top.

Thanks

Re: [Qemu-devel] [PATCH v2 00/10] intel-iommu: nested vIOMMU, cleanups, bug fixes
Posted by Peter Xu 7 years, 5 months ago
On Wed, May 16, 2018 at 09:57:40PM +0800, Jason Wang wrote:
> 
> 
> On 2018年05月16日 14:30, Peter Xu wrote:
> > On Fri, May 04, 2018 at 11:08:01AM +0800, Peter Xu wrote:
> > > v2:
> > > - fix patchew code style warnings
> > > - interval tree: postpone malloc when inserting; simplify node remove
> > >    a bit where proper [Jason]
> > > - fix up comment and commit message for iommu lock patch [Kevin]
> > > - protect context cache too using the iommu lock [Kevin, Jason]
> > > - add vast comment in patch 8 to explain the modify-PTE problem
> > >    [Jason, Kevin]
> > We can hold a bit on reviewing this series.  Jintack reported a scp
> > DMAR issue that might happen even with L1 guest with this series, and
> > the scp can stall after copied tens or hundreds of MBs randomly.  I'm
> > still investigating the problem.  This problem should be related to
> > deferred flushing of VT-d kernel driver, since the problem will go
> > away if we use "intel_iommu=on,strict".  However I'm still trying to
> > figure out what's the thing behind the scene even with that deferred
> > flushing feature.
> 
> I vaguely remember recent upstream vfio support delayed flush, maybe it's
> related.

I'm a bit confused on why vfio is related to the deferred flushing.
Could you provide a pointer for this?

> 
> > 
> > Meanwhile, during the investigation I found another "possibly valid"
> > use case about the modify-PTE problem that Jason has mentioned when
> > with deferred flushing:
> > 
> >           vcpu1                        vcpu2
> >        map page A
> >    explicitly send PSI for A
> >     queue unmap page A [1]
> >                                   map the same page A [2]
> >                                  explcitly send PSI for A [3]
> >     flush unmap page A [4]
> > 
> > Due to deferred flushing, the UNMAP PSI might be postponed (or it can
> > be finally a DSI) from step [1] to step [4].  If we allocate the same
> > page somewhere else, we might trigger this modify-PTE at [2] since we
> > haven't yet received the deferred PSI to unmap A from vcpu1.
> > 
> > Note that this will not happen with latest upstream Linux, since the
> > IOVA allocation algorithm in current Linux kernel made sure that the
> > IOVA range won't be freed until [4], so we can't really allocate the
> > same page address at [2].
> 
> Yes, so the vfio + vIOMMU work will probably uncover more bugs in the IOMMU
> driver (especially CM mode). I suspect CM mode does not have sufficient test
> (since it probably wasn't used in any production environment before the vfio
> + vIOMMU work).

Yes maybe. I might possibly continue investigating some of this after
this "functional" series.  So my plan is that firstly we make it work
functionally (even we can allow some trivial bugs, but scp error is
not in count), then we consider other things.

> 
> >    However this let me tend to agree with
> > Jason and Kevin's worry on future potential issues if that can be
> > triggered easily by common guest kernel bugs.  So now I'm considering
> > to drop my mergable interval tree but just use a simpler tree to cache
> > everything including translated addresses.  The metadata will possibly
> > take 2% of managed memory if with that.
> > 
> 
> Good to know this, we can start from the way we know correct for sure then
> optimizations on top.

Yes.

Btw, still it's not really "correct" here AFAIU - the correct thing
should be that we "modify" the PTE without any invalid DMA window.
Now even if I switch to the other method we still need to unmap and
remap, so we will still have an invalid window for that DMA range.
For example, if Linux kernel frees IOVA ranges after queueing the
unmap in current Linux VT-d drivers (again, this does not exist, but
I'm just assuming), then my current series will be affected while the
other way won't be affected.  I'll still add some comment there to
clarify this as TODO.

Thanks,

-- 
Peter Xu

Re: [Qemu-devel] [PATCH v2 00/10] intel-iommu: nested vIOMMU, cleanups, bug fixes
Posted by Alex Williamson 7 years, 5 months ago
On Thu, 17 May 2018 10:45:44 +0800
Peter Xu <peterx@redhat.com> wrote:

> On Wed, May 16, 2018 at 09:57:40PM +0800, Jason Wang wrote:
> > 
> > 
> > On 2018年05月16日 14:30, Peter Xu wrote:  
> > > On Fri, May 04, 2018 at 11:08:01AM +0800, Peter Xu wrote:  
> > > > v2:
> > > > - fix patchew code style warnings
> > > > - interval tree: postpone malloc when inserting; simplify node remove
> > > >    a bit where proper [Jason]
> > > > - fix up comment and commit message for iommu lock patch [Kevin]
> > > > - protect context cache too using the iommu lock [Kevin, Jason]
> > > > - add vast comment in patch 8 to explain the modify-PTE problem
> > > >    [Jason, Kevin]  
> > > We can hold a bit on reviewing this series.  Jintack reported a scp
> > > DMAR issue that might happen even with L1 guest with this series, and
> > > the scp can stall after copied tens or hundreds of MBs randomly.  I'm
> > > still investigating the problem.  This problem should be related to
> > > deferred flushing of VT-d kernel driver, since the problem will go
> > > away if we use "intel_iommu=on,strict".  However I'm still trying to
> > > figure out what's the thing behind the scene even with that deferred
> > > flushing feature.  
> > 
> > I vaguely remember recent upstream vfio support delayed flush, maybe it's
> > related.  
> 
> I'm a bit confused on why vfio is related to the deferred flushing.
> Could you provide a pointer for this?

Perhaps referring to this:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=6bd06f5a486c06023a618a86e8153b91d26f75f4

Rather than calling iommu_unmap() for each chunk of a mapping we'll
make multiple calls to iommu_unmap_fast() and flush with
iommu_tlb_sync() to defer and batch the hardware flushing.  Thanks,

Alex

Re: [Qemu-devel] [PATCH v2 00/10] intel-iommu: nested vIOMMU, cleanups, bug fixes
Posted by Peter Xu 7 years, 5 months ago
On Wed, May 16, 2018 at 09:39:41PM -0600, Alex Williamson wrote:
> On Thu, 17 May 2018 10:45:44 +0800
> Peter Xu <peterx@redhat.com> wrote:
> 
> > On Wed, May 16, 2018 at 09:57:40PM +0800, Jason Wang wrote:
> > > 
> > > 
> > > On 2018年05月16日 14:30, Peter Xu wrote:  
> > > > On Fri, May 04, 2018 at 11:08:01AM +0800, Peter Xu wrote:  
> > > > > v2:
> > > > > - fix patchew code style warnings
> > > > > - interval tree: postpone malloc when inserting; simplify node remove
> > > > >    a bit where proper [Jason]
> > > > > - fix up comment and commit message for iommu lock patch [Kevin]
> > > > > - protect context cache too using the iommu lock [Kevin, Jason]
> > > > > - add vast comment in patch 8 to explain the modify-PTE problem
> > > > >    [Jason, Kevin]  
> > > > We can hold a bit on reviewing this series.  Jintack reported a scp
> > > > DMAR issue that might happen even with L1 guest with this series, and
> > > > the scp can stall after copied tens or hundreds of MBs randomly.  I'm
> > > > still investigating the problem.  This problem should be related to
> > > > deferred flushing of VT-d kernel driver, since the problem will go
> > > > away if we use "intel_iommu=on,strict".  However I'm still trying to
> > > > figure out what's the thing behind the scene even with that deferred
> > > > flushing feature.  
> > > 
> > > I vaguely remember recent upstream vfio support delayed flush, maybe it's
> > > related.  
> > 
> > I'm a bit confused on why vfio is related to the deferred flushing.
> > Could you provide a pointer for this?
> 
> Perhaps referring to this:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=6bd06f5a486c06023a618a86e8153b91d26f75f4
> 
> Rather than calling iommu_unmap() for each chunk of a mapping we'll
> make multiple calls to iommu_unmap_fast() and flush with
> iommu_tlb_sync() to defer and batch the hardware flushing.  Thanks,

Thanks for the link!

It seems to be a good performance enhancement for vfio, but it might
not related to current problem.

My latest clue shows that the issue is possibly caused by replaying
each IOMMU region twice on a single DSI.  For example, when we get one
DSI we'll actually call vtd_page_walk() twice (the IOMMU region is
splitted by the MSI region 0xfeeXXXXX, so we have two notifiers for
each vfio-pci device now... and memory_region_iommu_replay_all will
call vtd_page_walk twice). So that confused the IOVA tree a bit. I'll
verify that and see how I can fix it up.

(PS: it seems that in above patch unmapped_region_cnt is not needed in
 vfio_unmap_unpin considering that we already have
 unmapped_region_list there?  If that's correct, then we can remove
 all the references too, e.g. we don't need to pass in unmapped_cnt
 into unmap_unpin_fast as well.)

Thanks,

-- 
Peter Xu