intel-iommu: ignore SNP bit in scalable mode

[PATCH] intel-iommu: ignore SNP bit in scalable mode

Posted by Jason Wang 4 years, 2 months ago

When booting with scalable mode, I hit this error:

qemu-system-x86_64: vtd_iova_to_slpte: detected splte reserve non-zero iova=0xfffff002, level=0x1slpte=0x102681803)
qemu-system-x86_64: vtd_iommu_translate: detected translation failure (dev=01:00:00, iova=0xfffff002)
qemu-system-x86_64: New fault is not recorded due to compression of faults

This is because the SNP bit is set since Linux kernel commit
6c00612d0cba1 ("iommu/vt-d: Report right snoop capability when using
FL for IOVA") where SNP bit is set if scalable mode is on though this
seems to be an violation on the spec which said the SNP bit is
considered to be reserved if SC is not supported.

To unbreak the guest, ignore the SNP bit for scalable mode first. In
the future we may consider to add SC support.

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 hw/i386/intel_iommu.c          | 18 ++++++++++++------
 hw/i386/intel_iommu_internal.h |  2 ++
 2 files changed, 14 insertions(+), 6 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 294499ee20..3bcac56c3e 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -969,7 +969,8 @@ static dma_addr_t vtd_get_iova_pgtbl_base(IntelIOMMUState *s,
 static uint64_t vtd_spte_rsvd[5];
 static uint64_t vtd_spte_rsvd_large[5];
 
-static bool vtd_slpte_nonzero_rsvd(uint64_t slpte, uint32_t level)
+static bool vtd_slpte_nonzero_rsvd(IntelIOMMUState *s,
+                                   uint64_t slpte, uint32_t level)
 {
     uint64_t rsvd_mask = vtd_spte_rsvd[level];
 
@@ -979,6 +980,10 @@ static bool vtd_slpte_nonzero_rsvd(uint64_t slpte, uint32_t level)
         rsvd_mask = vtd_spte_rsvd_large[level];
     }
 
+    if (s->scalable_mode) {
+        rsvd_mask &= ~VTD_SPTE_SNP;
+    }
+
     return slpte & rsvd_mask;
 }
 
@@ -1054,7 +1059,7 @@ static int vtd_iova_to_slpte(IntelIOMMUState *s, VTDContextEntry *ce,
                               iova, level, slpte, is_write);
             return is_write ? -VTD_FR_WRITE : -VTD_FR_READ;
         }
-        if (vtd_slpte_nonzero_rsvd(slpte, level)) {
+        if (vtd_slpte_nonzero_rsvd(s, slpte, level)) {
             error_report_once("%s: detected splte reserve non-zero "
                               "iova=0x%" PRIx64 ", level=0x%" PRIx32
                               "slpte=0x%" PRIx64 ")", __func__, iova,
@@ -1185,7 +1190,8 @@ static int vtd_page_walk_one(IOMMUTLBEvent *event, vtd_page_walk_info *info)
  * @write: whether parent level has write permission
  * @info: constant information for the page walk
  */
-static int vtd_page_walk_level(dma_addr_t addr, uint64_t start,
+static int vtd_page_walk_level(IntelIOMMUState *s,
+                               dma_addr_t addr, uint64_t start,
                                uint64_t end, uint32_t level, bool read,
                                bool write, vtd_page_walk_info *info)
 {
@@ -1214,7 +1220,7 @@ static int vtd_page_walk_level(dma_addr_t addr, uint64_t start,
             goto next;
         }
 
-        if (vtd_slpte_nonzero_rsvd(slpte, level)) {
+        if (vtd_slpte_nonzero_rsvd(s, slpte, level)) {
             trace_vtd_page_walk_skip_reserve(iova, iova_next);
             goto next;
         }
@@ -1235,7 +1241,7 @@ static int vtd_page_walk_level(dma_addr_t addr, uint64_t start,
              * This is a valid PDE (or even bigger than PDE).  We need
              * to walk one further level.
              */
-            ret = vtd_page_walk_level(vtd_get_slpte_addr(slpte, info->aw),
+            ret = vtd_page_walk_level(s, vtd_get_slpte_addr(slpte, info->aw),
                                       iova, MIN(iova_next, end), level - 1,
                                       read_cur, write_cur, info);
         } else {
@@ -1294,7 +1300,7 @@ static int vtd_page_walk(IntelIOMMUState *s, VTDContextEntry *ce,
         end = vtd_iova_limit(s, ce, info->aw);
     }
 
-    return vtd_page_walk_level(addr, start, end, level, true, true, info);
+    return vtd_page_walk_level(s, addr, start, end, level, true, true, info);
 }
 
 static int vtd_root_entry_rsvd_bits_check(IntelIOMMUState *s,
diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index 3d5487fe2c..a6c788049b 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -388,6 +388,8 @@ typedef union VTDInvDesc VTDInvDesc;
 #define VTD_INV_DESC_DEVICE_IOTLB_RSVD_LO 0xffff0000ffe0fff8
 
 /* Rsvd field masks for spte */
+#define VTD_SPTE_SNP 0x800ULL
+
 #define VTD_SPTE_PAGE_L1_RSVD_MASK(aw, dt_supported) \
         dt_supported ? \
         (0x800ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM | VTD_SL_TM)) : \
-- 
2.25.1

Re: [PATCH] intel-iommu: ignore SNP bit in scalable mode

Posted by Peter Xu 4 years, 2 months ago

On Wed, Nov 24, 2021 at 02:03:09PM +0800, Jason Wang wrote:
> When booting with scalable mode, I hit this error:
> 
> qemu-system-x86_64: vtd_iova_to_slpte: detected splte reserve non-zero iova=0xfffff002, level=0x1slpte=0x102681803)
> qemu-system-x86_64: vtd_iommu_translate: detected translation failure (dev=01:00:00, iova=0xfffff002)
> qemu-system-x86_64: New fault is not recorded due to compression of faults
> 
> This is because the SNP bit is set since Linux kernel commit
> 6c00612d0cba1 ("iommu/vt-d: Report right snoop capability when using
> FL for IOVA") where SNP bit is set if scalable mode is on though this
> seems to be an violation on the spec which said the SNP bit is
> considered to be reserved if SC is not supported.

When I was reading that commit, I was actually confused by this change:

---8<---
diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 956a02eb40b4..0ee5f1bd8af2 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -658,7 +658,14 @@ static int domain_update_iommu_snooping(struct intel_iommu *skip)
        rcu_read_lock();
        for_each_active_iommu(iommu, drhd) {
                if (iommu != skip) {
-                       if (!ecap_sc_support(iommu->ecap)) {
+                       /*
+                        * If the hardware is operating in the scalable mode,
+                        * the snooping control is always supported since we
+                        * always set PASID-table-entry.PGSNP bit if the domain
+                        * is managed outside (UNMANAGED).
+                        */
+                       if (!sm_supported(iommu) &&
+                           !ecap_sc_support(iommu->ecap)) {
                                ret = 0;
                                break;
                        }
---8<---

Does it mean that for some hardwares that has sm_supported()==true, it'll have
SC bit cleared in ecap register?  That sounds odd, and not sure why.  Maybe Yi
Liu or Yi Sun may know?

> 
> To unbreak the guest, ignore the SNP bit for scalable mode first. In
> the future we may consider to add SC support.

Oh yes, I remembered the last time we discussed this.  Could you remind me
what's missing for us to support SC?

IIUC, for common device emulations we can just declare SC==1, right?  As all
the DMAs (including kernel accels like vhost) will be from host processors so
there're no coherent issues with guest vcpu threads.

If that's correct, the only challenge is device assignment in any form (I am
not familiar with vdpa; so perhaps that includes vfio, vpda and any other kind
of assigning host devices to guest?), then we'll try to detect IOMMU_CACHE
capability from the host iommu groups that covers the assigned devices, and we
only set SC==1 if we have cache coherency on all the devices?

> 
> Signed-off-by: Jason Wang <jasowang@redhat.com>
> ---
>  hw/i386/intel_iommu.c          | 18 ++++++++++++------
>  hw/i386/intel_iommu_internal.h |  2 ++
>  2 files changed, 14 insertions(+), 6 deletions(-)
> 
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index 294499ee20..3bcac56c3e 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -969,7 +969,8 @@ static dma_addr_t vtd_get_iova_pgtbl_base(IntelIOMMUState *s,
>  static uint64_t vtd_spte_rsvd[5];
>  static uint64_t vtd_spte_rsvd_large[5];
>  
> -static bool vtd_slpte_nonzero_rsvd(uint64_t slpte, uint32_t level)
> +static bool vtd_slpte_nonzero_rsvd(IntelIOMMUState *s,
> +                                   uint64_t slpte, uint32_t level)
>  {
>      uint64_t rsvd_mask = vtd_spte_rsvd[level];
>  
> @@ -979,6 +980,10 @@ static bool vtd_slpte_nonzero_rsvd(uint64_t slpte, uint32_t level)
>          rsvd_mask = vtd_spte_rsvd_large[level];
>      }
>  
> +    if (s->scalable_mode) {
> +        rsvd_mask &= ~VTD_SPTE_SNP;
> +    }

IMHO what we want to do is only to skip the leaves of pgtables on SNP, so maybe
we still want to keep checking the bit 11 reserved for e.g. common pgtable dir
entries?

To do so, how about directly modifying the vtd_spte_rsvd* fields in vtd_init()?
I think we only need to modify 4k/2m/1g entries to mask bit 11, they're:

  - vtd_spte_rsvd[1] (4K)
  - vtd_spte_rsvd_large[2] (2M)
  - vtd_spte_rsvd_large[3] (1G)

What do you think?  Then we avoid passing IntelIOMMUState* all over too.

> +
>      return slpte & rsvd_mask;
>  }
>  
> @@ -1054,7 +1059,7 @@ static int vtd_iova_to_slpte(IntelIOMMUState *s, VTDContextEntry *ce,
>                                iova, level, slpte, is_write);
>              return is_write ? -VTD_FR_WRITE : -VTD_FR_READ;
>          }
> -        if (vtd_slpte_nonzero_rsvd(slpte, level)) {
> +        if (vtd_slpte_nonzero_rsvd(s, slpte, level)) {
>              error_report_once("%s: detected splte reserve non-zero "
>                                "iova=0x%" PRIx64 ", level=0x%" PRIx32
>                                "slpte=0x%" PRIx64 ")", __func__, iova,
> @@ -1185,7 +1190,8 @@ static int vtd_page_walk_one(IOMMUTLBEvent *event, vtd_page_walk_info *info)
>   * @write: whether parent level has write permission
>   * @info: constant information for the page walk
>   */
> -static int vtd_page_walk_level(dma_addr_t addr, uint64_t start,
> +static int vtd_page_walk_level(IntelIOMMUState *s,
> +                               dma_addr_t addr, uint64_t start,
>                                 uint64_t end, uint32_t level, bool read,
>                                 bool write, vtd_page_walk_info *info)
>  {
> @@ -1214,7 +1220,7 @@ static int vtd_page_walk_level(dma_addr_t addr, uint64_t start,
>              goto next;
>          }
>  
> -        if (vtd_slpte_nonzero_rsvd(slpte, level)) {
> +        if (vtd_slpte_nonzero_rsvd(s, slpte, level)) {
>              trace_vtd_page_walk_skip_reserve(iova, iova_next);
>              goto next;
>          }
> @@ -1235,7 +1241,7 @@ static int vtd_page_walk_level(dma_addr_t addr, uint64_t start,
>               * This is a valid PDE (or even bigger than PDE).  We need
>               * to walk one further level.
>               */
> -            ret = vtd_page_walk_level(vtd_get_slpte_addr(slpte, info->aw),
> +            ret = vtd_page_walk_level(s, vtd_get_slpte_addr(slpte, info->aw),
>                                        iova, MIN(iova_next, end), level - 1,
>                                        read_cur, write_cur, info);
>          } else {
> @@ -1294,7 +1300,7 @@ static int vtd_page_walk(IntelIOMMUState *s, VTDContextEntry *ce,
>          end = vtd_iova_limit(s, ce, info->aw);
>      }
>  
> -    return vtd_page_walk_level(addr, start, end, level, true, true, info);
> +    return vtd_page_walk_level(s, addr, start, end, level, true, true, info);
>  }
>  
>  static int vtd_root_entry_rsvd_bits_check(IntelIOMMUState *s,
> diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
> index 3d5487fe2c..a6c788049b 100644
> --- a/hw/i386/intel_iommu_internal.h
> +++ b/hw/i386/intel_iommu_internal.h
> @@ -388,6 +388,8 @@ typedef union VTDInvDesc VTDInvDesc;
>  #define VTD_INV_DESC_DEVICE_IOTLB_RSVD_LO 0xffff0000ffe0fff8
>  
>  /* Rsvd field masks for spte */
> +#define VTD_SPTE_SNP 0x800ULL
> +
>  #define VTD_SPTE_PAGE_L1_RSVD_MASK(aw, dt_supported) \
>          dt_supported ? \
>          (0x800ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM | VTD_SL_TM)) : \
> -- 
> 2.25.1
> 

-- 
Peter Xu

Re: [PATCH] intel-iommu: ignore SNP bit in scalable mode

Posted by Jason Wang 4 years, 2 months ago

On Wed, Nov 24, 2021 at 3:57 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Wed, Nov 24, 2021 at 02:03:09PM +0800, Jason Wang wrote:
> > When booting with scalable mode, I hit this error:
> >
> > qemu-system-x86_64: vtd_iova_to_slpte: detected splte reserve non-zero iova=0xfffff002, level=0x1slpte=0x102681803)
> > qemu-system-x86_64: vtd_iommu_translate: detected translation failure (dev=01:00:00, iova=0xfffff002)
> > qemu-system-x86_64: New fault is not recorded due to compression of faults
> >
> > This is because the SNP bit is set since Linux kernel commit
> > 6c00612d0cba1 ("iommu/vt-d: Report right snoop capability when using
> > FL for IOVA") where SNP bit is set if scalable mode is on though this
> > seems to be an violation on the spec which said the SNP bit is
> > considered to be reserved if SC is not supported.
>
> When I was reading that commit, I was actually confused by this change:
>
> ---8<---
> diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
> index 956a02eb40b4..0ee5f1bd8af2 100644
> --- a/drivers/iommu/intel/iommu.c
> +++ b/drivers/iommu/intel/iommu.c
> @@ -658,7 +658,14 @@ static int domain_update_iommu_snooping(struct intel_iommu *skip)
>         rcu_read_lock();
>         for_each_active_iommu(iommu, drhd) {
>                 if (iommu != skip) {
> -                       if (!ecap_sc_support(iommu->ecap)) {
> +                       /*
> +                        * If the hardware is operating in the scalable mode,
> +                        * the snooping control is always supported since we
> +                        * always set PASID-table-entry.PGSNP bit if the domain
> +                        * is managed outside (UNMANAGED).
> +                        */
> +                       if (!sm_supported(iommu) &&
> +                           !ecap_sc_support(iommu->ecap)) {
>                                 ret = 0;
>                                 break;
>                         }
> ---8<---
>
> Does it mean that for some hardwares that has sm_supported()==true, it'll have
> SC bit cleared in ecap register?

I guess not, so it's probably only the problem of vIOMMU.

> That sounds odd, and not sure why.  Maybe Yi
> Liu or Yi Sun may know?

Another interesting point is that, it looks to me after that commit
SNP is used for the domain that is not UNMANAGED even if PGSNP is not
set.


>
> >
> > To unbreak the guest, ignore the SNP bit for scalable mode first. In
> > the future we may consider to add SC support.
>
> Oh yes, I remembered the last time we discussed this.  Could you remind me
> what's missing for us to support SC?

Exactly what you described below.

>
> IIUC, for common device emulations we can just declare SC==1, right?

Strictly speaking, only safe for the datapath that is running in the
Qemu. For things like vhost-user, I'm not sure it can check CC when
using VFIO.

>  As all
> the DMAs (including kernel accels like vhost) will be from host processors so
> there're no coherent issues with guest vcpu threads.
>
> If that's correct, the only challenge is device assignment in any form (I am
> not familiar with vdpa; so perhaps that includes vfio, vpda and any other kind
> of assigning host devices to guest?), then we'll try to detect IOMMU_CACHE
> capability from the host iommu groups that covers the assigned devices, and we
> only set SC==1 if we have cache coherency on all the devices?

For VFIO yes, and we should prevent VFIO without CC to be plugged if
SC is advertised.

For vDPA, we don't need to worry about it at all, kernel vDPA forces
IOMMU_CACHE now.

vhost_vdpa_alloc_domain():

        if (!iommu_capable(bus, IOMMU_CAP_CACHE_COHERENCY))
                return -ENOTSUPP;

(For device with on-chip IOMMU, it's the parent and device that
guarantees the CC)

Thanks


>
> >
> > Signed-off-by: Jason Wang <jasowang@redhat.com>
> > ---
> >  hw/i386/intel_iommu.c          | 18 ++++++++++++------
> >  hw/i386/intel_iommu_internal.h |  2 ++
> >  2 files changed, 14 insertions(+), 6 deletions(-)
> >
> > diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> > index 294499ee20..3bcac56c3e 100644
> > --- a/hw/i386/intel_iommu.c
> > +++ b/hw/i386/intel_iommu.c
> > @@ -969,7 +969,8 @@ static dma_addr_t vtd_get_iova_pgtbl_base(IntelIOMMUState *s,
> >  static uint64_t vtd_spte_rsvd[5];
> >  static uint64_t vtd_spte_rsvd_large[5];
> >
> > -static bool vtd_slpte_nonzero_rsvd(uint64_t slpte, uint32_t level)
> > +static bool vtd_slpte_nonzero_rsvd(IntelIOMMUState *s,
> > +                                   uint64_t slpte, uint32_t level)
> >  {
> >      uint64_t rsvd_mask = vtd_spte_rsvd[level];
> >
> > @@ -979,6 +980,10 @@ static bool vtd_slpte_nonzero_rsvd(uint64_t slpte, uint32_t level)
> >          rsvd_mask = vtd_spte_rsvd_large[level];
> >      }
> >
> > +    if (s->scalable_mode) {
> > +        rsvd_mask &= ~VTD_SPTE_SNP;
> > +    }
>
> IMHO what we want to do is only to skip the leaves of pgtables on SNP, so maybe
> we still want to keep checking the bit 11 reserved for e.g. common pgtable dir
> entries?
>
> To do so, how about directly modifying the vtd_spte_rsvd* fields in vtd_init()?
> I think we only need to modify 4k/2m/1g entries to mask bit 11, they're:
>
>   - vtd_spte_rsvd[1] (4K)
>   - vtd_spte_rsvd_large[2] (2M)
>   - vtd_spte_rsvd_large[3] (1G)
>
> What do you think?  Then we avoid passing IntelIOMMUState* all over too.
>
> > +
> >      return slpte & rsvd_mask;
> >  }
> >
> > @@ -1054,7 +1059,7 @@ static int vtd_iova_to_slpte(IntelIOMMUState *s, VTDContextEntry *ce,
> >                                iova, level, slpte, is_write);
> >              return is_write ? -VTD_FR_WRITE : -VTD_FR_READ;
> >          }
> > -        if (vtd_slpte_nonzero_rsvd(slpte, level)) {
> > +        if (vtd_slpte_nonzero_rsvd(s, slpte, level)) {
> >              error_report_once("%s: detected splte reserve non-zero "
> >                                "iova=0x%" PRIx64 ", level=0x%" PRIx32
> >                                "slpte=0x%" PRIx64 ")", __func__, iova,
> > @@ -1185,7 +1190,8 @@ static int vtd_page_walk_one(IOMMUTLBEvent *event, vtd_page_walk_info *info)
> >   * @write: whether parent level has write permission
> >   * @info: constant information for the page walk
> >   */
> > -static int vtd_page_walk_level(dma_addr_t addr, uint64_t start,
> > +static int vtd_page_walk_level(IntelIOMMUState *s,
> > +                               dma_addr_t addr, uint64_t start,
> >                                 uint64_t end, uint32_t level, bool read,
> >                                 bool write, vtd_page_walk_info *info)
> >  {
> > @@ -1214,7 +1220,7 @@ static int vtd_page_walk_level(dma_addr_t addr, uint64_t start,
> >              goto next;
> >          }
> >
> > -        if (vtd_slpte_nonzero_rsvd(slpte, level)) {
> > +        if (vtd_slpte_nonzero_rsvd(s, slpte, level)) {
> >              trace_vtd_page_walk_skip_reserve(iova, iova_next);
> >              goto next;
> >          }
> > @@ -1235,7 +1241,7 @@ static int vtd_page_walk_level(dma_addr_t addr, uint64_t start,
> >               * This is a valid PDE (or even bigger than PDE).  We need
> >               * to walk one further level.
> >               */
> > -            ret = vtd_page_walk_level(vtd_get_slpte_addr(slpte, info->aw),
> > +            ret = vtd_page_walk_level(s, vtd_get_slpte_addr(slpte, info->aw),
> >                                        iova, MIN(iova_next, end), level - 1,
> >                                        read_cur, write_cur, info);
> >          } else {
> > @@ -1294,7 +1300,7 @@ static int vtd_page_walk(IntelIOMMUState *s, VTDContextEntry *ce,
> >          end = vtd_iova_limit(s, ce, info->aw);
> >      }
> >
> > -    return vtd_page_walk_level(addr, start, end, level, true, true, info);
> > +    return vtd_page_walk_level(s, addr, start, end, level, true, true, info);
> >  }
> >
> >  static int vtd_root_entry_rsvd_bits_check(IntelIOMMUState *s,
> > diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
> > index 3d5487fe2c..a6c788049b 100644
> > --- a/hw/i386/intel_iommu_internal.h
> > +++ b/hw/i386/intel_iommu_internal.h
> > @@ -388,6 +388,8 @@ typedef union VTDInvDesc VTDInvDesc;
> >  #define VTD_INV_DESC_DEVICE_IOTLB_RSVD_LO 0xffff0000ffe0fff8
> >
> >  /* Rsvd field masks for spte */
> > +#define VTD_SPTE_SNP 0x800ULL
> > +
> >  #define VTD_SPTE_PAGE_L1_RSVD_MASK(aw, dt_supported) \
> >          dt_supported ? \
> >          (0x800ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM | VTD_SL_TM)) : \
> > --
> > 2.25.1
> >
>
> --
> Peter Xu
>

Re: [PATCH] intel-iommu: ignore SNP bit in scalable mode

Posted by Peter Xu 4 years, 2 months ago

On Wed, Nov 24, 2021 at 04:28:52PM +0800, Jason Wang wrote:
> On Wed, Nov 24, 2021 at 3:57 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Wed, Nov 24, 2021 at 02:03:09PM +0800, Jason Wang wrote:
> > > When booting with scalable mode, I hit this error:
> > >
> > > qemu-system-x86_64: vtd_iova_to_slpte: detected splte reserve non-zero iova=0xfffff002, level=0x1slpte=0x102681803)
> > > qemu-system-x86_64: vtd_iommu_translate: detected translation failure (dev=01:00:00, iova=0xfffff002)
> > > qemu-system-x86_64: New fault is not recorded due to compression of faults
> > >
> > > This is because the SNP bit is set since Linux kernel commit
> > > 6c00612d0cba1 ("iommu/vt-d: Report right snoop capability when using
> > > FL for IOVA") where SNP bit is set if scalable mode is on though this
> > > seems to be an violation on the spec which said the SNP bit is
> > > considered to be reserved if SC is not supported.
> >
> > When I was reading that commit, I was actually confused by this change:
> >
> > ---8<---
> > diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
> > index 956a02eb40b4..0ee5f1bd8af2 100644
> > --- a/drivers/iommu/intel/iommu.c
> > +++ b/drivers/iommu/intel/iommu.c
> > @@ -658,7 +658,14 @@ static int domain_update_iommu_snooping(struct intel_iommu *skip)
> >         rcu_read_lock();
> >         for_each_active_iommu(iommu, drhd) {
> >                 if (iommu != skip) {
> > -                       if (!ecap_sc_support(iommu->ecap)) {
> > +                       /*
> > +                        * If the hardware is operating in the scalable mode,
> > +                        * the snooping control is always supported since we
> > +                        * always set PASID-table-entry.PGSNP bit if the domain
> > +                        * is managed outside (UNMANAGED).
> > +                        */
> > +                       if (!sm_supported(iommu) &&
> > +                           !ecap_sc_support(iommu->ecap)) {
> >                                 ret = 0;
> >                                 break;
> >                         }
> > ---8<---
> >
> > Does it mean that for some hardwares that has sm_supported()==true, it'll have
> > SC bit cleared in ecap register?
> 
> I guess not, so it's probably only the problem of vIOMMU.

But then what does the code mean above?

If SC is required for scalable mode, ecap_sc_support()==false already implies
sm_supported()==false too.  Then that check seems redundant.

> 
> > That sounds odd, and not sure why.  Maybe Yi
> > Liu or Yi Sun may know?
> 
> Another interesting point is that, it looks to me after that commit
> SNP is used for the domain that is not UNMANAGED even if PGSNP is not
> set.
> 
> 
> >
> > >
> > > To unbreak the guest, ignore the SNP bit for scalable mode first. In
> > > the future we may consider to add SC support.
> >
> > Oh yes, I remembered the last time we discussed this.  Could you remind me
> > what's missing for us to support SC?
> 
> Exactly what you described below.
> 
> >
> > IIUC, for common device emulations we can just declare SC==1, right?
> 
> Strictly speaking, only safe for the datapath that is running in the
> Qemu. For things like vhost-user, I'm not sure it can check CC when
> using VFIO.

Hmm yeah.. though I'll just call those vhost-user backends to fall into below
"device assignment" category too.  Great to know we're on the same page here.

> 
> >  As all
> > the DMAs (including kernel accels like vhost) will be from host processors so
> > there're no coherent issues with guest vcpu threads.
> >
> > If that's correct, the only challenge is device assignment in any form (I am
> > not familiar with vdpa; so perhaps that includes vfio, vpda and any other kind
> > of assigning host devices to guest?), then we'll try to detect IOMMU_CACHE
> > capability from the host iommu groups that covers the assigned devices, and we
> > only set SC==1 if we have cache coherency on all the devices?
> 
> For VFIO yes, and we should prevent VFIO without CC to be plugged if
> SC is advertised.
> 
> For vDPA, we don't need to worry about it at all, kernel vDPA forces
> IOMMU_CACHE now.
> 
> vhost_vdpa_alloc_domain():
> 
>         if (!iommu_capable(bus, IOMMU_CAP_CACHE_COHERENCY))
>                 return -ENOTSUPP;
> 
> (For device with on-chip IOMMU, it's the parent and device that
> guarantees the CC)

Ah right, yes you mentioned it and I forgot..  Though I'm not sure we'd simply
double-check again here (if we'll support vfio anyway, then we'll need to be
able to read those out from IOMMU groups), because we shouldn't rely on that
fact which is an implementation detail of vdpa, imho (say, when vdpa starts to
support !SC someday).

PS: I have other comments below in previous reply - please have a look too! :-D

> 
> Thanks
> 
> 
> >
> > >
> > > Signed-off-by: Jason Wang <jasowang@redhat.com>
> > > ---
> > >  hw/i386/intel_iommu.c          | 18 ++++++++++++------
> > >  hw/i386/intel_iommu_internal.h |  2 ++
> > >  2 files changed, 14 insertions(+), 6 deletions(-)
> > >
> > > diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> > > index 294499ee20..3bcac56c3e 100644
> > > --- a/hw/i386/intel_iommu.c
> > > +++ b/hw/i386/intel_iommu.c
> > > @@ -969,7 +969,8 @@ static dma_addr_t vtd_get_iova_pgtbl_base(IntelIOMMUState *s,
> > >  static uint64_t vtd_spte_rsvd[5];
> > >  static uint64_t vtd_spte_rsvd_large[5];
> > >
> > > -static bool vtd_slpte_nonzero_rsvd(uint64_t slpte, uint32_t level)
> > > +static bool vtd_slpte_nonzero_rsvd(IntelIOMMUState *s,
> > > +                                   uint64_t slpte, uint32_t level)
> > >  {
> > >      uint64_t rsvd_mask = vtd_spte_rsvd[level];
> > >
> > > @@ -979,6 +980,10 @@ static bool vtd_slpte_nonzero_rsvd(uint64_t slpte, uint32_t level)
> > >          rsvd_mask = vtd_spte_rsvd_large[level];
> > >      }
> > >
> > > +    if (s->scalable_mode) {
> > > +        rsvd_mask &= ~VTD_SPTE_SNP;
> > > +    }
> >
> > IMHO what we want to do is only to skip the leaves of pgtables on SNP, so maybe
> > we still want to keep checking the bit 11 reserved for e.g. common pgtable dir
> > entries?
> >
> > To do so, how about directly modifying the vtd_spte_rsvd* fields in vtd_init()?
> > I think we only need to modify 4k/2m/1g entries to mask bit 11, they're:
> >
> >   - vtd_spte_rsvd[1] (4K)
> >   - vtd_spte_rsvd_large[2] (2M)
> >   - vtd_spte_rsvd_large[3] (1G)
> >
> > What do you think?  Then we avoid passing IntelIOMMUState* all over too.

[Here]

> >
> > > +
> > >      return slpte & rsvd_mask;
> > >  }
> > >
> > > @@ -1054,7 +1059,7 @@ static int vtd_iova_to_slpte(IntelIOMMUState *s, VTDContextEntry *ce,
> > >                                iova, level, slpte, is_write);
> > >              return is_write ? -VTD_FR_WRITE : -VTD_FR_READ;
> > >          }
> > > -        if (vtd_slpte_nonzero_rsvd(slpte, level)) {
> > > +        if (vtd_slpte_nonzero_rsvd(s, slpte, level)) {
> > >              error_report_once("%s: detected splte reserve non-zero "
> > >                                "iova=0x%" PRIx64 ", level=0x%" PRIx32
> > >                                "slpte=0x%" PRIx64 ")", __func__, iova,
> > > @@ -1185,7 +1190,8 @@ static int vtd_page_walk_one(IOMMUTLBEvent *event, vtd_page_walk_info *info)
> > >   * @write: whether parent level has write permission
> > >   * @info: constant information for the page walk
> > >   */
> > > -static int vtd_page_walk_level(dma_addr_t addr, uint64_t start,
> > > +static int vtd_page_walk_level(IntelIOMMUState *s,
> > > +                               dma_addr_t addr, uint64_t start,
> > >                                 uint64_t end, uint32_t level, bool read,
> > >                                 bool write, vtd_page_walk_info *info)
> > >  {
> > > @@ -1214,7 +1220,7 @@ static int vtd_page_walk_level(dma_addr_t addr, uint64_t start,
> > >              goto next;
> > >          }
> > >
> > > -        if (vtd_slpte_nonzero_rsvd(slpte, level)) {
> > > +        if (vtd_slpte_nonzero_rsvd(s, slpte, level)) {
> > >              trace_vtd_page_walk_skip_reserve(iova, iova_next);
> > >              goto next;
> > >          }
> > > @@ -1235,7 +1241,7 @@ static int vtd_page_walk_level(dma_addr_t addr, uint64_t start,
> > >               * This is a valid PDE (or even bigger than PDE).  We need
> > >               * to walk one further level.
> > >               */
> > > -            ret = vtd_page_walk_level(vtd_get_slpte_addr(slpte, info->aw),
> > > +            ret = vtd_page_walk_level(s, vtd_get_slpte_addr(slpte, info->aw),
> > >                                        iova, MIN(iova_next, end), level - 1,
> > >                                        read_cur, write_cur, info);
> > >          } else {
> > > @@ -1294,7 +1300,7 @@ static int vtd_page_walk(IntelIOMMUState *s, VTDContextEntry *ce,
> > >          end = vtd_iova_limit(s, ce, info->aw);
> > >      }
> > >
> > > -    return vtd_page_walk_level(addr, start, end, level, true, true, info);
> > > +    return vtd_page_walk_level(s, addr, start, end, level, true, true, info);
> > >  }
> > >
> > >  static int vtd_root_entry_rsvd_bits_check(IntelIOMMUState *s,
> > > diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
> > > index 3d5487fe2c..a6c788049b 100644
> > > --- a/hw/i386/intel_iommu_internal.h
> > > +++ b/hw/i386/intel_iommu_internal.h
> > > @@ -388,6 +388,8 @@ typedef union VTDInvDesc VTDInvDesc;
> > >  #define VTD_INV_DESC_DEVICE_IOTLB_RSVD_LO 0xffff0000ffe0fff8
> > >
> > >  /* Rsvd field masks for spte */
> > > +#define VTD_SPTE_SNP 0x800ULL
> > > +
> > >  #define VTD_SPTE_PAGE_L1_RSVD_MASK(aw, dt_supported) \
> > >          dt_supported ? \
> > >          (0x800ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM | VTD_SL_TM)) : \
> > > --
> > > 2.25.1
> > >
> >
> > --
> > Peter Xu
> >
> 

-- 
Peter Xu

Re: [PATCH] intel-iommu: ignore SNP bit in scalable mode

Posted by Jason Wang 4 years, 2 months ago

On Wed, Nov 24, 2021 at 4:52 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Wed, Nov 24, 2021 at 04:28:52PM +0800, Jason Wang wrote:
> > On Wed, Nov 24, 2021 at 3:57 PM Peter Xu <peterx@redhat.com> wrote:
> > >
> > > On Wed, Nov 24, 2021 at 02:03:09PM +0800, Jason Wang wrote:
> > > > When booting with scalable mode, I hit this error:
> > > >
> > > > qemu-system-x86_64: vtd_iova_to_slpte: detected splte reserve non-zero iova=0xfffff002, level=0x1slpte=0x102681803)
> > > > qemu-system-x86_64: vtd_iommu_translate: detected translation failure (dev=01:00:00, iova=0xfffff002)
> > > > qemu-system-x86_64: New fault is not recorded due to compression of faults
> > > >
> > > > This is because the SNP bit is set since Linux kernel commit
> > > > 6c00612d0cba1 ("iommu/vt-d: Report right snoop capability when using
> > > > FL for IOVA") where SNP bit is set if scalable mode is on though this
> > > > seems to be an violation on the spec which said the SNP bit is
> > > > considered to be reserved if SC is not supported.
> > >
> > > When I was reading that commit, I was actually confused by this change:
> > >
> > > ---8<---
> > > diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
> > > index 956a02eb40b4..0ee5f1bd8af2 100644
> > > --- a/drivers/iommu/intel/iommu.c
> > > +++ b/drivers/iommu/intel/iommu.c
> > > @@ -658,7 +658,14 @@ static int domain_update_iommu_snooping(struct intel_iommu *skip)
> > >         rcu_read_lock();
> > >         for_each_active_iommu(iommu, drhd) {
> > >                 if (iommu != skip) {
> > > -                       if (!ecap_sc_support(iommu->ecap)) {
> > > +                       /*
> > > +                        * If the hardware is operating in the scalable mode,
> > > +                        * the snooping control is always supported since we
> > > +                        * always set PASID-table-entry.PGSNP bit if the domain
> > > +                        * is managed outside (UNMANAGED).
> > > +                        */
> > > +                       if (!sm_supported(iommu) &&
> > > +                           !ecap_sc_support(iommu->ecap)) {
> > >                                 ret = 0;
> > >                                 break;
> > >                         }
> > > ---8<---
> > >
> > > Does it mean that for some hardwares that has sm_supported()==true, it'll have
> > > SC bit cleared in ecap register?
> >
> > I guess not, so it's probably only the problem of vIOMMU.
>
> But then what does the code mean above?
>
> If SC is required for scalable mode,

I guess the code was tested on hardware with both SC and SM.

> ecap_sc_support()==false already implies
> sm_supported()==false too.  Then that check seems redundant.

To tell the truth, I'm not sure. And according to the spec SNP is
reserved if SC==false.

>
> >
> > > That sounds odd, and not sure why.  Maybe Yi
> > > Liu or Yi Sun may know?
> >
> > Another interesting point is that, it looks to me after that commit
> > SNP is used for the domain that is not UNMANAGED even if PGSNP is not
> > set.
> >
> >
> > >
> > > >
> > > > To unbreak the guest, ignore the SNP bit for scalable mode first. In
> > > > the future we may consider to add SC support.
> > >
> > > Oh yes, I remembered the last time we discussed this.  Could you remind me
> > > what's missing for us to support SC?
> >
> > Exactly what you described below.
> >
> > >
> > > IIUC, for common device emulations we can just declare SC==1, right?
> >
> > Strictly speaking, only safe for the datapath that is running in the
> > Qemu. For things like vhost-user, I'm not sure it can check CC when
> > using VFIO.
>
> Hmm yeah.. though I'll just call those vhost-user backends to fall into below
> "device assignment" category too.  Great to know we're on the same page here.

I see.

>
> >
> > >  As all
> > > the DMAs (including kernel accels like vhost) will be from host processors so
> > > there're no coherent issues with guest vcpu threads.
> > >
> > > If that's correct, the only challenge is device assignment in any form (I am
> > > not familiar with vdpa; so perhaps that includes vfio, vpda and any other kind
> > > of assigning host devices to guest?), then we'll try to detect IOMMU_CACHE
> > > capability from the host iommu groups that covers the assigned devices, and we
> > > only set SC==1 if we have cache coherency on all the devices?
> >
> > For VFIO yes, and we should prevent VFIO without CC to be plugged if
> > SC is advertised.
> >
> > For vDPA, we don't need to worry about it at all, kernel vDPA forces
> > IOMMU_CACHE now.
> >
> > vhost_vdpa_alloc_domain():
> >
> >         if (!iommu_capable(bus, IOMMU_CAP_CACHE_COHERENCY))
> >                 return -ENOTSUPP;
> >
> > (For device with on-chip IOMMU, it's the parent and device that
> > guarantees the CC)
>
> Ah right, yes you mentioned it and I forgot..  Though I'm not sure we'd simply
> double-check again here (if we'll support vfio anyway, then we'll need to be
> able to read those out from IOMMU groups), because we shouldn't rely on that
> fact which is an implementation detail of vdpa, imho (say, when vdpa starts to
> support !SC someday).

Good point, but at that time we need to expose the !SC via uAPI which
is currently not supported.

>
> PS: I have other comments below in previous reply - please have a look too! :-D

Sorry for missing that part :(

>
> >
> > Thanks
> >
> >
> > >
> > > >
> > > > Signed-off-by: Jason Wang <jasowang@redhat.com>
> > > > ---
> > > >  hw/i386/intel_iommu.c          | 18 ++++++++++++------
> > > >  hw/i386/intel_iommu_internal.h |  2 ++
> > > >  2 files changed, 14 insertions(+), 6 deletions(-)
> > > >
> > > > diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> > > > index 294499ee20..3bcac56c3e 100644
> > > > --- a/hw/i386/intel_iommu.c
> > > > +++ b/hw/i386/intel_iommu.c
> > > > @@ -969,7 +969,8 @@ static dma_addr_t vtd_get_iova_pgtbl_base(IntelIOMMUState *s,
> > > >  static uint64_t vtd_spte_rsvd[5];
> > > >  static uint64_t vtd_spte_rsvd_large[5];
> > > >
> > > > -static bool vtd_slpte_nonzero_rsvd(uint64_t slpte, uint32_t level)
> > > > +static bool vtd_slpte_nonzero_rsvd(IntelIOMMUState *s,
> > > > +                                   uint64_t slpte, uint32_t level)
> > > >  {
> > > >      uint64_t rsvd_mask = vtd_spte_rsvd[level];
> > > >
> > > > @@ -979,6 +980,10 @@ static bool vtd_slpte_nonzero_rsvd(uint64_t slpte, uint32_t level)
> > > >          rsvd_mask = vtd_spte_rsvd_large[level];
> > > >      }
> > > >
> > > > +    if (s->scalable_mode) {
> > > > +        rsvd_mask &= ~VTD_SPTE_SNP;
> > > > +    }
> > >
> > > IMHO what we want to do is only to skip the leaves of pgtables on SNP, so maybe
> > > we still want to keep checking the bit 11 reserved for e.g. common pgtable dir
> > > entries?

Maybe, but it's probably a question that can only be answered by
Intel. I can change it for the next version if you stick.

> > >
> > > To do so, how about directly modifying the vtd_spte_rsvd* fields in vtd_init()?
> > > I think we only need to modify 4k/2m/1g entries to mask bit 11, they're:
> > >
> > >   - vtd_spte_rsvd[1] (4K)
> > >   - vtd_spte_rsvd_large[2] (2M)
> > >   - vtd_spte_rsvd_large[3] (1G)
> > >
> > > What do you think?  Then we avoid passing IntelIOMMUState* all over too.

I started a version like that:), it should work, I will change that if
it was agreed by everyone.

The reason that I changed to pass IntelIOMMUState is that it results
in a smaller changeset. The reason is that I tend to introduce new
rsvd bits for SM mode since after checking vtd 3.3 it looks have
different reserved bit requirement than before (at least 1.2)

Thanks

>
> [Here]
>
> > >
> > > > +
> > > >      return slpte & rsvd_mask;
> > > >  }
> > > >
> > > > @@ -1054,7 +1059,7 @@ static int vtd_iova_to_slpte(IntelIOMMUState *s, VTDContextEntry *ce,
> > > >                                iova, level, slpte, is_write);
> > > >              return is_write ? -VTD_FR_WRITE : -VTD_FR_READ;
> > > >          }
> > > > -        if (vtd_slpte_nonzero_rsvd(slpte, level)) {
> > > > +        if (vtd_slpte_nonzero_rsvd(s, slpte, level)) {
> > > >              error_report_once("%s: detected splte reserve non-zero "
> > > >                                "iova=0x%" PRIx64 ", level=0x%" PRIx32
> > > >                                "slpte=0x%" PRIx64 ")", __func__, iova,
> > > > @@ -1185,7 +1190,8 @@ static int vtd_page_walk_one(IOMMUTLBEvent *event, vtd_page_walk_info *info)
> > > >   * @write: whether parent level has write permission
> > > >   * @info: constant information for the page walk
> > > >   */
> > > > -static int vtd_page_walk_level(dma_addr_t addr, uint64_t start,
> > > > +static int vtd_page_walk_level(IntelIOMMUState *s,
> > > > +                               dma_addr_t addr, uint64_t start,
> > > >                                 uint64_t end, uint32_t level, bool read,
> > > >                                 bool write, vtd_page_walk_info *info)
> > > >  {
> > > > @@ -1214,7 +1220,7 @@ static int vtd_page_walk_level(dma_addr_t addr, uint64_t start,
> > > >              goto next;
> > > >          }
> > > >
> > > > -        if (vtd_slpte_nonzero_rsvd(slpte, level)) {
> > > > +        if (vtd_slpte_nonzero_rsvd(s, slpte, level)) {
> > > >              trace_vtd_page_walk_skip_reserve(iova, iova_next);
> > > >              goto next;
> > > >          }
> > > > @@ -1235,7 +1241,7 @@ static int vtd_page_walk_level(dma_addr_t addr, uint64_t start,
> > > >               * This is a valid PDE (or even bigger than PDE).  We need
> > > >               * to walk one further level.
> > > >               */
> > > > -            ret = vtd_page_walk_level(vtd_get_slpte_addr(slpte, info->aw),
> > > > +            ret = vtd_page_walk_level(s, vtd_get_slpte_addr(slpte, info->aw),
> > > >                                        iova, MIN(iova_next, end), level - 1,
> > > >                                        read_cur, write_cur, info);
> > > >          } else {
> > > > @@ -1294,7 +1300,7 @@ static int vtd_page_walk(IntelIOMMUState *s, VTDContextEntry *ce,
> > > >          end = vtd_iova_limit(s, ce, info->aw);
> > > >      }
> > > >
> > > > -    return vtd_page_walk_level(addr, start, end, level, true, true, info);
> > > > +    return vtd_page_walk_level(s, addr, start, end, level, true, true, info);
> > > >  }
> > > >
> > > >  static int vtd_root_entry_rsvd_bits_check(IntelIOMMUState *s,
> > > > diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
> > > > index 3d5487fe2c..a6c788049b 100644
> > > > --- a/hw/i386/intel_iommu_internal.h
> > > > +++ b/hw/i386/intel_iommu_internal.h
> > > > @@ -388,6 +388,8 @@ typedef union VTDInvDesc VTDInvDesc;
> > > >  #define VTD_INV_DESC_DEVICE_IOTLB_RSVD_LO 0xffff0000ffe0fff8
> > > >
> > > >  /* Rsvd field masks for spte */
> > > > +#define VTD_SPTE_SNP 0x800ULL
> > > > +
> > > >  #define VTD_SPTE_PAGE_L1_RSVD_MASK(aw, dt_supported) \
> > > >          dt_supported ? \
> > > >          (0x800ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM | VTD_SL_TM)) : \
> > > > --
> > > > 2.25.1
> > > >
> > >
> > > --
> > > Peter Xu
> > >
> >
>
> --
> Peter Xu
>

Re: [PATCH] intel-iommu: ignore SNP bit in scalable mode

Posted by Peter Xu 4 years, 2 months ago

On Wed, Nov 24, 2021 at 05:01:42PM +0800, Jason Wang wrote:
> > > > > -static bool vtd_slpte_nonzero_rsvd(uint64_t slpte, uint32_t level)
> > > > > +static bool vtd_slpte_nonzero_rsvd(IntelIOMMUState *s,
> > > > > +                                   uint64_t slpte, uint32_t level)
> > > > >  {
> > > > >      uint64_t rsvd_mask = vtd_spte_rsvd[level];
> > > > >
> > > > > @@ -979,6 +980,10 @@ static bool vtd_slpte_nonzero_rsvd(uint64_t slpte, uint32_t level)
> > > > >          rsvd_mask = vtd_spte_rsvd_large[level];
> > > > >      }
> > > > >
> > > > > +    if (s->scalable_mode) {
> > > > > +        rsvd_mask &= ~VTD_SPTE_SNP;
> > > > > +    }
> > > >
> > > > IMHO what we want to do is only to skip the leaves of pgtables on SNP, so maybe
> > > > we still want to keep checking the bit 11 reserved for e.g. common pgtable dir
> > > > entries?
> 
> Maybe, but it's probably a question that can only be answered by
> Intel. I can change it for the next version if you stick.

I'm reading vtd spec v3.1 (June 2019) here, and chap 9.8 told me they're
reserved bits for pgdir entries, as no SNP bit defined on pgdir entries.

> 
> > > >
> > > > To do so, how about directly modifying the vtd_spte_rsvd* fields in vtd_init()?
> > > > I think we only need to modify 4k/2m/1g entries to mask bit 11, they're:
> > > >
> > > >   - vtd_spte_rsvd[1] (4K)
> > > >   - vtd_spte_rsvd_large[2] (2M)
> > > >   - vtd_spte_rsvd_large[3] (1G)
> > > >
> > > > What do you think?  Then we avoid passing IntelIOMMUState* all over too.
> 
> I started a version like that:), it should work, I will change that if
> it was agreed by everyone.
> 
> The reason that I changed to pass IntelIOMMUState is that it results
> in a smaller changeset. The reason is that I tend to introduce new
> rsvd bits for SM mode since after checking vtd 3.3 it looks have
> different reserved bit requirement than before (at least 1.2)

Oh I thought changing vtd_spte_rsvd* should have smaller changeset instead,
hmm? :)

IMHO it'll be:

  if (s->scalable_mode) {
        vtd_spte_rsvd[1] &= ~BIT(11);
        vtd_spte_rsvd_large[2] &= ~BIT(11);
        vtd_spte_rsvd_large[3] &= ~BIT(11);
  }

Would that work?  Thanks,

-- 
Peter Xu

Re: [PATCH] intel-iommu: ignore SNP bit in scalable mode

Posted by Jason Wang 4 years, 2 months ago

On Wed, Nov 24, 2021 at 5:23 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Wed, Nov 24, 2021 at 05:01:42PM +0800, Jason Wang wrote:
> > > > > > -static bool vtd_slpte_nonzero_rsvd(uint64_t slpte, uint32_t level)
> > > > > > +static bool vtd_slpte_nonzero_rsvd(IntelIOMMUState *s,
> > > > > > +                                   uint64_t slpte, uint32_t level)
> > > > > >  {
> > > > > >      uint64_t rsvd_mask = vtd_spte_rsvd[level];
> > > > > >
> > > > > > @@ -979,6 +980,10 @@ static bool vtd_slpte_nonzero_rsvd(uint64_t slpte, uint32_t level)
> > > > > >          rsvd_mask = vtd_spte_rsvd_large[level];
> > > > > >      }
> > > > > >
> > > > > > +    if (s->scalable_mode) {
> > > > > > +        rsvd_mask &= ~VTD_SPTE_SNP;
> > > > > > +    }
> > > > >
> > > > > IMHO what we want to do is only to skip the leaves of pgtables on SNP, so maybe
> > > > > we still want to keep checking the bit 11 reserved for e.g. common pgtable dir
> > > > > entries?
> >
> > Maybe, but it's probably a question that can only be answered by
> > Intel. I can change it for the next version if you stick.
>
> I'm reading vtd spec v3.1 (June 2019) here, and chap 9.8 told me they're
> reserved bits for pgdir entries, as no SNP bit defined on pgdir entries.

Yes, you're right.

>
> >
> > > > >
> > > > > To do so, how about directly modifying the vtd_spte_rsvd* fields in vtd_init()?
> > > > > I think we only need to modify 4k/2m/1g entries to mask bit 11, they're:
> > > > >
> > > > >   - vtd_spte_rsvd[1] (4K)
> > > > >   - vtd_spte_rsvd_large[2] (2M)
> > > > >   - vtd_spte_rsvd_large[3] (1G)
> > > > >
> > > > > What do you think?  Then we avoid passing IntelIOMMUState* all over too.
> >
> > I started a version like that:), it should work, I will change that if
> > it was agreed by everyone.
> >
> > The reason that I changed to pass IntelIOMMUState is that it results
> > in a smaller changeset. The reason is that I tend to introduce new
> > rsvd bits for SM mode since after checking vtd 3.3 it looks have
> > different reserved bit requirement than before (at least 1.2)
>
> Oh I thought changing vtd_spte_rsvd* should have smaller changeset instead,
> hmm? :)
>
> IMHO it'll be:
>
>   if (s->scalable_mode) {
>         vtd_spte_rsvd[1] &= ~BIT(11);
>         vtd_spte_rsvd_large[2] &= ~BIT(11);
>         vtd_spte_rsvd_large[3] &= ~BIT(11);
>   }
>
> Would that work?  Thanks,

Works for sure, if we just want to fix the SNP bit.

(I actually have a version like this as well). I can go this way

The reason why I had another big changset is to align the reserved
bits to vtd 3.3. E.g it equires e.g bit 62 to be reserved 63 to be
ignored which seems not as strict as VTD_SL_IGN_COM. But it can be
addressed in the future.

Thanks

>
> --
> Peter Xu
>

RE: [PATCH] intel-iommu: ignore SNP bit in scalable mode

Posted by Liu, Yi L 4 years, 2 months ago

> From: Jason Wang <jasowang@redhat.com>
> Sent: Wednesday, November 24, 2021 5:35 PM
> 
> On Wed, Nov 24, 2021 at 5:23 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Wed, Nov 24, 2021 at 05:01:42PM +0800, Jason Wang wrote:
> > > > > > > -static bool vtd_slpte_nonzero_rsvd(uint64_t slpte, uint32_t
> level)
> > > > > > > +static bool vtd_slpte_nonzero_rsvd(IntelIOMMUState *s,
> > > > > > > +                                   uint64_t slpte, uint32_t level)
> > > > > > >  {
> > > > > > >      uint64_t rsvd_mask = vtd_spte_rsvd[level];
> > > > > > >
> > > > > > > @@ -979,6 +980,10 @@ static bool
> vtd_slpte_nonzero_rsvd(uint64_t slpte, uint32_t level)
> > > > > > >          rsvd_mask = vtd_spte_rsvd_large[level];
> > > > > > >      }
> > > > > > >
> > > > > > > +    if (s->scalable_mode) {
> > > > > > > +        rsvd_mask &= ~VTD_SPTE_SNP;
> > > > > > > +    }
> > > > > >
> > > > > > IMHO what we want to do is only to skip the leaves of pgtables on
> SNP, so maybe
> > > > > > we still want to keep checking the bit 11 reserved for e.g. common
> pgtable dir
> > > > > > entries?
> > >
> > > Maybe, but it's probably a question that can only be answered by
> > > Intel. I can change it for the next version if you stick.
> >
> > I'm reading vtd spec v3.1 (June 2019) here, and chap 9.8 told me they're
> > reserved bits for pgdir entries, as no SNP bit defined on pgdir entries.
> 
> Yes, you're right.

yeah. The SNP bit is only available in the leaf paging entry. e.g. for 4KB pages,
the SNP bit is in the PTE, but for 2MB pages, the SNP bit is in PDE, and etc.

Regards,
Yi Liu

RE: [PATCH] intel-iommu: ignore SNP bit in scalable mode

Posted by Liu, Yi L 4 years, 2 months ago

> From: Jason Wang <jasowang@redhat.com>
> Sent: Wednesday, November 24, 2021 4:29 PM
> 
> On Wed, Nov 24, 2021 at 3:57 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Wed, Nov 24, 2021 at 02:03:09PM +0800, Jason Wang wrote:
> > > When booting with scalable mode, I hit this error:
> > >
> > > qemu-system-x86_64: vtd_iova_to_slpte: detected splte reserve non-
> zero iova=0xfffff002, level=0x1slpte=0x102681803)
> > > qemu-system-x86_64: vtd_iommu_translate: detected translation
> failure (dev=01:00:00, iova=0xfffff002)
> > > qemu-system-x86_64: New fault is not recorded due to compression of
> faults
> > >
> > > This is because the SNP bit is set since Linux kernel commit
> > > 6c00612d0cba1 ("iommu/vt-d: Report right snoop capability when using
> > > FL for IOVA") where SNP bit is set if scalable mode is on though this
> > > seems to be an violation on the spec which said the SNP bit is
> > > considered to be reserved if SC is not supported.
> >
> > When I was reading that commit, I was actually confused by this change:
> >
> > ---8<---
> > diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
> > index 956a02eb40b4..0ee5f1bd8af2 100644
> > --- a/drivers/iommu/intel/iommu.c
> > +++ b/drivers/iommu/intel/iommu.c
> > @@ -658,7 +658,14 @@ static int
> domain_update_iommu_snooping(struct intel_iommu *skip)
> >         rcu_read_lock();
> >         for_each_active_iommu(iommu, drhd) {
> >                 if (iommu != skip) {
> > -                       if (!ecap_sc_support(iommu->ecap)) {
> > +                       /*
> > +                        * If the hardware is operating in the scalable mode,
> > +                        * the snooping control is always supported since we
> > +                        * always set PASID-table-entry.PGSNP bit if the domain
> > +                        * is managed outside (UNMANAGED).
> > +                        */
> > +                       if (!sm_supported(iommu) &&
> > +                           !ecap_sc_support(iommu->ecap)) {
> >                                 ret = 0;
> >                                 break;
> >                         }
> > ---8<---
> >
> > Does it mean that for some hardwares that has sm_supported()==true,
> it'll have
> > SC bit cleared in ecap register?
> 
> I guess not, so it's probably only the problem of vIOMMU.
> 
> > That sounds odd, and not sure why.  Maybe Yi
> > Liu or Yi Sun may know?
> 
> Another interesting point is that, it looks to me after that commit
> SNP is used for the domain that is not UNMANAGED even if PGSNP is not
> set.

Per spec, if the PGSNP is set, it means the final page access is snooped.
If it's not set, then it's up to other bit to decide it. For detail, you may
refer to table 6 of chapter 3.9 in vtd 3.2 spec.

Regards,
Yi Liu

RE: [PATCH] intel-iommu: ignore SNP bit in scalable mode

Posted by Liu, Yi L 4 years, 2 months ago

> From: Peter Xu <peterx@redhat.com>
> Sent: Wednesday, November 24, 2021 3:57 PM
> 
> On Wed, Nov 24, 2021 at 02:03:09PM +0800, Jason Wang wrote:
> > When booting with scalable mode, I hit this error:
> >
> > qemu-system-x86_64: vtd_iova_to_slpte: detected splte reserve non-
> zero iova=0xfffff002, level=0x1slpte=0x102681803)
> > qemu-system-x86_64: vtd_iommu_translate: detected translation failure
> (dev=01:00:00, iova=0xfffff002)
> > qemu-system-x86_64: New fault is not recorded due to compression of
> faults
> >
> > This is because the SNP bit is set since Linux kernel commit
> > 6c00612d0cba1 ("iommu/vt-d: Report right snoop capability when using
> > FL for IOVA") where SNP bit is set if scalable mode is on though this
> > seems to be an violation on the spec which said the SNP bit is
> > considered to be reserved if SC is not supported.
> 
> When I was reading that commit, I was actually confused by this change:
> 
> ---8<---
> diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
> index 956a02eb40b4..0ee5f1bd8af2 100644
> --- a/drivers/iommu/intel/iommu.c
> +++ b/drivers/iommu/intel/iommu.c
> @@ -658,7 +658,14 @@ static int domain_update_iommu_snooping(struct
> intel_iommu *skip)
>         rcu_read_lock();
>         for_each_active_iommu(iommu, drhd) {
>                 if (iommu != skip) {
> -                       if (!ecap_sc_support(iommu->ecap)) {
> +                       /*
> +                        * If the hardware is operating in the scalable mode,
> +                        * the snooping control is always supported since we
> +                        * always set PASID-table-entry.PGSNP bit if the domain
> +                        * is managed outside (UNMANAGED).
> +                        */
> +                       if (!sm_supported(iommu) &&
> +                           !ecap_sc_support(iommu->ecap)) {
>                                 ret = 0;
>                                 break;
>                         }
> ---8<---
> 
> Does it mean that for some hardwares that has sm_supported()==true, it'll
> have  SC bit cleared in ecap register?  That sounds odd, and not sure why.  Maybe
> Yi Liu or Yi Sun may know?

scalable mode has no dependency on SC, so it's possible.

> >
> > To unbreak the guest, ignore the SNP bit for scalable mode first. In
> > the future we may consider to add SC support.
> 
> Oh yes, I remembered the last time we discussed this.  Could you remind
> me what's missing for us to support SC?
> 
> IIUC, for common device emulations we can just declare SC==1, right?  As all
> the DMAs (including kernel accels like vhost) will be from host processors so
> there're no coherent issues with guest vcpu threads.
> 
> If that's correct, the only challenge is device assignment in any form (I am
> not familiar with vdpa; so perhaps that includes vfio, vpda and any other
> kind of assigning host devices to guest?), then we'll try to detect IOMMU_CACHE
> capability from the host iommu groups that covers the assigned devices,
> and we only set SC==1 if we have cache coherency on all the devices?

above looks good to me. SC bit means SNP field available in leaf paging
structure. So we need to check the host side's SC capability for the assigned
devices, then decide whether set SC or not. Then guest iommu driver can
set the SNP bit per its policy.

btw. there is a discussion on the IOMMU_CACHE, it's considered to be
an incorrect usage to let it linked with no_snoop. So there may be some
cleanup later. anyhow, just let you two be aware of it.

https://lore.kernel.org/kvm/20210922234954.GB964074@nvidia.com/

Regards,
Yi Liu

> 
> >
> > Signed-off-by: Jason Wang <jasowang@redhat.com>
> > ---
> >  hw/i386/intel_iommu.c          | 18 ++++++++++++------
> >  hw/i386/intel_iommu_internal.h |  2 ++
> >  2 files changed, 14 insertions(+), 6 deletions(-)
> >
> > diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> > index 294499ee20..3bcac56c3e 100644
> > --- a/hw/i386/intel_iommu.c
> > +++ b/hw/i386/intel_iommu.c
> > @@ -969,7 +969,8 @@ static dma_addr_t
> vtd_get_iova_pgtbl_base(IntelIOMMUState *s,
> >  static uint64_t vtd_spte_rsvd[5];
> >  static uint64_t vtd_spte_rsvd_large[5];
> >
> > -static bool vtd_slpte_nonzero_rsvd(uint64_t slpte, uint32_t level)
> > +static bool vtd_slpte_nonzero_rsvd(IntelIOMMUState *s,
> > +                                   uint64_t slpte, uint32_t level)
> >  {
> >      uint64_t rsvd_mask = vtd_spte_rsvd[level];
> >
> > @@ -979,6 +980,10 @@ static bool vtd_slpte_nonzero_rsvd(uint64_t slpte,
> uint32_t level)
> >          rsvd_mask = vtd_spte_rsvd_large[level];
> >      }
> >
> > +    if (s->scalable_mode) {
> > +        rsvd_mask &= ~VTD_SPTE_SNP;
> > +    }
> 
> IMHO what we want to do is only to skip the leaves of pgtables on SNP, so
> maybe
> we still want to keep checking the bit 11 reserved for e.g. common pgtable
> dir
> entries?
> 
> To do so, how about directly modifying the vtd_spte_rsvd* fields in
> vtd_init()?
> I think we only need to modify 4k/2m/1g entries to mask bit 11, they're:
> 
>   - vtd_spte_rsvd[1] (4K)
>   - vtd_spte_rsvd_large[2] (2M)
>   - vtd_spte_rsvd_large[3] (1G)
> 
> What do you think?  Then we avoid passing IntelIOMMUState* all over too.
> 
> > +
> >      return slpte & rsvd_mask;
> >  }
> >
> > @@ -1054,7 +1059,7 @@ static int vtd_iova_to_slpte(IntelIOMMUState *s,
> VTDContextEntry *ce,
> >                                iova, level, slpte, is_write);
> >              return is_write ? -VTD_FR_WRITE : -VTD_FR_READ;
> >          }
> > -        if (vtd_slpte_nonzero_rsvd(slpte, level)) {
> > +        if (vtd_slpte_nonzero_rsvd(s, slpte, level)) {
> >              error_report_once("%s: detected splte reserve non-zero "
> >                                "iova=0x%" PRIx64 ", level=0x%" PRIx32
> >                                "slpte=0x%" PRIx64 ")", __func__, iova,
> > @@ -1185,7 +1190,8 @@ static int vtd_page_walk_one(IOMMUTLBEvent
> *event, vtd_page_walk_info *info)
> >   * @write: whether parent level has write permission
> >   * @info: constant information for the page walk
> >   */
> > -static int vtd_page_walk_level(dma_addr_t addr, uint64_t start,
> > +static int vtd_page_walk_level(IntelIOMMUState *s,
> > +                               dma_addr_t addr, uint64_t start,
> >                                 uint64_t end, uint32_t level, bool read,
> >                                 bool write, vtd_page_walk_info *info)
> >  {
> > @@ -1214,7 +1220,7 @@ static int vtd_page_walk_level(dma_addr_t addr,
> uint64_t start,
> >              goto next;
> >          }
> >
> > -        if (vtd_slpte_nonzero_rsvd(slpte, level)) {
> > +        if (vtd_slpte_nonzero_rsvd(s, slpte, level)) {
> >              trace_vtd_page_walk_skip_reserve(iova, iova_next);
> >              goto next;
> >          }
> > @@ -1235,7 +1241,7 @@ static int vtd_page_walk_level(dma_addr_t addr,
> uint64_t start,
> >               * This is a valid PDE (or even bigger than PDE).  We need
> >               * to walk one further level.
> >               */
> > -            ret = vtd_page_walk_level(vtd_get_slpte_addr(slpte, info->aw),
> > +            ret = vtd_page_walk_level(s, vtd_get_slpte_addr(slpte, info->aw),
> >                                        iova, MIN(iova_next, end), level - 1,
> >                                        read_cur, write_cur, info);
> >          } else {
> > @@ -1294,7 +1300,7 @@ static int vtd_page_walk(IntelIOMMUState *s,
> VTDContextEntry *ce,
> >          end = vtd_iova_limit(s, ce, info->aw);
> >      }
> >
> > -    return vtd_page_walk_level(addr, start, end, level, true, true, info);
> > +    return vtd_page_walk_level(s, addr, start, end, level, true, true, info);
> >  }
> >
> >  static int vtd_root_entry_rsvd_bits_check(IntelIOMMUState *s,
> > diff --git a/hw/i386/intel_iommu_internal.h
> b/hw/i386/intel_iommu_internal.h
> > index 3d5487fe2c..a6c788049b 100644
> > --- a/hw/i386/intel_iommu_internal.h
> > +++ b/hw/i386/intel_iommu_internal.h
> > @@ -388,6 +388,8 @@ typedef union VTDInvDesc VTDInvDesc;
> >  #define VTD_INV_DESC_DEVICE_IOTLB_RSVD_LO 0xffff0000ffe0fff8
> >
> >  /* Rsvd field masks for spte */
> > +#define VTD_SPTE_SNP 0x800ULL
> > +
> >  #define VTD_SPTE_PAGE_L1_RSVD_MASK(aw, dt_supported) \
> >          dt_supported ? \
> >          (0x800ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM |
> VTD_SL_TM)) : \
> > --
> > 2.25.1
> >
> 
> --
> Peter Xu

Re: [PATCH] intel-iommu: ignore SNP bit in scalable mode

Posted by Peter Xu 4 years, 2 months ago

On Thu, Nov 25, 2021 at 04:03:34AM +0000, Liu, Yi L wrote:
> > From: Peter Xu <peterx@redhat.com>
> > Sent: Wednesday, November 24, 2021 3:57 PM
> > 
> > On Wed, Nov 24, 2021 at 02:03:09PM +0800, Jason Wang wrote:
> > > When booting with scalable mode, I hit this error:
> > >
> > > qemu-system-x86_64: vtd_iova_to_slpte: detected splte reserve non-
> > zero iova=0xfffff002, level=0x1slpte=0x102681803)
> > > qemu-system-x86_64: vtd_iommu_translate: detected translation failure
> > (dev=01:00:00, iova=0xfffff002)
> > > qemu-system-x86_64: New fault is not recorded due to compression of
> > faults
> > >
> > > This is because the SNP bit is set since Linux kernel commit
> > > 6c00612d0cba1 ("iommu/vt-d: Report right snoop capability when using
> > > FL for IOVA") where SNP bit is set if scalable mode is on though this
> > > seems to be an violation on the spec which said the SNP bit is
> > > considered to be reserved if SC is not supported.
> > 
> > When I was reading that commit, I was actually confused by this change:
> > 
> > ---8<---
> > diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
> > index 956a02eb40b4..0ee5f1bd8af2 100644
> > --- a/drivers/iommu/intel/iommu.c
> > +++ b/drivers/iommu/intel/iommu.c
> > @@ -658,7 +658,14 @@ static int domain_update_iommu_snooping(struct
> > intel_iommu *skip)
> >         rcu_read_lock();
> >         for_each_active_iommu(iommu, drhd) {
> >                 if (iommu != skip) {
> > -                       if (!ecap_sc_support(iommu->ecap)) {
> > +                       /*
> > +                        * If the hardware is operating in the scalable mode,
> > +                        * the snooping control is always supported since we
> > +                        * always set PASID-table-entry.PGSNP bit if the domain
> > +                        * is managed outside (UNMANAGED).
> > +                        */
> > +                       if (!sm_supported(iommu) &&
> > +                           !ecap_sc_support(iommu->ecap)) {
> >                                 ret = 0;
> >                                 break;
> >                         }
> > ---8<---
> > 
> > Does it mean that for some hardwares that has sm_supported()==true, it'll
> > have  SC bit cleared in ecap register?  That sounds odd, and not sure why.  Maybe
> > Yi Liu or Yi Sun may know?
> 
> scalable mode has no dependency on SC, so it's possible.

I see; thanks, Yi.

However then OTOH I don't understand above comment 

  "If the hardware is operating in the scalable mode, the snooping control is
   always supported since... "

Because the current qemu vt-d emulation should fall into the case that Yi
mentioned - we support initial scalable mode but no SC yet..

Cc Baolu too.

-- 
Peter Xu

RE: [PATCH] intel-iommu: ignore SNP bit in scalable mode

Posted by Liu, Yi L 4 years, 2 months ago

> From: Peter Xu <peterx@redhat.com>
> Sent: Thursday, November 25, 2021 12:31 PM
> 
> On Thu, Nov 25, 2021 at 04:03:34AM +0000, Liu, Yi L wrote:
> > > From: Peter Xu <peterx@redhat.com>
> > > Sent: Wednesday, November 24, 2021 3:57 PM
> > >
> > > On Wed, Nov 24, 2021 at 02:03:09PM +0800, Jason Wang wrote:
> > > > When booting with scalable mode, I hit this error:
> > > >
> > > > qemu-system-x86_64: vtd_iova_to_slpte: detected splte reserve non-
> > > zero iova=0xfffff002, level=0x1slpte=0x102681803)
> > > > qemu-system-x86_64: vtd_iommu_translate: detected translation
> failure
> > > (dev=01:00:00, iova=0xfffff002)
> > > > qemu-system-x86_64: New fault is not recorded due to compression
> of
> > > faults
> > > >
> > > > This is because the SNP bit is set since Linux kernel commit
> > > > 6c00612d0cba1 ("iommu/vt-d: Report right snoop capability when
> using
> > > > FL for IOVA") where SNP bit is set if scalable mode is on though this
> > > > seems to be an violation on the spec which said the SNP bit is
> > > > considered to be reserved if SC is not supported.
> > >
> > > When I was reading that commit, I was actually confused by this change:
> > >
> > > ---8<---
> > > diff --git a/drivers/iommu/intel/iommu.c
> b/drivers/iommu/intel/iommu.c
> > > index 956a02eb40b4..0ee5f1bd8af2 100644
> > > --- a/drivers/iommu/intel/iommu.c
> > > +++ b/drivers/iommu/intel/iommu.c
> > > @@ -658,7 +658,14 @@ static int
> domain_update_iommu_snooping(struct
> > > intel_iommu *skip)
> > >         rcu_read_lock();
> > >         for_each_active_iommu(iommu, drhd) {
> > >                 if (iommu != skip) {
> > > -                       if (!ecap_sc_support(iommu->ecap)) {
> > > +                       /*
> > > +                        * If the hardware is operating in the scalable mode,
> > > +                        * the snooping control is always supported since we
> > > +                        * always set PASID-table-entry.PGSNP bit if the domain
> > > +                        * is managed outside (UNMANAGED).
> > > +                        */
> > > +                       if (!sm_supported(iommu) &&
> > > +                           !ecap_sc_support(iommu->ecap)) {
> > >                                 ret = 0;
> > >                                 break;
> > >                         }
> > > ---8<---
> > >
> > > Does it mean that for some hardwares that has sm_supported()==true,
> it'll
> > > have  SC bit cleared in ecap register?  That sounds odd, and not sure why.
> Maybe
> > > Yi Liu or Yi Sun may know?
> >
> > scalable mode has no dependency on SC, so it's possible.
> 
> I see; thanks, Yi.
> 
> However then OTOH I don't understand above comment
> 
>   "If the hardware is operating in the scalable mode, the snooping control is
>    always supported since... "
> 
> Because the current qemu vt-d emulation should fall into the case that Yi
> mentioned - we support initial scalable mode but no SC yet..

chapter 3.9 of 3.2 spec says below.

“If the remapping hardware is setup in scalable-mode (RTADDR_REG.TTM=01b)
and the Page Snoop (PGSNP) field in PASID-table entry is Set, access to the
final page is snooped.”

It means the PGSNP field is available under scalable mode. And spec also
says below in chapter 96. of 3.2 spec.

"Requests snoop processor caches irrespective of, other attributes in the
request or other fields in paging structure entries used to translate the
request."

It means the PGSNP field of PASID table entry is the first class control
of the snoop behaviour. Also it means the scalable mode has the snoop
control by default. ^_^. So the comment in the above commit is correct
since the policy of intel iommu driver is always setting the PGSNP bit.
But spec is not so clear. Will reach out to make it more clearer in the
spec. thanks for catching it. :-)

Regards,
Yi Liu

Re: [PATCH] intel-iommu: ignore SNP bit in scalable mode

Posted by Peter Xu 4 years, 2 months ago

On Thu, Nov 25, 2021 at 05:49:38AM +0000, Liu, Yi L wrote:
> > From: Peter Xu <peterx@redhat.com>
> > Sent: Thursday, November 25, 2021 12:31 PM
> > 
> > On Thu, Nov 25, 2021 at 04:03:34AM +0000, Liu, Yi L wrote:
> > > > From: Peter Xu <peterx@redhat.com>
> > > > Sent: Wednesday, November 24, 2021 3:57 PM
> > > >
> > > > On Wed, Nov 24, 2021 at 02:03:09PM +0800, Jason Wang wrote:
> > > > > When booting with scalable mode, I hit this error:
> > > > >
> > > > > qemu-system-x86_64: vtd_iova_to_slpte: detected splte reserve non-
> > > > zero iova=0xfffff002, level=0x1slpte=0x102681803)
> > > > > qemu-system-x86_64: vtd_iommu_translate: detected translation
> > failure
> > > > (dev=01:00:00, iova=0xfffff002)
> > > > > qemu-system-x86_64: New fault is not recorded due to compression
> > of
> > > > faults
> > > > >
> > > > > This is because the SNP bit is set since Linux kernel commit
> > > > > 6c00612d0cba1 ("iommu/vt-d: Report right snoop capability when
> > using
> > > > > FL for IOVA") where SNP bit is set if scalable mode is on though this
> > > > > seems to be an violation on the spec which said the SNP bit is
> > > > > considered to be reserved if SC is not supported.
> > > >
> > > > When I was reading that commit, I was actually confused by this change:
> > > >
> > > > ---8<---
> > > > diff --git a/drivers/iommu/intel/iommu.c
> > b/drivers/iommu/intel/iommu.c
> > > > index 956a02eb40b4..0ee5f1bd8af2 100644
> > > > --- a/drivers/iommu/intel/iommu.c
> > > > +++ b/drivers/iommu/intel/iommu.c
> > > > @@ -658,7 +658,14 @@ static int
> > domain_update_iommu_snooping(struct
> > > > intel_iommu *skip)
> > > >         rcu_read_lock();
> > > >         for_each_active_iommu(iommu, drhd) {
> > > >                 if (iommu != skip) {
> > > > -                       if (!ecap_sc_support(iommu->ecap)) {
> > > > +                       /*
> > > > +                        * If the hardware is operating in the scalable mode,
> > > > +                        * the snooping control is always supported since we
> > > > +                        * always set PASID-table-entry.PGSNP bit if the domain
> > > > +                        * is managed outside (UNMANAGED).
> > > > +                        */
> > > > +                       if (!sm_supported(iommu) &&
> > > > +                           !ecap_sc_support(iommu->ecap)) {
> > > >                                 ret = 0;
> > > >                                 break;
> > > >                         }
> > > > ---8<---
> > > >
> > > > Does it mean that for some hardwares that has sm_supported()==true,
> > it'll
> > > > have  SC bit cleared in ecap register?  That sounds odd, and not sure why.
> > Maybe
> > > > Yi Liu or Yi Sun may know?
> > >
> > > scalable mode has no dependency on SC, so it's possible.
> > 
> > I see; thanks, Yi.
> > 
> > However then OTOH I don't understand above comment
> > 
> >   "If the hardware is operating in the scalable mode, the snooping control is
> >    always supported since... "
> > 
> > Because the current qemu vt-d emulation should fall into the case that Yi
> > mentioned - we support initial scalable mode but no SC yet..
> 
> chapter 3.9 of 3.2 spec says below.
> 
> “If the remapping hardware is setup in scalable-mode (RTADDR_REG.TTM=01b)
> and the Page Snoop (PGSNP) field in PASID-table entry is Set, access to the
> final page is snooped.”
> 
> It means the PGSNP field is available under scalable mode. And spec also
> says below in chapter 96. of 3.2 spec.
> 
> "Requests snoop processor caches irrespective of, other attributes in the
> request or other fields in paging structure entries used to translate the
> request."
> 
> It means the PGSNP field of PASID table entry is the first class control
> of the snoop behaviour. Also it means the scalable mode has the snoop
> control by default. ^_^. So the comment in the above commit is correct
> since the policy of intel iommu driver is always setting the PGSNP bit.

I see.  Setting PGSNP bit in the pasid entry looks fine to me.

However IIUC what's triggering the crash (that Jason is fixing) is the guest
iommu driver "thinks" SC is supported since scalable is enabled (even if qemu
vIOMMU has declared ECAP.SC==0 there), then it'll update iommu_snooping bit,
then it'll try to attach the SNP bit in the 2nd level pgtable (intel_iommu_map):

	if ((iommu_prot & IOMMU_CACHE) && dmar_domain->iommu_snooping)
		prot |= DMA_PTE_SNP;

So what I'm wondering is: whether the kernel should _not_ set SNP bit in the
2nd level pgtable, even if we set PGSNP in the pasid entry.. because as you
mentioned, the hardware (here the emulated vIOMMU) is allowed to have both
scalable==1 but sc==0 so it may recognize PGSNP in pasid entry but not the SNP
bit in pgtables.

If we'll skip pgtable SNP bit anyway for scalable mode, it looks weird to
explicitly set it too.

I think it's fine for Jason's solution to just skip checking SNP bit so we
ignore it in qemu, however just to double check we're on the same page.

Thanks,

-- 
Peter Xu

RE: [PATCH] intel-iommu: ignore SNP bit in scalable mode

Posted by Liu, Yi L 4 years, 2 months ago

> From: Peter Xu <peterx@redhat.com>
> Sent: Thursday, November 25, 2021 2:14 PM
> 
> On Thu, Nov 25, 2021 at 05:49:38AM +0000, Liu, Yi L wrote:
> > > From: Peter Xu <peterx@redhat.com>
> > > Sent: Thursday, November 25, 2021 12:31 PM
> > >
> > > On Thu, Nov 25, 2021 at 04:03:34AM +0000, Liu, Yi L wrote:
> > > > > From: Peter Xu <peterx@redhat.com>
> > > > > Sent: Wednesday, November 24, 2021 3:57 PM
> > > > >
> > > > > On Wed, Nov 24, 2021 at 02:03:09PM +0800, Jason Wang wrote:
> > > > > > When booting with scalable mode, I hit this error:
> > > > > >
> > > > > > qemu-system-x86_64: vtd_iova_to_slpte: detected splte reserve
> non-
> > > > > zero iova=0xfffff002, level=0x1slpte=0x102681803)
> > > > > > qemu-system-x86_64: vtd_iommu_translate: detected translation
> > > failure
> > > > > (dev=01:00:00, iova=0xfffff002)
> > > > > > qemu-system-x86_64: New fault is not recorded due to
> compression
> > > of
> > > > > faults
> > > > > >
> > > > > > This is because the SNP bit is set since Linux kernel commit
> > > > > > 6c00612d0cba1 ("iommu/vt-d: Report right snoop capability when
> > > using
> > > > > > FL for IOVA") where SNP bit is set if scalable mode is on though this
> > > > > > seems to be an violation on the spec which said the SNP bit is
> > > > > > considered to be reserved if SC is not supported.
> > > > >
> > > > > When I was reading that commit, I was actually confused by this
> change:
> > > > >
> > > > > ---8<---
> > > > > diff --git a/drivers/iommu/intel/iommu.c
> > > b/drivers/iommu/intel/iommu.c
> > > > > index 956a02eb40b4..0ee5f1bd8af2 100644
> > > > > --- a/drivers/iommu/intel/iommu.c
> > > > > +++ b/drivers/iommu/intel/iommu.c
> > > > > @@ -658,7 +658,14 @@ static int
> > > domain_update_iommu_snooping(struct
> > > > > intel_iommu *skip)
> > > > >         rcu_read_lock();
> > > > >         for_each_active_iommu(iommu, drhd) {
> > > > >                 if (iommu != skip) {
> > > > > -                       if (!ecap_sc_support(iommu->ecap)) {
> > > > > +                       /*
> > > > > +                        * If the hardware is operating in the scalable mode,
> > > > > +                        * the snooping control is always supported since we
> > > > > +                        * always set PASID-table-entry.PGSNP bit if the domain
> > > > > +                        * is managed outside (UNMANAGED).
> > > > > +                        */
> > > > > +                       if (!sm_supported(iommu) &&
> > > > > +                           !ecap_sc_support(iommu->ecap)) {
> > > > >                                 ret = 0;
> > > > >                                 break;
> > > > >                         }
> > > > > ---8<---
> > > > >
> > > > > Does it mean that for some hardwares that has
> sm_supported()==true,
> > > it'll
> > > > > have  SC bit cleared in ecap register?  That sounds odd, and not sure
> why.
> > > Maybe
> > > > > Yi Liu or Yi Sun may know?
> > > >
> > > > scalable mode has no dependency on SC, so it's possible.
> > >
> > > I see; thanks, Yi.
> > >
> > > However then OTOH I don't understand above comment
> > >
> > >   "If the hardware is operating in the scalable mode, the snooping control
> is
> > >    always supported since... "
> > >
> > > Because the current qemu vt-d emulation should fall into the case that Yi
> > > mentioned - we support initial scalable mode but no SC yet..
> >
> > chapter 3.9 of 3.2 spec says below.
> >
> > “If the remapping hardware is setup in scalable-mode
> (RTADDR_REG.TTM=01b)
> > and the Page Snoop (PGSNP) field in PASID-table entry is Set, access to
> the
> > final page is snooped.”
> >
> > It means the PGSNP field is available under scalable mode. And spec also
> > says below in chapter 96. of 3.2 spec.
> >
> > "Requests snoop processor caches irrespective of, other attributes in the
> > request or other fields in paging structure entries used to translate the
> > request."
> >
> > It means the PGSNP field of PASID table entry is the first class control
> > of the snoop behaviour. Also it means the scalable mode has the snoop
> > control by default. ^_^. So the comment in the above commit is correct
> > since the policy of intel iommu driver is always setting the PGSNP bit.
> 
> I see.  Setting PGSNP bit in the pasid entry looks fine to me.
> 
> However IIUC what's triggering the crash (that Jason is fixing) is the guest
> iommu driver "thinks" SC is supported since scalable is enabled (even if
> qemu vIOMMU has declared ECAP.SC==0 there), then it'll update
> iommu_snooping bit, then it'll try to attach the SNP bit in the 2nd level
> pgtable (intel_iommu_map):
> 
> 	if ((iommu_prot & IOMMU_CACHE) && dmar_domain->iommu_snooping)
> 		prot |= DMA_PTE_SNP;
> 

Above is the fact today.

> So what I'm wondering is: whether the kernel should _not_ set SNP bit in
> the 2nd level pgtable, even if we set PGSNP in the pasid entry.. because
> as you mentioned, the hardware (here the emulated vIOMMU) is allowed to have
> both scalable==1 but sc==0 so it may recognize PGSNP in pasid entry but not
> the SNP bit in pgtables.

This is a weird configuration. :( let's fix it in spec. e.g. PGSNP bit is
supported only when scalable==1 and sc==1. this makes more sense. right?
Then SNP bit will always depend on sc bit, it won't have any relationship
with scalable bit.

> 
> If we'll skip pgtable SNP bit anyway for scalable mode, it looks weird to
> explicitly set it too.
> 
> I think it's fine for Jason's solution to just skip checking SNP bit so we
> ignore it in qemu, however just to double check we're on the same page.

If we have above fix in spec, the iommu driver would also update its
behavior on the SNP bit. then the problem encountered by Jason will
go away. right?

Thanks,
Yi Liu

Re: [PATCH] intel-iommu: ignore SNP bit in scalable mode

Posted by Peter Xu 4 years, 2 months ago

On Sun, Nov 28, 2021 at 07:06:18AM +0000, Liu, Yi L wrote:
> > From: Peter Xu <peterx@redhat.com>
> > Sent: Thursday, November 25, 2021 2:14 PM
> > 
> > On Thu, Nov 25, 2021 at 05:49:38AM +0000, Liu, Yi L wrote:
> > > > From: Peter Xu <peterx@redhat.com>
> > > > Sent: Thursday, November 25, 2021 12:31 PM
> > > >
> > > > On Thu, Nov 25, 2021 at 04:03:34AM +0000, Liu, Yi L wrote:
> > > > > > From: Peter Xu <peterx@redhat.com>
> > > > > > Sent: Wednesday, November 24, 2021 3:57 PM
> > > > > >
> > > > > > On Wed, Nov 24, 2021 at 02:03:09PM +0800, Jason Wang wrote:
> > > > > > > When booting with scalable mode, I hit this error:
> > > > > > >
> > > > > > > qemu-system-x86_64: vtd_iova_to_slpte: detected splte reserve
> > non-
> > > > > > zero iova=0xfffff002, level=0x1slpte=0x102681803)
> > > > > > > qemu-system-x86_64: vtd_iommu_translate: detected translation
> > > > failure
> > > > > > (dev=01:00:00, iova=0xfffff002)
> > > > > > > qemu-system-x86_64: New fault is not recorded due to
> > compression
> > > > of
> > > > > > faults
> > > > > > >
> > > > > > > This is because the SNP bit is set since Linux kernel commit
> > > > > > > 6c00612d0cba1 ("iommu/vt-d: Report right snoop capability when
> > > > using
> > > > > > > FL for IOVA") where SNP bit is set if scalable mode is on though this
> > > > > > > seems to be an violation on the spec which said the SNP bit is
> > > > > > > considered to be reserved if SC is not supported.
> > > > > >
> > > > > > When I was reading that commit, I was actually confused by this
> > change:
> > > > > >
> > > > > > ---8<---
> > > > > > diff --git a/drivers/iommu/intel/iommu.c
> > > > b/drivers/iommu/intel/iommu.c
> > > > > > index 956a02eb40b4..0ee5f1bd8af2 100644
> > > > > > --- a/drivers/iommu/intel/iommu.c
> > > > > > +++ b/drivers/iommu/intel/iommu.c
> > > > > > @@ -658,7 +658,14 @@ static int
> > > > domain_update_iommu_snooping(struct
> > > > > > intel_iommu *skip)
> > > > > >         rcu_read_lock();
> > > > > >         for_each_active_iommu(iommu, drhd) {
> > > > > >                 if (iommu != skip) {
> > > > > > -                       if (!ecap_sc_support(iommu->ecap)) {
> > > > > > +                       /*
> > > > > > +                        * If the hardware is operating in the scalable mode,
> > > > > > +                        * the snooping control is always supported since we
> > > > > > +                        * always set PASID-table-entry.PGSNP bit if the domain
> > > > > > +                        * is managed outside (UNMANAGED).
> > > > > > +                        */
> > > > > > +                       if (!sm_supported(iommu) &&
> > > > > > +                           !ecap_sc_support(iommu->ecap)) {
> > > > > >                                 ret = 0;
> > > > > >                                 break;
> > > > > >                         }
> > > > > > ---8<---
> > > > > >
> > > > > > Does it mean that for some hardwares that has
> > sm_supported()==true,
> > > > it'll
> > > > > > have  SC bit cleared in ecap register?  That sounds odd, and not sure
> > why.
> > > > Maybe
> > > > > > Yi Liu or Yi Sun may know?
> > > > >
> > > > > scalable mode has no dependency on SC, so it's possible.
> > > >
> > > > I see; thanks, Yi.
> > > >
> > > > However then OTOH I don't understand above comment
> > > >
> > > >   "If the hardware is operating in the scalable mode, the snooping control
> > is
> > > >    always supported since... "
> > > >
> > > > Because the current qemu vt-d emulation should fall into the case that Yi
> > > > mentioned - we support initial scalable mode but no SC yet..
> > >
> > > chapter 3.9 of 3.2 spec says below.
> > >
> > > “If the remapping hardware is setup in scalable-mode
> > (RTADDR_REG.TTM=01b)
> > > and the Page Snoop (PGSNP) field in PASID-table entry is Set, access to
> > the
> > > final page is snooped.”
> > >
> > > It means the PGSNP field is available under scalable mode. And spec also
> > > says below in chapter 96. of 3.2 spec.
> > >
> > > "Requests snoop processor caches irrespective of, other attributes in the
> > > request or other fields in paging structure entries used to translate the
> > > request."
> > >
> > > It means the PGSNP field of PASID table entry is the first class control
> > > of the snoop behaviour. Also it means the scalable mode has the snoop
> > > control by default. ^_^. So the comment in the above commit is correct
> > > since the policy of intel iommu driver is always setting the PGSNP bit.
> > 
> > I see.  Setting PGSNP bit in the pasid entry looks fine to me.
> > 
> > However IIUC what's triggering the crash (that Jason is fixing) is the guest
> > iommu driver "thinks" SC is supported since scalable is enabled (even if
> > qemu vIOMMU has declared ECAP.SC==0 there), then it'll update
> > iommu_snooping bit, then it'll try to attach the SNP bit in the 2nd level
> > pgtable (intel_iommu_map):
> > 
> > 	if ((iommu_prot & IOMMU_CACHE) && dmar_domain->iommu_snooping)
> > 		prot |= DMA_PTE_SNP;
> > 
> 
> Above is the fact today.
> 
> > So what I'm wondering is: whether the kernel should _not_ set SNP bit in
> > the 2nd level pgtable, even if we set PGSNP in the pasid entry.. because
> > as you mentioned, the hardware (here the emulated vIOMMU) is allowed to have
> > both scalable==1 but sc==0 so it may recognize PGSNP in pasid entry but not
> > the SNP bit in pgtables.
> 
> This is a weird configuration. :( let's fix it in spec. e.g. PGSNP bit is
> supported only when scalable==1 and sc==1. this makes more sense. right?
> Then SNP bit will always depend on sc bit, it won't have any relationship
> with scalable bit.

Sounds reasonable.

> 
> > 
> > If we'll skip pgtable SNP bit anyway for scalable mode, it looks weird to
> > explicitly set it too.
> > 
> > I think it's fine for Jason's solution to just skip checking SNP bit so we
> > ignore it in qemu, however just to double check we're on the same page.
> 
> If we have above fix in spec, the iommu driver would also update its
> behavior on the SNP bit. then the problem encountered by Jason will
> go away. right?

Looks right to me.

I think we can still have Jason's patch continued because the kernel commit to
apply SNP bit is merged in v5.13, so we may need the qemu change to let it
still work with v5.13-v5.15+ guest kernels.  We'll loose the resv bit check a
bit, but looks worthwhile.  Jason?

Thanks,

-- 
Peter Xu

Re: [PATCH] intel-iommu: ignore SNP bit in scalable mode

Posted by Jason Wang 4 years, 2 months ago

On Mon, Nov 29, 2021 at 9:19 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Sun, Nov 28, 2021 at 07:06:18AM +0000, Liu, Yi L wrote:
> > > From: Peter Xu <peterx@redhat.com>
> > > Sent: Thursday, November 25, 2021 2:14 PM
> > >
> > > On Thu, Nov 25, 2021 at 05:49:38AM +0000, Liu, Yi L wrote:
> > > > > From: Peter Xu <peterx@redhat.com>
> > > > > Sent: Thursday, November 25, 2021 12:31 PM
> > > > >
> > > > > On Thu, Nov 25, 2021 at 04:03:34AM +0000, Liu, Yi L wrote:
> > > > > > > From: Peter Xu <peterx@redhat.com>
> > > > > > > Sent: Wednesday, November 24, 2021 3:57 PM
> > > > > > >
> > > > > > > On Wed, Nov 24, 2021 at 02:03:09PM +0800, Jason Wang wrote:
> > > > > > > > When booting with scalable mode, I hit this error:
> > > > > > > >
> > > > > > > > qemu-system-x86_64: vtd_iova_to_slpte: detected splte reserve
> > > non-
> > > > > > > zero iova=0xfffff002, level=0x1slpte=0x102681803)
> > > > > > > > qemu-system-x86_64: vtd_iommu_translate: detected translation
> > > > > failure
> > > > > > > (dev=01:00:00, iova=0xfffff002)
> > > > > > > > qemu-system-x86_64: New fault is not recorded due to
> > > compression
> > > > > of
> > > > > > > faults
> > > > > > > >
> > > > > > > > This is because the SNP bit is set since Linux kernel commit
> > > > > > > > 6c00612d0cba1 ("iommu/vt-d: Report right snoop capability when
> > > > > using
> > > > > > > > FL for IOVA") where SNP bit is set if scalable mode is on though this
> > > > > > > > seems to be an violation on the spec which said the SNP bit is
> > > > > > > > considered to be reserved if SC is not supported.
> > > > > > >
> > > > > > > When I was reading that commit, I was actually confused by this
> > > change:
> > > > > > >
> > > > > > > ---8<---
> > > > > > > diff --git a/drivers/iommu/intel/iommu.c
> > > > > b/drivers/iommu/intel/iommu.c
> > > > > > > index 956a02eb40b4..0ee5f1bd8af2 100644
> > > > > > > --- a/drivers/iommu/intel/iommu.c
> > > > > > > +++ b/drivers/iommu/intel/iommu.c
> > > > > > > @@ -658,7 +658,14 @@ static int
> > > > > domain_update_iommu_snooping(struct
> > > > > > > intel_iommu *skip)
> > > > > > >         rcu_read_lock();
> > > > > > >         for_each_active_iommu(iommu, drhd) {
> > > > > > >                 if (iommu != skip) {
> > > > > > > -                       if (!ecap_sc_support(iommu->ecap)) {
> > > > > > > +                       /*
> > > > > > > +                        * If the hardware is operating in the scalable mode,
> > > > > > > +                        * the snooping control is always supported since we
> > > > > > > +                        * always set PASID-table-entry.PGSNP bit if the domain
> > > > > > > +                        * is managed outside (UNMANAGED).
> > > > > > > +                        */
> > > > > > > +                       if (!sm_supported(iommu) &&
> > > > > > > +                           !ecap_sc_support(iommu->ecap)) {
> > > > > > >                                 ret = 0;
> > > > > > >                                 break;
> > > > > > >                         }
> > > > > > > ---8<---
> > > > > > >
> > > > > > > Does it mean that for some hardwares that has
> > > sm_supported()==true,
> > > > > it'll
> > > > > > > have  SC bit cleared in ecap register?  That sounds odd, and not sure
> > > why.
> > > > > Maybe
> > > > > > > Yi Liu or Yi Sun may know?
> > > > > >
> > > > > > scalable mode has no dependency on SC, so it's possible.
> > > > >
> > > > > I see; thanks, Yi.
> > > > >
> > > > > However then OTOH I don't understand above comment
> > > > >
> > > > >   "If the hardware is operating in the scalable mode, the snooping control
> > > is
> > > > >    always supported since... "
> > > > >
> > > > > Because the current qemu vt-d emulation should fall into the case that Yi
> > > > > mentioned - we support initial scalable mode but no SC yet..
> > > >
> > > > chapter 3.9 of 3.2 spec says below.
> > > >
> > > > “If the remapping hardware is setup in scalable-mode
> > > (RTADDR_REG.TTM=01b)
> > > > and the Page Snoop (PGSNP) field in PASID-table entry is Set, access to
> > > the
> > > > final page is snooped.”
> > > >
> > > > It means the PGSNP field is available under scalable mode. And spec also
> > > > says below in chapter 96. of 3.2 spec.
> > > >
> > > > "Requests snoop processor caches irrespective of, other attributes in the
> > > > request or other fields in paging structure entries used to translate the
> > > > request."
> > > >
> > > > It means the PGSNP field of PASID table entry is the first class control
> > > > of the snoop behaviour. Also it means the scalable mode has the snoop
> > > > control by default. ^_^. So the comment in the above commit is correct
> > > > since the policy of intel iommu driver is always setting the PGSNP bit.
> > >
> > > I see.  Setting PGSNP bit in the pasid entry looks fine to me.
> > >
> > > However IIUC what's triggering the crash (that Jason is fixing) is the guest
> > > iommu driver "thinks" SC is supported since scalable is enabled (even if
> > > qemu vIOMMU has declared ECAP.SC==0 there), then it'll update
> > > iommu_snooping bit, then it'll try to attach the SNP bit in the 2nd level
> > > pgtable (intel_iommu_map):
> > >
> > >     if ((iommu_prot & IOMMU_CACHE) && dmar_domain->iommu_snooping)
> > >             prot |= DMA_PTE_SNP;
> > >
> >
> > Above is the fact today.
> >
> > > So what I'm wondering is: whether the kernel should _not_ set SNP bit in
> > > the 2nd level pgtable, even if we set PGSNP in the pasid entry.. because
> > > as you mentioned, the hardware (here the emulated vIOMMU) is allowed to have
> > > both scalable==1 but sc==0 so it may recognize PGSNP in pasid entry but not
> > > the SNP bit in pgtables.
> >
> > This is a weird configuration. :( let's fix it in spec. e.g. PGSNP bit is
> > supported only when scalable==1 and sc==1. this makes more sense. right?
> > Then SNP bit will always depend on sc bit, it won't have any relationship
> > with scalable bit.
>
> Sounds reasonable.
>
> >
> > >
> > > If we'll skip pgtable SNP bit anyway for scalable mode, it looks weird to
> > > explicitly set it too.
> > >
> > > I think it's fine for Jason's solution to just skip checking SNP bit so we
> > > ignore it in qemu, however just to double check we're on the same page.
> >
> > If we have above fix in spec, the iommu driver would also update its
> > behavior on the SNP bit. then the problem encountered by Jason will
> > go away. right?
>
> Looks right to me.
>
> I think we can still have Jason's patch continued because the kernel commit to
> apply SNP bit is merged in v5.13, so we may need the qemu change to let it
> still work with v5.13-v5.15+ guest kernels.  We'll loose the resv bit check a
> bit, but looks worthwhile.  Jason?

Yes, I agree. The only thing that may worry me is the migration
compatibility. If we migrate from new to old we may break the guests,
we probably need compatibility props for that.

And in the future, it could be even more troublesome,e.g there's one
day we found another bit that needs not to be checked. Maybe we should
even remove all the rsvd bits checks?

Thanks

>
> Thanks,
>
> --
> Peter Xu
>

Re: [PATCH] intel-iommu: ignore SNP bit in scalable mode

Posted by Peter Xu 4 years, 2 months ago

On Mon, Nov 29, 2021 at 10:28:42AM +0800, Jason Wang wrote:
> > I think we can still have Jason's patch continued because the kernel commit to
> > apply SNP bit is merged in v5.13, so we may need the qemu change to let it
> > still work with v5.13-v5.15+ guest kernels.  We'll loose the resv bit check a
> > bit, but looks worthwhile.  Jason?
> 
> Yes, I agree. The only thing that may worry me is the migration
> compatibility. If we migrate from new to old we may break the guests,
> we probably need compatibility props for that.

Hmm.. How important is new->old migrations?  Is that normally for the
old->new->old migration so that users can always fallback to the old hosts?

If that's the case then IMHO we're fine here, since the new binary check less
on resv bits than the old binary, then if the guest code can work with the old
binary already then migrating back to old binary should work too.  Changing
guest OS during migration of new->old can have a problem but hopefully rare.

OTOH - do you know any of the real enterprise user that uses scalable mode yet?
To my own understanding it's still mostly "experimental", then hopefully we can
avoid worrying on that too much?

> 
> And in the future, it could be even more troublesome,e.g there's one
> day we found another bit that needs not to be checked. Maybe we should
> even remove all the rsvd bits checks?

When a real hardware sees any of the reserved bits set, it'll bail out and
raise an error, right?

If that's the case, I'm wondering whether we should always follow the hardware
behavior as an emulator.

Now I'm trying to remember normally how a spec could re-use a bit that was used
to be reserved: should the hardware bumps the version of the version reg in so
that softwares will know what to expect?

So I'm thinking whether the emulator code can identify the version bump by
"scalable mode enabled", if so we know some resved bits are "ignored" now, and
IIUC that's mostly the original proposal to add a quirk when scalable mode in
vtd_init().

But again, I really think it should be the spec owner who should have
considered all these.. e.g. explicitly document "this bit was used to reserved,
but when scalable mode enabled it's ignored and programmable by the guest
driver", or something like that.

Thanks,

-- 
Peter Xu

RE: [PATCH] intel-iommu: ignore SNP bit in scalable mode

Posted by Liu, Yi L 4 years, 2 months ago

> From: Peter Xu <peterx@redhat.com>
> Sent: Monday, November 29, 2021 11:14 AM
> 
> On Mon, Nov 29, 2021 at 10:28:42AM +0800, Jason Wang wrote:
> >
> > And in the future, it could be even more troublesome,e.g there's one
> > day we found another bit that needs not to be checked. Maybe we should
> > even remove all the rsvd bits checks?
> 
> When a real hardware sees any of the reserved bits set, it'll bail out and
> raise an error, right?

I think so. vtd spec has defined Non-zero reserved field error code against
all the translation structures (root/context/pasid dir/pasid table/page table)
for it. And it makes sense since any such error indicates a potential
misunderstanding on the spec.

> If that's the case, I'm wondering whether we should always follow the
> hardware behavior as an emulator.

I think so. and current virtual Intel IOMMU does a good job to detect the
SNP setting.:)

> Now I'm trying to remember normally how a spec could re-use a bit that was
> used to be reserved: should the hardware bumps the version of the version reg in so
> that softwares will know what to expect?

defining a new capability bit is also a way for it. New software will probe the
capability bit and then program the bit which was reserved but now redefined.
Old software doesn’t have any idea on the new capability bit, so it will not
program the reserved bit.

> So I'm thinking whether the emulator code can identify the version bump by
> "scalable mode enabled", if so we know some resved bits are "ignored" now,
> and IIUC that's mostly the original proposal to add a quirk when scalable mode
> in vtd_init().

do you mean the spec version or?

> But again, I really think it should be the spec owner who should have
> considered all these..

yes, spec owner should consider it.

> e.g. explicitly document "this bit was used to reserved,
> but when scalable mode enabled it's ignored and programmable by the guest
> driver", or something like that.

there is a good example for your above sentence. It's the root table entry
and the scalable mode root table entry. In legacy mode, the high 64 bits of
root table entry are all reserved. In scalable mode, some of the high 64 bits
are used. I think we have defined scalable mode reserved bits macro in the
emulator code.

But regards to minor changes within a working mode, it may be more common to
define a capability bit when a reserved bit is re-used.

Regards,
Yi Liu

Re: [PATCH] intel-iommu: ignore SNP bit in scalable mode

Posted by Jason Wang 4 years, 2 months ago

On Mon, Nov 29, 2021 at 11:14 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Mon, Nov 29, 2021 at 10:28:42AM +0800, Jason Wang wrote:
> > > I think we can still have Jason's patch continued because the kernel commit to
> > > apply SNP bit is merged in v5.13, so we may need the qemu change to let it
> > > still work with v5.13-v5.15+ guest kernels.  We'll loose the resv bit check a
> > > bit, but looks worthwhile.  Jason?
> >
> > Yes, I agree. The only thing that may worry me is the migration
> > compatibility. If we migrate from new to old we may break the guests,
> > we probably need compatibility props for that.
>
> Hmm.. How important is new->old migrations?  Is that normally for the
> old->new->old migration so that users can always fallback to the old hosts?

I meant e.g migrating from new qemu with machine=6.3 to old qemu with
machine=6.3.

So guest works on the src but not dst.

>
> If that's the case then IMHO we're fine here, since the new binary check less
> on resv bits than the old binary, then if the guest code can work with the old
> binary already then migrating back to old binary should work too.  Changing
> guest OS during migration of new->old can have a problem but hopefully rare.
>
> OTOH - do you know any of the real enterprise user that uses scalable mode yet?
> To my own understanding it's still mostly "experimental", then hopefully we can
> avoid worrying on that too much?

Probably, it has an "x" prefix.

>
> >
> > And in the future, it could be even more troublesome,e.g there's one
> > day we found another bit that needs not to be checked. Maybe we should
> > even remove all the rsvd bits checks?
>
> When a real hardware sees any of the reserved bits set, it'll bail out and
> raise an error, right?

To say the truth I don't know (though spec said that).

>
> If that's the case, I'm wondering whether we should always follow the hardware
> behavior as an emulator.
>
> Now I'm trying to remember normally how a spec could re-use a bit that was used
> to be reserved: should the hardware bumps the version of the version reg in so
> that softwares will know what to expect?

I think so or at least some kind of negotiation.

>
> So I'm thinking whether the emulator code can identify the version bump by
> "scalable mode enabled", if so we know some resved bits are "ignored" now, and
> IIUC that's mostly the original proposal to add a quirk when scalable mode in
> vtd_init().
>
> But again, I really think it should be the spec owner who should have
> considered all these.. e.g. explicitly document "this bit was used to reserved,
> but when scalable mode enabled it's ignored and programmable by the guest
> driver", or something like that.

I fully agree, we've learnt a lot when dealing with migration
compatibility in the past decade.

I will prepare a V2, it was basically what you suggested. And we can
leave the rest for future investigation. The motivation is prototyping
PASID support for virtio, it's sufficient for this patch to unblock
the work.

Thanks

>
> Thanks,
>
> --
> Peter Xu
>