[PATCH v6 0/9] iommu/amd: Use 128-bit cmpxchg operation to update DTE

Suravee Suthikulpanit posted 9 patches 1 month, 1 week ago
There is a newer version of this series
drivers/iommu/amd/amd_iommu.h       |   4 +-
drivers/iommu/amd/amd_iommu_types.h |  25 +-
drivers/iommu/amd/init.c            |  79 ++----
drivers/iommu/amd/iommu.c           | 364 ++++++++++++++++++++--------
include/asm-generic/rwonce.h        |   2 +-
include/linux/compiler_types.h      |   8 +-
6 files changed, 322 insertions(+), 160 deletions(-)
[PATCH v6 0/9] iommu/amd: Use 128-bit cmpxchg operation to update DTE
Posted by Suravee Suthikulpanit 1 month, 1 week ago
This series modifies current implementation to use 128-bit cmpxchg to
update DTE when needed as specified in the AMD I/O Virtualization
Techonology (IOMMU) Specification.

Please note that I have verified with the hardware designer, and they have
confirmed that the IOMMU hardware has always been implemented with 256-bit
read. The next revision of the IOMMU spec will be updated to correctly
describe this part.  Therefore, I have updated the implementation to avoid
unnecessary flushing.

Changes in v6:

* Patch 2, 4, 7: Newly add

* Patch 3, 5, 6, 7, 9: Add READ_ONCE() per Uros.

* Patch 3:
  - Modify write_dte_[higher|lower]128() to avoid copying old DTE in the loop.

* Patch 5:
  - Use dev_data->dte_cache to restore persistent DTE bits in set_dte_entry().
  - Simplify make_clear_dte():
    - Remove bit preservation logic.
    - Remove non-SNP check for setting TV since it should not be needed.

* Patch 6:
  - Use find_dev_data(..., alias) since the dev_data might not have been allocated.
  - Move dev_iommu_priv_set() to before setup_aliases().

v5: https://lore.kernel.org/lkml/20241007041353.4756-1-suravee.suthikulpanit@amd.com/
v4: https://lore.kernel.org/lkml/20240916171805.324292-1-suravee.suthikulpanit@amd.com/
v3: https://lore.kernel.org/lkml/20240906121308.5013-1-suravee.suthikulpanit@amd.com/
v2: https://lore.kernel.org/lkml/20240829180726.5022-1-suravee.suthikulpanit@amd.com/
v1: https://lore.kernel.org/lkml/20240819161839.4657-1-suravee.suthikulpanit@amd.com/

Thanks,
Suravee

Suravee Suthikulpanit (8):
  iommu/amd: Disable AMD IOMMU if CMPXCHG16B feature is not supported
  iommu/amd: Introduce helper function to update 256-bit DTE
  iommu/amd: Introduce per-device DTE cache to store persistent bits
  iommu/amd: Modify set_dte_entry() to use 256-bit DTE helpers
  iommu/amd: Introduce helper function get_dte256()
  iommu/amd: Move erratum 63 logic to write_dte_lower128()
  iommu/amd: Modify clear_dte_entry() to avoid in-place update
  iommu/amd: Lock DTE before updating the entry with WRITE_ONCE()

Uros Bizjak (1):
  asm/rwonce: Introduce [READ|WRITE]_ONCE() support for __int128

 drivers/iommu/amd/amd_iommu.h       |   4 +-
 drivers/iommu/amd/amd_iommu_types.h |  25 +-
 drivers/iommu/amd/init.c            |  79 ++----
 drivers/iommu/amd/iommu.c           | 364 ++++++++++++++++++++--------
 include/asm-generic/rwonce.h        |   2 +-
 include/linux/compiler_types.h      |   8 +-
 6 files changed, 322 insertions(+), 160 deletions(-)

-- 
2.34.1
Re: [PATCH v6 0/9] iommu/amd: Use 128-bit cmpxchg operation to update DTE
Posted by Jason Gunthorpe 1 month, 1 week ago
On Wed, Oct 16, 2024 at 05:17:47AM +0000, Suravee Suthikulpanit wrote:
> This series modifies current implementation to use 128-bit cmpxchg to
> update DTE when needed as specified in the AMD I/O Virtualization
> Techonology (IOMMU) Specification.
> 
> Please note that I have verified with the hardware designer, and they have
> confirmed that the IOMMU hardware has always been implemented with 256-bit
> read. The next revision of the IOMMU spec will be updated to correctly
> describe this part.  Therefore, I have updated the implementation to avoid
> unnecessary flushing.
> 
> Changes in v6:
> 
> * Patch 2, 4, 7: Newly add
> 
> * Patch 3, 5, 6, 7, 9: Add READ_ONCE() per Uros.
> 
> * Patch 3:
>   - Modify write_dte_[higher|lower]128() to avoid copying old DTE in the loop.
> 
> * Patch 5:
>   - Use dev_data->dte_cache to restore persistent DTE bits in set_dte_entry().
>   - Simplify make_clear_dte():
>     - Remove bit preservation logic.
>     - Remove non-SNP check for setting TV since it should not be needed.
> 
> * Patch 6:
>   - Use find_dev_data(..., alias) since the dev_data might not have been allocated.
>   - Move dev_iommu_priv_set() to before setup_aliases().

I wanted to see how far this was to being split up neatly like ARM is,
I came up with this, which seems pretty good to me. This would
probably be the next step to get to, then you'd lift the individual
set functions higher up the call chain into their respective attach
functions.

static void set_dte_identity(struct amd_iommu *iommu,
			       struct iommu_dev_data *dev_data,
			       struct dev_table_entry *target)
{
	/*
	 * SNP does not support TV=1/Mode=1 in any case, and can't do IDENTITY
	 */
	if (WARN_ON(amd_iommu_snp_en))
		return;

	/* mode is zero */
	target->data[0] |= DTE_FLAG_TV | DTE_FLAG_IR | DTE_FLAG_IW | DTE_FLAG_V;
	if (dev_data->ats_enabled)
		target->data[1] |= DTE_FLAG_IOTLB;
	/* ppr is not allowed for identity */

	target->data128[0] |= dev_data->dte_cache.data128[0];
	target->data128[1] |= dev_data->dte_cache.data128[1];
}

static void set_dte_gcr3_table(struct amd_iommu *iommu,
			       struct iommu_dev_data *dev_data,
			       struct dev_table_entry *target)
{
	struct gcr3_tbl_info *gcr3_info = &dev_data->gcr3_info;
	u64 gcr3;

	if (!gcr3_info->gcr3_tbl)
		return;

	pr_debug("%s: devid=%#x, glx=%#x, gcr3_tbl=%#llx\n",
		 __func__, dev_data->devid, gcr3_info->glx,
		 (unsigned long long)gcr3_info->gcr3_tbl);

	gcr3 = iommu_virt_to_phys(gcr3_info->gcr3_tbl);

	target->data[0] |= DTE_FLAG_GV | DTE_FLAG_TV | DTE_FLAG_IR |
			   DTE_FLAG_IW | DTE_FLAG_V |
			   FIELD_PREP(DTE_GLX, gcr3_info->glx) |
			   FIELD_PREP(DTE_GCR3_14_12, gcr3 >> 12);
	if (pdom_is_v2_pgtbl_mode(dev_data->domain))
		target->data[0] |= DTE_FLAG_GIOV;

	target->data[1] |= FIELD_PREP(DTE_GCR3_30_15, gcr3 >> 15) |
			   FIELD_PREP(DTE_GCR3_51_31, gcr3 >> 31);

	/* Guest page table can only support 4 and 5 levels  */
	target->data[2] |= FIELD_PREP(
		DTE_GPT_LEVEL_MASK, (amd_iommu_gpt_level == PAGE_MODE_5_LEVEL ?
					     GUEST_PGTABLE_5_LEVEL :
					     GUEST_PGTABLE_4_LEVEL));

	target->data[1] |= dev_data->gcr3_info.domid;
	if (dev_data->ppr)
		target->data[0] |= 1ULL << DEV_ENTRY_PPR;
	if (dev_data->ats_enabled)
		target->data[1] |= DTE_FLAG_IOTLB;

	target->data128[0] |= dev_data->dte_cache.data128[0];
	target->data128[1] |= dev_data->dte_cache.data128[1];
}

static void set_dte_paging(struct amd_iommu *iommu,
			       struct iommu_dev_data *dev_data,
			       struct dev_table_entry *target)
{
	struct protection_domain *domain = dev_data->domain;

	target->data[0] |= DTE_FLAG_TV | DTE_FLAG_IR | DTE_FLAG_IW |
			   iommu_virt_to_phys(domain->iop.root) |
			   ((domain->iop.mode & DEV_ENTRY_MODE_MASK)
			    << DEV_ENTRY_MODE_SHIFT) |
			   DTE_FLAG_V;
	if (dev_data->ppr)
		target->data[0] |= 1ULL << DEV_ENTRY_PPR;
	if (domain->dirty_tracking)
		target->data[0] |= DTE_FLAG_HAD;

	target->data[1] |= domain->id;
	if (dev_data->ats_enabled)
		target->data[1] |= DTE_FLAG_IOTLB;

	target->data128[0] |= dev_data->dte_cache.data128[0];
	target->data128[1] |= dev_data->dte_cache.data128[1];
}

static void set_dte_entry(struct amd_iommu *iommu,
			  struct iommu_dev_data *dev_data)
{
	u32 old_domid;
	struct dev_table_entry new = {};
	struct protection_domain *domain = dev_data->domain;
	struct gcr3_tbl_info *gcr3_info = &dev_data->gcr3_info;
	struct dev_table_entry *dte = &get_dev_table(iommu)[dev_data->devid];

	make_clear_dte(dev_data, dte, &new);
	if (gcr3_info && gcr3_info->gcr3_tbl)
		set_dte_gcr3_table(iommu, dev_data, &new);
	else if (domain->iop.mode == PAGE_MODE_NONE)
		set_dte_identity(iommu, dev_data, &new);
	else
		set_dte_paging(iommu, dev_data, &new);

	old_domid = READ_ONCE(dte->data[1]) & DEV_DOMID_MASK;
	update_dte256(iommu, dev_data, &new);

	/*
	 * A kdump kernel might be replacing a domain ID that was copied from
	 * the previous kernel--if so, it needs to flush the translation cache
	 * entries for the old domain ID that is being overwritten
	 */
	if (old_domid) {
		amd_iommu_flush_tlb_domid(iommu, old_domid);
	}
}
Re: [PATCH v6 0/9] iommu/amd: Use 128-bit cmpxchg operation to update DTE
Posted by Suthikulpanit, Suravee 3 weeks, 5 days ago
On 10/16/2024 9:22 PM, Jason Gunthorpe wrote:
>
> ....
>
> I wanted to see how far this was to being split up neatly like ARM is,
> I came up with this, which seems pretty good to me. This would
> probably be the next step to get to, then you'd lift the individual
> set functions higher up the call chain into their respective attach
> functions.

I like this idea and will look into adopting this code when I submit the 
nested translation stuff (right after this series) since it will affect 
the set_dte_entry().

> .....
> 
> static void set_dte_entry(struct amd_iommu *iommu,
> 			  struct iommu_dev_data *dev_data)
> {
> 	u32 old_domid;
> 	struct dev_table_entry new = {};
> 	struct protection_domain *domain = dev_data->domain;
> 	struct gcr3_tbl_info *gcr3_info = &dev_data->gcr3_info;
> 	struct dev_table_entry *dte = &get_dev_table(iommu)[dev_data->devid];
> 
> 	make_clear_dte(dev_data, dte, &new);
> 	if (gcr3_info && gcr3_info->gcr3_tbl)
> 		set_dte_gcr3_table(iommu, dev_data, &new);
> 	else if (domain->iop.mode == PAGE_MODE_NONE)
> 		set_dte_identity(iommu, dev_data, &new);
> 	else
> 		set_dte_paging(iommu, dev_data, &new);

This will need to be change once we add nested translation support 
because we need to call both set_dte_paging() and set_dte_gcr3().

Thanks,
Suravee
Re: [PATCH v6 0/9] iommu/amd: Use 128-bit cmpxchg operation to update DTE
Posted by Jason Gunthorpe 3 weeks, 5 days ago
On Thu, Oct 31, 2024 at 04:15:02PM +0700, Suthikulpanit, Suravee wrote:
> On 10/16/2024 9:22 PM, Jason Gunthorpe wrote:
> > 
> > ....
> > 
> > I wanted to see how far this was to being split up neatly like ARM is,
> > I came up with this, which seems pretty good to me. This would
> > probably be the next step to get to, then you'd lift the individual
> > set functions higher up the call chain into their respective attach
> > functions.
> 
> I like this idea and will look into adopting this code when I submit the
> nested translation stuff (right after this series) since it will affect the
> set_dte_entry().

Yes, I definitely want to see this kind of code structure before
nested translation.

> > static void set_dte_entry(struct amd_iommu *iommu,
> > 			  struct iommu_dev_data *dev_data)
> > {
> > 	u32 old_domid;
> > 	struct dev_table_entry new = {};
> > 	struct protection_domain *domain = dev_data->domain;
> > 	struct gcr3_tbl_info *gcr3_info = &dev_data->gcr3_info;
> > 	struct dev_table_entry *dte = &get_dev_table(iommu)[dev_data->devid];
> > 
> > 	make_clear_dte(dev_data, dte, &new);
> > 	if (gcr3_info && gcr3_info->gcr3_tbl)
> > 		set_dte_gcr3_table(iommu, dev_data, &new);
> > 	else if (domain->iop.mode == PAGE_MODE_NONE)
> > 		set_dte_identity(iommu, dev_data, &new);
> > 	else
> > 		set_dte_paging(iommu, dev_data, &new);
> 
> This will need to be change once we add nested translation support because
> we need to call both set_dte_paging() and set_dte_gcr3().

The idea would be to remove set_dte_entry() because the attach
functions just call their specific set_dte_xx() directly, like how arm
is structured.

That will make everything much clearer.

Then the nested attach function would call some set_dte_nested() and
it would use set_dte_paging() internally.

Getting to this level is necessary to get the hitless replace, which
is important..

I hope this series gets landed this cycle, next cycle you should try
to get to hitless replace on the domain path, including this stuff,
then adding the nested domain should be straightforward!

Jason