From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
Use a previously introduced per-domain read/write lock to check
whether vpci is present, so we are sure there are no accesses to the
contents of the vpci struct if not. This lock can be used (and in a
few cases is used right away) so that vpci removal can be performed
while holding the lock in write mode. Previously such removal could
race with vpci_read for example.
When taking both d->pci_lock and pdev->vpci->lock, they should be
taken in this exact order: d->pci_lock then pdev->vpci->lock to avoid
possible deadlock situations.
1. Per-domain's pci_lock is used to protect pdev->vpci structure
from being removed.
2. Writing the command register and ROM BAR register may trigger
modify_bars to run, which in turn may access multiple pdevs while
checking for the existing BAR's overlap. The overlapping check, if
done under the read lock, requires vpci->lock to be acquired on both
devices being compared, which may produce a deadlock. It is not
possible to upgrade read lock to write lock in such a case. So, in
order to prevent the deadlock, use d->pci_lock instead.
All other code, which doesn't lead to pdev->vpci destruction and does
not access multiple pdevs at the same time, can still use a
combination of the read lock and pdev->vpci->lock.
3. Drop const qualifier where the new rwlock is used and this is
appropriate.
4. Do not call process_pending_softirqs with any locks held. For that
unlock prior the call and re-acquire the locks after. After
re-acquiring the lock there is no need to check if pdev->vpci exists:
- in apply_map because of the context it is called (no race condition
possible)
- for MSI/MSI-X debug code because it is called at the end of
pdev->vpci access and no further access to pdev->vpci is made
5. Use d->pci_lock around for_each_pdev and pci_get_pdev_by_domain
while accessing pdevs in vpci code.
Suggested-by: Roger Pau Monné <roger.pau@citrix.com>
Suggested-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com>
Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com>
---
Changes in v12:
- s/pci_rwlock/pci_lock/ in commit message
- expand comment about scope of pci_lock in sched.h
- in vpci_{read,write}, if hwdom is trying to access a device assigned
to dom_xen, holding hwdom->pci_lock is sufficient (no need to hold
dom_xen->pci_lock)
- reintroduce ASSERT in vmx_pi_update_irte()
- reintroduce ASSERT in __pci_enable_msi{x}()
- delete note 6. in commit message about removing ASSERTs since we have
reintroduced them
Changes in v11:
- Fixed commit message regarding possible spinlocks
- Removed parameter from allocate_and_map_msi_pirq(), which was added
in the prev version. Now we are taking pcidevs_lock in
physdev_map_pirq()
- Returned ASSERT to pci_enable_msi
- Fixed case when we took read lock instead of write one
- Fixed label indentation
Changes in v10:
- Moved printk pas locked area
- Returned back ASSERTs
- Added new parameter to allocate_and_map_msi_pirq() so it knows if
it should take the global pci lock
- Added comment about possible improvement in vpci_write
- Changed ASSERT(rw_is_locked()) to rw_is_write_locked() in
appropriate places
- Renamed release_domain_locks() to release_domain_write_locks()
- moved domain_done label in vpci_dump_msi() to correct place
Changes in v9:
- extended locked region to protect vpci_remove_device and
vpci_add_handlers() calls
- vpci_write() takes lock in the write mode to protect
potential call to modify_bars()
- renamed lock releasing function
- removed ASSERT()s from msi code
- added trylock in vpci_dump_msi
Changes in v8:
- changed d->vpci_lock to d->pci_lock
- introducing d->pci_lock in a separate patch
- extended locked region in vpci_process_pending
- removed pcidevs_lockis vpci_dump_msi()
- removed some changes as they are not needed with
the new locking scheme
- added handling for hwdom && dom_xen case
---
xen/arch/x86/hvm/vmsi.c | 22 +++++++--------
xen/arch/x86/hvm/vmx/vmx.c | 2 +-
xen/arch/x86/irq.c | 8 +++---
xen/arch/x86/msi.c | 14 +++++-----
xen/arch/x86/physdev.c | 2 ++
xen/drivers/passthrough/pci.c | 9 +++---
xen/drivers/vpci/header.c | 18 ++++++++++++
xen/drivers/vpci/msi.c | 28 +++++++++++++++++--
xen/drivers/vpci/msix.c | 52 ++++++++++++++++++++++++++++++-----
xen/drivers/vpci/vpci.c | 28 ++++++++++++++++---
xen/include/xen/sched.h | 3 +-
11 files changed, 144 insertions(+), 42 deletions(-)
diff --git a/xen/arch/x86/hvm/vmsi.c b/xen/arch/x86/hvm/vmsi.c
index 128f23636279..03caf91beefc 100644
--- a/xen/arch/x86/hvm/vmsi.c
+++ b/xen/arch/x86/hvm/vmsi.c
@@ -468,7 +468,7 @@ int msixtbl_pt_register(struct domain *d, struct pirq *pirq, uint64_t gtable)
struct msixtbl_entry *entry, *new_entry;
int r = -EINVAL;
- ASSERT(pcidevs_locked());
+ ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock));
ASSERT(rw_is_write_locked(&d->event_lock));
if ( !msixtbl_initialised(d) )
@@ -538,7 +538,7 @@ void msixtbl_pt_unregister(struct domain *d, struct pirq *pirq)
struct pci_dev *pdev;
struct msixtbl_entry *entry;
- ASSERT(pcidevs_locked());
+ ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock));
ASSERT(rw_is_write_locked(&d->event_lock));
if ( !msixtbl_initialised(d) )
@@ -684,7 +684,7 @@ static int vpci_msi_update(const struct pci_dev *pdev, uint32_t data,
{
unsigned int i;
- ASSERT(pcidevs_locked());
+ ASSERT(rw_is_locked(&pdev->domain->pci_lock));
if ( (address & MSI_ADDR_BASE_MASK) != MSI_ADDR_HEADER )
{
@@ -725,8 +725,8 @@ void vpci_msi_arch_update(struct vpci_msi *msi, const struct pci_dev *pdev)
int rc;
ASSERT(msi->arch.pirq != INVALID_PIRQ);
+ ASSERT(rw_is_locked(&pdev->domain->pci_lock));
- pcidevs_lock();
for ( i = 0; i < msi->vectors && msi->arch.bound; i++ )
{
struct xen_domctl_bind_pt_irq unbind = {
@@ -745,7 +745,6 @@ void vpci_msi_arch_update(struct vpci_msi *msi, const struct pci_dev *pdev)
msi->arch.bound = !vpci_msi_update(pdev, msi->data, msi->address,
msi->vectors, msi->arch.pirq, msi->mask);
- pcidevs_unlock();
}
static int vpci_msi_enable(const struct pci_dev *pdev, unsigned int nr,
@@ -778,15 +777,14 @@ int vpci_msi_arch_enable(struct vpci_msi *msi, const struct pci_dev *pdev,
int rc;
ASSERT(msi->arch.pirq == INVALID_PIRQ);
+ ASSERT(rw_is_locked(&pdev->domain->pci_lock));
rc = vpci_msi_enable(pdev, vectors, 0);
if ( rc < 0 )
return rc;
msi->arch.pirq = rc;
- pcidevs_lock();
msi->arch.bound = !vpci_msi_update(pdev, msi->data, msi->address, vectors,
msi->arch.pirq, msi->mask);
- pcidevs_unlock();
return 0;
}
@@ -797,8 +795,8 @@ static void vpci_msi_disable(const struct pci_dev *pdev, int pirq,
unsigned int i;
ASSERT(pirq != INVALID_PIRQ);
+ ASSERT(rw_is_locked(&pdev->domain->pci_lock));
- pcidevs_lock();
for ( i = 0; i < nr && bound; i++ )
{
struct xen_domctl_bind_pt_irq bind = {
@@ -814,7 +812,6 @@ static void vpci_msi_disable(const struct pci_dev *pdev, int pirq,
write_lock(&pdev->domain->event_lock);
unmap_domain_pirq(pdev->domain, pirq);
write_unlock(&pdev->domain->event_lock);
- pcidevs_unlock();
}
void vpci_msi_arch_disable(struct vpci_msi *msi, const struct pci_dev *pdev)
@@ -854,6 +851,7 @@ int vpci_msix_arch_enable_entry(struct vpci_msix_entry *entry,
int rc;
ASSERT(entry->arch.pirq == INVALID_PIRQ);
+ ASSERT(rw_is_locked(&pdev->domain->pci_lock));
rc = vpci_msi_enable(pdev, vmsix_entry_nr(pdev->vpci->msix, entry),
table_base);
if ( rc < 0 )
@@ -861,7 +859,6 @@ int vpci_msix_arch_enable_entry(struct vpci_msix_entry *entry,
entry->arch.pirq = rc;
- pcidevs_lock();
rc = vpci_msi_update(pdev, entry->data, entry->addr, 1, entry->arch.pirq,
entry->masked);
if ( rc )
@@ -869,7 +866,6 @@ int vpci_msix_arch_enable_entry(struct vpci_msix_entry *entry,
vpci_msi_disable(pdev, entry->arch.pirq, 1, false);
entry->arch.pirq = INVALID_PIRQ;
}
- pcidevs_unlock();
return rc;
}
@@ -895,6 +891,8 @@ int vpci_msix_arch_print(const struct vpci_msix *msix)
{
unsigned int i;
+ ASSERT(rw_is_locked(&msix->pdev->domain->pci_lock));
+
for ( i = 0; i < msix->max_entries; i++ )
{
const struct vpci_msix_entry *entry = &msix->entries[i];
@@ -913,7 +911,9 @@ int vpci_msix_arch_print(const struct vpci_msix *msix)
struct pci_dev *pdev = msix->pdev;
spin_unlock(&msix->pdev->vpci->lock);
+ read_unlock(&pdev->domain->pci_lock);
process_pending_softirqs();
+ read_lock(&pdev->domain->pci_lock);
/* NB: we assume that pdev cannot go away for an alive domain. */
if ( !pdev->vpci || !spin_trylock(&pdev->vpci->lock) )
return -EBUSY;
diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
index 8ff675883c2b..890faef48b82 100644
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -413,7 +413,7 @@ static int cf_check vmx_pi_update_irte(const struct vcpu *v,
spin_unlock_irq(&desc->lock);
- ASSERT(pcidevs_locked());
+ ASSERT(pcidevs_locked() || rw_is_locked(&msi_desc->dev->domain->pci_lock));
return iommu_update_ire_from_msi(msi_desc, &msi_desc->msg);
diff --git a/xen/arch/x86/irq.c b/xen/arch/x86/irq.c
index 50e49e1a4b6f..4d89d9af99fe 100644
--- a/xen/arch/x86/irq.c
+++ b/xen/arch/x86/irq.c
@@ -2166,7 +2166,7 @@ int map_domain_pirq(
struct pci_dev *pdev;
unsigned int nr = 0;
- ASSERT(pcidevs_locked());
+ ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock));
ret = -ENODEV;
if ( !cpu_has_apic )
@@ -2323,7 +2323,7 @@ int unmap_domain_pirq(struct domain *d, int pirq)
if ( (pirq < 0) || (pirq >= d->nr_pirqs) )
return -EINVAL;
- ASSERT(pcidevs_locked());
+ ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock));
ASSERT(rw_is_write_locked(&d->event_lock));
info = pirq_info(d, pirq);
@@ -2888,6 +2888,8 @@ int allocate_and_map_msi_pirq(struct domain *d, int index, int *pirq_p,
{
int irq, pirq, ret;
+ ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock));
+
switch ( type )
{
case MAP_PIRQ_TYPE_MSI:
@@ -2917,7 +2919,6 @@ int allocate_and_map_msi_pirq(struct domain *d, int index, int *pirq_p,
msi->irq = irq;
- pcidevs_lock();
/* Verify or get pirq. */
write_lock(&d->event_lock);
pirq = allocate_pirq(d, index, *pirq_p, irq, type, &msi->entry_nr);
@@ -2933,7 +2934,6 @@ int allocate_and_map_msi_pirq(struct domain *d, int index, int *pirq_p,
done:
write_unlock(&d->event_lock);
- pcidevs_unlock();
if ( ret )
{
switch ( type )
diff --git a/xen/arch/x86/msi.c b/xen/arch/x86/msi.c
index 335c0868a225..7da2affa2e82 100644
--- a/xen/arch/x86/msi.c
+++ b/xen/arch/x86/msi.c
@@ -602,7 +602,7 @@ static int msi_capability_init(struct pci_dev *dev,
unsigned int i, mpos;
uint16_t control;
- ASSERT(pcidevs_locked());
+ ASSERT(pcidevs_locked() || rw_is_locked(&dev->domain->pci_lock));
pos = pci_find_cap_offset(dev->sbdf, PCI_CAP_ID_MSI);
if ( !pos )
return -ENODEV;
@@ -771,7 +771,7 @@ static int msix_capability_init(struct pci_dev *dev,
if ( !pos )
return -ENODEV;
- ASSERT(pcidevs_locked());
+ ASSERT(pcidevs_locked() || rw_is_locked(&dev->domain->pci_lock));
control = pci_conf_read16(dev->sbdf, msix_control_reg(pos));
/*
@@ -988,11 +988,11 @@ static int __pci_enable_msi(struct pci_dev *pdev, struct msi_info *msi,
{
struct msi_desc *old_desc;
- ASSERT(pcidevs_locked());
-
if ( !pdev )
return -ENODEV;
+ ASSERT(pcidevs_locked() || rw_is_locked(&pdev->domain->pci_lock));
+
old_desc = find_msi_entry(pdev, msi->irq, PCI_CAP_ID_MSI);
if ( old_desc )
{
@@ -1043,11 +1043,11 @@ static int __pci_enable_msix(struct pci_dev *pdev, struct msi_info *msi,
{
struct msi_desc *old_desc;
- ASSERT(pcidevs_locked());
-
if ( !pdev || !pdev->msix )
return -ENODEV;
+ ASSERT(pcidevs_locked() || rw_is_locked(&pdev->domain->pci_lock));
+
if ( msi->entry_nr >= pdev->msix->nr_entries )
return -EINVAL;
@@ -1154,7 +1154,7 @@ int pci_prepare_msix(u16 seg, u8 bus, u8 devfn, bool off)
int pci_enable_msi(struct pci_dev *pdev, struct msi_info *msi,
struct msi_desc **desc)
{
- ASSERT(pcidevs_locked());
+ ASSERT(pcidevs_locked() || rw_is_locked(&pdev->domain->pci_lock));
if ( !use_msi )
return -EPERM;
diff --git a/xen/arch/x86/physdev.c b/xen/arch/x86/physdev.c
index 47c4da0af7e1..369c9e788c1c 100644
--- a/xen/arch/x86/physdev.c
+++ b/xen/arch/x86/physdev.c
@@ -123,7 +123,9 @@ int physdev_map_pirq(domid_t domid, int type, int *index, int *pirq_p,
case MAP_PIRQ_TYPE_MSI:
case MAP_PIRQ_TYPE_MULTI_MSI:
+ pcidevs_lock();
ret = allocate_and_map_msi_pirq(d, *index, pirq_p, type, msi);
+ pcidevs_unlock();
break;
default:
diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
index 1439d1ef2b26..3a973324bca1 100644
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -750,7 +750,6 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
pdev->domain = hardware_domain;
write_lock(&hardware_domain->pci_lock);
list_add(&pdev->domain_list, &hardware_domain->pdev_list);
- write_unlock(&hardware_domain->pci_lock);
/*
* For devices not discovered by Xen during boot, add vPCI handlers
@@ -759,18 +758,18 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
ret = vpci_add_handlers(pdev);
if ( ret )
{
- printk(XENLOG_ERR "Setup of vPCI failed: %d\n", ret);
- write_lock(&hardware_domain->pci_lock);
list_del(&pdev->domain_list);
write_unlock(&hardware_domain->pci_lock);
pdev->domain = NULL;
+ printk(XENLOG_ERR "Setup of vPCI failed: %d\n", ret);
goto out;
}
+ write_unlock(&hardware_domain->pci_lock);
ret = iommu_add_device(pdev);
if ( ret )
{
- vpci_remove_device(pdev);
write_lock(&hardware_domain->pci_lock);
+ vpci_remove_device(pdev);
list_del(&pdev->domain_list);
write_unlock(&hardware_domain->pci_lock);
pdev->domain = NULL;
@@ -1146,7 +1145,9 @@ static void __hwdom_init setup_one_hwdom_device(const struct setup_hwdom *ctxt,
} while ( devfn != pdev->devfn &&
PCI_SLOT(devfn) == PCI_SLOT(pdev->devfn) );
+ write_lock(&ctxt->d->pci_lock);
err = vpci_add_handlers(pdev);
+ write_unlock(&ctxt->d->pci_lock);
if ( err )
printk(XENLOG_ERR "setup of vPCI for d%d failed: %d\n",
ctxt->d->domain_id, err);
diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
index 58195549d50a..8f5850b8cf6d 100644
--- a/xen/drivers/vpci/header.c
+++ b/xen/drivers/vpci/header.c
@@ -173,6 +173,7 @@ bool vpci_process_pending(struct vcpu *v)
if ( rc == -ERESTART )
return true;
+ write_lock(&v->domain->pci_lock);
spin_lock(&v->vpci.pdev->vpci->lock);
/* Disable memory decoding unconditionally on failure. */
modify_decoding(v->vpci.pdev,
@@ -191,6 +192,7 @@ bool vpci_process_pending(struct vcpu *v)
* failure.
*/
vpci_remove_device(v->vpci.pdev);
+ write_unlock(&v->domain->pci_lock);
}
return false;
@@ -202,8 +204,20 @@ static int __init apply_map(struct domain *d, const struct pci_dev *pdev,
struct map_data data = { .d = d, .map = true };
int rc;
+ ASSERT(rw_is_write_locked(&d->pci_lock));
+
while ( (rc = rangeset_consume_ranges(mem, map_range, &data)) == -ERESTART )
+ {
+ /*
+ * It's safe to drop and reacquire the lock in this context
+ * without risking pdev disappearing because devices cannot be
+ * removed until the initial domain has been started.
+ */
+ write_unlock(&d->pci_lock);
process_pending_softirqs();
+ write_lock(&d->pci_lock);
+ }
+
rangeset_destroy(mem);
if ( !rc )
modify_decoding(pdev, cmd, false);
@@ -244,6 +258,8 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
unsigned int i;
int rc;
+ ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));
+
if ( !mem )
return -ENOMEM;
@@ -524,6 +540,8 @@ static int cf_check init_header(struct pci_dev *pdev)
int rc;
bool mask_cap_list = false;
+ ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));
+
switch ( pci_conf_read8(pdev->sbdf, PCI_HEADER_TYPE) & 0x7f )
{
case PCI_HEADER_TYPE_NORMAL:
diff --git a/xen/drivers/vpci/msi.c b/xen/drivers/vpci/msi.c
index a253ccbd7db7..6ff71e5f9ab7 100644
--- a/xen/drivers/vpci/msi.c
+++ b/xen/drivers/vpci/msi.c
@@ -263,7 +263,7 @@ REGISTER_VPCI_INIT(init_msi, VPCI_PRIORITY_LOW);
void vpci_dump_msi(void)
{
- const struct domain *d;
+ struct domain *d;
rcu_read_lock(&domlist_read_lock);
for_each_domain ( d )
@@ -275,6 +275,9 @@ void vpci_dump_msi(void)
printk("vPCI MSI/MSI-X d%d\n", d->domain_id);
+ if ( !read_trylock(&d->pci_lock) )
+ continue;
+
for_each_pdev ( d, pdev )
{
const struct vpci_msi *msi;
@@ -316,14 +319,33 @@ void vpci_dump_msi(void)
* holding the lock.
*/
printk("unable to print all MSI-X entries: %d\n", rc);
- process_pending_softirqs();
- continue;
+ goto pdev_done;
}
}
spin_unlock(&pdev->vpci->lock);
+ pdev_done:
+ /*
+ * Unlock lock to process pending softirqs. This is
+ * potentially unsafe, as d->pdev_list can be changed in
+ * meantime.
+ */
+ read_unlock(&d->pci_lock);
process_pending_softirqs();
+ if ( !read_trylock(&d->pci_lock) )
+ {
+ printk("unable to access other devices for the domain\n");
+ goto domain_done;
+ }
}
+ read_unlock(&d->pci_lock);
+ domain_done:
+ /*
+ * We need this label at the end of the loop, but some
+ * compilers might not be happy about label at the end of the
+ * compound statement so we adding an empty statement here.
+ */
+ ;
}
rcu_read_unlock(&domlist_read_lock);
}
diff --git a/xen/drivers/vpci/msix.c b/xen/drivers/vpci/msix.c
index d1126a417da9..b6abab47efdd 100644
--- a/xen/drivers/vpci/msix.c
+++ b/xen/drivers/vpci/msix.c
@@ -147,6 +147,8 @@ static struct vpci_msix *msix_find(const struct domain *d, unsigned long addr)
{
struct vpci_msix *msix;
+ ASSERT(rw_is_locked(&d->pci_lock));
+
list_for_each_entry ( msix, &d->arch.hvm.msix_tables, next )
{
const struct vpci_bar *bars = msix->pdev->vpci->header.bars;
@@ -163,7 +165,13 @@ static struct vpci_msix *msix_find(const struct domain *d, unsigned long addr)
static int cf_check msix_accept(struct vcpu *v, unsigned long addr)
{
- return !!msix_find(v->domain, addr);
+ int rc;
+
+ read_lock(&v->domain->pci_lock);
+ rc = !!msix_find(v->domain, addr);
+ read_unlock(&v->domain->pci_lock);
+
+ return rc;
}
static bool access_allowed(const struct pci_dev *pdev, unsigned long addr,
@@ -358,21 +366,35 @@ static int adjacent_read(const struct domain *d, const struct vpci_msix *msix,
static int cf_check msix_read(
struct vcpu *v, unsigned long addr, unsigned int len, unsigned long *data)
{
- const struct domain *d = v->domain;
- struct vpci_msix *msix = msix_find(d, addr);
+ struct domain *d = v->domain;
+ struct vpci_msix *msix;
const struct vpci_msix_entry *entry;
unsigned int offset;
*data = ~0UL;
+ read_lock(&d->pci_lock);
+
+ msix = msix_find(d, addr);
if ( !msix )
+ {
+ read_unlock(&d->pci_lock);
return X86EMUL_RETRY;
+ }
if ( adjacent_handle(msix, addr) )
- return adjacent_read(d, msix, addr, len, data);
+ {
+ int rc = adjacent_read(d, msix, addr, len, data);
+
+ read_unlock(&d->pci_lock);
+ return rc;
+ }
if ( !access_allowed(msix->pdev, addr, len) )
+ {
+ read_unlock(&d->pci_lock);
return X86EMUL_OKAY;
+ }
spin_lock(&msix->pdev->vpci->lock);
entry = get_entry(msix, addr);
@@ -404,6 +426,7 @@ static int cf_check msix_read(
break;
}
spin_unlock(&msix->pdev->vpci->lock);
+ read_unlock(&d->pci_lock);
return X86EMUL_OKAY;
}
@@ -491,19 +514,33 @@ static int adjacent_write(const struct domain *d, const struct vpci_msix *msix,
static int cf_check msix_write(
struct vcpu *v, unsigned long addr, unsigned int len, unsigned long data)
{
- const struct domain *d = v->domain;
- struct vpci_msix *msix = msix_find(d, addr);
+ struct domain *d = v->domain;
+ struct vpci_msix *msix;
struct vpci_msix_entry *entry;
unsigned int offset;
+ read_lock(&d->pci_lock);
+
+ msix = msix_find(d, addr);
if ( !msix )
+ {
+ read_unlock(&d->pci_lock);
return X86EMUL_RETRY;
+ }
if ( adjacent_handle(msix, addr) )
- return adjacent_write(d, msix, addr, len, data);
+ {
+ int rc = adjacent_write(d, msix, addr, len, data);
+
+ read_unlock(&d->pci_lock);
+ return rc;
+ }
if ( !access_allowed(msix->pdev, addr, len) )
+ {
+ read_unlock(&d->pci_lock);
return X86EMUL_OKAY;
+ }
spin_lock(&msix->pdev->vpci->lock);
entry = get_entry(msix, addr);
@@ -579,6 +616,7 @@ static int cf_check msix_write(
break;
}
spin_unlock(&msix->pdev->vpci->lock);
+ read_unlock(&d->pci_lock);
return X86EMUL_OKAY;
}
diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
index 72ef277c4f8e..a1a004460491 100644
--- a/xen/drivers/vpci/vpci.c
+++ b/xen/drivers/vpci/vpci.c
@@ -42,6 +42,8 @@ extern vpci_register_init_t *const __end_vpci_array[];
void vpci_remove_device(struct pci_dev *pdev)
{
+ ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));
+
if ( !has_vpci(pdev->domain) || !pdev->vpci )
return;
@@ -77,6 +79,8 @@ int vpci_add_handlers(struct pci_dev *pdev)
const unsigned long *ro_map;
int rc = 0;
+ ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));
+
if ( !has_vpci(pdev->domain) )
return 0;
@@ -361,7 +365,7 @@ static uint32_t merge_result(uint32_t data, uint32_t new, unsigned int size,
uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size)
{
- const struct domain *d = current->domain;
+ struct domain *d = current->domain;
const struct pci_dev *pdev;
const struct vpci_register *r;
unsigned int data_offset = 0;
@@ -375,13 +379,19 @@ uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size)
/*
* Find the PCI dev matching the address, which for hwdom also requires
- * consulting DomXEN. Passthrough everything that's not trapped.
+ * consulting DomXEN. Passthrough everything that's not trapped.
+ * If this is hwdom and the device is assigned to dom_xen, acquiring hwdom's
+ * pci_lock is sufficient.
*/
+ read_lock(&d->pci_lock);
pdev = pci_get_pdev(d, sbdf);
if ( !pdev && is_hardware_domain(d) )
pdev = pci_get_pdev(dom_xen, sbdf);
if ( !pdev || !pdev->vpci )
+ {
+ read_unlock(&d->pci_lock);
return vpci_read_hw(sbdf, reg, size);
+ }
spin_lock(&pdev->vpci->lock);
@@ -428,6 +438,7 @@ uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size)
ASSERT(data_offset < size);
}
spin_unlock(&pdev->vpci->lock);
+ read_unlock(&d->pci_lock);
if ( data_offset < size )
{
@@ -470,7 +481,7 @@ static void vpci_write_helper(const struct pci_dev *pdev,
void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
uint32_t data)
{
- const struct domain *d = current->domain;
+ struct domain *d = current->domain;
const struct pci_dev *pdev;
const struct vpci_register *r;
unsigned int data_offset = 0;
@@ -483,8 +494,14 @@ void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
/*
* Find the PCI dev matching the address, which for hwdom also requires
- * consulting DomXEN. Passthrough everything that's not trapped.
+ * consulting DomXEN. Passthrough everything that's not trapped.
+ * If this is hwdom and the device is assigned to dom_xen, acquiring hwdom's
+ * pci_lock is sufficient.
+ *
+ * TODO: We need to take pci_locks in exclusive mode only if we
+ * are modifying BARs, so there is a room for improvement.
*/
+ write_lock(&d->pci_lock);
pdev = pci_get_pdev(d, sbdf);
if ( !pdev && is_hardware_domain(d) )
pdev = pci_get_pdev(dom_xen, sbdf);
@@ -493,6 +510,8 @@ void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
/* Ignore writes to read-only devices, which have no ->vpci. */
const unsigned long *ro_map = pci_get_ro_map(sbdf.seg);
+ write_unlock(&d->pci_lock);
+
if ( !ro_map || !test_bit(sbdf.bdf, ro_map) )
vpci_write_hw(sbdf, reg, size, data);
return;
@@ -534,6 +553,7 @@ void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
ASSERT(data_offset < size);
}
spin_unlock(&pdev->vpci->lock);
+ write_unlock(&d->pci_lock);
if ( data_offset < size )
/* Tailing gap, write the remaining. */
diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
index 9da91e0e6244..37f5922f3206 100644
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
@@ -462,7 +462,8 @@ struct domain
#ifdef CONFIG_HAS_PCI
struct list_head pdev_list;
/*
- * pci_lock protects access to pdev_list.
+ * pci_lock protects access to pdev_list. pci_lock also protects pdev->vpci
+ * structure from being removed.
*
* Any user *reading* from pdev_list, or from devices stored in pdev_list,
* should hold either pcidevs_lock() or pci_lock in read mode. Optionally,
--
2.43.0
On Tue, Jan 09, 2024 at 04:51:16PM -0500, Stewart Hildebrand wrote:
> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>
> Use a previously introduced per-domain read/write lock to check
> whether vpci is present, so we are sure there are no accesses to the
> contents of the vpci struct if not.
Nit: isn't this sentence kind of awkward? I would word it as:
"Use the per-domain PCI read/write lock to protect the presence of the
pci device vpci field."
> This lock can be used (and in a
> few cases is used right away) so that vpci removal can be performed
> while holding the lock in write mode. Previously such removal could
> race with vpci_read for example.
>
> When taking both d->pci_lock and pdev->vpci->lock, they should be
> taken in this exact order: d->pci_lock then pdev->vpci->lock to avoid
> possible deadlock situations.
>
> 1. Per-domain's pci_lock is used to protect pdev->vpci structure
> from being removed.
>
> 2. Writing the command register and ROM BAR register may trigger
> modify_bars to run, which in turn may access multiple pdevs while
> checking for the existing BAR's overlap. The overlapping check, if
> done under the read lock, requires vpci->lock to be acquired on both
> devices being compared, which may produce a deadlock. It is not
> possible to upgrade read lock to write lock in such a case. So, in
> order to prevent the deadlock, use d->pci_lock instead.
... use d->pci_lock in write mode instead.
>
> All other code, which doesn't lead to pdev->vpci destruction and does
> not access multiple pdevs at the same time, can still use a
> combination of the read lock and pdev->vpci->lock.
>
> 3. Drop const qualifier where the new rwlock is used and this is
> appropriate.
>
> 4. Do not call process_pending_softirqs with any locks held. For that
> unlock prior the call and re-acquire the locks after. After
> re-acquiring the lock there is no need to check if pdev->vpci exists:
> - in apply_map because of the context it is called (no race condition
> possible)
> - for MSI/MSI-X debug code because it is called at the end of
> pdev->vpci access and no further access to pdev->vpci is made
>
> 5. Use d->pci_lock around for_each_pdev and pci_get_pdev_by_domain
> while accessing pdevs in vpci code.
>
> Suggested-by: Roger Pau Monné <roger.pau@citrix.com>
> Suggested-by: Jan Beulich <jbeulich@suse.com>
> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com>
> Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com>
Just one code fix regarding the usage of read_lock (instead of the
trylock version) in vpci_msix_arch_print().. With that and the
comments adjusted:
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
> ---
> Changes in v12:
> - s/pci_rwlock/pci_lock/ in commit message
> - expand comment about scope of pci_lock in sched.h
> - in vpci_{read,write}, if hwdom is trying to access a device assigned
> to dom_xen, holding hwdom->pci_lock is sufficient (no need to hold
> dom_xen->pci_lock)
> - reintroduce ASSERT in vmx_pi_update_irte()
> - reintroduce ASSERT in __pci_enable_msi{x}()
> - delete note 6. in commit message about removing ASSERTs since we have
> reintroduced them
>
> Changes in v11:
> - Fixed commit message regarding possible spinlocks
> - Removed parameter from allocate_and_map_msi_pirq(), which was added
> in the prev version. Now we are taking pcidevs_lock in
> physdev_map_pirq()
> - Returned ASSERT to pci_enable_msi
> - Fixed case when we took read lock instead of write one
> - Fixed label indentation
>
> Changes in v10:
> - Moved printk pas locked area
> - Returned back ASSERTs
> - Added new parameter to allocate_and_map_msi_pirq() so it knows if
> it should take the global pci lock
> - Added comment about possible improvement in vpci_write
> - Changed ASSERT(rw_is_locked()) to rw_is_write_locked() in
> appropriate places
> - Renamed release_domain_locks() to release_domain_write_locks()
> - moved domain_done label in vpci_dump_msi() to correct place
> Changes in v9:
> - extended locked region to protect vpci_remove_device and
> vpci_add_handlers() calls
> - vpci_write() takes lock in the write mode to protect
> potential call to modify_bars()
> - renamed lock releasing function
> - removed ASSERT()s from msi code
> - added trylock in vpci_dump_msi
>
> Changes in v8:
> - changed d->vpci_lock to d->pci_lock
> - introducing d->pci_lock in a separate patch
> - extended locked region in vpci_process_pending
> - removed pcidevs_lockis vpci_dump_msi()
> - removed some changes as they are not needed with
> the new locking scheme
> - added handling for hwdom && dom_xen case
> ---
> xen/arch/x86/hvm/vmsi.c | 22 +++++++--------
> xen/arch/x86/hvm/vmx/vmx.c | 2 +-
> xen/arch/x86/irq.c | 8 +++---
> xen/arch/x86/msi.c | 14 +++++-----
> xen/arch/x86/physdev.c | 2 ++
> xen/drivers/passthrough/pci.c | 9 +++---
> xen/drivers/vpci/header.c | 18 ++++++++++++
> xen/drivers/vpci/msi.c | 28 +++++++++++++++++--
> xen/drivers/vpci/msix.c | 52 ++++++++++++++++++++++++++++++-----
> xen/drivers/vpci/vpci.c | 28 ++++++++++++++++---
> xen/include/xen/sched.h | 3 +-
> 11 files changed, 144 insertions(+), 42 deletions(-)
>
> diff --git a/xen/arch/x86/hvm/vmsi.c b/xen/arch/x86/hvm/vmsi.c
> index 128f23636279..03caf91beefc 100644
> --- a/xen/arch/x86/hvm/vmsi.c
> +++ b/xen/arch/x86/hvm/vmsi.c
> @@ -468,7 +468,7 @@ int msixtbl_pt_register(struct domain *d, struct pirq *pirq, uint64_t gtable)
> struct msixtbl_entry *entry, *new_entry;
> int r = -EINVAL;
>
> - ASSERT(pcidevs_locked());
> + ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock));
> ASSERT(rw_is_write_locked(&d->event_lock));
>
> if ( !msixtbl_initialised(d) )
> @@ -538,7 +538,7 @@ void msixtbl_pt_unregister(struct domain *d, struct pirq *pirq)
> struct pci_dev *pdev;
> struct msixtbl_entry *entry;
>
> - ASSERT(pcidevs_locked());
> + ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock));
> ASSERT(rw_is_write_locked(&d->event_lock));
>
> if ( !msixtbl_initialised(d) )
> @@ -684,7 +684,7 @@ static int vpci_msi_update(const struct pci_dev *pdev, uint32_t data,
> {
> unsigned int i;
>
> - ASSERT(pcidevs_locked());
> + ASSERT(rw_is_locked(&pdev->domain->pci_lock));
>
> if ( (address & MSI_ADDR_BASE_MASK) != MSI_ADDR_HEADER )
> {
> @@ -725,8 +725,8 @@ void vpci_msi_arch_update(struct vpci_msi *msi, const struct pci_dev *pdev)
> int rc;
>
> ASSERT(msi->arch.pirq != INVALID_PIRQ);
> + ASSERT(rw_is_locked(&pdev->domain->pci_lock));
>
> - pcidevs_lock();
> for ( i = 0; i < msi->vectors && msi->arch.bound; i++ )
> {
> struct xen_domctl_bind_pt_irq unbind = {
> @@ -745,7 +745,6 @@ void vpci_msi_arch_update(struct vpci_msi *msi, const struct pci_dev *pdev)
>
> msi->arch.bound = !vpci_msi_update(pdev, msi->data, msi->address,
> msi->vectors, msi->arch.pirq, msi->mask);
> - pcidevs_unlock();
> }
>
> static int vpci_msi_enable(const struct pci_dev *pdev, unsigned int nr,
> @@ -778,15 +777,14 @@ int vpci_msi_arch_enable(struct vpci_msi *msi, const struct pci_dev *pdev,
> int rc;
>
> ASSERT(msi->arch.pirq == INVALID_PIRQ);
> + ASSERT(rw_is_locked(&pdev->domain->pci_lock));
> rc = vpci_msi_enable(pdev, vectors, 0);
> if ( rc < 0 )
> return rc;
> msi->arch.pirq = rc;
>
> - pcidevs_lock();
> msi->arch.bound = !vpci_msi_update(pdev, msi->data, msi->address, vectors,
> msi->arch.pirq, msi->mask);
> - pcidevs_unlock();
>
> return 0;
> }
> @@ -797,8 +795,8 @@ static void vpci_msi_disable(const struct pci_dev *pdev, int pirq,
> unsigned int i;
>
> ASSERT(pirq != INVALID_PIRQ);
> + ASSERT(rw_is_locked(&pdev->domain->pci_lock));
>
> - pcidevs_lock();
> for ( i = 0; i < nr && bound; i++ )
> {
> struct xen_domctl_bind_pt_irq bind = {
> @@ -814,7 +812,6 @@ static void vpci_msi_disable(const struct pci_dev *pdev, int pirq,
> write_lock(&pdev->domain->event_lock);
> unmap_domain_pirq(pdev->domain, pirq);
> write_unlock(&pdev->domain->event_lock);
> - pcidevs_unlock();
> }
>
> void vpci_msi_arch_disable(struct vpci_msi *msi, const struct pci_dev *pdev)
> @@ -854,6 +851,7 @@ int vpci_msix_arch_enable_entry(struct vpci_msix_entry *entry,
> int rc;
>
> ASSERT(entry->arch.pirq == INVALID_PIRQ);
> + ASSERT(rw_is_locked(&pdev->domain->pci_lock));
> rc = vpci_msi_enable(pdev, vmsix_entry_nr(pdev->vpci->msix, entry),
> table_base);
> if ( rc < 0 )
> @@ -861,7 +859,6 @@ int vpci_msix_arch_enable_entry(struct vpci_msix_entry *entry,
>
> entry->arch.pirq = rc;
>
> - pcidevs_lock();
> rc = vpci_msi_update(pdev, entry->data, entry->addr, 1, entry->arch.pirq,
> entry->masked);
> if ( rc )
> @@ -869,7 +866,6 @@ int vpci_msix_arch_enable_entry(struct vpci_msix_entry *entry,
> vpci_msi_disable(pdev, entry->arch.pirq, 1, false);
> entry->arch.pirq = INVALID_PIRQ;
> }
> - pcidevs_unlock();
>
> return rc;
> }
> @@ -895,6 +891,8 @@ int vpci_msix_arch_print(const struct vpci_msix *msix)
> {
> unsigned int i;
>
> + ASSERT(rw_is_locked(&msix->pdev->domain->pci_lock));
> +
> for ( i = 0; i < msix->max_entries; i++ )
> {
> const struct vpci_msix_entry *entry = &msix->entries[i];
> @@ -913,7 +911,9 @@ int vpci_msix_arch_print(const struct vpci_msix *msix)
> struct pci_dev *pdev = msix->pdev;
>
> spin_unlock(&msix->pdev->vpci->lock);
> + read_unlock(&pdev->domain->pci_lock);
> process_pending_softirqs();
> + read_lock(&pdev->domain->pci_lock);
This should be a trylock, like it's on the caller (vpci_dump_msi()).
> /* NB: we assume that pdev cannot go away for an alive domain. */
> if ( !pdev->vpci || !spin_trylock(&pdev->vpci->lock) )
> return -EBUSY;
> diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
> index 8ff675883c2b..890faef48b82 100644
> --- a/xen/arch/x86/hvm/vmx/vmx.c
> +++ b/xen/arch/x86/hvm/vmx/vmx.c
> @@ -413,7 +413,7 @@ static int cf_check vmx_pi_update_irte(const struct vcpu *v,
>
> spin_unlock_irq(&desc->lock);
>
> - ASSERT(pcidevs_locked());
> + ASSERT(pcidevs_locked() || rw_is_locked(&msi_desc->dev->domain->pci_lock));
>
> return iommu_update_ire_from_msi(msi_desc, &msi_desc->msg);
>
> diff --git a/xen/arch/x86/irq.c b/xen/arch/x86/irq.c
> index 50e49e1a4b6f..4d89d9af99fe 100644
> --- a/xen/arch/x86/irq.c
> +++ b/xen/arch/x86/irq.c
> @@ -2166,7 +2166,7 @@ int map_domain_pirq(
> struct pci_dev *pdev;
> unsigned int nr = 0;
>
> - ASSERT(pcidevs_locked());
> + ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock));
>
> ret = -ENODEV;
> if ( !cpu_has_apic )
> @@ -2323,7 +2323,7 @@ int unmap_domain_pirq(struct domain *d, int pirq)
> if ( (pirq < 0) || (pirq >= d->nr_pirqs) )
> return -EINVAL;
>
> - ASSERT(pcidevs_locked());
> + ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock));
> ASSERT(rw_is_write_locked(&d->event_lock));
>
> info = pirq_info(d, pirq);
> @@ -2888,6 +2888,8 @@ int allocate_and_map_msi_pirq(struct domain *d, int index, int *pirq_p,
> {
> int irq, pirq, ret;
>
> + ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock));
> +
> switch ( type )
> {
> case MAP_PIRQ_TYPE_MSI:
> @@ -2917,7 +2919,6 @@ int allocate_and_map_msi_pirq(struct domain *d, int index, int *pirq_p,
>
> msi->irq = irq;
>
> - pcidevs_lock();
> /* Verify or get pirq. */
> write_lock(&d->event_lock);
> pirq = allocate_pirq(d, index, *pirq_p, irq, type, &msi->entry_nr);
> @@ -2933,7 +2934,6 @@ int allocate_and_map_msi_pirq(struct domain *d, int index, int *pirq_p,
>
> done:
> write_unlock(&d->event_lock);
> - pcidevs_unlock();
> if ( ret )
> {
> switch ( type )
> diff --git a/xen/arch/x86/msi.c b/xen/arch/x86/msi.c
> index 335c0868a225..7da2affa2e82 100644
> --- a/xen/arch/x86/msi.c
> +++ b/xen/arch/x86/msi.c
> @@ -602,7 +602,7 @@ static int msi_capability_init(struct pci_dev *dev,
> unsigned int i, mpos;
> uint16_t control;
>
> - ASSERT(pcidevs_locked());
> + ASSERT(pcidevs_locked() || rw_is_locked(&dev->domain->pci_lock));
> pos = pci_find_cap_offset(dev->sbdf, PCI_CAP_ID_MSI);
> if ( !pos )
> return -ENODEV;
> @@ -771,7 +771,7 @@ static int msix_capability_init(struct pci_dev *dev,
> if ( !pos )
> return -ENODEV;
>
> - ASSERT(pcidevs_locked());
> + ASSERT(pcidevs_locked() || rw_is_locked(&dev->domain->pci_lock));
>
> control = pci_conf_read16(dev->sbdf, msix_control_reg(pos));
> /*
> @@ -988,11 +988,11 @@ static int __pci_enable_msi(struct pci_dev *pdev, struct msi_info *msi,
> {
> struct msi_desc *old_desc;
>
> - ASSERT(pcidevs_locked());
> -
> if ( !pdev )
> return -ENODEV;
>
> + ASSERT(pcidevs_locked() || rw_is_locked(&pdev->domain->pci_lock));
> +
> old_desc = find_msi_entry(pdev, msi->irq, PCI_CAP_ID_MSI);
> if ( old_desc )
> {
> @@ -1043,11 +1043,11 @@ static int __pci_enable_msix(struct pci_dev *pdev, struct msi_info *msi,
> {
> struct msi_desc *old_desc;
>
> - ASSERT(pcidevs_locked());
> -
> if ( !pdev || !pdev->msix )
> return -ENODEV;
>
> + ASSERT(pcidevs_locked() || rw_is_locked(&pdev->domain->pci_lock));
> +
> if ( msi->entry_nr >= pdev->msix->nr_entries )
> return -EINVAL;
>
> @@ -1154,7 +1154,7 @@ int pci_prepare_msix(u16 seg, u8 bus, u8 devfn, bool off)
> int pci_enable_msi(struct pci_dev *pdev, struct msi_info *msi,
> struct msi_desc **desc)
> {
> - ASSERT(pcidevs_locked());
> + ASSERT(pcidevs_locked() || rw_is_locked(&pdev->domain->pci_lock));
>
> if ( !use_msi )
> return -EPERM;
> diff --git a/xen/arch/x86/physdev.c b/xen/arch/x86/physdev.c
> index 47c4da0af7e1..369c9e788c1c 100644
> --- a/xen/arch/x86/physdev.c
> +++ b/xen/arch/x86/physdev.c
> @@ -123,7 +123,9 @@ int physdev_map_pirq(domid_t domid, int type, int *index, int *pirq_p,
>
> case MAP_PIRQ_TYPE_MSI:
> case MAP_PIRQ_TYPE_MULTI_MSI:
> + pcidevs_lock();
> ret = allocate_and_map_msi_pirq(d, *index, pirq_p, type, msi);
> + pcidevs_unlock();
> break;
>
> default:
> diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
> index 1439d1ef2b26..3a973324bca1 100644
> --- a/xen/drivers/passthrough/pci.c
> +++ b/xen/drivers/passthrough/pci.c
> @@ -750,7 +750,6 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
> pdev->domain = hardware_domain;
> write_lock(&hardware_domain->pci_lock);
> list_add(&pdev->domain_list, &hardware_domain->pdev_list);
> - write_unlock(&hardware_domain->pci_lock);
>
> /*
> * For devices not discovered by Xen during boot, add vPCI handlers
> @@ -759,18 +758,18 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
> ret = vpci_add_handlers(pdev);
> if ( ret )
> {
> - printk(XENLOG_ERR "Setup of vPCI failed: %d\n", ret);
> - write_lock(&hardware_domain->pci_lock);
> list_del(&pdev->domain_list);
> write_unlock(&hardware_domain->pci_lock);
> pdev->domain = NULL;
> + printk(XENLOG_ERR "Setup of vPCI failed: %d\n", ret);
> goto out;
> }
> + write_unlock(&hardware_domain->pci_lock);
> ret = iommu_add_device(pdev);
> if ( ret )
> {
> - vpci_remove_device(pdev);
> write_lock(&hardware_domain->pci_lock);
> + vpci_remove_device(pdev);
> list_del(&pdev->domain_list);
> write_unlock(&hardware_domain->pci_lock);
> pdev->domain = NULL;
> @@ -1146,7 +1145,9 @@ static void __hwdom_init setup_one_hwdom_device(const struct setup_hwdom *ctxt,
> } while ( devfn != pdev->devfn &&
> PCI_SLOT(devfn) == PCI_SLOT(pdev->devfn) );
>
> + write_lock(&ctxt->d->pci_lock);
> err = vpci_add_handlers(pdev);
> + write_unlock(&ctxt->d->pci_lock);
> if ( err )
> printk(XENLOG_ERR "setup of vPCI for d%d failed: %d\n",
> ctxt->d->domain_id, err);
> diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
> index 58195549d50a..8f5850b8cf6d 100644
> --- a/xen/drivers/vpci/header.c
> +++ b/xen/drivers/vpci/header.c
> @@ -173,6 +173,7 @@ bool vpci_process_pending(struct vcpu *v)
> if ( rc == -ERESTART )
> return true;
>
> + write_lock(&v->domain->pci_lock);
> spin_lock(&v->vpci.pdev->vpci->lock);
> /* Disable memory decoding unconditionally on failure. */
> modify_decoding(v->vpci.pdev,
> @@ -191,6 +192,7 @@ bool vpci_process_pending(struct vcpu *v)
> * failure.
> */
> vpci_remove_device(v->vpci.pdev);
> + write_unlock(&v->domain->pci_lock);
> }
>
> return false;
> @@ -202,8 +204,20 @@ static int __init apply_map(struct domain *d, const struct pci_dev *pdev,
> struct map_data data = { .d = d, .map = true };
> int rc;
>
> + ASSERT(rw_is_write_locked(&d->pci_lock));
> +
> while ( (rc = rangeset_consume_ranges(mem, map_range, &data)) == -ERESTART )
> + {
> + /*
> + * It's safe to drop and reacquire the lock in this context
> + * without risking pdev disappearing because devices cannot be
> + * removed until the initial domain has been started.
> + */
> + write_unlock(&d->pci_lock);
> process_pending_softirqs();
> + write_lock(&d->pci_lock);
Hm, I should have noticed before, but we already call
process_pending_softirqs() with the pdev->vpci->lock held here, so it
would make sense to drop it also.
Fix should be in a separate patch.
> + }
> +
> rangeset_destroy(mem);
> if ( !rc )
> modify_decoding(pdev, cmd, false);
> @@ -244,6 +258,8 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
> unsigned int i;
> int rc;
>
> + ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));
> +
> if ( !mem )
> return -ENOMEM;
>
> @@ -524,6 +540,8 @@ static int cf_check init_header(struct pci_dev *pdev)
> int rc;
> bool mask_cap_list = false;
>
> + ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));
> +
> switch ( pci_conf_read8(pdev->sbdf, PCI_HEADER_TYPE) & 0x7f )
> {
> case PCI_HEADER_TYPE_NORMAL:
> diff --git a/xen/drivers/vpci/msi.c b/xen/drivers/vpci/msi.c
> index a253ccbd7db7..6ff71e5f9ab7 100644
> --- a/xen/drivers/vpci/msi.c
> +++ b/xen/drivers/vpci/msi.c
> @@ -263,7 +263,7 @@ REGISTER_VPCI_INIT(init_msi, VPCI_PRIORITY_LOW);
>
> void vpci_dump_msi(void)
> {
> - const struct domain *d;
> + struct domain *d;
>
> rcu_read_lock(&domlist_read_lock);
> for_each_domain ( d )
> @@ -275,6 +275,9 @@ void vpci_dump_msi(void)
>
> printk("vPCI MSI/MSI-X d%d\n", d->domain_id);
>
> + if ( !read_trylock(&d->pci_lock) )
> + continue;
> +
> for_each_pdev ( d, pdev )
> {
> const struct vpci_msi *msi;
> @@ -316,14 +319,33 @@ void vpci_dump_msi(void)
> * holding the lock.
> */
> printk("unable to print all MSI-X entries: %d\n", rc);
> - process_pending_softirqs();
> - continue;
> + goto pdev_done;
> }
> }
>
> spin_unlock(&pdev->vpci->lock);
> + pdev_done:
> + /*
> + * Unlock lock to process pending softirqs. This is
> + * potentially unsafe, as d->pdev_list can be changed in
> + * meantime.
> + */
> + read_unlock(&d->pci_lock);
> process_pending_softirqs();
> + if ( !read_trylock(&d->pci_lock) )
> + {
> + printk("unable to access other devices for the domain\n");
> + goto domain_done;
> + }
> }
> + read_unlock(&d->pci_lock);
> + domain_done:
> + /*
> + * We need this label at the end of the loop, but some
> + * compilers might not be happy about label at the end of the
> + * compound statement so we adding an empty statement here.
> + */
> + ;
> }
> rcu_read_unlock(&domlist_read_lock);
> }
> diff --git a/xen/drivers/vpci/msix.c b/xen/drivers/vpci/msix.c
> index d1126a417da9..b6abab47efdd 100644
> --- a/xen/drivers/vpci/msix.c
> +++ b/xen/drivers/vpci/msix.c
> @@ -147,6 +147,8 @@ static struct vpci_msix *msix_find(const struct domain *d, unsigned long addr)
> {
> struct vpci_msix *msix;
>
> + ASSERT(rw_is_locked(&d->pci_lock));
> +
> list_for_each_entry ( msix, &d->arch.hvm.msix_tables, next )
> {
> const struct vpci_bar *bars = msix->pdev->vpci->header.bars;
> @@ -163,7 +165,13 @@ static struct vpci_msix *msix_find(const struct domain *d, unsigned long addr)
>
> static int cf_check msix_accept(struct vcpu *v, unsigned long addr)
> {
> - return !!msix_find(v->domain, addr);
> + int rc;
> +
> + read_lock(&v->domain->pci_lock);
> + rc = !!msix_find(v->domain, addr);
> + read_unlock(&v->domain->pci_lock);
> +
> + return rc;
> }
>
> static bool access_allowed(const struct pci_dev *pdev, unsigned long addr,
> @@ -358,21 +366,35 @@ static int adjacent_read(const struct domain *d, const struct vpci_msix *msix,
> static int cf_check msix_read(
> struct vcpu *v, unsigned long addr, unsigned int len, unsigned long *data)
> {
> - const struct domain *d = v->domain;
> - struct vpci_msix *msix = msix_find(d, addr);
> + struct domain *d = v->domain;
> + struct vpci_msix *msix;
> const struct vpci_msix_entry *entry;
> unsigned int offset;
>
> *data = ~0UL;
>
> + read_lock(&d->pci_lock);
> +
> + msix = msix_find(d, addr);
> if ( !msix )
> + {
> + read_unlock(&d->pci_lock);
> return X86EMUL_RETRY;
> + }
>
> if ( adjacent_handle(msix, addr) )
> - return adjacent_read(d, msix, addr, len, data);
> + {
> + int rc = adjacent_read(d, msix, addr, len, data);
> +
> + read_unlock(&d->pci_lock);
> + return rc;
> + }
>
> if ( !access_allowed(msix->pdev, addr, len) )
> + {
> + read_unlock(&d->pci_lock);
> return X86EMUL_OKAY;
> + }
>
> spin_lock(&msix->pdev->vpci->lock);
> entry = get_entry(msix, addr);
> @@ -404,6 +426,7 @@ static int cf_check msix_read(
> break;
> }
> spin_unlock(&msix->pdev->vpci->lock);
> + read_unlock(&d->pci_lock);
>
> return X86EMUL_OKAY;
> }
> @@ -491,19 +514,33 @@ static int adjacent_write(const struct domain *d, const struct vpci_msix *msix,
> static int cf_check msix_write(
> struct vcpu *v, unsigned long addr, unsigned int len, unsigned long data)
> {
> - const struct domain *d = v->domain;
> - struct vpci_msix *msix = msix_find(d, addr);
> + struct domain *d = v->domain;
> + struct vpci_msix *msix;
> struct vpci_msix_entry *entry;
> unsigned int offset;
>
> + read_lock(&d->pci_lock);
> +
> + msix = msix_find(d, addr);
> if ( !msix )
> + {
> + read_unlock(&d->pci_lock);
> return X86EMUL_RETRY;
> + }
>
> if ( adjacent_handle(msix, addr) )
> - return adjacent_write(d, msix, addr, len, data);
> + {
> + int rc = adjacent_write(d, msix, addr, len, data);
> +
> + read_unlock(&d->pci_lock);
> + return rc;
> + }
>
> if ( !access_allowed(msix->pdev, addr, len) )
> + {
> + read_unlock(&d->pci_lock);
> return X86EMUL_OKAY;
> + }
>
> spin_lock(&msix->pdev->vpci->lock);
> entry = get_entry(msix, addr);
> @@ -579,6 +616,7 @@ static int cf_check msix_write(
> break;
> }
> spin_unlock(&msix->pdev->vpci->lock);
> + read_unlock(&d->pci_lock);
>
> return X86EMUL_OKAY;
> }
> diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
> index 72ef277c4f8e..a1a004460491 100644
> --- a/xen/drivers/vpci/vpci.c
> +++ b/xen/drivers/vpci/vpci.c
> @@ -42,6 +42,8 @@ extern vpci_register_init_t *const __end_vpci_array[];
>
> void vpci_remove_device(struct pci_dev *pdev)
> {
> + ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));
> +
> if ( !has_vpci(pdev->domain) || !pdev->vpci )
> return;
>
> @@ -77,6 +79,8 @@ int vpci_add_handlers(struct pci_dev *pdev)
> const unsigned long *ro_map;
> int rc = 0;
>
> + ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));
> +
> if ( !has_vpci(pdev->domain) )
> return 0;
>
> @@ -361,7 +365,7 @@ static uint32_t merge_result(uint32_t data, uint32_t new, unsigned int size,
>
> uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size)
> {
> - const struct domain *d = current->domain;
> + struct domain *d = current->domain;
> const struct pci_dev *pdev;
> const struct vpci_register *r;
> unsigned int data_offset = 0;
> @@ -375,13 +379,19 @@ uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size)
>
> /*
> * Find the PCI dev matching the address, which for hwdom also requires
> - * consulting DomXEN. Passthrough everything that's not trapped.
> + * consulting DomXEN. Passthrough everything that's not trapped.
The double space in the comment above was intentional.
> + * If this is hwdom and the device is assigned to dom_xen, acquiring hwdom's
For consistency with the above sentence, could you please use DomXEN
here?
> + * pci_lock is sufficient.
> */
> + read_lock(&d->pci_lock);
> pdev = pci_get_pdev(d, sbdf);
> if ( !pdev && is_hardware_domain(d) )
> pdev = pci_get_pdev(dom_xen, sbdf);
> if ( !pdev || !pdev->vpci )
> + {
> + read_unlock(&d->pci_lock);
> return vpci_read_hw(sbdf, reg, size);
> + }
>
> spin_lock(&pdev->vpci->lock);
>
> @@ -428,6 +438,7 @@ uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size)
> ASSERT(data_offset < size);
> }
> spin_unlock(&pdev->vpci->lock);
> + read_unlock(&d->pci_lock);
>
> if ( data_offset < size )
> {
> @@ -470,7 +481,7 @@ static void vpci_write_helper(const struct pci_dev *pdev,
> void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
> uint32_t data)
> {
> - const struct domain *d = current->domain;
> + struct domain *d = current->domain;
> const struct pci_dev *pdev;
> const struct vpci_register *r;
> unsigned int data_offset = 0;
> @@ -483,8 +494,14 @@ void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
>
> /*
> * Find the PCI dev matching the address, which for hwdom also requires
> - * consulting DomXEN. Passthrough everything that's not trapped.
> + * consulting DomXEN. Passthrough everything that's not trapped.
Same here re the double space removal and the usage of dom_xen below.
Thanks, Roger.
On 1/12/24 08:48, Roger Pau Monné wrote:
> On Tue, Jan 09, 2024 at 04:51:16PM -0500, Stewart Hildebrand wrote:
>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>
>> Use a previously introduced per-domain read/write lock to check
>> whether vpci is present, so we are sure there are no accesses to the
>> contents of the vpci struct if not.
>
> Nit: isn't this sentence kind of awkward? I would word it as:
>
> "Use the per-domain PCI read/write lock to protect the presence of the
> pci device vpci field."
I'll use your suggested wording.
>
>> This lock can be used (and in a
>> few cases is used right away) so that vpci removal can be performed
>> while holding the lock in write mode. Previously such removal could
>> race with vpci_read for example.
>>
>> When taking both d->pci_lock and pdev->vpci->lock, they should be
>> taken in this exact order: d->pci_lock then pdev->vpci->lock to avoid
>> possible deadlock situations.
>>
>> 1. Per-domain's pci_lock is used to protect pdev->vpci structure
>> from being removed.
>>
>> 2. Writing the command register and ROM BAR register may trigger
>> modify_bars to run, which in turn may access multiple pdevs while
>> checking for the existing BAR's overlap. The overlapping check, if
>> done under the read lock, requires vpci->lock to be acquired on both
>> devices being compared, which may produce a deadlock. It is not
>> possible to upgrade read lock to write lock in such a case. So, in
>> order to prevent the deadlock, use d->pci_lock instead.
>
> ... use d->pci_lock in write mode instead.
Will fix
>
>>
>> All other code, which doesn't lead to pdev->vpci destruction and does
>> not access multiple pdevs at the same time, can still use a
>> combination of the read lock and pdev->vpci->lock.
>>
>> 3. Drop const qualifier where the new rwlock is used and this is
>> appropriate.
>>
>> 4. Do not call process_pending_softirqs with any locks held. For that
>> unlock prior the call and re-acquire the locks after. After
>> re-acquiring the lock there is no need to check if pdev->vpci exists:
>> - in apply_map because of the context it is called (no race condition
>> possible)
>> - for MSI/MSI-X debug code because it is called at the end of
>> pdev->vpci access and no further access to pdev->vpci is made
>>
>> 5. Use d->pci_lock around for_each_pdev and pci_get_pdev_by_domain
>> while accessing pdevs in vpci code.
>>
>> Suggested-by: Roger Pau Monné <roger.pau@citrix.com>
>> Suggested-by: Jan Beulich <jbeulich@suse.com>
>> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>> Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com>
>> Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com>
>
> Just one code fix regarding the usage of read_lock (instead of the
> trylock version) in vpci_msix_arch_print().. With that and the
> comments adjusted:
>
> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Thanks! I'll reply to this with a v12.1 patch so as to not unnessecarily resend the whole series.
>
>> ---
>> Changes in v12:
>> - s/pci_rwlock/pci_lock/ in commit message
>> - expand comment about scope of pci_lock in sched.h
>> - in vpci_{read,write}, if hwdom is trying to access a device assigned
>> to dom_xen, holding hwdom->pci_lock is sufficient (no need to hold
>> dom_xen->pci_lock)
>> - reintroduce ASSERT in vmx_pi_update_irte()
>> - reintroduce ASSERT in __pci_enable_msi{x}()
>> - delete note 6. in commit message about removing ASSERTs since we have
>> reintroduced them
>>
>> Changes in v11:
>> - Fixed commit message regarding possible spinlocks
>> - Removed parameter from allocate_and_map_msi_pirq(), which was added
>> in the prev version. Now we are taking pcidevs_lock in
>> physdev_map_pirq()
>> - Returned ASSERT to pci_enable_msi
>> - Fixed case when we took read lock instead of write one
>> - Fixed label indentation
>>
>> Changes in v10:
>> - Moved printk pas locked area
>> - Returned back ASSERTs
>> - Added new parameter to allocate_and_map_msi_pirq() so it knows if
>> it should take the global pci lock
>> - Added comment about possible improvement in vpci_write
>> - Changed ASSERT(rw_is_locked()) to rw_is_write_locked() in
>> appropriate places
>> - Renamed release_domain_locks() to release_domain_write_locks()
>> - moved domain_done label in vpci_dump_msi() to correct place
>> Changes in v9:
>> - extended locked region to protect vpci_remove_device and
>> vpci_add_handlers() calls
>> - vpci_write() takes lock in the write mode to protect
>> potential call to modify_bars()
>> - renamed lock releasing function
>> - removed ASSERT()s from msi code
>> - added trylock in vpci_dump_msi
>>
>> Changes in v8:
>> - changed d->vpci_lock to d->pci_lock
>> - introducing d->pci_lock in a separate patch
>> - extended locked region in vpci_process_pending
>> - removed pcidevs_lockis vpci_dump_msi()
>> - removed some changes as they are not needed with
>> the new locking scheme
>> - added handling for hwdom && dom_xen case
>> ---
>> xen/arch/x86/hvm/vmsi.c | 22 +++++++--------
>> xen/arch/x86/hvm/vmx/vmx.c | 2 +-
>> xen/arch/x86/irq.c | 8 +++---
>> xen/arch/x86/msi.c | 14 +++++-----
>> xen/arch/x86/physdev.c | 2 ++
>> xen/drivers/passthrough/pci.c | 9 +++---
>> xen/drivers/vpci/header.c | 18 ++++++++++++
>> xen/drivers/vpci/msi.c | 28 +++++++++++++++++--
>> xen/drivers/vpci/msix.c | 52 ++++++++++++++++++++++++++++++-----
>> xen/drivers/vpci/vpci.c | 28 ++++++++++++++++---
>> xen/include/xen/sched.h | 3 +-
>> 11 files changed, 144 insertions(+), 42 deletions(-)
>>
>> diff --git a/xen/arch/x86/hvm/vmsi.c b/xen/arch/x86/hvm/vmsi.c
>> index 128f23636279..03caf91beefc 100644
>> --- a/xen/arch/x86/hvm/vmsi.c
>> +++ b/xen/arch/x86/hvm/vmsi.c
>> @@ -468,7 +468,7 @@ int msixtbl_pt_register(struct domain *d, struct pirq *pirq, uint64_t gtable)
>> struct msixtbl_entry *entry, *new_entry;
>> int r = -EINVAL;
>>
>> - ASSERT(pcidevs_locked());
>> + ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock));
>> ASSERT(rw_is_write_locked(&d->event_lock));
>>
>> if ( !msixtbl_initialised(d) )
>> @@ -538,7 +538,7 @@ void msixtbl_pt_unregister(struct domain *d, struct pirq *pirq)
>> struct pci_dev *pdev;
>> struct msixtbl_entry *entry;
>>
>> - ASSERT(pcidevs_locked());
>> + ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock));
>> ASSERT(rw_is_write_locked(&d->event_lock));
>>
>> if ( !msixtbl_initialised(d) )
>> @@ -684,7 +684,7 @@ static int vpci_msi_update(const struct pci_dev *pdev, uint32_t data,
>> {
>> unsigned int i;
>>
>> - ASSERT(pcidevs_locked());
>> + ASSERT(rw_is_locked(&pdev->domain->pci_lock));
>>
>> if ( (address & MSI_ADDR_BASE_MASK) != MSI_ADDR_HEADER )
>> {
>> @@ -725,8 +725,8 @@ void vpci_msi_arch_update(struct vpci_msi *msi, const struct pci_dev *pdev)
>> int rc;
>>
>> ASSERT(msi->arch.pirq != INVALID_PIRQ);
>> + ASSERT(rw_is_locked(&pdev->domain->pci_lock));
>>
>> - pcidevs_lock();
>> for ( i = 0; i < msi->vectors && msi->arch.bound; i++ )
>> {
>> struct xen_domctl_bind_pt_irq unbind = {
>> @@ -745,7 +745,6 @@ void vpci_msi_arch_update(struct vpci_msi *msi, const struct pci_dev *pdev)
>>
>> msi->arch.bound = !vpci_msi_update(pdev, msi->data, msi->address,
>> msi->vectors, msi->arch.pirq, msi->mask);
>> - pcidevs_unlock();
>> }
>>
>> static int vpci_msi_enable(const struct pci_dev *pdev, unsigned int nr,
>> @@ -778,15 +777,14 @@ int vpci_msi_arch_enable(struct vpci_msi *msi, const struct pci_dev *pdev,
>> int rc;
>>
>> ASSERT(msi->arch.pirq == INVALID_PIRQ);
>> + ASSERT(rw_is_locked(&pdev->domain->pci_lock));
>> rc = vpci_msi_enable(pdev, vectors, 0);
>> if ( rc < 0 )
>> return rc;
>> msi->arch.pirq = rc;
>>
>> - pcidevs_lock();
>> msi->arch.bound = !vpci_msi_update(pdev, msi->data, msi->address, vectors,
>> msi->arch.pirq, msi->mask);
>> - pcidevs_unlock();
>>
>> return 0;
>> }
>> @@ -797,8 +795,8 @@ static void vpci_msi_disable(const struct pci_dev *pdev, int pirq,
>> unsigned int i;
>>
>> ASSERT(pirq != INVALID_PIRQ);
>> + ASSERT(rw_is_locked(&pdev->domain->pci_lock));
>>
>> - pcidevs_lock();
>> for ( i = 0; i < nr && bound; i++ )
>> {
>> struct xen_domctl_bind_pt_irq bind = {
>> @@ -814,7 +812,6 @@ static void vpci_msi_disable(const struct pci_dev *pdev, int pirq,
>> write_lock(&pdev->domain->event_lock);
>> unmap_domain_pirq(pdev->domain, pirq);
>> write_unlock(&pdev->domain->event_lock);
>> - pcidevs_unlock();
>> }
>>
>> void vpci_msi_arch_disable(struct vpci_msi *msi, const struct pci_dev *pdev)
>> @@ -854,6 +851,7 @@ int vpci_msix_arch_enable_entry(struct vpci_msix_entry *entry,
>> int rc;
>>
>> ASSERT(entry->arch.pirq == INVALID_PIRQ);
>> + ASSERT(rw_is_locked(&pdev->domain->pci_lock));
>> rc = vpci_msi_enable(pdev, vmsix_entry_nr(pdev->vpci->msix, entry),
>> table_base);
>> if ( rc < 0 )
>> @@ -861,7 +859,6 @@ int vpci_msix_arch_enable_entry(struct vpci_msix_entry *entry,
>>
>> entry->arch.pirq = rc;
>>
>> - pcidevs_lock();
>> rc = vpci_msi_update(pdev, entry->data, entry->addr, 1, entry->arch.pirq,
>> entry->masked);
>> if ( rc )
>> @@ -869,7 +866,6 @@ int vpci_msix_arch_enable_entry(struct vpci_msix_entry *entry,
>> vpci_msi_disable(pdev, entry->arch.pirq, 1, false);
>> entry->arch.pirq = INVALID_PIRQ;
>> }
>> - pcidevs_unlock();
>>
>> return rc;
>> }
>> @@ -895,6 +891,8 @@ int vpci_msix_arch_print(const struct vpci_msix *msix)
>> {
>> unsigned int i;
>>
>> + ASSERT(rw_is_locked(&msix->pdev->domain->pci_lock));
>> +
>> for ( i = 0; i < msix->max_entries; i++ )
>> {
>> const struct vpci_msix_entry *entry = &msix->entries[i];
>> @@ -913,7 +911,9 @@ int vpci_msix_arch_print(const struct vpci_msix *msix)
>> struct pci_dev *pdev = msix->pdev;
>>
>> spin_unlock(&msix->pdev->vpci->lock);
>> + read_unlock(&pdev->domain->pci_lock);
>> process_pending_softirqs();
>> + read_lock(&pdev->domain->pci_lock);
>
> This should be a trylock, like it's on the caller (vpci_dump_msi()).
I'll replace the read_lock with:
if ( !read_trylock(&pdev->domain->pci_lock) )
return -EBUSY;
>
>> /* NB: we assume that pdev cannot go away for an alive domain. */
>> if ( !pdev->vpci || !spin_trylock(&pdev->vpci->lock) )
>> return -EBUSY;
>> diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
>> index 8ff675883c2b..890faef48b82 100644
>> --- a/xen/arch/x86/hvm/vmx/vmx.c
>> +++ b/xen/arch/x86/hvm/vmx/vmx.c
>> @@ -413,7 +413,7 @@ static int cf_check vmx_pi_update_irte(const struct vcpu *v,
>>
>> spin_unlock_irq(&desc->lock);
>>
>> - ASSERT(pcidevs_locked());
>> + ASSERT(pcidevs_locked() || rw_is_locked(&msi_desc->dev->domain->pci_lock));
>>
>> return iommu_update_ire_from_msi(msi_desc, &msi_desc->msg);
>>
>> diff --git a/xen/arch/x86/irq.c b/xen/arch/x86/irq.c
>> index 50e49e1a4b6f..4d89d9af99fe 100644
>> --- a/xen/arch/x86/irq.c
>> +++ b/xen/arch/x86/irq.c
>> @@ -2166,7 +2166,7 @@ int map_domain_pirq(
>> struct pci_dev *pdev;
>> unsigned int nr = 0;
>>
>> - ASSERT(pcidevs_locked());
>> + ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock));
>>
>> ret = -ENODEV;
>> if ( !cpu_has_apic )
>> @@ -2323,7 +2323,7 @@ int unmap_domain_pirq(struct domain *d, int pirq)
>> if ( (pirq < 0) || (pirq >= d->nr_pirqs) )
>> return -EINVAL;
>>
>> - ASSERT(pcidevs_locked());
>> + ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock));
>> ASSERT(rw_is_write_locked(&d->event_lock));
>>
>> info = pirq_info(d, pirq);
>> @@ -2888,6 +2888,8 @@ int allocate_and_map_msi_pirq(struct domain *d, int index, int *pirq_p,
>> {
>> int irq, pirq, ret;
>>
>> + ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock));
>> +
>> switch ( type )
>> {
>> case MAP_PIRQ_TYPE_MSI:
>> @@ -2917,7 +2919,6 @@ int allocate_and_map_msi_pirq(struct domain *d, int index, int *pirq_p,
>>
>> msi->irq = irq;
>>
>> - pcidevs_lock();
>> /* Verify or get pirq. */
>> write_lock(&d->event_lock);
>> pirq = allocate_pirq(d, index, *pirq_p, irq, type, &msi->entry_nr);
>> @@ -2933,7 +2934,6 @@ int allocate_and_map_msi_pirq(struct domain *d, int index, int *pirq_p,
>>
>> done:
>> write_unlock(&d->event_lock);
>> - pcidevs_unlock();
>> if ( ret )
>> {
>> switch ( type )
>> diff --git a/xen/arch/x86/msi.c b/xen/arch/x86/msi.c
>> index 335c0868a225..7da2affa2e82 100644
>> --- a/xen/arch/x86/msi.c
>> +++ b/xen/arch/x86/msi.c
>> @@ -602,7 +602,7 @@ static int msi_capability_init(struct pci_dev *dev,
>> unsigned int i, mpos;
>> uint16_t control;
>>
>> - ASSERT(pcidevs_locked());
>> + ASSERT(pcidevs_locked() || rw_is_locked(&dev->domain->pci_lock));
>> pos = pci_find_cap_offset(dev->sbdf, PCI_CAP_ID_MSI);
>> if ( !pos )
>> return -ENODEV;
>> @@ -771,7 +771,7 @@ static int msix_capability_init(struct pci_dev *dev,
>> if ( !pos )
>> return -ENODEV;
>>
>> - ASSERT(pcidevs_locked());
>> + ASSERT(pcidevs_locked() || rw_is_locked(&dev->domain->pci_lock));
>>
>> control = pci_conf_read16(dev->sbdf, msix_control_reg(pos));
>> /*
>> @@ -988,11 +988,11 @@ static int __pci_enable_msi(struct pci_dev *pdev, struct msi_info *msi,
>> {
>> struct msi_desc *old_desc;
>>
>> - ASSERT(pcidevs_locked());
>> -
>> if ( !pdev )
>> return -ENODEV;
>>
>> + ASSERT(pcidevs_locked() || rw_is_locked(&pdev->domain->pci_lock));
>> +
>> old_desc = find_msi_entry(pdev, msi->irq, PCI_CAP_ID_MSI);
>> if ( old_desc )
>> {
>> @@ -1043,11 +1043,11 @@ static int __pci_enable_msix(struct pci_dev *pdev, struct msi_info *msi,
>> {
>> struct msi_desc *old_desc;
>>
>> - ASSERT(pcidevs_locked());
>> -
>> if ( !pdev || !pdev->msix )
>> return -ENODEV;
>>
>> + ASSERT(pcidevs_locked() || rw_is_locked(&pdev->domain->pci_lock));
>> +
>> if ( msi->entry_nr >= pdev->msix->nr_entries )
>> return -EINVAL;
>>
>> @@ -1154,7 +1154,7 @@ int pci_prepare_msix(u16 seg, u8 bus, u8 devfn, bool off)
>> int pci_enable_msi(struct pci_dev *pdev, struct msi_info *msi,
>> struct msi_desc **desc)
>> {
>> - ASSERT(pcidevs_locked());
>> + ASSERT(pcidevs_locked() || rw_is_locked(&pdev->domain->pci_lock));
>>
>> if ( !use_msi )
>> return -EPERM;
>> diff --git a/xen/arch/x86/physdev.c b/xen/arch/x86/physdev.c
>> index 47c4da0af7e1..369c9e788c1c 100644
>> --- a/xen/arch/x86/physdev.c
>> +++ b/xen/arch/x86/physdev.c
>> @@ -123,7 +123,9 @@ int physdev_map_pirq(domid_t domid, int type, int *index, int *pirq_p,
>>
>> case MAP_PIRQ_TYPE_MSI:
>> case MAP_PIRQ_TYPE_MULTI_MSI:
>> + pcidevs_lock();
>> ret = allocate_and_map_msi_pirq(d, *index, pirq_p, type, msi);
>> + pcidevs_unlock();
>> break;
>>
>> default:
>> diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
>> index 1439d1ef2b26..3a973324bca1 100644
>> --- a/xen/drivers/passthrough/pci.c
>> +++ b/xen/drivers/passthrough/pci.c
>> @@ -750,7 +750,6 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
>> pdev->domain = hardware_domain;
>> write_lock(&hardware_domain->pci_lock);
>> list_add(&pdev->domain_list, &hardware_domain->pdev_list);
>> - write_unlock(&hardware_domain->pci_lock);
>>
>> /*
>> * For devices not discovered by Xen during boot, add vPCI handlers
>> @@ -759,18 +758,18 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
>> ret = vpci_add_handlers(pdev);
>> if ( ret )
>> {
>> - printk(XENLOG_ERR "Setup of vPCI failed: %d\n", ret);
>> - write_lock(&hardware_domain->pci_lock);
>> list_del(&pdev->domain_list);
>> write_unlock(&hardware_domain->pci_lock);
>> pdev->domain = NULL;
>> + printk(XENLOG_ERR "Setup of vPCI failed: %d\n", ret);
>> goto out;
>> }
>> + write_unlock(&hardware_domain->pci_lock);
>> ret = iommu_add_device(pdev);
>> if ( ret )
>> {
>> - vpci_remove_device(pdev);
>> write_lock(&hardware_domain->pci_lock);
>> + vpci_remove_device(pdev);
>> list_del(&pdev->domain_list);
>> write_unlock(&hardware_domain->pci_lock);
>> pdev->domain = NULL;
>> @@ -1146,7 +1145,9 @@ static void __hwdom_init setup_one_hwdom_device(const struct setup_hwdom *ctxt,
>> } while ( devfn != pdev->devfn &&
>> PCI_SLOT(devfn) == PCI_SLOT(pdev->devfn) );
>>
>> + write_lock(&ctxt->d->pci_lock);
>> err = vpci_add_handlers(pdev);
>> + write_unlock(&ctxt->d->pci_lock);
>> if ( err )
>> printk(XENLOG_ERR "setup of vPCI for d%d failed: %d\n",
>> ctxt->d->domain_id, err);
>> diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
>> index 58195549d50a..8f5850b8cf6d 100644
>> --- a/xen/drivers/vpci/header.c
>> +++ b/xen/drivers/vpci/header.c
>> @@ -173,6 +173,7 @@ bool vpci_process_pending(struct vcpu *v)
>> if ( rc == -ERESTART )
>> return true;
>>
>> + write_lock(&v->domain->pci_lock);
>> spin_lock(&v->vpci.pdev->vpci->lock);
>> /* Disable memory decoding unconditionally on failure. */
>> modify_decoding(v->vpci.pdev,
>> @@ -191,6 +192,7 @@ bool vpci_process_pending(struct vcpu *v)
>> * failure.
>> */
>> vpci_remove_device(v->vpci.pdev);
>> + write_unlock(&v->domain->pci_lock);
>> }
>>
>> return false;
>> @@ -202,8 +204,20 @@ static int __init apply_map(struct domain *d, const struct pci_dev *pdev,
>> struct map_data data = { .d = d, .map = true };
>> int rc;
>>
>> + ASSERT(rw_is_write_locked(&d->pci_lock));
>> +
>> while ( (rc = rangeset_consume_ranges(mem, map_range, &data)) == -ERESTART )
>> + {
>> + /*
>> + * It's safe to drop and reacquire the lock in this context
>> + * without risking pdev disappearing because devices cannot be
>> + * removed until the initial domain has been started.
>> + */
>> + write_unlock(&d->pci_lock);
>> process_pending_softirqs();
>> + write_lock(&d->pci_lock);
>
> Hm, I should have noticed before, but we already call
> process_pending_softirqs() with the pdev->vpci->lock held here, so it
> would make sense to drop it also.
I don't quite understand this, maybe I'm missing something. I don't see where we acquire pdev->vpci->lock before calling process_pending_softirqs()?
Also, I tried adding
ASSERT(!spin_is_locked(&pdev->vpci->lock));
both here in apply_map() and in vpci_process_pending(), and they haven't triggered in either dom0 or domU test cases, tested on both arm and x86.
>
> Fix should be in a separate patch.
Agreed, I'll send v12.1 addressing the other feedback, and we can address this separately.
>
>> + }
>> +
>> rangeset_destroy(mem);
>> if ( !rc )
>> modify_decoding(pdev, cmd, false);
>> @@ -244,6 +258,8 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
>> unsigned int i;
>> int rc;
>>
>> + ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));
>> +
>> if ( !mem )
>> return -ENOMEM;
>>
>> @@ -524,6 +540,8 @@ static int cf_check init_header(struct pci_dev *pdev)
>> int rc;
>> bool mask_cap_list = false;
>>
>> + ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));
>> +
>> switch ( pci_conf_read8(pdev->sbdf, PCI_HEADER_TYPE) & 0x7f )
>> {
>> case PCI_HEADER_TYPE_NORMAL:
>> diff --git a/xen/drivers/vpci/msi.c b/xen/drivers/vpci/msi.c
>> index a253ccbd7db7..6ff71e5f9ab7 100644
>> --- a/xen/drivers/vpci/msi.c
>> +++ b/xen/drivers/vpci/msi.c
>> @@ -263,7 +263,7 @@ REGISTER_VPCI_INIT(init_msi, VPCI_PRIORITY_LOW);
>>
>> void vpci_dump_msi(void)
>> {
>> - const struct domain *d;
>> + struct domain *d;
>>
>> rcu_read_lock(&domlist_read_lock);
>> for_each_domain ( d )
>> @@ -275,6 +275,9 @@ void vpci_dump_msi(void)
>>
>> printk("vPCI MSI/MSI-X d%d\n", d->domain_id);
>>
>> + if ( !read_trylock(&d->pci_lock) )
>> + continue;
>> +
>> for_each_pdev ( d, pdev )
>> {
>> const struct vpci_msi *msi;
>> @@ -316,14 +319,33 @@ void vpci_dump_msi(void)
>> * holding the lock.
>> */
>> printk("unable to print all MSI-X entries: %d\n", rc);
>> - process_pending_softirqs();
>> - continue;
>> + goto pdev_done;
>> }
>> }
>>
>> spin_unlock(&pdev->vpci->lock);
>> + pdev_done:
>> + /*
>> + * Unlock lock to process pending softirqs. This is
>> + * potentially unsafe, as d->pdev_list can be changed in
>> + * meantime.
>> + */
>> + read_unlock(&d->pci_lock);
>> process_pending_softirqs();
>> + if ( !read_trylock(&d->pci_lock) )
>> + {
>> + printk("unable to access other devices for the domain\n");
>> + goto domain_done;
>> + }
>> }
>> + read_unlock(&d->pci_lock);
>> + domain_done:
>> + /*
>> + * We need this label at the end of the loop, but some
>> + * compilers might not be happy about label at the end of the
>> + * compound statement so we adding an empty statement here.
>> + */
>> + ;
>> }
>> rcu_read_unlock(&domlist_read_lock);
>> }
>> diff --git a/xen/drivers/vpci/msix.c b/xen/drivers/vpci/msix.c
>> index d1126a417da9..b6abab47efdd 100644
>> --- a/xen/drivers/vpci/msix.c
>> +++ b/xen/drivers/vpci/msix.c
>> @@ -147,6 +147,8 @@ static struct vpci_msix *msix_find(const struct domain *d, unsigned long addr)
>> {
>> struct vpci_msix *msix;
>>
>> + ASSERT(rw_is_locked(&d->pci_lock));
>> +
>> list_for_each_entry ( msix, &d->arch.hvm.msix_tables, next )
>> {
>> const struct vpci_bar *bars = msix->pdev->vpci->header.bars;
>> @@ -163,7 +165,13 @@ static struct vpci_msix *msix_find(const struct domain *d, unsigned long addr)
>>
>> static int cf_check msix_accept(struct vcpu *v, unsigned long addr)
>> {
>> - return !!msix_find(v->domain, addr);
>> + int rc;
>> +
>> + read_lock(&v->domain->pci_lock);
>> + rc = !!msix_find(v->domain, addr);
>> + read_unlock(&v->domain->pci_lock);
>> +
>> + return rc;
>> }
>>
>> static bool access_allowed(const struct pci_dev *pdev, unsigned long addr,
>> @@ -358,21 +366,35 @@ static int adjacent_read(const struct domain *d, const struct vpci_msix *msix,
>> static int cf_check msix_read(
>> struct vcpu *v, unsigned long addr, unsigned int len, unsigned long *data)
>> {
>> - const struct domain *d = v->domain;
>> - struct vpci_msix *msix = msix_find(d, addr);
>> + struct domain *d = v->domain;
>> + struct vpci_msix *msix;
>> const struct vpci_msix_entry *entry;
>> unsigned int offset;
>>
>> *data = ~0UL;
>>
>> + read_lock(&d->pci_lock);
>> +
>> + msix = msix_find(d, addr);
>> if ( !msix )
>> + {
>> + read_unlock(&d->pci_lock);
>> return X86EMUL_RETRY;
>> + }
>>
>> if ( adjacent_handle(msix, addr) )
>> - return adjacent_read(d, msix, addr, len, data);
>> + {
>> + int rc = adjacent_read(d, msix, addr, len, data);
>> +
>> + read_unlock(&d->pci_lock);
>> + return rc;
>> + }
>>
>> if ( !access_allowed(msix->pdev, addr, len) )
>> + {
>> + read_unlock(&d->pci_lock);
>> return X86EMUL_OKAY;
>> + }
>>
>> spin_lock(&msix->pdev->vpci->lock);
>> entry = get_entry(msix, addr);
>> @@ -404,6 +426,7 @@ static int cf_check msix_read(
>> break;
>> }
>> spin_unlock(&msix->pdev->vpci->lock);
>> + read_unlock(&d->pci_lock);
>>
>> return X86EMUL_OKAY;
>> }
>> @@ -491,19 +514,33 @@ static int adjacent_write(const struct domain *d, const struct vpci_msix *msix,
>> static int cf_check msix_write(
>> struct vcpu *v, unsigned long addr, unsigned int len, unsigned long data)
>> {
>> - const struct domain *d = v->domain;
>> - struct vpci_msix *msix = msix_find(d, addr);
>> + struct domain *d = v->domain;
>> + struct vpci_msix *msix;
>> struct vpci_msix_entry *entry;
>> unsigned int offset;
>>
>> + read_lock(&d->pci_lock);
>> +
>> + msix = msix_find(d, addr);
>> if ( !msix )
>> + {
>> + read_unlock(&d->pci_lock);
>> return X86EMUL_RETRY;
>> + }
>>
>> if ( adjacent_handle(msix, addr) )
>> - return adjacent_write(d, msix, addr, len, data);
>> + {
>> + int rc = adjacent_write(d, msix, addr, len, data);
>> +
>> + read_unlock(&d->pci_lock);
>> + return rc;
>> + }
>>
>> if ( !access_allowed(msix->pdev, addr, len) )
>> + {
>> + read_unlock(&d->pci_lock);
>> return X86EMUL_OKAY;
>> + }
>>
>> spin_lock(&msix->pdev->vpci->lock);
>> entry = get_entry(msix, addr);
>> @@ -579,6 +616,7 @@ static int cf_check msix_write(
>> break;
>> }
>> spin_unlock(&msix->pdev->vpci->lock);
>> + read_unlock(&d->pci_lock);
>>
>> return X86EMUL_OKAY;
>> }
>> diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
>> index 72ef277c4f8e..a1a004460491 100644
>> --- a/xen/drivers/vpci/vpci.c
>> +++ b/xen/drivers/vpci/vpci.c
>> @@ -42,6 +42,8 @@ extern vpci_register_init_t *const __end_vpci_array[];
>>
>> void vpci_remove_device(struct pci_dev *pdev)
>> {
>> + ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));
>> +
>> if ( !has_vpci(pdev->domain) || !pdev->vpci )
>> return;
>>
>> @@ -77,6 +79,8 @@ int vpci_add_handlers(struct pci_dev *pdev)
>> const unsigned long *ro_map;
>> int rc = 0;
>>
>> + ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));
>> +
>> if ( !has_vpci(pdev->domain) )
>> return 0;
>>
>> @@ -361,7 +365,7 @@ static uint32_t merge_result(uint32_t data, uint32_t new, unsigned int size,
>>
>> uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size)
>> {
>> - const struct domain *d = current->domain;
>> + struct domain *d = current->domain;
>> const struct pci_dev *pdev;
>> const struct vpci_register *r;
>> unsigned int data_offset = 0;
>> @@ -375,13 +379,19 @@ uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size)
>>
>> /*
>> * Find the PCI dev matching the address, which for hwdom also requires
>> - * consulting DomXEN. Passthrough everything that's not trapped.
>> + * consulting DomXEN. Passthrough everything that's not trapped.
>
> The double space in the comment above was intentional.
Will fix
>
>> + * If this is hwdom and the device is assigned to dom_xen, acquiring hwdom's
>
> For consistency with the above sentence, could you please use DomXEN
> here?
Yes
>
>> + * pci_lock is sufficient.
>> */
>> + read_lock(&d->pci_lock);
>> pdev = pci_get_pdev(d, sbdf);
>> if ( !pdev && is_hardware_domain(d) )
>> pdev = pci_get_pdev(dom_xen, sbdf);
>> if ( !pdev || !pdev->vpci )
>> + {
>> + read_unlock(&d->pci_lock);
>> return vpci_read_hw(sbdf, reg, size);
>> + }
>>
>> spin_lock(&pdev->vpci->lock);
>>
>> @@ -428,6 +438,7 @@ uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size)
>> ASSERT(data_offset < size);
>> }
>> spin_unlock(&pdev->vpci->lock);
>> + read_unlock(&d->pci_lock);
>>
>> if ( data_offset < size )
>> {
>> @@ -470,7 +481,7 @@ static void vpci_write_helper(const struct pci_dev *pdev,
>> void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
>> uint32_t data)
>> {
>> - const struct domain *d = current->domain;
>> + struct domain *d = current->domain;
>> const struct pci_dev *pdev;
>> const struct vpci_register *r;
>> unsigned int data_offset = 0;
>> @@ -483,8 +494,14 @@ void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
>>
>> /*
>> * Find the PCI dev matching the address, which for hwdom also requires
>> - * consulting DomXEN. Passthrough everything that's not trapped.
>> + * consulting DomXEN. Passthrough everything that's not trapped.
>
> Same here re the double space removal and the usage of dom_xen below.
Will fix
>
> Thanks, Roger.
On Fri, Jan 12, 2024 at 12:54:56PM -0500, Stewart Hildebrand wrote:
> On 1/12/24 08:48, Roger Pau Monné wrote:
> > On Tue, Jan 09, 2024 at 04:51:16PM -0500, Stewart Hildebrand wrote:
> >> @@ -202,8 +204,20 @@ static int __init apply_map(struct domain *d, const struct pci_dev *pdev,
> >> struct map_data data = { .d = d, .map = true };
> >> int rc;
> >>
> >> + ASSERT(rw_is_write_locked(&d->pci_lock));
> >> +
> >> while ( (rc = rangeset_consume_ranges(mem, map_range, &data)) == -ERESTART )
> >> + {
> >> + /*
> >> + * It's safe to drop and reacquire the lock in this context
> >> + * without risking pdev disappearing because devices cannot be
> >> + * removed until the initial domain has been started.
> >> + */
> >> + write_unlock(&d->pci_lock);
> >> process_pending_softirqs();
> >> + write_lock(&d->pci_lock);
> >
> > Hm, I should have noticed before, but we already call
> > process_pending_softirqs() with the pdev->vpci->lock held here, so it
> > would make sense to drop it also.
>
> I don't quite understand this, maybe I'm missing something. I don't see where we acquire pdev->vpci->lock before calling process_pending_softirqs()?
>
> Also, I tried adding
>
> ASSERT(!spin_is_locked(&pdev->vpci->lock));
>
> both here in apply_map() and in vpci_process_pending(), and they haven't triggered in either dom0 or domU test cases, tested on both arm and x86.
I think I was confused. Are you sure that pdev->vpci->lock is taken
in the apply_map() call? I was mistakenly assuming that
vpci_add_handlers() called the init function with the vpci->lock
taken, but that doesn't seem to be case with the current code. That
leads to apply_map() also being called without the vpci->lock taken.
I was wrongly assuming that apply_map() was called with the vpci->lock
lock taken, and that would need dropping around the
process_pending_softirqs() call.
Thanks, Roger.
On 1/15/24 03:53, Roger Pau Monné wrote:
> On Fri, Jan 12, 2024 at 12:54:56PM -0500, Stewart Hildebrand wrote:
>> On 1/12/24 08:48, Roger Pau Monné wrote:
>>> On Tue, Jan 09, 2024 at 04:51:16PM -0500, Stewart Hildebrand wrote:
>>>> @@ -202,8 +204,20 @@ static int __init apply_map(struct domain *d, const struct pci_dev *pdev,
>>>> struct map_data data = { .d = d, .map = true };
>>>> int rc;
>>>>
>>>> + ASSERT(rw_is_write_locked(&d->pci_lock));
>>>> +
>>>> while ( (rc = rangeset_consume_ranges(mem, map_range, &data)) == -ERESTART )
>>>> + {
>>>> + /*
>>>> + * It's safe to drop and reacquire the lock in this context
>>>> + * without risking pdev disappearing because devices cannot be
>>>> + * removed until the initial domain has been started.
>>>> + */
>>>> + write_unlock(&d->pci_lock);
>>>> process_pending_softirqs();
>>>> + write_lock(&d->pci_lock);
>>>
>>> Hm, I should have noticed before, but we already call
>>> process_pending_softirqs() with the pdev->vpci->lock held here, so it
>>> would make sense to drop it also.
>>
>> I don't quite understand this, maybe I'm missing something. I don't see where we acquire pdev->vpci->lock before calling process_pending_softirqs()?
>>
>> Also, I tried adding
>>
>> ASSERT(!spin_is_locked(&pdev->vpci->lock));
>>
>> both here in apply_map() and in vpci_process_pending(), and they haven't triggered in either dom0 or domU test cases, tested on both arm and x86.
>
> I think I was confused. Are you sure that pdev->vpci->lock is taken
> in the apply_map() call?
I'm sure that it's NOT taken in apply_map(). See the ! in the test ASSERT above.
> I was mistakenly assuming that
> vpci_add_handlers() called the init function with the vpci->lock
> taken, but that doesn't seem to be case with the current code. That
> leads to apply_map() also being called without the vpci->lock taken.
Right.
>
> I was wrongly assuming that apply_map() was called with the vpci->lock
> lock taken, and that would need dropping around the
> process_pending_softirqs() call.
>
> Thanks, Roger.
From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
Use the per-domain PCI read/write lock to protect the presence of the
pci device vpci field. This lock can be used (and in a few cases is used
right away) so that vpci removal can be performed while holding the lock
in write mode. Previously such removal could race with vpci_read for
example.
When taking both d->pci_lock and pdev->vpci->lock, they should be
taken in this exact order: d->pci_lock then pdev->vpci->lock to avoid
possible deadlock situations.
1. Per-domain's pci_lock is used to protect pdev->vpci structure
from being removed.
2. Writing the command register and ROM BAR register may trigger
modify_bars to run, which in turn may access multiple pdevs while
checking for the existing BAR's overlap. The overlapping check, if
done under the read lock, requires vpci->lock to be acquired on both
devices being compared, which may produce a deadlock. It is not
possible to upgrade read lock to write lock in such a case. So, in
order to prevent the deadlock, use d->pci_lock in write mode instead.
All other code, which doesn't lead to pdev->vpci destruction and does
not access multiple pdevs at the same time, can still use a
combination of the read lock and pdev->vpci->lock.
3. Drop const qualifier where the new rwlock is used and this is
appropriate.
4. Do not call process_pending_softirqs with any locks held. For that
unlock prior the call and re-acquire the locks after. After
re-acquiring the lock there is no need to check if pdev->vpci exists:
- in apply_map because of the context it is called (no race condition
possible)
- for MSI/MSI-X debug code because it is called at the end of
pdev->vpci access and no further access to pdev->vpci is made
5. Use d->pci_lock around for_each_pdev and pci_get_pdev_by_domain
while accessing pdevs in vpci code.
Suggested-by: Roger Pau Monné <roger.pau@citrix.com>
Suggested-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com>
Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
---
Changes in v12.1:
- use read_trylock() in vpci_msix_arch_print()
- fixup in-code comments (revert double space, use DomXEN) in
vpci_{read,write}()
- minor updates in commit message
- add Roger's R-b
Changes in v12:
- s/pci_rwlock/pci_lock/ in commit message
- expand comment about scope of pci_lock in sched.h
- in vpci_{read,write}, if hwdom is trying to access a device assigned
to dom_xen, holding hwdom->pci_lock is sufficient (no need to hold
dom_xen->pci_lock)
- reintroduce ASSERT in vmx_pi_update_irte()
- reintroduce ASSERT in __pci_enable_msi{x}()
- delete note 6. in commit message about removing ASSERTs since we have
reintroduced them
Changes in v11:
- Fixed commit message regarding possible spinlocks
- Removed parameter from allocate_and_map_msi_pirq(), which was added
in the prev version. Now we are taking pcidevs_lock in
physdev_map_pirq()
- Returned ASSERT to pci_enable_msi
- Fixed case when we took read lock instead of write one
- Fixed label indentation
Changes in v10:
- Moved printk pas locked area
- Returned back ASSERTs
- Added new parameter to allocate_and_map_msi_pirq() so it knows if
it should take the global pci lock
- Added comment about possible improvement in vpci_write
- Changed ASSERT(rw_is_locked()) to rw_is_write_locked() in
appropriate places
- Renamed release_domain_locks() to release_domain_write_locks()
- moved domain_done label in vpci_dump_msi() to correct place
Changes in v9:
- extended locked region to protect vpci_remove_device and
vpci_add_handlers() calls
- vpci_write() takes lock in the write mode to protect
potential call to modify_bars()
- renamed lock releasing function
- removed ASSERT()s from msi code
- added trylock in vpci_dump_msi
Changes in v8:
- changed d->vpci_lock to d->pci_lock
- introducing d->pci_lock in a separate patch
- extended locked region in vpci_process_pending
- removed pcidevs_lockis vpci_dump_msi()
- removed some changes as they are not needed with
the new locking scheme
- added handling for hwdom && dom_xen case
---
xen/arch/x86/hvm/vmsi.c | 25 +++++++++--------
xen/arch/x86/hvm/vmx/vmx.c | 2 +-
xen/arch/x86/irq.c | 8 +++---
xen/arch/x86/msi.c | 14 +++++-----
xen/arch/x86/physdev.c | 2 ++
xen/drivers/passthrough/pci.c | 9 +++---
xen/drivers/vpci/header.c | 18 ++++++++++++
xen/drivers/vpci/msi.c | 28 +++++++++++++++++--
xen/drivers/vpci/msix.c | 52 ++++++++++++++++++++++++++++++-----
xen/drivers/vpci/vpci.c | 24 ++++++++++++++--
xen/include/xen/sched.h | 3 +-
11 files changed, 145 insertions(+), 40 deletions(-)
diff --git a/xen/arch/x86/hvm/vmsi.c b/xen/arch/x86/hvm/vmsi.c
index 128f23636279..4725e3b72d53 100644
--- a/xen/arch/x86/hvm/vmsi.c
+++ b/xen/arch/x86/hvm/vmsi.c
@@ -468,7 +468,7 @@ int msixtbl_pt_register(struct domain *d, struct pirq *pirq, uint64_t gtable)
struct msixtbl_entry *entry, *new_entry;
int r = -EINVAL;
- ASSERT(pcidevs_locked());
+ ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock));
ASSERT(rw_is_write_locked(&d->event_lock));
if ( !msixtbl_initialised(d) )
@@ -538,7 +538,7 @@ void msixtbl_pt_unregister(struct domain *d, struct pirq *pirq)
struct pci_dev *pdev;
struct msixtbl_entry *entry;
- ASSERT(pcidevs_locked());
+ ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock));
ASSERT(rw_is_write_locked(&d->event_lock));
if ( !msixtbl_initialised(d) )
@@ -684,7 +684,7 @@ static int vpci_msi_update(const struct pci_dev *pdev, uint32_t data,
{
unsigned int i;
- ASSERT(pcidevs_locked());
+ ASSERT(rw_is_locked(&pdev->domain->pci_lock));
if ( (address & MSI_ADDR_BASE_MASK) != MSI_ADDR_HEADER )
{
@@ -725,8 +725,8 @@ void vpci_msi_arch_update(struct vpci_msi *msi, const struct pci_dev *pdev)
int rc;
ASSERT(msi->arch.pirq != INVALID_PIRQ);
+ ASSERT(rw_is_locked(&pdev->domain->pci_lock));
- pcidevs_lock();
for ( i = 0; i < msi->vectors && msi->arch.bound; i++ )
{
struct xen_domctl_bind_pt_irq unbind = {
@@ -745,7 +745,6 @@ void vpci_msi_arch_update(struct vpci_msi *msi, const struct pci_dev *pdev)
msi->arch.bound = !vpci_msi_update(pdev, msi->data, msi->address,
msi->vectors, msi->arch.pirq, msi->mask);
- pcidevs_unlock();
}
static int vpci_msi_enable(const struct pci_dev *pdev, unsigned int nr,
@@ -778,15 +777,14 @@ int vpci_msi_arch_enable(struct vpci_msi *msi, const struct pci_dev *pdev,
int rc;
ASSERT(msi->arch.pirq == INVALID_PIRQ);
+ ASSERT(rw_is_locked(&pdev->domain->pci_lock));
rc = vpci_msi_enable(pdev, vectors, 0);
if ( rc < 0 )
return rc;
msi->arch.pirq = rc;
- pcidevs_lock();
msi->arch.bound = !vpci_msi_update(pdev, msi->data, msi->address, vectors,
msi->arch.pirq, msi->mask);
- pcidevs_unlock();
return 0;
}
@@ -797,8 +795,8 @@ static void vpci_msi_disable(const struct pci_dev *pdev, int pirq,
unsigned int i;
ASSERT(pirq != INVALID_PIRQ);
+ ASSERT(rw_is_locked(&pdev->domain->pci_lock));
- pcidevs_lock();
for ( i = 0; i < nr && bound; i++ )
{
struct xen_domctl_bind_pt_irq bind = {
@@ -814,7 +812,6 @@ static void vpci_msi_disable(const struct pci_dev *pdev, int pirq,
write_lock(&pdev->domain->event_lock);
unmap_domain_pirq(pdev->domain, pirq);
write_unlock(&pdev->domain->event_lock);
- pcidevs_unlock();
}
void vpci_msi_arch_disable(struct vpci_msi *msi, const struct pci_dev *pdev)
@@ -854,6 +851,7 @@ int vpci_msix_arch_enable_entry(struct vpci_msix_entry *entry,
int rc;
ASSERT(entry->arch.pirq == INVALID_PIRQ);
+ ASSERT(rw_is_locked(&pdev->domain->pci_lock));
rc = vpci_msi_enable(pdev, vmsix_entry_nr(pdev->vpci->msix, entry),
table_base);
if ( rc < 0 )
@@ -861,7 +859,6 @@ int vpci_msix_arch_enable_entry(struct vpci_msix_entry *entry,
entry->arch.pirq = rc;
- pcidevs_lock();
rc = vpci_msi_update(pdev, entry->data, entry->addr, 1, entry->arch.pirq,
entry->masked);
if ( rc )
@@ -869,7 +866,6 @@ int vpci_msix_arch_enable_entry(struct vpci_msix_entry *entry,
vpci_msi_disable(pdev, entry->arch.pirq, 1, false);
entry->arch.pirq = INVALID_PIRQ;
}
- pcidevs_unlock();
return rc;
}
@@ -895,6 +891,8 @@ int vpci_msix_arch_print(const struct vpci_msix *msix)
{
unsigned int i;
+ ASSERT(rw_is_locked(&msix->pdev->domain->pci_lock));
+
for ( i = 0; i < msix->max_entries; i++ )
{
const struct vpci_msix_entry *entry = &msix->entries[i];
@@ -913,7 +911,12 @@ int vpci_msix_arch_print(const struct vpci_msix *msix)
struct pci_dev *pdev = msix->pdev;
spin_unlock(&msix->pdev->vpci->lock);
+ read_unlock(&pdev->domain->pci_lock);
process_pending_softirqs();
+
+ if ( !read_trylock(&pdev->domain->pci_lock) )
+ return -EBUSY;
+
/* NB: we assume that pdev cannot go away for an alive domain. */
if ( !pdev->vpci || !spin_trylock(&pdev->vpci->lock) )
return -EBUSY;
diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
index 8ff675883c2b..890faef48b82 100644
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -413,7 +413,7 @@ static int cf_check vmx_pi_update_irte(const struct vcpu *v,
spin_unlock_irq(&desc->lock);
- ASSERT(pcidevs_locked());
+ ASSERT(pcidevs_locked() || rw_is_locked(&msi_desc->dev->domain->pci_lock));
return iommu_update_ire_from_msi(msi_desc, &msi_desc->msg);
diff --git a/xen/arch/x86/irq.c b/xen/arch/x86/irq.c
index 50e49e1a4b6f..4d89d9af99fe 100644
--- a/xen/arch/x86/irq.c
+++ b/xen/arch/x86/irq.c
@@ -2166,7 +2166,7 @@ int map_domain_pirq(
struct pci_dev *pdev;
unsigned int nr = 0;
- ASSERT(pcidevs_locked());
+ ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock));
ret = -ENODEV;
if ( !cpu_has_apic )
@@ -2323,7 +2323,7 @@ int unmap_domain_pirq(struct domain *d, int pirq)
if ( (pirq < 0) || (pirq >= d->nr_pirqs) )
return -EINVAL;
- ASSERT(pcidevs_locked());
+ ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock));
ASSERT(rw_is_write_locked(&d->event_lock));
info = pirq_info(d, pirq);
@@ -2888,6 +2888,8 @@ int allocate_and_map_msi_pirq(struct domain *d, int index, int *pirq_p,
{
int irq, pirq, ret;
+ ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock));
+
switch ( type )
{
case MAP_PIRQ_TYPE_MSI:
@@ -2917,7 +2919,6 @@ int allocate_and_map_msi_pirq(struct domain *d, int index, int *pirq_p,
msi->irq = irq;
- pcidevs_lock();
/* Verify or get pirq. */
write_lock(&d->event_lock);
pirq = allocate_pirq(d, index, *pirq_p, irq, type, &msi->entry_nr);
@@ -2933,7 +2934,6 @@ int allocate_and_map_msi_pirq(struct domain *d, int index, int *pirq_p,
done:
write_unlock(&d->event_lock);
- pcidevs_unlock();
if ( ret )
{
switch ( type )
diff --git a/xen/arch/x86/msi.c b/xen/arch/x86/msi.c
index 335c0868a225..7da2affa2e82 100644
--- a/xen/arch/x86/msi.c
+++ b/xen/arch/x86/msi.c
@@ -602,7 +602,7 @@ static int msi_capability_init(struct pci_dev *dev,
unsigned int i, mpos;
uint16_t control;
- ASSERT(pcidevs_locked());
+ ASSERT(pcidevs_locked() || rw_is_locked(&dev->domain->pci_lock));
pos = pci_find_cap_offset(dev->sbdf, PCI_CAP_ID_MSI);
if ( !pos )
return -ENODEV;
@@ -771,7 +771,7 @@ static int msix_capability_init(struct pci_dev *dev,
if ( !pos )
return -ENODEV;
- ASSERT(pcidevs_locked());
+ ASSERT(pcidevs_locked() || rw_is_locked(&dev->domain->pci_lock));
control = pci_conf_read16(dev->sbdf, msix_control_reg(pos));
/*
@@ -988,11 +988,11 @@ static int __pci_enable_msi(struct pci_dev *pdev, struct msi_info *msi,
{
struct msi_desc *old_desc;
- ASSERT(pcidevs_locked());
-
if ( !pdev )
return -ENODEV;
+ ASSERT(pcidevs_locked() || rw_is_locked(&pdev->domain->pci_lock));
+
old_desc = find_msi_entry(pdev, msi->irq, PCI_CAP_ID_MSI);
if ( old_desc )
{
@@ -1043,11 +1043,11 @@ static int __pci_enable_msix(struct pci_dev *pdev, struct msi_info *msi,
{
struct msi_desc *old_desc;
- ASSERT(pcidevs_locked());
-
if ( !pdev || !pdev->msix )
return -ENODEV;
+ ASSERT(pcidevs_locked() || rw_is_locked(&pdev->domain->pci_lock));
+
if ( msi->entry_nr >= pdev->msix->nr_entries )
return -EINVAL;
@@ -1154,7 +1154,7 @@ int pci_prepare_msix(u16 seg, u8 bus, u8 devfn, bool off)
int pci_enable_msi(struct pci_dev *pdev, struct msi_info *msi,
struct msi_desc **desc)
{
- ASSERT(pcidevs_locked());
+ ASSERT(pcidevs_locked() || rw_is_locked(&pdev->domain->pci_lock));
if ( !use_msi )
return -EPERM;
diff --git a/xen/arch/x86/physdev.c b/xen/arch/x86/physdev.c
index 47c4da0af7e1..369c9e788c1c 100644
--- a/xen/arch/x86/physdev.c
+++ b/xen/arch/x86/physdev.c
@@ -123,7 +123,9 @@ int physdev_map_pirq(domid_t domid, int type, int *index, int *pirq_p,
case MAP_PIRQ_TYPE_MSI:
case MAP_PIRQ_TYPE_MULTI_MSI:
+ pcidevs_lock();
ret = allocate_and_map_msi_pirq(d, *index, pirq_p, type, msi);
+ pcidevs_unlock();
break;
default:
diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
index 1439d1ef2b26..3a973324bca1 100644
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -750,7 +750,6 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
pdev->domain = hardware_domain;
write_lock(&hardware_domain->pci_lock);
list_add(&pdev->domain_list, &hardware_domain->pdev_list);
- write_unlock(&hardware_domain->pci_lock);
/*
* For devices not discovered by Xen during boot, add vPCI handlers
@@ -759,18 +758,18 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
ret = vpci_add_handlers(pdev);
if ( ret )
{
- printk(XENLOG_ERR "Setup of vPCI failed: %d\n", ret);
- write_lock(&hardware_domain->pci_lock);
list_del(&pdev->domain_list);
write_unlock(&hardware_domain->pci_lock);
pdev->domain = NULL;
+ printk(XENLOG_ERR "Setup of vPCI failed: %d\n", ret);
goto out;
}
+ write_unlock(&hardware_domain->pci_lock);
ret = iommu_add_device(pdev);
if ( ret )
{
- vpci_remove_device(pdev);
write_lock(&hardware_domain->pci_lock);
+ vpci_remove_device(pdev);
list_del(&pdev->domain_list);
write_unlock(&hardware_domain->pci_lock);
pdev->domain = NULL;
@@ -1146,7 +1145,9 @@ static void __hwdom_init setup_one_hwdom_device(const struct setup_hwdom *ctxt,
} while ( devfn != pdev->devfn &&
PCI_SLOT(devfn) == PCI_SLOT(pdev->devfn) );
+ write_lock(&ctxt->d->pci_lock);
err = vpci_add_handlers(pdev);
+ write_unlock(&ctxt->d->pci_lock);
if ( err )
printk(XENLOG_ERR "setup of vPCI for d%d failed: %d\n",
ctxt->d->domain_id, err);
diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
index 58195549d50a..8f5850b8cf6d 100644
--- a/xen/drivers/vpci/header.c
+++ b/xen/drivers/vpci/header.c
@@ -173,6 +173,7 @@ bool vpci_process_pending(struct vcpu *v)
if ( rc == -ERESTART )
return true;
+ write_lock(&v->domain->pci_lock);
spin_lock(&v->vpci.pdev->vpci->lock);
/* Disable memory decoding unconditionally on failure. */
modify_decoding(v->vpci.pdev,
@@ -191,6 +192,7 @@ bool vpci_process_pending(struct vcpu *v)
* failure.
*/
vpci_remove_device(v->vpci.pdev);
+ write_unlock(&v->domain->pci_lock);
}
return false;
@@ -202,8 +204,20 @@ static int __init apply_map(struct domain *d, const struct pci_dev *pdev,
struct map_data data = { .d = d, .map = true };
int rc;
+ ASSERT(rw_is_write_locked(&d->pci_lock));
+
while ( (rc = rangeset_consume_ranges(mem, map_range, &data)) == -ERESTART )
+ {
+ /*
+ * It's safe to drop and reacquire the lock in this context
+ * without risking pdev disappearing because devices cannot be
+ * removed until the initial domain has been started.
+ */
+ write_unlock(&d->pci_lock);
process_pending_softirqs();
+ write_lock(&d->pci_lock);
+ }
+
rangeset_destroy(mem);
if ( !rc )
modify_decoding(pdev, cmd, false);
@@ -244,6 +258,8 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
unsigned int i;
int rc;
+ ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));
+
if ( !mem )
return -ENOMEM;
@@ -524,6 +540,8 @@ static int cf_check init_header(struct pci_dev *pdev)
int rc;
bool mask_cap_list = false;
+ ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));
+
switch ( pci_conf_read8(pdev->sbdf, PCI_HEADER_TYPE) & 0x7f )
{
case PCI_HEADER_TYPE_NORMAL:
diff --git a/xen/drivers/vpci/msi.c b/xen/drivers/vpci/msi.c
index a253ccbd7db7..6ff71e5f9ab7 100644
--- a/xen/drivers/vpci/msi.c
+++ b/xen/drivers/vpci/msi.c
@@ -263,7 +263,7 @@ REGISTER_VPCI_INIT(init_msi, VPCI_PRIORITY_LOW);
void vpci_dump_msi(void)
{
- const struct domain *d;
+ struct domain *d;
rcu_read_lock(&domlist_read_lock);
for_each_domain ( d )
@@ -275,6 +275,9 @@ void vpci_dump_msi(void)
printk("vPCI MSI/MSI-X d%d\n", d->domain_id);
+ if ( !read_trylock(&d->pci_lock) )
+ continue;
+
for_each_pdev ( d, pdev )
{
const struct vpci_msi *msi;
@@ -316,14 +319,33 @@ void vpci_dump_msi(void)
* holding the lock.
*/
printk("unable to print all MSI-X entries: %d\n", rc);
- process_pending_softirqs();
- continue;
+ goto pdev_done;
}
}
spin_unlock(&pdev->vpci->lock);
+ pdev_done:
+ /*
+ * Unlock lock to process pending softirqs. This is
+ * potentially unsafe, as d->pdev_list can be changed in
+ * meantime.
+ */
+ read_unlock(&d->pci_lock);
process_pending_softirqs();
+ if ( !read_trylock(&d->pci_lock) )
+ {
+ printk("unable to access other devices for the domain\n");
+ goto domain_done;
+ }
}
+ read_unlock(&d->pci_lock);
+ domain_done:
+ /*
+ * We need this label at the end of the loop, but some
+ * compilers might not be happy about label at the end of the
+ * compound statement so we adding an empty statement here.
+ */
+ ;
}
rcu_read_unlock(&domlist_read_lock);
}
diff --git a/xen/drivers/vpci/msix.c b/xen/drivers/vpci/msix.c
index d1126a417da9..b6abab47efdd 100644
--- a/xen/drivers/vpci/msix.c
+++ b/xen/drivers/vpci/msix.c
@@ -147,6 +147,8 @@ static struct vpci_msix *msix_find(const struct domain *d, unsigned long addr)
{
struct vpci_msix *msix;
+ ASSERT(rw_is_locked(&d->pci_lock));
+
list_for_each_entry ( msix, &d->arch.hvm.msix_tables, next )
{
const struct vpci_bar *bars = msix->pdev->vpci->header.bars;
@@ -163,7 +165,13 @@ static struct vpci_msix *msix_find(const struct domain *d, unsigned long addr)
static int cf_check msix_accept(struct vcpu *v, unsigned long addr)
{
- return !!msix_find(v->domain, addr);
+ int rc;
+
+ read_lock(&v->domain->pci_lock);
+ rc = !!msix_find(v->domain, addr);
+ read_unlock(&v->domain->pci_lock);
+
+ return rc;
}
static bool access_allowed(const struct pci_dev *pdev, unsigned long addr,
@@ -358,21 +366,35 @@ static int adjacent_read(const struct domain *d, const struct vpci_msix *msix,
static int cf_check msix_read(
struct vcpu *v, unsigned long addr, unsigned int len, unsigned long *data)
{
- const struct domain *d = v->domain;
- struct vpci_msix *msix = msix_find(d, addr);
+ struct domain *d = v->domain;
+ struct vpci_msix *msix;
const struct vpci_msix_entry *entry;
unsigned int offset;
*data = ~0UL;
+ read_lock(&d->pci_lock);
+
+ msix = msix_find(d, addr);
if ( !msix )
+ {
+ read_unlock(&d->pci_lock);
return X86EMUL_RETRY;
+ }
if ( adjacent_handle(msix, addr) )
- return adjacent_read(d, msix, addr, len, data);
+ {
+ int rc = adjacent_read(d, msix, addr, len, data);
+
+ read_unlock(&d->pci_lock);
+ return rc;
+ }
if ( !access_allowed(msix->pdev, addr, len) )
+ {
+ read_unlock(&d->pci_lock);
return X86EMUL_OKAY;
+ }
spin_lock(&msix->pdev->vpci->lock);
entry = get_entry(msix, addr);
@@ -404,6 +426,7 @@ static int cf_check msix_read(
break;
}
spin_unlock(&msix->pdev->vpci->lock);
+ read_unlock(&d->pci_lock);
return X86EMUL_OKAY;
}
@@ -491,19 +514,33 @@ static int adjacent_write(const struct domain *d, const struct vpci_msix *msix,
static int cf_check msix_write(
struct vcpu *v, unsigned long addr, unsigned int len, unsigned long data)
{
- const struct domain *d = v->domain;
- struct vpci_msix *msix = msix_find(d, addr);
+ struct domain *d = v->domain;
+ struct vpci_msix *msix;
struct vpci_msix_entry *entry;
unsigned int offset;
+ read_lock(&d->pci_lock);
+
+ msix = msix_find(d, addr);
if ( !msix )
+ {
+ read_unlock(&d->pci_lock);
return X86EMUL_RETRY;
+ }
if ( adjacent_handle(msix, addr) )
- return adjacent_write(d, msix, addr, len, data);
+ {
+ int rc = adjacent_write(d, msix, addr, len, data);
+
+ read_unlock(&d->pci_lock);
+ return rc;
+ }
if ( !access_allowed(msix->pdev, addr, len) )
+ {
+ read_unlock(&d->pci_lock);
return X86EMUL_OKAY;
+ }
spin_lock(&msix->pdev->vpci->lock);
entry = get_entry(msix, addr);
@@ -579,6 +616,7 @@ static int cf_check msix_write(
break;
}
spin_unlock(&msix->pdev->vpci->lock);
+ read_unlock(&d->pci_lock);
return X86EMUL_OKAY;
}
diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
index 72ef277c4f8e..475272b173f3 100644
--- a/xen/drivers/vpci/vpci.c
+++ b/xen/drivers/vpci/vpci.c
@@ -42,6 +42,8 @@ extern vpci_register_init_t *const __end_vpci_array[];
void vpci_remove_device(struct pci_dev *pdev)
{
+ ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));
+
if ( !has_vpci(pdev->domain) || !pdev->vpci )
return;
@@ -77,6 +79,8 @@ int vpci_add_handlers(struct pci_dev *pdev)
const unsigned long *ro_map;
int rc = 0;
+ ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));
+
if ( !has_vpci(pdev->domain) )
return 0;
@@ -361,7 +365,7 @@ static uint32_t merge_result(uint32_t data, uint32_t new, unsigned int size,
uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size)
{
- const struct domain *d = current->domain;
+ struct domain *d = current->domain;
const struct pci_dev *pdev;
const struct vpci_register *r;
unsigned int data_offset = 0;
@@ -376,12 +380,18 @@ uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size)
/*
* Find the PCI dev matching the address, which for hwdom also requires
* consulting DomXEN. Passthrough everything that's not trapped.
+ * If this is hwdom and the device is assigned to DomXEN, acquiring hwdom's
+ * pci_lock is sufficient.
*/
+ read_lock(&d->pci_lock);
pdev = pci_get_pdev(d, sbdf);
if ( !pdev && is_hardware_domain(d) )
pdev = pci_get_pdev(dom_xen, sbdf);
if ( !pdev || !pdev->vpci )
+ {
+ read_unlock(&d->pci_lock);
return vpci_read_hw(sbdf, reg, size);
+ }
spin_lock(&pdev->vpci->lock);
@@ -428,6 +438,7 @@ uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size)
ASSERT(data_offset < size);
}
spin_unlock(&pdev->vpci->lock);
+ read_unlock(&d->pci_lock);
if ( data_offset < size )
{
@@ -470,7 +481,7 @@ static void vpci_write_helper(const struct pci_dev *pdev,
void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
uint32_t data)
{
- const struct domain *d = current->domain;
+ struct domain *d = current->domain;
const struct pci_dev *pdev;
const struct vpci_register *r;
unsigned int data_offset = 0;
@@ -484,7 +495,13 @@ void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
/*
* Find the PCI dev matching the address, which for hwdom also requires
* consulting DomXEN. Passthrough everything that's not trapped.
+ * If this is hwdom and the device is assigned to DomXEN, acquiring hwdom's
+ * pci_lock is sufficient.
+ *
+ * TODO: We need to take pci_locks in exclusive mode only if we
+ * are modifying BARs, so there is a room for improvement.
*/
+ write_lock(&d->pci_lock);
pdev = pci_get_pdev(d, sbdf);
if ( !pdev && is_hardware_domain(d) )
pdev = pci_get_pdev(dom_xen, sbdf);
@@ -493,6 +510,8 @@ void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
/* Ignore writes to read-only devices, which have no ->vpci. */
const unsigned long *ro_map = pci_get_ro_map(sbdf.seg);
+ write_unlock(&d->pci_lock);
+
if ( !ro_map || !test_bit(sbdf.bdf, ro_map) )
vpci_write_hw(sbdf, reg, size, data);
return;
@@ -534,6 +553,7 @@ void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
ASSERT(data_offset < size);
}
spin_unlock(&pdev->vpci->lock);
+ write_unlock(&d->pci_lock);
if ( data_offset < size )
/* Tailing gap, write the remaining. */
diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
index 9da91e0e6244..37f5922f3206 100644
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
@@ -462,7 +462,8 @@ struct domain
#ifdef CONFIG_HAS_PCI
struct list_head pdev_list;
/*
- * pci_lock protects access to pdev_list.
+ * pci_lock protects access to pdev_list. pci_lock also protects pdev->vpci
+ * structure from being removed.
*
* Any user *reading* from pdev_list, or from devices stored in pdev_list,
* should hold either pcidevs_lock() or pci_lock in read mode. Optionally,
base-commit: 1ec3fe1f664fa837daf31e9fa8938f6109464f28
--
2.43.0
On 12.01.2024 19:14, Stewart Hildebrand wrote: > From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com> > > Use the per-domain PCI read/write lock to protect the presence of the > pci device vpci field. This lock can be used (and in a few cases is used > right away) so that vpci removal can be performed while holding the lock > in write mode. Previously such removal could race with vpci_read for > example. > > When taking both d->pci_lock and pdev->vpci->lock, they should be > taken in this exact order: d->pci_lock then pdev->vpci->lock to avoid > possible deadlock situations. > > 1. Per-domain's pci_lock is used to protect pdev->vpci structure > from being removed. > > 2. Writing the command register and ROM BAR register may trigger > modify_bars to run, which in turn may access multiple pdevs while > checking for the existing BAR's overlap. The overlapping check, if > done under the read lock, requires vpci->lock to be acquired on both > devices being compared, which may produce a deadlock. It is not > possible to upgrade read lock to write lock in such a case. So, in > order to prevent the deadlock, use d->pci_lock in write mode instead. > > All other code, which doesn't lead to pdev->vpci destruction and does > not access multiple pdevs at the same time, can still use a > combination of the read lock and pdev->vpci->lock. > > 3. Drop const qualifier where the new rwlock is used and this is > appropriate. > > 4. Do not call process_pending_softirqs with any locks held. For that > unlock prior the call and re-acquire the locks after. After > re-acquiring the lock there is no need to check if pdev->vpci exists: > - in apply_map because of the context it is called (no race condition > possible) > - for MSI/MSI-X debug code because it is called at the end of > pdev->vpci access and no further access to pdev->vpci is made > > 5. Use d->pci_lock around for_each_pdev and pci_get_pdev_by_domain > while accessing pdevs in vpci code. > > Suggested-by: Roger Pau Monné <roger.pau@citrix.com> > Suggested-by: Jan Beulich <jbeulich@suse.com> > Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com> > Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com> > Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com> > Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> While I know Roger did offer the tag with certain adjustments, ... > @@ -913,7 +911,12 @@ int vpci_msix_arch_print(const struct vpci_msix *msix) > struct pci_dev *pdev = msix->pdev; > > spin_unlock(&msix->pdev->vpci->lock); > + read_unlock(&pdev->domain->pci_lock); > process_pending_softirqs(); > + > + if ( !read_trylock(&pdev->domain->pci_lock) ) > + return -EBUSY; > + > /* NB: we assume that pdev cannot go away for an alive domain. */ > if ( !pdev->vpci || !spin_trylock(&pdev->vpci->lock) ) > return -EBUSY; ... I'm sure he was assuming you would get this right, in also dropping the 1st-try-acquired lock when this 2nd try-lock fails. Personally I feel this is the kind of change one would better not offer (or take) R-b ahead of time. I further think the respective comment in vpci_dump_msi() also wants adjusting from singular to plural. Jan
On 1/15/24 03:58, Jan Beulich wrote: > On 12.01.2024 19:14, Stewart Hildebrand wrote: >> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com> >> >> Use the per-domain PCI read/write lock to protect the presence of the >> pci device vpci field. This lock can be used (and in a few cases is used >> right away) so that vpci removal can be performed while holding the lock >> in write mode. Previously such removal could race with vpci_read for >> example. >> >> When taking both d->pci_lock and pdev->vpci->lock, they should be >> taken in this exact order: d->pci_lock then pdev->vpci->lock to avoid >> possible deadlock situations. >> >> 1. Per-domain's pci_lock is used to protect pdev->vpci structure >> from being removed. >> >> 2. Writing the command register and ROM BAR register may trigger >> modify_bars to run, which in turn may access multiple pdevs while >> checking for the existing BAR's overlap. The overlapping check, if >> done under the read lock, requires vpci->lock to be acquired on both >> devices being compared, which may produce a deadlock. It is not >> possible to upgrade read lock to write lock in such a case. So, in >> order to prevent the deadlock, use d->pci_lock in write mode instead. >> >> All other code, which doesn't lead to pdev->vpci destruction and does >> not access multiple pdevs at the same time, can still use a >> combination of the read lock and pdev->vpci->lock. >> >> 3. Drop const qualifier where the new rwlock is used and this is >> appropriate. >> >> 4. Do not call process_pending_softirqs with any locks held. For that >> unlock prior the call and re-acquire the locks after. After >> re-acquiring the lock there is no need to check if pdev->vpci exists: >> - in apply_map because of the context it is called (no race condition >> possible) >> - for MSI/MSI-X debug code because it is called at the end of >> pdev->vpci access and no further access to pdev->vpci is made >> >> 5. Use d->pci_lock around for_each_pdev and pci_get_pdev_by_domain >> while accessing pdevs in vpci code. >> >> Suggested-by: Roger Pau Monné <roger.pau@citrix.com> >> Suggested-by: Jan Beulich <jbeulich@suse.com> >> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com> >> Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com> >> Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com> >> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> > > While I know Roger did offer the tag with certain adjustments, ... > >> @@ -913,7 +911,12 @@ int vpci_msix_arch_print(const struct vpci_msix *msix) >> struct pci_dev *pdev = msix->pdev; >> >> spin_unlock(&msix->pdev->vpci->lock); >> + read_unlock(&pdev->domain->pci_lock); >> process_pending_softirqs(); >> + >> + if ( !read_trylock(&pdev->domain->pci_lock) ) >> + return -EBUSY; >> + >> /* NB: we assume that pdev cannot go away for an alive domain. */ >> if ( !pdev->vpci || !spin_trylock(&pdev->vpci->lock) ) >> return -EBUSY; > > ... I'm sure he was assuming you would get this right, in also > dropping the 1st-try-acquired lock when this 2nd try-lock fails. Thanks for catching this, and I appreciate the suggestion. I'll make sure both locks are dropped if needed on all error paths in vpci_msix_arch_print(), and adjust vpci_dump_msi() accordingly. > Personally I feel this is the kind of change one would better not > offer (or take) R-b ahead of time. I'll drop Roger's R-b for v12.2. > > I further think the respective comment in vpci_dump_msi() also wants > adjusting from singular to plural. I'll fix for v12.2, thanks for suggesting this.
From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
Use the per-domain PCI read/write lock to protect the presence of the
pci device vpci field. This lock can be used (and in a few cases is used
right away) so that vpci removal can be performed while holding the lock
in write mode. Previously such removal could race with vpci_read for
example.
When taking both d->pci_lock and pdev->vpci->lock, they should be
taken in this exact order: d->pci_lock then pdev->vpci->lock to avoid
possible deadlock situations.
1. Per-domain's pci_lock is used to protect pdev->vpci structure
from being removed.
2. Writing the command register and ROM BAR register may trigger
modify_bars to run, which in turn may access multiple pdevs while
checking for the existing BAR's overlap. The overlapping check, if
done under the read lock, requires vpci->lock to be acquired on both
devices being compared, which may produce a deadlock. It is not
possible to upgrade read lock to write lock in such a case. So, in
order to prevent the deadlock, use d->pci_lock in write mode instead.
All other code, which doesn't lead to pdev->vpci destruction and does
not access multiple pdevs at the same time, can still use a
combination of the read lock and pdev->vpci->lock.
3. Drop const qualifier where the new rwlock is used and this is
appropriate.
4. Do not call process_pending_softirqs with any locks held. For that
unlock prior the call and re-acquire the locks after. After
re-acquiring the lock there is no need to check if pdev->vpci exists:
- in apply_map because of the context it is called (no race condition
possible)
- for MSI/MSI-X debug code because it is called at the end of
pdev->vpci access and no further access to pdev->vpci is made
5. Use d->pci_lock around for_each_pdev and pci_get_pdev_by_domain
while accessing pdevs in vpci code.
Suggested-by: Roger Pau Monné <roger.pau@citrix.com>
Suggested-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com>
Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com>
---
Changes in v12.2:
- drop Roger's R-b
- drop both locks on error paths in vpci_msix_arch_print()
- add another ASSERT in vpci_msix_arch_print(), to enforce the
expectation both locks are held before calling vpci_msix_arch_print()
- move pdev_done label in vpci_dump_msi()
- update comments in vpci_dump_msi() to say locks (plural)
Changes in v12.1:
- use read_trylock() in vpci_msix_arch_print()
- fixup in-code comments (revert double space, use DomXEN) in
vpci_{read,write}()
- minor updates in commit message
- add Roger's R-b
Changes in v12:
- s/pci_rwlock/pci_lock/ in commit message
- expand comment about scope of pci_lock in sched.h
- in vpci_{read,write}, if hwdom is trying to access a device assigned
to dom_xen, holding hwdom->pci_lock is sufficient (no need to hold
dom_xen->pci_lock)
- reintroduce ASSERT in vmx_pi_update_irte()
- reintroduce ASSERT in __pci_enable_msi{x}()
- delete note 6. in commit message about removing ASSERTs since we have
reintroduced them
Changes in v11:
- Fixed commit message regarding possible spinlocks
- Removed parameter from allocate_and_map_msi_pirq(), which was added
in the prev version. Now we are taking pcidevs_lock in
physdev_map_pirq()
- Returned ASSERT to pci_enable_msi
- Fixed case when we took read lock instead of write one
- Fixed label indentation
Changes in v10:
- Moved printk pas locked area
- Returned back ASSERTs
- Added new parameter to allocate_and_map_msi_pirq() so it knows if
it should take the global pci lock
- Added comment about possible improvement in vpci_write
- Changed ASSERT(rw_is_locked()) to rw_is_write_locked() in
appropriate places
- Renamed release_domain_locks() to release_domain_write_locks()
- moved domain_done label in vpci_dump_msi() to correct place
Changes in v9:
- extended locked region to protect vpci_remove_device and
vpci_add_handlers() calls
- vpci_write() takes lock in the write mode to protect
potential call to modify_bars()
- renamed lock releasing function
- removed ASSERT()s from msi code
- added trylock in vpci_dump_msi
Changes in v8:
- changed d->vpci_lock to d->pci_lock
- introducing d->pci_lock in a separate patch
- extended locked region in vpci_process_pending
- removed pcidevs_lockis vpci_dump_msi()
- removed some changes as they are not needed with
the new locking scheme
- added handling for hwdom && dom_xen case
---
xen/arch/x86/hvm/vmsi.c | 31 +++++++++++++--------
xen/arch/x86/hvm/vmx/vmx.c | 2 +-
xen/arch/x86/irq.c | 8 +++---
xen/arch/x86/msi.c | 14 +++++-----
xen/arch/x86/physdev.c | 2 ++
xen/drivers/passthrough/pci.c | 9 +++---
xen/drivers/vpci/header.c | 18 ++++++++++++
xen/drivers/vpci/msi.c | 30 +++++++++++++++++---
xen/drivers/vpci/msix.c | 52 ++++++++++++++++++++++++++++++-----
xen/drivers/vpci/vpci.c | 24 ++++++++++++++--
xen/include/xen/sched.h | 3 +-
11 files changed, 152 insertions(+), 41 deletions(-)
diff --git a/xen/arch/x86/hvm/vmsi.c b/xen/arch/x86/hvm/vmsi.c
index 128f23636279..9e5e93a6ba0f 100644
--- a/xen/arch/x86/hvm/vmsi.c
+++ b/xen/arch/x86/hvm/vmsi.c
@@ -468,7 +468,7 @@ int msixtbl_pt_register(struct domain *d, struct pirq *pirq, uint64_t gtable)
struct msixtbl_entry *entry, *new_entry;
int r = -EINVAL;
- ASSERT(pcidevs_locked());
+ ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock));
ASSERT(rw_is_write_locked(&d->event_lock));
if ( !msixtbl_initialised(d) )
@@ -538,7 +538,7 @@ void msixtbl_pt_unregister(struct domain *d, struct pirq *pirq)
struct pci_dev *pdev;
struct msixtbl_entry *entry;
- ASSERT(pcidevs_locked());
+ ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock));
ASSERT(rw_is_write_locked(&d->event_lock));
if ( !msixtbl_initialised(d) )
@@ -684,7 +684,7 @@ static int vpci_msi_update(const struct pci_dev *pdev, uint32_t data,
{
unsigned int i;
- ASSERT(pcidevs_locked());
+ ASSERT(rw_is_locked(&pdev->domain->pci_lock));
if ( (address & MSI_ADDR_BASE_MASK) != MSI_ADDR_HEADER )
{
@@ -725,8 +725,8 @@ void vpci_msi_arch_update(struct vpci_msi *msi, const struct pci_dev *pdev)
int rc;
ASSERT(msi->arch.pirq != INVALID_PIRQ);
+ ASSERT(rw_is_locked(&pdev->domain->pci_lock));
- pcidevs_lock();
for ( i = 0; i < msi->vectors && msi->arch.bound; i++ )
{
struct xen_domctl_bind_pt_irq unbind = {
@@ -745,7 +745,6 @@ void vpci_msi_arch_update(struct vpci_msi *msi, const struct pci_dev *pdev)
msi->arch.bound = !vpci_msi_update(pdev, msi->data, msi->address,
msi->vectors, msi->arch.pirq, msi->mask);
- pcidevs_unlock();
}
static int vpci_msi_enable(const struct pci_dev *pdev, unsigned int nr,
@@ -778,15 +777,14 @@ int vpci_msi_arch_enable(struct vpci_msi *msi, const struct pci_dev *pdev,
int rc;
ASSERT(msi->arch.pirq == INVALID_PIRQ);
+ ASSERT(rw_is_locked(&pdev->domain->pci_lock));
rc = vpci_msi_enable(pdev, vectors, 0);
if ( rc < 0 )
return rc;
msi->arch.pirq = rc;
- pcidevs_lock();
msi->arch.bound = !vpci_msi_update(pdev, msi->data, msi->address, vectors,
msi->arch.pirq, msi->mask);
- pcidevs_unlock();
return 0;
}
@@ -797,8 +795,8 @@ static void vpci_msi_disable(const struct pci_dev *pdev, int pirq,
unsigned int i;
ASSERT(pirq != INVALID_PIRQ);
+ ASSERT(rw_is_locked(&pdev->domain->pci_lock));
- pcidevs_lock();
for ( i = 0; i < nr && bound; i++ )
{
struct xen_domctl_bind_pt_irq bind = {
@@ -814,7 +812,6 @@ static void vpci_msi_disable(const struct pci_dev *pdev, int pirq,
write_lock(&pdev->domain->event_lock);
unmap_domain_pirq(pdev->domain, pirq);
write_unlock(&pdev->domain->event_lock);
- pcidevs_unlock();
}
void vpci_msi_arch_disable(struct vpci_msi *msi, const struct pci_dev *pdev)
@@ -854,6 +851,7 @@ int vpci_msix_arch_enable_entry(struct vpci_msix_entry *entry,
int rc;
ASSERT(entry->arch.pirq == INVALID_PIRQ);
+ ASSERT(rw_is_locked(&pdev->domain->pci_lock));
rc = vpci_msi_enable(pdev, vmsix_entry_nr(pdev->vpci->msix, entry),
table_base);
if ( rc < 0 )
@@ -861,7 +859,6 @@ int vpci_msix_arch_enable_entry(struct vpci_msix_entry *entry,
entry->arch.pirq = rc;
- pcidevs_lock();
rc = vpci_msi_update(pdev, entry->data, entry->addr, 1, entry->arch.pirq,
entry->masked);
if ( rc )
@@ -869,7 +866,6 @@ int vpci_msix_arch_enable_entry(struct vpci_msix_entry *entry,
vpci_msi_disable(pdev, entry->arch.pirq, 1, false);
entry->arch.pirq = INVALID_PIRQ;
}
- pcidevs_unlock();
return rc;
}
@@ -895,6 +891,9 @@ int vpci_msix_arch_print(const struct vpci_msix *msix)
{
unsigned int i;
+ ASSERT(rw_is_locked(&msix->pdev->domain->pci_lock));
+ ASSERT(spin_is_locked(&msix->pdev->vpci->lock));
+
for ( i = 0; i < msix->max_entries; i++ )
{
const struct vpci_msix_entry *entry = &msix->entries[i];
@@ -913,13 +912,23 @@ int vpci_msix_arch_print(const struct vpci_msix *msix)
struct pci_dev *pdev = msix->pdev;
spin_unlock(&msix->pdev->vpci->lock);
+ read_unlock(&pdev->domain->pci_lock);
process_pending_softirqs();
+
+ if ( !read_trylock(&pdev->domain->pci_lock) )
+ return -EBUSY;
+
/* NB: we assume that pdev cannot go away for an alive domain. */
if ( !pdev->vpci || !spin_trylock(&pdev->vpci->lock) )
+ {
+ read_unlock(&pdev->domain->pci_lock);
return -EBUSY;
+ }
+
if ( pdev->vpci->msix != msix )
{
spin_unlock(&pdev->vpci->lock);
+ read_unlock(&pdev->domain->pci_lock);
return -EAGAIN;
}
}
diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
index 8ff675883c2b..890faef48b82 100644
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -413,7 +413,7 @@ static int cf_check vmx_pi_update_irte(const struct vcpu *v,
spin_unlock_irq(&desc->lock);
- ASSERT(pcidevs_locked());
+ ASSERT(pcidevs_locked() || rw_is_locked(&msi_desc->dev->domain->pci_lock));
return iommu_update_ire_from_msi(msi_desc, &msi_desc->msg);
diff --git a/xen/arch/x86/irq.c b/xen/arch/x86/irq.c
index 50e49e1a4b6f..4d89d9af99fe 100644
--- a/xen/arch/x86/irq.c
+++ b/xen/arch/x86/irq.c
@@ -2166,7 +2166,7 @@ int map_domain_pirq(
struct pci_dev *pdev;
unsigned int nr = 0;
- ASSERT(pcidevs_locked());
+ ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock));
ret = -ENODEV;
if ( !cpu_has_apic )
@@ -2323,7 +2323,7 @@ int unmap_domain_pirq(struct domain *d, int pirq)
if ( (pirq < 0) || (pirq >= d->nr_pirqs) )
return -EINVAL;
- ASSERT(pcidevs_locked());
+ ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock));
ASSERT(rw_is_write_locked(&d->event_lock));
info = pirq_info(d, pirq);
@@ -2888,6 +2888,8 @@ int allocate_and_map_msi_pirq(struct domain *d, int index, int *pirq_p,
{
int irq, pirq, ret;
+ ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock));
+
switch ( type )
{
case MAP_PIRQ_TYPE_MSI:
@@ -2917,7 +2919,6 @@ int allocate_and_map_msi_pirq(struct domain *d, int index, int *pirq_p,
msi->irq = irq;
- pcidevs_lock();
/* Verify or get pirq. */
write_lock(&d->event_lock);
pirq = allocate_pirq(d, index, *pirq_p, irq, type, &msi->entry_nr);
@@ -2933,7 +2934,6 @@ int allocate_and_map_msi_pirq(struct domain *d, int index, int *pirq_p,
done:
write_unlock(&d->event_lock);
- pcidevs_unlock();
if ( ret )
{
switch ( type )
diff --git a/xen/arch/x86/msi.c b/xen/arch/x86/msi.c
index 335c0868a225..7da2affa2e82 100644
--- a/xen/arch/x86/msi.c
+++ b/xen/arch/x86/msi.c
@@ -602,7 +602,7 @@ static int msi_capability_init(struct pci_dev *dev,
unsigned int i, mpos;
uint16_t control;
- ASSERT(pcidevs_locked());
+ ASSERT(pcidevs_locked() || rw_is_locked(&dev->domain->pci_lock));
pos = pci_find_cap_offset(dev->sbdf, PCI_CAP_ID_MSI);
if ( !pos )
return -ENODEV;
@@ -771,7 +771,7 @@ static int msix_capability_init(struct pci_dev *dev,
if ( !pos )
return -ENODEV;
- ASSERT(pcidevs_locked());
+ ASSERT(pcidevs_locked() || rw_is_locked(&dev->domain->pci_lock));
control = pci_conf_read16(dev->sbdf, msix_control_reg(pos));
/*
@@ -988,11 +988,11 @@ static int __pci_enable_msi(struct pci_dev *pdev, struct msi_info *msi,
{
struct msi_desc *old_desc;
- ASSERT(pcidevs_locked());
-
if ( !pdev )
return -ENODEV;
+ ASSERT(pcidevs_locked() || rw_is_locked(&pdev->domain->pci_lock));
+
old_desc = find_msi_entry(pdev, msi->irq, PCI_CAP_ID_MSI);
if ( old_desc )
{
@@ -1043,11 +1043,11 @@ static int __pci_enable_msix(struct pci_dev *pdev, struct msi_info *msi,
{
struct msi_desc *old_desc;
- ASSERT(pcidevs_locked());
-
if ( !pdev || !pdev->msix )
return -ENODEV;
+ ASSERT(pcidevs_locked() || rw_is_locked(&pdev->domain->pci_lock));
+
if ( msi->entry_nr >= pdev->msix->nr_entries )
return -EINVAL;
@@ -1154,7 +1154,7 @@ int pci_prepare_msix(u16 seg, u8 bus, u8 devfn, bool off)
int pci_enable_msi(struct pci_dev *pdev, struct msi_info *msi,
struct msi_desc **desc)
{
- ASSERT(pcidevs_locked());
+ ASSERT(pcidevs_locked() || rw_is_locked(&pdev->domain->pci_lock));
if ( !use_msi )
return -EPERM;
diff --git a/xen/arch/x86/physdev.c b/xen/arch/x86/physdev.c
index 47c4da0af7e1..369c9e788c1c 100644
--- a/xen/arch/x86/physdev.c
+++ b/xen/arch/x86/physdev.c
@@ -123,7 +123,9 @@ int physdev_map_pirq(domid_t domid, int type, int *index, int *pirq_p,
case MAP_PIRQ_TYPE_MSI:
case MAP_PIRQ_TYPE_MULTI_MSI:
+ pcidevs_lock();
ret = allocate_and_map_msi_pirq(d, *index, pirq_p, type, msi);
+ pcidevs_unlock();
break;
default:
diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
index 1439d1ef2b26..3a973324bca1 100644
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -750,7 +750,6 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
pdev->domain = hardware_domain;
write_lock(&hardware_domain->pci_lock);
list_add(&pdev->domain_list, &hardware_domain->pdev_list);
- write_unlock(&hardware_domain->pci_lock);
/*
* For devices not discovered by Xen during boot, add vPCI handlers
@@ -759,18 +758,18 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
ret = vpci_add_handlers(pdev);
if ( ret )
{
- printk(XENLOG_ERR "Setup of vPCI failed: %d\n", ret);
- write_lock(&hardware_domain->pci_lock);
list_del(&pdev->domain_list);
write_unlock(&hardware_domain->pci_lock);
pdev->domain = NULL;
+ printk(XENLOG_ERR "Setup of vPCI failed: %d\n", ret);
goto out;
}
+ write_unlock(&hardware_domain->pci_lock);
ret = iommu_add_device(pdev);
if ( ret )
{
- vpci_remove_device(pdev);
write_lock(&hardware_domain->pci_lock);
+ vpci_remove_device(pdev);
list_del(&pdev->domain_list);
write_unlock(&hardware_domain->pci_lock);
pdev->domain = NULL;
@@ -1146,7 +1145,9 @@ static void __hwdom_init setup_one_hwdom_device(const struct setup_hwdom *ctxt,
} while ( devfn != pdev->devfn &&
PCI_SLOT(devfn) == PCI_SLOT(pdev->devfn) );
+ write_lock(&ctxt->d->pci_lock);
err = vpci_add_handlers(pdev);
+ write_unlock(&ctxt->d->pci_lock);
if ( err )
printk(XENLOG_ERR "setup of vPCI for d%d failed: %d\n",
ctxt->d->domain_id, err);
diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
index 58195549d50a..8f5850b8cf6d 100644
--- a/xen/drivers/vpci/header.c
+++ b/xen/drivers/vpci/header.c
@@ -173,6 +173,7 @@ bool vpci_process_pending(struct vcpu *v)
if ( rc == -ERESTART )
return true;
+ write_lock(&v->domain->pci_lock);
spin_lock(&v->vpci.pdev->vpci->lock);
/* Disable memory decoding unconditionally on failure. */
modify_decoding(v->vpci.pdev,
@@ -191,6 +192,7 @@ bool vpci_process_pending(struct vcpu *v)
* failure.
*/
vpci_remove_device(v->vpci.pdev);
+ write_unlock(&v->domain->pci_lock);
}
return false;
@@ -202,8 +204,20 @@ static int __init apply_map(struct domain *d, const struct pci_dev *pdev,
struct map_data data = { .d = d, .map = true };
int rc;
+ ASSERT(rw_is_write_locked(&d->pci_lock));
+
while ( (rc = rangeset_consume_ranges(mem, map_range, &data)) == -ERESTART )
+ {
+ /*
+ * It's safe to drop and reacquire the lock in this context
+ * without risking pdev disappearing because devices cannot be
+ * removed until the initial domain has been started.
+ */
+ write_unlock(&d->pci_lock);
process_pending_softirqs();
+ write_lock(&d->pci_lock);
+ }
+
rangeset_destroy(mem);
if ( !rc )
modify_decoding(pdev, cmd, false);
@@ -244,6 +258,8 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
unsigned int i;
int rc;
+ ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));
+
if ( !mem )
return -ENOMEM;
@@ -524,6 +540,8 @@ static int cf_check init_header(struct pci_dev *pdev)
int rc;
bool mask_cap_list = false;
+ ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));
+
switch ( pci_conf_read8(pdev->sbdf, PCI_HEADER_TYPE) & 0x7f )
{
case PCI_HEADER_TYPE_NORMAL:
diff --git a/xen/drivers/vpci/msi.c b/xen/drivers/vpci/msi.c
index a253ccbd7db7..dc71938e23f5 100644
--- a/xen/drivers/vpci/msi.c
+++ b/xen/drivers/vpci/msi.c
@@ -263,7 +263,7 @@ REGISTER_VPCI_INIT(init_msi, VPCI_PRIORITY_LOW);
void vpci_dump_msi(void)
{
- const struct domain *d;
+ struct domain *d;
rcu_read_lock(&domlist_read_lock);
for_each_domain ( d )
@@ -275,6 +275,9 @@ void vpci_dump_msi(void)
printk("vPCI MSI/MSI-X d%d\n", d->domain_id);
+ if ( !read_trylock(&d->pci_lock) )
+ continue;
+
for_each_pdev ( d, pdev )
{
const struct vpci_msi *msi;
@@ -313,17 +316,36 @@ void vpci_dump_msi(void)
{
/*
* On error vpci_msix_arch_print will always return without
- * holding the lock.
+ * holding the locks.
*/
printk("unable to print all MSI-X entries: %d\n", rc);
- process_pending_softirqs();
- continue;
+ goto pdev_done;
}
}
+ /*
+ * Unlock locks to process pending softirqs. This is
+ * potentially unsafe, as d->pdev_list can be changed in
+ * meantime.
+ */
spin_unlock(&pdev->vpci->lock);
+ read_unlock(&d->pci_lock);
+ pdev_done:
process_pending_softirqs();
+ if ( !read_trylock(&d->pci_lock) )
+ {
+ printk("unable to access other devices for the domain\n");
+ goto domain_done;
+ }
}
+ read_unlock(&d->pci_lock);
+ domain_done:
+ /*
+ * We need this label at the end of the loop, but some
+ * compilers might not be happy about label at the end of the
+ * compound statement so we adding an empty statement here.
+ */
+ ;
}
rcu_read_unlock(&domlist_read_lock);
}
diff --git a/xen/drivers/vpci/msix.c b/xen/drivers/vpci/msix.c
index d1126a417da9..b6abab47efdd 100644
--- a/xen/drivers/vpci/msix.c
+++ b/xen/drivers/vpci/msix.c
@@ -147,6 +147,8 @@ static struct vpci_msix *msix_find(const struct domain *d, unsigned long addr)
{
struct vpci_msix *msix;
+ ASSERT(rw_is_locked(&d->pci_lock));
+
list_for_each_entry ( msix, &d->arch.hvm.msix_tables, next )
{
const struct vpci_bar *bars = msix->pdev->vpci->header.bars;
@@ -163,7 +165,13 @@ static struct vpci_msix *msix_find(const struct domain *d, unsigned long addr)
static int cf_check msix_accept(struct vcpu *v, unsigned long addr)
{
- return !!msix_find(v->domain, addr);
+ int rc;
+
+ read_lock(&v->domain->pci_lock);
+ rc = !!msix_find(v->domain, addr);
+ read_unlock(&v->domain->pci_lock);
+
+ return rc;
}
static bool access_allowed(const struct pci_dev *pdev, unsigned long addr,
@@ -358,21 +366,35 @@ static int adjacent_read(const struct domain *d, const struct vpci_msix *msix,
static int cf_check msix_read(
struct vcpu *v, unsigned long addr, unsigned int len, unsigned long *data)
{
- const struct domain *d = v->domain;
- struct vpci_msix *msix = msix_find(d, addr);
+ struct domain *d = v->domain;
+ struct vpci_msix *msix;
const struct vpci_msix_entry *entry;
unsigned int offset;
*data = ~0UL;
+ read_lock(&d->pci_lock);
+
+ msix = msix_find(d, addr);
if ( !msix )
+ {
+ read_unlock(&d->pci_lock);
return X86EMUL_RETRY;
+ }
if ( adjacent_handle(msix, addr) )
- return adjacent_read(d, msix, addr, len, data);
+ {
+ int rc = adjacent_read(d, msix, addr, len, data);
+
+ read_unlock(&d->pci_lock);
+ return rc;
+ }
if ( !access_allowed(msix->pdev, addr, len) )
+ {
+ read_unlock(&d->pci_lock);
return X86EMUL_OKAY;
+ }
spin_lock(&msix->pdev->vpci->lock);
entry = get_entry(msix, addr);
@@ -404,6 +426,7 @@ static int cf_check msix_read(
break;
}
spin_unlock(&msix->pdev->vpci->lock);
+ read_unlock(&d->pci_lock);
return X86EMUL_OKAY;
}
@@ -491,19 +514,33 @@ static int adjacent_write(const struct domain *d, const struct vpci_msix *msix,
static int cf_check msix_write(
struct vcpu *v, unsigned long addr, unsigned int len, unsigned long data)
{
- const struct domain *d = v->domain;
- struct vpci_msix *msix = msix_find(d, addr);
+ struct domain *d = v->domain;
+ struct vpci_msix *msix;
struct vpci_msix_entry *entry;
unsigned int offset;
+ read_lock(&d->pci_lock);
+
+ msix = msix_find(d, addr);
if ( !msix )
+ {
+ read_unlock(&d->pci_lock);
return X86EMUL_RETRY;
+ }
if ( adjacent_handle(msix, addr) )
- return adjacent_write(d, msix, addr, len, data);
+ {
+ int rc = adjacent_write(d, msix, addr, len, data);
+
+ read_unlock(&d->pci_lock);
+ return rc;
+ }
if ( !access_allowed(msix->pdev, addr, len) )
+ {
+ read_unlock(&d->pci_lock);
return X86EMUL_OKAY;
+ }
spin_lock(&msix->pdev->vpci->lock);
entry = get_entry(msix, addr);
@@ -579,6 +616,7 @@ static int cf_check msix_write(
break;
}
spin_unlock(&msix->pdev->vpci->lock);
+ read_unlock(&d->pci_lock);
return X86EMUL_OKAY;
}
diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
index 72ef277c4f8e..475272b173f3 100644
--- a/xen/drivers/vpci/vpci.c
+++ b/xen/drivers/vpci/vpci.c
@@ -42,6 +42,8 @@ extern vpci_register_init_t *const __end_vpci_array[];
void vpci_remove_device(struct pci_dev *pdev)
{
+ ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));
+
if ( !has_vpci(pdev->domain) || !pdev->vpci )
return;
@@ -77,6 +79,8 @@ int vpci_add_handlers(struct pci_dev *pdev)
const unsigned long *ro_map;
int rc = 0;
+ ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));
+
if ( !has_vpci(pdev->domain) )
return 0;
@@ -361,7 +365,7 @@ static uint32_t merge_result(uint32_t data, uint32_t new, unsigned int size,
uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size)
{
- const struct domain *d = current->domain;
+ struct domain *d = current->domain;
const struct pci_dev *pdev;
const struct vpci_register *r;
unsigned int data_offset = 0;
@@ -376,12 +380,18 @@ uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size)
/*
* Find the PCI dev matching the address, which for hwdom also requires
* consulting DomXEN. Passthrough everything that's not trapped.
+ * If this is hwdom and the device is assigned to DomXEN, acquiring hwdom's
+ * pci_lock is sufficient.
*/
+ read_lock(&d->pci_lock);
pdev = pci_get_pdev(d, sbdf);
if ( !pdev && is_hardware_domain(d) )
pdev = pci_get_pdev(dom_xen, sbdf);
if ( !pdev || !pdev->vpci )
+ {
+ read_unlock(&d->pci_lock);
return vpci_read_hw(sbdf, reg, size);
+ }
spin_lock(&pdev->vpci->lock);
@@ -428,6 +438,7 @@ uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size)
ASSERT(data_offset < size);
}
spin_unlock(&pdev->vpci->lock);
+ read_unlock(&d->pci_lock);
if ( data_offset < size )
{
@@ -470,7 +481,7 @@ static void vpci_write_helper(const struct pci_dev *pdev,
void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
uint32_t data)
{
- const struct domain *d = current->domain;
+ struct domain *d = current->domain;
const struct pci_dev *pdev;
const struct vpci_register *r;
unsigned int data_offset = 0;
@@ -484,7 +495,13 @@ void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
/*
* Find the PCI dev matching the address, which for hwdom also requires
* consulting DomXEN. Passthrough everything that's not trapped.
+ * If this is hwdom and the device is assigned to DomXEN, acquiring hwdom's
+ * pci_lock is sufficient.
+ *
+ * TODO: We need to take pci_locks in exclusive mode only if we
+ * are modifying BARs, so there is a room for improvement.
*/
+ write_lock(&d->pci_lock);
pdev = pci_get_pdev(d, sbdf);
if ( !pdev && is_hardware_domain(d) )
pdev = pci_get_pdev(dom_xen, sbdf);
@@ -493,6 +510,8 @@ void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
/* Ignore writes to read-only devices, which have no ->vpci. */
const unsigned long *ro_map = pci_get_ro_map(sbdf.seg);
+ write_unlock(&d->pci_lock);
+
if ( !ro_map || !test_bit(sbdf.bdf, ro_map) )
vpci_write_hw(sbdf, reg, size, data);
return;
@@ -534,6 +553,7 @@ void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
ASSERT(data_offset < size);
}
spin_unlock(&pdev->vpci->lock);
+ write_unlock(&d->pci_lock);
if ( data_offset < size )
/* Tailing gap, write the remaining. */
diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
index 9da91e0e6244..37f5922f3206 100644
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
@@ -462,7 +462,8 @@ struct domain
#ifdef CONFIG_HAS_PCI
struct list_head pdev_list;
/*
- * pci_lock protects access to pdev_list.
+ * pci_lock protects access to pdev_list. pci_lock also protects pdev->vpci
+ * structure from being removed.
*
* Any user *reading* from pdev_list, or from devices stored in pdev_list,
* should hold either pcidevs_lock() or pci_lock in read mode. Optionally,
base-commit: c2ce3466472e9c9eda79f5dc98eb701bc6fdba20
--
2.43.0
On 15.01.2024 20:43, Stewart Hildebrand wrote:
> @@ -2888,6 +2888,8 @@ int allocate_and_map_msi_pirq(struct domain *d, int index, int *pirq_p,
> {
> int irq, pirq, ret;
>
> + ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock));
If either lock is sufficient to hold here, ...
> --- a/xen/arch/x86/physdev.c
> +++ b/xen/arch/x86/physdev.c
> @@ -123,7 +123,9 @@ int physdev_map_pirq(domid_t domid, int type, int *index, int *pirq_p,
>
> case MAP_PIRQ_TYPE_MSI:
> case MAP_PIRQ_TYPE_MULTI_MSI:
> + pcidevs_lock();
> ret = allocate_and_map_msi_pirq(d, *index, pirq_p, type, msi);
> + pcidevs_unlock();
> break;
... why is it the global lock that's being acquired here?
Jan
On Tue, Jan 23, 2024 at 03:32:12PM +0100, Jan Beulich wrote:
> On 15.01.2024 20:43, Stewart Hildebrand wrote:
> > @@ -2888,6 +2888,8 @@ int allocate_and_map_msi_pirq(struct domain *d, int index, int *pirq_p,
> > {
> > int irq, pirq, ret;
> >
> > + ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock));
>
> If either lock is sufficient to hold here, ...
>
> > --- a/xen/arch/x86/physdev.c
> > +++ b/xen/arch/x86/physdev.c
> > @@ -123,7 +123,9 @@ int physdev_map_pirq(domid_t domid, int type, int *index, int *pirq_p,
> >
> > case MAP_PIRQ_TYPE_MSI:
> > case MAP_PIRQ_TYPE_MULTI_MSI:
> > + pcidevs_lock();
> > ret = allocate_and_map_msi_pirq(d, *index, pirq_p, type, msi);
> > + pcidevs_unlock();
> > break;
>
IIRC (Stewart can further comment) this is done holding the pcidevs
lock to keep the path unmodified, as there's no need to hold the
per-domain rwlock.
Roger.
On 23.01.2024 16:07, Roger Pau Monné wrote:
> On Tue, Jan 23, 2024 at 03:32:12PM +0100, Jan Beulich wrote:
>> On 15.01.2024 20:43, Stewart Hildebrand wrote:
>>> @@ -2888,6 +2888,8 @@ int allocate_and_map_msi_pirq(struct domain *d, int index, int *pirq_p,
>>> {
>>> int irq, pirq, ret;
>>>
>>> + ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock));
>>
>> If either lock is sufficient to hold here, ...
>>
>>> --- a/xen/arch/x86/physdev.c
>>> +++ b/xen/arch/x86/physdev.c
>>> @@ -123,7 +123,9 @@ int physdev_map_pirq(domid_t domid, int type, int *index, int *pirq_p,
>>>
>>> case MAP_PIRQ_TYPE_MSI:
>>> case MAP_PIRQ_TYPE_MULTI_MSI:
>>> + pcidevs_lock();
>>> ret = allocate_and_map_msi_pirq(d, *index, pirq_p, type, msi);
>>> + pcidevs_unlock();
>>> break;
>>
>
> IIRC (Stewart can further comment) this is done holding the pcidevs
> lock to keep the path unmodified, as there's no need to hold the
> per-domain rwlock.
Yet why would we prefer to acquire a global lock when a per-domain one
suffices?
Jan
On Wed, Jan 24, 2024 at 09:48:35AM +0100, Jan Beulich wrote:
> On 23.01.2024 16:07, Roger Pau Monné wrote:
> > On Tue, Jan 23, 2024 at 03:32:12PM +0100, Jan Beulich wrote:
> >> On 15.01.2024 20:43, Stewart Hildebrand wrote:
> >>> @@ -2888,6 +2888,8 @@ int allocate_and_map_msi_pirq(struct domain *d, int index, int *pirq_p,
> >>> {
> >>> int irq, pirq, ret;
> >>>
> >>> + ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock));
> >>
> >> If either lock is sufficient to hold here, ...
> >>
> >>> --- a/xen/arch/x86/physdev.c
> >>> +++ b/xen/arch/x86/physdev.c
> >>> @@ -123,7 +123,9 @@ int physdev_map_pirq(domid_t domid, int type, int *index, int *pirq_p,
> >>>
> >>> case MAP_PIRQ_TYPE_MSI:
> >>> case MAP_PIRQ_TYPE_MULTI_MSI:
> >>> + pcidevs_lock();
> >>> ret = allocate_and_map_msi_pirq(d, *index, pirq_p, type, msi);
> >>> + pcidevs_unlock();
> >>> break;
> >>
> >
> > IIRC (Stewart can further comment) this is done holding the pcidevs
> > lock to keep the path unmodified, as there's no need to hold the
> > per-domain rwlock.
>
> Yet why would we prefer to acquire a global lock when a per-domain one
> suffices?
I was hoping to introduce less changes, specially if they are not
strictly required, as it's less risk. I'm always quite worry of
locking changes.
Thanks, Roger.
On 24.01.2024 10:24, Roger Pau Monné wrote:
> On Wed, Jan 24, 2024 at 09:48:35AM +0100, Jan Beulich wrote:
>> On 23.01.2024 16:07, Roger Pau Monné wrote:
>>> On Tue, Jan 23, 2024 at 03:32:12PM +0100, Jan Beulich wrote:
>>>> On 15.01.2024 20:43, Stewart Hildebrand wrote:
>>>>> @@ -2888,6 +2888,8 @@ int allocate_and_map_msi_pirq(struct domain *d, int index, int *pirq_p,
>>>>> {
>>>>> int irq, pirq, ret;
>>>>>
>>>>> + ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock));
>>>>
>>>> If either lock is sufficient to hold here, ...
>>>>
>>>>> --- a/xen/arch/x86/physdev.c
>>>>> +++ b/xen/arch/x86/physdev.c
>>>>> @@ -123,7 +123,9 @@ int physdev_map_pirq(domid_t domid, int type, int *index, int *pirq_p,
>>>>>
>>>>> case MAP_PIRQ_TYPE_MSI:
>>>>> case MAP_PIRQ_TYPE_MULTI_MSI:
>>>>> + pcidevs_lock();
>>>>> ret = allocate_and_map_msi_pirq(d, *index, pirq_p, type, msi);
>>>>> + pcidevs_unlock();
>>>>> break;
>>>>
>>>
>>> IIRC (Stewart can further comment) this is done holding the pcidevs
>>> lock to keep the path unmodified, as there's no need to hold the
>>> per-domain rwlock.
>>
>> Yet why would we prefer to acquire a global lock when a per-domain one
>> suffices?
>
> I was hoping to introduce less changes, specially if they are not
> strictly required, as it's less risk. I'm always quite worry of
> locking changes.
In which case more description / code commenting is needed. The pattern
of the assertions looks dangerous. Even if (as you say in a later reply)
this is only temporary, we all know how long "temporary" can be. It
might even be advisable to introduce a helper construct. That would then
be where the respective code comment goes, clarifying that the _sole_
purpose is to guarantee a pdev to not go away; no further protection of
any state, no other critical region aspects (assuming of course all of
this is actually true for all of the affected use sites).
Jan
On Wed, Jan 24, 2024 at 12:34:10PM +0100, Jan Beulich wrote:
> On 24.01.2024 10:24, Roger Pau Monné wrote:
> > On Wed, Jan 24, 2024 at 09:48:35AM +0100, Jan Beulich wrote:
> >> On 23.01.2024 16:07, Roger Pau Monné wrote:
> >>> On Tue, Jan 23, 2024 at 03:32:12PM +0100, Jan Beulich wrote:
> >>>> On 15.01.2024 20:43, Stewart Hildebrand wrote:
> >>>>> @@ -2888,6 +2888,8 @@ int allocate_and_map_msi_pirq(struct domain *d, int index, int *pirq_p,
> >>>>> {
> >>>>> int irq, pirq, ret;
> >>>>>
> >>>>> + ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock));
> >>>>
> >>>> If either lock is sufficient to hold here, ...
> >>>>
> >>>>> --- a/xen/arch/x86/physdev.c
> >>>>> +++ b/xen/arch/x86/physdev.c
> >>>>> @@ -123,7 +123,9 @@ int physdev_map_pirq(domid_t domid, int type, int *index, int *pirq_p,
> >>>>>
> >>>>> case MAP_PIRQ_TYPE_MSI:
> >>>>> case MAP_PIRQ_TYPE_MULTI_MSI:
> >>>>> + pcidevs_lock();
> >>>>> ret = allocate_and_map_msi_pirq(d, *index, pirq_p, type, msi);
> >>>>> + pcidevs_unlock();
> >>>>> break;
> >>>>
> >>>
> >>> IIRC (Stewart can further comment) this is done holding the pcidevs
> >>> lock to keep the path unmodified, as there's no need to hold the
> >>> per-domain rwlock.
> >>
> >> Yet why would we prefer to acquire a global lock when a per-domain one
> >> suffices?
> >
> > I was hoping to introduce less changes, specially if they are not
> > strictly required, as it's less risk. I'm always quite worry of
> > locking changes.
>
> In which case more description / code commenting is needed. The pattern
> of the assertions looks dangerous.
Is such dangerousness perception because you fear some of the pcidevs
lock usage might be there not just for preventing the pdev from going
away, but also to guarantee exclusive access to certain state?
> Even if (as you say in a later reply)
> this is only temporary, we all know how long "temporary" can be. It
> might even be advisable to introduce a helper construct.
The aim here was to modify as little as possible, in order to avoid
having to analyze all possible users of pcidevs lock, and thus not
block the vPCI work on the probably lengthy and difficult analysis.
Not sure adding a construct makes is much better, as I didn't want to
give the impression all checks for the pcidevs lock can merely be
replaced by the new construct.
Thanks, Roger.
On 24.01.2024 18:51, Roger Pau Monné wrote:
> On Wed, Jan 24, 2024 at 12:34:10PM +0100, Jan Beulich wrote:
>> On 24.01.2024 10:24, Roger Pau Monné wrote:
>>> On Wed, Jan 24, 2024 at 09:48:35AM +0100, Jan Beulich wrote:
>>>> On 23.01.2024 16:07, Roger Pau Monné wrote:
>>>>> On Tue, Jan 23, 2024 at 03:32:12PM +0100, Jan Beulich wrote:
>>>>>> On 15.01.2024 20:43, Stewart Hildebrand wrote:
>>>>>>> @@ -2888,6 +2888,8 @@ int allocate_and_map_msi_pirq(struct domain *d, int index, int *pirq_p,
>>>>>>> {
>>>>>>> int irq, pirq, ret;
>>>>>>>
>>>>>>> + ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock));
>>>>>>
>>>>>> If either lock is sufficient to hold here, ...
>>>>>>
>>>>>>> --- a/xen/arch/x86/physdev.c
>>>>>>> +++ b/xen/arch/x86/physdev.c
>>>>>>> @@ -123,7 +123,9 @@ int physdev_map_pirq(domid_t domid, int type, int *index, int *pirq_p,
>>>>>>>
>>>>>>> case MAP_PIRQ_TYPE_MSI:
>>>>>>> case MAP_PIRQ_TYPE_MULTI_MSI:
>>>>>>> + pcidevs_lock();
>>>>>>> ret = allocate_and_map_msi_pirq(d, *index, pirq_p, type, msi);
>>>>>>> + pcidevs_unlock();
>>>>>>> break;
>>>>>>
>>>>>
>>>>> IIRC (Stewart can further comment) this is done holding the pcidevs
>>>>> lock to keep the path unmodified, as there's no need to hold the
>>>>> per-domain rwlock.
>>>>
>>>> Yet why would we prefer to acquire a global lock when a per-domain one
>>>> suffices?
>>>
>>> I was hoping to introduce less changes, specially if they are not
>>> strictly required, as it's less risk. I'm always quite worry of
>>> locking changes.
>>
>> In which case more description / code commenting is needed. The pattern
>> of the assertions looks dangerous.
>
> Is such dangerousness perception because you fear some of the pcidevs
> lock usage might be there not just for preventing the pdev from going
> away, but also to guarantee exclusive access to certain state?
Indeed. In my view the main purpose of locks is to guard state. Their
use here to guard against devices here is imo rather an abuse; as
mentioned before this should instead be achieved e.g via refcounting.
And it's bad enough already that pcidevs_lock() alone has been abused
this way, without proper marking (leaving us to guess in many places).
It gets worse when a second lock can now also serve this same purpose.
>> Even if (as you say in a later reply)
>> this is only temporary, we all know how long "temporary" can be. It
>> might even be advisable to introduce a helper construct.
>
> The aim here was to modify as little as possible, in order to avoid
> having to analyze all possible users of pcidevs lock, and thus not
> block the vPCI work on the probably lengthy and difficult analysis.
>
> Not sure adding a construct makes is much better, as I didn't want to
> give the impression all checks for the pcidevs lock can merely be
> replaced by the new construct.
Of course such a construct could only be used in places where it can
be shown to be appropriate.
Jan
On Thu, Jan 25, 2024 at 08:43:05AM +0100, Jan Beulich wrote:
> On 24.01.2024 18:51, Roger Pau Monné wrote:
> > On Wed, Jan 24, 2024 at 12:34:10PM +0100, Jan Beulich wrote:
> >> On 24.01.2024 10:24, Roger Pau Monné wrote:
> >>> On Wed, Jan 24, 2024 at 09:48:35AM +0100, Jan Beulich wrote:
> >>>> On 23.01.2024 16:07, Roger Pau Monné wrote:
> >>>>> On Tue, Jan 23, 2024 at 03:32:12PM +0100, Jan Beulich wrote:
> >>>>>> On 15.01.2024 20:43, Stewart Hildebrand wrote:
> >>>>>>> @@ -2888,6 +2888,8 @@ int allocate_and_map_msi_pirq(struct domain *d, int index, int *pirq_p,
> >>>>>>> {
> >>>>>>> int irq, pirq, ret;
> >>>>>>>
> >>>>>>> + ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock));
> >>>>>>
> >>>>>> If either lock is sufficient to hold here, ...
> >>>>>>
> >>>>>>> --- a/xen/arch/x86/physdev.c
> >>>>>>> +++ b/xen/arch/x86/physdev.c
> >>>>>>> @@ -123,7 +123,9 @@ int physdev_map_pirq(domid_t domid, int type, int *index, int *pirq_p,
> >>>>>>>
> >>>>>>> case MAP_PIRQ_TYPE_MSI:
> >>>>>>> case MAP_PIRQ_TYPE_MULTI_MSI:
> >>>>>>> + pcidevs_lock();
> >>>>>>> ret = allocate_and_map_msi_pirq(d, *index, pirq_p, type, msi);
> >>>>>>> + pcidevs_unlock();
> >>>>>>> break;
> >>>>>>
> >>>>>
> >>>>> IIRC (Stewart can further comment) this is done holding the pcidevs
> >>>>> lock to keep the path unmodified, as there's no need to hold the
> >>>>> per-domain rwlock.
> >>>>
> >>>> Yet why would we prefer to acquire a global lock when a per-domain one
> >>>> suffices?
> >>>
> >>> I was hoping to introduce less changes, specially if they are not
> >>> strictly required, as it's less risk. I'm always quite worry of
> >>> locking changes.
> >>
> >> In which case more description / code commenting is needed. The pattern
> >> of the assertions looks dangerous.
> >
> > Is such dangerousness perception because you fear some of the pcidevs
> > lock usage might be there not just for preventing the pdev from going
> > away, but also to guarantee exclusive access to certain state?
>
> Indeed. In my view the main purpose of locks is to guard state. Their
> use here to guard against devices here is imo rather an abuse; as
> mentioned before this should instead be achieved e.g via refcounting.
> And it's bad enough already that pcidevs_lock() alone has been abused
> this way, without proper marking (leaving us to guess in many places).
> It gets worse when a second lock can now also serve this same
> purpose.
The new lock is taken in read mode in most contexts, and hence can't
be used to indirectly gain exclusive access to domain related
structures in a safe way.
Not saying this makes it any better, but so far this is the best
solution we could come up with that didn't involve a full evaluation
and possible re-write of all usage of the pcidevs lock.
I would also prefer a solution that fully replaces the pcidevs lock
with something else, but for once I don't have a clear picture of how
that would look like because the analysis is a huge amount of work,
likely more than the implementation itself.
Hence the proposed compromise solution that should allow the vPCI work
to make progress.
Regards, Roger.
On 25.01.2024 10:05, Roger Pau Monné wrote:
> On Thu, Jan 25, 2024 at 08:43:05AM +0100, Jan Beulich wrote:
>> On 24.01.2024 18:51, Roger Pau Monné wrote:
>>> On Wed, Jan 24, 2024 at 12:34:10PM +0100, Jan Beulich wrote:
>>>> On 24.01.2024 10:24, Roger Pau Monné wrote:
>>>>> On Wed, Jan 24, 2024 at 09:48:35AM +0100, Jan Beulich wrote:
>>>>>> On 23.01.2024 16:07, Roger Pau Monné wrote:
>>>>>>> On Tue, Jan 23, 2024 at 03:32:12PM +0100, Jan Beulich wrote:
>>>>>>>> On 15.01.2024 20:43, Stewart Hildebrand wrote:
>>>>>>>>> @@ -2888,6 +2888,8 @@ int allocate_and_map_msi_pirq(struct domain *d, int index, int *pirq_p,
>>>>>>>>> {
>>>>>>>>> int irq, pirq, ret;
>>>>>>>>>
>>>>>>>>> + ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock));
>>>>>>>>
>>>>>>>> If either lock is sufficient to hold here, ...
>>>>>>>>
>>>>>>>>> --- a/xen/arch/x86/physdev.c
>>>>>>>>> +++ b/xen/arch/x86/physdev.c
>>>>>>>>> @@ -123,7 +123,9 @@ int physdev_map_pirq(domid_t domid, int type, int *index, int *pirq_p,
>>>>>>>>>
>>>>>>>>> case MAP_PIRQ_TYPE_MSI:
>>>>>>>>> case MAP_PIRQ_TYPE_MULTI_MSI:
>>>>>>>>> + pcidevs_lock();
>>>>>>>>> ret = allocate_and_map_msi_pirq(d, *index, pirq_p, type, msi);
>>>>>>>>> + pcidevs_unlock();
>>>>>>>>> break;
>>>>>>>>
>>>>>>>
>>>>>>> IIRC (Stewart can further comment) this is done holding the pcidevs
>>>>>>> lock to keep the path unmodified, as there's no need to hold the
>>>>>>> per-domain rwlock.
>>>>>>
>>>>>> Yet why would we prefer to acquire a global lock when a per-domain one
>>>>>> suffices?
>>>>>
>>>>> I was hoping to introduce less changes, specially if they are not
>>>>> strictly required, as it's less risk. I'm always quite worry of
>>>>> locking changes.
>>>>
>>>> In which case more description / code commenting is needed. The pattern
>>>> of the assertions looks dangerous.
>>>
>>> Is such dangerousness perception because you fear some of the pcidevs
>>> lock usage might be there not just for preventing the pdev from going
>>> away, but also to guarantee exclusive access to certain state?
>>
>> Indeed. In my view the main purpose of locks is to guard state. Their
>> use here to guard against devices here is imo rather an abuse; as
>> mentioned before this should instead be achieved e.g via refcounting.
>> And it's bad enough already that pcidevs_lock() alone has been abused
>> this way, without proper marking (leaving us to guess in many places).
>> It gets worse when a second lock can now also serve this same
>> purpose.
>
> The new lock is taken in read mode in most contexts, and hence can't
> be used to indirectly gain exclusive access to domain related
> structures in a safe way.
Oh, right - I keep being misled by rw_is_locked(). This is a fair
argument. Irrespective it would feel better to me if an abstraction
construct was introduced; but seeing you don't like the idea I guess
I won't insist.
Jan
On Thu, Jan 25, 2024 at 12:23:05PM +0100, Jan Beulich wrote:
> On 25.01.2024 10:05, Roger Pau Monné wrote:
> > On Thu, Jan 25, 2024 at 08:43:05AM +0100, Jan Beulich wrote:
> >> On 24.01.2024 18:51, Roger Pau Monné wrote:
> >>> On Wed, Jan 24, 2024 at 12:34:10PM +0100, Jan Beulich wrote:
> >>>> On 24.01.2024 10:24, Roger Pau Monné wrote:
> >>>>> On Wed, Jan 24, 2024 at 09:48:35AM +0100, Jan Beulich wrote:
> >>>>>> On 23.01.2024 16:07, Roger Pau Monné wrote:
> >>>>>>> On Tue, Jan 23, 2024 at 03:32:12PM +0100, Jan Beulich wrote:
> >>>>>>>> On 15.01.2024 20:43, Stewart Hildebrand wrote:
> >>>>>>>>> @@ -2888,6 +2888,8 @@ int allocate_and_map_msi_pirq(struct domain *d, int index, int *pirq_p,
> >>>>>>>>> {
> >>>>>>>>> int irq, pirq, ret;
> >>>>>>>>>
> >>>>>>>>> + ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock));
> >>>>>>>>
> >>>>>>>> If either lock is sufficient to hold here, ...
> >>>>>>>>
> >>>>>>>>> --- a/xen/arch/x86/physdev.c
> >>>>>>>>> +++ b/xen/arch/x86/physdev.c
> >>>>>>>>> @@ -123,7 +123,9 @@ int physdev_map_pirq(domid_t domid, int type, int *index, int *pirq_p,
> >>>>>>>>>
> >>>>>>>>> case MAP_PIRQ_TYPE_MSI:
> >>>>>>>>> case MAP_PIRQ_TYPE_MULTI_MSI:
> >>>>>>>>> + pcidevs_lock();
> >>>>>>>>> ret = allocate_and_map_msi_pirq(d, *index, pirq_p, type, msi);
> >>>>>>>>> + pcidevs_unlock();
> >>>>>>>>> break;
> >>>>>>>>
> >>>>>>>
> >>>>>>> IIRC (Stewart can further comment) this is done holding the pcidevs
> >>>>>>> lock to keep the path unmodified, as there's no need to hold the
> >>>>>>> per-domain rwlock.
> >>>>>>
> >>>>>> Yet why would we prefer to acquire a global lock when a per-domain one
> >>>>>> suffices?
> >>>>>
> >>>>> I was hoping to introduce less changes, specially if they are not
> >>>>> strictly required, as it's less risk. I'm always quite worry of
> >>>>> locking changes.
> >>>>
> >>>> In which case more description / code commenting is needed. The pattern
> >>>> of the assertions looks dangerous.
> >>>
> >>> Is such dangerousness perception because you fear some of the pcidevs
> >>> lock usage might be there not just for preventing the pdev from going
> >>> away, but also to guarantee exclusive access to certain state?
> >>
> >> Indeed. In my view the main purpose of locks is to guard state. Their
> >> use here to guard against devices here is imo rather an abuse; as
> >> mentioned before this should instead be achieved e.g via refcounting.
> >> And it's bad enough already that pcidevs_lock() alone has been abused
> >> this way, without proper marking (leaving us to guess in many places).
> >> It gets worse when a second lock can now also serve this same
> >> purpose.
> >
> > The new lock is taken in read mode in most contexts, and hence can't
> > be used to indirectly gain exclusive access to domain related
> > structures in a safe way.
>
> Oh, right - I keep being misled by rw_is_locked(). This is a fair
> argument. Irrespective it would feel better to me if an abstraction
> construct was introduced; but seeing you don't like the idea I guess
> I won't insist.
TBH I'm not going to argue against it if you and Stewart think it's
clearer, but I also won't request the addition of such wrapper myself.
Thanks, Roger.
On 1/25/24 07:33, Roger Pau Monné wrote:
> On Thu, Jan 25, 2024 at 12:23:05PM +0100, Jan Beulich wrote:
>> On 25.01.2024 10:05, Roger Pau Monné wrote:
>>> On Thu, Jan 25, 2024 at 08:43:05AM +0100, Jan Beulich wrote:
>>>> On 24.01.2024 18:51, Roger Pau Monné wrote:
>>>>> On Wed, Jan 24, 2024 at 12:34:10PM +0100, Jan Beulich wrote:
>>>>>> On 24.01.2024 10:24, Roger Pau Monné wrote:
>>>>>>> On Wed, Jan 24, 2024 at 09:48:35AM +0100, Jan Beulich wrote:
>>>>>>>> On 23.01.2024 16:07, Roger Pau Monné wrote:
>>>>>>>>> On Tue, Jan 23, 2024 at 03:32:12PM +0100, Jan Beulich wrote:
>>>>>>>>>> On 15.01.2024 20:43, Stewart Hildebrand wrote:
>>>>>>>>>>> @@ -2888,6 +2888,8 @@ int allocate_and_map_msi_pirq(struct domain *d, int index, int *pirq_p,
>>>>>>>>>>> {
>>>>>>>>>>> int irq, pirq, ret;
>>>>>>>>>>>
>>>>>>>>>>> + ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock));
>>>>>>>>>>
>>>>>>>>>> If either lock is sufficient to hold here, ...
>>>>>>>>>>
>>>>>>>>>>> --- a/xen/arch/x86/physdev.c
>>>>>>>>>>> +++ b/xen/arch/x86/physdev.c
>>>>>>>>>>> @@ -123,7 +123,9 @@ int physdev_map_pirq(domid_t domid, int type, int *index, int *pirq_p,
>>>>>>>>>>>
>>>>>>>>>>> case MAP_PIRQ_TYPE_MSI:
>>>>>>>>>>> case MAP_PIRQ_TYPE_MULTI_MSI:
>>>>>>>>>>> + pcidevs_lock();
>>>>>>>>>>> ret = allocate_and_map_msi_pirq(d, *index, pirq_p, type, msi);
>>>>>>>>>>> + pcidevs_unlock();
>>>>>>>>>>> break;
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> IIRC (Stewart can further comment) this is done holding the pcidevs
>>>>>>>>> lock to keep the path unmodified, as there's no need to hold the
>>>>>>>>> per-domain rwlock.
>>>>>>>>
>>>>>>>> Yet why would we prefer to acquire a global lock when a per-domain one
>>>>>>>> suffices?
>>>>>>>
>>>>>>> I was hoping to introduce less changes, specially if they are not
>>>>>>> strictly required, as it's less risk. I'm always quite worry of
>>>>>>> locking changes.
>>>>>>
>>>>>> In which case more description / code commenting is needed. The pattern
>>>>>> of the assertions looks dangerous.
>>>>>
>>>>> Is such dangerousness perception because you fear some of the pcidevs
>>>>> lock usage might be there not just for preventing the pdev from going
>>>>> away, but also to guarantee exclusive access to certain state?
>>>>
>>>> Indeed. In my view the main purpose of locks is to guard state. Their
>>>> use here to guard against devices here is imo rather an abuse; as
>>>> mentioned before this should instead be achieved e.g via refcounting.
>>>> And it's bad enough already that pcidevs_lock() alone has been abused
>>>> this way, without proper marking (leaving us to guess in many places).
>>>> It gets worse when a second lock can now also serve this same
>>>> purpose.
>>>
>>> The new lock is taken in read mode in most contexts, and hence can't
>>> be used to indirectly gain exclusive access to domain related
>>> structures in a safe way.
>>
>> Oh, right - I keep being misled by rw_is_locked(). This is a fair
>> argument. Irrespective it would feel better to me if an abstraction
>> construct was introduced; but seeing you don't like the idea I guess
>> I won't insist.
>
> TBH I'm not going to argue against it if you and Stewart think it's
> clearer, but I also won't request the addition of such wrapper myself.
>
> Thanks, Roger.
Overall, I think there are two sources of confusion:
1. This patch is using the odd-looking ASSERTs to verify that it is safe to *read* d->pdev_list, and/or ensure a pdev does not go away or get reassigned. The purpose of these ASSERTs is not immediately obvious due to inadequate description.
2. At first glance, the patch appears to be doing two things: using d->pci_lock for d->pdev_list/pdev protection, and using d->pci_lock for pdev->vpci protection.
Regarding #1, while the review experience could have been improved by introducing a wrapper construct, I think it would also (more importantly) be valuable to have such a wrapper for the sake of code readability. I think it is important to get this right and hopefully avoid/reduce potential future confusion. I'll add something like this in v13, e.g. in sched.h:
/* Ensure pdevs do not go away or get assigned to other domains. */
#define pdev_list_is_read_locked(d) ({ \
struct domain *d_ = (d); \
pcidevs_locked() || (d_ && rw_is_locked(&d_->pci_lock)); \
})
Example use:
ASSERT(pdev_list_is_read_locked(d));
Regarding #2, the patch description primarily talks about protecting the pdev->vpci field, and the d->pdev_list read / pdev reassignment protection seems an afterthought. However, the use of pcidevs_lock() for pdev protection is pre-existing. Now that vPCI callers are going to use d->pci_lock (instead of pcidevs_lock()), we are simultaneously changing the allowable mechanism for protecting d->pdev_list reads and pdevs going away or getting reassigned. I briefly experimented with splitting this into two separate patches, but I chose not to pursue this further because then we'd have a brief odd intermediate state, not to mention the additional test/review burden of evaluating each separate change independently. Keep in mind this patch as a whole has already been through much test/review, and at this point my primary focus is to improve readability and avoid confusion. I will plan to add appropriate description and rationale for v13.
Since I will be changing to use a wrapper construct and updating the descriptions, I will plan to drop Roger's R-b tag on this patch for v13.
Lastly, as has already been mentioned in the cover letter and reiterated in discussions here, for the non-vPCI code paths already using pcidevs_lock() I will plan to keep them using pcidevs_lock().
On 1/23/24 10:07, Roger Pau Monné wrote:
> On Tue, Jan 23, 2024 at 03:32:12PM +0100, Jan Beulich wrote:
>> On 15.01.2024 20:43, Stewart Hildebrand wrote:
>>> @@ -2888,6 +2888,8 @@ int allocate_and_map_msi_pirq(struct domain *d, int index, int *pirq_p,
>>> {
>>> int irq, pirq, ret;
>>>
>>> + ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock));
>>
>> If either lock is sufficient to hold here, ...
>>
>>> --- a/xen/arch/x86/physdev.c
>>> +++ b/xen/arch/x86/physdev.c
>>> @@ -123,7 +123,9 @@ int physdev_map_pirq(domid_t domid, int type, int *index, int *pirq_p,
>>>
>>> case MAP_PIRQ_TYPE_MSI:
>>> case MAP_PIRQ_TYPE_MULTI_MSI:
>>> + pcidevs_lock();
>>> ret = allocate_and_map_msi_pirq(d, *index, pirq_p, type, msi);
>>> + pcidevs_unlock();
>>> break;
>>
>> ... why is it the global lock that's being acquired here?
>>
>
> IIRC (Stewart can further comment) this is done holding the pcidevs
> lock to keep the path unmodified, as there's no need to hold the
> per-domain rwlock.
>
Although allocate_and_map_msi_pirq() was itself acquiring the global pcidevs_lock() before this patch, we could just as well use read_lock(&d->pci_lock) here instead now. It seems like a good optimization to make, so if there aren't any objections I'll change it to read_lock(&d->pci_lock).
On 1/24/24 00:00, Stewart Hildebrand wrote:
> On 1/23/24 10:07, Roger Pau Monné wrote:
>> On Tue, Jan 23, 2024 at 03:32:12PM +0100, Jan Beulich wrote:
>>> On 15.01.2024 20:43, Stewart Hildebrand wrote:
>>>> @@ -2888,6 +2888,8 @@ int allocate_and_map_msi_pirq(struct domain *d, int index, int *pirq_p,
>>>> {
>>>> int irq, pirq, ret;
>>>>
>>>> + ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock));
>>>
>>> If either lock is sufficient to hold here, ...
>>>
>>>> --- a/xen/arch/x86/physdev.c
>>>> +++ b/xen/arch/x86/physdev.c
>>>> @@ -123,7 +123,9 @@ int physdev_map_pirq(domid_t domid, int type, int *index, int *pirq_p,
>>>>
>>>> case MAP_PIRQ_TYPE_MSI:
>>>> case MAP_PIRQ_TYPE_MULTI_MSI:
>>>> + pcidevs_lock();
>>>> ret = allocate_and_map_msi_pirq(d, *index, pirq_p, type, msi);
>>>> + pcidevs_unlock();
>>>> break;
>>>
>>> ... why is it the global lock that's being acquired here?
>>>
>>
>> IIRC (Stewart can further comment) this is done holding the pcidevs
>> lock to keep the path unmodified, as there's no need to hold the
>> per-domain rwlock.
>>
>
> Although allocate_and_map_msi_pirq() was itself acquiring the global pcidevs_lock() before this patch, we could just as well use read_lock(&d->pci_lock) here instead now. It seems like a good optimization to make, so if there aren't any objections I'll change it to read_lock(&d->pci_lock).
>
Actually, I take this back. As mentioned in the cover letter of this series, and has been reiterated in recent discussions, the goal with this is to keep existing (non-vPCI) code paths as unmodified as possible. So I'll keep it as pcidevs_lock() here.
On 15.01.2024 20:43, Stewart Hildebrand wrote:
> @@ -1043,11 +1043,11 @@ static int __pci_enable_msix(struct pci_dev *pdev, struct msi_info *msi,
> {
> struct msi_desc *old_desc;
>
> - ASSERT(pcidevs_locked());
> -
> if ( !pdev || !pdev->msix )
> return -ENODEV;
>
> + ASSERT(pcidevs_locked() || rw_is_locked(&pdev->domain->pci_lock));
> +
> if ( msi->entry_nr >= pdev->msix->nr_entries )
> return -EINVAL;
Further looking at this - is dereferencing pdev actually safe without holding
the global lock?
Jan
On 1/23/24 09:29, Jan Beulich wrote:
> On 15.01.2024 20:43, Stewart Hildebrand wrote:
>> @@ -1043,11 +1043,11 @@ static int __pci_enable_msix(struct pci_dev *pdev, struct msi_info *msi,
>> {
>> struct msi_desc *old_desc;
>>
>> - ASSERT(pcidevs_locked());
>> -
>> if ( !pdev || !pdev->msix )
>> return -ENODEV;
>>
>> + ASSERT(pcidevs_locked() || rw_is_locked(&pdev->domain->pci_lock));
>> +
>> if ( msi->entry_nr >= pdev->msix->nr_entries )
>> return -EINVAL;
>
> Further looking at this - is dereferencing pdev actually safe without holding
> the global lock?
Are you referring to the new placement of the ASSERT, which opens up the possibility that pdev could be dereferenced and the function return before the ASSERT? If that is what you mean, I see your point. The ASSERT was placed there simply because we wanted to check that pdev != NULL first. See prior discussion at [1]. Hmm.. How about splitting the pdev-checking condition? E.g.:
if ( !pdev )
return -ENODEV;
ASSERT(pcidevs_locked() || rw_is_locked(&pdev->domain->pci_lock));
if ( !pdev->msix )
return -ENODEV;
[1] https://lore.kernel.org/xen-devel/85a52f8d-d6db-4478-92b1-2b6305769c96@amd.com/
On 24.01.2024 06:07, Stewart Hildebrand wrote:
> On 1/23/24 09:29, Jan Beulich wrote:
>> On 15.01.2024 20:43, Stewart Hildebrand wrote:
>>> @@ -1043,11 +1043,11 @@ static int __pci_enable_msix(struct pci_dev *pdev, struct msi_info *msi,
>>> {
>>> struct msi_desc *old_desc;
>>>
>>> - ASSERT(pcidevs_locked());
>>> -
>>> if ( !pdev || !pdev->msix )
>>> return -ENODEV;
>>>
>>> + ASSERT(pcidevs_locked() || rw_is_locked(&pdev->domain->pci_lock));
>>> +
>>> if ( msi->entry_nr >= pdev->msix->nr_entries )
>>> return -EINVAL;
>>
>> Further looking at this - is dereferencing pdev actually safe without holding
>> the global lock?
>
> Are you referring to the new placement of the ASSERT, which opens up the possibility that pdev could be dereferenced and the function return before the ASSERT? If that is what you mean, I see your point. The ASSERT was placed there simply because we wanted to check that pdev != NULL first. See prior discussion at [1]. Hmm.. How about splitting the pdev-checking condition? E.g.:
>
> if ( !pdev )
> return -ENODEV;
>
> ASSERT(pcidevs_locked() || rw_is_locked(&pdev->domain->pci_lock));
>
> if ( !pdev->msix )
> return -ENODEV;
Yes, this is the particular arrangement I would have expected (at least
partly based on the guessing of the purpose of holding those locks).
Jan
> [1] https://lore.kernel.org/xen-devel/85a52f8d-d6db-4478-92b1-2b6305769c96@amd.com/
On Wed, Jan 24, 2024 at 12:07:28AM -0500, Stewart Hildebrand wrote:
> On 1/23/24 09:29, Jan Beulich wrote:
> > On 15.01.2024 20:43, Stewart Hildebrand wrote:
> >> @@ -1043,11 +1043,11 @@ static int __pci_enable_msix(struct pci_dev *pdev, struct msi_info *msi,
> >> {
> >> struct msi_desc *old_desc;
> >>
> >> - ASSERT(pcidevs_locked());
> >> -
> >> if ( !pdev || !pdev->msix )
> >> return -ENODEV;
> >>
> >> + ASSERT(pcidevs_locked() || rw_is_locked(&pdev->domain->pci_lock));
> >> +
> >> if ( msi->entry_nr >= pdev->msix->nr_entries )
> >> return -EINVAL;
> >
> > Further looking at this - is dereferencing pdev actually safe without holding
> > the global lock?
It is safe because either the global pcidevs lock or the per-domain
pci_lock will be held, which should prevent the device from being
removed.
> Are you referring to the new placement of the ASSERT, which opens up the possibility that pdev could be dereferenced and the function return before the ASSERT? If that is what you mean, I see your point. The ASSERT was placed there simply because we wanted to check that pdev != NULL first. See prior discussion at [1]. Hmm.. How about splitting the pdev-checking condition? E.g.:
>
> if ( !pdev )
> return -ENODEV;
>
> ASSERT(pcidevs_locked() || rw_is_locked(&pdev->domain->pci_lock));
>
> if ( !pdev->msix )
> return -ENODEV;
I'm not specially worried about the position of the assert, those are
just debug messages at the end.
One worry I have after further looking at the code, when called from
ns16550_init_postirq(), does the device have pdev->domain set?
That case would satisfy the first condition of the assert, so won't
attempt to dereference pdev->domain, but still would be good to ensure
consistency here wrt the state of pdev->domain.
Regards, Roger.
On 1/24/24 03:21, Roger Pau Monné wrote:
> On Wed, Jan 24, 2024 at 12:07:28AM -0500, Stewart Hildebrand wrote:
>> On 1/23/24 09:29, Jan Beulich wrote:
>>> On 15.01.2024 20:43, Stewart Hildebrand wrote:
>>>> @@ -1043,11 +1043,11 @@ static int __pci_enable_msix(struct pci_dev *pdev, struct msi_info *msi,
>>>> {
>>>> struct msi_desc *old_desc;
>>>>
>>>> - ASSERT(pcidevs_locked());
>>>> -
>>>> if ( !pdev || !pdev->msix )
>>>> return -ENODEV;
>>>>
>>>> + ASSERT(pcidevs_locked() || rw_is_locked(&pdev->domain->pci_lock));
>>>> +
>>>> if ( msi->entry_nr >= pdev->msix->nr_entries )
>>>> return -EINVAL;
>>>
>>> Further looking at this - is dereferencing pdev actually safe without holding
>>> the global lock?
>
> It is safe because either the global pcidevs lock or the per-domain
> pci_lock will be held, which should prevent the device from being
> removed.
>
>> Are you referring to the new placement of the ASSERT, which opens up the possibility that pdev could be dereferenced and the function return before the ASSERT? If that is what you mean, I see your point. The ASSERT was placed there simply because we wanted to check that pdev != NULL first. See prior discussion at [1]. Hmm.. How about splitting the pdev-checking condition? E.g.:
>>
>> if ( !pdev )
>> return -ENODEV;
>>
>> ASSERT(pcidevs_locked() || rw_is_locked(&pdev->domain->pci_lock));
>>
>> if ( !pdev->msix )
>> return -ENODEV;
>
> I'm not specially worried about the position of the assert, those are
> just debug messages at the end.
>
> One worry I have after further looking at the code, when called from
> ns16550_init_postirq(), does the device have pdev->domain set?
>
> That case would satisfy the first condition of the assert, so won't
> attempt to dereference pdev->domain, but still would be good to ensure
> consistency here wrt the state of pdev->domain.
Indeed. How about this?
if ( !pdev )
return -ENODEV;
ASSERT(pcidevs_locked() ||
(pdev->domain && rw_is_locked(&pdev->domain->pci_lock)));
if ( !pdev->msix )
return -ENODEV;
And similarly in __pci_enable_msi(), without the !pdev->msix check. And similarly in pci_enable_msi(), which then should also gain its own if ( !pdev ) return -ENODEV; check.
On 15.01.2024 20:43, Stewart Hildebrand wrote: > --- a/xen/arch/x86/hvm/vmsi.c > +++ b/xen/arch/x86/hvm/vmsi.c > @@ -468,7 +468,7 @@ int msixtbl_pt_register(struct domain *d, struct pirq *pirq, uint64_t gtable) > struct msixtbl_entry *entry, *new_entry; > int r = -EINVAL; > > - ASSERT(pcidevs_locked()); > + ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock)); > ASSERT(rw_is_write_locked(&d->event_lock)); > > if ( !msixtbl_initialised(d) ) > @@ -538,7 +538,7 @@ void msixtbl_pt_unregister(struct domain *d, struct pirq *pirq) > struct pci_dev *pdev; > struct msixtbl_entry *entry; > > - ASSERT(pcidevs_locked()); > + ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock)); > ASSERT(rw_is_write_locked(&d->event_lock)); I was hoping to just ack this patch, but the two changes above look questionable to me: How can it be that holding _either_ lock is okay? It's not obvious in this context that consumers have to hold both locks now. In fact consumers looks to be the callers of msixtbl_find_entry(), yet the list is RCU-protected. Whereas races against themselves or against one another are avoided by holding d->event_lock. My only guess then for the original need of holding pcidevs_lock is the use of msi_desc->dev, with the desire for the device to not go away. Yet the description doesn't talk about interactions of the per- domain PCI lock with that one at all; it all circles around the domain'd vPCI lock. Feels like I'm missing something that's obvious to everyone else. Or maybe this part of the patch is actually unrelated, and should be split off (with its own [proper] justification)? Or wouldn't it then be better to also change the other paths leading here to acquire the per-domain PCI lock? > --- a/xen/arch/x86/hvm/vmx/vmx.c > +++ b/xen/arch/x86/hvm/vmx/vmx.c > @@ -413,7 +413,7 @@ static int cf_check vmx_pi_update_irte(const struct vcpu *v, > > spin_unlock_irq(&desc->lock); > > - ASSERT(pcidevs_locked()); > + ASSERT(pcidevs_locked() || rw_is_locked(&msi_desc->dev->domain->pci_lock)); > > return iommu_update_ire_from_msi(msi_desc, &msi_desc->msg); This then falls in the same category. And apparently there are more. Jan
On Tue, Jan 23, 2024 at 03:26:26PM +0100, Jan Beulich wrote:
> On 15.01.2024 20:43, Stewart Hildebrand wrote:
> > --- a/xen/arch/x86/hvm/vmsi.c
> > +++ b/xen/arch/x86/hvm/vmsi.c
> > @@ -468,7 +468,7 @@ int msixtbl_pt_register(struct domain *d, struct pirq *pirq, uint64_t gtable)
> > struct msixtbl_entry *entry, *new_entry;
> > int r = -EINVAL;
> >
> > - ASSERT(pcidevs_locked());
> > + ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock));
> > ASSERT(rw_is_write_locked(&d->event_lock));
> >
> > if ( !msixtbl_initialised(d) )
> > @@ -538,7 +538,7 @@ void msixtbl_pt_unregister(struct domain *d, struct pirq *pirq)
> > struct pci_dev *pdev;
> > struct msixtbl_entry *entry;
> >
> > - ASSERT(pcidevs_locked());
> > + ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock));
> > ASSERT(rw_is_write_locked(&d->event_lock));
>
> I was hoping to just ack this patch, but the two changes above look
> questionable to me: How can it be that holding _either_ lock is okay?
> It's not obvious in this context that consumers have to hold both
> locks now. In fact consumers looks to be the callers of
> msixtbl_find_entry(), yet the list is RCU-protected. Whereas races
> against themselves or against one another are avoided by holding
> d->event_lock.
The reason for the change here is that msixtbl_pt_{un,}register() gets
called by pt_irq_{create,destroy}_bind(), which is in turn called by
vPCI code (pcidevs_locked()) that has been switched to not take the
pcidevs lock anymore, and hence the ASSERT would trigger.
> My only guess then for the original need of holding pcidevs_lock is
> the use of msi_desc->dev, with the desire for the device to not go
> away. Yet the description doesn't talk about interactions of the per-
> domain PCI lock with that one at all; it all circles around the
> domain'd vPCI lock.
I do agree that it looks like the original intention of holding
pcidevs_lock is to prevent msi_desc->dev from being removed - yet I'm
not sure it's possible for the device to go away while the domain
event_lock is hold, as device removal would need to take that same
lock in order to destroy the irq_desc.
> Feels like I'm missing something that's obvious to everyone else.
> Or maybe this part of the patch is actually unrelated, and should be
> split off (with its own [proper] justification)? Or wouldn't it then
> be better to also change the other paths leading here to acquire the
> per-domain PCI lock?
Other paths in vPCI vpci_msi_update(), vpci_msi_arch_update(),
vpci_msi_arch_enable()... are switched in this patch to use the
per-domain pci_lock instead of pcidevs lock.
> > --- a/xen/arch/x86/hvm/vmx/vmx.c
> > +++ b/xen/arch/x86/hvm/vmx/vmx.c
> > @@ -413,7 +413,7 @@ static int cf_check vmx_pi_update_irte(const struct vcpu *v,
> >
> > spin_unlock_irq(&desc->lock);
> >
> > - ASSERT(pcidevs_locked());
> > + ASSERT(pcidevs_locked() || rw_is_locked(&msi_desc->dev->domain->pci_lock));
> >
> > return iommu_update_ire_from_msi(msi_desc, &msi_desc->msg);
>
> This then falls in the same category. And apparently there are more.
This one is again a result of such function being called from
pt_irq_create_bind() from vPCI code that has been switched to use the
per-domain pci_lock.
IOMMU state is already protected by it's own internal locks, and
doesn't rely on pcidevs lock. Hence I can also only guess that the
usage here is to prevent the device from being removed.
Regards, Roger.
On 23.01.2024 16:23, Roger Pau Monné wrote:
> On Tue, Jan 23, 2024 at 03:26:26PM +0100, Jan Beulich wrote:
>> On 15.01.2024 20:43, Stewart Hildebrand wrote:
>>> --- a/xen/arch/x86/hvm/vmsi.c
>>> +++ b/xen/arch/x86/hvm/vmsi.c
>>> @@ -468,7 +468,7 @@ int msixtbl_pt_register(struct domain *d, struct pirq *pirq, uint64_t gtable)
>>> struct msixtbl_entry *entry, *new_entry;
>>> int r = -EINVAL;
>>>
>>> - ASSERT(pcidevs_locked());
>>> + ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock));
>>> ASSERT(rw_is_write_locked(&d->event_lock));
>>>
>>> if ( !msixtbl_initialised(d) )
>>> @@ -538,7 +538,7 @@ void msixtbl_pt_unregister(struct domain *d, struct pirq *pirq)
>>> struct pci_dev *pdev;
>>> struct msixtbl_entry *entry;
>>>
>>> - ASSERT(pcidevs_locked());
>>> + ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock));
>>> ASSERT(rw_is_write_locked(&d->event_lock));
>>
>> I was hoping to just ack this patch, but the two changes above look
>> questionable to me: How can it be that holding _either_ lock is okay?
>> It's not obvious in this context that consumers have to hold both
>> locks now. In fact consumers looks to be the callers of
>> msixtbl_find_entry(), yet the list is RCU-protected. Whereas races
>> against themselves or against one another are avoided by holding
>> d->event_lock.
>
> The reason for the change here is that msixtbl_pt_{un,}register() gets
> called by pt_irq_{create,destroy}_bind(), which is in turn called by
> vPCI code (pcidevs_locked()) that has been switched to not take the
> pcidevs lock anymore, and hence the ASSERT would trigger.
I understand this is the motivation for the change, but that doesn't
(alone) render the construct above sensible / correct.
>> My only guess then for the original need of holding pcidevs_lock is
>> the use of msi_desc->dev, with the desire for the device to not go
>> away. Yet the description doesn't talk about interactions of the per-
>> domain PCI lock with that one at all; it all circles around the
>> domain'd vPCI lock.
>
> I do agree that it looks like the original intention of holding
> pcidevs_lock is to prevent msi_desc->dev from being removed - yet I'm
> not sure it's possible for the device to go away while the domain
> event_lock is hold, as device removal would need to take that same
> lock in order to destroy the irq_desc.
Yes, that matches an observation of mine as well. If we can simplify
(rather then complicate) locking, I'd prefer if we did. May need to
be a separate (prereq) patch, though.
>> Feels like I'm missing something that's obvious to everyone else.
>> Or maybe this part of the patch is actually unrelated, and should be
>> split off (with its own [proper] justification)? Or wouldn't it then
>> be better to also change the other paths leading here to acquire the
>> per-domain PCI lock?
>
> Other paths in vPCI vpci_msi_update(), vpci_msi_arch_update(),
> vpci_msi_arch_enable()... are switched in this patch to use the
> per-domain pci_lock instead of pcidevs lock.
Hence my question: Can't we consolidate to all paths only using the
per-domain lock? That would make these odd-looking assertions become
normal-looking again.
Jan
On Wed, Jan 24, 2024 at 09:56:42AM +0100, Jan Beulich wrote:
> On 23.01.2024 16:23, Roger Pau Monné wrote:
> > On Tue, Jan 23, 2024 at 03:26:26PM +0100, Jan Beulich wrote:
> >> On 15.01.2024 20:43, Stewart Hildebrand wrote:
> >>> --- a/xen/arch/x86/hvm/vmsi.c
> >>> +++ b/xen/arch/x86/hvm/vmsi.c
> >>> @@ -468,7 +468,7 @@ int msixtbl_pt_register(struct domain *d, struct pirq *pirq, uint64_t gtable)
> >>> struct msixtbl_entry *entry, *new_entry;
> >>> int r = -EINVAL;
> >>>
> >>> - ASSERT(pcidevs_locked());
> >>> + ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock));
> >>> ASSERT(rw_is_write_locked(&d->event_lock));
> >>>
> >>> if ( !msixtbl_initialised(d) )
> >>> @@ -538,7 +538,7 @@ void msixtbl_pt_unregister(struct domain *d, struct pirq *pirq)
> >>> struct pci_dev *pdev;
> >>> struct msixtbl_entry *entry;
> >>>
> >>> - ASSERT(pcidevs_locked());
> >>> + ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock));
> >>> ASSERT(rw_is_write_locked(&d->event_lock));
> >>
> >> I was hoping to just ack this patch, but the two changes above look
> >> questionable to me: How can it be that holding _either_ lock is okay?
> >> It's not obvious in this context that consumers have to hold both
> >> locks now. In fact consumers looks to be the callers of
> >> msixtbl_find_entry(), yet the list is RCU-protected. Whereas races
> >> against themselves or against one another are avoided by holding
> >> d->event_lock.
> >
> > The reason for the change here is that msixtbl_pt_{un,}register() gets
> > called by pt_irq_{create,destroy}_bind(), which is in turn called by
> > vPCI code (pcidevs_locked()) that has been switched to not take the
> > pcidevs lock anymore, and hence the ASSERT would trigger.
>
> I understand this is the motivation for the change, but that doesn't
> (alone) render the construct above sensible / correct.
But we agreed that for the purpose of the device not going anyway,
either the pcidevs or the per-domain pci_lock should be held, both are
valid for the purpose, and hence functions have adjusted to take that
into account.
So your concern is about the pcidevs lock being used here not just for
preventing the device from being removed, but also for protecting MSI
related state?
> >> My only guess then for the original need of holding pcidevs_lock is
> >> the use of msi_desc->dev, with the desire for the device to not go
> >> away. Yet the description doesn't talk about interactions of the per-
> >> domain PCI lock with that one at all; it all circles around the
> >> domain'd vPCI lock.
> >
> > I do agree that it looks like the original intention of holding
> > pcidevs_lock is to prevent msi_desc->dev from being removed - yet I'm
> > not sure it's possible for the device to go away while the domain
> > event_lock is hold, as device removal would need to take that same
> > lock in order to destroy the irq_desc.
>
> Yes, that matches an observation of mine as well. If we can simplify
> (rather then complicate) locking, I'd prefer if we did. May need to
> be a separate (prereq) patch, though.
Hm, yes, that might be an option, and doing a pre-patch that removes
the need to have pcidevs locked here would avoid the what seem
controversial changes.
> >> Feels like I'm missing something that's obvious to everyone else.
> >> Or maybe this part of the patch is actually unrelated, and should be
> >> split off (with its own [proper] justification)? Or wouldn't it then
> >> be better to also change the other paths leading here to acquire the
> >> per-domain PCI lock?
> >
> > Other paths in vPCI vpci_msi_update(), vpci_msi_arch_update(),
> > vpci_msi_arch_enable()... are switched in this patch to use the
> > per-domain pci_lock instead of pcidevs lock.
>
> Hence my question: Can't we consolidate to all paths only using the
> per-domain lock? That would make these odd-looking assertions become
> normal-looking again.
Hm, I think that's more work than originally planned, as the initial
plan was to use both locks during an interim period in order to avoid
doing a full swept change to switch to the per-domain one.
Regards, Roger.
On Mon, Jan 15, 2024 at 02:43:08PM -0500, Stewart Hildebrand wrote: > From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com> > > Use the per-domain PCI read/write lock to protect the presence of the > pci device vpci field. This lock can be used (and in a few cases is used > right away) so that vpci removal can be performed while holding the lock > in write mode. Previously such removal could race with vpci_read for > example. > > When taking both d->pci_lock and pdev->vpci->lock, they should be > taken in this exact order: d->pci_lock then pdev->vpci->lock to avoid > possible deadlock situations. > > 1. Per-domain's pci_lock is used to protect pdev->vpci structure > from being removed. > > 2. Writing the command register and ROM BAR register may trigger > modify_bars to run, which in turn may access multiple pdevs while > checking for the existing BAR's overlap. The overlapping check, if > done under the read lock, requires vpci->lock to be acquired on both > devices being compared, which may produce a deadlock. It is not > possible to upgrade read lock to write lock in such a case. So, in > order to prevent the deadlock, use d->pci_lock in write mode instead. > > All other code, which doesn't lead to pdev->vpci destruction and does > not access multiple pdevs at the same time, can still use a > combination of the read lock and pdev->vpci->lock. > > 3. Drop const qualifier where the new rwlock is used and this is > appropriate. > > 4. Do not call process_pending_softirqs with any locks held. For that > unlock prior the call and re-acquire the locks after. After > re-acquiring the lock there is no need to check if pdev->vpci exists: > - in apply_map because of the context it is called (no race condition > possible) > - for MSI/MSI-X debug code because it is called at the end of > pdev->vpci access and no further access to pdev->vpci is made > > 5. Use d->pci_lock around for_each_pdev and pci_get_pdev_by_domain > while accessing pdevs in vpci code. > > Suggested-by: Roger Pau Monné <roger.pau@citrix.com> > Suggested-by: Jan Beulich <jbeulich@suse.com> > Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com> > Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com> > Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Thanks, Roger.
© 2016 - 2026 Red Hat, Inc.