[PATCH 00/12] evtchn: recent XSAs follow-on

Jan Beulich posted 12 patches 3 years, 6 months ago
Only 0 patches received!
There is a newer version of this series
[PATCH 00/12] evtchn: recent XSAs follow-on
Posted by Jan Beulich 3 years, 6 months ago
These are grouped into a series largely because of their origin,
not so much because there are heavy dependencies among them.

01: refuse EVTCHNOP_status for Xen-bound event channels
02: avoid race in get_xen_consumer()
03: don't call Xen consumer callback with per-channel lock held
04: evtchn_set_priority() needs to acquire the per-channel lock
05: sched: reject poll requests for unusable ports
06: don't bypass unlinking pIRQ when closing port
07: cut short evtchn_reset()'s loop in the common case
08: ECS_CLOSED => ECS_FREE
09: move FIFO-private struct declarations
10: fifo: use stable fields when recording "last queue" information
11: convert vIRQ lock to an r/w one
12: convert domain event lock to an r/w one

Jan

[PATCH 01/12] evtchn: refuse EVTCHNOP_status for Xen-bound event channels
Posted by Jan Beulich 3 years, 6 months ago
Callers have no business knowing the state of the Xen end of an event
channel.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/common/event_channel.c
+++ b/xen/common/event_channel.c
@@ -933,6 +933,11 @@ int evtchn_status(evtchn_status_t *statu
     }
 
     chn = evtchn_from_port(d, port);
+    if ( consumer_is_xen(chn) )
+    {
+        rc = -EACCES;
+        goto out;
+    }
 
     rc = xsm_evtchn_status(XSM_TARGET, d, chn);
     if ( rc )


Re: [PATCH 01/12] evtchn: refuse EVTCHNOP_status for Xen-bound event channels
Posted by Julien Grall 3 years, 6 months ago
Hi Jan,

On 28/09/2020 11:56, Jan Beulich wrote:
> Callers have no business knowing the state of the Xen end of an event
> channel.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Acked-by: Julien Grall <jgrall@amazon.com>

Cheers,

> 
> --- a/xen/common/event_channel.c
> +++ b/xen/common/event_channel.c
> @@ -933,6 +933,11 @@ int evtchn_status(evtchn_status_t *statu
>       }
>   
>       chn = evtchn_from_port(d, port);
> +    if ( consumer_is_xen(chn) )
> +    {
> +        rc = -EACCES;
> +        goto out;
> +    }
>   
>       rc = xsm_evtchn_status(XSM_TARGET, d, chn);
>       if ( rc )
> 

-- 
Julien Grall

RE: [PATCH 01/12] evtchn: refuse EVTCHNOP_status for Xen-bound event channels
Posted by Paul Durrant 3 years, 6 months ago
> -----Original Message-----
> From: Xen-devel <xen-devel-bounces@lists.xenproject.org> On Behalf Of Jan Beulich
> Sent: 28 September 2020 11:56
> To: xen-devel@lists.xenproject.org
> Cc: Andrew Cooper <andrew.cooper3@citrix.com>; George Dunlap <George.Dunlap@eu.citrix.com>; Ian
> Jackson <iwj@xenproject.org>; Julien Grall <julien@xen.org>; Wei Liu <wl@xen.org>; Stefano Stabellini
> <sstabellini@kernel.org>
> Subject: [PATCH 01/12] evtchn: refuse EVTCHNOP_status for Xen-bound event channels
> 
> Callers have no business knowing the state of the Xen end of an event
> channel.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> 

Reviewed-by: Paul Durrant <Paul@xen.org>

> --- a/xen/common/event_channel.c
> +++ b/xen/common/event_channel.c
> @@ -933,6 +933,11 @@ int evtchn_status(evtchn_status_t *statu
>      }
> 
>      chn = evtchn_from_port(d, port);
> +    if ( consumer_is_xen(chn) )
> +    {
> +        rc = -EACCES;
> +        goto out;
> +    }
> 
>      rc = xsm_evtchn_status(XSM_TARGET, d, chn);
>      if ( rc )
> 



[PATCH 02/12] evtchn: avoid race in get_xen_consumer()
Posted by Jan Beulich 3 years, 6 months ago
There's no global lock around the updating of this global piece of data.
Make use of cmpxchg() to avoid two entities racing with their updates.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
TBD: Initially I used cmpxchgptr() here, until realizing Arm doesn't
     have it. It's slightly more type-safe than cmpxchg() (requiring
     all arguments to actually be pointers), so I now wonder whether Arm
     should gain it (perhaps simply by moving the x86 implementation to
     xen/lib.h), or whether we should purge it from x86 as being
     pointless.

--- a/xen/common/event_channel.c
+++ b/xen/common/event_channel.c
@@ -57,7 +57,8 @@
  * with a pointer, we stash them dynamically in a small lookup array which
  * can be indexed by a small integer.
  */
-static xen_event_channel_notification_t xen_consumers[NR_XEN_CONSUMERS];
+static xen_event_channel_notification_t __read_mostly
+    xen_consumers[NR_XEN_CONSUMERS];
 
 /* Default notification action: wake up from wait_on_xen_event_channel(). */
 static void default_xen_notification_fn(struct vcpu *v, unsigned int port)
@@ -81,7 +82,7 @@ static uint8_t get_xen_consumer(xen_even
     for ( i = 0; i < ARRAY_SIZE(xen_consumers); i++ )
     {
         if ( xen_consumers[i] == NULL )
-            xen_consumers[i] = fn;
+            (void)cmpxchg(&xen_consumers[i], NULL, fn);
         if ( xen_consumers[i] == fn )
             break;
     }


Re: [PATCH 02/12] evtchn: avoid race in get_xen_consumer()
Posted by Julien Grall 3 years, 6 months ago
Hi Jan,

On 28/09/2020 11:56, Jan Beulich wrote:
> There's no global lock around the updating of this global piece of data.
> Make use of cmpxchg() to avoid two entities racing with their updates.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> TBD: Initially I used cmpxchgptr() here, until realizing Arm doesn't
>       have it. It's slightly more type-safe than cmpxchg() (requiring
>       all arguments to actually be pointers), so I now wonder whether Arm
>       should gain it (perhaps simply by moving the x86 implementation to
>       xen/lib.h), or whether we should purge it from x86 as being
>       pointless.

I would be happy with an implementation of cmpxchgptr() in xen/lib.h.

> 
> --- a/xen/common/event_channel.c
> +++ b/xen/common/event_channel.c
> @@ -57,7 +57,8 @@
>    * with a pointer, we stash them dynamically in a small lookup array which
>    * can be indexed by a small integer.
>    */
> -static xen_event_channel_notification_t xen_consumers[NR_XEN_CONSUMERS];
> +static xen_event_channel_notification_t __read_mostly
> +    xen_consumers[NR_XEN_CONSUMERS];

This doesn't seem directly related to the changes below. Can you explain 
it in the commit message?

>   
>   /* Default notification action: wake up from wait_on_xen_event_channel(). */
>   static void default_xen_notification_fn(struct vcpu *v, unsigned int port)
> @@ -81,7 +82,7 @@ static uint8_t get_xen_consumer(xen_even
>       for ( i = 0; i < ARRAY_SIZE(xen_consumers); i++ )
>       {
>           if ( xen_consumers[i] == NULL )
> -            xen_consumers[i] = fn;
> +            (void)cmpxchg(&xen_consumers[i], NULL, fn);

This wants an explanation in the code. Maybe:

"As there is no global lock, the cmpxchg() will prevent race between two 
callers."

>           if ( xen_consumers[i] == fn )
>               break;
>       }
> 

Cheers,

-- 
Julien Grall

RE: [PATCH 02/12] evtchn: avoid race in get_xen_consumer()
Posted by Paul Durrant 3 years, 6 months ago
> -----Original Message-----
> From: Xen-devel <xen-devel-bounces@lists.xenproject.org> On Behalf Of Jan Beulich
> Sent: 28 September 2020 11:57
> To: xen-devel@lists.xenproject.org
> Cc: Andrew Cooper <andrew.cooper3@citrix.com>; George Dunlap <George.Dunlap@eu.citrix.com>; Ian
> Jackson <iwj@xenproject.org>; Julien Grall <julien@xen.org>; Wei Liu <wl@xen.org>; Stefano Stabellini
> <sstabellini@kernel.org>
> Subject: [PATCH 02/12] evtchn: avoid race in get_xen_consumer()
> 
> There's no global lock around the updating of this global piece of data.
> Make use of cmpxchg() to avoid two entities racing with their updates.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> TBD: Initially I used cmpxchgptr() here, until realizing Arm doesn't
>      have it. It's slightly more type-safe than cmpxchg() (requiring
>      all arguments to actually be pointers), so I now wonder whether Arm
>      should gain it (perhaps simply by moving the x86 implementation to
>      xen/lib.h), or whether we should purge it from x86 as being
>      pointless.
> 
> --- a/xen/common/event_channel.c
> +++ b/xen/common/event_channel.c
> @@ -57,7 +57,8 @@
>   * with a pointer, we stash them dynamically in a small lookup array which
>   * can be indexed by a small integer.
>   */
> -static xen_event_channel_notification_t xen_consumers[NR_XEN_CONSUMERS];
> +static xen_event_channel_notification_t __read_mostly
> +    xen_consumers[NR_XEN_CONSUMERS];
> 
>  /* Default notification action: wake up from wait_on_xen_event_channel(). */
>  static void default_xen_notification_fn(struct vcpu *v, unsigned int port)
> @@ -81,7 +82,7 @@ static uint8_t get_xen_consumer(xen_even
>      for ( i = 0; i < ARRAY_SIZE(xen_consumers); i++ )
>      {
>          if ( xen_consumers[i] == NULL )
> -            xen_consumers[i] = fn;
> +            (void)cmpxchg(&xen_consumers[i], NULL, fn);
>          if ( xen_consumers[i] == fn )

Why not use the return from cmpxchg() to determine success and break out of the loop rather than re-accessing the global array?

  Paul

>              break;
>      }
> 



Re: [PATCH 02/12] evtchn: avoid race in get_xen_consumer()
Posted by Jan Beulich 3 years, 6 months ago
On 29.09.2020 17:44, Paul Durrant wrote:
>> From: Xen-devel <xen-devel-bounces@lists.xenproject.org> On Behalf Of Jan Beulich
>> Sent: 28 September 2020 11:57
>>
>> @@ -81,7 +82,7 @@ static uint8_t get_xen_consumer(xen_even
>>      for ( i = 0; i < ARRAY_SIZE(xen_consumers); i++ )
>>      {
>>          if ( xen_consumers[i] == NULL )
>> -            xen_consumers[i] = fn;
>> +            (void)cmpxchg(&xen_consumers[i], NULL, fn);
>>          if ( xen_consumers[i] == fn )
> 
> Why not use the return from cmpxchg() to determine success and break
> out of the loop rather than re-accessing the global array?

That's an option, in which case I wouldn't be sure anymore whether
adding __read_mostly to the definition of xen_consumers[] is
appropriate. This way, otoh, the (LOCKed on x86) write isn't even
attempted when the slot already holds non-NULL.

Jan

[PATCH 03/12] evtchn: don't call Xen consumer callback with per-channel lock held
Posted by Jan Beulich 3 years, 6 months ago
While there don't look to be any problems with this right now, the lock
order implications from holding the lock can be very difficult to follow
(and may be easy to violate unknowingly). The present callbacks don't
(and no such callback should) have any need for the lock to be held.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/common/event_channel.c
+++ b/xen/common/event_channel.c
@@ -746,9 +746,18 @@ int evtchn_send(struct domain *ld, unsig
         rport = lchn->u.interdomain.remote_port;
         rchn  = evtchn_from_port(rd, rport);
         if ( consumer_is_xen(rchn) )
-            xen_notification_fn(rchn)(rd->vcpu[rchn->notify_vcpu_id], rport);
-        else
-            evtchn_port_set_pending(rd, rchn->notify_vcpu_id, rchn);
+        {
+            /* Don't keep holding the lock for the call below. */
+            xen_event_channel_notification_t fn = xen_notification_fn(rchn);
+            struct vcpu *rv = rd->vcpu[rchn->notify_vcpu_id];
+
+            rcu_lock_domain(rd);
+            spin_unlock_irqrestore(&lchn->lock, flags);
+            fn(rv, rport);
+            rcu_unlock_domain(rd);
+            return 0;
+        }
+        evtchn_port_set_pending(rd, rchn->notify_vcpu_id, rchn);
         break;
     case ECS_IPI:
         evtchn_port_set_pending(ld, lchn->notify_vcpu_id, lchn);


Re: [PATCH 03/12] evtchn: don't call Xen consumer callback with per-channel lock held
Posted by Julien Grall 3 years, 6 months ago
Hi Jan,

On 28/09/2020 11:57, Jan Beulich wrote:
> While there don't look to be any problems with this right now, the lock
> order implications from holding the lock can be very difficult to follow
> (and may be easy to violate unknowingly).

I think this is a good idea given that we are disabling interrupts now. 
Unfortunately...

> The present callbacks don't
> (and no such callback should) have any need for the lock to be held.

... I think the lock is necessary for the vm_event subsystem to avoid 
racing with the vm_event_disable().

The notification callback will use a data structure that is freed by 
vm_event_disable(). There is a lock, but it is part of the data structure...

One solution would be to have the lock outside of the data structure.

> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> 
> --- a/xen/common/event_channel.c
> +++ b/xen/common/event_channel.c
> @@ -746,9 +746,18 @@ int evtchn_send(struct domain *ld, unsig
>           rport = lchn->u.interdomain.remote_port;
>           rchn  = evtchn_from_port(rd, rport);
>           if ( consumer_is_xen(rchn) )
> -            xen_notification_fn(rchn)(rd->vcpu[rchn->notify_vcpu_id], rport);
> -        else
> -            evtchn_port_set_pending(rd, rchn->notify_vcpu_id, rchn);
> +        {
> +            /* Don't keep holding the lock for the call below. */
> +            xen_event_channel_notification_t fn = xen_notification_fn(rchn);
> +            struct vcpu *rv = rd->vcpu[rchn->notify_vcpu_id];
> +
> +            rcu_lock_domain(rd);
> +            spin_unlock_irqrestore(&lchn->lock, flags);
> +            fn(rv, rport);
> +            rcu_unlock_domain(rd);
> +            return 0;
> +        }
> +        evtchn_port_set_pending(rd, rchn->notify_vcpu_id, rchn);
>           break;
>       case ECS_IPI:
>           evtchn_port_set_pending(ld, lchn->notify_vcpu_id, lchn);
>

Cheers,

-- 
Julien Grall

Re: [PATCH 03/12] evtchn: don't call Xen consumer callback with per-channel lock held
Posted by Jan Beulich 3 years, 6 months ago
On 29.09.2020 12:16, Julien Grall wrote:
> On 28/09/2020 11:57, Jan Beulich wrote:
>> While there don't look to be any problems with this right now, the lock
>> order implications from holding the lock can be very difficult to follow
>> (and may be easy to violate unknowingly).
> 
> I think this is a good idea given that we are disabling interrupts now. 
> Unfortunately...
> 
>> The present callbacks don't
>> (and no such callback should) have any need for the lock to be held.
> 
> ... I think the lock is necessary for the vm_event subsystem to avoid 
> racing with the vm_event_disable().
> 
> The notification callback will use a data structure that is freed by 
> vm_event_disable(). There is a lock, but it is part of the data structure...

Oh, indeed - somehow I didn't spot this despite looking there.

> One solution would be to have the lock outside of the data structure.

I don't think that's viable - the structures are intentionally
separated from struct vcpu. I see two other options: Either free
the structure via call_rcu(), or maintain a count of in-progress
calls, and wait for it to drop to zero when closing the port.

VM event maintainers / reviewers - what are your thoughts here?

Jan

[PATCH 04/12] evtchn: evtchn_set_priority() needs to acquire the per-channel lock
Posted by Jan Beulich 3 years, 6 months ago
evtchn_fifo_set_pending() (invoked with the per-channel lock held) has
two uses of the channel's priority field. The field gets updated by
evtchn_fifo_set_priority() with only the per-domain event_lock held,
i.e. the two reads may observe two different values. While the 2nd use
could - afaict - in principle be replaced by q->priority, I think
evtchn_set_priority() should acquire the per-channel lock in any event.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/common/event_channel.c
+++ b/xen/common/event_channel.c
@@ -1132,7 +1132,9 @@ static long evtchn_set_priority(const st
 {
     struct domain *d = current->domain;
     unsigned int port = set_priority->port;
+    struct evtchn *chn;
     long ret;
+    unsigned long flags;
 
     spin_lock(&d->event_lock);
 
@@ -1142,8 +1144,10 @@ static long evtchn_set_priority(const st
         return -EINVAL;
     }
 
-    ret = evtchn_port_set_priority(d, evtchn_from_port(d, port),
-                                   set_priority->priority);
+    chn = evtchn_from_port(d, port);
+    spin_lock_irqsave(&chn->lock, flags);
+    ret = evtchn_port_set_priority(d, chn, set_priority->priority);
+    spin_unlock_irqrestore(&chn->lock, flags);
 
     spin_unlock(&d->event_lock);
 


Re: [PATCH 04/12] evtchn: evtchn_set_priority() needs to acquire the per-channel lock
Posted by Julien Grall 3 years, 6 months ago
Hi Jan,

On 28/09/2020 11:57, Jan Beulich wrote:
> evtchn_fifo_set_pending() (invoked with the per-channel lock held) has
> two uses of the channel's priority field. The field gets updated by
> evtchn_fifo_set_priority() with only the per-domain event_lock held,
> i.e. the two reads may observe two different values. While the 2nd use
> could - afaict - in principle be replaced by q->priority, I think
> evtchn_set_priority() should acquire the per-channel lock in any event.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> 
> --- a/xen/common/event_channel.c
> +++ b/xen/common/event_channel.c
> @@ -1132,7 +1132,9 @@ static long evtchn_set_priority(const st
>   {
>       struct domain *d = current->domain;
>       unsigned int port = set_priority->port;
> +    struct evtchn *chn;
>       long ret;
> +    unsigned long flags;
>   
>       spin_lock(&d->event_lock);

Is it still necessary to hold d->event_lock?

>   
> @@ -1142,8 +1144,10 @@ static long evtchn_set_priority(const st
>           return -EINVAL;
>       }
>   
> -    ret = evtchn_port_set_priority(d, evtchn_from_port(d, port),
> -                                   set_priority->priority);
> +    chn = evtchn_from_port(d, port);
> +    spin_lock_irqsave(&chn->lock, flags);
> +    ret = evtchn_port_set_priority(d, chn, set_priority->priority);
> +    spin_unlock_irqrestore(&chn->lock, flags);
>   
>       spin_unlock(&d->event_lock);
>   
> 

Cheers,

-- 
Julien Grall

Re: [PATCH 04/12] evtchn: evtchn_set_priority() needs to acquire the per-channel lock
Posted by Jan Beulich 3 years, 6 months ago
On 29.09.2020 12:21, Julien Grall wrote:
> On 28/09/2020 11:57, Jan Beulich wrote:
>> evtchn_fifo_set_pending() (invoked with the per-channel lock held) has
>> two uses of the channel's priority field. The field gets updated by
>> evtchn_fifo_set_priority() with only the per-domain event_lock held,
>> i.e. the two reads may observe two different values. While the 2nd use
>> could - afaict - in principle be replaced by q->priority, I think
>> evtchn_set_priority() should acquire the per-channel lock in any event.
>>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>>
>> --- a/xen/common/event_channel.c
>> +++ b/xen/common/event_channel.c
>> @@ -1132,7 +1132,9 @@ static long evtchn_set_priority(const st
>>   {
>>       struct domain *d = current->domain;
>>       unsigned int port = set_priority->port;
>> +    struct evtchn *chn;
>>       long ret;
>> +    unsigned long flags;
>>   
>>       spin_lock(&d->event_lock);
> 
> Is it still necessary to hold d->event_lock?

Good point - I see no reason for it to be held anymore.

Jan

RE: [PATCH 04/12] evtchn: evtchn_set_priority() needs to acquire the per-channel lock
Posted by Paul Durrant 3 years, 6 months ago
> -----Original Message-----
> From: Jan Beulich <jbeulich@suse.com>
> Sent: 30 September 2020 07:42
> To: paul@xen.org
> Cc: xen-devel@lists.xenproject.org; 'Andrew Cooper' <andrew.cooper3@citrix.com>; 'George Dunlap'
> <George.Dunlap@eu.citrix.com>; 'Ian Jackson' <iwj@xenproject.org>; 'Julien Grall' <julien@xen.org>;
> 'Wei Liu' <wl@xen.org>; 'Stefano Stabellini' <sstabellini@kernel.org>
> Subject: Re: [PATCH 04/12] evtchn: evtchn_set_priority() needs to acquire the per-channel lock
> 
> On 29.09.2020 18:31, Paul Durrant wrote:
> >> -----Original Message-----
> >> From: Xen-devel <xen-devel-bounces@lists.xenproject.org> On Behalf Of Jan Beulich
> >> Sent: 28 September 2020 11:58
> >> To: xen-devel@lists.xenproject.org
> >> Cc: Andrew Cooper <andrew.cooper3@citrix.com>; George Dunlap <George.Dunlap@eu.citrix.com>; Ian
> >> Jackson <iwj@xenproject.org>; Julien Grall <julien@xen.org>; Wei Liu <wl@xen.org>; Stefano
> Stabellini
> >> <sstabellini@kernel.org>
> >> Subject: [PATCH 04/12] evtchn: evtchn_set_priority() needs to acquire the per-channel lock
> >>
> >> evtchn_fifo_set_pending() (invoked with the per-channel lock held) has
> >> two uses of the channel's priority field.
> >
> > AFAICT it is invoked with only the sending end's lock held...
> >
> >> The field gets updated by
> >> evtchn_fifo_set_priority() with only the per-domain event_lock held,
> >> i.e. the two reads may observe two different values. While the 2nd use
> >> could - afaict - in principle be replaced by q->priority, I think
> >> evtchn_set_priority() should acquire the per-channel lock in any event.
> >>
> >
> > ... so how is this going to help?
> 
> I guess the reasoning needs to change here - it should focus solely
> on using the finer grained lock here (as holding the per-domain one
> doesn't help anyway). It would then be patch 10 which addresses the
> (FIFO-specific) concern of possibly reading inconsistent values.
> 

Yes, it looks like patch #10 should ensure consistency.

Prior to ad34d0656fc at least the first layer of calls done in evtchn_send() didn't take the evtchn itself as an arg. Of course, evtchn_set_pending() then looked up the evtchn and passed it to evtchn_port_set_pending() without any locking in the interdomain case. I wonder whether, to make reasoning easier, there ought to be a rule that ABI entry points are always called with the evtchn lock held?

  Paul

> Jan


Re: [PATCH 04/12] evtchn: evtchn_set_priority() needs to acquire the per-channel lock
Posted by Jan Beulich 3 years, 6 months ago
On 30.09.2020 09:31, Paul Durrant wrote:
>> -----Original Message-----
>> From: Jan Beulich <jbeulich@suse.com>
>> Sent: 30 September 2020 07:42
>> To: paul@xen.org
>> Cc: xen-devel@lists.xenproject.org; 'Andrew Cooper' <andrew.cooper3@citrix.com>; 'George Dunlap'
>> <George.Dunlap@eu.citrix.com>; 'Ian Jackson' <iwj@xenproject.org>; 'Julien Grall' <julien@xen.org>;
>> 'Wei Liu' <wl@xen.org>; 'Stefano Stabellini' <sstabellini@kernel.org>
>> Subject: Re: [PATCH 04/12] evtchn: evtchn_set_priority() needs to acquire the per-channel lock
>>
>> On 29.09.2020 18:31, Paul Durrant wrote:
>>>> -----Original Message-----
>>>> From: Xen-devel <xen-devel-bounces@lists.xenproject.org> On Behalf Of Jan Beulich
>>>> Sent: 28 September 2020 11:58
>>>> To: xen-devel@lists.xenproject.org
>>>> Cc: Andrew Cooper <andrew.cooper3@citrix.com>; George Dunlap <George.Dunlap@eu.citrix.com>; Ian
>>>> Jackson <iwj@xenproject.org>; Julien Grall <julien@xen.org>; Wei Liu <wl@xen.org>; Stefano
>> Stabellini
>>>> <sstabellini@kernel.org>
>>>> Subject: [PATCH 04/12] evtchn: evtchn_set_priority() needs to acquire the per-channel lock
>>>>
>>>> evtchn_fifo_set_pending() (invoked with the per-channel lock held) has
>>>> two uses of the channel's priority field.
>>>
>>> AFAICT it is invoked with only the sending end's lock held...
>>>
>>>> The field gets updated by
>>>> evtchn_fifo_set_priority() with only the per-domain event_lock held,
>>>> i.e. the two reads may observe two different values. While the 2nd use
>>>> could - afaict - in principle be replaced by q->priority, I think
>>>> evtchn_set_priority() should acquire the per-channel lock in any event.
>>>>
>>>
>>> ... so how is this going to help?
>>
>> I guess the reasoning needs to change here - it should focus solely
>> on using the finer grained lock here (as holding the per-domain one
>> doesn't help anyway). It would then be patch 10 which addresses the
>> (FIFO-specific) concern of possibly reading inconsistent values.
>>
> 
> Yes, it looks like patch #10 should ensure consistency.
> 
> Prior to ad34d0656fc at least the first layer of calls done in evtchn_send() didn't take the evtchn itself as an arg. Of course, evtchn_set_pending() then looked up the evtchn and passed it to evtchn_port_set_pending() without any locking in the interdomain case. I wonder whether, to make reasoning easier, there ought to be a rule that ABI entry points are always called with the evtchn lock held?

What do you mean by "ABI entry points" here? To me this would sound
like what's directly accessible to guests, but that's hardly what
you mean. Do you perhaps mean the hooks in struct evtchn_port_ops?
As per the comment that got added there recently, the locking
unfortunately is less consistent there.

Jan

RE: [PATCH 04/12] evtchn: evtchn_set_priority() needs to acquire the per-channel lock
Posted by Paul Durrant 3 years, 6 months ago
> -----Original Message-----
> From: Jan Beulich <jbeulich@suse.com>
> Sent: 30 September 2020 09:32
> To: paul@xen.org
> Cc: xen-devel@lists.xenproject.org; 'Andrew Cooper' <andrew.cooper3@citrix.com>; 'George Dunlap'
> <George.Dunlap@eu.citrix.com>; 'Ian Jackson' <iwj@xenproject.org>; 'Julien Grall' <julien@xen.org>;
> 'Wei Liu' <wl@xen.org>; 'Stefano Stabellini' <sstabellini@kernel.org>
> Subject: Re: [PATCH 04/12] evtchn: evtchn_set_priority() needs to acquire the per-channel lock
> 
> On 30.09.2020 09:31, Paul Durrant wrote:
> >> -----Original Message-----
> >> From: Jan Beulich <jbeulich@suse.com>
> >> Sent: 30 September 2020 07:42
> >> To: paul@xen.org
> >> Cc: xen-devel@lists.xenproject.org; 'Andrew Cooper' <andrew.cooper3@citrix.com>; 'George Dunlap'
> >> <George.Dunlap@eu.citrix.com>; 'Ian Jackson' <iwj@xenproject.org>; 'Julien Grall' <julien@xen.org>;
> >> 'Wei Liu' <wl@xen.org>; 'Stefano Stabellini' <sstabellini@kernel.org>
> >> Subject: Re: [PATCH 04/12] evtchn: evtchn_set_priority() needs to acquire the per-channel lock
> >>
> >> On 29.09.2020 18:31, Paul Durrant wrote:
> >>>> -----Original Message-----
> >>>> From: Xen-devel <xen-devel-bounces@lists.xenproject.org> On Behalf Of Jan Beulich
> >>>> Sent: 28 September 2020 11:58
> >>>> To: xen-devel@lists.xenproject.org
> >>>> Cc: Andrew Cooper <andrew.cooper3@citrix.com>; George Dunlap <George.Dunlap@eu.citrix.com>; Ian
> >>>> Jackson <iwj@xenproject.org>; Julien Grall <julien@xen.org>; Wei Liu <wl@xen.org>; Stefano
> >> Stabellini
> >>>> <sstabellini@kernel.org>
> >>>> Subject: [PATCH 04/12] evtchn: evtchn_set_priority() needs to acquire the per-channel lock
> >>>>
> >>>> evtchn_fifo_set_pending() (invoked with the per-channel lock held) has
> >>>> two uses of the channel's priority field.
> >>>
> >>> AFAICT it is invoked with only the sending end's lock held...
> >>>
> >>>> The field gets updated by
> >>>> evtchn_fifo_set_priority() with only the per-domain event_lock held,
> >>>> i.e. the two reads may observe two different values. While the 2nd use
> >>>> could - afaict - in principle be replaced by q->priority, I think
> >>>> evtchn_set_priority() should acquire the per-channel lock in any event.
> >>>>
> >>>
> >>> ... so how is this going to help?
> >>
> >> I guess the reasoning needs to change here - it should focus solely
> >> on using the finer grained lock here (as holding the per-domain one
> >> doesn't help anyway). It would then be patch 10 which addresses the
> >> (FIFO-specific) concern of possibly reading inconsistent values.
> >>
> >
> > Yes, it looks like patch #10 should ensure consistency.
> >
> > Prior to ad34d0656fc at least the first layer of calls done in evtchn_send() didn't take the evtchn
> itself as an arg. Of course, evtchn_set_pending() then looked up the evtchn and passed it to
> evtchn_port_set_pending() without any locking in the interdomain case. I wonder whether, to make
> reasoning easier, there ought to be a rule that ABI entry points are always called with the evtchn
> lock held?
> 
> What do you mean by "ABI entry points" here? To me this would sound
> like what's directly accessible to guests, but that's hardly what
> you mean. Do you perhaps mean the hooks in struct evtchn_port_ops?

Yes, by ABI I mean 'fifo' or '2l'. (I guess that 'ABI' is just the name I chose to refer to them in the Windows PV driver code).

> As per the comment that got added there recently, the locking
> unfortunately is less consistent there.
> 

I looked to me that most functions were entered with channel lock held so wondered whether it could be a rule.

  Paul

> Jan


Re: [PATCH 04/12] evtchn: evtchn_set_priority() needs to acquire the per-channel lock
Posted by Jan Beulich 3 years, 6 months ago
On 30.09.2020 10:36, Paul Durrant wrote:
>> From: Jan Beulich <jbeulich@suse.com>
>> Sent: 30 September 2020 09:32
>>
>> On 30.09.2020 09:31, Paul Durrant wrote:
>>>> From: Jan Beulich <jbeulich@suse.com>
>>>> Sent: 30 September 2020 07:42
>>>>
>>>> On 29.09.2020 18:31, Paul Durrant wrote:
>>>>>> From: Xen-devel <xen-devel-bounces@lists.xenproject.org> On Behalf Of Jan Beulich
>>>>>> Sent: 28 September 2020 11:58
>>>>>>
>>>>>> evtchn_fifo_set_pending() (invoked with the per-channel lock held) has
>>>>>> two uses of the channel's priority field.
>>>>>
>>>>> AFAICT it is invoked with only the sending end's lock held...
>>>>>
>>>>>> The field gets updated by
>>>>>> evtchn_fifo_set_priority() with only the per-domain event_lock held,
>>>>>> i.e. the two reads may observe two different values. While the 2nd use
>>>>>> could - afaict - in principle be replaced by q->priority, I think
>>>>>> evtchn_set_priority() should acquire the per-channel lock in any event.
>>>>>>
>>>>>
>>>>> ... so how is this going to help?
>>>>
>>>> I guess the reasoning needs to change here - it should focus solely
>>>> on using the finer grained lock here (as holding the per-domain one
>>>> doesn't help anyway). It would then be patch 10 which addresses the
>>>> (FIFO-specific) concern of possibly reading inconsistent values.
>>>>
>>>
>>> Yes, it looks like patch #10 should ensure consistency.
>>>
>>> Prior to ad34d0656fc at least the first layer of calls done in evtchn_send() didn't take the evtchn
>> itself as an arg. Of course, evtchn_set_pending() then looked up the evtchn and passed it to
>> evtchn_port_set_pending() without any locking in the interdomain case. I wonder whether, to make
>> reasoning easier, there ought to be a rule that ABI entry points are always called with the evtchn
>> lock held?
>>
>> What do you mean by "ABI entry points" here? To me this would sound
>> like what's directly accessible to guests, but that's hardly what
>> you mean. Do you perhaps mean the hooks in struct evtchn_port_ops?
> 
> Yes, by ABI I mean 'fifo' or '2l'. (I guess that 'ABI' is just the name I chose to refer to them in the Windows PV driver code).
> 
>> As per the comment that got added there recently, the locking
>> unfortunately is less consistent there.
>>
> 
> I looked to me that most functions were entered with channel lock
> held so wondered whether it could be a rule.

Well - especially for the sending paths it may be _a_ per-channel lock,
not _the_ one. While putting together the XSA fixed I had looked some at
the possibility of having a simple rule here, but acquiring _the_ lock
on the interdomain sending path looked to be complicating this path
quite a bit, when it specifically should be as lightweight as possible.

Jan

RE: [PATCH 04/12] evtchn: evtchn_set_priority() needs to acquire the per-channel lock
Posted by Paul Durrant 3 years, 6 months ago
> -----Original Message-----
> From: Xen-devel <xen-devel-bounces@lists.xenproject.org> On Behalf Of Jan Beulich
> Sent: 28 September 2020 11:58
> To: xen-devel@lists.xenproject.org
> Cc: Andrew Cooper <andrew.cooper3@citrix.com>; George Dunlap <George.Dunlap@eu.citrix.com>; Ian
> Jackson <iwj@xenproject.org>; Julien Grall <julien@xen.org>; Wei Liu <wl@xen.org>; Stefano Stabellini
> <sstabellini@kernel.org>
> Subject: [PATCH 04/12] evtchn: evtchn_set_priority() needs to acquire the per-channel lock
> 
> evtchn_fifo_set_pending() (invoked with the per-channel lock held) has
> two uses of the channel's priority field.

AFAICT it is invoked with only the sending end's lock held...

> The field gets updated by
> evtchn_fifo_set_priority() with only the per-domain event_lock held,
> i.e. the two reads may observe two different values. While the 2nd use
> could - afaict - in principle be replaced by q->priority, I think
> evtchn_set_priority() should acquire the per-channel lock in any event.
> 

... so how is this going to help?

  Paul

> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> 
> --- a/xen/common/event_channel.c
> +++ b/xen/common/event_channel.c
> @@ -1132,7 +1132,9 @@ static long evtchn_set_priority(const st
>  {
>      struct domain *d = current->domain;
>      unsigned int port = set_priority->port;
> +    struct evtchn *chn;
>      long ret;
> +    unsigned long flags;
> 
>      spin_lock(&d->event_lock);
> 
> @@ -1142,8 +1144,10 @@ static long evtchn_set_priority(const st
>          return -EINVAL;
>      }
> 
> -    ret = evtchn_port_set_priority(d, evtchn_from_port(d, port),
> -                                   set_priority->priority);
> +    chn = evtchn_from_port(d, port);
> +    spin_lock_irqsave(&chn->lock, flags);
> +    ret = evtchn_port_set_priority(d, chn, set_priority->priority);
> +    spin_unlock_irqrestore(&chn->lock, flags);
> 
>      spin_unlock(&d->event_lock);
> 
> 



Re: [PATCH 04/12] evtchn: evtchn_set_priority() needs to acquire the per-channel lock
Posted by Jan Beulich 3 years, 6 months ago
On 29.09.2020 18:31, Paul Durrant wrote:
>> -----Original Message-----
>> From: Xen-devel <xen-devel-bounces@lists.xenproject.org> On Behalf Of Jan Beulich
>> Sent: 28 September 2020 11:58
>> To: xen-devel@lists.xenproject.org
>> Cc: Andrew Cooper <andrew.cooper3@citrix.com>; George Dunlap <George.Dunlap@eu.citrix.com>; Ian
>> Jackson <iwj@xenproject.org>; Julien Grall <julien@xen.org>; Wei Liu <wl@xen.org>; Stefano Stabellini
>> <sstabellini@kernel.org>
>> Subject: [PATCH 04/12] evtchn: evtchn_set_priority() needs to acquire the per-channel lock
>>
>> evtchn_fifo_set_pending() (invoked with the per-channel lock held) has
>> two uses of the channel's priority field.
> 
> AFAICT it is invoked with only the sending end's lock held...

In the interdomain case you mean?

>> The field gets updated by
>> evtchn_fifo_set_priority() with only the per-domain event_lock held,
>> i.e. the two reads may observe two different values. While the 2nd use
>> could - afaict - in principle be replaced by q->priority, I think
>> evtchn_set_priority() should acquire the per-channel lock in any event.
> 
> ... so how is this going to help?

Indeed, this will require more thought then.

Jan

Re: [PATCH 04/12] evtchn: evtchn_set_priority() needs to acquire the per-channel lock
Posted by Jan Beulich 3 years, 6 months ago
On 29.09.2020 18:31, Paul Durrant wrote:
>> -----Original Message-----
>> From: Xen-devel <xen-devel-bounces@lists.xenproject.org> On Behalf Of Jan Beulich
>> Sent: 28 September 2020 11:58
>> To: xen-devel@lists.xenproject.org
>> Cc: Andrew Cooper <andrew.cooper3@citrix.com>; George Dunlap <George.Dunlap@eu.citrix.com>; Ian
>> Jackson <iwj@xenproject.org>; Julien Grall <julien@xen.org>; Wei Liu <wl@xen.org>; Stefano Stabellini
>> <sstabellini@kernel.org>
>> Subject: [PATCH 04/12] evtchn: evtchn_set_priority() needs to acquire the per-channel lock
>>
>> evtchn_fifo_set_pending() (invoked with the per-channel lock held) has
>> two uses of the channel's priority field.
> 
> AFAICT it is invoked with only the sending end's lock held...
> 
>> The field gets updated by
>> evtchn_fifo_set_priority() with only the per-domain event_lock held,
>> i.e. the two reads may observe two different values. While the 2nd use
>> could - afaict - in principle be replaced by q->priority, I think
>> evtchn_set_priority() should acquire the per-channel lock in any event.
>>
> 
> ... so how is this going to help?

I guess the reasoning needs to change here - it should focus solely
on using the finer grained lock here (as holding the per-domain one
doesn't help anyway). It would then be patch 10 which addresses the
(FIFO-specific) concern of possibly reading inconsistent values.

Jan

[PATCH 05/12] evtchn/sched: reject poll requests for unusable ports
Posted by Jan Beulich 3 years, 6 months ago
Before and after XSA-342 there has been an asymmetry in how not really
usable ports get treated in do_poll(): Ones beyond a certain boundary
(max_evtchns originally, valid_evtchns subsequently) did get refused
with -EINVAL, while lower ones were accepted despite there potentially
being no way to wake the vCPU again from its polling state. Arrange to
also honor evtchn_usable() output in the decision.

Requested-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/common/sched/core.c
+++ b/xen/common/sched/core.c
@@ -1427,13 +1427,13 @@ static long do_poll(struct sched_poll *s
         if ( __copy_from_guest_offset(&port, sched_poll->ports, i, 1) )
             goto out;
 
-        rc = -EINVAL;
-        if ( !port_is_valid(d, port) )
-            goto out;
-
-        rc = 0;
-        if ( evtchn_port_is_pending(d, port) )
+        rc = evtchn_port_poll(d, port);
+        if ( rc )
+        {
+            if ( rc > 0 )
+                rc = 0;
             goto out;
+        }
     }
 
     if ( sched_poll->nr_ports == 1 )
--- a/xen/include/xen/event.h
+++ b/xen/include/xen/event.h
@@ -240,19 +240,6 @@ static inline bool evtchn_is_pending(con
     return evtchn_usable(evtchn) && d->evtchn_port_ops->is_pending(d, evtchn);
 }
 
-static inline bool evtchn_port_is_pending(struct domain *d, evtchn_port_t port)
-{
-    struct evtchn *evtchn = evtchn_from_port(d, port);
-    bool rc;
-    unsigned long flags;
-
-    spin_lock_irqsave(&evtchn->lock, flags);
-    rc = evtchn_is_pending(d, evtchn);
-    spin_unlock_irqrestore(&evtchn->lock, flags);
-
-    return rc;
-}
-
 static inline bool evtchn_is_masked(const struct domain *d,
                                     const struct evtchn *evtchn)
 {
@@ -279,6 +266,24 @@ static inline bool evtchn_is_busy(const
            d->evtchn_port_ops->is_busy(d, evtchn);
 }
 
+static inline int evtchn_port_poll(struct domain *d, evtchn_port_t port)
+{
+    int rc = -EINVAL;
+
+    if ( port_is_valid(d, port) )
+    {
+        struct evtchn *evtchn = evtchn_from_port(d, port);
+        unsigned long flags;
+
+        spin_lock_irqsave(&evtchn->lock, flags);
+        if ( evtchn_usable(evtchn) )
+            rc = evtchn_is_pending(d, evtchn);
+        spin_unlock_irqrestore(&evtchn->lock, flags);
+    }
+
+    return rc;
+}
+
 static inline int evtchn_port_set_priority(struct domain *d,
                                            struct evtchn *evtchn,
                                            unsigned int priority)


Re: [PATCH 05/12] evtchn/sched: reject poll requests for unusable ports
Posted by Julien Grall 3 years, 6 months ago
Hi Jan,

On 28/09/2020 11:58, Jan Beulich wrote:
> Before and after XSA-342 there has been an asymmetry in how not really
> usable ports get treated in do_poll(): Ones beyond a certain boundary
> (max_evtchns originally, valid_evtchns subsequently) did get refused
> with -EINVAL, while lower ones were accepted despite there potentially
> being no way to wake the vCPU again from its polling state. Arrange to
> also honor evtchn_usable() output in the decision.
> 
> Requested-by: Andrew Cooper <andrew.cooper3@citrix.com>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

With one request (see below):

Reviewed-by: Julien Grall <jgrall@amazon.com>

> 
> --- a/xen/common/sched/core.c
> +++ b/xen/common/sched/core.c
> @@ -1427,13 +1427,13 @@ static long do_poll(struct sched_poll *s
>           if ( __copy_from_guest_offset(&port, sched_poll->ports, i, 1) )
>               goto out;
>   
> -        rc = -EINVAL;
> -        if ( !port_is_valid(d, port) )
> -            goto out;
> -
> -        rc = 0;
> -        if ( evtchn_port_is_pending(d, port) )
> +        rc = evtchn_port_poll(d, port);
> +        if ( rc )
> +        {
> +            if ( rc > 0 )
> +                rc = 0;

This check wasn't obvious to me until I spent some times understanding 
how evtchn_port_poll() is working. I think it would be worth to...

>               goto out;
> +        }
>       }
>   
>       if ( sched_poll->nr_ports == 1 )
> --- a/xen/include/xen/event.h
> +++ b/xen/include/xen/event.h
> @@ -240,19 +240,6 @@ static inline bool evtchn_is_pending(con
>       return evtchn_usable(evtchn) && d->evtchn_port_ops->is_pending(d, evtchn);
>   }
>   
> -static inline bool evtchn_port_is_pending(struct domain *d, evtchn_port_t port)
> -{
> -    struct evtchn *evtchn = evtchn_from_port(d, port);
> -    bool rc;
> -    unsigned long flags;
> -
> -    spin_lock_irqsave(&evtchn->lock, flags);
> -    rc = evtchn_is_pending(d, evtchn);
> -    spin_unlock_irqrestore(&evtchn->lock, flags);
> -
> -    return rc;
> -}
> -
>   static inline bool evtchn_is_masked(const struct domain *d,
>                                       const struct evtchn *evtchn)
>   {
> @@ -279,6 +266,24 @@ static inline bool evtchn_is_busy(const
>              d->evtchn_port_ops->is_busy(d, evtchn);
>   }
>   
> +static inline int evtchn_port_poll(struct domain *d, evtchn_port_t port)

... add a comment on top of evtchn_port_poll() explaining that it will 
return a value < 0 if the port is not valid and > 0 if the port is pending.

> +{
> +    int rc = -EINVAL;
> +
> +    if ( port_is_valid(d, port) )
> +    {
> +        struct evtchn *evtchn = evtchn_from_port(d, port);
> +        unsigned long flags;
> +
> +        spin_lock_irqsave(&evtchn->lock, flags);
> +        if ( evtchn_usable(evtchn) )
> +            rc = evtchn_is_pending(d, evtchn);
> +        spin_unlock_irqrestore(&evtchn->lock, flags);
> +    }
> +
> +    return rc;
> +}
> +
>   static inline int evtchn_port_set_priority(struct domain *d,
>                                              struct evtchn *evtchn,
>                                              unsigned int priority)
> 

Cheers,

-- 
Julien Grall

Re: [PATCH 05/12] evtchn/sched: reject poll requests for unusable ports
Posted by Dario Faggioli 3 years, 6 months ago
On Mon, 2020-09-28 at 12:58 +0200, Jan Beulich wrote:
> Before and after XSA-342 there has been an asymmetry in how not
> really
> usable ports get treated in do_poll(): Ones beyond a certain boundary
> (max_evtchns originally, valid_evtchns subsequently) did get refused
> with -EINVAL, while lower ones were accepted despite there
> potentially
> being no way to wake the vCPU again from its polling state. Arrange
> to
> also honor evtchn_usable() output in the decision.
> 
> Requested-by: Andrew Cooper <andrew.cooper3@citrix.com>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>
Reviewed-by: Dario Faggioli <dfaggioli@suse.com>

I agree with Julien that a comment about how evtchn_port_poll() would
improve things even further.

Regards
-- 
Dario Faggioli, Ph.D
http://about.me/dario.faggioli
Virtualization Software Engineer
SUSE Labs, SUSE https://www.suse.com/
-------------------------------------------------------------------
<<This happens because _I_ choose it to happen!>> (Raistlin Majere)
RE: [PATCH 05/12] evtchn/sched: reject poll requests for unusable ports
Posted by Paul Durrant 3 years, 6 months ago
> -----Original Message-----
> From: Xen-devel <xen-devel-bounces@lists.xenproject.org> On Behalf Of Jan Beulich
> Sent: 28 September 2020 11:59
> To: xen-devel@lists.xenproject.org
> Cc: Andrew Cooper <andrew.cooper3@citrix.com>; George Dunlap <George.Dunlap@eu.citrix.com>; Ian
> Jackson <iwj@xenproject.org>; Julien Grall <julien@xen.org>; Wei Liu <wl@xen.org>; Stefano Stabellini
> <sstabellini@kernel.org>; Dario Faggioli <dfaggioli@suse.com>
> Subject: [PATCH 05/12] evtchn/sched: reject poll requests for unusable ports
> 
> Before and after XSA-342 there has been an asymmetry in how not really
> usable ports get treated in do_poll(): Ones beyond a certain boundary
> (max_evtchns originally, valid_evtchns subsequently) did get refused
> with -EINVAL, while lower ones were accepted despite there potentially
> being no way to wake the vCPU again from its polling state. Arrange to
> also honor evtchn_usable() output in the decision.
> 
> Requested-by: Andrew Cooper <andrew.cooper3@citrix.com>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Reviewed-by: Paul Durrant <paul@xen.org>

> 
> --- a/xen/common/sched/core.c
> +++ b/xen/common/sched/core.c
> @@ -1427,13 +1427,13 @@ static long do_poll(struct sched_poll *s
>          if ( __copy_from_guest_offset(&port, sched_poll->ports, i, 1) )
>              goto out;
> 
> -        rc = -EINVAL;
> -        if ( !port_is_valid(d, port) )
> -            goto out;
> -
> -        rc = 0;
> -        if ( evtchn_port_is_pending(d, port) )
> +        rc = evtchn_port_poll(d, port);
> +        if ( rc )
> +        {
> +            if ( rc > 0 )
> +                rc = 0;
>              goto out;
> +        }
>      }
> 
>      if ( sched_poll->nr_ports == 1 )
> --- a/xen/include/xen/event.h
> +++ b/xen/include/xen/event.h
> @@ -240,19 +240,6 @@ static inline bool evtchn_is_pending(con
>      return evtchn_usable(evtchn) && d->evtchn_port_ops->is_pending(d, evtchn);
>  }
> 
> -static inline bool evtchn_port_is_pending(struct domain *d, evtchn_port_t port)
> -{
> -    struct evtchn *evtchn = evtchn_from_port(d, port);
> -    bool rc;
> -    unsigned long flags;
> -
> -    spin_lock_irqsave(&evtchn->lock, flags);
> -    rc = evtchn_is_pending(d, evtchn);
> -    spin_unlock_irqrestore(&evtchn->lock, flags);
> -
> -    return rc;
> -}
> -
>  static inline bool evtchn_is_masked(const struct domain *d,
>                                      const struct evtchn *evtchn)
>  {
> @@ -279,6 +266,24 @@ static inline bool evtchn_is_busy(const
>             d->evtchn_port_ops->is_busy(d, evtchn);
>  }
> 
> +static inline int evtchn_port_poll(struct domain *d, evtchn_port_t port)
> +{
> +    int rc = -EINVAL;
> +
> +    if ( port_is_valid(d, port) )
> +    {
> +        struct evtchn *evtchn = evtchn_from_port(d, port);
> +        unsigned long flags;
> +
> +        spin_lock_irqsave(&evtchn->lock, flags);
> +        if ( evtchn_usable(evtchn) )
> +            rc = evtchn_is_pending(d, evtchn);
> +        spin_unlock_irqrestore(&evtchn->lock, flags);
> +    }
> +
> +    return rc;
> +}
> +
>  static inline int evtchn_port_set_priority(struct domain *d,
>                                             struct evtchn *evtchn,
>                                             unsigned int priority)
> 



[PATCH 06/12] evtchn: don't bypass unlinking pIRQ when closing port
Posted by Jan Beulich 3 years, 6 months ago
There's no other path causing a terminal unlink_pirq_port() to be called
(evtchn_bind_vcpu() relinks it right away) and hence _if_ pirq can
indeed be NULL when closing the port, list corruption would occur when
bypassing the unlink (unless the structure never gets linked again). As
we can't come here after evtchn_destroy() anymore, (late) domain
destruction also isn't a reason for a possible exception, and hence the
only alternative looks to be that the check was pointless in the first
place. While I haven't observed the case, from code inspection I'm far
from sure I can exclude this being possible, so it feels more safe to
re-arrange the code instead.

Fixes: c24536b636f2 ("replace d->nr_pirqs sized arrays with radix tree")
Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/common/event_channel.c
+++ b/xen/common/event_channel.c
@@ -615,17 +615,18 @@ int evtchn_close(struct domain *d1, int
     case ECS_PIRQ: {
         struct pirq *pirq = pirq_info(d1, chn1->u.pirq.irq);
 
-        if ( !pirq )
-            break;
-        if ( !is_hvm_domain(d1) )
-            pirq_guest_unbind(d1, pirq);
-        pirq->evtchn = 0;
-        pirq_cleanup_check(pirq, d1);
-        unlink_pirq_port(chn1, d1->vcpu[chn1->notify_vcpu_id]);
+        if ( pirq )
+        {
+            if ( !is_hvm_domain(d1) )
+                pirq_guest_unbind(d1, pirq);
+            pirq->evtchn = 0;
+            pirq_cleanup_check(pirq, d1);
 #ifdef CONFIG_X86
-        if ( is_hvm_domain(d1) && domain_pirq_to_irq(d1, pirq->pirq) > 0 )
-            unmap_domain_pirq_emuirq(d1, pirq->pirq);
+            if ( is_hvm_domain(d1) && domain_pirq_to_irq(d1, pirq->pirq) > 0 )
+                unmap_domain_pirq_emuirq(d1, pirq->pirq);
 #endif
+        }
+        unlink_pirq_port(chn1, d1->vcpu[chn1->notify_vcpu_id]);
         break;
     }
 


RE: [PATCH 06/12] evtchn: don't bypass unlinking pIRQ when closing port
Posted by Paul Durrant 3 years, 6 months ago
> -----Original Message-----
> From: Xen-devel <xen-devel-bounces@lists.xenproject.org> On Behalf Of Jan Beulich
> Sent: 28 September 2020 11:59
> To: xen-devel@lists.xenproject.org
> Cc: Andrew Cooper <andrew.cooper3@citrix.com>; George Dunlap <George.Dunlap@eu.citrix.com>; Ian
> Jackson <iwj@xenproject.org>; Julien Grall <julien@xen.org>; Wei Liu <wl@xen.org>; Stefano Stabellini
> <sstabellini@kernel.org>
> Subject: [PATCH 06/12] evtchn: don't bypass unlinking pIRQ when closing port
> 
> There's no other path causing a terminal unlink_pirq_port() to be called
> (evtchn_bind_vcpu() relinks it right away) and hence _if_ pirq can
> indeed be NULL when closing the port, list corruption would occur when
> bypassing the unlink (unless the structure never gets linked again). As
> we can't come here after evtchn_destroy() anymore, (late) domain
> destruction also isn't a reason for a possible exception, and hence the
> only alternative looks to be that the check was pointless in the first
> place. While I haven't observed the case, from code inspection I'm far
> from sure I can exclude this being possible, so it feels more safe to
> re-arrange the code instead.
> 
> Fixes: c24536b636f2 ("replace d->nr_pirqs sized arrays with radix tree")
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Reviewed-by: Paul Durrant <paul@xen.org>

> 
> --- a/xen/common/event_channel.c
> +++ b/xen/common/event_channel.c
> @@ -615,17 +615,18 @@ int evtchn_close(struct domain *d1, int
>      case ECS_PIRQ: {
>          struct pirq *pirq = pirq_info(d1, chn1->u.pirq.irq);
> 
> -        if ( !pirq )
> -            break;
> -        if ( !is_hvm_domain(d1) )
> -            pirq_guest_unbind(d1, pirq);
> -        pirq->evtchn = 0;
> -        pirq_cleanup_check(pirq, d1);
> -        unlink_pirq_port(chn1, d1->vcpu[chn1->notify_vcpu_id]);
> +        if ( pirq )
> +        {
> +            if ( !is_hvm_domain(d1) )
> +                pirq_guest_unbind(d1, pirq);
> +            pirq->evtchn = 0;
> +            pirq_cleanup_check(pirq, d1);
>  #ifdef CONFIG_X86
> -        if ( is_hvm_domain(d1) && domain_pirq_to_irq(d1, pirq->pirq) > 0 )
> -            unmap_domain_pirq_emuirq(d1, pirq->pirq);
> +            if ( is_hvm_domain(d1) && domain_pirq_to_irq(d1, pirq->pirq) > 0 )
> +                unmap_domain_pirq_emuirq(d1, pirq->pirq);
>  #endif
> +        }
> +        unlink_pirq_port(chn1, d1->vcpu[chn1->notify_vcpu_id]);
>          break;
>      }
> 
> 



[PATCH 07/12] evtchn: cut short evtchn_reset()'s loop in the common case
Posted by Jan Beulich 3 years, 6 months ago
The general expectation is that there are only a few open ports left
when a domain asks its event channel configuration to be reset.
Similarly on average half a bucket worth of event channels can be
expected to be inactive. Try to avoid iterating over all channels, by
utilizing usage data we're maintaining anyway.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/common/event_channel.c
+++ b/xen/common/event_channel.c
@@ -232,7 +232,11 @@ void evtchn_free(struct domain *d, struc
     evtchn_port_clear_pending(d, chn);
 
     if ( consumer_is_xen(chn) )
+    {
         write_atomic(&d->xen_evtchns, d->xen_evtchns - 1);
+        /* Decrement ->xen_evtchns /before/ ->active_evtchns. */
+        smp_wmb();
+    }
     write_atomic(&d->active_evtchns, d->active_evtchns - 1);
 
     /* Reset binding to vcpu0 when the channel is freed. */
@@ -1073,6 +1077,19 @@ int evtchn_unmask(unsigned int port)
     return 0;
 }
 
+static bool has_active_evtchns(const struct domain *d)
+{
+    unsigned int xen = read_atomic(&d->xen_evtchns);
+
+    /*
+     * Read ->xen_evtchns /before/ active_evtchns, to prevent
+     * evtchn_reset() exiting its loop early.
+     */
+    smp_rmb();
+
+    return read_atomic(&d->active_evtchns) > xen;
+}
+
 int evtchn_reset(struct domain *d, bool resuming)
 {
     unsigned int i;
@@ -1097,7 +1114,7 @@ int evtchn_reset(struct domain *d, bool
     if ( !i )
         return -EBUSY;
 
-    for ( ; port_is_valid(d, i); i++ )
+    for ( ; port_is_valid(d, i) && has_active_evtchns(d); i++ )
     {
         evtchn_close(d, i, 1);
 
@@ -1340,6 +1357,10 @@ int alloc_unbound_xen_event_channel(
 
     spin_unlock_irqrestore(&chn->lock, flags);
 
+    /*
+     * Increment ->xen_evtchns /after/ ->active_evtchns. No explicit
+     * barrier needed due to spin-locked region just above.
+     */
     write_atomic(&ld->xen_evtchns, ld->xen_evtchns + 1);
 
  out:


RE: [PATCH 07/12] evtchn: cut short evtchn_reset()'s loop in the common case
Posted by Paul Durrant 3 years, 6 months ago
> -----Original Message-----
> From: Xen-devel <xen-devel-bounces@lists.xenproject.org> On Behalf Of Jan Beulich
> Sent: 28 September 2020 12:00
> To: xen-devel@lists.xenproject.org
> Cc: Andrew Cooper <andrew.cooper3@citrix.com>; George Dunlap <George.Dunlap@eu.citrix.com>; Ian
> Jackson <iwj@xenproject.org>; Julien Grall <julien@xen.org>; Wei Liu <wl@xen.org>; Stefano Stabellini
> <sstabellini@kernel.org>
> Subject: [PATCH 07/12] evtchn: cut short evtchn_reset()'s loop in the common case
> 
> The general expectation is that there are only a few open ports left
> when a domain asks its event channel configuration to be reset.
> Similarly on average half a bucket worth of event channels can be
> expected to be inactive. Try to avoid iterating over all channels, by
> utilizing usage data we're maintaining anyway.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Reviewed-by: Paul Durrant <paul@xen.org>

> 
> --- a/xen/common/event_channel.c
> +++ b/xen/common/event_channel.c
> @@ -232,7 +232,11 @@ void evtchn_free(struct domain *d, struc
>      evtchn_port_clear_pending(d, chn);
> 
>      if ( consumer_is_xen(chn) )
> +    {
>          write_atomic(&d->xen_evtchns, d->xen_evtchns - 1);
> +        /* Decrement ->xen_evtchns /before/ ->active_evtchns. */
> +        smp_wmb();
> +    }
>      write_atomic(&d->active_evtchns, d->active_evtchns - 1);
> 
>      /* Reset binding to vcpu0 when the channel is freed. */
> @@ -1073,6 +1077,19 @@ int evtchn_unmask(unsigned int port)
>      return 0;
>  }
> 
> +static bool has_active_evtchns(const struct domain *d)
> +{
> +    unsigned int xen = read_atomic(&d->xen_evtchns);
> +
> +    /*
> +     * Read ->xen_evtchns /before/ active_evtchns, to prevent
> +     * evtchn_reset() exiting its loop early.
> +     */
> +    smp_rmb();
> +
> +    return read_atomic(&d->active_evtchns) > xen;
> +}
> +
>  int evtchn_reset(struct domain *d, bool resuming)
>  {
>      unsigned int i;
> @@ -1097,7 +1114,7 @@ int evtchn_reset(struct domain *d, bool
>      if ( !i )
>          return -EBUSY;
> 
> -    for ( ; port_is_valid(d, i); i++ )
> +    for ( ; port_is_valid(d, i) && has_active_evtchns(d); i++ )
>      {
>          evtchn_close(d, i, 1);
> 
> @@ -1340,6 +1357,10 @@ int alloc_unbound_xen_event_channel(
> 
>      spin_unlock_irqrestore(&chn->lock, flags);
> 
> +    /*
> +     * Increment ->xen_evtchns /after/ ->active_evtchns. No explicit
> +     * barrier needed due to spin-locked region just above.
> +     */
>      write_atomic(&ld->xen_evtchns, ld->xen_evtchns + 1);
> 
>   out:
> 



Re: [PATCH 07/12] evtchn: cut short evtchn_reset()'s loop in the common case
Posted by Julien Grall 3 years, 5 months ago
Hi,

On 28/09/2020 12:00, Jan Beulich wrote:
> The general expectation is that there are only a few open ports left
> when a domain asks its event channel configuration to be reset.
> Similarly on average half a bucket worth of event channels can be
> expected to be inactive. Try to avoid iterating over all channels, by
> utilizing usage data we're maintaining anyway.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Acked-by: Julien Grall <jgrall@amazon.com>

Cheers,

-- 
Julien Grall

[PATCH 08/12] evtchn: ECS_CLOSED => ECS_FREE
Posted by Jan Beulich 3 years, 6 months ago
There's no ECS_CLOSED; correct a comment naming it.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/common/event_channel.c
+++ b/xen/common/event_channel.c
@@ -673,7 +673,7 @@ int evtchn_close(struct domain *d1, int
              * We can only get here if the port was closed and re-bound after
              * unlocking d1 but before locking d2 above. We could retry but
              * it is easier to return the same error as if we had seen the
-             * port in ECS_CLOSED. It must have passed through that state for
+             * port in ECS_FREE. It must have passed through that state for
              * us to end up here, so it's a valid error to return.
              */
             rc = -EINVAL;


RE: [PATCH 08/12] evtchn: ECS_CLOSED => ECS_FREE
Posted by Paul Durrant 3 years, 6 months ago
> -----Original Message-----
> From: Xen-devel <xen-devel-bounces@lists.xenproject.org> On Behalf Of Jan Beulich
> Sent: 28 September 2020 12:00
> To: xen-devel@lists.xenproject.org
> Cc: Andrew Cooper <andrew.cooper3@citrix.com>; George Dunlap <George.Dunlap@eu.citrix.com>; Ian
> Jackson <iwj@xenproject.org>; Julien Grall <julien@xen.org>; Wei Liu <wl@xen.org>; Stefano Stabellini
> <sstabellini@kernel.org>
> Subject: [PATCH 08/12] evtchn: ECS_CLOSED => ECS_FREE
> 
> There's no ECS_CLOSED; correct a comment naming it.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> 

Reviewed-by: Paul Durrant <paul@xen.org>

> --- a/xen/common/event_channel.c
> +++ b/xen/common/event_channel.c
> @@ -673,7 +673,7 @@ int evtchn_close(struct domain *d1, int
>               * We can only get here if the port was closed and re-bound after
>               * unlocking d1 but before locking d2 above. We could retry but
>               * it is easier to return the same error as if we had seen the
> -             * port in ECS_CLOSED. It must have passed through that state for
> +             * port in ECS_FREE. It must have passed through that state for
>               * us to end up here, so it's a valid error to return.
>               */
>              rc = -EINVAL;
> 



Re: [PATCH 08/12] evtchn: ECS_CLOSED => ECS_FREE
Posted by Julien Grall 3 years, 6 months ago
Hi,

On 28/09/2020 12:00, Jan Beulich wrote:
> There's no ECS_CLOSED; correct a comment naming it.
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Acked-by: Julien Grall <jgrall@amazon.com>

Cheers,

-- 
Julien Grall

[PATCH 09/12] evtchn: move FIFO-private struct declarations
Posted by Jan Beulich 3 years, 6 months ago
There's no need to expose them.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
I wonder whether we shouldn't do away with event_fifo.h altogether.

--- a/xen/common/event_fifo.c
+++ b/xen/common/event_fifo.c
@@ -21,6 +21,27 @@
 
 #include <public/event_channel.h>
 
+struct evtchn_fifo_queue {
+    uint32_t *head; /* points into control block */
+    uint32_t tail;
+    uint8_t priority;
+    spinlock_t lock;
+};
+
+struct evtchn_fifo_vcpu {
+    struct evtchn_fifo_control_block *control_block;
+    struct evtchn_fifo_queue queue[EVTCHN_FIFO_MAX_QUEUES];
+};
+
+#define EVTCHN_FIFO_EVENT_WORDS_PER_PAGE (PAGE_SIZE / sizeof(event_word_t))
+#define EVTCHN_FIFO_MAX_EVENT_ARRAY_PAGES \
+    (EVTCHN_FIFO_NR_CHANNELS / EVTCHN_FIFO_EVENT_WORDS_PER_PAGE)
+
+struct evtchn_fifo_domain {
+    event_word_t *event_array[EVTCHN_FIFO_MAX_EVENT_ARRAY_PAGES];
+    unsigned int num_evtchns;
+};
+
 static inline event_word_t *evtchn_fifo_word_from_port(const struct domain *d,
                                                        unsigned int port)
 {
--- a/xen/include/xen/event_fifo.h
+++ b/xen/include/xen/event_fifo.h
@@ -9,27 +9,6 @@
 #ifndef __XEN_EVENT_FIFO_H__
 #define __XEN_EVENT_FIFO_H__
 
-struct evtchn_fifo_queue {
-    uint32_t *head; /* points into control block */
-    uint32_t tail;
-    uint8_t priority;
-    spinlock_t lock;
-};
-
-struct evtchn_fifo_vcpu {
-    struct evtchn_fifo_control_block *control_block;
-    struct evtchn_fifo_queue queue[EVTCHN_FIFO_MAX_QUEUES];
-};
-
-#define EVTCHN_FIFO_EVENT_WORDS_PER_PAGE (PAGE_SIZE / sizeof(event_word_t))
-#define EVTCHN_FIFO_MAX_EVENT_ARRAY_PAGES \
-    (EVTCHN_FIFO_NR_CHANNELS / EVTCHN_FIFO_EVENT_WORDS_PER_PAGE)
-
-struct evtchn_fifo_domain {
-    event_word_t *event_array[EVTCHN_FIFO_MAX_EVENT_ARRAY_PAGES];
-    unsigned int num_evtchns;
-};
-
 int evtchn_fifo_init_control(struct evtchn_init_control *init_control);
 int evtchn_fifo_expand_array(const struct evtchn_expand_array *expand_array);
 void evtchn_fifo_destroy(struct domain *domain);


Re: [PATCH 09/12] evtchn: move FIFO-private struct declarations
Posted by Julien Grall 3 years, 6 months ago
Hi Jan,

On 28/09/2020 12:00, Jan Beulich wrote:
> There's no need to expose them.

We are going to need them for LiveUpdate and Non-cooperative Live 
Migration as the save/restore is happening outside of event_fifo.c.

This is because we tried to keep all the save/restore code in a separate 
directory.

Although, I could also see pros for scatter save/restore across the code 
base.

Cheers,

> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> I wonder whether we shouldn't do away with event_fifo.h altogether.
> 
> --- a/xen/common/event_fifo.c
> +++ b/xen/common/event_fifo.c
> @@ -21,6 +21,27 @@
>   
>   #include <public/event_channel.h>
>   
> +struct evtchn_fifo_queue {
> +    uint32_t *head; /* points into control block */
> +    uint32_t tail;
> +    uint8_t priority;
> +    spinlock_t lock;
> +};
> +
> +struct evtchn_fifo_vcpu {
> +    struct evtchn_fifo_control_block *control_block;
> +    struct evtchn_fifo_queue queue[EVTCHN_FIFO_MAX_QUEUES];
> +};
> +
> +#define EVTCHN_FIFO_EVENT_WORDS_PER_PAGE (PAGE_SIZE / sizeof(event_word_t))
> +#define EVTCHN_FIFO_MAX_EVENT_ARRAY_PAGES \
> +    (EVTCHN_FIFO_NR_CHANNELS / EVTCHN_FIFO_EVENT_WORDS_PER_PAGE)
> +
> +struct evtchn_fifo_domain {
> +    event_word_t *event_array[EVTCHN_FIFO_MAX_EVENT_ARRAY_PAGES];
> +    unsigned int num_evtchns;
> +};
> +
>   static inline event_word_t *evtchn_fifo_word_from_port(const struct domain *d,
>                                                          unsigned int port)
>   {
> --- a/xen/include/xen/event_fifo.h
> +++ b/xen/include/xen/event_fifo.h
> @@ -9,27 +9,6 @@
>   #ifndef __XEN_EVENT_FIFO_H__
>   #define __XEN_EVENT_FIFO_H__
>   
> -struct evtchn_fifo_queue {
> -    uint32_t *head; /* points into control block */
> -    uint32_t tail;
> -    uint8_t priority;
> -    spinlock_t lock;
> -};
> -
> -struct evtchn_fifo_vcpu {
> -    struct evtchn_fifo_control_block *control_block;
> -    struct evtchn_fifo_queue queue[EVTCHN_FIFO_MAX_QUEUES];
> -};
> -
> -#define EVTCHN_FIFO_EVENT_WORDS_PER_PAGE (PAGE_SIZE / sizeof(event_word_t))
> -#define EVTCHN_FIFO_MAX_EVENT_ARRAY_PAGES \
> -    (EVTCHN_FIFO_NR_CHANNELS / EVTCHN_FIFO_EVENT_WORDS_PER_PAGE)
> -
> -struct evtchn_fifo_domain {
> -    event_word_t *event_array[EVTCHN_FIFO_MAX_EVENT_ARRAY_PAGES];
> -    unsigned int num_evtchns;
> -};
> -
>   int evtchn_fifo_init_control(struct evtchn_init_control *init_control);
>   int evtchn_fifo_expand_array(const struct evtchn_expand_array *expand_array);
>   void evtchn_fifo_destroy(struct domain *domain);
> 

-- 
Julien Grall

Re: [PATCH 09/12] evtchn: move FIFO-private struct declarations
Posted by Jan Beulich 3 years, 6 months ago
On 29.09.2020 14:26, Julien Grall wrote:
> On 28/09/2020 12:00, Jan Beulich wrote:
>> There's no need to expose them.
> 
> We are going to need them for LiveUpdate and Non-cooperative Live 
> Migration as the save/restore is happening outside of event_fifo.c.
> 
> This is because we tried to keep all the save/restore code in a separate 
> directory.

I'm afraid I don't view this as a reason for the change not to be made
right now. If, when, and in what shape LU will hit upstream is unknown
at the moment (unless I've missed some indication of a time line). In
fact, if we didn't expose things like the ones here to far too wide an
"audience", I wonder whether ...

> Although, I could also see pros for scatter save/restore across the code 
> base.

... you wouldn't have chosen this route anyway, just to avoid exposing
items widely that are supposed to be (almost) private.

Jan

Re: [PATCH 09/12] evtchn: move FIFO-private struct declarations
Posted by Julien Grall 3 years, 6 months ago
Hi Jan,

On 29/09/2020 13:49, Jan Beulich wrote:
> On 29.09.2020 14:26, Julien Grall wrote:
>> On 28/09/2020 12:00, Jan Beulich wrote:
>>> There's no need to expose them.
>>
>> We are going to need them for LiveUpdate and Non-cooperative Live
>> Migration as the save/restore is happening outside of event_fifo.c.
>>
>> This is because we tried to keep all the save/restore code in a separate
>> directory.
> 
> I'm afraid I don't view this as a reason for the change not to be made
> right now. If, when, and in what shape LU will hit upstream is unknown
> at the moment (unless I've missed some indication of a time line).

I was merely pointing out long term use case as I vaguely remember a 
discussion in the past to avoid short term change.

> In
> fact, if we didn't expose things like the ones here to far too wide an
> "audience", I wonder whether ... >
>> Although, I could also see pros for scatter save/restore across the code
>> base.
> 
> ... you wouldn't have chosen this route anyway, just to avoid exposing
> items widely that are supposed to be (almost) private.

I would still chose that route because it helps to keep all the 
save/restore code in one place. FWIW, this follows what HVM context does 
today.

Anyway, as I said, this I can see pros/cons for each. So at least this 
discussion clarify the preferred approach.

For the patch itself:

Acked-by: Julien Grall <jgrall@amazon.com>

Cheers,

-- 
Julien Grall

RE: [PATCH 09/12] evtchn: move FIFO-private struct declarations
Posted by Paul Durrant 3 years, 6 months ago

> -----Original Message-----
> From: Xen-devel <xen-devel-bounces@lists.xenproject.org> On Behalf Of Jan Beulich
> Sent: 28 September 2020 12:01
> To: xen-devel@lists.xenproject.org
> Cc: Andrew Cooper <andrew.cooper3@citrix.com>; George Dunlap <George.Dunlap@eu.citrix.com>; Ian
> Jackson <iwj@xenproject.org>; Julien Grall <julien@xen.org>; Wei Liu <wl@xen.org>; Stefano Stabellini
> <sstabellini@kernel.org>
> Subject: [PATCH 09/12] evtchn: move FIFO-private struct declarations
> 
> There's no need to expose them.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> I wonder whether we shouldn't do away with event_fifo.h altogether.
> 

+1

I can't see why it needs to exist. 2l doesn't have a header.

  Paul

> --- a/xen/common/event_fifo.c
> +++ b/xen/common/event_fifo.c
> @@ -21,6 +21,27 @@
> 
>  #include <public/event_channel.h>
> 
> +struct evtchn_fifo_queue {
> +    uint32_t *head; /* points into control block */
> +    uint32_t tail;
> +    uint8_t priority;
> +    spinlock_t lock;
> +};
> +
> +struct evtchn_fifo_vcpu {
> +    struct evtchn_fifo_control_block *control_block;
> +    struct evtchn_fifo_queue queue[EVTCHN_FIFO_MAX_QUEUES];
> +};
> +
> +#define EVTCHN_FIFO_EVENT_WORDS_PER_PAGE (PAGE_SIZE / sizeof(event_word_t))
> +#define EVTCHN_FIFO_MAX_EVENT_ARRAY_PAGES \
> +    (EVTCHN_FIFO_NR_CHANNELS / EVTCHN_FIFO_EVENT_WORDS_PER_PAGE)
> +
> +struct evtchn_fifo_domain {
> +    event_word_t *event_array[EVTCHN_FIFO_MAX_EVENT_ARRAY_PAGES];
> +    unsigned int num_evtchns;
> +};
> +
>  static inline event_word_t *evtchn_fifo_word_from_port(const struct domain *d,
>                                                         unsigned int port)
>  {
> --- a/xen/include/xen/event_fifo.h
> +++ b/xen/include/xen/event_fifo.h
> @@ -9,27 +9,6 @@
>  #ifndef __XEN_EVENT_FIFO_H__
>  #define __XEN_EVENT_FIFO_H__
> 
> -struct evtchn_fifo_queue {
> -    uint32_t *head; /* points into control block */
> -    uint32_t tail;
> -    uint8_t priority;
> -    spinlock_t lock;
> -};
> -
> -struct evtchn_fifo_vcpu {
> -    struct evtchn_fifo_control_block *control_block;
> -    struct evtchn_fifo_queue queue[EVTCHN_FIFO_MAX_QUEUES];
> -};
> -
> -#define EVTCHN_FIFO_EVENT_WORDS_PER_PAGE (PAGE_SIZE / sizeof(event_word_t))
> -#define EVTCHN_FIFO_MAX_EVENT_ARRAY_PAGES \
> -    (EVTCHN_FIFO_NR_CHANNELS / EVTCHN_FIFO_EVENT_WORDS_PER_PAGE)
> -
> -struct evtchn_fifo_domain {
> -    event_word_t *event_array[EVTCHN_FIFO_MAX_EVENT_ARRAY_PAGES];
> -    unsigned int num_evtchns;
> -};
> -
>  int evtchn_fifo_init_control(struct evtchn_init_control *init_control);
>  int evtchn_fifo_expand_array(const struct evtchn_expand_array *expand_array);
>  void evtchn_fifo_destroy(struct domain *domain);
> 



Re: [PATCH 09/12] evtchn: move FIFO-private struct declarations
Posted by Jan Beulich 3 years, 6 months ago
On 30.09.2020 09:37, Paul Durrant wrote:
>> From: Xen-devel <xen-devel-bounces@lists.xenproject.org> On Behalf Of Jan Beulich
>> Sent: 28 September 2020 12:01
>>
>> There's no need to expose them.
>>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>> ---
>> I wonder whether we shouldn't do away with event_fifo.h altogether.
>>
> 
> +1
> 
> I can't see why it needs to exist. 2l doesn't have a header.

Okay, yet another patch then (I don't think I want to extend this one).

Jan

[PATCH 10/12] evtchn/fifo: use stable fields when recording "last queue" information
Posted by Jan Beulich 3 years, 6 months ago
Both evtchn->priority and evtchn->notify_vcpu_id could, prior to recent
locking adjustments, change behind the back of
evtchn_fifo_set_pending(). Neither the queue's priority nor the vCPU's
vcpu_id fields have similar properties, so they seem better suited for
the purpose. In particular they reflect the respective evtchn fields'
values at the time they were used to determine queue and vCPU.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/common/event_fifo.c
+++ b/xen/common/event_fifo.c
@@ -246,8 +246,8 @@ static void evtchn_fifo_set_pending(stru
         /* Moved to a different queue? */
         if ( old_q != q )
         {
-            evtchn->last_vcpu_id = evtchn->notify_vcpu_id;
-            evtchn->last_priority = evtchn->priority;
+            evtchn->last_vcpu_id = v->vcpu_id;
+            evtchn->last_priority = q->priority;
 
             spin_unlock_irqrestore(&old_q->lock, flags);
             spin_lock_irqsave(&q->lock, flags);


Re: [PATCH 10/12] evtchn/fifo: use stable fields when recording "last queue" information
Posted by Julien Grall 3 years, 6 months ago
Hi Jan,

On 28/09/2020 12:01, Jan Beulich wrote:
> Both evtchn->priority and evtchn->notify_vcpu_id could, prior to recent
> locking adjustments, change behind the back of
> evtchn_fifo_set_pending(). Neither the queue's priority nor the vCPU's
> vcpu_id fields have similar properties, so they seem better suited for
> the purpose. In particular they reflect the respective evtchn fields'
> values at the time they were used to determine queue and vCPU.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Reviewed-by: Julien Grall <jgrall@amazon.com>

Cheers,

-- 
Julien Grall

RE: [PATCH 10/12] evtchn/fifo: use stable fields when recording "last queue" information
Posted by Paul Durrant 3 years, 6 months ago

> -----Original Message-----
> From: Xen-devel <xen-devel-bounces@lists.xenproject.org> On Behalf Of Jan Beulich
> Sent: 28 September 2020 12:02
> To: xen-devel@lists.xenproject.org
> Cc: Andrew Cooper <andrew.cooper3@citrix.com>; George Dunlap <George.Dunlap@eu.citrix.com>; Ian
> Jackson <iwj@xenproject.org>; Julien Grall <julien@xen.org>; Wei Liu <wl@xen.org>; Stefano Stabellini
> <sstabellini@kernel.org>
> Subject: [PATCH 10/12] evtchn/fifo: use stable fields when recording "last queue" information
> 
> Both evtchn->priority and evtchn->notify_vcpu_id could, prior to recent
> locking adjustments, change behind the back of
> evtchn_fifo_set_pending(). Neither the queue's priority nor the vCPU's
> vcpu_id fields have similar properties, so they seem better suited for
> the purpose. In particular they reflect the respective evtchn fields'
> values at the time they were used to determine queue and vCPU.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> 

I think these changes make the code clearer anyway.

Reviewed-by: Paul Durrant <paul@xen.org>

> --- a/xen/common/event_fifo.c
> +++ b/xen/common/event_fifo.c
> @@ -246,8 +246,8 @@ static void evtchn_fifo_set_pending(stru
>          /* Moved to a different queue? */
>          if ( old_q != q )
>          {
> -            evtchn->last_vcpu_id = evtchn->notify_vcpu_id;
> -            evtchn->last_priority = evtchn->priority;
> +            evtchn->last_vcpu_id = v->vcpu_id;
> +            evtchn->last_priority = q->priority;
> 
>              spin_unlock_irqrestore(&old_q->lock, flags);
>              spin_lock_irqsave(&q->lock, flags);
> 



Re: [PATCH 10/12] evtchn/fifo: use stable fields when recording "last queue" information
Posted by Jan Beulich 3 years, 6 months ago
On 30.09.2020 09:35, Paul Durrant wrote:
>> From: Xen-devel <xen-devel-bounces@lists.xenproject.org> On Behalf Of Jan Beulich
>> Sent: 28 September 2020 12:02
>>
>> Both evtchn->priority and evtchn->notify_vcpu_id could, prior to recent
>> locking adjustments, change behind the back of
>> evtchn_fifo_set_pending(). Neither the queue's priority nor the vCPU's
>> vcpu_id fields have similar properties, so they seem better suited for
>> the purpose. In particular they reflect the respective evtchn fields'
>> values at the time they were used to determine queue and vCPU.
>>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> 
> I think these changes make the code clearer anyway.
> 
> Reviewed-by: Paul Durrant <paul@xen.org>

Thanks. With the change of description in the earlier patch, and with
this one possibly going in ahead of it, I'll massage the description
here somewhat, I guess.

Jan

RE: [PATCH 10/12] evtchn/fifo: use stable fields when recording "last queue" information
Posted by Paul Durrant 3 years, 6 months ago
> -----Original Message-----
> From: Jan Beulich <jbeulich@suse.com>
> Sent: 30 September 2020 09:35
> To: paul@xen.org
> Cc: xen-devel@lists.xenproject.org; 'Andrew Cooper' <andrew.cooper3@citrix.com>; 'George Dunlap'
> <George.Dunlap@eu.citrix.com>; 'Ian Jackson' <iwj@xenproject.org>; 'Julien Grall' <julien@xen.org>;
> 'Wei Liu' <wl@xen.org>; 'Stefano Stabellini' <sstabellini@kernel.org>
> Subject: Re: [PATCH 10/12] evtchn/fifo: use stable fields when recording "last queue" information
> 
> On 30.09.2020 09:35, Paul Durrant wrote:
> >> From: Xen-devel <xen-devel-bounces@lists.xenproject.org> On Behalf Of Jan Beulich
> >> Sent: 28 September 2020 12:02
> >>
> >> Both evtchn->priority and evtchn->notify_vcpu_id could, prior to recent
> >> locking adjustments, change behind the back of
> >> evtchn_fifo_set_pending(). Neither the queue's priority nor the vCPU's
> >> vcpu_id fields have similar properties, so they seem better suited for
> >> the purpose. In particular they reflect the respective evtchn fields'
> >> values at the time they were used to determine queue and vCPU.
> >>
> >> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> >
> > I think these changes make the code clearer anyway.
> >
> > Reviewed-by: Paul Durrant <paul@xen.org>
> 
> Thanks. With the change of description in the earlier patch, and with
> this one possibly going in ahead of it, I'll massage the description
> here somewhat, I guess.
> 

That's fine.

  Paul


[PATCH 11/12] evtchn: convert vIRQ lock to an r/w one
Posted by Jan Beulich 3 years, 6 months ago
There's no need to serialize all sending of vIRQ-s; all that's needed
is serialization against the closing of the respective event channels
(by means of a barrier). To facilitate the conversion, introduce a new
rw_barrier().

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/common/domain.c
+++ b/xen/common/domain.c
@@ -160,7 +160,7 @@ struct vcpu *vcpu_create(struct domain *
     v->vcpu_id = vcpu_id;
     v->dirty_cpu = VCPU_CPU_CLEAN;
 
-    spin_lock_init(&v->virq_lock);
+    rwlock_init(&v->virq_lock);
 
     tasklet_init(&v->continue_hypercall_tasklet, NULL, NULL);
 
--- a/xen/common/event_channel.c
+++ b/xen/common/event_channel.c
@@ -640,7 +640,7 @@ int evtchn_close(struct domain *d1, int
             if ( v->virq_to_evtchn[chn1->u.virq] != port1 )
                 continue;
             v->virq_to_evtchn[chn1->u.virq] = 0;
-            spin_barrier(&v->virq_lock);
+            rw_barrier(&v->virq_lock);
         }
         break;
 
@@ -794,7 +794,7 @@ void send_guest_vcpu_virq(struct vcpu *v
 
     ASSERT(!virq_is_global(virq));
 
-    spin_lock_irqsave(&v->virq_lock, flags);
+    read_lock_irqsave(&v->virq_lock, flags);
 
     port = v->virq_to_evtchn[virq];
     if ( unlikely(port == 0) )
@@ -807,7 +807,7 @@ void send_guest_vcpu_virq(struct vcpu *v
     spin_unlock(&chn->lock);
 
  out:
-    spin_unlock_irqrestore(&v->virq_lock, flags);
+    read_unlock_irqrestore(&v->virq_lock, flags);
 }
 
 void send_guest_global_virq(struct domain *d, uint32_t virq)
@@ -826,7 +826,7 @@ void send_guest_global_virq(struct domai
     if ( unlikely(v == NULL) )
         return;
 
-    spin_lock_irqsave(&v->virq_lock, flags);
+    read_lock_irqsave(&v->virq_lock, flags);
 
     port = v->virq_to_evtchn[virq];
     if ( unlikely(port == 0) )
@@ -838,7 +838,7 @@ void send_guest_global_virq(struct domai
     spin_unlock(&chn->lock);
 
  out:
-    spin_unlock_irqrestore(&v->virq_lock, flags);
+    read_unlock_irqrestore(&v->virq_lock, flags);
 }
 
 void send_guest_pirq(struct domain *d, const struct pirq *pirq)
--- a/xen/common/spinlock.c
+++ b/xen/common/spinlock.c
@@ -2,7 +2,7 @@
 #include <xen/irq.h>
 #include <xen/smp.h>
 #include <xen/time.h>
-#include <xen/spinlock.h>
+#include <xen/rwlock.h>
 #include <xen/guest_access.h>
 #include <xen/preempt.h>
 #include <public/sysctl.h>
@@ -334,6 +334,12 @@ void _spin_unlock_recursive(spinlock_t *
     }
 }
 
+void _rw_barrier(rwlock_t *lock)
+{
+    check_barrier(&lock->lock.debug);
+    do { smp_mb(); } while ( _rw_is_locked(lock) );
+}
+
 #ifdef CONFIG_DEBUG_LOCK_PROFILE
 
 struct lock_profile_anc {
--- a/xen/include/xen/rwlock.h
+++ b/xen/include/xen/rwlock.h
@@ -237,6 +237,8 @@ static inline int _rw_is_write_locked(rw
     return (atomic_read(&lock->cnts) & _QW_WMASK) == _QW_LOCKED;
 }
 
+void _rw_barrier(rwlock_t *lock);
+
 #define read_lock(l)                  _read_lock(l)
 #define read_lock_irq(l)              _read_lock_irq(l)
 #define read_lock_irqsave(l, f)                                 \
@@ -266,6 +268,7 @@ static inline int _rw_is_write_locked(rw
 #define rw_is_locked(l)               _rw_is_locked(l)
 #define rw_is_write_locked(l)         _rw_is_write_locked(l)
 
+#define rw_barrier(l)                 _rw_barrier(l)
 
 typedef struct percpu_rwlock percpu_rwlock_t;
 
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
@@ -235,7 +235,7 @@ struct vcpu
 
     /* IRQ-safe virq_lock protects against delivering VIRQ to stale evtchn. */
     evtchn_port_t    virq_to_evtchn[NR_VIRQS];
-    spinlock_t       virq_lock;
+    rwlock_t         virq_lock;
 
     /* Tasklet for continue_hypercall_on_cpu(). */
     struct tasklet   continue_hypercall_tasklet;


RE: [PATCH 11/12] evtchn: convert vIRQ lock to an r/w one
Posted by Paul Durrant 3 years, 6 months ago
> -----Original Message-----
> From: Xen-devel <xen-devel-bounces@lists.xenproject.org> On Behalf Of Jan Beulich
> Sent: 28 September 2020 12:02
> To: xen-devel@lists.xenproject.org
> Cc: Andrew Cooper <andrew.cooper3@citrix.com>; George Dunlap <George.Dunlap@eu.citrix.com>; Ian
> Jackson <iwj@xenproject.org>; Julien Grall <julien@xen.org>; Wei Liu <wl@xen.org>; Stefano Stabellini
> <sstabellini@kernel.org>
> Subject: [PATCH 11/12] evtchn: convert vIRQ lock to an r/w one
> 
> There's no need to serialize all sending of vIRQ-s; all that's needed
> is serialization against the closing of the respective event channels
> (by means of a barrier). To facilitate the conversion, introduce a new
> rw_barrier().
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> 
> --- a/xen/common/domain.c
> +++ b/xen/common/domain.c
> @@ -160,7 +160,7 @@ struct vcpu *vcpu_create(struct domain *
>      v->vcpu_id = vcpu_id;
>      v->dirty_cpu = VCPU_CPU_CLEAN;
> 
> -    spin_lock_init(&v->virq_lock);
> +    rwlock_init(&v->virq_lock);
> 
>      tasklet_init(&v->continue_hypercall_tasklet, NULL, NULL);
> 
> --- a/xen/common/event_channel.c
> +++ b/xen/common/event_channel.c
> @@ -640,7 +640,7 @@ int evtchn_close(struct domain *d1, int
>              if ( v->virq_to_evtchn[chn1->u.virq] != port1 )
>                  continue;
>              v->virq_to_evtchn[chn1->u.virq] = 0;
> -            spin_barrier(&v->virq_lock);
> +            rw_barrier(&v->virq_lock);
>          }
>          break;
> 
> @@ -794,7 +794,7 @@ void send_guest_vcpu_virq(struct vcpu *v
> 
>      ASSERT(!virq_is_global(virq));
> 
> -    spin_lock_irqsave(&v->virq_lock, flags);
> +    read_lock_irqsave(&v->virq_lock, flags);
> 
>      port = v->virq_to_evtchn[virq];
>      if ( unlikely(port == 0) )
> @@ -807,7 +807,7 @@ void send_guest_vcpu_virq(struct vcpu *v
>      spin_unlock(&chn->lock);
> 
>   out:
> -    spin_unlock_irqrestore(&v->virq_lock, flags);
> +    read_unlock_irqrestore(&v->virq_lock, flags);
>  }
> 
>  void send_guest_global_virq(struct domain *d, uint32_t virq)
> @@ -826,7 +826,7 @@ void send_guest_global_virq(struct domai
>      if ( unlikely(v == NULL) )
>          return;
> 
> -    spin_lock_irqsave(&v->virq_lock, flags);
> +    read_lock_irqsave(&v->virq_lock, flags);
> 
>      port = v->virq_to_evtchn[virq];
>      if ( unlikely(port == 0) )
> @@ -838,7 +838,7 @@ void send_guest_global_virq(struct domai
>      spin_unlock(&chn->lock);
> 
>   out:
> -    spin_unlock_irqrestore(&v->virq_lock, flags);
> +    read_unlock_irqrestore(&v->virq_lock, flags);
>  }
> 
>  void send_guest_pirq(struct domain *d, const struct pirq *pirq)
> --- a/xen/common/spinlock.c
> +++ b/xen/common/spinlock.c
> @@ -2,7 +2,7 @@
>  #include <xen/irq.h>
>  #include <xen/smp.h>
>  #include <xen/time.h>
> -#include <xen/spinlock.h>
> +#include <xen/rwlock.h>
>  #include <xen/guest_access.h>
>  #include <xen/preempt.h>
>  #include <public/sysctl.h>
> @@ -334,6 +334,12 @@ void _spin_unlock_recursive(spinlock_t *
>      }
>  }
> 
> +void _rw_barrier(rwlock_t *lock)
> +{
> +    check_barrier(&lock->lock.debug);
> +    do { smp_mb(); } while ( _rw_is_locked(lock) );
> +}

Should you not have a cpu_relax() somewhere in here?

TBH though, the fact this lock is never taken as a writer makes me wonder whether there needs to be a lock at all.

  Paul

> +
>  #ifdef CONFIG_DEBUG_LOCK_PROFILE
> 
>  struct lock_profile_anc {
> --- a/xen/include/xen/rwlock.h
> +++ b/xen/include/xen/rwlock.h
> @@ -237,6 +237,8 @@ static inline int _rw_is_write_locked(rw
>      return (atomic_read(&lock->cnts) & _QW_WMASK) == _QW_LOCKED;
>  }
> 
> +void _rw_barrier(rwlock_t *lock);
> +
>  #define read_lock(l)                  _read_lock(l)
>  #define read_lock_irq(l)              _read_lock_irq(l)
>  #define read_lock_irqsave(l, f)                                 \
> @@ -266,6 +268,7 @@ static inline int _rw_is_write_locked(rw
>  #define rw_is_locked(l)               _rw_is_locked(l)
>  #define rw_is_write_locked(l)         _rw_is_write_locked(l)
> 
> +#define rw_barrier(l)                 _rw_barrier(l)
> 
>  typedef struct percpu_rwlock percpu_rwlock_t;
> 
> --- a/xen/include/xen/sched.h
> +++ b/xen/include/xen/sched.h
> @@ -235,7 +235,7 @@ struct vcpu
> 
>      /* IRQ-safe virq_lock protects against delivering VIRQ to stale evtchn. */
>      evtchn_port_t    virq_to_evtchn[NR_VIRQS];
> -    spinlock_t       virq_lock;
> +    rwlock_t         virq_lock;
> 
>      /* Tasklet for continue_hypercall_on_cpu(). */
>      struct tasklet   continue_hypercall_tasklet;
> 



Re: [PATCH 11/12] evtchn: convert vIRQ lock to an r/w one
Posted by Jan Beulich 3 years, 6 months ago
On 30.09.2020 09:58, Paul Durrant wrote:
>> From: Xen-devel <xen-devel-bounces@lists.xenproject.org> On Behalf Of Jan Beulich
>> Sent: 28 September 2020 12:02
>>
>> @@ -334,6 +334,12 @@ void _spin_unlock_recursive(spinlock_t *
>>      }
>>  }
>>
>> +void _rw_barrier(rwlock_t *lock)
>> +{
>> +    check_barrier(&lock->lock.debug);
>> +    do { smp_mb(); } while ( _rw_is_locked(lock) );
>> +}
> 
> Should you not have a cpu_relax() somewhere in here?
> 
> TBH though, the fact this lock is never taken as a writer makes me
> wonder whether there needs to be a lock at all.

For both of these - see the discussion Julien and I also had. The
construct will now move to the subsequent patch anyway, and change
as per Julien's request.

Jan

RE: [PATCH 11/12] evtchn: convert vIRQ lock to an r/w one
Posted by Paul Durrant 3 years, 6 months ago
> -----Original Message-----
> From: Jan Beulich <jbeulich@suse.com>
> Sent: 30 September 2020 09:37
> To: paul@xen.org
> Cc: xen-devel@lists.xenproject.org; 'Andrew Cooper' <andrew.cooper3@citrix.com>; 'George Dunlap'
> <George.Dunlap@eu.citrix.com>; 'Ian Jackson' <iwj@xenproject.org>; 'Julien Grall' <julien@xen.org>;
> 'Wei Liu' <wl@xen.org>; 'Stefano Stabellini' <sstabellini@kernel.org>
> Subject: Re: [PATCH 11/12] evtchn: convert vIRQ lock to an r/w one
> 
> On 30.09.2020 09:58, Paul Durrant wrote:
> >> From: Xen-devel <xen-devel-bounces@lists.xenproject.org> On Behalf Of Jan Beulich
> >> Sent: 28 September 2020 12:02
> >>
> >> @@ -334,6 +334,12 @@ void _spin_unlock_recursive(spinlock_t *
> >>      }
> >>  }
> >>
> >> +void _rw_barrier(rwlock_t *lock)
> >> +{
> >> +    check_barrier(&lock->lock.debug);
> >> +    do { smp_mb(); } while ( _rw_is_locked(lock) );
> >> +}
> >
> > Should you not have a cpu_relax() somewhere in here?
> >
> > TBH though, the fact this lock is never taken as a writer makes me
> > wonder whether there needs to be a lock at all.
> 
> For both of these - see the discussion Julien and I also had. The
> construct will now move to the subsequent patch anyway, and change
> as per Julien's request.
> 

OK.

Looking again, given that both send_guest_vcpu_virq() and send_guest_global_virq() (rightly) hold the evtchn lock before calling evtchn_port_set_pending() I think you could do away with the virq lock by adding checks in those functions to verify evtchn->state == ECS_VIRQ and u.virq == virq after having acquired the channel lock but before calling evtchn_port_set_pending().

  Paul

> Jan


Re: [PATCH 11/12] evtchn: convert vIRQ lock to an r/w one
Posted by Jan Beulich 3 years, 6 months ago
On 30.09.2020 10:52, Paul Durrant wrote:
> Looking again, given that both send_guest_vcpu_virq() and
> send_guest_global_virq() (rightly) hold the evtchn lock before
> calling evtchn_port_set_pending() I think you could do away with
> the virq lock by adding checks in those functions to verify
> evtchn->state == ECS_VIRQ and u.virq == virq after having
> acquired the channel lock but before calling
> evtchn_port_set_pending().

I don't think so: The adjustment of v->virq_to_evtchn[] in
evtchn_close() would then happen with just the domain's event
lock held, which the sending paths don't use at all. The per-
channel lock gets acquired in evtchn_close() a bit later only
(and this lock can't possibly protect per-vCPU state).

In fact I'm now getting puzzled by evtchn_bind_virq() updating
this array with (just) the per-domain lock held. Since it's
the last thing in the function, there's technically no strict
need for acquiring the vIRQ lock, but holding the event lock
definitely doesn't help. All that looks to be needed is the
barrier implied from write_unlock().

Jan

Re: [PATCH 11/12] evtchn: convert vIRQ lock to an r/w one
Posted by Julien Grall 3 years, 5 months ago
Hi Jan,

On 30/09/2020 11:16, Jan Beulich wrote:
> On 30.09.2020 10:52, Paul Durrant wrote:
>> Looking again, given that both send_guest_vcpu_virq() and
>> send_guest_global_virq() (rightly) hold the evtchn lock before
>> calling evtchn_port_set_pending() I think you could do away with
>> the virq lock by adding checks in those functions to verify
>> evtchn->state == ECS_VIRQ and u.virq == virq after having
>> acquired the channel lock but before calling
>> evtchn_port_set_pending().
> 
> I don't think so: The adjustment of v->virq_to_evtchn[] in
> evtchn_close() would then happen with just the domain's event
> lock held, which the sending paths don't use at all. The per-
> channel lock gets acquired in evtchn_close() a bit later only
> (and this lock can't possibly protect per-vCPU state).
> 
> In fact I'm now getting puzzled by evtchn_bind_virq() updating
> this array with (just) the per-domain lock held. Since it's
> the last thing in the function, there's technically no strict
> need for acquiring the vIRQ lock,

Well, we at least need to prevent the compiler to tear the store/load. 
If we don't use a lock, then we would should use ACCESS_ONCE() or 
{read,write}_atomic() for all the usage.

> but holding the event lock
> definitely doesn't help.

It helps because spin_unlock() and write_unlock() use the same barrier 
(arch_lock_release_barrier()). So ...

> All that looks to be needed is the
> barrier implied from write_unlock().

No barrier should be necessary. Although, I would suggest to add a 
comment explaining it.

Cheers,

-- 
Julien Grall

Re: [PATCH 11/12] evtchn: convert vIRQ lock to an r/w one
Posted by Jan Beulich 3 years, 5 months ago
On 01.10.2020 18:21, Julien Grall wrote:
> On 30/09/2020 11:16, Jan Beulich wrote:
>> On 30.09.2020 10:52, Paul Durrant wrote:
>>> Looking again, given that both send_guest_vcpu_virq() and
>>> send_guest_global_virq() (rightly) hold the evtchn lock before
>>> calling evtchn_port_set_pending() I think you could do away with
>>> the virq lock by adding checks in those functions to verify
>>> evtchn->state == ECS_VIRQ and u.virq == virq after having
>>> acquired the channel lock but before calling
>>> evtchn_port_set_pending().
>>
>> I don't think so: The adjustment of v->virq_to_evtchn[] in
>> evtchn_close() would then happen with just the domain's event
>> lock held, which the sending paths don't use at all. The per-
>> channel lock gets acquired in evtchn_close() a bit later only
>> (and this lock can't possibly protect per-vCPU state).
>>
>> In fact I'm now getting puzzled by evtchn_bind_virq() updating
>> this array with (just) the per-domain lock held. Since it's
>> the last thing in the function, there's technically no strict
>> need for acquiring the vIRQ lock,
> 
> Well, we at least need to prevent the compiler to tear the store/load. 
> If we don't use a lock, then we would should use ACCESS_ONCE() or 
> {read,write}_atomic() for all the usage.
> 
>> but holding the event lock
>> definitely doesn't help.
> 
> It helps because spin_unlock() and write_unlock() use the same barrier 
> (arch_lock_release_barrier()). So ...

I'm having trouble making this part of your reply fit ...

>> All that looks to be needed is the
>> barrier implied from write_unlock().
> 
> No barrier should be necessary. Although, I would suggest to add a 
> comment explaining it.

... this. If we moved the update of v->virq_to_evtchn[] out of the
locked region (as the lock doesn't protect anything anymore at that
point), I think a barrier would need adding, such that the sending
paths will observe the update by the time evtchn_bind_virq()
returns (and hence sending of a respective vIRQ event can
legitimately be expected to actually work). Or did you possibly
just misunderstand what I wrote before? By putting in question the
utility of holding the event lock, I implied the write could be
moved out of the locked region ...

Jan

Re: [PATCH 11/12] evtchn: convert vIRQ lock to an r/w one
Posted by Julien Grall 3 years, 5 months ago
Hi Jan,

On 02/10/2020 07:12, Jan Beulich wrote:
> On 01.10.2020 18:21, Julien Grall wrote:
>> On 30/09/2020 11:16, Jan Beulich wrote:
>>> On 30.09.2020 10:52, Paul Durrant wrote:
>>>> Looking again, given that both send_guest_vcpu_virq() and
>>>> send_guest_global_virq() (rightly) hold the evtchn lock before
>>>> calling evtchn_port_set_pending() I think you could do away with
>>>> the virq lock by adding checks in those functions to verify
>>>> evtchn->state == ECS_VIRQ and u.virq == virq after having
>>>> acquired the channel lock but before calling
>>>> evtchn_port_set_pending().
>>>
>>> I don't think so: The adjustment of v->virq_to_evtchn[] in
>>> evtchn_close() would then happen with just the domain's event
>>> lock held, which the sending paths don't use at all. The per-
>>> channel lock gets acquired in evtchn_close() a bit later only
>>> (and this lock can't possibly protect per-vCPU state).
>>>
>>> In fact I'm now getting puzzled by evtchn_bind_virq() updating
>>> this array with (just) the per-domain lock held. Since it's
>>> the last thing in the function, there's technically no strict
>>> need for acquiring the vIRQ lock,
>>
>> Well, we at least need to prevent the compiler to tear the store/load.
>> If we don't use a lock, then we would should use ACCESS_ONCE() or
>> {read,write}_atomic() for all the usage.
>>
>>> but holding the event lock
>>> definitely doesn't help.
>>
>> It helps because spin_unlock() and write_unlock() use the same barrier
>> (arch_lock_release_barrier()). So ...
> 
> I'm having trouble making this part of your reply fit ...
> 
>>> All that looks to be needed is the
>>> barrier implied from write_unlock().
>>
>> No barrier should be necessary. Although, I would suggest to add a
>> comment explaining it.
> 
> ... this. If we moved the update of v->virq_to_evtchn[] out of the
> locked region (as the lock doesn't protect anything anymore at that
> point), I think a barrier would need adding, such that the sending
> paths will observe the update by the time evtchn_bind_virq()
> returns (and hence sending of a respective vIRQ event can
> legitimately be expected to actually work). Or did you possibly
> just misunderstand what I wrote before? By putting in question the
> utility of holding the event lock, I implied the write could be
> moved out of the locked region ...

We are probably talking past each other... My point was that if we leave 
the write where it currently is, then we don't need an extra barrier 
because the spin_unlock() already contains the barrier we want.

Hence the suggestion to add a comment so a reader doesn't spend time 
wondering how this is safe...

Cheers,

-- 
Julien Grall

Re: [PATCH 11/12] evtchn: convert vIRQ lock to an r/w one
Posted by Julien Grall 3 years, 6 months ago
Hi Jan,

On 28/09/2020 12:02, Jan Beulich wrote:
> There's no need to serialize all sending of vIRQ-s; all that's needed
> is serialization against the closing of the respective event channels
> (by means of a barrier). To facilitate the conversion, introduce a new
> rw_barrier().

Looking at the code below, all the spin_lock() have been replaced by a 
read_lock_*(). This is a bit surprising,

> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> 
> --- a/xen/common/domain.c
> +++ b/xen/common/domain.c
> @@ -160,7 +160,7 @@ struct vcpu *vcpu_create(struct domain *
>       v->vcpu_id = vcpu_id;
>       v->dirty_cpu = VCPU_CPU_CLEAN;
>   
> -    spin_lock_init(&v->virq_lock);
> +    rwlock_init(&v->virq_lock);
>   
>       tasklet_init(&v->continue_hypercall_tasklet, NULL, NULL);
>   
> --- a/xen/common/event_channel.c
> +++ b/xen/common/event_channel.c
> @@ -640,7 +640,7 @@ int evtchn_close(struct domain *d1, int
>               if ( v->virq_to_evtchn[chn1->u.virq] != port1 )
>                   continue;
>               v->virq_to_evtchn[chn1->u.virq] = 0;
> -            spin_barrier(&v->virq_lock);
> +            rw_barrier(&v->virq_lock);
>           }
>           break;
>   
> @@ -794,7 +794,7 @@ void send_guest_vcpu_virq(struct vcpu *v
>   
>       ASSERT(!virq_is_global(virq));
>   
> -    spin_lock_irqsave(&v->virq_lock, flags);
> +    read_lock_irqsave(&v->virq_lock, flags);
>   
>       port = v->virq_to_evtchn[virq];
>       if ( unlikely(port == 0) )
> @@ -807,7 +807,7 @@ void send_guest_vcpu_virq(struct vcpu *v
>       spin_unlock(&chn->lock);
>   
>    out:
> -    spin_unlock_irqrestore(&v->virq_lock, flags);
> +    read_unlock_irqrestore(&v->virq_lock, flags);
>   }
>   
>   void send_guest_global_virq(struct domain *d, uint32_t virq)
> @@ -826,7 +826,7 @@ void send_guest_global_virq(struct domai
>       if ( unlikely(v == NULL) )
>           return;
>   
> -    spin_lock_irqsave(&v->virq_lock, flags);
> +    read_lock_irqsave(&v->virq_lock, flags);
>   
>       port = v->virq_to_evtchn[virq];
>       if ( unlikely(port == 0) )
> @@ -838,7 +838,7 @@ void send_guest_global_virq(struct domai
>       spin_unlock(&chn->lock);
>   
>    out:
> -    spin_unlock_irqrestore(&v->virq_lock, flags);
> +    read_unlock_irqrestore(&v->virq_lock, flags);
>   }
>   
>   void send_guest_pirq(struct domain *d, const struct pirq *pirq)
> --- a/xen/common/spinlock.c
> +++ b/xen/common/spinlock.c
> @@ -2,7 +2,7 @@
>   #include <xen/irq.h>
>   #include <xen/smp.h>
>   #include <xen/time.h>
> -#include <xen/spinlock.h>
> +#include <xen/rwlock.h>

I would prefer if keep including <xen/spinlock.h> as the fact 
<xen/rwlock.h> include it is merely an implementation details.

>   #include <xen/guest_access.h>
>   #include <xen/preempt.h>
>   #include <public/sysctl.h>
> @@ -334,6 +334,12 @@ void _spin_unlock_recursive(spinlock_t *
>       }
>   }
>   
> +void _rw_barrier(rwlock_t *lock)
> +{
> +    check_barrier(&lock->lock.debug);
> +    do { smp_mb(); } while ( _rw_is_locked(lock) );
> +}

Why do you need to call smp_mb() at each loop? Would not it be 
sufficient to write something similar to spin_barrier(). I.e:

smp_mb();
while ( _rw_is_locked(lock) )
   cpu_relax();
smp_mb();

But I wonder if there is a risk with either implementation for 
_rw_is_locked() to always return true and therefore never end.

Let say we receive an interrupt, by the time it is handled, the 
read/lock may have been taken again.

spin_barrier() seems to handle this situation fine because it just wait 
for the head to change. I don't think we can do the same here...

I am thinking that it may be easier to hold the write lock when doing 
the update.

Cheers,

-- 
Julien Grall

Re: [PATCH 11/12] evtchn: convert vIRQ lock to an r/w one
Posted by Jan Beulich 3 years, 6 months ago
On 29.09.2020 15:03, Julien Grall wrote:
> On 28/09/2020 12:02, Jan Beulich wrote:
>> There's no need to serialize all sending of vIRQ-s; all that's needed
>> is serialization against the closing of the respective event channels
>> (by means of a barrier). To facilitate the conversion, introduce a new
>> rw_barrier().
> 
> Looking at the code below, all the spin_lock() have been replaced by a 
> read_lock_*(). This is a bit surprising,

It is, I agree, but that's what it ends up being. It is my understanding
that the lock really only exists for the purpose of the barrier in
evtchn_close().

>> --- a/xen/common/spinlock.c
>> +++ b/xen/common/spinlock.c
>> @@ -2,7 +2,7 @@
>>   #include <xen/irq.h>
>>   #include <xen/smp.h>
>>   #include <xen/time.h>
>> -#include <xen/spinlock.h>
>> +#include <xen/rwlock.h>
> 
> I would prefer if keep including <xen/spinlock.h> as the fact 
> <xen/rwlock.h> include it is merely an implementation details.

Can do, sure.

>> @@ -334,6 +334,12 @@ void _spin_unlock_recursive(spinlock_t *
>>       }
>>   }
>>   
>> +void _rw_barrier(rwlock_t *lock)
>> +{
>> +    check_barrier(&lock->lock.debug);
>> +    do { smp_mb(); } while ( _rw_is_locked(lock) );
>> +}
> 
> Why do you need to call smp_mb() at each loop? Would not it be 
> sufficient to write something similar to spin_barrier(). I.e:
> 
> smp_mb();
> while ( _rw_is_locked(lock) )
>    cpu_relax();
> smp_mb();

Yes, this looks to be possible. Both for this and the question
below it may be relevant to know that this patch pre-dates the
conversion to ticket locks by quite a bit. Hence the construct
above resembles the _spin_barrier() back at the time.

> But I wonder if there is a risk with either implementation for 
> _rw_is_locked() to always return true and therefore never end.
> 
> Let say we receive an interrupt, by the time it is handled, the 
> read/lock may have been taken again.

With non-ticket locks I would say there was the same issue. But
yes, ...

> spin_barrier() seems to handle this situation fine because it just wait 
> for the head to change. I don't think we can do the same here...
> 
> I am thinking that it may be easier to hold the write lock when doing 
> the update.

... perhaps this is indeed better. I have to admit that I never
fully understood the benefit of using spin_barrier() in this code
(as opposed to the use in evtchn_destroy()).

Jan

Re: [PATCH 11/12] evtchn: convert vIRQ lock to an r/w one
Posted by Julien Grall 3 years, 6 months ago
Hi Jan,

On 29/09/2020 14:37, Jan Beulich wrote:
> On 29.09.2020 15:03, Julien Grall wrote:
>> On 28/09/2020 12:02, Jan Beulich wrote:
>>> There's no need to serialize all sending of vIRQ-s; all that's needed
>>> is serialization against the closing of the respective event channels
>>> (by means of a barrier). To facilitate the conversion, introduce a new
>>> rw_barrier().
>>
>> Looking at the code below, all the spin_lock() have been replaced by a
>> read_lock_*(). This is a bit surprising,
> 
> It is, I agree, but that's what it ends up being. It is my understanding
> that the lock really only exists for the purpose of the barrier in
> evtchn_close().
> 
>>> --- a/xen/common/spinlock.c
>>> +++ b/xen/common/spinlock.c
>>> @@ -2,7 +2,7 @@
>>>    #include <xen/irq.h>
>>>    #include <xen/smp.h>
>>>    #include <xen/time.h>
>>> -#include <xen/spinlock.h>
>>> +#include <xen/rwlock.h>
>>
>> I would prefer if keep including <xen/spinlock.h> as the fact
>> <xen/rwlock.h> include it is merely an implementation details.
> 
> Can do, sure.
> 
>>> @@ -334,6 +334,12 @@ void _spin_unlock_recursive(spinlock_t *
>>>        }
>>>    }
>>>    
>>> +void _rw_barrier(rwlock_t *lock)
>>> +{
>>> +    check_barrier(&lock->lock.debug);
>>> +    do { smp_mb(); } while ( _rw_is_locked(lock) );
>>> +}
>>
>> Why do you need to call smp_mb() at each loop? Would not it be
>> sufficient to write something similar to spin_barrier(). I.e:
>>
>> smp_mb();
>> while ( _rw_is_locked(lock) )
>>     cpu_relax();
>> smp_mb();
> 
> Yes, this looks to be possible. Both for this and the question
> below it may be relevant to know that this patch pre-dates the
> conversion to ticket locks by quite a bit. Hence the construct
> above resembles the _spin_barrier() back at the time.
> 
>> But I wonder if there is a risk with either implementation for
>> _rw_is_locked() to always return true and therefore never end.
>>
>> Let say we receive an interrupt, by the time it is handled, the
>> read/lock may have been taken again.
> 
> With non-ticket locks I would say there was the same issue. But
> yes, ...

Most likely yes.

> 
>> spin_barrier() seems to handle this situation fine because it just wait
>> for the head to change. I don't think we can do the same here...
>>
>> I am thinking that it may be easier to hold the write lock when doing
>> the update.
> 
> ... perhaps this is indeed better. I have to admit that I never
> fully understood the benefit of using spin_barrier() in this code
> (as opposed to the use in evtchn_destroy()).

I am not entirely sure either. It looks like it is an attempt to make 
v->virq_to_evtchn[X] visible without holding a lock.

Any holder of the lock after spin_barrier() has completed will read 0 
and not try to use the lock.

But the update of v->virq_to_evtchn[X] should have used either 
ACCESS_ONCE() or write_atomic().

Cheers,

-- 
Julien Grall

Re: [PATCH 11/12] evtchn: convert vIRQ lock to an r/w one
Posted by Jan Beulich 3 years, 6 months ago
On 29.09.2020 19:18, Julien Grall wrote:
> On 29/09/2020 14:37, Jan Beulich wrote:
>> On 29.09.2020 15:03, Julien Grall wrote:
>>> I am thinking that it may be easier to hold the write lock when doing
>>> the update.
>>
>> ... perhaps this is indeed better. I have to admit that I never
>> fully understood the benefit of using spin_barrier() in this code
>> (as opposed to the use in evtchn_destroy()).
> 
> I am not entirely sure either. It looks like it is an attempt to make 
> v->virq_to_evtchn[X] visible without holding a lock.
> 
> Any holder of the lock after spin_barrier() has completed will read 0 
> and not try to use the lock.

I'm not sure I follow: A holder of the lock is obviously already
making use of the lock. Or are you talking of two different locks
here (recall that before XSA-343 there was just one lock involved
in sending)?

> But the update of v->virq_to_evtchn[X] should have used either 
> ACCESS_ONCE() or write_atomic().

Of course, like in so many other places in the code base.

Jan

Re: [PATCH 11/12] evtchn: convert vIRQ lock to an r/w one
Posted by Julien Grall 3 years, 6 months ago
Hi Jan,

On 30/09/2020 07:26, Jan Beulich wrote:
> On 29.09.2020 19:18, Julien Grall wrote:
>> On 29/09/2020 14:37, Jan Beulich wrote:
>>> On 29.09.2020 15:03, Julien Grall wrote:
>>>> I am thinking that it may be easier to hold the write lock when doing
>>>> the update.
>>>
>>> ... perhaps this is indeed better. I have to admit that I never
>>> fully understood the benefit of using spin_barrier() in this code
>>> (as opposed to the use in evtchn_destroy()).
>>
>> I am not entirely sure either. It looks like it is an attempt to make
>> v->virq_to_evtchn[X] visible without holding a lock.
>>
>> Any holder of the lock after spin_barrier() has completed will read 0
>> and not try to use the lock.
> 
> I'm not sure I follow: A holder of the lock is obviously already
> making use of the lock.

My point is the barrier is meant to split the holders of the lock in two 
category:
    - Anyone acquiring the lock before the spin_barrier() completed may 
see either the port open or close.
    - Anyone acquiring the lock after the spin_barrier() completed will 
see a close port.

> Or are you talking of two different locks
> here (recall that before XSA-343 there was just one lock involved
> in sending)?
> 
>> But the update of v->virq_to_evtchn[X] should have used either
>> ACCESS_ONCE() or write_atomic().
> 
> Of course, like in so many other places in the code base.

This is known. What I meant is if we are going to continue to use a 
spin_barrier() (or rw_barrier()), then we should also switch to use 
ACCESS_ONCE()/write_atomic().

Anyway, I think we discussed to acquire the write lock instead. So it 
should not be a concern.

Cheers,

-- 
Julien Grall

[PATCH 12/12] evtchn: convert domain event lock to an r/w one
Posted by Jan Beulich 3 years, 6 months ago
Especially for the use in evtchn_move_pirqs() (called when moving a vCPU
across pCPU-s) and the ones in EOI handling in PCI pass-through code,
serializing perhaps an entire domain isn't helpful when no state (which
isn't e.g. further protected by the per-channel lock) changes.

Unfortunately this implies dropping of lock profiling for this lock,
until r/w locks may get enabled for such functionality.

While ->notify_vcpu_id is now meant to be consistently updated with the
per-channel lock held for writing, an extension applies to ECS_PIRQ: The
field is also guaranteed to not change with the per-domain event lock
held. Therefore the unlink_pirq_port() call from evtchn_bind_vcpu() as
well as the link_pirq_port() one from evtchn_bind_pirq() could in
principle be moved out of the per-channel locked regions, but this
further code churn didn't seem worth it.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
RFC:
* In evtchn_bind_vcpu() the question is whether limiting the use of
  write_lock() to just the ECS_PIRQ case is really worth it.
* In flask_get_peer_sid() the question is whether we wouldn't better
  switch to using the per-channel lock.

--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -909,7 +909,7 @@ int arch_domain_soft_reset(struct domain
     if ( !is_hvm_domain(d) )
         return -EINVAL;
 
-    spin_lock(&d->event_lock);
+    write_lock(&d->event_lock);
     for ( i = 0; i < d->nr_pirqs ; i++ )
     {
         if ( domain_pirq_to_emuirq(d, i) != IRQ_UNBOUND )
@@ -919,7 +919,7 @@ int arch_domain_soft_reset(struct domain
                 break;
         }
     }
-    spin_unlock(&d->event_lock);
+    write_unlock(&d->event_lock);
 
     if ( ret )
         return ret;
--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -528,9 +528,9 @@ void hvm_migrate_pirqs(struct vcpu *v)
     if ( !is_iommu_enabled(d) || !hvm_domain_irq(d)->dpci )
        return;
 
-    spin_lock(&d->event_lock);
+    read_lock(&d->event_lock);
     pt_pirq_iterate(d, migrate_pirq, v);
-    spin_unlock(&d->event_lock);
+    read_unlock(&d->event_lock);
 }
 
 static bool hvm_get_pending_event(struct vcpu *v, struct x86_event *info)
--- a/xen/arch/x86/hvm/irq.c
+++ b/xen/arch/x86/hvm/irq.c
@@ -404,9 +404,9 @@ int hvm_inject_msi(struct domain *d, uin
             {
                 int rc;
 
-                spin_lock(&d->event_lock);
+                write_lock(&d->event_lock);
                 rc = map_domain_emuirq_pirq(d, pirq, IRQ_MSI_EMU);
-                spin_unlock(&d->event_lock);
+                write_unlock(&d->event_lock);
                 if ( rc )
                     return rc;
                 info = pirq_info(d, pirq);
--- a/xen/arch/x86/hvm/vioapic.c
+++ b/xen/arch/x86/hvm/vioapic.c
@@ -203,9 +203,9 @@ static int vioapic_hwdom_map_gsi(unsigne
     {
         gprintk(XENLOG_WARNING, "vioapic: error binding GSI %u: %d\n",
                 gsi, ret);
-        spin_lock(&currd->event_lock);
+        write_lock(&currd->event_lock);
         unmap_domain_pirq(currd, pirq);
-        spin_unlock(&currd->event_lock);
+        write_unlock(&currd->event_lock);
     }
     pcidevs_unlock();
 
--- a/xen/arch/x86/hvm/vmsi.c
+++ b/xen/arch/x86/hvm/vmsi.c
@@ -465,7 +465,7 @@ int msixtbl_pt_register(struct domain *d
     int r = -EINVAL;
 
     ASSERT(pcidevs_locked());
-    ASSERT(spin_is_locked(&d->event_lock));
+    ASSERT(rw_is_write_locked(&d->event_lock));
 
     if ( !msixtbl_initialised(d) )
         return -ENODEV;
@@ -535,7 +535,7 @@ void msixtbl_pt_unregister(struct domain
     struct msixtbl_entry *entry;
 
     ASSERT(pcidevs_locked());
-    ASSERT(spin_is_locked(&d->event_lock));
+    ASSERT(rw_is_write_locked(&d->event_lock));
 
     if ( !msixtbl_initialised(d) )
         return;
@@ -589,13 +589,13 @@ void msixtbl_pt_cleanup(struct domain *d
     if ( !msixtbl_initialised(d) )
         return;
 
-    spin_lock(&d->event_lock);
+    write_lock(&d->event_lock);
 
     list_for_each_entry_safe( entry, temp,
                               &d->arch.hvm.msixtbl_list, list )
         del_msixtbl_entry(entry);
 
-    spin_unlock(&d->event_lock);
+    write_unlock(&d->event_lock);
 }
 
 void msix_write_completion(struct vcpu *v)
@@ -719,9 +719,9 @@ int vpci_msi_arch_update(struct vpci_msi
                          msi->arch.pirq, msi->mask);
     if ( rc )
     {
-        spin_lock(&pdev->domain->event_lock);
+        write_lock(&pdev->domain->event_lock);
         unmap_domain_pirq(pdev->domain, msi->arch.pirq);
-        spin_unlock(&pdev->domain->event_lock);
+        write_unlock(&pdev->domain->event_lock);
         pcidevs_unlock();
         msi->arch.pirq = INVALID_PIRQ;
         return rc;
@@ -760,9 +760,9 @@ static int vpci_msi_enable(const struct
     rc = vpci_msi_update(pdev, data, address, vectors, pirq, mask);
     if ( rc )
     {
-        spin_lock(&pdev->domain->event_lock);
+        write_lock(&pdev->domain->event_lock);
         unmap_domain_pirq(pdev->domain, pirq);
-        spin_unlock(&pdev->domain->event_lock);
+        write_unlock(&pdev->domain->event_lock);
         pcidevs_unlock();
         return rc;
     }
@@ -807,9 +807,9 @@ static void vpci_msi_disable(const struc
         ASSERT(!rc);
     }
 
-    spin_lock(&pdev->domain->event_lock);
+    write_lock(&pdev->domain->event_lock);
     unmap_domain_pirq(pdev->domain, pirq);
-    spin_unlock(&pdev->domain->event_lock);
+    write_unlock(&pdev->domain->event_lock);
     pcidevs_unlock();
 }
 
--- a/xen/arch/x86/io_apic.c
+++ b/xen/arch/x86/io_apic.c
@@ -2413,10 +2413,10 @@ int ioapic_guest_write(unsigned long phy
     }
     if ( pirq >= 0 )
     {
-        spin_lock(&hardware_domain->event_lock);
+        write_lock(&hardware_domain->event_lock);
         ret = map_domain_pirq(hardware_domain, pirq, irq,
                               MAP_PIRQ_TYPE_GSI, NULL);
-        spin_unlock(&hardware_domain->event_lock);
+        write_unlock(&hardware_domain->event_lock);
         if ( ret < 0 )
             return ret;
     }
--- a/xen/arch/x86/irq.c
+++ b/xen/arch/x86/irq.c
@@ -1536,7 +1536,7 @@ int pirq_guest_bind(struct vcpu *v, stru
     irq_guest_action_t *action, *newaction = NULL;
     int                 rc = 0;
 
-    WARN_ON(!spin_is_locked(&v->domain->event_lock));
+    WARN_ON(!rw_is_write_locked(&v->domain->event_lock));
     BUG_ON(!local_irq_is_enabled());
 
  retry:
@@ -1756,7 +1756,7 @@ void pirq_guest_unbind(struct domain *d,
     struct irq_desc *desc;
     int irq = 0;
 
-    WARN_ON(!spin_is_locked(&d->event_lock));
+    WARN_ON(!rw_is_write_locked(&d->event_lock));
 
     BUG_ON(!local_irq_is_enabled());
     desc = pirq_spin_lock_irq_desc(pirq, NULL);
@@ -1793,7 +1793,7 @@ static bool pirq_guest_force_unbind(stru
     unsigned int i;
     bool bound = false;
 
-    WARN_ON(!spin_is_locked(&d->event_lock));
+    WARN_ON(!rw_is_write_locked(&d->event_lock));
 
     BUG_ON(!local_irq_is_enabled());
     desc = pirq_spin_lock_irq_desc(pirq, NULL);
@@ -2037,7 +2037,7 @@ int get_free_pirq(struct domain *d, int
 {
     int i;
 
-    ASSERT(spin_is_locked(&d->event_lock));
+    ASSERT(rw_is_write_locked(&d->event_lock));
 
     if ( type == MAP_PIRQ_TYPE_GSI )
     {
@@ -2062,7 +2062,7 @@ int get_free_pirqs(struct domain *d, uns
 {
     unsigned int i, found = 0;
 
-    ASSERT(spin_is_locked(&d->event_lock));
+    ASSERT(rw_is_write_locked(&d->event_lock));
 
     for ( i = d->nr_pirqs - 1; i >= nr_irqs_gsi; --i )
         if ( is_free_pirq(d, pirq_info(d, i)) )
@@ -2090,7 +2090,7 @@ int map_domain_pirq(
     DECLARE_BITMAP(prepared, MAX_MSI_IRQS) = {};
     DECLARE_BITMAP(granted, MAX_MSI_IRQS) = {};
 
-    ASSERT(spin_is_locked(&d->event_lock));
+    ASSERT(rw_is_write_locked(&d->event_lock));
 
     if ( !irq_access_permitted(current->domain, irq))
         return -EPERM;
@@ -2309,7 +2309,7 @@ int unmap_domain_pirq(struct domain *d,
         return -EINVAL;
 
     ASSERT(pcidevs_locked());
-    ASSERT(spin_is_locked(&d->event_lock));
+    ASSERT(rw_is_write_locked(&d->event_lock));
 
     info = pirq_info(d, pirq);
     if ( !info || (irq = info->arch.irq) <= 0 )
@@ -2436,13 +2436,13 @@ void free_domain_pirqs(struct domain *d)
     int i;
 
     pcidevs_lock();
-    spin_lock(&d->event_lock);
+    write_lock(&d->event_lock);
 
     for ( i = 0; i < d->nr_pirqs; i++ )
         if ( domain_pirq_to_irq(d, i) > 0 )
             unmap_domain_pirq(d, i);
 
-    spin_unlock(&d->event_lock);
+    write_unlock(&d->event_lock);
     pcidevs_unlock();
 }
 
@@ -2685,7 +2685,7 @@ int map_domain_emuirq_pirq(struct domain
     int old_emuirq = IRQ_UNBOUND, old_pirq = IRQ_UNBOUND;
     struct pirq *info;
 
-    ASSERT(spin_is_locked(&d->event_lock));
+    ASSERT(rw_is_write_locked(&d->event_lock));
 
     if ( !is_hvm_domain(d) )
         return -EINVAL;
@@ -2751,7 +2751,7 @@ int unmap_domain_pirq_emuirq(struct doma
     if ( (pirq < 0) || (pirq >= d->nr_pirqs) )
         return -EINVAL;
 
-    ASSERT(spin_is_locked(&d->event_lock));
+    ASSERT(rw_is_write_locked(&d->event_lock));
 
     emuirq = domain_pirq_to_emuirq(d, pirq);
     if ( emuirq == IRQ_UNBOUND )
@@ -2799,7 +2799,7 @@ static int allocate_pirq(struct domain *
 {
     int current_pirq;
 
-    ASSERT(spin_is_locked(&d->event_lock));
+    ASSERT(rw_is_write_locked(&d->event_lock));
     current_pirq = domain_irq_to_pirq(d, irq);
     if ( pirq < 0 )
     {
@@ -2871,7 +2871,7 @@ int allocate_and_map_gsi_pirq(struct dom
     }
 
     /* Verify or get pirq. */
-    spin_lock(&d->event_lock);
+    write_lock(&d->event_lock);
     pirq = allocate_pirq(d, index, *pirq_p, irq, MAP_PIRQ_TYPE_GSI, NULL);
     if ( pirq < 0 )
     {
@@ -2884,7 +2884,7 @@ int allocate_and_map_gsi_pirq(struct dom
         *pirq_p = pirq;
 
  done:
-    spin_unlock(&d->event_lock);
+    write_unlock(&d->event_lock);
 
     return ret;
 }
@@ -2925,7 +2925,7 @@ int allocate_and_map_msi_pirq(struct dom
 
     pcidevs_lock();
     /* Verify or get pirq. */
-    spin_lock(&d->event_lock);
+    write_lock(&d->event_lock);
     pirq = allocate_pirq(d, index, *pirq_p, irq, type, &msi->entry_nr);
     if ( pirq < 0 )
     {
@@ -2938,7 +2938,7 @@ int allocate_and_map_msi_pirq(struct dom
         *pirq_p = pirq;
 
  done:
-    spin_unlock(&d->event_lock);
+    write_unlock(&d->event_lock);
     pcidevs_unlock();
     if ( ret )
     {
--- a/xen/arch/x86/physdev.c
+++ b/xen/arch/x86/physdev.c
@@ -34,7 +34,7 @@ static int physdev_hvm_map_pirq(
 
     ASSERT(!is_hardware_domain(d));
 
-    spin_lock(&d->event_lock);
+    write_lock(&d->event_lock);
     switch ( type )
     {
     case MAP_PIRQ_TYPE_GSI: {
@@ -84,7 +84,7 @@ static int physdev_hvm_map_pirq(
         break;
     }
 
-    spin_unlock(&d->event_lock);
+    write_unlock(&d->event_lock);
     return ret;
 }
 
@@ -154,18 +154,18 @@ int physdev_unmap_pirq(domid_t domid, in
 
     if ( is_hvm_domain(d) && has_pirq(d) )
     {
-        spin_lock(&d->event_lock);
+        write_lock(&d->event_lock);
         if ( domain_pirq_to_emuirq(d, pirq) != IRQ_UNBOUND )
             ret = unmap_domain_pirq_emuirq(d, pirq);
-        spin_unlock(&d->event_lock);
+        write_unlock(&d->event_lock);
         if ( domid == DOMID_SELF || ret )
             goto free_domain;
     }
 
     pcidevs_lock();
-    spin_lock(&d->event_lock);
+    write_lock(&d->event_lock);
     ret = unmap_domain_pirq(d, pirq);
-    spin_unlock(&d->event_lock);
+    write_unlock(&d->event_lock);
     pcidevs_unlock();
 
  free_domain:
@@ -192,10 +192,10 @@ ret_t do_physdev_op(int cmd, XEN_GUEST_H
         ret = -EINVAL;
         if ( eoi.irq >= currd->nr_pirqs )
             break;
-        spin_lock(&currd->event_lock);
+        read_lock(&currd->event_lock);
         pirq = pirq_info(currd, eoi.irq);
         if ( !pirq ) {
-            spin_unlock(&currd->event_lock);
+            read_unlock(&currd->event_lock);
             break;
         }
         if ( currd->arch.auto_unmask )
@@ -214,7 +214,7 @@ ret_t do_physdev_op(int cmd, XEN_GUEST_H
                     && hvm_irq->gsi_assert_count[gsi] )
                 send_guest_pirq(currd, pirq);
         }
-        spin_unlock(&currd->event_lock);
+        read_unlock(&currd->event_lock);
         ret = 0;
         break;
     }
@@ -626,7 +626,7 @@ ret_t do_physdev_op(int cmd, XEN_GUEST_H
         if ( copy_from_guest(&out, arg, 1) != 0 )
             break;
 
-        spin_lock(&currd->event_lock);
+        write_lock(&currd->event_lock);
 
         ret = get_free_pirq(currd, out.type);
         if ( ret >= 0 )
@@ -639,7 +639,7 @@ ret_t do_physdev_op(int cmd, XEN_GUEST_H
                 ret = -ENOMEM;
         }
 
-        spin_unlock(&currd->event_lock);
+        write_unlock(&currd->event_lock);
 
         if ( ret >= 0 )
         {
--- a/xen/arch/x86/pv/shim.c
+++ b/xen/arch/x86/pv/shim.c
@@ -448,7 +448,7 @@ static long pv_shim_event_channel_op(int
         if ( rc )                                                           \
             break;                                                          \
                                                                             \
-        spin_lock(&d->event_lock);                                          \
+        write_lock(&d->event_lock);                                         \
         rc = evtchn_allocate_port(d, op.port_field);                        \
         if ( rc )                                                           \
         {                                                                   \
@@ -457,7 +457,7 @@ static long pv_shim_event_channel_op(int
         }                                                                   \
         else                                                                \
             evtchn_reserve(d, op.port_field);                               \
-        spin_unlock(&d->event_lock);                                        \
+        write_unlock(&d->event_lock);                                       \
                                                                             \
         if ( !rc && __copy_to_guest(arg, &op, 1) )                          \
             rc = -EFAULT;                                                   \
@@ -585,11 +585,11 @@ static long pv_shim_event_channel_op(int
         if ( rc )
             break;
 
-        spin_lock(&d->event_lock);
+        write_lock(&d->event_lock);
         rc = evtchn_allocate_port(d, ipi.port);
         if ( rc )
         {
-            spin_unlock(&d->event_lock);
+            write_unlock(&d->event_lock);
 
             close.port = ipi.port;
             BUG_ON(xen_hypercall_event_channel_op(EVTCHNOP_close, &close));
@@ -598,7 +598,7 @@ static long pv_shim_event_channel_op(int
 
         evtchn_assign_vcpu(d, ipi.port, ipi.vcpu);
         evtchn_reserve(d, ipi.port);
-        spin_unlock(&d->event_lock);
+        write_unlock(&d->event_lock);
 
         if ( __copy_to_guest(arg, &ipi, 1) )
             rc = -EFAULT;
--- a/xen/common/event_channel.c
+++ b/xen/common/event_channel.c
@@ -260,7 +260,7 @@ static long evtchn_alloc_unbound(evtchn_
     if ( d == NULL )
         return -ESRCH;
 
-    spin_lock(&d->event_lock);
+    write_lock(&d->event_lock);
 
     if ( (port = get_free_port(d)) < 0 )
         ERROR_EXIT_DOM(port, d);
@@ -283,7 +283,7 @@ static long evtchn_alloc_unbound(evtchn_
 
  out:
     check_free_port(d, port);
-    spin_unlock(&d->event_lock);
+    write_unlock(&d->event_lock);
     rcu_unlock_domain(d);
 
     return rc;
@@ -336,14 +336,14 @@ static long evtchn_bind_interdomain(evtc
     /* Avoid deadlock by first acquiring lock of domain with smaller id. */
     if ( ld < rd )
     {
-        spin_lock(&ld->event_lock);
-        spin_lock(&rd->event_lock);
+        write_lock(&ld->event_lock);
+        read_lock(&rd->event_lock);
     }
     else
     {
         if ( ld != rd )
-            spin_lock(&rd->event_lock);
-        spin_lock(&ld->event_lock);
+            read_lock(&rd->event_lock);
+        write_lock(&ld->event_lock);
     }
 
     if ( (lport = get_free_port(ld)) < 0 )
@@ -384,9 +384,9 @@ static long evtchn_bind_interdomain(evtc
 
  out:
     check_free_port(ld, lport);
-    spin_unlock(&ld->event_lock);
+    write_unlock(&ld->event_lock);
     if ( ld != rd )
-        spin_unlock(&rd->event_lock);
+        read_unlock(&rd->event_lock);
     
     rcu_unlock_domain(rd);
 
@@ -418,7 +418,7 @@ int evtchn_bind_virq(evtchn_bind_virq_t
     if ( (v = domain_vcpu(d, vcpu)) == NULL )
         return -ENOENT;
 
-    spin_lock(&d->event_lock);
+    write_lock(&d->event_lock);
 
     if ( v->virq_to_evtchn[virq] != 0 )
         ERROR_EXIT(-EEXIST);
@@ -451,7 +451,7 @@ int evtchn_bind_virq(evtchn_bind_virq_t
     v->virq_to_evtchn[virq] = bind->port = port;
 
  out:
-    spin_unlock(&d->event_lock);
+    write_unlock(&d->event_lock);
 
     return rc;
 }
@@ -468,7 +468,7 @@ static long evtchn_bind_ipi(evtchn_bind_
     if ( domain_vcpu(d, vcpu) == NULL )
         return -ENOENT;
 
-    spin_lock(&d->event_lock);
+    write_lock(&d->event_lock);
 
     if ( (port = get_free_port(d)) < 0 )
         ERROR_EXIT(port);
@@ -486,7 +486,7 @@ static long evtchn_bind_ipi(evtchn_bind_
     bind->port = port;
 
  out:
-    spin_unlock(&d->event_lock);
+    write_unlock(&d->event_lock);
 
     return rc;
 }
@@ -533,7 +533,7 @@ static long evtchn_bind_pirq(evtchn_bind
     if ( !is_hvm_domain(d) && !pirq_access_permitted(d, pirq) )
         return -EPERM;
 
-    spin_lock(&d->event_lock);
+    write_lock(&d->event_lock);
 
     if ( pirq_to_evtchn(d, pirq) != 0 )
         ERROR_EXIT(-EEXIST);
@@ -573,7 +573,7 @@ static long evtchn_bind_pirq(evtchn_bind
 
  out:
     check_free_port(d, port);
-    spin_unlock(&d->event_lock);
+    write_unlock(&d->event_lock);
 
     return rc;
 }
@@ -589,7 +589,7 @@ int evtchn_close(struct domain *d1, int
     unsigned long  flags;
 
  again:
-    spin_lock(&d1->event_lock);
+    write_lock(&d1->event_lock);
 
     if ( !port_is_valid(d1, port1) )
     {
@@ -657,13 +657,11 @@ int evtchn_close(struct domain *d1, int
                 BUG();
 
             if ( d1 < d2 )
-            {
-                spin_lock(&d2->event_lock);
-            }
+                read_lock(&d2->event_lock);
             else if ( d1 != d2 )
             {
-                spin_unlock(&d1->event_lock);
-                spin_lock(&d2->event_lock);
+                write_unlock(&d1->event_lock);
+                read_lock(&d2->event_lock);
                 goto again;
             }
         }
@@ -710,11 +708,11 @@ int evtchn_close(struct domain *d1, int
     if ( d2 != NULL )
     {
         if ( d1 != d2 )
-            spin_unlock(&d2->event_lock);
+            read_unlock(&d2->event_lock);
         put_domain(d2);
     }
 
-    spin_unlock(&d1->event_lock);
+    write_unlock(&d1->event_lock);
 
     return rc;
 }
@@ -939,7 +937,7 @@ int evtchn_status(evtchn_status_t *statu
     if ( d == NULL )
         return -ESRCH;
 
-    spin_lock(&d->event_lock);
+    read_lock(&d->event_lock);
 
     if ( !port_is_valid(d, port) )
     {
@@ -992,7 +990,7 @@ int evtchn_status(evtchn_status_t *statu
     status->vcpu = chn->notify_vcpu_id;
 
  out:
-    spin_unlock(&d->event_lock);
+    read_unlock(&d->event_lock);
     rcu_unlock_domain(d);
 
     return rc;
@@ -1005,20 +1003,19 @@ long evtchn_bind_vcpu(unsigned int port,
     struct evtchn *chn;
     long           rc = 0;
     struct vcpu   *v;
+    bool           write_locked = false;
+    unsigned long  flags;
 
     /* Use the vcpu info to prevent speculative out-of-bound accesses */
     if ( (v = domain_vcpu(d, vcpu_id)) == NULL )
         return -ENOENT;
 
-    spin_lock(&d->event_lock);
-
     if ( !port_is_valid(d, port) )
-    {
-        rc = -EINVAL;
-        goto out;
-    }
+        return -EINVAL;
 
     chn = evtchn_from_port(d, port);
+ again:
+    spin_lock_irqsave(&chn->lock, flags);
 
     /* Guest cannot re-bind a Xen-attached event channel. */
     if ( unlikely(consumer_is_xen(chn)) )
@@ -1042,19 +1039,32 @@ long evtchn_bind_vcpu(unsigned int port,
     case ECS_PIRQ:
         if ( chn->notify_vcpu_id == v->vcpu_id )
             break;
+        if ( !write_locked )
+        {
+            spin_unlock_irqrestore(&chn->lock, flags);
+            write_lock(&d->event_lock);
+            write_locked = true;
+            goto again;
+        }
+
         unlink_pirq_port(chn, d->vcpu[chn->notify_vcpu_id]);
         chn->notify_vcpu_id = v->vcpu_id;
+        spin_unlock_irqrestore(&chn->lock, flags);
         pirq_set_affinity(d, chn->u.pirq.irq,
                           cpumask_of(v->processor));
         link_pirq_port(port, chn, v);
-        break;
+
+        write_unlock(&d->event_lock);
+        return 0;
     default:
         rc = -EINVAL;
         break;
     }
 
  out:
-    spin_unlock(&d->event_lock);
+    spin_unlock_irqrestore(&chn->lock, flags);
+    if ( write_locked )
+        write_unlock(&d->event_lock);
 
     return rc;
 }
@@ -1098,7 +1108,7 @@ int evtchn_reset(struct domain *d, bool
     if ( d != current->domain && !d->controller_pause_count )
         return -EINVAL;
 
-    spin_lock(&d->event_lock);
+    read_lock(&d->event_lock);
 
     /*
      * If we are resuming, then start where we stopped. Otherwise, check
@@ -1109,7 +1119,7 @@ int evtchn_reset(struct domain *d, bool
     if ( i > d->next_evtchn )
         d->next_evtchn = i;
 
-    spin_unlock(&d->event_lock);
+    read_unlock(&d->event_lock);
 
     if ( !i )
         return -EBUSY;
@@ -1121,14 +1131,14 @@ int evtchn_reset(struct domain *d, bool
         /* NB: Choice of frequency is arbitrary. */
         if ( !(i & 0x3f) && hypercall_preempt_check() )
         {
-            spin_lock(&d->event_lock);
+            write_lock(&d->event_lock);
             d->next_evtchn = i;
-            spin_unlock(&d->event_lock);
+            write_unlock(&d->event_lock);
             return -ERESTART;
         }
     }
 
-    spin_lock(&d->event_lock);
+    write_lock(&d->event_lock);
 
     d->next_evtchn = 0;
 
@@ -1141,7 +1151,7 @@ int evtchn_reset(struct domain *d, bool
         evtchn_2l_init(d);
     }
 
-    spin_unlock(&d->event_lock);
+    write_unlock(&d->event_lock);
 
     return rc;
 }
@@ -1154,11 +1164,11 @@ static long evtchn_set_priority(const st
     long ret;
     unsigned long flags;
 
-    spin_lock(&d->event_lock);
+    read_lock(&d->event_lock);
 
     if ( !port_is_valid(d, port) )
     {
-        spin_unlock(&d->event_lock);
+        read_unlock(&d->event_lock);
         return -EINVAL;
     }
 
@@ -1167,7 +1177,7 @@ static long evtchn_set_priority(const st
     ret = evtchn_port_set_priority(d, chn, set_priority->priority);
     spin_unlock_irqrestore(&chn->lock, flags);
 
-    spin_unlock(&d->event_lock);
+    read_unlock(&d->event_lock);
 
     return ret;
 }
@@ -1337,7 +1347,7 @@ int alloc_unbound_xen_event_channel(
     int            port, rc;
     unsigned long  flags;
 
-    spin_lock(&ld->event_lock);
+    write_lock(&ld->event_lock);
 
     port = rc = get_free_port(ld);
     if ( rc < 0 )
@@ -1365,7 +1375,7 @@ int alloc_unbound_xen_event_channel(
 
  out:
     check_free_port(ld, port);
-    spin_unlock(&ld->event_lock);
+    write_unlock(&ld->event_lock);
 
     return rc < 0 ? rc : port;
 }
@@ -1453,7 +1463,8 @@ int evtchn_init(struct domain *d, unsign
         return -ENOMEM;
     d->valid_evtchns = EVTCHNS_PER_BUCKET;
 
-    spin_lock_init_prof(d, event_lock);
+    rwlock_init(&d->event_lock);
+
     if ( get_free_port(d) != 0 )
     {
         free_evtchn_bucket(d, d->evtchn);
@@ -1480,7 +1491,7 @@ int evtchn_destroy(struct domain *d)
 
     /* After this barrier no new event-channel allocations can occur. */
     BUG_ON(!d->is_dying);
-    spin_barrier(&d->event_lock);
+    rw_barrier(&d->event_lock);
 
     /* Close all existing event channels. */
     for ( i = d->valid_evtchns; --i; )
@@ -1538,13 +1549,13 @@ void evtchn_move_pirqs(struct vcpu *v)
     unsigned int port;
     struct evtchn *chn;
 
-    spin_lock(&d->event_lock);
+    read_lock(&d->event_lock);
     for ( port = v->pirq_evtchn_head; port; port = chn->u.pirq.next_port )
     {
         chn = evtchn_from_port(d, port);
         pirq_set_affinity(d, chn->u.pirq.irq, mask);
     }
-    spin_unlock(&d->event_lock);
+    read_unlock(&d->event_lock);
 }
 
 
@@ -1557,7 +1568,7 @@ static void domain_dump_evtchn_info(stru
            "Polling vCPUs: {%*pbl}\n"
            "    port [p/m/s]\n", d->domain_id, d->max_vcpus, d->poll_mask);
 
-    spin_lock(&d->event_lock);
+    read_lock(&d->event_lock);
 
     for ( port = 1; port_is_valid(d, port); ++port )
     {
@@ -1604,7 +1615,7 @@ static void domain_dump_evtchn_info(stru
         }
     }
 
-    spin_unlock(&d->event_lock);
+    read_unlock(&d->event_lock);
 }
 
 static void dump_evtchn_info(unsigned char key)
--- a/xen/common/event_fifo.c
+++ b/xen/common/event_fifo.c
@@ -561,7 +561,7 @@ int evtchn_fifo_init_control(struct evtc
     if ( offset & (8 - 1) )
         return -EINVAL;
 
-    spin_lock(&d->event_lock);
+    write_lock(&d->event_lock);
 
     /*
      * If this is the first control block, setup an empty event array
@@ -593,13 +593,13 @@ int evtchn_fifo_init_control(struct evtc
     else
         rc = map_control_block(v, gfn, offset);
 
-    spin_unlock(&d->event_lock);
+    write_unlock(&d->event_lock);
 
     return rc;
 
  error:
     evtchn_fifo_destroy(d);
-    spin_unlock(&d->event_lock);
+    write_unlock(&d->event_lock);
     return rc;
 }
 
@@ -652,9 +652,9 @@ int evtchn_fifo_expand_array(const struc
     if ( !d->evtchn_fifo )
         return -EOPNOTSUPP;
 
-    spin_lock(&d->event_lock);
+    write_lock(&d->event_lock);
     rc = add_page_to_event_array(d, expand_array->array_gfn);
-    spin_unlock(&d->event_lock);
+    write_unlock(&d->event_lock);
 
     return rc;
 }
--- a/xen/drivers/passthrough/io.c
+++ b/xen/drivers/passthrough/io.c
@@ -105,7 +105,7 @@ static void pt_pirq_softirq_reset(struct
 {
     struct domain *d = pirq_dpci->dom;
 
-    ASSERT(spin_is_locked(&d->event_lock));
+    ASSERT(rw_is_write_locked(&d->event_lock));
 
     switch ( cmpxchg(&pirq_dpci->state, 1 << STATE_SCHED, 0) )
     {
@@ -162,7 +162,7 @@ static void pt_irq_time_out(void *data)
     const struct hvm_irq_dpci *dpci;
     const struct dev_intx_gsi_link *digl;
 
-    spin_lock(&irq_map->dom->event_lock);
+    read_lock(&irq_map->dom->event_lock);
 
     if ( irq_map->flags & HVM_IRQ_DPCI_IDENTITY_GSI )
     {
@@ -177,7 +177,7 @@ static void pt_irq_time_out(void *data)
         hvm_gsi_deassert(irq_map->dom, dpci_pirq(irq_map)->pirq);
         irq_map->flags |= HVM_IRQ_DPCI_EOI_LATCH;
         pt_irq_guest_eoi(irq_map->dom, irq_map, NULL);
-        spin_unlock(&irq_map->dom->event_lock);
+        read_unlock(&irq_map->dom->event_lock);
         return;
     }
 
@@ -185,7 +185,7 @@ static void pt_irq_time_out(void *data)
     if ( unlikely(!dpci) )
     {
         ASSERT_UNREACHABLE();
-        spin_unlock(&irq_map->dom->event_lock);
+        read_unlock(&irq_map->dom->event_lock);
         return;
     }
     list_for_each_entry ( digl, &irq_map->digl_list, list )
@@ -204,7 +204,7 @@ static void pt_irq_time_out(void *data)
 
     pt_pirq_iterate(irq_map->dom, pt_irq_guest_eoi, NULL);
 
-    spin_unlock(&irq_map->dom->event_lock);
+    read_unlock(&irq_map->dom->event_lock);
 }
 
 struct hvm_irq_dpci *domain_get_irq_dpci(const struct domain *d)
@@ -288,7 +288,7 @@ int pt_irq_create_bind(
         return -EINVAL;
 
  restart:
-    spin_lock(&d->event_lock);
+    write_lock(&d->event_lock);
 
     hvm_irq_dpci = domain_get_irq_dpci(d);
     if ( !hvm_irq_dpci && !is_hardware_domain(d) )
@@ -304,7 +304,7 @@ int pt_irq_create_bind(
         hvm_irq_dpci = xzalloc(struct hvm_irq_dpci);
         if ( hvm_irq_dpci == NULL )
         {
-            spin_unlock(&d->event_lock);
+            write_unlock(&d->event_lock);
             return -ENOMEM;
         }
         for ( i = 0; i < NR_HVM_DOMU_IRQS; i++ )
@@ -316,7 +316,7 @@ int pt_irq_create_bind(
     info = pirq_get_info(d, pirq);
     if ( !info )
     {
-        spin_unlock(&d->event_lock);
+        write_unlock(&d->event_lock);
         return -ENOMEM;
     }
     pirq_dpci = pirq_dpci(info);
@@ -331,7 +331,7 @@ int pt_irq_create_bind(
      */
     if ( pt_pirq_softirq_active(pirq_dpci) )
     {
-        spin_unlock(&d->event_lock);
+        write_unlock(&d->event_lock);
         cpu_relax();
         goto restart;
     }
@@ -389,7 +389,7 @@ int pt_irq_create_bind(
                 pirq_dpci->dom = NULL;
                 pirq_dpci->flags = 0;
                 pirq_cleanup_check(info, d);
-                spin_unlock(&d->event_lock);
+                write_unlock(&d->event_lock);
                 return rc;
             }
         }
@@ -399,7 +399,7 @@ int pt_irq_create_bind(
 
             if ( (pirq_dpci->flags & mask) != mask )
             {
-                spin_unlock(&d->event_lock);
+                write_unlock(&d->event_lock);
                 return -EBUSY;
             }
 
@@ -423,7 +423,7 @@ int pt_irq_create_bind(
 
         dest_vcpu_id = hvm_girq_dest_2_vcpu_id(d, dest, dest_mode);
         pirq_dpci->gmsi.dest_vcpu_id = dest_vcpu_id;
-        spin_unlock(&d->event_lock);
+        write_unlock(&d->event_lock);
 
         pirq_dpci->gmsi.posted = false;
         vcpu = (dest_vcpu_id >= 0) ? d->vcpu[dest_vcpu_id] : NULL;
@@ -483,7 +483,7 @@ int pt_irq_create_bind(
 
             if ( !digl || !girq )
             {
-                spin_unlock(&d->event_lock);
+                write_unlock(&d->event_lock);
                 xfree(girq);
                 xfree(digl);
                 return -ENOMEM;
@@ -510,7 +510,7 @@ int pt_irq_create_bind(
             if ( pt_irq_bind->irq_type != PT_IRQ_TYPE_PCI ||
                  pirq >= hvm_domain_irq(d)->nr_gsis )
             {
-                spin_unlock(&d->event_lock);
+                write_unlock(&d->event_lock);
 
                 return -EINVAL;
             }
@@ -546,7 +546,7 @@ int pt_irq_create_bind(
 
                     if ( mask < 0 || trigger_mode < 0 )
                     {
-                        spin_unlock(&d->event_lock);
+                        write_unlock(&d->event_lock);
 
                         ASSERT_UNREACHABLE();
                         return -EINVAL;
@@ -594,14 +594,14 @@ int pt_irq_create_bind(
                 }
                 pirq_dpci->flags = 0;
                 pirq_cleanup_check(info, d);
-                spin_unlock(&d->event_lock);
+                write_unlock(&d->event_lock);
                 xfree(girq);
                 xfree(digl);
                 return rc;
             }
         }
 
-        spin_unlock(&d->event_lock);
+        write_unlock(&d->event_lock);
 
         if ( iommu_verbose )
         {
@@ -619,7 +619,7 @@ int pt_irq_create_bind(
     }
 
     default:
-        spin_unlock(&d->event_lock);
+        write_unlock(&d->event_lock);
         return -EOPNOTSUPP;
     }
 
@@ -672,13 +672,13 @@ int pt_irq_destroy_bind(
         return -EOPNOTSUPP;
     }
 
-    spin_lock(&d->event_lock);
+    write_lock(&d->event_lock);
 
     hvm_irq_dpci = domain_get_irq_dpci(d);
 
     if ( !hvm_irq_dpci && !is_hardware_domain(d) )
     {
-        spin_unlock(&d->event_lock);
+        write_unlock(&d->event_lock);
         return -EINVAL;
     }
 
@@ -711,7 +711,7 @@ int pt_irq_destroy_bind(
 
         if ( girq )
         {
-            spin_unlock(&d->event_lock);
+            write_unlock(&d->event_lock);
             return -EINVAL;
         }
 
@@ -755,7 +755,7 @@ int pt_irq_destroy_bind(
         pirq_cleanup_check(pirq, d);
     }
 
-    spin_unlock(&d->event_lock);
+    write_unlock(&d->event_lock);
 
     if ( what && iommu_verbose )
     {
@@ -799,7 +799,7 @@ int pt_pirq_iterate(struct domain *d,
     unsigned int pirq = 0, n, i;
     struct pirq *pirqs[8];
 
-    ASSERT(spin_is_locked(&d->event_lock));
+    ASSERT(rw_is_locked(&d->event_lock));
 
     do {
         n = radix_tree_gang_lookup(&d->pirq_tree, (void **)pirqs, pirq,
@@ -880,9 +880,9 @@ void hvm_dpci_msi_eoi(struct domain *d,
          (!hvm_domain_irq(d)->dpci && !is_hardware_domain(d)) )
        return;
 
-    spin_lock(&d->event_lock);
+    read_lock(&d->event_lock);
     pt_pirq_iterate(d, _hvm_dpci_msi_eoi, (void *)(long)vector);
-    spin_unlock(&d->event_lock);
+    read_unlock(&d->event_lock);
 }
 
 static void hvm_dirq_assist(struct domain *d, struct hvm_pirq_dpci *pirq_dpci)
@@ -893,7 +893,7 @@ static void hvm_dirq_assist(struct domai
         return;
     }
 
-    spin_lock(&d->event_lock);
+    read_lock(&d->event_lock);
     if ( test_and_clear_bool(pirq_dpci->masked) )
     {
         struct pirq *pirq = dpci_pirq(pirq_dpci);
@@ -947,7 +947,7 @@ static void hvm_dirq_assist(struct domai
     }
 
  out:
-    spin_unlock(&d->event_lock);
+    read_unlock(&d->event_lock);
 }
 
 static void hvm_pirq_eoi(struct pirq *pirq,
@@ -1012,7 +1012,7 @@ void hvm_dpci_eoi(struct domain *d, unsi
 
     if ( is_hardware_domain(d) )
     {
-        spin_lock(&d->event_lock);
+        read_lock(&d->event_lock);
         hvm_gsi_eoi(d, guest_gsi, ent);
         goto unlock;
     }
@@ -1023,7 +1023,7 @@ void hvm_dpci_eoi(struct domain *d, unsi
         return;
     }
 
-    spin_lock(&d->event_lock);
+    read_lock(&d->event_lock);
     hvm_irq_dpci = domain_get_irq_dpci(d);
 
     if ( !hvm_irq_dpci )
@@ -1033,7 +1033,7 @@ void hvm_dpci_eoi(struct domain *d, unsi
         __hvm_dpci_eoi(d, girq, ent);
 
 unlock:
-    spin_unlock(&d->event_lock);
+    read_unlock(&d->event_lock);
 }
 
 /*
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -883,7 +883,7 @@ static int pci_clean_dpci_irqs(struct do
     if ( !is_hvm_domain(d) )
         return 0;
 
-    spin_lock(&d->event_lock);
+    write_lock(&d->event_lock);
     hvm_irq_dpci = domain_get_irq_dpci(d);
     if ( hvm_irq_dpci != NULL )
     {
@@ -901,14 +901,14 @@ static int pci_clean_dpci_irqs(struct do
             ret = pt_pirq_iterate(d, pci_clean_dpci_irq, NULL);
         if ( ret )
         {
-            spin_unlock(&d->event_lock);
+            read_unlock(&d->event_lock);
             return ret;
         }
 
         hvm_domain_irq(d)->dpci = NULL;
         free_hvm_irq_dpci(hvm_irq_dpci);
     }
-    spin_unlock(&d->event_lock);
+    write_unlock(&d->event_lock);
     return 0;
 }
 
--- a/xen/drivers/passthrough/vtd/x86/hvm.c
+++ b/xen/drivers/passthrough/vtd/x86/hvm.c
@@ -54,7 +54,7 @@ void hvm_dpci_isairq_eoi(struct domain *
     if ( !is_iommu_enabled(d) )
         return;
 
-    spin_lock(&d->event_lock);
+    read_lock(&d->event_lock);
 
     dpci = domain_get_irq_dpci(d);
 
@@ -63,5 +63,5 @@ void hvm_dpci_isairq_eoi(struct domain *
         /* Multiple mirq may be mapped to one isa irq */
         pt_pirq_iterate(d, _hvm_dpci_isairq_eoi, (void *)(long)isairq);
     }
-    spin_unlock(&d->event_lock);
+    read_unlock(&d->event_lock);
 }
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
@@ -373,7 +373,7 @@ struct domain
     unsigned int     xen_evtchns;
     /* Port to resume from in evtchn_reset(), when in a continuation. */
     unsigned int     next_evtchn;
-    spinlock_t       event_lock;
+    rwlock_t         event_lock;
     const struct evtchn_port_ops *evtchn_port_ops;
     struct evtchn_fifo_domain *evtchn_fifo;
 
--- a/xen/xsm/flask/flask_op.c
+++ b/xen/xsm/flask/flask_op.c
@@ -555,7 +555,7 @@ static int flask_get_peer_sid(struct xen
     struct evtchn *chn;
     struct domain_security_struct *dsec;
 
-    spin_lock(&d->event_lock);
+    read_lock(&d->event_lock);
 
     if ( !port_is_valid(d, arg->evtchn) )
         goto out;
@@ -573,7 +573,7 @@ static int flask_get_peer_sid(struct xen
     rv = 0;
 
  out:
-    spin_unlock(&d->event_lock);
+    read_unlock(&d->event_lock);
     return rv;
 }