[PATCH for-4.16] Revert "domctl: improve locking during domain destruction"

Roger Pau Monne posted 1 patch 2 years, 5 months ago
Test gitlab-ci failed
Patches applied successfully (tree, apply log)
git fetch https://gitlab.com/xen-project/patchew/xen tags/patchew/20211109143128.56089-1-roger.pau@citrix.com
There is a newer version of this series
xen/common/domain.c | 12 ++----------
xen/common/domctl.c |  5 +----
2 files changed, 3 insertions(+), 14 deletions(-)
[PATCH for-4.16] Revert "domctl: improve locking during domain destruction"
Posted by Roger Pau Monne 2 years, 5 months ago
This reverts commit 228ab9992ffb1d8f9d2475f2581e68b2913acb88.

Performance analysis has shown that dropping the domctl lock during
domain destruction greatly increases the contention in the heap_lock,
thus making parallel destruction of domains slower.

The following lockperf data shows the difference between the current
code and the reverted one:

lock:  3342357(2.268295505s), block:  3263853(18.556650797s)
lock:  2788704(0.362311723s), block:   222681( 0.091152276s)

Given the current point in the release, revert the commit and
reinstate holding the domctl lock during domain destruction. Further
work should be done in order to re-add more fine grained locking to
the domain destruction path once a proper solution to avoid the
heap_lock contention is found.

Reported-by: Hongyan Xia <hongyxia@amazon.com>
Reported-by: Dmitry Isaikin <isaikin-dmitry@yandex.ru>
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Cc: Ian Jackson <iwj@xenproject.org>

Since this is a revert and not new code I think the risk is lower.
There's however some risk, as the original commit was from 2017, and
hence the surrounding code has changed a bit. It's also a possibility
that some other parts of the domain destruction code now rely on this
more fine grained locking. Local tests however haven't shown issues.
---
 xen/common/domain.c | 12 ++----------
 xen/common/domctl.c |  5 +----
 2 files changed, 3 insertions(+), 14 deletions(-)

diff --git a/xen/common/domain.c b/xen/common/domain.c
index 56d47dd664..093bb4403f 100644
--- a/xen/common/domain.c
+++ b/xen/common/domain.c
@@ -917,21 +917,13 @@ int domain_kill(struct domain *d)
     if ( d == current->domain )
         return -EINVAL;
 
-    /* Protected by d->domain_lock. */
+    /* Protected by domctl_lock. */
     switch ( d->is_dying )
     {
     case DOMDYING_alive:
-        domain_unlock(d);
         domain_pause(d);
-        domain_lock(d);
-        /*
-         * With the domain lock dropped, d->is_dying may have changed. Call
-         * ourselves recursively if so, which is safe as then we won't come
-         * back here.
-         */
-        if ( d->is_dying != DOMDYING_alive )
-            return domain_kill(d);
         d->is_dying = DOMDYING_dying;
+        spin_barrier(&d->domain_lock);
         argo_destroy(d);
         vnuma_destroy(d->vnuma);
         domain_set_outstanding_pages(d, 0);
diff --git a/xen/common/domctl.c b/xen/common/domctl.c
index 271862ae58..879a2adcbe 100644
--- a/xen/common/domctl.c
+++ b/xen/common/domctl.c
@@ -497,14 +497,11 @@ long do_domctl(XEN_GUEST_HANDLE_PARAM(xen_domctl_t) u_domctl)
         break;
 
     case XEN_DOMCTL_destroydomain:
-        domctl_lock_release();
-        domain_lock(d);
         ret = domain_kill(d);
-        domain_unlock(d);
         if ( ret == -ERESTART )
             ret = hypercall_create_continuation(
                 __HYPERVISOR_domctl, "h", u_domctl);
-        goto domctl_out_unlock_domonly;
+        break;
 
     case XEN_DOMCTL_setnodeaffinity:
     {
-- 
2.33.0


Re: [PATCH for-4.16] Revert "domctl: improve locking during domain destruction"
Posted by Andrew Cooper 2 years, 5 months ago
On 09/11/2021 14:31, Roger Pau Monne wrote:
> This reverts commit 228ab9992ffb1d8f9d2475f2581e68b2913acb88.
>
> Performance analysis has shown that dropping the domctl lock during
> domain destruction greatly increases the contention in the heap_lock,
> thus making parallel destruction of domains slower.
>
> The following lockperf data shows the difference between the current
> code and the reverted one:
>
> lock:  3342357(2.268295505s), block:  3263853(18.556650797s)
> lock:  2788704(0.362311723s), block:   222681( 0.091152276s)
>
> Given the current point in the release, revert the commit and
> reinstate holding the domctl lock during domain destruction. Further
> work should be done in order to re-add more fine grained locking to
> the domain destruction path once a proper solution to avoid the
> heap_lock contention is found.
>
> Reported-by: Hongyan Xia <hongyxia@amazon.com>
> Reported-by: Dmitry Isaikin <isaikin-dmitry@yandex.ru>
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>

Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

Re: [PATCH for-4.16] Revert "domctl: improve locking during domain destruction"
Posted by Ian Jackson 2 years, 5 months ago
Roger Pau Monne writes ("[PATCH for-4.16] Revert "domctl: improve locking during domain destruction""):
> This reverts commit 228ab9992ffb1d8f9d2475f2581e68b2913acb88.
> 
> Performance analysis has shown that dropping the domctl lock during
> domain destruction greatly increases the contention in the heap_lock,
> thus making parallel destruction of domains slower.
...
> Given the current point in the release, revert the commit and
> reinstate holding the domctl lock during domain destruction. Further
> work should be done in order to re-add more fine grained locking to
> the domain destruction path once a proper solution to avoid the
> heap_lock contention is found.
> ---
...
> Since this is a revert and not new code I think the risk is lower.
> There's however some risk, as the original commit was from 2017, and
> hence the surrounding code has changed a bit. It's also a possibility
> that some other parts of the domain destruction code now rely on this
> more fine grained locking. Local tests however haven't shown issues.

From a release management point of view I don't regard this as the
kind of "revert" that ought to get any kind of special consideration.
The tree has been like this since 2017 and Xen 4.11 and many changes
have been happened since.

So I am going to treat this as an effectively new change.

AIUI it is a proposal to improve performance, not a bugfix.  Was this
change posted (or, proposed on-list) before the Xen 4.16 Last Posting
Date (24th of September) ?  Even if it was, it would need a freeze
exception.

Ian.

Re: [PATCH for-4.16] Revert "domctl: improve locking during domain destruction"
Posted by Julien Grall 2 years, 5 months ago
Hi Roger,

On 09/11/2021 14:31, Roger Pau Monne wrote:
> This reverts commit 228ab9992ffb1d8f9d2475f2581e68b2913acb88.
> 
> Performance analysis has shown that dropping the domctl lock during
> domain destruction greatly increases the contention in the heap_lock,
> thus making parallel destruction of domains slower.
> 
> The following lockperf data shows the difference between the current
> code and the reverted one:
> 
> lock:  3342357(2.268295505s), block:  3263853(18.556650797s)
> lock:  2788704(0.362311723s), block:   222681( 0.091152276s)

Thanks for the numbers, this is already an improvement from the reverted.

Can you also please provide some details on the setup that was used to 
get the number? (e.g. how many guests, amount of memory...).

For instance, in the case of Amazon our setup was:

On a 144-core server with 4TiB of memory, destroying 32 guests (each
with 4 vcpus and 122GiB memory) simultaneously takes:

before the revert: 29 minutes
after the revert: 6 minutes

> 
> Given the current point in the release, revert the commit and
> reinstate holding the domctl lock during domain destruction. Further
> work should be done in order to re-add more fine grained locking to
> the domain destruction path once a proper solution to avoid the
> heap_lock contention is found.
> 
> Reported-by: Hongyan Xia <hongyxia@amazon.com>
> Reported-by: Dmitry Isaikin <isaikin-dmitry@yandex.ru>
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> ---
> Cc: Ian Jackson <iwj@xenproject.org>
> 
> Since this is a revert and not new code I think the risk is lower.
> There's however some risk, as the original commit was from 2017, and
> hence the surrounding code has changed a bit. It's also a possibility
> that some other parts of the domain destruction code now rely on this
> more fine grained locking. Local tests however haven't shown issues.
> ---
>   xen/common/domain.c | 12 ++----------
>   xen/common/domctl.c |  5 +----
>   2 files changed, 3 insertions(+), 14 deletions(-)
> 
> diff --git a/xen/common/domain.c b/xen/common/domain.c
> index 56d47dd664..093bb4403f 100644
> --- a/xen/common/domain.c
> +++ b/xen/common/domain.c
> @@ -917,21 +917,13 @@ int domain_kill(struct domain *d)
>       if ( d == current->domain )
>           return -EINVAL;
>   
> -    /* Protected by d->domain_lock. */
> +    /* Protected by domctl_lock. */
>       switch ( d->is_dying )
>       {
>       case DOMDYING_alive:
> -        domain_unlock(d);
>           domain_pause(d);
> -        domain_lock(d);
> -        /*
> -         * With the domain lock dropped, d->is_dying may have changed. Call
> -         * ourselves recursively if so, which is safe as then we won't come
> -         * back here.
> -         */
> -        if ( d->is_dying != DOMDYING_alive )
> -            return domain_kill(d);
>           d->is_dying = DOMDYING_dying;
> +        spin_barrier(&d->domain_lock);
>           argo_destroy(d);
>           vnuma_destroy(d->vnuma);
>           domain_set_outstanding_pages(d, 0);
> diff --git a/xen/common/domctl.c b/xen/common/domctl.c
> index 271862ae58..879a2adcbe 100644
> --- a/xen/common/domctl.c
> +++ b/xen/common/domctl.c
> @@ -497,14 +497,11 @@ long do_domctl(XEN_GUEST_HANDLE_PARAM(xen_domctl_t) u_domctl)
>           break;
>   
>       case XEN_DOMCTL_destroydomain:
> -        domctl_lock_release();
> -        domain_lock(d);
>           ret = domain_kill(d);
> -        domain_unlock(d);
>           if ( ret == -ERESTART )
>               ret = hypercall_create_continuation(
>                   __HYPERVISOR_domctl, "h", u_domctl);
> -        goto domctl_out_unlock_domonly;
> +        break;
>   
>       case XEN_DOMCTL_setnodeaffinity:
>       {
> 

-- 
Julien Grall

Re: [PATCH for-4.16] Revert "domctl: improve locking during domain destruction"
Posted by Roger Pau Monné 2 years, 5 months ago
On Tue, Nov 09, 2021 at 02:42:58PM +0000, Julien Grall wrote:
> Hi Roger,
> 
> On 09/11/2021 14:31, Roger Pau Monne wrote:
> > This reverts commit 228ab9992ffb1d8f9d2475f2581e68b2913acb88.
> > 
> > Performance analysis has shown that dropping the domctl lock during
> > domain destruction greatly increases the contention in the heap_lock,
> > thus making parallel destruction of domains slower.
> > 
> > The following lockperf data shows the difference between the current
> > code and the reverted one:
> > 
> > lock:  3342357(2.268295505s), block:  3263853(18.556650797s)
> > lock:  2788704(0.362311723s), block:   222681( 0.091152276s)
> 
> Thanks for the numbers, this is already an improvement from the reverted.
> 
> Can you also please provide some details on the setup that was used to get
> the number? (e.g. how many guests, amount of memory...).

Those are from Dmitry, and are gathered after destroying 5 guests in
parallel. Given his previous emails he seems to use 2GB HVM guests for
other tests, so I would assume that's the case for the lock profile data
also (albeit it's not said explicitly):

https://lists.xenproject.org/archives/html/xen-devel/2021-09/msg01515.html

I'm not sure it's worth adding this explicitly, as it's not a very
complex test case. Probably any attempts to destroy a minimal amount
of guests in parallel (5?) will already show the lock contention in
the profiling.

Thanks, Roger.

Re: [PATCH for-4.16] Revert "domctl: improve locking during domain destruction"
Posted by Julien Grall 2 years, 5 months ago
Hi,

On 09/11/2021 14:55, Roger Pau Monné wrote:
> On Tue, Nov 09, 2021 at 02:42:58PM +0000, Julien Grall wrote:
>> Hi Roger,
>>
>> On 09/11/2021 14:31, Roger Pau Monne wrote:
>>> This reverts commit 228ab9992ffb1d8f9d2475f2581e68b2913acb88.
>>>
>>> Performance analysis has shown that dropping the domctl lock during
>>> domain destruction greatly increases the contention in the heap_lock,
>>> thus making parallel destruction of domains slower.
>>>
>>> The following lockperf data shows the difference between the current
>>> code and the reverted one:
>>>
>>> lock:  3342357(2.268295505s), block:  3263853(18.556650797s)
>>> lock:  2788704(0.362311723s), block:   222681( 0.091152276s)
>>
>> Thanks for the numbers, this is already an improvement from the reverted.
>>
>> Can you also please provide some details on the setup that was used to get
>> the number? (e.g. how many guests, amount of memory...).
> 
> Those are from Dmitry, and are gathered after destroying 5 guests in
> parallel. Given his previous emails he seems to use 2GB HVM guests for
> other tests, so I would assume that's the case for the lock profile data
> also (albeit it's not said explicitly):
> 
> https://lists.xenproject.org/archives/html/xen-devel/2021-09/msg01515.html
> 
> I'm not sure it's worth adding this explicitly, as it's not a very
> complex test case. Probably any attempts to destroy a minimal amount
> of guests in parallel (5?) will already show the lock contention in
> the profiling.

In this case, I am not too concerned about not been able to reproduce 
it. However, I think it is a good practice to always post the setup 
along with the numbers.

This makes easier to understand the context of the patch and avoid 
spending time digging into the archives to find the original report.

Anyway, you already wrote everything above. So this is a matter of 
adding your first paragraph in the commit message + maybe a link to the 
original discussion(s).

Cheers,

-- 
Julien Grall