[PATCH v2] x86: Perform mem_sharing teardown before paging teardown

Tamas K Lengyel posted 1 patch 1 year, 2 months ago
Test gitlab-ci failed
Patches applied successfully (tree, apply log)
git fetch https://gitlab.com/xen-project/patchew/xen tags/patchew/a7df4a01d7b901eb7b43be08f6fd3dce82ebbcd0.1676480656.git.tamas@tklengyel.com
xen/arch/x86/domain.c | 56 ++++++++++++++++++++++---------------------
1 file changed, 29 insertions(+), 27 deletions(-)
[PATCH v2] x86: Perform mem_sharing teardown before paging teardown
Posted by Tamas K Lengyel 1 year, 2 months ago
An assert failure has been observed in p2m_teardown when performing vm
forking and then destroying the forked VM (p2m-basic.c:173). The assert
checks whether the domain's shared pages counter is 0. According to the
patch that originally added the assert (7bedbbb5c31) the p2m_teardown
should only happen after mem_sharing already relinquished all shared pages.

In this patch we flip the order in which relinquish ops are called to avoid
tripping the assert.

Signed-off-by: Tamas K Lengyel <tamas@tklengyel.com>
Fixes: e7aa55c0aab3 ("x86/p2m: free the paging memory pool preemptively")
---
v2: comsetic fixes
---
 xen/arch/x86/domain.c | 56 ++++++++++++++++++++++---------------------
 1 file changed, 29 insertions(+), 27 deletions(-)

diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index db3ebf062d..a42f03e8e5 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -2289,9 +2289,9 @@ int domain_relinquish_resources(struct domain *d)
 
         enum {
             PROG_iommu_pagetables = 1,
+            PROG_shared,
             PROG_paging,
             PROG_vcpu_pagetables,
-            PROG_shared,
             PROG_xen,
             PROG_l4,
             PROG_l3,
@@ -2310,6 +2310,34 @@ int domain_relinquish_resources(struct domain *d)
         if ( ret )
             return ret;
 
+#ifdef CONFIG_MEM_SHARING
+    PROGRESS(shared):
+
+        if ( is_hvm_domain(d) )
+        {
+            /*
+             * If the domain has shared pages, relinquish them allowing
+             * for preemption.
+             */
+            ret = relinquish_shared_pages(d);
+            if ( ret )
+                return ret;
+
+            /*
+             * If the domain is forked, decrement the parent's pause count
+             * and release the domain.
+             */
+            if ( mem_sharing_is_fork(d) )
+            {
+                struct domain *parent = d->parent;
+
+                d->parent = NULL;
+                domain_unpause(parent);
+                put_domain(parent);
+            }
+        }
+#endif
+
     PROGRESS(paging):
 
         /* Tear down paging-assistance stuff. */
@@ -2350,32 +2378,6 @@ int domain_relinquish_resources(struct domain *d)
             d->arch.auto_unmask = 0;
         }
 
-#ifdef CONFIG_MEM_SHARING
-    PROGRESS(shared):
-
-        if ( is_hvm_domain(d) )
-        {
-            /* If the domain has shared pages, relinquish them allowing
-             * for preemption. */
-            ret = relinquish_shared_pages(d);
-            if ( ret )
-                return ret;
-
-            /*
-             * If the domain is forked, decrement the parent's pause count
-             * and release the domain.
-             */
-            if ( mem_sharing_is_fork(d) )
-            {
-                struct domain *parent = d->parent;
-
-                d->parent = NULL;
-                domain_unpause(parent);
-                put_domain(parent);
-            }
-        }
-#endif
-
         spin_lock(&d->page_alloc_lock);
         page_list_splice(&d->arch.relmem_list, &d->page_list);
         INIT_PAGE_LIST_HEAD(&d->arch.relmem_list);
-- 
2.34.1
Re: [PATCH v2] x86: Perform mem_sharing teardown before paging teardown
Posted by Jan Beulich 1 year, 2 months ago
On 15.02.2023 18:07, Tamas K Lengyel wrote:
> An assert failure has been observed in p2m_teardown when performing vm
> forking and then destroying the forked VM (p2m-basic.c:173). The assert
> checks whether the domain's shared pages counter is 0. According to the
> patch that originally added the assert (7bedbbb5c31) the p2m_teardown
> should only happen after mem_sharing already relinquished all shared pages.
> 
> In this patch we flip the order in which relinquish ops are called to avoid
> tripping the assert.
> 
> Signed-off-by: Tamas K Lengyel <tamas@tklengyel.com>
> Fixes: e7aa55c0aab3 ("x86/p2m: free the paging memory pool preemptively")

Especially considering the backporting requirement I'm disappointed to
find that you haven't extended the description to explain why this
heavier code churn is to be preferred over an adjustment to the offending
assertion. As said in reply to v1 - I'm willing to accept arguments
towards this being better, but I need to know those arguments.

Jan
Re: [PATCH v2] x86: Perform mem_sharing teardown before paging teardown
Posted by Tamas K Lengyel 1 year, 2 months ago
On Tue, Feb 21, 2023 at 8:54 AM Jan Beulich <jbeulich@suse.com> wrote:
>
> On 15.02.2023 18:07, Tamas K Lengyel wrote:
> > An assert failure has been observed in p2m_teardown when performing vm
> > forking and then destroying the forked VM (p2m-basic.c:173). The assert
> > checks whether the domain's shared pages counter is 0. According to the
> > patch that originally added the assert (7bedbbb5c31) the p2m_teardown
> > should only happen after mem_sharing already relinquished all shared
pages.
> >
> > In this patch we flip the order in which relinquish ops are called to
avoid
> > tripping the assert.
> >
> > Signed-off-by: Tamas K Lengyel <tamas@tklengyel.com>
> > Fixes: e7aa55c0aab3 ("x86/p2m: free the paging memory pool
preemptively")
>
> Especially considering the backporting requirement I'm disappointed to
> find that you haven't extended the description to explain why this
> heavier code churn is to be preferred over an adjustment to the offending
> assertion. As said in reply to v1 - I'm willing to accept arguments
> towards this being better, but I need to know those arguments.

The assert change as you proposed is hard to understand and will be thus
harder to maintain. Conceptually sharing being torn down makes sense to
happen before paging is torn down. Considering that's how it has been
before the unfortunate regression I'm fixing here I don't think more
justification is needed. The code-churn is minimal anyway.

Tamas
Re: [PATCH v2] x86: Perform mem_sharing teardown before paging teardown
Posted by Jan Beulich 1 year, 2 months ago
On 21.02.2023 16:59, Tamas K Lengyel wrote:
> On Tue, Feb 21, 2023 at 8:54 AM Jan Beulich <jbeulich@suse.com> wrote:
>>
>> On 15.02.2023 18:07, Tamas K Lengyel wrote:
>>> An assert failure has been observed in p2m_teardown when performing vm
>>> forking and then destroying the forked VM (p2m-basic.c:173). The assert
>>> checks whether the domain's shared pages counter is 0. According to the
>>> patch that originally added the assert (7bedbbb5c31) the p2m_teardown
>>> should only happen after mem_sharing already relinquished all shared
> pages.
>>>
>>> In this patch we flip the order in which relinquish ops are called to
> avoid
>>> tripping the assert.
>>>
>>> Signed-off-by: Tamas K Lengyel <tamas@tklengyel.com>
>>> Fixes: e7aa55c0aab3 ("x86/p2m: free the paging memory pool
> preemptively")
>>
>> Especially considering the backporting requirement I'm disappointed to
>> find that you haven't extended the description to explain why this
>> heavier code churn is to be preferred over an adjustment to the offending
>> assertion. As said in reply to v1 - I'm willing to accept arguments
>> towards this being better, but I need to know those arguments.
> 
> The assert change as you proposed is hard to understand and will be thus
> harder to maintain. Conceptually sharing being torn down makes sense to
> happen before paging is torn down.

This is (perhaps) the crucial thing to say. Whereas ...

> Considering that's how it has been
> before the unfortunate regression I'm fixing here I don't think more
> justification is needed.

... I disagree here - that's _not_ how it has been before: paging_teardown()
has always been called first. The regression was with how much of teardown
is performed from there vs at the final stage of domain cleanup.

I'd be fine adding the "Conceptually ..." sentence to the description when
committing, of course provided you agree.

> The code-churn is minimal anyway.

Perspectives here may vary. Plus, as said, being the one to backport this
eventually, the larger the code change (even if pure movement), the larger
the chance of there being a conflict somewhere.

Jan
Re: [PATCH v2] x86: Perform mem_sharing teardown before paging teardown
Posted by Tamas K Lengyel 1 year, 2 months ago
On Wed, Feb 22, 2023 at 5:48 AM Jan Beulich <jbeulich@suse.com> wrote:
>
> On 21.02.2023 16:59, Tamas K Lengyel wrote:
> > On Tue, Feb 21, 2023 at 8:54 AM Jan Beulich <jbeulich@suse.com> wrote:
> >>
> >> On 15.02.2023 18:07, Tamas K Lengyel wrote:
> >>> An assert failure has been observed in p2m_teardown when performing vm
> >>> forking and then destroying the forked VM (p2m-basic.c:173). The
assert
> >>> checks whether the domain's shared pages counter is 0. According to
the
> >>> patch that originally added the assert (7bedbbb5c31) the p2m_teardown
> >>> should only happen after mem_sharing already relinquished all shared
> > pages.
> >>>
> >>> In this patch we flip the order in which relinquish ops are called to
> > avoid
> >>> tripping the assert.
> >>>
> >>> Signed-off-by: Tamas K Lengyel <tamas@tklengyel.com>
> >>> Fixes: e7aa55c0aab3 ("x86/p2m: free the paging memory pool
> > preemptively")
> >>
> >> Especially considering the backporting requirement I'm disappointed to
> >> find that you haven't extended the description to explain why this
> >> heavier code churn is to be preferred over an adjustment to the
offending
> >> assertion. As said in reply to v1 - I'm willing to accept arguments
> >> towards this being better, but I need to know those arguments.
> >
> > The assert change as you proposed is hard to understand and will be thus
> > harder to maintain. Conceptually sharing being torn down makes sense to
> > happen before paging is torn down.
>
> This is (perhaps) the crucial thing to say. Whereas ...
>
> > Considering that's how it has been
> > before the unfortunate regression I'm fixing here I don't think more
> > justification is needed.
>
> ... I disagree here - that's _not_ how it has been before:
paging_teardown()
> has always been called first. The regression was with how much of teardown
> is performed from there vs at the final stage of domain cleanup.
>
> I'd be fine adding the "Conceptually ..." sentence to the description when
> committing, of course provided you agree.

Perfectly fine by me.

Thanks,
Tamas