[PATCH v4] gnttab: defer allocation of status frame tracking array

Jan Beulich posted 1 patch 2 years, 11 months ago
Test gitlab-ci failed
Patches applied successfully (tree, apply log)
git fetch https://gitlab.com/xen-project/patchew/xen tags/patchew/74048f89-fee7-06c2-ffd5-6e5a14bdf440@suse.com
[PATCH v4] gnttab: defer allocation of status frame tracking array
Posted by Jan Beulich 2 years, 11 months ago
This array can be large when many grant frames are permitted; avoid
allocating it when it's not going to be used anyway, by doing this only
in gnttab_populate_status_frames(). While the delaying of the respective
memory allocation adds possible reasons for failure of the respective
enclosing operations, there are other memory allocations there already,
so callers can't expect these operations to always succeed anyway.

As to the re-ordering at the end of gnttab_unpopulate_status_frames(),
this is merely to represent intended order of actions (shrink array
bound, then free higher array entries). If there were racing accesses,
suitable barriers would need adding in addition.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v4: Add a comment. Add a few blank lines. Extend description.
v3: Drop smp_wmb(). Re-base.
v2: Defer allocation to when a domain actually switches to the v2 grant
    API.

--- a/xen/common/grant_table.c
+++ b/xen/common/grant_table.c
@@ -1747,6 +1747,17 @@ gnttab_populate_status_frames(struct dom
     /* Make sure, prior version checks are architectural visible */
     block_speculation();
 
+    if ( gt->status == ZERO_BLOCK_PTR )
+    {
+        gt->status = xzalloc_array(grant_status_t *,
+                                   grant_to_status_frames(gt->max_grant_frames));
+        if ( !gt->status )
+        {
+            gt->status = ZERO_BLOCK_PTR;
+            return -ENOMEM;
+        }
+    }
+
     for ( i = nr_status_frames(gt); i < req_status_frames; i++ )
     {
         if ( (gt->status[i] = alloc_xenheap_page()) == NULL )
@@ -1767,18 +1778,25 @@ status_alloc_failed:
         free_xenheap_page(gt->status[i]);
         gt->status[i] = NULL;
     }
+
+    if ( !nr_status_frames(gt) )
+    {
+        xfree(gt->status);
+        gt->status = ZERO_BLOCK_PTR;
+    }
+
     return -ENOMEM;
 }
 
 static int
 gnttab_unpopulate_status_frames(struct domain *d, struct grant_table *gt)
 {
-    unsigned int i;
+    unsigned int i, n = nr_status_frames(gt);
 
     /* Make sure, prior version checks are architectural visible */
     block_speculation();
 
-    for ( i = 0; i < nr_status_frames(gt); i++ )
+    for ( i = 0; i < n; i++ )
     {
         struct page_info *pg = virt_to_page(gt->status[i]);
         gfn_t gfn = gnttab_get_frame_gfn(gt, true, i);
@@ -1833,12 +1849,11 @@ gnttab_unpopulate_status_frames(struct d
         page_set_owner(pg, NULL);
     }
 
-    for ( i = 0; i < nr_status_frames(gt); i++ )
-    {
-        free_xenheap_page(gt->status[i]);
-        gt->status[i] = NULL;
-    }
     gt->nr_status_frames = 0;
+    for ( i = 0; i < n; i++ )
+        free_xenheap_page(gt->status[i]);
+    xfree(gt->status);
+    gt->status = ZERO_BLOCK_PTR;
 
     return 0;
 }
@@ -1969,11 +1984,11 @@ int grant_table_init(struct domain *d, i
     if ( gt->shared_raw == NULL )
         goto out;
 
-    /* Status pages for grant table - for version 2 */
-    gt->status = xzalloc_array(grant_status_t *,
-                               grant_to_status_frames(gt->max_grant_frames));
-    if ( gt->status == NULL )
-        goto out;
+    /*
+     * Status page tracking array for v2 gets allocated on demand. But don't
+     * leave a NULL pointer there.
+     */
+    gt->status = ZERO_BLOCK_PTR;
 
     grant_write_lock(gt);
 
@@ -4047,11 +4062,13 @@ int gnttab_acquire_resource(
         if ( gt->gt_version != 2 )
             break;
 
+        /* This may change gt->status, so has to happen before setting vaddrs. */ 
+        rc = gnttab_get_status_frame_mfn(d, final_frame, &tmp);
+
         /* Check that void ** is a suitable representation for gt->status. */
         BUILD_BUG_ON(!__builtin_types_compatible_p(
                          typeof(gt->status), grant_status_t **));
         vaddrs = (void **)gt->status;
-        rc = gnttab_get_status_frame_mfn(d, final_frame, &tmp);
         break;
     }
 

Re: [PATCH v4] gnttab: defer allocation of status frame tracking array
Posted by Julien Grall 2 years, 10 months ago
Hi Jan,

On 04/05/2021 09:42, Jan Beulich wrote:
> This array can be large when many grant frames are permitted; avoid
> allocating it when it's not going to be used anyway, by doing this only
> in gnttab_populate_status_frames(). While the delaying of the respective
> memory allocation adds possible reasons for failure of the respective
> enclosing operations, there are other memory allocations there already,
> so callers can't expect these operations to always succeed anyway.
> 
> As to the re-ordering at the end of gnttab_unpopulate_status_frames(),
> this is merely to represent intended order of actions (shrink array
> bound, then free higher array entries). If there were racing accesses,
> suitable barriers would need adding in addition.

Please drop the last sentence, this is at best misleading because you 
can't just add barriers to make it race free (see the discussion on v2 
for more details).

With the sentence dropped:

Reviewed-by: Julien Grall <jgrall@amazon.com>

Cheers,


[1] <f82ddfe7-853d-ca15-2373-a38068f65ef7@xen.org>


-- 
Julien Grall

Re: [PATCH v4] gnttab: defer allocation of status frame tracking array
Posted by Andrew Cooper 2 years, 10 months ago
On 04/05/2021 09:42, Jan Beulich wrote:
> This array can be large when many grant frames are permitted; avoid
> allocating it when it's not going to be used anyway, by doing this only
> in gnttab_populate_status_frames(). While the delaying of the respective
> memory allocation adds possible reasons for failure of the respective
> enclosing operations, there are other memory allocations there already,
> so callers can't expect these operations to always succeed anyway.
>
> As to the re-ordering at the end of gnttab_unpopulate_status_frames(),
> this is merely to represent intended order of actions (shrink array
> bound, then free higher array entries). If there were racing accesses,
> suitable barriers would need adding in addition.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Nack.

The argument you make says that the grant status frames is "large
enough" to care about not allocating.  (I frankly disagree, but that
isn't relevant to my to my nack).

The change in logic here moves a failure in DOMCTL_createdomain, to
GNTTABOP_setversion.  We know, from the last minute screwups in XSA-226,
that there are versions of Windows and Linux in the field, used by
customers, which will BUG()/BSOD when GNTTABOP_setversion doesn't succeed.

You're literally adding even more complexity to the grant table, to also
increase the chance of regressing VMs in the wild.  This is not ok.

The only form of this patch which is in any way acceptable, is to avoid
the allocation when you know *in DOMCTL_createdomain* that it will never
be needed by the VM.  So far, that is from Kconfig and/or the command
line options.

~Andrew


Re: [PATCH v4] gnttab: defer allocation of status frame tracking array
Posted by Jan Beulich 2 years, 10 months ago
On 05.05.2021 12:49, Andrew Cooper wrote:
> On 04/05/2021 09:42, Jan Beulich wrote:
>> This array can be large when many grant frames are permitted; avoid
>> allocating it when it's not going to be used anyway, by doing this only
>> in gnttab_populate_status_frames(). While the delaying of the respective
>> memory allocation adds possible reasons for failure of the respective
>> enclosing operations, there are other memory allocations there already,
>> so callers can't expect these operations to always succeed anyway.
>>
>> As to the re-ordering at the end of gnttab_unpopulate_status_frames(),
>> this is merely to represent intended order of actions (shrink array
>> bound, then free higher array entries). If there were racing accesses,
>> suitable barriers would need adding in addition.
>>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> 
> Nack.
> 
> The argument you make says that the grant status frames is "large
> enough" to care about not allocating.  (I frankly disagree, but that
> isn't relevant to my to my nack).
> 
> The change in logic here moves a failure in DOMCTL_createdomain, to
> GNTTABOP_setversion.  We know, from the last minute screwups in XSA-226,
> that there are versions of Windows and Linux in the field, used by
> customers, which will BUG()/BSOD when GNTTABOP_setversion doesn't succeed.

So after you did re-state the Linux aspect of this on the call yesterday
I went and looked. In the first phase of Linux supporting v2 at all
(3.3 - 3.16) there indeed was a panic() upon certain kinds of failures.
Only up to 3.13 was there an actual attempt to use v2. Also, while there
was some code there to support actual v2 features (sub-page or
transitive grants), all of this was dead. Hence their claim

		/*
		 * If we've already used version 2 features,
		 * but then suddenly discover that they're not
		 * available (e.g. migrating to an older
		 * version of Xen), almost unbounded badness
		 * can happen.
		 */

was bogus at best. If there was no way at all for set_version to fail
prior to the change, here I could probably accept your position. But as
said before by Julien - there are pre-existing memory allocations (of
typically larger size) there, and hence any guest assuming the call
can't fail is flawed already anyway. And no, I don't view this as a
reason for us to try to eliminate all memory allocations from that
hypercall (which would in particular mean populating status frames
irrespective of whether a guest would ever switch to v2). Guest flaws
would better be addressed in the guests.

Upon re-introduction in 4.15 no such fatal behavior was put back in
place. I notice though that even up-to-date Linux has problematic
behavior around GNTTABOP_setup_table - the call is followed (after an
explicit -ENOSYS check) by "BUG_ON(rc || setup.status)". The amount of
memory needed here (including the status table) is potentially far
higher than for set_version. And what's important: setup_table gets
invoked only after set_version, so any table expansion would happen
here, with - if any table growing is needed at all - a far higher risk
for failure.

This ordering of operations (set_version before setup_table) as well
as the "error checking" after setup_table was also in effect up to
3.13. Therefore the same conclusion can be drawn there: The main risk
for crashing the guest due to memory shortage in Xen is there, not
with the version switch. What you've observed with XSA-226 (or rather,
I assume, with a custom extension of yours at the time, to prohibit use
of v2, which was upstreamed only later) was entirely unrelated to memory
shortage, but was instead a result of v2 suddenly becoming unavailable
altogether.

As to Windows, in the pvdrivers/win/ git repos I haven't been able to
find any use of GrantTableSetVersion(). Of course I can't exclude
other versions or other driver variants making use of set_version in an
inappropriate way. But as long as they don't grow the number of grant
table entries before such a call, the main risk of memory allocation
failure would again be with other hypercalls.

Jan

Re: [PATCH v4] gnttab: defer allocation of status frame tracking array
Posted by Julien Grall 2 years, 10 months ago
Hi Andrew,

On 05/05/2021 11:49, Andrew Cooper wrote:
> On 04/05/2021 09:42, Jan Beulich wrote:
>> This array can be large when many grant frames are permitted; avoid
>> allocating it when it's not going to be used anyway, by doing this only
>> in gnttab_populate_status_frames(). While the delaying of the respective
>> memory allocation adds possible reasons for failure of the respective
>> enclosing operations, there are other memory allocations there already,
>> so callers can't expect these operations to always succeed anyway.
>>
>> As to the re-ordering at the end of gnttab_unpopulate_status_frames(),
>> this is merely to represent intended order of actions (shrink array
>> bound, then free higher array entries). If there were racing accesses,
>> suitable barriers would need adding in addition.
>>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> 
> Nack.
> 
> The argument you make says that the grant status frames is "large
> enough" to care about not allocating.  (I frankly disagree, but that
> isn't relevant to my to my nack).
> 
> The change in logic here moves a failure in DOMCTL_createdomain, to
> GNTTABOP_setversion.  We know, from the last minute screwups in XSA-226,
> that there are versions of Windows and Linux in the field, used by
> customers, which will BUG()/BSOD when GNTTABOP_setversion doesn't succeed.

Unfortunately, my reply to this on the v2 was left unanswered by you. So 
I will rewrite it here with more details in the hope this can lead to a 
consensus now.

 From a discussion during January's community call, the behavior was 
related to an old version version of Windows PV drivers. Newer version 
will not use Grant Table v2.

However, your comment seems to suggest that GNTTABOP_setversion will 
never fail today. AFAICT, this is not true today because Xen will return 
-ENOMEM when trying to populate the status frame (see the call to 
gnttab_populate_status_frame()).

Therefore...

> 
> You're literally adding even more complexity to the grant table, to also
> increase the chance of regressing VMs in the wild.  This is not ok.

... I am not sure why you are saying this is a regression.

> The only form of this patch which is in any way acceptable, is to avoid
> the allocation when you know *in DOMCTL_createdomain* that it will never
> be needed by the VM.  So far, that is from Kconfig and/or the command
> line options.

I can see how allocating memory upfront is easier for accounting 
purpose. However, it also means we may not be able to reduce the 
footprint if we don't know what the guest OS will use it.

This can happen for instance, if you let your customers to run random 
OSes and never restrict them to v1.

I believe that in most of the cases v1 will be used by the guest. But we 
technically can't restrict it after the fact (e.g. on reboot) as this 
may regress customers workload.

That's where the current approach is quite appealing because the 
allocation is done a runtime. So it does cater a lot more uses cases 
than your suggestion.

Cheers,

-- 
Julien Grall

Re: [PATCH v4] gnttab: defer allocation of status frame tracking array
Posted by Jan Beulich 2 years, 10 months ago
On 05.05.2021 12:49, Andrew Cooper wrote:
> On 04/05/2021 09:42, Jan Beulich wrote:
>> This array can be large when many grant frames are permitted; avoid
>> allocating it when it's not going to be used anyway, by doing this only
>> in gnttab_populate_status_frames(). While the delaying of the respective
>> memory allocation adds possible reasons for failure of the respective
>> enclosing operations, there are other memory allocations there already,
>> so callers can't expect these operations to always succeed anyway.
>>
>> As to the re-ordering at the end of gnttab_unpopulate_status_frames(),
>> this is merely to represent intended order of actions (shrink array
>> bound, then free higher array entries). If there were racing accesses,
>> suitable barriers would need adding in addition.
>>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> 
> Nack.
> 
> The argument you make says that the grant status frames is "large
> enough" to care about not allocating.  (I frankly disagree, but that
> isn't relevant to my to my nack).
> 
> The change in logic here moves a failure in DOMCTL_createdomain, to
> GNTTABOP_setversion.  We know, from the last minute screwups in XSA-226,
> that there are versions of Windows and Linux in the field, used by
> customers, which will BUG()/BSOD when GNTTABOP_setversion doesn't succeed.
> 
> You're literally adding even more complexity to the grant table, to also
> increase the chance of regressing VMs in the wild.  This is not ok.

To me, arguing like this is not okay. The code should have been written
like this in the first place. There's absolutely no reason to allocate
memory up front which is never going to be used.

> The only form of this patch which is in any way acceptable, is to avoid
> the allocation when you know *in DOMCTL_createdomain* that it will never
> be needed by the VM.  So far, that is from Kconfig and/or the command
> line options.

Well, may I remind you that this is how this patch had started? And
may I further remind you that it was Julien who asked me to convert to
this model? And may I then remind you that I already did suggest that
the two of you should come to an agreement? I'm not going to act as a
moderator in this process. But I insist that it is really bad practice
to drop the ball and leave patches (and alike) hanging in the air.

Just to be clear - in case I don't observe the two of you agreeing on
whichever outcome in the foreseeable future, I'm going to make attempts
to get another opinion from yet another REST maintainer. Since I can
live with both forms of the change (but would prefer the more
aggressive savings as done in this version), I can also live with
whatever 4th opinion would surface. But in case the result was not in
your favor, I'd then consider your Nack overruled.

Jan