[v1] KVM: arm64: Fix host/hyp tracking on share/unshare hypercall failure

[PATCH 0/2] KVM: arm64: Fix host/hyp tracking on share/unshare hypercall failure

Posted by tabba@google.com 1 week, 3 days ago

Hi folks,

Yet another bug I found while testing Sashiko locally with fixes to
review-prompts.

share_pfn_hyp() and unshare_pfn_hyp() in arch/arm64/kvm/mmu.c
maintain a host-side RB-tree mirroring the set of pages shared with
EL2. Both invoke a hypercall that can fail (page-state mismatch,
EL2 refcount still held), but neither cleans up on failure:

- share_pfn_hyp() inserts the tracking node before the hypercall
  and leaves it in the tree on failure, leaking the allocation and
  presenting a phantom share to a later unshare.

- unshare_pfn_hyp() erases the tracking node before the hypercall;
  on failure the host loses its record while EL2 still owns the
  share, breaking later operations on the same pfn.

Severity is low (no isolation impact) and the failure paths are rare
in practice, but the desync is real. Both patches are independent and
apply cleanly to current mainline. In other words, this can wait for
7.2.

Cheers,
/fuad

Fuad Tabba (2):
  KVM: arm64: Free hyp-share tracking node when share hypercall fails
  KVM: arm64: Avoid host/hyp share desync on unshare hypercall failure

 arch/arm64/kvm/mmu.c | 14 +++++++++++---
 1 file changed, 11 insertions(+), 3 deletions(-)

-- 
2.54.0.929.g9b7fa37559-goog

Re: [PATCH 0/2] KVM: arm64: Fix host/hyp tracking on share/unshare hypercall failure

Posted by Vincent Donnefort 1 week, 3 days ago

On Fri, May 29, 2026 at 08:43:39AM +0100, tabba@google.com wrote:
> Hi folks,
> 
> Yet another bug I found while testing Sashiko locally with fixes to
> review-prompts.
> 
> share_pfn_hyp() and unshare_pfn_hyp() in arch/arm64/kvm/mmu.c
> maintain a host-side RB-tree mirroring the set of pages shared with
> EL2. Both invoke a hypercall that can fail (page-state mismatch,
> EL2 refcount still held), but neither cleans up on failure:
> 
> - share_pfn_hyp() inserts the tracking node before the hypercall
>   and leaves it in the tree on failure, leaking the allocation and
>   presenting a phantom share to a later unshare.
> 
> - unshare_pfn_hyp() erases the tracking node before the hypercall;
>   on failure the host loses its record while EL2 still owns the
>   share, breaking later operations on the same pfn.
> 
> Severity is low (no isolation impact) and the failure paths are rare
> in practice, but the desync is real. Both patches are independent and
> apply cleanly to current mainline. In other words, this can wait for
> 7.2.


I believe I fixed that here lore.kernel.org/all/acyKhZL2di_QQ9xm@google.com but
as Quentin pointed-out, there's absolutely no reason for the hypercall to fail.
So I haven't sent a v2.

> 
> Cheers,
> /fuad
> 
> Fuad Tabba (2):
>   KVM: arm64: Free hyp-share tracking node when share hypercall fails
>   KVM: arm64: Avoid host/hyp share desync on unshare hypercall failure
> 
>  arch/arm64/kvm/mmu.c | 14 +++++++++++---
>  1 file changed, 11 insertions(+), 3 deletions(-)
> 
> -- 
> 2.54.0.929.g9b7fa37559-goog
>

Re: [PATCH 0/2] KVM: arm64: Fix host/hyp tracking on share/unshare hypercall failure

Posted by Fuad Tabba 1 week, 3 days ago

On Fri, 29 May 2026 at 09:02, Vincent Donnefort <vdonnefort@google.com> wrote:
>
> On Fri, May 29, 2026 at 08:43:39AM +0100, tabba@google.com wrote:
> > Hi folks,
> >
> > Yet another bug I found while testing Sashiko locally with fixes to
> > review-prompts.
> >
> > share_pfn_hyp() and unshare_pfn_hyp() in arch/arm64/kvm/mmu.c
> > maintain a host-side RB-tree mirroring the set of pages shared with
> > EL2. Both invoke a hypercall that can fail (page-state mismatch,
> > EL2 refcount still held), but neither cleans up on failure:
> >
> > - share_pfn_hyp() inserts the tracking node before the hypercall
> >   and leaves it in the tree on failure, leaking the allocation and
> >   presenting a phantom share to a later unshare.
> >
> > - unshare_pfn_hyp() erases the tracking node before the hypercall;
> >   on failure the host loses its record while EL2 still owns the
> >   share, breaking later operations on the same pfn.
> >
> > Severity is low (no isolation impact) and the failure paths are rare
> > in practice, but the desync is real. Both patches are independent and
> > apply cleanly to current mainline. In other words, this can wait for
> > 7.2.
>
>
> I believe I fixed that here lore.kernel.org/all/acyKhZL2di_QQ9xm@google.com but
> as Quentin pointed-out, there's absolutely no reason for the hypercall to fail.
> So I haven't sent a v2.

At the very least we need to add a comment, otherwise, people like me
and LLMs like Sashiko would stumble upon it.

That said, this fix adds no real overhead, makes the code clearer, and
guards us against a future where that call might fail.
Self-documenting in essense.

WDYT?

/fuad

>
> >
> > Cheers,
> > /fuad
> >
> > Fuad Tabba (2):
> >   KVM: arm64: Free hyp-share tracking node when share hypercall fails
> >   KVM: arm64: Avoid host/hyp share desync on unshare hypercall failure
> >
> >  arch/arm64/kvm/mmu.c | 14 +++++++++++---
> >  1 file changed, 11 insertions(+), 3 deletions(-)
> >
> > --
> > 2.54.0.929.g9b7fa37559-goog
> >

Re: [PATCH 0/2] KVM: arm64: Fix host/hyp tracking on share/unshare hypercall failure

Posted by Marc Zyngier 1 week, 3 days ago

On Fri, 29 May 2026 09:05:35 +0100,
Fuad Tabba <tabba@google.com> wrote:
> 
> On Fri, 29 May 2026 at 09:02, Vincent Donnefort <vdonnefort@google.com> wrote:
> >
> > On Fri, May 29, 2026 at 08:43:39AM +0100, tabba@google.com wrote:
> > > Hi folks,
> > >
> > > Yet another bug I found while testing Sashiko locally with fixes to
> > > review-prompts.
> > >
> > > share_pfn_hyp() and unshare_pfn_hyp() in arch/arm64/kvm/mmu.c
> > > maintain a host-side RB-tree mirroring the set of pages shared with
> > > EL2. Both invoke a hypercall that can fail (page-state mismatch,
> > > EL2 refcount still held), but neither cleans up on failure:
> > >
> > > - share_pfn_hyp() inserts the tracking node before the hypercall
> > >   and leaves it in the tree on failure, leaking the allocation and
> > >   presenting a phantom share to a later unshare.
> > >
> > > - unshare_pfn_hyp() erases the tracking node before the hypercall;
> > >   on failure the host loses its record while EL2 still owns the
> > >   share, breaking later operations on the same pfn.
> > >
> > > Severity is low (no isolation impact) and the failure paths are rare
> > > in practice, but the desync is real. Both patches are independent and
> > > apply cleanly to current mainline. In other words, this can wait for
> > > 7.2.
> >
> >
> > I believe I fixed that here lore.kernel.org/all/acyKhZL2di_QQ9xm@google.com but
> > as Quentin pointed-out, there's absolutely no reason for the hypercall to fail.
> > So I haven't sent a v2.
> 
> At the very least we need to add a comment, otherwise, people like me
> and LLMs like Sashiko would stumble upon it.
> 
> That said, this fix adds no real overhead, makes the code clearer, and
> guards us against a future where that call might fail.
> Self-documenting in essense.
> 
> WDYT?

If a hypercall really cannot fail, why does it have a return value?

	M.

-- 
Without deviation from the norm, progress is not possible.

Re: [PATCH 0/2] KVM: arm64: Fix host/hyp tracking on share/unshare hypercall failure

Posted by Fuad Tabba 1 week, 3 days ago

On Fri, 29 May 2026 at 09:15, Marc Zyngier <maz@kernel.org> wrote:
>
> On Fri, 29 May 2026 09:05:35 +0100,
> Fuad Tabba <tabba@google.com> wrote:
> >
> > On Fri, 29 May 2026 at 09:02, Vincent Donnefort <vdonnefort@google.com> wrote:
> > >
> > > On Fri, May 29, 2026 at 08:43:39AM +0100, tabba@google.com wrote:
> > > > Hi folks,
> > > >
> > > > Yet another bug I found while testing Sashiko locally with fixes to
> > > > review-prompts.
> > > >
> > > > share_pfn_hyp() and unshare_pfn_hyp() in arch/arm64/kvm/mmu.c
> > > > maintain a host-side RB-tree mirroring the set of pages shared with
> > > > EL2. Both invoke a hypercall that can fail (page-state mismatch,
> > > > EL2 refcount still held), but neither cleans up on failure:
> > > >
> > > > - share_pfn_hyp() inserts the tracking node before the hypercall
> > > >   and leaves it in the tree on failure, leaking the allocation and
> > > >   presenting a phantom share to a later unshare.
> > > >
> > > > - unshare_pfn_hyp() erases the tracking node before the hypercall;
> > > >   on failure the host loses its record while EL2 still owns the
> > > >   share, breaking later operations on the same pfn.
> > > >
> > > > Severity is low (no isolation impact) and the failure paths are rare
> > > > in practice, but the desync is real. Both patches are independent and
> > > > apply cleanly to current mainline. In other words, this can wait for
> > > > 7.2.
> > >
> > >
> > > I believe I fixed that here lore.kernel.org/all/acyKhZL2di_QQ9xm@google.com but
> > > as Quentin pointed-out, there's absolutely no reason for the hypercall to fail.
> > > So I haven't sent a v2.
> >
> > At the very least we need to add a comment, otherwise, people like me
> > and LLMs like Sashiko would stumble upon it.
> >
> > That said, this fix adds no real overhead, makes the code clearer, and
> > guards us against a future where that call might fail.
> > Self-documenting in essense.
> >
> > WDYT?
>
> If a hypercall really cannot fail, why does it have a return value?

Good point. If we know it cannot fail, how about just `void`?

That said, Vincen't exact words are: `very much unlikely`, not the
same as cannot fail :)

https://lore.kernel.org/all/acyKhZL2di_QQ9xm@google.com/

/fuad

>
>         M.
>
> --
> Without deviation from the norm, progress is not possible.

Re: [PATCH 0/2] KVM: arm64: Fix host/hyp tracking on share/unshare hypercall failure

Posted by Marc Zyngier 1 week, 3 days ago

On Fri, 29 May 2026 09:20:50 +0100,
Fuad Tabba <tabba@google.com> wrote:
> 
> On Fri, 29 May 2026 at 09:15, Marc Zyngier <maz@kernel.org> wrote:
> >
> > On Fri, 29 May 2026 09:05:35 +0100,
> > Fuad Tabba <tabba@google.com> wrote:
> > >
> > > On Fri, 29 May 2026 at 09:02, Vincent Donnefort <vdonnefort@google.com> wrote:
> > > >
> > > > On Fri, May 29, 2026 at 08:43:39AM +0100, tabba@google.com wrote:
> > > > > Hi folks,
> > > > >
> > > > > Yet another bug I found while testing Sashiko locally with fixes to
> > > > > review-prompts.
> > > > >
> > > > > share_pfn_hyp() and unshare_pfn_hyp() in arch/arm64/kvm/mmu.c
> > > > > maintain a host-side RB-tree mirroring the set of pages shared with
> > > > > EL2. Both invoke a hypercall that can fail (page-state mismatch,
> > > > > EL2 refcount still held), but neither cleans up on failure:
> > > > >
> > > > > - share_pfn_hyp() inserts the tracking node before the hypercall
> > > > >   and leaves it in the tree on failure, leaking the allocation and
> > > > >   presenting a phantom share to a later unshare.
> > > > >
> > > > > - unshare_pfn_hyp() erases the tracking node before the hypercall;
> > > > >   on failure the host loses its record while EL2 still owns the
> > > > >   share, breaking later operations on the same pfn.
> > > > >
> > > > > Severity is low (no isolation impact) and the failure paths are rare
> > > > > in practice, but the desync is real. Both patches are independent and
> > > > > apply cleanly to current mainline. In other words, this can wait for
> > > > > 7.2.
> > > >
> > > >
> > > > I believe I fixed that here lore.kernel.org/all/acyKhZL2di_QQ9xm@google.com but
> > > > as Quentin pointed-out, there's absolutely no reason for the hypercall to fail.
> > > > So I haven't sent a v2.
> > >
> > > At the very least we need to add a comment, otherwise, people like me
> > > and LLMs like Sashiko would stumble upon it.
> > >
> > > That said, this fix adds no real overhead, makes the code clearer, and
> > > guards us against a future where that call might fail.
> > > Self-documenting in essense.
> > >
> > > WDYT?
> >
> > If a hypercall really cannot fail, why does it have a return value?
> 
> Good point. If we know it cannot fail, how about just `void`?
> 
> That said, Vincen't exact words are: `very much unlikely`, not the
> same as cannot fail :)
> 
> https://lore.kernel.org/all/acyKhZL2di_QQ9xm@google.com/

I think the rules are simple:

- if something can fail, we need to handle the failure

- if something should not fail and has the potential of compromising
  the system, we should panic

- if something absolutely cannot fail, then there is nothing to handle

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.

Re: [PATCH 0/2] KVM: arm64: Fix host/hyp tracking on share/unshare hypercall failure

Posted by Vincent Donnefort 1 week, 3 days ago

On Fri, May 29, 2026 at 10:29:40AM +0100, Marc Zyngier wrote:
> On Fri, 29 May 2026 09:20:50 +0100,
> Fuad Tabba <tabba@google.com> wrote:
> > 
> > On Fri, 29 May 2026 at 09:15, Marc Zyngier <maz@kernel.org> wrote:
> > >
> > > On Fri, 29 May 2026 09:05:35 +0100,
> > > Fuad Tabba <tabba@google.com> wrote:
> > > >
> > > > On Fri, 29 May 2026 at 09:02, Vincent Donnefort <vdonnefort@google.com> wrote:
> > > > >
> > > > > On Fri, May 29, 2026 at 08:43:39AM +0100, tabba@google.com wrote:
> > > > > > Hi folks,
> > > > > >
> > > > > > Yet another bug I found while testing Sashiko locally with fixes to
> > > > > > review-prompts.
> > > > > >
> > > > > > share_pfn_hyp() and unshare_pfn_hyp() in arch/arm64/kvm/mmu.c
> > > > > > maintain a host-side RB-tree mirroring the set of pages shared with
> > > > > > EL2. Both invoke a hypercall that can fail (page-state mismatch,
> > > > > > EL2 refcount still held), but neither cleans up on failure:
> > > > > >
> > > > > > - share_pfn_hyp() inserts the tracking node before the hypercall
> > > > > >   and leaves it in the tree on failure, leaking the allocation and
> > > > > >   presenting a phantom share to a later unshare.
> > > > > >
> > > > > > - unshare_pfn_hyp() erases the tracking node before the hypercall;
> > > > > >   on failure the host loses its record while EL2 still owns the
> > > > > >   share, breaking later operations on the same pfn.
> > > > > >
> > > > > > Severity is low (no isolation impact) and the failure paths are rare
> > > > > > in practice, but the desync is real. Both patches are independent and
> > > > > > apply cleanly to current mainline. In other words, this can wait for
> > > > > > 7.2.
> > > > >
> > > > >
> > > > > I believe I fixed that here lore.kernel.org/all/acyKhZL2di_QQ9xm@google.com but
> > > > > as Quentin pointed-out, there's absolutely no reason for the hypercall to fail.
> > > > > So I haven't sent a v2.
> > > >
> > > > At the very least we need to add a comment, otherwise, people like me
> > > > and LLMs like Sashiko would stumble upon it.
> > > >
> > > > That said, this fix adds no real overhead, makes the code clearer, and
> > > > guards us against a future where that call might fail.
> > > > Self-documenting in essense.
> > > >
> > > > WDYT?
> > >
> > > If a hypercall really cannot fail, why does it have a return value?
> > 
> > Good point. If we know it cannot fail, how about just `void`?
> > 
> > That said, Vincen't exact words are: `very much unlikely`, not the
> > same as cannot fail :)
> > 
> > https://lore.kernel.org/all/acyKhZL2di_QQ9xm@google.com/
> 
> I think the rules are simple:
> 
> - if something can fail, we need to handle the failure

Looking at kvm_share_hyp() it should then rollback the shared pages. I think
that is fine.

> 
> - if something should not fail and has the potential of compromising
>   the system, we should panic

Then kvm_unshare_hyp() being void, should BUG_ON(unshare_pfn_hyp(pfn));

> 
> - if something absolutely cannot fail, then there is nothing to handle
> 
> Thanks,
> 
> 	M.
> 
> -- 
> Without deviation from the norm, progress is not possible.

Re: [PATCH 0/2] KVM: arm64: Fix host/hyp tracking on share/unshare hypercall failure

Posted by Vincent Donnefort 1 week, 3 days ago

On Fri, May 29, 2026 at 09:20:50AM +0100, Fuad Tabba wrote:
> On Fri, 29 May 2026 at 09:15, Marc Zyngier <maz@kernel.org> wrote:
> >
> > On Fri, 29 May 2026 09:05:35 +0100,
> > Fuad Tabba <tabba@google.com> wrote:
> > >
> > > On Fri, 29 May 2026 at 09:02, Vincent Donnefort <vdonnefort@google.com> wrote:
> > > >
> > > > On Fri, May 29, 2026 at 08:43:39AM +0100, tabba@google.com wrote:
> > > > > Hi folks,
> > > > >
> > > > > Yet another bug I found while testing Sashiko locally with fixes to
> > > > > review-prompts.
> > > > >
> > > > > share_pfn_hyp() and unshare_pfn_hyp() in arch/arm64/kvm/mmu.c
> > > > > maintain a host-side RB-tree mirroring the set of pages shared with
> > > > > EL2. Both invoke a hypercall that can fail (page-state mismatch,
> > > > > EL2 refcount still held), but neither cleans up on failure:
> > > > >
> > > > > - share_pfn_hyp() inserts the tracking node before the hypercall
> > > > >   and leaves it in the tree on failure, leaking the allocation and
> > > > >   presenting a phantom share to a later unshare.
> > > > >
> > > > > - unshare_pfn_hyp() erases the tracking node before the hypercall;
> > > > >   on failure the host loses its record while EL2 still owns the
> > > > >   share, breaking later operations on the same pfn.
> > > > >
> > > > > Severity is low (no isolation impact) and the failure paths are rare
> > > > > in practice, but the desync is real. Both patches are independent and
> > > > > apply cleanly to current mainline. In other words, this can wait for
> > > > > 7.2.
> > > >
> > > >
> > > > I believe I fixed that here lore.kernel.org/all/acyKhZL2di_QQ9xm@google.com but
> > > > as Quentin pointed-out, there's absolutely no reason for the hypercall to fail.
> > > > So I haven't sent a v2.
> > >
> > > At the very least we need to add a comment, otherwise, people like me
> > > and LLMs like Sashiko would stumble upon it.
> > >
> > > That said, this fix adds no real overhead, makes the code clearer, and
> > > guards us against a future where that call might fail.
> > > Self-documenting in essense.
> > >
> > > WDYT?
> >
> > If a hypercall really cannot fail, why does it have a return value?
> 
> Good point. If we know it cannot fail, how about just `void`?
> 
> That said, Vincen't exact words are: `very much unlikely`, not the
> same as cannot fail :)
> 
> https://lore.kernel.org/all/acyKhZL2di_QQ9xm@google.com/

The error would happen only if the host tries to share/unshare a page with the
wrong state. This would only happen in the case of a misbehaving host.

And Quentin's point was that this is anyway incomplete. To handle this error
properly, kvm_share_hyp/kvm_unshare_hyp would also need to rollback things...
The callers of the unshare should also leak the memory which couldn't be
unshared properly. This isn't the case now, (however we do WARN_ON).

> 
> /fuad
> 
> >
> >         M.
> >
> > --
> > Without deviation from the norm, progress is not possible.

Re: [PATCH 0/2] KVM: arm64: Fix host/hyp tracking on share/unshare hypercall failure

Posted by Fuad Tabba 1 week, 3 days ago

On Fri, 29 May 2026 at 10:21, Vincent Donnefort <vdonnefort@google.com> wrote:
>
> On Fri, May 29, 2026 at 09:20:50AM +0100, Fuad Tabba wrote:
> > On Fri, 29 May 2026 at 09:15, Marc Zyngier <maz@kernel.org> wrote:
> > >
> > > On Fri, 29 May 2026 09:05:35 +0100,
> > > Fuad Tabba <tabba@google.com> wrote:
> > > >
> > > > On Fri, 29 May 2026 at 09:02, Vincent Donnefort <vdonnefort@google.com> wrote:
> > > > >
> > > > > On Fri, May 29, 2026 at 08:43:39AM +0100, tabba@google.com wrote:
> > > > > > Hi folks,
> > > > > >
> > > > > > Yet another bug I found while testing Sashiko locally with fixes to
> > > > > > review-prompts.
> > > > > >
> > > > > > share_pfn_hyp() and unshare_pfn_hyp() in arch/arm64/kvm/mmu.c
> > > > > > maintain a host-side RB-tree mirroring the set of pages shared with
> > > > > > EL2. Both invoke a hypercall that can fail (page-state mismatch,
> > > > > > EL2 refcount still held), but neither cleans up on failure:
> > > > > >
> > > > > > - share_pfn_hyp() inserts the tracking node before the hypercall
> > > > > >   and leaves it in the tree on failure, leaking the allocation and
> > > > > >   presenting a phantom share to a later unshare.
> > > > > >
> > > > > > - unshare_pfn_hyp() erases the tracking node before the hypercall;
> > > > > >   on failure the host loses its record while EL2 still owns the
> > > > > >   share, breaking later operations on the same pfn.
> > > > > >
> > > > > > Severity is low (no isolation impact) and the failure paths are rare
> > > > > > in practice, but the desync is real. Both patches are independent and
> > > > > > apply cleanly to current mainline. In other words, this can wait for
> > > > > > 7.2.
> > > > >
> > > > >
> > > > > I believe I fixed that here lore.kernel.org/all/acyKhZL2di_QQ9xm@google.com but
> > > > > as Quentin pointed-out, there's absolutely no reason for the hypercall to fail.
> > > > > So I haven't sent a v2.
> > > >
> > > > At the very least we need to add a comment, otherwise, people like me
> > > > and LLMs like Sashiko would stumble upon it.
> > > >
> > > > That said, this fix adds no real overhead, makes the code clearer, and
> > > > guards us against a future where that call might fail.
> > > > Self-documenting in essense.
> > > >
> > > > WDYT?
> > >
> > > If a hypercall really cannot fail, why does it have a return value?
> >
> > Good point. If we know it cannot fail, how about just `void`?
> >
> > That said, Vincen't exact words are: `very much unlikely`, not the
> > same as cannot fail :)
> >
> > https://lore.kernel.org/all/acyKhZL2di_QQ9xm@google.com/
>
> The error would happen only if the host tries to share/unshare a page with the
> wrong state. This would only happen in the case of a misbehaving host.
>
> And Quentin's point was that this is anyway incomplete. To handle this error
> properly, kvm_share_hyp/kvm_unshare_hyp would also need to rollback things...
> The callers of the unshare should also leak the memory which couldn't be
> unshared properly. This isn't the case now, (however we do WARN_ON).

If we WARN_ON() in hyp, then I argue we shouldn't have a return value.
Or at least add a comment, BUG_ON() here. Think of the poor LLMs and
the people who run them :)

/fuad

>
> >
> > /fuad
> >
> > >
> > >         M.
> > >
> > > --
> > > Without deviation from the norm, progress is not possible.

Re: [PATCH 0/2] KVM: arm64: Fix host/hyp tracking on share/unshare hypercall failure

Posted by Vincent Donnefort 1 week, 3 days ago

On Fri, May 29, 2026 at 10:23:22AM +0100, Fuad Tabba wrote:
> On Fri, 29 May 2026 at 10:21, Vincent Donnefort <vdonnefort@google.com> wrote:
> >
> > On Fri, May 29, 2026 at 09:20:50AM +0100, Fuad Tabba wrote:
> > > On Fri, 29 May 2026 at 09:15, Marc Zyngier <maz@kernel.org> wrote:
> > > >
> > > > On Fri, 29 May 2026 09:05:35 +0100,
> > > > Fuad Tabba <tabba@google.com> wrote:
> > > > >
> > > > > On Fri, 29 May 2026 at 09:02, Vincent Donnefort <vdonnefort@google.com> wrote:
> > > > > >
> > > > > > On Fri, May 29, 2026 at 08:43:39AM +0100, tabba@google.com wrote:
> > > > > > > Hi folks,
> > > > > > >
> > > > > > > Yet another bug I found while testing Sashiko locally with fixes to
> > > > > > > review-prompts.
> > > > > > >
> > > > > > > share_pfn_hyp() and unshare_pfn_hyp() in arch/arm64/kvm/mmu.c
> > > > > > > maintain a host-side RB-tree mirroring the set of pages shared with
> > > > > > > EL2. Both invoke a hypercall that can fail (page-state mismatch,
> > > > > > > EL2 refcount still held), but neither cleans up on failure:
> > > > > > >
> > > > > > > - share_pfn_hyp() inserts the tracking node before the hypercall
> > > > > > >   and leaves it in the tree on failure, leaking the allocation and
> > > > > > >   presenting a phantom share to a later unshare.
> > > > > > >
> > > > > > > - unshare_pfn_hyp() erases the tracking node before the hypercall;
> > > > > > >   on failure the host loses its record while EL2 still owns the
> > > > > > >   share, breaking later operations on the same pfn.
> > > > > > >
> > > > > > > Severity is low (no isolation impact) and the failure paths are rare
> > > > > > > in practice, but the desync is real. Both patches are independent and
> > > > > > > apply cleanly to current mainline. In other words, this can wait for
> > > > > > > 7.2.
> > > > > >
> > > > > >
> > > > > > I believe I fixed that here lore.kernel.org/all/acyKhZL2di_QQ9xm@google.com but
> > > > > > as Quentin pointed-out, there's absolutely no reason for the hypercall to fail.
> > > > > > So I haven't sent a v2.
> > > > >
> > > > > At the very least we need to add a comment, otherwise, people like me
> > > > > and LLMs like Sashiko would stumble upon it.
> > > > >
> > > > > That said, this fix adds no real overhead, makes the code clearer, and
> > > > > guards us against a future where that call might fail.
> > > > > Self-documenting in essense.
> > > > >
> > > > > WDYT?
> > > >
> > > > If a hypercall really cannot fail, why does it have a return value?
> > >
> > > Good point. If we know it cannot fail, how about just `void`?
> > >
> > > That said, Vincen't exact words are: `very much unlikely`, not the
> > > same as cannot fail :)
> > >
> > > https://lore.kernel.org/all/acyKhZL2di_QQ9xm@google.com/
> >
> > The error would happen only if the host tries to share/unshare a page with the
> > wrong state. This would only happen in the case of a misbehaving host.
> >
> > And Quentin's point was that this is anyway incomplete. To handle this error
> > properly, kvm_share_hyp/kvm_unshare_hyp would also need to rollback things...
> > The callers of the unshare should also leak the memory which couldn't be
> > unshared properly. This isn't the case now, (however we do WARN_ON).
> 
> If we WARN_ON() in hyp, then I argue we shouldn't have a return value.

I meant the WARN_ON in the host's kvm_hyp_unshare()

> Or at least add a comment, BUG_ON() here. Think of the poor LLMs and
> the people who run them :)
> 
> /fuad
> 
> >
> > >
> > > /fuad
> > >
> > > >
> > > >         M.
> > > >
> > > > --
> > > > Without deviation from the norm, progress is not possible.