[v2] RE: [PATCH v2 00/21] Runtime TDX Module update support

RE: [PATCH v2 00/21] Runtime TDX Module update support

Posted by Reshetova, Elena 3 months, 3 weeks ago


> -----Original Message-----
> From: Vishal Annapurve <vannapurve@google.com>
> Sent: Thursday, October 16, 2025 8:48 PM
> To: Reshetova, Elena <elena.reshetova@intel.com>
> Cc: Hansen, Dave <dave.hansen@intel.com>; Gao, Chao
> <chao.gao@intel.com>; linux-coco@lists.linux.dev; linux-
> kernel@vger.kernel.org; x86@kernel.org; Chatre, Reinette
> <reinette.chatre@intel.com>; Weiny, Ira <ira.weiny@intel.com>; Huang, Kai
> <kai.huang@intel.com>; Williams, Dan J <dan.j.williams@intel.com>;
> yilun.xu@linux.intel.com; sagis@google.com; paulmck@kernel.org;
> nik.borisov@suse.com; Borislav Petkov <bp@alien8.de>; Dave Hansen
> <dave.hansen@linux.intel.com>; H. Peter Anvin <hpa@zytor.com>; Ingo Molnar
> <mingo@redhat.com>; Kirill A. Shutemov <kas@kernel.org>; Paolo Bonzini
> <pbonzini@redhat.com>; Edgecombe, Rick P <rick.p.edgecombe@intel.com>;
> Thomas Gleixner <tglx@linutronix.de>
> Subject: Re: [PATCH v2 00/21] Runtime TDX Module update support
> 
> On Wed, Oct 15, 2025 at 11:46 PM Reshetova, Elena
> <elena.reshetova@intel.com> wrote:
> >
> > > ...
> > > > But the situation can be avoided fully, if TD preserving update is not
> > > conducted
> > > > during the TD build time.
> > >
> > > Sure, and the TDX module itself could guarantee this as well as much as
> > > the kernel could. It could decline to allow module updates during TD
> > > builds, or error out the TD build if it collides with an update.
> >
> > TDX module has a functionality to decline going into SHUTDOWN state
> > (pre-requisite for TD preserving update) if TD build or any problematic
> > operation is in progress. It requires VMM to opt-in into this feature.
> 
> Is this opt-in enabled as part of this series? If not, what is the
> mechanism to enable this opt-in?

For the information about how it works on TDX module side, 
please consult the latest ABI spec, definition of TDH.SYS.SHUTDOWN leaf,
page 321:
https://cdrdv2.intel.com/v1/dl/getContent/733579

Re: [PATCH v2 00/21] Runtime TDX Module update support

Posted by Vishal Annapurve 3 months, 3 weeks ago

On Fri, Oct 17, 2025 at 3:08 AM Reshetova, Elena
<elena.reshetova@intel.com> wrote:
>
>
> > > > ...
> > > > > But the situation can be avoided fully, if TD preserving update is not
> > > > conducted
> > > > > during the TD build time.
> > > >
> > > > Sure, and the TDX module itself could guarantee this as well as much as
> > > > the kernel could. It could decline to allow module updates during TD
> > > > builds, or error out the TD build if it collides with an update.
> > >
> > > TDX module has a functionality to decline going into SHUTDOWN state
> > > (pre-requisite for TD preserving update) if TD build or any problematic
> > > operation is in progress. It requires VMM to opt-in into this feature.
> >
> > Is this opt-in enabled as part of this series? If not, what is the
> > mechanism to enable this opt-in?
>
> For the information about how it works on TDX module side,
> please consult the latest ABI spec, definition of TDH.SYS.SHUTDOWN leaf,
> page 321:
> https://cdrdv2.intel.com/v1/dl/getContent/733579
>

Thanks Elena. Should the patch [1] from this series be modified to
handle the TDX module shutdown as per:
"If supported by the TDX Module, the host VMM can set the
AVOID_COMPAT_SENSITIVE flag to request the TDX Module to fail
TDH.SYS.UPDATE if any of the TDs are currently in a state that is
impacted by the update-sensitive cases"

The documentation below doesn't make sense to me:
"The compatibility checks done by TDH.SYS.UPDATE do not include the
following cases:
* If any TD was initialized by an older TDX Module that did enumerate
TDX_FEATURES0.UPDATE_COMPATIBLITY as 1, TDH.SYS.SHUTDOWN does not
check for a TD build in progress condition for that TD.
* If any TD migration session is in progress, it was initialized by an
older TDX Module that did enumerate TDX_FEATURES0.UPDATE_COMPATIBLITY
as 1"

Was it supposed to say below?
"If any TD was initialized by an older TDX Module that did enumerate
TDX_FEATURES0.UPDATE_COMPATIBLITY as 0, TDH.SYS.SHUTDOWN does not
check for a TD build in progress condition for that TD"

[1] https://lore.kernel.org/all/20251001025442.427697-15-chao.gao@intel.com/

Re: [PATCH v2 00/21] Runtime TDX Module update support

Posted by Chao Gao 3 months, 2 weeks ago

On Fri, Oct 17, 2025 at 05:01:55PM -0700, Vishal Annapurve wrote:
>On Fri, Oct 17, 2025 at 3:08 AM Reshetova, Elena
><elena.reshetova@intel.com> wrote:
>>
>>
>> > > > ...
>> > > > > But the situation can be avoided fully, if TD preserving update is not
>> > > > conducted
>> > > > > during the TD build time.
>> > > >
>> > > > Sure, and the TDX module itself could guarantee this as well as much as
>> > > > the kernel could. It could decline to allow module updates during TD
>> > > > builds, or error out the TD build if it collides with an update.
>> > >
>> > > TDX module has a functionality to decline going into SHUTDOWN state
>> > > (pre-requisite for TD preserving update) if TD build or any problematic
>> > > operation is in progress. It requires VMM to opt-in into this feature.
>> >
>> > Is this opt-in enabled as part of this series? If not, what is the
>> > mechanism to enable this opt-in?
>>
>> For the information about how it works on TDX module side,
>> please consult the latest ABI spec, definition of TDH.SYS.SHUTDOWN leaf,
>> page 321:
>> https://cdrdv2.intel.com/v1/dl/getContent/733579
>>
>
>Thanks Elena. Should the patch [1] from this series be modified to
>handle the TDX module shutdown as per:

Hi Vishal,

I will fix this issue in the next version.

The plan is to opt in post-update compatibility detection in the TDX
Module. If incompatibilities are found, the module will return errors to
any TD build or migration operations that were initiated prior to the
updates. Please refer to the TDH.SYS.UPDATE leaf definition in the ABI
spec above for details.

I prefer this approach because:

a. it guarantees forward progress. In contrast, failing updates would
   require admins to retry TDX Module updates, and no progress would be
   made unless they can successfully avoid race conditions between TDX
   module updates and TD build/migration operations. However, if such race
   conditions could be reliably prevented, this issue wouldn't require a
   fix in the first place.

b. it eliminates false alarms that could occur with the "block update"
   approach. Under the "block update" approach, updates would be rejected
   whenever TD build operations are running, regardless of whether the new
   module is actually compatible (e.g., when using the same crypto library as
   the current module). In contrast, the post-update detection approach only
   fails TD build or migration operations when genuine incompatibilities
   exist.

Re: [PATCH v2 00/21] Runtime TDX Module update support

Posted by Vishal Annapurve 3 months, 2 weeks ago

On Wed, Oct 22, 2025 at 12:15 AM Chao Gao <chao.gao@intel.com> wrote:
>
> On Fri, Oct 17, 2025 at 05:01:55PM -0700, Vishal Annapurve wrote:
> >On Fri, Oct 17, 2025 at 3:08 AM Reshetova, Elena
> ><elena.reshetova@intel.com> wrote:
> >>
> >>
> >> > > > ...
> >> > > > > But the situation can be avoided fully, if TD preserving update is not
> >> > > > conducted
> >> > > > > during the TD build time.
> >> > > >
> >> > > > Sure, and the TDX module itself could guarantee this as well as much as
> >> > > > the kernel could. It could decline to allow module updates during TD
> >> > > > builds, or error out the TD build if it collides with an update.
> >> > >
> >> > > TDX module has a functionality to decline going into SHUTDOWN state
> >> > > (pre-requisite for TD preserving update) if TD build or any problematic
> >> > > operation is in progress. It requires VMM to opt-in into this feature.
> >> >
> >> > Is this opt-in enabled as part of this series? If not, what is the
> >> > mechanism to enable this opt-in?
> >>
> >> For the information about how it works on TDX module side,
> >> please consult the latest ABI spec, definition of TDH.SYS.SHUTDOWN leaf,
> >> page 321:
> >> https://cdrdv2.intel.com/v1/dl/getContent/733579
> >>
> >
> >Thanks Elena. Should the patch [1] from this series be modified to
> >handle the TDX module shutdown as per:
>
> Hi Vishal,
>
> I will fix this issue in the next version.
>
> The plan is to opt in post-update compatibility detection in the TDX
> Module. If incompatibilities are found, the module will return errors to
> any TD build or migration operations that were initiated prior to the
> updates. Please refer to the TDH.SYS.UPDATE leaf definition in the ABI
> spec above for details.
>
> I prefer this approach because:
>
> a. it guarantees forward progress. In contrast, failing updates would
>    require admins to retry TDX Module updates, and no progress would be
>    made unless they can successfully avoid race conditions between TDX
>    module updates and TD build/migration operations. However, if such race
>    conditions could be reliably prevented, this issue wouldn't require a
>    fix in the first place.

TD build operations are much more frequent than TDX module update
operations. Retrying TD build operation will need additional KVM and
userspace VMM changes IIUC (assuming TD build process needs to be
restarted from the scratch). IMO, it would be simpler to handle TDX
module update failures by retrying.

Admin logic to update TDX modules can be designed to either retry
failed TDX module updates or to be more robust, adds some
synchronization with VM creation attempts on the host. i.e. I think
it's fine to punt this problem of ensuring the forward progress to
user-space admin logic on the host.

>
> b. it eliminates false alarms that could occur with the "block update"
>    approach. Under the "block update" approach, updates would be rejected
>    whenever TD build operations are running, regardless of whether the new
>    module is actually compatible (e.g., when using the same crypto library as
>    the current module). In contrast, the post-update detection approach only
>    fails TD build or migration operations when genuine incompatibilities
>    exist.

Re: [PATCH v2 00/21] Runtime TDX Module update support

Posted by Vishal Annapurve 3 months, 2 weeks ago

On Wed, Oct 22, 2025 at 8:42 AM Vishal Annapurve <vannapurve@google.com> wrote:
>
> On Wed, Oct 22, 2025 at 12:15 AM Chao Gao <chao.gao@intel.com> wrote:
> >
> > On Fri, Oct 17, 2025 at 05:01:55PM -0700, Vishal Annapurve wrote:
> > >On Fri, Oct 17, 2025 at 3:08 AM Reshetova, Elena
> > ><elena.reshetova@intel.com> wrote:
> > >>
> > >>
> > >> > > > ...
> > >> > > > > But the situation can be avoided fully, if TD preserving update is not
> > >> > > > conducted
> > >> > > > > during the TD build time.
> > >> > > >
> > >> > > > Sure, and the TDX module itself could guarantee this as well as much as
> > >> > > > the kernel could. It could decline to allow module updates during TD
> > >> > > > builds, or error out the TD build if it collides with an update.
> > >> > >
> > >> > > TDX module has a functionality to decline going into SHUTDOWN state
> > >> > > (pre-requisite for TD preserving update) if TD build or any problematic
> > >> > > operation is in progress. It requires VMM to opt-in into this feature.
> > >> >
> > >> > Is this opt-in enabled as part of this series? If not, what is the
> > >> > mechanism to enable this opt-in?
> > >>
> > >> For the information about how it works on TDX module side,
> > >> please consult the latest ABI spec, definition of TDH.SYS.SHUTDOWN leaf,
> > >> page 321:
> > >> https://cdrdv2.intel.com/v1/dl/getContent/733579
> > >>
> > >
> > >Thanks Elena. Should the patch [1] from this series be modified to
> > >handle the TDX module shutdown as per:
> >
> > Hi Vishal,
> >
> > I will fix this issue in the next version.
> >
> > The plan is to opt in post-update compatibility detection in the TDX
> > Module. If incompatibilities are found, the module will return errors to
> > any TD build or migration operations that were initiated prior to the
> > updates. Please refer to the TDH.SYS.UPDATE leaf definition in the ABI
> > spec above for details.
> >
> > I prefer this approach because:
> >
> > a. it guarantees forward progress. In contrast, failing updates would
> >    require admins to retry TDX Module updates, and no progress would be
> >    made unless they can successfully avoid race conditions between TDX
> >    module updates and TD build/migration operations. However, if such race
> >    conditions could be reliably prevented, this issue wouldn't require a
> >    fix in the first place.
>
> TD build operations are much more frequent than TDX module update
> operations. Retrying TD build operation will need additional KVM and
> userspace VMM changes IIUC (assuming TD build process needs to be
> restarted from the scratch). IMO, it would be simpler to handle TDX
> module update failures by retrying.
>
> Admin logic to update TDX modules can be designed to either retry
> failed TDX module updates or to be more robust, adds some
> synchronization with VM creation attempts on the host. i.e. I think
> it's fine to punt this problem of ensuring the forward progress to
> user-space admin logic on the host.

Discussed offline with Erdem Aktas on this. From Google's perspective
"Avoid updates during updatesensitive times" seems a better option as
I mentioned above.

To avoid having to choose which policy to enforce in kernel, a better
way could be to:
* Allow user space opt-in for "Avoid updates during updatesensitive times" AND
* Allow user space opt-in for "Detect incompatibility after update" as well OR
* Keep "Detect incompatibility after update" enabled by default based
on the appetite for avoiding silent corruption scenarios.

>
> >
> > b. it eliminates false alarms that could occur with the "block update"
> >    approach. Under the "block update" approach, updates would be rejected
> >    whenever TD build operations are running, regardless of whether the new
> >    module is actually compatible (e.g., when using the same crypto library as
> >    the current module). In contrast, the post-update detection approach only
> >    fails TD build or migration operations when genuine incompatibilities
> >    exist.

Re: [PATCH v2 00/21] Runtime TDX Module update support

Posted by Dave Hansen 3 months, 2 weeks ago

On 10/23/25 13:31, Vishal Annapurve wrote:
...
>> Admin logic to update TDX modules can be designed to either retry
>> failed TDX module updates or to be more robust, adds some
>> synchronization with VM creation attempts on the host. i.e. I think
>> it's fine to punt this problem of ensuring the forward progress to
>> user-space admin logic on the host.
> Discussed offline with Erdem Aktas on this. From Google's perspective
> "Avoid updates during updatesensitive times" seems a better option as
> I mentioned above.
> 
> To avoid having to choose which policy to enforce in kernel, a better
> way could be to:
> * Allow user space opt-in for "Avoid updates during updatesensitive times" AND
> * Allow user space opt-in for "Detect incompatibility after update" as well OR
> * Keep "Detect incompatibility after update" enabled by default based
> on the appetite for avoiding silent corruption scenarios.

I'd really prefer to keep this simple. Adding new opt-in ABIs up the
wazoo doesn't seem great.

I think I've heard three requirements in the end:

1. Guarantee module update forward progress
2. Avoid "corrupt" TD build processes by letting the build/update
   race happen
3. Don't complicate the build process by forcing it to error out
   if a module update clobbers a build

One thing I don't think I've heard anyone be worried about is how timely
the update process is. So how about this: Updates wait for any existing
builds to complete. But, new builds wait for updates. That can be done
with a single rwsem:

struct rw_semaphore update_rwsem;

tdx_td_init()
{
	...
+	down_read_interruptible(&update_rwsem);
	kvm_tdx->state = TD_STATE_INITIALIZED;

tdx_td_finalize()
{
	...
+	up_read(&update_rwsem);
	kvm_tdx->state = TD_STATE_RUNNABLE;

A module update does:

	down_write_interruptible(&update_rwsem);
	do_actual_update();
	up_write(&update_rwsem);

There would be no corruption issues, no erroring out of the build
process, and no punting to userspace to ensure forward progress.

The big downside is that both the build process and update process can
appear to hang for a long time. It'll also be a bit annoying to ensure
that there are up_read(&update_rwsem)'s if the kvm_tdx object gets torn
down during a build.

But the massive upside is that there's no new ABI and all the
consistency and forward progress guarantees are in the kernel. If we
want new ABIs around it that give O_NONBLOCK semantics to build or
update, that can be added on after the fact.

Plus, if userspace *WANTS* to coordinate the whole shebang, they're free
to. They'd never see long hangs because they would be coordinating.

Thoughts?

Re: [PATCH v2 00/21] Runtime TDX Module update support

Posted by Chao Gao 3 months, 2 weeks ago

>One thing I don't think I've heard anyone be worried about is how timely
>the update process is. So how about this: Updates wait for any existing
>builds to complete. But, new builds wait for updates. That can be done
>with a single rwsem:
>
>struct rw_semaphore update_rwsem;
>
>tdx_td_init()
>{
>	...
>+	down_read_interruptible(&update_rwsem);
>	kvm_tdx->state = TD_STATE_INITIALIZED;
>
>tdx_td_finalize()
>{
>	...
>+	up_read(&update_rwsem);
>	kvm_tdx->state = TD_STATE_RUNNABLE;
>
>A module update does:
>
>	down_write_interruptible(&update_rwsem);
>	do_actual_update();
>	up_write(&update_rwsem);
>
>There would be no corruption issues, no erroring out of the build
>process, and no punting to userspace to ensure forward progress.
>
>The big downside is that both the build process and update process can
>appear to hang for a long time. It'll also be a bit annoying to ensure
>that there are up_read(&update_rwsem)'s if the kvm_tdx object gets torn
>down during a build.
>
>But the massive upside is that there's no new ABI and all the
>consistency and forward progress guarantees are in the kernel. If we
>want new ABIs around it that give O_NONBLOCK semantics to build or
>update, that can be added on after the fact.
>
>Plus, if userspace *WANTS* to coordinate the whole shebang, they're free
>to. They'd never see long hangs because they would be coordinating.
>
>Thoughts?

Hi Dave,

Thanks for this summary and suggestion.

Beyond "the kvm_tdx object gets torn down during a build," I see two potential
issues:

1. TD Build and TDX migration aren't purely kernel processes -- they span multiple
   KVM ioctls. Holding a read-write lock throughout the entire process would
   require exiting to userspace while the lock is held. I think this is
   irregular, but I'm not sure if it's acceptable for read-write semaphores.

2. The kernel may need to hold this read-write lock for operations not yet
   defined in the future. The TDX Module Base spec [*] notes on page 55:

   : Future TDX Module versions may have different or additional update-sensitive
   : cases. By design, such cases apply to a small portion of the overall TD
   : lifecycle.

[*]: https://cdrdv2.intel.com/v1/dl/getContent/733575

Given these concerns, I'm not sure whether implementing a read-write lock in
the kernel is the right approach.

Since Google prefers to "avoid updates during update-sensitive times," we can
implement that approach for now. If other Linux users find this insufficient
and prefer failing TD build/migration operations with strong justification, we
can enable that functionality in the future.

What do you think?

Re: [PATCH v2 00/21] Runtime TDX Module update support

Posted by Dave Hansen 3 months, 2 weeks ago

On 10/24/25 00:43, Chao Gao wrote:
...
> Beyond "the kvm_tdx object gets torn down during a build," I see two potential
> issues:
> 
> 1. TD Build and TDX migration aren't purely kernel processes -- they span multiple
>    KVM ioctls. Holding a read-write lock throughout the entire process would
>    require exiting to userspace while the lock is held. I think this is
>    irregular, but I'm not sure if it's acceptable for read-write semaphores.

Sure, I guess it's irregular. But look at it this way: let's say we
concocted some scheme to use a TD build refcount and a module update
flag, had them both wait_event_interruptible() on each other, and then
did wakeups. That would get the same semantics without an rwsem.

So is your issue with the rwsem, or the idea that one bit of userspace
can block another bit of userspace from doing something for arbitrary
lengths of time?

The one thing that worries me is solidly tying the build-side
down_read() to the lifetime of kvm_tdx. But that shouldn't be rocket
science. There also isn't a down_write_interruptible(), only
down_write_killable().

> 2. The kernel may need to hold this read-write lock for operations not yet
>    defined in the future. The TDX Module Base spec [*] notes on page 55:
> 
>    : Future TDX Module versions may have different or additional update-sensitive
>    : cases. By design, such cases apply to a small portion of the overall TD
>    : lifecycle.
Sure... But that's not a license for the TDX module to do completely
silly things. Elena is on cc here for a reason and I'm sure she'll
ensure that nothing silly gets put into the TDX module that will cause
problems here.

Re: [PATCH v2 00/21] Runtime TDX Module update support

Posted by dan.j.williams@intel.com 3 months, 2 weeks ago

Dave Hansen wrote:
> On 10/24/25 00:43, Chao Gao wrote:
> ...
> > Beyond "the kvm_tdx object gets torn down during a build," I see two potential
> > issues:
> > 
> > 1. TD Build and TDX migration aren't purely kernel processes -- they span multiple
> >    KVM ioctls. Holding a read-write lock throughout the entire process would
> >    require exiting to userspace while the lock is held. I think this is
> >    irregular, but I'm not sure if it's acceptable for read-write semaphores.
> 
> Sure, I guess it's irregular. But look at it this way: let's say we
> concocted some scheme to use a TD build refcount and a module update
> flag, had them both wait_event_interruptible() on each other, and then
> did wakeups. That would get the same semantics without an rwsem.

This sounds unworkable to me.

First, you cannot return to userspace while holding a lock. Lockdep will
rightfully scream:

    "WARNING: lock held when returning to user space!"

The complexity of ensuring that a multi-stage ABI transaction completes
from the kernel side is painful. If that process dies in the middle of
its ABI sequence who cleans up these references?

The operational mechanism to make sure that one process flow does not
mess up another process flow is for those process to communicate with
*userspace* file locks, or for those process to check for failures after
the fact and retry. Unless you can make the build side an atomic ABI,
this is a documentation + userspace problem, not a kernel problem.

Re: [PATCH v2 00/21] Runtime TDX Module update support

Posted by Dave Hansen 3 months, 2 weeks ago

On 10/24/25 12:40, dan.j.williams@intel.com wrote:
> Dave Hansen wrote:
>> On 10/24/25 00:43, Chao Gao wrote:
>> ...
>>> Beyond "the kvm_tdx object gets torn down during a build," I see two potential
>>> issues:
>>>
>>> 1. TD Build and TDX migration aren't purely kernel processes -- they span multiple
>>>    KVM ioctls. Holding a read-write lock throughout the entire process would
>>>    require exiting to userspace while the lock is held. I think this is
>>>    irregular, but I'm not sure if it's acceptable for read-write semaphores.
>>
>> Sure, I guess it's irregular. But look at it this way: let's say we
>> concocted some scheme to use a TD build refcount and a module update
>> flag, had them both wait_event_interruptible() on each other, and then
>> did wakeups. That would get the same semantics without an rwsem.
> 
> This sounds unworkable to me.
> 
> First, you cannot return to userspace while holding a lock. Lockdep will
> rightfully scream:
> 
>     "WARNING: lock held when returning to user space!"

Well, yup, it sure does look that way for normal lockdep-annotated lock
types. It does seem like a sane rule to have for most things.

But, just to be clear, this is a lockdep thing and a good, solid
semantic to have. It's not a rule that no kernel locking structure can
ever be held when returning to userspace.

> The complexity of ensuring that a multi-stage ABI transaction completes
> from the kernel side is painful. If that process dies in the middle of
> its ABI sequence who cleans up these references?

The 'struct kvm_tdx' has to get destroyed at some point. It also has a
'kvm_tdx_state' field that could be tied very tightly to the build
status. The reference gets cleaned up before the point when the
kvm_tdx->state memory is freed.

> The operational mechanism to make sure that one process flow does not
> mess up another process flow is for those process to communicate with
> *userspace* file locks, or for those process to check for failures after
> the fact and retry. Unless you can make the build side an atomic ABI,
> this is a documentation + userspace problem, not a kernel problem.

Yeah, that's a totally valid take on it.

My only worry is that the module update is going to be off in another
world from the thing building TDs. We had a similar set of challenges
around microcode updates, CPUSVN and SGX enclaves.

The guy doing "echo 1 > /sys/.../whatever" wasn't coordinating with
every entity on the system that might run an SGX enclave. It certainly
didn't help that enclave creation is typically done by unprivileged
users. Maybe the KVM/TDX world is a _bit_ more narrow and they will be
talking to each other, or the /dev/kvm permissions will be a nice funnel
to get them talking to each other.

The SGX solution, btw, was to at least ensure forward progress (CPUSVN
update) when the last enclave goes away. So new enclaves aren't
*prevented* from starting but the window when the first one starts
(enclave count going from 0->1) is leveraged to do the update.

Re: [PATCH v2 00/21] Runtime TDX Module update support

Posted by dan.j.williams@intel.com 3 months, 2 weeks ago

Dave Hansen wrote:
> On 10/24/25 12:40, dan.j.williams@intel.com wrote:
> > Dave Hansen wrote:
> >> On 10/24/25 00:43, Chao Gao wrote:
> >> ...
> >>> Beyond "the kvm_tdx object gets torn down during a build," I see two potential
> >>> issues:
> >>>
> >>> 1. TD Build and TDX migration aren't purely kernel processes -- they span multiple
> >>>    KVM ioctls. Holding a read-write lock throughout the entire process would
> >>>    require exiting to userspace while the lock is held. I think this is
> >>>    irregular, but I'm not sure if it's acceptable for read-write semaphores.
> >>
> >> Sure, I guess it's irregular. But look at it this way: let's say we
> >> concocted some scheme to use a TD build refcount and a module update
> >> flag, had them both wait_event_interruptible() on each other, and then
> >> did wakeups. That would get the same semantics without an rwsem.
> > 
> > This sounds unworkable to me.
> > 
> > First, you cannot return to userspace while holding a lock. Lockdep will
> > rightfully scream:
> > 
> >     "WARNING: lock held when returning to user space!"
> 
> Well, yup, it sure does look that way for normal lockdep-annotated lock
> types. It does seem like a sane rule to have for most things.
> 
> But, just to be clear, this is a lockdep thing and a good, solid
> semantic to have. It's not a rule that no kernel locking structure can
> ever be held when returning to userspace.

Sure, but I would submit that the lesser known cousin of the common
suggestion "do not write your own locking primitives" is "do not invent
locking schemes that involve holding locks over return to userspace". It
is rarely a good idea to the point that lockdep warns about it by
default.

> > The complexity of ensuring that a multi-stage ABI transaction completes
> > from the kernel side is painful. If that process dies in the middle of
> > its ABI sequence who cleans up these references?
> 
> The 'struct kvm_tdx' has to get destroyed at some point.

Indefinite hangs because a process goes out to lunch and fails to
destroy kvm_tdx in a reasonable timeframe now has knock-on effects.

[..]
> > The operational mechanism to make sure that one process flow does not
> > mess up another process flow is for those process to communicate with
> > *userspace* file locks, or for those process to check for failures after
> > the fact and retry. Unless you can make the build side an atomic ABI,
> > this is a documentation + userspace problem, not a kernel problem.
> 
> Yeah, that's a totally valid take on it.
> 
> My only worry is that the module update is going to be off in another
> world from the thing building TDs. We had a similar set of challenges
> around microcode updates, CPUSVN and SGX enclaves.
> 
> The guy doing "echo 1 > /sys/.../whatever" wasn't coordinating with
> every entity on the system that might run an SGX enclave. It certainly
> didn't help that enclave creation is typically done by unprivileged
> users. Maybe the KVM/TDX world is a _bit_ more narrow and they will be
> talking to each other, or the /dev/kvm permissions will be a nice funnel
> to get them talking to each other.
> 
> The SGX solution, btw, was to at least ensure forward progress (CPUSVN
> update) when the last enclave goes away. So new enclaves aren't
> *prevented* from starting but the window when the first one starts
> (enclave count going from 0->1) is leveraged to do the update.

The status quo does ensure forward progress. The TD does get built and
the update does complete, just the small matter of TD attestation
failures, right?

Note, we had a similar problem with the tsm_report interface which,
because it is configfs and not an ioctl, is a multi-stage ABI to build a
report. If 2 threads collide in building an object, userspace indeed
gets to keep the pieces, but there is:

1/ Documentation of the potential for collisions

2/ A mechanism to detect collisions. See
   /sys/kernel/config/tsm/report/$name/generation in
   Documentation/ABI/testing/configfs-tsm-report

I really would not worry about the "off in another world" problem, it is
par for the course for datacenter operations. I encountered prolific use
of file locks in operations scripts at my time at Facebook. Think of
problems like coordinating disk partitioning across various provisioning
flows. The kernel happily lets 2 fdisk processes race to write a
partition table. The only way to ensure a consistent result in that case
is userspace sequencing, not a kernel lock while some process has a
partition table open.

Re: [PATCH v2 00/21] Runtime TDX Module update support

Posted by Dave Hansen 3 months, 2 weeks ago

On 10/24/25 14:12, dan.j.williams@intel.com wrote:
>> The SGX solution, btw, was to at least ensure forward progress (CPUSVN
>> update) when the last enclave goes away. So new enclaves aren't
>> *prevented* from starting but the window when the first one starts
>> (enclave count going from 0->1) is leveraged to do the update.
> The status quo does ensure forward progress. The TD does get built and
> the update does complete, just the small matter of TD attestation
> failures, right?

Oh, yeah, for sure.

If we do _nothing_ in the kernel (no build vs. module update
synchronization), then the downside is being exposed to attestation
failures if userspace either also does nothing or has bugs.

That's actually, by far, my preferred solution to this whole mess:
Userspace plays stupid games, userspace wins stupid prizes.

Re: [PATCH v2 00/21] Runtime TDX Module update support

Posted by Vishal Annapurve 3 months, 2 weeks ago

On Fri, Oct 24, 2025 at 2:19 PM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 10/24/25 14:12, dan.j.williams@intel.com wrote:
> >> The SGX solution, btw, was to at least ensure forward progress (CPUSVN
> >> update) when the last enclave goes away. So new enclaves aren't
> >> *prevented* from starting but the window when the first one starts
> >> (enclave count going from 0->1) is leveraged to do the update.
> > The status quo does ensure forward progress. The TD does get built and
> > the update does complete, just the small matter of TD attestation
> > failures, right?

I would think that it's not a "small" problem if confidential
workloads on the hosts are not able to pass attestation.

>
> Oh, yeah, for sure.
>
> If we do _nothing_ in the kernel (no build vs. module update
> synchronization), then the downside is being exposed to attestation
> failures if userspace either also does nothing or has bugs.
>
> That's actually, by far, my preferred solution to this whole mess:
> Userspace plays stupid games, userspace wins stupid prizes.
>

IIUC, enforcing "Avoid updates during update sensitive times" is not
that complex and will ensure to avoid any issues with user space
logic.

Re: [PATCH v2 00/21] Runtime TDX Module update support

Posted by dan.j.williams@intel.com 3 months, 2 weeks ago

Vishal Annapurve wrote:
> On Fri, Oct 24, 2025 at 2:19 PM Dave Hansen <dave.hansen@intel.com> wrote:
> >
> > On 10/24/25 14:12, dan.j.williams@intel.com wrote:
> > >> The SGX solution, btw, was to at least ensure forward progress (CPUSVN
> > >> update) when the last enclave goes away. So new enclaves aren't
> > >> *prevented* from starting but the window when the first one starts
> > >> (enclave count going from 0->1) is leveraged to do the update.
> > > The status quo does ensure forward progress. The TD does get built and
> > > the update does complete, just the small matter of TD attestation
> > > failures, right?
> 
> I would think that it's not a "small" problem if confidential
> workloads on the hosts are not able to pass attestation.

"Small" as in "not the kernel's problem". Userspace asked for the
update, update is documented to clobber build sometimes, userspace ran
an update anyway. Userspace asked for the clobber.

It would be lovely if this clobbering does not happen at all and the
update mechanism did not come with this misfeature. Otherwise, the kernel
has no interface to solve that problem. The best it can do is document
that this new update facility has this side effect.

Userspace always has the choice to not update, coordinate update with
build, or do nothing and let tenants try to launch again.  Userspace
could even retry the build and hide the tenant failure if it knew about
the clobber, but be clear that the problem is the clobber not the kernel
doing what userspace asked.

The clobber, as I understand, is also limited to cases where the update
includes crypto library changes. I am not sure how often that happens in
practice. Suffice to say, the fact that the clobber is conditioned on
the contents of the update also puts it further away from being a kernel
problem. The clobber does not corrupt kernel state.

> > Oh, yeah, for sure.
> >
> > If we do _nothing_ in the kernel (no build vs. module update
> > synchronization), then the downside is being exposed to attestation
> > failures if userspace either also does nothing or has bugs.
> >
> > That's actually, by far, my preferred solution to this whole mess:
> > Userspace plays stupid games, userspace wins stupid prizes.
> >
> 
> IIUC, enforcing "Avoid updates during update sensitive times" is not
> that complex and will ensure to avoid any issues with user space
> logic.

Userspace logic avoids issues by honoring the documentation that these
ABIs sequences need synchronization. Otherwise, kernel blocking update
during build just trades one error for another.

Treat this like any other userspace solution for requiring "atomic"
semantics when the kernel mechanisms are not themselves designed to be
atomic, wrap it in userspace synchronization.

Re: [PATCH v2 00/21] Runtime TDX Module update support

Posted by Vishal Annapurve 3 months, 2 weeks ago

On Fri, Oct 24, 2025 at 6:42 PM <dan.j.williams@intel.com> wrote:
>
> Vishal Annapurve wrote:
> > On Fri, Oct 24, 2025 at 2:19 PM Dave Hansen <dave.hansen@intel.com> wrote:
> > >
> > > On 10/24/25 14:12, dan.j.williams@intel.com wrote:
> > > >> The SGX solution, btw, was to at least ensure forward progress (CPUSVN
> > > >> update) when the last enclave goes away. So new enclaves aren't
> > > >> *prevented* from starting but the window when the first one starts
> > > >> (enclave count going from 0->1) is leveraged to do the update.
> > > > The status quo does ensure forward progress. The TD does get built and
> > > > the update does complete, just the small matter of TD attestation
> > > > failures, right?
> >
> > I would think that it's not a "small" problem if confidential
> > workloads on the hosts are not able to pass attestation.
>
> "Small" as in "not the kernel's problem". Userspace asked for the
> update, update is documented to clobber build sometimes, userspace ran
> an update anyway. Userspace asked for the clobber.
>
> It would be lovely if this clobbering does not happen at all and the
> update mechanism did not come with this misfeature. Otherwise, the kernel
> has no interface to solve that problem. The best it can do is document
> that this new update facility has this side effect.

In this case, host kernel has a way to ensure that userspace can't
trigger such clobbering at all. That IIUC is "Avoid updates during
update sensitive times". Best kernel can do is prevent userspace from
screwing up the state of TDs.

>
> Userspace always has the choice to not update, coordinate update with
> build, or do nothing and let tenants try to launch again.  Userspace
> could even retry the build and hide the tenant failure if it knew about
> the clobber,

IIUC host userspace has no way to know if the TD state got clobbered.

> but be clear that the problem is the clobber not the kernel
> doing what userspace asked.
>
> The clobber, as I understand, is also limited to cases where the update
> includes crypto library changes. I am not sure how often that happens in
> practice. Suffice to say, the fact that the clobber is conditioned on
> the contents of the update also puts it further away from being a kernel

The knowledge of things getting clobbered are well much further away
from userspace.

> problem. The clobber does not corrupt kernel state.
>
> > > Oh, yeah, for sure.
> > >
> > > If we do _nothing_ in the kernel (no build vs. module update
> > > synchronization), then the downside is being exposed to attestation
> > > failures if userspace either also does nothing or has bugs.
> > >
> > > That's actually, by far, my preferred solution to this whole mess:
> > > Userspace plays stupid games, userspace wins stupid prizes.
> > >
> >
> > IIUC, enforcing "Avoid updates during update sensitive times" is not
> > that complex and will ensure to avoid any issues with user space
> > logic.
>
> Userspace logic avoids issues by honoring the documentation that these
> ABIs sequences need synchronization. Otherwise, kernel blocking update
> during build just trades one error for another.

Kernel blocking update during build makes the production systems much
safer and prevents userspace from screwing up the state that it has no
way to detect after the fact.

>
> Treat this like any other userspace solution for requiring "atomic"
> semantics when the kernel mechanisms are not themselves designed to be
> atomic, wrap it in userspace synchronization.

In general if this is something userspace detectable I would agree,
TDX module is the closest entity that can detect the problematic
sequence and the host kernel has a very simple way to ensure that such
a problematic sequence is not at all allowed to happen by toggling
some seamcall controls. It would be very helpful IMO to ensure that
userspace is not able to screw up production workloads especially if
the mess is not all visible to userspace.

Re: [PATCH v2 00/21] Runtime TDX Module update support

Posted by dan.j.williams@intel.com 3 months, 2 weeks ago

Vishal Annapurve wrote:
> On Fri, Oct 24, 2025 at 6:42 PM <dan.j.williams@intel.com> wrote:
> >
> > Vishal Annapurve wrote:
> > > On Fri, Oct 24, 2025 at 2:19 PM Dave Hansen <dave.hansen@intel.com> wrote:
> > > >
> > > > On 10/24/25 14:12, dan.j.williams@intel.com wrote:
> > > > >> The SGX solution, btw, was to at least ensure forward progress (CPUSVN
> > > > >> update) when the last enclave goes away. So new enclaves aren't
> > > > >> *prevented* from starting but the window when the first one starts
> > > > >> (enclave count going from 0->1) is leveraged to do the update.
> > > > > The status quo does ensure forward progress. The TD does get built and
> > > > > the update does complete, just the small matter of TD attestation
> > > > > failures, right?
> > >
> > > I would think that it's not a "small" problem if confidential
> > > workloads on the hosts are not able to pass attestation.
> >
> > "Small" as in "not the kernel's problem". Userspace asked for the
> > update, update is documented to clobber build sometimes, userspace ran
> > an update anyway. Userspace asked for the clobber.
> >
> > It would be lovely if this clobbering does not happen at all and the
> > update mechanism did not come with this misfeature. Otherwise, the kernel
> > has no interface to solve that problem. The best it can do is document
> > that this new update facility has this side effect.
> 
> In this case, host kernel has a way to ensure that userspace can't
> trigger such clobbering at all. 

Unless the clobber condition can be made atomic with respect to update
so that both succeed, the kernel needs to punt the syncrhonization
problem to userspace.

A theoretical TDX Module change could ensure that atomicity. A
theoretical change to the kernel's build ABI could effect that as well,
or notify the collision. I.e. a flag at the finalization stage that an
update happened during the build sequence needs a restart. This is the
role of "generation" in the tsm_report ABI. As far as I understand
userspace just skips that ABI and arranges for userspace synchronized
access to tsm_report.

At the point where the solution is "change existing build flows" might
as well just have userspace wrap the flows with userspace exclusion.

> That IIUC is "Avoid updates during update sensitive times". Best
> kernel can do is prevent userspace from screwing up the state of TDs.

"Avoid updates during update sensitive times" is the documentation for
the update userspace ABI.

> > Userspace always has the choice to not update, coordinate update with
> > build, or do nothing and let tenants try to launch again.  Userspace
> > could even retry the build and hide the tenant failure if it knew about
> > the clobber,
> 
> IIUC host userspace has no way to know if the TD state got clobbered.

Correct, today it can only assume that both flows need to be mutually
exclusive.

> > but be clear that the problem is the clobber not the kernel
> > doing what userspace asked.
> >
> > The clobber, as I understand, is also limited to cases where the update
> > includes crypto library changes. I am not sure how often that happens in
> > practice. Suffice to say, the fact that the clobber is conditioned on
> > the contents of the update also puts it further away from being a kernel
> 
> The knowledge of things getting clobbered are well much further away
> from userspace.

The possibility is documented as part of the update ABI. Another
documentation possibility is that updates that change the crypto library
are by definition not "runtime update" capable. A possible TDX Module
change to remove this collision. A menu of options before complicating
the kernel.

Re: [PATCH v2 00/21] Runtime TDX Module update support

Posted by Vishal Annapurve 3 months, 2 weeks ago

On Sun, Oct 26, 2025 at 2:30 PM <dan.j.williams@intel.com> wrote:
>
> Vishal Annapurve wrote:
> > On Fri, Oct 24, 2025 at 6:42 PM <dan.j.williams@intel.com> wrote:
> > >
> > > Vishal Annapurve wrote:
> > > > On Fri, Oct 24, 2025 at 2:19 PM Dave Hansen <dave.hansen@intel.com> wrote:
> > > > >
> > > > > On 10/24/25 14:12, dan.j.williams@intel.com wrote:
> > > > > >> The SGX solution, btw, was to at least ensure forward progress (CPUSVN
> > > > > >> update) when the last enclave goes away. So new enclaves aren't
> > > > > >> *prevented* from starting but the window when the first one starts
> > > > > >> (enclave count going from 0->1) is leveraged to do the update.
> > > > > > The status quo does ensure forward progress. The TD does get built and
> > > > > > the update does complete, just the small matter of TD attestation
> > > > > > failures, right?
> > > >
> > > > I would think that it's not a "small" problem if confidential
> > > > workloads on the hosts are not able to pass attestation.
> > >
> > > "Small" as in "not the kernel's problem". Userspace asked for the
> > > update, update is documented to clobber build sometimes, userspace ran
> > > an update anyway. Userspace asked for the clobber.
> > >
> > > It would be lovely if this clobbering does not happen at all and the
> > > update mechanism did not come with this misfeature. Otherwise, the kernel
> > > has no interface to solve that problem. The best it can do is document
> > > that this new update facility has this side effect.
> >
> > In this case, host kernel has a way to ensure that userspace can't
> > trigger such clobbering at all.
>
> Unless the clobber condition can be made atomic with respect to update
> so that both succeed, the kernel needs to punt the syncrhonization
> problem to userspace.
>
> A theoretical TDX Module change could ensure that atomicity.

IIUC TDX module already supports avoiding this clobber based on the
TDH.SYS.SHUTDOWN documentation from section 5.4.73 of TDX ABI Spec
[1].

Host kernel needs to set bit 16 of rcx when invoking TDH.SYS.SHUTDOWN
is available.

"If supported by the TDX Module, the host VMM can set the
AVOID_COMPAT_SENSITIVE flag to request the TDX Module to fail
TDH.SYS.UPDATE if any of the TDs are currently in a state that is
impacted by the update-sensitive cases."

I think the above documentation should replace TDH.SYS.UPDATE with
TDH.SYS.SHUTDOWN IIUC.

[1] https://cdrdv2.intel.com/v1/dl/getContent/733579

> A
> theoretical change to the kernel's build ABI could effect that as well,
> or notify the collision. I.e. a flag at the finalization stage that an
> update happened during the build sequence needs a restart. This is the
> role of "generation" in the tsm_report ABI. As far as I understand
> userspace just skips that ABI and arranges for userspace synchronized
> access to tsm_report.
>

Re: [PATCH v2 00/21] Runtime TDX Module update support

Posted by dan.j.williams@intel.com 3 months, 2 weeks ago

Vishal Annapurve wrote:
[..]
> > A theoretical TDX Module change could ensure that atomicity.
> 
> IIUC TDX module already supports avoiding this clobber based on the
> TDH.SYS.SHUTDOWN documentation from section 5.4.73 of TDX ABI Spec
> [1].
> 
> Host kernel needs to set bit 16 of rcx when invoking TDH.SYS.SHUTDOWN
> is available.
> 
> "If supported by the TDX Module, the host VMM can set the
> AVOID_COMPAT_SENSITIVE flag to request the TDX Module to fail
> TDH.SYS.UPDATE if any of the TDs are currently in a state that is
> impacted by the update-sensitive cases."

That is not a fix. That just shifts the complexity from build to update.
It still leaves update in a state where it is not guaranteed to make
forward progress. The way to ensure forward progress is the same as
ensuring build consistency, i.e. sequence build with respect to update.
The kernel sheds complexity by ether making userspace solve that
problem, or motivating a real fix in the TDX Module that obviates the
AVOID_COMPAT_SENSITIVE case.

Re: [PATCH v2 00/21] Runtime TDX Module update support

Posted by Vishal Annapurve 3 months, 2 weeks ago

On Mon, Oct 27, 2025 at 11:53 AM <dan.j.williams@intel.com> wrote:
>
> Vishal Annapurve wrote:
> [..]
> > > A theoretical TDX Module change could ensure that atomicity.
> >
> > IIUC TDX module already supports avoiding this clobber based on the
> > TDH.SYS.SHUTDOWN documentation from section 5.4.73 of TDX ABI Spec
> > [1].
> >
> > Host kernel needs to set bit 16 of rcx when invoking TDH.SYS.SHUTDOWN
> > is available.
> >
> > "If supported by the TDX Module, the host VMM can set the
> > AVOID_COMPAT_SENSITIVE flag to request the TDX Module to fail
> > TDH.SYS.UPDATE if any of the TDs are currently in a state that is
> > impacted by the update-sensitive cases."
>
> That is not a fix. That just shifts the complexity from build to update.
> It still leaves update in a state where it is not guaranteed to make

IMO, there are two problems here:
1) Giving a consistent ABI that leaves the responsibility of ensuring
forward progress by sequencing TD update with TD build steps with
userspace.
2) Ensuring that userspace can't screw up the in-progress TD VM
metadata if userspace doesn't adhere to the sequence above.

Problem 2 should be solved in the TDX module as it is the state owner
and should be given a chance to ensure that nothing else can affect
it's state. Kernel is just opting-in to toggle the already provided
TDX module ABI. I don't think this is adding complexity to the kernel.

> forward progress. The way to ensure forward progress is the same as
> ensuring build consistency, i.e. sequence build with respect to update.
> The kernel sheds complexity by ether making userspace solve that
> problem, or motivating a real fix in the TDX Module that obviates the
> AVOID_COMPAT_SENSITIVE case.

Re: [PATCH v2 00/21] Runtime TDX Module update support

Posted by dan.j.williams@intel.com 3 months, 1 week ago

Vishal Annapurve wrote:
> On Mon, Oct 27, 2025 at 11:53 AM <dan.j.williams@intel.com> wrote:
> >
> > Vishal Annapurve wrote:
> > [..]
> > > > A theoretical TDX Module change could ensure that atomicity.
> > >
> > > IIUC TDX module already supports avoiding this clobber based on the
> > > TDH.SYS.SHUTDOWN documentation from section 5.4.73 of TDX ABI Spec
> > > [1].
> > >
> > > Host kernel needs to set bit 16 of rcx when invoking TDH.SYS.SHUTDOWN
> > > is available.
> > >
> > > "If supported by the TDX Module, the host VMM can set the
> > > AVOID_COMPAT_SENSITIVE flag to request the TDX Module to fail
> > > TDH.SYS.UPDATE if any of the TDs are currently in a state that is
> > > impacted by the update-sensitive cases."
> >
> > That is not a fix. That just shifts the complexity from build to update.
> > It still leaves update in a state where it is not guaranteed to make
> 
> IMO, there are two problems here:
> 1) Giving a consistent ABI that leaves the responsibility of ensuring
> forward progress by sequencing TD update with TD build steps with
> userspace.
> 2) Ensuring that userspace can't screw up the in-progress TD VM
> metadata if userspace doesn't adhere to the sequence above.
> 
> Problem 2 should be solved in the TDX module as it is the state owner
> and should be given a chance to ensure that nothing else can affect
> it's state. Kernel is just opting-in to toggle the already provided
> TDX module ABI. I don't think this is adding complexity to the kernel.

That gives update a transient error to handle

Re: [PATCH v2 00/21] Runtime TDX Module update support

Posted by dan.j.williams@intel.com 3 months, 1 week ago

dan.j.williams@ wrote:
[..]
> That gives update a transient error to handle

Apologies, this was a draft sent in error.

Re: [PATCH v2 00/21] Runtime TDX Module update support

Posted by dan.j.williams@intel.com 3 months, 2 weeks ago

Vishal Annapurve wrote:
[..]
> Problem 2 should be solved in the TDX module as it is the state owner
> and should be given a chance to ensure that nothing else can affect
> it's state. Kernel is just opting-in to toggle the already provided
> TDX module ABI. I don't think this is adding complexity to the kernel.

It makes the interface hard to reason about, that is complexity.

Consider an urgent case where update is more important than the
consistency of ongoing builds. The kernel's job is its own self
consistency and security model, when that remains in tact root is
allowed to make informed decisions.

You might say, well add a --force option for that, and that is also
userspace prerogative to perform otherwise destructive operations with
the degrees of freedom the kernel allows.

I think we have reached the useful end of this thread. I support moving
ahead with the dead simple, "this may clobber your builds", for now. We
can always circle back to add more complexity later if that proves "too
simple" in practice.

Re: [PATCH v2 00/21] Runtime TDX Module update support

Posted by Vishal Annapurve 3 months, 1 week ago

On Mon, Oct 27, 2025 at 7:13 PM <dan.j.williams@intel.com> wrote:
>
> Vishal Annapurve wrote:
> [..]
> > Problem 2 should be solved in the TDX module as it is the state owner
> > and should be given a chance to ensure that nothing else can affect
> > it's state. Kernel is just opting-in to toggle the already provided
> > TDX module ABI. I don't think this is adding complexity to the kernel.
>
> It makes the interface hard to reason about, that is complexity.
>
> Consider an urgent case where update is more important than the
> consistency of ongoing builds. The kernel's job is its own self
> consistency and security model,

I would consider the TDX module as more privileged than the kernel
itself in certain aspects. So it's not fine to expose an ABI to
userspace which affects kernel state consistency, but it's fine to
expose an ABI that can affect TDX module state consistency?

> when that remains in tact root is allowed to make informed decisions.
>
> You might say, well add a --force option for that, and that is also
> userspace prerogative to perform otherwise destructive operations with
> the degrees of freedom the kernel allows.
>
> I think we have reached the useful end of this thread. I support moving
> ahead with the dead simple, "this may clobber your builds", for now. We
> can always circle back to add more complexity later if that proves "too
> simple" in practice.
>

Re: [PATCH v2 00/21] Runtime TDX Module update support

Posted by Erdem Aktas 3 months, 1 week ago

On Mon, Oct 27, 2025 at 7:14 PM <dan.j.williams@intel.com> wrote:
>
> Vishal Annapurve wrote:
> [..]
> > Problem 2 should be solved in the TDX module as it is the state owner
> > and should be given a chance to ensure that nothing else can affect
> > it's state. Kernel is just opting-in to toggle the already provided
> > TDX module ABI. I don't think this is adding complexity to the kernel.
>
> It makes the interface hard to reason about, that is complexity.

Could you clarify what you mean here? What interface do you need to
reason about? TDX module has a feature as described in its spec, this
is nothing to do with the kernel. Kernel executes the TDH.SYS.SHUTDOWN
and if it fails, it will return the error code back to the user space.
There is nothing here to reason about and it is not clear how it is
adding the complexity to the kernel.

>
> Consider an urgent case where update is more important than the
> consistency of ongoing builds. The kernel's job is its own self
> consistency and security model, when that remains in tact root is
> allowed to make informed decisions.
>
The whole update is initiated by the userspace, imo, it is not the
kernel's job to decide what to do. It should try to update the TDX
module and return error code back to the userspace if it fails. it is
up to the userspace to resolve the conflict and retry the
installation. If you are saying that the userspace is not trusted for
such a critical action, again the whole process is initiated and
controlled by the userspace so there is an inherent trust there.

Consistency? How does td preserve failure impact the kernel
consistency? On the contrary, bypassing AVOID_COMPAT_SENSITIVE will
break the consistency for some TDs.

> You might say, well add a --force option for that, and that is also
> userspace prerogative to perform otherwise destructive operations with
> the degrees of freedom the kernel allows.

IMO, It is something userspace should decide, kernel's job is to
provide the necessary interface about it.

>
> I think we have reached the useful end of this thread. I support moving
> ahead with the dead simple, "this may clobber your builds", for now. We
> can always circle back to add more complexity later if that proves "too
> simple" in practice.
>
It is not clear how you reached that conclusion. We are one of the
users for this feature and we have multiple times explained that we
prefer failure on update if there is any risk of corrupting some TD
states. I did not see any other feedback/preference from other users
and I did not see any reasonable argument why you are preferring the
"clobber your builds" option.

Also the "clobber your builds" option will impact the TDX live
migration, considering the TDX live migration is WIP, it will be
definitely very hard to foresee the challenges there you are
introducing with this decision. How about TDX connect? Are we going to
come back and keep updating this every time we find an issue?

Since the update process is initiated and controlled by userspace, it
is the userspace application's prerogative to make the informed
decision on whether an urgent update warrants potentially destructive
actions. The kernel's role is to provide a reliable mechanism to
interact with the TDX Module and report outcomes accurately.
 Ideally,  ABI should allow userpace to provide flags which can be
also used to configure the TD preserve update option. If you do not
want to change ABI, you can make those as module param so userspace
can make a decision by itself.

To address some of your previous concerns:
It shifts complexity to userspace which is something everyone here
seems to prefer. The problem is that the TD Preserve update would
corrupt the TDs who are in the build stage (also impacts TDX LM  and
possibly some TDX connect functionalities) and since the TDX module
would know about it,  this will make sure that they will not be
corrupted hence it is a fix for a problem.

TDH.SYS.SHUTDOWN may not succeed due to multiple reasons like
TDX_SYS_BUSY  therefore it needs to handle the error cases anyway and
should return the error to the userspace.
Now userspace can decide whatever logic it has to finish/cancel the
existing tdbuilds and retry the tdpreserve update.

You might be concerned about forward progress. As I said above, there
might be some other cases which might prevent the td preserve update
to succeed so forward progress is not guaranteed anyway and it is not
the kernel's job to figure it out. It will return the error code back
to userspace and let the userspace resolve the conflict.

Re: [PATCH v2 00/21] Runtime TDX Module update support

Posted by Sean Christopherson 3 months, 1 week ago

On Tue, Oct 28, 2025, Erdem Aktas wrote:
> On Mon, Oct 27, 2025 at 7:14 PM <dan.j.williams@intel.com> wrote:
> >
> > Vishal Annapurve wrote:
> > [..]
> > > Problem 2 should be solved in the TDX module as it is the state owner
> > > and should be given a chance to ensure that nothing else can affect
> > > it's state. Kernel is just opting-in to toggle the already provided
> > > TDX module ABI. I don't think this is adding complexity to the kernel.
> >
> > It makes the interface hard to reason about, that is complexity.
> 
> Could you clarify what you mean here? What interface do you need to
> reason about? TDX module has a feature as described in its spec, this
> is nothing to do with the kernel. Kernel executes the TDH.SYS.SHUTDOWN
> and if it fails, it will return the error code back to the user space.
> There is nothing here to reason about and it is not clear how it is
> adding the complexity to the kernel.

Userspace needs to reason about error codes and potential sources of those error
codes.  That said, I agree that having the kernel set AVOID_COMPAT_SENSITIVE by
default (I vote for setting it unconditionally), doesn't add meaningful
complexity; the kernel would just need to document that the update mechanism can
return -EBUSY (or whatever), and why/when.

For me, that seems far less daunting/complex than attempting to document what all
can go wrong if the kernel _doesn't_ set AVOID_COMPAT_SENSITIVE.  Because IMO,
regardless of whether or not the kernel sets AVOID_COMPAT_SENSITIVE, the kernel
is making a decision and defining behavior, and that behavior needs to be
documented.  If AVOID_COMPAT_SENSITIVE didn't exist, then I would agree this is
purely a userspace vs. TDX-Module problem, but it does exist, and not setting the
flag defines ABI just as much as setting the flag does.

The failure mode also matters, a lot.  "Sorry dear customer, we corrupted your VM"
is very, very different than "A handful of machines in our fleet haven't completed
an (optional?) update".

> > Consider an urgent case where update is more important than the
> > consistency of ongoing builds. The kernel's job is its own self
> > consistency and security model, when that remains in tact root is
> > allowed to make informed decisions.
> >
> The whole update is initiated by the userspace, imo, it is not the
> kernel's job to decide what to do. 

I think you and Dan are in violent agreement.  I _think_ what Dan is saying that
the kernel needs to protect itself, e.g. by rejecting an update if the kernel knows
the system is in a bad state.  But other than that, userspace can do whatever.

AFAICT, the only disagreement is whether or not to set AVOID_COMPAT_SENSITIVE.

> It should try to update the TDX module and return error code back to the
> userspace if it fails.

+1.  Unless there's a wrinkle I'm missing, failing with -EBUSY seems like the
obvious choice.

> > You might say, well add a --force option for that, and that is also
> > userspace prerogative to perform otherwise destructive operations with
> > the degrees of freedom the kernel allows.
> 
> IMO, It is something userspace should decide, kernel's job is to
> provide the necessary interface about it.

I disagree, I don't think userspace should even get the option.  IMO, not setting
AVOID_COMPAT_SENSITIVE is all kinds of crazy.

Re: [PATCH v2 00/21] Runtime TDX Module update support

Posted by dan.j.williams@intel.com 3 months, 1 week ago

Sean Christopherson wrote:
[..]
> > IMO, It is something userspace should decide, kernel's job is to
> > provide the necessary interface about it.
> 
> I disagree, I don't think userspace should even get the option.  IMO, not setting
> AVOID_COMPAT_SENSITIVE is all kinds of crazy.

Do see Table 4.4: "Comparison of Update Incompatibility Detection and/or
Avoidance Methods" from the latest base architecture specification [1].
It lists out the pros and cons of not setting AVOID_COMPAT_SENSITIVE.
This thread has only argued the merits of "None" and "Avoid updates
during update- sensitive times". It has not discussed "Detect
incompatibility after update", but let us not do that. You can just
assume the Module has multiple solutions to this awkward problem
precisely because different VMMs came to different conclusions.

I want this thread to end so I am not going to argue past what Dave and
Sean want to do here.

[1]: https://www.intel.com/content/www/us/en/content-details/865787/intel-tdx-module-base-architecture-specification.html

Re: [PATCH v2 00/21] Runtime TDX Module update support

Posted by Sean Christopherson 3 months, 1 week ago

On Tue, Oct 28, 2025, dan.j.williams@intel.com wrote:
> Sean Christopherson wrote:
> [..]
> > > IMO, It is something userspace should decide, kernel's job is to
> > > provide the necessary interface about it.
> > 
> > I disagree, I don't think userspace should even get the option.  IMO, not setting
> > AVOID_COMPAT_SENSITIVE is all kinds of crazy.
> 
> Do see Table 4.4: "Comparison of Update Incompatibility Detection and/or
> Avoidance Methods" from the latest base architecture specification [1].
> It lists out the pros and cons of not setting AVOID_COMPAT_SENSITIVE.
> This thread has only argued the merits of "None" and "Avoid updates
> during update- sensitive times". It has not discussed "Detect
> incompatibility after update", but let us not do that.

But we already are discussing that, because the "None" option is just punting
"Detect incompatibility after update" to something other than the VMM.  Doing
literally nothing isn't an option.  The fact that it's even listed in the table,
not to mention has "Simplest." listed as a pro, makes me question whether or not
the authors actually understand how software built around the TDX-Module is used
in practice.

If an update causes a TD build to fail, or to generate the wrong measurement, or
whatever "Failures due to incompatibilities" means, *something* eventually needs
to take action.  Doing nothing is certainly the simplest option for the hypervisor
and VMM, but when looking at the entire stack/ecosystem, it's the most complex
option as it bleeds the damage into multiple, potentially-unknown components of
the stack.  Either that, or I'm grossly misunderstanding what "Failures" means.

That section also states:

  Future TDX Module versions may have different or additional update-sensitive cases.

Which means that from an ABI perspective, "Avoid updates during update-sensitive
times" is the _ONLY_ viable option.  My read of that is that future TDX-Modules
can effectively change the failure modes for a existing KVM ioctls.  That is an
ABI change and will break userspace, e.g. if userspace is sane and expects certain
operations to succeed.

Re: [PATCH v2 00/21] Runtime TDX Module update support

Posted by Vishal Annapurve 3 months, 1 week ago

On Wed, Oct 29, 2025 at 6:48 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Tue, Oct 28, 2025, dan.j.williams@intel.com wrote:
> > Sean Christopherson wrote:
> > [..]
> > > > IMO, It is something userspace should decide, kernel's job is to
> > > > provide the necessary interface about it.
> > >
> > > I disagree, I don't think userspace should even get the option.  IMO, not setting
> > > AVOID_COMPAT_SENSITIVE is all kinds of crazy.
> >
> > Do see Table 4.4: "Comparison of Update Incompatibility Detection and/or
> > Avoidance Methods" from the latest base architecture specification [1].
> > It lists out the pros and cons of not setting AVOID_COMPAT_SENSITIVE.
> > This thread has only argued the merits of "None" and "Avoid updates
> > during update- sensitive times". It has not discussed "Detect
> > incompatibility after update", but let us not do that.
>
> But we already are discussing that, because the "None" option is just punting
> "Detect incompatibility after update" to something other than the VMM.  Doing
> literally nothing isn't an option.  The fact that it's even listed in the table,
> not to mention has "Simplest." listed as a pro, makes me question whether or not
> the authors actually understand how software built around the TDX-Module is used
> in practice.
>
> If an update causes a TD build to fail, or to generate the wrong measurement, or
> whatever "Failures due to incompatibilities" means, *something* eventually needs
> to take action.  Doing nothing is certainly the simplest option for the hypervisor
> and VMM, but when looking at the entire stack/ecosystem, it's the most complex
> option as it bleeds the damage into multiple, potentially-unknown components of
> the stack.  Either that, or I'm grossly misunderstanding what "Failures" means.
>
> That section also states:
>
>   Future TDX Module versions may have different or additional update-sensitive cases.
>
> Which means that from an ABI perspective, "Avoid updates during update-sensitive
> times" is the _ONLY_ viable option.  My read of that is that future TDX-Modules
> can effectively change the failure modes for a existing KVM ioctls.  That is an
> ABI change and will break userspace, e.g. if userspace is sane and expects certain
> operations to succeed.

A reference patch we tested for "Avoid updates during update-sensitive
times" and one caveat was that
/sys/devices/virtual/tdx/tdx_tsm/version was not available post update
failure until a subsequent successful update:

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index e00650b83f08..96ae7c679e4e 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -22,6 +22,7 @@
 #define TDX_FEATURES0_NO_RBP_MOD               BIT_ULL(18)
 #define TDX_FEATURES0_CLFLUSH_BEFORE_ALLOC     BIT_ULL(23)
 #define TDX_FEATURES0_DYNAMIC_PAMT             BIT_ULL(36)
+#define TDX_FEATURES0_UPDATE_COMPATIBILITY     BIT_ULL(47)

 #ifndef __ASSEMBLY__

@@ -129,6 +130,11 @@ static inline bool
tdx_supports_dynamic_pamt(const struct tdx_sys_info *sysinfo)
        return sysinfo->features.tdx_features0 & TDX_FEATURES0_DYNAMIC_PAMT;
 }

+static inline bool tdx_supports_update_compatibility(const struct
tdx_sys_info *sysinfo)
+{
+       return sysinfo->features.tdx_features0 &
TDX_FEATURES0_UPDATE_COMPATIBILITY;
+}
+
 int tdx_guest_keyid_alloc(void);
 u32 tdx_get_nr_guest_keyids(void);
 void tdx_guest_keyid_free(unsigned int keyid);
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index f6199f8ce411..95deb1146a79 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -1523,6 +1523,10 @@ int tdx_module_shutdown(void)
         * fail.
         */
        args.rcx = tdx_sysinfo.handoff.module_hv;
+
+       if (tdx_supports_update_compatibility(&tdx_sysinfo))
+               args.rcx |= TDX_SYS_SHUTDOWN_AVOID_COMPAT_SENSITIVE;
+
        ret = seamcall_prerr(TDH_SYS_SHUTDOWN, &args);
        if (!ret)
                tdx_module_reset_state();
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index 0cd9140620f9..772c714de2bc 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -94,6 +94,8 @@ struct tdmr_info {
 /* Bit definitions of TDX_FEATURES0 metadata field */
 #define TDX_FEATURES0_TD_PRESERVING    BIT(1)

+#define TDX_SYS_SHUTDOWN_AVOID_COMPAT_SENSITIVE                BIT(16)
+
 /*
  * Do not put any hardware-defined TDX structure representations below
  * this comment!

Re: [PATCH v2 00/21] Runtime TDX Module update support

Posted by Chao Gao 3 months, 1 week ago

>A reference patch we tested for "Avoid updates during update-sensitive
>times" and one caveat was that
>/sys/devices/virtual/tdx/tdx_tsm/version was not available post update
>failure until a subsequent successful update:

I also tested this. It works well to prevent updates during TD build, so,

  Tested-by: Chao Gao <chao.gao@intel.com>

And I can integrate this change into my next version if you don't object.

Regarding the caveat, could you check if the diff [*] I posted earlier this
week can fix it?

[1]: https://lore.kernel.org/linux-coco/aQAwRrvYMcaMsu02@intel.com/

Re: [PATCH v2 00/21] Runtime TDX Module update support

Posted by Sagi Shahar 2 months, 3 weeks ago

On Thu, Oct 30, 2025 at 9:53 PM Chao Gao <chao.gao@intel.com> wrote:
>
> >A reference patch we tested for "Avoid updates during update-sensitive
> >times" and one caveat was that
> >/sys/devices/virtual/tdx/tdx_tsm/version was not available post update
> >failure until a subsequent successful update:
>
> I also tested this. It works well to prevent updates during TD build, so,
>
>   Tested-by: Chao Gao <chao.gao@intel.com>
>
> And I can integrate this change into my next version if you don't object.
>
> Regarding the caveat, could you check if the diff [*] I posted earlier this
> week can fix it?
>
> [1]: https://lore.kernel.org/linux-coco/aQAwRrvYMcaMsu02@intel.com/

[Now in plaintext]

I tried testing it with the 1.5.24 TDX module and it sometimes fails,
but the failure does not appear consistent.

I added a local change to add the
TDX_SYS_SHUTDOWN_AVOID_COMPAT_SENSITIVE flag when calling
TDH_SYS_SHUTDOWN and TDH_SYS_SHUTDOWN fails as expected if a VM is
under build:
[ 1224.571177] virt/tdx: SEAMCALL (52) failed: 0x8000051200010000

But then sometimes trying to finalize the VM fail with the following error:
[ 1230.915145] kvm_intel: SEAMCALL TDH_MR_FINALIZE failed: 0x8000ff00ffff0000
[ 1230.948264] kvm_intel: tdh_mng_vpflushdone() failed. HKID 3 is leaked.

At this point the module seems to be in a broken state and trying to
create more TDs will fail:
[ 1543.745606] kvm_intel: SEAMCALL TDH_MNG_CREATE failed: 0x8000ff00ffff0000

Trying to update the module will fail shutdown with -ENODEV

Re: [PATCH v2 00/21] Runtime TDX Module update support

Posted by Chao Gao 2 months, 3 weeks ago

On Wed, Nov 19, 2025 at 04:44:50PM -0600, Sagi Shahar wrote:
>On Thu, Oct 30, 2025 at 9:53 PM Chao Gao <chao.gao@intel.com> wrote:
>>
>> >A reference patch we tested for "Avoid updates during update-sensitive
>> >times" and one caveat was that
>> >/sys/devices/virtual/tdx/tdx_tsm/version was not available post update
>> >failure until a subsequent successful update:
>>
>> I also tested this. It works well to prevent updates during TD build, so,
>>
>>   Tested-by: Chao Gao <chao.gao@intel.com>
>>
>> And I can integrate this change into my next version if you don't object.
>>
>> Regarding the caveat, could you check if the diff [*] I posted earlier this
>> week can fix it?
>>
>> [1]: https://lore.kernel.org/linux-coco/aQAwRrvYMcaMsu02@intel.com/
>
>[Now in plaintext]
>
>I tried testing it with the 1.5.24 TDX module and it sometimes fails,
>but the failure does not appear consistent.
>
>I added a local change to add the
>TDX_SYS_SHUTDOWN_AVOID_COMPAT_SENSITIVE flag when calling
>TDH_SYS_SHUTDOWN and TDH_SYS_SHUTDOWN fails as expected if a VM is
>under build:
>[ 1224.571177] virt/tdx: SEAMCALL (52) failed: 0x8000051200010000
>
>But then sometimes trying to finalize the VM fail with the following error:
>[ 1230.915145] kvm_intel: SEAMCALL TDH_MR_FINALIZE failed: 0x8000ff00ffff0000
>[ 1230.948264] kvm_intel: tdh_mng_vpflushdone() failed. HKID 3 is leaked.
>
>At this point the module seems to be in a broken state and trying to
>create more TDs will fail:
>[ 1543.745606] kvm_intel: SEAMCALL TDH_MNG_CREATE failed: 0x8000ff00ffff0000
>
>Trying to update the module will fail shutdown with -ENODEV

Can you apply this incremental change to see if the issue gets fixed?

diff --git a/arch/x86/virt/vmx/tdx/seamldr.c b/arch/x86/virt/vmx/tdx/seamldr.c
index e525bbd16610..f0bea1fecc52 100644
--- a/arch/x86/virt/vmx/tdx/seamldr.c
+++ b/arch/x86/virt/vmx/tdx/seamldr.c
@@ -317,8 +317,9 @@ static int do_seamldr_install_module(void *params)
					tdx_module_set_error();
					print_update_failure_message();
				}
+			} else {
+				ack_state();
			}
-			ack_state();
		} else {
			touch_nmi_watchdog();
			rcu_momentary_eqs();


The problem is if the failing CPU is the last one to ack the TDP_SHUTDOWN
state, the state will move to TDP_CPU_INSTALL state. Other CPUs may proceed to
install the new module before seeing tdp_data.failed. This disables TDX ISA, so
any subsequent SEAMCALLs get 0x8000ff00ffff0000.

Re: [PATCH v2 00/21] Runtime TDX Module update support

Posted by Sagi Shahar 2 months, 2 weeks ago

On Wed, Nov 19, 2025 at 8:47 PM Chao Gao <chao.gao@intel.com> wrote:
>
> On Wed, Nov 19, 2025 at 04:44:50PM -0600, Sagi Shahar wrote:
> >On Thu, Oct 30, 2025 at 9:53 PM Chao Gao <chao.gao@intel.com> wrote:
> >>
> >> >A reference patch we tested for "Avoid updates during update-sensitive
> >> >times" and one caveat was that
> >> >/sys/devices/virtual/tdx/tdx_tsm/version was not available post update
> >> >failure until a subsequent successful update:
> >>
> >> I also tested this. It works well to prevent updates during TD build, so,
> >>
> >>   Tested-by: Chao Gao <chao.gao@intel.com>
> >>
> >> And I can integrate this change into my next version if you don't object.
> >>
> >> Regarding the caveat, could you check if the diff [*] I posted earlier this
> >> week can fix it?
> >>
> >> [1]: https://lore.kernel.org/linux-coco/aQAwRrvYMcaMsu02@intel.com/
> >
> >[Now in plaintext]
> >
> >I tried testing it with the 1.5.24 TDX module and it sometimes fails,
> >but the failure does not appear consistent.
> >
> >I added a local change to add the
> >TDX_SYS_SHUTDOWN_AVOID_COMPAT_SENSITIVE flag when calling
> >TDH_SYS_SHUTDOWN and TDH_SYS_SHUTDOWN fails as expected if a VM is
> >under build:
> >[ 1224.571177] virt/tdx: SEAMCALL (52) failed: 0x8000051200010000
> >
> >But then sometimes trying to finalize the VM fail with the following error:
> >[ 1230.915145] kvm_intel: SEAMCALL TDH_MR_FINALIZE failed: 0x8000ff00ffff0000
> >[ 1230.948264] kvm_intel: tdh_mng_vpflushdone() failed. HKID 3 is leaked.
> >
> >At this point the module seems to be in a broken state and trying to
> >create more TDs will fail:
> >[ 1543.745606] kvm_intel: SEAMCALL TDH_MNG_CREATE failed: 0x8000ff00ffff0000
> >
> >Trying to update the module will fail shutdown with -ENODEV
>
> Can you apply this incremental change to see if the issue gets fixed?
>
> diff --git a/arch/x86/virt/vmx/tdx/seamldr.c b/arch/x86/virt/vmx/tdx/seamldr.c
> index e525bbd16610..f0bea1fecc52 100644
> --- a/arch/x86/virt/vmx/tdx/seamldr.c
> +++ b/arch/x86/virt/vmx/tdx/seamldr.c
> @@ -317,8 +317,9 @@ static int do_seamldr_install_module(void *params)
>                                         tdx_module_set_error();
>                                         print_update_failure_message();
>                                 }
> +                       } else {
> +                               ack_state();
>                         }
> -                       ack_state();
>                 } else {
>                         touch_nmi_watchdog();
>                         rcu_momentary_eqs();
>
>
> The problem is if the failing CPU is the last one to ack the TDP_SHUTDOWN
> state, the state will move to TDP_CPU_INSTALL state. Other CPUs may proceed to
> install the new module before seeing tdp_data.failed. This disables TDX ISA, so
> any subsequent SEAMCALLs get 0x8000ff00ffff0000.

Thanks, I ran a couple dozen updates while TD was being built and
couldn't reproduce the issue with the new fix

Re: [PATCH v2 00/21] Runtime TDX Module update support

Posted by Vishal Annapurve 3 months, 2 weeks ago

On Sat, Oct 25, 2025 at 4:55 AM Vishal Annapurve <vannapurve@google.com> wrote:
>
> On Fri, Oct 24, 2025 at 6:42 PM <dan.j.williams@intel.com> wrote:
> >
> > Vishal Annapurve wrote:
> > > On Fri, Oct 24, 2025 at 2:19 PM Dave Hansen <dave.hansen@intel.com> wrote:
> > > >
> > > > On 10/24/25 14:12, dan.j.williams@intel.com wrote:
> > > > >> The SGX solution, btw, was to at least ensure forward progress (CPUSVN
> > > > >> update) when the last enclave goes away. So new enclaves aren't
> > > > >> *prevented* from starting but the window when the first one starts
> > > > >> (enclave count going from 0->1) is leveraged to do the update.
> > > > > The status quo does ensure forward progress. The TD does get built and
> > > > > the update does complete, just the small matter of TD attestation
> > > > > failures, right?
> > >
> > > I would think that it's not a "small" problem if confidential
> > > workloads on the hosts are not able to pass attestation.
> >
> > "Small" as in "not the kernel's problem". Userspace asked for the
> > update, update is documented to clobber build sometimes, userspace ran
> > an update anyway. Userspace asked for the clobber.
> >
> > It would be lovely if this clobbering does not happen at all and the
> > update mechanism did not come with this misfeature. Otherwise, the kernel
> > has no interface to solve that problem. The best it can do is document
> > that this new update facility has this side effect.
>
> In this case, host kernel has a way to ensure that userspace can't
> trigger such clobbering at all. That IIUC is "Avoid updates during
> update sensitive times". Best kernel can do is prevent userspace from
> screwing up the state of TDs.
>
> >
> > Userspace always has the choice to not update, coordinate update with
> > build, or do nothing and let tenants try to launch again.  Userspace
> > could even retry the build and hide the tenant failure if it knew about
> > the clobber,
>
> IIUC host userspace has no way to know if the TD state got clobbered.
>
> > but be clear that the problem is the clobber not the kernel
> > doing what userspace asked.
> >
> > The clobber, as I understand, is also limited to cases where the update
> > includes crypto library changes. I am not sure how often that happens in
> > practice. Suffice to say, the fact that the clobber is conditioned on
> > the contents of the update also puts it further away from being a kernel
>
> The knowledge of things getting clobbered are well much further away
> from userspace.
>
> > problem. The clobber does not corrupt kernel state.
> >
> > > > Oh, yeah, for sure.
> > > >
> > > > If we do _nothing_ in the kernel (no build vs. module update
> > > > synchronization), then the downside is being exposed to attestation
> > > > failures if userspace either also does nothing or has bugs.
> > > >
> > > > That's actually, by far, my preferred solution to this whole mess:
> > > > Userspace plays stupid games, userspace wins stupid prizes.
> > > >
> > >
> > > IIUC, enforcing "Avoid updates during update sensitive times" is not
> > > that complex and will ensure to avoid any issues with user space
> > > logic.
> >
> > Userspace logic avoids issues by honoring the documentation that these
> > ABIs sequences need synchronization. Otherwise, kernel blocking update
> > during build just trades one error for another.
>
> Kernel blocking update during build makes the production systems much
> safer and prevents userspace from screwing up the state that it has no
> way to detect after the fact.
>
> >
> > Treat this like any other userspace solution for requiring "atomic"
> > semantics when the kernel mechanisms are not themselves designed to be
> > atomic, wrap it in userspace synchronization.
>
> In general if this is something userspace detectable I would agree,
> TDX module is the closest entity that can detect the problematic
> sequence and the host kernel has a very simple way to ensure that such
> a problematic sequence is not at all allowed to happen by toggling
> some seamcall controls. It would be very helpful IMO to ensure that
> userspace is not able to screw up production workloads especially if
> the mess is not all visible to userspace.

Detecting is one thing, undoing the mess is disruptive and not easy to
orchestrate in this case.

Re: [PATCH v2 00/21] Runtime TDX Module update support

Posted by Sean Christopherson 3 months, 2 weeks ago

On Fri, Oct 24, 2025, dan.j.williams@intel.com wrote:
> Dave Hansen wrote:
> > On 10/24/25 00:43, Chao Gao wrote:
> > ...
> > > Beyond "the kvm_tdx object gets torn down during a build," I see two potential
> > > issues:
> > > 
> > > 1. TD Build and TDX migration aren't purely kernel processes -- they span multiple
> > >    KVM ioctls. Holding a read-write lock throughout the entire process would
> > >    require exiting to userspace while the lock is held. I think this is
> > >    irregular, but I'm not sure if it's acceptable for read-write semaphores.
> > 
> > Sure, I guess it's irregular. But look at it this way: let's say we
> > concocted some scheme to use a TD build refcount and a module update
> > flag, had them both wait_event_interruptible() on each other, and then
> > did wakeups. That would get the same semantics without an rwsem.
> 
> This sounds unworkable to me.
> 
> First, you cannot return to userspace while holding a lock. Lockdep will
> rightfully scream:
> 
>     "WARNING: lock held when returning to user space!"
> 
> The complexity of ensuring that a multi-stage ABI transaction completes
> from the kernel side is painful. If that process dies in the middle of
> its ABI sequence who cleans up these references?
> 
> The operational mechanism to make sure that one process flow does not
> mess up another process flow is for those process to communicate with
> *userspace* file locks, or for those process to check for failures after
> the fact and retry. Unless you can make the build side an atomic ABI,
> this is a documentation + userspace problem, not a kernel problem.

C'mon people (especially the Google folks), this is the ***exact*** same problem
as certificate updates for SNP[1].  Y'all suggested holding a lock across a userspace
exit back then, and Dan's analysis confirms my reaction from back then that
"Holding a lock across an exit to userspace seems wildly unsafe."[2]

In the end, it took more time to understand the problem then to sketch out and
test a solution[3]. 

Unless this somehow puts the host (kernel) at risk, this is a userspace problem.

[1] https://lore.kernel.org/all/20240426173515.6pio42iqvjj2aeac@amd.com
[2] https://lore.kernel.org/all/Zx_V5SHwzDAl8ZQR@google.com
[3] https://lore.kernel.org/all/ZixCYlKn5OYUFWEq@google.com

Re: [PATCH v2 00/21] Runtime TDX Module update support

Posted by Dave Hansen 3 months, 2 weeks ago

On 10/24/25 13:00, Sean Christopherson wrote:
> C'mon people (especially the Google folks), this is the ***exact***
> same problem as certificate updates for SNP[1].  Y'all suggested
> holding a lock across a userspace exit back then, and Dan's analysis
> confirms my reaction from back then that "Holding a lock across an
> exit to userspace seems wildly unsafe."[2]
> 
> In the end, it took more time to understand the problem then to
> sketch out and test a solution[3].
> 
> Unless this somehow puts the host (kernel) at risk, this is a
> userspace problem.

If there's an similar SEV-SNP problem and accepted solution punted to
userspace that TDX can leverage, I'm 100% on board with that. Let's do that.

Re: [PATCH v2 00/21] Runtime TDX Module update support

Posted by Vishal Annapurve 3 months, 2 weeks ago

On Fri, Oct 24, 2025 at 1:14 PM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 10/24/25 13:00, Sean Christopherson wrote:
> > C'mon people (especially the Google folks), this is the ***exact***
> > same problem as certificate updates for SNP[1].  Y'all suggested
> > holding a lock across a userspace exit back then, and Dan's analysis
> > confirms my reaction from back then that "Holding a lock across an
> > exit to userspace seems wildly unsafe."[2]
> >
> > In the end, it took more time to understand the problem then to
> > sketch out and test a solution[3].
> >
> > Unless this somehow puts the host (kernel) at risk, this is a
> > userspace problem.
>
> If there's an similar SEV-SNP problem and accepted solution punted to
> userspace that TDX can leverage, I'm 100% on board with that. Let's do that.

So IIUC, the current stance is that the kernel can rely on userspace
to ensure forward progress of TDX module update.

I still vote for the "Avoid updates during update sensitive times"
approach to be enabled in the host kernel to ensure userspace can't
mess up the TDX module state.

Re: [PATCH v2 00/21] Runtime TDX Module update support

Posted by Vishal Annapurve 3 months, 2 weeks ago

On Thu, Oct 23, 2025 at 2:10 PM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 10/23/25 13:31, Vishal Annapurve wrote:
> ...
> >> Admin logic to update TDX modules can be designed to either retry
> >> failed TDX module updates or to be more robust, adds some
> >> synchronization with VM creation attempts on the host. i.e. I think
> >> it's fine to punt this problem of ensuring the forward progress to
> >> user-space admin logic on the host.
> > Discussed offline with Erdem Aktas on this. From Google's perspective
> > "Avoid updates during updatesensitive times" seems a better option as
> > I mentioned above.
> >
> > To avoid having to choose which policy to enforce in kernel, a better
> > way could be to:
> > * Allow user space opt-in for "Avoid updates during updatesensitive times" AND
> > * Allow user space opt-in for "Detect incompatibility after update" as well OR
> > * Keep "Detect incompatibility after update" enabled by default based
> > on the appetite for avoiding silent corruption scenarios.
>
> I'd really prefer to keep this simple. Adding new opt-in ABIs up the
> wazoo doesn't seem great.
>
> I think I've heard three requirements in the end:
>
> 1. Guarantee module update forward progress
> 2. Avoid "corrupt" TD build processes by letting the build/update
>    race happen
> 3. Don't complicate the build process by forcing it to error out
>    if a module update clobbers a build
>
> One thing I don't think I've heard anyone be worried about is how timely
> the update process is. So how about this: Updates wait for any existing
> builds to complete. But, new builds wait for updates. That can be done
> with a single rwsem:
>
> struct rw_semaphore update_rwsem;
>
> tdx_td_init()
> {
>         ...
> +       down_read_interruptible(&update_rwsem);
>         kvm_tdx->state = TD_STATE_INITIALIZED;
>
> tdx_td_finalize()
> {
>         ...
> +       up_read(&update_rwsem);
>         kvm_tdx->state = TD_STATE_RUNNABLE;
>
> A module update does:
>
>         down_write_interruptible(&update_rwsem);
>         do_actual_update();
>         up_write(&update_rwsem);
>
> There would be no corruption issues, no erroring out of the build
> process, and no punting to userspace to ensure forward progress.
>
> The big downside is that both the build process and update process can
> appear to hang for a long time. It'll also be a bit annoying to ensure
> that there are up_read(&update_rwsem)'s if the kvm_tdx object gets torn
> down during a build.
>
> But the massive upside is that there's no new ABI and all the
> consistency and forward progress guarantees are in the kernel. If we
> want new ABIs around it that give O_NONBLOCK semantics to build or
> update, that can be added on after the fact.
>
> Plus, if userspace *WANTS* to coordinate the whole shebang, they're free
> to. They'd never see long hangs because they would be coordinating.
>
> Thoughts?

Yeah, this approach sounds reasonable.

RE: [PATCH v2 00/21] Runtime TDX Module update support

Posted by Reshetova, Elena 3 months, 2 weeks ago


> -----Original Message-----
> From: Vishal Annapurve <vannapurve@google.com>
> Sent: Saturday, October 18, 2025 3:02 AM
> To: Reshetova, Elena <elena.reshetova@intel.com>
> Cc: Hansen, Dave <dave.hansen@intel.com>; Gao, Chao
> <chao.gao@intel.com>; linux-coco@lists.linux.dev; linux-
> kernel@vger.kernel.org; x86@kernel.org; Chatre, Reinette
> <reinette.chatre@intel.com>; Weiny, Ira <ira.weiny@intel.com>; Huang, Kai
> <kai.huang@intel.com>; Williams, Dan J <dan.j.williams@intel.com>;
> yilun.xu@linux.intel.com; sagis@google.com; paulmck@kernel.org;
> nik.borisov@suse.com; Borislav Petkov <bp@alien8.de>; Dave Hansen
> <dave.hansen@linux.intel.com>; H. Peter Anvin <hpa@zytor.com>; Ingo Molnar
> <mingo@redhat.com>; Kirill A. Shutemov <kas@kernel.org>; Paolo Bonzini
> <pbonzini@redhat.com>; Edgecombe, Rick P <rick.p.edgecombe@intel.com>;
> Thomas Gleixner <tglx@linutronix.de>
> Subject: Re: [PATCH v2 00/21] Runtime TDX Module update support
> 
> On Fri, Oct 17, 2025 at 3:08 AM Reshetova, Elena
> <elena.reshetova@intel.com> wrote:
> >
> >
> > > > > ...
> > > > > > But the situation can be avoided fully, if TD preserving update is not
> > > > > conducted
> > > > > > during the TD build time.
> > > > >
> > > > > Sure, and the TDX module itself could guarantee this as well as much as
> > > > > the kernel could. It could decline to allow module updates during TD
> > > > > builds, or error out the TD build if it collides with an update.
> > > >
> > > > TDX module has a functionality to decline going into SHUTDOWN state
> > > > (pre-requisite for TD preserving update) if TD build or any problematic
> > > > operation is in progress. It requires VMM to opt-in into this feature.
> > >
> > > Is this opt-in enabled as part of this series? If not, what is the
> > > mechanism to enable this opt-in?
> >
> > For the information about how it works on TDX module side,
> > please consult the latest ABI spec, definition of TDH.SYS.SHUTDOWN leaf,
> > page 321:
> > https://cdrdv2.intel.com/v1/dl/getContent/733579
> >
> 
> Thanks Elena. Should the patch [1] from this series be modified to
> handle the TDX module shutdown as per:
> "If supported by the TDX Module, the host VMM can set the
> AVOID_COMPAT_SENSITIVE flag to request the TDX Module to fail
> TDH.SYS.UPDATE if any of the TDs are currently in a state that is
> impacted by the update-sensitive cases"
> 
> The documentation below doesn't make sense to me:
> "The compatibility checks done by TDH.SYS.UPDATE do not include the
> following cases:
> * If any TD was initialized by an older TDX Module that did enumerate
> TDX_FEATURES0.UPDATE_COMPATIBLITY as 1, TDH.SYS.SHUTDOWN does not
> check for a TD build in progress condition for that TD.
> * If any TD migration session is in progress, it was initialized by an
> older TDX Module that did enumerate TDX_FEATURES0.UPDATE_COMPATIBLITY
> as 1"
> 
> Was it supposed to say below?
> "If any TD was initialized by an older TDX Module that did enumerate
> TDX_FEATURES0.UPDATE_COMPATIBLITY as 0, TDH.SYS.SHUTDOWN does not

Yes, the spec error, thank you for catching this. Will be fixed. 
The correct text should say:

" If any TD was initialized by an older TDX Module that did *not* enumerate
TDX_FEATURES0.UPDATE_COMPATIBLITY as 1, TDH.SYS.SHUTDOWN does
not check for a TD build in progress condition for that TD.
If any TD migration session is in progress, and it was initialized by an older
TDX Module that did *not* enumerate TDX_FEATURES0.UPDATE_COMPATIBLITY as 1,
TDH.SYS.SHUTDOWN does not check for an interrupted TD migration function
condition for that TD."

Best Regards,
Elena.