> -----Original Message----- > From: Vishal Annapurve <vannapurve@google.com> > Sent: Thursday, October 16, 2025 8:48 PM > To: Reshetova, Elena <elena.reshetova@intel.com> > Cc: Hansen, Dave <dave.hansen@intel.com>; Gao, Chao > <chao.gao@intel.com>; linux-coco@lists.linux.dev; linux- > kernel@vger.kernel.org; x86@kernel.org; Chatre, Reinette > <reinette.chatre@intel.com>; Weiny, Ira <ira.weiny@intel.com>; Huang, Kai > <kai.huang@intel.com>; Williams, Dan J <dan.j.williams@intel.com>; > yilun.xu@linux.intel.com; sagis@google.com; paulmck@kernel.org; > nik.borisov@suse.com; Borislav Petkov <bp@alien8.de>; Dave Hansen > <dave.hansen@linux.intel.com>; H. Peter Anvin <hpa@zytor.com>; Ingo Molnar > <mingo@redhat.com>; Kirill A. Shutemov <kas@kernel.org>; Paolo Bonzini > <pbonzini@redhat.com>; Edgecombe, Rick P <rick.p.edgecombe@intel.com>; > Thomas Gleixner <tglx@linutronix.de> > Subject: Re: [PATCH v2 00/21] Runtime TDX Module update support > > On Wed, Oct 15, 2025 at 11:46 PM Reshetova, Elena > <elena.reshetova@intel.com> wrote: > > > > > ... > > > > But the situation can be avoided fully, if TD preserving update is not > > > conducted > > > > during the TD build time. > > > > > > Sure, and the TDX module itself could guarantee this as well as much as > > > the kernel could. It could decline to allow module updates during TD > > > builds, or error out the TD build if it collides with an update. > > > > TDX module has a functionality to decline going into SHUTDOWN state > > (pre-requisite for TD preserving update) if TD build or any problematic > > operation is in progress. It requires VMM to opt-in into this feature. > > Is this opt-in enabled as part of this series? If not, what is the > mechanism to enable this opt-in? For the information about how it works on TDX module side, please consult the latest ABI spec, definition of TDH.SYS.SHUTDOWN leaf, page 321: https://cdrdv2.intel.com/v1/dl/getContent/733579
On Fri, Oct 17, 2025 at 3:08 AM Reshetova, Elena <elena.reshetova@intel.com> wrote: > > > > > > ... > > > > > But the situation can be avoided fully, if TD preserving update is not > > > > conducted > > > > > during the TD build time. > > > > > > > > Sure, and the TDX module itself could guarantee this as well as much as > > > > the kernel could. It could decline to allow module updates during TD > > > > builds, or error out the TD build if it collides with an update. > > > > > > TDX module has a functionality to decline going into SHUTDOWN state > > > (pre-requisite for TD preserving update) if TD build or any problematic > > > operation is in progress. It requires VMM to opt-in into this feature. > > > > Is this opt-in enabled as part of this series? If not, what is the > > mechanism to enable this opt-in? > > For the information about how it works on TDX module side, > please consult the latest ABI spec, definition of TDH.SYS.SHUTDOWN leaf, > page 321: > https://cdrdv2.intel.com/v1/dl/getContent/733579 > Thanks Elena. Should the patch [1] from this series be modified to handle the TDX module shutdown as per: "If supported by the TDX Module, the host VMM can set the AVOID_COMPAT_SENSITIVE flag to request the TDX Module to fail TDH.SYS.UPDATE if any of the TDs are currently in a state that is impacted by the update-sensitive cases" The documentation below doesn't make sense to me: "The compatibility checks done by TDH.SYS.UPDATE do not include the following cases: * If any TD was initialized by an older TDX Module that did enumerate TDX_FEATURES0.UPDATE_COMPATIBLITY as 1, TDH.SYS.SHUTDOWN does not check for a TD build in progress condition for that TD. * If any TD migration session is in progress, it was initialized by an older TDX Module that did enumerate TDX_FEATURES0.UPDATE_COMPATIBLITY as 1" Was it supposed to say below? "If any TD was initialized by an older TDX Module that did enumerate TDX_FEATURES0.UPDATE_COMPATIBLITY as 0, TDH.SYS.SHUTDOWN does not check for a TD build in progress condition for that TD" [1] https://lore.kernel.org/all/20251001025442.427697-15-chao.gao@intel.com/
On Fri, Oct 17, 2025 at 05:01:55PM -0700, Vishal Annapurve wrote: >On Fri, Oct 17, 2025 at 3:08 AM Reshetova, Elena ><elena.reshetova@intel.com> wrote: >> >> >> > > > ... >> > > > > But the situation can be avoided fully, if TD preserving update is not >> > > > conducted >> > > > > during the TD build time. >> > > > >> > > > Sure, and the TDX module itself could guarantee this as well as much as >> > > > the kernel could. It could decline to allow module updates during TD >> > > > builds, or error out the TD build if it collides with an update. >> > > >> > > TDX module has a functionality to decline going into SHUTDOWN state >> > > (pre-requisite for TD preserving update) if TD build or any problematic >> > > operation is in progress. It requires VMM to opt-in into this feature. >> > >> > Is this opt-in enabled as part of this series? If not, what is the >> > mechanism to enable this opt-in? >> >> For the information about how it works on TDX module side, >> please consult the latest ABI spec, definition of TDH.SYS.SHUTDOWN leaf, >> page 321: >> https://cdrdv2.intel.com/v1/dl/getContent/733579 >> > >Thanks Elena. Should the patch [1] from this series be modified to >handle the TDX module shutdown as per: Hi Vishal, I will fix this issue in the next version. The plan is to opt in post-update compatibility detection in the TDX Module. If incompatibilities are found, the module will return errors to any TD build or migration operations that were initiated prior to the updates. Please refer to the TDH.SYS.UPDATE leaf definition in the ABI spec above for details. I prefer this approach because: a. it guarantees forward progress. In contrast, failing updates would require admins to retry TDX Module updates, and no progress would be made unless they can successfully avoid race conditions between TDX module updates and TD build/migration operations. However, if such race conditions could be reliably prevented, this issue wouldn't require a fix in the first place. b. it eliminates false alarms that could occur with the "block update" approach. Under the "block update" approach, updates would be rejected whenever TD build operations are running, regardless of whether the new module is actually compatible (e.g., when using the same crypto library as the current module). In contrast, the post-update detection approach only fails TD build or migration operations when genuine incompatibilities exist.
On Wed, Oct 22, 2025 at 12:15 AM Chao Gao <chao.gao@intel.com> wrote: > > On Fri, Oct 17, 2025 at 05:01:55PM -0700, Vishal Annapurve wrote: > >On Fri, Oct 17, 2025 at 3:08 AM Reshetova, Elena > ><elena.reshetova@intel.com> wrote: > >> > >> > >> > > > ... > >> > > > > But the situation can be avoided fully, if TD preserving update is not > >> > > > conducted > >> > > > > during the TD build time. > >> > > > > >> > > > Sure, and the TDX module itself could guarantee this as well as much as > >> > > > the kernel could. It could decline to allow module updates during TD > >> > > > builds, or error out the TD build if it collides with an update. > >> > > > >> > > TDX module has a functionality to decline going into SHUTDOWN state > >> > > (pre-requisite for TD preserving update) if TD build or any problematic > >> > > operation is in progress. It requires VMM to opt-in into this feature. > >> > > >> > Is this opt-in enabled as part of this series? If not, what is the > >> > mechanism to enable this opt-in? > >> > >> For the information about how it works on TDX module side, > >> please consult the latest ABI spec, definition of TDH.SYS.SHUTDOWN leaf, > >> page 321: > >> https://cdrdv2.intel.com/v1/dl/getContent/733579 > >> > > > >Thanks Elena. Should the patch [1] from this series be modified to > >handle the TDX module shutdown as per: > > Hi Vishal, > > I will fix this issue in the next version. > > The plan is to opt in post-update compatibility detection in the TDX > Module. If incompatibilities are found, the module will return errors to > any TD build or migration operations that were initiated prior to the > updates. Please refer to the TDH.SYS.UPDATE leaf definition in the ABI > spec above for details. > > I prefer this approach because: > > a. it guarantees forward progress. In contrast, failing updates would > require admins to retry TDX Module updates, and no progress would be > made unless they can successfully avoid race conditions between TDX > module updates and TD build/migration operations. However, if such race > conditions could be reliably prevented, this issue wouldn't require a > fix in the first place. TD build operations are much more frequent than TDX module update operations. Retrying TD build operation will need additional KVM and userspace VMM changes IIUC (assuming TD build process needs to be restarted from the scratch). IMO, it would be simpler to handle TDX module update failures by retrying. Admin logic to update TDX modules can be designed to either retry failed TDX module updates or to be more robust, adds some synchronization with VM creation attempts on the host. i.e. I think it's fine to punt this problem of ensuring the forward progress to user-space admin logic on the host. > > b. it eliminates false alarms that could occur with the "block update" > approach. Under the "block update" approach, updates would be rejected > whenever TD build operations are running, regardless of whether the new > module is actually compatible (e.g., when using the same crypto library as > the current module). In contrast, the post-update detection approach only > fails TD build or migration operations when genuine incompatibilities > exist.
On Wed, Oct 22, 2025 at 8:42 AM Vishal Annapurve <vannapurve@google.com> wrote: > > On Wed, Oct 22, 2025 at 12:15 AM Chao Gao <chao.gao@intel.com> wrote: > > > > On Fri, Oct 17, 2025 at 05:01:55PM -0700, Vishal Annapurve wrote: > > >On Fri, Oct 17, 2025 at 3:08 AM Reshetova, Elena > > ><elena.reshetova@intel.com> wrote: > > >> > > >> > > >> > > > ... > > >> > > > > But the situation can be avoided fully, if TD preserving update is not > > >> > > > conducted > > >> > > > > during the TD build time. > > >> > > > > > >> > > > Sure, and the TDX module itself could guarantee this as well as much as > > >> > > > the kernel could. It could decline to allow module updates during TD > > >> > > > builds, or error out the TD build if it collides with an update. > > >> > > > > >> > > TDX module has a functionality to decline going into SHUTDOWN state > > >> > > (pre-requisite for TD preserving update) if TD build or any problematic > > >> > > operation is in progress. It requires VMM to opt-in into this feature. > > >> > > > >> > Is this opt-in enabled as part of this series? If not, what is the > > >> > mechanism to enable this opt-in? > > >> > > >> For the information about how it works on TDX module side, > > >> please consult the latest ABI spec, definition of TDH.SYS.SHUTDOWN leaf, > > >> page 321: > > >> https://cdrdv2.intel.com/v1/dl/getContent/733579 > > >> > > > > > >Thanks Elena. Should the patch [1] from this series be modified to > > >handle the TDX module shutdown as per: > > > > Hi Vishal, > > > > I will fix this issue in the next version. > > > > The plan is to opt in post-update compatibility detection in the TDX > > Module. If incompatibilities are found, the module will return errors to > > any TD build or migration operations that were initiated prior to the > > updates. Please refer to the TDH.SYS.UPDATE leaf definition in the ABI > > spec above for details. > > > > I prefer this approach because: > > > > a. it guarantees forward progress. In contrast, failing updates would > > require admins to retry TDX Module updates, and no progress would be > > made unless they can successfully avoid race conditions between TDX > > module updates and TD build/migration operations. However, if such race > > conditions could be reliably prevented, this issue wouldn't require a > > fix in the first place. > > TD build operations are much more frequent than TDX module update > operations. Retrying TD build operation will need additional KVM and > userspace VMM changes IIUC (assuming TD build process needs to be > restarted from the scratch). IMO, it would be simpler to handle TDX > module update failures by retrying. > > Admin logic to update TDX modules can be designed to either retry > failed TDX module updates or to be more robust, adds some > synchronization with VM creation attempts on the host. i.e. I think > it's fine to punt this problem of ensuring the forward progress to > user-space admin logic on the host. Discussed offline with Erdem Aktas on this. From Google's perspective "Avoid updates during updatesensitive times" seems a better option as I mentioned above. To avoid having to choose which policy to enforce in kernel, a better way could be to: * Allow user space opt-in for "Avoid updates during updatesensitive times" AND * Allow user space opt-in for "Detect incompatibility after update" as well OR * Keep "Detect incompatibility after update" enabled by default based on the appetite for avoiding silent corruption scenarios. > > > > > b. it eliminates false alarms that could occur with the "block update" > > approach. Under the "block update" approach, updates would be rejected > > whenever TD build operations are running, regardless of whether the new > > module is actually compatible (e.g., when using the same crypto library as > > the current module). In contrast, the post-update detection approach only > > fails TD build or migration operations when genuine incompatibilities > > exist.
On 10/23/25 13:31, Vishal Annapurve wrote:
...
>> Admin logic to update TDX modules can be designed to either retry
>> failed TDX module updates or to be more robust, adds some
>> synchronization with VM creation attempts on the host. i.e. I think
>> it's fine to punt this problem of ensuring the forward progress to
>> user-space admin logic on the host.
> Discussed offline with Erdem Aktas on this. From Google's perspective
> "Avoid updates during updatesensitive times" seems a better option as
> I mentioned above.
>
> To avoid having to choose which policy to enforce in kernel, a better
> way could be to:
> * Allow user space opt-in for "Avoid updates during updatesensitive times" AND
> * Allow user space opt-in for "Detect incompatibility after update" as well OR
> * Keep "Detect incompatibility after update" enabled by default based
> on the appetite for avoiding silent corruption scenarios.
I'd really prefer to keep this simple. Adding new opt-in ABIs up the
wazoo doesn't seem great.
I think I've heard three requirements in the end:
1. Guarantee module update forward progress
2. Avoid "corrupt" TD build processes by letting the build/update
race happen
3. Don't complicate the build process by forcing it to error out
if a module update clobbers a build
One thing I don't think I've heard anyone be worried about is how timely
the update process is. So how about this: Updates wait for any existing
builds to complete. But, new builds wait for updates. That can be done
with a single rwsem:
struct rw_semaphore update_rwsem;
tdx_td_init()
{
...
+ down_read_interruptible(&update_rwsem);
kvm_tdx->state = TD_STATE_INITIALIZED;
tdx_td_finalize()
{
...
+ up_read(&update_rwsem);
kvm_tdx->state = TD_STATE_RUNNABLE;
A module update does:
down_write_interruptible(&update_rwsem);
do_actual_update();
up_write(&update_rwsem);
There would be no corruption issues, no erroring out of the build
process, and no punting to userspace to ensure forward progress.
The big downside is that both the build process and update process can
appear to hang for a long time. It'll also be a bit annoying to ensure
that there are up_read(&update_rwsem)'s if the kvm_tdx object gets torn
down during a build.
But the massive upside is that there's no new ABI and all the
consistency and forward progress guarantees are in the kernel. If we
want new ABIs around it that give O_NONBLOCK semantics to build or
update, that can be added on after the fact.
Plus, if userspace *WANTS* to coordinate the whole shebang, they're free
to. They'd never see long hangs because they would be coordinating.
Thoughts?
>One thing I don't think I've heard anyone be worried about is how timely
>the update process is. So how about this: Updates wait for any existing
>builds to complete. But, new builds wait for updates. That can be done
>with a single rwsem:
>
>struct rw_semaphore update_rwsem;
>
>tdx_td_init()
>{
> ...
>+ down_read_interruptible(&update_rwsem);
> kvm_tdx->state = TD_STATE_INITIALIZED;
>
>tdx_td_finalize()
>{
> ...
>+ up_read(&update_rwsem);
> kvm_tdx->state = TD_STATE_RUNNABLE;
>
>A module update does:
>
> down_write_interruptible(&update_rwsem);
> do_actual_update();
> up_write(&update_rwsem);
>
>There would be no corruption issues, no erroring out of the build
>process, and no punting to userspace to ensure forward progress.
>
>The big downside is that both the build process and update process can
>appear to hang for a long time. It'll also be a bit annoying to ensure
>that there are up_read(&update_rwsem)'s if the kvm_tdx object gets torn
>down during a build.
>
>But the massive upside is that there's no new ABI and all the
>consistency and forward progress guarantees are in the kernel. If we
>want new ABIs around it that give O_NONBLOCK semantics to build or
>update, that can be added on after the fact.
>
>Plus, if userspace *WANTS* to coordinate the whole shebang, they're free
>to. They'd never see long hangs because they would be coordinating.
>
>Thoughts?
Hi Dave,
Thanks for this summary and suggestion.
Beyond "the kvm_tdx object gets torn down during a build," I see two potential
issues:
1. TD Build and TDX migration aren't purely kernel processes -- they span multiple
KVM ioctls. Holding a read-write lock throughout the entire process would
require exiting to userspace while the lock is held. I think this is
irregular, but I'm not sure if it's acceptable for read-write semaphores.
2. The kernel may need to hold this read-write lock for operations not yet
defined in the future. The TDX Module Base spec [*] notes on page 55:
: Future TDX Module versions may have different or additional update-sensitive
: cases. By design, such cases apply to a small portion of the overall TD
: lifecycle.
[*]: https://cdrdv2.intel.com/v1/dl/getContent/733575
Given these concerns, I'm not sure whether implementing a read-write lock in
the kernel is the right approach.
Since Google prefers to "avoid updates during update-sensitive times," we can
implement that approach for now. If other Linux users find this insufficient
and prefer failing TD build/migration operations with strong justification, we
can enable that functionality in the future.
What do you think?
On 10/24/25 00:43, Chao Gao wrote: ... > Beyond "the kvm_tdx object gets torn down during a build," I see two potential > issues: > > 1. TD Build and TDX migration aren't purely kernel processes -- they span multiple > KVM ioctls. Holding a read-write lock throughout the entire process would > require exiting to userspace while the lock is held. I think this is > irregular, but I'm not sure if it's acceptable for read-write semaphores. Sure, I guess it's irregular. But look at it this way: let's say we concocted some scheme to use a TD build refcount and a module update flag, had them both wait_event_interruptible() on each other, and then did wakeups. That would get the same semantics without an rwsem. So is your issue with the rwsem, or the idea that one bit of userspace can block another bit of userspace from doing something for arbitrary lengths of time? The one thing that worries me is solidly tying the build-side down_read() to the lifetime of kvm_tdx. But that shouldn't be rocket science. There also isn't a down_write_interruptible(), only down_write_killable(). > 2. The kernel may need to hold this read-write lock for operations not yet > defined in the future. The TDX Module Base spec [*] notes on page 55: > > : Future TDX Module versions may have different or additional update-sensitive > : cases. By design, such cases apply to a small portion of the overall TD > : lifecycle. Sure... But that's not a license for the TDX module to do completely silly things. Elena is on cc here for a reason and I'm sure she'll ensure that nothing silly gets put into the TDX module that will cause problems here.
Dave Hansen wrote:
> On 10/24/25 00:43, Chao Gao wrote:
> ...
> > Beyond "the kvm_tdx object gets torn down during a build," I see two potential
> > issues:
> >
> > 1. TD Build and TDX migration aren't purely kernel processes -- they span multiple
> > KVM ioctls. Holding a read-write lock throughout the entire process would
> > require exiting to userspace while the lock is held. I think this is
> > irregular, but I'm not sure if it's acceptable for read-write semaphores.
>
> Sure, I guess it's irregular. But look at it this way: let's say we
> concocted some scheme to use a TD build refcount and a module update
> flag, had them both wait_event_interruptible() on each other, and then
> did wakeups. That would get the same semantics without an rwsem.
This sounds unworkable to me.
First, you cannot return to userspace while holding a lock. Lockdep will
rightfully scream:
"WARNING: lock held when returning to user space!"
The complexity of ensuring that a multi-stage ABI transaction completes
from the kernel side is painful. If that process dies in the middle of
its ABI sequence who cleans up these references?
The operational mechanism to make sure that one process flow does not
mess up another process flow is for those process to communicate with
*userspace* file locks, or for those process to check for failures after
the fact and retry. Unless you can make the build side an atomic ABI,
this is a documentation + userspace problem, not a kernel problem.
On 10/24/25 12:40, dan.j.williams@intel.com wrote: > Dave Hansen wrote: >> On 10/24/25 00:43, Chao Gao wrote: >> ... >>> Beyond "the kvm_tdx object gets torn down during a build," I see two potential >>> issues: >>> >>> 1. TD Build and TDX migration aren't purely kernel processes -- they span multiple >>> KVM ioctls. Holding a read-write lock throughout the entire process would >>> require exiting to userspace while the lock is held. I think this is >>> irregular, but I'm not sure if it's acceptable for read-write semaphores. >> >> Sure, I guess it's irregular. But look at it this way: let's say we >> concocted some scheme to use a TD build refcount and a module update >> flag, had them both wait_event_interruptible() on each other, and then >> did wakeups. That would get the same semantics without an rwsem. > > This sounds unworkable to me. > > First, you cannot return to userspace while holding a lock. Lockdep will > rightfully scream: > > "WARNING: lock held when returning to user space!" Well, yup, it sure does look that way for normal lockdep-annotated lock types. It does seem like a sane rule to have for most things. But, just to be clear, this is a lockdep thing and a good, solid semantic to have. It's not a rule that no kernel locking structure can ever be held when returning to userspace. > The complexity of ensuring that a multi-stage ABI transaction completes > from the kernel side is painful. If that process dies in the middle of > its ABI sequence who cleans up these references? The 'struct kvm_tdx' has to get destroyed at some point. It also has a 'kvm_tdx_state' field that could be tied very tightly to the build status. The reference gets cleaned up before the point when the kvm_tdx->state memory is freed. > The operational mechanism to make sure that one process flow does not > mess up another process flow is for those process to communicate with > *userspace* file locks, or for those process to check for failures after > the fact and retry. Unless you can make the build side an atomic ABI, > this is a documentation + userspace problem, not a kernel problem. Yeah, that's a totally valid take on it. My only worry is that the module update is going to be off in another world from the thing building TDs. We had a similar set of challenges around microcode updates, CPUSVN and SGX enclaves. The guy doing "echo 1 > /sys/.../whatever" wasn't coordinating with every entity on the system that might run an SGX enclave. It certainly didn't help that enclave creation is typically done by unprivileged users. Maybe the KVM/TDX world is a _bit_ more narrow and they will be talking to each other, or the /dev/kvm permissions will be a nice funnel to get them talking to each other. The SGX solution, btw, was to at least ensure forward progress (CPUSVN update) when the last enclave goes away. So new enclaves aren't *prevented* from starting but the window when the first one starts (enclave count going from 0->1) is leveraged to do the update.
Dave Hansen wrote: > On 10/24/25 12:40, dan.j.williams@intel.com wrote: > > Dave Hansen wrote: > >> On 10/24/25 00:43, Chao Gao wrote: > >> ... > >>> Beyond "the kvm_tdx object gets torn down during a build," I see two potential > >>> issues: > >>> > >>> 1. TD Build and TDX migration aren't purely kernel processes -- they span multiple > >>> KVM ioctls. Holding a read-write lock throughout the entire process would > >>> require exiting to userspace while the lock is held. I think this is > >>> irregular, but I'm not sure if it's acceptable for read-write semaphores. > >> > >> Sure, I guess it's irregular. But look at it this way: let's say we > >> concocted some scheme to use a TD build refcount and a module update > >> flag, had them both wait_event_interruptible() on each other, and then > >> did wakeups. That would get the same semantics without an rwsem. > > > > This sounds unworkable to me. > > > > First, you cannot return to userspace while holding a lock. Lockdep will > > rightfully scream: > > > > "WARNING: lock held when returning to user space!" > > Well, yup, it sure does look that way for normal lockdep-annotated lock > types. It does seem like a sane rule to have for most things. > > But, just to be clear, this is a lockdep thing and a good, solid > semantic to have. It's not a rule that no kernel locking structure can > ever be held when returning to userspace. Sure, but I would submit that the lesser known cousin of the common suggestion "do not write your own locking primitives" is "do not invent locking schemes that involve holding locks over return to userspace". It is rarely a good idea to the point that lockdep warns about it by default. > > The complexity of ensuring that a multi-stage ABI transaction completes > > from the kernel side is painful. If that process dies in the middle of > > its ABI sequence who cleans up these references? > > The 'struct kvm_tdx' has to get destroyed at some point. Indefinite hangs because a process goes out to lunch and fails to destroy kvm_tdx in a reasonable timeframe now has knock-on effects. [..] > > The operational mechanism to make sure that one process flow does not > > mess up another process flow is for those process to communicate with > > *userspace* file locks, or for those process to check for failures after > > the fact and retry. Unless you can make the build side an atomic ABI, > > this is a documentation + userspace problem, not a kernel problem. > > Yeah, that's a totally valid take on it. > > My only worry is that the module update is going to be off in another > world from the thing building TDs. We had a similar set of challenges > around microcode updates, CPUSVN and SGX enclaves. > > The guy doing "echo 1 > /sys/.../whatever" wasn't coordinating with > every entity on the system that might run an SGX enclave. It certainly > didn't help that enclave creation is typically done by unprivileged > users. Maybe the KVM/TDX world is a _bit_ more narrow and they will be > talking to each other, or the /dev/kvm permissions will be a nice funnel > to get them talking to each other. > > The SGX solution, btw, was to at least ensure forward progress (CPUSVN > update) when the last enclave goes away. So new enclaves aren't > *prevented* from starting but the window when the first one starts > (enclave count going from 0->1) is leveraged to do the update. The status quo does ensure forward progress. The TD does get built and the update does complete, just the small matter of TD attestation failures, right? Note, we had a similar problem with the tsm_report interface which, because it is configfs and not an ioctl, is a multi-stage ABI to build a report. If 2 threads collide in building an object, userspace indeed gets to keep the pieces, but there is: 1/ Documentation of the potential for collisions 2/ A mechanism to detect collisions. See /sys/kernel/config/tsm/report/$name/generation in Documentation/ABI/testing/configfs-tsm-report I really would not worry about the "off in another world" problem, it is par for the course for datacenter operations. I encountered prolific use of file locks in operations scripts at my time at Facebook. Think of problems like coordinating disk partitioning across various provisioning flows. The kernel happily lets 2 fdisk processes race to write a partition table. The only way to ensure a consistent result in that case is userspace sequencing, not a kernel lock while some process has a partition table open.
On 10/24/25 14:12, dan.j.williams@intel.com wrote: >> The SGX solution, btw, was to at least ensure forward progress (CPUSVN >> update) when the last enclave goes away. So new enclaves aren't >> *prevented* from starting but the window when the first one starts >> (enclave count going from 0->1) is leveraged to do the update. > The status quo does ensure forward progress. The TD does get built and > the update does complete, just the small matter of TD attestation > failures, right? Oh, yeah, for sure. If we do _nothing_ in the kernel (no build vs. module update synchronization), then the downside is being exposed to attestation failures if userspace either also does nothing or has bugs. That's actually, by far, my preferred solution to this whole mess: Userspace plays stupid games, userspace wins stupid prizes.
On Fri, Oct 24, 2025 at 2:19 PM Dave Hansen <dave.hansen@intel.com> wrote: > > On 10/24/25 14:12, dan.j.williams@intel.com wrote: > >> The SGX solution, btw, was to at least ensure forward progress (CPUSVN > >> update) when the last enclave goes away. So new enclaves aren't > >> *prevented* from starting but the window when the first one starts > >> (enclave count going from 0->1) is leveraged to do the update. > > The status quo does ensure forward progress. The TD does get built and > > the update does complete, just the small matter of TD attestation > > failures, right? I would think that it's not a "small" problem if confidential workloads on the hosts are not able to pass attestation. > > Oh, yeah, for sure. > > If we do _nothing_ in the kernel (no build vs. module update > synchronization), then the downside is being exposed to attestation > failures if userspace either also does nothing or has bugs. > > That's actually, by far, my preferred solution to this whole mess: > Userspace plays stupid games, userspace wins stupid prizes. > IIUC, enforcing "Avoid updates during update sensitive times" is not that complex and will ensure to avoid any issues with user space logic.
Vishal Annapurve wrote: > On Fri, Oct 24, 2025 at 2:19 PM Dave Hansen <dave.hansen@intel.com> wrote: > > > > On 10/24/25 14:12, dan.j.williams@intel.com wrote: > > >> The SGX solution, btw, was to at least ensure forward progress (CPUSVN > > >> update) when the last enclave goes away. So new enclaves aren't > > >> *prevented* from starting but the window when the first one starts > > >> (enclave count going from 0->1) is leveraged to do the update. > > > The status quo does ensure forward progress. The TD does get built and > > > the update does complete, just the small matter of TD attestation > > > failures, right? > > I would think that it's not a "small" problem if confidential > workloads on the hosts are not able to pass attestation. "Small" as in "not the kernel's problem". Userspace asked for the update, update is documented to clobber build sometimes, userspace ran an update anyway. Userspace asked for the clobber. It would be lovely if this clobbering does not happen at all and the update mechanism did not come with this misfeature. Otherwise, the kernel has no interface to solve that problem. The best it can do is document that this new update facility has this side effect. Userspace always has the choice to not update, coordinate update with build, or do nothing and let tenants try to launch again. Userspace could even retry the build and hide the tenant failure if it knew about the clobber, but be clear that the problem is the clobber not the kernel doing what userspace asked. The clobber, as I understand, is also limited to cases where the update includes crypto library changes. I am not sure how often that happens in practice. Suffice to say, the fact that the clobber is conditioned on the contents of the update also puts it further away from being a kernel problem. The clobber does not corrupt kernel state. > > Oh, yeah, for sure. > > > > If we do _nothing_ in the kernel (no build vs. module update > > synchronization), then the downside is being exposed to attestation > > failures if userspace either also does nothing or has bugs. > > > > That's actually, by far, my preferred solution to this whole mess: > > Userspace plays stupid games, userspace wins stupid prizes. > > > > IIUC, enforcing "Avoid updates during update sensitive times" is not > that complex and will ensure to avoid any issues with user space > logic. Userspace logic avoids issues by honoring the documentation that these ABIs sequences need synchronization. Otherwise, kernel blocking update during build just trades one error for another. Treat this like any other userspace solution for requiring "atomic" semantics when the kernel mechanisms are not themselves designed to be atomic, wrap it in userspace synchronization.
On Fri, Oct 24, 2025 at 6:42 PM <dan.j.williams@intel.com> wrote: > > Vishal Annapurve wrote: > > On Fri, Oct 24, 2025 at 2:19 PM Dave Hansen <dave.hansen@intel.com> wrote: > > > > > > On 10/24/25 14:12, dan.j.williams@intel.com wrote: > > > >> The SGX solution, btw, was to at least ensure forward progress (CPUSVN > > > >> update) when the last enclave goes away. So new enclaves aren't > > > >> *prevented* from starting but the window when the first one starts > > > >> (enclave count going from 0->1) is leveraged to do the update. > > > > The status quo does ensure forward progress. The TD does get built and > > > > the update does complete, just the small matter of TD attestation > > > > failures, right? > > > > I would think that it's not a "small" problem if confidential > > workloads on the hosts are not able to pass attestation. > > "Small" as in "not the kernel's problem". Userspace asked for the > update, update is documented to clobber build sometimes, userspace ran > an update anyway. Userspace asked for the clobber. > > It would be lovely if this clobbering does not happen at all and the > update mechanism did not come with this misfeature. Otherwise, the kernel > has no interface to solve that problem. The best it can do is document > that this new update facility has this side effect. In this case, host kernel has a way to ensure that userspace can't trigger such clobbering at all. That IIUC is "Avoid updates during update sensitive times". Best kernel can do is prevent userspace from screwing up the state of TDs. > > Userspace always has the choice to not update, coordinate update with > build, or do nothing and let tenants try to launch again. Userspace > could even retry the build and hide the tenant failure if it knew about > the clobber, IIUC host userspace has no way to know if the TD state got clobbered. > but be clear that the problem is the clobber not the kernel > doing what userspace asked. > > The clobber, as I understand, is also limited to cases where the update > includes crypto library changes. I am not sure how often that happens in > practice. Suffice to say, the fact that the clobber is conditioned on > the contents of the update also puts it further away from being a kernel The knowledge of things getting clobbered are well much further away from userspace. > problem. The clobber does not corrupt kernel state. > > > > Oh, yeah, for sure. > > > > > > If we do _nothing_ in the kernel (no build vs. module update > > > synchronization), then the downside is being exposed to attestation > > > failures if userspace either also does nothing or has bugs. > > > > > > That's actually, by far, my preferred solution to this whole mess: > > > Userspace plays stupid games, userspace wins stupid prizes. > > > > > > > IIUC, enforcing "Avoid updates during update sensitive times" is not > > that complex and will ensure to avoid any issues with user space > > logic. > > Userspace logic avoids issues by honoring the documentation that these > ABIs sequences need synchronization. Otherwise, kernel blocking update > during build just trades one error for another. Kernel blocking update during build makes the production systems much safer and prevents userspace from screwing up the state that it has no way to detect after the fact. > > Treat this like any other userspace solution for requiring "atomic" > semantics when the kernel mechanisms are not themselves designed to be > atomic, wrap it in userspace synchronization. In general if this is something userspace detectable I would agree, TDX module is the closest entity that can detect the problematic sequence and the host kernel has a very simple way to ensure that such a problematic sequence is not at all allowed to happen by toggling some seamcall controls. It would be very helpful IMO to ensure that userspace is not able to screw up production workloads especially if the mess is not all visible to userspace.
Vishal Annapurve wrote: > On Fri, Oct 24, 2025 at 6:42 PM <dan.j.williams@intel.com> wrote: > > > > Vishal Annapurve wrote: > > > On Fri, Oct 24, 2025 at 2:19 PM Dave Hansen <dave.hansen@intel.com> wrote: > > > > > > > > On 10/24/25 14:12, dan.j.williams@intel.com wrote: > > > > >> The SGX solution, btw, was to at least ensure forward progress (CPUSVN > > > > >> update) when the last enclave goes away. So new enclaves aren't > > > > >> *prevented* from starting but the window when the first one starts > > > > >> (enclave count going from 0->1) is leveraged to do the update. > > > > > The status quo does ensure forward progress. The TD does get built and > > > > > the update does complete, just the small matter of TD attestation > > > > > failures, right? > > > > > > I would think that it's not a "small" problem if confidential > > > workloads on the hosts are not able to pass attestation. > > > > "Small" as in "not the kernel's problem". Userspace asked for the > > update, update is documented to clobber build sometimes, userspace ran > > an update anyway. Userspace asked for the clobber. > > > > It would be lovely if this clobbering does not happen at all and the > > update mechanism did not come with this misfeature. Otherwise, the kernel > > has no interface to solve that problem. The best it can do is document > > that this new update facility has this side effect. > > In this case, host kernel has a way to ensure that userspace can't > trigger such clobbering at all. Unless the clobber condition can be made atomic with respect to update so that both succeed, the kernel needs to punt the syncrhonization problem to userspace. A theoretical TDX Module change could ensure that atomicity. A theoretical change to the kernel's build ABI could effect that as well, or notify the collision. I.e. a flag at the finalization stage that an update happened during the build sequence needs a restart. This is the role of "generation" in the tsm_report ABI. As far as I understand userspace just skips that ABI and arranges for userspace synchronized access to tsm_report. At the point where the solution is "change existing build flows" might as well just have userspace wrap the flows with userspace exclusion. > That IIUC is "Avoid updates during update sensitive times". Best > kernel can do is prevent userspace from screwing up the state of TDs. "Avoid updates during update sensitive times" is the documentation for the update userspace ABI. > > Userspace always has the choice to not update, coordinate update with > > build, or do nothing and let tenants try to launch again. Userspace > > could even retry the build and hide the tenant failure if it knew about > > the clobber, > > IIUC host userspace has no way to know if the TD state got clobbered. Correct, today it can only assume that both flows need to be mutually exclusive. > > but be clear that the problem is the clobber not the kernel > > doing what userspace asked. > > > > The clobber, as I understand, is also limited to cases where the update > > includes crypto library changes. I am not sure how often that happens in > > practice. Suffice to say, the fact that the clobber is conditioned on > > the contents of the update also puts it further away from being a kernel > > The knowledge of things getting clobbered are well much further away > from userspace. The possibility is documented as part of the update ABI. Another documentation possibility is that updates that change the crypto library are by definition not "runtime update" capable. A possible TDX Module change to remove this collision. A menu of options before complicating the kernel.
On Sun, Oct 26, 2025 at 2:30 PM <dan.j.williams@intel.com> wrote: > > Vishal Annapurve wrote: > > On Fri, Oct 24, 2025 at 6:42 PM <dan.j.williams@intel.com> wrote: > > > > > > Vishal Annapurve wrote: > > > > On Fri, Oct 24, 2025 at 2:19 PM Dave Hansen <dave.hansen@intel.com> wrote: > > > > > > > > > > On 10/24/25 14:12, dan.j.williams@intel.com wrote: > > > > > >> The SGX solution, btw, was to at least ensure forward progress (CPUSVN > > > > > >> update) when the last enclave goes away. So new enclaves aren't > > > > > >> *prevented* from starting but the window when the first one starts > > > > > >> (enclave count going from 0->1) is leveraged to do the update. > > > > > > The status quo does ensure forward progress. The TD does get built and > > > > > > the update does complete, just the small matter of TD attestation > > > > > > failures, right? > > > > > > > > I would think that it's not a "small" problem if confidential > > > > workloads on the hosts are not able to pass attestation. > > > > > > "Small" as in "not the kernel's problem". Userspace asked for the > > > update, update is documented to clobber build sometimes, userspace ran > > > an update anyway. Userspace asked for the clobber. > > > > > > It would be lovely if this clobbering does not happen at all and the > > > update mechanism did not come with this misfeature. Otherwise, the kernel > > > has no interface to solve that problem. The best it can do is document > > > that this new update facility has this side effect. > > > > In this case, host kernel has a way to ensure that userspace can't > > trigger such clobbering at all. > > Unless the clobber condition can be made atomic with respect to update > so that both succeed, the kernel needs to punt the syncrhonization > problem to userspace. > > A theoretical TDX Module change could ensure that atomicity. IIUC TDX module already supports avoiding this clobber based on the TDH.SYS.SHUTDOWN documentation from section 5.4.73 of TDX ABI Spec [1]. Host kernel needs to set bit 16 of rcx when invoking TDH.SYS.SHUTDOWN is available. "If supported by the TDX Module, the host VMM can set the AVOID_COMPAT_SENSITIVE flag to request the TDX Module to fail TDH.SYS.UPDATE if any of the TDs are currently in a state that is impacted by the update-sensitive cases." I think the above documentation should replace TDH.SYS.UPDATE with TDH.SYS.SHUTDOWN IIUC. [1] https://cdrdv2.intel.com/v1/dl/getContent/733579 > A > theoretical change to the kernel's build ABI could effect that as well, > or notify the collision. I.e. a flag at the finalization stage that an > update happened during the build sequence needs a restart. This is the > role of "generation" in the tsm_report ABI. As far as I understand > userspace just skips that ABI and arranges for userspace synchronized > access to tsm_report. >
Vishal Annapurve wrote: [..] > > A theoretical TDX Module change could ensure that atomicity. > > IIUC TDX module already supports avoiding this clobber based on the > TDH.SYS.SHUTDOWN documentation from section 5.4.73 of TDX ABI Spec > [1]. > > Host kernel needs to set bit 16 of rcx when invoking TDH.SYS.SHUTDOWN > is available. > > "If supported by the TDX Module, the host VMM can set the > AVOID_COMPAT_SENSITIVE flag to request the TDX Module to fail > TDH.SYS.UPDATE if any of the TDs are currently in a state that is > impacted by the update-sensitive cases." That is not a fix. That just shifts the complexity from build to update. It still leaves update in a state where it is not guaranteed to make forward progress. The way to ensure forward progress is the same as ensuring build consistency, i.e. sequence build with respect to update. The kernel sheds complexity by ether making userspace solve that problem, or motivating a real fix in the TDX Module that obviates the AVOID_COMPAT_SENSITIVE case.
On Mon, Oct 27, 2025 at 11:53 AM <dan.j.williams@intel.com> wrote: > > Vishal Annapurve wrote: > [..] > > > A theoretical TDX Module change could ensure that atomicity. > > > > IIUC TDX module already supports avoiding this clobber based on the > > TDH.SYS.SHUTDOWN documentation from section 5.4.73 of TDX ABI Spec > > [1]. > > > > Host kernel needs to set bit 16 of rcx when invoking TDH.SYS.SHUTDOWN > > is available. > > > > "If supported by the TDX Module, the host VMM can set the > > AVOID_COMPAT_SENSITIVE flag to request the TDX Module to fail > > TDH.SYS.UPDATE if any of the TDs are currently in a state that is > > impacted by the update-sensitive cases." > > That is not a fix. That just shifts the complexity from build to update. > It still leaves update in a state where it is not guaranteed to make IMO, there are two problems here: 1) Giving a consistent ABI that leaves the responsibility of ensuring forward progress by sequencing TD update with TD build steps with userspace. 2) Ensuring that userspace can't screw up the in-progress TD VM metadata if userspace doesn't adhere to the sequence above. Problem 2 should be solved in the TDX module as it is the state owner and should be given a chance to ensure that nothing else can affect it's state. Kernel is just opting-in to toggle the already provided TDX module ABI. I don't think this is adding complexity to the kernel. > forward progress. The way to ensure forward progress is the same as > ensuring build consistency, i.e. sequence build with respect to update. > The kernel sheds complexity by ether making userspace solve that > problem, or motivating a real fix in the TDX Module that obviates the > AVOID_COMPAT_SENSITIVE case.
Vishal Annapurve wrote: > On Mon, Oct 27, 2025 at 11:53 AM <dan.j.williams@intel.com> wrote: > > > > Vishal Annapurve wrote: > > [..] > > > > A theoretical TDX Module change could ensure that atomicity. > > > > > > IIUC TDX module already supports avoiding this clobber based on the > > > TDH.SYS.SHUTDOWN documentation from section 5.4.73 of TDX ABI Spec > > > [1]. > > > > > > Host kernel needs to set bit 16 of rcx when invoking TDH.SYS.SHUTDOWN > > > is available. > > > > > > "If supported by the TDX Module, the host VMM can set the > > > AVOID_COMPAT_SENSITIVE flag to request the TDX Module to fail > > > TDH.SYS.UPDATE if any of the TDs are currently in a state that is > > > impacted by the update-sensitive cases." > > > > That is not a fix. That just shifts the complexity from build to update. > > It still leaves update in a state where it is not guaranteed to make > > IMO, there are two problems here: > 1) Giving a consistent ABI that leaves the responsibility of ensuring > forward progress by sequencing TD update with TD build steps with > userspace. > 2) Ensuring that userspace can't screw up the in-progress TD VM > metadata if userspace doesn't adhere to the sequence above. > > Problem 2 should be solved in the TDX module as it is the state owner > and should be given a chance to ensure that nothing else can affect > it's state. Kernel is just opting-in to toggle the already provided > TDX module ABI. I don't think this is adding complexity to the kernel. That gives update a transient error to handle
dan.j.williams@ wrote: [..] > That gives update a transient error to handle Apologies, this was a draft sent in error.
Vishal Annapurve wrote: [..] > Problem 2 should be solved in the TDX module as it is the state owner > and should be given a chance to ensure that nothing else can affect > it's state. Kernel is just opting-in to toggle the already provided > TDX module ABI. I don't think this is adding complexity to the kernel. It makes the interface hard to reason about, that is complexity. Consider an urgent case where update is more important than the consistency of ongoing builds. The kernel's job is its own self consistency and security model, when that remains in tact root is allowed to make informed decisions. You might say, well add a --force option for that, and that is also userspace prerogative to perform otherwise destructive operations with the degrees of freedom the kernel allows. I think we have reached the useful end of this thread. I support moving ahead with the dead simple, "this may clobber your builds", for now. We can always circle back to add more complexity later if that proves "too simple" in practice.
On Mon, Oct 27, 2025 at 7:13 PM <dan.j.williams@intel.com> wrote: > > Vishal Annapurve wrote: > [..] > > Problem 2 should be solved in the TDX module as it is the state owner > > and should be given a chance to ensure that nothing else can affect > > it's state. Kernel is just opting-in to toggle the already provided > > TDX module ABI. I don't think this is adding complexity to the kernel. > > It makes the interface hard to reason about, that is complexity. > > Consider an urgent case where update is more important than the > consistency of ongoing builds. The kernel's job is its own self > consistency and security model, I would consider the TDX module as more privileged than the kernel itself in certain aspects. So it's not fine to expose an ABI to userspace which affects kernel state consistency, but it's fine to expose an ABI that can affect TDX module state consistency? > when that remains in tact root is allowed to make informed decisions. > > You might say, well add a --force option for that, and that is also > userspace prerogative to perform otherwise destructive operations with > the degrees of freedom the kernel allows. > > I think we have reached the useful end of this thread. I support moving > ahead with the dead simple, "this may clobber your builds", for now. We > can always circle back to add more complexity later if that proves "too > simple" in practice. >
On Mon, Oct 27, 2025 at 7:14 PM <dan.j.williams@intel.com> wrote: > > Vishal Annapurve wrote: > [..] > > Problem 2 should be solved in the TDX module as it is the state owner > > and should be given a chance to ensure that nothing else can affect > > it's state. Kernel is just opting-in to toggle the already provided > > TDX module ABI. I don't think this is adding complexity to the kernel. > > It makes the interface hard to reason about, that is complexity. Could you clarify what you mean here? What interface do you need to reason about? TDX module has a feature as described in its spec, this is nothing to do with the kernel. Kernel executes the TDH.SYS.SHUTDOWN and if it fails, it will return the error code back to the user space. There is nothing here to reason about and it is not clear how it is adding the complexity to the kernel. > > Consider an urgent case where update is more important than the > consistency of ongoing builds. The kernel's job is its own self > consistency and security model, when that remains in tact root is > allowed to make informed decisions. > The whole update is initiated by the userspace, imo, it is not the kernel's job to decide what to do. It should try to update the TDX module and return error code back to the userspace if it fails. it is up to the userspace to resolve the conflict and retry the installation. If you are saying that the userspace is not trusted for such a critical action, again the whole process is initiated and controlled by the userspace so there is an inherent trust there. Consistency? How does td preserve failure impact the kernel consistency? On the contrary, bypassing AVOID_COMPAT_SENSITIVE will break the consistency for some TDs. > You might say, well add a --force option for that, and that is also > userspace prerogative to perform otherwise destructive operations with > the degrees of freedom the kernel allows. IMO, It is something userspace should decide, kernel's job is to provide the necessary interface about it. > > I think we have reached the useful end of this thread. I support moving > ahead with the dead simple, "this may clobber your builds", for now. We > can always circle back to add more complexity later if that proves "too > simple" in practice. > It is not clear how you reached that conclusion. We are one of the users for this feature and we have multiple times explained that we prefer failure on update if there is any risk of corrupting some TD states. I did not see any other feedback/preference from other users and I did not see any reasonable argument why you are preferring the "clobber your builds" option. Also the "clobber your builds" option will impact the TDX live migration, considering the TDX live migration is WIP, it will be definitely very hard to foresee the challenges there you are introducing with this decision. How about TDX connect? Are we going to come back and keep updating this every time we find an issue? Since the update process is initiated and controlled by userspace, it is the userspace application's prerogative to make the informed decision on whether an urgent update warrants potentially destructive actions. The kernel's role is to provide a reliable mechanism to interact with the TDX Module and report outcomes accurately. Ideally, ABI should allow userpace to provide flags which can be also used to configure the TD preserve update option. If you do not want to change ABI, you can make those as module param so userspace can make a decision by itself. To address some of your previous concerns: It shifts complexity to userspace which is something everyone here seems to prefer. The problem is that the TD Preserve update would corrupt the TDs who are in the build stage (also impacts TDX LM and possibly some TDX connect functionalities) and since the TDX module would know about it, this will make sure that they will not be corrupted hence it is a fix for a problem. TDH.SYS.SHUTDOWN may not succeed due to multiple reasons like TDX_SYS_BUSY therefore it needs to handle the error cases anyway and should return the error to the userspace. Now userspace can decide whatever logic it has to finish/cancel the existing tdbuilds and retry the tdpreserve update. You might be concerned about forward progress. As I said above, there might be some other cases which might prevent the td preserve update to succeed so forward progress is not guaranteed anyway and it is not the kernel's job to figure it out. It will return the error code back to userspace and let the userspace resolve the conflict.
On Tue, Oct 28, 2025, Erdem Aktas wrote: > On Mon, Oct 27, 2025 at 7:14 PM <dan.j.williams@intel.com> wrote: > > > > Vishal Annapurve wrote: > > [..] > > > Problem 2 should be solved in the TDX module as it is the state owner > > > and should be given a chance to ensure that nothing else can affect > > > it's state. Kernel is just opting-in to toggle the already provided > > > TDX module ABI. I don't think this is adding complexity to the kernel. > > > > It makes the interface hard to reason about, that is complexity. > > Could you clarify what you mean here? What interface do you need to > reason about? TDX module has a feature as described in its spec, this > is nothing to do with the kernel. Kernel executes the TDH.SYS.SHUTDOWN > and if it fails, it will return the error code back to the user space. > There is nothing here to reason about and it is not clear how it is > adding the complexity to the kernel. Userspace needs to reason about error codes and potential sources of those error codes. That said, I agree that having the kernel set AVOID_COMPAT_SENSITIVE by default (I vote for setting it unconditionally), doesn't add meaningful complexity; the kernel would just need to document that the update mechanism can return -EBUSY (or whatever), and why/when. For me, that seems far less daunting/complex than attempting to document what all can go wrong if the kernel _doesn't_ set AVOID_COMPAT_SENSITIVE. Because IMO, regardless of whether or not the kernel sets AVOID_COMPAT_SENSITIVE, the kernel is making a decision and defining behavior, and that behavior needs to be documented. If AVOID_COMPAT_SENSITIVE didn't exist, then I would agree this is purely a userspace vs. TDX-Module problem, but it does exist, and not setting the flag defines ABI just as much as setting the flag does. The failure mode also matters, a lot. "Sorry dear customer, we corrupted your VM" is very, very different than "A handful of machines in our fleet haven't completed an (optional?) update". > > Consider an urgent case where update is more important than the > > consistency of ongoing builds. The kernel's job is its own self > > consistency and security model, when that remains in tact root is > > allowed to make informed decisions. > > > The whole update is initiated by the userspace, imo, it is not the > kernel's job to decide what to do. I think you and Dan are in violent agreement. I _think_ what Dan is saying that the kernel needs to protect itself, e.g. by rejecting an update if the kernel knows the system is in a bad state. But other than that, userspace can do whatever. AFAICT, the only disagreement is whether or not to set AVOID_COMPAT_SENSITIVE. > It should try to update the TDX module and return error code back to the > userspace if it fails. +1. Unless there's a wrinkle I'm missing, failing with -EBUSY seems like the obvious choice. > > You might say, well add a --force option for that, and that is also > > userspace prerogative to perform otherwise destructive operations with > > the degrees of freedom the kernel allows. > > IMO, It is something userspace should decide, kernel's job is to > provide the necessary interface about it. I disagree, I don't think userspace should even get the option. IMO, not setting AVOID_COMPAT_SENSITIVE is all kinds of crazy.
Sean Christopherson wrote: [..] > > IMO, It is something userspace should decide, kernel's job is to > > provide the necessary interface about it. > > I disagree, I don't think userspace should even get the option. IMO, not setting > AVOID_COMPAT_SENSITIVE is all kinds of crazy. Do see Table 4.4: "Comparison of Update Incompatibility Detection and/or Avoidance Methods" from the latest base architecture specification [1]. It lists out the pros and cons of not setting AVOID_COMPAT_SENSITIVE. This thread has only argued the merits of "None" and "Avoid updates during update- sensitive times". It has not discussed "Detect incompatibility after update", but let us not do that. You can just assume the Module has multiple solutions to this awkward problem precisely because different VMMs came to different conclusions. I want this thread to end so I am not going to argue past what Dave and Sean want to do here. [1]: https://www.intel.com/content/www/us/en/content-details/865787/intel-tdx-module-base-architecture-specification.html
On Tue, Oct 28, 2025, dan.j.williams@intel.com wrote: > Sean Christopherson wrote: > [..] > > > IMO, It is something userspace should decide, kernel's job is to > > > provide the necessary interface about it. > > > > I disagree, I don't think userspace should even get the option. IMO, not setting > > AVOID_COMPAT_SENSITIVE is all kinds of crazy. > > Do see Table 4.4: "Comparison of Update Incompatibility Detection and/or > Avoidance Methods" from the latest base architecture specification [1]. > It lists out the pros and cons of not setting AVOID_COMPAT_SENSITIVE. > This thread has only argued the merits of "None" and "Avoid updates > during update- sensitive times". It has not discussed "Detect > incompatibility after update", but let us not do that. But we already are discussing that, because the "None" option is just punting "Detect incompatibility after update" to something other than the VMM. Doing literally nothing isn't an option. The fact that it's even listed in the table, not to mention has "Simplest." listed as a pro, makes me question whether or not the authors actually understand how software built around the TDX-Module is used in practice. If an update causes a TD build to fail, or to generate the wrong measurement, or whatever "Failures due to incompatibilities" means, *something* eventually needs to take action. Doing nothing is certainly the simplest option for the hypervisor and VMM, but when looking at the entire stack/ecosystem, it's the most complex option as it bleeds the damage into multiple, potentially-unknown components of the stack. Either that, or I'm grossly misunderstanding what "Failures" means. That section also states: Future TDX Module versions may have different or additional update-sensitive cases. Which means that from an ABI perspective, "Avoid updates during update-sensitive times" is the _ONLY_ viable option. My read of that is that future TDX-Modules can effectively change the failure modes for a existing KVM ioctls. That is an ABI change and will break userspace, e.g. if userspace is sane and expects certain operations to succeed.
On Wed, Oct 29, 2025 at 6:48 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Tue, Oct 28, 2025, dan.j.williams@intel.com wrote:
> > Sean Christopherson wrote:
> > [..]
> > > > IMO, It is something userspace should decide, kernel's job is to
> > > > provide the necessary interface about it.
> > >
> > > I disagree, I don't think userspace should even get the option. IMO, not setting
> > > AVOID_COMPAT_SENSITIVE is all kinds of crazy.
> >
> > Do see Table 4.4: "Comparison of Update Incompatibility Detection and/or
> > Avoidance Methods" from the latest base architecture specification [1].
> > It lists out the pros and cons of not setting AVOID_COMPAT_SENSITIVE.
> > This thread has only argued the merits of "None" and "Avoid updates
> > during update- sensitive times". It has not discussed "Detect
> > incompatibility after update", but let us not do that.
>
> But we already are discussing that, because the "None" option is just punting
> "Detect incompatibility after update" to something other than the VMM. Doing
> literally nothing isn't an option. The fact that it's even listed in the table,
> not to mention has "Simplest." listed as a pro, makes me question whether or not
> the authors actually understand how software built around the TDX-Module is used
> in practice.
>
> If an update causes a TD build to fail, or to generate the wrong measurement, or
> whatever "Failures due to incompatibilities" means, *something* eventually needs
> to take action. Doing nothing is certainly the simplest option for the hypervisor
> and VMM, but when looking at the entire stack/ecosystem, it's the most complex
> option as it bleeds the damage into multiple, potentially-unknown components of
> the stack. Either that, or I'm grossly misunderstanding what "Failures" means.
>
> That section also states:
>
> Future TDX Module versions may have different or additional update-sensitive cases.
>
> Which means that from an ABI perspective, "Avoid updates during update-sensitive
> times" is the _ONLY_ viable option. My read of that is that future TDX-Modules
> can effectively change the failure modes for a existing KVM ioctls. That is an
> ABI change and will break userspace, e.g. if userspace is sane and expects certain
> operations to succeed.
A reference patch we tested for "Avoid updates during update-sensitive
times" and one caveat was that
/sys/devices/virtual/tdx/tdx_tsm/version was not available post update
failure until a subsequent successful update:
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index e00650b83f08..96ae7c679e4e 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -22,6 +22,7 @@
#define TDX_FEATURES0_NO_RBP_MOD BIT_ULL(18)
#define TDX_FEATURES0_CLFLUSH_BEFORE_ALLOC BIT_ULL(23)
#define TDX_FEATURES0_DYNAMIC_PAMT BIT_ULL(36)
+#define TDX_FEATURES0_UPDATE_COMPATIBILITY BIT_ULL(47)
#ifndef __ASSEMBLY__
@@ -129,6 +130,11 @@ static inline bool
tdx_supports_dynamic_pamt(const struct tdx_sys_info *sysinfo)
return sysinfo->features.tdx_features0 & TDX_FEATURES0_DYNAMIC_PAMT;
}
+static inline bool tdx_supports_update_compatibility(const struct
tdx_sys_info *sysinfo)
+{
+ return sysinfo->features.tdx_features0 &
TDX_FEATURES0_UPDATE_COMPATIBILITY;
+}
+
int tdx_guest_keyid_alloc(void);
u32 tdx_get_nr_guest_keyids(void);
void tdx_guest_keyid_free(unsigned int keyid);
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index f6199f8ce411..95deb1146a79 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -1523,6 +1523,10 @@ int tdx_module_shutdown(void)
* fail.
*/
args.rcx = tdx_sysinfo.handoff.module_hv;
+
+ if (tdx_supports_update_compatibility(&tdx_sysinfo))
+ args.rcx |= TDX_SYS_SHUTDOWN_AVOID_COMPAT_SENSITIVE;
+
ret = seamcall_prerr(TDH_SYS_SHUTDOWN, &args);
if (!ret)
tdx_module_reset_state();
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index 0cd9140620f9..772c714de2bc 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -94,6 +94,8 @@ struct tdmr_info {
/* Bit definitions of TDX_FEATURES0 metadata field */
#define TDX_FEATURES0_TD_PRESERVING BIT(1)
+#define TDX_SYS_SHUTDOWN_AVOID_COMPAT_SENSITIVE BIT(16)
+
/*
* Do not put any hardware-defined TDX structure representations below
* this comment!
>A reference patch we tested for "Avoid updates during update-sensitive >times" and one caveat was that >/sys/devices/virtual/tdx/tdx_tsm/version was not available post update >failure until a subsequent successful update: I also tested this. It works well to prevent updates during TD build, so, Tested-by: Chao Gao <chao.gao@intel.com> And I can integrate this change into my next version if you don't object. Regarding the caveat, could you check if the diff [*] I posted earlier this week can fix it? [1]: https://lore.kernel.org/linux-coco/aQAwRrvYMcaMsu02@intel.com/
On Thu, Oct 30, 2025 at 9:53 PM Chao Gao <chao.gao@intel.com> wrote: > > >A reference patch we tested for "Avoid updates during update-sensitive > >times" and one caveat was that > >/sys/devices/virtual/tdx/tdx_tsm/version was not available post update > >failure until a subsequent successful update: > > I also tested this. It works well to prevent updates during TD build, so, > > Tested-by: Chao Gao <chao.gao@intel.com> > > And I can integrate this change into my next version if you don't object. > > Regarding the caveat, could you check if the diff [*] I posted earlier this > week can fix it? > > [1]: https://lore.kernel.org/linux-coco/aQAwRrvYMcaMsu02@intel.com/ [Now in plaintext] I tried testing it with the 1.5.24 TDX module and it sometimes fails, but the failure does not appear consistent. I added a local change to add the TDX_SYS_SHUTDOWN_AVOID_COMPAT_SENSITIVE flag when calling TDH_SYS_SHUTDOWN and TDH_SYS_SHUTDOWN fails as expected if a VM is under build: [ 1224.571177] virt/tdx: SEAMCALL (52) failed: 0x8000051200010000 But then sometimes trying to finalize the VM fail with the following error: [ 1230.915145] kvm_intel: SEAMCALL TDH_MR_FINALIZE failed: 0x8000ff00ffff0000 [ 1230.948264] kvm_intel: tdh_mng_vpflushdone() failed. HKID 3 is leaked. At this point the module seems to be in a broken state and trying to create more TDs will fail: [ 1543.745606] kvm_intel: SEAMCALL TDH_MNG_CREATE failed: 0x8000ff00ffff0000 Trying to update the module will fail shutdown with -ENODEV
On Wed, Nov 19, 2025 at 04:44:50PM -0600, Sagi Shahar wrote:
>On Thu, Oct 30, 2025 at 9:53 PM Chao Gao <chao.gao@intel.com> wrote:
>>
>> >A reference patch we tested for "Avoid updates during update-sensitive
>> >times" and one caveat was that
>> >/sys/devices/virtual/tdx/tdx_tsm/version was not available post update
>> >failure until a subsequent successful update:
>>
>> I also tested this. It works well to prevent updates during TD build, so,
>>
>> Tested-by: Chao Gao <chao.gao@intel.com>
>>
>> And I can integrate this change into my next version if you don't object.
>>
>> Regarding the caveat, could you check if the diff [*] I posted earlier this
>> week can fix it?
>>
>> [1]: https://lore.kernel.org/linux-coco/aQAwRrvYMcaMsu02@intel.com/
>
>[Now in plaintext]
>
>I tried testing it with the 1.5.24 TDX module and it sometimes fails,
>but the failure does not appear consistent.
>
>I added a local change to add the
>TDX_SYS_SHUTDOWN_AVOID_COMPAT_SENSITIVE flag when calling
>TDH_SYS_SHUTDOWN and TDH_SYS_SHUTDOWN fails as expected if a VM is
>under build:
>[ 1224.571177] virt/tdx: SEAMCALL (52) failed: 0x8000051200010000
>
>But then sometimes trying to finalize the VM fail with the following error:
>[ 1230.915145] kvm_intel: SEAMCALL TDH_MR_FINALIZE failed: 0x8000ff00ffff0000
>[ 1230.948264] kvm_intel: tdh_mng_vpflushdone() failed. HKID 3 is leaked.
>
>At this point the module seems to be in a broken state and trying to
>create more TDs will fail:
>[ 1543.745606] kvm_intel: SEAMCALL TDH_MNG_CREATE failed: 0x8000ff00ffff0000
>
>Trying to update the module will fail shutdown with -ENODEV
Can you apply this incremental change to see if the issue gets fixed?
diff --git a/arch/x86/virt/vmx/tdx/seamldr.c b/arch/x86/virt/vmx/tdx/seamldr.c
index e525bbd16610..f0bea1fecc52 100644
--- a/arch/x86/virt/vmx/tdx/seamldr.c
+++ b/arch/x86/virt/vmx/tdx/seamldr.c
@@ -317,8 +317,9 @@ static int do_seamldr_install_module(void *params)
tdx_module_set_error();
print_update_failure_message();
}
+ } else {
+ ack_state();
}
- ack_state();
} else {
touch_nmi_watchdog();
rcu_momentary_eqs();
The problem is if the failing CPU is the last one to ack the TDP_SHUTDOWN
state, the state will move to TDP_CPU_INSTALL state. Other CPUs may proceed to
install the new module before seeing tdp_data.failed. This disables TDX ISA, so
any subsequent SEAMCALLs get 0x8000ff00ffff0000.
On Wed, Nov 19, 2025 at 8:47 PM Chao Gao <chao.gao@intel.com> wrote:
>
> On Wed, Nov 19, 2025 at 04:44:50PM -0600, Sagi Shahar wrote:
> >On Thu, Oct 30, 2025 at 9:53 PM Chao Gao <chao.gao@intel.com> wrote:
> >>
> >> >A reference patch we tested for "Avoid updates during update-sensitive
> >> >times" and one caveat was that
> >> >/sys/devices/virtual/tdx/tdx_tsm/version was not available post update
> >> >failure until a subsequent successful update:
> >>
> >> I also tested this. It works well to prevent updates during TD build, so,
> >>
> >> Tested-by: Chao Gao <chao.gao@intel.com>
> >>
> >> And I can integrate this change into my next version if you don't object.
> >>
> >> Regarding the caveat, could you check if the diff [*] I posted earlier this
> >> week can fix it?
> >>
> >> [1]: https://lore.kernel.org/linux-coco/aQAwRrvYMcaMsu02@intel.com/
> >
> >[Now in plaintext]
> >
> >I tried testing it with the 1.5.24 TDX module and it sometimes fails,
> >but the failure does not appear consistent.
> >
> >I added a local change to add the
> >TDX_SYS_SHUTDOWN_AVOID_COMPAT_SENSITIVE flag when calling
> >TDH_SYS_SHUTDOWN and TDH_SYS_SHUTDOWN fails as expected if a VM is
> >under build:
> >[ 1224.571177] virt/tdx: SEAMCALL (52) failed: 0x8000051200010000
> >
> >But then sometimes trying to finalize the VM fail with the following error:
> >[ 1230.915145] kvm_intel: SEAMCALL TDH_MR_FINALIZE failed: 0x8000ff00ffff0000
> >[ 1230.948264] kvm_intel: tdh_mng_vpflushdone() failed. HKID 3 is leaked.
> >
> >At this point the module seems to be in a broken state and trying to
> >create more TDs will fail:
> >[ 1543.745606] kvm_intel: SEAMCALL TDH_MNG_CREATE failed: 0x8000ff00ffff0000
> >
> >Trying to update the module will fail shutdown with -ENODEV
>
> Can you apply this incremental change to see if the issue gets fixed?
>
> diff --git a/arch/x86/virt/vmx/tdx/seamldr.c b/arch/x86/virt/vmx/tdx/seamldr.c
> index e525bbd16610..f0bea1fecc52 100644
> --- a/arch/x86/virt/vmx/tdx/seamldr.c
> +++ b/arch/x86/virt/vmx/tdx/seamldr.c
> @@ -317,8 +317,9 @@ static int do_seamldr_install_module(void *params)
> tdx_module_set_error();
> print_update_failure_message();
> }
> + } else {
> + ack_state();
> }
> - ack_state();
> } else {
> touch_nmi_watchdog();
> rcu_momentary_eqs();
>
>
> The problem is if the failing CPU is the last one to ack the TDP_SHUTDOWN
> state, the state will move to TDP_CPU_INSTALL state. Other CPUs may proceed to
> install the new module before seeing tdp_data.failed. This disables TDX ISA, so
> any subsequent SEAMCALLs get 0x8000ff00ffff0000.
Thanks, I ran a couple dozen updates while TD was being built and
couldn't reproduce the issue with the new fix
On Sat, Oct 25, 2025 at 4:55 AM Vishal Annapurve <vannapurve@google.com> wrote: > > On Fri, Oct 24, 2025 at 6:42 PM <dan.j.williams@intel.com> wrote: > > > > Vishal Annapurve wrote: > > > On Fri, Oct 24, 2025 at 2:19 PM Dave Hansen <dave.hansen@intel.com> wrote: > > > > > > > > On 10/24/25 14:12, dan.j.williams@intel.com wrote: > > > > >> The SGX solution, btw, was to at least ensure forward progress (CPUSVN > > > > >> update) when the last enclave goes away. So new enclaves aren't > > > > >> *prevented* from starting but the window when the first one starts > > > > >> (enclave count going from 0->1) is leveraged to do the update. > > > > > The status quo does ensure forward progress. The TD does get built and > > > > > the update does complete, just the small matter of TD attestation > > > > > failures, right? > > > > > > I would think that it's not a "small" problem if confidential > > > workloads on the hosts are not able to pass attestation. > > > > "Small" as in "not the kernel's problem". Userspace asked for the > > update, update is documented to clobber build sometimes, userspace ran > > an update anyway. Userspace asked for the clobber. > > > > It would be lovely if this clobbering does not happen at all and the > > update mechanism did not come with this misfeature. Otherwise, the kernel > > has no interface to solve that problem. The best it can do is document > > that this new update facility has this side effect. > > In this case, host kernel has a way to ensure that userspace can't > trigger such clobbering at all. That IIUC is "Avoid updates during > update sensitive times". Best kernel can do is prevent userspace from > screwing up the state of TDs. > > > > > Userspace always has the choice to not update, coordinate update with > > build, or do nothing and let tenants try to launch again. Userspace > > could even retry the build and hide the tenant failure if it knew about > > the clobber, > > IIUC host userspace has no way to know if the TD state got clobbered. > > > but be clear that the problem is the clobber not the kernel > > doing what userspace asked. > > > > The clobber, as I understand, is also limited to cases where the update > > includes crypto library changes. I am not sure how often that happens in > > practice. Suffice to say, the fact that the clobber is conditioned on > > the contents of the update also puts it further away from being a kernel > > The knowledge of things getting clobbered are well much further away > from userspace. > > > problem. The clobber does not corrupt kernel state. > > > > > > Oh, yeah, for sure. > > > > > > > > If we do _nothing_ in the kernel (no build vs. module update > > > > synchronization), then the downside is being exposed to attestation > > > > failures if userspace either also does nothing or has bugs. > > > > > > > > That's actually, by far, my preferred solution to this whole mess: > > > > Userspace plays stupid games, userspace wins stupid prizes. > > > > > > > > > > IIUC, enforcing "Avoid updates during update sensitive times" is not > > > that complex and will ensure to avoid any issues with user space > > > logic. > > > > Userspace logic avoids issues by honoring the documentation that these > > ABIs sequences need synchronization. Otherwise, kernel blocking update > > during build just trades one error for another. > > Kernel blocking update during build makes the production systems much > safer and prevents userspace from screwing up the state that it has no > way to detect after the fact. > > > > > Treat this like any other userspace solution for requiring "atomic" > > semantics when the kernel mechanisms are not themselves designed to be > > atomic, wrap it in userspace synchronization. > > In general if this is something userspace detectable I would agree, > TDX module is the closest entity that can detect the problematic > sequence and the host kernel has a very simple way to ensure that such > a problematic sequence is not at all allowed to happen by toggling > some seamcall controls. It would be very helpful IMO to ensure that > userspace is not able to screw up production workloads especially if > the mess is not all visible to userspace. Detecting is one thing, undoing the mess is disruptive and not easy to orchestrate in this case.
On Fri, Oct 24, 2025, dan.j.williams@intel.com wrote: > Dave Hansen wrote: > > On 10/24/25 00:43, Chao Gao wrote: > > ... > > > Beyond "the kvm_tdx object gets torn down during a build," I see two potential > > > issues: > > > > > > 1. TD Build and TDX migration aren't purely kernel processes -- they span multiple > > > KVM ioctls. Holding a read-write lock throughout the entire process would > > > require exiting to userspace while the lock is held. I think this is > > > irregular, but I'm not sure if it's acceptable for read-write semaphores. > > > > Sure, I guess it's irregular. But look at it this way: let's say we > > concocted some scheme to use a TD build refcount and a module update > > flag, had them both wait_event_interruptible() on each other, and then > > did wakeups. That would get the same semantics without an rwsem. > > This sounds unworkable to me. > > First, you cannot return to userspace while holding a lock. Lockdep will > rightfully scream: > > "WARNING: lock held when returning to user space!" > > The complexity of ensuring that a multi-stage ABI transaction completes > from the kernel side is painful. If that process dies in the middle of > its ABI sequence who cleans up these references? > > The operational mechanism to make sure that one process flow does not > mess up another process flow is for those process to communicate with > *userspace* file locks, or for those process to check for failures after > the fact and retry. Unless you can make the build side an atomic ABI, > this is a documentation + userspace problem, not a kernel problem. C'mon people (especially the Google folks), this is the ***exact*** same problem as certificate updates for SNP[1]. Y'all suggested holding a lock across a userspace exit back then, and Dan's analysis confirms my reaction from back then that "Holding a lock across an exit to userspace seems wildly unsafe."[2] In the end, it took more time to understand the problem then to sketch out and test a solution[3]. Unless this somehow puts the host (kernel) at risk, this is a userspace problem. [1] https://lore.kernel.org/all/20240426173515.6pio42iqvjj2aeac@amd.com [2] https://lore.kernel.org/all/Zx_V5SHwzDAl8ZQR@google.com [3] https://lore.kernel.org/all/ZixCYlKn5OYUFWEq@google.com
On 10/24/25 13:00, Sean Christopherson wrote: > C'mon people (especially the Google folks), this is the ***exact*** > same problem as certificate updates for SNP[1]. Y'all suggested > holding a lock across a userspace exit back then, and Dan's analysis > confirms my reaction from back then that "Holding a lock across an > exit to userspace seems wildly unsafe."[2] > > In the end, it took more time to understand the problem then to > sketch out and test a solution[3]. > > Unless this somehow puts the host (kernel) at risk, this is a > userspace problem. If there's an similar SEV-SNP problem and accepted solution punted to userspace that TDX can leverage, I'm 100% on board with that. Let's do that.
On Fri, Oct 24, 2025 at 1:14 PM Dave Hansen <dave.hansen@intel.com> wrote: > > On 10/24/25 13:00, Sean Christopherson wrote: > > C'mon people (especially the Google folks), this is the ***exact*** > > same problem as certificate updates for SNP[1]. Y'all suggested > > holding a lock across a userspace exit back then, and Dan's analysis > > confirms my reaction from back then that "Holding a lock across an > > exit to userspace seems wildly unsafe."[2] > > > > In the end, it took more time to understand the problem then to > > sketch out and test a solution[3]. > > > > Unless this somehow puts the host (kernel) at risk, this is a > > userspace problem. > > If there's an similar SEV-SNP problem and accepted solution punted to > userspace that TDX can leverage, I'm 100% on board with that. Let's do that. So IIUC, the current stance is that the kernel can rely on userspace to ensure forward progress of TDX module update. I still vote for the "Avoid updates during update sensitive times" approach to be enabled in the host kernel to ensure userspace can't mess up the TDX module state.
On Thu, Oct 23, 2025 at 2:10 PM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 10/23/25 13:31, Vishal Annapurve wrote:
> ...
> >> Admin logic to update TDX modules can be designed to either retry
> >> failed TDX module updates or to be more robust, adds some
> >> synchronization with VM creation attempts on the host. i.e. I think
> >> it's fine to punt this problem of ensuring the forward progress to
> >> user-space admin logic on the host.
> > Discussed offline with Erdem Aktas on this. From Google's perspective
> > "Avoid updates during updatesensitive times" seems a better option as
> > I mentioned above.
> >
> > To avoid having to choose which policy to enforce in kernel, a better
> > way could be to:
> > * Allow user space opt-in for "Avoid updates during updatesensitive times" AND
> > * Allow user space opt-in for "Detect incompatibility after update" as well OR
> > * Keep "Detect incompatibility after update" enabled by default based
> > on the appetite for avoiding silent corruption scenarios.
>
> I'd really prefer to keep this simple. Adding new opt-in ABIs up the
> wazoo doesn't seem great.
>
> I think I've heard three requirements in the end:
>
> 1. Guarantee module update forward progress
> 2. Avoid "corrupt" TD build processes by letting the build/update
> race happen
> 3. Don't complicate the build process by forcing it to error out
> if a module update clobbers a build
>
> One thing I don't think I've heard anyone be worried about is how timely
> the update process is. So how about this: Updates wait for any existing
> builds to complete. But, new builds wait for updates. That can be done
> with a single rwsem:
>
> struct rw_semaphore update_rwsem;
>
> tdx_td_init()
> {
> ...
> + down_read_interruptible(&update_rwsem);
> kvm_tdx->state = TD_STATE_INITIALIZED;
>
> tdx_td_finalize()
> {
> ...
> + up_read(&update_rwsem);
> kvm_tdx->state = TD_STATE_RUNNABLE;
>
> A module update does:
>
> down_write_interruptible(&update_rwsem);
> do_actual_update();
> up_write(&update_rwsem);
>
> There would be no corruption issues, no erroring out of the build
> process, and no punting to userspace to ensure forward progress.
>
> The big downside is that both the build process and update process can
> appear to hang for a long time. It'll also be a bit annoying to ensure
> that there are up_read(&update_rwsem)'s if the kvm_tdx object gets torn
> down during a build.
>
> But the massive upside is that there's no new ABI and all the
> consistency and forward progress guarantees are in the kernel. If we
> want new ABIs around it that give O_NONBLOCK semantics to build or
> update, that can be added on after the fact.
>
> Plus, if userspace *WANTS* to coordinate the whole shebang, they're free
> to. They'd never see long hangs because they would be coordinating.
>
> Thoughts?
Yeah, this approach sounds reasonable.
> -----Original Message----- > From: Vishal Annapurve <vannapurve@google.com> > Sent: Saturday, October 18, 2025 3:02 AM > To: Reshetova, Elena <elena.reshetova@intel.com> > Cc: Hansen, Dave <dave.hansen@intel.com>; Gao, Chao > <chao.gao@intel.com>; linux-coco@lists.linux.dev; linux- > kernel@vger.kernel.org; x86@kernel.org; Chatre, Reinette > <reinette.chatre@intel.com>; Weiny, Ira <ira.weiny@intel.com>; Huang, Kai > <kai.huang@intel.com>; Williams, Dan J <dan.j.williams@intel.com>; > yilun.xu@linux.intel.com; sagis@google.com; paulmck@kernel.org; > nik.borisov@suse.com; Borislav Petkov <bp@alien8.de>; Dave Hansen > <dave.hansen@linux.intel.com>; H. Peter Anvin <hpa@zytor.com>; Ingo Molnar > <mingo@redhat.com>; Kirill A. Shutemov <kas@kernel.org>; Paolo Bonzini > <pbonzini@redhat.com>; Edgecombe, Rick P <rick.p.edgecombe@intel.com>; > Thomas Gleixner <tglx@linutronix.de> > Subject: Re: [PATCH v2 00/21] Runtime TDX Module update support > > On Fri, Oct 17, 2025 at 3:08 AM Reshetova, Elena > <elena.reshetova@intel.com> wrote: > > > > > > > > > ... > > > > > > But the situation can be avoided fully, if TD preserving update is not > > > > > conducted > > > > > > during the TD build time. > > > > > > > > > > Sure, and the TDX module itself could guarantee this as well as much as > > > > > the kernel could. It could decline to allow module updates during TD > > > > > builds, or error out the TD build if it collides with an update. > > > > > > > > TDX module has a functionality to decline going into SHUTDOWN state > > > > (pre-requisite for TD preserving update) if TD build or any problematic > > > > operation is in progress. It requires VMM to opt-in into this feature. > > > > > > Is this opt-in enabled as part of this series? If not, what is the > > > mechanism to enable this opt-in? > > > > For the information about how it works on TDX module side, > > please consult the latest ABI spec, definition of TDH.SYS.SHUTDOWN leaf, > > page 321: > > https://cdrdv2.intel.com/v1/dl/getContent/733579 > > > > Thanks Elena. Should the patch [1] from this series be modified to > handle the TDX module shutdown as per: > "If supported by the TDX Module, the host VMM can set the > AVOID_COMPAT_SENSITIVE flag to request the TDX Module to fail > TDH.SYS.UPDATE if any of the TDs are currently in a state that is > impacted by the update-sensitive cases" > > The documentation below doesn't make sense to me: > "The compatibility checks done by TDH.SYS.UPDATE do not include the > following cases: > * If any TD was initialized by an older TDX Module that did enumerate > TDX_FEATURES0.UPDATE_COMPATIBLITY as 1, TDH.SYS.SHUTDOWN does not > check for a TD build in progress condition for that TD. > * If any TD migration session is in progress, it was initialized by an > older TDX Module that did enumerate TDX_FEATURES0.UPDATE_COMPATIBLITY > as 1" > > Was it supposed to say below? > "If any TD was initialized by an older TDX Module that did enumerate > TDX_FEATURES0.UPDATE_COMPATIBLITY as 0, TDH.SYS.SHUTDOWN does not Yes, the spec error, thank you for catching this. Will be fixed. The correct text should say: " If any TD was initialized by an older TDX Module that did *not* enumerate TDX_FEATURES0.UPDATE_COMPATIBLITY as 1, TDH.SYS.SHUTDOWN does not check for a TD build in progress condition for that TD. If any TD migration session is in progress, and it was initialized by an older TDX Module that did *not* enumerate TDX_FEATURES0.UPDATE_COMPATIBLITY as 1, TDH.SYS.SHUTDOWN does not check for an interrupted TD migration function condition for that TD." Best Regards, Elena.
© 2016 - 2026 Red Hat, Inc.