[PATCH v2 16/21] x86/virt/seamldr: Handle TDX Module update failures

Chao Gao posted 21 patches 4 months, 1 week ago
There is a newer version of this series
[PATCH v2 16/21] x86/virt/seamldr: Handle TDX Module update failures
Posted by Chao Gao 4 months, 1 week ago
Failures encountered after a successful module shutdown are unrecoverable,
e.g., there is no way to restore the old TDX Module.

All subsequent SEAMCALLs to the TDX Module will fail and so TDs have to be
killed.

Report failures through sysfs attributes and log a message to clarify that
SEAMCALL errors are expected in this situation.

To prevent TDX Module update failures, admins are encouraged to use the
user space tool [1] that will perform compatibility and integrity checks
that guarantee TDX Module update success (unless the system's update limit
is exceeded, but the kernel will prevent an update attempt in this case).

Signed-off-by: Chao Gao <chao.gao@intel.com>
Tested-by: Farrah Chen <farrah.chen@intel.com>
Link: https://github.com/intel/tdx-module-binaries/blob/main/version_select_and_load.py # [1]
---
 arch/x86/virt/vmx/tdx/seamldr.c       | 15 ++++++++++++++-
 arch/x86/virt/vmx/tdx/tdx.c           |  6 ++++++
 arch/x86/virt/vmx/tdx/tdx.h           |  1 +
 drivers/virt/coco/tdx-host/tdx-host.c |  4 ++++
 4 files changed, 25 insertions(+), 1 deletion(-)

diff --git a/arch/x86/virt/vmx/tdx/seamldr.c b/arch/x86/virt/vmx/tdx/seamldr.c
index b9daf11e1064..a5aff04a85b9 100644
--- a/arch/x86/virt/vmx/tdx/seamldr.c
+++ b/arch/x86/virt/vmx/tdx/seamldr.c
@@ -264,6 +264,14 @@ static void ack_state(void)
 	}
 }
 
+static void print_update_failure_message(void)
+{
+	static atomic_t printed = ATOMIC_INIT(0);
+
+	if (atomic_inc_return(&printed) == 1)
+		pr_err("update failed, SEAMCALLs will report failure until TDs killed\n");
+}
+
 /*
  * See multi_cpu_stop() from where this multi-cpu state-machine was
  * adopted, and the rationale for touch_nmi_watchdog()
@@ -293,8 +301,13 @@ static int do_seamldr_install_module(void *params)
 				break;
 			}
 
-			if (ret)
+			if (ret) {
 				atomic_inc(&tdp_data.failed);
+				if (curstate > TDP_SHUTDOWN) {
+					tdx_module_set_error();
+					print_update_failure_message();
+				}
+			}
 			ack_state();
 		} else {
 			touch_nmi_watchdog();
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 7019a149ec4b..26357be18fa9 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -1219,6 +1219,12 @@ int tdx_module_shutdown(void)
 	return 0;
 }
 
+void tdx_module_set_error(void)
+{
+	/* Called from stop_machine(). no need to hold tdx_module_lock */
+	tdx_module_status = TDX_MODULE_ERROR;
+}
+
 static bool is_pamt_page(unsigned long phys)
 {
 	struct tdmr_info_list *tdmr_list = &tdx_tdmr_list;
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index 1c4da9540ae0..5b9a2d63808c 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -120,5 +120,6 @@ struct tdmr_info_list {
 };
 
 int tdx_module_shutdown(void);
+void tdx_module_set_error(void);
 
 #endif
diff --git a/drivers/virt/coco/tdx-host/tdx-host.c b/drivers/virt/coco/tdx-host/tdx-host.c
index 418e90797689..47c5ba115993 100644
--- a/drivers/virt/coco/tdx-host/tdx-host.c
+++ b/drivers/virt/coco/tdx-host/tdx-host.c
@@ -37,6 +37,10 @@ static ssize_t version_show(struct device *dev, struct device_attribute *attr,
 	const struct tdx_sys_info *tdx_sysinfo = tdx_get_sysinfo();
 	const struct tdx_sys_info_version *ver;
 
+	/*
+	 * Inform userspace that the TDX module isn't in a usable state,
+	 * possibly due to a failed update.
+	 */
 	if (!tdx_sysinfo)
 		return -ENXIO;
 
-- 
2.47.3
Re: [PATCH v2 16/21] x86/virt/seamldr: Handle TDX Module update failures
Posted by Xu Yilun 3 weeks, 4 days ago
On Tue, Sep 30, 2025 at 07:53:00PM -0700, Chao Gao wrote:
> Failures encountered after a successful module shutdown are unrecoverable,
> e.g., there is no way to restore the old TDX Module.

"e.g." is obscure. To me, the following sentence is explaining the
reason why the failure is not recoverable. Maybe "i.e." or "because"?
Re: [PATCH v2 16/21] x86/virt/seamldr: Handle TDX Module update failures
Posted by H. Peter Anvin 3 weeks ago
On January 14, 2026 10:24:22 PM PST, Xu Yilun <yilun.xu@linux.intel.com> wrote:
>On Tue, Sep 30, 2025 at 07:53:00PM -0700, Chao Gao wrote:
>> Failures encountered after a successful module shutdown are unrecoverable,
>> e.g., there is no way to restore the old TDX Module.
>
>"e.g." is obscure. To me, the following sentence is explaining the
>reason why the failure is not recoverable. Maybe "i.e." or "because"?

"e.g." (for example) means the following list is non-exhaustive.
Re: [PATCH v2 16/21] x86/virt/seamldr: Handle TDX Module update failures
Posted by Chao Gao 3 weeks ago
On Sun, Jan 18, 2026 at 08:55:22PM -0800, H. Peter Anvin wrote:
>On January 14, 2026 10:24:22 PM PST, Xu Yilun <yilun.xu@linux.intel.com> wrote:
>>On Tue, Sep 30, 2025 at 07:53:00PM -0700, Chao Gao wrote:
>>> Failures encountered after a successful module shutdown are unrecoverable,
>>> e.g., there is no way to restore the old TDX Module.
>>
>>"e.g." is obscure. To me, the following sentence is explaining the
>>reason why the failure is not recoverable. Maybe "i.e." or "because"?
>
>"e.g." (for example) means the following list is non-exhaustive.

Yes, 'i.e.' or 'because' would be more appropriate since the next sentence
explains 'unrecoverable'.
Re: [PATCH v2 16/21] x86/virt/seamldr: Handle TDX Module update failures
Posted by Dave Hansen 2 weeks, 6 days ago
On 1/18/26 21:34, Chao Gao wrote:
> On Sun, Jan 18, 2026 at 08:55:22PM -0800, H. Peter Anvin wrote:
>> On January 14, 2026 10:24:22 PM PST, Xu Yilun <yilun.xu@linux.intel.com> wrote:
>>> On Tue, Sep 30, 2025 at 07:53:00PM -0700, Chao Gao wrote:
>>>> Failures encountered after a successful module shutdown are unrecoverable,
>>>> e.g., there is no way to restore the old TDX Module.
>>> "e.g." is obscure. To me, the following sentence is explaining the
>>> reason why the failure is not recoverable. Maybe "i.e." or "because"?
>> "e.g." (for example) means the following list is non-exhaustive.
> Yes, 'i.e.' or 'because' would be more appropriate since the next sentence
> explains 'unrecoverable'.

FWIW, I very much prefer plain words. "i.e." and "e.g." are really only
useful if you and all your readers know how they work.

I don't feel the need to be draconian about it, but I frankly *wish*
folks would just remove them from their written vocabulary.
Re: [PATCH v2 16/21] x86/virt/seamldr: Handle TDX Module update failures
Posted by Chao Gao 3 months, 2 weeks ago
> /*
>  * See multi_cpu_stop() from where this multi-cpu state-machine was
>  * adopted, and the rationale for touch_nmi_watchdog()
>@@ -293,8 +301,13 @@ static int do_seamldr_install_module(void *params)
> 				break;
> 			}
> 
>-			if (ret)
>+			if (ret) {
> 				atomic_inc(&tdp_data.failed);
>+				if (curstate > TDP_SHUTDOWN) {
>+					tdx_module_set_error();
>+					print_update_failure_message();
>+				}
>+			}

I found a bug here.

If an error is non-fatal (e.g., occurs before shutting down the TDX module),
the TDX Module remains functional and TDs can continue to run. So, there's no
need to set the TDX module status to erorr or print the error message.

But, in this implementation, a failing CPU doesn't exit the do-while() loop in
do_seamldr_install_module(). Instead, all CPUs fast-forward to the TDP_DONE
state and execute the do-while() body once more. Then, the failing CPU reaches
the above if() statement again with a non-zero "ret" and "curstate=TDP_DONE",
causing it to incorrectly set the module to error and print the error message.

To fix this issue, apply the following diff:

diff --git a/arch/x86/virt/vmx/tdx/seamldr.c b/arch/x86/virt/vmx/tdx/seamldr.c
index a72f6b0b27e9..e525bbd16610 100644
--- a/arch/x86/virt/vmx/tdx/seamldr.c
+++ b/arch/x86/virt/vmx/tdx/seamldr.c
@@ -258,16 +258,8 @@ static void set_target_state(enum tdp_state state)
 /* Last one to ack a state moves to the next state. */
 static void ack_state(void)
 {
-	if (atomic_dec_and_test(&tdp_data.thread_ack)) {
-		/*
-		 * If an error occurred, abort the update by skipping to
-		 * the final state
-		 */
-		if (atomic_read(&tdp_data.failed))
-			set_target_state(TDP_DONE);
-		else
-			set_target_state(tdp_data.state + 1);
-	}
+	if (atomic_dec_and_test(&tdp_data.thread_ack))
+		set_target_state(tdp_data.state + 1);
 }
 
 static void print_update_failure_message(void)
@@ -331,7 +323,7 @@ static int do_seamldr_install_module(void *params)
			touch_nmi_watchdog();
			rcu_momentary_eqs();
		}
-	} while (curstate != TDP_DONE);
+	} while (curstate != TDP_DONE && !atomic_read(&tdp_data.failed));
 
	return ret;
 }
Re: [PATCH v2 16/21] x86/virt/seamldr: Handle TDX Module update failures
Posted by Xu Yilun 3 weeks, 4 days ago
>  static void print_update_failure_message(void)
> @@ -331,7 +323,7 @@ static int do_seamldr_install_module(void *params)
> 			touch_nmi_watchdog();
> 			rcu_momentary_eqs();
> 		}
> -	} while (curstate != TDP_DONE);
> +	} while (curstate != TDP_DONE && !atomic_read(&tdp_data.failed));

Ah, yes. That's idea of immediate error out I'm thinking of, your
implementation is better.

>  
> 	return ret;
>  }
>