arch/x86/coco/sev/core.c | 139 ++----------- arch/x86/coco/tdx/tdx.c | 338 ++++++++++++++++++++++++++----- arch/x86/include/asm/io.h | 3 + arch/x86/include/asm/processor.h | 3 + arch/x86/lib/iomem.c | 115 +++++++++++ 5 files changed, 429 insertions(+), 169 deletions(-)
From: "Alexey Gladkov (Intel)" <legion@kernel.org> Currently, MMIO inside the TDX guest is allowed from kernel space and access from userspace is denied. This becomes a problem when working with virtual devices in userspace. In TDX guest MMIO instructions are emulated in #VE. The kernel code uses special helpers to access MMIO memory to limit the number of instructions which are used. This patchset makes MMIO accessible from userspace. To do this additional checks were added to ensure that the emulated instruction will not be compromised. v5: - Improve commit messages and comments in commits as suggested by Kirill A. Shutemov. - To emulate MOVS, instead of storing the entire address, started using a flag. v4: - Move patches to avoid crossing the page boundary to separate patchset. They address separate issue. - Check the address only in user context and in case of nested exceptions. - Fix the check that the address does not point to private memory. v3: - Add patches to avoid crossing the page boundary when the instruction is read and decoded in the TDX, SEV, UMIP. - Forbid accessing userspace addresses from kernel space. The exception to this is when emulating MOVS instructions. - Fix address validation during MOVS emulation. v2: - Split into separate patches AMD helpers extraction and MOVS implementation code for intel as suggested by Thomas Gleixner. - Fix coding style issues. Alexey Gladkov (Intel) (6): x86/tdx: Split MMIO read and write operations x86/tdx: Add validation of userspace MMIO instructions x86/tdx: Allow MMIO from userspace x86/tdx: Add a restriction on access to MMIO address x86/tdx: Move MMIO helpers to common library x86/tdx: Implement MOVS for MMIO arch/x86/coco/sev/core.c | 139 ++----------- arch/x86/coco/tdx/tdx.c | 338 ++++++++++++++++++++++++++----- arch/x86/include/asm/io.h | 3 + arch/x86/include/asm/processor.h | 3 + arch/x86/lib/iomem.c | 115 +++++++++++ 5 files changed, 429 insertions(+), 169 deletions(-) -- 2.46.0
From: "Alexey Gladkov (Intel)" <legion@kernel.org> Currently, MMIO inside the TDX guest is allowed from kernel space and access from userspace is denied. This becomes a problem when working with virtual devices in userspace. In TDX guest MMIO instructions are emulated in #VE. The kernel code uses special helpers to access MMIO memory to limit the number of instructions which are used. This patchset makes MMIO accessible from userspace. To do this additional checks were added to ensure that the emulated instruction will not be compromised. v6: - Reorder patches and change commit messages. v5: - Improve commit messages and comments in commits as suggested by Kirill A. Shutemov. - To emulate MOVS, instead of storing the entire address, started using a flag. v4: - Move patches to avoid crossing the page boundary to separate patchset. They address separate issue. - Check the address only in user context and in case of nested exceptions. - Fix the check that the address does not point to private memory. v3: - Add patches to avoid crossing the page boundary when the instruction is read and decoded in the TDX, SEV, UMIP. - Forbid accessing userspace addresses from kernel space. The exception to this is when emulating MOVS instructions. - Fix address validation during MOVS emulation. v2: - Split into separate patches AMD helpers extraction and MOVS implementation code for intel as suggested by Thomas Gleixner. - Fix coding style issues. Alexey Gladkov (Intel) (6): x86/tdx: Fix "in-kernel MMIO" check x86/tdx: Split MMIO read and write operations x86/tdx: Add validation of userspace MMIO instructions x86/tdx: Allow MMIO from userspace x86/tdx: Move MMIO helpers to common library x86/tdx: Implement MOVS for MMIO arch/x86/coco/sev/core.c | 139 ++----------- arch/x86/coco/tdx/tdx.c | 338 ++++++++++++++++++++++++++----- arch/x86/include/asm/io.h | 3 + arch/x86/include/asm/processor.h | 3 + arch/x86/lib/iomem.c | 115 +++++++++++ 5 files changed, 429 insertions(+), 169 deletions(-) -- 2.46.0
On 9/6/24 04:49, Alexey Gladkov wrote: > Currently, MMIO inside the TDX guest is allowed from kernel space and access > from userspace is denied. This becomes a problem when working with virtual > devices in userspace. Kernel MMIO and User MMIO are very different beasts. The kernel MMIO instructions are trusted and can be constrained to use a very limited number of instructions that match the kernel's limited instruction decoding capability. Userspace is not constrained in that way. TDX also doesn't have the option of having the VMM deal with the instruction emulation. Before we start down this road, I'd really want to hear from the KVM folks that having the kernel decode arbitrary user instructions is the way we want to go.
On Fri, Sep 06, 2024, Dave Hansen wrote: > On 9/6/24 04:49, Alexey Gladkov wrote: > > Currently, MMIO inside the TDX guest is allowed from kernel space and access > > from userspace is denied. This becomes a problem when working with virtual > > devices in userspace. > > Kernel MMIO and User MMIO are very different beasts. > > The kernel MMIO instructions are trusted and can be constrained to use a > very limited number of instructions that match the kernel's limited > instruction decoding capability. > > Userspace is not constrained in that way. > > TDX also doesn't have the option of having the VMM deal with the > instruction emulation. > > Before we start down this road, I'd really want to hear from the KVM > folks that having the kernel decode arbitrary user instructions is the > way we want to go. That's an x86 kernel decision, not a KVM decision. KVM cares if the guest kernel has delegated certain permissions to userspace, which is why emulated MMIO is preferred over hypercalls; the fact that userspace can access an MMIO region communicates to KVM that the guest kernel has granted userspace the necessary permissions (by mapping the MMIO region into the user page tables). But whether or not a particular user/application is trusted by the guest kernel is firmly out of scope for KVM. KVM's responsibility is to not undermine the guest kernel's security/trust model, but KVM doesn't define that model. Ditto for what behavior is supported/allowed. The kernel could choose to disallow userspace MMIO entirely, limit what instructions are supported, etc, in the name of security, simplicity, or whatever. Doing so would likely cause friction with folks that want to run their workloads in an SNP/TDX VM, but that friction is very much with the guest kernel, not with KVM. FWIW, emulating MMIO that isn't controlled by the kernel gets to be a bit of a slippery slope, e.g. there are KVM patches on the list to support emulating AVX instructions[*]. But, a major use case of any hypervisor is to lift-and-shift workloads, and so KVM users, developers, and maintainers are quite motivated to ensure that anything that works on bare metal also works on KVM.
On 9/6/24 14:13, Sean Christopherson wrote: > Ditto for what behavior is supported/allowed. The kernel could choose to disallow > userspace MMIO entirely, limit what instructions are supported, etc, in the name > of security, simplicity, or whatever. Doing so would likely cause friction with > folks that want to run their workloads in an SNP/TDX VM, but that friction is very > much with the guest kernel, not with KVM. I think by "guest kernel" you really mean "x86 maintainers". Thanks for throwing us under the bus, Sean. ;) I do agree with you, though. In the process of taking the VMM out of the TCB, confidential computing has to fill the gap with _something_ and that something is usually arch-specific code in the guest kernel. By dragging the KVM folks in here, I was less asking what KVM does per se and more asking for some advice from the experienced VMM folks. > FWIW, emulating MMIO that isn't controlled by the kernel gets to be a bit of a > slippery slope, e.g. there are KVM patches on the list to support emulating AVX > instructions[*]. But, a major use case of any hypervisor is to lift-and-shift > workloads, and so KVM users, developers, and maintainers are quite motivated to > ensure that anything that works on bare metal also works on KVM. Do you have a link for that AVX discussion? I searched a bit but came up empty. The slippery slope is precisely what I'm worried about. I suspect the AVX instructions are a combination of compilers that are increasingly happy to spit out AVX and users who just want to use whatever the compiler spits out on "pointers" in their apps that just happen to be pointed at MMIO. But before we start digging in to avoid the slippery slope, we really do need to know more about the friction. Who are we causing it for and how bad is it for them?
On Wed, Sep 11, 2024, Dave Hansen wrote: > On 9/6/24 14:13, Sean Christopherson wrote: > > Ditto for what behavior is supported/allowed. The kernel could choose to disallow > > userspace MMIO entirely, limit what instructions are supported, etc, in the name > > of security, simplicity, or whatever. Doing so would likely cause friction with > > folks that want to run their workloads in an SNP/TDX VM, but that friction is very > > much with the guest kernel, not with KVM. > > I think by "guest kernel" you really mean "x86 maintainers". Thanks for > throwing us under the bus, Sean. ;) Heh, I would argue that you tried to push me under the bus, but I'm slippery fast and danced out of the way, and you got hit instead :-D > I do agree with you, though. In the process of taking the VMM out of > the TCB, confidential computing has to fill the gap with _something_ and > that something is usually arch-specific code in the guest kernel. > > By dragging the KVM folks in here, I was less asking what KVM does per > se and more asking for some advice from the experienced VMM folks. > > > FWIW, emulating MMIO that isn't controlled by the kernel gets to be a bit of a > > slippery slope, e.g. there are KVM patches on the list to support emulating AVX > > instructions[*]. But, a major use case of any hypervisor is to lift-and-shift > > workloads, and so KVM users, developers, and maintainers are quite motivated to > > ensure that anything that works on bare metal also works on KVM. > > Do you have a link for that AVX discussion? I searched a bit but came > up empty. Gah, of course I forgot to paste the link. https://lore.kernel.org/all/20240820230431.3850991-1-kbusch@meta.com > The slippery slope is precisely what I'm worried about. I suspect the > AVX instructions are a combination of compilers that are increasingly > happy to spit out AVX and users who just want to use whatever the > compiler spits out on "pointers" in their apps that just happen to be > pointed at MMIO. Yep. Based on the original report[*], it sounds like the userspace program is doing a memcpy(), so it's hard to even argue that userspace is being silly. [*] https://lore.kernel.org/kvm/20240304145932.4e685a38.alex.williamson@redhat.com > But before we start digging in to avoid the slippery slope, we really do > need to know more about the friction. Who are we causing it for and how > bad is it for them? This type of issue will most likely show up in the form of an end customer moving their workload into a TDX/SNP VM, and that workload crashing despite working just fine when run in a regular VM. One "answer" could be to tell users that they need to recompile with AVX+ explicitly disabled, but that's an answer that will make everyone unhappy. E.g. customers won't like recompiling, CSPs don't like unhappy customers, and CSPs and hardware vendors don't want their CoCo solutions to be hard(er) to adopt.
On Wed, Sep 11, 2024 at 09:19:04AM -0700, Sean Christopherson wrote: > Yep. Based on the original report[*], it sounds like the userspace program is > doing a memcpy(), so it's hard to even argue that userspace is being silly. The kernel does MMIO accesses using special helpers that use well-known instructions. I believe we should educate userspace to do the same by rejecting emulation of anything more complex than plain loads and stores. Otherwise these asks will keep coming. -- Kiryl Shutsemau / Kirill A. Shutemov
On 9/12/24 02:45, Kirill A. Shutemov wrote: > On Wed, Sep 11, 2024 at 09:19:04AM -0700, Sean Christopherson wrote: >> Yep. Based on the original report[*], it sounds like the userspace program is >> doing a memcpy(), so it's hard to even argue that userspace is being silly. > The kernel does MMIO accesses using special helpers that use well-known > instructions. I believe we should educate userspace to do the same by > rejecting emulation of anything more complex than plain loads and stores. > Otherwise these asks will keep coming. My assumption is that folks have VMM-specific kernel drivers and crusty old userspace that mmap()'s an MMIO region exposed by that driver. They want to keep their old userspace. Once we're dictating that specific instructions be used, the old userspace doesn't work and it needs to be changed. Once it needs to be changed, then some _other_ new ABI might as well be considered. Basically: New ABI =~ Specific Kernel-mandated Instructions
On Thu, Sep 12, 2024 at 08:49:21AM -0700, Dave Hansen wrote: > On 9/12/24 02:45, Kirill A. Shutemov wrote: > > On Wed, Sep 11, 2024 at 09:19:04AM -0700, Sean Christopherson wrote: > >> Yep. Based on the original report[*], it sounds like the userspace program is > >> doing a memcpy(), so it's hard to even argue that userspace is being silly. > > The kernel does MMIO accesses using special helpers that use well-known > > instructions. I believe we should educate userspace to do the same by > > rejecting emulation of anything more complex than plain loads and stores. > > Otherwise these asks will keep coming. > > My assumption is that folks have VMM-specific kernel drivers and crusty > old userspace that mmap()'s an MMIO region exposed by that driver. They > want to keep their old userspace. > > Once we're dictating that specific instructions be used, the old > userspace doesn't work and it needs to be changed. Once it needs to be > changed, then some _other_ new ABI might as well be considered. > > Basically: > > New ABI =~ Specific Kernel-mandated Instructions If we are going to say "no" to userspace MMIO emulation for TDX, the same has to be done for SEV. Or we can bring TDX to SEV level and draw the line there. SEV and TDX run similar workloads and functional difference in this area is hard to justify. -- Kiryl Shutsemau / Kirill A. Shutemov
On 9/13/24 08:53, Kirill A. Shutemov wrote: >> Basically: >> >> New ABI =~ Specific Kernel-mandated Instructions > If we are going to say "no" to userspace MMIO emulation for TDX, the same > has to be done for SEV. Or we can bring TDX to SEV level and draw the line > there. > > SEV and TDX run similar workloads and functional difference in this area > is hard to justify. Maybe. We definitely don't want to put any new restrictions on SEV because folks would update their kernel and old userspace would break. Or maybe we start enforcing things at >=SEV-SNP and TDX and just say that security model has changed too much to allow the old userspace.
On Fri, Sep 13, 2024, Dave Hansen wrote: > On 9/13/24 08:53, Kirill A. Shutemov wrote: > >> Basically: > >> > >> New ABI =~ Specific Kernel-mandated Instructions > > If we are going to say "no" to userspace MMIO emulation for TDX, the same > > has to be done for SEV. Or we can bring TDX to SEV level and draw the line > > there. > > > > SEV and TDX run similar workloads and functional difference in this area > > is hard to justify. > > Maybe. We definitely don't want to put any new restrictions on SEV Note, SEV-MEM, a.k.a. the original SEV, isn't in scope because instruction decoding is still handled by the hypervisor. SEV-ES is where the guest kernel first gets involved. > because folks would update their kernel and old userspace would break. > > Or maybe we start enforcing things at >=SEV-SNP and TDX and just say > that security model has changed too much to allow the old userspace. Heh, that's an outright lie though. Nothing relevant has changed between SEV-ES and SEV-SNP that makes old userspace any less secure, or makes it harder for the kernel to support decoding instructions on SNP vs. ES. I also don't know that this is for old userspace. AFAIK, the most common case for userspace triggering emulated MMIO is when a device is passed to userspace via VFIO/IOMMUFD, e.g. a la DPDK.
On 9/13/24 09:28, Sean Christopherson wrote: >> because folks would update their kernel and old userspace would break. >> >> Or maybe we start enforcing things at >=SEV-SNP and TDX and just say >> that security model has changed too much to allow the old userspace. > Heh, that's an outright lie though. Nothing relevant has changed between SEV-ES > and SEV-SNP that makes old userspace any less secure, or makes it harder for the > kernel to support decoding instructions on SNP vs. ES. The trust model does change, though. The VMM is still in the guest TCB for SEV-ES because there are *so* many ways to leverage NPT to compromise a VM. Yeah, the data isn't in plain view of the VMM, but that doesn't mean the VMM is out of the TCB. With SEV-ES, old crusty userspace is doing MMIO to a VMM in the TCB. With SEV-SNP, old crusty userspace is talking to an untrusted VMM. I think that's how the security model changes. > I also don't know that this is for old userspace. AFAIK, the most common case > for userspace triggering emulated MMIO is when a device is passed to userspace > via VFIO/IOMMUFD, e.g. a la DPDK. Ahh, that would make sense. It would be nice to hear from those folks _somewhere_ about what their restrictions are and if they'd ever be able to enforce a subset of the ISA for MMIO or even (for example) make system calls to do MMIO. Does it matter to them if all of a sudden the NIC or the NVMe device on the other side of VFIO is malicious?
On Fri, Sep 13, 2024, Dave Hansen wrote: > On 9/13/24 09:28, Sean Christopherson wrote: > >> because folks would update their kernel and old userspace would break. > >> > >> Or maybe we start enforcing things at >=SEV-SNP and TDX and just say > >> that security model has changed too much to allow the old userspace. > > Heh, that's an outright lie though. Nothing relevant has changed between SEV-ES > > and SEV-SNP that makes old userspace any less secure, or makes it harder for the > > kernel to support decoding instructions on SNP vs. ES. > > The trust model does change, though. > > The VMM is still in the guest TCB for SEV-ES because there are *so* many > ways to leverage NPT to compromise a VM. Yeah, the data isn't in plain > view of the VMM, but that doesn't mean the VMM is out of the TCB. > > With SEV-ES, old crusty userspace is doing MMIO to a VMM in the TCB. > > With SEV-SNP, old crusty userspace is talking to an untrusted VMM. > > I think that's how the security model changes. I agree to some extent, but as below, this really only holds true if we're talking about old crusty userspace. And even then, it's weird to draw the line at the emulated MMIO boundary, because if crusty old userspace is a security risk, then the kernel arguably shouldn't have mapped the MMIO address into that userspace in the first place. > > I also don't know that this is for old userspace. AFAIK, the most common case > > for userspace triggering emulated MMIO is when a device is passed to userspace > > via VFIO/IOMMUFD, e.g. a la DPDK. > > Ahh, that would make sense. > > It would be nice to hear from those folks _somewhere_ about what their > restrictions are and if they'd ever be able to enforce a subset of the > ISA for MMIO or even (for example) make system calls to do MMIO. > > Does it matter to them if all of a sudden the NIC or the NVMe device on > the other side of VFIO is malicious?
From: "Alexey Gladkov (Intel)" <legion@kernel.org> Currently, MMIO inside the TDX guest is allowed from kernel space and access from userspace is denied. This becomes a problem when working with virtual devices in userspace. In TDX guest MMIO instructions are emulated in #VE. The kernel code uses special helpers to access MMIO memory to limit the number of instructions which are used. This patchset makes MMIO accessible from userspace. To do this additional checks were added to ensure that the emulated instruction will not be compromised. v7: - Use fault_in_kernel_space() instead of using your own function as suggested by Dave Hansen. - Drop the unnecessary ifdef CONFIG_INTEL_TDX_GUEST from thread_struct. v6: - Reorder patches and change commit messages. v5: - Improve commit messages and comments in commits as suggested by Kirill A. Shutemov. - To emulate MOVS, instead of storing the entire address, started using a flag. v4: - Move patches to avoid crossing the page boundary to separate patchset. They address separate issue. - Check the address only in user context and in case of nested exceptions. - Fix the check that the address does not point to private memory. v3: - Add patches to avoid crossing the page boundary when the instruction is read and decoded in the TDX, SEV, UMIP. - Forbid accessing userspace addresses from kernel space. The exception to this is when emulating MOVS instructions. - Fix address validation during MOVS emulation. v2: - Split into separate patches AMD helpers extraction and MOVS implementation code for intel as suggested by Thomas Gleixner. - Fix coding style issues. Alexey Gladkov (Intel) (6): x86/tdx: Fix "in-kernel MMIO" check x86/tdx: Split MMIO read and write operations x86/tdx: Add validation of userspace MMIO instructions x86/tdx: Allow MMIO from userspace x86/tdx: Move MMIO helpers to common library x86/tdx: Implement MOVS for MMIO arch/x86/coco/sev/core.c | 139 ++----------- arch/x86/coco/tdx/tdx.c | 334 ++++++++++++++++++++++++++----- arch/x86/include/asm/io.h | 3 + arch/x86/include/asm/processor.h | 1 + arch/x86/lib/iomem.c | 115 +++++++++++ 5 files changed, 423 insertions(+), 169 deletions(-) -- 2.46.0
From: "Alexey Gladkov (Intel)" <legion@kernel.org>
TDX only supports kernel-initiated MMIO operations. The handle_mmio()
function checks if the #VE exception occurred in the kernel and rejects
the operation if it did not.
However, userspace can deceive the kernel into performing MMIO on its
behalf. For example, if userspace can point a syscall to an MMIO address,
syscall does get_user() or put_user() on it, triggering MMIO #VE. The
kernel will treat the #VE as in-kernel MMIO.
Ensure that the target MMIO address is within the kernel before decoding
instruction.
Cc: stable@vger.kernel.org
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Alexey Gladkov (Intel) <legion@kernel.org>
---
arch/x86/coco/tdx/tdx.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
index 078e2bac2553..d6e6407e3999 100644
--- a/arch/x86/coco/tdx/tdx.c
+++ b/arch/x86/coco/tdx/tdx.c
@@ -16,6 +16,7 @@
#include <asm/insn-eval.h>
#include <asm/pgtable.h>
#include <asm/set_memory.h>
+#include <asm/traps.h>
/* MMIO direction */
#define EPT_READ 0
@@ -434,6 +435,11 @@ static int handle_mmio(struct pt_regs *regs, struct ve_info *ve)
return -EINVAL;
}
+ if (!fault_in_kernel_space(ve->gla)) {
+ WARN_ONCE(1, "Access to userspace address is not supported");
+ return -EINVAL;
+ }
+
/*
* Reject EPT violation #VEs that split pages.
*
--
2.46.0
On 9/13/24 10:05, Alexey Gladkov wrote: > TDX only supports kernel-initiated MMIO operations. The handle_mmio() > function checks if the #VE exception occurred in the kernel and rejects > the operation if it did not. > > However, userspace can deceive the kernel into performing MMIO on its > behalf. For example, if userspace can point a syscall to an MMIO address, > syscall does get_user() or put_user() on it, triggering MMIO #VE. The > kernel will treat the #VE as in-kernel MMIO. > > Ensure that the target MMIO address is within the kernel before decoding > instruction. Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
On 9/13/24 10:18, Dave Hansen wrote:
...
>> Ensure that the target MMIO address is within the kernel before decoding
>> instruction.
>
> Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Oh, and please add these to anything you cc:stable@ on:
Fixes: 31d58c4e557d ("x86/tdx: Handle in-kernel MMIO")
From: "Alexey Gladkov (Intel)" <legion@kernel.org>
To implement MMIO in userspace, additional memory checks need to be
implemented. To avoid overly complicating the handle_mmio() function
and to separate checks from actions, it would be better to split this
function into two separate functions to handle read and write
operations.
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Alexey Gladkov (Intel) <legion@kernel.org>
---
arch/x86/coco/tdx/tdx.c | 136 ++++++++++++++++++++++++----------------
1 file changed, 83 insertions(+), 53 deletions(-)
diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
index d6e6407e3999..008840ac1191 100644
--- a/arch/x86/coco/tdx/tdx.c
+++ b/arch/x86/coco/tdx/tdx.c
@@ -406,14 +406,91 @@ static bool mmio_write(int size, unsigned long addr, unsigned long val)
EPT_WRITE, addr, val);
}
+static int handle_mmio_write(struct insn *insn, enum insn_mmio_type mmio, int size,
+ struct pt_regs *regs, struct ve_info *ve)
+{
+ unsigned long *reg, val;
+
+ switch (mmio) {
+ case INSN_MMIO_WRITE:
+ reg = insn_get_modrm_reg_ptr(insn, regs);
+ if (!reg)
+ return -EINVAL;
+ memcpy(&val, reg, size);
+ if (!mmio_write(size, ve->gpa, val))
+ return -EIO;
+ return insn->length;
+ case INSN_MMIO_WRITE_IMM:
+ val = insn->immediate.value;
+ if (!mmio_write(size, ve->gpa, val))
+ return -EIO;
+ return insn->length;
+ case INSN_MMIO_MOVS:
+ /*
+ * MMIO was accessed with an instruction that could not be
+ * decoded or handled properly. It was likely not using io.h
+ * helpers or accessed MMIO accidentally.
+ */
+ return -EINVAL;
+ default:
+ WARN_ON_ONCE(1);
+ return -EINVAL;
+ }
+
+ return insn->length;
+}
+
+static int handle_mmio_read(struct insn *insn, enum insn_mmio_type mmio, int size,
+ struct pt_regs *regs, struct ve_info *ve)
+{
+ unsigned long *reg, val;
+ int extend_size;
+ u8 extend_val;
+
+ reg = insn_get_modrm_reg_ptr(insn, regs);
+ if (!reg)
+ return -EINVAL;
+
+ if (!mmio_read(size, ve->gpa, &val))
+ return -EIO;
+
+ extend_val = 0;
+
+ switch (mmio) {
+ case INSN_MMIO_READ:
+ /* Zero-extend for 32-bit operation */
+ extend_size = size == 4 ? sizeof(*reg) : 0;
+ break;
+ case INSN_MMIO_READ_ZERO_EXTEND:
+ /* Zero extend based on operand size */
+ extend_size = insn->opnd_bytes;
+ break;
+ case INSN_MMIO_READ_SIGN_EXTEND:
+ /* Sign extend based on operand size */
+ extend_size = insn->opnd_bytes;
+ if (size == 1 && val & BIT(7))
+ extend_val = 0xFF;
+ else if (size > 1 && val & BIT(15))
+ extend_val = 0xFF;
+ break;
+ default:
+ WARN_ON_ONCE(1);
+ return -EINVAL;
+ }
+
+ if (extend_size)
+ memset(reg, extend_val, extend_size);
+ memcpy(reg, &val, size);
+ return insn->length;
+}
+
static int handle_mmio(struct pt_regs *regs, struct ve_info *ve)
{
- unsigned long *reg, val, vaddr;
char buffer[MAX_INSN_SIZE];
enum insn_mmio_type mmio;
struct insn insn = {};
- int size, extend_size;
- u8 extend_val = 0;
+ unsigned long vaddr;
+ int size;
/* Only in-kernel MMIO is supported */
if (WARN_ON_ONCE(user_mode(regs)))
@@ -429,12 +506,6 @@ static int handle_mmio(struct pt_regs *regs, struct ve_info *ve)
if (WARN_ON_ONCE(mmio == INSN_MMIO_DECODE_FAILED))
return -EINVAL;
- if (mmio != INSN_MMIO_WRITE_IMM && mmio != INSN_MMIO_MOVS) {
- reg = insn_get_modrm_reg_ptr(&insn, regs);
- if (!reg)
- return -EINVAL;
- }
-
if (!fault_in_kernel_space(ve->gla)) {
WARN_ONCE(1, "Access to userspace address is not supported");
return -EINVAL;
@@ -453,24 +524,15 @@ static int handle_mmio(struct pt_regs *regs, struct ve_info *ve)
if (vaddr / PAGE_SIZE != (vaddr + size - 1) / PAGE_SIZE)
return -EFAULT;
- /* Handle writes first */
switch (mmio) {
case INSN_MMIO_WRITE:
- memcpy(&val, reg, size);
- if (!mmio_write(size, ve->gpa, val))
- return -EIO;
- return insn.length;
case INSN_MMIO_WRITE_IMM:
- val = insn.immediate.value;
- if (!mmio_write(size, ve->gpa, val))
- return -EIO;
- return insn.length;
+ case INSN_MMIO_MOVS:
+ return handle_mmio_write(&insn, mmio, size, regs, ve);
case INSN_MMIO_READ:
case INSN_MMIO_READ_ZERO_EXTEND:
case INSN_MMIO_READ_SIGN_EXTEND:
- /* Reads are handled below */
- break;
- case INSN_MMIO_MOVS:
+ return handle_mmio_read(&insn, mmio, size, regs, ve);
case INSN_MMIO_DECODE_FAILED:
/*
* MMIO was accessed with an instruction that could not be
@@ -482,38 +544,6 @@ static int handle_mmio(struct pt_regs *regs, struct ve_info *ve)
WARN_ONCE(1, "Unknown insn_decode_mmio() decode value?");
return -EINVAL;
}
-
- /* Handle reads */
- if (!mmio_read(size, ve->gpa, &val))
- return -EIO;
-
- switch (mmio) {
- case INSN_MMIO_READ:
- /* Zero-extend for 32-bit operation */
- extend_size = size == 4 ? sizeof(*reg) : 0;
- break;
- case INSN_MMIO_READ_ZERO_EXTEND:
- /* Zero extend based on operand size */
- extend_size = insn.opnd_bytes;
- break;
- case INSN_MMIO_READ_SIGN_EXTEND:
- /* Sign extend based on operand size */
- extend_size = insn.opnd_bytes;
- if (size == 1 && val & BIT(7))
- extend_val = 0xFF;
- else if (size > 1 && val & BIT(15))
- extend_val = 0xFF;
- break;
- default:
- /* All other cases has to be covered with the first switch() */
- WARN_ON_ONCE(1);
- return -EINVAL;
- }
-
- if (extend_size)
- memset(reg, extend_val, extend_size);
- memcpy(reg, &val, size);
- return insn.length;
}
static bool handle_in(struct pt_regs *regs, int size, int port)
--
2.46.0
From: "Alexey Gladkov (Intel)" <legion@kernel.org>
Instructions from kernel space are considered trusted. If the MMIO
instruction is from userspace it must be checked.
For userspace instructions, it is need to check that the INSN has not
changed at the time of #VE and before the execution of the instruction.
Once the userspace instruction parsed is enforced that the address
points to mapped memory of current process and that address does not
point to private memory.
After parsing the userspace instruction, it is necessary to ensure that:
1. the operation direction (read/write) corresponds to #VE info;
2. the address still points to mapped memory of current process;
3. the address does not point to private memory.
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Alexey Gladkov (Intel) <legion@kernel.org>
---
arch/x86/coco/tdx/tdx.c | 131 ++++++++++++++++++++++++++++++++++++----
1 file changed, 118 insertions(+), 13 deletions(-)
diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
index 008840ac1191..30651a5af180 100644
--- a/arch/x86/coco/tdx/tdx.c
+++ b/arch/x86/coco/tdx/tdx.c
@@ -8,6 +8,7 @@
#include <linux/export.h>
#include <linux/io.h>
#include <linux/kexec.h>
+#include <linux/mm.h>
#include <asm/coco.h>
#include <asm/tdx.h>
#include <asm/vmx.h>
@@ -406,6 +407,87 @@ static bool mmio_write(int size, unsigned long addr, unsigned long val)
EPT_WRITE, addr, val);
}
+static inline bool is_private_gpa(u64 gpa)
+{
+ return gpa == cc_mkenc(gpa);
+}
+
+static int get_phys_addr(unsigned long addr, phys_addr_t *phys_addr, bool *writable)
+{
+ unsigned int level;
+ pgd_t *pgdp;
+ pte_t *ptep;
+
+ /*
+ * Address validation only makes sense for a user process. The lock must
+ * be obtained before validation can begin.
+ */
+ mmap_assert_locked(current->mm);
+
+ pgdp = pgd_offset(current->mm, addr);
+
+ if (!pgd_none(*pgdp)) {
+ ptep = lookup_address_in_pgd(pgdp, addr, &level);
+ if (ptep) {
+ unsigned long offset;
+
+ if (!pte_decrypted(*ptep))
+ return -EFAULT;
+
+ offset = addr & ~page_level_mask(level);
+ *phys_addr = PFN_PHYS(pte_pfn(*ptep));
+ *phys_addr |= offset;
+
+ *writable = pte_write(*ptep);
+
+ return 0;
+ }
+ }
+
+ return -EFAULT;
+}
+
+static int valid_vaddr(struct ve_info *ve, enum insn_mmio_type mmio, int size,
+ unsigned long vaddr)
+{
+ phys_addr_t phys_addr;
+ bool writable = false;
+
+ /* It's not fatal. This can happen due to swap out or page migration. */
+ if (get_phys_addr(vaddr, &phys_addr, &writable) || (ve->gpa != cc_mkdec(phys_addr)))
+ return -EAGAIN;
+
+ /*
+ * Re-check whether #VE info matches the instruction that was decoded.
+ *
+ * The ve->gpa was valid at the time ve_info was received. But this code
+ * executed with interrupts enabled, allowing tlb shootdown and therefore
+ * munmap() to be executed in the parallel thread.
+ *
+ * By the time MMIO emulation is performed, ve->gpa may be already
+ * unmapped from the process, the device it belongs to removed from
+ * system and something else could be plugged in its place.
+ */
+ switch (mmio) {
+ case INSN_MMIO_WRITE:
+ case INSN_MMIO_WRITE_IMM:
+ if (!writable || !(ve->exit_qual & EPT_VIOLATION_ACC_WRITE))
+ return -EFAULT;
+ break;
+ case INSN_MMIO_READ:
+ case INSN_MMIO_READ_ZERO_EXTEND:
+ case INSN_MMIO_READ_SIGN_EXTEND:
+ if (!(ve->exit_qual & EPT_VIOLATION_ACC_READ))
+ return -EFAULT;
+ break;
+ default:
+ WARN_ONCE(1, "Unsupported mmio instruction: %d", mmio);
+ return -EINVAL;
+ }
+
+ return 0;
+}
+
static int handle_mmio_write(struct insn *insn, enum insn_mmio_type mmio, int size,
struct pt_regs *regs, struct ve_info *ve)
{
@@ -490,7 +572,7 @@ static int handle_mmio(struct pt_regs *regs, struct ve_info *ve)
enum insn_mmio_type mmio;
struct insn insn = {};
unsigned long vaddr;
- int size;
+ int size, ret;
/* Only in-kernel MMIO is supported */
if (WARN_ON_ONCE(user_mode(regs)))
@@ -511,6 +593,17 @@ static int handle_mmio(struct pt_regs *regs, struct ve_info *ve)
return -EINVAL;
}
+ vaddr = (unsigned long)insn_get_addr_ref(&insn, regs);
+
+ if (user_mode(regs)) {
+ if (mmap_read_lock_killable(current->mm))
+ return -EINTR;
+
+ ret = valid_vaddr(ve, mmio, size, vaddr);
+ if (ret)
+ goto unlock;
+ }
+
/*
* Reject EPT violation #VEs that split pages.
*
@@ -520,30 +613,39 @@ static int handle_mmio(struct pt_regs *regs, struct ve_info *ve)
*
* load_unaligned_zeropad() will recover using exception fixups.
*/
- vaddr = (unsigned long)insn_get_addr_ref(&insn, regs);
- if (vaddr / PAGE_SIZE != (vaddr + size - 1) / PAGE_SIZE)
- return -EFAULT;
+ if (vaddr / PAGE_SIZE != (vaddr + size - 1) / PAGE_SIZE) {
+ ret = -EFAULT;
+ goto unlock;
+ }
switch (mmio) {
case INSN_MMIO_WRITE:
case INSN_MMIO_WRITE_IMM:
case INSN_MMIO_MOVS:
- return handle_mmio_write(&insn, mmio, size, regs, ve);
+ ret = handle_mmio_write(&insn, mmio, size, regs, ve);
+ break;
case INSN_MMIO_READ:
case INSN_MMIO_READ_ZERO_EXTEND:
case INSN_MMIO_READ_SIGN_EXTEND:
- return handle_mmio_read(&insn, mmio, size, regs, ve);
+ ret = handle_mmio_read(&insn, mmio, size, regs, ve);
+ break;
case INSN_MMIO_DECODE_FAILED:
/*
* MMIO was accessed with an instruction that could not be
* decoded or handled properly. It was likely not using io.h
* helpers or accessed MMIO accidentally.
*/
- return -EINVAL;
+ ret = -EINVAL;
+ break;
default:
WARN_ONCE(1, "Unknown insn_decode_mmio() decode value?");
- return -EINVAL;
+ ret = -EINVAL;
}
+unlock:
+ if (user_mode(regs))
+ mmap_read_unlock(current->mm);
+
+ return ret;
}
static bool handle_in(struct pt_regs *regs, int size, int port)
@@ -687,11 +789,6 @@ static int virt_exception_user(struct pt_regs *regs, struct ve_info *ve)
}
}
-static inline bool is_private_gpa(u64 gpa)
-{
- return gpa == cc_mkenc(gpa);
-}
-
/*
* Handle the kernel #VE.
*
@@ -729,6 +826,14 @@ bool tdx_handle_virt_exception(struct pt_regs *regs, struct ve_info *ve)
insn_len = virt_exception_user(regs, ve);
else
insn_len = virt_exception_kernel(regs, ve);
+
+ /*
+ * A special case to return to userspace without increasing regs->ip
+ * to repeat the instruction once again.
+ */
+ if (insn_len == -EAGAIN)
+ return true;
+
if (insn_len < 0)
return false;
--
2.46.0
From: "Alexey Gladkov (Intel)" <legion@kernel.org>
The MMIO emulation is only allowed for kernel space code. It is carried
out through a special API, which uses only certain instructions.
This does not allow userspace to work with virtual devices.
Allow userspace to use the same instructions as kernel space to access
MMIO. Additional checks have been added previously.
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Alexey Gladkov (Intel) <legion@kernel.org>
---
arch/x86/coco/tdx/tdx.c | 45 +++++++++++++++++++++++++++++++----------
1 file changed, 34 insertions(+), 11 deletions(-)
diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
index 30651a5af180..dffc343e64d7 100644
--- a/arch/x86/coco/tdx/tdx.c
+++ b/arch/x86/coco/tdx/tdx.c
@@ -488,6 +488,32 @@ static int valid_vaddr(struct ve_info *ve, enum insn_mmio_type mmio, int size,
return 0;
}
+static int decode_insn_struct(struct insn *insn, struct pt_regs *regs)
+{
+ char buffer[MAX_INSN_SIZE];
+
+ if (user_mode(regs)) {
+ int nr_copied = insn_fetch_from_user(regs, buffer);
+
+ if (nr_copied <= 0)
+ return -EFAULT;
+
+ if (!insn_decode_from_regs(insn, regs, buffer, nr_copied))
+ return -EINVAL;
+ } else {
+ if (copy_from_kernel_nofault(buffer, (void *)regs->ip, MAX_INSN_SIZE))
+ return -EFAULT;
+
+ if (insn_decode(insn, buffer, MAX_INSN_SIZE, INSN_MODE_64))
+ return -EINVAL;
+ }
+
+ if (!insn->immediate.got)
+ return -EINVAL;
+
+ return 0;
+}
+
static int handle_mmio_write(struct insn *insn, enum insn_mmio_type mmio, int size,
struct pt_regs *regs, struct ve_info *ve)
{
@@ -568,27 +594,20 @@ static int handle_mmio_read(struct insn *insn, enum insn_mmio_type mmio, int siz
static int handle_mmio(struct pt_regs *regs, struct ve_info *ve)
{
- char buffer[MAX_INSN_SIZE];
enum insn_mmio_type mmio;
struct insn insn = {};
unsigned long vaddr;
int size, ret;
- /* Only in-kernel MMIO is supported */
- if (WARN_ON_ONCE(user_mode(regs)))
- return -EFAULT;
-
- if (copy_from_kernel_nofault(buffer, (void *)regs->ip, MAX_INSN_SIZE))
- return -EFAULT;
-
- if (insn_decode(&insn, buffer, MAX_INSN_SIZE, INSN_MODE_64))
- return -EINVAL;
+ ret = decode_insn_struct(&insn, regs);
+ if (ret)
+ return ret;
mmio = insn_decode_mmio(&insn, &size);
if (WARN_ON_ONCE(mmio == INSN_MMIO_DECODE_FAILED))
return -EINVAL;
- if (!fault_in_kernel_space(ve->gla)) {
+ if (!user_mode(regs) && !fault_in_kernel_space(ve->gla)) {
WARN_ONCE(1, "Access to userspace address is not supported");
return -EINVAL;
}
@@ -783,6 +802,10 @@ static int virt_exception_user(struct pt_regs *regs, struct ve_info *ve)
switch (ve->exit_reason) {
case EXIT_REASON_CPUID:
return handle_cpuid(regs, ve);
+ case EXIT_REASON_EPT_VIOLATION:
+ if (is_private_gpa(ve->gpa))
+ panic("Unexpected EPT-violation on private memory.");
+ return handle_mmio(regs, ve);
default:
pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
return -EIO;
--
2.46.0
From: "Alexey Gladkov (Intel)" <legion@kernel.org>
AMD code has helpers that are used to emulate MOVS instructions. To be
able to reuse this code in the MOVS implementation for intel, it is
necessary to move them to a common location.
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Alexey Gladkov (Intel) <legion@kernel.org>
---
arch/x86/coco/sev/core.c | 139 ++++++--------------------------------
arch/x86/include/asm/io.h | 3 +
arch/x86/lib/iomem.c | 115 +++++++++++++++++++++++++++++++
3 files changed, 140 insertions(+), 117 deletions(-)
diff --git a/arch/x86/coco/sev/core.c b/arch/x86/coco/sev/core.c
index 082d61d85dfc..07e9a6f15fba 100644
--- a/arch/x86/coco/sev/core.c
+++ b/arch/x86/coco/sev/core.c
@@ -369,72 +369,24 @@ static enum es_result vc_decode_insn(struct es_em_ctxt *ctxt)
static enum es_result vc_write_mem(struct es_em_ctxt *ctxt,
char *dst, char *buf, size_t size)
{
- unsigned long error_code = X86_PF_PROT | X86_PF_WRITE;
+ unsigned long error_code;
+ int ret;
/*
- * This function uses __put_user() independent of whether kernel or user
- * memory is accessed. This works fine because __put_user() does no
- * sanity checks of the pointer being accessed. All that it does is
- * to report when the access failed.
- *
- * Also, this function runs in atomic context, so __put_user() is not
- * allowed to sleep. The page-fault handler detects that it is running
- * in atomic context and will not try to take mmap_sem and handle the
- * fault, so additional pagefault_enable()/disable() calls are not
- * needed.
- *
- * The access can't be done via copy_to_user() here because
- * vc_write_mem() must not use string instructions to access unsafe
- * memory. The reason is that MOVS is emulated by the #VC handler by
- * splitting the move up into a read and a write and taking a nested #VC
- * exception on whatever of them is the MMIO access. Using string
- * instructions here would cause infinite nesting.
+ * This function runs in atomic context, so __put_iomem() is not allowed
+ * to sleep. The page-fault handler detects that it is running in atomic
+ * context and will not try to take mmap_lock and handle the fault, so
+ * additional pagefault_enable()/disable() calls are not needed.
*/
- switch (size) {
- case 1: {
- u8 d1;
- u8 __user *target = (u8 __user *)dst;
-
- memcpy(&d1, buf, 1);
- if (__put_user(d1, target))
- goto fault;
- break;
- }
- case 2: {
- u16 d2;
- u16 __user *target = (u16 __user *)dst;
-
- memcpy(&d2, buf, 2);
- if (__put_user(d2, target))
- goto fault;
- break;
- }
- case 4: {
- u32 d4;
- u32 __user *target = (u32 __user *)dst;
-
- memcpy(&d4, buf, 4);
- if (__put_user(d4, target))
- goto fault;
- break;
- }
- case 8: {
- u64 d8;
- u64 __user *target = (u64 __user *)dst;
+ ret = __put_iomem(dst, buf, size);
+ if (!ret)
+ return ES_OK;
- memcpy(&d8, buf, 8);
- if (__put_user(d8, target))
- goto fault;
- break;
- }
- default:
- WARN_ONCE(1, "%s: Invalid size: %zu\n", __func__, size);
+ if (ret == -EIO)
return ES_UNSUPPORTED;
- }
- return ES_OK;
+ error_code = X86_PF_PROT | X86_PF_WRITE;
-fault:
if (user_mode(ctxt->regs))
error_code |= X86_PF_USER;
@@ -448,71 +400,24 @@ static enum es_result vc_write_mem(struct es_em_ctxt *ctxt,
static enum es_result vc_read_mem(struct es_em_ctxt *ctxt,
char *src, char *buf, size_t size)
{
- unsigned long error_code = X86_PF_PROT;
+ unsigned long error_code;
+ int ret;
/*
- * This function uses __get_user() independent of whether kernel or user
- * memory is accessed. This works fine because __get_user() does no
- * sanity checks of the pointer being accessed. All that it does is
- * to report when the access failed.
- *
- * Also, this function runs in atomic context, so __get_user() is not
- * allowed to sleep. The page-fault handler detects that it is running
- * in atomic context and will not try to take mmap_sem and handle the
- * fault, so additional pagefault_enable()/disable() calls are not
- * needed.
- *
- * The access can't be done via copy_from_user() here because
- * vc_read_mem() must not use string instructions to access unsafe
- * memory. The reason is that MOVS is emulated by the #VC handler by
- * splitting the move up into a read and a write and taking a nested #VC
- * exception on whatever of them is the MMIO access. Using string
- * instructions here would cause infinite nesting.
+ * This function runs in atomic context, so __get_iomem() is not allowed
+ * to sleep. The page-fault handler detects that it is running in atomic
+ * context and will not try to take mmap_lock and handle the fault, so
+ * additional pagefault_enable()/disable() calls are not needed.
*/
- switch (size) {
- case 1: {
- u8 d1;
- u8 __user *s = (u8 __user *)src;
-
- if (__get_user(d1, s))
- goto fault;
- memcpy(buf, &d1, 1);
- break;
- }
- case 2: {
- u16 d2;
- u16 __user *s = (u16 __user *)src;
-
- if (__get_user(d2, s))
- goto fault;
- memcpy(buf, &d2, 2);
- break;
- }
- case 4: {
- u32 d4;
- u32 __user *s = (u32 __user *)src;
+ ret = __get_iomem(src, buf, size);
+ if (!ret)
+ return ES_OK;
- if (__get_user(d4, s))
- goto fault;
- memcpy(buf, &d4, 4);
- break;
- }
- case 8: {
- u64 d8;
- u64 __user *s = (u64 __user *)src;
- if (__get_user(d8, s))
- goto fault;
- memcpy(buf, &d8, 8);
- break;
- }
- default:
- WARN_ONCE(1, "%s: Invalid size: %zu\n", __func__, size);
+ if (ret == -EIO)
return ES_UNSUPPORTED;
- }
- return ES_OK;
+ error_code = X86_PF_PROT;
-fault:
if (user_mode(ctxt->regs))
error_code |= X86_PF_USER;
diff --git a/arch/x86/include/asm/io.h b/arch/x86/include/asm/io.h
index 1d60427379c9..ac01d53466cb 100644
--- a/arch/x86/include/asm/io.h
+++ b/arch/x86/include/asm/io.h
@@ -402,4 +402,7 @@ static inline void iosubmit_cmds512(void __iomem *dst, const void *src,
}
}
+int __get_iomem(char *src, char *buf, size_t size);
+int __put_iomem(char *src, char *buf, size_t size);
+
#endif /* _ASM_X86_IO_H */
diff --git a/arch/x86/lib/iomem.c b/arch/x86/lib/iomem.c
index 5eecb45d05d5..3ab146edddea 100644
--- a/arch/x86/lib/iomem.c
+++ b/arch/x86/lib/iomem.c
@@ -2,6 +2,7 @@
#include <linux/module.h>
#include <linux/io.h>
#include <linux/kmsan-checks.h>
+#include <asm/uaccess.h>
#define movs(type,to,from) \
asm volatile("movs" type:"=&D" (to), "=&S" (from):"0" (to), "1" (from):"memory")
@@ -124,3 +125,117 @@ void memset_io(volatile void __iomem *a, int b, size_t c)
}
}
EXPORT_SYMBOL(memset_io);
+
+int __get_iomem(char *src, char *buf, size_t size)
+{
+ /*
+ * This function uses __get_user() independent of whether kernel or user
+ * memory is accessed. This works fine because __get_user() does no
+ * sanity checks of the pointer being accessed. All that it does is
+ * to report when the access failed.
+ *
+ * The access can't be done via copy_from_user() here because
+ * __get_iomem() must not use string instructions to access unsafe
+ * memory. The reason is that MOVS is emulated by the exception handler
+ * for SEV and TDX by splitting the move up into a read and a write
+ * opetations and taking a nested exception on whatever of them is the
+ * MMIO access. Using string instructions here would cause infinite
+ * nesting.
+ */
+ switch (size) {
+ case 1: {
+ u8 d1, __user *s = (u8 __user *)src;
+
+ if (__get_user(d1, s))
+ return -EFAULT;
+ memcpy(buf, &d1, 1);
+ break;
+ }
+ case 2: {
+ u16 d2, __user *s = (u16 __user *)src;
+
+ if (__get_user(d2, s))
+ return -EFAULT;
+ memcpy(buf, &d2, 2);
+ break;
+ }
+ case 4: {
+ u32 d4, __user *s = (u32 __user *)src;
+
+ if (__get_user(d4, s))
+ return -EFAULT;
+ memcpy(buf, &d4, 4);
+ break;
+ }
+ case 8: {
+ u64 d8, __user *s = (u64 __user *)src;
+
+ if (__get_user(d8, s))
+ return -EFAULT;
+ memcpy(buf, &d8, 8);
+ break;
+ }
+ default:
+ WARN_ONCE(1, "%s: Invalid size: %zu\n", __func__, size);
+ return -EIO;
+ }
+
+ return 0;
+}
+
+int __put_iomem(char *dst, char *buf, size_t size)
+{
+ /*
+ * This function uses __put_user() independent of whether kernel or user
+ * memory is accessed. This works fine because __put_user() does no
+ * sanity checks of the pointer being accessed. All that it does is
+ * to report when the access failed.
+ *
+ * The access can't be done via copy_to_user() here because
+ * __put_iomem() must not use string instructions to access unsafe
+ * memory. The reason is that MOVS is emulated by the exception handler
+ * for SEV and TDX by splitting the move up into a read and a write
+ * opetations and taking a nested exception on whatever of them is the
+ * MMIO access. Using string instructions here would cause infinite
+ * nesting.
+ */
+ switch (size) {
+ case 1: {
+ u8 d1, __user *target = (u8 __user *)dst;
+
+ memcpy(&d1, buf, 1);
+ if (__put_user(d1, target))
+ return -EFAULT;
+ break;
+ }
+ case 2: {
+ u16 d2, __user *target = (u16 __user *)dst;
+
+ memcpy(&d2, buf, 2);
+ if (__put_user(d2, target))
+ return -EFAULT;
+ break;
+ }
+ case 4: {
+ u32 d4, __user *target = (u32 __user *)dst;
+
+ memcpy(&d4, buf, 4);
+ if (__put_user(d4, target))
+ return -EFAULT;
+ break;
+ }
+ case 8: {
+ u64 d8, __user *target = (u64 __user *)dst;
+
+ memcpy(&d8, buf, 8);
+ if (__put_user(d8, target))
+ return -EFAULT;
+ break;
+ }
+ default:
+ WARN_ONCE(1, "%s: Invalid size: %zu\n", __func__, size);
+ return -EIO;
+ }
+
+ return 0;
+}
--
2.46.0
From: "Alexey Gladkov (Intel)" <legion@kernel.org>
Add emulation of the MOVS instruction on MMIO regions. MOVS emulation
consists of dividing it into a series of read and write operations,
which in turn will be validated separately.
This implementation is based on the same principle as in SEV. It splits
MOVS into separate read and write operations, which in turn can cause
nested #VEs depending on which of the arguments caused the first #VE.
The difference with the SEV implementation is the execution context. SEV
code is executed in atomic context. Exception handler in TDX is executed
with interrupts enabled. That's why the approach to locking is
different. In TDX, mmap_lock is taken to verify and emulate the
instruction.
Another difference is how the read and write instructions are executed
for MOVS emulation. While in SEV each read/write operation returns to
user space, in TDX these operations are performed from the kernel
context.
It may be possible to achieve more code reuse at this point,
but it would require confirmation from SEV that such a thing wouldn't
break anything.
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Alexey Gladkov (Intel) <legion@kernel.org>
---
arch/x86/coco/tdx/tdx.c | 82 ++++++++++++++++++++++++++++----
arch/x86/include/asm/processor.h | 1 +
2 files changed, 75 insertions(+), 8 deletions(-)
diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
index dffc343e64d7..151e63083a13 100644
--- a/arch/x86/coco/tdx/tdx.c
+++ b/arch/x86/coco/tdx/tdx.c
@@ -514,6 +514,60 @@ static int decode_insn_struct(struct insn *insn, struct pt_regs *regs)
return 0;
}
+static int handle_mmio_movs(struct insn *insn, struct pt_regs *regs, int size, struct ve_info *ve)
+{
+ unsigned long ds_base, es_base;
+ unsigned char *src, *dst;
+ unsigned char buffer[8];
+ int off, ret;
+ bool rep;
+
+ /*
+ * The in-kernel code must use a special API that does not use MOVS.
+ * If the MOVS instruction is received from in-kernel, then something
+ * is broken.
+ */
+ if (WARN_ON_ONCE(!user_mode(regs)))
+ return -EFAULT;
+
+ ds_base = insn_get_seg_base(regs, INAT_SEG_REG_DS);
+ es_base = insn_get_seg_base(regs, INAT_SEG_REG_ES);
+
+ if (ds_base == -1L || es_base == -1L)
+ return -EINVAL;
+
+ current->thread.in_mmio_emul = 1;
+
+ rep = insn_has_rep_prefix(insn);
+
+ do {
+ src = ds_base + (unsigned char *) regs->si;
+ dst = es_base + (unsigned char *) regs->di;
+
+ ret = __get_iomem(src, buffer, size);
+ if (ret)
+ goto out;
+
+ ret = __put_iomem(dst, buffer, size);
+ if (ret)
+ goto out;
+
+ off = (regs->flags & X86_EFLAGS_DF) ? -size : size;
+
+ regs->si += off;
+ regs->di += off;
+
+ if (rep)
+ regs->cx -= 1;
+ } while (rep || regs->cx > 0);
+
+ ret = insn->length;
+out:
+ current->thread.in_mmio_emul = 0;
+
+ return ret;
+}
+
static int handle_mmio_write(struct insn *insn, enum insn_mmio_type mmio, int size,
struct pt_regs *regs, struct ve_info *ve)
{
@@ -535,9 +589,8 @@ static int handle_mmio_write(struct insn *insn, enum insn_mmio_type mmio, int si
return insn->length;
case INSN_MMIO_MOVS:
/*
- * MMIO was accessed with an instruction that could not be
- * decoded or handled properly. It was likely not using io.h
- * helpers or accessed MMIO accidentally.
+ * MOVS is processed through higher level emulation which breaks
+ * this instruction into a sequence of reads and writes.
*/
return -EINVAL;
default:
@@ -596,6 +649,7 @@ static int handle_mmio(struct pt_regs *regs, struct ve_info *ve)
{
enum insn_mmio_type mmio;
struct insn insn = {};
+ int need_validation;
unsigned long vaddr;
int size, ret;
@@ -607,14 +661,27 @@ static int handle_mmio(struct pt_regs *regs, struct ve_info *ve)
if (WARN_ON_ONCE(mmio == INSN_MMIO_DECODE_FAILED))
return -EINVAL;
+ if (mmio == INSN_MMIO_MOVS)
+ return handle_mmio_movs(&insn, regs, size, ve);
+
+ need_validation = user_mode(regs);
+
if (!user_mode(regs) && !fault_in_kernel_space(ve->gla)) {
- WARN_ONCE(1, "Access to userspace address is not supported");
- return -EINVAL;
+ /*
+ * Access from kernel to userspace addresses is not allowed
+ * unless it is a nested exception during MOVS emulation.
+ */
+ if (!current->thread.in_mmio_emul || !current->mm) {
+ WARN_ONCE(1, "Access to userspace address is not supported");
+ return -EINVAL;
+ }
+
+ need_validation = 1;
}
vaddr = (unsigned long)insn_get_addr_ref(&insn, regs);
- if (user_mode(regs)) {
+ if (need_validation) {
if (mmap_read_lock_killable(current->mm))
return -EINTR;
@@ -640,7 +707,6 @@ static int handle_mmio(struct pt_regs *regs, struct ve_info *ve)
switch (mmio) {
case INSN_MMIO_WRITE:
case INSN_MMIO_WRITE_IMM:
- case INSN_MMIO_MOVS:
ret = handle_mmio_write(&insn, mmio, size, regs, ve);
break;
case INSN_MMIO_READ:
@@ -661,7 +727,7 @@ static int handle_mmio(struct pt_regs *regs, struct ve_info *ve)
ret = -EINVAL;
}
unlock:
- if (user_mode(regs))
+ if (need_validation)
mmap_read_unlock(current->mm);
return ret;
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index a75a07f4931f..57605b11b06c 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -486,6 +486,7 @@ struct thread_struct {
unsigned long iopl_emul;
unsigned int iopl_warn:1;
+ unsigned int in_mmio_emul:1;
/*
* Protection Keys Register for Userspace. Loaded immediately on
--
2.46.0
From: "Alexey Gladkov (Intel)" <legion@kernel.org>
TDX only supports kernel-initiated MMIO operations. The handle_mmio()
function checks if the #VE exception occurred in the kernel and rejects
the operation if it did not.
However, userspace can deceive the kernel into performing MMIO on its
behalf. For example, if userspace can point a syscall to an MMIO address,
syscall does get_user() or put_user() on it, triggering MMIO #VE. The
kernel will treat the #VE as in-kernel MMIO.
Ensure that the target MMIO address is within the kernel before decoding
instruction.
Cc: stable@vger.kernel.org
Signed-off-by: Alexey Gladkov (Intel) <legion@kernel.org>
---
arch/x86/coco/tdx/tdx.c | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
index 078e2bac2553..c90d2fdb5fc4 100644
--- a/arch/x86/coco/tdx/tdx.c
+++ b/arch/x86/coco/tdx/tdx.c
@@ -405,6 +405,11 @@ static bool mmio_write(int size, unsigned long addr, unsigned long val)
EPT_WRITE, addr, val);
}
+static inline bool is_kernel_addr(unsigned long addr)
+{
+ return (long)addr < 0;
+}
+
static int handle_mmio(struct pt_regs *regs, struct ve_info *ve)
{
unsigned long *reg, val, vaddr;
@@ -434,6 +439,11 @@ static int handle_mmio(struct pt_regs *regs, struct ve_info *ve)
return -EINVAL;
}
+ if (!user_mode(regs) && !is_kernel_addr(ve->gla)) {
+ WARN_ONCE(1, "Access to userspace address is not supported");
+ return -EINVAL;
+ }
+
/*
* Reject EPT violation #VEs that split pages.
*
--
2.46.0
On Fri, Sep 06, 2024 at 01:49:59PM +0200, Alexey Gladkov wrote: > From: "Alexey Gladkov (Intel)" <legion@kernel.org> > > TDX only supports kernel-initiated MMIO operations. The handle_mmio() > function checks if the #VE exception occurred in the kernel and rejects > the operation if it did not. > > However, userspace can deceive the kernel into performing MMIO on its > behalf. For example, if userspace can point a syscall to an MMIO address, > syscall does get_user() or put_user() on it, triggering MMIO #VE. The > kernel will treat the #VE as in-kernel MMIO. > > Ensure that the target MMIO address is within the kernel before decoding > instruction. > > Cc: stable@vger.kernel.org > Signed-off-by: Alexey Gladkov (Intel) <legion@kernel.org> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> -- Kiryl Shutsemau / Kirill A. Shutemov
On 9/6/24 04:49, Alexey Gladkov wrote:
> +static inline bool is_kernel_addr(unsigned long addr)
> +{
> + return (long)addr < 0;
> +}
> +
> static int handle_mmio(struct pt_regs *regs, struct ve_info *ve)
> {
> unsigned long *reg, val, vaddr;
> @@ -434,6 +439,11 @@ static int handle_mmio(struct pt_regs *regs, struct ve_info *ve)
> return -EINVAL;
> }
>
> + if (!user_mode(regs) && !is_kernel_addr(ve->gla)) {
> + WARN_ONCE(1, "Access to userspace address is not supported");
> + return -EINVAL;
> + }
Should we really be open-coding a "is_kernel_addr" check? I mean,
TASK_SIZE_MAX is there for a reason. While I doubt we'd ever change the
positive vs. negative address space convention on 64-bit, I don't see a
good reason to write a 64-bit x86-specific is_kernel_addr() when a more
generic, portable and conventional idiom would do.
So, please use either a:
addr < TASK_SIZE_MAX
check, or use fault_in_kernel_space() directly.
On Tue, Sep 10, 2024 at 12:54:19PM -0700, Dave Hansen wrote:
> On 9/6/24 04:49, Alexey Gladkov wrote:
> > +static inline bool is_kernel_addr(unsigned long addr)
> > +{
> > + return (long)addr < 0;
> > +}
> > +
> > static int handle_mmio(struct pt_regs *regs, struct ve_info *ve)
> > {
> > unsigned long *reg, val, vaddr;
> > @@ -434,6 +439,11 @@ static int handle_mmio(struct pt_regs *regs, struct ve_info *ve)
> > return -EINVAL;
> > }
> >
> > + if (!user_mode(regs) && !is_kernel_addr(ve->gla)) {
> > + WARN_ONCE(1, "Access to userspace address is not supported");
> > + return -EINVAL;
> > + }
>
> Should we really be open-coding a "is_kernel_addr" check? I mean,
> TASK_SIZE_MAX is there for a reason. While I doubt we'd ever change the
> positive vs. negative address space convention on 64-bit, I don't see a
> good reason to write a 64-bit x86-specific is_kernel_addr() when a more
> generic, portable and conventional idiom would do.
I took arch/x86/events/perf_event.h:1262 as an example. There is no
special reason in its own function.
> So, please use either a:
>
> addr < TASK_SIZE_MAX
>
> check, or use fault_in_kernel_space() directly.
I'll use fault_in_kernel_space() since SEV uses it. Thanks.
--
Rgrds, legion
On Wed, Sep 11, 2024 at 02:08:47PM +0200, Alexey Gladkov wrote:
> On Tue, Sep 10, 2024 at 12:54:19PM -0700, Dave Hansen wrote:
> > On 9/6/24 04:49, Alexey Gladkov wrote:
> > > +static inline bool is_kernel_addr(unsigned long addr)
> > > +{
> > > + return (long)addr < 0;
> > > +}
> > > +
> > > static int handle_mmio(struct pt_regs *regs, struct ve_info *ve)
> > > {
> > > unsigned long *reg, val, vaddr;
> > > @@ -434,6 +439,11 @@ static int handle_mmio(struct pt_regs *regs, struct ve_info *ve)
> > > return -EINVAL;
> > > }
> > >
> > > + if (!user_mode(regs) && !is_kernel_addr(ve->gla)) {
> > > + WARN_ONCE(1, "Access to userspace address is not supported");
> > > + return -EINVAL;
> > > + }
> >
> > Should we really be open-coding a "is_kernel_addr" check? I mean,
> > TASK_SIZE_MAX is there for a reason. While I doubt we'd ever change the
> > positive vs. negative address space convention on 64-bit, I don't see a
> > good reason to write a 64-bit x86-specific is_kernel_addr() when a more
> > generic, portable and conventional idiom would do.
>
> I took arch/x86/events/perf_event.h:1262 as an example. There is no
> special reason in its own function.
>
> > So, please use either a:
> >
> > addr < TASK_SIZE_MAX
> >
> > check, or use fault_in_kernel_space() directly.
>
> I'll use fault_in_kernel_space() since SEV uses it. Thanks.
Also user_mode() check is redundant until later in the patchset. Move it
to the patch that allows userspace MMIO.
--
Kiryl Shutsemau / Kirill A. Shutemov
From: "Alexey Gladkov (Intel)" <legion@kernel.org>
To implement MMIO in userspace, additional memory checks need to be
implemented. To avoid overly complicating the handle_mmio() function
and to separate checks from actions, it would be better to split this
function into two separate functions to handle read and write
operations.
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Alexey Gladkov (Intel) <legion@kernel.org>
---
arch/x86/coco/tdx/tdx.c | 136 ++++++++++++++++++++++++----------------
1 file changed, 83 insertions(+), 53 deletions(-)
diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
index c90d2fdb5fc4..eee97dff1eca 100644
--- a/arch/x86/coco/tdx/tdx.c
+++ b/arch/x86/coco/tdx/tdx.c
@@ -410,14 +410,91 @@ static inline bool is_kernel_addr(unsigned long addr)
return (long)addr < 0;
}
+static int handle_mmio_write(struct insn *insn, enum insn_mmio_type mmio, int size,
+ struct pt_regs *regs, struct ve_info *ve)
+{
+ unsigned long *reg, val;
+
+ switch (mmio) {
+ case INSN_MMIO_WRITE:
+ reg = insn_get_modrm_reg_ptr(insn, regs);
+ if (!reg)
+ return -EINVAL;
+ memcpy(&val, reg, size);
+ if (!mmio_write(size, ve->gpa, val))
+ return -EIO;
+ return insn->length;
+ case INSN_MMIO_WRITE_IMM:
+ val = insn->immediate.value;
+ if (!mmio_write(size, ve->gpa, val))
+ return -EIO;
+ return insn->length;
+ case INSN_MMIO_MOVS:
+ /*
+ * MMIO was accessed with an instruction that could not be
+ * decoded or handled properly. It was likely not using io.h
+ * helpers or accessed MMIO accidentally.
+ */
+ return -EINVAL;
+ default:
+ WARN_ON_ONCE(1);
+ return -EINVAL;
+ }
+
+ return insn->length;
+}
+
+static int handle_mmio_read(struct insn *insn, enum insn_mmio_type mmio, int size,
+ struct pt_regs *regs, struct ve_info *ve)
+{
+ unsigned long *reg, val;
+ int extend_size;
+ u8 extend_val;
+
+ reg = insn_get_modrm_reg_ptr(insn, regs);
+ if (!reg)
+ return -EINVAL;
+
+ if (!mmio_read(size, ve->gpa, &val))
+ return -EIO;
+
+ extend_val = 0;
+
+ switch (mmio) {
+ case INSN_MMIO_READ:
+ /* Zero-extend for 32-bit operation */
+ extend_size = size == 4 ? sizeof(*reg) : 0;
+ break;
+ case INSN_MMIO_READ_ZERO_EXTEND:
+ /* Zero extend based on operand size */
+ extend_size = insn->opnd_bytes;
+ break;
+ case INSN_MMIO_READ_SIGN_EXTEND:
+ /* Sign extend based on operand size */
+ extend_size = insn->opnd_bytes;
+ if (size == 1 && val & BIT(7))
+ extend_val = 0xFF;
+ else if (size > 1 && val & BIT(15))
+ extend_val = 0xFF;
+ break;
+ default:
+ WARN_ON_ONCE(1);
+ return -EINVAL;
+ }
+
+ if (extend_size)
+ memset(reg, extend_val, extend_size);
+ memcpy(reg, &val, size);
+ return insn->length;
+}
+
static int handle_mmio(struct pt_regs *regs, struct ve_info *ve)
{
- unsigned long *reg, val, vaddr;
char buffer[MAX_INSN_SIZE];
enum insn_mmio_type mmio;
struct insn insn = {};
- int size, extend_size;
- u8 extend_val = 0;
+ unsigned long vaddr;
+ int size;
/* Only in-kernel MMIO is supported */
if (WARN_ON_ONCE(user_mode(regs)))
@@ -433,12 +510,6 @@ static int handle_mmio(struct pt_regs *regs, struct ve_info *ve)
if (WARN_ON_ONCE(mmio == INSN_MMIO_DECODE_FAILED))
return -EINVAL;
- if (mmio != INSN_MMIO_WRITE_IMM && mmio != INSN_MMIO_MOVS) {
- reg = insn_get_modrm_reg_ptr(&insn, regs);
- if (!reg)
- return -EINVAL;
- }
-
if (!user_mode(regs) && !is_kernel_addr(ve->gla)) {
WARN_ONCE(1, "Access to userspace address is not supported");
return -EINVAL;
@@ -457,24 +528,15 @@ static int handle_mmio(struct pt_regs *regs, struct ve_info *ve)
if (vaddr / PAGE_SIZE != (vaddr + size - 1) / PAGE_SIZE)
return -EFAULT;
- /* Handle writes first */
switch (mmio) {
case INSN_MMIO_WRITE:
- memcpy(&val, reg, size);
- if (!mmio_write(size, ve->gpa, val))
- return -EIO;
- return insn.length;
case INSN_MMIO_WRITE_IMM:
- val = insn.immediate.value;
- if (!mmio_write(size, ve->gpa, val))
- return -EIO;
- return insn.length;
+ case INSN_MMIO_MOVS:
+ return handle_mmio_write(&insn, mmio, size, regs, ve);
case INSN_MMIO_READ:
case INSN_MMIO_READ_ZERO_EXTEND:
case INSN_MMIO_READ_SIGN_EXTEND:
- /* Reads are handled below */
- break;
- case INSN_MMIO_MOVS:
+ return handle_mmio_read(&insn, mmio, size, regs, ve);
case INSN_MMIO_DECODE_FAILED:
/*
* MMIO was accessed with an instruction that could not be
@@ -486,38 +548,6 @@ static int handle_mmio(struct pt_regs *regs, struct ve_info *ve)
WARN_ONCE(1, "Unknown insn_decode_mmio() decode value?");
return -EINVAL;
}
-
- /* Handle reads */
- if (!mmio_read(size, ve->gpa, &val))
- return -EIO;
-
- switch (mmio) {
- case INSN_MMIO_READ:
- /* Zero-extend for 32-bit operation */
- extend_size = size == 4 ? sizeof(*reg) : 0;
- break;
- case INSN_MMIO_READ_ZERO_EXTEND:
- /* Zero extend based on operand size */
- extend_size = insn.opnd_bytes;
- break;
- case INSN_MMIO_READ_SIGN_EXTEND:
- /* Sign extend based on operand size */
- extend_size = insn.opnd_bytes;
- if (size == 1 && val & BIT(7))
- extend_val = 0xFF;
- else if (size > 1 && val & BIT(15))
- extend_val = 0xFF;
- break;
- default:
- /* All other cases has to be covered with the first switch() */
- WARN_ON_ONCE(1);
- return -EINVAL;
- }
-
- if (extend_size)
- memset(reg, extend_val, extend_size);
- memcpy(reg, &val, size);
- return insn.length;
}
static bool handle_in(struct pt_regs *regs, int size, int port)
--
2.46.0
From: "Alexey Gladkov (Intel)" <legion@kernel.org>
Instructions from kernel space are considered trusted. If the MMIO
instruction is from userspace it must be checked.
For userspace instructions, it is need to check that the INSN has not
changed at the time of #VE and before the execution of the instruction.
Once the userspace instruction parsed is enforced that the address
points to mapped memory of current process and that address does not
point to private memory.
After parsing the userspace instruction, it is necessary to ensure that:
1. the operation direction (read/write) corresponds to #VE info;
2. the address still points to mapped memory of current process;
3. the address does not point to private memory.
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Alexey Gladkov (Intel) <legion@kernel.org>
---
arch/x86/coco/tdx/tdx.c | 130 ++++++++++++++++++++++++++++++++++++----
1 file changed, 117 insertions(+), 13 deletions(-)
diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
index eee97dff1eca..636bf4013ef2 100644
--- a/arch/x86/coco/tdx/tdx.c
+++ b/arch/x86/coco/tdx/tdx.c
@@ -8,6 +8,7 @@
#include <linux/export.h>
#include <linux/io.h>
#include <linux/kexec.h>
+#include <linux/mm.h>
#include <asm/coco.h>
#include <asm/tdx.h>
#include <asm/vmx.h>
@@ -410,6 +411,87 @@ static inline bool is_kernel_addr(unsigned long addr)
return (long)addr < 0;
}
+static inline bool is_private_gpa(u64 gpa)
+{
+ return gpa == cc_mkenc(gpa);
+}
+
+static int get_phys_addr(unsigned long addr, phys_addr_t *phys_addr, bool *writable)
+{
+ unsigned int level;
+ pgd_t *pgdp;
+ pte_t *ptep;
+
+ /*
+ * Address validation only makes sense for a user process. The lock must
+ * be obtained before validation can begin.
+ */
+ mmap_assert_locked(current->mm);
+
+ pgdp = pgd_offset(current->mm, addr);
+
+ if (!pgd_none(*pgdp)) {
+ ptep = lookup_address_in_pgd(pgdp, addr, &level);
+ if (ptep) {
+ unsigned long offset;
+
+ if (!pte_decrypted(*ptep))
+ return -EFAULT;
+
+ offset = addr & ~page_level_mask(level);
+ *phys_addr = PFN_PHYS(pte_pfn(*ptep));
+ *phys_addr |= offset;
+
+ *writable = pte_write(*ptep);
+
+ return 0;
+ }
+ }
+
+ return -EFAULT;
+}
+
+static int valid_vaddr(struct ve_info *ve, enum insn_mmio_type mmio, int size,
+ unsigned long vaddr)
+{
+ phys_addr_t phys_addr;
+ bool writable = false;
+
+ /* It's not fatal. This can happen due to swap out or page migration. */
+ if (get_phys_addr(vaddr, &phys_addr, &writable) || (ve->gpa != cc_mkdec(phys_addr)))
+ return -EAGAIN;
+
+ /*
+ * Re-check whether #VE info matches the instruction that was decoded.
+ *
+ * The ve->gpa was valid at the time ve_info was received. But this code
+ * executed with interrupts enabled, allowing tlb shootdown and therefore
+ * munmap() to be executed in the parallel thread.
+ *
+ * By the time MMIO emulation is performed, ve->gpa may be already
+ * unmapped from the process, the device it belongs to removed from
+ * system and something else could be plugged in its place.
+ */
+ switch (mmio) {
+ case INSN_MMIO_WRITE:
+ case INSN_MMIO_WRITE_IMM:
+ if (!writable || !(ve->exit_qual & EPT_VIOLATION_ACC_WRITE))
+ return -EFAULT;
+ break;
+ case INSN_MMIO_READ:
+ case INSN_MMIO_READ_ZERO_EXTEND:
+ case INSN_MMIO_READ_SIGN_EXTEND:
+ if (!(ve->exit_qual & EPT_VIOLATION_ACC_READ))
+ return -EFAULT;
+ break;
+ default:
+ WARN_ONCE(1, "Unsupported mmio instruction: %d", mmio);
+ return -EINVAL;
+ }
+
+ return 0;
+}
+
static int handle_mmio_write(struct insn *insn, enum insn_mmio_type mmio, int size,
struct pt_regs *regs, struct ve_info *ve)
{
@@ -494,7 +576,7 @@ static int handle_mmio(struct pt_regs *regs, struct ve_info *ve)
enum insn_mmio_type mmio;
struct insn insn = {};
unsigned long vaddr;
- int size;
+ int size, ret;
/* Only in-kernel MMIO is supported */
if (WARN_ON_ONCE(user_mode(regs)))
@@ -513,6 +595,16 @@ static int handle_mmio(struct pt_regs *regs, struct ve_info *ve)
if (!user_mode(regs) && !is_kernel_addr(ve->gla)) {
WARN_ONCE(1, "Access to userspace address is not supported");
return -EINVAL;
+
+ vaddr = (unsigned long)insn_get_addr_ref(&insn, regs);
+
+ if (user_mode(regs)) {
+ if (mmap_read_lock_killable(current->mm))
+ return -EINTR;
+
+ ret = valid_vaddr(ve, mmio, size, vaddr);
+ if (ret)
+ goto unlock;
}
/*
@@ -524,30 +616,39 @@ static int handle_mmio(struct pt_regs *regs, struct ve_info *ve)
*
* load_unaligned_zeropad() will recover using exception fixups.
*/
- vaddr = (unsigned long)insn_get_addr_ref(&insn, regs);
- if (vaddr / PAGE_SIZE != (vaddr + size - 1) / PAGE_SIZE)
- return -EFAULT;
+ if (vaddr / PAGE_SIZE != (vaddr + size - 1) / PAGE_SIZE) {
+ ret = -EFAULT;
+ goto unlock;
+ }
switch (mmio) {
case INSN_MMIO_WRITE:
case INSN_MMIO_WRITE_IMM:
case INSN_MMIO_MOVS:
- return handle_mmio_write(&insn, mmio, size, regs, ve);
+ ret = handle_mmio_write(&insn, mmio, size, regs, ve);
+ break;
case INSN_MMIO_READ:
case INSN_MMIO_READ_ZERO_EXTEND:
case INSN_MMIO_READ_SIGN_EXTEND:
- return handle_mmio_read(&insn, mmio, size, regs, ve);
+ ret = handle_mmio_read(&insn, mmio, size, regs, ve);
+ break;
case INSN_MMIO_DECODE_FAILED:
/*
* MMIO was accessed with an instruction that could not be
* decoded or handled properly. It was likely not using io.h
* helpers or accessed MMIO accidentally.
*/
- return -EINVAL;
+ ret = -EINVAL;
+ break;
default:
WARN_ONCE(1, "Unknown insn_decode_mmio() decode value?");
- return -EINVAL;
+ ret = -EINVAL;
}
+unlock:
+ if (user_mode(regs))
+ mmap_read_unlock(current->mm);
+
+ return ret;
}
static bool handle_in(struct pt_regs *regs, int size, int port)
@@ -691,11 +792,6 @@ static int virt_exception_user(struct pt_regs *regs, struct ve_info *ve)
}
}
-static inline bool is_private_gpa(u64 gpa)
-{
- return gpa == cc_mkenc(gpa);
-}
-
/*
* Handle the kernel #VE.
*
@@ -733,6 +829,14 @@ bool tdx_handle_virt_exception(struct pt_regs *regs, struct ve_info *ve)
insn_len = virt_exception_user(regs, ve);
else
insn_len = virt_exception_kernel(regs, ve);
+
+ /*
+ * A special case to return to userspace without increasing regs->ip
+ * to repeat the instruction once again.
+ */
+ if (insn_len == -EAGAIN)
+ return true;
+
if (insn_len < 0)
return false;
--
2.46.0
From: "Alexey Gladkov (Intel)" <legion@kernel.org>
The MMIO emulation is only allowed for kernel space code. It is carried
out through a special API, which uses only certain instructions.
This does not allow userspace to work with virtual devices.
Allow userspace to use the same instructions as kernel space to access
MMIO. Additional checks have been added previously.
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Alexey Gladkov (Intel) <legion@kernel.org>
---
arch/x86/coco/tdx/tdx.c | 43 +++++++++++++++++++++++++++++++----------
1 file changed, 33 insertions(+), 10 deletions(-)
diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
index 636bf4013ef2..1e391897e34f 100644
--- a/arch/x86/coco/tdx/tdx.c
+++ b/arch/x86/coco/tdx/tdx.c
@@ -492,6 +492,32 @@ static int valid_vaddr(struct ve_info *ve, enum insn_mmio_type mmio, int size,
return 0;
}
+static int decode_insn_struct(struct insn *insn, struct pt_regs *regs)
+{
+ char buffer[MAX_INSN_SIZE];
+
+ if (user_mode(regs)) {
+ int nr_copied = insn_fetch_from_user(regs, buffer);
+
+ if (nr_copied <= 0)
+ return -EFAULT;
+
+ if (!insn_decode_from_regs(insn, regs, buffer, nr_copied))
+ return -EINVAL;
+ } else {
+ if (copy_from_kernel_nofault(buffer, (void *)regs->ip, MAX_INSN_SIZE))
+ return -EFAULT;
+
+ if (insn_decode(insn, buffer, MAX_INSN_SIZE, INSN_MODE_64))
+ return -EINVAL;
+ }
+
+ if (!insn->immediate.got)
+ return -EINVAL;
+
+ return 0;
+}
+
static int handle_mmio_write(struct insn *insn, enum insn_mmio_type mmio, int size,
struct pt_regs *regs, struct ve_info *ve)
{
@@ -572,21 +598,14 @@ static int handle_mmio_read(struct insn *insn, enum insn_mmio_type mmio, int siz
static int handle_mmio(struct pt_regs *regs, struct ve_info *ve)
{
- char buffer[MAX_INSN_SIZE];
enum insn_mmio_type mmio;
struct insn insn = {};
unsigned long vaddr;
int size, ret;
- /* Only in-kernel MMIO is supported */
- if (WARN_ON_ONCE(user_mode(regs)))
- return -EFAULT;
-
- if (copy_from_kernel_nofault(buffer, (void *)regs->ip, MAX_INSN_SIZE))
- return -EFAULT;
-
- if (insn_decode(&insn, buffer, MAX_INSN_SIZE, INSN_MODE_64))
- return -EINVAL;
+ ret = decode_insn_struct(&insn, regs);
+ if (ret)
+ return ret;
mmio = insn_decode_mmio(&insn, &size);
if (WARN_ON_ONCE(mmio == INSN_MMIO_DECODE_FAILED))
@@ -786,6 +805,10 @@ static int virt_exception_user(struct pt_regs *regs, struct ve_info *ve)
switch (ve->exit_reason) {
case EXIT_REASON_CPUID:
return handle_cpuid(regs, ve);
+ case EXIT_REASON_EPT_VIOLATION:
+ if (is_private_gpa(ve->gpa))
+ panic("Unexpected EPT-violation on private memory.");
+ return handle_mmio(regs, ve);
default:
pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
return -EIO;
--
2.46.0
From: "Alexey Gladkov (Intel)" <legion@kernel.org>
AMD code has helpers that are used to emulate MOVS instructions. To be
able to reuse this code in the MOVS implementation for intel, it is
necessary to move them to a common location.
Signed-off-by: Alexey Gladkov (Intel) <legion@kernel.org>
---
arch/x86/coco/sev/core.c | 139 ++++++--------------------------------
arch/x86/include/asm/io.h | 3 +
arch/x86/lib/iomem.c | 115 +++++++++++++++++++++++++++++++
3 files changed, 140 insertions(+), 117 deletions(-)
diff --git a/arch/x86/coco/sev/core.c b/arch/x86/coco/sev/core.c
index 082d61d85dfc..07e9a6f15fba 100644
--- a/arch/x86/coco/sev/core.c
+++ b/arch/x86/coco/sev/core.c
@@ -369,72 +369,24 @@ static enum es_result vc_decode_insn(struct es_em_ctxt *ctxt)
static enum es_result vc_write_mem(struct es_em_ctxt *ctxt,
char *dst, char *buf, size_t size)
{
- unsigned long error_code = X86_PF_PROT | X86_PF_WRITE;
+ unsigned long error_code;
+ int ret;
/*
- * This function uses __put_user() independent of whether kernel or user
- * memory is accessed. This works fine because __put_user() does no
- * sanity checks of the pointer being accessed. All that it does is
- * to report when the access failed.
- *
- * Also, this function runs in atomic context, so __put_user() is not
- * allowed to sleep. The page-fault handler detects that it is running
- * in atomic context and will not try to take mmap_sem and handle the
- * fault, so additional pagefault_enable()/disable() calls are not
- * needed.
- *
- * The access can't be done via copy_to_user() here because
- * vc_write_mem() must not use string instructions to access unsafe
- * memory. The reason is that MOVS is emulated by the #VC handler by
- * splitting the move up into a read and a write and taking a nested #VC
- * exception on whatever of them is the MMIO access. Using string
- * instructions here would cause infinite nesting.
+ * This function runs in atomic context, so __put_iomem() is not allowed
+ * to sleep. The page-fault handler detects that it is running in atomic
+ * context and will not try to take mmap_lock and handle the fault, so
+ * additional pagefault_enable()/disable() calls are not needed.
*/
- switch (size) {
- case 1: {
- u8 d1;
- u8 __user *target = (u8 __user *)dst;
-
- memcpy(&d1, buf, 1);
- if (__put_user(d1, target))
- goto fault;
- break;
- }
- case 2: {
- u16 d2;
- u16 __user *target = (u16 __user *)dst;
-
- memcpy(&d2, buf, 2);
- if (__put_user(d2, target))
- goto fault;
- break;
- }
- case 4: {
- u32 d4;
- u32 __user *target = (u32 __user *)dst;
-
- memcpy(&d4, buf, 4);
- if (__put_user(d4, target))
- goto fault;
- break;
- }
- case 8: {
- u64 d8;
- u64 __user *target = (u64 __user *)dst;
+ ret = __put_iomem(dst, buf, size);
+ if (!ret)
+ return ES_OK;
- memcpy(&d8, buf, 8);
- if (__put_user(d8, target))
- goto fault;
- break;
- }
- default:
- WARN_ONCE(1, "%s: Invalid size: %zu\n", __func__, size);
+ if (ret == -EIO)
return ES_UNSUPPORTED;
- }
- return ES_OK;
+ error_code = X86_PF_PROT | X86_PF_WRITE;
-fault:
if (user_mode(ctxt->regs))
error_code |= X86_PF_USER;
@@ -448,71 +400,24 @@ static enum es_result vc_write_mem(struct es_em_ctxt *ctxt,
static enum es_result vc_read_mem(struct es_em_ctxt *ctxt,
char *src, char *buf, size_t size)
{
- unsigned long error_code = X86_PF_PROT;
+ unsigned long error_code;
+ int ret;
/*
- * This function uses __get_user() independent of whether kernel or user
- * memory is accessed. This works fine because __get_user() does no
- * sanity checks of the pointer being accessed. All that it does is
- * to report when the access failed.
- *
- * Also, this function runs in atomic context, so __get_user() is not
- * allowed to sleep. The page-fault handler detects that it is running
- * in atomic context and will not try to take mmap_sem and handle the
- * fault, so additional pagefault_enable()/disable() calls are not
- * needed.
- *
- * The access can't be done via copy_from_user() here because
- * vc_read_mem() must not use string instructions to access unsafe
- * memory. The reason is that MOVS is emulated by the #VC handler by
- * splitting the move up into a read and a write and taking a nested #VC
- * exception on whatever of them is the MMIO access. Using string
- * instructions here would cause infinite nesting.
+ * This function runs in atomic context, so __get_iomem() is not allowed
+ * to sleep. The page-fault handler detects that it is running in atomic
+ * context and will not try to take mmap_lock and handle the fault, so
+ * additional pagefault_enable()/disable() calls are not needed.
*/
- switch (size) {
- case 1: {
- u8 d1;
- u8 __user *s = (u8 __user *)src;
-
- if (__get_user(d1, s))
- goto fault;
- memcpy(buf, &d1, 1);
- break;
- }
- case 2: {
- u16 d2;
- u16 __user *s = (u16 __user *)src;
-
- if (__get_user(d2, s))
- goto fault;
- memcpy(buf, &d2, 2);
- break;
- }
- case 4: {
- u32 d4;
- u32 __user *s = (u32 __user *)src;
+ ret = __get_iomem(src, buf, size);
+ if (!ret)
+ return ES_OK;
- if (__get_user(d4, s))
- goto fault;
- memcpy(buf, &d4, 4);
- break;
- }
- case 8: {
- u64 d8;
- u64 __user *s = (u64 __user *)src;
- if (__get_user(d8, s))
- goto fault;
- memcpy(buf, &d8, 8);
- break;
- }
- default:
- WARN_ONCE(1, "%s: Invalid size: %zu\n", __func__, size);
+ if (ret == -EIO)
return ES_UNSUPPORTED;
- }
- return ES_OK;
+ error_code = X86_PF_PROT;
-fault:
if (user_mode(ctxt->regs))
error_code |= X86_PF_USER;
diff --git a/arch/x86/include/asm/io.h b/arch/x86/include/asm/io.h
index 1d60427379c9..ac01d53466cb 100644
--- a/arch/x86/include/asm/io.h
+++ b/arch/x86/include/asm/io.h
@@ -402,4 +402,7 @@ static inline void iosubmit_cmds512(void __iomem *dst, const void *src,
}
}
+int __get_iomem(char *src, char *buf, size_t size);
+int __put_iomem(char *src, char *buf, size_t size);
+
#endif /* _ASM_X86_IO_H */
diff --git a/arch/x86/lib/iomem.c b/arch/x86/lib/iomem.c
index 5eecb45d05d5..3ab146edddea 100644
--- a/arch/x86/lib/iomem.c
+++ b/arch/x86/lib/iomem.c
@@ -2,6 +2,7 @@
#include <linux/module.h>
#include <linux/io.h>
#include <linux/kmsan-checks.h>
+#include <asm/uaccess.h>
#define movs(type,to,from) \
asm volatile("movs" type:"=&D" (to), "=&S" (from):"0" (to), "1" (from):"memory")
@@ -124,3 +125,117 @@ void memset_io(volatile void __iomem *a, int b, size_t c)
}
}
EXPORT_SYMBOL(memset_io);
+
+int __get_iomem(char *src, char *buf, size_t size)
+{
+ /*
+ * This function uses __get_user() independent of whether kernel or user
+ * memory is accessed. This works fine because __get_user() does no
+ * sanity checks of the pointer being accessed. All that it does is
+ * to report when the access failed.
+ *
+ * The access can't be done via copy_from_user() here because
+ * __get_iomem() must not use string instructions to access unsafe
+ * memory. The reason is that MOVS is emulated by the exception handler
+ * for SEV and TDX by splitting the move up into a read and a write
+ * opetations and taking a nested exception on whatever of them is the
+ * MMIO access. Using string instructions here would cause infinite
+ * nesting.
+ */
+ switch (size) {
+ case 1: {
+ u8 d1, __user *s = (u8 __user *)src;
+
+ if (__get_user(d1, s))
+ return -EFAULT;
+ memcpy(buf, &d1, 1);
+ break;
+ }
+ case 2: {
+ u16 d2, __user *s = (u16 __user *)src;
+
+ if (__get_user(d2, s))
+ return -EFAULT;
+ memcpy(buf, &d2, 2);
+ break;
+ }
+ case 4: {
+ u32 d4, __user *s = (u32 __user *)src;
+
+ if (__get_user(d4, s))
+ return -EFAULT;
+ memcpy(buf, &d4, 4);
+ break;
+ }
+ case 8: {
+ u64 d8, __user *s = (u64 __user *)src;
+
+ if (__get_user(d8, s))
+ return -EFAULT;
+ memcpy(buf, &d8, 8);
+ break;
+ }
+ default:
+ WARN_ONCE(1, "%s: Invalid size: %zu\n", __func__, size);
+ return -EIO;
+ }
+
+ return 0;
+}
+
+int __put_iomem(char *dst, char *buf, size_t size)
+{
+ /*
+ * This function uses __put_user() independent of whether kernel or user
+ * memory is accessed. This works fine because __put_user() does no
+ * sanity checks of the pointer being accessed. All that it does is
+ * to report when the access failed.
+ *
+ * The access can't be done via copy_to_user() here because
+ * __put_iomem() must not use string instructions to access unsafe
+ * memory. The reason is that MOVS is emulated by the exception handler
+ * for SEV and TDX by splitting the move up into a read and a write
+ * opetations and taking a nested exception on whatever of them is the
+ * MMIO access. Using string instructions here would cause infinite
+ * nesting.
+ */
+ switch (size) {
+ case 1: {
+ u8 d1, __user *target = (u8 __user *)dst;
+
+ memcpy(&d1, buf, 1);
+ if (__put_user(d1, target))
+ return -EFAULT;
+ break;
+ }
+ case 2: {
+ u16 d2, __user *target = (u16 __user *)dst;
+
+ memcpy(&d2, buf, 2);
+ if (__put_user(d2, target))
+ return -EFAULT;
+ break;
+ }
+ case 4: {
+ u32 d4, __user *target = (u32 __user *)dst;
+
+ memcpy(&d4, buf, 4);
+ if (__put_user(d4, target))
+ return -EFAULT;
+ break;
+ }
+ case 8: {
+ u64 d8, __user *target = (u64 __user *)dst;
+
+ memcpy(&d8, buf, 8);
+ if (__put_user(d8, target))
+ return -EFAULT;
+ break;
+ }
+ default:
+ WARN_ONCE(1, "%s: Invalid size: %zu\n", __func__, size);
+ return -EIO;
+ }
+
+ return 0;
+}
--
2.46.0
On Fri, Sep 06, 2024 at 01:50:03PM +0200, Alexey Gladkov wrote: > From: "Alexey Gladkov (Intel)" <legion@kernel.org> > > AMD code has helpers that are used to emulate MOVS instructions. To be > able to reuse this code in the MOVS implementation for intel, it is > necessary to move them to a common location. > > Signed-off-by: Alexey Gladkov (Intel) <legion@kernel.org> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> -- Kiryl Shutsemau / Kirill A. Shutemov
From: "Alexey Gladkov (Intel)" <legion@kernel.org>
Add emulation of the MOVS instruction on MMIO regions. MOVS emulation
consists of dividing it into a series of read and write operations,
which in turn will be validated separately.
This implementation is based on the same principle as in SEV. It splits
MOVS into separate read and write operations, which in turn can cause
nested #VEs depending on which of the arguments caused the first #VE.
The difference with the SEV implementation is the execution context. SEV
code is executed in atomic context. Exception handler in TDX is executed
with interrupts enabled. That's why the approach to locking is
different. In TDX, mmap_lock is taken to verify and emulate the
instruction.
Another difference is how the read and write instructions are executed
for MOVS emulation. While in SEV each read/write operation returns to
user space, in TDX these operations are performed from the kernel
context.
It may be possible to achieve more code reuse at this point,
but it would require confirmation from SEV that such a thing wouldn't
break anything.
Signed-off-by: Alexey Gladkov (Intel) <legion@kernel.org>
---
arch/x86/coco/tdx/tdx.c | 83 +++++++++++++++++++++++++++++---
arch/x86/include/asm/processor.h | 3 ++
2 files changed, 78 insertions(+), 8 deletions(-)
diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
index 1e391897e34f..7e760f03fa1e 100644
--- a/arch/x86/coco/tdx/tdx.c
+++ b/arch/x86/coco/tdx/tdx.c
@@ -518,6 +518,60 @@ static int decode_insn_struct(struct insn *insn, struct pt_regs *regs)
return 0;
}
+static int handle_mmio_movs(struct insn *insn, struct pt_regs *regs, int size, struct ve_info *ve)
+{
+ unsigned long ds_base, es_base;
+ unsigned char *src, *dst;
+ unsigned char buffer[8];
+ int off, ret;
+ bool rep;
+
+ /*
+ * The in-kernel code must use a special API that does not use MOVS.
+ * If the MOVS instruction is received from in-kernel, then something
+ * is broken.
+ */
+ if (WARN_ON_ONCE(!user_mode(regs)))
+ return -EFAULT;
+
+ ds_base = insn_get_seg_base(regs, INAT_SEG_REG_DS);
+ es_base = insn_get_seg_base(regs, INAT_SEG_REG_ES);
+
+ if (ds_base == -1L || es_base == -1L)
+ return -EINVAL;
+
+ current->thread.in_mmio_emul = 1;
+
+ rep = insn_has_rep_prefix(insn);
+
+ do {
+ src = ds_base + (unsigned char *) regs->si;
+ dst = es_base + (unsigned char *) regs->di;
+
+ ret = __get_iomem(src, buffer, size);
+ if (ret)
+ goto out;
+
+ ret = __put_iomem(dst, buffer, size);
+ if (ret)
+ goto out;
+
+ off = (regs->flags & X86_EFLAGS_DF) ? -size : size;
+
+ regs->si += off;
+ regs->di += off;
+
+ if (rep)
+ regs->cx -= 1;
+ } while (rep || regs->cx > 0);
+
+ ret = insn->length;
+out:
+ current->thread.in_mmio_emul = 0;
+
+ return ret;
+}
+
static int handle_mmio_write(struct insn *insn, enum insn_mmio_type mmio, int size,
struct pt_regs *regs, struct ve_info *ve)
{
@@ -539,9 +593,8 @@ static int handle_mmio_write(struct insn *insn, enum insn_mmio_type mmio, int si
return insn->length;
case INSN_MMIO_MOVS:
/*
- * MMIO was accessed with an instruction that could not be
- * decoded or handled properly. It was likely not using io.h
- * helpers or accessed MMIO accidentally.
+ * MOVS is processed through higher level emulation which breaks
+ * this instruction into a sequence of reads and writes.
*/
return -EINVAL;
default:
@@ -600,6 +653,7 @@ static int handle_mmio(struct pt_regs *regs, struct ve_info *ve)
{
enum insn_mmio_type mmio;
struct insn insn = {};
+ int need_validation;
unsigned long vaddr;
int size, ret;
@@ -611,13 +665,27 @@ static int handle_mmio(struct pt_regs *regs, struct ve_info *ve)
if (WARN_ON_ONCE(mmio == INSN_MMIO_DECODE_FAILED))
return -EINVAL;
+ if (mmio == INSN_MMIO_MOVS)
+ return handle_mmio_movs(&insn, regs, size, ve);
+
+ need_validation = user_mode(regs);
+
if (!user_mode(regs) && !is_kernel_addr(ve->gla)) {
- WARN_ONCE(1, "Access to userspace address is not supported");
- return -EINVAL;
+ /*
+ * Access from kernel to userspace addresses is not allowed
+ * unless it is a nested exception during MOVS emulation.
+ */
+ if (!current->thread.in_mmio_emul || !current->mm) {
+ WARN_ONCE(1, "Access to userspace address is not supported");
+ return -EINVAL;
+ }
+
+ need_validation = 1;
+ }
vaddr = (unsigned long)insn_get_addr_ref(&insn, regs);
- if (user_mode(regs)) {
+ if (need_validation) {
if (mmap_read_lock_killable(current->mm))
return -EINTR;
@@ -643,7 +711,6 @@ static int handle_mmio(struct pt_regs *regs, struct ve_info *ve)
switch (mmio) {
case INSN_MMIO_WRITE:
case INSN_MMIO_WRITE_IMM:
- case INSN_MMIO_MOVS:
ret = handle_mmio_write(&insn, mmio, size, regs, ve);
break;
case INSN_MMIO_READ:
@@ -664,7 +731,7 @@ static int handle_mmio(struct pt_regs *regs, struct ve_info *ve)
ret = -EINVAL;
}
unlock:
- if (user_mode(regs))
+ if (need_validation)
mmap_read_unlock(current->mm);
return ret;
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index a75a07f4931f..33875a217ed8 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -486,6 +486,9 @@ struct thread_struct {
unsigned long iopl_emul;
unsigned int iopl_warn:1;
+#ifdef CONFIG_INTEL_TDX_GUEST
+ unsigned int in_mmio_emul:1;
+#endif
/*
* Protection Keys Register for Userspace. Loaded immediately on
--
2.46.0
On Fri, Sep 06, 2024 at 01:50:04PM +0200, Alexey Gladkov wrote:
> diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
> index a75a07f4931f..33875a217ed8 100644
> --- a/arch/x86/include/asm/processor.h
> +++ b/arch/x86/include/asm/processor.h
> @@ -486,6 +486,9 @@ struct thread_struct {
> unsigned long iopl_emul;
>
> unsigned int iopl_warn:1;
> +#ifdef CONFIG_INTEL_TDX_GUEST
> + unsigned int in_mmio_emul:1;
> +#endif
This ifdeffery doesn't help anybody. Just drop it.
Otherwise looks good to me:
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
--
Kiryl Shutsemau / Kirill A. Shutemov
© 2016 - 2025 Red Hat, Inc.