Documentation/virt/uml/nommu-uml.rst | 180 ++++++++++++++++++++++ MAINTAINERS | 1 + arch/um/Kconfig | 14 +- arch/um/Makefile | 10 ++ arch/um/configs/x86_64_nommu_defconfig | 54 +++++++ arch/um/include/asm/futex.h | 4 + arch/um/include/asm/mmu.h | 8 + arch/um/include/asm/mmu_context.h | 2 + arch/um/include/asm/ptrace-generic.h | 8 +- arch/um/include/asm/uaccess.h | 7 +- arch/um/include/shared/kern_util.h | 6 + arch/um/include/shared/os.h | 16 ++ arch/um/kernel/Makefile | 5 +- arch/um/kernel/mem-pgtable.c | 55 +++++++ arch/um/kernel/mem.c | 38 +---- arch/um/kernel/process.c | 38 +++++ arch/um/kernel/skas/process.c | 37 ----- arch/um/kernel/um_arch.c | 3 + arch/um/nommu/Makefile | 3 + arch/um/nommu/os-Linux/Makefile | 7 + arch/um/nommu/os-Linux/seccomp.c | 87 +++++++++++ arch/um/nommu/os-Linux/signal.c | 24 +++ arch/um/nommu/trap.c | 201 +++++++++++++++++++++++++ arch/um/os-Linux/Makefile | 3 +- arch/um/os-Linux/internal.h | 8 + arch/um/os-Linux/mem.c | 4 + arch/um/os-Linux/process.c | 139 ++++++++++++++++- arch/um/os-Linux/signal.c | 11 +- arch/um/os-Linux/skas/process.c | 127 ---------------- arch/um/os-Linux/start_up.c | 25 ++- arch/um/os-Linux/util.c | 3 +- arch/x86/um/Kconfig | 2 +- arch/x86/um/Makefile | 7 +- arch/x86/um/asm/elf.h | 8 +- arch/x86/um/asm/syscall.h | 6 + arch/x86/um/nommu/Makefile | 8 + arch/x86/um/nommu/do_syscall_64.c | 75 +++++++++ arch/x86/um/nommu/entry_64.S | 114 ++++++++++++++ arch/x86/um/nommu/os-Linux/Makefile | 6 + arch/x86/um/nommu/os-Linux/mcontext.c | 26 ++++ arch/x86/um/nommu/syscalls.h | 18 +++ arch/x86/um/nommu/syscalls_64.c | 121 +++++++++++++++ arch/x86/um/shared/sysdep/mcontext.h | 5 + arch/x86/um/shared/sysdep/ptrace.h | 2 +- arch/x86/um/vdso/vma.c | 17 ++- fs/Kconfig.binfmt | 2 +- 46 files changed, 1322 insertions(+), 223 deletions(-) create mode 100644 Documentation/virt/uml/nommu-uml.rst create mode 100644 arch/um/configs/x86_64_nommu_defconfig create mode 100644 arch/um/kernel/mem-pgtable.c create mode 100644 arch/um/nommu/Makefile create mode 100644 arch/um/nommu/os-Linux/Makefile create mode 100644 arch/um/nommu/os-Linux/seccomp.c create mode 100644 arch/um/nommu/os-Linux/signal.c create mode 100644 arch/um/nommu/trap.c create mode 100644 arch/x86/um/nommu/Makefile create mode 100644 arch/x86/um/nommu/do_syscall_64.c create mode 100644 arch/x86/um/nommu/entry_64.S create mode 100644 arch/x86/um/nommu/os-Linux/Makefile create mode 100644 arch/x86/um/nommu/os-Linux/mcontext.c create mode 100644 arch/x86/um/nommu/syscalls.h create mode 100644 arch/x86/um/nommu/syscalls_64.c
This patchset is another spin of nommu mode addition to UML. It would be nice to hear about your opinions on that. There are still several limitations/issues which we already found; here is the list of those issues. - memory mapped by loadable modules are not distinguished from userspace memory. - CONFIG_SMP is disabled as host_fs handling doesn't work with thread local storage. -- Hajime v13: - rebase with the latest uml/next branch, fixing a conflict ([06/13]) v12: - rebase with the latest uml/next branch - disable SMP and tls as those doesn't work with host_fs handling ([11/13]) - https://lore.kernel.org/all/cover.1762075876.git.thehajime@gmail.com/ v11: - clean up userspace return routine and integrate to userspace() ([04/13]) - fix direction flag issue on using nolibc memcpy ([04/13]) - fix a crash issue when using usermode helper ([06/13]) - test with out-of-tree kunit-uapi patches (which uses umh) - https://lore.kernel.org/all/20250626-kunit-kselftests-v4-0-48760534fef5@linutronix.de/ - https://lore.kernel.org/all/20250626195714.2123694-3-benjamin@sipsolutions.net/ - https://lore.kernel.org/all/cover.1758181109.git.thehajime@gmail.com/ v10: - fix wrong comment on gs register handling ([09/13]) - remove unnecessary code of early syscall implementation ([04/13]) - https://lore.kernel.org/all/cover.1750594487.git.thehajime@gmail.com/ v9: - rebase with the latest uml/next branch - add performance numbers of new SECCOMP mode, and update results ([12/13]) - add a workaround for upstream change on MMU depedency to PCI drivers ([10/13]) - https://lore.kernel.org/all/cover.1750294482.git.thehajime@gmail.com/ v8: - rebase with the latest uml/next branch - clean up segv_handler to align with the latest uml ([9/12]) - https://lore.kernel.org/all/cover.1745980082.git.thehajime@gmail.com/ v7: - properly handle FP register upon signal delivery [10/13] - update benchmark result with new FP register handling [12/13] - fix arch_has_single_step() for !MMU case [07/13] - revert stack alignment as it is in uml/fixes tree [10/13] - https://lore.kernel.org/all/cover.1737348399.git.thehajime@gmail.com/ v6: - rebase to the latest uml/next tree - more clean up on mmu/nommu for signal handling [10/13] - rename functions of mcontext routines [06,10/13] - added Acked-by tag for binfmt_elf_fdpic [02/13] - https://lore.kernel.org/linux-um/cover.1736853925.git.thehajime@gmail.com/ v5: - clean up stack manipulation code [05,06,07,10/13] - https://lore.kernel.org/linux-um/cover.1733998168.git.thehajime@gmail.com/ v4: - add arch/um/nommu, arch/x86/um/nommu to contain !MMU specific codes - remove zpoline patch - drop binfmt_elf_fdpic patch - reduce ifndef CONFIG_MMU if possible - split to elf header cleanup patch [01/13] - fix kernel test robot warnings [06/13] - fix coding styles [07/13] - move task_top_of_stack definition [05/13] - https://lore.kernel.org/linux-um/cover.1733652929.git.thehajime@gmail.com/ v3: - https://lore.kernel.org/linux-um/cover.1733199769.git.thehajime@gmail.com/ - add seccomp-based syscall hook in addition to zpoline [06/13] - remove RFC, add a line to MAINTAINERS file - fix kernel test robot warnings [02/13,08/13,10/13] - add base-commit tag to cover letter - pull the latest uml/next - clean up SIGSEGV handling [10/13] - detect fsgsbase availability with elf aux vector [08/13] - simplify vdso code with macros [09/13] RFC v2: - https://lore.kernel.org/linux-um/cover.1731290567.git.thehajime@gmail.com/ - base branch is now uml/linux.git instead of torvalds/linux.git. - reorganize the patch series to clean up - fixed various coding styles issues - clean up exec code path [07/13] - fixed the crash/SIGSEGV case on userspace programs [10/13] - add seccomp filter to limit syscall caller address [06/13] - detect fsgsbase availability with sigsetjmp/siglongjmp [08/13] - removes unrelated changes - removes unneeded ifndef CONFIG_MMU - convert UML_CONFIG_MMU to CONFIG_MMU as using uml/linux.git - proposed a patch of maple-tree issue (resolving a limitation in RFC v1) https://lore.kernel.org/linux-mm/20241108222834.3625217-1-thehajime@gmail.com/ RFC: - https://lore.kernel.org/linux-um/cover.1729770373.git.thehajime@gmail.com/ Hajime Tazaki (13): x86/um: nommu: elf loader for fdpic um: decouple MMU specific code from the common part um: nommu: memory handling x86/um: nommu: syscall handling um: nommu: seccomp syscalls hook x86/um: nommu: process/thread handling um: nommu: configure fs register on host syscall invocation x86/um/vdso: nommu: vdso memory update x86/um: nommu: signal handling um: change machine name for uname output um: nommu: disable SMP on nommu UML um: nommu: add documentation of nommu UML um: nommu: plug nommu code into build system Documentation/virt/uml/nommu-uml.rst | 180 ++++++++++++++++++++++ MAINTAINERS | 1 + arch/um/Kconfig | 14 +- arch/um/Makefile | 10 ++ arch/um/configs/x86_64_nommu_defconfig | 54 +++++++ arch/um/include/asm/futex.h | 4 + arch/um/include/asm/mmu.h | 8 + arch/um/include/asm/mmu_context.h | 2 + arch/um/include/asm/ptrace-generic.h | 8 +- arch/um/include/asm/uaccess.h | 7 +- arch/um/include/shared/kern_util.h | 6 + arch/um/include/shared/os.h | 16 ++ arch/um/kernel/Makefile | 5 +- arch/um/kernel/mem-pgtable.c | 55 +++++++ arch/um/kernel/mem.c | 38 +---- arch/um/kernel/process.c | 38 +++++ arch/um/kernel/skas/process.c | 37 ----- arch/um/kernel/um_arch.c | 3 + arch/um/nommu/Makefile | 3 + arch/um/nommu/os-Linux/Makefile | 7 + arch/um/nommu/os-Linux/seccomp.c | 87 +++++++++++ arch/um/nommu/os-Linux/signal.c | 24 +++ arch/um/nommu/trap.c | 201 +++++++++++++++++++++++++ arch/um/os-Linux/Makefile | 3 +- arch/um/os-Linux/internal.h | 8 + arch/um/os-Linux/mem.c | 4 + arch/um/os-Linux/process.c | 139 ++++++++++++++++- arch/um/os-Linux/signal.c | 11 +- arch/um/os-Linux/skas/process.c | 127 ---------------- arch/um/os-Linux/start_up.c | 25 ++- arch/um/os-Linux/util.c | 3 +- arch/x86/um/Kconfig | 2 +- arch/x86/um/Makefile | 7 +- arch/x86/um/asm/elf.h | 8 +- arch/x86/um/asm/syscall.h | 6 + arch/x86/um/nommu/Makefile | 8 + arch/x86/um/nommu/do_syscall_64.c | 75 +++++++++ arch/x86/um/nommu/entry_64.S | 114 ++++++++++++++ arch/x86/um/nommu/os-Linux/Makefile | 6 + arch/x86/um/nommu/os-Linux/mcontext.c | 26 ++++ arch/x86/um/nommu/syscalls.h | 18 +++ arch/x86/um/nommu/syscalls_64.c | 121 +++++++++++++++ arch/x86/um/shared/sysdep/mcontext.h | 5 + arch/x86/um/shared/sysdep/ptrace.h | 2 +- arch/x86/um/vdso/vma.c | 17 ++- fs/Kconfig.binfmt | 2 +- 46 files changed, 1322 insertions(+), 223 deletions(-) create mode 100644 Documentation/virt/uml/nommu-uml.rst create mode 100644 arch/um/configs/x86_64_nommu_defconfig create mode 100644 arch/um/kernel/mem-pgtable.c create mode 100644 arch/um/nommu/Makefile create mode 100644 arch/um/nommu/os-Linux/Makefile create mode 100644 arch/um/nommu/os-Linux/seccomp.c create mode 100644 arch/um/nommu/os-Linux/signal.c create mode 100644 arch/um/nommu/trap.c create mode 100644 arch/x86/um/nommu/Makefile create mode 100644 arch/x86/um/nommu/do_syscall_64.c create mode 100644 arch/x86/um/nommu/entry_64.S create mode 100644 arch/x86/um/nommu/os-Linux/Makefile create mode 100644 arch/x86/um/nommu/os-Linux/mcontext.c create mode 100644 arch/x86/um/nommu/syscalls.h create mode 100644 arch/x86/um/nommu/syscalls_64.c base-commit: 293f71435d14f5b5c46fc3398695fa265c69363d -- 2.43.0
On Sat, Nov 08, 2025 at 05:05:35PM +0900, Hajime Tazaki wrote: > This patchset is another spin of nommu mode addition to UML. It would > be nice to hear about your opinions on that. I've not seen any explanation of the use case and/or benefits anywhere in this cover letter or the patches. Without that it's usually pretty hard to get maintainers and reviewers excited.
Hello, On Mon, 10 Nov 2025 18:14:26 +0900, Christoph Hellwig wrote: > > On Sat, Nov 08, 2025 at 05:05:35PM +0900, Hajime Tazaki wrote: > > This patchset is another spin of nommu mode addition to UML. It would > > be nice to hear about your opinions on that. > > I've not seen any explanation of the use case and/or benefits anywhere > in this cover letter or the patches. Without that it's usually pretty > hard to get maintainers and reviewers excited. thank you for the comment. I tried to include this explanation in the document patch [12/13], which I copied from the text below. What is it for ? ================ - Alleviate syscall hook overhead implemented with ptrace(2) - To exercises nommu code over UML (and over KUnit) - Less dependency to host facilities the first item is for speed up, the second item is for more testing, the last item is for more extensibility in the future. Early version of this patchset included this information as well as the whole documentation, but I removed it as the versions grow. But I can revert it to the cover letter if it helps. -- Hajime
On Mon, 2025-11-10 at 21:18 +0900, Hajime Tazaki wrote: > > What is it for ? > ================ > > - Alleviate syscall hook overhead implemented with ptrace(2) > - To exercises nommu code over UML (and over KUnit) > - Less dependency to host facilities FWIW, in some way, this order of priorities is exactly why this hasn't been going anywhere, and every time I looked at it I got somewhat annoyed by what seems to me like choices made to support especially the first bullet. I suspect that the first and third bullet are not even really true any more, since you moved to seccomp (per our request), yet I think design choices influenced by them persist. People are definitely interested in the second bullet, mostly for kunit, and I'd be willing to support them in that to some extent. However, I'm not yet convinced that all of the complexities presented in this patchset (such as completely separate seccomp implementation) are actually necessary in support of _just_ the second bullet. These seem to me like design choices necessary to support the _first_ bullet [1]. [1] and then I suppose the third, which I'm reading as "doesn't need seccomp or ptrace", but I'm not really quite sure what you meant I've thought about what would happen if we stuck to creating a (single) separate process on the host to execute userspace, and just used CLONE_VM for it. That way, it's still no-MMU with full memory access, but there's some implicit isolation between the kernel and userspace processes which will likely remove complexities around FP/SSE/AVX handling, may completely remove the need for a separate seccomp implementation, etc. It would, on the other hand, make it completely non-viable to achieve the first and third bullets, so given your pursuit of those, one some level I understand the design right now. I'm yet to be convinced, however, that those are even worthy goals for (upstream) UML, what use case would that enable that we really need? Especially considering that over a longer perspective, NOMMU architectures _are_ on their way out, and UML will certainly follow once that happens, it won't be the last remaining NOMMU architecture. So the only value I see in this is for testing over the net couple of years, which really doesn't need any sort of significant optimisation or less reliance on host facilities. Where do you see this differently? johannes
On Tue, 11 Nov 2025 17:01:25 +0900, Johannes Berg wrote: > > On Mon, 2025-11-10 at 21:18 +0900, Hajime Tazaki wrote: > > > > What is it for ? > > ================ > > > > - Alleviate syscall hook overhead implemented with ptrace(2) > > - To exercises nommu code over UML (and over KUnit) > > - Less dependency to host facilities > > FWIW, in some way, this order of priorities is exactly why this hasn't > been going anywhere, and every time I looked at it I got somewhat > annoyed by what seems to me like choices made to support especially the > first bullet. over the past versions, I've been emphasized that the 2nd bullet (testing) is the primary usecase as I saw several actually cases from mm folks, https://lists.infradead.org/pipermail/maple-tree/2024-November/003775.html https://lore.kernel.org/all/cb1cf0be-871d-4982-9a1b-5fdd54deec8d@lucifer.local/ and I think this is not limited to mm code. other 2 bullets are additional benefits which we observed in a comment, and our experience. https://lore.kernel.org/all/20241122121826.GA26024@lst.de/ [2] https://static.sched.com/hosted_files/ossna2020/ec/kollerr_linux_um_nommu.pdf but those are not the primary goal, so I'm not pushing this aspect with usecases. > I suspect that the first and third bullet are not even really true any > more, since you moved to seccomp (per our request), yet I think design > choices influenced by them persist. this observation is not true; the first bullet is still true even using seccomp. please look at the benchmark result in the patch [12/13], quoted below. summary: most of tests show that um-nommu+seccomp is x4 to x20 faster than um-mmu+seccomp (and ptrace). .. csv-table:: lmbench (usec) :header: ,native,um,um-mmu(s),um-nommu(s) select-10 ,0.5319,36.1214,24.2795,2.9174 select-100 ,1.6019,34.6049,28.8865,3.8080 select-1000 ,12.2588,43.6838,48.7438,12.7872 syscall ,0.1644,35.0321,53.2119,2.5981 read ,0.3055,31.5509,45.8538,2.7068 write ,0.2512,31.3609,29.2636,2.6948 stat ,1.8894,43.8477,49.6121,3.1908 open/close ,3.2973,77.5123,68.9431,6.2575 fork+sh ,1110.3000,7359.5000,4618.6667,439.4615 fork+execve ,510.8182,2834.0000,2461.1667,139.7848 .. csv-table:: do_getpid bench (nsec) :header: ,native,um,um-mmu(s),um-nommu(s) getpid , 161 , 34477 , 26242 , 2599 the 1st bullet saying ptrace(2) is somehow misleading now. this might be rephrased with "a separate process handling userspace", instead of "ptrace". # when I started this patchset, the seccomp patch wasn't in upstream. saying ptrace(2) wasn't not that much wrong. > People are definitely interested in the second bullet, mostly for kunit, > and I'd be willing to support them in that to some extent. so (again) the 2nd bullet is the primary use case at this stage. > However, I'm not yet convinced that all of the complexities presented in > this patchset (such as completely separate seccomp implementation) are > actually necessary in support of _just_ the second bullet. These seem to > me like design choices necessary to support the _first_ bullet [1]. separate seccomp implementation is indeed needed due to the design choice we made, to use a single process to host a (um) userspace. I think there is no reason to unify the seccomp part because the signal handlers and filter installation do the different jobs. I don't see why you see this as a _complexity_, as functionally both seccomp handling don't interfere each other. we have prepared separate sub-directories for nommu to avoid unnecessary if/else clauses in .c/.h files. we haven't seen any functional regressions since this RFC version (which was 6.12 kernel). > [1] and then I suppose the third, which I'm reading as "doesn't need > seccomp or ptrace", but I'm not really quite sure what you meant > > I've thought about what would happen if we stuck to creating a (single) > separate process on the host to execute userspace, and just used > CLONE_VM for it. That way, it's still no-MMU with full memory access, > but there's some implicit isolation between the kernel and userspace > processes which will likely remove complexities around FP/SSE/AVX > handling, may completely remove the need for a separate seccomp > implementation, etc. this would be doable I think, but we went the different way, as using separate host processes (with ptrace/seccomp) is slow and add complexity by the synchronization between processes, which we think it's not easy to maintain in the future. this was natural for us (not sure for maintainers) when we add a new functionality, consider several options to implement, and took one of the option which is faster, simpler, and having less cost to maintain. the avoidance of separate processes is probably the core of our design choice we made for nommu UML. I'm not strongly pushing the benefits of 1st/3rd bullets, but I thought describing the characteristics of what _this_ patchset can should be useful. thus in the document. additionally, if the design choice we made introduces any breakages on existing code, or maintenance burdens, I would understand your concern on the complexity, but I don't think this is the case. > It would, on the other hand, make it completely non-viable to achieve > the first and third bullets, so given your pursuit of those, one some > level I understand the design right now. I'm yet to be convinced, > however, that those are even worthy goals for (upstream) UML, what use > case would that enable that we really need? the usecase for those are inherited from the original implementation, [2] above, which is running UML on containers with less host dependency and speedups. but again, this is not the primary goal at this stage. if you think that the document should not describe the potential benefits/usecases which are not related to the primary goal of the functionality, I'd agree to remove those descriptions. > Especially considering that > over a longer perspective, NOMMU architectures _are_ on their way out, > and UML will certainly follow once that happens, it won't be the last > remaining NOMMU architecture. I'm aware of this nommu removal discussion, but also saw there are expressions not to support this direction. This patchset is still useful even now. > So the only value I see in this is for testing over the net couple of > years, which really doesn't need any sort of significant optimisation or > less reliance on host facilities. I agree the former, but not the latter. - there is a value with a real usecase, - there are different ways to implement it but this went with the one with potential (additional) benefits, - without breakages to the exising (MMU) uml code. with that, we're proposing this patchset. > Where do you see this differently? thanks for the careful prompt for me. I hope my answer clarifies your concerns. I also wish to understand concerns of maintainers, due to the single process design of nommu for um userspace, and the codebase is still young so may have unexpected influence to others. but this is exactly the reason why I also put myself to MAINTAINERS in order to take care of this patchset even it is small (1.3k loc). -- Hajime
On Wed, 2025-11-12 at 17:52 +0900, Hajime Tazaki wrote: > > > What is it for ? > > > ================ > > > > > > - Alleviate syscall hook overhead implemented with ptrace(2) > > > - To exercises nommu code over UML (and over KUnit) > > > - Less dependency to host facilities > > > > FWIW, in some way, this order of priorities is exactly why this hasn't > > been going anywhere, and every time I looked at it I got somewhat > > annoyed by what seems to me like choices made to support especially the > > first bullet. > > over the past versions, I've been emphasized that the 2nd bullet (testing) > is the primary usecase as I saw several actually cases from mm folks, > > https://lists.infradead.org/pipermail/maple-tree/2024-November/003775.html > https://lore.kernel.org/all/cb1cf0be-871d-4982-9a1b-5fdd54deec8d@lucifer.local/ > > and I think this is not limited to mm code. Not sure there's much value in testing much else in no-MMU, but sure, I'll give you that it's useful for testing. > other 2 bullets are additional benefits which we observed in a > comment, and our experience. But are they really _worthwhile_ benefits? A lot of this design adds additional complexity, and it doesn't really seem necessary for the testing use case. Making it faster is nice, but it's not like the speedup really is 20x for arbitrary tests, that's just for corner cases like "sit in a loop of gettimeofday()". And for kunit there's no syscall boundary at all, so there's no speedup. > > I suspect that the first and third bullet are not even really true any > > more, since you moved to seccomp (per our request), yet I think design > > choices influenced by them persist. > > this observation is not true; the first bullet is still true even > using seccomp. please look at the benchmark result in the patch > [12/13], quoted below. > [snip] So thanks for the correction. If that's the case, however, it means the speedup can't be due to the syscall boundary itself (seccomp) but must rather be due to some pagefault/mapping handling issue? Which would be inherent in no-MMU, even taking an approach of using two host processes rather than embedding everything into one. > > However, I'm not yet convinced that all of the complexities presented in > > this patchset (such as completely separate seccomp implementation) are > > actually necessary in support of _just_ the second bullet. These seem to > > me like design choices necessary to support the _first_ bullet [1]. > > separate seccomp implementation is indeed needed due to the design > choice we made, to use a single process to host a (um) userspace. That sounds misleading or even wrong to me, I'd say it's due to putting the (um) userspace in the same host process as the kernel space? > I don't see why you see this as a _complexity_, as functionally both > seccomp handling don't interfere each other. The complexity isn't so much in the separate code, which is a small factor, but in the "put everything into the same process" aspect of it. That has consequences around the host context state handling, things we didn't really need to consider before suddenly become crucially important. In the current (with-MMU) design, we only need to worry about being able to correctly switch between userspace tasks/threads within a userspace mm (host) process. With the no-MMU design you propose, we also need to be able to correctly switch between kernel and userspace tasks within the same single (host) process. I think this is a pretty significant difference, and saying "there's no complexity here" is simply pretending it isn't a relevant difference. I believe you're not even handling this correctly right now in this patch set, specifically wrt. the GS register which has been pointed out before, but I wouldn't say that I even have a complete picture in my head over what state handling would be necessary and sufficient. So yeah, I think this warrants taking another look as to whether or not the approach of putting everything into the same host process is even worth it. I tend to believe that it isn't, given the use cases. And if you say the speedup still is with seccomp, that kills the speed argument too. > > I've thought about what would happen if we stuck to creating a (single) > > separate process on the host to execute userspace, and just used > > CLONE_VM for it. That way, it's still no-MMU with full memory access, > > but there's some implicit isolation between the kernel and userspace > > processes which will likely remove complexities around FP/SSE/AVX > > handling, may completely remove the need for a separate seccomp > > implementation, etc. > > this would be doable I think, but we went the different way, as > using separate host processes (with ptrace/seccomp) is slow and add > complexity by the synchronization between processes, which we think > it's not easy to maintain in the future. Which one is it then, slow or not? Not sure I follow. You just said you do have seccomp when comparing speeds, so that in itself doesn't make it slow. What synchronization? It'd (have to) be CLONE_VM, but that actually _simplifies_ state transfer/synchronization, and we already have (to have) state transfer between different userspace threads in the same host process for the with-MMU case. johannes
On Tue, 25 Nov 2025 18:58:53 +0900,
Johannes Berg wrote:
>
> On Wed, 2025-11-12 at 17:52 +0900, Hajime Tazaki wrote:
> > > > What is it for ?
> > > > ================
> > > >
> > > > - Alleviate syscall hook overhead implemented with ptrace(2)
> > > > - To exercises nommu code over UML (and over KUnit)
> > > > - Less dependency to host facilities
> > >
> > > FWIW, in some way, this order of priorities is exactly why this hasn't
> > > been going anywhere, and every time I looked at it I got somewhat
> > > annoyed by what seems to me like choices made to support especially the
> > > first bullet.
> >
> > over the past versions, I've been emphasized that the 2nd bullet (testing)
> > is the primary usecase as I saw several actually cases from mm folks,
> >
> > https://lists.infradead.org/pipermail/maple-tree/2024-November/003775.html
> > https://lore.kernel.org/all/cb1cf0be-871d-4982-9a1b-5fdd54deec8d@lucifer.local/
> >
> > and I think this is not limited to mm code.
>
> Not sure there's much value in testing much else in no-MMU, but sure,
> I'll give you that it's useful for testing.
under the tree,
% global -xr CONFIG_MMU | grep ifndef | grep -v -E "arch/|mm/" | wc -l
45
this is a rough picture but there are places to be tested other than
mm codebase.
> > other 2 bullets are additional benefits which we observed in a
> > comment, and our experience.
>
> But are they really _worthwhile_ benefits? A lot of this design adds
> additional complexity, and it doesn't really seem necessary for the
> testing use case. Making it faster is nice, but it's not like the
> speedup really is 20x for arbitrary tests, that's just for corner cases
> like "sit in a loop of gettimeofday()". And for kunit there's no syscall
> boundary at all, so there's no speedup.
I agree and as I said the reason to take a single-host-process
approach is from the speed and simplicity of removing interaction
between host processes.
I have never claimed that tests should execute fast.
and agree that kunit doesn't benefit from speed as there is no syscall
(unless kunit-uapi patch will be in).
> > > I suspect that the first and third bullet are not even really true any
> > > more, since you moved to seccomp (per our request), yet I think design
> > > choices influenced by them persist.
> >
> > this observation is not true; the first bullet is still true even
> > using seccomp. please look at the benchmark result in the patch
> > [12/13], quoted below.
>
> > [snip]
>
> So thanks for the correction. If that's the case, however, it means the
> speedup can't be due to the syscall boundary itself (seccomp) but must
> rather be due to some pagefault/mapping handling issue? Which would be
> inherent in no-MMU, even taking an approach of using two host processes
> rather than embedding everything into one.
I'll explain this later in this email.
# nommu doesn't have page fault as there are only physical address.
> > > However, I'm not yet convinced that all of the complexities presented in
> > > this patchset (such as completely separate seccomp implementation) are
> > > actually necessary in support of _just_ the second bullet. These seem to
> > > me like design choices necessary to support the _first_ bullet [1].
> >
> > separate seccomp implementation is indeed needed due to the design
> > choice we made, to use a single process to host a (um) userspace.
>
> That sounds misleading or even wrong to me, I'd say it's due to putting
> the (um) userspace in the same host process as the kernel space?
not sure if this is different from my explanation...
> > I don't see why you see this as a _complexity_, as functionally both
> > seccomp handling don't interfere each other.
>
> The complexity isn't so much in the separate code, which is a small
> factor, but in the "put everything into the same process" aspect of it.
> That has consequences around the host context state handling, things we
> didn't really need to consider before suddenly become crucially
> important. In the current (with-MMU) design, we only need to worry about
> being able to correctly switch between userspace tasks/threads within a
> userspace mm (host) process. With the no-MMU design you propose, we also
> need to be able to correctly switch between kernel and userspace tasks
> within the same single (host) process.
>
> I think this is a pretty significant difference, and saying "there's no
> complexity here" is simply pretending it isn't a relevant difference. I
> believe you're not even handling this correctly right now in this patch
> set, specifically wrt. the GS register which has been pointed out
> before, but I wouldn't say that I even have a complete picture in my
> head over what state handling would be necessary and sufficient.
>
> So yeah, I think this warrants taking another look as to whether or not
> the approach of putting everything into the same host process is even
> worth it. I tend to believe that it isn't, given the use cases. And if
> you say the speedup still is with seccomp, that kills the speed argument
> too.
I understand your concern on complexity, thanks for the detail.
the host context state handling is indeed new thing. we've only
verified a limited set of code path, with a basic operation with um +
drivers and some userspace programs. this should not be perfect at
this moment but can be improved.
> > > I've thought about what would happen if we stuck to creating a (single)
> > > separate process on the host to execute userspace, and just used
> > > CLONE_VM for it. That way, it's still no-MMU with full memory access,
> > > but there's some implicit isolation between the kernel and userspace
> > > processes which will likely remove complexities around FP/SSE/AVX
> > > handling, may completely remove the need for a separate seccomp
> > > implementation, etc.
> >
> > this would be doable I think, but we went the different way, as
> > using separate host processes (with ptrace/seccomp) is slow and add
> > complexity by the synchronization between processes, which we think
> > it's not easy to maintain in the future.
>
> Which one is it then, slow or not? Not sure I follow. You just said you
> do have seccomp when comparing speeds, so that in itself doesn't make it
> slow. What synchronization? It'd (have to) be CLONE_VM, but that
> actually _simplifies_ state transfer/synchronization, and we already
> have (to have) state transfer between different userspace threads in the
> same host process for the with-MMU case.
Since I included speed characteristics in the document, I should
explain more on the impact of this, compared to the existing
design/implementation of uml.
many documents, articles said uml is slow (uml document in tree also
mentioned a bit), but cannot find detailed analysis, so I look closely
at how nommu (w/ seccomp) and mmu w/ seccomp behave.
suppose we have a userspace program running under uml (on seccomp-mmu,
seccomp-nommu).
struct timespec ts1, ts2;
clock_gettime(CLOCK_REALTIME, &ts1); // 1)
getpid() // 2)
clock_gettime(CLOCK_REALTIME, &ts2); // 3)
# this is a chunk from the benchmark program used in the document.
then collected several events (sched_switch, signal_generate, and
sys_enter_futex) via ftrace.
looking at 3 SIGSYS (sig=31) signals on above code, and below is the
output of the `trace-cmd report`.
- frace seecomp-mmu, 2)-3)= 11 usec
uml-userspace-3092637 [002] 1749286.670199: signal_generate: sig=31 errno=0 code=1 comm=uml-userspace pid=3092637 grp=0 res=0 => 1)
uml-userspace-3092637 [002] 1749286.670200: sys_enter_futex: op=FUTEX_WAKE uaddr=0x7fffffffdf8c val=1
uml-userspace-3092637 [002] 1749286.670201: sys_enter_futex: op=FUTEX_WAIT uaddr=0x7fffffffdf8c val=0x00000001 utime=0x00000000
uml-userspace-3092637 [002] 1749286.670202: sched_switch: uml-userspace:3092637 [120] S ==> swapper/2:0 [120]
<idle>-0 [028] 1749286.670203: sched_switch: swapper/28:0 [120] R ==> vmlinux:3092631 [120]
vmlinux-3092631 [028] 1749286.670205: sys_enter_futex: op=FUTEX_WAKE uaddr=0x60b64f8c val=1
vmlinux-3092631 [028] 1749286.670206: sys_enter_futex: op=FUTEX_WAIT uaddr=0x60b64f8c val=0x00000000 utime=0x00000000
vmlinux-3092631 [028] 1749286.670207: sched_switch: vmlinux:3092631 [120] S ==> swapper/28:0 [120]
<idle>-0 [002] 1749286.670209: sched_switch: swapper/2:0 [120] R ==> uml-userspace:3092637 [120]
uml-userspace-3092637 [002] 1749286.670211: signal_generate: sig=31 errno=0 code=1 comm=uml-userspace pid=3092637 grp=0 res=0 => 2)
uml-userspace-3092637 [002] 1749286.670212: sys_enter_futex: op=FUTEX_WAKE uaddr=0x7fffffffdf8c val=1
uml-userspace-3092637 [002] 1749286.670213: sys_enter_futex: op=FUTEX_WAIT uaddr=0x7fffffffdf8c val=0x00000001 utime=0x00000000
uml-userspace-3092637 [002] 1749286.670214: sched_switch: uml-userspace:3092637 [120] S ==> swapper/2:0 [120]
<idle>-0 [028] 1749286.670215: sched_switch: swapper/28:0 [120] R ==> vmlinux:3092631 [120]
vmlinux-3092631 [028] 1749286.670216: sys_enter_futex: op=FUTEX_WAKE uaddr=0x60b64f8c val=1
vmlinux-3092631 [028] 1749286.670217: sys_enter_futex: op=FUTEX_WAIT uaddr=0x60b64f8c val=0x00000000 utime=0x00000000
vmlinux-3092631 [028] 1749286.670218: sched_switch: vmlinux:3092631 [120] S ==> swapper/28:0 [120]
<idle>-0 [002] 1749286.670220: sched_switch: swapper/2:0 [120] R ==> uml-userspace:3092637 [120]
uml-userspace-3092637 [002] 1749286.670222: signal_generate: sig=31 errno=0 code=1 comm=uml-userspace pid=3092637 grp=0 res=0 => 3)
- ftrace seccomp-nommu, 2)-3) = 3 usec
vmlinux-3092542 [006] 1749158.829292: signal_generate: sig=31 errno=0 code=1 comm=vmlinux pid=3092542 grp=0 res=0 => 1)
vmlinux-3092542 [006] 1749158.829294: signal_generate: sig=31 errno=0 code=1 comm=vmlinux pid=3092542 grp=0 res=0 => 2)
vmlinux-3092542 [006] 1749158.829297: signal_generate: sig=31 errno=0 code=1 comm=vmlinux pid=3092542 grp=0 res=0 => 3)
with seccomp-mmu, a host process for userspace (uml-userspace) is
notified with SIGSYS (sig=31) upon syscall from userspace, and switched
task (of host) to vmlinux (um kernel), with the wake/wait
synchronization (which I meant synchronization in my previous email),
and switch back to uml-userspace to continue the userspace process.
so, at least 4 host sched_switch-es per single um syscall.
with current nommu using a single host process, notifications via
SIGSYS is same as seccomp-mmu, but after that there is no context
switch upon syscall issued by a userspace, in the same context to the
next syscall.
nommu implementation with CLONE_VM (btw, the host process, uml-userspace
is already created with CLONE_VM flag IIUC) might face the similar
situation as seccomp-mmu, seeing the same switches between processes.
this becomes the difference between the benchmark results of getpid, which
um-mmu (seccomp)/um-nommu (seccomp) is mostly x10 (26.242 and 2.599
usec) (this was described as an example of benchmark in the patchset).
I didn't look at ptrace mode of MMU, but expect to see the similar (or
more) duration on a single syscall.
in addition to this ftrace measurement above, I conducted more
practical benchmark with iperf3 (forward/reverse path) and netperf
(TCP_STREAM/MAERTS), which aren't corner cases I believe, and below is
the result.
all use the vector driver with gro on via host tap devices.
iperf3/netperf server run on a host and client runs inside uml.
# I can give a complete script to reproduce this if needed.
- iperf3 (Mbps)
um-mmu(seccomp) um-nommu(seccomp)
--------------------------------------------------
iperf3(f) 7984 13152
iperf3(r) 8009 14363
- netperf (Mbps, bufsize=65507bytes)
um-mmu(seccomp) um-nommu(seccomp)
--------------------------------------------------
netperf(STREAM) 5912.93 10792.02
netperf(MAERTS) 29263.53 33970.06
not significant different as we saw with simple syscall benchmark with
getpid(2), but still see an impact with difference.
I would say these results only show partial cases of what UML can do,
different workloads may show different result, but it is still
valuable to present one of the benefits to see the nature of the
feature (of what single process design can do).
Of course, nommu will come with various limitations as I described in
the document; like applications should be aware of the kernel is nommu
(i.e., need to use vfork, PIE binaries, etc). So traditional uml is
more generic and has broader usage, but with this characteristic of
speed with nommu, I think it is worthwhile and users benefit from this
if they need speed.
I hope this clarifies a bit.
-- Hajime
On Wed, 12 Nov 2025 17:52:56 +0900, Hajime Tazaki wrote: [...] > > However, I'm not yet convinced that all of the complexities presented in > > this patchset (such as completely separate seccomp implementation) are > > actually necessary in support of _just_ the second bullet. These seem to > > me like design choices necessary to support the _first_ bullet [1]. > > separate seccomp implementation is indeed needed due to the design > choice we made, to use a single process to host a (um) userspace. I > think there is no reason to unify the seccomp part because the > signal handlers and filter installation do the different jobs. > > I don't see why you see this as a _complexity_, as functionally both > seccomp handling don't interfere each other. we have prepared > separate sub-directories for nommu to avoid unnecessary if/else > clauses in .c/.h files. I have the same concern about the complexities introduced by this patch set. The new processing paths it introduces (such as the separate handling for FP/SSE/AVX, FS, signal, syscall, ...) add a lot of unnecessary complexities. I think Johannes's suggestion is a great idea. > we haven't seen any functional regressions > since this RFC version (which was 6.12 kernel). I took a quick look at the code. It appears that patch 02/13 will break the mmu build when UML_TIME_TRAVEL_SUPPORT is enabled. Regards, Tiwei
On Thu, 13 Nov 2025 01:36:51 +0900, Tiwei Bie wrote: > > we haven't seen any functional regressions > > since this RFC version (which was 6.12 kernel). > > I took a quick look at the code. It appears that patch 02/13 will > break the mmu build when UML_TIME_TRAVEL_SUPPORT is enabled. thanks, it is my bad on the move the chunk. will fix it and added to my local test. -- Hajime
© 2016 - 2025 Red Hat, Inc.