[v13] nommu UML

[PATCH v13 00/13] nommu UML

Posted by Hajime Tazaki 3 months ago

This patchset is another spin of nommu mode addition to UML.  It would
be nice to hear about your opinions on that.

There are still several limitations/issues which we already found;
here is the list of those issues.

- memory mapped by loadable modules are not distinguished from
  userspace memory.
- CONFIG_SMP is disabled as host_fs handling doesn't work with thread
  local storage.

-- Hajime

v13:
- rebase with the latest uml/next branch, fixing a conflict ([06/13])

v12:
- rebase with the latest uml/next branch
- disable SMP and tls as those doesn't work with host_fs handling ([11/13])
- https://lore.kernel.org/all/cover.1762075876.git.thehajime@gmail.com/

v11:
- clean up userspace return routine and integrate to userspace() ([04/13])
- fix direction flag issue on using nolibc memcpy ([04/13])
- fix a crash issue when using usermode helper ([06/13])
- test with out-of-tree kunit-uapi patches (which uses umh)
 - https://lore.kernel.org/all/20250626-kunit-kselftests-v4-0-48760534fef5@linutronix.de/
 - https://lore.kernel.org/all/20250626195714.2123694-3-benjamin@sipsolutions.net/
- https://lore.kernel.org/all/cover.1758181109.git.thehajime@gmail.com/

v10:
- fix wrong comment on gs register handling ([09/13])
- remove unnecessary code of early syscall implementation ([04/13])
- https://lore.kernel.org/all/cover.1750594487.git.thehajime@gmail.com/

v9:
- rebase with the latest uml/next branch
- add performance numbers of new SECCOMP mode, and update results ([12/13])
- add a workaround for upstream change on MMU depedency to PCI drivers ([10/13])
- https://lore.kernel.org/all/cover.1750294482.git.thehajime@gmail.com/

v8:
- rebase with the latest uml/next branch
- clean up segv_handler to align with the latest uml ([9/12])
- https://lore.kernel.org/all/cover.1745980082.git.thehajime@gmail.com/

v7:
- properly handle FP register upon signal delivery [10/13]
- update benchmark result with new FP register handling [12/13]
- fix arch_has_single_step() for !MMU case [07/13]
- revert stack alignment as it is in uml/fixes tree [10/13]
- https://lore.kernel.org/all/cover.1737348399.git.thehajime@gmail.com/

v6:
- rebase to the latest uml/next tree
- more clean up on mmu/nommu for signal handling [10/13]
- rename functions of mcontext routines [06,10/13]
- added Acked-by tag for binfmt_elf_fdpic [02/13]
- https://lore.kernel.org/linux-um/cover.1736853925.git.thehajime@gmail.com/

v5:
- clean up stack manipulation code [05,06,07,10/13]
- https://lore.kernel.org/linux-um/cover.1733998168.git.thehajime@gmail.com/

v4:
- add arch/um/nommu, arch/x86/um/nommu to contain !MMU specific codes
- remove zpoline patch
- drop binfmt_elf_fdpic patch
- reduce ifndef CONFIG_MMU if possible
- split to elf header cleanup patch [01/13]
- fix kernel test robot warnings [06/13]
- fix coding styles [07/13]
- move task_top_of_stack definition [05/13]
- https://lore.kernel.org/linux-um/cover.1733652929.git.thehajime@gmail.com/

v3:
- https://lore.kernel.org/linux-um/cover.1733199769.git.thehajime@gmail.com/
- add seccomp-based syscall hook in addition to zpoline [06/13]
- remove RFC, add a line to MAINTAINERS file
- fix kernel test robot warnings [02/13,08/13,10/13]
- add base-commit tag to cover letter
- pull the latest uml/next
- clean up SIGSEGV handling [10/13]
- detect fsgsbase availability with elf aux vector [08/13]
- simplify vdso code with macros [09/13]

RFC v2:
- https://lore.kernel.org/linux-um/cover.1731290567.git.thehajime@gmail.com/
- base branch is now uml/linux.git instead of torvalds/linux.git.
- reorganize the patch series to clean up
- fixed various coding styles issues
- clean up exec code path [07/13]
- fixed the crash/SIGSEGV case on userspace programs [10/13]
- add seccomp filter to limit syscall caller address [06/13]
- detect fsgsbase availability with sigsetjmp/siglongjmp [08/13]
- removes unrelated changes
- removes unneeded ifndef CONFIG_MMU
- convert UML_CONFIG_MMU to CONFIG_MMU as using uml/linux.git
- proposed a patch of maple-tree issue (resolving a limitation in RFC v1)
  https://lore.kernel.org/linux-mm/20241108222834.3625217-1-thehajime@gmail.com/

RFC:
- https://lore.kernel.org/linux-um/cover.1729770373.git.thehajime@gmail.com/

Hajime Tazaki (13):
  x86/um: nommu: elf loader for fdpic
  um: decouple MMU specific code from the common part
  um: nommu: memory handling
  x86/um: nommu: syscall handling
  um: nommu: seccomp syscalls hook
  x86/um: nommu: process/thread handling
  um: nommu: configure fs register on host syscall invocation
  x86/um/vdso: nommu: vdso memory update
  x86/um: nommu: signal handling
  um: change machine name for uname output
  um: nommu: disable SMP on nommu UML
  um: nommu: add documentation of nommu UML
  um: nommu: plug nommu code into build system

 Documentation/virt/uml/nommu-uml.rst   | 180 ++++++++++++++++++++++
 MAINTAINERS                            |   1 +
 arch/um/Kconfig                        |  14 +-
 arch/um/Makefile                       |  10 ++
 arch/um/configs/x86_64_nommu_defconfig |  54 +++++++
 arch/um/include/asm/futex.h            |   4 +
 arch/um/include/asm/mmu.h              |   8 +
 arch/um/include/asm/mmu_context.h      |   2 +
 arch/um/include/asm/ptrace-generic.h   |   8 +-
 arch/um/include/asm/uaccess.h          |   7 +-
 arch/um/include/shared/kern_util.h     |   6 +
 arch/um/include/shared/os.h            |  16 ++
 arch/um/kernel/Makefile                |   5 +-
 arch/um/kernel/mem-pgtable.c           |  55 +++++++
 arch/um/kernel/mem.c                   |  38 +----
 arch/um/kernel/process.c               |  38 +++++
 arch/um/kernel/skas/process.c          |  37 -----
 arch/um/kernel/um_arch.c               |   3 +
 arch/um/nommu/Makefile                 |   3 +
 arch/um/nommu/os-Linux/Makefile        |   7 +
 arch/um/nommu/os-Linux/seccomp.c       |  87 +++++++++++
 arch/um/nommu/os-Linux/signal.c        |  24 +++
 arch/um/nommu/trap.c                   | 201 +++++++++++++++++++++++++
 arch/um/os-Linux/Makefile              |   3 +-
 arch/um/os-Linux/internal.h            |   8 +
 arch/um/os-Linux/mem.c                 |   4 +
 arch/um/os-Linux/process.c             | 139 ++++++++++++++++-
 arch/um/os-Linux/signal.c              |  11 +-
 arch/um/os-Linux/skas/process.c        | 127 ----------------
 arch/um/os-Linux/start_up.c            |  25 ++-
 arch/um/os-Linux/util.c                |   3 +-
 arch/x86/um/Kconfig                    |   2 +-
 arch/x86/um/Makefile                   |   7 +-
 arch/x86/um/asm/elf.h                  |   8 +-
 arch/x86/um/asm/syscall.h              |   6 +
 arch/x86/um/nommu/Makefile             |   8 +
 arch/x86/um/nommu/do_syscall_64.c      |  75 +++++++++
 arch/x86/um/nommu/entry_64.S           | 114 ++++++++++++++
 arch/x86/um/nommu/os-Linux/Makefile    |   6 +
 arch/x86/um/nommu/os-Linux/mcontext.c  |  26 ++++
 arch/x86/um/nommu/syscalls.h           |  18 +++
 arch/x86/um/nommu/syscalls_64.c        | 121 +++++++++++++++
 arch/x86/um/shared/sysdep/mcontext.h   |   5 +
 arch/x86/um/shared/sysdep/ptrace.h     |   2 +-
 arch/x86/um/vdso/vma.c                 |  17 ++-
 fs/Kconfig.binfmt                      |   2 +-
 46 files changed, 1322 insertions(+), 223 deletions(-)
 create mode 100644 Documentation/virt/uml/nommu-uml.rst
 create mode 100644 arch/um/configs/x86_64_nommu_defconfig
 create mode 100644 arch/um/kernel/mem-pgtable.c
 create mode 100644 arch/um/nommu/Makefile
 create mode 100644 arch/um/nommu/os-Linux/Makefile
 create mode 100644 arch/um/nommu/os-Linux/seccomp.c
 create mode 100644 arch/um/nommu/os-Linux/signal.c
 create mode 100644 arch/um/nommu/trap.c
 create mode 100644 arch/x86/um/nommu/Makefile
 create mode 100644 arch/x86/um/nommu/do_syscall_64.c
 create mode 100644 arch/x86/um/nommu/entry_64.S
 create mode 100644 arch/x86/um/nommu/os-Linux/Makefile
 create mode 100644 arch/x86/um/nommu/os-Linux/mcontext.c
 create mode 100644 arch/x86/um/nommu/syscalls.h
 create mode 100644 arch/x86/um/nommu/syscalls_64.c


base-commit: 293f71435d14f5b5c46fc3398695fa265c69363d
-- 
2.43.0

Re: [PATCH v13 00/13] nommu UML

Posted by Christoph Hellwig 3 months ago

On Sat, Nov 08, 2025 at 05:05:35PM +0900, Hajime Tazaki wrote:
> This patchset is another spin of nommu mode addition to UML.  It would
> be nice to hear about your opinions on that.

I've not seen any explanation of the use case and/or benefits anywhere
in this cover letter or the patches.  Without that it's usually pretty
hard to get maintainers and reviewers excited.

Re: [PATCH v13 00/13] nommu UML

Posted by Hajime Tazaki 3 months ago

Hello,

On Mon, 10 Nov 2025 18:14:26 +0900,
Christoph Hellwig wrote:
> 
> On Sat, Nov 08, 2025 at 05:05:35PM +0900, Hajime Tazaki wrote:
> > This patchset is another spin of nommu mode addition to UML.  It would
> > be nice to hear about your opinions on that.
> 
> I've not seen any explanation of the use case and/or benefits anywhere
> in this cover letter or the patches.  Without that it's usually pretty
> hard to get maintainers and reviewers excited.

thank you for the comment.  I tried to include this explanation in the
document patch [12/13], which I copied from the text below.

  What is it for ?
  ================

  - Alleviate syscall hook overhead implemented with ptrace(2)
  - To exercises nommu code over UML (and over KUnit)
  - Less dependency to host facilities

the first item is for speed up, the second item is for more testing,
the last item is for more extensibility in the future.

Early version of this patchset included this information as well as
the whole documentation, but I removed it as the versions grow.  But I
can revert it to the cover letter if it helps.

-- Hajime

Re: [PATCH v13 00/13] nommu UML

Posted by Johannes Berg 3 months ago

On Mon, 2025-11-10 at 21:18 +0900, Hajime Tazaki wrote:
> 
>   What is it for ?
>   ================
>   
>   - Alleviate syscall hook overhead implemented with ptrace(2)
>   - To exercises nommu code over UML (and over KUnit)
>   - Less dependency to host facilities

FWIW, in some way, this order of priorities is exactly why this hasn't
been going anywhere, and every time I looked at it I got somewhat
annoyed by what seems to me like choices made to support especially the
first bullet.

I suspect that the first and third bullet are not even really true any
more, since you moved to seccomp (per our request), yet I think design
choices influenced by them persist.

People are definitely interested in the second bullet, mostly for kunit,
and I'd be willing to support them in that to some extent.

However, I'm not yet convinced that all of the complexities presented in
this patchset (such as completely separate seccomp implementation) are
actually necessary in support of _just_ the second bullet. These seem to
me like design choices necessary to support the _first_ bullet [1].

[1] and then I suppose the third, which I'm reading as "doesn't need
seccomp or ptrace", but I'm not really quite sure what you meant

I've thought about what would happen if we stuck to creating a (single)
separate process on the host to execute userspace, and just used
CLONE_VM for it. That way, it's still no-MMU with full memory access,
but there's some implicit isolation between the kernel and userspace
processes which will likely remove complexities around FP/SSE/AVX
handling, may completely remove the need for a separate seccomp
implementation, etc.

It would, on the other hand, make it completely non-viable to achieve
the first and third bullets, so given your pursuit of those, one some
level I understand the design right now. I'm yet to be convinced,
however, that those are even worthy goals for (upstream) UML, what use
case would that enable that we really need? Especially considering that
over a longer perspective, NOMMU architectures _are_ on their way out,
and UML will certainly follow once that happens, it won't be the last
remaining NOMMU architecture.

So the only value I see in this is for testing over the net couple of
years, which really doesn't need any sort of significant optimisation or
less reliance on host facilities.

Where do you see this differently?

johannes

Re: [PATCH v13 00/13] nommu UML

Posted by Hajime Tazaki 2 months, 4 weeks ago

On Tue, 11 Nov 2025 17:01:25 +0900,
Johannes Berg wrote:
> 
> On Mon, 2025-11-10 at 21:18 +0900, Hajime Tazaki wrote:
> > 
> >   What is it for ?
> >   ================
> >   
> >   - Alleviate syscall hook overhead implemented with ptrace(2)
> >   - To exercises nommu code over UML (and over KUnit)
> >   - Less dependency to host facilities
> 
> FWIW, in some way, this order of priorities is exactly why this hasn't
> been going anywhere, and every time I looked at it I got somewhat
> annoyed by what seems to me like choices made to support especially the
> first bullet.

over the past versions, I've been emphasized that the 2nd bullet (testing)
is the primary usecase as I saw several actually cases from mm folks,

https://lists.infradead.org/pipermail/maple-tree/2024-November/003775.html
https://lore.kernel.org/all/cb1cf0be-871d-4982-9a1b-5fdd54deec8d@lucifer.local/

and I think this is not limited to mm code.

other 2 bullets are additional benefits which we observed in a
comment, and our experience.

https://lore.kernel.org/all/20241122121826.GA26024@lst.de/
[2] https://static.sched.com/hosted_files/ossna2020/ec/kollerr_linux_um_nommu.pdf

but those are not the primary goal, so I'm not pushing this aspect
with usecases.

> I suspect that the first and third bullet are not even really true any
> more, since you moved to seccomp (per our request), yet I think design
> choices influenced by them persist.

this observation is not true; the first bullet is still true even
using seccomp.  please look at the benchmark result in the patch
[12/13], quoted below.

summary: most of tests show that um-nommu+seccomp is x4 to x20 faster
than um-mmu+seccomp (and ptrace).

.. csv-table:: lmbench (usec)
  :header: ,native,um,um-mmu(s),um-nommu(s)

  select-10    ,0.5319,36.1214,24.2795,2.9174
  select-100   ,1.6019,34.6049,28.8865,3.8080
  select-1000  ,12.2588,43.6838,48.7438,12.7872
  syscall      ,0.1644,35.0321,53.2119,2.5981
  read         ,0.3055,31.5509,45.8538,2.7068
  write        ,0.2512,31.3609,29.2636,2.6948
  stat         ,1.8894,43.8477,49.6121,3.1908
  open/close   ,3.2973,77.5123,68.9431,6.2575
  fork+sh      ,1110.3000,7359.5000,4618.6667,439.4615
  fork+execve  ,510.8182,2834.0000,2461.1667,139.7848

.. csv-table:: do_getpid bench (nsec)
  :header: ,native,um,um-mmu(s),um-nommu(s)

  getpid , 161 , 34477 , 26242 , 2599

the 1st bullet saying ptrace(2) is somehow misleading now.  this might
be rephrased with "a separate process handling userspace", instead of
"ptrace".

# when I started this patchset, the seccomp patch wasn't in upstream.
  saying ptrace(2) wasn't not that much wrong.

> People are definitely interested in the second bullet, mostly for kunit,
> and I'd be willing to support them in that to some extent.

so (again) the 2nd bullet is the primary use case at this stage.

> However, I'm not yet convinced that all of the complexities presented in
> this patchset (such as completely separate seccomp implementation) are
> actually necessary in support of _just_ the second bullet. These seem to
> me like design choices necessary to support the _first_ bullet [1].

separate seccomp implementation is indeed needed due to the design
choice we made, to use a single process to host a (um) userspace.  I
think there is no reason to unify the seccomp part because the
signal handlers and filter installation do the different jobs.

I don't see why you see this as a _complexity_, as functionally both
seccomp handling don't interfere each other.  we have prepared
separate sub-directories for nommu to avoid unnecessary if/else
clauses in .c/.h files.  we haven't seen any functional regressions
since this RFC version (which was 6.12 kernel).

> [1] and then I suppose the third, which I'm reading as "doesn't need
> seccomp or ptrace", but I'm not really quite sure what you meant
> 
> I've thought about what would happen if we stuck to creating a (single)
> separate process on the host to execute userspace, and just used
> CLONE_VM for it. That way, it's still no-MMU with full memory access,
> but there's some implicit isolation between the kernel and userspace
> processes which will likely remove complexities around FP/SSE/AVX
> handling, may completely remove the need for a separate seccomp
> implementation, etc.

this would be doable I think, but we went the different way, as
using separate host processes (with ptrace/seccomp) is slow and add
complexity by the synchronization between processes, which we think
it's not easy to maintain in the future.

this was natural for us (not sure for maintainers) when we add a new
functionality, consider several options to implement, and took one of the
option which is faster, simpler, and having less cost to maintain.

the avoidance of separate processes is probably the core of our design
choice we made for nommu UML.  I'm not strongly pushing the benefits
of 1st/3rd bullets, but I thought describing the characteristics of
what _this_ patchset can should be useful.  thus in the document.

additionally, if the design choice we made introduces any breakages on
existing code, or maintenance burdens, I would understand your concern
on the complexity, but I don't think this is the case.

> It would, on the other hand, make it completely non-viable to achieve
> the first and third bullets, so given your pursuit of those, one some
> level I understand the design right now. I'm yet to be convinced,
> however, that those are even worthy goals for (upstream) UML, what use
> case would that enable that we really need?

the usecase for those are inherited from the original implementation,
[2] above, which is running UML on containers with less host dependency
and speedups.  but again, this is not the primary goal at this stage.

if you think that the document should not describe the potential
benefits/usecases which are not related to the primary goal of the
functionality, I'd agree to remove those descriptions.

> Especially considering that
> over a longer perspective, NOMMU architectures _are_ on their way out,
> and UML will certainly follow once that happens, it won't be the last
> remaining NOMMU architecture.

I'm aware of this nommu removal discussion, but also saw there are
expressions not to support this direction.  This patchset is still
useful even now.

> So the only value I see in this is for testing over the net couple of
> years, which really doesn't need any sort of significant optimisation or
> less reliance on host facilities.

I agree the former, but not the latter.

- there is a value with a real usecase,
- there are different ways to implement it but this went with the
  one with potential (additional) benefits,
- without breakages to the exising (MMU) uml code.

with that, we're proposing this patchset.

> Where do you see this differently?

thanks for the careful prompt for me.
I hope my answer clarifies your concerns.

I also wish to understand concerns of maintainers, due to the single
process design of nommu for um userspace, and the codebase is still
young so may have unexpected influence to others.  but this is exactly
the reason why I also put myself to MAINTAINERS in order to take care
of this patchset even it is small (1.3k loc).

-- Hajime

Re: [PATCH v13 00/13] nommu UML

Posted by Johannes Berg 2 months, 2 weeks ago

On Wed, 2025-11-12 at 17:52 +0900, Hajime Tazaki wrote:
> > >   What is it for ?
> > >   ================
> > >   
> > >   - Alleviate syscall hook overhead implemented with ptrace(2)
> > >   - To exercises nommu code over UML (and over KUnit)
> > >   - Less dependency to host facilities
> > 
> > FWIW, in some way, this order of priorities is exactly why this hasn't
> > been going anywhere, and every time I looked at it I got somewhat
> > annoyed by what seems to me like choices made to support especially the
> > first bullet.
> 
> over the past versions, I've been emphasized that the 2nd bullet (testing)
> is the primary usecase as I saw several actually cases from mm folks,
> 
> https://lists.infradead.org/pipermail/maple-tree/2024-November/003775.html
> https://lore.kernel.org/all/cb1cf0be-871d-4982-9a1b-5fdd54deec8d@lucifer.local/
> 
> and I think this is not limited to mm code.

Not sure there's much value in testing much else in no-MMU, but sure,
I'll give you that it's useful for testing.

> other 2 bullets are additional benefits which we observed in a
> comment, and our experience.

But are they really _worthwhile_ benefits? A lot of this design adds
additional complexity, and it doesn't really seem necessary for the
testing use case. Making it faster is nice, but it's not like the
speedup really is 20x for arbitrary tests, that's just for corner cases
like "sit in a loop of gettimeofday()". And for kunit there's no syscall
boundary at all, so there's no speedup.

> > I suspect that the first and third bullet are not even really true any
> > more, since you moved to seccomp (per our request), yet I think design
> > choices influenced by them persist.
> 
> this observation is not true; the first bullet is still true even
> using seccomp.  please look at the benchmark result in the patch
> [12/13], quoted below.

> [snip]

So thanks for the correction. If that's the case, however, it means the
speedup can't be due to the syscall boundary itself (seccomp) but must
rather be due to some pagefault/mapping handling issue? Which would be
inherent in no-MMU, even taking an approach of using two host processes
rather than embedding everything into one.

> > However, I'm not yet convinced that all of the complexities presented in
> > this patchset (such as completely separate seccomp implementation) are
> > actually necessary in support of _just_ the second bullet. These seem to
> > me like design choices necessary to support the _first_ bullet [1].
> 
> separate seccomp implementation is indeed needed due to the design
> choice we made, to use a single process to host a (um) userspace.

That sounds misleading or even wrong to me, I'd say it's due to putting
the (um) userspace in the same host process as the kernel space?

> I don't see why you see this as a _complexity_, as functionally both
> seccomp handling don't interfere each other.

The complexity isn't so much in the separate code, which is a small
factor, but in the "put everything into the same process" aspect of it.
That has consequences around the host context state handling, things we
didn't really need to consider before suddenly become crucially
important. In the current (with-MMU) design, we only need to worry about
being able to correctly switch between userspace tasks/threads within a
userspace mm (host) process. With the no-MMU design you propose, we also
need to be able to correctly switch between kernel and userspace tasks
within the same single (host) process.

I think this is a pretty significant difference, and saying "there's no
complexity here" is simply pretending it isn't a relevant difference. I
believe you're not even handling this correctly right now in this patch
set, specifically wrt. the GS register which has been pointed out
before, but I wouldn't say that I even have a complete picture in my
head over what state handling would be necessary and sufficient.

So yeah, I think this warrants taking another look as to whether or not
the approach of putting everything into the same host process is even
worth it. I tend to believe that it isn't, given the use cases. And if
you say the speedup still is with seccomp, that kills the speed argument
too.

> > I've thought about what would happen if we stuck to creating a (single)
> > separate process on the host to execute userspace, and just used
> > CLONE_VM for it. That way, it's still no-MMU with full memory access,
> > but there's some implicit isolation between the kernel and userspace
> > processes which will likely remove complexities around FP/SSE/AVX
> > handling, may completely remove the need for a separate seccomp
> > implementation, etc.
> 
> this would be doable I think, but we went the different way, as
> using separate host processes (with ptrace/seccomp) is slow and add
> complexity by the synchronization between processes, which we think
> it's not easy to maintain in the future.

Which one is it then, slow or not? Not sure I follow. You just said you
do have seccomp when comparing speeds, so that in itself doesn't make it
slow. What synchronization? It'd (have to) be CLONE_VM, but that
actually _simplifies_ state transfer/synchronization, and we already
have (to have) state transfer between different userspace threads in the
same host process for the with-MMU case.

johannes

Re: [PATCH v13 00/13] nommu UML

Posted by Hajime Tazaki 2 months, 1 week ago

On Tue, 25 Nov 2025 18:58:53 +0900,
Johannes Berg wrote:
> 
> On Wed, 2025-11-12 at 17:52 +0900, Hajime Tazaki wrote:
> > > >   What is it for ?
> > > >   ================
> > > >   
> > > >   - Alleviate syscall hook overhead implemented with ptrace(2)
> > > >   - To exercises nommu code over UML (and over KUnit)
> > > >   - Less dependency to host facilities
> > > 
> > > FWIW, in some way, this order of priorities is exactly why this hasn't
> > > been going anywhere, and every time I looked at it I got somewhat
> > > annoyed by what seems to me like choices made to support especially the
> > > first bullet.
> > 
> > over the past versions, I've been emphasized that the 2nd bullet (testing)
> > is the primary usecase as I saw several actually cases from mm folks,
> > 
> > https://lists.infradead.org/pipermail/maple-tree/2024-November/003775.html
> > https://lore.kernel.org/all/cb1cf0be-871d-4982-9a1b-5fdd54deec8d@lucifer.local/
> > 
> > and I think this is not limited to mm code.
> 
> Not sure there's much value in testing much else in no-MMU, but sure,
> I'll give you that it's useful for testing.

under the tree,

% global -xr CONFIG_MMU | grep ifndef  | grep -v -E "arch/|mm/" | wc -l
45

this is a rough picture but there are places to be tested other than
mm codebase.

> > other 2 bullets are additional benefits which we observed in a
> > comment, and our experience.
> 
> But are they really _worthwhile_ benefits? A lot of this design adds
> additional complexity, and it doesn't really seem necessary for the
> testing use case. Making it faster is nice, but it's not like the
> speedup really is 20x for arbitrary tests, that's just for corner cases
> like "sit in a loop of gettimeofday()". And for kunit there's no syscall
> boundary at all, so there's no speedup.

I agree and as I said the reason to take a single-host-process
approach is from the speed and simplicity of removing interaction
between host processes.

I have never claimed that tests should execute fast.
and agree that kunit doesn't benefit from speed as there is no syscall
(unless kunit-uapi patch will be in).

> > > I suspect that the first and third bullet are not even really true any
> > > more, since you moved to seccomp (per our request), yet I think design
> > > choices influenced by them persist.
> > 
> > this observation is not true; the first bullet is still true even
> > using seccomp.  please look at the benchmark result in the patch
> > [12/13], quoted below.
> 
> > [snip]
> 
> So thanks for the correction. If that's the case, however, it means the
> speedup can't be due to the syscall boundary itself (seccomp) but must
> rather be due to some pagefault/mapping handling issue? Which would be
> inherent in no-MMU, even taking an approach of using two host processes
> rather than embedding everything into one.

I'll explain this later in this email.

# nommu doesn't have page fault as there are only physical address.

> > > However, I'm not yet convinced that all of the complexities presented in
> > > this patchset (such as completely separate seccomp implementation) are
> > > actually necessary in support of _just_ the second bullet. These seem to
> > > me like design choices necessary to support the _first_ bullet [1].
> > 
> > separate seccomp implementation is indeed needed due to the design
> > choice we made, to use a single process to host a (um) userspace.
> 
> That sounds misleading or even wrong to me, I'd say it's due to putting
> the (um) userspace in the same host process as the kernel space?

not sure if this is different from my explanation...

> > I don't see why you see this as a _complexity_, as functionally both
> > seccomp handling don't interfere each other.
> 
> The complexity isn't so much in the separate code, which is a small
> factor, but in the "put everything into the same process" aspect of it.
> That has consequences around the host context state handling, things we
> didn't really need to consider before suddenly become crucially
> important. In the current (with-MMU) design, we only need to worry about
> being able to correctly switch between userspace tasks/threads within a
> userspace mm (host) process. With the no-MMU design you propose, we also
> need to be able to correctly switch between kernel and userspace tasks
> within the same single (host) process.
> 
> I think this is a pretty significant difference, and saying "there's no
> complexity here" is simply pretending it isn't a relevant difference. I
> believe you're not even handling this correctly right now in this patch
> set, specifically wrt. the GS register which has been pointed out
> before, but I wouldn't say that I even have a complete picture in my
> head over what state handling would be necessary and sufficient.
> 
> So yeah, I think this warrants taking another look as to whether or not
> the approach of putting everything into the same host process is even
> worth it. I tend to believe that it isn't, given the use cases. And if
> you say the speedup still is with seccomp, that kills the speed argument
> too.

I understand your concern on complexity, thanks for the detail.

the host context state handling is indeed new thing. we've only
verified a limited set of code path, with a basic operation with um +
drivers and some userspace programs.  this should not be perfect at
this moment but can be improved.

> > > I've thought about what would happen if we stuck to creating a (single)
> > > separate process on the host to execute userspace, and just used
> > > CLONE_VM for it. That way, it's still no-MMU with full memory access,
> > > but there's some implicit isolation between the kernel and userspace
> > > processes which will likely remove complexities around FP/SSE/AVX
> > > handling, may completely remove the need for a separate seccomp
> > > implementation, etc.
> > 
> > this would be doable I think, but we went the different way, as
> > using separate host processes (with ptrace/seccomp) is slow and add
> > complexity by the synchronization between processes, which we think
> > it's not easy to maintain in the future.
> 
> Which one is it then, slow or not? Not sure I follow. You just said you
> do have seccomp when comparing speeds, so that in itself doesn't make it
> slow. What synchronization? It'd (have to) be CLONE_VM, but that
> actually _simplifies_ state transfer/synchronization, and we already
> have (to have) state transfer between different userspace threads in the
> same host process for the with-MMU case.

Since I included speed characteristics in the document, I should
explain more on the impact of this, compared to the existing
design/implementation of uml.

many documents, articles said uml is slow (uml document in tree also
mentioned a bit), but cannot find detailed analysis, so I look closely
at how nommu (w/ seccomp) and mmu w/ seccomp behave.

suppose we have a userspace program running under uml (on seccomp-mmu,
seccomp-nommu).


	struct timespec ts1, ts2;
	clock_gettime(CLOCK_REALTIME, &ts1);  // 1)
	getpid()                              // 2)
	clock_gettime(CLOCK_REALTIME, &ts2);  // 3)

# this is a chunk from the benchmark program used in the document.

then collected several events (sched_switch, signal_generate, and
sys_enter_futex) via ftrace.

looking at 3 SIGSYS (sig=31) signals on above code, and below is the
output of the `trace-cmd report`.

- frace seecomp-mmu, 2)-3)= 11 usec
 uml-userspace-3092637 [002] 1749286.670199: signal_generate:      sig=31 errno=0 code=1 comm=uml-userspace pid=3092637 grp=0 res=0    => 1)
 uml-userspace-3092637 [002] 1749286.670200: sys_enter_futex:      op=FUTEX_WAKE uaddr=0x7fffffffdf8c val=1
 uml-userspace-3092637 [002] 1749286.670201: sys_enter_futex:      op=FUTEX_WAIT uaddr=0x7fffffffdf8c val=0x00000001 utime=0x00000000
 uml-userspace-3092637 [002] 1749286.670202: sched_switch:         uml-userspace:3092637 [120] S ==> swapper/2:0 [120]
          <idle>-0     [028] 1749286.670203: sched_switch:         swapper/28:0 [120] R ==> vmlinux:3092631 [120]
       vmlinux-3092631 [028] 1749286.670205: sys_enter_futex:      op=FUTEX_WAKE uaddr=0x60b64f8c val=1
       vmlinux-3092631 [028] 1749286.670206: sys_enter_futex:      op=FUTEX_WAIT uaddr=0x60b64f8c val=0x00000000 utime=0x00000000
       vmlinux-3092631 [028] 1749286.670207: sched_switch:         vmlinux:3092631 [120] S ==> swapper/28:0 [120]
          <idle>-0     [002] 1749286.670209: sched_switch:         swapper/2:0 [120] R ==> uml-userspace:3092637 [120]
 uml-userspace-3092637 [002] 1749286.670211: signal_generate:      sig=31 errno=0 code=1 comm=uml-userspace pid=3092637 grp=0 res=0    => 2)
 uml-userspace-3092637 [002] 1749286.670212: sys_enter_futex:      op=FUTEX_WAKE uaddr=0x7fffffffdf8c val=1
 uml-userspace-3092637 [002] 1749286.670213: sys_enter_futex:      op=FUTEX_WAIT uaddr=0x7fffffffdf8c val=0x00000001 utime=0x00000000
 uml-userspace-3092637 [002] 1749286.670214: sched_switch:         uml-userspace:3092637 [120] S ==> swapper/2:0 [120]
          <idle>-0     [028] 1749286.670215: sched_switch:         swapper/28:0 [120] R ==> vmlinux:3092631 [120]
       vmlinux-3092631 [028] 1749286.670216: sys_enter_futex:      op=FUTEX_WAKE uaddr=0x60b64f8c val=1
       vmlinux-3092631 [028] 1749286.670217: sys_enter_futex:      op=FUTEX_WAIT uaddr=0x60b64f8c val=0x00000000 utime=0x00000000
       vmlinux-3092631 [028] 1749286.670218: sched_switch:         vmlinux:3092631 [120] S ==> swapper/28:0 [120]
          <idle>-0     [002] 1749286.670220: sched_switch:         swapper/2:0 [120] R ==> uml-userspace:3092637 [120]
 uml-userspace-3092637 [002] 1749286.670222: signal_generate:      sig=31 errno=0 code=1 comm=uml-userspace pid=3092637 grp=0 res=0    => 3)


- ftrace seccomp-nommu, 2)-3) =  3 usec
       vmlinux-3092542 [006] 1749158.829292: signal_generate:      sig=31 errno=0 code=1 comm=vmlinux pid=3092542 grp=0 res=0    => 1)
       vmlinux-3092542 [006] 1749158.829294: signal_generate:      sig=31 errno=0 code=1 comm=vmlinux pid=3092542 grp=0 res=0    => 2)
       vmlinux-3092542 [006] 1749158.829297: signal_generate:      sig=31 errno=0 code=1 comm=vmlinux pid=3092542 grp=0 res=0    => 3)

with seccomp-mmu, a host process for userspace (uml-userspace) is
notified with SIGSYS (sig=31) upon syscall from userspace, and switched
task (of host) to vmlinux (um kernel), with the wake/wait
synchronization (which I meant synchronization in my previous email),
and switch back to uml-userspace to continue the userspace process.

so, at least 4 host sched_switch-es per single um syscall.

with current nommu using a single host process, notifications via
SIGSYS is same as seccomp-mmu, but after that there is no context
switch upon syscall issued by a userspace, in the same context to the
next syscall.

nommu implementation with CLONE_VM (btw, the host process, uml-userspace
is already created with CLONE_VM flag IIUC) might face the similar
situation as seccomp-mmu, seeing the same switches between processes.

this becomes the difference between the benchmark results of getpid, which
um-mmu (seccomp)/um-nommu (seccomp) is mostly x10 (26.242 and 2.599
usec) (this was described as an example of benchmark in the patchset).

I didn't look at ptrace mode of MMU, but expect to see the similar (or
more) duration on a single syscall.




in addition to this ftrace measurement above, I conducted more
practical benchmark with iperf3 (forward/reverse path) and netperf
(TCP_STREAM/MAERTS), which aren't corner cases I believe, and below is
the result.

all use the vector driver with gro on via host tap devices.
iperf3/netperf server run on a host and client runs inside uml.

# I can give a complete script to reproduce this if needed.


- iperf3 (Mbps)
              um-mmu(seccomp)	 um-nommu(seccomp)
--------------------------------------------------
iperf3(f)       7984             13152
iperf3(r)       8009             14363

- netperf (Mbps, bufsize=65507bytes)
              um-mmu(seccomp)	 um-nommu(seccomp)
--------------------------------------------------
netperf(STREAM)   5912.93        10792.02
netperf(MAERTS)  29263.53        33970.06


not significant different as we saw with simple syscall benchmark with
getpid(2), but still see an impact with difference.

I would say these results only show partial cases of what UML can do,
different workloads may show different result, but it is still
valuable to present one of the benefits to see the nature of the
feature (of what single process design can do).

Of course, nommu will come with various limitations as I described in
the document; like applications should be aware of the kernel is nommu
(i.e., need to use vfork, PIE binaries, etc).  So traditional uml is
more generic and has broader usage, but with this characteristic of
speed with nommu, I think it is worthwhile and users benefit from this
if they need speed.

I hope this clarifies a bit.

-- Hajime

Re: [PATCH v13 00/13] nommu UML

Posted by Tiwei Bie 2 months, 4 weeks ago

On Wed, 12 Nov 2025 17:52:56 +0900, Hajime Tazaki wrote:
[...]
> > However, I'm not yet convinced that all of the complexities presented in
> > this patchset (such as completely separate seccomp implementation) are
> > actually necessary in support of _just_ the second bullet. These seem to
> > me like design choices necessary to support the _first_ bullet [1].
> 
> separate seccomp implementation is indeed needed due to the design
> choice we made, to use a single process to host a (um) userspace.  I
> think there is no reason to unify the seccomp part because the
> signal handlers and filter installation do the different jobs.
> 
> I don't see why you see this as a _complexity_, as functionally both
> seccomp handling don't interfere each other.  we have prepared
> separate sub-directories for nommu to avoid unnecessary if/else
> clauses in .c/.h files.

I have the same concern about the complexities introduced by this
patch set. The new processing paths it introduces (such as the
separate handling for FP/SSE/AVX, FS, signal, syscall, ...) add a
lot of unnecessary complexities. I think Johannes's suggestion is
a great idea.

> we haven't seen any functional regressions
> since this RFC version (which was 6.12 kernel).

I took a quick look at the code. It appears that patch 02/13 will
break the mmu build when UML_TIME_TRAVEL_SUPPORT is enabled.

Regards,
Tiwei

Re: [PATCH v13 00/13] nommu UML

Posted by Hajime Tazaki 2 months, 3 weeks ago

On Thu, 13 Nov 2025 01:36:51 +0900,
Tiwei Bie wrote:

> > we haven't seen any functional regressions
> > since this RFC version (which was 6.12 kernel).
> 
> I took a quick look at the code. It appears that patch 02/13 will
> break the mmu build when UML_TIME_TRAVEL_SUPPORT is enabled.

thanks, it is my bad on the move the chunk.
will fix it and added to my local test.

-- Hajime