From: Wedson Almeida Filho <wedsonaf@gmail.com>
Allow for the creation of a file descriptor in two steps: first, we
reserve a slot for it, then we commit or drop the reservation. The first
step may fail (e.g., the current process ran out of available slots),
but commit and drop never fail (and are mutually exclusive).
This is needed by Rust Binder when fds are sent from one process to
another. It has to be a two-step process to properly handle the case
where multiple fds are sent: The operation must fail or succeed
atomically, which we achieve by first reserving the fds we need, and
only installing the files once we have reserved enough fds to send the
files.
Fd reservations assume that the value of `current` does not change
between the call to get_unused_fd_flags and the call to fd_install (or
put_unused_fd). By not implementing the Send trait, this abstraction
ensures that the `FileDescriptorReservation` cannot be moved into a
different process.
Signed-off-by: Wedson Almeida Filho <wedsonaf@gmail.com>
Co-developed-by: Alice Ryhl <aliceryhl@google.com>
Reviewed-by: Benno Lossin <benno.lossin@proton.me>
Reviewed-by: Martin Rodriguez Reboredo <yakoyoku@gmail.com>
Reviewed-by: Trevor Gross <tmgross@umich.edu>
Reviewed-by: Gary Guo <gary@garyguo.net>
Signed-off-by: Alice Ryhl <aliceryhl@google.com>
---
rust/kernel/fs/file.rs | 75 +++++++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 74 insertions(+), 1 deletion(-)
diff --git a/rust/kernel/fs/file.rs b/rust/kernel/fs/file.rs
index 3c1f51719804..e03dbe14d62a 100644
--- a/rust/kernel/fs/file.rs
+++ b/rust/kernel/fs/file.rs
@@ -11,7 +11,7 @@
bindings,
cred::Credential,
error::{code::*, Error, Result},
- types::{ARef, AlwaysRefCounted, Opaque},
+ types::{ARef, AlwaysRefCounted, NotThreadSafe, Opaque},
};
use core::ptr;
@@ -368,6 +368,79 @@ fn deref(&self) -> &LocalFile {
}
}
+/// A file descriptor reservation.
+///
+/// This allows the creation of a file descriptor in two steps: first, we reserve a slot for it,
+/// then we commit or drop the reservation. The first step may fail (e.g., the current process ran
+/// out of available slots), but commit and drop never fail (and are mutually exclusive).
+///
+/// Dropping the reservation happens in the destructor of this type.
+///
+/// # Invariants
+///
+/// The fd stored in this struct must correspond to a reserved file descriptor of the current task.
+pub struct FileDescriptorReservation {
+ fd: u32,
+ /// Prevent values of this type from being moved to a different task.
+ ///
+ /// The `fd_install` and `put_unused_fd` functions assume that the value of `current` is
+ /// unchanged since the call to `get_unused_fd_flags`. By adding this marker to this type, we
+ /// prevent it from being moved across task boundaries, which ensures that `current` does not
+ /// change while this value exists.
+ _not_send: NotThreadSafe,
+}
+
+impl FileDescriptorReservation {
+ /// Creates a new file descriptor reservation.
+ pub fn get_unused_fd_flags(flags: u32) -> Result<Self> {
+ // SAFETY: FFI call, there are no safety requirements on `flags`.
+ let fd: i32 = unsafe { bindings::get_unused_fd_flags(flags) };
+ if fd < 0 {
+ return Err(Error::from_errno(fd));
+ }
+ Ok(Self {
+ fd: fd as u32,
+ _not_send: NotThreadSafe,
+ })
+ }
+
+ /// Returns the file descriptor number that was reserved.
+ pub fn reserved_fd(&self) -> u32 {
+ self.fd
+ }
+
+ /// Commits the reservation.
+ ///
+ /// The previously reserved file descriptor is bound to `file`. This method consumes the
+ /// [`FileDescriptorReservation`], so it will not be usable after this call.
+ pub fn fd_install(self, file: ARef<File>) {
+ // SAFETY: `self.fd` was previously returned by `get_unused_fd_flags`. We have not yet used
+ // the fd, so it is still valid, and `current` still refers to the same task, as this type
+ // cannot be moved across task boundaries.
+ //
+ // Furthermore, the file pointer is guaranteed to own a refcount by its type invariants,
+ // and we take ownership of that refcount by not running the destructor below.
+ // Additionally, the file is known to not have any non-shared `fdget_pos` calls, so even if
+ // this process starts using the file position, this will not result in a data race on the
+ // file position.
+ unsafe { bindings::fd_install(self.fd, file.as_ptr()) };
+
+ // `fd_install` consumes both the file descriptor and the file reference, so we cannot run
+ // the destructors.
+ core::mem::forget(self);
+ core::mem::forget(file);
+ }
+}
+
+impl Drop for FileDescriptorReservation {
+ fn drop(&mut self) {
+ // SAFETY: By the type invariants of this type, `self.fd` was previously returned by
+ // `get_unused_fd_flags`. We have not yet used the fd, so it is still valid, and `current`
+ // still refers to the same task, as this type cannot be moved across task boundaries.
+ unsafe { bindings::put_unused_fd(self.fd) };
+ }
+}
+
/// Represents the `EBADF` error code.
///
/// Used for methods that can only fail with `EBADF`.
--
2.46.0.662.g92d0881bb0-goog
On Sun, Sep 15, 2024 at 02:31:32PM +0000, Alice Ryhl wrote: > +impl Drop for FileDescriptorReservation { > + fn drop(&mut self) { > + // SAFETY: By the type invariants of this type, `self.fd` was previously returned by > + // `get_unused_fd_flags`. We have not yet used the fd, so it is still valid, and `current` > + // still refers to the same task, as this type cannot be moved across task boundaries. > + unsafe { bindings::put_unused_fd(self.fd) }; > + } > +} FWIW, it's a bit more delicate. The real rules for API users are 1) anything you get from get_unused_fd_flags() (well, alloc_fd(), internally) must be passed either to put_unused_fd() or fd_install() before you return from syscall. That should be done by the same thread and all calls of put_unused_fd() or fd_install() should be paired with some get_unused_fd_flags() in that manner (i.e. by the same thread, within the same syscall, etc.) 2) calling thread MUST NOT unshare descriptor table while it has any reserved descriptors. I.e. fd = get_unused_fd(); unshare_files(); fd_install(fd, file); is a bug. Reservations are discarded by that. Getting rid of that constraint would require tracking the sets of reserved descriptors separately for each thread that happens to share the descriptor table. Conceptually they *are* per-thread - the same thread that has done reservation must either discard it or use it. However, it's easier to keep the "it's reserved by some thread" represented in descriptor table itself (bit set in ->open_fds bitmap, file reference in ->fd[] array is NULL) than try and keep track of who's reserved what. The constraint is basically "all reservations can stay with the old copy", i.e. "caller has no reservations of its own to transfer into the new private copy it gets". It's not particularly onerous[*] and it simplifies things quite a bit. However, if we are documenting thing, it needs to be put explicitly. With respect to Rust, if you do e.g. binfmt-in-rust support it will immediately become an issue - begin_new_exec() is calling unshare_files(), so the example above can become an issue. Internally (in fs/file.c, that is) we have additional safety rule - anything that might be given an arbitrary descriptor (e.g. do_dup2() destination can come directly from dup2(2) argument, file_close_fd_locked() victim can come directly from close(2) one, etc.) must leave reserved descriptors alone. Not an issue API users need to watch out for, though. [*] unsharing the descriptor table is done by + close_range(2), which has no reason to allocate any descriptors and is only called by userland. + unshare(2), which has no reason to allocate any descriptors and is only called by userland. + a place in early init that call ksys_unshare() while arranging the environment for /linuxrc from initrd image to be run. Again, no reserved descriptors there. + coredumping thread in the beginning of do_coredump(). The caller is at the point of signal delivery, which means that it had already left whatever syscall it might have been in. Which means that all reservations must have been undone by that point. + execve() at the point of no return (in begin_new_exec()). That's the only place where violation of that constraint on some later changes is plausible. That one needs to be watched out for.
On Sun, Sep 15, 2024 at 8:39 PM Al Viro <viro@zeniv.linux.org.uk> wrote: > > On Sun, Sep 15, 2024 at 02:31:32PM +0000, Alice Ryhl wrote: > > > +impl Drop for FileDescriptorReservation { > > + fn drop(&mut self) { > > + // SAFETY: By the type invariants of this type, `self.fd` was previously returned by > > + // `get_unused_fd_flags`. We have not yet used the fd, so it is still valid, and `current` > > + // still refers to the same task, as this type cannot be moved across task boundaries. > > + unsafe { bindings::put_unused_fd(self.fd) }; > > + } > > +} > > FWIW, it's a bit more delicate. The real rules for API users are > > 1) anything you get from get_unused_fd_flags() (well, alloc_fd(), > internally) must be passed either to put_unused_fd() or fd_install() before > you return from syscall. That should be done by the same thread and > all calls of put_unused_fd() or fd_install() should be paired with > some get_unused_fd_flags() in that manner (i.e. by the same thread, > within the same syscall, etc.) Ok, we have to use it before returning from the syscall. That's fine. What happens if I call `get_unused_fd_flags`, and then never call `put_unused_fd`? Assume that I don't try to use the fd in the future, and that I just forgot to clean up after myself. > 2) calling thread MUST NOT unshare descriptor table while it has > any reserved descriptors. I.e. > fd = get_unused_fd(); > unshare_files(); > fd_install(fd, file); > is a bug. Reservations are discarded by that. Getting rid of that > constraint would require tracking the sets of reserved descriptors > separately for each thread that happens to share the descriptor table. > Conceptually they *are* per-thread - the same thread that has done > reservation must either discard it or use it. However, it's easier to > keep the "it's reserved by some thread" represented in descriptor table > itself (bit set in ->open_fds bitmap, file reference in ->fd[] array is > NULL) than try and keep track of who's reserved what. The constraint is > basically "all reservations can stay with the old copy", i.e. "caller has > no reservations of its own to transfer into the new private copy it gets". > It's not particularly onerous[*] and it simplifies things > quite a bit. However, if we are documenting thing, it needs to be > put explicitly. With respect to Rust, if you do e.g. binfmt-in-rust > support it will immediately become an issue - begin_new_exec() is calling > unshare_files(), so the example above can become an issue. > > Internally (in fs/file.c, that is) we have additional safety > rule - anything that might be given an arbitrary descriptor (e.g. > do_dup2() destination can come directly from dup2(2) argument, > file_close_fd_locked() victim can come directly from close(2) one, > etc.) must leave reserved descriptors alone. Not an issue API users > need to watch out for, though. > > [*] unsharing the descriptor table is done by > + close_range(2), which has no reason to allocate any descriptors > and is only called by userland. > + unshare(2), which has no reason to allocate any descriptors > and is only called by userland. > + a place in early init that call ksys_unshare() while arranging > the environment for /linuxrc from initrd image to be run. Again, no > reserved descriptors there. > + coredumping thread in the beginning of do_coredump(). > The caller is at the point of signal delivery, which means that it had > already left whatever syscall it might have been in. Which means > that all reservations must have been undone by that point. > + execve() at the point of no return (in begin_new_exec()). > That's the only place where violation of that constraint on some later > changes is plausible. That one needs to be watched out for. Thanks for going through that. From a Rust perspective, it sounds easiest to just declare that execve() is an unsafe operation, and that one of the conditions for using it is that you don't hold any fd reservations. Trying to encode this in the fd reservation logic seems too onerous, and I'm guessing execve is not used particularly often anyway. Alice
> What happens if I call `get_unused_fd_flags`, and then never call > `put_unused_fd`? Assume that I don't try to use the fd in the future, > and that I just forgot to clean up after myself. Moderate amount of bogosities while the thread exists. For one thing, descriptor is leaked - for open() et.al. it will look like it's in use. For close() it will look like it's _not_ open (i.e. trying to close it from userland will quietly do nothing). Trying to overwrite it with dup2() will keep failing with -EBUSY. Kernel-side it definitely violates assertions, but currently nothing will break. Might or might not remain true in the future. Doing that again and again would lead to inflated descriptor table, but then so would dup2(0, something_large). IOW, it would be a bug, but it's probably not going to be high impact security hole. > > + execve() at the point of no return (in begin_new_exec()). execve(2), sorry. > > That's the only place where violation of that constraint on some later > > changes is plausible. That one needs to be watched out for. > Thanks for going through that. I'm in the middle of writing documentation on the descriptor table and struct file handling right now anyway... > From a Rust perspective, it sounds > easiest to just declare that execve() is an unsafe operation, and that > one of the conditions for using it is that you don't hold any fd > reservations. Trying to encode this in the fd reservation logic seems > too onerous, and I'm guessing execve is not used particularly often > anyway. Sorry, bad editing on my part - I should've clearly marked execve() as a syscall. It's not that it's an unsafe operation - it's only called from userland anyway, so it's not going to happen in scope of any reserved descriptor. The problem is different: * userland calls execve("/usr/bin/whatever", argv, envp) * that proceeds through several wrappers to do_execveat_common(). There are several syscalls in that family and they all converge to do_execveat_common() - wrappers just deal with difference in calling conventions. * do_execveat_common() set up exec context (struct linux_binprm). That opens the binary to be executed, creates memory context to be used, calculates argc, sets up argv and envp on what will become the userland stack for the new program, etc. - basically, all the work for marshalling the data from caller's memory. Then it calls bprm_execve(), which is where the rest of the work will be done. * bprm_execve() eventually calls exec_binprm(). That calls search_binary_handler(), which is where we finally get a look at the binary we are trying to load. search_binary_handler() goes through the known binary formats (ELF, script, etc.) and tries to offer the exec context to ->load_binary() of each. * ->load_binary() instance looks at the binary (starting with the first 256 bytes read for us by prepare_binprm() called in the beginning of search_binary_handler()). If it doesn't have the right magic values, ->load_binary() just returns -ENOEXEC, so that the next format would be tried. If it *does* look like something this format is going to deal with, more sanity checks are done, things are set up, etc. - details depend upon the binary format in question. See load_elf_binary() for some taste of that. Eventually it decides to actually discard the old memory and switch to new binary. Up to that point it can return an error - -ENOEXEC for soft ones ("not mine, after all, try other formats"), something like -EINVAL/-ENOMEM/-EPERM/-EIO/etc. for hard ones ("fail execve(2) with that error"). _After_ that point we have nothing to return to; old binary is not mapped anymore, userland stack frame is gone, etc. Any errors past that point are treated as "kill me". At the point of no return we call begin_new_exec(). That kills all other threads, unshares descriptor table in case it had been shared wider than the thread group, switches memory context, etc. Once begin_new_exec() is done, we can safely map whatever we want to map, handle relocations, etc. Among other things, we modify the userland register values saved on syscall entry, so that once we return from syscall we'll end up with the right register state at the right userland entry point, etc. If everything goes fine, ->load_binary() returns 0 into search_binary_handler() and we are pretty much done - some cleanups on the way out and off to the loaded binary. Alternatively, we may decide to mangle the exec context - that's what #! handling does (see load_script() - basically it opens the interpreter and massages the arguments, so that something like debian/rules build-arch turns into /usr/bin/make -f debian/rules build-arch and tells the caller to deal with that; this is what the loop in search_binary_handler() is about). There's not a lot of binary formats (5 of those currently - all in fs/binmt_*.c), but there's nothing to prohibit more of them. If somebody decides to add the infrastructure for writing those in Rust, begin_new_exec() wrapper will need to be documented as "never call that in scope of reserved descriptor". Maybe by marking that wrapper unsafe and telling the users about the restriction wrt descriptor reservations, maybe by somehow telling the compiler to watch out for that - or maybe the constraint will be gone by that time. In any case, the underlying constraint ("a thread with reserved descriptors should not try to get a private descriptor table until all those descriptors are disposed of one way or another") needs to be documented.
On Sun, Sep 15, 2024 at 11:01:26PM +0100, Al Viro wrote: > There's not a lot of binary formats (5 of those currently - > all in fs/binmt_*.c), but there's nothing to prohibit more binfmt_*.c, sorry. > of them. If somebody decides to add the infrastructure for > writing those in Rust, begin_new_exec() wrapper will need > to be documented as "never call that in scope of reserved > descriptor". Maybe by marking that wrapper unsafe and > telling the users about the restriction wrt descriptor > reservations, maybe by somehow telling the compiler to > watch out for that - or maybe the constraint will be gone > by that time. > > In any case, the underlying constraint ("a thread with > reserved descriptors should not try to get a private > descriptor table until all those descriptors are disposed > of one way or another") needs to be documented. >
On Sun, Sep 15, 2024 at 07:39:05PM +0100, Al Viro wrote: > 2) calling thread MUST NOT unshare descriptor table while it has > any reserved descriptors. I.e. > fd = get_unused_fd(); > unshare_files(); > fd_install(fd, file); > is a bug. Reservations are discarded by that. Getting rid of that > constraint would require tracking the sets of reserved descriptors > separately for each thread that happens to share the descriptor table. > Conceptually they *are* per-thread - the same thread that has done > reservation must either discard it or use it. However, it's easier to > keep the "it's reserved by some thread" represented in descriptor table > itself (bit set in ->open_fds bitmap, file reference in ->fd[] array is > NULL) than try and keep track of who's reserved what. The constraint is > basically "all reservations can stay with the old copy", i.e. "caller has > no reservations of its own to transfer into the new private copy it gets". FWIW, I toyed with the idea of having reservations kept per-thread; it is possible and it simplifies some things, but I hadn't been able to find a way to do that without buggering syscall latency for open() et.al. It would keep track of the set of reservations in task_struct (count, two-element array for the first two + page pointer for spillovers, for the rare threads that need more than two reserved simultaneously). Representation in fdtable: state open_fds bit value in ->fd[] array free clear 0UL reserved set 0UL uncommitted set 1UL|(unsigned long)file open set (unsigned long)file with file lookup treating any odd value as 0 (i.e. as reserved) fd_install() switching reserved to uncommitted *AND* separate "commit" operation that does this: if current->reservation_count == 0 return if failure for each descriptor in our reserved set v = fdtable->fd[descriptor] if (v) { fdtable->fd[descriptor] = 0; fput((struct file *)(v & ~1); } clear bit in fdtable->open_fds[] else for each descriptor in our reserved set v = fdtable->fd[descriptor] if (v) fdtable->fd[descriptor] = v & ~1; else BUG current->reservation_count = 0 That "commit" thing would be called on return from syscall for userland threads and would be called explicitly in kernel threads that have a descriptor table and work with it. The benefit would be that fd_install() would *NOT* have to be done after the last possible failure exit - something that installs a lot of files wouldn't have to play convoluted games on cleanup. Simply returning an error would do the right thing. Two stores into ->fd[] instead of one is not a big deal; however, anything like task_work_add() to arrange the call of "commit" ends up being bloody awful. We could have it called from syscall glue directly, but that means touching assembler for quite a few architectures, and I hadn't gotten around to that. Can be done, but it's not a pleasant work...
On Sun, Sep 15, 2024 at 08:34:43PM +0100, Al Viro wrote: > FWIW, I toyed with the idea of having reservations kept per-thread; > it is possible and it simplifies some things, but I hadn't been able to > find a way to do that without buggering syscall latency for open() et.al. Hmm... How about the following: * add an equivalent of array of pairs (fd, file) to task_struct; representation could be e.g. (a _VERY_ preliminary variant) unsigned fd_count; int fds[2]; struct file *fp[2]; void *spillover; with 'spillover' being a separately allocated array of pairs to deal with the moments when we have more than 2 simultaneously reserved descriptors. Initially NULL, allocated the first time we need more than 2. Always empty outside of syscall. * inline primitives: count_reserved_fds() reserved_descriptor(index) reserved_file(index) * int reserve_fd(flags) returns -E... or index. slot = current->fd_count if (unlikely(slot == 2) && !current->spillover) { allocate spillover if failed return -ENOMEM set current->spillover } if slot is maximal allowed (2 + how much fits into allocated part?) return -E<something> fd = get_unused_fd_flags(flags); if (unlikely(fd < 0)) return fd; if (likely(slot < 2)) { current->fds[slot] = fd; current->fp[slot] = NULL; } else { store (fd, NULL) into element #(slot - 2) of current->spillover } current->fd_count = slot + 1; * void install_file(index, file) if (likely(slot < 2)) current->fp[slot] = file; else store file to element #(slot - 2) of current->spillover * void __commit_reservations(unsigned count, bool failed) // count == current->fd_count while (count--) { fd = reserved_descriptor(count); file = reserved_file(count); if (!file) put_unused_fd(fd); else if (!failed) fd_install(fd, file); else { put_unused_fd(fd); fput(file); } } current->fd_count = 0; * static inline void commit_fd_reservations(bool failed) called in syscall glue, right after the syscall returns unsigned slots = current->fd_count; if (unlikely(slots)) __commit_reservations(slots, failed); Then we can (in addition to the current use of get_unused_fd_flags() et.al. - that still works) do e.g. things like for (i = 0; i < 69; i++) { index = reserve_fd(FD_CLOEXEC); if (unlikely(index < 0)) return index; file = some_driver_shite(some_shite, i); if (IS_ERR(file)) return PTR_ERR(file); install_file(index, file); // consumed file ioctl_result.some_array[i] = reserved_descriptor(index); .... } ... if (copy_to_user(arg, &ioctl_result, sizeof(ioctl_result)) return -EFAULT; ... return 0; and have it DTRT on all failures, no matter how many files we have added, etc. - on syscall return we will either commit all reservations (on success) or release all reserved descriptors and drop all files we had planned to put into descriptor table. Getting that right manually is doable (drm has some examples), but it's _not_ pleasant. The win here is in simpler cleanup code. And it can coexist with the current API just fine. The PITA is in the need to add the call of commit_fd_reservations() in syscall exit glue and have that done on all architectures ;-/ FWIW, I suspect that it won't be slower than the current API, even if used on hot paths. pipe(2) would be an interesting testcase for that - converting it is easy, and there's a plenty of loads where latency of pipe(2) would be visible. Comments?
© 2016 - 2024 Red Hat, Inc.