include/linux/mm.h | 1 + mm/gup.c | 105 +++++++++++++++++++++++++++++++++++++++++++-- mm/mmap.c | 36 ++++++++++++---- 3 files changed, 130 insertions(+), 12 deletions(-)
Writing to file-backed mappings which require folio dirty tracking using
GUP is a fundamentally broken operation, as kernel write access to GUP
mappings do not adhere to the semantics expected by a file system.
A GUP caller uses the direct mapping to access the folio, which does not
cause write notify to trigger, nor does it enforce that the caller marks
the folio dirty.
The problem arises when, after an initial write to the folio, writeback
results in the folio being cleaned and then the caller, via the GUP
interface, writes to the folio again.
As a result of the use of this secondary, direct, mapping to the folio no
write notify will occur, and if the caller does mark the folio dirty, this
will be done so unexpectedly.
For example, consider the following scenario:-
1. A folio is written to via GUP which write-faults the memory, notifying
the file system and dirtying the folio.
2. Later, writeback is triggered, resulting in the folio being cleaned and
the PTE being marked read-only.
3. The GUP caller writes to the folio, as it is mapped read/write via the
direct mapping.
4. The GUP caller, now done with the page, unpins it and sets it dirty
(though it does not have to).
This change updates both the PUP FOLL_LONGTERM slow and fast APIs. As
pin_user_pages_fast_only() does not exist, we can rely on a slightly
imperfect whitelisting in the PUP-fast case and fall back to the slow case
should this fail.
v7:
- Fixed very silly bug in writeable_file_mapping_allowed() inverting the
logic.
- Removed unnecessary RCU lock code and replaced with adaptation of Peter's
idea.
- Removed unnecessary open-coded folio_test_anon() in
folio_longterm_write_pin_allowed() and restructured to generally permit
NULL folio_mapping().
v6:
- Rebased on latest mm-unstable as of 28th April 2023.
- Add PUP-fast check with handling for rcu-locked TLB shootdown to synchronise
correctly.
- Split patch series into 3 to make it more digestible.
https://lore.kernel.org/all/cover.1682981880.git.lstoakes@gmail.com/
v5:
- Rebased on latest mm-unstable as of 25th April 2023.
- Some small refactorings suggested by John.
- Added an extended description of the problem in the comment around
writeable_file_mapping_allowed() for clarity.
- Updated commit message as suggested by Mika and John.
https://lore.kernel.org/all/6b73e692c2929dc4613af711bdf92e2ec1956a66.1682638385.git.lstoakes@gmail.com/
v4:
- Split out vma_needs_dirty_tracking() from vma_wants_writenotify() to
reduce duplication and update to use this in the GUP check. Note that
both separately check vm_ops_needs_writenotify() as the latter needs to
test this before the vm_pgprot_modify() test, resulting in
vma_wants_writenotify() checking this twice, however it is such a small
check this should not be egregious.
https://lore.kernel.org/all/3b92d56f55671a0389252379237703df6e86ea48.1682464032.git.lstoakes@gmail.com/
v3:
- Rebased on latest mm-unstable as of 24th April 2023.
- Explicitly check whether file system requires folio dirtying. Note that
vma_wants_writenotify() could not be used directly as it is very much focused
on determining if the PTE r/w should be set (e.g. assuming private mapping
does not require it as already set, soft dirty considerations).
- Tested code against shmem and hugetlb mappings - confirmed that these are not
disallowed by the check.
- Eliminate FOLL_ALLOW_BROKEN_FILE_MAPPING flag and instead perform check only
for FOLL_LONGTERM pins.
- As a result, limit check to internal GUP code.
https://lore.kernel.org/all/23c19e27ef0745f6d3125976e047ee0da62569d4.1682406295.git.lstoakes@gmail.com/
v2:
- Add accidentally excluded ptrace_access_vm() use of
FOLL_ALLOW_BROKEN_FILE_MAPPING.
- Tweak commit message.
https://lore.kernel.org/all/c8ee7e02d3d4f50bb3e40855c53bda39eec85b7d.1682321768.git.lstoakes@gmail.com/
v1:
https://lore.kernel.org/all/f86dc089b460c80805e321747b0898fd1efe93d7.1682168199.git.lstoakes@gmail.com/
Lorenzo Stoakes (3):
mm/mmap: separate writenotify and dirty tracking logic
mm/gup: disallow FOLL_LONGTERM GUP-nonfast writing to file-backed
mappings
mm/gup: disallow FOLL_LONGTERM GUP-fast writing to file-backed
mappings
include/linux/mm.h | 1 +
mm/gup.c | 105 +++++++++++++++++++++++++++++++++++++++++++--
mm/mmap.c | 36 ++++++++++++----
3 files changed, 130 insertions(+), 12 deletions(-)
--
2.40.1
On 5/2/23 12:34 PM, Lorenzo Stoakes wrote: > Writing to file-backed mappings which require folio dirty tracking using > GUP is a fundamentally broken operation, as kernel write access to GUP > mappings do not adhere to the semantics expected by a file system. > > A GUP caller uses the direct mapping to access the folio, which does not > cause write notify to trigger, nor does it enforce that the caller marks > the folio dirty. > > The problem arises when, after an initial write to the folio, writeback > results in the folio being cleaned and then the caller, via the GUP > interface, writes to the folio again. > > As a result of the use of this secondary, direct, mapping to the folio no > write notify will occur, and if the caller does mark the folio dirty, this > will be done so unexpectedly. > > For example, consider the following scenario:- > > 1. A folio is written to via GUP which write-faults the memory, notifying > the file system and dirtying the folio. > 2. Later, writeback is triggered, resulting in the folio being cleaned and > the PTE being marked read-only. > 3. The GUP caller writes to the folio, as it is mapped read/write via the > direct mapping. > 4. The GUP caller, now done with the page, unpins it and sets it dirty > (though it does not have to). > > This change updates both the PUP FOLL_LONGTERM slow and fast APIs. As > pin_user_pages_fast_only() does not exist, we can rely on a slightly > imperfect whitelisting in the PUP-fast case and fall back to the slow case > should this fail. > > v7: > - Fixed very silly bug in writeable_file_mapping_allowed() inverting the > logic. > - Removed unnecessary RCU lock code and replaced with adaptation of Peter's > idea. > - Removed unnecessary open-coded folio_test_anon() in > folio_longterm_write_pin_allowed() and restructured to generally permit > NULL folio_mapping(). > FWIW, I realize you are planning another respin, but I went and tried this version out on s390 -- Now when using a memory backend file and vfio-pci on s390 I see vfio_pin_pages_remote failing consistently. However, the pin_user_pages_fast(FOLL_WRITE | FOLL_LONGTERM) in kvm_s390_pci_aif_enable will still return positive.
On Tue, May 02, 2023 at 02:45:01PM -0400, Matthew Rosato wrote: > On 5/2/23 12:34 PM, Lorenzo Stoakes wrote: > > Writing to file-backed mappings which require folio dirty tracking using > > GUP is a fundamentally broken operation, as kernel write access to GUP > > mappings do not adhere to the semantics expected by a file system. > > > > A GUP caller uses the direct mapping to access the folio, which does not > > cause write notify to trigger, nor does it enforce that the caller marks > > the folio dirty. > > > > The problem arises when, after an initial write to the folio, writeback > > results in the folio being cleaned and then the caller, via the GUP > > interface, writes to the folio again. > > > > As a result of the use of this secondary, direct, mapping to the folio no > > write notify will occur, and if the caller does mark the folio dirty, this > > will be done so unexpectedly. > > > > For example, consider the following scenario:- > > > > 1. A folio is written to via GUP which write-faults the memory, notifying > > the file system and dirtying the folio. > > 2. Later, writeback is triggered, resulting in the folio being cleaned and > > the PTE being marked read-only. > > 3. The GUP caller writes to the folio, as it is mapped read/write via the > > direct mapping. > > 4. The GUP caller, now done with the page, unpins it and sets it dirty > > (though it does not have to). > > > > This change updates both the PUP FOLL_LONGTERM slow and fast APIs. As > > pin_user_pages_fast_only() does not exist, we can rely on a slightly > > imperfect whitelisting in the PUP-fast case and fall back to the slow case > > should this fail. > > > > v7: > > - Fixed very silly bug in writeable_file_mapping_allowed() inverting the > > logic. > > - Removed unnecessary RCU lock code and replaced with adaptation of Peter's > > idea. > > - Removed unnecessary open-coded folio_test_anon() in > > folio_longterm_write_pin_allowed() and restructured to generally permit > > NULL folio_mapping(). > > > > FWIW, I realize you are planning another respin, but I went and tried this version out on s390 -- Now when using a memory backend file and vfio-pci on s390 I see vfio_pin_pages_remote failing consistently. However, the pin_user_pages_fast(FOLL_WRITE | FOLL_LONGTERM) in kvm_s390_pci_aif_enable will still return positive. > Hey thanks very much for checking that :) This version will unconditionally apply the retriction to non-FOLL_LONGTERM by mistake (ugh) but vfio_pin_pages_remote() does seem to be setting FOLL_LONGTERM anyway so this seems a legitimate test. Interesting the _fast() variant succeeds... David, Jason et al. can speak more to the ins and outs of these virtualisation cases which I am not so familiar with, but I wonder if we do need a flag to provide an exception for VFIO.
© 2016 - 2025 Red Hat, Inc.