Documentation/filesystems/famfs.rst | 142 ++++ Documentation/filesystems/index.rst | 1 + MAINTAINERS | 10 + drivers/dax/Kconfig | 6 + drivers/dax/bus.c | 144 +++- drivers/dax/dax-private.h | 1 + drivers/dax/device.c | 38 +- drivers/dax/super.c | 33 +- fs/dax.c | 1 - fs/fuse/Kconfig | 13 + fs/fuse/Makefile | 4 +- fs/fuse/dev.c | 61 ++ fs/fuse/dir.c | 74 +- fs/fuse/famfs.c | 1105 +++++++++++++++++++++++++++ fs/fuse/famfs_kfmap.h | 166 ++++ fs/fuse/file.c | 27 +- fs/fuse/fuse_i.h | 67 +- fs/fuse/inode.c | 49 +- fs/fuse/iomode.c | 2 +- fs/namei.c | 1 + include/linux/dax.h | 6 + include/uapi/linux/fuse.h | 63 ++ include/uapi/linux/magic.h | 2 + 23 files changed, 1973 insertions(+), 43 deletions(-) create mode 100644 Documentation/filesystems/famfs.rst create mode 100644 fs/fuse/famfs.c create mode 100644 fs/fuse/famfs_kfmap.h
Subject: famfs: port into fuse This is the initial RFC for the fabric-attached memory file system (famfs) integration into fuse. In order to function, this requires a related patch to libfuse [1] and the famfs user space [2]. This RFC is mainly intended to socialize the approach and get feedback from the fuse developers and maintainers. There is some dax work that needs to be done before this should be merged (see the "poisoned page|folio problem" below). This patch set fully works with Linux 6.14 -- passing all existing famfs smoke and unit tests -- and I encourage existing famfs users to test it. This is really two patch sets mashed up: * The patches with the dev_dax_iomap: prefix fill in missing functionality for devdax to host an fs-dax file system. * The famfs_fuse: patches add famfs into fs/fuse/. These are effectively unchanged since last year. Because this is not ready to merge yet, I have felt free to leave some debug prints in place because we still find them useful; those will be cleaned up in a subsequent revision. Famfs Overview Famfs exposes shared memory as a file system. Famfs consumes shared memory from dax devices, and provides memory-mappable files that map directly to the memory - no page cache involvement. Famfs differs from conventional file systems in fs-dax mode, in that it handles in-memory metadata in a sharable way (which begins with never caching dirty shared metadata). Famfs started as a standalone file system [3,4], but the consensus at LSFMM 2024 [5] was that it should be ported into fuse - and this RFC is the first public evidence that I've been working on that. The key performance requirement is that famfs must resolve mapping faults without upcalls. This is achieved by fully caching the file-to-devdax metadata for all active files. This is done via two fuse client/server message/response pairs: GET_FMAP and GET_DAXDEV. Famfs remains the first fs-dax file system that is backed by devdax rather than pmem in fs-dax mode (hence the need for the dev_dax_iomap fixups). Notes * Once the dev_dax_iomap patches land, I suspect it may make sense for virtiofs to update to use the improved interface. * I'm currently maintaining compatibility between the famfs user space and both the standalone famfs kernel file system and this new fuse implementation. In the near future I'll be running performance comparisons and sharing them - but there is no reason to expect significant degradation with fuse, since famfs caches entire "fmaps" in the kernel to resolve faults with no upcalls. This patch has a bit too much debug turned on to to that testing quite yet. A branch * Two new fuse messages / responses are added: GET_FMAP and GET_DAXDEV. * When a file is looked up in a famfs mount, the LOOKUP is followed by a GET_FMAP message and response. The "fmap" is the full file-to-dax mapping, allowing the fuse/famfs kernel code to handle read/write/fault without any upcalls. * After each GET_FMAP, the fmap is checked for extents that reference previously-unknown daxdevs. Each such occurence is handled with a GET_DAXDEV message and response. * Daxdevs are stored in a table (which might become an xarray at some point). When entries are added to the table, we acquire exclusive access to the daxdev via the fs_dax_get() call (modeled after how fs-dax handles this with pmem devices). famfs provides holder_operations to devdax, providing a notification path in the event of memory errors. * If devdax notifies famfs of memory errors on a dax device, famfs currently bocks all subsequent accesses to data on that device. The recovery is to re-initialize the memory and file system. Famfs is memory, not storage... * Because famfs uses backing (devdax) devices, only privileged mounts are supported. * The famfs kernel code never accesses the memory directly - it only facilitates read, write and mmap on behalf of user processes. As such, the RAS of the shared memory affects applications, but not the kernel. * Famfs has backing device(s), but they are devdax (char) rather than block. Right now there is no way to tell the vfs layer that famfs has a char backing device (unless we say it's block, but it's not). Currently we use the standard anonymous fuse fs_type - but I'm not sure that's ultimately optimal (thoughts?) The "poisoned page|folio problem" * Background: before doing a kernel mount, the famfs user space [2] validates the superblock and log. This is done via raw mmap of the primary devdax device. If valid, the file system is mounted, and the superblock and log get exposed through a pair of files (.meta/.superblock and .meta/.log) - because we can't be using raw device mmap when a file system is mounted on the device. But this exposes a devdax bug and warning... * Pages that have been memory mapped via devdax are left in a permanently problematic state. Devdax sets page|folio->mapping when a page is accessed via raw devdax mmap (as famfs does before mount), but never cleans it up. When the pages of the famfs superblock and log are accessed via the "meta" files after mount, we see a WARN_ONCE() in dax_insert_entry(), which notices that page|folio->mapping is still set. I intend to address this prior to asking for the famfs patches to be merged. * Alistair Popple's recent dax patch series [6], which has been merged for 6.15, addresses some dax issues, but sadly does not fix the poisoned page|folio problem - its enhanced refcount checking turns the warning into an error. * This 6.14 patch set disables the warning; a proper fix will be required for famfs to work at all in 6.15. Dan W. and I are actively discussing how to do this properly... * In terms of the correct functionality of famfs, the warning can be ignored. References [1] - https://github.com/libfuse/libfuse/pull/1200 [2] - https://github.com/cxl-micron-reskit/famfs [3] - https://lore.kernel.org/linux-cxl/cover.1708709155.git.john@groves.net/ [4] - https://lore.kernel.org/linux-cxl/cover.1714409084.git.john@groves.net/ [5] - https://lwn.net/Articles/983105/ [6] - https://lore.kernel.org/linux-cxl/cover.8068ad144a7eea4a813670301f4d2a86a8e68ec4.1740713401.git-series.apopple@nvidia.com/ John Groves (19): dev_dax_iomap: Move dax_pgoff_to_phys() from device.c to bus.c dev_dax_iomap: Add fs_dax_get() func to prepare dax for fs-dax usage dev_dax_iomap: Save the kva from memremap dev_dax_iomap: Add dax_operations for use by fs-dax on devdax dev_dax_iomap: export dax_dev_get() dev_dax_iomap: (ignore!) Drop poisoned page warning in fs/dax.c famfs_fuse: magic.h: Add famfs magic numbers famfs_fuse: Kconfig famfs_fuse: Update macro s/FUSE_IS_DAX/FUSE_IS_VIRTIO_DAX/ famfs_fuse: Basic fuse kernel ABI enablement for famfs famfs_fuse: Basic famfs mount opts famfs_fuse: Plumb the GET_FMAP message/response famfs_fuse: Create files with famfs fmaps famfs_fuse: GET_DAXDEV message and daxdev_table famfs_fuse: Plumb dax iomap and fuse read/write/mmap famfs_fuse: Add holder_operations for dax notify_failure() famfs_fuse: Add famfs metadata documentation famfs_fuse: Add documentation famfs_fuse: (ignore) debug cruft Documentation/filesystems/famfs.rst | 142 ++++ Documentation/filesystems/index.rst | 1 + MAINTAINERS | 10 + drivers/dax/Kconfig | 6 + drivers/dax/bus.c | 144 +++- drivers/dax/dax-private.h | 1 + drivers/dax/device.c | 38 +- drivers/dax/super.c | 33 +- fs/dax.c | 1 - fs/fuse/Kconfig | 13 + fs/fuse/Makefile | 4 +- fs/fuse/dev.c | 61 ++ fs/fuse/dir.c | 74 +- fs/fuse/famfs.c | 1105 +++++++++++++++++++++++++++ fs/fuse/famfs_kfmap.h | 166 ++++ fs/fuse/file.c | 27 +- fs/fuse/fuse_i.h | 67 +- fs/fuse/inode.c | 49 +- fs/fuse/iomode.c | 2 +- fs/namei.c | 1 + include/linux/dax.h | 6 + include/uapi/linux/fuse.h | 63 ++ include/uapi/linux/magic.h | 2 + 23 files changed, 1973 insertions(+), 43 deletions(-) create mode 100644 Documentation/filesystems/famfs.rst create mode 100644 fs/fuse/famfs.c create mode 100644 fs/fuse/famfs_kfmap.h base-commit: 38fec10eb60d687e30c8c6b5420d86e8149f7557 -- 2.49.0
On 25/04/20 08:33PM, John Groves wrote: > Subject: famfs: port into fuse > > <snip> I'm planning to apply the review comments and send v2 of this patch series soon - hopefully next week. I asked a couple of specific questions for Miklos and Amir at [1] that I hope they will answer in the next few days. Do you object to zeroing fuse_inodes when they're allocated, and do I really need an xchg() to set the fi->famfs_meta pointer during fuse_alloc_inode()? cmpxchg would be good for avoiding stepping on an "already set" pointer, but not useful if fi->famfs_meta has random contents (which it does when allocated). I plan to move the GET_FMAP message to OPEN time rather than LOOKUP - unless that leads to problems that I don't currently foresee. The GET_FMAP response will also get a variable-sized payload. Darrick and I have met and discussed commonality between our use cases, and the only thing from famfs that he will be able to directly use is the GET_FMAP message/response - but likely with a different response payload. The file operations in famfs.c are not applicable for Darrick, as they only handle mapping file offsets to devdax offsets (i.e. fs-dax over devdax). Darrick is primarily exploring adapting block-backed file systems to use fuse. These are conventional page-cache-backed files that will primarily be read and written between blockdev and page cache. (Darrick, correct me if I got anything important wrong there.) In prep for Darrick, I'll add an offset and length to the GET_FMAP message, to specify what range of the file map is being requested. I'll also add a new "first header" struct in the GET_FMAP response that can accommodate additional fmap types, and will specify the file size as well as the offset and length of the fmap portion described in the response (allowing for GET_FMAP responses that contain an incomplete file map). If there is desire to give GET_FMAP a different name, like GET_IOMAP, I don't much care - although the term "iomap" may be a bit overloaded already (e.g. the dax_iomap_rw()/ dax_iomap_fault() functions debatably didn't need "iomap" in their names since they're about converting a file offset range to daxdev ranges, and they don't handle anything specifically iomap-related). At least "FMAP" is more narrowly descriptive of what it is. I don't think Darrick needs GET_DAXDEV (or anything analogous), because passing in the backing dev at mount time seems entirely sufficient - so I assume that at least for now GET_DAXDEV won't be shared. But famfs definitely needs GET_DAXDEV, because files must be able to interleave across memory devices. The one issue that I will kick down the road until v3 is fixing the "poisoned page|folio" problem. Because of that, v2 of this series will still be against a 6.14 kernel. Not solving that problem means this series won't be merge-able until v3. I hope this is all clear and obvious. Let me know if not (or if so). Thanks, John [1] https://lore.kernel.org/linux-fsdevel/20250421013346.32530-1-john@groves.net/T/#me47467b781d6c637899a38b898c27afb619206e0
On Thu, May 22, 2025 at 12:30 AM John Groves <John@groves.net> wrote: > > On 25/04/20 08:33PM, John Groves wrote: > > Subject: famfs: port into fuse > > > > <snip> > > I'm planning to apply the review comments and send v2 of > this patch series soon - hopefully next week. > > I asked a couple of specific questions for Miklos and > Amir at [1] that I hope they will answer in the next few > days. I missed this question. Feel free to ping me next time if I am not answering. > Do you object to zeroing fuse_inodes when they're > allocated, and do I really need an xchg() to set the > fi->famfs_meta pointer during fuse_alloc_inode()? cmpxchg > would be good for avoiding stepping on an "already set" > pointer, but not useful if fi->famfs_meta has random > contents (which it does when allocated). > I don't have anything against zeroing the fuse inode fields but be careful not to step over fuse_inode_init_once(). The answer to the xchg() question is quite technically boring. At least in my case it was done to avoid an #ifdef in c file. Thanks, Amir.
On Wed, May 21, 2025 at 05:30:12PM -0500, John Groves wrote: > On 25/04/20 08:33PM, John Groves wrote: > > Subject: famfs: port into fuse > > > > <snip> > > I'm planning to apply the review comments and send v2 of > this patch series soon - hopefully next week. Heh, I'm just about to push go on an RFC patchbomb for the entirety of fuse + iomap + ext4-fuse2fs. > I asked a couple of specific questions for Miklos and > Amir at [1] that I hope they will answer in the next few > days. Do you object to zeroing fuse_inodes when they're > allocated, and do I really need an xchg() to set the > fi->famfs_meta pointer during fuse_alloc_inode()? cmpxchg > would be good for avoiding stepping on an "already set" > pointer, but not useful if fi->famfs_meta has random > contents (which it does when allocated). I guess you could always null it out in fuse_inode_init_once and again when you free the inode... > I plan to move the GET_FMAP message to OPEN time rather than > LOOKUP - unless that leads to problems that I don't > currently foresee. The GET_FMAP response will also get a > variable-sized payload. > > Darrick and I have met and discussed commonality between our > use cases, and the only thing from famfs that he will be able > to directly use is the GET_FMAP message/response - but likely > with a different response payload. The file operations in > famfs.c are not applicable for Darrick, as they only handle > mapping file offsets to devdax offsets (i.e. fs-dax over > devdax). > > Darrick is primarily exploring adapting block-backed file > systems to use fuse. These are conventional page-cache-backed > files that will primarily be read and written between > blockdev and page cache. Yeah, I really do need to get moving on sending out the RFC. Everyone: patchbomb incoming! > (Darrick, correct me if I got anything important wrong there.) > > In prep for Darrick, I'll add an offset and length to the > GET_FMAP message, to specify what range of the file map is > being requested. I'll also add a new "first header" struct > in the GET_FMAP response that can accommodate additional fmap > types, and will specify the file size as well as the offset > and length of the fmap portion described in the response > (allowing for GET_FMAP responses that contain an incomplete > file map). Hrrmrmrmm. I don't think there's much use in trying to share a fuse command but then have to parse through the size of the response to figure out what the server actually sent back. It's less confusing to have just one response type per fuse command. I also don't think that FUSE_IOMAP_BEGIN is all that good of an interface for what John is trying to do. A regular filesystem creates whatever mappings it likes in response to the far too many file IO APIs in Linux, and needs to throw them at the kernel. OTOH, famfs' management daemon creates a static mapping with repeating elements and that gets uploaded in one go via FUSE_GET_FMAP. Yes, we could mash them into a single uncohesive mess of an interface, but why would we torture ourselves so? (For me it's the "repeating sequences" aspect of GET_FMAP that makes me think it should be a separate interface. OTOH I haven't thought much about how to support filesystems that implement RAID.) > If there is desire to give GET_FMAP a different name, like > GET_IOMAP, I don't much care - although the term "iomap" may > be a bit overloaded already (e.g. the dax_iomap_rw()/ > dax_iomap_fault() functions debatably didn't need "iomap" > in their names since they're about converting a file offset > range to daxdev ranges, and they don't handle anything > specifically iomap-related). At least "FMAP" is more narrowly > descriptive of what it is. > > I don't think Darrick needs GET_DAXDEV (or anything > analogous), because passing in the backing dev at mount time > seems entirely sufficient - so I assume that at least for now > GET_DAXDEV won't be shared. But famfs definitely needs > GET_DAXDEV, because files must be able to interleave across > memory devices. I actually /did/ add a notification so that the fuse server can tell the kernel that they'd like to use a particular fd with iomap. It doesn't support dax devices by virtue of gatekeeping on S_ISBLK, but it wouldn't be hard to do that. > The one issue that I will kick down the road until v3 is > fixing the "poisoned page|folio" problem. Because of that, > v2 of this series will still be against a 6.14 kernel. Not > solving that problem means this series won't be merge-able > until v3. > > I hope this is all clear and obvious. Let me know if not (or > if so). Hee hee. --D > > Thanks, > John > > > [1] https://lore.kernel.org/linux-fsdevel/20250421013346.32530-1-john@groves.net/T/#me47467b781d6c637899a38b898c27afb619206e0 > >
On Sun, 20 Apr 2025 20:33:27 -0500 John Groves <John@Groves.net> wrote: > Subject: famfs: port into fuse > > This is the initial RFC for the fabric-attached memory file system > (famfs) integration into fuse. In order to function, this requires a > related patch to libfuse [1] and the famfs user space [2]. > > This RFC is mainly intended to socialize the approach and get > feedback from the fuse developers and maintainers. There is some dax > work that needs to be done before this should be merged (see the > "poisoned page|folio problem" below). > > This patch set fully works with Linux 6.14 -- passing all existing > famfs smoke and unit tests -- and I encourage existing famfs users to > test it. > > This is really two patch sets mashed up: > > * The patches with the dev_dax_iomap: prefix fill in missing > functionality for devdax to host an fs-dax file system. > * The famfs_fuse: patches add famfs into fs/fuse/. These are > effectively unchanged since last year. > > Because this is not ready to merge yet, I have felt free to leave > some debug prints in place because we still find them useful; those > will be cleaned up in a subsequent revision. > > Famfs Overview > > Famfs exposes shared memory as a file system. Famfs consumes shared > memory from dax devices, and provides memory-mappable files that map > directly to the memory - no page cache involvement. Famfs differs > from conventional file systems in fs-dax mode, in that it handles > in-memory metadata in a sharable way (which begins with never caching > dirty shared metadata). > > Famfs started as a standalone file system [3,4], but the consensus at > LSFMM 2024 [5] was that it should be ported into fuse - and this RFC > is the first public evidence that I've been working on that. > > The key performance requirement is that famfs must resolve mapping > faults without upcalls. This is achieved by fully caching the > file-to-devdax metadata for all active files. This is done via two > fuse client/server message/response pairs: GET_FMAP and GET_DAXDEV. > > Famfs remains the first fs-dax file system that is backed by devdax > rather than pmem in fs-dax mode (hence the need for the dev_dax_iomap > fixups). > > Notes > > * Once the dev_dax_iomap patches land, I suspect it may make sense for > virtiofs to update to use the improved interface. > > * I'm currently maintaining compatibility between the famfs user > space and both the standalone famfs kernel file system and this new > fuse implementation. In the near future I'll be running performance > comparisons and sharing them - but there is no reason to expect > significant degradation with fuse, since famfs caches entire "fmaps" > in the kernel to resolve faults with no upcalls. This patch has a bit > too much debug turned on to to that testing quite yet. A branch > > * Two new fuse messages / responses are added: GET_FMAP and > GET_DAXDEV. > > * When a file is looked up in a famfs mount, the LOOKUP is followed > by a GET_FMAP message and response. The "fmap" is the full > file-to-dax mapping, allowing the fuse/famfs kernel code to handle > read/write/fault without any upcalls. > > * After each GET_FMAP, the fmap is checked for extents that reference > previously-unknown daxdevs. Each such occurence is handled with a > GET_DAXDEV message and response. > > * Daxdevs are stored in a table (which might become an xarray at some > point). When entries are added to the table, we acquire exclusive > access to the daxdev via the fs_dax_get() call (modeled after how > fs-dax handles this with pmem devices). famfs provides > holder_operations to devdax, providing a notification path in the > event of memory errors. > > * If devdax notifies famfs of memory errors on a dax device, famfs > currently bocks all subsequent accesses to data on that device. The > recovery is to re-initialize the memory and file system. Famfs is > memory, not storage... > > * Because famfs uses backing (devdax) devices, only privileged mounts > are supported. > > * The famfs kernel code never accesses the memory directly - it only > facilitates read, write and mmap on behalf of user processes. As > such, the RAS of the shared memory affects applications, but not the > kernel. > > * Famfs has backing device(s), but they are devdax (char) rather than > block. Right now there is no way to tell the vfs layer that famfs > has a char backing device (unless we say it's block, but it's not). > Currently we use the standard anonymous fuse fs_type - but I'm not > sure that's ultimately optimal (thoughts?) > > The "poisoned page|folio problem" > > * Background: before doing a kernel mount, the famfs user space [2] > validates the superblock and log. This is done via raw mmap of the > primary devdax device. If valid, the file system is mounted, and the > superblock and log get exposed through a pair of files > (.meta/.superblock and .meta/.log) - because we can't be using raw > device mmap when a file system is mounted on the device. But this > exposes a devdax bug and warning... > > * Pages that have been memory mapped via devdax are left in a > permanently problematic state. Devdax sets page|folio->mapping when a > page is accessed via raw devdax mmap (as famfs does before mount), > but never cleans it up. When the pages of the famfs superblock and > log are accessed via the "meta" files after mount, we see a > WARN_ONCE() in dax_insert_entry(), which notices that > page|folio->mapping is still set. I intend to address this prior to > asking for the famfs patches to be merged. > > * Alistair Popple's recent dax patch series [6], which has been merged > for 6.15, addresses some dax issues, but sadly does not fix the > poisoned page|folio problem - its enhanced refcount checking turns > the warning into an error. > > * This 6.14 patch set disables the warning; a proper fix will be > required for famfs to work at all in 6.15. Dan W. and I are actively > discussing how to do this properly... > > * In terms of the correct functionality of famfs, the warning can be > ignored. > > References > > [1] - https://github.com/libfuse/libfuse/pull/1200 > [2] - https://github.com/cxl-micron-reskit/famfs > [3] > - https://lore.kernel.org/linux-cxl/cover.1708709155.git.john@groves.net/ [4] - https://lore.kernel.org/linux-cxl/cover.1714409084.git.john@groves.net/ > [5] - https://lwn.net/Articles/983105/ > [6] > - https://lore.kernel.org/linux-cxl/cover.8068ad144a7eea4a813670301f4d2a86a8e68ec4.1740713401.git-series.apopple@nvidia.com/ > > > John Groves (19): > dev_dax_iomap: Move dax_pgoff_to_phys() from device.c to bus.c > dev_dax_iomap: Add fs_dax_get() func to prepare dax for fs-dax usage > dev_dax_iomap: Save the kva from memremap > dev_dax_iomap: Add dax_operations for use by fs-dax on devdax > dev_dax_iomap: export dax_dev_get() > dev_dax_iomap: (ignore!) Drop poisoned page warning in fs/dax.c > famfs_fuse: magic.h: Add famfs magic numbers > famfs_fuse: Kconfig > famfs_fuse: Update macro s/FUSE_IS_DAX/FUSE_IS_VIRTIO_DAX/ > famfs_fuse: Basic fuse kernel ABI enablement for famfs > famfs_fuse: Basic famfs mount opts > famfs_fuse: Plumb the GET_FMAP message/response > famfs_fuse: Create files with famfs fmaps > famfs_fuse: GET_DAXDEV message and daxdev_table > famfs_fuse: Plumb dax iomap and fuse read/write/mmap > famfs_fuse: Add holder_operations for dax notify_failure() > famfs_fuse: Add famfs metadata documentation > famfs_fuse: Add documentation > famfs_fuse: (ignore) debug cruft > > Documentation/filesystems/famfs.rst | 142 ++++ > Documentation/filesystems/index.rst | 1 + > MAINTAINERS | 10 + > drivers/dax/Kconfig | 6 + > drivers/dax/bus.c | 144 +++- > drivers/dax/dax-private.h | 1 + > drivers/dax/device.c | 38 +- > drivers/dax/super.c | 33 +- > fs/dax.c | 1 - > fs/fuse/Kconfig | 13 + > fs/fuse/Makefile | 4 +- > fs/fuse/dev.c | 61 ++ > fs/fuse/dir.c | 74 +- > fs/fuse/famfs.c | 1105 > +++++++++++++++++++++++++++ fs/fuse/famfs_kfmap.h | > 166 ++++ fs/fuse/file.c | 27 +- > fs/fuse/fuse_i.h | 67 +- > fs/fuse/inode.c | 49 +- > fs/fuse/iomode.c | 2 +- > fs/namei.c | 1 + > include/linux/dax.h | 6 + > include/uapi/linux/fuse.h | 63 ++ > include/uapi/linux/magic.h | 2 + > 23 files changed, 1973 insertions(+), 43 deletions(-) > create mode 100644 Documentation/filesystems/famfs.rst > create mode 100644 fs/fuse/famfs.c > create mode 100644 fs/fuse/famfs_kfmap.h > > > base-commit: 38fec10eb60d687e30c8c6b5420d86e8149f7557 Hi John, Apologies if the question is far off or irrelevant. I am trying to understand FAMFS, and I am thinking where does FAMFS stand when compared to OpenSHMEM PGAS. Can't we have a OpenSHMEM-based shared memory implementation over CXL that serves as FAMFS? Maybe FAMFS does more than that!?! Thanks, Alireza
On 25/04/30 03:42PM, Alireza Sanaee wrote: > On Sun, 20 Apr 2025 20:33:27 -0500 > John Groves <John@Groves.net> wrote: > >> <snip> > > Hi John, > > Apologies if the question is far off or irrelevant. > > I am trying to understand FAMFS, and I am thinking where does FAMFS > stand when compared to OpenSHMEM PGAS. Can't we have a OpenSHMEM-based > shared memory implementation over CXL that serves as FAMFS? > > Maybe FAMFS does more than that!?! > > Thanks, > Alireza > Continuation of this conversation likely belongs in the discusison section at [1], but a couple of thoughts. Famfs provides a scale-out filesystem mounts where the files that map to the same disaggregated shared memory. If you mmap a famfs file, you are accessing the memory directly. Since shmem is file-backed (usually tmpfs or its ilk), shmem is a higher-level and more specialized abstraction, and OpenSHMEM may be able to run atop famfs. It looks like OpenSHMEM and PGAS cover the possibility that "shared memory" might require grabbing a copy via [r]dma - which famfs will probably never do. Famfs only handles cases where the memory is actually shared. (hey, I work for a memory company.) Since famfs provides memory-mappable files, almost all apps can access them (no requirement to write to the shmem, or other related but more estoteric interfaces). Apps are responsible for not doing "nonsense" access WRT cache coherency, but famfs manages cache coherency for its metadata. The video at [2] may be useful to get up to speed. [1] http://github.com/cxl-micron-reskit/famfs [2] https://www.youtube.com/watch?v=L1QNpb-8VgM&t=1680
On Sun, Apr 20, 2025 at 08:33:27PM -0500, John Groves wrote: > Subject: famfs: port into fuse > > This is the initial RFC for the fabric-attached memory file system (famfs) > integration into fuse. In order to function, this requires a related patch > to libfuse [1] and the famfs user space [2]. > > This RFC is mainly intended to socialize the approach and get feedback from > the fuse developers and maintainers. There is some dax work that needs to > be done before this should be merged (see the "poisoned page|folio problem" > below). Note that I'm only looking at the fuse and iomap aspects of this patchset. I don't know the devdax code at all. > This patch set fully works with Linux 6.14 -- passing all existing famfs > smoke and unit tests -- and I encourage existing famfs users to test it. > > This is really two patch sets mashed up: > > * The patches with the dev_dax_iomap: prefix fill in missing functionality for > devdax to host an fs-dax file system. > * The famfs_fuse: patches add famfs into fs/fuse/. These are effectively > unchanged since last year. > > Because this is not ready to merge yet, I have felt free to leave some debug > prints in place because we still find them useful; those will be cleaned up > in a subsequent revision. > > Famfs Overview > > Famfs exposes shared memory as a file system. Famfs consumes shared memory > from dax devices, and provides memory-mappable files that map directly to > the memory - no page cache involvement. Famfs differs from conventional > file systems in fs-dax mode, in that it handles in-memory metadata in a > sharable way (which begins with never caching dirty shared metadata). > > Famfs started as a standalone file system [3,4], but the consensus at LSFMM > 2024 [5] was that it should be ported into fuse - and this RFC is the first > public evidence that I've been working on that. This is very timely, as I just started looking into how I might connect iomap to fuse so that most of the hot IO path continues to run in the kernel, and userspace block device filesystem drivers merely supply the file mappings to the kernel. In other words, we kick the metadata parsing craziness out of the kernel. > The key performance requirement is that famfs must resolve mapping faults > without upcalls. This is achieved by fully caching the file-to-devdax > metadata for all active files. This is done via two fuse client/server > message/response pairs: GET_FMAP and GET_DAXDEV. Heh, just last week I finally got around to laying out how I think I'd want to expose iomap through fuse to allow ->iomap_begin/->iomap_end upcalls to a fuse server. Note that I've done zero prototyping but "upload all the mappings at open time" seems like a reasonable place for me to start looking, especially for a filesystem with static mappings. I think what I want to try to build is an in-kernel mapping cache (sort of like the one you built), only with upcalls to the fuse server when there is no mapping information for a given IO. I'd probably want to have a means for the fuse server to put new mappings into the cache, or invalidate existing mappings. (famfs obviously is a simple corner-case of that grandiose vision, but I still have a long way to get to my larger vision so don't take my words as any kind of requirement.) > Famfs remains the first fs-dax file system that is backed by devdax rather > than pmem in fs-dax mode (hence the need for the dev_dax_iomap fixups). > > Notes > > * Once the dev_dax_iomap patches land, I suspect it may make sense for > virtiofs to update to use the improved interface. > > * I'm currently maintaining compatibility between the famfs user space and > both the standalone famfs kernel file system and this new fuse > implementation. In the near future I'll be running performance comparisons > and sharing them - but there is no reason to expect significant degradation > with fuse, since famfs caches entire "fmaps" in the kernel to resolve I'm curious to hear what you find, performance-wise. :) > faults with no upcalls. This patch has a bit too much debug turned on to > to that testing quite yet. A branch A branch ... what? > * Two new fuse messages / responses are added: GET_FMAP and GET_DAXDEV. > > * When a file is looked up in a famfs mount, the LOOKUP is followed by a > GET_FMAP message and response. The "fmap" is the full file-to-dax mapping, > allowing the fuse/famfs kernel code to handle read/write/fault without any > upcalls. Huh, I'd have thought you'd wait until FUSE_OPEN to start preloading mappings into the kernel. > * After each GET_FMAP, the fmap is checked for extents that reference > previously-unknown daxdevs. Each such occurence is handled with a > GET_DAXDEV message and response. I hadn't figured out how this part would work for my silly prototype. Just out of curiosity, does the famfs fuse server hold an open fd to the storage, in which case the fmap(ping) could just contain the open fd? Where are the mappings that are sent from the fuse server? Is that struct fuse_famfs_simple_ext? > * Daxdevs are stored in a table (which might become an xarray at some point). > When entries are added to the table, we acquire exclusive access to the > daxdev via the fs_dax_get() call (modeled after how fs-dax handles this > with pmem devices). famfs provides holder_operations to devdax, providing > a notification path in the event of memory errors. > > * If devdax notifies famfs of memory errors on a dax device, famfs currently > bocks all subsequent accesses to data on that device. The recovery is to > re-initialize the memory and file system. Famfs is memory, not storage... Ouch. :) > * Because famfs uses backing (devdax) devices, only privileged mounts are > supported. > > * The famfs kernel code never accesses the memory directly - it only > facilitates read, write and mmap on behalf of user processes. As such, > the RAS of the shared memory affects applications, but not the kernel. > > * Famfs has backing device(s), but they are devdax (char) rather than > block. Right now there is no way to tell the vfs layer that famfs has a > char backing device (unless we say it's block, but it's not). Currently > we use the standard anonymous fuse fs_type - but I'm not sure that's > ultimately optimal (thoughts?) Does it work if the fusefs server adds "-o fsname=<devdax cdev>" to the fuse_args object? fuse2fs does that, though I don't recall if that's a reasonable thing to do. > The "poisoned page|folio problem" > > * Background: before doing a kernel mount, the famfs user space [2] validates > the superblock and log. This is done via raw mmap of the primary devdax > device. If valid, the file system is mounted, and the superblock and log > get exposed through a pair of files (.meta/.superblock and .meta/.log) - > because we can't be using raw device mmap when a file system is mounted > on the device. But this exposes a devdax bug and warning... > > * Pages that have been memory mapped via devdax are left in a permanently > problematic state. Devdax sets page|folio->mapping when a page is accessed > via raw devdax mmap (as famfs does before mount), but never cleans it up. > When the pages of the famfs superblock and log are accessed via the "meta" > files after mount, we see a WARN_ONCE() in dax_insert_entry(), which > notices that page|folio->mapping is still set. I intend to address this > prior to asking for the famfs patches to be merged. > > * Alistair Popple's recent dax patch series [6], which has been merged > for 6.15, addresses some dax issues, but sadly does not fix the poisoned > page|folio problem - its enhanced refcount checking turns the warning into > an error. > > * This 6.14 patch set disables the warning; a proper fix will be required for > famfs to work at all in 6.15. Dan W. and I are actively discussing how to do > this properly... > > * In terms of the correct functionality of famfs, the warning can be ignored. > > References > > [1] - https://github.com/libfuse/libfuse/pull/1200 > [2] - https://github.com/cxl-micron-reskit/famfs Thanks for posting links, I'll have a look there too. --D > [3] - https://lore.kernel.org/linux-cxl/cover.1708709155.git.john@groves.net/ > [4] - https://lore.kernel.org/linux-cxl/cover.1714409084.git.john@groves.net/ > [5] - https://lwn.net/Articles/983105/ > [6] - https://lore.kernel.org/linux-cxl/cover.8068ad144a7eea4a813670301f4d2a86a8e68ec4.1740713401.git-series.apopple@nvidia.com/ > > > John Groves (19): > dev_dax_iomap: Move dax_pgoff_to_phys() from device.c to bus.c > dev_dax_iomap: Add fs_dax_get() func to prepare dax for fs-dax usage > dev_dax_iomap: Save the kva from memremap > dev_dax_iomap: Add dax_operations for use by fs-dax on devdax > dev_dax_iomap: export dax_dev_get() > dev_dax_iomap: (ignore!) Drop poisoned page warning in fs/dax.c > famfs_fuse: magic.h: Add famfs magic numbers > famfs_fuse: Kconfig > famfs_fuse: Update macro s/FUSE_IS_DAX/FUSE_IS_VIRTIO_DAX/ > famfs_fuse: Basic fuse kernel ABI enablement for famfs > famfs_fuse: Basic famfs mount opts > famfs_fuse: Plumb the GET_FMAP message/response > famfs_fuse: Create files with famfs fmaps > famfs_fuse: GET_DAXDEV message and daxdev_table > famfs_fuse: Plumb dax iomap and fuse read/write/mmap > famfs_fuse: Add holder_operations for dax notify_failure() > famfs_fuse: Add famfs metadata documentation > famfs_fuse: Add documentation > famfs_fuse: (ignore) debug cruft > > Documentation/filesystems/famfs.rst | 142 ++++ > Documentation/filesystems/index.rst | 1 + > MAINTAINERS | 10 + > drivers/dax/Kconfig | 6 + > drivers/dax/bus.c | 144 +++- > drivers/dax/dax-private.h | 1 + > drivers/dax/device.c | 38 +- > drivers/dax/super.c | 33 +- > fs/dax.c | 1 - > fs/fuse/Kconfig | 13 + > fs/fuse/Makefile | 4 +- > fs/fuse/dev.c | 61 ++ > fs/fuse/dir.c | 74 +- > fs/fuse/famfs.c | 1105 +++++++++++++++++++++++++++ > fs/fuse/famfs_kfmap.h | 166 ++++ > fs/fuse/file.c | 27 +- > fs/fuse/fuse_i.h | 67 +- > fs/fuse/inode.c | 49 +- > fs/fuse/iomode.c | 2 +- > fs/namei.c | 1 + > include/linux/dax.h | 6 + > include/uapi/linux/fuse.h | 63 ++ > include/uapi/linux/magic.h | 2 + > 23 files changed, 1973 insertions(+), 43 deletions(-) > create mode 100644 Documentation/filesystems/famfs.rst > create mode 100644 fs/fuse/famfs.c > create mode 100644 fs/fuse/famfs_kfmap.h > > > base-commit: 38fec10eb60d687e30c8c6b5420d86e8149f7557 > -- > 2.49.0 > >
On 25/04/21 11:27AM, Darrick J. Wong wrote: > On Sun, Apr 20, 2025 at 08:33:27PM -0500, John Groves wrote: > > Subject: famfs: port into fuse > > > > This is the initial RFC for the fabric-attached memory file system (famfs) > > integration into fuse. In order to function, this requires a related patch > > to libfuse [1] and the famfs user space [2]. > > > > This RFC is mainly intended to socialize the approach and get feedback from > > the fuse developers and maintainers. There is some dax work that needs to > > be done before this should be merged (see the "poisoned page|folio problem" > > below). > > Note that I'm only looking at the fuse and iomap aspects of this > patchset. I don't know the devdax code at all. > > > This patch set fully works with Linux 6.14 -- passing all existing famfs > > smoke and unit tests -- and I encourage existing famfs users to test it. > > > > This is really two patch sets mashed up: > > > > * The patches with the dev_dax_iomap: prefix fill in missing functionality for > > devdax to host an fs-dax file system. > > * The famfs_fuse: patches add famfs into fs/fuse/. These are effectively > > unchanged since last year. > > > > Because this is not ready to merge yet, I have felt free to leave some debug > > prints in place because we still find them useful; those will be cleaned up > > in a subsequent revision. > > > > Famfs Overview > > > > Famfs exposes shared memory as a file system. Famfs consumes shared memory > > from dax devices, and provides memory-mappable files that map directly to > > the memory - no page cache involvement. Famfs differs from conventional > > file systems in fs-dax mode, in that it handles in-memory metadata in a > > sharable way (which begins with never caching dirty shared metadata). > > > > Famfs started as a standalone file system [3,4], but the consensus at LSFMM > > 2024 [5] was that it should be ported into fuse - and this RFC is the first > > public evidence that I've been working on that. > > This is very timely, as I just started looking into how I might connect > iomap to fuse so that most of the hot IO path continues to run in the > kernel, and userspace block device filesystem drivers merely supply the > file mappings to the kernel. In other words, we kick the metadata > parsing craziness out of the kernel. Coool! > > > The key performance requirement is that famfs must resolve mapping faults > > without upcalls. This is achieved by fully caching the file-to-devdax > > metadata for all active files. This is done via two fuse client/server > > message/response pairs: GET_FMAP and GET_DAXDEV. > > Heh, just last week I finally got around to laying out how I think I'd > want to expose iomap through fuse to allow ->iomap_begin/->iomap_end > upcalls to a fuse server. Note that I've done zero prototyping but > "upload all the mappings at open time" seems like a reasonable place for > me to start looking, especially for a filesystem with static mappings. > > I think what I want to try to build is an in-kernel mapping cache (sort > of like the one you built), only with upcalls to the fuse server when > there is no mapping information for a given IO. I'd probably want to > have a means for the fuse server to put new mappings into the cache, or > invalidate existing mappings. > > (famfs obviously is a simple corner-case of that grandiose vision, but I > still have a long way to get to my larger vision so don't take my words > as any kind of requirement.) > > > Famfs remains the first fs-dax file system that is backed by devdax rather > > than pmem in fs-dax mode (hence the need for the dev_dax_iomap fixups). > > > > Notes > > > > * Once the dev_dax_iomap patches land, I suspect it may make sense for > > virtiofs to update to use the improved interface. > > > > * I'm currently maintaining compatibility between the famfs user space and > > both the standalone famfs kernel file system and this new fuse > > implementation. In the near future I'll be running performance comparisons > > and sharing them - but there is no reason to expect significant degradation > > with fuse, since famfs caches entire "fmaps" in the kernel to resolve > > I'm curious to hear what you find, performance-wise. :) > > > faults with no upcalls. This patch has a bit too much debug turned on to > > to that testing quite yet. A branch > > A branch ... what? I trail off sometimes... ;) > > > * Two new fuse messages / responses are added: GET_FMAP and GET_DAXDEV. > > > > * When a file is looked up in a famfs mount, the LOOKUP is followed by a > > GET_FMAP message and response. The "fmap" is the full file-to-dax mapping, > > allowing the fuse/famfs kernel code to handle read/write/fault without any > > upcalls. > > Huh, I'd have thought you'd wait until FUSE_OPEN to start preloading > mappings into the kernel. That may be a better approach. Miklos and I discussed it during LPC last year, and thought both were options. Having implemented it at LOOKUP time, I think moving it to open might avoid my READDIRPLUS problem (which is that RDP is a mashup of READDIR and LOOKUP), therefore might need to add the GET_FMAP payload. Moving GET_FMAP to open time, would break that connection in a good way, I think. > > > * After each GET_FMAP, the fmap is checked for extents that reference > > previously-unknown daxdevs. Each such occurence is handled with a > > GET_DAXDEV message and response. > > I hadn't figured out how this part would work for my silly prototype. > Just out of curiosity, does the famfs fuse server hold an open fd to the > storage, in which case the fmap(ping) could just contain the open fd? > > Where are the mappings that are sent from the fuse server? Is that > struct fuse_famfs_simple_ext? See patch 17 or fs/fuse/famfs_kfmap.h for the fmap metadata explanation. Famfs currently supports either simple extents (daxdev, offset, length) or interleaved ones (which describe each "strip" as a simple extent). I think the explanation in famfs_kfmap.h is pretty clear. A key question is whether any additional basic metadata abstractions would be needed - because the kernel needs to understand the full scheme. With disaggregated memory, the interleave approach is nice because it gets aggregated performance and resolving a file offset to daxdev offset is order 1. Oh, and there are two fmap formats (ok, more, but the others are legacy ;). The fmaps-in-messages structs are currently in the famfs section of include/uapi/linux/fuse.h. And the in-memory version is in fs/fuse/famfs_kfmap.h. The former will need to be a versioned interface. (ugh...) > > > * Daxdevs are stored in a table (which might become an xarray at some point). > > When entries are added to the table, we acquire exclusive access to the > > daxdev via the fs_dax_get() call (modeled after how fs-dax handles this > > with pmem devices). famfs provides holder_operations to devdax, providing > > a notification path in the event of memory errors. > > > > * If devdax notifies famfs of memory errors on a dax device, famfs currently > > bocks all subsequent accesses to data on that device. The recovery is to > > re-initialize the memory and file system. Famfs is memory, not storage... > > Ouch. :) Cautious initial approach (i.e. I'm trying not to scare people too much ;) > > > * Because famfs uses backing (devdax) devices, only privileged mounts are > > supported. > > > > * The famfs kernel code never accesses the memory directly - it only > > facilitates read, write and mmap on behalf of user processes. As such, > > the RAS of the shared memory affects applications, but not the kernel. > > > > * Famfs has backing device(s), but they are devdax (char) rather than > > block. Right now there is no way to tell the vfs layer that famfs has a > > char backing device (unless we say it's block, but it's not). Currently > > we use the standard anonymous fuse fs_type - but I'm not sure that's > > ultimately optimal (thoughts?) > > Does it work if the fusefs server adds "-o fsname=<devdax cdev>" to the > fuse_args object? fuse2fs does that, though I don't recall if that's a > reasonable thing to do. The kernel needs to "own" the dax devices. fs-dax on pmem/block calls fs_dax_get_by_bdev() and passes in holder_operations - which are used for error upcalls, but also effect exclusive ownership. I added fs_dax_get() since the bdev version wasn't really right or char devdax. But same holder_operations. I had originally intended to pass in "-o daxdev=<cdev>", but famfs needs to span multiple daxdevs, in order to interleave for performance. The approach of retrieving them with GET_DAXDEV handles the generalized case, so "-o" just amounts to a second way to do the same thing. "But wait"... I thought. Doesn't the "-o" approach get the primary daxdev locked up sooner, which might be good? Well, no, because famfs creates a couple of meta files during mount .meta/.superblock and .meta/.log - and those are guaranteed to reference the primary daxdev. So I concluded the -o approach wasn't worth the trouble (though it's not *much* trouble). > > > The "poisoned page|folio problem" > > > > * Background: before doing a kernel mount, the famfs user space [2] validates > > the superblock and log. This is done via raw mmap of the primary devdax > > device. If valid, the file system is mounted, and the superblock and log > > get exposed through a pair of files (.meta/.superblock and .meta/.log) - > > because we can't be using raw device mmap when a file system is mounted > > on the device. But this exposes a devdax bug and warning... > > > > * Pages that have been memory mapped via devdax are left in a permanently > > problematic state. Devdax sets page|folio->mapping when a page is accessed > > via raw devdax mmap (as famfs does before mount), but never cleans it up. > > When the pages of the famfs superblock and log are accessed via the "meta" > > files after mount, we see a WARN_ONCE() in dax_insert_entry(), which > > notices that page|folio->mapping is still set. I intend to address this > > prior to asking for the famfs patches to be merged. > > > > * Alistair Popple's recent dax patch series [6], which has been merged > > for 6.15, addresses some dax issues, but sadly does not fix the poisoned > > page|folio problem - its enhanced refcount checking turns the warning into > > an error. > > > > * This 6.14 patch set disables the warning; a proper fix will be required for > > famfs to work at all in 6.15. Dan W. and I are actively discussing how to do > > this properly... > > > > * In terms of the correct functionality of famfs, the warning can be ignored. > > > > References > > > > [1] - https://github.com/libfuse/libfuse/pull/1200 > > [2] - https://github.com/cxl-micron-reskit/famfs > > Thanks for posting links, I'll have a look there too. > > --D > I'm happy to talk if you wanna kick ideas around. Cheers, John
On Mon, Apr 21, 2025 at 05:00:35PM -0500, John Groves wrote: > On 25/04/21 11:27AM, Darrick J. Wong wrote: > > On Sun, Apr 20, 2025 at 08:33:27PM -0500, John Groves wrote: > > > Subject: famfs: port into fuse > > > > > > This is the initial RFC for the fabric-attached memory file system (famfs) > > > integration into fuse. In order to function, this requires a related patch > > > to libfuse [1] and the famfs user space [2]. > > > > > > This RFC is mainly intended to socialize the approach and get feedback from > > > the fuse developers and maintainers. There is some dax work that needs to > > > be done before this should be merged (see the "poisoned page|folio problem" > > > below). > > > > Note that I'm only looking at the fuse and iomap aspects of this > > patchset. I don't know the devdax code at all. > > > > > This patch set fully works with Linux 6.14 -- passing all existing famfs > > > smoke and unit tests -- and I encourage existing famfs users to test it. > > > > > > This is really two patch sets mashed up: > > > > > > * The patches with the dev_dax_iomap: prefix fill in missing functionality for > > > devdax to host an fs-dax file system. > > > * The famfs_fuse: patches add famfs into fs/fuse/. These are effectively > > > unchanged since last year. > > > > > > Because this is not ready to merge yet, I have felt free to leave some debug > > > prints in place because we still find them useful; those will be cleaned up > > > in a subsequent revision. > > > > > > Famfs Overview > > > > > > Famfs exposes shared memory as a file system. Famfs consumes shared memory > > > from dax devices, and provides memory-mappable files that map directly to > > > the memory - no page cache involvement. Famfs differs from conventional > > > file systems in fs-dax mode, in that it handles in-memory metadata in a > > > sharable way (which begins with never caching dirty shared metadata). > > > > > > Famfs started as a standalone file system [3,4], but the consensus at LSFMM > > > 2024 [5] was that it should be ported into fuse - and this RFC is the first > > > public evidence that I've been working on that. > > > > This is very timely, as I just started looking into how I might connect > > iomap to fuse so that most of the hot IO path continues to run in the > > kernel, and userspace block device filesystem drivers merely supply the > > file mappings to the kernel. In other words, we kick the metadata > > parsing craziness out of the kernel. > > Coool! > > > > > > The key performance requirement is that famfs must resolve mapping faults > > > without upcalls. This is achieved by fully caching the file-to-devdax > > > metadata for all active files. This is done via two fuse client/server > > > message/response pairs: GET_FMAP and GET_DAXDEV. > > > > Heh, just last week I finally got around to laying out how I think I'd > > want to expose iomap through fuse to allow ->iomap_begin/->iomap_end > > upcalls to a fuse server. Note that I've done zero prototyping but > > "upload all the mappings at open time" seems like a reasonable place for > > me to start looking, especially for a filesystem with static mappings. > > > > I think what I want to try to build is an in-kernel mapping cache (sort > > of like the one you built), only with upcalls to the fuse server when > > there is no mapping information for a given IO. I'd probably want to > > have a means for the fuse server to put new mappings into the cache, or > > invalidate existing mappings. > > > > (famfs obviously is a simple corner-case of that grandiose vision, but I > > still have a long way to get to my larger vision so don't take my words > > as any kind of requirement.) > > > > > Famfs remains the first fs-dax file system that is backed by devdax rather > > > than pmem in fs-dax mode (hence the need for the dev_dax_iomap fixups). > > > > > > Notes > > > > > > * Once the dev_dax_iomap patches land, I suspect it may make sense for > > > virtiofs to update to use the improved interface. > > > > > > * I'm currently maintaining compatibility between the famfs user space and > > > both the standalone famfs kernel file system and this new fuse > > > implementation. In the near future I'll be running performance comparisons > > > and sharing them - but there is no reason to expect significant degradation > > > with fuse, since famfs caches entire "fmaps" in the kernel to resolve > > > > I'm curious to hear what you find, performance-wise. :) > > > > > faults with no upcalls. This patch has a bit too much debug turned on to > > > to that testing quite yet. A branch > > > > A branch ... what? > > I trail off sometimes... ;) > > > > > > * Two new fuse messages / responses are added: GET_FMAP and GET_DAXDEV. > > > > > > * When a file is looked up in a famfs mount, the LOOKUP is followed by a > > > GET_FMAP message and response. The "fmap" is the full file-to-dax mapping, > > > allowing the fuse/famfs kernel code to handle read/write/fault without any > > > upcalls. > > > > Huh, I'd have thought you'd wait until FUSE_OPEN to start preloading > > mappings into the kernel. > > That may be a better approach. Miklos and I discussed it during LPC last year, > and thought both were options. Having implemented it at LOOKUP time, I think > moving it to open might avoid my READDIRPLUS problem (which is that RDP is a > mashup of READDIR and LOOKUP), therefore might need to add the GET_FMAP > payload. Moving GET_FMAP to open time, would break that connection in a good > way, I think. I wonder if we could just add a couple new "notification" types so that the fuse server can initiate uploads of mappings whenever it feels like it. For your usage model I don't think it'll make much difference since they seem pretty static, but the ability to do that would open up some flexibility for famfs. The more general filesystems will need it anyway, and someone's going to want to truncate a famfs file. They always do. ;) > > > > > * After each GET_FMAP, the fmap is checked for extents that reference > > > previously-unknown daxdevs. Each such occurence is handled with a > > > GET_DAXDEV message and response. > > > > I hadn't figured out how this part would work for my silly prototype. > > Just out of curiosity, does the famfs fuse server hold an open fd to the > > storage, in which case the fmap(ping) could just contain the open fd? > > > > Where are the mappings that are sent from the fuse server? Is that > > struct fuse_famfs_simple_ext? > > See patch 17 or fs/fuse/famfs_kfmap.h for the fmap metadata explanation. > Famfs currently supports either simple extents (daxdev, offset, length) or > interleaved ones (which describe each "strip" as a simple extent). I think > the explanation in famfs_kfmap.h is pretty clear. > > A key question is whether any additional basic metadata abstractions would > be needed - because the kernel needs to understand the full scheme. > > With disaggregated memory, the interleave approach is nice because it gets > aggregated performance and resolving a file offset to daxdev offset is order > 1. > > Oh, and there are two fmap formats (ok, more, but the others are legacy ;). > The fmaps-in-messages structs are currently in the famfs section of > include/uapi/linux/fuse.h. And the in-memory version is in > fs/fuse/famfs_kfmap.h. The former will need to be a versioned interface. > (ugh...) Ok, will take a look tomorrow morning. > > > > > * Daxdevs are stored in a table (which might become an xarray at some point). > > > When entries are added to the table, we acquire exclusive access to the > > > daxdev via the fs_dax_get() call (modeled after how fs-dax handles this > > > with pmem devices). famfs provides holder_operations to devdax, providing > > > a notification path in the event of memory errors. > > > > > > * If devdax notifies famfs of memory errors on a dax device, famfs currently > > > bocks all subsequent accesses to data on that device. The recovery is to > > > re-initialize the memory and file system. Famfs is memory, not storage... > > > > Ouch. :) > > Cautious initial approach (i.e. I'm trying not to scare people too much ;) > > > > > > * Because famfs uses backing (devdax) devices, only privileged mounts are > > > supported. > > > > > > * The famfs kernel code never accesses the memory directly - it only > > > facilitates read, write and mmap on behalf of user processes. As such, > > > the RAS of the shared memory affects applications, but not the kernel. > > > > > > * Famfs has backing device(s), but they are devdax (char) rather than > > > block. Right now there is no way to tell the vfs layer that famfs has a > > > char backing device (unless we say it's block, but it's not). Currently > > > we use the standard anonymous fuse fs_type - but I'm not sure that's > > > ultimately optimal (thoughts?) > > > > Does it work if the fusefs server adds "-o fsname=<devdax cdev>" to the > > fuse_args object? fuse2fs does that, though I don't recall if that's a > > reasonable thing to do. > > The kernel needs to "own" the dax devices. fs-dax on pmem/block calls > fs_dax_get_by_bdev() and passes in holder_operations - which are used for > error upcalls, but also effect exclusive ownership. > > I added fs_dax_get() since the bdev version wasn't really right or char > devdax. But same holder_operations. > > I had originally intended to pass in "-o daxdev=<cdev>", but famfs needs to > span multiple daxdevs, in order to interleave for performance. The approach > of retrieving them with GET_DAXDEV handles the generalized case, so "-o" > just amounts to a second way to do the same thing. Oh, hah, it's a multi-device filesystem. Hee hee hee... > "But wait"... I thought. Doesn't the "-o" approach get the primary daxdev > locked up sooner, which might be good? Well, no, because famfs creates a > couple of meta files during mount .meta/.superblock and .meta/.log - and > those are guaranteed to reference the primary daxdev. So I concluded the -o > approach wasn't worth the trouble (though it's not *much* trouble). <nod> For block devices, someone needs to own the bdev O_EXCL, but it doesn't have to be the kernel. Though ... I wonder what *does* happen when the something tries to invoke the bdev holder_ops? Maybe it would be nice to freeze the fs, but I don't know if fuse already does that. > > > > > The "poisoned page|folio problem" > > > > > > * Background: before doing a kernel mount, the famfs user space [2] validates > > > the superblock and log. This is done via raw mmap of the primary devdax > > > device. If valid, the file system is mounted, and the superblock and log > > > get exposed through a pair of files (.meta/.superblock and .meta/.log) - > > > because we can't be using raw device mmap when a file system is mounted > > > on the device. But this exposes a devdax bug and warning... > > > > > > * Pages that have been memory mapped via devdax are left in a permanently > > > problematic state. Devdax sets page|folio->mapping when a page is accessed > > > via raw devdax mmap (as famfs does before mount), but never cleans it up. > > > When the pages of the famfs superblock and log are accessed via the "meta" > > > files after mount, we see a WARN_ONCE() in dax_insert_entry(), which > > > notices that page|folio->mapping is still set. I intend to address this > > > prior to asking for the famfs patches to be merged. > > > > > > * Alistair Popple's recent dax patch series [6], which has been merged > > > for 6.15, addresses some dax issues, but sadly does not fix the poisoned > > > page|folio problem - its enhanced refcount checking turns the warning into > > > an error. > > > > > > * This 6.14 patch set disables the warning; a proper fix will be required for > > > famfs to work at all in 6.15. Dan W. and I are actively discussing how to do > > > this properly... > > > > > > * In terms of the correct functionality of famfs, the warning can be ignored. > > > > > > References > > > > > > [1] - https://github.com/libfuse/libfuse/pull/1200 > > > [2] - https://github.com/cxl-micron-reskit/famfs > > > > Thanks for posting links, I'll have a look there too. > > > > --D > > > > I'm happy to talk if you wanna kick ideas around. Heheh I will, but give me a day or two to wander through the rest of the patches, or maybe just decide to pull the branch and look at one huge diff. --D > Cheers, > John > >
On 25/04/21 06:25PM, Darrick J. Wong wrote: > On Mon, Apr 21, 2025 at 05:00:35PM -0500, John Groves wrote: > > On 25/04/21 11:27AM, Darrick J. Wong wrote: > > > On Sun, Apr 20, 2025 at 08:33:27PM -0500, John Groves wrote: > > > > Subject: famfs: port into fuse > > > > > > > > This is the initial RFC for the fabric-attached memory file system (famfs) > > > > integration into fuse. In order to function, this requires a related patch > > > > to libfuse [1] and the famfs user space [2]. > > > > > > > > This RFC is mainly intended to socialize the approach and get feedback from > > > > the fuse developers and maintainers. There is some dax work that needs to > > > > be done before this should be merged (see the "poisoned page|folio problem" > > > > below). > > > > > > Note that I'm only looking at the fuse and iomap aspects of this > > > patchset. I don't know the devdax code at all. > > > > > > > This patch set fully works with Linux 6.14 -- passing all existing famfs > > > > smoke and unit tests -- and I encourage existing famfs users to test it. > > > > > > > > This is really two patch sets mashed up: > > > > > > > > * The patches with the dev_dax_iomap: prefix fill in missing functionality for > > > > devdax to host an fs-dax file system. > > > > * The famfs_fuse: patches add famfs into fs/fuse/. These are effectively > > > > unchanged since last year. > > > > > > > > Because this is not ready to merge yet, I have felt free to leave some debug > > > > prints in place because we still find them useful; those will be cleaned up > > > > in a subsequent revision. > > > > > > > > Famfs Overview > > > > > > > > Famfs exposes shared memory as a file system. Famfs consumes shared memory > > > > from dax devices, and provides memory-mappable files that map directly to > > > > the memory - no page cache involvement. Famfs differs from conventional > > > > file systems in fs-dax mode, in that it handles in-memory metadata in a > > > > sharable way (which begins with never caching dirty shared metadata). > > > > > > > > Famfs started as a standalone file system [3,4], but the consensus at LSFMM > > > > 2024 [5] was that it should be ported into fuse - and this RFC is the first > > > > public evidence that I've been working on that. > > > > > > This is very timely, as I just started looking into how I might connect > > > iomap to fuse so that most of the hot IO path continues to run in the > > > kernel, and userspace block device filesystem drivers merely supply the > > > file mappings to the kernel. In other words, we kick the metadata > > > parsing craziness out of the kernel. > > > > Coool! > > > > > > > > > The key performance requirement is that famfs must resolve mapping faults > > > > without upcalls. This is achieved by fully caching the file-to-devdax > > > > metadata for all active files. This is done via two fuse client/server > > > > message/response pairs: GET_FMAP and GET_DAXDEV. > > > > > > Heh, just last week I finally got around to laying out how I think I'd > > > want to expose iomap through fuse to allow ->iomap_begin/->iomap_end > > > upcalls to a fuse server. Note that I've done zero prototyping but > > > "upload all the mappings at open time" seems like a reasonable place for > > > me to start looking, especially for a filesystem with static mappings. > > > > > > I think what I want to try to build is an in-kernel mapping cache (sort > > > of like the one you built), only with upcalls to the fuse server when > > > there is no mapping information for a given IO. I'd probably want to > > > have a means for the fuse server to put new mappings into the cache, or > > > invalidate existing mappings. > > > > > > (famfs obviously is a simple corner-case of that grandiose vision, but I > > > still have a long way to get to my larger vision so don't take my words > > > as any kind of requirement.) > > > > > > > Famfs remains the first fs-dax file system that is backed by devdax rather > > > > than pmem in fs-dax mode (hence the need for the dev_dax_iomap fixups). > > > > > > > > Notes > > > > > > > > * Once the dev_dax_iomap patches land, I suspect it may make sense for > > > > virtiofs to update to use the improved interface. > > > > > > > > * I'm currently maintaining compatibility between the famfs user space and > > > > both the standalone famfs kernel file system and this new fuse > > > > implementation. In the near future I'll be running performance comparisons > > > > and sharing them - but there is no reason to expect significant degradation > > > > with fuse, since famfs caches entire "fmaps" in the kernel to resolve > > > > > > I'm curious to hear what you find, performance-wise. :) > > > > > > > faults with no upcalls. This patch has a bit too much debug turned on to > > > > to that testing quite yet. A branch > > > > > > A branch ... what? > > > > I trail off sometimes... ;) > > > > > > > > > * Two new fuse messages / responses are added: GET_FMAP and GET_DAXDEV. > > > > > > > > * When a file is looked up in a famfs mount, the LOOKUP is followed by a > > > > GET_FMAP message and response. The "fmap" is the full file-to-dax mapping, > > > > allowing the fuse/famfs kernel code to handle read/write/fault without any > > > > upcalls. > > > > > > Huh, I'd have thought you'd wait until FUSE_OPEN to start preloading > > > mappings into the kernel. > > > > That may be a better approach. Miklos and I discussed it during LPC last year, > > and thought both were options. Having implemented it at LOOKUP time, I think > > moving it to open might avoid my READDIRPLUS problem (which is that RDP is a > > mashup of READDIR and LOOKUP), therefore might need to add the GET_FMAP > > payload. Moving GET_FMAP to open time, would break that connection in a good > > way, I think. > > I wonder if we could just add a couple new "notification" types so that > the fuse server can initiate uploads of mappings whenever it feels like > it. For your usage model I don't think it'll make much difference since > they seem pretty static, but the ability to do that would open up some > flexibility for famfs. The more general filesystems will need it > anyway, and someone's going to want to truncate a famfs file. They > always do. ;) > > > > > > > > * After each GET_FMAP, the fmap is checked for extents that reference > > > > previously-unknown daxdevs. Each such occurence is handled with a > > > > GET_DAXDEV message and response. > > > > > > I hadn't figured out how this part would work for my silly prototype. > > > Just out of curiosity, does the famfs fuse server hold an open fd to the > > > storage, in which case the fmap(ping) could just contain the open fd? > > > > > > Where are the mappings that are sent from the fuse server? Is that > > > struct fuse_famfs_simple_ext? > > > > See patch 17 or fs/fuse/famfs_kfmap.h for the fmap metadata explanation. > > Famfs currently supports either simple extents (daxdev, offset, length) or > > interleaved ones (which describe each "strip" as a simple extent). I think > > the explanation in famfs_kfmap.h is pretty clear. > > > > A key question is whether any additional basic metadata abstractions would > > be needed - because the kernel needs to understand the full scheme. > > > > With disaggregated memory, the interleave approach is nice because it gets > > aggregated performance and resolving a file offset to daxdev offset is order > > 1. > > > > Oh, and there are two fmap formats (ok, more, but the others are legacy ;). > > The fmaps-in-messages structs are currently in the famfs section of > > include/uapi/linux/fuse.h. And the in-memory version is in > > fs/fuse/famfs_kfmap.h. The former will need to be a versioned interface. > > (ugh...) > > Ok, will take a look tomorrow morning. > > > > > > > > * Daxdevs are stored in a table (which might become an xarray at some point). > > > > When entries are added to the table, we acquire exclusive access to the > > > > daxdev via the fs_dax_get() call (modeled after how fs-dax handles this > > > > with pmem devices). famfs provides holder_operations to devdax, providing > > > > a notification path in the event of memory errors. > > > > > > > > * If devdax notifies famfs of memory errors on a dax device, famfs currently > > > > bocks all subsequent accesses to data on that device. The recovery is to > > > > re-initialize the memory and file system. Famfs is memory, not storage... > > > > > > Ouch. :) > > > > Cautious initial approach (i.e. I'm trying not to scare people too much ;) > > > > > > > > > * Because famfs uses backing (devdax) devices, only privileged mounts are > > > > supported. > > > > > > > > * The famfs kernel code never accesses the memory directly - it only > > > > facilitates read, write and mmap on behalf of user processes. As such, > > > > the RAS of the shared memory affects applications, but not the kernel. > > > > > > > > * Famfs has backing device(s), but they are devdax (char) rather than > > > > block. Right now there is no way to tell the vfs layer that famfs has a > > > > char backing device (unless we say it's block, but it's not). Currently > > > > we use the standard anonymous fuse fs_type - but I'm not sure that's > > > > ultimately optimal (thoughts?) > > > > > > Does it work if the fusefs server adds "-o fsname=<devdax cdev>" to the > > > fuse_args object? fuse2fs does that, though I don't recall if that's a > > > reasonable thing to do. > > > > The kernel needs to "own" the dax devices. fs-dax on pmem/block calls > > fs_dax_get_by_bdev() and passes in holder_operations - which are used for > > error upcalls, but also effect exclusive ownership. > > > > I added fs_dax_get() since the bdev version wasn't really right or char > > devdax. But same holder_operations. > > > > I had originally intended to pass in "-o daxdev=<cdev>", but famfs needs to > > span multiple daxdevs, in order to interleave for performance. The approach > > of retrieving them with GET_DAXDEV handles the generalized case, so "-o" > > just amounts to a second way to do the same thing. > > Oh, hah, it's a multi-device filesystem. Hee hee hee... Hee hee indeed. The thing about memory, and dax devices, is that there isn't anything like device mapper that can make compound or interleaved devices. There's not a "stop while dma happens" point for swizzling addresses. I'm down for a discussion about whether there is a viable way to have a mapper layer, but I also think constructing interleaved objects as files is quite good - and might be the best solution. Interleaving is essential to memory performance in general. System-ram is pretty much never not interleaved. And there are some reasons why programming the hardware to do the interleaving is gonna be problem for non-static setups. I'll save going down that rathole for a different time... John
© 2016 - 2025 Red Hat, Inc.