[RFC PATCH 00/19] famfs: port into fuse

John Groves posted 19 patches 7 months, 3 weeks ago
There is a newer version of this series
Documentation/filesystems/famfs.rst |  142 ++++
Documentation/filesystems/index.rst |    1 +
MAINTAINERS                         |   10 +
drivers/dax/Kconfig                 |    6 +
drivers/dax/bus.c                   |  144 +++-
drivers/dax/dax-private.h           |    1 +
drivers/dax/device.c                |   38 +-
drivers/dax/super.c                 |   33 +-
fs/dax.c                            |    1 -
fs/fuse/Kconfig                     |   13 +
fs/fuse/Makefile                    |    4 +-
fs/fuse/dev.c                       |   61 ++
fs/fuse/dir.c                       |   74 +-
fs/fuse/famfs.c                     | 1105 +++++++++++++++++++++++++++
fs/fuse/famfs_kfmap.h               |  166 ++++
fs/fuse/file.c                      |   27 +-
fs/fuse/fuse_i.h                    |   67 +-
fs/fuse/inode.c                     |   49 +-
fs/fuse/iomode.c                    |    2 +-
fs/namei.c                          |    1 +
include/linux/dax.h                 |    6 +
include/uapi/linux/fuse.h           |   63 ++
include/uapi/linux/magic.h          |    2 +
23 files changed, 1973 insertions(+), 43 deletions(-)
create mode 100644 Documentation/filesystems/famfs.rst
create mode 100644 fs/fuse/famfs.c
create mode 100644 fs/fuse/famfs_kfmap.h
[RFC PATCH 00/19] famfs: port into fuse
Posted by John Groves 7 months, 3 weeks ago
Subject: famfs: port into fuse

This is the initial RFC for the fabric-attached memory file system (famfs)
integration into fuse. In order to function, this requires a related patch
to libfuse [1] and the famfs user space [2]. 

This RFC is mainly intended to socialize the approach and get feedback from
the fuse developers and maintainers. There is some dax work that needs to
be done before this should be merged (see the "poisoned page|folio problem"
below).

This patch set fully works with Linux 6.14 -- passing all existing famfs
smoke and unit tests -- and I encourage existing famfs users to test it.

This is really two patch sets mashed up:

* The patches with the dev_dax_iomap: prefix fill in missing functionality for
  devdax to host an fs-dax file system.
* The famfs_fuse: patches add famfs into fs/fuse/. These are effectively
  unchanged since last year.

Because this is not ready to merge yet, I have felt free to leave some debug
prints in place because we still find them useful; those will be cleaned up
in a subsequent revision.

Famfs Overview

Famfs exposes shared memory as a file system. Famfs consumes shared memory
from dax devices, and provides memory-mappable files that map directly to
the memory - no page cache involvement. Famfs differs from conventional
file systems in fs-dax mode, in that it handles in-memory metadata in a
sharable way (which begins with never caching dirty shared metadata).

Famfs started as a standalone file system [3,4], but the consensus at LSFMM
2024 [5] was that it should be ported into fuse - and this RFC is the first
public evidence that I've been working on that.

The key performance requirement is that famfs must resolve mapping faults
without upcalls. This is achieved by fully caching the file-to-devdax
metadata for all active files. This is done via two fuse client/server
message/response pairs: GET_FMAP and GET_DAXDEV.

Famfs remains the first fs-dax file system that is backed by devdax rather
than pmem in fs-dax mode (hence the need for the dev_dax_iomap fixups).

Notes

* Once the dev_dax_iomap patches land, I suspect it may make sense for
  virtiofs to update to use the improved interface.

* I'm currently maintaining compatibility between the famfs user space and
  both the standalone famfs kernel file system and this new fuse
  implementation. In the near future I'll be running performance comparisons
  and sharing them - but there is no reason to expect significant degradation
  with fuse, since famfs caches entire "fmaps" in the kernel to resolve
  faults with no upcalls. This patch has a bit too much debug turned on to
  to that testing quite yet. A branch 

* Two new fuse messages / responses are added: GET_FMAP and GET_DAXDEV.

* When a file is looked up in a famfs mount, the LOOKUP is followed by a
  GET_FMAP message and response. The "fmap" is the full file-to-dax mapping,
  allowing the fuse/famfs kernel code to handle read/write/fault without any
  upcalls.

* After each GET_FMAP, the fmap is checked for extents that reference
  previously-unknown daxdevs. Each such occurence is handled with a
  GET_DAXDEV message and response.

* Daxdevs are stored in a table (which might become an xarray at some point).
  When entries are added to the table, we acquire exclusive access to the
  daxdev via the fs_dax_get() call (modeled after how fs-dax handles this
  with pmem devices). famfs provides holder_operations to devdax, providing
  a notification path in the event of memory errors.

* If devdax notifies famfs of memory errors on a dax device, famfs currently
  bocks all subsequent accesses to data on that device. The recovery is to
  re-initialize the memory and file system. Famfs is memory, not storage...

* Because famfs uses backing (devdax) devices, only privileged mounts are
  supported.

* The famfs kernel code never accesses the memory directly - it only
  facilitates read, write and mmap on behalf of user processes. As such,
  the RAS of the shared memory affects applications, but not the kernel.

* Famfs has backing device(s), but they are devdax (char) rather than
  block. Right now there is no way to tell the vfs layer that famfs has a
  char backing device (unless we say it's block, but it's not). Currently
  we use the standard anonymous fuse fs_type - but I'm not sure that's
  ultimately optimal (thoughts?)

The "poisoned page|folio problem"

* Background: before doing a kernel mount, the famfs user space [2] validates
  the superblock and log. This is done via raw mmap of the primary devdax
  device. If valid, the file system is mounted, and the superblock and log
  get exposed through a pair of files (.meta/.superblock and .meta/.log) -
  because we can't be using raw device mmap when a file system is mounted
  on the device. But this exposes a devdax bug and warning...

* Pages that have been memory mapped via devdax are left in a permanently
  problematic state. Devdax sets page|folio->mapping when a page is accessed
  via raw devdax mmap (as famfs does before mount), but never cleans it up.
  When the pages of the famfs superblock and log are accessed via the "meta"
  files after mount, we see a WARN_ONCE() in dax_insert_entry(), which
  notices that page|folio->mapping is still set. I intend to address this
  prior to asking for the famfs patches to be merged.

* Alistair Popple's recent dax patch series [6], which has been merged
  for 6.15, addresses some dax issues, but sadly does not fix the poisoned
  page|folio problem - its enhanced refcount checking turns the warning into
  an error.

* This 6.14 patch set disables the warning; a proper fix will be required for
  famfs to work at all in 6.15. Dan W. and I are actively discussing how to do
  this properly...

* In terms of the correct functionality of famfs, the warning can be ignored.

References

[1] - https://github.com/libfuse/libfuse/pull/1200
[2] - https://github.com/cxl-micron-reskit/famfs
[3] - https://lore.kernel.org/linux-cxl/cover.1708709155.git.john@groves.net/
[4] - https://lore.kernel.org/linux-cxl/cover.1714409084.git.john@groves.net/
[5] - https://lwn.net/Articles/983105/
[6] - https://lore.kernel.org/linux-cxl/cover.8068ad144a7eea4a813670301f4d2a86a8e68ec4.1740713401.git-series.apopple@nvidia.com/


John Groves (19):
  dev_dax_iomap: Move dax_pgoff_to_phys() from device.c to bus.c
  dev_dax_iomap: Add fs_dax_get() func to prepare dax for fs-dax usage
  dev_dax_iomap: Save the kva from memremap
  dev_dax_iomap: Add dax_operations for use by fs-dax on devdax
  dev_dax_iomap: export dax_dev_get()
  dev_dax_iomap: (ignore!) Drop poisoned page warning in fs/dax.c
  famfs_fuse: magic.h: Add famfs magic numbers
  famfs_fuse: Kconfig
  famfs_fuse: Update macro s/FUSE_IS_DAX/FUSE_IS_VIRTIO_DAX/
  famfs_fuse: Basic fuse kernel ABI enablement for famfs
  famfs_fuse: Basic famfs mount opts
  famfs_fuse: Plumb the GET_FMAP message/response
  famfs_fuse: Create files with famfs fmaps
  famfs_fuse: GET_DAXDEV message and daxdev_table
  famfs_fuse: Plumb dax iomap and fuse read/write/mmap
  famfs_fuse: Add holder_operations for dax notify_failure()
  famfs_fuse: Add famfs metadata documentation
  famfs_fuse: Add documentation
  famfs_fuse: (ignore) debug cruft

 Documentation/filesystems/famfs.rst |  142 ++++
 Documentation/filesystems/index.rst |    1 +
 MAINTAINERS                         |   10 +
 drivers/dax/Kconfig                 |    6 +
 drivers/dax/bus.c                   |  144 +++-
 drivers/dax/dax-private.h           |    1 +
 drivers/dax/device.c                |   38 +-
 drivers/dax/super.c                 |   33 +-
 fs/dax.c                            |    1 -
 fs/fuse/Kconfig                     |   13 +
 fs/fuse/Makefile                    |    4 +-
 fs/fuse/dev.c                       |   61 ++
 fs/fuse/dir.c                       |   74 +-
 fs/fuse/famfs.c                     | 1105 +++++++++++++++++++++++++++
 fs/fuse/famfs_kfmap.h               |  166 ++++
 fs/fuse/file.c                      |   27 +-
 fs/fuse/fuse_i.h                    |   67 +-
 fs/fuse/inode.c                     |   49 +-
 fs/fuse/iomode.c                    |    2 +-
 fs/namei.c                          |    1 +
 include/linux/dax.h                 |    6 +
 include/uapi/linux/fuse.h           |   63 ++
 include/uapi/linux/magic.h          |    2 +
 23 files changed, 1973 insertions(+), 43 deletions(-)
 create mode 100644 Documentation/filesystems/famfs.rst
 create mode 100644 fs/fuse/famfs.c
 create mode 100644 fs/fuse/famfs_kfmap.h


base-commit: 38fec10eb60d687e30c8c6b5420d86e8149f7557
-- 
2.49.0
Re: [RFC PATCH 00/19] famfs: port into fuse
Posted by John Groves 6 months, 3 weeks ago
On 25/04/20 08:33PM, John Groves wrote:
> Subject: famfs: port into fuse
>
> <snip>

I'm planning to apply the review comments and send v2 of
this patch series soon - hopefully next week.

I asked a couple of specific questions for Miklos and
Amir at [1] that I hope they will answer in the next few
days. Do you object to zeroing fuse_inodes when they're
allocated, and do I really need an xchg() to set the
fi->famfs_meta pointer during fuse_alloc_inode()? cmpxchg
would be good for avoiding stepping on an "already set"
pointer, but not useful if fi->famfs_meta has random
contents (which it does when allocated).

I plan to move the GET_FMAP message to OPEN time rather than
LOOKUP - unless that leads to problems that I don't
currently foresee. The GET_FMAP response will also get a
variable-sized payload.

Darrick and I have met and discussed commonality between our
use cases, and the only thing from famfs that he will be able
to directly use is the GET_FMAP message/response - but likely
with a different response payload. The file operations in
famfs.c are not applicable for Darrick, as they only handle
mapping file offsets to devdax offsets (i.e. fs-dax over
devdax).

Darrick is primarily exploring adapting block-backed file
systems to use fuse. These are conventional page-cache-backed
files that will primarily be read and written between
blockdev and page cache.

(Darrick, correct me if I got anything important wrong there.)

In prep for Darrick, I'll add an offset and length to the
GET_FMAP message, to specify what range of the file map is
being requested. I'll also add a new "first header" struct
in the GET_FMAP response that can accommodate additional fmap
types, and will specify the file size as well as the offset
and length of the fmap portion described in the response
(allowing for GET_FMAP responses that contain an incomplete
file map).

If there is desire to give GET_FMAP a different name, like
GET_IOMAP, I don't much care - although the term "iomap" may
be a bit overloaded already (e.g. the dax_iomap_rw()/
dax_iomap_fault() functions debatably didn't need "iomap"
in their names since they're about converting a file offset
range to daxdev ranges, and they don't handle anything
specifically iomap-related). At least "FMAP" is more narrowly
descriptive of what it is.

I don't think Darrick needs GET_DAXDEV (or anything
analogous), because passing in the backing dev at mount time
seems entirely sufficient - so I assume that at least for now
GET_DAXDEV won't be shared. But famfs definitely needs
GET_DAXDEV, because files must be able to interleave across
memory devices.

The one issue that I will kick down the road until v3 is
fixing the "poisoned page|folio" problem. Because of that,
v2 of this series will still be against a 6.14 kernel. Not
solving that problem means this series won't be merge-able
until v3.

I hope this is all clear and obvious. Let me know if not (or
if so).

Thanks,
John


[1] https://lore.kernel.org/linux-fsdevel/20250421013346.32530-1-john@groves.net/T/#me47467b781d6c637899a38b898c27afb619206e0
Re: [RFC PATCH 00/19] famfs: port into fuse
Posted by Amir Goldstein 6 months, 3 weeks ago
On Thu, May 22, 2025 at 12:30 AM John Groves <John@groves.net> wrote:
>
> On 25/04/20 08:33PM, John Groves wrote:
> > Subject: famfs: port into fuse
> >
> > <snip>
>
> I'm planning to apply the review comments and send v2 of
> this patch series soon - hopefully next week.
>
> I asked a couple of specific questions for Miklos and
> Amir at [1] that I hope they will answer in the next few
> days.

I missed this question.
Feel free to ping me next time if I am not answering.

> Do you object to zeroing fuse_inodes when they're
> allocated, and do I really need an xchg() to set the
> fi->famfs_meta pointer during fuse_alloc_inode()? cmpxchg
> would be good for avoiding stepping on an "already set"
> pointer, but not useful if fi->famfs_meta has random
> contents (which it does when allocated).
>

I don't have anything against zeroing the fuse inode fields
but be careful not to step over fuse_inode_init_once().

The answer to the xchg() question is quite technically boring.
At least in my case it was done to avoid an #ifdef in c file.

Thanks,
Amir.
Re: [RFC PATCH 00/19] famfs: port into fuse
Posted by Darrick J. Wong 6 months, 3 weeks ago
On Wed, May 21, 2025 at 05:30:12PM -0500, John Groves wrote:
> On 25/04/20 08:33PM, John Groves wrote:
> > Subject: famfs: port into fuse
> >
> > <snip>
> 
> I'm planning to apply the review comments and send v2 of
> this patch series soon - hopefully next week.

Heh, I'm just about to push go on an RFC patchbomb for the entirety of
fuse + iomap + ext4-fuse2fs.

> I asked a couple of specific questions for Miklos and
> Amir at [1] that I hope they will answer in the next few
> days. Do you object to zeroing fuse_inodes when they're
> allocated, and do I really need an xchg() to set the
> fi->famfs_meta pointer during fuse_alloc_inode()? cmpxchg
> would be good for avoiding stepping on an "already set"
> pointer, but not useful if fi->famfs_meta has random
> contents (which it does when allocated).

I guess you could always null it out in fuse_inode_init_once and again
when you free the inode...

> I plan to move the GET_FMAP message to OPEN time rather than
> LOOKUP - unless that leads to problems that I don't
> currently foresee. The GET_FMAP response will also get a
> variable-sized payload.
> 
> Darrick and I have met and discussed commonality between our
> use cases, and the only thing from famfs that he will be able
> to directly use is the GET_FMAP message/response - but likely
> with a different response payload. The file operations in
> famfs.c are not applicable for Darrick, as they only handle
> mapping file offsets to devdax offsets (i.e. fs-dax over
> devdax).
> 
> Darrick is primarily exploring adapting block-backed file
> systems to use fuse. These are conventional page-cache-backed
> files that will primarily be read and written between
> blockdev and page cache.

Yeah, I really do need to get moving on sending out the RFC.

Everyone: patchbomb incoming!

> (Darrick, correct me if I got anything important wrong there.)
> 
> In prep for Darrick, I'll add an offset and length to the
> GET_FMAP message, to specify what range of the file map is
> being requested. I'll also add a new "first header" struct
> in the GET_FMAP response that can accommodate additional fmap
> types, and will specify the file size as well as the offset
> and length of the fmap portion described in the response
> (allowing for GET_FMAP responses that contain an incomplete
> file map).

Hrrmrmrmm.  I don't think there's much use in trying to share a fuse
command but then have to parse through the size of the response to
figure out what the server actually sent back.  It's less confusing to
have just one response type per fuse command.

I also don't think that FUSE_IOMAP_BEGIN is all that good of an
interface for what John is trying to do.  A regular filesystem creates
whatever mappings it likes in response to the far too many file IO APIs
in Linux, and needs to throw them at the kernel.  OTOH, famfs'
management daemon creates a static mapping with repeating elements and
that gets uploaded in one go via FUSE_GET_FMAP.  Yes, we could mash them
into a single uncohesive mess of an interface, but why would we torture
ourselves so?

(For me it's the "repeating sequences" aspect of GET_FMAP that makes me
think it should be a separate interface.  OTOH I haven't thought much
about how to support filesystems that implement RAID.)

> If there is desire to give GET_FMAP a different name, like
> GET_IOMAP, I don't much care - although the term "iomap" may
> be a bit overloaded already (e.g. the dax_iomap_rw()/
> dax_iomap_fault() functions debatably didn't need "iomap"
> in their names since they're about converting a file offset
> range to daxdev ranges, and they don't handle anything
> specifically iomap-related). At least "FMAP" is more narrowly
> descriptive of what it is.
> 
> I don't think Darrick needs GET_DAXDEV (or anything
> analogous), because passing in the backing dev at mount time
> seems entirely sufficient - so I assume that at least for now
> GET_DAXDEV won't be shared. But famfs definitely needs
> GET_DAXDEV, because files must be able to interleave across
> memory devices.

I actually /did/ add a notification so that the fuse server can tell the
kernel that they'd like to use a particular fd with iomap.  It doesn't
support dax devices by virtue of gatekeeping on S_ISBLK, but it wouldn't
be hard to do that.

> The one issue that I will kick down the road until v3 is
> fixing the "poisoned page|folio" problem. Because of that,
> v2 of this series will still be against a 6.14 kernel. Not
> solving that problem means this series won't be merge-able
> until v3.
> 
> I hope this is all clear and obvious. Let me know if not (or
> if so).

Hee hee.

--D

> 
> Thanks,
> John
> 
> 
> [1] https://lore.kernel.org/linux-fsdevel/20250421013346.32530-1-john@groves.net/T/#me47467b781d6c637899a38b898c27afb619206e0
> 
>
Re: [RFC PATCH 00/19] famfs: port into fuse
Posted by Alireza Sanaee 7 months, 2 weeks ago
On Sun, 20 Apr 2025 20:33:27 -0500
John Groves <John@Groves.net> wrote:

> Subject: famfs: port into fuse
> 
> This is the initial RFC for the fabric-attached memory file system
> (famfs) integration into fuse. In order to function, this requires a
> related patch to libfuse [1] and the famfs user space [2]. 
> 
> This RFC is mainly intended to socialize the approach and get
> feedback from the fuse developers and maintainers. There is some dax
> work that needs to be done before this should be merged (see the
> "poisoned page|folio problem" below).
> 
> This patch set fully works with Linux 6.14 -- passing all existing
> famfs smoke and unit tests -- and I encourage existing famfs users to
> test it.
> 
> This is really two patch sets mashed up:
> 
> * The patches with the dev_dax_iomap: prefix fill in missing
> functionality for devdax to host an fs-dax file system.
> * The famfs_fuse: patches add famfs into fs/fuse/. These are
> effectively unchanged since last year.
> 
> Because this is not ready to merge yet, I have felt free to leave
> some debug prints in place because we still find them useful; those
> will be cleaned up in a subsequent revision.
> 
> Famfs Overview
> 
> Famfs exposes shared memory as a file system. Famfs consumes shared
> memory from dax devices, and provides memory-mappable files that map
> directly to the memory - no page cache involvement. Famfs differs
> from conventional file systems in fs-dax mode, in that it handles
> in-memory metadata in a sharable way (which begins with never caching
> dirty shared metadata).
> 
> Famfs started as a standalone file system [3,4], but the consensus at
> LSFMM 2024 [5] was that it should be ported into fuse - and this RFC
> is the first public evidence that I've been working on that.
> 
> The key performance requirement is that famfs must resolve mapping
> faults without upcalls. This is achieved by fully caching the
> file-to-devdax metadata for all active files. This is done via two
> fuse client/server message/response pairs: GET_FMAP and GET_DAXDEV.
> 
> Famfs remains the first fs-dax file system that is backed by devdax
> rather than pmem in fs-dax mode (hence the need for the dev_dax_iomap
> fixups).
> 
> Notes
> 
> * Once the dev_dax_iomap patches land, I suspect it may make sense for
>   virtiofs to update to use the improved interface.
> 
> * I'm currently maintaining compatibility between the famfs user
> space and both the standalone famfs kernel file system and this new
> fuse implementation. In the near future I'll be running performance
> comparisons and sharing them - but there is no reason to expect
> significant degradation with fuse, since famfs caches entire "fmaps"
> in the kernel to resolve faults with no upcalls. This patch has a bit
> too much debug turned on to to that testing quite yet. A branch 
> 
> * Two new fuse messages / responses are added: GET_FMAP and
> GET_DAXDEV.
> 
> * When a file is looked up in a famfs mount, the LOOKUP is followed
> by a GET_FMAP message and response. The "fmap" is the full
> file-to-dax mapping, allowing the fuse/famfs kernel code to handle
> read/write/fault without any upcalls.
> 
> * After each GET_FMAP, the fmap is checked for extents that reference
>   previously-unknown daxdevs. Each such occurence is handled with a
>   GET_DAXDEV message and response.
> 
> * Daxdevs are stored in a table (which might become an xarray at some
> point). When entries are added to the table, we acquire exclusive
> access to the daxdev via the fs_dax_get() call (modeled after how
> fs-dax handles this with pmem devices). famfs provides
> holder_operations to devdax, providing a notification path in the
> event of memory errors.
> 
> * If devdax notifies famfs of memory errors on a dax device, famfs
> currently bocks all subsequent accesses to data on that device. The
> recovery is to re-initialize the memory and file system. Famfs is
> memory, not storage...
> 
> * Because famfs uses backing (devdax) devices, only privileged mounts
> are supported.
> 
> * The famfs kernel code never accesses the memory directly - it only
>   facilitates read, write and mmap on behalf of user processes. As
> such, the RAS of the shared memory affects applications, but not the
> kernel.
> 
> * Famfs has backing device(s), but they are devdax (char) rather than
>   block. Right now there is no way to tell the vfs layer that famfs
> has a char backing device (unless we say it's block, but it's not).
> Currently we use the standard anonymous fuse fs_type - but I'm not
> sure that's ultimately optimal (thoughts?)
> 
> The "poisoned page|folio problem"
> 
> * Background: before doing a kernel mount, the famfs user space [2]
> validates the superblock and log. This is done via raw mmap of the
> primary devdax device. If valid, the file system is mounted, and the
> superblock and log get exposed through a pair of files
> (.meta/.superblock and .meta/.log) - because we can't be using raw
> device mmap when a file system is mounted on the device. But this
> exposes a devdax bug and warning...
> 
> * Pages that have been memory mapped via devdax are left in a
> permanently problematic state. Devdax sets page|folio->mapping when a
> page is accessed via raw devdax mmap (as famfs does before mount),
> but never cleans it up. When the pages of the famfs superblock and
> log are accessed via the "meta" files after mount, we see a
> WARN_ONCE() in dax_insert_entry(), which notices that
> page|folio->mapping is still set. I intend to address this prior to
> asking for the famfs patches to be merged.
> 
> * Alistair Popple's recent dax patch series [6], which has been merged
>   for 6.15, addresses some dax issues, but sadly does not fix the
> poisoned page|folio problem - its enhanced refcount checking turns
> the warning into an error.
> 
> * This 6.14 patch set disables the warning; a proper fix will be
> required for famfs to work at all in 6.15. Dan W. and I are actively
> discussing how to do this properly...
> 
> * In terms of the correct functionality of famfs, the warning can be
> ignored.
> 
> References
> 
> [1] - https://github.com/libfuse/libfuse/pull/1200
> [2] - https://github.com/cxl-micron-reskit/famfs
> [3]
> - https://lore.kernel.org/linux-cxl/cover.1708709155.git.john@groves.net/ [4] - https://lore.kernel.org/linux-cxl/cover.1714409084.git.john@groves.net/
> [5] - https://lwn.net/Articles/983105/
> [6]
> - https://lore.kernel.org/linux-cxl/cover.8068ad144a7eea4a813670301f4d2a86a8e68ec4.1740713401.git-series.apopple@nvidia.com/
> 
> 
> John Groves (19):
>   dev_dax_iomap: Move dax_pgoff_to_phys() from device.c to bus.c
>   dev_dax_iomap: Add fs_dax_get() func to prepare dax for fs-dax usage
>   dev_dax_iomap: Save the kva from memremap
>   dev_dax_iomap: Add dax_operations for use by fs-dax on devdax
>   dev_dax_iomap: export dax_dev_get()
>   dev_dax_iomap: (ignore!) Drop poisoned page warning in fs/dax.c
>   famfs_fuse: magic.h: Add famfs magic numbers
>   famfs_fuse: Kconfig
>   famfs_fuse: Update macro s/FUSE_IS_DAX/FUSE_IS_VIRTIO_DAX/
>   famfs_fuse: Basic fuse kernel ABI enablement for famfs
>   famfs_fuse: Basic famfs mount opts
>   famfs_fuse: Plumb the GET_FMAP message/response
>   famfs_fuse: Create files with famfs fmaps
>   famfs_fuse: GET_DAXDEV message and daxdev_table
>   famfs_fuse: Plumb dax iomap and fuse read/write/mmap
>   famfs_fuse: Add holder_operations for dax notify_failure()
>   famfs_fuse: Add famfs metadata documentation
>   famfs_fuse: Add documentation
>   famfs_fuse: (ignore) debug cruft
> 
>  Documentation/filesystems/famfs.rst |  142 ++++
>  Documentation/filesystems/index.rst |    1 +
>  MAINTAINERS                         |   10 +
>  drivers/dax/Kconfig                 |    6 +
>  drivers/dax/bus.c                   |  144 +++-
>  drivers/dax/dax-private.h           |    1 +
>  drivers/dax/device.c                |   38 +-
>  drivers/dax/super.c                 |   33 +-
>  fs/dax.c                            |    1 -
>  fs/fuse/Kconfig                     |   13 +
>  fs/fuse/Makefile                    |    4 +-
>  fs/fuse/dev.c                       |   61 ++
>  fs/fuse/dir.c                       |   74 +-
>  fs/fuse/famfs.c                     | 1105
> +++++++++++++++++++++++++++ fs/fuse/famfs_kfmap.h               |
> 166 ++++ fs/fuse/file.c                      |   27 +-
>  fs/fuse/fuse_i.h                    |   67 +-
>  fs/fuse/inode.c                     |   49 +-
>  fs/fuse/iomode.c                    |    2 +-
>  fs/namei.c                          |    1 +
>  include/linux/dax.h                 |    6 +
>  include/uapi/linux/fuse.h           |   63 ++
>  include/uapi/linux/magic.h          |    2 +
>  23 files changed, 1973 insertions(+), 43 deletions(-)
>  create mode 100644 Documentation/filesystems/famfs.rst
>  create mode 100644 fs/fuse/famfs.c
>  create mode 100644 fs/fuse/famfs_kfmap.h
> 
> 
> base-commit: 38fec10eb60d687e30c8c6b5420d86e8149f7557

Hi John,

Apologies if the question is far off or irrelevant.

I am trying to understand FAMFS, and I am thinking where does FAMFS
stand when compared to OpenSHMEM PGAS. Can't we have a OpenSHMEM-based
shared memory implementation over CXL that serves as FAMFS?

Maybe FAMFS does more than that!?!

Thanks,
Alireza
Re: [RFC PATCH 00/19] famfs: port into fuse
Posted by John Groves 7 months, 2 weeks ago
On 25/04/30 03:42PM, Alireza Sanaee wrote:
> On Sun, 20 Apr 2025 20:33:27 -0500
> John Groves <John@Groves.net> wrote:
> 
>> <snip>
> 
> Hi John,
> 
> Apologies if the question is far off or irrelevant.
> 
> I am trying to understand FAMFS, and I am thinking where does FAMFS
> stand when compared to OpenSHMEM PGAS. Can't we have a OpenSHMEM-based
> shared memory implementation over CXL that serves as FAMFS?
> 
> Maybe FAMFS does more than that!?!
> 
> Thanks,
> Alireza
>

Continuation of this conversation likely belongs in the discusison section
at [1], but a couple of thoughts.

Famfs provides a scale-out filesystem mounts where the files that map to the
same disaggregated shared memory. If you mmap a famfs file, you are accessing
the memory directly. Since shmem is file-backed (usually tmpfs or
its ilk), shmem is a higher-level and more specialized abstraction, and
OpenSHMEM may be able to run atop famfs. It looks like OpenSHMEM and PGAS
cover the possibility that "shared memory" might require grabbing a copy via
[r]dma - which famfs will probably never do. Famfs only handles cases where
the memory is actually shared. (hey, I work for a memory company.)

Since famfs provides memory-mappable files, almost all apps can access them
(no requirement to write to the shmem, or other related but more estoteric
interfaces). Apps are responsible for not doing "nonsense" access WRT cache
coherency, but famfs manages cache coherency for its metadata.

The video at [2] may be useful to get up to speed.

[1] http://github.com/cxl-micron-reskit/famfs
[2] https://www.youtube.com/watch?v=L1QNpb-8VgM&t=1680
Re: [RFC PATCH 00/19] famfs: port into fuse
Posted by Darrick J. Wong 7 months, 3 weeks ago
On Sun, Apr 20, 2025 at 08:33:27PM -0500, John Groves wrote:
> Subject: famfs: port into fuse
> 
> This is the initial RFC for the fabric-attached memory file system (famfs)
> integration into fuse. In order to function, this requires a related patch
> to libfuse [1] and the famfs user space [2]. 
> 
> This RFC is mainly intended to socialize the approach and get feedback from
> the fuse developers and maintainers. There is some dax work that needs to
> be done before this should be merged (see the "poisoned page|folio problem"
> below).

Note that I'm only looking at the fuse and iomap aspects of this
patchset.  I don't know the devdax code at all.

> This patch set fully works with Linux 6.14 -- passing all existing famfs
> smoke and unit tests -- and I encourage existing famfs users to test it.
> 
> This is really two patch sets mashed up:
> 
> * The patches with the dev_dax_iomap: prefix fill in missing functionality for
>   devdax to host an fs-dax file system.
> * The famfs_fuse: patches add famfs into fs/fuse/. These are effectively
>   unchanged since last year.
> 
> Because this is not ready to merge yet, I have felt free to leave some debug
> prints in place because we still find them useful; those will be cleaned up
> in a subsequent revision.
> 
> Famfs Overview
> 
> Famfs exposes shared memory as a file system. Famfs consumes shared memory
> from dax devices, and provides memory-mappable files that map directly to
> the memory - no page cache involvement. Famfs differs from conventional
> file systems in fs-dax mode, in that it handles in-memory metadata in a
> sharable way (which begins with never caching dirty shared metadata).
> 
> Famfs started as a standalone file system [3,4], but the consensus at LSFMM
> 2024 [5] was that it should be ported into fuse - and this RFC is the first
> public evidence that I've been working on that.

This is very timely, as I just started looking into how I might connect
iomap to fuse so that most of the hot IO path continues to run in the
kernel, and userspace block device filesystem drivers merely supply the
file mappings to the kernel.  In other words, we kick the metadata
parsing craziness out of the kernel.

> The key performance requirement is that famfs must resolve mapping faults
> without upcalls. This is achieved by fully caching the file-to-devdax
> metadata for all active files. This is done via two fuse client/server
> message/response pairs: GET_FMAP and GET_DAXDEV.

Heh, just last week I finally got around to laying out how I think I'd
want to expose iomap through fuse to allow ->iomap_begin/->iomap_end
upcalls to a fuse server.  Note that I've done zero prototyping but
"upload all the mappings at open time" seems like a reasonable place for
me to start looking, especially for a filesystem with static mappings.

I think what I want to try to build is an in-kernel mapping cache (sort
of like the one you built), only with upcalls to the fuse server when
there is no mapping information for a given IO.  I'd probably want to
have a means for the fuse server to put new mappings into the cache, or
invalidate existing mappings.

(famfs obviously is a simple corner-case of that grandiose vision, but I
still have a long way to get to my larger vision so don't take my words
as any kind of requirement.)

> Famfs remains the first fs-dax file system that is backed by devdax rather
> than pmem in fs-dax mode (hence the need for the dev_dax_iomap fixups).
> 
> Notes
> 
> * Once the dev_dax_iomap patches land, I suspect it may make sense for
>   virtiofs to update to use the improved interface.
> 
> * I'm currently maintaining compatibility between the famfs user space and
>   both the standalone famfs kernel file system and this new fuse
>   implementation. In the near future I'll be running performance comparisons
>   and sharing them - but there is no reason to expect significant degradation
>   with fuse, since famfs caches entire "fmaps" in the kernel to resolve

I'm curious to hear what you find, performance-wise. :)

>   faults with no upcalls. This patch has a bit too much debug turned on to
>   to that testing quite yet. A branch 

A branch ... what?

> * Two new fuse messages / responses are added: GET_FMAP and GET_DAXDEV.
> 
> * When a file is looked up in a famfs mount, the LOOKUP is followed by a
>   GET_FMAP message and response. The "fmap" is the full file-to-dax mapping,
>   allowing the fuse/famfs kernel code to handle read/write/fault without any
>   upcalls.

Huh, I'd have thought you'd wait until FUSE_OPEN to start preloading
mappings into the kernel.

> * After each GET_FMAP, the fmap is checked for extents that reference
>   previously-unknown daxdevs. Each such occurence is handled with a
>   GET_DAXDEV message and response.

I hadn't figured out how this part would work for my silly prototype.
Just out of curiosity, does the famfs fuse server hold an open fd to the
storage, in which case the fmap(ping) could just contain the open fd?

Where are the mappings that are sent from the fuse server?  Is that
struct fuse_famfs_simple_ext?

> * Daxdevs are stored in a table (which might become an xarray at some point).
>   When entries are added to the table, we acquire exclusive access to the
>   daxdev via the fs_dax_get() call (modeled after how fs-dax handles this
>   with pmem devices). famfs provides holder_operations to devdax, providing
>   a notification path in the event of memory errors.
> 
> * If devdax notifies famfs of memory errors on a dax device, famfs currently
>   bocks all subsequent accesses to data on that device. The recovery is to
>   re-initialize the memory and file system. Famfs is memory, not storage...

Ouch. :)

> * Because famfs uses backing (devdax) devices, only privileged mounts are
>   supported.
> 
> * The famfs kernel code never accesses the memory directly - it only
>   facilitates read, write and mmap on behalf of user processes. As such,
>   the RAS of the shared memory affects applications, but not the kernel.
> 
> * Famfs has backing device(s), but they are devdax (char) rather than
>   block. Right now there is no way to tell the vfs layer that famfs has a
>   char backing device (unless we say it's block, but it's not). Currently
>   we use the standard anonymous fuse fs_type - but I'm not sure that's
>   ultimately optimal (thoughts?)

Does it work if the fusefs server adds "-o fsname=<devdax cdev>" to the
fuse_args object?  fuse2fs does that, though I don't recall if that's a
reasonable thing to do.

> The "poisoned page|folio problem"
> 
> * Background: before doing a kernel mount, the famfs user space [2] validates
>   the superblock and log. This is done via raw mmap of the primary devdax
>   device. If valid, the file system is mounted, and the superblock and log
>   get exposed through a pair of files (.meta/.superblock and .meta/.log) -
>   because we can't be using raw device mmap when a file system is mounted
>   on the device. But this exposes a devdax bug and warning...
> 
> * Pages that have been memory mapped via devdax are left in a permanently
>   problematic state. Devdax sets page|folio->mapping when a page is accessed
>   via raw devdax mmap (as famfs does before mount), but never cleans it up.
>   When the pages of the famfs superblock and log are accessed via the "meta"
>   files after mount, we see a WARN_ONCE() in dax_insert_entry(), which
>   notices that page|folio->mapping is still set. I intend to address this
>   prior to asking for the famfs patches to be merged.
> 
> * Alistair Popple's recent dax patch series [6], which has been merged
>   for 6.15, addresses some dax issues, but sadly does not fix the poisoned
>   page|folio problem - its enhanced refcount checking turns the warning into
>   an error.
> 
> * This 6.14 patch set disables the warning; a proper fix will be required for
>   famfs to work at all in 6.15. Dan W. and I are actively discussing how to do
>   this properly...
> 
> * In terms of the correct functionality of famfs, the warning can be ignored.
> 
> References
> 
> [1] - https://github.com/libfuse/libfuse/pull/1200
> [2] - https://github.com/cxl-micron-reskit/famfs

Thanks for posting links, I'll have a look there too.

--D

> [3] - https://lore.kernel.org/linux-cxl/cover.1708709155.git.john@groves.net/
> [4] - https://lore.kernel.org/linux-cxl/cover.1714409084.git.john@groves.net/
> [5] - https://lwn.net/Articles/983105/
> [6] - https://lore.kernel.org/linux-cxl/cover.8068ad144a7eea4a813670301f4d2a86a8e68ec4.1740713401.git-series.apopple@nvidia.com/
> 
> 
> John Groves (19):
>   dev_dax_iomap: Move dax_pgoff_to_phys() from device.c to bus.c
>   dev_dax_iomap: Add fs_dax_get() func to prepare dax for fs-dax usage
>   dev_dax_iomap: Save the kva from memremap
>   dev_dax_iomap: Add dax_operations for use by fs-dax on devdax
>   dev_dax_iomap: export dax_dev_get()
>   dev_dax_iomap: (ignore!) Drop poisoned page warning in fs/dax.c
>   famfs_fuse: magic.h: Add famfs magic numbers
>   famfs_fuse: Kconfig
>   famfs_fuse: Update macro s/FUSE_IS_DAX/FUSE_IS_VIRTIO_DAX/
>   famfs_fuse: Basic fuse kernel ABI enablement for famfs
>   famfs_fuse: Basic famfs mount opts
>   famfs_fuse: Plumb the GET_FMAP message/response
>   famfs_fuse: Create files with famfs fmaps
>   famfs_fuse: GET_DAXDEV message and daxdev_table
>   famfs_fuse: Plumb dax iomap and fuse read/write/mmap
>   famfs_fuse: Add holder_operations for dax notify_failure()
>   famfs_fuse: Add famfs metadata documentation
>   famfs_fuse: Add documentation
>   famfs_fuse: (ignore) debug cruft
> 
>  Documentation/filesystems/famfs.rst |  142 ++++
>  Documentation/filesystems/index.rst |    1 +
>  MAINTAINERS                         |   10 +
>  drivers/dax/Kconfig                 |    6 +
>  drivers/dax/bus.c                   |  144 +++-
>  drivers/dax/dax-private.h           |    1 +
>  drivers/dax/device.c                |   38 +-
>  drivers/dax/super.c                 |   33 +-
>  fs/dax.c                            |    1 -
>  fs/fuse/Kconfig                     |   13 +
>  fs/fuse/Makefile                    |    4 +-
>  fs/fuse/dev.c                       |   61 ++
>  fs/fuse/dir.c                       |   74 +-
>  fs/fuse/famfs.c                     | 1105 +++++++++++++++++++++++++++
>  fs/fuse/famfs_kfmap.h               |  166 ++++
>  fs/fuse/file.c                      |   27 +-
>  fs/fuse/fuse_i.h                    |   67 +-
>  fs/fuse/inode.c                     |   49 +-
>  fs/fuse/iomode.c                    |    2 +-
>  fs/namei.c                          |    1 +
>  include/linux/dax.h                 |    6 +
>  include/uapi/linux/fuse.h           |   63 ++
>  include/uapi/linux/magic.h          |    2 +
>  23 files changed, 1973 insertions(+), 43 deletions(-)
>  create mode 100644 Documentation/filesystems/famfs.rst
>  create mode 100644 fs/fuse/famfs.c
>  create mode 100644 fs/fuse/famfs_kfmap.h
> 
> 
> base-commit: 38fec10eb60d687e30c8c6b5420d86e8149f7557
> -- 
> 2.49.0
> 
>
Re: [RFC PATCH 00/19] famfs: port into fuse
Posted by John Groves 7 months, 3 weeks ago
On 25/04/21 11:27AM, Darrick J. Wong wrote:
> On Sun, Apr 20, 2025 at 08:33:27PM -0500, John Groves wrote:
> > Subject: famfs: port into fuse
> > 
> > This is the initial RFC for the fabric-attached memory file system (famfs)
> > integration into fuse. In order to function, this requires a related patch
> > to libfuse [1] and the famfs user space [2]. 
> > 
> > This RFC is mainly intended to socialize the approach and get feedback from
> > the fuse developers and maintainers. There is some dax work that needs to
> > be done before this should be merged (see the "poisoned page|folio problem"
> > below).
> 
> Note that I'm only looking at the fuse and iomap aspects of this
> patchset.  I don't know the devdax code at all.
> 
> > This patch set fully works with Linux 6.14 -- passing all existing famfs
> > smoke and unit tests -- and I encourage existing famfs users to test it.
> > 
> > This is really two patch sets mashed up:
> > 
> > * The patches with the dev_dax_iomap: prefix fill in missing functionality for
> >   devdax to host an fs-dax file system.
> > * The famfs_fuse: patches add famfs into fs/fuse/. These are effectively
> >   unchanged since last year.
> > 
> > Because this is not ready to merge yet, I have felt free to leave some debug
> > prints in place because we still find them useful; those will be cleaned up
> > in a subsequent revision.
> > 
> > Famfs Overview
> > 
> > Famfs exposes shared memory as a file system. Famfs consumes shared memory
> > from dax devices, and provides memory-mappable files that map directly to
> > the memory - no page cache involvement. Famfs differs from conventional
> > file systems in fs-dax mode, in that it handles in-memory metadata in a
> > sharable way (which begins with never caching dirty shared metadata).
> > 
> > Famfs started as a standalone file system [3,4], but the consensus at LSFMM
> > 2024 [5] was that it should be ported into fuse - and this RFC is the first
> > public evidence that I've been working on that.
> 
> This is very timely, as I just started looking into how I might connect
> iomap to fuse so that most of the hot IO path continues to run in the
> kernel, and userspace block device filesystem drivers merely supply the
> file mappings to the kernel.  In other words, we kick the metadata
> parsing craziness out of the kernel.

Coool!

> 
> > The key performance requirement is that famfs must resolve mapping faults
> > without upcalls. This is achieved by fully caching the file-to-devdax
> > metadata for all active files. This is done via two fuse client/server
> > message/response pairs: GET_FMAP and GET_DAXDEV.
> 
> Heh, just last week I finally got around to laying out how I think I'd
> want to expose iomap through fuse to allow ->iomap_begin/->iomap_end
> upcalls to a fuse server.  Note that I've done zero prototyping but
> "upload all the mappings at open time" seems like a reasonable place for
> me to start looking, especially for a filesystem with static mappings.
> 
> I think what I want to try to build is an in-kernel mapping cache (sort
> of like the one you built), only with upcalls to the fuse server when
> there is no mapping information for a given IO.  I'd probably want to
> have a means for the fuse server to put new mappings into the cache, or
> invalidate existing mappings.
> 
> (famfs obviously is a simple corner-case of that grandiose vision, but I
> still have a long way to get to my larger vision so don't take my words
> as any kind of requirement.)
> 
> > Famfs remains the first fs-dax file system that is backed by devdax rather
> > than pmem in fs-dax mode (hence the need for the dev_dax_iomap fixups).
> > 
> > Notes
> > 
> > * Once the dev_dax_iomap patches land, I suspect it may make sense for
> >   virtiofs to update to use the improved interface.
> > 
> > * I'm currently maintaining compatibility between the famfs user space and
> >   both the standalone famfs kernel file system and this new fuse
> >   implementation. In the near future I'll be running performance comparisons
> >   and sharing them - but there is no reason to expect significant degradation
> >   with fuse, since famfs caches entire "fmaps" in the kernel to resolve
> 
> I'm curious to hear what you find, performance-wise. :)
> 
> >   faults with no upcalls. This patch has a bit too much debug turned on to
> >   to that testing quite yet. A branch 
> 
> A branch ... what?

I trail off sometimes... ;)

> 
> > * Two new fuse messages / responses are added: GET_FMAP and GET_DAXDEV.
> > 
> > * When a file is looked up in a famfs mount, the LOOKUP is followed by a
> >   GET_FMAP message and response. The "fmap" is the full file-to-dax mapping,
> >   allowing the fuse/famfs kernel code to handle read/write/fault without any
> >   upcalls.
> 
> Huh, I'd have thought you'd wait until FUSE_OPEN to start preloading
> mappings into the kernel.

That may be a better approach. Miklos and I discussed it during LPC last year, 
and thought both were options. Having implemented it at LOOKUP time, I think
moving it to open might avoid my READDIRPLUS problem (which is that RDP is a
mashup of READDIR and LOOKUP), therefore might need to add the GET_FMAP
payload. Moving GET_FMAP to open time, would break that connection in a good
way, I think.

> 
> > * After each GET_FMAP, the fmap is checked for extents that reference
> >   previously-unknown daxdevs. Each such occurence is handled with a
> >   GET_DAXDEV message and response.
> 
> I hadn't figured out how this part would work for my silly prototype.
> Just out of curiosity, does the famfs fuse server hold an open fd to the
> storage, in which case the fmap(ping) could just contain the open fd?
> 
> Where are the mappings that are sent from the fuse server?  Is that
> struct fuse_famfs_simple_ext?

See patch 17 or fs/fuse/famfs_kfmap.h for the fmap metadata explanation. 
Famfs currently supports either simple extents (daxdev, offset, length) or 
interleaved ones (which describe each "strip" as a simple extent). I think 
the explanation in famfs_kfmap.h is pretty clear.

A key question is whether any additional basic metadata abstractions would
be needed - because the kernel needs to understand the full scheme.

With disaggregated memory, the interleave approach is nice because it gets
aggregated performance and resolving a file offset to daxdev offset is order
1.

Oh, and there are two fmap formats (ok, more, but the others are legacy ;).
The fmaps-in-messages structs are currently in the famfs section of
include/uapi/linux/fuse.h. And the in-memory version is in 
fs/fuse/famfs_kfmap.h. The former will need to be a versioned interface.
(ugh...)

> 
> > * Daxdevs are stored in a table (which might become an xarray at some point).
> >   When entries are added to the table, we acquire exclusive access to the
> >   daxdev via the fs_dax_get() call (modeled after how fs-dax handles this
> >   with pmem devices). famfs provides holder_operations to devdax, providing
> >   a notification path in the event of memory errors.
> > 
> > * If devdax notifies famfs of memory errors on a dax device, famfs currently
> >   bocks all subsequent accesses to data on that device. The recovery is to
> >   re-initialize the memory and file system. Famfs is memory, not storage...
> 
> Ouch. :)

Cautious initial approach (i.e. I'm trying not to scare people too much ;) 

> 
> > * Because famfs uses backing (devdax) devices, only privileged mounts are
> >   supported.
> > 
> > * The famfs kernel code never accesses the memory directly - it only
> >   facilitates read, write and mmap on behalf of user processes. As such,
> >   the RAS of the shared memory affects applications, but not the kernel.
> > 
> > * Famfs has backing device(s), but they are devdax (char) rather than
> >   block. Right now there is no way to tell the vfs layer that famfs has a
> >   char backing device (unless we say it's block, but it's not). Currently
> >   we use the standard anonymous fuse fs_type - but I'm not sure that's
> >   ultimately optimal (thoughts?)
> 
> Does it work if the fusefs server adds "-o fsname=<devdax cdev>" to the
> fuse_args object?  fuse2fs does that, though I don't recall if that's a
> reasonable thing to do.

The kernel needs to "own" the dax devices. fs-dax on pmem/block calls
fs_dax_get_by_bdev() and passes in holder_operations - which are used for
error upcalls, but also effect exclusive ownership. 

I added fs_dax_get() since the bdev version wasn't really right or char
devdax. But same holder_operations.

I had originally intended to pass in "-o daxdev=<cdev>", but famfs needs to
span multiple daxdevs, in order to interleave for performance. The approach
of retrieving them with GET_DAXDEV handles the generalized case, so "-o"
just amounts to a second way to do the same thing.

"But wait"... I thought. Doesn't the "-o" approach get the primary daxdev
locked up sooner, which might be good? Well, no, because famfs creates a
couple of meta files during mount .meta/.superblock and .meta/.log - and 
those are guaranteed to reference the primary daxdev. So I concluded the -o
approach wasn't worth the trouble (though it's not *much* trouble).

> 
> > The "poisoned page|folio problem"
> > 
> > * Background: before doing a kernel mount, the famfs user space [2] validates
> >   the superblock and log. This is done via raw mmap of the primary devdax
> >   device. If valid, the file system is mounted, and the superblock and log
> >   get exposed through a pair of files (.meta/.superblock and .meta/.log) -
> >   because we can't be using raw device mmap when a file system is mounted
> >   on the device. But this exposes a devdax bug and warning...
> > 
> > * Pages that have been memory mapped via devdax are left in a permanently
> >   problematic state. Devdax sets page|folio->mapping when a page is accessed
> >   via raw devdax mmap (as famfs does before mount), but never cleans it up.
> >   When the pages of the famfs superblock and log are accessed via the "meta"
> >   files after mount, we see a WARN_ONCE() in dax_insert_entry(), which
> >   notices that page|folio->mapping is still set. I intend to address this
> >   prior to asking for the famfs patches to be merged.
> > 
> > * Alistair Popple's recent dax patch series [6], which has been merged
> >   for 6.15, addresses some dax issues, but sadly does not fix the poisoned
> >   page|folio problem - its enhanced refcount checking turns the warning into
> >   an error.
> > 
> > * This 6.14 patch set disables the warning; a proper fix will be required for
> >   famfs to work at all in 6.15. Dan W. and I are actively discussing how to do
> >   this properly...
> > 
> > * In terms of the correct functionality of famfs, the warning can be ignored.
> > 
> > References
> > 
> > [1] - https://github.com/libfuse/libfuse/pull/1200
> > [2] - https://github.com/cxl-micron-reskit/famfs
> 
> Thanks for posting links, I'll have a look there too.
> 
> --D
> 

I'm happy to talk if you wanna kick ideas around.

Cheers,
John
Re: [RFC PATCH 00/19] famfs: port into fuse
Posted by Darrick J. Wong 7 months, 3 weeks ago
On Mon, Apr 21, 2025 at 05:00:35PM -0500, John Groves wrote:
> On 25/04/21 11:27AM, Darrick J. Wong wrote:
> > On Sun, Apr 20, 2025 at 08:33:27PM -0500, John Groves wrote:
> > > Subject: famfs: port into fuse
> > > 
> > > This is the initial RFC for the fabric-attached memory file system (famfs)
> > > integration into fuse. In order to function, this requires a related patch
> > > to libfuse [1] and the famfs user space [2]. 
> > > 
> > > This RFC is mainly intended to socialize the approach and get feedback from
> > > the fuse developers and maintainers. There is some dax work that needs to
> > > be done before this should be merged (see the "poisoned page|folio problem"
> > > below).
> > 
> > Note that I'm only looking at the fuse and iomap aspects of this
> > patchset.  I don't know the devdax code at all.
> > 
> > > This patch set fully works with Linux 6.14 -- passing all existing famfs
> > > smoke and unit tests -- and I encourage existing famfs users to test it.
> > > 
> > > This is really two patch sets mashed up:
> > > 
> > > * The patches with the dev_dax_iomap: prefix fill in missing functionality for
> > >   devdax to host an fs-dax file system.
> > > * The famfs_fuse: patches add famfs into fs/fuse/. These are effectively
> > >   unchanged since last year.
> > > 
> > > Because this is not ready to merge yet, I have felt free to leave some debug
> > > prints in place because we still find them useful; those will be cleaned up
> > > in a subsequent revision.
> > > 
> > > Famfs Overview
> > > 
> > > Famfs exposes shared memory as a file system. Famfs consumes shared memory
> > > from dax devices, and provides memory-mappable files that map directly to
> > > the memory - no page cache involvement. Famfs differs from conventional
> > > file systems in fs-dax mode, in that it handles in-memory metadata in a
> > > sharable way (which begins with never caching dirty shared metadata).
> > > 
> > > Famfs started as a standalone file system [3,4], but the consensus at LSFMM
> > > 2024 [5] was that it should be ported into fuse - and this RFC is the first
> > > public evidence that I've been working on that.
> > 
> > This is very timely, as I just started looking into how I might connect
> > iomap to fuse so that most of the hot IO path continues to run in the
> > kernel, and userspace block device filesystem drivers merely supply the
> > file mappings to the kernel.  In other words, we kick the metadata
> > parsing craziness out of the kernel.
> 
> Coool!
> 
> > 
> > > The key performance requirement is that famfs must resolve mapping faults
> > > without upcalls. This is achieved by fully caching the file-to-devdax
> > > metadata for all active files. This is done via two fuse client/server
> > > message/response pairs: GET_FMAP and GET_DAXDEV.
> > 
> > Heh, just last week I finally got around to laying out how I think I'd
> > want to expose iomap through fuse to allow ->iomap_begin/->iomap_end
> > upcalls to a fuse server.  Note that I've done zero prototyping but
> > "upload all the mappings at open time" seems like a reasonable place for
> > me to start looking, especially for a filesystem with static mappings.
> > 
> > I think what I want to try to build is an in-kernel mapping cache (sort
> > of like the one you built), only with upcalls to the fuse server when
> > there is no mapping information for a given IO.  I'd probably want to
> > have a means for the fuse server to put new mappings into the cache, or
> > invalidate existing mappings.
> > 
> > (famfs obviously is a simple corner-case of that grandiose vision, but I
> > still have a long way to get to my larger vision so don't take my words
> > as any kind of requirement.)
> > 
> > > Famfs remains the first fs-dax file system that is backed by devdax rather
> > > than pmem in fs-dax mode (hence the need for the dev_dax_iomap fixups).
> > > 
> > > Notes
> > > 
> > > * Once the dev_dax_iomap patches land, I suspect it may make sense for
> > >   virtiofs to update to use the improved interface.
> > > 
> > > * I'm currently maintaining compatibility between the famfs user space and
> > >   both the standalone famfs kernel file system and this new fuse
> > >   implementation. In the near future I'll be running performance comparisons
> > >   and sharing them - but there is no reason to expect significant degradation
> > >   with fuse, since famfs caches entire "fmaps" in the kernel to resolve
> > 
> > I'm curious to hear what you find, performance-wise. :)
> > 
> > >   faults with no upcalls. This patch has a bit too much debug turned on to
> > >   to that testing quite yet. A branch 
> > 
> > A branch ... what?
> 
> I trail off sometimes... ;)
> 
> > 
> > > * Two new fuse messages / responses are added: GET_FMAP and GET_DAXDEV.
> > > 
> > > * When a file is looked up in a famfs mount, the LOOKUP is followed by a
> > >   GET_FMAP message and response. The "fmap" is the full file-to-dax mapping,
> > >   allowing the fuse/famfs kernel code to handle read/write/fault without any
> > >   upcalls.
> > 
> > Huh, I'd have thought you'd wait until FUSE_OPEN to start preloading
> > mappings into the kernel.
> 
> That may be a better approach. Miklos and I discussed it during LPC last year, 
> and thought both were options. Having implemented it at LOOKUP time, I think
> moving it to open might avoid my READDIRPLUS problem (which is that RDP is a
> mashup of READDIR and LOOKUP), therefore might need to add the GET_FMAP
> payload. Moving GET_FMAP to open time, would break that connection in a good
> way, I think.

I wonder if we could just add a couple new "notification" types so that
the fuse server can initiate uploads of mappings whenever it feels like
it.  For your usage model I don't think it'll make much difference since
they seem pretty static, but the ability to do that would open up some
flexibility for famfs.  The more general filesystems will need it
anyway, and someone's going to want to truncate a famfs file.  They
always do. ;)

> > 
> > > * After each GET_FMAP, the fmap is checked for extents that reference
> > >   previously-unknown daxdevs. Each such occurence is handled with a
> > >   GET_DAXDEV message and response.
> > 
> > I hadn't figured out how this part would work for my silly prototype.
> > Just out of curiosity, does the famfs fuse server hold an open fd to the
> > storage, in which case the fmap(ping) could just contain the open fd?
> > 
> > Where are the mappings that are sent from the fuse server?  Is that
> > struct fuse_famfs_simple_ext?
> 
> See patch 17 or fs/fuse/famfs_kfmap.h for the fmap metadata explanation. 
> Famfs currently supports either simple extents (daxdev, offset, length) or 
> interleaved ones (which describe each "strip" as a simple extent). I think 
> the explanation in famfs_kfmap.h is pretty clear.
> 
> A key question is whether any additional basic metadata abstractions would
> be needed - because the kernel needs to understand the full scheme.
> 
> With disaggregated memory, the interleave approach is nice because it gets
> aggregated performance and resolving a file offset to daxdev offset is order
> 1.
> 
> Oh, and there are two fmap formats (ok, more, but the others are legacy ;).
> The fmaps-in-messages structs are currently in the famfs section of
> include/uapi/linux/fuse.h. And the in-memory version is in 
> fs/fuse/famfs_kfmap.h. The former will need to be a versioned interface.
> (ugh...)

Ok, will take a look tomorrow morning.

> > 
> > > * Daxdevs are stored in a table (which might become an xarray at some point).
> > >   When entries are added to the table, we acquire exclusive access to the
> > >   daxdev via the fs_dax_get() call (modeled after how fs-dax handles this
> > >   with pmem devices). famfs provides holder_operations to devdax, providing
> > >   a notification path in the event of memory errors.
> > > 
> > > * If devdax notifies famfs of memory errors on a dax device, famfs currently
> > >   bocks all subsequent accesses to data on that device. The recovery is to
> > >   re-initialize the memory and file system. Famfs is memory, not storage...
> > 
> > Ouch. :)
> 
> Cautious initial approach (i.e. I'm trying not to scare people too much ;) 
> 
> > 
> > > * Because famfs uses backing (devdax) devices, only privileged mounts are
> > >   supported.
> > > 
> > > * The famfs kernel code never accesses the memory directly - it only
> > >   facilitates read, write and mmap on behalf of user processes. As such,
> > >   the RAS of the shared memory affects applications, but not the kernel.
> > > 
> > > * Famfs has backing device(s), but they are devdax (char) rather than
> > >   block. Right now there is no way to tell the vfs layer that famfs has a
> > >   char backing device (unless we say it's block, but it's not). Currently
> > >   we use the standard anonymous fuse fs_type - but I'm not sure that's
> > >   ultimately optimal (thoughts?)
> > 
> > Does it work if the fusefs server adds "-o fsname=<devdax cdev>" to the
> > fuse_args object?  fuse2fs does that, though I don't recall if that's a
> > reasonable thing to do.
> 
> The kernel needs to "own" the dax devices. fs-dax on pmem/block calls
> fs_dax_get_by_bdev() and passes in holder_operations - which are used for
> error upcalls, but also effect exclusive ownership. 
> 
> I added fs_dax_get() since the bdev version wasn't really right or char
> devdax. But same holder_operations.
> 
> I had originally intended to pass in "-o daxdev=<cdev>", but famfs needs to
> span multiple daxdevs, in order to interleave for performance. The approach
> of retrieving them with GET_DAXDEV handles the generalized case, so "-o"
> just amounts to a second way to do the same thing.

Oh, hah, it's a multi-device filesystem.  Hee hee hee...

> "But wait"... I thought. Doesn't the "-o" approach get the primary daxdev
> locked up sooner, which might be good? Well, no, because famfs creates a
> couple of meta files during mount .meta/.superblock and .meta/.log - and 
> those are guaranteed to reference the primary daxdev. So I concluded the -o
> approach wasn't worth the trouble (though it's not *much* trouble).

<nod> For block devices, someone needs to own the bdev O_EXCL, but it
doesn't have to be the kernel.  Though ... I wonder what *does* happen
when the something tries to invoke the bdev holder_ops?  Maybe it would
be nice to freeze the fs, but I don't know if fuse already does that.

> > 
> > > The "poisoned page|folio problem"
> > > 
> > > * Background: before doing a kernel mount, the famfs user space [2] validates
> > >   the superblock and log. This is done via raw mmap of the primary devdax
> > >   device. If valid, the file system is mounted, and the superblock and log
> > >   get exposed through a pair of files (.meta/.superblock and .meta/.log) -
> > >   because we can't be using raw device mmap when a file system is mounted
> > >   on the device. But this exposes a devdax bug and warning...
> > > 
> > > * Pages that have been memory mapped via devdax are left in a permanently
> > >   problematic state. Devdax sets page|folio->mapping when a page is accessed
> > >   via raw devdax mmap (as famfs does before mount), but never cleans it up.
> > >   When the pages of the famfs superblock and log are accessed via the "meta"
> > >   files after mount, we see a WARN_ONCE() in dax_insert_entry(), which
> > >   notices that page|folio->mapping is still set. I intend to address this
> > >   prior to asking for the famfs patches to be merged.
> > > 
> > > * Alistair Popple's recent dax patch series [6], which has been merged
> > >   for 6.15, addresses some dax issues, but sadly does not fix the poisoned
> > >   page|folio problem - its enhanced refcount checking turns the warning into
> > >   an error.
> > > 
> > > * This 6.14 patch set disables the warning; a proper fix will be required for
> > >   famfs to work at all in 6.15. Dan W. and I are actively discussing how to do
> > >   this properly...
> > > 
> > > * In terms of the correct functionality of famfs, the warning can be ignored.
> > > 
> > > References
> > > 
> > > [1] - https://github.com/libfuse/libfuse/pull/1200
> > > [2] - https://github.com/cxl-micron-reskit/famfs
> > 
> > Thanks for posting links, I'll have a look there too.
> > 
> > --D
> > 
> 
> I'm happy to talk if you wanna kick ideas around.

Heheh I will, but give me a day or two to wander through the rest of the
patches, or maybe just decide to pull the branch and look at one huge
diff.

--D

> Cheers,
> John
> 
>
Re: [RFC PATCH 00/19] famfs: port into fuse
Posted by John Groves 7 months, 3 weeks ago
On 25/04/21 06:25PM, Darrick J. Wong wrote:
> On Mon, Apr 21, 2025 at 05:00:35PM -0500, John Groves wrote:
> > On 25/04/21 11:27AM, Darrick J. Wong wrote:
> > > On Sun, Apr 20, 2025 at 08:33:27PM -0500, John Groves wrote:
> > > > Subject: famfs: port into fuse
> > > > 
> > > > This is the initial RFC for the fabric-attached memory file system (famfs)
> > > > integration into fuse. In order to function, this requires a related patch
> > > > to libfuse [1] and the famfs user space [2]. 
> > > > 
> > > > This RFC is mainly intended to socialize the approach and get feedback from
> > > > the fuse developers and maintainers. There is some dax work that needs to
> > > > be done before this should be merged (see the "poisoned page|folio problem"
> > > > below).
> > > 
> > > Note that I'm only looking at the fuse and iomap aspects of this
> > > patchset.  I don't know the devdax code at all.
> > > 
> > > > This patch set fully works with Linux 6.14 -- passing all existing famfs
> > > > smoke and unit tests -- and I encourage existing famfs users to test it.
> > > > 
> > > > This is really two patch sets mashed up:
> > > > 
> > > > * The patches with the dev_dax_iomap: prefix fill in missing functionality for
> > > >   devdax to host an fs-dax file system.
> > > > * The famfs_fuse: patches add famfs into fs/fuse/. These are effectively
> > > >   unchanged since last year.
> > > > 
> > > > Because this is not ready to merge yet, I have felt free to leave some debug
> > > > prints in place because we still find them useful; those will be cleaned up
> > > > in a subsequent revision.
> > > > 
> > > > Famfs Overview
> > > > 
> > > > Famfs exposes shared memory as a file system. Famfs consumes shared memory
> > > > from dax devices, and provides memory-mappable files that map directly to
> > > > the memory - no page cache involvement. Famfs differs from conventional
> > > > file systems in fs-dax mode, in that it handles in-memory metadata in a
> > > > sharable way (which begins with never caching dirty shared metadata).
> > > > 
> > > > Famfs started as a standalone file system [3,4], but the consensus at LSFMM
> > > > 2024 [5] was that it should be ported into fuse - and this RFC is the first
> > > > public evidence that I've been working on that.
> > > 
> > > This is very timely, as I just started looking into how I might connect
> > > iomap to fuse so that most of the hot IO path continues to run in the
> > > kernel, and userspace block device filesystem drivers merely supply the
> > > file mappings to the kernel.  In other words, we kick the metadata
> > > parsing craziness out of the kernel.
> > 
> > Coool!
> > 
> > > 
> > > > The key performance requirement is that famfs must resolve mapping faults
> > > > without upcalls. This is achieved by fully caching the file-to-devdax
> > > > metadata for all active files. This is done via two fuse client/server
> > > > message/response pairs: GET_FMAP and GET_DAXDEV.
> > > 
> > > Heh, just last week I finally got around to laying out how I think I'd
> > > want to expose iomap through fuse to allow ->iomap_begin/->iomap_end
> > > upcalls to a fuse server.  Note that I've done zero prototyping but
> > > "upload all the mappings at open time" seems like a reasonable place for
> > > me to start looking, especially for a filesystem with static mappings.
> > > 
> > > I think what I want to try to build is an in-kernel mapping cache (sort
> > > of like the one you built), only with upcalls to the fuse server when
> > > there is no mapping information for a given IO.  I'd probably want to
> > > have a means for the fuse server to put new mappings into the cache, or
> > > invalidate existing mappings.
> > > 
> > > (famfs obviously is a simple corner-case of that grandiose vision, but I
> > > still have a long way to get to my larger vision so don't take my words
> > > as any kind of requirement.)
> > > 
> > > > Famfs remains the first fs-dax file system that is backed by devdax rather
> > > > than pmem in fs-dax mode (hence the need for the dev_dax_iomap fixups).
> > > > 
> > > > Notes
> > > > 
> > > > * Once the dev_dax_iomap patches land, I suspect it may make sense for
> > > >   virtiofs to update to use the improved interface.
> > > > 
> > > > * I'm currently maintaining compatibility between the famfs user space and
> > > >   both the standalone famfs kernel file system and this new fuse
> > > >   implementation. In the near future I'll be running performance comparisons
> > > >   and sharing them - but there is no reason to expect significant degradation
> > > >   with fuse, since famfs caches entire "fmaps" in the kernel to resolve
> > > 
> > > I'm curious to hear what you find, performance-wise. :)
> > > 
> > > >   faults with no upcalls. This patch has a bit too much debug turned on to
> > > >   to that testing quite yet. A branch 
> > > 
> > > A branch ... what?
> > 
> > I trail off sometimes... ;)
> > 
> > > 
> > > > * Two new fuse messages / responses are added: GET_FMAP and GET_DAXDEV.
> > > > 
> > > > * When a file is looked up in a famfs mount, the LOOKUP is followed by a
> > > >   GET_FMAP message and response. The "fmap" is the full file-to-dax mapping,
> > > >   allowing the fuse/famfs kernel code to handle read/write/fault without any
> > > >   upcalls.
> > > 
> > > Huh, I'd have thought you'd wait until FUSE_OPEN to start preloading
> > > mappings into the kernel.
> > 
> > That may be a better approach. Miklos and I discussed it during LPC last year, 
> > and thought both were options. Having implemented it at LOOKUP time, I think
> > moving it to open might avoid my READDIRPLUS problem (which is that RDP is a
> > mashup of READDIR and LOOKUP), therefore might need to add the GET_FMAP
> > payload. Moving GET_FMAP to open time, would break that connection in a good
> > way, I think.
> 
> I wonder if we could just add a couple new "notification" types so that
> the fuse server can initiate uploads of mappings whenever it feels like
> it.  For your usage model I don't think it'll make much difference since
> they seem pretty static, but the ability to do that would open up some
> flexibility for famfs.  The more general filesystems will need it
> anyway, and someone's going to want to truncate a famfs file.  They
> always do. ;)
> 
> > > 
> > > > * After each GET_FMAP, the fmap is checked for extents that reference
> > > >   previously-unknown daxdevs. Each such occurence is handled with a
> > > >   GET_DAXDEV message and response.
> > > 
> > > I hadn't figured out how this part would work for my silly prototype.
> > > Just out of curiosity, does the famfs fuse server hold an open fd to the
> > > storage, in which case the fmap(ping) could just contain the open fd?
> > > 
> > > Where are the mappings that are sent from the fuse server?  Is that
> > > struct fuse_famfs_simple_ext?
> > 
> > See patch 17 or fs/fuse/famfs_kfmap.h for the fmap metadata explanation. 
> > Famfs currently supports either simple extents (daxdev, offset, length) or 
> > interleaved ones (which describe each "strip" as a simple extent). I think 
> > the explanation in famfs_kfmap.h is pretty clear.
> > 
> > A key question is whether any additional basic metadata abstractions would
> > be needed - because the kernel needs to understand the full scheme.
> > 
> > With disaggregated memory, the interleave approach is nice because it gets
> > aggregated performance and resolving a file offset to daxdev offset is order
> > 1.
> > 
> > Oh, and there are two fmap formats (ok, more, but the others are legacy ;).
> > The fmaps-in-messages structs are currently in the famfs section of
> > include/uapi/linux/fuse.h. And the in-memory version is in 
> > fs/fuse/famfs_kfmap.h. The former will need to be a versioned interface.
> > (ugh...)
> 
> Ok, will take a look tomorrow morning.
> 
> > > 
> > > > * Daxdevs are stored in a table (which might become an xarray at some point).
> > > >   When entries are added to the table, we acquire exclusive access to the
> > > >   daxdev via the fs_dax_get() call (modeled after how fs-dax handles this
> > > >   with pmem devices). famfs provides holder_operations to devdax, providing
> > > >   a notification path in the event of memory errors.
> > > > 
> > > > * If devdax notifies famfs of memory errors on a dax device, famfs currently
> > > >   bocks all subsequent accesses to data on that device. The recovery is to
> > > >   re-initialize the memory and file system. Famfs is memory, not storage...
> > > 
> > > Ouch. :)
> > 
> > Cautious initial approach (i.e. I'm trying not to scare people too much ;) 
> > 
> > > 
> > > > * Because famfs uses backing (devdax) devices, only privileged mounts are
> > > >   supported.
> > > > 
> > > > * The famfs kernel code never accesses the memory directly - it only
> > > >   facilitates read, write and mmap on behalf of user processes. As such,
> > > >   the RAS of the shared memory affects applications, but not the kernel.
> > > > 
> > > > * Famfs has backing device(s), but they are devdax (char) rather than
> > > >   block. Right now there is no way to tell the vfs layer that famfs has a
> > > >   char backing device (unless we say it's block, but it's not). Currently
> > > >   we use the standard anonymous fuse fs_type - but I'm not sure that's
> > > >   ultimately optimal (thoughts?)
> > > 
> > > Does it work if the fusefs server adds "-o fsname=<devdax cdev>" to the
> > > fuse_args object?  fuse2fs does that, though I don't recall if that's a
> > > reasonable thing to do.
> > 
> > The kernel needs to "own" the dax devices. fs-dax on pmem/block calls
> > fs_dax_get_by_bdev() and passes in holder_operations - which are used for
> > error upcalls, but also effect exclusive ownership. 
> > 
> > I added fs_dax_get() since the bdev version wasn't really right or char
> > devdax. But same holder_operations.
> > 
> > I had originally intended to pass in "-o daxdev=<cdev>", but famfs needs to
> > span multiple daxdevs, in order to interleave for performance. The approach
> > of retrieving them with GET_DAXDEV handles the generalized case, so "-o"
> > just amounts to a second way to do the same thing.
> 
> Oh, hah, it's a multi-device filesystem.  Hee hee hee...

Hee hee indeed. The thing about memory, and dax devices, is that there
isn't anything like device mapper that can make compound or interleaved
devices. There's not a "stop while dma happens" point for swizzling 
addresses. I'm down for a discussion about whether there is a viable way 
to have a mapper layer, but I also think constructing interleaved objects 
as files is quite good - and might be the best solution.

Interleaving is essential to memory performance in general. System-ram is
pretty much never not interleaved. And there are some reasons why programming
the hardware to do the interleaving is gonna be problem for non-static 
setups. I'll save going down that rathole for a different time...

John