From nobody Thu Dec 18 14:48:19 2025 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 4062222333A for ; Tue, 8 Apr 2025 15:10:09 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.129.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1744125013; cv=none; b=i7pgOvMir5E55+Ba3CWYPfZDURmdoY/HOTJRX0m8kuRU/p3MkeIapKIHqamnrqcwERdcFMgbGSv7TZhVFtiGEhmn90QsRAevFWAs+0fJpp3NdBih/MAj8/OIVgrqYHgRPNFCBswkEK1kRkCphK6sU911E303SBrEYT9FMvZo2NM= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1744125013; c=relaxed/simple; bh=Y8qKm9svxRBWFKwsb+bRFZT5RyIqhNjk7DeYyIiQ5gY=; h=From:To:cc:Subject:MIME-Version:Content-Type:Date:Message-ID; b=XPB/mJo739bjxXrQQ1rhp5V3N/TeqpyXxulbl1K+z5BVPj/Jz1SotQU/WCbOWa2bK0BhLafFPWLtoNwlII5h1nlaOUl5fmGxeZjie94N8UE9egy4g5hym3oE7nFwDDx4SQkMDaFGhQENCazcirqWdY6lXDn8D/oeAsuWNl+7Mtw= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=f1u00mCr; arc=none smtp.client-ip=170.10.129.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="f1u00mCr" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1744125008; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding; bh=G6kR6dQzrDDSDStLjcrnt0DilvBG9z+YTsxPWbSGIGk=; b=f1u00mCrScp8/0KvAhk8fdtNP47SKQcwLnkj/2eWSB2rVCsIkF2IPbPQFnkt9sIZkWW5MT AW0iHUaRRpNSMyLrYL9L9w9JHOL1VzXkP9lL25Iod3zqZfzLhx/X4cD6OwnOD1jGvjoYSh oe5gNPfa7gmLrdTPVR1DpG8aAosxL80= Received: from mx-prod-mc-04.mail-002.prod.us-west-2.aws.redhat.com (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-373-VhyCDbd9MLarFtsZPSNeZA-1; Tue, 08 Apr 2025 11:10:04 -0400 X-MC-Unique: VhyCDbd9MLarFtsZPSNeZA-1 X-Mimecast-MFC-AGG-ID: VhyCDbd9MLarFtsZPSNeZA_1744125002 Received: from mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.111]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-04.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 4C8EE19560AB; Tue, 8 Apr 2025 15:10:02 +0000 (UTC) Received: from warthog.procyon.org.uk (unknown [10.42.28.40]) by mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id 18209180174E; Tue, 8 Apr 2025 15:09:58 +0000 (UTC) Organization: Red Hat UK Ltd. Registered Address: Red Hat UK Ltd, Amberley Place, 107-111 Peascod Street, Windsor, Berkshire, SI4 1TE, United Kingdom. Registered in England and Wales under Company Registration No. 3798903 From: David Howells To: Christian Brauner cc: dhowells@redhat.com, Paulo Alcantara (Red Hat) , Jeff Layton , Viacheslav Dubeyko , Alex Markuze , Timothy Day , Jonathan Corbet , netfs@lists.linux.dev, linux-doc@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [PATCH] netfs: Update main API document Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-ID: <1565251.1744124997.1@warthog.procyon.org.uk> Content-Transfer-Encoding: quoted-printable Date: Tue, 08 Apr 2025 16:09:57 +0100 Message-ID: <1565252.1744124997@warthog.procyon.org.uk> X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.111 Content-Type: text/plain; charset="utf-8" Bring the netfs documentation up to date. Signed-off-by: David Howells Reviewed-by: Paulo Alcantara (Red Hat) cc: Jeff Layton cc: Viacheslav Dubeyko cc: Alex Markuze cc: Timothy Day cc: Jonathan Corbet cc: netfs@lists.linux.dev cc: linux-doc@vger.kernel.org cc: linux-fsdevel@vger.kernel.org --- Documentation/filesystems/netfs_library.rst | 995 ++++++++++++++++++++---= ----- 1 file changed, 718 insertions(+), 277 deletions(-) diff --git a/Documentation/filesystems/netfs_library.rst b/Documentation/fi= lesystems/netfs_library.rst index 3886c14f89f4..ce6a7109e941 100644 --- a/Documentation/filesystems/netfs_library.rst +++ b/Documentation/filesystems/netfs_library.rst @@ -1,33 +1,185 @@ .. SPDX-License-Identifier: GPL-2.0 =20 -=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D -Network Filesystem Helper Library -=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D +Network Filesystem Services Library +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D =20 .. Contents: =20 - Overview. + - Requests and streams. + - Subrequests. + - Result collection and retry. + - Local caching. + - Content encryption (fscrypt). - Per-inode context. - Inode context helper functions. - - Buffered read helpers. - - Read helper functions. - - Read helper structures. - - Read helper operations. - - Read helper procedure. - - Read helper cache API. + - Inode locking. + - Inode writeback. + - High-level VFS API. + - Unlocked read/write iter. + - Pre-locked read/write iter. + - Monolithic files API. + - Memory-mapped I/O API. + - High-level VM API. + - Deprecated PG_private2 API. + - I/O request API. + - Request structure. + - Stream structure. + - Subrequest structure. + - Filesystem methods. + - Terminating a subrequest. + - Local cache API. + - API function reference. =20 =20 Overview =3D=3D=3D=3D=3D=3D=3D=3D =20 -The network filesystem helper library is a set of functions designed to ai= d a -network filesystem in implementing VM/VFS operations. For the moment, that -just includes turning various VM buffered read operations into requests to= read -from the server. The helper library, however, can also interpose other -services, such as local caching or local data encryption. +The network filesystem services library, netfslib, is a set of functions +designed to aid a network filesystem in implementing VM/VFS API operations= . It +takes over the normal buffered read, readahead, write and writeback and al= so +handles unbuffered and direct I/O. =20 -Note that the library module doesn't link against local caching directly, = so -access must be provided by the netfs. +The library provides support for (re-)negotiation of I/O sizes and retrying +failed I/O as well as local caching and will, in the future, provide conte= nt +encryption. + +It insulates the filesystem from VM interface changes as much as possible = and +handles VM features such as large multipage folios. The filesystem basica= lly +just has to provide a way to perform read and write RPC calls. + +The way I/O is organised inside netfslib consists of a number of objects: + + * A *request*. A request is used to track the progress of the I/O overal= l and + to hold on to resources. The collection of results is done at the requ= est + level. The I/O within a request is divided into a number of parallel + streams of subrequests. + + * A *stream*. A non-overlapping series of subrequests. The subrequests + within a stream do not have to be contiguous. + + * A *subrequest*. This is the basic unit of I/O. It represents a single= RPC + call or a single cache I/O operation. The library passes these to the + filesystem and the cache to perform. + +Requests and Streams +-------------------- + +When actually performing I/O (as opposed to just copying into the pagecach= e), +netfslib will create one or more requests to track the progress of the I/O= and +to hold resources. + +A read operation will have a single stream and the subrequests within that +stream may be of mixed origins, for instance mixing RPC subrequests and ca= che +subrequests. + +On the other hand, a write operation may have multiple streams, where each +stream targets a different destination. For instance, there may be one st= ream +writing to the local cache and one to the server. Currently, only two str= eams +are allowed, but this could be increased if parallel writes to multiple se= rvers +is desired. + +The subrequests within a write stream do not need to match alignment or si= ze +with the subrequests in another write stream and netfslib performs the til= ing +of subrequests in each stream over the source buffer independently. Furth= er, +each stream may contain holes that don't correspond to holes in the other +stream. + +In addition, the subrequests do not need to correspond to the boundaries o= f the +folios or vectors in the source/destination buffer. The library handles t= he +collection of results and the wrangling of folio flags and references. + +Subrequests +----------- + +Subrequests are at the heart of the interaction between netfslib and the +filesystem using it. Each subrequest is expected to correspond to a single +read or write RPC or cache operation. The library will stitch together the +results from a set of subrequests to provide a higher level operation. + +Netfslib has two interactions with the filesystem or the cache when settin= g up +a subrequest. First, there's an optional preparatory step that allows the +filesystem to negotiate the limits on the subrequest, both in terms of max= imum +number of bytes and maximum number of vectors (e.g. for RDMA). This may +involve negotiating with the server (e.g. cifs needing to acquire credits). + +And, secondly, there's the issuing step in which the subrequest is handed = off +to the filesystem to perform. + +Note that these two steps are done slightly differently between read and w= rite: + + * For reads, the VM/VFS tells us how much is being requested up front, so= the + library can preset maximum values that the cache and then the filesyste= m can + then reduce. The cache also gets consulted first on whether it wants t= o do + a read before the filesystem is consulted. + + * For writeback, it is unknown how much there will be to write until the + pagecache is walked, so no limit is set by the library. + +Once a subrequest is completed, the filesystem or cache informs the librar= y of +the completion and then collection is invoked. Depending on whether the +request is synchronous or asynchronous, the collection of results will be = done +in either the application thread or in a work queue. + +Result Collection and Retry +--------------------------- + +As subrequests complete, the results are collected and collated by the lib= rary +and folio unlocking is performed progressively (if appropriate). Once the +request is complete, async completion will be invoked (again, if appropria= te). +It is possible for the filesystem to provide interim progress reports to t= he +library to cause folio unlocking to happen earlier if possible. + +If any subrequests fail, netfslib can retry them. It will wait until all +subrequests are completed, offer the filesystem the opportunity to fiddle = with +the resources/state held by the request and poke at the subrequests before +re-preparing and re-issuing the subrequests. + +This allows the tiling of contiguous sets of failed subrequest within a st= ream +to be changed, adding more subrequests or ditching excess as necessary (for +instance, if the network sizes change or the server decides it wants small= er +chunks). + +Further, if a read from the cache fails, the library will ask the filesyst= em to +do the read instead, renegotiating and retiling the subrequests as necessa= ry. + +Local Caching +------------- + +One of the services netfslib provides, via ``fscache``, is the option to c= ache +on local disk a copy of the data obtained from/written to a network filesy= stem. +The library will manage the storing, retrieval and some invalidation of da= ta +automatically on behalf of the filesystem if a cookie is attached to the +``netfs_inode``. + +Note that local caching used to use the PG_private_2 (aliased as PG_fscach= e) to +keep track of a page that was being written to the cache, but this is now +deprecated as PG_private_2 will be removed. + +Instead, folios that are read from the server for which there was no data = in +the cache will be marked as dirty and will have ``folio->private`` set to a +special value (``NETFS_FOLIO_COPY_TO_CACHE``) and left to writeback to wri= te. +If the folio is modified before that happened, the special value will be +cleared and the write will become normally dirty. + +When writeback occurs, folios that are so marked will only be written to t= he +cache and not to the server. Writeback handles mixed cache-only writes and +server-and-cache writes by using two streams, sending one to the cache and= one +to the server. The server stream will have gaps in it corresponding to th= ose +folios. + +Content Encryption (fscrypt) +---------------------------- + +Though it does not do so yet, at some point netfslib will acquire the abil= ity +to do client-side content encryption on behalf of the network filesystem (= Ceph, +for example). fscrypt can be used for this if appropriate (it may not be - +cifs, for example). + +The data will be stored encrypted in the local cache using the same manner= of +encryption as the data written to the server and the library will impose b= ounce +buffering and RMW cycles as necessary. =20 =20 Per-Inode Context @@ -40,10 +192,13 @@ structure is defined:: struct netfs_inode { struct inode inode; const struct netfs_request_ops *ops; - struct fscache_cookie *cache; + struct fscache_cookie * cache; + loff_t remote_i_size; + unsigned long flags; + ... }; =20 -A network filesystem that wants to use netfs lib must place one of these i= n its +A network filesystem that wants to use netfslib must place one of these in= its inode wrapper struct instead of the VFS ``struct inode``. This can be don= e in a way similar to the following:: =20 @@ -56,7 +211,8 @@ This allows netfslib to find its state by using ``contai= ner_of()`` from the inode pointer, thereby allowing the netfslib helper functions to be pointe= d to directly by the VFS/VM operation tables. =20 -The structure contains the following fields: +The structure contains the following fields that are of interest to the +filesystem: =20 * ``inode`` =20 @@ -71,6 +227,37 @@ The structure contains the following fields: Local caching cookie, or NULL if no caching is enabled. This field doe= s not exist if fscache is disabled. =20 + * ``remote_i_size`` + + The size of the file on the server. This differs from inode->i_size if + local modifications have been made but not yet written back. + + * ``flags`` + + A set of flags, some of which the filesystem might be interested in: + + * ``NETFS_ICTX_MODIFIED_ATTR`` + + Set if netfslib modifies mtime/ctime. The filesystem is free to igno= re + this or clear it. + + * ``NETFS_ICTX_UNBUFFERED`` + + Do unbuffered I/O upon the file. Like direct I/O but without the + alignment limitations. RMW will be performed if necessary. The page= cache + will not be used unless mmap() is also used. + + * ``NETFS_ICTX_WRITETHROUGH`` + + Do writethrough caching upon the file. I/O will be set up and dispat= ched + as buffered writes are made to the page cache. mmap() does the normal + writeback thing. + + * ``NETFS_ICTX_SINGLE_NO_UPLOAD`` + + Set if the file has a monolithic content that must be read entirely i= n a + single go and must not be written back to the server, though it can be + cached (e.g. AFS directories). =20 Inode Context Helper Functions ------------------------------ @@ -84,117 +271,234 @@ set the operations table pointer:: =20 then a function to cast from the VFS inode structure to the netfs context:: =20 - struct netfs_inode *netfs_node(struct inode *inode); + struct netfs_inode *netfs_inode(struct inode *inode); =20 and finally, a function to get the cache cookie pointer from the context attached to an inode (or NULL if fscache is disabled):: =20 struct fscache_cookie *netfs_i_cookie(struct netfs_inode *ctx); =20 +Inode Locking +------------- + +A number of functions are provided to manage the locking of i_rwsem for I/= O and +to effectively extend it to provide more separate classes of exclusion:: + + int netfs_start_io_read(struct inode *inode); + void netfs_end_io_read(struct inode *inode); + int netfs_start_io_write(struct inode *inode); + void netfs_end_io_write(struct inode *inode); + int netfs_start_io_direct(struct inode *inode); + void netfs_end_io_direct(struct inode *inode); + +The exclusion breaks down into four separate classes: + + 1) Buffered reads and writes. + + Buffered reads can run concurrently each other and with buffered write= s, + but buffered writes cannot run concurrently with each other. + + 2) Direct reads and writes. + + Direct (and unbuffered) reads and writes can run concurrently since th= ey do + not share local buffering (i.e. the pagecache) and, in a network + filesystem, are expected to have exclusion managed on the server (thou= gh + this may not be the case for, say, Ceph). + + 3) Other major inode modifying operations (e.g. truncate, fallocate). + + These should just access i_rwsem directly. =20 -Buffered Read Helpers -=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D + 4) mmap(). =20 -The library provides a set of read helpers that handle the ->read_folio(), -->readahead() and much of the ->write_begin() VM operations and translate = them -into a common call framework. + mmap'd accesses might operate concurrently with any of the other class= es. + They might form the buffer for an intra-file loopback DIO read/write. = They + might be permitted on unbuffered files. =20 -The following services are provided: +Inode Writeback +--------------- =20 - * Handle folios that span multiple pages. +Netfslib will pin resources on an inode for future writeback (such as pinn= ing +use of an fscache cookie) when an inode is dirtied. However, this needs +managing. Firstly, a function is provided to unpin the writeback in +``->write_inode()``:: =20 - * Insulate the netfs from VM interface changes. + int netfs_unpin_writeback(struct inode *inode, struct writeback_control *= wbc); =20 - * Allow the netfs to arbitrarily split reads up into pieces, even ones th= at - don't match folio sizes or folio alignments and that may cross folios. +and, indeed, this may be set as a filesystem's ``.write_inode`` method. =20 - * Allow the netfs to expand a readahead request in both directions to mee= t its - needs. +Further, if an inode is deleted, the filesystem's write_inode method may n= ot +get called, so:: =20 - * Allow the netfs to partially fulfil a read, which will then be resubmit= ted. + void netfs_clear_inode_writeback(struct inode *inode, const void *aux); =20 - * Handle local caching, allowing cached data and server-read data to be - interleaved for a single request. +must be called from ``->evict_inode()`` *before* ``clear_inode()`` is call= ed. =20 - * Handle clearing of bufferage that isn't on the server. =20 - * Handle retrying of reads that failed, switching reads from the cache to= the - server as necessary. +High-Level VFS API +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D =20 - * In the future, this is a place that other services can be performed, su= ch as - local encryption of data to be stored remotely or in the cache. +Netfslib provides a number of sets of API calls for the filesystem to dele= gate +VFS operations to. Netfslib, in turn, will call out to the filesystem and= the +cache to negotiate I/O sizes, issue RPCs and provide places for it to inte= rvene +at various times. =20 -From the network filesystem, the helpers require a table of operations. T= his -includes a mandatory method to issue a read operation along with a number = of -optional methods. +Unlocked Read/Write Iter +------------------------ =20 +The first API set is for the delegation of operations to netfslib when the +filesystem is called through the standard VFS read/write_iter methods:: =20 -Read Helper Functions + ssize_t netfs_file_read_iter(struct kiocb *iocb, struct iov_iter *iter); + ssize_t netfs_file_write_iter(struct kiocb *iocb, struct iov_iter *from); + ssize_t netfs_buffered_read_iter(struct kiocb *iocb, struct iov_iter *ite= r); + ssize_t netfs_unbuffered_read_iter(struct kiocb *iocb, struct iov_iter *i= ter); + ssize_t netfs_unbuffered_write_iter(struct kiocb *iocb, struct iov_iter *= from); + +They can be assigned directly to ``.read_iter`` and ``.write_iter``. They +perform the inode locking themselves and the first two will switch between +buffered I/O and DIO as appropriate. + +Pre-Locked Read/Write Iter +-------------------------- + +The second API set is for the delegation of operations to netfslib when the +filesystem is called through the standard VFS methods, but needs to do some +other stuff before or after calling netfslib whilst still inside locked se= ction +(e.g. Ceph negotiating caps). The unbuffered read function is:: + + ssize_t netfs_unbuffered_read_iter_locked(struct kiocb *iocb, struct iov_= iter *iter); + +This must not be assigned directly to ``.read_iter`` and the filesystem is +responsible for performing the inode locking before calling it. In the ca= se of +buffered read, the filesystem should use ``filemap_read()``. + +There are three functions for writes:: + + ssize_t netfs_buffered_write_iter_locked(struct kiocb *iocb, struct iov_i= ter *from, + struct netfs_group *netfs_group); + ssize_t netfs_perform_write(struct kiocb *iocb, struct iov_iter *iter, + struct netfs_group *netfs_group); + ssize_t netfs_unbuffered_write_iter_locked(struct kiocb *iocb, struct iov= _iter *iter, + struct netfs_group *netfs_group); + +These must not be assigned directly to ``.write_iter`` and the filesystem = is +responsible for performing the inode locking before calling them. + +The first two functions are for buffered writes; the first just adds some +standard write checks and jumps to the second, but if the filesystem wants= to +do the checks itself, it can use the second directly. The third function = is +for unbuffered or DIO writes. + +On all three write functions, there is a writeback group pointer (which sh= ould +be NULL if the filesystem doesn't use this). Writeback groups are set on +folios when they're modified. If a folio to-be-modified is already marked= with +a different group, it is flushed first. The writeback API allows writing = back +of a specific group. + +Memory-Mapped I/O API --------------------- =20 -Three read helpers are provided:: +An API for support of mmap()'d I/O is provided:: + + vm_fault_t netfs_page_mkwrite(struct vm_fault *vmf, struct netfs_group *n= etfs_group); + +This allows the filesystem to delegate ``.page_mkwrite`` to netfslib. The +filesystem should not take the inode lock before calling it, but, as with = the +locked write functions above, this does take a writeback group pointer. I= f the +page to be made writable is in a different group, it will be flushed first. + +Monolithic Files API +-------------------- + +There is also a special API set for files for which the content must be re= ad in +a single RPC (and not written back) and is maintained as a monolithic blob +(e.g. an AFS directory), though it can be stored and updated in the local = cache:: + + ssize_t netfs_read_single(struct inode *inode, struct file *file, struct = iov_iter *iter); + void netfs_single_mark_inode_dirty(struct inode *inode); + int netfs_writeback_single(struct address_space *mapping, + struct writeback_control *wbc, + struct iov_iter *iter); + +The first function reads from a file into the given buffer, reading from t= he +cache in preference if the data is cached there; the second function allow= s the +inode to be marked dirty, causing a later writeback; and the third functio= n can +be called from the writeback code to write the data to the cache, if there= is +one. + +The inode should be marked ``NETFS_ICTX_SINGLE_NO_UPLOAD`` if this API is = to be +used. The writeback function requires the buffer to be of ITER_FOLIOQ typ= e. + +High-Level VM API +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D + +Netfslib also provides a number of sets of API calls for the filesystem to +delegate VM operations to. Again, netfslib, in turn, will call out to the +filesystem and the cache to negotiate I/O sizes, issue RPCs and provide pl= aces +for it to intervene at various times:: =20 - void netfs_readahead(struct readahead_control *ractl); - int netfs_read_folio(struct file *file, - struct folio *folio); - int netfs_write_begin(struct netfs_inode *ctx, - struct file *file, - struct address_space *mapping, - loff_t pos, - unsigned int len, - struct folio **_folio, - void **_fsdata); + void netfs_readahead(struct readahead_control *); + int netfs_read_folio(struct file *, struct folio *); + int netfs_writepages(struct address_space *mapping, + struct writeback_control *wbc); + bool netfs_dirty_folio(struct address_space *mapping, struct folio *folio= ); + void netfs_invalidate_folio(struct folio *folio, size_t offset, size_t le= ngth); + bool netfs_release_folio(struct folio *folio, gfp_t gfp); =20 -Each corresponds to a VM address space operation. These operations use the -state in the per-inode context. +These are ``address_space_operations`` methods and can be set directly in = the +operations table. =20 -For ->readahead() and ->read_folio(), the network filesystem just point di= rectly -at the corresponding read helper; whereas for ->write_begin(), it may be a -little more complicated as the network filesystem might want to flush -conflicting writes or track dirty data and needs to put the acquired folio= if -an error occurs after calling the helper. +Deprecated PG_private_2 API +--------------------------- =20 -The helpers manage the read request, calling back into the network filesys= tem -through the supplied table of operations. Waits will be performed as -necessary before returning for helpers that are meant to be synchronous. +There is also a deprecated function for filesystems that still use the +``->write_begin`` method:: =20 -If an error occurs, the ->free_request() will be called to clean up the -netfs_io_request struct allocated. If some parts of the request are in -progress when an error occurs, the request will get partially completed if -sufficient data is read. + int netfs_write_begin(struct netfs_inode *inode, struct file *file, + struct address_space *mapping, loff_t pos, unsigned int len, + struct folio **_folio, void **_fsdata); =20 -Additionally, there is:: +It uses the deprecated PG_private_2 flag and so should not be used. =20 - * void netfs_subreq_terminated(struct netfs_io_subrequest *subreq, - ssize_t transferred_or_error, - bool was_async); =20 -which should be called to complete a read subrequest. This is given the n= umber -of bytes transferred or a negative error code, plus a flag indicating whet= her -the operation was asynchronous (ie. whether the follow-on processing can be -done in the current context, given this may involve sleeping). +I/O Request API +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D =20 +The I/O request API comprises a number of structures and a number of funct= ions +that the filesystem may need to use. =20 -Read Helper Structures ----------------------- +Request Structure +----------------- =20 -The read helpers make use of a couple of structures to maintain the state = of -the read. The first is a structure that manages a read request as a whole= :: +The request structure manages the request as a whole, holding some resourc= es +and state on behalf of the filesystem and tracking the collection of resul= ts. +If the filesystem wants more private data than is afforded by this structu= re, +then it should wrap it and provide its own allocator. + +The fields generally of interest to a filesystem are:: =20 struct netfs_io_request { + enum netfs_io_origin origin; struct inode *inode; struct address_space *mapping; - struct netfs_cache_resources cache_resources; + struct netfs_group *group; + struct netfs_io_stream io_streams[]; void *netfs_priv; - loff_t start; - size_t len; - loff_t i_size; - const struct netfs_request_ops *netfs_ops; + void *netfs_priv2; + unsigned long long start; + unsigned long long len; + unsigned long long i_size; unsigned int debug_id; + unsigned long flags; ... }; =20 -The above fields are the ones the netfs can use. They are: +They are: + + * ``origin`` + + The origin of the request (readahead, read_folio, DIO read, writeback, = ...). =20 * ``inode`` * ``mapping`` @@ -202,11 +506,19 @@ The above fields are the ones the netfs can use. The= y are: The inode and the address space of the file being read from. The mappi= ng may or may not point to inode->i_data. =20 - * ``cache_resources`` + * ``group`` =20 - Resources for the local cache to use, if present. + The writeback group this request is dealing with or NULL. This holds a= ref + on the group. + + * ``io_streams`` + + The parallel streams of subrequests available to the request. Currentl= y two + are available, but this may be made extensible in future. ``NR_IO_STRE= AMS`` + indicates the size of the array. =20 * ``netfs_priv`` + * ``netfs_priv2`` =20 The network filesystem's private data. The value for this can be passe= d in to the helper functions or set during the request. @@ -221,37 +533,118 @@ The above fields are the ones the netfs can use. Th= ey are: =20 The size of the file at the start of the request. =20 - * ``netfs_ops`` - - A pointer to the operation table. The value for this is passed into the - helper functions. - * ``debug_id`` =20 A number allocated to this operation that can be displayed in trace lin= es for reference. =20 + * ``flags`` + + Flags for managing and controlling the operation of the request. Some = of + these may be of interest to the filesystem: + + * ``NETFS_RREQ_RETRYING`` + + Netfslib sets this when generating retries. + + * ``NETFS_RREQ_PAUSE`` + + The filesystem can set this to request to pause the library's subrequ= est + issuing loop - but care needs to be taken as netfslib may also set it. + + * ``NETFS_RREQ_NONBLOCK`` + * ``NETFS_RREQ_BLOCKED`` + + Netfslib sets the first to indicate that non-blocking mode was set by= the + caller and the filesystem can set the second to indicate that it would + have had to block. + + * ``NETFS_RREQ_USE_PGPRIV2`` + + The filesystem can set this if it wants to use PG_private_2 to track + whether a folio is being written to the cache. This is deprecated as + PG_private_2 is going to go away. =20 -The second structure is used to manage individual slices of the overall re= ad -request:: +Stream Structure +---------------- + +A request is comprised of one or more parallel streams and each stream may= be +aimed at a different target. + +For read requests, only stream 0 is used. This can contain a mixture of +subrequests aimed at different sources. For write requests, stream 0 is u= sed +for the server and stream 1 is used for the cache. For buffered writeback, +stream 0 is not enabled unless a normal dirty folio is encountered, at whi= ch +point ->begin_writeback() will be invoked and the filesystem can mark the +stream available. + +The stream struct looks like:: + + struct netfs_io_stream { + unsigned char stream_nr; + bool avail; + size_t sreq_max_len; + unsigned int sreq_max_segs; + unsigned int submit_extendable_to; + ... + }; + +A number of members are available for access/use by the filesystem: + + * ``stream_nr`` + + The number of the stream within the request. + + * ``avail`` + + True if the stream is available for use. The filesystem should set thi= s on + stream zero if in ->begin_writeback(). + + * ``sreq_max_len`` + * ``sreq_max_segs`` + + These are set by the filesystem or the cache in ->prepare_read() or + ->prepare_write() for each subrequest to indicate the maximum number of + bytes and, optionally, the maximum number of segments (if not 0) that t= hat + subrequest can support. + + * ``submit_extendable_to`` + + The size that a subrequest can be rounded up to beyond the EOF, given t= he + available buffer. This allows the cache to work out if it can do a DIO= read + or write that straddles the EOF marker. + +Subrequest Structure +-------------------- + +Individual units of I/O are managed by the subrequest structure. These +represent slices of the overall request and run independently:: =20 struct netfs_io_subrequest { struct netfs_io_request *rreq; - loff_t start; + struct iov_iter io_iter; + unsigned long long start; size_t len; size_t transferred; unsigned long flags; + short error; unsigned short debug_index; + unsigned char stream_nr; ... }; =20 -Each subrequest is expected to access a single source, though the helpers = will +Each subrequest is expected to access a single source, though the library = will handle falling back from one source type to another. The members are: =20 * ``rreq`` =20 A pointer to the read request. =20 + * ``io_iter`` + + An I/O iterator representing a slice of the buffer to be read into or + written from. + * ``start`` * ``len`` =20 @@ -260,241 +653,300 @@ handle falling back from one source type to another= . The members are: =20 * ``transferred`` =20 - The amount of data transferred so far of the length of this slice. The - network filesystem or cache should start the operation this far into the - slice. If a short read occurs, the helpers will call again, having upd= ated - this to reflect the amount read so far. + The amount of data transferred so far for this subrequest. This should= be + added to with the length of the transfer made by this issuance of the + subrequest. If this is less than ``len`` then the subrequest may be + reissued to continue. =20 * ``flags`` =20 - Flags pertaining to the read. There are two of interest to the filesys= tem - or cache: + Flags for managing the subrequest. There are a number of interest to t= he + filesystem or cache: + + * ``NETFS_SREQ_MADE_PROGRESS`` + + Set by the filesystem to indicates that at least one byte of data was= read + or written. + + * ``NETFS_SREQ_HIT_EOF`` + + The filesystem should set this if a read hit the EOF on the file (in = which + case ``transferred`` should stop at the EOF). Netfslib may expand the + subrequest out to the size of the folio containing the EOF on the off + chance that a third party change happened or a DIO read may have aske= d for + more than is available. The library will clear any excess pagecache. =20 * ``NETFS_SREQ_CLEAR_TAIL`` =20 - This can be set to indicate that the remainder of the slice, from - transferred to len, should be cleared. + The filesystem can set this to indicate that the remainder of the sli= ce, + from transferred to len, should be cleared. Do not set if HIT_EOF is= set. + + * ``NETFS_SREQ_NEED_RETRY`` + + The filesystem can set this to tell netfslib to retry the subrequest. + + * ``NETFS_SREQ_BOUNDARY`` + + This can be set by the filesystem on a subrequest to indicate that it= ends + at a boundary with the filesystem structure (e.g. at the end of a Ceph + object). It tells netfslib not to retile subrequests across it. =20 * ``NETFS_SREQ_SEEK_DATA_READ`` =20 - This is a hint to the cache that it might want to try skipping ahead = to - the next data (ie. using SEEK_DATA). + This is a hint from netfslib to the cache that it might want to try + skipping ahead to the next data (ie. using SEEK_DATA). + + * ``error`` + + This is for the filesystem to store result of the subrequest. It shoul= d be + set to 0 if successful and a negative error code otherwise. =20 * ``debug_index`` + * ``stream_nr`` =20 A number allocated to this slice that can be displayed in trace lines f= or - reference. + reference and the number of the request stream that it belongs to. + +If necessary, the filesystem can get and put extra refs on the subrequest = it is +given:: =20 + void netfs_get_subrequest(struct netfs_io_subrequest *subreq, + enum netfs_sreq_ref_trace what); + void netfs_put_subrequest(struct netfs_io_subrequest *subreq, + enum netfs_sreq_ref_trace what); =20 -Read Helper Operations ----------------------- +using netfs trace codes to indicate the reason. Care must be taken, howev= er, +as once control of the subrequest is returned to netfslib, the same subreq= uest +can be reissued/retried. =20 -The network filesystem must provide the read helpers with a table of opera= tions -through which it can issue requests and negotiate:: +Filesystem Methods +------------------ + +The filesystem sets a table of operations in ``netfs_inode`` for netfslib = to +use:: =20 struct netfs_request_ops { - void (*init_request)(struct netfs_io_request *rreq, struct file *file); + mempool_t *request_pool; + mempool_t *subrequest_pool; + int (*init_request)(struct netfs_io_request *rreq, struct file *file); void (*free_request)(struct netfs_io_request *rreq); + void (*free_subrequest)(struct netfs_io_subrequest *rreq); void (*expand_readahead)(struct netfs_io_request *rreq); - bool (*clamp_length)(struct netfs_io_subrequest *subreq); + int (*prepare_read)(struct netfs_io_subrequest *subreq); void (*issue_read)(struct netfs_io_subrequest *subreq); - bool (*is_still_valid)(struct netfs_io_request *rreq); - int (*check_write_begin)(struct file *file, loff_t pos, unsigned len, - struct folio **foliop, void **_fsdata); void (*done)(struct netfs_io_request *rreq); + void (*update_i_size)(struct inode *inode, loff_t i_size); + void (*post_modify)(struct inode *inode); + void (*begin_writeback)(struct netfs_io_request *wreq); + void (*prepare_write)(struct netfs_io_subrequest *subreq); + void (*issue_write)(struct netfs_io_subrequest *subreq); + void (*retry_request)(struct netfs_io_request *wreq, + struct netfs_io_stream *stream); + void (*invalidate_cache)(struct netfs_io_request *wreq); }; =20 -The operations are as follows: - - * ``init_request()`` +The table starts with a pair of optional pointers to memory pools from whi= ch +requests and subrequests can be allocated. If these are not given, netfsl= ib +has default pools that it will use. If the filesystem wraps the netfs str= ucts +in its own larger structs, then it will need to use its own pools. Netfsl= ib +will allocate directly from the pools. =20 - [Optional] This is called to initialise the request structure. It is g= iven - the file for reference. +The methods defined in the table are: =20 + * ``init_request()`` * ``free_request()`` + * ``free_subrequest()`` =20 - [Optional] This is called as the request is being deallocated so that t= he - filesystem can clean up any state it has attached there. + [Optional] A filesystem may implement these to initialise or clean up a= ny + resources that it attaches to the request or subrequest. =20 * ``expand_readahead()`` =20 [Optional] This is called to allow the filesystem to expand the size of= a - readahead read request. The filesystem gets to expand the request in b= oth - directions, though it's not permitted to reduce it as the numbers may - represent an allocation already made. If local caching is enabled, it = gets - to expand the request first. + readahead request. The filesystem gets to expand the request in both + directions, though it must retain the initial region as that may repres= ent + an allocation already made. If local caching is enabled, it gets to ex= pand + the request first. =20 Expansion is communicated by changing ->start and ->len in the request structure. Note that if any change is made, ->len must be increased by= at least as much as ->start is reduced. =20 - * ``clamp_length()`` - - [Optional] This is called to allow the filesystem to reduce the size of= a - subrequest. The filesystem can use this, for example, to chop up a req= uest - that has to be split across multiple servers or to put multiple reads in - flight. - - This should return 0 on success and an error code on error. - - * ``issue_read()`` + * ``prepare_read()`` =20 - [Required] The helpers use this to dispatch a subrequest to the server = for - reading. In the subrequest, ->start, ->len and ->transferred indicate = what - data should be read from the server. + [Optional] This is called to allow the filesystem to limit the size of a + subrequest. It may also limit the number of individual regions in iter= ator, + such as required by RDMA. This information should be set on stream zer= o in:: =20 - There is no return value; the netfs_subreq_terminated() function should= be - called to indicate whether or not the operation succeeded and how much = data - it transferred. The filesystem also should not deal with setting folios - uptodate, unlocking them or dropping their refs - the helpers need to d= eal - with this as they have to coordinate with copying to the local cache. + rreq->io_streams[0].sreq_max_len + rreq->io_streams[0].sreq_max_segs =20 - Note that the helpers have the folios locked, but not pinned. It is - possible to use the ITER_XARRAY iov iterator to refer to the range of t= he - inode that is being operated upon without the need to allocate large bv= ec - tables. + The filesystem can use this, for example, to chop up a request that has= to + be split across multiple servers or to put multiple reads in flight. =20 - * ``is_still_valid()`` + Zero should be returned on success and an error code otherwise. =20 - [Optional] This is called to find out if the data just read from the lo= cal - cache is still valid. It should return true if it is still valid and f= alse - if not. If it's not still valid, it will be reread from the server. + * ``issue_read()`` =20 - * ``check_write_begin()`` + [Required] Netfslib calls this to dispatch a subrequest to the server f= or + reading. In the subrequest, ->start, ->len and ->transferred indicate = what + data should be read from the server and ->io_iter indicates the buffer = to be + used. =20 - [Optional] This is called from the netfs_write_begin() helper once it h= as - allocated/grabbed the folio to be modified to allow the filesystem to f= lush - conflicting state before allowing it to be modified. + There is no return value; the ``netfs_read_subreq_terminated()`` functi= on + should be called to indicate that the subrequest completed either way. + ->error, ->transferred and ->flags should be updated before completing.= The + termination can be done asynchronously. =20 - It may unlock and discard the folio it was given and set the caller's f= olio - pointer to NULL. It should return 0 if everything is now fine (``*foli= op`` - left set) or the op should be retried (``*foliop`` cleared) and any oth= er - error code to abort the operation. + Note: the filesystem must not deal with setting folios uptodate, unlock= ing + them or dropping their refs - the library deals with this as it may hav= e to + stitch together the results of multiple subrequests that variously over= lap + the set of folios. =20 - * ``done`` + * ``done()`` =20 - [Optional] This is called after the folios in the request have all been + [Optional] This is called after the folios in a read request have all b= een unlocked (and marked uptodate if applicable). =20 + * ``update_i_size()`` + + [Optional] This is invoked by netfslib at various points during the wri= te + paths to ask the filesystem to update its idea of the file size. If not + given, netfslib will set i_size and i_blocks and update the local cache + cookie. + =20 + * ``post_modify()`` + + [Optional] This is called after netfslib writes to the pagecache or whe= n it + allows an mmap'd page to be marked as writable. + =20 + * ``begin_writeback()`` + + [Optional] Netfslib calls this when processing a writeback request if it + finds a dirty page that isn't simply marked NETFS_FOLIO_COPY_TO_CACHE, + indicating it must be written to the server. This allows the filesyste= m to + only set up writeback resources when it knows it's going to have to per= form + a write. + =20 + * ``prepare_write()`` =20 + [Optional] This is called to allow the filesystem to limit the size of a + subrequest. It may also limit the number of individual regions in iter= ator, + such as required by RDMA. This information should be set on stream to = which + the subrequest belongs:: =20 -Read Helper Procedure ---------------------- - -The read helpers work by the following general procedure: - - * Set up the request. - - * For readahead, allow the local cache and then the network filesystem to - propose expansions to the read request. This is then proposed to the V= M. - If the VM cannot fully perform the expansion, a partially expanded read= will - be performed, though this may not get written to the cache in its entir= ety. - - * Loop around slicing chunks off of the request to form subrequests: - - * If a local cache is present, it gets to do the slicing, otherwise the - helpers just try to generate maximal slices. - - * The network filesystem gets to clamp the size of each slice if it is = to be - the source. This allows rsize and chunking to be implemented. + rreq->io_streams[subreq->stream_nr].sreq_max_len + rreq->io_streams[subreq->stream_nr].sreq_max_segs =20 - * The helpers issue a read from the cache or a read from the server or = just - clears the slice as appropriate. + The filesystem can use this, for example, to chop up a request that has= to + be split across multiple servers or to put multiple writes in flight. =20 - * The next slice begins at the end of the last one. + This is not permitted to return an error. In the event of failure, + ``netfs_prepare_write_failed()`` must be called. =20 - * As slices finish being read, they terminate. + * ``issue_write()`` =20 - * When all the subrequests have terminated, the subrequests are assessed = and - any that are short or have failed are reissued: + [Required] This is used to dispatch a subrequest to the server for writ= ing. + In the subrequest, ->start, ->len and ->transferred indicate what data + should be written to the server and ->io_iter indicates the buffer to be + used. =20 - * Failed cache requests are issued against the server instead. + There is no return value; the ``netfs_write_subreq_terminated()`` funct= ion + should be called to indicate that the subrequest completed either way. + ->error, ->transferred and ->flags should be updated before completing.= The + termination can be done asynchronously. =20 - * Failed server requests just fail. + Note: the filesystem must not deal with removing the dirty or writeback + marks on folios involved in the operation and should not take refs or p= ins + on them, but should leave retention to netfslib. =20 - * Short reads against either source will be reissued against that source - provided they have transferred some more data: + * ``retry_request()`` =20 - * The cache may need to skip holes that it can't do DIO from. + [Optional] Netfslib calls this at the beginning of a retry cycle. This + allows the filesystem to examine the state of the request, the subreque= sts + in the indicated stream and of its own data and make adjustments or + renegotiate resources. + =20 + * ``invalidate_cache()`` =20 - * If NETFS_SREQ_CLEAR_TAIL was set, a short read will be cleared to t= he - end of the slice instead of reissuing. + [Optional] This is called by netfslib to invalidate data stored in the = local + cache in the event that writing to the local cache fails, providing upd= ated + coherency data that netfs can't provide. =20 - * Once the data is read, the folios that have been fully read/cleared: +Terminating a subrequest +------------------------ =20 - * Will be marked uptodate. +When a subrequest completes, there are a number of functions that the cach= e or +subrequest can call to inform netfslib of the status change. One function= is +provided to terminate a write subrequest at the preparation stage and acts +synchronously: =20 - * If a cache is present, will be marked with PG_fscache. + * ``void netfs_prepare_write_failed(struct netfs_io_subrequest *subreq);`` =20 - * Unlocked + Indicate that the ->prepare_write() call failed. The ``error`` field s= hould + have been updated. =20 - * Any folios that need writing to the cache will then have DIO writes iss= ued. +Note that ->prepare_read() can return an error as a read can simply be abo= rted. +Dealing with writeback failure is trickier. =20 - * Synchronous operations will wait for reading to be complete. +The other functions are used for subrequests that got as far as being issu= ed: =20 - * Writes to the cache will proceed asynchronously and the folios will hav= e the - PG_fscache mark removed when that completes. + * ``void netfs_read_subreq_terminated(struct netfs_io_subrequest *subreq)= ;`` =20 - * The request structures will be cleaned up when everything has completed. + Tell netfslib that a read subrequest has terminated. The ``error``, + ``flags`` and ``transferred`` fields should have been updated. =20 + * ``void netfs_write_subrequest_terminated(void *_op, ssize_t transferred= _or_error);`` =20 -Read Helper Cache API ---------------------- + Tell netfslib that a write subrequest has terminated. Either the amoun= t of + data processed or the negative error code can be passed in. This is + can be used as a kiocb completion function. =20 -When implementing a local cache to be used by the read helpers, two things= are -required: some way for the network filesystem to initialise the caching fo= r a -read request and a table of operations for the helpers to call. + * ``void netfs_read_subreq_progress(struct netfs_io_subrequest *subreq);`` =20 -To begin a cache operation on an fscache object, the following function is -called:: + This is provided to optionally update netfslib on the incremental progr= ess + of a read, allowing some folios to be unlocked early and does not actua= lly + terminate the subrequest. The ``transferred`` field should have been + updated. =20 - int fscache_begin_read_operation(struct netfs_io_request *rreq, - struct fscache_cookie *cookie); +Local Cache API +--------------- =20 -passing in the request pointer and the cookie corresponding to the file. = This -fills in the cache resources mentioned below. +Netfslib provides a separate API for a local cache to implement, though it +provides some somewhat similar routines to the filesystem request API. =20 -The netfs_io_request object contains a place for the cache to hang its +Firstly, the netfs_io_request object contains a place for the cache to han= g its state:: =20 struct netfs_cache_resources { const struct netfs_cache_ops *ops; void *cache_priv; void *cache_priv2; + unsigned int debug_id; + unsigned int inval_counter; }; =20 -This contains an operations table pointer and two private pointers. The -operation table looks like the following:: +This contains an operations table pointer and two private pointers plus the +debug ID of the fscache cookie for tracing purposes and an invalidation co= unter +that is cranked by calls to ``fscache_invalidate()`` allowing cache subreq= uests +to be invalidated after completion. + +The cache operation table looks like the following:: =20 struct netfs_cache_ops { void (*end_operation)(struct netfs_cache_resources *cres); - void (*expand_readahead)(struct netfs_cache_resources *cres, loff_t *_start, size_t *_len, loff_t i_size); - enum netfs_io_source (*prepare_read)(struct netfs_io_subrequest *subreq, - loff_t i_size); - + loff_t i_size); int (*read)(struct netfs_cache_resources *cres, loff_t start_pos, struct iov_iter *iter, bool seek_data, netfs_io_terminated_t term_func, void *term_func_priv); - - int (*prepare_write)(struct netfs_cache_resources *cres, - loff_t *_start, size_t *_len, loff_t i_size, - bool no_space_allocated_yet); - - int (*write)(struct netfs_cache_resources *cres, - loff_t start_pos, - struct iov_iter *iter, - netfs_io_terminated_t term_func, - void *term_func_priv); - - int (*query_occupancy)(struct netfs_cache_resources *cres, - loff_t start, size_t len, size_t granularity, - loff_t *_data_start, size_t *_data_len); + void (*prepare_write_subreq)(struct netfs_io_subrequest *subreq); + void (*issue_write)(struct netfs_io_subrequest *subreq); }; =20 With a termination handler function pointer:: @@ -511,10 +963,16 @@ The methods defined in the table are: =20 * ``expand_readahead()`` =20 - [Optional] Called at the beginning of a netfs_readahead() operation to = allow - the cache to expand a request in either direction. This allows the cac= he to + [Optional] Called at the beginning of a readahead operation to allow the + cache to expand a request in either direction. This allows the cache to size the request appropriately for the cache granularity. =20 + * ``prepare_read()`` + + [Required] Called to configure the next slice of a request. ->start and + ->len in the subrequest indicate where and how big the next slice can b= e; + the cache gets to reduce the length to match its granularity requiremen= ts. + The function is passed pointers to the start and length in its paramete= rs, plus the size of the file for reference, and adjusts the start and leng= th appropriately. It should return one of: @@ -528,12 +986,6 @@ The methods defined in the table are: downloaded from the server or read from the cache - or whether slicing should be given up at the current point. =20 - * ``prepare_read()`` - - [Required] Called to configure the next slice of a request. ->start and - ->len in the subrequest indicate where and how big the next slice can b= e; - the cache gets to reduce the length to match its granularity requiremen= ts. - * ``read()`` =20 [Required] Called to read from the cache. The start file offset is giv= en @@ -547,44 +999,33 @@ The methods defined in the table are: indicating whether the termination is definitely happening in the calle= r's context. =20 - * ``prepare_write()`` + * ``prepare_write_subreq()`` =20 - [Required] Called to prepare a write to the cache to take place. This - involves checking to see whether the cache has sufficient space to hono= ur - the write. ``*_start`` and ``*_len`` indicate the region to be written= ; the - region can be shrunk or it can be expanded to a page boundary either wa= y as - necessary to align for direct I/O. i_size holds the size of the object= and - is provided for reference. no_space_allocated_yet is set to true if the - caller is certain that no data has been written to that region - for ex= ample - if it tried to do a read from there already. + [Required] This is called to allow the cache to limit the size of a + subrequest. It may also limit the number of individual regions in iter= ator, + such as required by DIO/DMA. This information should be set on stream = to + which the subrequest belongs:: =20 - * ``write()`` + rreq->io_streams[subreq->stream_nr].sreq_max_len + rreq->io_streams[subreq->stream_nr].sreq_max_segs =20 - [Required] Called to write to the cache. The start file offset is given - along with an iterator to write from, which gives the length also. - - Also provided is a pointer to a termination handler function and private - data to pass to that function. The termination function should be call= ed - with the number of bytes transferred or an error code, plus a flag - indicating whether the termination is definitely happening in the calle= r's - context. + The filesystem can use this, for example, to chop up a request that has= to + be split across multiple servers or to put multiple writes in flight. =20 - * ``query_occupancy()`` + This is not permitted to return an error. In the event of failure, + ``netfs_prepare_write_failed()`` must be called. =20 - [Required] Called to find out where the next piece of data is within a - particular region of the cache. The start and length of the region to = be - queried are passed in, along with the granularity to which the answer n= eeds - to be aligned. The function passes back the start and length of the da= ta, - if any, available within that region. Note that there may be a hole at= the - front. + * ``issue_write()`` =20 - It returns 0 if some data was found, -ENODATA if there was no usable da= ta - within the region or -ENOBUFS if there is no caching on this file. + [Required] This is used to dispatch a subrequest to the cache for writi= ng. + In the subrequest, ->start, ->len and ->transferred indicate what data + should be written to the cache and ->io_iter indicates the buffer to be + used. =20 -Note that these methods are passed a pointer to the cache resource structu= re, -not the read request structure as they could be used in other situations w= here -there isn't a read request structure as well, such as writing dirty data t= o the -cache. + There is no return value; the ``netfs_write_subreq_terminated()`` funct= ion + should be called to indicate that the subrequest completed either way. + ->error, ->transferred and ->flags should be updated before completing.= The + termination can be done asynchronously. =20 =20 API Function Reference