From nobody Thu Oct 9 09:03:17 2025 Received: from mail-pf1-f176.google.com (mail-pf1-f176.google.com [209.85.210.176]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 996C8295531; Wed, 18 Jun 2025 11:16:07 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.210.176 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1750245371; cv=none; b=kmT7f7pY3frjxnLaukN3wY/tTWCLH8grRIwyIh1vDysKyyjqIxUzI6Syu9r5NkPzZoEnGQEMBulZ9QQYLfSf+rbJpIqkiYXWxJqw4nO81YstF2QM0iqsSpb3KjDeRpVCynwSIhuBYZhjd8lw9DsAFXoFKOWzM4PGJGY68TZiIyI= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1750245371; c=relaxed/simple; bh=K5pTB5IITvkRTBBRO+8HubCgsx539iFFT7LH9lgg4Sk=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=mRRzZwr5FlRVP9lN84ufFSQ2mkcYdY/GUnC0KGuW4tsL51sZsEABw4cVjrda25aDCfsgPP5vzV+9MFPO8zpEEs2x+/4w2ndQldM+fgbn9lss5l3Cla7/Cob8XikQj/BLexwO3ECcRKnNo4R55t+eXDl1V7z6kHALzS7cpIndSkw= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=d5daYMbw; arc=none smtp.client-ip=209.85.210.176 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="d5daYMbw" Received: by mail-pf1-f176.google.com with SMTP id d2e1a72fcca58-748ece799bdso249167b3a.1; Wed, 18 Jun 2025 04:16:07 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1750245367; x=1750850167; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=lxxGu10V6DoBlyRrA1QrIkw1CSrpJwPgwTi1jCn3eGY=; b=d5daYMbwmmevg5a2ArbI3qncNnLCro0QgVy6W6R4Hob3k3dL2m1+IFPa98qFPqwKYa zTn9+zSBy05oLYgkCgSQvoLQRTdCcAI2MdO5ExXB0vffnYMn4OwLw9aAF4a+7qFfbBv0 hEXH3POBaeKsT9zAWA16Sl5zZ9i4sICPIWyuZcjZcR7xn/pYJs5Jskn8sheGDZXsM6HP JsKF8IKTOujIt///mswuEwQwhakelvIe3MHnUPeKTzzUr8SIXZeQfyj/gPgYUBBs0wA0 YbDUjRTGbHBmWifsehbUs0bcvk8yrHaKt9YrmuWQXx+Yylnj9GS22neydHmGW8uEmbh/ UJvg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1750245367; x=1750850167; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=lxxGu10V6DoBlyRrA1QrIkw1CSrpJwPgwTi1jCn3eGY=; b=mSYQpKaHbEEJSAcVuE5XG6Mnf8f0uBinWuHbGv0cvE0t4K9TWibO7I8/UHpMRZ1tZD 0Js3531sCuQvzxsbtfwlGZU2K3pJDb9OLZqou5ChohlcnetstUpKUlwnVu+JrDEu5xxm ZwjosX6sAYI3NuPX047Vvi8lbusu3O/v5PyI6YYwqgvt9KA+OupAiF3Hro3lLBJmItQk jySzfw5aOd/S5XgKpUX7CYXUQ0G0j7laFfKhsID8xhdV5z66XkUq71aVfasOTKDSJQHn 3z3lEZaFqgd4EBY2XVabzPA+4MhFvGhsUQxmjVLxjDI0fJMg/E8gOaff5+rwD5rfZwpD vl/A== X-Forwarded-Encrypted: i=1; AJvYcCWCXYSZ1T5W+urM2nB18fPxoac0QBqISgpzxya4raQqjS4JcbrcXQyWXyIJhtIbQzQJACaXCQ41VjVa8Q==@vger.kernel.org, AJvYcCWy5IJexUe89rSQRLEYHDKlWrqJADc2fdPv0hiG2dSgXQvd5DwdNHIVBOkPpNyOD1envLYZu5hZeMo=@vger.kernel.org X-Gm-Message-State: AOJu0Yy38CDH6+p1994C4ff7q1SwU++tf7K5x+o27xDFsXhK+Esro0mv UucXPQNXEqVXpIL69QlGWe6NOq/Do4L6S1r4v+jLLZp75aFgGRppO7lCuAWoPw== X-Gm-Gg: ASbGncuw6dMUoTbRH/xW4ESMr7K44lqUIYigXF87nMgmJ721Fb0+F3yVNZSsQt+m1An xwUH1OX9kd9yQGvjhdcrJYXMk9nVJAaqk28Gx6TZaML+qcXXQhXW1v+KkFwdwdG6aEJHUvTuh3s WLn/oGaKLyKUj8owwBHP5b5PrVkwiULEM3BHblCjhD1H3gs26q8y40eCoRpl3lESjhl+dp3AjOX bKKh6QeGG7v4DpQ/Ws8GKH8tXZ56aeSiJ7tYUVFo0pvPR/roUQ/WC2jV703DuAMsbIwqJ5TBx9s 5K2UyXufdSVoVlu8vjJi3gfmRZmYP4rTsYz5I2aS0HZirXzCd6zOudaC6txJag== X-Google-Smtp-Source: AGHT+IErWc/EF4yO/GsSLGH4UN19fjyyFu2ijPAcsjK58jl9mdviWlSB1G7dPKqPCIqxc7UnWhR0rw== X-Received: by 2002:a05:6a00:4b4a:b0:736:54c9:df2c with SMTP id d2e1a72fcca58-7489cfbb3b3mr20955146b3a.15.1750245365958; Wed, 18 Jun 2025 04:16:05 -0700 (PDT) Received: from archie.me ([103.124.138.155]) by smtp.gmail.com with ESMTPSA id d2e1a72fcca58-748efa19750sm717606b3a.58.2025.06.18.04.16.02 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 18 Jun 2025 04:16:02 -0700 (PDT) Received: by archie.me (Postfix, from userid 1000) id AEAED4596078; Wed, 18 Jun 2025 18:15:59 +0700 (WIB) From: Bagas Sanjaya To: Linux Kernel Mailing List , Linux Documentation , Linux ext4 Cc: "Theodore Ts'o" , Andreas Dilger , Jonathan Corbet , "Darrick J. Wong" , "Ritesh Harjani (IBM)" , Bagas Sanjaya Subject: [PATCH 1/4] Documentation: ext4: Slurp included subdocs in high-level overview docs Date: Wed, 18 Jun 2025 18:15:34 +0700 Message-ID: <20250618111544.22602-2-bagasdotme@gmail.com> X-Mailer: git-send-email 2.49.0 In-Reply-To: <20250618111544.22602-1-bagasdotme@gmail.com> References: <20250618111544.22602-1-bagasdotme@gmail.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Developer-Signature: v=1; a=openpgp-sha256; l=73921; i=bagasdotme@gmail.com; h=from:subject; bh=K5pTB5IITvkRTBBRO+8HubCgsx539iFFT7LH9lgg4Sk=; b=kA0DAAoW9rmJSVVRTqMByyZiAGhSnyaiLz1+AdYQ8EGiVSZwm2SifBdiJkX9xEdlpFS26POZI Yh1BAAWCgAdFiEEkmEOgsu6MhTQh61B9rmJSVVRTqMFAmhSnyYACgkQ9rmJSVVRTqNedgEAu1Zb UA4cyRS/CYcz3zAq1YlfdYvzQl4+wQdNad4FCZwBAPdvueHdaOZWoNAZutB2jAH1uKZGs4xq7h/ fyXHxwKEA X-Developer-Key: i=bagasdotme@gmail.com; a=openpgp; fpr=701B806FDCA5D3A58FFB8F7D7C276C64A5E44A1D Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Slurp subdocumentations for high-level ext4 design overview (overview.rst) by replacing reST include:: directive with their respective contents. Signed-off-by: Bagas Sanjaya --- Documentation/filesystems/ext4/allocators.rst | 56 -- .../filesystems/ext4/atomic_writes.rst | 225 ----- Documentation/filesystems/ext4/bigalloc.rst | 34 - Documentation/filesystems/ext4/blockgroup.rst | 135 --- Documentation/filesystems/ext4/blocks.rst | 144 --- Documentation/filesystems/ext4/checksums.rst | 73 -- Documentation/filesystems/ext4/eainode.rst | 18 - Documentation/filesystems/ext4/inlinedata.rst | 37 - Documentation/filesystems/ext4/overview.rst | 819 +++++++++++++++++- .../filesystems/ext4/special_inodes.rst | 55 -- Documentation/filesystems/ext4/verity.rst | 44 - 11 files changed, 809 insertions(+), 831 deletions(-) delete mode 100644 Documentation/filesystems/ext4/allocators.rst delete mode 100644 Documentation/filesystems/ext4/atomic_writes.rst delete mode 100644 Documentation/filesystems/ext4/bigalloc.rst delete mode 100644 Documentation/filesystems/ext4/blockgroup.rst delete mode 100644 Documentation/filesystems/ext4/blocks.rst delete mode 100644 Documentation/filesystems/ext4/checksums.rst delete mode 100644 Documentation/filesystems/ext4/eainode.rst delete mode 100644 Documentation/filesystems/ext4/inlinedata.rst delete mode 100644 Documentation/filesystems/ext4/special_inodes.rst delete mode 100644 Documentation/filesystems/ext4/verity.rst diff --git a/Documentation/filesystems/ext4/allocators.rst b/Documentation/= filesystems/ext4/allocators.rst deleted file mode 100644 index 7aa85152ace3d0..00000000000000 --- a/Documentation/filesystems/ext4/allocators.rst +++ /dev/null @@ -1,56 +0,0 @@ -.. SPDX-License-Identifier: GPL-2.0 - -Block and Inode Allocation Policy ---------------------------------- - -ext4 recognizes (better than ext3, anyway) that data locality is -generally a desirably quality of a filesystem. On a spinning disk, -keeping related blocks near each other reduces the amount of movement -that the head actuator and disk must perform to access a data block, -thus speeding up disk IO. On an SSD there of course are no moving parts, -but locality can increase the size of each transfer request while -reducing the total number of requests. This locality may also have the -effect of concentrating writes on a single erase block, which can speed -up file rewrites significantly. Therefore, it is useful to reduce -fragmentation whenever possible. - -The first tool that ext4 uses to combat fragmentation is the multi-block -allocator. When a file is first created, the block allocator -speculatively allocates 8KiB of disk space to the file on the assumption -that the space will get written soon. When the file is closed, the -unused speculative allocations are of course freed, but if the -speculation is correct (typically the case for full writes of small -files) then the file data gets written out in a single multi-block -extent. A second related trick that ext4 uses is delayed allocation. -Under this scheme, when a file needs more blocks to absorb file writes, -the filesystem defers deciding the exact placement on the disk until all -the dirty buffers are being written out to disk. By not committing to a -particular placement until it's absolutely necessary (the commit timeout -is hit, or sync() is called, or the kernel runs out of memory), the hope -is that the filesystem can make better location decisions. - -The third trick that ext4 (and ext3) uses is that it tries to keep a -file's data blocks in the same block group as its inode. This cuts down -on the seek penalty when the filesystem first has to read a file's inode -to learn where the file's data blocks live and then seek over to the -file's data blocks to begin I/O operations. - -The fourth trick is that all the inodes in a directory are placed in the -same block group as the directory, when feasible. The working assumption -here is that all the files in a directory might be related, therefore it -is useful to try to keep them all together. - -The fifth trick is that the disk volume is cut up into 128MB block -groups; these mini-containers are used as outlined above to try to -maintain data locality. However, there is a deliberate quirk -- when a -directory is created in the root directory, the inode allocator scans -the block groups and puts that directory into the least heavily loaded -block group that it can find. This encourages directories to spread out -over a disk; as the top-level directory/file blobs fill up one block -group, the allocators simply move on to the next block group. Allegedly -this scheme evens out the loading on the block groups, though the author -suspects that the directories which are so unlucky as to land towards -the end of a spinning drive get a raw deal performance-wise. - -Of course if all of these mechanisms fail, one can always use e4defrag -to defragment files. diff --git a/Documentation/filesystems/ext4/atomic_writes.rst b/Documentati= on/filesystems/ext4/atomic_writes.rst deleted file mode 100644 index f65767df3620d5..00000000000000 --- a/Documentation/filesystems/ext4/atomic_writes.rst +++ /dev/null @@ -1,225 +0,0 @@ -.. SPDX-License-Identifier: GPL-2.0 -.. _atomic_writes: - -Atomic Block Writes -------------------------- - -Introduction -~~~~~~~~~~~~ - -Atomic (untorn) block writes ensure that either the entire write is commit= ted -to disk or none of it is. This prevents "torn writes" during power loss or -system crashes. The ext4 filesystem supports atomic writes (only with Dire= ct -I/O) on regular files with extents, provided the underlying storage device -supports hardware atomic writes. This is supported in the following two wa= ys: - -1. **Single-fsblock Atomic Writes**: - EXT4's supports atomic write operations with a single filesystem block = since - v6.13. In this the atomic write unit minimum and maximum sizes are both= set - to filesystem blocksize. - e.g. doing atomic write of 16KB with 16KB filesystem blocksize on 64KB - pagesize system is possible. - -2. **Multi-fsblock Atomic Writes with Bigalloc**: - EXT4 now also supports atomic writes spanning multiple filesystem blocks - using a feature known as bigalloc. The atomic write unit's minimum and - maximum sizes are determined by the filesystem block size and cluster s= ize, - based on the underlying device=E2=80=99s supported atomic write unit li= mits. - -Requirements -~~~~~~~~~~~~ - -Basic requirements for atomic writes in ext4: - - 1. The extents feature must be enabled (default for ext4) - 2. The underlying block device must support atomic writes - 3. For single-fsblock atomic writes: - - 1. A filesystem with appropriate block size (up to the page size) - 4. For multi-fsblock atomic writes: - - 1. The bigalloc feature must be enabled - 2. The cluster size must be appropriately configured - -NOTE: EXT4 does not support software or COW based atomic write, which means -atomic writes on ext4 are only supported if underlying storage device supp= orts -it. - -Multi-fsblock Implementation Details -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -The bigalloc feature changes ext4 to allocate in units of multiple filesys= tem -blocks, also known as clusters. With bigalloc each bit within block bitmap -represents cluster (power of 2 number of blocks) rather than individual -filesystem blocks. -EXT4 supports multi-fsblock atomic writes with bigalloc, subject to the -following constraints. The minimum atomic write size is the larger of the = fs -block size and the minimum hardware atomic write unit; and the maximum ato= mic -write size is smaller of the bigalloc cluster size and the maximum hardware -atomic write unit. Bigalloc ensures that all allocations are aligned to t= he -cluster size, which satisfies the LBA alignment requirements of the hardwa= re -device if the start of the partition/logical volume is itself aligned corr= ectly. - -Here is the block allocation strategy in bigalloc for atomic writes: - - * For regions with fully mapped extents, no additional work is needed - * For append writes, a new mapped extent is allocated - * For regions that are entirely holes, unwritten extent is created - * For large unwritten extents, the extent gets split into two unwritten - extents of appropriate requested size - * For mixed mapping regions (combinations of holes, unwritten extents, or - mapped extents), ext4_map_blocks() is called in a loop with - EXT4_GET_BLOCKS_ZERO flag to convert the region into a single contiguous - mapped extent by writing zeroes to it and converting any unwritten exte= nts to - written, if found within the range. - -Note: Writing on a single contiguous underlying extent, whether mapped or -unwritten, is not inherently problematic. However, writing to a mixed mapp= ing -region (i.e. one containing a combination of mapped and unwritten extents) -must be avoided when performing atomic writes. - -The reason is that, atomic writes when issued via pwritev2() with the RWF_= ATOMIC -flag, requires that either all data is written or none at all. In the even= t of -a system crash or unexpected power loss during the write operation, the af= fected -region (when later read) must reflect either the complete old data or the -complete new data, but never a mix of both. - -To enforce this guarantee, we ensure that the write target is backed by -a single, contiguous extent before any data is written. This is critical b= ecause -ext4 defers the conversion of unwritten extents to written extents until t= he I/O -completion path (typically in ->end_io()). If a write is allowed to procee= d over -a mixed mapping region (with mapped and unwritten extents) and a failure o= ccurs -mid-write, the system could observe partially updated regions after reboot= , i.e. -new data over mapped areas, and stale (old) data over unwritten extents th= at -were never marked written. This violates the atomicity and/or torn write -prevention guarantee. - -To prevent such torn writes, ext4 proactively allocates a single contiguous -extent for the entire requested region in ``ext4_iomap_alloc`` via -``ext4_map_blocks_atomic()``. EXT4 also force commits the current journall= ing -transaction in case if allocation is done over mixed mapping. This ensures= any -pending metadata updates (like unwritten to written extents conversion) in= this -range are in consistent state with the file data blocks, before performing= the -actual write I/O. If the commit fails, the whole I/O must be aborted to pr= event -from any possible torn writes. -Only after this step, the actual data write operation is performed by the = iomap. - -Handling Split Extents Across Leaf Blocks -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -There can be a special edge case where we have logically and physically -contiguous extents stored in separate leaf nodes of the on-disk extent tre= e. -This occurs because on-disk extent tree merges only happens within the leaf -blocks except for a case where we have 2-level tree which can get merged a= nd -collapsed entirely into the inode. -If such a layout exists and, in the worst case, the extent status cache en= tries -are reclaimed due to memory pressure, ``ext4_map_blocks()`` may never retu= rn -a single contiguous extent for these split leaf extents. - -To address this edge case, a new get block flag -``EXT4_GET_BLOCKS_QUERY_LEAF_BLOCKS flag`` is added to enhance the -``ext4_map_query_blocks()`` lookup behavior. - -This new get block flag allows ``ext4_map_blocks()`` to first check if the= re is -an entry in the extent status cache for the full range. -If not present, it consults the on-disk extent tree using -``ext4_map_query_blocks()``. -If the located extent is at the end of a leaf node, it probes the next log= ical -block (lblk) to detect a contiguous extent in the adjacent leaf. - -For now only one additional leaf block is queried to maintain efficiency, = as -atomic writes are typically constrained to small sizes -(e.g. [blocksize, clustersize]). - - -Handling Journal transactions -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -To support multi-fsblock atomic writes, we ensure enough journal credits a= re -reserved during: - - 1. Block allocation time in ``ext4_iomap_alloc()``. We first query if the= re - could be a mixed mapping for the underlying requested range. If yes, t= hen we - reserve credits of up to ``m_len``, assuming every alternate block can= be - an unwritten extent followed by a hole. - - 2. During ``->end_io()`` call, we make sure a single transaction is start= ed for - doing unwritten-to-written conversion. The loop for conversion is main= ly - only required to handle a split extent across leaf blocks. - -How to ------- - -Creating Filesystems with Atomic Write Support -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -First check the atomic write units supported by block device. -See :ref:`atomic_write_bdev_support` for more details. - -For single-fsblock atomic writes with a larger block size -(on systems with block size < page size): - -.. code-block:: bash - - # Create an ext4 filesystem with a 16KB block size - # (requires page size >=3D 16KB) - mkfs.ext4 -b 16384 /dev/device - -For multi-fsblock atomic writes with bigalloc: - -.. code-block:: bash - - # Create an ext4 filesystem with bigalloc and 64KB cluster size - mkfs.ext4 -F -O bigalloc -b 4096 -C 65536 /dev/device - -Where ``-b`` specifies the block size, ``-C`` specifies the cluster size i= n bytes, -and ``-O bigalloc`` enables the bigalloc feature. - -Application Interface -~~~~~~~~~~~~~~~~~~~~~ - -Applications can use the ``pwritev2()`` system call with the ``RWF_ATOMIC`= ` flag -to perform atomic writes: - -.. code-block:: c - - pwritev2(fd, iov, iovcnt, offset, RWF_ATOMIC); - -The write must be aligned to the filesystem's block size and not exceed the -filesystem's maximum atomic write unit size. -See ``generic_atomic_write_valid()`` for more details. - -``statx()`` system call with ``STATX_WRITE_ATOMIC`` flag can provides foll= owing -details: - - * ``stx_atomic_write_unit_min``: Minimum size of an atomic write request. - * ``stx_atomic_write_unit_max``: Maximum size of an atomic write request. - * ``stx_atomic_write_segments_max``: Upper limit for segments. The number= of - separate memory buffers that can be gathered into a write operation - (e.g., the iovcnt parameter for IOV_ITER). Currently, this is always se= t to one. - -The STATX_ATTR_WRITE_ATOMIC flag in ``statx->attributes`` is set if atomic -writes are supported. - -.. _atomic_write_bdev_support: - -Hardware Support ----------------- - -The underlying storage device must support atomic write operations. -Modern NVMe and SCSI devices often provide this capability. -The Linux kernel exposes this information through sysfs: - -* ``/sys/block//queue/atomic_write_unit_min`` - Minimum atomic wri= te size -* ``/sys/block//queue/atomic_write_unit_max`` - Maximum atomic wri= te size - -Nonzero values for these attributes indicate that the device supports -atomic writes. - -See Also --------- - -* :doc:`bigalloc` - Documentation on the bigalloc feature -* :doc:`allocators` - Documentation on block allocation in ext4 -* Support for atomic block writes in 6.13: - https://lwn.net/Articles/1009298/ diff --git a/Documentation/filesystems/ext4/bigalloc.rst b/Documentation/fi= lesystems/ext4/bigalloc.rst deleted file mode 100644 index 976a180b209c2a..00000000000000 --- a/Documentation/filesystems/ext4/bigalloc.rst +++ /dev/null @@ -1,34 +0,0 @@ -.. SPDX-License-Identifier: GPL-2.0 - -Bigalloc --------- - -At the moment, the default size of a block is 4KiB, which is a commonly -supported page size on most MMU-capable hardware. This is fortunate, as -ext4 code is not prepared to handle the case where the block size -exceeds the page size. However, for a filesystem of mostly huge files, -it is desirable to be able to allocate disk blocks in units of multiple -blocks to reduce both fragmentation and metadata overhead. The -bigalloc feature provides exactly this ability. - -The bigalloc feature (EXT4_FEATURE_RO_COMPAT_BIGALLOC) changes ext4 to -use clustered allocation, so that each bit in the ext4 block allocation -bitmap addresses a power of two number of blocks. For example, if the -file system is mainly going to be storing large files in the 4-32 -megabyte range, it might make sense to set a cluster size of 1 megabyte. -This means that each bit in the block allocation bitmap now addresses -256 4k blocks. This shrinks the total size of the block allocation -bitmaps for a 2T file system from 64 megabytes to 256 kilobytes. It also -means that a block group addresses 32 gigabytes instead of 128 megabytes, -also shrinking the amount of file system overhead for metadata. - -The administrator can set a block cluster size at mkfs time (which is -stored in the s_log_cluster_size field in the superblock); from then -on, the block bitmaps track clusters, not individual blocks. This means -that block groups can be several gigabytes in size (instead of just -128MiB); however, the minimum allocation unit becomes a cluster, not a -block, even for directories. TaoBao had a patchset to extend the =E2=80=9C= use -units of clusters instead of blocks=E2=80=9D to the extent tree, though it= is -not clear where those patches went-- they eventually morphed into -=E2=80=9Cextent tree v2=E2=80=9D but that code has not landed as of May 20= 15. - diff --git a/Documentation/filesystems/ext4/blockgroup.rst b/Documentation/= filesystems/ext4/blockgroup.rst deleted file mode 100644 index ed5a5cac6d40e1..00000000000000 --- a/Documentation/filesystems/ext4/blockgroup.rst +++ /dev/null @@ -1,135 +0,0 @@ -.. SPDX-License-Identifier: GPL-2.0 - -Layout ------- - -The layout of a standard block group is approximately as follows (each -of these fields is discussed in a separate section below): - -.. list-table:: - :widths: 1 1 1 1 1 1 1 1 - :header-rows: 1 - - * - Group 0 Padding - - ext4 Super Block - - Group Descriptors - - Reserved GDT Blocks - - Data Block Bitmap - - inode Bitmap - - inode Table - - Data Blocks - * - 1024 bytes - - 1 block - - many blocks - - many blocks - - 1 block - - 1 block - - many blocks - - many more blocks - -For the special case of block group 0, the first 1024 bytes are unused, -to allow for the installation of x86 boot sectors and other oddities. -The superblock will start at offset 1024 bytes, whichever block that -happens to be (usually 0). However, if for some reason the block size =3D -1024, then block 0 is marked in use and the superblock goes in block 1. -For all other block groups, there is no padding. - -The ext4 driver primarily works with the superblock and the group -descriptors that are found in block group 0. Redundant copies of the -superblock and group descriptors are written to some of the block groups -across the disk in case the beginning of the disk gets trashed, though -not all block groups necessarily host a redundant copy (see following -paragraph for more details). If the group does not have a redundant -copy, the block group begins with the data block bitmap. Note also that -when the filesystem is freshly formatted, mkfs will allocate =E2=80=9Crese= rve -GDT block=E2=80=9D space after the block group descriptors and before the = start -of the block bitmaps to allow for future expansion of the filesystem. By -default, a filesystem is allowed to increase in size by a factor of -1024x over the original filesystem size. - -The location of the inode table is given by ``grp.bg_inode_table_*``. It -is continuous range of blocks large enough to contain -``sb.s_inodes_per_group * sb.s_inode_size`` bytes. - -As for the ordering of items in a block group, it is generally -established that the super block and the group descriptor table, if -present, will be at the beginning of the block group. The bitmaps and -the inode table can be anywhere, and it is quite possible for the -bitmaps to come after the inode table, or for both to be in different -groups (flex_bg). Leftover space is used for file data blocks, indirect -block maps, extent tree blocks, and extended attributes. - -Flexible Block Groups ---------------------- - -Starting in ext4, there is a new feature called flexible block groups -(flex_bg). In a flex_bg, several block groups are tied together as one -logical block group; the bitmap spaces and the inode table space in the -first block group of the flex_bg are expanded to include the bitmaps -and inode tables of all other block groups in the flex_bg. For example, -if the flex_bg size is 4, then group 0 will contain (in order) the -superblock, group descriptors, data block bitmaps for groups 0-3, inode -bitmaps for groups 0-3, inode tables for groups 0-3, and the remaining -space in group 0 is for file data. The effect of this is to group the -block group metadata close together for faster loading, and to enable -large files to be continuous on disk. Backup copies of the superblock -and group descriptors are always at the beginning of block groups, even -if flex_bg is enabled. The number of block groups that make up a -flex_bg is given by 2 ^ ``sb.s_log_groups_per_flex``. - -Meta Block Groups ------------------ - -Without the option META_BG, for safety concerns, all block group -descriptors copies are kept in the first block group. Given the default -128MiB(2^27 bytes) block group size and 64-byte group descriptors, ext4 -can have at most 2^27/64 =3D 2^21 block groups. This limits the entire -filesystem size to 2^21 * 2^27 =3D 2^48bytes or 256TiB. - -The solution to this problem is to use the metablock group feature -(META_BG), which is already in ext3 for all 2.6 releases. With the -META_BG feature, ext4 filesystems are partitioned into many metablock -groups. Each metablock group is a cluster of block groups whose group -descriptor structures can be stored in a single disk block. For ext4 -filesystems with 4 KB block size, a single metablock group partition -includes 64 block groups, or 8 GiB of disk space. The metablock group -feature moves the location of the group descriptors from the congested -first block group of the whole filesystem into the first group of each -metablock group itself. The backups are in the second and last group of -each metablock group. This increases the 2^21 maximum block groups limit -to the hard limit 2^32, allowing support for a 512PiB filesystem. - -The change in the filesystem format replaces the current scheme where -the superblock is followed by a variable-length set of block group -descriptors. Instead, the superblock and a single block group descriptor -block is placed at the beginning of the first, second, and last block -groups in a meta-block group. A meta-block group is a collection of -block groups which can be described by a single block group descriptor -block. Since the size of the block group descriptor structure is 64 -bytes, a meta-block group contains 16 block groups for filesystems with -a 1KB block size, and 64 block groups for filesystems with a 4KB -blocksize. Filesystems can either be created using this new block group -descriptor layout, or existing filesystems can be resized on-line, and -the field s_first_meta_bg in the superblock will indicate the first -block group using this new layout. - -Please see an important note about ``BLOCK_UNINIT`` in the section about -block and inode bitmaps. - -Lazy Block Group Initialization -------------------------------- - -A new feature for ext4 are three block group descriptor flags that -enable mkfs to skip initializing other parts of the block group -metadata. Specifically, the INODE_UNINIT and BLOCK_UNINIT flags mean -that the inode and block bitmaps for that group can be calculated and -therefore the on-disk bitmap blocks are not initialized. This is -generally the case for an empty block group or a block group containing -only fixed-location block group metadata. The INODE_ZEROED flag means -that the inode table has been initialized; mkfs will unset this flag and -rely on the kernel to initialize the inode tables in the background. - -By not writing zeroes to the bitmaps and inode table, mkfs time is -reduced considerably. Note the feature flag is RO_COMPAT_GDT_CSUM, -but the dumpe2fs output prints this as =E2=80=9Cuninit_bg=E2=80=9D. They a= re the same -thing. diff --git a/Documentation/filesystems/ext4/blocks.rst b/Documentation/file= systems/ext4/blocks.rst deleted file mode 100644 index b0f80ea87c90e1..00000000000000 --- a/Documentation/filesystems/ext4/blocks.rst +++ /dev/null @@ -1,144 +0,0 @@ -.. SPDX-License-Identifier: GPL-2.0 - -Blocks ------- - -ext4 allocates storage space in units of =E2=80=9Cblocks=E2=80=9D. A block= is a group of -sectors between 1KiB and 64KiB, and the number of sectors must be an -integral power of 2. Blocks are in turn grouped into larger units called -block groups. Block size is specified at mkfs time and typically is -4KiB. You may experience mounting problems if block size is greater than -page size (i.e. 64KiB blocks on a i386 which only has 4KiB memory -pages). By default a filesystem can contain 2^32 blocks; if the '64bit' -feature is enabled, then a filesystem can have 2^64 blocks. The location -of structures is stored in terms of the block number the structure lives -in and not the absolute offset on disk. - -For 32-bit filesystems, limits are as follows: - -.. list-table:: - :widths: 1 1 1 1 1 - :header-rows: 1 - - * - Item - - 1KiB - - 2KiB - - 4KiB - - 64KiB - * - Blocks - - 2^32 - - 2^32 - - 2^32 - - 2^32 - * - Inodes - - 2^32 - - 2^32 - - 2^32 - - 2^32 - * - File System Size - - 4TiB - - 8TiB - - 16TiB - - 256TiB - * - Blocks Per Block Group - - 8,192 - - 16,384 - - 32,768 - - 524,288 - * - Inodes Per Block Group - - 8,192 - - 16,384 - - 32,768 - - 524,288 - * - Block Group Size - - 8MiB - - 32MiB - - 128MiB - - 32GiB - * - Blocks Per File, Extents - - 2^32 - - 2^32 - - 2^32 - - 2^32 - * - Blocks Per File, Block Maps - - 16,843,020 - - 134,480,396 - - 1,074,791,436 - - 4,398,314,962,956 (really 2^32 due to field size limitations) - * - File Size, Extents - - 4TiB - - 8TiB - - 16TiB - - 256TiB - * - File Size, Block Maps - - 16GiB - - 256GiB - - 4TiB - - 256TiB - -For 64-bit filesystems, limits are as follows: - -.. list-table:: - :widths: 1 1 1 1 1 - :header-rows: 1 - - * - Item - - 1KiB - - 2KiB - - 4KiB - - 64KiB - * - Blocks - - 2^64 - - 2^64 - - 2^64 - - 2^64 - * - Inodes - - 2^32 - - 2^32 - - 2^32 - - 2^32 - * - File System Size - - 16ZiB - - 32ZiB - - 64ZiB - - 1YiB - * - Blocks Per Block Group - - 8,192 - - 16,384 - - 32,768 - - 524,288 - * - Inodes Per Block Group - - 8,192 - - 16,384 - - 32,768 - - 524,288 - * - Block Group Size - - 8MiB - - 32MiB - - 128MiB - - 32GiB - * - Blocks Per File, Extents - - 2^32 - - 2^32 - - 2^32 - - 2^32 - * - Blocks Per File, Block Maps - - 16,843,020 - - 134,480,396 - - 1,074,791,436 - - 4,398,314,962,956 (really 2^32 due to field size limitations) - * - File Size, Extents - - 4TiB - - 8TiB - - 16TiB - - 256TiB - * - File Size, Block Maps - - 16GiB - - 256GiB - - 4TiB - - 256TiB - -Note: Files not using extents (i.e. files using block maps) must be -placed within the first 2^32 blocks of a filesystem. Files with extents -must be placed within the first 2^48 blocks of a filesystem. It's not -clear what happens with larger filesystems. diff --git a/Documentation/filesystems/ext4/checksums.rst b/Documentation/f= ilesystems/ext4/checksums.rst deleted file mode 100644 index e232749daf5f30..00000000000000 --- a/Documentation/filesystems/ext4/checksums.rst +++ /dev/null @@ -1,73 +0,0 @@ -.. SPDX-License-Identifier: GPL-2.0 - -Checksums ---------- - -Starting in early 2012, metadata checksums were added to all major ext4 -and jbd2 data structures. The associated feature flag is metadata_csum. -The desired checksum algorithm is indicated in the superblock, though as -of October 2012 the only supported algorithm is crc32c. Some data -structures did not have space to fit a full 32-bit checksum, so only the -lower 16 bits are stored. Enabling the 64bit feature increases the data -structure size so that full 32-bit checksums can be stored for many data -structures. However, existing 32-bit filesystems cannot be extended to -enable 64bit mode, at least not without the experimental resize2fs -patches to do so. - -Existing filesystems can have checksumming added by running -``tune2fs -O metadata_csum`` against the underlying device. If tune2fs -encounters directory blocks that lack sufficient empty space to add a -checksum, it will request that you run ``e2fsck -D`` to have the -directories rebuilt with checksums. This has the added benefit of -removing slack space from the directory files and rebalancing the htree -indexes. If you _ignore_ this step, your directories will not be -protected by a checksum! - -The following table describes the data elements that go into each type -of checksum. The checksum function is whatever the superblock describes -(crc32c as of October 2013) unless noted otherwise. - -.. list-table:: - :widths: 20 8 50 - :header-rows: 1 - - * - Metadata - - Length - - Ingredients - * - Superblock - - __le32 - - The entire superblock up to the checksum field. The UUID lives insi= de - the superblock. - * - MMP - - __le32 - - UUID + the entire MMP block up to the checksum field. - * - Extended Attributes - - __le32 - - UUID + the entire extended attribute block. The checksum field is s= et to - zero. - * - Directory Entries - - __le32 - - UUID + inode number + inode generation + the directory block up to = the - fake entry enclosing the checksum field. - * - HTREE Nodes - - __le32 - - UUID + inode number + inode generation + all valid extents + HTREE = tail. - The checksum field is set to zero. - * - Extents - - __le32 - - UUID + inode number + inode generation + the entire extent block up= to - the checksum field. - * - Bitmaps - - __le32 or __le16 - - UUID + the entire bitmap. Checksums are stored in the group descrip= tor, - and truncated if the group descriptor size is 32 bytes (i.e. ^64bit) - * - Inodes - - __le32 - - UUID + inode number + inode generation + the entire inode. The chec= ksum - field is set to zero. Each inode has its own checksum. - * - Group Descriptors - - __le16 - - If metadata_csum, then UUID + group number + the entire descriptor; - else if gdt_csum, then crc16(UUID + group number + the entire - descriptor). In all cases, only the lower 16 bits are stored. - diff --git a/Documentation/filesystems/ext4/eainode.rst b/Documentation/fil= esystems/ext4/eainode.rst deleted file mode 100644 index 7a2ef26b064ac0..00000000000000 --- a/Documentation/filesystems/ext4/eainode.rst +++ /dev/null @@ -1,18 +0,0 @@ -.. SPDX-License-Identifier: GPL-2.0 - -Large Extended Attribute Values -------------------------------- - -To enable ext4 to store extended attribute values that do not fit in the -inode or in the single extended attribute block attached to an inode, -the EA_INODE feature allows us to store the value in the data blocks of -a regular file inode. This =E2=80=9CEA inode=E2=80=9D is linked only from = the extended -attribute name index and must not appear in a directory entry. The -inode's i_atime field is used to store a checksum of the xattr value; -and i_ctime/i_version store a 64-bit reference count, which enables -sharing of large xattr values between multiple owning inodes. For -backward compatibility with older versions of this feature, the -i_mtime/i_generation *may* store a back-reference to the inode number -and i_generation of the **one** owning inode (in cases where the EA -inode is not referenced by multiple inodes) to verify that the EA inode -is the correct one being accessed. diff --git a/Documentation/filesystems/ext4/inlinedata.rst b/Documentation/= filesystems/ext4/inlinedata.rst deleted file mode 100644 index a728af0d2fd0c5..00000000000000 --- a/Documentation/filesystems/ext4/inlinedata.rst +++ /dev/null @@ -1,37 +0,0 @@ -.. SPDX-License-Identifier: GPL-2.0 - -Inline Data ------------ - -The inline data feature was designed to handle the case that a file's -data is so tiny that it readily fits inside the inode, which -(theoretically) reduces disk block consumption and reduces seeks. If the -file is smaller than 60 bytes, then the data are stored inline in -``inode.i_block``. If the rest of the file would fit inside the extended -attribute space, then it might be found as an extended attribute -=E2=80=9Csystem.data=E2=80=9D within the inode body (=E2=80=9Cibody EA=E2= =80=9D). This of course -constrains the amount of extended attributes one can attach to an inode. -If the data size increases beyond i_block + ibody EA, a regular block -is allocated and the contents moved to that block. - -Pending a change to compact the extended attribute key used to store -inline data, one ought to be able to store 160 bytes of data in a -256-byte inode (as of June 2015, when i_extra_isize is 28). Prior to -that, the limit was 156 bytes due to inefficient use of inode space. - -The inline data feature requires the presence of an extended attribute -for =E2=80=9Csystem.data=E2=80=9D, even if the attribute value is zero len= gth. - -Inline Directories -~~~~~~~~~~~~~~~~~~ - -The first four bytes of i_block are the inode number of the parent -directory. Following that is a 56-byte space for an array of directory -entries; see ``struct ext4_dir_entry``. If there is a =E2=80=9Csystem.data= =E2=80=9D -attribute in the inode body, the EA value is an array of -``struct ext4_dir_entry`` as well. Note that for inline directories, the -i_block and EA space are treated as separate dirent blocks; directory -entries cannot span the two. - -Inline directory entries are not checksummed, as the inode checksum -should protect all inline data contents. diff --git a/Documentation/filesystems/ext4/overview.rst b/Documentation/fi= lesystems/ext4/overview.rst index 9d4054c17ecb7b..f402ba37179f02 100644 --- a/Documentation/filesystems/ext4/overview.rst +++ b/Documentation/filesystems/ext4/overview.rst @@ -16,13 +16,812 @@ All fields in ext4 are written to disk in little-endia= n order. HOWEVER, all fields in jbd2 (the journal) are written to disk in big-endian order. =20 -.. include:: blocks.rst -.. include:: blockgroup.rst -.. include:: special_inodes.rst -.. include:: allocators.rst -.. include:: checksums.rst -.. include:: bigalloc.rst -.. include:: inlinedata.rst -.. include:: eainode.rst -.. include:: verity.rst -.. include:: atomic_writes.rst +Blocks +------ + +ext4 allocates storage space in units of =E2=80=9Cblocks=E2=80=9D. A block= is a group of +sectors between 1KiB and 64KiB, and the number of sectors must be an +integral power of 2. Blocks are in turn grouped into larger units called +block groups. Block size is specified at mkfs time and typically is +4KiB. You may experience mounting problems if block size is greater than +page size (i.e. 64KiB blocks on a i386 which only has 4KiB memory +pages). By default a filesystem can contain 2^32 blocks; if the '64bit' +feature is enabled, then a filesystem can have 2^64 blocks. The location +of structures is stored in terms of the block number the structure lives +in and not the absolute offset on disk. + +For 32-bit filesystems, limits are as follows: + +.. list-table:: + :widths: 1 1 1 1 1 + :header-rows: 1 + + * - Item + - 1KiB + - 2KiB + - 4KiB + - 64KiB + * - Blocks + - 2^32 + - 2^32 + - 2^32 + - 2^32 + * - Inodes + - 2^32 + - 2^32 + - 2^32 + - 2^32 + * - File System Size + - 4TiB + - 8TiB + - 16TiB + - 256TiB + * - Blocks Per Block Group + - 8,192 + - 16,384 + - 32,768 + - 524,288 + * - Inodes Per Block Group + - 8,192 + - 16,384 + - 32,768 + - 524,288 + * - Block Group Size + - 8MiB + - 32MiB + - 128MiB + - 32GiB + * - Blocks Per File, Extents + - 2^32 + - 2^32 + - 2^32 + - 2^32 + * - Blocks Per File, Block Maps + - 16,843,020 + - 134,480,396 + - 1,074,791,436 + - 4,398,314,962,956 (really 2^32 due to field size limitations) + * - File Size, Extents + - 4TiB + - 8TiB + - 16TiB + - 256TiB + * - File Size, Block Maps + - 16GiB + - 256GiB + - 4TiB + - 256TiB + +For 64-bit filesystems, limits are as follows: + +.. list-table:: + :widths: 1 1 1 1 1 + :header-rows: 1 + + * - Item + - 1KiB + - 2KiB + - 4KiB + - 64KiB + * - Blocks + - 2^64 + - 2^64 + - 2^64 + - 2^64 + * - Inodes + - 2^32 + - 2^32 + - 2^32 + - 2^32 + * - File System Size + - 16ZiB + - 32ZiB + - 64ZiB + - 1YiB + * - Blocks Per Block Group + - 8,192 + - 16,384 + - 32,768 + - 524,288 + * - Inodes Per Block Group + - 8,192 + - 16,384 + - 32,768 + - 524,288 + * - Block Group Size + - 8MiB + - 32MiB + - 128MiB + - 32GiB + * - Blocks Per File, Extents + - 2^32 + - 2^32 + - 2^32 + - 2^32 + * - Blocks Per File, Block Maps + - 16,843,020 + - 134,480,396 + - 1,074,791,436 + - 4,398,314,962,956 (really 2^32 due to field size limitations) + * - File Size, Extents + - 4TiB + - 8TiB + - 16TiB + - 256TiB + * - File Size, Block Maps + - 16GiB + - 256GiB + - 4TiB + - 256TiB + +.. note:: + Files not using extents (i.e. files using block maps) must be + placed within the first 2^32 blocks of a filesystem. Files with extents + must be placed within the first 2^48 blocks of a filesystem. It's not + clear what happens with larger filesystems. + +Layout +------ + +The layout of a standard block group is approximately as follows (each +of these fields is discussed in a separate section below): + +.. list-table:: + :widths: 1 1 1 1 1 1 1 1 + :header-rows: 1 + + * - Group 0 Padding + - ext4 Super Block + - Group Descriptors + - Reserved GDT Blocks + - Data Block Bitmap + - inode Bitmap + - inode Table + - Data Blocks + * - 1024 bytes + - 1 block + - many blocks + - many blocks + - 1 block + - 1 block + - many blocks + - many more blocks + +For the special case of block group 0, the first 1024 bytes are unused, +to allow for the installation of x86 boot sectors and other oddities. +The superblock will start at offset 1024 bytes, whichever block that +happens to be (usually 0). However, if for some reason the block size =3D +1024, then block 0 is marked in use and the superblock goes in block 1. +For all other block groups, there is no padding. + +The ext4 driver primarily works with the superblock and the group +descriptors that are found in block group 0. Redundant copies of the +superblock and group descriptors are written to some of the block groups +across the disk in case the beginning of the disk gets trashed, though +not all block groups necessarily host a redundant copy (see following +paragraph for more details). If the group does not have a redundant +copy, the block group begins with the data block bitmap. Note also that +when the filesystem is freshly formatted, mkfs will allocate =E2=80=9Crese= rve +GDT block=E2=80=9D space after the block group descriptors and before the = start +of the block bitmaps to allow for future expansion of the filesystem. By +default, a filesystem is allowed to increase in size by a factor of +1024x over the original filesystem size. + +The location of the inode table is given by ``grp.bg_inode_table_*``. It +is continuous range of blocks large enough to contain +``sb.s_inodes_per_group * sb.s_inode_size`` bytes. + +As for the ordering of items in a block group, it is generally +established that the super block and the group descriptor table, if +present, will be at the beginning of the block group. The bitmaps and +the inode table can be anywhere, and it is quite possible for the +bitmaps to come after the inode table, or for both to be in different +groups (flex_bg). Leftover space is used for file data blocks, indirect +block maps, extent tree blocks, and extended attributes. + +Flexible Block Groups +--------------------- + +Starting in ext4, there is a new feature called flexible block groups +(flex_bg). In a flex_bg, several block groups are tied together as one +logical block group; the bitmap spaces and the inode table space in the +first block group of the flex_bg are expanded to include the bitmaps +and inode tables of all other block groups in the flex_bg. For example, +if the flex_bg size is 4, then group 0 will contain (in order) the +superblock, group descriptors, data block bitmaps for groups 0-3, inode +bitmaps for groups 0-3, inode tables for groups 0-3, and the remaining +space in group 0 is for file data. The effect of this is to group the +block group metadata close together for faster loading, and to enable +large files to be continuous on disk. Backup copies of the superblock +and group descriptors are always at the beginning of block groups, even +if flex_bg is enabled. The number of block groups that make up a +flex_bg is given by 2 ^ ``sb.s_log_groups_per_flex``. + +Meta Block Groups +----------------- + +Without the option META_BG, for safety concerns, all block group +descriptors copies are kept in the first block group. Given the default +128MiB(2^27 bytes) block group size and 64-byte group descriptors, ext4 +can have at most 2^27/64 =3D 2^21 block groups. This limits the entire +filesystem size to 2^21 * 2^27 =3D 2^48bytes or 256TiB. + +The solution to this problem is to use the metablock group feature +(META_BG), which is already in ext3 for all 2.6 releases. With the +META_BG feature, ext4 filesystems are partitioned into many metablock +groups. Each metablock group is a cluster of block groups whose group +descriptor structures can be stored in a single disk block. For ext4 +filesystems with 4 KB block size, a single metablock group partition +includes 64 block groups, or 8 GiB of disk space. The metablock group +feature moves the location of the group descriptors from the congested +first block group of the whole filesystem into the first group of each +metablock group itself. The backups are in the second and last group of +each metablock group. This increases the 2^21 maximum block groups limit +to the hard limit 2^32, allowing support for a 512PiB filesystem. + +The change in the filesystem format replaces the current scheme where +the superblock is followed by a variable-length set of block group +descriptors. Instead, the superblock and a single block group descriptor +block is placed at the beginning of the first, second, and last block +groups in a meta-block group. A meta-block group is a collection of +block groups which can be described by a single block group descriptor +block. Since the size of the block group descriptor structure is 64 +bytes, a meta-block group contains 16 block groups for filesystems with +a 1KB block size, and 64 block groups for filesystems with a 4KB +blocksize. Filesystems can either be created using this new block group +descriptor layout, or existing filesystems can be resized on-line, and +the field s_first_meta_bg in the superblock will indicate the first +block group using this new layout. + +Please see an important note about ``BLOCK_UNINIT`` in the section about +block and inode bitmaps. + +Lazy Block Group Initialization +------------------------------- + +A new feature for ext4 are three block group descriptor flags that +enable mkfs to skip initializing other parts of the block group +metadata. Specifically, the INODE_UNINIT and BLOCK_UNINIT flags mean +that the inode and block bitmaps for that group can be calculated and +therefore the on-disk bitmap blocks are not initialized. This is +generally the case for an empty block group or a block group containing +only fixed-location block group metadata. The INODE_ZEROED flag means +that the inode table has been initialized; mkfs will unset this flag and +rely on the kernel to initialize the inode tables in the background. + +By not writing zeroes to the bitmaps and inode table, mkfs time is +reduced considerably. Note the feature flag is RO_COMPAT_GDT_CSUM, +but the dumpe2fs output prints this as =E2=80=9Cuninit_bg=E2=80=9D. They a= re the same +thing. + +Special inodes +-------------- + +ext4 reserves some inode for special features, as follows: + +.. list-table:: + :widths: 6 70 + :header-rows: 1 + + * - inode Number + - Purpose + * - 0 + - Doesn't exist; there is no inode 0. + * - 1 + - List of defective blocks. + * - 2 + - Root directory. + * - 3 + - User quota. + * - 4 + - Group quota. + * - 5 + - Boot loader. + * - 6 + - Undelete directory. + * - 7 + - Reserved group descriptors inode. (=E2=80=9Cresize inode=E2=80=9D) + * - 8 + - Journal inode. + * - 9 + - The =E2=80=9Cexclude=E2=80=9D inode, for snapshots(?) + * - 10 + - Replica inode, used for some non-upstream feature? + * - 11 + - Traditional first non-reserved inode. Usually this is the lost+foun= d directory. See s_first_ino in the superblock. + +Note that there are also some inodes allocated from non-reserved inode num= bers +for other filesystem features which are not referenced from standard direc= tory +hierarchy. These are generally reference from the superblock. They are: + +.. list-table:: + :widths: 20 50 + :header-rows: 1 + + * - Superblock field + - Description + + * - s_lpf_ino + - Inode number of lost+found directory. + * - s_prj_quota_inum + - Inode number of quota file tracking project quotas + * - s_orphan_file_inum + - Inode number of file tracking orphan inodes. + +Block and Inode Allocation Policy +--------------------------------- + +ext4 recognizes (better than ext3, anyway) that data locality is +generally a desirably quality of a filesystem. On a spinning disk, +keeping related blocks near each other reduces the amount of movement +that the head actuator and disk must perform to access a data block, +thus speeding up disk IO. On an SSD there of course are no moving parts, +but locality can increase the size of each transfer request while +reducing the total number of requests. This locality may also have the +effect of concentrating writes on a single erase block, which can speed +up file rewrites significantly. Therefore, it is useful to reduce +fragmentation whenever possible. + +The first tool that ext4 uses to combat fragmentation is the multi-block +allocator. When a file is first created, the block allocator +speculatively allocates 8KiB of disk space to the file on the assumption +that the space will get written soon. When the file is closed, the +unused speculative allocations are of course freed, but if the +speculation is correct (typically the case for full writes of small +files) then the file data gets written out in a single multi-block +extent. A second related trick that ext4 uses is delayed allocation. +Under this scheme, when a file needs more blocks to absorb file writes, +the filesystem defers deciding the exact placement on the disk until all +the dirty buffers are being written out to disk. By not committing to a +particular placement until it's absolutely necessary (the commit timeout +is hit, or sync() is called, or the kernel runs out of memory), the hope +is that the filesystem can make better location decisions. + +The third trick that ext4 (and ext3) uses is that it tries to keep a +file's data blocks in the same block group as its inode. This cuts down +on the seek penalty when the filesystem first has to read a file's inode +to learn where the file's data blocks live and then seek over to the +file's data blocks to begin I/O operations. + +The fourth trick is that all the inodes in a directory are placed in the +same block group as the directory, when feasible. The working assumption +here is that all the files in a directory might be related, therefore it +is useful to try to keep them all together. + +The fifth trick is that the disk volume is cut up into 128MB block +groups; these mini-containers are used as outlined above to try to +maintain data locality. However, there is a deliberate quirk -- when a +directory is created in the root directory, the inode allocator scans +the block groups and puts that directory into the least heavily loaded +block group that it can find. This encourages directories to spread out +over a disk; as the top-level directory/file blobs fill up one block +group, the allocators simply move on to the next block group. Allegedly +this scheme evens out the loading on the block groups, though the author +suspects that the directories which are so unlucky as to land towards +the end of a spinning drive get a raw deal performance-wise. + +Of course if all of these mechanisms fail, one can always use e4defrag +to defragment files. + +Checksums +--------- + +Starting in early 2012, metadata checksums were added to all major ext4 +and jbd2 data structures. The associated feature flag is metadata_csum. +The desired checksum algorithm is indicated in the superblock, though as +of October 2012 the only supported algorithm is crc32c. Some data +structures did not have space to fit a full 32-bit checksum, so only the +lower 16 bits are stored. Enabling the 64bit feature increases the data +structure size so that full 32-bit checksums can be stored for many data +structures. However, existing 32-bit filesystems cannot be extended to +enable 64bit mode, at least not without the experimental resize2fs +patches to do so. + +Existing filesystems can have checksumming added by running +``tune2fs -O metadata_csum`` against the underlying device. If tune2fs +encounters directory blocks that lack sufficient empty space to add a +checksum, it will request that you run ``e2fsck -D`` to have the +directories rebuilt with checksums. This has the added benefit of +removing slack space from the directory files and rebalancing the htree +indexes. If you _ignore_ this step, your directories will not be +protected by a checksum! + +The following table describes the data elements that go into each type +of checksum. The checksum function is whatever the superblock describes +(crc32c as of October 2013) unless noted otherwise. + +.. list-table:: + :widths: 20 8 50 + :header-rows: 1 + + * - Metadata + - Length + - Ingredients + * - Superblock + - __le32 + - The entire superblock up to the checksum field. The UUID lives insi= de + the superblock. + * - MMP + - __le32 + - UUID + the entire MMP block up to the checksum field. + * - Extended Attributes + - __le32 + - UUID + the entire extended attribute block. The checksum field is s= et to + zero. + * - Directory Entries + - __le32 + - UUID + inode number + inode generation + the directory block up to = the + fake entry enclosing the checksum field. + * - HTREE Nodes + - __le32 + - UUID + inode number + inode generation + all valid extents + HTREE = tail. + The checksum field is set to zero. + * - Extents + - __le32 + - UUID + inode number + inode generation + the entire extent block up= to + the checksum field. + * - Bitmaps + - __le32 or __le16 + - UUID + the entire bitmap. Checksums are stored in the group descrip= tor, + and truncated if the group descriptor size is 32 bytes (i.e. ^64bit) + * - Inodes + - __le32 + - UUID + inode number + inode generation + the entire inode. The chec= ksum + field is set to zero. Each inode has its own checksum. + * - Group Descriptors + - __le16 + - If metadata_csum, then UUID + group number + the entire descriptor; + else if gdt_csum, then crc16(UUID + group number + the entire + descriptor). In all cases, only the lower 16 bits are stored. + +Bigalloc +-------- + +At the moment, the default size of a block is 4KiB, which is a commonly +supported page size on most MMU-capable hardware. This is fortunate, as +ext4 code is not prepared to handle the case where the block size +exceeds the page size. However, for a filesystem of mostly huge files, +it is desirable to be able to allocate disk blocks in units of multiple +blocks to reduce both fragmentation and metadata overhead. The +bigalloc feature provides exactly this ability. + +The bigalloc feature (EXT4_FEATURE_RO_COMPAT_BIGALLOC) changes ext4 to +use clustered allocation, so that each bit in the ext4 block allocation +bitmap addresses a power of two number of blocks. For example, if the +file system is mainly going to be storing large files in the 4-32 +megabyte range, it might make sense to set a cluster size of 1 megabyte. +This means that each bit in the block allocation bitmap now addresses +256 4k blocks. This shrinks the total size of the block allocation +bitmaps for a 2T file system from 64 megabytes to 256 kilobytes. It also +means that a block group addresses 32 gigabytes instead of 128 megabytes, +also shrinking the amount of file system overhead for metadata. + +The administrator can set a block cluster size at mkfs time (which is +stored in the s_log_cluster_size field in the superblock); from then +on, the block bitmaps track clusters, not individual blocks. This means +that block groups can be several gigabytes in size (instead of just +128MiB); however, the minimum allocation unit becomes a cluster, not a +block, even for directories. TaoBao had a patchset to extend the =E2=80=9C= use +units of clusters instead of blocks=E2=80=9D to the extent tree, though it= is +not clear where those patches went-- they eventually morphed into +=E2=80=9Cextent tree v2=E2=80=9D but that code has not landed as of May 20= 15. + +Inline Data +----------- + +The inline data feature was designed to handle the case that a file's +data is so tiny that it readily fits inside the inode, which +(theoretically) reduces disk block consumption and reduces seeks. If the +file is smaller than 60 bytes, then the data are stored inline in +``inode.i_block``. If the rest of the file would fit inside the extended +attribute space, then it might be found as an extended attribute +=E2=80=9Csystem.data=E2=80=9D within the inode body (=E2=80=9Cibody EA=E2= =80=9D). This of course +constrains the amount of extended attributes one can attach to an inode. +If the data size increases beyond i_block + ibody EA, a regular block +is allocated and the contents moved to that block. + +Pending a change to compact the extended attribute key used to store +inline data, one ought to be able to store 160 bytes of data in a +256-byte inode (as of June 2015, when i_extra_isize is 28). Prior to +that, the limit was 156 bytes due to inefficient use of inode space. + +The inline data feature requires the presence of an extended attribute +for =E2=80=9Csystem.data=E2=80=9D, even if the attribute value is zero len= gth. + +Inline Directories +~~~~~~~~~~~~~~~~~~ + +The first four bytes of i_block are the inode number of the parent +directory. Following that is a 56-byte space for an array of directory +entries; see ``struct ext4_dir_entry``. If there is a =E2=80=9Csystem.data= =E2=80=9D +attribute in the inode body, the EA value is an array of +``struct ext4_dir_entry`` as well. Note that for inline directories, the +i_block and EA space are treated as separate dirent blocks; directory +entries cannot span the two. + +Inline directory entries are not checksummed, as the inode checksum +should protect all inline data contents. + +Large Extended Attribute Values +------------------------------- + +To enable ext4 to store extended attribute values that do not fit in the +inode or in the single extended attribute block attached to an inode, +the EA_INODE feature allows us to store the value in the data blocks of +a regular file inode. This =E2=80=9CEA inode=E2=80=9D is linked only from = the extended +attribute name index and must not appear in a directory entry. The +inode's i_atime field is used to store a checksum of the xattr value; +and i_ctime/i_version store a 64-bit reference count, which enables +sharing of large xattr values between multiple owning inodes. For +backward compatibility with older versions of this feature, the +i_mtime/i_generation *may* store a back-reference to the inode number +and i_generation of the **one** owning inode (in cases where the EA +inode is not referenced by multiple inodes) to verify that the EA inode +is the correct one being accessed. + +Verity files +------------ + +ext4 supports fs-verity, which is a filesystem feature that provides +Merkle tree based hashing for individual readonly files. Most of +fs-verity is common to all filesystems that support it; see +:ref:`Documentation/filesystems/fsverity.rst ` for the +fs-verity documentation. However, the on-disk layout of the verity +metadata is filesystem-specific. On ext4, the verity metadata is +stored after the end of the file data itself, in the following format: + +- Zero-padding to the next 65536-byte boundary. This padding need not + actually be allocated on-disk, i.e. it may be a hole. + +- The Merkle tree, as documented in + :ref:`Documentation/filesystems/fsverity.rst + `, with the tree levels stored in order from + root to leaf, and the tree blocks within each level stored in their + natural order. + +- Zero-padding to the next filesystem block boundary. + +- The verity descriptor, as documented in + :ref:`Documentation/filesystems/fsverity.rst `, + with optionally appended signature blob. + +- Zero-padding to the next offset that is 4 bytes before a filesystem + block boundary. + +- The size of the verity descriptor in bytes, as a 4-byte little + endian integer. + +Verity inodes have EXT4_VERITY_FL set, and they must use extents, i.e. +EXT4_EXTENTS_FL must be set and EXT4_INLINE_DATA_FL must be clear. +They can have EXT4_ENCRYPT_FL set, in which case the verity metadata +is encrypted as well as the data itself. + +Verity files cannot have blocks allocated past the end of the verity +metadata. + +Verity and DAX are not compatible and attempts to set both of these flags +on a file will fail. + +Atomic Block Writes +------------------- + +Introduction +~~~~~~~~~~~~ + +Atomic (untorn) block writes ensure that either the entire write is commit= ted +to disk or none of it is. This prevents "torn writes" during power loss or +system crashes. The ext4 filesystem supports atomic writes (only with Dire= ct +I/O) on regular files with extents, provided the underlying storage device +supports hardware atomic writes. This is supported in the following two wa= ys: + +1. **Single-fsblock Atomic Writes**: + ext4 supports atomic write operations with a single filesystem block si= nce + v6.13. In this the atomic write unit minimum and maximum sizes are both= set + to filesystem blocksize. + e.g. doing atomic write of 16KB with 16KB filesystem blocksize on 64KB + pagesize system is possible. + +2. **Multi-fsblock Atomic Writes with Bigalloc**: + ext4 now also supports atomic writes spanning multiple filesystem blocks + using a feature known as bigalloc. The atomic write unit's minimum and + maximum sizes are determined by the filesystem block size and cluster s= ize, + based on the underlying device=E2=80=99s supported atomic write unit li= mits. + +Requirements +~~~~~~~~~~~~ + +Basic requirements for atomic writes in ext4: + + 1. The extents feature must be enabled (default for ext4) + 2. The underlying block device must support atomic writes + 3. For single-fsblock atomic writes: + + 1. A filesystem with appropriate block size (up to the page size) + 4. For multi-fsblock atomic writes: + + 1. The bigalloc feature must be enabled + 2. The cluster size must be appropriately configured + +.. note:: + ext4 does not support software or COW based atomic write, which means + atomic writes on ext4 are only supported if underlying storage device + supports it. + +Multi-fsblock Implementation Details +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The bigalloc feature changes ext4 to allocate in units of multiple filesys= tem +blocks, also known as clusters. With bigalloc each bit within block bitmap +represents cluster (power of 2 number of blocks) rather than individual +filesystem blocks. + +ext4 supports multi-fsblock atomic writes with bigalloc, subject to the +following constraints. The minimum atomic write size is the larger of the = fs +block size and the minimum hardware atomic write unit; and the maximum ato= mic +write size is smaller of the bigalloc cluster size and the maximum hardware +atomic write unit. Bigalloc ensures that all allocations are aligned to t= he +cluster size, which satisfies the LBA alignment requirements of the hardwa= re +device if the start of the partition/logical volume is itself aligned corr= ectly. + +Here is the block allocation strategy in bigalloc for atomic writes: + + * For regions with fully mapped extents, no additional work is needed + * For append writes, a new mapped extent is allocated + * For regions that are entirely holes, unwritten extent is created + * For large unwritten extents, the extent gets split into two unwritten + extents of appropriate requested size + * For mixed mapping regions (combinations of holes, unwritten extents, or + mapped extents), ext4_map_blocks() is called in a loop with + EXT4_GET_BLOCKS_ZERO flag to convert the region into a single contiguous + mapped extent by writing zeroes to it and converting any unwritten exte= nts to + written, if found within the range. + +.. note:: + Writing on a single contiguous underlying extent, whether mapped or + unwritten, is not inherently problematic. However, writing to a mixed m= apping + region (i.e. one containing a combination of mapped and unwritten exten= ts) + must be avoided when performing atomic writes. + +The reason is that, atomic writes when issued via pwritev2() with the RWF_= ATOMIC +flag, requires that either all data is written or none at all. In the even= t of +a system crash or unexpected power loss during the write operation, the af= fected +region (when later read) must reflect either the complete old data or the +complete new data, but never a mix of both. + +To enforce this guarantee, we ensure that the write target is backed by +a single, contiguous extent before any data is written. This is critical b= ecause +ext4 defers the conversion of unwritten extents to written extents until t= he I/O +completion path (typically in ->end_io()). If a write is allowed to procee= d over +a mixed mapping region (with mapped and unwritten extents) and a failure o= ccurs +mid-write, the system could observe partially updated regions after reboot= , i.e. +new data over mapped areas, and stale (old) data over unwritten extents th= at +were never marked written. This violates the atomicity and/or torn write +prevention guarantee. + +To prevent such torn writes, ext4 proactively allocates a single contiguous +extent for the entire requested region in ``ext4_iomap_alloc`` via +``ext4_map_blocks_atomic()``. ext4 also force commits the current journall= ing +transaction in case if allocation is done over mixed mapping. This ensures= any +pending metadata updates (like unwritten to written extents conversion) in= this +range are in consistent state with the file data blocks, before performing= the +actual write I/O. If the commit fails, the whole I/O must be aborted to pr= event +from any possible torn writes. +Only after this step, the actual data write operation is performed by the = iomap. + +Handling Split Extents Across Leaf Blocks +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +There can be a special edge case where we have logically and physically +contiguous extents stored in separate leaf nodes of the on-disk extent tre= e. +This occurs because on-disk extent tree merges only happens within the leaf +blocks except for a case where we have 2-level tree which can get merged a= nd +collapsed entirely into the inode. +If such a layout exists and, in the worst case, the extent status cache en= tries +are reclaimed due to memory pressure, ``ext4_map_blocks()`` may never retu= rn +a single contiguous extent for these split leaf extents. + +To address this edge case, a new get block flag +``EXT4_GET_BLOCKS_QUERY_LEAF_BLOCKS flag`` is added to enhance the +``ext4_map_query_blocks()`` lookup behavior. + +This new get block flag allows ``ext4_map_blocks()`` to first check if the= re is +an entry in the extent status cache for the full range. +If not present, it consults the on-disk extent tree using +``ext4_map_query_blocks()``. +If the located extent is at the end of a leaf node, it probes the next log= ical +block (lblk) to detect a contiguous extent in the adjacent leaf. + +For now only one additional leaf block is queried to maintain efficiency, = as +atomic writes are typically constrained to small sizes +(e.g. [blocksize, clustersize]). + + +Handling Journal transactions +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +To support multi-fsblock atomic writes, we ensure enough journal credits a= re +reserved during: + + 1. Block allocation time in ``ext4_iomap_alloc()``. We first query if the= re + could be a mixed mapping for the underlying requested range. If yes, t= hen we + reserve credits of up to ``m_len``, assuming every alternate block can= be + an unwritten extent followed by a hole. + + 2. During ``->end_io()`` call, we make sure a single transaction is start= ed for + doing unwritten-to-written conversion. The loop for conversion is main= ly + only required to handle a split extent across leaf blocks. + +How to +~~~~~~ + +Creating Filesystems with Atomic Write Support +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +First check the atomic write units supported by block device. +See :ref:`atomic_write_bdev_support` for more details. + +For single-fsblock atomic writes with a larger block size +(on systems with block size < page size): + +.. code-block:: bash + + # Create an ext4 filesystem with a 16KB block size + # (requires page size >=3D 16KB) + mkfs.ext4 -b 16384 /dev/device + +For multi-fsblock atomic writes with bigalloc: + +.. code-block:: bash + + # Create an ext4 filesystem with bigalloc and 64KB cluster size + mkfs.ext4 -F -O bigalloc -b 4096 -C 65536 /dev/device + +Where ``-b`` specifies the block size, ``-C`` specifies the cluster size i= n bytes, +and ``-O bigalloc`` enables the bigalloc feature. + +Application Interface +^^^^^^^^^^^^^^^^^^^^^ + +Applications can use the ``pwritev2()`` system call with the ``RWF_ATOMIC`= ` flag +to perform atomic writes: + +.. code-block:: c + + pwritev2(fd, iov, iovcnt, offset, RWF_ATOMIC); + +The write must be aligned to the filesystem's block size and not exceed the +filesystem's maximum atomic write unit size. +See ``generic_atomic_write_valid()`` for more details. + +``statx()`` system call with ``STATX_WRITE_ATOMIC`` flag can provides foll= owing +details: + + * ``stx_atomic_write_unit_min``: Minimum size of an atomic write request. + * ``stx_atomic_write_unit_max``: Maximum size of an atomic write request. + * ``stx_atomic_write_segments_max``: Upper limit for segments. The number= of + separate memory buffers that can be gathered into a write operation + (e.g., the iovcnt parameter for IOV_ITER). Currently, this is always se= t to one. + +The STATX_ATTR_WRITE_ATOMIC flag in ``statx->attributes`` is set if atomic +writes are supported. + +.. _atomic_write_bdev_support: + +Hardware Support +~~~~~~~~~~~~~~~~ + +The underlying storage device must support atomic write operations. +Modern NVMe and SCSI devices often provide this capability. +The Linux kernel exposes this information through sysfs: + +* ``/sys/block//queue/atomic_write_unit_min`` - Minimum atomic wri= te size +* ``/sys/block//queue/atomic_write_unit_max`` - Maximum atomic wri= te size + +Nonzero values for these attributes indicate that the device supports +atomic writes. + +See Also +~~~~~~~~ + +* Support for atomic block writes in 6.13: + https://lwn.net/Articles/1009298/ diff --git a/Documentation/filesystems/ext4/special_inodes.rst b/Documentat= ion/filesystems/ext4/special_inodes.rst deleted file mode 100644 index fc0636901fa0e1..00000000000000 --- a/Documentation/filesystems/ext4/special_inodes.rst +++ /dev/null @@ -1,55 +0,0 @@ -.. SPDX-License-Identifier: GPL-2.0 - -Special inodes --------------- - -ext4 reserves some inode for special features, as follows: - -.. list-table:: - :widths: 6 70 - :header-rows: 1 - - * - inode Number - - Purpose - * - 0 - - Doesn't exist; there is no inode 0. - * - 1 - - List of defective blocks. - * - 2 - - Root directory. - * - 3 - - User quota. - * - 4 - - Group quota. - * - 5 - - Boot loader. - * - 6 - - Undelete directory. - * - 7 - - Reserved group descriptors inode. (=E2=80=9Cresize inode=E2=80=9D) - * - 8 - - Journal inode. - * - 9 - - The =E2=80=9Cexclude=E2=80=9D inode, for snapshots(?) - * - 10 - - Replica inode, used for some non-upstream feature? - * - 11 - - Traditional first non-reserved inode. Usually this is the lost+foun= d directory. See s_first_ino in the superblock. - -Note that there are also some inodes allocated from non-reserved inode num= bers -for other filesystem features which are not referenced from standard direc= tory -hierarchy. These are generally reference from the superblock. They are: - -.. list-table:: - :widths: 20 50 - :header-rows: 1 - - * - Superblock field - - Description - - * - s_lpf_ino - - Inode number of lost+found directory. - * - s_prj_quota_inum - - Inode number of quota file tracking project quotas - * - s_orphan_file_inum - - Inode number of file tracking orphan inodes. diff --git a/Documentation/filesystems/ext4/verity.rst b/Documentation/file= systems/ext4/verity.rst deleted file mode 100644 index e99ff3fd09f7e7..00000000000000 --- a/Documentation/filesystems/ext4/verity.rst +++ /dev/null @@ -1,44 +0,0 @@ -.. SPDX-License-Identifier: GPL-2.0 - -Verity files ------------- - -ext4 supports fs-verity, which is a filesystem feature that provides -Merkle tree based hashing for individual readonly files. Most of -fs-verity is common to all filesystems that support it; see -:ref:`Documentation/filesystems/fsverity.rst ` for the -fs-verity documentation. However, the on-disk layout of the verity -metadata is filesystem-specific. On ext4, the verity metadata is -stored after the end of the file data itself, in the following format: - -- Zero-padding to the next 65536-byte boundary. This padding need not - actually be allocated on-disk, i.e. it may be a hole. - -- The Merkle tree, as documented in - :ref:`Documentation/filesystems/fsverity.rst - `, with the tree levels stored in order from - root to leaf, and the tree blocks within each level stored in their - natural order. - -- Zero-padding to the next filesystem block boundary. - -- The verity descriptor, as documented in - :ref:`Documentation/filesystems/fsverity.rst `, - with optionally appended signature blob. - -- Zero-padding to the next offset that is 4 bytes before a filesystem - block boundary. - -- The size of the verity descriptor in bytes, as a 4-byte little - endian integer. - -Verity inodes have EXT4_VERITY_FL set, and they must use extents, i.e. -EXT4_EXTENTS_FL must be set and EXT4_INLINE_DATA_FL must be clear. -They can have EXT4_ENCRYPT_FL set, in which case the verity metadata -is encrypted as well as the data itself. - -Verity files cannot have blocks allocated past the end of the verity -metadata. - -Verity and DAX are not compatible and attempts to set both of these flags -on a file will fail. --=20 An old man doll... just what I always wanted! - Clara From nobody Thu Oct 9 09:03:17 2025 Received: from mail-pg1-f179.google.com (mail-pg1-f179.google.com [209.85.215.179]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 74F312980D0; Wed, 18 Jun 2025 11:16:10 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.215.179 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1750245376; cv=none; b=XzbRir0AVRIVrtBsBGAr+UCt7kq3VIEEJnr6SeIBivp8IfwzSEk7CA6/c9w0wQVRJS0zL7qVm6tVOW/1myPg3ipvhhJMdDOU9+9RM9iIzcwS+WCvTwnfSIidQTBY1rGyRjTiYN27jFkAR6ZqX3w6A8qp4yYLsAuW2QLbEjaPmpc= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1750245376; c=relaxed/simple; bh=BdBzNwLlLqgS46DkMif1dJEUTlsNP0+gDTFFuVm91u8=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=QPNrzjL8pER5gASwjgojTIJdIz/yI807XamJKEFRBwBkDt/D/Eozy+/1oB8KyYZbh2nONRjPqDlanApuWC9cxDBu/LPCA2Kwhi/8VMwgrPPkbmWri3hzWR+GcNbdiidHBjN2heK3u3WBfEc8TGEkzVu6L6mh0xKJu6orvjFYwIU= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=EvcOdtdB; arc=none smtp.client-ip=209.85.215.179 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="EvcOdtdB" Received: by mail-pg1-f179.google.com with SMTP id 41be03b00d2f7-af6a315b491so6169599a12.1; Wed, 18 Jun 2025 04:16:10 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1750245370; x=1750850170; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=0ZwIOGYwnzwPKQYsQKPCZlGW8kcO4DPGovRI54lK8PA=; b=EvcOdtdBmbA66QjlyPDsF2sNxzlM55cSK83AI9m+13jePzKWlzwo6pcZ6G0ylkIFYl 2ykv6AItwUwanPhw168CU4QLES0AhXKwlkg6tWhYVbq+ycH/P8cXPklloFNKxfXBWUpS ZZJFfAbEafHB/iZiQzhId8UUpy8eahOX+jVXJw1Pa+VLQY0fbOwGryg0eEKmm/fKH1FT 9hN0P2lVMXGZZLhd+WsHVoG+p2IxT/w8PXhyFNXjvPWXhtvXPXP09JsV6rlzR8cox1eO YR7pqiC7dRGfD/wthtC0+tOQNQnBnR6jh059wTRrK5lGDpfhhhgr97YBf/MzRBRhID+Z 69zg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1750245370; x=1750850170; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=0ZwIOGYwnzwPKQYsQKPCZlGW8kcO4DPGovRI54lK8PA=; b=J2DyWw/TfAgKwu485TYeOfEpYIgszWmu12REJXxSe4PxyKwQG65CM/92j8TEXrOc45 okUmi9yaqc7yMHUyVuUZQRLv2M4FnBsp6vtdeIpl6WlzYagZTGZamIP7Sbx/7r6HJuMU l00uHwrcdcZnntsJWZV8twH/mJnPf1EdSANTcVuBDRki9JO2gCf0SF39KHjn2AJAVCwI 4mkvp4VY4+kr8MHWXw/rtL/UnwtS2qOZrEA/o7ZifqSxtZEKp8TN6maFPgcRQF1QqNL6 Guwp+QwMfvVZF/0muNiZh5W9HlAHgILTRvPl1ndCuH3+4msnjOPzGgoj4IERuZa3lPed cxzw== X-Forwarded-Encrypted: i=1; AJvYcCVefY8f5JmsyLttm07/+QBGjx70vN4CYwbcTU6t+6v15Gxxa/hMTMdVIMfKgH8QJfZZpBdkox6WzO4=@vger.kernel.org, AJvYcCW7GbGJSyKDAC/2Ra5xeWm/bA1KYlAiQwai2OQrjA2zG0NPnaphtNLWbqf1jUJDK1o7LCY/k8Zp3P1foQ==@vger.kernel.org X-Gm-Message-State: AOJu0YzifhKkjfoZXQ5DRWukGWy9D/5Vw5t3PTxt8uVO4QXxStGbTPfP zMZA4qRXHamoKZ2SflMEta7Go6yv8tJUBVGEOSBsry1MEM4J5LbWRdiM X-Gm-Gg: ASbGncsavMiV0+kCMGXjEFzwF30AUT9ClVcOPUbcr5XzhUrvlMHbuJOvlzNXX9baa/8 X8uJ4LTWZN5/LDQbLrZCSWXMW4S+88gafM6HFb9W6kQaJiIUqu/LLa4Gwdpu7H7k70A4Pas/nxe naXG7Bn/4dPlTBr7KpI/Z8wDjIRP743Ayoo3Y4SUovnu/KG860syr4Q0Ib+IZxspUb1cHZzafYB 9gylghYg51Q6A+5ngxHW1HHH3hJrkqq6OzahlE1WBZ+8ohQN3U9tp0hCztWYTLaTQ1jFvYyT+Wc 6QnNW1TJ1EffvievN1+E3ti7S+RenP4pAVWmZpRMYfR4ATt3ZpeKOALBgAa41g== X-Google-Smtp-Source: AGHT+IEnchaz71dOfNUo0g/iWkZWfS/8pyfxDmVKuQuCVGBkGejHz5lhCZQSSucSgNmEntcyPeEGyw== X-Received: by 2002:a17:90b:1d81:b0:312:e279:9ccf with SMTP id 98e67ed59e1d1-313f1be89b1mr22226239a91.5.1750245366578; Wed, 18 Jun 2025 04:16:06 -0700 (PDT) Received: from archie.me ([103.124.138.155]) by smtp.gmail.com with ESMTPSA id 98e67ed59e1d1-313c1bdb39bsm12374543a91.20.2025.06.18.04.16.01 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 18 Jun 2025 04:16:02 -0700 (PDT) Received: by archie.me (Postfix, from userid 1000) id C228D45965DE; Wed, 18 Jun 2025 18:15:59 +0700 (WIB) From: Bagas Sanjaya To: Linux Kernel Mailing List , Linux Documentation , Linux ext4 Cc: "Theodore Ts'o" , Andreas Dilger , Jonathan Corbet , "Darrick J. Wong" , "Ritesh Harjani (IBM)" , Bagas Sanjaya Subject: [PATCH 2/4] Documentation: ext4: Slurp included subdocs in global structures docs Date: Wed, 18 Jun 2025 18:15:35 +0700 Message-ID: <20250618111544.22602-3-bagasdotme@gmail.com> X-Mailer: git-send-email 2.49.0 In-Reply-To: <20250618111544.22602-1-bagasdotme@gmail.com> References: <20250618111544.22602-1-bagasdotme@gmail.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Developer-Signature: v=1; a=openpgp-sha256; l=129089; i=bagasdotme@gmail.com; h=from:subject; bh=BdBzNwLlLqgS46DkMif1dJEUTlsNP0+gDTFFuVm91u8=; b=owGbwMvMwCX2bWenZ2ig32LG02pJDBlB89XmJlr5pSy+/WTlskVPJqo2rbZZfNbhYIj9sYX1B Trc372VO0pZGMS4GGTFFFkmJfI1nd5lJHKhfa0jzBxWJpAhDFycAjCRpo0Mf7h38D970CZ47miv auUc9oAD866csA0q4VMr229wxOM3szMjQ29hVX6+wiO3b3HL3com/ku6PcEre/2ESc/E+5SYcq0 N+AE= X-Developer-Key: i=bagasdotme@gmail.com; a=openpgp; fpr=701B806FDCA5D3A58FFB8F7D7C276C64A5E44A1D Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Slurp subdocumentations for global structures (globals.rst) by replacing reST include:: directive with their respective contents. Signed-off-by: Bagas Sanjaya --- Documentation/filesystems/ext4/bitmaps.rst | 28 - Documentation/filesystems/ext4/globals.rst | 1923 ++++++++++++++++- .../filesystems/ext4/group_descr.rst | 173 -- Documentation/filesystems/ext4/journal.rst | 761 ------- Documentation/filesystems/ext4/mmp.rst | 77 - Documentation/filesystems/ext4/orphan.rst | 42 - Documentation/filesystems/ext4/super.rst | 839 ------- 7 files changed, 1917 insertions(+), 1926 deletions(-) delete mode 100644 Documentation/filesystems/ext4/bitmaps.rst delete mode 100644 Documentation/filesystems/ext4/group_descr.rst delete mode 100644 Documentation/filesystems/ext4/journal.rst delete mode 100644 Documentation/filesystems/ext4/mmp.rst delete mode 100644 Documentation/filesystems/ext4/orphan.rst delete mode 100644 Documentation/filesystems/ext4/super.rst diff --git a/Documentation/filesystems/ext4/bitmaps.rst b/Documentation/fil= esystems/ext4/bitmaps.rst deleted file mode 100644 index 91c45d86e9bb56..00000000000000 --- a/Documentation/filesystems/ext4/bitmaps.rst +++ /dev/null @@ -1,28 +0,0 @@ -.. SPDX-License-Identifier: GPL-2.0 - -Block and inode Bitmaps ------------------------ - -The data block bitmap tracks the usage of data blocks within the block -group. - -The inode bitmap records which entries in the inode table are in use. - -As with most bitmaps, one bit represents the usage status of one data -block or inode table entry. This implies a block group size of 8 * -number_of_bytes_in_a_logical_block. - -NOTE: If ``BLOCK_UNINIT`` is set for a given block group, various parts -of the kernel and e2fsprogs code pretends that the block bitmap contains -zeros (i.e. all blocks in the group are free). However, it is not -necessarily the case that no blocks are in use -- if ``meta_bg`` is set, -the bitmaps and group descriptor live inside the group. Unfortunately, -ext2fs_test_block_bitmap2() will return '0' for those locations, -which produces confusing debugfs output. - -Inode Table ------------ -Inode tables are statically allocated at mkfs time. Each block group -descriptor points to the start of the table, and the superblock records -the number of inodes per group. See the section on inodes for more -information. diff --git a/Documentation/filesystems/ext4/globals.rst b/Documentation/fil= esystems/ext4/globals.rst index b17418974fd35e..46eabf88267f80 100644 --- a/Documentation/filesystems/ext4/globals.rst +++ b/Documentation/filesystems/ext4/globals.rst @@ -6,9 +6,1920 @@ Global Structures The filesystem is sharded into a number of block groups, each of which have static metadata at fixed locations. =20 -.. include:: super.rst -.. include:: group_descr.rst -.. include:: bitmaps.rst -.. include:: mmp.rst -.. include:: journal.rst -.. include:: orphan.rst +Super Block +----------- + +The superblock records various information about the enclosing +filesystem, such as block counts, inode counts, supported features, +maintenance information, and more. + +If the sparse_super feature flag is set, redundant copies of the +superblock and group descriptors are kept only in the groups whose group +number is either 0 or a power of 3, 5, or 7. If the flag is not set, +redundant copies are kept in all groups. + +The superblock checksum is calculated against the superblock structure, +which includes the FS UUID. + +The ext4 superblock is laid out as follows in +``struct ext4_super_block``: + +.. list-table:: + :widths: 8 8 24 40 + :header-rows: 1 + + * - Offset + - Size + - Name + - Description + * - 0x0 + - __le32 + - s_inodes_count + - Total inode count. + * - 0x4 + - __le32 + - s_blocks_count_lo + - Total block count. + * - 0x8 + - __le32 + - s_r_blocks_count_lo + - This number of blocks can only be allocated by the super-user. + * - 0xC + - __le32 + - s_free_blocks_count_lo + - Free block count. + * - 0x10 + - __le32 + - s_free_inodes_count + - Free inode count. + * - 0x14 + - __le32 + - s_first_data_block + - First data block. This must be at least 1 for 1k-block filesystems = and + is typically 0 for all other block sizes. + * - 0x18 + - __le32 + - s_log_block_size + - Block size is 2 ^ (10 + s_log_block_size). + * - 0x1C + - __le32 + - s_log_cluster_size + - Cluster size is 2 ^ (10 + s_log_cluster_size) blocks if bigalloc is + enabled. Otherwise s_log_cluster_size must equal s_log_block_size. + * - 0x20 + - __le32 + - s_blocks_per_group + - Blocks per group. + * - 0x24 + - __le32 + - s_clusters_per_group + - Clusters per group, if bigalloc is enabled. Otherwise + s_clusters_per_group must equal s_blocks_per_group. + * - 0x28 + - __le32 + - s_inodes_per_group + - Inodes per group. + * - 0x2C + - __le32 + - s_mtime + - Mount time, in seconds since the epoch. + * - 0x30 + - __le32 + - s_wtime + - Write time, in seconds since the epoch. + * - 0x34 + - __le16 + - s_mnt_count + - Number of mounts since the last fsck. + * - 0x36 + - __le16 + - s_max_mnt_count + - Number of mounts beyond which a fsck is needed. + * - 0x38 + - __le16 + - s_magic + - Magic signature, 0xEF53 + * - 0x3A + - __le16 + - s_state + - File system state. See super_state_ for more info. + * - 0x3C + - __le16 + - s_errors + - Behaviour when detecting errors. See super_errors_ for more info. + * - 0x3E + - __le16 + - s_minor_rev_level + - Minor revision level. + * - 0x40 + - __le32 + - s_lastcheck + - Time of last check, in seconds since the epoch. + * - 0x44 + - __le32 + - s_checkinterval + - Maximum time between checks, in seconds. + * - 0x48 + - __le32 + - s_creator_os + - Creator OS. See the table super_creator_ for more info. + * - 0x4C + - __le32 + - s_rev_level + - Revision level. See the table super_revision_ for more info. + * - 0x50 + - __le16 + - s_def_resuid + - Default uid for reserved blocks. + * - 0x52 + - __le16 + - s_def_resgid + - Default gid for reserved blocks. + * - + - + - + - These fields are for EXT4_DYNAMIC_REV superblocks only. + =20 + .. note:: + the difference between the compatible feature set and the + incompatible feature set is that if there is a bit set in the + incompatible feature set that the kernel doesn't know about, it + should refuse to mount the filesystem. + =20 + e2fsck's requirements are more strict; if it doesn't know + about a feature in either the compatible or incompatible feature= set, + it must abort and not try to meddle with things it doesn't + understand... + * - 0x54 + - __le32 + - s_first_ino + - First non-reserved inode. + * - 0x58 + - __le16 + - s_inode_size + - Size of inode structure, in bytes. + * - 0x5A + - __le16 + - s_block_group_nr + - Block group # of this superblock. + * - 0x5C + - __le32 + - s_feature_compat + - Compatible feature set flags. Kernel can still read/write this fs e= ven + if it doesn't understand a flag; fsck should not do that. See the + super_compat_ table for more info. + * - 0x60 + - __le32 + - s_feature_incompat + - Incompatible feature set. If the kernel or fsck doesn't understand = one + of these bits, it should stop. See the super_incompat_ table for mo= re + info. + * - 0x64 + - __le32 + - s_feature_ro_compat + - Readonly-compatible feature set. If the kernel doesn't understand o= ne of + these bits, it can still mount read-only. See the super_rocompat_ t= able + for more info. + * - 0x68 + - __u8 + - s_uuid[16] + - 128-bit UUID for volume. + * - 0x78 + - char + - s_volume_name[16] + - Volume label. + * - 0x88 + - char + - s_last_mounted[64] + - Directory where filesystem was last mounted. + * - 0xC8 + - __le32 + - s_algorithm_usage_bitmap + - For compression (Not used in e2fsprogs/Linux) + * - + - + - + - Performance hints. Directory preallocation should only happen if t= he + EXT4_FEATURE_COMPAT_DIR_PREALLOC flag is on. + * - 0xCC + - __u8 + - s_prealloc_blocks + - #. of blocks to try to preallocate for ... files? (Not used in + e2fsprogs/Linux) + * - 0xCD + - __u8 + - s_prealloc_dir_blocks + - #. of blocks to preallocate for directories. (Not used in + e2fsprogs/Linux) + * - 0xCE + - __le16 + - s_reserved_gdt_blocks + - Number of reserved GDT entries for future filesystem expansion. + * - + - + - + - Journalling support is valid only if EXT4_FEATURE_COMPAT_HAS_JOURNA= L is + set. + * - 0xD0 + - __u8 + - s_journal_uuid[16] + - UUID of journal superblock + * - 0xE0 + - __le32 + - s_journal_inum + - inode number of journal file. + * - 0xE4 + - __le32 + - s_journal_dev + - Device number of journal file, if the external journal feature flag= is + set. + * - 0xE8 + - __le32 + - s_last_orphan + - Start of list of orphaned inodes to delete. + * - 0xEC + - __le32 + - s_hash_seed[4] + - HTREE hash seed. + * - 0xFC + - __u8 + - s_def_hash_version + - Default hash algorithm to use for directory hashes. See super_def_h= ash_ + for more info. + * - 0xFD + - __u8 + - s_jnl_backup_type + - If this value is 0 or EXT3_JNL_BACKUP_BLOCKS (1), then the + ``s_jnl_blocks`` field contains a duplicate copy of the inode's + ``i_block[]`` array and ``i_size``. + * - 0xFE + - __le16 + - s_desc_size + - Size of group descriptors, in bytes, if the 64bit incompat feature = flag + is set. + * - 0x100 + - __le32 + - s_default_mount_opts + - Default mount options. See the super_mountopts_ table for more info. + * - 0x104 + - __le32 + - s_first_meta_bg + - First metablock block group, if the meta_bg feature is enabled. + * - 0x108 + - __le32 + - s_mkfs_time + - When the filesystem was created, in seconds since the epoch. + * - 0x10C + - __le32 + - s_jnl_blocks[17] + - Backup copy of the journal inode's ``i_block[]`` array in the first= 15 + elements and i_size_high and i_size in the 16th and 17th elements, + respectively. + * - + - + - + - 64bit support is valid only if EXT4_FEATURE_COMPAT_64BIT is set. + * - 0x150 + - __le32 + - s_blocks_count_hi + - High 32-bits of the block count. + * - 0x154 + - __le32 + - s_r_blocks_count_hi + - High 32-bits of the reserved block count. + * - 0x158 + - __le32 + - s_free_blocks_count_hi + - High 32-bits of the free block count. + * - 0x15C + - __le16 + - s_min_extra_isize + - All inodes have at least # bytes. + * - 0x15E + - __le16 + - s_want_extra_isize + - New inodes should reserve # bytes. + * - 0x160 + - __le32 + - s_flags + - Miscellaneous flags. See the super_flags_ table for more info. + * - 0x164 + - __le16 + - s_raid_stride + - RAID stride. This is the number of logical blocks read from or writ= ten + to the disk before moving to the next disk. This affects the placem= ent + of filesystem metadata, which will hopefully make RAID storage fast= er. + * - 0x166 + - __le16 + - s_mmp_interval + - #. seconds to wait in multi-mount prevention (MMP) checking. In the= ory, + MMP is a mechanism to record in the superblock which host and device + have mounted the filesystem, in order to prevent multiple mounts. T= his + feature does not seem to be implemented... + * - 0x168 + - __le64 + - s_mmp_block + - Block # for multi-mount protection data. + * - 0x170 + - __le32 + - s_raid_stripe_width + - RAID stripe width. This is the number of logical blocks read from or + written to the disk before coming back to the current disk. This is= used + by the block allocator to try to reduce the number of read-modify-w= rite + operations in a RAID5/6. + * - 0x174 + - __u8 + - s_log_groups_per_flex + - Size of a flexible block group is 2 ^ ``s_log_groups_per_flex``. + * - 0x175 + - __u8 + - s_checksum_type + - Metadata checksum algorithm type. The only valid value is 1 (crc32c= ). + * - 0x176 + - \_\_u8 + - s\_encryption\_level + - Versioning level for encryption. + * - 0x177 + - \_\_u8 + - s\_reserved\_pad + - Padding to next 32bits. + * - 0x178 + - __le64 + - s_kbytes_written + - Number of KiB written to this filesystem over its lifetime. + * - 0x180 + - __le32 + - s_snapshot_inum + - inode number of active snapshot. (Not used in e2fsprogs/Linux.) + * - 0x184 + - __le32 + - s_snapshot_id + - Sequential ID of active snapshot. (Not used in e2fsprogs/Linux.) + * - 0x188 + - __le64 + - s_snapshot_r_blocks_count + - Number of blocks reserved for active snapshot's future use. (Not us= ed in + e2fsprogs/Linux.) + * - 0x190 + - __le32 + - s_snapshot_list + - inode number of the head of the on-disk snapshot list. (Not used in + e2fsprogs/Linux.) + * - 0x194 + - __le32 + - s_error_count + - Number of errors seen. + * - 0x198 + - __le32 + - s_first_error_time + - First time an error happened, in seconds since the epoch. + * - 0x19C + - __le32 + - s_first_error_ino + - inode involved in first error. + * - 0x1A0 + - __le64 + - s_first_error_block + - Number of block involved of first error. + * - 0x1A8 + - __u8 + - s_first_error_func[32] + - Name of function where the error happened. + * - 0x1C8 + - __le32 + - s_first_error_line + - Line number where error happened. + * - 0x1CC + - __le32 + - s_last_error_time + - Time of most recent error, in seconds since the epoch. + * - 0x1D0 + - __le32 + - s_last_error_ino + - inode involved in most recent error. + * - 0x1D4 + - __le32 + - s_last_error_line + - Line number where most recent error happened. + * - 0x1D8 + - __le64 + - s_last_error_block + - Number of block involved in most recent error. + * - 0x1E0 + - __u8 + - s_last_error_func[32] + - Name of function where the most recent error happened. + * - 0x200 + - __u8 + - s_mount_opts[64] + - ASCIIZ string of mount options. + * - 0x240 + - __le32 + - s_usr_quota_inum + - Inode number of user `quota `__ file. + * - 0x244 + - __le32 + - s_grp_quota_inum + - Inode number of group `quota `__ file. + * - 0x248 + - __le32 + - s_overhead_blocks + - Overhead blocks/clusters in fs. (Huh? This field is always zero, wh= ich + means that the kernel calculates it dynamically.) + * - 0x24C + - __le32 + - s_backup_bgs[2] + - Block groups containing superblock backups (if sparse_super2) + * - 0x254 + - __u8 + - s_encrypt_algos[4] + - Encryption algorithms in use. There can be up to four algorithms in= use + at any time; valid algorithm codes are given in the super_encrypt_ = table + below. + * - 0x258 + - __u8 + - s_encrypt_pw_salt[16] + - Salt for the string2key algorithm for encryption. + * - 0x268 + - __le32 + - s_lpf_ino + - Inode number of lost+found + * - 0x26C + - __le32 + - s_prj_quota_inum + - Inode that tracks project quotas. + * - 0x270 + - __le32 + - s_checksum_seed + - Checksum seed used for metadata_csum calculations. This value is + crc32c(~0, $orig_fs_uuid). + * - 0x274 + - __u8 + - s_wtime_hi + - Upper 8 bits of the s_wtime field. + * - 0x275 + - __u8 + - s_mtime_hi + - Upper 8 bits of the s_mtime field. + * - 0x276 + - __u8 + - s_mkfs_time_hi + - Upper 8 bits of the s_mkfs_time field. + * - 0x277 + - __u8 + - s_lastcheck_hi + - Upper 8 bits of the s_lastcheck field. + * - 0x278 + - __u8 + - s_first_error_time_hi + - Upper 8 bits of the s_first_error_time field. + * - 0x279 + - __u8 + - s_last_error_time_hi + - Upper 8 bits of the s_last_error_time field. + * - 0x27A + - \_\_u8 + - s\_first\_error\_errcode + - + * - 0x27B + - \_\_u8 + - s\_last\_error\_errcode + - + * - 0x27C + - __le16 + - s_encoding + - Filename charset encoding. + * - 0x27E + - __le16 + - s_encoding_flags + - Filename charset encoding flags. + * - 0x280 + - __le32 + - s_orphan_file_inum + - Orphan file inode number. + * - 0x284 + - __le32 + - s_reserved[94] + - Padding to the end of the block. + * - 0x3FC + - __le32 + - s_checksum + - Superblock checksum. + +.. _super_state: + +The superblock state is some combination of the following: + +.. list-table:: + :widths: 8 72 + :header-rows: 1 + + * - Value + - Description + * - 0x0001 + - Cleanly umounted + * - 0x0002 + - Errors detected + * - 0x0004 + - Orphans being recovered + +.. _super_errors: + +The superblock error policy is one of the following: + +.. list-table:: + :widths: 8 72 + :header-rows: 1 + + * - Value + - Description + * - 1 + - Continue + * - 2 + - Remount read-only + * - 3 + - Panic + +.. _super_creator: + +The filesystem creator is one of the following: + +.. list-table:: + :widths: 8 72 + :header-rows: 1 + + * - Value + - Description + * - 0 + - Linux + * - 1 + - Hurd + * - 2 + - Masix + * - 3 + - FreeBSD + * - 4 + - Lites + +.. _super_revision: + +The superblock revision is one of the following: + +.. list-table:: + :widths: 8 72 + :header-rows: 1 + + * - Value + - Description + * - 0 + - Original format + * - 1 + - v2 format w/ dynamic inode sizes + +Note that ``EXT4_DYNAMIC_REV`` refers to a revision 1 or newer filesystem. + +.. _super_compat: + +The superblock compatible features field is a combination of any of the +following: + +.. list-table:: + :widths: 16 64 + :header-rows: 1 + + * - Value + - Description + * - 0x1 + - Directory preallocation (COMPAT_DIR_PREALLOC). + * - 0x2 + - =E2=80=9Cimagic inodes=E2=80=9D. Not clear from the code what this = does + (COMPAT_IMAGIC_INODES). + * - 0x4 + - Has a journal (COMPAT_HAS_JOURNAL). + * - 0x8 + - Supports extended attributes (COMPAT_EXT_ATTR). + * - 0x10 + - Has reserved GDT blocks for filesystem expansion + (COMPAT_RESIZE_INODE). Requires RO_COMPAT_SPARSE_SUPER. + * - 0x20 + - Has directory indices (COMPAT_DIR_INDEX). + * - 0x40 + - =E2=80=9CLazy BG=E2=80=9D. Not in Linux kernel, seems to have been = for uninitialized + block groups? (COMPAT_LAZY_BG) + * - 0x80 + - =E2=80=9CExclude inode=E2=80=9D. Not used. (COMPAT_EXCLUDE_INODE). + * - 0x100 + - =E2=80=9CExclude bitmap=E2=80=9D. Seems to be used to indicate the = presence of + snapshot-related exclude bitmaps? Not defined in kernel or used in + e2fsprogs (COMPAT_EXCLUDE_BITMAP). + * - 0x200 + - Sparse Super Block, v2. If this flag is set, the SB field s_backup_= bgs + points to the two block groups that contain backup superblocks + (COMPAT_SPARSE_SUPER2). + * - 0x400 + - Fast commits supported. Although fast commits blocks are + backward incompatible, fast commit blocks are not always + present in the journal. If fast commit blocks are present in + the journal, JBD2 incompat feature + (JBD2_FEATURE_INCOMPAT_FAST_COMMIT) gets + set (COMPAT_FAST_COMMIT). + * - 0x1000 + - Orphan file allocated. This is the special file for more efficient + tracking of unlinked but still open inodes. When there may be any + entries in the file, we additionally set proper rocompat feature + (RO_COMPAT_ORPHAN_PRESENT). + +.. _super_incompat: + +The superblock incompatible features field is a combination of any of the +following: + +.. list-table:: + :widths: 16 64 + :header-rows: 1 + + * - Value + - Description + * - 0x1 + - Compression (INCOMPAT_COMPRESSION). + * - 0x2 + - Directory entries record the file type. See ext4_dir_entry_2 below + (INCOMPAT_FILETYPE). + * - 0x4 + - Filesystem needs recovery (INCOMPAT_RECOVER). + * - 0x8 + - Filesystem has a separate journal device (INCOMPAT_JOURNAL_DEV). + * - 0x10 + - Meta block groups. See the earlier discussion of this feature + (INCOMPAT_META_BG). + * - 0x40 + - Files in this filesystem use extents (INCOMPAT_EXTENTS). + * - 0x80 + - Enable a filesystem size of 2^64 blocks (INCOMPAT_64BIT). + * - 0x100 + - Multiple mount protection (INCOMPAT_MMP). + * - 0x200 + - Flexible block groups. See the earlier discussion of this feature + (INCOMPAT_FLEX_BG). + * - 0x400 + - Inodes can be used to store large extended attribute values + (INCOMPAT_EA_INODE). + * - 0x1000 + - Data in directory entry (INCOMPAT_DIRDATA). (Not implemented?) + * - 0x2000 + - Metadata checksum seed is stored in the superblock. This feature en= ables + the administrator to change the UUID of a metadata_csum filesystem + while the filesystem is mounted; without it, the checksum definition + requires all metadata blocks to be rewritten (INCOMPAT_CSUM_SEED). + * - 0x4000 + - Large directory >2GB or 3-level htree (INCOMPAT_LARGEDIR). Prior to + this feature, directories could not be larger than 4GiB and could n= ot + have an htree more than 2 levels deep. If this feature is enabled, + directories can be larger than 4GiB and have a maximum htree depth = of 3. + * - 0x8000 + - Data in inode (INCOMPAT_INLINE_DATA). + * - 0x10000 + - Encrypted inodes are present on the filesystem. (INCOMPAT_ENCRYPT). + +.. _super_rocompat: + +The superblock read-only compatible features field is a combination of any= of +the following: + +.. list-table:: + :widths: 16 64 + :header-rows: 1 + + * - Value + - Description + * - 0x1 + - Sparse superblocks. See the earlier discussion of this feature + (RO_COMPAT_SPARSE_SUPER). + * - 0x2 + - This filesystem has been used to store a file greater than 2GiB + (RO_COMPAT_LARGE_FILE). + * - 0x4 + - Not used in kernel or e2fsprogs (RO_COMPAT_BTREE_DIR). + * - 0x8 + - This filesystem has files whose sizes are represented in units of + logical blocks, not 512-byte sectors. This implies a very large file + indeed! (RO_COMPAT_HUGE_FILE) + * - 0x10 + - Group descriptors have checksums. In addition to detecting corrupti= on, + this is useful for lazy formatting with uninitialized groups + (RO_COMPAT_GDT_CSUM). + * - 0x20 + - Indicates that the old ext3 32,000 subdirectory limit no longer app= lies + (RO_COMPAT_DIR_NLINK). A directory's i_links_count will be set to 1 + if it is incremented past 64,999. + * - 0x40 + - Indicates that large inodes exist on this filesystem + (RO_COMPAT_EXTRA_ISIZE). + * - 0x80 + - This filesystem has a snapshot (RO_COMPAT_HAS_SNAPSHOT). + * - 0x100 + - `Quota `__ (RO_COMPAT_QUOTA). + * - 0x200 + - This filesystem supports =E2=80=9Cbigalloc=E2=80=9D, which means th= at file extents are + tracked in units of clusters (of blocks) instead of blocks + (RO_COMPAT_BIGALLOC). + * - 0x400 + - This filesystem supports metadata checksumming. + (RO_COMPAT_METADATA_CSUM; implies RO_COMPAT_GDT_CSUM, though + GDT_CSUM must not be set) + * - 0x800 + - Filesystem supports replicas. This feature is neither in the kernel= nor + e2fsprogs. (RO_COMPAT_REPLICA) + * - 0x1000 + - Read-only filesystem image; the kernel will not mount this image + read-write and most tools will refuse to write to the image. + (RO_COMPAT_READONLY) + * - 0x2000 + - Filesystem tracks project quotas. (RO_COMPAT_PROJECT) + * - 0x8000 + - Verity inodes may be present on the filesystem. (RO_COMPAT_VERITY) + * - 0x10000 + - Indicates orphan file may have valid orphan entries and thus we need + to clean them up when mounting the filesystem + (RO_COMPAT_ORPHAN_PRESENT). + +.. _super_def_hash: + +The ``s_def_hash_version`` field is one of the following: + +.. list-table:: + :widths: 8 72 + :header-rows: 1 + + * - Value + - Description + * - 0x0 + - Legacy. + * - 0x1 + - Half MD4. + * - 0x2 + - Tea. + * - 0x3 + - Legacy, unsigned. + * - 0x4 + - Half MD4, unsigned. + * - 0x5 + - Tea, unsigned. + +.. _super_mountopts: + +The ``s_default_mount_opts`` field is any combination of the following: + +.. list-table:: + :widths: 8 72 + :header-rows: 1 + + * - Value + - Description + * - 0x0001 + - Print debugging info upon (re)mount. (EXT4_DEFM_DEBUG) + * - 0x0002 + - New files take the gid of the containing directory (instead of the = fsgid + of the current process). (EXT4_DEFM_BSDGROUPS) + * - 0x0004 + - Support userspace-provided extended attributes. (EXT4_DEFM_XATTR_US= ER) + * - 0x0008 + - Support POSIX access control lists (ACLs). (EXT4_DEFM_ACL) + * - 0x0010 + - Do not support 32-bit UIDs. (EXT4_DEFM_UID16) + * - 0x0020 + - All data and metadata are committed to the journal. + (EXT4_DEFM_JMODE_DATA) + * - 0x0040 + - All data are flushed to the disk before metadata are committed to t= he + journal. (EXT4_DEFM_JMODE_ORDERED) + * - 0x0060 + - Data ordering is not preserved; data may be written after the metad= ata + has been written. (EXT4_DEFM_JMODE_WBACK) + * - 0x0100 + - Disable write flushes. (EXT4_DEFM_NOBARRIER) + * - 0x0200 + - Track which blocks in a filesystem are metadata and therefore shoul= d not + be used as data blocks. This option will be enabled by default on 3= .18, + hopefully. (EXT4_DEFM_BLOCK_VALIDITY) + * - 0x0400 + - Enable DISCARD support, where the storage device is told about bloc= ks + becoming unused. (EXT4_DEFM_DISCARD) + * - 0x0800 + - Disable delayed allocation. (EXT4_DEFM_NODELALLOC) + +.. _super_flags: + +The ``s_flags`` field is any combination of the following: + +.. list-table:: + :widths: 8 72 + :header-rows: 1 + + * - Value + - Description + * - 0x0001 + - Signed directory hash in use. + * - 0x0002 + - Unsigned directory hash in use. + * - 0x0004 + - To test development code. + +.. _super_encrypt: + +The ``s_encrypt_algos`` list can contain any of the following: + +.. list-table:: + :widths: 8 72 + :header-rows: 1 + + * - Value + - Description + * - 0 + - Invalid algorithm (ENCRYPTION_MODE_INVALID). + * - 1 + - 256-bit AES in XTS mode (ENCRYPTION_MODE_AES_256_XTS). + * - 2 + - 256-bit AES in GCM mode (ENCRYPTION_MODE_AES_256_GCM). + * - 3 + - 256-bit AES in CBC mode (ENCRYPTION_MODE_AES_256_CBC). + +Total size of the superblock is 1024 bytes. + +Block Group Descriptors +----------------------- + +Each block group on the filesystem has one of these descriptors +associated with it. As noted in the Layout section above, the group +descriptors (if present) are the second item in the block group. The +standard configuration is for each block group to contain a full copy of +the block group descriptor table unless the sparse_super feature flag +is set. + +Notice how the group descriptor records the location of both bitmaps and +the inode table (i.e. they can float). This means that within a block +group, the only data structures with fixed locations are the superblock +and the group descriptor table. The flex_bg mechanism uses this +property to group several block groups into a flex group and lay out all +of the groups' bitmaps and inode tables into one long run in the first +group of the flex group. + +If the meta_bg feature flag is set, then several block groups are +grouped together into a meta group. Note that in the meta_bg case, +however, the first and last two block groups within the larger meta +group contain only group descriptors for the groups inside the meta +group. + +flex_bg and meta_bg do not appear to be mutually exclusive features. + +In ext2, ext3, and ext4 (when the 64bit feature is not enabled), the +block group descriptor was only 32 bytes long and therefore ends at +bg_checksum. On an ext4 filesystem with the 64bit feature enabled, the +block group descriptor expands to at least the 64 bytes described below; +the size is stored in the superblock. + +If gdt_csum is set and metadata_csum is not set, the block group +checksum is the crc16 of the FS UUID, the group number, and the group +descriptor structure. If metadata_csum is set, then the block group +checksum is the lower 16 bits of the checksum of the FS UUID, the group +number, and the group descriptor structure. Both block and inode bitmap +checksums are calculated against the FS UUID, the group number, and the +entire bitmap. + +The block group descriptor is laid out in ``struct ext4_group_desc``. + +.. list-table:: + :widths: 8 8 24 40 + :header-rows: 1 + + * - Offset + - Size + - Name + - Description + * - 0x0 + - __le32 + - bg_block_bitmap_lo + - Lower 32-bits of location of block bitmap. + * - 0x4 + - __le32 + - bg_inode_bitmap_lo + - Lower 32-bits of location of inode bitmap. + * - 0x8 + - __le32 + - bg_inode_table_lo + - Lower 32-bits of location of inode table. + * - 0xC + - __le16 + - bg_free_blocks_count_lo + - Lower 16-bits of free block count. + * - 0xE + - __le16 + - bg_free_inodes_count_lo + - Lower 16-bits of free inode count. + * - 0x10 + - __le16 + - bg_used_dirs_count_lo + - Lower 16-bits of directory count. + * - 0x12 + - __le16 + - bg_flags + - Block group flags. See the bgflags_ table below. + * - 0x14 + - __le32 + - bg_exclude_bitmap_lo + - Lower 32-bits of location of snapshot exclusion bitmap. + * - 0x18 + - __le16 + - bg_block_bitmap_csum_lo + - Lower 16-bits of the block bitmap checksum. + * - 0x1A + - __le16 + - bg_inode_bitmap_csum_lo + - Lower 16-bits of the inode bitmap checksum. + * - 0x1C + - __le16 + - bg_itable_unused_lo + - Lower 16-bits of unused inode count. If set, we needn't scan past t= he + ``(sb.s_inodes_per_group - gdt.bg_itable_unused)`` th entry in the + inode table for this group. + * - 0x1E + - __le16 + - bg_checksum + - Group descriptor checksum; crc16(sb_uuid+group_num+bg_desc) if the + RO_COMPAT_GDT_CSUM feature is set, or + crc32c(sb_uuid+group_num+bg_desc) & 0xFFFF if the + RO_COMPAT_METADATA_CSUM feature is set. The bg_checksum + field in bg_desc is skipped when calculating crc16 checksum, + and set to zero if crc32c checksum is used. + * - + - + - + - These fields only exist if the 64bit feature is enabled and s_desc_= size + > 32. + * - 0x20 + - __le32 + - bg_block_bitmap_hi + - Upper 32-bits of location of block bitmap. + * - 0x24 + - __le32 + - bg_inode_bitmap_hi + - Upper 32-bits of location of inodes bitmap. + * - 0x28 + - __le32 + - bg_inode_table_hi + - Upper 32-bits of location of inodes table. + * - 0x2C + - __le16 + - bg_free_blocks_count_hi + - Upper 16-bits of free block count. + * - 0x2E + - __le16 + - bg_free_inodes_count_hi + - Upper 16-bits of free inode count. + * - 0x30 + - __le16 + - bg_used_dirs_count_hi + - Upper 16-bits of directory count. + * - 0x32 + - __le16 + - bg_itable_unused_hi + - Upper 16-bits of unused inode count. + * - 0x34 + - __le32 + - bg_exclude_bitmap_hi + - Upper 32-bits of location of snapshot exclusion bitmap. + * - 0x38 + - __le16 + - bg_block_bitmap_csum_hi + - Upper 16-bits of the block bitmap checksum. + * - 0x3A + - __le16 + - bg_inode_bitmap_csum_hi + - Upper 16-bits of the inode bitmap checksum. + * - 0x3C + - __u32 + - bg_reserved + - Padding to 64 bytes. + +.. _bgflags: + +Block group flags can be any combination of the following: + +.. list-table:: + :widths: 16 64 + :header-rows: 1 + + * - Value + - Description + * - 0x1 + - inode table and bitmap are not initialized (EXT4_BG_INODE_UNINIT). + * - 0x2 + - block bitmap is not initialized (EXT4_BG_BLOCK_UNINIT). + * - 0x4 + - inode table is zeroed (EXT4_BG_INODE_ZEROED). + +Block and inode Bitmaps +----------------------- + +The data block bitmap tracks the usage of data blocks within the block +group. + +The inode bitmap records which entries in the inode table are in use. + +As with most bitmaps, one bit represents the usage status of one data +block or inode table entry. This implies a block group size of 8 * +number_of_bytes_in_a_logical_block. + +.. note:: + If ``BLOCK_UNINIT`` is set for a given block group, various parts + of the kernel and e2fsprogs code pretends that the block bitmap contains + zeros (i.e. all blocks in the group are free). However, it is not + necessarily the case that no blocks are in use -- if ``meta_bg`` is set, + the bitmaps and group descriptor live inside the group. Unfortunately, + ext2fs_test_block_bitmap2() will return '0' for those locations, + which produces confusing debugfs output. + +Inode Table +----------- +Inode tables are statically allocated at mkfs time. Each block group +descriptor points to the start of the table, and the superblock records +the number of inodes per group. See the section on inodes for more +information. + +Multiple Mount Protection +------------------------- + +Multiple mount protection (MMP) is a feature that protects the +filesystem against multiple hosts trying to use the filesystem +simultaneously. When a filesystem is opened (for mounting, or fsck, +etc.), the MMP code running on the node (call it node A) checks a +sequence number. If the sequence number is EXT4_MMP_SEQ_CLEAN, the +open continues. If the sequence number is EXT4_MMP_SEQ_FSCK, then +fsck is (hopefully) running, and open fails immediately. Otherwise, the +open code will wait for twice the specified MMP check interval and check +the sequence number again. If the sequence number has changed, then the +filesystem is active on another machine and the open fails. If the MMP +code passes all of those checks, a new MMP sequence number is generated +and written to the MMP block, and the mount proceeds. + +While the filesystem is live, the kernel sets up a timer to re-check the +MMP block at the specified MMP check interval. To perform the re-check, +the MMP sequence number is re-read; if it does not match the in-memory +MMP sequence number, then another node (node B) has mounted the +filesystem, and node A remounts the filesystem read-only. If the +sequence numbers match, the sequence number is incremented both in +memory and on disk, and the re-check is complete. + +The hostname and device filename are written into the MMP block whenever +an open operation succeeds. The MMP code does not use these values; they +are provided purely for informational purposes. + +The checksum is calculated against the FS UUID and the MMP structure. +The MMP structure (``struct mmp_struct``) is as follows: + +.. list-table:: + :widths: 8 12 20 40 + :header-rows: 1 + + * - Offset + - Type + - Name + - Description + * - 0x0 + - __le32 + - mmp_magic + - Magic number for MMP, 0x004D4D50 (=E2=80=9CMMP=E2=80=9D). + * - 0x4 + - __le32 + - mmp_seq + - Sequence number, updated periodically. + * - 0x8 + - __le64 + - mmp_time + - Time that the MMP block was last updated. + * - 0x10 + - char[64] + - mmp_nodename + - Hostname of the node that opened the filesystem. + * - 0x50 + - char[32] + - mmp_bdevname + - Block device name of the filesystem. + * - 0x70 + - __le16 + - mmp_check_interval + - The MMP re-check interval, in seconds. + * - 0x72 + - __le16 + - mmp_pad1 + - Zero. + * - 0x74 + - __le32[226] + - mmp_pad2 + - Zero. + * - 0x3FC + - __le32 + - mmp_checksum + - Checksum of the MMP block. + +Journal (jbd2) +-------------- + +Introduced in ext3, the ext4 filesystem employs a journal to protect the +filesystem against metadata inconsistencies in the case of a system crash.= Up +to 10,240,000 file system blocks (see man mke2fs(8) for more details on jo= urnal +size limits) can be reserved inside the filesystem as a place to land +=E2=80=9Cimportant=E2=80=9D data writes on-disk as quickly as possible. On= ce the important +data transaction is fully written to the disk and flushed from the disk wr= ite +cache, a record of the data being committed is also written to the journal= . At +some later point in time, the journal code writes the transactions to their +final locations on disk (this could involve a lot of seeking or a lot of s= mall +read-write-erases) before erasing the commit record. Should the system +crash during the second slow write, the journal can be replayed all the +way to the latest commit record, guaranteeing the atomicity of whatever +gets written through the journal to the disk. The effect of this is to +guarantee that the filesystem does not become stuck midway through a +metadata update. + +For performance reasons, ext4 by default only writes filesystem metadata +through the journal. This means that file data blocks are /not/ +guaranteed to be in any consistent state after a crash. If this default +guarantee level (``data=3Dordered``) is not satisfactory, there is a mount +option to control journal behavior. If ``data=3Djournal``, all data and +metadata are written to disk through the journal. This is slower but +safest. If ``data=3Dwriteback``, dirty data blocks are not flushed to the +disk before the metadata are written to disk through the journal. + +In case of ``data=3Dordered`` mode, Ext4 also supports fast commits which +help reduce commit latency significantly. The default ``data=3Dordered`` +mode works by logging metadata blocks to the journal. In fast commit +mode, Ext4 only stores the minimal delta needed to recreate the +affected metadata in fast commit space that is shared with JBD2. +Once the fast commit area fills in or if fast commit is not possible +or if JBD2 commit timer goes off, Ext4 performs a traditional full commit. +A full commit invalidates all the fast commits that happened before +it and thus it makes the fast commit area empty for further fast +commits. This feature needs to be enabled at mkfs time. + +The journal inode is typically inode 8. The first 68 bytes of the +journal inode are replicated in the ext4 superblock. The journal itself +is normal (but hidden) file within the filesystem. The file usually +consumes an entire block group, though mke2fs tries to put it in the +middle of the disk. + +All fields in jbd2 are written to disk in big-endian order. This is the +opposite of ext4. + +.. note:: Both ext4 and ocfs2 use jbd2. + +The maximum size of a journal embedded in an ext4 filesystem is 2^32 +blocks. jbd2 itself does not seem to care. + +Layout +~~~~~~ + +Generally speaking, the journal has this format: + +.. list-table:: + :widths: 16 48 16 + :header-rows: 1 + + * - Superblock + - descriptor_block (data_blocks or revocation_block) [more data or + revocations] commmit_block + - [more transactions...] + * -=20 + - One transaction + - + +Notice that a transaction begins with either a descriptor and some data, +or a block revocation list. A finished transaction always ends with a +commit. If there is no commit record (or the checksums don't match), the +transaction will be discarded during replay. + +External Journal +~~~~~~~~~~~~~~~~ + +Optionally, an ext4 filesystem can be created with an external journal +device (as opposed to an internal journal, which uses a reserved inode). +In this case, on the filesystem device, ``s_journal_inum`` should be +zero and ``s_journal_uuid`` should be set. On the journal device there +will be an ext4 super block in the usual place, with a matching UUID. +The journal superblock will be in the next full block after the +superblock. + +.. list-table:: + :widths: 12 12 12 32 12 + :header-rows: 1 + + * - 1024 bytes of padding + - ext4 Superblock + - Journal Superblock + - descriptor_block (data_blocks or revocation_block) [more data or + revocations] commmit_block + - [more transactions...] + * -=20 + - + - + - One transaction + - + +Block Header +~~~~~~~~~~~~ + +Every block in the journal starts with a common 12-byte header +``struct journal_header_s``: + +.. list-table:: + :widths: 8 8 24 40 + :header-rows: 1 + + * - Offset + - Type + - Name + - Description + * - 0x0 + - __be32 + - h_magic + - jbd2 magic number, 0xC03B3998. + * - 0x4 + - __be32 + - h_blocktype + - Description of what this block contains. See the jbd2_blocktype_ ta= ble + below. + * - 0x8 + - __be32 + - h_sequence + - The transaction ID that goes with this block. + +.. _jbd2_blocktype: + +The journal block type can be any one of: + +.. list-table:: + :widths: 16 64 + :header-rows: 1 + + * - Value + - Description + * - 1 + - Descriptor. This block precedes a series of data blocks that were + written through the journal during a transaction. + * - 2 + - Block commit record. This block signifies the completion of a + transaction. + * - 3 + - Journal superblock, v1. + * - 4 + - Journal superblock, v2. + * - 5 + - Block revocation records. This speeds up recovery by enabling the + journal to skip writing blocks that were subsequently rewritten. + +Super Block +~~~~~~~~~~~ + +The super block for the journal is much simpler as compared to ext4's. +The key data kept within are size of the journal, and where to find the +start of the log of transactions. + +The journal superblock is recorded as ``struct journal_superblock_s``, +which is 1024 bytes long: + +.. list-table:: + :widths: 8 8 24 40 + :header-rows: 1 + + * - Offset + - Type + - Name + - Description + * - + - + - + - Static information describing the journal. + * - 0x0 + - journal_header_t (12 bytes) + - s_header + - Common header identifying this as a superblock. + * - 0xC + - __be32 + - s_blocksize + - Journal device block size. + * - 0x10 + - __be32 + - s_maxlen + - Total number of blocks in this journal. + * - 0x14 + - __be32 + - s_first + - First block of log information. + * - + - + - + - Dynamic information describing the current state of the log. + * - 0x18 + - __be32 + - s_sequence + - First commit ID expected in log. + * - 0x1C + - __be32 + - s_start + - Block number of the start of log. Contrary to the comments, this fi= eld + being zero does not imply that the journal is clean! + * - 0x20 + - __be32 + - s_errno + - Error value, as set by jbd2_journal_abort(). + * - + - + - + - The remaining fields are only valid in a v2 superblock. + * - 0x24 + - __be32 + - s_feature_compat; + - Compatible feature set. See the table jbd2_compat_ below. + * - 0x28 + - __be32 + - s_feature_incompat + - Incompatible feature set. See the table jbd2_incompat_ below. + * - 0x2C + - __be32 + - s_feature_ro_compat + - Read-only compatible feature set. There aren't any of these current= ly. + * - 0x30 + - __u8 + - s_uuid[16] + - 128-bit uuid for journal. This is compared against the copy in the = ext4 + super block at mount time. + * - 0x40 + - __be32 + - s_nr_users + - Number of file systems sharing this journal. + * - 0x44 + - __be32 + - s_dynsuper + - Location of dynamic super block copy. (Not used?) + * - 0x48 + - __be32 + - s_max_transaction + - Limit of journal blocks per transaction. (Not used?) + * - 0x4C + - __be32 + - s_max_trans_data + - Limit of data blocks per transaction. (Not used?) + * - 0x50 + - __u8 + - s_checksum_type + - Checksum algorithm used for the journal. See jbd2_checksum_type_ f= or + more info. + * - 0x51 + - __u8[3] + - s_padding2 + - + * - 0x54 + - __be32 + - s_num_fc_blocks + - Number of fast commit blocks in the journal. + * - 0x58 + - __be32 + - s_head + - Block number of the head (first unused block) of the journal, only + up-to-date when the journal is empty. + * - 0x5C + - __u32 + - s_padding[40] + - + * - 0xFC + - __be32 + - s_checksum + - Checksum of the entire superblock, with this field set to zero. + * - 0x100 + - __u8 + - s_users[16*48] + - ids of all file systems sharing the log. e2fsprogs/Linux don't allow + shared external journals, but I imagine Lustre (or ocfs2?), which u= se + the jbd2 code, might. + +.. _jbd2_compat: + +The journal compat features are any combination of the following: + +.. list-table:: + :widths: 16 64 + :header-rows: 1 + + * - Value + - Description + * - 0x1 + - Journal maintains checksums on the data blocks. + (JBD2_FEATURE_COMPAT_CHECKSUM) + +.. _jbd2_incompat: + +The journal incompat features are any combination of the following: + +.. list-table:: + :widths: 16 64 + :header-rows: 1 + + * - Value + - Description + * - 0x1 + - Journal has block revocation records. (JBD2_FEATURE_INCOMPAT_REVOKE) + * - 0x2 + - Journal can deal with 64-bit block numbers. + (JBD2_FEATURE_INCOMPAT_64BIT) + * - 0x4 + - Journal commits asynchronously. (JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT) + * - 0x8 + - This journal uses v2 of the checksum on-disk format. Each journal + metadata block gets its own checksum, and the block tags in the + descriptor table contain checksums for each of the data blocks in t= he + journal. (JBD2_FEATURE_INCOMPAT_CSUM_V2) + * - 0x10 + - This journal uses v3 of the checksum on-disk format. This is the sa= me as + v2, but the journal block tag size is fixed regardless of the size = of + block numbers. (JBD2_FEATURE_INCOMPAT_CSUM_V3) + * - 0x20 + - Journal has fast commit blocks. (JBD2_FEATURE_INCOMPAT_FAST_COMMIT) + +.. _jbd2_checksum_type: + +Journal checksum type codes are one of the following. crc32 or crc32c are= the +most likely choices. + +.. list-table:: + :widths: 16 64 + :header-rows: 1 + + * - Value + - Description + * - 1 + - CRC32 + * - 2 + - MD5 + * - 3 + - SHA1 + * - 4 + - CRC32C + +Descriptor Block +~~~~~~~~~~~~~~~~ + +The descriptor block contains an array of journal block tags that +describe the final locations of the data blocks that follow in the +journal. Descriptor blocks are open-coded instead of being completely +described by a data structure, but here is the block structure anyway. +Descriptor blocks consume at least 36 bytes, but use a full block: + +.. list-table:: + :widths: 8 8 24 40 + :header-rows: 1 + + * - Offset + - Type + - Name + - Descriptor + * - 0x0 + - journal_header_t + - (open coded) + - Common block header. + * - 0xC + - struct journal_block_tag_s + - open coded array[] + - Enough tags either to fill up the block or to describe all the data + blocks that follow this descriptor block. + +Journal block tags have any of the following formats, depending on which +journal feature and block tag flags are set. + +If JBD2_FEATURE_INCOMPAT_CSUM_V3 is set, the journal block tag is +defined as ``struct journal_block_tag3_s``, which looks like the +following. The size is 16 or 32 bytes. + +.. list-table:: + :widths: 8 8 24 40 + :header-rows: 1 + + * - Offset + - Type + - Name + - Descriptor + * - 0x0 + - __be32 + - t_blocknr + - Lower 32-bits of the location of where the corresponding data block + should end up on disk. + * - 0x4 + - __be32 + - t_flags + - Flags that go with the descriptor. See the table jbd2_tag_flags_ for + more info. + * - 0x8 + - __be32 + - t_blocknr_high + - Upper 32-bits of the location of where the corresponding data block + should end up on disk. This is zero if JBD2_FEATURE_INCOMPAT_64BIT = is + not enabled. + * - 0xC + - __be32 + - t_checksum + - Checksum of the journal UUID, the sequence number, and the data blo= ck. + * - + - + - + - This field appears to be open coded. It always comes at the end of = the + tag, after t_checksum. This field is not present if the "same UUID"= flag + is set. + * - 0x8 or 0xC + - char + - uuid[16] + - A UUID to go with this tag. This field appears to be copied from the + ``j_uuid`` field in ``struct journal_s``, but only tune2fs touches = that + field. + +.. _jbd2_tag_flags: + +The journal tag flags are any combination of the following: + +.. list-table:: + :widths: 16 64 + :header-rows: 1 + + * - Value + - Description + * - 0x1 + - On-disk block is escaped. The first four bytes of the data block ju= st + happened to match the jbd2 magic number. + * - 0x2 + - This block has the same UUID as previous, therefore the UUID field = is + omitted. + * - 0x4 + - The data block was deleted by the transaction. (Not used?) + * - 0x8 + - This is the last tag in this descriptor block. + +If JBD2_FEATURE_INCOMPAT_CSUM_V3 is NOT set, the journal block tag +is defined as ``struct journal_block_tag_s``, which looks like the +following. The size is 8, 12, 24, or 28 bytes: + +.. list-table:: + :widths: 8 8 24 40 + :header-rows: 1 + + * - Offset + - Type + - Name + - Descriptor + * - 0x0 + - __be32 + - t_blocknr + - Lower 32-bits of the location of where the corresponding data block + should end up on disk. + * - 0x4 + - __be16 + - t_checksum + - Checksum of the journal UUID, the sequence number, and the data blo= ck. + Note that only the lower 16 bits are stored. + * - 0x6 + - __be16 + - t_flags + - Flags that go with the descriptor. See the table jbd2_tag_flags_ for + more info. + * - + - + - + - This next field is only present if the super block indicates suppor= t for + 64-bit block numbers. + * - 0x8 + - __be32 + - t_blocknr_high + - Upper 32-bits of the location of where the corresponding data block + should end up on disk. + * - + - + - + - This field appears to be open coded. It always comes at the end of = the + tag, after t_flags or t_blocknr_high. This field is not present if = the + "same UUID" flag is set. + * - 0x8 or 0xC + - char + - uuid[16] + - A UUID to go with this tag. This field appears to be copied from the + ``j_uuid`` field in ``struct journal_s``, but only tune2fs touches = that + field. + +If JBD2_FEATURE_INCOMPAT_CSUM_V2 or +JBD2_FEATURE_INCOMPAT_CSUM_V3 are set, the end of the block is a +``struct jbd2_journal_block_tail``, which looks like this: + +.. list-table:: + :widths: 8 8 24 40 + :header-rows: 1 + + * - Offset + - Type + - Name + - Descriptor + * - 0x0 + - __be32 + - t_checksum + - Checksum of the journal UUID + the descriptor block, with this fiel= d set + to zero. + +Data Block +~~~~~~~~~~ + +In general, the data blocks being written to disk through the journal +are written verbatim into the journal file after the descriptor block. +However, if the first four bytes of the block match the jbd2 magic +number then those four bytes are replaced with zeroes and the =E2=80=9Cesc= aped=E2=80=9D +flag is set in the descriptor block tag. + +Revocation Block +~~~~~~~~~~~~~~~~ + +A revocation block is used to prevent replay of a block in an earlier +transaction. This is used to mark blocks that were journalled at one +time but are no longer journalled. Typically this happens if a metadata +block is freed and re-allocated as a file data block; in this case, a +journal replay after the file block was written to disk will cause +corruption. + +.. note:: + This mechanism is NOT used to express =E2=80=9Cthis journal block is + superseded by this other journal block=E2=80=9D, as the author (djwong) + mistakenly thought. Any block being added to a transaction will cause + the removal of all existing revocation records for that block. + +Revocation blocks are described in +``struct jbd2_journal_revoke_header_s``, are at least 16 bytes in +length, but use a full block: + +.. list-table:: + :widths: 8 8 24 40 + :header-rows: 1 + + * - Offset + - Type + - Name + - Description + * - 0x0 + - journal_header_t + - r_header + - Common block header. + * - 0xC + - __be32 + - r_count + - Number of bytes used in this block. + * - 0x10 + - __be32 or __be64 + - blocks[0] + - Blocks to revoke. + +After r_count is a linear array of block numbers that are effectively +revoked by this transaction. The size of each block number is 8 bytes if +the superblock advertises 64-bit block number support, or 4 bytes +otherwise. + +If JBD2_FEATURE_INCOMPAT_CSUM_V2 or +JBD2_FEATURE_INCOMPAT_CSUM_V3 are set, the end of the revocation +block is a ``struct jbd2_journal_revoke_tail``, which has this format: + +.. list-table:: + :widths: 8 8 24 40 + :header-rows: 1 + + * - Offset + - Type + - Name + - Description + * - 0x0 + - __be32 + - r_checksum + - Checksum of the journal UUID + revocation block + +Commit Block +~~~~~~~~~~~~ + +The commit block is a sentry that indicates that a transaction has been +completely written to the journal. Once this commit block reaches the +journal, the data stored with this transaction can be written to their +final locations on disk. + +The commit block is described by ``struct commit_header``, which is 32 +bytes long (but uses a full block): + +.. list-table:: + :widths: 8 8 24 40 + :header-rows: 1 + + * - Offset + - Type + - Name + - Descriptor + * - 0x0 + - journal_header_s + - (open coded) + - Common block header. + * - 0xC + - unsigned char + - h_chksum_type + - The type of checksum to use to verify the integrity of the data blo= cks + in the transaction. See jbd2_checksum_type_ for more info. + * - 0xD + - unsigned char + - h_chksum_size + - The number of bytes used by the checksum. Most likely 4. + * - 0xE + - unsigned char + - h_padding[2] + - + * - 0x10 + - __be32 + - h_chksum[JBD2_CHECKSUM_BYTES] + - 32 bytes of space to store checksums. If + JBD2_FEATURE_INCOMPAT_CSUM_V2 or JBD2_FEATURE_INCOMPAT_CSUM_V3 + are set, the first ``__be32`` is the checksum of the journal UUID a= nd + the entire commit block, with this field zeroed. If + JBD2_FEATURE_COMPAT_CHECKSUM is set, the first ``__be32`` is the + crc32 of all the blocks already written to the transaction. + * - 0x30 + - __be64 + - h_commit_sec + - The time that the transaction was committed, in seconds since the e= poch. + * - 0x38 + - __be32 + - h_commit_nsec + - Nanoseconds component of the above timestamp. + +Fast commits +~~~~~~~~~~~~ + +Fast commit area is organized as a log of tag length values. Each TLV has +a ``struct ext4_fc_tl`` in the beginning which stores the tag and the leng= th +of the entire field. It is followed by variable length tag specific value. +Here is the list of supported tags and their meanings: + +.. list-table:: + :widths: 8 20 20 32 + :header-rows: 1 + + * - Tag + - Meaning + - Value struct + - Description + * - EXT4_FC_TAG_HEAD + - Fast commit area header + - ``struct ext4_fc_head`` + - Stores the TID of the transaction after which these fast commits sh= ould + be applied. + * - EXT4_FC_TAG_ADD_RANGE + - Add extent to inode + - ``struct ext4_fc_add_range`` + - Stores the inode number and extent to be added in this inode + * - EXT4_FC_TAG_DEL_RANGE + - Remove logical offsets to inode + - ``struct ext4_fc_del_range`` + - Stores the inode number and the logical offset range that needs to = be + removed + * - EXT4_FC_TAG_CREAT + - Create directory entry for a newly created file + - ``struct ext4_fc_dentry_info`` + - Stores the parent inode number, inode number and directory entry of= the + newly created file + * - EXT4_FC_TAG_LINK + - Link a directory entry to an inode + - ``struct ext4_fc_dentry_info`` + - Stores the parent inode number, inode number and directory entry + * - EXT4_FC_TAG_UNLINK + - Unlink a directory entry of an inode + - ``struct ext4_fc_dentry_info`` + - Stores the parent inode number, inode number and directory entry + + * - EXT4_FC_TAG_PAD + - Padding (unused area) + - None + - Unused bytes in the fast commit area. + + * - EXT4_FC_TAG_TAIL + - Mark the end of a fast commit + - ``struct ext4_fc_tail`` + - Stores the TID of the commit, CRC of the fast commit of which this = tag + represents the end of + +Fast Commit Replay Idempotence +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Fast commits tags are idempotent in nature provided the recovery code foll= ows +certain rules. The guiding principle that the commit path follows while +committing is that it stores the result of a particular operation instead = of +storing the procedure. + +Let's consider this rename operation: 'mv /a /b'. Let's assume dirent '/a' +was associated with inode 10. During fast commit, instead of storing this +operation as a procedure "rename a to b", we store the resulting file syst= em +state as a "series" of outcomes: + +- Link dirent b to inode 10 +- Unlink dirent a +- Inode 10 with valid refcount + +Now when recovery code runs, it needs "enforce" this state on the file +system. This is what guarantees idempotence of fast commit replay. + +Let's take an example of a procedure that is not idempotent and see how fa= st +commits make it idempotent. Consider following sequence of operations: + +1) rm A +2) mv B A +3) read A + +If we store this sequence of operations as is then the replay is not idemp= otent. +Let's say while in replay, we crash after (2). During the second replay, +file A (which was actually created as a result of "mv B A" operation) woul= d get +deleted. Thus, file named A would be absent when we try to read A. So, this +sequence of operations is not idempotent. However, as mentioned above, ins= tead +of storing the procedure fast commits store the outcome of each procedure.= Thus +the fast commit log for above procedure would be as follows: + +(Let's assume dirent A was linked to inode 10 and dirent B was linked to +inode 11 before the replay) + +1) Unlink A +2) Link A to inode 11 +3) Unlink B +4) Inode 11 + +If we crash after (3) we will have file A linked to inode 11. During the s= econd +replay, we will remove file A (inode 11). But we will create it back and m= ake +it point to inode 11. We won't find B, so we'll just skip that step. At th= is +point, the refcount for inode 11 is not reliable, but that gets fixed by t= he +replay of last inode 11 tag. Thus, by converting a non-idempotent procedure +into a series of idempotent outcomes, fast commits ensured idempotence dur= ing +the replay. + +Journal Checkpoint +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Checkpointing the journal ensures all transactions and their associated bu= ffers +are submitted to the disk. In-progress transactions are waited upon and in= cluded +in the checkpoint. Checkpointing is used internally during critical update= s to +the filesystem including journal recovery, filesystem resizing, and freein= g of +the journal_t structure. + +A journal checkpoint can be triggered from userspace via the ioctl +EXT4_IOC_CHECKPOINT. This ioctl takes a single, u64 argument for flags. +Currently, three flags are supported. First, EXT4_IOC_CHECKPOINT_FLAG_DRY_= RUN +can be used to verify input to the ioctl. It returns error if there is any +invalid input, otherwise it returns success without performing +any checkpointing. This can be used to check whether the ioctl exists on a +system and to verify there are no issues with arguments or flags. The +other two flags are EXT4_IOC_CHECKPOINT_FLAG_DISCARD and +EXT4_IOC_CHECKPOINT_FLAG_ZEROOUT. These flags cause the journal blocks to = be +discarded or zero-filled, respectively, after the journal checkpoint is +complete. EXT4_IOC_CHECKPOINT_FLAG_DISCARD and EXT4_IOC_CHECKPOINT_FLAG_ZE= ROOUT +cannot both be set. The ioctl may be useful when snapshotting a system or = for +complying with content deletion SLOs. + +Orphan file +----------- + +In unix there can inodes that are unlinked from directory hierarchy but th= at +are still alive because they are open. In case of crash the filesystem has= to +clean up these inodes as otherwise they (and the blocks referenced from th= em) +would leak. Similarly if we truncate or extend the file, we need not be ab= le +to perform the operation in a single journalling transaction. In such case= we +track the inode as orphan so that in case of crash extra blocks allocated = to +the file get truncated. + +Traditionally ext4 tracks orphan inodes in a form of single linked list wh= ere +superblock contains the inode number of the last orphan inode (s_last_orph= an +field) and then each inode contains inode number of the previously orphaned +inode (we overload i_dtime inode field for this). However this filesystem +global single linked list is a scalability bottleneck for workloads that r= esult +in heavy creation of orphan inodes. When orphan file feature +(COMPAT_ORPHAN_FILE) is enabled, the filesystem has a special inode +(referenced from the superblock through s_orphan_file_inum) with several +blocks. Each of these blocks has a structure: + +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D =3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D +Offset Type Name Description +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D =3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D +0x0 Array of Orphan inode Each __le32 entry is either + __le32 entries entries empty (0) or it contains + inode number of an orphan + inode. +blocksize-8 __le32 ob_magic Magic value stored in orphan + block tail (0x0b10ca04) +blocksize-4 __le32 ob_checksum Checksum of the orphan bloc= k. +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D =3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D + +When a filesystem with orphan file feature is writeably mounted, we set +RO_COMPAT_ORPHAN_PRESENT feature in the superblock to indicate there may +be valid orphan entries. In case we see this feature when mounting the +filesystem, we read the whole orphan file and process all orphan inodes fo= und +there as usual. When cleanly unmounting the filesystem we remove the +RO_COMPAT_ORPHAN_PRESENT feature to avoid unnecessary scanning of the orph= an +file and also make the filesystem fully compatible with older kernels. diff --git a/Documentation/filesystems/ext4/group_descr.rst b/Documentation= /filesystems/ext4/group_descr.rst deleted file mode 100644 index 392ec44f8fb00d..00000000000000 --- a/Documentation/filesystems/ext4/group_descr.rst +++ /dev/null @@ -1,173 +0,0 @@ -.. SPDX-License-Identifier: GPL-2.0 - -Block Group Descriptors ------------------------ - -Each block group on the filesystem has one of these descriptors -associated with it. As noted in the Layout section above, the group -descriptors (if present) are the second item in the block group. The -standard configuration is for each block group to contain a full copy of -the block group descriptor table unless the sparse_super feature flag -is set. - -Notice how the group descriptor records the location of both bitmaps and -the inode table (i.e. they can float). This means that within a block -group, the only data structures with fixed locations are the superblock -and the group descriptor table. The flex_bg mechanism uses this -property to group several block groups into a flex group and lay out all -of the groups' bitmaps and inode tables into one long run in the first -group of the flex group. - -If the meta_bg feature flag is set, then several block groups are -grouped together into a meta group. Note that in the meta_bg case, -however, the first and last two block groups within the larger meta -group contain only group descriptors for the groups inside the meta -group. - -flex_bg and meta_bg do not appear to be mutually exclusive features. - -In ext2, ext3, and ext4 (when the 64bit feature is not enabled), the -block group descriptor was only 32 bytes long and therefore ends at -bg_checksum. On an ext4 filesystem with the 64bit feature enabled, the -block group descriptor expands to at least the 64 bytes described below; -the size is stored in the superblock. - -If gdt_csum is set and metadata_csum is not set, the block group -checksum is the crc16 of the FS UUID, the group number, and the group -descriptor structure. If metadata_csum is set, then the block group -checksum is the lower 16 bits of the checksum of the FS UUID, the group -number, and the group descriptor structure. Both block and inode bitmap -checksums are calculated against the FS UUID, the group number, and the -entire bitmap. - -The block group descriptor is laid out in ``struct ext4_group_desc``. - -.. list-table:: - :widths: 8 8 24 40 - :header-rows: 1 - - * - Offset - - Size - - Name - - Description - * - 0x0 - - __le32 - - bg_block_bitmap_lo - - Lower 32-bits of location of block bitmap. - * - 0x4 - - __le32 - - bg_inode_bitmap_lo - - Lower 32-bits of location of inode bitmap. - * - 0x8 - - __le32 - - bg_inode_table_lo - - Lower 32-bits of location of inode table. - * - 0xC - - __le16 - - bg_free_blocks_count_lo - - Lower 16-bits of free block count. - * - 0xE - - __le16 - - bg_free_inodes_count_lo - - Lower 16-bits of free inode count. - * - 0x10 - - __le16 - - bg_used_dirs_count_lo - - Lower 16-bits of directory count. - * - 0x12 - - __le16 - - bg_flags - - Block group flags. See the bgflags_ table below. - * - 0x14 - - __le32 - - bg_exclude_bitmap_lo - - Lower 32-bits of location of snapshot exclusion bitmap. - * - 0x18 - - __le16 - - bg_block_bitmap_csum_lo - - Lower 16-bits of the block bitmap checksum. - * - 0x1A - - __le16 - - bg_inode_bitmap_csum_lo - - Lower 16-bits of the inode bitmap checksum. - * - 0x1C - - __le16 - - bg_itable_unused_lo - - Lower 16-bits of unused inode count. If set, we needn't scan past t= he - ``(sb.s_inodes_per_group - gdt.bg_itable_unused)`` th entry in the - inode table for this group. - * - 0x1E - - __le16 - - bg_checksum - - Group descriptor checksum; crc16(sb_uuid+group_num+bg_desc) if the - RO_COMPAT_GDT_CSUM feature is set, or - crc32c(sb_uuid+group_num+bg_desc) & 0xFFFF if the - RO_COMPAT_METADATA_CSUM feature is set. The bg_checksum - field in bg_desc is skipped when calculating crc16 checksum, - and set to zero if crc32c checksum is used. - * - - - - - - - These fields only exist if the 64bit feature is enabled and s_desc_= size - > 32. - * - 0x20 - - __le32 - - bg_block_bitmap_hi - - Upper 32-bits of location of block bitmap. - * - 0x24 - - __le32 - - bg_inode_bitmap_hi - - Upper 32-bits of location of inodes bitmap. - * - 0x28 - - __le32 - - bg_inode_table_hi - - Upper 32-bits of location of inodes table. - * - 0x2C - - __le16 - - bg_free_blocks_count_hi - - Upper 16-bits of free block count. - * - 0x2E - - __le16 - - bg_free_inodes_count_hi - - Upper 16-bits of free inode count. - * - 0x30 - - __le16 - - bg_used_dirs_count_hi - - Upper 16-bits of directory count. - * - 0x32 - - __le16 - - bg_itable_unused_hi - - Upper 16-bits of unused inode count. - * - 0x34 - - __le32 - - bg_exclude_bitmap_hi - - Upper 32-bits of location of snapshot exclusion bitmap. - * - 0x38 - - __le16 - - bg_block_bitmap_csum_hi - - Upper 16-bits of the block bitmap checksum. - * - 0x3A - - __le16 - - bg_inode_bitmap_csum_hi - - Upper 16-bits of the inode bitmap checksum. - * - 0x3C - - __u32 - - bg_reserved - - Padding to 64 bytes. - -.. _bgflags: - -Block group flags can be any combination of the following: - -.. list-table:: - :widths: 16 64 - :header-rows: 1 - - * - Value - - Description - * - 0x1 - - inode table and bitmap are not initialized (EXT4_BG_INODE_UNINIT). - * - 0x2 - - block bitmap is not initialized (EXT4_BG_BLOCK_UNINIT). - * - 0x4 - - inode table is zeroed (EXT4_BG_INODE_ZEROED). diff --git a/Documentation/filesystems/ext4/journal.rst b/Documentation/fil= esystems/ext4/journal.rst deleted file mode 100644 index 6e8fb2d4b46fed..00000000000000 --- a/Documentation/filesystems/ext4/journal.rst +++ /dev/null @@ -1,761 +0,0 @@ -.. SPDX-License-Identifier: GPL-2.0 - -Journal (jbd2) --------------- - -Introduced in ext3, the ext4 filesystem employs a journal to protect the -filesystem against metadata inconsistencies in the case of a system crash.= Up -to 10,240,000 file system blocks (see man mke2fs(8) for more details on jo= urnal -size limits) can be reserved inside the filesystem as a place to land -=E2=80=9Cimportant=E2=80=9D data writes on-disk as quickly as possible. On= ce the important -data transaction is fully written to the disk and flushed from the disk wr= ite -cache, a record of the data being committed is also written to the journal= . At -some later point in time, the journal code writes the transactions to their -final locations on disk (this could involve a lot of seeking or a lot of s= mall -read-write-erases) before erasing the commit record. Should the system -crash during the second slow write, the journal can be replayed all the -way to the latest commit record, guaranteeing the atomicity of whatever -gets written through the journal to the disk. The effect of this is to -guarantee that the filesystem does not become stuck midway through a -metadata update. - -For performance reasons, ext4 by default only writes filesystem metadata -through the journal. This means that file data blocks are /not/ -guaranteed to be in any consistent state after a crash. If this default -guarantee level (``data=3Dordered``) is not satisfactory, there is a mount -option to control journal behavior. If ``data=3Djournal``, all data and -metadata are written to disk through the journal. This is slower but -safest. If ``data=3Dwriteback``, dirty data blocks are not flushed to the -disk before the metadata are written to disk through the journal. - -In case of ``data=3Dordered`` mode, Ext4 also supports fast commits which -help reduce commit latency significantly. The default ``data=3Dordered`` -mode works by logging metadata blocks to the journal. In fast commit -mode, Ext4 only stores the minimal delta needed to recreate the -affected metadata in fast commit space that is shared with JBD2. -Once the fast commit area fills in or if fast commit is not possible -or if JBD2 commit timer goes off, Ext4 performs a traditional full commit. -A full commit invalidates all the fast commits that happened before -it and thus it makes the fast commit area empty for further fast -commits. This feature needs to be enabled at mkfs time. - -The journal inode is typically inode 8. The first 68 bytes of the -journal inode are replicated in the ext4 superblock. The journal itself -is normal (but hidden) file within the filesystem. The file usually -consumes an entire block group, though mke2fs tries to put it in the -middle of the disk. - -All fields in jbd2 are written to disk in big-endian order. This is the -opposite of ext4. - -NOTE: Both ext4 and ocfs2 use jbd2. - -The maximum size of a journal embedded in an ext4 filesystem is 2^32 -blocks. jbd2 itself does not seem to care. - -Layout -~~~~~~ - -Generally speaking, the journal has this format: - -.. list-table:: - :widths: 16 48 16 - :header-rows: 1 - - * - Superblock - - descriptor_block (data_blocks or revocation_block) [more data or - revocations] commmit_block - - [more transactions...] - * -=20 - - One transaction - - - -Notice that a transaction begins with either a descriptor and some data, -or a block revocation list. A finished transaction always ends with a -commit. If there is no commit record (or the checksums don't match), the -transaction will be discarded during replay. - -External Journal -~~~~~~~~~~~~~~~~ - -Optionally, an ext4 filesystem can be created with an external journal -device (as opposed to an internal journal, which uses a reserved inode). -In this case, on the filesystem device, ``s_journal_inum`` should be -zero and ``s_journal_uuid`` should be set. On the journal device there -will be an ext4 super block in the usual place, with a matching UUID. -The journal superblock will be in the next full block after the -superblock. - -.. list-table:: - :widths: 12 12 12 32 12 - :header-rows: 1 - - * - 1024 bytes of padding - - ext4 Superblock - - Journal Superblock - - descriptor_block (data_blocks or revocation_block) [more data or - revocations] commmit_block - - [more transactions...] - * -=20 - - - - - - One transaction - - - -Block Header -~~~~~~~~~~~~ - -Every block in the journal starts with a common 12-byte header -``struct journal_header_s``: - -.. list-table:: - :widths: 8 8 24 40 - :header-rows: 1 - - * - Offset - - Type - - Name - - Description - * - 0x0 - - __be32 - - h_magic - - jbd2 magic number, 0xC03B3998. - * - 0x4 - - __be32 - - h_blocktype - - Description of what this block contains. See the jbd2_blocktype_ ta= ble - below. - * - 0x8 - - __be32 - - h_sequence - - The transaction ID that goes with this block. - -.. _jbd2_blocktype: - -The journal block type can be any one of: - -.. list-table:: - :widths: 16 64 - :header-rows: 1 - - * - Value - - Description - * - 1 - - Descriptor. This block precedes a series of data blocks that were - written through the journal during a transaction. - * - 2 - - Block commit record. This block signifies the completion of a - transaction. - * - 3 - - Journal superblock, v1. - * - 4 - - Journal superblock, v2. - * - 5 - - Block revocation records. This speeds up recovery by enabling the - journal to skip writing blocks that were subsequently rewritten. - -Super Block -~~~~~~~~~~~ - -The super block for the journal is much simpler as compared to ext4's. -The key data kept within are size of the journal, and where to find the -start of the log of transactions. - -The journal superblock is recorded as ``struct journal_superblock_s``, -which is 1024 bytes long: - -.. list-table:: - :widths: 8 8 24 40 - :header-rows: 1 - - * - Offset - - Type - - Name - - Description - * - - - - - - - Static information describing the journal. - * - 0x0 - - journal_header_t (12 bytes) - - s_header - - Common header identifying this as a superblock. - * - 0xC - - __be32 - - s_blocksize - - Journal device block size. - * - 0x10 - - __be32 - - s_maxlen - - Total number of blocks in this journal. - * - 0x14 - - __be32 - - s_first - - First block of log information. - * - - - - - - - Dynamic information describing the current state of the log. - * - 0x18 - - __be32 - - s_sequence - - First commit ID expected in log. - * - 0x1C - - __be32 - - s_start - - Block number of the start of log. Contrary to the comments, this fi= eld - being zero does not imply that the journal is clean! - * - 0x20 - - __be32 - - s_errno - - Error value, as set by jbd2_journal_abort(). - * - - - - - - - The remaining fields are only valid in a v2 superblock. - * - 0x24 - - __be32 - - s_feature_compat; - - Compatible feature set. See the table jbd2_compat_ below. - * - 0x28 - - __be32 - - s_feature_incompat - - Incompatible feature set. See the table jbd2_incompat_ below. - * - 0x2C - - __be32 - - s_feature_ro_compat - - Read-only compatible feature set. There aren't any of these current= ly. - * - 0x30 - - __u8 - - s_uuid[16] - - 128-bit uuid for journal. This is compared against the copy in the = ext4 - super block at mount time. - * - 0x40 - - __be32 - - s_nr_users - - Number of file systems sharing this journal. - * - 0x44 - - __be32 - - s_dynsuper - - Location of dynamic super block copy. (Not used?) - * - 0x48 - - __be32 - - s_max_transaction - - Limit of journal blocks per transaction. (Not used?) - * - 0x4C - - __be32 - - s_max_trans_data - - Limit of data blocks per transaction. (Not used?) - * - 0x50 - - __u8 - - s_checksum_type - - Checksum algorithm used for the journal. See jbd2_checksum_type_ f= or - more info. - * - 0x51 - - __u8[3] - - s_padding2 - - - * - 0x54 - - __be32 - - s_num_fc_blocks - - Number of fast commit blocks in the journal. - * - 0x58 - - __be32 - - s_head - - Block number of the head (first unused block) of the journal, only - up-to-date when the journal is empty. - * - 0x5C - - __u32 - - s_padding[40] - - - * - 0xFC - - __be32 - - s_checksum - - Checksum of the entire superblock, with this field set to zero. - * - 0x100 - - __u8 - - s_users[16*48] - - ids of all file systems sharing the log. e2fsprogs/Linux don't allow - shared external journals, but I imagine Lustre (or ocfs2?), which u= se - the jbd2 code, might. - -.. _jbd2_compat: - -The journal compat features are any combination of the following: - -.. list-table:: - :widths: 16 64 - :header-rows: 1 - - * - Value - - Description - * - 0x1 - - Journal maintains checksums on the data blocks. - (JBD2_FEATURE_COMPAT_CHECKSUM) - -.. _jbd2_incompat: - -The journal incompat features are any combination of the following: - -.. list-table:: - :widths: 16 64 - :header-rows: 1 - - * - Value - - Description - * - 0x1 - - Journal has block revocation records. (JBD2_FEATURE_INCOMPAT_REVOKE) - * - 0x2 - - Journal can deal with 64-bit block numbers. - (JBD2_FEATURE_INCOMPAT_64BIT) - * - 0x4 - - Journal commits asynchronously. (JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT) - * - 0x8 - - This journal uses v2 of the checksum on-disk format. Each journal - metadata block gets its own checksum, and the block tags in the - descriptor table contain checksums for each of the data blocks in t= he - journal. (JBD2_FEATURE_INCOMPAT_CSUM_V2) - * - 0x10 - - This journal uses v3 of the checksum on-disk format. This is the sa= me as - v2, but the journal block tag size is fixed regardless of the size = of - block numbers. (JBD2_FEATURE_INCOMPAT_CSUM_V3) - * - 0x20 - - Journal has fast commit blocks. (JBD2_FEATURE_INCOMPAT_FAST_COMMIT) - -.. _jbd2_checksum_type: - -Journal checksum type codes are one of the following. crc32 or crc32c are= the -most likely choices. - -.. list-table:: - :widths: 16 64 - :header-rows: 1 - - * - Value - - Description - * - 1 - - CRC32 - * - 2 - - MD5 - * - 3 - - SHA1 - * - 4 - - CRC32C - -Descriptor Block -~~~~~~~~~~~~~~~~ - -The descriptor block contains an array of journal block tags that -describe the final locations of the data blocks that follow in the -journal. Descriptor blocks are open-coded instead of being completely -described by a data structure, but here is the block structure anyway. -Descriptor blocks consume at least 36 bytes, but use a full block: - -.. list-table:: - :widths: 8 8 24 40 - :header-rows: 1 - - * - Offset - - Type - - Name - - Descriptor - * - 0x0 - - journal_header_t - - (open coded) - - Common block header. - * - 0xC - - struct journal_block_tag_s - - open coded array[] - - Enough tags either to fill up the block or to describe all the data - blocks that follow this descriptor block. - -Journal block tags have any of the following formats, depending on which -journal feature and block tag flags are set. - -If JBD2_FEATURE_INCOMPAT_CSUM_V3 is set, the journal block tag is -defined as ``struct journal_block_tag3_s``, which looks like the -following. The size is 16 or 32 bytes. - -.. list-table:: - :widths: 8 8 24 40 - :header-rows: 1 - - * - Offset - - Type - - Name - - Descriptor - * - 0x0 - - __be32 - - t_blocknr - - Lower 32-bits of the location of where the corresponding data block - should end up on disk. - * - 0x4 - - __be32 - - t_flags - - Flags that go with the descriptor. See the table jbd2_tag_flags_ for - more info. - * - 0x8 - - __be32 - - t_blocknr_high - - Upper 32-bits of the location of where the corresponding data block - should end up on disk. This is zero if JBD2_FEATURE_INCOMPAT_64BIT = is - not enabled. - * - 0xC - - __be32 - - t_checksum - - Checksum of the journal UUID, the sequence number, and the data blo= ck. - * - - - - - - - This field appears to be open coded. It always comes at the end of = the - tag, after t_checksum. This field is not present if the "same UUID"= flag - is set. - * - 0x8 or 0xC - - char - - uuid[16] - - A UUID to go with this tag. This field appears to be copied from the - ``j_uuid`` field in ``struct journal_s``, but only tune2fs touches = that - field. - -.. _jbd2_tag_flags: - -The journal tag flags are any combination of the following: - -.. list-table:: - :widths: 16 64 - :header-rows: 1 - - * - Value - - Description - * - 0x1 - - On-disk block is escaped. The first four bytes of the data block ju= st - happened to match the jbd2 magic number. - * - 0x2 - - This block has the same UUID as previous, therefore the UUID field = is - omitted. - * - 0x4 - - The data block was deleted by the transaction. (Not used?) - * - 0x8 - - This is the last tag in this descriptor block. - -If JBD2_FEATURE_INCOMPAT_CSUM_V3 is NOT set, the journal block tag -is defined as ``struct journal_block_tag_s``, which looks like the -following. The size is 8, 12, 24, or 28 bytes: - -.. list-table:: - :widths: 8 8 24 40 - :header-rows: 1 - - * - Offset - - Type - - Name - - Descriptor - * - 0x0 - - __be32 - - t_blocknr - - Lower 32-bits of the location of where the corresponding data block - should end up on disk. - * - 0x4 - - __be16 - - t_checksum - - Checksum of the journal UUID, the sequence number, and the data blo= ck. - Note that only the lower 16 bits are stored. - * - 0x6 - - __be16 - - t_flags - - Flags that go with the descriptor. See the table jbd2_tag_flags_ for - more info. - * - - - - - - - This next field is only present if the super block indicates suppor= t for - 64-bit block numbers. - * - 0x8 - - __be32 - - t_blocknr_high - - Upper 32-bits of the location of where the corresponding data block - should end up on disk. - * - - - - - - - This field appears to be open coded. It always comes at the end of = the - tag, after t_flags or t_blocknr_high. This field is not present if = the - "same UUID" flag is set. - * - 0x8 or 0xC - - char - - uuid[16] - - A UUID to go with this tag. This field appears to be copied from the - ``j_uuid`` field in ``struct journal_s``, but only tune2fs touches = that - field. - -If JBD2_FEATURE_INCOMPAT_CSUM_V2 or -JBD2_FEATURE_INCOMPAT_CSUM_V3 are set, the end of the block is a -``struct jbd2_journal_block_tail``, which looks like this: - -.. list-table:: - :widths: 8 8 24 40 - :header-rows: 1 - - * - Offset - - Type - - Name - - Descriptor - * - 0x0 - - __be32 - - t_checksum - - Checksum of the journal UUID + the descriptor block, with this fiel= d set - to zero. - -Data Block -~~~~~~~~~~ - -In general, the data blocks being written to disk through the journal -are written verbatim into the journal file after the descriptor block. -However, if the first four bytes of the block match the jbd2 magic -number then those four bytes are replaced with zeroes and the =E2=80=9Cesc= aped=E2=80=9D -flag is set in the descriptor block tag. - -Revocation Block -~~~~~~~~~~~~~~~~ - -A revocation block is used to prevent replay of a block in an earlier -transaction. This is used to mark blocks that were journalled at one -time but are no longer journalled. Typically this happens if a metadata -block is freed and re-allocated as a file data block; in this case, a -journal replay after the file block was written to disk will cause -corruption. - -**NOTE**: This mechanism is NOT used to express =E2=80=9Cthis journal bloc= k is -superseded by this other journal block=E2=80=9D, as the author (djwong) -mistakenly thought. Any block being added to a transaction will cause -the removal of all existing revocation records for that block. - -Revocation blocks are described in -``struct jbd2_journal_revoke_header_s``, are at least 16 bytes in -length, but use a full block: - -.. list-table:: - :widths: 8 8 24 40 - :header-rows: 1 - - * - Offset - - Type - - Name - - Description - * - 0x0 - - journal_header_t - - r_header - - Common block header. - * - 0xC - - __be32 - - r_count - - Number of bytes used in this block. - * - 0x10 - - __be32 or __be64 - - blocks[0] - - Blocks to revoke. - -After r_count is a linear array of block numbers that are effectively -revoked by this transaction. The size of each block number is 8 bytes if -the superblock advertises 64-bit block number support, or 4 bytes -otherwise. - -If JBD2_FEATURE_INCOMPAT_CSUM_V2 or -JBD2_FEATURE_INCOMPAT_CSUM_V3 are set, the end of the revocation -block is a ``struct jbd2_journal_revoke_tail``, which has this format: - -.. list-table:: - :widths: 8 8 24 40 - :header-rows: 1 - - * - Offset - - Type - - Name - - Description - * - 0x0 - - __be32 - - r_checksum - - Checksum of the journal UUID + revocation block - -Commit Block -~~~~~~~~~~~~ - -The commit block is a sentry that indicates that a transaction has been -completely written to the journal. Once this commit block reaches the -journal, the data stored with this transaction can be written to their -final locations on disk. - -The commit block is described by ``struct commit_header``, which is 32 -bytes long (but uses a full block): - -.. list-table:: - :widths: 8 8 24 40 - :header-rows: 1 - - * - Offset - - Type - - Name - - Descriptor - * - 0x0 - - journal_header_s - - (open coded) - - Common block header. - * - 0xC - - unsigned char - - h_chksum_type - - The type of checksum to use to verify the integrity of the data blo= cks - in the transaction. See jbd2_checksum_type_ for more info. - * - 0xD - - unsigned char - - h_chksum_size - - The number of bytes used by the checksum. Most likely 4. - * - 0xE - - unsigned char - - h_padding[2] - - - * - 0x10 - - __be32 - - h_chksum[JBD2_CHECKSUM_BYTES] - - 32 bytes of space to store checksums. If - JBD2_FEATURE_INCOMPAT_CSUM_V2 or JBD2_FEATURE_INCOMPAT_CSUM_V3 - are set, the first ``__be32`` is the checksum of the journal UUID a= nd - the entire commit block, with this field zeroed. If - JBD2_FEATURE_COMPAT_CHECKSUM is set, the first ``__be32`` is the - crc32 of all the blocks already written to the transaction. - * - 0x30 - - __be64 - - h_commit_sec - - The time that the transaction was committed, in seconds since the e= poch. - * - 0x38 - - __be32 - - h_commit_nsec - - Nanoseconds component of the above timestamp. - -Fast commits -~~~~~~~~~~~~ - -Fast commit area is organized as a log of tag length values. Each TLV has -a ``struct ext4_fc_tl`` in the beginning which stores the tag and the leng= th -of the entire field. It is followed by variable length tag specific value. -Here is the list of supported tags and their meanings: - -.. list-table:: - :widths: 8 20 20 32 - :header-rows: 1 - - * - Tag - - Meaning - - Value struct - - Description - * - EXT4_FC_TAG_HEAD - - Fast commit area header - - ``struct ext4_fc_head`` - - Stores the TID of the transaction after which these fast commits sh= ould - be applied. - * - EXT4_FC_TAG_ADD_RANGE - - Add extent to inode - - ``struct ext4_fc_add_range`` - - Stores the inode number and extent to be added in this inode - * - EXT4_FC_TAG_DEL_RANGE - - Remove logical offsets to inode - - ``struct ext4_fc_del_range`` - - Stores the inode number and the logical offset range that needs to = be - removed - * - EXT4_FC_TAG_CREAT - - Create directory entry for a newly created file - - ``struct ext4_fc_dentry_info`` - - Stores the parent inode number, inode number and directory entry of= the - newly created file - * - EXT4_FC_TAG_LINK - - Link a directory entry to an inode - - ``struct ext4_fc_dentry_info`` - - Stores the parent inode number, inode number and directory entry - * - EXT4_FC_TAG_UNLINK - - Unlink a directory entry of an inode - - ``struct ext4_fc_dentry_info`` - - Stores the parent inode number, inode number and directory entry - - * - EXT4_FC_TAG_PAD - - Padding (unused area) - - None - - Unused bytes in the fast commit area. - - * - EXT4_FC_TAG_TAIL - - Mark the end of a fast commit - - ``struct ext4_fc_tail`` - - Stores the TID of the commit, CRC of the fast commit of which this = tag - represents the end of - -Fast Commit Replay Idempotence -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -Fast commits tags are idempotent in nature provided the recovery code foll= ows -certain rules. The guiding principle that the commit path follows while -committing is that it stores the result of a particular operation instead = of -storing the procedure. - -Let's consider this rename operation: 'mv /a /b'. Let's assume dirent '/a' -was associated with inode 10. During fast commit, instead of storing this -operation as a procedure "rename a to b", we store the resulting file syst= em -state as a "series" of outcomes: - -- Link dirent b to inode 10 -- Unlink dirent a -- Inode 10 with valid refcount - -Now when recovery code runs, it needs "enforce" this state on the file -system. This is what guarantees idempotence of fast commit replay. - -Let's take an example of a procedure that is not idempotent and see how fa= st -commits make it idempotent. Consider following sequence of operations: - -1) rm A -2) mv B A -3) read A - -If we store this sequence of operations as is then the replay is not idemp= otent. -Let's say while in replay, we crash after (2). During the second replay, -file A (which was actually created as a result of "mv B A" operation) woul= d get -deleted. Thus, file named A would be absent when we try to read A. So, this -sequence of operations is not idempotent. However, as mentioned above, ins= tead -of storing the procedure fast commits store the outcome of each procedure.= Thus -the fast commit log for above procedure would be as follows: - -(Let's assume dirent A was linked to inode 10 and dirent B was linked to -inode 11 before the replay) - -1) Unlink A -2) Link A to inode 11 -3) Unlink B -4) Inode 11 - -If we crash after (3) we will have file A linked to inode 11. During the s= econd -replay, we will remove file A (inode 11). But we will create it back and m= ake -it point to inode 11. We won't find B, so we'll just skip that step. At th= is -point, the refcount for inode 11 is not reliable, but that gets fixed by t= he -replay of last inode 11 tag. Thus, by converting a non-idempotent procedure -into a series of idempotent outcomes, fast commits ensured idempotence dur= ing -the replay. - -Journal Checkpoint -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -Checkpointing the journal ensures all transactions and their associated bu= ffers -are submitted to the disk. In-progress transactions are waited upon and in= cluded -in the checkpoint. Checkpointing is used internally during critical update= s to -the filesystem including journal recovery, filesystem resizing, and freein= g of -the journal_t structure. - -A journal checkpoint can be triggered from userspace via the ioctl -EXT4_IOC_CHECKPOINT. This ioctl takes a single, u64 argument for flags. -Currently, three flags are supported. First, EXT4_IOC_CHECKPOINT_FLAG_DRY_= RUN -can be used to verify input to the ioctl. It returns error if there is any -invalid input, otherwise it returns success without performing -any checkpointing. This can be used to check whether the ioctl exists on a -system and to verify there are no issues with arguments or flags. The -other two flags are EXT4_IOC_CHECKPOINT_FLAG_DISCARD and -EXT4_IOC_CHECKPOINT_FLAG_ZEROOUT. These flags cause the journal blocks to = be -discarded or zero-filled, respectively, after the journal checkpoint is -complete. EXT4_IOC_CHECKPOINT_FLAG_DISCARD and EXT4_IOC_CHECKPOINT_FLAG_ZE= ROOUT -cannot both be set. The ioctl may be useful when snapshotting a system or = for -complying with content deletion SLOs. diff --git a/Documentation/filesystems/ext4/mmp.rst b/Documentation/filesys= tems/ext4/mmp.rst deleted file mode 100644 index 174dd6538737d8..00000000000000 --- a/Documentation/filesystems/ext4/mmp.rst +++ /dev/null @@ -1,77 +0,0 @@ -.. SPDX-License-Identifier: GPL-2.0 - -Multiple Mount Protection -------------------------- - -Multiple mount protection (MMP) is a feature that protects the -filesystem against multiple hosts trying to use the filesystem -simultaneously. When a filesystem is opened (for mounting, or fsck, -etc.), the MMP code running on the node (call it node A) checks a -sequence number. If the sequence number is EXT4_MMP_SEQ_CLEAN, the -open continues. If the sequence number is EXT4_MMP_SEQ_FSCK, then -fsck is (hopefully) running, and open fails immediately. Otherwise, the -open code will wait for twice the specified MMP check interval and check -the sequence number again. If the sequence number has changed, then the -filesystem is active on another machine and the open fails. If the MMP -code passes all of those checks, a new MMP sequence number is generated -and written to the MMP block, and the mount proceeds. - -While the filesystem is live, the kernel sets up a timer to re-check the -MMP block at the specified MMP check interval. To perform the re-check, -the MMP sequence number is re-read; if it does not match the in-memory -MMP sequence number, then another node (node B) has mounted the -filesystem, and node A remounts the filesystem read-only. If the -sequence numbers match, the sequence number is incremented both in -memory and on disk, and the re-check is complete. - -The hostname and device filename are written into the MMP block whenever -an open operation succeeds. The MMP code does not use these values; they -are provided purely for informational purposes. - -The checksum is calculated against the FS UUID and the MMP structure. -The MMP structure (``struct mmp_struct``) is as follows: - -.. list-table:: - :widths: 8 12 20 40 - :header-rows: 1 - - * - Offset - - Type - - Name - - Description - * - 0x0 - - __le32 - - mmp_magic - - Magic number for MMP, 0x004D4D50 (=E2=80=9CMMP=E2=80=9D). - * - 0x4 - - __le32 - - mmp_seq - - Sequence number, updated periodically. - * - 0x8 - - __le64 - - mmp_time - - Time that the MMP block was last updated. - * - 0x10 - - char[64] - - mmp_nodename - - Hostname of the node that opened the filesystem. - * - 0x50 - - char[32] - - mmp_bdevname - - Block device name of the filesystem. - * - 0x70 - - __le16 - - mmp_check_interval - - The MMP re-check interval, in seconds. - * - 0x72 - - __le16 - - mmp_pad1 - - Zero. - * - 0x74 - - __le32[226] - - mmp_pad2 - - Zero. - * - 0x3FC - - __le32 - - mmp_checksum - - Checksum of the MMP block. diff --git a/Documentation/filesystems/ext4/orphan.rst b/Documentation/file= systems/ext4/orphan.rst deleted file mode 100644 index 03cca178864bb0..00000000000000 --- a/Documentation/filesystems/ext4/orphan.rst +++ /dev/null @@ -1,42 +0,0 @@ -.. SPDX-License-Identifier: GPL-2.0 - -Orphan file ------------ - -In unix there can inodes that are unlinked from directory hierarchy but th= at -are still alive because they are open. In case of crash the filesystem has= to -clean up these inodes as otherwise they (and the blocks referenced from th= em) -would leak. Similarly if we truncate or extend the file, we need not be ab= le -to perform the operation in a single journalling transaction. In such case= we -track the inode as orphan so that in case of crash extra blocks allocated = to -the file get truncated. - -Traditionally ext4 tracks orphan inodes in a form of single linked list wh= ere -superblock contains the inode number of the last orphan inode (s_last_orph= an -field) and then each inode contains inode number of the previously orphaned -inode (we overload i_dtime inode field for this). However this filesystem -global single linked list is a scalability bottleneck for workloads that r= esult -in heavy creation of orphan inodes. When orphan file feature -(COMPAT_ORPHAN_FILE) is enabled, the filesystem has a special inode -(referenced from the superblock through s_orphan_file_inum) with several -blocks. Each of these blocks has a structure: - -=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D =3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D -Offset Type Name Description -=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D =3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D -0x0 Array of Orphan inode Each __le32 entry is either - __le32 entries entries empty (0) or it contains - inode number of an orphan - inode. -blocksize-8 __le32 ob_magic Magic value stored in orphan - block tail (0x0b10ca04) -blocksize-4 __le32 ob_checksum Checksum of the orphan bloc= k. -=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D =3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D - -When a filesystem with orphan file feature is writeably mounted, we set -RO_COMPAT_ORPHAN_PRESENT feature in the superblock to indicate there may -be valid orphan entries. In case we see this feature when mounting the -filesystem, we read the whole orphan file and process all orphan inodes fo= und -there as usual. When cleanly unmounting the filesystem we remove the -RO_COMPAT_ORPHAN_PRESENT feature to avoid unnecessary scanning of the orph= an -file and also make the filesystem fully compatible with older kernels. diff --git a/Documentation/filesystems/ext4/super.rst b/Documentation/files= ystems/ext4/super.rst deleted file mode 100644 index 1b240661bfa306..00000000000000 --- a/Documentation/filesystems/ext4/super.rst +++ /dev/null @@ -1,839 +0,0 @@ -.. SPDX-License-Identifier: GPL-2.0 - -Super Block ------------ - -The superblock records various information about the enclosing -filesystem, such as block counts, inode counts, supported features, -maintenance information, and more. - -If the sparse_super feature flag is set, redundant copies of the -superblock and group descriptors are kept only in the groups whose group -number is either 0 or a power of 3, 5, or 7. If the flag is not set, -redundant copies are kept in all groups. - -The superblock checksum is calculated against the superblock structure, -which includes the FS UUID. - -The ext4 superblock is laid out as follows in -``struct ext4_super_block``: - -.. list-table:: - :widths: 8 8 24 40 - :header-rows: 1 - - * - Offset - - Size - - Name - - Description - * - 0x0 - - __le32 - - s_inodes_count - - Total inode count. - * - 0x4 - - __le32 - - s_blocks_count_lo - - Total block count. - * - 0x8 - - __le32 - - s_r_blocks_count_lo - - This number of blocks can only be allocated by the super-user. - * - 0xC - - __le32 - - s_free_blocks_count_lo - - Free block count. - * - 0x10 - - __le32 - - s_free_inodes_count - - Free inode count. - * - 0x14 - - __le32 - - s_first_data_block - - First data block. This must be at least 1 for 1k-block filesystems = and - is typically 0 for all other block sizes. - * - 0x18 - - __le32 - - s_log_block_size - - Block size is 2 ^ (10 + s_log_block_size). - * - 0x1C - - __le32 - - s_log_cluster_size - - Cluster size is 2 ^ (10 + s_log_cluster_size) blocks if bigalloc is - enabled. Otherwise s_log_cluster_size must equal s_log_block_size. - * - 0x20 - - __le32 - - s_blocks_per_group - - Blocks per group. - * - 0x24 - - __le32 - - s_clusters_per_group - - Clusters per group, if bigalloc is enabled. Otherwise - s_clusters_per_group must equal s_blocks_per_group. - * - 0x28 - - __le32 - - s_inodes_per_group - - Inodes per group. - * - 0x2C - - __le32 - - s_mtime - - Mount time, in seconds since the epoch. - * - 0x30 - - __le32 - - s_wtime - - Write time, in seconds since the epoch. - * - 0x34 - - __le16 - - s_mnt_count - - Number of mounts since the last fsck. - * - 0x36 - - __le16 - - s_max_mnt_count - - Number of mounts beyond which a fsck is needed. - * - 0x38 - - __le16 - - s_magic - - Magic signature, 0xEF53 - * - 0x3A - - __le16 - - s_state - - File system state. See super_state_ for more info. - * - 0x3C - - __le16 - - s_errors - - Behaviour when detecting errors. See super_errors_ for more info. - * - 0x3E - - __le16 - - s_minor_rev_level - - Minor revision level. - * - 0x40 - - __le32 - - s_lastcheck - - Time of last check, in seconds since the epoch. - * - 0x44 - - __le32 - - s_checkinterval - - Maximum time between checks, in seconds. - * - 0x48 - - __le32 - - s_creator_os - - Creator OS. See the table super_creator_ for more info. - * - 0x4C - - __le32 - - s_rev_level - - Revision level. See the table super_revision_ for more info. - * - 0x50 - - __le16 - - s_def_resuid - - Default uid for reserved blocks. - * - 0x52 - - __le16 - - s_def_resgid - - Default gid for reserved blocks. - * - - - - - - - These fields are for EXT4_DYNAMIC_REV superblocks only. - =20 - Note: the difference between the compatible feature set and the - incompatible feature set is that if there is a bit set in the - incompatible feature set that the kernel doesn't know about, it sho= uld - refuse to mount the filesystem. - =20 - e2fsck's requirements are more strict; if it doesn't know - about a feature in either the compatible or incompatible feature se= t, it - must abort and not try to meddle with things it doesn't understand.= .. - * - 0x54 - - __le32 - - s_first_ino - - First non-reserved inode. - * - 0x58 - - __le16 - - s_inode_size - - Size of inode structure, in bytes. - * - 0x5A - - __le16 - - s_block_group_nr - - Block group # of this superblock. - * - 0x5C - - __le32 - - s_feature_compat - - Compatible feature set flags. Kernel can still read/write this fs e= ven - if it doesn't understand a flag; fsck should not do that. See the - super_compat_ table for more info. - * - 0x60 - - __le32 - - s_feature_incompat - - Incompatible feature set. If the kernel or fsck doesn't understand = one - of these bits, it should stop. See the super_incompat_ table for mo= re - info. - * - 0x64 - - __le32 - - s_feature_ro_compat - - Readonly-compatible feature set. If the kernel doesn't understand o= ne of - these bits, it can still mount read-only. See the super_rocompat_ t= able - for more info. - * - 0x68 - - __u8 - - s_uuid[16] - - 128-bit UUID for volume. - * - 0x78 - - char - - s_volume_name[16] - - Volume label. - * - 0x88 - - char - - s_last_mounted[64] - - Directory where filesystem was last mounted. - * - 0xC8 - - __le32 - - s_algorithm_usage_bitmap - - For compression (Not used in e2fsprogs/Linux) - * - - - - - - - Performance hints. Directory preallocation should only happen if t= he - EXT4_FEATURE_COMPAT_DIR_PREALLOC flag is on. - * - 0xCC - - __u8 - - s_prealloc_blocks - - #. of blocks to try to preallocate for ... files? (Not used in - e2fsprogs/Linux) - * - 0xCD - - __u8 - - s_prealloc_dir_blocks - - #. of blocks to preallocate for directories. (Not used in - e2fsprogs/Linux) - * - 0xCE - - __le16 - - s_reserved_gdt_blocks - - Number of reserved GDT entries for future filesystem expansion. - * - - - - - - - Journalling support is valid only if EXT4_FEATURE_COMPAT_HAS_JOURNA= L is - set. - * - 0xD0 - - __u8 - - s_journal_uuid[16] - - UUID of journal superblock - * - 0xE0 - - __le32 - - s_journal_inum - - inode number of journal file. - * - 0xE4 - - __le32 - - s_journal_dev - - Device number of journal file, if the external journal feature flag= is - set. - * - 0xE8 - - __le32 - - s_last_orphan - - Start of list of orphaned inodes to delete. - * - 0xEC - - __le32 - - s_hash_seed[4] - - HTREE hash seed. - * - 0xFC - - __u8 - - s_def_hash_version - - Default hash algorithm to use for directory hashes. See super_def_h= ash_ - for more info. - * - 0xFD - - __u8 - - s_jnl_backup_type - - If this value is 0 or EXT3_JNL_BACKUP_BLOCKS (1), then the - ``s_jnl_blocks`` field contains a duplicate copy of the inode's - ``i_block[]`` array and ``i_size``. - * - 0xFE - - __le16 - - s_desc_size - - Size of group descriptors, in bytes, if the 64bit incompat feature = flag - is set. - * - 0x100 - - __le32 - - s_default_mount_opts - - Default mount options. See the super_mountopts_ table for more info. - * - 0x104 - - __le32 - - s_first_meta_bg - - First metablock block group, if the meta_bg feature is enabled. - * - 0x108 - - __le32 - - s_mkfs_time - - When the filesystem was created, in seconds since the epoch. - * - 0x10C - - __le32 - - s_jnl_blocks[17] - - Backup copy of the journal inode's ``i_block[]`` array in the first= 15 - elements and i_size_high and i_size in the 16th and 17th elements, - respectively. - * - - - - - - - 64bit support is valid only if EXT4_FEATURE_COMPAT_64BIT is set. - * - 0x150 - - __le32 - - s_blocks_count_hi - - High 32-bits of the block count. - * - 0x154 - - __le32 - - s_r_blocks_count_hi - - High 32-bits of the reserved block count. - * - 0x158 - - __le32 - - s_free_blocks_count_hi - - High 32-bits of the free block count. - * - 0x15C - - __le16 - - s_min_extra_isize - - All inodes have at least # bytes. - * - 0x15E - - __le16 - - s_want_extra_isize - - New inodes should reserve # bytes. - * - 0x160 - - __le32 - - s_flags - - Miscellaneous flags. See the super_flags_ table for more info. - * - 0x164 - - __le16 - - s_raid_stride - - RAID stride. This is the number of logical blocks read from or writ= ten - to the disk before moving to the next disk. This affects the placem= ent - of filesystem metadata, which will hopefully make RAID storage fast= er. - * - 0x166 - - __le16 - - s_mmp_interval - - #. seconds to wait in multi-mount prevention (MMP) checking. In the= ory, - MMP is a mechanism to record in the superblock which host and device - have mounted the filesystem, in order to prevent multiple mounts. T= his - feature does not seem to be implemented... - * - 0x168 - - __le64 - - s_mmp_block - - Block # for multi-mount protection data. - * - 0x170 - - __le32 - - s_raid_stripe_width - - RAID stripe width. This is the number of logical blocks read from or - written to the disk before coming back to the current disk. This is= used - by the block allocator to try to reduce the number of read-modify-w= rite - operations in a RAID5/6. - * - 0x174 - - __u8 - - s_log_groups_per_flex - - Size of a flexible block group is 2 ^ ``s_log_groups_per_flex``. - * - 0x175 - - __u8 - - s_checksum_type - - Metadata checksum algorithm type. The only valid value is 1 (crc32c= ). - * - 0x176 - - \_\_u8 - - s\_encryption\_level - - Versioning level for encryption. - * - 0x177 - - \_\_u8 - - s\_reserved\_pad - - Padding to next 32bits. - * - 0x178 - - __le64 - - s_kbytes_written - - Number of KiB written to this filesystem over its lifetime. - * - 0x180 - - __le32 - - s_snapshot_inum - - inode number of active snapshot. (Not used in e2fsprogs/Linux.) - * - 0x184 - - __le32 - - s_snapshot_id - - Sequential ID of active snapshot. (Not used in e2fsprogs/Linux.) - * - 0x188 - - __le64 - - s_snapshot_r_blocks_count - - Number of blocks reserved for active snapshot's future use. (Not us= ed in - e2fsprogs/Linux.) - * - 0x190 - - __le32 - - s_snapshot_list - - inode number of the head of the on-disk snapshot list. (Not used in - e2fsprogs/Linux.) - * - 0x194 - - __le32 - - s_error_count - - Number of errors seen. - * - 0x198 - - __le32 - - s_first_error_time - - First time an error happened, in seconds since the epoch. - * - 0x19C - - __le32 - - s_first_error_ino - - inode involved in first error. - * - 0x1A0 - - __le64 - - s_first_error_block - - Number of block involved of first error. - * - 0x1A8 - - __u8 - - s_first_error_func[32] - - Name of function where the error happened. - * - 0x1C8 - - __le32 - - s_first_error_line - - Line number where error happened. - * - 0x1CC - - __le32 - - s_last_error_time - - Time of most recent error, in seconds since the epoch. - * - 0x1D0 - - __le32 - - s_last_error_ino - - inode involved in most recent error. - * - 0x1D4 - - __le32 - - s_last_error_line - - Line number where most recent error happened. - * - 0x1D8 - - __le64 - - s_last_error_block - - Number of block involved in most recent error. - * - 0x1E0 - - __u8 - - s_last_error_func[32] - - Name of function where the most recent error happened. - * - 0x200 - - __u8 - - s_mount_opts[64] - - ASCIIZ string of mount options. - * - 0x240 - - __le32 - - s_usr_quota_inum - - Inode number of user `quota `__ file. - * - 0x244 - - __le32 - - s_grp_quota_inum - - Inode number of group `quota `__ file. - * - 0x248 - - __le32 - - s_overhead_blocks - - Overhead blocks/clusters in fs. (Huh? This field is always zero, wh= ich - means that the kernel calculates it dynamically.) - * - 0x24C - - __le32 - - s_backup_bgs[2] - - Block groups containing superblock backups (if sparse_super2) - * - 0x254 - - __u8 - - s_encrypt_algos[4] - - Encryption algorithms in use. There can be up to four algorithms in= use - at any time; valid algorithm codes are given in the super_encrypt_ = table - below. - * - 0x258 - - __u8 - - s_encrypt_pw_salt[16] - - Salt for the string2key algorithm for encryption. - * - 0x268 - - __le32 - - s_lpf_ino - - Inode number of lost+found - * - 0x26C - - __le32 - - s_prj_quota_inum - - Inode that tracks project quotas. - * - 0x270 - - __le32 - - s_checksum_seed - - Checksum seed used for metadata_csum calculations. This value is - crc32c(~0, $orig_fs_uuid). - * - 0x274 - - __u8 - - s_wtime_hi - - Upper 8 bits of the s_wtime field. - * - 0x275 - - __u8 - - s_mtime_hi - - Upper 8 bits of the s_mtime field. - * - 0x276 - - __u8 - - s_mkfs_time_hi - - Upper 8 bits of the s_mkfs_time field. - * - 0x277 - - __u8 - - s_lastcheck_hi - - Upper 8 bits of the s_lastcheck field. - * - 0x278 - - __u8 - - s_first_error_time_hi - - Upper 8 bits of the s_first_error_time field. - * - 0x279 - - __u8 - - s_last_error_time_hi - - Upper 8 bits of the s_last_error_time field. - * - 0x27A - - \_\_u8 - - s\_first\_error\_errcode - - - * - 0x27B - - \_\_u8 - - s\_last\_error\_errcode - - - * - 0x27C - - __le16 - - s_encoding - - Filename charset encoding. - * - 0x27E - - __le16 - - s_encoding_flags - - Filename charset encoding flags. - * - 0x280 - - __le32 - - s_orphan_file_inum - - Orphan file inode number. - * - 0x284 - - __le32 - - s_reserved[94] - - Padding to the end of the block. - * - 0x3FC - - __le32 - - s_checksum - - Superblock checksum. - -.. _super_state: - -The superblock state is some combination of the following: - -.. list-table:: - :widths: 8 72 - :header-rows: 1 - - * - Value - - Description - * - 0x0001 - - Cleanly umounted - * - 0x0002 - - Errors detected - * - 0x0004 - - Orphans being recovered - -.. _super_errors: - -The superblock error policy is one of the following: - -.. list-table:: - :widths: 8 72 - :header-rows: 1 - - * - Value - - Description - * - 1 - - Continue - * - 2 - - Remount read-only - * - 3 - - Panic - -.. _super_creator: - -The filesystem creator is one of the following: - -.. list-table:: - :widths: 8 72 - :header-rows: 1 - - * - Value - - Description - * - 0 - - Linux - * - 1 - - Hurd - * - 2 - - Masix - * - 3 - - FreeBSD - * - 4 - - Lites - -.. _super_revision: - -The superblock revision is one of the following: - -.. list-table:: - :widths: 8 72 - :header-rows: 1 - - * - Value - - Description - * - 0 - - Original format - * - 1 - - v2 format w/ dynamic inode sizes - -Note that ``EXT4_DYNAMIC_REV`` refers to a revision 1 or newer filesystem. - -.. _super_compat: - -The superblock compatible features field is a combination of any of the -following: - -.. list-table:: - :widths: 16 64 - :header-rows: 1 - - * - Value - - Description - * - 0x1 - - Directory preallocation (COMPAT_DIR_PREALLOC). - * - 0x2 - - =E2=80=9Cimagic inodes=E2=80=9D. Not clear from the code what this = does - (COMPAT_IMAGIC_INODES). - * - 0x4 - - Has a journal (COMPAT_HAS_JOURNAL). - * - 0x8 - - Supports extended attributes (COMPAT_EXT_ATTR). - * - 0x10 - - Has reserved GDT blocks for filesystem expansion - (COMPAT_RESIZE_INODE). Requires RO_COMPAT_SPARSE_SUPER. - * - 0x20 - - Has directory indices (COMPAT_DIR_INDEX). - * - 0x40 - - =E2=80=9CLazy BG=E2=80=9D. Not in Linux kernel, seems to have been = for uninitialized - block groups? (COMPAT_LAZY_BG) - * - 0x80 - - =E2=80=9CExclude inode=E2=80=9D. Not used. (COMPAT_EXCLUDE_INODE). - * - 0x100 - - =E2=80=9CExclude bitmap=E2=80=9D. Seems to be used to indicate the = presence of - snapshot-related exclude bitmaps? Not defined in kernel or used in - e2fsprogs (COMPAT_EXCLUDE_BITMAP). - * - 0x200 - - Sparse Super Block, v2. If this flag is set, the SB field s_backup_= bgs - points to the two block groups that contain backup superblocks - (COMPAT_SPARSE_SUPER2). - * - 0x400 - - Fast commits supported. Although fast commits blocks are - backward incompatible, fast commit blocks are not always - present in the journal. If fast commit blocks are present in - the journal, JBD2 incompat feature - (JBD2_FEATURE_INCOMPAT_FAST_COMMIT) gets - set (COMPAT_FAST_COMMIT). - * - 0x1000 - - Orphan file allocated. This is the special file for more efficient - tracking of unlinked but still open inodes. When there may be any - entries in the file, we additionally set proper rocompat feature - (RO_COMPAT_ORPHAN_PRESENT). - -.. _super_incompat: - -The superblock incompatible features field is a combination of any of the -following: - -.. list-table:: - :widths: 16 64 - :header-rows: 1 - - * - Value - - Description - * - 0x1 - - Compression (INCOMPAT_COMPRESSION). - * - 0x2 - - Directory entries record the file type. See ext4_dir_entry_2 below - (INCOMPAT_FILETYPE). - * - 0x4 - - Filesystem needs recovery (INCOMPAT_RECOVER). - * - 0x8 - - Filesystem has a separate journal device (INCOMPAT_JOURNAL_DEV). - * - 0x10 - - Meta block groups. See the earlier discussion of this feature - (INCOMPAT_META_BG). - * - 0x40 - - Files in this filesystem use extents (INCOMPAT_EXTENTS). - * - 0x80 - - Enable a filesystem size of 2^64 blocks (INCOMPAT_64BIT). - * - 0x100 - - Multiple mount protection (INCOMPAT_MMP). - * - 0x200 - - Flexible block groups. See the earlier discussion of this feature - (INCOMPAT_FLEX_BG). - * - 0x400 - - Inodes can be used to store large extended attribute values - (INCOMPAT_EA_INODE). - * - 0x1000 - - Data in directory entry (INCOMPAT_DIRDATA). (Not implemented?) - * - 0x2000 - - Metadata checksum seed is stored in the superblock. This feature en= ables - the administrator to change the UUID of a metadata_csum filesystem - while the filesystem is mounted; without it, the checksum definition - requires all metadata blocks to be rewritten (INCOMPAT_CSUM_SEED). - * - 0x4000 - - Large directory >2GB or 3-level htree (INCOMPAT_LARGEDIR). Prior to - this feature, directories could not be larger than 4GiB and could n= ot - have an htree more than 2 levels deep. If this feature is enabled, - directories can be larger than 4GiB and have a maximum htree depth = of 3. - * - 0x8000 - - Data in inode (INCOMPAT_INLINE_DATA). - * - 0x10000 - - Encrypted inodes are present on the filesystem. (INCOMPAT_ENCRYPT). - -.. _super_rocompat: - -The superblock read-only compatible features field is a combination of any= of -the following: - -.. list-table:: - :widths: 16 64 - :header-rows: 1 - - * - Value - - Description - * - 0x1 - - Sparse superblocks. See the earlier discussion of this feature - (RO_COMPAT_SPARSE_SUPER). - * - 0x2 - - This filesystem has been used to store a file greater than 2GiB - (RO_COMPAT_LARGE_FILE). - * - 0x4 - - Not used in kernel or e2fsprogs (RO_COMPAT_BTREE_DIR). - * - 0x8 - - This filesystem has files whose sizes are represented in units of - logical blocks, not 512-byte sectors. This implies a very large file - indeed! (RO_COMPAT_HUGE_FILE) - * - 0x10 - - Group descriptors have checksums. In addition to detecting corrupti= on, - this is useful for lazy formatting with uninitialized groups - (RO_COMPAT_GDT_CSUM). - * - 0x20 - - Indicates that the old ext3 32,000 subdirectory limit no longer app= lies - (RO_COMPAT_DIR_NLINK). A directory's i_links_count will be set to 1 - if it is incremented past 64,999. - * - 0x40 - - Indicates that large inodes exist on this filesystem - (RO_COMPAT_EXTRA_ISIZE). - * - 0x80 - - This filesystem has a snapshot (RO_COMPAT_HAS_SNAPSHOT). - * - 0x100 - - `Quota `__ (RO_COMPAT_QUOTA). - * - 0x200 - - This filesystem supports =E2=80=9Cbigalloc=E2=80=9D, which means th= at file extents are - tracked in units of clusters (of blocks) instead of blocks - (RO_COMPAT_BIGALLOC). - * - 0x400 - - This filesystem supports metadata checksumming. - (RO_COMPAT_METADATA_CSUM; implies RO_COMPAT_GDT_CSUM, though - GDT_CSUM must not be set) - * - 0x800 - - Filesystem supports replicas. This feature is neither in the kernel= nor - e2fsprogs. (RO_COMPAT_REPLICA) - * - 0x1000 - - Read-only filesystem image; the kernel will not mount this image - read-write and most tools will refuse to write to the image. - (RO_COMPAT_READONLY) - * - 0x2000 - - Filesystem tracks project quotas. (RO_COMPAT_PROJECT) - * - 0x8000 - - Verity inodes may be present on the filesystem. (RO_COMPAT_VERITY) - * - 0x10000 - - Indicates orphan file may have valid orphan entries and thus we need - to clean them up when mounting the filesystem - (RO_COMPAT_ORPHAN_PRESENT). - -.. _super_def_hash: - -The ``s_def_hash_version`` field is one of the following: - -.. list-table:: - :widths: 8 72 - :header-rows: 1 - - * - Value - - Description - * - 0x0 - - Legacy. - * - 0x1 - - Half MD4. - * - 0x2 - - Tea. - * - 0x3 - - Legacy, unsigned. - * - 0x4 - - Half MD4, unsigned. - * - 0x5 - - Tea, unsigned. - -.. _super_mountopts: - -The ``s_default_mount_opts`` field is any combination of the following: - -.. list-table:: - :widths: 8 72 - :header-rows: 1 - - * - Value - - Description - * - 0x0001 - - Print debugging info upon (re)mount. (EXT4_DEFM_DEBUG) - * - 0x0002 - - New files take the gid of the containing directory (instead of the = fsgid - of the current process). (EXT4_DEFM_BSDGROUPS) - * - 0x0004 - - Support userspace-provided extended attributes. (EXT4_DEFM_XATTR_US= ER) - * - 0x0008 - - Support POSIX access control lists (ACLs). (EXT4_DEFM_ACL) - * - 0x0010 - - Do not support 32-bit UIDs. (EXT4_DEFM_UID16) - * - 0x0020 - - All data and metadata are committed to the journal. - (EXT4_DEFM_JMODE_DATA) - * - 0x0040 - - All data are flushed to the disk before metadata are committed to t= he - journal. (EXT4_DEFM_JMODE_ORDERED) - * - 0x0060 - - Data ordering is not preserved; data may be written after the metad= ata - has been written. (EXT4_DEFM_JMODE_WBACK) - * - 0x0100 - - Disable write flushes. (EXT4_DEFM_NOBARRIER) - * - 0x0200 - - Track which blocks in a filesystem are metadata and therefore shoul= d not - be used as data blocks. This option will be enabled by default on 3= .18, - hopefully. (EXT4_DEFM_BLOCK_VALIDITY) - * - 0x0400 - - Enable DISCARD support, where the storage device is told about bloc= ks - becoming unused. (EXT4_DEFM_DISCARD) - * - 0x0800 - - Disable delayed allocation. (EXT4_DEFM_NODELALLOC) - -.. _super_flags: - -The ``s_flags`` field is any combination of the following: - -.. list-table:: - :widths: 8 72 - :header-rows: 1 - - * - Value - - Description - * - 0x0001 - - Signed directory hash in use. - * - 0x0002 - - Unsigned directory hash in use. - * - 0x0004 - - To test development code. - -.. _super_encrypt: - -The ``s_encrypt_algos`` list can contain any of the following: - -.. list-table:: - :widths: 8 72 - :header-rows: 1 - - * - Value - - Description - * - 0 - - Invalid algorithm (ENCRYPTION_MODE_INVALID). - * - 1 - - 256-bit AES in XTS mode (ENCRYPTION_MODE_AES_256_XTS). - * - 2 - - 256-bit AES in GCM mode (ENCRYPTION_MODE_AES_256_GCM). - * - 3 - - 256-bit AES in CBC mode (ENCRYPTION_MODE_AES_256_CBC). - -Total size of the superblock is 1024 bytes. --=20 An old man doll... just what I always wanted! - Clara From nobody Thu Oct 9 09:03:17 2025 Received: from mail-pl1-f179.google.com (mail-pl1-f179.google.com [209.85.214.179]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 27520295DBA; Wed, 18 Jun 2025 11:16:08 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.179 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1750245374; cv=none; b=djcAjk1BqVt4MNqwquIZKvpb7vFN3HHwGa4EoLMwRZ4znteikDoaOOpwdcej+wZCvPGRb5LuoB49pB+zVhvoUPZhwIbGEpMt8av0oRwtm2QeFUVHhxOITuhf3fPXgT8wCAYodDyYoa131KRgH93V99p9VU1dvdILl6RByGqe7z4= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1750245374; c=relaxed/simple; bh=BUu9HUdtaYWu1FAd7SnaGwZ6GdU8g8tTTCu8MsamgeM=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=rhZYkctmrtEvWBkIlAQgB5mdnlE++in1piRfsyTQB9X7Wui9qJcP0ZHF+8Lmup7UGzeGv/Gx02e+y3V8ZRd6PzCLe1n72MpHtcLePqWSGagCejuT3nZ+N8t0dAKrA974Gt1rrgsLDzuUfkPT1gofZgQ8VbJoWfub5bpDe1EKLhY= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=GxIFb23u; arc=none smtp.client-ip=209.85.214.179 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="GxIFb23u" Received: by mail-pl1-f179.google.com with SMTP id d9443c01a7336-234d3261631so45337375ad.1; Wed, 18 Jun 2025 04:16:08 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1750245368; x=1750850168; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=b6KDUdKeadskP6aqdnMUYKVbmLKz5AEWQ17jezj/TW4=; b=GxIFb23uIQxsD4EaM3v/WicLTcfgzJQI+rPEY6X829lBdUOPVZ4ubRJJldfGY8P4HZ GcJZOb+azqwKd8clrodcy/FR/FvMIt6/v1WMDWCDzRjW7F3ATI8Sj8rrCj4D6vCXdL9p rDikdBZ8pHtAiUqYvy5v0PSTD90jubK+sS4GXGN35rtKe58l/QNZGANO5smXPCt4xKZf lS836WYIxLdw/BfNBheOQG9EaB22W1B2AnYTIRvAj9r6u0umOXMS1UqzmET+Zb93tzmE Ipt09zsA6lrfjmHwLlxgRitwdGoxlklF4kOm5zCRiP1B4uJrVsZTFHr7sa/ouj5Wudl7 QSuA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1750245368; x=1750850168; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=b6KDUdKeadskP6aqdnMUYKVbmLKz5AEWQ17jezj/TW4=; b=oBofUaHV4yQePF0CF5TJGoxP6r7eaKSFRdPm6vwDwkKYaOfUmofBVoUfXWDSKul3t8 4YgGhj6JLcYB2t+Z6SshpvZKJFgROC+Uqdra8Br+zLUxXYkEq5QFrLn44TovCH6g/udS 9OwFYpArPBPkWVJjer66oszTFhGu64sp8G+9K48tMbHaBkuRKjTq/lXdmK/4eZHGHPsQ 9wn/PpIduhTpByMJL0fho7Evuudei3rWnt2xCLuYq/ilKlN7hIbMlEXUMmY6aIbaqcIe FPwn1doY0oZZNoltjSVPEnE71tnvZoop4dSkAiGr74nT2C70naYwxXPJeCMzGKiZNsUi PRYw== X-Forwarded-Encrypted: i=1; AJvYcCUoZyGbjforJjYbiyY2ADhXXEluxNdyNoOWDojhBVzQGswgpfyPp61FBJy+NjMsVqLpPfGo7BgoWUg=@vger.kernel.org, AJvYcCUrEqZGB1ZMJ9wywIzAem2tT/+QdU3oekByerDiuVDjt2wRWlxwjDGrcbk1MyHr/G+2r7i6Jjib374KaQ==@vger.kernel.org X-Gm-Message-State: AOJu0Ywncv4U4/ldU0GXdHMj1LF/bTCXyfrVdEH4jxf6l6Np8Sggo0/N zvHWeF3Lq0kr3ScpLjnG8QYXQy+ZwOsCHJYNjBuaHm8Qr6v75C9CsYZE X-Gm-Gg: ASbGncuVlQ+Qnlc5nstADnrCu0ngpRZ6g6wH81an3ZSuDC53hrdSYmxtt2ki1F8EIQr TAGal9IHAA5XSDRuoXZ9sTNGF9Y82kn73xfK1Eqb3BLSaXnM1OuJun9yt/tMoT3FGUCxzSgM+dx 2k5ed+CQaDgdKlq49ckDJe6S/1YhXTAU6ImBVuf6r7rq4T0VA8aQwco12QCayxVATtvu1wuV7Ny e09SgAJLDR2kv30lReqFzqGabUWlEQfWv5ATpdSO6Vw7jsaEoJ8os/AOiK43YlgciDgIocAZdpk wjSMfQb2RPcoXO40/yY/vOvWzpea1k6D2ynqyxeQHc9Z0LOV+YUchyenWrUjDrkdw85n8gpt X-Google-Smtp-Source: AGHT+IFJIv1HyeWLDYYDnjy9zhtgS5Rqks2BjUnmyLphTQi+Ix18XVcp2XkFyRlLYUPMyv8jUuBNQQ== X-Received: by 2002:a17:903:19cc:b0:234:ef42:5d75 with SMTP id d9443c01a7336-2366b00ee6bmr228150495ad.20.1750245367032; Wed, 18 Jun 2025 04:16:07 -0700 (PDT) Received: from archie.me ([103.124.138.155]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-2365e0d0bb5sm97548615ad.250.2025.06.18.04.16.02 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 18 Jun 2025 04:16:03 -0700 (PDT) Received: by archie.me (Postfix, from userid 1000) id DAAFC45E3AAE; Wed, 18 Jun 2025 18:15:59 +0700 (WIB) From: Bagas Sanjaya To: Linux Kernel Mailing List , Linux Documentation , Linux ext4 Cc: "Theodore Ts'o" , Andreas Dilger , Jonathan Corbet , "Darrick J. Wong" , "Ritesh Harjani (IBM)" , Bagas Sanjaya Subject: [PATCH 3/4] Documentation: ext4: Slurp included subdocs in dynamic structures docs Date: Wed, 18 Jun 2025 18:15:36 +0700 Message-ID: <20250618111544.22602-4-bagasdotme@gmail.com> X-Mailer: git-send-email 2.49.0 In-Reply-To: <20250618111544.22602-1-bagasdotme@gmail.com> References: <20250618111544.22602-1-bagasdotme@gmail.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Developer-Signature: v=1; a=openpgp-sha256; l=94728; i=bagasdotme@gmail.com; h=from:subject; bh=BUu9HUdtaYWu1FAd7SnaGwZ6GdU8g8tTTCu8MsamgeM=; b=owGbwMvMwCX2bWenZ2ig32LG02pJDBlB89WWZTiFTugymbP31uW5l7wcOoS9jj/Lc9d2eH9mE /OV3TPjO0pZGMS4GGTFFFkmJfI1nd5lJHKhfa0jzBxWJpAhDFycAjCRKc8Z/scvi2XYJW8Z378q 90kdu/d7lf/fpm6dUyl+wW0OV7bf7jOMDNPmL80Leh3v3bbwm3gA7x//gNSoxnkmd2wuWtgm3I2 LZQMA X-Developer-Key: i=bagasdotme@gmail.com; a=openpgp; fpr=701B806FDCA5D3A58FFB8F7D7C276C64A5E44A1D Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Slurp subdocumentations for dynamic structures (dynamic.rst) by replacing reST include:: directive with their respective contents. Signed-off-by: Bagas Sanjaya --- Documentation/filesystems/ext4/attributes.rst | 191 --- Documentation/filesystems/ext4/directory.rst | 453 ------ Documentation/filesystems/ext4/dynamic.rst | 1415 ++++++++++++++++- Documentation/filesystems/ext4/ifork.rst | 194 --- Documentation/filesystems/ext4/inodes.rst | 578 ------- 5 files changed, 1411 insertions(+), 1420 deletions(-) delete mode 100644 Documentation/filesystems/ext4/attributes.rst delete mode 100644 Documentation/filesystems/ext4/directory.rst delete mode 100644 Documentation/filesystems/ext4/ifork.rst delete mode 100644 Documentation/filesystems/ext4/inodes.rst diff --git a/Documentation/filesystems/ext4/attributes.rst b/Documentation/= filesystems/ext4/attributes.rst deleted file mode 100644 index 87814696a65b59..00000000000000 --- a/Documentation/filesystems/ext4/attributes.rst +++ /dev/null @@ -1,191 +0,0 @@ -.. SPDX-License-Identifier: GPL-2.0 - -Extended Attributes -------------------- - -Extended attributes (xattrs) are typically stored in a separate data -block on the disk and referenced from inodes via ``inode.i_file_acl*``. -The first use of extended attributes seems to have been for storing file -ACLs and other security data (selinux). With the ``user_xattr`` mount -option it is possible for users to store extended attributes so long as -all attribute names begin with =E2=80=9Cuser=E2=80=9D; this restriction se= ems to have -disappeared as of Linux 3.0. - -There are two places where extended attributes can be found. The first -place is between the end of each inode entry and the beginning of the -next inode entry. For example, if inode.i_extra_isize =3D 28 and -sb.inode_size =3D 256, then there are 256 - (128 + 28) =3D 100 bytes -available for in-inode extended attribute storage. The second place -where extended attributes can be found is in the block pointed to by -``inode.i_file_acl``. As of Linux 3.11, it is not possible for this -block to contain a pointer to a second extended attribute block (or even -the remaining blocks of a cluster). In theory it is possible for each -attribute's value to be stored in a separate data block, though as of -Linux 3.11 the code does not permit this. - -Keys are generally assumed to be ASCIIZ strings, whereas values can be -strings or binary data. - -Extended attributes, when stored after the inode, have a header -``ext4_xattr_ibody_header`` that is 4 bytes long: - -.. list-table:: - :widths: 8 8 24 40 - :header-rows: 1 - - * - Offset - - Type - - Name - - Description - * - 0x0 - - __le32 - - h_magic - - Magic number for identification, 0xEA020000. This value is set by t= he - Linux driver, though e2fsprogs doesn't seem to check it(?) - -The beginning of an extended attribute block is in -``struct ext4_xattr_header``, which is 32 bytes long: - -.. list-table:: - :widths: 8 8 24 40 - :header-rows: 1 - - * - Offset - - Type - - Name - - Description - * - 0x0 - - __le32 - - h_magic - - Magic number for identification, 0xEA020000. - * - 0x4 - - __le32 - - h_refcount - - Reference count. - * - 0x8 - - __le32 - - h_blocks - - Number of disk blocks used. - * - 0xC - - __le32 - - h_hash - - Hash value of all attributes. - * - 0x10 - - __le32 - - h_checksum - - Checksum of the extended attribute block. - * - 0x14 - - __u32 - - h_reserved[3] - - Zero. - -The checksum is calculated against the FS UUID, the 64-bit block number -of the extended attribute block, and the entire block (header + -entries). - -Following the ``struct ext4_xattr_header`` or -``struct ext4_xattr_ibody_header`` is an array of -``struct ext4_xattr_entry``; each of these entries is at least 16 bytes -long. When stored in an external block, the ``struct ext4_xattr_entry`` -entries must be stored in sorted order. The sort order is -``e_name_index``, then ``e_name_len``, and finally ``e_name``. -Attributes stored inside an inode do not need be stored in sorted order. - -.. list-table:: - :widths: 8 8 24 40 - :header-rows: 1 - - * - Offset - - Type - - Name - - Description - * - 0x0 - - __u8 - - e_name_len - - Length of name. - * - 0x1 - - __u8 - - e_name_index - - Attribute name index. There is a discussion of this below. - * - 0x2 - - __le16 - - e_value_offs - - Location of this attribute's value on the disk block where it is st= ored. - Multiple attributes can share the same value. For an inode attribute - this value is relative to the start of the first entry; for a block= this - value is relative to the start of the block (i.e. the header). - * - 0x4 - - __le32 - - e_value_inum - - The inode where the value is stored. Zero indicates the value is in= the - same block as this entry. This field is only used if the - INCOMPAT_EA_INODE feature is enabled. - * - 0x8 - - __le32 - - e_value_size - - Length of attribute value. - * - 0xC - - __le32 - - e_hash - - Hash value of attribute name and attribute value. The kernel doesn't - update the hash for in-inode attributes, so for that case this value - must be zero, because e2fsck validates any non-zero hash regardless= of - where the xattr lives. - * - 0x10 - - char - - e_name[e_name_len] - - Attribute name. Does not include trailing NULL. - -Attribute values can follow the end of the entry table. There appears to -be a requirement that they be aligned to 4-byte boundaries. The values -are stored starting at the end of the block and grow towards the -xattr_header/xattr_entry table. When the two collide, the overflow is -put into a separate disk block. If the disk block fills up, the -filesystem returns -ENOSPC. - -The first four fields of the ``ext4_xattr_entry`` are set to zero to -mark the end of the key list. - -Attribute Name Indices -~~~~~~~~~~~~~~~~~~~~~~ - -Logically speaking, extended attributes are a series of key=3Dvalue pairs. -The keys are assumed to be NULL-terminated strings. To reduce the amount -of on-disk space that the keys consume, the beginning of the key string -is matched against the attribute name index. If a match is found, the -attribute name index field is set, and matching string is removed from -the key name. Here is a map of name index values to key prefixes: - -.. list-table:: - :widths: 16 64 - :header-rows: 1 - - * - Name Index - - Key Prefix - * - 0 - - (no prefix) - * - 1 - - =E2=80=9Cuser.=E2=80=9D - * - 2 - - =E2=80=9Csystem.posix_acl_access=E2=80=9D - * - 3 - - =E2=80=9Csystem.posix_acl_default=E2=80=9D - * - 4 - - =E2=80=9Ctrusted.=E2=80=9D - * - 6 - - =E2=80=9Csecurity.=E2=80=9D - * - 7 - - =E2=80=9Csystem.=E2=80=9D (inline_data only?) - * - 8 - - =E2=80=9Csystem.richacl=E2=80=9D (SuSE kernels only?) - -For example, if the attribute key is =E2=80=9Cuser.fubar=E2=80=9D, the att= ribute name -index is set to 1 and the =E2=80=9Cfubar=E2=80=9D name is recorded on disk. - -POSIX ACLs -~~~~~~~~~~ - -POSIX ACLs are stored in a reduced version of the Linux kernel (and -libacl's) internal ACL format. The key difference is that the version -number is different (1) and the ``e_id`` field is only stored for named -user and group ACLs. diff --git a/Documentation/filesystems/ext4/directory.rst b/Documentation/f= ilesystems/ext4/directory.rst deleted file mode 100644 index 6eece8e31df8b7..00000000000000 --- a/Documentation/filesystems/ext4/directory.rst +++ /dev/null @@ -1,453 +0,0 @@ -.. SPDX-License-Identifier: GPL-2.0 - -Directory Entries ------------------ - -In an ext4 filesystem, a directory is more or less a flat file that maps -an arbitrary byte string (usually ASCII) to an inode number on the -filesystem. There can be many directory entries across the filesystem -that reference the same inode number--these are known as hard links, and -that is why hard links cannot reference files on other filesystems. As -such, directory entries are found by reading the data block(s) -associated with a directory file for the particular directory entry that -is desired. - -Linear (Classic) Directories -~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -By default, each directory lists its entries in an =E2=80=9Calmost-linear= =E2=80=9D -array. I write =E2=80=9Calmost=E2=80=9D because it's not a linear array in= the memory -sense because directory entries are not split across filesystem blocks. -Therefore, it is more accurate to say that a directory is a series of -data blocks and that each block contains a linear array of directory -entries. The end of each per-block array is signified by reaching the -end of the block; the last entry in the block has a record length that -takes it all the way to the end of the block. The end of the entire -directory is of course signified by reaching the end of the file. Unused -directory entries are signified by inode =3D 0. By default the filesystem -uses ``struct ext4_dir_entry_2`` for directory entries unless the -=E2=80=9Cfiletype=E2=80=9D feature flag is not set, in which case it uses -``struct ext4_dir_entry``. - -The original directory entry format is ``struct ext4_dir_entry``, which -is at most 263 bytes long, though on disk you'll need to reference -``dirent.rec_len`` to know for sure. - -.. list-table:: - :widths: 8 8 24 40 - :header-rows: 1 - - * - Offset - - Size - - Name - - Description - * - 0x0 - - __le32 - - inode - - Number of the inode that this directory entry points to. - * - 0x4 - - __le16 - - rec_len - - Length of this directory entry. Must be a multiple of 4. - * - 0x6 - - __le16 - - name_len - - Length of the file name. - * - 0x8 - - char - - name[EXT4_NAME_LEN] - - File name. - -Since file names cannot be longer than 255 bytes, the new directory -entry format shortens the name_len field and uses the space for a file -type flag, probably to avoid having to load every inode during directory -tree traversal. This format is ``ext4_dir_entry_2``, which is at most -263 bytes long, though on disk you'll need to reference -``dirent.rec_len`` to know for sure. - -.. list-table:: - :widths: 8 8 24 40 - :header-rows: 1 - - * - Offset - - Size - - Name - - Description - * - 0x0 - - __le32 - - inode - - Number of the inode that this directory entry points to. - * - 0x4 - - __le16 - - rec_len - - Length of this directory entry. - * - 0x6 - - __u8 - - name_len - - Length of the file name. - * - 0x7 - - __u8 - - file_type - - File type code, see ftype_ table below. - * - 0x8 - - char - - name[EXT4_NAME_LEN] - - File name. - -.. _ftype: - -The directory file type is one of the following values: - -.. list-table:: - :widths: 16 64 - :header-rows: 1 - - * - Value - - Description - * - 0x0 - - Unknown. - * - 0x1 - - Regular file. - * - 0x2 - - Directory. - * - 0x3 - - Character device file. - * - 0x4 - - Block device file. - * - 0x5 - - FIFO. - * - 0x6 - - Socket. - * - 0x7 - - Symbolic link. - -To support directories that are both encrypted and casefolded directories,= we -must also include hash information in the directory entry. We append -``ext4_extended_dir_entry_2`` to ``ext4_dir_entry_2`` except for the entri= es -for dot and dotdot, which are kept the same. The structure follows immedia= tely -after ``name`` and is included in the size listed by ``rec_len`` If a dire= ctory -entry uses this extension, it may be up to 271 bytes. - -.. list-table:: - :widths: 8 8 24 40 - :header-rows: 1 - - * - Offset - - Size - - Name - - Description - * - 0x0 - - __le32 - - hash - - The hash of the directory name - * - 0x4 - - __le32 - - minor_hash - - The minor hash of the directory name - - -In order to add checksums to these classic directory blocks, a phony -``struct ext4_dir_entry`` is placed at the end of each leaf block to -hold the checksum. The directory entry is 12 bytes long. The inode -number and name_len fields are set to zero to fool old software into -ignoring an apparently empty directory entry, and the checksum is stored -in the place where the name normally goes. The structure is -``struct ext4_dir_entry_tail``: - -.. list-table:: - :widths: 8 8 24 40 - :header-rows: 1 - - * - Offset - - Size - - Name - - Description - * - 0x0 - - __le32 - - det_reserved_zero1 - - Inode number, which must be zero. - * - 0x4 - - __le16 - - det_rec_len - - Length of this directory entry, which must be 12. - * - 0x6 - - __u8 - - det_reserved_zero2 - - Length of the file name, which must be zero. - * - 0x7 - - __u8 - - det_reserved_ft - - File type, which must be 0xDE. - * - 0x8 - - __le32 - - det_checksum - - Directory leaf block checksum. - -The leaf directory block checksum is calculated against the FS UUID, the -directory's inode number, the directory's inode generation number, and -the entire directory entry block up to (but not including) the fake -directory entry. - -Hash Tree Directories -~~~~~~~~~~~~~~~~~~~~~ - -A linear array of directory entries isn't great for performance, so a -new feature was added to ext3 to provide a faster (but peculiar) -balanced tree keyed off a hash of the directory entry name. If the -EXT4_INDEX_FL (0x1000) flag is set in the inode, this directory uses a -hashed btree (htree) to organize and find directory entries. For -backwards read-only compatibility with ext2, this tree is actually -hidden inside the directory file, masquerading as =E2=80=9Cempty=E2=80=9D = directory data -blocks! It was stated previously that the end of the linear directory -entry table was signified with an entry pointing to inode 0; this is -(ab)used to fool the old linear-scan algorithm into thinking that the -rest of the directory block is empty so that it moves on. - -The root of the tree always lives in the first data block of the -directory. By ext2 custom, the '.' and '..' entries must appear at the -beginning of this first block, so they are put here as two -``struct ext4_dir_entry_2`` s and not stored in the tree. The rest of -the root node contains metadata about the tree and finally a hash->block -map to find nodes that are lower in the htree. If -``dx_root.info.indirect_levels`` is non-zero then the htree has two -levels; the data block pointed to by the root node's map is an interior -node, which is indexed by a minor hash. Interior nodes in this tree -contains a zeroed out ``struct ext4_dir_entry_2`` followed by a -minor_hash->block map to find leafe nodes. Leaf nodes contain a linear -array of all ``struct ext4_dir_entry_2``; all of these entries -(presumably) hash to the same value. If there is an overflow, the -entries simply overflow into the next leaf node, and the -least-significant bit of the hash (in the interior node map) that gets -us to this next leaf node is set. - -To traverse the directory as a htree, the code calculates the hash of -the desired file name and uses it to find the corresponding block -number. If the tree is flat, the block is a linear array of directory -entries that can be searched; otherwise, the minor hash of the file name -is computed and used against this second block to find the corresponding -third block number. That third block number will be a linear array of -directory entries. - -To traverse the directory as a linear array (such as the old code does), -the code simply reads every data block in the directory. The blocks used -for the htree will appear to have no entries (aside from '.' and '..') -and so only the leaf nodes will appear to have any interesting content. - -The root of the htree is in ``struct dx_root``, which is the full length -of a data block: - -.. list-table:: - :widths: 8 8 24 40 - :header-rows: 1 - - * - Offset - - Type - - Name - - Description - * - 0x0 - - __le32 - - dot.inode - - inode number of this directory. - * - 0x4 - - __le16 - - dot.rec_len - - Length of this record, 12. - * - 0x6 - - u8 - - dot.name_len - - Length of the name, 1. - * - 0x7 - - u8 - - dot.file_type - - File type of this entry, 0x2 (directory) (if the feature flag is se= t). - * - 0x8 - - char - - dot.name[4] - - =E2=80=9C.\0\0\0=E2=80=9D - * - 0xC - - __le32 - - dotdot.inode - - inode number of parent directory. - * - 0x10 - - __le16 - - dotdot.rec_len - - block_size - 12. The record length is long enough to cover all htree - data. - * - 0x12 - - u8 - - dotdot.name_len - - Length of the name, 2. - * - 0x13 - - u8 - - dotdot.file_type - - File type of this entry, 0x2 (directory) (if the feature flag is se= t). - * - 0x14 - - char - - dotdot_name[4] - - =E2=80=9C..\0\0=E2=80=9D - * - 0x18 - - __le32 - - struct dx_root_info.reserved_zero - - Zero. - * - 0x1C - - u8 - - struct dx_root_info.hash_version - - Hash type, see dirhash_ table below. - * - 0x1D - - u8 - - struct dx_root_info.info_length - - Length of the tree information, 0x8. - * - 0x1E - - u8 - - struct dx_root_info.indirect_levels - - Depth of the htree. Cannot be larger than 3 if the INCOMPAT_LARGEDIR - feature is set; cannot be larger than 2 otherwise. - * - 0x1F - - u8 - - struct dx_root_info.unused_flags - - - * - 0x20 - - __le16 - - limit - - Maximum number of dx_entries that can follow this header, plus 1 for - the header itself. - * - 0x22 - - __le16 - - count - - Actual number of dx_entries that follow this header, plus 1 for the - header itself. - * - 0x24 - - __le32 - - block - - The block number (within the directory file) that goes with hash=3D= 0. - * - 0x28 - - struct dx_entry - - entries[0] - - As many 8-byte ``struct dx_entry`` as fits in the rest of the data = block. - -.. _dirhash: - -The directory hash is one of the following values: - -.. list-table:: - :widths: 16 64 - :header-rows: 1 - - * - Value - - Description - * - 0x0 - - Legacy. - * - 0x1 - - Half MD4. - * - 0x2 - - Tea. - * - 0x3 - - Legacy, unsigned. - * - 0x4 - - Half MD4, unsigned. - * - 0x5 - - Tea, unsigned. - * - 0x6 - - Siphash. - -Interior nodes of an htree are recorded as ``struct dx_node``, which is -also the full length of a data block: - -.. list-table:: - :widths: 8 8 24 40 - :header-rows: 1 - - * - Offset - - Type - - Name - - Description - * - 0x0 - - __le32 - - fake.inode - - Zero, to make it look like this entry is not in use. - * - 0x4 - - __le16 - - fake.rec_len - - The size of the block, in order to hide all of the dx_node data. - * - 0x6 - - u8 - - name_len - - Zero. There is no name for this =E2=80=9Cunused=E2=80=9D directory = entry. - * - 0x7 - - u8 - - file_type - - Zero. There is no file type for this =E2=80=9Cunused=E2=80=9D direc= tory entry. - * - 0x8 - - __le16 - - limit - - Maximum number of dx_entries that can follow this header, plus 1 for - the header itself. - * - 0xA - - __le16 - - count - - Actual number of dx_entries that follow this header, plus 1 for the - header itself. - * - 0xE - - __le32 - - block - - The block number (within the directory file) that goes with the low= est - hash value of this block. This value is stored in the parent block. - * - 0x12 - - struct dx_entry - - entries[0] - - As many 8-byte ``struct dx_entry`` as fits in the rest of the data = block. - -The hash maps that exist in both ``struct dx_root`` and -``struct dx_node`` are recorded as ``struct dx_entry``, which is 8 bytes -long: - -.. list-table:: - :widths: 8 8 24 40 - :header-rows: 1 - - * - Offset - - Type - - Name - - Description - * - 0x0 - - __le32 - - hash - - Hash code. - * - 0x4 - - __le32 - - block - - Block number (within the directory file, not filesystem blocks) of = the - next node in the htree. - -(If you think this is all quite clever and peculiar, so does the -author.) - -If metadata checksums are enabled, the last 8 bytes of the directory -block (precisely the length of one dx_entry) are used to store a -``struct dx_tail``, which contains the checksum. The ``limit`` and -``count`` entries in the dx_root/dx_node structures are adjusted as -necessary to fit the dx_tail into the block. If there is no space for -the dx_tail, the user is notified to run e2fsck -D to rebuild the -directory index (which will ensure that there's space for the checksum. -The dx_tail structure is 8 bytes long and looks like this: - -.. list-table:: - :widths: 8 8 24 40 - :header-rows: 1 - - * - Offset - - Type - - Name - - Description - * - 0x0 - - u32 - - dt_reserved - - Zero. - * - 0x4 - - __le32 - - dt_checksum - - Checksum of the htree directory block. - -The checksum is calculated against the FS UUID, the htree index header -(dx_root or dx_node), all of the htree indices (dx_entry) that are in -use, and the tail block (dx_tail). diff --git a/Documentation/filesystems/ext4/dynamic.rst b/Documentation/fil= esystems/ext4/dynamic.rst index bb0c84333341a5..225324e59fe57c 100644 --- a/Documentation/filesystems/ext4/dynamic.rst +++ b/Documentation/filesystems/ext4/dynamic.rst @@ -6,7 +6,1414 @@ Dynamic Structures Dynamic metadata are created on the fly when files and blocks are allocated to files. =20 -.. include:: inodes.rst -.. include:: ifork.rst -.. include:: directory.rst -.. include:: attributes.rst +Index Nodes +----------- + +In a regular UNIX filesystem, the inode stores all the metadata +pertaining to the file (time stamps, block maps, extended attributes, +etc), not the directory entry. To find the information associated with a +file, one must traverse the directory files to find the directory entry +associated with a file, then load the inode to find the metadata for +that file. ext4 appears to cheat (for performance reasons) a little bit +by storing a copy of the file type (normally stored in the inode) in the +directory entry. (Compare all this to FAT, which stores all the file +information directly in the directory entry, but does not support hard +links and is in general more seek-happy than ext4 due to its simpler +block allocator and extensive use of linked lists.) + +The inode table is a linear array of ``struct ext4_inode``. The table is +sized to have enough blocks to store at least +``sb.s_inode_size * sb.s_inodes_per_group`` bytes. The number of the +block group containing an inode can be calculated as +``(inode_number - 1) / sb.s_inodes_per_group``, and the offset into the +group's table is ``(inode_number - 1) % sb.s_inodes_per_group``. There +is no inode 0. + +The inode checksum is calculated against the FS UUID, the inode number, +and the inode structure itself. + +The inode table entry is laid out in ``struct ext4_inode``. + +.. list-table:: + :widths: 8 8 24 40 + :header-rows: 1 + :class: longtable + + * - Offset + - Size + - Name + - Description + * - 0x0 + - __le16 + - i_mode + - File mode. See the table i_mode_ below. + * - 0x2 + - __le16 + - i_uid + - Lower 16-bits of Owner UID. + * - 0x4 + - __le32 + - i_size_lo + - Lower 32-bits of size in bytes. + * - 0x8 + - __le32 + - i_atime + - Last access time, in seconds since the epoch. However, if the EA_IN= ODE + inode flag is set, this inode stores an extended attribute value and + this field contains the checksum of the value. + * - 0xC + - __le32 + - i_ctime + - Last inode change time, in seconds since the epoch. However, if the + EA_INODE inode flag is set, this inode stores an extended attribute + value and this field contains the lower 32 bits of the attribute va= lue's + reference count. + * - 0x10 + - __le32 + - i_mtime + - Last data modification time, in seconds since the epoch. However, i= f the + EA_INODE inode flag is set, this inode stores an extended attribute + value and this field contains the number of the inode that owns the + extended attribute. + * - 0x14 + - __le32 + - i_dtime + - Deletion Time, in seconds since the epoch. + * - 0x18 + - __le16 + - i_gid + - Lower 16-bits of GID. + * - 0x1A + - __le16 + - i_links_count + - Hard link count. Normally, ext4 does not permit an inode to have mo= re + than 65,000 hard links. This applies to files as well as directorie= s, + which means that there cannot be more than 64,998 subdirectories in= a + directory (each subdirectory's '..' entry counts as a hard link, as= does + the '.' entry in the directory itself). With the DIR_NLINK feature + enabled, ext4 supports more than 64,998 subdirectories by setting t= his + field to 1 to indicate that the number of hard links is not known. + * - 0x1C + - __le32 + - i_blocks_lo + - Lower 32-bits of =E2=80=9Cblock=E2=80=9D count. If the huge_file fe= ature flag is not + set on the filesystem, the file consumes ``i_blocks_lo`` 512-byte b= locks + on disk. If huge_file is set and EXT4_HUGE_FILE_FL is NOT set in + ``inode.i_flags``, then the file consumes ``i_blocks_lo + (i_blocks= _hi + << 32)`` 512-byte blocks on disk. If huge_file is set and + EXT4_HUGE_FILE_FL IS set in ``inode.i_flags``, then this file + consumes (``i_blocks_lo + i_blocks_hi`` << 32) filesystem blocks on + disk. + * - 0x20 + - __le32 + - i_flags + - Inode flags. See the table i_flags_ below. + * - 0x24 + - 4 bytes + - i_osd1 + - See the table i_osd1_ for more details. + * - 0x28 + - 60 bytes + - i_block[EXT4_N_BLOCKS=3D15] + - Block map or extent tree. See the section =E2=80=9CThe Contents of = inode.i_block=E2=80=9D. + * - 0x64 + - __le32 + - i_generation + - File version (for NFS). + * - 0x68 + - __le32 + - i_file_acl_lo + - Lower 32-bits of extended attribute block. ACLs are of course one of + many possible extended attributes; I think the name of this field i= s a + result of the first use of extended attributes being for ACLs. + * - 0x6C + - __le32 + - i_size_high / i_dir_acl + - Upper 32-bits of file/directory size. In ext2/3 this field was named + i_dir_acl, though it was usually set to zero and never used. + * - 0x70 + - __le32 + - i_obso_faddr + - (Obsolete) fragment address. + * - 0x74 + - 12 bytes + - i_osd2 + - See the table i_osd2_ for more details. + * - 0x80 + - __le16 + - i_extra_isize + - Size of this inode - 128. Alternately, the size of the extended ino= de + fields beyond the original ext2 inode, including this field. + * - 0x82 + - __le16 + - i_checksum_hi + - Upper 16-bits of the inode checksum. + * - 0x84 + - __le32 + - i_ctime_extra + - Extra change time bits. This provides sub-second precision. See Ino= de + Timestamps section. + * - 0x88 + - __le32 + - i_mtime_extra + - Extra modification time bits. This provides sub-second precision. + * - 0x8C + - __le32 + - i_atime_extra + - Extra access time bits. This provides sub-second precision. + * - 0x90 + - __le32 + - i_crtime + - File creation time, in seconds since the epoch. + * - 0x94 + - __le32 + - i_crtime_extra + - Extra file creation time bits. This provides sub-second precision. + * - 0x98 + - __le32 + - i_version_hi + - Upper 32-bits for version number. + * - 0x9C + - __le32 + - i_projid + - Project ID. + +.. _i_mode: + +The ``i_mode`` value is a combination of the following flags: + +.. list-table:: + :widths: 16 64 + :header-rows: 1 + + * - Value + - Description + * - 0x1 + - S_IXOTH (Others may execute) + * - 0x2 + - S_IWOTH (Others may write) + * - 0x4 + - S_IROTH (Others may read) + * - 0x8 + - S_IXGRP (Group members may execute) + * - 0x10 + - S_IWGRP (Group members may write) + * - 0x20 + - S_IRGRP (Group members may read) + * - 0x40 + - S_IXUSR (Owner may execute) + * - 0x80 + - S_IWUSR (Owner may write) + * - 0x100 + - S_IRUSR (Owner may read) + * - 0x200 + - S_ISVTX (Sticky bit) + * - 0x400 + - S_ISGID (Set GID) + * - 0x800 + - S_ISUID (Set UID) + * - + - These are mutually-exclusive file types: + * - 0x1000 + - S_IFIFO (FIFO) + * - 0x2000 + - S_IFCHR (Character device) + * - 0x4000 + - S_IFDIR (Directory) + * - 0x6000 + - S_IFBLK (Block device) + * - 0x8000 + - S_IFREG (Regular file) + * - 0xA000 + - S_IFLNK (Symbolic link) + * - 0xC000 + - S_IFSOCK (Socket) + +.. _i_flags: + +The ``i_flags`` field is a combination of these values: + +.. list-table:: + :widths: 16 64 + :header-rows: 1 + + * - Value + - Description + * - 0x1 + - This file requires secure deletion (EXT4_SECRM_FL). (not implemente= d) + * - 0x2 + - This file should be preserved, should undeletion be desired + (EXT4_UNRM_FL). (not implemented) + * - 0x4 + - File is compressed (EXT4_COMPR_FL). (not really implemented) + * - 0x8 + - All writes to the file must be synchronous (EXT4_SYNC_FL). + * - 0x10 + - File is immutable (EXT4_IMMUTABLE_FL). + * - 0x20 + - File can only be appended (EXT4_APPEND_FL). + * - 0x40 + - The dump(1) utility should not dump this file (EXT4_NODUMP_FL). + * - 0x80 + - Do not update access time (EXT4_NOATIME_FL). + * - 0x100 + - Dirty compressed file (EXT4_DIRTY_FL). (not used) + * - 0x200 + - File has one or more compressed clusters (EXT4_COMPRBLK_FL). (not u= sed) + * - 0x400 + - Do not compress file (EXT4_NOCOMPR_FL). (not used) + * - 0x800 + - Encrypted inode (EXT4_ENCRYPT_FL). This bit value previously was + EXT4_ECOMPR_FL (compression error), which was never used. + * - 0x1000 + - Directory has hashed indexes (EXT4_INDEX_FL). + * - 0x2000 + - AFS magic directory (EXT4_IMAGIC_FL). + * - 0x4000 + - File data must always be written through the journal + (EXT4_JOURNAL_DATA_FL). + * - 0x8000 + - File tail should not be merged (EXT4_NOTAIL_FL). (not used by ext4) + * - 0x10000 + - All directory entry data should be written synchronously (see + ``dirsync``) (EXT4_DIRSYNC_FL). + * - 0x20000 + - Top of directory hierarchy (EXT4_TOPDIR_FL). + * - 0x40000 + - This is a huge file (EXT4_HUGE_FILE_FL). + * - 0x80000 + - Inode uses extents (EXT4_EXTENTS_FL). + * - 0x100000 + - Verity protected file (EXT4_VERITY_FL). + * - 0x200000 + - Inode stores a large extended attribute value in its data blocks + (EXT4_EA_INODE_FL). + * - 0x400000 + - This file has blocks allocated past EOF (EXT4_EOFBLOCKS_FL). + (deprecated) + * - 0x01000000 + - Inode is a snapshot (``EXT4_SNAPFILE_FL``). (not in mainline) + * - 0x04000000 + - Snapshot is being deleted (``EXT4_SNAPFILE_DELETED_FL``). (not in + mainline) + * - 0x08000000 + - Snapshot shrink has completed (``EXT4_SNAPFILE_SHRUNK_FL``). (not in + mainline) + * - 0x10000000 + - Inode has inline data (EXT4_INLINE_DATA_FL). + * - 0x20000000 + - Create children with the same project ID (EXT4_PROJINHERIT_FL). + * - 0x80000000 + - Reserved for ext4 library (EXT4_RESERVED_FL). + * - + - Aggregate flags: + * - 0x705BDFFF + - User-visible flags. + * - 0x604BC0FF + - User-modifiable flags. Note that while EXT4_JOURNAL_DATA_FL and + EXT4_EXTENTS_FL can be set with setattr, they are not in the kernel= 's + EXT4_FL_USER_MODIFIABLE mask, since it needs to handle the setting = of + these flags in a special manner and they are masked out of the set = of + flags that are saved directly to i_flags. + +.. _i_osd1: + +The ``osd1`` field has multiple meanings depending on the creator: + +Linux: + +.. list-table:: + :widths: 8 8 24 40 + :header-rows: 1 + + * - Offset + - Size + - Name + - Description + * - 0x0 + - __le32 + - l_i_version + - Inode version. However, if the EA_INODE inode flag is set, this ino= de + stores an extended attribute value and this field contains the uppe= r 32 + bits of the attribute value's reference count. + +Hurd: + +.. list-table:: + :widths: 8 8 24 40 + :header-rows: 1 + + * - Offset + - Size + - Name + - Description + * - 0x0 + - __le32 + - h_i_translator + - ?? + +Masix: + +.. list-table:: + :widths: 8 8 24 40 + :header-rows: 1 + + * - Offset + - Size + - Name + - Description + * - 0x0 + - __le32 + - m_i_reserved + - ?? + +.. _i_osd2: + +The ``osd2`` field has multiple meanings depending on the filesystem creat= or: + +Linux: + +.. list-table:: + :widths: 8 8 24 40 + :header-rows: 1 + + * - Offset + - Size + - Name + - Description + * - 0x0 + - __le16 + - l_i_blocks_high + - Upper 16-bits of the block count. Please see the note attached to + i_blocks_lo. + * - 0x2 + - __le16 + - l_i_file_acl_high + - Upper 16-bits of the extended attribute block (historically, the fi= le + ACL location). See the Extended Attributes section below. + * - 0x4 + - __le16 + - l_i_uid_high + - Upper 16-bits of the Owner UID. + * - 0x6 + - __le16 + - l_i_gid_high + - Upper 16-bits of the GID. + * - 0x8 + - __le16 + - l_i_checksum_lo + - Lower 16-bits of the inode checksum. + * - 0xA + - __le16 + - l_i_reserved + - Unused. + +Hurd: + +.. list-table:: + :widths: 8 8 24 40 + :header-rows: 1 + + * - Offset + - Size + - Name + - Description + * - 0x0 + - __le16 + - h_i_reserved1 + - ?? + * - 0x2 + - __u16 + - h_i_mode_high + - Upper 16-bits of the file mode. + * - 0x4 + - __le16 + - h_i_uid_high + - Upper 16-bits of the Owner UID. + * - 0x6 + - __le16 + - h_i_gid_high + - Upper 16-bits of the GID. + * - 0x8 + - __u32 + - h_i_author + - Author code? + +Masix: + +.. list-table:: + :widths: 8 8 24 40 + :header-rows: 1 + + * - Offset + - Size + - Name + - Description + * - 0x0 + - __le16 + - h_i_reserved1 + - ?? + * - 0x2 + - __u16 + - m_i_file_acl_high + - Upper 16-bits of the extended attribute block (historically, the fi= le + ACL location). + * - 0x4 + - __u32 + - m_i_reserved2[2] + - ?? + +Inode Size +~~~~~~~~~~ + +In ext2 and ext3, the inode structure size was fixed at 128 bytes +(``EXT2_GOOD_OLD_INODE_SIZE``) and each inode had a disk record size of +128 bytes. Starting with ext4, it is possible to allocate a larger +on-disk inode at format time for all inodes in the filesystem to provide +space beyond the end of the original ext2 inode. The on-disk inode +record size is recorded in the superblock as ``s_inode_size``. The +number of bytes actually used by struct ext4_inode beyond the original +128-byte ext2 inode is recorded in the ``i_extra_isize`` field for each +inode, which allows struct ext4_inode to grow for a new kernel without +having to upgrade all of the on-disk inodes. Access to fields beyond +EXT2_GOOD_OLD_INODE_SIZE should be verified to be within +``i_extra_isize``. By default, ext4 inode records are 256 bytes, and (as +of August 2019) the inode structure is 160 bytes +(``i_extra_isize =3D 32``). The extra space between the end of the inode +structure and the end of the inode record can be used to store extended +attributes. Each inode record can be as large as the filesystem block +size, though this is not terribly efficient. + +Finding an Inode +~~~~~~~~~~~~~~~~ + +Each block group contains ``sb->s_inodes_per_group`` inodes. Because +inode 0 is defined not to exist, this formula can be used to find the +block group that an inode lives in: +``bg =3D (inode_num - 1) / sb->s_inodes_per_group``. The particular inode +can be found within the block group's inode table at +``index =3D (inode_num - 1) % sb->s_inodes_per_group``. To get the byte +address within the inode table, use +``offset =3D index * sb->s_inode_size``. + +Inode Timestamps +~~~~~~~~~~~~~~~~ + +Four timestamps are recorded in the lower 128 bytes of the inode +structure -- inode change time (ctime), access time (atime), data +modification time (mtime), and deletion time (dtime). The four fields +are 32-bit signed integers that represent seconds since the Unix epoch +(1970-01-01 00:00:00 GMT), which means that the fields will overflow in +January 2038. If the filesystem does not have orphan_file feature, inodes +that are not linked from any directory but are still open (orphan inodes) = have +the dtime field overloaded for use with the orphan list. The superblock fi= eld +``s_last_orphan`` points to the first inode in the orphan list; dtime is t= hen +the number of the next orphaned inode, or zero if there are no more orphan= s. + +If the inode structure size ``sb->s_inode_size`` is larger than 128 +bytes and the ``i_inode_extra`` field is large enough to encompass the +respective ``i_[cma]time_extra`` field, the ctime, atime, and mtime +inode fields are widened to 64 bits. Within this =E2=80=9Cextra=E2=80=9D 3= 2-bit field, +the lower two bits are used to extend the 32-bit seconds field to be 34 +bit wide; the upper 30 bits are used to provide nanosecond timestamp +accuracy. Therefore, timestamps should not overflow until May 2446. +dtime was not widened. There is also a fifth timestamp to record inode +creation time (crtime); this field is 64-bits wide and decoded in the +same manner as 64-bit [cma]time. Neither crtime nor dtime are accessible +through the regular stat() interface, though debugfs will report them. + +We use the 32-bit signed time value plus (2^32 * (extra epoch bits)). +In other words: + +.. list-table:: + :widths: 20 20 20 20 20 + :header-rows: 1 + + * - Extra epoch bits + - MSB of 32-bit time + - Adjustment for signed 32-bit to 64-bit tv_sec + - Decoded 64-bit tv_sec + - valid time range + * - 0 0 + - 1 + - 0 + - ``-0x80000000 - -0x00000001`` + - 1901-12-13 to 1969-12-31 + * - 0 0 + - 0 + - 0 + - ``0x000000000 - 0x07fffffff`` + - 1970-01-01 to 2038-01-19 + * - 0 1 + - 1 + - 0x100000000 + - ``0x080000000 - 0x0ffffffff`` + - 2038-01-19 to 2106-02-07 + * - 0 1 + - 0 + - 0x100000000 + - ``0x100000000 - 0x17fffffff`` + - 2106-02-07 to 2174-02-25 + * - 1 0 + - 1 + - 0x200000000 + - ``0x180000000 - 0x1ffffffff`` + - 2174-02-25 to 2242-03-16 + * - 1 0 + - 0 + - 0x200000000 + - ``0x200000000 - 0x27fffffff`` + - 2242-03-16 to 2310-04-04 + * - 1 1 + - 1 + - 0x300000000 + - ``0x280000000 - 0x2ffffffff`` + - 2310-04-04 to 2378-04-22 + * - 1 1 + - 0 + - 0x300000000 + - ``0x300000000 - 0x37fffffff`` + - 2378-04-22 to 2446-05-10 + +This is a somewhat odd encoding since there are effectively seven times +as many positive values as negative values. There have also been +long-standing bugs decoding and encoding dates beyond 2038, which don't +seem to be fixed as of kernel 3.12 and e2fsprogs 1.42.8. 64-bit kernels +incorrectly use the extra epoch bits 1,1 for dates between 1901 and +1970. At some point the kernel will be fixed and e2fsck will fix this +situation, assuming that it is run before 2310. + +The Contents of inode.i_block +------------------------------ + +Depending on the type of file an inode describes, the 60 bytes of +storage in ``inode.i_block`` can be used in different ways. In general, +regular files and directories will use it for file block indexing +information, and special files will use it for special purposes. + +Symbolic Links +~~~~~~~~~~~~~~ + +The target of a symbolic link will be stored in this field if the target +string is less than 60 bytes long. Otherwise, either extents or block +maps will be used to allocate data blocks to store the link target. + +Direct/Indirect Block Addressing +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +In ext2/3, file block numbers were mapped to logical block numbers by +means of an (up to) three level 1-1 block map. To find the logical block +that stores a particular file block, the code would navigate through +this increasingly complicated structure. Notice that there is neither a +magic number nor a checksum to provide any level of confidence that the +block isn't full of garbage. + +.. ifconfig:: builder !=3D 'latex' + + .. include:: blockmap.rst + +.. ifconfig:: builder =3D=3D 'latex' + + [Table omitted because LaTeX doesn't support nested tables.] + +Note that with this block mapping scheme, it is necessary to fill out a +lot of mapping data even for a large contiguous file! This inefficiency +led to the creation of the extent mapping scheme, discussed below. + +Notice also that a file using this mapping scheme cannot be placed +higher than 2^32 blocks. + +Extent Tree +~~~~~~~~~~~ + +In ext4, the file to logical block map has been replaced with an extent +tree. Under the old scheme, allocating a contiguous run of 1,000 blocks +requires an indirect block to map all 1,000 entries; with extents, the +mapping is reduced to a single ``struct ext4_extent`` with +``ee_len =3D 1000``. If flex_bg is enabled, it is possible to allocate +very large files with a single extent, at a considerable reduction in +metadata block use, and some improvement in disk efficiency. The inode +must have the extents flag (0x80000) flag set for this feature to be in +use. + +Extents are arranged as a tree. Each node of the tree begins with a +``struct ext4_extent_header``. If the node is an interior node +(``eh.eh_depth`` > 0), the header is followed by ``eh.eh_entries`` +instances of ``struct ext4_extent_idx``; each of these index entries +points to a block containing more nodes in the extent tree. If the node +is a leaf node (``eh.eh_depth =3D=3D 0``), then the header is followed by +``eh.eh_entries`` instances of ``struct ext4_extent``; these instances +point to the file's data blocks. The root node of the extent tree is +stored in ``inode.i_block``, which allows for the first four extents to +be recorded without the use of extra metadata blocks. + +The extent tree header is recorded in ``struct ext4_extent_header``, +which is 12 bytes long: + +.. list-table:: + :widths: 8 8 24 40 + :header-rows: 1 + + * - Offset + - Size + - Name + - Description + * - 0x0 + - __le16 + - eh_magic + - Magic number, 0xF30A. + * - 0x2 + - __le16 + - eh_entries + - Number of valid entries following the header. + * - 0x4 + - __le16 + - eh_max + - Maximum number of entries that could follow the header. + * - 0x6 + - __le16 + - eh_depth + - Depth of this extent node in the extent tree. 0 =3D this extent node + points to data blocks; otherwise, this extent node points to other + extent nodes. The extent tree can be at most 5 levels deep: a logic= al + block number can be at most ``2^32``, and the smallest ``n`` that + satisfies ``4*(((blocksize - 12)/12)^n) >=3D 2^32`` is 5. + * - 0x8 + - __le32 + - eh_generation + - Generation of the tree. (Used by Lustre, but not standard ext4). + +Internal nodes of the extent tree, also known as index nodes, are +recorded as ``struct ext4_extent_idx``, and are 12 bytes long: + +.. list-table:: + :widths: 8 8 24 40 + :header-rows: 1 + + * - Offset + - Size + - Name + - Description + * - 0x0 + - __le32 + - ei_block + - This index node covers file blocks from 'block' onward. + * - 0x4 + - __le32 + - ei_leaf_lo + - Lower 32-bits of the block number of the extent node that is the ne= xt + level lower in the tree. The tree node pointed to can be either ano= ther + internal node or a leaf node, described below. + * - 0x8 + - __le16 + - ei_leaf_hi + - Upper 16-bits of the previous field. + * - 0xA + - __u16 + - ei_unused + - + +Leaf nodes of the extent tree are recorded as ``struct ext4_extent``, +and are also 12 bytes long: + +.. list-table:: + :widths: 8 8 24 40 + :header-rows: 1 + + * - Offset + - Size + - Name + - Description + * - 0x0 + - __le32 + - ee_block + - First file block number that this extent covers. + * - 0x4 + - __le16 + - ee_len + - Number of blocks covered by extent. If the value of this field is <= =3D + 32768, the extent is initialized. If the value of the field is > 32= 768, + the extent is uninitialized and the actual extent length is ``ee_le= n`` - + 32768. Therefore, the maximum length of a initialized extent is 327= 68 + blocks, and the maximum length of an uninitialized extent is 32767. + * - 0x6 + - __le16 + - ee_start_hi + - Upper 16-bits of the block number to which this extent points. + * - 0x8 + - __le32 + - ee_start_lo + - Lower 32-bits of the block number to which this extent points. + +Prior to the introduction of metadata checksums, the extent header + +extent entries always left at least 4 bytes of unallocated space at the +end of each extent tree data block (because (2^x % 12) >=3D 4). Therefore, +the 32-bit checksum is inserted into this space. The 4 extents in the +inode do not need checksumming, since the inode is already checksummed. +The checksum is calculated against the FS UUID, the inode number, the +inode generation, and the entire extent block leading up to (but not +including) the checksum itself. + +``struct ext4_extent_tail`` is 4 bytes long: + +.. list-table:: + :widths: 8 8 24 40 + :header-rows: 1 + + * - Offset + - Size + - Name + - Description + * - 0x0 + - __le32 + - eb_checksum + - Checksum of the extent block, crc32c(uuid+inum+igeneration+extentbl= ock) + +Inline Data +~~~~~~~~~~~ + +If the inline data feature is enabled for the filesystem and the flag is +set for the inode, it is possible that the first 60 bytes of the file +data are stored here. + +Directory Entries +----------------- + +In an ext4 filesystem, a directory is more or less a flat file that maps +an arbitrary byte string (usually ASCII) to an inode number on the +filesystem. There can be many directory entries across the filesystem +that reference the same inode number--these are known as hard links, and +that is why hard links cannot reference files on other filesystems. As +such, directory entries are found by reading the data block(s) +associated with a directory file for the particular directory entry that +is desired. + +Linear (Classic) Directories +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +By default, each directory lists its entries in an =E2=80=9Calmost-linear= =E2=80=9D +array. I write =E2=80=9Calmost=E2=80=9D because it's not a linear array in= the memory +sense because directory entries are not split across filesystem blocks. +Therefore, it is more accurate to say that a directory is a series of +data blocks and that each block contains a linear array of directory +entries. The end of each per-block array is signified by reaching the +end of the block; the last entry in the block has a record length that +takes it all the way to the end of the block. The end of the entire +directory is of course signified by reaching the end of the file. Unused +directory entries are signified by inode =3D 0. By default the filesystem +uses ``struct ext4_dir_entry_2`` for directory entries unless the +=E2=80=9Cfiletype=E2=80=9D feature flag is not set, in which case it uses +``struct ext4_dir_entry``. + +The original directory entry format is ``struct ext4_dir_entry``, which +is at most 263 bytes long, though on disk you'll need to reference +``dirent.rec_len`` to know for sure. + +.. list-table:: + :widths: 8 8 24 40 + :header-rows: 1 + + * - Offset + - Size + - Name + - Description + * - 0x0 + - __le32 + - inode + - Number of the inode that this directory entry points to. + * - 0x4 + - __le16 + - rec_len + - Length of this directory entry. Must be a multiple of 4. + * - 0x6 + - __le16 + - name_len + - Length of the file name. + * - 0x8 + - char + - name[EXT4_NAME_LEN] + - File name. + +Since file names cannot be longer than 255 bytes, the new directory +entry format shortens the name_len field and uses the space for a file +type flag, probably to avoid having to load every inode during directory +tree traversal. This format is ``ext4_dir_entry_2``, which is at most +263 bytes long, though on disk you'll need to reference +``dirent.rec_len`` to know for sure. + +.. list-table:: + :widths: 8 8 24 40 + :header-rows: 1 + + * - Offset + - Size + - Name + - Description + * - 0x0 + - __le32 + - inode + - Number of the inode that this directory entry points to. + * - 0x4 + - __le16 + - rec_len + - Length of this directory entry. + * - 0x6 + - __u8 + - name_len + - Length of the file name. + * - 0x7 + - __u8 + - file_type + - File type code, see ftype_ table below. + * - 0x8 + - char + - name[EXT4_NAME_LEN] + - File name. + +.. _ftype: + +The directory file type is one of the following values: + +.. list-table:: + :widths: 16 64 + :header-rows: 1 + + * - Value + - Description + * - 0x0 + - Unknown. + * - 0x1 + - Regular file. + * - 0x2 + - Directory. + * - 0x3 + - Character device file. + * - 0x4 + - Block device file. + * - 0x5 + - FIFO. + * - 0x6 + - Socket. + * - 0x7 + - Symbolic link. + +To support directories that are both encrypted and casefolded directories,= we +must also include hash information in the directory entry. We append +``ext4_extended_dir_entry_2`` to ``ext4_dir_entry_2`` except for the entri= es +for dot and dotdot, which are kept the same. The structure follows immedia= tely +after ``name`` and is included in the size listed by ``rec_len`` If a dire= ctory +entry uses this extension, it may be up to 271 bytes. + +.. list-table:: + :widths: 8 8 24 40 + :header-rows: 1 + + * - Offset + - Size + - Name + - Description + * - 0x0 + - __le32 + - hash + - The hash of the directory name + * - 0x4 + - __le32 + - minor_hash + - The minor hash of the directory name + + +In order to add checksums to these classic directory blocks, a phony +``struct ext4_dir_entry`` is placed at the end of each leaf block to +hold the checksum. The directory entry is 12 bytes long. The inode +number and name_len fields are set to zero to fool old software into +ignoring an apparently empty directory entry, and the checksum is stored +in the place where the name normally goes. The structure is +``struct ext4_dir_entry_tail``: + +.. list-table:: + :widths: 8 8 24 40 + :header-rows: 1 + + * - Offset + - Size + - Name + - Description + * - 0x0 + - __le32 + - det_reserved_zero1 + - Inode number, which must be zero. + * - 0x4 + - __le16 + - det_rec_len + - Length of this directory entry, which must be 12. + * - 0x6 + - __u8 + - det_reserved_zero2 + - Length of the file name, which must be zero. + * - 0x7 + - __u8 + - det_reserved_ft + - File type, which must be 0xDE. + * - 0x8 + - __le32 + - det_checksum + - Directory leaf block checksum. + +The leaf directory block checksum is calculated against the FS UUID, the +directory's inode number, the directory's inode generation number, and +the entire directory entry block up to (but not including) the fake +directory entry. + +Hash Tree Directories +~~~~~~~~~~~~~~~~~~~~~ + +A linear array of directory entries isn't great for performance, so a +new feature was added to ext3 to provide a faster (but peculiar) +balanced tree keyed off a hash of the directory entry name. If the +EXT4_INDEX_FL (0x1000) flag is set in the inode, this directory uses a +hashed btree (htree) to organize and find directory entries. For +backwards read-only compatibility with ext2, this tree is actually +hidden inside the directory file, masquerading as =E2=80=9Cempty=E2=80=9D = directory data +blocks! It was stated previously that the end of the linear directory +entry table was signified with an entry pointing to inode 0; this is +(ab)used to fool the old linear-scan algorithm into thinking that the +rest of the directory block is empty so that it moves on. + +The root of the tree always lives in the first data block of the +directory. By ext2 custom, the '.' and '..' entries must appear at the +beginning of this first block, so they are put here as two +``struct ext4_dir_entry_2`` s and not stored in the tree. The rest of +the root node contains metadata about the tree and finally a hash->block +map to find nodes that are lower in the htree. If +``dx_root.info.indirect_levels`` is non-zero then the htree has two +levels; the data block pointed to by the root node's map is an interior +node, which is indexed by a minor hash. Interior nodes in this tree +contains a zeroed out ``struct ext4_dir_entry_2`` followed by a +minor_hash->block map to find leafe nodes. Leaf nodes contain a linear +array of all ``struct ext4_dir_entry_2``; all of these entries +(presumably) hash to the same value. If there is an overflow, the +entries simply overflow into the next leaf node, and the +least-significant bit of the hash (in the interior node map) that gets +us to this next leaf node is set. + +To traverse the directory as a htree, the code calculates the hash of +the desired file name and uses it to find the corresponding block +number. If the tree is flat, the block is a linear array of directory +entries that can be searched; otherwise, the minor hash of the file name +is computed and used against this second block to find the corresponding +third block number. That third block number will be a linear array of +directory entries. + +To traverse the directory as a linear array (such as the old code does), +the code simply reads every data block in the directory. The blocks used +for the htree will appear to have no entries (aside from '.' and '..') +and so only the leaf nodes will appear to have any interesting content. + +The root of the htree is in ``struct dx_root``, which is the full length +of a data block: + +.. list-table:: + :widths: 8 8 24 40 + :header-rows: 1 + + * - Offset + - Type + - Name + - Description + * - 0x0 + - __le32 + - dot.inode + - inode number of this directory. + * - 0x4 + - __le16 + - dot.rec_len + - Length of this record, 12. + * - 0x6 + - u8 + - dot.name_len + - Length of the name, 1. + * - 0x7 + - u8 + - dot.file_type + - File type of this entry, 0x2 (directory) (if the feature flag is se= t). + * - 0x8 + - char + - dot.name[4] + - =E2=80=9C.\0\0\0=E2=80=9D + * - 0xC + - __le32 + - dotdot.inode + - inode number of parent directory. + * - 0x10 + - __le16 + - dotdot.rec_len + - block_size - 12. The record length is long enough to cover all htree + data. + * - 0x12 + - u8 + - dotdot.name_len + - Length of the name, 2. + * - 0x13 + - u8 + - dotdot.file_type + - File type of this entry, 0x2 (directory) (if the feature flag is se= t). + * - 0x14 + - char + - dotdot_name[4] + - =E2=80=9C..\0\0=E2=80=9D + * - 0x18 + - __le32 + - struct dx_root_info.reserved_zero + - Zero. + * - 0x1C + - u8 + - struct dx_root_info.hash_version + - Hash type, see dirhash_ table below. + * - 0x1D + - u8 + - struct dx_root_info.info_length + - Length of the tree information, 0x8. + * - 0x1E + - u8 + - struct dx_root_info.indirect_levels + - Depth of the htree. Cannot be larger than 3 if the INCOMPAT_LARGEDIR + feature is set; cannot be larger than 2 otherwise. + * - 0x1F + - u8 + - struct dx_root_info.unused_flags + - + * - 0x20 + - __le16 + - limit + - Maximum number of dx_entries that can follow this header, plus 1 for + the header itself. + * - 0x22 + - __le16 + - count + - Actual number of dx_entries that follow this header, plus 1 for the + header itself. + * - 0x24 + - __le32 + - block + - The block number (within the directory file) that goes with hash=3D= 0. + * - 0x28 + - struct dx_entry + - entries[0] + - As many 8-byte ``struct dx_entry`` as fits in the rest of the data = block. + +.. _dirhash: + +The directory hash is one of the following values: + +.. list-table:: + :widths: 16 64 + :header-rows: 1 + + * - Value + - Description + * - 0x0 + - Legacy. + * - 0x1 + - Half MD4. + * - 0x2 + - Tea. + * - 0x3 + - Legacy, unsigned. + * - 0x4 + - Half MD4, unsigned. + * - 0x5 + - Tea, unsigned. + * - 0x6 + - Siphash. + +Interior nodes of an htree are recorded as ``struct dx_node``, which is +also the full length of a data block: + +.. list-table:: + :widths: 8 8 24 40 + :header-rows: 1 + + * - Offset + - Type + - Name + - Description + * - 0x0 + - __le32 + - fake.inode + - Zero, to make it look like this entry is not in use. + * - 0x4 + - __le16 + - fake.rec_len + - The size of the block, in order to hide all of the dx_node data. + * - 0x6 + - u8 + - name_len + - Zero. There is no name for this =E2=80=9Cunused=E2=80=9D directory = entry. + * - 0x7 + - u8 + - file_type + - Zero. There is no file type for this =E2=80=9Cunused=E2=80=9D direc= tory entry. + * - 0x8 + - __le16 + - limit + - Maximum number of dx_entries that can follow this header, plus 1 for + the header itself. + * - 0xA + - __le16 + - count + - Actual number of dx_entries that follow this header, plus 1 for the + header itself. + * - 0xE + - __le32 + - block + - The block number (within the directory file) that goes with the low= est + hash value of this block. This value is stored in the parent block. + * - 0x12 + - struct dx_entry + - entries[0] + - As many 8-byte ``struct dx_entry`` as fits in the rest of the data = block. + +The hash maps that exist in both ``struct dx_root`` and +``struct dx_node`` are recorded as ``struct dx_entry``, which is 8 bytes +long: + +.. list-table:: + :widths: 8 8 24 40 + :header-rows: 1 + + * - Offset + - Type + - Name + - Description + * - 0x0 + - __le32 + - hash + - Hash code. + * - 0x4 + - __le32 + - block + - Block number (within the directory file, not filesystem blocks) of = the + next node in the htree. + +(If you think this is all quite clever and peculiar, so does the +author.) + +If metadata checksums are enabled, the last 8 bytes of the directory +block (precisely the length of one dx_entry) are used to store a +``struct dx_tail``, which contains the checksum. The ``limit`` and +``count`` entries in the dx_root/dx_node structures are adjusted as +necessary to fit the dx_tail into the block. If there is no space for +the dx_tail, the user is notified to run e2fsck -D to rebuild the +directory index (which will ensure that there's space for the checksum. +The dx_tail structure is 8 bytes long and looks like this: + +.. list-table:: + :widths: 8 8 24 40 + :header-rows: 1 + + * - Offset + - Type + - Name + - Description + * - 0x0 + - u32 + - dt_reserved + - Zero. + * - 0x4 + - __le32 + - dt_checksum + - Checksum of the htree directory block. + +The checksum is calculated against the FS UUID, the htree index header +(dx_root or dx_node), all of the htree indices (dx_entry) that are in +use, and the tail block (dx_tail). + +Extended Attributes +------------------- + +Extended attributes (xattrs) are typically stored in a separate data +block on the disk and referenced from inodes via ``inode.i_file_acl*``. +The first use of extended attributes seems to have been for storing file +ACLs and other security data (selinux). With the ``user_xattr`` mount +option it is possible for users to store extended attributes so long as +all attribute names begin with =E2=80=9Cuser=E2=80=9D; this restriction se= ems to have +disappeared as of Linux 3.0. + +There are two places where extended attributes can be found. The first +place is between the end of each inode entry and the beginning of the +next inode entry. For example, if inode.i_extra_isize =3D 28 and +sb.inode_size =3D 256, then there are 256 - (128 + 28) =3D 100 bytes +available for in-inode extended attribute storage. The second place +where extended attributes can be found is in the block pointed to by +``inode.i_file_acl``. As of Linux 3.11, it is not possible for this +block to contain a pointer to a second extended attribute block (or even +the remaining blocks of a cluster). In theory it is possible for each +attribute's value to be stored in a separate data block, though as of +Linux 3.11 the code does not permit this. + +Keys are generally assumed to be ASCIIZ strings, whereas values can be +strings or binary data. + +Extended attributes, when stored after the inode, have a header +``ext4_xattr_ibody_header`` that is 4 bytes long: + +.. list-table:: + :widths: 8 8 24 40 + :header-rows: 1 + + * - Offset + - Type + - Name + - Description + * - 0x0 + - __le32 + - h_magic + - Magic number for identification, 0xEA020000. This value is set by t= he + Linux driver, though e2fsprogs doesn't seem to check it(?) + +The beginning of an extended attribute block is in +``struct ext4_xattr_header``, which is 32 bytes long: + +.. list-table:: + :widths: 8 8 24 40 + :header-rows: 1 + + * - Offset + - Type + - Name + - Description + * - 0x0 + - __le32 + - h_magic + - Magic number for identification, 0xEA020000. + * - 0x4 + - __le32 + - h_refcount + - Reference count. + * - 0x8 + - __le32 + - h_blocks + - Number of disk blocks used. + * - 0xC + - __le32 + - h_hash + - Hash value of all attributes. + * - 0x10 + - __le32 + - h_checksum + - Checksum of the extended attribute block. + * - 0x14 + - __u32 + - h_reserved[3] + - Zero. + +The checksum is calculated against the FS UUID, the 64-bit block number +of the extended attribute block, and the entire block (header + +entries). + +Following the ``struct ext4_xattr_header`` or +``struct ext4_xattr_ibody_header`` is an array of +``struct ext4_xattr_entry``; each of these entries is at least 16 bytes +long. When stored in an external block, the ``struct ext4_xattr_entry`` +entries must be stored in sorted order. The sort order is +``e_name_index``, then ``e_name_len``, and finally ``e_name``. +Attributes stored inside an inode do not need be stored in sorted order. + +.. list-table:: + :widths: 8 8 24 40 + :header-rows: 1 + + * - Offset + - Type + - Name + - Description + * - 0x0 + - __u8 + - e_name_len + - Length of name. + * - 0x1 + - __u8 + - e_name_index + - Attribute name index. There is a discussion of this below. + * - 0x2 + - __le16 + - e_value_offs + - Location of this attribute's value on the disk block where it is st= ored. + Multiple attributes can share the same value. For an inode attribute + this value is relative to the start of the first entry; for a block= this + value is relative to the start of the block (i.e. the header). + * - 0x4 + - __le32 + - e_value_inum + - The inode where the value is stored. Zero indicates the value is in= the + same block as this entry. This field is only used if the + INCOMPAT_EA_INODE feature is enabled. + * - 0x8 + - __le32 + - e_value_size + - Length of attribute value. + * - 0xC + - __le32 + - e_hash + - Hash value of attribute name and attribute value. The kernel doesn't + update the hash for in-inode attributes, so for that case this value + must be zero, because e2fsck validates any non-zero hash regardless= of + where the xattr lives. + * - 0x10 + - char + - e_name[e_name_len] + - Attribute name. Does not include trailing NULL. + +Attribute values can follow the end of the entry table. There appears to +be a requirement that they be aligned to 4-byte boundaries. The values +are stored starting at the end of the block and grow towards the +xattr_header/xattr_entry table. When the two collide, the overflow is +put into a separate disk block. If the disk block fills up, the +filesystem returns -ENOSPC. + +The first four fields of the ``ext4_xattr_entry`` are set to zero to +mark the end of the key list. + +Attribute Name Indices +~~~~~~~~~~~~~~~~~~~~~~ + +Logically speaking, extended attributes are a series of key=3Dvalue pairs. +The keys are assumed to be NULL-terminated strings. To reduce the amount +of on-disk space that the keys consume, the beginning of the key string +is matched against the attribute name index. If a match is found, the +attribute name index field is set, and matching string is removed from +the key name. Here is a map of name index values to key prefixes: + +.. list-table:: + :widths: 16 64 + :header-rows: 1 + + * - Name Index + - Key Prefix + * - 0 + - (no prefix) + * - 1 + - =E2=80=9Cuser.=E2=80=9D + * - 2 + - =E2=80=9Csystem.posix_acl_access=E2=80=9D + * - 3 + - =E2=80=9Csystem.posix_acl_default=E2=80=9D + * - 4 + - =E2=80=9Ctrusted.=E2=80=9D + * - 6 + - =E2=80=9Csecurity.=E2=80=9D + * - 7 + - =E2=80=9Csystem.=E2=80=9D (inline_data only?) + * - 8 + - =E2=80=9Csystem.richacl=E2=80=9D (SuSE kernels only?) + +For example, if the attribute key is =E2=80=9Cuser.fubar=E2=80=9D, the att= ribute name +index is set to 1 and the =E2=80=9Cfubar=E2=80=9D name is recorded on disk. + +POSIX ACLs +~~~~~~~~~~ + +POSIX ACLs are stored in a reduced version of the Linux kernel (and +libacl's) internal ACL format. The key difference is that the version +number is different (1) and the ``e_id`` field is only stored for named +user and group ACLs. diff --git a/Documentation/filesystems/ext4/ifork.rst b/Documentation/files= ystems/ext4/ifork.rst deleted file mode 100644 index dc31f505e6c835..00000000000000 --- a/Documentation/filesystems/ext4/ifork.rst +++ /dev/null @@ -1,194 +0,0 @@ -.. SPDX-License-Identifier: GPL-2.0 - -The Contents of inode.i_block ------------------------------- - -Depending on the type of file an inode describes, the 60 bytes of -storage in ``inode.i_block`` can be used in different ways. In general, -regular files and directories will use it for file block indexing -information, and special files will use it for special purposes. - -Symbolic Links -~~~~~~~~~~~~~~ - -The target of a symbolic link will be stored in this field if the target -string is less than 60 bytes long. Otherwise, either extents or block -maps will be used to allocate data blocks to store the link target. - -Direct/Indirect Block Addressing -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -In ext2/3, file block numbers were mapped to logical block numbers by -means of an (up to) three level 1-1 block map. To find the logical block -that stores a particular file block, the code would navigate through -this increasingly complicated structure. Notice that there is neither a -magic number nor a checksum to provide any level of confidence that the -block isn't full of garbage. - -.. ifconfig:: builder !=3D 'latex' - - .. include:: blockmap.rst - -.. ifconfig:: builder =3D=3D 'latex' - - [Table omitted because LaTeX doesn't support nested tables.] - -Note that with this block mapping scheme, it is necessary to fill out a -lot of mapping data even for a large contiguous file! This inefficiency -led to the creation of the extent mapping scheme, discussed below. - -Notice also that a file using this mapping scheme cannot be placed -higher than 2^32 blocks. - -Extent Tree -~~~~~~~~~~~ - -In ext4, the file to logical block map has been replaced with an extent -tree. Under the old scheme, allocating a contiguous run of 1,000 blocks -requires an indirect block to map all 1,000 entries; with extents, the -mapping is reduced to a single ``struct ext4_extent`` with -``ee_len =3D 1000``. If flex_bg is enabled, it is possible to allocate -very large files with a single extent, at a considerable reduction in -metadata block use, and some improvement in disk efficiency. The inode -must have the extents flag (0x80000) flag set for this feature to be in -use. - -Extents are arranged as a tree. Each node of the tree begins with a -``struct ext4_extent_header``. If the node is an interior node -(``eh.eh_depth`` > 0), the header is followed by ``eh.eh_entries`` -instances of ``struct ext4_extent_idx``; each of these index entries -points to a block containing more nodes in the extent tree. If the node -is a leaf node (``eh.eh_depth =3D=3D 0``), then the header is followed by -``eh.eh_entries`` instances of ``struct ext4_extent``; these instances -point to the file's data blocks. The root node of the extent tree is -stored in ``inode.i_block``, which allows for the first four extents to -be recorded without the use of extra metadata blocks. - -The extent tree header is recorded in ``struct ext4_extent_header``, -which is 12 bytes long: - -.. list-table:: - :widths: 8 8 24 40 - :header-rows: 1 - - * - Offset - - Size - - Name - - Description - * - 0x0 - - __le16 - - eh_magic - - Magic number, 0xF30A. - * - 0x2 - - __le16 - - eh_entries - - Number of valid entries following the header. - * - 0x4 - - __le16 - - eh_max - - Maximum number of entries that could follow the header. - * - 0x6 - - __le16 - - eh_depth - - Depth of this extent node in the extent tree. 0 =3D this extent node - points to data blocks; otherwise, this extent node points to other - extent nodes. The extent tree can be at most 5 levels deep: a logic= al - block number can be at most ``2^32``, and the smallest ``n`` that - satisfies ``4*(((blocksize - 12)/12)^n) >=3D 2^32`` is 5. - * - 0x8 - - __le32 - - eh_generation - - Generation of the tree. (Used by Lustre, but not standard ext4). - -Internal nodes of the extent tree, also known as index nodes, are -recorded as ``struct ext4_extent_idx``, and are 12 bytes long: - -.. list-table:: - :widths: 8 8 24 40 - :header-rows: 1 - - * - Offset - - Size - - Name - - Description - * - 0x0 - - __le32 - - ei_block - - This index node covers file blocks from 'block' onward. - * - 0x4 - - __le32 - - ei_leaf_lo - - Lower 32-bits of the block number of the extent node that is the ne= xt - level lower in the tree. The tree node pointed to can be either ano= ther - internal node or a leaf node, described below. - * - 0x8 - - __le16 - - ei_leaf_hi - - Upper 16-bits of the previous field. - * - 0xA - - __u16 - - ei_unused - - - -Leaf nodes of the extent tree are recorded as ``struct ext4_extent``, -and are also 12 bytes long: - -.. list-table:: - :widths: 8 8 24 40 - :header-rows: 1 - - * - Offset - - Size - - Name - - Description - * - 0x0 - - __le32 - - ee_block - - First file block number that this extent covers. - * - 0x4 - - __le16 - - ee_len - - Number of blocks covered by extent. If the value of this field is <= =3D - 32768, the extent is initialized. If the value of the field is > 32= 768, - the extent is uninitialized and the actual extent length is ``ee_le= n`` - - 32768. Therefore, the maximum length of a initialized extent is 327= 68 - blocks, and the maximum length of an uninitialized extent is 32767. - * - 0x6 - - __le16 - - ee_start_hi - - Upper 16-bits of the block number to which this extent points. - * - 0x8 - - __le32 - - ee_start_lo - - Lower 32-bits of the block number to which this extent points. - -Prior to the introduction of metadata checksums, the extent header + -extent entries always left at least 4 bytes of unallocated space at the -end of each extent tree data block (because (2^x % 12) >=3D 4). Therefore, -the 32-bit checksum is inserted into this space. The 4 extents in the -inode do not need checksumming, since the inode is already checksummed. -The checksum is calculated against the FS UUID, the inode number, the -inode generation, and the entire extent block leading up to (but not -including) the checksum itself. - -``struct ext4_extent_tail`` is 4 bytes long: - -.. list-table:: - :widths: 8 8 24 40 - :header-rows: 1 - - * - Offset - - Size - - Name - - Description - * - 0x0 - - __le32 - - eb_checksum - - Checksum of the extent block, crc32c(uuid+inum+igeneration+extentbl= ock) - -Inline Data -~~~~~~~~~~~ - -If the inline data feature is enabled for the filesystem and the flag is -set for the inode, it is possible that the first 60 bytes of the file -data are stored here. diff --git a/Documentation/filesystems/ext4/inodes.rst b/Documentation/file= systems/ext4/inodes.rst deleted file mode 100644 index cfc6c16599312a..00000000000000 --- a/Documentation/filesystems/ext4/inodes.rst +++ /dev/null @@ -1,578 +0,0 @@ -.. SPDX-License-Identifier: GPL-2.0 - -Index Nodes ------------ - -In a regular UNIX filesystem, the inode stores all the metadata -pertaining to the file (time stamps, block maps, extended attributes, -etc), not the directory entry. To find the information associated with a -file, one must traverse the directory files to find the directory entry -associated with a file, then load the inode to find the metadata for -that file. ext4 appears to cheat (for performance reasons) a little bit -by storing a copy of the file type (normally stored in the inode) in the -directory entry. (Compare all this to FAT, which stores all the file -information directly in the directory entry, but does not support hard -links and is in general more seek-happy than ext4 due to its simpler -block allocator and extensive use of linked lists.) - -The inode table is a linear array of ``struct ext4_inode``. The table is -sized to have enough blocks to store at least -``sb.s_inode_size * sb.s_inodes_per_group`` bytes. The number of the -block group containing an inode can be calculated as -``(inode_number - 1) / sb.s_inodes_per_group``, and the offset into the -group's table is ``(inode_number - 1) % sb.s_inodes_per_group``. There -is no inode 0. - -The inode checksum is calculated against the FS UUID, the inode number, -and the inode structure itself. - -The inode table entry is laid out in ``struct ext4_inode``. - -.. list-table:: - :widths: 8 8 24 40 - :header-rows: 1 - :class: longtable - - * - Offset - - Size - - Name - - Description - * - 0x0 - - __le16 - - i_mode - - File mode. See the table i_mode_ below. - * - 0x2 - - __le16 - - i_uid - - Lower 16-bits of Owner UID. - * - 0x4 - - __le32 - - i_size_lo - - Lower 32-bits of size in bytes. - * - 0x8 - - __le32 - - i_atime - - Last access time, in seconds since the epoch. However, if the EA_IN= ODE - inode flag is set, this inode stores an extended attribute value and - this field contains the checksum of the value. - * - 0xC - - __le32 - - i_ctime - - Last inode change time, in seconds since the epoch. However, if the - EA_INODE inode flag is set, this inode stores an extended attribute - value and this field contains the lower 32 bits of the attribute va= lue's - reference count. - * - 0x10 - - __le32 - - i_mtime - - Last data modification time, in seconds since the epoch. However, i= f the - EA_INODE inode flag is set, this inode stores an extended attribute - value and this field contains the number of the inode that owns the - extended attribute. - * - 0x14 - - __le32 - - i_dtime - - Deletion Time, in seconds since the epoch. - * - 0x18 - - __le16 - - i_gid - - Lower 16-bits of GID. - * - 0x1A - - __le16 - - i_links_count - - Hard link count. Normally, ext4 does not permit an inode to have mo= re - than 65,000 hard links. This applies to files as well as directorie= s, - which means that there cannot be more than 64,998 subdirectories in= a - directory (each subdirectory's '..' entry counts as a hard link, as= does - the '.' entry in the directory itself). With the DIR_NLINK feature - enabled, ext4 supports more than 64,998 subdirectories by setting t= his - field to 1 to indicate that the number of hard links is not known. - * - 0x1C - - __le32 - - i_blocks_lo - - Lower 32-bits of =E2=80=9Cblock=E2=80=9D count. If the huge_file fe= ature flag is not - set on the filesystem, the file consumes ``i_blocks_lo`` 512-byte b= locks - on disk. If huge_file is set and EXT4_HUGE_FILE_FL is NOT set in - ``inode.i_flags``, then the file consumes ``i_blocks_lo + (i_blocks= _hi - << 32)`` 512-byte blocks on disk. If huge_file is set and - EXT4_HUGE_FILE_FL IS set in ``inode.i_flags``, then this file - consumes (``i_blocks_lo + i_blocks_hi`` << 32) filesystem blocks on - disk. - * - 0x20 - - __le32 - - i_flags - - Inode flags. See the table i_flags_ below. - * - 0x24 - - 4 bytes - - i_osd1 - - See the table i_osd1_ for more details. - * - 0x28 - - 60 bytes - - i_block[EXT4_N_BLOCKS=3D15] - - Block map or extent tree. See the section =E2=80=9CThe Contents of = inode.i_block=E2=80=9D. - * - 0x64 - - __le32 - - i_generation - - File version (for NFS). - * - 0x68 - - __le32 - - i_file_acl_lo - - Lower 32-bits of extended attribute block. ACLs are of course one of - many possible extended attributes; I think the name of this field i= s a - result of the first use of extended attributes being for ACLs. - * - 0x6C - - __le32 - - i_size_high / i_dir_acl - - Upper 32-bits of file/directory size. In ext2/3 this field was named - i_dir_acl, though it was usually set to zero and never used. - * - 0x70 - - __le32 - - i_obso_faddr - - (Obsolete) fragment address. - * - 0x74 - - 12 bytes - - i_osd2 - - See the table i_osd2_ for more details. - * - 0x80 - - __le16 - - i_extra_isize - - Size of this inode - 128. Alternately, the size of the extended ino= de - fields beyond the original ext2 inode, including this field. - * - 0x82 - - __le16 - - i_checksum_hi - - Upper 16-bits of the inode checksum. - * - 0x84 - - __le32 - - i_ctime_extra - - Extra change time bits. This provides sub-second precision. See Ino= de - Timestamps section. - * - 0x88 - - __le32 - - i_mtime_extra - - Extra modification time bits. This provides sub-second precision. - * - 0x8C - - __le32 - - i_atime_extra - - Extra access time bits. This provides sub-second precision. - * - 0x90 - - __le32 - - i_crtime - - File creation time, in seconds since the epoch. - * - 0x94 - - __le32 - - i_crtime_extra - - Extra file creation time bits. This provides sub-second precision. - * - 0x98 - - __le32 - - i_version_hi - - Upper 32-bits for version number. - * - 0x9C - - __le32 - - i_projid - - Project ID. - -.. _i_mode: - -The ``i_mode`` value is a combination of the following flags: - -.. list-table:: - :widths: 16 64 - :header-rows: 1 - - * - Value - - Description - * - 0x1 - - S_IXOTH (Others may execute) - * - 0x2 - - S_IWOTH (Others may write) - * - 0x4 - - S_IROTH (Others may read) - * - 0x8 - - S_IXGRP (Group members may execute) - * - 0x10 - - S_IWGRP (Group members may write) - * - 0x20 - - S_IRGRP (Group members may read) - * - 0x40 - - S_IXUSR (Owner may execute) - * - 0x80 - - S_IWUSR (Owner may write) - * - 0x100 - - S_IRUSR (Owner may read) - * - 0x200 - - S_ISVTX (Sticky bit) - * - 0x400 - - S_ISGID (Set GID) - * - 0x800 - - S_ISUID (Set UID) - * - - - These are mutually-exclusive file types: - * - 0x1000 - - S_IFIFO (FIFO) - * - 0x2000 - - S_IFCHR (Character device) - * - 0x4000 - - S_IFDIR (Directory) - * - 0x6000 - - S_IFBLK (Block device) - * - 0x8000 - - S_IFREG (Regular file) - * - 0xA000 - - S_IFLNK (Symbolic link) - * - 0xC000 - - S_IFSOCK (Socket) - -.. _i_flags: - -The ``i_flags`` field is a combination of these values: - -.. list-table:: - :widths: 16 64 - :header-rows: 1 - - * - Value - - Description - * - 0x1 - - This file requires secure deletion (EXT4_SECRM_FL). (not implemente= d) - * - 0x2 - - This file should be preserved, should undeletion be desired - (EXT4_UNRM_FL). (not implemented) - * - 0x4 - - File is compressed (EXT4_COMPR_FL). (not really implemented) - * - 0x8 - - All writes to the file must be synchronous (EXT4_SYNC_FL). - * - 0x10 - - File is immutable (EXT4_IMMUTABLE_FL). - * - 0x20 - - File can only be appended (EXT4_APPEND_FL). - * - 0x40 - - The dump(1) utility should not dump this file (EXT4_NODUMP_FL). - * - 0x80 - - Do not update access time (EXT4_NOATIME_FL). - * - 0x100 - - Dirty compressed file (EXT4_DIRTY_FL). (not used) - * - 0x200 - - File has one or more compressed clusters (EXT4_COMPRBLK_FL). (not u= sed) - * - 0x400 - - Do not compress file (EXT4_NOCOMPR_FL). (not used) - * - 0x800 - - Encrypted inode (EXT4_ENCRYPT_FL). This bit value previously was - EXT4_ECOMPR_FL (compression error), which was never used. - * - 0x1000 - - Directory has hashed indexes (EXT4_INDEX_FL). - * - 0x2000 - - AFS magic directory (EXT4_IMAGIC_FL). - * - 0x4000 - - File data must always be written through the journal - (EXT4_JOURNAL_DATA_FL). - * - 0x8000 - - File tail should not be merged (EXT4_NOTAIL_FL). (not used by ext4) - * - 0x10000 - - All directory entry data should be written synchronously (see - ``dirsync``) (EXT4_DIRSYNC_FL). - * - 0x20000 - - Top of directory hierarchy (EXT4_TOPDIR_FL). - * - 0x40000 - - This is a huge file (EXT4_HUGE_FILE_FL). - * - 0x80000 - - Inode uses extents (EXT4_EXTENTS_FL). - * - 0x100000 - - Verity protected file (EXT4_VERITY_FL). - * - 0x200000 - - Inode stores a large extended attribute value in its data blocks - (EXT4_EA_INODE_FL). - * - 0x400000 - - This file has blocks allocated past EOF (EXT4_EOFBLOCKS_FL). - (deprecated) - * - 0x01000000 - - Inode is a snapshot (``EXT4_SNAPFILE_FL``). (not in mainline) - * - 0x04000000 - - Snapshot is being deleted (``EXT4_SNAPFILE_DELETED_FL``). (not in - mainline) - * - 0x08000000 - - Snapshot shrink has completed (``EXT4_SNAPFILE_SHRUNK_FL``). (not in - mainline) - * - 0x10000000 - - Inode has inline data (EXT4_INLINE_DATA_FL). - * - 0x20000000 - - Create children with the same project ID (EXT4_PROJINHERIT_FL). - * - 0x80000000 - - Reserved for ext4 library (EXT4_RESERVED_FL). - * - - - Aggregate flags: - * - 0x705BDFFF - - User-visible flags. - * - 0x604BC0FF - - User-modifiable flags. Note that while EXT4_JOURNAL_DATA_FL and - EXT4_EXTENTS_FL can be set with setattr, they are not in the kernel= 's - EXT4_FL_USER_MODIFIABLE mask, since it needs to handle the setting = of - these flags in a special manner and they are masked out of the set = of - flags that are saved directly to i_flags. - -.. _i_osd1: - -The ``osd1`` field has multiple meanings depending on the creator: - -Linux: - -.. list-table:: - :widths: 8 8 24 40 - :header-rows: 1 - - * - Offset - - Size - - Name - - Description - * - 0x0 - - __le32 - - l_i_version - - Inode version. However, if the EA_INODE inode flag is set, this ino= de - stores an extended attribute value and this field contains the uppe= r 32 - bits of the attribute value's reference count. - -Hurd: - -.. list-table:: - :widths: 8 8 24 40 - :header-rows: 1 - - * - Offset - - Size - - Name - - Description - * - 0x0 - - __le32 - - h_i_translator - - ?? - -Masix: - -.. list-table:: - :widths: 8 8 24 40 - :header-rows: 1 - - * - Offset - - Size - - Name - - Description - * - 0x0 - - __le32 - - m_i_reserved - - ?? - -.. _i_osd2: - -The ``osd2`` field has multiple meanings depending on the filesystem creat= or: - -Linux: - -.. list-table:: - :widths: 8 8 24 40 - :header-rows: 1 - - * - Offset - - Size - - Name - - Description - * - 0x0 - - __le16 - - l_i_blocks_high - - Upper 16-bits of the block count. Please see the note attached to - i_blocks_lo. - * - 0x2 - - __le16 - - l_i_file_acl_high - - Upper 16-bits of the extended attribute block (historically, the fi= le - ACL location). See the Extended Attributes section below. - * - 0x4 - - __le16 - - l_i_uid_high - - Upper 16-bits of the Owner UID. - * - 0x6 - - __le16 - - l_i_gid_high - - Upper 16-bits of the GID. - * - 0x8 - - __le16 - - l_i_checksum_lo - - Lower 16-bits of the inode checksum. - * - 0xA - - __le16 - - l_i_reserved - - Unused. - -Hurd: - -.. list-table:: - :widths: 8 8 24 40 - :header-rows: 1 - - * - Offset - - Size - - Name - - Description - * - 0x0 - - __le16 - - h_i_reserved1 - - ?? - * - 0x2 - - __u16 - - h_i_mode_high - - Upper 16-bits of the file mode. - * - 0x4 - - __le16 - - h_i_uid_high - - Upper 16-bits of the Owner UID. - * - 0x6 - - __le16 - - h_i_gid_high - - Upper 16-bits of the GID. - * - 0x8 - - __u32 - - h_i_author - - Author code? - -Masix: - -.. list-table:: - :widths: 8 8 24 40 - :header-rows: 1 - - * - Offset - - Size - - Name - - Description - * - 0x0 - - __le16 - - h_i_reserved1 - - ?? - * - 0x2 - - __u16 - - m_i_file_acl_high - - Upper 16-bits of the extended attribute block (historically, the fi= le - ACL location). - * - 0x4 - - __u32 - - m_i_reserved2[2] - - ?? - -Inode Size -~~~~~~~~~~ - -In ext2 and ext3, the inode structure size was fixed at 128 bytes -(``EXT2_GOOD_OLD_INODE_SIZE``) and each inode had a disk record size of -128 bytes. Starting with ext4, it is possible to allocate a larger -on-disk inode at format time for all inodes in the filesystem to provide -space beyond the end of the original ext2 inode. The on-disk inode -record size is recorded in the superblock as ``s_inode_size``. The -number of bytes actually used by struct ext4_inode beyond the original -128-byte ext2 inode is recorded in the ``i_extra_isize`` field for each -inode, which allows struct ext4_inode to grow for a new kernel without -having to upgrade all of the on-disk inodes. Access to fields beyond -EXT2_GOOD_OLD_INODE_SIZE should be verified to be within -``i_extra_isize``. By default, ext4 inode records are 256 bytes, and (as -of August 2019) the inode structure is 160 bytes -(``i_extra_isize =3D 32``). The extra space between the end of the inode -structure and the end of the inode record can be used to store extended -attributes. Each inode record can be as large as the filesystem block -size, though this is not terribly efficient. - -Finding an Inode -~~~~~~~~~~~~~~~~ - -Each block group contains ``sb->s_inodes_per_group`` inodes. Because -inode 0 is defined not to exist, this formula can be used to find the -block group that an inode lives in: -``bg =3D (inode_num - 1) / sb->s_inodes_per_group``. The particular inode -can be found within the block group's inode table at -``index =3D (inode_num - 1) % sb->s_inodes_per_group``. To get the byte -address within the inode table, use -``offset =3D index * sb->s_inode_size``. - -Inode Timestamps -~~~~~~~~~~~~~~~~ - -Four timestamps are recorded in the lower 128 bytes of the inode -structure -- inode change time (ctime), access time (atime), data -modification time (mtime), and deletion time (dtime). The four fields -are 32-bit signed integers that represent seconds since the Unix epoch -(1970-01-01 00:00:00 GMT), which means that the fields will overflow in -January 2038. If the filesystem does not have orphan_file feature, inodes -that are not linked from any directory but are still open (orphan inodes) = have -the dtime field overloaded for use with the orphan list. The superblock fi= eld -``s_last_orphan`` points to the first inode in the orphan list; dtime is t= hen -the number of the next orphaned inode, or zero if there are no more orphan= s. - -If the inode structure size ``sb->s_inode_size`` is larger than 128 -bytes and the ``i_inode_extra`` field is large enough to encompass the -respective ``i_[cma]time_extra`` field, the ctime, atime, and mtime -inode fields are widened to 64 bits. Within this =E2=80=9Cextra=E2=80=9D 3= 2-bit field, -the lower two bits are used to extend the 32-bit seconds field to be 34 -bit wide; the upper 30 bits are used to provide nanosecond timestamp -accuracy. Therefore, timestamps should not overflow until May 2446. -dtime was not widened. There is also a fifth timestamp to record inode -creation time (crtime); this field is 64-bits wide and decoded in the -same manner as 64-bit [cma]time. Neither crtime nor dtime are accessible -through the regular stat() interface, though debugfs will report them. - -We use the 32-bit signed time value plus (2^32 * (extra epoch bits)). -In other words: - -.. list-table:: - :widths: 20 20 20 20 20 - :header-rows: 1 - - * - Extra epoch bits - - MSB of 32-bit time - - Adjustment for signed 32-bit to 64-bit tv_sec - - Decoded 64-bit tv_sec - - valid time range - * - 0 0 - - 1 - - 0 - - ``-0x80000000 - -0x00000001`` - - 1901-12-13 to 1969-12-31 - * - 0 0 - - 0 - - 0 - - ``0x000000000 - 0x07fffffff`` - - 1970-01-01 to 2038-01-19 - * - 0 1 - - 1 - - 0x100000000 - - ``0x080000000 - 0x0ffffffff`` - - 2038-01-19 to 2106-02-07 - * - 0 1 - - 0 - - 0x100000000 - - ``0x100000000 - 0x17fffffff`` - - 2106-02-07 to 2174-02-25 - * - 1 0 - - 1 - - 0x200000000 - - ``0x180000000 - 0x1ffffffff`` - - 2174-02-25 to 2242-03-16 - * - 1 0 - - 0 - - 0x200000000 - - ``0x200000000 - 0x27fffffff`` - - 2242-03-16 to 2310-04-04 - * - 1 1 - - 1 - - 0x300000000 - - ``0x280000000 - 0x2ffffffff`` - - 2310-04-04 to 2378-04-22 - * - 1 1 - - 0 - - 0x300000000 - - ``0x300000000 - 0x37fffffff`` - - 2378-04-22 to 2446-05-10 - -This is a somewhat odd encoding since there are effectively seven times -as many positive values as negative values. There have also been -long-standing bugs decoding and encoding dates beyond 2038, which don't -seem to be fixed as of kernel 3.12 and e2fsprogs 1.42.8. 64-bit kernels -incorrectly use the extra epoch bits 1,1 for dates between 1901 and -1970. At some point the kernel will be fixed and e2fsck will fix this -situation, assuming that it is run before 2310. --=20 An old man doll... just what I always wanted! - Clara From nobody Thu Oct 9 09:03:17 2025 Received: from mail-pl1-f178.google.com (mail-pl1-f178.google.com [209.85.214.178]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 46A9E295DA5; Wed, 18 Jun 2025 11:16:04 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.178 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1750245365; cv=none; b=rvSoEQWJyupaPn7l5RWvA9eeqdmzl+9YehuU5e3xk+dRDN59LKP2h18JZVkKZEueYVSHkUUQ77DnjwOeE7OIyh3WusAKT8sdAEjJz5BkyBFtV0A1qbufZebe3RM2jo4HLzS4V/zv2w15nPZ57duOm3k+GX/EJxCZ2DisITQiQH0= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1750245365; c=relaxed/simple; bh=WM+fr1mKglNeWX++E+JHAdUnDqdYDjuycRYTRJ7Ja+0=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=CzDpRFOyPTN/jBro70WsK+8j5d39QO9JFLI2adoAU0cDl9sdsQJvl5EhiQQj1Ushxm24eVQ+qr9JQxkJBOeRX3W2RVn42hSnO+nfinhlko91yT0dPvOn3uUJDs6YLIyzJETwIpNS4H4/4T0hOEmpjlKBdA7AboKSvpkPr4E72Xs= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=HoYViUjU; arc=none smtp.client-ip=209.85.214.178 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="HoYViUjU" Received: by mail-pl1-f178.google.com with SMTP id d9443c01a7336-235f9e87f78so71196775ad.2; Wed, 18 Jun 2025 04:16:04 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1750245363; x=1750850163; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=vMcoFd2PaoUji5rvbYhQVmlHYP9geUwUg20LItQ+Jiw=; b=HoYViUjUD5POiHqvin/mh0/qP2uJlZxDWtUXGfZ5nuaUVXG+pIir9sAdCP+XImpROE pFMRuZeKZH7gJUOGo0HT310TOZlm1wJ/1owwI6yP77JbfZW35wCepDWzbXQvyrOXk5+T pr8NGnHmvdZ50t4QweoZRRoauDxCoH1DLWr7XhZh31guvrMfM/skGPnUgHynK8Acautt bQ9R4ZWfjbeYzfTYrR8hkMEqa1ARSn03KsysqBeqcX3oNQWSBCVYnjzJdP+jyxe7mjFs cu3oORlDJMC7tGx9QoUakKJv1e6xnBEk0woChUmgRkNZ79LY2ARtxYkBzadVgm6RoL3o NLNA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1750245363; x=1750850163; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=vMcoFd2PaoUji5rvbYhQVmlHYP9geUwUg20LItQ+Jiw=; b=WYPzt8cZZZvmziKHZxxRhkxzl3NLUKE/CCOKQ+5sLbp3KuTuGA+7NqiNbI1Ltt4k9d 3EKTy9DDwTivvemuRziGkmLWgH9G56obxlTW309q7f/0rG0lb4WMPSiUNE524S9J0ref DB2varAmWCorNjhoXC5TuOsYf+fScvAnPcw/dZ0JRBrH0WRclrRkBYN/MtD7Hx1W/GoQ 0FDn9SEY4Sg8Qb10XWNk6eL4rpkVo51Lfk4X2Ili7Rq7gA0TAQxzEuW1iHUOyH7Hotsr YIMUC96wyXLjj8yqSee615qKqZw/C/3hrCHXPmr5vpzlmMIQU8lqjQ8RRxj9qXamRkJB 521Q== X-Forwarded-Encrypted: i=1; AJvYcCVlE3eJPNFsGaRG+JghCqSFELoso2g1PW2xAbRbr5zXk6p50+kbL+oGpKZaplwqmXBc+Nc8Id6IJWPc+g==@vger.kernel.org, AJvYcCXrAownO61IFhSrOgWOjjr4GDCgO1o5ocunlmeDiySAvQCUA6euw982CF+Lwoxr2CfPrGFf6T5lWaQ=@vger.kernel.org X-Gm-Message-State: AOJu0YxyjspvANQX4vsr+P/BSAQJchnlJt1gM1kxu9iir3907jF8lYEI Z1SDU3Z6tCKXtrb9Q545m+dyLJYhJeKffBX1zoSYj+bZLOHxAT/JUGTb X-Gm-Gg: ASbGnctJBoX84yjUZlHuL2hqCTOsjs6m+NHzD5mHMkM7384f5bYO8cM+BgS8niU11Om B6364ug6eLnbeX9BIlcNwyCYCOiA8w5qUjBaD+s9ty4OcXEsWUZ6Bk70cQZb6dzgeyBdeJLGip/ t9uWXFH073lFaKBWbuiBwzLJipYhO5cXcBgOmNo7z6vMYG8P8Hm8BUbSZGMDhdLB8ISnpl75dEU kVZhvaVQUn3ncE7nn/VQTk/PsyuClmwAcBksB/afXV4Bhft7kwZMTRAnIO/xy2x70d0G08L3q0F aywaomFkDSrSugVzf7SoB+32e3TpfNguTABjPWUEmNd3Kg6FddWciZp+BSZVoA== X-Google-Smtp-Source: AGHT+IEcbWHZGa9kWaS0mqzRZhwsAbPkTcAOZCgJa8ddYgoukpVgeiwalevyyYvBr7XLWYgdCQEuRQ== X-Received: by 2002:a17:902:ccc2:b0:235:f70:fd39 with SMTP id d9443c01a7336-2366afd37d4mr308513475ad.10.1750245363391; Wed, 18 Jun 2025 04:16:03 -0700 (PDT) Received: from archie.me ([103.124.138.155]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-2365d88c077sm97551075ad.17.2025.06.18.04.16.02 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 18 Jun 2025 04:16:02 -0700 (PDT) Received: by archie.me (Postfix, from userid 1000) id F07384629FB0; Wed, 18 Jun 2025 18:15:59 +0700 (WIB) From: Bagas Sanjaya To: Linux Kernel Mailing List , Linux Documentation , Linux ext4 Cc: "Theodore Ts'o" , Andreas Dilger , Jonathan Corbet , "Darrick J. Wong" , "Ritesh Harjani (IBM)" , Bagas Sanjaya Subject: [PATCH 4/4] Documentation: ext4: Reduce toctree depth Date: Wed, 18 Jun 2025 18:15:37 +0700 Message-ID: <20250618111544.22602-5-bagasdotme@gmail.com> X-Mailer: git-send-email 2.49.0 In-Reply-To: <20250618111544.22602-1-bagasdotme@gmail.com> References: <20250618111544.22602-1-bagasdotme@gmail.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Developer-Signature: v=1; a=openpgp-sha256; l=888; i=bagasdotme@gmail.com; h=from:subject; bh=WM+fr1mKglNeWX++E+JHAdUnDqdYDjuycRYTRJ7Ja+0=; b=owGbwMvMwCX2bWenZ2ig32LG02pJDBlB89W3TVX2ceLmfNWwxjmj9/E3LuE1sxOOiolx1zg3P /oaKjilo5SFQYyLQVZMkWVSIl/T6V1GIhfa1zrCzGFlAhnCwMUpABPpjGJkmM7E9q7RUfCWoJdM zVR9zwv/3v49EPwk0crlQ2v9Pet59xn+O7de7QryuHCe/+6VFKH/a5a+c5aXmHtq5epK8yv+bzb uYwQA X-Developer-Key: i=bagasdotme@gmail.com; a=openpgp; fpr=701B806FDCA5D3A58FFB8F7D7C276C64A5E44A1D Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" ext4 docs toctree has an arbitrary :maxdepth: of 6 (which is presumably intended to cover all possible heading levels), whereas the docs has at most 4-level section heading depth. Reduce the option instead to 2 (only showing the title and sections). Signed-off-by: Bagas Sanjaya --- Documentation/filesystems/ext4/index.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Documentation/filesystems/ext4/index.rst b/Documentation/files= ystems/ext4/index.rst index 705d813d558f0e..1ff8150c50e927 100644 --- a/Documentation/filesystems/ext4/index.rst +++ b/Documentation/filesystems/ext4/index.rst @@ -5,7 +5,7 @@ ext4 Data Structures and Algorithms =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D =20 .. toctree:: - :maxdepth: 6 + :maxdepth: 2 :numbered: =20 about --=20 An old man doll... just what I always wanted! - Clara