From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-pl1-f179.google.com (mail-pl1-f179.google.com [209.85.214.179]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 27520295DBA; Wed, 18 Jun 2025 11:16:08 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.179 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1750245374; cv=none; b=djcAjk1BqVt4MNqwquIZKvpb7vFN3HHwGa4EoLMwRZ4znteikDoaOOpwdcej+wZCvPGRb5LuoB49pB+zVhvoUPZhwIbGEpMt8av0oRwtm2QeFUVHhxOITuhf3fPXgT8wCAYodDyYoa131KRgH93V99p9VU1dvdILl6RByGqe7z4= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1750245374; c=relaxed/simple; bh=BUu9HUdtaYWu1FAd7SnaGwZ6GdU8g8tTTCu8MsamgeM=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=rhZYkctmrtEvWBkIlAQgB5mdnlE++in1piRfsyTQB9X7Wui9qJcP0ZHF+8Lmup7UGzeGv/Gx02e+y3V8ZRd6PzCLe1n72MpHtcLePqWSGagCejuT3nZ+N8t0dAKrA974Gt1rrgsLDzuUfkPT1gofZgQ8VbJoWfub5bpDe1EKLhY= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=GxIFb23u; arc=none smtp.client-ip=209.85.214.179 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="GxIFb23u" Received: by mail-pl1-f179.google.com with SMTP id d9443c01a7336-234d3261631so45337375ad.1; Wed, 18 Jun 2025 04:16:08 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1750245368; x=1750850168; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=b6KDUdKeadskP6aqdnMUYKVbmLKz5AEWQ17jezj/TW4=; b=GxIFb23uIQxsD4EaM3v/WicLTcfgzJQI+rPEY6X829lBdUOPVZ4ubRJJldfGY8P4HZ GcJZOb+azqwKd8clrodcy/FR/FvMIt6/v1WMDWCDzRjW7F3ATI8Sj8rrCj4D6vCXdL9p rDikdBZ8pHtAiUqYvy5v0PSTD90jubK+sS4GXGN35rtKe58l/QNZGANO5smXPCt4xKZf lS836WYIxLdw/BfNBheOQG9EaB22W1B2AnYTIRvAj9r6u0umOXMS1UqzmET+Zb93tzmE Ipt09zsA6lrfjmHwLlxgRitwdGoxlklF4kOm5zCRiP1B4uJrVsZTFHr7sa/ouj5Wudl7 QSuA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1750245368; x=1750850168; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=b6KDUdKeadskP6aqdnMUYKVbmLKz5AEWQ17jezj/TW4=; b=oBofUaHV4yQePF0CF5TJGoxP6r7eaKSFRdPm6vwDwkKYaOfUmofBVoUfXWDSKul3t8 4YgGhj6JLcYB2t+Z6SshpvZKJFgROC+Uqdra8Br+zLUxXYkEq5QFrLn44TovCH6g/udS 9OwFYpArPBPkWVJjer66oszTFhGu64sp8G+9K48tMbHaBkuRKjTq/lXdmK/4eZHGHPsQ 9wn/PpIduhTpByMJL0fho7Evuudei3rWnt2xCLuYq/ilKlN7hIbMlEXUMmY6aIbaqcIe FPwn1doY0oZZNoltjSVPEnE71tnvZoop4dSkAiGr74nT2C70naYwxXPJeCMzGKiZNsUi PRYw== X-Forwarded-Encrypted: i=1; AJvYcCUoZyGbjforJjYbiyY2ADhXXEluxNdyNoOWDojhBVzQGswgpfyPp61FBJy+NjMsVqLpPfGo7BgoWUg=@vger.kernel.org, AJvYcCUrEqZGB1ZMJ9wywIzAem2tT/+QdU3oekByerDiuVDjt2wRWlxwjDGrcbk1MyHr/G+2r7i6Jjib374KaQ==@vger.kernel.org X-Gm-Message-State: AOJu0Ywncv4U4/ldU0GXdHMj1LF/bTCXyfrVdEH4jxf6l6Np8Sggo0/N zvHWeF3Lq0kr3ScpLjnG8QYXQy+ZwOsCHJYNjBuaHm8Qr6v75C9CsYZE X-Gm-Gg: ASbGncuVlQ+Qnlc5nstADnrCu0ngpRZ6g6wH81an3ZSuDC53hrdSYmxtt2ki1F8EIQr TAGal9IHAA5XSDRuoXZ9sTNGF9Y82kn73xfK1Eqb3BLSaXnM1OuJun9yt/tMoT3FGUCxzSgM+dx 2k5ed+CQaDgdKlq49ckDJe6S/1YhXTAU6ImBVuf6r7rq4T0VA8aQwco12QCayxVATtvu1wuV7Ny e09SgAJLDR2kv30lReqFzqGabUWlEQfWv5ATpdSO6Vw7jsaEoJ8os/AOiK43YlgciDgIocAZdpk wjSMfQb2RPcoXO40/yY/vOvWzpea1k6D2ynqyxeQHc9Z0LOV+YUchyenWrUjDrkdw85n8gpt X-Google-Smtp-Source: AGHT+IFJIv1HyeWLDYYDnjy9zhtgS5Rqks2BjUnmyLphTQi+Ix18XVcp2XkFyRlLYUPMyv8jUuBNQQ== X-Received: by 2002:a17:903:19cc:b0:234:ef42:5d75 with SMTP id d9443c01a7336-2366b00ee6bmr228150495ad.20.1750245367032; Wed, 18 Jun 2025 04:16:07 -0700 (PDT) Received: from archie.me ([103.124.138.155]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-2365e0d0bb5sm97548615ad.250.2025.06.18.04.16.02 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 18 Jun 2025 04:16:03 -0700 (PDT) Received: by archie.me (Postfix, from userid 1000) id DAAFC45E3AAE; Wed, 18 Jun 2025 18:15:59 +0700 (WIB) From: Bagas Sanjaya To: Linux Kernel Mailing List , Linux Documentation , Linux ext4 Cc: "Theodore Ts'o" , Andreas Dilger , Jonathan Corbet , "Darrick J. Wong" , "Ritesh Harjani (IBM)" , Bagas Sanjaya Subject: [PATCH 3/4] Documentation: ext4: Slurp included subdocs in dynamic structures docs Date: Wed, 18 Jun 2025 18:15:36 +0700 Message-ID: <20250618111544.22602-4-bagasdotme@gmail.com> X-Mailer: git-send-email 2.49.0 In-Reply-To: <20250618111544.22602-1-bagasdotme@gmail.com> References: <20250618111544.22602-1-bagasdotme@gmail.com> Precedence: bulk X-Mailing-List: linux-doc@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Developer-Signature: v=1; a=openpgp-sha256; l=94728; i=bagasdotme@gmail.com; h=from:subject; bh=BUu9HUdtaYWu1FAd7SnaGwZ6GdU8g8tTTCu8MsamgeM=; b=owGbwMvMwCX2bWenZ2ig32LG02pJDBlB89WWZTiFTugymbP31uW5l7wcOoS9jj/Lc9d2eH9mE /OV3TPjO0pZGMS4GGTFFFkmJfI1nd5lJHKhfa0jzBxWJpAhDFycAjCRKc8Z/scvi2XYJW8Z378q 90kdu/d7lf/fpm6dUyl+wW0OV7bf7jOMDNPmL80Leh3v3bbwm3gA7x//gNSoxnkmd2wuWtgm3I2 LZQMA X-Developer-Key: i=bagasdotme@gmail.com; a=openpgp; fpr=701B806FDCA5D3A58FFB8F7D7C276C64A5E44A1D Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Slurp subdocumentations for dynamic structures (dynamic.rst) by replacing reST include:: directive with their respective contents. Signed-off-by: Bagas Sanjaya --- Documentation/filesystems/ext4/attributes.rst | 191 --- Documentation/filesystems/ext4/directory.rst | 453 ------ Documentation/filesystems/ext4/dynamic.rst | 1415 ++++++++++++++++- Documentation/filesystems/ext4/ifork.rst | 194 --- Documentation/filesystems/ext4/inodes.rst | 578 ------- 5 files changed, 1411 insertions(+), 1420 deletions(-) delete mode 100644 Documentation/filesystems/ext4/attributes.rst delete mode 100644 Documentation/filesystems/ext4/directory.rst delete mode 100644 Documentation/filesystems/ext4/ifork.rst delete mode 100644 Documentation/filesystems/ext4/inodes.rst diff --git a/Documentation/filesystems/ext4/attributes.rst b/Documentation/filesystems/ext4/attributes.rst deleted file mode 100644 index 87814696a65b59..00000000000000 --- a/Documentation/filesystems/ext4/attributes.rst +++ /dev/null @@ -1,191 +0,0 @@ -.. SPDX-License-Identifier: GPL-2.0 - -Extended Attributes -------------------- - -Extended attributes (xattrs) are typically stored in a separate data -block on the disk and referenced from inodes via ``inode.i_file_acl*``. -The first use of extended attributes seems to have been for storing file -ACLs and other security data (selinux). With the ``user_xattr`` mount -option it is possible for users to store extended attributes so long as -all attribute names begin with “user”; this restriction seems to have -disappeared as of Linux 3.0. - -There are two places where extended attributes can be found. The first -place is between the end of each inode entry and the beginning of the -next inode entry. For example, if inode.i_extra_isize = 28 and -sb.inode_size = 256, then there are 256 - (128 + 28) = 100 bytes -available for in-inode extended attribute storage. The second place -where extended attributes can be found is in the block pointed to by -``inode.i_file_acl``. As of Linux 3.11, it is not possible for this -block to contain a pointer to a second extended attribute block (or even -the remaining blocks of a cluster). In theory it is possible for each -attribute's value to be stored in a separate data block, though as of -Linux 3.11 the code does not permit this. - -Keys are generally assumed to be ASCIIZ strings, whereas values can be -strings or binary data. - -Extended attributes, when stored after the inode, have a header -``ext4_xattr_ibody_header`` that is 4 bytes long: - -.. list-table:: - :widths: 8 8 24 40 - :header-rows: 1 - - * - Offset - - Type - - Name - - Description - * - 0x0 - - __le32 - - h_magic - - Magic number for identification, 0xEA020000. This value is set by the - Linux driver, though e2fsprogs doesn't seem to check it(?) - -The beginning of an extended attribute block is in -``struct ext4_xattr_header``, which is 32 bytes long: - -.. list-table:: - :widths: 8 8 24 40 - :header-rows: 1 - - * - Offset - - Type - - Name - - Description - * - 0x0 - - __le32 - - h_magic - - Magic number for identification, 0xEA020000. - * - 0x4 - - __le32 - - h_refcount - - Reference count. - * - 0x8 - - __le32 - - h_blocks - - Number of disk blocks used. - * - 0xC - - __le32 - - h_hash - - Hash value of all attributes. - * - 0x10 - - __le32 - - h_checksum - - Checksum of the extended attribute block. - * - 0x14 - - __u32 - - h_reserved[3] - - Zero. - -The checksum is calculated against the FS UUID, the 64-bit block number -of the extended attribute block, and the entire block (header + -entries). - -Following the ``struct ext4_xattr_header`` or -``struct ext4_xattr_ibody_header`` is an array of -``struct ext4_xattr_entry``; each of these entries is at least 16 bytes -long. When stored in an external block, the ``struct ext4_xattr_entry`` -entries must be stored in sorted order. The sort order is -``e_name_index``, then ``e_name_len``, and finally ``e_name``. -Attributes stored inside an inode do not need be stored in sorted order. - -.. list-table:: - :widths: 8 8 24 40 - :header-rows: 1 - - * - Offset - - Type - - Name - - Description - * - 0x0 - - __u8 - - e_name_len - - Length of name. - * - 0x1 - - __u8 - - e_name_index - - Attribute name index. There is a discussion of this below. - * - 0x2 - - __le16 - - e_value_offs - - Location of this attribute's value on the disk block where it is stored. - Multiple attributes can share the same value. For an inode attribute - this value is relative to the start of the first entry; for a block this - value is relative to the start of the block (i.e. the header). - * - 0x4 - - __le32 - - e_value_inum - - The inode where the value is stored. Zero indicates the value is in the - same block as this entry. This field is only used if the - INCOMPAT_EA_INODE feature is enabled. - * - 0x8 - - __le32 - - e_value_size - - Length of attribute value. - * - 0xC - - __le32 - - e_hash - - Hash value of attribute name and attribute value. The kernel doesn't - update the hash for in-inode attributes, so for that case this value - must be zero, because e2fsck validates any non-zero hash regardless of - where the xattr lives. - * - 0x10 - - char - - e_name[e_name_len] - - Attribute name. Does not include trailing NULL. - -Attribute values can follow the end of the entry table. There appears to -be a requirement that they be aligned to 4-byte boundaries. The values -are stored starting at the end of the block and grow towards the -xattr_header/xattr_entry table. When the two collide, the overflow is -put into a separate disk block. If the disk block fills up, the -filesystem returns -ENOSPC. - -The first four fields of the ``ext4_xattr_entry`` are set to zero to -mark the end of the key list. - -Attribute Name Indices -~~~~~~~~~~~~~~~~~~~~~~ - -Logically speaking, extended attributes are a series of key=value pairs. -The keys are assumed to be NULL-terminated strings. To reduce the amount -of on-disk space that the keys consume, the beginning of the key string -is matched against the attribute name index. If a match is found, the -attribute name index field is set, and matching string is removed from -the key name. Here is a map of name index values to key prefixes: - -.. list-table:: - :widths: 16 64 - :header-rows: 1 - - * - Name Index - - Key Prefix - * - 0 - - (no prefix) - * - 1 - - “user.” - * - 2 - - “system.posix_acl_access” - * - 3 - - “system.posix_acl_default” - * - 4 - - “trusted.” - * - 6 - - “security.” - * - 7 - - “system.” (inline_data only?) - * - 8 - - “system.richacl” (SuSE kernels only?) - -For example, if the attribute key is “user.fubar”, the attribute name -index is set to 1 and the “fubar” name is recorded on disk. - -POSIX ACLs -~~~~~~~~~~ - -POSIX ACLs are stored in a reduced version of the Linux kernel (and -libacl's) internal ACL format. The key difference is that the version -number is different (1) and the ``e_id`` field is only stored for named -user and group ACLs. diff --git a/Documentation/filesystems/ext4/directory.rst b/Documentation/filesystems/ext4/directory.rst deleted file mode 100644 index 6eece8e31df8b7..00000000000000 --- a/Documentation/filesystems/ext4/directory.rst +++ /dev/null @@ -1,453 +0,0 @@ -.. SPDX-License-Identifier: GPL-2.0 - -Directory Entries ------------------ - -In an ext4 filesystem, a directory is more or less a flat file that maps -an arbitrary byte string (usually ASCII) to an inode number on the -filesystem. There can be many directory entries across the filesystem -that reference the same inode number--these are known as hard links, and -that is why hard links cannot reference files on other filesystems. As -such, directory entries are found by reading the data block(s) -associated with a directory file for the particular directory entry that -is desired. - -Linear (Classic) Directories -~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -By default, each directory lists its entries in an “almost-linear” -array. I write “almost” because it's not a linear array in the memory -sense because directory entries are not split across filesystem blocks. -Therefore, it is more accurate to say that a directory is a series of -data blocks and that each block contains a linear array of directory -entries. The end of each per-block array is signified by reaching the -end of the block; the last entry in the block has a record length that -takes it all the way to the end of the block. The end of the entire -directory is of course signified by reaching the end of the file. Unused -directory entries are signified by inode = 0. By default the filesystem -uses ``struct ext4_dir_entry_2`` for directory entries unless the -“filetype” feature flag is not set, in which case it uses -``struct ext4_dir_entry``. - -The original directory entry format is ``struct ext4_dir_entry``, which -is at most 263 bytes long, though on disk you'll need to reference -``dirent.rec_len`` to know for sure. - -.. list-table:: - :widths: 8 8 24 40 - :header-rows: 1 - - * - Offset - - Size - - Name - - Description - * - 0x0 - - __le32 - - inode - - Number of the inode that this directory entry points to. - * - 0x4 - - __le16 - - rec_len - - Length of this directory entry. Must be a multiple of 4. - * - 0x6 - - __le16 - - name_len - - Length of the file name. - * - 0x8 - - char - - name[EXT4_NAME_LEN] - - File name. - -Since file names cannot be longer than 255 bytes, the new directory -entry format shortens the name_len field and uses the space for a file -type flag, probably to avoid having to load every inode during directory -tree traversal. This format is ``ext4_dir_entry_2``, which is at most -263 bytes long, though on disk you'll need to reference -``dirent.rec_len`` to know for sure. - -.. list-table:: - :widths: 8 8 24 40 - :header-rows: 1 - - * - Offset - - Size - - Name - - Description - * - 0x0 - - __le32 - - inode - - Number of the inode that this directory entry points to. - * - 0x4 - - __le16 - - rec_len - - Length of this directory entry. - * - 0x6 - - __u8 - - name_len - - Length of the file name. - * - 0x7 - - __u8 - - file_type - - File type code, see ftype_ table below. - * - 0x8 - - char - - name[EXT4_NAME_LEN] - - File name. - -.. _ftype: - -The directory file type is one of the following values: - -.. list-table:: - :widths: 16 64 - :header-rows: 1 - - * - Value - - Description - * - 0x0 - - Unknown. - * - 0x1 - - Regular file. - * - 0x2 - - Directory. - * - 0x3 - - Character device file. - * - 0x4 - - Block device file. - * - 0x5 - - FIFO. - * - 0x6 - - Socket. - * - 0x7 - - Symbolic link. - -To support directories that are both encrypted and casefolded directories, we -must also include hash information in the directory entry. We append -``ext4_extended_dir_entry_2`` to ``ext4_dir_entry_2`` except for the entries -for dot and dotdot, which are kept the same. The structure follows immediately -after ``name`` and is included in the size listed by ``rec_len`` If a directory -entry uses this extension, it may be up to 271 bytes. - -.. list-table:: - :widths: 8 8 24 40 - :header-rows: 1 - - * - Offset - - Size - - Name - - Description - * - 0x0 - - __le32 - - hash - - The hash of the directory name - * - 0x4 - - __le32 - - minor_hash - - The minor hash of the directory name - - -In order to add checksums to these classic directory blocks, a phony -``struct ext4_dir_entry`` is placed at the end of each leaf block to -hold the checksum. The directory entry is 12 bytes long. The inode -number and name_len fields are set to zero to fool old software into -ignoring an apparently empty directory entry, and the checksum is stored -in the place where the name normally goes. The structure is -``struct ext4_dir_entry_tail``: - -.. list-table:: - :widths: 8 8 24 40 - :header-rows: 1 - - * - Offset - - Size - - Name - - Description - * - 0x0 - - __le32 - - det_reserved_zero1 - - Inode number, which must be zero. - * - 0x4 - - __le16 - - det_rec_len - - Length of this directory entry, which must be 12. - * - 0x6 - - __u8 - - det_reserved_zero2 - - Length of the file name, which must be zero. - * - 0x7 - - __u8 - - det_reserved_ft - - File type, which must be 0xDE. - * - 0x8 - - __le32 - - det_checksum - - Directory leaf block checksum. - -The leaf directory block checksum is calculated against the FS UUID, the -directory's inode number, the directory's inode generation number, and -the entire directory entry block up to (but not including) the fake -directory entry. - -Hash Tree Directories -~~~~~~~~~~~~~~~~~~~~~ - -A linear array of directory entries isn't great for performance, so a -new feature was added to ext3 to provide a faster (but peculiar) -balanced tree keyed off a hash of the directory entry name. If the -EXT4_INDEX_FL (0x1000) flag is set in the inode, this directory uses a -hashed btree (htree) to organize and find directory entries. For -backwards read-only compatibility with ext2, this tree is actually -hidden inside the directory file, masquerading as “empty” directory data -blocks! It was stated previously that the end of the linear directory -entry table was signified with an entry pointing to inode 0; this is -(ab)used to fool the old linear-scan algorithm into thinking that the -rest of the directory block is empty so that it moves on. - -The root of the tree always lives in the first data block of the -directory. By ext2 custom, the '.' and '..' entries must appear at the -beginning of this first block, so they are put here as two -``struct ext4_dir_entry_2`` s and not stored in the tree. The rest of -the root node contains metadata about the tree and finally a hash->block -map to find nodes that are lower in the htree. If -``dx_root.info.indirect_levels`` is non-zero then the htree has two -levels; the data block pointed to by the root node's map is an interior -node, which is indexed by a minor hash. Interior nodes in this tree -contains a zeroed out ``struct ext4_dir_entry_2`` followed by a -minor_hash->block map to find leafe nodes. Leaf nodes contain a linear -array of all ``struct ext4_dir_entry_2``; all of these entries -(presumably) hash to the same value. If there is an overflow, the -entries simply overflow into the next leaf node, and the -least-significant bit of the hash (in the interior node map) that gets -us to this next leaf node is set. - -To traverse the directory as a htree, the code calculates the hash of -the desired file name and uses it to find the corresponding block -number. If the tree is flat, the block is a linear array of directory -entries that can be searched; otherwise, the minor hash of the file name -is computed and used against this second block to find the corresponding -third block number. That third block number will be a linear array of -directory entries. - -To traverse the directory as a linear array (such as the old code does), -the code simply reads every data block in the directory. The blocks used -for the htree will appear to have no entries (aside from '.' and '..') -and so only the leaf nodes will appear to have any interesting content. - -The root of the htree is in ``struct dx_root``, which is the full length -of a data block: - -.. list-table:: - :widths: 8 8 24 40 - :header-rows: 1 - - * - Offset - - Type - - Name - - Description - * - 0x0 - - __le32 - - dot.inode - - inode number of this directory. - * - 0x4 - - __le16 - - dot.rec_len - - Length of this record, 12. - * - 0x6 - - u8 - - dot.name_len - - Length of the name, 1. - * - 0x7 - - u8 - - dot.file_type - - File type of this entry, 0x2 (directory) (if the feature flag is set). - * - 0x8 - - char - - dot.name[4] - - “.\0\0\0” - * - 0xC - - __le32 - - dotdot.inode - - inode number of parent directory. - * - 0x10 - - __le16 - - dotdot.rec_len - - block_size - 12. The record length is long enough to cover all htree - data. - * - 0x12 - - u8 - - dotdot.name_len - - Length of the name, 2. - * - 0x13 - - u8 - - dotdot.file_type - - File type of this entry, 0x2 (directory) (if the feature flag is set). - * - 0x14 - - char - - dotdot_name[4] - - “..\0\0” - * - 0x18 - - __le32 - - struct dx_root_info.reserved_zero - - Zero. - * - 0x1C - - u8 - - struct dx_root_info.hash_version - - Hash type, see dirhash_ table below. - * - 0x1D - - u8 - - struct dx_root_info.info_length - - Length of the tree information, 0x8. - * - 0x1E - - u8 - - struct dx_root_info.indirect_levels - - Depth of the htree. Cannot be larger than 3 if the INCOMPAT_LARGEDIR - feature is set; cannot be larger than 2 otherwise. - * - 0x1F - - u8 - - struct dx_root_info.unused_flags - - - * - 0x20 - - __le16 - - limit - - Maximum number of dx_entries that can follow this header, plus 1 for - the header itself. - * - 0x22 - - __le16 - - count - - Actual number of dx_entries that follow this header, plus 1 for the - header itself. - * - 0x24 - - __le32 - - block - - The block number (within the directory file) that goes with hash=0. - * - 0x28 - - struct dx_entry - - entries[0] - - As many 8-byte ``struct dx_entry`` as fits in the rest of the data block. - -.. _dirhash: - -The directory hash is one of the following values: - -.. list-table:: - :widths: 16 64 - :header-rows: 1 - - * - Value - - Description - * - 0x0 - - Legacy. - * - 0x1 - - Half MD4. - * - 0x2 - - Tea. - * - 0x3 - - Legacy, unsigned. - * - 0x4 - - Half MD4, unsigned. - * - 0x5 - - Tea, unsigned. - * - 0x6 - - Siphash. - -Interior nodes of an htree are recorded as ``struct dx_node``, which is -also the full length of a data block: - -.. list-table:: - :widths: 8 8 24 40 - :header-rows: 1 - - * - Offset - - Type - - Name - - Description - * - 0x0 - - __le32 - - fake.inode - - Zero, to make it look like this entry is not in use. - * - 0x4 - - __le16 - - fake.rec_len - - The size of the block, in order to hide all of the dx_node data. - * - 0x6 - - u8 - - name_len - - Zero. There is no name for this “unused” directory entry. - * - 0x7 - - u8 - - file_type - - Zero. There is no file type for this “unused” directory entry. - * - 0x8 - - __le16 - - limit - - Maximum number of dx_entries that can follow this header, plus 1 for - the header itself. - * - 0xA - - __le16 - - count - - Actual number of dx_entries that follow this header, plus 1 for the - header itself. - * - 0xE - - __le32 - - block - - The block number (within the directory file) that goes with the lowest - hash value of this block. This value is stored in the parent block. - * - 0x12 - - struct dx_entry - - entries[0] - - As many 8-byte ``struct dx_entry`` as fits in the rest of the data block. - -The hash maps that exist in both ``struct dx_root`` and -``struct dx_node`` are recorded as ``struct dx_entry``, which is 8 bytes -long: - -.. list-table:: - :widths: 8 8 24 40 - :header-rows: 1 - - * - Offset - - Type - - Name - - Description - * - 0x0 - - __le32 - - hash - - Hash code. - * - 0x4 - - __le32 - - block - - Block number (within the directory file, not filesystem blocks) of the - next node in the htree. - -(If you think this is all quite clever and peculiar, so does the -author.) - -If metadata checksums are enabled, the last 8 bytes of the directory -block (precisely the length of one dx_entry) are used to store a -``struct dx_tail``, which contains the checksum. The ``limit`` and -``count`` entries in the dx_root/dx_node structures are adjusted as -necessary to fit the dx_tail into the block. If there is no space for -the dx_tail, the user is notified to run e2fsck -D to rebuild the -directory index (which will ensure that there's space for the checksum. -The dx_tail structure is 8 bytes long and looks like this: - -.. list-table:: - :widths: 8 8 24 40 - :header-rows: 1 - - * - Offset - - Type - - Name - - Description - * - 0x0 - - u32 - - dt_reserved - - Zero. - * - 0x4 - - __le32 - - dt_checksum - - Checksum of the htree directory block. - -The checksum is calculated against the FS UUID, the htree index header -(dx_root or dx_node), all of the htree indices (dx_entry) that are in -use, and the tail block (dx_tail). diff --git a/Documentation/filesystems/ext4/dynamic.rst b/Documentation/filesystems/ext4/dynamic.rst index bb0c84333341a5..225324e59fe57c 100644 --- a/Documentation/filesystems/ext4/dynamic.rst +++ b/Documentation/filesystems/ext4/dynamic.rst @@ -6,7 +6,1414 @@ Dynamic Structures Dynamic metadata are created on the fly when files and blocks are allocated to files. -.. include:: inodes.rst -.. include:: ifork.rst -.. include:: directory.rst -.. include:: attributes.rst +Index Nodes +----------- + +In a regular UNIX filesystem, the inode stores all the metadata +pertaining to the file (time stamps, block maps, extended attributes, +etc), not the directory entry. To find the information associated with a +file, one must traverse the directory files to find the directory entry +associated with a file, then load the inode to find the metadata for +that file. ext4 appears to cheat (for performance reasons) a little bit +by storing a copy of the file type (normally stored in the inode) in the +directory entry. (Compare all this to FAT, which stores all the file +information directly in the directory entry, but does not support hard +links and is in general more seek-happy than ext4 due to its simpler +block allocator and extensive use of linked lists.) + +The inode table is a linear array of ``struct ext4_inode``. The table is +sized to have enough blocks to store at least +``sb.s_inode_size * sb.s_inodes_per_group`` bytes. The number of the +block group containing an inode can be calculated as +``(inode_number - 1) / sb.s_inodes_per_group``, and the offset into the +group's table is ``(inode_number - 1) % sb.s_inodes_per_group``. There +is no inode 0. + +The inode checksum is calculated against the FS UUID, the inode number, +and the inode structure itself. + +The inode table entry is laid out in ``struct ext4_inode``. + +.. list-table:: + :widths: 8 8 24 40 + :header-rows: 1 + :class: longtable + + * - Offset + - Size + - Name + - Description + * - 0x0 + - __le16 + - i_mode + - File mode. See the table i_mode_ below. + * - 0x2 + - __le16 + - i_uid + - Lower 16-bits of Owner UID. + * - 0x4 + - __le32 + - i_size_lo + - Lower 32-bits of size in bytes. + * - 0x8 + - __le32 + - i_atime + - Last access time, in seconds since the epoch. However, if the EA_INODE + inode flag is set, this inode stores an extended attribute value and + this field contains the checksum of the value. + * - 0xC + - __le32 + - i_ctime + - Last inode change time, in seconds since the epoch. However, if the + EA_INODE inode flag is set, this inode stores an extended attribute + value and this field contains the lower 32 bits of the attribute value's + reference count. + * - 0x10 + - __le32 + - i_mtime + - Last data modification time, in seconds since the epoch. However, if the + EA_INODE inode flag is set, this inode stores an extended attribute + value and this field contains the number of the inode that owns the + extended attribute. + * - 0x14 + - __le32 + - i_dtime + - Deletion Time, in seconds since the epoch. + * - 0x18 + - __le16 + - i_gid + - Lower 16-bits of GID. + * - 0x1A + - __le16 + - i_links_count + - Hard link count. Normally, ext4 does not permit an inode to have more + than 65,000 hard links. This applies to files as well as directories, + which means that there cannot be more than 64,998 subdirectories in a + directory (each subdirectory's '..' entry counts as a hard link, as does + the '.' entry in the directory itself). With the DIR_NLINK feature + enabled, ext4 supports more than 64,998 subdirectories by setting this + field to 1 to indicate that the number of hard links is not known. + * - 0x1C + - __le32 + - i_blocks_lo + - Lower 32-bits of “block” count. If the huge_file feature flag is not + set on the filesystem, the file consumes ``i_blocks_lo`` 512-byte blocks + on disk. If huge_file is set and EXT4_HUGE_FILE_FL is NOT set in + ``inode.i_flags``, then the file consumes ``i_blocks_lo + (i_blocks_hi + << 32)`` 512-byte blocks on disk. If huge_file is set and + EXT4_HUGE_FILE_FL IS set in ``inode.i_flags``, then this file + consumes (``i_blocks_lo + i_blocks_hi`` << 32) filesystem blocks on + disk. + * - 0x20 + - __le32 + - i_flags + - Inode flags. See the table i_flags_ below. + * - 0x24 + - 4 bytes + - i_osd1 + - See the table i_osd1_ for more details. + * - 0x28 + - 60 bytes + - i_block[EXT4_N_BLOCKS=15] + - Block map or extent tree. See the section “The Contents of inode.i_block”. + * - 0x64 + - __le32 + - i_generation + - File version (for NFS). + * - 0x68 + - __le32 + - i_file_acl_lo + - Lower 32-bits of extended attribute block. ACLs are of course one of + many possible extended attributes; I think the name of this field is a + result of the first use of extended attributes being for ACLs. + * - 0x6C + - __le32 + - i_size_high / i_dir_acl + - Upper 32-bits of file/directory size. In ext2/3 this field was named + i_dir_acl, though it was usually set to zero and never used. + * - 0x70 + - __le32 + - i_obso_faddr + - (Obsolete) fragment address. + * - 0x74 + - 12 bytes + - i_osd2 + - See the table i_osd2_ for more details. + * - 0x80 + - __le16 + - i_extra_isize + - Size of this inode - 128. Alternately, the size of the extended inode + fields beyond the original ext2 inode, including this field. + * - 0x82 + - __le16 + - i_checksum_hi + - Upper 16-bits of the inode checksum. + * - 0x84 + - __le32 + - i_ctime_extra + - Extra change time bits. This provides sub-second precision. See Inode + Timestamps section. + * - 0x88 + - __le32 + - i_mtime_extra + - Extra modification time bits. This provides sub-second precision. + * - 0x8C + - __le32 + - i_atime_extra + - Extra access time bits. This provides sub-second precision. + * - 0x90 + - __le32 + - i_crtime + - File creation time, in seconds since the epoch. + * - 0x94 + - __le32 + - i_crtime_extra + - Extra file creation time bits. This provides sub-second precision. + * - 0x98 + - __le32 + - i_version_hi + - Upper 32-bits for version number. + * - 0x9C + - __le32 + - i_projid + - Project ID. + +.. _i_mode: + +The ``i_mode`` value is a combination of the following flags: + +.. list-table:: + :widths: 16 64 + :header-rows: 1 + + * - Value + - Description + * - 0x1 + - S_IXOTH (Others may execute) + * - 0x2 + - S_IWOTH (Others may write) + * - 0x4 + - S_IROTH (Others may read) + * - 0x8 + - S_IXGRP (Group members may execute) + * - 0x10 + - S_IWGRP (Group members may write) + * - 0x20 + - S_IRGRP (Group members may read) + * - 0x40 + - S_IXUSR (Owner may execute) + * - 0x80 + - S_IWUSR (Owner may write) + * - 0x100 + - S_IRUSR (Owner may read) + * - 0x200 + - S_ISVTX (Sticky bit) + * - 0x400 + - S_ISGID (Set GID) + * - 0x800 + - S_ISUID (Set UID) + * - + - These are mutually-exclusive file types: + * - 0x1000 + - S_IFIFO (FIFO) + * - 0x2000 + - S_IFCHR (Character device) + * - 0x4000 + - S_IFDIR (Directory) + * - 0x6000 + - S_IFBLK (Block device) + * - 0x8000 + - S_IFREG (Regular file) + * - 0xA000 + - S_IFLNK (Symbolic link) + * - 0xC000 + - S_IFSOCK (Socket) + +.. _i_flags: + +The ``i_flags`` field is a combination of these values: + +.. list-table:: + :widths: 16 64 + :header-rows: 1 + + * - Value + - Description + * - 0x1 + - This file requires secure deletion (EXT4_SECRM_FL). (not implemented) + * - 0x2 + - This file should be preserved, should undeletion be desired + (EXT4_UNRM_FL). (not implemented) + * - 0x4 + - File is compressed (EXT4_COMPR_FL). (not really implemented) + * - 0x8 + - All writes to the file must be synchronous (EXT4_SYNC_FL). + * - 0x10 + - File is immutable (EXT4_IMMUTABLE_FL). + * - 0x20 + - File can only be appended (EXT4_APPEND_FL). + * - 0x40 + - The dump(1) utility should not dump this file (EXT4_NODUMP_FL). + * - 0x80 + - Do not update access time (EXT4_NOATIME_FL). + * - 0x100 + - Dirty compressed file (EXT4_DIRTY_FL). (not used) + * - 0x200 + - File has one or more compressed clusters (EXT4_COMPRBLK_FL). (not used) + * - 0x400 + - Do not compress file (EXT4_NOCOMPR_FL). (not used) + * - 0x800 + - Encrypted inode (EXT4_ENCRYPT_FL). This bit value previously was + EXT4_ECOMPR_FL (compression error), which was never used. + * - 0x1000 + - Directory has hashed indexes (EXT4_INDEX_FL). + * - 0x2000 + - AFS magic directory (EXT4_IMAGIC_FL). + * - 0x4000 + - File data must always be written through the journal + (EXT4_JOURNAL_DATA_FL). + * - 0x8000 + - File tail should not be merged (EXT4_NOTAIL_FL). (not used by ext4) + * - 0x10000 + - All directory entry data should be written synchronously (see + ``dirsync``) (EXT4_DIRSYNC_FL). + * - 0x20000 + - Top of directory hierarchy (EXT4_TOPDIR_FL). + * - 0x40000 + - This is a huge file (EXT4_HUGE_FILE_FL). + * - 0x80000 + - Inode uses extents (EXT4_EXTENTS_FL). + * - 0x100000 + - Verity protected file (EXT4_VERITY_FL). + * - 0x200000 + - Inode stores a large extended attribute value in its data blocks + (EXT4_EA_INODE_FL). + * - 0x400000 + - This file has blocks allocated past EOF (EXT4_EOFBLOCKS_FL). + (deprecated) + * - 0x01000000 + - Inode is a snapshot (``EXT4_SNAPFILE_FL``). (not in mainline) + * - 0x04000000 + - Snapshot is being deleted (``EXT4_SNAPFILE_DELETED_FL``). (not in + mainline) + * - 0x08000000 + - Snapshot shrink has completed (``EXT4_SNAPFILE_SHRUNK_FL``). (not in + mainline) + * - 0x10000000 + - Inode has inline data (EXT4_INLINE_DATA_FL). + * - 0x20000000 + - Create children with the same project ID (EXT4_PROJINHERIT_FL). + * - 0x80000000 + - Reserved for ext4 library (EXT4_RESERVED_FL). + * - + - Aggregate flags: + * - 0x705BDFFF + - User-visible flags. + * - 0x604BC0FF + - User-modifiable flags. Note that while EXT4_JOURNAL_DATA_FL and + EXT4_EXTENTS_FL can be set with setattr, they are not in the kernel's + EXT4_FL_USER_MODIFIABLE mask, since it needs to handle the setting of + these flags in a special manner and they are masked out of the set of + flags that are saved directly to i_flags. + +.. _i_osd1: + +The ``osd1`` field has multiple meanings depending on the creator: + +Linux: + +.. list-table:: + :widths: 8 8 24 40 + :header-rows: 1 + + * - Offset + - Size + - Name + - Description + * - 0x0 + - __le32 + - l_i_version + - Inode version. However, if the EA_INODE inode flag is set, this inode + stores an extended attribute value and this field contains the upper 32 + bits of the attribute value's reference count. + +Hurd: + +.. list-table:: + :widths: 8 8 24 40 + :header-rows: 1 + + * - Offset + - Size + - Name + - Description + * - 0x0 + - __le32 + - h_i_translator + - ?? + +Masix: + +.. list-table:: + :widths: 8 8 24 40 + :header-rows: 1 + + * - Offset + - Size + - Name + - Description + * - 0x0 + - __le32 + - m_i_reserved + - ?? + +.. _i_osd2: + +The ``osd2`` field has multiple meanings depending on the filesystem creator: + +Linux: + +.. list-table:: + :widths: 8 8 24 40 + :header-rows: 1 + + * - Offset + - Size + - Name + - Description + * - 0x0 + - __le16 + - l_i_blocks_high + - Upper 16-bits of the block count. Please see the note attached to + i_blocks_lo. + * - 0x2 + - __le16 + - l_i_file_acl_high + - Upper 16-bits of the extended attribute block (historically, the file + ACL location). See the Extended Attributes section below. + * - 0x4 + - __le16 + - l_i_uid_high + - Upper 16-bits of the Owner UID. + * - 0x6 + - __le16 + - l_i_gid_high + - Upper 16-bits of the GID. + * - 0x8 + - __le16 + - l_i_checksum_lo + - Lower 16-bits of the inode checksum. + * - 0xA + - __le16 + - l_i_reserved + - Unused. + +Hurd: + +.. list-table:: + :widths: 8 8 24 40 + :header-rows: 1 + + * - Offset + - Size + - Name + - Description + * - 0x0 + - __le16 + - h_i_reserved1 + - ?? + * - 0x2 + - __u16 + - h_i_mode_high + - Upper 16-bits of the file mode. + * - 0x4 + - __le16 + - h_i_uid_high + - Upper 16-bits of the Owner UID. + * - 0x6 + - __le16 + - h_i_gid_high + - Upper 16-bits of the GID. + * - 0x8 + - __u32 + - h_i_author + - Author code? + +Masix: + +.. list-table:: + :widths: 8 8 24 40 + :header-rows: 1 + + * - Offset + - Size + - Name + - Description + * - 0x0 + - __le16 + - h_i_reserved1 + - ?? + * - 0x2 + - __u16 + - m_i_file_acl_high + - Upper 16-bits of the extended attribute block (historically, the file + ACL location). + * - 0x4 + - __u32 + - m_i_reserved2[2] + - ?? + +Inode Size +~~~~~~~~~~ + +In ext2 and ext3, the inode structure size was fixed at 128 bytes +(``EXT2_GOOD_OLD_INODE_SIZE``) and each inode had a disk record size of +128 bytes. Starting with ext4, it is possible to allocate a larger +on-disk inode at format time for all inodes in the filesystem to provide +space beyond the end of the original ext2 inode. The on-disk inode +record size is recorded in the superblock as ``s_inode_size``. The +number of bytes actually used by struct ext4_inode beyond the original +128-byte ext2 inode is recorded in the ``i_extra_isize`` field for each +inode, which allows struct ext4_inode to grow for a new kernel without +having to upgrade all of the on-disk inodes. Access to fields beyond +EXT2_GOOD_OLD_INODE_SIZE should be verified to be within +``i_extra_isize``. By default, ext4 inode records are 256 bytes, and (as +of August 2019) the inode structure is 160 bytes +(``i_extra_isize = 32``). The extra space between the end of the inode +structure and the end of the inode record can be used to store extended +attributes. Each inode record can be as large as the filesystem block +size, though this is not terribly efficient. + +Finding an Inode +~~~~~~~~~~~~~~~~ + +Each block group contains ``sb->s_inodes_per_group`` inodes. Because +inode 0 is defined not to exist, this formula can be used to find the +block group that an inode lives in: +``bg = (inode_num - 1) / sb->s_inodes_per_group``. The particular inode +can be found within the block group's inode table at +``index = (inode_num - 1) % sb->s_inodes_per_group``. To get the byte +address within the inode table, use +``offset = index * sb->s_inode_size``. + +Inode Timestamps +~~~~~~~~~~~~~~~~ + +Four timestamps are recorded in the lower 128 bytes of the inode +structure -- inode change time (ctime), access time (atime), data +modification time (mtime), and deletion time (dtime). The four fields +are 32-bit signed integers that represent seconds since the Unix epoch +(1970-01-01 00:00:00 GMT), which means that the fields will overflow in +January 2038. If the filesystem does not have orphan_file feature, inodes +that are not linked from any directory but are still open (orphan inodes) have +the dtime field overloaded for use with the orphan list. The superblock field +``s_last_orphan`` points to the first inode in the orphan list; dtime is then +the number of the next orphaned inode, or zero if there are no more orphans. + +If the inode structure size ``sb->s_inode_size`` is larger than 128 +bytes and the ``i_inode_extra`` field is large enough to encompass the +respective ``i_[cma]time_extra`` field, the ctime, atime, and mtime +inode fields are widened to 64 bits. Within this “extra” 32-bit field, +the lower two bits are used to extend the 32-bit seconds field to be 34 +bit wide; the upper 30 bits are used to provide nanosecond timestamp +accuracy. Therefore, timestamps should not overflow until May 2446. +dtime was not widened. There is also a fifth timestamp to record inode +creation time (crtime); this field is 64-bits wide and decoded in the +same manner as 64-bit [cma]time. Neither crtime nor dtime are accessible +through the regular stat() interface, though debugfs will report them. + +We use the 32-bit signed time value plus (2^32 * (extra epoch bits)). +In other words: + +.. list-table:: + :widths: 20 20 20 20 20 + :header-rows: 1 + + * - Extra epoch bits + - MSB of 32-bit time + - Adjustment for signed 32-bit to 64-bit tv_sec + - Decoded 64-bit tv_sec + - valid time range + * - 0 0 + - 1 + - 0 + - ``-0x80000000 - -0x00000001`` + - 1901-12-13 to 1969-12-31 + * - 0 0 + - 0 + - 0 + - ``0x000000000 - 0x07fffffff`` + - 1970-01-01 to 2038-01-19 + * - 0 1 + - 1 + - 0x100000000 + - ``0x080000000 - 0x0ffffffff`` + - 2038-01-19 to 2106-02-07 + * - 0 1 + - 0 + - 0x100000000 + - ``0x100000000 - 0x17fffffff`` + - 2106-02-07 to 2174-02-25 + * - 1 0 + - 1 + - 0x200000000 + - ``0x180000000 - 0x1ffffffff`` + - 2174-02-25 to 2242-03-16 + * - 1 0 + - 0 + - 0x200000000 + - ``0x200000000 - 0x27fffffff`` + - 2242-03-16 to 2310-04-04 + * - 1 1 + - 1 + - 0x300000000 + - ``0x280000000 - 0x2ffffffff`` + - 2310-04-04 to 2378-04-22 + * - 1 1 + - 0 + - 0x300000000 + - ``0x300000000 - 0x37fffffff`` + - 2378-04-22 to 2446-05-10 + +This is a somewhat odd encoding since there are effectively seven times +as many positive values as negative values. There have also been +long-standing bugs decoding and encoding dates beyond 2038, which don't +seem to be fixed as of kernel 3.12 and e2fsprogs 1.42.8. 64-bit kernels +incorrectly use the extra epoch bits 1,1 for dates between 1901 and +1970. At some point the kernel will be fixed and e2fsck will fix this +situation, assuming that it is run before 2310. + +The Contents of inode.i_block +------------------------------ + +Depending on the type of file an inode describes, the 60 bytes of +storage in ``inode.i_block`` can be used in different ways. In general, +regular files and directories will use it for file block indexing +information, and special files will use it for special purposes. + +Symbolic Links +~~~~~~~~~~~~~~ + +The target of a symbolic link will be stored in this field if the target +string is less than 60 bytes long. Otherwise, either extents or block +maps will be used to allocate data blocks to store the link target. + +Direct/Indirect Block Addressing +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +In ext2/3, file block numbers were mapped to logical block numbers by +means of an (up to) three level 1-1 block map. To find the logical block +that stores a particular file block, the code would navigate through +this increasingly complicated structure. Notice that there is neither a +magic number nor a checksum to provide any level of confidence that the +block isn't full of garbage. + +.. ifconfig:: builder != 'latex' + + .. include:: blockmap.rst + +.. ifconfig:: builder == 'latex' + + [Table omitted because LaTeX doesn't support nested tables.] + +Note that with this block mapping scheme, it is necessary to fill out a +lot of mapping data even for a large contiguous file! This inefficiency +led to the creation of the extent mapping scheme, discussed below. + +Notice also that a file using this mapping scheme cannot be placed +higher than 2^32 blocks. + +Extent Tree +~~~~~~~~~~~ + +In ext4, the file to logical block map has been replaced with an extent +tree. Under the old scheme, allocating a contiguous run of 1,000 blocks +requires an indirect block to map all 1,000 entries; with extents, the +mapping is reduced to a single ``struct ext4_extent`` with +``ee_len = 1000``. If flex_bg is enabled, it is possible to allocate +very large files with a single extent, at a considerable reduction in +metadata block use, and some improvement in disk efficiency. The inode +must have the extents flag (0x80000) flag set for this feature to be in +use. + +Extents are arranged as a tree. Each node of the tree begins with a +``struct ext4_extent_header``. If the node is an interior node +(``eh.eh_depth`` > 0), the header is followed by ``eh.eh_entries`` +instances of ``struct ext4_extent_idx``; each of these index entries +points to a block containing more nodes in the extent tree. If the node +is a leaf node (``eh.eh_depth == 0``), then the header is followed by +``eh.eh_entries`` instances of ``struct ext4_extent``; these instances +point to the file's data blocks. The root node of the extent tree is +stored in ``inode.i_block``, which allows for the first four extents to +be recorded without the use of extra metadata blocks. + +The extent tree header is recorded in ``struct ext4_extent_header``, +which is 12 bytes long: + +.. list-table:: + :widths: 8 8 24 40 + :header-rows: 1 + + * - Offset + - Size + - Name + - Description + * - 0x0 + - __le16 + - eh_magic + - Magic number, 0xF30A. + * - 0x2 + - __le16 + - eh_entries + - Number of valid entries following the header. + * - 0x4 + - __le16 + - eh_max + - Maximum number of entries that could follow the header. + * - 0x6 + - __le16 + - eh_depth + - Depth of this extent node in the extent tree. 0 = this extent node + points to data blocks; otherwise, this extent node points to other + extent nodes. The extent tree can be at most 5 levels deep: a logical + block number can be at most ``2^32``, and the smallest ``n`` that + satisfies ``4*(((blocksize - 12)/12)^n) >= 2^32`` is 5. + * - 0x8 + - __le32 + - eh_generation + - Generation of the tree. (Used by Lustre, but not standard ext4). + +Internal nodes of the extent tree, also known as index nodes, are +recorded as ``struct ext4_extent_idx``, and are 12 bytes long: + +.. list-table:: + :widths: 8 8 24 40 + :header-rows: 1 + + * - Offset + - Size + - Name + - Description + * - 0x0 + - __le32 + - ei_block + - This index node covers file blocks from 'block' onward. + * - 0x4 + - __le32 + - ei_leaf_lo + - Lower 32-bits of the block number of the extent node that is the next + level lower in the tree. The tree node pointed to can be either another + internal node or a leaf node, described below. + * - 0x8 + - __le16 + - ei_leaf_hi + - Upper 16-bits of the previous field. + * - 0xA + - __u16 + - ei_unused + - + +Leaf nodes of the extent tree are recorded as ``struct ext4_extent``, +and are also 12 bytes long: + +.. list-table:: + :widths: 8 8 24 40 + :header-rows: 1 + + * - Offset + - Size + - Name + - Description + * - 0x0 + - __le32 + - ee_block + - First file block number that this extent covers. + * - 0x4 + - __le16 + - ee_len + - Number of blocks covered by extent. If the value of this field is <= + 32768, the extent is initialized. If the value of the field is > 32768, + the extent is uninitialized and the actual extent length is ``ee_len`` - + 32768. Therefore, the maximum length of a initialized extent is 32768 + blocks, and the maximum length of an uninitialized extent is 32767. + * - 0x6 + - __le16 + - ee_start_hi + - Upper 16-bits of the block number to which this extent points. + * - 0x8 + - __le32 + - ee_start_lo + - Lower 32-bits of the block number to which this extent points. + +Prior to the introduction of metadata checksums, the extent header + +extent entries always left at least 4 bytes of unallocated space at the +end of each extent tree data block (because (2^x % 12) >= 4). Therefore, +the 32-bit checksum is inserted into this space. The 4 extents in the +inode do not need checksumming, since the inode is already checksummed. +The checksum is calculated against the FS UUID, the inode number, the +inode generation, and the entire extent block leading up to (but not +including) the checksum itself. + +``struct ext4_extent_tail`` is 4 bytes long: + +.. list-table:: + :widths: 8 8 24 40 + :header-rows: 1 + + * - Offset + - Size + - Name + - Description + * - 0x0 + - __le32 + - eb_checksum + - Checksum of the extent block, crc32c(uuid+inum+igeneration+extentblock) + +Inline Data +~~~~~~~~~~~ + +If the inline data feature is enabled for the filesystem and the flag is +set for the inode, it is possible that the first 60 bytes of the file +data are stored here. + +Directory Entries +----------------- + +In an ext4 filesystem, a directory is more or less a flat file that maps +an arbitrary byte string (usually ASCII) to an inode number on the +filesystem. There can be many directory entries across the filesystem +that reference the same inode number--these are known as hard links, and +that is why hard links cannot reference files on other filesystems. As +such, directory entries are found by reading the data block(s) +associated with a directory file for the particular directory entry that +is desired. + +Linear (Classic) Directories +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +By default, each directory lists its entries in an “almost-linear” +array. I write “almost” because it's not a linear array in the memory +sense because directory entries are not split across filesystem blocks. +Therefore, it is more accurate to say that a directory is a series of +data blocks and that each block contains a linear array of directory +entries. The end of each per-block array is signified by reaching the +end of the block; the last entry in the block has a record length that +takes it all the way to the end of the block. The end of the entire +directory is of course signified by reaching the end of the file. Unused +directory entries are signified by inode = 0. By default the filesystem +uses ``struct ext4_dir_entry_2`` for directory entries unless the +“filetype” feature flag is not set, in which case it uses +``struct ext4_dir_entry``. + +The original directory entry format is ``struct ext4_dir_entry``, which +is at most 263 bytes long, though on disk you'll need to reference +``dirent.rec_len`` to know for sure. + +.. list-table:: + :widths: 8 8 24 40 + :header-rows: 1 + + * - Offset + - Size + - Name + - Description + * - 0x0 + - __le32 + - inode + - Number of the inode that this directory entry points to. + * - 0x4 + - __le16 + - rec_len + - Length of this directory entry. Must be a multiple of 4. + * - 0x6 + - __le16 + - name_len + - Length of the file name. + * - 0x8 + - char + - name[EXT4_NAME_LEN] + - File name. + +Since file names cannot be longer than 255 bytes, the new directory +entry format shortens the name_len field and uses the space for a file +type flag, probably to avoid having to load every inode during directory +tree traversal. This format is ``ext4_dir_entry_2``, which is at most +263 bytes long, though on disk you'll need to reference +``dirent.rec_len`` to know for sure. + +.. list-table:: + :widths: 8 8 24 40 + :header-rows: 1 + + * - Offset + - Size + - Name + - Description + * - 0x0 + - __le32 + - inode + - Number of the inode that this directory entry points to. + * - 0x4 + - __le16 + - rec_len + - Length of this directory entry. + * - 0x6 + - __u8 + - name_len + - Length of the file name. + * - 0x7 + - __u8 + - file_type + - File type code, see ftype_ table below. + * - 0x8 + - char + - name[EXT4_NAME_LEN] + - File name. + +.. _ftype: + +The directory file type is one of the following values: + +.. list-table:: + :widths: 16 64 + :header-rows: 1 + + * - Value + - Description + * - 0x0 + - Unknown. + * - 0x1 + - Regular file. + * - 0x2 + - Directory. + * - 0x3 + - Character device file. + * - 0x4 + - Block device file. + * - 0x5 + - FIFO. + * - 0x6 + - Socket. + * - 0x7 + - Symbolic link. + +To support directories that are both encrypted and casefolded directories, we +must also include hash information in the directory entry. We append +``ext4_extended_dir_entry_2`` to ``ext4_dir_entry_2`` except for the entries +for dot and dotdot, which are kept the same. The structure follows immediately +after ``name`` and is included in the size listed by ``rec_len`` If a directory +entry uses this extension, it may be up to 271 bytes. + +.. list-table:: + :widths: 8 8 24 40 + :header-rows: 1 + + * - Offset + - Size + - Name + - Description + * - 0x0 + - __le32 + - hash + - The hash of the directory name + * - 0x4 + - __le32 + - minor_hash + - The minor hash of the directory name + + +In order to add checksums to these classic directory blocks, a phony +``struct ext4_dir_entry`` is placed at the end of each leaf block to +hold the checksum. The directory entry is 12 bytes long. The inode +number and name_len fields are set to zero to fool old software into +ignoring an apparently empty directory entry, and the checksum is stored +in the place where the name normally goes. The structure is +``struct ext4_dir_entry_tail``: + +.. list-table:: + :widths: 8 8 24 40 + :header-rows: 1 + + * - Offset + - Size + - Name + - Description + * - 0x0 + - __le32 + - det_reserved_zero1 + - Inode number, which must be zero. + * - 0x4 + - __le16 + - det_rec_len + - Length of this directory entry, which must be 12. + * - 0x6 + - __u8 + - det_reserved_zero2 + - Length of the file name, which must be zero. + * - 0x7 + - __u8 + - det_reserved_ft + - File type, which must be 0xDE. + * - 0x8 + - __le32 + - det_checksum + - Directory leaf block checksum. + +The leaf directory block checksum is calculated against the FS UUID, the +directory's inode number, the directory's inode generation number, and +the entire directory entry block up to (but not including) the fake +directory entry. + +Hash Tree Directories +~~~~~~~~~~~~~~~~~~~~~ + +A linear array of directory entries isn't great for performance, so a +new feature was added to ext3 to provide a faster (but peculiar) +balanced tree keyed off a hash of the directory entry name. If the +EXT4_INDEX_FL (0x1000) flag is set in the inode, this directory uses a +hashed btree (htree) to organize and find directory entries. For +backwards read-only compatibility with ext2, this tree is actually +hidden inside the directory file, masquerading as “empty” directory data +blocks! It was stated previously that the end of the linear directory +entry table was signified with an entry pointing to inode 0; this is +(ab)used to fool the old linear-scan algorithm into thinking that the +rest of the directory block is empty so that it moves on. + +The root of the tree always lives in the first data block of the +directory. By ext2 custom, the '.' and '..' entries must appear at the +beginning of this first block, so they are put here as two +``struct ext4_dir_entry_2`` s and not stored in the tree. The rest of +the root node contains metadata about the tree and finally a hash->block +map to find nodes that are lower in the htree. If +``dx_root.info.indirect_levels`` is non-zero then the htree has two +levels; the data block pointed to by the root node's map is an interior +node, which is indexed by a minor hash. Interior nodes in this tree +contains a zeroed out ``struct ext4_dir_entry_2`` followed by a +minor_hash->block map to find leafe nodes. Leaf nodes contain a linear +array of all ``struct ext4_dir_entry_2``; all of these entries +(presumably) hash to the same value. If there is an overflow, the +entries simply overflow into the next leaf node, and the +least-significant bit of the hash (in the interior node map) that gets +us to this next leaf node is set. + +To traverse the directory as a htree, the code calculates the hash of +the desired file name and uses it to find the corresponding block +number. If the tree is flat, the block is a linear array of directory +entries that can be searched; otherwise, the minor hash of the file name +is computed and used against this second block to find the corresponding +third block number. That third block number will be a linear array of +directory entries. + +To traverse the directory as a linear array (such as the old code does), +the code simply reads every data block in the directory. The blocks used +for the htree will appear to have no entries (aside from '.' and '..') +and so only the leaf nodes will appear to have any interesting content. + +The root of the htree is in ``struct dx_root``, which is the full length +of a data block: + +.. list-table:: + :widths: 8 8 24 40 + :header-rows: 1 + + * - Offset + - Type + - Name + - Description + * - 0x0 + - __le32 + - dot.inode + - inode number of this directory. + * - 0x4 + - __le16 + - dot.rec_len + - Length of this record, 12. + * - 0x6 + - u8 + - dot.name_len + - Length of the name, 1. + * - 0x7 + - u8 + - dot.file_type + - File type of this entry, 0x2 (directory) (if the feature flag is set). + * - 0x8 + - char + - dot.name[4] + - “.\0\0\0” + * - 0xC + - __le32 + - dotdot.inode + - inode number of parent directory. + * - 0x10 + - __le16 + - dotdot.rec_len + - block_size - 12. The record length is long enough to cover all htree + data. + * - 0x12 + - u8 + - dotdot.name_len + - Length of the name, 2. + * - 0x13 + - u8 + - dotdot.file_type + - File type of this entry, 0x2 (directory) (if the feature flag is set). + * - 0x14 + - char + - dotdot_name[4] + - “..\0\0” + * - 0x18 + - __le32 + - struct dx_root_info.reserved_zero + - Zero. + * - 0x1C + - u8 + - struct dx_root_info.hash_version + - Hash type, see dirhash_ table below. + * - 0x1D + - u8 + - struct dx_root_info.info_length + - Length of the tree information, 0x8. + * - 0x1E + - u8 + - struct dx_root_info.indirect_levels + - Depth of the htree. Cannot be larger than 3 if the INCOMPAT_LARGEDIR + feature is set; cannot be larger than 2 otherwise. + * - 0x1F + - u8 + - struct dx_root_info.unused_flags + - + * - 0x20 + - __le16 + - limit + - Maximum number of dx_entries that can follow this header, plus 1 for + the header itself. + * - 0x22 + - __le16 + - count + - Actual number of dx_entries that follow this header, plus 1 for the + header itself. + * - 0x24 + - __le32 + - block + - The block number (within the directory file) that goes with hash=0. + * - 0x28 + - struct dx_entry + - entries[0] + - As many 8-byte ``struct dx_entry`` as fits in the rest of the data block. + +.. _dirhash: + +The directory hash is one of the following values: + +.. list-table:: + :widths: 16 64 + :header-rows: 1 + + * - Value + - Description + * - 0x0 + - Legacy. + * - 0x1 + - Half MD4. + * - 0x2 + - Tea. + * - 0x3 + - Legacy, unsigned. + * - 0x4 + - Half MD4, unsigned. + * - 0x5 + - Tea, unsigned. + * - 0x6 + - Siphash. + +Interior nodes of an htree are recorded as ``struct dx_node``, which is +also the full length of a data block: + +.. list-table:: + :widths: 8 8 24 40 + :header-rows: 1 + + * - Offset + - Type + - Name + - Description + * - 0x0 + - __le32 + - fake.inode + - Zero, to make it look like this entry is not in use. + * - 0x4 + - __le16 + - fake.rec_len + - The size of the block, in order to hide all of the dx_node data. + * - 0x6 + - u8 + - name_len + - Zero. There is no name for this “unused” directory entry. + * - 0x7 + - u8 + - file_type + - Zero. There is no file type for this “unused” directory entry. + * - 0x8 + - __le16 + - limit + - Maximum number of dx_entries that can follow this header, plus 1 for + the header itself. + * - 0xA + - __le16 + - count + - Actual number of dx_entries that follow this header, plus 1 for the + header itself. + * - 0xE + - __le32 + - block + - The block number (within the directory file) that goes with the lowest + hash value of this block. This value is stored in the parent block. + * - 0x12 + - struct dx_entry + - entries[0] + - As many 8-byte ``struct dx_entry`` as fits in the rest of the data block. + +The hash maps that exist in both ``struct dx_root`` and +``struct dx_node`` are recorded as ``struct dx_entry``, which is 8 bytes +long: + +.. list-table:: + :widths: 8 8 24 40 + :header-rows: 1 + + * - Offset + - Type + - Name + - Description + * - 0x0 + - __le32 + - hash + - Hash code. + * - 0x4 + - __le32 + - block + - Block number (within the directory file, not filesystem blocks) of the + next node in the htree. + +(If you think this is all quite clever and peculiar, so does the +author.) + +If metadata checksums are enabled, the last 8 bytes of the directory +block (precisely the length of one dx_entry) are used to store a +``struct dx_tail``, which contains the checksum. The ``limit`` and +``count`` entries in the dx_root/dx_node structures are adjusted as +necessary to fit the dx_tail into the block. If there is no space for +the dx_tail, the user is notified to run e2fsck -D to rebuild the +directory index (which will ensure that there's space for the checksum. +The dx_tail structure is 8 bytes long and looks like this: + +.. list-table:: + :widths: 8 8 24 40 + :header-rows: 1 + + * - Offset + - Type + - Name + - Description + * - 0x0 + - u32 + - dt_reserved + - Zero. + * - 0x4 + - __le32 + - dt_checksum + - Checksum of the htree directory block. + +The checksum is calculated against the FS UUID, the htree index header +(dx_root or dx_node), all of the htree indices (dx_entry) that are in +use, and the tail block (dx_tail). + +Extended Attributes +------------------- + +Extended attributes (xattrs) are typically stored in a separate data +block on the disk and referenced from inodes via ``inode.i_file_acl*``. +The first use of extended attributes seems to have been for storing file +ACLs and other security data (selinux). With the ``user_xattr`` mount +option it is possible for users to store extended attributes so long as +all attribute names begin with “user”; this restriction seems to have +disappeared as of Linux 3.0. + +There are two places where extended attributes can be found. The first +place is between the end of each inode entry and the beginning of the +next inode entry. For example, if inode.i_extra_isize = 28 and +sb.inode_size = 256, then there are 256 - (128 + 28) = 100 bytes +available for in-inode extended attribute storage. The second place +where extended attributes can be found is in the block pointed to by +``inode.i_file_acl``. As of Linux 3.11, it is not possible for this +block to contain a pointer to a second extended attribute block (or even +the remaining blocks of a cluster). In theory it is possible for each +attribute's value to be stored in a separate data block, though as of +Linux 3.11 the code does not permit this. + +Keys are generally assumed to be ASCIIZ strings, whereas values can be +strings or binary data. + +Extended attributes, when stored after the inode, have a header +``ext4_xattr_ibody_header`` that is 4 bytes long: + +.. list-table:: + :widths: 8 8 24 40 + :header-rows: 1 + + * - Offset + - Type + - Name + - Description + * - 0x0 + - __le32 + - h_magic + - Magic number for identification, 0xEA020000. This value is set by the + Linux driver, though e2fsprogs doesn't seem to check it(?) + +The beginning of an extended attribute block is in +``struct ext4_xattr_header``, which is 32 bytes long: + +.. list-table:: + :widths: 8 8 24 40 + :header-rows: 1 + + * - Offset + - Type + - Name + - Description + * - 0x0 + - __le32 + - h_magic + - Magic number for identification, 0xEA020000. + * - 0x4 + - __le32 + - h_refcount + - Reference count. + * - 0x8 + - __le32 + - h_blocks + - Number of disk blocks used. + * - 0xC + - __le32 + - h_hash + - Hash value of all attributes. + * - 0x10 + - __le32 + - h_checksum + - Checksum of the extended attribute block. + * - 0x14 + - __u32 + - h_reserved[3] + - Zero. + +The checksum is calculated against the FS UUID, the 64-bit block number +of the extended attribute block, and the entire block (header + +entries). + +Following the ``struct ext4_xattr_header`` or +``struct ext4_xattr_ibody_header`` is an array of +``struct ext4_xattr_entry``; each of these entries is at least 16 bytes +long. When stored in an external block, the ``struct ext4_xattr_entry`` +entries must be stored in sorted order. The sort order is +``e_name_index``, then ``e_name_len``, and finally ``e_name``. +Attributes stored inside an inode do not need be stored in sorted order. + +.. list-table:: + :widths: 8 8 24 40 + :header-rows: 1 + + * - Offset + - Type + - Name + - Description + * - 0x0 + - __u8 + - e_name_len + - Length of name. + * - 0x1 + - __u8 + - e_name_index + - Attribute name index. There is a discussion of this below. + * - 0x2 + - __le16 + - e_value_offs + - Location of this attribute's value on the disk block where it is stored. + Multiple attributes can share the same value. For an inode attribute + this value is relative to the start of the first entry; for a block this + value is relative to the start of the block (i.e. the header). + * - 0x4 + - __le32 + - e_value_inum + - The inode where the value is stored. Zero indicates the value is in the + same block as this entry. This field is only used if the + INCOMPAT_EA_INODE feature is enabled. + * - 0x8 + - __le32 + - e_value_size + - Length of attribute value. + * - 0xC + - __le32 + - e_hash + - Hash value of attribute name and attribute value. The kernel doesn't + update the hash for in-inode attributes, so for that case this value + must be zero, because e2fsck validates any non-zero hash regardless of + where the xattr lives. + * - 0x10 + - char + - e_name[e_name_len] + - Attribute name. Does not include trailing NULL. + +Attribute values can follow the end of the entry table. There appears to +be a requirement that they be aligned to 4-byte boundaries. The values +are stored starting at the end of the block and grow towards the +xattr_header/xattr_entry table. When the two collide, the overflow is +put into a separate disk block. If the disk block fills up, the +filesystem returns -ENOSPC. + +The first four fields of the ``ext4_xattr_entry`` are set to zero to +mark the end of the key list. + +Attribute Name Indices +~~~~~~~~~~~~~~~~~~~~~~ + +Logically speaking, extended attributes are a series of key=value pairs. +The keys are assumed to be NULL-terminated strings. To reduce the amount +of on-disk space that the keys consume, the beginning of the key string +is matched against the attribute name index. If a match is found, the +attribute name index field is set, and matching string is removed from +the key name. Here is a map of name index values to key prefixes: + +.. list-table:: + :widths: 16 64 + :header-rows: 1 + + * - Name Index + - Key Prefix + * - 0 + - (no prefix) + * - 1 + - “user.” + * - 2 + - “system.posix_acl_access” + * - 3 + - “system.posix_acl_default” + * - 4 + - “trusted.” + * - 6 + - “security.” + * - 7 + - “system.” (inline_data only?) + * - 8 + - “system.richacl” (SuSE kernels only?) + +For example, if the attribute key is “user.fubar”, the attribute name +index is set to 1 and the “fubar” name is recorded on disk. + +POSIX ACLs +~~~~~~~~~~ + +POSIX ACLs are stored in a reduced version of the Linux kernel (and +libacl's) internal ACL format. The key difference is that the version +number is different (1) and the ``e_id`` field is only stored for named +user and group ACLs. diff --git a/Documentation/filesystems/ext4/ifork.rst b/Documentation/filesystems/ext4/ifork.rst deleted file mode 100644 index dc31f505e6c835..00000000000000 --- a/Documentation/filesystems/ext4/ifork.rst +++ /dev/null @@ -1,194 +0,0 @@ -.. SPDX-License-Identifier: GPL-2.0 - -The Contents of inode.i_block ------------------------------- - -Depending on the type of file an inode describes, the 60 bytes of -storage in ``inode.i_block`` can be used in different ways. In general, -regular files and directories will use it for file block indexing -information, and special files will use it for special purposes. - -Symbolic Links -~~~~~~~~~~~~~~ - -The target of a symbolic link will be stored in this field if the target -string is less than 60 bytes long. Otherwise, either extents or block -maps will be used to allocate data blocks to store the link target. - -Direct/Indirect Block Addressing -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -In ext2/3, file block numbers were mapped to logical block numbers by -means of an (up to) three level 1-1 block map. To find the logical block -that stores a particular file block, the code would navigate through -this increasingly complicated structure. Notice that there is neither a -magic number nor a checksum to provide any level of confidence that the -block isn't full of garbage. - -.. ifconfig:: builder != 'latex' - - .. include:: blockmap.rst - -.. ifconfig:: builder == 'latex' - - [Table omitted because LaTeX doesn't support nested tables.] - -Note that with this block mapping scheme, it is necessary to fill out a -lot of mapping data even for a large contiguous file! This inefficiency -led to the creation of the extent mapping scheme, discussed below. - -Notice also that a file using this mapping scheme cannot be placed -higher than 2^32 blocks. - -Extent Tree -~~~~~~~~~~~ - -In ext4, the file to logical block map has been replaced with an extent -tree. Under the old scheme, allocating a contiguous run of 1,000 blocks -requires an indirect block to map all 1,000 entries; with extents, the -mapping is reduced to a single ``struct ext4_extent`` with -``ee_len = 1000``. If flex_bg is enabled, it is possible to allocate -very large files with a single extent, at a considerable reduction in -metadata block use, and some improvement in disk efficiency. The inode -must have the extents flag (0x80000) flag set for this feature to be in -use. - -Extents are arranged as a tree. Each node of the tree begins with a -``struct ext4_extent_header``. If the node is an interior node -(``eh.eh_depth`` > 0), the header is followed by ``eh.eh_entries`` -instances of ``struct ext4_extent_idx``; each of these index entries -points to a block containing more nodes in the extent tree. If the node -is a leaf node (``eh.eh_depth == 0``), then the header is followed by -``eh.eh_entries`` instances of ``struct ext4_extent``; these instances -point to the file's data blocks. The root node of the extent tree is -stored in ``inode.i_block``, which allows for the first four extents to -be recorded without the use of extra metadata blocks. - -The extent tree header is recorded in ``struct ext4_extent_header``, -which is 12 bytes long: - -.. list-table:: - :widths: 8 8 24 40 - :header-rows: 1 - - * - Offset - - Size - - Name - - Description - * - 0x0 - - __le16 - - eh_magic - - Magic number, 0xF30A. - * - 0x2 - - __le16 - - eh_entries - - Number of valid entries following the header. - * - 0x4 - - __le16 - - eh_max - - Maximum number of entries that could follow the header. - * - 0x6 - - __le16 - - eh_depth - - Depth of this extent node in the extent tree. 0 = this extent node - points to data blocks; otherwise, this extent node points to other - extent nodes. The extent tree can be at most 5 levels deep: a logical - block number can be at most ``2^32``, and the smallest ``n`` that - satisfies ``4*(((blocksize - 12)/12)^n) >= 2^32`` is 5. - * - 0x8 - - __le32 - - eh_generation - - Generation of the tree. (Used by Lustre, but not standard ext4). - -Internal nodes of the extent tree, also known as index nodes, are -recorded as ``struct ext4_extent_idx``, and are 12 bytes long: - -.. list-table:: - :widths: 8 8 24 40 - :header-rows: 1 - - * - Offset - - Size - - Name - - Description - * - 0x0 - - __le32 - - ei_block - - This index node covers file blocks from 'block' onward. - * - 0x4 - - __le32 - - ei_leaf_lo - - Lower 32-bits of the block number of the extent node that is the next - level lower in the tree. The tree node pointed to can be either another - internal node or a leaf node, described below. - * - 0x8 - - __le16 - - ei_leaf_hi - - Upper 16-bits of the previous field. - * - 0xA - - __u16 - - ei_unused - - - -Leaf nodes of the extent tree are recorded as ``struct ext4_extent``, -and are also 12 bytes long: - -.. list-table:: - :widths: 8 8 24 40 - :header-rows: 1 - - * - Offset - - Size - - Name - - Description - * - 0x0 - - __le32 - - ee_block - - First file block number that this extent covers. - * - 0x4 - - __le16 - - ee_len - - Number of blocks covered by extent. If the value of this field is <= - 32768, the extent is initialized. If the value of the field is > 32768, - the extent is uninitialized and the actual extent length is ``ee_len`` - - 32768. Therefore, the maximum length of a initialized extent is 32768 - blocks, and the maximum length of an uninitialized extent is 32767. - * - 0x6 - - __le16 - - ee_start_hi - - Upper 16-bits of the block number to which this extent points. - * - 0x8 - - __le32 - - ee_start_lo - - Lower 32-bits of the block number to which this extent points. - -Prior to the introduction of metadata checksums, the extent header + -extent entries always left at least 4 bytes of unallocated space at the -end of each extent tree data block (because (2^x % 12) >= 4). Therefore, -the 32-bit checksum is inserted into this space. The 4 extents in the -inode do not need checksumming, since the inode is already checksummed. -The checksum is calculated against the FS UUID, the inode number, the -inode generation, and the entire extent block leading up to (but not -including) the checksum itself. - -``struct ext4_extent_tail`` is 4 bytes long: - -.. list-table:: - :widths: 8 8 24 40 - :header-rows: 1 - - * - Offset - - Size - - Name - - Description - * - 0x0 - - __le32 - - eb_checksum - - Checksum of the extent block, crc32c(uuid+inum+igeneration+extentblock) - -Inline Data -~~~~~~~~~~~ - -If the inline data feature is enabled for the filesystem and the flag is -set for the inode, it is possible that the first 60 bytes of the file -data are stored here. diff --git a/Documentation/filesystems/ext4/inodes.rst b/Documentation/filesystems/ext4/inodes.rst deleted file mode 100644 index cfc6c16599312a..00000000000000 --- a/Documentation/filesystems/ext4/inodes.rst +++ /dev/null @@ -1,578 +0,0 @@ -.. SPDX-License-Identifier: GPL-2.0 - -Index Nodes ------------ - -In a regular UNIX filesystem, the inode stores all the metadata -pertaining to the file (time stamps, block maps, extended attributes, -etc), not the directory entry. To find the information associated with a -file, one must traverse the directory files to find the directory entry -associated with a file, then load the inode to find the metadata for -that file. ext4 appears to cheat (for performance reasons) a little bit -by storing a copy of the file type (normally stored in the inode) in the -directory entry. (Compare all this to FAT, which stores all the file -information directly in the directory entry, but does not support hard -links and is in general more seek-happy than ext4 due to its simpler -block allocator and extensive use of linked lists.) - -The inode table is a linear array of ``struct ext4_inode``. The table is -sized to have enough blocks to store at least -``sb.s_inode_size * sb.s_inodes_per_group`` bytes. The number of the -block group containing an inode can be calculated as -``(inode_number - 1) / sb.s_inodes_per_group``, and the offset into the -group's table is ``(inode_number - 1) % sb.s_inodes_per_group``. There -is no inode 0. - -The inode checksum is calculated against the FS UUID, the inode number, -and the inode structure itself. - -The inode table entry is laid out in ``struct ext4_inode``. - -.. list-table:: - :widths: 8 8 24 40 - :header-rows: 1 - :class: longtable - - * - Offset - - Size - - Name - - Description - * - 0x0 - - __le16 - - i_mode - - File mode. See the table i_mode_ below. - * - 0x2 - - __le16 - - i_uid - - Lower 16-bits of Owner UID. - * - 0x4 - - __le32 - - i_size_lo - - Lower 32-bits of size in bytes. - * - 0x8 - - __le32 - - i_atime - - Last access time, in seconds since the epoch. However, if the EA_INODE - inode flag is set, this inode stores an extended attribute value and - this field contains the checksum of the value. - * - 0xC - - __le32 - - i_ctime - - Last inode change time, in seconds since the epoch. However, if the - EA_INODE inode flag is set, this inode stores an extended attribute - value and this field contains the lower 32 bits of the attribute value's - reference count. - * - 0x10 - - __le32 - - i_mtime - - Last data modification time, in seconds since the epoch. However, if the - EA_INODE inode flag is set, this inode stores an extended attribute - value and this field contains the number of the inode that owns the - extended attribute. - * - 0x14 - - __le32 - - i_dtime - - Deletion Time, in seconds since the epoch. - * - 0x18 - - __le16 - - i_gid - - Lower 16-bits of GID. - * - 0x1A - - __le16 - - i_links_count - - Hard link count. Normally, ext4 does not permit an inode to have more - than 65,000 hard links. This applies to files as well as directories, - which means that there cannot be more than 64,998 subdirectories in a - directory (each subdirectory's '..' entry counts as a hard link, as does - the '.' entry in the directory itself). With the DIR_NLINK feature - enabled, ext4 supports more than 64,998 subdirectories by setting this - field to 1 to indicate that the number of hard links is not known. - * - 0x1C - - __le32 - - i_blocks_lo - - Lower 32-bits of “block” count. If the huge_file feature flag is not - set on the filesystem, the file consumes ``i_blocks_lo`` 512-byte blocks - on disk. If huge_file is set and EXT4_HUGE_FILE_FL is NOT set in - ``inode.i_flags``, then the file consumes ``i_blocks_lo + (i_blocks_hi - << 32)`` 512-byte blocks on disk. If huge_file is set and - EXT4_HUGE_FILE_FL IS set in ``inode.i_flags``, then this file - consumes (``i_blocks_lo + i_blocks_hi`` << 32) filesystem blocks on - disk. - * - 0x20 - - __le32 - - i_flags - - Inode flags. See the table i_flags_ below. - * - 0x24 - - 4 bytes - - i_osd1 - - See the table i_osd1_ for more details. - * - 0x28 - - 60 bytes - - i_block[EXT4_N_BLOCKS=15] - - Block map or extent tree. See the section “The Contents of inode.i_block”. - * - 0x64 - - __le32 - - i_generation - - File version (for NFS). - * - 0x68 - - __le32 - - i_file_acl_lo - - Lower 32-bits of extended attribute block. ACLs are of course one of - many possible extended attributes; I think the name of this field is a - result of the first use of extended attributes being for ACLs. - * - 0x6C - - __le32 - - i_size_high / i_dir_acl - - Upper 32-bits of file/directory size. In ext2/3 this field was named - i_dir_acl, though it was usually set to zero and never used. - * - 0x70 - - __le32 - - i_obso_faddr - - (Obsolete) fragment address. - * - 0x74 - - 12 bytes - - i_osd2 - - See the table i_osd2_ for more details. - * - 0x80 - - __le16 - - i_extra_isize - - Size of this inode - 128. Alternately, the size of the extended inode - fields beyond the original ext2 inode, including this field. - * - 0x82 - - __le16 - - i_checksum_hi - - Upper 16-bits of the inode checksum. - * - 0x84 - - __le32 - - i_ctime_extra - - Extra change time bits. This provides sub-second precision. See Inode - Timestamps section. - * - 0x88 - - __le32 - - i_mtime_extra - - Extra modification time bits. This provides sub-second precision. - * - 0x8C - - __le32 - - i_atime_extra - - Extra access time bits. This provides sub-second precision. - * - 0x90 - - __le32 - - i_crtime - - File creation time, in seconds since the epoch. - * - 0x94 - - __le32 - - i_crtime_extra - - Extra file creation time bits. This provides sub-second precision. - * - 0x98 - - __le32 - - i_version_hi - - Upper 32-bits for version number. - * - 0x9C - - __le32 - - i_projid - - Project ID. - -.. _i_mode: - -The ``i_mode`` value is a combination of the following flags: - -.. list-table:: - :widths: 16 64 - :header-rows: 1 - - * - Value - - Description - * - 0x1 - - S_IXOTH (Others may execute) - * - 0x2 - - S_IWOTH (Others may write) - * - 0x4 - - S_IROTH (Others may read) - * - 0x8 - - S_IXGRP (Group members may execute) - * - 0x10 - - S_IWGRP (Group members may write) - * - 0x20 - - S_IRGRP (Group members may read) - * - 0x40 - - S_IXUSR (Owner may execute) - * - 0x80 - - S_IWUSR (Owner may write) - * - 0x100 - - S_IRUSR (Owner may read) - * - 0x200 - - S_ISVTX (Sticky bit) - * - 0x400 - - S_ISGID (Set GID) - * - 0x800 - - S_ISUID (Set UID) - * - - - These are mutually-exclusive file types: - * - 0x1000 - - S_IFIFO (FIFO) - * - 0x2000 - - S_IFCHR (Character device) - * - 0x4000 - - S_IFDIR (Directory) - * - 0x6000 - - S_IFBLK (Block device) - * - 0x8000 - - S_IFREG (Regular file) - * - 0xA000 - - S_IFLNK (Symbolic link) - * - 0xC000 - - S_IFSOCK (Socket) - -.. _i_flags: - -The ``i_flags`` field is a combination of these values: - -.. list-table:: - :widths: 16 64 - :header-rows: 1 - - * - Value - - Description - * - 0x1 - - This file requires secure deletion (EXT4_SECRM_FL). (not implemented) - * - 0x2 - - This file should be preserved, should undeletion be desired - (EXT4_UNRM_FL). (not implemented) - * - 0x4 - - File is compressed (EXT4_COMPR_FL). (not really implemented) - * - 0x8 - - All writes to the file must be synchronous (EXT4_SYNC_FL). - * - 0x10 - - File is immutable (EXT4_IMMUTABLE_FL). - * - 0x20 - - File can only be appended (EXT4_APPEND_FL). - * - 0x40 - - The dump(1) utility should not dump this file (EXT4_NODUMP_FL). - * - 0x80 - - Do not update access time (EXT4_NOATIME_FL). - * - 0x100 - - Dirty compressed file (EXT4_DIRTY_FL). (not used) - * - 0x200 - - File has one or more compressed clusters (EXT4_COMPRBLK_FL). (not used) - * - 0x400 - - Do not compress file (EXT4_NOCOMPR_FL). (not used) - * - 0x800 - - Encrypted inode (EXT4_ENCRYPT_FL). This bit value previously was - EXT4_ECOMPR_FL (compression error), which was never used. - * - 0x1000 - - Directory has hashed indexes (EXT4_INDEX_FL). - * - 0x2000 - - AFS magic directory (EXT4_IMAGIC_FL). - * - 0x4000 - - File data must always be written through the journal - (EXT4_JOURNAL_DATA_FL). - * - 0x8000 - - File tail should not be merged (EXT4_NOTAIL_FL). (not used by ext4) - * - 0x10000 - - All directory entry data should be written synchronously (see - ``dirsync``) (EXT4_DIRSYNC_FL). - * - 0x20000 - - Top of directory hierarchy (EXT4_TOPDIR_FL). - * - 0x40000 - - This is a huge file (EXT4_HUGE_FILE_FL). - * - 0x80000 - - Inode uses extents (EXT4_EXTENTS_FL). - * - 0x100000 - - Verity protected file (EXT4_VERITY_FL). - * - 0x200000 - - Inode stores a large extended attribute value in its data blocks - (EXT4_EA_INODE_FL). - * - 0x400000 - - This file has blocks allocated past EOF (EXT4_EOFBLOCKS_FL). - (deprecated) - * - 0x01000000 - - Inode is a snapshot (``EXT4_SNAPFILE_FL``). (not in mainline) - * - 0x04000000 - - Snapshot is being deleted (``EXT4_SNAPFILE_DELETED_FL``). (not in - mainline) - * - 0x08000000 - - Snapshot shrink has completed (``EXT4_SNAPFILE_SHRUNK_FL``). (not in - mainline) - * - 0x10000000 - - Inode has inline data (EXT4_INLINE_DATA_FL). - * - 0x20000000 - - Create children with the same project ID (EXT4_PROJINHERIT_FL). - * - 0x80000000 - - Reserved for ext4 library (EXT4_RESERVED_FL). - * - - - Aggregate flags: - * - 0x705BDFFF - - User-visible flags. - * - 0x604BC0FF - - User-modifiable flags. Note that while EXT4_JOURNAL_DATA_FL and - EXT4_EXTENTS_FL can be set with setattr, they are not in the kernel's - EXT4_FL_USER_MODIFIABLE mask, since it needs to handle the setting of - these flags in a special manner and they are masked out of the set of - flags that are saved directly to i_flags. - -.. _i_osd1: - -The ``osd1`` field has multiple meanings depending on the creator: - -Linux: - -.. list-table:: - :widths: 8 8 24 40 - :header-rows: 1 - - * - Offset - - Size - - Name - - Description - * - 0x0 - - __le32 - - l_i_version - - Inode version. However, if the EA_INODE inode flag is set, this inode - stores an extended attribute value and this field contains the upper 32 - bits of the attribute value's reference count. - -Hurd: - -.. list-table:: - :widths: 8 8 24 40 - :header-rows: 1 - - * - Offset - - Size - - Name - - Description - * - 0x0 - - __le32 - - h_i_translator - - ?? - -Masix: - -.. list-table:: - :widths: 8 8 24 40 - :header-rows: 1 - - * - Offset - - Size - - Name - - Description - * - 0x0 - - __le32 - - m_i_reserved - - ?? - -.. _i_osd2: - -The ``osd2`` field has multiple meanings depending on the filesystem creator: - -Linux: - -.. list-table:: - :widths: 8 8 24 40 - :header-rows: 1 - - * - Offset - - Size - - Name - - Description - * - 0x0 - - __le16 - - l_i_blocks_high - - Upper 16-bits of the block count. Please see the note attached to - i_blocks_lo. - * - 0x2 - - __le16 - - l_i_file_acl_high - - Upper 16-bits of the extended attribute block (historically, the file - ACL location). See the Extended Attributes section below. - * - 0x4 - - __le16 - - l_i_uid_high - - Upper 16-bits of the Owner UID. - * - 0x6 - - __le16 - - l_i_gid_high - - Upper 16-bits of the GID. - * - 0x8 - - __le16 - - l_i_checksum_lo - - Lower 16-bits of the inode checksum. - * - 0xA - - __le16 - - l_i_reserved - - Unused. - -Hurd: - -.. list-table:: - :widths: 8 8 24 40 - :header-rows: 1 - - * - Offset - - Size - - Name - - Description - * - 0x0 - - __le16 - - h_i_reserved1 - - ?? - * - 0x2 - - __u16 - - h_i_mode_high - - Upper 16-bits of the file mode. - * - 0x4 - - __le16 - - h_i_uid_high - - Upper 16-bits of the Owner UID. - * - 0x6 - - __le16 - - h_i_gid_high - - Upper 16-bits of the GID. - * - 0x8 - - __u32 - - h_i_author - - Author code? - -Masix: - -.. list-table:: - :widths: 8 8 24 40 - :header-rows: 1 - - * - Offset - - Size - - Name - - Description - * - 0x0 - - __le16 - - h_i_reserved1 - - ?? - * - 0x2 - - __u16 - - m_i_file_acl_high - - Upper 16-bits of the extended attribute block (historically, the file - ACL location). - * - 0x4 - - __u32 - - m_i_reserved2[2] - - ?? - -Inode Size -~~~~~~~~~~ - -In ext2 and ext3, the inode structure size was fixed at 128 bytes -(``EXT2_GOOD_OLD_INODE_SIZE``) and each inode had a disk record size of -128 bytes. Starting with ext4, it is possible to allocate a larger -on-disk inode at format time for all inodes in the filesystem to provide -space beyond the end of the original ext2 inode. The on-disk inode -record size is recorded in the superblock as ``s_inode_size``. The -number of bytes actually used by struct ext4_inode beyond the original -128-byte ext2 inode is recorded in the ``i_extra_isize`` field for each -inode, which allows struct ext4_inode to grow for a new kernel without -having to upgrade all of the on-disk inodes. Access to fields beyond -EXT2_GOOD_OLD_INODE_SIZE should be verified to be within -``i_extra_isize``. By default, ext4 inode records are 256 bytes, and (as -of August 2019) the inode structure is 160 bytes -(``i_extra_isize = 32``). The extra space between the end of the inode -structure and the end of the inode record can be used to store extended -attributes. Each inode record can be as large as the filesystem block -size, though this is not terribly efficient. - -Finding an Inode -~~~~~~~~~~~~~~~~ - -Each block group contains ``sb->s_inodes_per_group`` inodes. Because -inode 0 is defined not to exist, this formula can be used to find the -block group that an inode lives in: -``bg = (inode_num - 1) / sb->s_inodes_per_group``. The particular inode -can be found within the block group's inode table at -``index = (inode_num - 1) % sb->s_inodes_per_group``. To get the byte -address within the inode table, use -``offset = index * sb->s_inode_size``. - -Inode Timestamps -~~~~~~~~~~~~~~~~ - -Four timestamps are recorded in the lower 128 bytes of the inode -structure -- inode change time (ctime), access time (atime), data -modification time (mtime), and deletion time (dtime). The four fields -are 32-bit signed integers that represent seconds since the Unix epoch -(1970-01-01 00:00:00 GMT), which means that the fields will overflow in -January 2038. If the filesystem does not have orphan_file feature, inodes -that are not linked from any directory but are still open (orphan inodes) have -the dtime field overloaded for use with the orphan list. The superblock field -``s_last_orphan`` points to the first inode in the orphan list; dtime is then -the number of the next orphaned inode, or zero if there are no more orphans. - -If the inode structure size ``sb->s_inode_size`` is larger than 128 -bytes and the ``i_inode_extra`` field is large enough to encompass the -respective ``i_[cma]time_extra`` field, the ctime, atime, and mtime -inode fields are widened to 64 bits. Within this “extra” 32-bit field, -the lower two bits are used to extend the 32-bit seconds field to be 34 -bit wide; the upper 30 bits are used to provide nanosecond timestamp -accuracy. Therefore, timestamps should not overflow until May 2446. -dtime was not widened. There is also a fifth timestamp to record inode -creation time (crtime); this field is 64-bits wide and decoded in the -same manner as 64-bit [cma]time. Neither crtime nor dtime are accessible -through the regular stat() interface, though debugfs will report them. - -We use the 32-bit signed time value plus (2^32 * (extra epoch bits)). -In other words: - -.. list-table:: - :widths: 20 20 20 20 20 - :header-rows: 1 - - * - Extra epoch bits - - MSB of 32-bit time - - Adjustment for signed 32-bit to 64-bit tv_sec - - Decoded 64-bit tv_sec - - valid time range - * - 0 0 - - 1 - - 0 - - ``-0x80000000 - -0x00000001`` - - 1901-12-13 to 1969-12-31 - * - 0 0 - - 0 - - 0 - - ``0x000000000 - 0x07fffffff`` - - 1970-01-01 to 2038-01-19 - * - 0 1 - - 1 - - 0x100000000 - - ``0x080000000 - 0x0ffffffff`` - - 2038-01-19 to 2106-02-07 - * - 0 1 - - 0 - - 0x100000000 - - ``0x100000000 - 0x17fffffff`` - - 2106-02-07 to 2174-02-25 - * - 1 0 - - 1 - - 0x200000000 - - ``0x180000000 - 0x1ffffffff`` - - 2174-02-25 to 2242-03-16 - * - 1 0 - - 0 - - 0x200000000 - - ``0x200000000 - 0x27fffffff`` - - 2242-03-16 to 2310-04-04 - * - 1 1 - - 1 - - 0x300000000 - - ``0x280000000 - 0x2ffffffff`` - - 2310-04-04 to 2378-04-22 - * - 1 1 - - 0 - - 0x300000000 - - ``0x300000000 - 0x37fffffff`` - - 2378-04-22 to 2446-05-10 - -This is a somewhat odd encoding since there are effectively seven times -as many positive values as negative values. There have also been -long-standing bugs decoding and encoding dates beyond 2038, which don't -seem to be fixed as of kernel 3.12 and e2fsprogs 1.42.8. 64-bit kernels -incorrectly use the extra epoch bits 1,1 for dates between 1901 and -1970. At some point the kernel will be fixed and e2fsck will fix this -situation, assuming that it is run before 2310. -- An old man doll... just what I always wanted! - Clara