The Linux Kernel Mailing List
 help / color / mirror / Atom feed
* [PATCH 0/3] ocfs2: harden inode validators against forged metadata
@ 2026-05-17 11:10 Michael Bommarito
  2026-05-17 11:10 ` [PATCH 1/3] ocfs2: reject dinodes with non-canonical i_mode type or stray bits Michael Bommarito
                   ` (4 more replies)
  0 siblings, 5 replies; 10+ messages in thread
From: Michael Bommarito @ 2026-05-17 11:10 UTC (permalink / raw)
  To: Joseph Qi, Mark Fasheh, Joel Becker
  Cc: ZhengYuan Huang, ocfs2-devel, linux-fsdevel, linux-kernel

This series adds three structural checks to
ocfs2_validate_inode_block() that catch attacker-controlled bytes
in a freshly read dinode before ocfs2_populate_inode() copies them
verbatim into the in-core inode.  All three checks fire on the
mount, lookup, and read-after-cache-invalidation paths and reject
the block with ocfs2_error(), the same error-propagation
mechanism the existing suballoc-slot, inline-data, chain-list,
and refcount checks use.

Direction
=========

This continues the validator-hardening direction visible in the
recent in-flight ocfs2_validate_inode_block hardening series,
e.g. ZhengYuan Huang's "ocfs2: revalidate the journal dinode
before toggling dirty"
(<20260512024115.4036371-1-gality369@gmail.com>), "ocfs2: add
extent tree depth validation"
(<20260416110229.3283682-1-gality369@gmail.com>), and "ocfs2:
add extent list validation v2"
(<20260423094116.876696-1-gality369@gmail.com>).  Each of those
adds a per-field invariant check against bytes that downstream
code paths trust unconditionally.

The three checks in this series cover three more fields whose
attacker-controlled values currently propagate into VFS-visible
state without question: i_mode (type bits and reserved bits),
i_rdev (cross-checked against the file type), and the
i_size / i_clusters pairing for regular files on non-sparse
volumes (patch 3 is gated on !ocfs2_sparse_alloc(); sparse-
allocation mounts legitimately commit i_size > 0 with
i_clusters == 0 via ocfs2_zero_extend()).

Threat model
============

The validator is the chokepoint that protects
ocfs2_populate_inode() from a malformed dinode whether the
malformation got there via:

  (1) An attacker-supplied disk image mounted by a privileged
      user.  The mount path runs every dinode through this
      validator before any unprivileged user opens a file on
      the volume.  This is the same threat model the existing
      inline-data, refcount, and chain-list checks in this
      function were written for.

  (2) A compromised cluster peer with raw write access to the
      shared block device.  OCFS2 is a clustered filesystem;
      the on-disk blocks behind bh->b_data live on shared
      storage that other cluster nodes can write.  The local
      node's cache-eviction re-read runs the newly fetched
      block through this validator before ocfs2_populate_inode()
      runs again.  Oracle's BlockErrorDetection design document
      scopes the existing CRC32 + Hamming integrity primitive
      explicitly as defense against memory and wire corruption,
      not as authentication of peer writes; the field-level
      validators are therefore the kernel-side defense
      whichever path produced the forged block.

The three checks in this series are deliberately structural:
they each express an invariant mkfs.ocfs2 and the kernel
maintain unconditionally, and they reject any dinode whose
header violates that invariant before its bytes propagate to
the in-core inode.

Scope note: these checks block forge patterns that touch i_mode
(outside the canonical envelope), i_rdev (on a non-device file),
or the i_size / i_clusters pair (regular file with size but no
extents, on non-sparse volumes only).  A forger who keeps the
dinode within these structural envelopes (for example, flipping
only the permission bits and uid/gid on a regular file that
already has clusters allocated) can still produce a dinode that
satisfies the field-level invariants; closing that residual
class is outside the scope of this hardening series.

Validation
==========

Each patch builds on top of the previous one against mainline;
the series as a whole builds clean against v7.1-rc1 with zero
new warnings.  checkpatch --strict reports 0 errors, 0 warnings,
0 checks for each patch.

The series was exercised on a two-node QEMU cluster (virtio-blk
shared LUN with share-rw=on, both nodes joining the same o2cb
cluster, mounted ocfs2 with metaecc):

  - Pre-series baseline: a peer-node raw-write forge that adds
    S_ISUID and flips uid/gid to 0 on a regular file is accepted
    by the existing validator chain; the unprivileged user on
    the victim node exec()s the file and gains euid=0.  This
    confirms the cluster-peer write primitive is reachable in
    today's mainline.  Per the Scope note above, this particular
    forge stays within the structural envelope these patches
    enforce and is NOT blocked by them; closing it requires the
    out-of-scope keyed-integrity work.
  - Post-series, structural-variant forge: a peer-node forge
    that, in addition to the setuid + uid/gid changes, stores
    i_rdev = MKDEV(1,1) on the same regular-file dinode (the
    cleanest cluster-context attacker primitive patch 2
    catches) is rejected by ocfs2_validate_inode_block() with
    ocfs2_error "non-device mode 0104755 with i_rdev N".  The
    buffer-head propagates -EFSCORRUPTED to
    ocfs2_read_locked_inode and the user-visible result is
    Permission denied on subsequent stat / open / exec of the
    forged file.  Analogous post-series forge variants that
    flip i_mode outside the canonical envelope, or that set
    i_size > 0 with i_clusters == 0 on a non-inline regular
    file mounted from a non-sparse volume, are rejected by
    patches 1 and 3 respectively.

A separate cluster regression (mount, peer-write a regular
file, drop_caches on the second node, read it back) runs
clean post-series, so the checks do not regress normal
operation.

In-tree selftests under tools/testing/selftests/ that
reference fs/ocfs2/inode.c or any changed symbol were checked;
no matching selftests exist for ocfs2_validate_inode_block(),
which is consistent with OCFS2 having no in-tree selftest
coverage.  The subsystem's standard regression coverage is
xfstests (the generic fs group) plus ocfs2-test, both out of
tree.  Those were not run as part of this series; a full
xfstests pass before merge is recommended and I am happy to
run a representative subset and report results if reviewers
would find it useful.

Patches
=======

Michael Bommarito (3):
  ocfs2: reject dinodes with non-canonical i_mode type or stray bits
  ocfs2: reject dinodes whose i_rdev disagrees with the file type
  ocfs2: reject regular files with non-zero i_size and zero i_clusters

 fs/ocfs2/inode.c | 118 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 118 insertions(+)

--
2.53.0


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 1/3] ocfs2: reject dinodes with non-canonical i_mode type or stray bits
  2026-05-17 11:10 [PATCH 0/3] ocfs2: harden inode validators against forged metadata Michael Bommarito
@ 2026-05-17 11:10 ` Michael Bommarito
  2026-05-18  1:36   ` Joseph Qi
  2026-05-17 11:10 ` [PATCH 2/3] ocfs2: reject dinodes whose i_rdev disagrees with the file type Michael Bommarito
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 10+ messages in thread
From: Michael Bommarito @ 2026-05-17 11:10 UTC (permalink / raw)
  To: Joseph Qi, Mark Fasheh, Joel Becker
  Cc: ZhengYuan Huang, ocfs2-devel, linux-fsdevel, linux-kernel

ocfs2_validate_inode_block() currently accepts any 16-bit i_mode
value as long as i_mode is non-zero.  ocfs2_populate_inode() then
copies that mode verbatim into inode->i_mode and dispatches on
i_mode & S_IFMT to the file/dir/symlink/special_file iops; any
unrecognised type falls through to ocfs2_special_file_iops and
init_special_inode(), which interprets id1.dev1.i_rdev as a
device number.

The result is that anything able to forge or corrupt an inode
block (a hostile cluster peer with raw write access to the
shared LUN, a privileged user mounting an attacker-supplied
image, on-disk corruption) can publish an in-core inode whose
type bits do not name a POSIX file type, or whose permission
bits carry bytes outside S_IFMT|07777.  Both shapes propagate
into VFS-visible state that downstream code paths assume is
well-formed.

Reject early in the validator:

  - mode bits outside S_IFMT|07777
  - S_IFMT values that are not one of S_IFREG, S_IFDIR, S_IFLNK,
    S_IFCHR, S_IFBLK, S_IFIFO, S_IFSOCK

mkfs.ocfs2 and the kernel only ever produce these seven types
plus the standard permission, setuid/setgid/sticky bits; an
on-disk i_mode outside this envelope is structurally malformed
regardless of how it got there.

Validated against the existing inline_data, refcount, and
chain-list checks: this hardening fires before any of them and
does not perturb their behaviour for well-formed inodes.

Fixes: b657c95c1108 ("ocfs2: Wrap inode block reads in a dedicated function.")
Cc: stable@vger.kernel.org
Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com>
Assisted-by: Claude:claude-opus-4-7
---
 fs/ocfs2/inode.c | 39 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 39 insertions(+)

diff --git a/fs/ocfs2/inode.c b/fs/ocfs2/inode.c
index a510a0eb1adcc..fb592bf3e5f31 100644
--- a/fs/ocfs2/inode.c
+++ b/fs/ocfs2/inode.c
@@ -1494,6 +1494,45 @@ int ocfs2_validate_inode_block(struct super_block *sb,
 		goto bail;
 	}
 
+	/*
+	 * Reject dinodes whose i_mode does not name one of the seven
+	 * canonical POSIX file types, or whose mode carries bits outside
+	 * S_IFMT | 07777.  ocfs2_populate_inode() copies i_mode verbatim
+	 * into inode->i_mode and then dispatches via switch (mode & S_IFMT)
+	 * to file/dir/symlink/special_file iops; an unrecognised type
+	 * falls into ocfs2_special_file_iops with init_special_inode(),
+	 * which interprets i_rdev.  Constrain the type byte here so the
+	 * dispatch only ever sees a value mkfs.ocfs2 / VFS can produce.
+	 */
+	{
+		u16 mode = le16_to_cpu(di->i_mode);
+
+		if (mode & ~(S_IFMT | 07777)) {
+			rc = ocfs2_error(sb,
+					 "Invalid dinode #%llu: mode 0%o has bits outside S_IFMT|07777\n",
+					 (unsigned long long)bh->b_blocknr,
+					 mode);
+			goto bail;
+		}
+
+		switch (mode & S_IFMT) {
+		case S_IFREG:
+		case S_IFDIR:
+		case S_IFLNK:
+		case S_IFCHR:
+		case S_IFBLK:
+		case S_IFIFO:
+		case S_IFSOCK:
+			break;
+		default:
+			rc = ocfs2_error(sb,
+					 "Invalid dinode #%llu: mode 0%o has unknown file type\n",
+					 (unsigned long long)bh->b_blocknr,
+					 mode);
+			goto bail;
+		}
+	}
+
 	if (le16_to_cpu(di->i_dyn_features) & OCFS2_INLINE_DATA_FL) {
 		struct ocfs2_inline_data *data = &di->id2.i_data;
 
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 2/3] ocfs2: reject dinodes whose i_rdev disagrees with the file type
  2026-05-17 11:10 [PATCH 0/3] ocfs2: harden inode validators against forged metadata Michael Bommarito
  2026-05-17 11:10 ` [PATCH 1/3] ocfs2: reject dinodes with non-canonical i_mode type or stray bits Michael Bommarito
@ 2026-05-17 11:10 ` Michael Bommarito
  2026-05-18  1:37   ` Joseph Qi
  2026-05-17 11:10 ` [PATCH 3/3] ocfs2: reject regular files with non-zero i_size and zero i_clusters Michael Bommarito
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 10+ messages in thread
From: Michael Bommarito @ 2026-05-17 11:10 UTC (permalink / raw)
  To: Joseph Qi, Mark Fasheh, Joel Becker
  Cc: ZhengYuan Huang, ocfs2-devel, linux-fsdevel, linux-kernel

id1.dev1.i_rdev is the device-number arm of the ocfs2_dinode id1
union and is only meaningful for character and block device
inodes.  For any other user-visible file type the on-disk value
must be zero.

ocfs2_populate_inode() currently runs

    inode->i_rdev = huge_decode_dev(le64_to_cpu(fe->id1.dev1.i_rdev));

unconditionally, before the S_IFMT switch decides whether the
inode is a special file.  As a result, an i_rdev value present on
a non-device inode is silently published into the in-core inode.
A subsequent forced re-read or in-core mode mutation (cluster
peer with raw write access to the shared LUN, on-disk corruption,
or a separately forged dinode) can then expose the attacker-
controlled device number to init_special_inode() without ever
showing an unusual i_mode at validation time.

System inodes (OCFS2_SYSTEM_FL) legitimately use the bitmap1 and
journal1 arms of the same union: allocator inodes encode i_used
/ i_total in the bitmap1 arm and the journal encodes ij_flags /
ij_recovery_generation in the journal1 arm.  Those byte
sequences are not an i_rdev and a non-zero pattern there is the
on-disk norm, not an integrity violation.  Restrict the cross-
check to non-system inodes; that is the full surface where
i_rdev semantics apply and is also the full surface an
unprivileged consumer of the volume can see.

Following the i_mode canonicalisation in patch 1, S_ISCHR /
S_ISBLK covers the whole device-inode space; this check operates
correctly on its own, but the canonicalised i_mode makes the
predicate exhaustive.

Fixes: b657c95c1108 ("ocfs2: Wrap inode block reads in a dedicated function.")
Cc: stable@vger.kernel.org
Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com>
Assisted-by: Claude:claude-opus-4-7
---
 fs/ocfs2/inode.c | 38 ++++++++++++++++++++++++++++++++++++++
 1 file changed, 38 insertions(+)

diff --git a/fs/ocfs2/inode.c b/fs/ocfs2/inode.c
index fb592bf3e5f31..305e22cc9b1d9 100644
--- a/fs/ocfs2/inode.c
+++ b/fs/ocfs2/inode.c
@@ -1533,6 +1533,44 @@ int ocfs2_validate_inode_block(struct super_block *sb,
 		}
 	}
 
+	/*
+	 * id1.dev1.i_rdev is the device-number arm of the id1 union and
+	 * is only meaningful for character and block device inodes.  For
+	 * any other regular user-visible file type the on-disk value
+	 * must be zero.  ocfs2_populate_inode() currently runs
+	 *
+	 *     inode->i_rdev = huge_decode_dev(le64_to_cpu(fe->id1.dev1.i_rdev));
+	 *
+	 * unconditionally, before the S_IFMT switch decides whether the
+	 * inode is a special file.  As a result, an i_rdev value present
+	 * on a non-device inode is silently published into the in-core
+	 * inode; a subsequent forced re-read or in-core mode mutation
+	 * (cluster peer with raw write access to the shared LUN,
+	 * on-disk corruption, or a separately forged dinode) can then
+	 * expose the attacker-controlled device number to
+	 * init_special_inode() without ever showing an unusual i_mode
+	 * at validation time.
+	 *
+	 * System inodes (OCFS2_SYSTEM_FL) legitimately use the bitmap1
+	 * and journal1 arms of the same union (allocator i_used /
+	 * i_total counters and the journal ij_flags /
+	 * ij_recovery_generation pair); those bytes are not an i_rdev
+	 * and must not be checked here.  Restrict the cross-check to
+	 * non-system inodes, which is the full attacker-controllable
+	 * surface.
+	 */
+	if (!(le32_to_cpu(di->i_flags) & OCFS2_SYSTEM_FL) &&
+	    !S_ISCHR(le16_to_cpu(di->i_mode)) &&
+	    !S_ISBLK(le16_to_cpu(di->i_mode)) &&
+	    di->id1.dev1.i_rdev != 0) {
+		rc = ocfs2_error(sb,
+				 "Invalid dinode #%llu: non-device mode 0%o with i_rdev %llu\n",
+				 (unsigned long long)bh->b_blocknr,
+				 le16_to_cpu(di->i_mode),
+				 (unsigned long long)le64_to_cpu(di->id1.dev1.i_rdev));
+		goto bail;
+	}
+
 	if (le16_to_cpu(di->i_dyn_features) & OCFS2_INLINE_DATA_FL) {
 		struct ocfs2_inline_data *data = &di->id2.i_data;
 
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 3/3] ocfs2: reject regular files with non-zero i_size and zero i_clusters
  2026-05-17 11:10 [PATCH 0/3] ocfs2: harden inode validators against forged metadata Michael Bommarito
  2026-05-17 11:10 ` [PATCH 1/3] ocfs2: reject dinodes with non-canonical i_mode type or stray bits Michael Bommarito
  2026-05-17 11:10 ` [PATCH 2/3] ocfs2: reject dinodes whose i_rdev disagrees with the file type Michael Bommarito
@ 2026-05-17 11:10 ` Michael Bommarito
  2026-05-18  1:38   ` Joseph Qi
  2026-05-18 21:40 ` [PATCH 0/3] ocfs2: harden inode validators against forged metadata Andrew Morton
  2026-06-01 17:39 ` Joel Becker
  4 siblings, 1 reply; 10+ messages in thread
From: Michael Bommarito @ 2026-05-17 11:10 UTC (permalink / raw)
  To: Joseph Qi, Mark Fasheh, Joel Becker
  Cc: ZhengYuan Huang, ocfs2-devel, linux-fsdevel, linux-kernel

On a volume mounted WITHOUT OCFS2_FEATURE_INCOMPAT_SPARSE_ALLOC, a
regular file with non-zero i_size, zero i_clusters, and no
OCFS2_INLINE_DATA_FL flag is structurally malformed: the extent
map declares no allocated clusters yet the size header claims
the file has content.  ocfs2_populate_inode() copies i_size into
the in-core inode and dispatches to ocfs2_aops; subsequent reads
or truncates then operate on an inconsistent extent state.

This is the shape an attacker who keeps the rest of the extent
list intact (to satisfy the inline-data, refcount, chain-list,
and per-field validators already in this function) would produce
when forging only the inode header to publish a synthetic file
size on a victim node.  It is also the shape on-disk corruption
of the i_clusters field produces.  Reject early in the
validator.

The check is restricted to non-sparse volumes
(ocfs2_sparse_alloc() returns false).  On non-sparse mounts the
allocator path always grows clusters before i_size:
ocfs2_extend_file() takes the !sparse branch into
ocfs2_extend_no_holes(), which calls ocfs2_extend_allocation()
to journal new clusters first, and only then
ocfs2_simple_size_update() journals the larger i_size.  The
truncate path likewise lowers i_size in ocfs2_orphan_for_truncate()
and then frees clusters in ocfs2_commit_truncate(), which uses
ocfs2_clusters_for_bytes(new_i_size) as its new_highest_cpos:
when new_i_size > 0 the floor is at least one cluster, so the
on-disk dinode never legitimately exposes a non-inline regular
file with i_size > 0 and i_clusters == 0 on a non-sparse volume.

On sparse-alloc volumes the same shape is legitimate: an
ocfs2_extend_file() call goes through ocfs2_zero_extend() +
ocfs2_simple_size_update(), which grows i_size on its own
without changing i_clusters; a freshly truncate -s 1M of a
sparse regular file is therefore on-disk
(i_size = 1048576, i_clusters = 0).  The check therefore opts
out via ocfs2_sparse_alloc(OCFS2_SB(sb)).

System inodes (OCFS2_SYSTEM_FL) carry their own size and
cluster invariants validated by the allocator, journal, quota,
and truncate-log subsystems; skip them here.  The inline-data
fast path is filtered separately by its own dedicated branch
below: its well-formed case is exactly i_clusters == 0 with
i_size <= id_count.  Symlinks legitimately keep i_clusters ==
0 with non-zero i_size (fast symlinks), so this check is
restricted to S_IFREG.

Fixes: b657c95c1108 ("ocfs2: Wrap inode block reads in a dedicated function.")
Cc: stable@vger.kernel.org
Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com>
Assisted-by: Claude:claude-opus-4-7
---
 fs/ocfs2/inode.c | 41 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 41 insertions(+)

diff --git a/fs/ocfs2/inode.c b/fs/ocfs2/inode.c
index 305e22cc9b1d9..c63d2ced6b338 100644
--- a/fs/ocfs2/inode.c
+++ b/fs/ocfs2/inode.c
@@ -1571,6 +1571,47 @@ int ocfs2_validate_inode_block(struct super_block *sb,
 		goto bail;
 	}
 
+	/*
+	 * On a non-sparse volume, a regular file with non-zero i_size
+	 * and zero i_clusters that is not marked as inline data is
+	 * structurally malformed: the extent map declares no allocated
+	 * clusters yet the size header claims the file has content.
+	 * ocfs2_populate_inode() would still publish i_size to VFS and
+	 * leave the extent state inconsistent for any later read or
+	 * truncate.  This is the shape an attacker who keeps the rest
+	 * of the extent list intact (to satisfy the inline-data,
+	 * refcount, chain-list, and per-field validators above) would
+	 * produce when forging only the inode header to publish a
+	 * synthetic file size on a victim node.  It is also the shape
+	 * on-disk corruption of the i_clusters field produces.
+	 *
+	 * The check opts out on sparse-alloc volumes, where the
+	 * extend path (ocfs2_extend_file -> ocfs2_zero_extend ->
+	 * ocfs2_simple_size_update) legitimately grows i_size without
+	 * allocating clusters.  On non-sparse volumes the equivalent
+	 * path (ocfs2_extend_no_holes) journals clusters first and
+	 * i_size second, and truncate-down floors i_clusters at
+	 * ocfs2_clusters_for_bytes(new_i_size) which is >= 1 whenever
+	 * new_i_size > 0, so the rejected shape never appears on disk.
+	 *
+	 * Skip system inodes (OCFS2_SYSTEM_FL) and the inline-data
+	 * fast path (handled below).  Symlinks legitimately keep
+	 * i_clusters == 0 with non-zero i_size (fast symlinks), so
+	 * restrict to S_IFREG.
+	 */
+	if (!ocfs2_sparse_alloc(OCFS2_SB(sb)) &&
+	    S_ISREG(le16_to_cpu(di->i_mode)) &&
+	    !(le32_to_cpu(di->i_flags) & OCFS2_SYSTEM_FL) &&
+	    !(le16_to_cpu(di->i_dyn_features) & OCFS2_INLINE_DATA_FL) &&
+	    le64_to_cpu(di->i_size) != 0 &&
+	    le32_to_cpu(di->i_clusters) == 0) {
+		rc = ocfs2_error(sb,
+				 "Invalid dinode #%llu: regular file i_size %llu with i_clusters 0 and no inline-data flag on non-sparse volume\n",
+				 (unsigned long long)bh->b_blocknr,
+				 (unsigned long long)le64_to_cpu(di->i_size));
+		goto bail;
+	}
+
 	if (le16_to_cpu(di->i_dyn_features) & OCFS2_INLINE_DATA_FL) {
 		struct ocfs2_inline_data *data = &di->id2.i_data;
 
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/3] ocfs2: reject dinodes with non-canonical i_mode type or stray bits
  2026-05-17 11:10 ` [PATCH 1/3] ocfs2: reject dinodes with non-canonical i_mode type or stray bits Michael Bommarito
@ 2026-05-18  1:36   ` Joseph Qi
  0 siblings, 0 replies; 10+ messages in thread
From: Joseph Qi @ 2026-05-18  1:36 UTC (permalink / raw)
  To: Michael Bommarito, akpm
  Cc: ZhengYuan Huang, ocfs2-devel, linux-fsdevel, linux-kernel,
	Mark Fasheh, Joel Becker



On 5/17/26 7:10 PM, Michael Bommarito wrote:
> ocfs2_validate_inode_block() currently accepts any 16-bit i_mode
> value as long as i_mode is non-zero.  ocfs2_populate_inode() then
> copies that mode verbatim into inode->i_mode and dispatches on
> i_mode & S_IFMT to the file/dir/symlink/special_file iops; any
> unrecognised type falls through to ocfs2_special_file_iops and
> init_special_inode(), which interprets id1.dev1.i_rdev as a
> device number.
> 
> The result is that anything able to forge or corrupt an inode
> block (a hostile cluster peer with raw write access to the
> shared LUN, a privileged user mounting an attacker-supplied
> image, on-disk corruption) can publish an in-core inode whose
> type bits do not name a POSIX file type, or whose permission
> bits carry bytes outside S_IFMT|07777.  Both shapes propagate
> into VFS-visible state that downstream code paths assume is
> well-formed.
> 
> Reject early in the validator:
> 
>   - mode bits outside S_IFMT|07777
>   - S_IFMT values that are not one of S_IFREG, S_IFDIR, S_IFLNK,
>     S_IFCHR, S_IFBLK, S_IFIFO, S_IFSOCK
> 
> mkfs.ocfs2 and the kernel only ever produce these seven types
> plus the standard permission, setuid/setgid/sticky bits; an
> on-disk i_mode outside this envelope is structurally malformed
> regardless of how it got there.
> 
> Validated against the existing inline_data, refcount, and
> chain-list checks: this hardening fires before any of them and
> does not perturb their behaviour for well-formed inodes.
> 
> Fixes: b657c95c1108 ("ocfs2: Wrap inode block reads in a dedicated function.")
> Cc: stable@vger.kernel.org
> Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com>
> Assisted-by: Claude:claude-opus-4-7

Looks fine.
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>

> ---
>  fs/ocfs2/inode.c | 39 +++++++++++++++++++++++++++++++++++++++
>  1 file changed, 39 insertions(+)
> 
> diff --git a/fs/ocfs2/inode.c b/fs/ocfs2/inode.c
> index a510a0eb1adcc..fb592bf3e5f31 100644
> --- a/fs/ocfs2/inode.c
> +++ b/fs/ocfs2/inode.c
> @@ -1494,6 +1494,45 @@ int ocfs2_validate_inode_block(struct super_block *sb,
>  		goto bail;
>  	}
>  
> +	/*
> +	 * Reject dinodes whose i_mode does not name one of the seven
> +	 * canonical POSIX file types, or whose mode carries bits outside
> +	 * S_IFMT | 07777.  ocfs2_populate_inode() copies i_mode verbatim
> +	 * into inode->i_mode and then dispatches via switch (mode & S_IFMT)
> +	 * to file/dir/symlink/special_file iops; an unrecognised type
> +	 * falls into ocfs2_special_file_iops with init_special_inode(),
> +	 * which interprets i_rdev.  Constrain the type byte here so the
> +	 * dispatch only ever sees a value mkfs.ocfs2 / VFS can produce.
> +	 */
> +	{
> +		u16 mode = le16_to_cpu(di->i_mode);
> +
> +		if (mode & ~(S_IFMT | 07777)) {
> +			rc = ocfs2_error(sb,
> +					 "Invalid dinode #%llu: mode 0%o has bits outside S_IFMT|07777\n",
> +					 (unsigned long long)bh->b_blocknr,
> +					 mode);
> +			goto bail;
> +		}
> +
> +		switch (mode & S_IFMT) {
> +		case S_IFREG:
> +		case S_IFDIR:
> +		case S_IFLNK:
> +		case S_IFCHR:
> +		case S_IFBLK:
> +		case S_IFIFO:
> +		case S_IFSOCK:
> +			break;
> +		default:
> +			rc = ocfs2_error(sb,
> +					 "Invalid dinode #%llu: mode 0%o has unknown file type\n",
> +					 (unsigned long long)bh->b_blocknr,
> +					 mode);
> +			goto bail;
> +		}
> +	}
> +
>  	if (le16_to_cpu(di->i_dyn_features) & OCFS2_INLINE_DATA_FL) {
>  		struct ocfs2_inline_data *data = &di->id2.i_data;
>  


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/3] ocfs2: reject dinodes whose i_rdev disagrees with the file type
  2026-05-17 11:10 ` [PATCH 2/3] ocfs2: reject dinodes whose i_rdev disagrees with the file type Michael Bommarito
@ 2026-05-18  1:37   ` Joseph Qi
  0 siblings, 0 replies; 10+ messages in thread
From: Joseph Qi @ 2026-05-18  1:37 UTC (permalink / raw)
  To: Michael Bommarito, akpm
  Cc: ZhengYuan Huang, ocfs2-devel, linux-fsdevel, linux-kernel,
	Mark Fasheh, Joel Becker



On 5/17/26 7:10 PM, Michael Bommarito wrote:
> id1.dev1.i_rdev is the device-number arm of the ocfs2_dinode id1
> union and is only meaningful for character and block device
> inodes.  For any other user-visible file type the on-disk value
> must be zero.
> 
> ocfs2_populate_inode() currently runs
> 
>     inode->i_rdev = huge_decode_dev(le64_to_cpu(fe->id1.dev1.i_rdev));
> 
> unconditionally, before the S_IFMT switch decides whether the
> inode is a special file.  As a result, an i_rdev value present on
> a non-device inode is silently published into the in-core inode.
> A subsequent forced re-read or in-core mode mutation (cluster
> peer with raw write access to the shared LUN, on-disk corruption,
> or a separately forged dinode) can then expose the attacker-
> controlled device number to init_special_inode() without ever
> showing an unusual i_mode at validation time.
> 
> System inodes (OCFS2_SYSTEM_FL) legitimately use the bitmap1 and
> journal1 arms of the same union: allocator inodes encode i_used
> / i_total in the bitmap1 arm and the journal encodes ij_flags /
> ij_recovery_generation in the journal1 arm.  Those byte
> sequences are not an i_rdev and a non-zero pattern there is the
> on-disk norm, not an integrity violation.  Restrict the cross-
> check to non-system inodes; that is the full surface where
> i_rdev semantics apply and is also the full surface an
> unprivileged consumer of the volume can see.
> 
> Following the i_mode canonicalisation in patch 1, S_ISCHR /
> S_ISBLK covers the whole device-inode space; this check operates
> correctly on its own, but the canonicalised i_mode makes the
> predicate exhaustive.
> 
> Fixes: b657c95c1108 ("ocfs2: Wrap inode block reads in a dedicated function.")
> Cc: stable@vger.kernel.org
> Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com>
> Assisted-by: Claude:claude-opus-4-7

Looks fine.
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>

> ---
>  fs/ocfs2/inode.c | 38 ++++++++++++++++++++++++++++++++++++++
>  1 file changed, 38 insertions(+)
> 
> diff --git a/fs/ocfs2/inode.c b/fs/ocfs2/inode.c
> index fb592bf3e5f31..305e22cc9b1d9 100644
> --- a/fs/ocfs2/inode.c
> +++ b/fs/ocfs2/inode.c
> @@ -1533,6 +1533,44 @@ int ocfs2_validate_inode_block(struct super_block *sb,
>  		}
>  	}
>  
> +	/*
> +	 * id1.dev1.i_rdev is the device-number arm of the id1 union and
> +	 * is only meaningful for character and block device inodes.  For
> +	 * any other regular user-visible file type the on-disk value
> +	 * must be zero.  ocfs2_populate_inode() currently runs
> +	 *
> +	 *     inode->i_rdev = huge_decode_dev(le64_to_cpu(fe->id1.dev1.i_rdev));
> +	 *
> +	 * unconditionally, before the S_IFMT switch decides whether the
> +	 * inode is a special file.  As a result, an i_rdev value present
> +	 * on a non-device inode is silently published into the in-core
> +	 * inode; a subsequent forced re-read or in-core mode mutation
> +	 * (cluster peer with raw write access to the shared LUN,
> +	 * on-disk corruption, or a separately forged dinode) can then
> +	 * expose the attacker-controlled device number to
> +	 * init_special_inode() without ever showing an unusual i_mode
> +	 * at validation time.
> +	 *
> +	 * System inodes (OCFS2_SYSTEM_FL) legitimately use the bitmap1
> +	 * and journal1 arms of the same union (allocator i_used /
> +	 * i_total counters and the journal ij_flags /
> +	 * ij_recovery_generation pair); those bytes are not an i_rdev
> +	 * and must not be checked here.  Restrict the cross-check to
> +	 * non-system inodes, which is the full attacker-controllable
> +	 * surface.
> +	 */
> +	if (!(le32_to_cpu(di->i_flags) & OCFS2_SYSTEM_FL) &&
> +	    !S_ISCHR(le16_to_cpu(di->i_mode)) &&
> +	    !S_ISBLK(le16_to_cpu(di->i_mode)) &&
> +	    di->id1.dev1.i_rdev != 0) {
> +		rc = ocfs2_error(sb,
> +				 "Invalid dinode #%llu: non-device mode 0%o with i_rdev %llu\n",
> +				 (unsigned long long)bh->b_blocknr,
> +				 le16_to_cpu(di->i_mode),
> +				 (unsigned long long)le64_to_cpu(di->id1.dev1.i_rdev));
> +		goto bail;
> +	}
> +
>  	if (le16_to_cpu(di->i_dyn_features) & OCFS2_INLINE_DATA_FL) {
>  		struct ocfs2_inline_data *data = &di->id2.i_data;
>  


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 3/3] ocfs2: reject regular files with non-zero i_size and zero i_clusters
  2026-05-17 11:10 ` [PATCH 3/3] ocfs2: reject regular files with non-zero i_size and zero i_clusters Michael Bommarito
@ 2026-05-18  1:38   ` Joseph Qi
  0 siblings, 0 replies; 10+ messages in thread
From: Joseph Qi @ 2026-05-18  1:38 UTC (permalink / raw)
  To: Michael Bommarito, akpm
  Cc: ZhengYuan Huang, ocfs2-devel, linux-fsdevel, linux-kernel,
	Mark Fasheh, Joel Becker



On 5/17/26 7:10 PM, Michael Bommarito wrote:
> On a volume mounted WITHOUT OCFS2_FEATURE_INCOMPAT_SPARSE_ALLOC, a
> regular file with non-zero i_size, zero i_clusters, and no
> OCFS2_INLINE_DATA_FL flag is structurally malformed: the extent
> map declares no allocated clusters yet the size header claims
> the file has content.  ocfs2_populate_inode() copies i_size into
> the in-core inode and dispatches to ocfs2_aops; subsequent reads
> or truncates then operate on an inconsistent extent state.
> 
> This is the shape an attacker who keeps the rest of the extent
> list intact (to satisfy the inline-data, refcount, chain-list,
> and per-field validators already in this function) would produce
> when forging only the inode header to publish a synthetic file
> size on a victim node.  It is also the shape on-disk corruption
> of the i_clusters field produces.  Reject early in the
> validator.
> 
> The check is restricted to non-sparse volumes
> (ocfs2_sparse_alloc() returns false).  On non-sparse mounts the
> allocator path always grows clusters before i_size:
> ocfs2_extend_file() takes the !sparse branch into
> ocfs2_extend_no_holes(), which calls ocfs2_extend_allocation()
> to journal new clusters first, and only then
> ocfs2_simple_size_update() journals the larger i_size.  The
> truncate path likewise lowers i_size in ocfs2_orphan_for_truncate()
> and then frees clusters in ocfs2_commit_truncate(), which uses
> ocfs2_clusters_for_bytes(new_i_size) as its new_highest_cpos:
> when new_i_size > 0 the floor is at least one cluster, so the
> on-disk dinode never legitimately exposes a non-inline regular
> file with i_size > 0 and i_clusters == 0 on a non-sparse volume.
> 
> On sparse-alloc volumes the same shape is legitimate: an
> ocfs2_extend_file() call goes through ocfs2_zero_extend() +
> ocfs2_simple_size_update(), which grows i_size on its own
> without changing i_clusters; a freshly truncate -s 1M of a
> sparse regular file is therefore on-disk
> (i_size = 1048576, i_clusters = 0).  The check therefore opts
> out via ocfs2_sparse_alloc(OCFS2_SB(sb)).
> 
> System inodes (OCFS2_SYSTEM_FL) carry their own size and
> cluster invariants validated by the allocator, journal, quota,
> and truncate-log subsystems; skip them here.  The inline-data
> fast path is filtered separately by its own dedicated branch
> below: its well-formed case is exactly i_clusters == 0 with
> i_size <= id_count.  Symlinks legitimately keep i_clusters ==
> 0 with non-zero i_size (fast symlinks), so this check is
> restricted to S_IFREG.
> 
> Fixes: b657c95c1108 ("ocfs2: Wrap inode block reads in a dedicated function.")
> Cc: stable@vger.kernel.org
> Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com>
> Assisted-by: Claude:claude-opus-4-7

Looks fine.
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>

> ---
>  fs/ocfs2/inode.c | 41 +++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 41 insertions(+)
> 
> diff --git a/fs/ocfs2/inode.c b/fs/ocfs2/inode.c
> index 305e22cc9b1d9..c63d2ced6b338 100644
> --- a/fs/ocfs2/inode.c
> +++ b/fs/ocfs2/inode.c
> @@ -1571,6 +1571,47 @@ int ocfs2_validate_inode_block(struct super_block *sb,
>  		goto bail;
>  	}
>  
> +	/*
> +	 * On a non-sparse volume, a regular file with non-zero i_size
> +	 * and zero i_clusters that is not marked as inline data is
> +	 * structurally malformed: the extent map declares no allocated
> +	 * clusters yet the size header claims the file has content.
> +	 * ocfs2_populate_inode() would still publish i_size to VFS and
> +	 * leave the extent state inconsistent for any later read or
> +	 * truncate.  This is the shape an attacker who keeps the rest
> +	 * of the extent list intact (to satisfy the inline-data,
> +	 * refcount, chain-list, and per-field validators above) would
> +	 * produce when forging only the inode header to publish a
> +	 * synthetic file size on a victim node.  It is also the shape
> +	 * on-disk corruption of the i_clusters field produces.
> +	 *
> +	 * The check opts out on sparse-alloc volumes, where the
> +	 * extend path (ocfs2_extend_file -> ocfs2_zero_extend ->
> +	 * ocfs2_simple_size_update) legitimately grows i_size without
> +	 * allocating clusters.  On non-sparse volumes the equivalent
> +	 * path (ocfs2_extend_no_holes) journals clusters first and
> +	 * i_size second, and truncate-down floors i_clusters at
> +	 * ocfs2_clusters_for_bytes(new_i_size) which is >= 1 whenever
> +	 * new_i_size > 0, so the rejected shape never appears on disk.
> +	 *
> +	 * Skip system inodes (OCFS2_SYSTEM_FL) and the inline-data
> +	 * fast path (handled below).  Symlinks legitimately keep
> +	 * i_clusters == 0 with non-zero i_size (fast symlinks), so
> +	 * restrict to S_IFREG.
> +	 */
> +	if (!ocfs2_sparse_alloc(OCFS2_SB(sb)) &&
> +	    S_ISREG(le16_to_cpu(di->i_mode)) &&
> +	    !(le32_to_cpu(di->i_flags) & OCFS2_SYSTEM_FL) &&
> +	    !(le16_to_cpu(di->i_dyn_features) & OCFS2_INLINE_DATA_FL) &&
> +	    le64_to_cpu(di->i_size) != 0 &&
> +	    le32_to_cpu(di->i_clusters) == 0) {
> +		rc = ocfs2_error(sb,
> +				 "Invalid dinode #%llu: regular file i_size %llu with i_clusters 0 and no inline-data flag on non-sparse volume\n",
> +				 (unsigned long long)bh->b_blocknr,
> +				 (unsigned long long)le64_to_cpu(di->i_size));
> +		goto bail;
> +	}
> +
>  	if (le16_to_cpu(di->i_dyn_features) & OCFS2_INLINE_DATA_FL) {
>  		struct ocfs2_inline_data *data = &di->id2.i_data;
>  


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 0/3] ocfs2: harden inode validators against forged metadata
  2026-05-17 11:10 [PATCH 0/3] ocfs2: harden inode validators against forged metadata Michael Bommarito
                   ` (2 preceding siblings ...)
  2026-05-17 11:10 ` [PATCH 3/3] ocfs2: reject regular files with non-zero i_size and zero i_clusters Michael Bommarito
@ 2026-05-18 21:40 ` Andrew Morton
  2026-05-19  0:57   ` Michael Bommarito
  2026-06-01 17:39 ` Joel Becker
  4 siblings, 1 reply; 10+ messages in thread
From: Andrew Morton @ 2026-05-18 21:40 UTC (permalink / raw)
  To: Michael Bommarito
  Cc: Joseph Qi, Mark Fasheh, Joel Becker, ZhengYuan Huang, ocfs2-devel,
	linux-fsdevel, linux-kernel

On Sun, 17 May 2026 07:10:11 -0400 Michael Bommarito <michael.bommarito@gmail.com> wrote:

> 
> This series adds three structural checks to
> ocfs2_validate_inode_block() that catch attacker-controlled bytes
> in a freshly read dinode before ocfs2_populate_inode() copies them
> verbatim into the in-core inode.  All three checks fire on the
> mount, lookup, and read-after-cache-invalidation paths and reject
> the block with ocfs2_error(), the same error-propagation
> mechanism the existing suballoc-slot, inline-data, chain-list,
> and refcount checks use.

Thanks.  Sashiko review might have found a few issues:
	https://sashiko.dev/#/patchset/20260517111015.3187935-1-michael.bommarito@gmail.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 0/3] ocfs2: harden inode validators against forged metadata
  2026-05-18 21:40 ` [PATCH 0/3] ocfs2: harden inode validators against forged metadata Andrew Morton
@ 2026-05-19  0:57   ` Michael Bommarito
  0 siblings, 0 replies; 10+ messages in thread
From: Michael Bommarito @ 2026-05-19  0:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Joseph Qi, Mark Fasheh, Joel Becker, ZhengYuan Huang, ocfs2-devel,
	linux-fsdevel, linux-kernel

On Mon, May 18, 2026 at 5:40 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> Thanks.  Sashiko review might have found a few issues:
>         https://sashiko.dev/#/patchset/20260517111015.3187935-1-michael.bommarito@gmail.com

Yes, I think patches 1 and 3 need a v2.  Sashiko found a sibling issue
in patch 3, which is the most important part.

I'll send revisions later tonight.

Thanks,
Mike

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 0/3] ocfs2: harden inode validators against forged metadata
  2026-05-17 11:10 [PATCH 0/3] ocfs2: harden inode validators against forged metadata Michael Bommarito
                   ` (3 preceding siblings ...)
  2026-05-18 21:40 ` [PATCH 0/3] ocfs2: harden inode validators against forged metadata Andrew Morton
@ 2026-06-01 17:39 ` Joel Becker
  4 siblings, 0 replies; 10+ messages in thread
From: Joel Becker @ 2026-06-01 17:39 UTC (permalink / raw)
  To: Michael Bommarito
  Cc: Joseph Qi, Mark Fasheh, ZhengYuan Huang, ocfs2-devel,
	linux-fsdevel, linux-kernel

On Sun, May 17, 2026 at 07:10:11AM -0400, Michael Bommarito wrote:
> This series adds three structural checks to
> ocfs2_validate_inode_block() that catch attacker-controlled bytes
> in a freshly read dinode before ocfs2_populate_inode() copies them
> verbatim into the in-core inode.  All three checks fire on the
>
> ...
>
> Threat model
> ============
> 
> The validator is the chokepoint that protects
> ocfs2_populate_inode() from a malformed dinode whether the
> malformation got there via:
> 
>   (1) An attacker-supplied disk image mounted by a privileged
>       user.  The mount path runs every dinode through this
>       validator before any unprivileged user opens a file on
>       the volume.  This is the same threat model the existing
>       inline-data, refcount, and chain-list checks in this
>       function were written for.
> 
>   (2) A compromised cluster peer with raw write access to the
>       shared block device.  OCFS2 is a clustered filesystem;
>       the on-disk blocks behind bh->b_data live on shared
>       storage that other cluster nodes can write.  The local
>       node's cache-eviction re-read runs the newly fetched
>       block through this validator before ocfs2_populate_inode()
>       runs again.  Oracle's BlockErrorDetection design document
>       scopes the existing CRC32 + Hamming integrity primitive
>       explicitly as defense against memory and wire corruption,
>       not as authentication of peer writes; the field-level
>       validators are therefore the kernel-side defense
>       whichever path produced the forged block.

Thank you for the excellent description of the threat model, and for
devising the comprehensive model in the first place.  It really helps
consider these patches in the right context.

Can you make sure this info carries on to later versions of this patch
series, so it isn't lost?

Thanks,
Joel

-- 

"Every day I get up and look through the Forbes list of the richest
 people in America. If I'm not there, I go to work."
        - Robert Orben

			http://www.jlbec.org/
			jlbec@evilplan.org

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2026-06-01 17:39 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-17 11:10 [PATCH 0/3] ocfs2: harden inode validators against forged metadata Michael Bommarito
2026-05-17 11:10 ` [PATCH 1/3] ocfs2: reject dinodes with non-canonical i_mode type or stray bits Michael Bommarito
2026-05-18  1:36   ` Joseph Qi
2026-05-17 11:10 ` [PATCH 2/3] ocfs2: reject dinodes whose i_rdev disagrees with the file type Michael Bommarito
2026-05-18  1:37   ` Joseph Qi
2026-05-17 11:10 ` [PATCH 3/3] ocfs2: reject regular files with non-zero i_size and zero i_clusters Michael Bommarito
2026-05-18  1:38   ` Joseph Qi
2026-05-18 21:40 ` [PATCH 0/3] ocfs2: harden inode validators against forged metadata Andrew Morton
2026-05-19  0:57   ` Michael Bommarito
2026-06-01 17:39 ` Joel Becker

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox