[PATCH] bpf: add bpf_real

public inbox for bpf@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH] bpf: add bpf_real_inode() kfunc
@ 2026-03-26 16:53 Christian Brauner
  2026-03-26 17:02 ` Amir Goldstein
                   ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Christian Brauner @ 2026-03-26 16:53 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Alexander Viro, Jan Kara, Daniel Borkmann, Alexei Starovoitov,
	linux-fsdevel, bpf, Christian Brauner

Add a sleepable BPF kfunc that resolves the real inode backing a dentry
via d_real_inode(). On overlay/union filesystems the inode attached to
the dentry is the overlay inode which does not carry the underlying
device information. d_real_inode() resolves through the overlay and
returns the inode from the lower, real filesystem.

This is needed by the dm-verity based execution policy implemented in
systemd [1] where BPF LSM hooks must resolve a file's backing block
device via inode->i_sb->s_dev. Without looking through overlayfs the
device lookup would return the overlay's anonymous device number instead
of the actual dm-verity block device, causing all overlayfs-hosted
binaries to be incorrectly denied.

Link: https://github.com/systemd/systemd/pull/41340 [1]
Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/bpf_fs_kfuncs.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/fs/bpf_fs_kfuncs.c b/fs/bpf_fs_kfuncs.c
index e4e51a1d0de2..fc30aa906b8c 100644
--- a/fs/bpf_fs_kfuncs.c
+++ b/fs/bpf_fs_kfuncs.c
@@ -353,6 +353,21 @@ __bpf_kfunc int bpf_cgroup_read_xattr(struct cgroup *cgroup, const char *name__s
 }
 #endif /* CONFIG_CGROUPS */
 
+/**
+ * bpf_real_inode - get the real inode backing a dentry
+ * @dentry: dentry to resolve
+ *
+ * If the dentry is on a union/overlay filesystem, return the underlying, real
+ * inode that hosts the data.  Otherwise return the inode attached to the
+ * dentry itself.
+ *
+ * Return: The real inode backing the dentry.
+ */
+__bpf_kfunc struct inode *bpf_real_inode(struct dentry *dentry)
+{
+	return d_real_inode(dentry);
+}
+
 __bpf_kfunc_end_defs();
 
 BTF_KFUNCS_START(bpf_fs_kfunc_set_ids)
@@ -363,6 +378,7 @@ BTF_ID_FLAGS(func, bpf_get_dentry_xattr, KF_SLEEPABLE)
 BTF_ID_FLAGS(func, bpf_get_file_xattr, KF_SLEEPABLE)
 BTF_ID_FLAGS(func, bpf_set_dentry_xattr, KF_SLEEPABLE)
 BTF_ID_FLAGS(func, bpf_remove_dentry_xattr, KF_SLEEPABLE)
+BTF_ID_FLAGS(func, bpf_real_inode, KF_SLEEPABLE)
 BTF_KFUNCS_END(bpf_fs_kfunc_set_ids)
 
 static int bpf_fs_kfuncs_filter(const struct bpf_prog *prog, u32 kfunc_id)

---
base-commit: 1f318b96cc84d7c2ab792fcc0bfd42a7ca890681
change-id: 20260326-work-bpf-verity-a43f28baa242


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH] bpf: add bpf_real_inode() kfunc
  2026-03-26 16:53 [PATCH] bpf: add bpf_real_inode() kfunc Christian Brauner
@ 2026-03-26 17:02 ` Amir Goldstein
  2026-03-27  5:28 ` Christoph Hellwig
  2026-03-27 12:19 ` bot+bpf-ci
  2 siblings, 0 replies; 15+ messages in thread
From: Amir Goldstein @ 2026-03-26 17:02 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Alexander Viro, Jan Kara, Daniel Borkmann, Alexei Starovoitov,
	linux-fsdevel, bpf

On Thu, Mar 26, 2026 at 5:53 PM Christian Brauner <brauner@kernel.org> wrote:
>
> Add a sleepable BPF kfunc that resolves the real inode backing a dentry
> via d_real_inode(). On overlay/union filesystems the inode attached to
> the dentry is the overlay inode which does not carry the underlying
> device information. d_real_inode() resolves through the overlay and
> returns the inode from the lower, real filesystem.
>
> This is needed by the dm-verity based execution policy implemented in
> systemd [1] where BPF LSM hooks must resolve a file's backing block
> device via inode->i_sb->s_dev. Without looking through overlayfs the
> device lookup would return the overlay's anonymous device number instead
> of the actual dm-verity block device, causing all overlayfs-hosted
> binaries to be incorrectly denied.
>
> Link: https://github.com/systemd/systemd/pull/41340 [1]
> Signed-off-by: Christian Brauner <brauner@kernel.org>

Acked-by: Amir Goldstein <amir73il@gmail.com>

> ---
>  fs/bpf_fs_kfuncs.c | 16 ++++++++++++++++
>  1 file changed, 16 insertions(+)
>
> diff --git a/fs/bpf_fs_kfuncs.c b/fs/bpf_fs_kfuncs.c
> index e4e51a1d0de2..fc30aa906b8c 100644
> --- a/fs/bpf_fs_kfuncs.c
> +++ b/fs/bpf_fs_kfuncs.c
> @@ -353,6 +353,21 @@ __bpf_kfunc int bpf_cgroup_read_xattr(struct cgroup *cgroup, const char *name__s
>  }
>  #endif /* CONFIG_CGROUPS */
>
> +/**
> + * bpf_real_inode - get the real inode backing a dentry
> + * @dentry: dentry to resolve
> + *
> + * If the dentry is on a union/overlay filesystem, return the underlying, real
> + * inode that hosts the data.  Otherwise return the inode attached to the
> + * dentry itself.
> + *
> + * Return: The real inode backing the dentry.
> + */
> +__bpf_kfunc struct inode *bpf_real_inode(struct dentry *dentry)
> +{
> +       return d_real_inode(dentry);
> +}
> +
>  __bpf_kfunc_end_defs();
>
>  BTF_KFUNCS_START(bpf_fs_kfunc_set_ids)
> @@ -363,6 +378,7 @@ BTF_ID_FLAGS(func, bpf_get_dentry_xattr, KF_SLEEPABLE)
>  BTF_ID_FLAGS(func, bpf_get_file_xattr, KF_SLEEPABLE)
>  BTF_ID_FLAGS(func, bpf_set_dentry_xattr, KF_SLEEPABLE)
>  BTF_ID_FLAGS(func, bpf_remove_dentry_xattr, KF_SLEEPABLE)
> +BTF_ID_FLAGS(func, bpf_real_inode, KF_SLEEPABLE)
>  BTF_KFUNCS_END(bpf_fs_kfunc_set_ids)
>
>  static int bpf_fs_kfuncs_filter(const struct bpf_prog *prog, u32 kfunc_id)
>
> ---
> base-commit: 1f318b96cc84d7c2ab792fcc0bfd42a7ca890681
> change-id: 20260326-work-bpf-verity-a43f28baa242
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] bpf: add bpf_real_inode() kfunc
  2026-03-26 16:53 [PATCH] bpf: add bpf_real_inode() kfunc Christian Brauner
  2026-03-26 17:02 ` Amir Goldstein
@ 2026-03-27  5:28 ` Christoph Hellwig
  2026-03-27  6:05   ` Darrick J. Wong
  2026-03-27 12:19 ` bot+bpf-ci
  2 siblings, 1 reply; 15+ messages in thread
From: Christoph Hellwig @ 2026-03-27  5:28 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Amir Goldstein, Alexander Viro, Jan Kara, Daniel Borkmann,
	Alexei Starovoitov, linux-fsdevel, bpf

No comment on the code, here, but this caught my eye:

On Thu, Mar 26, 2026 at 05:53:44PM +0100, Christian Brauner wrote:
> This is needed by the dm-verity based execution policy implemented in
> systemd [1] where BPF LSM hooks must resolve a file's backing block
> device via inode->i_sb->s_dev. 

inode->i_sb->s_dev is not a files backing block device.  The only
think inode->i_sb->s_dev is required to be is the lookup key for
finding the super block.  It also happens to be the default backing
block device for simple file systems, but once things get a little more
complicated it often is not.  Examples are btrfs where it never matches,
f2fs additional devices, the XFS RT device, pNFS block layouts and
probably a few more I forgot.

For the more complex cases like btrfs there might not even be a single
block device for a file and/or the mapping can change.  So please do not
encode such an assumption anywhere because it is broken.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] bpf: add bpf_real_inode() kfunc
  2026-03-27  5:28 ` Christoph Hellwig
@ 2026-03-27  6:05   ` Darrick J. Wong
  2026-04-07 10:25     ` Christian Brauner
  0 siblings, 1 reply; 15+ messages in thread
From: Darrick J. Wong @ 2026-03-27  6:05 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Christian Brauner, Amir Goldstein, Alexander Viro, Jan Kara,
	Daniel Borkmann, Alexei Starovoitov, linux-fsdevel, bpf

On Thu, Mar 26, 2026 at 10:28:46PM -0700, Christoph Hellwig wrote:
> No comment on the code, here, but this caught my eye:
> 
> On Thu, Mar 26, 2026 at 05:53:44PM +0100, Christian Brauner wrote:
> > This is needed by the dm-verity based execution policy implemented in
> > systemd [1] where BPF LSM hooks must resolve a file's backing block
> > device via inode->i_sb->s_dev. 
> 
> inode->i_sb->s_dev is not a files backing block device.  The only
> think inode->i_sb->s_dev is required to be is the lookup key for
> finding the super block.  It also happens to be the default backing
> block device for simple file systems, but once things get a little more
> complicated it often is not.  Examples are btrfs where it never matches,
> f2fs additional devices, the XFS RT device, pNFS block layouts and
> probably a few more I forgot.
> 
> For the more complex cases like btrfs there might not even be a single
> block device for a file and/or the mapping can change.  So please do not
> encode such an assumption anywhere because it is broken.

Seconded.  It's a pity that dax_writeback_mapping_range makes nearly the
same mistake...

--D

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] bpf: add bpf_real_inode() kfunc
  2026-03-26 16:53 [PATCH] bpf: add bpf_real_inode() kfunc Christian Brauner
  2026-03-26 17:02 ` Amir Goldstein
  2026-03-27  5:28 ` Christoph Hellwig
@ 2026-03-27 12:19 ` bot+bpf-ci
  2 siblings, 0 replies; 15+ messages in thread
From: bot+bpf-ci @ 2026-03-27 12:19 UTC (permalink / raw)
  To: brauner, amir73il
  Cc: viro, jack, daniel, ast, linux-fsdevel, bpf, brauner, ast, andrii,
	daniel, martin.lau, eddyz87, yonghong.song, clm, ihor.solodrai

[-- Attachment #1: Type: text/plain, Size: 1303 bytes --]

> diff --git a/fs/bpf_fs_kfuncs.c b/fs/bpf_fs_kfuncs.c
> index e4e51a1d0de2..fc30aa906b8c 100644
> --- a/fs/bpf_fs_kfuncs.c
> +++ b/fs/bpf_fs_kfuncs.c

[ ... ]

> +__bpf_kfunc struct inode *bpf_real_inode(struct dentry *dentry)
> +{
> +	return d_real_inode(dentry);
> +}

[ ... ]

> +BTF_ID_FLAGS(func, bpf_real_inode, KF_SLEEPABLE)

d_real_inode() calls d_inode(d_real(dentry, D_REAL_DATA)), and
d_inode() returns dentry->d_inode, which is NULL for negative
dentries.

BPF LSM programs can be attached to hooks that receive negative
dentries, for example security_inode_create() or
security_inode_mkdir(), where the dentry represents a name that
does not yet exist.  If a BPF program passes such a dentry to
bpf_real_inode() and then dereferences the returned pointer, does
this result in a NULL pointer dereference?

Without KF_RET_NULL, the verifier marks the return as
PTR_TO_BTF_ID | PTR_TRUSTED, so it will not require a NULL check
before the BPF program accesses the returned inode.  Should this
have KF_RET_NULL to let the verifier enforce a NULL check?

---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

CI run summary: https://github.com/kernel-patches/bpf/actions/runs/23644830944

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] bpf: add bpf_real_inode() kfunc
  2026-03-27  6:05   ` Darrick J. Wong
@ 2026-04-07 10:25     ` Christian Brauner
  2026-04-07 14:54       ` Christoph Hellwig
  0 siblings, 1 reply; 15+ messages in thread
From: Christian Brauner @ 2026-04-07 10:25 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Christoph Hellwig, Amir Goldstein, Alexander Viro, Jan Kara,
	Daniel Borkmann, Alexei Starovoitov, linux-fsdevel, bpf

On Thu, Mar 26, 2026 at 11:05:18PM -0700, Darrick J. Wong wrote:
> On Thu, Mar 26, 2026 at 10:28:46PM -0700, Christoph Hellwig wrote:
> > No comment on the code, here, but this caught my eye:
> > 
> > On Thu, Mar 26, 2026 at 05:53:44PM +0100, Christian Brauner wrote:
> > > This is needed by the dm-verity based execution policy implemented in
> > > systemd [1] where BPF LSM hooks must resolve a file's backing block
> > > device via inode->i_sb->s_dev. 
> > 
> > inode->i_sb->s_dev is not a files backing block device.  The only
> > think inode->i_sb->s_dev is required to be is the lookup key for
> > finding the super block.  It also happens to be the default backing
> > block device for simple file systems, but once things get a little more
> > complicated it often is not.  Examples are btrfs where it never matches,
> > f2fs additional devices, the XFS RT device, pNFS block layouts and
> > probably a few more I forgot.
> > 
> > For the more complex cases like btrfs there might not even be a single
> > block device for a file and/or the mapping can change.  So please do not
> > encode such an assumption anywhere because it is broken.
> 
> Seconded.  It's a pity that dax_writeback_mapping_range makes nearly the
> same mistake...

Yes, I'm aware of that limitation. In this case we know exactly what
type of filesystem is used. But thanks for reminding me that for btrfs
it's s_dev is never the actual block device.
  
What I would ultimately would like in the future is to have a security
hook that allows bpf to reject mounting any block devices that aren't
dm-verity protected. Maybe we can chat about this at LSFMM as well. I
know you will love this idea.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] bpf: add bpf_real_inode() kfunc
  2026-04-07 10:25     ` Christian Brauner
@ 2026-04-07 14:54       ` Christoph Hellwig
  2026-04-09 13:19         ` Christian Brauner
  0 siblings, 1 reply; 15+ messages in thread
From: Christoph Hellwig @ 2026-04-07 14:54 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Darrick J. Wong, Christoph Hellwig, Amir Goldstein,
	Alexander Viro, Jan Kara, Daniel Borkmann, Alexei Starovoitov,
	linux-fsdevel, bpf

On Tue, Apr 07, 2026 at 12:25:29PM +0200, Christian Brauner wrote:
> Yes, I'm aware of that limitation. In this case we know exactly what
> type of filesystem is used. But thanks for reminding me that for btrfs
> it's s_dev is never the actual block device.
>   
> What I would ultimately would like in the future is to have a security
> hook that allows bpf to reject mounting any block devices that aren't
> dm-verity protected. Maybe we can chat about this at LSFMM as well. I
> know you will love this idea.

I'd much rather go right to that, with a slight tweak to clearly
specify the expected protection and not hard code dm-verity.  There's
ways to do full file system verity [1] much more efficiently inside the
file systems, and it would be good to not lock in a specific solution

[1] not to be confused with the existing fsverity for certain read-only
files.  Although a loopback image on that would probably also qualify.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] bpf: add bpf_real_inode() kfunc
  2026-04-07 14:54       ` Christoph Hellwig
@ 2026-04-09 13:19         ` Christian Brauner
  2026-04-09 14:24           ` Christoph Hellwig
  0 siblings, 1 reply; 15+ messages in thread
From: Christian Brauner @ 2026-04-09 13:19 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Darrick J. Wong, Amir Goldstein, Alexander Viro, Jan Kara,
	Daniel Borkmann, Alexei Starovoitov, linux-fsdevel, bpf

On Tue, Apr 07, 2026 at 07:54:14AM -0700, Christoph Hellwig wrote:
> On Tue, Apr 07, 2026 at 12:25:29PM +0200, Christian Brauner wrote:
> > Yes, I'm aware of that limitation. In this case we know exactly what
> > type of filesystem is used. But thanks for reminding me that for btrfs
> > it's s_dev is never the actual block device.
> >   
> > What I would ultimately would like in the future is to have a security
> > hook that allows bpf to reject mounting any block devices that aren't
> > dm-verity protected. Maybe we can chat about this at LSFMM as well. I
> > know you will love this idea.
> 
> I'd much rather go right to that, with a slight tweak to clearly
> specify the expected protection and not hard code dm-verity.  There's

Yes, of course. I just used that as an example because we're using that
policy. Just a mechanism to intercept the onlining/mounting of devices
based on $rule. And the placement of that call needs to be as central as
possible so we don't have to spaghetti that everywhere.

> ways to do full file system verity [1] much more efficiently inside the
> file systems, and it would be good to not lock in a specific solution

Note about that: we generally don't rely on any verity implementation
that makes the verity information itself part of the on-disk filesystem
format. The nice property of dm-verity is that the integrity is
completely separate from the filesystem format and it's basically simple
math that is trivially to prove correct.

> [1] not to be confused with the existing fsverity for certain read-only
> files.  Although a loopback image on that would probably also qualify.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] bpf: add bpf_real_inode() kfunc
  2026-04-09 13:19         ` Christian Brauner
@ 2026-04-09 14:24           ` Christoph Hellwig
  2026-04-09 14:37             ` Gao Xiang
  0 siblings, 1 reply; 15+ messages in thread
From: Christoph Hellwig @ 2026-04-09 14:24 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Christoph Hellwig, Darrick J. Wong, Amir Goldstein,
	Alexander Viro, Jan Kara, Daniel Borkmann, Alexei Starovoitov,
	linux-fsdevel, bpf

On Thu, Apr 09, 2026 at 03:19:27PM +0200, Christian Brauner wrote:
> > ways to do full file system verity [1] much more efficiently inside the
> > file systems, and it would be good to not lock in a specific solution
> 
> Note about that: we generally don't rely on any verity implementation
> that makes the verity information itself part of the on-disk filesystem
> format. The nice property of dm-verity is that the integrity is
> completely separate from the filesystem format and it's basically simple
> math that is trivially to prove correct.

Any file system integrated version storing the hashed in the extended
LBA data (which Linux also confusingly calls intgrity data) would be
even simpler and easier to verify.  But yes, we need to clearly
document what we want.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] bpf: add bpf_real_inode() kfunc
  2026-04-09 14:24           ` Christoph Hellwig
@ 2026-04-09 14:37             ` Gao Xiang
  2026-04-09 16:11               ` Christoph Hellwig
  0 siblings, 1 reply; 15+ messages in thread
From: Gao Xiang @ 2026-04-09 14:37 UTC (permalink / raw)
  To: Christoph Hellwig, Christian Brauner
  Cc: Darrick J. Wong, Amir Goldstein, Alexander Viro, Jan Kara,
	Daniel Borkmann, Alexei Starovoitov, linux-fsdevel, bpf

On 2026/4/9 22:24, Christoph Hellwig wrote:
> On Thu, Apr 09, 2026 at 03:19:27PM +0200, Christian Brauner wrote:
>>> ways to do full file system verity [1] much more efficiently inside the
>>> file systems, and it would be good to not lock in a specific solution
>>
>> Note about that: we generally don't rely on any verity implementation
>> that makes the verity information itself part of the on-disk filesystem
>> format. The nice property of dm-verity is that the integrity is
>> completely separate from the filesystem format and it's basically simple
>> math that is trivially to prove correct.
> 
> Any file system integrated version storing the hashed in the extended
> LBA data (which Linux also confusingly calls intgrity data) would be
> even simpler and easier to verify.  But yes, we need to clearly
> document what we want.

Yes, you could keep hash / checksum in the extended OOB area, but
I guess you still don't know if the hash / checksum of the
particular data can be trusted (or is changed by attackers).

I think the key point of merkle tree approach is not only for
data integrity but also to distribute the trust between all
hashes by calculating the hash (of the hashes) of the hashes
so that you could get the only one root hash value regardless
of how to organize the container filesystem (meta)data so
that it's totally seperated from individual filesystem layouts
and can be proven seperately and even implemented easily by
hardware/firmware rather than inside the specific filesystem
implementation.

Thanks,
Gao Xiang

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] bpf: add bpf_real_inode() kfunc
  2026-04-09 14:37             ` Gao Xiang
@ 2026-04-09 16:11               ` Christoph Hellwig
  2026-04-09 16:42                 ` Gao Xiang
  0 siblings, 1 reply; 15+ messages in thread
From: Christoph Hellwig @ 2026-04-09 16:11 UTC (permalink / raw)
  To: Gao Xiang
  Cc: Christoph Hellwig, Christian Brauner, Darrick J. Wong,
	Amir Goldstein, Alexander Viro, Jan Kara, Daniel Borkmann,
	Alexei Starovoitov, linux-fsdevel, bpf

On Thu, Apr 09, 2026 at 10:37:46PM +0800, Gao Xiang wrote:
> > LBA data (which Linux also confusingly calls intgrity data) would be
> > even simpler and easier to verify.  But yes, we need to clearly
> > document what we want.
> 
> Yes, you could keep hash / checksum in the extended OOB area, but
> I guess you still don't know if the hash / checksum of the
> particular data can be trusted (or is changed by attackers).

You'd still need to build a full merkle-tree out of them, but storing
the leaf hashes in the extent LBAs means:

 - a lot less I/O amplification
 - a sane way to actually have verification (including authenticated
   encryption) in a writable file system


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] bpf: add bpf_real_inode() kfunc
  2026-04-09 16:11               ` Christoph Hellwig
@ 2026-04-09 16:42                 ` Gao Xiang
  2026-04-10  6:15                   ` Christoph Hellwig
  0 siblings, 1 reply; 15+ messages in thread
From: Gao Xiang @ 2026-04-09 16:42 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Christian Brauner, Darrick J. Wong, Amir Goldstein,
	Alexander Viro, Jan Kara, Daniel Borkmann, Alexei Starovoitov,
	linux-fsdevel, bpf

Hi Christoph,

On 2026/4/10 00:11, Christoph Hellwig wrote:
> On Thu, Apr 09, 2026 at 10:37:46PM +0800, Gao Xiang wrote:
>>> LBA data (which Linux also confusingly calls intgrity data) would be
>>> even simpler and easier to verify.  But yes, we need to clearly
>>> document what we want.
>>
>> Yes, you could keep hash / checksum in the extended OOB area, but
>> I guess you still don't know if the hash / checksum of the
>> particular data can be trusted (or is changed by attackers).
> 
> You'd still need to build a full merkle-tree out of them, but storing
> the leaf hashes in the extent LBAs means:

Yes, the leaf hashes can be stored in the extended
LBA area.

> 
>   - a lot less I/O amplification

But not quite sure recently how extended OOB data is
kept within the physical media (I remembered in the
early years each page of raw nand flashes has
several-byte OOB together with the user data): if
it's along with the corresponding LBA main area, I
guess it has to read extended data among multiple
LBAs (but the LBA main data may not relate to this
partcular I/O) in order to verify the hash of the
hashes.

>   - a sane way to actually have verification (including authenticated
>     encryption) in a writable file system

Yet if considering modification, it need to move up
the tree and recalculate/update all hashes until the
root hash, it's not a low-overhead task TBH, and need
to apply to every single on-disk change.

Thanks,
Gao Xiang

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] bpf: add bpf_real_inode() kfunc
  2026-04-09 16:42                 ` Gao Xiang
@ 2026-04-10  6:15                   ` Christoph Hellwig
  2026-04-10  6:46                     ` Gao Xiang
  0 siblings, 1 reply; 15+ messages in thread
From: Christoph Hellwig @ 2026-04-10  6:15 UTC (permalink / raw)
  To: Gao Xiang
  Cc: Christoph Hellwig, Christian Brauner, Darrick J. Wong,
	Amir Goldstein, Alexander Viro, Jan Kara, Daniel Borkmann,
	Alexei Starovoitov, linux-fsdevel, bpf

On Fri, Apr 10, 2026 at 12:42:41AM +0800, Gao Xiang wrote:
> >   - a lot less I/O amplification
> 
> But not quite sure recently how extended OOB data is
> kept within the physical media (I remembered in the
> early years each page of raw nand flashes has
> several-byte OOB together with the user data): if
> it's along with the corresponding LBA main area, I
> guess it has to read extended data among multiple
> LBAs (but the LBA main data may not relate to this
> partcular I/O) in order to verify the hash of the
> hashes.

For HDD it it is stored next to the media, and doesn't introduce
extra seeks.  For NAND-based SSDs it typically is stored close to
the data as well.  NAND pages are significantly larger than the
exposed block size for any modern SSD.

> 
> >   - a sane way to actually have verification (including authenticated
> >     encryption) in a writable file system
> 
> Yet if considering modification, it need to move up
> the tree and recalculate/update all hashes until the
> root hash, it's not a low-overhead task TBH, and need
> to apply to every single on-disk change.

It needs to be applied in-memory for every changed, and persisted to
disk on every fsync or equivalent operation.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] bpf: add bpf_real_inode() kfunc
  2026-04-10  6:15                   ` Christoph Hellwig
@ 2026-04-10  6:46                     ` Gao Xiang
  2026-04-10  7:06                       ` Christoph Hellwig
  0 siblings, 1 reply; 15+ messages in thread
From: Gao Xiang @ 2026-04-10  6:46 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Christian Brauner, Darrick J. Wong, Amir Goldstein,
	Alexander Viro, Jan Kara, Daniel Borkmann, Alexei Starovoitov,
	linux-fsdevel, bpf

On 2026/4/10 14:15, Christoph Hellwig wrote:
> On Fri, Apr 10, 2026 at 12:42:41AM +0800, Gao Xiang wrote:
>>>    - a lot less I/O amplification
>>
>> But not quite sure recently how extended OOB data is
>> kept within the physical media (I remembered in the
>> early years each page of raw nand flashes has
>> several-byte OOB together with the user data): if
>> it's along with the corresponding LBA main area, I
>> guess it has to read extended data among multiple
>> LBAs (but the LBA main data may not relate to this
>> partcular I/O) in order to verify the hash of the
>> hashes.
> 
> For HDD it it is stored next to the media, and doesn't introduce
> extra seeks.  For NAND-based SSDs it typically is stored close to
> the data as well.  NAND pages are significantly larger than the
> exposed block size for any modern SSD.

Ok, thanks.

> 
>>
>>>    - a sane way to actually have verification (including authenticated
>>>      encryption) in a writable file system
>>
>> Yet if considering modification, it need to move up
>> the tree and recalculate/update all hashes until the
>> root hash, it's not a low-overhead task TBH, and need
>> to apply to every single on-disk change.
> 
> It needs to be applied in-memory for every changed, and persisted to
> disk on every fsync or equivalent operation.

Yes, yet it doesn't change my evaluation: and you need
to consider background writebacks too (since writeback
will update data and then impact the whole hash tree).

Currently data writeback can be applied for each block
independently, but if you consider maintaining a hash
tree (rather than simple checksums), I guess you have
to keep strict atomicity between data writeback,
metadata and hash-tree writeback, otherwise the hashes
and partial writeback data will be mismatched.

I won't say it's impossible, in short I have to say
it's just not a straight-forward task compared to the
previous data checksums, e.g. for XFS, maybe journalled
COW approach has to be used since it's hard to keep
such atomicity for XFS without COW.

Yes, the OOB approach for leaf hashes will help to
reduce write amplification, but my current observation
is that it won't have any help to read amplification,
especially for small random read; overall it depends
on the target workload.

Thanks,
Gao Xiang

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] bpf: add bpf_real_inode() kfunc
  2026-04-10  6:46                     ` Gao Xiang
@ 2026-04-10  7:06                       ` Christoph Hellwig
  0 siblings, 0 replies; 15+ messages in thread
From: Christoph Hellwig @ 2026-04-10  7:06 UTC (permalink / raw)
  To: Gao Xiang
  Cc: Christoph Hellwig, Christian Brauner, Darrick J. Wong,
	Amir Goldstein, Alexander Viro, Jan Kara, Daniel Borkmann,
	Alexei Starovoitov, linux-fsdevel, bpf

On Fri, Apr 10, 2026 at 02:46:00PM +0800, Gao Xiang wrote:
> > It needs to be applied in-memory for every changed, and persisted to
> > disk on every fsync or equivalent operation.
> 
> Yes, yet it doesn't change my evaluation: and you need
> to consider background writebacks too (since writeback
> will update data and then impact the whole hash tree).
> 
> Currently data writeback can be applied for each block
> independently, but if you consider maintaining a hash
> tree (rather than simple checksums), I guess you have
> to keep strict atomicity between data writeback,
> metadata and hash-tree writeback, otherwise the hashes
> and partial writeback data will be mismatched.

You write the leaf checksum with each block.  The rest of the chain
leading up to the root is kept in metadata tied to the inode and needs
to be written atomically with the transaction commit that updates the
on-disk metadata to point to the newly written block.

> Yes, the OOB approach for leaf hashes will help to
> reduce write amplification, but my current observation
> is that it won't have any help to read amplification,
> especially for small random read; overall it depends
> on the target workload.

For HDD is roughly halves the number of seeks for random reads, and
at least significantly reduces it significantly but quite a bit
less.  For SSD it reduces the IOPS in a similar way, but for that
you need to max out the IOPS, which for most workloads you won't
on anything currently (and probably in the future) using erofs.

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2026-04-10  7:06 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-26 16:53 [PATCH] bpf: add bpf_real_inode() kfunc Christian Brauner
2026-03-26 17:02 ` Amir Goldstein
2026-03-27  5:28 ` Christoph Hellwig
2026-03-27  6:05   ` Darrick J. Wong
2026-04-07 10:25     ` Christian Brauner
2026-04-07 14:54       ` Christoph Hellwig
2026-04-09 13:19         ` Christian Brauner
2026-04-09 14:24           ` Christoph Hellwig
2026-04-09 14:37             ` Gao Xiang
2026-04-09 16:11               ` Christoph Hellwig
2026-04-09 16:42                 ` Gao Xiang
2026-04-10  6:15                   ` Christoph Hellwig
2026-04-10  6:46                     ` Gao Xiang
2026-04-10  7:06                       ` Christoph Hellwig
2026-03-27 12:19 ` bot+bpf-ci

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox