public inbox for bpf@vger.kernel.org
 help / color / mirror / Atom feed
From: Gao Xiang <hsiangkao@linux.alibaba.com>
To: Christoph Hellwig <hch@infradead.org>
Cc: Christian Brauner <brauner@kernel.org>,
	"Darrick J. Wong" <djwong@kernel.org>,
	Amir Goldstein <amir73il@gmail.com>,
	Alexander Viro <viro@zeniv.linux.org.uk>, Jan Kara <jack@suse.cz>,
	Daniel Borkmann <daniel@iogearbox.net>,
	Alexei Starovoitov <ast@kernel.org>,
	linux-fsdevel@vger.kernel.org, bpf@vger.kernel.org
Subject: Re: [PATCH] bpf: add bpf_real_inode() kfunc
Date: Fri, 10 Apr 2026 15:29:39 +0800	[thread overview]
Message-ID: <6bfe805e-be0d-4bd5-ac45-1b58fc55839f@linux.alibaba.com> (raw)
In-Reply-To: <adihXuFuuS5uoJ31@infradead.org>



On 2026/4/10 15:06, Christoph Hellwig wrote:
> On Fri, Apr 10, 2026 at 02:46:00PM +0800, Gao Xiang wrote:
>>> It needs to be applied in-memory for every changed, and persisted to
>>> disk on every fsync or equivalent operation.
>>
>> Yes, yet it doesn't change my evaluation: and you need
>> to consider background writebacks too (since writeback
>> will update data and then impact the whole hash tree).
>>
>> Currently data writeback can be applied for each block
>> independently, but if you consider maintaining a hash
>> tree (rather than simple checksums), I guess you have
>> to keep strict atomicity between data writeback,
>> metadata and hash-tree writeback, otherwise the hashes
>> and partial writeback data will be mismatched.
> 
> You write the leaf checksum with each block.  The rest of the chain
> leading up to the root is kept in metadata tied to the inode and needs
> to be written atomically with the transaction commit that updates the
> on-disk metadata to point to the newly written block.
> 
>> Yes, the OOB approach for leaf hashes will help to
>> reduce write amplification, but my current observation
>> is that it won't have any help to read amplification,
>> especially for small random read; overall it depends
>> on the target workload.
> 
> For HDD is roughly halves the number of seeks for random reads, and
> at least significantly reduces it significantly but quite a bit
> less.  For SSD it reduces the IOPS in a similar way, but for that

Yes, seeks can be reduced and if all related leaf block
hashes can be loaded in a single request (even some main
data blocks may not needed) it will help more; in
practice, a POC forming out to measure the numbers between
hashing in individual blocks or extended area may help to
get more detailed ideas.

If it's useful, I think dm-verity can work out in this way
too as a bogus as long as the device supports it (and
converting between these two metadata formats should be
trivial.)

> you need to max out the IOPS, which for most workloads you won't
> on anything currently (and probably in the future) using erofs.

The problem is not always maximum IOPS (I also suspect
there is such real long-term max-IOPS workload too), but
small random burst I/Os with low latencies is what we
usually care much for typical use cases.

Thanks,
Gao Xiang


  reply	other threads:[~2026-04-10  7:29 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-26 16:53 [PATCH] bpf: add bpf_real_inode() kfunc Christian Brauner
2026-03-26 17:02 ` Amir Goldstein
2026-03-27  5:28 ` Christoph Hellwig
2026-03-27  6:05   ` Darrick J. Wong
2026-04-07 10:25     ` Christian Brauner
2026-04-07 14:54       ` Christoph Hellwig
2026-04-09 13:19         ` Christian Brauner
2026-04-09 14:24           ` Christoph Hellwig
2026-04-09 14:37             ` Gao Xiang
2026-04-09 16:11               ` Christoph Hellwig
2026-04-09 16:42                 ` Gao Xiang
2026-04-10  6:15                   ` Christoph Hellwig
2026-04-10  6:46                     ` Gao Xiang
2026-04-10  7:06                       ` Christoph Hellwig
2026-04-10  7:29                         ` Gao Xiang [this message]
2026-03-27 12:19 ` bot+bpf-ci

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=6bfe805e-be0d-4bd5-ac45-1b58fc55839f@linux.alibaba.com \
    --to=hsiangkao@linux.alibaba.com \
    --cc=amir73il@gmail.com \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=brauner@kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=djwong@kernel.org \
    --cc=hch@infradead.org \
    --cc=jack@suse.cz \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox