Linux Btrfs filesystem development
 help / color / mirror / Atom feed
From: David Sterba <dsterba@suse.cz>
To: Daniel Vacek <neelx@suse.com>
Cc: Chris Mason <clm@fb.com>, Josef Bacik <josef@toxicpanda.com>,
	David Sterba <dsterba@suse.com>,
	linux-btrfs@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH] btrfs: remove extent buffer's redundant `len` member field
Date: Fri, 2 May 2025 12:56:30 +0200	[thread overview]
Message-ID: <20250502105630.GO9140@suse.cz> (raw)
In-Reply-To: <CAPjX3FdexSywSbJQfrj5pazrBRyVns3SdRCsw1VmvhrJv20bvw@mail.gmail.com>

On Wed, Apr 30, 2025 at 04:13:20PM +0200, Daniel Vacek wrote:
> On Wed, 30 Apr 2025 at 15:30, David Sterba <dsterba@suse.cz> wrote:
> >
> > On Wed, Apr 30, 2025 at 10:21:18AM +0200, Daniel Vacek wrote:
> > > > The benefit of duplicating the length in each eb is that it's in the
> > > > same cacheline as the other members that are used for offset
> > > > calculations or bit manipulations.
> > > >
> > > > Going to the fs_info->nodesize may or may not hit a cache, also because
> > > > it needs to do 2 pointer dereferences, so from that perspective I think
> > > > it's making it worse.
> > >
> > > I was considering that. Since fs_info is shared for all ebs and other
> > > stuff like transactions, etc. I think the cache is hot most of the
> > > time and there will be hardly any performance difference observable.
> > > Though without benchmarks this is just a speculation (on both sides).
> >
> > The comparison is between "always access 1 cacheline" and "hope that the
> > other cacheline is hot", yeah we don't have benchmarks for that but the
> > first access pattern is not conditional.
> 
> That's quite right. Though in many places we already have fs_info
> anyways so it's rather accessing a cacheline in eb vs. accessing a
> cacheline in fs_info. In the former case it's likely a hot memory due
> to accessing surrounding members anyways, while in the later case is
> hopefully hot as it's a heavily shared resource accessed when
> processing other ebs or transactions.
> But yeah, in some places we don't have the fs_info pointer yet and two
> accesses are still needed.

The fs_info got added to eb because it used to be passed as parameter to
many functions.

> In theory fs_info could be shuffled to move nodesize to the same
> cacheline with buffer_tree. Would that feel better to you?

We'd get conflicting requirements for ordering in fs_info. Right now
the nodesize/sectorsize/... are in once cacheline in fs_info and they're
often used together in many functions. Reordering it to fit eb usage
pattern may work but I'm not convinced we need it.

> > > > I don't think we need to do the optimization right now, but maybe in the
> > > > future if there's a need to add something to eb. Still we can use the
> > > > remaining 16 bytes up to 256 without making things worse.
> > >
> > > This really depends on configuration. On my laptop (Debian -rt kernel)
> > > the eb struct is actually 272 bytes as the rt_mutex is significantly
> > > heavier than raw spin lock. And -rt is a first class citizen nowadays,
> > > often used in Kubernetes deployments like 5G RAN telco, dpdk and such.
> > > I think it would be nice to slim the struct below 256 bytes even there
> > > if that's your aim.
> >
> > I configured and built RT kernel to see if it's possible to go to 256
> > bytes on RT and it seems yes with a big sacrifice of removing several
> > struct members that cache values like folio_size or folio_shift and
> > generating worse code.
> >
> > As 272 is a multiple of 16 it's a reasonable size and we don't need to
> > optimize further. The number of ebs in one slab is 30, with the non-rt
> > build it's 34, which sounds OK.
> 
> That sounds fair. Well the 256 bytes were your argument in the first place.

Yeah, 256 is a nice number because it aligns with cachelines on multiple
architectures, this is useful for splitting the structure to the "data
accessed together" and locking/refcounting. It's a tentative goal, we
used to have larger eb size due to own locking implementation but with
rwsems it got close/under 256.

The current size 240 is 1/4 of cacheline shifted so it's not all clean
but whe have some wiggle room for adding new members or cached values,
like folio_size/folio_shift/addr.


> 
> Still, with this:
> 
> --- a/fs/btrfs/extent_io.h
> +++ b/fs/btrfs/extent_io.h
> @@ -82,7 +82,10 @@ void __cold extent_buffer_free_cachep(void);
>  struct extent_buffer {
>         u64 start;
>         u32 folio_size;
> -       unsigned long bflags;
> +       u8 folio_shift;
> +       /* >= 0 if eb belongs to a log tree, -1 otherwise */
> +       s8 log_index;
> +       unsigned short bflags;

This does not compile because of set_bit/clear_bit/wait_on_bit API
requirements.

>         struct btrfs_fs_info *fs_info;
> 
>         /*
> @@ -94,9 +97,6 @@ struct extent_buffer {
>         spinlock_t refs_lock;
>         atomic_t refs;
>         int read_mirror;
> -       /* >= 0 if eb belongs to a log tree, -1 otherwise */
> -       s8 log_index;
> -       u8 folio_shift;
>         struct rcu_head rcu_head;
> 
>         struct rw_semaphore lock;
> 
> you're down to 256 even on -rt. And the great part is I don't see any
> sacrifices (other than accessing a cacheline in fs_info). We're only
> using 8 flags now, so there is still some room left for another 8 if
> needed in the future.

Which means that the size on non-rt would be something like 228, roughly
calculating the savings and the increase due to spinloct_t going from
4 -> 32 bytes. Also I'd like to see the generated assembly after the
suggested reordering.

The eb may not be perfect, I think there could be false sharing of
refs_lock and refs but this is a wild guess and based only on code
observation. You may have more luck with other data structures with
unnecessary holes but please optimize for non-RT first.

  reply	other threads:[~2025-05-02 10:56 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-04-29 15:17 [PATCH] btrfs: remove extent buffer's redundant `len` member field Daniel Vacek
2025-04-29 22:34 ` Qu Wenruo
2025-04-30  8:03 ` David Sterba
2025-04-30  8:21   ` Daniel Vacek
2025-04-30 12:31     ` Daniel Vacek
2025-05-02 10:30       ` David Sterba
2025-05-02 11:23         ` Daniel Vacek
2025-04-30 13:30     ` David Sterba
2025-04-30 14:13       ` Daniel Vacek
2025-05-02 10:56         ` David Sterba [this message]
2025-05-02 12:03           ` Daniel Vacek
2025-05-05 14:10             ` David Sterba
2025-05-05 16:19               ` Daniel Vacek
2025-04-30  8:05 ` Filipe Manana
2025-04-30  8:26   ` Daniel Vacek
2025-04-30  8:34     ` Filipe Manana
2025-04-30  8:50       ` Daniel Vacek
2025-04-30 10:26         ` Filipe Manana
2025-04-30 11:09           ` Johannes Thumshirn
2025-04-30 12:09             ` Daniel Vacek
2025-04-30 12:06           ` Daniel Vacek
2025-04-30 12:33             ` Filipe Manana
2025-04-30 12:53               ` Daniel Vacek
2025-05-02 13:37 ` [PATCH v2 0/2] btrfs: eb struct cleanups Daniel Vacek
2025-05-02 13:37   ` [PATCH v2 1/2] btrfs: remove extent buffer's redundant `len` member field Daniel Vacek
2025-05-02 17:35     ` Boris Burkov
2025-05-05  8:23       ` Daniel Vacek
2025-05-05 11:50     ` [PATCH v3 0/2] btrfs: eb struct cleanups Daniel Vacek
2025-05-05 11:50       ` [PATCH v3 1/2] btrfs: remove extent buffer's redundant `len` member field Daniel Vacek
2025-05-05 15:18         ` David Sterba
2025-05-05 17:53           ` Daniel Vacek
2025-05-13  0:32             ` David Sterba
2025-05-13 10:43               ` Daniel Vacek
2025-05-05 11:50       ` [PATCH v3 2/2] btrfs: rearrange the extent buffer structure members Daniel Vacek
2025-05-02 13:37   ` [PATCH v2 " Daniel Vacek

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250502105630.GO9140@suse.cz \
    --to=dsterba@suse.cz \
    --cc=clm@fb.com \
    --cc=dsterba@suse.com \
    --cc=josef@toxicpanda.com \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=neelx@suse.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox