Re: [PATCH V2] jbd2: use rhashtable for revoke records during replay

public inbox for linux-ext4@vger.kernel.org
 help / color / mirror / Atom feed

From: Jan Kara <jack@suse.cz>
To: Li Dongyang <dongyangli@ddn.com>
Cc: linux-ext4@vger.kernel.org, Andreas Dilger <adilger@dilger.ca>,
	Alex Zhuravlev <bzzz@whamcloud.com>
Subject: Re: [PATCH V2] jbd2: use rhashtable for revoke records during replay
Date: Fri, 8 Nov 2024 11:33:58 +0100	[thread overview]
Message-ID: <20241108103358.ziocxsyapli2pexv@quack3> (raw)
In-Reply-To: <20241105034428.578701-1-dongyangli@ddn.com>

On Tue 05-11-24 14:44:28, Li Dongyang wrote:
> Resizable hashtable should improve journal replay time when
> we have million of revoke records.
> Notice that rhashtable is used during replay only,
> as removal with list_del() is less expensive and it's still used
> during regular processing.
> 
> before:
> 1048576 records - 95 seconds
> 2097152 records - 580 seconds

These are really high numbers of revoke records. Deleting couple GB of
metadata doesn't happen so easily. Are they from a real workload or just
a stress test?

> after:
> 1048576 records - 2 seconds
> 2097152 records - 3 seconds
> 4194304 records - 7 seconds

The gains are very nice :).

> Signed-off-by: Alex Zhuravlev <bzzz@whamcloud.com>
> Signed-off-by: Li Dongyang <dongyangli@ddn.com>

> diff --git a/fs/jbd2/recovery.c b/fs/jbd2/recovery.c
> index 667f67342c52..d9287439171c 100644
> --- a/fs/jbd2/recovery.c
> +++ b/fs/jbd2/recovery.c
> @@ -294,6 +294,10 @@ int jbd2_journal_recover(journal_t *journal)
>  	memset(&info, 0, sizeof(info));
>  	sb = journal->j_superblock;
>  
> +	err = jbd2_journal_init_recovery_revoke(journal);
> +	if (err)
> +		return err;
> +
>  	/*
>  	 * The journal superblock's s_start field (the current log head)
>  	 * is always zero if, and only if, the journal was cleanly
> diff --git a/fs/jbd2/revoke.c b/fs/jbd2/revoke.c
> index 4556e4689024..d6e96099e9c9 100644
> --- a/fs/jbd2/revoke.c
> +++ b/fs/jbd2/revoke.c
> @@ -90,6 +90,7 @@
>  #include <linux/bio.h>
>  #include <linux/log2.h>
>  #include <linux/hash.h>
> +#include <linux/rhashtable.h>
>  #endif
>  
>  static struct kmem_cache *jbd2_revoke_record_cache;
> @@ -101,7 +102,10 @@ static struct kmem_cache *jbd2_revoke_table_cache;
>  
>  struct jbd2_revoke_record_s
>  {
> -	struct list_head  hash;
> +	union {
> +		struct list_head  hash;
> +		struct rhash_head linkage;
> +	};
>  	tid_t		  sequence;	/* Used for recovery only */
>  	unsigned long long	  blocknr;
>  };
> @@ -680,13 +684,22 @@ static void flush_descriptor(journal_t *journal,
>   * single block.
>   */
>  
> +static const struct rhashtable_params revoke_rhashtable_params = {
> +	.key_len     = sizeof(unsigned long long),
> +	.key_offset  = offsetof(struct jbd2_revoke_record_s, blocknr),
> +	.head_offset = offsetof(struct jbd2_revoke_record_s, linkage),
> +};
> +

I'd probably view your performance results as: "JOURNAL_REVOKE_DEFAULT_HASH
is just too small for replays of a journal with huge numbers of revoked
blocks". Or did you observe that JOURNAL_REVOKE_DEFAULT_HASH is causing
performance issues also during normal operation when we track there revokes
for the current transaction?

If my interpretation is correct, then rhashtable is unnecessarily huge
hammer for this. Firstly, as the big hash is needed only during replay,
there's no concurrent access to the data structure. Secondly, we just fill
the data structure in the PASS_REVOKE scan and then use it. Thirdly, we
know the number of elements we need to store in the table in advance (well,
currently we don't but it's trivial to modify PASS_SCAN to get that
number). 

So rather than playing with rhashtable, I'd modify PASS_SCAN to sum up
number of revoke records we're going to process and then prepare a static
hash of appropriate size for replay (we can just use the standard hashing
fs/jbd2/revoke.c uses, just with differently sized hash table allocated for
replay and point journal->j_revoke to it). And once recovery completes
jbd2_journal_clear_revoke() can free the table and point journal->j_revoke
back to the original table. What do you think?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

next prev parent reply	other threads:[~2024-11-08 10:34 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-11-05  3:44 [PATCH V2] jbd2: use rhashtable for revoke records during replay Li Dongyang
2024-11-08 10:33 ` Jan Kara [this message]
2024-11-08 16:11   ` Theodore Ts'o
2024-11-12 18:44     ` Andreas Dilger
2024-11-13 14:47       ` Jan Kara
2025-01-16  0:08         ` Andreas Dilger
2025-01-16 18:04           ` Jan Kara
2024-11-09  3:12   ` Zhang Yi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20241108103358.ziocxsyapli2pexv@quack3 \
    --to=jack@suse.cz \
    --cc=adilger@dilger.ca \
    --cc=bzzz@whamcloud.com \
    --cc=dongyangli@ddn.com \
    --cc=linux-ext4@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox