[PATCH V2] jbd2: use rhashtable for revoke records during replay

public inbox for linux-ext4@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH V2] jbd2: use rhashtable for revoke records during replay
@ 2024-11-05  3:44 Li Dongyang
  2024-11-08 10:33 ` Jan Kara
  0 siblings, 1 reply; 8+ messages in thread
From: Li Dongyang @ 2024-11-05  3:44 UTC (permalink / raw)
  To: linux-ext4; +Cc: Andreas Dilger, Alex Zhuravlev

Resizable hashtable should improve journal replay time when
we have million of revoke records.
Notice that rhashtable is used during replay only,
as removal with list_del() is less expensive and it's still used
during regular processing.

before:
1048576 records - 95 seconds
2097152 records - 580 seconds

after:
1048576 records - 2 seconds
2097152 records - 3 seconds
4194304 records - 7 seconds

Signed-off-by: Alex Zhuravlev <bzzz@whamcloud.com>
Signed-off-by: Li Dongyang <dongyangli@ddn.com>
---
v1->v2:
include rhashtable header in jbd2.h
---
 fs/jbd2/recovery.c   |  4 +++
 fs/jbd2/revoke.c     | 65 +++++++++++++++++++++++++++++++-------------
 include/linux/jbd2.h |  7 +++++
 3 files changed, 57 insertions(+), 19 deletions(-)

diff --git a/fs/jbd2/recovery.c b/fs/jbd2/recovery.c
index 667f67342c52..d9287439171c 100644
--- a/fs/jbd2/recovery.c
+++ b/fs/jbd2/recovery.c
@@ -294,6 +294,10 @@ int jbd2_journal_recover(journal_t *journal)
 	memset(&info, 0, sizeof(info));
 	sb = journal->j_superblock;
 
+	err = jbd2_journal_init_recovery_revoke(journal);
+	if (err)
+		return err;
+
 	/*
 	 * The journal superblock's s_start field (the current log head)
 	 * is always zero if, and only if, the journal was cleanly
diff --git a/fs/jbd2/revoke.c b/fs/jbd2/revoke.c
index 4556e4689024..d6e96099e9c9 100644
--- a/fs/jbd2/revoke.c
+++ b/fs/jbd2/revoke.c
@@ -90,6 +90,7 @@
 #include <linux/bio.h>
 #include <linux/log2.h>
 #include <linux/hash.h>
+#include <linux/rhashtable.h>
 #endif
 
 static struct kmem_cache *jbd2_revoke_record_cache;
@@ -101,7 +102,10 @@ static struct kmem_cache *jbd2_revoke_table_cache;
 
 struct jbd2_revoke_record_s
 {
-	struct list_head  hash;
+	union {
+		struct list_head  hash;
+		struct rhash_head linkage;
+	};
 	tid_t		  sequence;	/* Used for recovery only */
 	unsigned long long	  blocknr;
 };
@@ -680,13 +684,22 @@ static void flush_descriptor(journal_t *journal,
  * single block.
  */
 
+static const struct rhashtable_params revoke_rhashtable_params = {
+	.key_len     = sizeof(unsigned long long),
+	.key_offset  = offsetof(struct jbd2_revoke_record_s, blocknr),
+	.head_offset = offsetof(struct jbd2_revoke_record_s, linkage),
+};
+
 int jbd2_journal_set_revoke(journal_t *journal,
 		       unsigned long long blocknr,
 		       tid_t sequence)
 {
 	struct jbd2_revoke_record_s *record;
+	gfp_t gfp_mask = GFP_NOFS;
+	int err;
 
-	record = find_revoke_record(journal, blocknr);
+	record = rhashtable_lookup(&journal->j_revoke_rhtable, &blocknr,
+				   revoke_rhashtable_params);
 	if (record) {
 		/* If we have multiple occurrences, only record the
 		 * latest sequence number in the hashed record */
@@ -694,7 +707,22 @@ int jbd2_journal_set_revoke(journal_t *journal,
 			record->sequence = sequence;
 		return 0;
 	}
-	return insert_revoke_hash(journal, blocknr, sequence);
+
+	if (journal_oom_retry)
+		gfp_mask |= __GFP_NOFAIL;
+	record = kmem_cache_alloc(jbd2_revoke_record_cache, gfp_mask);
+	if (!record)
+		return -ENOMEM;
+
+	record->sequence = sequence;
+	record->blocknr = blocknr;
+	err = rhashtable_lookup_insert_fast(&journal->j_revoke_rhtable,
+					    &record->linkage,
+					    revoke_rhashtable_params);
+	if (err)
+		kmem_cache_free(jbd2_revoke_record_cache, record);
+
+	return err;
 }
 
 /*
@@ -710,7 +738,8 @@ int jbd2_journal_test_revoke(journal_t *journal,
 {
 	struct jbd2_revoke_record_s *record;
 
-	record = find_revoke_record(journal, blocknr);
+	record = rhashtable_lookup(&journal->j_revoke_rhtable, &blocknr,
+				   revoke_rhashtable_params);
 	if (!record)
 		return 0;
 	if (tid_gt(sequence, record->sequence))
@@ -718,6 +747,17 @@ int jbd2_journal_test_revoke(journal_t *journal,
 	return 1;
 }
 
+int jbd2_journal_init_recovery_revoke(journal_t *journal)
+{
+	return rhashtable_init(&journal->j_revoke_rhtable,
+			       &revoke_rhashtable_params);
+}
+
+static void jbd2_revoke_record_free(void *ptr, void *arg)
+{
+	kmem_cache_free(jbd2_revoke_record_cache, ptr);
+}
+
 /*
  * Finally, once recovery is over, we need to clear the revoke table so
  * that it can be reused by the running filesystem.
@@ -725,19 +765,6 @@ int jbd2_journal_test_revoke(journal_t *journal,
 
 void jbd2_journal_clear_revoke(journal_t *journal)
 {
-	int i;
-	struct list_head *hash_list;
-	struct jbd2_revoke_record_s *record;
-	struct jbd2_revoke_table_s *revoke;
-
-	revoke = journal->j_revoke;
-
-	for (i = 0; i < revoke->hash_size; i++) {
-		hash_list = &revoke->hash_table[i];
-		while (!list_empty(hash_list)) {
-			record = (struct jbd2_revoke_record_s*) hash_list->next;
-			list_del(&record->hash);
-			kmem_cache_free(jbd2_revoke_record_cache, record);
-		}
-	}
+	rhashtable_free_and_destroy(&journal->j_revoke_rhtable,
+				    jbd2_revoke_record_free, NULL);
 }
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index 8aef9bb6ad57..2b0aa1e159b8 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -28,6 +28,7 @@
 #include <linux/slab.h>
 #include <linux/bit_spinlock.h>
 #include <linux/blkdev.h>
+#include <linux/rhashtable-types.h>
 #include <crypto/hash.h>
 #endif
 
@@ -1122,6 +1123,11 @@ struct journal_s
 	 */
 	struct jbd2_revoke_table_s *j_revoke_table[2];
 
+	/**
+	 * @j_revoke_rhtable:	rhashtable for revoke records during recovery
+	 */
+	struct rhashtable	j_revoke_rhtable;
+
 	/**
 	 * @j_wbuf: Array of bhs for jbd2_journal_commit_transaction.
 	 */
@@ -1644,6 +1650,7 @@ extern void	   jbd2_journal_write_revoke_records(transaction_t *transaction,
 /* Recovery revoke support */
 extern int	jbd2_journal_set_revoke(journal_t *, unsigned long long, tid_t);
 extern int	jbd2_journal_test_revoke(journal_t *, unsigned long long, tid_t);
+extern int	jbd2_journal_init_recovery_revoke(journal_t *);
 extern void	jbd2_journal_clear_revoke(journal_t *);
 extern void	jbd2_journal_switch_revoke_table(journal_t *journal);
 extern void	jbd2_clear_buffer_revoked_flags(journal_t *journal);
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH V2] jbd2: use rhashtable for revoke records during replay
  2024-11-05  3:44 [PATCH V2] jbd2: use rhashtable for revoke records during replay Li Dongyang
@ 2024-11-08 10:33 ` Jan Kara
  2024-11-08 16:11   ` Theodore Ts'o
  2024-11-09  3:12   ` Zhang Yi
  0 siblings, 2 replies; 8+ messages in thread
From: Jan Kara @ 2024-11-08 10:33 UTC (permalink / raw)
  To: Li Dongyang; +Cc: linux-ext4, Andreas Dilger, Alex Zhuravlev

On Tue 05-11-24 14:44:28, Li Dongyang wrote:
> Resizable hashtable should improve journal replay time when
> we have million of revoke records.
> Notice that rhashtable is used during replay only,
> as removal with list_del() is less expensive and it's still used
> during regular processing.
> 
> before:
> 1048576 records - 95 seconds
> 2097152 records - 580 seconds

These are really high numbers of revoke records. Deleting couple GB of
metadata doesn't happen so easily. Are they from a real workload or just
a stress test?

> after:
> 1048576 records - 2 seconds
> 2097152 records - 3 seconds
> 4194304 records - 7 seconds

The gains are very nice :).

> Signed-off-by: Alex Zhuravlev <bzzz@whamcloud.com>
> Signed-off-by: Li Dongyang <dongyangli@ddn.com>

> diff --git a/fs/jbd2/recovery.c b/fs/jbd2/recovery.c
> index 667f67342c52..d9287439171c 100644
> --- a/fs/jbd2/recovery.c
> +++ b/fs/jbd2/recovery.c
> @@ -294,6 +294,10 @@ int jbd2_journal_recover(journal_t *journal)
>  	memset(&info, 0, sizeof(info));
>  	sb = journal->j_superblock;
>  
> +	err = jbd2_journal_init_recovery_revoke(journal);
> +	if (err)
> +		return err;
> +
>  	/*
>  	 * The journal superblock's s_start field (the current log head)
>  	 * is always zero if, and only if, the journal was cleanly
> diff --git a/fs/jbd2/revoke.c b/fs/jbd2/revoke.c
> index 4556e4689024..d6e96099e9c9 100644
> --- a/fs/jbd2/revoke.c
> +++ b/fs/jbd2/revoke.c
> @@ -90,6 +90,7 @@
>  #include <linux/bio.h>
>  #include <linux/log2.h>
>  #include <linux/hash.h>
> +#include <linux/rhashtable.h>
>  #endif
>  
>  static struct kmem_cache *jbd2_revoke_record_cache;
> @@ -101,7 +102,10 @@ static struct kmem_cache *jbd2_revoke_table_cache;
>  
>  struct jbd2_revoke_record_s
>  {
> -	struct list_head  hash;
> +	union {
> +		struct list_head  hash;
> +		struct rhash_head linkage;
> +	};
>  	tid_t		  sequence;	/* Used for recovery only */
>  	unsigned long long	  blocknr;
>  };
> @@ -680,13 +684,22 @@ static void flush_descriptor(journal_t *journal,
>   * single block.
>   */
>  
> +static const struct rhashtable_params revoke_rhashtable_params = {
> +	.key_len     = sizeof(unsigned long long),
> +	.key_offset  = offsetof(struct jbd2_revoke_record_s, blocknr),
> +	.head_offset = offsetof(struct jbd2_revoke_record_s, linkage),
> +};
> +

I'd probably view your performance results as: "JOURNAL_REVOKE_DEFAULT_HASH
is just too small for replays of a journal with huge numbers of revoked
blocks". Or did you observe that JOURNAL_REVOKE_DEFAULT_HASH is causing
performance issues also during normal operation when we track there revokes
for the current transaction?

If my interpretation is correct, then rhashtable is unnecessarily huge
hammer for this. Firstly, as the big hash is needed only during replay,
there's no concurrent access to the data structure. Secondly, we just fill
the data structure in the PASS_REVOKE scan and then use it. Thirdly, we
know the number of elements we need to store in the table in advance (well,
currently we don't but it's trivial to modify PASS_SCAN to get that
number). 

So rather than playing with rhashtable, I'd modify PASS_SCAN to sum up
number of revoke records we're going to process and then prepare a static
hash of appropriate size for replay (we can just use the standard hashing
fs/jbd2/revoke.c uses, just with differently sized hash table allocated for
replay and point journal->j_revoke to it). And once recovery completes
jbd2_journal_clear_revoke() can free the table and point journal->j_revoke
back to the original table. What do you think?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH V2] jbd2: use rhashtable for revoke records during replay
  2024-11-08 10:33 ` Jan Kara
@ 2024-11-08 16:11   ` Theodore Ts'o
  2024-11-12 18:44     ` Andreas Dilger
  2024-11-09  3:12   ` Zhang Yi
  1 sibling, 1 reply; 8+ messages in thread
From: Theodore Ts'o @ 2024-11-08 16:11 UTC (permalink / raw)
  To: Jan Kara; +Cc: Li Dongyang, linux-ext4, Andreas Dilger, Alex Zhuravlev

On Fri, Nov 08, 2024 at 11:33:58AM +0100, Jan Kara wrote:
> > 1048576 records - 95 seconds
> > 2097152 records - 580 seconds
> 
> These are really high numbers of revoke records. Deleting couple GB of
> metadata doesn't happen so easily. Are they from a real workload or just
> a stress test?

For context, the background of this is that this has been an
out-of-tree that's been around for a very long time, for use with
Lustre servers where apparently, this very large number of revoke
records is a real thing.

> If my interpretation is correct, then rhashtable is unnecessarily
> huge hammer for this. Firstly, as the big hash is needed only during
> replay, there's no concurrent access to the data
> structure. Secondly, we just fill the data structure in the
> PASS_REVOKE scan and then use it. Thirdly, we know the number of
> elements we need to store in the table in advance (well, currently
> we don't but it's trivial to modify PASS_SCAN to get that number).
> 
> So rather than playing with rhashtable, I'd modify PASS_SCAN to sum
> up number of revoke records we're going to process and then prepare
> a static hash of appropriate size for replay (we can just use the
> standard hashing fs/jbd2/revoke.c uses, just with differently sized
> hash table allocated for replay and point journal->j_revoke to
> it). And once recovery completes jbd2_journal_clear_revoke() can
> free the table and point journal->j_revoke back to the original
> table. What do you think?

Hmm, that's a really nice idea; Andreas, what do you think?

     	      	     	  		 - Ted

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH V2] jbd2: use rhashtable for revoke records during replay
  2024-11-08 10:33 ` Jan Kara
  2024-11-08 16:11   ` Theodore Ts'o
@ 2024-11-09  3:12   ` Zhang Yi
  1 sibling, 0 replies; 8+ messages in thread
From: Zhang Yi @ 2024-11-09  3:12 UTC (permalink / raw)
  To: Jan Kara, Li Dongyang; +Cc: linux-ext4, Andreas Dilger, Alex Zhuravlev

On 2024/11/8 18:33, Jan Kara wrote:
> On Tue 05-11-24 14:44:28, Li Dongyang wrote:
>> Resizable hashtable should improve journal replay time when
>> we have million of revoke records.
>> Notice that rhashtable is used during replay only,
>> as removal with list_del() is less expensive and it's still used
>> during regular processing.
>>
>> before:
>> 1048576 records - 95 seconds
>> 2097152 records - 580 seconds
> 
> These are really high numbers of revoke records. Deleting couple GB of
> metadata doesn't happen so easily. Are they from a real workload or just
> a stress test?
>  
>> after:
>> 1048576 records - 2 seconds
>> 2097152 records - 3 seconds
>> 4194304 records - 7 seconds
> 
> The gains are very nice :).
> 
>> Signed-off-by: Alex Zhuravlev <bzzz@whamcloud.com>
>> Signed-off-by: Li Dongyang <dongyangli@ddn.com>
> 
>> diff --git a/fs/jbd2/recovery.c b/fs/jbd2/recovery.c
>> index 667f67342c52..d9287439171c 100644
>> --- a/fs/jbd2/recovery.c
>> +++ b/fs/jbd2/recovery.c
>> @@ -294,6 +294,10 @@ int jbd2_journal_recover(journal_t *journal)
>>  	memset(&info, 0, sizeof(info));
>>  	sb = journal->j_superblock;
>>  
>> +	err = jbd2_journal_init_recovery_revoke(journal);
>> +	if (err)
>> +		return err;
>> +
>>  	/*
>>  	 * The journal superblock's s_start field (the current log head)
>>  	 * is always zero if, and only if, the journal was cleanly
>> diff --git a/fs/jbd2/revoke.c b/fs/jbd2/revoke.c
>> index 4556e4689024..d6e96099e9c9 100644
>> --- a/fs/jbd2/revoke.c
>> +++ b/fs/jbd2/revoke.c
>> @@ -90,6 +90,7 @@
>>  #include <linux/bio.h>
>>  #include <linux/log2.h>
>>  #include <linux/hash.h>
>> +#include <linux/rhashtable.h>
>>  #endif
>>  
>>  static struct kmem_cache *jbd2_revoke_record_cache;
>> @@ -101,7 +102,10 @@ static struct kmem_cache *jbd2_revoke_table_cache;
>>  
>>  struct jbd2_revoke_record_s
>>  {
>> -	struct list_head  hash;
>> +	union {
>> +		struct list_head  hash;
>> +		struct rhash_head linkage;
>> +	};
>>  	tid_t		  sequence;	/* Used for recovery only */
>>  	unsigned long long	  blocknr;
>>  };
>> @@ -680,13 +684,22 @@ static void flush_descriptor(journal_t *journal,
>>   * single block.
>>   */
>>  
>> +static const struct rhashtable_params revoke_rhashtable_params = {
>> +	.key_len     = sizeof(unsigned long long),
>> +	.key_offset  = offsetof(struct jbd2_revoke_record_s, blocknr),
>> +	.head_offset = offsetof(struct jbd2_revoke_record_s, linkage),
>> +};
>> +
> 
> I'd probably view your performance results as: "JOURNAL_REVOKE_DEFAULT_HASH
> is just too small for replays of a journal with huge numbers of revoked
> blocks". Or did you observe that JOURNAL_REVOKE_DEFAULT_HASH is causing
> performance issues also during normal operation when we track there revokes
> for the current transaction?
> 
> If my interpretation is correct, then rhashtable is unnecessarily huge
> hammer for this. Firstly, as the big hash is needed only during replay,
> there's no concurrent access to the data structure. Secondly, we just fill
> the data structure in the PASS_REVOKE scan and then use it. Thirdly, we
> know the number of elements we need to store in the table in advance (well,
> currently we don't but it's trivial to modify PASS_SCAN to get that
> number). 
> 
> So rather than playing with rhashtable, I'd modify PASS_SCAN to sum up
> number of revoke records we're going to process and then prepare a static
> hash of appropriate size for replay (we can just use the standard hashing
> fs/jbd2/revoke.c uses, just with differently sized hash table allocated for
> replay and point journal->j_revoke to it). And once recovery completes
> jbd2_journal_clear_revoke() can free the table and point journal->j_revoke
> back to the original table. What do you think?
> 

Sounds reasonable to me. I'd vote for this solution, this is a really simple
and clear solution, and I believe it can achieve similar gains as rhashtable.

Thanks,
Yi.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH V2] jbd2: use rhashtable for revoke records during replay
  2024-11-08 16:11   ` Theodore Ts'o
@ 2024-11-12 18:44     ` Andreas Dilger
  2024-11-13 14:47       ` Jan Kara
  0 siblings, 1 reply; 8+ messages in thread
From: Andreas Dilger @ 2024-11-12 18:44 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Jan Kara, Li Dongyang, linux-ext4, Alex Zhuravlev

[-- Attachment #1: Type: text/plain, Size: 2471 bytes --]

On Nov 8, 2024, at 9:11 AM, Theodore Ts'o <tytso@mit.edu> wrote:
> 
> On Fri, Nov 08, 2024 at 11:33:58AM +0100, Jan Kara wrote:
>>> 1048576 records - 95 seconds
>>> 2097152 records - 580 seconds
>> 
>> These are really high numbers of revoke records. Deleting couple GB of
>> metadata doesn't happen so easily. Are they from a real workload or just
>> a stress test?
> 
> For context, the background of this is that this has been an
> out-of-tree that's been around for a very long time, for use with
> Lustre servers where apparently, this very large number of revoke
> records is a real thing.

Yes, we've seen this in production if there was a crash after deleting
many millions of log records.  This causes remount to take potentially
several hours before completing (and this was made worse by HA causing
failovers due to mount being "stuck" doing the journal replay).

>> If my interpretation is correct, then rhashtable is unnecessarily
>> huge hammer for this. Firstly, as the big hash is needed only during
>> replay, there's no concurrent access to the data
>> structure. Secondly, we just fill the data structure in the
>> PASS_REVOKE scan and then use it. Thirdly, we know the number of
>> elements we need to store in the table in advance (well, currently
>> we don't but it's trivial to modify PASS_SCAN to get that number).
>> 
>> So rather than playing with rhashtable, I'd modify PASS_SCAN to sum
>> up number of revoke records we're going to process and then prepare
>> a static hash of appropriate size for replay (we can just use the
>> standard hashing fs/jbd2/revoke.c uses, just with differently sized
>> hash table allocated for replay and point journal->j_revoke to
>> it). And once recovery completes jbd2_journal_clear_revoke() can
>> free the table and point journal->j_revoke back to the original
>> table. What do you think?
> 
> Hmm, that's a really nice idea; Andreas, what do you think?

Implementing code to manually count and resize the recovery hashtable
will also have its own complexity, including possible allocation size
limits for a huge hash table.  That could be worked around by kvmalloc(),
but IMHO this essentially starts "open coding" something rhashtable was
exactly designed to avoid.

Since the rhashtable is only used during the journal recovery phase,
IMHO it isn't adding any complexity to the normal operational code, but
if there was ongoing overhead then this would be a different discussion.

Cheers, Andreas






[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 873 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH V2] jbd2: use rhashtable for revoke records during replay
  2024-11-12 18:44     ` Andreas Dilger
@ 2024-11-13 14:47       ` Jan Kara
  2025-01-16  0:08         ` Andreas Dilger
  0 siblings, 1 reply; 8+ messages in thread
From: Jan Kara @ 2024-11-13 14:47 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Theodore Ts'o, Jan Kara, Li Dongyang, linux-ext4,
	Alex Zhuravlev

[-- Attachment #1: Type: text/plain, Size: 3101 bytes --]

On Tue 12-11-24 11:44:11, Andreas Dilger wrote:
> On Nov 8, 2024, at 9:11 AM, Theodore Ts'o <tytso@mit.edu> wrote:
> > 
> > On Fri, Nov 08, 2024 at 11:33:58AM +0100, Jan Kara wrote:
> >>> 1048576 records - 95 seconds
> >>> 2097152 records - 580 seconds
> >> 
> >> These are really high numbers of revoke records. Deleting couple GB of
> >> metadata doesn't happen so easily. Are they from a real workload or just
> >> a stress test?
> > 
> > For context, the background of this is that this has been an
> > out-of-tree that's been around for a very long time, for use with
> > Lustre servers where apparently, this very large number of revoke
> > records is a real thing.
> 
> Yes, we've seen this in production if there was a crash after deleting
> many millions of log records.  This causes remount to take potentially
> several hours before completing (and this was made worse by HA causing
> failovers due to mount being "stuck" doing the journal replay).

Thanks for clarification!

> >> If my interpretation is correct, then rhashtable is unnecessarily
> >> huge hammer for this. Firstly, as the big hash is needed only during
> >> replay, there's no concurrent access to the data
> >> structure. Secondly, we just fill the data structure in the
> >> PASS_REVOKE scan and then use it. Thirdly, we know the number of
> >> elements we need to store in the table in advance (well, currently
> >> we don't but it's trivial to modify PASS_SCAN to get that number).
> >> 
> >> So rather than playing with rhashtable, I'd modify PASS_SCAN to sum
> >> up number of revoke records we're going to process and then prepare
> >> a static hash of appropriate size for replay (we can just use the
> >> standard hashing fs/jbd2/revoke.c uses, just with differently sized
> >> hash table allocated for replay and point journal->j_revoke to
> >> it). And once recovery completes jbd2_journal_clear_revoke() can
> >> free the table and point journal->j_revoke back to the original
> >> table. What do you think?
> > 
> > Hmm, that's a really nice idea; Andreas, what do you think?
> 
> Implementing code to manually count and resize the recovery hashtable
> will also have its own complexity, including possible allocation size
> limits for a huge hash table.  That could be worked around by kvmalloc(),
> but IMHO this essentially starts "open coding" something rhashtable was
> exactly designed to avoid.

Well, I'd say the result is much simpler than rhashtable code since you
don't need all that dynamic reallocation and complex locking. Attached is a
patch that implements my suggestion. I'd say it is simpler than having two
types of revoke block hashing depending on whether we are doing recovery or
running the journal. I've tested it and it seems to work fine (including
replay of a journal with sufficiently many revoke blocks) but I'm not sure
I can do a meaningful performance testing (I cannot quite reproduce the
slow replay times even when shutting down the filesystem after deleting
1000000 directories). So can you please give it a spin?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

[-- Attachment #2: 0001-jbd2-Avoid-long-replay-times-due-to-high-number-or-r.patch --]
[-- Type: text/x-patch, Size: 6628 bytes --]

From db87b1d2cac01bc8336b70a32616388e6ff9fa8f Mon Sep 17 00:00:00 2001
From: Jan Kara <jack@suse.cz>
Date: Wed, 13 Nov 2024 11:53:13 +0100
Subject: [PATCH] jbd2: Avoid long replay times due to high number or revoke
 blocks

Some users are reporting journal replay takes a long time when there is
excessive number of revoke blocks in the journal. Reported times are
like:

1048576 records - 95 seconds
2097152 records - 580 seconds

The problem is that hash chains in the revoke table gets excessively
long in these cases. Fix the problem by sizing the revoke table
appropriately before the revoke pass.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/jbd2/recovery.c   | 54 +++++++++++++++++++++++++++++++++++++-------
 fs/jbd2/revoke.c     |  8 +++----
 include/linux/jbd2.h |  2 ++
 3 files changed, 52 insertions(+), 12 deletions(-)

diff --git a/fs/jbd2/recovery.c b/fs/jbd2/recovery.c
index 667f67342c52..9845f72e456a 100644
--- a/fs/jbd2/recovery.c
+++ b/fs/jbd2/recovery.c
@@ -39,7 +39,7 @@ struct recovery_info
 
 static int do_one_pass(journal_t *journal,
 				struct recovery_info *info, enum passtype pass);
-static int scan_revoke_records(journal_t *, struct buffer_head *,
+static int scan_revoke_records(journal_t *, enum passtype, struct buffer_head *,
 				tid_t, struct recovery_info *);
 
 #ifdef __KERNEL__
@@ -327,6 +327,12 @@ int jbd2_journal_recover(journal_t *journal)
 		  journal->j_transaction_sequence, journal->j_head);
 
 	jbd2_journal_clear_revoke(journal);
+	/* Free revoke table allocated for replay */
+	if (journal->j_revoke != journal->j_revoke_table[0] &&
+	    journal->j_revoke != journal->j_revoke_table[1]) {
+		jbd2_journal_destroy_revoke_table(journal->j_revoke);
+		journal->j_revoke = journal->j_revoke_table[1];
+	}
 	err2 = sync_blockdev(journal->j_fs_dev);
 	if (!err)
 		err = err2;
@@ -517,6 +523,31 @@ static int do_one_pass(journal_t *journal,
 	first_commit_ID = next_commit_ID;
 	if (pass == PASS_SCAN)
 		info->start_transaction = first_commit_ID;
+	else if (pass == PASS_REVOKE) {
+		/*
+		 * Would the default revoke table have too long hash chains
+		 * during replay?
+		 */
+		if (info->nr_revokes > JOURNAL_REVOKE_DEFAULT_HASH * 16) {
+			unsigned int hash_size;
+
+			/*
+			 * Aim for average chain length of 8, limit at 1M
+			 * entries to avoid problems with malicious
+			 * filesystems.
+			 */
+			hash_size = min(roundup_pow_of_two(info->nr_revokes / 8),
+					1U << 20);
+			journal->j_revoke =
+				jbd2_journal_init_revoke_table(hash_size);
+			if (!journal->j_revoke) {
+				printk(KERN_ERR
+				       "JBD2: failed to allocate revoke table for replay with %u entries. "
+				       "Journal replay may be slow.\n", hash_size);
+				journal->j_revoke = journal->j_revoke_table[1];
+			}
+		}
+	}
 
 	jbd2_debug(1, "Starting recovery pass %d\n", pass);
 
@@ -874,14 +905,16 @@ static int do_one_pass(journal_t *journal,
 				need_check_commit_time = true;
 			}
 
-			/* If we aren't in the REVOKE pass, then we can
-			 * just skip over this block. */
-			if (pass != PASS_REVOKE) {
+			/*
+			 * If we aren't in the SCAN or REVOKE pass, then we can
+			 * just skip over this block.
+			 */
+			if (pass != PASS_REVOKE && pass != PASS_SCAN) {
 				brelse(bh);
 				continue;
 			}
 
-			err = scan_revoke_records(journal, bh,
+			err = scan_revoke_records(journal, pass, bh,
 						  next_commit_ID, info);
 			brelse(bh);
 			if (err)
@@ -937,8 +970,9 @@ static int do_one_pass(journal_t *journal,
 
 /* Scan a revoke record, marking all blocks mentioned as revoked. */
 
-static int scan_revoke_records(journal_t *journal, struct buffer_head *bh,
-			       tid_t sequence, struct recovery_info *info)
+static int scan_revoke_records(journal_t *journal, enum passtype pass,
+			       struct buffer_head *bh, tid_t sequence,
+			       struct recovery_info *info)
 {
 	jbd2_journal_revoke_header_t *header;
 	int offset, max;
@@ -959,6 +993,11 @@ static int scan_revoke_records(journal_t *journal, struct buffer_head *bh,
 	if (jbd2_has_feature_64bit(journal))
 		record_len = 8;
 
+	if (pass == PASS_SCAN) {
+		info->nr_revokes += (max - offset) / record_len;
+		return 0;
+	}
+
 	while (offset + record_len <= max) {
 		unsigned long long blocknr;
 		int err;
@@ -971,7 +1010,6 @@ static int scan_revoke_records(journal_t *journal, struct buffer_head *bh,
 		err = jbd2_journal_set_revoke(journal, blocknr, sequence);
 		if (err)
 			return err;
-		++info->nr_revokes;
 	}
 	return 0;
 }
diff --git a/fs/jbd2/revoke.c b/fs/jbd2/revoke.c
index 4556e4689024..f4ac308e84c5 100644
--- a/fs/jbd2/revoke.c
+++ b/fs/jbd2/revoke.c
@@ -215,7 +215,7 @@ int __init jbd2_journal_init_revoke_table_cache(void)
 	return 0;
 }
 
-static struct jbd2_revoke_table_s *jbd2_journal_init_revoke_table(int hash_size)
+struct jbd2_revoke_table_s *jbd2_journal_init_revoke_table(int hash_size)
 {
 	int shift = 0;
 	int tmp = hash_size;
@@ -231,7 +231,7 @@ static struct jbd2_revoke_table_s *jbd2_journal_init_revoke_table(int hash_size)
 	table->hash_size = hash_size;
 	table->hash_shift = shift;
 	table->hash_table =
-		kmalloc_array(hash_size, sizeof(struct list_head), GFP_KERNEL);
+		kvmalloc_array(hash_size, sizeof(struct list_head), GFP_KERNEL);
 	if (!table->hash_table) {
 		kmem_cache_free(jbd2_revoke_table_cache, table);
 		table = NULL;
@@ -245,7 +245,7 @@ static struct jbd2_revoke_table_s *jbd2_journal_init_revoke_table(int hash_size)
 	return table;
 }
 
-static void jbd2_journal_destroy_revoke_table(struct jbd2_revoke_table_s *table)
+void jbd2_journal_destroy_revoke_table(struct jbd2_revoke_table_s *table)
 {
 	int i;
 	struct list_head *hash_list;
@@ -255,7 +255,7 @@ static void jbd2_journal_destroy_revoke_table(struct jbd2_revoke_table_s *table)
 		J_ASSERT(list_empty(hash_list));
 	}
 
-	kfree(table->hash_table);
+	kvfree(table->hash_table);
 	kmem_cache_free(jbd2_revoke_table_cache, table);
 }
 
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index 8aef9bb6ad57..781615214d47 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -1634,6 +1634,8 @@ extern void	   jbd2_journal_destroy_revoke_record_cache(void);
 extern void	   jbd2_journal_destroy_revoke_table_cache(void);
 extern int __init jbd2_journal_init_revoke_record_cache(void);
 extern int __init jbd2_journal_init_revoke_table_cache(void);
+struct jbd2_revoke_table_s *jbd2_journal_init_revoke_table(int hash_size);
+void jbd2_journal_destroy_revoke_table(struct jbd2_revoke_table_s *table);
 
 extern void	   jbd2_journal_destroy_revoke(journal_t *);
 extern int	   jbd2_journal_revoke (handle_t *, unsigned long long, struct buffer_head *);
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH V2] jbd2: use rhashtable for revoke records during replay
  2024-11-13 14:47       ` Jan Kara
@ 2025-01-16  0:08         ` Andreas Dilger
  2025-01-16 18:04           ` Jan Kara
  0 siblings, 1 reply; 8+ messages in thread
From: Andreas Dilger @ 2025-01-16  0:08 UTC (permalink / raw)
  To: Jan Kara, Theodore Ts'o
  Cc: Li Dongyang, Ext4 Developers List, Alex Zhuravlev

[-- Attachment #1: Type: text/plain, Size: 11780 bytes --]

On Nov 13, 2024, at 7:47 AM, Jan Kara <jack@suse.cz> wrote:
> 
> On Tue 12-11-24 11:44:11, Andreas Dilger wrote:
>> On Nov 8, 2024, at 9:11 AM, Theodore Ts'o <tytso@mit.edu> wrote:
>>> 
>>> On Fri, Nov 08, 2024 at 11:33:58AM +0100, Jan Kara wrote:
>>>>> 1048576 records - 95 seconds
>>>>> 2097152 records - 580 seconds
>>>> 
>>>> These are really high numbers of revoke records. Deleting couple GB of
>>>> metadata doesn't happen so easily. Are they from a real workload or just
>>>> a stress test?
>>> 
>>> For context, the background of this is that this has been an
>>> out-of-tree that's been around for a very long time, for use with
>>> Lustre servers where apparently, this very large number of revoke
>>> records is a real thing.
>> 
>> Yes, we've seen this in production if there was a crash after deleting
>> many millions of log records.  This causes remount to take potentially
>> several hours before completing (and this was made worse by HA causing
>> failovers due to mount being "stuck" doing the journal replay).
> 
> Thanks for clarification!
> 
>>>> If my interpretation is correct, then rhashtable is unnecessarily
>>>> huge hammer for this. Firstly, as the big hash is needed only during
>>>> replay, there's no concurrent access to the data
>>>> structure. Secondly, we just fill the data structure in the
>>>> PASS_REVOKE scan and then use it. Thirdly, we know the number of
>>>> elements we need to store in the table in advance (well, currently
>>>> we don't but it's trivial to modify PASS_SCAN to get that number).
>>>> 
>>>> So rather than playing with rhashtable, I'd modify PASS_SCAN to sum
>>>> up number of revoke records we're going to process and then prepare
>>>> a static hash of appropriate size for replay (we can just use the
>>>> standard hashing fs/jbd2/revoke.c uses, just with differently sized
>>>> hash table allocated for replay and point journal->j_revoke to
>>>> it). And once recovery completes jbd2_journal_clear_revoke() can
>>>> free the table and point journal->j_revoke back to the original
>>>> table. What do you think?
>>> 
>>> Hmm, that's a really nice idea; Andreas, what do you think?
>> 
>> Implementing code to manually count and resize the recovery hashtable
>> will also have its own complexity, including possible allocation size
>> limits for a huge hash table.  That could be worked around by kvmalloc(),
>> but IMHO this essentially starts "open coding" something rhashtable was
>> exactly designed to avoid.
> 
> Well, I'd say the result is much simpler than rhashtable code since
> you don't need all that dynamic reallocation and complex locking. Attached is a patch that implements my suggestion. I'd say it is
> simpler than having two types of revoke block hashing depending on
> whether we are doing recovery or running the journal.
> 
> I've tested it and it seems to work fine (including replay of a
> journal with sufficiently many revoke blocks) but I'm not sure
> I can do a meaningful performance testing (I cannot quite reproduce
> the slow replay times even when shutting down the filesystem after
> deleting 1000000 directories). So can you please give it a spin?

Alex posted test results on the other rhashtable revoke thread,
which show both the rhashtable and Jan's dynamically-allocated hash
table perform much better than the original fixed-size hash table.

On Jan 13, 2025, at 8:31 AM, Alexey Zhuravlev <azhuravlev@ddn.com> wrote:
> I benchmarked rhashtable based patch vs Jan's patch:
> 
> records		vanilla	rhashtable	JK patch
> 2.5M records	102s	29s		25s
> 5.0M records	317s	28s		30s
> 6.0M records	--	35s		44s
> 
> The tests were done using 4.18 kernel (I guess this doesn't
> matter much in this context), using an SSD.
> 
> Time to mount after a crash (simulated with read-only device
> mapper) was measured.
> 
> Unfortunately I wasn't able to reproduce with more records
> as my test node has just 32GB RAM,

I'm OK with either variant landing.  This issue doesn't happen on
a regular basis, only when a huge amount of data is deleted just
before the server crashes, so waiting +/-5s for revoke processing
isn't critical, but waiting for 30h definitely is a problem.

Jan, do you want to resubmit your patch as a standalone email,
or can Ted just take it from below (or the original attachment on
your email)?

Note I also have a patch for e2fsprogs to scale the revoke hashtable
size, to avoid a similar issue with e2fsck. I will submit separately.

Cheers, Andreas

> From db87b1d2cac01bc8336b70a32616388e6ff9fa8f Mon Sep 17 00:00:00 2001
> From: Jan Kara <jack@suse.cz>
> Date: Wed, 13 Nov 2024 11:53:13 +0100
> Subject: [PATCH] jbd2: Avoid long replay times due to high number or revoke
>  blocks
> 
> Some users are reporting journal replay takes a long time when there is
> excessive number of revoke blocks in the journal. Reported times are
> like:
> 
> 1048576 records - 95 seconds
> 2097152 records - 580 seconds
> 
> The problem is that hash chains in the revoke table gets excessively
> long in these cases. Fix the problem by sizing the revoke table
> appropriately before the revoke pass.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  fs/jbd2/recovery.c   | 54 +++++++++++++++++++++++++++++++++++++-------
>  fs/jbd2/revoke.c     |  8 +++----
>  include/linux/jbd2.h |  2 ++
>  3 files changed, 52 insertions(+), 12 deletions(-)
> 
> diff --git a/fs/jbd2/recovery.c b/fs/jbd2/recovery.c
> index 667f67342c52..9845f72e456a 100644
> --- a/fs/jbd2/recovery.c
> +++ b/fs/jbd2/recovery.c
> @@ -39,7 +39,7 @@ struct recovery_info
> 
>  static int do_one_pass(journal_t *journal,
>  				struct recovery_info *info, enum passtype pass);
> -static int scan_revoke_records(journal_t *, struct buffer_head *,
> +static int scan_revoke_records(journal_t *, enum passtype, struct buffer_head *,
>  				tid_t, struct recovery_info *);
> 
>  #ifdef __KERNEL__
> @@ -327,6 +327,12 @@ int jbd2_journal_recover(journal_t *journal)
>  		  journal->j_transaction_sequence, journal->j_head);
> 
>  	jbd2_journal_clear_revoke(journal);
> +	/* Free revoke table allocated for replay */
> +	if (journal->j_revoke != journal->j_revoke_table[0] &&
> +	    journal->j_revoke != journal->j_revoke_table[1]) {
> +		jbd2_journal_destroy_revoke_table(journal->j_revoke);
> +		journal->j_revoke = journal->j_revoke_table[1];
> +	}
>  	err2 = sync_blockdev(journal->j_fs_dev);
>  	if (!err)
>  		err = err2;
> @@ -517,6 +523,31 @@ static int do_one_pass(journal_t *journal,
>  	first_commit_ID = next_commit_ID;
>  	if (pass == PASS_SCAN)
>  		info->start_transaction = first_commit_ID;
> +	else if (pass == PASS_REVOKE) {
> +		/*
> +		 * Would the default revoke table have too long hash chains
> +		 * during replay?
> +		 */
> +		if (info->nr_revokes > JOURNAL_REVOKE_DEFAULT_HASH * 16) {
> +			unsigned int hash_size;
> +
> +			/*
> +			 * Aim for average chain length of 8, limit at 1M
> +			 * entries to avoid problems with malicious
> +			 * filesystems.
> +			 */
> +			hash_size = min(roundup_pow_of_two(info->nr_revokes / 8),
> +					1U << 20);
> +			journal->j_revoke =
> +				jbd2_journal_init_revoke_table(hash_size);
> +			if (!journal->j_revoke) {
> +				printk(KERN_ERR
> +				       "JBD2: failed to allocate revoke table for replay with %u entries. "
> +				       "Journal replay may be slow.\n", hash_size);
> +				journal->j_revoke = journal->j_revoke_table[1];
> +			}
> +		}
> +	}
> 
>  	jbd2_debug(1, "Starting recovery pass %d\n", pass);
> 
> @@ -874,14 +905,16 @@ static int do_one_pass(journal_t *journal,
>  				need_check_commit_time = true;
>  			}
> 
> -			/* If we aren't in the REVOKE pass, then we can
> -			 * just skip over this block. */
> -			if (pass != PASS_REVOKE) {
> +			/*
> +			 * If we aren't in the SCAN or REVOKE pass, then we can
> +			 * just skip over this block.
> +			 */
> +			if (pass != PASS_REVOKE && pass != PASS_SCAN) {
>  				brelse(bh);
>  				continue;
>  			}
> 
> -			err = scan_revoke_records(journal, bh,
> +			err = scan_revoke_records(journal, pass, bh,
>  						  next_commit_ID, info);
>  			brelse(bh);
>  			if (err)
> @@ -937,8 +970,9 @@ static int do_one_pass(journal_t *journal,
> 
>  /* Scan a revoke record, marking all blocks mentioned as revoked. */
> 
> -static int scan_revoke_records(journal_t *journal, struct buffer_head *bh,
> -			       tid_t sequence, struct recovery_info *info)
> +static int scan_revoke_records(journal_t *journal, enum passtype pass,
> +			       struct buffer_head *bh, tid_t sequence,
> +			       struct recovery_info *info)
>  {
>  	jbd2_journal_revoke_header_t *header;
>  	int offset, max;
> @@ -959,6 +993,11 @@ static int scan_revoke_records(journal_t *journal, struct buffer_head *bh,
>  	if (jbd2_has_feature_64bit(journal))
>  		record_len = 8;
> 
> +	if (pass == PASS_SCAN) {
> +		info->nr_revokes += (max - offset) / record_len;
> +		return 0;
> +	}
> +
>  	while (offset + record_len <= max) {
>  		unsigned long long blocknr;
>  		int err;
> @@ -971,7 +1010,6 @@ static int scan_revoke_records(journal_t *journal, struct buffer_head *bh,
>  		err = jbd2_journal_set_revoke(journal, blocknr, sequence);
>  		if (err)
>  			return err;
> -		++info->nr_revokes;
>  	}
>  	return 0;
>  }
> diff --git a/fs/jbd2/revoke.c b/fs/jbd2/revoke.c
> index 4556e4689024..f4ac308e84c5 100644
> --- a/fs/jbd2/revoke.c
> +++ b/fs/jbd2/revoke.c
> @@ -215,7 +215,7 @@ int __init jbd2_journal_init_revoke_table_cache(void)
>  	return 0;
>  }
> 
> -static struct jbd2_revoke_table_s *jbd2_journal_init_revoke_table(int hash_size)
> +struct jbd2_revoke_table_s *jbd2_journal_init_revoke_table(int hash_size)
>  {
>  	int shift = 0;
>  	int tmp = hash_size;
> @@ -231,7 +231,7 @@ static struct jbd2_revoke_table_s *jbd2_journal_init_revoke_table(int hash_size)
>  	table->hash_size = hash_size;
>  	table->hash_shift = shift;
>  	table->hash_table =
> -		kmalloc_array(hash_size, sizeof(struct list_head), GFP_KERNEL);
> +		kvmalloc_array(hash_size, sizeof(struct list_head), GFP_KERNEL);
>  	if (!table->hash_table) {
>  		kmem_cache_free(jbd2_revoke_table_cache, table);
>  		table = NULL;
> @@ -245,7 +245,7 @@ static struct jbd2_revoke_table_s *jbd2_journal_init_revoke_table(int hash_size)
>  	return table;
>  }
> 
> -static void jbd2_journal_destroy_revoke_table(struct jbd2_revoke_table_s *table)
> +void jbd2_journal_destroy_revoke_table(struct jbd2_revoke_table_s *table)
>  {
>  	int i;
>  	struct list_head *hash_list;
> @@ -255,7 +255,7 @@ static void jbd2_journal_destroy_revoke_table(struct jbd2_revoke_table_s *table)
>  		J_ASSERT(list_empty(hash_list));
>  	}
> 
> -	kfree(table->hash_table);
> +	kvfree(table->hash_table);
>  	kmem_cache_free(jbd2_revoke_table_cache, table);
>  }
> 
> diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
> index 8aef9bb6ad57..781615214d47 100644
> --- a/include/linux/jbd2.h
> +++ b/include/linux/jbd2.h
> @@ -1634,6 +1634,8 @@ extern void	   jbd2_journal_destroy_revoke_record_cache(void);
>  extern void	   jbd2_journal_destroy_revoke_table_cache(void);
>  extern int __init jbd2_journal_init_revoke_record_cache(void);
>  extern int __init jbd2_journal_init_revoke_table_cache(void);
> +struct jbd2_revoke_table_s *jbd2_journal_init_revoke_table(int hash_size);
> +void jbd2_journal_destroy_revoke_table(struct jbd2_revoke_table_s *table);
> 
>  extern void	   jbd2_journal_destroy_revoke(journal_t *);
>  extern int	   jbd2_journal_revoke (handle_t *, unsigned long long, struct buffer_head *);
> --
> 2.35.3




Cheers, Andreas






[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 873 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH V2] jbd2: use rhashtable for revoke records during replay
  2025-01-16  0:08         ` Andreas Dilger
@ 2025-01-16 18:04           ` Jan Kara
  0 siblings, 0 replies; 8+ messages in thread
From: Jan Kara @ 2025-01-16 18:04 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Jan Kara, Theodore Ts'o, Li Dongyang, Ext4 Developers List,
	Alex Zhuravlev

On Wed 15-01-25 17:08:03, Andreas Dilger wrote:
> On Nov 13, 2024, at 7:47 AM, Jan Kara <jack@suse.cz> wrote:
> > 
> > On Tue 12-11-24 11:44:11, Andreas Dilger wrote:
> >> On Nov 8, 2024, at 9:11 AM, Theodore Ts'o <tytso@mit.edu> wrote:
> >>> 
> >>> On Fri, Nov 08, 2024 at 11:33:58AM +0100, Jan Kara wrote:
> >>>>> 1048576 records - 95 seconds
> >>>>> 2097152 records - 580 seconds
> >>>> 
> >>>> These are really high numbers of revoke records. Deleting couple GB of
> >>>> metadata doesn't happen so easily. Are they from a real workload or just
> >>>> a stress test?
> >>> 
> >>> For context, the background of this is that this has been an
> >>> out-of-tree that's been around for a very long time, for use with
> >>> Lustre servers where apparently, this very large number of revoke
> >>> records is a real thing.
> >> 
> >> Yes, we've seen this in production if there was a crash after deleting
> >> many millions of log records.  This causes remount to take potentially
> >> several hours before completing (and this was made worse by HA causing
> >> failovers due to mount being "stuck" doing the journal replay).
> > 
> > Thanks for clarification!
> > 
> >>>> If my interpretation is correct, then rhashtable is unnecessarily
> >>>> huge hammer for this. Firstly, as the big hash is needed only during
> >>>> replay, there's no concurrent access to the data
> >>>> structure. Secondly, we just fill the data structure in the
> >>>> PASS_REVOKE scan and then use it. Thirdly, we know the number of
> >>>> elements we need to store in the table in advance (well, currently
> >>>> we don't but it's trivial to modify PASS_SCAN to get that number).
> >>>> 
> >>>> So rather than playing with rhashtable, I'd modify PASS_SCAN to sum
> >>>> up number of revoke records we're going to process and then prepare
> >>>> a static hash of appropriate size for replay (we can just use the
> >>>> standard hashing fs/jbd2/revoke.c uses, just with differently sized
> >>>> hash table allocated for replay and point journal->j_revoke to
> >>>> it). And once recovery completes jbd2_journal_clear_revoke() can
> >>>> free the table and point journal->j_revoke back to the original
> >>>> table. What do you think?
> >>> 
> >>> Hmm, that's a really nice idea; Andreas, what do you think?
> >> 
> >> Implementing code to manually count and resize the recovery hashtable
> >> will also have its own complexity, including possible allocation size
> >> limits for a huge hash table.  That could be worked around by kvmalloc(),
> >> but IMHO this essentially starts "open coding" something rhashtable was
> >> exactly designed to avoid.
> > 
> > Well, I'd say the result is much simpler than rhashtable code since
> > you don't need all that dynamic reallocation and complex locking. Attached is a patch that implements my suggestion. I'd say it is
> > simpler than having two types of revoke block hashing depending on
> > whether we are doing recovery or running the journal.
> > 
> > I've tested it and it seems to work fine (including replay of a
> > journal with sufficiently many revoke blocks) but I'm not sure
> > I can do a meaningful performance testing (I cannot quite reproduce
> > the slow replay times even when shutting down the filesystem after
> > deleting 1000000 directories). So can you please give it a spin?
> 
> Alex posted test results on the other rhashtable revoke thread,
> which show both the rhashtable and Jan's dynamically-allocated hash
> table perform much better than the original fixed-size hash table.

Yes, thanks for the testing! Patch submitted.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2025-01-16 18:04 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-11-05  3:44 [PATCH V2] jbd2: use rhashtable for revoke records during replay Li Dongyang
2024-11-08 10:33 ` Jan Kara
2024-11-08 16:11   ` Theodore Ts'o
2024-11-12 18:44     ` Andreas Dilger
2024-11-13 14:47       ` Jan Kara
2025-01-16  0:08         ` Andreas Dilger
2025-01-16 18:04           ` Jan Kara
2024-11-09  3:12   ` Zhang Yi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox