[PATCH] dm-verity: support block number limits for different ioprio classes

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH] dm-verity: support block number limits for different ioprio classes
@ 2025-03-25 10:49 LongPing Wei
  2025-03-25 17:43 ` Eric Biggers
  2025-03-25 21:17 ` Mikulas Patocka
  0 siblings, 2 replies; 23+ messages in thread
From: LongPing Wei @ 2025-03-25 10:49 UTC (permalink / raw)
  To: snitzer, mpatocka; +Cc: dm-devel, guoweichao, LongPing Wei

Calling verity_verify_io in bh for IO of all sizes is not suitable for
embedded devices. From our tests, it can improve the performance of 4K
synchronise random reads.
For example:
./fio --name=rand_read --ioengine=psync --rw=randread --bs=4K \
 --direct=1 --numjobs=8 --runtime=60 --time_based --group_reporting \
 --filename=/dev/block/mapper/xx-verity

But it will degrade the performance of 512K synchronise sequential reads
on our devices.
For example:
./fio --name=read --ioengine=psync --rw=read --bs=512K --direct=1 \
 --numjobs=8 --runtime=60 --time_based --group_reporting \
 --filename=/dev/block/mapper/xx-verity

A parameter array is introduced by this change. And users can modify the
default config by /sys/module/dm_verity/parameters/use_bh_blocks.

The default limits for NONE/RT/BE/IDLE is set to UINT_MAX to keep the
original behaviour.

Signed-off-by: LongPing Wei <weilongping@oppo.com>
---
 drivers/md/dm-verity-target.c | 19 ++++++++++++++++++-
 1 file changed, 18 insertions(+), 1 deletion(-)

diff --git a/drivers/md/dm-verity-target.c b/drivers/md/dm-verity-target.c
index e86c1431b108..1e0d0920d6a9 100644
--- a/drivers/md/dm-verity-target.c
+++ b/drivers/md/dm-verity-target.c
@@ -30,6 +30,7 @@
 #define DM_VERITY_ENV_VAR_NAME		"DM_VERITY_ERR_BLOCK_NR"
 
 #define DM_VERITY_DEFAULT_PREFETCH_SIZE	262144
+#define DM_VERITY_USE_BH_DEFAULT_BLOCKS	UINT_MAX
 
 #define DM_VERITY_MAX_CORRUPTED_ERRS	100
 
@@ -49,6 +50,15 @@ static unsigned int dm_verity_prefetch_cluster = DM_VERITY_DEFAULT_PREFETCH_SIZE
 
 module_param_named(prefetch_cluster, dm_verity_prefetch_cluster, uint, 0644);
 
+static unsigned int dm_verity_use_bh_blocks[4] = {
+	DM_VERITY_USE_BH_DEFAULT_BLOCKS,	// IOPRIO_CLASS_NONE
+	DM_VERITY_USE_BH_DEFAULT_BLOCKS,	// IOPRIO_CLASS_RT
+	DM_VERITY_USE_BH_DEFAULT_BLOCKS,	// IOPRIO_CLASS_BE
+	DM_VERITY_USE_BH_DEFAULT_BLOCKS		// IOPRIO_CLASS_IDLE
+};
+
+module_param_array_named(use_bh_blocks, dm_verity_use_bh_blocks, uint, NULL, 0644);
+
 static DEFINE_STATIC_KEY_FALSE(use_bh_wq_enabled);
 
 /* Is at least one dm-verity instance using ahash_tfm instead of shash_tfm? */
@@ -652,6 +662,12 @@ static void verity_bh_work(struct work_struct *w)
 	verity_finish_io(io, errno_to_blk_status(err));
 }
 
+static inline bool verity_use_bh(unsigned int n_blocks, unsigned short ioprio)
+{
+	return ioprio <= IOPRIO_CLASS_IDLE &&
+		n_blocks <= READ_ONCE(dm_verity_use_bh_blocks[ioprio]);
+}
+
 static void verity_end_io(struct bio *bio)
 {
 	struct dm_verity_io *io = bio->bi_private;
@@ -664,7 +680,8 @@ static void verity_end_io(struct bio *bio)
 		return;
 	}
 
-	if (static_branch_unlikely(&use_bh_wq_enabled) && io->v->use_bh_wq) {
+	if (static_branch_unlikely(&use_bh_wq_enabled) && io->v->use_bh_wq &&
+		verity_use_bh(io->n_blocks, IOPRIO_PRIO_CLASS(bio->bi_ioprio))) {
 		INIT_WORK(&io->bh_work, verity_bh_work);
 		queue_work(system_bh_wq, &io->bh_work);
 	} else {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH] dm-verity: support block number limits for different ioprio classes
  2025-03-25 10:49 [PATCH] dm-verity: support block number limits for different ioprio classes LongPing Wei
@ 2025-03-25 17:43 ` Eric Biggers
  2025-03-26  2:09   ` LongPing Wei
  2025-03-25 21:17 ` Mikulas Patocka
  1 sibling, 1 reply; 23+ messages in thread
From: Eric Biggers @ 2025-03-25 17:43 UTC (permalink / raw)
  To: LongPing Wei; +Cc: snitzer, mpatocka, dm-devel, guoweichao, Eran Messeri

Hi LongPing,

On Tue, Mar 25, 2025 at 06:49:43PM +0800, LongPing Wei wrote:
> Calling verity_verify_io in bh for IO of all sizes is not suitable for
> embedded devices. From our tests, it can improve the performance of 4K
> synchronise random reads.
> For example:
> ./fio --name=rand_read --ioengine=psync --rw=randread --bs=4K \
>  --direct=1 --numjobs=8 --runtime=60 --time_based --group_reporting \
>  --filename=/dev/block/mapper/xx-verity
> 
> But it will degrade the performance of 512K synchronise sequential reads
> on our devices.
> For example:
> ./fio --name=read --ioengine=psync --rw=read --bs=512K --direct=1 \
>  --numjobs=8 --runtime=60 --time_based --group_reporting \
>  --filename=/dev/block/mapper/xx-verity
> 
> A parameter array is introduced by this change. And users can modify the
> default config by /sys/module/dm_verity/parameters/use_bh_blocks.
> 
> The default limits for NONE/RT/BE/IDLE is set to UINT_MAX to keep the
> original behaviour.
> 
> Signed-off-by: LongPing Wei <weilongping@oppo.com>
> ---
>  drivers/md/dm-verity-target.c | 19 ++++++++++++++++++-
>  1 file changed, 18 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/md/dm-verity-target.c b/drivers/md/dm-verity-target.c
> index e86c1431b108..1e0d0920d6a9 100644
> --- a/drivers/md/dm-verity-target.c
> +++ b/drivers/md/dm-verity-target.c
> @@ -30,6 +30,7 @@
>  #define DM_VERITY_ENV_VAR_NAME		"DM_VERITY_ERR_BLOCK_NR"
>  
>  #define DM_VERITY_DEFAULT_PREFETCH_SIZE	262144
> +#define DM_VERITY_USE_BH_DEFAULT_BLOCKS	UINT_MAX
>  
>  #define DM_VERITY_MAX_CORRUPTED_ERRS	100
>  
> @@ -49,6 +50,15 @@ static unsigned int dm_verity_prefetch_cluster = DM_VERITY_DEFAULT_PREFETCH_SIZE
>  
>  module_param_named(prefetch_cluster, dm_verity_prefetch_cluster, uint, 0644);
>  
> +static unsigned int dm_verity_use_bh_blocks[4] = {
> +	DM_VERITY_USE_BH_DEFAULT_BLOCKS,	// IOPRIO_CLASS_NONE
> +	DM_VERITY_USE_BH_DEFAULT_BLOCKS,	// IOPRIO_CLASS_RT
> +	DM_VERITY_USE_BH_DEFAULT_BLOCKS,	// IOPRIO_CLASS_BE
> +	DM_VERITY_USE_BH_DEFAULT_BLOCKS		// IOPRIO_CLASS_IDLE
> +};
> +
> +module_param_array_named(use_bh_blocks, dm_verity_use_bh_blocks, uint, NULL, 0644);
> +
>  static DEFINE_STATIC_KEY_FALSE(use_bh_wq_enabled);
>  
>  /* Is at least one dm-verity instance using ahash_tfm instead of shash_tfm? */
> @@ -652,6 +662,12 @@ static void verity_bh_work(struct work_struct *w)
>  	verity_finish_io(io, errno_to_blk_status(err));
>  }
>  
> +static inline bool verity_use_bh(unsigned int n_blocks, unsigned short ioprio)
> +{
> +	return ioprio <= IOPRIO_CLASS_IDLE &&
> +		n_blocks <= READ_ONCE(dm_verity_use_bh_blocks[ioprio]);
> +}
> +
>  static void verity_end_io(struct bio *bio)
>  {
>  	struct dm_verity_io *io = bio->bi_private;
> @@ -664,7 +680,8 @@ static void verity_end_io(struct bio *bio)
>  		return;
>  	}
>  
> -	if (static_branch_unlikely(&use_bh_wq_enabled) && io->v->use_bh_wq) {
> +	if (static_branch_unlikely(&use_bh_wq_enabled) && io->v->use_bh_wq &&
> +		verity_use_bh(io->n_blocks, IOPRIO_PRIO_CLASS(bio->bi_ioprio))) {
>  		INIT_WORK(&io->bh_work, verity_bh_work);
>  		queue_work(system_bh_wq, &io->bh_work);

Thanks for working on this and sending a patch!

Just for some background: "try_verify_in_tasklet" was originally developed for
use in Android, but it did not actually end up being used there because of
concern about softirqs interfering with real-time tasks.  Yet, the scheduling
delay in dm-verity remains a problem.  I think that opportunistically doing the
verity work in softirq context remains promising (not necessarily ideal, but
helpful in practice), especially in cases where the block softirq already has to
run so the work can just be done in-line.  dm-verity mainly just needs to be
smarter about whether it chooses to use softirq or kworker for each request.

Choosing softirq context only for short requests seems helpful, and it's the
same idea I had.  In practice, dm-verity I/O request lengths have a bimodal
distribution with peaks at 4 KiB for synchronous reads and 128 KiB for
readahead.  These cases are quite different, and it makes sense that different
solutions could be optimal for them.  SHA-256 hashing a 4 KiB block takes only
2-6 microseconds, so there is not much reason to punt it to a kworker.

Taking the I/O priority into account makes sense too.  In theory, the higher the
priority, the more softirq should be preferred.  Though, I do wonder if almost
all the benefit actually just comes from the 4 KiB case, in which case there
would not be much reason to bother looking the I/O priority.

FWIW, other ideas that I have are:

    - When possible, make dm-verity do the work inline in the existing block
      softirq, instead of using the BH workqueue.  This would reduce the amount
      of overhead.  Note that dm-crypt does it this way.

    - If the CPU has been in softirq context for too long, then stop choosing
      softirq context and instead fall back to the traditional workqueue.  This
      could help address objections about increased use of softirq context.

But this patch seems like a good start.

> The default limits for NONE/RT/BE/IDLE is set to UINT_MAX to keep the
> original behaviour.

Let's just set good parameters by default instead.  "try_verify_in_tasklet" is
kind of useless as-is, and we should just fix it.  Especially since it only
affects performance and not other behavior, I don't think there are any
compatibility concerns with changing what it does exactly.

- Eric

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] dm-verity: support block number limits for different ioprio classes
  2025-03-25 10:49 [PATCH] dm-verity: support block number limits for different ioprio classes LongPing Wei
  2025-03-25 17:43 ` Eric Biggers
@ 2025-03-25 21:17 ` Mikulas Patocka
  2025-03-26  1:30   ` [PATCH v2] " LongPing Wei
  1 sibling, 1 reply; 23+ messages in thread
From: Mikulas Patocka @ 2025-03-25 21:17 UTC (permalink / raw)
  To: LongPing Wei; +Cc: snitzer, Alasdair Kergon, dm-devel, guoweichao

Hi

I think the patch is OK, I just suggest these changes:

- set the limit for the number of bytes, rather than the number of blocks, 
because different dm-verity devices can have different block size

- run some benchmarks and set the default values according to the 
benchmark results

- increase the target version number

- add a note about this feature to 
Documentation/admin-guide/device-mapper/verity.rst

Mikulas


On Tue, 25 Mar 2025, LongPing Wei wrote:

> Calling verity_verify_io in bh for IO of all sizes is not suitable for
> embedded devices. From our tests, it can improve the performance of 4K
> synchronise random reads.
> For example:
> ./fio --name=rand_read --ioengine=psync --rw=randread --bs=4K \
>  --direct=1 --numjobs=8 --runtime=60 --time_based --group_reporting \
>  --filename=/dev/block/mapper/xx-verity
> 
> But it will degrade the performance of 512K synchronise sequential reads
> on our devices.
> For example:
> ./fio --name=read --ioengine=psync --rw=read --bs=512K --direct=1 \
>  --numjobs=8 --runtime=60 --time_based --group_reporting \
>  --filename=/dev/block/mapper/xx-verity
> 
> A parameter array is introduced by this change. And users can modify the
> default config by /sys/module/dm_verity/parameters/use_bh_blocks.
> 
> The default limits for NONE/RT/BE/IDLE is set to UINT_MAX to keep the
> original behaviour.
> 
> Signed-off-by: LongPing Wei <weilongping@oppo.com>
> ---
>  drivers/md/dm-verity-target.c | 19 ++++++++++++++++++-
>  1 file changed, 18 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/md/dm-verity-target.c b/drivers/md/dm-verity-target.c
> index e86c1431b108..1e0d0920d6a9 100644
> --- a/drivers/md/dm-verity-target.c
> +++ b/drivers/md/dm-verity-target.c
> @@ -30,6 +30,7 @@
>  #define DM_VERITY_ENV_VAR_NAME		"DM_VERITY_ERR_BLOCK_NR"
>  
>  #define DM_VERITY_DEFAULT_PREFETCH_SIZE	262144
> +#define DM_VERITY_USE_BH_DEFAULT_BLOCKS	UINT_MAX
>  
>  #define DM_VERITY_MAX_CORRUPTED_ERRS	100
>  
> @@ -49,6 +50,15 @@ static unsigned int dm_verity_prefetch_cluster = DM_VERITY_DEFAULT_PREFETCH_SIZE
>  
>  module_param_named(prefetch_cluster, dm_verity_prefetch_cluster, uint, 0644);
>  
> +static unsigned int dm_verity_use_bh_blocks[4] = {
> +	DM_VERITY_USE_BH_DEFAULT_BLOCKS,	// IOPRIO_CLASS_NONE
> +	DM_VERITY_USE_BH_DEFAULT_BLOCKS,	// IOPRIO_CLASS_RT
> +	DM_VERITY_USE_BH_DEFAULT_BLOCKS,	// IOPRIO_CLASS_BE
> +	DM_VERITY_USE_BH_DEFAULT_BLOCKS		// IOPRIO_CLASS_IDLE
> +};
> +
> +module_param_array_named(use_bh_blocks, dm_verity_use_bh_blocks, uint, NULL, 0644);
> +
>  static DEFINE_STATIC_KEY_FALSE(use_bh_wq_enabled);
>  
>  /* Is at least one dm-verity instance using ahash_tfm instead of shash_tfm? */
> @@ -652,6 +662,12 @@ static void verity_bh_work(struct work_struct *w)
>  	verity_finish_io(io, errno_to_blk_status(err));
>  }
>  
> +static inline bool verity_use_bh(unsigned int n_blocks, unsigned short ioprio)
> +{
> +	return ioprio <= IOPRIO_CLASS_IDLE &&
> +		n_blocks <= READ_ONCE(dm_verity_use_bh_blocks[ioprio]);
> +}
> +
>  static void verity_end_io(struct bio *bio)
>  {
>  	struct dm_verity_io *io = bio->bi_private;
> @@ -664,7 +680,8 @@ static void verity_end_io(struct bio *bio)
>  		return;
>  	}
>  
> -	if (static_branch_unlikely(&use_bh_wq_enabled) && io->v->use_bh_wq) {
> +	if (static_branch_unlikely(&use_bh_wq_enabled) && io->v->use_bh_wq &&
> +		verity_use_bh(io->n_blocks, IOPRIO_PRIO_CLASS(bio->bi_ioprio))) {
>  		INIT_WORK(&io->bh_work, verity_bh_work);
>  		queue_work(system_bh_wq, &io->bh_work);
>  	} else {
> -- 
> 2.34.1
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH v2] dm-verity: support block number limits for different ioprio classes
  2025-03-25 21:17 ` Mikulas Patocka
@ 2025-03-26  1:30   ` LongPing Wei
  2025-03-26  5:18     ` Eric Biggers
  2025-03-26 11:04     ` Mikulas Patocka
  0 siblings, 2 replies; 23+ messages in thread
From: LongPing Wei @ 2025-03-26  1:30 UTC (permalink / raw)
  To: mpatocka; +Cc: agk, dm-devel, guoweichao, snitzer, weilongping

Calling verity_verify_io in bh for IO of all sizes is not suitable for
embedded devices. From our tests, it can improve the performance of 4K
synchronise random reads.
For example:
./fio --name=rand_read --ioengine=psync --rw=randread --bs=4K \
 --direct=1 --numjobs=8 --runtime=60 --time_based --group_reporting \
 --filename=/dev/block/mapper/xx-verity

But it will degrade the performance of 512K synchronise sequential reads
on our devices.
For example:
./fio --name=read --ioengine=psync --rw=read --bs=512K --direct=1 \
 --numjobs=8 --runtime=60 --time_based --group_reporting \
 --filename=/dev/block/mapper/xx-verity

A parameter array is introduced by this change. And users can modify the
default config by /sys/module/dm_verity/parameters/use_bh_bytes.

The default limits for NONE/RT/BE/IDLE is set to 4096.

Call verity_verify_io directly when verity_end_io is in softirq.

Signed-off-by: LongPing Wei <weilongping@oppo.com>
---
v2:
1. change from use_bh_blocks to use_bh_bytes;
2. update documention;
3. set the default limit to 4096bytes;
4. call verity_verify_io directly when verity_end_io is in softirq;
5. bump version of dm-verity;
---
 .../admin-guide/device-mapper/verity.rst      |  8 ++++-
 drivers/md/dm-verity-target.c                 | 30 +++++++++++++++++--
 2 files changed, 34 insertions(+), 4 deletions(-)

diff --git a/Documentation/admin-guide/device-mapper/verity.rst b/Documentation/admin-guide/device-mapper/verity.rst
index a65c1602cb23..f7fb5ea3c0df 100644
--- a/Documentation/admin-guide/device-mapper/verity.rst
+++ b/Documentation/admin-guide/device-mapper/verity.rst
@@ -143,7 +143,13 @@ root_hash_sig_key_desc <key_description>
 
 try_verify_in_tasklet
     If verity hashes are in cache, verify data blocks in kernel tasklet instead
-    of workqueue. This option can reduce IO latency.
+    of workqueue. This option can reduce IO latency. The size limits could be
+    configured via /sys/module/dm_verity/parameters/use_bh_bytes. The four
+    parameters correspond to limits for IOPRIO_CLASS_NONE,IOPRIO_CLASS_RT,
+    IOPRIO_CLASS_BE and IOPRIO_CLASS_IDLE in turn.
+    For example:
+    <none>,<rt>,<be>,<idle>
+    4096,4096,4096,4096
 
 Theory of operation
 ===================
diff --git a/drivers/md/dm-verity-target.c b/drivers/md/dm-verity-target.c
index e86c1431b108..6e26d59cdcef 100644
--- a/drivers/md/dm-verity-target.c
+++ b/drivers/md/dm-verity-target.c
@@ -30,6 +30,7 @@
 #define DM_VERITY_ENV_VAR_NAME		"DM_VERITY_ERR_BLOCK_NR"
 
 #define DM_VERITY_DEFAULT_PREFETCH_SIZE	262144
+#define DM_VERITY_USE_BH_DEFAULT_BYTES	4096
 
 #define DM_VERITY_MAX_CORRUPTED_ERRS	100
 
@@ -49,6 +50,15 @@ static unsigned int dm_verity_prefetch_cluster = DM_VERITY_DEFAULT_PREFETCH_SIZE
 
 module_param_named(prefetch_cluster, dm_verity_prefetch_cluster, uint, 0644);
 
+static unsigned int dm_verity_use_bh_bytes[4] = {
+	DM_VERITY_USE_BH_DEFAULT_BYTES,	// IOPRIO_CLASS_NONE
+	DM_VERITY_USE_BH_DEFAULT_BYTES,	// IOPRIO_CLASS_RT
+	DM_VERITY_USE_BH_DEFAULT_BYTES,	// IOPRIO_CLASS_BE
+	DM_VERITY_USE_BH_DEFAULT_BYTES	// IOPRIO_CLASS_IDLE
+};
+
+module_param_array_named(use_bh_bytes, dm_verity_use_bh_bytes, uint, NULL, 0644);
+
 static DEFINE_STATIC_KEY_FALSE(use_bh_wq_enabled);
 
 /* Is at least one dm-verity instance using ahash_tfm instead of shash_tfm? */
@@ -652,6 +662,12 @@ static void verity_bh_work(struct work_struct *w)
 	verity_finish_io(io, errno_to_blk_status(err));
 }
 
+static inline bool verity_use_bh(unsigned int bytes, unsigned short ioprio)
+{
+	return ioprio <= IOPRIO_CLASS_IDLE &&
+		bytes <= READ_ONCE(dm_verity_use_bh_bytes[ioprio]);
+}
+
 static void verity_end_io(struct bio *bio)
 {
 	struct dm_verity_io *io = bio->bi_private;
@@ -665,8 +681,16 @@ static void verity_end_io(struct bio *bio)
 	}
 
 	if (static_branch_unlikely(&use_bh_wq_enabled) && io->v->use_bh_wq) {
-		INIT_WORK(&io->bh_work, verity_bh_work);
-		queue_work(system_bh_wq, &io->bh_work);
+		unsigned short ioprio = IOPRIO_PRIO_CLASS(bio->bi_ioprio);
+
+		if (verity_use_bh(bio->bi_iter.bi_size, ioprio)) {
+			if (in_hardirq() || irqs_disabled()) {
+				INIT_WORK(&io->bh_work, verity_bh_work);
+				queue_work(system_bh_wq, &io->bh_work);
+			} else {
+				verity_bh_work(&io->bh_work);
+			}
+		}
 	} else {
 		INIT_WORK(&io->work, verity_work);
 		queue_work(io->v->verify_wq, &io->work);
@@ -1761,7 +1785,7 @@ static struct target_type verity_target = {
 	.name		= "verity",
 /* Note: the LSMs depend on the singleton and immutable features */
 	.features	= DM_TARGET_SINGLETON | DM_TARGET_IMMUTABLE,
-	.version	= {1, 10, 0},
+	.version	= {1, 11, 0},
 	.module		= THIS_MODULE,
 	.ctr		= verity_ctr,
 	.dtr		= verity_dtr,
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH] dm-verity: support block number limits for different ioprio classes
  2025-03-25 17:43 ` Eric Biggers
@ 2025-03-26  2:09   ` LongPing Wei
  2025-03-26 16:40     ` Eric Biggers
  0 siblings, 1 reply; 23+ messages in thread
From: LongPing Wei @ 2025-03-26  2:09 UTC (permalink / raw)
  To: ebiggers; +Cc: dm-devel, eranm, guoweichao, mpatocka, snitzer, weilongping

Hi, Eric

 > If the CPU has been in softirq context for too long, then stop
 > choosing softirq context and instead fall back to the traditional
 > workqueue.  This could help address objections about increased use of 
 > softirq context.

Could you share me some example about this?

 > When possible, make dm-verity do the work inline in the existing block
 > softirq, instead of using the BH workqueue.  This would reduce the
 > amount of overhead.  Note that dm-crypt does it this way.

 > Let's just set good parameters by default instead.
 > "try_verify_in_tasklet" is kind of useless as-is, and we should just
 > fix it.  Especially since it only affects performance and not other
 > behavior, I don't think there are any compatibility concerns with
 > changing what it does exactly.

These ideas have been implemented in patch v2. Thanks for your suggestions.

- LongPing

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2] dm-verity: support block number limits for different ioprio classes
  2025-03-26  1:30   ` [PATCH v2] " LongPing Wei
@ 2025-03-26  5:18     ` Eric Biggers
  2025-03-26  6:34       ` LongPing Wei
  2025-03-26 11:04     ` Mikulas Patocka
  1 sibling, 1 reply; 23+ messages in thread
From: Eric Biggers @ 2025-03-26  5:18 UTC (permalink / raw)
  To: LongPing Wei; +Cc: mpatocka, agk, dm-devel, guoweichao, snitzer

On Wed, Mar 26, 2025 at 09:30:51AM +0800, LongPing Wei wrote:
>  try_verify_in_tasklet
>      If verity hashes are in cache, verify data blocks in kernel tasklet instead
> -    of workqueue. This option can reduce IO latency.
> +    of workqueue. This option can reduce IO latency. The size limits could be
> +    configured via /sys/module/dm_verity/parameters/use_bh_bytes. The four
> +    parameters correspond to limits for IOPRIO_CLASS_NONE,IOPRIO_CLASS_RT,
> +    IOPRIO_CLASS_BE and IOPRIO_CLASS_IDLE in turn.
> +    For example:
> +    <none>,<rt>,<be>,<idle>
> +    4096,4096,4096,4096

I'm concerned about adding yet another performance tuning UAPI, which will then
have to be maintained going forwards, even if the implementation changes.  See
e.g. how try_verify_in_tasklet has ended up not actually using tasklets anymore.
Is there any strong reason for userspace needing to configure use_bh_bytes
differently on different systems, vs. just having the kernel choose the best
value automatically without any new UAPI surface?

Keep in mind that the tunables could also be added separately later.

- Eric

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2] dm-verity: support block number limits for different ioprio classes
  2025-03-26  5:18     ` Eric Biggers
@ 2025-03-26  6:34       ` LongPing Wei
  0 siblings, 0 replies; 23+ messages in thread
From: LongPing Wei @ 2025-03-26  6:34 UTC (permalink / raw)
  To: Eric Biggers; +Cc: mpatocka, agk, dm-devel, guoweichao, snitzer

On 2025/3/26 13:18, Eric Biggers wrote:
> On Wed, Mar 26, 2025 at 09:30:51AM +0800, LongPing Wei wrote:
>>   try_verify_in_tasklet
>>       If verity hashes are in cache, verify data blocks in kernel tasklet instead
>> -    of workqueue. This option can reduce IO latency.
>> +    of workqueue. This option can reduce IO latency. The size limits could be
>> +    configured via /sys/module/dm_verity/parameters/use_bh_bytes. The four
>> +    parameters correspond to limits for IOPRIO_CLASS_NONE,IOPRIO_CLASS_RT,
>> +    IOPRIO_CLASS_BE and IOPRIO_CLASS_IDLE in turn.
>> +    For example:
>> +    <none>,<rt>,<be>,<idle>
>> +    4096,4096,4096,4096
> 
> I'm concerned about adding yet another performance tuning UAPI, which will then
> have to be maintained going forwards, even if the implementation changes.  See
> e.g. how try_verify_in_tasklet has ended up not actually using tasklets anymore.
> Is there any strong reason for userspace needing to configure use_bh_bytes
> differently on different systems, vs. just having the kernel choose the best
> value automatically without any new UAPI surface?
> 
> Keep in mind that the tunables could also be added separately later.
> 
> - Eric

Hi, Eric

It seems that it is not easy to implement the idea that having the 
kernel choose the best value automatically.

Unlike on Linux servers and PCs, the CPU load,number and frequency on 
Android devices may vary. For Android devices, dynamic policies may have 
a more general benefit effect.

On the low-end device I tested locally, soft interrupts would be 
concentrated on CPU4~7. I tried to increase use_bh_bytes but got 
negative effects. The time cost of a single block softirq becomes 
higher, which will defer subsequent block softirq.

But on the devices with UFS MCQ or NVME SSD, most verity_end_io should 
be called in hardirq context directly but not softirq. These devices 
have the potential to raise the threshold.

- LongPing

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2] dm-verity: support block number limits for different ioprio classes
  2025-03-26  1:30   ` [PATCH v2] " LongPing Wei
  2025-03-26  5:18     ` Eric Biggers
@ 2025-03-26 11:04     ` Mikulas Patocka
  2025-03-26 13:32       ` LongPing Wei
  1 sibling, 1 reply; 23+ messages in thread
From: Mikulas Patocka @ 2025-03-26 11:04 UTC (permalink / raw)
  To: LongPing Wei; +Cc: agk, dm-devel, guoweichao, snitzer



On Wed, 26 Mar 2025, LongPing Wei wrote:

> Calling verity_verify_io in bh for IO of all sizes is not suitable for
> embedded devices. From our tests, it can improve the performance of 4K
> synchronise random reads.
> For example:
> ./fio --name=rand_read --ioengine=psync --rw=randread --bs=4K \
>  --direct=1 --numjobs=8 --runtime=60 --time_based --group_reporting \
>  --filename=/dev/block/mapper/xx-verity
> 
> But it will degrade the performance of 512K synchronise sequential reads
> on our devices.
> For example:
> ./fio --name=read --ioengine=psync --rw=read --bs=512K --direct=1 \
>  --numjobs=8 --runtime=60 --time_based --group_reporting \
>  --filename=/dev/block/mapper/xx-verity
> 
> A parameter array is introduced by this change. And users can modify the
> default config by /sys/module/dm_verity/parameters/use_bh_bytes.
> 
> The default limits for NONE/RT/BE/IDLE is set to 4096.
> 
> Call verity_verify_io directly when verity_end_io is in softirq.
> 
> Signed-off-by: LongPing Wei <weilongping@oppo.com>

Are you sure that 4096 bytes is the correct threshold?

I suggest that you run the benchmarks for 4k, 8k, 16k, 32k, 64k, 128k, 
256k, 512k and set the default threshold to the largest value where bh 
code performs better than non-bh code.

Mikulas


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2] dm-verity: support block number limits for different ioprio classes
  2025-03-26 11:04     ` Mikulas Patocka
@ 2025-03-26 13:32       ` LongPing Wei
  2025-03-26 16:53         ` Eric Biggers
  0 siblings, 1 reply; 23+ messages in thread
From: LongPing Wei @ 2025-03-26 13:32 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: agk, dm-devel, guoweichao, snitzer, ebiggers

On 2025/3/26 19:04, Mikulas Patocka wrote:
> 
> 
> On Wed, 26 Mar 2025, LongPing Wei wrote:
> 
>> Calling verity_verify_io in bh for IO of all sizes is not suitable for
>> embedded devices. From our tests, it can improve the performance of 4K
>> synchronise random reads.
>> For example:
>> ./fio --name=rand_read --ioengine=psync --rw=randread --bs=4K \
>>   --direct=1 --numjobs=8 --runtime=60 --time_based --group_reporting \
>>   --filename=/dev/block/mapper/xx-verity
>>
>> But it will degrade the performance of 512K synchronise sequential reads
>> on our devices.
>> For example:
>> ./fio --name=read --ioengine=psync --rw=read --bs=512K --direct=1 \
>>   --numjobs=8 --runtime=60 --time_based --group_reporting \
>>   --filename=/dev/block/mapper/xx-verity
>>
>> A parameter array is introduced by this change. And users can modify the
>> default config by /sys/module/dm_verity/parameters/use_bh_bytes.
>>
>> The default limits for NONE/RT/BE/IDLE is set to 4096.
>>
>> Call verity_verify_io directly when verity_end_io is in softirq.
>>
>> Signed-off-by: LongPing Wei <weilongping@oppo.com>
> 
> Are you sure that 4096 bytes is the correct threshold?
> 
> I suggest that you run the benchmarks for 4k, 8k, 16k, 32k, 64k, 128k,
> 256k, 512k and set the default threshold to the largest value where bh
> code performs better than non-bh code.
> 
> Mikulas
> 

Hi, Mikulas

My test device is a smartphone based on SnapDragon 6gen1 with 512GB
UFS3.1 and 12GiB DDR.

./fio --name=rand_read --ioengine=psync --rw=randread --bs=4K \
    --direct=1 --numjobs=8 --runtime=60 --time_based --group_reporting \
    --filename=/dev/block/mapper/xx-verity

origin: 165~MiB/s
after: 215~ MiB/s

./fio --name=rand_read --ioengine=psync --rw=randread --bs=8K \
    --direct=1 --numjobs=8 --runtime=60 --time_based --group_reporting \
    --filename=/dev/block/mapper/xx-verity
origin: 265~ MiB/s
after: 132,268,308,302,116  avg:225.2MiB/s

Just like what I have explained to Eric:

 > On the low-end device I tested locally, soft interrupts would be
 > concentrated on CPU4~7. I tried to increase use_bh_bytes but got
 > negative effects. The time cost of a single block softirq becomes
 > higher, which will defer subsequent block softirq.
 >
 > But on the devices with UFS MCQ or NVME SSD, most verity_end_io should
 > be called in hardirq context directly but not softirq. These devices
 > have the potential to raise the threshold.

I cannot run the benchmarks on the high-end devices with UFS MCQ now
as I don't have a test device with 6.12 kernel. The existing Android
products will stay on the LTS version they are launched on until that
version is EOL. I will share you the results when I get a new product
with 6.12 and UFS MCQ and run the benckmarks in the future.

I would prefer 4096 to be the default threshold as it won't introduce
negative effects on all devices including low-end devices like mine.

LongPing

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] dm-verity: support block number limits for different ioprio classes
  2025-03-26  2:09   ` LongPing Wei
@ 2025-03-26 16:40     ` Eric Biggers
  0 siblings, 0 replies; 23+ messages in thread
From: Eric Biggers @ 2025-03-26 16:40 UTC (permalink / raw)
  To: LongPing Wei
  Cc: dm-devel, eranm, guoweichao, mpatocka, snitzer, Bart Van Assche

On Wed, Mar 26, 2025 at 10:09:22AM +0800, LongPing Wei wrote:
> Hi, Eric
> 
> > If the CPU has been in softirq context for too long, then stop
> > choosing softirq context and instead fall back to the traditional
> > workqueue.  This could help address objections about increased use of >
> softirq context.
> 
> Could you share me some example about this?
> 

The objection that keeps getting raised to doing more work in softirq context is
that the latency of real-time tasks, such as audio tasks, may increase due to
softirqs having a higher priority than all tasks.  Causing audio skipping, etc.

The logic in handle_softirqs() in kernel/softirq.c actually goes a ways towards
addressing this already: it processes softirqs for at most 2 ms or until
rescheduling of the interrupted task gets requested, before deferring them to
ksoftirqd which runs at normal task priority.

The gaps that I see are (a) 2 ms is longer than desired, and (b) the limit does
not apply to *individual* softirqs.  Looking at (b), observe that the block
softirq (blk_done_softirq()) completes a list of I/O requests.  If there are a
lot of requests in that list, it could theoretically take more than the 2 ms
limit that kernel/softirq.c is meant to enforce (which is already too long).

So the objection would be that, even with dm-verity choosing to do in-line
verification only for 4 KiB requests which take only a few microseconds each, a
lot of requests could still add up to cause a long time to be spent in a single
softirq context.  (At least in theory.  AFAIK no one has confirmed that this is
actually a problem in practice with the 4 KiB limit applied.  But this is the
objection that I keep hearing whenever anyone suggests doing more in softirqs.)

But if dm-verity could detect this case happening and start deferring
verification work to task context, that should mostly address this concern, IMO.

One idea which *might* get us most of the way there pretty easily would be to do
the in-line verification only when need_resched() returns false.

- Eric

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2] dm-verity: support block number limits for different ioprio classes
  2025-03-26 13:32       ` LongPing Wei
@ 2025-03-26 16:53         ` Eric Biggers
  2025-03-26 18:50           ` Mikulas Patocka
  0 siblings, 1 reply; 23+ messages in thread
From: Eric Biggers @ 2025-03-26 16:53 UTC (permalink / raw)
  To: LongPing Wei
  Cc: Mikulas Patocka, agk, dm-devel, guoweichao, snitzer,
	Bart Van Assche

On Wed, Mar 26, 2025 at 09:32:27PM +0800, LongPing Wei wrote:
> On 2025/3/26 19:04, Mikulas Patocka wrote:
> > 
> > 
> > On Wed, 26 Mar 2025, LongPing Wei wrote:
> > 
> > > Calling verity_verify_io in bh for IO of all sizes is not suitable for
> > > embedded devices. From our tests, it can improve the performance of 4K
> > > synchronise random reads.
> > > For example:
> > > ./fio --name=rand_read --ioengine=psync --rw=randread --bs=4K \
> > >   --direct=1 --numjobs=8 --runtime=60 --time_based --group_reporting \
> > >   --filename=/dev/block/mapper/xx-verity
> > > 
> > > But it will degrade the performance of 512K synchronise sequential reads
> > > on our devices.
> > > For example:
> > > ./fio --name=read --ioengine=psync --rw=read --bs=512K --direct=1 \
> > >   --numjobs=8 --runtime=60 --time_based --group_reporting \
> > >   --filename=/dev/block/mapper/xx-verity
> > > 
> > > A parameter array is introduced by this change. And users can modify the
> > > default config by /sys/module/dm_verity/parameters/use_bh_bytes.
> > > 
> > > The default limits for NONE/RT/BE/IDLE is set to 4096.
> > > 
> > > Call verity_verify_io directly when verity_end_io is in softirq.
> > > 
> > > Signed-off-by: LongPing Wei <weilongping@oppo.com>
> > 
> > Are you sure that 4096 bytes is the correct threshold?
> > 
> > I suggest that you run the benchmarks for 4k, 8k, 16k, 32k, 64k, 128k,
> > 256k, 512k and set the default threshold to the largest value where bh
> > code performs better than non-bh code.
> > 
> > Mikulas
> > 

4096 bytes sounds good to me.  As I mentioned elsewhere in the thread
(https://lore.kernel.org/dm-devel/20250326164057.GA1243@sol.localdomain/),
objections keep getting raised to doing more work in softirq context due to
potential impact on real-time tasks.  So while more I/O benchmarks would be
interesting, they are also not the whole story and we should probably prefer a
more conservative threshold.  Also, dm-verity I/O requests have a bimodal
distribution with peaks at 4 KiB (for synchronous reads) and 128 KiB or
read_ahead_kb (for readahead) anyway, so most of the impact should come from 4
KiB anyway.

8 KiB could also be reasonable when taking into account multibuffer hashing
though, since multibuffer hashing makes hashing two 4 KiB blocks almost as fast
as hashing a single 4 KiB block.  (Though multibuffer hashing is currently
Android-only, as the crypto maintainer keeps rejecting it upstream.)

- Eric

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2] dm-verity: support block number limits for different ioprio classes
  2025-03-26 16:53         ` Eric Biggers
@ 2025-03-26 18:50           ` Mikulas Patocka
  2025-03-26 20:32             ` Bart Van Assche
                               ` (2 more replies)
  0 siblings, 3 replies; 23+ messages in thread
From: Mikulas Patocka @ 2025-03-26 18:50 UTC (permalink / raw)
  To: Eric Biggers
  Cc: LongPing Wei, agk, dm-devel, guoweichao, snitzer, Bart Van Assche



On Wed, 26 Mar 2025, Eric Biggers wrote:

> On Wed, Mar 26, 2025 at 09:32:27PM +0800, LongPing Wei wrote:
> > On 2025/3/26 19:04, Mikulas Patocka wrote:
> > > 
> > > 
> > > On Wed, 26 Mar 2025, LongPing Wei wrote:
> > > 
> > > > Calling verity_verify_io in bh for IO of all sizes is not suitable for
> > > > embedded devices. From our tests, it can improve the performance of 4K
> > > > synchronise random reads.
> > > > For example:
> > > > ./fio --name=rand_read --ioengine=psync --rw=randread --bs=4K \
> > > >   --direct=1 --numjobs=8 --runtime=60 --time_based --group_reporting \
> > > >   --filename=/dev/block/mapper/xx-verity
> > > > 
> > > > But it will degrade the performance of 512K synchronise sequential reads
> > > > on our devices.
> > > > For example:
> > > > ./fio --name=read --ioengine=psync --rw=read --bs=512K --direct=1 \
> > > >   --numjobs=8 --runtime=60 --time_based --group_reporting \
> > > >   --filename=/dev/block/mapper/xx-verity
> > > > 
> > > > A parameter array is introduced by this change. And users can modify the
> > > > default config by /sys/module/dm_verity/parameters/use_bh_bytes.
> > > > 
> > > > The default limits for NONE/RT/BE/IDLE is set to 4096.
> > > > 
> > > > Call verity_verify_io directly when verity_end_io is in softirq.
> > > > 
> > > > Signed-off-by: LongPing Wei <weilongping@oppo.com>
> > > 
> > > Are you sure that 4096 bytes is the correct threshold?
> > > 
> > > I suggest that you run the benchmarks for 4k, 8k, 16k, 32k, 64k, 128k,
> > > 256k, 512k and set the default threshold to the largest value where bh
> > > code performs better than non-bh code.
> > > 
> > > Mikulas
> > > 
> 
> 4096 bytes sounds good to me.  As I mentioned elsewhere in the thread
> (https://lore.kernel.org/dm-devel/20250326164057.GA1243@sol.localdomain/),
> objections keep getting raised to doing more work in softirq context due to
> potential impact on real-time tasks.  So while more I/O benchmarks would be
> interesting, they are also not the whole story and we should probably prefer a
> more conservative threshold.  Also, dm-verity I/O requests have a bimodal
> distribution with peaks at 4 KiB (for synchronous reads) and 128 KiB or
> read_ahead_kb (for readahead) anyway, so most of the impact should come from 4
> KiB anyway.
> 
> 8 KiB could also be reasonable when taking into account multibuffer hashing
> though, since multibuffer hashing makes hashing two 4 KiB blocks almost as fast
> as hashing a single 4 KiB block.  (Though multibuffer hashing is currently
> Android-only, as the crypto maintainer keeps rejecting it upstream.)
> 
> - Eric

OK, so I'll set it to 8KiB.

Do you think that I should increase this limit to "unlimited" for 
IOPRIO_CLASS_RT and disable this feature at all for IOPRIO_CLASS_NONE?

Mikulas


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2] dm-verity: support block number limits for different ioprio classes
  2025-03-26 18:50           ` Mikulas Patocka
@ 2025-03-26 20:32             ` Bart Van Assche
  2025-03-26 20:48             ` Eric Biggers
  2025-03-27  2:18             ` [PATCH v3] " LongPing Wei
  2 siblings, 0 replies; 23+ messages in thread
From: Bart Van Assche @ 2025-03-26 20:32 UTC (permalink / raw)
  To: Mikulas Patocka, Eric Biggers
  Cc: LongPing Wei, agk, dm-devel, guoweichao, snitzer

On 3/26/25 2:50 PM, Mikulas Patocka wrote:
> OK, so I'll set it to 8KiB.
> 
> Do you think that I should increase this limit to "unlimited" for
> IOPRIO_CLASS_RT and disable this feature at all for IOPRIO_CLASS_NONE?

How about a single limit for all I/O priority classes? Running too much
work in tasklet context can cause all kinds of nasty issues, e.g. the
user interface seeming to freeze if the tasklet happens to run on the
same CPU core as the user interface code.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2] dm-verity: support block number limits for different ioprio classes
  2025-03-26 18:50           ` Mikulas Patocka
  2025-03-26 20:32             ` Bart Van Assche
@ 2025-03-26 20:48             ` Eric Biggers
  2025-03-27  0:52               ` LongPing Wei
  2025-03-27  2:18             ` [PATCH v3] " LongPing Wei
  2 siblings, 1 reply; 23+ messages in thread
From: Eric Biggers @ 2025-03-26 20:48 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: LongPing Wei, agk, dm-devel, guoweichao, snitzer, Bart Van Assche

On Wed, Mar 26, 2025 at 07:50:34PM +0100, Mikulas Patocka wrote:
> 
> 
> On Wed, 26 Mar 2025, Eric Biggers wrote:
> 
> > On Wed, Mar 26, 2025 at 09:32:27PM +0800, LongPing Wei wrote:
> > > On 2025/3/26 19:04, Mikulas Patocka wrote:
> > > > 
> > > > 
> > > > On Wed, 26 Mar 2025, LongPing Wei wrote:
> > > > 
> > > > > Calling verity_verify_io in bh for IO of all sizes is not suitable for
> > > > > embedded devices. From our tests, it can improve the performance of 4K
> > > > > synchronise random reads.
> > > > > For example:
> > > > > ./fio --name=rand_read --ioengine=psync --rw=randread --bs=4K \
> > > > >   --direct=1 --numjobs=8 --runtime=60 --time_based --group_reporting \
> > > > >   --filename=/dev/block/mapper/xx-verity
> > > > > 
> > > > > But it will degrade the performance of 512K synchronise sequential reads
> > > > > on our devices.
> > > > > For example:
> > > > > ./fio --name=read --ioengine=psync --rw=read --bs=512K --direct=1 \
> > > > >   --numjobs=8 --runtime=60 --time_based --group_reporting \
> > > > >   --filename=/dev/block/mapper/xx-verity
> > > > > 
> > > > > A parameter array is introduced by this change. And users can modify the
> > > > > default config by /sys/module/dm_verity/parameters/use_bh_bytes.
> > > > > 
> > > > > The default limits for NONE/RT/BE/IDLE is set to 4096.
> > > > > 
> > > > > Call verity_verify_io directly when verity_end_io is in softirq.
> > > > > 
> > > > > Signed-off-by: LongPing Wei <weilongping@oppo.com>
> > > > 
> > > > Are you sure that 4096 bytes is the correct threshold?
> > > > 
> > > > I suggest that you run the benchmarks for 4k, 8k, 16k, 32k, 64k, 128k,
> > > > 256k, 512k and set the default threshold to the largest value where bh
> > > > code performs better than non-bh code.
> > > > 
> > > > Mikulas
> > > > 
> > 
> > 4096 bytes sounds good to me.  As I mentioned elsewhere in the thread
> > (https://lore.kernel.org/dm-devel/20250326164057.GA1243@sol.localdomain/),
> > objections keep getting raised to doing more work in softirq context due to
> > potential impact on real-time tasks.  So while more I/O benchmarks would be
> > interesting, they are also not the whole story and we should probably prefer a
> > more conservative threshold.  Also, dm-verity I/O requests have a bimodal
> > distribution with peaks at 4 KiB (for synchronous reads) and 128 KiB or
> > read_ahead_kb (for readahead) anyway, so most of the impact should come from 4
> > KiB anyway.
> > 
> > 8 KiB could also be reasonable when taking into account multibuffer hashing
> > though, since multibuffer hashing makes hashing two 4 KiB blocks almost as fast
> > as hashing a single 4 KiB block.  (Though multibuffer hashing is currently
> > Android-only, as the crypto maintainer keeps rejecting it upstream.)
> > 
> > - Eric
> 
> OK, so I'll set it to 8KiB.
> 
> Do you think that I should increase this limit to "unlimited" for 
> IOPRIO_CLASS_RT and disable this feature at all for IOPRIO_CLASS_NONE?

IOPRIO_CLASS_NONE means an I/O priority isn't set, so it probably should be
treated the same as IOPRIO_CLASS_BE which is the "normal" priority.

For now I think it makes sense to use 4 or 8 KiB even for IOPRIO_CLASS_RT.  That
covers the single-block request case which is the one that matters most.  We can
bump it up later if we can be shown that something higher is better.  Softirqs
normally get prioritized over all real-time tasks; so while it should be more
acceptable to use them for IOPRIO_CLASS_RT, it's not a perfect solution.

Another concern I have with longer requests is that when dm-verity runs the
verification work in a softirq and finds that a Merkle tree block isn't cached,
it punts the request to a kworker and restarts it from the beginning.  That not
only defeats the point of using softirq context, but it also throws away work
that was done.  I think we could fix it to not throw away work, but having to
fall back at all would still defeat the point of using softirq context.

- Eric

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2] dm-verity: support block number limits for different ioprio classes
  2025-03-26 20:48             ` Eric Biggers
@ 2025-03-27  0:52               ` LongPing Wei
  2025-03-27 17:05                 ` Eric Biggers
  0 siblings, 1 reply; 23+ messages in thread
From: LongPing Wei @ 2025-03-27  0:52 UTC (permalink / raw)
  To: Eric Biggers, Mikulas Patocka
  Cc: agk, dm-devel, guoweichao, snitzer, Bart Van Assche

On 2025/3/27 4:48, Eric Biggers wrote:
> On Wed, Mar 26, 2025 at 07:50:34PM +0100, Mikulas Patocka wrote:
>>
>>
>> On Wed, 26 Mar 2025, Eric Biggers wrote:
>>
>>> On Wed, Mar 26, 2025 at 09:32:27PM +0800, LongPing Wei wrote:
>>>> On 2025/3/26 19:04, Mikulas Patocka wrote:
>>>>>
>>>>>
>>>>> On Wed, 26 Mar 2025, LongPing Wei wrote:
>>>>>
>>>>>> Calling verity_verify_io in bh for IO of all sizes is not suitable for
>>>>>> embedded devices. From our tests, it can improve the performance of 4K
>>>>>> synchronise random reads.
>>>>>> For example:
>>>>>> ./fio --name=rand_read --ioengine=psync --rw=randread --bs=4K \
>>>>>>    --direct=1 --numjobs=8 --runtime=60 --time_based --group_reporting \
>>>>>>    --filename=/dev/block/mapper/xx-verity
>>>>>>
>>>>>> But it will degrade the performance of 512K synchronise sequential reads
>>>>>> on our devices.
>>>>>> For example:
>>>>>> ./fio --name=read --ioengine=psync --rw=read --bs=512K --direct=1 \
>>>>>>    --numjobs=8 --runtime=60 --time_based --group_reporting \
>>>>>>    --filename=/dev/block/mapper/xx-verity
>>>>>>
>>>>>> A parameter array is introduced by this change. And users can modify the
>>>>>> default config by /sys/module/dm_verity/parameters/use_bh_bytes.
>>>>>>
>>>>>> The default limits for NONE/RT/BE/IDLE is set to 4096.
>>>>>>
>>>>>> Call verity_verify_io directly when verity_end_io is in softirq.
>>>>>>
>>>>>> Signed-off-by: LongPing Wei <weilongping@oppo.com>
>>>>>
>>>>> Are you sure that 4096 bytes is the correct threshold?
>>>>>
>>>>> I suggest that you run the benchmarks for 4k, 8k, 16k, 32k, 64k, 128k,
>>>>> 256k, 512k and set the default threshold to the largest value where bh
>>>>> code performs better than non-bh code.
>>>>>
>>>>> Mikulas
>>>>>
>>>
>>> 4096 bytes sounds good to me.  As I mentioned elsewhere in the thread
>>> (https://lore.kernel.org/dm-devel/20250326164057.GA1243@sol.localdomain/),
>>> objections keep getting raised to doing more work in softirq context due to
>>> potential impact on real-time tasks.  So while more I/O benchmarks would be
>>> interesting, they are also not the whole story and we should probably prefer a
>>> more conservative threshold.  Also, dm-verity I/O requests have a bimodal
>>> distribution with peaks at 4 KiB (for synchronous reads) and 128 KiB or
>>> read_ahead_kb (for readahead) anyway, so most of the impact should come from 4
>>> KiB anyway.
>>>
>>> 8 KiB could also be reasonable when taking into account multibuffer hashing
>>> though, since multibuffer hashing makes hashing two 4 KiB blocks almost as fast
>>> as hashing a single 4 KiB block.  (Though multibuffer hashing is currently
>>> Android-only, as the crypto maintainer keeps rejecting it upstream.)
>>>
>>> - Eric
>>
>> OK, so I'll set it to 8KiB.
>>
>> Do you think that I should increase this limit to "unlimited" for
>> IOPRIO_CLASS_RT and disable this feature at all for IOPRIO_CLASS_NONE?
> 
> IOPRIO_CLASS_NONE means an I/O priority isn't set, so it probably should be
> treated the same as IOPRIO_CLASS_BE which is the "normal" priority.
> 
> For now I think it makes sense to use 4 or 8 KiB even for IOPRIO_CLASS_RT.  That
> covers the single-block request case which is the one that matters most.  We can
> bump it up later if we can be shown that something higher is better.  Softirqs
> normally get prioritized over all real-time tasks; so while it should be more
> acceptable to use them for IOPRIO_CLASS_RT, it's not a perfect solution.
> 
> Another concern I have with longer requests is that when dm-verity runs the
> verification work in a softirq and finds that a Merkle tree block isn't cached,
> it punts the request to a kworker and restarts it from the beginning.  That not
> only defeats the point of using softirq context, but it also throws away work
> that was done.  I think we could fix it to not throw away work, but having to
> fall back at all would still defeat the point of using softirq context.
> 
> - Eric

Hi, Eric

It seems that verity_prefetch_io doesn't work efficiently.
dm_bufio_prefetch_with_ioprio
   __dm_bufio_prefetch
     blk_start_plug
     for loop
       __bufio_new
       submit_io
       cond_resched
     blk_finish_plug

If more than one hash blocks need to be prefetched, cond_resched will
be called in each loop and blk_finish_plug will be called at the end.

The requests for hash blocks may have not been dispatched when the
requests for data blocks have been completed.

- LongPing

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH v3] dm-verity: support block number limits for different ioprio classes
  2025-03-26 18:50           ` Mikulas Patocka
  2025-03-26 20:32             ` Bart Van Assche
  2025-03-26 20:48             ` Eric Biggers
@ 2025-03-27  2:18             ` LongPing Wei
  2025-03-27 16:16               ` Mikulas Patocka
  2025-03-27 16:57               ` Eric Biggers
  2 siblings, 2 replies; 23+ messages in thread
From: LongPing Wei @ 2025-03-27  2:18 UTC (permalink / raw)
  To: snitzer, mpatocka
  Cc: dm-devel, guoweichao, ebiggers, bvanassche, LongPing Wei

Calling verity_verify_io in bh for IO of all sizes is not suitable for
embedded devices. From our tests, it can improve the performance of 4K
synchronise random reads.
For example:
./fio --name=rand_read --ioengine=psync --rw=randread --bs=4K \
 --direct=1 --numjobs=8 --runtime=60 --time_based --group_reporting \
 --filename=/dev/block/mapper/xx-verity

But it will degrade the performance of 512K synchronise sequential reads
on our devices.
For example:
./fio --name=read --ioengine=psync --rw=read --bs=512K --direct=1 \
 --numjobs=8 --runtime=60 --time_based --group_reporting \
 --filename=/dev/block/mapper/xx-verity

A parameter array is introduced by this change. And users can modify the
default config by /sys/module/dm_verity/parameters/use_bh_bytes.

The default limits for NONE/RT/BE is set to 8192.
The default limits for IDLE is set to 0.

Call verity_verify_io directly when verity_end_io is not in hardirq.

Signed-off-by: LongPing Wei <weilongping@oppo.com>
---
v3:
1. set the default limit to 8192 bytes for none,rt,be;
2. set the default limit to 0 bytes for idle;
3. fix the mistake in v2 that verity_verify_io won't be called when
verity_use_bh return false;
v2:
1. change from use_bh_blocks to use_bh_bytes;
2. update documention;
3. set the default limit to 4096bytes;
4. call verity_verify_io directly when verity_end_io is in softirq;
5. bump version of dm-verity;
---
 .../admin-guide/device-mapper/verity.rst      |  8 ++++-
 drivers/md/dm-verity-target.c                 | 30 ++++++++++++++++---
 2 files changed, 33 insertions(+), 5 deletions(-)

diff --git a/Documentation/admin-guide/device-mapper/verity.rst b/Documentation/admin-guide/device-mapper/verity.rst
index a65c1602cb23..f7fb5ea3c0df 100644
--- a/Documentation/admin-guide/device-mapper/verity.rst
+++ b/Documentation/admin-guide/device-mapper/verity.rst
@@ -143,7 +143,13 @@ root_hash_sig_key_desc <key_description>
 
 try_verify_in_tasklet
     If verity hashes are in cache, verify data blocks in kernel tasklet instead
-    of workqueue. This option can reduce IO latency.
+    of workqueue. This option can reduce IO latency. The size limits could be
+    configured via /sys/module/dm_verity/parameters/use_bh_bytes. The four
+    parameters correspond to limits for IOPRIO_CLASS_NONE,IOPRIO_CLASS_RT,
+    IOPRIO_CLASS_BE and IOPRIO_CLASS_IDLE in turn.
+    For example:
+    <none>,<rt>,<be>,<idle>
+    4096,4096,4096,4096
 
 Theory of operation
 ===================
diff --git a/drivers/md/dm-verity-target.c b/drivers/md/dm-verity-target.c
index e86c1431b108..4d6148ed02e8 100644
--- a/drivers/md/dm-verity-target.c
+++ b/drivers/md/dm-verity-target.c
@@ -30,6 +30,7 @@
 #define DM_VERITY_ENV_VAR_NAME		"DM_VERITY_ERR_BLOCK_NR"
 
 #define DM_VERITY_DEFAULT_PREFETCH_SIZE	262144
+#define DM_VERITY_USE_BH_DEFAULT_BYTES	8192
 
 #define DM_VERITY_MAX_CORRUPTED_ERRS	100
 
@@ -49,6 +50,15 @@ static unsigned int dm_verity_prefetch_cluster = DM_VERITY_DEFAULT_PREFETCH_SIZE
 
 module_param_named(prefetch_cluster, dm_verity_prefetch_cluster, uint, 0644);
 
+static unsigned int dm_verity_use_bh_bytes[4] = {
+	DM_VERITY_USE_BH_DEFAULT_BYTES,	// IOPRIO_CLASS_NONE
+	DM_VERITY_USE_BH_DEFAULT_BYTES,	// IOPRIO_CLASS_RT
+	DM_VERITY_USE_BH_DEFAULT_BYTES,	// IOPRIO_CLASS_BE
+	0				// IOPRIO_CLASS_IDLE
+};
+
+module_param_array_named(use_bh_bytes, dm_verity_use_bh_bytes, uint, NULL, 0644);
+
 static DEFINE_STATIC_KEY_FALSE(use_bh_wq_enabled);
 
 /* Is at least one dm-verity instance using ahash_tfm instead of shash_tfm? */
@@ -652,9 +662,16 @@ static void verity_bh_work(struct work_struct *w)
 	verity_finish_io(io, errno_to_blk_status(err));
 }
 
+static inline bool verity_use_bh(unsigned int bytes, unsigned short ioprio)
+{
+	return ioprio <= IOPRIO_CLASS_IDLE &&
+		bytes <= READ_ONCE(dm_verity_use_bh_bytes[ioprio]);
+}
+
 static void verity_end_io(struct bio *bio)
 {
 	struct dm_verity_io *io = bio->bi_private;
+	unsigned short ioprio = IOPRIO_PRIO_CLASS(bio->bi_ioprio);
 
 	if (bio->bi_status &&
 	    (!verity_fec_is_enabled(io->v) ||
@@ -664,9 +681,14 @@ static void verity_end_io(struct bio *bio)
 		return;
 	}
 
-	if (static_branch_unlikely(&use_bh_wq_enabled) && io->v->use_bh_wq) {
-		INIT_WORK(&io->bh_work, verity_bh_work);
-		queue_work(system_bh_wq, &io->bh_work);
+	if (static_branch_unlikely(&use_bh_wq_enabled) && io->v->use_bh_wq &&
+		verity_use_bh(bio->bi_iter.bi_size, ioprio)) {
+		if (in_hardirq() || irqs_disabled()) {
+			INIT_WORK(&io->bh_work, verity_bh_work);
+			queue_work(system_bh_wq, &io->bh_work);
+		} else {
+			verity_bh_work(&io->bh_work);
+		}
 	} else {
 		INIT_WORK(&io->work, verity_work);
 		queue_work(io->v->verify_wq, &io->work);
@@ -1761,7 +1783,7 @@ static struct target_type verity_target = {
 	.name		= "verity",
 /* Note: the LSMs depend on the singleton and immutable features */
 	.features	= DM_TARGET_SINGLETON | DM_TARGET_IMMUTABLE,
-	.version	= {1, 10, 0},
+	.version	= {1, 11, 0},
 	.module		= THIS_MODULE,
 	.ctr		= verity_ctr,
 	.dtr		= verity_dtr,
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH v3] dm-verity: support block number limits for different ioprio classes
  2025-03-27  2:18             ` [PATCH v3] " LongPing Wei
@ 2025-03-27 16:16               ` Mikulas Patocka
  2025-03-27 16:57               ` Eric Biggers
  1 sibling, 0 replies; 23+ messages in thread
From: Mikulas Patocka @ 2025-03-27 16:16 UTC (permalink / raw)
  To: LongPing Wei; +Cc: snitzer, dm-devel, guoweichao, ebiggers, bvanassche

I applied the patch, thanks.

Mikulas


On Thu, 27 Mar 2025, LongPing Wei wrote:

> Calling verity_verify_io in bh for IO of all sizes is not suitable for
> embedded devices. From our tests, it can improve the performance of 4K
> synchronise random reads.
> For example:
> ./fio --name=rand_read --ioengine=psync --rw=randread --bs=4K \
>  --direct=1 --numjobs=8 --runtime=60 --time_based --group_reporting \
>  --filename=/dev/block/mapper/xx-verity
> 
> But it will degrade the performance of 512K synchronise sequential reads
> on our devices.
> For example:
> ./fio --name=read --ioengine=psync --rw=read --bs=512K --direct=1 \
>  --numjobs=8 --runtime=60 --time_based --group_reporting \
>  --filename=/dev/block/mapper/xx-verity
> 
> A parameter array is introduced by this change. And users can modify the
> default config by /sys/module/dm_verity/parameters/use_bh_bytes.
> 
> The default limits for NONE/RT/BE is set to 8192.
> The default limits for IDLE is set to 0.
> 
> Call verity_verify_io directly when verity_end_io is not in hardirq.
> 
> Signed-off-by: LongPing Wei <weilongping@oppo.com>
> ---
> v3:
> 1. set the default limit to 8192 bytes for none,rt,be;
> 2. set the default limit to 0 bytes for idle;
> 3. fix the mistake in v2 that verity_verify_io won't be called when
> verity_use_bh return false;
> v2:
> 1. change from use_bh_blocks to use_bh_bytes;
> 2. update documention;
> 3. set the default limit to 4096bytes;
> 4. call verity_verify_io directly when verity_end_io is in softirq;
> 5. bump version of dm-verity;
> ---
>  .../admin-guide/device-mapper/verity.rst      |  8 ++++-
>  drivers/md/dm-verity-target.c                 | 30 ++++++++++++++++---
>  2 files changed, 33 insertions(+), 5 deletions(-)
> 
> diff --git a/Documentation/admin-guide/device-mapper/verity.rst b/Documentation/admin-guide/device-mapper/verity.rst
> index a65c1602cb23..f7fb5ea3c0df 100644
> --- a/Documentation/admin-guide/device-mapper/verity.rst
> +++ b/Documentation/admin-guide/device-mapper/verity.rst
> @@ -143,7 +143,13 @@ root_hash_sig_key_desc <key_description>
>  
>  try_verify_in_tasklet
>      If verity hashes are in cache, verify data blocks in kernel tasklet instead
> -    of workqueue. This option can reduce IO latency.
> +    of workqueue. This option can reduce IO latency. The size limits could be
> +    configured via /sys/module/dm_verity/parameters/use_bh_bytes. The four
> +    parameters correspond to limits for IOPRIO_CLASS_NONE,IOPRIO_CLASS_RT,
> +    IOPRIO_CLASS_BE and IOPRIO_CLASS_IDLE in turn.
> +    For example:
> +    <none>,<rt>,<be>,<idle>
> +    4096,4096,4096,4096
>  
>  Theory of operation
>  ===================
> diff --git a/drivers/md/dm-verity-target.c b/drivers/md/dm-verity-target.c
> index e86c1431b108..4d6148ed02e8 100644
> --- a/drivers/md/dm-verity-target.c
> +++ b/drivers/md/dm-verity-target.c
> @@ -30,6 +30,7 @@
>  #define DM_VERITY_ENV_VAR_NAME		"DM_VERITY_ERR_BLOCK_NR"
>  
>  #define DM_VERITY_DEFAULT_PREFETCH_SIZE	262144
> +#define DM_VERITY_USE_BH_DEFAULT_BYTES	8192
>  
>  #define DM_VERITY_MAX_CORRUPTED_ERRS	100
>  
> @@ -49,6 +50,15 @@ static unsigned int dm_verity_prefetch_cluster = DM_VERITY_DEFAULT_PREFETCH_SIZE
>  
>  module_param_named(prefetch_cluster, dm_verity_prefetch_cluster, uint, 0644);
>  
> +static unsigned int dm_verity_use_bh_bytes[4] = {
> +	DM_VERITY_USE_BH_DEFAULT_BYTES,	// IOPRIO_CLASS_NONE
> +	DM_VERITY_USE_BH_DEFAULT_BYTES,	// IOPRIO_CLASS_RT
> +	DM_VERITY_USE_BH_DEFAULT_BYTES,	// IOPRIO_CLASS_BE
> +	0				// IOPRIO_CLASS_IDLE
> +};
> +
> +module_param_array_named(use_bh_bytes, dm_verity_use_bh_bytes, uint, NULL, 0644);
> +
>  static DEFINE_STATIC_KEY_FALSE(use_bh_wq_enabled);
>  
>  /* Is at least one dm-verity instance using ahash_tfm instead of shash_tfm? */
> @@ -652,9 +662,16 @@ static void verity_bh_work(struct work_struct *w)
>  	verity_finish_io(io, errno_to_blk_status(err));
>  }
>  
> +static inline bool verity_use_bh(unsigned int bytes, unsigned short ioprio)
> +{
> +	return ioprio <= IOPRIO_CLASS_IDLE &&
> +		bytes <= READ_ONCE(dm_verity_use_bh_bytes[ioprio]);
> +}
> +
>  static void verity_end_io(struct bio *bio)
>  {
>  	struct dm_verity_io *io = bio->bi_private;
> +	unsigned short ioprio = IOPRIO_PRIO_CLASS(bio->bi_ioprio);
>  
>  	if (bio->bi_status &&
>  	    (!verity_fec_is_enabled(io->v) ||
> @@ -664,9 +681,14 @@ static void verity_end_io(struct bio *bio)
>  		return;
>  	}
>  
> -	if (static_branch_unlikely(&use_bh_wq_enabled) && io->v->use_bh_wq) {
> -		INIT_WORK(&io->bh_work, verity_bh_work);
> -		queue_work(system_bh_wq, &io->bh_work);
> +	if (static_branch_unlikely(&use_bh_wq_enabled) && io->v->use_bh_wq &&
> +		verity_use_bh(bio->bi_iter.bi_size, ioprio)) {
> +		if (in_hardirq() || irqs_disabled()) {
> +			INIT_WORK(&io->bh_work, verity_bh_work);
> +			queue_work(system_bh_wq, &io->bh_work);
> +		} else {
> +			verity_bh_work(&io->bh_work);
> +		}
>  	} else {
>  		INIT_WORK(&io->work, verity_work);
>  		queue_work(io->v->verify_wq, &io->work);
> @@ -1761,7 +1783,7 @@ static struct target_type verity_target = {
>  	.name		= "verity",
>  /* Note: the LSMs depend on the singleton and immutable features */
>  	.features	= DM_TARGET_SINGLETON | DM_TARGET_IMMUTABLE,
> -	.version	= {1, 10, 0},
> +	.version	= {1, 11, 0},
>  	.module		= THIS_MODULE,
>  	.ctr		= verity_ctr,
>  	.dtr		= verity_dtr,
> -- 
> 2.34.1
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3] dm-verity: support block number limits for different ioprio classes
  2025-03-27  2:18             ` [PATCH v3] " LongPing Wei
  2025-03-27 16:16               ` Mikulas Patocka
@ 2025-03-27 16:57               ` Eric Biggers
  1 sibling, 0 replies; 23+ messages in thread
From: Eric Biggers @ 2025-03-27 16:57 UTC (permalink / raw)
  To: LongPing Wei; +Cc: snitzer, mpatocka, dm-devel, guoweichao, bvanassche

On Thu, Mar 27, 2025 at 10:18:19AM +0800, LongPing Wei wrote:
>  try_verify_in_tasklet
>      If verity hashes are in cache, verify data blocks in kernel tasklet instead
> -    of workqueue. This option can reduce IO latency.
> +    of workqueue. This option can reduce IO latency.

The above is really overdue for an update (tasklets aren't actually used
anymore), but that can be done in a separate patch...

> + The size limits could be configured

could => can

> +	if (static_branch_unlikely(&use_bh_wq_enabled) && io->v->use_bh_wq &&
> +		verity_use_bh(bio->bi_iter.bi_size, ioprio)) {

bi_size is always 0 here since the iterator was already advanced.

dm_verity_io::n_blocks needs to be used instead.

- Eric

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2] dm-verity: support block number limits for different ioprio classes
  2025-03-27  0:52               ` LongPing Wei
@ 2025-03-27 17:05                 ` Eric Biggers
  2025-03-27 17:12                   ` Mikulas Patocka
  0 siblings, 1 reply; 23+ messages in thread
From: Eric Biggers @ 2025-03-27 17:05 UTC (permalink / raw)
  To: LongPing Wei
  Cc: Mikulas Patocka, agk, dm-devel, guoweichao, snitzer,
	Bart Van Assche

On Thu, Mar 27, 2025 at 08:52:45AM +0800, LongPing Wei wrote:
> On 2025/3/27 4:48, Eric Biggers wrote:
> > On Wed, Mar 26, 2025 at 07:50:34PM +0100, Mikulas Patocka wrote:
> > > 
> > > 
> > > On Wed, 26 Mar 2025, Eric Biggers wrote:
> > > 
> > > > On Wed, Mar 26, 2025 at 09:32:27PM +0800, LongPing Wei wrote:
> > > > > On 2025/3/26 19:04, Mikulas Patocka wrote:
> > > > > > 
> > > > > > 
> > > > > > On Wed, 26 Mar 2025, LongPing Wei wrote:
> > > > > > 
> > > > > > > Calling verity_verify_io in bh for IO of all sizes is not suitable for
> > > > > > > embedded devices. From our tests, it can improve the performance of 4K
> > > > > > > synchronise random reads.
> > > > > > > For example:
> > > > > > > ./fio --name=rand_read --ioengine=psync --rw=randread --bs=4K \
> > > > > > >    --direct=1 --numjobs=8 --runtime=60 --time_based --group_reporting \
> > > > > > >    --filename=/dev/block/mapper/xx-verity
> > > > > > > 
> > > > > > > But it will degrade the performance of 512K synchronise sequential reads
> > > > > > > on our devices.
> > > > > > > For example:
> > > > > > > ./fio --name=read --ioengine=psync --rw=read --bs=512K --direct=1 \
> > > > > > >    --numjobs=8 --runtime=60 --time_based --group_reporting \
> > > > > > >    --filename=/dev/block/mapper/xx-verity
> > > > > > > 
> > > > > > > A parameter array is introduced by this change. And users can modify the
> > > > > > > default config by /sys/module/dm_verity/parameters/use_bh_bytes.
> > > > > > > 
> > > > > > > The default limits for NONE/RT/BE/IDLE is set to 4096.
> > > > > > > 
> > > > > > > Call verity_verify_io directly when verity_end_io is in softirq.
> > > > > > > 
> > > > > > > Signed-off-by: LongPing Wei <weilongping@oppo.com>
> > > > > > 
> > > > > > Are you sure that 4096 bytes is the correct threshold?
> > > > > > 
> > > > > > I suggest that you run the benchmarks for 4k, 8k, 16k, 32k, 64k, 128k,
> > > > > > 256k, 512k and set the default threshold to the largest value where bh
> > > > > > code performs better than non-bh code.
> > > > > > 
> > > > > > Mikulas
> > > > > > 
> > > > 
> > > > 4096 bytes sounds good to me.  As I mentioned elsewhere in the thread
> > > > (https://lore.kernel.org/dm-devel/20250326164057.GA1243@sol.localdomain/),
> > > > objections keep getting raised to doing more work in softirq context due to
> > > > potential impact on real-time tasks.  So while more I/O benchmarks would be
> > > > interesting, they are also not the whole story and we should probably prefer a
> > > > more conservative threshold.  Also, dm-verity I/O requests have a bimodal
> > > > distribution with peaks at 4 KiB (for synchronous reads) and 128 KiB or
> > > > read_ahead_kb (for readahead) anyway, so most of the impact should come from 4
> > > > KiB anyway.
> > > > 
> > > > 8 KiB could also be reasonable when taking into account multibuffer hashing
> > > > though, since multibuffer hashing makes hashing two 4 KiB blocks almost as fast
> > > > as hashing a single 4 KiB block.  (Though multibuffer hashing is currently
> > > > Android-only, as the crypto maintainer keeps rejecting it upstream.)
> > > > 
> > > > - Eric
> > > 
> > > OK, so I'll set it to 8KiB.
> > > 
> > > Do you think that I should increase this limit to "unlimited" for
> > > IOPRIO_CLASS_RT and disable this feature at all for IOPRIO_CLASS_NONE?
> > 
> > IOPRIO_CLASS_NONE means an I/O priority isn't set, so it probably should be
> > treated the same as IOPRIO_CLASS_BE which is the "normal" priority.
> > 
> > For now I think it makes sense to use 4 or 8 KiB even for IOPRIO_CLASS_RT.  That
> > covers the single-block request case which is the one that matters most.  We can
> > bump it up later if we can be shown that something higher is better.  Softirqs
> > normally get prioritized over all real-time tasks; so while it should be more
> > acceptable to use them for IOPRIO_CLASS_RT, it's not a perfect solution.
> > 
> > Another concern I have with longer requests is that when dm-verity runs the
> > verification work in a softirq and finds that a Merkle tree block isn't cached,
> > it punts the request to a kworker and restarts it from the beginning.  That not
> > only defeats the point of using softirq context, but it also throws away work
> > that was done.  I think we could fix it to not throw away work, but having to
> > fall back at all would still defeat the point of using softirq context.
> > 
> > - Eric
> 
> Hi, Eric
> 
> It seems that verity_prefetch_io doesn't work efficiently.
> dm_bufio_prefetch_with_ioprio
>   __dm_bufio_prefetch
>     blk_start_plug
>     for loop
>       __bufio_new
>       submit_io
>       cond_resched
>     blk_finish_plug
> 
> If more than one hash blocks need to be prefetched, cond_resched will
> be called in each loop and blk_finish_plug will be called at the end.
> 
> The requests for hash blocks may have not been dispatched when the
> requests for data blocks have been completed.

Well, it's always possible for the prefetch I/O to be too slow, but the way it
works does seem to be unnecessarily bad.  The prefetch work runs on a kworker,
which is going to slow things down.  The way it should work is that verity_map()
runs the prefetch work and starts, but doesn't wait for, any I/O that is needed.
(And of course, most of the time no I/O will actually be needed.)

- Eric

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2] dm-verity: support block number limits for different ioprio classes
  2025-03-27 17:05                 ` Eric Biggers
@ 2025-03-27 17:12                   ` Mikulas Patocka
  2025-03-27 17:35                     ` Eric Biggers
  0 siblings, 1 reply; 23+ messages in thread
From: Mikulas Patocka @ 2025-03-27 17:12 UTC (permalink / raw)
  To: Eric Biggers
  Cc: LongPing Wei, agk, dm-devel, guoweichao, snitzer, Bart Van Assche



On Thu, 27 Mar 2025, Eric Biggers wrote:

> > Hi, Eric
> > 
> > It seems that verity_prefetch_io doesn't work efficiently.
> > dm_bufio_prefetch_with_ioprio
> >   __dm_bufio_prefetch
> >     blk_start_plug
> >     for loop
> >       __bufio_new
> >       submit_io
> >       cond_resched
> >     blk_finish_plug
> > 
> > If more than one hash blocks need to be prefetched, cond_resched will
> > be called in each loop and blk_finish_plug will be called at the end.
> > 
> > The requests for hash blocks may have not been dispatched when the
> > requests for data blocks have been completed.
> 
> Well, it's always possible for the prefetch I/O to be too slow, but the way it
> works does seem to be unnecessarily bad.  The prefetch work runs on a kworker,
> which is going to slow things down.  The way it should work is that verity_map()
> runs the prefetch work and starts, but doesn't wait for, any I/O that is needed.
> (And of course, most of the time no I/O will actually be needed.)
> 
> - Eric

dm-verity used to prefetch from the verity_map function, but it caused a 
deadlock - 3b6b7813b198b578aa7e04e4047ddb8225c37b7f - so, it was moved to 
a workqueue.

Mikulas


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2] dm-verity: support block number limits for different ioprio classes
  2025-03-27 17:12                   ` Mikulas Patocka
@ 2025-03-27 17:35                     ` Eric Biggers
  2025-03-27 19:03                       ` Mikulas Patocka
  0 siblings, 1 reply; 23+ messages in thread
From: Eric Biggers @ 2025-03-27 17:35 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: LongPing Wei, agk, dm-devel, guoweichao, snitzer, Bart Van Assche

On Thu, Mar 27, 2025 at 06:12:22PM +0100, Mikulas Patocka wrote:
> 
> 
> On Thu, 27 Mar 2025, Eric Biggers wrote:
> 
> > > Hi, Eric
> > > 
> > > It seems that verity_prefetch_io doesn't work efficiently.
> > > dm_bufio_prefetch_with_ioprio
> > >   __dm_bufio_prefetch
> > >     blk_start_plug
> > >     for loop
> > >       __bufio_new
> > >       submit_io
> > >       cond_resched
> > >     blk_finish_plug
> > > 
> > > If more than one hash blocks need to be prefetched, cond_resched will
> > > be called in each loop and blk_finish_plug will be called at the end.
> > > 
> > > The requests for hash blocks may have not been dispatched when the
> > > requests for data blocks have been completed.
> > 
> > Well, it's always possible for the prefetch I/O to be too slow, but the way it
> > works does seem to be unnecessarily bad.  The prefetch work runs on a kworker,
> > which is going to slow things down.  The way it should work is that verity_map()
> > runs the prefetch work and starts, but doesn't wait for, any I/O that is needed.
> > (And of course, most of the time no I/O will actually be needed.)
> > 
> > - Eric
> 
> dm-verity used to prefetch from the verity_map function, but it caused a 
> deadlock - 3b6b7813b198b578aa7e04e4047ddb8225c37b7f - so, it was moved to 
> a workqueue.

Interesting.  Unfortunately I think the workqueue makes it way worse.

In principle there is no need for it to block for anything.  If it would have to
block, it just should just continue on.

- Eric

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2] dm-verity: support block number limits for different ioprio classes
  2025-03-27 17:35                     ` Eric Biggers
@ 2025-03-27 19:03                       ` Mikulas Patocka
  2025-03-27 19:39                         ` Mikulas Patocka
  0 siblings, 1 reply; 23+ messages in thread
From: Mikulas Patocka @ 2025-03-27 19:03 UTC (permalink / raw)
  To: Eric Biggers
  Cc: LongPing Wei, agk, dm-devel, guoweichao, snitzer, Bart Van Assche



On Thu, 27 Mar 2025, Eric Biggers wrote:

> On Thu, Mar 27, 2025 at 06:12:22PM +0100, Mikulas Patocka wrote:
> > 
> > 
> > On Thu, 27 Mar 2025, Eric Biggers wrote:
> > 
> > > > Hi, Eric
> > > > 
> > > > It seems that verity_prefetch_io doesn't work efficiently.
> > > > dm_bufio_prefetch_with_ioprio
> > > >   __dm_bufio_prefetch
> > > >     blk_start_plug
> > > >     for loop
> > > >       __bufio_new
> > > >       submit_io
> > > >       cond_resched
> > > >     blk_finish_plug
> > > > 
> > > > If more than one hash blocks need to be prefetched, cond_resched will
> > > > be called in each loop and blk_finish_plug will be called at the end.
> > > > 
> > > > The requests for hash blocks may have not been dispatched when the
> > > > requests for data blocks have been completed.
> > > 
> > > Well, it's always possible for the prefetch I/O to be too slow, but the way it
> > > works does seem to be unnecessarily bad.  The prefetch work runs on a kworker,
> > > which is going to slow things down.  The way it should work is that verity_map()
> > > runs the prefetch work and starts, but doesn't wait for, any I/O that is needed.
> > > (And of course, most of the time no I/O will actually be needed.)
> > > 
> > > - Eric
> > 
> > dm-verity used to prefetch from the verity_map function, but it caused a 
> > deadlock - 3b6b7813b198b578aa7e04e4047ddb8225c37b7f - so, it was moved to 
> > a workqueue.
> 
> Interesting.  Unfortunately I think the workqueue makes it way worse.
> 
> In principle there is no need for it to block for anything.  If it would have to
> block, it just should just continue on.
> 
> - Eric

That seems to be right concern. Should I disable prefetch and add a switch 
to enable it on demand? Or delete the prefetch code entirely?

There is also a bug that prefetch may race with suspend, sending I/Os even 
when the dm-verity device is suspended - but that is fixable by adding 
flush_workqueue to the postsuspend hook.

Mikulas



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2] dm-verity: support block number limits for different ioprio classes
  2025-03-27 19:03                       ` Mikulas Patocka
@ 2025-03-27 19:39                         ` Mikulas Patocka
  0 siblings, 0 replies; 23+ messages in thread
From: Mikulas Patocka @ 2025-03-27 19:39 UTC (permalink / raw)
  To: Eric Biggers
  Cc: LongPing Wei, agk, dm-devel, guoweichao, snitzer, Bart Van Assche



On Thu, 27 Mar 2025, Mikulas Patocka wrote:

> There is also a bug that prefetch may race with suspend, sending I/Os even 
> when the dm-verity device is suspended - but that is fixable by adding 
> flush_workqueue to the postsuspend hook.
> 
> Mikulas

flush_workqueue won't make it - it seems that I'll have to add a new 
function dm_bufio_quisce that will wait until all prefetches terminate and 
use this function in the postsuspend hook of targets that do prefetch.

Mikulas


^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2025-03-27 19:39 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-03-25 10:49 [PATCH] dm-verity: support block number limits for different ioprio classes LongPing Wei
2025-03-25 17:43 ` Eric Biggers
2025-03-26  2:09   ` LongPing Wei
2025-03-26 16:40     ` Eric Biggers
2025-03-25 21:17 ` Mikulas Patocka
2025-03-26  1:30   ` [PATCH v2] " LongPing Wei
2025-03-26  5:18     ` Eric Biggers
2025-03-26  6:34       ` LongPing Wei
2025-03-26 11:04     ` Mikulas Patocka
2025-03-26 13:32       ` LongPing Wei
2025-03-26 16:53         ` Eric Biggers
2025-03-26 18:50           ` Mikulas Patocka
2025-03-26 20:32             ` Bart Van Assche
2025-03-26 20:48             ` Eric Biggers
2025-03-27  0:52               ` LongPing Wei
2025-03-27 17:05                 ` Eric Biggers
2025-03-27 17:12                   ` Mikulas Patocka
2025-03-27 17:35                     ` Eric Biggers
2025-03-27 19:03                       ` Mikulas Patocka
2025-03-27 19:39                         ` Mikulas Patocka
2025-03-27  2:18             ` [PATCH v3] " LongPing Wei
2025-03-27 16:16               ` Mikulas Patocka
2025-03-27 16:57               ` Eric Biggers

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.