linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* make block layer auto-PI deadlock safe
@ 2025-10-23  8:08 Christoph Hellwig
  2025-10-23  8:08 ` [PATCH 1/3] slab, block: generalize bvec_alloc_gfp Christoph Hellwig
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Christoph Hellwig @ 2025-10-23  8:08 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Vlastimil Babka, Andrew Morton, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Martin K. Petersen, linux-block,
	linux-mm

Hi all,

currently the automatic block layer PI generation allocates the integrity
buffer using kmalloc, and thus could deadlock, or fail I/O request due
to memory pressure.

Fix this by adding a mempool, and capping the maximum I/O size on PI
capable devices to not exceed the allocation size of the mempool.

This is against the block-6.18 branch as it has a contextual dependency
on the PI fix merged there yesterday.

Diffstat:
 block/bio-integrity-auto.c    |   26 ++---------------------
 block/bio-integrity.c         |   47 ++++++++++++++++++++++++++++++++++++++++++
 block/bio.c                   |   13 +----------
 block/blk-settings.c          |   11 +++++++++
 include/linux/bio-integrity.h |    6 +++++
 include/linux/blk-integrity.h |    5 ++++
 include/linux/slab.h          |   10 ++++++++
 7 files changed, 84 insertions(+), 34 deletions(-)


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 1/3] slab, block: generalize bvec_alloc_gfp
  2025-10-23  8:08 make block layer auto-PI deadlock safe Christoph Hellwig
@ 2025-10-23  8:08 ` Christoph Hellwig
  2025-10-24  1:44   ` Martin K. Petersen
                     ` (2 more replies)
  2025-10-23  8:08 ` [PATCH 2/3] block: blocking mempool_alloc doesn't fail Christoph Hellwig
  2025-10-23  8:08 ` [PATCH 3/3] block: make bio auto-integrity deadlock safe Christoph Hellwig
  2 siblings, 3 replies; 14+ messages in thread
From: Christoph Hellwig @ 2025-10-23  8:08 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Vlastimil Babka, Andrew Morton, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Martin K. Petersen, linux-block,
	linux-mm

bvec_alloc_gfp is useful for any place that tries to kmalloc first and
then fall back to a mempool.  Rename it and move it to blk.h to prepare
for using it to allocate the default integrity buffer.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 block/bio.c          | 13 ++-----------
 include/linux/slab.h | 10 ++++++++++
 2 files changed, 12 insertions(+), 11 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index b3a79285c278..4ea5833a7637 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -169,16 +169,6 @@ void bvec_free(mempool_t *pool, struct bio_vec *bv, unsigned short nr_vecs)
 		kmem_cache_free(biovec_slab(nr_vecs)->slab, bv);
 }
 
-/*
- * Make the first allocation restricted and don't dump info on allocation
- * failures, since we'll fall back to the mempool in case of failure.
- */
-static inline gfp_t bvec_alloc_gfp(gfp_t gfp)
-{
-	return (gfp & ~(__GFP_DIRECT_RECLAIM | __GFP_IO)) |
-		__GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN;
-}
-
 struct bio_vec *bvec_alloc(mempool_t *pool, unsigned short *nr_vecs,
 		gfp_t gfp_mask)
 {
@@ -201,7 +191,8 @@ struct bio_vec *bvec_alloc(mempool_t *pool, unsigned short *nr_vecs,
 	if (*nr_vecs < BIO_MAX_VECS) {
 		struct bio_vec *bvl;
 
-		bvl = kmem_cache_alloc(bvs->slab, bvec_alloc_gfp(gfp_mask));
+		bvl = kmem_cache_alloc(bvs->slab,
+				try_alloc_gfp(gfp_mask & ~__GFP_IO));
 		if (likely(bvl) || !(gfp_mask & __GFP_DIRECT_RECLAIM))
 			return bvl;
 		*nr_vecs = BIO_MAX_VECS;
diff --git a/include/linux/slab.h b/include/linux/slab.h
index d5a8ab98035c..a6672cead03e 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -1113,6 +1113,16 @@ void kfree_rcu_scheduler_running(void);
  */
 size_t kmalloc_size_roundup(size_t size);
 
+/*
+ * Make the first allocation restricted and don't dump info on allocation
+ * failures, for callers that will fall back to a mempool in case of failure.
+ */
+static inline gfp_t try_alloc_gfp(gfp_t gfp)
+{
+	return (gfp & ~__GFP_DIRECT_RECLAIM) |
+		__GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN;
+}
+
 void __init kmem_cache_init_late(void);
 void __init kvfree_rcu_init(void);
 
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 2/3] block: blocking mempool_alloc doesn't fail
  2025-10-23  8:08 make block layer auto-PI deadlock safe Christoph Hellwig
  2025-10-23  8:08 ` [PATCH 1/3] slab, block: generalize bvec_alloc_gfp Christoph Hellwig
@ 2025-10-23  8:08 ` Christoph Hellwig
  2025-10-24  1:45   ` Martin K. Petersen
  2025-10-23  8:08 ` [PATCH 3/3] block: make bio auto-integrity deadlock safe Christoph Hellwig
  2 siblings, 1 reply; 14+ messages in thread
From: Christoph Hellwig @ 2025-10-23  8:08 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Vlastimil Babka, Andrew Morton, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Martin K. Petersen, linux-block,
	linux-mm

So remove the error check for it in bio_integrity_prep.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 block/bio-integrity-auto.c | 4 ----
 1 file changed, 4 deletions(-)

diff --git a/block/bio-integrity-auto.c b/block/bio-integrity-auto.c
index 687952f63bbb..2f4a244749ac 100644
--- a/block/bio-integrity-auto.c
+++ b/block/bio-integrity-auto.c
@@ -158,8 +158,6 @@ bool bio_integrity_prep(struct bio *bio)
 	if (!buf)
 		goto err_end_io;
 	bid = mempool_alloc(&bid_pool, GFP_NOIO);
-	if (!bid)
-		goto err_free_buf;
 	bio_integrity_init(bio, &bid->bip, &bid->bvec, 1);
 
 	bid->bio = bio;
@@ -187,8 +185,6 @@ bool bio_integrity_prep(struct bio *bio)
 		bid->saved_bio_iter = bio->bi_iter;
 	return true;
 
-err_free_buf:
-	kfree(buf);
 err_end_io:
 	bio->bi_status = BLK_STS_RESOURCE;
 	bio_endio(bio);
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 3/3] block: make bio auto-integrity deadlock safe
  2025-10-23  8:08 make block layer auto-PI deadlock safe Christoph Hellwig
  2025-10-23  8:08 ` [PATCH 1/3] slab, block: generalize bvec_alloc_gfp Christoph Hellwig
  2025-10-23  8:08 ` [PATCH 2/3] block: blocking mempool_alloc doesn't fail Christoph Hellwig
@ 2025-10-23  8:08 ` Christoph Hellwig
  2025-10-24  1:47   ` Martin K. Petersen
  2025-10-27  6:03   ` Kanchan Joshi
  2 siblings, 2 replies; 14+ messages in thread
From: Christoph Hellwig @ 2025-10-23  8:08 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Vlastimil Babka, Andrew Morton, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Martin K. Petersen, linux-block,
	linux-mm

The current block layer automatic integrity protection allocates the
actual integrity buffer, which has three problems:

 - because it happens at the bottom of the I/O stack and doesn't use a
   mempool it can deadlock under load
 - because the data size in a bio is almost unbounded when using lage
   folios it can relatively easily exceed the maximum kmalloc size
 - even when it does not exceed the maximum kmalloc size, it could
   exceed the maximum segment size of the device

Fix this by limiting the I/O size so that we can allocated at least a
2MiB integrity buffer, i.e. 128MiB for 8 byte PI and 512 byte integrity
internals, and create a mempool as a last resort for this maximum size,
mirroring the scheme used for bvecs.  As a nice upside none of this
can fail now, so we remove the error handling and open code the
trivial addition of the bip vec.

The new allocation helpers sit outside of bio-integrity-auto.c because
I plan to reuse them for file system based PI in the near future.

Fixes: 7ba1ba12eeef ("block: Block layer data integrity support")
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 block/bio-integrity-auto.c    | 22 +++-------------
 block/bio-integrity.c         | 47 +++++++++++++++++++++++++++++++++++
 block/blk-settings.c          | 11 ++++++++
 include/linux/bio-integrity.h |  6 +++++
 include/linux/blk-integrity.h |  5 ++++
 5 files changed, 72 insertions(+), 19 deletions(-)

diff --git a/block/bio-integrity-auto.c b/block/bio-integrity-auto.c
index 2f4a244749ac..9850c338548d 100644
--- a/block/bio-integrity-auto.c
+++ b/block/bio-integrity-auto.c
@@ -29,7 +29,7 @@ static void bio_integrity_finish(struct bio_integrity_data *bid)
 {
 	bid->bio->bi_integrity = NULL;
 	bid->bio->bi_opf &= ~REQ_INTEGRITY;
-	kfree(bvec_virt(bid->bip.bip_vec));
+	bio_integrity_free_buf(&bid->bip);
 	mempool_free(bid, &bid_pool);
 }
 
@@ -110,8 +110,6 @@ bool bio_integrity_prep(struct bio *bio)
 	struct bio_integrity_data *bid;
 	bool set_flags = true;
 	gfp_t gfp = GFP_NOIO;
-	unsigned int len;
-	void *buf;
 
 	if (!bi)
 		return true;
@@ -152,17 +150,12 @@ bool bio_integrity_prep(struct bio *bio)
 	if (WARN_ON_ONCE(bio_has_crypt_ctx(bio)))
 		return true;
 
-	/* Allocate kernel buffer for protection data */
-	len = bio_integrity_bytes(bi, bio_sectors(bio));
-	buf = kmalloc(len, gfp);
-	if (!buf)
-		goto err_end_io;
 	bid = mempool_alloc(&bid_pool, GFP_NOIO);
 	bio_integrity_init(bio, &bid->bip, &bid->bvec, 1);
-
 	bid->bio = bio;
-
 	bid->bip.bip_flags |= BIP_BLOCK_INTEGRITY;
+	bio_integrity_alloc_buf(bio, gfp & __GFP_ZERO);
+
 	bip_set_seed(&bid->bip, bio->bi_iter.bi_sector);
 
 	if (set_flags) {
@@ -174,21 +167,12 @@ bool bio_integrity_prep(struct bio *bio)
 			bid->bip.bip_flags |= BIP_CHECK_REFTAG;
 	}
 
-	if (bio_integrity_add_page(bio, virt_to_page(buf), len,
-			offset_in_page(buf)) < len)
-		goto err_end_io;
-
 	/* Auto-generate integrity metadata if this is a write */
 	if (bio_data_dir(bio) == WRITE && bip_should_check(&bid->bip))
 		blk_integrity_generate(bio);
 	else
 		bid->saved_bio_iter = bio->bi_iter;
 	return true;
-
-err_end_io:
-	bio->bi_status = BLK_STS_RESOURCE;
-	bio_endio(bio);
-	return false;
 }
 EXPORT_SYMBOL(bio_integrity_prep);
 
diff --git a/block/bio-integrity.c b/block/bio-integrity.c
index bed26f1ec869..a9896d563c1c 100644
--- a/block/bio-integrity.c
+++ b/block/bio-integrity.c
@@ -14,6 +14,44 @@ struct bio_integrity_alloc {
 	struct bio_vec			bvecs[];
 };
 
+static mempool_t integrity_buf_pool;
+
+void bio_integrity_alloc_buf(struct bio *bio, bool zero_buffer)
+{
+	struct blk_integrity *bi = blk_get_integrity(bio->bi_bdev->bd_disk);
+	struct bio_integrity_payload *bip = bio_integrity(bio);
+	unsigned int len = bio_integrity_bytes(bi, bio_sectors(bio));
+	gfp_t gfp = GFP_NOIO | (zero_buffer ? __GFP_ZERO : 0);
+	void *buf;
+
+	buf = kmalloc(len, try_alloc_gfp(gfp));
+	if (unlikely(!buf)) {
+		struct page *page;
+
+		page = mempool_alloc(&integrity_buf_pool, GFP_NOFS);
+		if (zero_buffer)
+			memset(page_address(page), 0, len);
+		bvec_set_page(&bip->bip_vec[0], page, len, 0);
+		bip->bip_flags |= BIP_MEMPOOL;
+	} else {
+		bvec_set_page(&bip->bip_vec[0], virt_to_page(buf), len,
+				offset_in_page(buf));
+	}
+
+	bip->bip_vcnt = 1;
+	bip->bip_iter.bi_size = len;
+}
+
+void bio_integrity_free_buf(struct bio_integrity_payload *bip)
+{
+	struct bio_vec *bv = &bip->bip_vec[0];
+
+	if (bip->bip_flags & BIP_MEMPOOL)
+		mempool_free(bv->bv_page, &integrity_buf_pool);
+	else
+		kfree(bvec_virt(bv));
+}
+
 /**
  * bio_integrity_free - Free bio integrity payload
  * @bio:	bio containing bip to be freed
@@ -438,3 +476,12 @@ int bio_integrity_clone(struct bio *bio, struct bio *bio_src,
 
 	return 0;
 }
+
+static int __init bio_integrity_initfn(void)
+{
+	if (mempool_init_page_pool(&integrity_buf_pool, BIO_POOL_SIZE,
+			get_order(BLK_INTEGRITY_MAX_SIZE)))
+		panic("bio: can't create integrity buf pool\n");
+	return 0;
+}
+subsys_initcall(bio_integrity_initfn);
diff --git a/block/blk-settings.c b/block/blk-settings.c
index d74b13ec8e54..04e88615032a 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -194,6 +194,17 @@ static int blk_validate_integrity_limits(struct queue_limits *lim)
 					(1U << bi->interval_exp) - 1);
 	}
 
+	/*
+	 * The block layer automatically adds integrity data for bios that don't
+	 * already have it.  It allocates a single segment. Limit the I/O size
+	 * so that a single maximum size metadata segment can cover the
+	 * integrity data for the entire I/O.
+	 */
+	lim->max_sectors = min3(lim->max_sectors,
+		BLK_INTEGRITY_MAX_SIZE /
+			bi->pi_tuple_size * lim->logical_block_size,
+		lim->max_segment_size >> SECTOR_SHIFT);
+
 	return 0;
 }
 
diff --git a/include/linux/bio-integrity.h b/include/linux/bio-integrity.h
index 851254f36eb3..3d05296a5afe 100644
--- a/include/linux/bio-integrity.h
+++ b/include/linux/bio-integrity.h
@@ -14,6 +14,8 @@ enum bip_flags {
 	BIP_CHECK_REFTAG	= 1 << 6, /* reftag check */
 	BIP_CHECK_APPTAG	= 1 << 7, /* apptag check */
 	BIP_P2P_DMA		= 1 << 8, /* using P2P address */
+
+	BIP_MEMPOOL		= 1 << 15, /* buffer backed by mempool */
 };
 
 struct bio_integrity_payload {
@@ -140,4 +142,8 @@ static inline int bio_integrity_add_page(struct bio *bio, struct page *page,
 	return 0;
 }
 #endif /* CONFIG_BLK_DEV_INTEGRITY */
+
+void bio_integrity_alloc_buf(struct bio *bio, bool zero_buffer);
+void bio_integrity_free_buf(struct bio_integrity_payload *bip);
+
 #endif /* _LINUX_BIO_INTEGRITY_H */
diff --git a/include/linux/blk-integrity.h b/include/linux/blk-integrity.h
index b659373788f6..c2030fd8ba0a 100644
--- a/include/linux/blk-integrity.h
+++ b/include/linux/blk-integrity.h
@@ -8,6 +8,11 @@
 
 struct request;
 
+/*
+ * Maximum contiguous integrity buffer allocation.
+ */
+#define BLK_INTEGRITY_MAX_SIZE		SZ_2M
+
 enum blk_integrity_flags {
 	BLK_INTEGRITY_NOVERIFY		= 1 << 0,
 	BLK_INTEGRITY_NOGENERATE	= 1 << 1,
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/3] slab, block: generalize bvec_alloc_gfp
  2025-10-23  8:08 ` [PATCH 1/3] slab, block: generalize bvec_alloc_gfp Christoph Hellwig
@ 2025-10-24  1:44   ` Martin K. Petersen
  2025-10-24  8:38   ` Vlastimil Babka
  2025-10-26 21:19   ` Matthew Wilcox
  2 siblings, 0 replies; 14+ messages in thread
From: Martin K. Petersen @ 2025-10-24  1:44 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Vlastimil Babka, Andrew Morton, Christoph Lameter,
	David Rientjes, Roman Gushchin, Harry Yoo, Martin K. Petersen,
	linux-block, linux-mm


Christoph,

> bvec_alloc_gfp is useful for any place that tries to kmalloc first and
> then fall back to a mempool. Rename it and move it to blk.h to prepare
> for using it to allocate the default integrity buffer.

Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>

-- 
Martin K. Petersen


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/3] block: blocking mempool_alloc doesn't fail
  2025-10-23  8:08 ` [PATCH 2/3] block: blocking mempool_alloc doesn't fail Christoph Hellwig
@ 2025-10-24  1:45   ` Martin K. Petersen
  0 siblings, 0 replies; 14+ messages in thread
From: Martin K. Petersen @ 2025-10-24  1:45 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Vlastimil Babka, Andrew Morton, Christoph Lameter,
	David Rientjes, Roman Gushchin, Harry Yoo, Martin K. Petersen,
	linux-block, linux-mm


Christoph,

> So remove the error check for it in bio_integrity_prep.

Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>

-- 
Martin K. Petersen


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/3] block: make bio auto-integrity deadlock safe
  2025-10-23  8:08 ` [PATCH 3/3] block: make bio auto-integrity deadlock safe Christoph Hellwig
@ 2025-10-24  1:47   ` Martin K. Petersen
  2025-10-27  6:03   ` Kanchan Joshi
  1 sibling, 0 replies; 14+ messages in thread
From: Martin K. Petersen @ 2025-10-24  1:47 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Vlastimil Babka, Andrew Morton, Christoph Lameter,
	David Rientjes, Roman Gushchin, Harry Yoo, Martin K. Petersen,
	linux-block, linux-mm


Christoph,

> Fix this by limiting the I/O size so that we can allocated at least a

allocate

> 2MiB integrity buffer, i.e. 128MiB for 8 byte PI and 512 byte integrity
> internals, and create a mempool as a last resort for this maximum size,

intervals

> mirroring the scheme used for bvecs.  As a nice upside none of this
> can fail now, so we remove the error handling and open code the
> trivial addition of the bip vec.

Typos aside, this looks good to me.

Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>

-- 
Martin K. Petersen


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/3] slab, block: generalize bvec_alloc_gfp
  2025-10-23  8:08 ` [PATCH 1/3] slab, block: generalize bvec_alloc_gfp Christoph Hellwig
  2025-10-24  1:44   ` Martin K. Petersen
@ 2025-10-24  8:38   ` Vlastimil Babka
  2025-10-24  9:05     ` Christoph Hellwig
  2025-10-26 21:19   ` Matthew Wilcox
  2 siblings, 1 reply; 14+ messages in thread
From: Vlastimil Babka @ 2025-10-24  8:38 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe
  Cc: Andrew Morton, Christoph Lameter, David Rientjes, Roman Gushchin,
	Harry Yoo, Martin K. Petersen, linux-block, linux-mm,
	David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan

On 10/23/25 10:08, Christoph Hellwig wrote:
> bvec_alloc_gfp is useful for any place that tries to kmalloc first and
> then fall back to a mempool.  Rename it and move it to blk.h to prepare

I wonder if such fall backs are necessary because IIRC mempools try to
allocate from the underlying provider (i.e. kmalloc caches first), and only
give out the reserves when that fails. Is it done for less overhead or
something?

> for using it to allocate the default integrity buffer.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>

That says blk.h but you move it to slab.h? Assuming you intended slab.h.
However gfp flags are not slab only so it should be rather
include/linux/gfp.h - added maintainers of that to Cc.

We do have gfp_nested_mask() there which is quite similar but not exactly.
Maybe a canonical macro not for nested, but for opportunistic allocations
(if a one size fits all solution can be found) would be useful too, as
people indeed reinvent those manually in various places with subtle differences.

> ---
>  block/bio.c          | 13 ++-----------
>  include/linux/slab.h | 10 ++++++++++
>  2 files changed, 12 insertions(+), 11 deletions(-)
> 
> diff --git a/block/bio.c b/block/bio.c
> index b3a79285c278..4ea5833a7637 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -169,16 +169,6 @@ void bvec_free(mempool_t *pool, struct bio_vec *bv, unsigned short nr_vecs)
>  		kmem_cache_free(biovec_slab(nr_vecs)->slab, bv);
>  }
>  
> -/*
> - * Make the first allocation restricted and don't dump info on allocation
> - * failures, since we'll fall back to the mempool in case of failure.
> - */
> -static inline gfp_t bvec_alloc_gfp(gfp_t gfp)
> -{
> -	return (gfp & ~(__GFP_DIRECT_RECLAIM | __GFP_IO)) |
> -		__GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN;
> -}
> -
>  struct bio_vec *bvec_alloc(mempool_t *pool, unsigned short *nr_vecs,
>  		gfp_t gfp_mask)
>  {
> @@ -201,7 +191,8 @@ struct bio_vec *bvec_alloc(mempool_t *pool, unsigned short *nr_vecs,
>  	if (*nr_vecs < BIO_MAX_VECS) {
>  		struct bio_vec *bvl;
>  
> -		bvl = kmem_cache_alloc(bvs->slab, bvec_alloc_gfp(gfp_mask));
> +		bvl = kmem_cache_alloc(bvs->slab,
> +				try_alloc_gfp(gfp_mask & ~__GFP_IO));
>  		if (likely(bvl) || !(gfp_mask & __GFP_DIRECT_RECLAIM))
>  			return bvl;
>  		*nr_vecs = BIO_MAX_VECS;
> diff --git a/include/linux/slab.h b/include/linux/slab.h
> index d5a8ab98035c..a6672cead03e 100644
> --- a/include/linux/slab.h
> +++ b/include/linux/slab.h
> @@ -1113,6 +1113,16 @@ void kfree_rcu_scheduler_running(void);
>   */
>  size_t kmalloc_size_roundup(size_t size);
>  
> +/*
> + * Make the first allocation restricted and don't dump info on allocation
> + * failures, for callers that will fall back to a mempool in case of failure.
> + */
> +static inline gfp_t try_alloc_gfp(gfp_t gfp)
> +{
> +	return (gfp & ~__GFP_DIRECT_RECLAIM) |
> +		__GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN;

Note without __GFP_DIRECT_RECLAIM the __GFP_NORETRY is most likely redundant
(but doesn't hurt).

> +}
> +
>  void __init kmem_cache_init_late(void);
>  void __init kvfree_rcu_init(void);
>  



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/3] slab, block: generalize bvec_alloc_gfp
  2025-10-24  8:38   ` Vlastimil Babka
@ 2025-10-24  9:05     ` Christoph Hellwig
  0 siblings, 0 replies; 14+ messages in thread
From: Christoph Hellwig @ 2025-10-24  9:05 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Christoph Hellwig, Jens Axboe, Andrew Morton, Christoph Lameter,
	David Rientjes, Roman Gushchin, Harry Yoo, Martin K. Petersen,
	linux-block, linux-mm, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Brendan Jackman, Johannes Weiner, Zi Yan

On Fri, Oct 24, 2025 at 10:38:20AM +0200, Vlastimil Babka wrote:
> On 10/23/25 10:08, Christoph Hellwig wrote:
> > bvec_alloc_gfp is useful for any place that tries to kmalloc first and
> > then fall back to a mempool.  Rename it and move it to blk.h to prepare
> 
> I wonder if such fall backs are necessary because IIRC mempools try to
> allocate from the underlying provider (i.e. kmalloc caches first), and only
> give out the reserves when that fails. Is it done for less overhead or
> something?

That's the mempool behavior, yes.  But the bvec allocator only has a
mempool for the largest possible allocation, while usually trying
smaller allocations instead.

> 
> > for using it to allocate the default integrity buffer.
> > 
> > Signed-off-by: Christoph Hellwig <hch@lst.de>
> 
> That says blk.h but you move it to slab.h? Assuming you intended slab.h.

Yes.  I initially had it in blk.h, but it felt more general.

> However gfp flags are not slab only so it should be rather
> include/linux/gfp.h - added maintainers of that to Cc.

Ok.

> We do have gfp_nested_mask() there which is quite similar but not exactly.
> Maybe a canonical macro not for nested, but for opportunistic allocations
> (if a one size fits all solution can be found) would be useful too, as
> people indeed reinvent those manually in various places with subtle differences.

That's exactly what I've been trying to avoid indeed.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/3] slab, block: generalize bvec_alloc_gfp
  2025-10-23  8:08 ` [PATCH 1/3] slab, block: generalize bvec_alloc_gfp Christoph Hellwig
  2025-10-24  1:44   ` Martin K. Petersen
  2025-10-24  8:38   ` Vlastimil Babka
@ 2025-10-26 21:19   ` Matthew Wilcox
  2025-10-27  6:47     ` Christoph Hellwig
  2 siblings, 1 reply; 14+ messages in thread
From: Matthew Wilcox @ 2025-10-26 21:19 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Vlastimil Babka, Andrew Morton, Christoph Lameter,
	David Rientjes, Roman Gushchin, Harry Yoo, Martin K. Petersen,
	linux-block, linux-mm

On Thu, Oct 23, 2025 at 10:08:54AM +0200, Christoph Hellwig wrote:
> +/*
> + * Make the first allocation restricted and don't dump info on allocation
> + * failures, for callers that will fall back to a mempool in case of failure.
> + */
> +static inline gfp_t try_alloc_gfp(gfp_t gfp)
> +{
> +	return (gfp & ~__GFP_DIRECT_RECLAIM) |
> +		__GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN;
> +}

Comparing to the kvmalloc code:

        flags |= __GFP_NOWARN;

        if (!(flags & __GFP_RETRY_MAYFAIL))
                flags &= ~__GFP_DIRECT_RECLAIM;

        /* nofail semantic is implemented by the vmalloc fallback */
        flags &= ~__GFP_NOFAIL;

it's quite different.  I am by no stretch of the imagination a GFP
flags expert, but it seems to me that we should make the two the same
since they're both "try to allocate and we have a fallback if
necessary".  I suspect kvmalloc() is called with a wider range of
GFP flags than bvec allocation is, so it's probably better tested.

Is there a reason _not_ to use the kvmalloc code for bvec allocations?


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/3] block: make bio auto-integrity deadlock safe
  2025-10-23  8:08 ` [PATCH 3/3] block: make bio auto-integrity deadlock safe Christoph Hellwig
  2025-10-24  1:47   ` Martin K. Petersen
@ 2025-10-27  6:03   ` Kanchan Joshi
  1 sibling, 0 replies; 14+ messages in thread
From: Kanchan Joshi @ 2025-10-27  6:03 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe
  Cc: Vlastimil Babka, Andrew Morton, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Martin K. Petersen, linux-block,
	linux-mm

On 10/23/2025 1:38 PM, Christoph Hellwig wrote:
> @@ -194,6 +194,17 @@ static int blk_validate_integrity_limits(struct queue_limits *lim)
>   					(1U << bi->interval_exp) - 1);
>   	}
>   
> +	/*
> +	 * The block layer automatically adds integrity data for bios that don't
> +	 * already have it.  It allocates a single segment. Limit the I/O size
> +	 * so that a single maximum size metadata segment can cover the
> +	 * integrity data for the entire I/O.
> +	 */
> +	lim->max_sectors = min3(lim->max_sectors,
> +		BLK_INTEGRITY_MAX_SIZE /
> +			bi->pi_tuple_size * lim->logical_block_size,
> +		lim->max_segment_size >> SECTOR_SHIFT);

Two issues:
- When underlying device has pi-type 0, pi_tuple_size will be 0 and this 
will cause divide-by-zero.
- The second value in above min3() is in bytes, and other two in 
sectors. So this clamping may not be happening correctly.





^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/3] slab, block: generalize bvec_alloc_gfp
  2025-10-26 21:19   ` Matthew Wilcox
@ 2025-10-27  6:47     ` Christoph Hellwig
  2025-10-27 13:09       ` Matthew Wilcox
  0 siblings, 1 reply; 14+ messages in thread
From: Christoph Hellwig @ 2025-10-27  6:47 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Christoph Hellwig, Jens Axboe, Vlastimil Babka, Andrew Morton,
	Christoph Lameter, David Rientjes, Roman Gushchin, Harry Yoo,
	Martin K. Petersen, linux-block, linux-mm

On Sun, Oct 26, 2025 at 09:19:27PM +0000, Matthew Wilcox wrote:
> it's quite different.  I am by no stretch of the imagination a GFP
> flags expert, but it seems to me that we should make the two the same
> since they're both "try to allocate and we have a fallback if
> necessary".  I suspect kvmalloc() is called with a wider range of
> GFP flags than bvec allocation is, so it's probably better tested.
> 
> Is there a reason _not_ to use the kvmalloc code for bvec allocations?

It's using a dedicated slab cache, which makes sense for such a frequent
and usually short-lived allocation.  We also don't use vmalloc backing
ever at the moment.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/3] slab, block: generalize bvec_alloc_gfp
  2025-10-27  6:47     ` Christoph Hellwig
@ 2025-10-27 13:09       ` Matthew Wilcox
  2025-10-27 13:14         ` Christoph Hellwig
  0 siblings, 1 reply; 14+ messages in thread
From: Matthew Wilcox @ 2025-10-27 13:09 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Vlastimil Babka, Andrew Morton, Christoph Lameter,
	David Rientjes, Roman Gushchin, Harry Yoo, Martin K. Petersen,
	linux-block, linux-mm

On Mon, Oct 27, 2025 at 07:47:28AM +0100, Christoph Hellwig wrote:
> On Sun, Oct 26, 2025 at 09:19:27PM +0000, Matthew Wilcox wrote:
> > it's quite different.  I am by no stretch of the imagination a GFP
> > flags expert, but it seems to me that we should make the two the same
> > since they're both "try to allocate and we have a fallback if
> > necessary".  I suspect kvmalloc() is called with a wider range of
> > GFP flags than bvec allocation is, so it's probably better tested.
> > 
> > Is there a reason _not_ to use the kvmalloc code for bvec allocations?
> 
> It's using a dedicated slab cache, which makes sense for such a frequent
> and usually short-lived allocation.  We also don't use vmalloc backing
> ever at the moment.

That's not what I meant.

What I was proposing was:

+static inline gfp_t try_alloc_gfp(gfp_t gfp)
+{
+	gfp |= __GFP_NOWARN;
+	if (!(gfp & __GFP_RETRY_MAYFAIL))
+		gfp &= ~__GFP_DIRECT_RECLAIM;
+	gfp &= ~__GFP_NOFAIL;
+
+	return gfp;
+}


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/3] slab, block: generalize bvec_alloc_gfp
  2025-10-27 13:09       ` Matthew Wilcox
@ 2025-10-27 13:14         ` Christoph Hellwig
  0 siblings, 0 replies; 14+ messages in thread
From: Christoph Hellwig @ 2025-10-27 13:14 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Christoph Hellwig, Jens Axboe, Vlastimil Babka, Andrew Morton,
	Christoph Lameter, David Rientjes, Roman Gushchin, Harry Yoo,
	Martin K. Petersen, linux-block, linux-mm

On Mon, Oct 27, 2025 at 01:09:06PM +0000, Matthew Wilcox wrote:
> > > Is there a reason _not_ to use the kvmalloc code for bvec allocations?
> > 
> > It's using a dedicated slab cache, which makes sense for such a frequent
> > and usually short-lived allocation.  We also don't use vmalloc backing
> > ever at the moment.
> 
> That's not what I meant.

Oh, not the entirely kvmalloc code, but the helper.  Ok.

> What I was proposing was:
> 
> +static inline gfp_t try_alloc_gfp(gfp_t gfp)
> +{
> +	gfp |= __GFP_NOWARN;
> +	if (!(gfp & __GFP_RETRY_MAYFAIL))
> +		gfp &= ~__GFP_DIRECT_RECLAIM;
> +	gfp &= ~__GFP_NOFAIL;


That's missing the __GFP_NOMEMALLOC from the original one, and also
__GFP_NORETRY.  So it'll be pretty different.

For now I think I'll just either duplicate the logic or keep it in
block code to get the deadlock fix in, and then we can spend more
time analyzing all the flags and documenting the ones needed in
various places.


^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2025-10-27 13:14 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-23  8:08 make block layer auto-PI deadlock safe Christoph Hellwig
2025-10-23  8:08 ` [PATCH 1/3] slab, block: generalize bvec_alloc_gfp Christoph Hellwig
2025-10-24  1:44   ` Martin K. Petersen
2025-10-24  8:38   ` Vlastimil Babka
2025-10-24  9:05     ` Christoph Hellwig
2025-10-26 21:19   ` Matthew Wilcox
2025-10-27  6:47     ` Christoph Hellwig
2025-10-27 13:09       ` Matthew Wilcox
2025-10-27 13:14         ` Christoph Hellwig
2025-10-23  8:08 ` [PATCH 2/3] block: blocking mempool_alloc doesn't fail Christoph Hellwig
2025-10-24  1:45   ` Martin K. Petersen
2025-10-23  8:08 ` [PATCH 3/3] block: make bio auto-integrity deadlock safe Christoph Hellwig
2025-10-24  1:47   ` Martin K. Petersen
2025-10-27  6:03   ` Kanchan Joshi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).