* [PATCHv5 6/9] scsi: use request to get integrity segments
From: Keith Busch @ 2024-09-13 18:28 UTC (permalink / raw)
To: axboe, hch, martin.petersen, linux-block, linux-nvme, linux-scsi
Cc: sagi, Keith Busch, Kanchan Joshi
In-Reply-To: <20240913182854.2445457-1-kbusch@meta.com>
From: Keith Busch <kbusch@kernel.org>
The request tracks the integrity segments already, so no need to recount
the segments again.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
---
drivers/scsi/scsi_lib.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 3958a6d14bf45..c602b0af745ca 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -1175,8 +1175,7 @@ blk_status_t scsi_alloc_sgtables(struct scsi_cmnd *cmd)
goto out_free_sgtables;
}
- ivecs = blk_rq_count_integrity_sg(rq->q, rq->bio);
-
+ ivecs = rq->nr_integrity_segments;
if (sg_alloc_table_chained(&prot_sdb->table, ivecs,
prot_sdb->table.sgl,
SCSI_INLINE_PROT_SG_CNT)) {
--
2.43.5
^ permalink raw reply related
* [PATCHv5 2/9] blk-mq: set the nr_integrity_segments from bio
From: Keith Busch @ 2024-09-13 18:28 UTC (permalink / raw)
To: axboe, hch, martin.petersen, linux-block, linux-nvme, linux-scsi
Cc: sagi, Keith Busch
In-Reply-To: <20240913182854.2445457-1-kbusch@meta.com>
From: Keith Busch <kbusch@kernel.org>
This value is used for merging considerations, so it needs to be
accurate.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
---
block/blk-mq.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index ef3a2ed499563..82219f0e9a256 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2544,6 +2544,9 @@ static void blk_mq_bio_to_request(struct request *rq, struct bio *bio,
rq->__sector = bio->bi_iter.bi_sector;
rq->write_hint = bio->bi_write_hint;
blk_rq_bio_prep(rq, bio, nr_segs);
+ if (bio_integrity(bio))
+ rq->nr_integrity_segments = blk_rq_count_integrity_sg(rq->q,
+ bio);
/* This can't fail, since GFP_NOIO includes __GFP_DIRECT_RECLAIM. */
err = blk_crypto_rq_bio_prep(rq, bio, GFP_NOIO);
--
2.43.5
^ permalink raw reply related
* [PATCHv5 3/9] blk-integrity: properly account for segments
From: Keith Busch @ 2024-09-13 18:28 UTC (permalink / raw)
To: axboe, hch, martin.petersen, linux-block, linux-nvme, linux-scsi
Cc: sagi, Keith Busch
In-Reply-To: <20240913182854.2445457-1-kbusch@meta.com>
From: Keith Busch <kbusch@kernel.org>
Both types of merging when integrity data is used are miscounting the
segments:
Merging two requests wasn't accounting for the new segment count, so add
the "next" segment count to the first on a successful merge to ensure
this value is accurate.
Merging a bio into an existing request was double counting the bio's
segments, even if the merge failed later on. Move the segment accounting
to the end when the merge is successful.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
---
block/blk-integrity.c | 2 --
block/blk-merge.c | 4 ++++
2 files changed, 4 insertions(+), 2 deletions(-)
diff --git a/block/blk-integrity.c b/block/blk-integrity.c
index 010decc892eaa..afd101555d3cb 100644
--- a/block/blk-integrity.c
+++ b/block/blk-integrity.c
@@ -153,8 +153,6 @@ bool blk_integrity_merge_bio(struct request_queue *q, struct request *req,
q->limits.max_integrity_segments)
return false;
- req->nr_integrity_segments += nr_integrity_segs;
-
return true;
}
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 56769c4bcd799..ad763ec313b6a 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -639,6 +639,9 @@ static inline int ll_new_hw_segment(struct request *req, struct bio *bio,
* counters.
*/
req->nr_phys_segments += nr_phys_segs;
+ if (bio_integrity(bio))
+ req->nr_integrity_segments += blk_rq_count_integrity_sg(req->q,
+ bio);
return 1;
no_merge:
@@ -731,6 +734,7 @@ static int ll_merge_requests_fn(struct request_queue *q, struct request *req,
/* Merge is OK... */
req->nr_phys_segments = total_phys_segments;
+ req->nr_integrity_segments += next->nr_integrity_segments;
return 1;
}
--
2.43.5
^ permalink raw reply related
* [PATCHv5 9/9] blk-integrity: improved sg segment mapping
From: Keith Busch @ 2024-09-13 18:28 UTC (permalink / raw)
To: axboe, hch, martin.petersen, linux-block, linux-nvme, linux-scsi
Cc: sagi, Keith Busch
In-Reply-To: <20240913182854.2445457-1-kbusch@meta.com>
From: Keith Busch <kbusch@kernel.org>
Make the integrity mapping more like data mapping, blk_rq_map_sg. Use
the request to validate the segment count, and update the callers so
they don't have to.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
---
block/blk-integrity.c | 15 +++++++++++----
drivers/nvme/host/rdma.c | 4 ++--
drivers/scsi/scsi_lib.c | 11 +++--------
include/linux/blk-integrity.h | 6 ++----
4 files changed, 18 insertions(+), 18 deletions(-)
diff --git a/block/blk-integrity.c b/block/blk-integrity.c
index 1d82b18e06f8e..549480aa2a069 100644
--- a/block/blk-integrity.c
+++ b/block/blk-integrity.c
@@ -62,19 +62,20 @@ int blk_rq_count_integrity_sg(struct request_queue *q, struct bio *bio)
*
* Description: Map the integrity vectors in request into a
* scatterlist. The scatterlist must be big enough to hold all
- * elements. I.e. sized using blk_rq_count_integrity_sg().
+ * elements. I.e. sized using blk_rq_count_integrity_sg() or
+ * rq->nr_integrity_segments.
*/
-int blk_rq_map_integrity_sg(struct request_queue *q, struct bio *bio,
- struct scatterlist *sglist)
+int blk_rq_map_integrity_sg(struct request *rq, struct scatterlist *sglist)
{
struct bio_vec iv, ivprv = { NULL };
+ struct request_queue *q = rq->q;
struct scatterlist *sg = NULL;
+ struct bio *bio = rq->bio;
unsigned int segments = 0;
struct bvec_iter iter;
int prev = 0;
bio_for_each_integrity_vec(iv, bio, iter) {
-
if (prev) {
if (!biovec_phys_mergeable(q, &ivprv, &iv))
goto new_segment;
@@ -102,6 +103,12 @@ int blk_rq_map_integrity_sg(struct request_queue *q, struct bio *bio,
if (sg)
sg_mark_end(sg);
+ /*
+ * Something must have been wrong if the figured number of segment
+ * is bigger than number of req's physical integrity segments
+ */
+ BUG_ON(segments > blk_rq_nr_phys_segments(rq));
+ BUG_ON(segments > queue_max_integrity_segments(q));
return segments;
}
EXPORT_SYMBOL(blk_rq_map_integrity_sg);
diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 256466bdaee7c..c8fd0e8f02375 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -1504,8 +1504,8 @@ static int nvme_rdma_dma_map_req(struct ib_device *ibdev, struct request *rq,
goto out_unmap_sg;
}
- req->metadata_sgl->nents = blk_rq_map_integrity_sg(rq->q,
- rq->bio, req->metadata_sgl->sg_table.sgl);
+ req->metadata_sgl->nents = blk_rq_map_integrity_sg(rq,
+ req->metadata_sgl->sg_table.sgl);
*pi_count = ib_dma_map_sg(ibdev,
req->metadata_sgl->sg_table.sgl,
req->metadata_sgl->nents,
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index c602b0af745ca..c2f6d0e1c03e7 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -1163,7 +1163,6 @@ blk_status_t scsi_alloc_sgtables(struct scsi_cmnd *cmd)
if (blk_integrity_rq(rq)) {
struct scsi_data_buffer *prot_sdb = cmd->prot_sdb;
- int ivecs;
if (WARN_ON_ONCE(!prot_sdb)) {
/*
@@ -1175,19 +1174,15 @@ blk_status_t scsi_alloc_sgtables(struct scsi_cmnd *cmd)
goto out_free_sgtables;
}
- ivecs = rq->nr_integrity_segments;
- if (sg_alloc_table_chained(&prot_sdb->table, ivecs,
+ if (sg_alloc_table_chained(&prot_sdb->table,
+ rq->nr_integrity_segments,
prot_sdb->table.sgl,
SCSI_INLINE_PROT_SG_CNT)) {
ret = BLK_STS_RESOURCE;
goto out_free_sgtables;
}
- count = blk_rq_map_integrity_sg(rq->q, rq->bio,
- prot_sdb->table.sgl);
- BUG_ON(count > ivecs);
- BUG_ON(count > queue_max_integrity_segments(rq->q));
-
+ count = blk_rq_map_integrity_sg(rq, prot_sdb->table.sgl);
cmd->prot_sdb = prot_sdb;
cmd->prot_sdb->table.nents = count;
}
diff --git a/include/linux/blk-integrity.h b/include/linux/blk-integrity.h
index 793dbb1e0672d..676f8f860c474 100644
--- a/include/linux/blk-integrity.h
+++ b/include/linux/blk-integrity.h
@@ -25,8 +25,7 @@ static inline bool queue_limits_stack_integrity_bdev(struct queue_limits *t,
}
#ifdef CONFIG_BLK_DEV_INTEGRITY
-int blk_rq_map_integrity_sg(struct request_queue *, struct bio *,
- struct scatterlist *);
+int blk_rq_map_integrity_sg(struct request *, struct scatterlist *);
int blk_rq_count_integrity_sg(struct request_queue *, struct bio *);
int blk_rq_integrity_map_user(struct request *rq, void __user *ubuf,
ssize_t bytes, u32 seed);
@@ -98,8 +97,7 @@ static inline int blk_rq_count_integrity_sg(struct request_queue *q,
{
return 0;
}
-static inline int blk_rq_map_integrity_sg(struct request_queue *q,
- struct bio *b,
+static inline int blk_rq_map_integrity_sg(struct request *q,
struct scatterlist *s)
{
return 0;
--
2.43.5
^ permalink raw reply related
* [PATCHv5 4/9] blk-integrity: consider entire bio list for merging
From: Keith Busch @ 2024-09-13 18:28 UTC (permalink / raw)
To: axboe, hch, martin.petersen, linux-block, linux-nvme, linux-scsi
Cc: sagi, Keith Busch
In-Reply-To: <20240913182854.2445457-1-kbusch@meta.com>
From: Keith Busch <kbusch@kernel.org>
If a bio is merged to a request, the entire bio list is merged, so don't
temporarily detach it from its list when counting segments. In most
cases, bi_next will already be NULL, so detaching is usually a no-op.
But if the bio does have a list, the current code is miscounting the
segments for the resulting merge.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
---
block/blk-integrity.c | 4 ----
1 file changed, 4 deletions(-)
diff --git a/block/blk-integrity.c b/block/blk-integrity.c
index afd101555d3cb..84065691aaed0 100644
--- a/block/blk-integrity.c
+++ b/block/blk-integrity.c
@@ -134,7 +134,6 @@ bool blk_integrity_merge_bio(struct request_queue *q, struct request *req,
struct bio *bio)
{
int nr_integrity_segs;
- struct bio *next = bio->bi_next;
if (blk_integrity_rq(req) == 0 && bio_integrity(bio) == NULL)
return true;
@@ -145,10 +144,7 @@ bool blk_integrity_merge_bio(struct request_queue *q, struct request *req,
if (bio_integrity(req->bio)->bip_flags != bio_integrity(bio)->bip_flags)
return false;
- bio->bi_next = NULL;
nr_integrity_segs = blk_rq_count_integrity_sg(q, bio);
- bio->bi_next = next;
-
if (req->nr_integrity_segments + nr_integrity_segs >
q->limits.max_integrity_segments)
return false;
--
2.43.5
^ permalink raw reply related
* [PATCHv5 0/9] block integrity merging and counting
From: Keith Busch @ 2024-09-13 18:28 UTC (permalink / raw)
To: axboe, hch, martin.petersen, linux-block, linux-nvme, linux-scsi
Cc: sagi, Keith Busch
From: Keith Busch <kbusch@kernel.org>
Some fixes and cleanups to counting integrity segments when metadata is
used. This addresses merging issues when integrity data is present.
Changes from v3:
Dropped the trivial nr_integerity_segments helper
Fixed sg mapping sanity check
Added reviews
Fixed commit log typos
Keith Busch (9):
blk-mq: unconditional nr_integrity_segments
blk-mq: set the nr_integrity_segments from bio
blk-integrity: properly account for segments
blk-integrity: consider entire bio list for merging
block: provide a request helper for user integrity segments
scsi: use request to get integrity segments
nvme-rdma: use request to get integrity segments
block: unexport blk_rq_count_integrity_sg
blk-integrity: improved sg segment mapping
block/bio-integrity.c | 1 -
block/blk-integrity.c | 36 ++++++++++++++++++++++++-----------
block/blk-merge.c | 4 ++++
block/blk-mq.c | 5 +++--
drivers/nvme/host/ioctl.c | 6 ++----
drivers/nvme/host/rdma.c | 6 +++---
drivers/scsi/scsi_lib.c | 12 +++---------
include/linux/blk-integrity.h | 15 +++++++++++----
include/linux/blk-mq.h | 3 ---
9 files changed, 51 insertions(+), 37 deletions(-)
--
2.43.5
^ permalink raw reply
* [PATCHv5 8/9] block: unexport blk_rq_count_integrity_sg
From: Keith Busch @ 2024-09-13 18:28 UTC (permalink / raw)
To: axboe, hch, martin.petersen, linux-block, linux-nvme, linux-scsi
Cc: sagi, Keith Busch
In-Reply-To: <20240913182854.2445457-1-kbusch@meta.com>
From: Keith Busch <kbusch@kernel.org>
There are no external users of this.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
---
block/blk-integrity.c | 1 -
1 file changed, 1 deletion(-)
diff --git a/block/blk-integrity.c b/block/blk-integrity.c
index ddc742d58330b..1d82b18e06f8e 100644
--- a/block/blk-integrity.c
+++ b/block/blk-integrity.c
@@ -53,7 +53,6 @@ int blk_rq_count_integrity_sg(struct request_queue *q, struct bio *bio)
return segments;
}
-EXPORT_SYMBOL(blk_rq_count_integrity_sg);
/**
* blk_rq_map_integrity_sg - Map integrity metadata into a scatterlist
--
2.43.5
^ permalink raw reply related
* [PATCHv5 1/9] blk-mq: unconditional nr_integrity_segments
From: Keith Busch @ 2024-09-13 18:28 UTC (permalink / raw)
To: axboe, hch, martin.petersen, linux-block, linux-nvme, linux-scsi
Cc: sagi, Keith Busch
In-Reply-To: <20240913182854.2445457-1-kbusch@meta.com>
From: Keith Busch <kbusch@kernel.org>
Always defining the field will make using it easier and less error prone
in future patches.
There shouldn't be any downside to this: the field fits in what would
otherwise be a 2-byte hole, so we're not saving space by conditionally
leaving it out.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
---
block/blk-mq.c | 2 --
include/linux/blk-mq.h | 3 ---
2 files changed, 5 deletions(-)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 3f1f7d0b3ff35..ef3a2ed499563 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -376,9 +376,7 @@ static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data,
rq->io_start_time_ns = 0;
rq->stats_sectors = 0;
rq->nr_phys_segments = 0;
-#if defined(CONFIG_BLK_DEV_INTEGRITY)
rq->nr_integrity_segments = 0;
-#endif
rq->end_io = NULL;
rq->end_io_data = NULL;
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 8d304b1d16b15..4fecf46ef681b 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -149,10 +149,7 @@ struct request {
* physical address coalescing is performed.
*/
unsigned short nr_phys_segments;
-
-#ifdef CONFIG_BLK_DEV_INTEGRITY
unsigned short nr_integrity_segments;
-#endif
#ifdef CONFIG_BLK_INLINE_ENCRYPTION
struct bio_crypt_ctx *crypt_ctx;
--
2.43.5
^ permalink raw reply related
* [PATCHv5 5/9] block: provide a request helper for user integrity segments
From: Keith Busch @ 2024-09-13 18:28 UTC (permalink / raw)
To: axboe, hch, martin.petersen, linux-block, linux-nvme, linux-scsi
Cc: sagi, Keith Busch, Kanchan Joshi
In-Reply-To: <20240913182854.2445457-1-kbusch@meta.com>
From: Keith Busch <kbusch@kernel.org>
Provide a helper to keep the request flags and nr_integrity_segments in
sync with the bio's integrity payload. This is an integrity equivalent
to the normal data helper function, 'blk_rq_map_user()'.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
---
block/bio-integrity.c | 1 -
block/blk-integrity.c | 14 ++++++++++++++
drivers/nvme/host/ioctl.c | 6 ++----
include/linux/blk-integrity.h | 9 +++++++++
4 files changed, 25 insertions(+), 5 deletions(-)
diff --git a/block/bio-integrity.c b/block/bio-integrity.c
index 8d1fb38f745f9..357a022eed411 100644
--- a/block/bio-integrity.c
+++ b/block/bio-integrity.c
@@ -371,7 +371,6 @@ int bio_integrity_map_user(struct bio *bio, void __user *ubuf, ssize_t bytes,
kfree(bvec);
return ret;
}
-EXPORT_SYMBOL_GPL(bio_integrity_map_user);
/**
* bio_integrity_prep - Prepare bio for integrity I/O
diff --git a/block/blk-integrity.c b/block/blk-integrity.c
index 84065691aaed0..ddc742d58330b 100644
--- a/block/blk-integrity.c
+++ b/block/blk-integrity.c
@@ -107,6 +107,20 @@ int blk_rq_map_integrity_sg(struct request_queue *q, struct bio *bio,
}
EXPORT_SYMBOL(blk_rq_map_integrity_sg);
+int blk_rq_integrity_map_user(struct request *rq, void __user *ubuf,
+ ssize_t bytes, u32 seed)
+{
+ int ret = bio_integrity_map_user(rq->bio, ubuf, bytes, seed);
+
+ if (ret)
+ return ret;
+
+ rq->nr_integrity_segments = blk_rq_count_integrity_sg(rq->q, rq->bio);
+ rq->cmd_flags |= REQ_INTEGRITY;
+ return 0;
+}
+EXPORT_SYMBOL_GPL(blk_rq_integrity_map_user);
+
bool blk_integrity_merge_rq(struct request_queue *q, struct request *req,
struct request *next)
{
diff --git a/drivers/nvme/host/ioctl.c b/drivers/nvme/host/ioctl.c
index 1d769c842fbf5..b9b79ccfabf8a 100644
--- a/drivers/nvme/host/ioctl.c
+++ b/drivers/nvme/host/ioctl.c
@@ -3,7 +3,6 @@
* Copyright (c) 2011-2014, Intel Corporation.
* Copyright (c) 2017-2021 Christoph Hellwig.
*/
-#include <linux/bio-integrity.h>
#include <linux/blk-integrity.h>
#include <linux/ptrace.h> /* for force_successful_syscall_return */
#include <linux/nvme_ioctl.h>
@@ -153,11 +152,10 @@ static int nvme_map_user_request(struct request *req, u64 ubuffer,
bio_set_dev(bio, bdev);
if (has_metadata) {
- ret = bio_integrity_map_user(bio, meta_buffer, meta_len,
- meta_seed);
+ ret = blk_rq_integrity_map_user(req, meta_buffer, meta_len,
+ meta_seed);
if (ret)
goto out_unmap;
- req->cmd_flags |= REQ_INTEGRITY;
}
return ret;
diff --git a/include/linux/blk-integrity.h b/include/linux/blk-integrity.h
index de98049b7ded9..793dbb1e0672d 100644
--- a/include/linux/blk-integrity.h
+++ b/include/linux/blk-integrity.h
@@ -28,6 +28,8 @@ static inline bool queue_limits_stack_integrity_bdev(struct queue_limits *t,
int blk_rq_map_integrity_sg(struct request_queue *, struct bio *,
struct scatterlist *);
int blk_rq_count_integrity_sg(struct request_queue *, struct bio *);
+int blk_rq_integrity_map_user(struct request *rq, void __user *ubuf,
+ ssize_t bytes, u32 seed);
static inline bool
blk_integrity_queue_supports_integrity(struct request_queue *q)
@@ -102,6 +104,13 @@ static inline int blk_rq_map_integrity_sg(struct request_queue *q,
{
return 0;
}
+static inline int blk_rq_integrity_map_user(struct request *rq,
+ void __user *ubuf,
+ ssize_t bytes,
+ u32 seed)
+{
+ return -EINVAL;
+}
static inline struct blk_integrity *bdev_get_integrity(struct block_device *b)
{
return NULL;
--
2.43.5
^ permalink raw reply related
* [PATCHv5 7/9] nvme-rdma: use request to get integrity segments
From: Keith Busch @ 2024-09-13 18:28 UTC (permalink / raw)
To: axboe, hch, martin.petersen, linux-block, linux-nvme, linux-scsi
Cc: sagi, Keith Busch, Kanchan Joshi
In-Reply-To: <20240913182854.2445457-1-kbusch@meta.com>
From: Keith Busch <kbusch@kernel.org>
The request tracks the integrity segments already, so no need to recount
the segments again.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
---
drivers/nvme/host/rdma.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 15b5e06039a5f..256466bdaee7c 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -1496,7 +1496,7 @@ static int nvme_rdma_dma_map_req(struct ib_device *ibdev, struct request *rq,
req->metadata_sgl->sg_table.sgl =
(struct scatterlist *)(req->metadata_sgl + 1);
ret = sg_alloc_table_chained(&req->metadata_sgl->sg_table,
- blk_rq_count_integrity_sg(rq->q, rq->bio),
+ rq->nr_integrity_segments,
req->metadata_sgl->sg_table.sgl,
NVME_INLINE_METADATA_SG_CNT);
if (unlikely(ret)) {
--
2.43.5
^ permalink raw reply related
* [GIT PULL] io_uring async discard support
From: Jens Axboe @ 2024-09-13 17:02 UTC (permalink / raw)
To: Linus Torvalds; +Cc: io-uring, linux-block@vger.kernel.org
Hi Linus,
As mentioned, and sitting on top of both the 6.12 block and io_uring
core branches, here's support for async discard through io_uring. This
allows applications to issue async discards, rather than rely on the
blocking sync ioctl discards we already have. The sync support is
difficult to use outside of idle/cleanup periods. On a real (but slow)
device, testing shows the following results when compared to sync
discard:
qd64 sync discard: 21K IOPS, lat avg 3 msec (max 21 msec)
qd64 async discard: 76K IOPS, lat avg 845 usec (max 2.2 msec)
qd64 sync discard: 14K IOPS, lat avg 5 msec (max 25 msec)
qd64 async discard: 56K IOPS, lat avg 1153 usec (max 3.6 msec)
and synthetic null_blk testing with the same queue depth and block size
settings as above shows:
Type Trim size IOPS Lat avg (usec) Lat Max (usec)
==============================================================
sync 4k 144K 444 20314
async 4k 1353K 47 595
sync 1M 56K 1136 21031
async 1M 94K 680 760
Please pull!
The following changes since commit 84eacf177faa605853c58e5b1c0d9544b88c16fd:
io_uring/io-wq: inherit cpuset of cgroup in io worker (2024-09-11 07:27:56 -0600)
are available in the Git repository at:
git://git.kernel.dk/linux.git tags/for-6.12/io_uring-discard-20240913
for you to fetch changes up to 50c52250e2d74b098465841163c18f4b4e9ad430:
block: implement async io_uring discard cmd (2024-09-11 10:45:28 -0600)
----------------------------------------------------------------
for-6.12/io_uring-discard-20240913
----------------------------------------------------------------
Jens Axboe (2):
Merge branch 'for-6.12/block' into for-6.12/io_uring-discard
Merge branch 'for-6.12/io_uring' into for-6.12/io_uring-discard
Pavel Begunkov (5):
io_uring/cmd: expose iowq to cmds
io_uring/cmd: give inline space in request to cmds
filemap: introduce filemap_invalidate_pages
block: introduce blk_validate_byte_range()
block: implement async io_uring discard cmd
block/blk.h | 1 +
block/fops.c | 2 +
block/ioctl.c | 163 +++++++++++++++++++++++++++++++----
include/linux/io_uring/cmd.h | 15 ++++
include/linux/pagemap.h | 2 +
include/uapi/linux/blkdev.h | 14 +++
io_uring/io_uring.c | 11 +++
io_uring/io_uring.h | 1 +
io_uring/uring_cmd.c | 7 ++
mm/filemap.c | 17 ++--
10 files changed, 209 insertions(+), 24 deletions(-)
--
Jens Axboe
^ permalink raw reply
* [GIT PULL] Block updates for 6.12-rc
From: Jens Axboe @ 2024-09-13 17:02 UTC (permalink / raw)
To: Linus Torvalds; +Cc: linux-block@vger.kernel.org
Hi Linus,
Here are the block changes queued up for the 6.12 merge window. Pretty
quiet round, mostly driver side fixes and cleanups. This pull request
contains:
- MD changes via Song:
- md-bitmap refactoring (Yu Kuai)
- raid5 performance optimization (Artur Paszkiewicz)
- Other small fixes (Yu Kuai, Chen Ni)
- Add a sysfs entry 'new_level' (Xiao Ni)
- Improve information reported in /proc/mdstat (Mateusz Kusiak)
- NVMe changes via Keith:
- Asynchronous namespace scanning (Stuart)
- TCP TLS updates (Hannes)
- RDMA queue controller validation (Niklas)
- Align field names to the spec (Anuj)
- Metadata support validation (Puranjay)
- A syntax cleanup (Shen)
- Fix a Kconfig linking error (Arnd)
- New queue-depth quirk (Keith)
- Add missing unplug trace event (Keith)
- blk-iocost fixes (Colin, Konstantin)
- t10-pi modular removal and fixes (Alexey)
- Fix for potential BLKSECDISCARD overflow (Alexey)
- bio splitting cleanups and fixes (Christoph)
- Deal with folios rather than rather than pages, speeding up how the
block layer handles bigger IOs (Kundan)
- Use spinlocks rather than bit spinlocks in zram (Sebastian, Mike)
- Reduce zoned device overhead in ublk (Ming)
- Add and use sendpages_ok() for drbd and nvme-tcp (Ofir)
- Fix regression in partition error pointer checking (Riyan)
- Add support for write zeroes and rotational status in nbd (Wouter)
- Add Yu Kuai as new BFQ maintainer. The scheduler has been unmaintained
for quite a while.
- Various sets of fixes for BFQ (Yu Kuai)
- Misc fixes and cleanups (Alvaro, Christophe, Li, Md Haris, Mikhail,
Yang)
Please pull!
The following changes since commit 8400291e289ee6b2bf9779ff1c83a291501f017b:
Linux 6.11-rc1 (2024-07-28 14:19:55 -0700)
are available in the Git repository at:
git://git.kernel.dk/linux.git tags/for-6.12/block-20240913
for you to fetch changes up to d4d7c03f7ee1d7f16b7b6e885b1e00968f72b93c:
Merge tag 'nvme-6.12-2024-09-13' of git://git.infradead.org/nvme into for-6.12/block (2024-09-13 08:39:09 -0600)
----------------------------------------------------------------
for-6.12/block-20240913
----------------------------------------------------------------
Alexey Dobriyan (3):
block: delete module stuff from t10-pi
block: constify ext_pi_ref_escape()
block: fix integer overflow in BLKSECDISCARD
Alvaro Parker (1):
block: fix comment to use set_current_state
Anuj Gupta (1):
nvme: rename apptag and appmask to lbat and lbatm
Arnd Bergmann (1):
nvme-tcp: fix link failure for TCP auth
Artur Paszkiewicz (3):
md/raid5: use wait_on_bit() for R5_Overlap
md/raid5: only add to wq if reshape is in progress
md/raid5: rename wait_for_overlap to wait_for_reshape
Chen Ni (1):
md: convert comma to semicolon
Christoph Hellwig (4):
block: rework bio splitting
block: constify the lim argument to queue_limits_max_zone_append_sectors
block: properly handle REQ_OP_ZONE_APPEND in __bio_split_to_limits
block: don't use bio_split_rw on misc operations
Christophe JAILLET (1):
drbd: Remove an unused field in struct drbd_device
Colin Ian King (1):
blk_iocost: make read-only static array vrate_adj_pct const
Hannes Reinecke (9):
nvme-keyring: restrict match length for version '1' identifiers
nvme-tcp: sanitize TLS key handling
nvme-tcp: check for invalidated or revoked key
nvme: add a newline to the 'tls_key' sysfs attribute
nvme: split off TLS sysfs attributes into a separate group
nvme-sysfs: add 'tls_configured_key' sysfs attribute
nvme-sysfs: add 'tls_keyring' attribute
nvmet-auth: allow to clear DH-HMAC-CHAP keys
nvme-target: do not check authentication status for admin commands twice
Jens Axboe (6):
Merge tag 'md-6.12-20240829' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md into for-6.12/block
MAINTAINERS: move the BFQ io scheduler to orphan state
Merge tag 'md-6.12-20240905' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md into for-6.12/block
Merge tag 'nvme-6.12-2024-09-06' of git://git.infradead.org/nvme into for-6.12/block
Merge tag 'md-6.12-20240906' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md into for-6.12/block
Merge tag 'nvme-6.12-2024-09-13' of git://git.infradead.org/nvme into for-6.12/block
Keith Busch (2):
blk-mq: add missing unplug trace event
nvme-pci: qdepth 1 quirk
Konstantin Ovsepian (1):
blk_iocost: fix more out of bound shifts
Kundan Kumar (4):
block: Added folio-ized version of bio_add_hw_page()
block: introduce folio awareness and add a bigger size from folio
mm: release number of pages of a folio
block: unpin user pages belonging to a folio at once
Li Zetao (1):
mtip32xx: Remove redundant null pointer checks in mtip_hw_debugfs_init()
Mateusz Kusiak (1):
md: Report failed arrays as broken in mdstat
Md Haris Iqbal (1):
block/rnbd-srv: Add sanity check and remove redundant assignment
Mike Galbraith (1):
zram: Replace bit spinlocks with a spinlock_t.
Mikhail Lobanov (1):
drbd: Add NULL check for net_conf to prevent dereference in state validation
Ming Lei (2):
ublk: move zone report data out of request pdu
nbd: fix race between timeout and normal completion
Niklas Cassel (1):
nvme-rdma: send cntlid in the RDMA_CM_REQUEST Private Data
Ofir Gal (3):
net: introduce helper sendpages_ok()
nvme-tcp: use sendpages_ok() instead of sendpage_ok()
drbd: use sendpages_ok() instead of sendpage_ok()
Puranjay Mohan (1):
nvme: fix metadata handling in nvme-passthrough
Riyan Dhiman (1):
block: fix potential invalid pointer dereference in blk_add_partition
Sebastian Andrzej Siewior (2):
zram: Remove ZRAM_LOCK
zram: Shrink zram_table_entry::flags.
Shen Lichuan (1):
nvme: Convert comma to semicolon
Song Liu (2):
Merge branch 'md-6.12-bitmap' into md-6.12
Merge branch 'md-6.12-raid5-opt' into md-6.12
Stuart Hayes (1):
nvme_core: scan namespaces asynchronously
Wouter Verhelst (4):
nbd: add support for rotational devices
nbd: implement the WRITE_ZEROES command
nbd: nbd_bg_flags_show: add NBD_FLAG_ROTATIONAL
nbd: correct the maximum value for discard sectors
Xiao Ni (1):
md: Add new_level sysfs interface
Yang Ruibin (1):
pktcdvd: remove unnecessary debugfs_create_dir() error check
Yu Kuai (62):
blk-cgroup: check for pd_(alloc|free)_fn in blkcg_activate_policy()
blk-ioprio: remove ioprio_blkcg_from_bio()
blk-ioprio: remove per-disk structure
md: Don't flush sync_work in md_write_start()
md/raid1: Clean up local variable 'b' from raid1_read_request()
md/raid1: use md_bitmap_wait_behind_writes() in raid1_read_request()
md/md-bitmap: replace md_bitmap_status() with a new helper md_bitmap_get_stats()
md: use new helper md_bitmap_get_stats() in update_array_info()
md/md-bitmap: add 'events_cleared' into struct md_bitmap_stats
md/md-cluster: fix spares warnings for __le64
md/md-bitmap: add 'sync_size' into struct md_bitmap_stats
md/md-bitmap: add 'file_pages' into struct md_bitmap_stats
md/md-bitmap: add 'behind_writes' and 'behind_wait' into struct md_bitmap_stats
md/md-cluster: use helper md_bitmap_get_stats() to get pages in resize_bitmaps()
md/md-bitmap: add a new helper md_bitmap_set_pages()
md/md-bitmap: introduce struct bitmap_operations
md/md-bitmap: simplify md_bitmap_create() + md_bitmap_load()
md/md-bitmap: merge md_bitmap_create() into bitmap_operations
md/md-bitmap: merge md_bitmap_load() into bitmap_operations
md/md-bitmap: merge md_bitmap_destroy() into bitmap_operations
md/md-bitmap: merge md_bitmap_flush() into bitmap_operations
md/md-bitmap: make md_bitmap_print_sb() internal
md/md-bitmap: merge md_bitmap_update_sb() into bitmap_operations
md/md-bitmap: merge md_bitmap_status() into bitmap_operations
md/md-bitmap: remove md_bitmap_setallbits()
md/md-bitmap: merge bitmap_write_all() into bitmap_operations
md/md-bitmap: merge md_bitmap_dirty_bits() into bitmap_operations
md/md-bitmap: merge md_bitmap_startwrite() into bitmap_operations
md/md-bitmap: merge md_bitmap_endwrite() into bitmap_operations
md/md-bitmap: merge md_bitmap_start_sync() into bitmap_operations
md/md-bitmap: remove the parameter 'aborted' for md_bitmap_end_sync()
md/md-bitmap: merge md_bitmap_end_sync() into bitmap_operations
md/md-bitmap: merge md_bitmap_close_sync() into bitmap_operations
md/md-bitmap: merge md_bitmap_cond_end_sync() into bitmap_operations
md/md-bitmap: merge md_bitmap_sync_with_cluster() into bitmap_operations
md/md-bitmap: merge md_bitmap_unplug_async() into md_bitmap_unplug()
md/md-bitmap: merge bitmap_unplug() into bitmap_operations
md/md-bitmap: merge md_bitmap_daemon_work() into bitmap_operations
md/md-bitmap: pass in mddev directly for md_bitmap_resize()
md/md-bitmap: merge md_bitmap_resize() into bitmap_operations
md/md-bitmap: merge get_bitmap_from_slot() into bitmap_operations
md/md-bitmap: merge md_bitmap_copy_from_slot() into struct bitmap_operation.
md/md-bitmap: merge md_bitmap_set_pages() into struct bitmap_operations
md/md-bitmap: merge md_bitmap_free() into bitmap_operations
md/md-bitmap: merge md_bitmap_wait_behind_writes() into bitmap_operations
md/md-bitmap: merge md_bitmap_enabled() into bitmap_operations
md/md-bitmap: make in memory structure internal
md: Remove flush handling
block, bfq: fix possible UAF for bfqq->bic with merge chain
block, bfq: choose the last bfqq from merge chain in bfq_setup_cooperator()
block, bfq: don't break merge chain in bfq_split_bfqq()
block, bfq: use bfq_reassign_last_bfqq() in bfq_bfqq_move()
MAINTAINERS: Move the BFQ io scheduler to Odd Fixes state
blk-throttle: remove last_low_overflow_time
blk-throttle: support prioritized processing of metadata
block, bfq: fix uaf for accessing waker_bfqq after splitting
block, bfq: fix procress reference leakage for bfqq in merge chain
block, bfq: merge bfq_release_process_ref() into bfq_put_cooperator()
block, bfq: remove bfq_log_bfqg()
block, bfq: remove local variable 'split' in bfq_init_rq()
block, bfq: remove local variable 'bfqq_already_existing' in bfq_init_rq()
block, bfq: factor out a helper to split bfqq in bfq_init_rq()
Yue Haibing (1):
blk-cgroup: Remove unused declaration blkg_path()
YueHaibing (1):
drbd: Remove unused extern declarations
MAINTAINERS | 5 +-
block/bfq-cgroup.c | 8 +-
block/bfq-iosched.c | 206 +++++++-------
block/bfq-iosched.h | 8 +-
block/bio.c | 112 ++++++--
block/blk-cgroup.c | 23 +-
block/blk-cgroup.h | 1 -
block/blk-iocost.c | 10 +-
block/blk-ioprio.c | 57 +---
block/blk-ioprio.h | 9 -
block/blk-merge.c | 162 +++++------
block/blk-mq.c | 14 +-
block/blk-rq-qos.c | 2 +-
block/blk-throttle.c | 69 +++--
block/blk-throttle.h | 2 -
block/blk.h | 74 +++--
block/ioctl.c | 9 +-
block/partitions/core.c | 8 +-
block/t10-pi.c | 8 +-
drivers/block/drbd/drbd_int.h | 11 -
drivers/block/drbd/drbd_main.c | 2 +-
drivers/block/drbd/drbd_state.c | 2 +-
drivers/block/mtip32xx/mtip32xx.c | 19 +-
drivers/block/nbd.c | 28 +-
drivers/block/pktcdvd.c | 2 -
drivers/block/rnbd/rnbd-srv.c | 11 +-
drivers/block/ublk_drv.c | 62 +++--
drivers/block/zram/zram_drv.c | 16 +-
drivers/block/zram/zram_drv.h | 7 +-
drivers/md/dm-raid.c | 7 +-
drivers/md/md-bitmap.c | 568 +++++++++++++++++++++++++++++---------
drivers/md/md-bitmap.h | 268 ++++--------------
drivers/md/md-cluster.c | 91 +++---
drivers/md/md.c | 332 ++++++++++------------
drivers/md/md.h | 13 +-
drivers/md/raid1-10.c | 9 +-
drivers/md/raid1.c | 99 +++----
drivers/md/raid10.c | 75 ++---
drivers/md/raid5-cache.c | 14 +-
drivers/md/raid5.c | 157 ++++++-----
drivers/md/raid5.h | 2 +-
drivers/nvme/common/keyring.c | 58 +++-
drivers/nvme/host/Kconfig | 3 +-
drivers/nvme/host/core.c | 47 +++-
drivers/nvme/host/fabrics.c | 2 +-
drivers/nvme/host/ioctl.c | 26 +-
drivers/nvme/host/nvme.h | 7 +-
drivers/nvme/host/pci.c | 18 +-
drivers/nvme/host/rdma.c | 6 +-
drivers/nvme/host/sysfs.c | 90 ++++--
drivers/nvme/host/tcp.c | 57 +++-
drivers/nvme/target/admin-cmd.c | 2 -
drivers/nvme/target/auth.c | 12 +
drivers/nvme/target/rdma.c | 4 +-
fs/btrfs/bio.c | 30 +-
include/linux/bio.h | 4 +-
include/linux/blkdev.h | 3 +-
include/linux/mm.h | 1 +
include/linux/net.h | 19 ++
include/linux/nvme-keyring.h | 6 +-
include/linux/nvme-rdma.h | 6 +-
include/linux/nvme.h | 8 +-
include/uapi/linux/nbd.h | 8 +-
mm/gup.c | 13 +
64 files changed, 1711 insertions(+), 1301 deletions(-)
--
Jens Axboe
^ permalink raw reply
* Re: [PATCH 1/6] blk-mq: introduce blk_mq_hctx_map_queues
From: Bjorn Helgaas @ 2024-09-13 16:26 UTC (permalink / raw)
To: Daniel Wagner
Cc: Jens Axboe, Bjorn Helgaas, Michael S. Tsirkin, Jason Wang,
Martin K. Petersen, Keith Busch, Christoph Hellwig, Sagi Grimberg,
linux-block, linux-kernel, linux-pci, virtualization, linux-scsi,
megaraidlinux.pdl, mpi3mr-linuxdrv.pdl, MPT-FusionLinux.pdl,
storagedev, linux-nvme, Daniel Wagner,
20240912-do-not-overwrite-pci-mapping-v1-1-85724b6cec49, Ming Lei
In-Reply-To: <20240913-refactor-blk-affinity-helpers-v1-1-8e058f77af12@suse.de>
On Fri, Sep 13, 2024 at 09:41:59AM +0200, Daniel Wagner wrote:
> From: Ming Lei <ming.lei@redhat.com>
>
> blk_mq_pci_map_queues and blk_mq_virtio_map_queues will create a CPU to
> hardware queue mapping based on affinity information. These two
> function share code which only differs on how the affinity information
> is retrieved. Also there is the hisi_sas which open codes the same loop.
>
> Thus introduce a new helper function for creating these mappings which
> takes an callback function for fetching the affinity mask. Also
> introduce common helper function for PCI and virtio devices to retrieve
> affinity masks.
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index e3a49f66982d..84f9c16b813b 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -6370,6 +6370,26 @@ int pci_set_vga_state(struct pci_dev *dev, bool decode,
> return 0;
> }
>
> +#ifdef CONFIG_BLK_MQ_PCI
> +/**
> + * pci_get_blk_mq_affinity - get affinity mask queue mapping for PCI device
> + * @dev_data: Pointer to struct pci_dev.
> + * @offset: Offset to use for the pci irq vector
> + * @queue: Queue index
> + *
> + * This function returns for a queue the affinity mask for a PCI device.
> + * It is usually used as callback for blk_mq_hctx_map_queues().
> + */
> +const struct cpumask *pci_get_blk_mq_affinity(void *dev_data, int offset,
> + int queue)
> +{
> + struct pci_dev *pdev = dev_data;
> +
> + return pci_irq_get_affinity(pdev, offset + queue);
> +}
> +EXPORT_SYMBOL_GPL(pci_get_blk_mq_affinity);
> +#endif
IMO this doesn't really fit well in drivers/pci since it doesn't add
any PCI-specific knowledge or require any PCI core internals, and the
parameters are blk-specific. I don't object to the code, but it seems
like it could go somewhere in block/?
Bjorn
^ permalink raw reply
* Re: [PATCH RFC] block: trace: add block alignment information
From: John Garry @ 2024-09-13 14:08 UTC (permalink / raw)
To: Daniel Gomez
Cc: Jens Axboe, Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
linux-block, linux-kernel, linux-trace-kernel, gost.dev,
Luis Chamberlain, Pankaj Raghav, Dave Chinner, Daniel Gomez
In-Reply-To: <20240913135713.vzevruukayd3o7cj@AALNPWDAGOMEZ1.aal.scsc.local>
On 13/09/2024 14:57, Daniel Gomez wrote:
>> I am just saying that the information already present in the block trace
>> point can be used to get this "alignment" info, right? And userspace can do
>> the work of reading those trace events to find this "alignment" info.
> So, maybe this is better to have integrated in blktrace tool?
Sure, or whatever tool which can process this information. I just don't
see why the kernel should be burdened with this.
Thanks,
John
^ permalink raw reply
* Re: [PATCH RFC] block: trace: add block alignment information
From: Daniel Gomez @ 2024-09-13 13:57 UTC (permalink / raw)
To: John Garry
Cc: Jens Axboe, Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
linux-block, linux-kernel, linux-trace-kernel, gost.dev,
Luis Chamberlain, Pankaj Raghav, Dave Chinner, Daniel Gomez
In-Reply-To: <77f6fd38-e216-47cf-8ceb-930395614aca@oracle.com>
On Fri, Sep 13, 2024 at 01:08:34PM +0100, John Garry wrote:
> On 13/09/2024 12:26, Daniel Gomez wrote:
> > On Fri, Sep 13, 2024 at 09:59:08AM +0100, John Garry wrote:
> > > On 12/09/2024 21:48, Daniel Gomez via B4 Relay wrote:
> > > > From: Daniel Gomez <da.gomez@samsung.com>
> > > >
> > > > Report block alignment in terms of LBA and size during block tracing for
> > > > block_rq. Calculate alignment only for read/writes where the length is
> > > > greater than 0. Otherwise, report 0 to indicate no alignment calculated.
> > > >
> > > > Suggested-by: Dave Chinner <dchinner@redhat.com>
> > > > Signed-off-by: Daniel Gomez <da.gomez@samsung.com>
> > > > ---
> > > > This patch introduces LBA and size alignment information for
> > > > the block_rq tracepoints (block_rq{insert, issue, merge} and
> > > > block_{io_start, io_done}).
> > >
> > > eh, shouldn't this belong in the description of the patch?
> >
> > Yes. I'll move this to the commit message.
> >
> > >
> > > And I still don't know what we mean by alignment in this context.
> > >
> > > From looking at the code, it seems to be the max detected block size
> > > granularity. For example, for a 64KB write at a 32KB offset, that would give
> > > a 32KB "alignment". But a 64KB write at a 64KB offset would be "64KB"
> > > alignment. While a 8KB write at 64KB offset would be 8KB "alignment". And a
> > > 24KB write at offset 0 is a 8KB "alignment", as 8KB is the lowest power-of-2
>
> note: I meant "8KB is the largest power-of-2"
8KB will be the largest unit at what a device can operate at, for that
particular I/O.
>
> > > which is divisible into 24KB. Is this a correct understanding?
> >
> > That is correct.
>
> So maybe it's me, but I just find it odd to call this information
> "alignment". To me, what you are looking for is largest block size
> granularity.
More suggestions are welcome. What about just I/O granularity? Does the term
imply LBA and size?
>
> > Do you think adding examples like yours can help to explain
> > this better?
> > Below the same examples using fio with the trace output:
> >
> >
> > sudo fio -bs=64k -size=64k -offset=32k -rw=write \
> > -direct=1 -filename=/dev/nvme0n1 -iodepth=1 -ioengine=sync -name=sync
> >
> > sudo fio -bs=64k -size=64k -offset=64k -rw=write \
> > -direct=1 -filename=/dev/nvme0n1 -iodepth=1 -ioengine=sync -name=sync
> >
> > sudo fio -bs=8k -size=8k -offset=64k -rw=write \
> > -direct=1 -filename=/dev/nvme0n1 -iodepth=1 -ioengine=sync -name=sync
> >
> > sudo fio -bs=24k -size=24k -offset=0k -rw=write \
> > -direct=1 -filename=/dev/nvme0n1 -iodepth=1 -ioengine=sync -name=sync
> >
> > fio-789 [000] ..... 4455.092003: block_rq_issue: 259,0 WS 65536 () 64 + 128 none,0,0 |32768| [fio]
> > fio-801 [000] ..... 4455.474826: block_rq_issue: 259,0 WS 65536 () 128 + 128 none,0,0 |65536| [fio]
> > fio-813 [000] ..... 4455.855143: block_rq_issue: 259,0 WS 8192 () 128 + 16 none,0,0 |8192| [fio]
> > fio-825 [000] ..... 4456.235595: block_rq_issue: 259,0 WS 24576 () 0 + 48 none,0,0 |8192| [fio]
> >
> >
> > Also, the motivation behind this is explained in the LBS RFC [1] and I should
> > have included it here for context. I hope [1] and my description below helps to
> > explain what alignment means and why is needed:
> >
> > [1] Subject: [RFC 00/23] Enable block size > page size in XFS
> > https://urldefense.com/v3/__https://lore.kernel.org/lkml/20230915183848.1018717-1-kernel@pankajraghav.com/__;!!ACWV5N9M2RV99hQ!NoMpDxzuA5uKlv0RAWE5UtOQKOrNB2zv8PHmOLWxfGCEzw5WpyyvonfhcMi0REPjCgF8pgBvEO9kyhTPO8z1$
> >
> > Tracing alignment information is important for high-capacity and QLC SSDs with
> > Indirection Units greater than 4 KiB. These devices are still 4 KiB in Logical
> > Block Size (LBS) but because they work at higher IUs, unaligned writes to the IU
> > boundaries can imply in a read-modify-write (RMW).
>
> Yes, I get that this might be important to know.
>
> >
> > The graph below is a representation of the device IU vs an I/O block aligned/
> > unaligned.
> >
> > |--- IU Boundaries ----| |-PS-|
> > a) [====][====][====][====][····][····][····]--
> > | |
> > b) [····][====][====][====][====][····][····]--
> > | |
> > c) [====][====][====][====][····][====][====]--
>
> is there meant to be a gap at page index #4?
Sorry, that's a copy+paste error. c) can be ignored.
>
> > | |
> > d) [····][····][====][====][····][····][····]--
d) is c)
> > | |
> > LBA 0 4
> > Key:
> > [====] = I/O Block
> > [····] = Memory in Page Size (PS) chunks
> >
> > a) I/O matches IU boundaries (LBA and block size). I/O is aligned.
> > b) The size of the I/O matches the IU size but the I/O is not aligned to the
> > IU boundaries. I/O is unaligned.
> > c) I/O does not match in either size or LBA. I/O is unaligned.
>
> what about d)? Not aligned to IU, I assume.
Yes, c) description is meant for d).
So for clarity, the correct graph is:
|--- IU Boundaries ----| |-PS-|
a) [====][====][====][====][····][····][····]--
| |
b) [····][====][====][====][====][····][····]--
| |
c) [····][····][====][====][····][····][····]--
| |
LBA 0 4
a) I/O matches IU boundaries (LBA and block size). I/O is aligned to IU boundaries.
b) The size of the I/O matches the IU size but the I/O is not aligned to the
IU boundaries. I/O is unaligned.
c) I/O does not match in either size or LBA. I/O is unaligned.
Using I/O granularity term:
a) 16k I/O granularity
b) 4k I/O granularity
c) 8k I/O granularity
>
> >
> > >
> > > >
> > > > The idea of reporting alignment in a tracepoint was first suggested in
> > > > this thread [1] by Dave Chinner. Additionally, an eBPF-based equivalent
> > > > tracing tool [2] was developed and used during LBS development, as
> > > > mentioned in the patch series [3] and in [1].
> > > >
> > > > With this addition, users can check block alignment directly through the
> > > > block layer tracepoints without needing any additional tools.
> > > >
> > > > In case we have a use case, this can be extended to other tracepoints,
> > > > such as complete and error.
> > > >
> > > > Another potential enhancement could be the integration of this
> > > > information into blktrace. Would that be a feasible option to consider?
> > > >
> > > > [1] https://urldefense.com/v3/__https://lore.kernel.org/all/ZdvXAn1Q*2F*QX5sPQ@dread.disaster.area/__;JSs!!ACWV5N9M2RV99hQ!P1ZM_E9uBSDLzz6M0dLc_vgEGWEY2HPBXJvEJLWp7w0l_G_r9Gvkm2kQiN586NSIH-JMx_YiCFy_6qdklHFY3pXtYsRb3aY$
> > > > [2] blkalgn tool written in eBPF/bcc:
> > > > https://urldefense.com/v3/__https://github.com/dkruces/bcc/tree/lbs__;!!ACWV5N9M2RV99hQ!P1ZM_E9uBSDLzz6M0dLc_vgEGWEY2HPBXJvEJLWp7w0l_G_r9Gvkm2kQiN586NSIH-JMx_YiCFy_6qdklHFY3pXthE7cfng$
> > > > [3] https://urldefense.com/v3/__https://lore.kernel.org/all/20240822135018.1931258-1-kernel@pankajraghav.com/__;!!ACWV5N9M2RV99hQ!P1ZM_E9uBSDLzz6M0dLc_vgEGWEY2HPBXJvEJLWp7w0l_G_r9Gvkm2kQiN586NSIH-JMx_YiCFy_6qdklHFY3pXtqQ5uwAE$
> > > > ---
> > > > block/blk-mq.c | 29 +++++++++++++++++++++++++++++
> > > > include/linux/blk-mq.h | 11 +++++++++++
> > > > include/linux/blkdev.h | 6 ++++++
> > > > include/trace/events/block.h | 7 +++++--
> > > > 4 files changed, 51 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > > > index 831c5cf5d874..714452bc236b 100644
> > > > --- a/block/blk-mq.c
> > > > +++ b/block/blk-mq.c
> > > > @@ -4920,6 +4920,35 @@ int blk_rq_poll(struct request *rq, struct io_comp_batch *iob,
> > > > }
> > > > EXPORT_SYMBOL_GPL(blk_rq_poll);
> > > > +u32 __blk_rq_lba_algn(struct request *req)
> > >
> > > why use "algn", and not "align"?
> > >
> > > "algn" is not a natural abbreviation of "alignment".
> >
> > That's okay with me, changing the var name to a more natural abbreviation.
> >
> > >
> > > And why can't userspace figure this out? All the info is available already,
> > > right?
> >
> > We are interested in the block alignment (LBA and size) at block device driver
> > level, not userspace level. That is, everything that is going out from the block
> > layer. Using the block tracing points currently available makes it block-driver
> > generic.
>
> I am just saying that the information already present in the block trace
> point can be used to get this "alignment" info, right? And userspace can do
> the work of reading those trace events to find this "alignment" info.
So, maybe this is better to have integrated in blktrace tool?
>
> Thanks,
> John
>
^ permalink raw reply
* Re: [PATCHv4 08/10] nvme-rdma: use request helper to get integrity segments
From: Kanchan Joshi @ 2024-09-13 12:37 UTC (permalink / raw)
To: Keith Busch, axboe, hch, martin.petersen, linux-block, linux-nvme,
linux-scsi, sagi
Cc: Keith Busch
In-Reply-To: <20240911201240.3982856-9-kbusch@meta.com>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
^ permalink raw reply
* Re: [PATCHv4 07/10] scsi: use request helper to get integrity segments
From: Kanchan Joshi @ 2024-09-13 12:35 UTC (permalink / raw)
To: Keith Busch, axboe, hch, martin.petersen, linux-block, linux-nvme,
linux-scsi, sagi
Cc: Keith Busch
In-Reply-To: <20240911201240.3982856-8-kbusch@meta.com>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
^ permalink raw reply
* Re: [PATCHv4 06/10] block: provide helper for nr_integrity_segments
From: Kanchan Joshi @ 2024-09-13 12:35 UTC (permalink / raw)
To: Keith Busch, axboe, hch, martin.petersen, linux-block, linux-nvme,
linux-scsi, sagi
Cc: Keith Busch
In-Reply-To: <20240911201240.3982856-7-kbusch@meta.com>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
^ permalink raw reply
* Re: [PATCHv4 05/10] block: provide a request helper for user integrity segments
From: Kanchan Joshi @ 2024-09-13 12:33 UTC (permalink / raw)
To: Keith Busch, axboe, hch, martin.petersen, linux-block, linux-nvme,
linux-scsi, sagi
Cc: Keith Busch
In-Reply-To: <20240911201240.3982856-6-kbusch@meta.com>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
^ permalink raw reply
* Re: [PATCH v6 09/17] soc: qcom: ice: add HWKM support to the ICE driver
From: Dmitry Baryshkov @ 2024-09-13 12:21 UTC (permalink / raw)
To: Eric Biggers
Cc: Gaurav Kashyap (QUIC), Neil Armstrong, Bartosz Golaszewski,
Jens Axboe, Jonathan Corbet, Alasdair Kergon, Mike Snitzer,
Mikulas Patocka, Adrian Hunter, Asutosh Das, Ritesh Harjani,
Ulf Hansson, Alim Akhtar, Avri Altman, Bart Van Assche,
James E.J. Bottomley, Martin K. Petersen, Theodore Y. Ts'o,
Jaegeuk Kim, Alexander Viro, Christian Brauner, Jan Kara,
Bjorn Andersson, Konrad Dybcio, manivannan.sadhasivam@linaro.org,
linux-block@vger.kernel.org, linux-doc@vger.kernel.org,
linux-kernel@vger.kernel.org, dm-devel@lists.linux.dev,
linux-mmc@vger.kernel.org, linux-scsi@vger.kernel.org,
linux-fscrypt@vger.kernel.org, linux-fsdevel@vger.kernel.org,
linux-arm-msm@vger.kernel.org, bartosz.golaszewski
In-Reply-To: <20240913045716.GA2292625@google.com>
On Fri, Sep 13, 2024 at 04:57:16AM GMT, Eric Biggers wrote:
> On Fri, Sep 13, 2024 at 07:28:33AM +0300, Dmitry Baryshkov wrote:
> > On Fri, 13 Sept 2024 at 02:17, Eric Biggers <ebiggers@kernel.org> wrote:
> > >
> > > On Thu, Sep 12, 2024 at 10:17:03PM +0000, Gaurav Kashyap (QUIC) wrote:
> > > >
> > > > On Monday, September 9, 2024 11:29 PM PDT, Dmitry Baryshkov wrote:
> > > > > On Tue, 10 Sept 2024 at 03:51, Gaurav Kashyap (QUIC)
> > > > > <quic_gaurkash@quicinc.com> wrote:
> > > > > >
> > > > > > Hello Dmitry and Neil
> > > > > >
> > > > > > On Monday, September 9, 2024 2:44 AM PDT, Dmitry Baryshkov wrote:
> > > > > > > On Mon, Sep 09, 2024 at 10:58:30AM GMT, Neil Armstrong wrote:
> > > > > > > > On 07/09/2024 00:07, Dmitry Baryshkov wrote:
> > > > > > > > > On Fri, Sep 06, 2024 at 08:07:12PM GMT, Bartosz Golaszewski wrote:
> > > > > > > > > > From: Gaurav Kashyap <quic_gaurkash@quicinc.com>
> > > > > > > > > >
> > > > > > > > > > Qualcomm's ICE (Inline Crypto Engine) contains a proprietary
> > > > > > > > > > key management hardware called Hardware Key Manager (HWKM).
> > > > > > > > > > Add
> > > > > > > HWKM
> > > > > > > > > > support to the ICE driver if it is available on the platform.
> > > > > > > > > > HWKM primarily provides hardware wrapped key support where
> > > > > the
> > > > > > > > > > ICE
> > > > > > > > > > (storage) keys are not available in software and instead
> > > > > > > > > > protected in
> > > > > > > hardware.
> > > > > > > > > >
> > > > > > > > > > When HWKM software support is not fully available (from
> > > > > > > > > > Trustzone), there can be a scenario where the ICE hardware
> > > > > > > > > > supports HWKM, but it cannot be used for wrapped keys. In this
> > > > > > > > > > case, raw keys have to be used without using the HWKM. We
> > > > > > > > > > query the TZ at run-time to find out whether wrapped keys
> > > > > > > > > > support is
> > > > > > > available.
> > > > > > > > > >
> > > > > > > > > > Tested-by: Neil Armstrong <neil.armstrong@linaro.org>
> > > > > > > > > > Signed-off-by: Gaurav Kashyap <quic_gaurkash@quicinc.com>
> > > > > > > > > > Signed-off-by: Bartosz Golaszewski
> > > > > > > > > > <bartosz.golaszewski@linaro.org>
> > > > > > > > > > ---
> > > > > > > > > > drivers/soc/qcom/ice.c | 152
> > > > > > > +++++++++++++++++++++++++++++++++++++++++++++++--
> > > > > > > > > > include/soc/qcom/ice.h | 1 +
> > > > > > > > > > 2 files changed, 149 insertions(+), 4 deletions(-)
> > > > > > > > > >
> > > > > > > > > > int qcom_ice_enable(struct qcom_ice *ice)
> > > > > > > > > > {
> > > > > > > > > > + int err;
> > > > > > > > > > +
> > > > > > > > > > qcom_ice_low_power_mode_enable(ice);
> > > > > > > > > > qcom_ice_optimization_enable(ice);
> > > > > > > > > > - return qcom_ice_wait_bist_status(ice);
> > > > > > > > > > + if (ice->use_hwkm)
> > > > > > > > > > + qcom_ice_enable_standard_mode(ice);
> > > > > > > > > > +
> > > > > > > > > > + err = qcom_ice_wait_bist_status(ice); if (err)
> > > > > > > > > > + return err;
> > > > > > > > > > +
> > > > > > > > > > + if (ice->use_hwkm)
> > > > > > > > > > + qcom_ice_hwkm_init(ice);
> > > > > > > > > > +
> > > > > > > > > > + return err;
> > > > > > > > > > }
> > > > > > > > > > EXPORT_SYMBOL_GPL(qcom_ice_enable);
> > > > > > > > > > @@ -150,6 +282,10 @@ int qcom_ice_resume(struct qcom_ice
> > > > > *ice)
> > > > > > > > > > return err;
> > > > > > > > > > }
> > > > > > > > > > + if (ice->use_hwkm) {
> > > > > > > > > > + qcom_ice_enable_standard_mode(ice);
> > > > > > > > > > + qcom_ice_hwkm_init(ice); }
> > > > > > > > > > return qcom_ice_wait_bist_status(ice);
> > > > > > > > > > }
> > > > > > > > > > EXPORT_SYMBOL_GPL(qcom_ice_resume);
> > > > > > > > > > @@ -157,6 +293,7 @@ EXPORT_SYMBOL_GPL(qcom_ice_resume);
> > > > > > > > > > int qcom_ice_suspend(struct qcom_ice *ice)
> > > > > > > > > > {
> > > > > > > > > > clk_disable_unprepare(ice->core_clk);
> > > > > > > > > > + ice->hwkm_init_complete = false;
> > > > > > > > > > return 0;
> > > > > > > > > > }
> > > > > > > > > > @@ -206,6 +343,12 @@ int qcom_ice_evict_key(struct qcom_ice
> > > > > > > > > > *ice,
> > > > > > > int slot)
> > > > > > > > > > }
> > > > > > > > > > EXPORT_SYMBOL_GPL(qcom_ice_evict_key);
> > > > > > > > > > +bool qcom_ice_hwkm_supported(struct qcom_ice *ice) { return
> > > > > > > > > > +ice->use_hwkm; }
> > > > > > > EXPORT_SYMBOL_GPL(qcom_ice_hwkm_supported);
> > > > > > > > > > +
> > > > > > > > > > static struct qcom_ice *qcom_ice_create(struct device *dev,
> > > > > > > > > > void __iomem *base)
> > > > > > > > > > {
> > > > > > > > > > @@ -240,6 +383,7 @@ static struct qcom_ice
> > > > > > > > > > *qcom_ice_create(struct
> > > > > > > device *dev,
> > > > > > > > > > engine->core_clk = devm_clk_get_enabled(dev, NULL);
> > > > > > > > > > if (IS_ERR(engine->core_clk))
> > > > > > > > > > return ERR_CAST(engine->core_clk);
> > > > > > > > > > + engine->use_hwkm = qcom_scm_has_wrapped_key_support();
> > > > > > > > >
> > > > > > > > > This still makes the decision on whether to use HW-wrapped keys
> > > > > > > > > on behalf of a user. I suppose this is incorrect. The user must
> > > > > > > > > be able to use raw keys even if HW-wrapped keys are available on
> > > > > > > > > the platform. One of the examples for such use-cases is if a
> > > > > > > > > user prefers to be able to recover stored information in case of
> > > > > > > > > a device failure (such recovery will be impossible if SoC is
> > > > > > > > > damaged and HW-
> > > > > > > wrapped keys are used).
> > > > > > > >
> > > > > > > > Isn't that already the case ? the
> > > > > BLK_CRYPTO_KEY_TYPE_HW_WRAPPED
> > > > > > > size
> > > > > > > > is here to select HW-wrapped key, otherwise the ol' raw key is passed.
> > > > > > > > Just look the next patch.
> > > > > > > >
> > > > > > > > Or did I miss something ?
> > > > > > >
> > > > > > > That's a good question. If use_hwkm is set, ICE gets programmed to
> > > > > > > use hwkm (see qcom_ice_hwkm_init() call above). I'm not sure if it
> > > > > > > is expected to work properly if after such a call we pass raw key.
> > > > > > >
> > > > > >
> > > > > > Once ICE has moved to a HWKM mode, the firmware key programming
> > > > > currently does not support raw keys.
> > > > > > This support is being added for the next Qualcomm chipset in Trustzone to
> > > > > support both at he same time, but that will take another year or two to hit
> > > > > the market.
> > > > > > Until that time, due to TZ (firmware) limitations , the driver can only
> > > > > support one or the other.
> > > > > >
> > > > > > We also cannot keep moving ICE modes, due to the HWKM enablement
> > > > > being a one-time configurable value at boot.
> > > > >
> > > > > So the init of HWKM should be delayed until the point where the user tells if
> > > > > HWKM or raw keys should be used.
> > > >
> > > > Ack.
> > > > I'll work with Bartosz to look into moving to HWKM mode only during the first key program request
> > > >
> > >
> > > That would mean the driver would have to initially advertise support for both
> > > HW-wrapped keys and raw keys, and then it would revoke the support for one of
> > > them later (due to the other one being used). However, runtime revocation of
> > > crypto capabilities is not supported by the blk-crypto framework
> > > (Documentation/block/inline-encryption.rst), and there is no clear path to
> > > adding such support. Upper layers may have already checked the crypto
> > > capabilities and decided to use them. It's too late to find out that the
> > > support was revoked in the middle of an I/O request. Upper layer code
> > > (blk-crypto, fscrypt, etc.) is not prepared for this. And even if it was, the
> > > best it could do is cleanly fail the I/O, which is too late as e.g. it may
> > > happen during background writeback and cause user data to be thrown away.
> >
> > Can we check crypto capabilities when the user sets the key?
>
> I think you mean when a key is programmed into a keyslot? That happens during
> I/O, which is too late as I've explained above.
>
> > Compare this to the actual HSM used to secure communication or
> > storage. It has certain capabilities, which can be enumerated, etc.
> > But then at the time the user sets the key it is perfectly normal to
> > return an error because HSM is out of resources. It might even have
> > spare key slots, but it might be not enough to be able to program the
> > required key (as a really crazy example, consider the HSM having at
> > this time a single spare DES key slot, while the user wants to program
> > 3DES key).
>
> That isn't how the kernel handles inline encryption keyslots. They are only
> programmed as needed for I/O. If they are all in-use by pending I/O requests,
> then the kernel waits for an I/O request to finish and reprograms the keyslot it
> was using. There is never an error reported due to lack of keyslots.
Does that mean that the I/O can be outstanding for the very long period
of time? Or that if the ICE hardware has just a single keyslot, but
there are two concurrent I/O processes using two different keys, the
framework will be constantly swapping the keys programmed to the HW?
I think it might be prefereable for the drivers and the framework to
support "preprogramming" of the keys, when the key is programmed to the
hardware when it is set by the user.
Another option might be to let the drivers validate the keys being set
by userspace. This way in our case the driver might report that it
supports both raw and wrapped keys, but start rejecting the keys once
it gets notified that the user has programmed other kind of keys. This
way key setup can fail, but the actual I/O can not. WDYT?
> If I/O requests could randomly fail at any time when using inline encryption,
> then no one would use inline encryption because it would not be reliable.
Yes, I agree here.
>
> > > So, the choice of support for HW-wrapped vs. raw will need to be made ahead of
> > > time, rather than being implicitly set by the first use. That is most easily
> > > done using a module parameter like qcom_ice.hw_wrapped_keys=1. Yes, it's a bit
> > > inconvenient, but there's no realistic way around this currently.
> >
> > This doesn't work for Android usecase. The user isn't able to setup modparams.
>
> It does work for Android. The encryption setting that Android uses is
> configured in the build of Android for the device (by the OEM, or by whoever
> made the build in the case of a custom build). Refer to
> https://source.android.com/docs/security/features/encryption/file-based#enabling-file-based-encryption
>
> Anyone who can change that can also change the kernel command line.
Ok. I think if the 'validation' or 'notify' proposal is declined, I'll
have to agree to the modparam.
--
With best wishes
Dmitry
^ permalink raw reply
* Re: [PATCH RFC] block: trace: add block alignment information
From: John Garry @ 2024-09-13 12:08 UTC (permalink / raw)
To: Daniel Gomez
Cc: Jens Axboe, Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
linux-block, linux-kernel, linux-trace-kernel, gost.dev,
Luis Chamberlain, Pankaj Raghav, Dave Chinner, Daniel Gomez
In-Reply-To: <20240913112626.mmr27xzxicyf37kh@AALNPWDAGOMEZ1.aal.scsc.local>
On 13/09/2024 12:26, Daniel Gomez wrote:
> On Fri, Sep 13, 2024 at 09:59:08AM +0100, John Garry wrote:
>> On 12/09/2024 21:48, Daniel Gomez via B4 Relay wrote:
>>> From: Daniel Gomez <da.gomez@samsung.com>
>>>
>>> Report block alignment in terms of LBA and size during block tracing for
>>> block_rq. Calculate alignment only for read/writes where the length is
>>> greater than 0. Otherwise, report 0 to indicate no alignment calculated.
>>>
>>> Suggested-by: Dave Chinner <dchinner@redhat.com>
>>> Signed-off-by: Daniel Gomez <da.gomez@samsung.com>
>>> ---
>>> This patch introduces LBA and size alignment information for
>>> the block_rq tracepoints (block_rq{insert, issue, merge} and
>>> block_{io_start, io_done}).
>>
>> eh, shouldn't this belong in the description of the patch?
>
> Yes. I'll move this to the commit message.
>
>>
>> And I still don't know what we mean by alignment in this context.
>>
>> From looking at the code, it seems to be the max detected block size
>> granularity. For example, for a 64KB write at a 32KB offset, that would give
>> a 32KB "alignment". But a 64KB write at a 64KB offset would be "64KB"
>> alignment. While a 8KB write at 64KB offset would be 8KB "alignment". And a
>> 24KB write at offset 0 is a 8KB "alignment", as 8KB is the lowest power-of-2
note: I meant "8KB is the largest power-of-2"
>> which is divisible into 24KB. Is this a correct understanding?
>
> That is correct.
So maybe it's me, but I just find it odd to call this information
"alignment". To me, what you are looking for is largest block size
granularity.
> Do you think adding examples like yours can help to explain
> this better?
> Below the same examples using fio with the trace output:
>
>
> sudo fio -bs=64k -size=64k -offset=32k -rw=write \
> -direct=1 -filename=/dev/nvme0n1 -iodepth=1 -ioengine=sync -name=sync
>
> sudo fio -bs=64k -size=64k -offset=64k -rw=write \
> -direct=1 -filename=/dev/nvme0n1 -iodepth=1 -ioengine=sync -name=sync
>
> sudo fio -bs=8k -size=8k -offset=64k -rw=write \
> -direct=1 -filename=/dev/nvme0n1 -iodepth=1 -ioengine=sync -name=sync
>
> sudo fio -bs=24k -size=24k -offset=0k -rw=write \
> -direct=1 -filename=/dev/nvme0n1 -iodepth=1 -ioengine=sync -name=sync
>
> fio-789 [000] ..... 4455.092003: block_rq_issue: 259,0 WS 65536 () 64 + 128 none,0,0 |32768| [fio]
> fio-801 [000] ..... 4455.474826: block_rq_issue: 259,0 WS 65536 () 128 + 128 none,0,0 |65536| [fio]
> fio-813 [000] ..... 4455.855143: block_rq_issue: 259,0 WS 8192 () 128 + 16 none,0,0 |8192| [fio]
> fio-825 [000] ..... 4456.235595: block_rq_issue: 259,0 WS 24576 () 0 + 48 none,0,0 |8192| [fio]
>
>
> Also, the motivation behind this is explained in the LBS RFC [1] and I should
> have included it here for context. I hope [1] and my description below helps to
> explain what alignment means and why is needed:
>
> [1] Subject: [RFC 00/23] Enable block size > page size in XFS
> https://urldefense.com/v3/__https://lore.kernel.org/lkml/20230915183848.1018717-1-kernel@pankajraghav.com/__;!!ACWV5N9M2RV99hQ!NoMpDxzuA5uKlv0RAWE5UtOQKOrNB2zv8PHmOLWxfGCEzw5WpyyvonfhcMi0REPjCgF8pgBvEO9kyhTPO8z1$
>
> Tracing alignment information is important for high-capacity and QLC SSDs with
> Indirection Units greater than 4 KiB. These devices are still 4 KiB in Logical
> Block Size (LBS) but because they work at higher IUs, unaligned writes to the IU
> boundaries can imply in a read-modify-write (RMW).
Yes, I get that this might be important to know.
>
> The graph below is a representation of the device IU vs an I/O block aligned/
> unaligned.
>
> |--- IU Boundaries ----| |-PS-|
> a) [====][====][====][====][····][····][····]--
> | |
> b) [····][====][====][====][====][····][····]--
> | |
> c) [====][====][====][====][····][====][====]--
is there meant to be a gap at page index #4?
> | |
> d) [····][····][====][====][····][····][····]--
> | |
> LBA 0 4
>
> Key:
> [====] = I/O Block
> [····] = Memory in Page Size (PS) chunks
>
> a) I/O matches IU boundaries (LBA and block size). I/O is aligned.
> b) The size of the I/O matches the IU size but the I/O is not aligned to the
> IU boundaries. I/O is unaligned.
> c) I/O does not match in either size or LBA. I/O is unaligned.
what about d)? Not aligned to IU, I assume.
>
>>
>>>
>>> The idea of reporting alignment in a tracepoint was first suggested in
>>> this thread [1] by Dave Chinner. Additionally, an eBPF-based equivalent
>>> tracing tool [2] was developed and used during LBS development, as
>>> mentioned in the patch series [3] and in [1].
>>>
>>> With this addition, users can check block alignment directly through the
>>> block layer tracepoints without needing any additional tools.
>>>
>>> In case we have a use case, this can be extended to other tracepoints,
>>> such as complete and error.
>>>
>>> Another potential enhancement could be the integration of this
>>> information into blktrace. Would that be a feasible option to consider?
>>>
>>> [1] https://urldefense.com/v3/__https://lore.kernel.org/all/ZdvXAn1Q*2F*QX5sPQ@dread.disaster.area/__;JSs!!ACWV5N9M2RV99hQ!P1ZM_E9uBSDLzz6M0dLc_vgEGWEY2HPBXJvEJLWp7w0l_G_r9Gvkm2kQiN586NSIH-JMx_YiCFy_6qdklHFY3pXtYsRb3aY$
>>> [2] blkalgn tool written in eBPF/bcc:
>>> https://urldefense.com/v3/__https://github.com/dkruces/bcc/tree/lbs__;!!ACWV5N9M2RV99hQ!P1ZM_E9uBSDLzz6M0dLc_vgEGWEY2HPBXJvEJLWp7w0l_G_r9Gvkm2kQiN586NSIH-JMx_YiCFy_6qdklHFY3pXthE7cfng$
>>> [3] https://urldefense.com/v3/__https://lore.kernel.org/all/20240822135018.1931258-1-kernel@pankajraghav.com/__;!!ACWV5N9M2RV99hQ!P1ZM_E9uBSDLzz6M0dLc_vgEGWEY2HPBXJvEJLWp7w0l_G_r9Gvkm2kQiN586NSIH-JMx_YiCFy_6qdklHFY3pXtqQ5uwAE$
>>> ---
>>> block/blk-mq.c | 29 +++++++++++++++++++++++++++++
>>> include/linux/blk-mq.h | 11 +++++++++++
>>> include/linux/blkdev.h | 6 ++++++
>>> include/trace/events/block.h | 7 +++++--
>>> 4 files changed, 51 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>>> index 831c5cf5d874..714452bc236b 100644
>>> --- a/block/blk-mq.c
>>> +++ b/block/blk-mq.c
>>> @@ -4920,6 +4920,35 @@ int blk_rq_poll(struct request *rq, struct io_comp_batch *iob,
>>> }
>>> EXPORT_SYMBOL_GPL(blk_rq_poll);
>>> +u32 __blk_rq_lba_algn(struct request *req)
>>
>> why use "algn", and not "align"?
>>
>> "algn" is not a natural abbreviation of "alignment".
>
> That's okay with me, changing the var name to a more natural abbreviation.
>
>>
>> And why can't userspace figure this out? All the info is available already,
>> right?
>
> We are interested in the block alignment (LBA and size) at block device driver
> level, not userspace level. That is, everything that is going out from the block
> layer. Using the block tracing points currently available makes it block-driver
> generic.
I am just saying that the information already present in the block trace
point can be used to get this "alignment" info, right? And userspace can
do the work of reading those trace events to find this "alignment" info.
Thanks,
John
^ permalink raw reply
* Re: [PATCH RFC] block: trace: add block alignment information
From: Daniel Gomez @ 2024-09-13 11:26 UTC (permalink / raw)
To: John Garry
Cc: Jens Axboe, Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
linux-block, linux-kernel, linux-trace-kernel, gost.dev,
Luis Chamberlain, Pankaj Raghav, Dave Chinner, Daniel Gomez
In-Reply-To: <a7f9079f-6f47-4a47-a327-98497bd33dfe@oracle.com>
On Fri, Sep 13, 2024 at 09:59:08AM +0100, John Garry wrote:
> On 12/09/2024 21:48, Daniel Gomez via B4 Relay wrote:
> > From: Daniel Gomez <da.gomez@samsung.com>
> >
> > Report block alignment in terms of LBA and size during block tracing for
> > block_rq. Calculate alignment only for read/writes where the length is
> > greater than 0. Otherwise, report 0 to indicate no alignment calculated.
> >
> > Suggested-by: Dave Chinner <dchinner@redhat.com>
> > Signed-off-by: Daniel Gomez <da.gomez@samsung.com>
> > ---
> > This patch introduces LBA and size alignment information for
> > the block_rq tracepoints (block_rq{insert, issue, merge} and
> > block_{io_start, io_done}).
>
> eh, shouldn't this belong in the description of the patch?
Yes. I'll move this to the commit message.
>
> And I still don't know what we mean by alignment in this context.
>
> From looking at the code, it seems to be the max detected block size
> granularity. For example, for a 64KB write at a 32KB offset, that would give
> a 32KB "alignment". But a 64KB write at a 64KB offset would be "64KB"
> alignment. While a 8KB write at 64KB offset would be 8KB "alignment". And a
> 24KB write at offset 0 is a 8KB "alignment", as 8KB is the lowest power-of-2
> which is divisible into 24KB. Is this a correct understanding?
That is correct. Do you think adding examples like yours can help to explain
this better? Below the same examples using fio with the trace output:
sudo fio -bs=64k -size=64k -offset=32k -rw=write \
-direct=1 -filename=/dev/nvme0n1 -iodepth=1 -ioengine=sync -name=sync
sudo fio -bs=64k -size=64k -offset=64k -rw=write \
-direct=1 -filename=/dev/nvme0n1 -iodepth=1 -ioengine=sync -name=sync
sudo fio -bs=8k -size=8k -offset=64k -rw=write \
-direct=1 -filename=/dev/nvme0n1 -iodepth=1 -ioengine=sync -name=sync
sudo fio -bs=24k -size=24k -offset=0k -rw=write \
-direct=1 -filename=/dev/nvme0n1 -iodepth=1 -ioengine=sync -name=sync
fio-789 [000] ..... 4455.092003: block_rq_issue: 259,0 WS 65536 () 64 + 128 none,0,0 |32768| [fio]
fio-801 [000] ..... 4455.474826: block_rq_issue: 259,0 WS 65536 () 128 + 128 none,0,0 |65536| [fio]
fio-813 [000] ..... 4455.855143: block_rq_issue: 259,0 WS 8192 () 128 + 16 none,0,0 |8192| [fio]
fio-825 [000] ..... 4456.235595: block_rq_issue: 259,0 WS 24576 () 0 + 48 none,0,0 |8192| [fio]
Also, the motivation behind this is explained in the LBS RFC [1] and I should
have included it here for context. I hope [1] and my description below helps to
explain what alignment means and why is needed:
[1] Subject: [RFC 00/23] Enable block size > page size in XFS
https://lore.kernel.org/lkml/20230915183848.1018717-1-kernel@pankajraghav.com/
Tracing alignment information is important for high-capacity and QLC SSDs with
Indirection Units greater than 4 KiB. These devices are still 4 KiB in Logical
Block Size (LBS) but because they work at higher IUs, unaligned writes to the IU
boundaries can imply in a read-modify-write (RMW).
The graph below is a representation of the device IU vs an I/O block aligned/
unaligned.
|--- IU Boundaries ----| |-PS-|
a) [====][====][====][====][····][····][····]--
| |
b) [····][====][====][====][====][····][····]--
| |
c) [====][====][====][====][····][====][====]--
| |
d) [····][····][====][====][····][····][····]--
| |
LBA 0 4
Key:
[====] = I/O Block
[····] = Memory in Page Size (PS) chunks
a) I/O matches IU boundaries (LBA and block size). I/O is aligned.
b) The size of the I/O matches the IU size but the I/O is not aligned to the
IU boundaries. I/O is unaligned.
c) I/O does not match in either size or LBA. I/O is unaligned.
>
> >
> > The idea of reporting alignment in a tracepoint was first suggested in
> > this thread [1] by Dave Chinner. Additionally, an eBPF-based equivalent
> > tracing tool [2] was developed and used during LBS development, as
> > mentioned in the patch series [3] and in [1].
> >
> > With this addition, users can check block alignment directly through the
> > block layer tracepoints without needing any additional tools.
> >
> > In case we have a use case, this can be extended to other tracepoints,
> > such as complete and error.
> >
> > Another potential enhancement could be the integration of this
> > information into blktrace. Would that be a feasible option to consider?
> >
> > [1] https://urldefense.com/v3/__https://lore.kernel.org/all/ZdvXAn1Q*2F*QX5sPQ@dread.disaster.area/__;JSs!!ACWV5N9M2RV99hQ!P1ZM_E9uBSDLzz6M0dLc_vgEGWEY2HPBXJvEJLWp7w0l_G_r9Gvkm2kQiN586NSIH-JMx_YiCFy_6qdklHFY3pXtYsRb3aY$
> > [2] blkalgn tool written in eBPF/bcc:
> > https://urldefense.com/v3/__https://github.com/dkruces/bcc/tree/lbs__;!!ACWV5N9M2RV99hQ!P1ZM_E9uBSDLzz6M0dLc_vgEGWEY2HPBXJvEJLWp7w0l_G_r9Gvkm2kQiN586NSIH-JMx_YiCFy_6qdklHFY3pXthE7cfng$
> > [3] https://urldefense.com/v3/__https://lore.kernel.org/all/20240822135018.1931258-1-kernel@pankajraghav.com/__;!!ACWV5N9M2RV99hQ!P1ZM_E9uBSDLzz6M0dLc_vgEGWEY2HPBXJvEJLWp7w0l_G_r9Gvkm2kQiN586NSIH-JMx_YiCFy_6qdklHFY3pXtqQ5uwAE$
> > ---
> > block/blk-mq.c | 29 +++++++++++++++++++++++++++++
> > include/linux/blk-mq.h | 11 +++++++++++
> > include/linux/blkdev.h | 6 ++++++
> > include/trace/events/block.h | 7 +++++--
> > 4 files changed, 51 insertions(+), 2 deletions(-)
> >
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index 831c5cf5d874..714452bc236b 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -4920,6 +4920,35 @@ int blk_rq_poll(struct request *rq, struct io_comp_batch *iob,
> > }
> > EXPORT_SYMBOL_GPL(blk_rq_poll);
> > +u32 __blk_rq_lba_algn(struct request *req)
>
> why use "algn", and not "align"?
>
> "algn" is not a natural abbreviation of "alignment".
That's okay with me, changing the var name to a more natural abbreviation.
>
> And why can't userspace figure this out? All the info is available already,
> right?
We are interested in the block alignment (LBA and size) at block device driver
level, not userspace level. That is, everything that is going out from the block
layer. Using the block tracing points currently available makes it block-driver
generic.
>
> > +{
> > + u32 lbs = queue_logical_block_size(req->q);
> > + u32 lba_shift = ilog2(lbs);
> > + u32 lba = req->__sector >> (lba_shift - SECTOR_SHIFT);
>
> req->__sector is a u64 - is it safe to store the shifted value in a u32?
>
> > + u32 len = req->__data_len;
> > + u32 algn_len = len;
> > + u32 algn_lba = len / lbs;
> > + u32 alignment = lbs;
> > +
> > + if (is_power_of_2(len) &&
> > + blk_rq_lba_aligned(len, algn_len, lba, algn_lba))
> > + return len;
> > +
> > + algn_len = lbs << 1U;
> > + algn_lba = algn_len / lbs;
> > +
> > + while (algn_len < len) {
> > + if (!blk_rq_lba_aligned(len, algn_len, lba, algn_lba))
> > + break;
> > +
> > + alignment = algn_len;
> > + algn_len = algn_len << 1U;
> > + algn_lba = algn_len / lbs;
> > + }
> > +
> > + return alignment;
> > +}
> > +
> > unsigned int blk_mq_rq_cpu(struct request *rq)
> > {
> > return rq->mq_ctx->cpu;
> > diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
> > index 8d304b1d16b1..02959fbd5e28 100644
> > --- a/include/linux/blk-mq.h
> > +++ b/include/linux/blk-mq.h
> > @@ -740,6 +740,17 @@ void blk_mq_free_request(struct request *rq);
> > int blk_rq_poll(struct request *rq, struct io_comp_batch *iob,
> > unsigned int poll_flags);
> > +/* The alignment of the block in terms of LBA and size */
> > +u32 __blk_rq_lba_algn(struct request *req);
> > +static inline u32 blk_rq_lba_algn(struct request *req)
> > +{
> > + if ((req_op(req) != REQ_OP_WRITE) && (req_op(req) != REQ_OP_READ) &&
> > + !(req->__data_len))
> > + return 0;
> > +
> > + return __blk_rq_lba_algn(req);
> > +}
> > +
> > bool blk_mq_queue_inflight(struct request_queue *q);
> > enum {
> > diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> > index bf1aa951fda2..28557987daa8 100644
> > --- a/include/linux/blkdev.h
> > +++ b/include/linux/blkdev.h
> > @@ -1433,6 +1433,12 @@ static inline int blk_rq_aligned(struct request_queue *q, unsigned long addr,
> > return !(addr & alignment) && !(len & alignment);
> > }
> > +static inline bool blk_rq_lba_aligned(u32 len, u32 algn_len, u32 lba,
> > + u32 algn_lba)
> > +{
> > + return !(len % algn_len) && !(lba % algn_lba);
> > +}
>
> why in blkdev.h? It is only used in block/blk-mq.c
There's a blk_rq_aligned in blkdev.h already and used in blk-map.c that checks
for dma and pad alignment. I can move it to blk-mq.h if that fits better.
>
> > +
> > /* assumes size > 256 */
> > static inline unsigned int blksize_bits(unsigned int size)
> > {
> > diff --git a/include/trace/events/block.h b/include/trace/events/block.h
> > index 1527d5d45e01..ba3764214dc7 100644
> > --- a/include/trace/events/block.h
> > +++ b/include/trace/events/block.h
> > @@ -202,6 +202,7 @@ DECLARE_EVENT_CLASS(block_rq,
> > __array( char, rwbs, RWBS_LEN )
> > __array( char, comm, TASK_COMM_LEN )
> > __dynamic_array( char, cmd, 1 )
> > + __field( unsigned int, algn )
> > ),
> > TP_fast_assign(
> > @@ -210,20 +211,22 @@ DECLARE_EVENT_CLASS(block_rq,
> > __entry->nr_sector = blk_rq_trace_nr_sectors(rq);
> > __entry->bytes = blk_rq_bytes(rq);
> > __entry->ioprio = rq->ioprio;
> > + __entry->algn = blk_rq_lba_algn(rq);
> > blk_fill_rwbs(__entry->rwbs, rq->cmd_flags);
> > __get_str(cmd)[0] = '\0';
> > memcpy(__entry->comm, current->comm, TASK_COMM_LEN);
> > ),
> > - TP_printk("%d,%d %s %u (%s) %llu + %u %s,%u,%u [%s]",
> > + TP_printk("%d,%d %s %u (%s) %llu + %u %s,%u,%u |%u| [%s]",
> > MAJOR(__entry->dev), MINOR(__entry->dev),
> > __entry->rwbs, __entry->bytes, __get_str(cmd),
> > (unsigned long long)__entry->sector, __entry->nr_sector,
> > __print_symbolic(IOPRIO_PRIO_CLASS(__entry->ioprio),
> > IOPRIO_CLASS_STRINGS),
> > IOPRIO_PRIO_HINT(__entry->ioprio),
> > - IOPRIO_PRIO_LEVEL(__entry->ioprio), __entry->comm)
> > + IOPRIO_PRIO_LEVEL(__entry->ioprio), __entry->algn,
> > + __entry->comm)
> > );
> > /**
> >
> > ---
> > base-commit: 57f962b956f1d116cd64d5c406776c4975de549d
> > change-id: 20240912-add-blkalgn-block-trace-71e8ab6708f1
> >
> > Best regards,
>
^ permalink raw reply
* Re: [PATCH RFC] block: trace: add block alignment information
From: John Garry @ 2024-09-13 8:59 UTC (permalink / raw)
To: da.gomez, Jens Axboe, Steven Rostedt, Masami Hiramatsu,
Mathieu Desnoyers
Cc: linux-block, linux-kernel, linux-trace-kernel, gost.dev,
Luis Chamberlain, Pankaj Raghav, Dave Chinner, Daniel Gomez
In-Reply-To: <20240912-add-blkalgn-block-trace-v1-1-335dd6eea557@samsung.com>
On 12/09/2024 21:48, Daniel Gomez via B4 Relay wrote:
> From: Daniel Gomez <da.gomez@samsung.com>
>
> Report block alignment in terms of LBA and size during block tracing for
> block_rq. Calculate alignment only for read/writes where the length is
> greater than 0. Otherwise, report 0 to indicate no alignment calculated.
>
> Suggested-by: Dave Chinner <dchinner@redhat.com>
> Signed-off-by: Daniel Gomez <da.gomez@samsung.com>
> ---
> This patch introduces LBA and size alignment information for
> the block_rq tracepoints (block_rq{insert, issue, merge} and
> block_{io_start, io_done}).
eh, shouldn't this belong in the description of the patch?
And I still don't know what we mean by alignment in this context.
From looking at the code, it seems to be the max detected block size
granularity. For example, for a 64KB write at a 32KB offset, that would
give a 32KB "alignment". But a 64KB write at a 64KB offset would be
"64KB" alignment. While a 8KB write at 64KB offset would be 8KB
"alignment". And a 24KB write at offset 0 is a 8KB "alignment", as 8KB
is the lowest power-of-2 which is divisible into 24KB. Is this a correct
understanding?
>
> The idea of reporting alignment in a tracepoint was first suggested in
> this thread [1] by Dave Chinner. Additionally, an eBPF-based equivalent
> tracing tool [2] was developed and used during LBS development, as
> mentioned in the patch series [3] and in [1].
>
> With this addition, users can check block alignment directly through the
> block layer tracepoints without needing any additional tools.
>
> In case we have a use case, this can be extended to other tracepoints,
> such as complete and error.
>
> Another potential enhancement could be the integration of this
> information into blktrace. Would that be a feasible option to consider?
>
> [1] https://urldefense.com/v3/__https://lore.kernel.org/all/ZdvXAn1Q*2F*QX5sPQ@dread.disaster.area/__;JSs!!ACWV5N9M2RV99hQ!P1ZM_E9uBSDLzz6M0dLc_vgEGWEY2HPBXJvEJLWp7w0l_G_r9Gvkm2kQiN586NSIH-JMx_YiCFy_6qdklHFY3pXtYsRb3aY$
> [2] blkalgn tool written in eBPF/bcc:
> https://urldefense.com/v3/__https://github.com/dkruces/bcc/tree/lbs__;!!ACWV5N9M2RV99hQ!P1ZM_E9uBSDLzz6M0dLc_vgEGWEY2HPBXJvEJLWp7w0l_G_r9Gvkm2kQiN586NSIH-JMx_YiCFy_6qdklHFY3pXthE7cfng$
> [3] https://urldefense.com/v3/__https://lore.kernel.org/all/20240822135018.1931258-1-kernel@pankajraghav.com/__;!!ACWV5N9M2RV99hQ!P1ZM_E9uBSDLzz6M0dLc_vgEGWEY2HPBXJvEJLWp7w0l_G_r9Gvkm2kQiN586NSIH-JMx_YiCFy_6qdklHFY3pXtqQ5uwAE$
> ---
> block/blk-mq.c | 29 +++++++++++++++++++++++++++++
> include/linux/blk-mq.h | 11 +++++++++++
> include/linux/blkdev.h | 6 ++++++
> include/trace/events/block.h | 7 +++++--
> 4 files changed, 51 insertions(+), 2 deletions(-)
>
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 831c5cf5d874..714452bc236b 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -4920,6 +4920,35 @@ int blk_rq_poll(struct request *rq, struct io_comp_batch *iob,
> }
> EXPORT_SYMBOL_GPL(blk_rq_poll);
>
> +u32 __blk_rq_lba_algn(struct request *req)
why use "algn", and not "align"?
"algn" is not a natural abbreviation of "alignment".
And why can't userspace figure this out? All the info is available
already, right?
> +{
> + u32 lbs = queue_logical_block_size(req->q);
> + u32 lba_shift = ilog2(lbs);
> + u32 lba = req->__sector >> (lba_shift - SECTOR_SHIFT);
req->__sector is a u64 - is it safe to store the shifted value in a u32?
> + u32 len = req->__data_len;
> + u32 algn_len = len;
> + u32 algn_lba = len / lbs;
> + u32 alignment = lbs;
> +
> + if (is_power_of_2(len) &&
> + blk_rq_lba_aligned(len, algn_len, lba, algn_lba))
> + return len;
> +
> + algn_len = lbs << 1U;
> + algn_lba = algn_len / lbs;
> +
> + while (algn_len < len) {
> + if (!blk_rq_lba_aligned(len, algn_len, lba, algn_lba))
> + break;
> +
> + alignment = algn_len;
> + algn_len = algn_len << 1U;
> + algn_lba = algn_len / lbs;
> + }
> +
> + return alignment;
> +}
> +
> unsigned int blk_mq_rq_cpu(struct request *rq)
> {
> return rq->mq_ctx->cpu;
> diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
> index 8d304b1d16b1..02959fbd5e28 100644
> --- a/include/linux/blk-mq.h
> +++ b/include/linux/blk-mq.h
> @@ -740,6 +740,17 @@ void blk_mq_free_request(struct request *rq);
> int blk_rq_poll(struct request *rq, struct io_comp_batch *iob,
> unsigned int poll_flags);
>
> +/* The alignment of the block in terms of LBA and size */
> +u32 __blk_rq_lba_algn(struct request *req);
> +static inline u32 blk_rq_lba_algn(struct request *req)
> +{
> + if ((req_op(req) != REQ_OP_WRITE) && (req_op(req) != REQ_OP_READ) &&
> + !(req->__data_len))
> + return 0;
> +
> + return __blk_rq_lba_algn(req);
> +}
> +
> bool blk_mq_queue_inflight(struct request_queue *q);
>
> enum {
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index bf1aa951fda2..28557987daa8 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -1433,6 +1433,12 @@ static inline int blk_rq_aligned(struct request_queue *q, unsigned long addr,
> return !(addr & alignment) && !(len & alignment);
> }
>
> +static inline bool blk_rq_lba_aligned(u32 len, u32 algn_len, u32 lba,
> + u32 algn_lba)
> +{
> + return !(len % algn_len) && !(lba % algn_lba);
> +}
why in blkdev.h? It is only used in block/blk-mq.c
> +
> /* assumes size > 256 */
> static inline unsigned int blksize_bits(unsigned int size)
> {
> diff --git a/include/trace/events/block.h b/include/trace/events/block.h
> index 1527d5d45e01..ba3764214dc7 100644
> --- a/include/trace/events/block.h
> +++ b/include/trace/events/block.h
> @@ -202,6 +202,7 @@ DECLARE_EVENT_CLASS(block_rq,
> __array( char, rwbs, RWBS_LEN )
> __array( char, comm, TASK_COMM_LEN )
> __dynamic_array( char, cmd, 1 )
> + __field( unsigned int, algn )
> ),
>
> TP_fast_assign(
> @@ -210,20 +211,22 @@ DECLARE_EVENT_CLASS(block_rq,
> __entry->nr_sector = blk_rq_trace_nr_sectors(rq);
> __entry->bytes = blk_rq_bytes(rq);
> __entry->ioprio = rq->ioprio;
> + __entry->algn = blk_rq_lba_algn(rq);
>
> blk_fill_rwbs(__entry->rwbs, rq->cmd_flags);
> __get_str(cmd)[0] = '\0';
> memcpy(__entry->comm, current->comm, TASK_COMM_LEN);
> ),
>
> - TP_printk("%d,%d %s %u (%s) %llu + %u %s,%u,%u [%s]",
> + TP_printk("%d,%d %s %u (%s) %llu + %u %s,%u,%u |%u| [%s]",
> MAJOR(__entry->dev), MINOR(__entry->dev),
> __entry->rwbs, __entry->bytes, __get_str(cmd),
> (unsigned long long)__entry->sector, __entry->nr_sector,
> __print_symbolic(IOPRIO_PRIO_CLASS(__entry->ioprio),
> IOPRIO_CLASS_STRINGS),
> IOPRIO_PRIO_HINT(__entry->ioprio),
> - IOPRIO_PRIO_LEVEL(__entry->ioprio), __entry->comm)
> + IOPRIO_PRIO_LEVEL(__entry->ioprio), __entry->algn,
> + __entry->comm)
> );
>
> /**
>
> ---
> base-commit: 57f962b956f1d116cd64d5c406776c4975de549d
> change-id: 20240912-add-blkalgn-block-trace-71e8ab6708f1
>
> Best regards,
^ permalink raw reply
* Re: [PATCH RFC 1/4] block: Make bdev_can_atomic_write() robust against mis-aligned bdev size
From: Hannes Reinecke @ 2024-09-13 8:36 UTC (permalink / raw)
To: John Garry, Christoph Hellwig
Cc: axboe, song, yukuai3, kbusch, sagi, James.Bottomley,
martin.petersen, linux-block, linux-kernel, linux-raid,
linux-nvme, linux-scsi
In-Reply-To: <4a015015-ae7f-4eb5-ad00-420db5961d96@oracle.com>
On 9/12/24 17:22, John Garry wrote:
> On 12/09/2024 16:07, Christoph Hellwig wrote:
>>> We should do be able to, but with this patch we cannot. However, a
>>> misaligned partition would be very much unexpected.
>> Yes, misaligned partitions is very unexpected, but with large and
>> potentially unlimited atomic boundaries I would not expect the size
>> to always be aligned. But then again at least in NVMe atomic writes
>> don't need to match the max size anyway, so I'm not entirely sure
>> what the problem actually is.
>
> Actually it's not an alignment issue, but a size issue.
>
> Consider a 3.5MB partition and atomic write max is 1MB. If we tried to
> atomic write 1MB at offset 3MB, then it would be truncated to a 0.5MB
> write.
>
> So maybe it is an application bug.
>
Hmm. Why don't we reject such an I/O? We cannot guarantee an atomic
write, so I think we should be perfectly fine to return an error to
userspace.
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
^ permalink raw reply
* [PATCH] drbd: Fix atomicity violation in drbd_uuid_set_bm()
From: Qiu-ji Chen @ 2024-09-13 8:35 UTC (permalink / raw)
To: philipp.reisner, lars.ellenberg, christoph.boehmwalder, axboe
Cc: drbd-dev, linux-block, linux-kernel, baijiaju1990, Qiu-ji Chen,
stable
The violation of atomicity occurs when the drbd_uuid_set_bm function is
executed simultaneously with modifying the value of
device->ldev->md.uuid[UI_BITMAP]. Consider a scenario where, while
device->ldev->md.uuid[UI_BITMAP] passes the validity check when its value
is not zero, the value of device->ldev->md.uuid[UI_BITMAP] is written to
zero. In this case, the check in drbd_uuid_set_bm might refer to the old
value of device->ldev->md.uuid[UI_BITMAP] (before locking), which allows
an invalid value to pass the validity check, resulting in inconsistency.
To address this issue, it is recommended to include the data validity check
within the locked section of the function. This modification ensures that
the value of device->ldev->md.uuid[UI_BITMAP] does not change during the
validation process, thereby maintaining its integrity.
This possible bug is found by an experimental static analysis tool
developed by our team. This tool analyzes the locking APIs to extract
function pairs that can be concurrently executed, and then analyzes the
instructions in the paired functions to identify possible concurrency bugs
including data races and atomicity violations.
Fixes: 9f2247bb9b75 ("drbd: Protect accesses to the uuid set with a spinlock")
Cc: stable@vger.kernel.org
Signed-off-by: Qiu-ji Chen <chenqiuji666@gmail.com>
---
drivers/block/drbd/drbd_main.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/drivers/block/drbd/drbd_main.c b/drivers/block/drbd/drbd_main.c
index a9e49b212341..abafc4edf9ed 100644
--- a/drivers/block/drbd/drbd_main.c
+++ b/drivers/block/drbd/drbd_main.c
@@ -3399,10 +3399,12 @@ void drbd_uuid_new_current(struct drbd_device *device) __must_hold(local)
void drbd_uuid_set_bm(struct drbd_device *device, u64 val) __must_hold(local)
{
unsigned long flags;
- if (device->ldev->md.uuid[UI_BITMAP] == 0 && val == 0)
+ spin_lock_irqsave(&device->ldev->md.uuid_lock, flags);
+ if (device->ldev->md.uuid[UI_BITMAP] == 0 && val == 0) {
+ spin_unlock_irqrestore(&device->ldev->md.uuid_lock, flags);
return;
+ }
- spin_lock_irqsave(&device->ldev->md.uuid_lock, flags);
if (val == 0) {
drbd_uuid_move_history(device);
device->ldev->md.uuid[UI_HISTORY_START] = device->ldev->md.uuid[UI_BITMAP];
--
2.34.1
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox