* [PATCH 1/4] block: create payloadless issue bio helper
2020-03-29 17:47 [PATCH 0/4] block: Add support for REQ_OP_ASSIGN_RANGE Chaitanya Kulkarni
@ 2020-03-29 17:47 ` Chaitanya Kulkarni
2020-03-29 17:47 ` [PATCH 2/4] block: Add support for REQ_OP_ASSIGN_RANGE Chaitanya Kulkarni
` (5 subsequent siblings)
6 siblings, 0 replies; 18+ messages in thread
From: Chaitanya Kulkarni @ 2020-03-29 17:47 UTC (permalink / raw)
To: hch, martin.petersen
Cc: darrick.wong, axboe, tytso, adilger.kernel, ming.lei, jthumshirn,
minwoo.im.dev, chaitanya.kulkarni, damien.lemoal, andrea.parri,
hare, tj, hannes, khlebnikov, ajay.joshi, bvanassche, arnd,
houtao1, asml.silence, linux-block, linux-ext4
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=y, Size: 2560 bytes --]
This is a prep-patch that creates a helper to submit payloadless bio
with all the required arguments. This is needed to avoid the code
repetition in blk-lib,c so that new payloadless ops can use it.
Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
---
block/blk-lib.c | 51 +++++++++++++++++++++++++++++--------------------
1 file changed, 30 insertions(+), 21 deletions(-)
diff --git a/block/blk-lib.c b/block/blk-lib.c
index 5f2c429d4378..8e53e393703c 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -209,13 +209,40 @@ int blkdev_issue_write_same(struct block_device *bdev, sector_t sector,
}
EXPORT_SYMBOL(blkdev_issue_write_same);
+static void __blkdev_issue_payloadless(struct block_device *bdev,
+ unsigned op, sector_t sector, sector_t nr_sects, gfp_t gfp_mask,
+ struct bio **biop, unsigned bio_opf, unsigned int max_sectors)
+{
+ struct bio *bio = *biop;
+
+ while (nr_sects) {
+ bio = blk_next_bio(bio, 0, gfp_mask);
+ bio->bi_iter.bi_sector = sector;
+ bio_set_dev(bio, bdev);
+ bio->bi_opf = op;
+ bio->bi_opf |= bio_opf;
+
+ if (nr_sects > max_sectors) {
+ bio->bi_iter.bi_size = max_sectors << 9;
+ nr_sects -= max_sectors;
+ sector += max_sectors;
+ } else {
+ bio->bi_iter.bi_size = nr_sects << 9;
+ nr_sects = 0;
+ }
+ cond_resched();
+ }
+
+ *biop = bio;
+}
+
static int __blkdev_issue_write_zeroes(struct block_device *bdev,
sector_t sector, sector_t nr_sects, gfp_t gfp_mask,
struct bio **biop, unsigned flags)
{
- struct bio *bio = *biop;
unsigned int max_write_zeroes_sectors;
struct request_queue *q = bdev_get_queue(bdev);
+ unsigned int unmap = (flags & BLKDEV_ZERO_NOUNMAP) ? REQ_NOUNMAP : 0;
if (!q)
return -ENXIO;
@@ -229,26 +256,8 @@ static int __blkdev_issue_write_zeroes(struct block_device *bdev,
if (max_write_zeroes_sectors == 0)
return -EOPNOTSUPP;
- while (nr_sects) {
- bio = blk_next_bio(bio, 0, gfp_mask);
- bio->bi_iter.bi_sector = sector;
- bio_set_dev(bio, bdev);
- bio->bi_opf = REQ_OP_WRITE_ZEROES;
- if (flags & BLKDEV_ZERO_NOUNMAP)
- bio->bi_opf |= REQ_NOUNMAP;
-
- if (nr_sects > max_write_zeroes_sectors) {
- bio->bi_iter.bi_size = max_write_zeroes_sectors << 9;
- nr_sects -= max_write_zeroes_sectors;
- sector += max_write_zeroes_sectors;
- } else {
- bio->bi_iter.bi_size = nr_sects << 9;
- nr_sects = 0;
- }
- cond_resched();
- }
-
- *biop = bio;
+ __blkdev_issue_payloadless(bdev, REQ_OP_WRITE_ZEROES, sector, nr_sects,
+ gfp_mask, biop, unmap, max_write_zeroes_sectors);
return 0;
}
--
2.22.0
^ permalink raw reply related [flat|nested] 18+ messages in thread* [PATCH 2/4] block: Add support for REQ_OP_ASSIGN_RANGE
2020-03-29 17:47 [PATCH 0/4] block: Add support for REQ_OP_ASSIGN_RANGE Chaitanya Kulkarni
2020-03-29 17:47 ` [PATCH 1/4] block: create payloadless issue bio helper Chaitanya Kulkarni
@ 2020-03-29 17:47 ` Chaitanya Kulkarni
2020-03-29 17:47 ` [PATCH 3/4] loop: Forward REQ_OP_ASSIGN_RANGE into fallocate(0) Chaitanya Kulkarni
` (4 subsequent siblings)
6 siblings, 0 replies; 18+ messages in thread
From: Chaitanya Kulkarni @ 2020-03-29 17:47 UTC (permalink / raw)
To: hch, martin.petersen
Cc: darrick.wong, axboe, tytso, adilger.kernel, ming.lei, jthumshirn,
minwoo.im.dev, chaitanya.kulkarni, damien.lemoal, andrea.parri,
hare, tj, hannes, khlebnikov, ajay.joshi, bvanassche, arnd,
houtao1, asml.silence, linux-block, linux-ext4
From: Kirill Tkhai <ktkhai@virtuozzo.com>
This operation allows to notify a device about the fact, that some
sectors range was chosen by a filesystem as a single extent, and
the device should try its best to reflect that (keep the range as a
single hunk in its internals, or represent the range as minimal set of
hunks). Speaking directly, the operation is for forwarding fallocate(0)
requests into an essence, on which the device is based.
This may be useful for some distributed network filesystems, providing
block device interface, for optimization of their blocks placement over
the cluster nodes.
Also, block devices mapping a file (like loop) are users of that, since
this allows to allocate more continuous extents and since this batches
blocks allocation requests. In addition, hypervisors like QEMU may use
this for better blocks placement.
This patch adds a new blkdev_issue_assign_range() primitive, which is
rather similar to existing blkdev_issue_{*} api. Also, a new queue
limit.max_assign_range_sectors is added.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
---
block/blk-core.c | 5 +++
block/blk-lib.c | 64 +++++++++++++++++++++++++++++++++++++++
block/blk-merge.c | 21 +++++++++++++
block/blk-settings.c | 19 ++++++++++++
block/blk-zoned.c | 1 +
block/bounce.c | 1 +
include/linux/bio.h | 9 ++++--
include/linux/blk_types.h | 2 ++
include/linux/blkdev.h | 34 +++++++++++++++++++++
9 files changed, 153 insertions(+), 3 deletions(-)
diff --git a/block/blk-core.c b/block/blk-core.c
index 60dc9552ef8d..25165fa8fe46 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -137,6 +137,7 @@ static const char *const blk_op_name[] = {
REQ_OP_NAME(ZONE_FINISH),
REQ_OP_NAME(WRITE_SAME),
REQ_OP_NAME(WRITE_ZEROES),
+ REQ_OP_NAME(ASSIGN_RANGE),
REQ_OP_NAME(SCSI_IN),
REQ_OP_NAME(SCSI_OUT),
REQ_OP_NAME(DRV_IN),
@@ -952,6 +953,10 @@ generic_make_request_checks(struct bio *bio)
if (!q->limits.max_write_zeroes_sectors)
goto not_supported;
break;
+ case REQ_OP_ASSIGN_RANGE:
+ if (!q->limits.max_assign_range_sectors)
+ goto not_supported;
+ break;
default:
break;
}
diff --git a/block/blk-lib.c b/block/blk-lib.c
index 8e53e393703c..16dc9dbf6c79 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -414,3 +414,67 @@ int blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
return ret;
}
EXPORT_SYMBOL(blkdev_issue_zeroout);
+
+static int __blkdev_issue_assign_range(struct block_device *bdev,
+ sector_t sector, sector_t nr_sects, gfp_t gfp_mask,
+ struct bio **biop)
+{
+ unsigned int max_assign_range_sectors;
+ struct request_queue *q = bdev_get_queue(bdev);
+
+ if (!q)
+ return -ENXIO;
+
+ if (bdev_read_only(bdev))
+ return -EPERM;
+
+ max_assign_range_sectors = bdev_assign_range_sectors(bdev);
+
+ if (max_assign_range_sectors == 0)
+ return -EOPNOTSUPP;
+
+ __blkdev_issue_payloadless(bdev, REQ_OP_ASSIGN_RANGE, sector, nr_sects,
+ gfp_mask, biop, 0, max_assign_range_sectors);
+ return 0;
+}
+
+/**
+ * __blkdev_issue_assign_range - generate number of assign range bios
+ * @bdev: blockdev to issue
+ * @sector: start sector
+ * @nr_sects: number of sectors to write
+ * @gfp_mask: memory allocation flags (for bio_alloc)
+ * @biop: pointer to anchor bio
+ *
+ * Description:
+ * Assign a block range for batched allocation requests. Useful in stacking
+ * block device on the top of the file system.
+ *
+ */
+int blkdev_issue_assign_range(struct block_device *bdev, sector_t sector,
+ sector_t nr_sects, gfp_t gfp_mask)
+{
+ int ret = 0;
+ sector_t bs_mask;
+ struct blk_plug plug;
+ struct bio *bio = NULL;
+
+ if (bdev_assign_range_sectors(bdev) == 0)
+ return 0;
+
+ bs_mask = (bdev_logical_block_size(bdev) >> 9) - 1;
+ if ((sector | nr_sects) & bs_mask)
+ return -EINVAL;
+
+ blk_start_plug(&plug);
+ ret = __blkdev_issue_assign_range(bdev, sector, nr_sects,
+ gfp_mask, &bio);
+ if (ret == 0 && bio) {
+ ret = submit_bio_wait(bio);
+ bio_put(bio);
+ }
+ blk_finish_plug(&plug);
+
+ return ret;
+}
+EXPORT_SYMBOL(blkdev_issue_assign_range);
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 1534ed736363..441d1620de03 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -116,6 +116,22 @@ static struct bio *blk_bio_write_zeroes_split(struct request_queue *q,
return bio_split(bio, q->limits.max_write_zeroes_sectors, GFP_NOIO, bs);
}
+static struct bio *blk_bio_assign_range_split(struct request_queue *q,
+ struct bio *bio,
+ struct bio_set *bs,
+ unsigned *nsegs)
+{
+ *nsegs = 0;
+
+ if (!q->limits.max_assign_range_sectors)
+ return NULL;
+
+ if (bio_sectors(bio) <= q->limits.max_assign_range_sectors)
+ return NULL;
+
+ return bio_split(bio, q->limits.max_assign_range_sectors, GFP_NOIO, bs);
+}
+
static struct bio *blk_bio_write_same_split(struct request_queue *q,
struct bio *bio,
struct bio_set *bs,
@@ -308,6 +324,10 @@ void __blk_queue_split(struct request_queue *q, struct bio **bio,
split = blk_bio_write_zeroes_split(q, *bio, &q->bio_split,
nr_segs);
break;
+ case REQ_OP_ASSIGN_RANGE:
+ split = blk_bio_assign_range_split(q, *bio, &q->bio_split,
+ nr_segs);
+ break;
case REQ_OP_WRITE_SAME:
split = blk_bio_write_same_split(q, *bio, &q->bio_split,
nr_segs);
@@ -386,6 +406,7 @@ unsigned int blk_recalc_rq_segments(struct request *rq)
case REQ_OP_DISCARD:
case REQ_OP_SECURE_ERASE:
case REQ_OP_WRITE_ZEROES:
+ case REQ_OP_ASSIGN_RANGE:
return 0;
case REQ_OP_WRITE_SAME:
return 1;
diff --git a/block/blk-settings.c b/block/blk-settings.c
index c8eda2e7b91e..6beee0585580 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -48,6 +48,7 @@ void blk_set_default_limits(struct queue_limits *lim)
lim->chunk_sectors = 0;
lim->max_write_same_sectors = 0;
lim->max_write_zeroes_sectors = 0;
+ lim->max_assign_range_sectors = 0;
lim->max_discard_sectors = 0;
lim->max_hw_discard_sectors = 0;
lim->discard_granularity = 0;
@@ -83,6 +84,7 @@ void blk_set_stacking_limits(struct queue_limits *lim)
lim->max_dev_sectors = UINT_MAX;
lim->max_write_same_sectors = UINT_MAX;
lim->max_write_zeroes_sectors = UINT_MAX;
+ lim->max_assign_range_sectors = UINT_MAX;
}
EXPORT_SYMBOL(blk_set_stacking_limits);
@@ -257,6 +259,21 @@ void blk_queue_max_write_zeroes_sectors(struct request_queue *q,
}
EXPORT_SYMBOL(blk_queue_max_write_zeroes_sectors);
+/**
+ * blk_queue_max_assign_range_sectors - set max sectors for a single
+ * assign_range
+ *
+ * @q: the request queue for the device
+ * @max_assign_range_sectors: maximum number of sectors to assign range per
+ * command
+ **/
+void blk_queue_max_assign_range_sectors(struct request_queue *q,
+ unsigned int max_assign_range_sectors)
+{
+ q->limits.max_assign_range_sectors = max_assign_range_sectors;
+}
+EXPORT_SYMBOL(blk_queue_max_assign_range_sectors);
+
/**
* blk_queue_max_segments - set max hw segments for a request for this queue
* @q: the request queue for the device
@@ -506,6 +523,8 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
b->max_write_same_sectors);
t->max_write_zeroes_sectors = min(t->max_write_zeroes_sectors,
b->max_write_zeroes_sectors);
+ t->max_assign_range_sectors = min(t->max_assign_range_sectors,
+ b->max_assign_range_sectors);
t->bounce_pfn = min_not_zero(t->bounce_pfn, b->bounce_pfn);
t->seg_boundary_mask = min_not_zero(t->seg_boundary_mask,
diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 05741c6f618b..14b1fbed40f6 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -41,6 +41,7 @@ bool blk_req_needs_zone_write_lock(struct request *rq)
switch (req_op(rq)) {
case REQ_OP_WRITE_ZEROES:
+ case REQ_OP_ASSIGN_RANGE:
case REQ_OP_WRITE_SAME:
case REQ_OP_WRITE:
return blk_rq_zone_is_seq(rq);
diff --git a/block/bounce.c b/block/bounce.c
index f8ed677a1bf7..0eeb20b290ec 100644
--- a/block/bounce.c
+++ b/block/bounce.c
@@ -257,6 +257,7 @@ static struct bio *bounce_clone_bio(struct bio *bio_src, gfp_t gfp_mask,
case REQ_OP_DISCARD:
case REQ_OP_SECURE_ERASE:
case REQ_OP_WRITE_ZEROES:
+ case REQ_OP_ASSIGN_RANGE:
break;
case REQ_OP_WRITE_SAME:
bio->bi_io_vec[bio->bi_vcnt++] = bio_src->bi_io_vec[0];
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 853d92ceee64..8617abfc6f78 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -64,7 +64,8 @@ static inline bool bio_has_data(struct bio *bio)
bio->bi_iter.bi_size &&
bio_op(bio) != REQ_OP_DISCARD &&
bio_op(bio) != REQ_OP_SECURE_ERASE &&
- bio_op(bio) != REQ_OP_WRITE_ZEROES)
+ bio_op(bio) != REQ_OP_WRITE_ZEROES &&
+ bio_op(bio) != REQ_OP_ASSIGN_RANGE)
return true;
return false;
@@ -75,7 +76,8 @@ static inline bool bio_no_advance_iter(struct bio *bio)
return bio_op(bio) == REQ_OP_DISCARD ||
bio_op(bio) == REQ_OP_SECURE_ERASE ||
bio_op(bio) == REQ_OP_WRITE_SAME ||
- bio_op(bio) == REQ_OP_WRITE_ZEROES;
+ bio_op(bio) == REQ_OP_WRITE_ZEROES ||
+ bio_op(bio) == REQ_OP_ASSIGN_RANGE;
}
static inline bool bio_mergeable(struct bio *bio)
@@ -178,7 +180,7 @@ static inline unsigned bio_segments(struct bio *bio)
struct bvec_iter iter;
/*
- * We special case discard/write same/write zeroes, because they
+ * We special case discard/write same/write zeroes/assign range, because
* interpret bi_size differently:
*/
@@ -186,6 +188,7 @@ static inline unsigned bio_segments(struct bio *bio)
case REQ_OP_DISCARD:
case REQ_OP_SECURE_ERASE:
case REQ_OP_WRITE_ZEROES:
+ case REQ_OP_ASSIGN_RANGE:
return 0;
case REQ_OP_WRITE_SAME:
return 1;
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 70254ae11769..bef450026044 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -296,6 +296,8 @@ enum req_opf {
REQ_OP_ZONE_CLOSE = 11,
/* Transition a zone to full */
REQ_OP_ZONE_FINISH = 12,
+ /* Assign a sector range */
+ REQ_OP_ASSIGN_RANGE = 15,
/* SCSI passthrough using struct scsi_request */
REQ_OP_SCSI_IN = 32,
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index f629d40c645c..3a63c14e2cbc 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -336,6 +336,7 @@ struct queue_limits {
unsigned int max_hw_discard_sectors;
unsigned int max_write_same_sectors;
unsigned int max_write_zeroes_sectors;
+ unsigned int max_assign_range_sectors;
unsigned int discard_granularity;
unsigned int discard_alignment;
@@ -747,6 +748,9 @@ static inline bool rq_mergeable(struct request *rq)
if (req_op(rq) == REQ_OP_WRITE_ZEROES)
return false;
+ if (req_op(rq) == REQ_OP_ASSIGN_RANGE)
+ return false;
+
if (rq->cmd_flags & REQ_NOMERGE_FLAGS)
return false;
if (rq->rq_flags & RQF_NOMERGE_FLAGS)
@@ -1000,6 +1004,10 @@ static inline unsigned int blk_queue_get_max_sectors(struct request_queue *q,
if (unlikely(op == REQ_OP_WRITE_ZEROES))
return q->limits.max_write_zeroes_sectors;
+ if (unlikely(op == REQ_OP_ASSIGN_RANGE))
+ return min(q->limits.max_assign_range_sectors,
+ UINT_MAX >> SECTOR_SHIFT);
+
return q->limits.max_sectors;
}
@@ -1077,6 +1085,8 @@ extern void blk_queue_max_write_same_sectors(struct request_queue *q,
unsigned int max_write_same_sectors);
extern void blk_queue_max_write_zeroes_sectors(struct request_queue *q,
unsigned int max_write_same_sectors);
+extern void blk_queue_max_assign_range_sectors(struct request_queue *q,
+ unsigned int max_assign_range_sectors);
extern void blk_queue_logical_block_size(struct request_queue *, unsigned int);
extern void blk_queue_physical_block_size(struct request_queue *, unsigned int);
extern void blk_queue_alignment_offset(struct request_queue *q,
@@ -1246,6 +1256,20 @@ static inline int sb_issue_zeroout(struct super_block *sb, sector_t block,
gfp_mask, 0);
}
+extern int blkdev_issue_assign_range(struct block_device *bdev, sector_t sector,
+ sector_t nr_sects, gfp_t gfp_mask);
+
+static inline int sb_issue_assign_range(struct super_block *sb, sector_t block,
+ sector_t nr_blocks, gfp_t gfp_mask)
+{
+ return blkdev_issue_assign_range(sb->s_bdev,
+ block << (sb->s_blocksize_bits -
+ SECTOR_SHIFT),
+ nr_blocks << (sb->s_blocksize_bits -
+ SECTOR_SHIFT),
+ gfp_mask);
+}
+
extern int blk_verify_command(unsigned char *cmd, fmode_t mode);
enum blk_default_limits {
@@ -1427,6 +1451,16 @@ static inline unsigned int bdev_write_zeroes_sectors(struct block_device *bdev)
return 0;
}
+static inline unsigned int bdev_assign_range_sectors(struct block_device *bdev)
+{
+ struct request_queue *q = bdev_get_queue(bdev);
+
+ if (q)
+ return q->limits.max_assign_range_sectors;
+
+ return 0;
+}
+
static inline enum blk_zoned_model bdev_zoned_model(struct block_device *bdev)
{
struct request_queue *q = bdev_get_queue(bdev);
--
2.22.0
^ permalink raw reply related [flat|nested] 18+ messages in thread* [PATCH 3/4] loop: Forward REQ_OP_ASSIGN_RANGE into fallocate(0)
2020-03-29 17:47 [PATCH 0/4] block: Add support for REQ_OP_ASSIGN_RANGE Chaitanya Kulkarni
2020-03-29 17:47 ` [PATCH 1/4] block: create payloadless issue bio helper Chaitanya Kulkarni
2020-03-29 17:47 ` [PATCH 2/4] block: Add support for REQ_OP_ASSIGN_RANGE Chaitanya Kulkarni
@ 2020-03-29 17:47 ` Chaitanya Kulkarni
2020-03-29 17:47 ` [PATCH 4/4] ext4: Notify block device about alloc-assigned blk Chaitanya Kulkarni
` (3 subsequent siblings)
6 siblings, 0 replies; 18+ messages in thread
From: Chaitanya Kulkarni @ 2020-03-29 17:47 UTC (permalink / raw)
To: hch, martin.petersen
Cc: darrick.wong, axboe, tytso, adilger.kernel, ming.lei, jthumshirn,
minwoo.im.dev, chaitanya.kulkarni, damien.lemoal, andrea.parri,
hare, tj, hannes, khlebnikov, ajay.joshi, bvanassche, arnd,
houtao1, asml.silence, linux-block, linux-ext4, Kirill Tkhai
From: Kirill Tkhai <ktkhai@virtuozzo.com>
From: Kirill Tkhai <ktkhai@virtuozzo.com>
Send fallocate(0) request into underlining filesystem after upper
filesystem sent REQ_OP_ASSIGN_RANGE request to block device.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
[Use blk_queue_max_assign_range_sectors() from newly updated previous
patch.]
Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
---
drivers/block/loop.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 739b372a5112..0a28db66c485 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -609,6 +609,8 @@ static int do_req_filebacked(struct loop_device *lo, struct request *rq)
FALLOC_FL_PUNCH_HOLE);
case REQ_OP_DISCARD:
return lo_fallocate(lo, rq, pos, FALLOC_FL_PUNCH_HOLE);
+ case REQ_OP_ASSIGN_RANGE:
+ return lo_fallocate(lo, rq, pos, 0);
case REQ_OP_WRITE:
if (lo->transfer)
return lo_write_transfer(lo, rq, pos);
@@ -876,6 +878,7 @@ static void loop_config_discard(struct loop_device *lo)
q->limits.discard_granularity = 0;
q->limits.discard_alignment = 0;
blk_queue_max_discard_sectors(q, 0);
+ blk_queue_max_assign_range_sectors(q, 0);
blk_queue_max_write_zeroes_sectors(q, 0);
blk_queue_flag_clear(QUEUE_FLAG_DISCARD, q);
return;
@@ -886,6 +889,7 @@ static void loop_config_discard(struct loop_device *lo)
blk_queue_max_discard_sectors(q, UINT_MAX >> 9);
blk_queue_max_write_zeroes_sectors(q, UINT_MAX >> 9);
+ blk_queue_max_assign_range_sectors(q, UINT_MAX >> 9);
blk_queue_flag_set(QUEUE_FLAG_DISCARD, q);
}
@@ -1917,6 +1921,7 @@ static blk_status_t loop_queue_rq(struct blk_mq_hw_ctx *hctx,
case REQ_OP_FLUSH:
case REQ_OP_DISCARD:
case REQ_OP_WRITE_ZEROES:
+ case REQ_OP_ASSIGN_RANGE:
cmd->use_aio = false;
break;
default:
--
2.22.0
^ permalink raw reply related [flat|nested] 18+ messages in thread* [PATCH 4/4] ext4: Notify block device about alloc-assigned blk
2020-03-29 17:47 [PATCH 0/4] block: Add support for REQ_OP_ASSIGN_RANGE Chaitanya Kulkarni
` (2 preceding siblings ...)
2020-03-29 17:47 ` [PATCH 3/4] loop: Forward REQ_OP_ASSIGN_RANGE into fallocate(0) Chaitanya Kulkarni
@ 2020-03-29 17:47 ` Chaitanya Kulkarni
2020-04-01 6:22 ` [PATCH 0/4] block: Add support for REQ_OP_ASSIGN_RANGE Konstantin Khlebnikov
` (2 subsequent siblings)
6 siblings, 0 replies; 18+ messages in thread
From: Chaitanya Kulkarni @ 2020-03-29 17:47 UTC (permalink / raw)
To: hch, martin.petersen
Cc: darrick.wong, axboe, tytso, adilger.kernel, ming.lei, jthumshirn,
minwoo.im.dev, chaitanya.kulkarni, damien.lemoal, andrea.parri,
hare, tj, hannes, khlebnikov, ajay.joshi, bvanassche, arnd,
houtao1, asml.silence, linux-block, linux-ext4, Kirill Tkhai
From: Kirill Tkhai <ktkhai@virtuozzo.com>
From: Kirill Tkhai <ktkhai@virtuozzo.com>
Call sb_issue_assign_range() after extent range was allocated on user
request. Hopeful, this helps block device to maintain its internals in
the best way, if this is appliable.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
---
fs/ext4/ext4.h | 2 ++
fs/ext4/extents.c | 12 +++++++++++-
2 files changed, 13 insertions(+), 1 deletion(-)
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 61b37a052052..0d0fa9904147 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -622,6 +622,8 @@ enum {
* allows jbd2 to avoid submitting data before commit. */
#define EXT4_GET_BLOCKS_IO_SUBMIT 0x0400
+#define EXT4_GET_BLOCKS_SUBMIT_ALLOC 0x0800
+
/*
* The bit position of these flags must not overlap with any of the
* EXT4_GET_BLOCKS_*. They are used by ext4_find_extent(),
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 954013d6076b..598b700c4d4c 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4449,6 +4449,14 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
ar.len = allocated;
got_allocated_blocks:
+ if ((flags & EXT4_GET_BLOCKS_SUBMIT_ALLOC) &&
+ inode->i_fop->fallocate) {
+ err = sb_issue_assign_range(inode->i_sb, newblock,
+ EXT4_C2B(sbi, allocated_clusters), GFP_NOFS);
+ if (err)
+ goto free_on_err;
+ }
+
/* try to insert new extent into found leaf and return */
ext4_ext_store_pblock(&newex, newblock + offset);
newex.ee_len = cpu_to_le16(ar.len);
@@ -4466,6 +4474,7 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
err = ext4_ext_insert_extent(handle, inode, &path,
&newex, flags);
+free_on_err:
if (err && free_on_err) {
int fb_flags = flags & EXT4_GET_BLOCKS_DELALLOC_RESERVE ?
EXT4_FREE_BLOCKS_NO_QUOT_UPDATE : 0;
@@ -4733,7 +4742,8 @@ static long ext4_zero_range(struct file *file, loff_t offset,
goto out_mutex;
}
- flags = EXT4_GET_BLOCKS_CREATE_UNWRIT_EXT;
+ flags = EXT4_GET_BLOCKS_CREATE_UNWRIT_EXT |
+ EXT4_GET_BLOCKS_SUBMIT_ALLOC;
if (mode & FALLOC_FL_KEEP_SIZE)
flags |= EXT4_GET_BLOCKS_KEEP_SIZE;
--
2.22.0
^ permalink raw reply related [flat|nested] 18+ messages in thread* Re: [PATCH 0/4] block: Add support for REQ_OP_ASSIGN_RANGE
2020-03-29 17:47 [PATCH 0/4] block: Add support for REQ_OP_ASSIGN_RANGE Chaitanya Kulkarni
` (3 preceding siblings ...)
2020-03-29 17:47 ` [PATCH 4/4] ext4: Notify block device about alloc-assigned blk Chaitanya Kulkarni
@ 2020-04-01 6:22 ` Konstantin Khlebnikov
2020-04-02 2:29 ` Martin K. Petersen
2020-04-02 22:41 ` Dave Chinner
[not found] ` <(Chaitanya>
6 siblings, 1 reply; 18+ messages in thread
From: Konstantin Khlebnikov @ 2020-04-01 6:22 UTC (permalink / raw)
To: Chaitanya Kulkarni, hch, martin.petersen
Cc: darrick.wong, axboe, tytso, adilger.kernel, ming.lei, jthumshirn,
minwoo.im.dev, damien.lemoal, andrea.parri, hare, tj, hannes,
ajay.joshi, bvanassche, arnd, houtao1, asml.silence, linux-block,
linux-ext4
On 29/03/2020 20.47, Chaitanya Kulkarni wrote:
> Hi,
>
> This patch-series is based on the original RFC patch series:-
> https://www.spinics.net/lists/linux-block/msg47933.html.
>
> I've designed a rough testcase based on the information present
> in the mailing list archive for original RFC, it may need
> some corrections from the author.
>
> If anyone is interested, test results are at the end of this patch.
>
> Following is the original cover-letter :-
>
> Information about continuous extent placement may be useful
> for some block devices. Say, distributed network filesystems,
> which provide block device interface, may use this information
> for better blocks placement over the nodes in their cluster,
> and for better performance. Block devices, which map a file
> on another filesystem (loop), may request the same length extent
> on underlining filesystem for less fragmentation and for batching
> allocation requests. Also, hypervisors like QEMU may use this
> information for optimization of cluster allocations.
>
> This patchset introduces REQ_OP_ASSIGN_RANGE, which is going
> to be used for forwarding user's fallocate(0) requests into
> block device internals. It rather similar to existing
> REQ_OP_DISCARD, REQ_OP_WRITE_ZEROES, etc. The corresponding
> exported primitive is called blkdev_issue_assign_range().
What exact semantics of that?
It may/must preserve present data or may/must discard them, or may fill range with random garbage?
Obviously I prefer weakest one - may discard data, may return garbage, may do nothing.
I.e. lower layer could reuse blocks without zeroing, for encrypted storage this is even safe.
So, this works as third type of dicasrd in addtion to REQ_OP_DISCARD and REQ_OP_SECURE_ERASE.
> See [1/3] for the details.
>
> Patch [2/3] teaches loop driver to handle REQ_OP_ASSIGN_RANGE
> requests by calling fallocate(0).
>
> Patch [3/3] makes ext4 to notify a block device about fallocate(0).
>
> Here is a simple test I did:
> https://gist.github.com/tkhai/5b788651cdb74c1dbff3500745878856
>
> I attached a file on ext4 to loop. Then, created ext4 partition
> on loop device and started the test in the partition. Direct-io
> is enabled on loop.
>
> The test fallocates 4G file and writes from some offset with
> given step, then it chooses another offset and repeats. After
> the test all the blocks in the file become written.
>
> The results shows that batching extents-assigning requests improves
> the performance:
>
> Before patchset: real ~ 1min 27sec
> After patchset: real ~ 1min 16sec (18% better)
>
> Ordinary fallocate() before writes improves the performance
> by batching the requests. These results just show, the same
> is in case of forwarding extents information to underlining
> filesystem.
>
> Regards,
> Chaitanya
>
> Changes from RFC:-
>
> 1. Add missing plumbing for REQ_OP_ASSIGN_RANGE similar to write-zeores.
> 2. Add a prep patch to create a helper to submit payloadless bios.
> 3. Design a testcases around the description present in the
> cover-letter.
>
> Chaitanya Kulkarni (1):
> block: create payloadless issue bio helper
>
> Kirill Tkhai (3):
> block: Add support for REQ_OP_ASSIGN_RANGE
> loop: Forward REQ_OP_ASSIGN_RANGE into fallocate(0)
> ext4: Notify block device about alloc-assigned blk
>
> block/blk-core.c | 5 ++
> block/blk-lib.c | 115 +++++++++++++++++++++++++++++++-------
> block/blk-merge.c | 21 +++++++
> block/blk-settings.c | 19 +++++++
> block/blk-zoned.c | 1 +
> block/bounce.c | 1 +
> drivers/block/loop.c | 5 ++
> fs/ext4/ext4.h | 2 +
> fs/ext4/extents.c | 12 +++-
> include/linux/bio.h | 9 ++-
> include/linux/blk_types.h | 2 +
> include/linux/blkdev.h | 34 +++++++++++
> 12 files changed, 201 insertions(+), 25 deletions(-)
>
> 1. Setup :-
> -----------
> # git log --oneline -5
> c64a4c781915 (HEAD -> req-op-assign-range) ext4: Notify block device about alloc-assigned blk
> 000cbc6720a4 loop: Forward REQ_OP_ASSIGN_RANGE into fallocate(0)
> 89ceed8cac80 block: Add support for REQ_OP_ASSIGN_RANGE
> a798743e87e7 block: create payloadless issue bio helper
> b53df2e7442c (tag: block-5.6-2020-03-13) block: Fix partition support for host aware zoned block devices
>
> # cat /proc/kallsyms | grep -i blkdev_issue_assign_range
> ffffffffa3264a80 T blkdev_issue_assign_range
> ffffffffa4027184 r __ksymtab_blkdev_issue_assign_range
> ffffffffa40524be r __kstrtabns_blkdev_issue_assign_range
> ffffffffa405a8eb r __kstrtab_blkdev_issue_assign_range
>
> 2. Test program, will be moved to blktest once code is upstream :-
> -----------------
> #define _GNU_SOURCE
> #include <sys/types.h>
> #include <unistd.h>
> #include <stdlib.h>
> #include <stdio.h>
> #include <fcntl.h>
> #include <errno.h>
>
> #define BLOCK_SIZE 4096
> #define STEP (BLOCK_SIZE * 16)
> #define SIZE (1024 * 1024 * 1024ULL)
>
> int main(int argc, char *argv[])
> {
> int fd, step, ret = 0;
> unsigned long i;
> void *buf;
>
> if (posix_memalign(&buf, BLOCK_SIZE, SIZE)) {
> perror("alloc");
> exit(1);
> }
>
> fd = open("/mnt/loop0/file.img", O_RDWR | O_CREAT | O_DIRECT);
> if (fd < 0) {
> perror("open");
> exit(1);
> }
>
> if (ftruncate(fd, SIZE)) {
> perror("ftruncate");
> exit(1);
> }
>
> ret = fallocate(fd, 0, 0, SIZE);
> if (ret) {
> perror("fallocate");
> exit(1);
> }
>
> for (step = STEP - BLOCK_SIZE; step >= 0; step -= BLOCK_SIZE) {
> printf("step=%u\n", step);
> for (i = step; i < SIZE; i += STEP) {
> errno = 0;
> if (pwrite(fd, buf, BLOCK_SIZE, i) != BLOCK_SIZE) {
> perror("pwrite");
> exit(1);
> }
> }
>
> if (fsync(fd)) {
> perror("fsync");
> exit(1);
> }
> }
> return 0;
> }
>
> 3. Test script, will be moved to blktests once code is upstream :-
> ------------------------------------------------------------------
> # cat req_op_assign_test.sh
> #!/bin/bash -x
>
> NULLB_FILE="/mnt/backend/data"
> NULLB_MNT="/mnt/backend"
> LOOP_MNT="/mnt/loop0"
>
> delete_loop()
> {
> umount ${LOOP_MNT}
> losetup -D
> sleep 3
> }
>
> delete_nullb()
> {
> umount ${NULLB_MNT}
> echo 1 > config/nullb/nullb0/power
> rmdir config/nullb/nullb0
> sleep 3
> }
>
> unload_modules()
> {
> rmmod drivers/block/loop.ko
> rmmod fs/ext4/ext4.ko
> rmmod drivers/block/null_blk.ko
> lsmod | grep -e ext4 -e loop -e null_blk
> }
>
> unload()
> {
> delete_loop
> delete_nullb
> unload_modules
> }
>
> load_ext4()
> {
> make -j $(nproc) M=fs/ext4 modules
> local src=fs/ext4/
> local dest=/lib/modules/`uname -r`/kernel/fs/ext4
> \cp ${src}/ext4.ko ${dest}/
>
> modprobe mbcache
> modprobe jbd2
> sleep 1
> insmod fs/ext4/ext4.ko
> sleep 1
> }
>
> load_nullb()
> {
> local src=drivers/block/
> local dest=/lib/modules/`uname -r`/kernel/drivers/block
> \cp ${src}/null_blk.ko ${dest}/
>
> modprobe null_blk nr_devices=0
> sleep 1
>
> mkdir config/nullb/nullb0
> tree config/nullb/nullb0
>
> echo 1 > config/nullb/nullb0/memory_backed
> echo 512 > config/nullb/nullb0/blocksize
>
> # 20 GB
> echo 20480 > config/nullb/nullb0/size
> echo 1 > config/nullb/nullb0/power
> sleep 2
> IDX=`cat config/nullb/nullb0/index`
> lsblk | grep null${IDX}
> sleep 1
>
> mkfs.ext4 /dev/nullb0
> mount /dev/nullb0 ${NULLB_MNT}
> sleep 1
> mount | grep nullb
>
> # 10 GB
> dd if=/dev/zero of=${NULLB_FILE} count=2621440 bs=4096
> }
>
> load_loop()
> {
> local src=drivers/block/
> local dest=/lib/modules/`uname -r`/kernel/drivers/block
> \cp ${src}/loop.ko ${dest}/
>
> insmod drivers/block/loop.ko max_loop=1
> sleep 3
> /root/util-linux/losetup --direct-io=off /dev/loop0 ${NULLB_FILE}
> sleep 3
> /root/util-linux/losetup
> ls -l /dev/loop*
> dmesg -c
> mkfs.ext4 /dev/loop0
> mount /dev/loop0 ${LOOP_MNT}
> mount | grep loop0
> }
>
> load()
> {
> make -j $(nproc) M=drivers/block modules
>
> load_ext4
> load_nullb
> load_loop
> sleep 1
> sync
> sync
> sync
> }
>
> unload
> load
> time ./test
>
> 4. Test Results :-
> ------------------
>
> # ./req_op_assign_test.sh
> + NULLB_FILE=/mnt/backend/data
> + NULLB_MNT=/mnt/backend
> + LOOP_MNT=/mnt/loop0
> + unload
> + delete_loop
> + umount /mnt/loop0
> + losetup -D
> + sleep 3
> + delete_nullb
> + umount /mnt/backend
> + echo 1
> + rmdir config/nullb/nullb0
> + sleep 3
> + unload_modules
> + rmmod drivers/block/loop.ko
> + rmmod fs/ext4/ext4.ko
> + rmmod drivers/block/null_blk.ko
> + lsmod
> + grep -e ext4 -e loop -e null_blk
> + load
> ++ nproc
> + make -j 32 M=drivers/block modules
> CC [M] drivers/block/loop.o
> MODPOST 11 modules
> CC [M] drivers/block/loop.mod.o
> LD [M] drivers/block/loop.ko
> + load_ext4
> ++ nproc
> + make -j 32 M=fs/ext4 modules
> CC [M] fs/ext4/balloc.o
> CC [M] fs/ext4/bitmap.o
> CC [M] fs/ext4/block_validity.o
> CC [M] fs/ext4/dir.o
> CC [M] fs/ext4/ext4_jbd2.o
> CC [M] fs/ext4/extents.o
> CC [M] fs/ext4/extents_status.o
> CC [M] fs/ext4/file.o
> CC [M] fs/ext4/fsmap.o
> CC [M] fs/ext4/fsync.o
> CC [M] fs/ext4/hash.o
> CC [M] fs/ext4/ialloc.o
> CC [M] fs/ext4/indirect.o
> CC [M] fs/ext4/inline.o
> CC [M] fs/ext4/inode.o
> CC [M] fs/ext4/ioctl.o
> CC [M] fs/ext4/mballoc.o
> CC [M] fs/ext4/migrate.o
> CC [M] fs/ext4/mmp.o
> CC [M] fs/ext4/move_extent.o
> CC [M] fs/ext4/namei.o
> CC [M] fs/ext4/page-io.o
> CC [M] fs/ext4/readpage.o
> CC [M] fs/ext4/resize.o
> CC [M] fs/ext4/super.o
> CC [M] fs/ext4/symlink.o
> CC [M] fs/ext4/sysfs.o
> CC [M] fs/ext4/xattr.o
> CC [M] fs/ext4/xattr_trusted.o
> CC [M] fs/ext4/xattr_user.o
> CC [M] fs/ext4/acl.o
> CC [M] fs/ext4/xattr_security.o
> LD [M] fs/ext4/ext4.o
> MODPOST 1 modules
> LD [M] fs/ext4/ext4.ko
> + local src=fs/ext4/
> ++ uname -r
> + local dest=/lib/modules/5.6.0-rc3lbk+/kernel/fs/ext4
> + cp fs/ext4//ext4.ko /lib/modules/5.6.0-rc3lbk+/kernel/fs/ext4/
> + modprobe mbcache
> + modprobe jbd2
> + sleep 1
> + insmod fs/ext4/ext4.ko
> + sleep 1
> + load_nullb
> + local src=drivers/block/
> ++ uname -r
> + local dest=/lib/modules/5.6.0-rc3lbk+/kernel/drivers/block
> + cp drivers/block//null_blk.ko /lib/modules/5.6.0-rc3lbk+/kernel/drivers/block/
> + modprobe null_blk nr_devices=0
> + sleep 1
> + mkdir config/nullb/nullb0
> + tree config/nullb/nullb0
> config/nullb/nullb0
> ├── badblocks
> ├── blocking
> ├── blocksize
> ├── cache_size
> ├── completion_nsec
> ├── discard
> ├── home_node
> ├── hw_queue_depth
> ├── index
> ├── irqmode
> ├── mbps
> ├── memory_backed
> ├── power
> ├── queue_mode
> ├── size
> ├── submit_queues
> ├── use_per_node_hctx
> ├── zoned
> ├── zone_nr_conv
> └── zone_size
>
> 0 directories, 20 files
> + echo 1
> + echo 512
> + echo 20480
> + echo 1
> + sleep 2
> ++ cat config/nullb/nullb0/index
> + IDX=0
> + lsblk
> + grep null0
> + sleep 1
> + mkfs.ext4 /dev/nullb0
> mke2fs 1.42.9 (28-Dec-2013)
> Filesystem label=
> OS type: Linux
> Block size=4096 (log=2)
> Fragment size=4096 (log=2)
> Stride=0 blocks, Stripe width=0 blocks
> 1310720 inodes, 5242880 blocks
> 262144 blocks (5.00%) reserved for the super user
> First data block=0
> Maximum filesystem blocks=2153775104
> 160 block groups
> 32768 blocks per group, 32768 fragments per group
> 8192 inodes per group
> Superblock backups stored on blocks:
> 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
> 4096000
>
> Allocating group tables: done
> Writing inode tables: done
> Creating journal (32768 blocks): done
> Writing superblocks and filesystem accounting information: done
>
> + mount /dev/nullb0 /mnt/backend
> + sleep 1
> + mount
> + grep nullb
> /dev/nullb0 on /mnt/backend type ext4 (rw,relatime,seclabel)
> + dd if=/dev/zero of=/mnt/backend/data count=2621440 bs=4096
> 2621440+0 records in
> 2621440+0 records out
> 10737418240 bytes (11 GB) copied, 27.4579 s, 391 MB/s
> + load_loop
> + local src=drivers/block/
> ++ uname -r
> + local dest=/lib/modules/5.6.0-rc3lbk+/kernel/drivers/block
> + cp drivers/block//loop.ko /lib/modules/5.6.0-rc3lbk+/kernel/drivers/block/
> + insmod drivers/block/loop.ko max_loop=1
> + sleep 3
> + /root/util-linux/losetup --direct-io=off /dev/loop0 /mnt/backend/data
> + sleep 3
> + /root/util-linux/losetup
> NAME SIZELIMIT OFFSET AUTOCLEAR RO BACK-FILE DIO LOG-SEC
> /dev/loop0 0 0 0 0 /mnt/backend/data 0 512
> + ls -l /dev/loop0 /dev/loop-control
> brw-rw----. 1 root disk 7, 0 Mar 29 10:28 /dev/loop0
> crw-rw----. 1 root disk 10, 237 Mar 29 10:28 /dev/loop-control
> + dmesg -c
> [42963.967060] null_blk: module loaded
> [42968.419481] EXT4-fs (nullb0): mounted filesystem with ordered data mode. Opts: (null)
> [42996.928141] loop: module loaded
> + mkfs.ext4 /dev/loop0
> mke2fs 1.42.9 (28-Dec-2013)
> Discarding device blocks: done
> Filesystem label=
> OS type: Linux
> Block size=4096 (log=2)
> Fragment size=4096 (log=2)
> Stride=0 blocks, Stripe width=0 blocks
> 655360 inodes, 2621440 blocks
> 131072 blocks (5.00%) reserved for the super user
> First data block=0
> Maximum filesystem blocks=2151677952
> 80 block groups
> 32768 blocks per group, 32768 fragments per group
> 8192 inodes per group
> Superblock backups stored on blocks:
> 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632
>
> Allocating group tables: done
> Writing inode tables: done
> Creating journal (32768 blocks): done
> Writing superblocks and filesystem accounting information: done
>
> + mount /dev/loop0 /mnt/loop0
> + mount
> + grep loop0
> /dev/loop0 on /mnt/loop0 type ext4 (rw,relatime,seclabel)
> + sleep 1
> + sync
> + sync
> + sync
> + ./test
> step=61440
> step=57344
> step=53248
> step=49152
> step=45056
> step=40960
> step=36864
> step=32768
> step=28672
> step=24576
> step=20480
> step=16384
> step=12288
> step=8192
> step=4096
> step=0
>
> real 9m34.472s
> user 0m0.062s
> sys 0m5.783s
>
^ permalink raw reply [flat|nested] 18+ messages in thread* Re: [PATCH 0/4] block: Add support for REQ_OP_ASSIGN_RANGE
2020-04-01 6:22 ` [PATCH 0/4] block: Add support for REQ_OP_ASSIGN_RANGE Konstantin Khlebnikov
@ 2020-04-02 2:29 ` Martin K. Petersen
2020-04-02 9:49 ` Konstantin Khlebnikov
0 siblings, 1 reply; 18+ messages in thread
From: Martin K. Petersen @ 2020-04-02 2:29 UTC (permalink / raw)
To: Konstantin Khlebnikov
Cc: Chaitanya Kulkarni, hch, martin.petersen, darrick.wong, axboe,
tytso, adilger.kernel, ming.lei, jthumshirn, minwoo.im.dev,
damien.lemoal, andrea.parri, hare, tj, hannes, ajay.joshi,
bvanassche, arnd, houtao1, asml.silence, linux-block, linux-ext4
Konstantin,
>> The corresponding exported primitive is called
>> blkdev_issue_assign_range().
>
> What exact semantics of that?
REQ_OP_ALLOCATE will be used to compel a device to allocate a block
range. What a given block contains after successful allocation is
undefined (depends on the device implementation).
For block allocation with deterministic zeroing, one must keep using
REQ_OP_WRITE_ZEROES with the NOUNMAP flag set.
--
Martin K. Petersen Oracle Linux Engineering
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH 0/4] block: Add support for REQ_OP_ASSIGN_RANGE
2020-04-02 2:29 ` Martin K. Petersen
@ 2020-04-02 9:49 ` Konstantin Khlebnikov
0 siblings, 0 replies; 18+ messages in thread
From: Konstantin Khlebnikov @ 2020-04-02 9:49 UTC (permalink / raw)
To: Martin K. Petersen
Cc: Chaitanya Kulkarni, hch, darrick.wong, axboe, tytso,
adilger.kernel, ming.lei, jthumshirn, minwoo.im.dev,
damien.lemoal, andrea.parri, hare, tj, hannes, ajay.joshi,
bvanassche, arnd, houtao1, asml.silence, linux-block, linux-ext4
On 02/04/2020 05.29, Martin K. Petersen wrote:
>
> Konstantin,
>
>>> The corresponding exported primitive is called
>>> blkdev_issue_assign_range().
>>
>> What exact semantics of that?
>
> REQ_OP_ALLOCATE will be used to compel a device to allocate a block
> range. What a given block contains after successful allocation is
> undefined (depends on the device implementation).
Ok. Then REQ_OP_ALLOCATE should be accounted as discard rather than write.
That's decided by helper op_is_discard() which is used only by statistics.
It seems REQ_OP_SECURE_ERASE also should be accounted in this way.
>
> For block allocation with deterministic zeroing, one must keep using
> REQ_OP_WRITE_ZEROES with the NOUNMAP flag set.
>
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH 0/4] block: Add support for REQ_OP_ASSIGN_RANGE
2020-03-29 17:47 [PATCH 0/4] block: Add support for REQ_OP_ASSIGN_RANGE Chaitanya Kulkarni
` (4 preceding siblings ...)
2020-04-01 6:22 ` [PATCH 0/4] block: Add support for REQ_OP_ASSIGN_RANGE Konstantin Khlebnikov
@ 2020-04-02 22:41 ` Dave Chinner
2020-04-03 1:34 ` Martin K. Petersen
[not found] ` <(Chaitanya>
6 siblings, 1 reply; 18+ messages in thread
From: Dave Chinner @ 2020-04-02 22:41 UTC (permalink / raw)
To: Chaitanya Kulkarni
Cc: hch, martin.petersen, darrick.wong, axboe, tytso, adilger.kernel,
ming.lei, jthumshirn, minwoo.im.dev, damien.lemoal, andrea.parri,
hare, tj, hannes, khlebnikov, ajay.joshi, bvanassche, arnd,
houtao1, asml.silence, linux-block, linux-ext4
On Sun, Mar 29, 2020 at 10:47:10AM -0700, Chaitanya Kulkarni wrote:
> Hi,
>
> This patch-series is based on the original RFC patch series:-
> https://www.spinics.net/lists/linux-block/msg47933.html.
>
> I've designed a rough testcase based on the information present
> in the mailing list archive for original RFC, it may need
> some corrections from the author.
>
> If anyone is interested, test results are at the end of this patch.
>
> Following is the original cover-letter :-
>
> Information about continuous extent placement may be useful
> for some block devices. Say, distributed network filesystems,
> which provide block device interface, may use this information
> for better blocks placement over the nodes in their cluster,
> and for better performance. Block devices, which map a file
> on another filesystem (loop), may request the same length extent
> on underlining filesystem for less fragmentation and for batching
> allocation requests. Also, hypervisors like QEMU may use this
> information for optimization of cluster allocations.
>
> This patchset introduces REQ_OP_ASSIGN_RANGE, which is going
> to be used for forwarding user's fallocate(0) requests into
> block device internals. It rather similar to existing
> REQ_OP_DISCARD, REQ_OP_WRITE_ZEROES, etc. The corresponding
> exported primitive is called blkdev_issue_assign_range().
> See [1/3] for the details.
>
> Patch [2/3] teaches loop driver to handle REQ_OP_ASSIGN_RANGE
> requests by calling fallocate(0).
>
> Patch [3/3] makes ext4 to notify a block device about fallocate(0).
Ok, so ext4 has a very limited max allocation size for an extent, so
I expect this won't cause huge latency problems. However, what
happens when we use XFS, have a 64kB block size, and fallocate() is
allocating disk space in continguous 100GB extents and passing those
down to the block device?
How does this get split by dm devices? Are raid stripes going to
dice this into separate stripe unit sized bios, so instead of single
large requests we end up with hundreds or thousands or tiny
allocation requests being issued?
I know that for the loop device, it is going to serialise all IO to
the backing file while fallocate is run on it. Hence if you have
concurrent IO running, any REQ_OP_ASSIGN_RANGE is going to cause an
significant, measurable latency hit to all those IOs in flight.
How are we expecting hardware to behave here? Is this a queued
command in the scsi/nvme/sata protocols? Or is this, for the moment,
just a special snowflake that we can't actually use in production
because the hardware just can't handle what we throw at it?
IOWs, what sort of latency issues is this operation going to cause
on real hardware? Is this going to be like discard? i.e. where we
end up not using it at all because so few devices actually handle
the massive stream of operations the filesystem will end up sending
the device(s) in the course of normal operations?
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 18+ messages in thread* Re: [PATCH 0/4] block: Add support for REQ_OP_ASSIGN_RANGE
2020-04-02 22:41 ` Dave Chinner
@ 2020-04-03 1:34 ` Martin K. Petersen
2020-04-03 2:57 ` Dave Chinner
0 siblings, 1 reply; 18+ messages in thread
From: Martin K. Petersen @ 2020-04-03 1:34 UTC (permalink / raw)
To: Dave Chinner
Cc: Chaitanya Kulkarni, hch, martin.petersen, darrick.wong, axboe,
tytso, adilger.kernel, ming.lei, jthumshirn, minwoo.im.dev,
damien.lemoal, andrea.parri, hare, tj, hannes, khlebnikov,
ajay.joshi, bvanassche, arnd, houtao1, asml.silence, linux-block,
linux-ext4
Hi Dave!
> Ok, so ext4 has a very limited max allocation size for an extent, so
> I expect this won't cause huge latency problems. However, what
> happens when we use XFS, have a 64kB block size, and fallocate() is
> allocating disk space in continguous 100GB extents and passing those
> down to the block device?
Depends on the device.
> How does this get split by dm devices? Are raid stripes going to dice
> this into separate stripe unit sized bios, so instead of single large
> requests we end up with hundreds or thousands or tiny allocation
> requests being issued?
There is nothing special about this operation. It needs to be handled
the same way as all other splits. I.e. ideally coalesced at the bottom
of the stack so we can issue larger, contiguous commands to the
hardware.
> How are we expecting hardware to behave here? Is this a queued
> command in the scsi/nvme/sata protocols? Or is this, for the moment,
> just a special snowflake that we can't actually use in production
> because the hardware just can't handle what we throw at it?
For now it's SCSI and queued. Only found in high-end thinly provisioned
storage arrays and not in your average SSD.
The performance expectation for REQ_OP_ALLOCATE is that it is faster
than a write to the same block range since the device potentially needs
to do less work. I.e. the device simply needs to decrement the free
space and mark the LBAs reserved in a map. It doesn't need to write all
the blocks to zero them. If you want zeroed blocks, use
REQ_OP_WRITE_ZEROES.
> IOWs, what sort of latency issues is this operation going to cause
> on real hardware? Is this going to be like discard? i.e. where we
> end up not using it at all because so few devices actually handle
> the massive stream of operations the filesystem will end up sending
> the device(s) in the course of normal operations?
The intended use case, from a SCSI perspective, is that on a thinly
provisioned device you can use this operation to preallocate blocks so
that future writes to the LBAs in question will not fail due to the
device being out of space. I.e. you would use this to pin down block
ranges where you can not tolerate write failures. The advantage over
writing the blocks individually is that dedup won't apply and that the
device doesn't actually have to go write all the individual blocks.
--
Martin K. Petersen Oracle Linux Engineering
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH 0/4] block: Add support for REQ_OP_ASSIGN_RANGE
2020-04-03 1:34 ` Martin K. Petersen
@ 2020-04-03 2:57 ` Dave Chinner
[not found] ` <(Dave>
0 siblings, 1 reply; 18+ messages in thread
From: Dave Chinner @ 2020-04-03 2:57 UTC (permalink / raw)
To: Martin K. Petersen
Cc: Chaitanya Kulkarni, hch, darrick.wong, axboe, tytso,
adilger.kernel, ming.lei, jthumshirn, minwoo.im.dev,
damien.lemoal, andrea.parri, hare, tj, hannes, khlebnikov,
ajay.joshi, bvanassche, arnd, houtao1, asml.silence, linux-block,
linux-ext4
On Thu, Apr 02, 2020 at 09:34:43PM -0400, Martin K. Petersen wrote:
>
> Hi Dave!
>
> > Ok, so ext4 has a very limited max allocation size for an extent, so
> > I expect this won't cause huge latency problems. However, what
> > happens when we use XFS, have a 64kB block size, and fallocate() is
> > allocating disk space in continguous 100GB extents and passing those
> > down to the block device?
>
> Depends on the device.
Great. :(
> > How does this get split by dm devices? Are raid stripes going to dice
> > this into separate stripe unit sized bios, so instead of single large
> > requests we end up with hundreds or thousands or tiny allocation
> > requests being issued?
>
> There is nothing special about this operation. It needs to be handled
> the same way as all other splits. I.e. ideally coalesced at the bottom
> of the stack so we can issue larger, contiguous commands to the
> hardware.
>
> > How are we expecting hardware to behave here? Is this a queued
> > command in the scsi/nvme/sata protocols? Or is this, for the moment,
> > just a special snowflake that we can't actually use in production
> > because the hardware just can't handle what we throw at it?
>
> For now it's SCSI and queued. Only found in high-end thinly provisioned
> storage arrays and not in your average SSD.
So it's a special snowflake :)
> The performance expectation for REQ_OP_ALLOCATE is that it is faster
> than a write to the same block range since the device potentially needs
> to do less work. I.e. the device simply needs to decrement the free
> space and mark the LBAs reserved in a map. It doesn't need to write all
> the blocks to zero them. If you want zeroed blocks, use
> REQ_OP_WRITE_ZEROES.
I suspect that the implications of wiring filesystems directly up to
this hasn't been thought through entirely....
> > IOWs, what sort of latency issues is this operation going to cause
> > on real hardware? Is this going to be like discard? i.e. where we
> > end up not using it at all because so few devices actually handle
> > the massive stream of operations the filesystem will end up sending
> > the device(s) in the course of normal operations?
>
> The intended use case, from a SCSI perspective, is that on a thinly
> provisioned device you can use this operation to preallocate blocks so
> that future writes to the LBAs in question will not fail due to the
> device being out of space. I.e. you would use this to pin down block
> ranges where you can not tolerate write failures. The advantage over
> writing the blocks individually is that dedup won't apply and that the
> device doesn't actually have to go write all the individual blocks.
.... because when backed by thinp storage, plumbing user level
fallocate() straight through from the filesystem introduces a
trivial, user level storage DOS vector....
i.e. a user can just fallocate a bunch of files and, because the
filesystem can do that instantly, can also run the back end array
out of space almost instantly. Storage admins are going to love
this!
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 18+ messages in thread
[parent not found: <(Chaitanya>]