* [PATCH v3 1/4] brd: use a switch statement in brd_submit_bio
2023-08-10 10:07 [PATCH v3 0/4] brd discard patches Mikulas Patocka
@ 2023-08-10 10:08 ` Mikulas Patocka
2023-08-10 10:09 ` [PATCH v3 2/4] brd: extend the rcu regions to cover read and write Mikulas Patocka
` (4 subsequent siblings)
5 siblings, 0 replies; 10+ messages in thread
From: Mikulas Patocka @ 2023-08-10 10:08 UTC (permalink / raw)
To: Jens Axboe
Cc: Li Nan, Zdenek Kabelac, Christoph Hellwig, Chaitanya Kulkarni,
linux-block, dm-devel
Use a switch statement in brd_submit_bio, so that the code will look
better when we add support for more bio operations.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
---
drivers/block/brd.c | 43 ++++++++++++++++++++++++++-----------------
1 file changed, 26 insertions(+), 17 deletions(-)
Index: linux-2.6/drivers/block/brd.c
===================================================================
--- linux-2.6.orig/drivers/block/brd.c
+++ linux-2.6/drivers/block/brd.c
@@ -243,29 +243,38 @@ out:
static void brd_submit_bio(struct bio *bio)
{
struct brd_device *brd = bio->bi_bdev->bd_disk->private_data;
- sector_t sector = bio->bi_iter.bi_sector;
+ sector_t sector;
struct bio_vec bvec;
struct bvec_iter iter;
- bio_for_each_segment(bvec, bio, iter) {
- unsigned int len = bvec.bv_len;
- int err;
+ switch (bio_op(bio)) {
+ case REQ_OP_READ:
+ case REQ_OP_WRITE:
+ sector = bio->bi_iter.bi_sector;
+ bio_for_each_segment(bvec, bio, iter) {
+ unsigned int len = bvec.bv_len;
+ int err;
- /* Don't support un-aligned buffer */
- WARN_ON_ONCE((bvec.bv_offset & (SECTOR_SIZE - 1)) ||
- (len & (SECTOR_SIZE - 1)));
+ /* Don't support un-aligned buffer */
+ WARN_ON_ONCE((bvec.bv_offset & (SECTOR_SIZE - 1)) ||
+ (len & (SECTOR_SIZE - 1)));
- err = brd_do_bvec(brd, bvec.bv_page, len, bvec.bv_offset,
- bio->bi_opf, sector);
- if (err) {
- if (err == -ENOMEM && bio->bi_opf & REQ_NOWAIT) {
- bio_wouldblock_error(bio);
- return;
+ err = brd_do_bvec(brd, bvec.bv_page, len, bvec.bv_offset,
+ bio->bi_opf, sector);
+ if (err) {
+ if (err == -ENOMEM && bio->bi_opf & REQ_NOWAIT) {
+ bio_wouldblock_error(bio);
+ return;
+ }
+ bio_io_error(bio);
+ return;
+ }
+ sector += len >> SECTOR_SHIFT;
}
- bio_io_error(bio);
- return;
- }
- sector += len >> SECTOR_SHIFT;
+ break;
+ default:
+ bio->bi_status = BLK_STS_NOTSUPP;
+ break;
}
bio_endio(bio);
^ permalink raw reply [flat|nested] 10+ messages in thread* [PATCH v3 2/4] brd: extend the rcu regions to cover read and write
2023-08-10 10:07 [PATCH v3 0/4] brd discard patches Mikulas Patocka
2023-08-10 10:08 ` [PATCH v3 1/4] brd: use a switch statement in brd_submit_bio Mikulas Patocka
@ 2023-08-10 10:09 ` Mikulas Patocka
2023-08-10 10:09 ` [PATCH v3 3/4] brd: enable discard Mikulas Patocka
` (3 subsequent siblings)
5 siblings, 0 replies; 10+ messages in thread
From: Mikulas Patocka @ 2023-08-10 10:09 UTC (permalink / raw)
To: Jens Axboe
Cc: Li Nan, Zdenek Kabelac, Christoph Hellwig, Chaitanya Kulkarni,
linux-block, dm-devel
This patch extends the rcu regions, so that lookup followed by a read or
write of a page is done inside rcu read lock. This is needed for the
following patch that enables discard.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
---
drivers/block/brd.c | 8 ++++++++
1 file changed, 8 insertions(+)
Index: linux-2.6/drivers/block/brd.c
===================================================================
--- linux-2.6.orig/drivers/block/brd.c
+++ linux-2.6/drivers/block/brd.c
@@ -150,23 +150,27 @@ static void copy_to_brd(struct brd_devic
size_t copy;
copy = min_t(size_t, n, PAGE_SIZE - offset);
+ rcu_read_lock();
page = brd_lookup_page(brd, sector);
BUG_ON(!page);
dst = kmap_atomic(page);
memcpy(dst + offset, src, copy);
kunmap_atomic(dst);
+ rcu_read_unlock();
if (copy < n) {
src += copy;
sector += copy >> SECTOR_SHIFT;
copy = n - copy;
+ rcu_read_lock();
page = brd_lookup_page(brd, sector);
BUG_ON(!page);
dst = kmap_atomic(page);
memcpy(dst, src, copy);
kunmap_atomic(dst);
+ rcu_read_unlock();
}
}
@@ -182,6 +186,7 @@ static void copy_from_brd(void *dst, str
size_t copy;
copy = min_t(size_t, n, PAGE_SIZE - offset);
+ rcu_read_lock();
page = brd_lookup_page(brd, sector);
if (page) {
src = kmap_atomic(page);
@@ -189,11 +194,13 @@ static void copy_from_brd(void *dst, str
kunmap_atomic(src);
} else
memset(dst, 0, copy);
+ rcu_read_unlock();
if (copy < n) {
dst += copy;
sector += copy >> SECTOR_SHIFT;
copy = n - copy;
+ rcu_read_lock();
page = brd_lookup_page(brd, sector);
if (page) {
src = kmap_atomic(page);
@@ -201,6 +208,7 @@ static void copy_from_brd(void *dst, str
kunmap_atomic(src);
} else
memset(dst, 0, copy);
+ rcu_read_unlock();
}
}
^ permalink raw reply [flat|nested] 10+ messages in thread* [PATCH v3 3/4] brd: enable discard
2023-08-10 10:07 [PATCH v3 0/4] brd discard patches Mikulas Patocka
2023-08-10 10:08 ` [PATCH v3 1/4] brd: use a switch statement in brd_submit_bio Mikulas Patocka
2023-08-10 10:09 ` [PATCH v3 2/4] brd: extend the rcu regions to cover read and write Mikulas Patocka
@ 2023-08-10 10:09 ` Mikulas Patocka
2023-08-10 10:10 ` [PATCH v3 4/4] brd: implement write zeroes Mikulas Patocka
` (2 subsequent siblings)
5 siblings, 0 replies; 10+ messages in thread
From: Mikulas Patocka @ 2023-08-10 10:09 UTC (permalink / raw)
To: Jens Axboe
Cc: Li Nan, Zdenek Kabelac, Christoph Hellwig, Chaitanya Kulkarni,
linux-block, dm-devel
This patch implements discard in the brd driver. We use RCU to free the
page, so that if there are any concurrent readers or writes, they won't
touch the page after it is freed.
Calling "call_rcu" for each page is inefficient, so we attempt to batch
multiple pages to a single "call_rcu" call.
Note that we replace "BUG_ON(!page);" with "if (page) ..." in copy_to_brd
- the page can't be NULL under normal circumstances, it can only be NULL
if REQ_OP_WRITE races with REQ_OP_DISCARD. If these two bios race with
each other on the same page, the result is undefined, so we can handle
this race condition just by skipping the copying.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
---
drivers/block/brd.c | 144 +++++++++++++++++++++++++++++++++++++++++++++++-----
1 file changed, 131 insertions(+), 13 deletions(-)
Index: linux-2.6/drivers/block/brd.c
===================================================================
--- linux-2.6.orig/drivers/block/brd.c
+++ linux-2.6/drivers/block/brd.c
@@ -46,6 +46,8 @@ struct brd_device {
u64 brd_nr_pages;
};
+static bool discard;
+
/*
* Look up and return a brd's page for a given sector.
*/
@@ -100,6 +102,54 @@ static int brd_insert_page(struct brd_de
return ret;
}
+struct free_page_batch {
+ struct rcu_head rcu;
+ struct list_head list;
+};
+
+static void brd_free_page_rcu(struct rcu_head *head)
+{
+ __free_page(container_of(head, struct page, rcu_head));
+}
+
+static void brd_free_pages_rcu(struct rcu_head *head)
+{
+ struct free_page_batch *batch = container_of(head, struct free_page_batch, rcu);
+
+ while (!list_empty(&batch->list)) {
+ struct page *page = list_entry(batch->list.prev, struct page, lru);
+
+ list_del(&page->lru);
+
+ __free_page(page);
+ }
+
+ kfree(batch);
+}
+
+static void brd_free_page(struct brd_device *brd, sector_t sector,
+ struct free_page_batch **batch)
+{
+ struct page *page;
+ pgoff_t idx;
+
+ idx = sector >> PAGE_SECTORS_SHIFT;
+ page = xa_erase(&brd->brd_pages, idx);
+
+ if (page) {
+ BUG_ON(page->index != idx);
+ if (!*batch) {
+ *batch = kmalloc(sizeof(struct free_page_batch), GFP_NOIO);
+ if (unlikely(!*batch)) {
+ call_rcu(&page->rcu_head, brd_free_page_rcu);
+ return;
+ }
+ INIT_LIST_HEAD(&(*batch)->list);
+ }
+ list_add(&page->lru, &(*batch)->list);
+ }
+}
+
/*
* Free all backing store pages and xarray. This must only be called when
* there are no other users of the device.
@@ -152,11 +202,11 @@ static void copy_to_brd(struct brd_devic
copy = min_t(size_t, n, PAGE_SIZE - offset);
rcu_read_lock();
page = brd_lookup_page(brd, sector);
- BUG_ON(!page);
-
- dst = kmap_atomic(page);
- memcpy(dst + offset, src, copy);
- kunmap_atomic(dst);
+ if (page) {
+ dst = kmap_atomic(page);
+ memcpy(dst + offset, src, copy);
+ kunmap_atomic(dst);
+ }
rcu_read_unlock();
if (copy < n) {
@@ -165,11 +215,11 @@ static void copy_to_brd(struct brd_devic
copy = n - copy;
rcu_read_lock();
page = brd_lookup_page(brd, sector);
- BUG_ON(!page);
-
- dst = kmap_atomic(page);
- memcpy(dst, src, copy);
- kunmap_atomic(dst);
+ if (page) {
+ dst = kmap_atomic(page);
+ memcpy(dst, src, copy);
+ kunmap_atomic(dst);
+ }
rcu_read_unlock();
}
}
@@ -248,6 +298,34 @@ out:
return err;
}
+void brd_do_discard(struct brd_device *brd, struct bio *bio)
+{
+ struct free_page_batch *batch = NULL;
+ sector_t sector, len, front_pad;
+
+ if (unlikely(!discard)) {
+ bio->bi_status = BLK_STS_NOTSUPP;
+ return;
+ }
+
+ sector = bio->bi_iter.bi_sector;
+ len = bio_sectors(bio);
+ front_pad = -sector & (PAGE_SECTORS - 1);
+ sector += front_pad;
+ if (unlikely(len <= front_pad))
+ return;
+ len -= front_pad;
+ len = round_down(len, PAGE_SECTORS);
+ while (len) {
+ brd_free_page(brd, sector, &batch);
+ sector += PAGE_SECTORS;
+ len -= PAGE_SECTORS;
+ cond_resched();
+ }
+ if (batch)
+ call_rcu(&batch->rcu, brd_free_pages_rcu);
+}
+
static void brd_submit_bio(struct bio *bio)
{
struct brd_device *brd = bio->bi_bdev->bd_disk->private_data;
@@ -280,6 +358,9 @@ static void brd_submit_bio(struct bio *b
sector += len >> SECTOR_SHIFT;
}
break;
+ case REQ_OP_DISCARD:
+ brd_do_discard(brd, bio);
+ break;
default:
bio->bi_status = BLK_STS_NOTSUPP;
break;
@@ -293,6 +374,40 @@ static const struct block_device_operati
.submit_bio = brd_submit_bio,
};
+static LIST_HEAD(brd_devices);
+static struct dentry *brd_debugfs_dir;
+
+static void brd_set_discard_limits(struct brd_device *brd)
+{
+ struct request_queue *queue = brd->brd_disk->queue;
+ if (discard) {
+ queue->limits.discard_granularity = PAGE_SIZE;
+ blk_queue_max_discard_sectors(queue, round_down(UINT_MAX, PAGE_SECTORS));
+ } else {
+ queue->limits.discard_granularity = 0;
+ blk_queue_max_discard_sectors(queue, 0);
+ }
+}
+
+static int discard_set_bool(const char *val, const struct kernel_param *kp)
+{
+ struct brd_device *brd;
+
+ int r = param_set_bool(val, kp);
+ if (r)
+ return r;
+
+ list_for_each_entry(brd, &brd_devices, brd_list)
+ brd_set_discard_limits(brd);
+
+ return 0;
+}
+
+static const struct kernel_param_ops discard_ops = {
+ .set = discard_set_bool,
+ .get = param_get_bool,
+};
+
/*
* And now the modules code and kernel interface.
*/
@@ -308,6 +423,10 @@ static int max_part = 1;
module_param(max_part, int, 0444);
MODULE_PARM_DESC(max_part, "Num Minors to reserve between devices");
+static bool discard = false;
+module_param_cb(discard, &discard_ops, &discard, 0644);
+MODULE_PARM_DESC(discard, "Support discard");
+
MODULE_LICENSE("GPL");
MODULE_ALIAS_BLOCKDEV_MAJOR(RAMDISK_MAJOR);
MODULE_ALIAS("rd");
@@ -326,9 +445,6 @@ __setup("ramdisk_size=", ramdisk_size);
* The device scheme is derived from loop.c. Keep them in synch where possible
* (should share code eventually).
*/
-static LIST_HEAD(brd_devices);
-static struct dentry *brd_debugfs_dir;
-
static int brd_alloc(int i)
{
struct brd_device *brd;
@@ -373,6 +489,8 @@ static int brd_alloc(int i)
*/
blk_queue_physical_block_size(disk->queue, PAGE_SIZE);
+ brd_set_discard_limits(brd);
+
/* Tell the block layer that this is not a rotational device */
blk_queue_flag_set(QUEUE_FLAG_NONROT, disk->queue);
blk_queue_flag_set(QUEUE_FLAG_SYNCHRONOUS, disk->queue);
^ permalink raw reply [flat|nested] 10+ messages in thread* [PATCH v3 4/4] brd: implement write zeroes
2023-08-10 10:07 [PATCH v3 0/4] brd discard patches Mikulas Patocka
` (2 preceding siblings ...)
2023-08-10 10:09 ` [PATCH v3 3/4] brd: enable discard Mikulas Patocka
@ 2023-08-10 10:10 ` Mikulas Patocka
2023-11-10 1:22 ` [PATCH v3 0/4] brd discard patches Li Nan
2024-01-19 8:41 ` Ming Lei
5 siblings, 0 replies; 10+ messages in thread
From: Mikulas Patocka @ 2023-08-10 10:10 UTC (permalink / raw)
To: Jens Axboe
Cc: Li Nan, Zdenek Kabelac, Christoph Hellwig, Chaitanya Kulkarni,
linux-block, dm-devel
This patch implements REQ_OP_WRITE_ZEROES on brd. Write zeroes will free
the pages just like discard, but the difference is that it writes zeroes
to the preceding and following page if the range is not aligned on page
boundary.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
---
drivers/block/brd.c | 21 ++++++++++++++++++---
1 file changed, 18 insertions(+), 3 deletions(-)
Index: linux-2.6/drivers/block/brd.c
===================================================================
--- linux-2.6.orig/drivers/block/brd.c
+++ linux-2.6/drivers/block/brd.c
@@ -301,7 +301,8 @@ out:
void brd_do_discard(struct brd_device *brd, struct bio *bio)
{
struct free_page_batch *batch = NULL;
- sector_t sector, len, front_pad;
+ bool zero_padding = bio_op(bio) == REQ_OP_WRITE_ZEROES;
+ sector_t sector, len, front_pad, end_pad;
if (unlikely(!discard)) {
bio->bi_status = BLK_STS_NOTSUPP;
@@ -311,11 +312,22 @@ void brd_do_discard(struct brd_device *b
sector = bio->bi_iter.bi_sector;
len = bio_sectors(bio);
front_pad = -sector & (PAGE_SECTORS - 1);
+
+ if (zero_padding && unlikely(front_pad != 0))
+ copy_to_brd(brd, page_address(ZERO_PAGE(0)),
+ sector, min(len, front_pad) << SECTOR_SHIFT);
+
sector += front_pad;
if (unlikely(len <= front_pad))
return;
len -= front_pad;
- len = round_down(len, PAGE_SECTORS);
+
+ end_pad = len & (PAGE_SECTORS - 1);
+ if (zero_padding && unlikely(end_pad != 0))
+ copy_to_brd(brd, page_address(ZERO_PAGE(0)),
+ sector + len - end_pad, end_pad << SECTOR_SHIFT);
+ len -= end_pad;
+
while (len) {
brd_free_page(brd, sector, &batch);
sector += PAGE_SECTORS;
@@ -359,6 +371,7 @@ static void brd_submit_bio(struct bio *b
}
break;
case REQ_OP_DISCARD:
+ case REQ_OP_WRITE_ZEROES:
brd_do_discard(brd, bio);
break;
default:
@@ -383,9 +396,11 @@ static void brd_set_discard_limits(struc
if (discard) {
queue->limits.discard_granularity = PAGE_SIZE;
blk_queue_max_discard_sectors(queue, round_down(UINT_MAX, PAGE_SECTORS));
+ blk_queue_max_write_zeroes_sectors(queue, round_down(UINT_MAX, PAGE_SECTORS));
} else {
queue->limits.discard_granularity = 0;
blk_queue_max_discard_sectors(queue, 0);
+ blk_queue_max_write_zeroes_sectors(queue, 0);
}
}
@@ -425,7 +440,7 @@ MODULE_PARM_DESC(max_part, "Num Minors t
static bool discard = false;
module_param_cb(discard, &discard_ops, &discard, 0644);
-MODULE_PARM_DESC(discard, "Support discard");
+MODULE_PARM_DESC(discard, "Support discard and write zeroes");
MODULE_LICENSE("GPL");
MODULE_ALIAS_BLOCKDEV_MAJOR(RAMDISK_MAJOR);
^ permalink raw reply [flat|nested] 10+ messages in thread* Re: [PATCH v3 0/4] brd discard patches
2023-08-10 10:07 [PATCH v3 0/4] brd discard patches Mikulas Patocka
` (3 preceding siblings ...)
2023-08-10 10:10 ` [PATCH v3 4/4] brd: implement write zeroes Mikulas Patocka
@ 2023-11-10 1:22 ` Li Nan
2023-11-14 13:59 ` Mikulas Patocka
2024-01-19 8:41 ` Ming Lei
5 siblings, 1 reply; 10+ messages in thread
From: Li Nan @ 2023-11-10 1:22 UTC (permalink / raw)
To: Mikulas Patocka, Jens Axboe
Cc: Zdenek Kabelac, Christoph Hellwig, Chaitanya Kulkarni,
linux-block, dm-devel
friendly ping...
在 2023/8/10 18:07, Mikulas Patocka 写道:
> Hi
>
> Here I'm submitting the ramdisk discard patches for the next merge window.
> If you want to make some more changes, please let me now.
>
> Changes since version 2:
> There are no functional changes, I only moved the switch statement
> conversion to a separate patch, so that it's easier to review.
>
> Patch 1: introduce a switch statement in brd_submit_bio
> Patch 2: extend the rcu regions to cover read and write
> Patch 3: enable discard
> Patch 4: implement write zeroes
>
> Mikulas
>
--
Thanks,
Nan
^ permalink raw reply [flat|nested] 10+ messages in thread* Re: [PATCH v3 0/4] brd discard patches
2023-11-10 1:22 ` [PATCH v3 0/4] brd discard patches Li Nan
@ 2023-11-14 13:59 ` Mikulas Patocka
0 siblings, 0 replies; 10+ messages in thread
From: Mikulas Patocka @ 2023-11-14 13:59 UTC (permalink / raw)
To: Jens Axboe
Cc: Li Nan, Zdenek Kabelac, Christoph Hellwig, Chaitanya Kulkarni,
linux-block, dm-devel
[-- Attachment #1: Type: text/plain, Size: 722 bytes --]
On Fri, 10 Nov 2023, Li Nan wrote:
> friendly ping...
Jens? Do you want this patch series or not?
Mikulas
> 在 2023/8/10 18:07, Mikulas Patocka 写道:
> > Hi
> >
> > Here I'm submitting the ramdisk discard patches for the next merge window.
> > If you want to make some more changes, please let me now.
> >
> > Changes since version 2:
> > There are no functional changes, I only moved the switch statement
> > conversion to a separate patch, so that it's easier to review.
> >
> > Patch 1: introduce a switch statement in brd_submit_bio
> > Patch 2: extend the rcu regions to cover read and write
> > Patch 3: enable discard
> > Patch 4: implement write zeroes
> >
> > Mikulas
> >
>
> --
> Thanks,
> Nan
>
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH v3 0/4] brd discard patches
2023-08-10 10:07 [PATCH v3 0/4] brd discard patches Mikulas Patocka
` (4 preceding siblings ...)
2023-11-10 1:22 ` [PATCH v3 0/4] brd discard patches Li Nan
@ 2024-01-19 8:41 ` Ming Lei
2024-01-22 16:30 ` Mikulas Patocka
5 siblings, 1 reply; 10+ messages in thread
From: Ming Lei @ 2024-01-19 8:41 UTC (permalink / raw)
To: Mikulas Patocka
Cc: Jens Axboe, Li Nan, Zdenek Kabelac, Christoph Hellwig,
Chaitanya Kulkarni, linux-block, dm-devel, ming.lei
Hi Mikulas,
On Thu, Aug 10, 2023 at 12:07:07PM +0200, Mikulas Patocka wrote:
> Hi
>
> Here I'm submitting the ramdisk discard patches for the next merge window.
> If you want to make some more changes, please let me now.
brd discard is removed in f09a06a193d9 ("brd: remove discard support")
in 2017 because it is just driver private write_zero, and user can get same
result with fallocate(FALLOC_FL_ZERO_RANGE).
Also you only mentioned the motivation in V1 cover-letter:
https://lore.kernel.org/linux-block/alpine.LRH.2.02.2209151604410.13231@file01.intranet.prod.int.rdu2.redhat.com/
```
Zdenek asked me to write it, because we use brd in the lvm2 testsuite and
it would be benefical to run the testsuite with discard enabled in order
to test discard handling.
```
But we have lots of test disks with discard support: loop, scsi_debug,
null_blk, ublk, ..., so one requestion is that why brd discard is
a must for lvm2 testsuite to cover (lvm)discard handling?
The reason why brd didn't support discard by freeing pages is writeback
deadlock risk, see:
commit f09a06a193d9 ("brd: remove discard support")
-static void discard_from_brd(struct brd_device *brd,
- sector_t sector, size_t n)
-{
- while (n >= PAGE_SIZE) {
- /*
- * Don't want to actually discard pages here because
- * re-allocating the pages can result in writeback
- * deadlocks under heavy load.
- */
- if (0)
- brd_free_page(brd, sector);
- else
- brd_zero_page(brd, sector);
- sector += PAGE_SIZE >> SECTOR_SHIFT;
- n -= PAGE_SIZE;
- }
-}
However, you didn't mention how your patches address this potential
risk, care to document it? I can't find any related words about
this problem.
BTW, your patches looks more complicated than the original removed
discard implementation. And if the above questions get addressed,
I am happy to provide review on the following patches.
Thanks,
Ming
^ permalink raw reply [flat|nested] 10+ messages in thread* Re: [PATCH v3 0/4] brd discard patches
2024-01-19 8:41 ` Ming Lei
@ 2024-01-22 16:30 ` Mikulas Patocka
2024-01-23 2:49 ` Ming Lei
0 siblings, 1 reply; 10+ messages in thread
From: Mikulas Patocka @ 2024-01-22 16:30 UTC (permalink / raw)
To: Ming Lei, Zdenek Kabelac
Cc: Jens Axboe, Li Nan, Christoph Hellwig, Chaitanya Kulkarni,
linux-block, dm-devel
[-- Attachment #1: Type: text/plain, Size: 3196 bytes --]
Hi
On Fri, 19 Jan 2024, Ming Lei wrote:
> Hi Mikulas,
>
> On Thu, Aug 10, 2023 at 12:07:07PM +0200, Mikulas Patocka wrote:
> > Hi
> >
> > Here I'm submitting the ramdisk discard patches for the next merge window.
> > If you want to make some more changes, please let me now.
>
> brd discard is removed in f09a06a193d9 ("brd: remove discard support")
> in 2017 because it is just driver private write_zero, and user can get same
> result with fallocate(FALLOC_FL_ZERO_RANGE).
>
> Also you only mentioned the motivation in V1 cover-letter:
>
> https://lore.kernel.org/linux-block/alpine.LRH.2.02.2209151604410.13231@file01.intranet.prod.int.rdu2.redhat.com/
>
> ```
> Zdenek asked me to write it, because we use brd in the lvm2 testsuite and
> it would be benefical to run the testsuite with discard enabled in order
> to test discard handling.
> ```
>
> But we have lots of test disks with discard support: loop, scsi_debug,
> null_blk, ublk, ..., so one requestion is that why brd discard is
> a must for lvm2 testsuite to cover (lvm)discard handling?
We should ask Zdeněk Kabeláč about it - he is expert about the lvm2
testsuite.
> The reason why brd didn't support discard by freeing pages is writeback
> deadlock risk, see:
>
> commit f09a06a193d9 ("brd: remove discard support")
>
> -static void discard_from_brd(struct brd_device *brd,
> - sector_t sector, size_t n)
> -{
> - while (n >= PAGE_SIZE) {
> - /*
> - * Don't want to actually discard pages here because
> - * re-allocating the pages can result in writeback
> - * deadlocks under heavy load.
> - */
> - if (0)
> - brd_free_page(brd, sector);
> - else
> - brd_zero_page(brd, sector);
> - sector += PAGE_SIZE >> SECTOR_SHIFT;
> - n -= PAGE_SIZE;
> - }
> -}
>
> However, you didn't mention how your patches address this potential
> risk, care to document it? I can't find any related words about
> this problem.
The writeback deadlock can happen even without discard - if the machine
runs out of memory while writing data to a ramdisk. But the probability is
increased when discard is used, because pages are freed and re-allocated
more often.
Generally, the admin should make sure that the machine has enough
available memory when creating a ramdisk - then, the deadlock can't
happen.
Ramdisk has no limit on the number of allocated pages, so when it runs out
of memory, the oom killer will try to kill unrelated processes and the
machine will hang. If there is risk of overflowing the available memory,
the admin should use tmpfs instead of a ramdisk - tmpfs can be configured
with a limit and it can also swap out pages.
> BTW, your patches looks more complicated than the original removed
> discard implementation. And if the above questions get addressed,
> I am happy to provide review on the following patches.
My patches actually free the discarded pages. The original discard
implementation just overwrote the pages with zeroes without freeing them.
Mikulas
>
>
> Thanks,
> Ming
>
^ permalink raw reply [flat|nested] 10+ messages in thread* Re: [PATCH v3 0/4] brd discard patches
2024-01-22 16:30 ` Mikulas Patocka
@ 2024-01-23 2:49 ` Ming Lei
0 siblings, 0 replies; 10+ messages in thread
From: Ming Lei @ 2024-01-23 2:49 UTC (permalink / raw)
To: Mikulas Patocka
Cc: Zdenek Kabelac, Jens Axboe, Li Nan, Christoph Hellwig,
Chaitanya Kulkarni, linux-block, dm-devel, ming.lei
On Mon, Jan 22, 2024 at 05:30:07PM +0100, Mikulas Patocka wrote:
> Hi
>
>
> On Fri, 19 Jan 2024, Ming Lei wrote:
>
> > Hi Mikulas,
> >
> > On Thu, Aug 10, 2023 at 12:07:07PM +0200, Mikulas Patocka wrote:
> > > Hi
> > >
> > > Here I'm submitting the ramdisk discard patches for the next merge window.
> > > If you want to make some more changes, please let me now.
> >
> > brd discard is removed in f09a06a193d9 ("brd: remove discard support")
> > in 2017 because it is just driver private write_zero, and user can get same
> > result with fallocate(FALLOC_FL_ZERO_RANGE).
> >
> > Also you only mentioned the motivation in V1 cover-letter:
> >
> > https://lore.kernel.org/linux-block/alpine.LRH.2.02.2209151604410.13231@file01.intranet.prod.int.rdu2.redhat.com/
> >
> > ```
> > Zdenek asked me to write it, because we use brd in the lvm2 testsuite and
> > it would be benefical to run the testsuite with discard enabled in order
> > to test discard handling.
> > ```
> >
> > But we have lots of test disks with discard support: loop, scsi_debug,
> > null_blk, ublk, ..., so one requestion is that why brd discard is
> > a must for lvm2 testsuite to cover (lvm)discard handling?
>
> We should ask Zdeněk Kabeláč about it - he is expert about the lvm2
> testsuite.
>
> > The reason why brd didn't support discard by freeing pages is writeback
> > deadlock risk, see:
> >
> > commit f09a06a193d9 ("brd: remove discard support")
> >
> > -static void discard_from_brd(struct brd_device *brd,
> > - sector_t sector, size_t n)
> > -{
> > - while (n >= PAGE_SIZE) {
> > - /*
> > - * Don't want to actually discard pages here because
> > - * re-allocating the pages can result in writeback
> > - * deadlocks under heavy load.
> > - */
> > - if (0)
> > - brd_free_page(brd, sector);
> > - else
> > - brd_zero_page(brd, sector);
> > - sector += PAGE_SIZE >> SECTOR_SHIFT;
> > - n -= PAGE_SIZE;
> > - }
> > -}
> >
> > However, you didn't mention how your patches address this potential
> > risk, care to document it? I can't find any related words about
> > this problem.
>
> The writeback deadlock can happen even without discard - if the machine
> runs out of memory while writing data to a ramdisk. But the probability is
> increased when discard is used, because pages are freed and re-allocated
> more often.
Yeah, I agree, what I meant is that this thing needs to be documented,
given discard is re-introduced, and the original deadlock comment isn't
addressed
>
> Generally, the admin should make sure that the machine has enough
> available memory when creating a ramdisk - then, the deadlock can't
> happen.
>
> Ramdisk has no limit on the number of allocated pages, so when it runs out
> of memory, the oom killer will try to kill unrelated processes and the
> machine will hang. If there is risk of overflowing the available memory,
> the admin should use tmpfs instead of a ramdisk - tmpfs can be configured
> with a limit and it can also swap out pages.
>
> > BTW, your patches looks more complicated than the original removed
> > discard implementation. And if the above questions get addressed,
> > I am happy to provide review on the following patches.
>
> My patches actually free the discarded pages. The original discard
> implementation just overwrote the pages with zeroes without freeing them.
The original implementation supports to discard by freeing pages, and
it is just bypassed unconditionally by:
if (0)
brd_free_page(brd, sector);
else
brd_zero_page(brd, sector);
However, page could be freed by discard when it is being consumed in brd_do_bvec().
Maybe your patch of "brd: extend the rcu regions to cover read and write"
can be simplified a bit, such as:
- grab rcu read lock in brd_do_bvec()
- release the rcu read lock when allocating page via alloc_page() in
brd_insert_page()
- change free page by rcu
Or avoid it by holding page reference:
- grabbing page reference in brd_lookup_page() if it is called from
copy_to_brd() or copy_from_brd(), and drop it after it is consumed
Thanks,
Ming
^ permalink raw reply [flat|nested] 10+ messages in thread