* [PATCH 0/5] Improve zram writeback performance
@ 2023-09-11 13:34 ` Pankaj Raghav
2023-09-11 13:34 ` [PATCH 1/5] zram: move index preparation to a separate function in writeback_store Pankaj Raghav
` (6 more replies)
0 siblings, 7 replies; 13+ messages in thread
From: Pankaj Raghav @ 2023-09-11 13:34 UTC (permalink / raw)
To: minchan, senozhatsky
Cc: linux-kernel, axboe, p.raghav, linux-block, kernel, gost.dev
ZRAM can have a backing device that could be used as a writeback device
for the pages in RAM. The current writeback code (writeback_store()) does
a synchronous single page size IO to the backing device.
This series implements IO batching while doing a writeback to a backing
device. The code still does synchronous IOs but with larger IO sizes
whenever possible. This crosses off one of the TODO that was there as a part
of writeback_store() function:
A single page IO would be inefficient for write...
The idea is to batch the IOs to a certain limit before the data is flushed
to the backing device. The batch limit is initially chosen based on the
bdi->io_pages value with an upper limit of 32 pages (128k on x86).
Batching reduces the time of writeback of 4G data to a nvme backing device
from 68 secs to 15 secs (more than **4x improvement**).
The first 3 patches are prep. 4th patch implements the main logic for IO
batching and the last patch is another cleanup.
Perf:
$ modprobe zram num_devices=1
$ echo "/dev/nvme0n1" > /sys/block/zram0/backing_dev
$ echo 6G > /sys/block/zram0/disksize
$ fio -iodepth=16 -rw=randwrite -ioengine=io_uring -bs=4k -numjobs=1 -size=4G -filename=/dev/zram0 -name=io_uring_1 > /dev/null
$ echo all > /sys/block/zram0/idle
Without changes:
$ time echo idle > /sys/block/zram0/writeback
real 1m8.648s (68 secs)
user 0m0.000s
sys 0m24.899s
$ cat /sys/block/zram0/bd_stat
1048576 0 1048576
With changes:
$ time echo idle > /sys/block/zram0/writeback
real 0m15.496s (15 secs)
user 0m0.000s
sys 0m7.789s
$ cat /sys/block/zram0/bd_stat
1048576 0 1048576
Testing:
A basic End-End testing (based on Sergey's test flow [1]):
1) configure zram0 and add a nvme device as a writeback device
2) Get the sha256sum of a tarball
3) mkfs.ext4 on zram0, cp tarball
4) idle writeback
5) cp tarball from zram0 to another device (reread writeback pages) and
compare the sha256sum again
The sha before and after are verified to be the same.
Writeback limit testing:
1) configure zram0 and add a nvme device as a writeback device
2) Set writeback limit and enable
3) Do a fio that crosses the writeback limit
4) idle writeback
5) Verify the writeback is limited to the set writeback limit value
$ modprobe zram num_devices=1
$ echo "/dev/nvme0n1" > /sys/block/zram0/backing_dev
$ echo 4G > /sys/block/zram0/disksize
$ echo 1 > /sys/block/zram0/writeback_limit_enable
$ echo 1002 > /sys/block/zram0/writeback_limit
$ fio -iodepth=16 -rw=write -ioengine=io_uring -bs=4k -numjobs=1 -size=10M -filename=/dev/zram0 -name=io_uring_1
$ echo all > /sys/block/zram0/idle
$ echo idle > /sys/block/zram0/writeback
$ cat /sys/block/zram0/bd_stat
1002 0 1002
writeback is limited to the set value.
[1] https://lore.kernel.org/lkml/20230806071601.GB907732@google.com/
Pankaj Raghav (5):
zram: move index preparation to a separate function in writeback_store
zram: encapsulate writeback to the backing bdev in a function
zram: add alloc_block_bdev_range() and free_block_bdev_range()
zram: batch IOs during writeback to improve performance
zram: don't overload blk_idx variable in writeback_store()
drivers/block/zram/zram_drv.c | 318 ++++++++++++++++++++++------------
1 file changed, 210 insertions(+), 108 deletions(-)
base-commit: 7bc675554773f09d88101bf1ccfc8537dc7c0be9
--
2.40.1
^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH 1/5] zram: move index preparation to a separate function in writeback_store
2023-09-11 13:34 ` [PATCH 0/5] Improve zram writeback performance Pankaj Raghav
@ 2023-09-11 13:34 ` Pankaj Raghav
2023-09-11 13:34 ` [PATCH 2/5] zram: encapsulate writeback to the backing bdev in a function Pankaj Raghav
` (5 subsequent siblings)
6 siblings, 0 replies; 13+ messages in thread
From: Pankaj Raghav @ 2023-09-11 13:34 UTC (permalink / raw)
To: minchan, senozhatsky
Cc: linux-kernel, axboe, p.raghav, linux-block, kernel, gost.dev
From: Pankaj Raghav <p.raghav@samsung.com>
Add a new function writeback_prep_or_skip_index() that does the check
and set the approapriate flags before writeback starts. The function
returns false if the index can be skipped.
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
drivers/block/zram/zram_drv.c | 68 +++++++++++++++++++++--------------
1 file changed, 42 insertions(+), 26 deletions(-)
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 06673c6ca255..eaf9e227778e 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -595,6 +595,46 @@ static void read_from_bdev_async(struct zram *zram, struct page *page,
#define IDLE_WRITEBACK (1<<1)
#define INCOMPRESSIBLE_WRITEBACK (1<<2)
+/*
+ * Returns: true if the index was prepared for further processing
+ * false if the index can be skipped
+ */
+static bool writeback_prep_or_skip_index(struct zram *zram, int mode,
+ unsigned long index)
+{
+ bool ret = false;
+
+ zram_slot_lock(zram, index);
+ if (!zram_allocated(zram, index))
+ goto skip;
+
+ if (zram_test_flag(zram, index, ZRAM_WB) ||
+ zram_test_flag(zram, index, ZRAM_SAME) ||
+ zram_test_flag(zram, index, ZRAM_UNDER_WB))
+ goto skip;
+
+ if (mode & IDLE_WRITEBACK && !zram_test_flag(zram, index, ZRAM_IDLE))
+ goto skip;
+ if (mode & HUGE_WRITEBACK && !zram_test_flag(zram, index, ZRAM_HUGE))
+ goto skip;
+ if (mode & INCOMPRESSIBLE_WRITEBACK &&
+ !zram_test_flag(zram, index, ZRAM_INCOMPRESSIBLE))
+ goto skip;
+
+ /*
+ * Clearing ZRAM_UNDER_WB is duty of caller.
+ * IOW, zram_free_page never clear it.
+ */
+ zram_set_flag(zram, index, ZRAM_UNDER_WB);
+ /* Need for hugepage writeback racing */
+ zram_set_flag(zram, index, ZRAM_IDLE);
+
+ ret = true;
+skip:
+ zram_slot_unlock(zram, index);
+ return ret;
+}
+
static ssize_t writeback_store(struct device *dev,
struct device_attribute *attr, const char *buf, size_t len)
{
@@ -662,33 +702,9 @@ static ssize_t writeback_store(struct device *dev,
}
}
- zram_slot_lock(zram, index);
- if (!zram_allocated(zram, index))
- goto next;
+ if (!writeback_prep_or_skip_index(zram, mode, index))
+ continue;
- if (zram_test_flag(zram, index, ZRAM_WB) ||
- zram_test_flag(zram, index, ZRAM_SAME) ||
- zram_test_flag(zram, index, ZRAM_UNDER_WB))
- goto next;
-
- if (mode & IDLE_WRITEBACK &&
- !zram_test_flag(zram, index, ZRAM_IDLE))
- goto next;
- if (mode & HUGE_WRITEBACK &&
- !zram_test_flag(zram, index, ZRAM_HUGE))
- goto next;
- if (mode & INCOMPRESSIBLE_WRITEBACK &&
- !zram_test_flag(zram, index, ZRAM_INCOMPRESSIBLE))
- goto next;
-
- /*
- * Clearing ZRAM_UNDER_WB is duty of caller.
- * IOW, zram_free_page never clear it.
- */
- zram_set_flag(zram, index, ZRAM_UNDER_WB);
- /* Need for hugepage writeback racing */
- zram_set_flag(zram, index, ZRAM_IDLE);
- zram_slot_unlock(zram, index);
if (zram_read_page(zram, page, index, NULL)) {
zram_slot_lock(zram, index);
zram_clear_flag(zram, index, ZRAM_UNDER_WB);
--
2.40.1
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH 2/5] zram: encapsulate writeback to the backing bdev in a function
2023-09-11 13:34 ` [PATCH 0/5] Improve zram writeback performance Pankaj Raghav
2023-09-11 13:34 ` [PATCH 1/5] zram: move index preparation to a separate function in writeback_store Pankaj Raghav
@ 2023-09-11 13:34 ` Pankaj Raghav
2023-09-11 13:34 ` [PATCH 3/5] zram: add alloc_block_bdev_range() and free_block_bdev_range() Pankaj Raghav
` (4 subsequent siblings)
6 siblings, 0 replies; 13+ messages in thread
From: Pankaj Raghav @ 2023-09-11 13:34 UTC (permalink / raw)
To: minchan, senozhatsky
Cc: linux-kernel, axboe, p.raghav, linux-block, kernel, gost.dev
From: Pankaj Raghav <p.raghav@samsung.com>
Encapsulate the flushing data to the backing bdev in writeback in a
separate function writeback_flush_to_bdev(). This is in preparation for
adding batching IO support to writeback_store().
No functional changes.
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
drivers/block/zram/zram_drv.c | 125 ++++++++++++++++++----------------
1 file changed, 68 insertions(+), 57 deletions(-)
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index eaf9e227778e..bd93ed653b99 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -635,14 +635,78 @@ static bool writeback_prep_or_skip_index(struct zram *zram, int mode,
return ret;
}
+static int writeback_flush_to_bdev(struct zram *zram, unsigned long index,
+ struct page *page, unsigned long *blk_idx)
+{
+ struct bio bio;
+ struct bio_vec bio_vec;
+ int ret;
+
+ bio_init(&bio, zram->bdev, &bio_vec, 1, REQ_OP_WRITE | REQ_SYNC);
+ bio.bi_iter.bi_sector = *blk_idx * (PAGE_SIZE >> 9);
+ __bio_add_page(&bio, page, PAGE_SIZE, 0);
+
+ /*
+ * XXX: A single page IO would be inefficient for write
+ * but it would be not bad as starter.
+ */
+ ret = submit_bio_wait(&bio);
+ if (ret) {
+ zram_slot_lock(zram, index);
+ zram_clear_flag(zram, index, ZRAM_UNDER_WB);
+ zram_clear_flag(zram, index, ZRAM_IDLE);
+ zram_slot_unlock(zram, index);
+ /*
+ * BIO errors are not fatal, we continue and simply
+ * attempt to writeback the remaining objects (pages).
+ * At the same time we need to signal user-space that
+ * some writes (at least one, but also could be all of
+ * them) were not successful and we do so by returning
+ * the most recent BIO error.
+ */
+ return ret;
+ }
+
+ atomic64_inc(&zram->stats.bd_writes);
+ /*
+ * We released zram_slot_lock so need to check if the slot was
+ * changed. If there is freeing for the slot, we can catch it
+ * easily by zram_allocated.
+ * A subtle case is the slot is freed/reallocated/marked as
+ * ZRAM_IDLE again. To close the race, idle_store doesn't
+ * mark ZRAM_IDLE once it found the slot was ZRAM_UNDER_WB.
+ * Thus, we could close the race by checking ZRAM_IDLE bit.
+ */
+ zram_slot_lock(zram, index);
+ if (!zram_allocated(zram, index) ||
+ !zram_test_flag(zram, index, ZRAM_IDLE)) {
+ zram_clear_flag(zram, index, ZRAM_UNDER_WB);
+ zram_clear_flag(zram, index, ZRAM_IDLE);
+ goto skip;
+ }
+
+ zram_free_page(zram, index);
+ zram_clear_flag(zram, index, ZRAM_UNDER_WB);
+ zram_set_flag(zram, index, ZRAM_WB);
+ zram_set_element(zram, index, *blk_idx);
+ atomic64_inc(&zram->stats.pages_stored);
+ *blk_idx = 0;
+
+ spin_lock(&zram->wb_limit_lock);
+ if (zram->wb_limit_enable && zram->bd_wb_limit > 0)
+ zram->bd_wb_limit -= 1UL << (PAGE_SHIFT - 12);
+ spin_unlock(&zram->wb_limit_lock);
+skip:
+ zram_slot_unlock(zram, index);
+ return 0;
+}
+
static ssize_t writeback_store(struct device *dev,
struct device_attribute *attr, const char *buf, size_t len)
{
struct zram *zram = dev_to_zram(dev);
unsigned long nr_pages = zram->disksize >> PAGE_SHIFT;
unsigned long index = 0;
- struct bio bio;
- struct bio_vec bio_vec;
struct page *page;
ssize_t ret = len;
int mode, err;
@@ -713,63 +777,10 @@ static ssize_t writeback_store(struct device *dev,
continue;
}
- bio_init(&bio, zram->bdev, &bio_vec, 1,
- REQ_OP_WRITE | REQ_SYNC);
- bio.bi_iter.bi_sector = blk_idx * (PAGE_SIZE >> 9);
- __bio_add_page(&bio, page, PAGE_SIZE, 0);
+ err = writeback_flush_to_bdev(zram, index, page, &blk_idx);
- /*
- * XXX: A single page IO would be inefficient for write
- * but it would be not bad as starter.
- */
- err = submit_bio_wait(&bio);
- if (err) {
- zram_slot_lock(zram, index);
- zram_clear_flag(zram, index, ZRAM_UNDER_WB);
- zram_clear_flag(zram, index, ZRAM_IDLE);
- zram_slot_unlock(zram, index);
- /*
- * BIO errors are not fatal, we continue and simply
- * attempt to writeback the remaining objects (pages).
- * At the same time we need to signal user-space that
- * some writes (at least one, but also could be all of
- * them) were not successful and we do so by returning
- * the most recent BIO error.
- */
+ if (err)
ret = err;
- continue;
- }
-
- atomic64_inc(&zram->stats.bd_writes);
- /*
- * We released zram_slot_lock so need to check if the slot was
- * changed. If there is freeing for the slot, we can catch it
- * easily by zram_allocated.
- * A subtle case is the slot is freed/reallocated/marked as
- * ZRAM_IDLE again. To close the race, idle_store doesn't
- * mark ZRAM_IDLE once it found the slot was ZRAM_UNDER_WB.
- * Thus, we could close the race by checking ZRAM_IDLE bit.
- */
- zram_slot_lock(zram, index);
- if (!zram_allocated(zram, index) ||
- !zram_test_flag(zram, index, ZRAM_IDLE)) {
- zram_clear_flag(zram, index, ZRAM_UNDER_WB);
- zram_clear_flag(zram, index, ZRAM_IDLE);
- goto next;
- }
-
- zram_free_page(zram, index);
- zram_clear_flag(zram, index, ZRAM_UNDER_WB);
- zram_set_flag(zram, index, ZRAM_WB);
- zram_set_element(zram, index, blk_idx);
- blk_idx = 0;
- atomic64_inc(&zram->stats.pages_stored);
- spin_lock(&zram->wb_limit_lock);
- if (zram->wb_limit_enable && zram->bd_wb_limit > 0)
- zram->bd_wb_limit -= 1UL << (PAGE_SHIFT - 12);
- spin_unlock(&zram->wb_limit_lock);
-next:
- zram_slot_unlock(zram, index);
}
if (blk_idx)
--
2.40.1
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH 3/5] zram: add alloc_block_bdev_range() and free_block_bdev_range()
2023-09-11 13:34 ` [PATCH 0/5] Improve zram writeback performance Pankaj Raghav
2023-09-11 13:34 ` [PATCH 1/5] zram: move index preparation to a separate function in writeback_store Pankaj Raghav
2023-09-11 13:34 ` [PATCH 2/5] zram: encapsulate writeback to the backing bdev in a function Pankaj Raghav
@ 2023-09-11 13:34 ` Pankaj Raghav
2023-09-11 13:34 ` [PATCH 4/5] zram: batch IOs during writeback to improve performance Pankaj Raghav
` (3 subsequent siblings)
6 siblings, 0 replies; 13+ messages in thread
From: Pankaj Raghav @ 2023-09-11 13:34 UTC (permalink / raw)
To: minchan, senozhatsky
Cc: linux-kernel, axboe, p.raghav, linux-block, kernel, gost.dev
From: Pankaj Raghav <p.raghav@samsung.com>
Add [alloc|free]_block_bdev_range() which accepts number of blocks to
allocate or free from the block bitmap.
alloc_block_bdev_range() tries to allocate a range of bitmap based in
the input nr_of_blocks whenever possible, or else it will retry with a
smaller value. This is done so that we don't unnecessarily return EIO
when the underlying device is fragmented.
alloc_block_bdev_range() is not an atomic operation as this function can
be called only from writeback_store() and init_lock is anyway taken
making sure there cannot be two processes allocating from bdev bitmap.
free_block_bdev_range() is just a simple loop that calls the atomic
free_block_bdev() function. As bdev bitmap free can be called from
two different process simulataneously without a lock, atomicity needs
to be maintained.
This is useful when we want to send larger IOs to the backing dev.
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
drivers/block/zram/zram_drv.c | 33 +++++++++++++++++++++++++++++++++
1 file changed, 33 insertions(+)
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index bd93ed653b99..0b8f814e11dd 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -576,6 +576,39 @@ static void free_block_bdev(struct zram *zram, unsigned long blk_idx)
atomic64_dec(&zram->stats.bd_count);
}
+static unsigned long alloc_block_bdev_range(struct zram *zram,
+ unsigned int *nr_of_blocksp)
+{
+ unsigned long blk_idx;
+ unsigned int nr_of_blocks = *nr_of_blocksp;
+retry:
+ /* skip 0 bit to confuse zram.handle = 0 */
+ blk_idx = 1;
+ blk_idx = bitmap_find_next_zero_area(zram->bitmap, zram->nr_pages,
+ blk_idx, nr_of_blocks, 0);
+
+ if ((blk_idx + nr_of_blocks) > zram->nr_pages) {
+ if (nr_of_blocks == 1)
+ return 0;
+
+ nr_of_blocks = nr_of_blocks / 2;
+ goto retry;
+ }
+
+ bitmap_set(zram->bitmap, blk_idx, nr_of_blocks);
+ atomic64_add(nr_of_blocks, &zram->stats.bd_count);
+ *nr_of_blocksp = nr_of_blocks;
+
+ return blk_idx;
+}
+
+static void free_block_bdev_range(struct zram *zram, unsigned long blk_idx,
+ unsigned int nr_of_blocks)
+{
+ for (unsigned int i = 0; i < nr_of_blocks; i++)
+ free_block_bdev(zram, blk_idx + i);
+}
+
static void read_from_bdev_async(struct zram *zram, struct page *page,
unsigned long entry, struct bio *parent)
{
--
2.40.1
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH 4/5] zram: batch IOs during writeback to improve performance
2023-09-11 13:34 ` [PATCH 0/5] Improve zram writeback performance Pankaj Raghav
` (2 preceding siblings ...)
2023-09-11 13:34 ` [PATCH 3/5] zram: add alloc_block_bdev_range() and free_block_bdev_range() Pankaj Raghav
@ 2023-09-11 13:34 ` Pankaj Raghav
2023-09-11 13:34 ` [PATCH 5/5] zram: don't overload blk_idx variable in writeback_store() Pankaj Raghav
` (2 subsequent siblings)
6 siblings, 0 replies; 13+ messages in thread
From: Pankaj Raghav @ 2023-09-11 13:34 UTC (permalink / raw)
To: minchan, senozhatsky
Cc: linux-kernel, axboe, p.raghav, linux-block, kernel, gost.dev
From: Pankaj Raghav <p.raghav@samsung.com>
This crosses off one of the TODO that was there as a part of
writeback_store() function:
A single page IO would be inefficient for write...
This reduces the time of writeback of 4G data to a nvme backing device from
68 secs to 15 secs (more than 4x improvement).
The idea is to batch the IOs until to a certain limit before the data is
flushed to the backing device. The batch limit is initially chosen based
on the bdi->io_pages value with an upper limit of 32 pages (128k on
x86). The limit is modified based writeback_limit, if set.
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
drivers/block/zram/zram_drv.c | 186 +++++++++++++++++++++-------------
1 file changed, 113 insertions(+), 73 deletions(-)
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 0b8f814e11dd..27313c2d781d 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -551,22 +551,6 @@ static ssize_t backing_dev_store(struct device *dev,
return err;
}
-static unsigned long alloc_block_bdev(struct zram *zram)
-{
- unsigned long blk_idx = 1;
-retry:
- /* skip 0 bit to confuse zram.handle = 0 */
- blk_idx = find_next_zero_bit(zram->bitmap, zram->nr_pages, blk_idx);
- if (blk_idx == zram->nr_pages)
- return 0;
-
- if (test_and_set_bit(blk_idx, zram->bitmap))
- goto retry;
-
- atomic64_inc(&zram->stats.bd_count);
- return blk_idx;
-}
-
static void free_block_bdev(struct zram *zram, unsigned long blk_idx)
{
int was_set;
@@ -628,6 +612,15 @@ static void read_from_bdev_async(struct zram *zram, struct page *page,
#define IDLE_WRITEBACK (1<<1)
#define INCOMPRESSIBLE_WRITEBACK (1<<2)
+#define MAX_INDEX_ENTRIES_ORDER 5
+#define MAX_INDEX_ENTRIES (1U << MAX_INDEX_ENTRIES_ORDER)
+struct index_mapping {
+ /* Cap the maximum indices to 32 before we flush */
+ unsigned long arr[MAX_INDEX_ENTRIES];
+ unsigned int nr_of_entries;
+};
+
+
/*
* Returns: true if the index was prepared for further processing
* false if the index can be skipped
@@ -668,39 +661,36 @@ static bool writeback_prep_or_skip_index(struct zram *zram, int mode,
return ret;
}
-static int writeback_flush_to_bdev(struct zram *zram, unsigned long index,
- struct page *page, unsigned long *blk_idx)
+static int writeback_flush_to_bdev(struct zram *zram, struct folio *folio,
+ struct index_mapping *map,
+ unsigned long blk_idx, unsigned int io_pages)
{
struct bio bio;
struct bio_vec bio_vec;
- int ret;
+ int ret = 0;
+
+ if (!map->nr_of_entries)
+ return ret;
bio_init(&bio, zram->bdev, &bio_vec, 1, REQ_OP_WRITE | REQ_SYNC);
- bio.bi_iter.bi_sector = *blk_idx * (PAGE_SIZE >> 9);
- __bio_add_page(&bio, page, PAGE_SIZE, 0);
+ bio.bi_iter.bi_sector = blk_idx * (PAGE_SIZE >> 9);
+
+ if (!bio_add_folio(&bio, folio, io_pages * PAGE_SIZE, 0))
+ goto cleanup;
- /*
- * XXX: A single page IO would be inefficient for write
- * but it would be not bad as starter.
- */
ret = submit_bio_wait(&bio);
- if (ret) {
- zram_slot_lock(zram, index);
- zram_clear_flag(zram, index, ZRAM_UNDER_WB);
- zram_clear_flag(zram, index, ZRAM_IDLE);
- zram_slot_unlock(zram, index);
- /*
- * BIO errors are not fatal, we continue and simply
- * attempt to writeback the remaining objects (pages).
- * At the same time we need to signal user-space that
- * some writes (at least one, but also could be all of
- * them) were not successful and we do so by returning
- * the most recent BIO error.
- */
- return ret;
- }
+ /*
+ * BIO errors are not fatal, we continue and simply
+ * attempt to writeback the remaining objects (pages).
+ * At the same time we need to signal user-space that
+ * some writes (at least one, but also could be all of
+ * them) were not successful and we do so by returning
+ * the most recent BIO error.
+ */
+ if (ret)
+ goto cleanup;
- atomic64_inc(&zram->stats.bd_writes);
+ atomic64_add(map->nr_of_entries, &zram->stats.bd_writes);
/*
* We released zram_slot_lock so need to check if the slot was
* changed. If there is freeing for the slot, we can catch it
@@ -710,28 +700,40 @@ static int writeback_flush_to_bdev(struct zram *zram, unsigned long index,
* mark ZRAM_IDLE once it found the slot was ZRAM_UNDER_WB.
* Thus, we could close the race by checking ZRAM_IDLE bit.
*/
- zram_slot_lock(zram, index);
- if (!zram_allocated(zram, index) ||
- !zram_test_flag(zram, index, ZRAM_IDLE)) {
- zram_clear_flag(zram, index, ZRAM_UNDER_WB);
- zram_clear_flag(zram, index, ZRAM_IDLE);
- goto skip;
+ for (int iter = 0; iter < map->nr_of_entries; iter++) {
+ zram_slot_lock(zram, map->arr[iter]);
+ if (!zram_allocated(zram, map->arr[iter]) ||
+ !zram_test_flag(zram, map->arr[iter], ZRAM_IDLE)) {
+ zram_clear_flag(zram, map->arr[iter], ZRAM_UNDER_WB);
+ zram_clear_flag(zram, map->arr[iter], ZRAM_IDLE);
+ zram_slot_unlock(zram, map->arr[iter]);
+ free_block_bdev(zram, blk_idx + iter);
+ continue;
+ }
+
+ zram_free_page(zram, map->arr[iter]);
+ zram_clear_flag(zram, map->arr[iter], ZRAM_UNDER_WB);
+ zram_set_flag(zram, map->arr[iter], ZRAM_WB);
+ zram_set_element(zram, map->arr[iter], blk_idx + iter);
+ zram_slot_unlock(zram, map->arr[iter]);
+ atomic64_inc(&zram->stats.pages_stored);
+
+ spin_lock(&zram->wb_limit_lock);
+ if (zram->wb_limit_enable && zram->bd_wb_limit > 0)
+ zram->bd_wb_limit -= 1UL << (PAGE_SHIFT - 12);
+ spin_unlock(&zram->wb_limit_lock);
}
+ return ret;
- zram_free_page(zram, index);
- zram_clear_flag(zram, index, ZRAM_UNDER_WB);
- zram_set_flag(zram, index, ZRAM_WB);
- zram_set_element(zram, index, *blk_idx);
- atomic64_inc(&zram->stats.pages_stored);
- *blk_idx = 0;
-
- spin_lock(&zram->wb_limit_lock);
- if (zram->wb_limit_enable && zram->bd_wb_limit > 0)
- zram->bd_wb_limit -= 1UL << (PAGE_SHIFT - 12);
- spin_unlock(&zram->wb_limit_lock);
-skip:
- zram_slot_unlock(zram, index);
- return 0;
+cleanup:
+ for (int iter = 0; iter < map->nr_of_entries; iter++) {
+ zram_slot_lock(zram, map->arr[iter]);
+ zram_clear_flag(zram, map->arr[iter], ZRAM_UNDER_WB);
+ zram_clear_flag(zram, map->arr[iter], ZRAM_IDLE);
+ zram_slot_unlock(zram, map->arr[iter]);
+ }
+ free_block_bdev_range(zram, blk_idx, map->nr_of_entries);
+ return ret;
}
static ssize_t writeback_store(struct device *dev,
@@ -741,9 +743,15 @@ static ssize_t writeback_store(struct device *dev,
unsigned long nr_pages = zram->disksize >> PAGE_SHIFT;
unsigned long index = 0;
struct page *page;
+ struct folio *folio;
ssize_t ret = len;
int mode, err;
unsigned long blk_idx = 0;
+ unsigned int io_pages;
+ u64 bd_wb_limit_pages = ULONG_MAX;
+ struct index_mapping map = {};
+ unsigned int order = min(MAX_INDEX_ENTRIES_ORDER,
+ ilog2(zram->bdev->bd_disk->bdi->io_pages));
if (sysfs_streq(buf, "idle"))
mode = IDLE_WRITEBACK;
@@ -776,32 +784,48 @@ static ssize_t writeback_store(struct device *dev,
goto release_init_lock;
}
- page = alloc_page(GFP_KERNEL);
- if (!page) {
+ folio = folio_alloc(GFP_KERNEL, order);
+ if (!folio) {
ret = -ENOMEM;
goto release_init_lock;
}
for (; nr_pages != 0; index++, nr_pages--) {
spin_lock(&zram->wb_limit_lock);
- if (zram->wb_limit_enable && !zram->bd_wb_limit) {
- spin_unlock(&zram->wb_limit_lock);
- ret = -EIO;
- break;
+ if (zram->wb_limit_enable) {
+ if (!zram->bd_wb_limit) {
+ spin_unlock(&zram->wb_limit_lock);
+ ret = -EIO;
+ break;
+ }
+ bd_wb_limit_pages = zram->bd_wb_limit >>
+ (PAGE_SHIFT - 12);
}
spin_unlock(&zram->wb_limit_lock);
if (!blk_idx) {
- blk_idx = alloc_block_bdev(zram);
+ io_pages = min(1UL << order, nr_pages);
+ io_pages = min_t(u64, bd_wb_limit_pages, io_pages);
+
+ blk_idx = alloc_block_bdev_range(zram, &io_pages);
if (!blk_idx) {
ret = -ENOSPC;
break;
}
}
- if (!writeback_prep_or_skip_index(zram, mode, index))
- continue;
+ if (!writeback_prep_or_skip_index(zram, mode, index)) {
+ if (nr_pages > 1 || map.nr_of_entries == 0)
+ continue;
+ /* There are still some unfinished IOs that
+ * needs to be flushed
+ */
+ err = writeback_flush_to_bdev(zram, folio, &map,
+ blk_idx, io_pages);
+ goto next;
+ }
+ page = folio_page(folio, map.nr_of_entries);
if (zram_read_page(zram, page, index, NULL)) {
zram_slot_lock(zram, index);
zram_clear_flag(zram, index, ZRAM_UNDER_WB);
@@ -810,15 +834,31 @@ static ssize_t writeback_store(struct device *dev,
continue;
}
- err = writeback_flush_to_bdev(zram, index, page, &blk_idx);
+ map.arr[map.nr_of_entries++] = index;
+ if (map.nr_of_entries < io_pages)
+ continue;
+ err = writeback_flush_to_bdev(zram, folio, &map, blk_idx,
+ io_pages);
+next:
if (err)
ret = err;
+
+ /*
+ * Check if all the blocks have been utilized before
+ * allocating the next batch. This is necessary to free
+ * the unused blocks after looping through all indices.
+ */
+ if (map.nr_of_entries == io_pages) {
+ blk_idx = 0;
+ map.nr_of_entries = 0;
+ }
}
if (blk_idx)
- free_block_bdev(zram, blk_idx);
- __free_page(page);
+ free_block_bdev_range(zram, blk_idx + map.nr_of_entries,
+ io_pages - map.nr_of_entries);
+ folio_put(folio);
release_init_lock:
up_read(&zram->init_lock);
--
2.40.1
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH 5/5] zram: don't overload blk_idx variable in writeback_store()
2023-09-11 13:34 ` [PATCH 0/5] Improve zram writeback performance Pankaj Raghav
` (3 preceding siblings ...)
2023-09-11 13:34 ` [PATCH 4/5] zram: batch IOs during writeback to improve performance Pankaj Raghav
@ 2023-09-11 13:34 ` Pankaj Raghav
2023-09-18 13:53 ` [PATCH 0/5] Improve zram writeback performance Pankaj Raghav
2024-09-26 4:41 ` Sergey Senozhatsky
6 siblings, 0 replies; 13+ messages in thread
From: Pankaj Raghav @ 2023-09-11 13:34 UTC (permalink / raw)
To: minchan, senozhatsky
Cc: linux-kernel, axboe, p.raghav, linux-block, kernel, gost.dev
From: Pankaj Raghav <p.raghav@samsung.com>
Instead of overloading blk_idx variable to find if it was allocated and
used, add a new boolean variable blk_allocated.
No functional changes.
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
drivers/block/zram/zram_drv.c | 8 +++++---
1 file changed, 5 insertions(+), 3 deletions(-)
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 27313c2d781d..7c1420e92c6a 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -747,6 +747,7 @@ static ssize_t writeback_store(struct device *dev,
ssize_t ret = len;
int mode, err;
unsigned long blk_idx = 0;
+ bool blk_allocated = false;
unsigned int io_pages;
u64 bd_wb_limit_pages = ULONG_MAX;
struct index_mapping map = {};
@@ -803,7 +804,7 @@ static ssize_t writeback_store(struct device *dev,
}
spin_unlock(&zram->wb_limit_lock);
- if (!blk_idx) {
+ if (!blk_allocated) {
io_pages = min(1UL << order, nr_pages);
io_pages = min_t(u64, bd_wb_limit_pages, io_pages);
@@ -812,6 +813,7 @@ static ssize_t writeback_store(struct device *dev,
ret = -ENOSPC;
break;
}
+ blk_allocated = true;
}
if (!writeback_prep_or_skip_index(zram, mode, index)) {
@@ -850,12 +852,12 @@ static ssize_t writeback_store(struct device *dev,
* the unused blocks after looping through all indices.
*/
if (map.nr_of_entries == io_pages) {
- blk_idx = 0;
+ blk_allocated = false;
map.nr_of_entries = 0;
}
}
- if (blk_idx)
+ if (blk_allocated)
free_block_bdev_range(zram, blk_idx + map.nr_of_entries,
io_pages - map.nr_of_entries);
folio_put(folio);
--
2.40.1
^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH 0/5] Improve zram writeback performance
2023-09-11 13:34 ` [PATCH 0/5] Improve zram writeback performance Pankaj Raghav
` (4 preceding siblings ...)
2023-09-11 13:34 ` [PATCH 5/5] zram: don't overload blk_idx variable in writeback_store() Pankaj Raghav
@ 2023-09-18 13:53 ` Pankaj Raghav
2023-09-19 0:33 ` Sergey Senozhatsky
2024-09-26 4:41 ` Sergey Senozhatsky
6 siblings, 1 reply; 13+ messages in thread
From: Pankaj Raghav @ 2023-09-18 13:53 UTC (permalink / raw)
To: minchan, senozhatsky
Cc: linux-kernel, axboe, linux-block, gost.dev, Pankaj Raghav
Gentle ping Minchan and Sergey.
Regards,
Pankaj
On 2023-09-11 15:34, Pankaj Raghav wrote:
> ZRAM can have a backing device that could be used as a writeback device
> for the pages in RAM. The current writeback code (writeback_store()) does
> a synchronous single page size IO to the backing device.
>
> This series implements IO batching while doing a writeback to a backing
> device. The code still does synchronous IOs but with larger IO sizes
> whenever possible. This crosses off one of the TODO that was there as a part
> of writeback_store() function:
> A single page IO would be inefficient for write...
>
> The idea is to batch the IOs to a certain limit before the data is flushed
> to the backing device. The batch limit is initially chosen based on the
> bdi->io_pages value with an upper limit of 32 pages (128k on x86).
>
> Batching reduces the time of writeback of 4G data to a nvme backing device
> from 68 secs to 15 secs (more than **4x improvement**).
>
> The first 3 patches are prep. 4th patch implements the main logic for IO
> batching and the last patch is another cleanup.
>
> Perf:
>
> $ modprobe zram num_devices=1
> $ echo "/dev/nvme0n1" > /sys/block/zram0/backing_dev
> $ echo 6G > /sys/block/zram0/disksize
> $ fio -iodepth=16 -rw=randwrite -ioengine=io_uring -bs=4k -numjobs=1 -size=4G -filename=/dev/zram0 -name=io_uring_1 > /dev/null
> $ echo all > /sys/block/zram0/idle
>
> Without changes:
> $ time echo idle > /sys/block/zram0/writeback
> real 1m8.648s (68 secs)
> user 0m0.000s
> sys 0m24.899s
> $ cat /sys/block/zram0/bd_stat
> 1048576 0 1048576
>
> With changes:
> $ time echo idle > /sys/block/zram0/writeback
> real 0m15.496s (15 secs)
> user 0m0.000s
> sys 0m7.789s
> $ cat /sys/block/zram0/bd_stat
> 1048576 0 1048576
>
> Testing:
>
> A basic End-End testing (based on Sergey's test flow [1]):
> 1) configure zram0 and add a nvme device as a writeback device
> 2) Get the sha256sum of a tarball
> 3) mkfs.ext4 on zram0, cp tarball
> 4) idle writeback
> 5) cp tarball from zram0 to another device (reread writeback pages) and
> compare the sha256sum again
> The sha before and after are verified to be the same.
>
> Writeback limit testing:
>
> 1) configure zram0 and add a nvme device as a writeback device
> 2) Set writeback limit and enable
> 3) Do a fio that crosses the writeback limit
> 4) idle writeback
> 5) Verify the writeback is limited to the set writeback limit value
>
> $ modprobe zram num_devices=1
> $ echo "/dev/nvme0n1" > /sys/block/zram0/backing_dev
> $ echo 4G > /sys/block/zram0/disksize
> $ echo 1 > /sys/block/zram0/writeback_limit_enable
> $ echo 1002 > /sys/block/zram0/writeback_limit
>
> $ fio -iodepth=16 -rw=write -ioengine=io_uring -bs=4k -numjobs=1 -size=10M -filename=/dev/zram0 -name=io_uring_1
>
> $ echo all > /sys/block/zram0/idle
> $ echo idle > /sys/block/zram0/writeback
> $ cat /sys/block/zram0/bd_stat
> 1002 0 1002
>
> writeback is limited to the set value.
>
> [1] https://lore.kernel.org/lkml/20230806071601.GB907732@google.com/
>
> Pankaj Raghav (5):
> zram: move index preparation to a separate function in writeback_store
> zram: encapsulate writeback to the backing bdev in a function
> zram: add alloc_block_bdev_range() and free_block_bdev_range()
> zram: batch IOs during writeback to improve performance
> zram: don't overload blk_idx variable in writeback_store()
>
> drivers/block/zram/zram_drv.c | 318 ++++++++++++++++++++++------------
> 1 file changed, 210 insertions(+), 108 deletions(-)
>
>
> base-commit: 7bc675554773f09d88101bf1ccfc8537dc7c0be9
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 0/5] Improve zram writeback performance
2023-09-18 13:53 ` [PATCH 0/5] Improve zram writeback performance Pankaj Raghav
@ 2023-09-19 0:33 ` Sergey Senozhatsky
2023-09-19 14:20 ` Pankaj Raghav
2024-09-25 15:53 ` Jassi Brar
0 siblings, 2 replies; 13+ messages in thread
From: Sergey Senozhatsky @ 2023-09-19 0:33 UTC (permalink / raw)
To: Pankaj Raghav
Cc: minchan, senozhatsky, linux-kernel, axboe, linux-block, gost.dev,
Pankaj Raghav
On (23/09/18 15:53), Pankaj Raghav wrote:
> Gentle ping Minchan and Sergey.
Hello,
zram writeback is currently under (heavy) rework, the series hasn't
been published yet, but it's in the making for some time already.
The biggest change is that zram will support compressed writeback,
that is writeback of compressed objects, as opposed to current design
when zram de-compresses pages before writeback.
Minchan will have more details, but I guess we'll need to wait for
that series to land.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 0/5] Improve zram writeback performance
2023-09-19 0:33 ` Sergey Senozhatsky
@ 2023-09-19 14:20 ` Pankaj Raghav
2024-09-25 15:53 ` Jassi Brar
1 sibling, 0 replies; 13+ messages in thread
From: Pankaj Raghav @ 2023-09-19 14:20 UTC (permalink / raw)
To: Sergey Senozhatsky
Cc: minchan, linux-kernel, axboe, linux-block, gost.dev,
Pankaj Raghav
On 2023-09-19 02:33, Sergey Senozhatsky wrote:
> On (23/09/18 15:53), Pankaj Raghav wrote:
>> Gentle ping Minchan and Sergey.
>
> Hello,
>
> zram writeback is currently under (heavy) rework, the series hasn't
> been published yet, but it's in the making for some time already.
> The biggest change is that zram will support compressed writeback,
> that is writeback of compressed objects, as opposed to current design
> when zram de-compresses pages before writeback.
>
Got it. Thanks for the explanation. The compressed writeback also makes
sense as it will save space in the backing device.
> Minchan will have more details, but I guess we'll need to wait for
> that series to land.
This series might not be applicable with the new direction, but anyway I will
keep an eye on the new series from Minchan.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 0/5] Improve zram writeback performance
2023-09-19 0:33 ` Sergey Senozhatsky
2023-09-19 14:20 ` Pankaj Raghav
@ 2024-09-25 15:53 ` Jassi Brar
2024-09-26 4:33 ` Sergey Senozhatsky
1 sibling, 1 reply; 13+ messages in thread
From: Jassi Brar @ 2024-09-25 15:53 UTC (permalink / raw)
To: senozhatsky
Cc: axboe, gost.dev, kernel, linux-block, linux-kernel, minchan,
p.raghav
Hi Sergey, Hi Minchan,
>> Gentle ping Minchan and Sergey.
>
>Hello,
>
>zram writeback is currently under (heavy) rework, the series hasn't
>been published yet, but it's in the making for some time already.
>The biggest change is that zram will support compressed writeback,
>that is writeback of compressed objects, as opposed to current design
>when zram de-compresses pages before writeback.
>
>Minchan will have more details, but I guess we'll need to wait for
>that series to land.
May I please know where we are with the rework? Is there somewhere I
could look up the compressed-writeback work-in-progress code?
Regards,
Jassi
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 0/5] Improve zram writeback performance
2024-09-25 15:53 ` Jassi Brar
@ 2024-09-26 4:33 ` Sergey Senozhatsky
2024-09-29 22:21 ` Jassi Brar
0 siblings, 1 reply; 13+ messages in thread
From: Sergey Senozhatsky @ 2024-09-26 4:33 UTC (permalink / raw)
To: Jassi Brar
Cc: senozhatsky, axboe, gost.dev, kernel, linux-block, linux-kernel,
minchan, p.raghav
On (24/09/25 10:53), Jassi Brar wrote:
> Hi Sergey, Hi Minchan,
>
> >> Gentle ping Minchan and Sergey.
> >
> May I please know where we are with the rework? Is there somewhere I
> could look up the compressed-writeback work-in-progress code?
There is no code for that nor any progress that can be shared,
as far as I'm concerned.
The most recent writeback-related patch series (WIP) reworks
how writeback and re-compression select target slots for
post-processing [1]
[1] https://lore.kernel.org/linux-kernel/20240917021020.883356-1-senozhatsky@chromium.org
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 0/5] Improve zram writeback performance
2023-09-11 13:34 ` [PATCH 0/5] Improve zram writeback performance Pankaj Raghav
` (5 preceding siblings ...)
2023-09-18 13:53 ` [PATCH 0/5] Improve zram writeback performance Pankaj Raghav
@ 2024-09-26 4:41 ` Sergey Senozhatsky
6 siblings, 0 replies; 13+ messages in thread
From: Sergey Senozhatsky @ 2024-09-26 4:41 UTC (permalink / raw)
To: Pankaj Raghav
Cc: minchan, senozhatsky, linux-kernel, axboe, p.raghav, linux-block,
gost.dev
On (23/09/11 15:34), Pankaj Raghav wrote:
> Batching reduces the time of writeback of 4G data to a nvme backing device
> from 68 secs to 15 secs (more than **4x improvement**).
I don't think anyone does that on practice. Excessive writeback wears
out flash storage, so on practice no one writebacks gigabytes of data
all at once, but instead people put daily writeback limits and try to
be flash storage "friendly", which is especially important if your device
has to a lifespan of 7 or 10 years. IOW usually writeback is put under
such constraints that writeback speed is hardly noticeable. So I'm not
sure that the complexity that this patch introduces is justified, to be
honest.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 0/5] Improve zram writeback performance
2024-09-26 4:33 ` Sergey Senozhatsky
@ 2024-09-29 22:21 ` Jassi Brar
0 siblings, 0 replies; 13+ messages in thread
From: Jassi Brar @ 2024-09-29 22:21 UTC (permalink / raw)
To: Sergey Senozhatsky
Cc: axboe, gost.dev, kernel, linux-block, linux-kernel, minchan,
p.raghav
On Wed, Sep 25, 2024 at 11:34 PM Sergey Senozhatsky
<senozhatsky@chromium.org> wrote:
>
> On (24/09/25 10:53), Jassi Brar wrote:
> > Hi Sergey, Hi Minchan,
> >
> > >> Gentle ping Minchan and Sergey.
> > >
> > May I please know where we are with the rework? Is there somewhere I
> > could look up the compressed-writeback work-in-progress code?
>
> There is no code for that nor any progress that can be shared,
> as far as I'm concerned.
>
> The most recent writeback-related patch series (WIP) reworks
> how writeback and re-compression select target slots for
> post-processing [1]
>
Thanks for the update, Sergey,
Minchan, if you are not pursuing that patchset anymore, is it possible
to share the last version you had? That could avoid re-writing from
scratch and, more importantly, straying too far from your preferred
implementation.
Thanks
Jassi
^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2024-09-29 22:21 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <CGME20230911133442eucas1p2f773a475e0a6dc1a448c63884d58c8d3@eucas1p2.samsung.com>
2023-09-11 13:34 ` [PATCH 0/5] Improve zram writeback performance Pankaj Raghav
2023-09-11 13:34 ` [PATCH 1/5] zram: move index preparation to a separate function in writeback_store Pankaj Raghav
2023-09-11 13:34 ` [PATCH 2/5] zram: encapsulate writeback to the backing bdev in a function Pankaj Raghav
2023-09-11 13:34 ` [PATCH 3/5] zram: add alloc_block_bdev_range() and free_block_bdev_range() Pankaj Raghav
2023-09-11 13:34 ` [PATCH 4/5] zram: batch IOs during writeback to improve performance Pankaj Raghav
2023-09-11 13:34 ` [PATCH 5/5] zram: don't overload blk_idx variable in writeback_store() Pankaj Raghav
2023-09-18 13:53 ` [PATCH 0/5] Improve zram writeback performance Pankaj Raghav
2023-09-19 0:33 ` Sergey Senozhatsky
2023-09-19 14:20 ` Pankaj Raghav
2024-09-25 15:53 ` Jassi Brar
2024-09-26 4:33 ` Sergey Senozhatsky
2024-09-29 22:21 ` Jassi Brar
2024-09-26 4:41 ` Sergey Senozhatsky
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).