* bdi cleanups v6
@ 2020-09-21 8:07 Christoph Hellwig
2020-09-21 8:07 ` [PATCH 01/13] fs: remove the unused SB_I_MULTIROOT flag Christoph Hellwig
` (12 more replies)
0 siblings, 13 replies; 29+ messages in thread
From: Christoph Hellwig @ 2020-09-21 8:07 UTC (permalink / raw)
To: Jens Axboe
Cc: Song Liu, Hans de Goede, Coly Li, Richard Weinberger, Minchan Kim,
Johannes Thumshirn, Justin Sanders, linux-mtd, dm-devel,
linux-block, linux-bcache, linux-kernel, drbd-dev, linux-raid,
linux-fsdevel, linux-mm, cgroups
Hi Jens,
this series contains a bunch of different BDI cleanups. The biggest item
is to isolate block drivers from the BDI in preparation of changing the
lifetime of the block device BDI in a follow up series.
Changes since v5:
- improve a commit message
- improve the stable_writes deprecation printk
- drop "drbd: remove RB_CONGESTED_REMOTE"
- drop a few hunks that add a local variable in a otherwise unchanged
file due to changes in the previous revisions
- keep updating ->io_pages in queue_max_sectors_store
- set an optimal I/O size in aoe
- inherit the optimal I/O size in bcache
Changes since v4:
- add a back a prematurely removed assignment in dm-table.c
- pick up a few reviews from Johannes that got lost
Changes since v3:
- rebased on the lasted block tree, which has some of the prep
changes merged
- extend the ->ra_pages changes to ->io_pages
- move initializing ->ra_pages and ->io_pages for block devices to
blk_register_queue
Changes since v2:
- fix a rw_page return value check
- fix up various changelogs
Changes since v1:
- rebased to the for-5.9/block-merge branch
- explicitly set the readahead to 0 for ubifs, vboxsf and mtd
- split the zram block_device operations
- let rw_page users fall back to bios in swap_readpage
Diffstat:
^ permalink raw reply [flat|nested] 29+ messages in thread
* [PATCH 01/13] fs: remove the unused SB_I_MULTIROOT flag
2020-09-21 8:07 bdi cleanups v6 Christoph Hellwig
@ 2020-09-21 8:07 ` Christoph Hellwig
2020-09-21 8:07 ` [PATCH 02/13] drbd: remove dead code in device_to_statistics Christoph Hellwig
` (11 subsequent siblings)
12 siblings, 0 replies; 29+ messages in thread
From: Christoph Hellwig @ 2020-09-21 8:07 UTC (permalink / raw)
To: Jens Axboe
Cc: Song Liu, Hans de Goede, Coly Li, Richard Weinberger, Minchan Kim,
Johannes Thumshirn, Justin Sanders, linux-mtd, dm-devel,
linux-block, linux-bcache, linux-kernel, drbd-dev, linux-raid,
linux-fsdevel, linux-mm, cgroups, Jan Kara, Johannes Thumshirn
The last user of SB_I_MULTIROOT is disappeared with commit f2aedb713c28
("NFS: Add fs_context support.")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
fs/namei.c | 4 ++--
include/linux/fs.h | 1 -
2 files changed, 2 insertions(+), 3 deletions(-)
diff --git a/fs/namei.c b/fs/namei.c
index e99e2a9da0f7de..f1eb8ccd2be958 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -568,8 +568,8 @@ static bool path_connected(struct vfsmount *mnt, struct dentry *dentry)
{
struct super_block *sb = mnt->mnt_sb;
- /* Bind mounts and multi-root filesystems can have disconnected paths */
- if (!(sb->s_iflags & SB_I_MULTIROOT) && (mnt->mnt_root == sb->s_root))
+ /* Bind mounts can have disconnected paths */
+ if (mnt->mnt_root == sb->s_root)
return true;
return is_subdir(dentry, mnt->mnt_root);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 7519ae003a082c..fbd74df5ce5f34 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1385,7 +1385,6 @@ extern int send_sigurg(struct fown_struct *fown);
#define SB_I_CGROUPWB 0x00000001 /* cgroup-aware writeback enabled */
#define SB_I_NOEXEC 0x00000002 /* Ignore executables on this fs */
#define SB_I_NODEV 0x00000004 /* Ignore devices on this fs */
-#define SB_I_MULTIROOT 0x00000008 /* Multiple roots to the dentry tree */
/* sb->s_iflags to limit user namespace mounts */
#define SB_I_USERNS_VISIBLE 0x00000010 /* fstype already mounted */
--
2.28.0
^ permalink raw reply related [flat|nested] 29+ messages in thread
* [PATCH 02/13] drbd: remove dead code in device_to_statistics
2020-09-21 8:07 bdi cleanups v6 Christoph Hellwig
2020-09-21 8:07 ` [PATCH 01/13] fs: remove the unused SB_I_MULTIROOT flag Christoph Hellwig
@ 2020-09-21 8:07 ` Christoph Hellwig
2020-09-21 8:07 ` [PATCH 03/13] bcache: inherit the optimal I/O size Christoph Hellwig
` (10 subsequent siblings)
12 siblings, 0 replies; 29+ messages in thread
From: Christoph Hellwig @ 2020-09-21 8:07 UTC (permalink / raw)
To: Jens Axboe
Cc: Song Liu, Hans de Goede, Coly Li, Richard Weinberger, Minchan Kim,
Johannes Thumshirn, Justin Sanders, linux-mtd, dm-devel,
linux-block, linux-bcache, linux-kernel, drbd-dev, linux-raid,
linux-fsdevel, linux-mm, cgroups, Jan Kara, Johannes Thumshirn
Ever since the switch to blk-mq, a lower device not used for VM
writeback will not be marked congested, so the check will never
trigger.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
drivers/block/drbd/drbd_nl.c | 6 ------
1 file changed, 6 deletions(-)
diff --git a/drivers/block/drbd/drbd_nl.c b/drivers/block/drbd/drbd_nl.c
index 43c8ae4d9fca81..aaff5bde391506 100644
--- a/drivers/block/drbd/drbd_nl.c
+++ b/drivers/block/drbd/drbd_nl.c
@@ -3370,7 +3370,6 @@ static void device_to_statistics(struct device_statistics *s,
if (get_ldev(device)) {
struct drbd_md *md = &device->ldev->md;
u64 *history_uuids = (u64 *)s->history_uuids;
- struct request_queue *q;
int n;
spin_lock_irq(&md->uuid_lock);
@@ -3384,11 +3383,6 @@ static void device_to_statistics(struct device_statistics *s,
spin_unlock_irq(&md->uuid_lock);
s->dev_disk_flags = md->flags;
- q = bdev_get_queue(device->ldev->backing_bdev);
- s->dev_lower_blocked =
- bdi_congested(q->backing_dev_info,
- (1 << WB_async_congested) |
- (1 << WB_sync_congested));
put_ldev(device);
}
s->dev_size = drbd_get_capacity(device->this_bdev);
--
2.28.0
^ permalink raw reply related [flat|nested] 29+ messages in thread
* [PATCH 03/13] bcache: inherit the optimal I/O size
2020-09-21 8:07 bdi cleanups v6 Christoph Hellwig
2020-09-21 8:07 ` [PATCH 01/13] fs: remove the unused SB_I_MULTIROOT flag Christoph Hellwig
2020-09-21 8:07 ` [PATCH 02/13] drbd: remove dead code in device_to_statistics Christoph Hellwig
@ 2020-09-21 8:07 ` Christoph Hellwig
2020-09-21 9:54 ` Coly Li
` (2 more replies)
2020-09-21 8:07 ` [PATCH 04/13] aoe: set an " Christoph Hellwig
` (9 subsequent siblings)
12 siblings, 3 replies; 29+ messages in thread
From: Christoph Hellwig @ 2020-09-21 8:07 UTC (permalink / raw)
To: Jens Axboe
Cc: Song Liu, Hans de Goede, Coly Li, Richard Weinberger, Minchan Kim,
Johannes Thumshirn, Justin Sanders, linux-mtd, dm-devel,
linux-block, linux-bcache, linux-kernel, drbd-dev, linux-raid,
linux-fsdevel, linux-mm, cgroups
Inherit the optimal I/O size setting just like the readahead window,
as any reason to do larger I/O does not apply to just readahead.
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
drivers/md/bcache/super.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index 1bbdc410ee3c51..48113005ed86ad 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -1430,6 +1430,8 @@ static int cached_dev_init(struct cached_dev *dc, unsigned int block_size)
dc->disk.disk->queue->backing_dev_info->ra_pages =
max(dc->disk.disk->queue->backing_dev_info->ra_pages,
q->backing_dev_info->ra_pages);
+ blk_queue_io_opt(dc->disk.disk->queue,
+ max(queue_io_opt(dc->disk.disk->queue), queue_io_opt(q)));
atomic_set(&dc->io_errors, 0);
dc->io_disable = false;
--
2.28.0
^ permalink raw reply related [flat|nested] 29+ messages in thread
* [PATCH 04/13] aoe: set an optimal I/O size
2020-09-21 8:07 bdi cleanups v6 Christoph Hellwig
` (2 preceding siblings ...)
2020-09-21 8:07 ` [PATCH 03/13] bcache: inherit the optimal I/O size Christoph Hellwig
@ 2020-09-21 8:07 ` Christoph Hellwig
2020-09-22 8:45 ` Jan Kara
2020-09-21 8:07 ` [PATCH 05/13] bdi: initialize ->ra_pages and ->io_pages in bdi_init Christoph Hellwig
` (8 subsequent siblings)
12 siblings, 1 reply; 29+ messages in thread
From: Christoph Hellwig @ 2020-09-21 8:07 UTC (permalink / raw)
To: Jens Axboe
Cc: Song Liu, Hans de Goede, Coly Li, Richard Weinberger, Minchan Kim,
Johannes Thumshirn, Justin Sanders, linux-mtd, dm-devel,
linux-block, linux-bcache, linux-kernel, drbd-dev, linux-raid,
linux-fsdevel, linux-mm, cgroups
aoe forces a larger readahead size, but any reason to do larger I/O
is not limited to readahead. Also set the optimal I/O size, and
remove the local constants in favor of just using SZ_2G.
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
drivers/block/aoe/aoeblk.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/block/aoe/aoeblk.c b/drivers/block/aoe/aoeblk.c
index 5ca7216e9e01f3..d8cfc233e64b93 100644
--- a/drivers/block/aoe/aoeblk.c
+++ b/drivers/block/aoe/aoeblk.c
@@ -347,7 +347,6 @@ aoeblk_gdalloc(void *vp)
mempool_t *mp;
struct request_queue *q;
struct blk_mq_tag_set *set;
- enum { KB = 1024, MB = KB * KB, READ_AHEAD = 2 * MB, };
ulong flags;
int late = 0;
int err;
@@ -407,7 +406,8 @@ aoeblk_gdalloc(void *vp)
WARN_ON(d->gd);
WARN_ON(d->flags & DEVFL_UP);
blk_queue_max_hw_sectors(q, BLK_DEF_MAX_SECTORS);
- q->backing_dev_info->ra_pages = READ_AHEAD / PAGE_SIZE;
+ q->backing_dev_info->ra_pages = SZ_2M / PAGE_SIZE;
+ blk_queue_io_opt(q, SZ_2M);
d->bufpool = mp;
d->blkq = gd->queue = q;
q->queuedata = d;
--
2.28.0
^ permalink raw reply related [flat|nested] 29+ messages in thread
* [PATCH 05/13] bdi: initialize ->ra_pages and ->io_pages in bdi_init
2020-09-21 8:07 bdi cleanups v6 Christoph Hellwig
` (3 preceding siblings ...)
2020-09-21 8:07 ` [PATCH 04/13] aoe: set an " Christoph Hellwig
@ 2020-09-21 8:07 ` Christoph Hellwig
2020-09-22 8:49 ` Jan Kara
2020-09-21 8:07 ` [PATCH 06/13] md: update the optimal I/O size on reshape Christoph Hellwig
` (7 subsequent siblings)
12 siblings, 1 reply; 29+ messages in thread
From: Christoph Hellwig @ 2020-09-21 8:07 UTC (permalink / raw)
To: Jens Axboe
Cc: Song Liu, Hans de Goede, Coly Li, Richard Weinberger, Minchan Kim,
Johannes Thumshirn, Justin Sanders, linux-mtd, dm-devel,
linux-block, linux-bcache, linux-kernel, drbd-dev, linux-raid,
linux-fsdevel, linux-mm, cgroups, David Sterba
Set up a readahead size by default, as very few users have a good
reason to change it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: David Sterba <dsterba@suse.com> [btrfs]
Acked-by: Richard Weinberger <richard@nod.at> [ubifs, mtd]
---
block/blk-core.c | 2 --
drivers/mtd/mtdcore.c | 2 ++
fs/9p/vfs_super.c | 6 ++++--
fs/afs/super.c | 1 -
fs/btrfs/disk-io.c | 1 -
fs/fuse/inode.c | 1 -
fs/nfs/super.c | 9 +--------
fs/ubifs/super.c | 2 ++
fs/vboxsf/super.c | 2 ++
mm/backing-dev.c | 2 ++
10 files changed, 13 insertions(+), 15 deletions(-)
diff --git a/block/blk-core.c b/block/blk-core.c
index ca3f0f00c9435f..865d39e5be2b28 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -538,8 +538,6 @@ struct request_queue *blk_alloc_queue(int node_id)
if (!q->stats)
goto fail_stats;
- q->backing_dev_info->ra_pages = VM_READAHEAD_PAGES;
- q->backing_dev_info->io_pages = VM_READAHEAD_PAGES;
q->backing_dev_info->capabilities = BDI_CAP_CGROUP_WRITEBACK;
q->node = node_id;
diff --git a/drivers/mtd/mtdcore.c b/drivers/mtd/mtdcore.c
index 7d930569a7dfb7..b5e5d3140f578e 100644
--- a/drivers/mtd/mtdcore.c
+++ b/drivers/mtd/mtdcore.c
@@ -2196,6 +2196,8 @@ static struct backing_dev_info * __init mtd_bdi_init(char *name)
bdi = bdi_alloc(NUMA_NO_NODE);
if (!bdi)
return ERR_PTR(-ENOMEM);
+ bdi->ra_pages = 0;
+ bdi->io_pages = 0;
/*
* We put '-0' suffix to the name to get the same name format as we
diff --git a/fs/9p/vfs_super.c b/fs/9p/vfs_super.c
index 74df32be4c6a52..e34fa20acf612e 100644
--- a/fs/9p/vfs_super.c
+++ b/fs/9p/vfs_super.c
@@ -80,8 +80,10 @@ v9fs_fill_super(struct super_block *sb, struct v9fs_session_info *v9ses,
if (ret)
return ret;
- if (v9ses->cache)
- sb->s_bdi->ra_pages = VM_READAHEAD_PAGES;
+ if (!v9ses->cache) {
+ sb->s_bdi->ra_pages = 0;
+ sb->s_bdi->io_pages = 0;
+ }
sb->s_flags |= SB_ACTIVE | SB_DIRSYNC;
if (!v9ses->cache)
diff --git a/fs/afs/super.c b/fs/afs/super.c
index b552357b1d1379..3a40ee752c1e3f 100644
--- a/fs/afs/super.c
+++ b/fs/afs/super.c
@@ -456,7 +456,6 @@ static int afs_fill_super(struct super_block *sb, struct afs_fs_context *ctx)
ret = super_setup_bdi(sb);
if (ret)
return ret;
- sb->s_bdi->ra_pages = VM_READAHEAD_PAGES;
/* allocate the root inode and dentry */
if (as->dyn_root) {
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index f6bba7eb1fa171..047934cea25efa 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3092,7 +3092,6 @@ int __cold open_ctree(struct super_block *sb, struct btrfs_fs_devices *fs_device
}
sb->s_bdi->capabilities |= BDI_CAP_CGROUP_WRITEBACK;
- sb->s_bdi->ra_pages = VM_READAHEAD_PAGES;
sb->s_bdi->ra_pages *= btrfs_super_num_devices(disk_super);
sb->s_bdi->ra_pages = max(sb->s_bdi->ra_pages, SZ_4M / PAGE_SIZE);
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index bba747520e9b08..17b00670fb539e 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -1049,7 +1049,6 @@ static int fuse_bdi_init(struct fuse_conn *fc, struct super_block *sb)
if (err)
return err;
- sb->s_bdi->ra_pages = VM_READAHEAD_PAGES;
/* fuse does it's own writeback accounting */
sb->s_bdi->capabilities = BDI_CAP_NO_ACCT_WB | BDI_CAP_STRICTLIMIT;
diff --git a/fs/nfs/super.c b/fs/nfs/super.c
index 7a70287f21a2c1..f943e37853fa25 100644
--- a/fs/nfs/super.c
+++ b/fs/nfs/super.c
@@ -1200,13 +1200,6 @@ static void nfs_get_cache_cookie(struct super_block *sb,
}
#endif
-static void nfs_set_readahead(struct backing_dev_info *bdi,
- unsigned long iomax_pages)
-{
- bdi->ra_pages = VM_READAHEAD_PAGES;
- bdi->io_pages = iomax_pages;
-}
-
int nfs_get_tree_common(struct fs_context *fc)
{
struct nfs_fs_context *ctx = nfs_fc2context(fc);
@@ -1251,7 +1244,7 @@ int nfs_get_tree_common(struct fs_context *fc)
MINOR(server->s_dev));
if (error)
goto error_splat_super;
- nfs_set_readahead(s->s_bdi, server->rpages);
+ s->s_bdi->io_pages = server->rpages;
server->super = s;
}
diff --git a/fs/ubifs/super.c b/fs/ubifs/super.c
index a2420c900275a8..fbddb2a1c03f5e 100644
--- a/fs/ubifs/super.c
+++ b/fs/ubifs/super.c
@@ -2177,6 +2177,8 @@ static int ubifs_fill_super(struct super_block *sb, void *data, int silent)
c->vi.vol_id);
if (err)
goto out_close;
+ sb->s_bdi->ra_pages = 0;
+ sb->s_bdi->io_pages = 0;
sb->s_fs_info = c;
sb->s_magic = UBIFS_SUPER_MAGIC;
diff --git a/fs/vboxsf/super.c b/fs/vboxsf/super.c
index 8fe03b4a0d2b03..8e3792177a8523 100644
--- a/fs/vboxsf/super.c
+++ b/fs/vboxsf/super.c
@@ -167,6 +167,8 @@ static int vboxsf_fill_super(struct super_block *sb, struct fs_context *fc)
err = super_setup_bdi_name(sb, "vboxsf-%d", sbi->bdi_id);
if (err)
goto fail_free;
+ sb->s_bdi->ra_pages = 0;
+ sb->s_bdi->io_pages = 0;
/* Turn source into a shfl_string and map the folder */
size = strlen(fc->source) + 1;
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 8e8b00627bb2d8..2dac3be6127127 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -746,6 +746,8 @@ struct backing_dev_info *bdi_alloc(int node_id)
kfree(bdi);
return NULL;
}
+ bdi->ra_pages = VM_READAHEAD_PAGES;
+ bdi->io_pages = VM_READAHEAD_PAGES;
return bdi;
}
EXPORT_SYMBOL(bdi_alloc);
--
2.28.0
^ permalink raw reply related [flat|nested] 29+ messages in thread
* [PATCH 06/13] md: update the optimal I/O size on reshape
2020-09-21 8:07 bdi cleanups v6 Christoph Hellwig
` (4 preceding siblings ...)
2020-09-21 8:07 ` [PATCH 05/13] bdi: initialize ->ra_pages and ->io_pages in bdi_init Christoph Hellwig
@ 2020-09-21 8:07 ` Christoph Hellwig
2020-09-21 8:07 ` [PATCH 07/13] block: lift setting the readahead size into the block layer Christoph Hellwig
` (6 subsequent siblings)
12 siblings, 0 replies; 29+ messages in thread
From: Christoph Hellwig @ 2020-09-21 8:07 UTC (permalink / raw)
To: Jens Axboe
Cc: Song Liu, Hans de Goede, Coly Li, Richard Weinberger, Minchan Kim,
Johannes Thumshirn, Justin Sanders, linux-mtd, dm-devel,
linux-block, linux-bcache, linux-kernel, drbd-dev, linux-raid,
linux-fsdevel, linux-mm, cgroups, Johannes Thumshirn
The raid5 and raid10 drivers currently update the read-ahead size,
but not the optimal I/O size on reshape. To prepare for deriving the
read-ahead size from the optimal I/O size make sure it is updated
as well.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Song Liu <song@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
drivers/md/raid10.c | 22 ++++++++++++++--------
drivers/md/raid5.c | 10 ++++++++--
2 files changed, 22 insertions(+), 10 deletions(-)
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index e8fa327339171c..9956a04ac13bd6 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -3703,10 +3703,20 @@ static struct r10conf *setup_conf(struct mddev *mddev)
return ERR_PTR(err);
}
+static void raid10_set_io_opt(struct r10conf *conf)
+{
+ int raid_disks = conf->geo.raid_disks;
+
+ if (!(conf->geo.raid_disks % conf->geo.near_copies))
+ raid_disks /= conf->geo.near_copies;
+ blk_queue_io_opt(conf->mddev->queue, (conf->mddev->chunk_sectors << 9) *
+ raid_disks);
+}
+
static int raid10_run(struct mddev *mddev)
{
struct r10conf *conf;
- int i, disk_idx, chunk_size;
+ int i, disk_idx;
struct raid10_info *disk;
struct md_rdev *rdev;
sector_t size;
@@ -3742,18 +3752,13 @@ static int raid10_run(struct mddev *mddev)
mddev->thread = conf->thread;
conf->thread = NULL;
- chunk_size = mddev->chunk_sectors << 9;
if (mddev->queue) {
blk_queue_max_discard_sectors(mddev->queue,
mddev->chunk_sectors);
blk_queue_max_write_same_sectors(mddev->queue, 0);
blk_queue_max_write_zeroes_sectors(mddev->queue, 0);
- blk_queue_io_min(mddev->queue, chunk_size);
- if (conf->geo.raid_disks % conf->geo.near_copies)
- blk_queue_io_opt(mddev->queue, chunk_size * conf->geo.raid_disks);
- else
- blk_queue_io_opt(mddev->queue, chunk_size *
- (conf->geo.raid_disks / conf->geo.near_copies));
+ blk_queue_io_min(mddev->queue, mddev->chunk_sectors << 9);
+ raid10_set_io_opt(conf);
}
rdev_for_each(rdev, mddev) {
@@ -4727,6 +4732,7 @@ static void end_reshape(struct r10conf *conf)
stripe /= conf->geo.near_copies;
if (conf->mddev->queue->backing_dev_info->ra_pages < 2 * stripe)
conf->mddev->queue->backing_dev_info->ra_pages = 2 * stripe;
+ raid10_set_io_opt(conf);
}
conf->fullsync = 0;
}
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 225380efd1e24f..9a7d1250894ef1 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -7232,6 +7232,12 @@ static int only_parity(int raid_disk, int algo, int raid_disks, int max_degraded
return 0;
}
+static void raid5_set_io_opt(struct r5conf *conf)
+{
+ blk_queue_io_opt(conf->mddev->queue, (conf->chunk_sectors << 9) *
+ (conf->raid_disks - conf->max_degraded));
+}
+
static int raid5_run(struct mddev *mddev)
{
struct r5conf *conf;
@@ -7521,8 +7527,7 @@ static int raid5_run(struct mddev *mddev)
chunk_size = mddev->chunk_sectors << 9;
blk_queue_io_min(mddev->queue, chunk_size);
- blk_queue_io_opt(mddev->queue, chunk_size *
- (conf->raid_disks - conf->max_degraded));
+ raid5_set_io_opt(conf);
mddev->queue->limits.raid_partial_stripes_expensive = 1;
/*
* We can only discard a whole stripe. It doesn't make sense to
@@ -8115,6 +8120,7 @@ static void end_reshape(struct r5conf *conf)
/ PAGE_SIZE);
if (conf->mddev->queue->backing_dev_info->ra_pages < 2 * stripe)
conf->mddev->queue->backing_dev_info->ra_pages = 2 * stripe;
+ raid5_set_io_opt(conf);
}
}
}
--
2.28.0
^ permalink raw reply related [flat|nested] 29+ messages in thread
* [PATCH 07/13] block: lift setting the readahead size into the block layer
2020-09-21 8:07 bdi cleanups v6 Christoph Hellwig
` (5 preceding siblings ...)
2020-09-21 8:07 ` [PATCH 06/13] md: update the optimal I/O size on reshape Christoph Hellwig
@ 2020-09-21 8:07 ` Christoph Hellwig
2020-09-22 9:13 ` Jan Kara
2020-09-22 9:51 ` Coly Li
2020-09-21 8:07 ` [PATCH 08/13] bdi: remove BDI_CAP_CGROUP_WRITEBACK Christoph Hellwig
` (5 subsequent siblings)
12 siblings, 2 replies; 29+ messages in thread
From: Christoph Hellwig @ 2020-09-21 8:07 UTC (permalink / raw)
To: Jens Axboe
Cc: Song Liu, Hans de Goede, Coly Li, Richard Weinberger, Minchan Kim,
Johannes Thumshirn, Justin Sanders, linux-mtd, dm-devel,
linux-block, linux-bcache, linux-kernel, drbd-dev, linux-raid,
linux-fsdevel, linux-mm, cgroups, Johannes Thumshirn
Drivers shouldn't really mess with the readahead size, as that is a VM
concept. Instead set it based on the optimal I/O size by lifting the
algorithm from the md driver when registering the disk. Also set
bdi->io_pages there as well by applying the same scheme based on
max_sectors.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
block/blk-settings.c | 5 ++---
block/blk-sysfs.c | 9 +++++++++
drivers/block/aoe/aoeblk.c | 1 -
drivers/block/drbd/drbd_nl.c | 12 +-----------
drivers/md/bcache/super.c | 3 ---
drivers/md/raid0.c | 16 ----------------
drivers/md/raid10.c | 24 +-----------------------
drivers/md/raid5.c | 13 +------------
8 files changed, 14 insertions(+), 69 deletions(-)
diff --git a/block/blk-settings.c b/block/blk-settings.c
index 76a7e03bcd6cac..01049e9b998f1d 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -452,6 +452,8 @@ EXPORT_SYMBOL(blk_limits_io_opt);
void blk_queue_io_opt(struct request_queue *q, unsigned int opt)
{
blk_limits_io_opt(&q->limits, opt);
+ q->backing_dev_info->ra_pages =
+ max(queue_io_opt(q) * 2 / PAGE_SIZE, VM_READAHEAD_PAGES);
}
EXPORT_SYMBOL(blk_queue_io_opt);
@@ -628,9 +630,6 @@ void disk_stack_limits(struct gendisk *disk, struct block_device *bdev,
printk(KERN_NOTICE "%s: Warning: Device %s is misaligned\n",
top, bottom);
}
-
- t->backing_dev_info->io_pages =
- t->limits.max_sectors >> (PAGE_SHIFT - 9);
}
EXPORT_SYMBOL(disk_stack_limits);
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 81722cdcf0cb21..83915b4a1fc3ad 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -854,6 +854,15 @@ int blk_register_queue(struct gendisk *disk)
percpu_ref_switch_to_percpu(&q->q_usage_counter);
}
+ /*
+ * For read-ahead of large files to be effective, we need to read ahead
+ * at least twice the optimal I/O size.
+ */
+ q->backing_dev_info->ra_pages =
+ max(queue_io_opt(q) * 2 / PAGE_SIZE, VM_READAHEAD_PAGES);
+ q->backing_dev_info->io_pages =
+ queue_max_sectors(q) >> (PAGE_SHIFT - 9);
+
ret = blk_trace_init_sysfs(dev);
if (ret)
return ret;
diff --git a/drivers/block/aoe/aoeblk.c b/drivers/block/aoe/aoeblk.c
index d8cfc233e64b93..c34e71b0c4a98c 100644
--- a/drivers/block/aoe/aoeblk.c
+++ b/drivers/block/aoe/aoeblk.c
@@ -406,7 +406,6 @@ aoeblk_gdalloc(void *vp)
WARN_ON(d->gd);
WARN_ON(d->flags & DEVFL_UP);
blk_queue_max_hw_sectors(q, BLK_DEF_MAX_SECTORS);
- q->backing_dev_info->ra_pages = SZ_2M / PAGE_SIZE;
blk_queue_io_opt(q, SZ_2M);
d->bufpool = mp;
d->blkq = gd->queue = q;
diff --git a/drivers/block/drbd/drbd_nl.c b/drivers/block/drbd/drbd_nl.c
index aaff5bde391506..f8fb1c9b1bb6c1 100644
--- a/drivers/block/drbd/drbd_nl.c
+++ b/drivers/block/drbd/drbd_nl.c
@@ -1360,18 +1360,8 @@ static void drbd_setup_queue_param(struct drbd_device *device, struct drbd_backi
decide_on_discard_support(device, q, b, discard_zeroes_if_aligned);
decide_on_write_same_support(device, q, b, o, disable_write_same);
- if (b) {
+ if (b)
blk_stack_limits(&q->limits, &b->limits, 0);
-
- if (q->backing_dev_info->ra_pages !=
- b->backing_dev_info->ra_pages) {
- drbd_info(device, "Adjusting my ra_pages to backing device's (%lu -> %lu)\n",
- q->backing_dev_info->ra_pages,
- b->backing_dev_info->ra_pages);
- q->backing_dev_info->ra_pages =
- b->backing_dev_info->ra_pages;
- }
- }
fixup_discard_if_not_supported(q);
fixup_write_zeroes(device, q);
}
diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index 48113005ed86ad..6bfa771673623e 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -1427,9 +1427,6 @@ static int cached_dev_init(struct cached_dev *dc, unsigned int block_size)
if (ret)
return ret;
- dc->disk.disk->queue->backing_dev_info->ra_pages =
- max(dc->disk.disk->queue->backing_dev_info->ra_pages,
- q->backing_dev_info->ra_pages);
blk_queue_io_opt(dc->disk.disk->queue,
max(queue_io_opt(dc->disk.disk->queue), queue_io_opt(q)));
diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c
index f54a449f97aa79..aa2d7279176880 100644
--- a/drivers/md/raid0.c
+++ b/drivers/md/raid0.c
@@ -410,22 +410,6 @@ static int raid0_run(struct mddev *mddev)
mdname(mddev),
(unsigned long long)mddev->array_sectors);
- if (mddev->queue) {
- /* calculate the max read-ahead size.
- * For read-ahead of large files to be effective, we need to
- * readahead at least twice a whole stripe. i.e. number of devices
- * multiplied by chunk size times 2.
- * If an individual device has an ra_pages greater than the
- * chunk size, then we will not drive that device as hard as it
- * wants. We consider this a configuration error: a larger
- * chunksize should be used in that case.
- */
- int stripe = mddev->raid_disks *
- (mddev->chunk_sectors << 9) / PAGE_SIZE;
- if (mddev->queue->backing_dev_info->ra_pages < 2* stripe)
- mddev->queue->backing_dev_info->ra_pages = 2* stripe;
- }
-
dump_zones(mddev);
ret = md_integrity_register(mddev);
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 9956a04ac13bd6..5d1bdee313ec33 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -3873,19 +3873,6 @@ static int raid10_run(struct mddev *mddev)
mddev->resync_max_sectors = size;
set_bit(MD_FAILFAST_SUPPORTED, &mddev->flags);
- if (mddev->queue) {
- int stripe = conf->geo.raid_disks *
- ((mddev->chunk_sectors << 9) / PAGE_SIZE);
-
- /* Calculate max read-ahead size.
- * We need to readahead at least twice a whole stripe....
- * maybe...
- */
- stripe /= conf->geo.near_copies;
- if (mddev->queue->backing_dev_info->ra_pages < 2 * stripe)
- mddev->queue->backing_dev_info->ra_pages = 2 * stripe;
- }
-
if (md_integrity_register(mddev))
goto out_free_conf;
@@ -4723,17 +4710,8 @@ static void end_reshape(struct r10conf *conf)
conf->reshape_safe = MaxSector;
spin_unlock_irq(&conf->device_lock);
- /* read-ahead size must cover two whole stripes, which is
- * 2 * (datadisks) * chunksize where 'n' is the number of raid devices
- */
- if (conf->mddev->queue) {
- int stripe = conf->geo.raid_disks *
- ((conf->mddev->chunk_sectors << 9) / PAGE_SIZE);
- stripe /= conf->geo.near_copies;
- if (conf->mddev->queue->backing_dev_info->ra_pages < 2 * stripe)
- conf->mddev->queue->backing_dev_info->ra_pages = 2 * stripe;
+ if (conf->mddev->queue)
raid10_set_io_opt(conf);
- }
conf->fullsync = 0;
}
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 9a7d1250894ef1..7ace1f76b14736 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -7522,8 +7522,6 @@ static int raid5_run(struct mddev *mddev)
int data_disks = conf->previous_raid_disks - conf->max_degraded;
int stripe = data_disks *
((mddev->chunk_sectors << 9) / PAGE_SIZE);
- if (mddev->queue->backing_dev_info->ra_pages < 2 * stripe)
- mddev->queue->backing_dev_info->ra_pages = 2 * stripe;
chunk_size = mddev->chunk_sectors << 9;
blk_queue_io_min(mddev->queue, chunk_size);
@@ -8111,17 +8109,8 @@ static void end_reshape(struct r5conf *conf)
spin_unlock_irq(&conf->device_lock);
wake_up(&conf->wait_for_overlap);
- /* read-ahead size must cover two whole stripes, which is
- * 2 * (datadisks) * chunksize where 'n' is the number of raid devices
- */
- if (conf->mddev->queue) {
- int data_disks = conf->raid_disks - conf->max_degraded;
- int stripe = data_disks * ((conf->chunk_sectors << 9)
- / PAGE_SIZE);
- if (conf->mddev->queue->backing_dev_info->ra_pages < 2 * stripe)
- conf->mddev->queue->backing_dev_info->ra_pages = 2 * stripe;
+ if (conf->mddev->queue)
raid5_set_io_opt(conf);
- }
}
}
--
2.28.0
^ permalink raw reply related [flat|nested] 29+ messages in thread
* [PATCH 08/13] bdi: remove BDI_CAP_CGROUP_WRITEBACK
2020-09-21 8:07 bdi cleanups v6 Christoph Hellwig
` (6 preceding siblings ...)
2020-09-21 8:07 ` [PATCH 07/13] block: lift setting the readahead size into the block layer Christoph Hellwig
@ 2020-09-21 8:07 ` Christoph Hellwig
2020-09-21 8:07 ` [PATCH 09/13] bdi: remove BDI_CAP_SYNCHRONOUS_IO Christoph Hellwig
` (4 subsequent siblings)
12 siblings, 0 replies; 29+ messages in thread
From: Christoph Hellwig @ 2020-09-21 8:07 UTC (permalink / raw)
To: Jens Axboe
Cc: Song Liu, Hans de Goede, Coly Li, Richard Weinberger, Minchan Kim,
Johannes Thumshirn, Justin Sanders, linux-mtd, dm-devel,
linux-block, linux-bcache, linux-kernel, drbd-dev, linux-raid,
linux-fsdevel, linux-mm, cgroups, Jan Kara, Johannes Thumshirn
Just checking SB_I_CGROUPWB for cgroup writeback support is enough.
Either the file system allocates its own bdi (e.g. btrfs), in which case
it is known to support cgroup writeback, or the bdi comes from the block
layer, which always supports cgroup writeback.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
block/blk-core.c | 1 -
fs/btrfs/disk-io.c | 1 -
include/linux/backing-dev.h | 8 +++-----
3 files changed, 3 insertions(+), 7 deletions(-)
diff --git a/block/blk-core.c b/block/blk-core.c
index 865d39e5be2b28..1cc4fa6bc7fe1f 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -538,7 +538,6 @@ struct request_queue *blk_alloc_queue(int node_id)
if (!q->stats)
goto fail_stats;
- q->backing_dev_info->capabilities = BDI_CAP_CGROUP_WRITEBACK;
q->node = node_id;
atomic_set(&q->nr_active_requests_shared_sbitmap, 0);
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 047934cea25efa..e24927bddd5829 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3091,7 +3091,6 @@ int __cold open_ctree(struct super_block *sb, struct btrfs_fs_devices *fs_device
goto fail_sb_buffer;
}
- sb->s_bdi->capabilities |= BDI_CAP_CGROUP_WRITEBACK;
sb->s_bdi->ra_pages *= btrfs_super_num_devices(disk_super);
sb->s_bdi->ra_pages = max(sb->s_bdi->ra_pages, SZ_4M / PAGE_SIZE);
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 0b06b2d26c9aa3..52583b6f2ea05d 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -123,7 +123,6 @@ int bdi_set_max_ratio(struct backing_dev_info *bdi, unsigned int max_ratio);
* BDI_CAP_NO_ACCT_WB: Don't automatically account writeback pages
* BDI_CAP_STRICTLIMIT: Keep number of dirty pages below bdi threshold.
*
- * BDI_CAP_CGROUP_WRITEBACK: Supports cgroup-aware writeback.
* BDI_CAP_SYNCHRONOUS_IO: Device is so fast that asynchronous IO would be
* inefficient.
*/
@@ -233,9 +232,9 @@ int inode_congested(struct inode *inode, int cong_bits);
* inode_cgwb_enabled - test whether cgroup writeback is enabled on an inode
* @inode: inode of interest
*
- * cgroup writeback requires support from both the bdi and filesystem.
- * Also, both memcg and iocg have to be on the default hierarchy. Test
- * whether all conditions are met.
+ * Cgroup writeback requires support from the filesystem. Also, both memcg and
+ * iocg have to be on the default hierarchy. Test whether all conditions are
+ * met.
*
* Note that the test result may change dynamically on the same inode
* depending on how memcg and iocg are configured.
@@ -247,7 +246,6 @@ static inline bool inode_cgwb_enabled(struct inode *inode)
return cgroup_subsys_on_dfl(memory_cgrp_subsys) &&
cgroup_subsys_on_dfl(io_cgrp_subsys) &&
bdi_cap_account_dirty(bdi) &&
- (bdi->capabilities & BDI_CAP_CGROUP_WRITEBACK) &&
(inode->i_sb->s_iflags & SB_I_CGROUPWB);
}
--
2.28.0
^ permalink raw reply related [flat|nested] 29+ messages in thread
* [PATCH 09/13] bdi: remove BDI_CAP_SYNCHRONOUS_IO
2020-09-21 8:07 bdi cleanups v6 Christoph Hellwig
` (7 preceding siblings ...)
2020-09-21 8:07 ` [PATCH 08/13] bdi: remove BDI_CAP_CGROUP_WRITEBACK Christoph Hellwig
@ 2020-09-21 8:07 ` Christoph Hellwig
2020-09-21 8:07 ` [PATCH 10/13] mm: use SWP_SYNCHRONOUS_IO more intelligently Christoph Hellwig
` (3 subsequent siblings)
12 siblings, 0 replies; 29+ messages in thread
From: Christoph Hellwig @ 2020-09-21 8:07 UTC (permalink / raw)
To: Jens Axboe
Cc: Song Liu, Hans de Goede, Coly Li, Richard Weinberger, Minchan Kim,
Johannes Thumshirn, Justin Sanders, linux-mtd, dm-devel,
linux-block, linux-bcache, linux-kernel, drbd-dev, linux-raid,
linux-fsdevel, linux-mm, cgroups, Jan Kara, Johannes Thumshirn
BDI_CAP_SYNCHRONOUS_IO is only checked in the swap code, and used to
decided if ->rw_page can be used on a block device. Just check up for
the method instead. The only complication is that zram needs a second
set of block_device_operations as it can switch between modes that
actually support ->rw_page and those who don't.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
drivers/block/brd.c | 1 -
drivers/block/zram/zram_drv.c | 19 +++++++++++++------
drivers/nvdimm/btt.c | 2 --
drivers/nvdimm/pmem.c | 1 -
include/linux/backing-dev.h | 9 ---------
mm/swapfile.c | 2 +-
6 files changed, 14 insertions(+), 20 deletions(-)
diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index 2723a70eb85593..cc49a921339f77 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -403,7 +403,6 @@ static struct brd_device *brd_alloc(int i)
disk->flags = GENHD_FL_EXT_DEVT;
sprintf(disk->disk_name, "ram%d", i);
set_capacity(disk, rd_size * 2);
- brd->brd_queue->backing_dev_info->capabilities |= BDI_CAP_SYNCHRONOUS_IO;
/* Tell the block layer that this is not a rotational device */
blk_queue_flag_set(QUEUE_FLAG_NONROT, brd->brd_queue);
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index a356275605b104..1b51bb664f91f5 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -52,6 +52,9 @@ static unsigned int num_devices = 1;
*/
static size_t huge_class_size;
+static const struct block_device_operations zram_devops;
+static const struct block_device_operations zram_wb_devops;
+
static void zram_free_page(struct zram *zram, size_t index);
static int zram_bvec_read(struct zram *zram, struct bio_vec *bvec,
u32 index, int offset, struct bio *bio);
@@ -408,8 +411,7 @@ static void reset_bdev(struct zram *zram)
zram->backing_dev = NULL;
zram->old_block_size = 0;
zram->bdev = NULL;
- zram->disk->queue->backing_dev_info->capabilities |=
- BDI_CAP_SYNCHRONOUS_IO;
+ zram->disk->fops = &zram_devops;
kvfree(zram->bitmap);
zram->bitmap = NULL;
}
@@ -528,8 +530,7 @@ static ssize_t backing_dev_store(struct device *dev,
* freely but in fact, IO is going on so finally could cause
* use-after-free when the IO is really done.
*/
- zram->disk->queue->backing_dev_info->capabilities &=
- ~BDI_CAP_SYNCHRONOUS_IO;
+ zram->disk->fops = &zram_wb_devops;
up_write(&zram->init_lock);
pr_info("setup backing device %s\n", file_name);
@@ -1819,6 +1820,13 @@ static const struct block_device_operations zram_devops = {
.owner = THIS_MODULE
};
+static const struct block_device_operations zram_wb_devops = {
+ .open = zram_open,
+ .submit_bio = zram_submit_bio,
+ .swap_slot_free_notify = zram_slot_free_notify,
+ .owner = THIS_MODULE
+};
+
static DEVICE_ATTR_WO(compact);
static DEVICE_ATTR_RW(disksize);
static DEVICE_ATTR_RO(initstate);
@@ -1946,8 +1954,7 @@ static int zram_add(void)
if (ZRAM_LOGICAL_BLOCK_SIZE == PAGE_SIZE)
blk_queue_max_write_zeroes_sectors(zram->disk->queue, UINT_MAX);
- zram->disk->queue->backing_dev_info->capabilities |=
- (BDI_CAP_STABLE_WRITES | BDI_CAP_SYNCHRONOUS_IO);
+ zram->disk->queue->backing_dev_info->capabilities |= BDI_CAP_STABLE_WRITES;
device_add_disk(NULL, zram->disk, zram_disk_attr_groups);
strlcpy(zram->compressor, default_compressor, sizeof(zram->compressor));
diff --git a/drivers/nvdimm/btt.c b/drivers/nvdimm/btt.c
index 0d710140bf93be..12ff6f8784ac11 100644
--- a/drivers/nvdimm/btt.c
+++ b/drivers/nvdimm/btt.c
@@ -1537,8 +1537,6 @@ static int btt_blk_init(struct btt *btt)
btt->btt_disk->private_data = btt;
btt->btt_disk->queue = btt->btt_queue;
btt->btt_disk->flags = GENHD_FL_EXT_DEVT;
- btt->btt_disk->queue->backing_dev_info->capabilities |=
- BDI_CAP_SYNCHRONOUS_IO;
blk_queue_logical_block_size(btt->btt_queue, btt->sector_size);
blk_queue_max_hw_sectors(btt->btt_queue, UINT_MAX);
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 140cf3b9000c60..1711fdfd8d2816 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -475,7 +475,6 @@ static int pmem_attach_disk(struct device *dev,
disk->queue = q;
disk->flags = GENHD_FL_EXT_DEVT;
disk->private_data = pmem;
- disk->queue->backing_dev_info->capabilities |= BDI_CAP_SYNCHRONOUS_IO;
nvdimm_namespace_disk_name(ndns, disk->disk_name);
set_capacity(disk, (pmem->size - pmem->pfn_pad - pmem->data_offset)
/ 512);
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 52583b6f2ea05d..860ea33571bce5 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -122,9 +122,6 @@ int bdi_set_max_ratio(struct backing_dev_info *bdi, unsigned int max_ratio);
* BDI_CAP_NO_WRITEBACK: Don't write pages back
* BDI_CAP_NO_ACCT_WB: Don't automatically account writeback pages
* BDI_CAP_STRICTLIMIT: Keep number of dirty pages below bdi threshold.
- *
- * BDI_CAP_SYNCHRONOUS_IO: Device is so fast that asynchronous IO would be
- * inefficient.
*/
#define BDI_CAP_NO_ACCT_DIRTY 0x00000001
#define BDI_CAP_NO_WRITEBACK 0x00000002
@@ -132,7 +129,6 @@ int bdi_set_max_ratio(struct backing_dev_info *bdi, unsigned int max_ratio);
#define BDI_CAP_STABLE_WRITES 0x00000008
#define BDI_CAP_STRICTLIMIT 0x00000010
#define BDI_CAP_CGROUP_WRITEBACK 0x00000020
-#define BDI_CAP_SYNCHRONOUS_IO 0x00000040
#define BDI_CAP_NO_ACCT_AND_WRITEBACK \
(BDI_CAP_NO_WRITEBACK | BDI_CAP_NO_ACCT_DIRTY | BDI_CAP_NO_ACCT_WB)
@@ -174,11 +170,6 @@ static inline int wb_congested(struct bdi_writeback *wb, int cong_bits)
long congestion_wait(int sync, long timeout);
long wait_iff_congested(int sync, long timeout);
-static inline bool bdi_cap_synchronous_io(struct backing_dev_info *bdi)
-{
- return bdi->capabilities & BDI_CAP_SYNCHRONOUS_IO;
-}
-
static inline bool bdi_cap_stable_pages_required(struct backing_dev_info *bdi)
{
return bdi->capabilities & BDI_CAP_STABLE_WRITES;
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 12f59e641b5e29..986fe5aad30e18 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -3237,7 +3237,7 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
if (bdi_cap_stable_pages_required(inode_to_bdi(inode)))
p->flags |= SWP_STABLE_WRITES;
- if (bdi_cap_synchronous_io(inode_to_bdi(inode)))
+ if (p->bdev && p->bdev->bd_disk->fops->rw_page)
p->flags |= SWP_SYNCHRONOUS_IO;
if (p->bdev && blk_queue_nonrot(bdev_get_queue(p->bdev))) {
--
2.28.0
^ permalink raw reply related [flat|nested] 29+ messages in thread
* [PATCH 10/13] mm: use SWP_SYNCHRONOUS_IO more intelligently
2020-09-21 8:07 bdi cleanups v6 Christoph Hellwig
` (8 preceding siblings ...)
2020-09-21 8:07 ` [PATCH 09/13] bdi: remove BDI_CAP_SYNCHRONOUS_IO Christoph Hellwig
@ 2020-09-21 8:07 ` Christoph Hellwig
2020-09-21 8:07 ` [PATCH 11/13] bdi: replace BDI_CAP_STABLE_WRITES with a queue and a sb flag Christoph Hellwig
` (2 subsequent siblings)
12 siblings, 0 replies; 29+ messages in thread
From: Christoph Hellwig @ 2020-09-21 8:07 UTC (permalink / raw)
To: Jens Axboe
Cc: Song Liu, Hans de Goede, Coly Li, Richard Weinberger, Minchan Kim,
Johannes Thumshirn, Justin Sanders, linux-mtd, dm-devel,
linux-block, linux-bcache, linux-kernel, drbd-dev, linux-raid,
linux-fsdevel, linux-mm, cgroups, Jan Kara, Johannes Thumshirn
There is no point in trying to call bdev_read_page if SWP_SYNCHRONOUS_IO
is not set, as the device won't support it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
mm/page_io.c | 18 ++++++++++--------
1 file changed, 10 insertions(+), 8 deletions(-)
diff --git a/mm/page_io.c b/mm/page_io.c
index e485a6e8a6cddb..b199b87e0aa92b 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -403,15 +403,17 @@ int swap_readpage(struct page *page, bool synchronous)
goto out;
}
- ret = bdev_read_page(sis->bdev, swap_page_sector(page), page);
- if (!ret) {
- if (trylock_page(page)) {
- swap_slot_free_notify(page);
- unlock_page(page);
- }
+ if (sis->flags & SWP_SYNCHRONOUS_IO) {
+ ret = bdev_read_page(sis->bdev, swap_page_sector(page), page);
+ if (!ret) {
+ if (trylock_page(page)) {
+ swap_slot_free_notify(page);
+ unlock_page(page);
+ }
- count_vm_event(PSWPIN);
- goto out;
+ count_vm_event(PSWPIN);
+ goto out;
+ }
}
ret = 0;
--
2.28.0
^ permalink raw reply related [flat|nested] 29+ messages in thread
* [PATCH 11/13] bdi: replace BDI_CAP_STABLE_WRITES with a queue and a sb flag
2020-09-21 8:07 bdi cleanups v6 Christoph Hellwig
` (9 preceding siblings ...)
2020-09-21 8:07 ` [PATCH 10/13] mm: use SWP_SYNCHRONOUS_IO more intelligently Christoph Hellwig
@ 2020-09-21 8:07 ` Christoph Hellwig
2020-09-21 8:07 ` [PATCH 12/13] bdi: invert BDI_CAP_NO_ACCT_WB Christoph Hellwig
2020-09-21 8:07 ` [PATCH 13/13] bdi: replace BDI_CAP_NO_{WRITEBACK,ACCT_DIRTY} with a single flag Christoph Hellwig
12 siblings, 0 replies; 29+ messages in thread
From: Christoph Hellwig @ 2020-09-21 8:07 UTC (permalink / raw)
To: Jens Axboe
Cc: Song Liu, Hans de Goede, Coly Li, Richard Weinberger, Minchan Kim,
Johannes Thumshirn, Justin Sanders, linux-mtd, dm-devel,
linux-block, linux-bcache, linux-kernel, drbd-dev, linux-raid,
linux-fsdevel, linux-mm, cgroups, Jan Kara, Johannes Thumshirn
The BDI_CAP_STABLE_WRITES is one of the few bits of information in the
backing_dev_info shared between the block drivers and the writeback code.
To help untangling the dependency replace it with a queue flag and a
superblock flag derived from it. This also helps with the case of e.g.
a file system requiring stable writes due to its own checksumming, but
not forcing it on other users of the block device like the swap code.
One downside is that we an't support the stable_pages_required bdi
attribute in sysfs anymore. It is replaced with a queue attribute which
also is writable for easier testing.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
block/blk-integrity.c | 4 ++--
block/blk-mq-debugfs.c | 1 +
block/blk-sysfs.c | 3 +++
drivers/block/rbd.c | 2 +-
drivers/block/zram/zram_drv.c | 2 +-
drivers/md/dm-table.c | 6 +++---
drivers/md/raid5.c | 8 ++++----
drivers/mmc/core/queue.c | 3 +--
drivers/nvme/host/core.c | 3 +--
drivers/nvme/host/multipath.c | 10 +++-------
drivers/scsi/iscsi_tcp.c | 4 ++--
fs/super.c | 2 ++
include/linux/backing-dev.h | 6 ------
include/linux/blkdev.h | 3 +++
include/linux/fs.h | 1 +
mm/backing-dev.c | 7 +++----
mm/page-writeback.c | 2 +-
mm/swapfile.c | 2 +-
18 files changed, 33 insertions(+), 36 deletions(-)
diff --git a/block/blk-integrity.c b/block/blk-integrity.c
index c03705cbb9c9f2..2b36a8f9b81390 100644
--- a/block/blk-integrity.c
+++ b/block/blk-integrity.c
@@ -408,7 +408,7 @@ void blk_integrity_register(struct gendisk *disk, struct blk_integrity *template
bi->tuple_size = template->tuple_size;
bi->tag_size = template->tag_size;
- disk->queue->backing_dev_info->capabilities |= BDI_CAP_STABLE_WRITES;
+ blk_queue_flag_set(QUEUE_FLAG_STABLE_WRITES, disk->queue);
#ifdef CONFIG_BLK_INLINE_ENCRYPTION
if (disk->queue->ksm) {
@@ -428,7 +428,7 @@ EXPORT_SYMBOL(blk_integrity_register);
*/
void blk_integrity_unregister(struct gendisk *disk)
{
- disk->queue->backing_dev_info->capabilities &= ~BDI_CAP_STABLE_WRITES;
+ blk_queue_flag_clear(QUEUE_FLAG_STABLE_WRITES, disk->queue);
memset(&disk->queue->integrity, 0, sizeof(struct blk_integrity));
}
EXPORT_SYMBOL(blk_integrity_unregister);
diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 645b7f800cb827..3094542e12ae0f 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -116,6 +116,7 @@ static const char *const blk_queue_flag_name[] = {
QUEUE_FLAG_NAME(SAME_FORCE),
QUEUE_FLAG_NAME(DEAD),
QUEUE_FLAG_NAME(INIT_DONE),
+ QUEUE_FLAG_NAME(STABLE_WRITES),
QUEUE_FLAG_NAME(POLL),
QUEUE_FLAG_NAME(WC),
QUEUE_FLAG_NAME(FUA),
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 83915b4a1fc3ad..06c5cebaea3def 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -287,6 +287,7 @@ queue_##name##_store(struct request_queue *q, const char *page, size_t count) \
QUEUE_SYSFS_BIT_FNS(nonrot, NONROT, 1);
QUEUE_SYSFS_BIT_FNS(random, ADD_RANDOM, 0);
QUEUE_SYSFS_BIT_FNS(iostats, IO_STAT, 0);
+QUEUE_SYSFS_BIT_FNS(stable_writes, STABLE_WRITES, 0);
#undef QUEUE_SYSFS_BIT_FNS
static ssize_t queue_zoned_show(struct request_queue *q, char *page)
@@ -613,6 +614,7 @@ static struct queue_sysfs_entry queue_hw_sector_size_entry = {
QUEUE_RW_ENTRY(queue_nonrot, "rotational");
QUEUE_RW_ENTRY(queue_iostats, "iostats");
QUEUE_RW_ENTRY(queue_random, "add_random");
+QUEUE_RW_ENTRY(queue_stable_writes, "stable_writes");
static struct attribute *queue_attrs[] = {
&queue_requests_entry.attr,
@@ -645,6 +647,7 @@ static struct attribute *queue_attrs[] = {
&queue_nomerges_entry.attr,
&queue_rq_affinity_entry.attr,
&queue_iostats_entry.attr,
+ &queue_stable_writes_entry.attr,
&queue_random_entry.attr,
&queue_poll_entry.attr,
&queue_wc_entry.attr,
diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 5d3923c0997ce0..cf5b016358cdab 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -5022,7 +5022,7 @@ static int rbd_init_disk(struct rbd_device *rbd_dev)
}
if (!ceph_test_opt(rbd_dev->rbd_client->client, NOCRC))
- q->backing_dev_info->capabilities |= BDI_CAP_STABLE_WRITES;
+ blk_queue_flag_set(QUEUE_FLAG_STABLE_WRITES, q);
/*
* disk_release() expects a queue ref from add_disk() and will
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 1b51bb664f91f5..2e26e170bd9753 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -1954,7 +1954,7 @@ static int zram_add(void)
if (ZRAM_LOGICAL_BLOCK_SIZE == PAGE_SIZE)
blk_queue_max_write_zeroes_sectors(zram->disk->queue, UINT_MAX);
- zram->disk->queue->backing_dev_info->capabilities |= BDI_CAP_STABLE_WRITES;
+ blk_queue_flag_set(QUEUE_FLAG_STABLE_WRITES, zram->disk->queue);
device_add_disk(NULL, zram->disk, zram_disk_attr_groups);
strlcpy(zram->compressor, default_compressor, sizeof(zram->compressor));
diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index 5edc3079e7c199..ad58c91a1bf9b0 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -1815,7 +1815,7 @@ static int device_requires_stable_pages(struct dm_target *ti,
{
struct request_queue *q = bdev_get_queue(dev->bdev);
- return q && bdi_cap_stable_pages_required(q->backing_dev_info);
+ return q && blk_queue_stable_writes(q);
}
/*
@@ -1900,9 +1900,9 @@ void dm_table_set_restrictions(struct dm_table *t, struct request_queue *q,
* because they do their own checksumming.
*/
if (dm_table_requires_stable_pages(t))
- q->backing_dev_info->capabilities |= BDI_CAP_STABLE_WRITES;
+ blk_queue_flag_set(QUEUE_FLAG_STABLE_WRITES, q);
else
- q->backing_dev_info->capabilities &= ~BDI_CAP_STABLE_WRITES;
+ blk_queue_flag_clear(QUEUE_FLAG_STABLE_WRITES, q);
/*
* Determine whether or not this queue's I/O timings contribute
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 7ace1f76b14736..d589d26c86ea3f 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -6638,14 +6638,14 @@ raid5_store_skip_copy(struct mddev *mddev, const char *page, size_t len)
if (!conf)
err = -ENODEV;
else if (new != conf->skip_copy) {
+ struct request_queue *q = mddev->queue;
+
mddev_suspend(mddev);
conf->skip_copy = new;
if (new)
- mddev->queue->backing_dev_info->capabilities |=
- BDI_CAP_STABLE_WRITES;
+ blk_queue_flag_set(QUEUE_FLAG_STABLE_WRITES, q);
else
- mddev->queue->backing_dev_info->capabilities &=
- ~BDI_CAP_STABLE_WRITES;
+ blk_queue_flag_clear(QUEUE_FLAG_STABLE_WRITES, q);
mddev_resume(mddev);
}
mddev_unlock(mddev);
diff --git a/drivers/mmc/core/queue.c b/drivers/mmc/core/queue.c
index 6c022ef0f84d72..80fe3852ce0f75 100644
--- a/drivers/mmc/core/queue.c
+++ b/drivers/mmc/core/queue.c
@@ -472,8 +472,7 @@ int mmc_init_queue(struct mmc_queue *mq, struct mmc_card *card)
}
if (mmc_host_is_spi(host) && host->use_spi_crc)
- mq->queue->backing_dev_info->capabilities |=
- BDI_CAP_STABLE_WRITES;
+ blk_queue_flag_set(QUEUE_FLAG_STABLE_WRITES, mq->queue);
mq->queue->queuedata = mq;
blk_queue_rq_timeout(mq->queue, 60 * HZ);
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index ea1fa41fbba8df..1c9547c7a61388 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -3925,8 +3925,7 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
goto out_free_ns;
if (ctrl->opts && ctrl->opts->data_digest)
- ns->queue->backing_dev_info->capabilities
- |= BDI_CAP_STABLE_WRITES;
+ blk_queue_flag_set(QUEUE_FLAG_STABLE_WRITES, ns->queue);
blk_queue_flag_set(QUEUE_FLAG_NONROT, ns->queue);
if (ctrl->ops->flags & NVME_F_PCI_P2PDMA)
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index d4ba736c6c8905..74896be40c1769 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -673,13 +673,9 @@ void nvme_mpath_add_disk(struct nvme_ns *ns, struct nvme_id_ns *id)
nvme_mpath_set_live(ns);
}
- if (bdi_cap_stable_pages_required(ns->queue->backing_dev_info)) {
- struct gendisk *disk = ns->head->disk;
-
- if (disk)
- disk->queue->backing_dev_info->capabilities |=
- BDI_CAP_STABLE_WRITES;
- }
+ if (blk_queue_stable_writes(ns->queue) && ns->head->disk)
+ blk_queue_flag_set(QUEUE_FLAG_STABLE_WRITES,
+ ns->head->disk->queue);
}
void nvme_mpath_remove_disk(struct nvme_ns_head *head)
diff --git a/drivers/scsi/iscsi_tcp.c b/drivers/scsi/iscsi_tcp.c
index b5dd1caae5e92d..a622f334c933f5 100644
--- a/drivers/scsi/iscsi_tcp.c
+++ b/drivers/scsi/iscsi_tcp.c
@@ -962,8 +962,8 @@ static int iscsi_sw_tcp_slave_configure(struct scsi_device *sdev)
struct iscsi_conn *conn = session->leadconn;
if (conn->datadgst_en)
- sdev->request_queue->backing_dev_info->capabilities
- |= BDI_CAP_STABLE_WRITES;
+ blk_queue_flag_set(QUEUE_FLAG_STABLE_WRITES,
+ sdev->request_queue);
blk_queue_dma_alignment(sdev->request_queue, 0);
return 0;
}
diff --git a/fs/super.c b/fs/super.c
index 904459b3511995..a51c2083cd6b18 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -1256,6 +1256,8 @@ static int set_bdev_super(struct super_block *s, void *data)
s->s_dev = s->s_bdev->bd_dev;
s->s_bdi = bdi_get(s->s_bdev->bd_bdi);
+ if (blk_queue_stable_writes(s->s_bdev->bd_disk->queue))
+ s->s_iflags |= SB_I_STABLE_WRITES;
return 0;
}
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 860ea33571bce5..5da4ea3dd0cc5c 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -126,7 +126,6 @@ int bdi_set_max_ratio(struct backing_dev_info *bdi, unsigned int max_ratio);
#define BDI_CAP_NO_ACCT_DIRTY 0x00000001
#define BDI_CAP_NO_WRITEBACK 0x00000002
#define BDI_CAP_NO_ACCT_WB 0x00000004
-#define BDI_CAP_STABLE_WRITES 0x00000008
#define BDI_CAP_STRICTLIMIT 0x00000010
#define BDI_CAP_CGROUP_WRITEBACK 0x00000020
@@ -170,11 +169,6 @@ static inline int wb_congested(struct bdi_writeback *wb, int cong_bits)
long congestion_wait(int sync, long timeout);
long wait_iff_congested(int sync, long timeout);
-static inline bool bdi_cap_stable_pages_required(struct backing_dev_info *bdi)
-{
- return bdi->capabilities & BDI_CAP_STABLE_WRITES;
-}
-
static inline bool bdi_cap_writeback_dirty(struct backing_dev_info *bdi)
{
return !(bdi->capabilities & BDI_CAP_NO_WRITEBACK);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 5bd96fbab9b4c8..3192937d89e3f5 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -606,6 +606,7 @@ struct request_queue {
#define QUEUE_FLAG_SAME_FORCE 12 /* force complete on same CPU */
#define QUEUE_FLAG_DEAD 13 /* queue tear-down finished */
#define QUEUE_FLAG_INIT_DONE 14 /* queue is initialized */
+#define QUEUE_FLAG_STABLE_WRITES 15 /* don't modify blks until WB is done */
#define QUEUE_FLAG_POLL 16 /* IO polling enabled if set */
#define QUEUE_FLAG_WC 17 /* Write back caching */
#define QUEUE_FLAG_FUA 18 /* device supports FUA writes */
@@ -635,6 +636,8 @@ bool blk_queue_flag_test_and_set(unsigned int flag, struct request_queue *q);
#define blk_queue_noxmerges(q) \
test_bit(QUEUE_FLAG_NOXMERGES, &(q)->queue_flags)
#define blk_queue_nonrot(q) test_bit(QUEUE_FLAG_NONROT, &(q)->queue_flags)
+#define blk_queue_stable_writes(q) \
+ test_bit(QUEUE_FLAG_STABLE_WRITES, &(q)->queue_flags)
#define blk_queue_io_stat(q) test_bit(QUEUE_FLAG_IO_STAT, &(q)->queue_flags)
#define blk_queue_add_random(q) test_bit(QUEUE_FLAG_ADD_RANDOM, &(q)->queue_flags)
#define blk_queue_discard(q) test_bit(QUEUE_FLAG_DISCARD, &(q)->queue_flags)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index fbd74df5ce5f34..222465b7cf4178 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1385,6 +1385,7 @@ extern int send_sigurg(struct fown_struct *fown);
#define SB_I_CGROUPWB 0x00000001 /* cgroup-aware writeback enabled */
#define SB_I_NOEXEC 0x00000002 /* Ignore executables on this fs */
#define SB_I_NODEV 0x00000004 /* Ignore devices on this fs */
+#define SB_I_STABLE_WRITES 0x00000008 /* don't modify blks until WB is done */
/* sb->s_iflags to limit user namespace mounts */
#define SB_I_USERNS_VISIBLE 0x00000010 /* fstype already mounted */
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 2dac3be6127127..8e3802bf03a968 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -204,10 +204,9 @@ static ssize_t stable_pages_required_show(struct device *dev,
struct device_attribute *attr,
char *page)
{
- struct backing_dev_info *bdi = dev_get_drvdata(dev);
-
- return snprintf(page, PAGE_SIZE-1, "%d\n",
- bdi_cap_stable_pages_required(bdi) ? 1 : 0);
+ dev_warn_once(dev,
+ "the stable_pages_required attribute has been removed. Use the stable_writes queue attribute instead.\n");
+ return snprintf(page, PAGE_SIZE-1, "%d\n", 0);
}
static DEVICE_ATTR_RO(stable_pages_required);
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 4e4ddd67b71e58..e9c36521461aaa 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2849,7 +2849,7 @@ EXPORT_SYMBOL_GPL(wait_on_page_writeback);
*/
void wait_for_stable_page(struct page *page)
{
- if (bdi_cap_stable_pages_required(inode_to_bdi(page->mapping->host)))
+ if (page->mapping->host->i_sb->s_iflags & SB_I_STABLE_WRITES)
wait_on_page_writeback(page);
}
EXPORT_SYMBOL_GPL(wait_for_stable_page);
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 986fe5aad30e18..c119b839937d65 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -3234,7 +3234,7 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
goto bad_swap_unlock_inode;
}
- if (bdi_cap_stable_pages_required(inode_to_bdi(inode)))
+ if (p->bdev && blk_queue_stable_writes(p->bdev->bd_disk->queue))
p->flags |= SWP_STABLE_WRITES;
if (p->bdev && p->bdev->bd_disk->fops->rw_page)
--
2.28.0
^ permalink raw reply related [flat|nested] 29+ messages in thread
* [PATCH 12/13] bdi: invert BDI_CAP_NO_ACCT_WB
2020-09-21 8:07 bdi cleanups v6 Christoph Hellwig
` (10 preceding siblings ...)
2020-09-21 8:07 ` [PATCH 11/13] bdi: replace BDI_CAP_STABLE_WRITES with a queue and a sb flag Christoph Hellwig
@ 2020-09-21 8:07 ` Christoph Hellwig
2020-09-21 8:07 ` [PATCH 13/13] bdi: replace BDI_CAP_NO_{WRITEBACK,ACCT_DIRTY} with a single flag Christoph Hellwig
12 siblings, 0 replies; 29+ messages in thread
From: Christoph Hellwig @ 2020-09-21 8:07 UTC (permalink / raw)
To: Jens Axboe
Cc: Song Liu, Hans de Goede, Coly Li, Richard Weinberger, Minchan Kim,
Johannes Thumshirn, Justin Sanders, linux-mtd, dm-devel,
linux-block, linux-bcache, linux-kernel, drbd-dev, linux-raid,
linux-fsdevel, linux-mm, cgroups, Jan Kara, Johannes Thumshirn
Replace BDI_CAP_NO_ACCT_WB with a positive BDI_CAP_WRITEBACK_ACCT to
make the checks more obvious. Also remove the pointless
bdi_cap_account_writeback wrapper that just obsfucates the check.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
fs/fuse/inode.c | 3 ++-
include/linux/backing-dev.h | 13 +++----------
mm/backing-dev.c | 1 +
mm/page-writeback.c | 4 ++--
4 files changed, 8 insertions(+), 13 deletions(-)
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 17b00670fb539e..581329203d6860 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -1050,7 +1050,8 @@ static int fuse_bdi_init(struct fuse_conn *fc, struct super_block *sb)
return err;
/* fuse does it's own writeback accounting */
- sb->s_bdi->capabilities = BDI_CAP_NO_ACCT_WB | BDI_CAP_STRICTLIMIT;
+ sb->s_bdi->capabilities &= ~BDI_CAP_WRITEBACK_ACCT;
+ sb->s_bdi->capabilities |= BDI_CAP_STRICTLIMIT;
/*
* For a single fuse filesystem use max 1% of dirty +
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 5da4ea3dd0cc5c..b217344a2c63be 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -120,17 +120,17 @@ int bdi_set_max_ratio(struct backing_dev_info *bdi, unsigned int max_ratio);
*
* BDI_CAP_NO_ACCT_DIRTY: Dirty pages shouldn't contribute to accounting
* BDI_CAP_NO_WRITEBACK: Don't write pages back
- * BDI_CAP_NO_ACCT_WB: Don't automatically account writeback pages
+ * BDI_CAP_WRITEBACK_ACCT: Automatically account writeback pages
* BDI_CAP_STRICTLIMIT: Keep number of dirty pages below bdi threshold.
*/
#define BDI_CAP_NO_ACCT_DIRTY 0x00000001
#define BDI_CAP_NO_WRITEBACK 0x00000002
-#define BDI_CAP_NO_ACCT_WB 0x00000004
+#define BDI_CAP_WRITEBACK_ACCT 0x00000004
#define BDI_CAP_STRICTLIMIT 0x00000010
#define BDI_CAP_CGROUP_WRITEBACK 0x00000020
#define BDI_CAP_NO_ACCT_AND_WRITEBACK \
- (BDI_CAP_NO_WRITEBACK | BDI_CAP_NO_ACCT_DIRTY | BDI_CAP_NO_ACCT_WB)
+ (BDI_CAP_NO_WRITEBACK | BDI_CAP_NO_ACCT_DIRTY)
extern struct backing_dev_info noop_backing_dev_info;
@@ -179,13 +179,6 @@ static inline bool bdi_cap_account_dirty(struct backing_dev_info *bdi)
return !(bdi->capabilities & BDI_CAP_NO_ACCT_DIRTY);
}
-static inline bool bdi_cap_account_writeback(struct backing_dev_info *bdi)
-{
- /* Paranoia: BDI_CAP_NO_WRITEBACK implies BDI_CAP_NO_ACCT_WB */
- return !(bdi->capabilities & (BDI_CAP_NO_ACCT_WB |
- BDI_CAP_NO_WRITEBACK));
-}
-
static inline bool mapping_cap_writeback_dirty(struct address_space *mapping)
{
return bdi_cap_writeback_dirty(inode_to_bdi(mapping->host));
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 8e3802bf03a968..df18f0088dd3f5 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -745,6 +745,7 @@ struct backing_dev_info *bdi_alloc(int node_id)
kfree(bdi);
return NULL;
}
+ bdi->capabilities = BDI_CAP_WRITEBACK_ACCT;
bdi->ra_pages = VM_READAHEAD_PAGES;
bdi->io_pages = VM_READAHEAD_PAGES;
return bdi;
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index e9c36521461aaa..0139f9622a92da 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2738,7 +2738,7 @@ int test_clear_page_writeback(struct page *page)
if (ret) {
__xa_clear_mark(&mapping->i_pages, page_index(page),
PAGECACHE_TAG_WRITEBACK);
- if (bdi_cap_account_writeback(bdi)) {
+ if (bdi->capabilities & BDI_CAP_WRITEBACK_ACCT) {
struct bdi_writeback *wb = inode_to_wb(inode);
dec_wb_stat(wb, WB_WRITEBACK);
@@ -2791,7 +2791,7 @@ int __test_set_page_writeback(struct page *page, bool keep_write)
PAGECACHE_TAG_WRITEBACK);
xas_set_mark(&xas, PAGECACHE_TAG_WRITEBACK);
- if (bdi_cap_account_writeback(bdi))
+ if (bdi->capabilities & BDI_CAP_WRITEBACK_ACCT)
inc_wb_stat(inode_to_wb(inode), WB_WRITEBACK);
/*
--
2.28.0
^ permalink raw reply related [flat|nested] 29+ messages in thread
* [PATCH 13/13] bdi: replace BDI_CAP_NO_{WRITEBACK,ACCT_DIRTY} with a single flag
2020-09-21 8:07 bdi cleanups v6 Christoph Hellwig
` (11 preceding siblings ...)
2020-09-21 8:07 ` [PATCH 12/13] bdi: invert BDI_CAP_NO_ACCT_WB Christoph Hellwig
@ 2020-09-21 8:07 ` Christoph Hellwig
12 siblings, 0 replies; 29+ messages in thread
From: Christoph Hellwig @ 2020-09-21 8:07 UTC (permalink / raw)
To: Jens Axboe
Cc: Song Liu, Hans de Goede, Coly Li, Richard Weinberger, Minchan Kim,
Johannes Thumshirn, Justin Sanders, linux-mtd, dm-devel,
linux-block, linux-bcache, linux-kernel, drbd-dev, linux-raid,
linux-fsdevel, linux-mm, cgroups, Jan Kara, Johannes Thumshirn
Replace the two negative flags that are always used together with a
single positive flag that indicates the writeback capability instead
of two related non-capabilities. Also remove the pointless wrappers
to just check the flag.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
fs/9p/vfs_file.c | 2 +-
fs/fs-writeback.c | 7 +++---
include/linux/backing-dev.h | 48 ++++++++-----------------------------
mm/backing-dev.c | 6 ++---
mm/filemap.c | 4 ++--
mm/memcontrol.c | 2 +-
mm/memory-failure.c | 2 +-
mm/migrate.c | 2 +-
mm/mmap.c | 2 +-
mm/page-writeback.c | 12 +++++-----
10 files changed, 29 insertions(+), 58 deletions(-)
diff --git a/fs/9p/vfs_file.c b/fs/9p/vfs_file.c
index 3576123d82990e..6ecf863bfa2f4b 100644
--- a/fs/9p/vfs_file.c
+++ b/fs/9p/vfs_file.c
@@ -625,7 +625,7 @@ static void v9fs_mmap_vm_close(struct vm_area_struct *vma)
inode = file_inode(vma->vm_file);
- if (!mapping_cap_writeback_dirty(inode->i_mapping))
+ if (!mapping_can_writeback(inode->i_mapping))
wbc.nr_to_write = 0;
might_sleep();
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 149227160ff0b0..d4f84a2fe0878e 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -2321,7 +2321,7 @@ void __mark_inode_dirty(struct inode *inode, int flags)
wb = locked_inode_to_wb_and_lock_list(inode);
- WARN(bdi_cap_writeback_dirty(wb->bdi) &&
+ WARN((wb->bdi->capabilities & BDI_CAP_WRITEBACK) &&
!test_bit(WB_registered, &wb->state),
"bdi-%s not registered\n", bdi_dev_name(wb->bdi));
@@ -2346,7 +2346,8 @@ void __mark_inode_dirty(struct inode *inode, int flags)
* to make sure background write-back happens
* later.
*/
- if (bdi_cap_writeback_dirty(wb->bdi) && wakeup_bdi)
+ if (wakeup_bdi &&
+ (wb->bdi->capabilities & BDI_CAP_WRITEBACK))
wb_wakeup_delayed(wb);
return;
}
@@ -2581,7 +2582,7 @@ int write_inode_now(struct inode *inode, int sync)
.range_end = LLONG_MAX,
};
- if (!mapping_cap_writeback_dirty(inode->i_mapping))
+ if (!mapping_can_writeback(inode->i_mapping))
wbc.nr_to_write = 0;
might_sleep();
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index b217344a2c63be..44df4fcef65c1e 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -110,27 +110,14 @@ int bdi_set_max_ratio(struct backing_dev_info *bdi, unsigned int max_ratio);
/*
* Flags in backing_dev_info::capability
*
- * The first three flags control whether dirty pages will contribute to the
- * VM's accounting and whether writepages() should be called for dirty pages
- * (something that would not, for example, be appropriate for ramfs)
- *
- * WARNING: these flags are closely related and should not normally be
- * used separately. The BDI_CAP_NO_ACCT_AND_WRITEBACK combines these
- * three flags into a single convenience macro.
- *
- * BDI_CAP_NO_ACCT_DIRTY: Dirty pages shouldn't contribute to accounting
- * BDI_CAP_NO_WRITEBACK: Don't write pages back
- * BDI_CAP_WRITEBACK_ACCT: Automatically account writeback pages
- * BDI_CAP_STRICTLIMIT: Keep number of dirty pages below bdi threshold.
+ * BDI_CAP_WRITEBACK: Supports dirty page writeback, and dirty pages
+ * should contribute to accounting
+ * BDI_CAP_WRITEBACK_ACCT: Automatically account writeback pages
+ * BDI_CAP_STRICTLIMIT: Keep number of dirty pages below bdi threshold
*/
-#define BDI_CAP_NO_ACCT_DIRTY 0x00000001
-#define BDI_CAP_NO_WRITEBACK 0x00000002
-#define BDI_CAP_WRITEBACK_ACCT 0x00000004
-#define BDI_CAP_STRICTLIMIT 0x00000010
-#define BDI_CAP_CGROUP_WRITEBACK 0x00000020
-
-#define BDI_CAP_NO_ACCT_AND_WRITEBACK \
- (BDI_CAP_NO_WRITEBACK | BDI_CAP_NO_ACCT_DIRTY)
+#define BDI_CAP_WRITEBACK (1 << 0)
+#define BDI_CAP_WRITEBACK_ACCT (1 << 1)
+#define BDI_CAP_STRICTLIMIT (1 << 2)
extern struct backing_dev_info noop_backing_dev_info;
@@ -169,24 +156,9 @@ static inline int wb_congested(struct bdi_writeback *wb, int cong_bits)
long congestion_wait(int sync, long timeout);
long wait_iff_congested(int sync, long timeout);
-static inline bool bdi_cap_writeback_dirty(struct backing_dev_info *bdi)
-{
- return !(bdi->capabilities & BDI_CAP_NO_WRITEBACK);
-}
-
-static inline bool bdi_cap_account_dirty(struct backing_dev_info *bdi)
-{
- return !(bdi->capabilities & BDI_CAP_NO_ACCT_DIRTY);
-}
-
-static inline bool mapping_cap_writeback_dirty(struct address_space *mapping)
-{
- return bdi_cap_writeback_dirty(inode_to_bdi(mapping->host));
-}
-
-static inline bool mapping_cap_account_dirty(struct address_space *mapping)
+static inline bool mapping_can_writeback(struct address_space *mapping)
{
- return bdi_cap_account_dirty(inode_to_bdi(mapping->host));
+ return inode_to_bdi(mapping->host)->capabilities & BDI_CAP_WRITEBACK;
}
static inline int bdi_sched_wait(void *word)
@@ -223,7 +195,7 @@ static inline bool inode_cgwb_enabled(struct inode *inode)
return cgroup_subsys_on_dfl(memory_cgrp_subsys) &&
cgroup_subsys_on_dfl(io_cgrp_subsys) &&
- bdi_cap_account_dirty(bdi) &&
+ (bdi->capabilities & BDI_CAP_WRITEBACK) &&
(inode->i_sb->s_iflags & SB_I_CGROUPWB);
}
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index df18f0088dd3f5..408d5051d05b3d 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -14,9 +14,7 @@
#include <linux/device.h>
#include <trace/events/writeback.h>
-struct backing_dev_info noop_backing_dev_info = {
- .capabilities = BDI_CAP_NO_ACCT_AND_WRITEBACK,
-};
+struct backing_dev_info noop_backing_dev_info;
EXPORT_SYMBOL_GPL(noop_backing_dev_info);
static struct class *bdi_class;
@@ -745,7 +743,7 @@ struct backing_dev_info *bdi_alloc(int node_id)
kfree(bdi);
return NULL;
}
- bdi->capabilities = BDI_CAP_WRITEBACK_ACCT;
+ bdi->capabilities = BDI_CAP_WRITEBACK | BDI_CAP_WRITEBACK_ACCT;
bdi->ra_pages = VM_READAHEAD_PAGES;
bdi->io_pages = VM_READAHEAD_PAGES;
return bdi;
diff --git a/mm/filemap.c b/mm/filemap.c
index 1aaea26556cc7e..6c2a0139e22fa3 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -414,7 +414,7 @@ int __filemap_fdatawrite_range(struct address_space *mapping, loff_t start,
.range_end = end,
};
- if (!mapping_cap_writeback_dirty(mapping) ||
+ if (!mapping_can_writeback(mapping) ||
!mapping_tagged(mapping, PAGECACHE_TAG_DIRTY))
return 0;
@@ -1702,7 +1702,7 @@ struct page *pagecache_get_page(struct address_space *mapping, pgoff_t index,
no_page:
if (!page && (fgp_flags & FGP_CREAT)) {
int err;
- if ((fgp_flags & FGP_WRITE) && mapping_cap_account_dirty(mapping))
+ if ((fgp_flags & FGP_WRITE) && mapping_can_writeback(mapping))
gfp_mask |= __GFP_WRITE;
if (fgp_flags & FGP_NOFS)
gfp_mask &= ~__GFP_FS;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b807952b4d431b..d2352f76d6519f 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5643,7 +5643,7 @@ static int mem_cgroup_move_account(struct page *page,
if (PageDirty(page)) {
struct address_space *mapping = page_mapping(page);
- if (mapping_cap_account_dirty(mapping)) {
+ if (mapping_can_writeback(mapping)) {
__mod_lruvec_state(from_vec, NR_FILE_DIRTY,
-nr_pages);
__mod_lruvec_state(to_vec, NR_FILE_DIRTY,
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index f1aa6433f40416..a1e73943445e77 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1006,7 +1006,7 @@ static bool hwpoison_user_mappings(struct page *p, unsigned long pfn,
*/
mapping = page_mapping(hpage);
if (!(flags & MF_MUST_KILL) && !PageDirty(hpage) && mapping &&
- mapping_cap_writeback_dirty(mapping)) {
+ mapping_can_writeback(mapping)) {
if (page_mkclean(hpage)) {
SetPageDirty(hpage);
} else {
diff --git a/mm/migrate.c b/mm/migrate.c
index 34a842a8eb6a7b..9d2f42a3a16294 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -503,7 +503,7 @@ int migrate_page_move_mapping(struct address_space *mapping,
__dec_lruvec_state(old_lruvec, NR_SHMEM);
__inc_lruvec_state(new_lruvec, NR_SHMEM);
}
- if (dirty && mapping_cap_account_dirty(mapping)) {
+ if (dirty && mapping_can_writeback(mapping)) {
__dec_node_state(oldzone->zone_pgdat, NR_FILE_DIRTY);
__dec_zone_state(oldzone, NR_ZONE_WRITE_PENDING);
__inc_node_state(newzone->zone_pgdat, NR_FILE_DIRTY);
diff --git a/mm/mmap.c b/mm/mmap.c
index 40248d84ad5fbd..1fc0e92be4ba9b 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1666,7 +1666,7 @@ int vma_wants_writenotify(struct vm_area_struct *vma, pgprot_t vm_page_prot)
/* Can the mapping track the dirty pages? */
return vma->vm_file && vma->vm_file->f_mapping &&
- mapping_cap_account_dirty(vma->vm_file->f_mapping);
+ mapping_can_writeback(vma->vm_file->f_mapping);
}
/*
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 0139f9622a92da..358d6f28c627b7 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1882,7 +1882,7 @@ void balance_dirty_pages_ratelimited(struct address_space *mapping)
int ratelimit;
int *p;
- if (!bdi_cap_account_dirty(bdi))
+ if (!(bdi->capabilities & BDI_CAP_WRITEBACK))
return;
if (inode_cgwb_enabled(inode))
@@ -2423,7 +2423,7 @@ void account_page_dirtied(struct page *page, struct address_space *mapping)
trace_writeback_dirty_page(page, mapping);
- if (mapping_cap_account_dirty(mapping)) {
+ if (mapping_can_writeback(mapping)) {
struct bdi_writeback *wb;
inode_attach_wb(inode, page);
@@ -2450,7 +2450,7 @@ void account_page_dirtied(struct page *page, struct address_space *mapping)
void account_page_cleaned(struct page *page, struct address_space *mapping,
struct bdi_writeback *wb)
{
- if (mapping_cap_account_dirty(mapping)) {
+ if (mapping_can_writeback(mapping)) {
dec_lruvec_page_state(page, NR_FILE_DIRTY);
dec_zone_page_state(page, NR_ZONE_WRITE_PENDING);
dec_wb_stat(wb, WB_RECLAIMABLE);
@@ -2513,7 +2513,7 @@ void account_page_redirty(struct page *page)
{
struct address_space *mapping = page->mapping;
- if (mapping && mapping_cap_account_dirty(mapping)) {
+ if (mapping && mapping_can_writeback(mapping)) {
struct inode *inode = mapping->host;
struct bdi_writeback *wb;
struct wb_lock_cookie cookie = {};
@@ -2625,7 +2625,7 @@ void __cancel_dirty_page(struct page *page)
{
struct address_space *mapping = page_mapping(page);
- if (mapping_cap_account_dirty(mapping)) {
+ if (mapping_can_writeback(mapping)) {
struct inode *inode = mapping->host;
struct bdi_writeback *wb;
struct wb_lock_cookie cookie = {};
@@ -2665,7 +2665,7 @@ int clear_page_dirty_for_io(struct page *page)
VM_BUG_ON_PAGE(!PageLocked(page), page);
- if (mapping && mapping_cap_account_dirty(mapping)) {
+ if (mapping && mapping_can_writeback(mapping)) {
struct inode *inode = mapping->host;
struct bdi_writeback *wb;
struct wb_lock_cookie cookie = {};
--
2.28.0
^ permalink raw reply related [flat|nested] 29+ messages in thread
* Re: [PATCH 03/13] bcache: inherit the optimal I/O size
2020-09-21 8:07 ` [PATCH 03/13] bcache: inherit the optimal I/O size Christoph Hellwig
@ 2020-09-21 9:54 ` Coly Li
2020-09-21 14:00 ` Christoph Hellwig
2020-09-22 8:44 ` Jan Kara
2020-09-22 9:39 ` Coly Li
2 siblings, 1 reply; 29+ messages in thread
From: Coly Li @ 2020-09-21 9:54 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Jens Axboe, Song Liu, Hans de Goede, Richard Weinberger,
Minchan Kim, Johannes Thumshirn, Justin Sanders, linux-mtd,
dm-devel, linux-block, linux-bcache, linux-kernel, drbd-dev,
linux-raid, linux-fsdevel, linux-mm, cgroups
On 2020/9/21 16:07, Christoph Hellwig wrote:
> Inherit the optimal I/O size setting just like the readahead window,
> as any reason to do larger I/O does not apply to just readahead.
>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
> drivers/md/bcache/super.c | 2 ++
> 1 file changed, 2 insertions(+)
>
> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
> index 1bbdc410ee3c51..48113005ed86ad 100644
> --- a/drivers/md/bcache/super.c
> +++ b/drivers/md/bcache/super.c
> @@ -1430,6 +1430,8 @@ static int cached_dev_init(struct cached_dev *dc, unsigned int block_size)
> dc->disk.disk->queue->backing_dev_info->ra_pages =
> max(dc->disk.disk->queue->backing_dev_info->ra_pages,
> q->backing_dev_info->ra_pages);
> + blk_queue_io_opt(dc->disk.disk->queue,
> + max(queue_io_opt(dc->disk.disk->queue), queue_io_opt(q)));
>
Hi Christoph,
I am not sure whether virtual bcache device's optimal request size can
be simply set like this.
Most of time inherit backing device's optimal request size is fine, but
there are two exceptions,
- Read request hits on cache device
- User sets sequential_cuttoff as 0, all writing may go into cache
device firstly.
For the above two conditions, all I/Os goes into cache device, using
optimal request size of backing device might be improper.
Just a guess, is it OK to set the optimal request size of the virtual
bcache device as the least common multiple of cache device's and backing
device's optimal request sizes ?
[snipped]
Thanks.
Coly Li
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 03/13] bcache: inherit the optimal I/O size
2020-09-21 9:54 ` Coly Li
@ 2020-09-21 14:00 ` Christoph Hellwig
2020-09-21 15:09 ` Coly Li
0 siblings, 1 reply; 29+ messages in thread
From: Christoph Hellwig @ 2020-09-21 14:00 UTC (permalink / raw)
To: Coly Li
Cc: Christoph Hellwig, Jens Axboe, Song Liu, Hans de Goede,
Richard Weinberger, Minchan Kim, Johannes Thumshirn,
Justin Sanders, linux-mtd, dm-devel, linux-block, linux-bcache,
linux-kernel, drbd-dev, linux-raid, linux-fsdevel, linux-mm,
cgroups
On Mon, Sep 21, 2020 at 05:54:59PM +0800, Coly Li wrote:
> I am not sure whether virtual bcache device's optimal request size can
> be simply set like this.
>
> Most of time inherit backing device's optimal request size is fine, but
> there are two exceptions,
> - Read request hits on cache device
> - User sets sequential_cuttoff as 0, all writing may go into cache
> device firstly.
> For the above two conditions, all I/Os goes into cache device, using
> optimal request size of backing device might be improper.
>
> Just a guess, is it OK to set the optimal request size of the virtual
> bcache device as the least common multiple of cache device's and backing
> device's optimal request sizes ?
Well, if the optimal I/O size is wrong, the read ahead size also is
wrong. Can we just drop the setting?
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 03/13] bcache: inherit the optimal I/O size
2020-09-21 14:00 ` Christoph Hellwig
@ 2020-09-21 15:09 ` Coly Li
2020-09-21 18:18 ` Christoph Hellwig
0 siblings, 1 reply; 29+ messages in thread
From: Coly Li @ 2020-09-21 15:09 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Jens Axboe, Song Liu, Hans de Goede, Richard Weinberger,
Minchan Kim, Johannes Thumshirn, Justin Sanders, linux-mtd,
dm-devel, linux-block, linux-bcache, linux-kernel, drbd-dev,
linux-raid, linux-fsdevel, linux-mm, cgroups
On 2020/9/21 22:00, Christoph Hellwig wrote:
> On Mon, Sep 21, 2020 at 05:54:59PM +0800, Coly Li wrote:
>> I am not sure whether virtual bcache device's optimal request size can
>> be simply set like this.
>>
>> Most of time inherit backing device's optimal request size is fine, but
>> there are two exceptions,
>> - Read request hits on cache device
>> - User sets sequential_cuttoff as 0, all writing may go into cache
>> device firstly.
>> For the above two conditions, all I/Os goes into cache device, using
>> optimal request size of backing device might be improper.
>>
>> Just a guess, is it OK to set the optimal request size of the virtual
>> bcache device as the least common multiple of cache device's and backing
>> device's optimal request sizes ?
>
> Well, if the optimal I/O size is wrong, the read ahead size also is
> wrong. Can we just drop the setting?
>
I feel this is something should be fixed. Indeed I overlooked it until
you point out the issue now.
The optimal request size and read ahead pages hint are necessary, but
current initialization is simple. A better way might be dynamically
setting them depends on the cache mode and some special configuration.
By your inspiration, I want to ACK your original patch although it
doesn't work fine for all condition. Then we may know these two settings
(ra_pages and queue_io_opt) should be improved for more situations. At
lease for most part of the situations they provide proper hints.
How do you think of the above idea ?
Coly Li
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 03/13] bcache: inherit the optimal I/O size
2020-09-21 15:09 ` Coly Li
@ 2020-09-21 18:18 ` Christoph Hellwig
0 siblings, 0 replies; 29+ messages in thread
From: Christoph Hellwig @ 2020-09-21 18:18 UTC (permalink / raw)
To: Coly Li
Cc: Christoph Hellwig, Jens Axboe, Song Liu, Hans de Goede,
Richard Weinberger, Minchan Kim, Johannes Thumshirn,
Justin Sanders, linux-mtd, dm-devel, linux-block, linux-bcache,
linux-kernel, drbd-dev, linux-raid, linux-fsdevel, linux-mm,
cgroups
On Mon, Sep 21, 2020 at 11:09:48PM +0800, Coly Li wrote:
> I feel this is something should be fixed. Indeed I overlooked it until
> you point out the issue now.
>
> The optimal request size and read ahead pages hint are necessary, but
> current initialization is simple. A better way might be dynamically
> setting them depends on the cache mode and some special configuration.
>
> By your inspiration, I want to ACK your original patch although it
> doesn't work fine for all condition. Then we may know these two settings
> (ra_pages and queue_io_opt) should be improved for more situations. At
> lease for most part of the situations they provide proper hints.
>
> How do you think of the above idea ?
Sounds like a plan. I'd reall like to get this series in to get
some soaking before the end of the merge window, but we should still
have plenty of time for localized bcache updates.
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 03/13] bcache: inherit the optimal I/O size
2020-09-21 8:07 ` [PATCH 03/13] bcache: inherit the optimal I/O size Christoph Hellwig
2020-09-21 9:54 ` Coly Li
@ 2020-09-22 8:44 ` Jan Kara
2020-09-22 9:39 ` Coly Li
2 siblings, 0 replies; 29+ messages in thread
From: Jan Kara @ 2020-09-22 8:44 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Jens Axboe, Song Liu, Hans de Goede, Coly Li, Richard Weinberger,
Minchan Kim, Johannes Thumshirn, Justin Sanders, linux-mtd,
dm-devel, linux-block, linux-bcache, linux-kernel, drbd-dev,
linux-raid, linux-fsdevel, linux-mm, cgroups
On Mon 21-09-20 10:07:24, Christoph Hellwig wrote:
> Inherit the optimal I/O size setting just like the readahead window,
> as any reason to do larger I/O does not apply to just readahead.
>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
The patch looks good to me. You can add:
Reviewed-by: Jan Kara <jack@suse.cz>
Honza
> ---
> drivers/md/bcache/super.c | 2 ++
> 1 file changed, 2 insertions(+)
>
> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
> index 1bbdc410ee3c51..48113005ed86ad 100644
> --- a/drivers/md/bcache/super.c
> +++ b/drivers/md/bcache/super.c
> @@ -1430,6 +1430,8 @@ static int cached_dev_init(struct cached_dev *dc, unsigned int block_size)
> dc->disk.disk->queue->backing_dev_info->ra_pages =
> max(dc->disk.disk->queue->backing_dev_info->ra_pages,
> q->backing_dev_info->ra_pages);
> + blk_queue_io_opt(dc->disk.disk->queue,
> + max(queue_io_opt(dc->disk.disk->queue), queue_io_opt(q)));
>
> atomic_set(&dc->io_errors, 0);
> dc->io_disable = false;
> --
> 2.28.0
>
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 04/13] aoe: set an optimal I/O size
2020-09-21 8:07 ` [PATCH 04/13] aoe: set an " Christoph Hellwig
@ 2020-09-22 8:45 ` Jan Kara
0 siblings, 0 replies; 29+ messages in thread
From: Jan Kara @ 2020-09-22 8:45 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Jens Axboe, Song Liu, Hans de Goede, Coly Li, Richard Weinberger,
Minchan Kim, Johannes Thumshirn, Justin Sanders, linux-mtd,
dm-devel, linux-block, linux-bcache, linux-kernel, drbd-dev,
linux-raid, linux-fsdevel, linux-mm, cgroups
On Mon 21-09-20 10:07:25, Christoph Hellwig wrote:
> aoe forces a larger readahead size, but any reason to do larger I/O
> is not limited to readahead. Also set the optimal I/O size, and
> remove the local constants in favor of just using SZ_2G.
>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
Looks good. You can add:
Reviewed-by: Jan Kara <jack@suse.cz>
Honza
> ---
> drivers/block/aoe/aoeblk.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/block/aoe/aoeblk.c b/drivers/block/aoe/aoeblk.c
> index 5ca7216e9e01f3..d8cfc233e64b93 100644
> --- a/drivers/block/aoe/aoeblk.c
> +++ b/drivers/block/aoe/aoeblk.c
> @@ -347,7 +347,6 @@ aoeblk_gdalloc(void *vp)
> mempool_t *mp;
> struct request_queue *q;
> struct blk_mq_tag_set *set;
> - enum { KB = 1024, MB = KB * KB, READ_AHEAD = 2 * MB, };
> ulong flags;
> int late = 0;
> int err;
> @@ -407,7 +406,8 @@ aoeblk_gdalloc(void *vp)
> WARN_ON(d->gd);
> WARN_ON(d->flags & DEVFL_UP);
> blk_queue_max_hw_sectors(q, BLK_DEF_MAX_SECTORS);
> - q->backing_dev_info->ra_pages = READ_AHEAD / PAGE_SIZE;
> + q->backing_dev_info->ra_pages = SZ_2M / PAGE_SIZE;
> + blk_queue_io_opt(q, SZ_2M);
> d->bufpool = mp;
> d->blkq = gd->queue = q;
> q->queuedata = d;
> --
> 2.28.0
>
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 05/13] bdi: initialize ->ra_pages and ->io_pages in bdi_init
2020-09-21 8:07 ` [PATCH 05/13] bdi: initialize ->ra_pages and ->io_pages in bdi_init Christoph Hellwig
@ 2020-09-22 8:49 ` Jan Kara
2020-09-23 15:16 ` Christoph Hellwig
0 siblings, 1 reply; 29+ messages in thread
From: Jan Kara @ 2020-09-22 8:49 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Jens Axboe, Song Liu, Hans de Goede, Coly Li, Richard Weinberger,
Minchan Kim, Johannes Thumshirn, Justin Sanders, linux-mtd,
dm-devel, linux-block, linux-bcache, linux-kernel, drbd-dev,
linux-raid, linux-fsdevel, linux-mm, cgroups, David Sterba
On Mon 21-09-20 10:07:26, Christoph Hellwig wrote:
> Set up a readahead size by default, as very few users have a good
> reason to change it.
>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> Acked-by: David Sterba <dsterba@suse.com> [btrfs]
> Acked-by: Richard Weinberger <richard@nod.at> [ubifs, mtd]
The patch looks good to me. You can add:
Reviewed-by: Jan Kara <jack@suse.cz>
I'd just prefer if the changelog explicitely mentioned that this patch
results in enabling readahead for coda, ecryptfs, and orangefs... Just in
case someone bisects some issue down to this patch :).
Honza
> ---
> block/blk-core.c | 2 --
> drivers/mtd/mtdcore.c | 2 ++
> fs/9p/vfs_super.c | 6 ++++--
> fs/afs/super.c | 1 -
> fs/btrfs/disk-io.c | 1 -
> fs/fuse/inode.c | 1 -
> fs/nfs/super.c | 9 +--------
> fs/ubifs/super.c | 2 ++
> fs/vboxsf/super.c | 2 ++
> mm/backing-dev.c | 2 ++
> 10 files changed, 13 insertions(+), 15 deletions(-)
>
> diff --git a/block/blk-core.c b/block/blk-core.c
> index ca3f0f00c9435f..865d39e5be2b28 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -538,8 +538,6 @@ struct request_queue *blk_alloc_queue(int node_id)
> if (!q->stats)
> goto fail_stats;
>
> - q->backing_dev_info->ra_pages = VM_READAHEAD_PAGES;
> - q->backing_dev_info->io_pages = VM_READAHEAD_PAGES;
> q->backing_dev_info->capabilities = BDI_CAP_CGROUP_WRITEBACK;
> q->node = node_id;
>
> diff --git a/drivers/mtd/mtdcore.c b/drivers/mtd/mtdcore.c
> index 7d930569a7dfb7..b5e5d3140f578e 100644
> --- a/drivers/mtd/mtdcore.c
> +++ b/drivers/mtd/mtdcore.c
> @@ -2196,6 +2196,8 @@ static struct backing_dev_info * __init mtd_bdi_init(char *name)
> bdi = bdi_alloc(NUMA_NO_NODE);
> if (!bdi)
> return ERR_PTR(-ENOMEM);
> + bdi->ra_pages = 0;
> + bdi->io_pages = 0;
>
> /*
> * We put '-0' suffix to the name to get the same name format as we
> diff --git a/fs/9p/vfs_super.c b/fs/9p/vfs_super.c
> index 74df32be4c6a52..e34fa20acf612e 100644
> --- a/fs/9p/vfs_super.c
> +++ b/fs/9p/vfs_super.c
> @@ -80,8 +80,10 @@ v9fs_fill_super(struct super_block *sb, struct v9fs_session_info *v9ses,
> if (ret)
> return ret;
>
> - if (v9ses->cache)
> - sb->s_bdi->ra_pages = VM_READAHEAD_PAGES;
> + if (!v9ses->cache) {
> + sb->s_bdi->ra_pages = 0;
> + sb->s_bdi->io_pages = 0;
> + }
>
> sb->s_flags |= SB_ACTIVE | SB_DIRSYNC;
> if (!v9ses->cache)
> diff --git a/fs/afs/super.c b/fs/afs/super.c
> index b552357b1d1379..3a40ee752c1e3f 100644
> --- a/fs/afs/super.c
> +++ b/fs/afs/super.c
> @@ -456,7 +456,6 @@ static int afs_fill_super(struct super_block *sb, struct afs_fs_context *ctx)
> ret = super_setup_bdi(sb);
> if (ret)
> return ret;
> - sb->s_bdi->ra_pages = VM_READAHEAD_PAGES;
>
> /* allocate the root inode and dentry */
> if (as->dyn_root) {
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index f6bba7eb1fa171..047934cea25efa 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -3092,7 +3092,6 @@ int __cold open_ctree(struct super_block *sb, struct btrfs_fs_devices *fs_device
> }
>
> sb->s_bdi->capabilities |= BDI_CAP_CGROUP_WRITEBACK;
> - sb->s_bdi->ra_pages = VM_READAHEAD_PAGES;
> sb->s_bdi->ra_pages *= btrfs_super_num_devices(disk_super);
> sb->s_bdi->ra_pages = max(sb->s_bdi->ra_pages, SZ_4M / PAGE_SIZE);
>
> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> index bba747520e9b08..17b00670fb539e 100644
> --- a/fs/fuse/inode.c
> +++ b/fs/fuse/inode.c
> @@ -1049,7 +1049,6 @@ static int fuse_bdi_init(struct fuse_conn *fc, struct super_block *sb)
> if (err)
> return err;
>
> - sb->s_bdi->ra_pages = VM_READAHEAD_PAGES;
> /* fuse does it's own writeback accounting */
> sb->s_bdi->capabilities = BDI_CAP_NO_ACCT_WB | BDI_CAP_STRICTLIMIT;
>
> diff --git a/fs/nfs/super.c b/fs/nfs/super.c
> index 7a70287f21a2c1..f943e37853fa25 100644
> --- a/fs/nfs/super.c
> +++ b/fs/nfs/super.c
> @@ -1200,13 +1200,6 @@ static void nfs_get_cache_cookie(struct super_block *sb,
> }
> #endif
>
> -static void nfs_set_readahead(struct backing_dev_info *bdi,
> - unsigned long iomax_pages)
> -{
> - bdi->ra_pages = VM_READAHEAD_PAGES;
> - bdi->io_pages = iomax_pages;
> -}
> -
> int nfs_get_tree_common(struct fs_context *fc)
> {
> struct nfs_fs_context *ctx = nfs_fc2context(fc);
> @@ -1251,7 +1244,7 @@ int nfs_get_tree_common(struct fs_context *fc)
> MINOR(server->s_dev));
> if (error)
> goto error_splat_super;
> - nfs_set_readahead(s->s_bdi, server->rpages);
> + s->s_bdi->io_pages = server->rpages;
> server->super = s;
> }
>
> diff --git a/fs/ubifs/super.c b/fs/ubifs/super.c
> index a2420c900275a8..fbddb2a1c03f5e 100644
> --- a/fs/ubifs/super.c
> +++ b/fs/ubifs/super.c
> @@ -2177,6 +2177,8 @@ static int ubifs_fill_super(struct super_block *sb, void *data, int silent)
> c->vi.vol_id);
> if (err)
> goto out_close;
> + sb->s_bdi->ra_pages = 0;
> + sb->s_bdi->io_pages = 0;
>
> sb->s_fs_info = c;
> sb->s_magic = UBIFS_SUPER_MAGIC;
> diff --git a/fs/vboxsf/super.c b/fs/vboxsf/super.c
> index 8fe03b4a0d2b03..8e3792177a8523 100644
> --- a/fs/vboxsf/super.c
> +++ b/fs/vboxsf/super.c
> @@ -167,6 +167,8 @@ static int vboxsf_fill_super(struct super_block *sb, struct fs_context *fc)
> err = super_setup_bdi_name(sb, "vboxsf-%d", sbi->bdi_id);
> if (err)
> goto fail_free;
> + sb->s_bdi->ra_pages = 0;
> + sb->s_bdi->io_pages = 0;
>
> /* Turn source into a shfl_string and map the folder */
> size = strlen(fc->source) + 1;
> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> index 8e8b00627bb2d8..2dac3be6127127 100644
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
> @@ -746,6 +746,8 @@ struct backing_dev_info *bdi_alloc(int node_id)
> kfree(bdi);
> return NULL;
> }
> + bdi->ra_pages = VM_READAHEAD_PAGES;
> + bdi->io_pages = VM_READAHEAD_PAGES;
> return bdi;
> }
> EXPORT_SYMBOL(bdi_alloc);
> --
> 2.28.0
>
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 07/13] block: lift setting the readahead size into the block layer
2020-09-21 8:07 ` [PATCH 07/13] block: lift setting the readahead size into the block layer Christoph Hellwig
@ 2020-09-22 9:13 ` Jan Kara
2020-09-22 9:51 ` Coly Li
1 sibling, 0 replies; 29+ messages in thread
From: Jan Kara @ 2020-09-22 9:13 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Jens Axboe, Song Liu, Hans de Goede, Coly Li, Richard Weinberger,
Minchan Kim, Johannes Thumshirn, Justin Sanders, linux-mtd,
dm-devel, linux-block, linux-bcache, linux-kernel, drbd-dev,
linux-raid, linux-fsdevel, linux-mm, cgroups
On Mon 21-09-20 10:07:28, Christoph Hellwig wrote:
> Drivers shouldn't really mess with the readahead size, as that is a VM
> concept. Instead set it based on the optimal I/O size by lifting the
> algorithm from the md driver when registering the disk. Also set
> bdi->io_pages there as well by applying the same scheme based on
> max_sectors.
>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
...
> diff --git a/block/blk-settings.c b/block/blk-settings.c
> index 76a7e03bcd6cac..01049e9b998f1d 100644
> --- a/block/blk-settings.c
> +++ b/block/blk-settings.c
> @@ -452,6 +452,8 @@ EXPORT_SYMBOL(blk_limits_io_opt);
> void blk_queue_io_opt(struct request_queue *q, unsigned int opt)
> {
> blk_limits_io_opt(&q->limits, opt);
> + q->backing_dev_info->ra_pages =
> + max(queue_io_opt(q) * 2 / PAGE_SIZE, VM_READAHEAD_PAGES);
> }
> EXPORT_SYMBOL(blk_queue_io_opt);
>
> @@ -628,9 +630,6 @@ void disk_stack_limits(struct gendisk *disk, struct block_device *bdev,
> printk(KERN_NOTICE "%s: Warning: Device %s is misaligned\n",
> top, bottom);
> }
> -
> - t->backing_dev_info->io_pages =
> - t->limits.max_sectors >> (PAGE_SHIFT - 9);
> }
> EXPORT_SYMBOL(disk_stack_limits);
One thing I've noticed is that blk_stack_limits() does not use
blk_queue_io_opt() to set new optimal limit. That means that ra_pages won't
be updated for the new queue. E.g. your DRDB change below will result in
ra_pages not being properly updated AFAICT.
Similarly it isn't clear to me how io_pages would get updated after
blk_stack_limits() updates max_hw_sectors...
Otherwise the patch looks good.
Honza
> diff --git a/drivers/block/drbd/drbd_nl.c b/drivers/block/drbd/drbd_nl.c
> index aaff5bde391506..f8fb1c9b1bb6c1 100644
> --- a/drivers/block/drbd/drbd_nl.c
> +++ b/drivers/block/drbd/drbd_nl.c
> @@ -1360,18 +1360,8 @@ static void drbd_setup_queue_param(struct drbd_device *device, struct drbd_backi
> decide_on_discard_support(device, q, b, discard_zeroes_if_aligned);
> decide_on_write_same_support(device, q, b, o, disable_write_same);
>
> - if (b) {
> + if (b)
> blk_stack_limits(&q->limits, &b->limits, 0);
> -
> - if (q->backing_dev_info->ra_pages !=
> - b->backing_dev_info->ra_pages) {
> - drbd_info(device, "Adjusting my ra_pages to backing device's (%lu -> %lu)\n",
> - q->backing_dev_info->ra_pages,
> - b->backing_dev_info->ra_pages);
> - q->backing_dev_info->ra_pages =
> - b->backing_dev_info->ra_pages;
> - }
> - }
> fixup_discard_if_not_supported(q);
> fixup_write_zeroes(device, q);
> }
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 03/13] bcache: inherit the optimal I/O size
2020-09-21 8:07 ` [PATCH 03/13] bcache: inherit the optimal I/O size Christoph Hellwig
2020-09-21 9:54 ` Coly Li
2020-09-22 8:44 ` Jan Kara
@ 2020-09-22 9:39 ` Coly Li
2 siblings, 0 replies; 29+ messages in thread
From: Coly Li @ 2020-09-22 9:39 UTC (permalink / raw)
To: Christoph Hellwig, Jens Axboe
Cc: Song Liu, Hans de Goede, Richard Weinberger, Minchan Kim,
Johannes Thumshirn, Justin Sanders, linux-mtd, dm-devel,
linux-block, linux-bcache, linux-kernel, drbd-dev, linux-raid,
linux-fsdevel, linux-mm, cgroups
On 2020/9/21 16:07, Christoph Hellwig wrote:
> Inherit the optimal I/O size setting just like the readahead window,
> as any reason to do larger I/O does not apply to just readahead.
>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Coly Li <colyli@suse.de>
Thanks.
Coly Li
> ---
> drivers/md/bcache/super.c | 2 ++
> 1 file changed, 2 insertions(+)
>
> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
> index 1bbdc410ee3c51..48113005ed86ad 100644
> --- a/drivers/md/bcache/super.c
> +++ b/drivers/md/bcache/super.c
> @@ -1430,6 +1430,8 @@ static int cached_dev_init(struct cached_dev *dc, unsigned int block_size)
> dc->disk.disk->queue->backing_dev_info->ra_pages =
> max(dc->disk.disk->queue->backing_dev_info->ra_pages,
> q->backing_dev_info->ra_pages);
> + blk_queue_io_opt(dc->disk.disk->queue,
> + max(queue_io_opt(dc->disk.disk->queue), queue_io_opt(q)));
>
> atomic_set(&dc->io_errors, 0);
> dc->io_disable = false;
>
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 07/13] block: lift setting the readahead size into the block layer
2020-09-21 8:07 ` [PATCH 07/13] block: lift setting the readahead size into the block layer Christoph Hellwig
2020-09-22 9:13 ` Jan Kara
@ 2020-09-22 9:51 ` Coly Li
1 sibling, 0 replies; 29+ messages in thread
From: Coly Li @ 2020-09-22 9:51 UTC (permalink / raw)
To: Christoph Hellwig, Jens Axboe
Cc: Song Liu, Hans de Goede, Richard Weinberger, Minchan Kim,
Johannes Thumshirn, Justin Sanders, linux-mtd, dm-devel,
linux-block, linux-bcache, linux-kernel, drbd-dev, linux-raid,
linux-fsdevel, linux-mm, cgroups
On 2020/9/21 16:07, Christoph Hellwig wrote:
> Drivers shouldn't really mess with the readahead size, as that is a VM
> concept. Instead set it based on the optimal I/O size by lifting the
> algorithm from the md driver when registering the disk. Also set
> bdi->io_pages there as well by applying the same scheme based on
> max_sectors.
>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
For the bcache part,
Acked-by: Coly Li <colyli@suse.de>
Thanks.
Coly Li
> ---
> block/blk-settings.c | 5 ++---
> block/blk-sysfs.c | 9 +++++++++
> drivers/block/aoe/aoeblk.c | 1 -
> drivers/block/drbd/drbd_nl.c | 12 +-----------
> drivers/md/bcache/super.c | 3 ---
> drivers/md/raid0.c | 16 ----------------
> drivers/md/raid10.c | 24 +-----------------------
> drivers/md/raid5.c | 13 +------------
> 8 files changed, 14 insertions(+), 69 deletions(-)
>
> diff --git a/block/blk-settings.c b/block/blk-settings.c
> index 76a7e03bcd6cac..01049e9b998f1d 100644
> --- a/block/blk-settings.c
> +++ b/block/blk-settings.c
> @@ -452,6 +452,8 @@ EXPORT_SYMBOL(blk_limits_io_opt);
> void blk_queue_io_opt(struct request_queue *q, unsigned int opt)
> {
> blk_limits_io_opt(&q->limits, opt);
> + q->backing_dev_info->ra_pages =
> + max(queue_io_opt(q) * 2 / PAGE_SIZE, VM_READAHEAD_PAGES);
> }
> EXPORT_SYMBOL(blk_queue_io_opt);
>
> @@ -628,9 +630,6 @@ void disk_stack_limits(struct gendisk *disk, struct block_device *bdev,
> printk(KERN_NOTICE "%s: Warning: Device %s is misaligned\n",
> top, bottom);
> }
> -
> - t->backing_dev_info->io_pages =
> - t->limits.max_sectors >> (PAGE_SHIFT - 9);
> }
> EXPORT_SYMBOL(disk_stack_limits);
>
> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> index 81722cdcf0cb21..83915b4a1fc3ad 100644
> --- a/block/blk-sysfs.c
> +++ b/block/blk-sysfs.c
> @@ -854,6 +854,15 @@ int blk_register_queue(struct gendisk *disk)
> percpu_ref_switch_to_percpu(&q->q_usage_counter);
> }
>
> + /*
> + * For read-ahead of large files to be effective, we need to read ahead
> + * at least twice the optimal I/O size.
> + */
> + q->backing_dev_info->ra_pages =
> + max(queue_io_opt(q) * 2 / PAGE_SIZE, VM_READAHEAD_PAGES);
> + q->backing_dev_info->io_pages =
> + queue_max_sectors(q) >> (PAGE_SHIFT - 9);
> +
> ret = blk_trace_init_sysfs(dev);
> if (ret)
> return ret;
> diff --git a/drivers/block/aoe/aoeblk.c b/drivers/block/aoe/aoeblk.c
> index d8cfc233e64b93..c34e71b0c4a98c 100644
> --- a/drivers/block/aoe/aoeblk.c
> +++ b/drivers/block/aoe/aoeblk.c
> @@ -406,7 +406,6 @@ aoeblk_gdalloc(void *vp)
> WARN_ON(d->gd);
> WARN_ON(d->flags & DEVFL_UP);
> blk_queue_max_hw_sectors(q, BLK_DEF_MAX_SECTORS);
> - q->backing_dev_info->ra_pages = SZ_2M / PAGE_SIZE;
> blk_queue_io_opt(q, SZ_2M);
> d->bufpool = mp;
> d->blkq = gd->queue = q;
> diff --git a/drivers/block/drbd/drbd_nl.c b/drivers/block/drbd/drbd_nl.c
> index aaff5bde391506..f8fb1c9b1bb6c1 100644
> --- a/drivers/block/drbd/drbd_nl.c
> +++ b/drivers/block/drbd/drbd_nl.c
> @@ -1360,18 +1360,8 @@ static void drbd_setup_queue_param(struct drbd_device *device, struct drbd_backi
> decide_on_discard_support(device, q, b, discard_zeroes_if_aligned);
> decide_on_write_same_support(device, q, b, o, disable_write_same);
>
> - if (b) {
> + if (b)
> blk_stack_limits(&q->limits, &b->limits, 0);
> -
> - if (q->backing_dev_info->ra_pages !=
> - b->backing_dev_info->ra_pages) {
> - drbd_info(device, "Adjusting my ra_pages to backing device's (%lu -> %lu)\n",
> - q->backing_dev_info->ra_pages,
> - b->backing_dev_info->ra_pages);
> - q->backing_dev_info->ra_pages =
> - b->backing_dev_info->ra_pages;
> - }
> - }
> fixup_discard_if_not_supported(q);
> fixup_write_zeroes(device, q);
> }
> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
> index 48113005ed86ad..6bfa771673623e 100644
> --- a/drivers/md/bcache/super.c
> +++ b/drivers/md/bcache/super.c
> @@ -1427,9 +1427,6 @@ static int cached_dev_init(struct cached_dev *dc, unsigned int block_size)
> if (ret)
> return ret;
>
> - dc->disk.disk->queue->backing_dev_info->ra_pages =
> - max(dc->disk.disk->queue->backing_dev_info->ra_pages,
> - q->backing_dev_info->ra_pages);
> blk_queue_io_opt(dc->disk.disk->queue,
> max(queue_io_opt(dc->disk.disk->queue), queue_io_opt(q)));
>
> diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c
> index f54a449f97aa79..aa2d7279176880 100644
> --- a/drivers/md/raid0.c
> +++ b/drivers/md/raid0.c
> @@ -410,22 +410,6 @@ static int raid0_run(struct mddev *mddev)
> mdname(mddev),
> (unsigned long long)mddev->array_sectors);
>
> - if (mddev->queue) {
> - /* calculate the max read-ahead size.
> - * For read-ahead of large files to be effective, we need to
> - * readahead at least twice a whole stripe. i.e. number of devices
> - * multiplied by chunk size times 2.
> - * If an individual device has an ra_pages greater than the
> - * chunk size, then we will not drive that device as hard as it
> - * wants. We consider this a configuration error: a larger
> - * chunksize should be used in that case.
> - */
> - int stripe = mddev->raid_disks *
> - (mddev->chunk_sectors << 9) / PAGE_SIZE;
> - if (mddev->queue->backing_dev_info->ra_pages < 2* stripe)
> - mddev->queue->backing_dev_info->ra_pages = 2* stripe;
> - }
> -
> dump_zones(mddev);
>
> ret = md_integrity_register(mddev);
> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
> index 9956a04ac13bd6..5d1bdee313ec33 100644
> --- a/drivers/md/raid10.c
> +++ b/drivers/md/raid10.c
> @@ -3873,19 +3873,6 @@ static int raid10_run(struct mddev *mddev)
> mddev->resync_max_sectors = size;
> set_bit(MD_FAILFAST_SUPPORTED, &mddev->flags);
>
> - if (mddev->queue) {
> - int stripe = conf->geo.raid_disks *
> - ((mddev->chunk_sectors << 9) / PAGE_SIZE);
> -
> - /* Calculate max read-ahead size.
> - * We need to readahead at least twice a whole stripe....
> - * maybe...
> - */
> - stripe /= conf->geo.near_copies;
> - if (mddev->queue->backing_dev_info->ra_pages < 2 * stripe)
> - mddev->queue->backing_dev_info->ra_pages = 2 * stripe;
> - }
> -
> if (md_integrity_register(mddev))
> goto out_free_conf;
>
> @@ -4723,17 +4710,8 @@ static void end_reshape(struct r10conf *conf)
> conf->reshape_safe = MaxSector;
> spin_unlock_irq(&conf->device_lock);
>
> - /* read-ahead size must cover two whole stripes, which is
> - * 2 * (datadisks) * chunksize where 'n' is the number of raid devices
> - */
> - if (conf->mddev->queue) {
> - int stripe = conf->geo.raid_disks *
> - ((conf->mddev->chunk_sectors << 9) / PAGE_SIZE);
> - stripe /= conf->geo.near_copies;
> - if (conf->mddev->queue->backing_dev_info->ra_pages < 2 * stripe)
> - conf->mddev->queue->backing_dev_info->ra_pages = 2 * stripe;
> + if (conf->mddev->queue)
> raid10_set_io_opt(conf);
> - }
> conf->fullsync = 0;
> }
>
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index 9a7d1250894ef1..7ace1f76b14736 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -7522,8 +7522,6 @@ static int raid5_run(struct mddev *mddev)
> int data_disks = conf->previous_raid_disks - conf->max_degraded;
> int stripe = data_disks *
> ((mddev->chunk_sectors << 9) / PAGE_SIZE);
> - if (mddev->queue->backing_dev_info->ra_pages < 2 * stripe)
> - mddev->queue->backing_dev_info->ra_pages = 2 * stripe;
>
> chunk_size = mddev->chunk_sectors << 9;
> blk_queue_io_min(mddev->queue, chunk_size);
> @@ -8111,17 +8109,8 @@ static void end_reshape(struct r5conf *conf)
> spin_unlock_irq(&conf->device_lock);
> wake_up(&conf->wait_for_overlap);
>
> - /* read-ahead size must cover two whole stripes, which is
> - * 2 * (datadisks) * chunksize where 'n' is the number of raid devices
> - */
> - if (conf->mddev->queue) {
> - int data_disks = conf->raid_disks - conf->max_degraded;
> - int stripe = data_disks * ((conf->chunk_sectors << 9)
> - / PAGE_SIZE);
> - if (conf->mddev->queue->backing_dev_info->ra_pages < 2 * stripe)
> - conf->mddev->queue->backing_dev_info->ra_pages = 2 * stripe;
> + if (conf->mddev->queue)
> raid5_set_io_opt(conf);
> - }
> }
> }
>
>
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 05/13] bdi: initialize ->ra_pages and ->io_pages in bdi_init
2020-09-22 8:49 ` Jan Kara
@ 2020-09-23 15:16 ` Christoph Hellwig
0 siblings, 0 replies; 29+ messages in thread
From: Christoph Hellwig @ 2020-09-23 15:16 UTC (permalink / raw)
To: Jan Kara
Cc: Christoph Hellwig, Jens Axboe, Song Liu, Hans de Goede, Coly Li,
Richard Weinberger, Minchan Kim, Johannes Thumshirn,
Justin Sanders, linux-mtd, dm-devel, linux-block, linux-bcache,
linux-kernel, drbd-dev, linux-raid, linux-fsdevel, linux-mm,
cgroups, David Sterba
On Tue, Sep 22, 2020 at 10:49:54AM +0200, Jan Kara wrote:
> On Mon 21-09-20 10:07:26, Christoph Hellwig wrote:
> > Set up a readahead size by default, as very few users have a good
> > reason to change it.
> >
> > Signed-off-by: Christoph Hellwig <hch@lst.de>
> > Acked-by: David Sterba <dsterba@suse.com> [btrfs]
> > Acked-by: Richard Weinberger <richard@nod.at> [ubifs, mtd]
>
> The patch looks good to me. You can add:
>
> Reviewed-by: Jan Kara <jack@suse.cz>
>
> I'd just prefer if the changelog explicitely mentioned that this patch
> results in enabling readahead for coda, ecryptfs, and orangefs... Just in
> case someone bisects some issue down to this patch :).
Ok, I've updated the changelog.
^ permalink raw reply [flat|nested] 29+ messages in thread
* [PATCH 07/13] block: lift setting the readahead size into the block layer
2020-09-24 6:51 bdi cleanups v7 Christoph Hellwig
@ 2020-09-24 6:51 ` Christoph Hellwig
2020-09-24 14:53 ` Jan Kara
` (2 more replies)
0 siblings, 3 replies; 29+ messages in thread
From: Christoph Hellwig @ 2020-09-24 6:51 UTC (permalink / raw)
To: Jens Axboe
Cc: Song Liu, Hans de Goede, Coly Li, Richard Weinberger, Minchan Kim,
Johannes Thumshirn, Justin Sanders, linux-mtd, dm-devel,
linux-block, linux-bcache, linux-kernel, drbd-dev, linux-raid,
linux-fsdevel, linux-mm, cgroups, Johannes Thumshirn
Drivers shouldn't really mess with the readahead size, as that is a VM
concept. Instead set it based on the optimal I/O size by lifting the
algorithm from the md driver when registering the disk. Also set
bdi->io_pages there as well by applying the same scheme based on
max_sectors. To ensure the limits work well for stacking drivers a
new helper is added to update the readahead limits from the block
limits, which is also called from disk_stack_limits.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Coly Li <colyli@suse.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
block/blk-settings.c | 18 ++++++++++++++++--
block/blk-sysfs.c | 2 ++
drivers/block/aoe/aoeblk.c | 1 -
drivers/block/drbd/drbd_nl.c | 10 +---------
drivers/md/bcache/super.c | 3 ---
drivers/md/dm-table.c | 3 +--
drivers/md/raid0.c | 16 ----------------
drivers/md/raid10.c | 24 +-----------------------
drivers/md/raid5.c | 13 +------------
drivers/nvme/host/core.c | 1 +
include/linux/blkdev.h | 1 +
11 files changed, 24 insertions(+), 68 deletions(-)
diff --git a/block/blk-settings.c b/block/blk-settings.c
index 5ea3de48afba22..4f6eb4bb17236a 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -372,6 +372,19 @@ void blk_queue_alignment_offset(struct request_queue *q, unsigned int offset)
}
EXPORT_SYMBOL(blk_queue_alignment_offset);
+void blk_queue_update_readahead(struct request_queue *q)
+{
+ /*
+ * For read-ahead of large files to be effective, we need to read ahead
+ * at least twice the optimal I/O size.
+ */
+ q->backing_dev_info->ra_pages =
+ max(queue_io_opt(q) * 2 / PAGE_SIZE, VM_READAHEAD_PAGES);
+ q->backing_dev_info->io_pages =
+ queue_max_sectors(q) >> (PAGE_SHIFT - 9);
+}
+EXPORT_SYMBOL_GPL(blk_queue_update_readahead);
+
/**
* blk_limits_io_min - set minimum request size for a device
* @limits: the queue limits
@@ -450,6 +463,8 @@ EXPORT_SYMBOL(blk_limits_io_opt);
void blk_queue_io_opt(struct request_queue *q, unsigned int opt)
{
blk_limits_io_opt(&q->limits, opt);
+ q->backing_dev_info->ra_pages =
+ max(queue_io_opt(q) * 2 / PAGE_SIZE, VM_READAHEAD_PAGES);
}
EXPORT_SYMBOL(blk_queue_io_opt);
@@ -631,8 +646,7 @@ void disk_stack_limits(struct gendisk *disk, struct block_device *bdev,
top, bottom);
}
- t->backing_dev_info->io_pages =
- t->limits.max_sectors >> (PAGE_SHIFT - 9);
+ blk_queue_update_readahead(disk->queue);
}
EXPORT_SYMBOL(disk_stack_limits);
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 81722cdcf0cb21..869ed21a9edcab 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -854,6 +854,8 @@ int blk_register_queue(struct gendisk *disk)
percpu_ref_switch_to_percpu(&q->q_usage_counter);
}
+ blk_queue_update_readahead(q);
+
ret = blk_trace_init_sysfs(dev);
if (ret)
return ret;
diff --git a/drivers/block/aoe/aoeblk.c b/drivers/block/aoe/aoeblk.c
index d8cfc233e64b93..c34e71b0c4a98c 100644
--- a/drivers/block/aoe/aoeblk.c
+++ b/drivers/block/aoe/aoeblk.c
@@ -406,7 +406,6 @@ aoeblk_gdalloc(void *vp)
WARN_ON(d->gd);
WARN_ON(d->flags & DEVFL_UP);
blk_queue_max_hw_sectors(q, BLK_DEF_MAX_SECTORS);
- q->backing_dev_info->ra_pages = SZ_2M / PAGE_SIZE;
blk_queue_io_opt(q, SZ_2M);
d->bufpool = mp;
d->blkq = gd->queue = q;
diff --git a/drivers/block/drbd/drbd_nl.c b/drivers/block/drbd/drbd_nl.c
index aaff5bde391506..54a4930c04fe07 100644
--- a/drivers/block/drbd/drbd_nl.c
+++ b/drivers/block/drbd/drbd_nl.c
@@ -1362,15 +1362,7 @@ static void drbd_setup_queue_param(struct drbd_device *device, struct drbd_backi
if (b) {
blk_stack_limits(&q->limits, &b->limits, 0);
-
- if (q->backing_dev_info->ra_pages !=
- b->backing_dev_info->ra_pages) {
- drbd_info(device, "Adjusting my ra_pages to backing device's (%lu -> %lu)\n",
- q->backing_dev_info->ra_pages,
- b->backing_dev_info->ra_pages);
- q->backing_dev_info->ra_pages =
- b->backing_dev_info->ra_pages;
- }
+ blk_queue_update_readahead(q);
}
fixup_discard_if_not_supported(q);
fixup_write_zeroes(device, q);
diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index 48113005ed86ad..6bfa771673623e 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -1427,9 +1427,6 @@ static int cached_dev_init(struct cached_dev *dc, unsigned int block_size)
if (ret)
return ret;
- dc->disk.disk->queue->backing_dev_info->ra_pages =
- max(dc->disk.disk->queue->backing_dev_info->ra_pages,
- q->backing_dev_info->ra_pages);
blk_queue_io_opt(dc->disk.disk->queue,
max(queue_io_opt(dc->disk.disk->queue), queue_io_opt(q)));
diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index 5edc3079e7c199..ef2757012f59d5 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -1925,8 +1925,7 @@ void dm_table_set_restrictions(struct dm_table *t, struct request_queue *q,
}
#endif
- /* Allow reads to exceed readahead limits */
- q->backing_dev_info->io_pages = limits->max_sectors >> (PAGE_SHIFT - 9);
+ blk_queue_update_readahead(q);
}
unsigned int dm_table_get_num_targets(struct dm_table *t)
diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c
index f54a449f97aa79..aa2d7279176880 100644
--- a/drivers/md/raid0.c
+++ b/drivers/md/raid0.c
@@ -410,22 +410,6 @@ static int raid0_run(struct mddev *mddev)
mdname(mddev),
(unsigned long long)mddev->array_sectors);
- if (mddev->queue) {
- /* calculate the max read-ahead size.
- * For read-ahead of large files to be effective, we need to
- * readahead at least twice a whole stripe. i.e. number of devices
- * multiplied by chunk size times 2.
- * If an individual device has an ra_pages greater than the
- * chunk size, then we will not drive that device as hard as it
- * wants. We consider this a configuration error: a larger
- * chunksize should be used in that case.
- */
- int stripe = mddev->raid_disks *
- (mddev->chunk_sectors << 9) / PAGE_SIZE;
- if (mddev->queue->backing_dev_info->ra_pages < 2* stripe)
- mddev->queue->backing_dev_info->ra_pages = 2* stripe;
- }
-
dump_zones(mddev);
ret = md_integrity_register(mddev);
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 9956a04ac13bd6..5d1bdee313ec33 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -3873,19 +3873,6 @@ static int raid10_run(struct mddev *mddev)
mddev->resync_max_sectors = size;
set_bit(MD_FAILFAST_SUPPORTED, &mddev->flags);
- if (mddev->queue) {
- int stripe = conf->geo.raid_disks *
- ((mddev->chunk_sectors << 9) / PAGE_SIZE);
-
- /* Calculate max read-ahead size.
- * We need to readahead at least twice a whole stripe....
- * maybe...
- */
- stripe /= conf->geo.near_copies;
- if (mddev->queue->backing_dev_info->ra_pages < 2 * stripe)
- mddev->queue->backing_dev_info->ra_pages = 2 * stripe;
- }
-
if (md_integrity_register(mddev))
goto out_free_conf;
@@ -4723,17 +4710,8 @@ static void end_reshape(struct r10conf *conf)
conf->reshape_safe = MaxSector;
spin_unlock_irq(&conf->device_lock);
- /* read-ahead size must cover two whole stripes, which is
- * 2 * (datadisks) * chunksize where 'n' is the number of raid devices
- */
- if (conf->mddev->queue) {
- int stripe = conf->geo.raid_disks *
- ((conf->mddev->chunk_sectors << 9) / PAGE_SIZE);
- stripe /= conf->geo.near_copies;
- if (conf->mddev->queue->backing_dev_info->ra_pages < 2 * stripe)
- conf->mddev->queue->backing_dev_info->ra_pages = 2 * stripe;
+ if (conf->mddev->queue)
raid10_set_io_opt(conf);
- }
conf->fullsync = 0;
}
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 9a7d1250894ef1..7ace1f76b14736 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -7522,8 +7522,6 @@ static int raid5_run(struct mddev *mddev)
int data_disks = conf->previous_raid_disks - conf->max_degraded;
int stripe = data_disks *
((mddev->chunk_sectors << 9) / PAGE_SIZE);
- if (mddev->queue->backing_dev_info->ra_pages < 2 * stripe)
- mddev->queue->backing_dev_info->ra_pages = 2 * stripe;
chunk_size = mddev->chunk_sectors << 9;
blk_queue_io_min(mddev->queue, chunk_size);
@@ -8111,17 +8109,8 @@ static void end_reshape(struct r5conf *conf)
spin_unlock_irq(&conf->device_lock);
wake_up(&conf->wait_for_overlap);
- /* read-ahead size must cover two whole stripes, which is
- * 2 * (datadisks) * chunksize where 'n' is the number of raid devices
- */
- if (conf->mddev->queue) {
- int data_disks = conf->raid_disks - conf->max_degraded;
- int stripe = data_disks * ((conf->chunk_sectors << 9)
- / PAGE_SIZE);
- if (conf->mddev->queue->backing_dev_info->ra_pages < 2 * stripe)
- conf->mddev->queue->backing_dev_info->ra_pages = 2 * stripe;
+ if (conf->mddev->queue)
raid5_set_io_opt(conf);
- }
}
}
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index ea1fa41fbba8df..741c9bfa8e14c7 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -2147,6 +2147,7 @@ static int __nvme_revalidate_disk(struct gendisk *disk, struct nvme_id_ns *id)
nvme_update_disk_info(ns->head->disk, ns, id);
blk_stack_limits(&ns->head->disk->queue->limits,
&ns->queue->limits, 0);
+ blk_queue_update_readahead(ns->head->disk->queue);
nvme_update_bdev_size(ns->head->disk);
}
#endif
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index be5ef6f4ba1905..282f5ca424f14a 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1140,6 +1140,7 @@ extern void blk_queue_max_zone_append_sectors(struct request_queue *q,
extern void blk_queue_physical_block_size(struct request_queue *, unsigned int);
extern void blk_queue_alignment_offset(struct request_queue *q,
unsigned int alignment);
+void blk_queue_update_readahead(struct request_queue *q);
extern void blk_limits_io_min(struct queue_limits *limits, unsigned int min);
extern void blk_queue_io_min(struct request_queue *q, unsigned int min);
extern void blk_limits_io_opt(struct queue_limits *limits, unsigned int opt);
--
2.28.0
^ permalink raw reply related [flat|nested] 29+ messages in thread
* Re: [PATCH 07/13] block: lift setting the readahead size into the block layer
2020-09-24 6:51 ` [PATCH 07/13] block: lift setting the readahead size into the block layer Christoph Hellwig
@ 2020-09-24 14:53 ` Jan Kara
2020-09-24 15:03 ` Mike Snitzer
2020-09-24 15:57 ` Martin K. Petersen
2 siblings, 0 replies; 29+ messages in thread
From: Jan Kara @ 2020-09-24 14:53 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Jens Axboe, Song Liu, Hans de Goede, Coly Li, Richard Weinberger,
Minchan Kim, Johannes Thumshirn, Justin Sanders, linux-mtd,
dm-devel, linux-block, linux-bcache, linux-kernel, drbd-dev,
linux-raid, linux-fsdevel, linux-mm, cgroups
On Thu 24-09-20 08:51:34, Christoph Hellwig wrote:
> Drivers shouldn't really mess with the readahead size, as that is a VM
> concept. Instead set it based on the optimal I/O size by lifting the
> algorithm from the md driver when registering the disk. Also set
> bdi->io_pages there as well by applying the same scheme based on
> max_sectors. To ensure the limits work well for stacking drivers a
> new helper is added to update the readahead limits from the block
> limits, which is also called from disk_stack_limits.
>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> Acked-by: Coly Li <colyli@suse.de>
> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
The patch looks good to me now. You can add:
Reviewed-by: Jan Kara <jack@suse.cz>
Honza
> ---
> block/blk-settings.c | 18 ++++++++++++++++--
> block/blk-sysfs.c | 2 ++
> drivers/block/aoe/aoeblk.c | 1 -
> drivers/block/drbd/drbd_nl.c | 10 +---------
> drivers/md/bcache/super.c | 3 ---
> drivers/md/dm-table.c | 3 +--
> drivers/md/raid0.c | 16 ----------------
> drivers/md/raid10.c | 24 +-----------------------
> drivers/md/raid5.c | 13 +------------
> drivers/nvme/host/core.c | 1 +
> include/linux/blkdev.h | 1 +
> 11 files changed, 24 insertions(+), 68 deletions(-)
>
> diff --git a/block/blk-settings.c b/block/blk-settings.c
> index 5ea3de48afba22..4f6eb4bb17236a 100644
> --- a/block/blk-settings.c
> +++ b/block/blk-settings.c
> @@ -372,6 +372,19 @@ void blk_queue_alignment_offset(struct request_queue *q, unsigned int offset)
> }
> EXPORT_SYMBOL(blk_queue_alignment_offset);
>
> +void blk_queue_update_readahead(struct request_queue *q)
> +{
> + /*
> + * For read-ahead of large files to be effective, we need to read ahead
> + * at least twice the optimal I/O size.
> + */
> + q->backing_dev_info->ra_pages =
> + max(queue_io_opt(q) * 2 / PAGE_SIZE, VM_READAHEAD_PAGES);
> + q->backing_dev_info->io_pages =
> + queue_max_sectors(q) >> (PAGE_SHIFT - 9);
> +}
> +EXPORT_SYMBOL_GPL(blk_queue_update_readahead);
> +
> /**
> * blk_limits_io_min - set minimum request size for a device
> * @limits: the queue limits
> @@ -450,6 +463,8 @@ EXPORT_SYMBOL(blk_limits_io_opt);
> void blk_queue_io_opt(struct request_queue *q, unsigned int opt)
> {
> blk_limits_io_opt(&q->limits, opt);
> + q->backing_dev_info->ra_pages =
> + max(queue_io_opt(q) * 2 / PAGE_SIZE, VM_READAHEAD_PAGES);
> }
> EXPORT_SYMBOL(blk_queue_io_opt);
>
> @@ -631,8 +646,7 @@ void disk_stack_limits(struct gendisk *disk, struct block_device *bdev,
> top, bottom);
> }
>
> - t->backing_dev_info->io_pages =
> - t->limits.max_sectors >> (PAGE_SHIFT - 9);
> + blk_queue_update_readahead(disk->queue);
> }
> EXPORT_SYMBOL(disk_stack_limits);
>
> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> index 81722cdcf0cb21..869ed21a9edcab 100644
> --- a/block/blk-sysfs.c
> +++ b/block/blk-sysfs.c
> @@ -854,6 +854,8 @@ int blk_register_queue(struct gendisk *disk)
> percpu_ref_switch_to_percpu(&q->q_usage_counter);
> }
>
> + blk_queue_update_readahead(q);
> +
> ret = blk_trace_init_sysfs(dev);
> if (ret)
> return ret;
> diff --git a/drivers/block/aoe/aoeblk.c b/drivers/block/aoe/aoeblk.c
> index d8cfc233e64b93..c34e71b0c4a98c 100644
> --- a/drivers/block/aoe/aoeblk.c
> +++ b/drivers/block/aoe/aoeblk.c
> @@ -406,7 +406,6 @@ aoeblk_gdalloc(void *vp)
> WARN_ON(d->gd);
> WARN_ON(d->flags & DEVFL_UP);
> blk_queue_max_hw_sectors(q, BLK_DEF_MAX_SECTORS);
> - q->backing_dev_info->ra_pages = SZ_2M / PAGE_SIZE;
> blk_queue_io_opt(q, SZ_2M);
> d->bufpool = mp;
> d->blkq = gd->queue = q;
> diff --git a/drivers/block/drbd/drbd_nl.c b/drivers/block/drbd/drbd_nl.c
> index aaff5bde391506..54a4930c04fe07 100644
> --- a/drivers/block/drbd/drbd_nl.c
> +++ b/drivers/block/drbd/drbd_nl.c
> @@ -1362,15 +1362,7 @@ static void drbd_setup_queue_param(struct drbd_device *device, struct drbd_backi
>
> if (b) {
> blk_stack_limits(&q->limits, &b->limits, 0);
> -
> - if (q->backing_dev_info->ra_pages !=
> - b->backing_dev_info->ra_pages) {
> - drbd_info(device, "Adjusting my ra_pages to backing device's (%lu -> %lu)\n",
> - q->backing_dev_info->ra_pages,
> - b->backing_dev_info->ra_pages);
> - q->backing_dev_info->ra_pages =
> - b->backing_dev_info->ra_pages;
> - }
> + blk_queue_update_readahead(q);
> }
> fixup_discard_if_not_supported(q);
> fixup_write_zeroes(device, q);
> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
> index 48113005ed86ad..6bfa771673623e 100644
> --- a/drivers/md/bcache/super.c
> +++ b/drivers/md/bcache/super.c
> @@ -1427,9 +1427,6 @@ static int cached_dev_init(struct cached_dev *dc, unsigned int block_size)
> if (ret)
> return ret;
>
> - dc->disk.disk->queue->backing_dev_info->ra_pages =
> - max(dc->disk.disk->queue->backing_dev_info->ra_pages,
> - q->backing_dev_info->ra_pages);
> blk_queue_io_opt(dc->disk.disk->queue,
> max(queue_io_opt(dc->disk.disk->queue), queue_io_opt(q)));
>
> diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
> index 5edc3079e7c199..ef2757012f59d5 100644
> --- a/drivers/md/dm-table.c
> +++ b/drivers/md/dm-table.c
> @@ -1925,8 +1925,7 @@ void dm_table_set_restrictions(struct dm_table *t, struct request_queue *q,
> }
> #endif
>
> - /* Allow reads to exceed readahead limits */
> - q->backing_dev_info->io_pages = limits->max_sectors >> (PAGE_SHIFT - 9);
> + blk_queue_update_readahead(q);
> }
>
> unsigned int dm_table_get_num_targets(struct dm_table *t)
> diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c
> index f54a449f97aa79..aa2d7279176880 100644
> --- a/drivers/md/raid0.c
> +++ b/drivers/md/raid0.c
> @@ -410,22 +410,6 @@ static int raid0_run(struct mddev *mddev)
> mdname(mddev),
> (unsigned long long)mddev->array_sectors);
>
> - if (mddev->queue) {
> - /* calculate the max read-ahead size.
> - * For read-ahead of large files to be effective, we need to
> - * readahead at least twice a whole stripe. i.e. number of devices
> - * multiplied by chunk size times 2.
> - * If an individual device has an ra_pages greater than the
> - * chunk size, then we will not drive that device as hard as it
> - * wants. We consider this a configuration error: a larger
> - * chunksize should be used in that case.
> - */
> - int stripe = mddev->raid_disks *
> - (mddev->chunk_sectors << 9) / PAGE_SIZE;
> - if (mddev->queue->backing_dev_info->ra_pages < 2* stripe)
> - mddev->queue->backing_dev_info->ra_pages = 2* stripe;
> - }
> -
> dump_zones(mddev);
>
> ret = md_integrity_register(mddev);
> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
> index 9956a04ac13bd6..5d1bdee313ec33 100644
> --- a/drivers/md/raid10.c
> +++ b/drivers/md/raid10.c
> @@ -3873,19 +3873,6 @@ static int raid10_run(struct mddev *mddev)
> mddev->resync_max_sectors = size;
> set_bit(MD_FAILFAST_SUPPORTED, &mddev->flags);
>
> - if (mddev->queue) {
> - int stripe = conf->geo.raid_disks *
> - ((mddev->chunk_sectors << 9) / PAGE_SIZE);
> -
> - /* Calculate max read-ahead size.
> - * We need to readahead at least twice a whole stripe....
> - * maybe...
> - */
> - stripe /= conf->geo.near_copies;
> - if (mddev->queue->backing_dev_info->ra_pages < 2 * stripe)
> - mddev->queue->backing_dev_info->ra_pages = 2 * stripe;
> - }
> -
> if (md_integrity_register(mddev))
> goto out_free_conf;
>
> @@ -4723,17 +4710,8 @@ static void end_reshape(struct r10conf *conf)
> conf->reshape_safe = MaxSector;
> spin_unlock_irq(&conf->device_lock);
>
> - /* read-ahead size must cover two whole stripes, which is
> - * 2 * (datadisks) * chunksize where 'n' is the number of raid devices
> - */
> - if (conf->mddev->queue) {
> - int stripe = conf->geo.raid_disks *
> - ((conf->mddev->chunk_sectors << 9) / PAGE_SIZE);
> - stripe /= conf->geo.near_copies;
> - if (conf->mddev->queue->backing_dev_info->ra_pages < 2 * stripe)
> - conf->mddev->queue->backing_dev_info->ra_pages = 2 * stripe;
> + if (conf->mddev->queue)
> raid10_set_io_opt(conf);
> - }
> conf->fullsync = 0;
> }
>
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index 9a7d1250894ef1..7ace1f76b14736 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -7522,8 +7522,6 @@ static int raid5_run(struct mddev *mddev)
> int data_disks = conf->previous_raid_disks - conf->max_degraded;
> int stripe = data_disks *
> ((mddev->chunk_sectors << 9) / PAGE_SIZE);
> - if (mddev->queue->backing_dev_info->ra_pages < 2 * stripe)
> - mddev->queue->backing_dev_info->ra_pages = 2 * stripe;
>
> chunk_size = mddev->chunk_sectors << 9;
> blk_queue_io_min(mddev->queue, chunk_size);
> @@ -8111,17 +8109,8 @@ static void end_reshape(struct r5conf *conf)
> spin_unlock_irq(&conf->device_lock);
> wake_up(&conf->wait_for_overlap);
>
> - /* read-ahead size must cover two whole stripes, which is
> - * 2 * (datadisks) * chunksize where 'n' is the number of raid devices
> - */
> - if (conf->mddev->queue) {
> - int data_disks = conf->raid_disks - conf->max_degraded;
> - int stripe = data_disks * ((conf->chunk_sectors << 9)
> - / PAGE_SIZE);
> - if (conf->mddev->queue->backing_dev_info->ra_pages < 2 * stripe)
> - conf->mddev->queue->backing_dev_info->ra_pages = 2 * stripe;
> + if (conf->mddev->queue)
> raid5_set_io_opt(conf);
> - }
> }
> }
>
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index ea1fa41fbba8df..741c9bfa8e14c7 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -2147,6 +2147,7 @@ static int __nvme_revalidate_disk(struct gendisk *disk, struct nvme_id_ns *id)
> nvme_update_disk_info(ns->head->disk, ns, id);
> blk_stack_limits(&ns->head->disk->queue->limits,
> &ns->queue->limits, 0);
> + blk_queue_update_readahead(ns->head->disk->queue);
> nvme_update_bdev_size(ns->head->disk);
> }
> #endif
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index be5ef6f4ba1905..282f5ca424f14a 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -1140,6 +1140,7 @@ extern void blk_queue_max_zone_append_sectors(struct request_queue *q,
> extern void blk_queue_physical_block_size(struct request_queue *, unsigned int);
> extern void blk_queue_alignment_offset(struct request_queue *q,
> unsigned int alignment);
> +void blk_queue_update_readahead(struct request_queue *q);
> extern void blk_limits_io_min(struct queue_limits *limits, unsigned int min);
> extern void blk_queue_io_min(struct request_queue *q, unsigned int min);
> extern void blk_limits_io_opt(struct queue_limits *limits, unsigned int opt);
> --
> 2.28.0
>
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 07/13] block: lift setting the readahead size into the block layer
2020-09-24 6:51 ` [PATCH 07/13] block: lift setting the readahead size into the block layer Christoph Hellwig
2020-09-24 14:53 ` Jan Kara
@ 2020-09-24 15:03 ` Mike Snitzer
2020-09-24 15:57 ` Martin K. Petersen
2 siblings, 0 replies; 29+ messages in thread
From: Mike Snitzer @ 2020-09-24 15:03 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Jens Axboe, linux-raid, Hans de Goede, Justin Sanders,
Minchan Kim, Johannes Thumshirn, cgroups, linux-bcache, Coly Li,
linux-block, Song Liu, dm-devel, linux-mtd, Richard Weinberger,
drbd-dev, linux-fsdevel, linux-mm, linux-kernel
On Thu, Sep 24 2020 at 2:51am -0400,
Christoph Hellwig <hch@lst.de> wrote:
> Drivers shouldn't really mess with the readahead size, as that is a VM
> concept. Instead set it based on the optimal I/O size by lifting the
> algorithm from the md driver when registering the disk. Also set
> bdi->io_pages there as well by applying the same scheme based on
> max_sectors. To ensure the limits work well for stacking drivers a
> new helper is added to update the readahead limits from the block
> limits, which is also called from disk_stack_limits.
>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> Acked-by: Coly Li <colyli@suse.de>
> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Thanks for adding blk_queue_update_readahead()
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 07/13] block: lift setting the readahead size into the block layer
2020-09-24 6:51 ` [PATCH 07/13] block: lift setting the readahead size into the block layer Christoph Hellwig
2020-09-24 14:53 ` Jan Kara
2020-09-24 15:03 ` Mike Snitzer
@ 2020-09-24 15:57 ` Martin K. Petersen
2 siblings, 0 replies; 29+ messages in thread
From: Martin K. Petersen @ 2020-09-24 15:57 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Jens Axboe, Song Liu, Hans de Goede, Coly Li, Richard Weinberger,
Minchan Kim, Johannes Thumshirn, Justin Sanders, linux-mtd,
dm-devel, linux-block, linux-bcache, linux-kernel, drbd-dev,
linux-raid, linux-fsdevel, linux-mm, cgroups
Christoph,
> Drivers shouldn't really mess with the readahead size, as that is a VM
> concept. Instead set it based on the optimal I/O size by lifting the
> algorithm from the md driver when registering the disk. Also set
> bdi->io_pages there as well by applying the same scheme based on
> max_sectors. To ensure the limits work well for stacking drivers a
> new helper is added to update the readahead limits from the block
> limits, which is also called from disk_stack_limits.
Looks good!
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
--
Martin K. Petersen Oracle Linux Engineering
^ permalink raw reply [flat|nested] 29+ messages in thread
end of thread, other threads:[~2020-09-24 15:57 UTC | newest]
Thread overview: 29+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2020-09-21 8:07 bdi cleanups v6 Christoph Hellwig
2020-09-21 8:07 ` [PATCH 01/13] fs: remove the unused SB_I_MULTIROOT flag Christoph Hellwig
2020-09-21 8:07 ` [PATCH 02/13] drbd: remove dead code in device_to_statistics Christoph Hellwig
2020-09-21 8:07 ` [PATCH 03/13] bcache: inherit the optimal I/O size Christoph Hellwig
2020-09-21 9:54 ` Coly Li
2020-09-21 14:00 ` Christoph Hellwig
2020-09-21 15:09 ` Coly Li
2020-09-21 18:18 ` Christoph Hellwig
2020-09-22 8:44 ` Jan Kara
2020-09-22 9:39 ` Coly Li
2020-09-21 8:07 ` [PATCH 04/13] aoe: set an " Christoph Hellwig
2020-09-22 8:45 ` Jan Kara
2020-09-21 8:07 ` [PATCH 05/13] bdi: initialize ->ra_pages and ->io_pages in bdi_init Christoph Hellwig
2020-09-22 8:49 ` Jan Kara
2020-09-23 15:16 ` Christoph Hellwig
2020-09-21 8:07 ` [PATCH 06/13] md: update the optimal I/O size on reshape Christoph Hellwig
2020-09-21 8:07 ` [PATCH 07/13] block: lift setting the readahead size into the block layer Christoph Hellwig
2020-09-22 9:13 ` Jan Kara
2020-09-22 9:51 ` Coly Li
2020-09-21 8:07 ` [PATCH 08/13] bdi: remove BDI_CAP_CGROUP_WRITEBACK Christoph Hellwig
2020-09-21 8:07 ` [PATCH 09/13] bdi: remove BDI_CAP_SYNCHRONOUS_IO Christoph Hellwig
2020-09-21 8:07 ` [PATCH 10/13] mm: use SWP_SYNCHRONOUS_IO more intelligently Christoph Hellwig
2020-09-21 8:07 ` [PATCH 11/13] bdi: replace BDI_CAP_STABLE_WRITES with a queue and a sb flag Christoph Hellwig
2020-09-21 8:07 ` [PATCH 12/13] bdi: invert BDI_CAP_NO_ACCT_WB Christoph Hellwig
2020-09-21 8:07 ` [PATCH 13/13] bdi: replace BDI_CAP_NO_{WRITEBACK,ACCT_DIRTY} with a single flag Christoph Hellwig
-- strict thread matches above, loose matches on Subject: below --
2020-09-24 6:51 bdi cleanups v7 Christoph Hellwig
2020-09-24 6:51 ` [PATCH 07/13] block: lift setting the readahead size into the block layer Christoph Hellwig
2020-09-24 14:53 ` Jan Kara
2020-09-24 15:03 ` Mike Snitzer
2020-09-24 15:57 ` Martin K. Petersen
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).