From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from smtp-out2.suse.de (smtp-out2.suse.de [195.135.223.131]) by mail19.linbit.com (LINBIT Mail Daemon) with ESMTP id 072C342091F for ; Tue, 11 Jun 2024 11:58:20 +0200 (CEST) Message-ID: <34a7b2a4-b0cb-4580-85c9-b598fd70449e@suse.de> Date: Tue, 11 Jun 2024 11:58:19 +0200 MIME-Version: 1.0 Subject: Re: [PATCH 13/26] block: move cache control settings out of queue->flags To: Christoph Hellwig , Jens Axboe References: <20240611051929.513387-1-hch@lst.de> <20240611051929.513387-14-hch@lst.de> Content-Language: en-US From: Hannes Reinecke In-Reply-To: <20240611051929.513387-14-hch@lst.de> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Cc: nvdimm@lists.linux.dev, "Michael S. Tsirkin" , Jason Wang , linux-nvme@lists.infradead.org, Song Liu , linux-mtd@lists.infradead.org, Vineeth Vijayan , Alasdair Kergon , drbd-dev@lists.linbit.com, linux-s390@vger.kernel.org, linux-scsi@vger.kernel.org, Richard Weinberger , Geert Uytterhoeven , Yu Kuai , dm-devel@lists.linux.dev, linux-um@lists.infradead.org, Mike Snitzer , Josef Bacik , nbd@other.debian.org, linux-raid@vger.kernel.org, linux-m68k@lists.linux-m68k.org, Mikulas Patocka , xen-devel@lists.xenproject.org, ceph-devel@vger.kernel.org, Ming Lei , linux-bcache@vger.kernel.org, linux-block@vger.kernel.org, "Martin K. Petersen" , linux-mmc@vger.kernel.org, Philipp Reisner , virtualization@lists.linux.dev, Lars Ellenberg , linuxppc-dev@lists.ozlabs.org, =?UTF-8?Q?Roger_Pau_Monn=C3=A9?= List-Id: "*Coordination* of development, patches, contributions -- *Questions* \(even to developers\) go to drbd-user, please." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , On 6/11/24 07:19, Christoph Hellwig wrote: > Move the cache control settings into the queue_limits so that they > can be set atomically and all I/O is frozen when changing the > flags. > > Add new features and flags field for the driver set flags, and internal > (usually sysfs-controlled) flags in the block layer. Note that we'll > eventually remove enough field from queue_limits to bring it back to the > previous size. > > The disable flag is inverted compared to the previous meaning, which > means it now survives a rescan, similar to the max_sectors and > max_discard_sectors user limits. > > The FLUSH and FUA flags are now inherited by blk_stack_limits, which > simplified the code in dm a lot, but also causes a slight behavior > change in that dm-switch and dm-unstripe now advertise a write cache > despite setting num_flush_bios to 0. The I/O path will handle this > gracefully, but as far as I can tell the lack of num_flush_bios > and thus flush support is a pre-existing data integrity bug in those > targets that really needs fixing, after which a non-zero num_flush_bios > should be required in dm for targets that map to underlying devices. > > Signed-off-by: Christoph Hellwig > --- > .../block/writeback_cache_control.rst | 67 +++++++++++-------- > arch/um/drivers/ubd_kern.c | 2 +- > block/blk-core.c | 2 +- > block/blk-flush.c | 9 ++- > block/blk-mq-debugfs.c | 2 - > block/blk-settings.c | 29 ++------ > block/blk-sysfs.c | 29 +++++--- > block/blk-wbt.c | 4 +- > drivers/block/drbd/drbd_main.c | 2 +- > drivers/block/loop.c | 9 +-- > drivers/block/nbd.c | 14 ++-- > drivers/block/null_blk/main.c | 12 ++-- > drivers/block/ps3disk.c | 7 +- > drivers/block/rnbd/rnbd-clt.c | 10 +-- > drivers/block/ublk_drv.c | 8 ++- > drivers/block/virtio_blk.c | 20 ++++-- > drivers/block/xen-blkfront.c | 9 ++- > drivers/md/bcache/super.c | 7 +- > drivers/md/dm-table.c | 39 +++-------- > drivers/md/md.c | 8 ++- > drivers/mmc/core/block.c | 42 ++++++------ > drivers/mmc/core/queue.c | 12 ++-- > drivers/mmc/core/queue.h | 3 +- > drivers/mtd/mtd_blkdevs.c | 5 +- > drivers/nvdimm/pmem.c | 4 +- > drivers/nvme/host/core.c | 7 +- > drivers/nvme/host/multipath.c | 6 -- > drivers/scsi/sd.c | 28 +++++--- > include/linux/blkdev.h | 38 +++++++++-- > 29 files changed, 227 insertions(+), 207 deletions(-) > > diff --git a/Documentation/block/writeback_cache_control.rst b/Documentation/block/writeback_cache_control.rst > index b208488d0aae85..9cfe27f90253c7 100644 > --- a/Documentation/block/writeback_cache_control.rst > +++ b/Documentation/block/writeback_cache_control.rst > @@ -46,41 +46,50 @@ worry if the underlying devices need any explicit cache flushing and how > the Forced Unit Access is implemented. The REQ_PREFLUSH and REQ_FUA flags > may both be set on a single bio. > > +Feature settings for block drivers > +---------------------------------- > > -Implementation details for bio based block drivers > --------------------------------------------------------------- > +For devices that do not support volatile write caches there is no driver > +support required, the block layer completes empty REQ_PREFLUSH requests before > +entering the driver and strips off the REQ_PREFLUSH and REQ_FUA bits from > +requests that have a payload. > > -These drivers will always see the REQ_PREFLUSH and REQ_FUA bits as they sit > -directly below the submit_bio interface. For remapping drivers the REQ_FUA > -bits need to be propagated to underlying devices, and a global flush needs > -to be implemented for bios with the REQ_PREFLUSH bit set. For real device > -drivers that do not have a volatile cache the REQ_PREFLUSH and REQ_FUA bits > -on non-empty bios can simply be ignored, and REQ_PREFLUSH requests without > -data can be completed successfully without doing any work. Drivers for > -devices with volatile caches need to implement the support for these > -flags themselves without any help from the block layer. > +For devices with volatile write caches the driver needs to tell the block layer > +that it supports flushing caches by setting the > > + BLK_FEAT_WRITE_CACHE > > -Implementation details for request_fn based block drivers > ---------------------------------------------------------- > +flag in the queue_limits feature field. For devices that also support the FUA > +bit the block layer needs to be told to pass on the REQ_FUA bit by also setting > +the > > -For devices that do not support volatile write caches there is no driver > -support required, the block layer completes empty REQ_PREFLUSH requests before > -entering the driver and strips off the REQ_PREFLUSH and REQ_FUA bits from > -requests that have a payload. For devices with volatile write caches the > -driver needs to tell the block layer that it supports flushing caches by > -doing:: > + BLK_FEAT_FUA > + > +flag in the features field of the queue_limits structure. > + > +Implementation details for bio based block drivers > +-------------------------------------------------- > + > +For bio based drivers the REQ_PREFLUSH and REQ_FUA bit are simplify passed on > +to the driver if the drivers sets the BLK_FEAT_WRITE_CACHE flag and the drivers > +needs to handle them. > + > +*NOTE*: The REQ_FUA bit also gets passed on when the BLK_FEAT_FUA flags is > +_not_ set. Any bio based driver that sets BLK_FEAT_WRITE_CACHE also needs to > +handle REQ_FUA. > > - blk_queue_write_cache(sdkp->disk->queue, true, false); > +For remapping drivers the REQ_FUA bits need to be propagated to underlying > +devices, and a global flush needs to be implemented for bios with the > +REQ_PREFLUSH bit set. > > -and handle empty REQ_OP_FLUSH requests in its prep_fn/request_fn. Note that > -REQ_PREFLUSH requests with a payload are automatically turned into a sequence > -of an empty REQ_OP_FLUSH request followed by the actual write by the block > -layer. For devices that also support the FUA bit the block layer needs > -to be told to pass through the REQ_FUA bit using:: > +Implementation details for blk-mq drivers > +----------------------------------------- > > - blk_queue_write_cache(sdkp->disk->queue, true, true); > +When the BLK_FEAT_WRITE_CACHE flag is set, REQ_OP_WRITE | REQ_PREFLUSH requests > +with a payload are automatically turned into a sequence of a REQ_OP_FLUSH > +request followed by the actual write by the block layer. > > -and the driver must handle write requests that have the REQ_FUA bit set > -in prep_fn/request_fn. If the FUA bit is not natively supported the block > -layer turns it into an empty REQ_OP_FLUSH request after the actual write. > +When the BLK_FEA_FUA flags is set, the REQ_FUA bit simplify passed on for the > +REQ_OP_WRITE request, else a REQ_OP_FLUSH request is sent by the block layer > +after the completion of the write request for bio submissions with the REQ_FUA > +bit set. > diff --git a/arch/um/drivers/ubd_kern.c b/arch/um/drivers/ubd_kern.c > index cdcb75a68989dd..19e01691ea0ea7 100644 > --- a/arch/um/drivers/ubd_kern.c > +++ b/arch/um/drivers/ubd_kern.c > @@ -835,6 +835,7 @@ static int ubd_add(int n, char **error_out) > struct queue_limits lim = { > .max_segments = MAX_SG, > .seg_boundary_mask = PAGE_SIZE - 1, > + .features = BLK_FEAT_WRITE_CACHE, > }; > struct gendisk *disk; > int err = 0; > @@ -882,7 +883,6 @@ static int ubd_add(int n, char **error_out) > } > > blk_queue_flag_set(QUEUE_FLAG_NONROT, disk->queue); > - blk_queue_write_cache(disk->queue, true, false); > disk->major = UBD_MAJOR; > disk->first_minor = n << UBD_SHIFT; > disk->minors = 1 << UBD_SHIFT; > diff --git a/block/blk-core.c b/block/blk-core.c > index 82c3ae22d76d88..2b45a4df9a1aa1 100644 > --- a/block/blk-core.c > +++ b/block/blk-core.c > @@ -782,7 +782,7 @@ void submit_bio_noacct(struct bio *bio) > if (WARN_ON_ONCE(bio_op(bio) != REQ_OP_WRITE && > bio_op(bio) != REQ_OP_ZONE_APPEND)) > goto end_io; > - if (!test_bit(QUEUE_FLAG_WC, &q->queue_flags)) { > + if (!bdev_write_cache(bdev)) { > bio->bi_opf &= ~(REQ_PREFLUSH | REQ_FUA); > if (!bio_sectors(bio)) { > status = BLK_STS_OK; > diff --git a/block/blk-flush.c b/block/blk-flush.c > index 2234f8b3fc05f2..30b9d5033a2b85 100644 > --- a/block/blk-flush.c > +++ b/block/blk-flush.c > @@ -381,8 +381,8 @@ static void blk_rq_init_flush(struct request *rq) > bool blk_insert_flush(struct request *rq) > { > struct request_queue *q = rq->q; > - unsigned long fflags = q->queue_flags; /* may change, cache */ > struct blk_flush_queue *fq = blk_get_flush_queue(q, rq->mq_ctx); > + bool supports_fua = q->limits.features & BLK_FEAT_FUA; Shouldn't we have a helper like blk_feat_fua() here? > unsigned int policy = 0; > > /* FLUSH/FUA request must never be merged */ > @@ -394,11 +394,10 @@ bool blk_insert_flush(struct request *rq) > /* > * Check which flushes we need to sequence for this operation. > */ > - if (fflags & (1UL << QUEUE_FLAG_WC)) { > + if (blk_queue_write_cache(q)) { > if (rq->cmd_flags & REQ_PREFLUSH) > policy |= REQ_FSEQ_PREFLUSH; > - if (!(fflags & (1UL << QUEUE_FLAG_FUA)) && > - (rq->cmd_flags & REQ_FUA)) > + if ((rq->cmd_flags & REQ_FUA) && !supports_fua) > policy |= REQ_FSEQ_POSTFLUSH; > } > > @@ -407,7 +406,7 @@ bool blk_insert_flush(struct request *rq) > * REQ_PREFLUSH and FUA for the driver. > */ > rq->cmd_flags &= ~REQ_PREFLUSH; > - if (!(fflags & (1UL << QUEUE_FLAG_FUA))) > + if (!supports_fua) > rq->cmd_flags &= ~REQ_FUA; > > /* > diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c > index 770c0c2b72faaa..e8b9db7c30c455 100644 > --- a/block/blk-mq-debugfs.c > +++ b/block/blk-mq-debugfs.c > @@ -93,8 +93,6 @@ static const char *const blk_queue_flag_name[] = { > QUEUE_FLAG_NAME(INIT_DONE), > QUEUE_FLAG_NAME(STABLE_WRITES), > QUEUE_FLAG_NAME(POLL), > - QUEUE_FLAG_NAME(WC), > - QUEUE_FLAG_NAME(FUA), > QUEUE_FLAG_NAME(DAX), > QUEUE_FLAG_NAME(STATS), > QUEUE_FLAG_NAME(REGISTERED), > diff --git a/block/blk-settings.c b/block/blk-settings.c > index f11c8676eb4c67..536ee202fcdccb 100644 > --- a/block/blk-settings.c > +++ b/block/blk-settings.c > @@ -261,6 +261,9 @@ static int blk_validate_limits(struct queue_limits *lim) > lim->misaligned = 0; > } > > + if (!(lim->features & BLK_FEAT_WRITE_CACHE)) > + lim->features &= ~BLK_FEAT_FUA; > + > err = blk_validate_integrity_limits(lim); > if (err) > return err; > @@ -454,6 +457,8 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b, > { > unsigned int top, bottom, alignment, ret = 0; > > + t->features |= (b->features & BLK_FEAT_INHERIT_MASK); > + > t->max_sectors = min_not_zero(t->max_sectors, b->max_sectors); > t->max_user_sectors = min_not_zero(t->max_user_sectors, > b->max_user_sectors); > @@ -711,30 +716,6 @@ void blk_set_queue_depth(struct request_queue *q, unsigned int depth) > } > EXPORT_SYMBOL(blk_set_queue_depth); > > -/** > - * blk_queue_write_cache - configure queue's write cache > - * @q: the request queue for the device > - * @wc: write back cache on or off > - * @fua: device supports FUA writes, if true > - * > - * Tell the block layer about the write cache of @q. > - */ > -void blk_queue_write_cache(struct request_queue *q, bool wc, bool fua) > -{ > - if (wc) { > - blk_queue_flag_set(QUEUE_FLAG_HW_WC, q); > - blk_queue_flag_set(QUEUE_FLAG_WC, q); > - } else { > - blk_queue_flag_clear(QUEUE_FLAG_HW_WC, q); > - blk_queue_flag_clear(QUEUE_FLAG_WC, q); > - } > - if (fua) > - blk_queue_flag_set(QUEUE_FLAG_FUA, q); > - else > - blk_queue_flag_clear(QUEUE_FLAG_FUA, q); > -} > -EXPORT_SYMBOL_GPL(blk_queue_write_cache); > - > int bdev_alignment_offset(struct block_device *bdev) > { > struct request_queue *q = bdev_get_queue(bdev); > diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c > index 5c787965b7d09e..4f524c1d5e08bd 100644 > --- a/block/blk-sysfs.c > +++ b/block/blk-sysfs.c > @@ -423,32 +423,41 @@ static ssize_t queue_io_timeout_store(struct request_queue *q, const char *page, > > static ssize_t queue_wc_show(struct request_queue *q, char *page) > { > - if (test_bit(QUEUE_FLAG_WC, &q->queue_flags)) > - return sprintf(page, "write back\n"); > - > - return sprintf(page, "write through\n"); > + if (q->limits.features & BLK_FLAGS_WRITE_CACHE_DISABLED) Where is the difference between 'flags' and 'features'? Ie why is is named BLK_FEAT_FUA but BLK_FLAGS_WRITE_CACHE_DISABLED? And if the feature is the existence of a capability, and the flag is the setting of that capability, can you make it clear in the documentation? Cheers, Hannes -- Dr. Hannes Reinecke Kernel Storage Architect hare@suse.de +49 911 74053 688 SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich