* [PATCH 0/2] block, md: Better handle REQ_OP_FLUSH
@ 2023-12-21 1:27 Song Liu
2023-12-21 1:27 ` [PATCH 1/2] block: Check REQ_OP_FLUSH in op_is_flush() Song Liu
` (2 more replies)
0 siblings, 3 replies; 9+ messages in thread
From: Song Liu @ 2023-12-21 1:27 UTC (permalink / raw)
To: linux-block, linux-raid
Cc: axboe, kent.overstreet, janpieter.sollie, colyli, bagasdotme,
Song Liu
A recent bug report [1] shows md is handling a flush from bcachefs as read:
bch2_journal_write=>
submit_bio=>
...
md_handle_request =>
raid5_make_request =>
chunk_aligned_read =>
raid5_read_one_chunk =>
...
It appears md code only checks REQ_PREFLUSH for flush requests, which
doesn't cover all cases. OTOH, op_is_flush() doesn't check REQ_OP_FLUSH
either.
Fix this by:
1) Check REQ_PREFLUSH in op_is_flush();
2) Use op_is_flush() in md code.
Thanks,
Song
[1] https://bugzilla.kernel.org/show_bug.cgi?id=218184
Song Liu (2):
block: Check REQ_OP_FLUSH in op_is_flush()
md: Use op_is_flush() to check flush bio
drivers/md/raid0.c | 2 +-
drivers/md/raid1.c | 2 +-
drivers/md/raid10.c | 2 +-
drivers/md/raid5.c | 2 +-
include/linux/blk_types.h | 3 ++-
5 files changed, 6 insertions(+), 5 deletions(-)
--
2.34.1
^ permalink raw reply [flat|nested] 9+ messages in thread* [PATCH 1/2] block: Check REQ_OP_FLUSH in op_is_flush() 2023-12-21 1:27 [PATCH 0/2] block, md: Better handle REQ_OP_FLUSH Song Liu @ 2023-12-21 1:27 ` Song Liu 2023-12-21 6:12 ` Christoph Hellwig 2023-12-21 1:27 ` [PATCH 2/2] md: Use op_is_flush() to check flush bio Song Liu 2023-12-21 3:36 ` [PATCH 0/2] block, md: Better handle REQ_OP_FLUSH Ed Tsai (蔡宗軒) 2 siblings, 1 reply; 9+ messages in thread From: Song Liu @ 2023-12-21 1:27 UTC (permalink / raw) To: linux-block, linux-raid Cc: axboe, kent.overstreet, janpieter.sollie, colyli, bagasdotme, Song Liu Upper layer (fs, etc.) may issue flush with something like: bio_reset(bio, bdev, REQ_OP_FLUSH); bio->bi_end_io = xxx; submit_bio(bio); op_is_flush(bio->bi_opf) should return true for this bio. Signed-off-by: Song Liu <song@kernel.org> --- include/linux/blk_types.h | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h index d5c5e59ddbd2..338423da84ca 100644 --- a/include/linux/blk_types.h +++ b/include/linux/blk_types.h @@ -487,7 +487,8 @@ static inline bool op_is_write(blk_opf_t op) */ static inline bool op_is_flush(blk_opf_t op) { - return op & (REQ_FUA | REQ_PREFLUSH); + return op & (REQ_FUA | REQ_PREFLUSH) || + (op & REQ_OP_MASK) == REQ_OP_FLUSH; } /* -- 2.34.1 ^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH 1/2] block: Check REQ_OP_FLUSH in op_is_flush() 2023-12-21 1:27 ` [PATCH 1/2] block: Check REQ_OP_FLUSH in op_is_flush() Song Liu @ 2023-12-21 6:12 ` Christoph Hellwig 2023-12-21 7:53 ` Song Liu 0 siblings, 1 reply; 9+ messages in thread From: Christoph Hellwig @ 2023-12-21 6:12 UTC (permalink / raw) To: Song Liu Cc: linux-block, linux-raid, axboe, kent.overstreet, janpieter.sollie, colyli, bagasdotme On Wed, Dec 20, 2023 at 05:27:14PM -0800, Song Liu wrote: > Upper layer (fs, etc.) may issue flush with something like: > > bio_reset(bio, bdev, REQ_OP_FLUSH); > bio->bi_end_io = xxx; > submit_bio(bio); No, they can't. REQ_OP_FLUSH is currently only used by request based drivers and only generated by the flush state machine. (Not that I particularly like this wart, I'd much prefer file systems to submit a REQ_OP_FLUSH than an empty write with a preflush flag, but that's not a trivial change). ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 1/2] block: Check REQ_OP_FLUSH in op_is_flush() 2023-12-21 6:12 ` Christoph Hellwig @ 2023-12-21 7:53 ` Song Liu 0 siblings, 0 replies; 9+ messages in thread From: Song Liu @ 2023-12-21 7:53 UTC (permalink / raw) To: Christoph Hellwig Cc: linux-block, linux-raid, axboe, kent.overstreet, janpieter.sollie, colyli, bagasdotme On Wed, Dec 20, 2023 at 10:12 PM Christoph Hellwig <hch@infradead.org> wrote: > > On Wed, Dec 20, 2023 at 05:27:14PM -0800, Song Liu wrote: > > Upper layer (fs, etc.) may issue flush with something like: > > > > bio_reset(bio, bdev, REQ_OP_FLUSH); > > bio->bi_end_io = xxx; > > submit_bio(bio); > > No, they can't. REQ_OP_FLUSH is currently only used by request > based drivers and only generated by the flush state machine. Hmm... Then this call trace from bcachefs is the exception here: bch2_journal_write=> submit_bio=> ... md_handle_request => raid5_make_request => chunk_aligned_read => raid5_read_one_chunk => > (Not that I particularly like this wart, I'd much prefer file systems > to submit a REQ_OP_FLUSH than an empty write with a preflush flag, > but that's not a trivial change). Kent, I think we need to update how bcachefs issue flushes. Thanks, Song ^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH 2/2] md: Use op_is_flush() to check flush bio 2023-12-21 1:27 [PATCH 0/2] block, md: Better handle REQ_OP_FLUSH Song Liu 2023-12-21 1:27 ` [PATCH 1/2] block: Check REQ_OP_FLUSH in op_is_flush() Song Liu @ 2023-12-21 1:27 ` Song Liu 2023-12-21 3:36 ` [PATCH 0/2] block, md: Better handle REQ_OP_FLUSH Ed Tsai (蔡宗軒) 2 siblings, 0 replies; 9+ messages in thread From: Song Liu @ 2023-12-21 1:27 UTC (permalink / raw) To: linux-block, linux-raid Cc: axboe, kent.overstreet, janpieter.sollie, colyli, bagasdotme, Song Liu op_is_flush() covers different ways to request flush. Use it instead of simply checking against REQ_PREFLUSH. Signed-off-by: Song Liu <song@kernel.org> --- drivers/md/raid0.c | 2 +- drivers/md/raid1.c | 2 +- drivers/md/raid10.c | 2 +- drivers/md/raid5.c | 2 +- 4 files changed, 4 insertions(+), 4 deletions(-) diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c index c50a7abda744..20283dc5208a 100644 --- a/drivers/md/raid0.c +++ b/drivers/md/raid0.c @@ -592,7 +592,7 @@ static bool raid0_make_request(struct mddev *mddev, struct bio *bio) unsigned chunk_sects; unsigned sectors; - if (unlikely(bio->bi_opf & REQ_PREFLUSH) + if (unlikely(op_is_flush(bio->bi_opf)) && md_flush_request(mddev, bio)) return true; diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c index aaa434f0c175..5c1dadd7fbb6 100644 --- a/drivers/md/raid1.c +++ b/drivers/md/raid1.c @@ -1581,7 +1581,7 @@ static bool raid1_make_request(struct mddev *mddev, struct bio *bio) { sector_t sectors; - if (unlikely(bio->bi_opf & REQ_PREFLUSH) + if (unlikely(op_is_flush(bio->bi_opf)) && md_flush_request(mddev, bio)) return true; diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c index 7412066ea22c..5c6e0a8635f2 100644 --- a/drivers/md/raid10.c +++ b/drivers/md/raid10.c @@ -1857,7 +1857,7 @@ static bool raid10_make_request(struct mddev *mddev, struct bio *bio) int chunk_sects = chunk_mask + 1; int sectors = bio_sectors(bio); - if (unlikely(bio->bi_opf & REQ_PREFLUSH) + if (unlikely(op_is_flush(bio->bi_opf)) && md_flush_request(mddev, bio)) return true; diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index e57deb1c6138..1bcf96b490a7 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -6070,7 +6070,7 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi) enum stripe_result res; int s, stripe_cnt; - if (unlikely(bi->bi_opf & REQ_PREFLUSH)) { + if (unlikely(op_is_flush(bi->bi_opf))) { int ret = log_handle_flush_request(conf, bi); if (ret == 0) -- 2.34.1 ^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH 0/2] block, md: Better handle REQ_OP_FLUSH 2023-12-21 1:27 [PATCH 0/2] block, md: Better handle REQ_OP_FLUSH Song Liu 2023-12-21 1:27 ` [PATCH 1/2] block: Check REQ_OP_FLUSH in op_is_flush() Song Liu 2023-12-21 1:27 ` [PATCH 2/2] md: Use op_is_flush() to check flush bio Song Liu @ 2023-12-21 3:36 ` Ed Tsai (蔡宗軒) 2023-12-21 5:30 ` Kent Overstreet 2 siblings, 1 reply; 9+ messages in thread From: Ed Tsai (蔡宗軒) @ 2023-12-21 3:36 UTC (permalink / raw) To: song@kernel.org Cc: bagasdotme@gmail.com, colyli@suse.de, linux-raid@vger.kernel.org, linux-block@vger.kernel.org, janpieter.sollie@edpnet.be, axboe@kernel.dk, kent.overstreet@linux.dev On Wed, 2023-12-20 at 17:27 -0800, Song Liu wrote: > > External email : Please do not click links or open attachments until > you have verified the sender or the content. > A recent bug report [1] shows md is handling a flush from bcachefs > as read: > > bch2_journal_write=> > submit_bio=> > ... > md_handle_request => > raid5_make_request => > chunk_aligned_read => > raid5_read_one_chunk => > ... > > It appears md code only checks REQ_PREFLUSH for flush requests, which > doesn't cover all cases. OTOH, op_is_flush() doesn't check > REQ_OP_FLUSH > either. > > Fix this by: > 1) Check REQ_PREFLUSH in op_is_flush(); > 2) Use op_is_flush() in md code. > > Thanks, > Song > > [1] > https://urldefense.com/v3/__https://bugzilla.kernel.org/show_bug.cgi?id=218184__;!!CTRNKA9wMg0ARbw!gQbjtS_f5d3Du2prpIT8zUM4mkZf7qDleyaAuEfG8j5tMrDvw7cfJUB04VWl0uVAL4BJ4YWbVopp$ > REQ_OP_FLUSH is only used by the block layer's flush code, and the filesystem should use REQ_PREFLUSH with an empty write bio. If we want upper layer to be able to directly send REQ_OP_FLUSH bio, then we should retrieve all REQ_PREFLUSH to confirm. At least for now, it seems that REQ_OP_FLUSH without REQ_PREFLUSH in `blk_flush_policy` will directly return 0 and no flush operation will be sent to the driver. > > > Song Liu (2): > block: Check REQ_OP_FLUSH in op_is_flush() > md: Use op_is_flush() to check flush bio > > drivers/md/raid0.c | 2 +- > drivers/md/raid1.c | 2 +- > drivers/md/raid10.c | 2 +- > drivers/md/raid5.c | 2 +- > include/linux/blk_types.h | 3 ++- > 5 files changed, 6 insertions(+), 5 deletions(-) > > -- > 2.34.1 ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 0/2] block, md: Better handle REQ_OP_FLUSH 2023-12-21 3:36 ` [PATCH 0/2] block, md: Better handle REQ_OP_FLUSH Ed Tsai (蔡宗軒) @ 2023-12-21 5:30 ` Kent Overstreet 2023-12-21 7:56 ` Ed Tsai (蔡宗軒) 0 siblings, 1 reply; 9+ messages in thread From: Kent Overstreet @ 2023-12-21 5:30 UTC (permalink / raw) To: Ed Tsai (蔡宗軒) Cc: song@kernel.org, bagasdotme@gmail.com, colyli@suse.de, linux-raid@vger.kernel.org, linux-block@vger.kernel.org, janpieter.sollie@edpnet.be, axboe@kernel.dk On Thu, Dec 21, 2023 at 03:36:40AM +0000, Ed Tsai (蔡宗軒) wrote: > On Wed, 2023-12-20 at 17:27 -0800, Song Liu wrote: > > > > External email : Please do not click links or open attachments until > > you have verified the sender or the content. > > A recent bug report [1] shows md is handling a flush from bcachefs > > as read: > > > > bch2_journal_write=> > > submit_bio=> > > ... > > md_handle_request => > > raid5_make_request => > > chunk_aligned_read => > > raid5_read_one_chunk => > > ... > > > > It appears md code only checks REQ_PREFLUSH for flush requests, which > > doesn't cover all cases. OTOH, op_is_flush() doesn't check > > REQ_OP_FLUSH > > either. > > > > Fix this by: > > 1) Check REQ_PREFLUSH in op_is_flush(); > > 2) Use op_is_flush() in md code. > > > > Thanks, > > Song > > > > [1] > > https://urldefense.com/v3/__https://bugzilla.kernel.org/show_bug.cgi?id=218184__;!!CTRNKA9wMg0ARbw!gQbjtS_f5d3Du2prpIT8zUM4mkZf7qDleyaAuEfG8j5tMrDvw7cfJUB04VWl0uVAL4BJ4YWbVopp$ > > > > REQ_OP_FLUSH is only used by the block layer's flush code, and the > filesystem should use REQ_PREFLUSH with an empty write bio. > > If we want upper layer to be able to directly send REQ_OP_FLUSH bio, > then we should retrieve all REQ_PREFLUSH to confirm. At least for now, > it seems that REQ_OP_FLUSH without REQ_PREFLUSH in `blk_flush_policy` > will directly return 0 and no flush operation will be sent to the > driver. If that's the case, then it should be documented and there should be a WARN_ON() in generic_make_request(). Also, glancing at blk_types.h, we have the req_op and req_flag_bits both using (__force blk_opf_t), but using the same bit range - what the hell? That's seriously broken... ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 0/2] block, md: Better handle REQ_OP_FLUSH 2023-12-21 5:30 ` Kent Overstreet @ 2023-12-21 7:56 ` Ed Tsai (蔡宗軒) 2023-12-21 19:19 ` Kent Overstreet 0 siblings, 1 reply; 9+ messages in thread From: Ed Tsai (蔡宗軒) @ 2023-12-21 7:56 UTC (permalink / raw) To: kent.overstreet@linux.dev Cc: linux-raid@vger.kernel.org, colyli@suse.de, song@kernel.org, linux-block@vger.kernel.org, bagasdotme@gmail.com, janpieter.sollie@edpnet.be, axboe@kernel.dk On Thu, 2023-12-21 at 00:30 -0500, Kent Overstreet wrote: > On Thu, Dec 21, 2023 at 03:36:40AM +0000, Ed Tsai (蔡宗軒) wrote: > > On Wed, 2023-12-20 at 17:27 -0800, Song Liu wrote: > > > you have verified the sender or the content. > > > A recent bug report [1] shows md is handling a flush from > bcachefs > > > as read: > > > > > > bch2_journal_write=> > > > submit_bio=> > > > ... > > > md_handle_request => > > > raid5_make_request => > > > chunk_aligned_read => > > > raid5_read_one_chunk => > > > ... > > > > > > It appears md code only checks REQ_PREFLUSH for flush requests, > which > > > doesn't cover all cases. OTOH, op_is_flush() doesn't check > > > REQ_OP_FLUSH > > > either. > > > > > > Fix this by: > > > 1) Check REQ_PREFLUSH in op_is_flush(); > > > 2) Use op_is_flush() in md code. > > > > > > Thanks, > > > Song > > > > > > [1] > > > > https://urldefense.com/v3/__https://bugzilla.kernel.org/show_bug.cgi?id=218184__;!!CTRNKA9wMg0ARbw!gQbjtS_f5d3Du2prpIT8zUM4mkZf7qDleyaAuEfG8j5tMrDvw7cfJUB04VWl0uVAL4BJ4YWbVopp$ > > > > > > > REQ_OP_FLUSH is only used by the block layer's flush code, and the > > filesystem should use REQ_PREFLUSH with an empty write bio. > > > > If we want upper layer to be able to directly send REQ_OP_FLUSH > bio, > > then we should retrieve all REQ_PREFLUSH to confirm. At least for > now, > > it seems that REQ_OP_FLUSH without REQ_PREFLUSH in > `blk_flush_policy` > > will directly return 0 and no flush operation will be sent to the > > driver. > > If that's the case, then it should be documented and there should be > a > WARN_ON() in generic_make_request(). Please refer to the writeback_cache_control.rst. Use an empty write bio with the REQ_PREFLUSH flag for an explicit flush, or as commonly practiced by most filesystems, use blkdev_issue_flush for a pure flush. > > Also, glancing at blk_types.h, we have the req_op and req_flag_bits > both > using (__force blk_opf_t), but using the same bit range - what the > hell? > That's seriously broken... No, read the comment before req_op. We do not need to use the entire 32 bits to represent OP; only 8 bits for OP, while the remaning 24 bits is used for FLAG. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 0/2] block, md: Better handle REQ_OP_FLUSH 2023-12-21 7:56 ` Ed Tsai (蔡宗軒) @ 2023-12-21 19:19 ` Kent Overstreet 0 siblings, 0 replies; 9+ messages in thread From: Kent Overstreet @ 2023-12-21 19:19 UTC (permalink / raw) To: Ed Tsai (蔡宗軒) Cc: linux-raid@vger.kernel.org, colyli@suse.de, song@kernel.org, linux-block@vger.kernel.org, bagasdotme@gmail.com, janpieter.sollie@edpnet.be, axboe@kernel.dk On Thu, Dec 21, 2023 at 07:56:45AM +0000, Ed Tsai (蔡宗軒) wrote: > On Thu, 2023-12-21 at 00:30 -0500, Kent Overstreet wrote: > > On Thu, Dec 21, 2023 at 03:36:40AM +0000, Ed Tsai (蔡宗軒) wrote: > > > On Wed, 2023-12-20 at 17:27 -0800, Song Liu wrote: > > > > you have verified the sender or the content. > > > > A recent bug report [1] shows md is handling a flush from > > bcachefs > > > > as read: > > > > > > > > bch2_journal_write=> > > > > submit_bio=> > > > > ... > > > > md_handle_request => > > > > raid5_make_request => > > > > chunk_aligned_read => > > > > raid5_read_one_chunk => > > > > ... > > > > > > > > It appears md code only checks REQ_PREFLUSH for flush requests, > > which > > > > doesn't cover all cases. OTOH, op_is_flush() doesn't check > > > > REQ_OP_FLUSH > > > > either. > > > > > > > > Fix this by: > > > > 1) Check REQ_PREFLUSH in op_is_flush(); > > > > 2) Use op_is_flush() in md code. > > > > > > > > Thanks, > > > > Song > > > > > > > > [1] > > > > > > https://urldefense.com/v3/__https://bugzilla.kernel.org/show_bug.cgi?id=218184__;!!CTRNKA9wMg0ARbw!gQbjtS_f5d3Du2prpIT8zUM4mkZf7qDleyaAuEfG8j5tMrDvw7cfJUB04VWl0uVAL4BJ4YWbVopp$ > > > > > > > > > > REQ_OP_FLUSH is only used by the block layer's flush code, and the > > > filesystem should use REQ_PREFLUSH with an empty write bio. > > > > > > If we want upper layer to be able to directly send REQ_OP_FLUSH > > bio, > > > then we should retrieve all REQ_PREFLUSH to confirm. At least for > > now, > > > it seems that REQ_OP_FLUSH without REQ_PREFLUSH in > > `blk_flush_policy` > > > will directly return 0 and no flush operation will be sent to the > > > driver. > > > > If that's the case, then it should be documented and there should be > > a > > WARN_ON() in generic_make_request(). > > Please refer to the writeback_cache_control.rst. Use an empty write bio > with the REQ_PREFLUSH flag for an explicit flush, or as commonly > practiced by most filesystems, use blkdev_issue_flush for a pure flush. That's not a substitute for a proper comment in the code. > > > > > Also, glancing at blk_types.h, we have the req_op and req_flag_bits > > both > > using (__force blk_opf_t), but using the same bit range - what the > > hell? > > That's seriously broken... > > No, read the comment before req_op. We do not need to use the entire 32 > bits to represent OP; only 8 bits for OP, while the remaning 24 bits is > used for FLAG. No, this is just broken; it's using the same bitwise enum for two different enums. bitwise exists for a reason - C enums are not natively type safe, and mixing up enums/bitflags and using them in the wrong context is a serious source of bugs. If it would be incorrect to or the two different flags together, you can't use the same bitwise type. ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2023-12-21 19:19 UTC | newest] Thread overview: 9+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2023-12-21 1:27 [PATCH 0/2] block, md: Better handle REQ_OP_FLUSH Song Liu 2023-12-21 1:27 ` [PATCH 1/2] block: Check REQ_OP_FLUSH in op_is_flush() Song Liu 2023-12-21 6:12 ` Christoph Hellwig 2023-12-21 7:53 ` Song Liu 2023-12-21 1:27 ` [PATCH 2/2] md: Use op_is_flush() to check flush bio Song Liu 2023-12-21 3:36 ` [PATCH 0/2] block, md: Better handle REQ_OP_FLUSH Ed Tsai (蔡宗軒) 2023-12-21 5:30 ` Kent Overstreet 2023-12-21 7:56 ` Ed Tsai (蔡宗軒) 2023-12-21 19:19 ` Kent Overstreet
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).