linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH stable] block/mq-deadline: fix different priority request on the same zone
@ 2024-05-16  9:28 Wu Bo
  2024-05-16 13:45 ` Bart Van Assche
  0 siblings, 1 reply; 6+ messages in thread
From: Wu Bo @ 2024-05-16  9:28 UTC (permalink / raw)
  To: linux-kernel
  Cc: Jens Axboe, Bart Van Assche, linux-block, Wu Bo, Wu Bo, stable

Zoned devices request sequential writing on the same zone. That means
if 2 requests on the saem zone, the lower pos request need to dispatch
to device first.
While different priority has it's own tree & list, request with high
priority will be disptch first.
So if requestA & requestB are on the same zone. RequestA is BE and pos
is X+0. ReqeustB is RT and pos is X+1. RequestB will be disptched before
requestA, which got an ERROR from zoned device.

This is found in a practice scenario when using F2FS on zoned device.
And it is very easy to reproduce:
1. Use fsstress to run 8 test processes
2. Use ionice to change 4/8 processes to RT priority

Fixes: c807ab520fc3 ("block/mq-deadline: Add I/O priority support")
Cc: <stable@vger.kernel.org>
Signed-off-by: Wu Bo <bo.wu@vivo.com>
---
 block/mq-deadline.c    | 31 +++++++++++++++++++++++++++++++
 include/linux/blk-mq.h | 15 +++++++++++++++
 2 files changed, 46 insertions(+)

diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index 02a916ba62ee..6a05dd86e8ca 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -539,6 +539,37 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd,
 	if (started_after(dd, rq, latest_start))
 		return NULL;
 
+	if (!blk_rq_is_seq_zoned_write(rq))
+		goto skip_check;
+	/*
+	 * To ensure sequential writing, check the lower priority class to see
+	 * if there is a request on the same zone and need to be dispatched
+	 * first
+	 */
+	ioprio_class = dd_rq_ioclass(rq);
+	prio = ioprio_class_to_prio[ioprio_class];
+	prio++;
+	for (; prio <= DD_PRIO_MAX; prio++) {
+		struct request *temp_rq;
+		unsigned long flags;
+		bool can_dispatch;
+
+		if (!dd_queued(dd, prio))
+			continue;
+
+		temp_rq = deadline_from_pos(&dd->per_prio[prio], data_dir, blk_rq_pos(rq));
+		if (temp_rq && blk_req_zone_in_one(temp_rq, rq) &&
+				blk_rq_pos(temp_rq) < blk_rq_pos(rq)) {
+			spin_lock_irqsave(&dd->zone_lock, flags);
+			can_dispatch = blk_req_can_dispatch_to_zone(temp_rq);
+			spin_unlock_irqrestore(&dd->zone_lock, flags);
+			if (!can_dispatch)
+				return NULL;
+			rq = temp_rq;
+			per_prio = &dd->per_prio[prio];
+		}
+	}
+skip_check:
 	/*
 	 * rq is the selected appropriate request.
 	 */
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index d3d8fd8e229b..bca1e639e0f3 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -1202,6 +1202,15 @@ static inline bool blk_req_can_dispatch_to_zone(struct request *rq)
 		return true;
 	return !blk_req_zone_is_write_locked(rq);
 }
+
+static inline bool blk_req_zone_in_one(struct request *rq_a,
+		struct request *rq_b)
+{
+	unsigned int zone_sectors = rq_a->q->limits.chunk_sectors;
+
+	return round_down(blk_rq_pos(rq_a), zone_sectors) ==
+		round_down(blk_rq_pos(rq_b), zone_sectors);
+}
 #else /* CONFIG_BLK_DEV_ZONED */
 static inline bool blk_rq_is_seq_zoned_write(struct request *rq)
 {
@@ -1229,6 +1238,12 @@ static inline bool blk_req_can_dispatch_to_zone(struct request *rq)
 {
 	return true;
 }
+
+static inline bool blk_req_zone_in_one(struct request *rq_a,
+		struct request *rq_b)
+{
+	return false;
+}
 #endif /* CONFIG_BLK_DEV_ZONED */
 
 #endif /* BLK_MQ_H */
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH stable] block/mq-deadline: fix different priority request on the same zone
  2024-05-16  9:28 [PATCH stable] block/mq-deadline: fix different priority request on the same zone Wu Bo
@ 2024-05-16 13:45 ` Bart Van Assche
  2024-05-17  1:44   ` Wu Bo
  0 siblings, 1 reply; 6+ messages in thread
From: Bart Van Assche @ 2024-05-16 13:45 UTC (permalink / raw)
  To: Wu Bo, linux-kernel
  Cc: Jens Axboe, linux-block, Wu Bo, stable, Damien Le Moal

On 5/16/24 03:28, Wu Bo wrote:
> Zoned devices request sequential writing on the same zone. That means
> if 2 requests on the saem zone, the lower pos request need to dispatch
> to device first.
> While different priority has it's own tree & list, request with high
> priority will be disptch first.
> So if requestA & requestB are on the same zone. RequestA is BE and pos
> is X+0. ReqeustB is RT and pos is X+1. RequestB will be disptched before
> requestA, which got an ERROR from zoned device.
> 
> This is found in a practice scenario when using F2FS on zoned device.
> And it is very easy to reproduce:
> 1. Use fsstress to run 8 test processes
> 2. Use ionice to change 4/8 processes to RT priority

Hi Wu,

I agree that there is a problem related to the interaction of I/O
priority and zoned storage. A solution with a lower runtime overhead
is available here:
https://lore.kernel.org/linux-block/20231218211342.2179689-1-bvanassche@acm.org/T/#me97b088c535278fe3d1dc5846b388ed58aa53f46

Are you OK with that alternative solution?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH stable] block/mq-deadline: fix different priority request on the same zone
  2024-05-16 13:45 ` Bart Van Assche
@ 2024-05-17  1:44   ` Wu Bo
  2024-05-17 17:53     ` Bart Van Assche
  0 siblings, 1 reply; 6+ messages in thread
From: Wu Bo @ 2024-05-17  1:44 UTC (permalink / raw)
  To: bvanassche
  Cc: axboe, bo.wu, dlemoal, linux-block, linux-kernel, stable,
	wubo.oduw

On Thu, May 16, 2024 at 07:45:21AM -0600, Bart Van Assche wrote:
> On 5/16/24 03:28, Wu Bo wrote:
> > Zoned devices request sequential writing on the same zone. That means
> > if 2 requests on the saem zone, the lower pos request need to dispatch
> > to device first.
> > While different priority has it's own tree & list, request with high
> > priority will be disptch first.
> > So if requestA & requestB are on the same zone. RequestA is BE and pos
> > is X+0. ReqeustB is RT and pos is X+1. RequestB will be disptched before
> > requestA, which got an ERROR from zoned device.
> > 
> > This is found in a practice scenario when using F2FS on zoned device.
> > And it is very easy to reproduce:
> > 1. Use fsstress to run 8 test processes
> > 2. Use ionice to change 4/8 processes to RT priority
> 
> Hi Wu,
> 
> I agree that there is a problem related to the interaction of I/O
> priority and zoned storage. A solution with a lower runtime overhead
> is available here:
> https://lore.kernel.org/linux-block/20231218211342.2179689-1-bvanassche@acm.org/T/#me97b088c535278fe3d1dc5846b388ed58aa53f46
Hi Bart,

I have tried to set all seq write requests the same priority:

diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index 6a05dd86e8ca..b560846c63cb 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -841,7 +841,10 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx,
struct request *rq,
         */
	         blk_req_zone_write_unlock(rq);

		 -       prio = ioprio_class_to_prio[ioprio_class];
		 +       if (blk_rq_is_seq_zoned_write(rq))
		 +               prio = DD_BE_PRIO;
		 +       else
		 +               prio = ioprio_class_to_prio[ioprio_class];
		         per_prio = &dd->per_prio[prio];
			         if (!rq->elv.priv[0]) {
				                 per_prio->stats.inserted++;

I think this is the same effect as the patch you mentioned here. Unfortunatelly,
this fix causes another issue.
As all write requests are set to the same priority while read requests still
have different priotities. This makes f2fs prone to hung when under stress test:

[129412.105440][T1100129] vkhungtaskd: INFO: task "f2fs_ckpt-254:5":769 blocked for more than 193 seconds.
[129412.106629][T1100129] vkhungtaskd:       6.1.25-android14-11-maybe-dirty #1
[129412.107624][T1100129] vkhungtaskd: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[129412.108873][T1100129] vkhungtaskd: task:f2fs_ckpt-254:5 state:D stack:10496 pid:769   ppid:2      flags:0x00000408
[129412.110194][T1100129] vkhungtaskd: Call trace:
[129412.110769][T1100129] vkhungtaskd:  __switch_to+0x174/0x338
[129412.111566][T1100129] vkhungtaskd:  __schedule+0x604/0x9e4
[129412.112275][T1100129] vkhungtaskd:  schedule+0x7c/0xe8
[129412.112938][T1100129] vkhungtaskd:  rwsem_down_write_slowpath+0x4cc/0xf98
[129412.113813][T1100129] vkhungtaskd:  down_write+0x38/0x40
[129412.114500][T1100129] vkhungtaskd:  __write_checkpoint_sync+0x8c/0x11c
[129412.115409][T1100129] vkhungtaskd:  __checkpoint_and_complete_reqs+0x54/0x1dc
[129412.116323][T1100129] vkhungtaskd:  issue_checkpoint_thread+0x8c/0xec
[129412.117148][T1100129] vkhungtaskd:  kthread+0x110/0x224
[129412.117826][T1100129] vkhungtaskd:  ret_from_fork+0x10/0x20
[129412.484027][T1700129] vkhungtaskd: task:f2fs_gc-254:55  state:D stack:10832 pid:771   ppid:2      flags:0x00000408
[129412.485337][T1700129] vkhungtaskd: Call trace:
[129412.485906][T1700129] vkhungtaskd:  __switch_to+0x174/0x338
[129412.486618][T1700129] vkhungtaskd:  __schedule+0x604/0x9e4
[129412.487327][T1700129] vkhungtaskd:  schedule+0x7c/0xe8
[129412.487985][T1700129] vkhungtaskd:  io_schedule+0x38/0xc4
[129412.488675][T1700129] vkhungtaskd:  folio_wait_bit_common+0x3d8/0x4f8
[129412.489496][T1700129] vkhungtaskd:  __folio_lock+0x1c/0x2c
[129412.490196][T1700129] vkhungtaskd:  __folio_lock_io+0x24/0x44
[129412.490936][T1700129] vkhungtaskd:  __filemap_get_folio+0x190/0x400
[129412.491736][T1700129] vkhungtaskd:  pagecache_get_page+0x1c/0x5c
[129412.492501][T1700129] vkhungtaskd:  f2fs_wait_on_block_writeback+0x60/0xf8
[129412.493376][T1700129] vkhungtaskd:  do_garbage_collect+0x1100/0x223c
[129412.494185][T1700129] vkhungtaskd:  f2fs_gc+0x284/0x778
[129412.494858][T1700129] vkhungtaskd:  gc_thread_func+0x304/0x838
[129412.495603][T1700129] vkhungtaskd:  kthread+0x110/0x224
[129412.496271][T1700129] vkhungtaskd:  ret_from_fork+0x10/0x20

I think because f2fs is a CoW filesystem. Some threads holding lock need much
reading & writing at the same time. Different reading & writing priority of this
thread makes this process very long. And other FS operations will be blocked.

So I figured this solution to fix this priority issue on zoned device. It sure
raises the overhead but can do fix it.

Thanks,
Wu Bo
> 
> Are you OK with that alternative solution?
> 
> Thanks,
> 
> Bart.

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH stable] block/mq-deadline: fix different priority request on the same zone
  2024-05-17  1:44   ` Wu Bo
@ 2024-05-17 17:53     ` Bart Van Assche
  2024-05-18  0:52       ` Wu Bo
  0 siblings, 1 reply; 6+ messages in thread
From: Bart Van Assche @ 2024-05-17 17:53 UTC (permalink / raw)
  To: Wu Bo; +Cc: axboe, dlemoal, linux-block, linux-kernel, stable, wubo.oduw

On 5/16/24 18:44, Wu Bo wrote:
> So I figured this solution to fix this priority issue on zoned device. It sure
> raises the overhead but can do fix it.

Something I should have realized earlier is that this patch is not
necessary with the latest upstream kernel (v6.10-rc1). Damien's zoned
write plugging patch series has been merged. Hence, I/O schedulers,
including the mq-deadline I/O schedulers, will only see a single
zoned write at a time per zone. So it is no longer possible that
zoned writes are reordered by the I/O scheduler because of their I/O
priorities.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH stable] block/mq-deadline: fix different priority request on the same zone
  2024-05-17 17:53     ` Bart Van Assche
@ 2024-05-18  0:52       ` Wu Bo
  2024-05-18  2:14         ` Bart Van Assche
  0 siblings, 1 reply; 6+ messages in thread
From: Wu Bo @ 2024-05-18  0:52 UTC (permalink / raw)
  To: Bart Van Assche, Wu Bo; +Cc: axboe, dlemoal, linux-block, linux-kernel, stable

On 2024/5/18 01:53, Bart Van Assche wrote:
> On 5/16/24 18:44, Wu Bo wrote:
>> So I figured this solution to fix this priority issue on zoned 
>> device. It sure
>> raises the overhead but can do fix it.
>
> Something I should have realized earlier is that this patch is not
> necessary with the latest upstream kernel (v6.10-rc1). Damien's zoned
> write plugging patch series has been merged. Hence, I/O schedulers,
> including the mq-deadline I/O schedulers, will only see a single
> zoned write at a time per zone. So it is no longer possible that
> zoned writes are reordered by the I/O scheduler because of their I/O
> priorities.
Hi Bart,

Yes, I noticed that 'zone write plugging' has been merged to latest
branch. But it seems hard to backport to old version which mq-deadline
priority feature has been merged. So is it possible to apply this fix to
old versions?

Thanks,
Wu Bo
>
> Thanks,
>
> Bart.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH stable] block/mq-deadline: fix different priority request on the same zone
  2024-05-18  0:52       ` Wu Bo
@ 2024-05-18  2:14         ` Bart Van Assche
  0 siblings, 0 replies; 6+ messages in thread
From: Bart Van Assche @ 2024-05-18  2:14 UTC (permalink / raw)
  To: Wu Bo, Wu Bo; +Cc: axboe, dlemoal, linux-block, linux-kernel, stable

On 5/17/24 18:52, Wu Bo wrote:
> Yes, I noticed that 'zone write plugging' has been merged to latest
> branch. But it seems hard to backport to old version which mq-deadline
> priority feature has been merged. So is it possible to apply this fix to
> old versions?

If you need this change in the Android kernel, please either submit a CL
to the Android kernel repository or submit a request to Android Partner
Engineering program.

Thanks,

Bart.


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2024-05-18  2:14 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-05-16  9:28 [PATCH stable] block/mq-deadline: fix different priority request on the same zone Wu Bo
2024-05-16 13:45 ` Bart Van Assche
2024-05-17  1:44   ` Wu Bo
2024-05-17 17:53     ` Bart Van Assche
2024-05-18  0:52       ` Wu Bo
2024-05-18  2:14         ` Bart Van Assche

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).