* [PATCH v2] fs-writeback: writeback_sb_inodes:Recalculate 'wrote' according skipped pages
@ 2022-04-18  9:28 Zhihao Cheng
  2022-04-18 19:43 ` Linus Torvalds
  0 siblings, 1 reply; 13+ messages in thread
From: Zhihao Cheng @ 2022-04-18  9:28 UTC (permalink / raw)
  To: viro; +Cc: hch, torvalds, linux-fsdevel, linux-kernel, chengzhihao1, yukuai3
Commit 505a666ee3fc ("writeback: plug writeback in wb_writeback() and
writeback_inodes_wb()") has us holding a plug during wb_writeback, which
may cause a potential ABBA dead lock:
    wb_writeback		fat_file_fsync
blk_start_plug(&plug)
for (;;) {
  iter i-1: some reqs have been added into plug->mq_list  // LOCK A
  iter i:
    progress = __writeback_inodes_wb(wb, work)
    . writeback_sb_inodes // fat's bdev
    .   __writeback_single_inode
    .   . generic_writepages
    .   .   __block_write_full_page
    .   .   . . 	    __generic_file_fsync
    .   .   . . 	      sync_inode_metadata
    .   .   . . 	        writeback_single_inode
    .   .   . . 		  __writeback_single_inode
    .   .   . . 		    fat_write_inode
    .   .   . . 		      __fat_write_inode
    .   .   . . 		        sync_dirty_buffer	// fat's bdev
    .   .   . . 			  lock_buffer(bh)	// LOCK B
    .   .   . . 			    submit_bh
    .   .   . . 			      blk_mq_get_tag	// LOCK A
    .   .   . trylock_buffer(bh)  // LOCK B
    .   .   .   redirty_page_for_writepage
    .   .   .     wbc->pages_skipped++
    .   .   --wbc->nr_to_write
    .   wrote += write_chunk - wbc.nr_to_write  // wrote > 0
    .   requeue_inode
    .     redirty_tail_locked
    if (progress)    // progress > 0
      continue;
  iter i+1:
      queue_io
      // similar process with iter i, infinite for-loop !
}
blk_finish_plug(&plug)   // flush plug won't be called
Above process triggers a hungtask like:
[  399.044861] INFO: task bb:2607 blocked for more than 30 seconds.
[  399.046824]       Not tainted 5.18.0-rc1-00005-gefae4d9eb6a2-dirty
[  399.051539] task:bb              state:D stack:    0 pid: 2607 ppid:
2426 flags:0x00004000
[  399.051556] Call Trace:
[  399.051570]  __schedule+0x480/0x1050
[  399.051592]  schedule+0x92/0x1a0
[  399.051602]  io_schedule+0x22/0x50
[  399.051613]  blk_mq_get_tag+0x1d3/0x3c0
[  399.051640]  __blk_mq_alloc_requests+0x21d/0x3f0
[  399.051657]  blk_mq_submit_bio+0x68d/0xca0
[  399.051674]  __submit_bio+0x1b5/0x2d0
[  399.051708]  submit_bio_noacct+0x34e/0x720
[  399.051718]  submit_bio+0x3b/0x150
[  399.051725]  submit_bh_wbc+0x161/0x230
[  399.051734]  __sync_dirty_buffer+0xd1/0x420
[  399.051744]  sync_dirty_buffer+0x17/0x20
[  399.051750]  __fat_write_inode+0x289/0x310
[  399.051766]  fat_write_inode+0x2a/0xa0
[  399.051783]  __writeback_single_inode+0x53c/0x6f0
[  399.051795]  writeback_single_inode+0x145/0x200
[  399.051803]  sync_inode_metadata+0x45/0x70
[  399.051856]  __generic_file_fsync+0xa3/0x150
[  399.051880]  fat_file_fsync+0x1d/0x80
[  399.051895]  vfs_fsync_range+0x40/0xb0
[  399.051929]  __x64_sys_fsync+0x18/0x30
In my test, 'need_resched()' (which is imported by 590dca3a71 "fs-writeback:
unplug before cond_resched in writeback_sb_inodes") in function
'writeback_sb_inodes()' seldom comes true, unless cond_resched() is deleted
from write_cache_pages().
Fix it by correcting wrote number according number of skipped pages
in writeback_sb_inodes().
Goto Link to find a reproducer.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=215837
Cc: stable@vger.kernel.org # v4.3
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
---
 fs/fs-writeback.c | 15 +++++++++------
 1 file changed, 9 insertions(+), 6 deletions(-)
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 591fe9cf1659..9740c7d346aa 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1775,11 +1775,12 @@ static long writeback_sb_inodes(struct super_block *sb,
 	};
 	unsigned long start_time = jiffies;
 	long write_chunk;
-	long wrote = 0;  /* count both pages and inodes */
+	long total_wrote = 0;  /* count both pages and inodes */
 
 	while (!list_empty(&wb->b_io)) {
 		struct inode *inode = wb_inode(wb->b_io.prev);
 		struct bdi_writeback *tmp_wb;
+		long wrote;
 
 		if (inode->i_sb != sb) {
 			if (work->sb) {
@@ -1854,8 +1855,10 @@ static long writeback_sb_inodes(struct super_block *sb,
 		__writeback_single_inode(inode, &wbc);
 
 		wbc_detach_inode(&wbc);
-		work->nr_pages -= write_chunk - wbc.nr_to_write;
-		wrote += write_chunk - wbc.nr_to_write;
+		wrote = write_chunk - wbc.nr_to_write - wbc.pages_skipped;
+		wrote = wrote < 0 ? 0 : wrote;
+		work->nr_pages -= wrote;
+		total_wrote += wrote;
 
 		if (need_resched()) {
 			/*
@@ -1877,7 +1880,7 @@ static long writeback_sb_inodes(struct super_block *sb,
 		tmp_wb = inode_to_wb_and_lock_list(inode);
 		spin_lock(&inode->i_lock);
 		if (!(inode->i_state & I_DIRTY_ALL))
-			wrote++;
+			total_wrote++;
 		requeue_inode(inode, tmp_wb, &wbc);
 		inode_sync_complete(inode);
 		spin_unlock(&inode->i_lock);
@@ -1891,14 +1894,14 @@ static long writeback_sb_inodes(struct super_block *sb,
 		 * bail out to wb_writeback() often enough to check
 		 * background threshold and other termination conditions.
 		 */
-		if (wrote) {
+		if (total_wrote) {
 			if (time_is_before_jiffies(start_time + HZ / 10UL))
 				break;
 			if (work->nr_pages <= 0)
 				break;
 		}
 	}
-	return wrote;
+	return total_wrote;
 }
 
 static long __writeback_inodes_wb(struct bdi_writeback *wb,
-- 
2.31.1
^ permalink raw reply related	[flat|nested] 13+ messages in thread
* Re: [PATCH v2] fs-writeback: writeback_sb_inodes: Recalculate 'wrote' according skipped pages
       [not found] <20220418112745.1761-1-hdanton@sina.com>
@ 2022-04-18 13:27 ` Zhihao Cheng
  0 siblings, 0 replies; 13+ messages in thread
From: Zhihao Cheng @ 2022-04-18 13:27 UTC (permalink / raw)
  To: Hillf Danton; +Cc: hch, torvalds, linux-fsdevel, linux-kernel, yukuai3
HI Hillf,
> On Mon, 18 Apr 2022 17:28:24 +0800 Zhihao Cheng wrote:
>> Commit 505a666ee3fc ("writeback: plug writeback in wb_writeback() and
>> writeback_inodes_wb()") has us holding a plug during wb_writeback, which
>> may cause a potential ABBA dead lock:
>>
>>      wb_writeback		fat_file_fsync
>> blk_start_plug(&plug)
>> for (;;) {
>>    iter i-1: some reqs have been added into plug->mq_list  // LOCK A
>>    iter i:
>>      progress = __writeback_inodes_wb(wb, work)
>>      . writeback_sb_inodes // fat's bdev
See comments " fat's bdev".
> 
> 	if (inode->i_state & I_SYNC) {
> 		/* Wait for I_SYNC. This function drops i_lock... */
> 		inode_sleep_on_writeback(inode);
> 		/* Inode may be gone, start again */
> 		spin_lock(&wb->list_lock);
> 		continue;
> 	}
> 	inode->i_state |= I_SYNC;
This inode is fat's bdev's inode.
> 
>>      .   __writeback_single_inode
>>      .   . generic_writepages
>>      .   .   __block_write_full_page
>>      .   .   . . 	    __generic_file_fsync
>>      .   .   . . 	      sync_inode_metadata
>>      .   .   . . 	        writeback_single_inode
> 
> 				if (inode->i_state & I_SYNC) {
This inode is fat's inode.
> 					/*
> 					 * Writeback is already running on the inode.  For WB_SYNC_NONE,
> 					 * that's enough and we can just return.  For WB_SYNC_ALL, we
> 					 * must wait for the existing writeback to complete, then do
> 					 * writeback again if there's anything left.
> 					 */
> 					if (wbc->sync_mode != WB_SYNC_ALL)
> 						goto out;
> 					__inode_wait_for_writeback(inode);
> 				}
> 				inode->i_state |= I_SYNC;
> 
>>      .   .   . . 		  __writeback_single_inode
>>      .   .   . . 		    fat_write_inode
>>      .   .   . . 		      __fat_write_inode
>>      .   .   . . 		        sync_dirty_buffer	// fat's bdev
>>      .   .   . . 			  lock_buffer(bh)	// LOCK B
>>      .   .   . . 			    submit_bh
>>      .   .   . . 			      blk_mq_get_tag	// LOCK A
>>      .   .   . trylock_buffer(bh)  // LOCK B
> 
> Given I_SYNC checked on both sides, the chance for ABBA deadlock is zero.
Above two inodes are not same.
^ permalink raw reply	[flat|nested] 13+ messages in thread
* Re: [PATCH v2] fs-writeback: writeback_sb_inodes:Recalculate 'wrote' according skipped pages
  2022-04-18  9:28 [PATCH v2] fs-writeback: writeback_sb_inodes:Recalculate 'wrote' according skipped pages Zhihao Cheng
@ 2022-04-18 19:43 ` Linus Torvalds
  2022-04-18 21:16   ` Jens Axboe
                     ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Linus Torvalds @ 2022-04-18 19:43 UTC (permalink / raw)
  To: Zhihao Cheng, Ingo Molnar, Peter Zijlstra, Jens Axboe
  Cc: Al Viro, Christoph Hellwig, linux-fsdevel,
	Linux Kernel Mailing List, yukuai3
[ Adding some scheduler people - the background here is a ABBA
deadlock because a plug never gets unplugged and the IO never starts
and the buffer lock thus never gets released. That's simplified, see
     https://lore.kernel.org/all/20220415013735.1610091-1-chengzhihao1@huawei.com/
  and
     https://bugzilla.kernel.org/show_bug.cgi?id=215837
   for details ]
On Mon, Apr 18, 2022 at 2:14 AM Zhihao Cheng <chengzhihao1@huawei.com> wrote:
>
> In my test, 'need_resched()' (which is imported by 590dca3a71 "fs-writeback:
> unplug before cond_resched in writeback_sb_inodes") in function
> 'writeback_sb_inodes()' seldom comes true, unless cond_resched() is deleted
> from write_cache_pages().
So I'm not reacting to the patch, but just to this part of the message...
I forget the exact history of plugging, but at some point (long long
ago - we're talking pre-git days) it was device-specific and always
released on a timeout (or, obviously, explicitly unplugged).
And then later it became per-process, and always released by task-work
on any schedule() call.
But over time, that "any schedule" has gone away. It did so gradually,
over time, and long ago:
  73c101011926 ("block: initial patch for on-stack per-task plugging")
  6631e635c65d ("block: don't flush plugged IO on forced preemtion scheduling")
And that's *mostly* perfectly fine, but the problem ends up being that
not everything necessarily triggers the flushing at all.
In fact, if you call "__schedule()" directly (rather than
"schedule()") I think you may end up avoiding flush entirely. I'm
looking at  do_task_dead() and schedule_idle() and the
preempt_schedule() cases.
Similarly, tsk_is_pi_blocked() will disable the plug flush.
Back when it was a timer, the flushing was eventually guaranteed.
And then we would flush on any re-schedule, even if it was about
preemption and the process might stay on the CPU.
But these days we can be in the situation where we really don't flush
at all - the process may be scheduled away, but if it's still
runnable, the blk plug won't be flushed.
To make things *really* confusing, doing an io_schedule() will force a
plug flush, even  if the process might stay runnable. So io_schedule()
has those old legacy "unconditional flush" guarantees that a normal
schedule does not any more.
Also note how the plug is per-process, so when another process *does*
block (because it's waiting for some resource), that doesn't end up
really unplugging the actual IO which was started by somebody else.
Even if that other process is using io_schedule().
Which all brings us back to how we have that hacky thing in
writeback_sb_inodes() that does
        if (need_resched()) {
                /*
                 * We're trying to balance between building up a nice
                 * long list of IOs to improve our merge rate, and
                 * getting those IOs out quickly for anyone throttling
                 * in balance_dirty_pages().  cond_resched() doesn't
                 * unplug, so get our IOs out the door before we
                 * give up the CPU.
                 */
                blk_flush_plug(current->plug, false);
                cond_resched();
        }
and that currently *mostly* ends up protecting us and flushing the
plug when doing big writebacks, but as you can see from the email I'm
quoting, it then doesn't always work very well, because
"need_resched()" may end up being cleared by some other scheduling
point, and is entirely meaningless when preemption is on anyway.
So I think that's basically just a random voodoo programming thing
that has protected us in the past in some situations.
Now, Zhihao has a patch that fixes the problem by limiting the
writeback by being better at accounting:
    https://lore.kernel.org/all/20220418092824.3018714-1-chengzhihao1@huawei.com/
which is the email I'm answering, but I did want to bring in the
scheduler people to the discussion to see if people have ideas.
I think the writeback accounting fix is the right thing to do
regardless, but that whole need_resched() dance in
writeback_sb_inodes() is, I think, a sign that we do have real issues
here. That whole "flush plug if we need to reschedule" is simply a
fundamentally broken concept, when there are other rescheduling
points.
Comments?
The answer may just be that "the code in writeback_sb_inodes() is
fundamentally broken and should be removed".
But the fact that we have that code at all makes me quite nervous
about this. And we clearly *do* have situations where the writeback
code seems to cause nasty unplugging delays.
So I'm not convinced that "fix up the writeback accounting" is the
real and final fix.
I don't really have answers or suggestions, I just wanted people to
look at this in case they have ideas.
                    Linus
^ permalink raw reply	[flat|nested] 13+ messages in thread
* Re: [PATCH v2] fs-writeback: writeback_sb_inodes:Recalculate 'wrote' according skipped pages
  2022-04-18 19:43 ` Linus Torvalds
@ 2022-04-18 21:16   ` Jens Axboe
  2022-04-18 22:01     ` Linus Torvalds
  2022-04-19  8:10   ` Peter Zijlstra
  2022-04-19  8:17   ` Peter Zijlstra
  2 siblings, 1 reply; 13+ messages in thread
From: Jens Axboe @ 2022-04-18 21:16 UTC (permalink / raw)
  To: Linus Torvalds, Zhihao Cheng, Ingo Molnar, Peter Zijlstra
  Cc: Al Viro, Christoph Hellwig, linux-fsdevel,
	Linux Kernel Mailing List, yukuai3
On 4/18/22 1:43 PM, Linus Torvalds wrote:
> [ Adding some scheduler people - the background here is a ABBA
> deadlock because a plug never gets unplugged and the IO never starts
> and the buffer lock thus never gets released. That's simplified, see
>      https://lore.kernel.org/all/20220415013735.1610091-1-chengzhihao1@huawei.com/
>   and
>      https://bugzilla.kernel.org/show_bug.cgi?id=215837
>    for details ]
> 
> On Mon, Apr 18, 2022 at 2:14 AM Zhihao Cheng <chengzhihao1@huawei.com> wrote:
>>
>> In my test, 'need_resched()' (which is imported by 590dca3a71 "fs-writeback:
>> unplug before cond_resched in writeback_sb_inodes") in function
>> 'writeback_sb_inodes()' seldom comes true, unless cond_resched() is deleted
>> from write_cache_pages().
> 
> So I'm not reacting to the patch, but just to this part of the message...
> 
> I forget the exact history of plugging, but at some point (long long
> ago - we're talking pre-git days) it was device-specific and always
> released on a timeout (or, obviously, explicitly unplugged).
That is correct, it used to be a tq_disk list and each queue could be
added. This was back in the days when io_request_lock was a single
spinlock around all of bdev queuing, so quite a while ago :-)
> And then later it became per-process, and always released by task-work
> on any schedule() call.
kblock kickoff from schedule, we never do task-work for unplug. It's
either done in-line if not from schedule, or punted to kblockd. But not
really relevant to the problem at hand...
> But over time, that "any schedule" has gone away. It did so gradually,
> over time, and long ago:
> 
>   73c101011926 ("block: initial patch for on-stack per-task plugging")
>   6631e635c65d ("block: don't flush plugged IO on forced preemtion scheduling")
> 
> And that's *mostly* perfectly fine, but the problem ends up being that
> not everything necessarily triggers the flushing at all.
> 
> In fact, if you call "__schedule()" directly (rather than
> "schedule()") I think you may end up avoiding flush entirely. I'm
> looking at  do_task_dead() and schedule_idle() and the
> preempt_schedule() cases.
> 
> Similarly, tsk_is_pi_blocked() will disable the plug flush.
> 
> Back when it was a timer, the flushing was eventually guaranteed.
> 
> And then we would flush on any re-schedule, even if it was about
> preemption and the process might stay on the CPU.
> 
> But these days we can be in the situation where we really don't flush
> at all - the process may be scheduled away, but if it's still
> runnable, the blk plug won't be flushed.
> 
> To make things *really* confusing, doing an io_schedule() will force a
> plug flush, even  if the process might stay runnable. So io_schedule()
> has those old legacy "unconditional flush" guarantees that a normal
> schedule does not any more.
I think that's mostly to avoid hitting it in the schedule path, as it
involves a lock juggle at that point. If you're doing io_schedule(),
presumable chances are high that you have queued IO.
> Also note how the plug is per-process, so when another process *does*
> block (because it's waiting for some resource), that doesn't end up
> really unplugging the actual IO which was started by somebody else.
> Even if that other process is using io_schedule().
> 
> Which all brings us back to how we have that hacky thing in
> writeback_sb_inodes() that does
> 
>         if (need_resched()) {
>                 /*
>                  * We're trying to balance between building up a nice
>                  * long list of IOs to improve our merge rate, and
>                  * getting those IOs out quickly for anyone throttling
>                  * in balance_dirty_pages().  cond_resched() doesn't
>                  * unplug, so get our IOs out the door before we
>                  * give up the CPU.
>                  */
>                 blk_flush_plug(current->plug, false);
>                 cond_resched();
>         }
> 
> and that currently *mostly* ends up protecting us and flushing the
> plug when doing big writebacks, but as you can see from the email I'm
> quoting, it then doesn't always work very well, because
> "need_resched()" may end up being cleared by some other scheduling
> point, and is entirely meaningless when preemption is on anyway.
> 
> So I think that's basically just a random voodoo programming thing
> that has protected us in the past in some situations.
> 
> Now, Zhihao has a patch that fixes the problem by limiting the
> writeback by being better at accounting:
> 
>     https://lore.kernel.org/all/20220418092824.3018714-1-chengzhihao1@huawei.com/
> 
> which is the email I'm answering, but I did want to bring in the
> scheduler people to the discussion to see if people have ideas.
> 
> I think the writeback accounting fix is the right thing to do
> regardless, but that whole need_resched() dance in
> writeback_sb_inodes() is, I think, a sign that we do have real issues
> here. That whole "flush plug if we need to reschedule" is simply a
> fundamentally broken concept, when there are other rescheduling
> points.
> 
> Comments?
> 
> The answer may just be that "the code in writeback_sb_inodes() is
> fundamentally broken and should be removed".
> 
> But the fact that we have that code at all makes me quite nervous
> about this. And we clearly *do* have situations where the writeback
> code seems to cause nasty unplugging delays.
> 
> So I'm not convinced that "fix up the writeback accounting" is the
> real and final fix.
> 
> I don't really have answers or suggestions, I just wanted people to
> look at this in case they have ideas.
Unless I'm missing something, this exclusively seems to be a problem
with being preempted (task scheduled out, still runnable) and the
original patch did flush for preemption. I wasn't aware of the writeback
work-around doing those need_resched() checks to explicitly work-around
not flushing on preemption, that seems like a somewhat nasty
work-around...
So as far as I can tell, we really have two options:
1) Don't preempt a task that has a plug active
2) Flush for any schedule out, not just going to sleep
1 may not be feasible if we're queueing lots of IO, which then leaves 2.
Linus, do you remember what your original patch here was motivated by?
I'm assuming it was an effiency thing, but do we really have a lot of
cases of IO submissions being preempted a lot and hence making the plug
less efficient than it should be at merging IO? Seems unlikely, but I
could be wrong.
-- 
Jens Axboe
^ permalink raw reply	[flat|nested] 13+ messages in thread
* Re: [PATCH v2] fs-writeback: writeback_sb_inodes:Recalculate 'wrote' according skipped pages
  2022-04-18 21:16   ` Jens Axboe
@ 2022-04-18 22:01     ` Linus Torvalds
  2022-04-18 22:12       ` Jens Axboe
  0 siblings, 1 reply; 13+ messages in thread
From: Linus Torvalds @ 2022-04-18 22:01 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Zhihao Cheng, Ingo Molnar, Peter Zijlstra, Al Viro,
	Christoph Hellwig, linux-fsdevel, Linux Kernel Mailing List,
	yukuai3
On Mon, Apr 18, 2022 at 2:16 PM Jens Axboe <axboe@kernel.dk> wrote:
>
> So as far as I can tell, we really have two options:
>
> 1) Don't preempt a task that has a plug active
> 2) Flush for any schedule out, not just going to sleep
>
> 1 may not be feasible if we're queueing lots of IO, which then leaves 2.
> Linus, do you remember what your original patch here was motivated by?
> I'm assuming it was an effiency thing, but do we really have a lot of
> cases of IO submissions being preempted a lot and hence making the plug
> less efficient than it should be at merging IO? Seems unlikely, but I
> could be wrong.
No, it goes all the way back to 2011, my memory for those kinds of
details doesn't go that far back.
That said, it clearly is about preemption, and I wonder if we had an
actual bug there.
IOW, it might well not just in the "gather up more IO for bigger
requests" thing, but about "the IO plug is per-thread and doesn't have
locking because of that".
So doing plug flushing from a preemptible kernel context might race
with it all being set up.
Explicit io_schedule() etc obviously doesn't have that issue.
                       Linus
^ permalink raw reply	[flat|nested] 13+ messages in thread
* Re: [PATCH v2] fs-writeback: writeback_sb_inodes:Recalculate 'wrote' according skipped pages
  2022-04-18 22:01     ` Linus Torvalds
@ 2022-04-18 22:12       ` Jens Axboe
  2022-04-18 22:20         ` Jens Axboe
  2022-04-19  0:19         ` Linus Torvalds
  0 siblings, 2 replies; 13+ messages in thread
From: Jens Axboe @ 2022-04-18 22:12 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Zhihao Cheng, Ingo Molnar, Peter Zijlstra, Al Viro,
	Christoph Hellwig, linux-fsdevel, Linux Kernel Mailing List,
	yukuai3
On 4/18/22 4:01 PM, Linus Torvalds wrote:
> On Mon, Apr 18, 2022 at 2:16 PM Jens Axboe <axboe@kernel.dk> wrote:
>>
>> So as far as I can tell, we really have two options:
>>
>> 1) Don't preempt a task that has a plug active
>> 2) Flush for any schedule out, not just going to sleep
>>
>> 1 may not be feasible if we're queueing lots of IO, which then leaves 2.
>> Linus, do you remember what your original patch here was motivated by?
>> I'm assuming it was an effiency thing, but do we really have a lot of
>> cases of IO submissions being preempted a lot and hence making the plug
>> less efficient than it should be at merging IO? Seems unlikely, but I
>> could be wrong.
> 
> No, it goes all the way back to 2011, my memory for those kinds of
> details doesn't go that far back.
> 
> That said, it clearly is about preemption, and I wonder if we had an
> actual bug there.
> 
> IOW, it might well not just in the "gather up more IO for bigger
> requests" thing, but about "the IO plug is per-thread and doesn't have
> locking because of that".
> 
> So doing plug flushing from a preemptible kernel context might race
> with it all being set up.
Hmm yes. But doesn't preemption imply a full barrier? As long as we
assign the plug at the end, we should be fine. And just now looking that
up, there's even already a comment to that effect in blk_start_plug().
So barring any weirdness with that, maybe that's the solution.
Your comment did jog my memory a bit though, and I do in fact think it
was something related to that that made is change it. I'll dig through
some old emails and see if I can find it.
> Explicit io_schedule() etc obviously doesn't have that issue.
Right
-- 
Jens Axboe
^ permalink raw reply	[flat|nested] 13+ messages in thread
* Re: [PATCH v2] fs-writeback: writeback_sb_inodes:Recalculate 'wrote' according skipped pages
  2022-04-18 22:12       ` Jens Axboe
@ 2022-04-18 22:20         ` Jens Axboe
  2022-04-19  0:24           ` Linus Torvalds
  2022-04-19  0:19         ` Linus Torvalds
  1 sibling, 1 reply; 13+ messages in thread
From: Jens Axboe @ 2022-04-18 22:20 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Zhihao Cheng, Ingo Molnar, Peter Zijlstra, Al Viro,
	Christoph Hellwig, linux-fsdevel, Linux Kernel Mailing List,
	yukuai3
On 4/18/22 4:12 PM, Jens Axboe wrote:
> On 4/18/22 4:01 PM, Linus Torvalds wrote:
>> On Mon, Apr 18, 2022 at 2:16 PM Jens Axboe <axboe@kernel.dk> wrote:
>>>
>>> So as far as I can tell, we really have two options:
>>>
>>> 1) Don't preempt a task that has a plug active
>>> 2) Flush for any schedule out, not just going to sleep
>>>
>>> 1 may not be feasible if we're queueing lots of IO, which then leaves 2.
>>> Linus, do you remember what your original patch here was motivated by?
>>> I'm assuming it was an effiency thing, but do we really have a lot of
>>> cases of IO submissions being preempted a lot and hence making the plug
>>> less efficient than it should be at merging IO? Seems unlikely, but I
>>> could be wrong.
>>
>> No, it goes all the way back to 2011, my memory for those kinds of
>> details doesn't go that far back.
>>
>> That said, it clearly is about preemption, and I wonder if we had an
>> actual bug there.
>>
>> IOW, it might well not just in the "gather up more IO for bigger
>> requests" thing, but about "the IO plug is per-thread and doesn't have
>> locking because of that".
>>
>> So doing plug flushing from a preemptible kernel context might race
>> with it all being set up.
> 
> Hmm yes. But doesn't preemption imply a full barrier? As long as we
> assign the plug at the end, we should be fine. And just now looking that
> up, there's even already a comment to that effect in blk_start_plug().
> So barring any weirdness with that, maybe that's the solution.
> 
> Your comment did jog my memory a bit though, and I do in fact think it
> was something related to that that made is change it. I'll dig through
> some old emails and see if I can find it.
Here's the thread:
https://lore.kernel.org/all/1295659049-2688-6-git-send-email-jaxboe@fusionio.com/
I'll dig through it in a bit, but here's your reasoning for why it
should not flush on preemption:
https://lore.kernel.org/all/BANLkTikBEJa7bJJoLFU7NoiEgOjVHVG08A@mail.gmail.com/
-- 
Jens Axboe
^ permalink raw reply	[flat|nested] 13+ messages in thread
* Re: [PATCH v2] fs-writeback: writeback_sb_inodes:Recalculate 'wrote' according skipped pages
  2022-04-18 22:12       ` Jens Axboe
  2022-04-18 22:20         ` Jens Axboe
@ 2022-04-19  0:19         ` Linus Torvalds
  2022-04-19  0:30           ` Jens Axboe
  1 sibling, 1 reply; 13+ messages in thread
From: Linus Torvalds @ 2022-04-19  0:19 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Zhihao Cheng, Ingo Molnar, Peter Zijlstra, Al Viro,
	Christoph Hellwig, linux-fsdevel, Linux Kernel Mailing List,
	yukuai3
On Mon, Apr 18, 2022 at 3:12 PM Jens Axboe <axboe@kernel.dk> wrote:
>
> Hmm yes. But doesn't preemption imply a full barrier? As long as we
> assign the plug at the end, we should be fine. And just now looking that
> up, there's even already a comment to that effect in blk_start_plug().
> So barring any weirdness with that, maybe that's the solution.
My worry is more about the code that adds new cb_list entries to the
plug, racing with then some random preemption event that flushes the
plug.
preemption itself is perfectly fine wrt any per-thread data updates
etc, but if preemption then also *changes* the data that is updated,
that's not great.
So that worries me.
             Linus
^ permalink raw reply	[flat|nested] 13+ messages in thread
* Re: [PATCH v2] fs-writeback: writeback_sb_inodes:Recalculate 'wrote' according skipped pages
  2022-04-18 22:20         ` Jens Axboe
@ 2022-04-19  0:24           ` Linus Torvalds
  0 siblings, 0 replies; 13+ messages in thread
From: Linus Torvalds @ 2022-04-19  0:24 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Zhihao Cheng, Ingo Molnar, Peter Zijlstra, Al Viro,
	Christoph Hellwig, linux-fsdevel, Linux Kernel Mailing List,
	yukuai3
On Mon, Apr 18, 2022 at 3:20 PM Jens Axboe <axboe@kernel.dk> wrote:
>
> Here's the thread:
>
> https://lore.kernel.org/all/1295659049-2688-6-git-send-email-jaxboe@fusionio.com/
>
> I'll dig through it in a bit, but here's your reasoning for why it
> should not flush on preemption:
>
> https://lore.kernel.org/all/BANLkTikBEJa7bJJoLFU7NoiEgOjVHVG08A@mail.gmail.com/
Well, that one was triggered by that whole "now it can happen
anywhere" worry that people had.
So yes, IO patterns are a worry, but I think the bigger worry - even
back then - was that preemption points can be pretty much anywhere in
the code.
              Linus
^ permalink raw reply	[flat|nested] 13+ messages in thread
* Re: [PATCH v2] fs-writeback: writeback_sb_inodes:Recalculate 'wrote' according skipped pages
  2022-04-19  0:19         ` Linus Torvalds
@ 2022-04-19  0:30           ` Jens Axboe
  2022-04-19  8:16             ` Peter Zijlstra
  0 siblings, 1 reply; 13+ messages in thread
From: Jens Axboe @ 2022-04-19  0:30 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Zhihao Cheng, Ingo Molnar, Peter Zijlstra, Al Viro,
	Christoph Hellwig, linux-fsdevel, Linux Kernel Mailing List,
	yukuai3
On 4/18/22 6:19 PM, Linus Torvalds wrote:
> On Mon, Apr 18, 2022 at 3:12 PM Jens Axboe <axboe@kernel.dk> wrote:
>>
>> Hmm yes. But doesn't preemption imply a full barrier? As long as we
>> assign the plug at the end, we should be fine. And just now looking that
>> up, there's even already a comment to that effect in blk_start_plug().
>> So barring any weirdness with that, maybe that's the solution.
> 
> My worry is more about the code that adds new cb_list entries to the
> plug, racing with then some random preemption event that flushes the
> plug.
> 
> preemption itself is perfectly fine wrt any per-thread data updates
> etc, but if preemption then also *changes* the data that is updated,
> that's not great.
> 
> So that worries me.
Yes, and the same is true for eg merge traversal. We'd then need to
disable preempt for that as well...
It may be the best option in terms of making this issue go away without
having callers working around it. I don't have a good answer to this
right now, I'll think about it.
-- 
Jens Axboe
^ permalink raw reply	[flat|nested] 13+ messages in thread
* Re: [PATCH v2] fs-writeback: writeback_sb_inodes:Recalculate 'wrote' according skipped pages
  2022-04-18 19:43 ` Linus Torvalds
  2022-04-18 21:16   ` Jens Axboe
@ 2022-04-19  8:10   ` Peter Zijlstra
  2022-04-19  8:17   ` Peter Zijlstra
  2 siblings, 0 replies; 13+ messages in thread
From: Peter Zijlstra @ 2022-04-19  8:10 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Zhihao Cheng, Ingo Molnar, Jens Axboe, Al Viro, Christoph Hellwig,
	linux-fsdevel, Linux Kernel Mailing List, yukuai3
On Mon, Apr 18, 2022 at 12:43:43PM -0700, Linus Torvalds wrote:
> Similarly, tsk_is_pi_blocked() will disable the plug flush.
Bah, I'm forever confused by that clause... I did ask for a comment at
some point but that never seems to have happened.
  https://lore.kernel.org/all/20190816160626.12742-1-bigeasy@linutronix.de/T/#md983b64256814c046cec9ec875f4a975029b0daa
Seems to indicate it is about scheduling while holding a lock and the
unplug then leading to a deadlock.
^ permalink raw reply	[flat|nested] 13+ messages in thread
* Re: [PATCH v2] fs-writeback: writeback_sb_inodes:Recalculate 'wrote' according skipped pages
  2022-04-19  0:30           ` Jens Axboe
@ 2022-04-19  8:16             ` Peter Zijlstra
  0 siblings, 0 replies; 13+ messages in thread
From: Peter Zijlstra @ 2022-04-19  8:16 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Linus Torvalds, Zhihao Cheng, Ingo Molnar, Al Viro,
	Christoph Hellwig, linux-fsdevel, Linux Kernel Mailing List,
	yukuai3
On Mon, Apr 18, 2022 at 06:30:16PM -0600, Jens Axboe wrote:
> On 4/18/22 6:19 PM, Linus Torvalds wrote:
> > On Mon, Apr 18, 2022 at 3:12 PM Jens Axboe <axboe@kernel.dk> wrote:
> >>
> >> Hmm yes. But doesn't preemption imply a full barrier? As long as we
> >> assign the plug at the end, we should be fine. And just now looking that
> >> up, there's even already a comment to that effect in blk_start_plug().
> >> So barring any weirdness with that, maybe that's the solution.
> > 
> > My worry is more about the code that adds new cb_list entries to the
> > plug, racing with then some random preemption event that flushes the
> > plug.
> > 
> > preemption itself is perfectly fine wrt any per-thread data updates
> > etc, but if preemption then also *changes* the data that is updated,
> > that's not great.
> > 
> > So that worries me.
> 
> Yes, and the same is true for eg merge traversal. We'd then need to
> disable preempt for that as well...
One is only supposed to disable preemption for short and bounded things,
otherwise we'll get people complaining their latencies are going bad.
^ permalink raw reply	[flat|nested] 13+ messages in thread
* Re: [PATCH v2] fs-writeback: writeback_sb_inodes:Recalculate 'wrote' according skipped pages
  2022-04-18 19:43 ` Linus Torvalds
  2022-04-18 21:16   ` Jens Axboe
  2022-04-19  8:10   ` Peter Zijlstra
@ 2022-04-19  8:17   ` Peter Zijlstra
  2 siblings, 0 replies; 13+ messages in thread
From: Peter Zijlstra @ 2022-04-19  8:17 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Zhihao Cheng, Ingo Molnar, Jens Axboe, Al Viro, Christoph Hellwig,
	linux-fsdevel, Linux Kernel Mailing List, yukuai3
On Mon, Apr 18, 2022 at 12:43:43PM -0700, Linus Torvalds wrote:
> Which all brings us back to how we have that hacky thing in
> writeback_sb_inodes() that does
> 
>         if (need_resched()) {
>                 /*
>                  * We're trying to balance between building up a nice
>                  * long list of IOs to improve our merge rate, and
>                  * getting those IOs out quickly for anyone throttling
>                  * in balance_dirty_pages().  cond_resched() doesn't
>                  * unplug, so get our IOs out the door before we
>                  * give up the CPU.
>                  */
>                 blk_flush_plug(current->plug, false);
>                 cond_resched();
>         }
Yeah, that's horribly broken for PREEMPT=y.
^ permalink raw reply	[flat|nested] 13+ messages in thread
end of thread, other threads:[~2022-04-19  8:18 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-04-18  9:28 [PATCH v2] fs-writeback: writeback_sb_inodes:Recalculate 'wrote' according skipped pages Zhihao Cheng
2022-04-18 19:43 ` Linus Torvalds
2022-04-18 21:16   ` Jens Axboe
2022-04-18 22:01     ` Linus Torvalds
2022-04-18 22:12       ` Jens Axboe
2022-04-18 22:20         ` Jens Axboe
2022-04-19  0:24           ` Linus Torvalds
2022-04-19  0:19         ` Linus Torvalds
2022-04-19  0:30           ` Jens Axboe
2022-04-19  8:16             ` Peter Zijlstra
2022-04-19  8:10   ` Peter Zijlstra
2022-04-19  8:17   ` Peter Zijlstra
     [not found] <20220418112745.1761-1-hdanton@sina.com>
2022-04-18 13:27 ` [PATCH v2] fs-writeback: writeback_sb_inodes: Recalculate " Zhihao Cheng
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).