From: Jens Axboe <jaxboe@fusionio.com>
To: Dave Chinner <david@fromorbit.com>
Cc: "hch@infradead.org" <hch@infradead.org>,
NeilBrown <neilb@suse.de>, Mike Snitzer <snitzer@redhat.com>,
"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
"dm-devel@redhat.com" <dm-devel@redhat.com>,
"linux-raid@vger.kernel.org" <linux-raid@vger.kernel.org>
Subject: Re: [PATCH 05/10] block: remove per-queue plugging
Date: Tue, 12 Apr 2011 15:45:52 +0200 [thread overview]
Message-ID: <4DA45790.2010109@fusionio.com> (raw)
In-Reply-To: <20110412133117.GE31057@dastard>
On 2011-04-12 15:31, Dave Chinner wrote:
> On Tue, Apr 12, 2011 at 02:58:46PM +0200, Jens Axboe wrote:
>> On 2011-04-12 14:41, Dave Chinner wrote:
>>> On Tue, Apr 12, 2011 at 02:28:31PM +0200, Jens Axboe wrote:
>>>> On 2011-04-12 14:22, Dave Chinner wrote:
>>>>> On Tue, Apr 12, 2011 at 10:36:30AM +0200, Jens Axboe wrote:
>>>>>> On 2011-04-12 03:12, hch@infradead.org wrote:
>>>>>>> On Mon, Apr 11, 2011 at 02:48:45PM +0200, Jens Axboe wrote:
>>>>>>>> Great, once you do that and XFS kills the blk_flush_plug() calls too,
>>>>>>>> then we can remove that export and make it internal only.
>>>>>>>
>>>>>>> Linus pulled the tree, so they are gone now. Btw, there's still some
>>>>>>> bits in the area that confuse me:
>>>>>>
>>>>>> Great!
>>>>>>
>>>>>>> - what's the point of the queue_sync_plugs? It has a lot of comment
>>>>>>> that seem to pre-data the onstack plugging, but except for that
>>>>>>> it's trivial wrapper around blk_flush_plug, with an argument
>>>>>>> that is not used.
>>>>>>
>>>>>> There's really no point to it anymore. It's existance was due to the
>>>>>> older revision that had to track write requests for serializaing around
>>>>>> a barrier. I'll kill it, since we don't do that anymore.
>>>>>>
>>>>>>> - is there a good reason for the existance of __blk_flush_plug? You'd
>>>>>>> get one additional instruction in the inlined version of
>>>>>>> blk_flush_plug when opencoding, but avoid the need for chained
>>>>>>> function calls.
>>>>>>> - Why is having a plug in blk_flush_plug marked unlikely? Note that
>>>>>>> unlikely is the static branch prediction hint to mark the case
>>>>>>> extremly unlikely and is even used for hot/cold partitioning. But
>>>>>>> when we call it we usually check beforehand if we actually have
>>>>>>> plugs, so it's actually likely to happen.
>>>>>>
>>>>>> The existance and out-of-line is for the scheduler() hook. It should be
>>>>>> an unlikely event to schedule with a plug held, normally the plug should
>>>>>> have been explicitly unplugged before that happens.
>>>>>
>>>>> Though if it does, haven't you just added a significant amount of
>>>>> depth to the worst case stack usage? I'm seeing this sort of thing
>>>>> from io_schedule():
>>>>>
>>>>> Depth Size Location (40 entries)
>>>>> ----- ---- --------
>>>>> 0) 4256 16 mempool_alloc_slab+0x15/0x20
>>>>> 1) 4240 144 mempool_alloc+0x63/0x160
>>>>> 2) 4096 16 scsi_sg_alloc+0x4c/0x60
>>>>> 3) 4080 112 __sg_alloc_table+0x66/0x140
>>>>> 4) 3968 32 scsi_init_sgtable+0x33/0x90
>>>>> 5) 3936 48 scsi_init_io+0x31/0xc0
>>>>> 6) 3888 32 scsi_setup_fs_cmnd+0x79/0xe0
>>>>> 7) 3856 112 sd_prep_fn+0x150/0xa90
>>>>> 8) 3744 48 blk_peek_request+0x6a/0x1f0
>>>>> 9) 3696 96 scsi_request_fn+0x60/0x510
>>>>> 10) 3600 32 __blk_run_queue+0x57/0x100
>>>>> 11) 3568 80 flush_plug_list+0x133/0x1d0
>>>>> 12) 3488 32 __blk_flush_plug+0x24/0x50
>>>>> 13) 3456 32 io_schedule+0x79/0x80
>>>>>
>>>>> (This is from a page fault on ext3 that is doing page cache
>>>>> readahead and blocking on a locked buffer.)
>>>>>
>>>>> I've seen traces where mempool_alloc_slab enters direct reclaim
>>>>> which adds another 1.5k of stack usage to this path. So I'm
>>>>> extremely concerned that you've just reduced the stack available to
>>>>> every thread by at least 2.5k of space...
>>>>
>>>> Yeah, that does not look great. If this turns out to be problematic, we
>>>> can turn the queue runs from the unlikely case into out-of-line from
>>>> kblockd.
>>>>
>>>> But this really isn't that new, you could enter the IO dispatch path
>>>> when doing IO already (when submitting it). So we better be able to
>>>> handle that.
>>>
>>> The problem I see is that IO is submitted when there's plenty of
>>> stack available whould have previously been fine. However now it
>>> hits the plug, and then later on after the thread consumes a lot
>>> more stack it, say, waits for a completion. We then schedule, it
>>> unplugs the queue and we add the IO stack to a place where there
>>> isn't much space available.
>>>
>>> So effectively we are moving the places where stack is consumed
>>> about, and it's complete unpredictable where that stack is going to
>>> land now.
>>
>> Isn't that example fairly contrived?
>
> I don't think so. e.g. in the XFS allocation path we do btree block
> readahead, then go do the real work. The real work can end up with a
> deeper stack before blocking on locks or completions unrelated to
> the readahead, leading to schedule() being called and an unplug
> being issued at that point. You might think it contrived, but if
> you can't provide a guarantee that it can't happen then I have to
> assume it will happen.
If you ended up in lock_page() somewhere along the way, the path would
have been pretty much the same as it is now:
lock_page()
__lock_page()
__wait_on_bit_lock()
sync_page()
aops->sync_page();
block_sync_page()
__blk_run_backing_dev()
and the dispatch follows after that. If your schedules are only due to,
say, blocking on a mutex, then yes it'll be different. But is that
really the case?
I bet that worst case stack usage is exactly the same as before, and
that's the only metric we really care about.
> My concern is that we're already under stack space stress in the
> writeback path, so anything that has the potential to increase it
> significantly is a major worry from my point of view...
I agree on writeback being a worry, and that's why I made the change
(since it makes sense for other reasons, too). I just don't think we are
worse of than before.
>> If we ended up doing the IO
>> dispatch before, then the only difference now is the stack usage of
>> schedule() itself. Apart from that, as far as I can tell, there should
>> not be much difference.
>
> There's a difference between IO submission and IO dispatch. IO
> submission is submit_bio thru to the plug; IO dispatch is from the
> plug down to the disk. If they happen at the same place, there's no
> problem. If IO dispatch is moved to schedule() via a plug....
The IO submission can easily and non-deterministically turn into an IO
dispatch, so there's no real difference for the submitter. That was the
case before. With the explicit plug now, you _know_ that the IO
submission is only that and doesn't include IO dispatch. Not until you
schedule() or call blk_finish_plug(), both of which are events that you
can control.
>>>> If it's a problem from the schedule()/io_schedule() path, then
>>>> lets ensure that those are truly unlikely events so we can punt
>>>> them to kblockd.
>>>
>>> Rather than wait for an explosion to be reported before doing this,
>>> why not just punt unplugs to kblockd unconditionally?
>>
>> Supposedly it's faster to do it inline rather than punt the dispatch.
>> But that may actually not be true, if you have multiple plugs going (and
>> thus multiple contenders for the queue lock on dispatch). So lets play
>> it safe and punt to kblockd, we can always revisit this later.
>
> It's always best to play it safe when it comes to other peoples
> data....
Certainly, but so far I see no real evidence that this is in fact any
safer.
--
Jens Axboe
next prev parent reply other threads:[~2011-04-12 13:45 UTC|newest]
Thread overview: 152+ messages / expand[flat|nested] mbox.gz Atom feed top
2011-01-22 1:17 [PATCH 0/10] On-stack explicit block queue plugging Jens Axboe
2011-01-22 1:17 ` [PATCH 01/10] block: add API for delaying work/request_fn a little bit Jens Axboe
2011-01-22 1:17 ` [PATCH 02/10] ide-cd: convert to blk_delay_queue() for a short pause Jens Axboe
2011-01-22 1:19 ` David Miller
2011-01-22 1:17 ` [PATCH 03/10] scsi: convert to blk_delay_queue() Jens Axboe
2011-01-22 1:17 ` [PATCH 04/10] block: initial patch for on-stack per-task plugging Jens Axboe
2011-01-24 19:36 ` Jeff Moyer
2011-01-24 21:23 ` Jens Axboe
2011-03-10 16:54 ` Vivek Goyal
2011-03-10 19:32 ` Jens Axboe
2011-03-10 19:46 ` Vivek Goyal
2011-03-16 8:18 ` Shaohua Li
2011-03-16 17:31 ` Vivek Goyal
2011-03-17 1:00 ` Shaohua Li
2011-03-17 3:19 ` Shaohua Li
2011-03-17 9:44 ` Jens Axboe
2011-03-18 1:55 ` Shaohua Li
2011-03-17 9:43 ` Jens Axboe
2011-03-18 6:36 ` Shaohua Li
2011-03-18 12:54 ` Jens Axboe
2011-03-18 13:52 ` Jens Axboe
2011-03-21 6:52 ` Shaohua Li
2011-03-21 9:20 ` Jens Axboe
2011-03-22 0:32 ` Shaohua Li
2011-03-22 7:36 ` Jens Axboe
2011-03-17 9:39 ` Jens Axboe
2011-01-22 1:17 ` [PATCH 05/10] block: remove per-queue plugging Jens Axboe
2011-01-22 1:31 ` Nick Piggin
2011-03-03 21:23 ` Mike Snitzer
2011-03-03 21:27 ` Mike Snitzer
2011-03-03 22:13 ` Mike Snitzer
2011-03-04 13:02 ` Shaohua Li
2011-03-04 13:20 ` Jens Axboe
2011-03-04 21:43 ` Mike Snitzer
2011-03-04 21:50 ` Jens Axboe
2011-03-04 22:27 ` Mike Snitzer
2011-03-05 20:54 ` Jens Axboe
2011-03-07 10:23 ` Peter Zijlstra
2011-03-07 19:43 ` Jens Axboe
2011-03-07 20:41 ` Peter Zijlstra
2011-03-07 20:46 ` Jens Axboe
2011-03-08 9:38 ` Peter Zijlstra
2011-03-08 9:41 ` Jens Axboe
2011-03-07 0:54 ` Shaohua Li
2011-03-07 8:07 ` Jens Axboe
2011-03-08 12:16 ` Jens Axboe
2011-03-08 20:21 ` Mike Snitzer
2011-03-08 20:27 ` Jens Axboe
2011-03-08 21:36 ` Jeff Moyer
2011-03-09 7:25 ` Jens Axboe
2011-03-08 22:05 ` Mike Snitzer
2011-03-10 0:58 ` Mike Snitzer
2011-04-05 3:05 ` NeilBrown
2011-04-11 4:50 ` NeilBrown
2011-04-11 9:19 ` Jens Axboe
2011-04-11 10:59 ` NeilBrown
2011-04-11 11:04 ` Jens Axboe
2011-04-11 11:26 ` NeilBrown
2011-04-11 11:37 ` Jens Axboe
2011-04-11 12:05 ` NeilBrown
2011-04-11 12:11 ` Jens Axboe
2011-04-11 12:36 ` NeilBrown
2011-04-11 12:48 ` Jens Axboe
2011-04-12 1:12 ` hch
2011-04-12 8:36 ` Jens Axboe
2011-04-12 12:22 ` Dave Chinner
2011-04-12 12:28 ` Jens Axboe
2011-04-12 12:41 ` Dave Chinner
2011-04-12 12:58 ` Jens Axboe
2011-04-12 13:31 ` Dave Chinner
2011-04-12 13:45 ` Jens Axboe [this message]
2011-04-12 14:34 ` Dave Chinner
2011-04-12 21:08 ` NeilBrown
2011-04-13 2:23 ` Linus Torvalds
2011-04-13 11:12 ` Peter Zijlstra
2011-04-13 11:23 ` Jens Axboe
2011-04-13 11:41 ` Peter Zijlstra
2011-04-13 15:13 ` Linus Torvalds
2011-04-13 17:35 ` Jens Axboe
2011-04-12 16:58 ` hch
2011-04-12 17:29 ` Jens Axboe
2011-04-12 16:44 ` hch
2011-04-12 16:49 ` Jens Axboe
2011-04-12 16:54 ` hch
2011-04-12 17:24 ` Jens Axboe
2011-04-12 13:40 ` Dave Chinner
2011-04-12 13:48 ` Jens Axboe
2011-04-12 23:35 ` Dave Chinner
2011-04-12 16:50 ` hch
2011-04-15 4:26 ` hch
2011-04-15 6:34 ` Jens Axboe
2011-04-17 22:19 ` NeilBrown
2011-04-18 4:19 ` NeilBrown
2011-04-18 6:38 ` Jens Axboe
2011-04-18 7:25 ` NeilBrown
2011-04-18 8:10 ` Jens Axboe
2011-04-18 8:33 ` NeilBrown
2011-04-18 8:42 ` Jens Axboe
2011-04-18 21:23 ` hch
2011-04-22 15:39 ` hch
2011-04-22 16:01 ` Vivek Goyal
2011-04-22 16:10 ` Vivek Goyal
2011-04-18 21:30 ` hch
2011-04-18 22:38 ` NeilBrown
2011-04-20 10:55 ` hch
2011-04-18 9:19 ` hch
2011-04-18 9:40 ` [dm-devel] " Hannes Reinecke
2011-04-18 9:47 ` Jens Axboe
2011-04-18 9:46 ` Jens Axboe
2011-04-11 11:55 ` NeilBrown
2011-04-11 12:12 ` Jens Axboe
2011-04-11 22:58 ` hch
2011-04-12 6:20 ` Jens Axboe
2011-04-11 16:59 ` hch
2011-04-11 21:14 ` NeilBrown
2011-04-11 22:59 ` hch
2011-04-12 6:18 ` Jens Axboe
2011-03-17 15:51 ` Mike Snitzer
2011-03-17 18:31 ` Jens Axboe
2011-03-17 18:46 ` Mike Snitzer
2011-03-18 9:15 ` hch
2011-03-08 12:15 ` Jens Axboe
2011-03-04 4:00 ` Vivek Goyal
2011-03-08 12:24 ` Jens Axboe
2011-03-08 22:10 ` blk-throttle: Use blk_plug in throttle code (Was: Re: [PATCH 05/10] block: remove per-queue plugging) Vivek Goyal
2011-03-09 7:26 ` Jens Axboe
2011-01-22 1:17 ` [PATCH 06/10] block: kill request allocation batching Jens Axboe
2011-01-22 9:31 ` Christoph Hellwig
2011-01-24 19:09 ` Jens Axboe
2011-01-22 1:17 ` [PATCH 07/10] fs: make generic file read/write functions plug Jens Axboe
2011-01-24 3:57 ` Dave Chinner
2011-01-24 19:11 ` Jens Axboe
2011-03-04 4:09 ` Vivek Goyal
2011-03-04 13:22 ` Jens Axboe
2011-03-04 13:25 ` hch
2011-03-04 13:40 ` Jens Axboe
2011-03-04 14:08 ` hch
2011-03-04 22:07 ` Jens Axboe
2011-03-04 23:12 ` hch
2011-03-08 12:38 ` Jens Axboe
2011-03-09 10:38 ` hch
2011-03-09 10:52 ` Jens Axboe
2011-01-22 1:17 ` [PATCH 08/10] read-ahead: use plugging Jens Axboe
2011-01-22 1:17 ` [PATCH 09/10] fs: make mpage read/write_pages() plug Jens Axboe
2011-01-22 1:17 ` [PATCH 10/10] fs: make aio plug Jens Axboe
2011-01-24 17:59 ` Jeff Moyer
2011-01-24 19:09 ` Jens Axboe
2011-01-24 19:15 ` Jeff Moyer
2011-01-24 19:22 ` Jens Axboe
2011-01-24 19:29 ` Jeff Moyer
2011-01-24 19:31 ` Jens Axboe
2011-01-24 19:38 ` Jeff Moyer
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4DA45790.2010109@fusionio.com \
--to=jaxboe@fusionio.com \
--cc=david@fromorbit.com \
--cc=dm-devel@redhat.com \
--cc=hch@infradead.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-raid@vger.kernel.org \
--cc=neilb@suse.de \
--cc=snitzer@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).