From: Jens Axboe <axboe@kernel.dk>
To: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>, Shaohua Li <shli@kernel.org>,
linux-fsdevel@vger.kernel.org,
LKML <linux-kernel@vger.kernel.org>
Subject: Re: Read starvation by sync writes
Date: Thu, 13 Dec 2012 19:03:48 +0100 [thread overview]
Message-ID: <50CA1884.6050408@kernel.dk> (raw)
In-Reply-To: <20121213150255.GA21592@quack.suse.cz>
On 2012-12-13 16:02, Jan Kara wrote:
> On Thu 13-12-12 14:30:42, Jens Axboe wrote:
>> On 2012-12-12 20:41, Jeff Moyer wrote:
>>> Jeff Moyer <jmoyer@redhat.com> writes:
>>>
>>>>> I agree. This isn't about scheduling, we haven't even reached that part
>>>>> yet. Back when we split the queues into read vs write, this problem
>>>>> obviously wasn't there. Now we have sync writes and reads, both eating
>>>>> from the same pool. The io scheduler can impact this a bit by forcing
>>>>> reads to must allocate (Jan, which io scheduler are you using?). CFQ
>>>>> does this when it's expecting a request from this process queue.
>>>>>
>>>>> Back in the day, we used to have one list. To avoid a similar problem,
>>>>> we reserved the top of the list for reads. With the batching, it's a bit
>>>>> more complicated. If we make the request allocation (just that, not the
>>>>> scheduling) be read vs write instead of sync vs async, then we have the
>>>>> same issue for sync vs buffered writes.
>>>>>
>>>>> How about something like the below? Due to the nature of sync reads, we
>>>>> should allow a much longer timeout. The batch is really tailored towards
>>>>> writes at the moment. Also shrink the batch count, 32 is pretty large...
>>>>
>>>> Does batching even make sense for dependent reads? I don't think it
>>>> does.
>>>
>>> Having just read the batching code in detail, I'd like to ammend this
>>> misguided comment. Batching logic kicks in when you happen to be lucky
>>> enough to use up the last request. As such, I'd be surprised if the
>>> patch you posted helped. Jens, don't you think the writer is way more
>>> likely to become the batcher? I do agree with shrinking the batch count
>>> to 16, whether or not the rest of the patch goes in.
>>>
>>>> Assuming you disagree, then you'll have to justify that fixed
>>>> time value of 2 seconds. The amount of time between dependent reads
>>>> will vary depending on other I/O sent to the device, the properties of
>>>> the device, the I/O scheduler, and so on. If you do stick 2 seconds in
>>>> there, please comment it. Maybe it's time we started keeping track of
>>>> worst case Q->C time? That could be used to tell worst case latency,
>>>> and adjust magic timeouts like this one.
>>>>
>>>> I'm still thinking about how we might solve this in a cleaner way.
>>>
>>> The way things stand today, you can do a complete end run around the I/O
>>> scheduler by queueing up enough I/O. To address that, I think we need
>>> to move to a request list per io_context as Jan had suggested. That
>>> way, we can keep the logic about who gets to submit I/O when in one
>>> place.
>>>
>>> Jens, what do you think?
>>
>> I think that is pretty extreme. We have way too much accounting around
>> this already, and I'd rather just limit the batching than make
>> per-ioc request lists too.
>>
>> I agree the batch addition isn't super useful for the reads. It really
>> is mostly a writer thing, and the timing reflects that.
>>
>> The problem is really that the WRITE_SYNC is (for Jan's case) behaving
>> like buffered writes, so it eats up a queue of requests very easily. On
>> the allocation side, the assumption is that WRITE_SYNC behaves like
>> dependent reads. Similar to a dd with oflag=direct, not like a flood of
>> requests. For dependent sync writes, our current behaviour is fine, we
>> treat them like reads. For commits of WRITE_SYNC, they should be treated
>> like async WRITE instead.
> Yeah. But it's similar to what happens when you run fsync() on a large
> dirty file. That will also submit a lot of WRITE_SYNC requests... kjournald
> could probably use WRITE instead of WRITE_SYNC for large commits. It's just
> that we don't really want to give e.g. DIO a preference over kjournald
> because transaction commit can effectively block any metadata changes on
> the filesystem.
Sure, I'm not advocating against changing WRITE_SYNC, we just need to be
able to handle it a bit better. I've got a test patch, will post it
later.
--
Jens Axboe
next prev parent reply other threads:[~2012-12-13 18:04 UTC|newest]
Thread overview: 20+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-12-10 22:12 Read starvation by sync writes Jan Kara
2012-12-11 21:44 ` Jeff Moyer
2012-12-12 2:31 ` Jan Kara
2012-12-12 4:18 ` Dave Chinner
2012-12-12 10:26 ` Jan Kara
2012-12-12 23:33 ` Dave Chinner
2012-12-12 0:13 ` Jan Engelhardt
2012-12-12 2:55 ` Shaohua Li
2012-12-12 10:11 ` Jan Kara
2012-12-12 15:19 ` Jens Axboe
2012-12-12 16:38 ` Jeff Moyer
2012-12-12 19:41 ` Jeff Moyer
2012-12-13 12:30 ` Jan Kara
2012-12-13 13:30 ` Jens Axboe
2012-12-13 14:55 ` Jeff Moyer
2012-12-13 15:02 ` Jan Kara
2012-12-13 18:03 ` Jens Axboe [this message]
2013-01-23 17:35 ` Jeff Moyer
2012-12-13 1:43 ` Shaohua Li
2012-12-13 10:32 ` Jan Kara
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=50CA1884.6050408@kernel.dk \
--to=axboe@kernel.dk \
--cc=jack@suse.cz \
--cc=jmoyer@redhat.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=shli@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).