From: Jan Kara <jack@suse.cz>
To: Jens Axboe <axboe@kernel.dk>
Cc: Jeff Moyer <jmoyer@redhat.com>, Jan Kara <jack@suse.cz>,
Shaohua Li <shli@kernel.org>,
linux-fsdevel@vger.kernel.org,
LKML <linux-kernel@vger.kernel.org>
Subject: Re: Read starvation by sync writes
Date: Thu, 13 Dec 2012 16:02:55 +0100 [thread overview]
Message-ID: <20121213150255.GA21592@quack.suse.cz> (raw)
In-Reply-To: <50C9D882.7000808@kernel.dk>
On Thu 13-12-12 14:30:42, Jens Axboe wrote:
> On 2012-12-12 20:41, Jeff Moyer wrote:
> > Jeff Moyer <jmoyer@redhat.com> writes:
> >
> >>> I agree. This isn't about scheduling, we haven't even reached that part
> >>> yet. Back when we split the queues into read vs write, this problem
> >>> obviously wasn't there. Now we have sync writes and reads, both eating
> >>> from the same pool. The io scheduler can impact this a bit by forcing
> >>> reads to must allocate (Jan, which io scheduler are you using?). CFQ
> >>> does this when it's expecting a request from this process queue.
> >>>
> >>> Back in the day, we used to have one list. To avoid a similar problem,
> >>> we reserved the top of the list for reads. With the batching, it's a bit
> >>> more complicated. If we make the request allocation (just that, not the
> >>> scheduling) be read vs write instead of sync vs async, then we have the
> >>> same issue for sync vs buffered writes.
> >>>
> >>> How about something like the below? Due to the nature of sync reads, we
> >>> should allow a much longer timeout. The batch is really tailored towards
> >>> writes at the moment. Also shrink the batch count, 32 is pretty large...
> >>
> >> Does batching even make sense for dependent reads? I don't think it
> >> does.
> >
> > Having just read the batching code in detail, I'd like to ammend this
> > misguided comment. Batching logic kicks in when you happen to be lucky
> > enough to use up the last request. As such, I'd be surprised if the
> > patch you posted helped. Jens, don't you think the writer is way more
> > likely to become the batcher? I do agree with shrinking the batch count
> > to 16, whether or not the rest of the patch goes in.
> >
> >> Assuming you disagree, then you'll have to justify that fixed
> >> time value of 2 seconds. The amount of time between dependent reads
> >> will vary depending on other I/O sent to the device, the properties of
> >> the device, the I/O scheduler, and so on. If you do stick 2 seconds in
> >> there, please comment it. Maybe it's time we started keeping track of
> >> worst case Q->C time? That could be used to tell worst case latency,
> >> and adjust magic timeouts like this one.
> >>
> >> I'm still thinking about how we might solve this in a cleaner way.
> >
> > The way things stand today, you can do a complete end run around the I/O
> > scheduler by queueing up enough I/O. To address that, I think we need
> > to move to a request list per io_context as Jan had suggested. That
> > way, we can keep the logic about who gets to submit I/O when in one
> > place.
> >
> > Jens, what do you think?
>
> I think that is pretty extreme. We have way too much accounting around
> this already, and I'd rather just limit the batching than make
> per-ioc request lists too.
>
> I agree the batch addition isn't super useful for the reads. It really
> is mostly a writer thing, and the timing reflects that.
>
> The problem is really that the WRITE_SYNC is (for Jan's case) behaving
> like buffered writes, so it eats up a queue of requests very easily. On
> the allocation side, the assumption is that WRITE_SYNC behaves like
> dependent reads. Similar to a dd with oflag=direct, not like a flood of
> requests. For dependent sync writes, our current behaviour is fine, we
> treat them like reads. For commits of WRITE_SYNC, they should be treated
> like async WRITE instead.
Yeah. But it's similar to what happens when you run fsync() on a large
dirty file. That will also submit a lot of WRITE_SYNC requests... kjournald
could probably use WRITE instead of WRITE_SYNC for large commits. It's just
that we don't really want to give e.g. DIO a preference over kjournald
because transaction commit can effectively block any metadata changes on
the filesystem.
Honza
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
next prev parent reply other threads:[~2012-12-13 15:02 UTC|newest]
Thread overview: 20+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-12-10 22:12 Read starvation by sync writes Jan Kara
2012-12-11 21:44 ` Jeff Moyer
2012-12-12 2:31 ` Jan Kara
2012-12-12 4:18 ` Dave Chinner
2012-12-12 10:26 ` Jan Kara
2012-12-12 23:33 ` Dave Chinner
2012-12-12 0:13 ` Jan Engelhardt
2012-12-12 2:55 ` Shaohua Li
2012-12-12 10:11 ` Jan Kara
2012-12-12 15:19 ` Jens Axboe
2012-12-12 16:38 ` Jeff Moyer
2012-12-12 19:41 ` Jeff Moyer
2012-12-13 12:30 ` Jan Kara
2012-12-13 13:30 ` Jens Axboe
2012-12-13 14:55 ` Jeff Moyer
2012-12-13 15:02 ` Jan Kara [this message]
2012-12-13 18:03 ` Jens Axboe
2013-01-23 17:35 ` Jeff Moyer
2012-12-13 1:43 ` Shaohua Li
2012-12-13 10:32 ` Jan Kara
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20121213150255.GA21592@quack.suse.cz \
--to=jack@suse.cz \
--cc=axboe@kernel.dk \
--cc=jmoyer@redhat.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=shli@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.