Re: Excessive stall times on ext4 in 3.9-rc2

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Jeff Moyer <jmoyer@redhat.com>
To: Mel Gorman <mgorman@suse.de>
Cc: Theodore Ts'o <tytso@mit.edu>, Dave Chinner <david@fromorbit.com>,
	Jan Kara <jack@suse.cz>,
	linux-ext4@vger.kernel.org, LKML <linux-kernel@vger.kernel.org>,
	Linux-MM <linux-mm@kvack.org>, Jiri Slaby <jslaby@suse.cz>
Subject: Re: Excessive stall times on ext4 in 3.9-rc2
Date: Mon, 22 Apr 2013 18:42:23 -0400	[thread overview]
Message-ID: <x49a9oqmblc.fsf@segfault.boston.devel.redhat.com> (raw)
In-Reply-To: <20130422143846.GA2675@suse.de> (Mel Gorman's message of "Mon, 22 Apr 2013 15:38:46 +0100")

Mel Gorman <mgorman@suse.de> writes:

> (Adding Jeff Moyer to the cc as I'm told he is interested in the blktrace)

Thanks.  I've got a few comments and corrections for you below.

> TLDR: Flusher writes pages very quickly after processes dirty a buffer. Reads
> starve flusher writes.
[snip]

> 3. The blktrace indicates that reads can starve writes from flusher
>
>    While there are people that can look at a blktrace and find problems
>    like they are rain man, I'm more like an ADHD squirrel when looking at
>    a trace.  I wrote a script to look for what unrelated requests completed
>    while an request got stalled for over a second. It seemed like something
>    that a tool shoudl already exist for but I didn't find one unless btt
>    can give the information somehow.

Care to share that script?

[snip]

> I recognise that the output will have a WTF reaction but the key
> observations to me are
>
> a) a single write request from flusher took over a second to complete
> b) at the time it was queued, it was mostly other writes that were in
>    the queue at the same time
> c) The write request and the parallel writes were all asynchronous write
>    requests
> D) at the time the request completed, there were a LARGE number of
>    other requested queued and completed at the same time.
>
> Of the requests queued and completed in the meantime the breakdown was
>
>      22 RM
>      31 RA
>      82 W
>     445 R
>
> If I'm reading this correctly, it is saying that 22 reads were merged (RM),
> 31 reads were remapped to another device (RA) which is probably reads from
> the dm-crypt partition, 82 were writes (W) which is not far off the number
> of writes that were in the queue and 445 were other reads. The delay was
> dominated by reads that were queued after the write request and completed
> before it.

RM == Read Meta
RA == Read Ahead  (remapping, by the way, does not happen across
                   devices, just into partitions)
W and R you understood correctly.

> That's saying that the 27128th request in the trace took over 7 seconds
> to complete and was an asynchronous write from flusher. The contents of
> the queue are displayed at that time and the breakdown of requests is
>
>      23 WS  [JEM: write sync]
>      86 RM  [JEM: Read Meta]
>     124 RA  [JEM: Read Ahead]
>     442 W
>    1931 R
>
> 7 seconds later when it was completed the breakdown of completed
> requests was
>
>      25 WS
>     114 RM
>     155 RA
>     408 W
>    2457 R
>
> In combination, that confirms for me that asynchronous writes from flush
> are being starved by reads. When a process requires a buffer that is locked
> by that asynchronous write from flusher, it stalls.
>
>> The thing is, we do want to make ext4 work well with cfq, and
>> prioritizing non-readahead read requests ahead of data writeback does
>> make sense.  The issue is with is that metadata writes going through
>> the block device could in some cases effectively cause a priority
>> inversion when what had previously been an asynchronous writeback
>> starts blocking a foreground, user-visible process.
>> 
>> At least, that's the theory;
>
> I *think* the data more or less confirms the theory but it'd be nice if
> someone else double checked in case I'm seeing what I want to see
> instead of what is actually there.

Looks sane.  You can also see a lot of "preempt"s in the blkparse
output, which indicates exactly what you're saying.  Any sync request
gets priority over the async requests.

I'll also note that even though your I/O is going all over the place
(D2C is pretty bad, 14ms), most of the time is spent waiting for a
struct request allocation or between Queue and Merge:

==================== All Devices ====================

            ALL           MIN           AVG           MAX           N
--------------- ------------- ------------- ------------- -----------

Q2Q               0.000000001   0.000992259   8.898375882     2300861
Q2G               0.000000843  10.193261239 2064.079501935     1016463 <====
G2I               0.000000461   0.000044702   3.237065090     1015803
Q2M               0.000000101   8.203147238 2064.079367557     1311662
I2D               0.000002012   1.476824812 2064.089774419     1014890
M2D               0.000003283   6.994306138 283.573348664     1284872
D2C               0.000061889   0.014438316   0.857811758     2291996
Q2C               0.000072284  13.363007244 2064.092228625     2292191

==================== Device Overhead ====================

       DEV |       Q2G       G2I       Q2M       I2D       D2C
---------- | --------- --------- --------- --------- ---------
 (  8,  0) |  33.8259%   0.0001%  35.1275%   4.8932%   0.1080%
---------- | --------- --------- --------- --------- ---------
   Overall |  33.8259%   0.0001%  35.1275%   4.8932%   0.1080%

I'm not sure I believe that max value.  2064 seconds seems a bit high.
Also, Q2M should not be anywhere near that big, so more investigation is
required there.  A quick look over the data doesn't show any such delays
(making me question the tools), but I'll write some code tomorrow to
verify the btt output.

Jan, if I were to come up with a way of promoting a particular async
queue to the front of the line, where would I put such a call in the
ext4/jbd2 code to be effective?

Mel, can you reproduce this at will?  Do you have a reproducer that I
could run so I'm not constantly bugging you?

Cheers,
Jeff

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2013-04-22 22:42 UTC|newest]

Thread overview: 52+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-04-02 14:27 Excessive stall times on ext4 in 3.9-rc2 Mel Gorman
2013-04-02 15:00 ` Jiri Slaby
2013-04-02 15:03 ` Zheng Liu
2013-04-02 15:15   ` Mel Gorman
2013-04-02 15:06 ` Theodore Ts'o
2013-04-02 15:14   ` Theodore Ts'o
2013-04-02 18:19     ` Theodore Ts'o
2013-04-07 21:59       ` Frank Ch. Eigler
2013-04-08  8:36         ` Mel Gorman
2013-04-08 10:52           ` Frank Ch. Eigler
2013-04-08 11:01         ` Theodore Ts'o
2013-04-03 10:19     ` Mel Gorman
2013-04-03 12:05       ` Theodore Ts'o
2013-04-03 15:15         ` Mel Gorman
2013-04-05 22:18       ` Jiri Slaby
2013-04-05 23:16         ` Theodore Ts'o
2013-04-06  7:29           ` Jiri Slaby
2013-04-06  7:37             ` Jiri Slaby
2013-04-06  8:19               ` Jiri Slaby
2013-04-06 13:15             ` Theodore Ts'o
2013-04-10 10:56   ` Mel Gorman
2013-04-10 13:12     ` Theodore Ts'o
2013-04-11 17:04       ` Mel Gorman
2013-04-11 18:35         ` Theodore Ts'o
2013-04-11 21:33           ` Jan Kara
2013-04-12  2:57             ` Theodore Ts'o
2013-04-12  4:50               ` Dave Chinner
2013-04-12 15:19                 ` Theodore Ts'o
2013-04-13  1:23                   ` Dave Chinner
2013-04-22 14:38                   ` Mel Gorman
2013-04-22 22:42                     ` Jeff Moyer [this message]
2013-04-23  0:02                       ` Theodore Ts'o
2013-04-23  9:31                       ` Jan Kara
2013-04-23 14:01                       ` Mel Gorman
2013-04-24 19:09                         ` Jeff Moyer
2013-04-25 12:21                           ` Mel Gorman
2013-04-12  9:47               ` Mel Gorman
2013-04-21  0:05                 ` Theodore Ts'o
2013-04-21  0:07                   ` [PATCH 1/3] ext4: mark all metadata I/O with REQ_META Theodore Ts'o
2013-04-21  0:07                     ` [PATCH 2/3] buffer: add BH_Prio and BH_Meta flags Theodore Ts'o
2013-04-21  0:07                     ` [PATCH 3/3] ext4: mark metadata blocks using bh flags Theodore Ts'o
2013-04-21  6:09                       ` Jiri Slaby
2013-04-21 19:55                         ` Theodore Ts'o
2013-04-21 20:48                           ` [PATCH 3/3 -v2] " Theodore Ts'o
2013-04-22 12:06                     ` [PATCH 1/3] ext4: mark all metadata I/O with REQ_META Zheng Liu
2013-04-23 15:33                   ` Excessive stall times on ext4 in 3.9-rc2 Mel Gorman
2013-04-23 15:50                     ` Theodore Ts'o
2013-04-23 16:13                       ` Mel Gorman
2013-04-12 10:18               ` Tvrtko Ursulin
2013-04-12  9:45           ` Mel Gorman
2013-04-02 23:16 ` Theodore Ts'o
2013-04-03 15:22   ` Mel Gorman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=x49a9oqmblc.fsf@segfault.boston.devel.redhat.com \
    --to=jmoyer@redhat.com \
    --cc=david@fromorbit.com \
    --cc=jack@suse.cz \
    --cc=jslaby@suse.cz \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).