Re: [PATCHSET v3][RFC] Make background writeback not suck

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Dave Chinner <david@fromorbit.com>
To: Jens Axboe <axboe@fb.com>
Cc: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-block@vger.kernel.org
Subject: Re: [PATCHSET v3][RFC] Make background writeback not suck
Date: Fri, 1 Apr 2016 17:16:10 +1100	[thread overview]
Message-ID: <20160401061610.GX11812@dastard> (raw)
In-Reply-To: <56FDED6D.4070200@fb.com>

On Thu, Mar 31, 2016 at 09:39:25PM -0600, Jens Axboe wrote:
> On 03/31/2016 09:29 PM, Jens Axboe wrote:
> >>>I can't seem to reproduce this at all. On an nvme device, I get a
> >>>fairly steady 60K/sec file creation rate, and we're nowhere near
> >>>being IO bound. So the throttling has no effect at all.
> >>
> >>That's too slow to show the stalls - your likely concurrency bound
> >>in allocation by the default AG count (4) from mkfs. Use mkfs.xfs -d
> >>agcount=32 so that every thread works in it's own AG.
> >
> >That's the key, with that I get 300-400K ops/sec instead. I'll run some
> >testing with this tomorrow and see what I can find, it did one full run
> >now and I didn't see any issues, but I need to run it at various
> >settings and see if I can find the issue.
> 
> No stalls seen, I get the same performance with it disabled and with
> it enabled, at both default settings, and lower ones
> (wb_percent=20). Looking at iostat, we don't drive a lot of depth,
> so it makes sense, even with the throttling we're doing essentially
> the same amount of IO.

Try appending numa=fake=4 to your guest's kernel command line.

(that's what I'm using)

> 
> What does 'nr_requests' say for your virtio_blk device? Looks like
> virtio_blk has a queue_depth setting, but it's not set by default,
> and then it uses the free entries in the ring. But I don't know what
> that is...

$ cat /sys/block/vdc/queue/nr_requests 
128
$

Without the block throttling, guest IO (measured within the guest)
looks like this over a fair proportion of the test (5s sample time)

# iostat -d -x -m 5 /dev/vdc

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
vdc               0.00 20443.00    6.20  436.60     0.05   269.89  1248.48    73.83  146.11  486.58  141.27   1.64  72.40
vdc               0.00 11567.60   19.20  161.40     0.05   146.08  1657.12   119.17  704.57  707.25  704.25   5.34  96.48
vdc               0.00 12723.20    3.20  437.40     0.05   193.65   900.38    29.46   57.12    1.75   57.52   0.78  34.56
vdc               0.00  1739.80   22.40  426.80     0.05   123.62   563.86    23.44   62.51   79.89   61.59   1.01  45.28
vdc               0.00 12553.80    0.00  521.20     0.00   210.86   828.54    34.38   65.96    0.00   65.96   0.97  50.80
vdc               0.00 12523.60   25.60  529.60     0.10   201.94   745.29    52.24   77.73    0.41   81.47   1.14  63.20
vdc               0.00  5419.80   22.40  502.60     0.05   158.34   617.90    24.42   63.81   30.96   65.27   1.31  68.80
vdc               0.00 12059.00    0.00  439.60     0.00   174.85   814.59    30.91   70.27    0.00   70.27   0.72  31.76
vdc               0.00  7578.00   25.60  397.00     0.10   139.18   675.00    15.72   37.26   61.19   35.72   0.73  30.72
vdc               0.00  9156.00    0.00  537.40     0.00   173.57   661.45    17.08   29.62    0.00   29.62   0.53  28.72
vdc               0.00  5274.80   22.40  377.60     0.05   136.42   698.77    26.17   68.33  186.96   61.30   1.53  61.36
vdc               0.00  9407.00    3.20  541.00     0.05   174.28   656.05    36.10   66.33    3.00   66.71   0.87  47.60
vdc               0.00  8687.20   22.40  410.40     0.05   150.98   714.70    39.91   92.21   93.82   92.12   1.39  60.32
vdc               0.00  8872.80    0.00  422.60     0.00   139.28   674.96    25.01   33.03    0.00   33.03   0.91  38.40
vdc               0.00  1081.60   22.40  241.00     0.05    68.88   535.97    10.78   82.89  137.86   77.79   2.25  59.20
vdc               0.00  9826.80    0.00  445.00     0.00   167.42   770.49    45.16  101.49    0.00  101.49   1.80  79.92
vdc               0.00  7394.00   22.40  447.60     0.05   157.34   685.83    18.06   38.42   77.64   36.46   1.46  68.48
vdc               0.00  9984.80    3.20  252.00     0.05   108.46   870.82    85.68  293.73   16.75  297.24   3.00  76.64
vdc               0.00     0.00   22.40  454.20     0.05   117.67   505.86     8.11   39.51   35.71   39.70   1.17  55.76
vdc               0.00 10273.20    0.00  418.80     0.00   156.76   766.57    90.52  179.40    0.00  179.40   1.85  77.52
vdc               0.00  5650.00   22.40  185.00     0.05    84.12   831.20   103.90  575.15   60.82  637.42   4.21  87.36
vdc               0.00  7193.00    0.00  308.80     0.00   120.71   800.56    63.77  194.35    0.00  194.35   2.24  69.12
vdc               0.00  4460.80    9.80  211.00     0.03    69.52   645.07    72.35  154.81  269.39  149.49   4.42  97.60
vdc               0.00   683.00   14.00  374.60     0.05    99.13   522.69    25.38  167.61  603.14  151.33   1.45  56.24
vdc               0.00  7140.20    1.80  275.20     0.03   104.53   773.06    85.25  202.67   32.44  203.79   2.80  77.68
vdc               0.00  6916.00    0.00  164.00     0.00    82.59  1031.33   126.20  813.60    0.00  813.60   6.10 100.00
vdc               0.00  2255.60   22.40  359.00     0.05   107.41   577.06    42.97  170.03   92.79  174.85   2.17  82.64
vdc               0.00  7580.40    3.20  370.40     0.05   128.32   703.70    60.19  134.11   15.00  135.14   1.64  61.36
vdc               0.00  6438.40   18.80  159.20     0.04    78.04   898.36   126.80  706.27  639.15  714.19   5.62 100.00
vdc               0.00  5420.00    3.60  315.40     0.01   108.87   699.07    20.80   78.54  580.00   72.81   1.03  32.72
vdc               0.00  9444.00    2.60  242.40     0.00   118.72   992.38   126.21  488.66  146.15  492.33   4.08 100.00
vdc               0.00     0.00   19.80  434.60     0.05   110.14   496.65    12.74   57.56  313.78   45.89   1.10  49.84
vdc               0.00 14108.20    3.20  549.60     0.05   207.84   770.17    42.32   69.66   72.75   69.64   1.40  77.20
vdc               0.00  1306.40   35.20  268.20     0.08    78.74   532.08    30.84  114.22  175.07  106.24   2.02  61.20
vdc               0.00 14999.40    0.00  458.60     0.00   192.03   857.57    61.48  134.02    0.00  134.02   1.67  76.80
vdc               0.00     1.40   22.40  331.80     0.05    82.11   475.11     1.74    4.87   22.68    3.66   0.76  26.96
vdc               0.00 13971.80    0.00  670.20     0.00   248.26   758.63    34.45   51.37    0.00   51.37   1.04  69.52
vdc               0.00  7033.00   22.60  205.80     0.06    87.81   787.86    40.95  128.53  244.64  115.78   2.90  66.24
vdc               0.00  1282.00    3.20  456.00     0.05   123.21   549.74    14.56   46.99   21.00   47.17   1.42  65.20
vdc               0.00  9475.80   22.40  248.60     0.05   107.66   814.02   123.94  412.61  376.64  415.86   3.69 100.00
vdc               0.00  3603.60    0.00  418.80     0.00   133.32   651.94    71.28  210.08    0.00  210.08   1.77  74.00

You can see hat there are periods where it drives the request queue
depth to congestion, but most of the time the device is only 60-70%
utilised and the queue depths are only 30-40 deep. THere's quite a
lot of idle time in the request queue.

Note that there are a couple of points where merging stops
completely - that's when memory reclaim is directly flushing dirty
inodes because if all we have is cached inodes then we have toi
throttle memory allocation back to the rate we at which we can clean
dirty inodes.

Throughput does drop when this happens, but because the device has
idle overhead and spare request queue space, these less than optimal
IO dispatch spikes don't really affect throughput because the device
has the capacity available to soak them up without dropping
performance.


An equivalent trace from the middle of a run with block throttling
enabled:

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
vdc               0.00  5143.40   22.40  188.00     0.05    81.04   789.38    19.17   89.09  237.86   71.37   4.75  99.92
vdc               0.00  7182.60    0.00  272.60     0.00   116.74   877.06    15.50   57.91    0.00   57.91   3.66  99.84
vdc               0.00  3732.80   11.60  102.20     0.01    53.17   957.05    19.35  151.47  514.00  110.32   8.79 100.00
vdc               0.00     0.00   10.80 1007.20     0.04   104.33   209.98    10.22   12.49  457.85    7.71   0.92  93.44
vdc               0.00     0.00    0.40  822.80     0.01    47.58   118.40     9.39   11.21   10.00   11.21   1.22 100.24
vdc               0.00     0.00    1.00  227.00     0.02     3.55    32.00     0.22    1.54  224.00    0.56   0.48  11.04
vdc               0.00 11100.40    3.20  437.40     0.05   125.91   585.49     7.95   17.77   47.25   17.55   1.56  68.56
vdc               0.00 14134.20   22.40  453.20     0.05   191.26   823.85    15.99   32.91   73.07   30.92   2.09  99.28
vdc               0.00  6667.00    0.00  265.20     0.00   105.11   811.70    15.71   58.98    0.00   58.98   3.77 100.00
vdc               0.00  6243.40   22.40  259.00     0.05   101.23   737.11    17.53   62.97  115.21   58.45   3.55  99.92
vdc               0.00  5590.80    0.00  278.20     0.00   105.30   775.18    18.09   65.55    0.00   65.55   3.59 100.00
vdc               0.00     0.00   14.20  714.80     0.02    97.81   274.86    11.61   12.86  260.85    7.93   1.23  89.44
vdc               0.00     0.00    9.80 1555.00     0.05   126.19   165.22     5.41    4.91  267.02    3.26   0.53  82.96
vdc               0.00     0.00    3.00  816.80     0.05    22.32    55.89     6.07    7.39  256.00    6.48   1.05  85.84
vdc               0.00 11172.80    0.20  463.00     0.00   125.77   556.10     6.13   13.23  260.00   13.13   0.93  43.28
vdc               0.00  9563.00   22.40  324.60     0.05   119.66   706.55    15.50   38.45   10.39   40.39   2.88  99.84
vdc               0.00  5333.60    0.00  218.00     0.00    83.71   786.46    15.57   80.46    0.00   80.46   4.59 100.00
vdc               0.00  5128.00   24.80  216.60     0.06    85.31   724.28    19.18   79.84  193.13   66.87   4.12  99.52
vdc               0.00  2746.40    0.00  257.40     0.00    81.13   645.49    11.16   43.70    0.00   43.70   3.87  99.68
vdc               0.00     0.00    0.00  418.80     0.00   104.68   511.92     5.33   12.74    0.00   12.74   1.93  80.96
vdc               0.00  8102.00    0.20  291.60     0.00   108.79   763.59     3.09   10.60   20.00   10.59   0.87  25.44


The first thing to note is the device utilisation is almost always
above 80%, and often at 100%, meaning with throttling the device
always has IO in flight. It's got no real idle time to soak up peaks
of IO activity - throttling means the device is running at close to
100% utilisation all the time under worklaods like this, and there's
not elasticity in the pipeline to handle changes in IO dispatch
behaviour.

So when memory reclaim does direct inode writeback, we see merging
stop, but the request queue is not able to soak up all the IO being
dispatched, even though there is very little read IO demand. hence
changes in the dispatch patterns that would drive deeper queues and
maintain performance will now get throttled, resulting in things
like memory reclaim backing up a lot and everything on the machine
suffering.

I'll try the "don't throttle REQ_META" patch, but this seems like a
fragile way to solve this problem - it shuts up the messenger, but
doesn't solve the problem for any other subsystem that might have a
similer issue. e.g. next we're going to have to make sure direct IO
(which is also REQ_WRITE dispatch) does not get throttled, and so
on....

It seems to me that the right thing to do here is add a separate
classification flag for IO that can be throttled. e.g. as
REQ_WRITEBACK and only background writeback work sets this flag.
That would ensure that when the IO is being dispatched from other
sources (e.g. fsync, sync_file_range(), direct IO, filesystem
metadata, etc) it is clear that it is not a target for throttling.
This would also allow us to easily switch off throttling if
writeback is occurring for memory reclaim reasons, and so on.
Throttling policy decisions belong above the block layer, even
though the throttle mechanism itself is in the block layer.

FWIW, this is analogous to REQ_READA, which tells the block layer
that a read is not important and can be discarded if there is too
much load. Policy is set at the layer that knows whether the IO can
be discarded safely, the mechanism is implemented at a lower layer
that knows about load, scheduling and other things the higher layers
know nothing about.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

next prev parent reply	other threads:[~2016-04-01  6:16 UTC|newest]

Thread overview: 33+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-03-30 15:07 [PATCHSET v3][RFC] Make background writeback not suck Jens Axboe
2016-03-30 15:07 ` [PATCH 1/9] writeback: propagate the various reasons for writeback Jens Axboe
2016-03-30 15:07 ` [PATCH 2/9] writeback: add wbc_to_write() Jens Axboe
2016-03-30 15:07 ` [PATCH 3/9] writeback: use WRITE_SYNC for reclaim or sync writeback Jens Axboe
2016-03-30 15:07 ` [PATCH 4/9] writeback: track if we're sleeping on progress in balance_dirty_pages() Jens Axboe
2016-04-13 13:08   ` Jan Kara
2016-04-13 14:20     ` Jens Axboe
2016-03-30 15:07 ` [PATCH 5/9] block: add ability to flag write back caching on a device Jens Axboe
2016-03-30 15:42   ` Christoph Hellwig
2016-03-30 15:46     ` Jens Axboe
2016-03-30 16:23       ` Jens Axboe
2016-03-30 17:29         ` Christoph Hellwig
2016-03-30 15:07 ` [PATCH 6/9] sd: inform block layer of write cache state Jens Axboe
2016-03-30 15:07 ` [PATCH 7/9] NVMe: " Jens Axboe
2016-03-30 15:07 ` [PATCH 8/9] block: add code to track actual device queue depth Jens Axboe
2016-03-30 15:07 ` [PATCH 9/9] writeback: throttle buffered writeback Jens Axboe
2016-03-31  8:24 ` [PATCHSET v3][RFC] Make background writeback not suck Dave Chinner
2016-03-31 14:29   ` Jens Axboe
2016-03-31 16:21     ` Jens Axboe
2016-04-01  0:56       ` Dave Chinner
2016-04-01  3:29         ` Jens Axboe
2016-04-01  3:33           ` Jens Axboe
2016-04-01  3:39           ` Jens Axboe
2016-04-01  6:16             ` Dave Chinner [this message]
2016-04-01 14:33               ` Jens Axboe
2016-04-01  5:04           ` Dave Chinner
2016-04-01  0:46     ` Dave Chinner
2016-04-01  3:25       ` Jens Axboe
2016-04-01  6:27         ` Dave Chinner
2016-04-01 14:34           ` Jens Axboe
2016-03-31 22:09 ` Holger Hoffstätte
2016-04-01  1:01   ` Dave Chinner
2016-04-01 16:58     ` Holger Hoffstätte

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160401061610.GX11812@dastard \
    --to=david@fromorbit.com \
    --cc=axboe@fb.com \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).