From: "Darrick J. Wong" <darrick.wong@oracle.com>
To: Jan Kara <jack@suse.cz>
Cc: Andrew Morton <akpm@linux-foundation.org>,
linux-fsdevel@vger.kernel.org, linux-mm@kvack.org
Subject: Re: [PATCH] mm: readahead: Increase maximum readahead window
Date: Wed, 4 Oct 2017 10:41:51 -0700 [thread overview]
Message-ID: <20171004174151.GA6497@magnolia> (raw)
In-Reply-To: <20171004091205.468-1-jack@suse.cz>
On Wed, Oct 04, 2017 at 11:12:05AM +0200, Jan Kara wrote:
> Increase default maximum allowed readahead window from 128 KB to 512 KB.
> This improves performance for some workloads (see below for details) where
> ability to scale readahead window to larger sizes allows for better total
> throughput while chances for regression are rather low given readahead
> window size is dynamically computed based on observation (and thus it never
> grows large for workloads with a random read pattern).
>
> Note that the same tuning can be done using udev rules or by manually setting
> the sysctl parameter however we believe the new value is a better default most
> users will want to use. As a data point we carry this patch in SUSE kernels
> for over 8 years.
>
> Some data from the last evaluation of this patch (on 4.4-based kernel, I can
> rerun those tests on a newer kernel but nothing has changed in the readahead
> area since 4.4). The patch was evaluated on two machines
This is purely speculating, but I think this is worth at least a quick
retry on 4.14 to see what's changed in the past 10 kernel release. For
one thing, ext3 no longer exists, and XFS' file IO path has changed
quite a lot since then.
> o a UMA machine, 8 cores and rotary storage
> o A NUMA machine, 4 socket, 48 cores and SSD storage
>
> Five basic tests were conducted;
>
> 1. paralleldd-single
> paralleldd uses different instances of dd to access a single file and
> write the contents to /dev/null. The performance of it depends on how
> well readahead works for a single file. It's mostly sequential IO.
>
> 2. paralleldd-multi
> Similar to test 1 except each instance of dd accesses a different file
> so each instance of dd is accessing data sequentially but the timing
> makes it look like random read IO.
>
> 3. pgbench-small
> A standard init of pgbench and execution with a small data set
>
> 4. pgbench-large
> A standard init of pgbench and execution with a large data set
>
> 5. bonnie++ with dataset sizes 2X RAM and in asyncronous mode
>
> UMA paralleldd-single on ext3
> 4.4.0 4.4.0
> vanilla readahead-v1r1
> Amean Elapsd-1 5.42 ( 0.00%) 5.40 ( 0.50%)
> Amean Elapsd-3 7.51 ( 0.00%) 5.54 ( 26.25%)
> Amean Elapsd-5 7.15 ( 0.00%) 5.90 ( 17.46%)
> Amean Elapsd-7 5.81 ( 0.00%) 5.61 ( 3.42%)
> Amean Elapsd-8 6.05 ( 0.00%) 5.73 ( 5.36%)
>
> Results speak for themselves, readahead is a major boost when there
> are multiple readers of data. It's not displayed but system CPU
> usage is overall. The IO stats support the results
>
> 4.4.0 4.4.0
> vanillareadahead-v1r1
> Mean sda-avgqusz 7.44 8.59
> Mean sda-avgrqsz 279.77 722.52
> Mean sda-await 31.95 48.82
> Mean sda-r_await 3.32 11.58
> Mean sda-w_await 127.51 119.60
> Mean sda-svctm 1.47 3.46
> Mean sda-rrqm 27.82 23.52
> Mean sda-wrqm 4.52 5.00
>
> It shows that the average request size is 2.5 times larger even
> though the merging stats are similar. It's also interesting to
> note that average wait times are higher but more IO is being
> initiated per dd instance.
>
> It's interesting to note that this is specific to ext3 and that xfs showed
> a small regression with larger readahead.
>
> UMA paralleldd-single on xfs
> 4.4.0 4.4.0
> vanilla readahead-v1r1
> Min Elapsd-1 6.91 ( 0.00%) 7.10 ( -2.75%)
> Min Elapsd-3 6.77 ( 0.00%) 6.93 ( -2.36%)
> Min Elapsd-5 6.82 ( 0.00%) 7.00 ( -2.64%)
> Min Elapsd-7 6.84 ( 0.00%) 7.05 ( -3.07%)
> Min Elapsd-8 7.02 ( 0.00%) 7.04 ( -0.28%)
> Amean Elapsd-1 7.08 ( 0.00%) 7.20 ( -1.68%)
> Amean Elapsd-3 7.03 ( 0.00%) 7.12 ( -1.40%)
> Amean Elapsd-5 7.22 ( 0.00%) 7.38 ( -2.34%)
> Amean Elapsd-7 7.07 ( 0.00%) 7.19 ( -1.75%)
> Amean Elapsd-8 7.23 ( 0.00%) 7.23 ( -0.10%)
>
> The IO stats are not displayed but show a similar ratio to ext3 and system
> CPU usage is also lower. Hence, this slowdown is unexplained but may be
> due to differences in XFS in the read path and how it locks even though
> direct IO is not a factor. Tracing was not enabled to see what flags are
> passed into xfs_ilock to see if the IO is all behind one lock but it's
> one potential explanation.
>
> UMA paralleldd-single on ext3
>
> This showed nothing interesting as the test was too short-lived to draw
> any conclusions. There was some difference in the kernels but it was
> within the noise. The same applies for XFS.
>
> UMA pgbench-small on ext3
>
> This showed very little that was interesting. The database load time
> was slower but by a very small margin. The actual transaction times
> were highly variable and inconclusive.
>
> NUMA pgbench-small on ext3
>
> Load times are not reported but they completed 1.5% faster.
>
> 4.4.0 4.4.0
> vanilla readahead-v1r1
> Hmean 1 3000.54 ( 0.00%) 2895.28 ( -3.51%)
> Hmean 8 20596.33 ( 0.00%) 19291.92 ( -6.33%)
> Hmean 12 30760.68 ( 0.00%) 30019.58 ( -2.41%)
> Hmean 24 74383.22 ( 0.00%) 73580.80 ( -1.08%)
> Hmean 32 88377.30 ( 0.00%) 88928.70 ( 0.62%)
> Hmean 48 88133.53 ( 0.00%) 96099.16 ( 9.04%)
> Hmean 80 55981.37 ( 0.00%) 76886.10 ( 37.34%)
> Hmean 112 74060.29 ( 0.00%) 87632.95 ( 18.33%)
> Hmean 144 51331.50 ( 0.00%) 66135.77 ( 28.84%)
> Hmean 172 44256.92 ( 0.00%) 63521.73 ( 43.53%)
> Hmean 192 35942.74 ( 0.00%) 71121.35 ( 97.87%)
>
> The impact here is substantial particularly for higher thread-counts.
> It's interesting to note that there is an apparent regression for low
> thread counts. In general, there was a high degree of variability
> but the gains were all outside of the noise. In general, the io stats
> did not show any particular pattern about request size as the workload
> is mostly resident in memory. The real curiousity is that readahead
> should have had little or no impact here as the data is mostly resident
> in memory. Observing the transactions over time, there was a lot of
> variability and the performance is likely dominated by whether the
> data happened to be local or not. In itself, this test does not push
> for inclusion of the patch due to the lack of IO but is included for
> completeness.
>
> UMA pgbench-small on xfs
>
> Similar observations to ext3 on the load times. The transaction times
> were stable but showed no significant performance difference.
>
> UMA pgbench-large on ext3
>
> Database load times were slightly faster (3.36%). The transaction times
> were slower on average, more variable but still very close to the noise.
>
> UMA pgbench-large on xfs
>
> No significant difference on either database load times or transactions.
>
> UMA bonnie on ext3
>
> 4.4.0 4.4.0
> vanilla readahead-v1r1
> Hmean SeqOut Char 81079.98 ( 0.00%) 81172.05 ( 0.11%)
> Hmean SeqOut Block 104416.12 ( 0.00%) 104116.24 ( -0.29%)
> Hmean SeqOut Rewrite 44153.34 ( 0.00%) 44596.23 ( 1.00%)
> Hmean SeqIn Char 88144.56 ( 0.00%) 91702.67 ( 4.04%)
> Hmean SeqIn Block 134581.06 ( 0.00%) 137245.71 ( 1.98%)
> Hmean Random seeks 258.46 ( 0.00%) 280.82 ( 8.65%)
> Hmean SeqCreate ops 2.25 ( 0.00%) 2.25 ( 0.00%)
> Hmean SeqCreate read 2.25 ( 0.00%) 2.25 ( 0.00%)
> Hmean SeqCreate del 911.29 ( 0.00%) 880.24 ( -3.41%)
> Hmean RandCreate ops 2.25 ( 0.00%) 2.25 ( 0.00%)
> Hmean RandCreate read 2.00 ( 0.00%) 2.25 ( 12.50%)
> Hmean RandCreate del 911.89 ( 0.00%) 878.80 ( -3.63%)
>
> The difference in headline performance figures is marginal and well within noise.
> The system CPU usage tells a slightly different story
>
> 4.4.0 4.4.0
> vanillareadahead-v1r1
> User 1817.53 1798.89
> System 499.40 420.65
> Elapsed 10692.67 10588.08
>
> As do the IO stats
>
> 4.4.0 4.4.0
> vanillareadahead-v1r1
> Mean sda-avgqusz 1079.16 1083.35
> Mean sda-avgrqsz 807.95 1225.08
> Mean sda-await 7308.06 9647.13
> Mean sda-r_await 119.04 133.27
> Mean sda-w_await 19106.20 20255.41
> Mean sda-svctm 4.67 7.02
> Mean sda-rrqm 1.80 0.99
> Mean sda-wrqm 5597.12 5723.32
>
> NUMA bonnie on ext3
>
> bonnie
> 4.4.0 4.4.0
> vanilla readahead-v1r1
> Hmean SeqOut Char 58660.72 ( 0.00%) 58930.39 ( 0.46%)
> Hmean SeqOut Block 253950.92 ( 0.00%) 261466.37 ( 2.96%)
> Hmean SeqOut Rewrite 151960.60 ( 0.00%) 161300.48 ( 6.15%)
> Hmean SeqIn Char 57015.41 ( 0.00%) 55699.16 ( -2.31%)
> Hmean SeqIn Block 600448.14 ( 0.00%) 627565.09 ( 4.52%)
> Hmean Random seeks 0.00 ( 0.00%) 0.00 ( 0.00%)
> Hmean SeqCreate ops 1.00 ( 0.00%) 1.00 ( 0.00%)
> Hmean SeqCreate read 3.00 ( 0.00%) 3.00 ( 0.00%)
> Hmean SeqCreate del 90.91 ( 0.00%) 79.88 (-12.14%)
> Hmean RandCreate ops 1.00 ( 0.00%) 1.50 ( 50.00%)
> Hmean RandCreate read 3.00 ( 0.00%) 3.00 ( 0.00%)
> Hmean RandCreate del 92.95 ( 0.00%) 93.97 ( 1.10%)
>
> The impact is small but in line with the UMA machine in a number of details.
> As before, the CPU usage is lower even if the iostats show very little
> differences overall.
>
> Overall, the headline performance figures are mostly improved or show
> little difference. There is a small anomaly with XFS that indicates it may
> not always win there due to other factors. There is also the possibility
/me wonders what the anomaly is/was?
(Well, not that much. If it disappears on 4.14 then I don't care at
all. :P)
--D
> that a mostly random read workload that was larger than memory with each
> read spanning multiple pages but less than the max readahead window would
> suffer but the probability is low as the readahead window should scale
> properly. On balance, this is a win -- particularly on the large read
> workloads.
>
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
> include/linux/mm.h | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 00bad7793788..c50c6f442786 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1991,7 +1991,7 @@ int write_one_page(struct page *page, int wait);
> void task_dirty_inc(struct task_struct *tsk);
>
> /* readahead.c */
> -#define VM_MAX_READAHEAD 128 /* kbytes */
> +#define VM_MAX_READAHEAD 512 /* kbytes */
> #define VM_MIN_READAHEAD 16 /* kbytes (includes current page) */
>
> int force_page_cache_readahead(struct address_space *mapping, struct file *filp,
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2017-10-04 17:41 UTC|newest]
Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-10-04 9:12 [PATCH] mm: readahead: Increase maximum readahead window Jan Kara
2017-10-04 17:41 ` Darrick J. Wong [this message]
2017-10-05 8:39 ` Jan Kara
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20171004174151.GA6497@magnolia \
--to=darrick.wong@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=jack@suse.cz \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-mm@kvack.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).