From: Trond Myklebust <trondmy@hammerspace.com>
To: "linux-nfs@vger.kernel.org" <linux-nfs@vger.kernel.org>,
"abuehaze@amazon.com" <abuehaze@amazon.com>
Subject: Re: NFS read performance degradation after upgrading to kernel 5.4.*
Date: Wed, 31 Mar 2021 13:09:07 +0000 [thread overview]
Message-ID: <9f3477020dd5d6f120ddc2cc80ada38e1cf258af.camel@hammerspace.com> (raw)
In-Reply-To: <0D482540-0455-4AA7-B534-FCE24793ECB0@amazon.com>
On Wed, 2021-03-31 at 12:53 +0000, Mohamed Abuelfotoh, Hazem wrote:
> Hi Trond,
>
> I am wondering if we should consider raising the default maximum NFS
> read ahead size given the facts I mentioned in my previous e-mail.
>
We can't keep changing the default every time someone tries to measure
a new workload.
The change in 5.4 was also measurement based, and was due to poor
performance in cases where rsize/wsize is smaller and readahead was
overshooting.
> Thank you.
>
> Hazem
>
> On 29/03/2021, 17:07, "Mohamed Abuelfotoh, Hazem" <
> abuehaze@amazon.com> wrote:
>
>
> Hello Team,
>
> -We have got multiple customers complaining about NFS read
> performance degradation after they upgraded to kernel 5.4.*
>
> -After doing some deep dive and testing we figured out that the
> reason behind the regression was patch NFS: Optimise the default
> readahead size[1] Which has been merged to Linux kernels 5.4.* and
> above.
> -Our customers are using AWS EC2 instances as client mounting EFS
> export (which is AWS managed NFSV4 service), I am sharing the results
> that we got before & after the upgrade given that the NFS server(EFS)
> should be able to achieve between 250-300MB/sec which the clients can
> achieve without patch[1] while getting quarter of this speed around
> 70MB/sec with the mentioned patch merged as seen below.
>
>
> #####################################################################
> #####################
>
>
> Before the upgrade:
> # uname -r
> 4.14.225-168.357.amzn2.x86_64
> [root@ip-172-31-28-135 ec2-user]# sync; echo 3 >
> /proc/sys/vm/drop_caches
> [root@ip-172-31-28-135 ec2-user]# mount -t nfs4 -o
> nfsvers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2,nore
> svport fs-6700f553.efs.eu-west-1.amazonaws.com:/ efs
> [root@ip-172-31-28-135 ec2-user]# rsync --progress efs/test .
> test
> 8,589,934,592 100% 313.20MB/s 0:00:26 (xfr#1, to-chk=0/1)
>
>
> #####################################################################
> #####################
>
> After the upgrade using the same client & server:
> #uname -r; sync; echo 3 > /proc/sys/vm/drop_caches; ./nfs-
> readahead show /home/ec2-user/efs/;rsync --progress efs/test .
> 5.4.0-1.132.64.amzn2.x86_64
> /home/ec2-user/efs 0:40 /sys/class/bdi/0:40/read_ahead_kb = 128
> test
> 1,073,741,824 100% 68.61MB/s 0:00:14 (xfr#1, to-chk=0/1)
>
>
> -We are recommending[2] EFS users to use rsize=1048576 as mount
> option for getting the best read performance from their EFS exports
> given that EC2 to EFS traffic is residing in the same AWS
> availability zone hence it has low latency and up to 250-300MB/sec
> throughput however with the mentioned patch merged the customer can’t
> achieve this throughput after the kernel upgrade because the default
> NFS read ahead has been decreased from (15*rsize)=15 MB to 128KB so
> the clients have to manually raise the manually raise the
> read_ahead_kb parameter from 128 to 15360 to get the same experience
> they were getting before the upgrade.
> -We know that the purpose of the mentioned patch was to decrease
> OS boot time (for netboot users) also decreasing Application start up
> times in congested & Low throughput networks as mentioned in [3],
> however this would also cause regression for high throughput & low
> latency workload especially sequential read workflows.
> -After doing further debugging we also found that the maximum
> read ahead size is constant so there is no Autotuning for this
> configuration even if the client is filling the read ahead window
> which means any NFS client specially ones using maximum rsize mount
> option will have to manually tune their maximum NFS read ahead size
> after the upgrade which in my opinion is some sort of regression from
> older kernels behaviour.
>
>
> #####################################################################
> ####################
>
> After increasing the maximum NFS read ahead size to 15MB it’s
> clear that read ahead window is expanded as expected and it will be
> doubled until it reach 15MB.
>
> Mar 29 11:25:18 ip-172-31-17-191 kernel: init_ra_size 256
> Mar 29 11:25:18 ip-172-31-17-191 kernel: current ra 256
> Mar 29 11:25:18 ip-172-31-17-191 kernel: max ra 3840
> Mar 29 11:25:18 ip-172-31-17-191 kernel: current ra 59
> Mar 29 11:25:18 ip-172-31-17-191 kernel: max ra 32
> Mar 29 11:25:18 ip-172-31-17-191 kernel: current ra 512
> Mar 29 11:25:18 ip-172-31-17-191 kernel: max ra 3840
> Mar 29 11:25:18 ip-172-31-17-191 kernel: current ra 1024
> Mar 29 11:25:18 ip-172-31-17-191 kernel: max ra 3840
> Mar 29 11:25:18 ip-172-31-17-191 kernel: current ra 2048
> Mar 29 11:25:18 ip-172-31-17-191 kernel: max ra 3840
>
>
> #####################################################################
> ####################
>
> With 128KB as maximum NFS read ahead size, the read ahead window
> size is increasing until it reach the configured maximum Read ahead
> (128KB).
>
> Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32
> Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32
> Mar 29 11:35:37 ip-172-31-17-191 kernel: init_ra_size 32
> Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 40
> Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 32
> Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32
> Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 40
> Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32
> Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 64
> Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32
> Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32
> Mar 29 11:35:37 ip-172-31-17-191 kernel: init_ra_size 4
> Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 4
> Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32
> Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 64
> Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 64
> Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 64
> Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32
> Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 32
> Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32
> Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 59
> Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32
> Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 32
> Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32
> Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 32
> Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32
>
> -In my reproduction I used rsync as clarified above and it is
> always doing read syscalls requesting 256 KB in each call:
> 15:47:10.780658 read(3,
> "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...
> , 262144) = 262144 <0.023749>
> 15:47:10.805467 read(3,
> "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...
> , 262144) = 262144 <0.023739>
> 15:47:10.830272 read(3,
> "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...
> , 262144) = 262144 <0.023664>
> 15:47:10.854972 read(3,
> "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...
> , 262144) = 262144 <0.023814>
> 15:47:10.879837 read(3,
> "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...
> , 262144) = 262144 <0.023625>
> 15:47:10.904496 read(3,
> "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...
> , 262144) = 262144 <0.023645>
> 15:47:10.929180 read(3,
> "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...
> , 262144) = 262144 <0.024072>
> 15:47:10.954308 read(3,
> "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...
> , 262144) = 262144 <0.023669>
>
>
> -Looking into the readahead source code and I can see that
> readahead is doing some heuristics to determine if the access pattern
> is sequential or random then it modify the read ahead window(amount
> of data it will prefetch) accordingly, read ahead can't also read
> beyond the requested file size, this theoretically means that having
> Large NFS Max read ahead size (15MB) shouldn't have much impact on
> performance even with random I/O pattern or with data set consists of
> small files, the only major impact of having large NFS read ahead
> size would be some networking congestion or bootup delay with hosts
> using congested or low throughput networks as illustrated in
> https://bugzilla.kernel.org/show_bug.cgi?id=204939 &
> https://lore.kernel.org/linux-nfs/0b76213e-426b-0654-5b69-02bfead78031@gmail.com/T/
> .
> -With patch
> https://www.spinics.net/lists/linux-nfs/msg75018.html the packet
> captures are showing the client either asking for 128KB or 256 KB in
> the NFS READ calls and it can't reach even the 1MB configured rsize
> mount option this is because the ondemand_readahead which should be
> responsible for moving and scaling the read ahead window has an if
> condition which was a part of
> https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1274743.html
> , this patch actually modified read ahead to issue the maximum of the
> user request size(rsync is doing 256KB read requests), and the read-
> ahead max size(128KB by default), but capped to the max request size
> on the device side(1MB in our case). The latter is done to avoid
> reading ahead too much, if the application asks for a huge read. this
> is why with 128KB as read ahead size and application asking for 256KB
> we never exceed 256KB because this patch actually intended to do
> that, it avoids limiting the requested data to the maximum read ahead
> size but we are still limited by the minimum between amount of data
> application is reading which is 256KB as sync in rsync strace output
> & bdi->io_pages(256 pages=1MB) as configured in the rsize mount
> option.
>
> -Output after adding some debugging to the kernel showing the
> size of each variable in the mentioned "if" condition:
>
> [ 238.387788] req_size= 64 ------>256MB rsync read requests
> [ 238.387790] io pages= 256----->1MB as supported by EFS and as
> configured in the rsize mount option.
> [ 238.390487] max_pages before= 32----->128 KB read ahead size
> which is the default.
> [ 238.393177] max_pages after= 64---->raised to 256 KB because
> of changes mentioned in [4] "max_pages = min(req_size, bdi-
> >io_pages);"
>
>
> https://elixir.bootlin.com/linux/latest/source/mm/readahead.c#L435
>
> /*
> * A minimal readahead algorithm for trivial sequential/random
> reads.
> */
> static void ondemand_readahead(struct readahead_control *ractl,
> struct file_ra_state *ra, bool hit_readahead_marker,
> unsigned long req_size)
> {
> struct backing_dev_info *bdi = inode_to_bdi(ractl->mapping-
> >host);
> unsigned long max_pages = ra->ra_pages;
> unsigned long add_pages;
> unsigned long index = readahead_index(ractl);
> pgoff_t prev_index;
>
> /*
> * If the request exceeds the readahead window, allow the
> read to
> * be up to the optimal hardware IO size
> */
> if (req_size > max_pages && bdi->io_pages > max_pages)
> max_pages = min(req_size, bdi->io_pages);
>
>
>
> #####################################################################
> #############################
>
> -With 128KB as default maximum read ahead size the packet capture
> from the client side is showing the NFSv4 READ calls showing count
> in bytes moving between 128KB to 256KB.
>
> 73403 29 172.31.17.191 -> 172.31.28.161 NFS 394 V4 Call
> READ StateID: 0xcec2 Offset: 1072955392 Len: 131072
> 73404 29 172.31.17.191 -> 172.31.28.161 NFS 394 V4 Call
> READ StateID: 0xcec2 Offset: 1073086464 Len: 262144
> 73406 29 172.31.28.161 -> 172.31.17.191 NFS 8699 V4 Reply
> (Call In 73403)[Unreassembled Packet]
> 73415 29 172.31.17.191 -> 172.31.28.161 NFS 394 V4 Call
> READ StateID: 0xcec2 Offset: 1073348608 Len: 131072
> 73416 29 172.31.28.161 -> 172.31.17.191 NFS 8699 V4 Reply
> (Call In 73404)[Unreassembled Packet]
> 73428 29 172.31.17.191 -> 172.31.28.161 NFS 394 V4 Call
> READ StateID: 0xcec2 Offset: 1073479680 Len: 131072
> 73429 29 172.31.28.161 -> 172.31.17.191 NFS 8699 V4 Reply
> (Call In 73415)[Unreassembled Packet]
> 73438 29 172.31.17.191 -> 172.31.28.161 NFS 394 V4 Call
> READ StateID: 0xcec2 Offset: 1073610752 Len: 131072
> 73439 29 172.31.28.161 -> 172.31.17.191 NFS 8699 V4 Reply
> (Call In 73428)[Unreassembled Packet]
>
> -nfsstat is showing 8183 NFSV4 READ calls required to read 1GB
> file.
>
> # nfsstat
> Client rpc stats:
> calls retrans authrefrsh
> 8204 0 8204
>
> Client nfs v4:
> null read write commit open
> open_conf
> 1 0% 8183 99% 0 0% 0 0% 0 0%
> 0 0%
> open_noat open_dgrd close setattr fsinfo
> renew
> 1 0% 0 0% 1 0% 0 0% 2 0%
> 0 0%
> setclntid confirm lock lockt locku
> access
> 0 0% 0 0% 0 0% 0 0% 0 0%
> 1 0%
> getattr lookup lookup_root remove rename
> link
> 4 0% 1 0% 1 0% 0 0% 0 0%
> 0 0%
> symlink create pathconf statfs readlink
> readdir
> 0 0% 0 0% 1 0% 0 0% 0 0%
> 0 0%
> server_caps delegreturn getacl setacl fs_locations
> rel_lkowner
> 3 0% 0 0% 0 0% 0 0% 0 0%
> 0 0%
> secinfo exchange_id create_ses destroy_ses sequence
> get_lease_t
> 0 0% 0 0% 2 0% 1 0% 0 0%
> 0 0%
> reclaim_comp layoutget getdevinfo layoutcommit layoutreturn
> getdevlist
> 0 0% 1 0% 0 0% 0 0% 0 0%
> 0 0%
> (null)
> 1 0%
>
>
> #####################################################################
> ##########################
>
> -When using 15MB as maximum read ahead size, the client is
> sending 1MB NFSV4 read requests hence it’s able to read the same 1GB
> file in 1024 NFS READ calls
>
> #uname -r; mount -t nfs4 -o
> nfsvers=4.1,rsize=1052672,wsize=1048576,hard,timeo=600,retrans=2,nore
> svport fs-6700f553.efs.eu-west-1.amazonaws.com:/ efs; ./nfs-readahead
> show /home/ec2-user/efs/
> 5.3.9
> /home/ec2-user/efs 0:40 /sys/class/bdi/0:40/read_ahead_kb = 15360
> #sync; echo 3 > /proc/sys/vm/drop_caches
> #rsync --progress efs/test .
> test
> 1,073,741,824 100% 260.15MB/s 0:00:03 (xfr#1, to-chk=0/1)
> [root@ip-172-31-17-42 ec2-user]# nfsstat
> Client rpc stats:
> calls retrans authrefrsh
> 1043 0 1043
>
> Client nfs v4:
> null read write commit open
> open_conf
> 1 0% 1024 98% 0 0% 0 0% 0 0%
> 0 0%
> open_noat open_dgrd close setattr fsinfo
> renew
> 1 0% 0 0% 1 0% 0 0% 2 0%
> 0 0%
> setclntid confirm lock lockt locku
> access
> 0 0% 0 0% 0 0% 0 0% 0 0%
> 1 0%
> getattr lookup lookup_root remove rename
> link
> 2 0% 1 0% 1 0% 0 0% 0 0%
> 0 0%
> symlink create pathconf statfs readlink
> readdir
> 0 0% 0 0% 1 0% 0 0% 0 0%
> 0 0%
> server_caps delegreturn getacl setacl fs_locations
> rel_lkowner
> 3 0% 0 0% 0 0% 0 0% 0 0%
> 0 0%
> secinfo exchange_id create_ses destroy_ses sequence
> get_lease_t
> 0 0% 0 0% 2 0% 1 0% 0 0%
> 0 0%
> reclaim_comp layoutget getdevinfo layoutcommit layoutreturn
> getdevlist
> 0 0% 1 0% 0 0% 0 0% 0 0%
> 0 0%
> (null)
> 1 0%
>
> -The packet capture from the client side is showing NFSv4 READ
> calls with 1MB as read count in bytes when having 15MB as maximum NFS
> read ahead size.
>
> 2021-03-22 14:25:34.984731 9398 172.31.17.42 → 172.31.28.161 NFS
> 0.000375 V4 Call READ StateID: 0x3640 Offset: 94371840 Len: 1048576
> 2021-03-22 14:25:34.984805 9405 172.31.17.42 → 172.31.28.161 NFS
> 0.000074 V4 Call READ StateID: 0x3640 Offset: 95420416 Len: 1048576
> 2021-03-22 14:25:34.984902 9416 172.31.17.42 → 172.31.28.161 NFS
> 0.000097 V4 Call READ StateID: 0x3640 Offset: 96468992 Len: 1048576
> 2021-03-22 14:25:34.984941 9421 172.31.17.42 → 172.31.28.161 NFS
> 0.000039 V4 Call READ StateID: 0x3640 Offset: 97517568 Len: 1048576
>
>
> #####################################################################
> ##########################
>
>
> -I think there are 2 options to mitigate this behaviour which I
> am listing below:
> A)Raising the default maximum NFS read ahead size because the
> current default 128KB doesn’t seem to be sufficient for High
> throughout & low latency workload, I strongly believe that the NFS
> rsize mount option should be used as variable in deciding the maximum
> NFS read ahead size which was the case before [1] while now it’s
> always 128KB regardless the utilized rsize mount option. Also I think
> clients running in High latency & Low throughout environment
> shouldn’t use 1MB as rsize in their mount options(i.e they should use
> smaller rsize) because it may increase their suffering even with low
> maximum NFS read ahead size.
> B)Adding some logic to read ahead to have some kind of Autotuning
> (Similar to TCP Autotuning) where the maximum read ahead size can
> dynamically increase in case the client/reader is constantly filling
> up/utilizing the read ahead window size.
>
>
> Links:
> [1] https://www.spinics.net/lists/linux-nfs/msg75018.html
> [2]
> https://docs.aws.amazon.com/efs/latest/ug/mounting-fs-nfs-mount-settings.html
> [3] https://bugzilla.kernel.org/show_bug.cgi?id=204939
> [4]
> https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1274743.html
>
>
> Thank you.
>
> Hazem
>
>
>
>
>
>
>
> Amazon Web Services EMEA SARL, 38 avenue John F. Kennedy, L-1855
> Luxembourg, R.C.S. Luxembourg B186284
>
> Amazon Web Services EMEA SARL, Irish Branch, One Burlington Plaza,
> Burlington Road, Dublin 4, Ireland, branch registration number 908705
>
>
--
Trond Myklebust
CTO, Hammerspace Inc
4984 El Camino Real, Suite 208
Los Altos, CA 94022
www.hammer.space
next prev parent reply other threads:[~2021-03-31 13:09 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <A889C9EC-CBD2-4AA0-89A2-73C2F7AD96F3@amazon.com>
2021-03-29 15:07 ` NFS read performance degradation after upgrading to kernel 5.4.* Mohamed Abuelfotoh, Hazem
2021-03-31 12:53 ` Mohamed Abuelfotoh, Hazem
2021-03-31 13:09 ` Trond Myklebust [this message]
2021-03-31 13:31 ` Mohamed Abuelfotoh, Hazem
2021-03-31 14:36 ` Trond Myklebust
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=9f3477020dd5d6f120ddc2cc80ada38e1cf258af.camel@hammerspace.com \
--to=trondmy@hammerspace.com \
--cc=abuehaze@amazon.com \
--cc=linux-nfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).