From: "Mkrtchyan, Tigran" <tigran.mkrtchyan@desy.de>
To: Daire Byrne <daire@dneg.com>
Cc: linux-nfs <linux-nfs@vger.kernel.org>, linux-cachefs@redhat.com
Subject: Re: Adventures in NFS re-exporting
Date: Tue, 8 Sep 2020 11:40:47 +0200 (CEST)	[thread overview]
Message-ID: <1642729052.4237865.1599558047149.JavaMail.zimbra@desy.de> (raw)
In-Reply-To: <943482310.31162206.1599499860595.JavaMail.zimbra@dneg.com>
Just out of curiosity:
did you have tries instead of re-exporting nfs mount directly
re-export an overlayfs mount on top of the original nfs mount?
Such setup should cover most of your issues.
Regards,
   Tigran.
----- Original Message -----
> From: "Daire Byrne" <daire@dneg.com>
> To: "linux-nfs" <linux-nfs@vger.kernel.org>
> Cc: linux-cachefs@redhat.com
> Sent: Monday, September 7, 2020 7:31:00 PM
> Subject: Adventures in NFS re-exporting
> Hi,
> 
> Apologies for this rather long email, but I thought there may be some interest
> out there in the community in how and why we've been doing something
> unsupported and barely documented - NFS re-exporting! And I'm not sure I can
> tell our story well in just a few short sentences so please bear with me (or
> stop now!).
> 
> Full disclosure - I am also rather hoping that this story piques some interest
> amongst developers to help make our rather niche setup even better and perhaps
> a little better documented. I also totally understand if this is something
> people wouldn't want to touch with a very long barge pole....
> 
> First a quick bit of history (I hope I have this right). Late in 2015, Jeff
> Layton proposed a patch series allowing knfsd to re-export a NFS client mount.
> The rationale then was to provide a "proxy" server that could mount an NFSv4
> only server and re-export it to older clients that only supported NFSv3. One of
> the main sticking points then (as now), was around the 63 byte limit of
> filehandles for NFSv3 and how it couldn't be guaranteed that all re-exported
> filehandles would fit within that (in my experience it mostly works with
> "no_subtree_check"). There are also the usual locking and coherence concerns
> with NFSv3 too but I'll get to that in a bit.
> 
> Then almost two years later, v4.13 was released including parts of the patch
> series that actually allowed the re-export and since then other relevant bits
> (such as the open file cache) have also been merged. I soon became interested
> in using this new functionality to both accelerate our on-premises NFS storage
> and use it as a "WAN cache" to provide cloud compute instances locally cached
> proxy access to our on-premises storage.
> 
> Cut to a brief introduction to us and what we do... DNEG is an award winning VFX
> company which uses large compute farms to generate complex final frame renders
> for movies and TV. This workload mostly consists of reads of common data shared
> between many render clients (e.g textures, geometry) and a little unique data
> per frame. All file writes are to unique files per process (frames) and there
> is very little if any writing over existing files. Hence it's not very
> demanding on locking and coherence guarantees.
> 
> When our on-premises NFS storage is being overloaded or the server's network is
> maxed out, we can place multiple re-export servers in between them and our farm
> to improve performance. When our on-premises render farm is not quite big
> enough to meet a deadline, we spin up compute instances with a (reasonably
> local) cloud provider. Some of these cloud instances are Linux NFS servers
> which mount our on-premises NFS storage servers (~10ms away) and re-export
> these to the other cloud (render) instances. Since we know that the data we are
> reading doesn't change often, we can increase the actimeo and even use nocto to
> reduce the network chatter back to the on-prem servers. These re-export servers
> also use fscache/cachefiles to cache data to disk so that we can retain TBs of
> previously read data locally in the cloud over long periods of time. We also
> use NFSv4 (less network chatter) all the way from our on-prem storage to the
> re-export server and then on to the clients.
> 
> The re-export server(s) quickly builds up both a memory cache and disk backed
> fscache/cachefiles storage cache of our working data set so the data being
> pulled from on-prem lessens over time. Data is only ever read once over the WAN
> network from on-prem storage and then read multiple times by the many render
> client instances in the cloud. Recent NFS features such as "nconnect" help to
> speed up the initial reading of data from on-prem by using multiple connections
> to offset TCP latency. At the end of the render, we write the files back
> through the re-export server to our on-prem storage. Our average read bandwidth
> is many times higher than our write bandwidth.
> 
> Rather surprisingly, this mostly works for our particular workloads. We've
> completed movies using this setup and saved money on commercial caching systems
> (e.g Avere, GPFS, etc). But there are still some remaining issues with doing
> something that is very much not widely supported (or recommended). In most
> cases we have worked around them, but it would be great if we didn't have to so
> others could also benefit. I will list the main problems quickly now and
> provide more information and reproducers later if anyone is interested.
> 
> 1) The kernel can drop entries out of the NFS client inode cache (under memory
> cache churn) when those filehandles are still being used by the knfsd's remote
> clients resulting in sporadic and random stale filehandles. This seems to be
> mostly for directories from what I've seen. Does the NFS client not know that
> knfsd is still using those files/dirs? The workaround is to never drop inode &
> dentry caches on the re-export servers (vfs_cache_pressure=1). This also helps
> to ensure that we actually make the most of our actimeo=3600,nocto mount
> options for the full specified time.
> 
> 2) If we cache metadata on the re-export server using actimeo=3600,nocto we can
> cut the network packets back to the origin server to zero for repeated lookups.
> However, if a client of the re-export server walks paths and memory maps those
> files (i.e. loading an application), the re-export server starts issuing
> unexpected calls back to the origin server again, ignoring/invalidating the
> re-export server's NFS client cache. We worked around this this by patching an
> inode/iversion validity check in inode.c so that the NFS client cache on the
> re-export server is used. I'm not sure about the correctness of this patch but
> it works for our corner case.
> 
> 3) If we saturate an NFS client's network with reads from the server, all client
> metadata lookups become unbearably slow even if it's all cached in the NFS
> client's memory and no network RPCs should be required. This is the case for
> any NFS client regardless of re-exporting but it affects this case more because
> when we can't serve cached metadata we also can't serve the cached data. It
> feels like some sort of bottleneck in the client's ability to parallelise
> requests? We work around this by not maxing out our network.
> 
> 4) With an NFSv4 re-export, lots of open/close requests (hundreds per second)
> quickly eat up the CPU on the re-export server and perf top shows we are mostly
> in native_queued_spin_lock_slowpath. Does NFSv4 also need an open file cache
> like that added to NFSv3? Our workaround is to either fix the thing doing lots
> of repeated open/closes or use NFSv3 instead.
> 
> If you made it this far, I've probably taken up way too much of your valuable
> time already. If nobody is interested in this rather niche application of the
> Linux client & knfsd, then I totally understand and I will not mention it here
> again. If your interest is piqued however, I'm happy to go into more detail
> about any of this with the hope that this could become a better documented and
> understood type of setup that others with similar workloads could reference.
> 
> Also, many thanks to all the Linux NFS developers for the amazing work you do
> which, in turn, helps us to make great movies. :)
> 
> Daire (Head of Systems DNEG)
next prev parent reply	other threads:[~2020-09-08  9:46 UTC|newest]
Thread overview: 129+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-09-07 17:31 Adventures in NFS re-exporting Daire Byrne
2020-09-08  9:40 ` Mkrtchyan, Tigran [this message]
2020-09-08 11:06   ` Daire Byrne
2020-09-15 17:21 ` J. Bruce Fields
2020-09-15 19:59   ` Trond Myklebust
2020-09-16 16:01     ` Daire Byrne
2020-10-19 16:19       ` Daire Byrne
2020-10-19 17:53         ` [PATCH 0/2] Add NFSv3 emulation of the lookupp operation trondmy
2020-10-19 17:53           ` [PATCH 1/2] NFSv3: Refactor nfs3_proc_lookup() to split out the dentry trondmy
2020-10-19 17:53             ` [PATCH 2/2] NFSv3: Add emulation of the lookupp() operation trondmy
2020-10-19 20:05         ` [PATCH v2 0/2] Add NFSv3 emulation of the lookupp operation trondmy
2020-10-19 20:05           ` [PATCH v2 1/2] NFSv3: Refactor nfs3_proc_lookup() to split out the dentry trondmy
2020-10-19 20:05             ` [PATCH v2 2/2] NFSv3: Add emulation of the lookupp() operation trondmy
2020-10-20 18:37         ` [PATCH v3 0/3] Add NFSv3 emulation of the lookupp operation trondmy
2020-10-20 18:37           ` [PATCH v3 1/3] NFSv3: Refactor nfs3_proc_lookup() to split out the dentry trondmy
2020-10-20 18:37             ` [PATCH v3 2/3] NFSv3: Add emulation of the lookupp() operation trondmy
2020-10-20 18:37               ` [PATCH v3 3/3] NFSv4: Observe the NFS_MOUNT_SOFTREVAL flag in _nfs4_proc_lookupp trondmy
2020-10-21  9:33         ` Adventures in NFS re-exporting Daire Byrne
2020-11-09 16:02           ` bfields
2020-11-12 13:01             ` Daire Byrne
2020-11-12 13:57               ` bfields
2020-11-12 18:33                 ` Daire Byrne
2020-11-12 20:55                   ` bfields
2020-11-12 23:05                     ` Daire Byrne
2020-11-13 14:50                       ` bfields
2020-11-13 22:26                         ` bfields
2020-11-14 12:57                           ` Daire Byrne
2020-11-16 15:18                             ` bfields
2020-11-16 15:53                             ` bfields
2020-11-16 19:21                               ` Daire Byrne
2020-11-16 15:29                           ` Jeff Layton
2020-11-16 15:56                             ` bfields
2020-11-16 16:03                               ` Jeff Layton
2020-11-16 16:14                                 ` bfields
2020-11-16 16:38                                   ` Jeff Layton
2020-11-16 19:03                                     ` bfields
2020-11-16 20:03                                       ` Jeff Layton
2020-11-17  3:16                                         ` bfields
2020-11-17  3:18                                           ` [PATCH 1/4] nfsd: move fill_{pre,post}_wcc to nfsfh.c J. Bruce Fields
2020-11-17  3:18                                             ` [PATCH 2/4] nfsd: pre/post attr is using wrong change attribute J. Bruce Fields
2020-11-17 12:34                                               ` Jeff Layton
2020-11-17 15:26                                                 ` J. Bruce Fields
2020-11-17 15:34                                                   ` Jeff Layton
2020-11-20 22:38                                                     ` J. Bruce Fields
2020-11-20 22:39                                                       ` [PATCH 1/8] nfsd: only call inode_query_iversion in the I_VERSION case J. Bruce Fields
2020-11-20 22:39                                                         ` [PATCH 2/8] nfsd: simplify nfsd4_change_info J. Bruce Fields
2020-11-20 22:39                                                         ` [PATCH 3/8] nfsd: minor nfsd4_change_attribute cleanup J. Bruce Fields
2020-11-21  0:34                                                           ` Jeff Layton
2020-11-20 22:39                                                         ` [PATCH 4/8] nfsd4: don't query change attribute in v2/v3 case J. Bruce Fields
2020-11-20 22:39                                                         ` [PATCH 5/8] nfs: use change attribute for NFS re-exports J. Bruce Fields
2020-11-20 22:39                                                         ` [PATCH 6/8] nfsd: move change attribute generation to filesystem J. Bruce Fields
2020-11-21  0:58                                                           ` Jeff Layton
2020-11-21  1:01                                                             ` J. Bruce Fields
2020-11-21 13:00                                                           ` Jeff Layton
2020-11-20 22:39                                                         ` [PATCH 7/8] nfsd: skip some unnecessary stats in the v4 case J. Bruce Fields
2020-11-20 22:39                                                         ` [PATCH 8/8] Revert "nfsd4: support change_attr_type attribute" J. Bruce Fields
2020-11-20 22:44                                                       ` [PATCH 2/4] nfsd: pre/post attr is using wrong change attribute J. Bruce Fields
2020-11-21  1:03                                                         ` Jeff Layton
2020-11-21 21:44                                                           ` Daire Byrne
2020-11-22  0:02                                                             ` bfields
2020-11-22  1:55                                                               ` Daire Byrne
2020-11-22  3:03                                                                 ` bfields
2020-11-23 20:07                                                                   ` Daire Byrne
2020-11-17 15:25                                               ` J. Bruce Fields
2020-11-17  3:18                                             ` [PATCH 3/4] nfs: don't mangle i_version on NFS J. Bruce Fields
2020-11-17 12:27                                               ` Jeff Layton
2020-11-17 14:14                                                 ` J. Bruce Fields
2020-11-17  3:18                                             ` [PATCH 4/4] nfs: support i_version in the NFSv4 case J. Bruce Fields
2020-11-17 12:34                                               ` Jeff Layton
2020-11-24 20:35               ` Adventures in NFS re-exporting Daire Byrne
2020-11-24 21:15                 ` bfields
2020-11-24 22:15                   ` Frank Filz
2020-11-25 14:47                     ` 'bfields'
2020-11-25 16:25                       ` Frank Filz
2020-11-25 19:03                         ` 'bfields'
2020-11-26  0:04                           ` Frank Filz
2020-11-25 17:14                   ` Daire Byrne
2020-11-25 19:31                     ` bfields
2020-12-03 12:20                     ` Daire Byrne
2020-12-03 18:51                       ` bfields
2020-12-03 20:27                         ` Trond Myklebust
2020-12-03 21:13                           ` bfields
2020-12-03 21:32                             ` Frank Filz
2020-12-03 21:34                             ` Trond Myklebust
2020-12-03 21:45                               ` Frank Filz
2020-12-03 21:57                                 ` Trond Myklebust
2020-12-03 22:04                                   ` bfields
2020-12-03 22:14                                     ` Trond Myklebust
2020-12-03 22:39                                       ` Frank Filz
2020-12-03 22:50                                         ` Trond Myklebust
2020-12-03 23:34                                           ` Frank Filz
2020-12-03 22:44                                       ` bfields
2020-12-03 21:54                               ` bfields
2020-12-03 22:45                               ` bfields
2020-12-03 22:53                                 ` Trond Myklebust
2020-12-03 23:16                                   ` bfields
2020-12-03 23:28                                     ` Frank Filz
2020-12-04  1:02                                     ` Trond Myklebust
2020-12-04  1:41                                       ` bfields
2020-12-04  2:27                                         ` Trond Myklebust
2020-09-17 16:01   ` Daire Byrne
2020-09-17 19:09     ` bfields
2020-09-17 20:23       ` Frank van der Linden
2020-09-17 21:57         ` bfields
2020-09-19 11:08           ` Daire Byrne
2020-09-22 16:43         ` Chuck Lever
2020-09-23 20:25           ` Daire Byrne
2020-09-23 21:01             ` Frank van der Linden
2020-09-26  9:00               ` Daire Byrne
2020-09-28 15:49                 ` Frank van der Linden
2020-09-28 16:08                   ` Chuck Lever
2020-09-28 17:42                     ` Frank van der Linden
2020-09-22 12:31 ` Daire Byrne
2020-09-22 13:52   ` Trond Myklebust
2020-09-23 12:40     ` J. Bruce Fields
2020-09-23 13:09       ` Trond Myklebust
2020-09-23 17:07         ` bfields
2020-09-30 19:30   ` [Linux-cachefs] " Jeff Layton
2020-10-01  0:09     ` Daire Byrne
2020-10-01 10:36       ` Jeff Layton
2020-10-01 12:38         ` Trond Myklebust
2020-10-01 16:39           ` Jeff Layton
2020-10-05 12:54         ` Daire Byrne
2020-10-13  9:59           ` Daire Byrne
2020-10-01 18:41     ` J. Bruce Fields
2020-10-01 19:24       ` Trond Myklebust
2020-10-01 19:26         ` bfields
2020-10-01 19:29           ` Trond Myklebust
2020-10-01 19:51             ` bfields
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox
  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):
  git send-email \
    --in-reply-to=1642729052.4237865.1599558047149.JavaMail.zimbra@desy.de \
    --to=tigran.mkrtchyan@desy.de \
    --cc=daire@dneg.com \
    --cc=linux-cachefs@redhat.com \
    --cc=linux-nfs@vger.kernel.org \
    /path/to/YOUR_REPLY
  https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
  Be sure your reply has a Subject: header at the top and a blank line
  before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).