* RFC: return d_type for non-plus READDIR @ 2021-03-23 1:00 Geert Jansen 2021-03-23 15:26 ` Chuck Lever III 0 siblings, 1 reply; 5+ messages in thread From: Geert Jansen @ 2021-03-23 1:00 UTC (permalink / raw) To: linux-nfs Hi, recursively listing a directory tree requires that you know which entries are directories so that you can recurse into them. The getdents() API can provide this information through the d_type field. Today, d_type is available if we use READDIRPLUS. A non-plus READDIR requests only the "rdattr_error" and "mounted_on_fileid" attributes, but not "type", and consequently sets d_type to DT_UNKNOWN. Requesting the "type" attribute for regular, non-plus READDIR would allow us to always return d_type, even for large directories where we switch to a non-plus READDIR. It would allow the user to recursively list directories of any size without the need for GETATTRs, and, if the server supports this, without any stat() or equivalent calls on the server. For some use cases, you could also mount with '-o nordirplus' to scan an entire file system efficiently. Since not all file servers may be able to produce the directory entry type efficiently, this could be implemented as a mount option that defaults off. Some local file systems offer a similar choice. For example, both ext4 and xfs have an (in this case mkfs-time) option to store the inode type in the directory. If this option is set, then getdents() always returns d_type. Would a patch that adds such a mount option be acceptable? ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: RFC: return d_type for non-plus READDIR 2021-03-23 1:00 RFC: return d_type for non-plus READDIR Geert Jansen @ 2021-03-23 15:26 ` Chuck Lever III 2021-03-24 1:47 ` Geert Jansen 0 siblings, 1 reply; 5+ messages in thread From: Chuck Lever III @ 2021-03-23 15:26 UTC (permalink / raw) To: Geert Jansen; +Cc: Linux NFS Mailing List Hi Geert - > On Mar 22, 2021, at 9:00 PM, Geert Jansen <gerardu@amazon.com> wrote: > > Hi, > > recursively listing a directory tree requires that you know which entries are > directories so that you can recurse into them. The getdents() API can provide > this information through the d_type field. > > Today, d_type is available if we use READDIRPLUS. A non-plus READDIR requests > only the "rdattr_error" and "mounted_on_fileid" attributes, but not "type", and > consequently sets d_type to DT_UNKNOWN. > > Requesting the "type" attribute for regular, non-plus READDIR would allow us to > always return d_type, even for large directories where we switch to a non-plus > READDIR. It would allow the user to recursively list directories of any size > without the need for GETATTRs, and, if the server supports this, without any > stat() or equivalent calls on the server. For some use cases, you could also > mount with '-o nordirplus' to scan an entire file system efficiently. > > Since not all file servers may be able to produce the directory entry type > efficiently, this could be implemented as a mount option that defaults off. Can you say more about the impact of requesting this attribute from servers that cannot efficiently provide it? Which servers and filesystems find it a problem, and how much of a problem is it? > Some local file systems offer a similar choice. For example, both ext4 and xfs > have an (in this case mkfs-time) option to store the inode type in the > directory. If this option is set, then getdents() always returns d_type. > > Would a patch that adds such a mount option be acceptable? I'd rather avoid adding another administrative knob unless it is absolutely necessary... are there other options for controlling whether the client requests this attribute? For example, is there a way for a server to decide not to provide it if it would be burdensome to do so? ie, the client always asks, but it would be up to the server to provide it if it can do so. -- Chuck Lever ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: RFC: return d_type for non-plus READDIR 2021-03-23 15:26 ` Chuck Lever III @ 2021-03-24 1:47 ` Geert Jansen 2021-03-24 13:50 ` Chuck Lever III 0 siblings, 1 reply; 5+ messages in thread From: Geert Jansen @ 2021-03-24 1:47 UTC (permalink / raw) To: Chuck Lever III; +Cc: Linux NFS Mailing List On Tue, Mar 23, 2021 at 03:26:02PM +0000, Chuck Lever III wrote: > > Since not all file servers may be able to produce the directory entry type > > efficiently, this could be implemented as a mount option that defaults off. > > Can you say more about the impact of requesting this attribute > from servers that cannot efficiently provide it? Which servers > and filesystems find it a problem, and how much of a problem is > it? The ability to satisfy a non-plus READDIR by reading just the directory pages, instead of having to read all dirent inodes as well, can be worth it for certain use cases (especially those with large directories). If a file system does not store d_type in the directory, and the client would always request the type attribute even for non-plus READDIR, then you lose the ability to make this optimization. From a review of the man pages, most local file systems appear to be able to store d_type within the directory, including ext4, xfs and zfs. Both ext4 and xfs have options to turn this behavior off. If you'd export such a file system using nfsd, then this would cause additional IO on the file system if we would always request the type attribute. I do not know how other commercial servers handle this. > I'd rather avoid adding another administrative knob unless it is > absolutely necessary... are there other options for controlling > whether the client requests this attribute? > > For example, is there a way for a server to decide not to provide > it if it would be burdensome to do so? ie, the client always asks, > but it would be up to the server to provide it if it can do so. I looked in the RFCs but I am not sure if there is a way today? Both 4.0 and 4.1 define "type" as a required attribute that needs to be returned if the client asks for it. There also does not appear to be an enum value corresponding to DT_UNKNOWN. Were you thinking about something specifically? If there's no way to do this today, then I guess a per-file system attribute that indicates support for "can produce file type efficient when reading a directory" would would be a relatively clean solution. I presume it would require an RFC to define this attribute. Would you have a recommendation given your your experience with the RFC process? Geert ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: RFC: return d_type for non-plus READDIR 2021-03-24 1:47 ` Geert Jansen @ 2021-03-24 13:50 ` Chuck Lever III 2021-03-25 17:26 ` Geert Jansen 0 siblings, 1 reply; 5+ messages in thread From: Chuck Lever III @ 2021-03-24 13:50 UTC (permalink / raw) To: Geert Jansen; +Cc: Linux NFS Mailing List Hi Geert - > On Mar 23, 2021, at 9:47 PM, Geert Jansen <gerardu@amazon.com> wrote: > > On Tue, Mar 23, 2021 at 03:26:02PM +0000, Chuck Lever III wrote: > >>> Since not all file servers may be able to produce the directory entry type >>> efficiently, this could be implemented as a mount option that defaults off. >> >> Can you say more about the impact of requesting this attribute >> from servers that cannot efficiently provide it? Which servers >> and filesystems find it a problem, and how much of a problem is >> it? > > The ability to satisfy a non-plus READDIR by reading just the directory > pages, instead of having to read all dirent inodes as well, can be worth it > for certain use cases (especially those with large directories). If a file > system does not store d_type in the directory, and the client would always > request the type attribute even for non-plus READDIR, then you lose the > ability to make this optimization. > > From a review of the man pages, most local file systems appear to be able to > store d_type within the directory, including ext4, xfs and zfs. Both ext4 > and xfs have options to turn this behavior off. If you'd export such a file > system using nfsd, then this would cause additional IO on the file system if > we would always request the type attribute. > > I do not know how other commercial servers handle this. "How much of a problem is it" -- I guess what I really want to see is some quantification of the problem, in numbers. - Exactly which workloads benefit from having the DT information? - How much do they improve? - Which workloads are negatively impacted, and how much? - How are workloads impacted if the client requests DT information from servers that cannot support it efficiently? Seems to me there will be some caching effects -- there are at least two caches between the server's persistent storage and the application. So I expect this will be a complex situation, at best. I totally agree that directory operations are a performance and scalability sore spot for NFS, so I personally am interested in hearing any and all suggestions in this area. In this case, the proposed mechanism is intriguing and sensible, but I would suggest that without measurement data, the proposal seems incomplete so far. >> I'd rather avoid adding another administrative knob unless it is >> absolutely necessary... are there other options for controlling >> whether the client requests this attribute? >> >> For example, is there a way for a server to decide not to provide >> it if it would be burdensome to do so? ie, the client always asks, >> but it would be up to the server to provide it if it can do so. > > I looked in the RFCs but I am not sure if there is a way today? Both 4.0 and > 4.1 define "type" as a required attribute that needs to be returned if the > client asks for it. There also does not appear to be an enum value > corresponding to DT_UNKNOWN. Were you thinking about something specifically? I wasn't thinking of a particular protocol mechanism, though that is certainly a possibility. I'm more interested in seeing if there are ways to enable the proposed improvement without adding more administrative complexity. Yet one more thing that can be set incorrectly and has to be maintained in perpetuity. So, alternatives might be: - Always requesting the DT information - Leveraging an existing mount option, like lookupcache= - A sysfs setting or a module parameter - A heuristic to guess when requesting the information is harmful - Enabling the request based on directory size or some other static feature of the directory - If this information is of truly great benefit, approaching server vendors to support it efficiently, and then have it always enabled on clients Adding an administrative knob means we don't have a good understanding of how this setting is going to work. As an experimental feature, this is a great way to go, but for a permanent, long-term thing, let's keep in mind that client administration is a resource that has to scale well to cohorts of 100s of thousands of systems. The simpler and more automatic we can make it, the better off it will be for everyone. > If there's no way to do this today, then I guess a per-file system attribute > that indicates support for "can produce file type efficient when reading a > directory" would would be a relatively clean solution. I presume it would > require an RFC to define this attribute. Would you have a recommendation given > your your experience with the RFC process? My recommendation is to look for other alternatives first ;-) It can't hurt to ask for advice from the nfsv4 working group, but I would go in armed with some performance numbers. -- Chuck Lever ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: RFC: return d_type for non-plus READDIR 2021-03-24 13:50 ` Chuck Lever III @ 2021-03-25 17:26 ` Geert Jansen 0 siblings, 0 replies; 5+ messages in thread From: Geert Jansen @ 2021-03-25 17:26 UTC (permalink / raw) To: Chuck Lever III; +Cc: Linux NFS Mailing List Hi Chuck, On Wed, Mar 24, 2021 at 01:50:52PM +0000, Chuck Lever III wrote: > "How much of a problem is it" -- I guess what I really want to > see is some quantification of the problem, in numbers. > > - Exactly which workloads benefit from having the DT information? > - How much do they improve? > - Which workloads are negatively impacted, and how much? > - How are workloads impacted if the client requests DT > information from servers that cannot support it efficiently? > > Seems to me there will be some caching effects -- there are at > least two caches between the server's persistent storage and the > application. So I expect this will be a complex situation, at > best. Customer applications that would benefit are those that periodically need to scan a tree with large directories, e.g. to find new files for document exchange or messaging applications. Most of the apps that I've seen do this were custom developed. Some standard CLI apps also fall in this category, including "find" (with no predicates other than for type and name), and "updatedb". How much do these improve? I think there are three cases. On EFS: - Case 1: READDIR returns DT_UNKNOWN. The client needs to do a stat() for every entry to get the file type. Throughput is approximately 2K entries/sec. - Case 2: READDIR returns the actual d_type, but the server gets d_type by reading the dirent inodes. Throughput is approximately 18K entries/s. - Case 3: READDIR returns the actual d_type and does not need to read inodes. Throughput is 200K entries/s. (Caveat: EFS does not currently store d_type in our directories, so I did a related test that should give the same results. For cases 2 and 3, I measured a regular non-plus READDIR and tested it against two server configurations, one where the server reads all dirent inodes and just discards the results, and one where it does not read any inodes.) If the server stores d_type in its directories, then the only negative impact that I can think of would be the extra 4 bytes for each dirent in the NFS response. The exact overhead depends on the file size, but should be typically be less than 5-7% depending on file name size. On the other hand, if requesting d_type requires the server to read inodes, where previously it did not, then there's an 11x throughput regression (scenario 3 vs 2). Regarding caching, yes, great question. This was something we looked into as well. In our tests, reading dirent inodes only when needed (i.e. for READDIRPLUS) got us an overall better cache hit rate, which we explained due to lower pressure on the cache. That's a second reason why we want to only request d_type if it's not going to force the server to read all inodes. > So, alternatives might be: > - Always requesting the DT information > - Leveraging an existing mount option, like lookupcache= > - A sysfs setting or a module parameter > - A heuristic to guess when requesting the information is harmful > - Enabling the request based on directory size or some other static > feature of the directory > - If this information is of truly great benefit, approaching server > vendors to support it efficiently, and then have it always enabled > on clients > > Adding an administrative knob means we don't have a good understanding > of how this setting is going to work. As an experimental feature, this > is a great way to go, but for a permanent, long-term thing, let's keep > in mind that client administration is a resource that has to scale > well to cohorts of 100s of thousands of systems. The simpler and more > automatic we can make it, the better off it will be for everyone. Thanks for that! I'd be interested to hear if you think our data above is compelling enough. Ideally we'd find a way to do this approach experimentally at first. Whether we can make it a default, or whether we need a way to discover the capbility, would depend on how other server vendors handle this. Geert ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2021-03-25 17:27 UTC | newest] Thread overview: 5+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2021-03-23 1:00 RFC: return d_type for non-plus READDIR Geert Jansen 2021-03-23 15:26 ` Chuck Lever III 2021-03-24 1:47 ` Geert Jansen 2021-03-24 13:50 ` Chuck Lever III 2021-03-25 17:26 ` Geert Jansen
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox