* [LSF/MM/BPF TOPIC] Design challenges for a new file system that needs to support multiple billions of file @ 2025-02-02 21:39 RIc Wheeler 2025-02-03 15:22 ` Amir Goldstein 2025-02-03 15:23 ` Ric Wheeler 0 siblings, 2 replies; 8+ messages in thread From: RIc Wheeler @ 2025-02-02 21:39 UTC (permalink / raw) To: lsf-pc, linux-fsdevel; +Cc: Zach Brown I have always been super interested in how much we can push the scalability limits of file systems and for the workloads we need to support, we need to scale up to supporting absolutely ridiculously large numbers of files (a few billion files doesn't meet the need of the largest customers we support). Zach Brown is leading a new project on ngnfs (FOSDEM talk this year gave a good background on this - https://www.fosdem.org/2025/schedule/speaker/zach_brown/). We are looking at taking advantage of modern low latency NVME devices and today's networks to implement a distributed file system that provides better concurrency that high object counts need and still have the bandwidth needed to support the backend archival systems we feed. ngnfs as a topic would go into the coherence design (and code) that underpins the increased concurrency it aims to deliver. Clear that the project is in early days compared to most of the proposed content, but it can be useful to spend some of the time on new ideas. ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Design challenges for a new file system that needs to support multiple billions of file 2025-02-02 21:39 [LSF/MM/BPF TOPIC] Design challenges for a new file system that needs to support multiple billions of file RIc Wheeler @ 2025-02-03 15:22 ` Amir Goldstein 2025-02-03 16:18 ` Ric Wheeler 2025-02-06 18:58 ` Zach Brown 2025-02-03 15:23 ` Ric Wheeler 1 sibling, 2 replies; 8+ messages in thread From: Amir Goldstein @ 2025-02-03 15:22 UTC (permalink / raw) To: RIc Wheeler; +Cc: lsf-pc, linux-fsdevel, Zach Brown, Christian Brauner On Sun, Feb 2, 2025 at 10:40 PM RIc Wheeler <ricwheeler@gmail.com> wrote: > > > I have always been super interested in how much we can push the > scalability limits of file systems and for the workloads we need to > support, we need to scale up to supporting absolutely ridiculously large > numbers of files (a few billion files doesn't meet the need of the > largest customers we support). > Hi Ric, Since LSFMM is not about presentations, it would be better if the topic to discuss was trying to address specific technical questions that developers could discuss. If a topic cannot generate a discussion on the list, it is not very likely that it will generate a discussion on-prem. Where does the scaling with the number of files in a filesystem affect existing filesystems? What are the limitations that you need to overcome? > Zach Brown is leading a new project on ngnfs (FOSDEM talk this year gave > a good background on this - > https://www.fosdem.org/2025/schedule/speaker/zach_brown/). We are > looking at taking advantage of modern low latency NVME devices and > today's networks to implement a distributed file system that provides > better concurrency that high object counts need and still have the > bandwidth needed to support the backend archival systems we feed. > I heard this talk and it was very interesting. Here's a direct link to slides from people who may be too lazy to follow 3 clicks: https://www.fosdem.org/2025/events/attachments/fosdem-2025-5471-ngnfs-a-distributed-file-system-using-block-granular-consistency/slides/236150/zach-brow_aqVkVuI.pdf I was both very impressed by the cache coherent rename example and very puzzled - I do not know any filesystem where rename can be synchronized on a single block io, and looking up ancestors is usually done on in-memory dentries, so I may not have understood the example. > ngnfs as a topic would go into the coherence design (and code) that > underpins the increased concurrency it aims to deliver. > > Clear that the project is in early days compared to most of the proposed > content, but it can be useful to spend some of the time on new ideas. > This sounds like an interesting topic to discuss. I would love it if you or Zach could share more details on the list so that more people could participate in the discussion leading to LSFMM. Also, I think it is important to mention, as you told me, that the server implementation of ngnfs is GPL and to provide some pointers, because IMO this is very important when requesting community feedback on a new filesystem. Thanks, Amir. ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Design challenges for a new file system that needs to support multiple billions of file 2025-02-03 15:22 ` Amir Goldstein @ 2025-02-03 16:18 ` Ric Wheeler 2025-02-04 1:47 ` Dave Chinner 2025-02-06 18:58 ` Zach Brown 1 sibling, 1 reply; 8+ messages in thread From: Ric Wheeler @ 2025-02-03 16:18 UTC (permalink / raw) To: Amir Goldstein; +Cc: lsf-pc, linux-fsdevel, Zach Brown, Christian Brauner On 2/3/25 4:22 PM, Amir Goldstein wrote: > On Sun, Feb 2, 2025 at 10:40 PM RIc Wheeler <ricwheeler@gmail.com> wrote: >> >> I have always been super interested in how much we can push the >> scalability limits of file systems and for the workloads we need to >> support, we need to scale up to supporting absolutely ridiculously large >> numbers of files (a few billion files doesn't meet the need of the >> largest customers we support). >> > Hi Ric, > > Since LSFMM is not about presentations, it would be better if the topic to > discuss was trying to address specific technical questions that developers > could discuss. Totally agree - from the ancient history of LSF (before MM or BPF!) we also pushed for discussions over talks. > > If a topic cannot generate a discussion on the list, it is not very > likely that it will > generate a discussion on-prem. > > Where does the scaling with the number of files in a filesystem affect existing > filesystems? What are the limitations that you need to overcome? Local file systems like xfs running on "scale up" giant systems (think of the old super sized HP Superdomes and the like) would be likely to handle this well. In a lot of ways, ngnfs means to replicate that scalability for "scale out" (hate buzz words!) systems that are more affordable. In effect, you can size your system by just adding more servers with their local NVME devices and build up performance and capacity in an incremental way. Shared disk file systems like scoutfs which (also GPL'ed but not upstream) scale pretty well in file count but have coarse grain locking that causes performance bumps and the added complexity of needed RAID heads or SAN systems. > >> Zach Brown is leading a new project on ngnfs (FOSDEM talk this year gave >> a good background on this - >> https://www.fosdem.org/2025/schedule/speaker/zach_brown/). We are >> looking at taking advantage of modern low latency NVME devices and >> today's networks to implement a distributed file system that provides >> better concurrency that high object counts need and still have the >> bandwidth needed to support the backend archival systems we feed. >> > I heard this talk and it was very interesting. > Here's a direct link to slides from people who may be too lazy to > follow 3 clicks: > https://www.fosdem.org/2025/events/attachments/fosdem-2025-5471-ngnfs-a-distributed-file-system-using-block-granular-consistency/slides/236150/zach-brow_aqVkVuI.pdf > > I was both very impressed by the cache coherent rename example > and very puzzled - I do not know any filesystem where rename can be > synchronized on a single block io, and looking up ancestors is usually > done on in-memory dentries, so I may not have understood the example. > >> ngnfs as a topic would go into the coherence design (and code) that >> underpins the increased concurrency it aims to deliver. >> >> Clear that the project is in early days compared to most of the proposed >> content, but it can be useful to spend some of the time on new ideas. >> > This sounds like an interesting topic to discuss. > I would love it if you or Zach could share more details on the list so that more > people could participate in the discussion leading to LSFMM. > > Also, I think it is important to mention, as you told me, that the > server implementation > of ngnfs is GPL and to provide some pointers, because IMO this is very important > when requesting community feedback on a new filesystem. > > Thanks, > Amir. All of ngnfs is GPL'ed - no non-open source client or similar. Regards, Ric ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Design challenges for a new file system that needs to support multiple billions of file 2025-02-03 16:18 ` Ric Wheeler @ 2025-02-04 1:47 ` Dave Chinner 2025-02-05 8:05 ` Ric Wheeler 0 siblings, 1 reply; 8+ messages in thread From: Dave Chinner @ 2025-02-04 1:47 UTC (permalink / raw) To: Ric Wheeler Cc: Amir Goldstein, lsf-pc, linux-fsdevel, Zach Brown, Christian Brauner On Mon, Feb 03, 2025 at 05:18:48PM +0100, Ric Wheeler wrote: > > On 2/3/25 4:22 PM, Amir Goldstein wrote: > > On Sun, Feb 2, 2025 at 10:40 PM RIc Wheeler <ricwheeler@gmail.com> wrote: > > > > > > I have always been super interested in how much we can push the > > > scalability limits of file systems and for the workloads we need to > > > support, we need to scale up to supporting absolutely ridiculously large > > > numbers of files (a few billion files doesn't meet the need of the > > > largest customers we support). > > > > > Hi Ric, > > > > Since LSFMM is not about presentations, it would be better if the topic to > > discuss was trying to address specific technical questions that developers > > could discuss. > > Totally agree - from the ancient history of LSF (before MM or BPF!) we also > pushed for discussions over talks. > > > > > If a topic cannot generate a discussion on the list, it is not very > > likely that it will > > generate a discussion on-prem. > > > > Where does the scaling with the number of files in a filesystem affect existing > > filesystems? What are the limitations that you need to overcome? > > Local file systems like xfs running on "scale up" giant systems (think of > the old super sized HP Superdomes and the like) would be likely to handle > this well. We don't need "Big Iron" hardware to scale up to tens of billions of files in a single filesystem these days. A cheap server with 32p and a couple of hundred GB of RAM and a few NVMe SSDs is all that is really needed. We recently had a XFS user report over 16 billion files in a relatively small filesystem (a few tens of TB), most of which were reflink copied files (backup/archival storage farm). So, yeah, large file counts (i.e. tens of billions) in production systems aren't a big deal these days. There shouldn't be any specific issues at the OS/VFS layers supporting filesystems with inode counts in the billions - most of the problems with this are internal fielsystem implementation issues. If there are any specific VFS level scalability issues you've come across, I'm all ears... -Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Design challenges for a new file system that needs to support multiple billions of file 2025-02-04 1:47 ` Dave Chinner @ 2025-02-05 8:05 ` Ric Wheeler 0 siblings, 0 replies; 8+ messages in thread From: Ric Wheeler @ 2025-02-05 8:05 UTC (permalink / raw) To: Dave Chinner Cc: Amir Goldstein, lsf-pc, linux-fsdevel, Zach Brown, Christian Brauner On 2/4/25 2:47 AM, Dave Chinner wrote: > On Mon, Feb 03, 2025 at 05:18:48PM +0100, Ric Wheeler wrote: >> On 2/3/25 4:22 PM, Amir Goldstein wrote: >>> On Sun, Feb 2, 2025 at 10:40 PM RIc Wheeler <ricwheeler@gmail.com> wrote: >>>> I have always been super interested in how much we can push the >>>> scalability limits of file systems and for the workloads we need to >>>> support, we need to scale up to supporting absolutely ridiculously large >>>> numbers of files (a few billion files doesn't meet the need of the >>>> largest customers we support). >>>> >>> Hi Ric, >>> >>> Since LSFMM is not about presentations, it would be better if the topic to >>> discuss was trying to address specific technical questions that developers >>> could discuss. >> Totally agree - from the ancient history of LSF (before MM or BPF!) we also >> pushed for discussions over talks. >> >>> If a topic cannot generate a discussion on the list, it is not very >>> likely that it will >>> generate a discussion on-prem. >>> >>> Where does the scaling with the number of files in a filesystem affect existing >>> filesystems? What are the limitations that you need to overcome? >> Local file systems like xfs running on "scale up" giant systems (think of >> the old super sized HP Superdomes and the like) would be likely to handle >> this well. > We don't need "Big Iron" hardware to scale up to tens of billions of > files in a single filesystem these days. A cheap server with 32p and > a couple of hundred GB of RAM and a few NVMe SSDs is all that is > really needed. We recently had a XFS user report over 16 billion > files in a relatively small filesystem (a few tens of TB), most of > which were reflink copied files (backup/archival storage farm). > > So, yeah, large file counts (i.e. tens of billions) in production > systems aren't a big deal these days. There shouldn't be any > specific issues at the OS/VFS layers supporting filesystems with > inode counts in the billions - most of the problems with this are > internal fielsystem implementation issues. If there are any specific > VFS level scalability issues you've come across, I'm all ears... > > -Dave. I remember fondly torturing xfs (and ext4 and btrfs) many years back with a billion small (empty) files on a sata drive :) For our workload though, we have a couple of requirements that prevent most customers from using a single server. First requirement is the need to keep a scary number of large tape drives/robots running at line rate - keeping all of those busy normally requires order of 5 servers with our existing stack but larger systems can need more. Second requirement is the need for high availability - that lead us to using a shared disk back file system (scoutfs) - but others in this space have used cxfs and similar non-open source file systems. The shared disk/cluster file systems are where the coarse grain locking comes into conflict with concurrency. What ngnfs is driving towards is to be able to drive that bandwidth requirement for the backend archival work flow, support the many billions of file objects in a high availability system made with today's cutting edge components. Zach will jump in once he gets back but my hand wavy way of thinking of this is that ngnfs as a distributed file system is closer in design to how xfs would run on a huge system with coherence between NUMA zones. regards, Ric ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Design challenges for a new file system that needs to support multiple billions of file 2025-02-03 15:22 ` Amir Goldstein 2025-02-03 16:18 ` Ric Wheeler @ 2025-02-06 18:58 ` Zach Brown 2025-02-06 23:36 ` Andreas Dilger 1 sibling, 1 reply; 8+ messages in thread From: Zach Brown @ 2025-02-06 18:58 UTC (permalink / raw) To: Amir Goldstein; +Cc: RIc Wheeler, lsf-pc, linux-fsdevel, Christian Brauner (Yay, back from travel!) On Mon, Feb 03, 2025 at 04:22:59PM +0100, Amir Goldstein wrote: > On Sun, Feb 2, 2025 at 10:40 PM RIc Wheeler <ricwheeler@gmail.com> wrote: > > > > Zach Brown is leading a new project on ngnfs (FOSDEM talk this year gave > > a good background on this - > > https://www.fosdem.org/2025/schedule/speaker/zach_brown/). We are > > looking at taking advantage of modern low latency NVME devices and > > today's networks to implement a distributed file system that provides > > better concurrency that high object counts need and still have the > > bandwidth needed to support the backend archival systems we feed. > > > > I heard this talk and it was very interesting. > Here's a direct link to slides from people who may be too lazy to > follow 3 clicks: > https://www.fosdem.org/2025/events/attachments/fosdem-2025-5471-ngnfs-a-distributed-file-system-using-block-granular-consistency/slides/236150/zach-brow_aqVkVuI.pdf > > I was both very impressed by the cache coherent rename example > and very puzzled - I do not know any filesystem where rename can be > synchronized on a single block io, and looking up ancestors is usually > done on in-memory dentries, so I may not have understood the example. The meat of that talk was about how ngnfs uses its distributed block cache as a serializing/coherence/consistency mechanism. That specific example was about how we can get concurrent rename between different mounts without needing some global equivelant of rename mutex. The core of the mechanism is that code paths that implement operations have a transactional object that holds on to cached block references which have a given access mode granted over the network. In this rename case, the ancestor walk holds on to all the blocks for the duration of the walk. (Can be a lot of blocks!). If another mount somewhere else tried to modify those ancestor blocks, that mount would need to revoke the cached read access to be granted their write access. That'd wait for the first rename to finish and release the read refs. This gives us specific serialization of access to the blocks in question rather than relying on a global serializing object over all renames. That's the idea, anyway. I'm implementing the first bits of this now. It's sort of a silly example, because who puts cross-directory rename in the fast path? (Historically some s3<->posix servers implemented CompleteMultipartUpload be renaming from tmp dirs to visible bucket dirs, hrmph). But it illustrates the pattern of shrinking contention down to the block level. - z ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Design challenges for a new file system that needs to support multiple billions of file 2025-02-06 18:58 ` Zach Brown @ 2025-02-06 23:36 ` Andreas Dilger 0 siblings, 0 replies; 8+ messages in thread From: Andreas Dilger @ 2025-02-06 23:36 UTC (permalink / raw) To: Zach Brown Cc: Amir Goldstein, RIc Wheeler, lsf-pc, linux-fsdevel, Christian Brauner Lustre has production filesystems with hundreds of billions of files today, with coherent renames running across dozens of servers. We've relaxed the rename locking at the server to allow concurrent rename for regular files within the same server, and directories that stay within the same parent (so cannot break the namespace hierarchy). They are still subject the VFS serialization on a single client node, but hopefully Neil's parallel dirops patch will eventually land. Cheers, Andreas > On Feb 6, 2025, at 13:59, Zach Brown <zab@zabbo.net> wrote: > > > (Yay, back from travel!) > >> On Mon, Feb 03, 2025 at 04:22:59PM +0100, Amir Goldstein wrote: >>> On Sun, Feb 2, 2025 at 10:40 PM RIc Wheeler <ricwheeler@gmail.com> wrote: >>> >>> Zach Brown is leading a new project on ngnfs (FOSDEM talk this year gave >>> a good background on this - >>> https://www.fosdem.org/2025/schedule/speaker/zach_brown/). We are >>> looking at taking advantage of modern low latency NVME devices and >>> today's networks to implement a distributed file system that provides >>> better concurrency that high object counts need and still have the >>> bandwidth needed to support the backend archival systems we feed. >>> >> >> I heard this talk and it was very interesting. >> Here's a direct link to slides from people who may be too lazy to >> follow 3 clicks: >> https://www.fosdem.org/2025/events/attachments/fosdem-2025-5471-ngnfs-a-distributed-file-system-using-block-granular-consistency/slides/236150/zach-brow_aqVkVuI.pdf >> >> I was both very impressed by the cache coherent rename example >> and very puzzled - I do not know any filesystem where rename can be >> synchronized on a single block io, and looking up ancestors is usually >> done on in-memory dentries, so I may not have understood the example. > > The meat of that talk was about how ngnfs uses its distributed block > cache as a serializing/coherence/consistency mechanism. That specific > example was about how we can get concurrent rename between different > mounts without needing some global equivelant of rename mutex. > > The core of the mechanism is that code paths that implement operations > have a transactional object that holds on to cached block references > which have a given access mode granted over the network. In this rename > case, the ancestor walk holds on to all the blocks for the duration of > the walk. (Can be a lot of blocks!). If another mount somewhere else > tried to modify those ancestor blocks, that mount would need to revoke > the cached read access to be granted their write access. That'd wait > for the first rename to finish and release the read refs. This gives us > specific serialization of access to the blocks in question rather than > relying on a global serializing object over all renames. > > That's the idea, anyway. I'm implementing the first bits of this now. > > It's sort of a silly example, because who puts cross-directory rename in > the fast path? (Historically some s3<->posix servers implemented > CompleteMultipartUpload be renaming from tmp dirs to visible bucket > dirs, hrmph). But it illustrates the pattern of shrinking contention > down to the block level. > > - z > ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Design challenges for a new file system that needs to support multiple billions of file 2025-02-02 21:39 [LSF/MM/BPF TOPIC] Design challenges for a new file system that needs to support multiple billions of file RIc Wheeler 2025-02-03 15:22 ` Amir Goldstein @ 2025-02-03 15:23 ` Ric Wheeler 1 sibling, 0 replies; 8+ messages in thread From: Ric Wheeler @ 2025-02-03 15:23 UTC (permalink / raw) To: lsf-pc, linux-fsdevel; +Cc: Zach Brown On 2/2/25 10:39 PM, RIc Wheeler wrote: > > I have always been super interested in how much we can push the > scalability limits of file systems and for the workloads we need to > support, we need to scale up to supporting absolutely ridiculously > large numbers of files (a few billion files doesn't meet the need of > the largest customers we support). > > Zach Brown is leading a new project on ngnfs (FOSDEM talk this year > gave a good background on this - > https://www.fosdem.org/2025/schedule/speaker/zach_brown/). We are > looking at taking advantage of modern low latency NVME devices and > today's networks to implement a distributed file system that provides > better concurrency that high object counts need and still have the > bandwidth needed to support the backend archival systems we feed. > > ngnfs as a topic would go into the coherence design (and code) that > underpins the increased concurrency it aims to deliver. > > Clear that the project is in early days compared to most of the > proposed content, but it can be useful to spend some of the time on > new ideas. > Just adding that all of this work is GPL'ed and we aspire to getting it upstream. This is planned to be a core part of future shipping products, so we intend to fully maintain it going forward. ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2025-02-06 23:36 UTC | newest] Thread overview: 8+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-02-02 21:39 [LSF/MM/BPF TOPIC] Design challenges for a new file system that needs to support multiple billions of file RIc Wheeler 2025-02-03 15:22 ` Amir Goldstein 2025-02-03 16:18 ` Ric Wheeler 2025-02-04 1:47 ` Dave Chinner 2025-02-05 8:05 ` Ric Wheeler 2025-02-06 18:58 ` Zach Brown 2025-02-06 23:36 ` Andreas Dilger 2025-02-03 15:23 ` Ric Wheeler
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).