[LSF/MM/BPF TOPIC] Design challenges for a new file system that needs to support multiple billions of file

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [LSF/MM/BPF TOPIC] Design challenges for a new file system that needs to support multiple billions of file
@ 2025-02-02 21:39 RIc Wheeler
  2025-02-03 15:22 ` Amir Goldstein
  2025-02-03 15:23 ` Ric Wheeler
  0 siblings, 2 replies; 8+ messages in thread
From: RIc Wheeler @ 2025-02-02 21:39 UTC (permalink / raw)
  To: lsf-pc, linux-fsdevel; +Cc: Zach Brown

I have always been super interested in how much we can push the 
scalability limits of file systems and for the workloads we need to 
support, we need to scale up to supporting absolutely ridiculously large 
numbers of files (a few billion files doesn't meet the need of the 
largest customers we support).

Zach Brown is leading a new project on ngnfs (FOSDEM talk this year gave 
a good background on this - 
https://www.fosdem.org/2025/schedule/speaker/zach_brown/).  We are 
looking at taking advantage of modern low latency NVME devices and 
today's networks to implement a distributed file system that provides  
better concurrency that high object counts need and still have the 
bandwidth needed to support the backend archival systems we feed.

ngnfs as a topic would go into the coherence design (and code) that 
underpins the increased concurrency it aims to deliver.

Clear that the project is in early days compared to most of the proposed 
content, but it can be useful to spend some of the time on new ideas.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Design challenges for a new file system that needs to support multiple billions of file
  2025-02-02 21:39 [LSF/MM/BPF TOPIC] Design challenges for a new file system that needs to support multiple billions of file RIc Wheeler
@ 2025-02-03 15:22 ` Amir Goldstein
  2025-02-03 16:18   ` Ric Wheeler
  2025-02-06 18:58   ` Zach Brown
  2025-02-03 15:23 ` Ric Wheeler
  1 sibling, 2 replies; 8+ messages in thread
From: Amir Goldstein @ 2025-02-03 15:22 UTC (permalink / raw)
  To: RIc Wheeler; +Cc: lsf-pc, linux-fsdevel, Zach Brown, Christian Brauner

On Sun, Feb 2, 2025 at 10:40 PM RIc Wheeler <ricwheeler@gmail.com> wrote:
>
>
> I have always been super interested in how much we can push the
> scalability limits of file systems and for the workloads we need to
> support, we need to scale up to supporting absolutely ridiculously large
> numbers of files (a few billion files doesn't meet the need of the
> largest customers we support).
>

Hi Ric,

Since LSFMM is not about presentations, it would be better if the topic to
discuss was trying to address specific technical questions that developers
could discuss.

If a topic cannot generate a discussion on the list, it is not very
likely that it will
generate a discussion on-prem.

Where does the scaling with the number of files in a filesystem affect existing
filesystems? What are the limitations that you need to overcome?

> Zach Brown is leading a new project on ngnfs (FOSDEM talk this year gave
> a good background on this -
> https://www.fosdem.org/2025/schedule/speaker/zach_brown/).  We are
> looking at taking advantage of modern low latency NVME devices and
> today's networks to implement a distributed file system that provides
> better concurrency that high object counts need and still have the
> bandwidth needed to support the backend archival systems we feed.
>

I heard this talk and it was very interesting.
Here's a direct link to slides from people who may be too lazy to
follow 3 clicks:
https://www.fosdem.org/2025/events/attachments/fosdem-2025-5471-ngnfs-a-distributed-file-system-using-block-granular-consistency/slides/236150/zach-brow_aqVkVuI.pdf

I was both very impressed by the cache coherent rename example
and very puzzled - I do not know any filesystem where rename can be
synchronized on a single block io, and looking up ancestors is usually
done on in-memory dentries, so I may not have understood the example.

> ngnfs as a topic would go into the coherence design (and code) that
> underpins the increased concurrency it aims to deliver.
>
> Clear that the project is in early days compared to most of the proposed
> content, but it can be useful to spend some of the time on new ideas.
>

This sounds like an interesting topic to discuss.
I would love it if you or Zach could share more details on the list so that more
people could participate in the discussion leading to LSFMM.

Also, I think it is important to mention, as you told me, that the
server implementation
of ngnfs is GPL and to provide some pointers, because IMO this is very important
when requesting community feedback on a new filesystem.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Design challenges for a new file system that needs to support multiple billions of file
  2025-02-02 21:39 [LSF/MM/BPF TOPIC] Design challenges for a new file system that needs to support multiple billions of file RIc Wheeler
  2025-02-03 15:22 ` Amir Goldstein
@ 2025-02-03 15:23 ` Ric Wheeler
  1 sibling, 0 replies; 8+ messages in thread
From: Ric Wheeler @ 2025-02-03 15:23 UTC (permalink / raw)
  To: lsf-pc, linux-fsdevel; +Cc: Zach Brown


On 2/2/25 10:39 PM, RIc Wheeler wrote:
>
> I have always been super interested in how much we can push the 
> scalability limits of file systems and for the workloads we need to 
> support, we need to scale up to supporting absolutely ridiculously 
> large numbers of files (a few billion files doesn't meet the need of 
> the largest customers we support).
>
> Zach Brown is leading a new project on ngnfs (FOSDEM talk this year 
> gave a good background on this - 
> https://www.fosdem.org/2025/schedule/speaker/zach_brown/).  We are 
> looking at taking advantage of modern low latency NVME devices and 
> today's networks to implement a distributed file system that provides  
> better concurrency that high object counts need and still have the 
> bandwidth needed to support the backend archival systems we feed.
>
> ngnfs as a topic would go into the coherence design (and code) that 
> underpins the increased concurrency it aims to deliver.
>
> Clear that the project is in early days compared to most of the 
> proposed content, but it can be useful to spend some of the time on 
> new ideas.
>
Just adding that all of this work is GPL'ed and we aspire to getting it 
upstream.

This is planned to be a core part of future shipping products, so we 
intend to fully maintain it going forward.



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Design challenges for a new file system that needs to support multiple billions of file
  2025-02-03 15:22 ` Amir Goldstein
@ 2025-02-03 16:18   ` Ric Wheeler
  2025-02-04  1:47     ` Dave Chinner
  2025-02-06 18:58   ` Zach Brown
  1 sibling, 1 reply; 8+ messages in thread
From: Ric Wheeler @ 2025-02-03 16:18 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: lsf-pc, linux-fsdevel, Zach Brown, Christian Brauner


On 2/3/25 4:22 PM, Amir Goldstein wrote:
> On Sun, Feb 2, 2025 at 10:40 PM RIc Wheeler <ricwheeler@gmail.com> wrote:
>>
>> I have always been super interested in how much we can push the
>> scalability limits of file systems and for the workloads we need to
>> support, we need to scale up to supporting absolutely ridiculously large
>> numbers of files (a few billion files doesn't meet the need of the
>> largest customers we support).
>>
> Hi Ric,
>
> Since LSFMM is not about presentations, it would be better if the topic to
> discuss was trying to address specific technical questions that developers
> could discuss.

Totally agree - from the ancient history of LSF (before MM or BPF!) we 
also pushed for discussions over talks.

>
> If a topic cannot generate a discussion on the list, it is not very
> likely that it will
> generate a discussion on-prem.
>
> Where does the scaling with the number of files in a filesystem affect existing
> filesystems? What are the limitations that you need to overcome?

Local file systems like xfs running on "scale up" giant systems (think 
of the old super sized HP Superdomes and the like) would be likely to 
handle this well.

In a lot of ways, ngnfs means to replicate that scalability for "scale 
out" (hate buzz words!) systems that are more affordable. In effect, you 
can size your system by just adding more servers with their local NVME 
devices and build up performance and capacity in an incremental way.

Shared disk file systems like scoutfs which (also GPL'ed but not 
upstream) scale pretty well in file count but have coarse grain locking 
that causes performance bumps and the added complexity of needed RAID 
heads or SAN systems.


>
>> Zach Brown is leading a new project on ngnfs (FOSDEM talk this year gave
>> a good background on this -
>> https://www.fosdem.org/2025/schedule/speaker/zach_brown/).  We are
>> looking at taking advantage of modern low latency NVME devices and
>> today's networks to implement a distributed file system that provides
>> better concurrency that high object counts need and still have the
>> bandwidth needed to support the backend archival systems we feed.
>>
> I heard this talk and it was very interesting.
> Here's a direct link to slides from people who may be too lazy to
> follow 3 clicks:
> https://www.fosdem.org/2025/events/attachments/fosdem-2025-5471-ngnfs-a-distributed-file-system-using-block-granular-consistency/slides/236150/zach-brow_aqVkVuI.pdf
>
> I was both very impressed by the cache coherent rename example
> and very puzzled - I do not know any filesystem where rename can be
> synchronized on a single block io, and looking up ancestors is usually
> done on in-memory dentries, so I may not have understood the example.
>
>> ngnfs as a topic would go into the coherence design (and code) that
>> underpins the increased concurrency it aims to deliver.
>>
>> Clear that the project is in early days compared to most of the proposed
>> content, but it can be useful to spend some of the time on new ideas.
>>
> This sounds like an interesting topic to discuss.
> I would love it if you or Zach could share more details on the list so that more
> people could participate in the discussion leading to LSFMM.
>
> Also, I think it is important to mention, as you told me, that the
> server implementation
> of ngnfs is GPL and to provide some pointers, because IMO this is very important
> when requesting community feedback on a new filesystem.
>
> Thanks,
> Amir.

All of ngnfs is GPL'ed - no non-open source client or similar.


Regards,


Ric



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Design challenges for a new file system that needs to support multiple billions of file
  2025-02-03 16:18   ` Ric Wheeler
@ 2025-02-04  1:47     ` Dave Chinner
  2025-02-05  8:05       ` Ric Wheeler
  0 siblings, 1 reply; 8+ messages in thread
From: Dave Chinner @ 2025-02-04  1:47 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Amir Goldstein, lsf-pc, linux-fsdevel, Zach Brown,
	Christian Brauner

On Mon, Feb 03, 2025 at 05:18:48PM +0100, Ric Wheeler wrote:
> 
> On 2/3/25 4:22 PM, Amir Goldstein wrote:
> > On Sun, Feb 2, 2025 at 10:40 PM RIc Wheeler <ricwheeler@gmail.com> wrote:
> > > 
> > > I have always been super interested in how much we can push the
> > > scalability limits of file systems and for the workloads we need to
> > > support, we need to scale up to supporting absolutely ridiculously large
> > > numbers of files (a few billion files doesn't meet the need of the
> > > largest customers we support).
> > > 
> > Hi Ric,
> > 
> > Since LSFMM is not about presentations, it would be better if the topic to
> > discuss was trying to address specific technical questions that developers
> > could discuss.
> 
> Totally agree - from the ancient history of LSF (before MM or BPF!) we also
> pushed for discussions over talks.
> 
> > 
> > If a topic cannot generate a discussion on the list, it is not very
> > likely that it will
> > generate a discussion on-prem.
> > 
> > Where does the scaling with the number of files in a filesystem affect existing
> > filesystems? What are the limitations that you need to overcome?
> 
> Local file systems like xfs running on "scale up" giant systems (think of
> the old super sized HP Superdomes and the like) would be likely to handle
> this well.

We don't need "Big Iron" hardware to scale up to tens of billions of
files in a single filesystem these days. A cheap server with 32p and
a couple of hundred GB of RAM and a few NVMe SSDs is all that is
really needed. We recently had a XFS user report over 16 billion
files in a relatively small filesystem (a few tens of TB), most of
which were reflink copied files (backup/archival storage farm).

So, yeah, large file counts (i.e. tens of billions) in production
systems aren't a big deal these days. There shouldn't be any
specific issues at the OS/VFS layers supporting filesystems with
inode counts in the billions - most of the problems with this are
internal fielsystem implementation issues. If there are any specific
VFS level scalability issues you've come across, I'm all ears...

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Design challenges for a new file system that needs to support multiple billions of file
  2025-02-04  1:47     ` Dave Chinner
@ 2025-02-05  8:05       ` Ric Wheeler
  0 siblings, 0 replies; 8+ messages in thread
From: Ric Wheeler @ 2025-02-05  8:05 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Amir Goldstein, lsf-pc, linux-fsdevel, Zach Brown,
	Christian Brauner


On 2/4/25 2:47 AM, Dave Chinner wrote:
> On Mon, Feb 03, 2025 at 05:18:48PM +0100, Ric Wheeler wrote:
>> On 2/3/25 4:22 PM, Amir Goldstein wrote:
>>> On Sun, Feb 2, 2025 at 10:40 PM RIc Wheeler <ricwheeler@gmail.com> wrote:
>>>> I have always been super interested in how much we can push the
>>>> scalability limits of file systems and for the workloads we need to
>>>> support, we need to scale up to supporting absolutely ridiculously large
>>>> numbers of files (a few billion files doesn't meet the need of the
>>>> largest customers we support).
>>>>
>>> Hi Ric,
>>>
>>> Since LSFMM is not about presentations, it would be better if the topic to
>>> discuss was trying to address specific technical questions that developers
>>> could discuss.
>> Totally agree - from the ancient history of LSF (before MM or BPF!) we also
>> pushed for discussions over talks.
>>
>>> If a topic cannot generate a discussion on the list, it is not very
>>> likely that it will
>>> generate a discussion on-prem.
>>>
>>> Where does the scaling with the number of files in a filesystem affect existing
>>> filesystems? What are the limitations that you need to overcome?
>> Local file systems like xfs running on "scale up" giant systems (think of
>> the old super sized HP Superdomes and the like) would be likely to handle
>> this well.
> We don't need "Big Iron" hardware to scale up to tens of billions of
> files in a single filesystem these days. A cheap server with 32p and
> a couple of hundred GB of RAM and a few NVMe SSDs is all that is
> really needed. We recently had a XFS user report over 16 billion
> files in a relatively small filesystem (a few tens of TB), most of
> which were reflink copied files (backup/archival storage farm).
>
> So, yeah, large file counts (i.e. tens of billions) in production
> systems aren't a big deal these days. There shouldn't be any
> specific issues at the OS/VFS layers supporting filesystems with
> inode counts in the billions - most of the problems with this are
> internal fielsystem implementation issues. If there are any specific
> VFS level scalability issues you've come across, I'm all ears...
>
> -Dave.

I remember fondly torturing xfs (and ext4 and btrfs) many years back 
with a billion small (empty) files on a sata drive :)

For our workload though, we have a couple of requirements that prevent 
most customers from using a single server.

First requirement is the need to keep a scary number of large tape 
drives/robots running at line rate - keeping all of those busy normally 
requires order of 5 servers with our existing stack but larger systems 
can need more.

Second requirement is the need for high availability - that lead us to 
using a shared disk back file system (scoutfs) - but others in this 
space have used cxfs and similar non-open source file systems. The 
shared disk/cluster file systems are where the coarse grain locking 
comes into conflict with concurrency.

What ngnfs is driving towards is to be able to drive that bandwidth 
requirement for the backend archival work flow, support the many 
billions of file objects in a high availability system made with today's 
cutting edge components.  Zach will jump in once he gets back but my 
hand wavy way of thinking of this is that ngnfs as a distributed file 
system is closer in design to how xfs would run on a huge system with 
coherence between NUMA zones.

regards,

Ric



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Design challenges for a new file system that needs to support multiple billions of file
  2025-02-03 15:22 ` Amir Goldstein
  2025-02-03 16:18   ` Ric Wheeler
@ 2025-02-06 18:58   ` Zach Brown
  2025-02-06 23:36     ` Andreas Dilger
  1 sibling, 1 reply; 8+ messages in thread
From: Zach Brown @ 2025-02-06 18:58 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: RIc Wheeler, lsf-pc, linux-fsdevel, Christian Brauner

(Yay, back from travel!)

On Mon, Feb 03, 2025 at 04:22:59PM +0100, Amir Goldstein wrote:
> On Sun, Feb 2, 2025 at 10:40 PM RIc Wheeler <ricwheeler@gmail.com> wrote:
> >
> > Zach Brown is leading a new project on ngnfs (FOSDEM talk this year gave
> > a good background on this -
> > https://www.fosdem.org/2025/schedule/speaker/zach_brown/).  We are
> > looking at taking advantage of modern low latency NVME devices and
> > today's networks to implement a distributed file system that provides
> > better concurrency that high object counts need and still have the
> > bandwidth needed to support the backend archival systems we feed.
> >
> 
> I heard this talk and it was very interesting.
> Here's a direct link to slides from people who may be too lazy to
> follow 3 clicks:
> https://www.fosdem.org/2025/events/attachments/fosdem-2025-5471-ngnfs-a-distributed-file-system-using-block-granular-consistency/slides/236150/zach-brow_aqVkVuI.pdf
> 
> I was both very impressed by the cache coherent rename example
> and very puzzled - I do not know any filesystem where rename can be
> synchronized on a single block io, and looking up ancestors is usually
> done on in-memory dentries, so I may not have understood the example.

The meat of that talk was about how ngnfs uses its distributed block
cache as a serializing/coherence/consistency mechanism.  That specific
example was about how we can get concurrent rename between different
mounts without needing some global equivelant of rename mutex.

The core of the mechanism is that code paths that implement operations
have a transactional object that holds on to cached block references
which have a given access mode granted over the network.  In this rename
case, the ancestor walk holds on to all the blocks for the duration of
the walk.  (Can be a lot of blocks!).  If another mount somewhere else
tried to modify those ancestor blocks, that mount would need to revoke
the cached read access to be granted their write access.  That'd wait
for the first rename to finish and release the read refs.  This gives us
specific serialization of access to the blocks in question rather than
relying on a global serializing object over all renames.

That's the idea, anyway.  I'm implementing the first bits of this now.

It's sort of a silly example, because who puts cross-directory rename in
the fast path?  (Historically some s3<->posix servers implemented
CompleteMultipartUpload be renaming from tmp dirs to visible bucket
dirs, hrmph).  But it illustrates the pattern of shrinking contention
down to the block level.

- z

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Design challenges for a new file system that needs to support multiple billions of file
  2025-02-06 18:58   ` Zach Brown
@ 2025-02-06 23:36     ` Andreas Dilger
  0 siblings, 0 replies; 8+ messages in thread
From: Andreas Dilger @ 2025-02-06 23:36 UTC (permalink / raw)
  To: Zach Brown
  Cc: Amir Goldstein, RIc Wheeler, lsf-pc, linux-fsdevel,
	Christian Brauner

Lustre has production filesystems with hundreds of billions of files today, with
coherent renames running across dozens of servers. 

We've relaxed the rename locking at the server to allow concurrent rename for regular
files within the same server, and directories that stay within the same parent (so cannot
break the namespace hierarchy).  They are still subject the VFS serialization on a single
client node, but hopefully Neil's parallel dirops patch will eventually land. 

Cheers, Andreas

> On Feb 6, 2025, at 13:59, Zach Brown <zab@zabbo.net> wrote:
> 
> 
> (Yay, back from travel!)
> 
>> On Mon, Feb 03, 2025 at 04:22:59PM +0100, Amir Goldstein wrote:
>>> On Sun, Feb 2, 2025 at 10:40 PM RIc Wheeler <ricwheeler@gmail.com> wrote:
>>> 
>>> Zach Brown is leading a new project on ngnfs (FOSDEM talk this year gave
>>> a good background on this -
>>> https://www.fosdem.org/2025/schedule/speaker/zach_brown/).  We are
>>> looking at taking advantage of modern low latency NVME devices and
>>> today's networks to implement a distributed file system that provides
>>> better concurrency that high object counts need and still have the
>>> bandwidth needed to support the backend archival systems we feed.
>>> 
>> 
>> I heard this talk and it was very interesting.
>> Here's a direct link to slides from people who may be too lazy to
>> follow 3 clicks:
>> https://www.fosdem.org/2025/events/attachments/fosdem-2025-5471-ngnfs-a-distributed-file-system-using-block-granular-consistency/slides/236150/zach-brow_aqVkVuI.pdf
>> 
>> I was both very impressed by the cache coherent rename example
>> and very puzzled - I do not know any filesystem where rename can be
>> synchronized on a single block io, and looking up ancestors is usually
>> done on in-memory dentries, so I may not have understood the example.
> 
> The meat of that talk was about how ngnfs uses its distributed block
> cache as a serializing/coherence/consistency mechanism.  That specific
> example was about how we can get concurrent rename between different
> mounts without needing some global equivelant of rename mutex.
> 
> The core of the mechanism is that code paths that implement operations
> have a transactional object that holds on to cached block references
> which have a given access mode granted over the network.  In this rename
> case, the ancestor walk holds on to all the blocks for the duration of
> the walk.  (Can be a lot of blocks!).  If another mount somewhere else
> tried to modify those ancestor blocks, that mount would need to revoke
> the cached read access to be granted their write access.  That'd wait
> for the first rename to finish and release the read refs.  This gives us
> specific serialization of access to the blocks in question rather than
> relying on a global serializing object over all renames.
> 
> That's the idea, anyway.  I'm implementing the first bits of this now.
> 
> It's sort of a silly example, because who puts cross-directory rename in
> the fast path?  (Historically some s3<->posix servers implemented
> CompleteMultipartUpload be renaming from tmp dirs to visible bucket
> dirs, hrmph).  But it illustrates the pattern of shrinking contention
> down to the block level.
> 
> - z
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2025-02-06 23:36 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-02-02 21:39 [LSF/MM/BPF TOPIC] Design challenges for a new file system that needs to support multiple billions of file RIc Wheeler
2025-02-03 15:22 ` Amir Goldstein
2025-02-03 16:18   ` Ric Wheeler
2025-02-04  1:47     ` Dave Chinner
2025-02-05  8:05       ` Ric Wheeler
2025-02-06 18:58   ` Zach Brown
2025-02-06 23:36     ` Andreas Dilger
2025-02-03 15:23 ` Ric Wheeler

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).