Re: [LSF/MM/BPF TOPIC] Design challenges for a new file system that needs to support multiple billions of file

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Ric Wheeler <ricwheeler@gmail.com>
To: Dave Chinner <david@fromorbit.com>
Cc: Amir Goldstein <amir73il@gmail.com>,
	lsf-pc@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org,
	Zach Brown <zab@zabbo.net>,
	Christian Brauner <brauner@kernel.org>
Subject: Re: [LSF/MM/BPF TOPIC] Design challenges for a new file system that needs to support multiple billions of file
Date: Wed, 5 Feb 2025 09:05:11 +0100	[thread overview]
Message-ID: <64ef645b-4d7e-4eb0-a497-0b24e90c225c@gmail.com> (raw)
In-Reply-To: <Z6FxzraXgNYxs2ct@dread.disaster.area>


On 2/4/25 2:47 AM, Dave Chinner wrote:
> On Mon, Feb 03, 2025 at 05:18:48PM +0100, Ric Wheeler wrote:
>> On 2/3/25 4:22 PM, Amir Goldstein wrote:
>>> On Sun, Feb 2, 2025 at 10:40 PM RIc Wheeler <ricwheeler@gmail.com> wrote:
>>>> I have always been super interested in how much we can push the
>>>> scalability limits of file systems and for the workloads we need to
>>>> support, we need to scale up to supporting absolutely ridiculously large
>>>> numbers of files (a few billion files doesn't meet the need of the
>>>> largest customers we support).
>>>>
>>> Hi Ric,
>>>
>>> Since LSFMM is not about presentations, it would be better if the topic to
>>> discuss was trying to address specific technical questions that developers
>>> could discuss.
>> Totally agree - from the ancient history of LSF (before MM or BPF!) we also
>> pushed for discussions over talks.
>>
>>> If a topic cannot generate a discussion on the list, it is not very
>>> likely that it will
>>> generate a discussion on-prem.
>>>
>>> Where does the scaling with the number of files in a filesystem affect existing
>>> filesystems? What are the limitations that you need to overcome?
>> Local file systems like xfs running on "scale up" giant systems (think of
>> the old super sized HP Superdomes and the like) would be likely to handle
>> this well.
> We don't need "Big Iron" hardware to scale up to tens of billions of
> files in a single filesystem these days. A cheap server with 32p and
> a couple of hundred GB of RAM and a few NVMe SSDs is all that is
> really needed. We recently had a XFS user report over 16 billion
> files in a relatively small filesystem (a few tens of TB), most of
> which were reflink copied files (backup/archival storage farm).
>
> So, yeah, large file counts (i.e. tens of billions) in production
> systems aren't a big deal these days. There shouldn't be any
> specific issues at the OS/VFS layers supporting filesystems with
> inode counts in the billions - most of the problems with this are
> internal fielsystem implementation issues. If there are any specific
> VFS level scalability issues you've come across, I'm all ears...
>
> -Dave.

I remember fondly torturing xfs (and ext4 and btrfs) many years back 
with a billion small (empty) files on a sata drive :)

For our workload though, we have a couple of requirements that prevent 
most customers from using a single server.

First requirement is the need to keep a scary number of large tape 
drives/robots running at line rate - keeping all of those busy normally 
requires order of 5 servers with our existing stack but larger systems 
can need more.

Second requirement is the need for high availability - that lead us to 
using a shared disk back file system (scoutfs) - but others in this 
space have used cxfs and similar non-open source file systems. The 
shared disk/cluster file systems are where the coarse grain locking 
comes into conflict with concurrency.

What ngnfs is driving towards is to be able to drive that bandwidth 
requirement for the backend archival work flow, support the many 
billions of file objects in a high availability system made with today's 
cutting edge components.  Zach will jump in once he gets back but my 
hand wavy way of thinking of this is that ngnfs as a distributed file 
system is closer in design to how xfs would run on a huge system with 
coherence between NUMA zones.

regards,

Ric

next prev parent reply	other threads:[~2025-02-05  8:05 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-02-02 21:39 [LSF/MM/BPF TOPIC] Design challenges for a new file system that needs to support multiple billions of file RIc Wheeler
2025-02-03 15:22 ` Amir Goldstein
2025-02-03 16:18   ` Ric Wheeler
2025-02-04  1:47     ` Dave Chinner
2025-02-05  8:05       ` Ric Wheeler [this message]
2025-02-06 18:58   ` Zach Brown
2025-02-06 23:36     ` Andreas Dilger
2025-02-03 15:23 ` Ric Wheeler

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=64ef645b-4d7e-4eb0-a497-0b24e90c225c@gmail.com \
    --to=ricwheeler@gmail.com \
    --cc=amir73il@gmail.com \
    --cc=brauner@kernel.org \
    --cc=david@fromorbit.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=lsf-pc@lists.linux-foundation.org \
    --cc=zab@zabbo.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).