From: Dave Chinner <david@fromorbit.com>
To: "Richard W.M. Jones" <rjones@redhat.com>
Cc: linux-xfs@vger.kernel.org
Subject: Re: mkfs.xfs options suitable for creating absurdly large XFS filesystems?
Date: Tue, 4 Sep 2018 19:12:30 +1000 [thread overview]
Message-ID: <20180904091230.GU5631@dastard> (raw)
In-Reply-To: <20180904082332.GS5631@dastard>
On Tue, Sep 04, 2018 at 06:23:32PM +1000, Dave Chinner wrote:
> On Tue, Sep 04, 2018 at 10:49:40AM +1000, Dave Chinner wrote:
> > On Mon, Sep 03, 2018 at 11:49:19PM +0100, Richard W.M. Jones wrote:
> > > [This is silly and has no real purpose except to explore the limits.
> > > If that offends you, don't read the rest of this email.]
> >
> > We do this quite frequently ourselves, even if it is just to remind
> > ourselves how long it takes to wait for millions of IOs to be done.
> >
> > > I am trying to create an XFS filesystem in a partition of approx
> > > 2^63 - 1 bytes to see what happens.
> >
> > Should just work. You might find problems with the underlying
> > storage, but the XFS side of things should just work.
>
> > I'm trying to reproduce it here:
> >
> > $ grep vdd /proc/partitions
> > 253 48 9007199254739968 vdd
> > $ sudo mkfs.xfs -f -s size=1024 -d size=2251799813684887b -N /dev/vdd
> > meta-data=/dev/vdd isize=512 agcount=8388609, agsize=268435455 blks
> > = sectsz=1024 attr=2, projid32bit=1
> > = crc=1 finobt=1, sparse=1, rmapbt=0
> > = reflink=0
> > data = bsize=4096 blocks=2251799813684887, imaxpct=1
> > = sunit=0 swidth=0 blks
> > naming =version 2 bsize=4096 ascii-ci=0, ftype=1
> > log =internal log bsize=4096 blocks=521728, version=2
> > = sectsz=1024 sunit=1 blks, lazy-count=1
> > realtime =none extsz=4096 blocks=0, rtextents=0
> >
> >
> > And it is running now without the "-N" and I have to wait for tens
> > of millions of IOs to be issued. The write rate is currently about
> > 13,000 IOPS, so I'm guessing it'll take at least an hour to do
> > this. Next time I'll run it on the machine with faster SSDs.
> >
> > I haven't seen any error after 20 minutes, though.
>
> I killed it after 2 and half hours, and started looking at why it
> was taking that long. That's the above.
Or the below. Stand on your head if you're confused.
-Dave.
> But it's not fast. This is the first time I've looked at whether we
> perturbed the IO patterns in the recent mkfs.xfs refactoring. I'm
> not sure we made them any worse (the algorithms are the same), but
> it's now much more obvious how we can improve them drastically with
> a few small mods.
>
> Firstly, there's the force overwrite alogrithm that zeros the old
> filesystem signature. One an 8EB device with an existing 8EB
> filesystem, there's 8+ million single sector IOs right there.
> So for the moment, zero the first 1MB of the device to whack the
> old superblock and you can avoid this step. I've got a fix for that
> now:
>
> Time to mkfs a 1TB filsystem on a big device after it held another
> larger filesystem:
>
> previous FS size 10PB 100PB 1EB
> old mkfs time 1.95s 8.9s 81.3s
> patched 0.95s 1.2s 1.2s
>
>
> Second, use -K to avoid discard (which you already know).
>
> Third, we do two passes over the AG headers to initialise them.
> Unfortunately, with a large number of AGs, they don't stay in the
> buffer cache and so the second pas involves RMW cycles. This means
> we do at least 5 extra read and 5 extra write IOs per AG than we
> need to. I've got a fix for this, too:
>
> Time to make a filesystem from scratch, using a zeroed device so the
> force overwrite algorithms are not triggered and -K to avoid
> discards:
>
> FS size 10PB 100PB 1EB
> current mkfs 26.9s 214.8s 2484s
> patched 11.3s 70.3s 709s
>
> From that projection, the 8EB mkfs would have taken somewhere around
> 7-8 hours to complete. The new code should only take a couple of
> hours. Still not all that good....
>
> .... and I think that's because we are using direct IO. That means
> the IO we issue is effectively synchronous, even though we sorta
> doing delayed writeback. The problem is that mkfs is not threaded so
> writeback happens when the cache fills up and we run out of buffers
> on the free list. Basically it's "direct delayed writeback" at that
> point.
>
> Worse, because it's synchronous, we don't drive more than one IO at
> a time and so we don't get adjacent sector merging, even though most
> ofhte AG header writes are to adjacent sectors. That would cut the
> amount of IOs from ~10 per AG down to 2 for sectorsize < blocksize
> filesysetms and 1 for sectorsize = blocksize filesystems.
>
> This isn't so easy to fix. I either need to:
>
> 1) thread the libxfs buffer cache so we can do this
> writeback in the background.
> 2) thread mkfs so it can process multiple AGs at once; or
> 3) libxfs needs to use AIO via delayed write infrastructure
> similar to what we have in the kernel (buffer lists)
>
> Approach 1) does not solve the queue depth = 1 issue, so
> it's of limited value. Might be quick, but doesn't really get us
> much improvement.
>
> Approach 2) drives deeper queues, but it doesn't solve the adjacent
> sector IO merging problem because each thread only has a queue depth
> of one. So we'll be able to do more IO, but IO efficiency won't
> improve. And, realistically, this isn't a good idea because OOO AG
> processing doesn't work on spinning rust - it just causes seek
> storms and things go slower. To make things faster on spinning rust,
> we need single threaded, in order dispatch, asynchronous writeback.
> Which is almost what 1) is, except it's not asynchronous.
>
> That's what 3) solves - single threaded, in-order, async writeback,
> controlled by the context creating the dirty buffers in a limited
> AIO context. I'll have to think about this a bit more....
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com
>
--
Dave Chinner
david@fromorbit.com
next prev parent reply other threads:[~2018-09-04 13:36 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-09-03 22:49 mkfs.xfs options suitable for creating absurdly large XFS filesystems? Richard W.M. Jones
2018-09-04 0:49 ` Dave Chinner
2018-09-04 8:23 ` Dave Chinner
2018-09-04 9:12 ` Dave Chinner [this message]
2018-09-04 8:26 ` Richard W.M. Jones
2018-09-04 9:11 ` Dave Chinner
2018-09-04 9:45 ` Richard W.M. Jones
2018-09-04 15:36 ` Martin Steigerwald
2018-09-04 22:23 ` Dave Chinner
2018-09-05 7:09 ` Martin Steigerwald
2018-09-05 7:43 ` Dave Chinner
2018-09-05 9:05 ` Richard W.M. Jones
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20180904091230.GU5631@dastard \
--to=david@fromorbit.com \
--cc=linux-xfs@vger.kernel.org \
--cc=rjones@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).