Max theoretical XFS filesystem size in review

public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed

* Max theoretical XFS filesystem size in review
@ 2024-02-07 22:26 Luis Chamberlain
  2024-02-07 22:39 ` Matthew Wilcox
  2024-02-07 23:54 ` Dave Chinner
  0 siblings, 2 replies; 8+ messages in thread
From: Luis Chamberlain @ 2024-02-07 22:26 UTC (permalink / raw)
  To: linux-xfs
  Cc: Luis Chamberlain, ritesh.list, Pankaj Raghav, Daniel Gomez,
	Matthew Wilcox

I'd like to review the max theoretical XFS filesystem size and
if block size used may affect this. At first I thought that the limit which
seems to be documented on a few pages online of 16 EiB might reflect the
current limitations [0], however I suspect its an artifact of both
BLKGETSIZE64 limitation. There might be others so I welcome your feedback
on other things as well.

As I see it the max filesystem size should be an artifact of:

max_num_ags * max_ag_blocks * block_size

Does that seem right?

This is because the allocation group stores max number of addressable
blocks in an allocation group, and this is in block of block size.  If
we consider the max possible value for max_num_ags in light of the max
number of addressable blocks which Linux can support, this is capped at
the limit of blkdev_ioctl() BLKGETSIZE64, which gives us a 64-bit
integer, so (2^64)-1, we do -1 as we start counting the first block at
block 0.  That's 16 EiB (Exbibytes) and so we're capped at that in Linux
regardless of filesystem.

Is that right?

If we didn't have that limitation though, let's consider what else would
be our cap.

max_num_ags depends on the actual max value possibly reported by the
device divided by the maximum size of an AG in bytes. We have
XFS_AG_MAX_BYTES which represents the maximum size of an AG in bytes.
This is defined statically always as (longlong)BBSIZE << 31 and since
BBSIZE is 9 this is about 1 TiB. So we cap one AG to have max 1 TiB.
To get max_num_ags we divide the total capacity of the drive by
this 1 TiB, so in Linux effectively today that max value should be
18,874,368.

Is that right?

Although we're probably far from needing a single storage addressable
array needing more than 16 EiB for a single XFS filesystem, if the above was
correct I was curious if anyone has more details about the caked in limit
of 1 TiB limit per AG.

Datatype wise though max_num_ags is the agcount in the superblock, we have
xfs_agnumber_t sb_agcount and the xfs_agnumber_t is a uint32_t, so in theory
we should be able to get this to 2^32 if we were OK to squeeze more data into
one AG. And then the number of blocks in the ag is agf_length, another
32-bit value. With 4 KiB block size that's 65536 EiB, and on 16 KiB
block size that's 262,144 Exbibytes (EiB) and so on.

[0] https://access.redhat.com/solutions/1532

  Luis

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Max theoretical XFS filesystem size in review
  2024-02-07 22:26 Max theoretical XFS filesystem size in review Luis Chamberlain
@ 2024-02-07 22:39 ` Matthew Wilcox
  2024-02-07 23:54 ` Dave Chinner
  1 sibling, 0 replies; 8+ messages in thread
From: Matthew Wilcox @ 2024-02-07 22:39 UTC (permalink / raw)
  To: Luis Chamberlain; +Cc: linux-xfs, ritesh.list, Pankaj Raghav, Daniel Gomez

On Wed, Feb 07, 2024 at 02:26:53PM -0800, Luis Chamberlain wrote:
> I'd like to review the max theoretical XFS filesystem size and
> if block size used may affect this. At first I thought that the limit which
> seems to be documented on a few pages online of 16 EiB might reflect the
> current limitations [0], however I suspect its an artifact of both
> BLKGETSIZE64 limitation. There might be others so I welcome your feedback
> on other things as well.

Linux is limited to 8EiB as loff_t is signed ... I don't want to introduce
lllseek() to expand beyond 8EiB; I have reason to believe that we'll
have 128-bit registers in relevant CPUs before we can buy reasonably
priced arrays of drives that will reach 8EiB (and want to turn those
into a single block device).

See my Zettalinux presentation at Plumbers 2022 in Dublin (and that
reminds me, I really should do something with zettalinux.org)

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Max theoretical XFS filesystem size in review
  2024-02-07 22:26 Max theoretical XFS filesystem size in review Luis Chamberlain
  2024-02-07 22:39 ` Matthew Wilcox
@ 2024-02-07 23:54 ` Dave Chinner
  2024-03-15  0:12   ` Luis Chamberlain
  1 sibling, 1 reply; 8+ messages in thread
From: Dave Chinner @ 2024-02-07 23:54 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: linux-xfs, ritesh.list, Pankaj Raghav, Daniel Gomez,
	Matthew Wilcox

On Wed, Feb 07, 2024 at 02:26:53PM -0800, Luis Chamberlain wrote:
> I'd like to review the max theoretical XFS filesystem size and
> if block size used may affect this. At first I thought that the limit which
> seems to be documented on a few pages online of 16 EiB might reflect the
> current limitations [0], however I suspect its an artifact of both
> BLKGETSIZE64 limitation. There might be others so I welcome your feedback
> on other things as well.

The actual limit is 8EiB, not 16EiB. mkfs.xfs won't allow a
filesystem over 8EiB to be made.

> 
> As I see it the max filesystem size should be an artifact of:
> 
> max_num_ags * max_ag_blocks * block_size
> 
> Does that seem right?

max sector size, not max block size is the ultimate limitation.

Not really. Max filesystem size is also determined by compiler,
architecture, OS, tool and support constraints

> This is because the allocation group stores max number of addressable
> blocks in an allocation group, and this is in block of block size.  If
> we consider the max possible value for max_num_ags in light of the max
> number of addressable blocks which Linux can support, this is capped at
> the limit of blkdev_ioctl() BLKGETSIZE64, which gives us a 64-bit
> integer, so (2^64)-1, we do -1 as we start counting the first block at
> block 0.  That's 16 EiB (Exbibytes) and so we're capped at that in Linux
> regardless of filesystem.
> 
> Is that right?

We could actually support the full 64 bit device sector_t range (so
2^73 bytes), and we support file sizes up to 2^54 FSBs, so with 64kB
block sizes we are at 2^70 bytes per file. IOWs, we -could- go
larger than 8EiB, but....

> If we didn't have that limitation though, let's consider what else would
> be our cap.
> 
> max_num_ags depends on the actual max value possibly reported by the
> device divided by the maximum size of an AG in bytes. We have
> XFS_AG_MAX_BYTES which represents the maximum size of an AG in bytes.
> This is defined statically always as (longlong)BBSIZE << 31 and since
> BBSIZE is 9 this is about 1 TiB. So we cap one AG to have max 1 TiB.
> To get max_num_ags we divide the total capacity of the drive by
> this 1 TiB, so in Linux effectively today that max value should be
> 18,874,368.
>
> Is that right?

No.  It's (2^64 / 2^40) = 2^24 AGs (16.7 million), not (2^64 /
10^12) AGs.

Also, inode numbers only go up to 2^56, so once the AG count goes
above 2^24 we'd have to introduce a new allocator that to handle
inode/data locality in such large filesystems.

> Although we're probably far from needing a single storage addressable
> array needing more than 16 EiB for a single XFS filesystem, if the above was
> correct I was curious if anyone has more details about the caked in limit
> of 1 TiB limit per AG.

AGs are indexed by short btrees. i.e. they have 4 byte pointers to
minimise indexing space so are limited to indexing 2^31 blocks.

> Datatype wise though max_num_ags is the agcount in the superblock, we have
> xfs_agnumber_t sb_agcount and the xfs_agnumber_t is a uint32_t, so in theory
> we should be able to get this to 2^32 if we were OK to squeeze more data into
> one AG. And then the number of blocks in the ag is agf_length, another
> 32-bit value. With 4 KiB block size that's 65536 EiB, and on 16 KiB
> block size that's 262,144 Exbibytes (EiB) and so on.

Sure, in theory the XFS format *could* handle 2^80 bytes when we
have 64kB filesystem blocks. But we can't do that without massive
changes to the OS and filesystem implementation, so there's no point
in even talking about XFS support beyond 2^64 bytes until 128 bit
integer support is brought to the linux kernel and all our block
device and syscall interfaces are 128bit file offset capable....

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Max theoretical XFS filesystem size in review
  2024-02-07 23:54 ` Dave Chinner
@ 2024-03-15  0:12   ` Luis Chamberlain
  2024-03-15  1:14     ` Dave Chinner
  0 siblings, 1 reply; 8+ messages in thread
From: Luis Chamberlain @ 2024-03-15  0:12 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-xfs, ritesh.list, Pankaj Raghav, Daniel Gomez,
	Matthew Wilcox

On Thu, Feb 08, 2024 at 10:54:08AM +1100, Dave Chinner wrote:
> On Wed, Feb 07, 2024 at 02:26:53PM -0800, Luis Chamberlain wrote:
> > I'd like to review the max theoretical XFS filesystem size and
> > if block size used may affect this. At first I thought that the limit which
> > seems to be documented on a few pages online of 16 EiB might reflect the
> > current limitations [0], however I suspect its an artifact of both
> > BLKGETSIZE64 limitation. There might be others so I welcome your feedback
> > on other things as well.
> 
> The actual limit is 8EiB, not 16EiB. mkfs.xfs won't allow a
> filesystem over 8EiB to be made.

A truncated 9 EB file seems to go through:

truncate -s 9EB /mnt-pmem/sparse-9eb; losetup /dev/loop0 /mnt-pmem/sparse-9eb
mkfs.xfs -K /dev/loop0
meta-data=/dev/loop0             isize=512    agcount=8185453, agsize=268435455 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=1
         =                       reflink=1    bigtime=1 inobtcount=1 nrext64=1
data     =                       bsize=4096   blocks=2197265625000000, imaxpct=1
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=521728, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

Should we be rejecting that?

Joining two 8 EB files with device-mapper seems allowed:

truncate -s 8EB /mnt-pmem/sparse-8eb.1; losetup /dev/loop1 /mnt-pmem/sparse-8eb.1
truncate -s 8EB /mnt-pmem/sparse-8eb.2; losetup /dev/loop2 /mnt-pmem/sparse-8eb.2

cat /home/mcgrof/dm-join-multiple.sh 
#!/bin/sh
# Join multiple devices with the same size in a linear form
# We assume the same size for simplicity
set -e
size=`blockdev --getsz $1`
FILE=$(mktemp)
for i in $(seq 1 $#) ; do
        offset=$(( ($i -1)  * $size))
	echo "$offset $size linear $1 0" >> $FILE
	shift
done
cat $FILE | dmsetup create joined
rm -f $FILE

/home/mcgrof/dm-join-multiple.sh /dev/loop1 /dev/loop2

And mkfs.xfs seems to go through on them, ie, its not rejected

mkfs.xfs -f /dev/mapper/joined
meta-data=/dev/mapper/joined     isize=512    agcount=14551916, agsize=268435455 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=1
         =                       reflink=1    bigtime=1 inobtcount=1 nrext64=1
data     =                       bsize=4096   blocks=3906250000000000, imaxpct=1
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=521728, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
Discarding blocks...

I didn't wait, should we be rejecting that?

Using -K does hit some failures on the bno number though:

mkfs.xfs -K -f /dev/mapper/joined
meta-data=/dev/mapper/joined     isize=512    agcount=14551916, agsize=268435455 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=1
         =                       reflink=1    bigtime=1 inobtcount=1 nrext64=1
data     =                       bsize=4096   blocks=3906250000000000, imaxpct=1
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=521728, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
mkfs.xfs: pwrite failed: Invalid argument
libxfs_bwrite: write failed on (unknown) bno 0x6f05b59d3b1f00/0x100, err=22
mkfs.xfs: Releasing dirty buffer to free list!
found dirty buffer (bulk) on free list!
mkfs.xfs: pwrite failed: No space left on device
libxfs_bwrite: write failed on (unknown) bno 0x0/0x100, err=28
mkfs.xfs: Releasing dirty buffer to free list!
found dirty buffer (bulk) on free list!
mkfs.xfs: pwrite failed: No space left on device
libxfs_bwrite: write failed on xfs_sb bno 0x0/0x1, err=28
mkfs.xfs: Releasing dirty buffer to free list!
mkfs.xfs: libxfs_device_zero seek to offset 8000000394407514112 failed: Invalid argument

I still gotta chew through the rest of your reply, thanks for the
details!

  Luis

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Max theoretical XFS filesystem size in review
  2024-03-15  0:12   ` Luis Chamberlain
@ 2024-03-15  1:14     ` Dave Chinner
  2024-03-15  2:48       ` Matthew Wilcox
  0 siblings, 1 reply; 8+ messages in thread
From: Dave Chinner @ 2024-03-15  1:14 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: linux-xfs, ritesh.list, Pankaj Raghav, Daniel Gomez,
	Matthew Wilcox

On Thu, Mar 14, 2024 at 05:12:22PM -0700, Luis Chamberlain wrote:
> On Thu, Feb 08, 2024 at 10:54:08AM +1100, Dave Chinner wrote:
> > On Wed, Feb 07, 2024 at 02:26:53PM -0800, Luis Chamberlain wrote:
> > > I'd like to review the max theoretical XFS filesystem size and
> > > if block size used may affect this. At first I thought that the limit which
> > > seems to be documented on a few pages online of 16 EiB might reflect the
> > > current limitations [0], however I suspect its an artifact of both
> > > BLKGETSIZE64 limitation. There might be others so I welcome your feedback
> > > on other things as well.
> > 
> > The actual limit is 8EiB, not 16EiB. mkfs.xfs won't allow a
> > filesystem over 8EiB to be made.
> 
> A truncated 9 EB file seems to go through:

<sigh>

9EB  = 9000000000000000000
8EiB = 9223372036854775808

So, 9EB < 8EiB and yes, mkfs.xfs will accept anything smaller than
8EiB...

> truncate -s 9EB /mnt-pmem/sparse-9eb; losetup /dev/loop0 /mnt-pmem/sparse-9eb
> mkfs.xfs -K /dev/loop0
> meta-data=/dev/loop0             isize=512    agcount=8185453, agsize=268435455 blks

yup, agcount is clearly less than 8388608, so you've screwed up your
units there...

> Joining two 8 EB files with device-mapper seems allowed:
> 
> truncate -s 8EB /mnt-pmem/sparse-8eb.1; losetup /dev/loop1 /mnt-pmem/sparse-8eb.1
> truncate -s 8EB /mnt-pmem/sparse-8eb.2; losetup /dev/loop2 /mnt-pmem/sparse-8eb.2
> 
> cat /home/mcgrof/dm-join-multiple.sh 
> #!/bin/sh
> # Join multiple devices with the same size in a linear form
> # We assume the same size for simplicity
> set -e
> size=`blockdev --getsz $1`
> FILE=$(mktemp)
> for i in $(seq 1 $#) ; do
>         offset=$(( ($i -1)  * $size))
> 	echo "$offset $size linear $1 0" >> $FILE
> 	shift
> done
> cat $FILE | dmsetup create joined
> rm -f $FILE
> 
> /home/mcgrof/dm-join-multiple.sh /dev/loop1 /dev/loop2
> 
> And mkfs.xfs seems to go through on them, ie, its not rejected

Ah, I think mkfs.xfs has a limit of 8EiB on image files, maybe not
on block devices. What's the actual limit of block device size on
Linux?

> mkfs.xfs -f /dev/mapper/joined
> meta-data=/dev/mapper/joined     isize=512    agcount=14551916, agsize=268435455 blks
>          =                       sectsz=512   attr=2, projid32bit=1
>          =                       crc=1        finobt=1, sparse=1, rmapbt=1
>          =                       reflink=1    bigtime=1 inobtcount=1 nrext64=1
> data     =                       bsize=4096   blocks=3906250000000000, imaxpct=1
>          =                       sunit=0      swidth=0 blks
> naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
> log      =internal log           bsize=4096   blocks=521728, version=2
>          =                       sectsz=512   sunit=0 blks, lazy-count=1
> realtime =none                   extsz=4096   blocks=0, rtextents=0
> Discarding blocks...
> 
> I didn't wait, should we be rejecting that?

Probably. mkfs.xfs uses uint64_t for the block counts and
arithmetic, so all the size and geometry calcs should work. The
problem is when we translate those sizes to byte counts, and then th
elinux kernel side has all sorts of problems because many things
described in bytes (like off_t and loff_t) are signed. Hence while
you might be able to make block devices larger than 8EiB, I'm pretty
sure you can't actually do things like pread()/pwrite() at offsets
above 8EiB on block devices....

> Using -K does hit some failures on the bno number though:
> 
> mkfs.xfs -K -f /dev/mapper/joined
> meta-data=/dev/mapper/joined     isize=512    agcount=14551916, agsize=268435455 blks
>          =                       sectsz=512   attr=2, projid32bit=1
>          =                       crc=1        finobt=1, sparse=1, rmapbt=1
>          =                       reflink=1    bigtime=1 inobtcount=1 nrext64=1
> data     =                       bsize=4096   blocks=3906250000000000, imaxpct=1
>          =                       sunit=0      swidth=0 blks
> naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
> log      =internal log           bsize=4096   blocks=521728, version=2
>          =                       sectsz=512   sunit=0 blks, lazy-count=1
> realtime =none                   extsz=4096   blocks=0, rtextents=0
> mkfs.xfs: pwrite failed: Invalid argument
> libxfs_bwrite: write failed on (unknown) bno 0x6f05b59d3b1f00/0x100, err=22

daddr is 0x6f05b59d3b1f00. So lets convert that to a byte based
offset from a buffer daddr:

$ printf "0x%llx\n" $(( 0x6f05b59d3b1f00 << 9  ))
0xde0b6b3a763e0000
$

It's hard to see, but if I write it as 16 bit couplets:

	0xde0b 6b3a 763e 0000

You can see the high bit in the file offset is set, and so that's a
write beyond 8EiB that returned -EINVAL. That exactly what
rw_verify_area() returns when loff_t *pos < 0 when the file does not
assert FMODE_UNSIGNED_OFFSET. No block based filesystem nor do block
devices assert FMODE_UNSIGNED_OFFSET, so this write should always
fail with -EINVAL.

And where did it fail? You used "-f" which set force_overwrite,
which means we do a bit of zeroing of potential locations for old
XFS structures (secondary superblocks) and that silently swallows IO
failures, so it wasn't that. The next thing it does is whack
potential MD and GPT records at the end of the filesystem and that's
done in IO sizes of:

/*
 * amount (in bytes) we zero at the beginning and end of the device to
 * remove traces of other filesystems, raid superblocks, etc.
 */
#define WHACK_SIZE (128 * 1024)

128kB IOs. The above IO that failed with -EINVAL is a 128kB IO
(0x100 basic blocks). This will emit a warning message that the IO
failed (as per above), but it also swallows IO errors and lets mkfs
continue.

> mkfs.xfs: Releasing dirty buffer to free list!
> found dirty buffer (bulk) on free list!
> mkfs.xfs: pwrite failed: No space left on device
> libxfs_bwrite: write failed on (unknown) bno 0x0/0x100, err=28

Yup, that's the next write to zap the first blocks of the device to
get rid of primary superblocks and other signatures from other types
of filesytsems and partition tables. That failed with -ENOSPC, which
implies something went wrong in the dm/loop device IO/backing
file IO stage. Likely an 8EiB overflow problem somewhere.

> mkfs.xfs: Releasing dirty buffer to free list!
> found dirty buffer (bulk) on free list!
> mkfs.xfs: pwrite failed: No space left on device
> libxfs_bwrite: write failed on xfs_sb bno 0x0/0x1, err=28

And that's the initial write of the superblock (single 512 byte
sector write) that failed with ENOSPC. Same error as the previous
write, same likely cause.

> mkfs.xfs: Releasing dirty buffer to free list!
> mkfs.xfs: libxfs_device_zero seek to offset 8000000394407514112 failed: Invalid argument

And yeah, their's the smoking gun: mkfs.xfs is attempting to seek to
an offset beyond 8EiB on the block device and that is failing.

IOWs, max supported block device size on Linux is 8EiB. mkfs.xfs
should really capture some of these errors, but largely the problem
here is that dm is allowing an unsupported block device mapping
to be created...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Max theoretical XFS filesystem size in review
  2024-03-15  1:14     ` Dave Chinner
@ 2024-03-15  2:48       ` Matthew Wilcox
  2024-03-15 17:52         ` Luis Chamberlain
  0 siblings, 1 reply; 8+ messages in thread
From: Matthew Wilcox @ 2024-03-15  2:48 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Luis Chamberlain, linux-xfs, ritesh.list, Pankaj Raghav,
	Daniel Gomez

On Fri, Mar 15, 2024 at 12:14:05PM +1100, Dave Chinner wrote:
> On Thu, Mar 14, 2024 at 05:12:22PM -0700, Luis Chamberlain wrote:
> > Joining two 8 EB files with device-mapper seems allowed:
> > 
> > truncate -s 8EB /mnt-pmem/sparse-8eb.1; losetup /dev/loop1 /mnt-pmem/sparse-8eb.1
> > truncate -s 8EB /mnt-pmem/sparse-8eb.2; losetup /dev/loop2 /mnt-pmem/sparse-8eb.2
> > 
> > cat /home/mcgrof/dm-join-multiple.sh 
> > #!/bin/sh
> > # Join multiple devices with the same size in a linear form
> > # We assume the same size for simplicity
> > set -e
> > size=`blockdev --getsz $1`
> > FILE=$(mktemp)
> > for i in $(seq 1 $#) ; do
> >         offset=$(( ($i -1)  * $size))
> > 	echo "$offset $size linear $1 0" >> $FILE
> > 	shift
> > done
> > cat $FILE | dmsetup create joined
> > rm -f $FILE
> > 
> > /home/mcgrof/dm-join-multiple.sh /dev/loop1 /dev/loop2
> > 
> > And mkfs.xfs seems to go through on them, ie, its not rejected
> 
> Ah, I think mkfs.xfs has a limit of 8EiB on image files, maybe not
> on block devices. What's the actual limit of block device size on
> Linux?

We can't seek past 2^63-1.  That's the limit on lseek, llseek, lseek64
or whatever we're calling it these days.  If we're missing a check
somewhere, that's a bug.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Max theoretical XFS filesystem size in review
  2024-03-15  2:48       ` Matthew Wilcox
@ 2024-03-15 17:52         ` Luis Chamberlain
  2024-03-18  0:00           ` Dave Chinner
  0 siblings, 1 reply; 8+ messages in thread
From: Luis Chamberlain @ 2024-03-15 17:52 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Dave Chinner, linux-xfs, ritesh.list, Pankaj Raghav, Daniel Gomez

On Fri, Mar 15, 2024 at 02:48:27AM +0000, Matthew Wilcox wrote:
> On Fri, Mar 15, 2024 at 12:14:05PM +1100, Dave Chinner wrote:
> > On Thu, Mar 14, 2024 at 05:12:22PM -0700, Luis Chamberlain wrote:
> > > Joining two 8 EB files with device-mapper seems allowed:
> > > 
> > > truncate -s 8EB /mnt-pmem/sparse-8eb.1; losetup /dev/loop1 /mnt-pmem/sparse-8eb.1
> > > truncate -s 8EB /mnt-pmem/sparse-8eb.2; losetup /dev/loop2 /mnt-pmem/sparse-8eb.2
> > > 
> > > cat /home/mcgrof/dm-join-multiple.sh 
> > > #!/bin/sh
> > > # Join multiple devices with the same size in a linear form
> > > # We assume the same size for simplicity
> > > set -e
> > > size=`blockdev --getsz $1`
> > > FILE=$(mktemp)
> > > for i in $(seq 1 $#) ; do
> > >         offset=$(( ($i -1)  * $size))
> > > 	echo "$offset $size linear $1 0" >> $FILE
> > > 	shift
> > > done
> > > cat $FILE | dmsetup create joined
> > > rm -f $FILE
> > > 
> > > /home/mcgrof/dm-join-multiple.sh /dev/loop1 /dev/loop2
> > > 
> > > And mkfs.xfs seems to go through on them, ie, its not rejected
> > 
> > Ah, I think mkfs.xfs has a limit of 8EiB on image files, maybe not
> > on block devices. What's the actual limit of block device size on
> > Linux?
> 
> We can't seek past 2^63-1.  That's the limit on lseek, llseek, lseek64
> or whatever we're calling it these days.  If we're missing a check
> somewhere, that's a bug.

Thanks, I can send fixes, just wanted to review some of these things
with the community to explore what a big fat linux block device or
filesystem might be constrained to, if any. The fact that through this
discussion we're uncovering perhaps some missing checks is already
useful. I'll try to document some of it.

  Luis

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Max theoretical XFS filesystem size in review
  2024-03-15 17:52         ` Luis Chamberlain
@ 2024-03-18  0:00           ` Dave Chinner
  0 siblings, 0 replies; 8+ messages in thread
From: Dave Chinner @ 2024-03-18  0:00 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Matthew Wilcox, linux-xfs, ritesh.list, Pankaj Raghav,
	Daniel Gomez

On Fri, Mar 15, 2024 at 10:52:43AM -0700, Luis Chamberlain wrote:
> On Fri, Mar 15, 2024 at 02:48:27AM +0000, Matthew Wilcox wrote:
> > On Fri, Mar 15, 2024 at 12:14:05PM +1100, Dave Chinner wrote:
> > > On Thu, Mar 14, 2024 at 05:12:22PM -0700, Luis Chamberlain wrote:
> > > > Joining two 8 EB files with device-mapper seems allowed:
> > > > 
> > > > truncate -s 8EB /mnt-pmem/sparse-8eb.1; losetup /dev/loop1 /mnt-pmem/sparse-8eb.1
> > > > truncate -s 8EB /mnt-pmem/sparse-8eb.2; losetup /dev/loop2 /mnt-pmem/sparse-8eb.2
> > > > 
> > > > cat /home/mcgrof/dm-join-multiple.sh 
> > > > #!/bin/sh
> > > > # Join multiple devices with the same size in a linear form
> > > > # We assume the same size for simplicity
> > > > set -e
> > > > size=`blockdev --getsz $1`
> > > > FILE=$(mktemp)
> > > > for i in $(seq 1 $#) ; do
> > > >         offset=$(( ($i -1)  * $size))
> > > > 	echo "$offset $size linear $1 0" >> $FILE
> > > > 	shift
> > > > done
> > > > cat $FILE | dmsetup create joined
> > > > rm -f $FILE
> > > > 
> > > > /home/mcgrof/dm-join-multiple.sh /dev/loop1 /dev/loop2
> > > > 
> > > > And mkfs.xfs seems to go through on them, ie, its not rejected
> > > 
> > > Ah, I think mkfs.xfs has a limit of 8EiB on image files, maybe not
> > > on block devices. What's the actual limit of block device size on
> > > Linux?
> > 
> > We can't seek past 2^63-1.  That's the limit on lseek, llseek, lseek64
> > or whatever we're calling it these days.  If we're missing a check
> > somewhere, that's a bug.
> 
> Thanks, I can send fixes, just wanted to review some of these things
> with the community to explore what a big fat linux block device or
> filesystem might be constrained to, if any. The fact that through this
> discussion we're uncovering perhaps some missing checks is already
> useful. I'll try to document some of it.

I don't really care about some random documentation on some random
website about some weird corner case issue. Just fix the problems
you find and get the patches to mkfs.xfs merged.

Realistically, though, we just haven't cared about mkfs.xfs
behaviour at that scale because of one main issue: have you ever
waited for mkfs.xfs to create and then mount an ~8EiB XFS
filesystem?

You have to wait through the hundreds of millions on
synchronous writes (as in "waits for each submitted write to
complete", not O_SYNC) that mkfs.xfs needs to do to create the
filesystem, and then wait through the hundreds of millions of
synchronous reads that mount does in the kernel to allow the
filesystem to mount.

Hence we have not done any real validation of behaviour at that
scale because of the time and resource cost involved in just
creating and mounting filesystems at that scale. Unless you have
many, many hours to burn every time you want mkfs and mount a XFS
filesystem, it's just not practical to even do basic functional
testing at this scale.

And, really, mkfs.xfs is the least of the problems that need
addressing before we can test filesystems that large. We do full
filesystem AG walks at mount that need to be avoided, we need tens
of GB of RAM to hold all the AG information in kernel memory (we
can't demand free per-AG information yet - that's part of the
problem that makes shrink so complex), we have algorithms that do
linear AG walks that depend on AG information being held in memory,
etc. When you're talking about an algorithm that can iterate all AGs
in the filesystem 3 times before failing and having 8.4 million AGs
indexed, this is a serious scalability problem.

IOWs, we've got years of development ahead of us to scale the
filesystem implementation out to handle filesystems larger than a
few PiB effciently - mkfs.xfs limits are the most trivial of things
compared to the deep surgery that is needed to make 64 bit capacity
support a production-quality reality....

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2024-03-18  0:00 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-02-07 22:26 Max theoretical XFS filesystem size in review Luis Chamberlain
2024-02-07 22:39 ` Matthew Wilcox
2024-02-07 23:54 ` Dave Chinner
2024-03-15  0:12   ` Luis Chamberlain
2024-03-15  1:14     ` Dave Chinner
2024-03-15  2:48       ` Matthew Wilcox
2024-03-15 17:52         ` Luis Chamberlain
2024-03-18  0:00           ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox