* Max theoretical XFS filesystem size in review @ 2024-02-07 22:26 Luis Chamberlain 2024-02-07 22:39 ` Matthew Wilcox 2024-02-07 23:54 ` Dave Chinner 0 siblings, 2 replies; 8+ messages in thread From: Luis Chamberlain @ 2024-02-07 22:26 UTC (permalink / raw) To: linux-xfs Cc: Luis Chamberlain, ritesh.list, Pankaj Raghav, Daniel Gomez, Matthew Wilcox I'd like to review the max theoretical XFS filesystem size and if block size used may affect this. At first I thought that the limit which seems to be documented on a few pages online of 16 EiB might reflect the current limitations [0], however I suspect its an artifact of both BLKGETSIZE64 limitation. There might be others so I welcome your feedback on other things as well. As I see it the max filesystem size should be an artifact of: max_num_ags * max_ag_blocks * block_size Does that seem right? This is because the allocation group stores max number of addressable blocks in an allocation group, and this is in block of block size. If we consider the max possible value for max_num_ags in light of the max number of addressable blocks which Linux can support, this is capped at the limit of blkdev_ioctl() BLKGETSIZE64, which gives us a 64-bit integer, so (2^64)-1, we do -1 as we start counting the first block at block 0. That's 16 EiB (Exbibytes) and so we're capped at that in Linux regardless of filesystem. Is that right? If we didn't have that limitation though, let's consider what else would be our cap. max_num_ags depends on the actual max value possibly reported by the device divided by the maximum size of an AG in bytes. We have XFS_AG_MAX_BYTES which represents the maximum size of an AG in bytes. This is defined statically always as (longlong)BBSIZE << 31 and since BBSIZE is 9 this is about 1 TiB. So we cap one AG to have max 1 TiB. To get max_num_ags we divide the total capacity of the drive by this 1 TiB, so in Linux effectively today that max value should be 18,874,368. Is that right? Although we're probably far from needing a single storage addressable array needing more than 16 EiB for a single XFS filesystem, if the above was correct I was curious if anyone has more details about the caked in limit of 1 TiB limit per AG. Datatype wise though max_num_ags is the agcount in the superblock, we have xfs_agnumber_t sb_agcount and the xfs_agnumber_t is a uint32_t, so in theory we should be able to get this to 2^32 if we were OK to squeeze more data into one AG. And then the number of blocks in the ag is agf_length, another 32-bit value. With 4 KiB block size that's 65536 EiB, and on 16 KiB block size that's 262,144 Exbibytes (EiB) and so on. [0] https://access.redhat.com/solutions/1532 Luis ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Max theoretical XFS filesystem size in review 2024-02-07 22:26 Max theoretical XFS filesystem size in review Luis Chamberlain @ 2024-02-07 22:39 ` Matthew Wilcox 2024-02-07 23:54 ` Dave Chinner 1 sibling, 0 replies; 8+ messages in thread From: Matthew Wilcox @ 2024-02-07 22:39 UTC (permalink / raw) To: Luis Chamberlain; +Cc: linux-xfs, ritesh.list, Pankaj Raghav, Daniel Gomez On Wed, Feb 07, 2024 at 02:26:53PM -0800, Luis Chamberlain wrote: > I'd like to review the max theoretical XFS filesystem size and > if block size used may affect this. At first I thought that the limit which > seems to be documented on a few pages online of 16 EiB might reflect the > current limitations [0], however I suspect its an artifact of both > BLKGETSIZE64 limitation. There might be others so I welcome your feedback > on other things as well. Linux is limited to 8EiB as loff_t is signed ... I don't want to introduce lllseek() to expand beyond 8EiB; I have reason to believe that we'll have 128-bit registers in relevant CPUs before we can buy reasonably priced arrays of drives that will reach 8EiB (and want to turn those into a single block device). See my Zettalinux presentation at Plumbers 2022 in Dublin (and that reminds me, I really should do something with zettalinux.org) ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Max theoretical XFS filesystem size in review 2024-02-07 22:26 Max theoretical XFS filesystem size in review Luis Chamberlain 2024-02-07 22:39 ` Matthew Wilcox @ 2024-02-07 23:54 ` Dave Chinner 2024-03-15 0:12 ` Luis Chamberlain 1 sibling, 1 reply; 8+ messages in thread From: Dave Chinner @ 2024-02-07 23:54 UTC (permalink / raw) To: Luis Chamberlain Cc: linux-xfs, ritesh.list, Pankaj Raghav, Daniel Gomez, Matthew Wilcox On Wed, Feb 07, 2024 at 02:26:53PM -0800, Luis Chamberlain wrote: > I'd like to review the max theoretical XFS filesystem size and > if block size used may affect this. At first I thought that the limit which > seems to be documented on a few pages online of 16 EiB might reflect the > current limitations [0], however I suspect its an artifact of both > BLKGETSIZE64 limitation. There might be others so I welcome your feedback > on other things as well. The actual limit is 8EiB, not 16EiB. mkfs.xfs won't allow a filesystem over 8EiB to be made. > > As I see it the max filesystem size should be an artifact of: > > max_num_ags * max_ag_blocks * block_size > > Does that seem right? max sector size, not max block size is the ultimate limitation. Not really. Max filesystem size is also determined by compiler, architecture, OS, tool and support constraints > This is because the allocation group stores max number of addressable > blocks in an allocation group, and this is in block of block size. If > we consider the max possible value for max_num_ags in light of the max > number of addressable blocks which Linux can support, this is capped at > the limit of blkdev_ioctl() BLKGETSIZE64, which gives us a 64-bit > integer, so (2^64)-1, we do -1 as we start counting the first block at > block 0. That's 16 EiB (Exbibytes) and so we're capped at that in Linux > regardless of filesystem. > > Is that right? We could actually support the full 64 bit device sector_t range (so 2^73 bytes), and we support file sizes up to 2^54 FSBs, so with 64kB block sizes we are at 2^70 bytes per file. IOWs, we -could- go larger than 8EiB, but.... > If we didn't have that limitation though, let's consider what else would > be our cap. > > max_num_ags depends on the actual max value possibly reported by the > device divided by the maximum size of an AG in bytes. We have > XFS_AG_MAX_BYTES which represents the maximum size of an AG in bytes. > This is defined statically always as (longlong)BBSIZE << 31 and since > BBSIZE is 9 this is about 1 TiB. So we cap one AG to have max 1 TiB. > To get max_num_ags we divide the total capacity of the drive by > this 1 TiB, so in Linux effectively today that max value should be > 18,874,368. > > Is that right? No. It's (2^64 / 2^40) = 2^24 AGs (16.7 million), not (2^64 / 10^12) AGs. Also, inode numbers only go up to 2^56, so once the AG count goes above 2^24 we'd have to introduce a new allocator that to handle inode/data locality in such large filesystems. > Although we're probably far from needing a single storage addressable > array needing more than 16 EiB for a single XFS filesystem, if the above was > correct I was curious if anyone has more details about the caked in limit > of 1 TiB limit per AG. AGs are indexed by short btrees. i.e. they have 4 byte pointers to minimise indexing space so are limited to indexing 2^31 blocks. > Datatype wise though max_num_ags is the agcount in the superblock, we have > xfs_agnumber_t sb_agcount and the xfs_agnumber_t is a uint32_t, so in theory > we should be able to get this to 2^32 if we were OK to squeeze more data into > one AG. And then the number of blocks in the ag is agf_length, another > 32-bit value. With 4 KiB block size that's 65536 EiB, and on 16 KiB > block size that's 262,144 Exbibytes (EiB) and so on. Sure, in theory the XFS format *could* handle 2^80 bytes when we have 64kB filesystem blocks. But we can't do that without massive changes to the OS and filesystem implementation, so there's no point in even talking about XFS support beyond 2^64 bytes until 128 bit integer support is brought to the linux kernel and all our block device and syscall interfaces are 128bit file offset capable.... -Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Max theoretical XFS filesystem size in review 2024-02-07 23:54 ` Dave Chinner @ 2024-03-15 0:12 ` Luis Chamberlain 2024-03-15 1:14 ` Dave Chinner 0 siblings, 1 reply; 8+ messages in thread From: Luis Chamberlain @ 2024-03-15 0:12 UTC (permalink / raw) To: Dave Chinner Cc: linux-xfs, ritesh.list, Pankaj Raghav, Daniel Gomez, Matthew Wilcox On Thu, Feb 08, 2024 at 10:54:08AM +1100, Dave Chinner wrote: > On Wed, Feb 07, 2024 at 02:26:53PM -0800, Luis Chamberlain wrote: > > I'd like to review the max theoretical XFS filesystem size and > > if block size used may affect this. At first I thought that the limit which > > seems to be documented on a few pages online of 16 EiB might reflect the > > current limitations [0], however I suspect its an artifact of both > > BLKGETSIZE64 limitation. There might be others so I welcome your feedback > > on other things as well. > > The actual limit is 8EiB, not 16EiB. mkfs.xfs won't allow a > filesystem over 8EiB to be made. A truncated 9 EB file seems to go through: truncate -s 9EB /mnt-pmem/sparse-9eb; losetup /dev/loop0 /mnt-pmem/sparse-9eb mkfs.xfs -K /dev/loop0 meta-data=/dev/loop0 isize=512 agcount=8185453, agsize=268435455 blks = sectsz=512 attr=2, projid32bit=1 = crc=1 finobt=1, sparse=1, rmapbt=1 = reflink=1 bigtime=1 inobtcount=1 nrext64=1 data = bsize=4096 blocks=2197265625000000, imaxpct=1 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0, ftype=1 log =internal log bsize=4096 blocks=521728, version=2 = sectsz=512 sunit=0 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 Should we be rejecting that? Joining two 8 EB files with device-mapper seems allowed: truncate -s 8EB /mnt-pmem/sparse-8eb.1; losetup /dev/loop1 /mnt-pmem/sparse-8eb.1 truncate -s 8EB /mnt-pmem/sparse-8eb.2; losetup /dev/loop2 /mnt-pmem/sparse-8eb.2 cat /home/mcgrof/dm-join-multiple.sh #!/bin/sh # Join multiple devices with the same size in a linear form # We assume the same size for simplicity set -e size=`blockdev --getsz $1` FILE=$(mktemp) for i in $(seq 1 $#) ; do offset=$(( ($i -1) * $size)) echo "$offset $size linear $1 0" >> $FILE shift done cat $FILE | dmsetup create joined rm -f $FILE /home/mcgrof/dm-join-multiple.sh /dev/loop1 /dev/loop2 And mkfs.xfs seems to go through on them, ie, its not rejected mkfs.xfs -f /dev/mapper/joined meta-data=/dev/mapper/joined isize=512 agcount=14551916, agsize=268435455 blks = sectsz=512 attr=2, projid32bit=1 = crc=1 finobt=1, sparse=1, rmapbt=1 = reflink=1 bigtime=1 inobtcount=1 nrext64=1 data = bsize=4096 blocks=3906250000000000, imaxpct=1 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0, ftype=1 log =internal log bsize=4096 blocks=521728, version=2 = sectsz=512 sunit=0 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 Discarding blocks... I didn't wait, should we be rejecting that? Using -K does hit some failures on the bno number though: mkfs.xfs -K -f /dev/mapper/joined meta-data=/dev/mapper/joined isize=512 agcount=14551916, agsize=268435455 blks = sectsz=512 attr=2, projid32bit=1 = crc=1 finobt=1, sparse=1, rmapbt=1 = reflink=1 bigtime=1 inobtcount=1 nrext64=1 data = bsize=4096 blocks=3906250000000000, imaxpct=1 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0, ftype=1 log =internal log bsize=4096 blocks=521728, version=2 = sectsz=512 sunit=0 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 mkfs.xfs: pwrite failed: Invalid argument libxfs_bwrite: write failed on (unknown) bno 0x6f05b59d3b1f00/0x100, err=22 mkfs.xfs: Releasing dirty buffer to free list! found dirty buffer (bulk) on free list! mkfs.xfs: pwrite failed: No space left on device libxfs_bwrite: write failed on (unknown) bno 0x0/0x100, err=28 mkfs.xfs: Releasing dirty buffer to free list! found dirty buffer (bulk) on free list! mkfs.xfs: pwrite failed: No space left on device libxfs_bwrite: write failed on xfs_sb bno 0x0/0x1, err=28 mkfs.xfs: Releasing dirty buffer to free list! mkfs.xfs: libxfs_device_zero seek to offset 8000000394407514112 failed: Invalid argument I still gotta chew through the rest of your reply, thanks for the details! Luis ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Max theoretical XFS filesystem size in review 2024-03-15 0:12 ` Luis Chamberlain @ 2024-03-15 1:14 ` Dave Chinner 2024-03-15 2:48 ` Matthew Wilcox 0 siblings, 1 reply; 8+ messages in thread From: Dave Chinner @ 2024-03-15 1:14 UTC (permalink / raw) To: Luis Chamberlain Cc: linux-xfs, ritesh.list, Pankaj Raghav, Daniel Gomez, Matthew Wilcox On Thu, Mar 14, 2024 at 05:12:22PM -0700, Luis Chamberlain wrote: > On Thu, Feb 08, 2024 at 10:54:08AM +1100, Dave Chinner wrote: > > On Wed, Feb 07, 2024 at 02:26:53PM -0800, Luis Chamberlain wrote: > > > I'd like to review the max theoretical XFS filesystem size and > > > if block size used may affect this. At first I thought that the limit which > > > seems to be documented on a few pages online of 16 EiB might reflect the > > > current limitations [0], however I suspect its an artifact of both > > > BLKGETSIZE64 limitation. There might be others so I welcome your feedback > > > on other things as well. > > > > The actual limit is 8EiB, not 16EiB. mkfs.xfs won't allow a > > filesystem over 8EiB to be made. > > A truncated 9 EB file seems to go through: <sigh> 9EB = 9000000000000000000 8EiB = 9223372036854775808 So, 9EB < 8EiB and yes, mkfs.xfs will accept anything smaller than 8EiB... > truncate -s 9EB /mnt-pmem/sparse-9eb; losetup /dev/loop0 /mnt-pmem/sparse-9eb > mkfs.xfs -K /dev/loop0 > meta-data=/dev/loop0 isize=512 agcount=8185453, agsize=268435455 blks yup, agcount is clearly less than 8388608, so you've screwed up your units there... > Joining two 8 EB files with device-mapper seems allowed: > > truncate -s 8EB /mnt-pmem/sparse-8eb.1; losetup /dev/loop1 /mnt-pmem/sparse-8eb.1 > truncate -s 8EB /mnt-pmem/sparse-8eb.2; losetup /dev/loop2 /mnt-pmem/sparse-8eb.2 > > cat /home/mcgrof/dm-join-multiple.sh > #!/bin/sh > # Join multiple devices with the same size in a linear form > # We assume the same size for simplicity > set -e > size=`blockdev --getsz $1` > FILE=$(mktemp) > for i in $(seq 1 $#) ; do > offset=$(( ($i -1) * $size)) > echo "$offset $size linear $1 0" >> $FILE > shift > done > cat $FILE | dmsetup create joined > rm -f $FILE > > /home/mcgrof/dm-join-multiple.sh /dev/loop1 /dev/loop2 > > And mkfs.xfs seems to go through on them, ie, its not rejected Ah, I think mkfs.xfs has a limit of 8EiB on image files, maybe not on block devices. What's the actual limit of block device size on Linux? > mkfs.xfs -f /dev/mapper/joined > meta-data=/dev/mapper/joined isize=512 agcount=14551916, agsize=268435455 blks > = sectsz=512 attr=2, projid32bit=1 > = crc=1 finobt=1, sparse=1, rmapbt=1 > = reflink=1 bigtime=1 inobtcount=1 nrext64=1 > data = bsize=4096 blocks=3906250000000000, imaxpct=1 > = sunit=0 swidth=0 blks > naming =version 2 bsize=4096 ascii-ci=0, ftype=1 > log =internal log bsize=4096 blocks=521728, version=2 > = sectsz=512 sunit=0 blks, lazy-count=1 > realtime =none extsz=4096 blocks=0, rtextents=0 > Discarding blocks... > > I didn't wait, should we be rejecting that? Probably. mkfs.xfs uses uint64_t for the block counts and arithmetic, so all the size and geometry calcs should work. The problem is when we translate those sizes to byte counts, and then th elinux kernel side has all sorts of problems because many things described in bytes (like off_t and loff_t) are signed. Hence while you might be able to make block devices larger than 8EiB, I'm pretty sure you can't actually do things like pread()/pwrite() at offsets above 8EiB on block devices.... > Using -K does hit some failures on the bno number though: > > mkfs.xfs -K -f /dev/mapper/joined > meta-data=/dev/mapper/joined isize=512 agcount=14551916, agsize=268435455 blks > = sectsz=512 attr=2, projid32bit=1 > = crc=1 finobt=1, sparse=1, rmapbt=1 > = reflink=1 bigtime=1 inobtcount=1 nrext64=1 > data = bsize=4096 blocks=3906250000000000, imaxpct=1 > = sunit=0 swidth=0 blks > naming =version 2 bsize=4096 ascii-ci=0, ftype=1 > log =internal log bsize=4096 blocks=521728, version=2 > = sectsz=512 sunit=0 blks, lazy-count=1 > realtime =none extsz=4096 blocks=0, rtextents=0 > mkfs.xfs: pwrite failed: Invalid argument > libxfs_bwrite: write failed on (unknown) bno 0x6f05b59d3b1f00/0x100, err=22 daddr is 0x6f05b59d3b1f00. So lets convert that to a byte based offset from a buffer daddr: $ printf "0x%llx\n" $(( 0x6f05b59d3b1f00 << 9 )) 0xde0b6b3a763e0000 $ It's hard to see, but if I write it as 16 bit couplets: 0xde0b 6b3a 763e 0000 You can see the high bit in the file offset is set, and so that's a write beyond 8EiB that returned -EINVAL. That exactly what rw_verify_area() returns when loff_t *pos < 0 when the file does not assert FMODE_UNSIGNED_OFFSET. No block based filesystem nor do block devices assert FMODE_UNSIGNED_OFFSET, so this write should always fail with -EINVAL. And where did it fail? You used "-f" which set force_overwrite, which means we do a bit of zeroing of potential locations for old XFS structures (secondary superblocks) and that silently swallows IO failures, so it wasn't that. The next thing it does is whack potential MD and GPT records at the end of the filesystem and that's done in IO sizes of: /* * amount (in bytes) we zero at the beginning and end of the device to * remove traces of other filesystems, raid superblocks, etc. */ #define WHACK_SIZE (128 * 1024) 128kB IOs. The above IO that failed with -EINVAL is a 128kB IO (0x100 basic blocks). This will emit a warning message that the IO failed (as per above), but it also swallows IO errors and lets mkfs continue. > mkfs.xfs: Releasing dirty buffer to free list! > found dirty buffer (bulk) on free list! > mkfs.xfs: pwrite failed: No space left on device > libxfs_bwrite: write failed on (unknown) bno 0x0/0x100, err=28 Yup, that's the next write to zap the first blocks of the device to get rid of primary superblocks and other signatures from other types of filesytsems and partition tables. That failed with -ENOSPC, which implies something went wrong in the dm/loop device IO/backing file IO stage. Likely an 8EiB overflow problem somewhere. > mkfs.xfs: Releasing dirty buffer to free list! > found dirty buffer (bulk) on free list! > mkfs.xfs: pwrite failed: No space left on device > libxfs_bwrite: write failed on xfs_sb bno 0x0/0x1, err=28 And that's the initial write of the superblock (single 512 byte sector write) that failed with ENOSPC. Same error as the previous write, same likely cause. > mkfs.xfs: Releasing dirty buffer to free list! > mkfs.xfs: libxfs_device_zero seek to offset 8000000394407514112 failed: Invalid argument And yeah, their's the smoking gun: mkfs.xfs is attempting to seek to an offset beyond 8EiB on the block device and that is failing. IOWs, max supported block device size on Linux is 8EiB. mkfs.xfs should really capture some of these errors, but largely the problem here is that dm is allowing an unsupported block device mapping to be created... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Max theoretical XFS filesystem size in review 2024-03-15 1:14 ` Dave Chinner @ 2024-03-15 2:48 ` Matthew Wilcox 2024-03-15 17:52 ` Luis Chamberlain 0 siblings, 1 reply; 8+ messages in thread From: Matthew Wilcox @ 2024-03-15 2:48 UTC (permalink / raw) To: Dave Chinner Cc: Luis Chamberlain, linux-xfs, ritesh.list, Pankaj Raghav, Daniel Gomez On Fri, Mar 15, 2024 at 12:14:05PM +1100, Dave Chinner wrote: > On Thu, Mar 14, 2024 at 05:12:22PM -0700, Luis Chamberlain wrote: > > Joining two 8 EB files with device-mapper seems allowed: > > > > truncate -s 8EB /mnt-pmem/sparse-8eb.1; losetup /dev/loop1 /mnt-pmem/sparse-8eb.1 > > truncate -s 8EB /mnt-pmem/sparse-8eb.2; losetup /dev/loop2 /mnt-pmem/sparse-8eb.2 > > > > cat /home/mcgrof/dm-join-multiple.sh > > #!/bin/sh > > # Join multiple devices with the same size in a linear form > > # We assume the same size for simplicity > > set -e > > size=`blockdev --getsz $1` > > FILE=$(mktemp) > > for i in $(seq 1 $#) ; do > > offset=$(( ($i -1) * $size)) > > echo "$offset $size linear $1 0" >> $FILE > > shift > > done > > cat $FILE | dmsetup create joined > > rm -f $FILE > > > > /home/mcgrof/dm-join-multiple.sh /dev/loop1 /dev/loop2 > > > > And mkfs.xfs seems to go through on them, ie, its not rejected > > Ah, I think mkfs.xfs has a limit of 8EiB on image files, maybe not > on block devices. What's the actual limit of block device size on > Linux? We can't seek past 2^63-1. That's the limit on lseek, llseek, lseek64 or whatever we're calling it these days. If we're missing a check somewhere, that's a bug. ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Max theoretical XFS filesystem size in review 2024-03-15 2:48 ` Matthew Wilcox @ 2024-03-15 17:52 ` Luis Chamberlain 2024-03-18 0:00 ` Dave Chinner 0 siblings, 1 reply; 8+ messages in thread From: Luis Chamberlain @ 2024-03-15 17:52 UTC (permalink / raw) To: Matthew Wilcox Cc: Dave Chinner, linux-xfs, ritesh.list, Pankaj Raghav, Daniel Gomez On Fri, Mar 15, 2024 at 02:48:27AM +0000, Matthew Wilcox wrote: > On Fri, Mar 15, 2024 at 12:14:05PM +1100, Dave Chinner wrote: > > On Thu, Mar 14, 2024 at 05:12:22PM -0700, Luis Chamberlain wrote: > > > Joining two 8 EB files with device-mapper seems allowed: > > > > > > truncate -s 8EB /mnt-pmem/sparse-8eb.1; losetup /dev/loop1 /mnt-pmem/sparse-8eb.1 > > > truncate -s 8EB /mnt-pmem/sparse-8eb.2; losetup /dev/loop2 /mnt-pmem/sparse-8eb.2 > > > > > > cat /home/mcgrof/dm-join-multiple.sh > > > #!/bin/sh > > > # Join multiple devices with the same size in a linear form > > > # We assume the same size for simplicity > > > set -e > > > size=`blockdev --getsz $1` > > > FILE=$(mktemp) > > > for i in $(seq 1 $#) ; do > > > offset=$(( ($i -1) * $size)) > > > echo "$offset $size linear $1 0" >> $FILE > > > shift > > > done > > > cat $FILE | dmsetup create joined > > > rm -f $FILE > > > > > > /home/mcgrof/dm-join-multiple.sh /dev/loop1 /dev/loop2 > > > > > > And mkfs.xfs seems to go through on them, ie, its not rejected > > > > Ah, I think mkfs.xfs has a limit of 8EiB on image files, maybe not > > on block devices. What's the actual limit of block device size on > > Linux? > > We can't seek past 2^63-1. That's the limit on lseek, llseek, lseek64 > or whatever we're calling it these days. If we're missing a check > somewhere, that's a bug. Thanks, I can send fixes, just wanted to review some of these things with the community to explore what a big fat linux block device or filesystem might be constrained to, if any. The fact that through this discussion we're uncovering perhaps some missing checks is already useful. I'll try to document some of it. Luis ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Max theoretical XFS filesystem size in review 2024-03-15 17:52 ` Luis Chamberlain @ 2024-03-18 0:00 ` Dave Chinner 0 siblings, 0 replies; 8+ messages in thread From: Dave Chinner @ 2024-03-18 0:00 UTC (permalink / raw) To: Luis Chamberlain Cc: Matthew Wilcox, linux-xfs, ritesh.list, Pankaj Raghav, Daniel Gomez On Fri, Mar 15, 2024 at 10:52:43AM -0700, Luis Chamberlain wrote: > On Fri, Mar 15, 2024 at 02:48:27AM +0000, Matthew Wilcox wrote: > > On Fri, Mar 15, 2024 at 12:14:05PM +1100, Dave Chinner wrote: > > > On Thu, Mar 14, 2024 at 05:12:22PM -0700, Luis Chamberlain wrote: > > > > Joining two 8 EB files with device-mapper seems allowed: > > > > > > > > truncate -s 8EB /mnt-pmem/sparse-8eb.1; losetup /dev/loop1 /mnt-pmem/sparse-8eb.1 > > > > truncate -s 8EB /mnt-pmem/sparse-8eb.2; losetup /dev/loop2 /mnt-pmem/sparse-8eb.2 > > > > > > > > cat /home/mcgrof/dm-join-multiple.sh > > > > #!/bin/sh > > > > # Join multiple devices with the same size in a linear form > > > > # We assume the same size for simplicity > > > > set -e > > > > size=`blockdev --getsz $1` > > > > FILE=$(mktemp) > > > > for i in $(seq 1 $#) ; do > > > > offset=$(( ($i -1) * $size)) > > > > echo "$offset $size linear $1 0" >> $FILE > > > > shift > > > > done > > > > cat $FILE | dmsetup create joined > > > > rm -f $FILE > > > > > > > > /home/mcgrof/dm-join-multiple.sh /dev/loop1 /dev/loop2 > > > > > > > > And mkfs.xfs seems to go through on them, ie, its not rejected > > > > > > Ah, I think mkfs.xfs has a limit of 8EiB on image files, maybe not > > > on block devices. What's the actual limit of block device size on > > > Linux? > > > > We can't seek past 2^63-1. That's the limit on lseek, llseek, lseek64 > > or whatever we're calling it these days. If we're missing a check > > somewhere, that's a bug. > > Thanks, I can send fixes, just wanted to review some of these things > with the community to explore what a big fat linux block device or > filesystem might be constrained to, if any. The fact that through this > discussion we're uncovering perhaps some missing checks is already > useful. I'll try to document some of it. I don't really care about some random documentation on some random website about some weird corner case issue. Just fix the problems you find and get the patches to mkfs.xfs merged. Realistically, though, we just haven't cared about mkfs.xfs behaviour at that scale because of one main issue: have you ever waited for mkfs.xfs to create and then mount an ~8EiB XFS filesystem? You have to wait through the hundreds of millions on synchronous writes (as in "waits for each submitted write to complete", not O_SYNC) that mkfs.xfs needs to do to create the filesystem, and then wait through the hundreds of millions of synchronous reads that mount does in the kernel to allow the filesystem to mount. Hence we have not done any real validation of behaviour at that scale because of the time and resource cost involved in just creating and mounting filesystems at that scale. Unless you have many, many hours to burn every time you want mkfs and mount a XFS filesystem, it's just not practical to even do basic functional testing at this scale. And, really, mkfs.xfs is the least of the problems that need addressing before we can test filesystems that large. We do full filesystem AG walks at mount that need to be avoided, we need tens of GB of RAM to hold all the AG information in kernel memory (we can't demand free per-AG information yet - that's part of the problem that makes shrink so complex), we have algorithms that do linear AG walks that depend on AG information being held in memory, etc. When you're talking about an algorithm that can iterate all AGs in the filesystem 3 times before failing and having 8.4 million AGs indexed, this is a serious scalability problem. IOWs, we've got years of development ahead of us to scale the filesystem implementation out to handle filesystems larger than a few PiB effciently - mkfs.xfs limits are the most trivial of things compared to the deep surgery that is needed to make 64 bit capacity support a production-quality reality.... -Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2024-03-18 0:00 UTC | newest] Thread overview: 8+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-02-07 22:26 Max theoretical XFS filesystem size in review Luis Chamberlain 2024-02-07 22:39 ` Matthew Wilcox 2024-02-07 23:54 ` Dave Chinner 2024-03-15 0:12 ` Luis Chamberlain 2024-03-15 1:14 ` Dave Chinner 2024-03-15 2:48 ` Matthew Wilcox 2024-03-15 17:52 ` Luis Chamberlain 2024-03-18 0:00 ` Dave Chinner
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox