Re: Insane file system overhead on large volume

From: Martin Steigerwald <Martin@lichtvoll.de>
To: Eric Sandeen <sandeen@sandeen.net>
Cc: Manny <dermaniac@gmail.com>, xfs@oss.sgi.com
Subject: Re: Insane file system overhead on large volume
Date: Sat, 28 Jan 2012 17:23:42 +0100	[thread overview]
Message-ID: <201201281723.42786.Martin@lichtvoll.de> (raw)
In-Reply-To: <4F2415B2.3080605@sandeen.net>

Am Samstag, 28. Januar 2012 schrieb Eric Sandeen:
> On 1/28/12 8:55 AM, Martin Steigerwald wrote:
> > Am Freitag, 27. Januar 2012 schrieb Eric Sandeen:
> >> On 1/27/12 1:50 AM, Manny wrote:
> >>> Hi there,
> >>> 
> >>> I'm not sure if this is intended behavior, but I was a bit stumped
> >>> when I formatted a 30TB volume (12x3TB minus 2x3TB for parity in
> >>> RAID 6) with XFS and noticed that there were only 22 TB left. I
> >>> just called mkfs.xfs with default parameters - except for swith
> >>> and sunit which match the RAID setup.
> >>> 
> >>> Is it normal that I lost 8TB just for the file system? That's
> >>> almost 30% of the volume. Should I set the block size higher? Or
> >>> should I increase the number of allocation groups? Would that make
> >>> a difference? Whats the preferred method for handling such large
> >>> volumes?
> >> 
> >> If it was 12x3TB I imagine you're confusing TB with TiB, so
> >> perhaps your 30T is really only 27TiB to start with.
> >> 
> >> Anyway, fs metadata should not eat much space:
> >> 
> >> # mkfs.xfs -dfile,name=fsfile,size=30t
> >> # ls -lh fsfile
> >> -rw-r--r-- 1 root root 30T Jan 27 12:18 fsfile
> >> # mount -o loop fsfile  mnt/
> >> # df -h mnt
> >> Filesystem            Size  Used Avail Use% Mounted on
> >> /tmp/fsfile            30T  5.0M   30T   1% /tmp/mnt
> >> 
> >> So Christoph's question was a good one; where are you getting
> >> your sizes?
> 
> To solve your original problem, can you answer the above question?
> Adding your actual raid config output (/proc/mdstat maybe) would help
> too.

Eric, I wrote

> > An academic question:

to make clear that it was just something I was curious about.

I was not the reporter of the problem anyway, I have no problem,
the reporter has no problem, see his answer, so all is good ;)

With your hint and some thinking / testing through it I was able to
resolve most of my other questions. Thanks.

For the gory details:

> > Why is it that I get
[…]
> > merkaba:/tmp> LANG=C df -hT /mnt/zeit
> > Filesystem     Type  Size  Used Avail Use% Mounted on
> > /dev/loop0     xfs    30T   33M   30T   1% /mnt/zeit
> > 
> > 
> > 33MiB used on first mount instead of 5?
> 
> Not sure offhand, differences in xfsprogs version mkfs defaults
> perhaps.

Okay, thats fine with me. I was just curious. It doesn´t matter much.

> > Hmmm, but creating the file on Ext4 does not work:
> ext4 is not designed to handle very large files, so anything
> above 16T will fail.
> 
> > fallocate instead of sparse file?
> 
> no, you just ran into file offset limits on ext4.

Oh, yes. Completely forgot about these Ext4 limits. Sorry.

> > And on BTRFS as well as XFS it appears to try to create a 30T file
> > for real, i.e. by writing data - I stopped it before it could do too
> > much harm.
> 
> Why do you say that it appears to create a 30T file for real?  It
> should not...

I jumped to a conclusion too quickly. It did do a I/O storm onto the
Intel SSD 320:

martin@merkaba:~> vmstat -S M 1 (not applied to bi/bo)
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
 0  0   1630   4365     87   1087    0    0   101    53    7   81  5  2 93  0
 1  0   1630   4365     87   1087    0    0     0     0  428  769  1  0 99  0
 2  0   1630   4365     87   1087    0    0     0     0  426  740  1  1 99  0
 0  0   1630   4358     87   1088    0    0     0     0 1165 2297  4  7 89  0
 0  0   1630   4357     87   1088    0    0     0    40 1736 3434  8  6 86  0
 0  0   1630   4357     87   1088    0    0     0     0  614 1121  3  1 96  0
 0  0   1630   4357     87   1088    0    0     0    32  359  636  0  0 100  0
 1  1   1630   3852     87   1585    0    0    13 81540  529 1045  1  7 91  1
 0  3   1630   3398     87   2027    0    0     0 227940 1357 2764  0  9 54 37
 4  3   1630   3225     87   2188    0    0     0 212004 2346 4796  5  6 41 49
 1  3   1630   2992     87   2415    0    0     0 215608 1825 3821  1  6 42 50
 0  2   1630   2820     87   2582    0    0     0 200492 1476 3089  3  6 49 41
 1  1   1630   2569     87   2832    0    0     0 198156 1250 2508  0  6 59 34
 0  2   1630   2386     87   3009    0    0     0 229896 1301 2611  1  6 56 37
 0  2   1630   2266     87   3126    0    0     0 302876 1067 2093  0  5 62 33
 1  3   1630   2266     87   3126    0    0     0 176092  723 1321  0  3 71 26
 0  3   1630   2266     87   3126    0    0     0 163840  706 1351  0  1 74 25
 0  1   1630   2266     87   3126    0    0     0 80104 3137 6228  1  4 69 26
 0  0   1630   2267     87   3126    0    0     0     3 3505 7035  6  3 86  5
 0  0   1630   2266     87   3126    0    0     0     0  631 1203  4  1 95  0
 0  0   1630   2259     87   3127    0    0     0     0  715 1398  4  2 94  0
 2  0   1630   2259     87   3127    0    0     0     0 1501 3087 10  3 86  0
 0  0   1630   2259     87   3127    0    0     0    27  945 1883  5  2 93  0
 0  0   1630   2259     87   3127    0    0     0     0  399  713  1  0 99  0
^C

But then stopped. Thus mkfs.xfs was just writing metadata it seems
and I didn´t see this in the tmpfs obviously.

But when I review it, creating a 30TB XFS filesystem should involve writing
some metadata at different places of the file.

I get:

merkaba:/mnt/zeit> LANG=C xfs_bmap fsfile
fsfile:
        0: [0..255]: 96..351
        1: [256..2147483639]: hole
        2: [2147483640..2147483671]: 3400032..3400063
        3: [2147483672..4294967279]: hole
        4: [4294967280..4294967311]: 3400064..3400095
        5: [4294967312..6442450919]: hole
        6: [6442450920..6442450951]: 3400096..3400127
        7: [6442450952..8589934559]: hole
        8: [8589934560..8589934591]: 3400128..3400159
        9: [8589934592..10737418199]: hole
        10: [10737418200..10737418231]: 3400160..3400191
        11: [10737418232..12884901839]: hole
        12: [12884901840..12884901871]: 3400192..3400223
        13: [12884901872..15032385479]: hole
        14: [15032385480..15032385511]: 3400224..3400255
        15: [15032385512..17179869119]: hole
        16: [17179869120..17179869151]: 3400256..3400287
        17: [17179869152..19327352759]: hole
        18: [19327352760..19327352791]: 3400296..3400327
        19: [19327352792..21474836399]: hole
        20: [21474836400..21474836431]: 3400328..3400359
        21: [21474836432..23622320039]: hole
        22: [23622320040..23622320071]: 3400360..3400391
        23: [23622320072..25769803679]: hole
        24: [25769803680..25769803711]: 3400392..3400423
        25: [25769803712..27917287319]: hole
        26: [27917287320..27917287351]: 3400424..3400455
        27: [27917287352..30064770959]: hole
        28: [30064770960..30064770991]: 3400456..3400487
        29: [30064770992..32212254599]: hole
        30: [32212254600..32212254631]: 3400488..3400519
        31: [32212254632..32215654311]: 352..3400031
        32: [32215654312..32216428455]: 3400520..4174663
        33: [32216428456..34359738239]: hole
        34: [34359738240..34359738271]: 4174664..4174695
        35: [34359738272..36507221879]: hole
        36: [36507221880..36507221911]: 4174696..4174727
        37: [36507221912..38654705519]: hole
        38: [38654705520..38654705551]: 4174728..4174759
        39: [38654705552..40802189159]: hole
        40: [40802189160..40802189191]: 4174760..4174791
        41: [40802189192..42949672799]: hole
        42: [42949672800..42949672831]: 4174792..4174823
        43: [42949672832..45097156439]: hole
        44: [45097156440..45097156471]: 4174824..4174855
        45: [45097156472..47244640079]: hole
        46: [47244640080..47244640111]: 4174856..4174887
        47: [47244640112..49392123719]: hole
        48: [49392123720..49392123751]: 4174888..4174919
        49: [49392123752..51539607359]: hole
        50: [51539607360..51539607391]: 4174920..4174951
        51: [51539607392..53687090999]: hole
        52: [53687091000..53687091031]: 4174952..4174983
        53: [53687091032..55834574639]: hole
        54: [55834574640..55834574671]: 4174984..4175015
        55: [55834574672..57982058279]: hole
        56: [57982058280..57982058311]: 4175016..4175047
        57: [57982058312..60129541919]: hole
        58: [60129541920..60129541951]: 4175048..4175079
        59: [60129541952..62277025559]: hole
        60: [62277025560..62277025591]: 4175080..4175111
        61: [62277025592..64424509191]: hole
        62: [64424509192..64424509199]: 4175112..4175119

Okay, it needed to write 2 GB:

merkaba:/mnt/zeit> du -h fsfile 
2,0G    fsfile
merkaba:/mnt/zeit> du --apparent-size -h fsfile
30T     fsfile
merkaba:/mnt/zeit>

I didn´t expect mkfs.xfs to write 2 GB, but when thinking through it
for a 30 TB filesystem I find this reasonable.

Still it has 33 MiB for metadata:

merkaba:/mnt/zeit> mkdir bigfilefs
merkaba:/mnt/zeit> mount -o loop fsfile bigfilefs 
merkaba:/mnt/zeit> LANG=C df -hT bigfilefs
Filesystem     Type  Size  Used Avail Use% Mounted on
/dev/loop0     xfs    30T   33M   30T   1% /mnt/zeit/bigfilefs

Ciao,
-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs