All of lore.kernel.org
 help / color / mirror / Atom feed
From: Martin Steigerwald <Martin@lichtvoll.de>
To: Eric Sandeen <sandeen@sandeen.net>
Cc: Manny <dermaniac@gmail.com>, xfs@oss.sgi.com
Subject: Re: Insane file system overhead on large volume
Date: Sat, 28 Jan 2012 17:23:42 +0100	[thread overview]
Message-ID: <201201281723.42786.Martin@lichtvoll.de> (raw)
In-Reply-To: <4F2415B2.3080605@sandeen.net>

Am Samstag, 28. Januar 2012 schrieb Eric Sandeen:
> On 1/28/12 8:55 AM, Martin Steigerwald wrote:
> > Am Freitag, 27. Januar 2012 schrieb Eric Sandeen:
> >> On 1/27/12 1:50 AM, Manny wrote:
> >>> Hi there,
> >>> 
> >>> I'm not sure if this is intended behavior, but I was a bit stumped
> >>> when I formatted a 30TB volume (12x3TB minus 2x3TB for parity in
> >>> RAID 6) with XFS and noticed that there were only 22 TB left. I
> >>> just called mkfs.xfs with default parameters - except for swith
> >>> and sunit which match the RAID setup.
> >>> 
> >>> Is it normal that I lost 8TB just for the file system? That's
> >>> almost 30% of the volume. Should I set the block size higher? Or
> >>> should I increase the number of allocation groups? Would that make
> >>> a difference? Whats the preferred method for handling such large
> >>> volumes?
> >> 
> >> If it was 12x3TB I imagine you're confusing TB with TiB, so
> >> perhaps your 30T is really only 27TiB to start with.
> >> 
> >> Anyway, fs metadata should not eat much space:
> >> 
> >> # mkfs.xfs -dfile,name=fsfile,size=30t
> >> # ls -lh fsfile
> >> -rw-r--r-- 1 root root 30T Jan 27 12:18 fsfile
> >> # mount -o loop fsfile  mnt/
> >> # df -h mnt
> >> Filesystem            Size  Used Avail Use% Mounted on
> >> /tmp/fsfile            30T  5.0M   30T   1% /tmp/mnt
> >> 
> >> So Christoph's question was a good one; where are you getting
> >> your sizes?
> 
> To solve your original problem, can you answer the above question?
> Adding your actual raid config output (/proc/mdstat maybe) would help
> too.

Eric, I wrote

> > An academic question:

to make clear that it was just something I was curious about.

I was not the reporter of the problem anyway, I have no problem,
the reporter has no problem, see his answer, so all is good ;)

With your hint and some thinking / testing through it I was able to
resolve most of my other questions. Thanks.



For the gory details:

> > Why is it that I get
[…]
> > merkaba:/tmp> LANG=C df -hT /mnt/zeit
> > Filesystem     Type  Size  Used Avail Use% Mounted on
> > /dev/loop0     xfs    30T   33M   30T   1% /mnt/zeit
> > 
> > 
> > 33MiB used on first mount instead of 5?
> 
> Not sure offhand, differences in xfsprogs version mkfs defaults
> perhaps.

Okay, thats fine with me. I was just curious. It doesn´t matter much.

> > Hmmm, but creating the file on Ext4 does not work:
> ext4 is not designed to handle very large files, so anything
> above 16T will fail.
> 
> > fallocate instead of sparse file?
> 
> no, you just ran into file offset limits on ext4.

Oh, yes. Completely forgot about these Ext4 limits. Sorry.

> > And on BTRFS as well as XFS it appears to try to create a 30T file
> > for real, i.e. by writing data - I stopped it before it could do too
> > much harm.
> 
> Why do you say that it appears to create a 30T file for real?  It
> should not...

I jumped to a conclusion too quickly. It did do a I/O storm onto the
Intel SSD 320:

martin@merkaba:~> vmstat -S M 1 (not applied to bi/bo)
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
 0  0   1630   4365     87   1087    0    0   101    53    7   81  5  2 93  0
 1  0   1630   4365     87   1087    0    0     0     0  428  769  1  0 99  0
 2  0   1630   4365     87   1087    0    0     0     0  426  740  1  1 99  0
 0  0   1630   4358     87   1088    0    0     0     0 1165 2297  4  7 89  0
 0  0   1630   4357     87   1088    0    0     0    40 1736 3434  8  6 86  0
 0  0   1630   4357     87   1088    0    0     0     0  614 1121  3  1 96  0
 0  0   1630   4357     87   1088    0    0     0    32  359  636  0  0 100  0
 1  1   1630   3852     87   1585    0    0    13 81540  529 1045  1  7 91  1
 0  3   1630   3398     87   2027    0    0     0 227940 1357 2764  0  9 54 37
 4  3   1630   3225     87   2188    0    0     0 212004 2346 4796  5  6 41 49
 1  3   1630   2992     87   2415    0    0     0 215608 1825 3821  1  6 42 50
 0  2   1630   2820     87   2582    0    0     0 200492 1476 3089  3  6 49 41
 1  1   1630   2569     87   2832    0    0     0 198156 1250 2508  0  6 59 34
 0  2   1630   2386     87   3009    0    0     0 229896 1301 2611  1  6 56 37
 0  2   1630   2266     87   3126    0    0     0 302876 1067 2093  0  5 62 33
 1  3   1630   2266     87   3126    0    0     0 176092  723 1321  0  3 71 26
 0  3   1630   2266     87   3126    0    0     0 163840  706 1351  0  1 74 25
 0  1   1630   2266     87   3126    0    0     0 80104 3137 6228  1  4 69 26
 0  0   1630   2267     87   3126    0    0     0     3 3505 7035  6  3 86  5
 0  0   1630   2266     87   3126    0    0     0     0  631 1203  4  1 95  0
 0  0   1630   2259     87   3127    0    0     0     0  715 1398  4  2 94  0
 2  0   1630   2259     87   3127    0    0     0     0 1501 3087 10  3 86  0
 0  0   1630   2259     87   3127    0    0     0    27  945 1883  5  2 93  0
 0  0   1630   2259     87   3127    0    0     0     0  399  713  1  0 99  0
^C

But then stopped. Thus mkfs.xfs was just writing metadata it seems
and I didn´t see this in the tmpfs obviously.

But when I review it, creating a 30TB XFS filesystem should involve writing
some metadata at different places of the file.

I get:

merkaba:/mnt/zeit> LANG=C xfs_bmap fsfile
fsfile:
        0: [0..255]: 96..351
        1: [256..2147483639]: hole
        2: [2147483640..2147483671]: 3400032..3400063
        3: [2147483672..4294967279]: hole
        4: [4294967280..4294967311]: 3400064..3400095
        5: [4294967312..6442450919]: hole
        6: [6442450920..6442450951]: 3400096..3400127
        7: [6442450952..8589934559]: hole
        8: [8589934560..8589934591]: 3400128..3400159
        9: [8589934592..10737418199]: hole
        10: [10737418200..10737418231]: 3400160..3400191
        11: [10737418232..12884901839]: hole
        12: [12884901840..12884901871]: 3400192..3400223
        13: [12884901872..15032385479]: hole
        14: [15032385480..15032385511]: 3400224..3400255
        15: [15032385512..17179869119]: hole
        16: [17179869120..17179869151]: 3400256..3400287
        17: [17179869152..19327352759]: hole
        18: [19327352760..19327352791]: 3400296..3400327
        19: [19327352792..21474836399]: hole
        20: [21474836400..21474836431]: 3400328..3400359
        21: [21474836432..23622320039]: hole
        22: [23622320040..23622320071]: 3400360..3400391
        23: [23622320072..25769803679]: hole
        24: [25769803680..25769803711]: 3400392..3400423
        25: [25769803712..27917287319]: hole
        26: [27917287320..27917287351]: 3400424..3400455
        27: [27917287352..30064770959]: hole
        28: [30064770960..30064770991]: 3400456..3400487
        29: [30064770992..32212254599]: hole
        30: [32212254600..32212254631]: 3400488..3400519
        31: [32212254632..32215654311]: 352..3400031
        32: [32215654312..32216428455]: 3400520..4174663
        33: [32216428456..34359738239]: hole
        34: [34359738240..34359738271]: 4174664..4174695
        35: [34359738272..36507221879]: hole
        36: [36507221880..36507221911]: 4174696..4174727
        37: [36507221912..38654705519]: hole
        38: [38654705520..38654705551]: 4174728..4174759
        39: [38654705552..40802189159]: hole
        40: [40802189160..40802189191]: 4174760..4174791
        41: [40802189192..42949672799]: hole
        42: [42949672800..42949672831]: 4174792..4174823
        43: [42949672832..45097156439]: hole
        44: [45097156440..45097156471]: 4174824..4174855
        45: [45097156472..47244640079]: hole
        46: [47244640080..47244640111]: 4174856..4174887
        47: [47244640112..49392123719]: hole
        48: [49392123720..49392123751]: 4174888..4174919
        49: [49392123752..51539607359]: hole
        50: [51539607360..51539607391]: 4174920..4174951
        51: [51539607392..53687090999]: hole
        52: [53687091000..53687091031]: 4174952..4174983
        53: [53687091032..55834574639]: hole
        54: [55834574640..55834574671]: 4174984..4175015
        55: [55834574672..57982058279]: hole
        56: [57982058280..57982058311]: 4175016..4175047
        57: [57982058312..60129541919]: hole
        58: [60129541920..60129541951]: 4175048..4175079
        59: [60129541952..62277025559]: hole
        60: [62277025560..62277025591]: 4175080..4175111
        61: [62277025592..64424509191]: hole
        62: [64424509192..64424509199]: 4175112..4175119

Okay, it needed to write 2 GB:

merkaba:/mnt/zeit> du -h fsfile 
2,0G    fsfile
merkaba:/mnt/zeit> du --apparent-size -h fsfile
30T     fsfile
merkaba:/mnt/zeit>

I didn´t expect mkfs.xfs to write 2 GB, but when thinking through it
for a 30 TB filesystem I find this reasonable.

Still it has 33 MiB for metadata:

merkaba:/mnt/zeit> mkdir bigfilefs
merkaba:/mnt/zeit> mount -o loop fsfile bigfilefs 
merkaba:/mnt/zeit> LANG=C df -hT bigfilefs
Filesystem     Type  Size  Used Avail Use% Mounted on
/dev/loop0     xfs    30T   33M   30T   1% /mnt/zeit/bigfilefs

Ciao,
-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

  parent reply	other threads:[~2012-01-28 16:23 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-01-27  7:50 Insane file system overhead on large volume Manny
2012-01-27 10:44 ` Christoph Hellwig
2012-01-27 19:15   ` Manny
2012-01-27 18:21 ` Eric Sandeen
2012-01-28 14:55   ` Martin Steigerwald
2012-01-28 15:35     ` Eric Sandeen
2012-01-28 16:05       ` Christoph Hellwig
2012-01-28 16:07       ` Eric Sandeen
2012-01-28 16:23       ` Martin Steigerwald [this message]
2012-01-29 22:18         ` Dave Chinner
2012-01-27 19:08 ` Stan Hoeppner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=201201281723.42786.Martin@lichtvoll.de \
    --to=martin@lichtvoll.de \
    --cc=dermaniac@gmail.com \
    --cc=sandeen@sandeen.net \
    --cc=xfs@oss.sgi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.