From: Martin Steigerwald <Martin@lichtvoll.de>
To: Eric Sandeen <sandeen@sandeen.net>
Cc: Manny <dermaniac@gmail.com>, xfs@oss.sgi.com
Subject: Re: Insane file system overhead on large volume
Date: Sat, 28 Jan 2012 17:23:42 +0100 [thread overview]
Message-ID: <201201281723.42786.Martin@lichtvoll.de> (raw)
In-Reply-To: <4F2415B2.3080605@sandeen.net>
Am Samstag, 28. Januar 2012 schrieb Eric Sandeen:
> On 1/28/12 8:55 AM, Martin Steigerwald wrote:
> > Am Freitag, 27. Januar 2012 schrieb Eric Sandeen:
> >> On 1/27/12 1:50 AM, Manny wrote:
> >>> Hi there,
> >>>
> >>> I'm not sure if this is intended behavior, but I was a bit stumped
> >>> when I formatted a 30TB volume (12x3TB minus 2x3TB for parity in
> >>> RAID 6) with XFS and noticed that there were only 22 TB left. I
> >>> just called mkfs.xfs with default parameters - except for swith
> >>> and sunit which match the RAID setup.
> >>>
> >>> Is it normal that I lost 8TB just for the file system? That's
> >>> almost 30% of the volume. Should I set the block size higher? Or
> >>> should I increase the number of allocation groups? Would that make
> >>> a difference? Whats the preferred method for handling such large
> >>> volumes?
> >>
> >> If it was 12x3TB I imagine you're confusing TB with TiB, so
> >> perhaps your 30T is really only 27TiB to start with.
> >>
> >> Anyway, fs metadata should not eat much space:
> >>
> >> # mkfs.xfs -dfile,name=fsfile,size=30t
> >> # ls -lh fsfile
> >> -rw-r--r-- 1 root root 30T Jan 27 12:18 fsfile
> >> # mount -o loop fsfile mnt/
> >> # df -h mnt
> >> Filesystem Size Used Avail Use% Mounted on
> >> /tmp/fsfile 30T 5.0M 30T 1% /tmp/mnt
> >>
> >> So Christoph's question was a good one; where are you getting
> >> your sizes?
>
> To solve your original problem, can you answer the above question?
> Adding your actual raid config output (/proc/mdstat maybe) would help
> too.
Eric, I wrote
> > An academic question:
to make clear that it was just something I was curious about.
I was not the reporter of the problem anyway, I have no problem,
the reporter has no problem, see his answer, so all is good ;)
With your hint and some thinking / testing through it I was able to
resolve most of my other questions. Thanks.
For the gory details:
> > Why is it that I get
[…]
> > merkaba:/tmp> LANG=C df -hT /mnt/zeit
> > Filesystem Type Size Used Avail Use% Mounted on
> > /dev/loop0 xfs 30T 33M 30T 1% /mnt/zeit
> >
> >
> > 33MiB used on first mount instead of 5?
>
> Not sure offhand, differences in xfsprogs version mkfs defaults
> perhaps.
Okay, thats fine with me. I was just curious. It doesn´t matter much.
> > Hmmm, but creating the file on Ext4 does not work:
> ext4 is not designed to handle very large files, so anything
> above 16T will fail.
>
> > fallocate instead of sparse file?
>
> no, you just ran into file offset limits on ext4.
Oh, yes. Completely forgot about these Ext4 limits. Sorry.
> > And on BTRFS as well as XFS it appears to try to create a 30T file
> > for real, i.e. by writing data - I stopped it before it could do too
> > much harm.
>
> Why do you say that it appears to create a 30T file for real? It
> should not...
I jumped to a conclusion too quickly. It did do a I/O storm onto the
Intel SSD 320:
martin@merkaba:~> vmstat -S M 1 (not applied to bi/bo)
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
0 0 1630 4365 87 1087 0 0 101 53 7 81 5 2 93 0
1 0 1630 4365 87 1087 0 0 0 0 428 769 1 0 99 0
2 0 1630 4365 87 1087 0 0 0 0 426 740 1 1 99 0
0 0 1630 4358 87 1088 0 0 0 0 1165 2297 4 7 89 0
0 0 1630 4357 87 1088 0 0 0 40 1736 3434 8 6 86 0
0 0 1630 4357 87 1088 0 0 0 0 614 1121 3 1 96 0
0 0 1630 4357 87 1088 0 0 0 32 359 636 0 0 100 0
1 1 1630 3852 87 1585 0 0 13 81540 529 1045 1 7 91 1
0 3 1630 3398 87 2027 0 0 0 227940 1357 2764 0 9 54 37
4 3 1630 3225 87 2188 0 0 0 212004 2346 4796 5 6 41 49
1 3 1630 2992 87 2415 0 0 0 215608 1825 3821 1 6 42 50
0 2 1630 2820 87 2582 0 0 0 200492 1476 3089 3 6 49 41
1 1 1630 2569 87 2832 0 0 0 198156 1250 2508 0 6 59 34
0 2 1630 2386 87 3009 0 0 0 229896 1301 2611 1 6 56 37
0 2 1630 2266 87 3126 0 0 0 302876 1067 2093 0 5 62 33
1 3 1630 2266 87 3126 0 0 0 176092 723 1321 0 3 71 26
0 3 1630 2266 87 3126 0 0 0 163840 706 1351 0 1 74 25
0 1 1630 2266 87 3126 0 0 0 80104 3137 6228 1 4 69 26
0 0 1630 2267 87 3126 0 0 0 3 3505 7035 6 3 86 5
0 0 1630 2266 87 3126 0 0 0 0 631 1203 4 1 95 0
0 0 1630 2259 87 3127 0 0 0 0 715 1398 4 2 94 0
2 0 1630 2259 87 3127 0 0 0 0 1501 3087 10 3 86 0
0 0 1630 2259 87 3127 0 0 0 27 945 1883 5 2 93 0
0 0 1630 2259 87 3127 0 0 0 0 399 713 1 0 99 0
^C
But then stopped. Thus mkfs.xfs was just writing metadata it seems
and I didn´t see this in the tmpfs obviously.
But when I review it, creating a 30TB XFS filesystem should involve writing
some metadata at different places of the file.
I get:
merkaba:/mnt/zeit> LANG=C xfs_bmap fsfile
fsfile:
0: [0..255]: 96..351
1: [256..2147483639]: hole
2: [2147483640..2147483671]: 3400032..3400063
3: [2147483672..4294967279]: hole
4: [4294967280..4294967311]: 3400064..3400095
5: [4294967312..6442450919]: hole
6: [6442450920..6442450951]: 3400096..3400127
7: [6442450952..8589934559]: hole
8: [8589934560..8589934591]: 3400128..3400159
9: [8589934592..10737418199]: hole
10: [10737418200..10737418231]: 3400160..3400191
11: [10737418232..12884901839]: hole
12: [12884901840..12884901871]: 3400192..3400223
13: [12884901872..15032385479]: hole
14: [15032385480..15032385511]: 3400224..3400255
15: [15032385512..17179869119]: hole
16: [17179869120..17179869151]: 3400256..3400287
17: [17179869152..19327352759]: hole
18: [19327352760..19327352791]: 3400296..3400327
19: [19327352792..21474836399]: hole
20: [21474836400..21474836431]: 3400328..3400359
21: [21474836432..23622320039]: hole
22: [23622320040..23622320071]: 3400360..3400391
23: [23622320072..25769803679]: hole
24: [25769803680..25769803711]: 3400392..3400423
25: [25769803712..27917287319]: hole
26: [27917287320..27917287351]: 3400424..3400455
27: [27917287352..30064770959]: hole
28: [30064770960..30064770991]: 3400456..3400487
29: [30064770992..32212254599]: hole
30: [32212254600..32212254631]: 3400488..3400519
31: [32212254632..32215654311]: 352..3400031
32: [32215654312..32216428455]: 3400520..4174663
33: [32216428456..34359738239]: hole
34: [34359738240..34359738271]: 4174664..4174695
35: [34359738272..36507221879]: hole
36: [36507221880..36507221911]: 4174696..4174727
37: [36507221912..38654705519]: hole
38: [38654705520..38654705551]: 4174728..4174759
39: [38654705552..40802189159]: hole
40: [40802189160..40802189191]: 4174760..4174791
41: [40802189192..42949672799]: hole
42: [42949672800..42949672831]: 4174792..4174823
43: [42949672832..45097156439]: hole
44: [45097156440..45097156471]: 4174824..4174855
45: [45097156472..47244640079]: hole
46: [47244640080..47244640111]: 4174856..4174887
47: [47244640112..49392123719]: hole
48: [49392123720..49392123751]: 4174888..4174919
49: [49392123752..51539607359]: hole
50: [51539607360..51539607391]: 4174920..4174951
51: [51539607392..53687090999]: hole
52: [53687091000..53687091031]: 4174952..4174983
53: [53687091032..55834574639]: hole
54: [55834574640..55834574671]: 4174984..4175015
55: [55834574672..57982058279]: hole
56: [57982058280..57982058311]: 4175016..4175047
57: [57982058312..60129541919]: hole
58: [60129541920..60129541951]: 4175048..4175079
59: [60129541952..62277025559]: hole
60: [62277025560..62277025591]: 4175080..4175111
61: [62277025592..64424509191]: hole
62: [64424509192..64424509199]: 4175112..4175119
Okay, it needed to write 2 GB:
merkaba:/mnt/zeit> du -h fsfile
2,0G fsfile
merkaba:/mnt/zeit> du --apparent-size -h fsfile
30T fsfile
merkaba:/mnt/zeit>
I didn´t expect mkfs.xfs to write 2 GB, but when thinking through it
for a 30 TB filesystem I find this reasonable.
Still it has 33 MiB for metadata:
merkaba:/mnt/zeit> mkdir bigfilefs
merkaba:/mnt/zeit> mount -o loop fsfile bigfilefs
merkaba:/mnt/zeit> LANG=C df -hT bigfilefs
Filesystem Type Size Used Avail Use% Mounted on
/dev/loop0 xfs 30T 33M 30T 1% /mnt/zeit/bigfilefs
Ciao,
--
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
next prev parent reply other threads:[~2012-01-28 16:23 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-01-27 7:50 Insane file system overhead on large volume Manny
2012-01-27 10:44 ` Christoph Hellwig
2012-01-27 19:15 ` Manny
2012-01-27 18:21 ` Eric Sandeen
2012-01-28 14:55 ` Martin Steigerwald
2012-01-28 15:35 ` Eric Sandeen
2012-01-28 16:05 ` Christoph Hellwig
2012-01-28 16:07 ` Eric Sandeen
2012-01-28 16:23 ` Martin Steigerwald [this message]
2012-01-29 22:18 ` Dave Chinner
2012-01-27 19:08 ` Stan Hoeppner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=201201281723.42786.Martin@lichtvoll.de \
--to=martin@lichtvoll.de \
--cc=dermaniac@gmail.com \
--cc=sandeen@sandeen.net \
--cc=xfs@oss.sgi.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.