public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed
* Insane file system overhead on large volume
@ 2012-01-27  7:50 Manny
  2012-01-27 10:44 ` Christoph Hellwig
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Manny @ 2012-01-27  7:50 UTC (permalink / raw)
  To: xfs

Hi there,

I'm not sure if this is intended behavior, but I was a bit stumped
when I formatted a 30TB volume (12x3TB minus 2x3TB for parity in RAID
6) with XFS and noticed that there were only 22 TB left. I just called
mkfs.xfs with default parameters - except for swith and sunit which
match the RAID setup.

Is it normal that I lost 8TB just for the file system? That's almost
30% of the volume. Should I set the block size higher? Or should I
increase the number of allocation groups? Would that make a
difference? Whats the preferred method for handling such large
volumes?

Thanks a lot,
Manny

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Insane file system overhead on large volume
  2012-01-27  7:50 Insane file system overhead on large volume Manny
@ 2012-01-27 10:44 ` Christoph Hellwig
  2012-01-27 19:15   ` Manny
  2012-01-27 18:21 ` Eric Sandeen
  2012-01-27 19:08 ` Stan Hoeppner
  2 siblings, 1 reply; 11+ messages in thread
From: Christoph Hellwig @ 2012-01-27 10:44 UTC (permalink / raw)
  To: Manny; +Cc: xfs

On Fri, Jan 27, 2012 at 08:50:38AM +0100, Manny wrote:
> Hi there,
> 
> I'm not sure if this is intended behavior, but I was a bit stumped
> when I formatted a 30TB volume (12x3TB minus 2x3TB for parity in RAID
> 6) with XFS and noticed that there were only 22 TB left. I just called
> mkfs.xfs with default parameters - except for swith and sunit which
> match the RAID setup.
> 
> Is it normal that I lost 8TB just for the file system? That's almost
> 30% of the volume. Should I set the block size higher? Or should I
> increase the number of allocation groups? Would that make a
> difference? Whats the preferred method for handling such large
> volumes?

Where did you get the sizes for the raw volume and the filesystem usage
from?

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Insane file system overhead on large volume
  2012-01-27  7:50 Insane file system overhead on large volume Manny
  2012-01-27 10:44 ` Christoph Hellwig
@ 2012-01-27 18:21 ` Eric Sandeen
  2012-01-28 14:55   ` Martin Steigerwald
  2012-01-27 19:08 ` Stan Hoeppner
  2 siblings, 1 reply; 11+ messages in thread
From: Eric Sandeen @ 2012-01-27 18:21 UTC (permalink / raw)
  To: Manny; +Cc: xfs

On 1/27/12 1:50 AM, Manny wrote:
> Hi there,
> 
> I'm not sure if this is intended behavior, but I was a bit stumped
> when I formatted a 30TB volume (12x3TB minus 2x3TB for parity in RAID
> 6) with XFS and noticed that there were only 22 TB left. I just called
> mkfs.xfs with default parameters - except for swith and sunit which
> match the RAID setup.
> 
> Is it normal that I lost 8TB just for the file system? That's almost
> 30% of the volume. Should I set the block size higher? Or should I
> increase the number of allocation groups? Would that make a
> difference? Whats the preferred method for handling such large
> volumes?

If it was 12x3TB I imagine you're confusing TB with TiB, so
perhaps your 30T is really only 27TiB to start with.

Anyway, fs metadata should not eat much space:

# mkfs.xfs -dfile,name=fsfile,size=30t
# ls -lh fsfile
-rw-r--r-- 1 root root 30T Jan 27 12:18 fsfile
# mount -o loop fsfile  mnt/
# df -h mnt
Filesystem            Size  Used Avail Use% Mounted on
/tmp/fsfile            30T  5.0M   30T   1% /tmp/mnt

So Christoph's question was a good one; where are you getting
your sizes?

-Eric

> Thanks a lot,
> Manny
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs
> 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Insane file system overhead on large volume
  2012-01-27  7:50 Insane file system overhead on large volume Manny
  2012-01-27 10:44 ` Christoph Hellwig
  2012-01-27 18:21 ` Eric Sandeen
@ 2012-01-27 19:08 ` Stan Hoeppner
  2 siblings, 0 replies; 11+ messages in thread
From: Stan Hoeppner @ 2012-01-27 19:08 UTC (permalink / raw)
  To: xfs

On 1/27/2012 1:50 AM, Manny wrote:
> Hi there,
> 
> I'm not sure if this is intended behavior, but I was a bit stumped
> when I formatted a 30TB volume (12x3TB minus 2x3TB for parity in RAID
> 6) with XFS and noticed that there were only 22 TB left. I just called
> mkfs.xfs with default parameters - except for swith and sunit which
> match the RAID setup.
> 
> Is it normal that I lost 8TB just for the file system? That's almost
> 30% of the volume. Should I set the block size higher? Or should I
> increase the number of allocation groups? Would that make a
> difference? Whats the preferred method for handling such large
> volumes?

Maybe you simply assigned 2 spares and forgot, so you actually only have
10 RAID6 disks with 8 disks worth of stripe, equaling 24 TB, or 21.8
TiB.  21.8 TiB matches up pretty closely with your 22 TB, so this
scenario seems pretty plausible, dare I say likely.

If this is the case you'll want to reformat the 10 disk RAID6 with the
proper sunit/swidth values.

-- 
Stan

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Insane file system overhead on large volume
  2012-01-27 10:44 ` Christoph Hellwig
@ 2012-01-27 19:15   ` Manny
  0 siblings, 0 replies; 11+ messages in thread
From: Manny @ 2012-01-27 19:15 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

> Where did you get the sizes for the raw volume and the filesystem usage
> from?

Oh my god, you are so right. The raw volume was actually just 24TB. My
Raid controller decided to leave 6TB on the VDisk for a Snap pool.

Thanks for the hint, and sorry to bother you

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Insane file system overhead on large volume
  2012-01-27 18:21 ` Eric Sandeen
@ 2012-01-28 14:55   ` Martin Steigerwald
  2012-01-28 15:35     ` Eric Sandeen
  0 siblings, 1 reply; 11+ messages in thread
From: Martin Steigerwald @ 2012-01-28 14:55 UTC (permalink / raw)
  To: xfs; +Cc: Manny, Eric Sandeen

Am Freitag, 27. Januar 2012 schrieb Eric Sandeen:
> On 1/27/12 1:50 AM, Manny wrote:
> > Hi there,
> > 
> > I'm not sure if this is intended behavior, but I was a bit stumped
> > when I formatted a 30TB volume (12x3TB minus 2x3TB for parity in RAID
> > 6) with XFS and noticed that there were only 22 TB left. I just
> > called mkfs.xfs with default parameters - except for swith and sunit
> > which match the RAID setup.
> > 
> > Is it normal that I lost 8TB just for the file system? That's almost
> > 30% of the volume. Should I set the block size higher? Or should I
> > increase the number of allocation groups? Would that make a
> > difference? Whats the preferred method for handling such large
> > volumes?
> 
> If it was 12x3TB I imagine you're confusing TB with TiB, so
> perhaps your 30T is really only 27TiB to start with.
> 
> Anyway, fs metadata should not eat much space:
> 
> # mkfs.xfs -dfile,name=fsfile,size=30t
> # ls -lh fsfile
> -rw-r--r-- 1 root root 30T Jan 27 12:18 fsfile
> # mount -o loop fsfile  mnt/
> # df -h mnt
> Filesystem            Size  Used Avail Use% Mounted on
> /tmp/fsfile            30T  5.0M   30T   1% /tmp/mnt
> 
> So Christoph's question was a good one; where are you getting
> your sizes?

An academic question:

Why is it that I get

merkaba:/tmp> mkfs.xfs -dfile,name=fsfile,size=30t
meta-data=fsfile                 isize=256    agcount=30, agsize=268435455 
blks
         =                       sectsz=512   attr=2, projid32bit=0
data     =                       bsize=4096   blocks=8053063650, imaxpct=5
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0
log      =Internes Protokoll     bsize=4096   blocks=521728, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =keine                  extsz=4096   blocks=0, rtextents=0

merkaba:/tmp> mount -o loop fsfile /mnt/zeit
merkaba:/tmp> df -hT /mnt/zeit
Dateisystem    Typ  Größe Benutzt Verf. Verw% Eingehängt auf
/dev/loop0     xfs    30T     33M   30T    1% /mnt/zeit
merkaba:/tmp> LANG=C df -hT /mnt/zeit
Filesystem     Type  Size  Used Avail Use% Mounted on
/dev/loop0     xfs    30T   33M   30T   1% /mnt/zeit


33MiB used on first mount instead of 5?

merkaba:/tmp> cat /proc/version
Linux version 3.2.0-1-amd64 (Debian 3.2.1-2) ([…]) (gcc version 4.6.2 
(Debian 4.6.2-12) ) #1 SMP Tue Jan 24 05:01:45 UTC 2012
merkaba:/tmp> mkfs.xfs -V      
mkfs.xfs Version 3.1.7

Maybe its due to me using a tmpfs for /tmp:

merkaba:/tmp> LANG=C df -hT .
Filesystem     Type   Size  Used Avail Use% Mounted on
tmpfs          tmpfs  2.0G  2.0G  6.6M 100% /tmp


Hmmm, but creating the file on Ext4 does not work:

merkaba:/home> LANG=C df -hT .                     
Filesystem               Type  Size  Used Avail Use% Mounted on
/dev/mapper/merkaba-home ext4  224G  202G   20G  92% /home
merkaba:/home> LANG=C mkfs.xfs -dfile,name=fsfile,size=30t
meta-data=fsfile                 isize=256    agcount=30, agsize=268435455 
blks
         =                       sectsz=512   attr=2, projid32bit=0
data     =                       bsize=4096   blocks=8053063650, imaxpct=5
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0
log      =internal log           bsize=4096   blocks=521728, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
mkfs.xfs: Growing the data section failed

fallocate instead of sparse file?


And on BTRFS as well as XFS it appears to try to create a 30T file for 
real, i.e. by writing data - I stopped it before it could do too much 
harm.

Where did you create that hugish XFS file?

Ciao,
-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Insane file system overhead on large volume
  2012-01-28 14:55   ` Martin Steigerwald
@ 2012-01-28 15:35     ` Eric Sandeen
  2012-01-28 16:05       ` Christoph Hellwig
                         ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Eric Sandeen @ 2012-01-28 15:35 UTC (permalink / raw)
  To: Martin Steigerwald; +Cc: Manny, xfs

On 1/28/12 8:55 AM, Martin Steigerwald wrote:
> Am Freitag, 27. Januar 2012 schrieb Eric Sandeen:
>> On 1/27/12 1:50 AM, Manny wrote:
>>> Hi there,
>>>
>>> I'm not sure if this is intended behavior, but I was a bit stumped
>>> when I formatted a 30TB volume (12x3TB minus 2x3TB for parity in RAID
>>> 6) with XFS and noticed that there were only 22 TB left. I just
>>> called mkfs.xfs with default parameters - except for swith and sunit
>>> which match the RAID setup.
>>>
>>> Is it normal that I lost 8TB just for the file system? That's almost
>>> 30% of the volume. Should I set the block size higher? Or should I
>>> increase the number of allocation groups? Would that make a
>>> difference? Whats the preferred method for handling such large
>>> volumes?
>>
>> If it was 12x3TB I imagine you're confusing TB with TiB, so
>> perhaps your 30T is really only 27TiB to start with.
>>
>> Anyway, fs metadata should not eat much space:
>>
>> # mkfs.xfs -dfile,name=fsfile,size=30t
>> # ls -lh fsfile
>> -rw-r--r-- 1 root root 30T Jan 27 12:18 fsfile
>> # mount -o loop fsfile  mnt/
>> # df -h mnt
>> Filesystem            Size  Used Avail Use% Mounted on
>> /tmp/fsfile            30T  5.0M   30T   1% /tmp/mnt
>>
>> So Christoph's question was a good one; where are you getting
>> your sizes?

To solve your original problem, can you answer the above question?
Adding your actual raid config output (/proc/mdstat maybe) would help
too.

> An academic question:
> 
> Why is it that I get
> 
> merkaba:/tmp> mkfs.xfs -dfile,name=fsfile,size=30t
> meta-data=fsfile                 isize=256    agcount=30, agsize=268435455 
> blks
>          =                       sectsz=512   attr=2, projid32bit=0
> data     =                       bsize=4096   blocks=8053063650, imaxpct=5
>          =                       sunit=0      swidth=0 blks
> naming   =version 2              bsize=4096   ascii-ci=0
> log      =Internes Protokoll     bsize=4096   blocks=521728, version=2
>          =                       sectsz=512   sunit=0 blks, lazy-count=1
> realtime =keine                  extsz=4096   blocks=0, rtextents=0
> 
> merkaba:/tmp> mount -o loop fsfile /mnt/zeit
> merkaba:/tmp> df -hT /mnt/zeit
> Dateisystem    Typ  Größe Benutzt Verf. Verw% Eingehängt auf
> /dev/loop0     xfs    30T     33M   30T    1% /mnt/zeit
> merkaba:/tmp> LANG=C df -hT /mnt/zeit
> Filesystem     Type  Size  Used Avail Use% Mounted on
> /dev/loop0     xfs    30T   33M   30T   1% /mnt/zeit
> 
> 
> 33MiB used on first mount instead of 5?

Not sure offhand, differences in xfsprogs version mkfs defaults perhaps.

...

> Hmmm, but creating the file on Ext4 does not work:

ext4 is not designed to handle very large files, so anything
above 16T will fail.

> fallocate instead of sparse file?

no, you just ran into file offset limits on ext4.
 
> And on BTRFS as well as XFS it appears to try to create a 30T file for 
> real, i.e. by writing data - I stopped it before it could do too much 
> harm.

Why do you say that it appears to create a 30T file for real?  It
should not...
 
> Where did you create that hugish XFS file?

On XFS.  Of course.  :)

> Ciao,

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Insane file system overhead on large volume
  2012-01-28 15:35     ` Eric Sandeen
@ 2012-01-28 16:05       ` Christoph Hellwig
  2012-01-28 16:07       ` Eric Sandeen
  2012-01-28 16:23       ` Martin Steigerwald
  2 siblings, 0 replies; 11+ messages in thread
From: Christoph Hellwig @ 2012-01-28 16:05 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: Manny, xfs

Everyone calm done, Manny already replied and mentioned the problem.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Insane file system overhead on large volume
  2012-01-28 15:35     ` Eric Sandeen
  2012-01-28 16:05       ` Christoph Hellwig
@ 2012-01-28 16:07       ` Eric Sandeen
  2012-01-28 16:23       ` Martin Steigerwald
  2 siblings, 0 replies; 11+ messages in thread
From: Eric Sandeen @ 2012-01-28 16:07 UTC (permalink / raw)
  To: Martin Steigerwald; +Cc: Manny, xfs

On 1/28/12 9:35 AM, Eric Sandeen wrote:
> On 1/28/12 8:55 AM, Martin Steigerwald wrote:
>> Am Freitag, 27. Januar 2012 schrieb Eric Sandeen:

...

>>> So Christoph's question was a good one; where are you getting
>>> your sizes?
> 
> To solve your original problem, can you answer the above question?
> Adding your actual raid config output (/proc/mdstat maybe) would help
> too.

Sorry, nevermind.  I missed the earlier reply about solving the problem and
confused the responders.  Argh.

-Eric

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Insane file system overhead on large volume
  2012-01-28 15:35     ` Eric Sandeen
  2012-01-28 16:05       ` Christoph Hellwig
  2012-01-28 16:07       ` Eric Sandeen
@ 2012-01-28 16:23       ` Martin Steigerwald
  2012-01-29 22:18         ` Dave Chinner
  2 siblings, 1 reply; 11+ messages in thread
From: Martin Steigerwald @ 2012-01-28 16:23 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: Manny, xfs

Am Samstag, 28. Januar 2012 schrieb Eric Sandeen:
> On 1/28/12 8:55 AM, Martin Steigerwald wrote:
> > Am Freitag, 27. Januar 2012 schrieb Eric Sandeen:
> >> On 1/27/12 1:50 AM, Manny wrote:
> >>> Hi there,
> >>> 
> >>> I'm not sure if this is intended behavior, but I was a bit stumped
> >>> when I formatted a 30TB volume (12x3TB minus 2x3TB for parity in
> >>> RAID 6) with XFS and noticed that there were only 22 TB left. I
> >>> just called mkfs.xfs with default parameters - except for swith
> >>> and sunit which match the RAID setup.
> >>> 
> >>> Is it normal that I lost 8TB just for the file system? That's
> >>> almost 30% of the volume. Should I set the block size higher? Or
> >>> should I increase the number of allocation groups? Would that make
> >>> a difference? Whats the preferred method for handling such large
> >>> volumes?
> >> 
> >> If it was 12x3TB I imagine you're confusing TB with TiB, so
> >> perhaps your 30T is really only 27TiB to start with.
> >> 
> >> Anyway, fs metadata should not eat much space:
> >> 
> >> # mkfs.xfs -dfile,name=fsfile,size=30t
> >> # ls -lh fsfile
> >> -rw-r--r-- 1 root root 30T Jan 27 12:18 fsfile
> >> # mount -o loop fsfile  mnt/
> >> # df -h mnt
> >> Filesystem            Size  Used Avail Use% Mounted on
> >> /tmp/fsfile            30T  5.0M   30T   1% /tmp/mnt
> >> 
> >> So Christoph's question was a good one; where are you getting
> >> your sizes?
> 
> To solve your original problem, can you answer the above question?
> Adding your actual raid config output (/proc/mdstat maybe) would help
> too.

Eric, I wrote

> > An academic question:

to make clear that it was just something I was curious about.

I was not the reporter of the problem anyway, I have no problem,
the reporter has no problem, see his answer, so all is good ;)

With your hint and some thinking / testing through it I was able to
resolve most of my other questions. Thanks.



For the gory details:

> > Why is it that I get
[…]
> > merkaba:/tmp> LANG=C df -hT /mnt/zeit
> > Filesystem     Type  Size  Used Avail Use% Mounted on
> > /dev/loop0     xfs    30T   33M   30T   1% /mnt/zeit
> > 
> > 
> > 33MiB used on first mount instead of 5?
> 
> Not sure offhand, differences in xfsprogs version mkfs defaults
> perhaps.

Okay, thats fine with me. I was just curious. It doesn´t matter much.

> > Hmmm, but creating the file on Ext4 does not work:
> ext4 is not designed to handle very large files, so anything
> above 16T will fail.
> 
> > fallocate instead of sparse file?
> 
> no, you just ran into file offset limits on ext4.

Oh, yes. Completely forgot about these Ext4 limits. Sorry.

> > And on BTRFS as well as XFS it appears to try to create a 30T file
> > for real, i.e. by writing data - I stopped it before it could do too
> > much harm.
> 
> Why do you say that it appears to create a 30T file for real?  It
> should not...

I jumped to a conclusion too quickly. It did do a I/O storm onto the
Intel SSD 320:

martin@merkaba:~> vmstat -S M 1 (not applied to bi/bo)
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
 0  0   1630   4365     87   1087    0    0   101    53    7   81  5  2 93  0
 1  0   1630   4365     87   1087    0    0     0     0  428  769  1  0 99  0
 2  0   1630   4365     87   1087    0    0     0     0  426  740  1  1 99  0
 0  0   1630   4358     87   1088    0    0     0     0 1165 2297  4  7 89  0
 0  0   1630   4357     87   1088    0    0     0    40 1736 3434  8  6 86  0
 0  0   1630   4357     87   1088    0    0     0     0  614 1121  3  1 96  0
 0  0   1630   4357     87   1088    0    0     0    32  359  636  0  0 100  0
 1  1   1630   3852     87   1585    0    0    13 81540  529 1045  1  7 91  1
 0  3   1630   3398     87   2027    0    0     0 227940 1357 2764  0  9 54 37
 4  3   1630   3225     87   2188    0    0     0 212004 2346 4796  5  6 41 49
 1  3   1630   2992     87   2415    0    0     0 215608 1825 3821  1  6 42 50
 0  2   1630   2820     87   2582    0    0     0 200492 1476 3089  3  6 49 41
 1  1   1630   2569     87   2832    0    0     0 198156 1250 2508  0  6 59 34
 0  2   1630   2386     87   3009    0    0     0 229896 1301 2611  1  6 56 37
 0  2   1630   2266     87   3126    0    0     0 302876 1067 2093  0  5 62 33
 1  3   1630   2266     87   3126    0    0     0 176092  723 1321  0  3 71 26
 0  3   1630   2266     87   3126    0    0     0 163840  706 1351  0  1 74 25
 0  1   1630   2266     87   3126    0    0     0 80104 3137 6228  1  4 69 26
 0  0   1630   2267     87   3126    0    0     0     3 3505 7035  6  3 86  5
 0  0   1630   2266     87   3126    0    0     0     0  631 1203  4  1 95  0
 0  0   1630   2259     87   3127    0    0     0     0  715 1398  4  2 94  0
 2  0   1630   2259     87   3127    0    0     0     0 1501 3087 10  3 86  0
 0  0   1630   2259     87   3127    0    0     0    27  945 1883  5  2 93  0
 0  0   1630   2259     87   3127    0    0     0     0  399  713  1  0 99  0
^C

But then stopped. Thus mkfs.xfs was just writing metadata it seems
and I didn´t see this in the tmpfs obviously.

But when I review it, creating a 30TB XFS filesystem should involve writing
some metadata at different places of the file.

I get:

merkaba:/mnt/zeit> LANG=C xfs_bmap fsfile
fsfile:
        0: [0..255]: 96..351
        1: [256..2147483639]: hole
        2: [2147483640..2147483671]: 3400032..3400063
        3: [2147483672..4294967279]: hole
        4: [4294967280..4294967311]: 3400064..3400095
        5: [4294967312..6442450919]: hole
        6: [6442450920..6442450951]: 3400096..3400127
        7: [6442450952..8589934559]: hole
        8: [8589934560..8589934591]: 3400128..3400159
        9: [8589934592..10737418199]: hole
        10: [10737418200..10737418231]: 3400160..3400191
        11: [10737418232..12884901839]: hole
        12: [12884901840..12884901871]: 3400192..3400223
        13: [12884901872..15032385479]: hole
        14: [15032385480..15032385511]: 3400224..3400255
        15: [15032385512..17179869119]: hole
        16: [17179869120..17179869151]: 3400256..3400287
        17: [17179869152..19327352759]: hole
        18: [19327352760..19327352791]: 3400296..3400327
        19: [19327352792..21474836399]: hole
        20: [21474836400..21474836431]: 3400328..3400359
        21: [21474836432..23622320039]: hole
        22: [23622320040..23622320071]: 3400360..3400391
        23: [23622320072..25769803679]: hole
        24: [25769803680..25769803711]: 3400392..3400423
        25: [25769803712..27917287319]: hole
        26: [27917287320..27917287351]: 3400424..3400455
        27: [27917287352..30064770959]: hole
        28: [30064770960..30064770991]: 3400456..3400487
        29: [30064770992..32212254599]: hole
        30: [32212254600..32212254631]: 3400488..3400519
        31: [32212254632..32215654311]: 352..3400031
        32: [32215654312..32216428455]: 3400520..4174663
        33: [32216428456..34359738239]: hole
        34: [34359738240..34359738271]: 4174664..4174695
        35: [34359738272..36507221879]: hole
        36: [36507221880..36507221911]: 4174696..4174727
        37: [36507221912..38654705519]: hole
        38: [38654705520..38654705551]: 4174728..4174759
        39: [38654705552..40802189159]: hole
        40: [40802189160..40802189191]: 4174760..4174791
        41: [40802189192..42949672799]: hole
        42: [42949672800..42949672831]: 4174792..4174823
        43: [42949672832..45097156439]: hole
        44: [45097156440..45097156471]: 4174824..4174855
        45: [45097156472..47244640079]: hole
        46: [47244640080..47244640111]: 4174856..4174887
        47: [47244640112..49392123719]: hole
        48: [49392123720..49392123751]: 4174888..4174919
        49: [49392123752..51539607359]: hole
        50: [51539607360..51539607391]: 4174920..4174951
        51: [51539607392..53687090999]: hole
        52: [53687091000..53687091031]: 4174952..4174983
        53: [53687091032..55834574639]: hole
        54: [55834574640..55834574671]: 4174984..4175015
        55: [55834574672..57982058279]: hole
        56: [57982058280..57982058311]: 4175016..4175047
        57: [57982058312..60129541919]: hole
        58: [60129541920..60129541951]: 4175048..4175079
        59: [60129541952..62277025559]: hole
        60: [62277025560..62277025591]: 4175080..4175111
        61: [62277025592..64424509191]: hole
        62: [64424509192..64424509199]: 4175112..4175119

Okay, it needed to write 2 GB:

merkaba:/mnt/zeit> du -h fsfile 
2,0G    fsfile
merkaba:/mnt/zeit> du --apparent-size -h fsfile
30T     fsfile
merkaba:/mnt/zeit>

I didn´t expect mkfs.xfs to write 2 GB, but when thinking through it
for a 30 TB filesystem I find this reasonable.

Still it has 33 MiB for metadata:

merkaba:/mnt/zeit> mkdir bigfilefs
merkaba:/mnt/zeit> mount -o loop fsfile bigfilefs 
merkaba:/mnt/zeit> LANG=C df -hT bigfilefs
Filesystem     Type  Size  Used Avail Use% Mounted on
/dev/loop0     xfs    30T   33M   30T   1% /mnt/zeit/bigfilefs

Ciao,
-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Insane file system overhead on large volume
  2012-01-28 16:23       ` Martin Steigerwald
@ 2012-01-29 22:18         ` Dave Chinner
  0 siblings, 0 replies; 11+ messages in thread
From: Dave Chinner @ 2012-01-29 22:18 UTC (permalink / raw)
  To: Martin Steigerwald; +Cc: Manny, Eric Sandeen, xfs

On Sat, Jan 28, 2012 at 05:23:42PM +0100, Martin Steigerwald wrote:
> Am Samstag, 28. Januar 2012 schrieb Eric Sandeen:
> > On 1/28/12 8:55 AM, Martin Steigerwald wrote:
> For the gory details:
> 
> > > Why is it that I get
> […]
> > > merkaba:/tmp> LANG=C df -hT /mnt/zeit
> > > Filesystem     Type  Size  Used Avail Use% Mounted on
> > > /dev/loop0     xfs    30T   33M   30T   1% /mnt/zeit
> > > 
> > > 
> > > 33MiB used on first mount instead of 5?
> > 
> > Not sure offhand, differences in xfsprogs version mkfs defaults
> > perhaps.
> 
> Okay, thats fine with me. I was just curious. It doesn´t matter much.

More likely the kernel. Older kernels only use 1024 blocks for
the reserve block pool, while more recent ones use 8192 blocks.

$ gl -n 1 8babd8a
commit 8babd8a2e75cccff3167a61176c2a3e977e13799
Author: Dave Chinner <david@fromorbit.com>
Date:   Thu Mar 4 01:46:25 2010 +0000

    xfs: Increase the default size of the reserved blocks pool

    The current default size of the reserved blocks pool is easy to deplete
    with certain workloads, in particular workloads that do lots of concurrent
    delayed allocation extent conversions.  If enough transactions are running
    in parallel and the entire pool is consumed then subsequent calls to
    xfs_trans_reserve() will fail with ENOSPC.  Also add a rate limited
    warning so we know if this starts happening again.

    This is an updated version of an old patch from Lachlan McIlroy.

    Signed-off-by: Dave Chinner <david@fromorbit.com>
    Signed-off-by: Alex Elder <aelder@sgi.com>

> But when I review it, creating a 30TB XFS filesystem should involve writing
> some metadata at different places of the file.
> 
> I get:
> 
> merkaba:/mnt/zeit> LANG=C xfs_bmap fsfile
> fsfile:
>         0: [0..255]: 96..351
>         1: [256..2147483639]: hole
>         2: [2147483640..2147483671]: 3400032..3400063
>         3: [2147483672..4294967279]: hole
>         4: [4294967280..4294967311]: 3400064..3400095
>         5: [4294967312..6442450919]: hole
>         6: [6442450920..6442450951]: 3400096..3400127
>         7: [6442450952..8589934559]: hole

.....

Yeah, that's all the AG headers.

> Okay, it needed to write 2 GB:
> 
> merkaba:/mnt/zeit> du -h fsfile 
> 2,0G    fsfile
> merkaba:/mnt/zeit> du --apparent-size -h fsfile
> 30T     fsfile
> merkaba:/mnt/zeit>
> 
> I didn´t expect mkfs.xfs to write 2 GB, but when thinking through it
> for a 30 TB filesystem I find this reasonable.

It zeroed the log, which will be just under 2GB in size for a
filesystem that large. Zeroing the log accounts for >99% of the IO
that mkfs does for most normal cases.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2012-01-29 22:18 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-01-27  7:50 Insane file system overhead on large volume Manny
2012-01-27 10:44 ` Christoph Hellwig
2012-01-27 19:15   ` Manny
2012-01-27 18:21 ` Eric Sandeen
2012-01-28 14:55   ` Martin Steigerwald
2012-01-28 15:35     ` Eric Sandeen
2012-01-28 16:05       ` Christoph Hellwig
2012-01-28 16:07       ` Eric Sandeen
2012-01-28 16:23       ` Martin Steigerwald
2012-01-29 22:18         ` Dave Chinner
2012-01-27 19:08 ` Stan Hoeppner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox