public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed
* Problems with filesizes on different Kernels
@ 2012-02-17 11:51 Bernhard Schrader
  2012-02-17 12:33 ` Matthias Schniedermeyer
  0 siblings, 1 reply; 6+ messages in thread
From: Bernhard Schrader @ 2012-02-17 11:51 UTC (permalink / raw)
  To: xfs

Hi all,

we just discovered a problem, which I think is related to XFS. Well, I 
will try to explain.

The environment i am working with are around 300 Postgres databases in 
separated VM's. All are running with XFS. Differences are just in kernel 
versions.
- 2.6.18
- 2.6.39
- 3.1.4

Some days ago i discovered that the file nodes of my postgresql tables 
have strange sizes. They are located in 
/var/lib/postgresql/9.0/main/base/[databaseid]/
If I execute the following commands i get results like this:

Command: du -sh | tr "\n" " "; du --apparent-size -h
Result: 6.6G	. 5.7G	.

Well, as you can see there is something wrong. The files consume more 
Diskspace than they originally would do. This happens only on 2.6.39 and 
3.1.4 servers. the old 2.6.18 has normal behavior and the sizes are the 
same for both commands.

The following was done on a 3.1.4 kernel.
To get some more informations i played a little bit with the xfs tools:

First i choose one file to examine:
##########
/var/lib/postgresql/9.0/main/base/43169# ls -lh 64121
-rw------- 1 postgres postgres 58M 2012-02-16 17:03 64121

/var/lib/postgresql/9.0/main/base/43169# du -sh 64121
89M    64121
##########

So this file "64121" has a difference of 31MB.

##########
/var/lib/postgresql/9.0/main/base/43169# xfs_bmap  64121
64121:
     0: [0..116991]: 17328672..17445663

/var/lib/postgresql/9.0/main/base/43169# xfs_fsr -v 64121
64121
64121 already fully defragmented.

/var/lib/postgresql/9.0/main/base/43169# xfs_info /dev/xvda1
meta-data=/dev/root              isize=256    agcount=4, agsize=959932 blks
          =                       sectsz=512   attr=2
data     =                       bsize=4096   blocks=3839727, imaxpct=25
          =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096
log      =internal               bsize=4096   blocks=2560, version=2
          =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

/var/lib/postgresql/9.0/main/base/43169# cat /proc/mounts
rootfs / rootfs rw 0 0
/dev/root / xfs rw,noatime,nodiratime,attr2,delaylog,nobarrier,noquota 0 0
tmpfs /lib/init/rw tmpfs rw,nosuid,relatime,mode=755 0 0
proc /proc proc rw,nosuid,nodev,noexec,relatime 0 0
sysfs /sys sysfs rw,nosuid,nodev,noexec,relatime 0 0
tmpfs /dev/shm tmpfs rw,nosuid,nodev,relatime 0 0
devpts /dev/pts devpts 
rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000 0 0
#########

I sent the following also to postgres mailinglist but i think this is 
now useful too.

Strange, or  not? Regarding this informations, the file is contiguous on 
disk and has of course no fragmentation, so why is it showing so much 
diskusage?

The relation this filenode is belonging to, is an index, and regarding 
my last overview it seems that this happens for 95% only to indexes/pkeys.

Well you could think i have some strange config settings, but we 
distribute this config via puppet, and also the servers on old hardware 
have this config. so things like fillfactor couldn't explain this.

We also thought that there could be some filehandles still exist. So we 
decided to reboot. Wow, we thought we got it, the free diskspace 
increased slowly for a while. But then, after 1-2GB captured diskspace 
it went back to normal and the filenodes grew again. This doesn't 
explain it as well. :/

One more thing, a xfs_fsr /dev/xvda1 recaptures also some diskspace, but 
with same effect as a reboot.


Some differences on 2.6.18 are the mount options and the lazy-count:
###########
xfs_info /dev/xvda1
meta-data=/dev/root              isize=256    agcount=4, agsize=959996 blks
          =                       sectsz=512   attr=2
data     =                       bsize=4096   blocks=3839983, imaxpct=25
          =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096
log      =internal               bsize=4096   blocks=2560, version=2
          =                       sectsz=512   sunit=0 blks, lazy-count=0
realtime =none                   extsz=4096   blocks=0, rtextents=0

cat /proc/mounts
rootfs / rootfs rw 0 0
/dev/root / xfs rw,noatime,nodiratime 0 0
tmpfs /lib/init/rw tmpfs rw,nosuid 0 0
proc /proc proc rw,nosuid,nodev,noexec 0 0
sysfs /sys sysfs rw,nosuid,nodev,noexec 0 0
tmpfs /dev/shm tmpfs rw,nosuid,nodev 0 0
devpts /dev/pts devpts rw,nosuid,noexec 0 0
#############

I don't know what causes this problem, and why we are the only ones who 
discovered this. I don't know if it's really 100% related to xfs but for 
now i don't have other ideas. If you need anymore information I will 
provide.

Thanks in advance
Bernhard

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Problems with filesizes on different Kernels
  2012-02-17 11:51 Problems with filesizes on different Kernels Bernhard Schrader
@ 2012-02-17 12:33 ` Matthias Schniedermeyer
  2012-02-20  8:41   ` Bernhard Schrader
  0 siblings, 1 reply; 6+ messages in thread
From: Matthias Schniedermeyer @ 2012-02-17 12:33 UTC (permalink / raw)
  To: Bernhard Schrader; +Cc: xfs

On 17.02.2012 12:51, Bernhard Schrader wrote:
> Hi all,
> 
> we just discovered a problem, which I think is related to XFS. Well,
> I will try to explain.
> 
> The environment i am working with are around 300 Postgres databases
> in separated VM's. All are running with XFS. Differences are just in
> kernel versions.
> - 2.6.18
> - 2.6.39
> - 3.1.4
> 
> Some days ago i discovered that the file nodes of my postgresql
> tables have strange sizes. They are located in
> /var/lib/postgresql/9.0/main/base/[databaseid]/
> If I execute the following commands i get results like this:
> 
> Command: du -sh | tr "\n" " "; du --apparent-size -h
> Result: 6.6G	. 5.7G	.

Since a few kernel-version XFS does speculative preallocations, which is 
primarily a measure to prevent fragmentation.

The preallocations should go away when you drop the caches.

sync
echo 3 > /proc/sys/vm/drop_caches

XFS can be prevented to do that with the mount-option "allocsize". 
Personally i use "allocsize=64k", since i first encountered that 
behaviour, my workload primarily consists of single-thread writing which 
doesn't benefit from this preallocation.
Your workload OTOH may benefit as it should prevent/lower the 
fragmentation of the database files.






Bis denn

-- 
Real Programmers consider "what you see is what you get" to be just as 
bad a concept in Text Editors as it is in women. No, the Real Programmer
wants a "you asked for it, you got it" text editor -- complicated, 
cryptic, powerful, unforgiving, dangerous.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Problems with filesizes on different Kernels
  2012-02-17 12:33 ` Matthias Schniedermeyer
@ 2012-02-20  8:41   ` Bernhard Schrader
  2012-02-20 11:06     ` Matthias Schniedermeyer
  0 siblings, 1 reply; 6+ messages in thread
From: Bernhard Schrader @ 2012-02-20  8:41 UTC (permalink / raw)
  To: Matthias Schniedermeyer; +Cc: xfs

On 02/17/2012 01:33 PM, Matthias Schniedermeyer wrote:
> On 17.02.2012 12:51, Bernhard Schrader wrote:
>> Hi all,
>>
>> we just discovered a problem, which I think is related to XFS. Well,
>> I will try to explain.
>>
>> The environment i am working with are around 300 Postgres databases
>> in separated VM's. All are running with XFS. Differences are just in
>> kernel versions.
>> - 2.6.18
>> - 2.6.39
>> - 3.1.4
>>
>> Some days ago i discovered that the file nodes of my postgresql
>> tables have strange sizes. They are located in
>> /var/lib/postgresql/9.0/main/base/[databaseid]/
>> If I execute the following commands i get results like this:
>>
>> Command: du -sh | tr "\n" " "; du --apparent-size -h
>> Result: 6.6G	. 5.7G	.
>
> Since a few kernel-version XFS does speculative preallocations, which is
> primarily a measure to prevent fragmentation.
>
> The preallocations should go away when you drop the caches.
>
> sync
> echo 3>  /proc/sys/vm/drop_caches
>
> XFS can be prevented to do that with the mount-option "allocsize".
> Personally i use "allocsize=64k", since i first encountered that
> behaviour, my workload primarily consists of single-thread writing which
> doesn't benefit from this preallocation.
> Your workload OTOH may benefit as it should prevent/lower the
> fragmentation of the database files.
>
>
>
>
>
>
> Bis denn
>

Hi Matthias,
thanks for the reply, as far as i can say the "echo 3 > 
/proc/sys/vm/drop_caches" didn't work. the sizes didnt shrink. Today i 
had the chance to test the allocsize=64k. Well, first i thought it 
worked, i added the mountoption, restarted the server, everything shrink 
to normal sizes. but right now its more or less "flapping". I have 5.7GB 
real data and the sizes flap between 6.9GB to 5.7GB.
But I am wondering a little about the mount output:

# mount
/dev/xvda1 on / type xfs 
(rw,noatime,nodiratime,logbufs=8,nobarrier,allocsize=64k)
tmpfs on /lib/init/rw type tmpfs (rw,nosuid,mode=0755)
proc on /proc type proc (rw,noexec,nosuid,nodev)
sysfs on /sys type sysfs (rw,noexec,nosuid,nodev)
tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev)
devpts on /dev/pts type devpts (rw,noexec,nosuid,gid=5,mode=620)


# cat /proc/mounts
rootfs / rootfs rw 0 0
/dev/root / xfs rw,noatime,nodiratime,attr2,delaylog,nobarrier,noquota 0 0
tmpfs /lib/init/rw tmpfs rw,nosuid,relatime,mode=755 0 0
proc /proc proc rw,nosuid,nodev,noexec,relatime 0 0
sysfs /sys sysfs rw,nosuid,nodev,noexec,relatime 0 0
tmpfs /dev/shm tmpfs rw,nosuid,nodev,relatime 0 0
devpts /dev/pts devpts 
rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000 0 0


In normal mount output i see the allocsize, but not in cat /proc/mounts?!?

Is there a way to completly disable speculative prealocations? or the 
behavior how it works right now?


regards
Bernhard

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Problems with filesizes on different Kernels
  2012-02-20  8:41   ` Bernhard Schrader
@ 2012-02-20 11:06     ` Matthias Schniedermeyer
  2012-02-20 12:06       ` Bernhard Schrader
  0 siblings, 1 reply; 6+ messages in thread
From: Matthias Schniedermeyer @ 2012-02-20 11:06 UTC (permalink / raw)
  To: Bernhard Schrader; +Cc: xfs

On 20.02.2012 09:41, Bernhard Schrader wrote:
> On 02/17/2012 01:33 PM, Matthias Schniedermeyer wrote:
> >On 17.02.2012 12:51, Bernhard Schrader wrote:
> >>Hi all,
> >>
> >>we just discovered a problem, which I think is related to XFS. Well,
> >>I will try to explain.
> >>
> >>The environment i am working with are around 300 Postgres databases
> >>in separated VM's. All are running with XFS. Differences are just in
> >>kernel versions.
> >>- 2.6.18
> >>- 2.6.39
> >>- 3.1.4
> >>
> >>Some days ago i discovered that the file nodes of my postgresql
> >>tables have strange sizes. They are located in
> >>/var/lib/postgresql/9.0/main/base/[databaseid]/
> >>If I execute the following commands i get results like this:
> >>
> >>Command: du -sh | tr "\n" " "; du --apparent-size -h
> >>Result: 6.6G	. 5.7G	.
> >
> >Since a few kernel-version XFS does speculative preallocations, which is
> >primarily a measure to prevent fragmentation.
> >
> >The preallocations should go away when you drop the caches.
> >
> >sync
> >echo 3>  /proc/sys/vm/drop_caches
> >
> >XFS can be prevented to do that with the mount-option "allocsize".
> >Personally i use "allocsize=64k", since i first encountered that
> >behaviour, my workload primarily consists of single-thread writing which
> >doesn't benefit from this preallocation.
> >Your workload OTOH may benefit as it should prevent/lower the
> >fragmentation of the database files.
> 
> Hi Matthias,
> thanks for the reply, as far as i can say the "echo 3 >
> /proc/sys/vm/drop_caches" didn't work. the sizes didnt shrink.

You did "sync" before?
drop caches only drops "clean" pages, everything that is dirty isn't 
dropped. Hence the need to "sync" before.

Also i persume that you didn't stop Postgres?
I don't know if the process works for files that are currently opened. 

When i tested the behaviour i tested it with files copied by "cp", so 
they weren't open by any program when i droped the caches.

> Today
> i had the chance to test the allocsize=64k. Well, first i thought it
> worked, i added the mountoption, restarted the server, everything
> shrink to normal sizes. but right now its more or less "flapping". I
> have 5.7GB real data and the sizes flap between 6.9GB to 5.7GB.
> But I am wondering a little about the mount output:
> 
> # mount
> /dev/xvda1 on / type xfs
> (rw,noatime,nodiratime,logbufs=8,nobarrier,allocsize=64k)
> tmpfs on /lib/init/rw type tmpfs (rw,nosuid,mode=0755)
> proc on /proc type proc (rw,noexec,nosuid,nodev)
> sysfs on /sys type sysfs (rw,noexec,nosuid,nodev)
> tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev)
> devpts on /dev/pts type devpts (rw,noexec,nosuid,gid=5,mode=620)
> 
> 
> # cat /proc/mounts
> rootfs / rootfs rw 0 0
> /dev/root / xfs rw,noatime,nodiratime,attr2,delaylog,nobarrier,noquota 0 0
> tmpfs /lib/init/rw tmpfs rw,nosuid,relatime,mode=755 0 0
> proc /proc proc rw,nosuid,nodev,noexec,relatime 0 0
> sysfs /sys sysfs rw,nosuid,nodev,noexec,relatime 0 0
> tmpfs /dev/shm tmpfs rw,nosuid,nodev,relatime 0 0
> devpts /dev/pts devpts
> rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000 0 0
> 
> 
> In normal mount output i see the allocsize, but not in cat /proc/mounts?!?
> 
> Is there a way to completly disable speculative prealocations? or
> the behavior how it works right now?

In /proc/mounts on my computer allocsize is there:
/dev/mapper/x1 /x1 xfs rw,nosuid,nodev,noatime,attr2,delaylog,allocsize=64k,noquota 0 0

I tracked down the patch. It went into 2.6.38

- snip -
commit 055388a3188f56676c21e92962fc366ac8b5cb72
Author: Dave Chinner <dchinner@redhat.com>
Date:   Tue Jan 4 11:35:03 2011 +1100

    xfs: dynamic speculative EOF preallocation

    Currently the size of the speculative preallocation during delayed
    allocation is fixed by either the allocsize mount option of a
    default size. We are seeing a lot of cases where we need to
    recommend using the allocsize mount option to prevent fragmentation
    when buffered writes land in the same AG.

    Rather than using a fixed preallocation size by default (up to 64k),
    make it dynamic by basing it on the current inode size. That way the
    EOF preallocation will increase as the file size increases.  Hence
    for streaming writes we are much more likely to get large
    preallocations exactly when we need it to reduce fragementation.

    For default settings, the size of the initial extents is determined
    by the number of parallel writers and the amount of memory in the
    machine. For 4GB RAM and 4 concurrent 32GB file writes:

    EXT: FILE-OFFSET           BLOCK-RANGE          AG AG-OFFSET                 TOTAL
       0: [0..1048575]:         1048672..2097247      0 (1048672..2097247)      1048576
       1: [1048576..2097151]:   5242976..6291551      0 (5242976..6291551)      1048576
       2: [2097152..4194303]:   12583008..14680159    0 (12583008..14680159)    2097152
       3: [4194304..8388607]:   25165920..29360223    0 (25165920..29360223)    4194304
       4: [8388608..16777215]:  58720352..67108959    0 (58720352..67108959)    8388608
       5: [16777216..33554423]: 117440584..134217791  0 (117440584..134217791) 16777208
       6: [33554424..50331511]: 184549056..201326143  0 (184549056..201326143) 16777088
       7: [50331512..67108599]: 251657408..268434495  0 (251657408..268434495) 16777088

    and for 16 concurrent 16GB file writes:

     EXT: FILE-OFFSET           BLOCK-RANGE          AG AG-OFFSET                 TOTAL
       0: [0..262143]:          2490472..2752615      0 (2490472..2752615)       262144
       1: [262144..524287]:     6291560..6553703      0 (6291560..6553703)       262144
       2: [524288..1048575]:    13631592..14155879    0 (13631592..14155879)     524288
       3: [1048576..2097151]:   30408808..31457383    0 (30408808..31457383)    1048576
       4: [2097152..4194303]:   52428904..54526055    0 (52428904..54526055)    2097152
       5: [4194304..8388607]:   104857704..109052007  0 (104857704..109052007)  4194304
       6: [8388608..16777215]:  209715304..218103911  0 (209715304..218103911)  8388608
       7: [16777216..33554423]: 452984848..469762055  0 (452984848..469762055) 16777208

    Because it is hard to take back specualtive preallocation, cases
    where there are large slow growing log files on a nearly full
    filesystem may cause premature ENOSPC. Hence as the filesystem nears
    full, the maximum dynamic prealloc size ?s reduced according to this
    table (based on 4k block size):

    freespace       max prealloc size
      >5%             full extent (8GB)
      4-5%             2GB (8GB >> 2)
      3-4%             1GB (8GB >> 3)
      2-3%           512MB (8GB >> 4)
      1-2%           256MB (8GB >> 5)
      <1%            128MB (8GB >> 6)

    This should reduce the amount of space held in speculative
    preallocation for such cases.

    The allocsize mount option turns off the dynamic behaviour and fixes
    the prealloc size to whatever the mount option specifies. i.e. the
    behaviour is unchanged.

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
- snip -





Bis denn

-- 
Real Programmers consider "what you see is what you get" to be just as 
bad a concept in Text Editors as it is in women. No, the Real Programmer
wants a "you asked for it, you got it" text editor -- complicated, 
cryptic, powerful, unforgiving, dangerous.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Problems with filesizes on different Kernels
  2012-02-20 11:06     ` Matthias Schniedermeyer
@ 2012-02-20 12:06       ` Bernhard Schrader
  2012-02-27  8:23         ` Bernhard Schrader
  0 siblings, 1 reply; 6+ messages in thread
From: Bernhard Schrader @ 2012-02-20 12:06 UTC (permalink / raw)
  To: xfs

On 02/20/2012 12:06 PM, Matthias Schniedermeyer wrote:
> On 20.02.2012 09:41, Bernhard Schrader wrote:
>> On 02/17/2012 01:33 PM, Matthias Schniedermeyer wrote:
>>> On 17.02.2012 12:51, Bernhard Schrader wrote:
>>>> Hi all,
>>>>
>>>> we just discovered a problem, which I think is related to XFS. Well,
>>>> I will try to explain.
>>>>
>>>> The environment i am working with are around 300 Postgres databases
>>>> in separated VM's. All are running with XFS. Differences are just in
>>>> kernel versions.
>>>> - 2.6.18
>>>> - 2.6.39
>>>> - 3.1.4
>>>>
>>>> Some days ago i discovered that the file nodes of my postgresql
>>>> tables have strange sizes. They are located in
>>>> /var/lib/postgresql/9.0/main/base/[databaseid]/
>>>> If I execute the following commands i get results like this:
>>>>
>>>> Command: du -sh | tr "\n" " "; du --apparent-size -h
>>>> Result: 6.6G	. 5.7G	.
>>>
>>> Since a few kernel-version XFS does speculative preallocations, which is
>>> primarily a measure to prevent fragmentation.
>>>
>>> The preallocations should go away when you drop the caches.
>>>
>>> sync
>>> echo 3>   /proc/sys/vm/drop_caches
>>>
>>> XFS can be prevented to do that with the mount-option "allocsize".
>>> Personally i use "allocsize=64k", since i first encountered that
>>> behaviour, my workload primarily consists of single-thread writing which
>>> doesn't benefit from this preallocation.
>>> Your workload OTOH may benefit as it should prevent/lower the
>>> fragmentation of the database files.
>>
>> Hi Matthias,
>> thanks for the reply, as far as i can say the "echo 3>
>> /proc/sys/vm/drop_caches" didn't work. the sizes didnt shrink.
>
> You did "sync" before?
> drop caches only drops "clean" pages, everything that is dirty isn't
> dropped. Hence the need to "sync" before.
>
> Also i persume that you didn't stop Postgres?
> I don't know if the process works for files that are currently opened.
>
> When i tested the behaviour i tested it with files copied by "cp", so
> they weren't open by any program when i droped the caches.
>
>> Today
>> i had the chance to test the allocsize=64k. Well, first i thought it
>> worked, i added the mountoption, restarted the server, everything
>> shrink to normal sizes. but right now its more or less "flapping". I
>> have 5.7GB real data and the sizes flap between 6.9GB to 5.7GB.
>> But I am wondering a little about the mount output:
>>
>> # mount
>> /dev/xvda1 on / type xfs
>> (rw,noatime,nodiratime,logbufs=8,nobarrier,allocsize=64k)
>> tmpfs on /lib/init/rw type tmpfs (rw,nosuid,mode=0755)
>> proc on /proc type proc (rw,noexec,nosuid,nodev)
>> sysfs on /sys type sysfs (rw,noexec,nosuid,nodev)
>> tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev)
>> devpts on /dev/pts type devpts (rw,noexec,nosuid,gid=5,mode=620)
>>
>>
>> # cat /proc/mounts
>> rootfs / rootfs rw 0 0
>> /dev/root / xfs rw,noatime,nodiratime,attr2,delaylog,nobarrier,noquota 0 0
>> tmpfs /lib/init/rw tmpfs rw,nosuid,relatime,mode=755 0 0
>> proc /proc proc rw,nosuid,nodev,noexec,relatime 0 0
>> sysfs /sys sysfs rw,nosuid,nodev,noexec,relatime 0 0
>> tmpfs /dev/shm tmpfs rw,nosuid,nodev,relatime 0 0
>> devpts /dev/pts devpts
>> rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000 0 0
>>
>>
>> In normal mount output i see the allocsize, but not in cat /proc/mounts?!?
>>
>> Is there a way to completly disable speculative prealocations? or
>> the behavior how it works right now?
>
> In /proc/mounts on my computer allocsize is there:
> /dev/mapper/x1 /x1 xfs rw,nosuid,nodev,noatime,attr2,delaylog,allocsize=64k,noquota 0 0
>
> I tracked down the patch. It went into 2.6.38
>
> - snip -
> commit 055388a3188f56676c21e92962fc366ac8b5cb72
> Author: Dave Chinner<dchinner@redhat.com>
> Date:   Tue Jan 4 11:35:03 2011 +1100
>
>      xfs: dynamic speculative EOF preallocation
>
>      Currently the size of the speculative preallocation during delayed
>      allocation is fixed by either the allocsize mount option of a
>      default size. We are seeing a lot of cases where we need to
>      recommend using the allocsize mount option to prevent fragmentation
>      when buffered writes land in the same AG.
>
>      Rather than using a fixed preallocation size by default (up to 64k),
>      make it dynamic by basing it on the current inode size. That way the
>      EOF preallocation will increase as the file size increases.  Hence
>      for streaming writes we are much more likely to get large
>      preallocations exactly when we need it to reduce fragementation.
>
>      For default settings, the size of the initial extents is determined
>      by the number of parallel writers and the amount of memory in the
>      machine. For 4GB RAM and 4 concurrent 32GB file writes:
>
>      EXT: FILE-OFFSET           BLOCK-RANGE          AG AG-OFFSET                 TOTAL
>         0: [0..1048575]:         1048672..2097247      0 (1048672..2097247)      1048576
>         1: [1048576..2097151]:   5242976..6291551      0 (5242976..6291551)      1048576
>         2: [2097152..4194303]:   12583008..14680159    0 (12583008..14680159)    2097152
>         3: [4194304..8388607]:   25165920..29360223    0 (25165920..29360223)    4194304
>         4: [8388608..16777215]:  58720352..67108959    0 (58720352..67108959)    8388608
>         5: [16777216..33554423]: 117440584..134217791  0 (117440584..134217791) 16777208
>         6: [33554424..50331511]: 184549056..201326143  0 (184549056..201326143) 16777088
>         7: [50331512..67108599]: 251657408..268434495  0 (251657408..268434495) 16777088
>
>      and for 16 concurrent 16GB file writes:
>
>       EXT: FILE-OFFSET           BLOCK-RANGE          AG AG-OFFSET                 TOTAL
>         0: [0..262143]:          2490472..2752615      0 (2490472..2752615)       262144
>         1: [262144..524287]:     6291560..6553703      0 (6291560..6553703)       262144
>         2: [524288..1048575]:    13631592..14155879    0 (13631592..14155879)     524288
>         3: [1048576..2097151]:   30408808..31457383    0 (30408808..31457383)    1048576
>         4: [2097152..4194303]:   52428904..54526055    0 (52428904..54526055)    2097152
>         5: [4194304..8388607]:   104857704..109052007  0 (104857704..109052007)  4194304
>         6: [8388608..16777215]:  209715304..218103911  0 (209715304..218103911)  8388608
>         7: [16777216..33554423]: 452984848..469762055  0 (452984848..469762055) 16777208
>
>      Because it is hard to take back specualtive preallocation, cases
>      where there are large slow growing log files on a nearly full
>      filesystem may cause premature ENOSPC. Hence as the filesystem nears
>      full, the maximum dynamic prealloc size ?s reduced according to this
>      table (based on 4k block size):
>
>      freespace       max prealloc size
>        >5%             full extent (8GB)
>        4-5%             2GB (8GB>>  2)
>        3-4%             1GB (8GB>>  3)
>        2-3%           512MB (8GB>>  4)
>        1-2%           256MB (8GB>>  5)
>        <1%            128MB (8GB>>  6)
>
>      This should reduce the amount of space held in speculative
>      preallocation for such cases.
>
>      The allocsize mount option turns off the dynamic behaviour and fixes
>      the prealloc size to whatever the mount option specifies. i.e. the
>      behaviour is unchanged.
>
>      Signed-off-by: Dave Chinner<dchinner@redhat.com>
> - snip -
>
>
>
>
>
> Bis denn
>

Yes, I did the sync, and you are right, I didn't restarted the postgres 
process.
Well, but today i restarted the whole server. And regarding the last 
paragraph you wrote, the allocsize=64K should stop the dynamic 
preallocation... but right now it doesnt seem so, the sizes always get 
back to the 5.7GB, but in between it raises up.
Could it be possible, because of the different mount outputs, that it 
didnt get loaded well?

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Problems with filesizes on different Kernels
  2012-02-20 12:06       ` Bernhard Schrader
@ 2012-02-27  8:23         ` Bernhard Schrader
  0 siblings, 0 replies; 6+ messages in thread
From: Bernhard Schrader @ 2012-02-27  8:23 UTC (permalink / raw)
  To: xfs

On 02/20/2012 01:06 PM, Bernhard Schrader wrote:
> On 02/20/2012 12:06 PM, Matthias Schniedermeyer wrote:
>> On 20.02.2012 09:41, Bernhard Schrader wrote:
>>> On 02/17/2012 01:33 PM, Matthias Schniedermeyer wrote:
>>>> On 17.02.2012 12:51, Bernhard Schrader wrote:
>>>>> Hi all,
>>>>>
>>>>> we just discovered a problem, which I think is related to XFS. Well,
>>>>> I will try to explain.
>>>>>
>>>>> The environment i am working with are around 300 Postgres databases
>>>>> in separated VM's. All are running with XFS. Differences are just in
>>>>> kernel versions.
>>>>> - 2.6.18
>>>>> - 2.6.39
>>>>> - 3.1.4
>>>>>
>>>>> Some days ago i discovered that the file nodes of my postgresql
>>>>> tables have strange sizes. They are located in
>>>>> /var/lib/postgresql/9.0/main/base/[databaseid]/
>>>>> If I execute the following commands i get results like this:
>>>>>
>>>>> Command: du -sh | tr "\n" " "; du --apparent-size -h
>>>>> Result: 6.6G . 5.7G .
>>>>
>>>> Since a few kernel-version XFS does speculative preallocations,
>>>> which is
>>>> primarily a measure to prevent fragmentation.
>>>>
>>>> The preallocations should go away when you drop the caches.
>>>>
>>>> sync
>>>> echo 3> /proc/sys/vm/drop_caches
>>>>
>>>> XFS can be prevented to do that with the mount-option "allocsize".
>>>> Personally i use "allocsize=64k", since i first encountered that
>>>> behaviour, my workload primarily consists of single-thread writing
>>>> which
>>>> doesn't benefit from this preallocation.
>>>> Your workload OTOH may benefit as it should prevent/lower the
>>>> fragmentation of the database files.
>>>
>>> Hi Matthias,
>>> thanks for the reply, as far as i can say the "echo 3>
>>> /proc/sys/vm/drop_caches" didn't work. the sizes didnt shrink.
>>
>> You did "sync" before?
>> drop caches only drops "clean" pages, everything that is dirty isn't
>> dropped. Hence the need to "sync" before.
>>
>> Also i persume that you didn't stop Postgres?
>> I don't know if the process works for files that are currently opened.
>>
>> When i tested the behaviour i tested it with files copied by "cp", so
>> they weren't open by any program when i droped the caches.
>>
>>> Today
>>> i had the chance to test the allocsize=64k. Well, first i thought it
>>> worked, i added the mountoption, restarted the server, everything
>>> shrink to normal sizes. but right now its more or less "flapping". I
>>> have 5.7GB real data and the sizes flap between 6.9GB to 5.7GB.
>>> But I am wondering a little about the mount output:
>>>
>>> # mount
>>> /dev/xvda1 on / type xfs
>>> (rw,noatime,nodiratime,logbufs=8,nobarrier,allocsize=64k)
>>> tmpfs on /lib/init/rw type tmpfs (rw,nosuid,mode=0755)
>>> proc on /proc type proc (rw,noexec,nosuid,nodev)
>>> sysfs on /sys type sysfs (rw,noexec,nosuid,nodev)
>>> tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev)
>>> devpts on /dev/pts type devpts (rw,noexec,nosuid,gid=5,mode=620)
>>>
>>>
>>> # cat /proc/mounts
>>> rootfs / rootfs rw 0 0
>>> /dev/root / xfs
>>> rw,noatime,nodiratime,attr2,delaylog,nobarrier,noquota 0 0
>>> tmpfs /lib/init/rw tmpfs rw,nosuid,relatime,mode=755 0 0
>>> proc /proc proc rw,nosuid,nodev,noexec,relatime 0 0
>>> sysfs /sys sysfs rw,nosuid,nodev,noexec,relatime 0 0
>>> tmpfs /dev/shm tmpfs rw,nosuid,nodev,relatime 0 0
>>> devpts /dev/pts devpts
>>> rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000 0 0
>>>
>>>
>>> In normal mount output i see the allocsize, but not in cat
>>> /proc/mounts?!?
>>>
>>> Is there a way to completly disable speculative prealocations? or
>>> the behavior how it works right now?
>>
>> In /proc/mounts on my computer allocsize is there:
>> /dev/mapper/x1 /x1 xfs
>> rw,nosuid,nodev,noatime,attr2,delaylog,allocsize=64k,noquota 0 0
>>
>> I tracked down the patch. It went into 2.6.38
>>
>> - snip -
>> commit 055388a3188f56676c21e92962fc366ac8b5cb72
>> Author: Dave Chinner<dchinner@redhat.com>
>> Date: Tue Jan 4 11:35:03 2011 +1100
>>
>> xfs: dynamic speculative EOF preallocation
>>
>> Currently the size of the speculative preallocation during delayed
>> allocation is fixed by either the allocsize mount option of a
>> default size. We are seeing a lot of cases where we need to
>> recommend using the allocsize mount option to prevent fragmentation
>> when buffered writes land in the same AG.
>>
>> Rather than using a fixed preallocation size by default (up to 64k),
>> make it dynamic by basing it on the current inode size. That way the
>> EOF preallocation will increase as the file size increases. Hence
>> for streaming writes we are much more likely to get large
>> preallocations exactly when we need it to reduce fragementation.
>>
>> For default settings, the size of the initial extents is determined
>> by the number of parallel writers and the amount of memory in the
>> machine. For 4GB RAM and 4 concurrent 32GB file writes:
>>
>> EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL
>> 0: [0..1048575]: 1048672..2097247 0 (1048672..2097247) 1048576
>> 1: [1048576..2097151]: 5242976..6291551 0 (5242976..6291551) 1048576
>> 2: [2097152..4194303]: 12583008..14680159 0 (12583008..14680159) 2097152
>> 3: [4194304..8388607]: 25165920..29360223 0 (25165920..29360223) 4194304
>> 4: [8388608..16777215]: 58720352..67108959 0 (58720352..67108959) 8388608
>> 5: [16777216..33554423]: 117440584..134217791 0 (117440584..134217791)
>> 16777208
>> 6: [33554424..50331511]: 184549056..201326143 0 (184549056..201326143)
>> 16777088
>> 7: [50331512..67108599]: 251657408..268434495 0 (251657408..268434495)
>> 16777088
>>
>> and for 16 concurrent 16GB file writes:
>>
>> EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL
>> 0: [0..262143]: 2490472..2752615 0 (2490472..2752615) 262144
>> 1: [262144..524287]: 6291560..6553703 0 (6291560..6553703) 262144
>> 2: [524288..1048575]: 13631592..14155879 0 (13631592..14155879) 524288
>> 3: [1048576..2097151]: 30408808..31457383 0 (30408808..31457383) 1048576
>> 4: [2097152..4194303]: 52428904..54526055 0 (52428904..54526055) 2097152
>> 5: [4194304..8388607]: 104857704..109052007 0 (104857704..109052007)
>> 4194304
>> 6: [8388608..16777215]: 209715304..218103911 0 (209715304..218103911)
>> 8388608
>> 7: [16777216..33554423]: 452984848..469762055 0 (452984848..469762055)
>> 16777208
>>
>> Because it is hard to take back specualtive preallocation, cases
>> where there are large slow growing log files on a nearly full
>> filesystem may cause premature ENOSPC. Hence as the filesystem nears
>> full, the maximum dynamic prealloc size ?s reduced according to this
>> table (based on 4k block size):
>>
>> freespace max prealloc size
>> >5% full extent (8GB)
>> 4-5% 2GB (8GB>> 2)
>> 3-4% 1GB (8GB>> 3)
>> 2-3% 512MB (8GB>> 4)
>> 1-2% 256MB (8GB>> 5)
>> <1% 128MB (8GB>> 6)
>>
>> This should reduce the amount of space held in speculative
>> preallocation for such cases.
>>
>> The allocsize mount option turns off the dynamic behaviour and fixes
>> the prealloc size to whatever the mount option specifies. i.e. the
>> behaviour is unchanged.
>>
>> Signed-off-by: Dave Chinner<dchinner@redhat.com>
>> - snip -
>>
>>
>>
>>
>>
>> Bis denn
>>
>
> Yes, I did the sync, and you are right, I didn't restarted the postgres
> process.
> Well, but today i restarted the whole server. And regarding the last
> paragraph you wrote, the allocsize=64K should stop the dynamic
> preallocation... but right now it doesnt seem so, the sizes always get
> back to the 5.7GB, but in between it raises up.
> Could it be possible, because of the different mount outputs, that it
> didnt get loaded well?
>
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs


Just to give you the solution. Well, the allocsize setting itself was 
correct, but the mountpoint for this option was / so the flag isn't 
remountable on this point, i had to add "rootflags=allocsize=64k" to the 
extra kernel line in my *.sxp files of each VM, this way it recognized 
the option and worked as expected.

thanks all for help.

regards
Bernhard

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2012-02-27  8:23 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-02-17 11:51 Problems with filesizes on different Kernels Bernhard Schrader
2012-02-17 12:33 ` Matthias Schniedermeyer
2012-02-20  8:41   ` Bernhard Schrader
2012-02-20 11:06     ` Matthias Schniedermeyer
2012-02-20 12:06       ` Bernhard Schrader
2012-02-27  8:23         ` Bernhard Schrader

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox