public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed
* XFS_IOC_RESVSP64 versus XFS_IOC_ALLOCSP64 with multiple threads
@ 2006-11-13  1:33 Stewart Smith
       [not found] ` <965ECEF2-971D-46A1-B3F2-C6C1860C9ED8@sgi.com>
  0 siblings, 1 reply; 9+ messages in thread
From: Stewart Smith @ 2006-11-13  1:33 UTC (permalink / raw)
  To: xfs

[-- Attachment #1: Type: text/plain, Size: 1831 bytes --]

I recently (finally) wrote my patch to use the xfsctl to get better
allocation for NDB disk data files (datafiles and undofiles).
patch at:
http://lists.mysql.com/commits/15088

This actually ends up giving us a rather nice speed boost in some of the
test suite runs.

The problem is:
- two cluster nodes on 1 host (in the case of the mysql-test-run script)
- each node has a complete copy of the database
- ALTER TABLESPACE ADD DATAFILE / ALTER LOGFILEGROUP ADD UNDOFILE
creates files on *both* nodes. We want to zero these out.
- files are opened with O_SYNC (IIRC)

The patch I committed uses XFS_IOC_RESVSP64 to allocate (unwritten)
extents and then posix_fallocate to zero out the file (the glibc
implementation of this call just writes zeros out).

Now, ideally it would be beneficial (and probably faster) to have XFS do
this in kernel. Asynchronously would be pretty cool too.. but hey :)

The reason we don't want unwritten extents is that NDB has some realtime
properties, and futzing about with extents and the like in the FS during
transactions isn't such a good idea.

So, this would lead me to try XFS_IOC_ALLOCSP64 - which doesn't have the
"unwritten extents" warning that RESVSP64 does. However, with the two
processes writing the files out, I get heavy fragmentation. Even with a
RESVSP followed by ALLOCSP I get the same result.

So it seems that ALLOCSP re-allocates extents (even if it doesn't have
to) and really doesn't give you much (didn't do too much timing to see
if it was any quicker).

Is this expected behaviour? (it wasn't for me)
-- 
Stewart Smith, Software Engineer
MySQL AB, www.mysql.com
Office: +14082136540 Ext: 6616
VoIP: 6616@sip.us.mysql.com
Mobile: +61 4 3 8844 332

Jumpstart your cluster:
http://www.mysql.com/consulting/packaged/cluster.html

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: XFS_IOC_RESVSP64 versus XFS_IOC_ALLOCSP64 with multiple threads
       [not found] ` <965ECEF2-971D-46A1-B3F2-C6C1860C9ED8@sgi.com>
@ 2006-11-13  4:09   ` Stewart Smith
  2006-11-13  4:53     ` Sam Vaughan
  0 siblings, 1 reply; 9+ messages in thread
From: Stewart Smith @ 2006-11-13  4:09 UTC (permalink / raw)
  To: Sam Vaughan; +Cc: xfs

[-- Attachment #1: Type: text/plain, Size: 2417 bytes --]

On Mon, 2006-11-13 at 13:58 +1100, Sam Vaughan wrote:
> Are the two processes in your test writing files to the same  
> directory as each other?  If so then their allocations will go into  
> the same AG as the directory by default, hence the fragmentation.  If  
> you can limit yourself to an AG's worth of data per directory then  
> you should be able to avoid fragmentation using the default  
> allocator.  If you need to reserve more than that per AG, then the  
> files will most likely start interleaving again once they spill out  
> of their original AGs.  If that's the case then the upcoming  
> filestreams allocator may be your best bet.

I do predict that the filestreams allocator will be useful for us (and
also on my MythTV box...).

The two processes write to their own directories.

The structure of the "filesystem" for the process (ndbd) is:

ndb_1_fs/ (the 1 refers to node id, so there is a ndb_2_fs for a 2 node
setup)
	D8/, D9/, D10/, D11/
		all have a DBLQH subdirectory. In here there are several
		S0.FragLog files (the number changes). These are 16MB
		files used for logging.
		We (currently) don't do any xfsctl allocation on these.
		We should though. In fact, we're writing them in a way
		to get holes (which probably affects performance).
		These files are write only (except during a full cluster
		restart - a very rare event).

	LCP/0/T0F0.Data
		(there is at least 0,1,2 for that first number,
		T0 is table 0 - can be thousands of tables.
		f0 is fragment 0, can be a few of them too, typically
		2-4 though)
		These are an on-disk copy of in-memory tables, variably
		sized files (as big or as small as tables in a DB)
		The above log files are for changes occuring during the
		writing of these files.

	datafile01.dat, undofile01.dat etc
	whatever files the user creates for disk based tables
		the datafiles and undofiles that i've done the special
		allocation for.
		Typical deployments will have anything from a few
		hundred MB per file to few GB to many many GB.

"typical" installations are probably now evenly split between 1 process
per physical machine and several (usually 2). 
-- 
Stewart Smith, Software Engineer
MySQL AB, www.mysql.com
Office: +14082136540 Ext: 6616
VoIP: 6616@sip.us.mysql.com
Mobile: +61 4 3 8844 332

Jumpstart your cluster:
http://www.mysql.com/consulting/packaged/cluster.html

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: XFS_IOC_RESVSP64 versus XFS_IOC_ALLOCSP64 with multiple threads
  2006-11-13  4:09   ` Stewart Smith
@ 2006-11-13  4:53     ` Sam Vaughan
  2006-11-13  5:20       ` Stewart Smith
  0 siblings, 1 reply; 9+ messages in thread
From: Sam Vaughan @ 2006-11-13  4:53 UTC (permalink / raw)
  To: Stewart Smith; +Cc: xfs

On 13/11/2006, at 3:09 PM, Stewart Smith wrote:

> On Mon, 2006-11-13 at 13:58 +1100, Sam Vaughan wrote:
>> Are the two processes in your test writing files to the same
>> directory as each other?  If so then their allocations will go into
>> the same AG as the directory by default, hence the fragmentation.  If
>> you can limit yourself to an AG's worth of data per directory then
>> you should be able to avoid fragmentation using the default
>> allocator.  If you need to reserve more than that per AG, then the
>> files will most likely start interleaving again once they spill out
>> of their original AGs.  If that's the case then the upcoming
>> filestreams allocator may be your best bet.
>
> I do predict that the filestreams allocator will be useful for us (and
> also on my MythTV box...).
>
> The two processes write to their own directories.
>
> The structure of the "filesystem" for the process (ndbd) is:
>
> ndb_1_fs/ (the 1 refers to node id, so there is a ndb_2_fs for a 2  
> node
> setup)
> 	D8/, D9/, D10/, D11/
> 		all have a DBLQH subdirectory. In here there are several
> 		S0.FragLog files (the number changes). These are 16MB
> 		files used for logging.
> 		We (currently) don't do any xfsctl allocation on these.
> 		We should though. In fact, we're writing them in a way
> 		to get holes (which probably affects performance).
> 		These files are write only (except during a full cluster
> 		restart - a very rare event).
>
> 	LCP/0/T0F0.Data
> 		(there is at least 0,1,2 for that first number,
> 		T0 is table 0 - can be thousands of tables.
> 		f0 is fragment 0, can be a few of them too, typically
> 		2-4 though)
> 		These are an on-disk copy of in-memory tables, variably
> 		sized files (as big or as small as tables in a DB)
> 		The above log files are for changes occuring during the
> 		writing of these files.
>
> 	datafile01.dat, undofile01.dat etc
> 	whatever files the user creates for disk based tables
> 		the datafiles and undofiles that i've done the special
> 		allocation for.
> 		Typical deployments will have anything from a few
> 		hundred MB per file to few GB to many many GB.
>
> "typical" installations are probably now evenly split between 1  
> process
> per physical machine and several (usually 2).

Just to be clear, are we talking about intra-file fragmentation, i.e.  
file data laid out discontiguously on disk, or inter-file  
fragmentation where each file is continguous on disk but the files  
from different processes are getting interleaved?  Also, are there  
just a couple of user data files, each of them potentially much  
larger than the size of an AG, or do you split the data up into many  
files, e.g. datafile01.dat ... datafile99.dat ...?

If you have the flexibility to break the data up at arbitrary points  
into separate files, you could get optimal allocation behaviour by  
starting a new directory as soon as the files in the current one are  
large enough to fill an AG.  The problem with the filestreams  
allocator is that it will only dedicate an AG to a directory for a  
fixed and short period of time after the last file was written to  
it.  This works well to limit the resource drain on AGs when running  
file-per-frame video captures, but not so well with a database that  
writes its data in a far less regimented and timely way.

The following two tests illustrate the standard allocation policy I'm  
referring to here.  I've simplified it to take advantage of the fact  
that it's producing just one extent per file, but you can run  
`xfs_bmap -v` over all the files to verify that's the case.

Standard SLES 10 kernel, standard mount options:

$ uname -r
2.6.16.21-0.8-smp
$ xfs_info .
meta-data=/dev/sdb8              isize=256    agcount=16,  
agsize=3267720 blks
          =                       sectsz=512   attr=0
data     =                       bsize=4096   blocks=52283520,  
imaxpct=25
          =                       sunit=0      swidth=0 blks,  
unwritten=1
naming   =version 2              bsize=4096
log      =internal               bsize=4096   blocks=25529, version=1
          =                       sectsz=512   sunit=0 blks
realtime =none                   extsz=65536  blocks=0, rtextents=0
$ mount | grep sdb8
/dev/sdb8 on /spare200 type xfs (rw)
$

Create two directories and start two processes off, one per  
directory.  The processes preallocate ten 100MB files each.  The  
result is that their data goes into separate AGs on disk, all nicely  
contiguous:

$ mkdir a b
$ for dir in a b; do
 > for file in `seq 0 9`; do
 > touch $dir/$file
 > xfs_io -c 'allocsp 100m 0' $dir/$file
 > done &
 > done; wait
[1] 5649
[2] 5650
$ for file in `seq 0 9`; do
 > bmap_a=`xfs_bmap -v a/$file | tail -1`
 > bmap_b=`xfs_bmap -v b/$file | tail -1`
 > ag_a=`echo $bmap_a | awk '{print $4}'`
 > ag_b=`echo $bmap_b | awk '{print $4}'`
 > br_a=`echo $bmap_a | awk 'printf "%-18s", $3}'`
 > br_b=`echo $bmap_b | awk 'printf "%-18s", $3}'`
 > echo a/$file: $ag_a "$br_a" b/$file: $ag_b "$br_b"
 > done
a/0: 8 209338416..209543215 b/0: 9 235275936..235480735
a/1: 8 209543216..209748015 b/1: 9 235480736..235685535
a/2: 8 209748016..209952815 b/2: 9 235685536..235890335
a/3: 8 209952816..210157615 b/3: 9 235890336..236095135
a/4: 8 210157616..210362415 b/4: 9 236095136..236299935
a/5: 8 210362416..210567215 b/5: 9 236299936..236504735
a/6: 8 210567216..210772015 b/6: 9 236504736..236709535
a/7: 8 210772016..210976815 b/7: 9 236709536..236914335
a/8: 8 210976816..211181615 b/8: 9 236914336..237119135
a/9: 8 211181616..211386415 b/9: 9 237119136..237323935
$

Now do the same thing, except have the processes write their files  
into the same directory using different file names.  This time the  
files are allocated on top of each other.

$ dir=c
$ mkdir $dir
$ for process in 1 2; do
 > for file in `seq 0 9`; do
 > touch $dir/$process.$file
 > xfs_io -c 'allocsp 100m 0' $dir/$process.$file
 > done &
 > done; wait
[1] 5985
[2] 5986
$ for file in c/*; do
 > bmap=`xfs_bmap -v $file | tail -1`
 > ag=`echo $bmap | awk '{print $4}'`
 > br=`echo $bmap | awk '{printf "%-18s", $3}'`
 > echo $file: $ag "$br"
 > done
c/1.0: 11 287559456..287764255
c/1.1: 11 287969056..288173855
c/1.2: 11 288378656..288583455
c/1.3: 11 288788256..288993055
c/1.4: 11 289197856..289402655
c/1.5: 11 289607456..289812255
c/1.6: 11 290017056..290221855
c/1.7: 11 290426656..290631455
c/1.8: 11 290836264..291041063
c/1.9: 11 291450664..291655463
c/2.0: 11 287764256..287969055
c/2.1: 11 288173856..288378655
c/2.2: 11 288583456..288788255
c/2.3: 11 288993056..289197855
c/2.4: 11 289402656..289607455
c/2.5: 11 289812256..290017055
c/2.6: 11 290221856..290426655
c/2.7: 11 290631464..290836263
c/2.8: 11 291041064..291245863
c/2.9: 11 291245864..291450663
$

Now in your case you're using different directories, so your files  
are probably OK at the start of day.  Once the AGs they start in fill  
up though, the files for both processes will start getting allocated  
from the next available AG.  At that point, allocations that started  
out looking like the first test above will end up looking like the  
second.

The filestreams allocator will stop this from happening for  
applications that write data regularly like video ingest servers, but  
I wouldn't expect it to be a cure-all for your database app because  
your writes could have large delays between them.  Instead, I'd look  
into ways to break up your data into AG-sized chunks, starting a new  
directory every time you go over that magic size.

Sam

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: XFS_IOC_RESVSP64 versus XFS_IOC_ALLOCSP64 with multiple threads
  2006-11-13  4:53     ` Sam Vaughan
@ 2006-11-13  5:20       ` Stewart Smith
  2006-11-14  0:04         ` Sam Vaughan
  0 siblings, 1 reply; 9+ messages in thread
From: Stewart Smith @ 2006-11-13  5:20 UTC (permalink / raw)
  To: Sam Vaughan; +Cc: xfs

[-- Attachment #1: Type: text/plain, Size: 4850 bytes --]

On Mon, 2006-11-13 at 15:53 +1100, Sam Vaughan wrote:
> Just to be clear, are we talking about intra-file fragmentation, i.e.  
> file data laid out discontiguously on disk, or inter-file  
> fragmentation where each file is continguous on disk but the files  
> from different processes are getting interleaved?  Also, are there  
> just a couple of user data files, each of them potentially much  
> larger than the size of an AG, or do you split the data up into many  
> files, e.g. datafile01.dat ... datafile99.dat ...?

an example:

/home/mysql/cluster/ndb_1_fs/datafile1.dat:
 EXT: FILE-OFFSET       BLOCK-RANGE        AG AG-OFFSET          TOTAL
   0: [0..63]:          32862376..32862439  8 (1405096..1405159)    64
   1: [64..127]:        32875992..32876055  8 (1418712..1418775)    64
   2: [128..191]:       33040112..33040175  8 (1582832..1582895)    64
   3: [192..255]:       33080136..33080199  8 (1622856..1622919)    64
   4: [256..319]:       33101416..33101479  8 (1644136..1644199)    64
   5: [320..383]:       33112624..33112687  8 (1655344..1655407)    64
   6: [384..447]:       32526608..32526671  8 (1069328..1069391)    64
   7: [448..511]:       31678920..31678983  8 (221640..221703)      64
/home/mysql/cluster/ndb_2_fs/datafile1.dat:
 EXT: FILE-OFFSET       BLOCK-RANGE        AG AG-OFFSET          TOTAL
   0: [0..63]:          32864704..32864767  8 (1407424..1407487)    64
   1: [64..127]:        32888544..32888607  8 (1431264..1431327)    64
   2: [128..191]:       33068832..33068895  8 (1611552..1611615)    64
   3: [192..255]:       33101168..33101231  8 (1643888..1643951)    64
   4: [256..319]:       33101656..33101719  8 (1644376..1644439)    64
   5: [320..383]:       33115784..33115847  8 (1658504..1658567)    64
   6: [384..447]:       33897200..33897263  8 (2439920..2439983)    64
   7: [448..511]:       33900896..33900959  8 (2443616..2443679)    64

on this fs:
 isize=256    agcount=32, agsize=491520 blks
         =                       sectsz=512   attr=0
data     =                       bsize=4096   blocks=15728640,
imaxpct=25
         =                       sunit=0      swidth=0 blks, unwritten=1
naming   =version 2              bsize=4096  
log      =internal               bsize=4096   blocks=3840, version=1
         =                       sectsz=512   sunit=0 blks
realtime =none                   extsz=65536  blocks=0, rtextents=0

(somewhere between 5-15Gb free from this create IIRC)

these datafiles are fixed size, allocated by user. a DBA would run from
the SQL server something like:
CREATE TABLESPACE ts1
ADD DATAFILE 'datafile.dat'
USE LOGFILE GROUP lg1
INITIAL_SIZE 1G
ENGINE NDB;

to get a tablespace with 1GB data file (on each node).

we currently don't do any automatic extending.

> If you have the flexibility to break the data up at arbitrary points  
> into separate files, you could get optimal allocation behaviour by  
> starting a new directory as soon as the files in the current one are  
> large enough to fill an AG.  The problem with the filestreams  
> allocator is that it will only dedicate an AG to a directory for a  
> fixed and short period of time after the last file was written to  
> it.  This works well to limit the resource drain on AGs when running  
> file-per-frame video captures, but not so well with a database that  
> writes its data in a far less regimented and timely way.

for the data and undo files, we're just not changing their size except
at creation time, so that's okay.

> Now in your case you're using different directories, so your files  
> are probably OK at the start of day.  Once the AGs they start in fill  
> up though, the files for both processes will start getting allocated  
> from the next available AG.  At that point, allocations that started  
> out looking like the first test above will end up looking like the  
> second.
> 
> The filestreams allocator will stop this from happening for  
> applications that write data regularly like video ingest servers, but  
> I wouldn't expect it to be a cure-all for your database app because  
> your writes could have large delays between them.  Instead, I'd look  
> into ways to break up your data into AG-sized chunks, starting a new  
> directory every time you go over that magic size.

I'll have to check our writing behaviour the files that change sizes...
but they're not too much of an issue (they're hardly ever read back, so
as long as writing them out is okay and reading isn't totally abismal,
we don't have to worry).
-- 
Stewart Smith, Software Engineer
MySQL AB, www.mysql.com
Office: +14082136540 Ext: 6616
VoIP: 6616@sip.us.mysql.com
Mobile: +61 4 3 8844 332

Jumpstart your cluster:
http://www.mysql.com/consulting/packaged/cluster.html

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: XFS_IOC_RESVSP64 versus XFS_IOC_ALLOCSP64 with multiple threads
  2006-11-13  5:20       ` Stewart Smith
@ 2006-11-14  0:04         ` Sam Vaughan
  2006-11-14  0:25           ` Chris Wedgwood
  2006-11-27  5:55           ` Stewart Smith
  0 siblings, 2 replies; 9+ messages in thread
From: Sam Vaughan @ 2006-11-14  0:04 UTC (permalink / raw)
  To: Stewart Smith; +Cc: xfs

On 13/11/2006, at 4:20 PM, Stewart Smith wrote:

> On Mon, 2006-11-13 at 15:53 +1100, Sam Vaughan wrote:
>> Just to be clear, are we talking about intra-file fragmentation, i.e.
>> file data laid out discontiguously on disk, or inter-file
>> fragmentation where each file is continguous on disk but the files
>> from different processes are getting interleaved?  Also, are there
>> just a couple of user data files, each of them potentially much
>> larger than the size of an AG, or do you split the data up into many
>> files, e.g. datafile01.dat ... datafile99.dat ...?
>
> an example:
>
> /home/mysql/cluster/ndb_1_fs/datafile1.dat:
>  EXT: FILE-OFFSET       BLOCK-RANGE        AG AG-OFFSET          TOTAL
>    0: [0..63]:          32862376..32862439  8 (1405096..1405159)    64
>    1: [64..127]:        32875992..32876055  8 (1418712..1418775)    64
>    2: [128..191]:       33040112..33040175  8 (1582832..1582895)    64
>    3: [192..255]:       33080136..33080199  8 (1622856..1622919)    64
>    4: [256..319]:       33101416..33101479  8 (1644136..1644199)    64
>    5: [320..383]:       33112624..33112687  8 (1655344..1655407)    64
>    6: [384..447]:       32526608..32526671  8 (1069328..1069391)    64
>    7: [448..511]:       31678920..31678983  8 (221640..221703)      64
> /home/mysql/cluster/ndb_2_fs/datafile1.dat:
>  EXT: FILE-OFFSET       BLOCK-RANGE        AG AG-OFFSET          TOTAL
>    0: [0..63]:          32864704..32864767  8 (1407424..1407487)    64
>    1: [64..127]:        32888544..32888607  8 (1431264..1431327)    64
>    2: [128..191]:       33068832..33068895  8 (1611552..1611615)    64
>    3: [192..255]:       33101168..33101231  8 (1643888..1643951)    64
>    4: [256..319]:       33101656..33101719  8 (1644376..1644439)    64
>    5: [320..383]:       33115784..33115847  8 (1658504..1658567)    64
>    6: [384..447]:       33897200..33897263  8 (2439920..2439983)    64
>    7: [448..511]:       33900896..33900959  8 (2443616..2443679)    64

Those extents are curiously uniform, all 32kB in size.  The fact that  
both files' extents are in AG 8 suggests that the two directories  
ndb_1_fs and ndb_2_fs filled their original AGs and spilled out into  
other ones, which is when the interference would have started.   
Looking at the directory hierarchy in your last email, you might be  
better off if you could add another directory for the datafiles and  
undofiles to live in, so they don't end up sharing their AG with  
other stuff in their parent directory.

> on this fs:
>  isize=256    agcount=32, agsize=491520 blks
>          =                       sectsz=512   attr=0
> data     =                       bsize=4096   blocks=15728640,
> imaxpct=25
>          =                       sunit=0      swidth=0 blks,  
> unwritten=1
> naming   =version 2              bsize=4096
> log      =internal               bsize=4096   blocks=3840, version=1
>          =                       sectsz=512   sunit=0 blks
> realtime =none                   extsz=65536  blocks=0, rtextents=0

OK, so you've got 32 2GB AGs, and the filesystem is much too small  
for the inode32 rotor to be involved.

> (somewhere between 5-15Gb free from this create IIRC)
>
> these datafiles are fixed size, allocated by user. a DBA would run  
> from
> the SQL server something like:
> CREATE TABLESPACE ts1
> ADD DATAFILE 'datafile.dat'
> USE LOGFILE GROUP lg1
> INITIAL_SIZE 1G
> ENGINE NDB;
>
> to get a tablespace with 1GB data file (on each node).

So your data file is half the size of an AG.  That shouldn't be a  
problem but it'd be best to keep it to one or two of these files per  
directory if there's going to be much other concurrent allocation  
activity.

> we currently don't do any automatic extending.
>
>> If you have the flexibility to break the data up at arbitrary points
>> into separate files, you could get optimal allocation behaviour by
>> starting a new directory as soon as the files in the current one are
>> large enough to fill an AG.  The problem with the filestreams
>> allocator is that it will only dedicate an AG to a directory for a
>> fixed and short period of time after the last file was written to
>> it.  This works well to limit the resource drain on AGs when running
>> file-per-frame video captures, but not so well with a database that
>> writes its data in a far less regimented and timely way.
>
> for the data and undo files, we're just not changing their size except
> at creation time, so that's okay.

I'd assumed that these files were being continually grown.  If all  
this is happening at creation time then it shouldn't be too hard to  
make sure the files are cleanly allocated with just one extent.  Does  
the following not work on your file system?

$ touch a b
$ for file in a b; do
 > xfs_io -c 'allocsp 1G 0' $file &
 > done; wait
[1] 12312
[2] 12313
[1]-  Done                    xfs_io -c 'allocsp 1G 0' $file
[2]+  Done                    xfs_io -c 'allocsp 1G 0' $file
$ xfs_bmap -v a b
a:
EXT: FILE-OFFSET      BLOCK-RANGE          AG AG-OFFSET               
TOTAL
    0: [0..2097151]:    231732008..233829159  6 (11968856..14066007)  
2097152
b:
EXT: FILE-OFFSET      BLOCK-RANGE          AG AG-OFFSET               
TOTAL
    0: [0..2097151]:    233829160..235926311  6 (14066008..16163159)  
2097152
$

>> Now in your case you're using different directories, so your files
>> are probably OK at the start of day.  Once the AGs they start in fill
>> up though, the files for both processes will start getting allocated
>> from the next available AG.  At that point, allocations that started
>> out looking like the first test above will end up looking like the
>> second.
>>
>> The filestreams allocator will stop this from happening for
>> applications that write data regularly like video ingest servers, but
>> I wouldn't expect it to be a cure-all for your database app because
>> your writes could have large delays between them.  Instead, I'd look
>> into ways to break up your data into AG-sized chunks, starting a new
>> directory every time you go over that magic size.
>
> I'll have to check our writing behaviour the files that change  
> sizes...
> but they're not too much of an issue (they're hardly ever read  
> back, so
> as long as writing them out is okay and reading isn't totally abismal,
> we don't have to worry).

That's handy.  All in all it sounds like your requirements are very  
file system friendly in terms of getting optimum allocation.  I'm not  
sure what could be causing all those 32kB extents.

Sam

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: XFS_IOC_RESVSP64 versus XFS_IOC_ALLOCSP64 with multiple threads
  2006-11-14  0:04         ` Sam Vaughan
@ 2006-11-14  0:25           ` Chris Wedgwood
  2006-11-14  0:31             ` Sam Vaughan
  2006-11-27  5:55           ` Stewart Smith
  1 sibling, 1 reply; 9+ messages in thread
From: Chris Wedgwood @ 2006-11-14  0:25 UTC (permalink / raw)
  To: Sam Vaughan; +Cc: Stewart Smith, xfs

On Tue, Nov 14, 2006 at 11:04:17AM +1100, Sam Vaughan wrote:

> Those extents are curiously uniform, all 32kB in size.

O_SYNC writes?

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: XFS_IOC_RESVSP64 versus XFS_IOC_ALLOCSP64 with multiple threads
  2006-11-14  0:25           ` Chris Wedgwood
@ 2006-11-14  0:31             ` Sam Vaughan
  2006-11-14  0:37               ` Sam Vaughan
  0 siblings, 1 reply; 9+ messages in thread
From: Sam Vaughan @ 2006-11-14  0:31 UTC (permalink / raw)
  To: Chris Wedgwood; +Cc: Stewart Smith, xfs

On 14/11/2006, at 11:25 AM, Chris Wedgwood wrote:

> On Tue, Nov 14, 2006 at 11:04:17AM +1100, Sam Vaughan wrote:
>
>> Those extents are curiously uniform, all 32kB in size.
>
> O_SYNC writes?

I'm assuming from Stuart's original email that these files weren't  
written out with write(), but instead pre-allocated using allocsp:

> So, this would lead me to try XFS_IOC_ALLOCSP64 - which doesn't  
> have the
> "unwritten extents" warning that RESVSP64 does. However, with the two
> processes writing the files out, I get heavy fragmentation. Even  
> with a
> RESVSP followed by ALLOCSP I get the same result.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: XFS_IOC_RESVSP64 versus XFS_IOC_ALLOCSP64 with multiple threads
  2006-11-14  0:31             ` Sam Vaughan
@ 2006-11-14  0:37               ` Sam Vaughan
  0 siblings, 0 replies; 9+ messages in thread
From: Sam Vaughan @ 2006-11-14  0:37 UTC (permalink / raw)
  To: Stewart Smith; +Cc: xfs

On 14/11/2006, at 11:31 AM, Sam Vaughan wrote:

> I'm assuming from Stuart's original email

Oops.  s/Stuart/Stewart/

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: XFS_IOC_RESVSP64 versus XFS_IOC_ALLOCSP64 with multiple threads
  2006-11-14  0:04         ` Sam Vaughan
  2006-11-14  0:25           ` Chris Wedgwood
@ 2006-11-27  5:55           ` Stewart Smith
  1 sibling, 0 replies; 9+ messages in thread
From: Stewart Smith @ 2006-11-27  5:55 UTC (permalink / raw)
  To: Sam Vaughan; +Cc: xfs

[-- Attachment #1: Type: text/plain, Size: 3662 bytes --]

On Tue, 2006-11-14 at 11:04 +1100, Sam Vaughan wrote: 
> Those extents are curiously uniform, all 32kB in size.  The fact that  
> both files' extents are in AG 8 suggests that the two directories  
> ndb_1_fs and ndb_2_fs filled their original AGs and spilled out into  
> other ones, which is when the interference would have started.   
> Looking at the directory hierarchy in your last email, you might be  
> better off if you could add another directory for the datafiles and  
> undofiles to live in, so they don't end up sharing their AG with  
> other stuff in their parent directory.

I think this is typically what the QA guys do (to help keep their sanity
if anything). Perhaps we should have this in our "best practice"
documentation as well...

> > for the data and undo files, we're just not changing their size except
> > at creation time, so that's okay.
> 
> I'd assumed that these files were being continually grown.  If all  
> this is happening at creation time then it shouldn't be too hard to  
> make sure the files are cleanly allocated with just one extent.  Does  
> the following not work on your file system?
> 
> $ touch a b
> $ for file in a b; do
>  > xfs_io -c 'allocsp 1G 0' $file &
>  > done; wait
> [1] 12312
> [2] 12313
> [1]-  Done                    xfs_io -c 'allocsp 1G 0' $file
> [2]+  Done                    xfs_io -c 'allocsp 1G 0' $file
> $ xfs_bmap -v a b
> a:
> EXT: FILE-OFFSET      BLOCK-RANGE          AG AG-OFFSET               
> TOTAL
>     0: [0..2097151]:    231732008..233829159  6 (11968856..14066007)  
> 2097152
> b:
> EXT: FILE-OFFSET      BLOCK-RANGE          AG AG-OFFSET               
> TOTAL
>     0: [0..2097151]:    233829160..235926311  6 (14066008..16163159)  
> 2097152
> $

That works fine on my file systems (or, on my rather full and well
used /home, as well as it can).

We're opening the files with O_DIRECT (or, if not available or fails,
O_SYNC)



> >> Now in your case you're using different directories, so your files
> >> are probably OK at the start of day.  Once the AGs they start in fill
> >> up though, the files for both processes will start getting allocated
> >> from the next available AG.  At that point, allocations that started
> >> out looking like the first test above will end up looking like the
> >> second.
> >>
> >> The filestreams allocator will stop this from happening for
> >> applications that write data regularly like video ingest servers, but
> >> I wouldn't expect it to be a cure-all for your database app because
> >> your writes could have large delays between them.  Instead, I'd look
> >> into ways to break up your data into AG-sized chunks, starting a new
> >> directory every time you go over that magic size.
> >
> > I'll have to check our writing behaviour the files that change  
> > sizes...
> > but they're not too much of an issue (they're hardly ever read  
> > back, so
> > as long as writing them out is okay and reading isn't totally abismal,
> > we don't have to worry).
> 
> That's handy.  All in all it sounds like your requirements are very  
> file system friendly in terms of getting optimum allocation.  I'm not  
> sure what could be causing all those 32kB extents.

Perhaps being flushed out due to VM pressure? but with O_DIRECT/O_SYNC
that shouldn't be the case, right? Or perhaps *because* of
O_DIRECT/O_SYNC?
-- 
Stewart Smith, Software Engineer
MySQL AB, www.mysql.com
Office: +14082136540 Ext: 6616
VoIP: 6616@sip.us.mysql.com
Mobile: +61 4 3 8844 332

Jumpstart your cluster:
http://www.mysql.com/consulting/packaged/cluster.html

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2006-11-27  6:29 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-11-13  1:33 XFS_IOC_RESVSP64 versus XFS_IOC_ALLOCSP64 with multiple threads Stewart Smith
     [not found] ` <965ECEF2-971D-46A1-B3F2-C6C1860C9ED8@sgi.com>
2006-11-13  4:09   ` Stewart Smith
2006-11-13  4:53     ` Sam Vaughan
2006-11-13  5:20       ` Stewart Smith
2006-11-14  0:04         ` Sam Vaughan
2006-11-14  0:25           ` Chris Wedgwood
2006-11-14  0:31             ` Sam Vaughan
2006-11-14  0:37               ` Sam Vaughan
2006-11-27  5:55           ` Stewart Smith

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox