* XFS_IOC_RESVSP64 versus XFS_IOC_ALLOCSP64 with multiple threads
@ 2006-11-13 1:33 Stewart Smith
[not found] ` <965ECEF2-971D-46A1-B3F2-C6C1860C9ED8@sgi.com>
0 siblings, 1 reply; 9+ messages in thread
From: Stewart Smith @ 2006-11-13 1:33 UTC (permalink / raw)
To: xfs
[-- Attachment #1: Type: text/plain, Size: 1831 bytes --]
I recently (finally) wrote my patch to use the xfsctl to get better
allocation for NDB disk data files (datafiles and undofiles).
patch at:
http://lists.mysql.com/commits/15088
This actually ends up giving us a rather nice speed boost in some of the
test suite runs.
The problem is:
- two cluster nodes on 1 host (in the case of the mysql-test-run script)
- each node has a complete copy of the database
- ALTER TABLESPACE ADD DATAFILE / ALTER LOGFILEGROUP ADD UNDOFILE
creates files on *both* nodes. We want to zero these out.
- files are opened with O_SYNC (IIRC)
The patch I committed uses XFS_IOC_RESVSP64 to allocate (unwritten)
extents and then posix_fallocate to zero out the file (the glibc
implementation of this call just writes zeros out).
Now, ideally it would be beneficial (and probably faster) to have XFS do
this in kernel. Asynchronously would be pretty cool too.. but hey :)
The reason we don't want unwritten extents is that NDB has some realtime
properties, and futzing about with extents and the like in the FS during
transactions isn't such a good idea.
So, this would lead me to try XFS_IOC_ALLOCSP64 - which doesn't have the
"unwritten extents" warning that RESVSP64 does. However, with the two
processes writing the files out, I get heavy fragmentation. Even with a
RESVSP followed by ALLOCSP I get the same result.
So it seems that ALLOCSP re-allocates extents (even if it doesn't have
to) and really doesn't give you much (didn't do too much timing to see
if it was any quicker).
Is this expected behaviour? (it wasn't for me)
--
Stewart Smith, Software Engineer
MySQL AB, www.mysql.com
Office: +14082136540 Ext: 6616
VoIP: 6616@sip.us.mysql.com
Mobile: +61 4 3 8844 332
Jumpstart your cluster:
http://www.mysql.com/consulting/packaged/cluster.html
[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: XFS_IOC_RESVSP64 versus XFS_IOC_ALLOCSP64 with multiple threads
[not found] ` <965ECEF2-971D-46A1-B3F2-C6C1860C9ED8@sgi.com>
@ 2006-11-13 4:09 ` Stewart Smith
2006-11-13 4:53 ` Sam Vaughan
0 siblings, 1 reply; 9+ messages in thread
From: Stewart Smith @ 2006-11-13 4:09 UTC (permalink / raw)
To: Sam Vaughan; +Cc: xfs
[-- Attachment #1: Type: text/plain, Size: 2417 bytes --]
On Mon, 2006-11-13 at 13:58 +1100, Sam Vaughan wrote:
> Are the two processes in your test writing files to the same
> directory as each other? If so then their allocations will go into
> the same AG as the directory by default, hence the fragmentation. If
> you can limit yourself to an AG's worth of data per directory then
> you should be able to avoid fragmentation using the default
> allocator. If you need to reserve more than that per AG, then the
> files will most likely start interleaving again once they spill out
> of their original AGs. If that's the case then the upcoming
> filestreams allocator may be your best bet.
I do predict that the filestreams allocator will be useful for us (and
also on my MythTV box...).
The two processes write to their own directories.
The structure of the "filesystem" for the process (ndbd) is:
ndb_1_fs/ (the 1 refers to node id, so there is a ndb_2_fs for a 2 node
setup)
D8/, D9/, D10/, D11/
all have a DBLQH subdirectory. In here there are several
S0.FragLog files (the number changes). These are 16MB
files used for logging.
We (currently) don't do any xfsctl allocation on these.
We should though. In fact, we're writing them in a way
to get holes (which probably affects performance).
These files are write only (except during a full cluster
restart - a very rare event).
LCP/0/T0F0.Data
(there is at least 0,1,2 for that first number,
T0 is table 0 - can be thousands of tables.
f0 is fragment 0, can be a few of them too, typically
2-4 though)
These are an on-disk copy of in-memory tables, variably
sized files (as big or as small as tables in a DB)
The above log files are for changes occuring during the
writing of these files.
datafile01.dat, undofile01.dat etc
whatever files the user creates for disk based tables
the datafiles and undofiles that i've done the special
allocation for.
Typical deployments will have anything from a few
hundred MB per file to few GB to many many GB.
"typical" installations are probably now evenly split between 1 process
per physical machine and several (usually 2).
--
Stewart Smith, Software Engineer
MySQL AB, www.mysql.com
Office: +14082136540 Ext: 6616
VoIP: 6616@sip.us.mysql.com
Mobile: +61 4 3 8844 332
Jumpstart your cluster:
http://www.mysql.com/consulting/packaged/cluster.html
[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: XFS_IOC_RESVSP64 versus XFS_IOC_ALLOCSP64 with multiple threads
2006-11-13 4:09 ` Stewart Smith
@ 2006-11-13 4:53 ` Sam Vaughan
2006-11-13 5:20 ` Stewart Smith
0 siblings, 1 reply; 9+ messages in thread
From: Sam Vaughan @ 2006-11-13 4:53 UTC (permalink / raw)
To: Stewart Smith; +Cc: xfs
On 13/11/2006, at 3:09 PM, Stewart Smith wrote:
> On Mon, 2006-11-13 at 13:58 +1100, Sam Vaughan wrote:
>> Are the two processes in your test writing files to the same
>> directory as each other? If so then their allocations will go into
>> the same AG as the directory by default, hence the fragmentation. If
>> you can limit yourself to an AG's worth of data per directory then
>> you should be able to avoid fragmentation using the default
>> allocator. If you need to reserve more than that per AG, then the
>> files will most likely start interleaving again once they spill out
>> of their original AGs. If that's the case then the upcoming
>> filestreams allocator may be your best bet.
>
> I do predict that the filestreams allocator will be useful for us (and
> also on my MythTV box...).
>
> The two processes write to their own directories.
>
> The structure of the "filesystem" for the process (ndbd) is:
>
> ndb_1_fs/ (the 1 refers to node id, so there is a ndb_2_fs for a 2
> node
> setup)
> D8/, D9/, D10/, D11/
> all have a DBLQH subdirectory. In here there are several
> S0.FragLog files (the number changes). These are 16MB
> files used for logging.
> We (currently) don't do any xfsctl allocation on these.
> We should though. In fact, we're writing them in a way
> to get holes (which probably affects performance).
> These files are write only (except during a full cluster
> restart - a very rare event).
>
> LCP/0/T0F0.Data
> (there is at least 0,1,2 for that first number,
> T0 is table 0 - can be thousands of tables.
> f0 is fragment 0, can be a few of them too, typically
> 2-4 though)
> These are an on-disk copy of in-memory tables, variably
> sized files (as big or as small as tables in a DB)
> The above log files are for changes occuring during the
> writing of these files.
>
> datafile01.dat, undofile01.dat etc
> whatever files the user creates for disk based tables
> the datafiles and undofiles that i've done the special
> allocation for.
> Typical deployments will have anything from a few
> hundred MB per file to few GB to many many GB.
>
> "typical" installations are probably now evenly split between 1
> process
> per physical machine and several (usually 2).
Just to be clear, are we talking about intra-file fragmentation, i.e.
file data laid out discontiguously on disk, or inter-file
fragmentation where each file is continguous on disk but the files
from different processes are getting interleaved? Also, are there
just a couple of user data files, each of them potentially much
larger than the size of an AG, or do you split the data up into many
files, e.g. datafile01.dat ... datafile99.dat ...?
If you have the flexibility to break the data up at arbitrary points
into separate files, you could get optimal allocation behaviour by
starting a new directory as soon as the files in the current one are
large enough to fill an AG. The problem with the filestreams
allocator is that it will only dedicate an AG to a directory for a
fixed and short period of time after the last file was written to
it. This works well to limit the resource drain on AGs when running
file-per-frame video captures, but not so well with a database that
writes its data in a far less regimented and timely way.
The following two tests illustrate the standard allocation policy I'm
referring to here. I've simplified it to take advantage of the fact
that it's producing just one extent per file, but you can run
`xfs_bmap -v` over all the files to verify that's the case.
Standard SLES 10 kernel, standard mount options:
$ uname -r
2.6.16.21-0.8-smp
$ xfs_info .
meta-data=/dev/sdb8 isize=256 agcount=16,
agsize=3267720 blks
= sectsz=512 attr=0
data = bsize=4096 blocks=52283520,
imaxpct=25
= sunit=0 swidth=0 blks,
unwritten=1
naming =version 2 bsize=4096
log =internal bsize=4096 blocks=25529, version=1
= sectsz=512 sunit=0 blks
realtime =none extsz=65536 blocks=0, rtextents=0
$ mount | grep sdb8
/dev/sdb8 on /spare200 type xfs (rw)
$
Create two directories and start two processes off, one per
directory. The processes preallocate ten 100MB files each. The
result is that their data goes into separate AGs on disk, all nicely
contiguous:
$ mkdir a b
$ for dir in a b; do
> for file in `seq 0 9`; do
> touch $dir/$file
> xfs_io -c 'allocsp 100m 0' $dir/$file
> done &
> done; wait
[1] 5649
[2] 5650
$ for file in `seq 0 9`; do
> bmap_a=`xfs_bmap -v a/$file | tail -1`
> bmap_b=`xfs_bmap -v b/$file | tail -1`
> ag_a=`echo $bmap_a | awk '{print $4}'`
> ag_b=`echo $bmap_b | awk '{print $4}'`
> br_a=`echo $bmap_a | awk 'printf "%-18s", $3}'`
> br_b=`echo $bmap_b | awk 'printf "%-18s", $3}'`
> echo a/$file: $ag_a "$br_a" b/$file: $ag_b "$br_b"
> done
a/0: 8 209338416..209543215 b/0: 9 235275936..235480735
a/1: 8 209543216..209748015 b/1: 9 235480736..235685535
a/2: 8 209748016..209952815 b/2: 9 235685536..235890335
a/3: 8 209952816..210157615 b/3: 9 235890336..236095135
a/4: 8 210157616..210362415 b/4: 9 236095136..236299935
a/5: 8 210362416..210567215 b/5: 9 236299936..236504735
a/6: 8 210567216..210772015 b/6: 9 236504736..236709535
a/7: 8 210772016..210976815 b/7: 9 236709536..236914335
a/8: 8 210976816..211181615 b/8: 9 236914336..237119135
a/9: 8 211181616..211386415 b/9: 9 237119136..237323935
$
Now do the same thing, except have the processes write their files
into the same directory using different file names. This time the
files are allocated on top of each other.
$ dir=c
$ mkdir $dir
$ for process in 1 2; do
> for file in `seq 0 9`; do
> touch $dir/$process.$file
> xfs_io -c 'allocsp 100m 0' $dir/$process.$file
> done &
> done; wait
[1] 5985
[2] 5986
$ for file in c/*; do
> bmap=`xfs_bmap -v $file | tail -1`
> ag=`echo $bmap | awk '{print $4}'`
> br=`echo $bmap | awk '{printf "%-18s", $3}'`
> echo $file: $ag "$br"
> done
c/1.0: 11 287559456..287764255
c/1.1: 11 287969056..288173855
c/1.2: 11 288378656..288583455
c/1.3: 11 288788256..288993055
c/1.4: 11 289197856..289402655
c/1.5: 11 289607456..289812255
c/1.6: 11 290017056..290221855
c/1.7: 11 290426656..290631455
c/1.8: 11 290836264..291041063
c/1.9: 11 291450664..291655463
c/2.0: 11 287764256..287969055
c/2.1: 11 288173856..288378655
c/2.2: 11 288583456..288788255
c/2.3: 11 288993056..289197855
c/2.4: 11 289402656..289607455
c/2.5: 11 289812256..290017055
c/2.6: 11 290221856..290426655
c/2.7: 11 290631464..290836263
c/2.8: 11 291041064..291245863
c/2.9: 11 291245864..291450663
$
Now in your case you're using different directories, so your files
are probably OK at the start of day. Once the AGs they start in fill
up though, the files for both processes will start getting allocated
from the next available AG. At that point, allocations that started
out looking like the first test above will end up looking like the
second.
The filestreams allocator will stop this from happening for
applications that write data regularly like video ingest servers, but
I wouldn't expect it to be a cure-all for your database app because
your writes could have large delays between them. Instead, I'd look
into ways to break up your data into AG-sized chunks, starting a new
directory every time you go over that magic size.
Sam
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: XFS_IOC_RESVSP64 versus XFS_IOC_ALLOCSP64 with multiple threads
2006-11-13 4:53 ` Sam Vaughan
@ 2006-11-13 5:20 ` Stewart Smith
2006-11-14 0:04 ` Sam Vaughan
0 siblings, 1 reply; 9+ messages in thread
From: Stewart Smith @ 2006-11-13 5:20 UTC (permalink / raw)
To: Sam Vaughan; +Cc: xfs
[-- Attachment #1: Type: text/plain, Size: 4850 bytes --]
On Mon, 2006-11-13 at 15:53 +1100, Sam Vaughan wrote:
> Just to be clear, are we talking about intra-file fragmentation, i.e.
> file data laid out discontiguously on disk, or inter-file
> fragmentation where each file is continguous on disk but the files
> from different processes are getting interleaved? Also, are there
> just a couple of user data files, each of them potentially much
> larger than the size of an AG, or do you split the data up into many
> files, e.g. datafile01.dat ... datafile99.dat ...?
an example:
/home/mysql/cluster/ndb_1_fs/datafile1.dat:
EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL
0: [0..63]: 32862376..32862439 8 (1405096..1405159) 64
1: [64..127]: 32875992..32876055 8 (1418712..1418775) 64
2: [128..191]: 33040112..33040175 8 (1582832..1582895) 64
3: [192..255]: 33080136..33080199 8 (1622856..1622919) 64
4: [256..319]: 33101416..33101479 8 (1644136..1644199) 64
5: [320..383]: 33112624..33112687 8 (1655344..1655407) 64
6: [384..447]: 32526608..32526671 8 (1069328..1069391) 64
7: [448..511]: 31678920..31678983 8 (221640..221703) 64
/home/mysql/cluster/ndb_2_fs/datafile1.dat:
EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL
0: [0..63]: 32864704..32864767 8 (1407424..1407487) 64
1: [64..127]: 32888544..32888607 8 (1431264..1431327) 64
2: [128..191]: 33068832..33068895 8 (1611552..1611615) 64
3: [192..255]: 33101168..33101231 8 (1643888..1643951) 64
4: [256..319]: 33101656..33101719 8 (1644376..1644439) 64
5: [320..383]: 33115784..33115847 8 (1658504..1658567) 64
6: [384..447]: 33897200..33897263 8 (2439920..2439983) 64
7: [448..511]: 33900896..33900959 8 (2443616..2443679) 64
on this fs:
isize=256 agcount=32, agsize=491520 blks
= sectsz=512 attr=0
data = bsize=4096 blocks=15728640,
imaxpct=25
= sunit=0 swidth=0 blks, unwritten=1
naming =version 2 bsize=4096
log =internal bsize=4096 blocks=3840, version=1
= sectsz=512 sunit=0 blks
realtime =none extsz=65536 blocks=0, rtextents=0
(somewhere between 5-15Gb free from this create IIRC)
these datafiles are fixed size, allocated by user. a DBA would run from
the SQL server something like:
CREATE TABLESPACE ts1
ADD DATAFILE 'datafile.dat'
USE LOGFILE GROUP lg1
INITIAL_SIZE 1G
ENGINE NDB;
to get a tablespace with 1GB data file (on each node).
we currently don't do any automatic extending.
> If you have the flexibility to break the data up at arbitrary points
> into separate files, you could get optimal allocation behaviour by
> starting a new directory as soon as the files in the current one are
> large enough to fill an AG. The problem with the filestreams
> allocator is that it will only dedicate an AG to a directory for a
> fixed and short period of time after the last file was written to
> it. This works well to limit the resource drain on AGs when running
> file-per-frame video captures, but not so well with a database that
> writes its data in a far less regimented and timely way.
for the data and undo files, we're just not changing their size except
at creation time, so that's okay.
> Now in your case you're using different directories, so your files
> are probably OK at the start of day. Once the AGs they start in fill
> up though, the files for both processes will start getting allocated
> from the next available AG. At that point, allocations that started
> out looking like the first test above will end up looking like the
> second.
>
> The filestreams allocator will stop this from happening for
> applications that write data regularly like video ingest servers, but
> I wouldn't expect it to be a cure-all for your database app because
> your writes could have large delays between them. Instead, I'd look
> into ways to break up your data into AG-sized chunks, starting a new
> directory every time you go over that magic size.
I'll have to check our writing behaviour the files that change sizes...
but they're not too much of an issue (they're hardly ever read back, so
as long as writing them out is okay and reading isn't totally abismal,
we don't have to worry).
--
Stewart Smith, Software Engineer
MySQL AB, www.mysql.com
Office: +14082136540 Ext: 6616
VoIP: 6616@sip.us.mysql.com
Mobile: +61 4 3 8844 332
Jumpstart your cluster:
http://www.mysql.com/consulting/packaged/cluster.html
[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: XFS_IOC_RESVSP64 versus XFS_IOC_ALLOCSP64 with multiple threads
2006-11-13 5:20 ` Stewart Smith
@ 2006-11-14 0:04 ` Sam Vaughan
2006-11-14 0:25 ` Chris Wedgwood
2006-11-27 5:55 ` Stewart Smith
0 siblings, 2 replies; 9+ messages in thread
From: Sam Vaughan @ 2006-11-14 0:04 UTC (permalink / raw)
To: Stewart Smith; +Cc: xfs
On 13/11/2006, at 4:20 PM, Stewart Smith wrote:
> On Mon, 2006-11-13 at 15:53 +1100, Sam Vaughan wrote:
>> Just to be clear, are we talking about intra-file fragmentation, i.e.
>> file data laid out discontiguously on disk, or inter-file
>> fragmentation where each file is continguous on disk but the files
>> from different processes are getting interleaved? Also, are there
>> just a couple of user data files, each of them potentially much
>> larger than the size of an AG, or do you split the data up into many
>> files, e.g. datafile01.dat ... datafile99.dat ...?
>
> an example:
>
> /home/mysql/cluster/ndb_1_fs/datafile1.dat:
> EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL
> 0: [0..63]: 32862376..32862439 8 (1405096..1405159) 64
> 1: [64..127]: 32875992..32876055 8 (1418712..1418775) 64
> 2: [128..191]: 33040112..33040175 8 (1582832..1582895) 64
> 3: [192..255]: 33080136..33080199 8 (1622856..1622919) 64
> 4: [256..319]: 33101416..33101479 8 (1644136..1644199) 64
> 5: [320..383]: 33112624..33112687 8 (1655344..1655407) 64
> 6: [384..447]: 32526608..32526671 8 (1069328..1069391) 64
> 7: [448..511]: 31678920..31678983 8 (221640..221703) 64
> /home/mysql/cluster/ndb_2_fs/datafile1.dat:
> EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL
> 0: [0..63]: 32864704..32864767 8 (1407424..1407487) 64
> 1: [64..127]: 32888544..32888607 8 (1431264..1431327) 64
> 2: [128..191]: 33068832..33068895 8 (1611552..1611615) 64
> 3: [192..255]: 33101168..33101231 8 (1643888..1643951) 64
> 4: [256..319]: 33101656..33101719 8 (1644376..1644439) 64
> 5: [320..383]: 33115784..33115847 8 (1658504..1658567) 64
> 6: [384..447]: 33897200..33897263 8 (2439920..2439983) 64
> 7: [448..511]: 33900896..33900959 8 (2443616..2443679) 64
Those extents are curiously uniform, all 32kB in size. The fact that
both files' extents are in AG 8 suggests that the two directories
ndb_1_fs and ndb_2_fs filled their original AGs and spilled out into
other ones, which is when the interference would have started.
Looking at the directory hierarchy in your last email, you might be
better off if you could add another directory for the datafiles and
undofiles to live in, so they don't end up sharing their AG with
other stuff in their parent directory.
> on this fs:
> isize=256 agcount=32, agsize=491520 blks
> = sectsz=512 attr=0
> data = bsize=4096 blocks=15728640,
> imaxpct=25
> = sunit=0 swidth=0 blks,
> unwritten=1
> naming =version 2 bsize=4096
> log =internal bsize=4096 blocks=3840, version=1
> = sectsz=512 sunit=0 blks
> realtime =none extsz=65536 blocks=0, rtextents=0
OK, so you've got 32 2GB AGs, and the filesystem is much too small
for the inode32 rotor to be involved.
> (somewhere between 5-15Gb free from this create IIRC)
>
> these datafiles are fixed size, allocated by user. a DBA would run
> from
> the SQL server something like:
> CREATE TABLESPACE ts1
> ADD DATAFILE 'datafile.dat'
> USE LOGFILE GROUP lg1
> INITIAL_SIZE 1G
> ENGINE NDB;
>
> to get a tablespace with 1GB data file (on each node).
So your data file is half the size of an AG. That shouldn't be a
problem but it'd be best to keep it to one or two of these files per
directory if there's going to be much other concurrent allocation
activity.
> we currently don't do any automatic extending.
>
>> If you have the flexibility to break the data up at arbitrary points
>> into separate files, you could get optimal allocation behaviour by
>> starting a new directory as soon as the files in the current one are
>> large enough to fill an AG. The problem with the filestreams
>> allocator is that it will only dedicate an AG to a directory for a
>> fixed and short period of time after the last file was written to
>> it. This works well to limit the resource drain on AGs when running
>> file-per-frame video captures, but not so well with a database that
>> writes its data in a far less regimented and timely way.
>
> for the data and undo files, we're just not changing their size except
> at creation time, so that's okay.
I'd assumed that these files were being continually grown. If all
this is happening at creation time then it shouldn't be too hard to
make sure the files are cleanly allocated with just one extent. Does
the following not work on your file system?
$ touch a b
$ for file in a b; do
> xfs_io -c 'allocsp 1G 0' $file &
> done; wait
[1] 12312
[2] 12313
[1]- Done xfs_io -c 'allocsp 1G 0' $file
[2]+ Done xfs_io -c 'allocsp 1G 0' $file
$ xfs_bmap -v a b
a:
EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET
TOTAL
0: [0..2097151]: 231732008..233829159 6 (11968856..14066007)
2097152
b:
EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET
TOTAL
0: [0..2097151]: 233829160..235926311 6 (14066008..16163159)
2097152
$
>> Now in your case you're using different directories, so your files
>> are probably OK at the start of day. Once the AGs they start in fill
>> up though, the files for both processes will start getting allocated
>> from the next available AG. At that point, allocations that started
>> out looking like the first test above will end up looking like the
>> second.
>>
>> The filestreams allocator will stop this from happening for
>> applications that write data regularly like video ingest servers, but
>> I wouldn't expect it to be a cure-all for your database app because
>> your writes could have large delays between them. Instead, I'd look
>> into ways to break up your data into AG-sized chunks, starting a new
>> directory every time you go over that magic size.
>
> I'll have to check our writing behaviour the files that change
> sizes...
> but they're not too much of an issue (they're hardly ever read
> back, so
> as long as writing them out is okay and reading isn't totally abismal,
> we don't have to worry).
That's handy. All in all it sounds like your requirements are very
file system friendly in terms of getting optimum allocation. I'm not
sure what could be causing all those 32kB extents.
Sam
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: XFS_IOC_RESVSP64 versus XFS_IOC_ALLOCSP64 with multiple threads
2006-11-14 0:04 ` Sam Vaughan
@ 2006-11-14 0:25 ` Chris Wedgwood
2006-11-14 0:31 ` Sam Vaughan
2006-11-27 5:55 ` Stewart Smith
1 sibling, 1 reply; 9+ messages in thread
From: Chris Wedgwood @ 2006-11-14 0:25 UTC (permalink / raw)
To: Sam Vaughan; +Cc: Stewart Smith, xfs
On Tue, Nov 14, 2006 at 11:04:17AM +1100, Sam Vaughan wrote:
> Those extents are curiously uniform, all 32kB in size.
O_SYNC writes?
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: XFS_IOC_RESVSP64 versus XFS_IOC_ALLOCSP64 with multiple threads
2006-11-14 0:25 ` Chris Wedgwood
@ 2006-11-14 0:31 ` Sam Vaughan
2006-11-14 0:37 ` Sam Vaughan
0 siblings, 1 reply; 9+ messages in thread
From: Sam Vaughan @ 2006-11-14 0:31 UTC (permalink / raw)
To: Chris Wedgwood; +Cc: Stewart Smith, xfs
On 14/11/2006, at 11:25 AM, Chris Wedgwood wrote:
> On Tue, Nov 14, 2006 at 11:04:17AM +1100, Sam Vaughan wrote:
>
>> Those extents are curiously uniform, all 32kB in size.
>
> O_SYNC writes?
I'm assuming from Stuart's original email that these files weren't
written out with write(), but instead pre-allocated using allocsp:
> So, this would lead me to try XFS_IOC_ALLOCSP64 - which doesn't
> have the
> "unwritten extents" warning that RESVSP64 does. However, with the two
> processes writing the files out, I get heavy fragmentation. Even
> with a
> RESVSP followed by ALLOCSP I get the same result.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: XFS_IOC_RESVSP64 versus XFS_IOC_ALLOCSP64 with multiple threads
2006-11-14 0:31 ` Sam Vaughan
@ 2006-11-14 0:37 ` Sam Vaughan
0 siblings, 0 replies; 9+ messages in thread
From: Sam Vaughan @ 2006-11-14 0:37 UTC (permalink / raw)
To: Stewart Smith; +Cc: xfs
On 14/11/2006, at 11:31 AM, Sam Vaughan wrote:
> I'm assuming from Stuart's original email
Oops. s/Stuart/Stewart/
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: XFS_IOC_RESVSP64 versus XFS_IOC_ALLOCSP64 with multiple threads
2006-11-14 0:04 ` Sam Vaughan
2006-11-14 0:25 ` Chris Wedgwood
@ 2006-11-27 5:55 ` Stewart Smith
1 sibling, 0 replies; 9+ messages in thread
From: Stewart Smith @ 2006-11-27 5:55 UTC (permalink / raw)
To: Sam Vaughan; +Cc: xfs
[-- Attachment #1: Type: text/plain, Size: 3662 bytes --]
On Tue, 2006-11-14 at 11:04 +1100, Sam Vaughan wrote:
> Those extents are curiously uniform, all 32kB in size. The fact that
> both files' extents are in AG 8 suggests that the two directories
> ndb_1_fs and ndb_2_fs filled their original AGs and spilled out into
> other ones, which is when the interference would have started.
> Looking at the directory hierarchy in your last email, you might be
> better off if you could add another directory for the datafiles and
> undofiles to live in, so they don't end up sharing their AG with
> other stuff in their parent directory.
I think this is typically what the QA guys do (to help keep their sanity
if anything). Perhaps we should have this in our "best practice"
documentation as well...
> > for the data and undo files, we're just not changing their size except
> > at creation time, so that's okay.
>
> I'd assumed that these files were being continually grown. If all
> this is happening at creation time then it shouldn't be too hard to
> make sure the files are cleanly allocated with just one extent. Does
> the following not work on your file system?
>
> $ touch a b
> $ for file in a b; do
> > xfs_io -c 'allocsp 1G 0' $file &
> > done; wait
> [1] 12312
> [2] 12313
> [1]- Done xfs_io -c 'allocsp 1G 0' $file
> [2]+ Done xfs_io -c 'allocsp 1G 0' $file
> $ xfs_bmap -v a b
> a:
> EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET
> TOTAL
> 0: [0..2097151]: 231732008..233829159 6 (11968856..14066007)
> 2097152
> b:
> EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET
> TOTAL
> 0: [0..2097151]: 233829160..235926311 6 (14066008..16163159)
> 2097152
> $
That works fine on my file systems (or, on my rather full and well
used /home, as well as it can).
We're opening the files with O_DIRECT (or, if not available or fails,
O_SYNC)
> >> Now in your case you're using different directories, so your files
> >> are probably OK at the start of day. Once the AGs they start in fill
> >> up though, the files for both processes will start getting allocated
> >> from the next available AG. At that point, allocations that started
> >> out looking like the first test above will end up looking like the
> >> second.
> >>
> >> The filestreams allocator will stop this from happening for
> >> applications that write data regularly like video ingest servers, but
> >> I wouldn't expect it to be a cure-all for your database app because
> >> your writes could have large delays between them. Instead, I'd look
> >> into ways to break up your data into AG-sized chunks, starting a new
> >> directory every time you go over that magic size.
> >
> > I'll have to check our writing behaviour the files that change
> > sizes...
> > but they're not too much of an issue (they're hardly ever read
> > back, so
> > as long as writing them out is okay and reading isn't totally abismal,
> > we don't have to worry).
>
> That's handy. All in all it sounds like your requirements are very
> file system friendly in terms of getting optimum allocation. I'm not
> sure what could be causing all those 32kB extents.
Perhaps being flushed out due to VM pressure? but with O_DIRECT/O_SYNC
that shouldn't be the case, right? Or perhaps *because* of
O_DIRECT/O_SYNC?
--
Stewart Smith, Software Engineer
MySQL AB, www.mysql.com
Office: +14082136540 Ext: 6616
VoIP: 6616@sip.us.mysql.com
Mobile: +61 4 3 8844 332
Jumpstart your cluster:
http://www.mysql.com/consulting/packaged/cluster.html
[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2006-11-27 6:29 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-11-13 1:33 XFS_IOC_RESVSP64 versus XFS_IOC_ALLOCSP64 with multiple threads Stewart Smith
[not found] ` <965ECEF2-971D-46A1-B3F2-C6C1860C9ED8@sgi.com>
2006-11-13 4:09 ` Stewart Smith
2006-11-13 4:53 ` Sam Vaughan
2006-11-13 5:20 ` Stewart Smith
2006-11-14 0:04 ` Sam Vaughan
2006-11-14 0:25 ` Chris Wedgwood
2006-11-14 0:31 ` Sam Vaughan
2006-11-14 0:37 ` Sam Vaughan
2006-11-27 5:55 ` Stewart Smith
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox