XFS peculiar behavior

public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed

* XFS peculiar behavior
@ 2010-06-23  7:37 Yannis Klonatos
  2010-06-23 10:16 ` Michael Monnerie
                   ` (3 more replies)
  0 siblings, 4 replies; 11+ messages in thread
From: Yannis Klonatos @ 2010-06-23  7:37 UTC (permalink / raw)
  To: xfs

Hi all!

         I have come across the following peculiar behavior in XFS and i 
would appreciate any information anyone
could provide.
         In our lab we have a system that has twelve 500GByte hard disks 
(total capacity 6TByte), connected to an
Areca (ARC-1680D-IX-12) SAS storage controller. The disks are configured 
as a RAID-0 device. Then I create
a clean XFS filesystem on top of the raid volume, using the whole 
capacity. We use this test-setup to measure
performance improvement for a TPC-H experiment. We copy the database 
over the clean XFS filesystem using the
cp utility. The database used in our experiments is 56GBytes in size 
(data + indices).
         The problem is that i have noticed that XFS may - not all times 
- split a table over a large disk distance. For
example in one run i have noticed that a file of 13GByte is split over a 
4,7TByte distance (I calculate this distance
by subtracting the final block used for the file with the first one. The 
two disk blocks values are acquired using the
FIBMAP ioctl).
         Is there some reasoning behind this (peculiar) behavior? I 
would expect that since the underlying storage is so
large, and the dataset is so small, XFS would try to minimize disk seeks 
and thus place the file sequentially in disk.
Furthermore, I understand that there may be some blocks left unused by 
XFS between subsequent file blocks used
in order to handle any write appends that may come afterward. But i 
wouldn't expect such a large splitting of a single
file.
         Any help?

Thanks in advance,
Yannis Klonatos

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: XFS peculiar behavior
  2010-06-23  7:37 XFS peculiar behavior Yannis Klonatos
@ 2010-06-23 10:16 ` Michael Monnerie
  2010-06-23 10:24 ` Andi Kleen
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 11+ messages in thread
From: Michael Monnerie @ 2010-06-23 10:16 UTC (permalink / raw)
  To: xfs


[-- Attachment #1.1: Type: Text/Plain, Size: 809 bytes --]

On Mittwoch, 23. Juni 2010 Yannis Klonatos wrote:
> The problem is that i have noticed that XFS may - not all times 
> - split a table over a large disk distance.
 
Interesting. I have no idea why this happens and would be interested in 
investigation too.

As a quick help, maybe using allocsize=1G in the xfs mount options would 
help?

How many AGs does the filesystem have? 
And which version of mkfs.xfs do you use?
Newer xfsprogs create the disk with 4 AGs per default, maybe that 
influences the allocation order?

-- 
mit freundlichen Grüssen,
Michael Monnerie, Ing. BSc

it-management Internet Services
http://proteger.at [gesprochen: Prot-e-schee]
Tel: 0660 / 415 65 31

// Wir haben im Moment zwei Häuser zu verkaufen:
// http://zmi.at/langegg/
// http://zmi.at/haus2009/

[-- Attachment #1.2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

[-- Attachment #2: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: XFS peculiar behavior
  2010-06-23  7:37 XFS peculiar behavior Yannis Klonatos
  2010-06-23 10:16 ` Michael Monnerie
@ 2010-06-23 10:24 ` Andi Kleen
  2010-06-23 15:04   ` Michael Monnerie
  2010-06-23 16:21 ` Eric Sandeen
  2010-06-23 23:17 ` Dave Chinner
  3 siblings, 1 reply; 11+ messages in thread
From: Andi Kleen @ 2010-06-23 10:24 UTC (permalink / raw)
  To: Yannis Klonatos; +Cc: xfs

Yannis Klonatos <klonatos@ics.forth.gr> writes:

>         The problem is that i have noticed that XFS may - not all

Why is that a problem?

> times - split a table over a large disk distance. For
> example in one run i have noticed that a file of 13GByte is split over
> a 4,7TByte distance (I calculate this distance
> by subtracting the final block used for the file with the first
> one. The two disk blocks values are acquired using the
> FIBMAP ioctl).

I don't know if it's the only reason, but XFS does a lot of data
structure locking and updates per allocation group, so spreading
to multiple AGs gives better scalability to many CPUs.

Also I suppose it's good to avoid hot spots on the underlying 
device.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: XFS peculiar behavior
  2010-06-23 10:24 ` Andi Kleen
@ 2010-06-23 15:04   ` Michael Monnerie
  0 siblings, 0 replies; 11+ messages in thread
From: Michael Monnerie @ 2010-06-23 15:04 UTC (permalink / raw)
  To: xfs

[-- Attachment #1.1: Type: Text/Plain, Size: 1189 bytes --]

On Mittwoch, 23. Juni 2010 Andi Kleen wrote:
> I don't know if it's the only reason, but XFS does a lot of data
> structure locking and updates per allocation group, so spreading
> to multiple AGs gives better scalability to many CPUs.

This only helps if there are metadata operations, right? So in the case 
where you have one big database "file" of 50GB, it should be ordered 
sector-by-sector to get the maximum performance, and minimize disk-head 
movement.
And I don't believe XFS would scatter a single big file around several 
AGs, as far as I know even all files within the same dir are grouped 
within a single AG. The AG "scattering" is done for separate dirs only.

> Also I suppose it's good to avoid hot spots on the underlying 
> device.

A database file is staying on the same place "forever" and will be 
overwritten all the time, so it doesn't matter for the "hot spot" case.

-- 
mit freundlichen Grüssen,
Michael Monnerie, Ing. BSc

it-management Internet Services
http://proteger.at [gesprochen: Prot-e-schee]
Tel: 0660 / 415 65 31

// Wir haben im Moment zwei Häuser zu verkaufen:
// http://zmi.at/langegg/
// http://zmi.at/haus2009/

[-- Attachment #1.2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

[-- Attachment #2: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: XFS peculiar behavior
  2010-06-23  7:37 XFS peculiar behavior Yannis Klonatos
  2010-06-23 10:16 ` Michael Monnerie
  2010-06-23 10:24 ` Andi Kleen
@ 2010-06-23 16:21 ` Eric Sandeen
  2010-06-23 23:17 ` Dave Chinner
  3 siblings, 0 replies; 11+ messages in thread
From: Eric Sandeen @ 2010-06-23 16:21 UTC (permalink / raw)
  To: Yannis Klonatos; +Cc: xfs

Yannis Klonatos wrote:
> Hi all!
> 
>         I have come across the following peculiar behavior in XFS and i
> would appreciate any information anyone
> could provide.
>         In our lab we have a system that has twelve 500GByte hard disks
> (total capacity 6TByte), connected to an
> Areca (ARC-1680D-IX-12) SAS storage controller. The disks are configured
> as a RAID-0 device. Then I create
> a clean XFS filesystem on top of the raid volume, using the whole
> capacity. We use this test-setup to measure
> performance improvement for a TPC-H experiment. We copy the database
> over the clean XFS filesystem using the
> cp utility. The database used in our experiments is 56GBytes in size
> (data + indices).
>         The problem is that i have noticed that XFS may - not all times
> - split a table over a large disk distance. For
> example in one run i have noticed that a file of 13GByte is split over a
> 4,7TByte distance (I calculate this distance
> by subtracting the final block used for the file with the first one. The
> two disk blocks values are acquired using the
> FIBMAP ioctl).

xfs_bmap output would be a lot nicer.  Maybe you can paste that here to
show exactly what the layout is.

-Eric

>         Is there some reasoning behind this (peculiar) behavior? I would
> expect that since the underlying storage is so
> large, and the dataset is so small, XFS would try to minimize disk seeks
> and thus place the file sequentially in disk.
> Furthermore, I understand that there may be some blocks left unused by
> XFS between subsequent file blocks used
> in order to handle any write appends that may come afterward. But i
> wouldn't expect such a large splitting of a single
> file.
>         Any help?
> 
> Thanks in advance,
> Yannis Klonatos

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: XFS peculiar behavior
  2010-06-23  7:37 XFS peculiar behavior Yannis Klonatos
                   ` (2 preceding siblings ...)
  2010-06-23 16:21 ` Eric Sandeen
@ 2010-06-23 23:17 ` Dave Chinner
  2010-06-24 14:11   ` Yannis Klonatos
  3 siblings, 1 reply; 11+ messages in thread
From: Dave Chinner @ 2010-06-23 23:17 UTC (permalink / raw)
  To: Yannis Klonatos; +Cc: xfs

On Wed, Jun 23, 2010 at 10:37:19AM +0300, Yannis Klonatos wrote:
> Hi all!
> 
>         I have come across the following peculiar behavior in XFS
> and i would appreciate any information anyone
> could provide.
>         In our lab we have a system that has twelve 500GByte hard
> disks (total capacity 6TByte), connected to an
> Areca (ARC-1680D-IX-12) SAS storage controller. The disks are
> configured as a RAID-0 device. Then I create
> a clean XFS filesystem on top of the raid volume, using the whole
> capacity. We use this test-setup to measure
> performance improvement for a TPC-H experiment. We copy the database
> over the clean XFS filesystem using the
> cp utility. The database used in our experiments is 56GBytes in size
> (data + indices).
>         The problem is that i have noticed that XFS may - not all
> times - split a table over a large disk distance. For
> example in one run i have noticed that a file of 13GByte is split
> over a 4,7TByte distance (I calculate this distance
> by subtracting the final block used for the file with the first one.
> The two disk blocks values are acquired using the
> FIBMAP ioctl).
>         Is there some reasoning behind this (peculiar) behavior? I
> would expect that since the underlying storage is so
> large, and the dataset is so small, XFS would try to minimize disk
> seeks and thus place the file sequentially in disk.
> Furthermore, I understand that there may be some blocks left unused
> by XFS between subsequent file blocks used
> in order to handle any write appends that may come afterward. But i
> wouldn't expect such a large splitting of a single
> file.
>         Any help?

The reasons for it being split are wide and varied. We need more
information before trying to determie the reason.

The output of "xfs_info <mntpt>" will tell us your filesystem
geometry and the output of xfs_bmap <split file> will tell us
exactly how it was laid out on disk. These are needed to see exactly
what the problem is.

Did you copy the file alone, with others, or while there were other
write operations going on in the background? was it a pristine
filesystem that you copied it to? If so, what was the directory
structure created before/by the copy?

Also, the kernel version you are running, and the version of
xfsprogs you have installed (xfs_info -V) will help us determine if
you are tripping any known bugs...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: XFS peculiar behavior
  2010-06-23 23:17 ` Dave Chinner
@ 2010-06-24 14:11   ` Yannis Klonatos
  2010-06-24 15:21     ` Eric Sandeen
  0 siblings, 1 reply; 11+ messages in thread
From: Yannis Klonatos @ 2010-06-24 14:11 UTC (permalink / raw)
  To: Dave Chinner; +Cc: andi, sandeen, xfs

Hello again,

         First of all, thank you all for your quick replies. I attach 
all the information you requested in your responses.

1) The output of xfs_info is the following:

meta-data=/dev/sdf             isize=256    agcount=32, agsize=45776328 blks
                =                         sectsz=512   attr=0
data         =                         bsize=4096   blocks=1464842496, 
imaxpct=25
                =                         sunit=0      swidth=0 blks, 
unwritten=1
naming    =version 2            bsize=4096
log          =internal               bsize=4096   blocks=32768, version=1
               =                          ectsz=512   sunit=0 blks, 
lazy-count=0
realtime  =none                   extsz=4096   blocks=0, rtextents=0

2) The output of xfs_bmap in the lineitem.MYI table of the TPC-H 
workload is at one run:

/mnt/test/mysql/tpch/lineitem.MYI:
  EXT: FILE-OFFSET           BLOCK-RANGE              AG 
AG-OFFSET          TOTAL
    0: [0..6344271]:         11352529416..11358873687 31 
(72..6344343)    6344272
    1: [6344272..10901343]:  1464842608..1469399679    4 
(112..4557183)   4557072
    2: [10901344..18439199]: 1831053200..1838591055    5 
(80..7537935)    7537856
    3: [18439200..25311519]: 2197263840..2204136159    6 
(96..6872415)    6872320
    4: [25311520..26660095]: 2563474464..2564823039    7 
(96..1348671)    1348576

Given that all disk blocks are in units of 512-byte blocks, if I 
interpret the output
correctly the first file is at block 1465352792 = 698.4GByte offset and 
the last block
is at 5421.1GByte offset, meaning that this specific table is split over 
a 4,7TByte distance.

However, in another run (with a clean file system again)

/mnt/test/mysql/tpch/lineitem.MYI:
  EXT: FILE-OFFSET      BLOCK-RANGE              AG AG-OFFSET           
TOTAL
    0: [0..26660095]:   11352529416..11379189511 31 (72..26660167)   
26660096

3) For the copy, as i mentioned in my previous mail, i copied the 
database over nfs using the cp -R linux program.
Thus, i believe all the files are copied sequentially, the one after the 
other, with no other concurrent write operations
running at the background. The file-system was pristine before the cp 
with no files, and just the mount directory was
created (all the other necessary files and directories are created from 
the cp program).

4)  The version of xfsprogs is 2.9.4 (acquired with xfs_info -v) and the 
version of the kernel is 2.6.18-164.11.1.el5.

         If you require any further information let me know. Let me 
state that i can also provide you with the complete
data-set if you feel it necessary trying to reproduce the issue.

Thanks,
Yannis Klonatos
>> Hi all!
>>
>>          I have come across the following peculiar behavior in XFS
>> and i would appreciate any information anyone
>> could provide.
>>          In our lab we have a system that has twelve 500GByte hard
>> disks (total capacity 6TByte), connected to an
>> Areca (ARC-1680D-IX-12) SAS storage controller. The disks are
>> configured as a RAID-0 device. Then I create
>> a clean XFS filesystem on top of the raid volume, using the whole
>> capacity. We use this test-setup to measure
>> performance improvement for a TPC-H experiment. We copy the database
>> over the clean XFS filesystem using the
>> cp utility. The database used in our experiments is 56GBytes in size
>> (data + indices).
>>          The problem is that i have noticed that XFS may - not all
>> times - split a table over a large disk distance. For
>> example in one run i have noticed that a file of 13GByte is split
>> over a 4,7TByte distance (I calculate this distance
>> by subtracting the final block used for the file with the first one.
>> The two disk blocks values are acquired using the
>> FIBMAP ioctl).
>>          Is there some reasoning behind this (peculiar) behavior? I
>> would expect that since the underlying storage is so
>> large, and the dataset is so small, XFS would try to minimize disk
>> seeks and thus place the file sequentially in disk.
>> Furthermore, I understand that there may be some blocks left unused
>> by XFS between subsequent file blocks used
>> in order to handle any write appends that may come afterward. But i
>> wouldn't expect such a large splitting of a single
>> file.
>>          Any help?
>>      

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: XFS peculiar behavior
  2010-06-24 14:11   ` Yannis Klonatos
@ 2010-06-24 15:21     ` Eric Sandeen
  2010-06-24 15:35       ` Yannis Klonatos
  2010-06-25  0:46       ` Dave Chinner
  0 siblings, 2 replies; 11+ messages in thread
From: Eric Sandeen @ 2010-06-24 15:21 UTC (permalink / raw)
  To: Yannis Klonatos; +Cc: andi, xfs

On 06/24/2010 09:11 AM, Yannis Klonatos wrote:
> Hello again,
> 
>          First of all, thank you all for your quick replies. I attach 
> all the information you requested in your responses.
> 
> 1) The output of xfs_info is the following:
>
> meta-data=/dev/sdf     isize=256    agcount=32, agsize=45776328 blks
>          =             sectsz=512   attr=0
> data     =             bsize=4096   blocks=1464842496, imaxpct=25
>          =             sunit=0      swidth=0 blks, unwritten=1
> naming   =version 2    bsize=4096
> log      =internal     bsize=4096   blocks=32768, version=1
>          =             sectsz=512   sunit=0 blks, lazy-count=0
> realtime =none         extsz=4096   blocks=0, rtextents=0
> 
> 2) The output of xfs_bmap in the lineitem.MYI table of the TPC-H 
> workload is at one run:
> 
> /mnt/test/mysql/tpch/lineitem.MYI:
>   EXT: FILE-OFFSET           BLOCK-RANGE              AG  AG-OFFSET         TOTAL
>     0: [0..6344271]:         11352529416..11358873687 31 (72..6344343)    6344272
>     1: [6344272..10901343]:  1464842608..1469399679    4 (112..4557183)   4557072
>     2: [10901344..18439199]: 1831053200..1838591055    5 (80..7537935)    7537856
>     3: [18439200..25311519]: 2197263840..2204136159    6 (96..6872415)    6872320
>     4: [25311520..26660095]: 2563474464..2564823039    7 (96..1348671)    1348576
> 
> Given that all disk blocks are in units of 512-byte blocks, if I 
> interpret the output
> correctly the first file is at block 1465352792 = 698.4GByte offset and 
> the last block
> is at 5421.1GByte offset, meaning that this specific table is split over 
> a 4,7TByte distance.

The file started out in the last AG, and then had to wrap around,
because it hit the end of the filesystem. :)  It was then somewhat
sequential in AGs 4,5,6,7 after that, though not perfectly so.

This run was with a clean filesystem?  Was the mountpoint
/mnt/test?  XFS distributes new directories into new AGs (allocation
groups, or disk regions) for parallelism, and then files in those dirs
start populating the same AG.  So if /mnt/test/mysql/tpch ended up in
the last AG (#31) then the file likely started there, too.

Also, the "inode32" allocator biases data towards the end of the
filesystem, because inode numbers in xfs reflect their on-disk location,
and to keep inodes numbers below 2^32, it must save space in the lower
portions of the filesystem.  You might want to re-test with a fresh
filesystem mounted with the "inode64" mount option.

> However, in another run (with a clean file system again)
> 
> /mnt/test/mysql/tpch/lineitem.MYI:
>   EXT: FILE-OFFSET      BLOCK-RANGE              AG AG-OFFSET           
> TOTAL
>     0: [0..26660095]:   11352529416..11379189511 31 (72..26660167)   
> 26660096

Hmm.

> 3) For the copy, as i mentioned in my previous mail, i copied the 
> database over nfs using the cp -R linux program.
> Thus, i believe all the files are copied sequentially, the one after the 
> other, with no other concurrent write operations
> running at the background. The file-system was pristine before the cp 
> with no files, and just the mount directory was
> created (all the other necessary files and directories are created from 
> the cp program).

IIRC, copies over NFS can affect xfs allocator performance, because
(IIRC) it tends to close the filehandle periodically and xfs loses the
allocator context.  We used to have a filehandle cache which held them
open, but that went away some time ago.

Dave will probably correct significant swaths of this information for
me, though ;)

> 4)  The version of xfsprogs is 2.9.4 (acquired with xfs_info -v) and the 
> version of the kernel is 2.6.18-164.11.1.el5.

Ah!  A Red Hat kernel; have you asked your Red Hat support folks for
help on this issue?

-Eric

>          If you require any further information let me know. Let me 
> state that i can also provide you with the complete
> data-set if you feel it necessary trying to reproduce the issue.
> 
> Thanks,
> Yannis Klonatos
>>> Hi all!
>>>
>>>          I have come across the following peculiar behavior in XFS
>>> and i would appreciate any information anyone
>>> could provide.
>>>          In our lab we have a system that has twelve 500GByte hard
>>> disks (total capacity 6TByte), connected to an
>>> Areca (ARC-1680D-IX-12) SAS storage controller. The disks are
>>> configured as a RAID-0 device. Then I create
>>> a clean XFS filesystem on top of the raid volume, using the whole
>>> capacity. We use this test-setup to measure
>>> performance improvement for a TPC-H experiment. We copy the database
>>> over the clean XFS filesystem using the
>>> cp utility. The database used in our experiments is 56GBytes in size
>>> (data + indices).
>>>          The problem is that i have noticed that XFS may - not all
>>> times - split a table over a large disk distance. For
>>> example in one run i have noticed that a file of 13GByte is split
>>> over a 4,7TByte distance (I calculate this distance
>>> by subtracting the final block used for the file with the first one.
>>> The two disk blocks values are acquired using the
>>> FIBMAP ioctl).
>>>          Is there some reasoning behind this (peculiar) behavior? I
>>> would expect that since the underlying storage is so
>>> large, and the dataset is so small, XFS would try to minimize disk
>>> seeks and thus place the file sequentially in disk.
>>> Furthermore, I understand that there may be some blocks left unused
>>> by XFS between subsequent file blocks used
>>> in order to handle any write appends that may come afterward. But i
>>> wouldn't expect such a large splitting of a single
>>> file.
>>>          Any help?
>>>      
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs
> 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: XFS peculiar behavior
  2010-06-24 15:21     ` Eric Sandeen
@ 2010-06-24 15:35       ` Yannis Klonatos
  2010-06-25  0:58         ` Dave Chinner
  2010-06-25  0:46       ` Dave Chinner
  1 sibling, 1 reply; 11+ messages in thread
From: Yannis Klonatos @ 2010-06-24 15:35 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: andi, xfs

στις 6/24/2010 6:21 PM, O/H Eric Sandeen έγραψε:
> On 06/24/2010 09:11 AM, Yannis Klonatos wrote:
>    
>> Hello again,
>>
>>           First of all, thank you all for your quick replies. I attach
>> all the information you requested in your responses.
>>
>> 1) The output of xfs_info is the following:
>>
>> meta-data=/dev/sdf     isize=256    agcount=32, agsize=45776328 blks
>>           =             sectsz=512   attr=0
>> data     =             bsize=4096   blocks=1464842496, imaxpct=25
>>           =             sunit=0      swidth=0 blks, unwritten=1
>> naming   =version 2    bsize=4096
>> log      =internal     bsize=4096   blocks=32768, version=1
>>           =             sectsz=512   sunit=0 blks, lazy-count=0
>> realtime =none         extsz=4096   blocks=0, rtextents=0
>>
>> 2) The output of xfs_bmap in the lineitem.MYI table of the TPC-H
>> workload is at one run:
>>
>> /mnt/test/mysql/tpch/lineitem.MYI:
>>    EXT: FILE-OFFSET           BLOCK-RANGE              AG  AG-OFFSET         TOTAL
>>      0: [0..6344271]:         11352529416..11358873687 31 (72..6344343)    6344272
>>      1: [6344272..10901343]:  1464842608..1469399679    4 (112..4557183)   4557072
>>      2: [10901344..18439199]: 1831053200..1838591055    5 (80..7537935)    7537856
>>      3: [18439200..25311519]: 2197263840..2204136159    6 (96..6872415)    6872320
>>      4: [25311520..26660095]: 2563474464..2564823039    7 (96..1348671)    1348576
>>
>> Given that all disk blocks are in units of 512-byte blocks, if I
>> interpret the output
>> correctly the first file is at block 1465352792 = 698.4GByte offset and
>> the last block
>> is at 5421.1GByte offset, meaning that this specific table is split over
>> a 4,7TByte distance.
>>      
> The file started out in the last AG, and then had to wrap around,
> because it hit the end of the filesystem. :)  It was then somewhat
> sequential in AGs 4,5,6,7 after that, though not perfectly so.
>
> This run was with a clean filesystem?  Was the mountpoint
> /mnt/test?  XFS distributes new directories into new AGs (allocation
> groups, or disk regions) for parallelism, and then files in those dirs
> start populating the same AG.  So if /mnt/test/mysql/tpch ended up in
> the last AG (#31) then the file likely started there, too.
>    

Ok. Your argument makes a lot of sense. However, this is a clean file 
system (mount point /mnt/test), and
I am certain that the files copied before the aforementioned index file 
(lineitem.MYI) require 28GByte space
in total. So, this still raises the question why XFS splitted these 
files in a way that caused the whole file system
space to be "covered", and the lineitem file to be placed starting at 
the end of the FS (as you mentioned).

Also, based on my little XFS knowledge and background, i seriously doubt 
that parallelism along AGs
is an issue here, since the copy utility copies files sequentially, so a 
new AG would be allocated for the
/mnt/test/mysql/tpch directory, and would be populated completely with 
all its files, before another AG
was created. This is true of course, only if your observation holds.

> Also, the "inode32" allocator biases data towards the end of the
> filesystem, because inode numbers in xfs reflect their on-disk location,
> and to keep inodes numbers below 2^32, it must save space in the lower
> portions of the filesystem.  You might want to re-test with a fresh
> filesystem mounted with the "inode64" mount option.
>
>    
>> However, in another run (with a clean file system again)
>>
>> /mnt/test/mysql/tpch/lineitem.MYI:
>>    EXT: FILE-OFFSET      BLOCK-RANGE              AG AG-OFFSET
>> TOTAL
>>      0: [0..26660095]:   11352529416..11379189511 31 (72..26660167)
>> 26660096
>>      
> Hmm.
>
>    
>> 3) For the copy, as i mentioned in my previous mail, i copied the
>> database over nfs using the cp -R linux program.
>> Thus, i believe all the files are copied sequentially, the one after the
>> other, with no other concurrent write operations
>> running at the background. The file-system was pristine before the cp
>> with no files, and just the mount directory was
>> created (all the other necessary files and directories are created from
>> the cp program).
>>      
> IIRC, copies over NFS can affect xfs allocator performance, because
> (IIRC) it tends to close the filehandle periodically and xfs loses the
> allocator context.  We used to have a filehandle cache which held them
> open, but that went away some time ago.
>
> Dave will probably correct significant swaths of this information for
> me, though ;)
>
>    
>> 4)  The version of xfsprogs is 2.9.4 (acquired with xfs_info -v) and the
>> version of the kernel is 2.6.18-164.11.1.el5.
>>      
> Ah!  A Red Hat kernel; have you asked your Red Hat support folks for
> help on this issue?
>    

I suppose that they will redirect me back to you, won't they? :-)

> -Eric
>
>    
>>           If you require any further information let me know. Let me
>> state that i can also provide you with the complete
>> data-set if you feel it necessary trying to reproduce the issue.
>>
>> Thanks,
>> Yannis Klonatos
>>      
>>>> Hi all!
>>>>
>>>>           I have come across the following peculiar behavior in XFS
>>>> and i would appreciate any information anyone
>>>> could provide.
>>>>           In our lab we have a system that has twelve 500GByte hard
>>>> disks (total capacity 6TByte), connected to an
>>>> Areca (ARC-1680D-IX-12) SAS storage controller. The disks are
>>>> configured as a RAID-0 device. Then I create
>>>> a clean XFS filesystem on top of the raid volume, using the whole
>>>> capacity. We use this test-setup to measure
>>>> performance improvement for a TPC-H experiment. We copy the database
>>>> over the clean XFS filesystem using the
>>>> cp utility. The database used in our experiments is 56GBytes in size
>>>> (data + indices).
>>>>           The problem is that i have noticed that XFS may - not all
>>>> times - split a table over a large disk distance. For
>>>> example in one run i have noticed that a file of 13GByte is split
>>>> over a 4,7TByte distance (I calculate this distance
>>>> by subtracting the final block used for the file with the first one.
>>>> The two disk blocks values are acquired using the
>>>> FIBMAP ioctl).
>>>>           Is there some reasoning behind this (peculiar) behavior? I
>>>> would expect that since the underlying storage is so
>>>> large, and the dataset is so small, XFS would try to minimize disk
>>>> seeks and thus place the file sequentially in disk.
>>>> Furthermore, I understand that there may be some blocks left unused
>>>> by XFS between subsequent file blocks used
>>>> in order to handle any write appends that may come afterward. But i
>>>> wouldn't expect such a large splitting of a single
>>>> file.
>>>>           Any help?
>>>>
>>>>          
>> _______________________________________________
>> xfs mailing list
>> xfs@oss.sgi.com
>> http://oss.sgi.com/mailman/listinfo/xfs
>>
>>      
>    

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: XFS peculiar behavior
  2010-06-24 15:35       ` Yannis Klonatos
@ 2010-06-25  0:58         ` Dave Chinner
  0 siblings, 0 replies; 11+ messages in thread
From: Dave Chinner @ 2010-06-25  0:58 UTC (permalink / raw)
  To: Yannis Klonatos; +Cc: andi, Eric Sandeen, xfs

On Thu, Jun 24, 2010 at 06:35:42PM +0300, Yannis Klonatos wrote:
> στις 6/24/2010 6:21 PM, O/H Eric Sandeen έγραψε:
> >On 06/24/2010 09:11 AM, Yannis Klonatos wrote:
> >>2) The output of xfs_bmap in the lineitem.MYI table of the TPC-H
> >>workload is at one run:
> >>
> >>/mnt/test/mysql/tpch/lineitem.MYI:
> >>   EXT: FILE-OFFSET           BLOCK-RANGE              AG  AG-OFFSET         TOTAL
> >>     0: [0..6344271]:         11352529416..11358873687 31 (72..6344343)    6344272
> >>     1: [6344272..10901343]:  1464842608..1469399679    4 (112..4557183)   4557072
> >>     2: [10901344..18439199]: 1831053200..1838591055    5 (80..7537935)    7537856
> >>     3: [18439200..25311519]: 2197263840..2204136159    6 (96..6872415)    6872320
> >>     4: [25311520..26660095]: 2563474464..2564823039    7 (96..1348671)    1348576
> >>
> >>Given that all disk blocks are in units of 512-byte blocks, if I
> >>interpret the output
> >>correctly the first file is at block 1465352792 = 698.4GByte offset and
> >>the last block
> >>is at 5421.1GByte offset, meaning that this specific table is split over
> >>a 4,7TByte distance.
> >The file started out in the last AG, and then had to wrap around,
> >because it hit the end of the filesystem. :)  It was then somewhat
> >sequential in AGs 4,5,6,7 after that, though not perfectly so.
> >
> >This run was with a clean filesystem?  Was the mountpoint
> >/mnt/test?  XFS distributes new directories into new AGs (allocation
> >groups, or disk regions) for parallelism, and then files in those dirs
> >start populating the same AG.  So if /mnt/test/mysql/tpch ended up in
> >the last AG (#31) then the file likely started there, too.
> 
> Ok. Your argument makes a lot of sense. However, this is a clean
> file system (mount point /mnt/test), and
> I am certain that the files copied before the aforementioned index
> file (lineitem.MYI) require 28GByte space
> in total. So, this still raises the question why XFS splitted these
> files in a way that caused the whole file system
> space to be "covered", and the lineitem file to be placed starting
> at the end of the FS (as you mentioned).

XFS spreads allocation out over it's entire address space to enable
utilisation of all the disks backing the filesystem (think linear
concatenation of devices).  This is sub-optimal for a small number
of spindles, but XFS is designed to scale to hundreds to thousands
of disks effectively.

> >>4)  The version of xfsprogs is 2.9.4 (acquired with xfs_info -v) and the
> >>version of the kernel is 2.6.18-164.11.1.el5.
> >Ah!  A Red Hat kernel; have you asked your Red Hat support folks for
> >help on this issue?
> 
> I suppose that they will redirect me back to you, won't they? :-)

Or me ;)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: XFS peculiar behavior
  2010-06-24 15:21     ` Eric Sandeen
  2010-06-24 15:35       ` Yannis Klonatos
@ 2010-06-25  0:46       ` Dave Chinner
  1 sibling, 0 replies; 11+ messages in thread
From: Dave Chinner @ 2010-06-25  0:46 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: Yannis Klonatos, andi, xfs

On Thu, Jun 24, 2010 at 10:21:17AM -0500, Eric Sandeen wrote:
> On 06/24/2010 09:11 AM, Yannis Klonatos wrote:
> > Hello again,
> > 
> >          First of all, thank you all for your quick replies. I attach 
> > all the information you requested in your responses.
> > 
> > 1) The output of xfs_info is the following:
> >
> > meta-data=/dev/sdf     isize=256    agcount=32, agsize=45776328 blks
> >          =             sectsz=512   attr=0
> > data     =             bsize=4096   blocks=1464842496, imaxpct=25
> >          =             sunit=0      swidth=0 blks, unwritten=1
> > naming   =version 2    bsize=4096
> > log      =internal     bsize=4096   blocks=32768, version=1
> >          =             sectsz=512   sunit=0 blks, lazy-count=0
> > realtime =none         extsz=4096   blocks=0, rtextents=0
> > 
> > 2) The output of xfs_bmap in the lineitem.MYI table of the TPC-H 
> > workload is at one run:
> > 
> > /mnt/test/mysql/tpch/lineitem.MYI:
> >   EXT: FILE-OFFSET           BLOCK-RANGE              AG  AG-OFFSET         TOTAL
> >     0: [0..6344271]:         11352529416..11358873687 31 (72..6344343)    6344272
> >     1: [6344272..10901343]:  1464842608..1469399679    4 (112..4557183)   4557072
> >     2: [10901344..18439199]: 1831053200..1838591055    5 (80..7537935)    7537856
> >     3: [18439200..25311519]: 2197263840..2204136159    6 (96..6872415)    6872320
> >     4: [25311520..26660095]: 2563474464..2564823039    7 (96..1348671)    1348576
> > 
> > Given that all disk blocks are in units of 512-byte blocks, if I 
> > interpret the output
> > correctly the first file is at block 1465352792 = 698.4GByte offset and 
> > the last block
> > is at 5421.1GByte offset, meaning that this specific table is split over 
> > a 4,7TByte distance.
> 
> The file started out in the last AG, and then had to wrap around,
> because it hit the end of the filesystem. :)  It was then somewhat
> sequential in AGs 4,5,6,7 after that, though not perfectly so.
> 
> This run was with a clean filesystem?  Was the mountpoint
> /mnt/test?  XFS distributes new directories into new AGs (allocation
> groups, or disk regions) for parallelism, and then files in those dirs
> start populating the same AG.  So if /mnt/test/mysql/tpch ended up in
> the last AG (#31) then the file likely started there, too.

For inode64, yes.  For inode32, the first ag is derived from the
mp->m_agfrotor and the xfs_rotorstep value.  The rate at which
mp->m_agfrotor increments for each new file is controlled by the
/proc/sys/fs/xfs/rotorstep sysctl.  Changing the value of the step
will likely change the first AG location of the database in this
test.  Alternatively, copy the database file first so that it starts
in a low AG.

> Also, the "inode32" allocator biases data towards the end of the
> filesystem, because inode numbers in xfs reflect their on-disk location,
> and to keep inodes numbers below 2^32, it must save space in the lower
> portions of the filesystem.  You might want to re-test with a fresh
> filesystem mounted with the "inode64" mount option.

Or just use inode64 ;)

> 
> > However, in another run (with a clean file system again)
> > 
> > /mnt/test/mysql/tpch/lineitem.MYI:
> >   EXT: FILE-OFFSET      BLOCK-RANGE              AG AG-OFFSET           
> > TOTAL
> >     0: [0..26660095]:   11352529416..11379189511 31 (72..26660167)   
> > 26660096
> 
> Hmm.
> 
> > 3) For the copy, as i mentioned in my previous mail, i copied the 
> > database over nfs using the cp -R linux program.
> > Thus, i believe all the files are copied sequentially, the one after the 
> > other, with no other concurrent write operations
> > running at the background. The file-system was pristine before the cp 
> > with no files, and just the mount directory was
> > created (all the other necessary files and directories are created from 
> > the cp program).
> 
> IIRC, copies over NFS can affect xfs allocator performance, because
> (IIRC) it tends to close the filehandle periodically and xfs loses the
> allocator context.  We used to have a filehandle cache which held them
> open, but that went away some time ago.

The filehandle cache was used in 2.4 to prevent cached inodes being
torn down when NFS stops referencing it, only to have to rebuild it
a few ms later when the next request comes in. The frequentteardown
was what caused the problems on those kernels, which was why the
cache helped prevent bad allocation patterns. That doesn't happen in
2.6 kernels, but it has other idiosyncracies... :)

> Dave will probably correct significant swaths of this information for
> me, though ;)

Only minor bits ;)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2010-06-25  0:56 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-06-23  7:37 XFS peculiar behavior Yannis Klonatos
2010-06-23 10:16 ` Michael Monnerie
2010-06-23 10:24 ` Andi Kleen
2010-06-23 15:04   ` Michael Monnerie
2010-06-23 16:21 ` Eric Sandeen
2010-06-23 23:17 ` Dave Chinner
2010-06-24 14:11   ` Yannis Klonatos
2010-06-24 15:21     ` Eric Sandeen
2010-06-24 15:35       ` Yannis Klonatos
2010-06-25  0:58         ` Dave Chinner
2010-06-25  0:46       ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox