definitions for /proc/fs/xfs/stat

public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed

* definitions for /proc/fs/xfs/stat
@ 2013-06-14 16:37 Mark Seger
  2013-06-14 22:16 ` Nathan Scott
  0 siblings, 1 reply; 28+ messages in thread
From: Mark Seger @ 2013-06-14 16:37 UTC (permalink / raw)
  To: xfs


[-- Attachment #1.1: Type: text/plain, Size: 181 bytes --]

I've been thinking about adding xfs support to collectl but am having a
hard time trying to decode he data in the stat file.  can someone point me
to a decoder ring?  thanks.
-mark

[-- Attachment #1.2: Type: text/html, Size: 246 bytes --]

[-- Attachment #2: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: definitions for /proc/fs/xfs/stat
  2013-06-14 16:37 definitions for /proc/fs/xfs/stat Mark Seger
@ 2013-06-14 22:16 ` Nathan Scott
  2013-06-14 22:37   ` Mark Seger
  0 siblings, 1 reply; 28+ messages in thread
From: Nathan Scott @ 2013-06-14 22:16 UTC (permalink / raw)
  To: Mark Seger; +Cc: xfs

Hi Mark,

----- Original Message -----
> I've been thinking about adding xfs support to collectl but am having a hard
> time trying to decode he data in the stat file. can someone point me to a
> decoder ring? thanks.

Using Performance Co-Pilot, "pminfo -T xfs" will give you the answers you seek.
[ http://oss.sgi.com/projects/pcp/ ]

cheers.

--
Nathan

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: definitions for /proc/fs/xfs/stat
  2013-06-14 22:16 ` Nathan Scott
@ 2013-06-14 22:37   ` Mark Seger
  2013-06-15  0:17     ` Nathan Scott
  0 siblings, 1 reply; 28+ messages in thread
From: Mark Seger @ 2013-06-14 22:37 UTC (permalink / raw)
  To: Nathan Scott; +Cc: xfs


[-- Attachment #1.1: Type: text/plain, Size: 974 bytes --]

actually I have since found a decoder ring here -
http://xfs.org/index.php/Runtime_Stats and have been incorporating a lot of
the data so I can look at things in real time.  I'd still love to know why
writing 1000 1K files results in 200MB/sec of disk I/O though.  clearly
something weird is going on. both PCP as well as collectl (now that I have
incorporated some of the counters) are both telling me that I/O is in the
logging, but why remains a mystery.  can anyone offer some suggestions as
to why?
-mark



On Fri, Jun 14, 2013 at 6:16 PM, Nathan Scott <nathans@redhat.com> wrote:

> Hi Mark,
>
> ----- Original Message -----
> > I've been thinking about adding xfs support to collectl but am having a
> hard
> > time trying to decode he data in the stat file. can someone point me to a
> > decoder ring? thanks.
>
> Using Performance Co-Pilot, "pminfo -T xfs" will give you the answers you
> seek.
> [ http://oss.sgi.com/projects/pcp/ ]
>
> cheers.
>
> --
> Nathan
>

[-- Attachment #1.2: Type: text/html, Size: 1526 bytes --]

[-- Attachment #2: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: definitions for /proc/fs/xfs/stat
  2013-06-14 22:37   ` Mark Seger
@ 2013-06-15  0:17     ` Nathan Scott
  2013-06-15  1:55       ` Mark Seger
  0 siblings, 1 reply; 28+ messages in thread
From: Nathan Scott @ 2013-06-15  0:17 UTC (permalink / raw)
  To: Mark Seger; +Cc: xfs



----- Original Message -----
> actually I have since found a decoder ring here -
> http://xfs.org/index.php/Runtime_Stats and have been incorporating a lot of
> the data so I can look at things in real time.  I'd still love to know why

Ah, good stuff.

> writing 1000 1K files results in 200MB/sec of disk I/O though.  clearly

For how many seconds?  (or fractions thereof?)  You may need the level of
detail that only tracing will provide to explore further, although off the
top of my head I'm not sure exactly which events you should focus on (log,
inode creation, or space allocation at a guess).

cheers.

--
Nathan

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: definitions for /proc/fs/xfs/stat
  2013-06-15  0:17     ` Nathan Scott
@ 2013-06-15  1:55       ` Mark Seger
  2013-06-15  2:04         ` Dave Chinner
  0 siblings, 1 reply; 28+ messages in thread
From: Mark Seger @ 2013-06-15  1:55 UTC (permalink / raw)
  To: Nathan Scott; +Cc: xfs


[-- Attachment #1.1: Type: text/plain, Size: 3365 bytes --]

I'm doing 1 second samples and the rates are very steady.  The reason I
ended up at this level of testing was I had done a sustained test for 2
minutes at about 5MB/sec and was seeing over 500MB/sec going to the disk,
again sampling at 1-second intervals.  I'd be happy to provide detailed
output and can even sample more frequently if you like.

from my shorter test I was experimenting looking at some of the XFS data
with collectl and recorded this, it if help at all:

segerm@az1-sw-object-0006:~$ collectl --import xfs
waiting for 1 second sample...
#<--XFS Ops--><-----------XFS
Logging----------><------Extents------><------DirOps-------><----Trans---><-
# Write Reads  Writes WrtKBs NoRoom Force Sleep  ExtA BlkA ExtF ExtF  Look
Cre8 Remv Gdnt  Sync Asyn Empt
      0     0       0      0      0     0     0     0    0    0    0     0
   0    0    0     0    0    0
      0     0       3    768      0     3     3     0    0    0    0     0
   0    0    0     3    0    0
      0     0       0      0      0     0     0     0    0    0    0     0
   0    0    0     0    0    0
      0     0       0      0      0     0     0     0    0    0    0     0
   0    0    0     0    0    0
      0     0       0      0      0     0     0     0    0    0    0     0
   0    0    0     0    0    0
      0     0       1    256      0     1     1     0    0    0    0     0
   0    0    0     1    0    0
     38    88      95  24320      0    95    96    54   54   54   54   115
  76   76  154    95  473    0
    339   776     968 247816      0   968   978   484  484  479  479  1011
 675  671 1351   967 4087    0
    321   748     929 237806      0   929   935   450  450  453  453   967
 645  647 1287   930 3847    0
    279   637     810 207360      0   810   811   391  391  390  390   838
 559  558 1118   810 3324    0
    209   482     610 156160      0   610   610   286  286  289  289   627
 417  420  834   610 2451    0
      0     0       3    768      0     3     3     0    0    0    0     0
   0    0    0     3    0    0

I can say for a fact I was doing about 300 wtrites/sec whcih the write
numbers seem to support, though I don't know what the read numbers are
measuring.  you can also see from the logging data that was 250MB/sec going
to disk.

Are there other numbers that are meaningful that you'd like to see?  All it
takes is adding a couple of print statement as what you're seeing above
only took a hour or so to throw to together.

I can collect as much (or little as you like) and actually can save the
complete contents of /proc/fs/xfs/stat every second in a file for later
playback.

-mark



On Fri, Jun 14, 2013 at 8:17 PM, Nathan Scott <nathans@redhat.com> wrote:

>
>
> ----- Original Message -----
> > actually I have since found a decoder ring here -
> > http://xfs.org/index.php/Runtime_Stats and have been incorporating a
> lot of
> > the data so I can look at things in real time.  I'd still love to know
> why
>
> Ah, good stuff.
>
> > writing 1000 1K files results in 200MB/sec of disk I/O though.  clearly
>
> For how many seconds?  (or fractions thereof?)  You may need the level of
> detail that only tracing will provide to explore further, although off the
> top of my head I'm not sure exactly which events you should focus on (log,
> inode creation, or space allocation at a guess).
>
> cheers.
>
> --
> Nathan
>

[-- Attachment #1.2: Type: text/html, Size: 5007 bytes --]

[-- Attachment #2: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: definitions for /proc/fs/xfs/stat
  2013-06-15  1:55       ` Mark Seger
@ 2013-06-15  2:04         ` Dave Chinner
  2013-06-15 10:35           ` Mark Seger
  0 siblings, 1 reply; 28+ messages in thread
From: Dave Chinner @ 2013-06-15  2:04 UTC (permalink / raw)
  To: Mark Seger; +Cc: Nathan Scott, xfs

On Fri, Jun 14, 2013 at 09:55:17PM -0400, Mark Seger wrote:
> I'm doing 1 second samples and the rates are very steady.  The reason I
> ended up at this level of testing was I had done a sustained test for 2
> minutes at about 5MB/sec and was seeing over 500MB/sec going to the disk,
> again sampling at 1-second intervals.  I'd be happy to provide detailed
> output and can even sample more frequently if you like.

Where are you getting your IO throughput numbers from?

How do they compare to, say, the output of `iostat -d -x -m 1`?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: definitions for /proc/fs/xfs/stat
  2013-06-15  2:04         ` Dave Chinner
@ 2013-06-15 10:35           ` Mark Seger
  2013-06-15 16:22             ` Mark Seger
  2013-06-16  0:00             ` Dave Chinner
  0 siblings, 2 replies; 28+ messages in thread
From: Mark Seger @ 2013-06-15 10:35 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Nathan Scott, xfs


[-- Attachment #1.1: Type: text/plain, Size: 5341 bytes --]

Basically everything do it with collectl, a tool I wrote and opensourced
almost 10 years ago.  it's numbers are very accurate - I've compared with
iostat on numerous occasions whenever I might have had doubts and they
always agree.  Since both tools get their data from the same place,
/proc/diskstats, it's hard for them not to agree AND its numbers also agree
with /proc/fs/xfs.

Here's an example of comparing the two on a short run, leaving off the -m
since collectl reports its output in KB.

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz
avgqu-sz   await r_await w_await  svctm  %util
sdc               0.00     0.00    0.00  494.00     0.00 126464.00   512.00
    0.11    0.22    0.00    0.22   0.22  11.00

#
<---------reads---------><---------writes---------><--------averages-------->
Pct
#Time     Name       KBytes Merged  IOs Size  KBytes Merged  IOs Size
 RWSize  QLen  Wait SvcTim Util
10:18:32 sdc1             0      0    0    0  127488      0  498  256
256     1     0      0    7
10:18:33 sdc1             0      0    0    0  118784      0  464  256
256     1     0      0    4

for grins I also ran a set of numbers at a monitoring interval of 0.2
seconds just to see if they were steady and they are:

#
<---------reads---------><---------writes---------><--------averages-------->
Pct
#Time         Name       KBytes Merged  IOs Size  KBytes Merged  IOs Size
 RWSize  QLen  Wait SvcTim Util
10:19:50.601 sdc1             0      0    0    0     768      0    3  256
  256     0     0      0    0
10:19:50.801 sdc1             0      0    0    0   23296      0   91  256
  256     1     0      0   19
10:19:51.001 sdc1             0      0    0    0   32256      0  126  256
  256     1     0      0   14
10:19:51.201 sdc1             0      0    0    0   29696      0  116  256
  256     1     0      0   19
10:19:51.401 sdc1             0      0    0    0   30464      0  119  256
  256     1     0      0    4
10:19:51.601 sdc1             0      0    0    0   32768      0  128  256
  256     1     0      0   14

but back to the problem at hand and that's the question why is this
happening?

To restate what's going on, I have a very simple script that I'm
duplicating what openstack swift is doing, namely to create a file with
mkstmp and than running an falloc against it.  The files are being created
with a size of zero but it seems that xfs is generating a ton of logging
activity.  I had read your posted back in 2011 about speculative
preallocation and can't help but wonder if that's what hitting me here.  I
also saw where system memory can come into play and this box has 192GB and
12 hyperthreaded cores.

I also tried one more run without falloc, this is creating 10000 1K files,
which should be about 10MB and it looks like it's still doing 140MB of I/O
which still feels like a lot but at least it's less than the

#
<---------reads---------><---------writes---------><--------averages-------->
Pct
#Time     Name       KBytes Merged  IOs Size  KBytes Merged  IOs Size
 RWSize  QLen  Wait SvcTim Util
10:29:20 sdc1             0      0    0    0   89608      0  351  255
255     1     0      0   11
10:29:21 sdc1             0      0    0    0   55296      0  216  256
256     1     0      0    5

and to repeat the full run with falloc:

# DISK STATISTICS (/sec)
#
<---------reads---------><---------writes---------><--------averages-------->
Pct
#Time     Name       KBytes Merged  IOs Size  KBytes Merged  IOs Size
 RWSize  QLen  Wait SvcTim Util
10:30:50 sdc1             0      0    0    0   56064      0  219  256
256     1     0      0    2
10:30:51 sdc1             0      0    0    0  409720    148 1622  253
252     1     0      0   26
10:30:52 sdc1             0      0    0    0  453240    144 1796  252
252     1     0      0   36
10:30:53 sdc1             0      0    0    0  441768    298 1800  245
245     1     0      0   37
10:30:54 sdc1             0      0    0    0  455576    144 1813  251
251     1     0      0   25
10:30:55 sdc1             0      0    0    0  453532    145 1805  251
251     1     0      0   35
10:30:56 sdc1             0      0    0    0  307352    145 1233  249
249     1     0      0   17
10:30:57 sdc1             0      0    0    0       0      0    0    0
0     0     0      0    0

If there is anything more I can provide I'll be happy to do so.  Actually I
should point out I can easily generate graphs and if you'd like to see some
examples I can provide those too.  Also, if there is anything I can report
from /proc/fs/xfs I can relatively easily do that as well and display it
side by side with the disk I/O.

-mark


On Fri, Jun 14, 2013 at 10:04 PM, Dave Chinner <david@fromorbit.com> wrote:

> On Fri, Jun 14, 2013 at 09:55:17PM -0400, Mark Seger wrote:
> > I'm doing 1 second samples and the rates are very steady.  The reason I
> > ended up at this level of testing was I had done a sustained test for 2
> > minutes at about 5MB/sec and was seeing over 500MB/sec going to the disk,
> > again sampling at 1-second intervals.  I'd be happy to provide detailed
> > output and can even sample more frequently if you like.
>
> Where are you getting your IO throughput numbers from?
>
> How do they compare to, say, the output of `iostat -d -x -m 1`?
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com
>

[-- Attachment #1.2: Type: text/html, Size: 8151 bytes --]

[-- Attachment #2: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: definitions for /proc/fs/xfs/stat
  2013-06-15 10:35           ` Mark Seger
@ 2013-06-15 16:22             ` Mark Seger
  2013-06-16  0:11               ` Dave Chinner
  2013-06-16  0:00             ` Dave Chinner
  1 sibling, 1 reply; 28+ messages in thread
From: Mark Seger @ 2013-06-15 16:22 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Nathan Scott, xfs


[-- Attachment #1.1: Type: text/plain, Size: 6523 bytes --]

I was thinking a little color commentary might be helpful from a
perspective of what the functionally is that's driving the need for
fallocate.  I think I mentioned somewhere in this thread that the
application is OpenStack Swift, which is  a highly scalable cloud object
store.  If you're not familiar with it, it doesn't do successive sequential
writes to a preallocated file but rather writes out a full object in one
shot.  In other words, object = file.  The whole purpose of preallocation,
at least my understanding of it, is to make sure there is enough room when
the time comes to write the actual object so if there isn't, a redundant
server elsewhere can do it instead.  This then makes the notion of
speculative preallocation for future sequential writes moot, the ideal
being to only preallocate the object size with minimal extra I/O.  Does
that help?
-mark



On Sat, Jun 15, 2013 at 6:35 AM, Mark Seger <mjseger@gmail.com> wrote:

> Basically everything do it with collectl, a tool I wrote and opensourced
> almost 10 years ago.  it's numbers are very accurate - I've compared with
> iostat on numerous occasions whenever I might have had doubts and they
> always agree.  Since both tools get their data from the same place,
> /proc/diskstats, it's hard for them not to agree AND its numbers also agree
> with /proc/fs/xfs.
>
> Here's an example of comparing the two on a short run, leaving off the -m
> since collectl reports its output in KB.
>
> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz
> avgqu-sz   await r_await w_await  svctm  %util
> sdc               0.00     0.00    0.00  494.00     0.00 126464.00
> 512.00     0.11    0.22    0.00    0.22   0.22  11.00
>
> #
> <---------reads---------><---------writes---------><--------averages-------->
> Pct
> #Time     Name       KBytes Merged  IOs Size  KBytes Merged  IOs Size
>  RWSize  QLen  Wait SvcTim Util
> 10:18:32 sdc1             0      0    0    0  127488      0  498  256
> 256     1     0      0    7
> 10:18:33 sdc1             0      0    0    0  118784      0  464  256
> 256     1     0      0    4
>
> for grins I also ran a set of numbers at a monitoring interval of 0.2
> seconds just to see if they were steady and they are:
>
> #
> <---------reads---------><---------writes---------><--------averages-------->
> Pct
> #Time         Name       KBytes Merged  IOs Size  KBytes Merged  IOs Size
>  RWSize  QLen  Wait SvcTim Util
> 10:19:50.601 sdc1             0      0    0    0     768      0    3  256
>     256     0     0      0    0
> 10:19:50.801 sdc1             0      0    0    0   23296      0   91  256
>     256     1     0      0   19
> 10:19:51.001 sdc1             0      0    0    0   32256      0  126  256
>     256     1     0      0   14
> 10:19:51.201 sdc1             0      0    0    0   29696      0  116  256
>     256     1     0      0   19
> 10:19:51.401 sdc1             0      0    0    0   30464      0  119  256
>     256     1     0      0    4
> 10:19:51.601 sdc1             0      0    0    0   32768      0  128  256
>     256     1     0      0   14
>
> but back to the problem at hand and that's the question why is this
> happening?
>
> To restate what's going on, I have a very simple script that I'm
> duplicating what openstack swift is doing, namely to create a file with
> mkstmp and than running an falloc against it.  The files are being created
> with a size of zero but it seems that xfs is generating a ton of logging
> activity.  I had read your posted back in 2011 about speculative
> preallocation and can't help but wonder if that's what hitting me here.  I
> also saw where system memory can come into play and this box has 192GB and
> 12 hyperthreaded cores.
>
> I also tried one more run without falloc, this is creating 10000 1K files,
> which should be about 10MB and it looks like it's still doing 140MB of I/O
> which still feels like a lot but at least it's less than the
>
> #
> <---------reads---------><---------writes---------><--------averages-------->
> Pct
> #Time     Name       KBytes Merged  IOs Size  KBytes Merged  IOs Size
>  RWSize  QLen  Wait SvcTim Util
> 10:29:20 sdc1             0      0    0    0   89608      0  351  255
> 255     1     0      0   11
> 10:29:21 sdc1             0      0    0    0   55296      0  216  256
> 256     1     0      0    5
>
> and to repeat the full run with falloc:
>
> # DISK STATISTICS (/sec)
> #
> <---------reads---------><---------writes---------><--------averages-------->
> Pct
> #Time     Name       KBytes Merged  IOs Size  KBytes Merged  IOs Size
>  RWSize  QLen  Wait SvcTim Util
> 10:30:50 sdc1             0      0    0    0   56064      0  219  256
> 256     1     0      0    2
> 10:30:51 sdc1             0      0    0    0  409720    148 1622  253
> 252     1     0      0   26
> 10:30:52 sdc1             0      0    0    0  453240    144 1796  252
> 252     1     0      0   36
> 10:30:53 sdc1             0      0    0    0  441768    298 1800  245
> 245     1     0      0   37
> 10:30:54 sdc1             0      0    0    0  455576    144 1813  251
> 251     1     0      0   25
> 10:30:55 sdc1             0      0    0    0  453532    145 1805  251
> 251     1     0      0   35
> 10:30:56 sdc1             0      0    0    0  307352    145 1233  249
> 249     1     0      0   17
> 10:30:57 sdc1             0      0    0    0       0      0    0    0
>   0     0     0      0    0
>
> If there is anything more I can provide I'll be happy to do so.  Actually
> I should point out I can easily generate graphs and if you'd like to see
> some examples I can provide those too.  Also, if there is anything I can
> report from /proc/fs/xfs I can relatively easily do that as well and
> display it side by side with the disk I/O.
>
> -mark
>
>
> On Fri, Jun 14, 2013 at 10:04 PM, Dave Chinner <david@fromorbit.com>wrote:
>
>> On Fri, Jun 14, 2013 at 09:55:17PM -0400, Mark Seger wrote:
>> > I'm doing 1 second samples and the rates are very steady.  The reason I
>> > ended up at this level of testing was I had done a sustained test for 2
>> > minutes at about 5MB/sec and was seeing over 500MB/sec going to the
>> disk,
>> > again sampling at 1-second intervals.  I'd be happy to provide detailed
>> > output and can even sample more frequently if you like.
>>
>> Where are you getting your IO throughput numbers from?
>>
>> How do they compare to, say, the output of `iostat -d -x -m 1`?
>>
>> Cheers,
>>
>> Dave.
>> --
>> Dave Chinner
>> david@fromorbit.com
>>
>
>

[-- Attachment #1.2: Type: text/html, Size: 9486 bytes --]

[-- Attachment #2: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: definitions for /proc/fs/xfs/stat
  2013-06-15 16:22             ` Mark Seger
@ 2013-06-16  0:11               ` Dave Chinner
  2013-06-16 12:58                 ` Mark Seger
  0 siblings, 1 reply; 28+ messages in thread
From: Dave Chinner @ 2013-06-16  0:11 UTC (permalink / raw)
  To: Mark Seger; +Cc: Nathan Scott, xfs

On Sat, Jun 15, 2013 at 12:22:35PM -0400, Mark Seger wrote:
> I was thinking a little color commentary might be helpful from a
> perspective of what the functionally is that's driving the need for
> fallocate.  I think I mentioned somewhere in this thread that the
> application is OpenStack Swift, which is  a highly scalable cloud object
> store.

I'm familiar with it and the problems it causes filesystems. What
application am I talking about here, for example?

http://oss.sgi.com/pipermail/xfs/2013-June/027159.html

Basically, Swift is trying to emulate Direct IO because python
does't support Direct IO. Hence Swift is hacking around that problem
and causing secondary issues that would never have occurred if
Direct IO was used in the first place.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: definitions for /proc/fs/xfs/stat
  2013-06-16  0:11               ` Dave Chinner
@ 2013-06-16 12:58                 ` Mark Seger
  2013-06-16 22:06                   ` Dave Chinner
  2013-06-17  1:11                   ` Nathan Scott
  0 siblings, 2 replies; 28+ messages in thread
From: Mark Seger @ 2013-06-16 12:58 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Nathan Scott, xfs


[-- Attachment #1.1: Type: text/plain, Size: 3231 bytes --]

ok, I have a simple reproducer.  try out the following, noting you'll
obviously have to change the directory pointed to by dname:

libc=ctypes.CDLL(ctypes.util.find_library('c'), use_errno=True)
falloc=getattr(libc, 'fallocate')

fb#!/usr/bin/python -u

import os
import sys
import time
import ctypes
import ctypes.util
from tempfile import mkstemp

dname = '/srv/node/disk0/mjs'
fname = 'foo'
fsize = 1024
nfiles = 1000

body = ' ' * fsize

time0 = time.time()
for i in range(nfiles):
    file_name = '%s/%s-%s' % (dname, fname, i)
    fd, tmppath = mkstemp(dir=dname)
    falloc(fd, 1, 0, fsize)
    os.rename(tmppath, file_name)

elapsed = time.time() - time0
tbytes = fsize * nfiles
rate = tbytes/elapsed/1024/1024

print "DName: %s" % dname
print "Bytes: %d" % (tbytes/1024/1024)
print "Time:  %.2f secs" % elapsed
print "Rate:  %.2f/sec" % rate

when I run it I see this:

segerm@az1-sw-object-0006:~$ sudo ./falloc.py
DName: /srv/node/disk0/mjs
Bytes: 9
Time:  5.84 secs
Rate:  1.67/sec

and while running collectl I see this:

#<----CPU[HYPER]-----><----------Disks-----------><----------Network---------->
#cpu sys inter  ctxsw KBRead  Reads KBWrit Writes   KBIn  PktIn  KBOut
 PktOut
   0   0   110    113      0      0      0      0      0      5      0
  3
   1   0  1576   2874      0      0 170240    665      0      3      0
  2
   4   3  2248   6623      0      0 406585   1596      0      1      0
  1
   4   3  2145   7680      0      0 473600   1850      0      1      0
  1
   2   1  2200   7406      0      0 456633   1875      0      2      0
  1
   4   3  3696   7401      0      0 454606   1823      0      1      0
  1
   3   2  3877   7354      0      0 453537   1806      0      1      0
  1
   1   0  1610   2764      0      0 163793    684      0      3      0
  3

This is the same behavior I'm seeing on swift.  10K 1KB files X 4kb minimal
block size still comes out to a lot less than the multiple GB of writes
being reported.  Actually since these whole thing only takes a few seconds
and I know a single disk can't write that fast maybe it's just a bug in the
way the kernel is reported allocated preallocated blocks and nothing to do
with XFS?  Or iis xfs responsible for the stats?

If I remove the fallocate call I see the expected amount of disk traffic.

-mark



On Sat, Jun 15, 2013 at 8:11 PM, Dave Chinner <david@fromorbit.com> wrote:

> On Sat, Jun 15, 2013 at 12:22:35PM -0400, Mark Seger wrote:
> > I was thinking a little color commentary might be helpful from a
> > perspective of what the functionally is that's driving the need for
> > fallocate.  I think I mentioned somewhere in this thread that the
> > application is OpenStack Swift, which is  a highly scalable cloud object
> > store.
>
> I'm familiar with it and the problems it causes filesystems. What
> application am I talking about here, for example?
>
> http://oss.sgi.com/pipermail/xfs/2013-June/027159.html
>
> Basically, Swift is trying to emulate Direct IO because python
> does't support Direct IO. Hence Swift is hacking around that problem
> and causing secondary issues that would never have occurred if
> Direct IO was used in the first place.
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com
>

[-- Attachment #1.2: Type: text/html, Size: 6824 bytes --]

[-- Attachment #2: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: definitions for /proc/fs/xfs/stat
  2013-06-16 12:58                 ` Mark Seger
@ 2013-06-16 22:06                   ` Dave Chinner
  2013-06-16 22:31                     ` Mark Seger
  2013-06-17  1:11                   ` Nathan Scott
  1 sibling, 1 reply; 28+ messages in thread
From: Dave Chinner @ 2013-06-16 22:06 UTC (permalink / raw)
  To: Mark Seger; +Cc: Nathan Scott, xfs

On Sun, Jun 16, 2013 at 08:58:58AM -0400, Mark Seger wrote:
> ok, I have a simple reproducer.  try out the following, noting you'll
> obviously have to change the directory pointed to by dname:
> 
> libc=ctypes.CDLL(ctypes.util.find_library('c'), use_errno=True)
> falloc=getattr(libc, 'fallocate')
> 
> fb#!/usr/bin/python -u
> 
> import os
> import sys
> import time
> import ctypes
> import ctypes.util
> from tempfile import mkstemp
> 
> dname = '/srv/node/disk0/mjs'
> fname = 'foo'
> fsize = 1024
> nfiles = 1000
> 
> body = ' ' * fsize
> 
> time0 = time.time()
> for i in range(nfiles):
>     file_name = '%s/%s-%s' % (dname, fname, i)
>     fd, tmppath = mkstemp(dir=dname)
>     falloc(fd, 1, 0, fsize)
>     os.rename(tmppath, file_name)
> 
> elapsed = time.time() - time0
> tbytes = fsize * nfiles
> rate = tbytes/elapsed/1024/1024
> 
> print "DName: %s" % dname
> print "Bytes: %d" % (tbytes/1024/1024)
> print "Time:  %.2f secs" % elapsed
> print "Rate:  %.2f/sec" % rate
> 
> when I run it I see this:
> 
> segerm@az1-sw-object-0006:~$ sudo ./falloc.py
> DName: /srv/node/disk0/mjs
> Bytes: 9
> Time:  5.84 secs
> Rate:  1.67/sec

Running on my 3.11 branch, python 2.7.3, 100TB XFS filesystem, there
is no IO to speak of:

$ tests/falloc.py 
DName: /mnt/scratch/
Bytes: 0
Time:  0.21 secs
Rate:  4.60/sec

> #<----CPU[HYPER]-----><----------Disks-----------><----------Network---------->
> #cpu sys inter  ctxsw KBRead  Reads KBWrit Writes   KBIn  PktIn  KBOut
>  PktOut
>    0   0   110    113      0      0      0      0      0      5      0
>   3
>    1   0  1576   2874      0      0 170240    665      0      3      0
>   2
>    4   3  2248   6623      0      0 406585   1596      0      1      0
>   1
>    4   3  2145   7680      0      0 473600   1850      0      1      0
>   1
>    2   1  2200   7406      0      0 456633   1875      0      2      0
>   1
>    4   3  3696   7401      0      0 454606   1823      0      1      0
>   1
>    3   2  3877   7354      0      0 453537   1806      0      1      0
>   1
>    1   0  1610   2764      0      0 163793    684      0      3      0
>   3

There is no way that fallocate() of 1000x1k files should be causing
450MB/s of IO for 5 seconds.

However, I still have no idea what you are running this test on - as
I asked in another email, can you provide some information about
the system your are seeing this problem on so we can try to work out
what might be causing this?

http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F

> This is the same behavior I'm seeing on swift.  10K 1KB files X 4kb minimal
> block size still comes out to a lot less than the multiple GB of writes
> being reported.  Actually since these whole thing only takes a few seconds
> and I know a single disk can't write that fast maybe it's just a bug in the
> way the kernel is reported allocated preallocated blocks and nothing to do
> with XFS?  Or iis xfs responsible for the stats?

I don't know what stats you are getting your information from. XFS
is only responsible for what is in /proc/fs/xfs/,,,

> If I remove the fallocate call I see the expected amount of disk traffic.

Have you run strace to confirm that your python script is actually
calling fallocate(), and it's not being emulated by python or libc?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: definitions for /proc/fs/xfs/stat
  2013-06-16 22:06                   ` Dave Chinner
@ 2013-06-16 22:31                     ` Mark Seger
  2013-06-16 23:14                       ` Dave Chinner
  0 siblings, 1 reply; 28+ messages in thread
From: Mark Seger @ 2013-06-16 22:31 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Nathan Scott, xfs


[-- Attachment #1.1: Type: text/plain, Size: 9365 bytes --]

>
> There is no way that fallocate() of 1000x1k files should be causing
> 450MB/s of IO for 5 seconds.
>

I agree and that's what has me puzzled as well.


> However, I still have no idea what you are running this test on - as
> I asked in another email, can you provide some information about
> the system your are seeing this problem on so we can try to work out
> what might be causing this?
>

sorry about that.  This is an HP box with 192GB RAM and 6 2-core
hyperthreaded CPUs, running ubuntu/precise

segerm@az1-sw-object-0006:~$ uname -a
Linux az1-sw-object-0006 2.6.38-16-server #68hf1026116v20120926-Ubuntu SMP
Wed Sep 26 14:34:13 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux

segerm@az1-sw-object-0006:~$ python --version
Python 2.7.1+

segerm@az1-sw-object-0006:~$ xfs_repair -V
xfs_repair version 3.1.4

segerm@az1-sw-object-0006:~$ cat /proc/meminfo
MemTotal:       198191696 kB
MemFree:        166202324 kB
Buffers:          193268 kB
Cached:         21595332 kB
SwapCached:           36 kB
Active:          7992268 kB
Inactive:       14068428 kB
Active(anon):     719408 kB
Inactive(anon):  5244092 kB
Active(file):    7272860 kB
Inactive(file):  8824336 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:      126889980 kB
SwapFree:       126863972 kB
Dirty:                 0 kB
Writeback:             0 kB
AnonPages:        232156 kB
Mapped:            11384 kB
Shmem:           5731388 kB
Slab:            5755820 kB
SReclaimable:    4792292 kB
SUnreclaim:       963528 kB
KernelStack:        6240 kB
PageTables:         7328 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    225985828 kB
Committed_AS:    8851976 kB
VmallocTotal:   34359738367 kB
VmallocUsed:      725112 kB
VmallocChunk:   34159749436 kB
HardwareCorrupted:     0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:      118648 kB
DirectMap2M:     5089280 kB
DirectMap1G:    196083712 kB

over 60 mounts, but here's the one I'm writing to:

segerm@az1-sw-object-0006:~$ mount | grep disk0
/dev/sdc1 on /srv/node/disk0 type xfs (rw,nobarrier)

not sure what you're looking for here so here's it all

segerm@az1-sw-object-0006:~$ cat /proc/partitions
major minor  #blocks  name

   8        0  976762584 sda
   8        1     248976 sda1
   8        2          1 sda2
   8        5  976510993 sda5
 251        0   41943040 dm-0
 251        1    8785920 dm-1
 251        2    2928640 dm-2
   8       16  976762584 sdb
   8       17  976760832 sdb1
 251        3  126889984 dm-3
 251        4     389120 dm-4
 251        5   41943040 dm-5
   8       32 2930233816 sdc
   8       33 2930233344 sdc1
   8       48 2930233816 sdd
   8       49 2930233344 sdd1
   8       64 2930233816 sde
   8       65 2930233344 sde1
   8       80 2930233816 sdf
   8       81 2930233344 sdf1
   8       96 2930233816 sdg
   8       97 2930233344 sdg1
   8      112 2930233816 sdh
   8      113 2930233344 sdh1
   8      128 2930233816 sdi
   8      129 2930233344 sdi1
   8      144 2930233816 sdj
   8      145 2930233344 sdj1
   8      160 2930233816 sdk
   8      161 2930233344 sdk1
   8      176 2930233816 sdl
   8      177 2930233344 sdl1
   8      192 2930233816 sdm
   8      193 2930233344 sdm1
   8      208 2930233816 sdn
   8      209 2930233344 sdn1
   8      240 2930233816 sdp
   8      241 2930233344 sdp1
   8      224 2930233816 sdo
   8      225 2930233344 sdo1
  65        0 2930233816 sdq
  65        1 2930233344 sdq1
  65       16 2930233816 sdr
  65       17 2930233344 sdr1
  65       32 2930233816 sds
  65       33 2930233344 sds1
  65      176 2930233816 sdab
  65      177 2930233344 sdab1
  66        0 2930233816 sdag
  66        1 2930233344 sdag1
  65       64 2930233816 sdu
  65       65 2930233344 sdu1
  65      128 2930233816 sdy
  65      129 2930233344 sdy1
  65      224 2930233816 sdae
  65      225 2930233344 sdae1
  65      144 2930233816 sdz
  65      145 2930233344 sdz1
  65      208 2930233816 sdad
  65      209 2930233344 sdad1
  65       48 2930233816 sdt
  65       49 2930233344 sdt1
  65      240 2930233816 sdaf
  65      241 2930233344 sdaf1
  65       96 2930233816 sdw
  65       97 2930233344 sdw1
  66       32 2930233816 sdai
  66       33 2930233344 sdai1
  66       48 2930233816 sdaj
  66       49 2930233344 sdaj1
  65      112 2930233816 sdx
  65      113 2930233344 sdx1
  65      160 2930233816 sdaa
  65      161 2930233344 sdaa1
  65       80 2930233816 sdv
  65       81 2930233344 sdv1
  65      192 2930233816 sdac
  65      193 2930233344 sdac1
  66       16 2930233816 sdah
  66       17 2930233344 sdah1
  66       64 2930233816 sdak
  66       65 2930233344 sdak1
  66      144 2930233816 sdap
  66      145 2930233344 sdap1
  66      224 2930233816 sdau
  66      225 2930233344 sdau1
  67      208 2930233816 sdbj
  67      209 2930233344 sdbj1
  66      112 2930233816 sdan
  66      113 2930233344 sdan1
  66       96 2930233816 sdam
  66       97 2930233344 sdam1
  66       80 2930233816 sdal
  66       81 2930233344 sdal1
  67       96 2930233816 sdbc
  67       97 2930233344 sdbc1
  66      128 2930233816 sdao
  66      129 2930233344 sdao1
  67      128 2930233816 sdbe
  67      129 2930233344 sdbe1
  66      176 2930233816 sdar
  66      177 2930233344 sdar1
  67      160 2930233816 sdbg
  67      161 2930233344 sdbg1
  66      160 2930233816 sdaq
  66      161 2930233344 sdaq1
  67       48 2930233816 sdaz
  67       49 2930233344 sdaz1
  66      208 2930233816 sdat
  66      209 2930233344 sdat1
  67      112 2930233816 sdbd
  67      113 2930233344 sdbd1
  67      144 2930233816 sdbf
  67      145 2930233344 sdbf1
  67       64 2930233816 sdba
  67       65 2930233344 sdba1
  67       16 2930233816 sdax
  67       17 2930233344 sdax1
  67       32 2930233816 sday
  67       33 2930233344 sday1
  66      192 2930233816 sdas
  66      193 2930233344 sdas1
  67      192 2930233816 sdbi
  67      193 2930233344 sdbi1
  66      240 2930233816 sdav
  66      241 2930233344 sdav1
  67       80 2930233816 sdbb
  67       81 2930233344 sdbb1
  67      176 2930233816 sdbh
  67      177 2930233344 sdbh1
  67        0 2930233816 sdaw
  67        1 2930233344 sdaw1

segerm@az1-sw-object-0006:~$ xfs_info /srv/node/disk0
meta-data=/dev/sdc1              isize=1024   agcount=32, agsize=22892416
blks
         =                       sectsz=512   attr=2
data     =                       bsize=4096   blocks=732557312, imaxpct=5
         =                       sunit=64     swidth=64 blks
naming   =version 2              bsize=4096   ascii-ci=0
log      =internal               bsize=4096   blocks=357696, version=2
         =                       sectsz=512   sunit=64 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

Is this enough to describe my environment?


>
> http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F
>
> > This is the same behavior I'm seeing on swift.  10K 1KB files X 4kb
> minimal
> > block size still comes out to a lot less than the multiple GB of writes
> > being reported.  Actually since these whole thing only takes a few
> seconds
> > and I know a single disk can't write that fast maybe it's just a bug in
> the
> > way the kernel is reported allocated preallocated blocks and nothing to
> do
> > with XFS?  Or iis xfs responsible for the stats?
>
> I don't know what stats you are getting your information from. XFS
> is only responsible for what is in /proc/fs/xfs/,,,


I'm getting my stats from 2 places, collectl reading /proc/diskstats AND
xfs from /proc/fs/xfs though I've only shown you the collectl stats.  It's
easy enough to send you the /proc/fs/xfs stats if you want them as both
formatted output as well as the raw /proc/fs/xfs contents every second from
which they're derived.


> > If I remove the fallocate call I see the expected amount of disk traffic.
>
> Have you run strace to confirm that your python script is actually
> calling fallocate(), and it's not being emulated by python or libc?


I haven't.  However when I saw this exact behavior with swift, we did see
both fallocate being called and here's a small section showing a complete
10K PUT operation and the beginning of the next one:

    0.067874 cpu=0 pid=41977 fallocate [285] entry fd=15 mode=0x1
offset=0x0 len=10240
    0.067980 cpu=0 pid=41977 block_rq_insert dev_t=0x04100030 wr=write
flags=SYNC sector=0xaec11a00 len=262144
    0.067983 cpu=0 pid=41977 block_rq_issue dev_t=0x04100030 wr=write
flags=SYNC sector=0xaec11a00 len=262144
    0.068010 cpu=0 pid=41977 sched_switch prio=120 state=SSLEEP next_pid=0
next_prio=120
    0.068125 cpu=0 pid=0 block_rq_complete dev_t=0x04100030 wr=write
flags=SYNC sector=0xaec11a00 len=262144 qpid=41977 spid=41977 qtm=
 0.000004 svtm=  0.000142
    0.068175 cpu=0 pid=48168 sched_wakeup target_pid=41977 prio=120
target_cpu=12 success=1
    0.068180 cpu=12 pid=0 sched_switch prio=n/a state=n/a next_pid=41977
next_prio=120
    0.068185 cpu=12 pid=41977 fallocate [285] ret=0x0 syscallbeg=
 0.000311 fd=15 mode=0x1 offset=0x0 len=10240

I hope this helps but if there's any more I can provide I'll be happy to do
so.
-mark



> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com
>

[-- Attachment #1.2: Type: text/html, Size: 13425 bytes --]

[-- Attachment #2: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: definitions for /proc/fs/xfs/stat
  2013-06-16 22:31                     ` Mark Seger
@ 2013-06-16 23:14                       ` Dave Chinner
  2013-06-16 23:31                         ` Mark Seger
  0 siblings, 1 reply; 28+ messages in thread
From: Dave Chinner @ 2013-06-16 23:14 UTC (permalink / raw)
  To: Mark Seger; +Cc: Nathan Scott, xfs

On Sun, Jun 16, 2013 at 06:31:13PM -0400, Mark Seger wrote:
> >
> > There is no way that fallocate() of 1000x1k files should be causing
> > 450MB/s of IO for 5 seconds.
> 
> I agree and that's what has me puzzled as well.
> 
> > However, I still have no idea what you are running this test on - as
> > I asked in another email, can you provide some information about
> > the system your are seeing this problem on so we can try to work out
> > what might be causing this?
> >
> 
> sorry about that.  This is an HP box with 192GB RAM and 6 2-core
> hyperthreaded CPUs, running ubuntu/precise
> 
> segerm@az1-sw-object-0006:~$ uname -a
> Linux az1-sw-object-0006 2.6.38-16-server #68hf1026116v20120926-Ubuntu SMP
> Wed Sep 26 14:34:13 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux

So it running a pretty old Ubuntu something-or-other kernel. There's
only limited help I can give you for this kernel as I've got no idea
what Ubuntu have put in it...

> segerm@az1-sw-object-0006:~$ python --version
> Python 2.7.1+
> 
> segerm@az1-sw-object-0006:~$ xfs_repair -V
> xfs_repair version 3.1.4
> 
> segerm@az1-sw-object-0006:~$ cat /proc/meminfo
> MemTotal:       198191696 kB
> MemFree:        166202324 kB
> Buffers:          193268 kB
> Cached:         21595332 kB
....
> over 60 mounts, but here's the one I'm writing to:
> 
> segerm@az1-sw-object-0006:~$ mount | grep disk0
> /dev/sdc1 on /srv/node/disk0 type xfs (rw,nobarrier)
> 
> not sure what you're looking for here so here's it all
> 
> segerm@az1-sw-object-0006:~$ cat /proc/partitions
> major minor  #blocks  name
> 
>    8        0  976762584 sda
>    8        1     248976 sda1
>    8        2          1 sda2
>    8        5  976510993 sda5
>  251        0   41943040 dm-0
>  251        1    8785920 dm-1
>  251        2    2928640 dm-2
>    8       16  976762584 sdb
>    8       17  976760832 sdb1
>  251        3  126889984 dm-3
>  251        4     389120 dm-4
>  251        5   41943040 dm-5
>    8       32 2930233816 sdc
>    8       33 2930233344 sdc1
....

> segerm@az1-sw-object-0006:~$ xfs_info /srv/node/disk0
> meta-data=/dev/sdc1              isize=1024   agcount=32, agsize=22892416
> blks
>          =                       sectsz=512   attr=2
> data     =                       bsize=4096   blocks=732557312, imaxpct=5
>          =                       sunit=64     swidth=64 blks
> naming   =version 2              bsize=4096   ascii-ci=0
> log      =internal               bsize=4096   blocks=357696, version=2
>          =                       sectsz=512   sunit=64 blks, lazy-count=1
> realtime =none                   extsz=4096   blocks=0, rtextents=0

Ok, that's interesting - a 1k inode size, and sunit=swidth=256k. But
it doesn't cause a current kernel to reproduce the behaviour you are
seeing....

sunit=256k is interesting, because:

>     0.067874 cpu=0 pid=41977 fallocate [285] entry fd=15 mode=0x1
> offset=0x0 len=10240
>     0.067980 cpu=0 pid=41977 block_rq_insert dev_t=0x04100030 wr=write
> flags=SYNC sector=0xaec11a00 len=262144

That's a write which is rounded up to 256k.

BTW, that's a trace for a also a 10k fallocate, not 1k, but
regardless it doesn't change behaviour on my TOT test kernel.

> I hope this helps but if there's any more I can provide I'll be
> happy to do so.

It doesn't tell me what XFS is doing with the fallocate call.
Providing the trace-cmd trace output from the FAQ might shed some
light on it...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: definitions for /proc/fs/xfs/stat
  2013-06-16 23:14                       ` Dave Chinner
@ 2013-06-16 23:31                         ` Mark Seger
  0 siblings, 0 replies; 28+ messages in thread
From: Mark Seger @ 2013-06-16 23:31 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Nathan Scott, xfs@oss.sgi.com

Thanks for your feedback, I really do appreciate it.  Tell you what, I'll try to get a newer distro installed and see if I can reproduce.  Either way ill let you know what I find
-mark

Sent from my iPhone

On Jun 16, 2013, at 7:14 PM, Dave Chinner <david@fromorbit.com> wrote:

> On Sun, Jun 16, 2013 at 06:31:13PM -0400, Mark Seger wrote:
>>> 
>>> There is no way that fallocate() of 1000x1k files should be causing
>>> 450MB/s of IO for 5 seconds.
>> 
>> I agree and that's what has me puzzled as well.
>> 
>>> However, I still have no idea what you are running this test on - as
>>> I asked in another email, can you provide some information about
>>> the system your are seeing this problem on so we can try to work out
>>> what might be causing this?
>> 
>> sorry about that.  This is an HP box with 192GB RAM and 6 2-core
>> hyperthreaded CPUs, running ubuntu/precise
>> 
>> segerm@az1-sw-object-0006:~$ uname -a
>> Linux az1-sw-object-0006 2.6.38-16-server #68hf1026116v20120926-Ubuntu SMP
>> Wed Sep 26 14:34:13 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux
> 
> So it running a pretty old Ubuntu something-or-other kernel. There's
> only limited help I can give you for this kernel as I've got no idea
> what Ubuntu have put in it...
> 
>> segerm@az1-sw-object-0006:~$ python --version
>> Python 2.7.1+
>> 
>> segerm@az1-sw-object-0006:~$ xfs_repair -V
>> xfs_repair version 3.1.4
>> 
>> segerm@az1-sw-object-0006:~$ cat /proc/meminfo
>> MemTotal:       198191696 kB
>> MemFree:        166202324 kB
>> Buffers:          193268 kB
>> Cached:         21595332 kB
> ....
>> over 60 mounts, but here's the one I'm writing to:
>> 
>> segerm@az1-sw-object-0006:~$ mount | grep disk0
>> /dev/sdc1 on /srv/node/disk0 type xfs (rw,nobarrier)
>> 
>> not sure what you're looking for here so here's it all
>> 
>> segerm@az1-sw-object-0006:~$ cat /proc/partitions
>> major minor  #blocks  name
>> 
>>   8        0  976762584 sda
>>   8        1     248976 sda1
>>   8        2          1 sda2
>>   8        5  976510993 sda5
>> 251        0   41943040 dm-0
>> 251        1    8785920 dm-1
>> 251        2    2928640 dm-2
>>   8       16  976762584 sdb
>>   8       17  976760832 sdb1
>> 251        3  126889984 dm-3
>> 251        4     389120 dm-4
>> 251        5   41943040 dm-5
>>   8       32 2930233816 sdc
>>   8       33 2930233344 sdc1
> ....
> 
>> segerm@az1-sw-object-0006:~$ xfs_info /srv/node/disk0
>> meta-data=/dev/sdc1              isize=1024   agcount=32, agsize=22892416
>> blks
>>         =                       sectsz=512   attr=2
>> data     =                       bsize=4096   blocks=732557312, imaxpct=5
>>         =                       sunit=64     swidth=64 blks
>> naming   =version 2              bsize=4096   ascii-ci=0
>> log      =internal               bsize=4096   blocks=357696, version=2
>>         =                       sectsz=512   sunit=64 blks, lazy-count=1
>> realtime =none                   extsz=4096   blocks=0, rtextents=0
> 
> Ok, that's interesting - a 1k inode size, and sunit=swidth=256k. But
> it doesn't cause a current kernel to reproduce the behaviour you are
> seeing....
> 
> sunit=256k is interesting, because:
> 
>>    0.067874 cpu=0 pid=41977 fallocate [285] entry fd=15 mode=0x1
>> offset=0x0 len=10240
>>    0.067980 cpu=0 pid=41977 block_rq_insert dev_t=0x04100030 wr=write
>> flags=SYNC sector=0xaec11a00 len=262144
> 
> That's a write which is rounded up to 256k.
> 
> BTW, that's a trace for a also a 10k fallocate, not 1k, but
> regardless it doesn't change behaviour on my TOT test kernel.
> 
>> I hope this helps but if there's any more I can provide I'll be
>> happy to do so.
> 
> It doesn't tell me what XFS is doing with the fallocate call.
> Providing the trace-cmd trace output from the FAQ might shed some
> light on it...
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: definitions for /proc/fs/xfs/stat
  2013-06-16 12:58                 ` Mark Seger
  2013-06-16 22:06                   ` Dave Chinner
@ 2013-06-17  1:11                   ` Nathan Scott
  2013-06-17  2:46                     ` Dave Chinner
  2013-06-18  0:13                     ` Mark Goodwin
  1 sibling, 2 replies; 28+ messages in thread
From: Nathan Scott @ 2013-06-17  1:11 UTC (permalink / raw)
  To: Mark Seger, Dave Chinner; +Cc: xfs

[-- Attachment #1: Type: text/plain, Size: 2571 bytes --]

Hey guys,

----- Original Message -----
> ok, I have a simple reproducer.  try out the following, noting you'll
> obviously have to change the directory pointed to by dname:
> 
> libc=ctypes.CDLL(ctypes.util.find_library('c'), use_errno=True)
> falloc=getattr(libc, 'fallocate')
> 

This is using the glibc fallocate wrapper - I have vague memories of an
old libc which used to do per-page buffered writes providing a poor-mans
implementation of fallocate, maybe somehow that older version/behaviour
is being triggered.

Running the test case on a RHEL6 box here, you should see patterns like
the attached ("pmchart -c XFSLog" - config attached too), which suggest
log traffic dominates (though I have no stripe-fu setup like you, Mark,
which adds another wrinkle).

> This is the same behavior I'm seeing on swift.  10K 1KB files X 4kb minimal
> block size still comes out to a lot less than the multiple GB of writes
> being reported.  Actually since these whole thing only takes a few seconds
> and I know a single disk can't write that fast maybe it's just a bug in the
> way the kernel is reported allocated preallocated blocks and nothing to do
> with XFS?  Or iis xfs responsible for the stats?

The device level IOPs and bandwidth stats are accounted in the block layer,
below the filesystem (at that point, there is no concept of preallocation
anymore - something has submitted actual I/O), whereas the XFS log traffic
accounting is done inside XFS.

> If I remove the fallocate call I see the expected amount of disk traffic.
> 
> -mark
> 
> On Sat, Jun 15, 2013 at 8:11 PM, Dave Chinner <david@fromorbit.com> wrote:
> > On Sat, Jun 15, 2013 at 12:22:35PM -0400, Mark Seger wrote:
> > > I was thinking a little color commentary might be helpful from a
> > > perspective of what the functionally is that's driving the need for
> > > fallocate.  I think I mentioned somewhere in this thread that the
> > > application is OpenStack Swift, which is  a highly scalable cloud object
> > > store.
> >
> > I'm familiar with it and the problems it causes filesystems. What
> > application am I talking about here, for example?
> >
> > http://oss.sgi.com/pipermail/xfs/2013-June/027159.html
> >
> > Basically, Swift is trying to emulate Direct IO because python
> > does't support Direct IO. Hence Swift is hacking around that problem

I think it is still possible, FWIW.  One could use python ctypes (as in
Marks test program) and achieve a page-aligned POSIX memalign, and some
quick googling suggests flags can be passed to open(2) via os.O_DIRECT.

cheers.

--
Nathan

[-- Attachment #2: XFSLog --]
[-- Type: application/octet-stream, Size: 524 bytes --]

#kmchart
version 1

chart title "IOPS over all Disks [%h]" style stacking
	plot legend "Reads" color #ffff00 metric disk.all.read
	plot legend "Writes" color #ee82ee metric disk.all.write
chart style plot antialiasing off
	plot color #ee82ee metric xfs.log.writes
chart title "Disk Throughput [%h]" style stacking
	plot legend "Read rate" color #ffff00 metric disk.all.read_bytes
	plot legend "Write rate" color #ee82ee metric disk.all.write_bytes
chart style plot antialiasing off
	plot color #ee82ee metric xfs.log.blocks

[-- Attachment #3: xfs-log-vs-fallocate.png --]
[-- Type: image/png, Size: 35764 bytes --]

[-- Attachment #4: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: definitions for /proc/fs/xfs/stat
  2013-06-17  1:11                   ` Nathan Scott
@ 2013-06-17  2:46                     ` Dave Chinner
  2013-06-17  5:41                       ` Nathan Scott
  2013-06-18  0:13                     ` Mark Goodwin
  1 sibling, 1 reply; 28+ messages in thread
From: Dave Chinner @ 2013-06-17  2:46 UTC (permalink / raw)
  To: Nathan Scott; +Cc: Mark Seger, xfs

On Sun, Jun 16, 2013 at 09:11:29PM -0400, Nathan Scott wrote:
> Hey guys,
> 
> ----- Original Message -----
> > ok, I have a simple reproducer.  try out the following, noting you'll
> > obviously have to change the directory pointed to by dname:
> > 
> > libc=ctypes.CDLL(ctypes.util.find_library('c'), use_errno=True)
> > falloc=getattr(libc, 'fallocate')
> > 
> 
> This is using the glibc fallocate wrapper - I have vague memories of an
> old libc which used to do per-page buffered writes providing a poor-mans
> implementation of fallocate, maybe somehow that older version/behaviour
> is being triggered.
> 
> Running the test case on a RHEL6 box here, you should see patterns like
> the attached ("pmchart -c XFSLog" - config attached too), which suggest
> log traffic dominates (though I have no stripe-fu setup like you, Mark,
> which adds another wrinkle).

Must be an old version of RHEL6, because 6.4 doesn't do any IO at
all, same as upstream. This test workload is purely a metadata only
workload (no data is written) and so it all gets gathered up by
delayed logging.

And that's something 2.6.38 (and RHEL6.0/6.1) doesn't have by
default, and so is going to write a fair bit of metadata to the log.
But I wouldn't have expected one IO per fallocate call. Oh, we fixed
this in 2.6.39:

8287889 xfs: preallocation transactions do not need to be synchronous

So, fallocate() is synchronous in 2.6.38 (and probably RHEL 6.0/6.1)
and the filesystem has a log stripe unit of 256k, so that would
explain the 256k IO per fallocate call - the log is forced and so
the ~500 bytes of dirty metadata gets padded to the full log stripe
(i.e. 256k) and written synchronously.

So there's the reason for the 256k write per file being written by
swift. Have I mentioned anything about weird side effects occurring
as a result of trying to emulate direct IO before? :)

> > > On Sat, Jun 15, 2013 at 12:22:35PM -0400, Mark Seger wrote:
> > > > I was thinking a little color commentary might be helpful from a
> > > > perspective of what the functionally is that's driving the need for
> > > > fallocate.  I think I mentioned somewhere in this thread that the
> > > > application is OpenStack Swift, which is  a highly scalable cloud object
> > > > store.
> > >
> > > I'm familiar with it and the problems it causes filesystems. What
> > > application am I talking about here, for example?
> > >
> > > http://oss.sgi.com/pipermail/xfs/2013-June/027159.html
> > >
> > > Basically, Swift is trying to emulate Direct IO because python
> > > does't support Direct IO. Hence Swift is hacking around that problem
> 
> I think it is still possible, FWIW.  One could use python ctypes (as in
> Marks test program) and achieve a page-aligned POSIX memalign,

I wasn't aware you could get memalign() through python at all. I
went looking for this exact solution a couple of month ago when
these problems started to be reported and couldn't find anything
related to direct IO on python with google except for "it can't be
done", "it doesn't work" and  a patch that
was rejected years ago to support it natively.

> and some
> quick googling suggests flags can be passed to open(2) via os.O_DIRECT.

Yup, the python manual that documents this kind of thing is I'd
expect to show up as the number one hit when you google "python
O_DIRECT open flags", wouldn't you think?  All I get with that is
"O_DIRECT doesn't work" bug reports and blog posts.  If drop the
O_DIRECT out of the search phrase, and the first post is the python
documentation about open flags and it documents that O_DIRECT can be
passed. And if I use different search phrases for memalign without
mentioning direct IO, I see lots of tricks people use to get this
functionality on python.

<sigh>

Google has been letting me down like this quite a bit over the past
few months when it comes to searching for stuff related to
development. It's getting harder to find stuff amongst in the noise
of whiny blogs, forums, and other places where people do nothing but
complain about broken shite that google seems to think is more
important than a real reference manual on the topic being searched.

Is there anything better out there yet? Like from years ago when the
google "I'm felling lucky" button used to pass you directly to the
exact page of the reference manual relevant to the topic being
searched?

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: definitions for /proc/fs/xfs/stat
  2013-06-17  2:46                     ` Dave Chinner
@ 2013-06-17  5:41                       ` Nathan Scott
  2013-06-17 10:57                         ` Mark Seger
  2013-06-17 11:19                         ` definitions for /proc/fs/xfs/stat Dave Chinner
  0 siblings, 2 replies; 28+ messages in thread
From: Nathan Scott @ 2013-06-17  5:41 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Mark Seger, xfs

Hey Dave,

----- Original Message -----
> ...
> Must be an old version of RHEL6, because 6.4 doesn't do any IO at
> all, same as upstream. This test workload is purely a metadata only
> workload (no data is written) and so it all gets gathered up by
> delayed logging.

*nod* - RHEL6.3.

> > I think it is still possible, FWIW.  One could use python ctypes (as in
> > Marks test program) and achieve a page-aligned POSIX memalign,
> 
> I wasn't aware you could get memalign() through python at all. I
> went looking for this exact solution a couple of month ago when
> these problems started to be reported and couldn't find anything
> ...

Yes, on reflection it doesn't jive too well with the way python wants
to do reads, in particular - os.read takes a file and a size, there's
no buffer exposed at the API level (for input).

It would need to be a separate python module to the core set I guess
(with a C component), and a slightly different API - or at least some
additional APIs which can take in an aligned buffer, rather than just
allocating one each time - but I believe it's still feasible.

cheers.

--
Nathan

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: definitions for /proc/fs/xfs/stat
  2013-06-17  5:41                       ` Nathan Scott
@ 2013-06-17 10:57                         ` Mark Seger
  2013-06-17 11:13                           ` Dave Chinner
  2013-06-17 11:19                         ` definitions for /proc/fs/xfs/stat Dave Chinner
  1 sibling, 1 reply; 28+ messages in thread
From: Mark Seger @ 2013-06-17 10:57 UTC (permalink / raw)
  To: Nathan Scott; +Cc: xfs


[-- Attachment #1.1: Type: text/plain, Size: 1925 bytes --]

all - good conversation and again, thanks for digging into this.  The
comment about me running on an older kernel seems to be the problem and by
rerunning my test on precise/3.5.0-23-generic all seems to be operating
correctly, so I guess that was it.

However, the one thing that does jump out of this is that proc/fs/xsfstats
and pcp were both showing many hundred MB/sec during tests that only ran
for a few seconds, which is impossible so it still feels some like sort of
accounting bug to me.  On the other hand if the fact that this was an older
kernel, and newer kernels are fine, perhaps it's something just to note and
not worry about.

thanks again...

-mark


On Mon, Jun 17, 2013 at 1:41 AM, Nathan Scott <nathans@redhat.com> wrote:

> Hey Dave,
>
> ----- Original Message -----
> > ...
> > Must be an old version of RHEL6, because 6.4 doesn't do any IO at
> > all, same as upstream. This test workload is purely a metadata only
> > workload (no data is written) and so it all gets gathered up by
> > delayed logging.
>
> *nod* - RHEL6.3.
>
> > > I think it is still possible, FWIW.  One could use python ctypes (as in
> > > Marks test program) and achieve a page-aligned POSIX memalign,
> >
> > I wasn't aware you could get memalign() through python at all. I
> > went looking for this exact solution a couple of month ago when
> > these problems started to be reported and couldn't find anything
> > ...
>
> Yes, on reflection it doesn't jive too well with the way python wants
> to do reads, in particular - os.read takes a file and a size, there's
> no buffer exposed at the API level (for input).
>
> It would need to be a separate python module to the core set I guess
> (with a C component), and a slightly different API - or at least some
> additional APIs which can take in an aligned buffer, rather than just
> allocating one each time - but I believe it's still feasible.
>
> cheers.
>
> --
> Nathan
>

[-- Attachment #1.2: Type: text/html, Size: 2544 bytes --]

[-- Attachment #2: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: definitions for /proc/fs/xfs/stat
  2013-06-17 10:57                         ` Mark Seger
@ 2013-06-17 11:13                           ` Dave Chinner
  2013-06-17 14:57                             ` Mark Seger
  0 siblings, 1 reply; 28+ messages in thread
From: Dave Chinner @ 2013-06-17 11:13 UTC (permalink / raw)
  To: Mark Seger; +Cc: Nathan Scott, xfs

On Mon, Jun 17, 2013 at 06:57:14AM -0400, Mark Seger wrote:
> all - good conversation and again, thanks for digging into this.  The
> comment about me running on an older kernel seems to be the problem and by
> rerunning my test on precise/3.5.0-23-generic all seems to be operating
> correctly, so I guess that was it.

OK, good to know.

> However, the one thing that does jump out of this is that proc/fs/xsfstats
> and pcp were both showing many hundred MB/sec during tests that only ran
> for a few seconds, which is impossible so it still feels some like sort of
> accounting bug to me.  On the other hand if the fact that this was an older
> kernel, and newer kernels are fine, perhaps it's something just to note and
> not worry about.

How big is the write cache in your RAID array? If the log is the
only thing being written to, and it fits within the cache, then it
can easily push several hundred MB/s of write IO....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: definitions for /proc/fs/xfs/stat
  2013-06-17 11:13                           ` Dave Chinner
@ 2013-06-17 14:57                             ` Mark Seger
  2013-06-17 20:28                               ` Stefan Ring
  2013-06-19 23:02                               ` Useful stats (was Re: definitions for /proc/fs/xfs/stat) Nathan Scott
  0 siblings, 2 replies; 28+ messages in thread
From: Mark Seger @ 2013-06-17 14:57 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Nathan Scott, xfs

[-- Attachment #1.1: Type: text/plain, Size: 854 bytes --]

> How big is the write cache in your RAID array? If the log is the
> only thing being written to, and it fits within the cache, then it
> can easily push several hundred MB/s of write IO....

I asked around and people believe the cache is on the order of a few GB and
the test ran I was intentionally large enough to overshadow any cache
effects, running for about a minute and doing 100K 1K file creates.  The
disk write data was close to a sustained 475MB/sec and would have easily
filled the cache in the first handful of seconds which would have produced
enough backpressure to slow down the write rate which it never did.

On a totally different topic, and if you like we can start a different
thread on it, I'd be interested in adding some monitoring stats to collectl
for xfs and could use some suggestions of what are the most important.

-mark

[-- Attachment #1.2: Type: text/html, Size: 1193 bytes --]

[-- Attachment #2: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: definitions for /proc/fs/xfs/stat
  2013-06-17 14:57                             ` Mark Seger
@ 2013-06-17 20:28                               ` Stefan Ring
  2013-06-18  0:15                                 ` Dave Chinner
  2013-06-19 23:02                               ` Useful stats (was Re: definitions for /proc/fs/xfs/stat) Nathan Scott
  1 sibling, 1 reply; 28+ messages in thread
From: Stefan Ring @ 2013-06-17 20:28 UTC (permalink / raw)
  To: Mark Seger; +Cc: Nathan Scott, Linux fs XFS

> I asked around and people believe the cache is on the order of a few GB and
> the test ran I was intentionally large enough to overshadow any cache
> effects, running for about a minute and doing 100K 1K file creates.  The
> disk write data was close to a sustained 475MB/sec and would have easily
> filled the cache in the first handful of seconds which would have produced
> enough backpressure to slow down the write rate which it never did.

If a limited number of blocks gets written over and over, you won't
ever fill the cache.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: definitions for /proc/fs/xfs/stat
  2013-06-17 20:28                               ` Stefan Ring
@ 2013-06-18  0:15                                 ` Dave Chinner
  2013-06-18 10:17                                   ` Mark Seger
  0 siblings, 1 reply; 28+ messages in thread
From: Dave Chinner @ 2013-06-18  0:15 UTC (permalink / raw)
  To: Stefan Ring; +Cc: Nathan Scott, Mark Seger, Linux fs XFS

On Mon, Jun 17, 2013 at 10:28:19PM +0200, Stefan Ring wrote:
> > I asked around and people believe the cache is on the order of a few GB and
> > the test ran I was intentionally large enough to overshadow any cache
> > effects, running for about a minute and doing 100K 1K file creates.  The
> > disk write data was close to a sustained 475MB/sec and would have easily
> > filled the cache in the first handful of seconds which would have produced
> > enough backpressure to slow down the write rate which it never did.
> 
> If a limited number of blocks gets written over and over, you won't
> ever fill the cache.

Right, and the XFS log is circular and about 1GB in size on the
filesystem under test, so should fit completely in cache. Hence
speeds faster than a physical disk are acheivable if the log stays
resident in the cache....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: definitions for /proc/fs/xfs/stat
  2013-06-18  0:15                                 ` Dave Chinner
@ 2013-06-18 10:17                                   ` Mark Seger
  0 siblings, 0 replies; 28+ messages in thread
From: Mark Seger @ 2013-06-18 10:17 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Stefan Ring, Nathan Scott, Linux fs XFS


[-- Attachment #1.1: Type: text/plain, Size: 1822 bytes --]

mark - good point about the controller cache and the 1-bye/block, but
remember my files in this case are only 1 block.
dave - a long time ago I found/reported a bug in the way linux was doing
their block stats.  They were basically incrementing the byte counts as
block were written to cache and the reported numbers were ridiculously
high.  I had wondered if something like this could be going on, but also
remember the numbers being reported by xfs are much too high.  I did 10K 1K
writes, which I do understand are really 1-4k page each, but that's still
only 40MB.  If I add up all the 500MB/sec bytes xfs logging is reporting
(even one of which is over 10 times larger), I see something on the order
of of 10GB.  But again this IS with the older kernel and so may not be
worth worrying about.
-mark


On Mon, Jun 17, 2013 at 8:15 PM, Dave Chinner <david@fromorbit.com> wrote:

> On Mon, Jun 17, 2013 at 10:28:19PM +0200, Stefan Ring wrote:
> > > I asked around and people believe the cache is on the order of a few
> GB and
> > > the test ran I was intentionally large enough to overshadow any cache
> > > effects, running for about a minute and doing 100K 1K file creates.
>  The
> > > disk write data was close to a sustained 475MB/sec and would have
> easily
> > > filled the cache in the first handful of seconds which would have
> produced
> > > enough backpressure to slow down the write rate which it never did.
> >
> > If a limited number of blocks gets written over and over, you won't
> > ever fill the cache.
>
> Right, and the XFS log is circular and about 1GB in size on the
> filesystem under test, so should fit completely in cache. Hence
> speeds faster than a physical disk are acheivable if the log stays
> resident in the cache....
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com
>

[-- Attachment #1.2: Type: text/html, Size: 2405 bytes --]

[-- Attachment #2: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Useful stats (was Re: definitions for /proc/fs/xfs/stat)
  2013-06-17 14:57                             ` Mark Seger
  2013-06-17 20:28                               ` Stefan Ring
@ 2013-06-19 23:02                               ` Nathan Scott
  1 sibling, 0 replies; 28+ messages in thread
From: Nathan Scott @ 2013-06-19 23:02 UTC (permalink / raw)
  To: Mark Seger; +Cc: xfs

Hi Mark,

----- Original Message -----
> ...
> On a totally different topic, and if you like we can start a different
> thread on it, I'd be interested in adding some monitoring stats to collectl
> for xfs and could use some suggestions of what are the most important.

Did you receive any other feedback on this (offlist perhaps)?  I was kinda
hoping someone else would chime in, but can share our experiences with PCP
if it helps.

We provide options to record all xfs.* metrics & an alternative to record
just the core I/O stats (so log I/O and the file read/write metrics).  With
PCP one can record any subset at any rate, so those are just convenience /
default suggestions.

In terms of recording in collectl, not sure what the requirements/parameters
are for you but I'd suggest recording all the data (in save-to-disk mode),
and IIRC you've got brief/verbose reporting modes - I'd focus on the log,
file I/O and buffer I/O stats when being "brief", and expand out to showing
in-core inode/dquot stats, ail/transaction stats, dir/attr operation stats.
You might consider another very-verbose kinda view with all of the various
btree stats, but that may be going too far.

Note that these are all aggregate statistics, there has been on-and-off
discussion for many years about adding some per-filesystem XFS stats, but
to date that has not happened - perhaps someday it will, so keep that in
mind for collectl too I guess.

Also, be aware there is a separate file for quota metrics, for hysterical
raisins (/proc/fs/xfs/xqmstat).

HTH.

cheers.

--
Nathan

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: definitions for /proc/fs/xfs/stat
  2013-06-17  5:41                       ` Nathan Scott
  2013-06-17 10:57                         ` Mark Seger
@ 2013-06-17 11:19                         ` Dave Chinner
  2013-06-17 13:18                           ` Stan Hoeppner
  1 sibling, 1 reply; 28+ messages in thread
From: Dave Chinner @ 2013-06-17 11:19 UTC (permalink / raw)
  To: Nathan Scott; +Cc: Mark Seger, xfs

On Mon, Jun 17, 2013 at 01:41:50AM -0400, Nathan Scott wrote:
> Hey Dave,
> 
> ----- Original Message -----
> > ...
> > Must be an old version of RHEL6, because 6.4 doesn't do any IO at
> > all, same as upstream. This test workload is purely a metadata only
> > workload (no data is written) and so it all gets gathered up by
> > delayed logging.
> 
> *nod* - RHEL6.3.
> 
> > > I think it is still possible, FWIW.  One could use python ctypes (as in
> > > Marks test program) and achieve a page-aligned POSIX memalign,
> > 
> > I wasn't aware you could get memalign() through python at all. I
> > went looking for this exact solution a couple of month ago when
> > these problems started to be reported and couldn't find anything
> > ...
> 
> Yes, on reflection it doesn't jive too well with the way python wants
> to do reads, in particular - os.read takes a file and a size, there's
> no buffer exposed at the API level (for input).
> 
> It would need to be a separate python module to the core set I guess
> (with a C component), and a slightly different API - or at least some
> additional APIs which can take in an aligned buffer, rather than just
> allocating one each time - but I believe it's still feasible.

Oh, there were modules written to do this years ago. e.g this from
2006:

https://pypi.python.org/pypi/directio/1.0

There were even patches to provide a native interface in 2004, but
they got shot down, like all the future enhancement requests that
have happened since, mainly because the python maintainers don't
beleive that Direct IO is necessary.

I mean, Linus said Direct IO is for deranged monkeys, so nobody
should ever use direct IO, right? ;)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: definitions for /proc/fs/xfs/stat
  2013-06-17 11:19                         ` definitions for /proc/fs/xfs/stat Dave Chinner
@ 2013-06-17 13:18                           ` Stan Hoeppner
  0 siblings, 0 replies; 28+ messages in thread
From: Stan Hoeppner @ 2013-06-17 13:18 UTC (permalink / raw)
  To: xfs

On 6/17/2013 6:19 AM, Dave Chinner wrote:

> I mean, Linus said Direct IO is for deranged monkeys, so nobody
> should ever use direct IO, right? ;)

As it turns out, there are simply far more deranged monkeys running
loose than Linus anticipated. :)

-- 
Stan


_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: definitions for /proc/fs/xfs/stat
  2013-06-17  1:11                   ` Nathan Scott
  2013-06-17  2:46                     ` Dave Chinner
@ 2013-06-18  0:13                     ` Mark Goodwin
  1 sibling, 0 replies; 28+ messages in thread
From: Mark Goodwin @ 2013-06-18  0:13 UTC (permalink / raw)
  To: Mark Seger; +Cc: Nathan Scott, xfs

On 06/17/2013 11:11 AM, Nathan Scott wrote:
>> ok, I have a simple reproducer.  try out the following, noting you'll
>> obviously have to change the directory pointed to by dname:
>>
>> libc=ctypes.CDLL(ctypes.util.find_library('c'), use_errno=True)
>> falloc=getattr(libc, 'fallocate')
>>
>
> This is using the glibc fallocate wrapper - I have vague memories of an
> old libc which used to do per-page buffered writes providing a poor-mans
> implementation of fallocate, maybe somehow that older version/behaviour
> is being triggered.

Python probably tipped you into a posix_fallocate() bog hole - it writes
one byte to every block in the range offset .. offset+len as the portable
(but poor-man's) implementation of fallocate(). This combined with your
controller cache could explain the unexpected i/o.

See glibc:sysdeps/posix/posix_fallocate.c

-- Mark

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: definitions for /proc/fs/xfs/stat
  2013-06-15 10:35           ` Mark Seger
  2013-06-15 16:22             ` Mark Seger
@ 2013-06-16  0:00             ` Dave Chinner
  1 sibling, 0 replies; 28+ messages in thread
From: Dave Chinner @ 2013-06-16  0:00 UTC (permalink / raw)
  To: Mark Seger; +Cc: Nathan Scott, xfs

On Sat, Jun 15, 2013 at 06:35:02AM -0400, Mark Seger wrote:
> Basically everything do it with collectl, a tool I wrote and opensourced
> almost 10 years ago.  it's numbers are very accurate - I've compared with
> iostat on numerous occasions whenever I might have had doubts and they
> always agree.  Since both tools get their data from the same place,
> /proc/diskstats, it's hard for them not to agree AND its numbers also agree
> with /proc/fs/xfs.

Ok, that's all I wanted to know.

> happening?
> 
> To restate what's going on, I have a very simple script that I'm
> duplicating what openstack swift is doing, namely to create a file with
> mkstmp and than running an falloc against it.  The files are being created
> with a size of zero but it seems that xfs is generating a ton of logging
> activity.  I had read your posted back in 2011 about speculative
> preallocation and can't help but wonder if that's what hitting me here.  I
> also saw where system memory can come into play and this box has 192GB and
> 12 hyperthreaded cores.
> 
> I also tried one more run without falloc, this is creating 10000 1K files,
> which should be about 10MB and it looks like it's still doing 140MB of I/O
> which still feels like a lot but at least it's less than the

1k files will still write 4k filesystem blocks, so there's going to
be 40MB/s there at least. 

As it is, I ran a bunch of tests yesterday writing 4k files, and I
got 180MB/s @ 32,000 files/s. That's roughly 130MB/s for data, and
another 50MB/s for log and metadata traffic. But without knowing
your test configuration and using your test script, I can't compare
those results to yours. Can you provide the information in:

http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F


> If there is anything more I can provide I'll be happy to do so.  Actually I
> should point out I can easily generate graphs and if you'd like to see some
> examples I can provide those too.

PCP generates realtime graphs, which is what I use ;)

> Also, if there is anything I can report
> from /proc/fs/xfs I can relatively easily do that as well and display it
> side by side with the disk I/O.

Let's see if there is something unusual in your setup that might
explain it first...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2013-06-19 23:02 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-06-14 16:37 definitions for /proc/fs/xfs/stat Mark Seger
2013-06-14 22:16 ` Nathan Scott
2013-06-14 22:37   ` Mark Seger
2013-06-15  0:17     ` Nathan Scott
2013-06-15  1:55       ` Mark Seger
2013-06-15  2:04         ` Dave Chinner
2013-06-15 10:35           ` Mark Seger
2013-06-15 16:22             ` Mark Seger
2013-06-16  0:11               ` Dave Chinner
2013-06-16 12:58                 ` Mark Seger
2013-06-16 22:06                   ` Dave Chinner
2013-06-16 22:31                     ` Mark Seger
2013-06-16 23:14                       ` Dave Chinner
2013-06-16 23:31                         ` Mark Seger
2013-06-17  1:11                   ` Nathan Scott
2013-06-17  2:46                     ` Dave Chinner
2013-06-17  5:41                       ` Nathan Scott
2013-06-17 10:57                         ` Mark Seger
2013-06-17 11:13                           ` Dave Chinner
2013-06-17 14:57                             ` Mark Seger
2013-06-17 20:28                               ` Stefan Ring
2013-06-18  0:15                                 ` Dave Chinner
2013-06-18 10:17                                   ` Mark Seger
2013-06-19 23:02                               ` Useful stats (was Re: definitions for /proc/fs/xfs/stat) Nathan Scott
2013-06-17 11:19                         ` definitions for /proc/fs/xfs/stat Dave Chinner
2013-06-17 13:18                           ` Stan Hoeppner
2013-06-18  0:13                     ` Mark Goodwin
2013-06-16  0:00             ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox