* definitions for /proc/fs/xfs/stat @ 2013-06-14 16:37 Mark Seger 2013-06-14 22:16 ` Nathan Scott 0 siblings, 1 reply; 28+ messages in thread From: Mark Seger @ 2013-06-14 16:37 UTC (permalink / raw) To: xfs [-- Attachment #1.1: Type: text/plain, Size: 181 bytes --] I've been thinking about adding xfs support to collectl but am having a hard time trying to decode he data in the stat file. can someone point me to a decoder ring? thanks. -mark [-- Attachment #1.2: Type: text/html, Size: 246 bytes --] [-- Attachment #2: Type: text/plain, Size: 121 bytes --] _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: definitions for /proc/fs/xfs/stat 2013-06-14 16:37 definitions for /proc/fs/xfs/stat Mark Seger @ 2013-06-14 22:16 ` Nathan Scott 2013-06-14 22:37 ` Mark Seger 0 siblings, 1 reply; 28+ messages in thread From: Nathan Scott @ 2013-06-14 22:16 UTC (permalink / raw) To: Mark Seger; +Cc: xfs Hi Mark, ----- Original Message ----- > I've been thinking about adding xfs support to collectl but am having a hard > time trying to decode he data in the stat file. can someone point me to a > decoder ring? thanks. Using Performance Co-Pilot, "pminfo -T xfs" will give you the answers you seek. [ http://oss.sgi.com/projects/pcp/ ] cheers. -- Nathan _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: definitions for /proc/fs/xfs/stat 2013-06-14 22:16 ` Nathan Scott @ 2013-06-14 22:37 ` Mark Seger 2013-06-15 0:17 ` Nathan Scott 0 siblings, 1 reply; 28+ messages in thread From: Mark Seger @ 2013-06-14 22:37 UTC (permalink / raw) To: Nathan Scott; +Cc: xfs [-- Attachment #1.1: Type: text/plain, Size: 974 bytes --] actually I have since found a decoder ring here - http://xfs.org/index.php/Runtime_Stats and have been incorporating a lot of the data so I can look at things in real time. I'd still love to know why writing 1000 1K files results in 200MB/sec of disk I/O though. clearly something weird is going on. both PCP as well as collectl (now that I have incorporated some of the counters) are both telling me that I/O is in the logging, but why remains a mystery. can anyone offer some suggestions as to why? -mark On Fri, Jun 14, 2013 at 6:16 PM, Nathan Scott <nathans@redhat.com> wrote: > Hi Mark, > > ----- Original Message ----- > > I've been thinking about adding xfs support to collectl but am having a > hard > > time trying to decode he data in the stat file. can someone point me to a > > decoder ring? thanks. > > Using Performance Co-Pilot, "pminfo -T xfs" will give you the answers you > seek. > [ http://oss.sgi.com/projects/pcp/ ] > > cheers. > > -- > Nathan > [-- Attachment #1.2: Type: text/html, Size: 1526 bytes --] [-- Attachment #2: Type: text/plain, Size: 121 bytes --] _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: definitions for /proc/fs/xfs/stat 2013-06-14 22:37 ` Mark Seger @ 2013-06-15 0:17 ` Nathan Scott 2013-06-15 1:55 ` Mark Seger 0 siblings, 1 reply; 28+ messages in thread From: Nathan Scott @ 2013-06-15 0:17 UTC (permalink / raw) To: Mark Seger; +Cc: xfs ----- Original Message ----- > actually I have since found a decoder ring here - > http://xfs.org/index.php/Runtime_Stats and have been incorporating a lot of > the data so I can look at things in real time. I'd still love to know why Ah, good stuff. > writing 1000 1K files results in 200MB/sec of disk I/O though. clearly For how many seconds? (or fractions thereof?) You may need the level of detail that only tracing will provide to explore further, although off the top of my head I'm not sure exactly which events you should focus on (log, inode creation, or space allocation at a guess). cheers. -- Nathan _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: definitions for /proc/fs/xfs/stat 2013-06-15 0:17 ` Nathan Scott @ 2013-06-15 1:55 ` Mark Seger 2013-06-15 2:04 ` Dave Chinner 0 siblings, 1 reply; 28+ messages in thread From: Mark Seger @ 2013-06-15 1:55 UTC (permalink / raw) To: Nathan Scott; +Cc: xfs [-- Attachment #1.1: Type: text/plain, Size: 3365 bytes --] I'm doing 1 second samples and the rates are very steady. The reason I ended up at this level of testing was I had done a sustained test for 2 minutes at about 5MB/sec and was seeing over 500MB/sec going to the disk, again sampling at 1-second intervals. I'd be happy to provide detailed output and can even sample more frequently if you like. from my shorter test I was experimenting looking at some of the XFS data with collectl and recorded this, it if help at all: segerm@az1-sw-object-0006:~$ collectl --import xfs waiting for 1 second sample... #<--XFS Ops--><-----------XFS Logging----------><------Extents------><------DirOps-------><----Trans---><- # Write Reads Writes WrtKBs NoRoom Force Sleep ExtA BlkA ExtF ExtF Look Cre8 Remv Gdnt Sync Asyn Empt 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 768 0 3 3 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 256 0 1 1 0 0 0 0 0 0 0 0 1 0 0 38 88 95 24320 0 95 96 54 54 54 54 115 76 76 154 95 473 0 339 776 968 247816 0 968 978 484 484 479 479 1011 675 671 1351 967 4087 0 321 748 929 237806 0 929 935 450 450 453 453 967 645 647 1287 930 3847 0 279 637 810 207360 0 810 811 391 391 390 390 838 559 558 1118 810 3324 0 209 482 610 156160 0 610 610 286 286 289 289 627 417 420 834 610 2451 0 0 0 3 768 0 3 3 0 0 0 0 0 0 0 0 3 0 0 I can say for a fact I was doing about 300 wtrites/sec whcih the write numbers seem to support, though I don't know what the read numbers are measuring. you can also see from the logging data that was 250MB/sec going to disk. Are there other numbers that are meaningful that you'd like to see? All it takes is adding a couple of print statement as what you're seeing above only took a hour or so to throw to together. I can collect as much (or little as you like) and actually can save the complete contents of /proc/fs/xfs/stat every second in a file for later playback. -mark On Fri, Jun 14, 2013 at 8:17 PM, Nathan Scott <nathans@redhat.com> wrote: > > > ----- Original Message ----- > > actually I have since found a decoder ring here - > > http://xfs.org/index.php/Runtime_Stats and have been incorporating a > lot of > > the data so I can look at things in real time. I'd still love to know > why > > Ah, good stuff. > > > writing 1000 1K files results in 200MB/sec of disk I/O though. clearly > > For how many seconds? (or fractions thereof?) You may need the level of > detail that only tracing will provide to explore further, although off the > top of my head I'm not sure exactly which events you should focus on (log, > inode creation, or space allocation at a guess). > > cheers. > > -- > Nathan > [-- Attachment #1.2: Type: text/html, Size: 5007 bytes --] [-- Attachment #2: Type: text/plain, Size: 121 bytes --] _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: definitions for /proc/fs/xfs/stat 2013-06-15 1:55 ` Mark Seger @ 2013-06-15 2:04 ` Dave Chinner 2013-06-15 10:35 ` Mark Seger 0 siblings, 1 reply; 28+ messages in thread From: Dave Chinner @ 2013-06-15 2:04 UTC (permalink / raw) To: Mark Seger; +Cc: Nathan Scott, xfs On Fri, Jun 14, 2013 at 09:55:17PM -0400, Mark Seger wrote: > I'm doing 1 second samples and the rates are very steady. The reason I > ended up at this level of testing was I had done a sustained test for 2 > minutes at about 5MB/sec and was seeing over 500MB/sec going to the disk, > again sampling at 1-second intervals. I'd be happy to provide detailed > output and can even sample more frequently if you like. Where are you getting your IO throughput numbers from? How do they compare to, say, the output of `iostat -d -x -m 1`? Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: definitions for /proc/fs/xfs/stat 2013-06-15 2:04 ` Dave Chinner @ 2013-06-15 10:35 ` Mark Seger 2013-06-15 16:22 ` Mark Seger 2013-06-16 0:00 ` Dave Chinner 0 siblings, 2 replies; 28+ messages in thread From: Mark Seger @ 2013-06-15 10:35 UTC (permalink / raw) To: Dave Chinner; +Cc: Nathan Scott, xfs [-- Attachment #1.1: Type: text/plain, Size: 5341 bytes --] Basically everything do it with collectl, a tool I wrote and opensourced almost 10 years ago. it's numbers are very accurate - I've compared with iostat on numerous occasions whenever I might have had doubts and they always agree. Since both tools get their data from the same place, /proc/diskstats, it's hard for them not to agree AND its numbers also agree with /proc/fs/xfs. Here's an example of comparing the two on a short run, leaving off the -m since collectl reports its output in KB. Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sdc 0.00 0.00 0.00 494.00 0.00 126464.00 512.00 0.11 0.22 0.00 0.22 0.22 11.00 # <---------reads---------><---------writes---------><--------averages--------> Pct #Time Name KBytes Merged IOs Size KBytes Merged IOs Size RWSize QLen Wait SvcTim Util 10:18:32 sdc1 0 0 0 0 127488 0 498 256 256 1 0 0 7 10:18:33 sdc1 0 0 0 0 118784 0 464 256 256 1 0 0 4 for grins I also ran a set of numbers at a monitoring interval of 0.2 seconds just to see if they were steady and they are: # <---------reads---------><---------writes---------><--------averages--------> Pct #Time Name KBytes Merged IOs Size KBytes Merged IOs Size RWSize QLen Wait SvcTim Util 10:19:50.601 sdc1 0 0 0 0 768 0 3 256 256 0 0 0 0 10:19:50.801 sdc1 0 0 0 0 23296 0 91 256 256 1 0 0 19 10:19:51.001 sdc1 0 0 0 0 32256 0 126 256 256 1 0 0 14 10:19:51.201 sdc1 0 0 0 0 29696 0 116 256 256 1 0 0 19 10:19:51.401 sdc1 0 0 0 0 30464 0 119 256 256 1 0 0 4 10:19:51.601 sdc1 0 0 0 0 32768 0 128 256 256 1 0 0 14 but back to the problem at hand and that's the question why is this happening? To restate what's going on, I have a very simple script that I'm duplicating what openstack swift is doing, namely to create a file with mkstmp and than running an falloc against it. The files are being created with a size of zero but it seems that xfs is generating a ton of logging activity. I had read your posted back in 2011 about speculative preallocation and can't help but wonder if that's what hitting me here. I also saw where system memory can come into play and this box has 192GB and 12 hyperthreaded cores. I also tried one more run without falloc, this is creating 10000 1K files, which should be about 10MB and it looks like it's still doing 140MB of I/O which still feels like a lot but at least it's less than the # <---------reads---------><---------writes---------><--------averages--------> Pct #Time Name KBytes Merged IOs Size KBytes Merged IOs Size RWSize QLen Wait SvcTim Util 10:29:20 sdc1 0 0 0 0 89608 0 351 255 255 1 0 0 11 10:29:21 sdc1 0 0 0 0 55296 0 216 256 256 1 0 0 5 and to repeat the full run with falloc: # DISK STATISTICS (/sec) # <---------reads---------><---------writes---------><--------averages--------> Pct #Time Name KBytes Merged IOs Size KBytes Merged IOs Size RWSize QLen Wait SvcTim Util 10:30:50 sdc1 0 0 0 0 56064 0 219 256 256 1 0 0 2 10:30:51 sdc1 0 0 0 0 409720 148 1622 253 252 1 0 0 26 10:30:52 sdc1 0 0 0 0 453240 144 1796 252 252 1 0 0 36 10:30:53 sdc1 0 0 0 0 441768 298 1800 245 245 1 0 0 37 10:30:54 sdc1 0 0 0 0 455576 144 1813 251 251 1 0 0 25 10:30:55 sdc1 0 0 0 0 453532 145 1805 251 251 1 0 0 35 10:30:56 sdc1 0 0 0 0 307352 145 1233 249 249 1 0 0 17 10:30:57 sdc1 0 0 0 0 0 0 0 0 0 0 0 0 0 If there is anything more I can provide I'll be happy to do so. Actually I should point out I can easily generate graphs and if you'd like to see some examples I can provide those too. Also, if there is anything I can report from /proc/fs/xfs I can relatively easily do that as well and display it side by side with the disk I/O. -mark On Fri, Jun 14, 2013 at 10:04 PM, Dave Chinner <david@fromorbit.com> wrote: > On Fri, Jun 14, 2013 at 09:55:17PM -0400, Mark Seger wrote: > > I'm doing 1 second samples and the rates are very steady. The reason I > > ended up at this level of testing was I had done a sustained test for 2 > > minutes at about 5MB/sec and was seeing over 500MB/sec going to the disk, > > again sampling at 1-second intervals. I'd be happy to provide detailed > > output and can even sample more frequently if you like. > > Where are you getting your IO throughput numbers from? > > How do they compare to, say, the output of `iostat -d -x -m 1`? > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com > [-- Attachment #1.2: Type: text/html, Size: 8151 bytes --] [-- Attachment #2: Type: text/plain, Size: 121 bytes --] _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: definitions for /proc/fs/xfs/stat 2013-06-15 10:35 ` Mark Seger @ 2013-06-15 16:22 ` Mark Seger 2013-06-16 0:11 ` Dave Chinner 2013-06-16 0:00 ` Dave Chinner 1 sibling, 1 reply; 28+ messages in thread From: Mark Seger @ 2013-06-15 16:22 UTC (permalink / raw) To: Dave Chinner; +Cc: Nathan Scott, xfs [-- Attachment #1.1: Type: text/plain, Size: 6523 bytes --] I was thinking a little color commentary might be helpful from a perspective of what the functionally is that's driving the need for fallocate. I think I mentioned somewhere in this thread that the application is OpenStack Swift, which is a highly scalable cloud object store. If you're not familiar with it, it doesn't do successive sequential writes to a preallocated file but rather writes out a full object in one shot. In other words, object = file. The whole purpose of preallocation, at least my understanding of it, is to make sure there is enough room when the time comes to write the actual object so if there isn't, a redundant server elsewhere can do it instead. This then makes the notion of speculative preallocation for future sequential writes moot, the ideal being to only preallocate the object size with minimal extra I/O. Does that help? -mark On Sat, Jun 15, 2013 at 6:35 AM, Mark Seger <mjseger@gmail.com> wrote: > Basically everything do it with collectl, a tool I wrote and opensourced > almost 10 years ago. it's numbers are very accurate - I've compared with > iostat on numerous occasions whenever I might have had doubts and they > always agree. Since both tools get their data from the same place, > /proc/diskstats, it's hard for them not to agree AND its numbers also agree > with /proc/fs/xfs. > > Here's an example of comparing the two on a short run, leaving off the -m > since collectl reports its output in KB. > > Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz > avgqu-sz await r_await w_await svctm %util > sdc 0.00 0.00 0.00 494.00 0.00 126464.00 > 512.00 0.11 0.22 0.00 0.22 0.22 11.00 > > # > <---------reads---------><---------writes---------><--------averages--------> > Pct > #Time Name KBytes Merged IOs Size KBytes Merged IOs Size > RWSize QLen Wait SvcTim Util > 10:18:32 sdc1 0 0 0 0 127488 0 498 256 > 256 1 0 0 7 > 10:18:33 sdc1 0 0 0 0 118784 0 464 256 > 256 1 0 0 4 > > for grins I also ran a set of numbers at a monitoring interval of 0.2 > seconds just to see if they were steady and they are: > > # > <---------reads---------><---------writes---------><--------averages--------> > Pct > #Time Name KBytes Merged IOs Size KBytes Merged IOs Size > RWSize QLen Wait SvcTim Util > 10:19:50.601 sdc1 0 0 0 0 768 0 3 256 > 256 0 0 0 0 > 10:19:50.801 sdc1 0 0 0 0 23296 0 91 256 > 256 1 0 0 19 > 10:19:51.001 sdc1 0 0 0 0 32256 0 126 256 > 256 1 0 0 14 > 10:19:51.201 sdc1 0 0 0 0 29696 0 116 256 > 256 1 0 0 19 > 10:19:51.401 sdc1 0 0 0 0 30464 0 119 256 > 256 1 0 0 4 > 10:19:51.601 sdc1 0 0 0 0 32768 0 128 256 > 256 1 0 0 14 > > but back to the problem at hand and that's the question why is this > happening? > > To restate what's going on, I have a very simple script that I'm > duplicating what openstack swift is doing, namely to create a file with > mkstmp and than running an falloc against it. The files are being created > with a size of zero but it seems that xfs is generating a ton of logging > activity. I had read your posted back in 2011 about speculative > preallocation and can't help but wonder if that's what hitting me here. I > also saw where system memory can come into play and this box has 192GB and > 12 hyperthreaded cores. > > I also tried one more run without falloc, this is creating 10000 1K files, > which should be about 10MB and it looks like it's still doing 140MB of I/O > which still feels like a lot but at least it's less than the > > # > <---------reads---------><---------writes---------><--------averages--------> > Pct > #Time Name KBytes Merged IOs Size KBytes Merged IOs Size > RWSize QLen Wait SvcTim Util > 10:29:20 sdc1 0 0 0 0 89608 0 351 255 > 255 1 0 0 11 > 10:29:21 sdc1 0 0 0 0 55296 0 216 256 > 256 1 0 0 5 > > and to repeat the full run with falloc: > > # DISK STATISTICS (/sec) > # > <---------reads---------><---------writes---------><--------averages--------> > Pct > #Time Name KBytes Merged IOs Size KBytes Merged IOs Size > RWSize QLen Wait SvcTim Util > 10:30:50 sdc1 0 0 0 0 56064 0 219 256 > 256 1 0 0 2 > 10:30:51 sdc1 0 0 0 0 409720 148 1622 253 > 252 1 0 0 26 > 10:30:52 sdc1 0 0 0 0 453240 144 1796 252 > 252 1 0 0 36 > 10:30:53 sdc1 0 0 0 0 441768 298 1800 245 > 245 1 0 0 37 > 10:30:54 sdc1 0 0 0 0 455576 144 1813 251 > 251 1 0 0 25 > 10:30:55 sdc1 0 0 0 0 453532 145 1805 251 > 251 1 0 0 35 > 10:30:56 sdc1 0 0 0 0 307352 145 1233 249 > 249 1 0 0 17 > 10:30:57 sdc1 0 0 0 0 0 0 0 0 > 0 0 0 0 0 > > If there is anything more I can provide I'll be happy to do so. Actually > I should point out I can easily generate graphs and if you'd like to see > some examples I can provide those too. Also, if there is anything I can > report from /proc/fs/xfs I can relatively easily do that as well and > display it side by side with the disk I/O. > > -mark > > > On Fri, Jun 14, 2013 at 10:04 PM, Dave Chinner <david@fromorbit.com>wrote: > >> On Fri, Jun 14, 2013 at 09:55:17PM -0400, Mark Seger wrote: >> > I'm doing 1 second samples and the rates are very steady. The reason I >> > ended up at this level of testing was I had done a sustained test for 2 >> > minutes at about 5MB/sec and was seeing over 500MB/sec going to the >> disk, >> > again sampling at 1-second intervals. I'd be happy to provide detailed >> > output and can even sample more frequently if you like. >> >> Where are you getting your IO throughput numbers from? >> >> How do they compare to, say, the output of `iostat -d -x -m 1`? >> >> Cheers, >> >> Dave. >> -- >> Dave Chinner >> david@fromorbit.com >> > > [-- Attachment #1.2: Type: text/html, Size: 9486 bytes --] [-- Attachment #2: Type: text/plain, Size: 121 bytes --] _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: definitions for /proc/fs/xfs/stat 2013-06-15 16:22 ` Mark Seger @ 2013-06-16 0:11 ` Dave Chinner 2013-06-16 12:58 ` Mark Seger 0 siblings, 1 reply; 28+ messages in thread From: Dave Chinner @ 2013-06-16 0:11 UTC (permalink / raw) To: Mark Seger; +Cc: Nathan Scott, xfs On Sat, Jun 15, 2013 at 12:22:35PM -0400, Mark Seger wrote: > I was thinking a little color commentary might be helpful from a > perspective of what the functionally is that's driving the need for > fallocate. I think I mentioned somewhere in this thread that the > application is OpenStack Swift, which is a highly scalable cloud object > store. I'm familiar with it and the problems it causes filesystems. What application am I talking about here, for example? http://oss.sgi.com/pipermail/xfs/2013-June/027159.html Basically, Swift is trying to emulate Direct IO because python does't support Direct IO. Hence Swift is hacking around that problem and causing secondary issues that would never have occurred if Direct IO was used in the first place. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: definitions for /proc/fs/xfs/stat 2013-06-16 0:11 ` Dave Chinner @ 2013-06-16 12:58 ` Mark Seger 2013-06-16 22:06 ` Dave Chinner 2013-06-17 1:11 ` Nathan Scott 0 siblings, 2 replies; 28+ messages in thread From: Mark Seger @ 2013-06-16 12:58 UTC (permalink / raw) To: Dave Chinner; +Cc: Nathan Scott, xfs [-- Attachment #1.1: Type: text/plain, Size: 3231 bytes --] ok, I have a simple reproducer. try out the following, noting you'll obviously have to change the directory pointed to by dname: libc=ctypes.CDLL(ctypes.util.find_library('c'), use_errno=True) falloc=getattr(libc, 'fallocate') fb#!/usr/bin/python -u import os import sys import time import ctypes import ctypes.util from tempfile import mkstemp dname = '/srv/node/disk0/mjs' fname = 'foo' fsize = 1024 nfiles = 1000 body = ' ' * fsize time0 = time.time() for i in range(nfiles): file_name = '%s/%s-%s' % (dname, fname, i) fd, tmppath = mkstemp(dir=dname) falloc(fd, 1, 0, fsize) os.rename(tmppath, file_name) elapsed = time.time() - time0 tbytes = fsize * nfiles rate = tbytes/elapsed/1024/1024 print "DName: %s" % dname print "Bytes: %d" % (tbytes/1024/1024) print "Time: %.2f secs" % elapsed print "Rate: %.2f/sec" % rate when I run it I see this: segerm@az1-sw-object-0006:~$ sudo ./falloc.py DName: /srv/node/disk0/mjs Bytes: 9 Time: 5.84 secs Rate: 1.67/sec and while running collectl I see this: #<----CPU[HYPER]-----><----------Disks-----------><----------Network----------> #cpu sys inter ctxsw KBRead Reads KBWrit Writes KBIn PktIn KBOut PktOut 0 0 110 113 0 0 0 0 0 5 0 3 1 0 1576 2874 0 0 170240 665 0 3 0 2 4 3 2248 6623 0 0 406585 1596 0 1 0 1 4 3 2145 7680 0 0 473600 1850 0 1 0 1 2 1 2200 7406 0 0 456633 1875 0 2 0 1 4 3 3696 7401 0 0 454606 1823 0 1 0 1 3 2 3877 7354 0 0 453537 1806 0 1 0 1 1 0 1610 2764 0 0 163793 684 0 3 0 3 This is the same behavior I'm seeing on swift. 10K 1KB files X 4kb minimal block size still comes out to a lot less than the multiple GB of writes being reported. Actually since these whole thing only takes a few seconds and I know a single disk can't write that fast maybe it's just a bug in the way the kernel is reported allocated preallocated blocks and nothing to do with XFS? Or iis xfs responsible for the stats? If I remove the fallocate call I see the expected amount of disk traffic. -mark On Sat, Jun 15, 2013 at 8:11 PM, Dave Chinner <david@fromorbit.com> wrote: > On Sat, Jun 15, 2013 at 12:22:35PM -0400, Mark Seger wrote: > > I was thinking a little color commentary might be helpful from a > > perspective of what the functionally is that's driving the need for > > fallocate. I think I mentioned somewhere in this thread that the > > application is OpenStack Swift, which is a highly scalable cloud object > > store. > > I'm familiar with it and the problems it causes filesystems. What > application am I talking about here, for example? > > http://oss.sgi.com/pipermail/xfs/2013-June/027159.html > > Basically, Swift is trying to emulate Direct IO because python > does't support Direct IO. Hence Swift is hacking around that problem > and causing secondary issues that would never have occurred if > Direct IO was used in the first place. > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com > [-- Attachment #1.2: Type: text/html, Size: 6824 bytes --] [-- Attachment #2: Type: text/plain, Size: 121 bytes --] _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: definitions for /proc/fs/xfs/stat 2013-06-16 12:58 ` Mark Seger @ 2013-06-16 22:06 ` Dave Chinner 2013-06-16 22:31 ` Mark Seger 2013-06-17 1:11 ` Nathan Scott 1 sibling, 1 reply; 28+ messages in thread From: Dave Chinner @ 2013-06-16 22:06 UTC (permalink / raw) To: Mark Seger; +Cc: Nathan Scott, xfs On Sun, Jun 16, 2013 at 08:58:58AM -0400, Mark Seger wrote: > ok, I have a simple reproducer. try out the following, noting you'll > obviously have to change the directory pointed to by dname: > > libc=ctypes.CDLL(ctypes.util.find_library('c'), use_errno=True) > falloc=getattr(libc, 'fallocate') > > fb#!/usr/bin/python -u > > import os > import sys > import time > import ctypes > import ctypes.util > from tempfile import mkstemp > > dname = '/srv/node/disk0/mjs' > fname = 'foo' > fsize = 1024 > nfiles = 1000 > > body = ' ' * fsize > > time0 = time.time() > for i in range(nfiles): > file_name = '%s/%s-%s' % (dname, fname, i) > fd, tmppath = mkstemp(dir=dname) > falloc(fd, 1, 0, fsize) > os.rename(tmppath, file_name) > > elapsed = time.time() - time0 > tbytes = fsize * nfiles > rate = tbytes/elapsed/1024/1024 > > print "DName: %s" % dname > print "Bytes: %d" % (tbytes/1024/1024) > print "Time: %.2f secs" % elapsed > print "Rate: %.2f/sec" % rate > > when I run it I see this: > > segerm@az1-sw-object-0006:~$ sudo ./falloc.py > DName: /srv/node/disk0/mjs > Bytes: 9 > Time: 5.84 secs > Rate: 1.67/sec Running on my 3.11 branch, python 2.7.3, 100TB XFS filesystem, there is no IO to speak of: $ tests/falloc.py DName: /mnt/scratch/ Bytes: 0 Time: 0.21 secs Rate: 4.60/sec > #<----CPU[HYPER]-----><----------Disks-----------><----------Network----------> > #cpu sys inter ctxsw KBRead Reads KBWrit Writes KBIn PktIn KBOut > PktOut > 0 0 110 113 0 0 0 0 0 5 0 > 3 > 1 0 1576 2874 0 0 170240 665 0 3 0 > 2 > 4 3 2248 6623 0 0 406585 1596 0 1 0 > 1 > 4 3 2145 7680 0 0 473600 1850 0 1 0 > 1 > 2 1 2200 7406 0 0 456633 1875 0 2 0 > 1 > 4 3 3696 7401 0 0 454606 1823 0 1 0 > 1 > 3 2 3877 7354 0 0 453537 1806 0 1 0 > 1 > 1 0 1610 2764 0 0 163793 684 0 3 0 > 3 There is no way that fallocate() of 1000x1k files should be causing 450MB/s of IO for 5 seconds. However, I still have no idea what you are running this test on - as I asked in another email, can you provide some information about the system your are seeing this problem on so we can try to work out what might be causing this? http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F > This is the same behavior I'm seeing on swift. 10K 1KB files X 4kb minimal > block size still comes out to a lot less than the multiple GB of writes > being reported. Actually since these whole thing only takes a few seconds > and I know a single disk can't write that fast maybe it's just a bug in the > way the kernel is reported allocated preallocated blocks and nothing to do > with XFS? Or iis xfs responsible for the stats? I don't know what stats you are getting your information from. XFS is only responsible for what is in /proc/fs/xfs/,,, > If I remove the fallocate call I see the expected amount of disk traffic. Have you run strace to confirm that your python script is actually calling fallocate(), and it's not being emulated by python or libc? Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: definitions for /proc/fs/xfs/stat 2013-06-16 22:06 ` Dave Chinner @ 2013-06-16 22:31 ` Mark Seger 2013-06-16 23:14 ` Dave Chinner 0 siblings, 1 reply; 28+ messages in thread From: Mark Seger @ 2013-06-16 22:31 UTC (permalink / raw) To: Dave Chinner; +Cc: Nathan Scott, xfs [-- Attachment #1.1: Type: text/plain, Size: 9365 bytes --] > > There is no way that fallocate() of 1000x1k files should be causing > 450MB/s of IO for 5 seconds. > I agree and that's what has me puzzled as well. > However, I still have no idea what you are running this test on - as > I asked in another email, can you provide some information about > the system your are seeing this problem on so we can try to work out > what might be causing this? > sorry about that. This is an HP box with 192GB RAM and 6 2-core hyperthreaded CPUs, running ubuntu/precise segerm@az1-sw-object-0006:~$ uname -a Linux az1-sw-object-0006 2.6.38-16-server #68hf1026116v20120926-Ubuntu SMP Wed Sep 26 14:34:13 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux segerm@az1-sw-object-0006:~$ python --version Python 2.7.1+ segerm@az1-sw-object-0006:~$ xfs_repair -V xfs_repair version 3.1.4 segerm@az1-sw-object-0006:~$ cat /proc/meminfo MemTotal: 198191696 kB MemFree: 166202324 kB Buffers: 193268 kB Cached: 21595332 kB SwapCached: 36 kB Active: 7992268 kB Inactive: 14068428 kB Active(anon): 719408 kB Inactive(anon): 5244092 kB Active(file): 7272860 kB Inactive(file): 8824336 kB Unevictable: 0 kB Mlocked: 0 kB SwapTotal: 126889980 kB SwapFree: 126863972 kB Dirty: 0 kB Writeback: 0 kB AnonPages: 232156 kB Mapped: 11384 kB Shmem: 5731388 kB Slab: 5755820 kB SReclaimable: 4792292 kB SUnreclaim: 963528 kB KernelStack: 6240 kB PageTables: 7328 kB NFS_Unstable: 0 kB Bounce: 0 kB WritebackTmp: 0 kB CommitLimit: 225985828 kB Committed_AS: 8851976 kB VmallocTotal: 34359738367 kB VmallocUsed: 725112 kB VmallocChunk: 34159749436 kB HardwareCorrupted: 0 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 2048 kB DirectMap4k: 118648 kB DirectMap2M: 5089280 kB DirectMap1G: 196083712 kB over 60 mounts, but here's the one I'm writing to: segerm@az1-sw-object-0006:~$ mount | grep disk0 /dev/sdc1 on /srv/node/disk0 type xfs (rw,nobarrier) not sure what you're looking for here so here's it all segerm@az1-sw-object-0006:~$ cat /proc/partitions major minor #blocks name 8 0 976762584 sda 8 1 248976 sda1 8 2 1 sda2 8 5 976510993 sda5 251 0 41943040 dm-0 251 1 8785920 dm-1 251 2 2928640 dm-2 8 16 976762584 sdb 8 17 976760832 sdb1 251 3 126889984 dm-3 251 4 389120 dm-4 251 5 41943040 dm-5 8 32 2930233816 sdc 8 33 2930233344 sdc1 8 48 2930233816 sdd 8 49 2930233344 sdd1 8 64 2930233816 sde 8 65 2930233344 sde1 8 80 2930233816 sdf 8 81 2930233344 sdf1 8 96 2930233816 sdg 8 97 2930233344 sdg1 8 112 2930233816 sdh 8 113 2930233344 sdh1 8 128 2930233816 sdi 8 129 2930233344 sdi1 8 144 2930233816 sdj 8 145 2930233344 sdj1 8 160 2930233816 sdk 8 161 2930233344 sdk1 8 176 2930233816 sdl 8 177 2930233344 sdl1 8 192 2930233816 sdm 8 193 2930233344 sdm1 8 208 2930233816 sdn 8 209 2930233344 sdn1 8 240 2930233816 sdp 8 241 2930233344 sdp1 8 224 2930233816 sdo 8 225 2930233344 sdo1 65 0 2930233816 sdq 65 1 2930233344 sdq1 65 16 2930233816 sdr 65 17 2930233344 sdr1 65 32 2930233816 sds 65 33 2930233344 sds1 65 176 2930233816 sdab 65 177 2930233344 sdab1 66 0 2930233816 sdag 66 1 2930233344 sdag1 65 64 2930233816 sdu 65 65 2930233344 sdu1 65 128 2930233816 sdy 65 129 2930233344 sdy1 65 224 2930233816 sdae 65 225 2930233344 sdae1 65 144 2930233816 sdz 65 145 2930233344 sdz1 65 208 2930233816 sdad 65 209 2930233344 sdad1 65 48 2930233816 sdt 65 49 2930233344 sdt1 65 240 2930233816 sdaf 65 241 2930233344 sdaf1 65 96 2930233816 sdw 65 97 2930233344 sdw1 66 32 2930233816 sdai 66 33 2930233344 sdai1 66 48 2930233816 sdaj 66 49 2930233344 sdaj1 65 112 2930233816 sdx 65 113 2930233344 sdx1 65 160 2930233816 sdaa 65 161 2930233344 sdaa1 65 80 2930233816 sdv 65 81 2930233344 sdv1 65 192 2930233816 sdac 65 193 2930233344 sdac1 66 16 2930233816 sdah 66 17 2930233344 sdah1 66 64 2930233816 sdak 66 65 2930233344 sdak1 66 144 2930233816 sdap 66 145 2930233344 sdap1 66 224 2930233816 sdau 66 225 2930233344 sdau1 67 208 2930233816 sdbj 67 209 2930233344 sdbj1 66 112 2930233816 sdan 66 113 2930233344 sdan1 66 96 2930233816 sdam 66 97 2930233344 sdam1 66 80 2930233816 sdal 66 81 2930233344 sdal1 67 96 2930233816 sdbc 67 97 2930233344 sdbc1 66 128 2930233816 sdao 66 129 2930233344 sdao1 67 128 2930233816 sdbe 67 129 2930233344 sdbe1 66 176 2930233816 sdar 66 177 2930233344 sdar1 67 160 2930233816 sdbg 67 161 2930233344 sdbg1 66 160 2930233816 sdaq 66 161 2930233344 sdaq1 67 48 2930233816 sdaz 67 49 2930233344 sdaz1 66 208 2930233816 sdat 66 209 2930233344 sdat1 67 112 2930233816 sdbd 67 113 2930233344 sdbd1 67 144 2930233816 sdbf 67 145 2930233344 sdbf1 67 64 2930233816 sdba 67 65 2930233344 sdba1 67 16 2930233816 sdax 67 17 2930233344 sdax1 67 32 2930233816 sday 67 33 2930233344 sday1 66 192 2930233816 sdas 66 193 2930233344 sdas1 67 192 2930233816 sdbi 67 193 2930233344 sdbi1 66 240 2930233816 sdav 66 241 2930233344 sdav1 67 80 2930233816 sdbb 67 81 2930233344 sdbb1 67 176 2930233816 sdbh 67 177 2930233344 sdbh1 67 0 2930233816 sdaw 67 1 2930233344 sdaw1 segerm@az1-sw-object-0006:~$ xfs_info /srv/node/disk0 meta-data=/dev/sdc1 isize=1024 agcount=32, agsize=22892416 blks = sectsz=512 attr=2 data = bsize=4096 blocks=732557312, imaxpct=5 = sunit=64 swidth=64 blks naming =version 2 bsize=4096 ascii-ci=0 log =internal bsize=4096 blocks=357696, version=2 = sectsz=512 sunit=64 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 Is this enough to describe my environment? > > http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F > > > This is the same behavior I'm seeing on swift. 10K 1KB files X 4kb > minimal > > block size still comes out to a lot less than the multiple GB of writes > > being reported. Actually since these whole thing only takes a few > seconds > > and I know a single disk can't write that fast maybe it's just a bug in > the > > way the kernel is reported allocated preallocated blocks and nothing to > do > > with XFS? Or iis xfs responsible for the stats? > > I don't know what stats you are getting your information from. XFS > is only responsible for what is in /proc/fs/xfs/,,, I'm getting my stats from 2 places, collectl reading /proc/diskstats AND xfs from /proc/fs/xfs though I've only shown you the collectl stats. It's easy enough to send you the /proc/fs/xfs stats if you want them as both formatted output as well as the raw /proc/fs/xfs contents every second from which they're derived. > > If I remove the fallocate call I see the expected amount of disk traffic. > > Have you run strace to confirm that your python script is actually > calling fallocate(), and it's not being emulated by python or libc? I haven't. However when I saw this exact behavior with swift, we did see both fallocate being called and here's a small section showing a complete 10K PUT operation and the beginning of the next one: 0.067874 cpu=0 pid=41977 fallocate [285] entry fd=15 mode=0x1 offset=0x0 len=10240 0.067980 cpu=0 pid=41977 block_rq_insert dev_t=0x04100030 wr=write flags=SYNC sector=0xaec11a00 len=262144 0.067983 cpu=0 pid=41977 block_rq_issue dev_t=0x04100030 wr=write flags=SYNC sector=0xaec11a00 len=262144 0.068010 cpu=0 pid=41977 sched_switch prio=120 state=SSLEEP next_pid=0 next_prio=120 0.068125 cpu=0 pid=0 block_rq_complete dev_t=0x04100030 wr=write flags=SYNC sector=0xaec11a00 len=262144 qpid=41977 spid=41977 qtm= 0.000004 svtm= 0.000142 0.068175 cpu=0 pid=48168 sched_wakeup target_pid=41977 prio=120 target_cpu=12 success=1 0.068180 cpu=12 pid=0 sched_switch prio=n/a state=n/a next_pid=41977 next_prio=120 0.068185 cpu=12 pid=41977 fallocate [285] ret=0x0 syscallbeg= 0.000311 fd=15 mode=0x1 offset=0x0 len=10240 I hope this helps but if there's any more I can provide I'll be happy to do so. -mark > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com > [-- Attachment #1.2: Type: text/html, Size: 13425 bytes --] [-- Attachment #2: Type: text/plain, Size: 121 bytes --] _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: definitions for /proc/fs/xfs/stat 2013-06-16 22:31 ` Mark Seger @ 2013-06-16 23:14 ` Dave Chinner 2013-06-16 23:31 ` Mark Seger 0 siblings, 1 reply; 28+ messages in thread From: Dave Chinner @ 2013-06-16 23:14 UTC (permalink / raw) To: Mark Seger; +Cc: Nathan Scott, xfs On Sun, Jun 16, 2013 at 06:31:13PM -0400, Mark Seger wrote: > > > > There is no way that fallocate() of 1000x1k files should be causing > > 450MB/s of IO for 5 seconds. > > I agree and that's what has me puzzled as well. > > > However, I still have no idea what you are running this test on - as > > I asked in another email, can you provide some information about > > the system your are seeing this problem on so we can try to work out > > what might be causing this? > > > > sorry about that. This is an HP box with 192GB RAM and 6 2-core > hyperthreaded CPUs, running ubuntu/precise > > segerm@az1-sw-object-0006:~$ uname -a > Linux az1-sw-object-0006 2.6.38-16-server #68hf1026116v20120926-Ubuntu SMP > Wed Sep 26 14:34:13 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux So it running a pretty old Ubuntu something-or-other kernel. There's only limited help I can give you for this kernel as I've got no idea what Ubuntu have put in it... > segerm@az1-sw-object-0006:~$ python --version > Python 2.7.1+ > > segerm@az1-sw-object-0006:~$ xfs_repair -V > xfs_repair version 3.1.4 > > segerm@az1-sw-object-0006:~$ cat /proc/meminfo > MemTotal: 198191696 kB > MemFree: 166202324 kB > Buffers: 193268 kB > Cached: 21595332 kB .... > over 60 mounts, but here's the one I'm writing to: > > segerm@az1-sw-object-0006:~$ mount | grep disk0 > /dev/sdc1 on /srv/node/disk0 type xfs (rw,nobarrier) > > not sure what you're looking for here so here's it all > > segerm@az1-sw-object-0006:~$ cat /proc/partitions > major minor #blocks name > > 8 0 976762584 sda > 8 1 248976 sda1 > 8 2 1 sda2 > 8 5 976510993 sda5 > 251 0 41943040 dm-0 > 251 1 8785920 dm-1 > 251 2 2928640 dm-2 > 8 16 976762584 sdb > 8 17 976760832 sdb1 > 251 3 126889984 dm-3 > 251 4 389120 dm-4 > 251 5 41943040 dm-5 > 8 32 2930233816 sdc > 8 33 2930233344 sdc1 .... > segerm@az1-sw-object-0006:~$ xfs_info /srv/node/disk0 > meta-data=/dev/sdc1 isize=1024 agcount=32, agsize=22892416 > blks > = sectsz=512 attr=2 > data = bsize=4096 blocks=732557312, imaxpct=5 > = sunit=64 swidth=64 blks > naming =version 2 bsize=4096 ascii-ci=0 > log =internal bsize=4096 blocks=357696, version=2 > = sectsz=512 sunit=64 blks, lazy-count=1 > realtime =none extsz=4096 blocks=0, rtextents=0 Ok, that's interesting - a 1k inode size, and sunit=swidth=256k. But it doesn't cause a current kernel to reproduce the behaviour you are seeing.... sunit=256k is interesting, because: > 0.067874 cpu=0 pid=41977 fallocate [285] entry fd=15 mode=0x1 > offset=0x0 len=10240 > 0.067980 cpu=0 pid=41977 block_rq_insert dev_t=0x04100030 wr=write > flags=SYNC sector=0xaec11a00 len=262144 That's a write which is rounded up to 256k. BTW, that's a trace for a also a 10k fallocate, not 1k, but regardless it doesn't change behaviour on my TOT test kernel. > I hope this helps but if there's any more I can provide I'll be > happy to do so. It doesn't tell me what XFS is doing with the fallocate call. Providing the trace-cmd trace output from the FAQ might shed some light on it... Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: definitions for /proc/fs/xfs/stat 2013-06-16 23:14 ` Dave Chinner @ 2013-06-16 23:31 ` Mark Seger 0 siblings, 0 replies; 28+ messages in thread From: Mark Seger @ 2013-06-16 23:31 UTC (permalink / raw) To: Dave Chinner; +Cc: Nathan Scott, xfs@oss.sgi.com Thanks for your feedback, I really do appreciate it. Tell you what, I'll try to get a newer distro installed and see if I can reproduce. Either way ill let you know what I find -mark Sent from my iPhone On Jun 16, 2013, at 7:14 PM, Dave Chinner <david@fromorbit.com> wrote: > On Sun, Jun 16, 2013 at 06:31:13PM -0400, Mark Seger wrote: >>> >>> There is no way that fallocate() of 1000x1k files should be causing >>> 450MB/s of IO for 5 seconds. >> >> I agree and that's what has me puzzled as well. >> >>> However, I still have no idea what you are running this test on - as >>> I asked in another email, can you provide some information about >>> the system your are seeing this problem on so we can try to work out >>> what might be causing this? >> >> sorry about that. This is an HP box with 192GB RAM and 6 2-core >> hyperthreaded CPUs, running ubuntu/precise >> >> segerm@az1-sw-object-0006:~$ uname -a >> Linux az1-sw-object-0006 2.6.38-16-server #68hf1026116v20120926-Ubuntu SMP >> Wed Sep 26 14:34:13 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux > > So it running a pretty old Ubuntu something-or-other kernel. There's > only limited help I can give you for this kernel as I've got no idea > what Ubuntu have put in it... > >> segerm@az1-sw-object-0006:~$ python --version >> Python 2.7.1+ >> >> segerm@az1-sw-object-0006:~$ xfs_repair -V >> xfs_repair version 3.1.4 >> >> segerm@az1-sw-object-0006:~$ cat /proc/meminfo >> MemTotal: 198191696 kB >> MemFree: 166202324 kB >> Buffers: 193268 kB >> Cached: 21595332 kB > .... >> over 60 mounts, but here's the one I'm writing to: >> >> segerm@az1-sw-object-0006:~$ mount | grep disk0 >> /dev/sdc1 on /srv/node/disk0 type xfs (rw,nobarrier) >> >> not sure what you're looking for here so here's it all >> >> segerm@az1-sw-object-0006:~$ cat /proc/partitions >> major minor #blocks name >> >> 8 0 976762584 sda >> 8 1 248976 sda1 >> 8 2 1 sda2 >> 8 5 976510993 sda5 >> 251 0 41943040 dm-0 >> 251 1 8785920 dm-1 >> 251 2 2928640 dm-2 >> 8 16 976762584 sdb >> 8 17 976760832 sdb1 >> 251 3 126889984 dm-3 >> 251 4 389120 dm-4 >> 251 5 41943040 dm-5 >> 8 32 2930233816 sdc >> 8 33 2930233344 sdc1 > .... > >> segerm@az1-sw-object-0006:~$ xfs_info /srv/node/disk0 >> meta-data=/dev/sdc1 isize=1024 agcount=32, agsize=22892416 >> blks >> = sectsz=512 attr=2 >> data = bsize=4096 blocks=732557312, imaxpct=5 >> = sunit=64 swidth=64 blks >> naming =version 2 bsize=4096 ascii-ci=0 >> log =internal bsize=4096 blocks=357696, version=2 >> = sectsz=512 sunit=64 blks, lazy-count=1 >> realtime =none extsz=4096 blocks=0, rtextents=0 > > Ok, that's interesting - a 1k inode size, and sunit=swidth=256k. But > it doesn't cause a current kernel to reproduce the behaviour you are > seeing.... > > sunit=256k is interesting, because: > >> 0.067874 cpu=0 pid=41977 fallocate [285] entry fd=15 mode=0x1 >> offset=0x0 len=10240 >> 0.067980 cpu=0 pid=41977 block_rq_insert dev_t=0x04100030 wr=write >> flags=SYNC sector=0xaec11a00 len=262144 > > That's a write which is rounded up to 256k. > > BTW, that's a trace for a also a 10k fallocate, not 1k, but > regardless it doesn't change behaviour on my TOT test kernel. > >> I hope this helps but if there's any more I can provide I'll be >> happy to do so. > > It doesn't tell me what XFS is doing with the fallocate call. > Providing the trace-cmd trace output from the FAQ might shed some > light on it... > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: definitions for /proc/fs/xfs/stat 2013-06-16 12:58 ` Mark Seger 2013-06-16 22:06 ` Dave Chinner @ 2013-06-17 1:11 ` Nathan Scott 2013-06-17 2:46 ` Dave Chinner 2013-06-18 0:13 ` Mark Goodwin 1 sibling, 2 replies; 28+ messages in thread From: Nathan Scott @ 2013-06-17 1:11 UTC (permalink / raw) To: Mark Seger, Dave Chinner; +Cc: xfs [-- Attachment #1: Type: text/plain, Size: 2571 bytes --] Hey guys, ----- Original Message ----- > ok, I have a simple reproducer. try out the following, noting you'll > obviously have to change the directory pointed to by dname: > > libc=ctypes.CDLL(ctypes.util.find_library('c'), use_errno=True) > falloc=getattr(libc, 'fallocate') > This is using the glibc fallocate wrapper - I have vague memories of an old libc which used to do per-page buffered writes providing a poor-mans implementation of fallocate, maybe somehow that older version/behaviour is being triggered. Running the test case on a RHEL6 box here, you should see patterns like the attached ("pmchart -c XFSLog" - config attached too), which suggest log traffic dominates (though I have no stripe-fu setup like you, Mark, which adds another wrinkle). > This is the same behavior I'm seeing on swift. 10K 1KB files X 4kb minimal > block size still comes out to a lot less than the multiple GB of writes > being reported. Actually since these whole thing only takes a few seconds > and I know a single disk can't write that fast maybe it's just a bug in the > way the kernel is reported allocated preallocated blocks and nothing to do > with XFS? Or iis xfs responsible for the stats? The device level IOPs and bandwidth stats are accounted in the block layer, below the filesystem (at that point, there is no concept of preallocation anymore - something has submitted actual I/O), whereas the XFS log traffic accounting is done inside XFS. > If I remove the fallocate call I see the expected amount of disk traffic. > > -mark > > On Sat, Jun 15, 2013 at 8:11 PM, Dave Chinner <david@fromorbit.com> wrote: > > On Sat, Jun 15, 2013 at 12:22:35PM -0400, Mark Seger wrote: > > > I was thinking a little color commentary might be helpful from a > > > perspective of what the functionally is that's driving the need for > > > fallocate. I think I mentioned somewhere in this thread that the > > > application is OpenStack Swift, which is a highly scalable cloud object > > > store. > > > > I'm familiar with it and the problems it causes filesystems. What > > application am I talking about here, for example? > > > > http://oss.sgi.com/pipermail/xfs/2013-June/027159.html > > > > Basically, Swift is trying to emulate Direct IO because python > > does't support Direct IO. Hence Swift is hacking around that problem I think it is still possible, FWIW. One could use python ctypes (as in Marks test program) and achieve a page-aligned POSIX memalign, and some quick googling suggests flags can be passed to open(2) via os.O_DIRECT. cheers. -- Nathan [-- Attachment #2: XFSLog --] [-- Type: application/octet-stream, Size: 524 bytes --] #kmchart version 1 chart title "IOPS over all Disks [%h]" style stacking plot legend "Reads" color #ffff00 metric disk.all.read plot legend "Writes" color #ee82ee metric disk.all.write chart style plot antialiasing off plot color #ee82ee metric xfs.log.writes chart title "Disk Throughput [%h]" style stacking plot legend "Read rate" color #ffff00 metric disk.all.read_bytes plot legend "Write rate" color #ee82ee metric disk.all.write_bytes chart style plot antialiasing off plot color #ee82ee metric xfs.log.blocks [-- Attachment #3: xfs-log-vs-fallocate.png --] [-- Type: image/png, Size: 35764 bytes --] [-- Attachment #4: Type: text/plain, Size: 121 bytes --] _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: definitions for /proc/fs/xfs/stat 2013-06-17 1:11 ` Nathan Scott @ 2013-06-17 2:46 ` Dave Chinner 2013-06-17 5:41 ` Nathan Scott 2013-06-18 0:13 ` Mark Goodwin 1 sibling, 1 reply; 28+ messages in thread From: Dave Chinner @ 2013-06-17 2:46 UTC (permalink / raw) To: Nathan Scott; +Cc: Mark Seger, xfs On Sun, Jun 16, 2013 at 09:11:29PM -0400, Nathan Scott wrote: > Hey guys, > > ----- Original Message ----- > > ok, I have a simple reproducer. try out the following, noting you'll > > obviously have to change the directory pointed to by dname: > > > > libc=ctypes.CDLL(ctypes.util.find_library('c'), use_errno=True) > > falloc=getattr(libc, 'fallocate') > > > > This is using the glibc fallocate wrapper - I have vague memories of an > old libc which used to do per-page buffered writes providing a poor-mans > implementation of fallocate, maybe somehow that older version/behaviour > is being triggered. > > Running the test case on a RHEL6 box here, you should see patterns like > the attached ("pmchart -c XFSLog" - config attached too), which suggest > log traffic dominates (though I have no stripe-fu setup like you, Mark, > which adds another wrinkle). Must be an old version of RHEL6, because 6.4 doesn't do any IO at all, same as upstream. This test workload is purely a metadata only workload (no data is written) and so it all gets gathered up by delayed logging. And that's something 2.6.38 (and RHEL6.0/6.1) doesn't have by default, and so is going to write a fair bit of metadata to the log. But I wouldn't have expected one IO per fallocate call. Oh, we fixed this in 2.6.39: 8287889 xfs: preallocation transactions do not need to be synchronous So, fallocate() is synchronous in 2.6.38 (and probably RHEL 6.0/6.1) and the filesystem has a log stripe unit of 256k, so that would explain the 256k IO per fallocate call - the log is forced and so the ~500 bytes of dirty metadata gets padded to the full log stripe (i.e. 256k) and written synchronously. So there's the reason for the 256k write per file being written by swift. Have I mentioned anything about weird side effects occurring as a result of trying to emulate direct IO before? :) > > > On Sat, Jun 15, 2013 at 12:22:35PM -0400, Mark Seger wrote: > > > > I was thinking a little color commentary might be helpful from a > > > > perspective of what the functionally is that's driving the need for > > > > fallocate. I think I mentioned somewhere in this thread that the > > > > application is OpenStack Swift, which is a highly scalable cloud object > > > > store. > > > > > > I'm familiar with it and the problems it causes filesystems. What > > > application am I talking about here, for example? > > > > > > http://oss.sgi.com/pipermail/xfs/2013-June/027159.html > > > > > > Basically, Swift is trying to emulate Direct IO because python > > > does't support Direct IO. Hence Swift is hacking around that problem > > I think it is still possible, FWIW. One could use python ctypes (as in > Marks test program) and achieve a page-aligned POSIX memalign, I wasn't aware you could get memalign() through python at all. I went looking for this exact solution a couple of month ago when these problems started to be reported and couldn't find anything related to direct IO on python with google except for "it can't be done", "it doesn't work" and a patch that was rejected years ago to support it natively. > and some > quick googling suggests flags can be passed to open(2) via os.O_DIRECT. Yup, the python manual that documents this kind of thing is I'd expect to show up as the number one hit when you google "python O_DIRECT open flags", wouldn't you think? All I get with that is "O_DIRECT doesn't work" bug reports and blog posts. If drop the O_DIRECT out of the search phrase, and the first post is the python documentation about open flags and it documents that O_DIRECT can be passed. And if I use different search phrases for memalign without mentioning direct IO, I see lots of tricks people use to get this functionality on python. <sigh> Google has been letting me down like this quite a bit over the past few months when it comes to searching for stuff related to development. It's getting harder to find stuff amongst in the noise of whiny blogs, forums, and other places where people do nothing but complain about broken shite that google seems to think is more important than a real reference manual on the topic being searched. Is there anything better out there yet? Like from years ago when the google "I'm felling lucky" button used to pass you directly to the exact page of the reference manual relevant to the topic being searched? Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: definitions for /proc/fs/xfs/stat 2013-06-17 2:46 ` Dave Chinner @ 2013-06-17 5:41 ` Nathan Scott 2013-06-17 10:57 ` Mark Seger 2013-06-17 11:19 ` definitions for /proc/fs/xfs/stat Dave Chinner 0 siblings, 2 replies; 28+ messages in thread From: Nathan Scott @ 2013-06-17 5:41 UTC (permalink / raw) To: Dave Chinner; +Cc: Mark Seger, xfs Hey Dave, ----- Original Message ----- > ... > Must be an old version of RHEL6, because 6.4 doesn't do any IO at > all, same as upstream. This test workload is purely a metadata only > workload (no data is written) and so it all gets gathered up by > delayed logging. *nod* - RHEL6.3. > > I think it is still possible, FWIW. One could use python ctypes (as in > > Marks test program) and achieve a page-aligned POSIX memalign, > > I wasn't aware you could get memalign() through python at all. I > went looking for this exact solution a couple of month ago when > these problems started to be reported and couldn't find anything > ... Yes, on reflection it doesn't jive too well with the way python wants to do reads, in particular - os.read takes a file and a size, there's no buffer exposed at the API level (for input). It would need to be a separate python module to the core set I guess (with a C component), and a slightly different API - or at least some additional APIs which can take in an aligned buffer, rather than just allocating one each time - but I believe it's still feasible. cheers. -- Nathan _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: definitions for /proc/fs/xfs/stat 2013-06-17 5:41 ` Nathan Scott @ 2013-06-17 10:57 ` Mark Seger 2013-06-17 11:13 ` Dave Chinner 2013-06-17 11:19 ` definitions for /proc/fs/xfs/stat Dave Chinner 1 sibling, 1 reply; 28+ messages in thread From: Mark Seger @ 2013-06-17 10:57 UTC (permalink / raw) To: Nathan Scott; +Cc: xfs [-- Attachment #1.1: Type: text/plain, Size: 1925 bytes --] all - good conversation and again, thanks for digging into this. The comment about me running on an older kernel seems to be the problem and by rerunning my test on precise/3.5.0-23-generic all seems to be operating correctly, so I guess that was it. However, the one thing that does jump out of this is that proc/fs/xsfstats and pcp were both showing many hundred MB/sec during tests that only ran for a few seconds, which is impossible so it still feels some like sort of accounting bug to me. On the other hand if the fact that this was an older kernel, and newer kernels are fine, perhaps it's something just to note and not worry about. thanks again... -mark On Mon, Jun 17, 2013 at 1:41 AM, Nathan Scott <nathans@redhat.com> wrote: > Hey Dave, > > ----- Original Message ----- > > ... > > Must be an old version of RHEL6, because 6.4 doesn't do any IO at > > all, same as upstream. This test workload is purely a metadata only > > workload (no data is written) and so it all gets gathered up by > > delayed logging. > > *nod* - RHEL6.3. > > > > I think it is still possible, FWIW. One could use python ctypes (as in > > > Marks test program) and achieve a page-aligned POSIX memalign, > > > > I wasn't aware you could get memalign() through python at all. I > > went looking for this exact solution a couple of month ago when > > these problems started to be reported and couldn't find anything > > ... > > Yes, on reflection it doesn't jive too well with the way python wants > to do reads, in particular - os.read takes a file and a size, there's > no buffer exposed at the API level (for input). > > It would need to be a separate python module to the core set I guess > (with a C component), and a slightly different API - or at least some > additional APIs which can take in an aligned buffer, rather than just > allocating one each time - but I believe it's still feasible. > > cheers. > > -- > Nathan > [-- Attachment #1.2: Type: text/html, Size: 2544 bytes --] [-- Attachment #2: Type: text/plain, Size: 121 bytes --] _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: definitions for /proc/fs/xfs/stat 2013-06-17 10:57 ` Mark Seger @ 2013-06-17 11:13 ` Dave Chinner 2013-06-17 14:57 ` Mark Seger 0 siblings, 1 reply; 28+ messages in thread From: Dave Chinner @ 2013-06-17 11:13 UTC (permalink / raw) To: Mark Seger; +Cc: Nathan Scott, xfs On Mon, Jun 17, 2013 at 06:57:14AM -0400, Mark Seger wrote: > all - good conversation and again, thanks for digging into this. The > comment about me running on an older kernel seems to be the problem and by > rerunning my test on precise/3.5.0-23-generic all seems to be operating > correctly, so I guess that was it. OK, good to know. > However, the one thing that does jump out of this is that proc/fs/xsfstats > and pcp were both showing many hundred MB/sec during tests that only ran > for a few seconds, which is impossible so it still feels some like sort of > accounting bug to me. On the other hand if the fact that this was an older > kernel, and newer kernels are fine, perhaps it's something just to note and > not worry about. How big is the write cache in your RAID array? If the log is the only thing being written to, and it fits within the cache, then it can easily push several hundred MB/s of write IO.... Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: definitions for /proc/fs/xfs/stat 2013-06-17 11:13 ` Dave Chinner @ 2013-06-17 14:57 ` Mark Seger 2013-06-17 20:28 ` Stefan Ring 2013-06-19 23:02 ` Useful stats (was Re: definitions for /proc/fs/xfs/stat) Nathan Scott 0 siblings, 2 replies; 28+ messages in thread From: Mark Seger @ 2013-06-17 14:57 UTC (permalink / raw) To: Dave Chinner; +Cc: Nathan Scott, xfs [-- Attachment #1.1: Type: text/plain, Size: 854 bytes --] > How big is the write cache in your RAID array? If the log is the > only thing being written to, and it fits within the cache, then it > can easily push several hundred MB/s of write IO.... I asked around and people believe the cache is on the order of a few GB and the test ran I was intentionally large enough to overshadow any cache effects, running for about a minute and doing 100K 1K file creates. The disk write data was close to a sustained 475MB/sec and would have easily filled the cache in the first handful of seconds which would have produced enough backpressure to slow down the write rate which it never did. On a totally different topic, and if you like we can start a different thread on it, I'd be interested in adding some monitoring stats to collectl for xfs and could use some suggestions of what are the most important. -mark [-- Attachment #1.2: Type: text/html, Size: 1193 bytes --] [-- Attachment #2: Type: text/plain, Size: 121 bytes --] _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: definitions for /proc/fs/xfs/stat 2013-06-17 14:57 ` Mark Seger @ 2013-06-17 20:28 ` Stefan Ring 2013-06-18 0:15 ` Dave Chinner 2013-06-19 23:02 ` Useful stats (was Re: definitions for /proc/fs/xfs/stat) Nathan Scott 1 sibling, 1 reply; 28+ messages in thread From: Stefan Ring @ 2013-06-17 20:28 UTC (permalink / raw) To: Mark Seger; +Cc: Nathan Scott, Linux fs XFS > I asked around and people believe the cache is on the order of a few GB and > the test ran I was intentionally large enough to overshadow any cache > effects, running for about a minute and doing 100K 1K file creates. The > disk write data was close to a sustained 475MB/sec and would have easily > filled the cache in the first handful of seconds which would have produced > enough backpressure to slow down the write rate which it never did. If a limited number of blocks gets written over and over, you won't ever fill the cache. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: definitions for /proc/fs/xfs/stat 2013-06-17 20:28 ` Stefan Ring @ 2013-06-18 0:15 ` Dave Chinner 2013-06-18 10:17 ` Mark Seger 0 siblings, 1 reply; 28+ messages in thread From: Dave Chinner @ 2013-06-18 0:15 UTC (permalink / raw) To: Stefan Ring; +Cc: Nathan Scott, Mark Seger, Linux fs XFS On Mon, Jun 17, 2013 at 10:28:19PM +0200, Stefan Ring wrote: > > I asked around and people believe the cache is on the order of a few GB and > > the test ran I was intentionally large enough to overshadow any cache > > effects, running for about a minute and doing 100K 1K file creates. The > > disk write data was close to a sustained 475MB/sec and would have easily > > filled the cache in the first handful of seconds which would have produced > > enough backpressure to slow down the write rate which it never did. > > If a limited number of blocks gets written over and over, you won't > ever fill the cache. Right, and the XFS log is circular and about 1GB in size on the filesystem under test, so should fit completely in cache. Hence speeds faster than a physical disk are acheivable if the log stays resident in the cache.... Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: definitions for /proc/fs/xfs/stat 2013-06-18 0:15 ` Dave Chinner @ 2013-06-18 10:17 ` Mark Seger 0 siblings, 0 replies; 28+ messages in thread From: Mark Seger @ 2013-06-18 10:17 UTC (permalink / raw) To: Dave Chinner; +Cc: Stefan Ring, Nathan Scott, Linux fs XFS [-- Attachment #1.1: Type: text/plain, Size: 1822 bytes --] mark - good point about the controller cache and the 1-bye/block, but remember my files in this case are only 1 block. dave - a long time ago I found/reported a bug in the way linux was doing their block stats. They were basically incrementing the byte counts as block were written to cache and the reported numbers were ridiculously high. I had wondered if something like this could be going on, but also remember the numbers being reported by xfs are much too high. I did 10K 1K writes, which I do understand are really 1-4k page each, but that's still only 40MB. If I add up all the 500MB/sec bytes xfs logging is reporting (even one of which is over 10 times larger), I see something on the order of of 10GB. But again this IS with the older kernel and so may not be worth worrying about. -mark On Mon, Jun 17, 2013 at 8:15 PM, Dave Chinner <david@fromorbit.com> wrote: > On Mon, Jun 17, 2013 at 10:28:19PM +0200, Stefan Ring wrote: > > > I asked around and people believe the cache is on the order of a few > GB and > > > the test ran I was intentionally large enough to overshadow any cache > > > effects, running for about a minute and doing 100K 1K file creates. > The > > > disk write data was close to a sustained 475MB/sec and would have > easily > > > filled the cache in the first handful of seconds which would have > produced > > > enough backpressure to slow down the write rate which it never did. > > > > If a limited number of blocks gets written over and over, you won't > > ever fill the cache. > > Right, and the XFS log is circular and about 1GB in size on the > filesystem under test, so should fit completely in cache. Hence > speeds faster than a physical disk are acheivable if the log stays > resident in the cache.... > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com > [-- Attachment #1.2: Type: text/html, Size: 2405 bytes --] [-- Attachment #2: Type: text/plain, Size: 121 bytes --] _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 28+ messages in thread
* Useful stats (was Re: definitions for /proc/fs/xfs/stat) 2013-06-17 14:57 ` Mark Seger 2013-06-17 20:28 ` Stefan Ring @ 2013-06-19 23:02 ` Nathan Scott 1 sibling, 0 replies; 28+ messages in thread From: Nathan Scott @ 2013-06-19 23:02 UTC (permalink / raw) To: Mark Seger; +Cc: xfs Hi Mark, ----- Original Message ----- > ... > On a totally different topic, and if you like we can start a different > thread on it, I'd be interested in adding some monitoring stats to collectl > for xfs and could use some suggestions of what are the most important. Did you receive any other feedback on this (offlist perhaps)? I was kinda hoping someone else would chime in, but can share our experiences with PCP if it helps. We provide options to record all xfs.* metrics & an alternative to record just the core I/O stats (so log I/O and the file read/write metrics). With PCP one can record any subset at any rate, so those are just convenience / default suggestions. In terms of recording in collectl, not sure what the requirements/parameters are for you but I'd suggest recording all the data (in save-to-disk mode), and IIRC you've got brief/verbose reporting modes - I'd focus on the log, file I/O and buffer I/O stats when being "brief", and expand out to showing in-core inode/dquot stats, ail/transaction stats, dir/attr operation stats. You might consider another very-verbose kinda view with all of the various btree stats, but that may be going too far. Note that these are all aggregate statistics, there has been on-and-off discussion for many years about adding some per-filesystem XFS stats, but to date that has not happened - perhaps someday it will, so keep that in mind for collectl too I guess. Also, be aware there is a separate file for quota metrics, for hysterical raisins (/proc/fs/xfs/xqmstat). HTH. cheers. -- Nathan _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: definitions for /proc/fs/xfs/stat 2013-06-17 5:41 ` Nathan Scott 2013-06-17 10:57 ` Mark Seger @ 2013-06-17 11:19 ` Dave Chinner 2013-06-17 13:18 ` Stan Hoeppner 1 sibling, 1 reply; 28+ messages in thread From: Dave Chinner @ 2013-06-17 11:19 UTC (permalink / raw) To: Nathan Scott; +Cc: Mark Seger, xfs On Mon, Jun 17, 2013 at 01:41:50AM -0400, Nathan Scott wrote: > Hey Dave, > > ----- Original Message ----- > > ... > > Must be an old version of RHEL6, because 6.4 doesn't do any IO at > > all, same as upstream. This test workload is purely a metadata only > > workload (no data is written) and so it all gets gathered up by > > delayed logging. > > *nod* - RHEL6.3. > > > > I think it is still possible, FWIW. One could use python ctypes (as in > > > Marks test program) and achieve a page-aligned POSIX memalign, > > > > I wasn't aware you could get memalign() through python at all. I > > went looking for this exact solution a couple of month ago when > > these problems started to be reported and couldn't find anything > > ... > > Yes, on reflection it doesn't jive too well with the way python wants > to do reads, in particular - os.read takes a file and a size, there's > no buffer exposed at the API level (for input). > > It would need to be a separate python module to the core set I guess > (with a C component), and a slightly different API - or at least some > additional APIs which can take in an aligned buffer, rather than just > allocating one each time - but I believe it's still feasible. Oh, there were modules written to do this years ago. e.g this from 2006: https://pypi.python.org/pypi/directio/1.0 There were even patches to provide a native interface in 2004, but they got shot down, like all the future enhancement requests that have happened since, mainly because the python maintainers don't beleive that Direct IO is necessary. I mean, Linus said Direct IO is for deranged monkeys, so nobody should ever use direct IO, right? ;) Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: definitions for /proc/fs/xfs/stat 2013-06-17 11:19 ` definitions for /proc/fs/xfs/stat Dave Chinner @ 2013-06-17 13:18 ` Stan Hoeppner 0 siblings, 0 replies; 28+ messages in thread From: Stan Hoeppner @ 2013-06-17 13:18 UTC (permalink / raw) To: xfs On 6/17/2013 6:19 AM, Dave Chinner wrote: > I mean, Linus said Direct IO is for deranged monkeys, so nobody > should ever use direct IO, right? ;) As it turns out, there are simply far more deranged monkeys running loose than Linus anticipated. :) -- Stan _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: definitions for /proc/fs/xfs/stat 2013-06-17 1:11 ` Nathan Scott 2013-06-17 2:46 ` Dave Chinner @ 2013-06-18 0:13 ` Mark Goodwin 1 sibling, 0 replies; 28+ messages in thread From: Mark Goodwin @ 2013-06-18 0:13 UTC (permalink / raw) To: Mark Seger; +Cc: Nathan Scott, xfs On 06/17/2013 11:11 AM, Nathan Scott wrote: >> ok, I have a simple reproducer. try out the following, noting you'll >> obviously have to change the directory pointed to by dname: >> >> libc=ctypes.CDLL(ctypes.util.find_library('c'), use_errno=True) >> falloc=getattr(libc, 'fallocate') >> > > This is using the glibc fallocate wrapper - I have vague memories of an > old libc which used to do per-page buffered writes providing a poor-mans > implementation of fallocate, maybe somehow that older version/behaviour > is being triggered. Python probably tipped you into a posix_fallocate() bog hole - it writes one byte to every block in the range offset .. offset+len as the portable (but poor-man's) implementation of fallocate(). This combined with your controller cache could explain the unexpected i/o. See glibc:sysdeps/posix/posix_fallocate.c -- Mark _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: definitions for /proc/fs/xfs/stat 2013-06-15 10:35 ` Mark Seger 2013-06-15 16:22 ` Mark Seger @ 2013-06-16 0:00 ` Dave Chinner 1 sibling, 0 replies; 28+ messages in thread From: Dave Chinner @ 2013-06-16 0:00 UTC (permalink / raw) To: Mark Seger; +Cc: Nathan Scott, xfs On Sat, Jun 15, 2013 at 06:35:02AM -0400, Mark Seger wrote: > Basically everything do it with collectl, a tool I wrote and opensourced > almost 10 years ago. it's numbers are very accurate - I've compared with > iostat on numerous occasions whenever I might have had doubts and they > always agree. Since both tools get their data from the same place, > /proc/diskstats, it's hard for them not to agree AND its numbers also agree > with /proc/fs/xfs. Ok, that's all I wanted to know. > happening? > > To restate what's going on, I have a very simple script that I'm > duplicating what openstack swift is doing, namely to create a file with > mkstmp and than running an falloc against it. The files are being created > with a size of zero but it seems that xfs is generating a ton of logging > activity. I had read your posted back in 2011 about speculative > preallocation and can't help but wonder if that's what hitting me here. I > also saw where system memory can come into play and this box has 192GB and > 12 hyperthreaded cores. > > I also tried one more run without falloc, this is creating 10000 1K files, > which should be about 10MB and it looks like it's still doing 140MB of I/O > which still feels like a lot but at least it's less than the 1k files will still write 4k filesystem blocks, so there's going to be 40MB/s there at least. As it is, I ran a bunch of tests yesterday writing 4k files, and I got 180MB/s @ 32,000 files/s. That's roughly 130MB/s for data, and another 50MB/s for log and metadata traffic. But without knowing your test configuration and using your test script, I can't compare those results to yours. Can you provide the information in: http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F > If there is anything more I can provide I'll be happy to do so. Actually I > should point out I can easily generate graphs and if you'd like to see some > examples I can provide those too. PCP generates realtime graphs, which is what I use ;) > Also, if there is anything I can report > from /proc/fs/xfs I can relatively easily do that as well and display it > side by side with the disk I/O. Let's see if there is something unusual in your setup that might explain it first... Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 28+ messages in thread
end of thread, other threads:[~2013-06-19 23:02 UTC | newest] Thread overview: 28+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2013-06-14 16:37 definitions for /proc/fs/xfs/stat Mark Seger 2013-06-14 22:16 ` Nathan Scott 2013-06-14 22:37 ` Mark Seger 2013-06-15 0:17 ` Nathan Scott 2013-06-15 1:55 ` Mark Seger 2013-06-15 2:04 ` Dave Chinner 2013-06-15 10:35 ` Mark Seger 2013-06-15 16:22 ` Mark Seger 2013-06-16 0:11 ` Dave Chinner 2013-06-16 12:58 ` Mark Seger 2013-06-16 22:06 ` Dave Chinner 2013-06-16 22:31 ` Mark Seger 2013-06-16 23:14 ` Dave Chinner 2013-06-16 23:31 ` Mark Seger 2013-06-17 1:11 ` Nathan Scott 2013-06-17 2:46 ` Dave Chinner 2013-06-17 5:41 ` Nathan Scott 2013-06-17 10:57 ` Mark Seger 2013-06-17 11:13 ` Dave Chinner 2013-06-17 14:57 ` Mark Seger 2013-06-17 20:28 ` Stefan Ring 2013-06-18 0:15 ` Dave Chinner 2013-06-18 10:17 ` Mark Seger 2013-06-19 23:02 ` Useful stats (was Re: definitions for /proc/fs/xfs/stat) Nathan Scott 2013-06-17 11:19 ` definitions for /proc/fs/xfs/stat Dave Chinner 2013-06-17 13:18 ` Stan Hoeppner 2013-06-18 0:13 ` Mark Goodwin 2013-06-16 0:00 ` Dave Chinner
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox