public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed
* task blocked for more than 120 seconds
@ 2012-04-18 15:11 Josef 'Jeff' Sipek
  2012-04-18 18:28 ` Ben Myers
  2012-04-18 23:48 ` Dave Chinner
  0 siblings, 2 replies; 13+ messages in thread
From: Josef 'Jeff' Sipek @ 2012-04-18 15:11 UTC (permalink / raw)
  To: xfs

Greetings!  I have a file server that get a pretty nasty load (about 15
million files created every day).  After some time, I noticed that the load
average spiked up from the usual 30 to about 180.  dmesg revealed:

[434042.318401] INFO: task php:2185 blocked for more than 120 seconds.
[434042.318403] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[434042.318405] php             D 000000010675d6cd     0  2185  27306 0x00000000
[434042.318408]  ffff88008d735a48 0000000000000086 ffff88008d735938 ffffffff00000000
[434042.318412]  ffff88008d734010 ffff88000e28e340 0000000000012000 ffff88008d735fd8
[434042.318416]  ffff88008d735fd8 0000000000012000 ffff8807ef9966c0 ffff88000e28e340
[434042.318419] Call Trace:
[434042.318442]  [<ffffffffa0087a9b>] ? xfs_trans_brelse+0xee/0xf7 [xfs]
[434042.318464]  [<ffffffffa00689de>] ? xfs_da_brelse+0x71/0x96 [xfs]
[434042.318485]  [<ffffffffa006df10>] ? xfs_dir2_leaf_lookup_int+0x211/0x225 [xfs]
[434042.318489]  [<ffffffff8141481e>] schedule+0x55/0x57
[434042.318512]  [<ffffffffa0083de2>] xlog_reserveq_wait+0x115/0x1c0 [xfs]
[434042.318515]  [<ffffffff810381f1>] ? try_to_wake_up+0x23d/0x23d
[434042.318539]  [<ffffffffa0083f45>] xlog_grant_log_space+0xb8/0x1be [xfs]
[434042.318562]  [<ffffffffa0084164>] xfs_log_reserve+0x119/0x133 [xfs]
[434042.318585]  [<ffffffffa0080cf1>] xfs_trans_reserve+0xca/0x199 [xfs]
[434042.318605]  [<ffffffffa00500dc>] xfs_create+0x18d/0x467 [xfs]
[434042.318623]  [<ffffffffa00485be>] xfs_vn_mknod+0xa0/0xf9 [xfs]
[434042.318640]  [<ffffffffa0048632>] xfs_vn_create+0xb/0xd [xfs]
[434042.318644]  [<ffffffff810f0c5d>] vfs_create+0x6e/0x9e
[434042.318647]  [<ffffffff810f1c5e>] do_last+0x302/0x642
[434042.318651]  [<ffffffff810f2068>] path_openat+0xca/0x344
[434042.318654]  [<ffffffff810f23d1>] do_filp_open+0x38/0x87
[434042.318658]  [<ffffffff810fb22e>] ? alloc_fd+0x76/0x11e
[434042.318661]  [<ffffffff810e40b1>] do_sys_open+0x10b/0x1a4
[434042.318664]  [<ffffffff810e4173>] sys_open+0x1b/0x1d

It makes sense that'd the load average would spike up if some major lock got
held longer than it should have been.

The box has 32GB RAM, 6 cores, and it's running 3.2.2.

I've looked at the commits in the stable tree since 3.2.2 was tagged, and I
do see a couple of useful commits so I'll try to get the kernel updated
anyway but I don't quite see any of those fixes addressing this "hang".

Thanks,

Jeff.

-- 
Research, n.:
  Consider Columbus:
    He didn't know where he was going.
    When he got there he didn't know where he was.
    When he got back he didn't know where he had been.
    And he did it all on someone else's money.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: task blocked for more than 120 seconds
  2012-04-18 15:11 task blocked for more than 120 seconds Josef 'Jeff' Sipek
@ 2012-04-18 18:28 ` Ben Myers
  2012-04-18 23:48 ` Dave Chinner
  1 sibling, 0 replies; 13+ messages in thread
From: Ben Myers @ 2012-04-18 18:28 UTC (permalink / raw)
  To: Josef 'Jeff' Sipek; +Cc: xfs

Hey Josef,

On Wed, Apr 18, 2012 at 11:11:40AM -0400, Josef 'Jeff' Sipek wrote:
> Greetings!  I have a file server that get a pretty nasty load (about 15
> million files created every day).  After some time, I noticed that the load
> average spiked up from the usual 30 to about 180.  dmesg revealed:
> 
> [434042.318401] INFO: task php:2185 blocked for more than 120 seconds.
> [434042.318403] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [434042.318405] php             D 000000010675d6cd     0  2185  27306 0x00000000
> [434042.318408]  ffff88008d735a48 0000000000000086 ffff88008d735938 ffffffff00000000
> [434042.318412]  ffff88008d734010 ffff88000e28e340 0000000000012000 ffff88008d735fd8
> [434042.318416]  ffff88008d735fd8 0000000000012000 ffff8807ef9966c0 ffff88000e28e340
> [434042.318419] Call Trace:
> [434042.318442]  [<ffffffffa0087a9b>] ? xfs_trans_brelse+0xee/0xf7 [xfs]
> [434042.318464]  [<ffffffffa00689de>] ? xfs_da_brelse+0x71/0x96 [xfs]
> [434042.318485]  [<ffffffffa006df10>] ? xfs_dir2_leaf_lookup_int+0x211/0x225 [xfs]
> [434042.318489]  [<ffffffff8141481e>] schedule+0x55/0x57
> [434042.318512]  [<ffffffffa0083de2>] xlog_reserveq_wait+0x115/0x1c0 [xfs]
> [434042.318515]  [<ffffffff810381f1>] ? try_to_wake_up+0x23d/0x23d
> [434042.318539]  [<ffffffffa0083f45>] xlog_grant_log_space+0xb8/0x1be [xfs]
> [434042.318562]  [<ffffffffa0084164>] xfs_log_reserve+0x119/0x133 [xfs]
> [434042.318585]  [<ffffffffa0080cf1>] xfs_trans_reserve+0xca/0x199 [xfs]
> [434042.318605]  [<ffffffffa00500dc>] xfs_create+0x18d/0x467 [xfs]
> [434042.318623]  [<ffffffffa00485be>] xfs_vn_mknod+0xa0/0xf9 [xfs]
> [434042.318640]  [<ffffffffa0048632>] xfs_vn_create+0xb/0xd [xfs]
> [434042.318644]  [<ffffffff810f0c5d>] vfs_create+0x6e/0x9e
> [434042.318647]  [<ffffffff810f1c5e>] do_last+0x302/0x642
> [434042.318651]  [<ffffffff810f2068>] path_openat+0xca/0x344
> [434042.318654]  [<ffffffff810f23d1>] do_filp_open+0x38/0x87
> [434042.318658]  [<ffffffff810fb22e>] ? alloc_fd+0x76/0x11e
> [434042.318661]  [<ffffffff810e40b1>] do_sys_open+0x10b/0x1a4
> [434042.318664]  [<ffffffff810e4173>] sys_open+0x1b/0x1d
> 
> It makes sense that'd the load average would spike up if some major lock got
> held longer than it should have been.
> 
> The box has 32GB RAM, 6 cores, and it's running 3.2.2.
> 
> I've looked at the commits in the stable tree since 3.2.2 was tagged, and I
> do see a couple of useful commits so I'll try to get the kernel updated
> anyway but I don't quite see any of those fixes addressing this "hang".

I was about to suggest that 9f9c19e 'xfs: fix the logspace waiting algorithm'
is probably what you need... but that's already in 3.2.2.  Hrm.

-Ben

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: task blocked for more than 120 seconds
  2012-04-18 15:11 task blocked for more than 120 seconds Josef 'Jeff' Sipek
  2012-04-18 18:28 ` Ben Myers
@ 2012-04-18 23:48 ` Dave Chinner
  2012-04-19 15:46   ` Josef 'Jeff' Sipek
  1 sibling, 1 reply; 13+ messages in thread
From: Dave Chinner @ 2012-04-18 23:48 UTC (permalink / raw)
  To: Josef 'Jeff' Sipek; +Cc: xfs

On Wed, Apr 18, 2012 at 11:11:40AM -0400, Josef 'Jeff' Sipek wrote:
> Greetings!  I have a file server that get a pretty nasty load (about 15
> million files created every day).  After some time, I noticed that the load
> average spiked up from the usual 30 to about 180.  dmesg revealed:
> 
> [434042.318401] INFO: task php:2185 blocked for more than 120 seconds.
> [434042.318403] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [434042.318405] php             D 000000010675d6cd     0  2185  27306 0x00000000
> [434042.318408]  ffff88008d735a48 0000000000000086 ffff88008d735938 ffffffff00000000
> [434042.318412]  ffff88008d734010 ffff88000e28e340 0000000000012000 ffff88008d735fd8
> [434042.318416]  ffff88008d735fd8 0000000000012000 ffff8807ef9966c0 ffff88000e28e340
> [434042.318419] Call Trace:
> [434042.318442]  [<ffffffffa0087a9b>] ? xfs_trans_brelse+0xee/0xf7 [xfs]
> [434042.318464]  [<ffffffffa00689de>] ? xfs_da_brelse+0x71/0x96 [xfs]
> [434042.318485]  [<ffffffffa006df10>] ? xfs_dir2_leaf_lookup_int+0x211/0x225 [xfs]
> [434042.318489]  [<ffffffff8141481e>] schedule+0x55/0x57
> [434042.318512]  [<ffffffffa0083de2>] xlog_reserveq_wait+0x115/0x1c0 [xfs]
> [434042.318515]  [<ffffffff810381f1>] ? try_to_wake_up+0x23d/0x23d
> [434042.318539]  [<ffffffffa0083f45>] xlog_grant_log_space+0xb8/0x1be [xfs]
> [434042.318562]  [<ffffffffa0084164>] xfs_log_reserve+0x119/0x133 [xfs]
> [434042.318585]  [<ffffffffa0080cf1>] xfs_trans_reserve+0xca/0x199 [xfs]
> [434042.318605]  [<ffffffffa00500dc>] xfs_create+0x18d/0x467 [xfs]
> [434042.318623]  [<ffffffffa00485be>] xfs_vn_mknod+0xa0/0xf9 [xfs]
> [434042.318640]  [<ffffffffa0048632>] xfs_vn_create+0xb/0xd [xfs]
> [434042.318644]  [<ffffffff810f0c5d>] vfs_create+0x6e/0x9e
> [434042.318647]  [<ffffffff810f1c5e>] do_last+0x302/0x642
> [434042.318651]  [<ffffffff810f2068>] path_openat+0xca/0x344
> [434042.318654]  [<ffffffff810f23d1>] do_filp_open+0x38/0x87
> [434042.318658]  [<ffffffff810fb22e>] ? alloc_fd+0x76/0x11e
> [434042.318661]  [<ffffffff810e40b1>] do_sys_open+0x10b/0x1a4
> [434042.318664]  [<ffffffff810e4173>] sys_open+0x1b/0x1d
> 
> It makes sense that'd the load average would spike up if some major lock got
> held longer than it should have been.

That's possibly just IO load. It's waiting for log space to become
available, and that will only occur when metadata is written back to
disk.

What's the storage subsystem, what IO scheduler, the actually
workload that is running at the time the spike occurs (e.g did
someone run a directory traversal that changed every indoe?), and so
on.

Also, getting the output of 'echo w > /proc/sysrq-trigger' whould be
helpful here....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: task blocked for more than 120 seconds
  2012-04-18 23:48 ` Dave Chinner
@ 2012-04-19 15:46   ` Josef 'Jeff' Sipek
  2012-04-19 22:56     ` Dave Chinner
  0 siblings, 1 reply; 13+ messages in thread
From: Josef 'Jeff' Sipek @ 2012-04-19 15:46 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Thu, Apr 19, 2012 at 09:48:21AM +1000, Dave Chinner wrote:
> On Wed, Apr 18, 2012 at 11:11:40AM -0400, Josef 'Jeff' Sipek wrote:
> > Greetings!  I have a file server that get a pretty nasty load (about 15
> > million files created every day).  After some time, I noticed that the load
> > average spiked up from the usual 30 to about 180.  dmesg revealed:
> > 
> > [434042.318401] INFO: task php:2185 blocked for more than 120 seconds.
> > [434042.318403] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > [434042.318405] php             D 000000010675d6cd     0  2185  27306 0x00000000
> > [434042.318408]  ffff88008d735a48 0000000000000086 ffff88008d735938 ffffffff00000000
> > [434042.318412]  ffff88008d734010 ffff88000e28e340 0000000000012000 ffff88008d735fd8
> > [434042.318416]  ffff88008d735fd8 0000000000012000 ffff8807ef9966c0 ffff88000e28e340
> > [434042.318419] Call Trace:
> > [434042.318442]  [<ffffffffa0087a9b>] ? xfs_trans_brelse+0xee/0xf7 [xfs]
> > [434042.318464]  [<ffffffffa00689de>] ? xfs_da_brelse+0x71/0x96 [xfs]
> > [434042.318485]  [<ffffffffa006df10>] ? xfs_dir2_leaf_lookup_int+0x211/0x225 [xfs]
> > [434042.318489]  [<ffffffff8141481e>] schedule+0x55/0x57
> > [434042.318512]  [<ffffffffa0083de2>] xlog_reserveq_wait+0x115/0x1c0 [xfs]
> > [434042.318515]  [<ffffffff810381f1>] ? try_to_wake_up+0x23d/0x23d
> > [434042.318539]  [<ffffffffa0083f45>] xlog_grant_log_space+0xb8/0x1be [xfs]
> > [434042.318562]  [<ffffffffa0084164>] xfs_log_reserve+0x119/0x133 [xfs]
> > [434042.318585]  [<ffffffffa0080cf1>] xfs_trans_reserve+0xca/0x199 [xfs]
> > [434042.318605]  [<ffffffffa00500dc>] xfs_create+0x18d/0x467 [xfs]
> > [434042.318623]  [<ffffffffa00485be>] xfs_vn_mknod+0xa0/0xf9 [xfs]
> > [434042.318640]  [<ffffffffa0048632>] xfs_vn_create+0xb/0xd [xfs]
> > [434042.318644]  [<ffffffff810f0c5d>] vfs_create+0x6e/0x9e
> > [434042.318647]  [<ffffffff810f1c5e>] do_last+0x302/0x642
> > [434042.318651]  [<ffffffff810f2068>] path_openat+0xca/0x344
> > [434042.318654]  [<ffffffff810f23d1>] do_filp_open+0x38/0x87
> > [434042.318658]  [<ffffffff810fb22e>] ? alloc_fd+0x76/0x11e
> > [434042.318661]  [<ffffffff810e40b1>] do_sys_open+0x10b/0x1a4
> > [434042.318664]  [<ffffffff810e4173>] sys_open+0x1b/0x1d
> > 
> > It makes sense that'd the load average would spike up if some major lock got
> > held longer than it should have been.
> 
> That's possibly just IO load. It's waiting for log space to become
> available, and that will only occur when metadata is written back to
> disk.

Yeah, it could be...

> What's the storage subsystem, what IO scheduler, the actually
> workload that is running at the time the spike occurs (e.g did
> someone run a directory traversal that changed every indoe?), and so
> on.

fs3.ess ~ # MegaCli -CfgDsply -aAll
==============================================================================
Adapter: 0
Product Name: LSI MegaRAID SAS 9260-16i
Memory: 512MB
BBU: Present
...
RAID Level          : Primary-6, Secondary-0, RAID Level Qualifier-3
Size                : 32.742 TB
State               : Optimal
Strip Size          : 64 KB
Number Of Drives per span:8
Span Depth          : 2
Default Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU
Access Policy       : Read/Write
Disk Cache Policy   : Disabled
Encryption Type     : None
...

fs3.ess ~ # cat /proc/mounts 
rootfs / rootfs rw 0 0
/dev/root / ext4 rw,noatime,user_xattr,barrier=1,data=ordered 0 0
proc /proc proc rw,nosuid,nodev,noexec,relatime 0 0
sysfs /sys sysfs rw,nosuid,nodev,noexec,relatime 0 0
udev /dev tmpfs rw,nosuid,relatime,size=10240k,mode=755 0 0
devpts /dev/pts devpts rw,nosuid,noexec,relatime,gid=5,mode=620 0 0
none /dev/shm tmpfs rw,relatime 0 0
usbfs /proc/bus/usb usbfs rw,nosuid,noexec,relatime,devgid=85,devmode=664 0 0
/dev/sda1 /boot ext3 rw,noatime,errors=continue,user_xattr,acl,barrier=1,data=writeback 0 0
/dev/sda4 /var/data/disk0 xfs rw,noatime,attr2,filestreams,delaylog,nobarrier,inode64,logbufs=8,logbsize=256k,noquota 0 0

fs3.ess ~ # xfs_info /var/data/disk0/
meta-data=/dev/sda4              isize=256    agcount=32, agsize=268435455 blks
         =                       sectsz=512   attr=2
data     =                       bsize=4096   blocks=8519632640, imaxpct=5
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0
log      =internal               bsize=4096   blocks=521728, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

As far as the workload is concerned, there's the typical (for us) load of
create 15 million files in a day (pretty standard diurnal distribution as far
as the load is concerned).  Half the files are about 300 bytes, and the other
half averages 20kB in size.  On this system, the files are never read back.
Since we had a fs corruption recently (due to a power failure & disk caches
being on), our software managed to dump about 25 million files onto /.  I've
been rsync'ing them to the data partition at the same time (yes, it's slow
because they are two partitions on the same array).  The rsync aside, we've had
this workload going for a month on this particular server without any issues.
(The other server which is set up identically has been fine too.)  On Monday,
we mkfs'd this server's sda4, mounted it as before and started the workload.
Then, on Tuesday and Wednesday, we saw two loadavg spikes for no apparent
reason.  (Our system forks a bunch and then each child does network & disk I/O,
so 180 loadavg makes sense if the system gums up.)

The following applies to all of the spikes, but I'm specifically talking about
the spike from this morning.  During the ~45 minute spike, there seems to be
very little disk I/O (<1 MByte/s compared to the usual 10 MBytes/s).  Since
all the I/O children get stuck waiting for disk (ps says they are in 'D'
state), the network utilization drops to essentially 0.  The CPU usage also
drops to ~0%.

I *think* things return to normal because our code eventually notices that
the children are stuck and it 'kill -9's them.  (It's nicer at first.)

> Also, getting the output of 'echo w > /proc/sysrq-trigger' whould be
> helpful here....

I assume you want this when the loadavg spikes up to 180.  /me waits for the
next time it happens.

Thanks, let me know if I forgot something.

Jeff.

-- 
All science is either physics or stamp collecting.
		- Ernest Rutherford

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: task blocked for more than 120 seconds
  2012-04-19 15:46   ` Josef 'Jeff' Sipek
@ 2012-04-19 22:56     ` Dave Chinner
  2012-04-20 13:58       ` Josef 'Jeff' Sipek
  0 siblings, 1 reply; 13+ messages in thread
From: Dave Chinner @ 2012-04-19 22:56 UTC (permalink / raw)
  To: Josef 'Jeff' Sipek; +Cc: xfs

On Thu, Apr 19, 2012 at 11:46:02AM -0400, Josef 'Jeff' Sipek wrote:
> On Thu, Apr 19, 2012 at 09:48:21AM +1000, Dave Chinner wrote:
> > On Wed, Apr 18, 2012 at 11:11:40AM -0400, Josef 'Jeff' Sipek wrote:
> > > Greetings!  I have a file server that get a pretty nasty load (about 15
> > > million files created every day).  After some time, I noticed that the load
> > > average spiked up from the usual 30 to about 180.  dmesg revealed:
> > > 
> > > [434042.318401] INFO: task php:2185 blocked for more than 120 seconds.
> > > [434042.318403] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > > [434042.318405] php             D 000000010675d6cd     0  2185  27306 0x00000000
> > > [434042.318408]  ffff88008d735a48 0000000000000086 ffff88008d735938 ffffffff00000000
> > > [434042.318412]  ffff88008d734010 ffff88000e28e340 0000000000012000 ffff88008d735fd8
> > > [434042.318416]  ffff88008d735fd8 0000000000012000 ffff8807ef9966c0 ffff88000e28e340
> > > [434042.318419] Call Trace:
> > > [434042.318442]  [<ffffffffa0087a9b>] ? xfs_trans_brelse+0xee/0xf7 [xfs]
> > > [434042.318464]  [<ffffffffa00689de>] ? xfs_da_brelse+0x71/0x96 [xfs]
> > > [434042.318485]  [<ffffffffa006df10>] ? xfs_dir2_leaf_lookup_int+0x211/0x225 [xfs]
> > > [434042.318489]  [<ffffffff8141481e>] schedule+0x55/0x57
> > > [434042.318512]  [<ffffffffa0083de2>] xlog_reserveq_wait+0x115/0x1c0 [xfs]
> > > [434042.318515]  [<ffffffff810381f1>] ? try_to_wake_up+0x23d/0x23d
> > > [434042.318539]  [<ffffffffa0083f45>] xlog_grant_log_space+0xb8/0x1be [xfs]
> > > [434042.318562]  [<ffffffffa0084164>] xfs_log_reserve+0x119/0x133 [xfs]
> > > [434042.318585]  [<ffffffffa0080cf1>] xfs_trans_reserve+0xca/0x199 [xfs]
> > > [434042.318605]  [<ffffffffa00500dc>] xfs_create+0x18d/0x467 [xfs]
> > > [434042.318623]  [<ffffffffa00485be>] xfs_vn_mknod+0xa0/0xf9 [xfs]
> > > [434042.318640]  [<ffffffffa0048632>] xfs_vn_create+0xb/0xd [xfs]
> > > [434042.318644]  [<ffffffff810f0c5d>] vfs_create+0x6e/0x9e
> > > [434042.318647]  [<ffffffff810f1c5e>] do_last+0x302/0x642
> > > [434042.318651]  [<ffffffff810f2068>] path_openat+0xca/0x344
> > > [434042.318654]  [<ffffffff810f23d1>] do_filp_open+0x38/0x87
> > > [434042.318658]  [<ffffffff810fb22e>] ? alloc_fd+0x76/0x11e
> > > [434042.318661]  [<ffffffff810e40b1>] do_sys_open+0x10b/0x1a4
> > > [434042.318664]  [<ffffffff810e4173>] sys_open+0x1b/0x1d
> > > 
> > > It makes sense that'd the load average would spike up if some major lock got
> > > held longer than it should have been.
> > 
> > That's possibly just IO load. It's waiting for log space to become
> > available, and that will only occur when metadata is written back to
> > disk.
> 
> Yeah, it could be...
> 
> > What's the storage subsystem, what IO scheduler, the actually
> > workload that is running at the time the spike occurs (e.g did
> > someone run a directory traversal that changed every indoe?), and so
> > on.
> 
> fs3.ess ~ # MegaCli -CfgDsply -aAll
> ==============================================================================
> Adapter: 0
> Product Name: LSI MegaRAID SAS 9260-16i
> Memory: 512MB
> BBU: Present
> ...
> RAID Level          : Primary-6, Secondary-0, RAID Level Qualifier-3
....
> /dev/sda4 /var/data/disk0 xfs rw,noatime,attr2,filestreams,delaylog,nobarrier,inode64,logbufs=8,logbsize=256k,noquota 0 0

Only unusual thing is the filestreams allocator is in use...

> fs3.ess ~ # xfs_info /var/data/disk0/
> meta-data=/dev/sda4              isize=256    agcount=32, agsize=268435455 blks
>          =                       sectsz=512   attr=2
> data     =                       bsize=4096   blocks=8519632640, imaxpct=5
>          =                       sunit=0      swidth=0 blks
> naming   =version 2              bsize=4096   ascii-ci=0
> log      =internal               bsize=4096   blocks=521728, version=2

and it is a large log, so that's a definite candidate for long
stalls pushing the tail of the log. The typical push that you'll get
stalled on is trying to keep 25% fthe log space free, so there's
potentially hundreds of megabytes of random 4/8k metadata writes to
be done to free that space in the log...

>          =                       sectsz=512   sunit=0 blks, lazy-count=1
> realtime =none                   extsz=4096   blocks=0, rtextents=0
> 
> As far as the workload is concerned, there's the typical (for us) load of
> create 15 million files in a day (pretty standard diurnal distribution as far
> as the load is concerned).  Half the files are about 300 bytes, and the other
> half averages 20kB in size.  On this system, the files are never read back.
> Since we had a fs corruption recently (due to a power failure & disk caches
> being on), our software managed to dump about 25 million files onto /.  I've
> been rsync'ing them to the data partition at the same time (yes, it's slow
> because they are two partitions on the same array).  The rsync aside, we've had
> this workload going for a month on this particular server without any issues.
> (The other server which is set up identically has been fine too.)  On Monday,
> we mkfs'd this server's sda4, mounted it as before and started the workload.

What was the size of the log on the previous incarnation of the
filesystem?

> Then, on Tuesday and Wednesday, we saw two loadavg spikes for no apparent
> reason.  (Our system forks a bunch and then each child does network & disk I/O,
> so 180 loadavg makes sense if the system gums up.)
> 
> The following applies to all of the spikes, but I'm specifically talking about
> the spike from this morning.  During the ~45 minute spike, there seems to be
> very little disk I/O (<1 MByte/s compared to the usual 10 MBytes/s).  Since

That sounds like it might have dropped into random 4k write IO or
inode cluster RMW cycles - a single large RAID6 volume is going to
be no faster than a single spindle at this. Can you get `iostat -d
-m -x 5` output when the next slowdown occurs so we can see the IOPS
and utilisation as well as the bandwidth?

> all the I/O children get stuck waiting for disk (ps says they are in 'D'
> state), the network utilization drops to essentially 0.  The CPU usage also
> drops to ~0%.
> 
> I *think* things return to normal because our code eventually notices that
> the children are stuck and it 'kill -9's them.  (It's nicer at first.)

you can't kill processes stuck waiting for log space IIRC.

> > Also, getting the output of 'echo w > /proc/sysrq-trigger' whould be
> > helpful here....
> 
> I assume you want this when the loadavg spikes up to 180.  /me waits for the
> next time it happens.

Yup.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: task blocked for more than 120 seconds
  2012-04-19 22:56     ` Dave Chinner
@ 2012-04-20 13:58       ` Josef 'Jeff' Sipek
  2012-04-21  0:29         ` Dave Chinner
  0 siblings, 1 reply; 13+ messages in thread
From: Josef 'Jeff' Sipek @ 2012-04-20 13:58 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Fri, Apr 20, 2012 at 08:56:03AM +1000, Dave Chinner wrote:
> On Thu, Apr 19, 2012 at 11:46:02AM -0400, Josef 'Jeff' Sipek wrote:
...
> > /dev/sda4 /var/data/disk0 xfs rw,noatime,attr2,filestreams,delaylog,nobarrier,inode64,logbufs=8,logbsize=256k,noquota 0 0
> 
> Only unusual thing is the filestreams allocator is in use...

Yep.  It seemed to make sense to use it since we just want the files to get
to disk.

> > fs3.ess ~ # xfs_info /var/data/disk0/
> > meta-data=/dev/sda4              isize=256    agcount=32, agsize=268435455 blks
> >          =                       sectsz=512   attr=2
> > data     =                       bsize=4096   blocks=8519632640, imaxpct=5
> >          =                       sunit=0      swidth=0 blks
> > naming   =version 2              bsize=4096   ascii-ci=0
> > log      =internal               bsize=4096   blocks=521728, version=2
> 
> and it is a large log, so that's a definite candidate for long
> stalls pushing the tail of the log. The typical push that you'll get
> stalled on is trying to keep 25% fthe log space free, so there's
> potentially hundreds of megabytes of random 4/8k metadata writes to
> be done to free that space in the log...

FWIW, the filesystem was made using the default mkfs options (IOW: mkfs.xfs
/dev/sda4).  I didn't even notice the log being 2GB.  (I just assumed it'd
be the good ol' 128MB.)

> What was the size of the log on the previous incarnation of the
> filesystem?

I don't know.  I checked the other file server that was set up at the same
time, and it has a log that's the same size.

...
> > The following applies to all of the spikes, but I'm specifically talking about
> > the spike from this morning.  During the ~45 minute spike, there seems to be
> > very little disk I/O (<1 MByte/s compared to the usual 10 MBytes/s).  Since
> 
> That sounds like it might have dropped into random 4k write IO or
> inode cluster RMW cycles - a single large RAID6 volume is going to
> be no faster than a single spindle at this. Can you get `iostat -d
> -m -x 5` output when the next slowdown occurs so we can see the IOPS
> and utilisation as well as the bandwidth?

Right.  Here's the output from a few minutes ago (I removed the empty lines
and redundant column headings):

fs3.ess ~ # iostat -d -m -x 5
Linux 3.2.2 (fs3.ess.sfj.cudaops.com) 	04/20/12 	_x86_64_	(6 CPU)

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda              11.90   129.18  261.81  457.22     2.81     6.76    27.26     4.44    6.18    9.60    4.22   0.82  59.02
sda               0.00     0.20   61.20  112.60     0.49     0.88    16.17     1.50    8.64   17.66    3.74   5.74  99.76
sda               0.00     0.60   61.40  117.60     0.48     0.92    16.02     1.33    7.41   16.18    2.82   5.56  99.52
sda               0.00     1.20   63.40  104.20     0.50     0.82    16.09     1.28    7.65   15.70    2.75   5.94  99.60
sda               0.00    43.00   59.20  166.60     0.46     1.64    19.05     1.63    7.23   16.82    3.82   4.41  99.60
sda               0.00     0.40   60.80  139.80     0.47     1.09    16.00     1.32    6.57   16.25    2.36   4.96  99.44
sda               0.00     1.20   59.60  108.00     0.47     0.85    16.08     1.22    7.34   16.82    2.10   5.95  99.68
sda               0.00     0.40   66.60  102.80     0.52     0.80    16.01     1.28    7.57   15.00    2.75   5.88  99.68
sda               0.00     4.80   67.00  111.60     0.52     0.91    16.39     1.21    6.75   14.85    1.88   5.58  99.68
sda               0.00     0.60   64.60  103.40     0.50     0.81    16.02     1.16    6.90   15.48    1.55   5.94  99.76
sda               0.00    23.00   65.20  122.60     0.51     1.14    17.96     1.34    7.13   15.28    2.80   5.31  99.68
sda               0.00     0.60   62.40  153.60     0.49     1.20    16.01     1.40    6.46   15.88    2.63   4.63  99.92
sda               0.00    13.60   56.20  131.20     0.44     1.13    17.13     1.39    7.43   17.81    2.98   5.31  99.52
sda               0.00    18.80   29.80  422.00     0.23     3.44    16.66    20.54   45.46   33.40   46.31   2.20  99.52
sda               0.00     0.40   80.40    2.00     0.63     0.01    15.98     0.99   12.09   12.39    0.00  12.06  99.36
sda               0.00     3.60   70.20   12.60     0.55     0.12    16.64     1.00   11.71   13.81    0.00  12.04  99.68
sda               0.00    28.20   66.60   40.40     0.52     0.54    20.22     1.12   10.68   15.33    3.01   9.29  99.36
sda               0.00     0.80   71.80   89.00     0.56     0.70    16.04     1.24    7.66   13.77    2.72   6.21  99.84
sda               0.00     3.60   56.60  156.00     0.44     1.24    16.24     1.51    7.10   17.58    3.29   4.69  99.68
sda               0.00     0.40   64.80  106.80     0.51     0.84    16.01     1.18    6.93   15.56    1.69   5.82  99.84
sda               0.00     4.60   22.80  439.40     0.18     3.46    16.14    25.95   56.14   43.68   56.78   2.16  99.68
sda               0.00    28.20   70.60   87.60     0.57     0.90    19.02     1.33    8.43   14.29    3.71   6.29  99.52
sda               0.00     0.60   83.60    4.00     0.65     0.03    16.02     0.99   11.33   11.88    0.00  11.33  99.28
sda               0.00    13.20   80.00   44.40     0.62     0.23    14.10     1.00    8.08   12.45    0.20   8.00  99.52
sda               0.00     1.00   74.60   41.40     0.58     0.33    16.06     1.08    9.23   13.22    2.05   8.58  99.52
sda               0.00     0.40   63.00   95.20     0.49     0.74    16.01     1.28    8.16   15.89    3.05   6.28  99.36
sda               0.00     5.00   22.60  482.60     0.18     3.81    16.15     9.94   19.65   43.68   18.53   1.98 100.00
sda               0.00     0.20   75.40   10.60     0.59     0.08    15.98     1.00   11.75   13.40    0.00  11.57  99.52
sda               0.00     0.00   78.00    1.60     0.61     0.01    16.00     1.00   12.50   12.76    0.00  12.53  99.76
sda               0.00    14.40   75.80   52.40     0.59     0.52    17.75     1.15    8.95   13.12    2.93   7.76  99.44
sda               0.00    36.40   70.60  101.80     0.55     1.08    19.34     1.29    7.47   14.11    2.86   5.77  99.44
sda               0.00     0.60   63.80  143.40     0.50     1.12    16.02     1.49    7.18   15.59    3.44   4.81  99.60
sda               0.00    32.80   55.40  197.00     0.43     1.79    18.05     1.67    6.62   17.92    3.44   3.94  99.44
sda               0.00     2.20   62.60  132.20     0.49     1.03    16.02     1.38    7.08   15.96    2.88   5.12  99.76
sda               0.00     6.20   64.20  122.40     0.50     1.00    16.49     1.35    7.20   15.45    2.88   5.34  99.60
sda               0.00     0.00   62.60  111.20     0.49     0.87    16.00     1.28    7.38   16.00    2.53   5.74  99.76
sda               0.00     7.20   58.20  121.60     0.45     1.00    16.61     1.16    6.27   16.54    1.36   5.53  99.36
sda               0.00     0.20   68.20  104.40     0.53     0.81    15.99     1.33    7.88   15.10    3.16   5.78  99.76
sda               0.00    14.60   62.40  122.40     0.49     1.07    17.23     1.49    8.08   16.00    4.04   5.40  99.76


And now while the system is coming back to the usual loadavg (of 30):

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00    26.00   29.00 1235.40     0.13    17.10    27.91   202.30  161.37 1792.52  123.08   0.79 100.00
sda               0.00    29.80   26.20 1095.60     0.11    15.34    28.22   185.88  187.19 2248.92  137.89   0.89 100.00
sda               0.00    28.20   11.80 1527.20     0.06    19.80    26.44   165.23  112.77  665.83  108.50   0.65 100.00
sda               0.20     2.80  412.40  320.20     2.09     7.00    25.40    61.10   87.40   53.49  131.06   1.37 100.00
sda               2.60     1.60 1056.00   52.20     5.82     0.57    11.81    21.87   19.68   20.65    0.03   0.90 100.00
sda               0.00     7.40 1256.40   34.40     5.77     0.30     9.63    24.81   19.33   19.85    0.19   0.77 100.00
sda               0.00    28.80  917.60  273.00     4.18     2.42    11.34    26.72   22.38   27.52    5.12   0.84 100.00
sda               0.00     0.20 1302.00   28.20     5.90     0.37     9.64    24.75   18.69   19.09    0.03   0.75 100.00
sda               0.00    42.60   61.40  964.60     0.30    11.75    24.05   158.59  144.06  323.64  132.63   0.97 100.00
sda               0.00    29.40   24.00  675.00     0.11     5.00    14.96   143.77  209.05 1007.17  180.67   1.43 100.00
sda               0.00    21.60   28.00  785.00     0.16     6.38    16.48   177.47  220.53  930.09  195.22   1.23 100.00
sda               0.00    31.00   24.40  964.20     0.13    10.67    22.36   177.09  177.76  767.02  162.85   1.01 100.00
sda               0.00    27.00   24.80 1174.20     0.11    14.33    24.68   170.26  144.93  911.90  128.73   0.83 100.00
sda               0.00    34.40    9.00  700.80     0.05     2.98     8.73   151.54  212.61  797.51  205.09   1.41 100.00
sda               0.00    37.20    9.00  776.60     0.05     4.09    10.78   152.84  192.65  775.91  185.89   1.27 100.00
sda               0.00    36.60   14.80  716.40     0.07     3.35     9.58   167.69  214.14  769.19  202.68   1.37 100.00
sda               0.00     5.80   54.00 1824.00     0.24    25.04    27.56   200.50  107.98  950.50   83.04   0.53 100.00
sda               0.00     1.80  207.60 1659.60     0.87    36.25    40.72   151.21   89.22  316.71   60.77   0.54 100.00
sda               0.00     2.00 1513.20   33.20     6.96     0.27     9.57    27.73   17.94   18.33    0.10   0.65 100.00
sda               0.00     1.60 1257.00   14.00     5.78     0.68    10.41    24.57   19.34   19.54    1.54   0.79 100.00
sda               0.00    14.20 1214.00  122.80     5.51     5.54    16.93    23.34   17.42   19.05    1.29   0.75 100.00
sda               0.00    13.60  892.80  590.40     4.12     2.67     9.38    25.95   16.39   23.63    5.45   0.67 100.00
sda               0.00    22.00   31.00  775.40     0.15     3.64     9.60    96.49  115.41  671.79   93.16   1.24 100.00
sda               0.00    29.40   29.60  679.40     0.14     4.54    13.52   153.30  215.36  860.62  187.25   1.41 100.00
sda               0.00    20.60   23.00  635.60     0.11     3.06     9.85   136.16  198.69  772.73  177.92   1.52 100.00
sda               0.00    21.40   21.00  909.20     0.10     7.39    16.49   178.96  196.48 1222.10  172.79   1.08 100.00
sda               0.00     6.00   51.80 1457.20     0.23    26.23    35.91   200.49  124.56  666.36  105.30   0.66 100.00
sda               0.00    38.60   29.60  701.00     0.12     3.23     9.40   182.10  267.15 1711.57  206.16   1.37 100.00

Oh, and:

fs3.ess ~ # cat /sys/block/sda/queue/scheduler 
noop [deadline] cfq 

Jeff.

-- 
You measure democracy by the freedom it gives its dissidents, not the
freedom it gives its assimilated conformists.
		- Abbie Hoffman

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: task blocked for more than 120 seconds
  2012-04-20 13:58       ` Josef 'Jeff' Sipek
@ 2012-04-21  0:29         ` Dave Chinner
  2012-04-23 17:16           ` Josef 'Jeff' Sipek
  2012-04-23 20:24           ` Josef 'Jeff' Sipek
  0 siblings, 2 replies; 13+ messages in thread
From: Dave Chinner @ 2012-04-21  0:29 UTC (permalink / raw)
  To: Josef 'Jeff' Sipek; +Cc: xfs

On Fri, Apr 20, 2012 at 09:58:20AM -0400, Josef 'Jeff' Sipek wrote:
> On Fri, Apr 20, 2012 at 08:56:03AM +1000, Dave Chinner wrote:
> > On Thu, Apr 19, 2012 at 11:46:02AM -0400, Josef 'Jeff' Sipek wrote:
> FWIW, the filesystem was made using the default mkfs options (IOW: mkfs.xfs
> /dev/sda4).  I didn't even notice the log being 2GB.  (I just assumed it'd
> be the good ol' 128MB.)

OK.

> ...
> > > The following applies to all of the spikes, but I'm specifically talking about
> > > the spike from this morning.  During the ~45 minute spike, there seems to be
> > > very little disk I/O (<1 MByte/s compared to the usual 10 MBytes/s).  Since
> > 
> > That sounds like it might have dropped into random 4k write IO or
> > inode cluster RMW cycles - a single large RAID6 volume is going to
> > be no faster than a single spindle at this. Can you get `iostat -d
> > -m -x 5` output when the next slowdown occurs so we can see the IOPS
> > and utilisation as well as the bandwidth?
> 
> Right.  Here's the output from a few minutes ago (I removed the empty lines
> and redundant column headings):

Thanks, that makes it easy to read ;)

> fs3.ess ~ # iostat -d -m -x 5
> Linux 3.2.2 (fs3.ess.sfj.cudaops.com) 	04/20/12 	_x86_64_	(6 CPU)
> 
> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> sda              11.90   129.18  261.81  457.22     2.81     6.76    27.26     4.44    6.18    9.60    4.22   0.82  59.02
> sda               0.00     0.20   61.20  112.60     0.49     0.88    16.17     1.50    8.64   17.66    3.74   5.74  99.76
> sda               0.00     0.60   61.40  117.60     0.48     0.92    16.02     1.33    7.41   16.18    2.82   5.56  99.52
> sda               0.00     1.20   63.40  104.20     0.50     0.82    16.09     1.28    7.65   15.70    2.75   5.94  99.60

Ok, so the queue depth here is showing serialised IO (barely above
1), and the request size is 8k, which will be inode clusters. The
service time for reads is around 16ms, which assuming SATA drives is
a full disk seek. The writes are much faster dues to the BBWC. it
looks, however, like it is reading inodes from all over the drive,
and that is the delay factor here. The serialisation indicates that
it is either log tail pushing or low-memory inode reclaim that is
slowing everyone down here.

Now comes the "more data needed" bit. I still need the sysrq-w
output, but also given the length of the incident, some other data
is definitely needed:

	- a 30s event trace - it'll compress pretty well
	  (trace-cmd record -e xfs* sleep 30; trace-cmd report > output.txt)
	- vmstat 1 output for the same period
	- output of /proc/meminfo at the same time
	- the same iostat output for comparison.

> And now while the system is coming back to the usual loadavg (of 30):
> 
> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> sda               0.00    26.00   29.00 1235.40     0.13    17.10    27.91   202.30  161.37 1792.52  123.08   0.79 100.00
> sda               0.00    29.80   26.20 1095.60     0.11    15.34    28.22   185.88  187.19 2248.92  137.89   0.89 100.00
> sda               0.00    28.20   11.80 1527.20     0.06    19.80    26.44   165.23  112.77  665.83  108.50   0.65 100.00

Yeah, that looks more like a properly working filesystem :/

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: task blocked for more than 120 seconds
  2012-04-21  0:29         ` Dave Chinner
@ 2012-04-23 17:16           ` Josef 'Jeff' Sipek
  2012-04-23 20:24           ` Josef 'Jeff' Sipek
  1 sibling, 0 replies; 13+ messages in thread
From: Josef 'Jeff' Sipek @ 2012-04-23 17:16 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Sat, Apr 21, 2012 at 10:29:32AM +1000, Dave Chinner wrote:
> On Fri, Apr 20, 2012 at 09:58:20AM -0400, Josef 'Jeff' Sipek wrote:
...
> > Right.  Here's the output from a few minutes ago (I removed the empty lines
> > and redundant column headings):
> 
> Thanks, that makes it easy to read ;)

Yeah, I don't know who's brilliant idea it was to output the headings every
time.

> Now comes the "more data needed" bit. I still need the sysrq-w
> output,

D'oh!  Totally forgot to put that up somewhere.

http://31bits.net/download/xfs_log.gz

Sadly, it looks like the kernel's ring buffer wrapped while dumping the ~180
stacks.

> but also given the length of the incident, some other data
> is definitely needed:
> 
> 	- a 30s event trace - it'll compress pretty well
> 	  (trace-cmd record -e xfs* sleep 30; trace-cmd report > output.txt)

Hrm... Time to figure out how to get ftrace going on Gentoo.  I don't see an
obvious package to install to get trace-cmd.

> 	- vmstat 1 output for the same period
> 	- output of /proc/meminfo at the same time
> 	- the same iostat output for comparison.

Jeff.

-- 
I'm somewhere between geek and normal.
		- Linus Torvalds

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: task blocked for more than 120 seconds
  2012-04-21  0:29         ` Dave Chinner
  2012-04-23 17:16           ` Josef 'Jeff' Sipek
@ 2012-04-23 20:24           ` Josef 'Jeff' Sipek
  2012-04-23 23:27             ` Dave Chinner
  1 sibling, 1 reply; 13+ messages in thread
From: Josef 'Jeff' Sipek @ 2012-04-23 20:24 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Sat, Apr 21, 2012 at 10:29:32AM +1000, Dave Chinner wrote:
...
> but also given the length of the incident, some other data is definitely
> needed:
> 	- a 30s event trace - it'll compress pretty well
> 	  (trace-cmd record -e xfs* sleep 30; trace-cmd report > output.txt)

http://31bits.net/download/output-1335211829.txt.bz2

> 	- vmstat 1 output for the same period

procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
 2  0      0 191064 321700 16540692    0    0   471  1131    1    2  3  2 81 15
 0  0      0 190248 321700 16540696    0    0   624   744  615  289  0  0 100  0
 0  0      0 188728 321708 16540700    0    0   568   732 1544 2427  0  1 99  0
 0  0      0 186848 321708 16540772    0    0   544     0 5789 11321  0  1 99  0
 0  0      0 185856 321708 16540928    0    0   556   728 5801 11318  0  1 99  0
 0  0      0 185608 321708 16541032    0    0   504   704 5952 11360  0  2 98  0
 0  0      0 184740 321708 16541188    0    0   488   792 5799 11293  0  1 99  0
 0  0      0 183996 321708 16541248    0    0   488   608 5937 11369  0  1 99  0
 0  0      0 183004 321716 16541420    0    0   664   504 5834 11396  0  1 99  0
 0  0      0 182136 321716 16541560    0    0   656     0 5790 11332  0  2 98  0
 0  0      0 179408 321716 16544120    0    0   944  1072 5877 11428  0  1 98  0
 1  0      0 177844 321716 16544828    0    0   232 11448 5996 11443  0  1 98  0
 0  0      0 177968 321716 16544944    0    0     0     0 5703 11177  0  1 98  0
 0  0      0 177472 321724 16545028    0    0   264    20 5769 11264  1  1 99  0
 0  0      0 176852 321724 16545116    0    0   672     0 5827 11355  0  2 98  0
 0  0      0 176140 321724 16545164    0    0   656     0 5769 11339  0  1 99  0
 0  0      0 175396 321724 16545304    0    0   664    24 5797 11320  1  2 98  0
 0  0      0 173024 321724 16545408    0    0   668    40 5932 11393  1  1 98  0
 0  0      0 172280 321732 16545504    0    0   544    44 5761 11309  0  1 99  0
 0  0      0 171660 321732 16545632    0    0   632     8 5782 11336  0  1 99  0
 0  0      0 171040 321732 16546000    0    0   648   176 5796 11330  0  1 99  0
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
 0  0      0 167320 321732 16548276    0    0  1312     0 5864 11479  0  1 99  0
 0  0      0 165832 321732 16548452    0    0   568  1224 5883 11359  0  1 99  0
 0  0      0 164716 321740 16548524    0    0   640   144 5871 11346  0  2 98  0
 0  0      0 163600 321740 16549920    0    0  1048   576 5876 11411  0  1 99  0
 0  0      0 161988 321740 16551356    0    0   944  1000 5849 11402  0  1 99  0
 0  0      0 161368 321740 16551964    0    0   624    48 5788 11324  0  2 98  0
 0  0      0 160996 321740 16552044    0    0   584     0 5772 11339  0  1 99  0
 0  0      0 160252 321748 16552136    0    0   624    20 5741 11327  0  0 100  0
 0  0      0 165708 321736 16546552    0    0   752   456 5895 11382  0  1 98  0
 0  0      0 164468 321736 16546952    0    0   664   376 5822 11326  0  1 99  0
 0  0      0 163724 321736 16547180    0    0   608   152 5800 11336  0  1 99  0
 0  0      0 163600 321736 16547260    0    0   672   872 4851 9169  0  1 99  0
 0  0      0 162880 321736 16547336    0    0   536     0  457  232  0  0 100  0
 1  0      0 164880 321740 16537360    0    0   376 16020  616  317  5  2 93  0
 1  0      0 167740 321368 16536716    0    0   496   888  597  258 17  0 83  0
 0  0      0 176048 317916 16540552    0    0   480   752  629  248 14  1 85  0
 0  0      0 175684 317916 16540448    0    0   472   800  230  216  0  0 100  0
 0  0      0 175328 317916 16540448    0    0   592   560  217  249  0  0 100  0
 0  0      0 174632 317916 16540488    0    0   600     0  254  257  0  0 100  0
 0  0      0 174508 317924 16540476    0    0   528   364  234  289  0  0 100  0
 0  0      0 173764 317924 16540480    0    0   584   640  290  280  0  0 100  0
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
 0  0      0 173392 317924 16540480    0    0   736  1088  310  324  0  0 100  0
 0  0      0 172720 317924 16540480    0    0   608   328  279  301  0  0 100  0
 0  0      0 172100 317924 16540480    0    0   680   744  270  309  0  0 100  0
 0  0      0 171728 317932 16540480    0    0   504    44  197  218  0  0 100  0
 0  0      0 172852 317932 16540484    0    0   504   776  354  270  0  0 100  0
 0  0      0 172380 317932 16540488    0    0   432   680  239  217  0  0 100  0
 0  0      0 171760 317932 16540484    0    0   552   616  208  221  0  0 100  0
 0  0      0 171388 317932 16540484    0    0   528   816  212  229  0  0 100  0

> 	- output of /proc/meminfo at the same time

MemTotal:       33017676 kB
MemFree:          191312 kB
Buffers:          321700 kB
Cached:         16540692 kB
SwapCached:            0 kB
Active:         15167228 kB
Inactive:        9247756 kB
Active(anon):    6736980 kB
Inactive(anon):   817368 kB
Active(file):    8430248 kB
Inactive(file):  8430388 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:       3808252 kB
SwapFree:        3808252 kB
Dirty:            434824 kB
Writeback:             0 kB
AnonPages:       7552692 kB
Mapped:            10380 kB
Shmem:              1740 kB
Slab:            3689788 kB
SReclaimable:    2628672 kB
SUnreclaim:      1061116 kB
KernelStack:        2752 kB
PageTables:        58084 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    20317088 kB
Committed_AS:    9698260 kB
VmallocTotal:   34359738367 kB
VmallocUsed:      329288 kB
VmallocChunk:   34359405376 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:        6720 kB
DirectMap2M:     3137536 kB
DirectMap1G:    30408704 kB

> 	- the same iostat output for comparison.

Linux 3.2.2 (fs3.ess.sfj.cudaops.com) 	04/23/12 	_x86_64_	(6 CPU)

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               8.89    95.35  306.14  470.40     2.75     6.60    24.67     3.52    4.54    5.62    3.83   0.91  70.84
sda               0.00     2.00   70.20   71.80     0.55     0.57    16.07     1.02    7.19   14.54    0.01   7.03  99.84
sda               0.00     2.40   80.80   73.40     0.63     0.58    16.10     1.02    6.58   12.16    0.45   6.39  98.48
sda               0.00    12.20   45.80  275.00     0.36     2.24    16.58     3.96   12.36   21.87   10.78   3.11  99.92
sda               0.00     2.00   78.60    6.40     0.61     0.06    16.19     1.00   11.76   12.71    0.00  11.71  99.52
sda               0.00    30.20  113.00   44.00     0.88     0.57    19.02     0.99    6.29    8.74    0.00   6.30  98.88
sda               0.00     2.40   81.20   20.60     0.63     0.18    16.30     1.00    9.80   12.29    0.00   9.80  99.76
sda               0.00     1.20   67.20   80.00     0.53     3.50    56.03     1.13    7.67   14.88    1.62   6.78  99.84
sda               0.00     2.60   66.80   60.80     0.52     0.48    16.14     1.03    8.08   14.92    0.57   7.82  99.84
sda               0.00     8.00   77.60   64.20     0.61     0.56    16.78     1.01    7.11   12.86    0.16   7.03  99.68

Anything else?

Jeff.

P.S. I had a script prepared to fire off whenever the time came:

# cat run-when-loaded.sh 
#!/bin/sh

TOK=`date +%s`

cat /proc/mounts > ~/mounts-$TOK.txt
cat /proc/meminfo > ~/meminfo-$TOK.txt

(
	vmstat 1 50 > ~/vmstat-$TOK.txt
) &

(
	iostat -d -m -x 5 10 > ~/iostat-$TOK.txt
) &

(
	cd trace-cmd
	./trace-cmd record -e xfs* sleep 30
	./trace-cmd report > ~/output-$TOK.txt
)

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: task blocked for more than 120 seconds
  2012-04-23 20:24           ` Josef 'Jeff' Sipek
@ 2012-04-23 23:27             ` Dave Chinner
  2012-04-24 15:10               ` Josef 'Jeff' Sipek
  2012-09-27 12:49               ` Josef 'Jeff' Sipek
  0 siblings, 2 replies; 13+ messages in thread
From: Dave Chinner @ 2012-04-23 23:27 UTC (permalink / raw)
  To: Josef 'Jeff' Sipek; +Cc: xfs

On Mon, Apr 23, 2012 at 04:24:41PM -0400, Josef 'Jeff' Sipek wrote:
> On Sat, Apr 21, 2012 at 10:29:32AM +1000, Dave Chinner wrote:
> ...
> > but also given the length of the incident, some other data is definitely
> > needed:
> > 	- a 30s event trace - it'll compress pretty well
> > 	  (trace-cmd record -e xfs* sleep 30; trace-cmd report > output.txt)
> 
> http://31bits.net/download/output-1335211829.txt.bz2

Ok, that's instructive. Inode RMW cycles in the xfsaild:

xfsaild/sda4  968486.257027: xfs_buf_read:         dev 8:4 bno 0xe000d4e10 len 0x2000
xfsaild/sda4  968486.338194: xfs_buf_delwri_queue: dev 8:4 bno 0xe000d4e10 len 0x2000
xfsaild/sda4  968486.338203: xfs_buf_read:         dev 8:4 bno 0xb00ff7c00 len 0x2000
xfsaild/sda4  968486.351177: xfs_buf_delwri_queue: dev 8:4 bno 0xb00ff7c00 len 0x2000
xfsaild/sda4  968486.351183: xfs_buf_read:         dev 8:4 bno 0x280054e38 len 0x2000
xfsaild/sda4  968486.351194: xfs_buf_delwri_queue: dev 8:4 bno 0x280054e38 len 0x2000
xfsaild/sda4  968486.351200: xfs_buf_read:         dev 8:4 bno 0x580057ee8 len 0x2000
xfsaild/sda4  968486.363347: xfs_buf_delwri_queue: dev 8:4 bno 0x580057ee8 len 0x2000
xfsaild/sda4  968486.363355: xfs_buf_read:         dev 8:4 bno 0x500522980 len 0x2000
xfsaild/sda4  968486.373812: xfs_buf_delwri_queue: dev 8:4 bno 0x500522980 len 0x2000
xfsaild/sda4  968486.373817: xfs_buf_read:         dev 8:4 bno 0x800412390 len 0x2000
xfsaild/sda4  968486.373829: xfs_buf_delwri_queue: dev 8:4 bno 0x800412390 len 0x2000
xfsaild/sda4  968486.373835: xfs_buf_read:         dev 8:4 bno 0xe005a07c0 len 0x2000
xfsaild/sda4  968486.386211: xfs_buf_delwri_queue: dev 8:4 bno 0xe005a07c0 len 0x2000
xfsaild/sda4  968486.386220: xfs_buf_read:         dev 8:4 bno 0x5801af5f8 len 0x2000
xfsaild/sda4  968486.400109: xfs_buf_delwri_queue: dev 8:4 bno 0x5801af5f8 len 0x2000
xfsaild/sda4  968486.400116: xfs_buf_read:         dev 8:4 bno 0xf01026940 len 0x2000
xfsaild/sda4  968486.400128: xfs_buf_delwri_queue: dev 8:4 bno 0xf01026940 len 0x2000
xfsaild/sda4  968486.400135: xfs_buf_read:         dev 8:4 bno 0xe00fccac0 len 0x2000
xfsaild/sda4  968486.517162: xfs_buf_delwri_queue: dev 8:4 bno 0xe00fccac0 len 0x2000
xfsaild/sda4  968486.517169: xfs_buf_read:         dev 8:4 bno 0x4007ba2d0 len 0x2000
xfsaild/sda4  968486.517179: xfs_buf_delwri_queue: dev 8:4 bno 0x4007ba2d0 len 0x2000
xfsaild/sda4  968486.517187: xfs_buf_read:         dev 8:4 bno 0x8800652c8 len 0x2000
xfsaild/sda4  968486.524118: xfs_buf_delwri_queue: dev 8:4 bno 0x8800652c8 len 0x2000
xfsaild/sda4  968486.524126: xfs_buf_read:         dev 8:4 bno 0x2811e0dc8 len 0x2000
xfsaild/sda4  968486.536576: xfs_buf_delwri_queue: dev 8:4 bno 0x2811e0dc8 len 0x2000
.....
xfsaild/sda4  968516.199683: xfs_buf_read:         dev 8:4 bno 0x7008ebfb0 len 0x2000
xfsaild/sda4  968516.212424: xfs_buf_delwri_queue: dev 8:4 bno 0x7008ebfb0 len 0x2000


Every buffer read is followed by a queuing for write.  It is also
showing that it is typically taking 10-25ms per inode read IO, which
is exactly what I'd expect for your given storage. There are 2500 of
these over the 30s period, which translates to about one every 12ms
across the 30s sample.

So, yes, your hangs are definitely due to inode buffer RMW cycles
when trying to flush dirty inodes from the cache. I have a few
ideas on how to fix this - but I'm not sure whether a current TOT
solution will be easily back-portable. The simplest solution is a
readahead based solution - AIL pushing is async, and will cycle back
to inodes that it failed to flush the first time past, so triggering
readahead on the first pass might work just fine.

Right now we do:

	read inode buffer (trylock)
	no buffer:
		read from disk
		wait
	flush inode to buffer
	queue buffer for write

So the aild is essentially blocking on inode buffer read IO. What
would be better is:


	read inode buffer (trylock)
	no buffer:
		issue readahead to disk
		fail read
	xfsaild skips inode
	.....
	read inode buffer (trylock)
	buffer found
	flush inode to buffer
	queue buffer for write

That way the xfsaild will make a pass across the AIL doing
readahead and doesn't block on RMW cycles. Effectively we get async
RMW cycles occurring, and the latency of a single cycle will no
longer be the performance limiting factor. I'll start to prototype
something to address this - it isn't a new idea, and I've seen it
done before, so i should be able to get something working.

> Anything else?

No, I know what the problem is now.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: task blocked for more than 120 seconds
  2012-04-23 23:27             ` Dave Chinner
@ 2012-04-24 15:10               ` Josef 'Jeff' Sipek
  2012-09-27 12:49               ` Josef 'Jeff' Sipek
  1 sibling, 0 replies; 13+ messages in thread
From: Josef 'Jeff' Sipek @ 2012-04-24 15:10 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Tue, Apr 24, 2012 at 09:27:59AM +1000, Dave Chinner wrote:
> On Mon, Apr 23, 2012 at 04:24:41PM -0400, Josef 'Jeff' Sipek wrote:
> > On Sat, Apr 21, 2012 at 10:29:32AM +1000, Dave Chinner wrote:
> > ...
> > > but also given the length of the incident, some other data is definitely
> > > needed:
> > > 	- a 30s event trace - it'll compress pretty well
> > > 	  (trace-cmd record -e xfs* sleep 30; trace-cmd report > output.txt)
> > 
> > http://31bits.net/download/output-1335211829.txt.bz2
> 
> Ok, that's instructive. Inode RMW cycles in the xfsaild:
...
> Every buffer read is followed by a queuing for write.  It is also
> showing that it is typically taking 10-25ms per inode read IO, which
> is exactly what I'd expect for your given storage. There are 2500 of
> these over the 30s period, which translates to about one every 12ms
> across the 30s sample.
> 
> So, yes, your hangs are definitely due to inode buffer RMW cycles
> when trying to flush dirty inodes from the cache. I have a few
> ideas on how to fix this - but I'm not sure whether a current TOT
> solution will be easily back-portable.

If it is too much effort to backport, we should be able to move the box to a
3.3 stable kernel (assuming no driver problems).

> The simplest solution is a readahead based solution - AIL pushing is
> async, and will cycle back to inodes that it failed to flush the first
> time past, so triggering readahead on the first pass might work just fine.
...

This makes sense.  With a large enough log, could you not end up evicting
the readahead inodes by the time you get back to them?

> That way the xfsaild will make a pass across the AIL doing
> readahead and doesn't block on RMW cycles. Effectively we get async
> RMW cycles occurring, and the latency of a single cycle will no
> longer be the performance limiting factor. I'll start to prototype
> something to address this - it isn't a new idea, and I've seen it
> done before, so i should be able to get something working.

Cool.  Let me know when you have something we can try.  I don't quite know
what it is that's causing this giant backlog of inode modifications - I
suspect it's the rsync that's pushing it over.  But regardless, I'm
interested in testing the fix.

Thanks!

Jeff.

-- 
Fact: 30.3% of all statistics are generated randomly.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: task blocked for more than 120 seconds
  2012-04-23 23:27             ` Dave Chinner
  2012-04-24 15:10               ` Josef 'Jeff' Sipek
@ 2012-09-27 12:49               ` Josef 'Jeff' Sipek
  2012-09-27 22:50                 ` Dave Chinner
  1 sibling, 1 reply; 13+ messages in thread
From: Josef 'Jeff' Sipek @ 2012-09-27 12:49 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Tue, Apr 24, 2012 at 09:27:59AM +1000, Dave Chinner wrote:
...
> So, yes, your hangs are definitely due to inode buffer RMW cycles
> when trying to flush dirty inodes from the cache. I have a few
> ideas on how to fix this - but I'm not sure whether a current TOT
> solution will be easily back-portable. The simplest solution is a
> readahead based solution - AIL pushing is async, and will cycle back
> to inodes that it failed to flush the first time past, so triggering
> readahead on the first pass might work just fine.

Have you had time to look at this?  (We got bitten by this again and I
really don't want to go back to ext4.)  Is there anything I can help with?

Jeff.

-- 
Evolution, n.:
  A hypothetical process whereby infinitely improbable events occur with
  alarming frequency, order arises from chaos, and no one is given credit.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: task blocked for more than 120 seconds
  2012-09-27 12:49               ` Josef 'Jeff' Sipek
@ 2012-09-27 22:50                 ` Dave Chinner
  0 siblings, 0 replies; 13+ messages in thread
From: Dave Chinner @ 2012-09-27 22:50 UTC (permalink / raw)
  To: Josef 'Jeff' Sipek; +Cc: xfs

On Thu, Sep 27, 2012 at 08:49:15AM -0400, Josef 'Jeff' Sipek wrote:
> On Tue, Apr 24, 2012 at 09:27:59AM +1000, Dave Chinner wrote:
> ...
> > So, yes, your hangs are definitely due to inode buffer RMW cycles
> > when trying to flush dirty inodes from the cache. I have a few
> > ideas on how to fix this - but I'm not sure whether a current TOT
> > solution will be easily back-portable. The simplest solution is a
> > readahead based solution - AIL pushing is async, and will cycle back
> > to inodes that it failed to flush the first time past, so triggering
> > readahead on the first pass might work just fine.
> 
> Have you had time to look at this?

No.

> (We got bitten by this again and I
> really don't want to go back to ext4.)  Is there anything I can help with?

What is needed is a xfs_inode_readahead function(), which takes an
xfs inode and attempts to issue readahead on the underlying buffer
if it is not found in cache.

The simplest thing to do is add another flag to xfs_buf_read that is
passed through from xfs_iflush() to say "don't block on read".
Indeed, we used to have a XBF_DONT_BLOCK flag that we removed
recently becuse it was the default behaviour for everything. That
could be reintroduced to tell xfs_buf_read_map() that it should
return the bufer if it is in cache, or issue readahead and return
null if it is not found in memory so that a read a little while
later might find it. COmbined iwth the XBF_TRYLOCK that xfs_iflush
alrady uses, it won't block on read IO already in progress, either.
i.e. there's a difference between not-in-cache and
in-cache-but-locked when it comes to issuing readahead, so there
might need to be slight changes to xfs_buf_find() to accomodate
that.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2012-09-27 22:49 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-04-18 15:11 task blocked for more than 120 seconds Josef 'Jeff' Sipek
2012-04-18 18:28 ` Ben Myers
2012-04-18 23:48 ` Dave Chinner
2012-04-19 15:46   ` Josef 'Jeff' Sipek
2012-04-19 22:56     ` Dave Chinner
2012-04-20 13:58       ` Josef 'Jeff' Sipek
2012-04-21  0:29         ` Dave Chinner
2012-04-23 17:16           ` Josef 'Jeff' Sipek
2012-04-23 20:24           ` Josef 'Jeff' Sipek
2012-04-23 23:27             ` Dave Chinner
2012-04-24 15:10               ` Josef 'Jeff' Sipek
2012-09-27 12:49               ` Josef 'Jeff' Sipek
2012-09-27 22:50                 ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox