public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed
* Issues with XFS on Sles9 sp2.
@ 2006-12-01 15:08 Roger Heflin
  2006-12-02  0:25 ` Christian Kujau
       [not found] ` <20061203224832.GY37654165@melbourne.sgi.com>
  0 siblings, 2 replies; 4+ messages in thread
From: Roger Heflin @ 2006-12-01 15:08 UTC (permalink / raw)
  To: xfs

Hello,

I have a customer that has machines whose XFS filesystem quits
responding when certain applications are running.  The only filesystem
that uses XFS is /tmp all other filesystems still respond, anything
going to tmp hangs forever.   There are multiple machines with
a couple of different types of motherboards that have this issue,
converting the machines to ext3 eliminates the issues.  Under load
they were seeing 1-2 events per 24 hours on 100 machines.   After
the ext3 conversion they have had 0 events on 400 machines in
2 weeks, so it is fairly conclusive that XFS has something to do
with it.  It is not a hardware problem of the 2 different motherboard
with the issue, one uses Opteron+AMDchipset+IDE and the other
one uses Opteron+Nvidia+SATA, and the problems are not repeating
on any 1 node, the appear to just randomly hit 1 or 2 nodes out
of the test set, and the next day it will be a different one.

They are using Sles9SP2, currently we cannot go to SP3 as there
are some other bad driver issues unrelated to XFS (the issue
preventing us from upgrading also appears to be in 2.6.16.x
kernel.org kernels so that is a more than just a SLES issue).

I have already had long discussions with Suse with less
than useful results.

Are there any patches that are likely to either produce
more debugging or to get rid of this issue?

There are no messages in the messages file when the event
happens.

Below is a sysrq generated stack trace from one of the
machines.   The issues do not seem to require heavy IO
loads (we have verified that the application is not IO
intensive), it may be something related to running short
on memory, but we don't have any OOM type messages
anywhere.  The first type of machine to have the issue
and where the issue is alot more common has only 4GB
of ram, the second type of machine that has recently
starting also having the error has 32GB of ram.

                           Roger


<Oct/27 07:40 am>xfssyncd D 00000000000493e0 0 2760 1 3876 2755 (L-TLB)

<Oct/27 07:40 am>Call Trace:<ffffffffa0141832>{:xfs:kmem_zone_zalloc+50} 
<ffffffffa012a9c4>{:xfs:_xfs_trans_alloc+36}

<Oct/27 07:40 am> <ffffffff80231b35>{__down_write+117} 
<ffffffffa0116ead>{:xfs:xfs_ilock+93}

<Oct/27 07:40 am> <ffffffffa012eda3>{:xfs:xfs_syncsub+2787} 
<ffffffff80146970>{del_timer_sync+80}

<Oct/27 07:40 am> <ffffffff80146a55>{del_singleshot_timer_sync+21} 
<ffffffff80146d2e>{schedule_timeout+254}

<Oct/27 07:40 am> <ffffffffa013e468>{:xfs:vfs_sync+40} 
<ffffffffa013da79>{:xfs:vfs_sync_worker+25}

<Oct/27 07:40 am> <ffffffffa013dc1a>{:xfs:xfssyncd+378} 
<ffffffffa013d780>{:xfs:linvfs_fill_super+0}

<Oct/27 07:40 am> <ffffffff801112b7>{child_rip+8} 
<ffffffffa013d780>{:xfs:linvfs_fill_super+0}

<Oct/27 07:40 am> <ffffffffa013daa0>{:xfs:xfssyncd+0} 
<ffffffff801112af>{child_rip+0}

<Oct/27 07:40 am>

<Oct/27 07:40 am>res D 000000000000000a 0 16149 1 26319 16151 5825 (NOTLB)

<Oct/27 07:40 am>Call Trace:<ffffffff80231bcd>{__down_read+125} 
<ffffffffa01333dc>{:xfs:xfs_access+44}

<Oct/27 07:40 am> <ffffffffa013af44>{:xfs:linvfs_permission+20} 
<ffffffff8019c767>{permission+55}

<Oct/27 07:40 am> <ffffffff8019df1c>{link_path_walk+348} 
<ffffffff801a0706>{__user_walk_it+70}

<Oct/27 07:40 am> <ffffffff801974b0>{vfs_lstat+128} 
<ffffffff80122868>{do_page_fault+536}

<Oct/27 07:40 am> <ffffffff801975bf>{sys_newlstat+31} 
<ffffffff80111101>{error_exit+0}

<Oct/27 07:40 am> <ffffffff80110794>{system_call+124}



<Oct/27 07:40 am>sbatchd D 00000000000493e0 0 16151 1 12686 16149 (NOTLB)

<Oct/27 07:40 am>Call Trace:<ffffffff801a7b51>{dput+33} 
<ffffffff8019c2cd>{follow_mount+93}

<Oct/27 07:40 am> <ffffffff801a7b51>{dput+33} 
<ffffffff80231bcd>{__down_read+125}

<Oct/27 07:40 am> <ffffffffa01333dc>{:xfs:xfs_access+44} 
<ffffffffa013af44>{:xfs:linvfs_permission+20}

<Oct/27 07:40 am> <ffffffff8019c767>{permission+55} 
<ffffffff8018aeca>{sys_chdir+138}

<Oct/27 07:40 am> <ffffffff801a394c>{sys_select+1244} 
<ffffffff80110794>{system_call+124}

<Oct/27 07:40 am>

<Oct/27 07:40 am>gm_mapper D 000000000000000a 0 12686 1 16834 16151 (L-TLB)

<Oct/27 07:40 am>Call 
Trace:<ffffffffa012b37b>{:xfs:xfs_trans_log_buf+107} 
<ffffffff8010f9c8>{__down+152}

<Oct/27 07:40 am> <ffffffff80135c50>{default_wake_function+0} 
<ffffffff80234447>{__down_failed+53}

<Oct/27 07:40 am> <ffffffffa0141642>{:xfs:.text.lock.xfs_buf+15} 
<ffffffffa0126618>{:xfs:xfs_getsb+40}

<Oct/27 07:40 am> <ffffffffa012b8aa>{:xfs:xfs_trans_getsb+106} 
<ffffffffa012a10c>{:xfs:xfs_trans_commit+332}

<Oct/27 07:40 am> <ffffffffa00e7a9c>{:xfs:xfs_free_extent+204} 
<ffffffffa0111634>{:xfs:xfs_efd_init+68}

<Oct/27 07:40 am> <ffffffffa014179b>{:xfs:kmem_zone_alloc+75} 
<ffffffffa0141832>{:xfs:kmem_zone_zalloc+50}

<Oct/27 07:40 am> <ffffffffa011a9cd>{:xfs:xfs_itruncate_finish+557} 
<ffffffffa012aae9>{:xfs:xfs_trans_alloc+217}

<Oct/27 07:40 am> <ffffffff8011081d>{sysret_signal+28} 
<ffffffffa01300af>{:xfs:xfs_inactive+591}

<Oct/27 07:40 am> <ffffffff8011081d>{sysret_signal+28} 
<ffffffff80169f50>{__pagevec_free+32}

<Oct/27 07:40 am> <ffffffff8011081d>{sysret_signal+28} 
<ffffffffa013ebc8>{:xfs:vn_rele+72}

<Oct/27 07:40 am> <ffffffffa013d392>{:xfs:linvfs_clear_inode+18} 
<ffffffff801a9d3b>{clear_inode+155}

<Oct/27 07:40 am> <ffffffff801aa3f5>{generic_delete_inode+245} 
<ffffffff801a95ee>{iput+158}

<Oct/27 07:40 am> <ffffffff801a7cb5>{dput+389} 
<ffffffff8018d9de>{__fput+270}

<Oct/27 07:40 am> <ffffffff8018965e>{filp_close+126} 
<ffffffff8013f073>{put_files_struct+115}

<Oct/27 07:40 am> <ffffffff80140522>{do_exit+1010} 
<ffffffff801484b5>{__dequeue_signal+501}

<Oct/27 07:40 am> <ffffffff8011081d>{sysret_signal+28} 
<ffffffff80140fa8>{do_group_exit+232}

<Oct/27 07:40 am> <ffffffff8014ab37>{get_signal_to_deliver+1175} 
<ffffffff8011004b>{do_signal+1179}

<Oct/27 07:40 am> <ffffffff8010fc45>{do_signal+149} 
<ffffffffa02dbea0>{:gm:gm_linux_ioctl+0}

<Oct/27 07:40 am> <ffffffffa02dbf0a>{:gm:gm_linux_ioctl+106} 
<ffffffff801a2094>{sys_ioctl+1092}

<Oct/27 07:40 am> <ffffffff8011052d>{sys_rt_sigreturn+653} 
<ffffffff8011081d>{sysret_signal+28}

<Oct/27 07:40 am> <ffffffff80110adf>{ptregscall_common+103}

<Oct/27 07:40 am>lim D 000000000000000a 0 16834 1 16835 17594 12686 (NOTLB)

<Oct/27 07:40 am>Call Trace:<ffffffff80231bcd>{__down_read+125} 
<ffffffffa01333dc>{:xfs:xfs_access+44}

<Oct/27 07:40 am> <ffffffffa013af44>{:xfs:linvfs_permission+20} 
<ffffffff8019c767>{permission+55}

<Oct/27 07:40 am> <ffffffff8019df1c>{link_path_walk+348} 
<ffffffff801a0706>{__user_walk_it+70}

<Oct/27 07:40 am> <ffffffff801974b0>{vfs_lstat+128} 
<ffffffff80117ec4>{save_i387+148}

<Oct/27 07:40 am> <ffffffff8011018d>{do_signal+1501} 
<ffffffff801975bf>{sys_newlstat+31}

<Oct/27 07:40 am> <ffffffff80147d04>{sys_rt_sigaction+148} 
<ffffffff80110794>{system_call+124}

<Oct/27 07:40 am>

<Oct/27 07:40 am>pim D 00000000000493e0 0 16835 16834 16870 (NOTLB)

<Oct/27 07:40 am>Call 
Trace:<ffffffffa01412ad>{:xfs:xfs_buf_get_flags+877} 
<ffffffffa014179b>{:xfs:kmem_zone_alloc+75}

<Oct/27 07:40 am> <ffffffff8010f9c8>{__down+152} 
<ffffffff80135c50>{default_wake_function+0}

<Oct/27 07:40 am> <ffffffffa012b37b>{:xfs:xfs_trans_log_buf+107} 
<ffffffff80234447>{__down_failed+53}

<Oct/27 07:40 am> <ffffffffa0141642>{:xfs:.text.lock.xfs_buf+15} 
<ffffffffa0126618>{:xfs:xfs_getsb+40}

<Oct/27 07:40 am> <ffffffffa012b8aa>{:xfs:xfs_trans_getsb+106} 
<ffffffffa012a10c>{:xfs:xfs_trans_commit+332}

<Oct/27 07:40 am> <ffffffffa0104d26>{:xfs:xfs_dir2_createname+278} 
<ffffffffa0117d3d>{:xfs:xfs_ichgtime+301}

<Oct/27 07:40 am> <ffffffffa013194f>{:xfs:xfs_create+1359} 
<ffffffffa013b429>{:xfs:linvfs_mknod+521}

<Oct/27 07:40 am> <ffffffffa0116d16>{:xfs:xfs_iunlock+102} 
<ffffffffa0133387>{:xfs:xfs_lookup+119}

<Oct/27 07:40 am> <ffffffffa013b704>{:xfs:linvfs_lookup+84} 
<ffffffff8019c49b>{real_lookup+123}

<Oct/27 07:40 am> <ffffffff8019cedb>{vfs_create+251} 
<ffffffff8019f3a0>{open_namei+464}

<Oct/27 07:40 am> <ffffffff80189cc7>{filp_open+87} 
<ffffffff80189d8f>{sys_open+159}

<Oct/27 07:40 am> <ffffffff80189765>{sys_close+229} 
<ffffffff80110794>{system_call+124}

<Oct/27 07:40 am>

<Oct/27 07:40 am>elim.uptime D 00000000000493e0 0 16873 1 14418 18756 
(NOTLB)

<Oct/27 07:40 am>Call Trace:<ffffffff80231bcd>{__down_read+125} 
<ffffffffa01333dc>{:xfs:xfs_access+44}

<Oct/27 07:40 am> <ffffffffa013af44>{:xfs:linvfs_permission+20} 
<ffffffff8019c767>{permission+55}

<Oct/27 07:40 am> <ffffffff8019df1c>{link_path_walk+348} 
<ffffffff8019f2a1>{open_namei+209}

<Oct/27 07:40 am> <ffffffff80189cc7>{filp_open+87} 
<ffffffff80189d8f>{sys_open+159}

<Oct/27 07:40 am> <ffffffff80111101>{error_exit+0} 
<ffffffff80110794>{system_call+124}

<Oct/27 07:40 am>

<Oct/27 07:40 am>res D 00000000000493e0 0 14323 16149 26319 (NOTLB)

<Oct/27 07:40 am>Call Trace:<ffffffff80301c83>{inet_recvmsg+51} 
<ffffffff802b520a>{sock_aio_read+346}

<Oct/27 07:40 am> <ffffffff80231bcd>{__down_read+125} 
<ffffffffa01333dc>{:xfs:xfs_access+44}

<Oct/27 07:40 am> <ffffffffa013af44>{:xfs:linvfs_permission+20} 
<ffffffff8019c767>{permission+55}

<Oct/27 07:40 am> <ffffffff8019df1c>{link_path_walk+348} 
<ffffffff8019f27f>{open_namei+175}

<Oct/27 07:40 am> <ffffffff80189cc7>{filp_open+87} 
<ffffffff80189d8f>{sys_open+159}

<Oct/27 07:40 am> <ffffffff802b58a8>{sys_socket+104} 
<ffffffff80110794>{system_call+124}

<Oct/27 07:40 am>

<Oct/27 07:40 am>acuSolve-gmpi D 00000000000493e0 0 14418 1 14419 16873 
(NOTLB)

<Oct/27 07:40 am>Call 
Trace:<ffffffff80165ad4>{wait_on_page_writeback_range_wq+324}

<Oct/27 07:40 am> <ffffffff8010f9c8>{__down+152} 
<ffffffff80135c50>{default_wake_function+0}

<Oct/27 07:40 am> <ffffffff80234447>{__down_failed+53} 
<ffffffff801949dc>{.text.lock.super+169}

<Oct/27 07:40 am> <ffffffff8018fcea>{do_sync+42} 
<ffffffff8018fd5e>{sys_sync+62}

<Oct/27 07:40 am> <ffffffff80110794>{system_call+124}

<Oct/27 07:40 am>acuSolve-gmpi D 00000000000493e0 0 14419 1 18864 14418 
(NOTLB)

<Oct/27 07:40 am>Call Trace:<ffffffff8010f9c8>{__down+152} 
<ffffffff80135c50>{default_wake_function+0}

<Oct/27 07:40 am> <ffffffff80234447>{__down_failed+53} 
<ffffffffa0141642>{:xfs:.text.lock.xfs_buf+15}

<Oct/27 07:40 am> <ffffffffa0126618>{:xfs:xfs_getsb+40} 
<ffffffffa012ecea>{:xfs:xfs_syncsub+2602}

<Oct/27 07:40 am> <ffffffffa013e468>{:xfs:vfs_sync+40} 
<ffffffffa013d434>{:xfs:linvfs_sync_super+68}

<Oct/27 07:40 am> <ffffffff80193cff>{sync_filesystems+223} 
<ffffffff8018fcf1>{do_sync+49}

<Oct/27 07:40 am> <ffffffff8018fd5e>{sys_sync+62} 
<ffffffff80110794>{system_call+124}

<Oct/27 07:40 am>

<Oct/27 07:40 am>mktemp D 00000000000493e0 0 17594 1 17656 16834 (NOTLB)

<Oct/27 07:40 am>Call Trace:<ffffffff8025bbe9>{SHATransform+25} 
<ffffffff8019c2cd>{follow_mount+93}

<Oct/27 07:40 am> <ffffffff801a7b51>{dput+33} 
<ffffffff80231bcd>{__down_read+125}

<Oct/27 07:40 am> <ffffffffa01333dc>{:xfs:xfs_access+44} 
<ffffffffa013af44>{:xfs:linvfs_permission+20}

<Oct/27 07:40 am> <ffffffff8019c767>{permission+55} 
<ffffffff8019df1c>{link_path_walk+348}

<Oct/27 07:40 am> <ffffffff801a02ac>{sys_mkdir+220} 
<ffffffff80110794>{system_call+124}

<Oct/27 07:40 am>

<Oct/27 07:40 am>check_EWNstag D 00000000000493e0 0 17620 1 17751 17656 
(NOTLB)

<Oct/27 07:40 am>Call Trace:<ffffffff8018cbbd>{do_sync_write+173} 
<ffffffff80231bcd>{__down_read+125}

<Oct/27 07:40 am> <ffffffffa01333dc>{:xfs:xfs_access+44} 
<ffffffffa013af44>{:xfs:linvfs_permission+20}

<Oct/27 07:40 am> <ffffffff8019c767>{permission+55} 
<ffffffff8019df1c>{link_path_walk+348}

<Oct/27 07:40 am> <ffffffff8019f2a1>{open_namei+209} 
<ffffffff80189cc7>{filp_open+87}

<Oct/27 07:40 am> <ffffffff80189d8f>{sys_open+159} 
<ffffffff80111101>{error_exit+0}

<Oct/27 07:40 am> <ffffffff80110794>{system_call+124} <Oct/27 07:40 am>

Oct/27 07:40 am>

<Oct/27 07:40 am>sh D 00000000000493e0 0 17959 1 17793 17858 (NOTLB)

<Oct/27 07:40 am>Call Trace:<ffffffff80231bcd>{__down_read+125} 
<ffffffffa01333dc>{:xfs:xfs_access+44}

<Oct/27 07:40 am> <ffffffffa013af44>{:xfs:linvfs_permission+20} 
<ffffffff8019c767>{permission+55}

<Oct/27 07:40 am> <ffffffff8019df1c>{link_path_walk+348} 
<ffffffff8019f2a1>{open_namei+209}

<Oct/27 07:40 am> <ffffffff80189cc7>{filp_open+87} 
<ffffffff80189d8f>{sys_open+159}

<Oct/27 07:40 am> <ffffffff80111101>{error_exit+0} 
<ffffffff80110794>{system_call+124}

<Oct/27 07:40 am>

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Issues with XFS on Sles9 sp2.
  2006-12-01 15:08 Issues with XFS on Sles9 sp2 Roger Heflin
@ 2006-12-02  0:25 ` Christian Kujau
  2006-12-02  0:44   ` Roger Heflin
       [not found] ` <20061203224832.GY37654165@melbourne.sgi.com>
  1 sibling, 1 reply; 4+ messages in thread
From: Christian Kujau @ 2006-12-02  0:25 UTC (permalink / raw)
  To: Roger Heflin; +Cc: xfs

On Fri, 1 Dec 2006, Roger Heflin wrote:
> converting the machines to ext3 eliminates the issues.  Under load
> they were seeing 1-2 events per 24 hours on 100 machines.

just to be sure: 1-2 machines out of 100 had a hanging XFS in 24h, 
right? (as opposed to "each of the 100 machines had 1-2 incidents in 
24h" ;))

> They are using Sles9SP2, currently we cannot go to SP3 as there
> are some other bad driver issues unrelated to XFS (the issue
> preventing us from upgrading also appears to be in 2.6.16.x
> kernel.org kernels so that is a more than just a SLES issue).

So, you can't upgrade to a more current SuSE kernel, but you've already 
tried vanilla kernels? OK, you won't be able to upgrade to a kernel.org 
kernel because of the driver issues - but do the XFS hangs go away?
Did it happen with earlier SuSE kernels too?

> anywhere.  The first type of machine to have the issue
> and where the issue is alot more common has only 4GB
> of ram, the second type of machine that has recently
> starting also having the error has 32GB of ram.

Shot in the dark: did you try to boot with the "mem=" option?
You could try to boot with e.g. "mem=512M" and see if the problem has 
something to do with memory...

Christian.
-- 
BOFH excuse #228:

That function is not currently supported, but Bill Gates assures us it will be featured in the next upgrade.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* RE: Issues with XFS on Sles9 sp2.
  2006-12-02  0:25 ` Christian Kujau
@ 2006-12-02  0:44   ` Roger Heflin
  0 siblings, 0 replies; 4+ messages in thread
From: Roger Heflin @ 2006-12-02  0:44 UTC (permalink / raw)
  To: Christian Kujau; +Cc: xfs




-----Original Message-----
From: Christian Kujau [mailto:lists@nerdbynature.de]
Sent: Fri 12/1/2006 6:25 PM
To: Roger Heflin
Cc: xfs@oss.sgi.com
Subject: Re: Issues with XFS on Sles9 sp2.
 
>On Fri, 1 Dec 2006, Roger Heflin wrote:
>> converting the machines to ext3 eliminates the issues.  Under load
>> they were seeing 1-2 events per 24 hours on 100 machines.
>
>just to be sure: 1-2 machines out of 100 had a hanging XFS in 24h, 
>right? (as opposed to "each of the 100 machines had 1-2 incidents in 
>24h" ;))

1-2 machines out of the 100 each 24 hours, different machines will
fail the next day if they are under a similar load.

>> They are using Sles9SP2, currently we cannot go to SP3 as there
>> are some other bad driver issues unrelated to XFS (the issue
>> preventing us from upgrading also appears to be in 2.6.16.x
>> kernel.org kernels so that is a more than just a SLES issue).
>
>So, you can't upgrade to a more current SuSE kernel, but you've already 
>tried vanilla kernels? OK, you won't be able to upgrade to a kernel.org 
>kernel because of the driver issues - but do the XFS hangs go away?
>Did it happen with earlier SuSE kernels too?

I tested with the earlier vanilla kernels on only a couple
of machines to see if the "other" problems was also in it, as if
it was not I might be able to upgrade that specific part and the
"other" problem was in the kernel.org kernel, so the new driver was
unlikely fix anything so I don't have any data that would tell me
if the problem XFS happens or not.     We did not find the XFS issue
in the testing stage (where the SP3 and kernel.org testing was done),
we found the XFS issues in the early production phase.  The customer
can duplicate it, but those machines are connected by sneakernet, at
a remote site.

I also don't have similar data on earlier versions of 
sles as the SP1 setup only uses XFS on a limited number
of file servers (no general usage) but on those file servers
we have not seen any issues and have ran 4 machine like that
for 1.5 years.   Though if it were a out of kernel memory
issue those machines may not be having that condition happen.

The condition does not seem to happen at idle, the machines
only lockup with a load on them, we never saw the issue at 
idle, or with well understood loads (HPL).

>> anywhere.  The first type of machine to have the issue
>> and where the issue is alot more common has only 4GB
>> of ram, the second type of machine that has recently
>> starting also having the error has 32GB of ram.
>
>Shot in the dark: did you try to boot with the "mem=" option?
>You could try to boot with e.g. "mem=512M" and see if the problem has 
>something to do with memory...


I don't think it has to do with real memory, it may have to
do with some resource that is in short supply at some time,
like maybe a shortage of kernel memory at the wrong time,
the machines have never been observed to fail under idle
or under controlled loads (HPL) and we have significant
runs times under both of those conditions (ie 20+ times the
observed failure time) so far the machines have only failed under
production loads, we have yet to be able to duplicate the
failure under any test loads.

The machines all appear to work correctly under tests like
bonnie, and the machines are found with XFS locked up
several hours later, so we don't really have a way of catching
exactly what happened at the time of the event to cause it,
non-xfs file systems still function, and the machine is still
usable so long as one avoids the locked up XFS file systems,
we only have 1 xfs file system on each machine so I don't know
if a second xfs would also be locked up or not.  Also these
file systems are build on MD stripe arrays, and that may
play some part in this.    The ext3 filesystems are also
build on the same md arrays and work just fine.

We used XFS because we observed that ext3's write rates 
started to drop down significantly after a few minutes of
sustained IO rates (much more than was expected from moving
to the inner disk tracks), and XFS did not appear to suffer from
this issue, so produced more predictable results.

                         Roger


[[HTML alternate version deleted]]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Issues with XFS on Sles9 sp2.
       [not found]         ` <20061205004416.GU33919298@melbourne.sgi.com>
@ 2006-12-20 15:45           ` Roger Heflin
  0 siblings, 0 replies; 4+ messages in thread
From: Roger Heflin @ 2006-12-20 15:45 UTC (permalink / raw)
  To: David Chinner, xfs

David,

I have a bit more data on this hang up.

So far we have not been able to reproduce it with any test
other applications.

It so far will only duplicate with one certain application and
the trace from that application shows something interesting:

<Oct/27 07:40 am>acuSolve-gmpi D 00000000000493e0 0 14418 1 14419 16873 
(NOTLB)
<Oct/27 07:40 am>Call 
Trace:<ffffffff80165ad4>{wait_on_page_writeback_range_wq+324}
<Oct/27 07:40 am> <ffffffff8010f9c8>{__down+152} 
<ffffffff80135c50>{default_wake_function+0}
<Oct/27 07:40 am> <ffffffff80234447>{__down_failed+53} 
<ffffffff801949dc>{.text.lock.super+169}
<Oct/27 07:40 am> <ffffffff8018fcea>{do_sync+42} 
<ffffffff8018fd5e>{sys_sync+62}
<Oct/27 07:40 am> <ffffffff80110794>{system_call+124}

<Oct/27 07:40 am>acuSolve-gmpi D 00000000000493e0 0 14419 1 18864 14418 
(NOTLB)
<Oct/27 07:40 am>Call Trace:<ffffffff8010f9c8>{__down+152} 
<ffffffff80135c50>{default_wake_function+0}
<Oct/27 07:40 am> <ffffffff80234447>{__down_failed+53} 
<ffffffffa0141642>{:xfs:.text.lock.xfs_buf+15}
<Oct/27 07:40 am> <ffffffffa0126618>{:xfs:xfs_getsb+40} 
<ffffffffa012ecea>{:xfs:xfs_syncsub+2602}
<Oct/27 07:40 am> <ffffffffa013e468>{:xfs:vfs_sync+40} 
<ffffffffa013d434>{:xfs:linvfs_sync_super+68}
<Oct/27 07:40 am> <ffffffff80193cff>{sync_filesystems+223} 
<ffffffff8018fcf1>{do_sync+49}
<Oct/27 07:40 am> <ffffffff8018fd5e>{sys_sync+62} 
<ffffffff80110794>{system_call+124}

Both copies of the application appear to be calling sys_sync at the same 
time,
and each one is hung in a different part of the sys_sync call, I wonder 
if there
is a deadlock condition possible if one is doing this sort of thing?

All other processes are always running on the machines, these 2 process 
are the extra
ones that appear to be required to get the xfs lockup condition.  Other 
application
process have never reproduced the issue, so it is likely that something that
these 2 processes are doing is required to cause the deadlock condition.

Both processes are trying to get locks and failing, if I look at the 
other processes
that are working with locks I see:

<Oct/27 07:40 am>Call Trace:<ffffffffa0141832>{:xfs:kmem_zone_zalloc+50} 
<ffffffffa012a9c4>{:xfs:_xfs_trans_alloc+36}
<Oct/27 07:40 am> <ffffffff80231b35>{__down_write+117} 
<ffffffffa0116ead>{:xfs:xfs_ilock+93}
<Oct/27 07:40 am> <ffffffffa012eda3>{:xfs:xfs_syncsub+2787} 
<ffffffff80146970>{del_timer_sync+80}


<Oct/27 07:40 am> <ffffffff80135c50>{default_wake_function+0} 
<ffffffff80234447>{__down_failed+53}
<Oct/27 07:40 am> <ffffffffa0141642>{:xfs:.text.lock.xfs_buf+15} 
<ffffffffa0126618>{:xfs:xfs_getsb+40}
<Oct/27 07:40 am> <ffffffffa012b8aa>{:xfs:xfs_trans_getsb+106} 
<ffffffffa012a10c>{:xfs:xfs_trans_commit+332}


<Oct/27 07:40 am> <ffffffffa012b37b>{:xfs:xfs_trans_log_buf+107} 
<ffffffff80234447>{__down_failed+53}
<Oct/27 07:40 am> <ffffffffa0141642>{:xfs:.text.lock.xfs_buf+15} 
<ffffffffa0126618>{:xfs:xfs_getsb+40}
<Oct/27 07:40 am> <ffffffffa012b8aa>{:xfs:xfs_trans_getsb+106} 
<ffffffffa012a10c>{:xfs:xfs_trans_commit+332}


<Oct/27 07:40 am> <ffffffffa013194f>{:xfs:xfs_create+1359} 
<ffffffffa013b429>{:xfs:linvfs_mknod+521}
<Oct/27 07:40 am> <ffffffffa0116d16>{:xfs:xfs_iunlock+102} 
<ffffffffa0133387>{:xfs:xfs_lookup+119}
<Oct/27 07:40 am> <ffffffffa013b704>{:xfs:linvfs_lookup+84} 
<ffffffff8019c49b>{real_lookup+123}


<Oct/27 07:40 am> <ffffffff8010f9c8>{__down+152} 
<ffffffff80135c50>{default_wake_function+0}
<Oct/27 07:40 am> <ffffffff80234447>{__down_failed+53} 
<ffffffff801949dc>{.text.lock.super+169}
<Oct/27 07:40 am> <ffffffff8018fcea>{do_sync+42} 
<ffffffff8018fd5e>{sys_sync+62}


<Oct/27 07:40 am>Call Trace:<ffffffff8010f9c8>{__down+152} 
<ffffffff80135c50>{default_wake_function+0}
<Oct/27 07:40 am> <ffffffff80234447>{__down_failed+53} 
<ffffffffa0141642>{:xfs:.text.lock.xfs_buf+15}
<Oct/27 07:40 am> <ffffffffa0126618>{:xfs:xfs_getsb+40} 
<ffffffffa012ecea>{:xfs:xfs_syncsub+2602}

It very much looks like a deadlock as almost everyone hung looks to be 
working/waiting for locks.

Is there a way to list the kernel locks and see who is holding what locks?

                                Roger

Full Trace follows:

<Oct/27 07:40 am>xfssyncd D 00000000000493e0 0 2760 1 3876 2755 (L-TLB)
<Oct/27 07:40 am>Call Trace:<ffffffffa0141832>{:xfs:kmem_zone_zalloc+50} 
<ffffffffa012a9c4>{:xfs:_xfs_trans_alloc+36}
<Oct/27 07:40 am> <ffffffff80231b35>{__down_write+117} 
<ffffffffa0116ead>{:xfs:xfs_ilock+93}
<Oct/27 07:40 am> <ffffffffa012eda3>{:xfs:xfs_syncsub+2787} 
<ffffffff80146970>{del_timer_sync+80}
<Oct/27 07:40 am> <ffffffff80146a55>{del_singleshot_timer_sync+21} 
<ffffffff80146d2e>{schedule_timeout+254}
<Oct/27 07:40 am> <ffffffffa013e468>{:xfs:vfs_sync+40} 
<ffffffffa013da79>{:xfs:vfs_sync_worker+25}
<Oct/27 07:40 am> <ffffffffa013dc1a>{:xfs:xfssyncd+378} 
<ffffffffa013d780>{:xfs:linvfs_fill_super+0}
<Oct/27 07:40 am> <ffffffff801112b7>{child_rip+8} 
<ffffffffa013d780>{:xfs:linvfs_fill_super+0}
<Oct/27 07:40 am> <ffffffffa013daa0>{:xfs:xfssyncd+0} 
<ffffffff801112af>{child_rip+0}
<Oct/27 07:40 am>

<Oct/27 07:40 am>res D 000000000000000a 0 16149 1 26319 16151 5825 (NOTLB)
<Oct/27 07:40 am>Call Trace:<ffffffff80231bcd>{__down_read+125} 
<ffffffffa01333dc>{:xfs:xfs_access+44}
<Oct/27 07:40 am> <ffffffffa013af44>{:xfs:linvfs_permission+20} 
<ffffffff8019c767>{permission+55}
<Oct/27 07:40 am> <ffffffff8019df1c>{link_path_walk+348} 
<ffffffff801a0706>{__user_walk_it+70}
<Oct/27 07:40 am> <ffffffff801974b0>{vfs_lstat+128} 
<ffffffff80122868>{do_page_fault+536}
<Oct/27 07:40 am> <ffffffff801975bf>{sys_newlstat+31} 
<ffffffff80111101>{error_exit+0}
<Oct/27 07:40 am> <ffffffff80110794>{system_call+124}



<Oct/27 07:40 am>sbatchd D 00000000000493e0 0 16151 1 12686 16149 (NOTLB)
<Oct/27 07:40 am>Call Trace:<ffffffff801a7b51>{dput+33} 
<ffffffff8019c2cd>{follow_mount+93}
<Oct/27 07:40 am> <ffffffff801a7b51>{dput+33} 
<ffffffff80231bcd>{__down_read+125}
<Oct/27 07:40 am> <ffffffffa01333dc>{:xfs:xfs_access+44} 
<ffffffffa013af44>{:xfs:linvfs_permission+20}
<Oct/27 07:40 am> <ffffffff8019c767>{permission+55} 
<ffffffff8018aeca>{sys_chdir+138}
<Oct/27 07:40 am> <ffffffff801a394c>{sys_select+1244} 
<ffffffff80110794>{system_call+124}
<Oct/27 07:40 am>

<Oct/27 07:40 am>gm_mapper D 000000000000000a 0 12686 1 16834 16151 (L-TLB)
<Oct/27 07:40 am>Call 
Trace:<ffffffffa012b37b>{:xfs:xfs_trans_log_buf+107} 
<ffffffff8010f9c8>{__down+152}
<Oct/27 07:40 am> <ffffffff80135c50>{default_wake_function+0} 
<ffffffff80234447>{__down_failed+53}
<Oct/27 07:40 am> <ffffffffa0141642>{:xfs:.text.lock.xfs_buf+15} 
<ffffffffa0126618>{:xfs:xfs_getsb+40}
<Oct/27 07:40 am> <ffffffffa012b8aa>{:xfs:xfs_trans_getsb+106} 
<ffffffffa012a10c>{:xfs:xfs_trans_commit+332}
<Oct/27 07:40 am> <ffffffffa00e7a9c>{:xfs:xfs_free_extent+204} 
<ffffffffa0111634>{:xfs:xfs_efd_init+68}
<Oct/27 07:40 am> <ffffffffa014179b>{:xfs:kmem_zone_alloc+75} 
<ffffffffa0141832>{:xfs:kmem_zone_zalloc+50}
<Oct/27 07:40 am> <ffffffffa011a9cd>{:xfs:xfs_itruncate_finish+557} 
<ffffffffa012aae9>{:xfs:xfs_trans_alloc+217}
<Oct/27 07:40 am> <ffffffff8011081d>{sysret_signal+28} 
<ffffffffa01300af>{:xfs:xfs_inactive+591}
<Oct/27 07:40 am> <ffffffff8011081d>{sysret_signal+28} 
<ffffffff80169f50>{__pagevec_free+32}
<Oct/27 07:40 am> <ffffffff8011081d>{sysret_signal+28} 
<ffffffffa013ebc8>{:xfs:vn_rele+72}
<Oct/27 07:40 am> <ffffffffa013d392>{:xfs:linvfs_clear_inode+18} 
<ffffffff801a9d3b>{clear_inode+155}
<Oct/27 07:40 am> <ffffffff801aa3f5>{generic_delete_inode+245} 
<ffffffff801a95ee>{iput+158}
<Oct/27 07:40 am> <ffffffff801a7cb5>{dput+389} 
<ffffffff8018d9de>{__fput+270}
<Oct/27 07:40 am> <ffffffff8018965e>{filp_close+126} 
<ffffffff8013f073>{put_files_struct+115}
<Oct/27 07:40 am> <ffffffff80140522>{do_exit+1010} 
<ffffffff801484b5>{__dequeue_signal+501}
<Oct/27 07:40 am> <ffffffff8011081d>{sysret_signal+28} 
<ffffffff80140fa8>{do_group_exit+232}
<Oct/27 07:40 am> <ffffffff8014ab37>{get_signal_to_deliver+1175} 
<ffffffff8011004b>{do_signal+1179}
<Oct/27 07:40 am> <ffffffff8010fc45>{do_signal+149} 
<ffffffffa02dbea0>{:gm:gm_linux_ioctl+0}
<Oct/27 07:40 am> <ffffffffa02dbf0a>{:gm:gm_linux_ioctl+106} 
<ffffffff801a2094>{sys_ioctl+1092}
<Oct/27 07:40 am> <ffffffff8011052d>{sys_rt_sigreturn+653} 
<ffffffff8011081d>{sysret_signal+28}
<Oct/27 07:40 am> <ffffffff80110adf>{ptregscall_common+103}


<Oct/27 07:40 am>lim D 000000000000000a 0 16834 1 16835 17594 12686 (NOTLB)
<Oct/27 07:40 am>Call Trace:<ffffffff80231bcd>{__down_read+125} 
<ffffffffa01333dc>{:xfs:xfs_access+44}
<Oct/27 07:40 am> <ffffffffa013af44>{:xfs:linvfs_permission+20} 
<ffffffff8019c767>{permission+55}
<Oct/27 07:40 am> <ffffffff8019df1c>{link_path_walk+348} 
<ffffffff801a0706>{__user_walk_it+70}
<Oct/27 07:40 am> <ffffffff801974b0>{vfs_lstat+128} 
<ffffffff80117ec4>{save_i387+148}
<Oct/27 07:40 am> <ffffffff8011018d>{do_signal+1501} 
<ffffffff801975bf>{sys_newlstat+31}
<Oct/27 07:40 am> <ffffffff80147d04>{sys_rt_sigaction+148} 
<ffffffff80110794>{system_call+124}
<Oct/27 07:40 am>

<Oct/27 07:40 am>pim D 00000000000493e0 0 16835 16834 16870 (NOTLB)
<Oct/27 07:40 am>Call 
Trace:<ffffffffa01412ad>{:xfs:xfs_buf_get_flags+877} 
<ffffffffa014179b>{:xfs:kmem_zone_alloc+75}
<Oct/27 07:40 am> <ffffffff8010f9c8>{__down+152} 
<ffffffff80135c50>{default_wake_function+0}
<Oct/27 07:40 am> <ffffffffa012b37b>{:xfs:xfs_trans_log_buf+107} 
<ffffffff80234447>{__down_failed+53}
<Oct/27 07:40 am> <ffffffffa0141642>{:xfs:.text.lock.xfs_buf+15} 
<ffffffffa0126618>{:xfs:xfs_getsb+40}
<Oct/27 07:40 am> <ffffffffa012b8aa>{:xfs:xfs_trans_getsb+106} 
<ffffffffa012a10c>{:xfs:xfs_trans_commit+332}
<Oct/27 07:40 am> <ffffffffa0104d26>{:xfs:xfs_dir2_createname+278} 
<ffffffffa0117d3d>{:xfs:xfs_ichgtime+301}
<Oct/27 07:40 am> <ffffffffa013194f>{:xfs:xfs_create+1359} 
<ffffffffa013b429>{:xfs:linvfs_mknod+521}
<Oct/27 07:40 am> <ffffffffa0116d16>{:xfs:xfs_iunlock+102} 
<ffffffffa0133387>{:xfs:xfs_lookup+119}
<Oct/27 07:40 am> <ffffffffa013b704>{:xfs:linvfs_lookup+84} 
<ffffffff8019c49b>{real_lookup+123}
<Oct/27 07:40 am> <ffffffff8019cedb>{vfs_create+251} 
<ffffffff8019f3a0>{open_namei+464}
<Oct/27 07:40 am> <ffffffff80189cc7>{filp_open+87} 
<ffffffff80189d8f>{sys_open+159}
<Oct/27 07:40 am> <ffffffff80189765>{sys_close+229} 
<ffffffff80110794>{system_call+124}
<Oct/27 07:40 am>

<Oct/27 07:40 am>elim.uptime D 00000000000493e0 0 16873 1 14418 18756 
(NOTLB)
<Oct/27 07:40 am>Call Trace:<ffffffff80231bcd>{__down_read+125} 
<ffffffffa01333dc>{:xfs:xfs_access+44}
<Oct/27 07:40 am> <ffffffffa013af44>{:xfs:linvfs_permission+20} 
<ffffffff8019c767>{permission+55}
<Oct/27 07:40 am> <ffffffff8019df1c>{link_path_walk+348} 
<ffffffff8019f2a1>{open_namei+209}
<Oct/27 07:40 am> <ffffffff80189cc7>{filp_open+87} 
<ffffffff80189d8f>{sys_open+159}
<Oct/27 07:40 am> <ffffffff80111101>{error_exit+0} 
<ffffffff80110794>{system_call+124}
<Oct/27 07:40 am>

<Oct/27 07:40 am>res D 00000000000493e0 0 14323 16149 26319 (NOTLB)
<Oct/27 07:40 am>Call Trace:<ffffffff80301c83>{inet_recvmsg+51} 
<ffffffff802b520a>{sock_aio_read+346}
<Oct/27 07:40 am> <ffffffff80231bcd>{__down_read+125} 
<ffffffffa01333dc>{:xfs:xfs_access+44}
<Oct/27 07:40 am> <ffffffffa013af44>{:xfs:linvfs_permission+20} 
<ffffffff8019c767>{permission+55}
<Oct/27 07:40 am> <ffffffff8019df1c>{link_path_walk+348} 
<ffffffff8019f27f>{open_namei+175}
<Oct/27 07:40 am> <ffffffff80189cc7>{filp_open+87} 
<ffffffff80189d8f>{sys_open+159}
<Oct/27 07:40 am> <ffffffff802b58a8>{sys_socket+104} 
<ffffffff80110794>{system_call+124}
<Oct/27 07:40 am>

<Oct/27 07:40 am>acuSolve-gmpi D 00000000000493e0 0 14418 1 14419 16873 
(NOTLB)
<Oct/27 07:40 am>Call 
Trace:<ffffffff80165ad4>{wait_on_page_writeback_range_wq+324}
<Oct/27 07:40 am> <ffffffff8010f9c8>{__down+152} 
<ffffffff80135c50>{default_wake_function+0}
<Oct/27 07:40 am> <ffffffff80234447>{__down_failed+53} 
<ffffffff801949dc>{.text.lock.super+169}
<Oct/27 07:40 am> <ffffffff8018fcea>{do_sync+42} 
<ffffffff8018fd5e>{sys_sync+62}
<Oct/27 07:40 am> <ffffffff80110794>{system_call+124}

<Oct/27 07:40 am>acuSolve-gmpi D 00000000000493e0 0 14419 1 18864 14418 
(NOTLB)
<Oct/27 07:40 am>Call Trace:<ffffffff8010f9c8>{__down+152} 
<ffffffff80135c50>{default_wake_function+0}
<Oct/27 07:40 am> <ffffffff80234447>{__down_failed+53} 
<ffffffffa0141642>{:xfs:.text.lock.xfs_buf+15}
<Oct/27 07:40 am> <ffffffffa0126618>{:xfs:xfs_getsb+40} 
<ffffffffa012ecea>{:xfs:xfs_syncsub+2602}
<Oct/27 07:40 am> <ffffffffa013e468>{:xfs:vfs_sync+40} 
<ffffffffa013d434>{:xfs:linvfs_sync_super+68}
<Oct/27 07:40 am> <ffffffff80193cff>{sync_filesystems+223} 
<ffffffff8018fcf1>{do_sync+49}
<Oct/27 07:40 am> <ffffffff8018fd5e>{sys_sync+62} 
<ffffffff80110794>{system_call+124}
<Oct/27 07:40 am>
<Oct/27 07:40 am>mktemp D 00000000000493e0 0 17594 1 17656 16834 (NOTLB)
<Oct/27 07:40 am>Call Trace:<ffffffff8025bbe9>{SHATransform+25} 
<ffffffff8019c2cd>{follow_mount+93}
<Oct/27 07:40 am> <ffffffff801a7b51>{dput+33} 
<ffffffff80231bcd>{__down_read+125}
<Oct/27 07:40 am> <ffffffffa01333dc>{:xfs:xfs_access+44} 
<ffffffffa013af44>{:xfs:linvfs_permission+20}
<Oct/27 07:40 am> <ffffffff8019c767>{permission+55} 
<ffffffff8019df1c>{link_path_walk+348}
<Oct/27 07:40 am> <ffffffff801a02ac>{sys_mkdir+220} 
<ffffffff80110794>{system_call+124}
<Oct/27 07:40 am>
<Oct/27 07:40 am>check_EWNstag D 00000000000493e0 0 17620 1 17751 17656 
(NOTLB)
<Oct/27 07:40 am>Call Trace:<ffffffff8018cbbd>{do_sync_write+173} 
<ffffffff80231bcd>{__down_read+125}
<Oct/27 07:40 am> <ffffffffa01333dc>{:xfs:xfs_access+44} 
<ffffffffa013af44>{:xfs:linvfs_permission+20}
<Oct/27 07:40 am> <ffffffff8019c767>{permission+55} 
<ffffffff8019df1c>{link_path_walk+348}
<Oct/27 07:40 am> <ffffffff8019f2a1>{open_namei+209} 
<ffffffff80189cc7>{filp_open+87}
<Oct/27 07:40 am> <ffffffff80189d8f>{sys_open+159} 
<ffffffff80111101>{error_exit+0}
<Oct/27 07:40 am> <ffffffff80110794>{system_call+124} <Oct/27 07:40 am>
<Oct/27 07:40 am>
<Oct/27 07:40 am>sh D 00000000000493e0 0 17959 1 17793 17858 (NOTLB)
<Oct/27 07:40 am>Call Trace:<ffffffff80231bcd>{__down_read+125} 
<ffffffffa01333dc>{:xfs:xfs_access+44}
<Oct/27 07:40 am> <ffffffffa013af44>{:xfs:linvfs_permission+20} 
<ffffffff8019c767>{permission+55}
<Oct/27 07:40 am> <ffffffff8019df1c>{link_path_walk+348} 
<ffffffff8019f2a1>{open_namei+209}
<Oct/27 07:40 am> <ffffffff80189cc7>{filp_open+87} 
<ffffffff80189d8f>{sys_open+159}
<Oct/27 07:40 am> <ffffffff80111101>{error_exit+0} 
<ffffffff80110794>{system_call+124}
<Oct/27 07:40 am>

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2006-12-20 15:46 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-12-01 15:08 Issues with XFS on Sles9 sp2 Roger Heflin
2006-12-02  0:25 ` Christian Kujau
2006-12-02  0:44   ` Roger Heflin
     [not found] ` <20061203224832.GY37654165@melbourne.sgi.com>
     [not found]   ` <457431B1.20700@atipa.com>
     [not found]     ` <20061204224208.GO33919298@melbourne.sgi.com>
     [not found]       ` <4574AA4F.4030902@atipa.com>
     [not found]         ` <20061205004416.GU33919298@melbourne.sgi.com>
2006-12-20 15:45           ` Roger Heflin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox