xfs partial dismount issue

public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed

* xfs partial dismount issue
@ 2007-03-05 13:13 Chuck Weber
  2007-03-05 15:57 ` Eric Sandeen
  0 siblings, 1 reply; 5+ messages in thread
From: Chuck Weber @ 2007-03-05 13:13 UTC (permalink / raw)
  To: linux-xfs

Hi everyone, I have a long running problem perhaps you can help with. I
will include as much detail as I can. I can set up a spare server-disk
set for testing if you have any bright ideas.

We use XFS for samba and nfs on x86_64 Fedora Proliant DL585/385
servers. Our busiest server has disk partitions go away. The other
servers do not show this behavior ever. The partitions show as mounted,
but access to the partition just hangs. Open file count, process count
and load average rise until the server becomes very unresponsive. Even
if we catch it before the high load average, because it cannot unmount
the partition, it must be powered off and back on to restart. Upon
restart all partitions mount properly and everything is fine for days or
months. There is nothing in log files that I have noticed. With sar, I
can track the files open and process count rise. I believed this to be a
hardware issue and embarked on replacing parts along the partition
chain. I recently replaced the actual server and saw the same issue the
next week, so I don't think it is hardware. The problem is related to
XFS/Samba/acl/load usage I think, as I have 2-8 directories set up as
samba shares in a given partition. When the problem occurs, first I
cannot access a directory, shortly afterward I cannot access the entire
partition. This problem has affected 3 partitions so far. Over the last
3 months this has occurred every week or 2.

Configuration:

Proliant DL585, 8GB ram, 2 proc with 3 smartarray 6404 4 channel U320
raid cards. 6 MSA30 dual channel disk carriers with 14 drives each in
raid with 2 parity stripes. We started with 72 GB drives and have
updated 1 carrier each with 146 GB and 300 GB drives. Each disk carrier
is mounted as a single partition, store1 through store6. Example of last
mounting problem partition below:

/dev/cciss/c3d0p1 on /share/store3 type xfs (rw,logbufs=8) 

/dev/cciss/c3d0p1 814G 677G 138G 84% /share/store3

meta-data=/dev/cciss/c3d0p1 isize=2048 agcount=32, agsize=6668186 blks

= sectsz=512 

data = bsize=4096 blocks=213381952, imaxpct=25

= sunit=0 swidth=0 blks, unwritten=1

naming =version 2 bsize=4096 

log =internal bsize=4096 blocks=32768, version=1

= sectsz=512 sunit=0 blks

realtime =none extsz=65536 blocks=0, rtextents=0

I have added nobarrier and noatime mount options recently from the list
but don't see that they affect the problem.

For the 300 Gb disk carrier I am using LVM as it runs into a 6404 2TB
limit but I only am using 3-400 GB on it so far.

All servers are running x86_64 Fedora so I hope not to have the stack
issue.

The Dl585/3raid controllers/6 disk chassis without problems runs Fedora
Core 2 and acts as an NFS server to some computational computers.
Another DL585 with only 1 raid controller acts as windows home directory
and mail store server. It runs Fedora Core 4/ samba 3.023a. These
servers would show the same xfs_info as above on their raid partitions.
Both of these servers have no problems and very long uptimes. 

Our problem server started as Fedora Core 2 and whatever samba we used
then. When it first had problems, I upgraded to FC 4 and then to FC5
with samba 3.0.24. I have applied all current HP firmware throughout
this process. I have changed out power, disks, disk carriers, scsi
cables, and raid controllers. I finally swapped the DL585 for a DL385
with 4 processors and 16 GB ram. None of this made a difference. Fedora
core 5 2.6.18 and 19 kernels dumped within 1 day of booting with a
spinlock error, so I am now running the latest FC5 2.6.17 kernel, which
does include the 17.13 patch. I have run HP diagnostics for hours with
no results. I have taken the active server offline and run xfs_repair on
the partitions. I have reformatted one of the partitions. I have been
formatting the partitions with an inode size of 2k and no other options.

Current rpms, but note that I have used different versions on this
server from FC2 to present and downloaded/built acl/attr/xfsprogs at
times all with no difference in my problem:

acl-2.2.34-1.2

attr-2.4.28-1.2

samba-3.0.24-1.fc5

xfsprogs-2.7.3-1.2.1

kernel-2.6.17-1.2187_FC5

I could move to ext3, but in my one recent test it ran into trouble just
copying acled files from an XFS partition to it. XFS performance seems
quite good, with my limiting factor being AD user/group id times. 

All I can think of now is some resource/tuning/formatting/kernel change.
I would appreciate any suggestions you can come up with.

Thanks,
Chuck 

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: xfs partial dismount issue
  2007-03-05 13:13 xfs partial dismount issue Chuck Weber
@ 2007-03-05 15:57 ` Eric Sandeen
  2007-03-05 18:25   ` Charles Weber
  0 siblings, 1 reply; 5+ messages in thread
From: Eric Sandeen @ 2007-03-05 15:57 UTC (permalink / raw)
  To: Chuck Weber; +Cc: linux-xfs

Chuck Weber wrote:
> Hi everyone, I have a long running problem perhaps you can help with. I
> will include as much detail as I can. I can set up a spare server-disk
> set for testing if you have any bright ideas.
> 
> We use XFS for samba and nfs on x86_64 Fedora Proliant DL585/385
> servers. Our busiest server has disk partitions go away. 

What do you mean by this, exactly?  The partitions themselves go away,
or are you talking about the problem described below where processes
start hanging?

> The other
> servers do not show this behavior ever. The partitions show as mounted,
> but access to the partition just hangs. Open file count, process count
> and load average rise until the server becomes very unresponsive. Even
> if we catch it before the high load average, because it cannot unmount
> the partition, it must be powered off and back on to restart. Upon
> restart all partitions mount properly and everything is fine for days or
> months. There is nothing in log files that I have noticed. With sar, I
> can track the files open and process count rise. 

Maybe try sysrq-t, to capture all backtraces when it's in this state,
and see where the various threads are at.

-Eric

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: xfs partial dismount issue
  2007-03-05 15:57 ` Eric Sandeen
@ 2007-03-05 18:25   ` Charles Weber
  2007-03-05 21:07     ` Roger Heflin
  0 siblings, 1 reply; 5+ messages in thread
From: Charles Weber @ 2007-03-05 18:25 UTC (permalink / raw)
  To: linux-xfs

Eric Sandeen <sandeen <at> sandeen.net> writes:

> 
> Chuck Weber wrote:
> > Hi everyone, I have a long running problem perhaps you can help with. I
> > will include as much detail as I can. I can set up a spare server-disk
> > set for testing if you have any bright ideas.
> > 
> > We use XFS for samba and nfs on x86_64 Fedora Proliant DL585/385
> > servers. Our busiest server has disk partitions go away. 
> 
> What do you mean by this, exactly?  The partitions themselves go away,
> or are you talking about the problem described below where processes
> start hanging?
> 
Here is an example partition (1 of 6 or more xfs storage only).
/share/store3 with samba shares on /share/store3/lls, lds, lxs and so on.
I will get a call saying my groups share (lxs) is no longer accessable. I ssh
into server and can ls /share/store3 but ls will hang when I ls
/share/store3/lxs. Shortly there after ls will hang for the root or any
directory on the partition. Other partitions will be fine and other samba shares
will be fine until the queued up process load bogs the server down.

> > The other
> > servers do not show this behavior ever. The partitions show as mounted,
> > but access to the partition just hangs. Open file count, process count
> > and load average rise until the server becomes very unresponsive. Even
> > if we catch it before the high load average, because it cannot unmount
> > the partition, it must be powered off and back on to restart. Upon
> > restart all partitions mount properly and everything is fine for days or
> > months. There is nothing in log files that I have noticed. With sar, I
> > can track the files open and process count rise. 
> 
> Maybe try sysrq-t, to capture all backtraces when it's in this state,
> and see where the various threads are at.
> 

OK I'll look over sysrq
> -Eric
> 
> 

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: xfs partial dismount issue
  2007-03-05 18:25   ` Charles Weber
@ 2007-03-05 21:07     ` Roger Heflin
  2007-04-02 21:18       ` Charles Weber
  0 siblings, 1 reply; 5+ messages in thread
From: Roger Heflin @ 2007-03-05 21:07 UTC (permalink / raw)
  To: Charles Weber; +Cc: linux-xfs

Charles Weber wrote:
> Eric Sandeen <sandeen <at> sandeen.net> writes:
> 
>> Chuck Weber wrote:
>>> Hi everyone, I have a long running problem perhaps you can help with. I
>>> will include as much detail as I can. I can set up a spare server-disk
>>> set for testing if you have any bright ideas.
>>>
>>> We use XFS for samba and nfs on x86_64 Fedora Proliant DL585/385
>>> servers. Our busiest server has disk partitions go away. 
>> What do you mean by this, exactly?  The partitions themselves go away,
>> or are you talking about the problem described below where processes
>> start hanging?
>>
> Here is an example partition (1 of 6 or more xfs storage only).
> /share/store3 with samba shares on /share/store3/lls, lds, lxs and so on.
> I will get a call saying my groups share (lxs) is no longer accessable. I ssh
> into server and can ls /share/store3 but ls will hang when I ls
> /share/store3/lxs. Shortly there after ls will hang for the root or any
> directory on the partition. Other partitions will be fine and other samba shares
> will be fine until the queued up process load bogs the server down.
> 

Charles,

I have seen what may be a similar issue on SLES9SP2, we had 1 xfs
partition, and under certain conditions it would stop responding, all
non-xfs partitions were ok, and everything was fine after a reboot.

Under sysrq-t it appeared to me that 2 separate processes were calling
fsync and were causing each other to deadlock (and locking all others
out of changing the xfs partition).  I was not able to determine exactly
what the underlying bug was, but all of the hung processes
were waiting on locks in at least several widely different parts of the
xfs and kernel code, and adjusting the application to not fsync has
apparently resulted in the deadlock not occuring.   In this case
there were multiple (2-4) different instances of the application calling
fsync apparently sometimes at close to the same time.   With the
given application the failure was almost a certainly on one machine
(of 100) running the application overnight.

                            Roger

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: xfs partial dismount issue
  2007-03-05 21:07     ` Roger Heflin
@ 2007-04-02 21:18       ` Charles Weber
  0 siblings, 0 replies; 5+ messages in thread
From: Charles Weber @ 2007-04-02 21:18 UTC (permalink / raw)
  To: Roger Heflin; +Cc: linux-xfs, sandeen

Well actually I did a test with ext3 and got the same result. It partially
dismounted after a day or so of use and seemed identical to my previous xfs
filesystem failures . My guess now is that this has always occurred when all
6 raid controllers (2 per card)  were in use. I could go quite some time
with 5 of the 6 controllers used. I consolidated everything to 2 cards,
removed one card and put in fiber channel card for my new storage array So
far no problems. If so then it seems something is funny about the cciss
driver.

thanks,
Chuck






On 3/5/07, Roger Heflin <rheflin@atipa.com> wrote:
>
> Charles Weber wrote:
> > Eric Sandeen <sandeen <at> sandeen.net> writes:
> >
> >> Chuck Weber wrote:
> >>> Hi everyone, I have a long running problem perhaps you can help with.
> I
> >>> will include as much detail as I can. I can set up a spare server-disk
> >>> set for testing if you have any bright ideas.
> >>>
> >>> We use XFS for samba and nfs on x86_64 Fedora Proliant DL585/385
> >>> servers. Our busiest server has disk partitions go away.
> >> What do you mean by this, exactly?  The partitions themselves go away,
> >> or are you talking about the problem described below where processes
> >> start hanging?
> >>
> > Here is an example partition (1 of 6 or more xfs storage only).
> > /share/store3 with samba shares on /share/store3/lls, lds, lxs and so
> on.
> > I will get a call saying my groups share (lxs) is no longer accessable.
> I ssh
> > into server and can ls /share/store3 but ls will hang when I ls
> > /share/store3/lxs. Shortly there after ls will hang for the root or any
> > directory on the partition. Other partitions will be fine and other
> samba shares
> > will be fine until the queued up process load bogs the server down.
> >
>
> Charles,
>
> I have seen what may be a similar issue on SLES9SP2, we had 1 xfs
> partition, and under certain conditions it would stop responding, all
> non-xfs partitions were ok, and everything was fine after a reboot.
>
> Under sysrq-t it appeared to me that 2 separate processes were calling
> fsync and were causing each other to deadlock (and locking all others
> out of changing the xfs partition).  I was not able to determine exactly
> what the underlying bug was, but all of the hung processes
> were waiting on locks in at least several widely different parts of the
> xfs and kernel code, and adjusting the application to not fsync has
> apparently resulted in the deadlock not occuring.   In this case
> there were multiple (2-4) different instances of the application calling
> fsync apparently sometimes at close to the same time.   With the
> given application the failure was almost a certainly on one machine
> (of 100) running the application overnight.
>
>                             Roger
>


[[HTML alternate version deleted]]

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2007-04-02 21:44 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-03-05 13:13 xfs partial dismount issue Chuck Weber
2007-03-05 15:57 ` Eric Sandeen
2007-03-05 18:25   ` Charles Weber
2007-03-05 21:07     ` Roger Heflin
2007-04-02 21:18       ` Charles Weber

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox