public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: Juerg Haefliger <juergh@gmail.com>
Cc: xfs@oss.sgi.com
Subject: Re: Still seeing hangs in xlog_grant_log_space
Date: Tue, 24 Apr 2012 22:07:31 +1000	[thread overview]
Message-ID: <20120424120731.GT9541@dastard> (raw)
In-Reply-To: <CADLDEKsfckBw2oVYFfaaTbpe8Ri+rYJr2e5SB7-pM0BU9nRUeA@mail.gmail.com>

On Tue, Apr 24, 2012 at 10:55:22AM +0200, Juerg Haefliger wrote:
> On Tue, Apr 24, 2012 at 1:58 AM, Dave Chinner <david@fromorbit.com> wrote:
> > On Mon, Apr 23, 2012 at 05:33:40PM +0200, Juerg Haefliger wrote:
> >> Hi Dave,
> >>
> >>
> >> On Mon, Apr 23, 2012 at 4:38 PM, Dave Chinner <david@fromorbit.com> wrote:
> >> > On Mon, Apr 23, 2012 at 02:09:53PM +0200, Juerg Haefliger wrote:
> >> >> Hi,
> >> >>
> >> >> I have a test system that I'm using to try to force an XFS filesystem
> >> >> hang since we're encountering that problem sporadically in production
> >> >> running a 2.6.38-8 Natty kernel. The original idea was to use this
> >> >> system to find the patches that fix the issue but I've tried a whole
> >> >> bunch of kernels and they all hang eventually (anywhere from 5 to 45
> >> >> mins) with the stack trace shown below.
> >> >
> >> > If you kill the workload, does the file system recover normally?
> >>
> >> The workload can't be killed.
> >
> > OK.
> >
> >> >> Only an emergency flush will
> >> >> bring the filesystem back. I tried kernels 3.0.29, 3.1.10, 3.2.15,
> >> >> 3.3.2. From reading through the mail archives, I get the impression
> >> >> that this should be fixed in 3.1.
> >> >
> >> > What you see is not necessarily a hang. It may just be that you've
> >> > caused your IO subsystem to have so much IO queued up it's completely
> >> > overwhelmed. How much RAM do you have in the machine?
> >>
> >> When it hangs, there are zero IOs going to the disk. The machine has
> >> 100GB of RAM.
> >
> > Can you get an event trace across the period where the hang occurs?
> >
> > ....
> >
> >> >> I can't seem to hit the problem without the above modifications.
> >> >
> >> > How on earth did you come up with this configuration?
> >>
> >> Just plain ol' luck. I was looking for a configuration that would
> >> allow me to reproduce the hangs and I accidentally picked a machine
> >> with a faulty controller battery which disabled the cache.
> >
> > Wonderful.
> >
> >> >> For the IO workload I pre-create 8000 files with random content and
> >> >> sizes between 1k and 128k on the test partition. Then I run a tool
> >> >> that spawns a bunch of threads which just copy these files to a
> >> >> different directory on the same partition.
> >> >
> >> > So, your workload also has a significant amount parallelism and
> >> > concurrency on a filesytsem with only 4 AGs?
> >>
> >> Yes. Excuse my ignorance but what are AGs?
> >
> > Allocation groups.
> >
> >> >> At the same time there are
> >> >> other threads that rename, remove and overwrite random files in the
> >> >> destination directory keeping the file count at around 500.
> >> >
> >> > And you've added as much concurrent metadata modification as
> >> > possible, too, which makes me wonder.....
> >> >
> >> >> Let me know what other information I can provide to pin this down.
> >> >
> >> > .... exactly what are you trying to acheive with this test?  From my
> >> > point of view, you're doing something completely and utterly insane.
> >> > You filesystem config and workload is so far outside normal
> >> > configurations and workloads that I'm not surprised you're seeing
> >> > some kind of problem.....
> >>
> >> No objection from my side. It's a silly configuration but it's the
> >> only one I've found that lets me reproduce a hang at will.
> >
> > Ok, that's fair enough - it's handy to tell us that up front,
> > though.  ;)
> 
> Ah sorry for not being clear enough. I thought my intentions could be
> deduced from the information that I provided :-)
> 
> 
> > Alright, then I need all the usual information. I suspect an event
> > trace is the only way I'm going to see what is happening. I just
> > updated the FAQ entry, so all the necessary info for gathering a
> > trace should be there now.
> >
> > http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F
> 
> Very good. Will do. What kernel do you want me to run? I would prefer
> our current production kernel (2.6.38-8-server) but I understand if
> you want something newer.

If you can reproduce it on a current kernel - 3.4-rc4 if possible, if
not a 3.3.x stable kernel would be best. 2.6.38 is simply too old to
be useful for debugging these sorts of problems...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

  reply	other threads:[~2012-04-24 12:07 UTC|newest]

Thread overview: 58+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-04-23 12:09 Still seeing hangs in xlog_grant_log_space Juerg Haefliger
2012-04-23 14:38 ` Dave Chinner
2012-04-23 15:33   ` Juerg Haefliger
2012-04-23 23:58     ` Dave Chinner
2012-04-24  8:55       ` Juerg Haefliger
2012-04-24 12:07         ` Dave Chinner [this message]
2012-04-24 18:26           ` Juerg Haefliger
2012-04-25 22:38             ` Dave Chinner
2012-04-26 12:37               ` Juerg Haefliger
2012-04-26 22:44                 ` Dave Chinner
2012-04-26 23:00                   ` Juerg Haefliger
2012-04-26 23:07                     ` Dave Chinner
2012-04-27  9:04                       ` Juerg Haefliger
2012-04-27 11:09                         ` Dave Chinner
2012-04-27 13:07                           ` Juerg Haefliger
2012-05-05  7:44                             ` Juerg Haefliger
2012-05-07 17:19                               ` Ben Myers
2012-05-09  7:54                                 ` Juerg Haefliger
2012-05-10 16:11                                   ` Chris J Arges
2012-05-10 21:53                                     ` Mark Tinguely
2012-05-16 18:42                                     ` Ben Myers
2012-05-16 19:03                                       ` Chris J Arges
2012-05-16 21:29                                         ` Mark Tinguely
2012-05-18 10:10                                           ` Dave Chinner
2012-05-18 14:42                                             ` Mark Tinguely
2012-05-22 22:59                                               ` Dave Chinner
2012-06-06 15:00                                             ` Chris J Arges
2012-06-07  0:49                                               ` Dave Chinner
2012-05-17 20:55                                       ` Chris J Arges
2012-05-18 16:53                                         ` Chris J Arges
2012-05-18 17:19                                   ` Ben Myers
2012-05-19  7:28                                     ` Juerg Haefliger
2012-05-21 17:11                                       ` Ben Myers
2012-05-24  5:45                                         ` Juerg Haefliger
2012-05-24 14:23                                           ` Ben Myers
2012-05-07 22:59                               ` Dave Chinner
2012-05-09  7:35                                 ` Dave Chinner
2012-05-09 21:07                                   ` Mark Tinguely
2012-05-10  2:10                                     ` Mark Tinguely
2012-05-18  9:37                                       ` Dave Chinner
2012-05-18  9:31                                     ` Dave Chinner
2012-05-24 20:18 ` Peter Watkins
2012-05-25  6:28   ` Juerg Haefliger
2012-05-25 17:03     ` Peter Watkins
2012-06-05 23:54       ` Dave Chinner
2012-06-06 13:40         ` Brian Foster
2012-06-06 17:41           ` Mark Tinguely
2012-06-11 20:42             ` Chris J Arges
2012-06-11 23:53               ` Dave Chinner
2012-06-12 13:28                 ` Chris J Arges
2012-06-06 22:03           ` Mark Tinguely
2012-06-06 23:04             ` Brian Foster
2012-06-07  1:35           ` Dave Chinner
2012-06-07 14:16             ` Brian Foster
2012-06-08  0:28               ` Dave Chinner
2012-06-08 17:09                 ` Ben Myers
2012-06-11 20:59         ` Mark Tinguely
2012-06-05 15:21   ` Chris J Arges

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20120424120731.GT9541@dastard \
    --to=david@fromorbit.com \
    --cc=juergh@gmail.com \
    --cc=xfs@oss.sgi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox