All of lore.kernel.org
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: Juerg Haefliger <juergh@gmail.com>
Cc: xfs@oss.sgi.com
Subject: Re: Still seeing hangs in xlog_grant_log_space
Date: Tue, 24 Apr 2012 22:07:31 +1000	[thread overview]
Message-ID: <20120424120731.GT9541@dastard> (raw)
In-Reply-To: <CADLDEKsfckBw2oVYFfaaTbpe8Ri+rYJr2e5SB7-pM0BU9nRUeA@mail.gmail.com>

On Tue, Apr 24, 2012 at 10:55:22AM +0200, Juerg Haefliger wrote:
> On Tue, Apr 24, 2012 at 1:58 AM, Dave Chinner <david@fromorbit.com> wrote:
> > On Mon, Apr 23, 2012 at 05:33:40PM +0200, Juerg Haefliger wrote:
> >> Hi Dave,
> >>
> >>
> >> On Mon, Apr 23, 2012 at 4:38 PM, Dave Chinner <david@fromorbit.com> wrote:
> >> > On Mon, Apr 23, 2012 at 02:09:53PM +0200, Juerg Haefliger wrote:
> >> >> Hi,
> >> >>
> >> >> I have a test system that I'm using to try to force an XFS filesystem
> >> >> hang since we're encountering that problem sporadically in production
> >> >> running a 2.6.38-8 Natty kernel. The original idea was to use this
> >> >> system to find the patches that fix the issue but I've tried a whole
> >> >> bunch of kernels and they all hang eventually (anywhere from 5 to 45
> >> >> mins) with the stack trace shown below.
> >> >
> >> > If you kill the workload, does the file system recover normally?
> >>
> >> The workload can't be killed.
> >
> > OK.
> >
> >> >> Only an emergency flush will
> >> >> bring the filesystem back. I tried kernels 3.0.29, 3.1.10, 3.2.15,
> >> >> 3.3.2. From reading through the mail archives, I get the impression
> >> >> that this should be fixed in 3.1.
> >> >
> >> > What you see is not necessarily a hang. It may just be that you've
> >> > caused your IO subsystem to have so much IO queued up it's completely
> >> > overwhelmed. How much RAM do you have in the machine?
> >>
> >> When it hangs, there are zero IOs going to the disk. The machine has
> >> 100GB of RAM.
> >
> > Can you get an event trace across the period where the hang occurs?
> >
> > ....
> >
> >> >> I can't seem to hit the problem without the above modifications.
> >> >
> >> > How on earth did you come up with this configuration?
> >>
> >> Just plain ol' luck. I was looking for a configuration that would
> >> allow me to reproduce the hangs and I accidentally picked a machine
> >> with a faulty controller battery which disabled the cache.
> >
> > Wonderful.
> >
> >> >> For the IO workload I pre-create 8000 files with random content and
> >> >> sizes between 1k and 128k on the test partition. Then I run a tool
> >> >> that spawns a bunch of threads which just copy these files to a
> >> >> different directory on the same partition.
> >> >
> >> > So, your workload also has a significant amount parallelism and
> >> > concurrency on a filesytsem with only 4 AGs?
> >>
> >> Yes. Excuse my ignorance but what are AGs?
> >
> > Allocation groups.
> >
> >> >> At the same time there are
> >> >> other threads that rename, remove and overwrite random files in the
> >> >> destination directory keeping the file count at around 500.
> >> >
> >> > And you've added as much concurrent metadata modification as
> >> > possible, too, which makes me wonder.....
> >> >
> >> >> Let me know what other information I can provide to pin this down.
> >> >
> >> > .... exactly what are you trying to acheive with this test?  From my
> >> > point of view, you're doing something completely and utterly insane.
> >> > You filesystem config and workload is so far outside normal
> >> > configurations and workloads that I'm not surprised you're seeing
> >> > some kind of problem.....
> >>
> >> No objection from my side. It's a silly configuration but it's the
> >> only one I've found that lets me reproduce a hang at will.
> >
> > Ok, that's fair enough - it's handy to tell us that up front,
> > though.  ;)
> 
> Ah sorry for not being clear enough. I thought my intentions could be
> deduced from the information that I provided :-)
> 
> 
> > Alright, then I need all the usual information. I suspect an event
> > trace is the only way I'm going to see what is happening. I just
> > updated the FAQ entry, so all the necessary info for gathering a
> > trace should be there now.
> >
> > http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F
> 
> Very good. Will do. What kernel do you want me to run? I would prefer
> our current production kernel (2.6.38-8-server) but I understand if
> you want something newer.

If you can reproduce it on a current kernel - 3.4-rc4 if possible, if
not a 3.3.x stable kernel would be best. 2.6.38 is simply too old to
be useful for debugging these sorts of problems...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

  reply	other threads:[~2012-04-24 12:07 UTC|newest]

Thread overview: 58+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-04-23 12:09 Still seeing hangs in xlog_grant_log_space Juerg Haefliger
2012-04-23 14:38 ` Dave Chinner
2012-04-23 15:33   ` Juerg Haefliger
2012-04-23 23:58     ` Dave Chinner
2012-04-24  8:55       ` Juerg Haefliger
2012-04-24 12:07         ` Dave Chinner [this message]
2012-04-24 18:26           ` Juerg Haefliger
2012-04-25 22:38             ` Dave Chinner
2012-04-26 12:37               ` Juerg Haefliger
2012-04-26 22:44                 ` Dave Chinner
2012-04-26 23:00                   ` Juerg Haefliger
2012-04-26 23:07                     ` Dave Chinner
2012-04-27  9:04                       ` Juerg Haefliger
2012-04-27 11:09                         ` Dave Chinner
2012-04-27 13:07                           ` Juerg Haefliger
2012-05-05  7:44                             ` Juerg Haefliger
2012-05-07 17:19                               ` Ben Myers
2012-05-09  7:54                                 ` Juerg Haefliger
2012-05-10 16:11                                   ` Chris J Arges
2012-05-10 21:53                                     ` Mark Tinguely
2012-05-16 18:42                                     ` Ben Myers
2012-05-16 19:03                                       ` Chris J Arges
2012-05-16 21:29                                         ` Mark Tinguely
2012-05-18 10:10                                           ` Dave Chinner
2012-05-18 14:42                                             ` Mark Tinguely
2012-05-22 22:59                                               ` Dave Chinner
2012-06-06 15:00                                             ` Chris J Arges
2012-06-07  0:49                                               ` Dave Chinner
2012-05-17 20:55                                       ` Chris J Arges
2012-05-18 16:53                                         ` Chris J Arges
2012-05-18 17:19                                   ` Ben Myers
2012-05-19  7:28                                     ` Juerg Haefliger
2012-05-21 17:11                                       ` Ben Myers
2012-05-24  5:45                                         ` Juerg Haefliger
2012-05-24 14:23                                           ` Ben Myers
2012-05-07 22:59                               ` Dave Chinner
2012-05-09  7:35                                 ` Dave Chinner
2012-05-09 21:07                                   ` Mark Tinguely
2012-05-10  2:10                                     ` Mark Tinguely
2012-05-18  9:37                                       ` Dave Chinner
2012-05-18  9:31                                     ` Dave Chinner
2012-05-24 20:18 ` Peter Watkins
2012-05-25  6:28   ` Juerg Haefliger
2012-05-25 17:03     ` Peter Watkins
2012-06-05 23:54       ` Dave Chinner
2012-06-06 13:40         ` Brian Foster
2012-06-06 17:41           ` Mark Tinguely
2012-06-11 20:42             ` Chris J Arges
2012-06-11 23:53               ` Dave Chinner
2012-06-12 13:28                 ` Chris J Arges
2012-06-06 22:03           ` Mark Tinguely
2012-06-06 23:04             ` Brian Foster
2012-06-07  1:35           ` Dave Chinner
2012-06-07 14:16             ` Brian Foster
2012-06-08  0:28               ` Dave Chinner
2012-06-08 17:09                 ` Ben Myers
2012-06-11 20:59         ` Mark Tinguely
2012-06-05 15:21   ` Chris J Arges

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20120424120731.GT9541@dastard \
    --to=david@fromorbit.com \
    --cc=juergh@gmail.com \
    --cc=xfs@oss.sgi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.