public inbox for fstests@vger.kernel.org
 help / color / mirror / Atom feed
From: Jens Axboe <axboe@fb.com>
To: Theodore Ts'o <tytso@mit.edu>
Cc: Dave Chinner <david@fromorbit.com>,
	linux-ext4@vger.kernel.org, fstests@vger.kernel.org,
	tarasov@vasily.name
Subject: Re: Test generic/299 stalling forever
Date: Sun, 23 Oct 2016 08:32:49 -0600	[thread overview]
Message-ID: <53fe5a98-6ff9-4fa1-e84c-8a3e16cc0f50@fb.com> (raw)
In-Reply-To: <20161021221551.sdv4hgw33zjxnkvu@thunk.org>

On 10/21/2016 04:15 PM, Theodore Ts'o wrote:
> On Thu, Oct 20, 2016 at 08:22:00AM -0600, Jens Axboe wrote:
>>> So what's happening is that generic/299 is looping in the
>>> fallocate/truncate loop until fio exits, but since fio never exits, so
>>> it ends up looping forever.
>>
>> I'm setting up the GCE now, I've had the tests running for about 24h now
>> on another test box and haven't been able to trigger any hangs. I'll
>> match your setup as closely as I can, hopefully that'll work.
>
> Any luck reproducing the problem?
>
> On Wed, Oct 19, 2016 at 08:06:44AM -0600, Jens Axboe wrote:
>>
>> I'll take a look today. I agree, this definitely looks like a fio
>> bug. But not related to the mutex issue for the stat part, all verifier
>> threads are waiting to be woken up, but the main thread is done.
>>
>
> I was taking a closer look at this, and it does look ike it's related
> to the stat_mutex.  The main thread (according to gdb) seems to be
> stuck in this loop in backend.c line 1738 (in thread_main):
>
> 		do {
> 			check_update_rusage(td);
> 			if (!fio_mutex_down_trylock(stat_mutex))
> 				break;
> 			usleep(1000);   <----- line 1738
> 		} while (1);
>
> So it looks like it's not able to grab the stat_mutex.  But I can't
> figure out how the stat_mutex could be down.  None of the strack
> traces seem to show that, and I've looked at all of the places where
> stat_mutex is taken, and it doesn't look like stat_mutex should ever
> be down for more than, say, a second?
>
> So as a temporary workaround, I'm considering adding a check to see if
> we stay stuck in this loop for than a thousand times, and if so, print
> an error to stderr and then call _exit(1), or maybe just break out two
> levels by jumping to line 1778 at "td_set_runstate(td, TD_FINISHING)"
> and just give up on the usage statistics (since for xfstests we really
> don't care about the usage stats).

Very strange. Can you see who the owner is of stat_mutex->lock, that's
the pthread_mutex_t they are sleeping on.

For now, I'll apply the work-around you sent. I haven't been able to
reproduce this, but knowing that it's the stat_mutex will allow me to
better make up a test case to hit it.

-- 
Jens Axboe


  parent reply	other threads:[~2016-10-23 14:33 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-06-18 15:53 Test generic/299 stalling forever Theodore Ts'o
2015-06-18 16:25 ` Eric Whitney
2015-06-18 23:34 ` Dave Chinner
2015-06-19  2:56   ` Theodore Ts'o
2016-09-29  4:37   ` Theodore Ts'o
2016-10-12 15:46     ` Jens Axboe
2016-10-12 21:14     ` Dave Chinner
2016-10-12 21:19       ` Jens Axboe
2016-10-13  2:15         ` Theodore Ts'o
2016-10-13  2:39           ` Jens Axboe
2016-10-13 23:19             ` Theodore Ts'o
2016-10-18 18:01               ` Theodore Ts'o
2016-10-19 14:06                 ` Jens Axboe
2016-10-19 17:49                   ` Jens Axboe
2016-10-19 20:32                     ` Theodore Ts'o
2016-10-20 14:22                       ` Jens Axboe
2016-10-21 22:15                         ` Theodore Ts'o
2016-10-23  2:02                           ` Theodore Ts'o
2016-10-23 14:32                           ` Jens Axboe [this message]
2016-10-23 19:33                             ` Theodore Ts'o
2016-10-23 21:24                               ` Theodore Ts'o
2016-10-24  1:41                                 ` Jens Axboe
2016-10-24  3:38                                 ` Theodore Ts'o
2016-10-24 16:28                                   ` Jens Axboe
2016-10-25  2:54                                     ` Theodore Ts'o
2016-10-25  2:59                                       ` Jens Axboe
2016-10-13 13:08           ` Anatoly Pugachev
2016-10-13 13:36             ` Anatoly Pugachev
2016-10-13 14:28               ` Jens Axboe

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=53fe5a98-6ff9-4fa1-e84c-8a3e16cc0f50@fb.com \
    --to=axboe@fb.com \
    --cc=david@fromorbit.com \
    --cc=fstests@vger.kernel.org \
    --cc=linux-ext4@vger.kernel.org \
    --cc=tarasov@vasily.name \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox