From: "Theodore Ts'o" <tytso@mit.edu>
To: Jens Axboe <axboe@fb.com>
Cc: Dave Chinner <david@fromorbit.com>,
linux-ext4@vger.kernel.org, fstests@vger.kernel.org,
tarasov@vasily.name
Subject: Re: Test generic/299 stalling forever
Date: Sat, 22 Oct 2016 22:02:28 -0400 [thread overview]
Message-ID: <20161023020228.gf6lzfw2phca3ykp@thunk.org> (raw)
In-Reply-To: <20161021221551.sdv4hgw33zjxnkvu@thunk.org>
On Fri, Oct 21, 2016 at 06:15:51PM -0400, Theodore Ts'o wrote:
> I was taking a closer look at this, and it does look ike it's related
> to the stat_mutex. The main thread (according to gdb) seems to be
> stuck in this loop in backend.c line 1738 (in thread_main):
>
> do {
> check_update_rusage(td);
> if (!fio_mutex_down_trylock(stat_mutex))
> break;
> usleep(1000); <----- line 1738
> } while (1);
So I have something very strange to report. I sync'ed up to the
latest fio repo, at commit e291cff14e97feb3cf. The problem still
manifests with that commit. Given what I've observed with a thread
spinning in this do loop, I added this commit:
commit 0f2f71f51595f6b708b801f7ae1dc86c5b2f3705
Author: Theodore Ts'o <tytso@mit.edu>
Date: Sat Oct 22 10:32:41 2016 -0400
backend: if we can't grab stat_mutex, report a deadlock error and exit
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
diff --git a/backend.c b/backend.c
index fb2a855..093b6a3 100644
--- a/backend.c
+++ b/backend.c
@@ -1471,6 +1471,7 @@ static void *thread_main(void *data)
struct thread_data *td = fd->td;
struct thread_options *o = &td->o;
struct sk_out *sk_out = fd->sk_out;
+ int deadlock_loop_cnt;
int clear_state;
int ret;
@@ -1731,11 +1732,17 @@ static void *thread_main(void *data)
* the rusage_sem, which would never get upped because
* this thread is waiting for the stat mutex.
*/
+ deadlock_loop_cnt = 0;
do {
check_update_rusage(td);
if (!fio_mutex_down_trylock(stat_mutex))
break;
usleep(1000);
+ if (deadlock_loop_cnt++ > 5000) {
+ log_err("fio seems to be stuck grabbing stat_mutex, forcibly exiting\n");
+ td->error = EDEADLOCK;
+ goto err;
+ }
} while (1);
if (td_read(td) && td->io_bytes[DDIR_READ])
With this commit, the fioe in the generic/299 test no longer hangs.
I've tried running a very large time, and it no longer reproduces at
all. Specifically, the log_err() and the EDEADLOCK error added by the
patch isn't triggering, and fio is no longer hanging. So merely
adding loop counter seems to make the problem go away. Which makes me
wonder if there is either some kind of compiler or code generation
artifact we're seeing. So I should mention which compiler I'm
currently using:
% schroot -c jessie64 -- gcc --version
gcc (Debian 4.9.2-10) 4.9.2
Anyway, I have a work around that seems to work for me, and which even
if the deadlock_loop counter fires, will at least stop the test run
from hanging.
You may or may not want to include this in the fio upstream repo,
given that I can't explain merely trying to check for the deadlock (or
inability to grab the stat_mute, anyway) makes the deadlock go away.
At least for the purposes of running the test, though, it does seem to
be a valid workaround, though.
Cheers,
- Ted
next prev parent reply other threads:[~2016-10-23 2:02 UTC|newest]
Thread overview: 29+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-06-18 15:53 Test generic/299 stalling forever Theodore Ts'o
2015-06-18 16:25 ` Eric Whitney
2015-06-18 23:34 ` Dave Chinner
2015-06-19 2:56 ` Theodore Ts'o
2016-09-29 4:37 ` Theodore Ts'o
2016-10-12 15:46 ` Jens Axboe
2016-10-12 21:14 ` Dave Chinner
2016-10-12 21:19 ` Jens Axboe
2016-10-13 2:15 ` Theodore Ts'o
2016-10-13 2:39 ` Jens Axboe
2016-10-13 23:19 ` Theodore Ts'o
2016-10-18 18:01 ` Theodore Ts'o
2016-10-19 14:06 ` Jens Axboe
2016-10-19 17:49 ` Jens Axboe
2016-10-19 20:32 ` Theodore Ts'o
2016-10-20 14:22 ` Jens Axboe
2016-10-21 22:15 ` Theodore Ts'o
2016-10-23 2:02 ` Theodore Ts'o [this message]
2016-10-23 14:32 ` Jens Axboe
2016-10-23 19:33 ` Theodore Ts'o
2016-10-23 21:24 ` Theodore Ts'o
2016-10-24 1:41 ` Jens Axboe
2016-10-24 3:38 ` Theodore Ts'o
2016-10-24 16:28 ` Jens Axboe
2016-10-25 2:54 ` Theodore Ts'o
2016-10-25 2:59 ` Jens Axboe
2016-10-13 13:08 ` Anatoly Pugachev
2016-10-13 13:36 ` Anatoly Pugachev
2016-10-13 14:28 ` Jens Axboe
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20161023020228.gf6lzfw2phca3ykp@thunk.org \
--to=tytso@mit.edu \
--cc=axboe@fb.com \
--cc=david@fromorbit.com \
--cc=fstests@vger.kernel.org \
--cc=linux-ext4@vger.kernel.org \
--cc=tarasov@vasily.name \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox