From: "Theodore Ts'o" <tytso@mit.edu>
To: Jens Axboe <axboe@fb.com>
Cc: Dave Chinner <david@fromorbit.com>,
linux-ext4@vger.kernel.org, fstests@vger.kernel.org,
tarasov@vasily.name
Subject: Re: Test generic/299 stalling forever
Date: Sun, 23 Oct 2016 23:38:52 -0400 [thread overview]
Message-ID: <20161024033852.quinlee4a24mb2e2@thunk.org> (raw)
In-Reply-To: <20161023212408.cjqmnzw3547ujzil@thunk.org>
[-- Attachment #1: Type: text/plain, Size: 3558 bytes --]
I enabled some more debugging and it's become more clear what's going
on. (See attached for the full log).
The main issue seems to be that once one of fio is done, it kills off
the other threads (actually, we're using processes):
process 31848 terminate group_id=0
process 31848 setting terminate on direct_aio/31846
process 31848 setting terminate on direct_aio/31848
process 31848 setting terminate on direct_aio/31849
process 31848 setting terminate on direct_aio/31851
process 31848 setting terminate on aio-dio-verifier/31852
process 31848 setting terminate on buffered-aio-verifier/31854
process 31851 pid=31851: runstate RUNNING -> FINISHING
process 31851 terminate group_id=0
process 31851 setting terminate on direct_aio/31846
process 31851 setting terminate on direct_aio/31848
process 31851 setting terminate on direct_aio/31849
process 31851 setting terminate on direct_aio/31851
process 31851 setting terminate on aio-dio-verifier/31852
process 31851 setting terminate on buffered-aio-verifier/31854
process 31852 pid=31852: runstate RUNNING -> FINISHING
process 31846 pid=31846: runstate RUNNING -> FINISHING
...
but one or more of the threads doesn't exit within 60 seconds:
fio: job 'direct_aio' (state=5) hasn't exited in 60 seconds, it appears to be stuck. Doing forceful exit of this job.
process 31794 pid=31849: runstate RUNNING -> REAPED
fio: job 'buffered-aio-verifier' (state=5) hasn't exited in 60 seconds, it appears to be stuck. Doing forceful exit of this job.
process 31794 pid=31854: runstate RUNNING -> REAPED
process 31794 terminate group_id=-1
The main thread then prints all of the statistics, and calls stat_exit():
stat_exit called by tid: 31794 <---- debugging message which prints gettid()
Unfortunately, this process(es) aren't actually, killed, they are
marked as reap, but they are still in the process listing:
root@xfstests:~# ps augxww | grep fio
root 1585 0.0 0.0 0 0 ? S< 18:45 0:00 [dm_bufio_cache]
root 7191 0.0 0.0 12732 2200 pts/1 S+ 23:05 0:00 grep fio
root 31849 1.5 0.2 407208 18876 ? Ss 22:36 0:26 /root/xfstests/bin/fio /tmp/31503.fio
root 31854 1.2 0.1 398480 10240 ? Ssl 22:36 0:22 /root/xfstests/bin/fio /tmp/31503.fio
And if you attach to them with a gdb, they are spinning trying to grab
the stat_mutex(), which they can't get because the main thread has
already called stat_exit() and then has exited. So these two threads
did eventually return, but some time after 60 seconds had passed, and
then they hung waiting for stat_mutex(), which they will never get
because the main thread has already called stat_exit().
This probably also explains why you had trouble reproducing it. It
requires a disk whose performance is variable enougoh that under heavy
load, it might take more than 60 seconds for the direct_aio or
buffered-aio-verifier thread to close itself out.
And I suspect once the main thread exited, it probably also closed out
the debugging channel so the deadlock detector did probably trip, but
somehow we just didn't see the output.
So I can imagine some possible fixes. We could make the thread
timeout configurable, and/or increase it from 60 seconds to something like
300 seconds. We could make stat_exit() a no-op --- after all, if the
main thread is exiting, there's no real point to down and then destroy
the stat_mutex. And/or we could change the forced reap to send a kill
-9 to the thread, and instead of maring it as reaped.
Cheers,
- Ted
[-- Attachment #2: 299.full.gz --]
[-- Type: application/gzip, Size: 4711 bytes --]
next prev parent reply other threads:[~2016-10-24 3:38 UTC|newest]
Thread overview: 29+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-06-18 15:53 Test generic/299 stalling forever Theodore Ts'o
2015-06-18 16:25 ` Eric Whitney
2015-06-18 23:34 ` Dave Chinner
2015-06-19 2:56 ` Theodore Ts'o
2016-09-29 4:37 ` Theodore Ts'o
2016-10-12 15:46 ` Jens Axboe
2016-10-12 21:14 ` Dave Chinner
2016-10-12 21:19 ` Jens Axboe
2016-10-13 2:15 ` Theodore Ts'o
2016-10-13 2:39 ` Jens Axboe
2016-10-13 23:19 ` Theodore Ts'o
2016-10-18 18:01 ` Theodore Ts'o
2016-10-19 14:06 ` Jens Axboe
2016-10-19 17:49 ` Jens Axboe
2016-10-19 20:32 ` Theodore Ts'o
2016-10-20 14:22 ` Jens Axboe
2016-10-21 22:15 ` Theodore Ts'o
2016-10-23 2:02 ` Theodore Ts'o
2016-10-23 14:32 ` Jens Axboe
2016-10-23 19:33 ` Theodore Ts'o
2016-10-23 21:24 ` Theodore Ts'o
2016-10-24 1:41 ` Jens Axboe
2016-10-24 3:38 ` Theodore Ts'o [this message]
2016-10-24 16:28 ` Jens Axboe
2016-10-25 2:54 ` Theodore Ts'o
2016-10-25 2:59 ` Jens Axboe
2016-10-13 13:08 ` Anatoly Pugachev
2016-10-13 13:36 ` Anatoly Pugachev
2016-10-13 14:28 ` Jens Axboe
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20161024033852.quinlee4a24mb2e2@thunk.org \
--to=tytso@mit.edu \
--cc=axboe@fb.com \
--cc=david@fromorbit.com \
--cc=fstests@vger.kernel.org \
--cc=linux-ext4@vger.kernel.org \
--cc=tarasov@vasily.name \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox