From: Jens Axboe <axboe@kernel.dk>
To: "Elliott, Robert (Server Storage)" <Elliott@hp.com>,
"stephenmcameron@gmail.com" <stephenmcameron@gmail.com>
Cc: "fio@vger.kernel.org" <fio@vger.kernel.org>
Subject: Re: fio main thread got stuck over the weekend
Date: Mon, 15 Dec 2014 19:51:55 -0700 [thread overview]
Message-ID: <548F9E4B.2010103@kernel.dk> (raw)
In-Reply-To: <94D0CD8314A33A4D9D801C0FE68B40295940F830@G4W3202.americas.hpqcorp.net>
On 12/15/2014 05:52 PM, Elliott, Robert (Server Storage) wrote:
>
>
>> -----Original Message-----
>> From: Jens Axboe [mailto:axboe@kernel.dk]
>> Sent: Monday, 15 December, 2014 2:50 PM
>> To: Elliott, Robert (Server Storage); stephenmcameron@gmail.com
>> Cc: fio@vger.kernel.org
>> Subject: Re: fio main thread got stuck over the weekend
>>
>> On 12/15/2014 01:31 PM, Jens Axboe wrote:
>>> On 12/15/2014 01:12 PM, Jens Axboe wrote:
>>>>> (gdb) thread 212
>>>>> [Switching to thread 212 (Thread 0x7fa9a086e700 (LWP 6509))]#0
>> 0x0000003974c0b98e in pthread_cond_timedwait@@GLIBC_2.3.2 () from
>> /lib64/libpthread.so.0
>>>>> (gdb) bt
>>>>> #0 0x0000003974c0b98e in pthread_cond_timedwait@@GLIBC_2.3.2 () from
>> /lib64/libpthread.so.0
>>>>> #1 0x0000000000449381 in helper_thread_main (data=<value optimized
>> out>) at backend.c:2127
>>>>> #2 0x0000003974c079d1 in start_thread () from /lib64/libpthread.so.0
>>>>> #3 0x00000039748e8b7d in clone () from /lib64/libc.so.6
>>>>> (gdb) select-frame 1
>>>>> (gdb) info local
>>>>> ts = {tv_sec = 1418658662, tv_nsec = 980074000}
>>>>> tv = {tv_sec = 1418658662, tv_usec = 730074}
>>>>> ret = <value optimized out>
>>>>> (gdb) print helper_exit
>>>>> $36 = 0
>>>>> (gdb) print nr_thread
>>>>> $37 = 224
>>>>> (gdb) print nr_process
>>>>> $38 = 0
>>>>
>>>> This is the helper thread, this is what will output the ETA etc stats.
>>>> There's a signal handler that runs and sets/wakes it to do so. So it
>>>> would be interesting if you could see what it is up to, set a break
>>>> point in there and check what the control flow ends up being.
>>>>
>>>> It's backend.c:helper_thread_main()
>>>>
>>>> backend.c:sig_show_status() should repeatedly fire and end up waking
>>>> this thread.
>>>
>>> So that referred to the output status, your bug was on the ETA, correct?
>
> Yes, the first anomaly was no ETAs incrementing, while IO was still
> occurring.
>
> The second anomaly was running forever.
Gotcha
>>> For that, we never get woken up by someone signaling our condvar, we
>>> just wake up every second to do the ETA display. The time conversion
>>> looks wrong, though:
>>>
>>> gettimeofday(&tv, NULL);
>>> ts.tv_sec = tv.tv_sec + sec;
>>> ts.tv_nsec = (tv.tv_usec * 1000) + nsec;
>>> if (ts.tv_nsec > 1000000000ULL) {
>>> ts.tv_nsec -= 1000000000ULL;
>>> ts.tv_sec++;
>>> }
>>>
>>> I wonder what happens if ts.tv_nsec == 10^6, that check really should be
>>> on >= instead. Not sure if this explains it though, I'll conjure up a
>>> test case...
>>
>> On my system, pthread_cond_timedwait() returns EINVAL if nsec is 10^9 or
>> higher, which is the sane thing to do. I wonder if it does that at your
>> end too, or whether it hangs? If it hangs indefinitely, that could
>> explain it. Test case attached, I checked in a fix for the off-by-one.
>
> That program results in EINVAL here:
> will wait
> done wait 22
OK, which is the sane way to handle it, but who knows what bugs there
might have been there.
> In backend.c helper_thread_main():
> * should the gettimeofday() return value be checked?
Should not really fail, since we know 'tv' is valid.
> * should fio_gettime() be used instead of gettimeofday() to honor
> the clocksource setting?
That would be the case if we used it internally, but we do not. It's
only used for passing to the condvar timedwait.
> In this pool of machines, I have occasionally seen time issues where
> some timeserver communication is messed up and "make" complains that
> files are newer than expected. Maybe some time warp occurred during
> the test?
That could definitely explain it, not sure what else could. That could
cause both the run to continue, and the timedwait to not return. I'll
switch it to use clock_gettime, if we have it, otherwise we really have
no other option than to use gettimeofday()...
--
Jens Axboe
next prev parent reply other threads:[~2014-12-16 2:51 UTC|newest]
Thread overview: 20+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-08-11 15:44 fio main thread got stuck over the weekend scameron
2014-08-11 16:04 ` scameron
2014-08-22 19:04 ` Jens Axboe
2014-08-22 19:09 ` scameron
2014-08-22 19:11 ` Jens Axboe
2014-12-12 20:32 ` Elliott, Robert (Server Storage)
2014-12-13 4:49 ` Jens Axboe
2014-12-15 17:33 ` Elliott, Robert (Server Storage)
2014-12-15 17:37 ` Jens Axboe
2014-12-15 19:39 ` Elliott, Robert (Server Storage)
2014-12-15 20:12 ` Jens Axboe
2014-12-15 20:31 ` Jens Axboe
2014-12-15 20:49 ` Jens Axboe
2014-12-16 0:52 ` Elliott, Robert (Server Storage)
2014-12-16 2:51 ` Jens Axboe [this message]
2014-12-16 22:43 ` Jens Axboe
2014-12-17 3:52 ` Elliott, Robert (Server Storage)
2014-12-17 5:43 ` Jens Axboe
2014-12-17 16:48 ` Elliott, Robert (Server Storage)
2014-12-17 17:27 ` Jens Axboe
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=548F9E4B.2010103@kernel.dk \
--to=axboe@kernel.dk \
--cc=Elliott@hp.com \
--cc=fio@vger.kernel.org \
--cc=stephenmcameron@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox