From: Jens Axboe <axboe@kernel.dk>
To: "Elliott, Robert (Server Storage)" <Elliott@hp.com>,
"stephenmcameron@gmail.com" <stephenmcameron@gmail.com>
Cc: "fio@vger.kernel.org" <fio@vger.kernel.org>
Subject: Re: fio main thread got stuck over the weekend
Date: Tue, 16 Dec 2014 15:43:00 -0700 [thread overview]
Message-ID: <5490B574.3000502@kernel.dk> (raw)
In-Reply-To: <94D0CD8314A33A4D9D801C0FE68B40295940F830@G4W3202.americas.hpqcorp.net>
On 12/15/2014 05:52 PM, Elliott, Robert (Server Storage) wrote:
>
>
>> -----Original Message-----
>> From: Jens Axboe [mailto:axboe@kernel.dk]
>> Sent: Monday, 15 December, 2014 2:50 PM
>> To: Elliott, Robert (Server Storage); stephenmcameron@gmail.com
>> Cc: fio@vger.kernel.org
>> Subject: Re: fio main thread got stuck over the weekend
>>
>> On 12/15/2014 01:31 PM, Jens Axboe wrote:
>>> On 12/15/2014 01:12 PM, Jens Axboe wrote:
>>>>> (gdb) thread 212
>>>>> [Switching to thread 212 (Thread 0x7fa9a086e700 (LWP 6509))]#0
>> 0x0000003974c0b98e in pthread_cond_timedwait@@GLIBC_2.3.2 () from
>> /lib64/libpthread.so.0
>>>>> (gdb) bt
>>>>> #0 0x0000003974c0b98e in pthread_cond_timedwait@@GLIBC_2.3.2 () from
>> /lib64/libpthread.so.0
>>>>> #1 0x0000000000449381 in helper_thread_main (data=<value optimized
>> out>) at backend.c:2127
>>>>> #2 0x0000003974c079d1 in start_thread () from /lib64/libpthread.so.0
>>>>> #3 0x00000039748e8b7d in clone () from /lib64/libc.so.6
>>>>> (gdb) select-frame 1
>>>>> (gdb) info local
>>>>> ts = {tv_sec = 1418658662, tv_nsec = 980074000}
>>>>> tv = {tv_sec = 1418658662, tv_usec = 730074}
>>>>> ret = <value optimized out>
>>>>> (gdb) print helper_exit
>>>>> $36 = 0
>>>>> (gdb) print nr_thread
>>>>> $37 = 224
>>>>> (gdb) print nr_process
>>>>> $38 = 0
>>>>
>>>> This is the helper thread, this is what will output the ETA etc stats.
>>>> There's a signal handler that runs and sets/wakes it to do so. So it
>>>> would be interesting if you could see what it is up to, set a break
>>>> point in there and check what the control flow ends up being.
>>>>
>>>> It's backend.c:helper_thread_main()
>>>>
>>>> backend.c:sig_show_status() should repeatedly fire and end up waking
>>>> this thread.
>>>
>>> So that referred to the output status, your bug was on the ETA, correct?
>
> Yes, the first anomaly was no ETAs incrementing, while IO was still
> occurring.
>
> The second anomaly was running forever.
>
>
>>> For that, we never get woken up by someone signaling our condvar, we
>>> just wake up every second to do the ETA display. The time conversion
>>> looks wrong, though:
>>>
>>> gettimeofday(&tv, NULL);
>>> ts.tv_sec = tv.tv_sec + sec;
>>> ts.tv_nsec = (tv.tv_usec * 1000) + nsec;
>>> if (ts.tv_nsec > 1000000000ULL) {
>>> ts.tv_nsec -= 1000000000ULL;
>>> ts.tv_sec++;
>>> }
>>>
>>> I wonder what happens if ts.tv_nsec == 10^6, that check really should be
>>> on >= instead. Not sure if this explains it though, I'll conjure up a
>>> test case...
>>
>> On my system, pthread_cond_timedwait() returns EINVAL if nsec is 10^9 or
>> higher, which is the sane thing to do. I wonder if it does that at your
>> end too, or whether it hangs? If it hangs indefinitely, that could
>> explain it. Test case attached, I checked in a fix for the off-by-one.
>
> That program results in EINVAL here:
> will wait
> done wait 22
>
> In backend.c helper_thread_main():
> * should the gettimeofday() return value be checked?
> * should fio_gettime() be used instead of gettimeofday() to honor
> the clocksource setting?
>
> In this pool of machines, I have occasionally seen time issues where
> some timeserver communication is messed up and "make" complains that
> files are newer than expected. Maybe some time warp occurred during
> the test?
I think I figured out what was going in, another user reported a similar
issue. Since you have it running, can you try and gdb attach to one of
the fio jobs? Then add a breakpoint in backend.c:__update_tv_cache(),
and print td->tv_cache after each call. Just do that a handful of times,
enough so we can see if it's incrementing or if it's stuck at the same
value.
--
Jens Axboe
next prev parent reply other threads:[~2014-12-16 22:43 UTC|newest]
Thread overview: 20+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-08-11 15:44 fio main thread got stuck over the weekend scameron
2014-08-11 16:04 ` scameron
2014-08-22 19:04 ` Jens Axboe
2014-08-22 19:09 ` scameron
2014-08-22 19:11 ` Jens Axboe
2014-12-12 20:32 ` Elliott, Robert (Server Storage)
2014-12-13 4:49 ` Jens Axboe
2014-12-15 17:33 ` Elliott, Robert (Server Storage)
2014-12-15 17:37 ` Jens Axboe
2014-12-15 19:39 ` Elliott, Robert (Server Storage)
2014-12-15 20:12 ` Jens Axboe
2014-12-15 20:31 ` Jens Axboe
2014-12-15 20:49 ` Jens Axboe
2014-12-16 0:52 ` Elliott, Robert (Server Storage)
2014-12-16 2:51 ` Jens Axboe
2014-12-16 22:43 ` Jens Axboe [this message]
2014-12-17 3:52 ` Elliott, Robert (Server Storage)
2014-12-17 5:43 ` Jens Axboe
2014-12-17 16:48 ` Elliott, Robert (Server Storage)
2014-12-17 17:27 ` Jens Axboe
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=5490B574.3000502@kernel.dk \
--to=axboe@kernel.dk \
--cc=Elliott@hp.com \
--cc=fio@vger.kernel.org \
--cc=stephenmcameron@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox