Latency spikes with 'thread' option

Flexible I/O Tester development
 help / color / mirror / Atom feed

* Latency spikes with 'thread' option
@ 2012-12-12 20:11 Sam Bradshaw (sbradshaw)
  2012-12-13 13:19 ` Jens Axboe
  0 siblings, 1 reply; 7+ messages in thread
From: Sam Bradshaw (sbradshaw) @ 2012-12-12 20:11 UTC (permalink / raw)
  To: fio@vger.kernel.org; +Cc: Jens Axboe

Hi All,

We're running queue depth sweeps with a 4k random read workload (sample config 
below) against a high performance PCIe SSD - the Micron p320h.  We're seeing
latency spikes to 1 sec when the 'thread' option is used.  Instrumenting the
driver, we see max latencies from driver entry point to block layer completion 
callback of <20 ms at high queue depths.  If 'thread' is not used, the max 
latencies reported by fio align almost exactly with that seen by the driver.
There are typically only one or two of these latency outliers during a 40 sec
run, for example, but they represent a significant enough excursion to pull
our std. dev. very high.

Has anyone witnessed this sort of behavior?  We see it with all the versions
of fio that we have used (2.0.5+) with a variety of kernels.  It's also very
suspicious that the max latency is either almost exactly 1 sec or aligns with
our hardware incurred latency for the given queue depth.

Thanks,
-Sam

[global]
#thread
group_reporting
direct=1
time_based
norandommap=1
refill_buffers
runtime=40
ioengine=libaio
filename=/dev/rssda
bs=4k
rw=randread

[read-qd-256]
numjobs=8
iodepth=32
stonewall

[read-qd-248]
numjobs=8
iodepth=31
stonewall

[read-qd-240]
numjobs=8
iodepth=30
stonewall

[read-qd-232]
numjobs=8
iodepth=29
stonewall

<snip>

[read-qd-64]
numjobs=8
iodepth=8
stonewall

[read-qd-56]
numjobs=8
iodepth=7
stonewall

[read-qd-48]
numjobs=8
iodepth=6
stonewall

[read-qd-40]
numjobs=8
iodepth=5
stonewall

[read-qd-32]
numjobs=8
iodepth=4
stonewall

[read-qd-24]
numjobs=8
iodepth=3
stonewall

[read-qd-16]
numjobs=8
iodepth=2
stonewall

[read-qd-8]
numjobs=8
iodepth=1
stonewall

[read-qd-4]
numjobs=4
iodepth=1
stonewall

[read-qd-1]
numjobs=1
iodepth=1
stonewall

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Latency spikes with 'thread' option
  2012-12-12 20:11 Latency spikes with 'thread' option Sam Bradshaw (sbradshaw)
@ 2012-12-13 13:19 ` Jens Axboe
  2012-12-17 22:23   ` Sam Bradshaw (sbradshaw)
  0 siblings, 1 reply; 7+ messages in thread
From: Jens Axboe @ 2012-12-13 13:19 UTC (permalink / raw)
  To: Sam Bradshaw (sbradshaw); +Cc: fio@vger.kernel.org

On 2012-12-12 21:11, Sam Bradshaw (sbradshaw) wrote:
> Hi All,
> 
> We're running queue depth sweeps with a 4k random read workload (sample config 
> below) against a high performance PCIe SSD - the Micron p320h.  We're seeing
> latency spikes to 1 sec when the 'thread' option is used.  Instrumenting the
> driver, we see max latencies from driver entry point to block layer completion 
> callback of <20 ms at high queue depths.  If 'thread' is not used, the max 
> latencies reported by fio align almost exactly with that seen by the driver.
> There are typically only one or two of these latency outliers during a 40 sec
> run, for example, but they represent a significant enough excursion to pull
> our std. dev. very high.
> 
> Has anyone witnessed this sort of behavior?  We see it with all the versions
> of fio that we have used (2.0.5+) with a variety of kernels.  It's also very
> suspicious that the max latency is either almost exactly 1 sec or aligns with
> our hardware incurred latency for the given queue depth.

I've seen that happen before as well, but I never got to the bottom of
it. I just tried, and I can trigger it fairly easily that dell box. If I
beat on two devices, it doesn't happen easily. Add the third, and it
hits almost immediately after starting up the threads.

For fio, the only difference between a thread and process is how they
are kicked off. So it would seem unlikely to be something in fio.
Perhaps it's a scheduling bug? But then it seems odd that nobody else
has seen this. I see exactly the same latencies you report, very close
to precisely 1s latencies. That is indeed very odd.

I'll try and poke at this a bit.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: Latency spikes with 'thread' option
  2012-12-13 13:19 ` Jens Axboe
@ 2012-12-17 22:23   ` Sam Bradshaw (sbradshaw)
  2012-12-18  7:21     ` Jens Axboe
  0 siblings, 1 reply; 7+ messages in thread
From: Sam Bradshaw (sbradshaw) @ 2012-12-17 22:23 UTC (permalink / raw)
  To: Jens Axboe; +Cc: fio@vger.kernel.org, Mike Berhan (mberhan)


> -----Original Message-----
> From: Jens Axboe [mailto:axboe@kernel.dk]
> Sent: Thursday, December 13, 2012 5:19 AM
> To: Sam Bradshaw (sbradshaw)
> Cc: fio@vger.kernel.org
> Subject: Re: Latency spikes with 'thread' option
> 
> On 2012-12-12 21:11, Sam Bradshaw (sbradshaw) wrote:
> > Hi All,
> >
> > We're running queue depth sweeps with a 4k random read workload (sample config
> > below) against a high performance PCIe SSD - the Micron p320h.  We're seeing
> > latency spikes to 1 sec when the 'thread' option is used.  Instrumenting the
> > driver, we see max latencies from driver entry point to block layer completion
> > callback of <20 ms at high queue depths.  If 'thread' is not used, the max
> > latencies reported by fio align almost exactly with that seen by the driver.
> > There are typically only one or two of these latency outliers during a 40 sec
> > run, for example, but they represent a significant enough excursion to pull
> > our std. dev. very high.
> >
> > Has anyone witnessed this sort of behavior?  We see it with all the versions
> > of fio that we have used (2.0.5+) with a variety of kernels.  It's also very
> > suspicious that the max latency is either almost exactly 1 sec or aligns with
> > our hardware incurred latency for the given queue depth.
> 
> I've seen that happen before as well, but I never got to the bottom of
> it. I just tried, and I can trigger it fairly easily that dell box. If I
> beat on two devices, it doesn't happen easily. Add the third, and it
> hits almost immediately after starting up the threads.
> 
> For fio, the only difference between a thread and process is how they
> are kicked off. So it would seem unlikely to be something in fio.
> Perhaps it's a scheduling bug? But then it seems odd that nobody else
> has seen this. I see exactly the same latencies you report, very close
> to precisely 1s latencies. That is indeed very odd.

I've spent some time poking around and I have some more data points and a
suggestion.

First, the 1 second latency events come in batches and those batches occur 
suspiciously close to a usec wrap (0.999999 us -> 1.000000 us).  Second, if you
subtract exactly 1 second from these outlier latencies, the remaining amount is
very close to what our instrumented low level driver records for the IO latency
and consistent with the expected latencies of our SSD.  Similarly, the tv_usec
portion of the timeval structs shows increasing values.  See snippet below.

Format is like    start: <start_time.tv_sec>.<start_time.tv_usec>

latency: 1004657 us, lba: 1111289192, start: 1355776806.995294 issue: 1355776806.995312 complete: 1355776807.999969
latency: 1000494 us, lba: 203093568, start: 1355776806.999456 issue: 1355776806.999475 complete: 1355776807.999969
latency: 1000404 us, lba: 551350992, start: 1355776806.999546 issue: 1355776806.999565 complete: 1355776807.999969
latency: 1000477 us, lba: 449672928, start: 1355776806.999484 issue: 1355776806.999492 complete: 1355776807.999969

All this pointed to the time collection code being buggy.  Reviewing the 
code, I spotted this in fio_gettime():

	/*
	 * If Linux is using the tsc clock on non-synced processors,
	 * sometimes time can appear to drift backwards. Fix that up.
	 */
	if (last_tv_valid) {
		if (tp->tv_sec < last_tv.tv_sec)
			tp->tv_sec = last_tv.tv_sec;
		else if (last_tv.tv_sec == tp->tv_sec &&
			 tp->tv_usec < last_tv.tv_usec)
			tp->tv_usec = last_tv.tv_usec;
	}
	last_tv_valid = 1;
	memcpy(&last_tv, tp, sizeof(*tp));

This does not appear to be multi-thread safe.  Pre-emption can occur between 
either comparison and the subsequent update.  Commenting it out makes the
problem go away (at the expense of being subject to drift).  How about
making last_tv & last_tv_valid thread-local?

-Sam




^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Latency spikes with 'thread' option
  2012-12-17 22:23   ` Sam Bradshaw (sbradshaw)
@ 2012-12-18  7:21     ` Jens Axboe
  2012-12-18  8:29       ` Jens Axboe
  0 siblings, 1 reply; 7+ messages in thread
From: Jens Axboe @ 2012-12-18  7:21 UTC (permalink / raw)
  To: Sam Bradshaw (sbradshaw); +Cc: fio@vger.kernel.org, Mike Berhan (mberhan)

On 2012-12-17 23:23, Sam Bradshaw (sbradshaw) wrote:
> 
>> -----Original Message-----
>> From: Jens Axboe [mailto:axboe@kernel.dk]
>> Sent: Thursday, December 13, 2012 5:19 AM
>> To: Sam Bradshaw (sbradshaw)
>> Cc: fio@vger.kernel.org
>> Subject: Re: Latency spikes with 'thread' option
>>
>> On 2012-12-12 21:11, Sam Bradshaw (sbradshaw) wrote:
>>> Hi All,
>>>
>>> We're running queue depth sweeps with a 4k random read workload (sample config
>>> below) against a high performance PCIe SSD - the Micron p320h.  We're seeing
>>> latency spikes to 1 sec when the 'thread' option is used.  Instrumenting the
>>> driver, we see max latencies from driver entry point to block layer completion
>>> callback of <20 ms at high queue depths.  If 'thread' is not used, the max
>>> latencies reported by fio align almost exactly with that seen by the driver.
>>> There are typically only one or two of these latency outliers during a 40 sec
>>> run, for example, but they represent a significant enough excursion to pull
>>> our std. dev. very high.
>>>
>>> Has anyone witnessed this sort of behavior?  We see it with all the versions
>>> of fio that we have used (2.0.5+) with a variety of kernels.  It's also very
>>> suspicious that the max latency is either almost exactly 1 sec or aligns with
>>> our hardware incurred latency for the given queue depth.
>>
>> I've seen that happen before as well, but I never got to the bottom of
>> it. I just tried, and I can trigger it fairly easily that dell box. If I
>> beat on two devices, it doesn't happen easily. Add the third, and it
>> hits almost immediately after starting up the threads.
>>
>> For fio, the only difference between a thread and process is how they
>> are kicked off. So it would seem unlikely to be something in fio.
>> Perhaps it's a scheduling bug? But then it seems odd that nobody else
>> has seen this. I see exactly the same latencies you report, very close
>> to precisely 1s latencies. That is indeed very odd.
> 
> I've spent some time poking around and I have some more data points and a
> suggestion.
> 
> First, the 1 second latency events come in batches and those batches occur 
> suspiciously close to a usec wrap (0.999999 us -> 1.000000 us).  Second, if you
> subtract exactly 1 second from these outlier latencies, the remaining amount is
> very close to what our instrumented low level driver records for the IO latency
> and consistent with the expected latencies of our SSD.  Similarly, the tv_usec
> portion of the timeval structs shows increasing values.  See snippet below.
> 
> Format is like    start: <start_time.tv_sec>.<start_time.tv_usec>
> 
> latency: 1004657 us, lba: 1111289192, start: 1355776806.995294 issue: 1355776806.995312 complete: 1355776807.999969
> latency: 1000494 us, lba: 203093568, start: 1355776806.999456 issue: 1355776806.999475 complete: 1355776807.999969
> latency: 1000404 us, lba: 551350992, start: 1355776806.999546 issue: 1355776806.999565 complete: 1355776807.999969
> latency: 1000477 us, lba: 449672928, start: 1355776806.999484 issue: 1355776806.999492 complete: 1355776807.999969
> 
> All this pointed to the time collection code being buggy.  Reviewing the 
> code, I spotted this in fio_gettime():
> 
> 	/*
> 	 * If Linux is using the tsc clock on non-synced processors,
> 	 * sometimes time can appear to drift backwards. Fix that up.
> 	 */
> 	if (last_tv_valid) {
> 		if (tp->tv_sec < last_tv.tv_sec)
> 			tp->tv_sec = last_tv.tv_sec;
> 		else if (last_tv.tv_sec == tp->tv_sec &&
> 			 tp->tv_usec < last_tv.tv_usec)
> 			tp->tv_usec = last_tv.tv_usec;
> 	}
> 	last_tv_valid = 1;
> 	memcpy(&last_tv, tp, sizeof(*tp));
> 
> This does not appear to be multi-thread safe.  Pre-emption can occur between 
> either comparison and the subsequent update.  Commenting it out makes the
> problem go away (at the expense of being subject to drift).  How about
> making last_tv & last_tv_valid thread-local?

Good analysis and I believe you are correct. It's not safely shared and
should be thread local. A quick test here with the below seems to
indicate that that is indeed the issue, I don't see any time weirdness
with that applied.


diff --git a/gettime.c b/gettime.c
index f5be6bd..1a7af37 100644
--- a/gettime.c
+++ b/gettime.c
@@ -18,8 +18,8 @@ static unsigned long cycles_per_usec;
 static unsigned long last_cycles;
 int tsc_reliable = 0;
 #endif
-static struct timeval last_tv;
-static int last_tv_valid;
+static __thread struct timeval last_tv;
+static __thread int last_tv_valid;
 
 enum fio_cs fio_clock_source = FIO_PREFERRED_CLOCK_SOURCE;
 int fio_clock_source_set = 0;

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: Latency spikes with 'thread' option
  2012-12-18  7:21     ` Jens Axboe
@ 2012-12-18  8:29       ` Jens Axboe
  2012-12-18 21:16         ` Sam Bradshaw (sbradshaw)
  0 siblings, 1 reply; 7+ messages in thread
From: Jens Axboe @ 2012-12-18  8:29 UTC (permalink / raw)
  To: Sam Bradshaw (sbradshaw); +Cc: fio@vger.kernel.org, Mike Berhan (mberhan)

On 2012-12-18 08:21, Jens Axboe wrote:
> Good analysis and I believe you are correct. It's not safely shared and
> should be thread local. A quick test here with the below seems to
> indicate that that is indeed the issue, I don't see any time weirdness
> with that applied.

Bah, so that didn't work on all the supported platforms (notably OSX
does not have support for __thread). I've committed a patch that should
work across platforms, please give the current fio git a try (5d879392
or newer).

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: Latency spikes with 'thread' option
  2012-12-18  8:29       ` Jens Axboe
@ 2012-12-18 21:16         ` Sam Bradshaw (sbradshaw)
  2012-12-19  7:00           ` Jens Axboe
  0 siblings, 1 reply; 7+ messages in thread
From: Sam Bradshaw (sbradshaw) @ 2012-12-18 21:16 UTC (permalink / raw)
  To: Jens Axboe; +Cc: fio@vger.kernel.org, Mike Berhan (mberhan)

> -----Original Message-----
> From: Jens Axboe [mailto:axboe@kernel.dk]
> Sent: Tuesday, December 18, 2012 12:29 AM
> To: Sam Bradshaw (sbradshaw)
> Cc: fio@vger.kernel.org; Mike Berhan (mberhan)
> Subject: Re: Latency spikes with 'thread' option
> 
> On 2012-12-18 08:21, Jens Axboe wrote:
> > Good analysis and I believe you are correct. It's not safely shared and
> > should be thread local. A quick test here with the below seems to
> > indicate that that is indeed the issue, I don't see any time weirdness
> > with that applied.
> 
> Bah, so that didn't work on all the supported platforms (notably OSX
> does not have support for __thread). I've committed a patch that should
> work across platforms, please give the current fio git a try (5d879392
> or newer).

Works great.  No abnormal latency spikes, nor do our measurements of peak
IOPs differ from prior releases.

One tidbit though: fio_gettime() shows a much larger % of execution time
with the default clocksource than the gettimeofday() clocksource.  In a
512b random read workload with gtod_reduce disabled, for example, up to 15% 
of the userspace execution time was spent in fio_gettime() with the default
clocksource vs. 5% with gettimeofday().  The latency is mostly incurred
at the divisions that convert nsecs? to secs for populating the timeval
struct.  I don't have a suggestion to replace the

usecs = t / cycles_per_usec 

division but the 

tp->tv_sec = usecs / 1000000

could be reduced by instead comparing "t" to "tv->last_cycles" to see if
they differ by more than 1000000.  If so, do the division and capture
secs in tv->tv_sec.  If not, just copy the last recorded tv->tv_sec into
tp->tv_sec.  That way, the division in only incurred every second at a 
minimum.

I could put together a patch if you like; I just have limited ability
to test the change on the multitude of platforms supported by fio.

-Sam

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Latency spikes with 'thread' option
  2012-12-18 21:16         ` Sam Bradshaw (sbradshaw)
@ 2012-12-19  7:00           ` Jens Axboe
  0 siblings, 0 replies; 7+ messages in thread
From: Jens Axboe @ 2012-12-19  7:00 UTC (permalink / raw)
  To: Sam Bradshaw (sbradshaw); +Cc: fio@vger.kernel.org, Mike Berhan (mberhan)

On 2012-12-18 22:16, Sam Bradshaw (sbradshaw) wrote:
> 
>> -----Original Message-----
>> From: Jens Axboe [mailto:axboe@kernel.dk]
>> Sent: Tuesday, December 18, 2012 12:29 AM
>> To: Sam Bradshaw (sbradshaw)
>> Cc: fio@vger.kernel.org; Mike Berhan (mberhan)
>> Subject: Re: Latency spikes with 'thread' option
>>
>> On 2012-12-18 08:21, Jens Axboe wrote:
>>> Good analysis and I believe you are correct. It's not safely shared and
>>> should be thread local. A quick test here with the below seems to
>>> indicate that that is indeed the issue, I don't see any time weirdness
>>> with that applied.
>>
>> Bah, so that didn't work on all the supported platforms (notably OSX
>> does not have support for __thread). I've committed a patch that should
>> work across platforms, please give the current fio git a try (5d879392
>> or newer).
> 
> Works great.  No abnormal latency spikes, nor do our measurements of peak
> IOPs differ from prior releases.

Good!

> One tidbit though: fio_gettime() shows a much larger % of execution time
> with the default clocksource than the gettimeofday() clocksource.  In a
> 512b random read workload with gtod_reduce disabled, for example, up to 15% 
> of the userspace execution time was spent in fio_gettime() with the default
> clocksource vs. 5% with gettimeofday().  The latency is mostly incurred
> at the divisions that convert nsecs? to secs for populating the timeval
> struct.  I don't have a suggestion to replace the
> 
> usecs = t / cycles_per_usec 
> 
> division but the 
> 
> tp->tv_sec = usecs / 1000000
> 
> could be reduced by instead comparing "t" to "tv->last_cycles" to see if
> they differ by more than 1000000.  If so, do the division and capture
> secs in tv->tv_sec.  If not, just copy the last recorded tv->tv_sec into
> tp->tv_sec.  That way, the division in only incurred every second at a 
> minimum.

That would work, though I'm surprised the two divisions are that costly
for fixed point math. It would be a branch more, but since it'd hit most
of the time, should be easy enough for the branch predictor.

There's also the option of disabling some of the time keeping. Are you
interested only in completion latency? Then you could disable submission
latency measuring. That would reduce the time calls per IOP by 25%.

> I could put together a patch if you like; I just have limited ability
> to test the change on the multitude of platforms supported by fio.

For this particular case, the risk of platform breakage is small. So go
for it!

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2012-12-19  7:00 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-12-12 20:11 Latency spikes with 'thread' option Sam Bradshaw (sbradshaw)
2012-12-13 13:19 ` Jens Axboe
2012-12-17 22:23   ` Sam Bradshaw (sbradshaw)
2012-12-18  7:21     ` Jens Axboe
2012-12-18  8:29       ` Jens Axboe
2012-12-18 21:16         ` Sam Bradshaw (sbradshaw)
2012-12-19  7:00           ` Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox