Fix for a race when fio prints I/O statistics periodically

Flexible I/O Tester development
 help / color / mirror / Atom feed

* Fix for a race when fio prints I/O statistics periodically
@ 2014-11-05 17:10 Vasily Tarasov
  2014-11-05 17:19 ` Michael Mattsson
  2014-11-08 16:28 ` Jens Axboe
  0 siblings, 2 replies; 8+ messages in thread
From: Vasily Tarasov @ 2014-11-05 17:10 UTC (permalink / raw)
  To: fio

[-- Attachment #1: Type: text/plain, Size: 1893 bytes --]

Below is the demonstration for the latest code in git:

Job-file:

[job]
filename=/dev/sda
rw=randread
blocksize=4096
direct=1
time_based=1
runtime=300
numjobs=16

Command line:

fio --output-format=json --status-interval=10  /tmp/job.file | tee /tmp/fio.out

After parsing json and printing per-job "read->runtime" and "read->iops" in two
columns, one can see this:

runtime iops
9854 4762
19861 4724
29867 4711
39873 4705
49880 4701
59886 4698
69894 4696
79900 4694
89907 4693
99913 4692
109919 4692
119926 4691
129933 4691
139940 4691
149946 4691
159953 4690
169959 4690
179966 4690
189972 4690
199978 4690
209984 4690
219990 4690
229996 4690
240002 4690
250009 4690
260015 4690
270021 4690
280027 4690
290033 4690
600043 2344 <--- You can see a spurious jump in runtime here
300001 4690

The problem is two-fold, IMHO:

1) td->ts.runtime[DDIR_READ/DDIR_WRITE/DDIR_TRIM] are set in microseconds
   instead of milliseconds in thread_main() function. The values are later
   converted to milliseconds (after exiting from the loop in thread_main()),
   but it is too late for the out-of-band access by the helper_thread
   (the one that prints periodic statistics).
   
2) The helper_thread in __show_running_run_stats() temporarily adds to
   td->ts.runtime[DDIR_READ/DDIR_WRITE/DDIR_TRIM]
   the time that has passed since the beginning of the loop (td->start).
   However, the same addition happens in thread_main() by the job
   thread and td->start is not reset atomically with this addition.
   As a result, the runtime can be added twice. (That's why in the output
   we see 600043, which is approximately twice larger than the  
   previous reading).

My proposal is to make the job threads updating both td->ts.runtime and
td->start atomically by using an already existing stat_mutex.
__show_running_run_stats() already takes this mutex where needed. 

A proposed patch is attached.


[-- Attachment #2: running-stats-race.diff --]
[-- Type: text/x-diff, Size: 2242 bytes --]

diff --git a/backend.c b/backend.c
index d19d870..d05faf4 100644
--- a/backend.c
+++ b/backend.c
@@ -1485,18 +1485,21 @@ static void *thread_main(void *data)
 
 		clear_state = 1;
 
+		fio_mutex_down(stat_mutex);
 		if (td_read(td) && td->io_bytes[DDIR_READ]) {
-			elapsed = utime_since_now(&td->start);
+			elapsed = mtime_since_now(&td->start);
 			td->ts.runtime[DDIR_READ] += elapsed;
 		}
 		if (td_write(td) && td->io_bytes[DDIR_WRITE]) {
-			elapsed = utime_since_now(&td->start);
+			elapsed = mtime_since_now(&td->start);
 			td->ts.runtime[DDIR_WRITE] += elapsed;
 		}
 		if (td_trim(td) && td->io_bytes[DDIR_TRIM]) {
-			elapsed = utime_since_now(&td->start);
+			elapsed = mtime_since_now(&td->start);
 			td->ts.runtime[DDIR_TRIM] += elapsed;
 		}
+		fio_gettime(&td->start, NULL);
+		fio_mutex_up(stat_mutex);
 
 		if (td->error || td->terminate)
 			break;
@@ -1512,16 +1515,16 @@ static void *thread_main(void *data)
 
 		do_verify(td, verify_bytes);
 
-		td->ts.runtime[DDIR_READ] += utime_since_now(&td->start);
+		fio_mutex_down(stat_mutex);
+		td->ts.runtime[DDIR_READ] += mtime_since_now(&td->start);
+		fio_gettime(&td->start, NULL);
+		fio_mutex_up(stat_mutex);
 
 		if (td->error || td->terminate)
 			break;
 	}
 
 	update_rusage_stat(td);
-	td->ts.runtime[DDIR_READ] = (td->ts.runtime[DDIR_READ] + 999) / 1000;
-	td->ts.runtime[DDIR_WRITE] = (td->ts.runtime[DDIR_WRITE] + 999) / 1000;
-	td->ts.runtime[DDIR_TRIM] = (td->ts.runtime[DDIR_TRIM] + 999) / 1000;
 	td->ts.total_run_time = mtime_since_now(&td->epoch);
 	td->ts.io_bytes[DDIR_READ] = td->io_bytes[DDIR_READ];
 	td->ts.io_bytes[DDIR_WRITE] = td->io_bytes[DDIR_WRITE];
diff --git a/stat.c b/stat.c
index 84d9eef..b7f3992 100644
--- a/stat.c
+++ b/stat.c
@@ -14,7 +14,7 @@
 #include "lib/getrusage.h"
 #include "idletime.h"
 
-static struct fio_mutex *stat_mutex;
+struct fio_mutex *stat_mutex;
 
 void update_rusage_stat(struct thread_data *td)
 {
diff --git a/stat.h b/stat.h
index 8aefe03..8b4416c 100644
--- a/stat.h
+++ b/stat.h
@@ -214,6 +214,8 @@ struct jobs_eta {
 	uint8_t run_str[];
 } __attribute__((packed));
 
+extern struct fio_mutex *stat_mutex;
+
 extern struct jobs_eta *get_jobs_eta(int force, size_t *size);
 
 extern void stat_init(void);

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: Fix for a race when fio prints I/O statistics periodically
  2014-11-05 17:10 Fix for a race when fio prints I/O statistics periodically Vasily Tarasov
@ 2014-11-05 17:19 ` Michael Mattsson
  2014-11-08 16:28 ` Jens Axboe
  1 sibling, 0 replies; 8+ messages in thread
From: Michael Mattsson @ 2014-11-05 17:19 UTC (permalink / raw)
  To: Vasily Tarasov; +Cc: fio

Hey,
This is exactly what I'm seeing as well sporadically. My rig is busy
over the next few days but I'll try verify ASAP with these patches.

Thanks!
Michael

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Fix for a race when fio prints I/O statistics periodically
  2014-11-05 17:10 Fix for a race when fio prints I/O statistics periodically Vasily Tarasov
  2014-11-05 17:19 ` Michael Mattsson
@ 2014-11-08 16:28 ` Jens Axboe
  2014-11-08 16:33   ` Michael Mattsson
  2014-11-10 14:19   ` Vasily Tarasov
  1 sibling, 2 replies; 8+ messages in thread
From: Jens Axboe @ 2014-11-08 16:28 UTC (permalink / raw)
  To: Vasily Tarasov, fio, michael.mattsson

On 2014-11-05 10:10, Vasily Tarasov wrote:
> Below is the demonstration for the latest code in git:
>
> Job-file:
>
> [job]
> filename=/dev/sda
> rw=randread
> blocksize=4096
> direct=1
> time_based=1
> runtime=300
> numjobs=16
>
> Command line:
>
> fio --output-format=json --status-interval=10  /tmp/job.file | tee /tmp/fio.out
>
> After parsing json and printing per-job "read->runtime" and "read->iops" in two
> columns, one can see this:
>
> runtime iops
> 9854 4762
> 19861 4724
> 29867 4711
> 39873 4705
> 49880 4701
> 59886 4698
> 69894 4696
> 79900 4694
> 89907 4693
> 99913 4692
> 109919 4692
> 119926 4691
> 129933 4691
> 139940 4691
> 149946 4691
> 159953 4690
> 169959 4690
> 179966 4690
> 189972 4690
> 199978 4690
> 209984 4690
> 219990 4690
> 229996 4690
> 240002 4690
> 250009 4690
> 260015 4690
> 270021 4690
> 280027 4690
> 290033 4690
> 600043 2344 <--- You can see a spurious jump in runtime here
> 300001 4690
>
> The problem is two-fold, IMHO:
>
> 1) td->ts.runtime[DDIR_READ/DDIR_WRITE/DDIR_TRIM] are set in microseconds
>     instead of milliseconds in thread_main() function. The values are later
>     converted to milliseconds (after exiting from the loop in thread_main()),
>     but it is too late for the out-of-band access by the helper_thread
>     (the one that prints periodic statistics).
>
> 2) The helper_thread in __show_running_run_stats() temporarily adds to
>     td->ts.runtime[DDIR_READ/DDIR_WRITE/DDIR_TRIM]
>     the time that has passed since the beginning of the loop (td->start).
>     However, the same addition happens in thread_main() by the job
>     thread and td->start is not reset atomically with this addition.
>     As a result, the runtime can be added twice. (That's why in the output
>     we see 600043, which is approximately twice larger than the
>     previous reading).
>
> My proposal is to make the job threads updating both td->ts.runtime and
> td->start atomically by using an already existing stat_mutex.
> __show_running_run_stats() already takes this mutex where needed.
>
> A proposed patch is attached.

Patch looks pretty good, probably not worth it to make this a per-td 
mutex instead. But why the extra setting of td->start?

Michael, did you get a chance to test this?

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Fix for a race when fio prints I/O statistics periodically
  2014-11-08 16:28 ` Jens Axboe
@ 2014-11-08 16:33   ` Michael Mattsson
  2014-11-10 14:19   ` Vasily Tarasov
  1 sibling, 0 replies; 8+ messages in thread
From: Michael Mattsson @ 2014-11-08 16:33 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Vasily Tarasov, fio

Hey,
It's been a busy week. I'll try get some samples Sunday to Monday
night and report my findings.

Thanks for all the work!

Michael

On Sat, Nov 8, 2014 at 8:28 AM, Jens Axboe <axboe@kernel.dk> wrote:
> On 2014-11-05 10:10, Vasily Tarasov wrote:
>>
>> Below is the demonstration for the latest code in git:
>>
>> Job-file:
>>
>> [job]
>> filename=/dev/sda
>> rw=randread
>> blocksize=4096
>> direct=1
>> time_based=1
>> runtime=300
>> numjobs=16
>>
>> Command line:
>>
>> fio --output-format=json --status-interval=10  /tmp/job.file | tee
>> /tmp/fio.out
>>
>> After parsing json and printing per-job "read->runtime" and "read->iops"
>> in two
>> columns, one can see this:
>>
>> runtime iops
>> 9854 4762
>> 19861 4724
>> 29867 4711
>> 39873 4705
>> 49880 4701
>> 59886 4698
>> 69894 4696
>> 79900 4694
>> 89907 4693
>> 99913 4692
>> 109919 4692
>> 119926 4691
>> 129933 4691
>> 139940 4691
>> 149946 4691
>> 159953 4690
>> 169959 4690
>> 179966 4690
>> 189972 4690
>> 199978 4690
>> 209984 4690
>> 219990 4690
>> 229996 4690
>> 240002 4690
>> 250009 4690
>> 260015 4690
>> 270021 4690
>> 280027 4690
>> 290033 4690
>> 600043 2344 <--- You can see a spurious jump in runtime here
>> 300001 4690
>>
>> The problem is two-fold, IMHO:
>>
>> 1) td->ts.runtime[DDIR_READ/DDIR_WRITE/DDIR_TRIM] are set in microseconds
>>     instead of milliseconds in thread_main() function. The values are
>> later
>>     converted to milliseconds (after exiting from the loop in
>> thread_main()),
>>     but it is too late for the out-of-band access by the helper_thread
>>     (the one that prints periodic statistics).
>>
>> 2) The helper_thread in __show_running_run_stats() temporarily adds to
>>     td->ts.runtime[DDIR_READ/DDIR_WRITE/DDIR_TRIM]
>>     the time that has passed since the beginning of the loop (td->start).
>>     However, the same addition happens in thread_main() by the job
>>     thread and td->start is not reset atomically with this addition.
>>     As a result, the runtime can be added twice. (That's why in the output
>>     we see 600043, which is approximately twice larger than the
>>     previous reading).
>>
>> My proposal is to make the job threads updating both td->ts.runtime and
>> td->start atomically by using an already existing stat_mutex.
>> __show_running_run_stats() already takes this mutex where needed.
>>
>> A proposed patch is attached.
>
>
> Patch looks pretty good, probably not worth it to make this a per-td mutex
> instead. But why the extra setting of td->start?
>
> Michael, did you get a chance to test this?
>
> --
> Jens Axboe
>


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Fix for a race when fio prints I/O statistics periodically
  2014-11-08 16:28 ` Jens Axboe
  2014-11-08 16:33   ` Michael Mattsson
@ 2014-11-10 14:19   ` Vasily Tarasov
  2014-11-10 15:33     ` Jens Axboe
  1 sibling, 1 reply; 8+ messages in thread
From: Vasily Tarasov @ 2014-11-10 14:19 UTC (permalink / raw)
  To: Jens Axboe; +Cc: fio@vger.kernel.org, Michael Mattsson

Hi Jens,

td->start is used in  __show_running_run_stats() to compute the time
that passed since the beginning of the loop (stored in td->start) to
the moment when periodical statistics need to be printed. Then
td->ts.runtime[] is temporarily updated based on the computed value.

In thread_main(), similar actions are taken to update td->ts.runtime[]
based on the td->start and the loop's elapsed time. Without the patch,
td->start is set in the very beginning of the loop in thread_main().
So, it leaves a time window between the end of the current loop and
the beginning of the next loop when td->ts.runtime[] is already
updated while td->start still stores the start of the previous lop. If
__show_running_run_stats() is called during that window - the time can
be (temporarily) added twice to td->ts.runtime[].

However, if we reset the start time in the end of the loop under the
stat_mutex it should not happen.

I would agree that having a per-td mutex might be an overkill
considering that people usually print statistics periodically with
relatively long intervals: 1, 5, 10, 60 seconds.

Vasily

On Sat, Nov 8, 2014 at 11:28 AM, Jens Axboe <axboe@kernel.dk> wrote:
> On 2014-11-05 10:10, Vasily Tarasov wrote:
>>
>> Below is the demonstration for the latest code in git:
>>
>> Job-file:
>>
>> [job]
>> filename=/dev/sda
>> rw=randread
>> blocksize=4096
>> direct=1
>> time_based=1
>> runtime=300
>> numjobs=16
>>
>> Command line:
>>
>> fio --output-format=json --status-interval=10  /tmp/job.file | tee
>> /tmp/fio.out
>>
>> After parsing json and printing per-job "read->runtime" and "read->iops"
>> in two
>> columns, one can see this:
>>
>> runtime iops
>> 9854 4762
>> 19861 4724
>> 29867 4711
>> 39873 4705
>> 49880 4701
>> 59886 4698
>> 69894 4696
>> 79900 4694
>> 89907 4693
>> 99913 4692
>> 109919 4692
>> 119926 4691
>> 129933 4691
>> 139940 4691
>> 149946 4691
>> 159953 4690
>> 169959 4690
>> 179966 4690
>> 189972 4690
>> 199978 4690
>> 209984 4690
>> 219990 4690
>> 229996 4690
>> 240002 4690
>> 250009 4690
>> 260015 4690
>> 270021 4690
>> 280027 4690
>> 290033 4690
>> 600043 2344 <--- You can see a spurious jump in runtime here
>> 300001 4690
>>
>> The problem is two-fold, IMHO:
>>
>> 1) td->ts.runtime[DDIR_READ/DDIR_WRITE/DDIR_TRIM] are set in microseconds
>>     instead of milliseconds in thread_main() function. The values are
>> later
>>     converted to milliseconds (after exiting from the loop in
>> thread_main()),
>>     but it is too late for the out-of-band access by the helper_thread
>>     (the one that prints periodic statistics).
>>
>> 2) The helper_thread in __show_running_run_stats() temporarily adds to
>>     td->ts.runtime[DDIR_READ/DDIR_WRITE/DDIR_TRIM]
>>     the time that has passed since the beginning of the loop (td->start).
>>     However, the same addition happens in thread_main() by the job
>>     thread and td->start is not reset atomically with this addition.
>>     As a result, the runtime can be added twice. (That's why in the output
>>     we see 600043, which is approximately twice larger than the
>>     previous reading).
>>
>> My proposal is to make the job threads updating both td->ts.runtime and
>> td->start atomically by using an already existing stat_mutex.
>> __show_running_run_stats() already takes this mutex where needed.
>>
>> A proposed patch is attached.
>
>
> Patch looks pretty good, probably not worth it to make this a per-td mutex
> instead. But why the extra setting of td->start?
>
> Michael, did you get a chance to test this?
>
> --
> Jens Axboe
>


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Fix for a race when fio prints I/O statistics periodically
  2014-11-10 14:19   ` Vasily Tarasov
@ 2014-11-10 15:33     ` Jens Axboe
  2014-12-02 23:00       ` Michael Mattsson
  0 siblings, 1 reply; 8+ messages in thread
From: Jens Axboe @ 2014-11-10 15:33 UTC (permalink / raw)
  To: Vasily Tarasov; +Cc: fio@vger.kernel.org, Michael Mattsson

On 2014-11-10 07:19, Vasily Tarasov wrote:
> Hi Jens,
>
> td->start is used in  __show_running_run_stats() to compute the time
> that passed since the beginning of the loop (stored in td->start) to
> the moment when periodical statistics need to be printed. Then
> td->ts.runtime[] is temporarily updated based on the computed value.
>
> In thread_main(), similar actions are taken to update td->ts.runtime[]
> based on the td->start and the loop's elapsed time. Without the patch,
> td->start is set in the very beginning of the loop in thread_main().
> So, it leaves a time window between the end of the current loop and
> the beginning of the next loop when td->ts.runtime[] is already
> updated while td->start still stores the start of the previous lop. If
> __show_running_run_stats() is called during that window - the time can
> be (temporarily) added twice to td->ts.runtime[].
>
> However, if we reset the start time in the end of the loop under the
> stat_mutex it should not happen.
>
> I would agree that having a per-td mutex might be an overkill
> considering that people usually print statistics periodically with
> relatively long intervals: 1, 5, 10, 60 seconds.

Thanks, I missed the change in addition after IO was done, hence the 
question on ->start. I did commit the patch as-is yesterday, thanks!

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Fix for a race when fio prints I/O statistics periodically
  2014-11-10 15:33     ` Jens Axboe
@ 2014-12-02 23:00       ` Michael Mattsson
  2014-12-02 23:36         ` Jens Axboe
  0 siblings, 1 reply; 8+ messages in thread
From: Michael Mattsson @ 2014-12-02 23:00 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Vasily Tarasov, fio@vger.kernel.org

Hey,
I just want to close the loop here. Last week up until now I've
successfully run several hundred jobs and not a single skewed output
has come across.

Is this patch integrated into the main code base or the latest release?

Regards
Michael

On Mon, Nov 10, 2014 at 7:33 AM, Jens Axboe <axboe@kernel.dk> wrote:
> On 2014-11-10 07:19, Vasily Tarasov wrote:
>>
>> Hi Jens,
>>
>> td->start is used in  __show_running_run_stats() to compute the time
>> that passed since the beginning of the loop (stored in td->start) to
>> the moment when periodical statistics need to be printed. Then
>> td->ts.runtime[] is temporarily updated based on the computed value.
>>
>> In thread_main(), similar actions are taken to update td->ts.runtime[]
>> based on the td->start and the loop's elapsed time. Without the patch,
>> td->start is set in the very beginning of the loop in thread_main().
>> So, it leaves a time window between the end of the current loop and
>> the beginning of the next loop when td->ts.runtime[] is already
>> updated while td->start still stores the start of the previous lop. If
>> __show_running_run_stats() is called during that window - the time can
>> be (temporarily) added twice to td->ts.runtime[].
>>
>> However, if we reset the start time in the end of the loop under the
>> stat_mutex it should not happen.
>>
>> I would agree that having a per-td mutex might be an overkill
>> considering that people usually print statistics periodically with
>> relatively long intervals: 1, 5, 10, 60 seconds.
>
>
> Thanks, I missed the change in addition after IO was done, hence the
> question on ->start. I did commit the patch as-is yesterday, thanks!
>
> --
> Jens Axboe
>


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Fix for a race when fio prints I/O statistics periodically
  2014-12-02 23:00       ` Michael Mattsson
@ 2014-12-02 23:36         ` Jens Axboe
  0 siblings, 0 replies; 8+ messages in thread
From: Jens Axboe @ 2014-12-02 23:36 UTC (permalink / raw)
  To: Michael Mattsson; +Cc: Vasily Tarasov, fio@vger.kernel.org

On 12/02/2014 04:00 PM, Michael Mattsson wrote:
> Hey,
> I just want to close the loop here. Last week up until now I've
> successfully run several hundred jobs and not a single skewed output
> has come across.
>
> Is this patch integrated into the main code base or the latest release?

The fix is in 2.1.14 as released, so both.

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2014-12-02 23:36 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-11-05 17:10 Fix for a race when fio prints I/O statistics periodically Vasily Tarasov
2014-11-05 17:19 ` Michael Mattsson
2014-11-08 16:28 ` Jens Axboe
2014-11-08 16:33   ` Michael Mattsson
2014-11-10 14:19   ` Vasily Tarasov
2014-11-10 15:33     ` Jens Axboe
2014-12-02 23:00       ` Michael Mattsson
2014-12-02 23:36         ` Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox