Flexible I/O Tester development
 help / color / mirror / Atom feed
* Missing log entries and other questions
@ 2016-04-04 11:08 Mihkal Dunfjeld
  0 siblings, 0 replies; 16+ messages in thread
From: Mihkal Dunfjeld @ 2016-04-04 11:08 UTC (permalink / raw)
  To: fio; +Cc: birger.lunstoeng

Hello fio team!

We are a group of three students that are writing a bachelor thesis at
NTNU Gjøvik Norway about Ceph and we are trying to measure its
performance using the great tool Fio. During our testing we find that
for 16 job write runs, the amount of lines in our logs falls
significantly below what we expect for a test set to run for 120 seconds.

Example: 4 jobs, 8 QD, random write
sudo fio --filename=/dev/rbd0 --direct=1 --rw=randwrite --refill_buffers
--norandommap --randrepeat=0 --ioengine=libaio --bs=4k --iodepth=8
--numjobs=4 --runtime=120 --log_avg_msec=1000
--write_iops_log=fsrandW4-8-iopslog --group_reporting --ramp_time=60
--name=fsrandW4-8 --output=fsrandW4-8

For this test we are missing 23% of the log entries. Got 369/480 (120
sec times 4 jobs).
Raw data file: http://pastebin.com/CeXa61MF
This gets worse with higher load. On 16 jobs 64 QD runs, more than 50%
of the log entries are missing.
The tests are run from a HP DL380G6 with 2x Xeon X5560 @ 2.8 Ghz, 12 GB
ECC RAM running Ubuntu Server 14.04. Tested with both the Ubuntu
repository version and the latest version compiled from github. We also
tried to output the log files to tempfs without that making much of a
difference.

Because of this we have a few questions:

1. What causes fio to not be able to write a new line to the 	
Write_iops_log file at the set interval? (log_avg_msec=1000).  We see up
to several seconds of delay (See above pastebin).

2. Does fio write to the write_iops_log file even if iops is zero? For
instance if cluster performance is really bad.

3. The iops-number that Fio writes to the log, is that the number of I/O
operations since the last line in the log, or the number of operations
during the time specified with iopsavgtime, or just a snapshot of the
current number of operations at the time of writing to the log?

4. Does the operation to write to the log enter the same I/O queue as
the jobs/QD Fio uses? This is not directly a Fio question, but maybe
someone knows. I suppose we really ought to know this ourselves, but
sadly do not.

5. The average IOPS number that Fio reports for a run, how is that
calculated? We see an at times significant disparity between our own
calculation from the raw data log (--write_iops_log) and the standard
--output. Our average is always higher when we calculate the iopslog
like this: http://pastebin.com/ZY3hNWh3

There may be simple explanation for all of this, but we are
unfortunately unable to see it.

Any help would be much appreciated.

Regards,
Birger, Mihkal, Øyvind

^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: Missing log entries and other questions
       [not found] <eBGJ1s00E1TeosZ01BGKGG>
@ 2016-04-26  2:22 ` Paul Alcorn
  2016-04-27 20:38   ` Jens Axboe
  0 siblings, 1 reply; 16+ messages in thread
From: Paul Alcorn @ 2016-04-26  2:22 UTC (permalink / raw)
  To: 'Mihkal Dunfjeld', fio; +Cc: birger.lunstoeng

I have also encountered this problem, it is very severe with SATA HDDs, and also is prevalent with SATA SSDs as well. In Centos 7 with the latest fio version it is very bad. The only two variables that seem to remain the same are with SATA drives and with slow devices. Perhaps the perception that it is with slow devices is merely because they are SATA, but the exact same configurations with PCIe SSDs never experience skipped log entries. 

-----Original Message-----
From: fio-owner@vger.kernel.org [mailto:fio-owner@vger.kernel.org] On Behalf Of Mihkal Dunfjeld
Sent: Monday, April 4, 2016 6:08 AM
To: fio@vger.kernel.org
Cc: birger.lunstoeng@stud.ntnu.no
Subject: Missing log entries and other questions

Hello fio team!

We are a group of three students that are writing a bachelor thesis at NTNU Gjøvik Norway about Ceph and we are trying to measure its performance using the great tool Fio. During our testing we find that for 16 job write runs, the amount of lines in our logs falls significantly below what we expect for a test set to run for 120 seconds.

Example: 4 jobs, 8 QD, random write
sudo fio --filename=/dev/rbd0 --direct=1 --rw=randwrite --refill_buffers --norandommap --randrepeat=0 --ioengine=libaio --bs=4k --iodepth=8
--numjobs=4 --runtime=120 --log_avg_msec=1000 --write_iops_log=fsrandW4-8-iopslog --group_reporting --ramp_time=60
--name=fsrandW4-8 --output=fsrandW4-8

For this test we are missing 23% of the log entries. Got 369/480 (120 sec times 4 jobs).
Raw data file: http://pastebin.com/CeXa61MF This gets worse with higher load. On 16 jobs 64 QD runs, more than 50% of the log entries are missing.
The tests are run from a HP DL380G6 with 2x Xeon X5560 @ 2.8 Ghz, 12 GB ECC RAM running Ubuntu Server 14.04. Tested with both the Ubuntu repository version and the latest version compiled from github. We also tried to output the log files to tempfs without that making much of a difference.

Because of this we have a few questions:

1. What causes fio to not be able to write a new line to the 	
Write_iops_log file at the set interval? (log_avg_msec=1000).  We see up to several seconds of delay (See above pastebin).

2. Does fio write to the write_iops_log file even if iops is zero? For instance if cluster performance is really bad.

3. The iops-number that Fio writes to the log, is that the number of I/O operations since the last line in the log, or the number of operations during the time specified with iopsavgtime, or just a snapshot of the current number of operations at the time of writing to the log?

4. Does the operation to write to the log enter the same I/O queue as the jobs/QD Fio uses? This is not directly a Fio question, but maybe someone knows. I suppose we really ought to know this ourselves, but sadly do not.

5. The average IOPS number that Fio reports for a run, how is that calculated? We see an at times significant disparity between our own calculation from the raw data log (--write_iops_log) and the standard --output. Our average is always higher when we calculate the iopslog like this: http://pastebin.com/ZY3hNWh3

There may be simple explanation for all of this, but we are unfortunately unable to see it.

Any help would be much appreciated.

Regards,
Birger, Mihkal, Øyvind
--
To unsubscribe from this list: send the line "unsubscribe fio" in the body of a message to mailto:majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Missing log entries and other questions
  2016-04-26  2:22 ` Missing log entries and other questions Paul Alcorn
@ 2016-04-27 20:38   ` Jens Axboe
  2016-05-04 14:10     ` Jens Axboe
  0 siblings, 1 reply; 16+ messages in thread
From: Jens Axboe @ 2016-04-27 20:38 UTC (permalink / raw)
  To: Paul Alcorn; +Cc: 'Mihkal Dunfjeld', fio, birger.lunstoeng

I'll try and see if I can reproduce this.


On Mon, Apr 25 2016, Paul Alcorn wrote:
> I have also encountered this problem, it is very severe with SATA HDDs, and also is prevalent with SATA SSDs as well. In Centos 7 with the latest fio version it is very bad. The only two variables that seem to remain the same are with SATA drives and with slow devices. Perhaps the perception that it is with slow devices is merely because they are SATA, but the exact same configurations with PCIe SSDs never experience skipped log entries. 
> 
> -----Original Message-----
> From: fio-owner@vger.kernel.org [mailto:fio-owner@vger.kernel.org] On Behalf Of Mihkal Dunfjeld
> Sent: Monday, April 4, 2016 6:08 AM
> To: fio@vger.kernel.org
> Cc: birger.lunstoeng@stud.ntnu.no
> Subject: Missing log entries and other questions
> 
> Hello fio team!
> 
> We are a group of three students that are writing a bachelor thesis at NTNU Gj�vik Norway about Ceph and we are trying to measure its performance using the great tool Fio. During our testing we find that for 16 job write runs, the amount of lines in our logs falls significantly below what we expect for a test set to run for 120 seconds.
> 
> Example: 4 jobs, 8 QD, random write
> sudo fio --filename=/dev/rbd0 --direct=1 --rw=randwrite --refill_buffers --norandommap --randrepeat=0 --ioengine=libaio --bs=4k --iodepth=8
> --numjobs=4 --runtime=120 --log_avg_msec=1000 --write_iops_log=fsrandW4-8-iopslog --group_reporting --ramp_time=60
> --name=fsrandW4-8 --output=fsrandW4-8
> 
> For this test we are missing 23% of the log entries. Got 369/480 (120 sec times 4 jobs).
> Raw data file: http://pastebin.com/CeXa61MF This gets worse with higher load. On 16 jobs 64 QD runs, more than 50% of the log entries are missing.
> The tests are run from a HP DL380G6 with 2x Xeon X5560 @ 2.8 Ghz, 12 GB ECC RAM running Ubuntu Server 14.04. Tested with both the Ubuntu repository version and the latest version compiled from github. We also tried to output the log files to tempfs without that making much of a difference.
> 
> Because of this we have a few questions:
> 
> 1. What causes fio to not be able to write a new line to the 	
> Write_iops_log file at the set interval? (log_avg_msec=1000).  We see up to several seconds of delay (See above pastebin).
> 
> 2. Does fio write to the write_iops_log file even if iops is zero? For instance if cluster performance is really bad.
> 
> 3. The iops-number that Fio writes to the log, is that the number of I/O operations since the last line in the log, or the number of operations during the time specified with iopsavgtime, or just a snapshot of the current number of operations at the time of writing to the log?
> 
> 4. Does the operation to write to the log enter the same I/O queue as the jobs/QD Fio uses? This is not directly a Fio question, but maybe someone knows. I suppose we really ought to know this ourselves, but sadly do not.
> 
> 5. The average IOPS number that Fio reports for a run, how is that calculated? We see an at times significant disparity between our own calculation from the raw data log (--write_iops_log) and the standard --output. Our average is always higher when we calculate the iopslog like this: http://pastebin.com/ZY3hNWh3
> 
> There may be simple explanation for all of this, but we are unfortunately unable to see it.
> 
> Any help would be much appreciated.
> 
> Regards,
> Birger, Mihkal, �yvind
> --
> To unsubscribe from this list: send the line "unsubscribe fio" in the body of a message to mailto:majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> --
> To unsubscribe from this list: send the line "unsubscribe fio" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Missing log entries and other questions
  2016-04-27 20:38   ` Jens Axboe
@ 2016-05-04 14:10     ` Jens Axboe
  2016-05-04 16:34       ` Jens Axboe
  0 siblings, 1 reply; 16+ messages in thread
From: Jens Axboe @ 2016-05-04 14:10 UTC (permalink / raw)
  To: Paul Alcorn; +Cc: 'Mihkal Dunfjeld', fio, birger.lunstoeng

On 04/27/2016 02:38 PM, Jens Axboe wrote:
> I'll try and see if I can reproduce this.

So I took a look at your logged output. The way that fio logs is that it 
does it on IO completion. You have set 1s intervals, so it checks, when 
an IO completes, if we're due for logging the previous interval IOPS:

spent = mtime_since(&td->iops_sample_time, t);
if (spent < td->o.iops_avg_time)
         return;

If you are doing lots of IO, it matches pretty evenly. But for slow 
workloads, it will end up being skewed a bit. It won't be too hard to 
fix the drift, though, but it still won't make it clock precise at exact 
1s intervals.

We could move this part into the main thread, and out of the workers. 
Then we could make it more precise. Might not be a bad idea in general, 
since it'll move some of this code out of the fast path.

I'll take a stab at it.

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Missing log entries and other questions
  2016-05-04 14:10     ` Jens Axboe
@ 2016-05-04 16:34       ` Jens Axboe
  2016-05-04 17:43         ` Mark Nelson
  0 siblings, 1 reply; 16+ messages in thread
From: Jens Axboe @ 2016-05-04 16:34 UTC (permalink / raw)
  To: Paul Alcorn; +Cc: 'Mihkal Dunfjeld', fio, birger.lunstoeng

On 05/04/2016 08:10 AM, Jens Axboe wrote:
> On 04/27/2016 02:38 PM, Jens Axboe wrote:
>> I'll try and see if I can reproduce this.
>
> So I took a look at your logged output. The way that fio logs is that it
> does it on IO completion. You have set 1s intervals, so it checks, when
> an IO completes, if we're due for logging the previous interval IOPS:
>
> spent = mtime_since(&td->iops_sample_time, t);
> if (spent < td->o.iops_avg_time)
>          return;
>
> If you are doing lots of IO, it matches pretty evenly. But for slow
> workloads, it will end up being skewed a bit. It won't be too hard to
> fix the drift, though, but it still won't make it clock precise at exact
> 1s intervals.
>
> We could move this part into the main thread, and out of the workers.
> Then we could make it more precise. Might not be a bad idea in general,
> since it'll move some of this code out of the fast path.
>
> I'll take a stab at it.

Can you try the below patch?

diff --git a/backend.c b/backend.c
index 1723b8f..6ce54fc 100644
--- a/backend.c
+++ b/backend.c
@@ -1722,8 +1722,6 @@ static void *thread_main(void *data)

  	fio_unpin_memory(td);

-	fio_writeout_logs(td);
-
  	iolog_compress_exit(td);
  	rate_submit_exit(td);

@@ -2338,17 +2336,21 @@ static void free_disk_util(void)
  static void *helper_thread_main(void *data)
  {
  	struct sk_out *sk_out = data;
+	struct timeval last_du;
+	uint64_t msec_to_next_event = DISK_UTIL_MSEC;
  	int ret = 0;

  	sk_out_assign(sk_out);

  	fio_mutex_up(startup_mutex);

-	while (!ret) {
-		uint64_t sec = DISK_UTIL_MSEC / 1000;
-		uint64_t nsec = (DISK_UTIL_MSEC % 1000) * 1000000;
+	gettimeofday(&last_du, NULL);
+
+	while (!ret && !helper_exit) {
+		uint64_t sec = msec_to_next_event / 1000;
+		uint64_t nsec = (msec_to_next_event % 1000) * 1000000;
  		struct timespec ts;
-		struct timeval tv;
+		struct timeval now, tv;

  		gettimeofday(&tv, NULL);
  		ts.tv_sec = tv.tv_sec + sec;
@@ -2361,7 +2363,16 @@ static void *helper_thread_main(void *data)

  		pthread_cond_timedwait(&helper_cond, &helper_lock, &ts);

-		ret = update_io_ticks();
+		gettimeofday(&now, NULL);
+
+		if (mtime_since(&last_du, &now) >= DISK_UTIL_MSEC) {
+			ret = update_io_ticks();
+			memcpy(&last_du, &now, sizeof(now));
+		}
+
+		msec_to_next_event = calc_log_samples();
+		if (!msec_to_next_event || msec_to_next_event > DISK_UTIL_MSEC)
+			msec_to_next_event = DISK_UTIL_MSEC;

  		if (helper_do_stat) {
  			helper_do_stat = 0;
@@ -2372,6 +2383,8 @@ static void *helper_thread_main(void *data)
  			print_thread_status();
  	}

+	fio_writeout_logs();
+
  	sk_out_drop();
  	return NULL;
  }
diff --git a/io_u.c b/io_u.c
index 6622bc0..881fd67 100644
--- a/io_u.c
+++ b/io_u.c
@@ -1715,12 +1715,6 @@ static void account_io_completion(struct 
thread_data *td, struct io_u *io_u,
  		io_u_mark_latency(td, lusec);
  	}

-	if (!td->o.disable_bw)
-		add_bw_sample(td, idx, bytes, &icd->time);
-
-	if (no_reduce)
-		add_iops_sample(td, idx, bytes, &icd->time);
-
  	if (td->ts.nr_block_infos && io_u->ddir == DDIR_TRIM) {
  		uint32_t *info = io_u_block_info(td, io_u);
  		if (BLOCK_INFO_STATE(*info) < BLOCK_STATE_TRIM_FAILURE) {
diff --git a/iolog.c b/iolog.c
index feda9ed..6f53f3f 100644
--- a/iolog.c
+++ b/iolog.c
@@ -18,6 +18,7 @@
  #include "verify.h"
  #include "trim.h"
  #include "filelock.h"
+#include "smalloc.h"

  static const char iolog_ver2[] = "fio version 2 iolog";

@@ -574,14 +575,12 @@ void setup_log(struct io_log **log, struct 
log_params *p,
  {
  	struct io_log *l;

-	l = calloc(1, sizeof(*l));
+	l = smalloc(sizeof(*l));
  	l->nr_samples = 0;
-	l->max_samples = 1024;
  	l->log_type = p->log_type;
  	l->log_offset = p->log_offset;
  	l->log_gz = p->log_gz;
  	l->log_gz_store = p->log_gz_store;
-	l->log = malloc(l->max_samples * log_entry_sz(l));
  	l->avg_msec = p->avg_msec;
  	l->filename = strdup(filename);
  	l->td = p->td;
@@ -631,7 +630,7 @@ void free_log(struct io_log *log)
  {
  	free(log->log);
  	free(log->filename);
-	free(log);
+	sfree(log);
  }

  void flush_samples(FILE *f, void *samples, uint64_t sample_size)
@@ -1302,7 +1301,7 @@ static struct log_type log_types[] = {
  	},
  };

-void fio_writeout_logs(struct thread_data *td)
+static void td_writeout_logs(struct thread_data *td)
  {
  	unsigned int log_mask = 0;
  	unsigned int log_left = ALL_LOG_NR;
@@ -1334,3 +1333,12 @@ void fio_writeout_logs(struct thread_data *td)

  	td_restore_runstate(td, old_state);
  }
+
+void fio_writeout_logs(void)
+{
+	struct thread_data *td;
+	int i;
+
+	for_each_td(td, i)
+		td_writeout_logs(td);
+}
diff --git a/iolog.h b/iolog.h
index 297daf5..a1b2ec5 100644
--- a/iolog.h
+++ b/iolog.h
@@ -41,6 +41,8 @@ enum {
  	IO_LOG_TYPE_IOPS,
  };

+#define DEF_LOG_ENTRIES		1024
+
  /*
   * Dynamically growing data sample log
   */
@@ -210,7 +212,7 @@ extern void setup_log(struct io_log **, struct 
log_params *, const char *);
  extern void flush_log(struct io_log *, int);
  extern void flush_samples(FILE *, void *, uint64_t);
  extern void free_log(struct io_log *);
-extern void fio_writeout_logs(struct thread_data *);
+extern void fio_writeout_logs(void);
  extern int iolog_flush(struct io_log *, int);

  static inline void init_ipo(struct io_piece *ipo)
diff --git a/libfio.c b/libfio.c
index c626d15..b17f148 100644
--- a/libfio.c
+++ b/libfio.c
@@ -146,6 +146,8 @@ void reset_all_stats(struct thread_data *td)
  	fio_gettime(&tv, NULL);
  	memcpy(&td->epoch, &tv, sizeof(tv));
  	memcpy(&td->start, &tv, sizeof(tv));
+	memcpy(&td->iops_sample_time, &tv, sizeof(tv));
+	memcpy(&td->bw_sample_time, &tv, sizeof(tv));

  	lat_target_reset(td);
  	clear_rusage_stat(td);
diff --git a/stat.c b/stat.c
index 6d8d4d0..8087970 100644
--- a/stat.c
+++ b/stat.c
@@ -1865,7 +1865,11 @@ static void __add_log_sample(struct io_log 
*iolog, unsigned long val,
  		size_t new_size;
  		void *new_log;

-		new_size = 2 * iolog->max_samples * log_entry_sz(iolog);
+		if (!iolog->max_samples) {
+			iolog->max_samples = DEF_LOG_ENTRIES;
+			new_size = DEF_LOG_ENTRIES * log_entry_sz(iolog);
+		} else
+			new_size = 2 * iolog->max_samples * log_entry_sz(iolog);

  		if (iolog->log_gz && (new_size > iolog->log_gz)) {
  			if (iolog_flush(iolog, 0)) {
@@ -1882,7 +1886,7 @@ static void __add_log_sample(struct io_log *iolog, 
unsigned long val,
  				return;
  			}
  			iolog->log = new_log;
-			iolog->max_samples <<= 1;
+			iolog->max_samples = new_size;
  		}
  	}

@@ -2108,18 +2112,15 @@ void add_lat_sample(struct thread_data *td, enum 
fio_ddir ddir,
  	td_io_u_unlock(td);
  }

-void add_bw_sample(struct thread_data *td, enum fio_ddir ddir, unsigned 
int bs,
-		   struct timeval *t)
+static uint64_t add_bw_sample(struct thread_data *td, struct timeval *t)
  {
  	struct thread_stat *ts = &td->ts;
  	unsigned long spent, rate;
-
-	if (!ddir_rw(ddir))
-		return;
+	enum fio_ddir ddir;

  	spent = mtime_since(&td->bw_sample_time, t);
  	if (spent < td->o.bw_avg_time)
-		return;
+		return td->o.bw_avg_time - spent;

  	td_io_u_lock(td);

@@ -2141,27 +2142,25 @@ void add_bw_sample(struct thread_data *td, enum 
fio_ddir ddir, unsigned int bs,
  		add_stat_sample(&ts->bw_stat[ddir], rate);

  		if (td->bw_log)
-			add_log_sample(td, td->bw_log, rate, ddir, bs, 0);
+			add_log_sample(td, td->bw_log, rate, ddir, 0, 0);

  		td->stat_io_bytes[ddir] = td->this_io_bytes[ddir];
  	}

  	fio_gettime(&td->bw_sample_time, NULL);
  	td_io_u_unlock(td);
+	return td->o.bw_avg_time;
  }

-void add_iops_sample(struct thread_data *td, enum fio_ddir ddir, 
unsigned int bs,
-		     struct timeval *t)
+static uint64_t add_iops_sample(struct thread_data *td, struct timeval *t)
  {
  	struct thread_stat *ts = &td->ts;
  	unsigned long spent, iops;
-
-	if (!ddir_rw(ddir))
-		return;
+	enum fio_ddir ddir;

  	spent = mtime_since(&td->iops_sample_time, t);
  	if (spent < td->o.iops_avg_time)
-		return;
+		return td->o.iops_avg_time - spent;

  	td_io_u_lock(td);

@@ -2183,13 +2182,41 @@ void add_iops_sample(struct thread_data *td, 
enum fio_ddir ddir, unsigned int bs
  		add_stat_sample(&ts->iops_stat[ddir], iops);

  		if (td->iops_log)
-			add_log_sample(td, td->iops_log, iops, ddir, bs, 0);
+			add_log_sample(td, td->iops_log, iops, ddir, 0, 0);

  		td->stat_io_blocks[ddir] = td->this_io_blocks[ddir];
  	}

  	fio_gettime(&td->iops_sample_time, NULL);
  	td_io_u_unlock(td);
+	return td->o.iops_avg_time;
+}
+
+uint64_t calc_log_samples(void)
+{
+	struct thread_data *td;
+	struct timeval now;
+	uint64_t next = ~0ULL, tmp;
+	int i;
+
+	fio_gettime(&now, NULL);
+
+	for_each_td(td, i) {
+		if (!ramp_time_over(td) ||
+		    !(td->runstate == TD_RUNNING || td->runstate == TD_VERIFYING)) {
+			next = min(td->o.iops_avg_time, td->o.bw_avg_time);
+			continue;
+		}
+		tmp = add_bw_sample(td, &now);
+		if (tmp < next)
+			next = tmp;
+
+		tmp = add_iops_sample(td, &now);
+		if (tmp < next)
+			next = tmp;
+	}
+
+	return next == ~0ULL ? 0 : next;
  }

  void stat_init(void)
diff --git a/stat.h b/stat.h
index 9c3f192..91a8cff 100644
--- a/stat.h
+++ b/stat.h
@@ -276,11 +276,8 @@ extern void add_clat_sample(struct thread_data *, 
enum fio_ddir, unsigned long,
  				unsigned int, uint64_t);
  extern void add_slat_sample(struct thread_data *, enum fio_ddir, 
unsigned long,
  				unsigned int, uint64_t);
-extern void add_bw_sample(struct thread_data *, enum fio_ddir, unsigned 
int,
-				struct timeval *);
-extern void add_iops_sample(struct thread_data *, enum fio_ddir, 
unsigned int,
-				struct timeval *);
  extern void add_agg_sample(unsigned long, enum fio_ddir, unsigned int);
+extern uint64_t calc_log_samples(void);

  extern struct io_log *agg_io_log[DDIR_RWDIR_CNT];
  extern int write_bw_log;

-- 
Jens Axboe



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: Missing log entries and other questions
  2016-05-04 16:34       ` Jens Axboe
@ 2016-05-04 17:43         ` Mark Nelson
  2016-05-04 20:51           ` Jens Axboe
  0 siblings, 1 reply; 16+ messages in thread
From: Mark Nelson @ 2016-05-04 17:43 UTC (permalink / raw)
  To: Jens Axboe, Paul Alcorn; +Cc: 'Mihkal Dunfjeld', fio, birger.lunstoeng

On 05/04/2016 11:34 AM, Jens Axboe wrote:
> On 05/04/2016 08:10 AM, Jens Axboe wrote:
>> On 04/27/2016 02:38 PM, Jens Axboe wrote:
>>> I'll try and see if I can reproduce this.
>>
>> So I took a look at your logged output. The way that fio logs is that it
>> does it on IO completion. You have set 1s intervals, so it checks, when
>> an IO completes, if we're due for logging the previous interval IOPS:
>>
>> spent = mtime_since(&td->iops_sample_time, t);
>> if (spent < td->o.iops_avg_time)
>>          return;
>>
>> If you are doing lots of IO, it matches pretty evenly. But for slow
>> workloads, it will end up being skewed a bit. It won't be too hard to
>> fix the drift, though, but it still won't make it clock precise at exact
>> 1s intervals.
>>
>> We could move this part into the main thread, and out of the workers.
>> Then we could make it more precise. Might not be a bad idea in general,
>> since it'll move some of this code out of the fast path.
>>
>> I'll take a stab at it.
>
> Can you try the below patch?

I'll also try to take a look at this as we've seen a fair amount of skew 
in the logs doing fio librbd upstream ceph testing.  I actually wrote a 
first stab at a parser to aggregate logs from multiple fio processes 
with the assumption of non-uniform sampling:

https://github.com/ceph/cbt/blob/master/tools/fiologpaser.py

The idea is you have say 8 fio bw logs in a directory you can do:

fiologparser.py -f *bw*

which will return average values from the logs over user defined 
intervals by taking weighted scores from samples that overlap.  You can 
also look at per-interval averages, sums, or just a global average.

various option flags:

-i <interval time in ms>
-d <divide results by d (say if you want a different unit of output)
-f <show full interval data>
-a <show average interval data across logs>
-s <show sums for interval data across logs>

Mark


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Missing log entries and other questions
  2016-05-04 17:43         ` Mark Nelson
@ 2016-05-04 20:51           ` Jens Axboe
  2016-05-04 21:30             ` Jens Axboe
  2016-05-05 19:36             ` Mark Nelson
  0 siblings, 2 replies; 16+ messages in thread
From: Jens Axboe @ 2016-05-04 20:51 UTC (permalink / raw)
  To: Mark Nelson, Paul Alcorn; +Cc: 'Mihkal Dunfjeld', fio, birger.lunstoeng

On 05/04/2016 11:43 AM, Mark Nelson wrote:
> On 05/04/2016 11:34 AM, Jens Axboe wrote:
>> On 05/04/2016 08:10 AM, Jens Axboe wrote:
>>> On 04/27/2016 02:38 PM, Jens Axboe wrote:
>>>> I'll try and see if I can reproduce this.
>>>
>>> So I took a look at your logged output. The way that fio logs is that it
>>> does it on IO completion. You have set 1s intervals, so it checks, when
>>> an IO completes, if we're due for logging the previous interval IOPS:
>>>
>>> spent = mtime_since(&td->iops_sample_time, t);
>>> if (spent < td->o.iops_avg_time)
>>>          return;
>>>
>>> If you are doing lots of IO, it matches pretty evenly. But for slow
>>> workloads, it will end up being skewed a bit. It won't be too hard to
>>> fix the drift, though, but it still won't make it clock precise at exact
>>> 1s intervals.
>>>
>>> We could move this part into the main thread, and out of the workers.
>>> Then we could make it more precise. Might not be a bad idea in general,
>>> since it'll move some of this code out of the fast path.
>>>
>>> I'll take a stab at it.
>>
>> Can you try the below patch?
>
> I'll also try to take a look at this as we've seen a fair amount of skew
> in the logs doing fio librbd upstream ceph testing.

That'd be great. More complete version below, fwiw.

> I actually wrote a
> first stab at a parser to aggregate logs from multiple fio processes
> with the assumption of non-uniform sampling:
>
> https://github.com/ceph/cbt/blob/master/tools/fiologpaser.py
>
> The idea is you have say 8 fio bw logs in a directory you can do:
>
> fiologparser.py -f *bw*
>
> which will return average values from the logs over user defined
> intervals by taking weighted scores from samples that overlap.  You can
> also look at per-interval averages, sums, or just a global average.
>
> various option flags:
>
> -i <interval time in ms>
> -d <divide results by d (say if you want a different unit of output)
> -f <show full interval data>
> -a <show average interval data across logs>
> -s <show sums for interval data across logs>

That looks useful. Might want to put that in the fio repo as well, if
you are interested.


diff --git a/HOWTO b/HOWTO
index 1f523d3..a74a112 100644
--- a/HOWTO
+++ b/HOWTO
@@ -1263,10 +1263,14 @@ exitall_on_error	When one job finishes in error, 
terminate the rest. The
  		default is to wait for each job to finish.

  bwavgtime=int	Average the calculated bandwidth over the given time. Value
-		is specified in milliseconds.
+		is specified in milliseconds. If the job also does bandwidth
+		logging through 'write_bw_log', then the minimum of this option
+		and 'log_avg_msec' will be used.  Default: 500ms.

  iopsavgtime=int	Average the calculated IOPS over the given time. Value
-		is specified in milliseconds.
+		is specified in milliseconds. If the job also does IOPS logging
+		through 'write_iops_log', then the minimum of this option and
+		'log_avg_msec' will be used.  Default: 500ms.

  create_serialize=bool	If true, serialize the file creating for the jobs.
  			This may be handy to avoid interleaving of data
diff --git a/backend.c b/backend.c
index 1723b8f..975517a 100644
--- a/backend.c
+++ b/backend.c
@@ -1722,7 +1722,7 @@ static void *thread_main(void *data)

  	fio_unpin_memory(td);

-	fio_writeout_logs(td);
+	td_writeout_logs(td, true);

  	iolog_compress_exit(td);
  	rate_submit_exit(td);
@@ -2337,41 +2337,64 @@ static void free_disk_util(void)

  static void *helper_thread_main(void *data)
  {
+	unsigned int msec_to_next_event, next_log;
  	struct sk_out *sk_out = data;
+	struct timeval tv, last_du;
  	int ret = 0;

  	sk_out_assign(sk_out);

+	gettimeofday(&tv, NULL);
+	memcpy(&last_du, &tv, sizeof(tv));
+
  	fio_mutex_up(startup_mutex);

-	while (!ret) {
-		uint64_t sec = DISK_UTIL_MSEC / 1000;
-		uint64_t nsec = (DISK_UTIL_MSEC % 1000) * 1000000;
+	msec_to_next_event = DISK_UTIL_MSEC;
+	while (!ret && !helper_exit) {
  		struct timespec ts;
-		struct timeval tv;
-
-		gettimeofday(&tv, NULL);
-		ts.tv_sec = tv.tv_sec + sec;
-		ts.tv_nsec = (tv.tv_usec * 1000) + nsec;
+		struct timeval now;

-		if (ts.tv_nsec >= 1000000000ULL) {
-			ts.tv_nsec -= 1000000000ULL;
-			ts.tv_sec++;
+		tv.tv_usec += msec_to_next_event * 1000;
+		if (tv.tv_usec >= 1000000) {
+			tv.tv_usec -= 1000000;
+			tv.tv_sec++;
  		}

+		ts.tv_sec = tv.tv_sec;
+		ts.tv_nsec = tv.tv_usec * 1000;
+
  		pthread_cond_timedwait(&helper_cond, &helper_lock, &ts);

-		ret = update_io_ticks();
+		gettimeofday(&now, NULL);
+
+		if (mtime_since(&last_du, &now) >= DISK_UTIL_MSEC) {
+			ret = update_io_ticks();
+			memcpy(&last_du, &now, sizeof(now));
+			msec_to_next_event = DISK_UTIL_MSEC;
+		} else {
+			msec_to_next_event = DISK_UTIL_MSEC -
+						mtime_since(&last_du, &now);
+		}

  		if (helper_do_stat) {
  			helper_do_stat = 0;
  			__show_running_run_stats();
  		}

+		next_log = calc_log_samples();
+		if (!next_log)
+			next_log = DISK_UTIL_MSEC;
+
+		msec_to_next_event = min(next_log, msec_to_next_event);
+		if (msec_to_next_event)
+			msec_to_next_event--;
+
  		if (!is_backend)
  			print_thread_status();
  	}

+	fio_writeout_logs(false);
+
  	sk_out_drop();
  	return NULL;
  }
diff --git a/fio.1 b/fio.1
index 73fdee6..e9bb72e 100644
--- a/fio.1
+++ b/fio.1
@@ -1180,12 +1180,14 @@ Terminate all jobs if one job finishes in error. 
  Default: wait for each job
  to finish.
  .TP
  .BI bwavgtime \fR=\fPint
-Average bandwidth calculations over the given time in milliseconds. 
Default:
-500ms.
+Average bandwidth calculations over the given time in milliseconds. If 
the job
+also does bandwidth logging through \fBwrite_bw_log\fR, then the minimum of
+this option and \fBlog_avg_msec\fR will be used.  Default: 500ms.
  .TP
  .BI iopsavgtime \fR=\fPint
-Average IOPS calculations over the given time in milliseconds.  Default:
-500ms.
+Average IOPS calculations over the given time in milliseconds. If the job
+also does IOPS logging through \fBwrite_iops_log\fR, then the minimum of
+this option and \fBlog_avg_msec\fR will be used.  Default: 500ms.
  .TP
  .BI create_serialize \fR=\fPbool
  If true, serialize file creation for the jobs.  Default: true.
diff --git a/init.c b/init.c
index 89e05c0..c579d5c 100644
--- a/init.c
+++ b/init.c
@@ -1416,6 +1416,11 @@ static int add_job(struct thread_data *td, const 
char *jobname, int job_add_num,
  		};
  		const char *suf;

+		if (fio_option_is_set(o, bw_avg_time))
+			p.avg_msec = min(o->log_avg_msec, o->bw_avg_time);
+		else
+			o->bw_avg_time = p.avg_msec;
+
  		if (p.log_gz_store)
  			suf = "log.fz";
  		else
@@ -1436,6 +1441,11 @@ static int add_job(struct thread_data *td, const 
char *jobname, int job_add_num,
  		};
  		const char *suf;

+		if (fio_option_is_set(o, iops_avg_time))
+			p.avg_msec = min(o->log_avg_msec, o->iops_avg_time);
+		else
+			o->iops_avg_time = p.avg_msec;
+
  		if (p.log_gz_store)
  			suf = "log.fz";
  		else
diff --git a/io_u.c b/io_u.c
index 6622bc0..a268c9b 100644
--- a/io_u.c
+++ b/io_u.c
@@ -1715,11 +1715,11 @@ static void account_io_completion(struct 
thread_data *td, struct io_u *io_u,
  		io_u_mark_latency(td, lusec);
  	}

-	if (!td->o.disable_bw)
-		add_bw_sample(td, idx, bytes, &icd->time);
+	if (!td->o.disable_bw && (per_unit_log(td->bw_log) || !td->o.bw_avg_time))
+		add_bw_sample(td, io_u, bytes, lusec);

-	if (no_reduce)
-		add_iops_sample(td, idx, bytes, &icd->time);
+	if (no_reduce && (per_unit_log(td->iops_log) || !td->o.iops_avg_time))
+		add_iops_sample(td, io_u, bytes, lusec);

  	if (td->ts.nr_block_infos && io_u->ddir == DDIR_TRIM) {
  		uint32_t *info = io_u_block_info(td, io_u);
diff --git a/iolog.c b/iolog.c
index feda9ed..21ab9c8 100644
--- a/iolog.c
+++ b/iolog.c
@@ -18,6 +18,7 @@
  #include "verify.h"
  #include "trim.h"
  #include "filelock.h"
+#include "smalloc.h"

  static const char iolog_ver2[] = "fio version 2 iolog";

@@ -574,14 +575,12 @@ void setup_log(struct io_log **log, struct 
log_params *p,
  {
  	struct io_log *l;

-	l = calloc(1, sizeof(*l));
+	l = smalloc(sizeof(*l));
  	l->nr_samples = 0;
-	l->max_samples = 1024;
  	l->log_type = p->log_type;
  	l->log_offset = p->log_offset;
  	l->log_gz = p->log_gz;
  	l->log_gz_store = p->log_gz_store;
-	l->log = malloc(l->max_samples * log_entry_sz(l));
  	l->avg_msec = p->avg_msec;
  	l->filename = strdup(filename);
  	l->td = p->td;
@@ -631,7 +630,7 @@ void free_log(struct io_log *log)
  {
  	free(log->log);
  	free(log->filename);
-	free(log);
+	sfree(log);
  }

  void flush_samples(FILE *f, void *samples, uint64_t sample_size)
@@ -1239,29 +1238,74 @@ static int __write_log(struct thread_data *td, 
struct io_log *log, int try)
  	return 0;
  }

-static int write_iops_log(struct thread_data *td, int try)
+static int write_iops_log(struct thread_data *td, int try, bool unit_log)
  {
-	return __write_log(td, td->iops_log, try);
+	int ret;
+
+	if (per_unit_log(td->iops_log) != unit_log)
+		return 0;
+
+	ret = __write_log(td, td->iops_log, try);
+	if (!ret)
+		td->iops_log = NULL;
+
+	return ret;
  }

-static int write_slat_log(struct thread_data *td, int try)
+static int write_slat_log(struct thread_data *td, int try, bool unit_log)
  {
-	return __write_log(td, td->slat_log, try);
+	int ret;
+
+	if (per_unit_log(td->slat_log) != unit_log)
+		return 0;
+
+	ret = __write_log(td, td->slat_log, try);
+	if (!ret)
+		td->slat_log = NULL;
+
+	return ret;
  }

-static int write_clat_log(struct thread_data *td, int try)
+static int write_clat_log(struct thread_data *td, int try, bool unit_log)
  {
-	return __write_log(td, td->clat_log, try);
+	int ret;
+
+	if (per_unit_log(td->clat_log) != unit_log)
+		return 0;
+
+	ret = __write_log(td, td->clat_log, try);
+	if (!ret)
+		td->clat_log = NULL;
+
+	return ret;
  }

-static int write_lat_log(struct thread_data *td, int try)
+static int write_lat_log(struct thread_data *td, int try, bool unit_log)
  {
-	return __write_log(td, td->lat_log, try);
+	int ret;
+
+	if (per_unit_log(td->lat_log) != unit_log)
+		return 0;
+
+	ret = __write_log(td, td->lat_log, try);
+	if (!ret)
+		td->lat_log = NULL;
+
+	return ret;
  }

-static int write_bandw_log(struct thread_data *td, int try)
+static int write_bandw_log(struct thread_data *td, int try, bool unit_log)
  {
-	return __write_log(td, td->bw_log, try);
+	int ret;
+
+	if (per_unit_log(td->bw_log) != unit_log)
+		return 0;
+
+	ret = __write_log(td, td->bw_log, try);
+	if (!ret)
+		td->bw_log = NULL;
+
+	return ret;
  }

  enum {
@@ -1276,7 +1320,7 @@ enum {

  struct log_type {
  	unsigned int mask;
-	int (*fn)(struct thread_data *, int);
+	int (*fn)(struct thread_data *, int, bool);
  };

  static struct log_type log_types[] = {
@@ -1302,7 +1346,7 @@ static struct log_type log_types[] = {
  	},
  };

-void fio_writeout_logs(struct thread_data *td)
+void td_writeout_logs(struct thread_data *td, bool unit_logs)
  {
  	unsigned int log_mask = 0;
  	unsigned int log_left = ALL_LOG_NR;
@@ -1310,7 +1354,7 @@ void fio_writeout_logs(struct thread_data *td)

  	old_state = td_bump_runstate(td, TD_FINISHING);

-	finalize_logs(td);
+	finalize_logs(td, unit_logs);

  	while (log_left) {
  		int prev_log_left = log_left;
@@ -1320,7 +1364,7 @@ void fio_writeout_logs(struct thread_data *td)
  			int ret;

  			if (!(log_mask & lt->mask)) {
-				ret = lt->fn(td, log_left != 1);
+				ret = lt->fn(td, log_left != 1, unit_logs);
  				if (!ret) {
  					log_left--;
  					log_mask |= lt->mask;
@@ -1334,3 +1378,12 @@ void fio_writeout_logs(struct thread_data *td)

  	td_restore_runstate(td, old_state);
  }
+
+void fio_writeout_logs(bool unit_logs)
+{
+	struct thread_data *td;
+	int i;
+
+	for_each_td(td, i)
+		td_writeout_logs(td, unit_logs);
+}
diff --git a/iolog.h b/iolog.h
index 297daf5..739a7c8 100644
--- a/iolog.h
+++ b/iolog.h
@@ -41,6 +41,8 @@ enum {
  	IO_LOG_TYPE_IOPS,
  };

+#define DEF_LOG_ENTRIES		1024
+
  /*
   * Dynamically growing data sample log
   */
@@ -205,12 +207,18 @@ struct log_params {
  	int log_compress;
  };

-extern void finalize_logs(struct thread_data *td);
+static inline bool per_unit_log(struct io_log *log)
+{
+	return log && !log->avg_msec;
+}
+
+extern void finalize_logs(struct thread_data *td, bool);
  extern void setup_log(struct io_log **, struct log_params *, const 
char *);
  extern void flush_log(struct io_log *, int);
  extern void flush_samples(FILE *, void *, uint64_t);
  extern void free_log(struct io_log *);
-extern void fio_writeout_logs(struct thread_data *);
+extern void fio_writeout_logs(bool);
+extern void td_writeout_logs(struct thread_data *, bool);
  extern int iolog_flush(struct io_log *, int);

  static inline void init_ipo(struct io_piece *ipo)
diff --git a/libfio.c b/libfio.c
index c626d15..b17f148 100644
--- a/libfio.c
+++ b/libfio.c
@@ -146,6 +146,8 @@ void reset_all_stats(struct thread_data *td)
  	fio_gettime(&tv, NULL);
  	memcpy(&td->epoch, &tv, sizeof(tv));
  	memcpy(&td->start, &tv, sizeof(tv));
+	memcpy(&td->iops_sample_time, &tv, sizeof(tv));
+	memcpy(&td->bw_sample_time, &tv, sizeof(tv));

  	lat_target_reset(td);
  	clear_rusage_stat(td);
diff --git a/stat.c b/stat.c
index 6d8d4d0..49ce2e6 100644
--- a/stat.c
+++ b/stat.c
@@ -1862,10 +1862,15 @@ static void __add_log_sample(struct io_log 
*iolog, unsigned long val,
  		iolog->avg_last = t;

  	if (iolog->nr_samples == iolog->max_samples) {
-		size_t new_size;
+		size_t new_size, new_samples;
  		void *new_log;

-		new_size = 2 * iolog->max_samples * log_entry_sz(iolog);
+		if (!iolog->max_samples)
+			new_samples = DEF_LOG_ENTRIES;
+		else
+			new_samples = iolog->max_samples * 2;
+
+		new_size = new_samples * log_entry_sz(iolog);

  		if (iolog->log_gz && (new_size > iolog->log_gz)) {
  			if (iolog_flush(iolog, 0)) {
@@ -1882,7 +1887,7 @@ static void __add_log_sample(struct io_log *iolog, 
unsigned long val,
  				return;
  			}
  			iolog->log = new_log;
-			iolog->max_samples <<= 1;
+			iolog->max_samples = new_samples;
  		}
  	}

@@ -2013,21 +2018,21 @@ static void add_log_sample(struct thread_data 
*td, struct io_log *iolog,
  	iolog->avg_last = elapsed;
  }

-void finalize_logs(struct thread_data *td)
+void finalize_logs(struct thread_data *td, bool unit_logs)
  {
  	unsigned long elapsed;

  	elapsed = mtime_since_now(&td->epoch);

-	if (td->clat_log)
+	if (td->clat_log && (unit_logs == per_unit_log(td->clat_log)))
  		_add_stat_to_log(td->clat_log, elapsed, td->o.log_max != 0);
-	if (td->slat_log)
+	if (td->slat_log && (unit_logs == per_unit_log(td->slat_log)))
  		_add_stat_to_log(td->slat_log, elapsed, td->o.log_max != 0);
-	if (td->lat_log)
+	if (td->lat_log && (unit_logs == per_unit_log(td->lat_log)))
  		_add_stat_to_log(td->lat_log, elapsed, td->o.log_max != 0);
-	if (td->bw_log)
+	if (td->bw_log && (unit_logs == per_unit_log(td->bw_log)))
  		_add_stat_to_log(td->bw_log, elapsed, td->o.log_max != 0);
-	if (td->iops_log)
+	if (td->iops_log && (unit_logs == per_unit_log(td->iops_log)))
  		_add_stat_to_log(td->iops_log, elapsed, td->o.log_max != 0);
  }

@@ -2108,18 +2113,48 @@ void add_lat_sample(struct thread_data *td, enum 
fio_ddir ddir,
  	td_io_u_unlock(td);
  }

-void add_bw_sample(struct thread_data *td, enum fio_ddir ddir, unsigned 
int bs,
-		   struct timeval *t)
+void add_bw_sample(struct thread_data *td, struct io_u *io_u,
+		   unsigned int bytes, unsigned long spent)
  {
  	struct thread_stat *ts = &td->ts;
-	unsigned long spent, rate;
+	unsigned long rate;
+	uint64_t delta;

-	if (!ddir_rw(ddir))
+	if (!ddir_rw(io_u->ddir))
  		return;

+	td_io_u_lock(td);
+
+	delta = td->this_io_bytes[io_u->ddir] - td->stat_io_bytes[io_u->ddir];
+	if (!delta)
+		return;
+
+	if (spent)
+		rate = delta * 1000 / spent / 1024;
+	else
+		rate = 0;
+
+	add_stat_sample(&ts->bw_stat[io_u->ddir], rate);
+
+	if (td->bw_log)
+		add_log_sample(td, td->bw_log, rate, io_u->ddir, bytes, io_u->offset);
+
+	td->stat_io_bytes[io_u->ddir] = td->this_io_bytes[io_u->ddir];
+	td_io_u_unlock(td);
+}
+
+uint64_t add_bw_samples(struct thread_data *td, struct timeval *t)
+{
+	struct thread_stat *ts = &td->ts;
+	unsigned long spent, rate;
+	enum fio_ddir ddir;
+
+	if (per_unit_log(td->bw_log) || !td->o.bw_avg_time)
+		return 0;
+
  	spent = mtime_since(&td->bw_sample_time, t);
  	if (spent < td->o.bw_avg_time)
-		return;
+		return td->o.bw_avg_time - spent;

  	td_io_u_lock(td);

@@ -2141,27 +2176,59 @@ void add_bw_sample(struct thread_data *td, enum 
fio_ddir ddir, unsigned int bs,
  		add_stat_sample(&ts->bw_stat[ddir], rate);

  		if (td->bw_log)
-			add_log_sample(td, td->bw_log, rate, ddir, bs, 0);
+			add_log_sample(td, td->bw_log, rate, ddir, 0, 0);

  		td->stat_io_bytes[ddir] = td->this_io_bytes[ddir];
  	}

-	fio_gettime(&td->bw_sample_time, NULL);
+	memcpy(&td->bw_sample_time, t, sizeof(*t));
  	td_io_u_unlock(td);
+
+	return td->o.bw_avg_time;
  }

-void add_iops_sample(struct thread_data *td, enum fio_ddir ddir, 
unsigned int bs,
-		     struct timeval *t)
+void add_iops_sample(struct thread_data *td, struct io_u *io_u,
+		     unsigned int bytes, unsigned long spent)
  {
  	struct thread_stat *ts = &td->ts;
-	unsigned long spent, iops;
+	unsigned long iops;
+	uint64_t delta;

-	if (!ddir_rw(ddir))
+	if (!ddir_rw(io_u->ddir))
  		return;

+	td_io_u_lock(td);
+
+	delta = td->this_io_blocks[io_u->ddir] - td->stat_io_blocks[io_u->ddir];
+	if (!delta)
+		return;
+
+	if (spent)
+		iops = (delta * 1000) / spent;
+	else
+		iops = 0;
+
+	add_stat_sample(&ts->iops_stat[io_u->ddir], iops);
+
+	if (td->iops_log)
+		add_log_sample(td, td->iops_log, iops, io_u->ddir, bytes, io_u->offset);
+
+	td->stat_io_blocks[io_u->ddir] = td->this_io_blocks[io_u->ddir];
+	td_io_u_unlock(td);
+}
+
+uint64_t add_iops_samples(struct thread_data *td, struct timeval *t)
+{
+	struct thread_stat *ts = &td->ts;
+	unsigned long spent, iops;
+	enum fio_ddir ddir;
+
+	if (per_unit_log(td->iops_log) || !td->o.iops_avg_time)
+		return 0;
+
  	spent = mtime_since(&td->iops_sample_time, t);
  	if (spent < td->o.iops_avg_time)
-		return;
+		return td->o.iops_avg_time - spent;

  	td_io_u_lock(td);

@@ -2183,13 +2250,48 @@ void add_iops_sample(struct thread_data *td, 
enum fio_ddir ddir, unsigned int bs
  		add_stat_sample(&ts->iops_stat[ddir], iops);

  		if (td->iops_log)
-			add_log_sample(td, td->iops_log, iops, ddir, bs, 0);
+			add_log_sample(td, td->iops_log, iops, ddir, 0, 0);

  		td->stat_io_blocks[ddir] = td->this_io_blocks[ddir];
  	}

-	fio_gettime(&td->iops_sample_time, NULL);
+	memcpy(&td->iops_sample_time, t, sizeof(*t));
  	td_io_u_unlock(td);
+
+	return td->o.iops_avg_time;
+}
+
+/*
+ * Returns msecs to next event
+ */
+unsigned int calc_log_samples(void)
+{
+	struct thread_data *td;
+	struct timeval now;
+	unsigned int next = ~0U, tmp;
+	int i;
+
+	fio_gettime(&now, NULL);
+
+	for_each_td(td, i) {
+		if (!ramp_time_over(td) ||
+		    !(td->runstate == TD_RUNNING || td->runstate == TD_VERIFYING)) {
+			next = min(td->o.iops_avg_time, td->o.bw_avg_time);
+			continue;
+		}
+		if (!per_unit_log(td->bw_log)) {
+			tmp = add_bw_samples(td, &now);
+			if (tmp < next)
+				next = tmp;
+		}
+		if (!per_unit_log(td->iops_log)) {
+			tmp = add_iops_samples(td, &now);
+			if (tmp < next)
+				next = tmp;
+		}
+	}
+
+	return next == ~0U ? 0 : next;
  }

  void stat_init(void)
diff --git a/stat.h b/stat.h
index 9c3f192..d95a2a7 100644
--- a/stat.h
+++ b/stat.h
@@ -276,11 +276,12 @@ extern void add_clat_sample(struct thread_data *, 
enum fio_ddir, unsigned long,
  				unsigned int, uint64_t);
  extern void add_slat_sample(struct thread_data *, enum fio_ddir, 
unsigned long,
  				unsigned int, uint64_t);
-extern void add_bw_sample(struct thread_data *, enum fio_ddir, unsigned 
int,
-				struct timeval *);
-extern void add_iops_sample(struct thread_data *, enum fio_ddir, 
unsigned int,
-				struct timeval *);
  extern void add_agg_sample(unsigned long, enum fio_ddir, unsigned int);
+extern void add_iops_sample(struct thread_data *, struct io_u *,
+				unsigned int, unsigned long);
+extern void add_bw_sample(struct thread_data *, struct io_u *,
+				unsigned int, unsigned long);
+extern unsigned int calc_log_samples(void);

  extern struct io_log *agg_io_log[DDIR_RWDIR_CNT];
  extern int write_bw_log;

-- 
Jens Axboe



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: Missing log entries and other questions
  2016-05-04 20:51           ` Jens Axboe
@ 2016-05-04 21:30             ` Jens Axboe
  2016-05-05 19:33               ` Mark Nelson
  2016-05-05 19:36             ` Mark Nelson
  1 sibling, 1 reply; 16+ messages in thread
From: Jens Axboe @ 2016-05-04 21:30 UTC (permalink / raw)
  To: Mark Nelson, Paul Alcorn; +Cc: 'Mihkal Dunfjeld', fio, birger.lunstoeng

On 05/04/2016 02:51 PM, Jens Axboe wrote:
> On 05/04/2016 11:43 AM, Mark Nelson wrote:
>> On 05/04/2016 11:34 AM, Jens Axboe wrote:
>>> On 05/04/2016 08:10 AM, Jens Axboe wrote:
>>>> On 04/27/2016 02:38 PM, Jens Axboe wrote:
>>>>> I'll try and see if I can reproduce this.
>>>>
>>>> So I took a look at your logged output. The way that fio logs is
>>>> that it
>>>> does it on IO completion. You have set 1s intervals, so it checks, when
>>>> an IO completes, if we're due for logging the previous interval IOPS:
>>>>
>>>> spent = mtime_since(&td->iops_sample_time, t);
>>>> if (spent < td->o.iops_avg_time)
>>>>          return;
>>>>
>>>> If you are doing lots of IO, it matches pretty evenly. But for slow
>>>> workloads, it will end up being skewed a bit. It won't be too hard to
>>>> fix the drift, though, but it still won't make it clock precise at
>>>> exact
>>>> 1s intervals.
>>>>
>>>> We could move this part into the main thread, and out of the workers.
>>>> Then we could make it more precise. Might not be a bad idea in general,
>>>> since it'll move some of this code out of the fast path.
>>>>
>>>> I'll take a stab at it.
>>>
>>> Can you try the below patch?
>>
>> I'll also try to take a look at this as we've seen a fair amount of skew
>> in the logs doing fio librbd upstream ceph testing.
>
> That'd be great. More complete version below, fwiw.

And hopefully final... Little corner cases fixed. Seems to pass my
testing, would be great if others could test too. Any job that currently
uses any of the loggings would do.


diff --git a/HOWTO b/HOWTO
index 1f523d3..a74a112 100644
--- a/HOWTO
+++ b/HOWTO
@@ -1263,10 +1263,14 @@ exitall_on_error	When one job finishes in error, 
terminate the rest. The
  		default is to wait for each job to finish.

  bwavgtime=int	Average the calculated bandwidth over the given time. Value
-		is specified in milliseconds.
+		is specified in milliseconds. If the job also does bandwidth
+		logging through 'write_bw_log', then the minimum of this option
+		and 'log_avg_msec' will be used.  Default: 500ms.

  iopsavgtime=int	Average the calculated IOPS over the given time. Value
-		is specified in milliseconds.
+		is specified in milliseconds. If the job also does IOPS logging
+		through 'write_iops_log', then the minimum of this option and
+		'log_avg_msec' will be used.  Default: 500ms.

  create_serialize=bool	If true, serialize the file creating for the jobs.
  			This may be handy to avoid interleaving of data
diff --git a/backend.c b/backend.c
index 1723b8f..975517a 100644
--- a/backend.c
+++ b/backend.c
@@ -1722,7 +1722,7 @@ static void *thread_main(void *data)

  	fio_unpin_memory(td);

-	fio_writeout_logs(td);
+	td_writeout_logs(td, true);

  	iolog_compress_exit(td);
  	rate_submit_exit(td);
@@ -2337,41 +2337,64 @@ static void free_disk_util(void)

  static void *helper_thread_main(void *data)
  {
+	unsigned int msec_to_next_event, next_log;
  	struct sk_out *sk_out = data;
+	struct timeval tv, last_du;
  	int ret = 0;

  	sk_out_assign(sk_out);

+	gettimeofday(&tv, NULL);
+	memcpy(&last_du, &tv, sizeof(tv));
+
  	fio_mutex_up(startup_mutex);

-	while (!ret) {
-		uint64_t sec = DISK_UTIL_MSEC / 1000;
-		uint64_t nsec = (DISK_UTIL_MSEC % 1000) * 1000000;
+	msec_to_next_event = DISK_UTIL_MSEC;
+	while (!ret && !helper_exit) {
  		struct timespec ts;
-		struct timeval tv;
-
-		gettimeofday(&tv, NULL);
-		ts.tv_sec = tv.tv_sec + sec;
-		ts.tv_nsec = (tv.tv_usec * 1000) + nsec;
+		struct timeval now;

-		if (ts.tv_nsec >= 1000000000ULL) {
-			ts.tv_nsec -= 1000000000ULL;
-			ts.tv_sec++;
+		tv.tv_usec += msec_to_next_event * 1000;
+		if (tv.tv_usec >= 1000000) {
+			tv.tv_usec -= 1000000;
+			tv.tv_sec++;
  		}

+		ts.tv_sec = tv.tv_sec;
+		ts.tv_nsec = tv.tv_usec * 1000;
+
  		pthread_cond_timedwait(&helper_cond, &helper_lock, &ts);

-		ret = update_io_ticks();
+		gettimeofday(&now, NULL);
+
+		if (mtime_since(&last_du, &now) >= DISK_UTIL_MSEC) {
+			ret = update_io_ticks();
+			memcpy(&last_du, &now, sizeof(now));
+			msec_to_next_event = DISK_UTIL_MSEC;
+		} else {
+			msec_to_next_event = DISK_UTIL_MSEC -
+						mtime_since(&last_du, &now);
+		}

  		if (helper_do_stat) {
  			helper_do_stat = 0;
  			__show_running_run_stats();
  		}

+		next_log = calc_log_samples();
+		if (!next_log)
+			next_log = DISK_UTIL_MSEC;
+
+		msec_to_next_event = min(next_log, msec_to_next_event);
+		if (msec_to_next_event)
+			msec_to_next_event--;
+
  		if (!is_backend)
  			print_thread_status();
  	}

+	fio_writeout_logs(false);
+
  	sk_out_drop();
  	return NULL;
  }
diff --git a/fio.1 b/fio.1
index 73fdee6..e9bb72e 100644
--- a/fio.1
+++ b/fio.1
@@ -1180,12 +1180,14 @@ Terminate all jobs if one job finishes in error. 
  Default: wait for each job
  to finish.
  .TP
  .BI bwavgtime \fR=\fPint
-Average bandwidth calculations over the given time in milliseconds. 
Default:
-500ms.
+Average bandwidth calculations over the given time in milliseconds. If 
the job
+also does bandwidth logging through \fBwrite_bw_log\fR, then the minimum of
+this option and \fBlog_avg_msec\fR will be used.  Default: 500ms.
  .TP
  .BI iopsavgtime \fR=\fPint
-Average IOPS calculations over the given time in milliseconds.  Default:
-500ms.
+Average IOPS calculations over the given time in milliseconds. If the job
+also does IOPS logging through \fBwrite_iops_log\fR, then the minimum of
+this option and \fBlog_avg_msec\fR will be used.  Default: 500ms.
  .TP
  .BI create_serialize \fR=\fPbool
  If true, serialize file creation for the jobs.  Default: true.
diff --git a/init.c b/init.c
index 89e05c0..c579d5c 100644
--- a/init.c
+++ b/init.c
@@ -1416,6 +1416,11 @@ static int add_job(struct thread_data *td, const 
char *jobname, int job_add_num,
  		};
  		const char *suf;

+		if (fio_option_is_set(o, bw_avg_time))
+			p.avg_msec = min(o->log_avg_msec, o->bw_avg_time);
+		else
+			o->bw_avg_time = p.avg_msec;
+
  		if (p.log_gz_store)
  			suf = "log.fz";
  		else
@@ -1436,6 +1441,11 @@ static int add_job(struct thread_data *td, const 
char *jobname, int job_add_num,
  		};
  		const char *suf;

+		if (fio_option_is_set(o, iops_avg_time))
+			p.avg_msec = min(o->log_avg_msec, o->iops_avg_time);
+		else
+			o->iops_avg_time = p.avg_msec;
+
  		if (p.log_gz_store)
  			suf = "log.fz";
  		else
diff --git a/io_u.c b/io_u.c
index 6622bc0..8a5245a 100644
--- a/io_u.c
+++ b/io_u.c
@@ -1715,11 +1715,11 @@ static void account_io_completion(struct 
thread_data *td, struct io_u *io_u,
  		io_u_mark_latency(td, lusec);
  	}

-	if (!td->o.disable_bw)
-		add_bw_sample(td, idx, bytes, &icd->time);
+	if (!td->o.disable_bw && per_unit_log(td->bw_log))
+		add_bw_sample(td, io_u, bytes, lusec);

-	if (no_reduce)
-		add_iops_sample(td, idx, bytes, &icd->time);
+	if (no_reduce && per_unit_log(td->iops_log))
+		add_iops_sample(td, io_u, bytes);

  	if (td->ts.nr_block_infos && io_u->ddir == DDIR_TRIM) {
  		uint32_t *info = io_u_block_info(td, io_u);
diff --git a/iolog.c b/iolog.c
index feda9ed..f1edf84 100644
--- a/iolog.c
+++ b/iolog.c
@@ -18,6 +18,7 @@
  #include "verify.h"
  #include "trim.h"
  #include "filelock.h"
+#include "smalloc.h"

  static const char iolog_ver2[] = "fio version 2 iolog";

@@ -574,14 +575,12 @@ void setup_log(struct io_log **log, struct 
log_params *p,
  {
  	struct io_log *l;

-	l = calloc(1, sizeof(*l));
+	l = smalloc(sizeof(*l));
  	l->nr_samples = 0;
-	l->max_samples = 1024;
  	l->log_type = p->log_type;
  	l->log_offset = p->log_offset;
  	l->log_gz = p->log_gz;
  	l->log_gz_store = p->log_gz_store;
-	l->log = malloc(l->max_samples * log_entry_sz(l));
  	l->avg_msec = p->avg_msec;
  	l->filename = strdup(filename);
  	l->td = p->td;
@@ -631,7 +630,7 @@ void free_log(struct io_log *log)
  {
  	free(log->log);
  	free(log->filename);
-	free(log);
+	sfree(log);
  }

  void flush_samples(FILE *f, void *samples, uint64_t sample_size)
@@ -1239,29 +1238,74 @@ static int __write_log(struct thread_data *td, 
struct io_log *log, int try)
  	return 0;
  }

-static int write_iops_log(struct thread_data *td, int try)
+static int write_iops_log(struct thread_data *td, int try, bool unit_log)
  {
-	return __write_log(td, td->iops_log, try);
+	int ret;
+
+	if (per_unit_log(td->iops_log) != unit_log)
+		return 0;
+
+	ret = __write_log(td, td->iops_log, try);
+	if (!ret)
+		td->iops_log = NULL;
+
+	return ret;
  }

-static int write_slat_log(struct thread_data *td, int try)
+static int write_slat_log(struct thread_data *td, int try, bool unit_log)
  {
-	return __write_log(td, td->slat_log, try);
+	int ret;
+
+	if (!unit_log)
+		return 0;
+
+	ret = __write_log(td, td->slat_log, try);
+	if (!ret)
+		td->slat_log = NULL;
+
+	return ret;
  }

-static int write_clat_log(struct thread_data *td, int try)
+static int write_clat_log(struct thread_data *td, int try, bool unit_log)
  {
-	return __write_log(td, td->clat_log, try);
+	int ret;
+
+	if (!unit_log)
+		return 0;
+
+	ret = __write_log(td, td->clat_log, try);
+	if (!ret)
+		td->clat_log = NULL;
+
+	return ret;
  }

-static int write_lat_log(struct thread_data *td, int try)
+static int write_lat_log(struct thread_data *td, int try, bool unit_log)
  {
-	return __write_log(td, td->lat_log, try);
+	int ret;
+
+	if (!unit_log)
+		return 0;
+
+	ret = __write_log(td, td->lat_log, try);
+	if (!ret)
+		td->lat_log = NULL;
+
+	return ret;
  }

-static int write_bandw_log(struct thread_data *td, int try)
+static int write_bandw_log(struct thread_data *td, int try, bool unit_log)
  {
-	return __write_log(td, td->bw_log, try);
+	int ret;
+
+	if (per_unit_log(td->bw_log) != unit_log)
+		return 0;
+
+	ret = __write_log(td, td->bw_log, try);
+	if (!ret)
+		td->bw_log = NULL;
+
+	return ret;
  }

  enum {
@@ -1276,7 +1320,7 @@ enum {

  struct log_type {
  	unsigned int mask;
-	int (*fn)(struct thread_data *, int);
+	int (*fn)(struct thread_data *, int, bool);
  };

  static struct log_type log_types[] = {
@@ -1302,7 +1346,7 @@ static struct log_type log_types[] = {
  	},
  };

-void fio_writeout_logs(struct thread_data *td)
+void td_writeout_logs(struct thread_data *td, bool unit_logs)
  {
  	unsigned int log_mask = 0;
  	unsigned int log_left = ALL_LOG_NR;
@@ -1310,7 +1354,7 @@ void fio_writeout_logs(struct thread_data *td)

  	old_state = td_bump_runstate(td, TD_FINISHING);

-	finalize_logs(td);
+	finalize_logs(td, unit_logs);

  	while (log_left) {
  		int prev_log_left = log_left;
@@ -1320,7 +1364,7 @@ void fio_writeout_logs(struct thread_data *td)
  			int ret;

  			if (!(log_mask & lt->mask)) {
-				ret = lt->fn(td, log_left != 1);
+				ret = lt->fn(td, log_left != 1, unit_logs);
  				if (!ret) {
  					log_left--;
  					log_mask |= lt->mask;
@@ -1334,3 +1378,12 @@ void fio_writeout_logs(struct thread_data *td)

  	td_restore_runstate(td, old_state);
  }
+
+void fio_writeout_logs(bool unit_logs)
+{
+	struct thread_data *td;
+	int i;
+
+	for_each_td(td, i)
+		td_writeout_logs(td, unit_logs);
+}
diff --git a/iolog.h b/iolog.h
index 297daf5..739a7c8 100644
--- a/iolog.h
+++ b/iolog.h
@@ -41,6 +41,8 @@ enum {
  	IO_LOG_TYPE_IOPS,
  };

+#define DEF_LOG_ENTRIES		1024
+
  /*
   * Dynamically growing data sample log
   */
@@ -205,12 +207,18 @@ struct log_params {
  	int log_compress;
  };

-extern void finalize_logs(struct thread_data *td);
+static inline bool per_unit_log(struct io_log *log)
+{
+	return log && !log->avg_msec;
+}
+
+extern void finalize_logs(struct thread_data *td, bool);
  extern void setup_log(struct io_log **, struct log_params *, const 
char *);
  extern void flush_log(struct io_log *, int);
  extern void flush_samples(FILE *, void *, uint64_t);
  extern void free_log(struct io_log *);
-extern void fio_writeout_logs(struct thread_data *);
+extern void fio_writeout_logs(bool);
+extern void td_writeout_logs(struct thread_data *, bool);
  extern int iolog_flush(struct io_log *, int);

  static inline void init_ipo(struct io_piece *ipo)
diff --git a/libfio.c b/libfio.c
index c626d15..b17f148 100644
--- a/libfio.c
+++ b/libfio.c
@@ -146,6 +146,8 @@ void reset_all_stats(struct thread_data *td)
  	fio_gettime(&tv, NULL);
  	memcpy(&td->epoch, &tv, sizeof(tv));
  	memcpy(&td->start, &tv, sizeof(tv));
+	memcpy(&td->iops_sample_time, &tv, sizeof(tv));
+	memcpy(&td->bw_sample_time, &tv, sizeof(tv));

  	lat_target_reset(td);
  	clear_rusage_stat(td);
diff --git a/stat.c b/stat.c
index 6d8d4d0..df73ed7 100644
--- a/stat.c
+++ b/stat.c
@@ -1862,10 +1862,15 @@ static void __add_log_sample(struct io_log 
*iolog, unsigned long val,
  		iolog->avg_last = t;

  	if (iolog->nr_samples == iolog->max_samples) {
-		size_t new_size;
+		size_t new_size, new_samples;
  		void *new_log;

-		new_size = 2 * iolog->max_samples * log_entry_sz(iolog);
+		if (!iolog->max_samples)
+			new_samples = DEF_LOG_ENTRIES;
+		else
+			new_samples = iolog->max_samples * 2;
+
+		new_size = new_samples * log_entry_sz(iolog);

  		if (iolog->log_gz && (new_size > iolog->log_gz)) {
  			if (iolog_flush(iolog, 0)) {
@@ -1882,7 +1887,7 @@ static void __add_log_sample(struct io_log *iolog, 
unsigned long val,
  				return;
  			}
  			iolog->log = new_log;
-			iolog->max_samples <<= 1;
+			iolog->max_samples = new_samples;
  		}
  	}

@@ -2013,21 +2018,21 @@ static void add_log_sample(struct thread_data 
*td, struct io_log *iolog,
  	iolog->avg_last = elapsed;
  }

-void finalize_logs(struct thread_data *td)
+void finalize_logs(struct thread_data *td, bool unit_logs)
  {
  	unsigned long elapsed;

  	elapsed = mtime_since_now(&td->epoch);

-	if (td->clat_log)
+	if (td->clat_log && unit_logs)
  		_add_stat_to_log(td->clat_log, elapsed, td->o.log_max != 0);
-	if (td->slat_log)
+	if (td->slat_log && unit_logs)
  		_add_stat_to_log(td->slat_log, elapsed, td->o.log_max != 0);
-	if (td->lat_log)
+	if (td->lat_log && unit_logs)
  		_add_stat_to_log(td->lat_log, elapsed, td->o.log_max != 0);
-	if (td->bw_log)
+	if (td->bw_log && (unit_logs == per_unit_log(td->bw_log)))
  		_add_stat_to_log(td->bw_log, elapsed, td->o.log_max != 0);
-	if (td->iops_log)
+	if (td->iops_log && (unit_logs == per_unit_log(td->iops_log)))
  		_add_stat_to_log(td->iops_log, elapsed, td->o.log_max != 0);
  }

@@ -2108,18 +2113,43 @@ void add_lat_sample(struct thread_data *td, enum 
fio_ddir ddir,
  	td_io_u_unlock(td);
  }

-void add_bw_sample(struct thread_data *td, enum fio_ddir ddir, unsigned 
int bs,
-		   struct timeval *t)
+void add_bw_sample(struct thread_data *td, struct io_u *io_u,
+		   unsigned int bytes, unsigned long spent)
  {
  	struct thread_stat *ts = &td->ts;
-	unsigned long spent, rate;
+	unsigned long rate;

-	if (!ddir_rw(ddir))
+	if (!ddir_rw(io_u->ddir))
  		return;

+	if (spent)
+		rate = bytes * 1000 / spent;
+	else
+		rate = 0;
+
+	td_io_u_lock(td);
+
+	add_stat_sample(&ts->bw_stat[io_u->ddir], rate);
+
+	if (td->bw_log)
+		add_log_sample(td, td->bw_log, rate, io_u->ddir, bytes, io_u->offset);
+
+	td->stat_io_bytes[io_u->ddir] = td->this_io_bytes[io_u->ddir];
+	td_io_u_unlock(td);
+}
+
+uint64_t add_bw_samples(struct thread_data *td, struct timeval *t)
+{
+	struct thread_stat *ts = &td->ts;
+	unsigned long spent, rate;
+	enum fio_ddir ddir;
+
+	if (per_unit_log(td->bw_log))
+		return 0;
+
  	spent = mtime_since(&td->bw_sample_time, t);
  	if (spent < td->o.bw_avg_time)
-		return;
+		return td->o.bw_avg_time - spent;

  	td_io_u_lock(td);

@@ -2141,27 +2171,48 @@ void add_bw_sample(struct thread_data *td, enum 
fio_ddir ddir, unsigned int bs,
  		add_stat_sample(&ts->bw_stat[ddir], rate);

  		if (td->bw_log)
-			add_log_sample(td, td->bw_log, rate, ddir, bs, 0);
+			add_log_sample(td, td->bw_log, rate, ddir, 0, 0);

  		td->stat_io_bytes[ddir] = td->this_io_bytes[ddir];
  	}

-	fio_gettime(&td->bw_sample_time, NULL);
+	memcpy(&td->bw_sample_time, t, sizeof(*t));
  	td_io_u_unlock(td);
+
+	return td->o.bw_avg_time;
  }

-void add_iops_sample(struct thread_data *td, enum fio_ddir ddir, 
unsigned int bs,
-		     struct timeval *t)
+void add_iops_sample(struct thread_data *td, struct io_u *io_u,
+		     unsigned int bytes)
  {
  	struct thread_stat *ts = &td->ts;
-	unsigned long spent, iops;

-	if (!ddir_rw(ddir))
+	if (!ddir_rw(io_u->ddir))
  		return;

+	td_io_u_lock(td);
+
+	add_stat_sample(&ts->iops_stat[io_u->ddir], 1);
+
+	if (td->iops_log)
+		add_log_sample(td, td->iops_log, 1, io_u->ddir, bytes, io_u->offset);
+
+	td->stat_io_blocks[io_u->ddir] = td->this_io_blocks[io_u->ddir];
+	td_io_u_unlock(td);
+}
+
+uint64_t add_iops_samples(struct thread_data *td, struct timeval *t)
+{
+	struct thread_stat *ts = &td->ts;
+	unsigned long spent, iops;
+	enum fio_ddir ddir;
+
+	if (per_unit_log(td->iops_log))
+		return 0;
+
  	spent = mtime_since(&td->iops_sample_time, t);
  	if (spent < td->o.iops_avg_time)
-		return;
+		return td->o.iops_avg_time - spent;

  	td_io_u_lock(td);

@@ -2183,13 +2234,48 @@ void add_iops_sample(struct thread_data *td, 
enum fio_ddir ddir, unsigned int bs
  		add_stat_sample(&ts->iops_stat[ddir], iops);

  		if (td->iops_log)
-			add_log_sample(td, td->iops_log, iops, ddir, bs, 0);
+			add_log_sample(td, td->iops_log, iops, ddir, 0, 0);

  		td->stat_io_blocks[ddir] = td->this_io_blocks[ddir];
  	}

-	fio_gettime(&td->iops_sample_time, NULL);
+	memcpy(&td->iops_sample_time, t, sizeof(*t));
  	td_io_u_unlock(td);
+
+	return td->o.iops_avg_time;
+}
+
+/*
+ * Returns msecs to next event
+ */
+unsigned int calc_log_samples(void)
+{
+	struct thread_data *td;
+	struct timeval now;
+	unsigned int next = ~0U, tmp;
+	int i;
+
+	fio_gettime(&now, NULL);
+
+	for_each_td(td, i) {
+		if (!ramp_time_over(td) ||
+		    !(td->runstate == TD_RUNNING || td->runstate == TD_VERIFYING)) {
+			next = min(td->o.iops_avg_time, td->o.bw_avg_time);
+			continue;
+		}
+		if (!per_unit_log(td->bw_log)) {
+			tmp = add_bw_samples(td, &now);
+			if (tmp < next)
+				next = tmp;
+		}
+		if (!per_unit_log(td->iops_log)) {
+			tmp = add_iops_samples(td, &now);
+			if (tmp < next)
+				next = tmp;
+		}
+	}
+
+	return next == ~0U ? 0 : next;
  }

  void stat_init(void)
diff --git a/stat.h b/stat.h
index 9c3f192..7c2b16e 100644
--- a/stat.h
+++ b/stat.h
@@ -276,11 +276,12 @@ extern void add_clat_sample(struct thread_data *, 
enum fio_ddir, unsigned long,
  				unsigned int, uint64_t);
  extern void add_slat_sample(struct thread_data *, enum fio_ddir, 
unsigned long,
  				unsigned int, uint64_t);
-extern void add_bw_sample(struct thread_data *, enum fio_ddir, unsigned 
int,
-				struct timeval *);
-extern void add_iops_sample(struct thread_data *, enum fio_ddir, 
unsigned int,
-				struct timeval *);
  extern void add_agg_sample(unsigned long, enum fio_ddir, unsigned int);
+extern void add_iops_sample(struct thread_data *, struct io_u *,
+				unsigned int);
+extern void add_bw_sample(struct thread_data *, struct io_u *,
+				unsigned int, unsigned long);
+extern unsigned int calc_log_samples(void);

  extern struct io_log *agg_io_log[DDIR_RWDIR_CNT];
  extern int write_bw_log;


-- 
Jens Axboe



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: Missing log entries and other questions
  2016-05-04 21:30             ` Jens Axboe
@ 2016-05-05 19:33               ` Mark Nelson
  2016-05-06 17:11                 ` Jens Axboe
  0 siblings, 1 reply; 16+ messages in thread
From: Mark Nelson @ 2016-05-05 19:33 UTC (permalink / raw)
  To: Jens Axboe, Paul Alcorn; +Cc: 'Mihkal Dunfjeld', fio, birger.lunstoeng



On 05/04/2016 04:30 PM, Jens Axboe wrote:
> On 05/04/2016 02:51 PM, Jens Axboe wrote:
>> On 05/04/2016 11:43 AM, Mark Nelson wrote:
>>> On 05/04/2016 11:34 AM, Jens Axboe wrote:
>>>> On 05/04/2016 08:10 AM, Jens Axboe wrote:
>>>>> On 04/27/2016 02:38 PM, Jens Axboe wrote:
>>>>>> I'll try and see if I can reproduce this.
>>>>>
>>>>> So I took a look at your logged output. The way that fio logs is
>>>>> that it
>>>>> does it on IO completion. You have set 1s intervals, so it checks,
>>>>> when
>>>>> an IO completes, if we're due for logging the previous interval IOPS:
>>>>>
>>>>> spent = mtime_since(&td->iops_sample_time, t);
>>>>> if (spent < td->o.iops_avg_time)
>>>>>          return;
>>>>>
>>>>> If you are doing lots of IO, it matches pretty evenly. But for slow
>>>>> workloads, it will end up being skewed a bit. It won't be too hard to
>>>>> fix the drift, though, but it still won't make it clock precise at
>>>>> exact
>>>>> 1s intervals.
>>>>>
>>>>> We could move this part into the main thread, and out of the workers.
>>>>> Then we could make it more precise. Might not be a bad idea in
>>>>> general,
>>>>> since it'll move some of this code out of the fast path.
>>>>>
>>>>> I'll take a stab at it.
>>>>
>>>> Can you try the below patch?
>>>
>>> I'll also try to take a look at this as we've seen a fair amount of skew
>>> in the logs doing fio librbd upstream ceph testing.
>>
>> That'd be great. More complete version below, fwiw.

Ok, I'll try this one.  It might be a bit before I can test it since I'm 
bisecting ceph right now looking for a regression that only hits 128k 
sequential rbd reads.  Go figure. :)

>
> And hopefully final... Little corner cases fixed. Seems to pass my
> testing, would be great if others could test too. Any job that currently
> uses any of the loggings would do.
>
>
> diff --git a/HOWTO b/HOWTO
> index 1f523d3..a74a112 100644
> --- a/HOWTO
> +++ b/HOWTO
> @@ -1263,10 +1263,14 @@ exitall_on_error    When one job finishes in
> error, terminate the rest. The
>          default is to wait for each job to finish.
>
>  bwavgtime=int    Average the calculated bandwidth over the given time.
> Value
> -        is specified in milliseconds.
> +        is specified in milliseconds. If the job also does bandwidth
> +        logging through 'write_bw_log', then the minimum of this option
> +        and 'log_avg_msec' will be used.  Default: 500ms.
>
>  iopsavgtime=int    Average the calculated IOPS over the given time. Value
> -        is specified in milliseconds.
> +        is specified in milliseconds. If the job also does IOPS logging
> +        through 'write_iops_log', then the minimum of this option and
> +        'log_avg_msec' will be used.  Default: 500ms.
>
>  create_serialize=bool    If true, serialize the file creating for the
> jobs.
>              This may be handy to avoid interleaving of data
> diff --git a/backend.c b/backend.c
> index 1723b8f..975517a 100644
> --- a/backend.c
> +++ b/backend.c
> @@ -1722,7 +1722,7 @@ static void *thread_main(void *data)
>
>      fio_unpin_memory(td);
>
> -    fio_writeout_logs(td);
> +    td_writeout_logs(td, true);
>
>      iolog_compress_exit(td);
>      rate_submit_exit(td);
> @@ -2337,41 +2337,64 @@ static void free_disk_util(void)
>
>  static void *helper_thread_main(void *data)
>  {
> +    unsigned int msec_to_next_event, next_log;
>      struct sk_out *sk_out = data;
> +    struct timeval tv, last_du;
>      int ret = 0;
>
>      sk_out_assign(sk_out);
>
> +    gettimeofday(&tv, NULL);
> +    memcpy(&last_du, &tv, sizeof(tv));
> +
>      fio_mutex_up(startup_mutex);
>
> -    while (!ret) {
> -        uint64_t sec = DISK_UTIL_MSEC / 1000;
> -        uint64_t nsec = (DISK_UTIL_MSEC % 1000) * 1000000;
> +    msec_to_next_event = DISK_UTIL_MSEC;
> +    while (!ret && !helper_exit) {
>          struct timespec ts;
> -        struct timeval tv;
> -
> -        gettimeofday(&tv, NULL);
> -        ts.tv_sec = tv.tv_sec + sec;
> -        ts.tv_nsec = (tv.tv_usec * 1000) + nsec;
> +        struct timeval now;
>
> -        if (ts.tv_nsec >= 1000000000ULL) {
> -            ts.tv_nsec -= 1000000000ULL;
> -            ts.tv_sec++;
> +        tv.tv_usec += msec_to_next_event * 1000;
> +        if (tv.tv_usec >= 1000000) {
> +            tv.tv_usec -= 1000000;
> +            tv.tv_sec++;
>          }
>
> +        ts.tv_sec = tv.tv_sec;
> +        ts.tv_nsec = tv.tv_usec * 1000;
> +
>          pthread_cond_timedwait(&helper_cond, &helper_lock, &ts);
>
> -        ret = update_io_ticks();
> +        gettimeofday(&now, NULL);
> +
> +        if (mtime_since(&last_du, &now) >= DISK_UTIL_MSEC) {
> +            ret = update_io_ticks();
> +            memcpy(&last_du, &now, sizeof(now));
> +            msec_to_next_event = DISK_UTIL_MSEC;
> +        } else {
> +            msec_to_next_event = DISK_UTIL_MSEC -
> +                        mtime_since(&last_du, &now);
> +        }
>
>          if (helper_do_stat) {
>              helper_do_stat = 0;
>              __show_running_run_stats();
>          }
>
> +        next_log = calc_log_samples();
> +        if (!next_log)
> +            next_log = DISK_UTIL_MSEC;
> +
> +        msec_to_next_event = min(next_log, msec_to_next_event);
> +        if (msec_to_next_event)
> +            msec_to_next_event--;
> +
>          if (!is_backend)
>              print_thread_status();
>      }
>
> +    fio_writeout_logs(false);
> +
>      sk_out_drop();
>      return NULL;
>  }
> diff --git a/fio.1 b/fio.1
> index 73fdee6..e9bb72e 100644
> --- a/fio.1
> +++ b/fio.1
> @@ -1180,12 +1180,14 @@ Terminate all jobs if one job finishes in error.
>  Default: wait for each job
>  to finish.
>  .TP
>  .BI bwavgtime \fR=\fPint
> -Average bandwidth calculations over the given time in milliseconds.
> Default:
> -500ms.
> +Average bandwidth calculations over the given time in milliseconds. If
> the job
> +also does bandwidth logging through \fBwrite_bw_log\fR, then the
> minimum of
> +this option and \fBlog_avg_msec\fR will be used.  Default: 500ms.
>  .TP
>  .BI iopsavgtime \fR=\fPint
> -Average IOPS calculations over the given time in milliseconds.  Default:
> -500ms.
> +Average IOPS calculations over the given time in milliseconds. If the job
> +also does IOPS logging through \fBwrite_iops_log\fR, then the minimum of
> +this option and \fBlog_avg_msec\fR will be used.  Default: 500ms.
>  .TP
>  .BI create_serialize \fR=\fPbool
>  If true, serialize file creation for the jobs.  Default: true.
> diff --git a/init.c b/init.c
> index 89e05c0..c579d5c 100644
> --- a/init.c
> +++ b/init.c
> @@ -1416,6 +1416,11 @@ static int add_job(struct thread_data *td, const
> char *jobname, int job_add_num,
>          };
>          const char *suf;
>
> +        if (fio_option_is_set(o, bw_avg_time))
> +            p.avg_msec = min(o->log_avg_msec, o->bw_avg_time);
> +        else
> +            o->bw_avg_time = p.avg_msec;
> +
>          if (p.log_gz_store)
>              suf = "log.fz";
>          else
> @@ -1436,6 +1441,11 @@ static int add_job(struct thread_data *td, const
> char *jobname, int job_add_num,
>          };
>          const char *suf;
>
> +        if (fio_option_is_set(o, iops_avg_time))
> +            p.avg_msec = min(o->log_avg_msec, o->iops_avg_time);
> +        else
> +            o->iops_avg_time = p.avg_msec;
> +
>          if (p.log_gz_store)
>              suf = "log.fz";
>          else
> diff --git a/io_u.c b/io_u.c
> index 6622bc0..8a5245a 100644
> --- a/io_u.c
> +++ b/io_u.c
> @@ -1715,11 +1715,11 @@ static void account_io_completion(struct
> thread_data *td, struct io_u *io_u,
>          io_u_mark_latency(td, lusec);
>      }
>
> -    if (!td->o.disable_bw)
> -        add_bw_sample(td, idx, bytes, &icd->time);
> +    if (!td->o.disable_bw && per_unit_log(td->bw_log))
> +        add_bw_sample(td, io_u, bytes, lusec);
>
> -    if (no_reduce)
> -        add_iops_sample(td, idx, bytes, &icd->time);
> +    if (no_reduce && per_unit_log(td->iops_log))
> +        add_iops_sample(td, io_u, bytes);
>
>      if (td->ts.nr_block_infos && io_u->ddir == DDIR_TRIM) {
>          uint32_t *info = io_u_block_info(td, io_u);
> diff --git a/iolog.c b/iolog.c
> index feda9ed..f1edf84 100644
> --- a/iolog.c
> +++ b/iolog.c
> @@ -18,6 +18,7 @@
>  #include "verify.h"
>  #include "trim.h"
>  #include "filelock.h"
> +#include "smalloc.h"
>
>  static const char iolog_ver2[] = "fio version 2 iolog";
>
> @@ -574,14 +575,12 @@ void setup_log(struct io_log **log, struct
> log_params *p,
>  {
>      struct io_log *l;
>
> -    l = calloc(1, sizeof(*l));
> +    l = smalloc(sizeof(*l));
>      l->nr_samples = 0;
> -    l->max_samples = 1024;
>      l->log_type = p->log_type;
>      l->log_offset = p->log_offset;
>      l->log_gz = p->log_gz;
>      l->log_gz_store = p->log_gz_store;
> -    l->log = malloc(l->max_samples * log_entry_sz(l));
>      l->avg_msec = p->avg_msec;
>      l->filename = strdup(filename);
>      l->td = p->td;
> @@ -631,7 +630,7 @@ void free_log(struct io_log *log)
>  {
>      free(log->log);
>      free(log->filename);
> -    free(log);
> +    sfree(log);
>  }
>
>  void flush_samples(FILE *f, void *samples, uint64_t sample_size)
> @@ -1239,29 +1238,74 @@ static int __write_log(struct thread_data *td,
> struct io_log *log, int try)
>      return 0;
>  }
>
> -static int write_iops_log(struct thread_data *td, int try)
> +static int write_iops_log(struct thread_data *td, int try, bool unit_log)
>  {
> -    return __write_log(td, td->iops_log, try);
> +    int ret;
> +
> +    if (per_unit_log(td->iops_log) != unit_log)
> +        return 0;
> +
> +    ret = __write_log(td, td->iops_log, try);
> +    if (!ret)
> +        td->iops_log = NULL;
> +
> +    return ret;
>  }
>
> -static int write_slat_log(struct thread_data *td, int try)
> +static int write_slat_log(struct thread_data *td, int try, bool unit_log)
>  {
> -    return __write_log(td, td->slat_log, try);
> +    int ret;
> +
> +    if (!unit_log)
> +        return 0;
> +
> +    ret = __write_log(td, td->slat_log, try);
> +    if (!ret)
> +        td->slat_log = NULL;
> +
> +    return ret;
>  }
>
> -static int write_clat_log(struct thread_data *td, int try)
> +static int write_clat_log(struct thread_data *td, int try, bool unit_log)
>  {
> -    return __write_log(td, td->clat_log, try);
> +    int ret;
> +
> +    if (!unit_log)
> +        return 0;
> +
> +    ret = __write_log(td, td->clat_log, try);
> +    if (!ret)
> +        td->clat_log = NULL;
> +
> +    return ret;
>  }
>
> -static int write_lat_log(struct thread_data *td, int try)
> +static int write_lat_log(struct thread_data *td, int try, bool unit_log)
>  {
> -    return __write_log(td, td->lat_log, try);
> +    int ret;
> +
> +    if (!unit_log)
> +        return 0;
> +
> +    ret = __write_log(td, td->lat_log, try);
> +    if (!ret)
> +        td->lat_log = NULL;
> +
> +    return ret;
>  }
>
> -static int write_bandw_log(struct thread_data *td, int try)
> +static int write_bandw_log(struct thread_data *td, int try, bool unit_log)
>  {
> -    return __write_log(td, td->bw_log, try);
> +    int ret;
> +
> +    if (per_unit_log(td->bw_log) != unit_log)
> +        return 0;
> +
> +    ret = __write_log(td, td->bw_log, try);
> +    if (!ret)
> +        td->bw_log = NULL;
> +
> +    return ret;
>  }
>
>  enum {
> @@ -1276,7 +1320,7 @@ enum {
>
>  struct log_type {
>      unsigned int mask;
> -    int (*fn)(struct thread_data *, int);
> +    int (*fn)(struct thread_data *, int, bool);
>  };
>
>  static struct log_type log_types[] = {
> @@ -1302,7 +1346,7 @@ static struct log_type log_types[] = {
>      },
>  };
>
> -void fio_writeout_logs(struct thread_data *td)
> +void td_writeout_logs(struct thread_data *td, bool unit_logs)
>  {
>      unsigned int log_mask = 0;
>      unsigned int log_left = ALL_LOG_NR;
> @@ -1310,7 +1354,7 @@ void fio_writeout_logs(struct thread_data *td)
>
>      old_state = td_bump_runstate(td, TD_FINISHING);
>
> -    finalize_logs(td);
> +    finalize_logs(td, unit_logs);
>
>      while (log_left) {
>          int prev_log_left = log_left;
> @@ -1320,7 +1364,7 @@ void fio_writeout_logs(struct thread_data *td)
>              int ret;
>
>              if (!(log_mask & lt->mask)) {
> -                ret = lt->fn(td, log_left != 1);
> +                ret = lt->fn(td, log_left != 1, unit_logs);
>                  if (!ret) {
>                      log_left--;
>                      log_mask |= lt->mask;
> @@ -1334,3 +1378,12 @@ void fio_writeout_logs(struct thread_data *td)
>
>      td_restore_runstate(td, old_state);
>  }
> +
> +void fio_writeout_logs(bool unit_logs)
> +{
> +    struct thread_data *td;
> +    int i;
> +
> +    for_each_td(td, i)
> +        td_writeout_logs(td, unit_logs);
> +}
> diff --git a/iolog.h b/iolog.h
> index 297daf5..739a7c8 100644
> --- a/iolog.h
> +++ b/iolog.h
> @@ -41,6 +41,8 @@ enum {
>      IO_LOG_TYPE_IOPS,
>  };
>
> +#define DEF_LOG_ENTRIES        1024
> +
>  /*
>   * Dynamically growing data sample log
>   */
> @@ -205,12 +207,18 @@ struct log_params {
>      int log_compress;
>  };
>
> -extern void finalize_logs(struct thread_data *td);
> +static inline bool per_unit_log(struct io_log *log)
> +{
> +    return log && !log->avg_msec;
> +}
> +
> +extern void finalize_logs(struct thread_data *td, bool);
>  extern void setup_log(struct io_log **, struct log_params *, const char
> *);
>  extern void flush_log(struct io_log *, int);
>  extern void flush_samples(FILE *, void *, uint64_t);
>  extern void free_log(struct io_log *);
> -extern void fio_writeout_logs(struct thread_data *);
> +extern void fio_writeout_logs(bool);
> +extern void td_writeout_logs(struct thread_data *, bool);
>  extern int iolog_flush(struct io_log *, int);
>
>  static inline void init_ipo(struct io_piece *ipo)
> diff --git a/libfio.c b/libfio.c
> index c626d15..b17f148 100644
> --- a/libfio.c
> +++ b/libfio.c
> @@ -146,6 +146,8 @@ void reset_all_stats(struct thread_data *td)
>      fio_gettime(&tv, NULL);
>      memcpy(&td->epoch, &tv, sizeof(tv));
>      memcpy(&td->start, &tv, sizeof(tv));
> +    memcpy(&td->iops_sample_time, &tv, sizeof(tv));
> +    memcpy(&td->bw_sample_time, &tv, sizeof(tv));
>
>      lat_target_reset(td);
>      clear_rusage_stat(td);
> diff --git a/stat.c b/stat.c
> index 6d8d4d0..df73ed7 100644
> --- a/stat.c
> +++ b/stat.c
> @@ -1862,10 +1862,15 @@ static void __add_log_sample(struct io_log
> *iolog, unsigned long val,
>          iolog->avg_last = t;
>
>      if (iolog->nr_samples == iolog->max_samples) {
> -        size_t new_size;
> +        size_t new_size, new_samples;
>          void *new_log;
>
> -        new_size = 2 * iolog->max_samples * log_entry_sz(iolog);
> +        if (!iolog->max_samples)
> +            new_samples = DEF_LOG_ENTRIES;
> +        else
> +            new_samples = iolog->max_samples * 2;
> +
> +        new_size = new_samples * log_entry_sz(iolog);
>
>          if (iolog->log_gz && (new_size > iolog->log_gz)) {
>              if (iolog_flush(iolog, 0)) {
> @@ -1882,7 +1887,7 @@ static void __add_log_sample(struct io_log *iolog,
> unsigned long val,
>                  return;
>              }
>              iolog->log = new_log;
> -            iolog->max_samples <<= 1;
> +            iolog->max_samples = new_samples;
>          }
>      }
>
> @@ -2013,21 +2018,21 @@ static void add_log_sample(struct thread_data
> *td, struct io_log *iolog,
>      iolog->avg_last = elapsed;
>  }
>
> -void finalize_logs(struct thread_data *td)
> +void finalize_logs(struct thread_data *td, bool unit_logs)
>  {
>      unsigned long elapsed;
>
>      elapsed = mtime_since_now(&td->epoch);
>
> -    if (td->clat_log)
> +    if (td->clat_log && unit_logs)
>          _add_stat_to_log(td->clat_log, elapsed, td->o.log_max != 0);
> -    if (td->slat_log)
> +    if (td->slat_log && unit_logs)
>          _add_stat_to_log(td->slat_log, elapsed, td->o.log_max != 0);
> -    if (td->lat_log)
> +    if (td->lat_log && unit_logs)
>          _add_stat_to_log(td->lat_log, elapsed, td->o.log_max != 0);
> -    if (td->bw_log)
> +    if (td->bw_log && (unit_logs == per_unit_log(td->bw_log)))
>          _add_stat_to_log(td->bw_log, elapsed, td->o.log_max != 0);
> -    if (td->iops_log)
> +    if (td->iops_log && (unit_logs == per_unit_log(td->iops_log)))
>          _add_stat_to_log(td->iops_log, elapsed, td->o.log_max != 0);
>  }
>
> @@ -2108,18 +2113,43 @@ void add_lat_sample(struct thread_data *td, enum
> fio_ddir ddir,
>      td_io_u_unlock(td);
>  }
>
> -void add_bw_sample(struct thread_data *td, enum fio_ddir ddir, unsigned
> int bs,
> -           struct timeval *t)
> +void add_bw_sample(struct thread_data *td, struct io_u *io_u,
> +           unsigned int bytes, unsigned long spent)
>  {
>      struct thread_stat *ts = &td->ts;
> -    unsigned long spent, rate;
> +    unsigned long rate;
>
> -    if (!ddir_rw(ddir))
> +    if (!ddir_rw(io_u->ddir))
>          return;
>
> +    if (spent)
> +        rate = bytes * 1000 / spent;
> +    else
> +        rate = 0;
> +
> +    td_io_u_lock(td);
> +
> +    add_stat_sample(&ts->bw_stat[io_u->ddir], rate);
> +
> +    if (td->bw_log)
> +        add_log_sample(td, td->bw_log, rate, io_u->ddir, bytes,
> io_u->offset);
> +
> +    td->stat_io_bytes[io_u->ddir] = td->this_io_bytes[io_u->ddir];
> +    td_io_u_unlock(td);
> +}
> +
> +uint64_t add_bw_samples(struct thread_data *td, struct timeval *t)
> +{
> +    struct thread_stat *ts = &td->ts;
> +    unsigned long spent, rate;
> +    enum fio_ddir ddir;
> +
> +    if (per_unit_log(td->bw_log))
> +        return 0;
> +
>      spent = mtime_since(&td->bw_sample_time, t);
>      if (spent < td->o.bw_avg_time)
> -        return;
> +        return td->o.bw_avg_time - spent;
>
>      td_io_u_lock(td);
>
> @@ -2141,27 +2171,48 @@ void add_bw_sample(struct thread_data *td, enum
> fio_ddir ddir, unsigned int bs,
>          add_stat_sample(&ts->bw_stat[ddir], rate);
>
>          if (td->bw_log)
> -            add_log_sample(td, td->bw_log, rate, ddir, bs, 0);
> +            add_log_sample(td, td->bw_log, rate, ddir, 0, 0);
>
>          td->stat_io_bytes[ddir] = td->this_io_bytes[ddir];
>      }
>
> -    fio_gettime(&td->bw_sample_time, NULL);
> +    memcpy(&td->bw_sample_time, t, sizeof(*t));
>      td_io_u_unlock(td);
> +
> +    return td->o.bw_avg_time;
>  }
>
> -void add_iops_sample(struct thread_data *td, enum fio_ddir ddir,
> unsigned int bs,
> -             struct timeval *t)
> +void add_iops_sample(struct thread_data *td, struct io_u *io_u,
> +             unsigned int bytes)
>  {
>      struct thread_stat *ts = &td->ts;
> -    unsigned long spent, iops;
>
> -    if (!ddir_rw(ddir))
> +    if (!ddir_rw(io_u->ddir))
>          return;
>
> +    td_io_u_lock(td);
> +
> +    add_stat_sample(&ts->iops_stat[io_u->ddir], 1);
> +
> +    if (td->iops_log)
> +        add_log_sample(td, td->iops_log, 1, io_u->ddir, bytes,
> io_u->offset);
> +
> +    td->stat_io_blocks[io_u->ddir] = td->this_io_blocks[io_u->ddir];
> +    td_io_u_unlock(td);
> +}
> +
> +uint64_t add_iops_samples(struct thread_data *td, struct timeval *t)
> +{
> +    struct thread_stat *ts = &td->ts;
> +    unsigned long spent, iops;
> +    enum fio_ddir ddir;
> +
> +    if (per_unit_log(td->iops_log))
> +        return 0;
> +
>      spent = mtime_since(&td->iops_sample_time, t);
>      if (spent < td->o.iops_avg_time)
> -        return;
> +        return td->o.iops_avg_time - spent;
>
>      td_io_u_lock(td);
>
> @@ -2183,13 +2234,48 @@ void add_iops_sample(struct thread_data *td,
> enum fio_ddir ddir, unsigned int bs
>          add_stat_sample(&ts->iops_stat[ddir], iops);
>
>          if (td->iops_log)
> -            add_log_sample(td, td->iops_log, iops, ddir, bs, 0);
> +            add_log_sample(td, td->iops_log, iops, ddir, 0, 0);
>
>          td->stat_io_blocks[ddir] = td->this_io_blocks[ddir];
>      }
>
> -    fio_gettime(&td->iops_sample_time, NULL);
> +    memcpy(&td->iops_sample_time, t, sizeof(*t));
>      td_io_u_unlock(td);
> +
> +    return td->o.iops_avg_time;
> +}
> +
> +/*
> + * Returns msecs to next event
> + */
> +unsigned int calc_log_samples(void)
> +{
> +    struct thread_data *td;
> +    struct timeval now;
> +    unsigned int next = ~0U, tmp;
> +    int i;
> +
> +    fio_gettime(&now, NULL);
> +
> +    for_each_td(td, i) {
> +        if (!ramp_time_over(td) ||
> +            !(td->runstate == TD_RUNNING || td->runstate ==
> TD_VERIFYING)) {
> +            next = min(td->o.iops_avg_time, td->o.bw_avg_time);
> +            continue;
> +        }
> +        if (!per_unit_log(td->bw_log)) {
> +            tmp = add_bw_samples(td, &now);
> +            if (tmp < next)
> +                next = tmp;
> +        }
> +        if (!per_unit_log(td->iops_log)) {
> +            tmp = add_iops_samples(td, &now);
> +            if (tmp < next)
> +                next = tmp;
> +        }
> +    }
> +
> +    return next == ~0U ? 0 : next;
>  }
>
>  void stat_init(void)
> diff --git a/stat.h b/stat.h
> index 9c3f192..7c2b16e 100644
> --- a/stat.h
> +++ b/stat.h
> @@ -276,11 +276,12 @@ extern void add_clat_sample(struct thread_data *,
> enum fio_ddir, unsigned long,
>                  unsigned int, uint64_t);
>  extern void add_slat_sample(struct thread_data *, enum fio_ddir,
> unsigned long,
>                  unsigned int, uint64_t);
> -extern void add_bw_sample(struct thread_data *, enum fio_ddir, unsigned
> int,
> -                struct timeval *);
> -extern void add_iops_sample(struct thread_data *, enum fio_ddir,
> unsigned int,
> -                struct timeval *);
>  extern void add_agg_sample(unsigned long, enum fio_ddir, unsigned int);
> +extern void add_iops_sample(struct thread_data *, struct io_u *,
> +                unsigned int);
> +extern void add_bw_sample(struct thread_data *, struct io_u *,
> +                unsigned int, unsigned long);
> +extern unsigned int calc_log_samples(void);
>
>  extern struct io_log *agg_io_log[DDIR_RWDIR_CNT];
>  extern int write_bw_log;
>
>


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Missing log entries and other questions
  2016-05-04 20:51           ` Jens Axboe
  2016-05-04 21:30             ` Jens Axboe
@ 2016-05-05 19:36             ` Mark Nelson
  2016-05-06 17:11               ` Jens Axboe
  1 sibling, 1 reply; 16+ messages in thread
From: Mark Nelson @ 2016-05-05 19:36 UTC (permalink / raw)
  To: Jens Axboe, Paul Alcorn; +Cc: 'Mihkal Dunfjeld', fio, birger.lunstoeng



On 05/04/2016 03:51 PM, Jens Axboe wrote:
> On 05/04/2016 11:43 AM, Mark Nelson wrote:
>> On 05/04/2016 11:34 AM, Jens Axboe wrote:
>>> On 05/04/2016 08:10 AM, Jens Axboe wrote:
>>>> On 04/27/2016 02:38 PM, Jens Axboe wrote:
>>>>> I'll try and see if I can reproduce this.
>>>>
>>>> So I took a look at your logged output. The way that fio logs is
>>>> that it
>>>> does it on IO completion. You have set 1s intervals, so it checks, when
>>>> an IO completes, if we're due for logging the previous interval IOPS:
>>>>
>>>> spent = mtime_since(&td->iops_sample_time, t);
>>>> if (spent < td->o.iops_avg_time)
>>>>          return;
>>>>
>>>> If you are doing lots of IO, it matches pretty evenly. But for slow
>>>> workloads, it will end up being skewed a bit. It won't be too hard to
>>>> fix the drift, though, but it still won't make it clock precise at
>>>> exact
>>>> 1s intervals.
>>>>
>>>> We could move this part into the main thread, and out of the workers.
>>>> Then we could make it more precise. Might not be a bad idea in general,
>>>> since it'll move some of this code out of the fast path.
>>>>
>>>> I'll take a stab at it.
>>>
>>> Can you try the below patch?
>>
>> I'll also try to take a look at this as we've seen a fair amount of skew
>> in the logs doing fio librbd upstream ceph testing.
>
> That'd be great. More complete version below, fwiw.
>
>> I actually wrote a
>> first stab at a parser to aggregate logs from multiple fio processes
>> with the assumption of non-uniform sampling:
>>
>> https://github.com/ceph/cbt/blob/master/tools/fiologpaser.py
>>
>> The idea is you have say 8 fio bw logs in a directory you can do:
>>
>> fiologparser.py -f *bw*
>>
>> which will return average values from the logs over user defined
>> intervals by taking weighted scores from samples that overlap.  You can
>> also look at per-interval averages, sums, or just a global average.
>>
>> various option flags:
>>
>> -i <interval time in ms>
>> -d <divide results by d (say if you want a different unit of output)
>> -f <show full interval data>
>> -a <show average interval data across logs>
>> -s <show sums for interval data across logs>
>
> That looks useful. Might want to put that in the fio repo as well, if
> you are interested.

Sure, I'm game.  Might be a little buggy and it's definitely slow since 
I just threw it together the other day.  Do you have a place in the repo 
you like throwing parsing scripts?

Mark


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Missing log entries and other questions
  2016-05-05 19:33               ` Mark Nelson
@ 2016-05-06 17:11                 ` Jens Axboe
  2016-05-06 17:16                   ` Mark Nelson
  0 siblings, 1 reply; 16+ messages in thread
From: Jens Axboe @ 2016-05-06 17:11 UTC (permalink / raw)
  To: Mark Nelson, Paul Alcorn; +Cc: 'Mihkal Dunfjeld', fio, birger.lunstoeng

On 05/05/2016 01:33 PM, Mark Nelson wrote:
>>>>> Can you try the below patch?
>>>>
>>>> I'll also try to take a look at this as we've seen a fair amount of
>>>> skew
>>>> in the logs doing fio librbd upstream ceph testing.
>>>
>>> That'd be great. More complete version below, fwiw.
>
> Ok, I'll try this one.  It might be a bit before I can test it since I'm
> bisecting ceph right now looking for a regression that only hits 128k
> sequential rbd reads.  Go figure. :)

When you get to it, here's an updated version. I'm tempted to shove this
into a new branch, instead of flinging diffs around... Let me know what
you prefer.

I've tested various mutations of real time or windowed logging, and with
and without log compression enabled. Seems to work for me, and it should
be accurate and eliminate drifting.


diff --git a/HOWTO b/HOWTO
index 1f523d3..a74a112 100644
--- a/HOWTO
+++ b/HOWTO
@@ -1263,10 +1263,14 @@ exitall_on_error	When one job finishes in error, 
terminate the rest. The
  		default is to wait for each job to finish.

  bwavgtime=int	Average the calculated bandwidth over the given time. Value
-		is specified in milliseconds.
+		is specified in milliseconds. If the job also does bandwidth
+		logging through 'write_bw_log', then the minimum of this option
+		and 'log_avg_msec' will be used.  Default: 500ms.

  iopsavgtime=int	Average the calculated IOPS over the given time. Value
-		is specified in milliseconds.
+		is specified in milliseconds. If the job also does IOPS logging
+		through 'write_iops_log', then the minimum of this option and
+		'log_avg_msec' will be used.  Default: 500ms.

  create_serialize=bool	If true, serialize the file creating for the jobs.
  			This may be handy to avoid interleaving of data
diff --git a/backend.c b/backend.c
index 1723b8f..f0cb1bc 100644
--- a/backend.c
+++ b/backend.c
@@ -58,11 +58,6 @@
  #include "lib/mountcheck.h"
  #include "rate-submit.h"

-static pthread_t helper_thread;
-static pthread_mutex_t helper_lock;
-pthread_cond_t helper_cond;
-int helper_do_stat = 0;
-
  static struct fio_mutex *startup_mutex;
  static struct flist_head *cgroup_list;
  static char *cgroup_mnt;
@@ -79,7 +74,16 @@ unsigned int stat_number = 0;
  int shm_id = 0;
  int temp_stall_ts;
  unsigned long done_secs = 0;
-volatile int helper_exit = 0;
+
+static struct helper_data {
+	volatile int exit;
+	volatile int reset;
+	volatile int do_stat;
+	struct sk_out *sk_out;
+	pthread_t thread;
+	pthread_mutex_t lock;
+	pthread_cond_t cond;
+} *helper_data;

  #define PAGE_ALIGN(buf)	\
  	(char *) (((uintptr_t) (buf) + page_mask) & ~page_mask)
@@ -1722,7 +1726,7 @@ static void *thread_main(void *data)

  	fio_unpin_memory(td);

-	fio_writeout_logs(td);
+	td_writeout_logs(td, true);

  	iolog_compress_exit(td);
  	rate_submit_exit(td);
@@ -2319,78 +2323,154 @@ reap:
  	update_io_ticks();
  }

+void helper_reset(void)
+{
+	if (!helper_data)
+		return;
+
+	pthread_mutex_lock(&helper_data->lock);
+
+	if (!helper_data->reset) {
+		helper_data->reset = 1;
+		pthread_cond_signal(&helper_data->cond);
+	}
+
+	pthread_mutex_unlock(&helper_data->lock);
+}
+
+void helper_do_stat(void)
+{
+	if (!helper_data)
+		return;
+
+	pthread_mutex_lock(&helper_data->lock);
+	helper_data->do_stat = 1;
+	pthread_cond_signal(&helper_data->cond);
+	pthread_mutex_unlock(&helper_data->lock);
+}
+
+bool helper_should_exit(void)
+{
+	if (!helper_data)
+		return true;
+
+	return helper_data->exit;
+}
+
  static void wait_for_helper_thread_exit(void)
  {
  	void *ret;

-	helper_exit = 1;
-	pthread_cond_signal(&helper_cond);
-	pthread_join(helper_thread, &ret);
+	pthread_mutex_lock(&helper_data->lock);
+	helper_data->exit = 1;
+	pthread_cond_signal(&helper_data->cond);
+	pthread_mutex_unlock(&helper_data->lock);
+
+	pthread_join(helper_data->thread, &ret);
  }

  static void free_disk_util(void)
  {
  	disk_util_prune_entries();

-	pthread_cond_destroy(&helper_cond);
+	pthread_cond_destroy(&helper_data->cond);
+	pthread_mutex_destroy(&helper_data->lock);
+	sfree(helper_data);
  }

  static void *helper_thread_main(void *data)
  {
-	struct sk_out *sk_out = data;
+	struct helper_data *hd = data;
+	unsigned int msec_to_next_event, next_log;
+	struct timeval tv, last_du;
  	int ret = 0;

-	sk_out_assign(sk_out);
+	sk_out_assign(hd->sk_out);
+
+	gettimeofday(&tv, NULL);
+	memcpy(&last_du, &tv, sizeof(tv));

  	fio_mutex_up(startup_mutex);

-	while (!ret) {
-		uint64_t sec = DISK_UTIL_MSEC / 1000;
-		uint64_t nsec = (DISK_UTIL_MSEC % 1000) * 1000000;
+	msec_to_next_event = DISK_UTIL_MSEC;
+	while (!ret && !hd->exit) {
  		struct timespec ts;
-		struct timeval tv;
+		struct timeval now;
+		uint64_t since_du;
+
+		timeval_add_msec(&tv, msec_to_next_event);
+		ts.tv_sec = tv.tv_sec;
+		ts.tv_nsec = tv.tv_usec * 1000;
+
+		pthread_mutex_lock(&hd->lock);
+		pthread_cond_timedwait(&hd->cond, &hd->lock, &ts);

-		gettimeofday(&tv, NULL);
-		ts.tv_sec = tv.tv_sec + sec;
-		ts.tv_nsec = (tv.tv_usec * 1000) + nsec;
+		gettimeofday(&now, NULL);

-		if (ts.tv_nsec >= 1000000000ULL) {
-			ts.tv_nsec -= 1000000000ULL;
-			ts.tv_sec++;
+		if (hd->reset) {
+			memcpy(&tv, &now, sizeof(tv));
+			memcpy(&last_du, &now, sizeof(last_du));
+			hd->reset = 0;
  		}

-		pthread_cond_timedwait(&helper_cond, &helper_lock, &ts);
+		pthread_mutex_unlock(&hd->lock);

-		ret = update_io_ticks();
+		since_du = mtime_since(&last_du, &now);
+		if (since_du >= DISK_UTIL_MSEC || DISK_UTIL_MSEC - since_du < 10) {
+			ret = update_io_ticks();
+			timeval_add_msec(&last_du, DISK_UTIL_MSEC);
+			msec_to_next_event = DISK_UTIL_MSEC;
+			if (since_du >= DISK_UTIL_MSEC)
+				msec_to_next_event -= (since_du - DISK_UTIL_MSEC);
+		} else {
+			if (since_du >= DISK_UTIL_MSEC)
+				msec_to_next_event = DISK_UTIL_MSEC - (DISK_UTIL_MSEC - since_du);
+			else
+				msec_to_next_event = DISK_UTIL_MSEC;
+		}

-		if (helper_do_stat) {
-			helper_do_stat = 0;
+		if (hd->do_stat) {
+			hd->do_stat = 0;
  			__show_running_run_stats();
  		}

+		next_log = calc_log_samples();
+		if (!next_log)
+			next_log = DISK_UTIL_MSEC;
+
+		msec_to_next_event = min(next_log, msec_to_next_event);
+
  		if (!is_backend)
  			print_thread_status();
  	}

+	fio_writeout_logs(false);
+
  	sk_out_drop();
  	return NULL;
  }

  static int create_helper_thread(struct sk_out *sk_out)
  {
+	struct helper_data *hd;
  	int ret;

+	hd = smalloc(sizeof(*hd));
+
  	setup_disk_util();

-	pthread_cond_init(&helper_cond, NULL);
-	pthread_mutex_init(&helper_lock, NULL);
+	hd->sk_out = sk_out;
+	pthread_cond_init(&hd->cond, NULL);
+	pthread_mutex_init(&hd->lock, NULL);

-	ret = pthread_create(&helper_thread, NULL, helper_thread_main, sk_out);
+	ret = pthread_create(&hd->thread, NULL, helper_thread_main, hd);
  	if (ret) {
  		log_err("Can't create helper thread: %s\n", strerror(ret));
  		return 1;
  	}

+	helper_data = hd;
+
  	dprint(FD_MUTEX, "wait on startup_mutex\n");
  	fio_mutex_down(startup_mutex);
  	dprint(FD_MUTEX, "done waiting on startup_mutex\n");
diff --git a/diskutil.c b/diskutil.c
index c25c5c9..315c1e1 100644
--- a/diskutil.c
+++ b/diskutil.c
@@ -121,7 +121,7 @@ int update_io_ticks(void)

  	fio_mutex_down(disk_util_mutex);

-	if (!helper_exit) {
+	if (!helper_should_exit()) {
  		flist_for_each(entry, &disk_list) {
  			du = flist_entry(entry, struct disk_util, list);
  			update_io_tick_disk(du);
diff --git a/fio.1 b/fio.1
index 73fdee6..e9bb72e 100644
--- a/fio.1
+++ b/fio.1
@@ -1180,12 +1180,14 @@ Terminate all jobs if one job finishes in error. 
  Default: wait for each job
  to finish.
  .TP
  .BI bwavgtime \fR=\fPint
-Average bandwidth calculations over the given time in milliseconds. 
Default:
-500ms.
+Average bandwidth calculations over the given time in milliseconds. If 
the job
+also does bandwidth logging through \fBwrite_bw_log\fR, then the minimum of
+this option and \fBlog_avg_msec\fR will be used.  Default: 500ms.
  .TP
  .BI iopsavgtime \fR=\fPint
-Average IOPS calculations over the given time in milliseconds.  Default:
-500ms.
+Average IOPS calculations over the given time in milliseconds. If the job
+also does IOPS logging through \fBwrite_iops_log\fR, then the minimum of
+this option and \fBlog_avg_msec\fR will be used.  Default: 500ms.
  .TP
  .BI create_serialize \fR=\fPbool
  If true, serialize file creation for the jobs.  Default: true.
diff --git a/fio.h b/fio.h
index 829cc81..20ff26f 100644
--- a/fio.h
+++ b/fio.h
@@ -445,8 +445,6 @@ extern int nr_clients;
  extern int log_syslog;
  extern int status_interval;
  extern const char fio_version_string[];
-extern int helper_do_stat;
-extern pthread_cond_t helper_cond;
  extern char *trigger_file;
  extern char *trigger_cmd;
  extern char *trigger_remote_cmd;
@@ -734,4 +732,8 @@ enum {
  extern void exec_trigger(const char *);
  extern void check_trigger_file(void);

+extern void helper_reset(void);
+extern void helper_do_stat(void);
+extern bool helper_should_exit(void);
+
  #endif
diff --git a/fio_time.h b/fio_time.h
index 79f324a..cb271c2 100644
--- a/fio_time.h
+++ b/fio_time.h
@@ -17,5 +17,6 @@ extern void set_genesis_time(void);
  extern int ramp_time_over(struct thread_data *);
  extern int in_ramp_time(struct thread_data *);
  extern void fio_time_init(void);
+extern void timeval_add_msec(struct timeval *, unsigned int);

  #endif
diff --git a/init.c b/init.c
index 89e05c0..c579d5c 100644
--- a/init.c
+++ b/init.c
@@ -1416,6 +1416,11 @@ static int add_job(struct thread_data *td, const 
char *jobname, int job_add_num,
  		};
  		const char *suf;

+		if (fio_option_is_set(o, bw_avg_time))
+			p.avg_msec = min(o->log_avg_msec, o->bw_avg_time);
+		else
+			o->bw_avg_time = p.avg_msec;
+
  		if (p.log_gz_store)
  			suf = "log.fz";
  		else
@@ -1436,6 +1441,11 @@ static int add_job(struct thread_data *td, const 
char *jobname, int job_add_num,
  		};
  		const char *suf;

+		if (fio_option_is_set(o, iops_avg_time))
+			p.avg_msec = min(o->log_avg_msec, o->iops_avg_time);
+		else
+			o->iops_avg_time = p.avg_msec;
+
  		if (p.log_gz_store)
  			suf = "log.fz";
  		else
diff --git a/io_u.c b/io_u.c
index 6622bc0..eb15dc2 100644
--- a/io_u.c
+++ b/io_u.c
@@ -1710,16 +1710,18 @@ static void account_io_completion(struct 
thread_data *td, struct io_u *io_u,
  		}
  	}

-	if (!td->o.disable_clat) {
-		add_clat_sample(td, idx, lusec, bytes, io_u->offset);
-		io_u_mark_latency(td, lusec);
-	}
+	if (ddir_rw(idx)) {
+		if (!td->o.disable_clat) {
+			add_clat_sample(td, idx, lusec, bytes, io_u->offset);
+			io_u_mark_latency(td, lusec);
+		}

-	if (!td->o.disable_bw)
-		add_bw_sample(td, idx, bytes, &icd->time);
+		if (!td->o.disable_bw && per_unit_log(td->bw_log))
+			add_bw_sample(td, io_u, bytes, lusec);

-	if (no_reduce)
-		add_iops_sample(td, idx, bytes, &icd->time);
+		if (no_reduce && per_unit_log(td->iops_log))
+			add_iops_sample(td, io_u, bytes);
+	}

  	if (td->ts.nr_block_infos && io_u->ddir == DDIR_TRIM) {
  		uint32_t *info = io_u_block_info(td, io_u);
diff --git a/iolog.c b/iolog.c
index 94d3f3c..71afe86 100644
--- a/iolog.c
+++ b/iolog.c
@@ -18,6 +18,7 @@
  #include "verify.h"
  #include "trim.h"
  #include "filelock.h"
+#include "smalloc.h"

  static const char iolog_ver2[] = "fio version 2 iolog";

@@ -574,14 +575,12 @@ void setup_log(struct io_log **log, struct 
log_params *p,
  {
  	struct io_log *l;

-	l = calloc(1, sizeof(*l));
+	l = smalloc(sizeof(*l));
  	l->nr_samples = 0;
-	l->max_samples = DEF_LOG_ENTRIES;
  	l->log_type = p->log_type;
  	l->log_offset = p->log_offset;
  	l->log_gz = p->log_gz;
  	l->log_gz_store = p->log_gz_store;
-	l->log = malloc(l->max_samples * log_entry_sz(l));
  	l->avg_msec = p->avg_msec;
  	l->filename = strdup(filename);
  	l->td = p->td;
@@ -631,7 +630,7 @@ void free_log(struct io_log *log)
  {
  	free(log->log);
  	free(log->filename);
-	free(log);
+	sfree(log);
  }

  void flush_samples(FILE *f, void *samples, uint64_t sample_size)
@@ -1202,29 +1201,74 @@ static int __write_log(struct thread_data *td, 
struct io_log *log, int try)
  	return 0;
  }

-static int write_iops_log(struct thread_data *td, int try)
+static int write_iops_log(struct thread_data *td, int try, bool unit_log)
  {
-	return __write_log(td, td->iops_log, try);
+	int ret;
+
+	if (per_unit_log(td->iops_log) != unit_log)
+		return 0;
+
+	ret = __write_log(td, td->iops_log, try);
+	if (!ret)
+		td->iops_log = NULL;
+
+	return ret;
  }

-static int write_slat_log(struct thread_data *td, int try)
+static int write_slat_log(struct thread_data *td, int try, bool unit_log)
  {
-	return __write_log(td, td->slat_log, try);
+	int ret;
+
+	if (!unit_log)
+		return 0;
+
+	ret = __write_log(td, td->slat_log, try);
+	if (!ret)
+		td->slat_log = NULL;
+
+	return ret;
  }

-static int write_clat_log(struct thread_data *td, int try)
+static int write_clat_log(struct thread_data *td, int try, bool unit_log)
  {
-	return __write_log(td, td->clat_log, try);
+	int ret;
+
+	if (!unit_log)
+		return 0;
+
+	ret = __write_log(td, td->clat_log, try);
+	if (!ret)
+		td->clat_log = NULL;
+
+	return ret;
  }

-static int write_lat_log(struct thread_data *td, int try)
+static int write_lat_log(struct thread_data *td, int try, bool unit_log)
  {
-	return __write_log(td, td->lat_log, try);
+	int ret;
+
+	if (!unit_log)
+		return 0;
+
+	ret = __write_log(td, td->lat_log, try);
+	if (!ret)
+		td->lat_log = NULL;
+
+	return ret;
  }

-static int write_bandw_log(struct thread_data *td, int try)
+static int write_bandw_log(struct thread_data *td, int try, bool unit_log)
  {
-	return __write_log(td, td->bw_log, try);
+	int ret;
+
+	if (per_unit_log(td->bw_log) != unit_log)
+		return 0;
+
+	ret = __write_log(td, td->bw_log, try);
+	if (!ret)
+		td->bw_log = NULL;
+
+	return ret;
  }

  enum {
@@ -1239,7 +1283,7 @@ enum {

  struct log_type {
  	unsigned int mask;
-	int (*fn)(struct thread_data *, int);
+	int (*fn)(struct thread_data *, int, bool);
  };

  static struct log_type log_types[] = {
@@ -1265,7 +1309,7 @@ static struct log_type log_types[] = {
  	},
  };

-void fio_writeout_logs(struct thread_data *td)
+void td_writeout_logs(struct thread_data *td, bool unit_logs)
  {
  	unsigned int log_mask = 0;
  	unsigned int log_left = ALL_LOG_NR;
@@ -1273,7 +1317,7 @@ void fio_writeout_logs(struct thread_data *td)

  	old_state = td_bump_runstate(td, TD_FINISHING);

-	finalize_logs(td);
+	finalize_logs(td, unit_logs);

  	while (log_left) {
  		int prev_log_left = log_left;
@@ -1283,7 +1327,7 @@ void fio_writeout_logs(struct thread_data *td)
  			int ret;

  			if (!(log_mask & lt->mask)) {
-				ret = lt->fn(td, log_left != 1);
+				ret = lt->fn(td, log_left != 1, unit_logs);
  				if (!ret) {
  					log_left--;
  					log_mask |= lt->mask;
@@ -1297,3 +1341,12 @@ void fio_writeout_logs(struct thread_data *td)

  	td_restore_runstate(td, old_state);
  }
+
+void fio_writeout_logs(bool unit_logs)
+{
+	struct thread_data *td;
+	int i;
+
+	for_each_td(td, i)
+		td_writeout_logs(td, unit_logs);
+}
diff --git a/iolog.h b/iolog.h
index 74f2170..739a7c8 100644
--- a/iolog.h
+++ b/iolog.h
@@ -207,12 +207,18 @@ struct log_params {
  	int log_compress;
  };

-extern void finalize_logs(struct thread_data *td);
+static inline bool per_unit_log(struct io_log *log)
+{
+	return log && !log->avg_msec;
+}
+
+extern void finalize_logs(struct thread_data *td, bool);
  extern void setup_log(struct io_log **, struct log_params *, const 
char *);
  extern void flush_log(struct io_log *, int);
  extern void flush_samples(FILE *, void *, uint64_t);
  extern void free_log(struct io_log *);
-extern void fio_writeout_logs(struct thread_data *);
+extern void fio_writeout_logs(bool);
+extern void td_writeout_logs(struct thread_data *, bool);
  extern int iolog_flush(struct io_log *, int);

  static inline void init_ipo(struct io_piece *ipo)
diff --git a/libfio.c b/libfio.c
index b17f148..790c036 100644
--- a/libfio.c
+++ b/libfio.c
@@ -151,6 +151,7 @@ void reset_all_stats(struct thread_data *td)

  	lat_target_reset(td);
  	clear_rusage_stat(td);
+	helper_reset();
  }

  void reset_fio_state(void)
diff --git a/stat.c b/stat.c
index 6d8d4d0..8c41914 100644
--- a/stat.c
+++ b/stat.c
@@ -1862,13 +1862,21 @@ static void __add_log_sample(struct io_log 
*iolog, unsigned long val,
  		iolog->avg_last = t;

  	if (iolog->nr_samples == iolog->max_samples) {
-		size_t new_size;
+		size_t new_size, new_samples;
  		void *new_log;

-		new_size = 2 * iolog->max_samples * log_entry_sz(iolog);
+		if (!iolog->max_samples)
+			new_samples = DEF_LOG_ENTRIES;
+		else
+			new_samples = iolog->max_samples * 2;
+
+		new_size = new_samples * log_entry_sz(iolog);

  		if (iolog->log_gz && (new_size > iolog->log_gz)) {
-			if (iolog_flush(iolog, 0)) {
+			if (!iolog->log) {
+				iolog->log = malloc(new_size);
+				iolog->max_samples = new_samples;
+			} else if (iolog_flush(iolog, 0)) {
  				log_err("fio: failed flushing iolog! Will stop logging.\n");
  				iolog->disabled = 1;
  				return;
@@ -1882,7 +1890,7 @@ static void __add_log_sample(struct io_log *iolog, 
unsigned long val,
  				return;
  			}
  			iolog->log = new_log;
-			iolog->max_samples <<= 1;
+			iolog->max_samples = new_samples;
  		}
  	}

@@ -2013,21 +2021,21 @@ static void add_log_sample(struct thread_data 
*td, struct io_log *iolog,
  	iolog->avg_last = elapsed;
  }

-void finalize_logs(struct thread_data *td)
+void finalize_logs(struct thread_data *td, bool unit_logs)
  {
  	unsigned long elapsed;

  	elapsed = mtime_since_now(&td->epoch);

-	if (td->clat_log)
+	if (td->clat_log && unit_logs)
  		_add_stat_to_log(td->clat_log, elapsed, td->o.log_max != 0);
-	if (td->slat_log)
+	if (td->slat_log && unit_logs)
  		_add_stat_to_log(td->slat_log, elapsed, td->o.log_max != 0);
-	if (td->lat_log)
+	if (td->lat_log && unit_logs)
  		_add_stat_to_log(td->lat_log, elapsed, td->o.log_max != 0);
-	if (td->bw_log)
+	if (td->bw_log && (unit_logs == per_unit_log(td->bw_log)))
  		_add_stat_to_log(td->bw_log, elapsed, td->o.log_max != 0);
-	if (td->iops_log)
+	if (td->iops_log && (unit_logs == per_unit_log(td->iops_log)))
  		_add_stat_to_log(td->iops_log, elapsed, td->o.log_max != 0);
  }

@@ -2056,9 +2064,6 @@ void add_clat_sample(struct thread_data *td, enum 
fio_ddir ddir,
  {
  	struct thread_stat *ts = &td->ts;

-	if (!ddir_rw(ddir))
-		return;
-
  	td_io_u_lock(td);

  	add_stat_sample(&ts->clat_stat[ddir], usec);
@@ -2108,18 +2113,41 @@ void add_lat_sample(struct thread_data *td, enum 
fio_ddir ddir,
  	td_io_u_unlock(td);
  }

-void add_bw_sample(struct thread_data *td, enum fio_ddir ddir, unsigned 
int bs,
-		   struct timeval *t)
+void add_bw_sample(struct thread_data *td, struct io_u *io_u,
+		   unsigned int bytes, unsigned long spent)
+{
+	struct thread_stat *ts = &td->ts;
+	unsigned long rate;
+
+	if (spent)
+		rate = bytes * 1000 / spent;
+	else
+		rate = 0;
+
+	td_io_u_lock(td);
+
+	add_stat_sample(&ts->bw_stat[io_u->ddir], rate);
+
+	if (td->bw_log)
+		add_log_sample(td, td->bw_log, rate, io_u->ddir, bytes, io_u->offset);
+
+	td->stat_io_bytes[io_u->ddir] = td->this_io_bytes[io_u->ddir];
+	td_io_u_unlock(td);
+}
+
+static int add_bw_samples(struct thread_data *td, struct timeval *t)
  {
  	struct thread_stat *ts = &td->ts;
  	unsigned long spent, rate;
+	enum fio_ddir ddir;

-	if (!ddir_rw(ddir))
-		return;
+	if (per_unit_log(td->bw_log))
+		return 0;

  	spent = mtime_since(&td->bw_sample_time, t);
-	if (spent < td->o.bw_avg_time)
-		return;
+	if (spent < td->o.bw_avg_time &&
+	    td->o.bw_avg_time - spent >= 10)
+		return td->o.bw_avg_time - spent;

  	td_io_u_lock(td);

@@ -2141,27 +2169,50 @@ void add_bw_sample(struct thread_data *td, enum 
fio_ddir ddir, unsigned int bs,
  		add_stat_sample(&ts->bw_stat[ddir], rate);

  		if (td->bw_log)
-			add_log_sample(td, td->bw_log, rate, ddir, bs, 0);
+			add_log_sample(td, td->bw_log, rate, ddir, 0, 0);

  		td->stat_io_bytes[ddir] = td->this_io_bytes[ddir];
  	}

-	fio_gettime(&td->bw_sample_time, NULL);
+	timeval_add_msec(&td->bw_sample_time, td->o.bw_avg_time);
+
+	td_io_u_unlock(td);
+
+	if (spent <= td->o.bw_avg_time)
+		return td->o.bw_avg_time;
+
+	return td->o.bw_avg_time - (1 + spent - td->o.bw_avg_time);
+}
+
+void add_iops_sample(struct thread_data *td, struct io_u *io_u,
+		     unsigned int bytes)
+{
+	struct thread_stat *ts = &td->ts;
+
+	td_io_u_lock(td);
+
+	add_stat_sample(&ts->iops_stat[io_u->ddir], 1);
+
+	if (td->iops_log)
+		add_log_sample(td, td->iops_log, 1, io_u->ddir, bytes, io_u->offset);
+
+	td->stat_io_blocks[io_u->ddir] = td->this_io_blocks[io_u->ddir];
  	td_io_u_unlock(td);
  }

-void add_iops_sample(struct thread_data *td, enum fio_ddir ddir, 
unsigned int bs,
-		     struct timeval *t)
+static int add_iops_samples(struct thread_data *td, struct timeval *t)
  {
  	struct thread_stat *ts = &td->ts;
  	unsigned long spent, iops;
+	enum fio_ddir ddir;

-	if (!ddir_rw(ddir))
-		return;
+	if (per_unit_log(td->iops_log))
+		return 0;

  	spent = mtime_since(&td->iops_sample_time, t);
-	if (spent < td->o.iops_avg_time)
-		return;
+	if (spent < td->o.iops_avg_time &&
+	    td->o.iops_avg_time - spent >= 10)
+		return td->o.iops_avg_time - spent;

  	td_io_u_lock(td);

@@ -2183,13 +2234,53 @@ void add_iops_sample(struct thread_data *td, 
enum fio_ddir ddir, unsigned int bs
  		add_stat_sample(&ts->iops_stat[ddir], iops);

  		if (td->iops_log)
-			add_log_sample(td, td->iops_log, iops, ddir, bs, 0);
+			add_log_sample(td, td->iops_log, iops, ddir, 0, 0);

  		td->stat_io_blocks[ddir] = td->this_io_blocks[ddir];
  	}

-	fio_gettime(&td->iops_sample_time, NULL);
+	timeval_add_msec(&td->iops_sample_time, td->o.iops_avg_time);
+
  	td_io_u_unlock(td);
+
+	if (spent <= td->o.iops_avg_time)
+		return td->o.iops_avg_time;
+
+	printf("%lu over\n", spent - td->o.iops_avg_time);
+	return td->o.iops_avg_time - (1 + spent - td->o.iops_avg_time);
+}
+
+/*
+ * Returns msecs to next event
+ */
+int calc_log_samples(void)
+{
+	struct thread_data *td;
+	unsigned int next = ~0U, tmp;
+	struct timeval now;
+	int i;
+
+	fio_gettime(&now, NULL);
+
+	for_each_td(td, i) {
+		if (!ramp_time_over(td) ||
+		    !(td->runstate == TD_RUNNING || td->runstate == TD_VERIFYING)) {
+			next = min(td->o.iops_avg_time, td->o.bw_avg_time);
+			continue;
+		}
+		if (!per_unit_log(td->bw_log)) {
+			tmp = add_bw_samples(td, &now);
+			if (tmp < next)
+				next = tmp;
+		}
+		if (!per_unit_log(td->iops_log)) {
+			tmp = add_iops_samples(td, &now);
+			if (tmp < next)
+				next = tmp;
+		}
+	}
+
+	return next == ~0U ? 0 : next;
  }

  void stat_init(void)
@@ -2212,8 +2303,7 @@ void stat_exit(void)
   */
  void show_running_run_stats(void)
  {
-	helper_do_stat = 1;
-	pthread_cond_signal(&helper_cond);
+	helper_do_stat();
  }

  uint32_t *io_u_block_info(struct thread_data *td, struct io_u *io_u)
diff --git a/stat.h b/stat.h
index 9c3f192..86f1a0b 100644
--- a/stat.h
+++ b/stat.h
@@ -276,11 +276,12 @@ extern void add_clat_sample(struct thread_data *, 
enum fio_ddir, unsigned long,
  				unsigned int, uint64_t);
  extern void add_slat_sample(struct thread_data *, enum fio_ddir, 
unsigned long,
  				unsigned int, uint64_t);
-extern void add_bw_sample(struct thread_data *, enum fio_ddir, unsigned 
int,
-				struct timeval *);
-extern void add_iops_sample(struct thread_data *, enum fio_ddir, 
unsigned int,
-				struct timeval *);
  extern void add_agg_sample(unsigned long, enum fio_ddir, unsigned int);
+extern void add_iops_sample(struct thread_data *, struct io_u *,
+				unsigned int);
+extern void add_bw_sample(struct thread_data *, struct io_u *,
+				unsigned int, unsigned long);
+extern int calc_log_samples(void);

  extern struct io_log *agg_io_log[DDIR_RWDIR_CNT];
  extern int write_bw_log;
diff --git a/time.c b/time.c
index b145e90..7c7fa96 100644
--- a/time.c
+++ b/time.c
@@ -6,6 +6,15 @@
  static struct timeval genesis;
  static unsigned long ns_granularity;

+void timeval_add_msec(struct timeval *tv, unsigned int msec)
+{
+	tv->tv_usec += 1000 * msec;
+	if (tv->tv_usec > 1000000) {
+		tv->tv_usec -= 1000000;
+		tv->tv_sec++;
+	}
+}
+
  /*
   * busy looping version for the last few usec
   */
diff --git a/workqueue.c b/workqueue.c
index 6e67f3e..4f9c414 100644
--- a/workqueue.c
+++ b/workqueue.c
@@ -9,6 +9,7 @@
  #include "fio.h"
  #include "flist.h"
  #include "workqueue.h"
+#include "smalloc.h"

  enum {
  	SW_F_IDLE	= 1 << 0,
@@ -263,7 +264,7 @@ void workqueue_exit(struct workqueue *wq)
  		}
  	} while (shutdown && shutdown != wq->max_workers);

-	free(wq->workers);
+	sfree(wq->workers);
  	wq->workers = NULL;
  	pthread_mutex_destroy(&wq->flush_lock);
  	pthread_cond_destroy(&wq->flush_cond);
@@ -317,7 +318,7 @@ int workqueue_init(struct thread_data *td, struct 
workqueue *wq,
  	pthread_mutex_init(&wq->flush_lock, NULL);
  	pthread_mutex_init(&wq->stat_lock, NULL);

-	wq->workers = calloc(wq->max_workers, sizeof(struct submit_worker));
+	wq->workers = smalloc(wq->max_workers * sizeof(struct submit_worker));

  	for (i = 0; i < wq->max_workers; i++)
  		if (start_worker(wq, i, sk_out))

-- 
Jens Axboe



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: Missing log entries and other questions
  2016-05-05 19:36             ` Mark Nelson
@ 2016-05-06 17:11               ` Jens Axboe
  2016-05-06 17:43                 ` Mark Nelson
  0 siblings, 1 reply; 16+ messages in thread
From: Jens Axboe @ 2016-05-06 17:11 UTC (permalink / raw)
  To: Mark Nelson, Paul Alcorn; +Cc: 'Mihkal Dunfjeld', fio, birger.lunstoeng

On 05/05/2016 01:36 PM, Mark Nelson wrote:
>
>
> On 05/04/2016 03:51 PM, Jens Axboe wrote:
>> On 05/04/2016 11:43 AM, Mark Nelson wrote:
>>> On 05/04/2016 11:34 AM, Jens Axboe wrote:
>>>> On 05/04/2016 08:10 AM, Jens Axboe wrote:
>>>>> On 04/27/2016 02:38 PM, Jens Axboe wrote:
>>>>>> I'll try and see if I can reproduce this.
>>>>>
>>>>> So I took a look at your logged output. The way that fio logs is
>>>>> that it
>>>>> does it on IO completion. You have set 1s intervals, so it checks,
>>>>> when
>>>>> an IO completes, if we're due for logging the previous interval IOPS:
>>>>>
>>>>> spent = mtime_since(&td->iops_sample_time, t);
>>>>> if (spent < td->o.iops_avg_time)
>>>>>          return;
>>>>>
>>>>> If you are doing lots of IO, it matches pretty evenly. But for slow
>>>>> workloads, it will end up being skewed a bit. It won't be too hard to
>>>>> fix the drift, though, but it still won't make it clock precise at
>>>>> exact
>>>>> 1s intervals.
>>>>>
>>>>> We could move this part into the main thread, and out of the workers.
>>>>> Then we could make it more precise. Might not be a bad idea in
>>>>> general,
>>>>> since it'll move some of this code out of the fast path.
>>>>>
>>>>> I'll take a stab at it.
>>>>
>>>> Can you try the below patch?
>>>
>>> I'll also try to take a look at this as we've seen a fair amount of skew
>>> in the logs doing fio librbd upstream ceph testing.
>>
>> That'd be great. More complete version below, fwiw.
>>
>>> I actually wrote a
>>> first stab at a parser to aggregate logs from multiple fio processes
>>> with the assumption of non-uniform sampling:
>>>
>>> https://github.com/ceph/cbt/blob/master/tools/fiologpaser.py
>>>
>>> The idea is you have say 8 fio bw logs in a directory you can do:
>>>
>>> fiologparser.py -f *bw*
>>>
>>> which will return average values from the logs over user defined
>>> intervals by taking weighted scores from samples that overlap.  You can
>>> also look at per-interval averages, sums, or just a global average.
>>>
>>> various option flags:
>>>
>>> -i <interval time in ms>
>>> -d <divide results by d (say if you want a different unit of output)
>>> -f <show full interval data>
>>> -a <show average interval data across logs>
>>> -s <show sums for interval data across logs>
>>
>> That looks useful. Might want to put that in the fio repo as well, if
>> you are interested.
>
> Sure, I'm game.  Might be a little buggy and it's definitely slow since
> I just threw it together the other day.  Do you have a place in the repo
> you like throwing parsing scripts?

Usually I put them in the tools/ directory.

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Missing log entries and other questions
  2016-05-06 17:11                 ` Jens Axboe
@ 2016-05-06 17:16                   ` Mark Nelson
  2016-05-06 17:18                     ` Jens Axboe
  0 siblings, 1 reply; 16+ messages in thread
From: Mark Nelson @ 2016-05-06 17:16 UTC (permalink / raw)
  To: Jens Axboe, Paul Alcorn; +Cc: 'Mihkal Dunfjeld', fio, birger.lunstoeng



On 05/06/2016 12:11 PM, Jens Axboe wrote:
> On 05/05/2016 01:33 PM, Mark Nelson wrote:
>>>>>> Can you try the below patch?
>>>>>
>>>>> I'll also try to take a look at this as we've seen a fair amount of
>>>>> skew
>>>>> in the logs doing fio librbd upstream ceph testing.
>>>>
>>>> That'd be great. More complete version below, fwiw.
>>
>> Ok, I'll try this one.  It might be a bit before I can test it since I'm
>> bisecting ceph right now looking for a regression that only hits 128k
>> sequential rbd reads.  Go figure. :)
>
> When you get to it, here's an updated version. I'm tempted to shove this
> into a new branch, instead of flinging diffs around... Let me know what
> you prefer.

new branch would be wonderful!

>
> I've tested various mutations of real time or windowed logging, and with
> and without log compression enabled. Seems to work for me, and it should
> be accurate and eliminate drifting.
>
>
> diff --git a/HOWTO b/HOWTO
> index 1f523d3..a74a112 100644
> --- a/HOWTO
> +++ b/HOWTO
> @@ -1263,10 +1263,14 @@ exitall_on_error    When one job finishes in
> error, terminate the rest. The
>          default is to wait for each job to finish.
>
>  bwavgtime=int    Average the calculated bandwidth over the given time.
> Value
> -        is specified in milliseconds.
> +        is specified in milliseconds. If the job also does bandwidth
> +        logging through 'write_bw_log', then the minimum of this option
> +        and 'log_avg_msec' will be used.  Default: 500ms.
>
>  iopsavgtime=int    Average the calculated IOPS over the given time. Value
> -        is specified in milliseconds.
> +        is specified in milliseconds. If the job also does IOPS logging
> +        through 'write_iops_log', then the minimum of this option and
> +        'log_avg_msec' will be used.  Default: 500ms.
>
>  create_serialize=bool    If true, serialize the file creating for the
> jobs.
>              This may be handy to avoid interleaving of data
> diff --git a/backend.c b/backend.c
> index 1723b8f..f0cb1bc 100644
> --- a/backend.c
> +++ b/backend.c
> @@ -58,11 +58,6 @@
>  #include "lib/mountcheck.h"
>  #include "rate-submit.h"
>
> -static pthread_t helper_thread;
> -static pthread_mutex_t helper_lock;
> -pthread_cond_t helper_cond;
> -int helper_do_stat = 0;
> -
>  static struct fio_mutex *startup_mutex;
>  static struct flist_head *cgroup_list;
>  static char *cgroup_mnt;
> @@ -79,7 +74,16 @@ unsigned int stat_number = 0;
>  int shm_id = 0;
>  int temp_stall_ts;
>  unsigned long done_secs = 0;
> -volatile int helper_exit = 0;
> +
> +static struct helper_data {
> +    volatile int exit;
> +    volatile int reset;
> +    volatile int do_stat;
> +    struct sk_out *sk_out;
> +    pthread_t thread;
> +    pthread_mutex_t lock;
> +    pthread_cond_t cond;
> +} *helper_data;
>
>  #define PAGE_ALIGN(buf)    \
>      (char *) (((uintptr_t) (buf) + page_mask) & ~page_mask)
> @@ -1722,7 +1726,7 @@ static void *thread_main(void *data)
>
>      fio_unpin_memory(td);
>
> -    fio_writeout_logs(td);
> +    td_writeout_logs(td, true);
>
>      iolog_compress_exit(td);
>      rate_submit_exit(td);
> @@ -2319,78 +2323,154 @@ reap:
>      update_io_ticks();
>  }
>
> +void helper_reset(void)
> +{
> +    if (!helper_data)
> +        return;
> +
> +    pthread_mutex_lock(&helper_data->lock);
> +
> +    if (!helper_data->reset) {
> +        helper_data->reset = 1;
> +        pthread_cond_signal(&helper_data->cond);
> +    }
> +
> +    pthread_mutex_unlock(&helper_data->lock);
> +}
> +
> +void helper_do_stat(void)
> +{
> +    if (!helper_data)
> +        return;
> +
> +    pthread_mutex_lock(&helper_data->lock);
> +    helper_data->do_stat = 1;
> +    pthread_cond_signal(&helper_data->cond);
> +    pthread_mutex_unlock(&helper_data->lock);
> +}
> +
> +bool helper_should_exit(void)
> +{
> +    if (!helper_data)
> +        return true;
> +
> +    return helper_data->exit;
> +}
> +
>  static void wait_for_helper_thread_exit(void)
>  {
>      void *ret;
>
> -    helper_exit = 1;
> -    pthread_cond_signal(&helper_cond);
> -    pthread_join(helper_thread, &ret);
> +    pthread_mutex_lock(&helper_data->lock);
> +    helper_data->exit = 1;
> +    pthread_cond_signal(&helper_data->cond);
> +    pthread_mutex_unlock(&helper_data->lock);
> +
> +    pthread_join(helper_data->thread, &ret);
>  }
>
>  static void free_disk_util(void)
>  {
>      disk_util_prune_entries();
>
> -    pthread_cond_destroy(&helper_cond);
> +    pthread_cond_destroy(&helper_data->cond);
> +    pthread_mutex_destroy(&helper_data->lock);
> +    sfree(helper_data);
>  }
>
>  static void *helper_thread_main(void *data)
>  {
> -    struct sk_out *sk_out = data;
> +    struct helper_data *hd = data;
> +    unsigned int msec_to_next_event, next_log;
> +    struct timeval tv, last_du;
>      int ret = 0;
>
> -    sk_out_assign(sk_out);
> +    sk_out_assign(hd->sk_out);
> +
> +    gettimeofday(&tv, NULL);
> +    memcpy(&last_du, &tv, sizeof(tv));
>
>      fio_mutex_up(startup_mutex);
>
> -    while (!ret) {
> -        uint64_t sec = DISK_UTIL_MSEC / 1000;
> -        uint64_t nsec = (DISK_UTIL_MSEC % 1000) * 1000000;
> +    msec_to_next_event = DISK_UTIL_MSEC;
> +    while (!ret && !hd->exit) {
>          struct timespec ts;
> -        struct timeval tv;
> +        struct timeval now;
> +        uint64_t since_du;
> +
> +        timeval_add_msec(&tv, msec_to_next_event);
> +        ts.tv_sec = tv.tv_sec;
> +        ts.tv_nsec = tv.tv_usec * 1000;
> +
> +        pthread_mutex_lock(&hd->lock);
> +        pthread_cond_timedwait(&hd->cond, &hd->lock, &ts);
>
> -        gettimeofday(&tv, NULL);
> -        ts.tv_sec = tv.tv_sec + sec;
> -        ts.tv_nsec = (tv.tv_usec * 1000) + nsec;
> +        gettimeofday(&now, NULL);
>
> -        if (ts.tv_nsec >= 1000000000ULL) {
> -            ts.tv_nsec -= 1000000000ULL;
> -            ts.tv_sec++;
> +        if (hd->reset) {
> +            memcpy(&tv, &now, sizeof(tv));
> +            memcpy(&last_du, &now, sizeof(last_du));
> +            hd->reset = 0;
>          }
>
> -        pthread_cond_timedwait(&helper_cond, &helper_lock, &ts);
> +        pthread_mutex_unlock(&hd->lock);
>
> -        ret = update_io_ticks();
> +        since_du = mtime_since(&last_du, &now);
> +        if (since_du >= DISK_UTIL_MSEC || DISK_UTIL_MSEC - since_du <
> 10) {
> +            ret = update_io_ticks();
> +            timeval_add_msec(&last_du, DISK_UTIL_MSEC);
> +            msec_to_next_event = DISK_UTIL_MSEC;
> +            if (since_du >= DISK_UTIL_MSEC)
> +                msec_to_next_event -= (since_du - DISK_UTIL_MSEC);
> +        } else {
> +            if (since_du >= DISK_UTIL_MSEC)
> +                msec_to_next_event = DISK_UTIL_MSEC - (DISK_UTIL_MSEC -
> since_du);
> +            else
> +                msec_to_next_event = DISK_UTIL_MSEC;
> +        }
>
> -        if (helper_do_stat) {
> -            helper_do_stat = 0;
> +        if (hd->do_stat) {
> +            hd->do_stat = 0;
>              __show_running_run_stats();
>          }
>
> +        next_log = calc_log_samples();
> +        if (!next_log)
> +            next_log = DISK_UTIL_MSEC;
> +
> +        msec_to_next_event = min(next_log, msec_to_next_event);
> +
>          if (!is_backend)
>              print_thread_status();
>      }
>
> +    fio_writeout_logs(false);
> +
>      sk_out_drop();
>      return NULL;
>  }
>
>  static int create_helper_thread(struct sk_out *sk_out)
>  {
> +    struct helper_data *hd;
>      int ret;
>
> +    hd = smalloc(sizeof(*hd));
> +
>      setup_disk_util();
>
> -    pthread_cond_init(&helper_cond, NULL);
> -    pthread_mutex_init(&helper_lock, NULL);
> +    hd->sk_out = sk_out;
> +    pthread_cond_init(&hd->cond, NULL);
> +    pthread_mutex_init(&hd->lock, NULL);
>
> -    ret = pthread_create(&helper_thread, NULL, helper_thread_main,
> sk_out);
> +    ret = pthread_create(&hd->thread, NULL, helper_thread_main, hd);
>      if (ret) {
>          log_err("Can't create helper thread: %s\n", strerror(ret));
>          return 1;
>      }
>
> +    helper_data = hd;
> +
>      dprint(FD_MUTEX, "wait on startup_mutex\n");
>      fio_mutex_down(startup_mutex);
>      dprint(FD_MUTEX, "done waiting on startup_mutex\n");
> diff --git a/diskutil.c b/diskutil.c
> index c25c5c9..315c1e1 100644
> --- a/diskutil.c
> +++ b/diskutil.c
> @@ -121,7 +121,7 @@ int update_io_ticks(void)
>
>      fio_mutex_down(disk_util_mutex);
>
> -    if (!helper_exit) {
> +    if (!helper_should_exit()) {
>          flist_for_each(entry, &disk_list) {
>              du = flist_entry(entry, struct disk_util, list);
>              update_io_tick_disk(du);
> diff --git a/fio.1 b/fio.1
> index 73fdee6..e9bb72e 100644
> --- a/fio.1
> +++ b/fio.1
> @@ -1180,12 +1180,14 @@ Terminate all jobs if one job finishes in error.
>  Default: wait for each job
>  to finish.
>  .TP
>  .BI bwavgtime \fR=\fPint
> -Average bandwidth calculations over the given time in milliseconds.
> Default:
> -500ms.
> +Average bandwidth calculations over the given time in milliseconds. If
> the job
> +also does bandwidth logging through \fBwrite_bw_log\fR, then the
> minimum of
> +this option and \fBlog_avg_msec\fR will be used.  Default: 500ms.
>  .TP
>  .BI iopsavgtime \fR=\fPint
> -Average IOPS calculations over the given time in milliseconds.  Default:
> -500ms.
> +Average IOPS calculations over the given time in milliseconds. If the job
> +also does IOPS logging through \fBwrite_iops_log\fR, then the minimum of
> +this option and \fBlog_avg_msec\fR will be used.  Default: 500ms.
>  .TP
>  .BI create_serialize \fR=\fPbool
>  If true, serialize file creation for the jobs.  Default: true.
> diff --git a/fio.h b/fio.h
> index 829cc81..20ff26f 100644
> --- a/fio.h
> +++ b/fio.h
> @@ -445,8 +445,6 @@ extern int nr_clients;
>  extern int log_syslog;
>  extern int status_interval;
>  extern const char fio_version_string[];
> -extern int helper_do_stat;
> -extern pthread_cond_t helper_cond;
>  extern char *trigger_file;
>  extern char *trigger_cmd;
>  extern char *trigger_remote_cmd;
> @@ -734,4 +732,8 @@ enum {
>  extern void exec_trigger(const char *);
>  extern void check_trigger_file(void);
>
> +extern void helper_reset(void);
> +extern void helper_do_stat(void);
> +extern bool helper_should_exit(void);
> +
>  #endif
> diff --git a/fio_time.h b/fio_time.h
> index 79f324a..cb271c2 100644
> --- a/fio_time.h
> +++ b/fio_time.h
> @@ -17,5 +17,6 @@ extern void set_genesis_time(void);
>  extern int ramp_time_over(struct thread_data *);
>  extern int in_ramp_time(struct thread_data *);
>  extern void fio_time_init(void);
> +extern void timeval_add_msec(struct timeval *, unsigned int);
>
>  #endif
> diff --git a/init.c b/init.c
> index 89e05c0..c579d5c 100644
> --- a/init.c
> +++ b/init.c
> @@ -1416,6 +1416,11 @@ static int add_job(struct thread_data *td, const
> char *jobname, int job_add_num,
>          };
>          const char *suf;
>
> +        if (fio_option_is_set(o, bw_avg_time))
> +            p.avg_msec = min(o->log_avg_msec, o->bw_avg_time);
> +        else
> +            o->bw_avg_time = p.avg_msec;
> +
>          if (p.log_gz_store)
>              suf = "log.fz";
>          else
> @@ -1436,6 +1441,11 @@ static int add_job(struct thread_data *td, const
> char *jobname, int job_add_num,
>          };
>          const char *suf;
>
> +        if (fio_option_is_set(o, iops_avg_time))
> +            p.avg_msec = min(o->log_avg_msec, o->iops_avg_time);
> +        else
> +            o->iops_avg_time = p.avg_msec;
> +
>          if (p.log_gz_store)
>              suf = "log.fz";
>          else
> diff --git a/io_u.c b/io_u.c
> index 6622bc0..eb15dc2 100644
> --- a/io_u.c
> +++ b/io_u.c
> @@ -1710,16 +1710,18 @@ static void account_io_completion(struct
> thread_data *td, struct io_u *io_u,
>          }
>      }
>
> -    if (!td->o.disable_clat) {
> -        add_clat_sample(td, idx, lusec, bytes, io_u->offset);
> -        io_u_mark_latency(td, lusec);
> -    }
> +    if (ddir_rw(idx)) {
> +        if (!td->o.disable_clat) {
> +            add_clat_sample(td, idx, lusec, bytes, io_u->offset);
> +            io_u_mark_latency(td, lusec);
> +        }
>
> -    if (!td->o.disable_bw)
> -        add_bw_sample(td, idx, bytes, &icd->time);
> +        if (!td->o.disable_bw && per_unit_log(td->bw_log))
> +            add_bw_sample(td, io_u, bytes, lusec);
>
> -    if (no_reduce)
> -        add_iops_sample(td, idx, bytes, &icd->time);
> +        if (no_reduce && per_unit_log(td->iops_log))
> +            add_iops_sample(td, io_u, bytes);
> +    }
>
>      if (td->ts.nr_block_infos && io_u->ddir == DDIR_TRIM) {
>          uint32_t *info = io_u_block_info(td, io_u);
> diff --git a/iolog.c b/iolog.c
> index 94d3f3c..71afe86 100644
> --- a/iolog.c
> +++ b/iolog.c
> @@ -18,6 +18,7 @@
>  #include "verify.h"
>  #include "trim.h"
>  #include "filelock.h"
> +#include "smalloc.h"
>
>  static const char iolog_ver2[] = "fio version 2 iolog";
>
> @@ -574,14 +575,12 @@ void setup_log(struct io_log **log, struct
> log_params *p,
>  {
>      struct io_log *l;
>
> -    l = calloc(1, sizeof(*l));
> +    l = smalloc(sizeof(*l));
>      l->nr_samples = 0;
> -    l->max_samples = DEF_LOG_ENTRIES;
>      l->log_type = p->log_type;
>      l->log_offset = p->log_offset;
>      l->log_gz = p->log_gz;
>      l->log_gz_store = p->log_gz_store;
> -    l->log = malloc(l->max_samples * log_entry_sz(l));
>      l->avg_msec = p->avg_msec;
>      l->filename = strdup(filename);
>      l->td = p->td;
> @@ -631,7 +630,7 @@ void free_log(struct io_log *log)
>  {
>      free(log->log);
>      free(log->filename);
> -    free(log);
> +    sfree(log);
>  }
>
>  void flush_samples(FILE *f, void *samples, uint64_t sample_size)
> @@ -1202,29 +1201,74 @@ static int __write_log(struct thread_data *td,
> struct io_log *log, int try)
>      return 0;
>  }
>
> -static int write_iops_log(struct thread_data *td, int try)
> +static int write_iops_log(struct thread_data *td, int try, bool unit_log)
>  {
> -    return __write_log(td, td->iops_log, try);
> +    int ret;
> +
> +    if (per_unit_log(td->iops_log) != unit_log)
> +        return 0;
> +
> +    ret = __write_log(td, td->iops_log, try);
> +    if (!ret)
> +        td->iops_log = NULL;
> +
> +    return ret;
>  }
>
> -static int write_slat_log(struct thread_data *td, int try)
> +static int write_slat_log(struct thread_data *td, int try, bool unit_log)
>  {
> -    return __write_log(td, td->slat_log, try);
> +    int ret;
> +
> +    if (!unit_log)
> +        return 0;
> +
> +    ret = __write_log(td, td->slat_log, try);
> +    if (!ret)
> +        td->slat_log = NULL;
> +
> +    return ret;
>  }
>
> -static int write_clat_log(struct thread_data *td, int try)
> +static int write_clat_log(struct thread_data *td, int try, bool unit_log)
>  {
> -    return __write_log(td, td->clat_log, try);
> +    int ret;
> +
> +    if (!unit_log)
> +        return 0;
> +
> +    ret = __write_log(td, td->clat_log, try);
> +    if (!ret)
> +        td->clat_log = NULL;
> +
> +    return ret;
>  }
>
> -static int write_lat_log(struct thread_data *td, int try)
> +static int write_lat_log(struct thread_data *td, int try, bool unit_log)
>  {
> -    return __write_log(td, td->lat_log, try);
> +    int ret;
> +
> +    if (!unit_log)
> +        return 0;
> +
> +    ret = __write_log(td, td->lat_log, try);
> +    if (!ret)
> +        td->lat_log = NULL;
> +
> +    return ret;
>  }
>
> -static int write_bandw_log(struct thread_data *td, int try)
> +static int write_bandw_log(struct thread_data *td, int try, bool unit_log)
>  {
> -    return __write_log(td, td->bw_log, try);
> +    int ret;
> +
> +    if (per_unit_log(td->bw_log) != unit_log)
> +        return 0;
> +
> +    ret = __write_log(td, td->bw_log, try);
> +    if (!ret)
> +        td->bw_log = NULL;
> +
> +    return ret;
>  }
>
>  enum {
> @@ -1239,7 +1283,7 @@ enum {
>
>  struct log_type {
>      unsigned int mask;
> -    int (*fn)(struct thread_data *, int);
> +    int (*fn)(struct thread_data *, int, bool);
>  };
>
>  static struct log_type log_types[] = {
> @@ -1265,7 +1309,7 @@ static struct log_type log_types[] = {
>      },
>  };
>
> -void fio_writeout_logs(struct thread_data *td)
> +void td_writeout_logs(struct thread_data *td, bool unit_logs)
>  {
>      unsigned int log_mask = 0;
>      unsigned int log_left = ALL_LOG_NR;
> @@ -1273,7 +1317,7 @@ void fio_writeout_logs(struct thread_data *td)
>
>      old_state = td_bump_runstate(td, TD_FINISHING);
>
> -    finalize_logs(td);
> +    finalize_logs(td, unit_logs);
>
>      while (log_left) {
>          int prev_log_left = log_left;
> @@ -1283,7 +1327,7 @@ void fio_writeout_logs(struct thread_data *td)
>              int ret;
>
>              if (!(log_mask & lt->mask)) {
> -                ret = lt->fn(td, log_left != 1);
> +                ret = lt->fn(td, log_left != 1, unit_logs);
>                  if (!ret) {
>                      log_left--;
>                      log_mask |= lt->mask;
> @@ -1297,3 +1341,12 @@ void fio_writeout_logs(struct thread_data *td)
>
>      td_restore_runstate(td, old_state);
>  }
> +
> +void fio_writeout_logs(bool unit_logs)
> +{
> +    struct thread_data *td;
> +    int i;
> +
> +    for_each_td(td, i)
> +        td_writeout_logs(td, unit_logs);
> +}
> diff --git a/iolog.h b/iolog.h
> index 74f2170..739a7c8 100644
> --- a/iolog.h
> +++ b/iolog.h
> @@ -207,12 +207,18 @@ struct log_params {
>      int log_compress;
>  };
>
> -extern void finalize_logs(struct thread_data *td);
> +static inline bool per_unit_log(struct io_log *log)
> +{
> +    return log && !log->avg_msec;
> +}
> +
> +extern void finalize_logs(struct thread_data *td, bool);
>  extern void setup_log(struct io_log **, struct log_params *, const char
> *);
>  extern void flush_log(struct io_log *, int);
>  extern void flush_samples(FILE *, void *, uint64_t);
>  extern void free_log(struct io_log *);
> -extern void fio_writeout_logs(struct thread_data *);
> +extern void fio_writeout_logs(bool);
> +extern void td_writeout_logs(struct thread_data *, bool);
>  extern int iolog_flush(struct io_log *, int);
>
>  static inline void init_ipo(struct io_piece *ipo)
> diff --git a/libfio.c b/libfio.c
> index b17f148..790c036 100644
> --- a/libfio.c
> +++ b/libfio.c
> @@ -151,6 +151,7 @@ void reset_all_stats(struct thread_data *td)
>
>      lat_target_reset(td);
>      clear_rusage_stat(td);
> +    helper_reset();
>  }
>
>  void reset_fio_state(void)
> diff --git a/stat.c b/stat.c
> index 6d8d4d0..8c41914 100644
> --- a/stat.c
> +++ b/stat.c
> @@ -1862,13 +1862,21 @@ static void __add_log_sample(struct io_log
> *iolog, unsigned long val,
>          iolog->avg_last = t;
>
>      if (iolog->nr_samples == iolog->max_samples) {
> -        size_t new_size;
> +        size_t new_size, new_samples;
>          void *new_log;
>
> -        new_size = 2 * iolog->max_samples * log_entry_sz(iolog);
> +        if (!iolog->max_samples)
> +            new_samples = DEF_LOG_ENTRIES;
> +        else
> +            new_samples = iolog->max_samples * 2;
> +
> +        new_size = new_samples * log_entry_sz(iolog);
>
>          if (iolog->log_gz && (new_size > iolog->log_gz)) {
> -            if (iolog_flush(iolog, 0)) {
> +            if (!iolog->log) {
> +                iolog->log = malloc(new_size);
> +                iolog->max_samples = new_samples;
> +            } else if (iolog_flush(iolog, 0)) {
>                  log_err("fio: failed flushing iolog! Will stop
> logging.\n");
>                  iolog->disabled = 1;
>                  return;
> @@ -1882,7 +1890,7 @@ static void __add_log_sample(struct io_log *iolog,
> unsigned long val,
>                  return;
>              }
>              iolog->log = new_log;
> -            iolog->max_samples <<= 1;
> +            iolog->max_samples = new_samples;
>          }
>      }
>
> @@ -2013,21 +2021,21 @@ static void add_log_sample(struct thread_data
> *td, struct io_log *iolog,
>      iolog->avg_last = elapsed;
>  }
>
> -void finalize_logs(struct thread_data *td)
> +void finalize_logs(struct thread_data *td, bool unit_logs)
>  {
>      unsigned long elapsed;
>
>      elapsed = mtime_since_now(&td->epoch);
>
> -    if (td->clat_log)
> +    if (td->clat_log && unit_logs)
>          _add_stat_to_log(td->clat_log, elapsed, td->o.log_max != 0);
> -    if (td->slat_log)
> +    if (td->slat_log && unit_logs)
>          _add_stat_to_log(td->slat_log, elapsed, td->o.log_max != 0);
> -    if (td->lat_log)
> +    if (td->lat_log && unit_logs)
>          _add_stat_to_log(td->lat_log, elapsed, td->o.log_max != 0);
> -    if (td->bw_log)
> +    if (td->bw_log && (unit_logs == per_unit_log(td->bw_log)))
>          _add_stat_to_log(td->bw_log, elapsed, td->o.log_max != 0);
> -    if (td->iops_log)
> +    if (td->iops_log && (unit_logs == per_unit_log(td->iops_log)))
>          _add_stat_to_log(td->iops_log, elapsed, td->o.log_max != 0);
>  }
>
> @@ -2056,9 +2064,6 @@ void add_clat_sample(struct thread_data *td, enum
> fio_ddir ddir,
>  {
>      struct thread_stat *ts = &td->ts;
>
> -    if (!ddir_rw(ddir))
> -        return;
> -
>      td_io_u_lock(td);
>
>      add_stat_sample(&ts->clat_stat[ddir], usec);
> @@ -2108,18 +2113,41 @@ void add_lat_sample(struct thread_data *td, enum
> fio_ddir ddir,
>      td_io_u_unlock(td);
>  }
>
> -void add_bw_sample(struct thread_data *td, enum fio_ddir ddir, unsigned
> int bs,
> -           struct timeval *t)
> +void add_bw_sample(struct thread_data *td, struct io_u *io_u,
> +           unsigned int bytes, unsigned long spent)
> +{
> +    struct thread_stat *ts = &td->ts;
> +    unsigned long rate;
> +
> +    if (spent)
> +        rate = bytes * 1000 / spent;
> +    else
> +        rate = 0;
> +
> +    td_io_u_lock(td);
> +
> +    add_stat_sample(&ts->bw_stat[io_u->ddir], rate);
> +
> +    if (td->bw_log)
> +        add_log_sample(td, td->bw_log, rate, io_u->ddir, bytes,
> io_u->offset);
> +
> +    td->stat_io_bytes[io_u->ddir] = td->this_io_bytes[io_u->ddir];
> +    td_io_u_unlock(td);
> +}
> +
> +static int add_bw_samples(struct thread_data *td, struct timeval *t)
>  {
>      struct thread_stat *ts = &td->ts;
>      unsigned long spent, rate;
> +    enum fio_ddir ddir;
>
> -    if (!ddir_rw(ddir))
> -        return;
> +    if (per_unit_log(td->bw_log))
> +        return 0;
>
>      spent = mtime_since(&td->bw_sample_time, t);
> -    if (spent < td->o.bw_avg_time)
> -        return;
> +    if (spent < td->o.bw_avg_time &&
> +        td->o.bw_avg_time - spent >= 10)
> +        return td->o.bw_avg_time - spent;
>
>      td_io_u_lock(td);
>
> @@ -2141,27 +2169,50 @@ void add_bw_sample(struct thread_data *td, enum
> fio_ddir ddir, unsigned int bs,
>          add_stat_sample(&ts->bw_stat[ddir], rate);
>
>          if (td->bw_log)
> -            add_log_sample(td, td->bw_log, rate, ddir, bs, 0);
> +            add_log_sample(td, td->bw_log, rate, ddir, 0, 0);
>
>          td->stat_io_bytes[ddir] = td->this_io_bytes[ddir];
>      }
>
> -    fio_gettime(&td->bw_sample_time, NULL);
> +    timeval_add_msec(&td->bw_sample_time, td->o.bw_avg_time);
> +
> +    td_io_u_unlock(td);
> +
> +    if (spent <= td->o.bw_avg_time)
> +        return td->o.bw_avg_time;
> +
> +    return td->o.bw_avg_time - (1 + spent - td->o.bw_avg_time);
> +}
> +
> +void add_iops_sample(struct thread_data *td, struct io_u *io_u,
> +             unsigned int bytes)
> +{
> +    struct thread_stat *ts = &td->ts;
> +
> +    td_io_u_lock(td);
> +
> +    add_stat_sample(&ts->iops_stat[io_u->ddir], 1);
> +
> +    if (td->iops_log)
> +        add_log_sample(td, td->iops_log, 1, io_u->ddir, bytes,
> io_u->offset);
> +
> +    td->stat_io_blocks[io_u->ddir] = td->this_io_blocks[io_u->ddir];
>      td_io_u_unlock(td);
>  }
>
> -void add_iops_sample(struct thread_data *td, enum fio_ddir ddir,
> unsigned int bs,
> -             struct timeval *t)
> +static int add_iops_samples(struct thread_data *td, struct timeval *t)
>  {
>      struct thread_stat *ts = &td->ts;
>      unsigned long spent, iops;
> +    enum fio_ddir ddir;
>
> -    if (!ddir_rw(ddir))
> -        return;
> +    if (per_unit_log(td->iops_log))
> +        return 0;
>
>      spent = mtime_since(&td->iops_sample_time, t);
> -    if (spent < td->o.iops_avg_time)
> -        return;
> +    if (spent < td->o.iops_avg_time &&
> +        td->o.iops_avg_time - spent >= 10)
> +        return td->o.iops_avg_time - spent;
>
>      td_io_u_lock(td);
>
> @@ -2183,13 +2234,53 @@ void add_iops_sample(struct thread_data *td,
> enum fio_ddir ddir, unsigned int bs
>          add_stat_sample(&ts->iops_stat[ddir], iops);
>
>          if (td->iops_log)
> -            add_log_sample(td, td->iops_log, iops, ddir, bs, 0);
> +            add_log_sample(td, td->iops_log, iops, ddir, 0, 0);
>
>          td->stat_io_blocks[ddir] = td->this_io_blocks[ddir];
>      }
>
> -    fio_gettime(&td->iops_sample_time, NULL);
> +    timeval_add_msec(&td->iops_sample_time, td->o.iops_avg_time);
> +
>      td_io_u_unlock(td);
> +
> +    if (spent <= td->o.iops_avg_time)
> +        return td->o.iops_avg_time;
> +
> +    printf("%lu over\n", spent - td->o.iops_avg_time);
> +    return td->o.iops_avg_time - (1 + spent - td->o.iops_avg_time);
> +}
> +
> +/*
> + * Returns msecs to next event
> + */
> +int calc_log_samples(void)
> +{
> +    struct thread_data *td;
> +    unsigned int next = ~0U, tmp;
> +    struct timeval now;
> +    int i;
> +
> +    fio_gettime(&now, NULL);
> +
> +    for_each_td(td, i) {
> +        if (!ramp_time_over(td) ||
> +            !(td->runstate == TD_RUNNING || td->runstate ==
> TD_VERIFYING)) {
> +            next = min(td->o.iops_avg_time, td->o.bw_avg_time);
> +            continue;
> +        }
> +        if (!per_unit_log(td->bw_log)) {
> +            tmp = add_bw_samples(td, &now);
> +            if (tmp < next)
> +                next = tmp;
> +        }
> +        if (!per_unit_log(td->iops_log)) {
> +            tmp = add_iops_samples(td, &now);
> +            if (tmp < next)
> +                next = tmp;
> +        }
> +    }
> +
> +    return next == ~0U ? 0 : next;
>  }
>
>  void stat_init(void)
> @@ -2212,8 +2303,7 @@ void stat_exit(void)
>   */
>  void show_running_run_stats(void)
>  {
> -    helper_do_stat = 1;
> -    pthread_cond_signal(&helper_cond);
> +    helper_do_stat();
>  }
>
>  uint32_t *io_u_block_info(struct thread_data *td, struct io_u *io_u)
> diff --git a/stat.h b/stat.h
> index 9c3f192..86f1a0b 100644
> --- a/stat.h
> +++ b/stat.h
> @@ -276,11 +276,12 @@ extern void add_clat_sample(struct thread_data *,
> enum fio_ddir, unsigned long,
>                  unsigned int, uint64_t);
>  extern void add_slat_sample(struct thread_data *, enum fio_ddir,
> unsigned long,
>                  unsigned int, uint64_t);
> -extern void add_bw_sample(struct thread_data *, enum fio_ddir, unsigned
> int,
> -                struct timeval *);
> -extern void add_iops_sample(struct thread_data *, enum fio_ddir,
> unsigned int,
> -                struct timeval *);
>  extern void add_agg_sample(unsigned long, enum fio_ddir, unsigned int);
> +extern void add_iops_sample(struct thread_data *, struct io_u *,
> +                unsigned int);
> +extern void add_bw_sample(struct thread_data *, struct io_u *,
> +                unsigned int, unsigned long);
> +extern int calc_log_samples(void);
>
>  extern struct io_log *agg_io_log[DDIR_RWDIR_CNT];
>  extern int write_bw_log;
> diff --git a/time.c b/time.c
> index b145e90..7c7fa96 100644
> --- a/time.c
> +++ b/time.c
> @@ -6,6 +6,15 @@
>  static struct timeval genesis;
>  static unsigned long ns_granularity;
>
> +void timeval_add_msec(struct timeval *tv, unsigned int msec)
> +{
> +    tv->tv_usec += 1000 * msec;
> +    if (tv->tv_usec > 1000000) {
> +        tv->tv_usec -= 1000000;
> +        tv->tv_sec++;
> +    }
> +}
> +
>  /*
>   * busy looping version for the last few usec
>   */
> diff --git a/workqueue.c b/workqueue.c
> index 6e67f3e..4f9c414 100644
> --- a/workqueue.c
> +++ b/workqueue.c
> @@ -9,6 +9,7 @@
>  #include "fio.h"
>  #include "flist.h"
>  #include "workqueue.h"
> +#include "smalloc.h"
>
>  enum {
>      SW_F_IDLE    = 1 << 0,
> @@ -263,7 +264,7 @@ void workqueue_exit(struct workqueue *wq)
>          }
>      } while (shutdown && shutdown != wq->max_workers);
>
> -    free(wq->workers);
> +    sfree(wq->workers);
>      wq->workers = NULL;
>      pthread_mutex_destroy(&wq->flush_lock);
>      pthread_cond_destroy(&wq->flush_cond);
> @@ -317,7 +318,7 @@ int workqueue_init(struct thread_data *td, struct
> workqueue *wq,
>      pthread_mutex_init(&wq->flush_lock, NULL);
>      pthread_mutex_init(&wq->stat_lock, NULL);
>
> -    wq->workers = calloc(wq->max_workers, sizeof(struct submit_worker));
> +    wq->workers = smalloc(wq->max_workers * sizeof(struct submit_worker));
>
>      for (i = 0; i < wq->max_workers; i++)
>          if (start_worker(wq, i, sk_out))
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Missing log entries and other questions
  2016-05-06 17:16                   ` Mark Nelson
@ 2016-05-06 17:18                     ` Jens Axboe
  0 siblings, 0 replies; 16+ messages in thread
From: Jens Axboe @ 2016-05-06 17:18 UTC (permalink / raw)
  To: Mark Nelson, Paul Alcorn; +Cc: 'Mihkal Dunfjeld', fio, birger.lunstoeng

On 05/06/2016 11:16 AM, Mark Nelson wrote:
> On 05/06/2016 12:11 PM, Jens Axboe wrote:
>> On 05/05/2016 01:33 PM, Mark Nelson wrote:
>>>>>>> Can you try the below patch?
>>>>>>
>>>>>> I'll also try to take a look at this as we've seen a fair amount of
>>>>>> skew
>>>>>> in the logs doing fio librbd upstream ceph testing.
>>>>>
>>>>> That'd be great. More complete version below, fwiw.
>>>
>>> Ok, I'll try this one.  It might be a bit before I can test it since I'm
>>> bisecting ceph right now looking for a regression that only hits 128k
>>> sequential rbd reads.  Go figure. :)
>>
>> When you get to it, here's an updated version. I'm tempted to shove this
>> into a new branch, instead of flinging diffs around... Let me know what
>> you prefer.
>
> new branch would be wonderful!

Done, it's now in the 'logging' branch:

http://git.kernel.dk/cgit/fio/commit/?h=logging&id=a47591e4923fb8faef18e1cd5125a50429282089

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Missing log entries and other questions
  2016-05-06 17:11               ` Jens Axboe
@ 2016-05-06 17:43                 ` Mark Nelson
  2016-05-06 17:44                   ` Jens Axboe
  0 siblings, 1 reply; 16+ messages in thread
From: Mark Nelson @ 2016-05-06 17:43 UTC (permalink / raw)
  To: Jens Axboe, Paul Alcorn; +Cc: 'Mihkal Dunfjeld', fio, birger.lunstoeng



On 05/06/2016 12:11 PM, Jens Axboe wrote:
> On 05/05/2016 01:36 PM, Mark Nelson wrote:
>>
>>
>> On 05/04/2016 03:51 PM, Jens Axboe wrote:
>>> On 05/04/2016 11:43 AM, Mark Nelson wrote:
>>>> On 05/04/2016 11:34 AM, Jens Axboe wrote:
>>>>> On 05/04/2016 08:10 AM, Jens Axboe wrote:
>>>>>> On 04/27/2016 02:38 PM, Jens Axboe wrote:
>>>>>>> I'll try and see if I can reproduce this.
>>>>>>
>>>>>> So I took a look at your logged output. The way that fio logs is
>>>>>> that it
>>>>>> does it on IO completion. You have set 1s intervals, so it checks,
>>>>>> when
>>>>>> an IO completes, if we're due for logging the previous interval IOPS:
>>>>>>
>>>>>> spent = mtime_since(&td->iops_sample_time, t);
>>>>>> if (spent < td->o.iops_avg_time)
>>>>>>          return;
>>>>>>
>>>>>> If you are doing lots of IO, it matches pretty evenly. But for slow
>>>>>> workloads, it will end up being skewed a bit. It won't be too hard to
>>>>>> fix the drift, though, but it still won't make it clock precise at
>>>>>> exact
>>>>>> 1s intervals.
>>>>>>
>>>>>> We could move this part into the main thread, and out of the workers.
>>>>>> Then we could make it more precise. Might not be a bad idea in
>>>>>> general,
>>>>>> since it'll move some of this code out of the fast path.
>>>>>>
>>>>>> I'll take a stab at it.
>>>>>
>>>>> Can you try the below patch?
>>>>
>>>> I'll also try to take a look at this as we've seen a fair amount of
>>>> skew
>>>> in the logs doing fio librbd upstream ceph testing.
>>>
>>> That'd be great. More complete version below, fwiw.
>>>
>>>> I actually wrote a
>>>> first stab at a parser to aggregate logs from multiple fio processes
>>>> with the assumption of non-uniform sampling:
>>>>
>>>> https://github.com/ceph/cbt/blob/master/tools/fiologpaser.py
>>>>
>>>> The idea is you have say 8 fio bw logs in a directory you can do:
>>>>
>>>> fiologparser.py -f *bw*
>>>>
>>>> which will return average values from the logs over user defined
>>>> intervals by taking weighted scores from samples that overlap.  You can
>>>> also look at per-interval averages, sums, or just a global average.
>>>>
>>>> various option flags:
>>>>
>>>> -i <interval time in ms>
>>>> -d <divide results by d (say if you want a different unit of output)
>>>> -f <show full interval data>
>>>> -a <show average interval data across logs>
>>>> -s <show sums for interval data across logs>
>>>
>>> That looks useful. Might want to put that in the fio repo as well, if
>>> you are interested.
>>
>> Sure, I'm game.  Might be a little buggy and it's definitely slow since
>> I just threw it together the other day.  Do you have a place in the repo
>> you like throwing parsing scripts?
>
> Usually I put them in the tools/ directory.

Done: https://github.com/axboe/fio/pull/170


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Missing log entries and other questions
  2016-05-06 17:43                 ` Mark Nelson
@ 2016-05-06 17:44                   ` Jens Axboe
  0 siblings, 0 replies; 16+ messages in thread
From: Jens Axboe @ 2016-05-06 17:44 UTC (permalink / raw)
  To: Mark Nelson, Paul Alcorn; +Cc: 'Mihkal Dunfjeld', fio, birger.lunstoeng

On 05/06/2016 11:43 AM, Mark Nelson wrote:
>
>
> On 05/06/2016 12:11 PM, Jens Axboe wrote:
>> On 05/05/2016 01:36 PM, Mark Nelson wrote:
>>>
>>>
>>> On 05/04/2016 03:51 PM, Jens Axboe wrote:
>>>> On 05/04/2016 11:43 AM, Mark Nelson wrote:
>>>>> On 05/04/2016 11:34 AM, Jens Axboe wrote:
>>>>>> On 05/04/2016 08:10 AM, Jens Axboe wrote:
>>>>>>> On 04/27/2016 02:38 PM, Jens Axboe wrote:
>>>>>>>> I'll try and see if I can reproduce this.
>>>>>>>
>>>>>>> So I took a look at your logged output. The way that fio logs is
>>>>>>> that it
>>>>>>> does it on IO completion. You have set 1s intervals, so it checks,
>>>>>>> when
>>>>>>> an IO completes, if we're due for logging the previous interval
>>>>>>> IOPS:
>>>>>>>
>>>>>>> spent = mtime_since(&td->iops_sample_time, t);
>>>>>>> if (spent < td->o.iops_avg_time)
>>>>>>>          return;
>>>>>>>
>>>>>>> If you are doing lots of IO, it matches pretty evenly. But for slow
>>>>>>> workloads, it will end up being skewed a bit. It won't be too
>>>>>>> hard to
>>>>>>> fix the drift, though, but it still won't make it clock precise at
>>>>>>> exact
>>>>>>> 1s intervals.
>>>>>>>
>>>>>>> We could move this part into the main thread, and out of the
>>>>>>> workers.
>>>>>>> Then we could make it more precise. Might not be a bad idea in
>>>>>>> general,
>>>>>>> since it'll move some of this code out of the fast path.
>>>>>>>
>>>>>>> I'll take a stab at it.
>>>>>>
>>>>>> Can you try the below patch?
>>>>>
>>>>> I'll also try to take a look at this as we've seen a fair amount of
>>>>> skew
>>>>> in the logs doing fio librbd upstream ceph testing.
>>>>
>>>> That'd be great. More complete version below, fwiw.
>>>>
>>>>> I actually wrote a
>>>>> first stab at a parser to aggregate logs from multiple fio processes
>>>>> with the assumption of non-uniform sampling:
>>>>>
>>>>> https://github.com/ceph/cbt/blob/master/tools/fiologpaser.py
>>>>>
>>>>> The idea is you have say 8 fio bw logs in a directory you can do:
>>>>>
>>>>> fiologparser.py -f *bw*
>>>>>
>>>>> which will return average values from the logs over user defined
>>>>> intervals by taking weighted scores from samples that overlap.  You
>>>>> can
>>>>> also look at per-interval averages, sums, or just a global average.
>>>>>
>>>>> various option flags:
>>>>>
>>>>> -i <interval time in ms>
>>>>> -d <divide results by d (say if you want a different unit of output)
>>>>> -f <show full interval data>
>>>>> -a <show average interval data across logs>
>>>>> -s <show sums for interval data across logs>
>>>>
>>>> That looks useful. Might want to put that in the fio repo as well, if
>>>> you are interested.
>>>
>>> Sure, I'm game.  Might be a little buggy and it's definitely slow since
>>> I just threw it together the other day.  Do you have a place in the repo
>>> you like throwing parsing scripts?
>>
>> Usually I put them in the tools/ directory.
>
> Done: https://github.com/axboe/fio/pull/170

Merged. I added it to the installable scripts as well.


-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2016-05-06 17:44 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <eBGJ1s00E1TeosZ01BGKGG>
2016-04-26  2:22 ` Missing log entries and other questions Paul Alcorn
2016-04-27 20:38   ` Jens Axboe
2016-05-04 14:10     ` Jens Axboe
2016-05-04 16:34       ` Jens Axboe
2016-05-04 17:43         ` Mark Nelson
2016-05-04 20:51           ` Jens Axboe
2016-05-04 21:30             ` Jens Axboe
2016-05-05 19:33               ` Mark Nelson
2016-05-06 17:11                 ` Jens Axboe
2016-05-06 17:16                   ` Mark Nelson
2016-05-06 17:18                     ` Jens Axboe
2016-05-05 19:36             ` Mark Nelson
2016-05-06 17:11               ` Jens Axboe
2016-05-06 17:43                 ` Mark Nelson
2016-05-06 17:44                   ` Jens Axboe
2016-04-04 11:08 Mihkal Dunfjeld

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox