Flexible I/O Tester development
 help / color / mirror / Atom feed
* bug in logging code?
@ 2014-04-02 15:23 Mark Nelson
  2014-04-02 17:38 ` Jens Axboe
  0 siblings, 1 reply; 11+ messages in thread
From: Mark Nelson @ 2014-04-02 15:23 UTC (permalink / raw)
  To: fio

I've been playing around with usng the various write_*_log options in 
fio to record run statistics.  What I'm finding is that some of the logs 
have malformed lines in them.  The following snippet is from the bw log:

811820, 12263, 0, 4194304
812619, 35929, 0, 4194304
818437, 8448, 0, 419430, 4194304
1256299, 160627, 0, 4194304
1256831, 154274, 0, 4194304

Notice that the malformed line appears to be a partial write that gets 
truncated.  The fio command being used to run the tests is 
autogenerated, so job files aren't used.  here's the one that created 
the above log snippet:

/home/nhm/bin/fio
--rw=read
--ioengine=libaio
--runtime=7200
--ramp_time=0
--numjobs=1
--direct=1
--bs=4194304B
--iodepth=16
--size=9216M
--write_iops_log=/tmp/cbt/ceph/kvmrbdfio/osd_ra-00004096/client_ra-00000128/op_size-04194304/concurrent_procs-008/iodepth-016/read/output 

--write_bw_log=/tmp/cbt/ceph/kvmrbdfio/osd_ra-00004096/client_ra-00000128/op_size-04194304/concurrent_procs-008/iodepth-016/read/output 

--write_lat_log=/tmp/cbt/ceph/kvmrbdfio/osd_ra-00004096/client_ra-00000128/op_size-04194304/concurrent_procs-008/iodepth-016/read/output 

--time_based
--name=/srv/rbdfio-`hostname -s`-0/cbt-kvmrbdfio-0
--name=/srv/rbdfio-`hostname -s`-0/cbt-kvmrbdfio-1
--name=/srv/rbdfio-`hostname -s`-0/cbt-kvmrbdfio-2
--name=/srv/rbdfio-`hostname -s`-0/cbt-kvmrbdfio-3
--name=/srv/rbdfio-`hostname -s`-0/cbt-kvmrbdfio-4
--name=/srv/rbdfio-`hostname -s`-0/cbt-kvmrbdfio-5
--name=/srv/rbdfio-`hostname -s`-0/cbt-kvmrbdfio-6
--name=/srv/rbdfio-`hostname -s`-0/cbt-kvmrbdfio-7

The --numjobs=1 line should be removed, but I believe it gets overridden 
anyway with multiple name parameters.  I wonder if having multiple name 
parameters (ie multiple fio processes) is causing those processes to all 
try to write to the same log files, potentially at the same time?  Any 
thoughts would be much appreicated!

Thanks,
Mark

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: bug in logging code?
  2014-04-02 15:23 bug in logging code? Mark Nelson
@ 2014-04-02 17:38 ` Jens Axboe
  2014-04-02 17:53   ` Mark Nelson
  0 siblings, 1 reply; 11+ messages in thread
From: Jens Axboe @ 2014-04-02 17:38 UTC (permalink / raw)
  To: Mark Nelson, fio

On 04/02/2014 09:23 AM, Mark Nelson wrote:
> I've been playing around with usng the various write_*_log options in
> fio to record run statistics.  What I'm finding is that some of the logs
> have malformed lines in them.  The following snippet is from the bw log:
>
> 811820, 12263, 0, 4194304
> 812619, 35929, 0, 4194304
> 818437, 8448, 0, 419430, 4194304
> 1256299, 160627, 0, 4194304
> 1256831, 154274, 0, 4194304
>
> Notice that the malformed line appears to be a partial write that gets
> truncated.  The fio command being used to run the tests is
> autogenerated, so job files aren't used.  here's the one that created
> the above log snippet:
>
> /home/nhm/bin/fio
> --rw=read
> --ioengine=libaio
> --runtime=7200
> --ramp_time=0
> --numjobs=1
> --direct=1
> --bs=4194304B
> --iodepth=16
> --size=9216M
> --write_iops_log=/tmp/cbt/ceph/kvmrbdfio/osd_ra-00004096/client_ra-00000128/op_size-04194304/concurrent_procs-008/iodepth-016/read/output
>
> --write_bw_log=/tmp/cbt/ceph/kvmrbdfio/osd_ra-00004096/client_ra-00000128/op_size-04194304/concurrent_procs-008/iodepth-016/read/output
>
> --write_lat_log=/tmp/cbt/ceph/kvmrbdfio/osd_ra-00004096/client_ra-00000128/op_size-04194304/concurrent_procs-008/iodepth-016/read/output
>
> --time_based
> --name=/srv/rbdfio-`hostname -s`-0/cbt-kvmrbdfio-0
> --name=/srv/rbdfio-`hostname -s`-0/cbt-kvmrbdfio-1
> --name=/srv/rbdfio-`hostname -s`-0/cbt-kvmrbdfio-2
> --name=/srv/rbdfio-`hostname -s`-0/cbt-kvmrbdfio-3
> --name=/srv/rbdfio-`hostname -s`-0/cbt-kvmrbdfio-4
> --name=/srv/rbdfio-`hostname -s`-0/cbt-kvmrbdfio-5
> --name=/srv/rbdfio-`hostname -s`-0/cbt-kvmrbdfio-6
> --name=/srv/rbdfio-`hostname -s`-0/cbt-kvmrbdfio-7
>
> The --numjobs=1 line should be removed, but I believe it gets overridden
> anyway with multiple name parameters.  I wonder if having multiple name
> parameters (ie multiple fio processes) is causing those processes to all
> try to write to the same log files, potentially at the same time?  Any
> thoughts would be much appreicated!

What version of fio are you running?

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: bug in logging code?
  2014-04-02 17:38 ` Jens Axboe
@ 2014-04-02 17:53   ` Mark Nelson
  2014-04-02 17:55     ` Jens Axboe
  0 siblings, 1 reply; 11+ messages in thread
From: Mark Nelson @ 2014-04-02 17:53 UTC (permalink / raw)
  To: Jens Axboe, fio

On 04/02/2014 12:38 PM, Jens Axboe wrote:
> On 04/02/2014 09:23 AM, Mark Nelson wrote:
>> I've been playing around with usng the various write_*_log options in
>> fio to record run statistics.  What I'm finding is that some of the logs
>> have malformed lines in them.  The following snippet is from the bw log:
>>
>> 811820, 12263, 0, 4194304
>> 812619, 35929, 0, 4194304
>> 818437, 8448, 0, 419430, 4194304
>> 1256299, 160627, 0, 4194304
>> 1256831, 154274, 0, 4194304
>>
>> Notice that the malformed line appears to be a partial write that gets
>> truncated.  The fio command being used to run the tests is
>> autogenerated, so job files aren't used.  here's the one that created
>> the above log snippet:
>>
>> /home/nhm/bin/fio
>> --rw=read
>> --ioengine=libaio
>> --runtime=7200
>> --ramp_time=0
>> --numjobs=1
>> --direct=1
>> --bs=4194304B
>> --iodepth=16
>> --size=9216M
>> --write_iops_log=/tmp/cbt/ceph/kvmrbdfio/osd_ra-00004096/client_ra-00000128/op_size-04194304/concurrent_procs-008/iodepth-016/read/output
>>
>>
>> --write_bw_log=/tmp/cbt/ceph/kvmrbdfio/osd_ra-00004096/client_ra-00000128/op_size-04194304/concurrent_procs-008/iodepth-016/read/output
>>
>>
>> --write_lat_log=/tmp/cbt/ceph/kvmrbdfio/osd_ra-00004096/client_ra-00000128/op_size-04194304/concurrent_procs-008/iodepth-016/read/output
>>
>>
>> --time_based
>> --name=/srv/rbdfio-`hostname -s`-0/cbt-kvmrbdfio-0
>> --name=/srv/rbdfio-`hostname -s`-0/cbt-kvmrbdfio-1
>> --name=/srv/rbdfio-`hostname -s`-0/cbt-kvmrbdfio-2
>> --name=/srv/rbdfio-`hostname -s`-0/cbt-kvmrbdfio-3
>> --name=/srv/rbdfio-`hostname -s`-0/cbt-kvmrbdfio-4
>> --name=/srv/rbdfio-`hostname -s`-0/cbt-kvmrbdfio-5
>> --name=/srv/rbdfio-`hostname -s`-0/cbt-kvmrbdfio-6
>> --name=/srv/rbdfio-`hostname -s`-0/cbt-kvmrbdfio-7
>>
>> The --numjobs=1 line should be removed, but I believe it gets overridden
>> anyway with multiple name parameters.  I wonder if having multiple name
>> parameters (ie multiple fio processes) is causing those processes to all
>> try to write to the same log files, potentially at the same time?  Any
>> thoughts would be much appreicated!
>
> What version of fio are you running?
>

f42557f (ie master from yesterday) with a slightly different fix for the 
basename issues that just got resolved.

I can verify now that running fio with a single --name flag appears to 
generate logs without malformated lines.

Mark


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: bug in logging code?
  2014-04-02 17:53   ` Mark Nelson
@ 2014-04-02 17:55     ` Jens Axboe
  2014-04-02 18:24       ` Mark Nelson
  0 siblings, 1 reply; 11+ messages in thread
From: Jens Axboe @ 2014-04-02 17:55 UTC (permalink / raw)
  To: Mark Nelson, fio

On 04/02/2014 11:53 AM, Mark Nelson wrote:
> On 04/02/2014 12:38 PM, Jens Axboe wrote:
>> On 04/02/2014 09:23 AM, Mark Nelson wrote:
>>> I've been playing around with usng the various write_*_log options in
>>> fio to record run statistics.  What I'm finding is that some of the logs
>>> have malformed lines in them.  The following snippet is from the bw log:
>>>
>>> 811820, 12263, 0, 4194304
>>> 812619, 35929, 0, 4194304
>>> 818437, 8448, 0, 419430, 4194304
>>> 1256299, 160627, 0, 4194304
>>> 1256831, 154274, 0, 4194304
>>>
>>> Notice that the malformed line appears to be a partial write that gets
>>> truncated.  The fio command being used to run the tests is
>>> autogenerated, so job files aren't used.  here's the one that created
>>> the above log snippet:
>>>
>>> /home/nhm/bin/fio
>>> --rw=read
>>> --ioengine=libaio
>>> --runtime=7200
>>> --ramp_time=0
>>> --numjobs=1
>>> --direct=1
>>> --bs=4194304B
>>> --iodepth=16
>>> --size=9216M
>>> --write_iops_log=/tmp/cbt/ceph/kvmrbdfio/osd_ra-00004096/client_ra-00000128/op_size-04194304/concurrent_procs-008/iodepth-016/read/output
>>>
>>>
>>>
>>> --write_bw_log=/tmp/cbt/ceph/kvmrbdfio/osd_ra-00004096/client_ra-00000128/op_size-04194304/concurrent_procs-008/iodepth-016/read/output
>>>
>>>
>>>
>>> --write_lat_log=/tmp/cbt/ceph/kvmrbdfio/osd_ra-00004096/client_ra-00000128/op_size-04194304/concurrent_procs-008/iodepth-016/read/output
>>>
>>>
>>>
>>> --time_based
>>> --name=/srv/rbdfio-`hostname -s`-0/cbt-kvmrbdfio-0
>>> --name=/srv/rbdfio-`hostname -s`-0/cbt-kvmrbdfio-1
>>> --name=/srv/rbdfio-`hostname -s`-0/cbt-kvmrbdfio-2
>>> --name=/srv/rbdfio-`hostname -s`-0/cbt-kvmrbdfio-3
>>> --name=/srv/rbdfio-`hostname -s`-0/cbt-kvmrbdfio-4
>>> --name=/srv/rbdfio-`hostname -s`-0/cbt-kvmrbdfio-5
>>> --name=/srv/rbdfio-`hostname -s`-0/cbt-kvmrbdfio-6
>>> --name=/srv/rbdfio-`hostname -s`-0/cbt-kvmrbdfio-7
>>>
>>> The --numjobs=1 line should be removed, but I believe it gets overridden
>>> anyway with multiple name parameters.  I wonder if having multiple name
>>> parameters (ie multiple fio processes) is causing those processes to all
>>> try to write to the same log files, potentially at the same time?  Any
>>> thoughts would be much appreicated!
>>
>> What version of fio are you running?
>>
>
> f42557f (ie master from yesterday) with a slightly different fix for the
> basename issues that just got resolved.
>
> I can verify now that running fio with a single --name flag appears to
> generate logs without malformated lines.

OK, I'll take a look, We should already be serializing the writing to 
the same log files, but your example definitely looks like we are not.

BTW, the numjobs option applies to each --name section. Just to clear 
that up from your original email. So you really get numjobs * 
name_sections running.

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: bug in logging code?
  2014-04-02 17:55     ` Jens Axboe
@ 2014-04-02 18:24       ` Mark Nelson
  2014-04-02 18:55         ` Jens Axboe
  0 siblings, 1 reply; 11+ messages in thread
From: Mark Nelson @ 2014-04-02 18:24 UTC (permalink / raw)
  To: Jens Axboe, fio

On 04/02/2014 12:55 PM, Jens Axboe wrote:
> On 04/02/2014 11:53 AM, Mark Nelson wrote:
>> On 04/02/2014 12:38 PM, Jens Axboe wrote:
>>> On 04/02/2014 09:23 AM, Mark Nelson wrote:
>>>> I've been playing around with usng the various write_*_log options in
>>>> fio to record run statistics.  What I'm finding is that some of the
>>>> logs
>>>> have malformed lines in them.  The following snippet is from the bw
>>>> log:
>>>>
>>>> 811820, 12263, 0, 4194304
>>>> 812619, 35929, 0, 4194304
>>>> 818437, 8448, 0, 419430, 4194304
>>>> 1256299, 160627, 0, 4194304
>>>> 1256831, 154274, 0, 4194304
>>>>
>>>> Notice that the malformed line appears to be a partial write that gets
>>>> truncated.  The fio command being used to run the tests is
>>>> autogenerated, so job files aren't used.  here's the one that created
>>>> the above log snippet:
>>>>
>>>> /home/nhm/bin/fio
>>>> --rw=read
>>>> --ioengine=libaio
>>>> --runtime=7200
>>>> --ramp_time=0
>>>> --numjobs=1
>>>> --direct=1
>>>> --bs=4194304B
>>>> --iodepth=16
>>>> --size=9216M
>>>> --write_iops_log=/tmp/cbt/ceph/kvmrbdfio/osd_ra-00004096/client_ra-00000128/op_size-04194304/concurrent_procs-008/iodepth-016/read/output
>>>>
>>>>
>>>>
>>>>
>>>> --write_bw_log=/tmp/cbt/ceph/kvmrbdfio/osd_ra-00004096/client_ra-00000128/op_size-04194304/concurrent_procs-008/iodepth-016/read/output
>>>>
>>>>
>>>>
>>>>
>>>> --write_lat_log=/tmp/cbt/ceph/kvmrbdfio/osd_ra-00004096/client_ra-00000128/op_size-04194304/concurrent_procs-008/iodepth-016/read/output
>>>>
>>>>
>>>>
>>>>
>>>> --time_based
>>>> --name=/srv/rbdfio-`hostname -s`-0/cbt-kvmrbdfio-0
>>>> --name=/srv/rbdfio-`hostname -s`-0/cbt-kvmrbdfio-1
>>>> --name=/srv/rbdfio-`hostname -s`-0/cbt-kvmrbdfio-2
>>>> --name=/srv/rbdfio-`hostname -s`-0/cbt-kvmrbdfio-3
>>>> --name=/srv/rbdfio-`hostname -s`-0/cbt-kvmrbdfio-4
>>>> --name=/srv/rbdfio-`hostname -s`-0/cbt-kvmrbdfio-5
>>>> --name=/srv/rbdfio-`hostname -s`-0/cbt-kvmrbdfio-6
>>>> --name=/srv/rbdfio-`hostname -s`-0/cbt-kvmrbdfio-7
>>>>
>>>> The --numjobs=1 line should be removed, but I believe it gets
>>>> overridden
>>>> anyway with multiple name parameters.  I wonder if having multiple name
>>>> parameters (ie multiple fio processes) is causing those processes to
>>>> all
>>>> try to write to the same log files, potentially at the same time?  Any
>>>> thoughts would be much appreicated!
>>>
>>> What version of fio are you running?
>>>
>>
>> f42557f (ie master from yesterday) with a slightly different fix for the
>> basename issues that just got resolved.
>>
>> I can verify now that running fio with a single --name flag appears to
>> generate logs without malformated lines.
>
> OK, I'll take a look, We should already be serializing the writing to
> the same log files, but your example definitely looks like we are not.
>
> BTW, the numjobs option applies to each --name section. Just to clear
> that up from your original email. So you really get numjobs *
> name_sections running.

Aha, thanks for clearing that up!

Is github the right place to create new issues?  If so, want me to 
create one for this?

Mark


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: bug in logging code?
  2014-04-02 18:24       ` Mark Nelson
@ 2014-04-02 18:55         ` Jens Axboe
  2014-04-02 20:38           ` Mark Nelson
  0 siblings, 1 reply; 11+ messages in thread
From: Jens Axboe @ 2014-04-02 18:55 UTC (permalink / raw)
  To: Mark Nelson, fio

On 04/02/2014 12:24 PM, Mark Nelson wrote:
> On 04/02/2014 12:55 PM, Jens Axboe wrote:
>> On 04/02/2014 11:53 AM, Mark Nelson wrote:
>>> On 04/02/2014 12:38 PM, Jens Axboe wrote:
>>>> On 04/02/2014 09:23 AM, Mark Nelson wrote:
>>>>> I've been playing around with usng the various write_*_log options in
>>>>> fio to record run statistics.  What I'm finding is that some of the
>>>>> logs
>>>>> have malformed lines in them.  The following snippet is from the bw
>>>>> log:
>>>>>
>>>>> 811820, 12263, 0, 4194304
>>>>> 812619, 35929, 0, 4194304
>>>>> 818437, 8448, 0, 419430, 4194304
>>>>> 1256299, 160627, 0, 4194304
>>>>> 1256831, 154274, 0, 4194304
>>>>>
>>>>> Notice that the malformed line appears to be a partial write that gets
>>>>> truncated.  The fio command being used to run the tests is
>>>>> autogenerated, so job files aren't used.  here's the one that created
>>>>> the above log snippet:
>>>>>
>>>>> /home/nhm/bin/fio
>>>>> --rw=read
>>>>> --ioengine=libaio
>>>>> --runtime=7200
>>>>> --ramp_time=0
>>>>> --numjobs=1
>>>>> --direct=1
>>>>> --bs=4194304B
>>>>> --iodepth=16
>>>>> --size=9216M
>>>>> --write_iops_log=/tmp/cbt/ceph/kvmrbdfio/osd_ra-00004096/client_ra-00000128/op_size-04194304/concurrent_procs-008/iodepth-016/read/output
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --write_bw_log=/tmp/cbt/ceph/kvmrbdfio/osd_ra-00004096/client_ra-00000128/op_size-04194304/concurrent_procs-008/iodepth-016/read/output
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --write_lat_log=/tmp/cbt/ceph/kvmrbdfio/osd_ra-00004096/client_ra-00000128/op_size-04194304/concurrent_procs-008/iodepth-016/read/output
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --time_based
>>>>> --name=/srv/rbdfio-`hostname -s`-0/cbt-kvmrbdfio-0
>>>>> --name=/srv/rbdfio-`hostname -s`-0/cbt-kvmrbdfio-1
>>>>> --name=/srv/rbdfio-`hostname -s`-0/cbt-kvmrbdfio-2
>>>>> --name=/srv/rbdfio-`hostname -s`-0/cbt-kvmrbdfio-3
>>>>> --name=/srv/rbdfio-`hostname -s`-0/cbt-kvmrbdfio-4
>>>>> --name=/srv/rbdfio-`hostname -s`-0/cbt-kvmrbdfio-5
>>>>> --name=/srv/rbdfio-`hostname -s`-0/cbt-kvmrbdfio-6
>>>>> --name=/srv/rbdfio-`hostname -s`-0/cbt-kvmrbdfio-7
>>>>>
>>>>> The --numjobs=1 line should be removed, but I believe it gets
>>>>> overridden
>>>>> anyway with multiple name parameters.  I wonder if having multiple
>>>>> name
>>>>> parameters (ie multiple fio processes) is causing those processes to
>>>>> all
>>>>> try to write to the same log files, potentially at the same time?  Any
>>>>> thoughts would be much appreicated!
>>>>
>>>> What version of fio are you running?
>>>>
>>>
>>> f42557f (ie master from yesterday) with a slightly different fix for the
>>> basename issues that just got resolved.
>>>
>>> I can verify now that running fio with a single --name flag appears to
>>> generate logs without malformated lines.
>>
>> OK, I'll take a look, We should already be serializing the writing to
>> the same log files, but your example definitely looks like we are not.
>>
>> BTW, the numjobs option applies to each --name section. Just to clear
>> that up from your original email. So you really get numjobs *
>> name_sections running.
>
> Aha, thanks for clearing that up!
>
> Is github the right place to create new issues?  If so, want me to
> create one for this?

Sure, you can go ahead and do that.

I'm puzzled by this report. The writeout of the logs is fully 
serialized, there should be no interleaving. Looking at your output, 
it's definitely either the output from two jobs, or mangled output from 
just one job. The corrupted line is at time 818437, and the next line is 
much further ahead. Normally you would expect to see:

500, some-bw, 0, blocksize		<- job X
1000, some-bw, 0, blocksize
[...]
end-time, some-bw, 0, blocksize		<- last entry from job X
500, some-bw, 0, blocksize		<- job Y now starts

Note that the log writing always appends. Did you possibly have this 
half-done files in that location from a previous run?

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: bug in logging code?
  2014-04-02 18:55         ` Jens Axboe
@ 2014-04-02 20:38           ` Mark Nelson
  2014-04-03  3:24             ` Jens Axboe
  0 siblings, 1 reply; 11+ messages in thread
From: Mark Nelson @ 2014-04-02 20:38 UTC (permalink / raw)
  To: Jens Axboe, fio

On 04/02/2014 01:55 PM, Jens Axboe wrote:
> On 04/02/2014 12:24 PM, Mark Nelson wrote:
>> On 04/02/2014 12:55 PM, Jens Axboe wrote:
>>> On 04/02/2014 11:53 AM, Mark Nelson wrote:
>>>> On 04/02/2014 12:38 PM, Jens Axboe wrote:
>>>>> On 04/02/2014 09:23 AM, Mark Nelson wrote:
>>>>>> I've been playing around with usng the various write_*_log options in
>>>>>> fio to record run statistics.  What I'm finding is that some of the
>>>>>> logs
>>>>>> have malformed lines in them.  The following snippet is from the bw
>>>>>> log:
>>>>>>
>>>>>> 811820, 12263, 0, 4194304
>>>>>> 812619, 35929, 0, 4194304
>>>>>> 818437, 8448, 0, 419430, 4194304
>>>>>> 1256299, 160627, 0, 4194304
>>>>>> 1256831, 154274, 0, 4194304
>>>>>>
>>>>>> Notice that the malformed line appears to be a partial write that
>>>>>> gets
>>>>>> truncated.  The fio command being used to run the tests is
>>>>>> autogenerated, so job files aren't used.  here's the one that created
>>>>>> the above log snippet:
>>>>>>
>>>>>> /home/nhm/bin/fio
>>>>>> --rw=read
>>>>>> --ioengine=libaio
>>>>>> --runtime=7200
>>>>>> --ramp_time=0
>>>>>> --numjobs=1
>>>>>> --direct=1
>>>>>> --bs=4194304B
>>>>>> --iodepth=16
>>>>>> --size=9216M
>>>>>> --write_iops_log=/tmp/cbt/ceph/kvmrbdfio/osd_ra-00004096/client_ra-00000128/op_size-04194304/concurrent_procs-008/iodepth-016/read/output
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --write_bw_log=/tmp/cbt/ceph/kvmrbdfio/osd_ra-00004096/client_ra-00000128/op_size-04194304/concurrent_procs-008/iodepth-016/read/output
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --write_lat_log=/tmp/cbt/ceph/kvmrbdfio/osd_ra-00004096/client_ra-00000128/op_size-04194304/concurrent_procs-008/iodepth-016/read/output
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --time_based
>>>>>> --name=/srv/rbdfio-`hostname -s`-0/cbt-kvmrbdfio-0
>>>>>> --name=/srv/rbdfio-`hostname -s`-0/cbt-kvmrbdfio-1
>>>>>> --name=/srv/rbdfio-`hostname -s`-0/cbt-kvmrbdfio-2
>>>>>> --name=/srv/rbdfio-`hostname -s`-0/cbt-kvmrbdfio-3
>>>>>> --name=/srv/rbdfio-`hostname -s`-0/cbt-kvmrbdfio-4
>>>>>> --name=/srv/rbdfio-`hostname -s`-0/cbt-kvmrbdfio-5
>>>>>> --name=/srv/rbdfio-`hostname -s`-0/cbt-kvmrbdfio-6
>>>>>> --name=/srv/rbdfio-`hostname -s`-0/cbt-kvmrbdfio-7
>>>>>>
>>>>>> The --numjobs=1 line should be removed, but I believe it gets
>>>>>> overridden
>>>>>> anyway with multiple name parameters.  I wonder if having multiple
>>>>>> name
>>>>>> parameters (ie multiple fio processes) is causing those processes to
>>>>>> all
>>>>>> try to write to the same log files, potentially at the same time?
>>>>>> Any
>>>>>> thoughts would be much appreicated!
>>>>>
>>>>> What version of fio are you running?
>>>>>
>>>>
>>>> f42557f (ie master from yesterday) with a slightly different fix for
>>>> the
>>>> basename issues that just got resolved.
>>>>
>>>> I can verify now that running fio with a single --name flag appears to
>>>> generate logs without malformated lines.
>>>
>>> OK, I'll take a look, We should already be serializing the writing to
>>> the same log files, but your example definitely looks like we are not.
>>>
>>> BTW, the numjobs option applies to each --name section. Just to clear
>>> that up from your original email. So you really get numjobs *
>>> name_sections running.
>>
>> Aha, thanks for clearing that up!
>>
>> Is github the right place to create new issues?  If so, want me to
>> create one for this?
>
> Sure, you can go ahead and do that.

Actually, I'm going to wait after hearing your response below!

>
> I'm puzzled by this report. The writeout of the logs is fully
> serialized, there should be no interleaving. Looking at your output,
> it's definitely either the output from two jobs, or mangled output from
> just one job. The corrupted line is at time 818437, and the next line is
> much further ahead. Normally you would expect to see:
>
> 500, some-bw, 0, blocksize        <- job X
> 1000, some-bw, 0, blocksize
> [...]
> end-time, some-bw, 0, blocksize        <- last entry from job X
> 500, some-bw, 0, blocksize        <- job Y now starts
>
> Note that the log writing always appends. Did you possibly have this
> half-done files in that location from a previous run?

I think this is possible.  My tool is supposed to clean out old tmp 
directories but it was broken (and unnoticed because all other output 
files were getting overwritten each time).  Guess I should prioritize a 
long-standing plan to use UUIDs for all tests/directories to make this 
kind of bug more obvious. :)

Mark

>



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: bug in logging code?
  2014-04-02 20:38           ` Mark Nelson
@ 2014-04-03  3:24             ` Jens Axboe
  2014-04-04  4:58               ` Andrey Kuzmin
  2014-05-28 14:01               ` Mark Nelson
  0 siblings, 2 replies; 11+ messages in thread
From: Jens Axboe @ 2014-04-03  3:24 UTC (permalink / raw)
  To: Mark Nelson, fio

On 2014-04-02 14:38, Mark Nelson wrote:
> On 04/02/2014 01:55 PM, Jens Axboe wrote:
>> On 04/02/2014 12:24 PM, Mark Nelson wrote:
>>> On 04/02/2014 12:55 PM, Jens Axboe wrote:
>>>> On 04/02/2014 11:53 AM, Mark Nelson wrote:
>>>>> On 04/02/2014 12:38 PM, Jens Axboe wrote:
>>>>>> On 04/02/2014 09:23 AM, Mark Nelson wrote:
>>>>>>> I've been playing around with usng the various write_*_log
>>>>>>> options in
>>>>>>> fio to record run statistics.  What I'm finding is that some of the
>>>>>>> logs
>>>>>>> have malformed lines in them.  The following snippet is from the bw
>>>>>>> log:
>>>>>>>
>>>>>>> 811820, 12263, 0, 4194304
>>>>>>> 812619, 35929, 0, 4194304
>>>>>>> 818437, 8448, 0, 419430, 4194304
>>>>>>> 1256299, 160627, 0, 4194304
>>>>>>> 1256831, 154274, 0, 4194304
>>>>>>>
>>>>>>> Notice that the malformed line appears to be a partial write that
>>>>>>> gets
>>>>>>> truncated.  The fio command being used to run the tests is
>>>>>>> autogenerated, so job files aren't used.  here's the one that
>>>>>>> created
>>>>>>> the above log snippet:
>>>>>>>
>>>>>>> /home/nhm/bin/fio
>>>>>>> --rw=read
>>>>>>> --ioengine=libaio
>>>>>>> --runtime=7200
>>>>>>> --ramp_time=0
>>>>>>> --numjobs=1
>>>>>>> --direct=1
>>>>>>> --bs=4194304B
>>>>>>> --iodepth=16
>>>>>>> --size=9216M
>>>>>>> --write_iops_log=/tmp/cbt/ceph/kvmrbdfio/osd_ra-00004096/client_ra-00000128/op_size-04194304/concurrent_procs-008/iodepth-016/read/output
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --write_bw_log=/tmp/cbt/ceph/kvmrbdfio/osd_ra-00004096/client_ra-00000128/op_size-04194304/concurrent_procs-008/iodepth-016/read/output
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --write_lat_log=/tmp/cbt/ceph/kvmrbdfio/osd_ra-00004096/client_ra-00000128/op_size-04194304/concurrent_procs-008/iodepth-016/read/output
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --time_based
>>>>>>> --name=/srv/rbdfio-`hostname -s`-0/cbt-kvmrbdfio-0
>>>>>>> --name=/srv/rbdfio-`hostname -s`-0/cbt-kvmrbdfio-1
>>>>>>> --name=/srv/rbdfio-`hostname -s`-0/cbt-kvmrbdfio-2
>>>>>>> --name=/srv/rbdfio-`hostname -s`-0/cbt-kvmrbdfio-3
>>>>>>> --name=/srv/rbdfio-`hostname -s`-0/cbt-kvmrbdfio-4
>>>>>>> --name=/srv/rbdfio-`hostname -s`-0/cbt-kvmrbdfio-5
>>>>>>> --name=/srv/rbdfio-`hostname -s`-0/cbt-kvmrbdfio-6
>>>>>>> --name=/srv/rbdfio-`hostname -s`-0/cbt-kvmrbdfio-7
>>>>>>>
>>>>>>> The --numjobs=1 line should be removed, but I believe it gets
>>>>>>> overridden
>>>>>>> anyway with multiple name parameters.  I wonder if having multiple
>>>>>>> name
>>>>>>> parameters (ie multiple fio processes) is causing those processes to
>>>>>>> all
>>>>>>> try to write to the same log files, potentially at the same time?
>>>>>>> Any
>>>>>>> thoughts would be much appreicated!
>>>>>>
>>>>>> What version of fio are you running?
>>>>>>
>>>>>
>>>>> f42557f (ie master from yesterday) with a slightly different fix for
>>>>> the
>>>>> basename issues that just got resolved.
>>>>>
>>>>> I can verify now that running fio with a single --name flag appears to
>>>>> generate logs without malformated lines.
>>>>
>>>> OK, I'll take a look, We should already be serializing the writing to
>>>> the same log files, but your example definitely looks like we are not.
>>>>
>>>> BTW, the numjobs option applies to each --name section. Just to clear
>>>> that up from your original email. So you really get numjobs *
>>>> name_sections running.
>>>
>>> Aha, thanks for clearing that up!
>>>
>>> Is github the right place to create new issues?  If so, want me to
>>> create one for this?
>>
>> Sure, you can go ahead and do that.
>
> Actually, I'm going to wait after hearing your response below!
>
>>
>> I'm puzzled by this report. The writeout of the logs is fully
>> serialized, there should be no interleaving. Looking at your output,
>> it's definitely either the output from two jobs, or mangled output from
>> just one job. The corrupted line is at time 818437, and the next line is
>> much further ahead. Normally you would expect to see:
>>
>> 500, some-bw, 0, blocksize        <- job X
>> 1000, some-bw, 0, blocksize
>> [...]
>> end-time, some-bw, 0, blocksize        <- last entry from job X
>> 500, some-bw, 0, blocksize        <- job Y now starts
>>
>> Note that the log writing always appends. Did you possibly have this
>> half-done files in that location from a previous run?
>
> I think this is possible.  My tool is supposed to clean out old tmp
> directories but it was broken (and unnoticed because all other output
> files were getting overwritten each time).  Guess I should prioritize a
> long-standing plan to use UUIDs for all tests/directories to make this
> kind of bug more obvious. :)

That makes me feel better, and does make a lot more sense! But since I 
was looking at this code anyway, I could not help but be annoyed at how 
we serialize all the logs when writing them out. If the logs are big, it 
can take a long time per log. So I added some code to only lock per file 
instead. This means that if no log sharing is happening, we'll write all 
5 logs in parallel. A quick test here with a 5s null run which resulted 
in 4 logs (two tiny, the bw and iops log) reduced the log writing from 
26 seconds to 14 seconds. And that was just two 18M logs.

So even if we didn't have a real bug here, at least the logging is now 
substantially faster.

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: bug in logging code?
  2014-04-03  3:24             ` Jens Axboe
@ 2014-04-04  4:58               ` Andrey Kuzmin
  2014-04-04  5:05                 ` Jens Axboe
  2014-05-28 14:01               ` Mark Nelson
  1 sibling, 1 reply; 11+ messages in thread
From: Andrey Kuzmin @ 2014-04-04  4:58 UTC (permalink / raw)
  To: Jens Axboe; +Cc: fio

[-- Attachment #1: Type: text/plain, Size: 6258 bytes --]

Another reduction idea would be to open/read/write io logs via gz frontend,
defaulting to regular I/O where compression is either unavailable or
undesired. Should be a big deal with large logs, which is often the case. I
planned to take on this in my spare time couple of months back, but it's
still where it was at that time :(.
On Apr 2, 2014 8:30 PM, "Jens Axboe" <axboe@kernel.dk> wrote:

> On 2014-04-02 14:38, Mark Nelson wrote:
>
>> On 04/02/2014 01:55 PM, Jens Axboe wrote:
>>
>>> On 04/02/2014 12:24 PM, Mark Nelson wrote:
>>>
>>>> On 04/02/2014 12:55 PM, Jens Axboe wrote:
>>>>
>>>>> On 04/02/2014 11:53 AM, Mark Nelson wrote:
>>>>>
>>>>>> On 04/02/2014 12:38 PM, Jens Axboe wrote:
>>>>>>
>>>>>>> On 04/02/2014 09:23 AM, Mark Nelson wrote:
>>>>>>>
>>>>>>>> I've been playing around with usng the various write_*_log
>>>>>>>> options in
>>>>>>>> fio to record run statistics.  What I'm finding is that some of the
>>>>>>>> logs
>>>>>>>> have malformed lines in them.  The following snippet is from the bw
>>>>>>>> log:
>>>>>>>>
>>>>>>>> 811820, 12263, 0, 4194304
>>>>>>>> 812619, 35929, 0, 4194304
>>>>>>>> 818437, 8448, 0, 419430, 4194304
>>>>>>>> 1256299, 160627, 0, 4194304
>>>>>>>> 1256831, 154274, 0, 4194304
>>>>>>>>
>>>>>>>> Notice that the malformed line appears to be a partial write that
>>>>>>>> gets
>>>>>>>> truncated.  The fio command being used to run the tests is
>>>>>>>> autogenerated, so job files aren't used.  here's the one that
>>>>>>>> created
>>>>>>>> the above log snippet:
>>>>>>>>
>>>>>>>> /home/nhm/bin/fio
>>>>>>>> --rw=read
>>>>>>>> --ioengine=libaio
>>>>>>>> --runtime=7200
>>>>>>>> --ramp_time=0
>>>>>>>> --numjobs=1
>>>>>>>> --direct=1
>>>>>>>> --bs=4194304B
>>>>>>>> --iodepth=16
>>>>>>>> --size=9216M
>>>>>>>> --write_iops_log=/tmp/cbt/ceph/kvmrbdfio/osd_ra-
>>>>>>>> 00004096/client_ra-00000128/op_size-04194304/concurrent_
>>>>>>>> procs-008/iodepth-016/read/output
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --write_bw_log=/tmp/cbt/ceph/kvmrbdfio/osd_ra-00004096/
>>>>>>>> client_ra-00000128/op_size-04194304/concurrent_procs-008/
>>>>>>>> iodepth-016/read/output
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --write_lat_log=/tmp/cbt/ceph/kvmrbdfio/osd_ra-00004096/
>>>>>>>> client_ra-00000128/op_size-04194304/concurrent_procs-008/
>>>>>>>> iodepth-016/read/output
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --time_based
>>>>>>>> --name=/srv/rbdfio-`hostname -s`-0/cbt-kvmrbdfio-0
>>>>>>>> --name=/srv/rbdfio-`hostname -s`-0/cbt-kvmrbdfio-1
>>>>>>>> --name=/srv/rbdfio-`hostname -s`-0/cbt-kvmrbdfio-2
>>>>>>>> --name=/srv/rbdfio-`hostname -s`-0/cbt-kvmrbdfio-3
>>>>>>>> --name=/srv/rbdfio-`hostname -s`-0/cbt-kvmrbdfio-4
>>>>>>>> --name=/srv/rbdfio-`hostname -s`-0/cbt-kvmrbdfio-5
>>>>>>>> --name=/srv/rbdfio-`hostname -s`-0/cbt-kvmrbdfio-6
>>>>>>>> --name=/srv/rbdfio-`hostname -s`-0/cbt-kvmrbdfio-7
>>>>>>>>
>>>>>>>> The --numjobs=1 line should be removed, but I believe it gets
>>>>>>>> overridden
>>>>>>>> anyway with multiple name parameters.  I wonder if having multiple
>>>>>>>> name
>>>>>>>> parameters (ie multiple fio processes) is causing those processes to
>>>>>>>> all
>>>>>>>> try to write to the same log files, potentially at the same time?
>>>>>>>> Any
>>>>>>>> thoughts would be much appreicated!
>>>>>>>>
>>>>>>>
>>>>>>> What version of fio are you running?
>>>>>>>
>>>>>>>
>>>>>> f42557f (ie master from yesterday) with a slightly different fix for
>>>>>> the
>>>>>> basename issues that just got resolved.
>>>>>>
>>>>>> I can verify now that running fio with a single --name flag appears to
>>>>>> generate logs without malformated lines.
>>>>>>
>>>>>
>>>>> OK, I'll take a look, We should already be serializing the writing to
>>>>> the same log files, but your example definitely looks like we are not.
>>>>>
>>>>> BTW, the numjobs option applies to each --name section. Just to clear
>>>>> that up from your original email. So you really get numjobs *
>>>>> name_sections running.
>>>>>
>>>>
>>>> Aha, thanks for clearing that up!
>>>>
>>>> Is github the right place to create new issues?  If so, want me to
>>>> create one for this?
>>>>
>>>
>>> Sure, you can go ahead and do that.
>>>
>>
>> Actually, I'm going to wait after hearing your response below!
>>
>>
>>> I'm puzzled by this report. The writeout of the logs is fully
>>> serialized, there should be no interleaving. Looking at your output,
>>> it's definitely either the output from two jobs, or mangled output from
>>> just one job. The corrupted line is at time 818437, and the next line is
>>> much further ahead. Normally you would expect to see:
>>>
>>> 500, some-bw, 0, blocksize        <- job X
>>> 1000, some-bw, 0, blocksize
>>> [...]
>>> end-time, some-bw, 0, blocksize        <- last entry from job X
>>> 500, some-bw, 0, blocksize        <- job Y now starts
>>>
>>> Note that the log writing always appends. Did you possibly have this
>>> half-done files in that location from a previous run?
>>>
>>
>> I think this is possible.  My tool is supposed to clean out old tmp
>> directories but it was broken (and unnoticed because all other output
>> files were getting overwritten each time).  Guess I should prioritize a
>> long-standing plan to use UUIDs for all tests/directories to make this
>> kind of bug more obvious. :)
>>
>
> That makes me feel better, and does make a lot more sense! But since I was
> looking at this code anyway, I could not help but be annoyed at how we
> serialize all the logs when writing them out. If the logs are big, it can
> take a long time per log. So I added some code to only lock per file
> instead. This means that if no log sharing is happening, we'll write all 5
> logs in parallel. A quick test here with a 5s null run which resulted in 4
> logs (two tiny, the bw and iops log) reduced the log writing from 26
> seconds to 14 seconds. And that was just two 18M logs.
>
> So even if we didn't have a real bug here, at least the logging is now
> substantially faster.
>
> --
> Jens Axboe
>
> --
> To unsubscribe from this list: send the line "unsubscribe fio" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

[-- Attachment #2: Type: text/html, Size: 7417 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: bug in logging code?
  2014-04-04  4:58               ` Andrey Kuzmin
@ 2014-04-04  5:05                 ` Jens Axboe
  0 siblings, 0 replies; 11+ messages in thread
From: Jens Axboe @ 2014-04-04  5:05 UTC (permalink / raw)
  To: Andrey Kuzmin; +Cc: fio@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 505 bytes --]

On Apr 3, 2014, at 10:58 PM, Andrey Kuzmin <andrey.v.kuzmin@gmail.com> wrote:
> 
> Another reduction idea would be to open/read/write io logs via gz frontend, defaulting to regular I/O where compression is either unavailable or undesired. Should be a big deal with large logs, which is often the case. I planned to take on this in my spare time couple of months back, but it's still where it was at that time :(.
> 
Should be trivial to wire up actually, the network code already does gz of logs. 


[-- Attachment #2: Type: text/html, Size: 772 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: bug in logging code?
  2014-04-03  3:24             ` Jens Axboe
  2014-04-04  4:58               ` Andrey Kuzmin
@ 2014-05-28 14:01               ` Mark Nelson
  1 sibling, 0 replies; 11+ messages in thread
From: Mark Nelson @ 2014-05-28 14:01 UTC (permalink / raw)
  To: Jens Axboe, fio

On 04/02/2014 10:24 PM, Jens Axboe wrote:
> On 2014-04-02 14:38, Mark Nelson wrote:
>> On 04/02/2014 01:55 PM, Jens Axboe wrote:
>>> On 04/02/2014 12:24 PM, Mark Nelson wrote:
>>>> On 04/02/2014 12:55 PM, Jens Axboe wrote:
>>>>> On 04/02/2014 11:53 AM, Mark Nelson wrote:
>>>>>> On 04/02/2014 12:38 PM, Jens Axboe wrote:
>>>>>>> On 04/02/2014 09:23 AM, Mark Nelson wrote:
>>>>>>>> I've been playing around with usng the various write_*_log
>>>>>>>> options in
>>>>>>>> fio to record run statistics.  What I'm finding is that some of the
>>>>>>>> logs
>>>>>>>> have malformed lines in them.  The following snippet is from the bw
>>>>>>>> log:
>>>>>>>>
>>>>>>>> 811820, 12263, 0, 4194304
>>>>>>>> 812619, 35929, 0, 4194304
>>>>>>>> 818437, 8448, 0, 419430, 4194304
>>>>>>>> 1256299, 160627, 0, 4194304
>>>>>>>> 1256831, 154274, 0, 4194304
>>>>>>>>
>>>>>>>> Notice that the malformed line appears to be a partial write that
>>>>>>>> gets
>>>>>>>> truncated.  The fio command being used to run the tests is
>>>>>>>> autogenerated, so job files aren't used.  here's the one that
>>>>>>>> created
>>>>>>>> the above log snippet:
>>>>>>>>
>>>>>>>> /home/nhm/bin/fio
>>>>>>>> --rw=read
>>>>>>>> --ioengine=libaio
>>>>>>>> --runtime=7200
>>>>>>>> --ramp_time=0
>>>>>>>> --numjobs=1
>>>>>>>> --direct=1
>>>>>>>> --bs=4194304B
>>>>>>>> --iodepth=16
>>>>>>>> --size=9216M
>>>>>>>> --write_iops_log=/tmp/cbt/ceph/kvmrbdfio/osd_ra-00004096/client_ra-00000128/op_size-04194304/concurrent_procs-008/iodepth-016/read/output
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --write_bw_log=/tmp/cbt/ceph/kvmrbdfio/osd_ra-00004096/client_ra-00000128/op_size-04194304/concurrent_procs-008/iodepth-016/read/output
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --write_lat_log=/tmp/cbt/ceph/kvmrbdfio/osd_ra-00004096/client_ra-00000128/op_size-04194304/concurrent_procs-008/iodepth-016/read/output
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --time_based
>>>>>>>> --name=/srv/rbdfio-`hostname -s`-0/cbt-kvmrbdfio-0
>>>>>>>> --name=/srv/rbdfio-`hostname -s`-0/cbt-kvmrbdfio-1
>>>>>>>> --name=/srv/rbdfio-`hostname -s`-0/cbt-kvmrbdfio-2
>>>>>>>> --name=/srv/rbdfio-`hostname -s`-0/cbt-kvmrbdfio-3
>>>>>>>> --name=/srv/rbdfio-`hostname -s`-0/cbt-kvmrbdfio-4
>>>>>>>> --name=/srv/rbdfio-`hostname -s`-0/cbt-kvmrbdfio-5
>>>>>>>> --name=/srv/rbdfio-`hostname -s`-0/cbt-kvmrbdfio-6
>>>>>>>> --name=/srv/rbdfio-`hostname -s`-0/cbt-kvmrbdfio-7
>>>>>>>>
>>>>>>>> The --numjobs=1 line should be removed, but I believe it gets
>>>>>>>> overridden
>>>>>>>> anyway with multiple name parameters.  I wonder if having multiple
>>>>>>>> name
>>>>>>>> parameters (ie multiple fio processes) is causing those
>>>>>>>> processes to
>>>>>>>> all
>>>>>>>> try to write to the same log files, potentially at the same time?
>>>>>>>> Any
>>>>>>>> thoughts would be much appreicated!
>>>>>>>
>>>>>>> What version of fio are you running?
>>>>>>>
>>>>>>
>>>>>> f42557f (ie master from yesterday) with a slightly different fix for
>>>>>> the
>>>>>> basename issues that just got resolved.
>>>>>>
>>>>>> I can verify now that running fio with a single --name flag
>>>>>> appears to
>>>>>> generate logs without malformated lines.
>>>>>
>>>>> OK, I'll take a look, We should already be serializing the writing to
>>>>> the same log files, but your example definitely looks like we are not.
>>>>>
>>>>> BTW, the numjobs option applies to each --name section. Just to clear
>>>>> that up from your original email. So you really get numjobs *
>>>>> name_sections running.
>>>>
>>>> Aha, thanks for clearing that up!
>>>>
>>>> Is github the right place to create new issues?  If so, want me to
>>>> create one for this?
>>>
>>> Sure, you can go ahead and do that.
>>
>> Actually, I'm going to wait after hearing your response below!
>>
>>>
>>> I'm puzzled by this report. The writeout of the logs is fully
>>> serialized, there should be no interleaving. Looking at your output,
>>> it's definitely either the output from two jobs, or mangled output from
>>> just one job. The corrupted line is at time 818437, and the next line is
>>> much further ahead. Normally you would expect to see:
>>>
>>> 500, some-bw, 0, blocksize        <- job X
>>> 1000, some-bw, 0, blocksize
>>> [...]
>>> end-time, some-bw, 0, blocksize        <- last entry from job X
>>> 500, some-bw, 0, blocksize        <- job Y now starts
>>>
>>> Note that the log writing always appends. Did you possibly have this
>>> half-done files in that location from a previous run?
>>
>> I think this is possible.  My tool is supposed to clean out old tmp
>> directories but it was broken (and unnoticed because all other output
>> files were getting overwritten each time).  Guess I should prioritize a
>> long-standing plan to use UUIDs for all tests/directories to make this
>> kind of bug more obvious. :)
>
> That makes me feel better, and does make a lot more sense! But since I
> was looking at this code anyway, I could not help but be annoyed at how
> we serialize all the logs when writing them out. If the logs are big, it
> can take a long time per log. So I added some code to only lock per file
> instead. This means that if no log sharing is happening, we'll write all
> 5 logs in parallel. A quick test here with a 5s null run which resulted
> in 4 logs (two tiny, the bw and iops log) reduced the log writing from
> 26 seconds to 14 seconds. And that was just two 18M logs.
>
> So even if we didn't have a real bug here, at least the logging is now
> substantially faster.
>

Hi Jens,

Sorry to necro this thread.  I've been continuing to play with the 
various write*log options along with bwavgtime and log_avg_msec and 
trying to make sense of the output.  Does each job log a line in the 
logs or do they get aggregated?  Also, when I have a different value for 
bwavgtime and log_avg_msec (say 500ms default for bwavgtime and 100ms 
for log_avg_msec), is it first doing the 100ms avg for log_avg_msec and 
then averaging those for bwavgtime?  What is the intended behaviour?  My 
bw logs seem to be showing performance lower than what the aggregate 
totals in the output file show (and oddities I wasn't expecting), so I'm 
just trying to get a feel for how things are supposed to work.

Mark


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2014-05-28 14:01 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-04-02 15:23 bug in logging code? Mark Nelson
2014-04-02 17:38 ` Jens Axboe
2014-04-02 17:53   ` Mark Nelson
2014-04-02 17:55     ` Jens Axboe
2014-04-02 18:24       ` Mark Nelson
2014-04-02 18:55         ` Jens Axboe
2014-04-02 20:38           ` Mark Nelson
2014-04-03  3:24             ` Jens Axboe
2014-04-04  4:58               ` Andrey Kuzmin
2014-04-04  5:05                 ` Jens Axboe
2014-05-28 14:01               ` Mark Nelson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox