All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] gettime: minimize integer division
@ 2012-12-20  0:52 Sam Bradshaw
  2012-12-20  8:14 ` Jens Axboe
  0 siblings, 1 reply; 12+ messages in thread
From: Sam Bradshaw @ 2012-12-20  0:52 UTC (permalink / raw)
  To: axboe; +Cc: fio


This patch generally converts a division to a subtraction in fio_gettime().

Shows ~1% better iops with synthetic benchmarking at roughly the same cpu 
time spent in fio_gettime().

Signed-off-by: Sam Bradshaw <sbradshaw@micron.com>

diff --git a/gettime.c b/gettime.c
index 248f146..e2a6241 100644
--- a/gettime.c
+++ b/gettime.c
@@ -163,17 +163,23 @@ void fio_gettime(struct timeval *tp, void fio_unused *caller)
 		}
 #ifdef ARCH_HAVE_CPU_CLOCK
 	case CS_CPUCLOCK: {
-		unsigned long long usecs, t;
+		unsigned long long usecs, t, delta = 0;

 		t = get_cpu_clock();
 		if (tv && t < tv->last_cycles) {
 			dprint(FD_TIME, "CPU clock going back in time\n");
 			t = tv->last_cycles;
-		} else if (tv)
+		} else if (tv) {
+			if (tv->last_tv_valid)
+				delta = t - tv->last_cycles;
 			tv->last_cycles = t;
+		}

 		usecs = t / cycles_per_usec;
-		tp->tv_sec = usecs / 1000000;
+		if (delta > 1000000)
+			tp->tv_sec = tv->last_tv.tv_sec;
+		else
+			tp->tv_sec = usecs / 1000000;
 		tp->tv_usec = usecs % 1000000;
 		break;
 		}


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH] gettime: minimize integer division
  2012-12-20  0:52 [PATCH] gettime: minimize integer division Sam Bradshaw
@ 2012-12-20  8:14 ` Jens Axboe
  2012-12-20 17:18   ` Sam Bradshaw
  0 siblings, 1 reply; 12+ messages in thread
From: Jens Axboe @ 2012-12-20  8:14 UTC (permalink / raw)
  To: Sam Bradshaw; +Cc: fio

On 2012-12-20 01:52, Sam Bradshaw wrote:
> 
> This patch generally converts a division to a subtraction in fio_gettime().
> 
> Shows ~1% better iops with synthetic benchmarking at roughly the same cpu 
> time spent in fio_gettime().
> 
> Signed-off-by: Sam Bradshaw <sbradshaw@micron.com>
> 
> diff --git a/gettime.c b/gettime.c
> index 248f146..e2a6241 100644
> --- a/gettime.c
> +++ b/gettime.c
> @@ -163,17 +163,23 @@ void fio_gettime(struct timeval *tp, void fio_unused *caller)
>  		}
>  #ifdef ARCH_HAVE_CPU_CLOCK
>  	case CS_CPUCLOCK: {
> -		unsigned long long usecs, t;
> +		unsigned long long usecs, t, delta = 0;
> 
>  		t = get_cpu_clock();
>  		if (tv && t < tv->last_cycles) {
>  			dprint(FD_TIME, "CPU clock going back in time\n");
>  			t = tv->last_cycles;
> -		} else if (tv)
> +		} else if (tv) {
> +			if (tv->last_tv_valid)
> +				delta = t - tv->last_cycles;
>  			tv->last_cycles = t;
> +		}
> 
>  		usecs = t / cycles_per_usec;
> -		tp->tv_sec = usecs / 1000000;
> +		if (delta > 1000000)
> +			tp->tv_sec = tv->last_tv.tv_sec;
> +		else
> +			tp->tv_sec = usecs / 1000000;

Shouldn't that be delta < 1000000? What am I missing? If the diff is
more than 1M usecs, then do the division. If not, we can reuse the
seconds from the last one.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] gettime: minimize integer division
  2012-12-20  8:14 ` Jens Axboe
@ 2012-12-20 17:18   ` Sam Bradshaw
  2012-12-20 18:03     ` Jens Axboe
  0 siblings, 1 reply; 12+ messages in thread
From: Sam Bradshaw @ 2012-12-20 17:18 UTC (permalink / raw)
  To: Jens Axboe; +Cc: fio

On 12/20/2012 12:14 AM, Jens Axboe wrote:

> On 2012-12-20 01:52, Sam Bradshaw wrote:
>>
>> This patch generally converts a division to a subtraction in fio_gettime().
>>
>> Shows ~1% better iops with synthetic benchmarking at roughly the same cpu 
>> time spent in fio_gettime().
>>
>> Signed-off-by: Sam Bradshaw <sbradshaw@micron.com>
>>
>> diff --git a/gettime.c b/gettime.c
>> index 248f146..e2a6241 100644
>> --- a/gettime.c
>> +++ b/gettime.c
>> @@ -163,17 +163,23 @@ void fio_gettime(struct timeval *tp, void fio_unused *caller)
>>  		}
>>  #ifdef ARCH_HAVE_CPU_CLOCK
>>  	case CS_CPUCLOCK: {
>> -		unsigned long long usecs, t;
>> +		unsigned long long usecs, t, delta = 0;
>>
>>  		t = get_cpu_clock();
>>  		if (tv && t < tv->last_cycles) {
>>  			dprint(FD_TIME, "CPU clock going back in time\n");
>>  			t = tv->last_cycles;
>> -		} else if (tv)
>> +		} else if (tv) {
>> +			if (tv->last_tv_valid)
>> +				delta = t - tv->last_cycles;
>>  			tv->last_cycles = t;
>> +		}
>>
>>  		usecs = t / cycles_per_usec;
>> -		tp->tv_sec = usecs / 1000000;
>> +		if (delta > 1000000)
>> +			tp->tv_sec = tv->last_tv.tv_sec;
>> +		else
>> +			tp->tv_sec = usecs / 1000000;
> 
> Shouldn't that be delta < 1000000? What am I missing? If the diff is
> more than 1M usecs, then do the division. If not, we can reuse the
> seconds from the last one.
> 



Yes, my bad.  Correct patch below.

diff --git a/gettime.c b/gettime.c
index 035d275..89f3e27 100644
--- a/gettime.c
+++ b/gettime.c
@@ -168,17 +168,23 @@ void fio_gettime(struct timeval *tp, void
fio_unused *caller)
 		}
 #ifdef ARCH_HAVE_CPU_CLOCK
 	case CS_CPUCLOCK: {
-		unsigned long long usecs, t;
+		unsigned long long usecs, t, delta = 0;

 		t = get_cpu_clock();
 		if (tv && t < tv->last_cycles) {
 			dprint(FD_TIME, "CPU clock going back in time\n");
 			t = tv->last_cycles;
-		} else if (tv)
+		} else if (tv) {
+			if (tv->last_tv_valid)
+				delta = t - tv->last_cycles;
 			tv->last_cycles = t;
+		}

 		usecs = t / cycles_per_usec;
-		tp->tv_sec = usecs / 1000000;
+		if (delta && delta < 1000000)
+			tp->tv_sec = tv->last_tv.tv_sec;
+		else
+			tp->tv_sec = usecs / 1000000;
 		tp->tv_usec = usecs % 1000000;
 		break;
 		}


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH] gettime: minimize integer division
  2012-12-20 17:18   ` Sam Bradshaw
@ 2012-12-20 18:03     ` Jens Axboe
  2012-12-20 18:58       ` Sam Bradshaw (sbradshaw)
  2012-12-20 19:23       ` Sam Bradshaw
  0 siblings, 2 replies; 12+ messages in thread
From: Jens Axboe @ 2012-12-20 18:03 UTC (permalink / raw)
  To: Sam Bradshaw; +Cc: fio

On 2012-12-20 18:18, Sam Bradshaw wrote:
> On 12/20/2012 12:14 AM, Jens Axboe wrote:
> 
>> On 2012-12-20 01:52, Sam Bradshaw wrote:
>>>
>>> This patch generally converts a division to a subtraction in fio_gettime().
>>>
>>> Shows ~1% better iops with synthetic benchmarking at roughly the same cpu 
>>> time spent in fio_gettime().
>>>
>>> Signed-off-by: Sam Bradshaw <sbradshaw@micron.com>
>>>
>>> diff --git a/gettime.c b/gettime.c
>>> index 248f146..e2a6241 100644
>>> --- a/gettime.c
>>> +++ b/gettime.c
>>> @@ -163,17 +163,23 @@ void fio_gettime(struct timeval *tp, void fio_unused *caller)
>>>  		}
>>>  #ifdef ARCH_HAVE_CPU_CLOCK
>>>  	case CS_CPUCLOCK: {
>>> -		unsigned long long usecs, t;
>>> +		unsigned long long usecs, t, delta = 0;
>>>
>>>  		t = get_cpu_clock();
>>>  		if (tv && t < tv->last_cycles) {
>>>  			dprint(FD_TIME, "CPU clock going back in time\n");
>>>  			t = tv->last_cycles;
>>> -		} else if (tv)
>>> +		} else if (tv) {
>>> +			if (tv->last_tv_valid)
>>> +				delta = t - tv->last_cycles;
>>>  			tv->last_cycles = t;
>>> +		}
>>>
>>>  		usecs = t / cycles_per_usec;
>>> -		tp->tv_sec = usecs / 1000000;
>>> +		if (delta > 1000000)
>>> +			tp->tv_sec = tv->last_tv.tv_sec;
>>> +		else
>>> +			tp->tv_sec = usecs / 1000000;
>>
>> Shouldn't that be delta < 1000000? What am I missing? If the diff is
>> more than 1M usecs, then do the division. If not, we can reuse the
>> seconds from the last one.
>>
> 
> 
> 
> Yes, my bad.  Correct patch below.
> 
> diff --git a/gettime.c b/gettime.c
> index 035d275..89f3e27 100644
> --- a/gettime.c
> +++ b/gettime.c
> @@ -168,17 +168,23 @@ void fio_gettime(struct timeval *tp, void
> fio_unused *caller)
>  		}
>  #ifdef ARCH_HAVE_CPU_CLOCK
>  	case CS_CPUCLOCK: {
> -		unsigned long long usecs, t;
> +		unsigned long long usecs, t, delta = 0;
> 
>  		t = get_cpu_clock();
>  		if (tv && t < tv->last_cycles) {
>  			dprint(FD_TIME, "CPU clock going back in time\n");
>  			t = tv->last_cycles;
> -		} else if (tv)
> +		} else if (tv) {
> +			if (tv->last_tv_valid)
> +				delta = t - tv->last_cycles;
>  			tv->last_cycles = t;
> +		}
> 
>  		usecs = t / cycles_per_usec;
> -		tp->tv_sec = usecs / 1000000;
> +		if (delta && delta < 1000000)
> +			tp->tv_sec = tv->last_tv.tv_sec;
> +		else
> +			tp->tv_sec = usecs / 1000000;
>  		tp->tv_usec = usecs % 1000000;
>  		break;
>  		}

I was thinking about this... Is it actually guarenteed to work. If
tv->last_tv.tv_usec is eg 900,000, you'd only need a 100k usec diff to
need to wrap, not 1000k. And since this is about avoiding costly divs,
since we know the number of cycles last time, it might make more sense
to just do the single div to go from cycles to usecs, then add that to
the tv->last_tv.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: [PATCH] gettime: minimize integer division
  2012-12-20 18:03     ` Jens Axboe
@ 2012-12-20 18:58       ` Sam Bradshaw (sbradshaw)
  2012-12-20 19:23       ` Sam Bradshaw
  1 sibling, 0 replies; 12+ messages in thread
From: Sam Bradshaw (sbradshaw) @ 2012-12-20 18:58 UTC (permalink / raw)
  To: Jens Axboe; +Cc: fio@vger.kernel.org



> -----Original Message-----
> From: fio-owner@vger.kernel.org [mailto:fio-owner@vger.kernel.org] On Behalf Of Jens
> Axboe
> Sent: Thursday, December 20, 2012 10:03 AM
> To: Sam Bradshaw (sbradshaw)
> Cc: fio@vger.kernel.org
> Subject: Re: [PATCH] gettime: minimize integer division
> 
> On 2012-12-20 18:18, Sam Bradshaw wrote:
> > On 12/20/2012 12:14 AM, Jens Axboe wrote:
> >
> >> On 2012-12-20 01:52, Sam Bradshaw wrote:
> >>>
> >>> This patch generally converts a division to a subtraction in fio_gettime().
> >>>
> >>> Shows ~1% better iops with synthetic benchmarking at roughly the same cpu
> >>> time spent in fio_gettime().
> >>>
> >>> Signed-off-by: Sam Bradshaw <sbradshaw@micron.com>
> >>>
> >>> diff --git a/gettime.c b/gettime.c
> >>> index 248f146..e2a6241 100644
> >>> --- a/gettime.c
> >>> +++ b/gettime.c
> >>> @@ -163,17 +163,23 @@ void fio_gettime(struct timeval *tp, void fio_unused
> *caller)
> >>>  		}
> >>>  #ifdef ARCH_HAVE_CPU_CLOCK
> >>>  	case CS_CPUCLOCK: {
> >>> -		unsigned long long usecs, t;
> >>> +		unsigned long long usecs, t, delta = 0;
> >>>
> >>>  		t = get_cpu_clock();
> >>>  		if (tv && t < tv->last_cycles) {
> >>>  			dprint(FD_TIME, "CPU clock going back in time\n");
> >>>  			t = tv->last_cycles;
> >>> -		} else if (tv)
> >>> +		} else if (tv) {
> >>> +			if (tv->last_tv_valid)
> >>> +				delta = t - tv->last_cycles;
> >>>  			tv->last_cycles = t;
> >>> +		}
> >>>
> >>>  		usecs = t / cycles_per_usec;
> >>> -		tp->tv_sec = usecs / 1000000;
> >>> +		if (delta > 1000000)
> >>> +			tp->tv_sec = tv->last_tv.tv_sec;
> >>> +		else
> >>> +			tp->tv_sec = usecs / 1000000;
> >>
> >> Shouldn't that be delta < 1000000? What am I missing? If the diff is
> >> more than 1M usecs, then do the division. If not, we can reuse the
> >> seconds from the last one.
> >>
> >
> >
> >
> > Yes, my bad.  Correct patch below.
> >
> > diff --git a/gettime.c b/gettime.c
> > index 035d275..89f3e27 100644
> > --- a/gettime.c
> > +++ b/gettime.c
> > @@ -168,17 +168,23 @@ void fio_gettime(struct timeval *tp, void
> > fio_unused *caller)
> >  		}
> >  #ifdef ARCH_HAVE_CPU_CLOCK
> >  	case CS_CPUCLOCK: {
> > -		unsigned long long usecs, t;
> > +		unsigned long long usecs, t, delta = 0;
> >
> >  		t = get_cpu_clock();
> >  		if (tv && t < tv->last_cycles) {
> >  			dprint(FD_TIME, "CPU clock going back in time\n");
> >  			t = tv->last_cycles;
> > -		} else if (tv)
> > +		} else if (tv) {
> > +			if (tv->last_tv_valid)
> > +				delta = t - tv->last_cycles;
> >  			tv->last_cycles = t;
> > +		}
> >
> >  		usecs = t / cycles_per_usec;
> > -		tp->tv_sec = usecs / 1000000;
> > +		if (delta && delta < 1000000)
> > +			tp->tv_sec = tv->last_tv.tv_sec;
> > +		else
> > +			tp->tv_sec = usecs / 1000000;
> >  		tp->tv_usec = usecs % 1000000;
> >  		break;
> >  		}
> 
> I was thinking about this... Is it actually guarenteed to work. If
> tv->last_tv.tv_usec is eg 900,000, you'd only need a 100k usec diff to
> need to wrap, not 1000k. And since this is about avoiding costly divs,
> since we know the number of cycles last time, it might make more sense
> to just do the single div to go from cycles to usecs, then add that to
> the tv->last_tv.

Good point.  This patch is fundamentally broken as is.  It could perhaps
be fixed up with some additional reconciliation but that would add more
arithmetic operations and may quickly approach the div latency and 
confound the branching logic.  I'll think about it some more and try out 
your suggestion.

-Sam

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] gettime: minimize integer division
  2012-12-20 18:03     ` Jens Axboe
  2012-12-20 18:58       ` Sam Bradshaw (sbradshaw)
@ 2012-12-20 19:23       ` Sam Bradshaw
  2012-12-21 15:33         ` Jens Axboe
  1 sibling, 1 reply; 12+ messages in thread
From: Sam Bradshaw @ 2012-12-20 19:23 UTC (permalink / raw)
  To: Jens Axboe; +Cc: fio@vger.kernel.org


>> diff --git a/gettime.c b/gettime.c
>> index 035d275..89f3e27 100644
>> --- a/gettime.c
>> +++ b/gettime.c
>> @@ -168,17 +168,23 @@ void fio_gettime(struct timeval *tp, void
>> fio_unused *caller)
>>  		}
>>  #ifdef ARCH_HAVE_CPU_CLOCK
>>  	case CS_CPUCLOCK: {
>> -		unsigned long long usecs, t;
>> +		unsigned long long usecs, t, delta = 0;
>>
>>  		t = get_cpu_clock();
>>  		if (tv && t < tv->last_cycles) {
>>  			dprint(FD_TIME, "CPU clock going back in time\n");
>>  			t = tv->last_cycles;
>> -		} else if (tv)
>> +		} else if (tv) {
>> +			if (tv->last_tv_valid)
>> +				delta = t - tv->last_cycles;
>>  			tv->last_cycles = t;
>> +		}
>>
>>  		usecs = t / cycles_per_usec;
>> -		tp->tv_sec = usecs / 1000000;
>> +		if (delta && delta < 1000000)
>> +			tp->tv_sec = tv->last_tv.tv_sec;
>> +		else
>> +			tp->tv_sec = usecs / 1000000;
>>  		tp->tv_usec = usecs % 1000000;
>>  		break;
>>  		}
> 
> I was thinking about this... Is it actually guarenteed to work. If
> tv->last_tv.tv_usec is eg 900,000, you'd only need a 100k usec diff to
> need to wrap, not 1000k. And since this is about avoiding costly divs,
> since we know the number of cycles last time, it might make more sense
> to just do the single div to go from cycles to usecs, then add that to
> the tv->last_tv.
> 



Something like this might work, though that amount of logic may
be equivalent in terms of cycles to the divide.

diff --git a/gettime.c b/gettime.c
index 035d275..955a0a1 100644
--- a/gettime.c
+++ b/gettime.c
@@ -168,18 +168,25 @@ void fio_gettime(struct timeval *tp, void
fio_unused *caller)
 		}
 #ifdef ARCH_HAVE_CPU_CLOCK
 	case CS_CPUCLOCK: {
-		unsigned long long usecs, t;
+		unsigned long long usecs, t, delta = 0;

 		t = get_cpu_clock();
 		if (tv && t < tv->last_cycles) {
 			dprint(FD_TIME, "CPU clock going back in time\n");
 			t = tv->last_cycles;
-		} else if (tv)
+		} else if (tv) {
+			if (tv->last_tv_valid)
+				delta = t - tv->last_cycles;
 			tv->last_cycles = t;
+		}

 		usecs = t / cycles_per_usec;
-		tp->tv_sec = usecs / 1000000;
 		tp->tv_usec = usecs % 1000000;
+		if (delta && delta < 1000000 &&
+			tv->last_tv.tv_usec < tp->tv_usec)
+			tp->tv_sec = tv->last_tv.tv_sec;
+		else
+			tp->tv_sec = usecs / 1000000;
 		break;
 		}
 #endif




^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH] gettime: minimize integer division
  2012-12-20 19:23       ` Sam Bradshaw
@ 2012-12-21 15:33         ` Jens Axboe
  2012-12-21 15:45           ` Jens Axboe
  2012-12-21 21:28           ` Sam Bradshaw (sbradshaw)
  0 siblings, 2 replies; 12+ messages in thread
From: Jens Axboe @ 2012-12-21 15:33 UTC (permalink / raw)
  To: Sam Bradshaw; +Cc: fio@vger.kernel.org

On 2012-12-20 20:23, Sam Bradshaw wrote:
> 
>>> diff --git a/gettime.c b/gettime.c
>>> index 035d275..89f3e27 100644
>>> --- a/gettime.c
>>> +++ b/gettime.c
>>> @@ -168,17 +168,23 @@ void fio_gettime(struct timeval *tp, void
>>> fio_unused *caller)
>>>  		}
>>>  #ifdef ARCH_HAVE_CPU_CLOCK
>>>  	case CS_CPUCLOCK: {
>>> -		unsigned long long usecs, t;
>>> +		unsigned long long usecs, t, delta = 0;
>>>
>>>  		t = get_cpu_clock();
>>>  		if (tv && t < tv->last_cycles) {
>>>  			dprint(FD_TIME, "CPU clock going back in time\n");
>>>  			t = tv->last_cycles;
>>> -		} else if (tv)
>>> +		} else if (tv) {
>>> +			if (tv->last_tv_valid)
>>> +				delta = t - tv->last_cycles;
>>>  			tv->last_cycles = t;
>>> +		}
>>>
>>>  		usecs = t / cycles_per_usec;
>>> -		tp->tv_sec = usecs / 1000000;
>>> +		if (delta && delta < 1000000)
>>> +			tp->tv_sec = tv->last_tv.tv_sec;
>>> +		else
>>> +			tp->tv_sec = usecs / 1000000;
>>>  		tp->tv_usec = usecs % 1000000;
>>>  		break;
>>>  		}
>>
>> I was thinking about this... Is it actually guarenteed to work. If
>> tv->last_tv.tv_usec is eg 900,000, you'd only need a 100k usec diff to
>> need to wrap, not 1000k. And since this is about avoiding costly divs,
>> since we know the number of cycles last time, it might make more sense
>> to just do the single div to go from cycles to usecs, then add that to
>> the tv->last_tv.
>>
> 
> 
> 
> Something like this might work, though that amount of logic may
> be equivalent in terms of cycles to the divide.

So I took a look at it. The costly bit is the division by
cycles_per_usec, which the compiler has no other option than turn into a
divq. The modulo and divide by 1M can be turned into something more
clever, basically shifts and imull.

So how about the below? It turns the divq into multiplication and
division by 10M, which should be considerably less expensive. Can you
test and see how that works for you?

diff --git a/gettime.c b/gettime.c
index 035d275..56703e1 100644
--- a/gettime.c
+++ b/gettime.c
@@ -15,6 +15,7 @@
 
 #ifdef ARCH_HAVE_CPU_CLOCK
 static unsigned long cycles_per_usec;
+static unsigned long inv_cycles_per_usec;
 int tsc_reliable = 0;
 #endif
 
@@ -177,7 +178,7 @@ void fio_gettime(struct timeval *tp, void fio_unused *caller)
 		} else if (tv)
 			tv->last_cycles = t;
 
-		usecs = t / cycles_per_usec;
+		usecs = (t * inv_cycles_per_usec) / 10000000UL;
 		tp->tv_sec = usecs / 1000000;
 		tp->tv_usec = usecs % 1000000;
 		break;
@@ -277,6 +278,8 @@ static void calibrate_cpu_clock(void)
 	dprint(FD_TIME, "mean=%f, S=%f\n", mean, S);
 
 	cycles_per_usec = avg;
+	inv_cycles_per_usec = 10000000UL / cycles_per_usec;
+	dprint(FD_TIME, "inv_cycles_per_usec=%lu\n", inv_cycles_per_usec);
 }
 #else
 static void calibrate_cpu_clock(void)

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH] gettime: minimize integer division
  2012-12-21 15:33         ` Jens Axboe
@ 2012-12-21 15:45           ` Jens Axboe
  2012-12-21 21:28           ` Sam Bradshaw (sbradshaw)
  1 sibling, 0 replies; 12+ messages in thread
From: Jens Axboe @ 2012-12-21 15:45 UTC (permalink / raw)
  To: Sam Bradshaw; +Cc: fio@vger.kernel.org

On 2012-12-21 16:33, Jens Axboe wrote:
> On 2012-12-20 20:23, Sam Bradshaw wrote:
>>
>>>> diff --git a/gettime.c b/gettime.c
>>>> index 035d275..89f3e27 100644
>>>> --- a/gettime.c
>>>> +++ b/gettime.c
>>>> @@ -168,17 +168,23 @@ void fio_gettime(struct timeval *tp, void
>>>> fio_unused *caller)
>>>>  		}
>>>>  #ifdef ARCH_HAVE_CPU_CLOCK
>>>>  	case CS_CPUCLOCK: {
>>>> -		unsigned long long usecs, t;
>>>> +		unsigned long long usecs, t, delta = 0;
>>>>
>>>>  		t = get_cpu_clock();
>>>>  		if (tv && t < tv->last_cycles) {
>>>>  			dprint(FD_TIME, "CPU clock going back in time\n");
>>>>  			t = tv->last_cycles;
>>>> -		} else if (tv)
>>>> +		} else if (tv) {
>>>> +			if (tv->last_tv_valid)
>>>> +				delta = t - tv->last_cycles;
>>>>  			tv->last_cycles = t;
>>>> +		}
>>>>
>>>>  		usecs = t / cycles_per_usec;
>>>> -		tp->tv_sec = usecs / 1000000;
>>>> +		if (delta && delta < 1000000)
>>>> +			tp->tv_sec = tv->last_tv.tv_sec;
>>>> +		else
>>>> +			tp->tv_sec = usecs / 1000000;
>>>>  		tp->tv_usec = usecs % 1000000;
>>>>  		break;
>>>>  		}
>>>
>>> I was thinking about this... Is it actually guarenteed to work. If
>>> tv->last_tv.tv_usec is eg 900,000, you'd only need a 100k usec diff to
>>> need to wrap, not 1000k. And since this is about avoiding costly divs,
>>> since we know the number of cycles last time, it might make more sense
>>> to just do the single div to go from cycles to usecs, then add that to
>>> the tv->last_tv.
>>>
>>
>>
>>
>> Something like this might work, though that amount of logic may
>> be equivalent in terms of cycles to the divide.
> 
> So I took a look at it. The costly bit is the division by
> cycles_per_usec, which the compiler has no other option than turn into a
> divq. The modulo and divide by 1M can be turned into something more
> clever, basically shifts and imull.
> 
> So how about the below? It turns the divq into multiplication and
> division by 10M, which should be considerably less expensive. Can you
> test and see how that works for you?

Actually, it'd be dumb not to make it a power-of-2, since the actual
number doesn't really matter. So this uses 2^24, try that.

diff --git a/gettime.c b/gettime.c
index 035d275..df329f6 100644
--- a/gettime.c
+++ b/gettime.c
@@ -15,6 +15,7 @@
 
 #ifdef ARCH_HAVE_CPU_CLOCK
 static unsigned long cycles_per_usec;
+static unsigned long inv_cycles_per_usec;
 int tsc_reliable = 0;
 #endif
 
@@ -177,7 +178,7 @@ void fio_gettime(struct timeval *tp, void fio_unused *caller)
 		} else if (tv)
 			tv->last_cycles = t;
 
-		usecs = t / cycles_per_usec;
+		usecs = (t * inv_cycles_per_usec) / 16777216UL;
 		tp->tv_sec = usecs / 1000000;
 		tp->tv_usec = usecs % 1000000;
 		break;
@@ -277,6 +278,8 @@ static void calibrate_cpu_clock(void)
 	dprint(FD_TIME, "mean=%f, S=%f\n", mean, S);
 
 	cycles_per_usec = avg;
+	inv_cycles_per_usec = 16777216UL / cycles_per_usec;
+	dprint(FD_TIME, "inv_cycles_per_usec=%lu\n", inv_cycles_per_usec);
 }
 #else
 static void calibrate_cpu_clock(void)

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* RE: [PATCH] gettime: minimize integer division
  2012-12-21 15:33         ` Jens Axboe
  2012-12-21 15:45           ` Jens Axboe
@ 2012-12-21 21:28           ` Sam Bradshaw (sbradshaw)
  2012-12-21 21:30             ` Jens Axboe
  1 sibling, 1 reply; 12+ messages in thread
From: Sam Bradshaw (sbradshaw) @ 2012-12-21 21:28 UTC (permalink / raw)
  To: Jens Axboe; +Cc: fio@vger.kernel.org


> -----Original Message-----
> From: Jens Axboe [mailto:axboe@kernel.dk]
> Sent: Friday, December 21, 2012 7:33 AM
> To: Sam Bradshaw (sbradshaw)
> Cc: fio@vger.kernel.org
> Subject: Re: [PATCH] gettime: minimize integer division
> 
> On 2012-12-20 20:23, Sam Bradshaw wrote:
> >
> >
> > Something like this might work, though that amount of logic may
> > be equivalent in terms of cycles to the divide.
> 
> So I took a look at it. The costly bit is the division by
> cycles_per_usec, which the compiler has no other option than turn into a
> divq. The modulo and divide by 1M can be turned into something more
> clever, basically shifts and imull.
> 
> So how about the below? It turns the divq into multiplication and
> division by 10M, which should be considerably less expensive. Can you
> test and see how that works for you?

That works much better.  Several % lower execution time in fio_gettime().
IOPs look the same in my synthetic test but I'm not sure that's relevant;
(it probably just needs some more tweaking).

I'll keep hunting for other hot spots.

-Sam

 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] gettime: minimize integer division
  2012-12-21 21:28           ` Sam Bradshaw (sbradshaw)
@ 2012-12-21 21:30             ` Jens Axboe
  2012-12-21 21:53               ` Sam Bradshaw (sbradshaw)
  0 siblings, 1 reply; 12+ messages in thread
From: Jens Axboe @ 2012-12-21 21:30 UTC (permalink / raw)
  To: Sam Bradshaw (sbradshaw); +Cc: fio@vger.kernel.org

On 2012-12-21 22:28, Sam Bradshaw (sbradshaw) wrote:
> 
>> -----Original Message-----
>> From: Jens Axboe [mailto:axboe@kernel.dk]
>> Sent: Friday, December 21, 2012 7:33 AM
>> To: Sam Bradshaw (sbradshaw)
>> Cc: fio@vger.kernel.org
>> Subject: Re: [PATCH] gettime: minimize integer division
>>
>> On 2012-12-20 20:23, Sam Bradshaw wrote:
>>>
>>>
>>> Something like this might work, though that amount of logic may
>>> be equivalent in terms of cycles to the divide.
>>
>> So I took a look at it. The costly bit is the division by
>> cycles_per_usec, which the compiler has no other option than turn into a
>> divq. The modulo and divide by 1M can be turned into something more
>> clever, basically shifts and imull.
>>
>> So how about the below? It turns the divq into multiplication and
>> division by 10M, which should be considerably less expensive. Can you
>> test and see how that works for you?
> 
> That works much better.  Several % lower execution time in fio_gettime().

Goodie

> IOPs look the same in my synthetic test but I'm not sure that's relevant;
> (it probably just needs some more tweaking).

It'd probably need 3-4M IOPS from a single thread to have a big impact.
But reduced CPU is leftover CPU for doing actual IO, so always a good
thing.

And just as important, did the timing look correct?

> I'll keep hunting for other hot spots.

Thanks!

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: [PATCH] gettime: minimize integer division
  2012-12-21 21:30             ` Jens Axboe
@ 2012-12-21 21:53               ` Sam Bradshaw (sbradshaw)
  2012-12-23 20:49                 ` Jens Axboe
  0 siblings, 1 reply; 12+ messages in thread
From: Sam Bradshaw (sbradshaw) @ 2012-12-21 21:53 UTC (permalink / raw)
  To: Jens Axboe; +Cc: fio@vger.kernel.org


> > IOPs look the same in my synthetic test but I'm not sure that's relevant;
> > (it probably just needs some more tweaking).
> 
> It'd probably need 3-4M IOPS from a single thread to have a big impact.
> But reduced CPU is leftover CPU for doing actual IO, so always a good
> thing.
> 
> And just as important, did the timing look correct?

Yes, no spikes, latency average and std. dev. as expected.

-Sam


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] gettime: minimize integer division
  2012-12-21 21:53               ` Sam Bradshaw (sbradshaw)
@ 2012-12-23 20:49                 ` Jens Axboe
  0 siblings, 0 replies; 12+ messages in thread
From: Jens Axboe @ 2012-12-23 20:49 UTC (permalink / raw)
  To: Sam Bradshaw (sbradshaw); +Cc: fio@vger.kernel.org

On 2012-12-21 22:53, Sam Bradshaw (sbradshaw) wrote:
> 
>>> IOPs look the same in my synthetic test but I'm not sure that's relevant;
>>> (it probably just needs some more tweaking).
>>
>> It'd probably need 3-4M IOPS from a single thread to have a big impact.
>> But reduced CPU is leftover CPU for doing actual IO, so always a good
>> thing.
>>
>> And just as important, did the timing look correct?
> 
> Yes, no spikes, latency average and std. dev. as expected.

Excellent, now committed...

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2012-12-23 20:49 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-12-20  0:52 [PATCH] gettime: minimize integer division Sam Bradshaw
2012-12-20  8:14 ` Jens Axboe
2012-12-20 17:18   ` Sam Bradshaw
2012-12-20 18:03     ` Jens Axboe
2012-12-20 18:58       ` Sam Bradshaw (sbradshaw)
2012-12-20 19:23       ` Sam Bradshaw
2012-12-21 15:33         ` Jens Axboe
2012-12-21 15:45           ` Jens Axboe
2012-12-21 21:28           ` Sam Bradshaw (sbradshaw)
2012-12-21 21:30             ` Jens Axboe
2012-12-21 21:53               ` Sam Bradshaw (sbradshaw)
2012-12-23 20:49                 ` Jens Axboe

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.