From mboxrd@z Thu Jan  1 00:00:00 1970
From: Stephen Hemminger <stephen-OTpzqLSitTUnbdJkjeBofR2eb7JE58TQ@public.gmane.org>
Subject: Re: [PATCH v2 6/7] rte_sched: eliminate floating point
 in	calculating byte clock
Date: Tue, 17 Feb 2015 11:05:00 -0500
Message-ID: <20150217110500.41ed8a18@uryu.home.lan>
References: <1423116841-19799-4-git-send-email-stephen@networkplumber.org>
 <1423116841-19799-6-git-send-email-stephen@networkplumber.org>
 <3EB4FA525960D640B5BDFFD6A3D8912632318070@IRSMSX108.ger.corp.intel.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Cc: "dev-VfR2kkLFssw@public.gmane.org" <dev-VfR2kkLFssw@public.gmane.org>, Stephen Hemminger <shemming-43mecJUBy8ZBDgjK7y7TUQ@public.gmane.org>
To: "Dumitrescu, Cristian" <cristian.dumitrescu-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Return-path: <dev-bounces-VfR2kkLFssw@public.gmane.org>
In-Reply-To: <3EB4FA525960D640B5BDFFD6A3D8912632318070-kPTMFJFq+rEMvF1YICWikbfspsVTdybXVpNB7YpNyf8@public.gmane.org>
List-Id: patches and discussions about DPDK <dev.dpdk.org>
List-Unsubscribe: <http://dpdk.org/ml/options/dev>,
 <mailto:dev-request-VfR2kkLFssw@public.gmane.org?subject=unsubscribe>
List-Archive: <http://dpdk.org/ml/archives/dev/>
List-Post: <mailto:dev-VfR2kkLFssw@public.gmane.org>
List-Help: <mailto:dev-request-VfR2kkLFssw@public.gmane.org?subject=help>
List-Subscribe: <http://dpdk.org/ml/listinfo/dev>,
 <mailto:dev-request-VfR2kkLFssw@public.gmane.org?subject=subscribe>
Errors-To: dev-bounces-VfR2kkLFssw@public.gmane.org
Sender: "dev" <dev-bounces-VfR2kkLFssw@public.gmane.org>

On Mon, 16 Feb 2015 22:44:31 +0000
"Dumitrescu, Cristian" <cristian.dumitrescu-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> wrote:

> Hi Stephen,
> 
> Sorry, NACK.
> 
> 1. Overflow issue
> As you declare cycles_per_byte as uint32_t, for a CPU frequency of 2-3 GHz, the line of code below results in overflow:
> 	port->cycles_per_byte = (rte_get_tsc_hz() << RTE_SCHED_TIME_SHIFT) / params->rate;
> Therefore, there is most likely a significant accuracy loss, which might result in more packets allowed to go out than it should.

The tsc shifted is still 64 bits.
and rate is 32 bits bytes/sec.

I chose scale such that
if clock = 3 Ghz
then min rate = 715 bytes/sec =  5722 bits/sec

> 2. Integer division has a higher cost than floating point division
> My understanding is we are considering a performance improvement by replacing the double precision floating point division in:
> 	double bytes_diff = ((double) cycles_diff) / port->cycles_per_byte;
> with an integer division:
> 	uint64_t bytes_diff = (cycles_diff << RTE_SCHED_TIME_SHIFT) / port->cycles_per_byte;
> I don't think this is going to have the claimed benefit, as acording to "Intel 64 and IA-32 Architectures Optimization  Reference Manual" (Appendix C), the latency of the integer division instruction is significantly bigger than the latency of integer division:
> 	Instruction FDIV double precision: latency = 38-40 cycles
> 	Instruction IDIV: latency = 56 - 80 cycles

I observed that performance when from 5Gbit/sec to 10Gbit/sec.
Mostly because the floating point engages more instruction units and does not
pipeline. Cycle count is not everything.  This was on Ivy Bridge processor.


> 3. Alternative
> I hear though your suggestion about replacing the floating point division with a more performant construction. One suggestion would be to replace it with an integer multiplication followed by a shift right, probably by using a uint64_t bytes_per_cycle_scaled_up (the inverse of cycles_per_bytes). I need to prototype this code myself. Would you be OK to look into providing an alternative implementation?
>

I looked into multiplative integer method, and will do it in future. But it has
more scaling issues since it would require that the values both be 32 bits.