From mboxrd@z Thu Jan 1 00:00:00 1970 From: Stephen Hemminger Subject: Re: [PATCH v2 6/7] rte_sched: eliminate floating point in calculating byte clock Date: Tue, 17 Feb 2015 11:05:00 -0500 Message-ID: <20150217110500.41ed8a18@uryu.home.lan> References: <1423116841-19799-4-git-send-email-stephen@networkplumber.org> <1423116841-19799-6-git-send-email-stephen@networkplumber.org> <3EB4FA525960D640B5BDFFD6A3D8912632318070@IRSMSX108.ger.corp.intel.com> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Cc: "dev-VfR2kkLFssw@public.gmane.org" , Stephen Hemminger To: "Dumitrescu, Cristian" Return-path: In-Reply-To: <3EB4FA525960D640B5BDFFD6A3D8912632318070-kPTMFJFq+rEMvF1YICWikbfspsVTdybXVpNB7YpNyf8@public.gmane.org> List-Id: patches and discussions about DPDK List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces-VfR2kkLFssw@public.gmane.org Sender: "dev" On Mon, 16 Feb 2015 22:44:31 +0000 "Dumitrescu, Cristian" wrote: > Hi Stephen, > > Sorry, NACK. > > 1. Overflow issue > As you declare cycles_per_byte as uint32_t, for a CPU frequency of 2-3 GHz, the line of code below results in overflow: > port->cycles_per_byte = (rte_get_tsc_hz() << RTE_SCHED_TIME_SHIFT) / params->rate; > Therefore, there is most likely a significant accuracy loss, which might result in more packets allowed to go out than it should. The tsc shifted is still 64 bits. and rate is 32 bits bytes/sec. I chose scale such that if clock = 3 Ghz then min rate = 715 bytes/sec = 5722 bits/sec > 2. Integer division has a higher cost than floating point division > My understanding is we are considering a performance improvement by replacing the double precision floating point division in: > double bytes_diff = ((double) cycles_diff) / port->cycles_per_byte; > with an integer division: > uint64_t bytes_diff = (cycles_diff << RTE_SCHED_TIME_SHIFT) / port->cycles_per_byte; > I don't think this is going to have the claimed benefit, as acording to "Intel 64 and IA-32 Architectures Optimization Reference Manual" (Appendix C), the latency of the integer division instruction is significantly bigger than the latency of integer division: > Instruction FDIV double precision: latency = 38-40 cycles > Instruction IDIV: latency = 56 - 80 cycles I observed that performance when from 5Gbit/sec to 10Gbit/sec. Mostly because the floating point engages more instruction units and does not pipeline. Cycle count is not everything. This was on Ivy Bridge processor. > 3. Alternative > I hear though your suggestion about replacing the floating point division with a more performant construction. One suggestion would be to replace it with an integer multiplication followed by a shift right, probably by using a uint64_t bytes_per_cycle_scaled_up (the inverse of cycles_per_bytes). I need to prototype this code myself. Would you be OK to look into providing an alternative implementation? > I looked into multiplative integer method, and will do it in future. But it has more scaling issues since it would require that the values both be 32 bits.