From mboxrd@z Thu Jan 1 00:00:00 1970 From: Stephen Hemminger Subject: Re: [PATCH v2 6/6] rte_sched: eliminate floating point in calculating byte clock Date: Thu, 12 Mar 2015 16:09:26 -0700 Message-ID: <20150312160926.5d9af5d7@urahara> References: <1426004018-25948-1-git-send-email-stephen@networkplumber.org> <1426004018-25948-7-git-send-email-stephen@networkplumber.org> <3EB4FA525960D640B5BDFFD6A3D89126323225C5@IRSMSX108.ger.corp.intel.com> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Cc: "dev-VfR2kkLFssw@public.gmane.org" , Stephen Hemminger To: "Dumitrescu, Cristian" Return-path: In-Reply-To: <3EB4FA525960D640B5BDFFD6A3D89126323225C5-kPTMFJFq+rEMvF1YICWikbfspsVTdybXVpNB7YpNyf8@public.gmane.org> List-Id: patches and discussions about DPDK List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces-VfR2kkLFssw@public.gmane.org Sender: "dev" On Thu, 12 Mar 2015 19:06:39 +0000 "Dumitrescu, Cristian" wrote: > > > > -----Original Message----- > > From: Stephen Hemminger [mailto:stephen-OTpzqLSitTUnbdJkjeBofR2eb7JE58TQ@public.gmane.org] > > Sent: Tuesday, March 10, 2015 4:14 PM > > To: Dumitrescu, Cristian > > Cc: dev-VfR2kkLFssw@public.gmane.org; Stephen Hemminger; Stephen Hemminger > > Subject: [PATCH v2 6/6] rte_sched: eliminate floating point in calculating byte > > clock > > > > From: Stephen Hemminger > > > > The old code was doing a floating point divide for each rte_dequeue() > > which is very expensive. Change to using fixed point scaled math instead. > > This improved performance from 5Gbit/sec to 10 Gbit/sec > > > > Signed-off-by: Stephen Hemminger > > --- > > v2 -- no changes > > despite objections, the performance observation is real > > on Intel(R) Core(TM) i7-3770 CPU > > > > lib/librte_sched/rte_sched.c | 14 ++++++++++---- > > 1 file changed, 10 insertions(+), 4 deletions(-) > > > > diff --git a/lib/librte_sched/rte_sched.c b/lib/librte_sched/rte_sched.c > > index 74d0e0a..522a647 100644 > > --- a/lib/librte_sched/rte_sched.c > > +++ b/lib/librte_sched/rte_sched.c > > @@ -102,6 +102,9 @@ > > > > #define RTE_SCHED_BMP_POS_INVALID UINT32_MAX > > > > +/* For cycles_per_byte calculation */ > > +#define RTE_SCHED_TIME_SHIFT 20 > > + > > struct rte_sched_subport { > > /* Token bucket (TB) */ > > uint64_t tb_time; /* time of last update */ > > @@ -239,7 +242,7 @@ struct rte_sched_port { > > uint64_t time_cpu_cycles; /* Current CPU time measured in CPU > > cyles */ > > uint64_t time_cpu_bytes; /* Current CPU time measured in bytes > > */ > > uint64_t time; /* Current NIC TX time measured in bytes */ > > - double cycles_per_byte; /* CPU cycles per byte */ > > + uint32_t cycles_per_byte; /* CPU cycles per byte (scaled) */ > > > > /* Scheduling loop detection */ > > uint32_t pipe_loop; > > @@ -657,7 +660,9 @@ rte_sched_port_config(struct > > rte_sched_port_params *params) > > port->time_cpu_cycles = rte_get_tsc_cycles(); > > port->time_cpu_bytes = 0; > > port->time = 0; > > - port->cycles_per_byte = ((double) rte_get_tsc_hz()) / ((double) > > params->rate); > > + > > + port->cycles_per_byte = (rte_get_tsc_hz() << > > RTE_SCHED_TIME_SHIFT) > > + / params->rate; > > > > /* Scheduling loop detection */ > > port->pipe_loop = RTE_SCHED_PIPE_INVALID; > > @@ -2126,11 +2131,12 @@ rte_sched_port_time_resync(struct > > rte_sched_port *port) > > { > > uint64_t cycles = rte_get_tsc_cycles(); > > uint64_t cycles_diff = cycles - port->time_cpu_cycles; > > - double bytes_diff = ((double) cycles_diff) / port->cycles_per_byte; > > + uint64_t bytes_diff = (cycles_diff << RTE_SCHED_TIME_SHIFT) > > + / port->cycles_per_byte; > > > > /* Advance port time */ > > port->time_cpu_cycles = cycles; > > - port->time_cpu_bytes += (uint64_t) bytes_diff; > > + port->time_cpu_bytes += bytes_diff; > > if (port->time < port->time_cpu_bytes) { > > port->time = port->time_cpu_bytes; > > } > > -- > > 2.1.4 > > Stephen, > > We agreed during the previous round to look at 64-bit multiplication option, but looks like this patch is identical to the previous one. Did you meet any issues in implementing this approach? As stated before, I do not think this is the best solution for the reasons listed previously, and this part of the code is too sensitive to take the risk. > > Since Thomas indicated these patches will be considered for 2.1 rather than 2.0 release, it looks like we have some time to refine this work. I would reiterate the same proposal that Thomas made: re-submit the patches where we have consensus, and keep this one out for the moment; you and me can sync up offline and come back with an implementation proposal that would hopefully address the previous concerns for 2.1 release. Would this work for you? > > Thank you for your work and for your understanding! > > Regards, > Cristian > I haven't had time to do the analysis to figure out how to keep the maths in range so that 64 bit multiplicative divide would work. When I first tried it the issue was that the cycles_diff and bytes_diff would both have to scaled so that they are 32 bit values. Should be possible, might get back to it (or someone else might take up the cause).