From mboxrd@z Thu Jan 1 00:00:00 1970 From: Vlad Zolotarov Subject: Re: i40e xmit path HW limitation Date: Thu, 30 Jul 2015 19:44:24 +0300 Message-ID: <55BA5468.80109@cloudius-systems.com> References: <55BA3B5D.4020402@cloudius-systems.com> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: quoted-printable Cc: "dev@dpdk.org" To: "Zhang, Helin" , "Ananyev, Konstantin" Return-path: Received: from mail-wi0-f181.google.com (mail-wi0-f181.google.com [209.85.212.181]) by dpdk.org (Postfix) with ESMTP id EBB09C634 for ; Thu, 30 Jul 2015 18:44:26 +0200 (CEST) Received: by wicgb10 with SMTP id gb10so251857090wic.1 for ; Thu, 30 Jul 2015 09:44:26 -0700 (PDT) In-Reply-To: List-Id: patches and discussions about DPDK List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" On 07/30/15 19:10, Zhang, Helin wrote: > >> -----Original Message----- >> From: Vlad Zolotarov [mailto:vladz@cloudius-systems.com] >> Sent: Thursday, July 30, 2015 7:58 AM >> To: dev@dpdk.org; Ananyev, Konstantin; Zhang, Helin >> Subject: RFC: i40e xmit path HW limitation >> >> Hi, Konstantin, Helin, >> there is a documented limitation of xl710 controllers (i40e driver) wh= ich is not >> handled in any way by a DPDK driver. >> From the datasheet chapter 8.4.1: >> >> "=E2=80=A2 A single transmit packet may span up to 8 buffers (up to 8 = data descriptors per >> packet including both the header and payload buffers). >> =E2=80=A2 The total number of data descriptors for the whole TSO (expl= ained later on in >> this chapter) is unlimited as long as each segment within the TSO obey= s the >> previous rule (up to 8 data descriptors per segment for both the TSO h= eader and >> the segment payload buffers)." > Yes, I remember the RX side just supports 5 segments per packet receivi= ng. > But what's the possible issue you thought about? Note that it's a Tx size we are talking about. See 30520831f058cd9d75c0f6b360bc5c5ae49b5f27 commit in linux net-next rep= o. If such a cluster arrives and you post it on the HW ring - HW will shut=20 this HW ring down permanently. The application will see that it's ring=20 is stuck. > >> This means that, for instance, long cluster with small fragments has t= o be >> linearized before it may be placed on the HW ring. > What type of size of the small fragments? Basically 2KB is the default = size of mbuf of most > example applications. 2KB x 8 is bigger than 1.5KB. So it is enough for= the maximum > packet size we supported. > If 1KB mbuf is used, don't expect it can transmit more than 8KB size of= packet. I kinda lost u here. Again, we talk about the Tx side here and buffers=20 are not obligatory completely filled. Namely there may be a cluster with=20 15 fragments 100 bytes each. > >> In more standard environments like Linux or FreeBSD drivers the soluti= on is >> straight forward - call skb_linearize()/m_collapse() corresponding. >> In the non-conformist environment like DPDK life is not that easy - th= ere is no >> easy way to collapse the cluster into a linear buffer from inside the = device driver >> since device driver doesn't allocate memory in a fast path and utilize= s the user >> allocated pools only. >> Here are two proposals for a solution: >> >> 1. We may provide a callback that would return a user TRUE if a give >> cluster has to be linearized and it should always be called befor= e >> rte_eth_tx_burst(). Alternatively it may be called from inside th= e >> rte_eth_tx_burst() and rte_eth_tx_burst() is changed to return so= me >> error code for a case when one of the clusters it's given has to = be >> linearized. >> 2. Another option is to allocate a mempool in the driver with the >> elements consuming a single page each (standard 2KB buffers would >> do). Number of elements in the pool should be as Tx ring length >> multiplied by "64KB/(linear data length of the buffer in the pool >> above)". Here I use 64KB as a maximum packet length and not takin= g >> into an account esoteric things like "Giant" TSO mentioned in the >> spec above. Then we may actually go and linearize the cluster if >> needed on top of the buffers from the pool above, post the buffer >> from the mempool above on the HW ring, link the original cluster = to >> that new cluster (using the private data) and release it when the >> send is done. >> >> >> The first is a change in the API and would require from the applicatio= n some >> additional handling (linearization). The second would require some add= itional >> memory but would keep all dirty details inside the driver and would le= ave the >> rest of the code intact. >> >> Pls., comment. >> >> thanks, >> vlad >>