From mboxrd@z Thu Jan 1 00:00:00 1970 From: Amir Vadai Subject: Re: [PATCH 7/8] net: support tx_ring per UP in HW based QoS mechanism Date: Wed, 14 Mar 2012 12:09:30 +0200 Message-ID: <4F606E5A.6080407@mellanox.com> References: <1331659323-12904-1-git-send-email-amirv@mellanox.com> <1331659323-12904-8-git-send-email-amirv@mellanox.com> <4F5F9087.7090607@intel.com> Mime-Version: 1.0 Content-Type: text/plain; charset="ISO-8859-1"; format=flowed Content-Transfer-Encoding: 7bit Cc: "David S. Miller" , , Roland Dreier , Oren Duer , Amir Vadai , Jeff Kirsher , Eilon Greenstein To: John Fastabend Return-path: Received: from eu1sys200aog113.obsmtp.com ([207.126.144.135]:60605 "HELO eu1sys200aog113.obsmtp.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with SMTP id S1760368Ab2CNKJm (ORCPT ); Wed, 14 Mar 2012 06:09:42 -0400 In-Reply-To: <4F5F9087.7090607@intel.com> Sender: netdev-owner@vger.kernel.org List-ID: On 03/13/2012 08:23 PM, John Fastabend wrote: > On 3/13/2012 10:22 AM, Amir Vadai wrote: >> From: Amir Vadai >> >> The Current HW based QoS mechanism which was introduced in commit 4f57c087de9 >> "net: implement mechanism for HW based QOS" is in orientation to ETS traffic >> class. This patch introduces an approach which allow to use this mechanism also >> with hardware who has queues per user priority (UP). After the change, >> __skb_tx_hash() will direct a flow to a tx ring from a range of tx rings. This >> range is defined by the caller function by the specific HW. If TC based queues, >> the range is by TC number and for UP based queues, the range is by UP. >> > ETS is one specific use case for mqprio it can easily be used with other > hardware transmission selection algorithms 802.1Q std based or otherwise. > > The mapping is really just an skb->priority to group of queues (qoffset and > qcount). I happened to call the queue grouping a traffic class because that > aligns with 802.1Q but it _really_ is just a queue grouping. This is good for untagged traffic, but for tagged traffic I can see 2 problems with this approach: 1. mqprio mapping could contradict egress map or 8021Qaz mapping (UP <=> TC). This could be solved (not very elegantly) by forcing the mappings to be synced. 2. egress map is per vlan, and mqprio mapping is one global mapping. > > In your case what would it mean to change the map and num_tc see 'tc': > > [root@jf-dev1-dcblab netperf]# tc qdisc add dev eth3 root mqprio help > Usage: ... mqprio [num_tc NUMBER] [map P0 P1 ...] > [queues count1@offset1 count2@offset2 ...] [hw 1|0] > > For example setting 'num_tc 8 map 0 1 2 3 0 1 2 3' looks like it might > not work correctly. Would you end up with an skb tagged with priority > 4,5,6,7 being sent out UP queues 0,1,2,3? My quess is that won't work > with PFC or your ETS correctly. I don't see a problem here. For example, skb tagged with priority 5 is mapped to UP 1. And sent through one of the tx rings of UP 1. All the rings of UP 1 share the same transmission queue (schedule queue) which is controlled by PFC and ETS by the HW. What is the problem here? > > In the canonical iSCSI case we put iscsid in a net_prio cgroup to get the > priority set then use the priority to steer the skb to the correct queue > groupings. In your case I think you can just fail any num_tc != 8 and keep > the dflt map 1:1 then this should work. What did I miss? It looks like you > already fail the num_tc != 8 case so why do we need this change? > > At most maybe we need a flag to indicate the mqprio map can't be changed in > some cases. What you suggest is that the priority in net_prio cgroup will be the User Priority, and not just the skb priority? And also, for tagged traffic, how could it be forced to be synced with egress map? > > >> CC: John Fastabend >> CC: Jeff Kirsher >> CC: Eilon Greenstein >> Signed-off-by: Amir Vadai >> --- >> drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c | 11 ++++++++++- >> drivers/net/ethernet/mellanox/mlx4/en_tx.c | 9 +++------ >> include/linux/netdevice.h | 12 +++++++++++- >> include/linux/skbuff.h | 3 ++- >> net/core/dev.c | 10 +--------- >> 5 files changed, 27 insertions(+), 18 deletions(-) >> > [...] > >> >> void bnx2x_set_num_queues(struct bnx2x *bp) >> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c >> index 7a49830..d0d96e3 100644 >> --- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c >> +++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c >> @@ -570,18 +570,15 @@ static void build_inline_wqe(struct mlx4_en_tx_desc *tx_desc, struct sk_buff *sk >> >> u16 mlx4_en_select_queue(struct net_device *dev, struct sk_buff *skb) >> { >> - struct mlx4_en_priv *priv = netdev_priv(dev); >> - int up = -1; >> + int up = 0; >> >> if (vlan_tx_tag_present(skb)) >> up = (vlan_tx_tag_get(skb)>> 13); > I was trying to avoid logic like this in select_queue(). Why? > > Can we get the same behavior by keeping the egress map and mqprio > map in sync? As I said above, if we force egress map to be synced to mqprio mapping, we loose it's power - mqprio is global, and egress map is per vlan. > >> else if (dev->num_tc) >> up = netdev_get_prio_tc_map(dev, skb->priority); >> >> - if (up>= 0) >> - return MLX4_EN_NUM_TX_RINGS + up; >> - >> - return __skb_tx_hash(dev, skb, MLX4_EN_NUM_TX_RINGS); >> + return __skb_tx_hash(dev, skb, MLX4_EN_NUM_TX_RINGS, >> + MLX4_EN_NUM_TX_RINGS + up, 1); >> } >> > Thanks, > John Thanks, Amir