Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: XDP redirect measurements, gotchas and tracepoints
From: John Fastabend @ 2017-08-28 16:14 UTC (permalink / raw)
  To: Andy Gospodarek, Michael Chan
  Cc: Jesper Dangaard Brouer, Alexander Duyck, Duyck, Alexander H,
	pstaszewski@itcare.pl, netdev@vger.kernel.org,
	xdp-newbies@vger.kernel.org, borkmann@iogearbox.net
In-Reply-To: <20170828160237.GA70910@C02RW35GFVH8.dhcp.broadcom.net>

On 08/28/2017 09:02 AM, Andy Gospodarek wrote:
> On Fri, Aug 25, 2017 at 08:28:55AM -0700, Michael Chan wrote:
>> On Fri, Aug 25, 2017 at 8:10 AM, John Fastabend
>> <john.fastabend@gmail.com> wrote:
>>> On 08/25/2017 05:45 AM, Jesper Dangaard Brouer wrote:
>>>> On Thu, 24 Aug 2017 20:36:28 -0700
>>>> Michael Chan <michael.chan@broadcom.com> wrote:
>>>>
>>>>> On Wed, Aug 23, 2017 at 1:29 AM, Jesper Dangaard Brouer
>>>>> <brouer@redhat.com> wrote:
>>>>>> On Tue, 22 Aug 2017 23:59:05 -0700
>>>>>> Michael Chan <michael.chan@broadcom.com> wrote:
>>>>>>
>>>>>>> On Tue, Aug 22, 2017 at 6:06 PM, Alexander Duyck
>>>>>>> <alexander.duyck@gmail.com> wrote:
>>>>>>>> On Tue, Aug 22, 2017 at 1:04 PM, Michael Chan <michael.chan@broadcom.com> wrote:
>>>>>>>>>
>>>>>>>>> Right, but it's conceivable to add an API to "return" the buffer to
>>>>>>>>> the input device, right?
>>>>>>
>>>>>> Yes, I would really like to see an API like this.
>>>>>>
>>>>>>>>
>>>>>>>> You could, it is just added complexity. "just free the buffer" in
>>>>>>>> ixgbe usually just amounts to one atomic operation to decrement the
>>>>>>>> total page count since page recycling is already implemented in the
>>>>>>>> driver. You still would have to unmap the buffer regardless of if you
>>>>>>>> were recycling it or not so all you would save is 1.000015259 atomic
>>>>>>>> operations per packet. The fraction is because once every 64K uses we
>>>>>>>> have to bulk update the count on the page.
>>>>>>>>
>>>>>>>
>>>>>>> If the buffer is returned to the input device, the input device can
>>>>>>> keep the DMA mapping.  All it needs to do is to dma_sync it back to
>>>>>>> the input device when the buffer is returned.
>>>>>>
>>>>>> Yes, exactly, return to the input device. I really think we should
>>>>>> work on a solution where we can keep the DMA mapping around.  We have
>>>>>> an opportunity here to make ndo_xdp_xmit TX queues use a specialized
>>>>>> page return call, to achieve this. (I imagine other arch's have a high
>>>>>> DMA overhead than Intel)
>>>>>>
>>>>>> I'm not sure how the API should look.  The ixgbe recycle mechanism and
>>>>>> splitting the page (into two packets) actually complicates things, and
>>>>>> tie us into a page-refcnt based model.  We could get around this by
>>>>>> each driver implementing a page-return-callback, that allow us to
>>>>>> return the page to the input device?  Then, drivers implementing the
>>>>>> 1-packet-per-page can simply check/read the page-refcnt, and if it is
>>>>>> "1" DMA-sync and reuse it in the RX queue.
>>>>>>
>>>>>
>>>>> Yeah, based on Alex' description, it's not clear to me whether ixgbe
>>>>> redirecting to a non-intel NIC or vice versa will actually work.  It
>>>>> sounds like the output device has to make some assumptions about how
>>>>> the page was allocated by the input device.
>>>>
>>>> Yes, exactly. We are tied into a page refcnt based scheme.
>>>>
>>>> Besides the ixgbe page recycle scheme (which keeps the DMA RX-mapping)
>>>> is also tied to the RX queue size, plus how fast the pages are returned.
>>>> This makes it very hard to tune.  As I demonstrated, default ixgbe
>>>> settings does not work well with XDP_REDIRECT.  I needed to increase
>>>> TX-ring size, but it broke page recycling (dropping perf from 13Mpps to
>>>> 10Mpps) so I also needed it increase RX-ring size.  But perf is best if
>>>> RX-ring size is smaller, thus two contradicting tuning needed.
>>>>
>>>
>>> The changes to decouple the ixgbe page recycle scheme (1pg per descriptor
>>> split into two halves being the default) from the number of descriptors
>>> doesn't look too bad IMO. It seems like it could be done by having some
>>> extra pages allocated upfront and pulling those in when we need another
>>> page.
>>>
>>> This would be a nice iterative step we could take on the existing API.
>>>
>>>>
>>>>> With buffer return API,
>>>>> each driver can cleanly recycle or free its own buffers properly.
>>>>
>>>> Yes, exactly. And RX-driver can implement a special memory model for
>>>> this queue.  E.g. RX-driver can know this is a dedicated XDP RX-queue
>>>> which is never used for SKBs, thus opening for new RX memory models.
>>>>
>>>> Another advantage of a return API.  There is also an opportunity for
>>>> avoiding the DMA map on TX. As we need to know the from-device.  Thus,
>>>> we can add a DMA API, where we can query if the two devices uses the
>>>> same DMA engine, and can reuse the same DMA address the RX-side already
>>>> knows.
>>>>
>>>>
>>>>> Let me discuss this further with Andy to see if we can come up with a
>>>>> good scheme.
>>>>
>>>> Sound good, looking forward to hear what you come-up with :-)
>>>>
>>>
>>> I guess by this thread we will see a broadcom nic with redirect support
>>> soon ;)
>>
>> Yes, Andy actually has finished the coding for XDP_REDIRECT, but the
>> buffer recycling scheme has some problems.  We can make it work for
>> Broadcom to Broadcom only, but we want a better solution.
> 
> (Sorry for the radio silence I was AFK last week...)
> 
> I finished it a little while ago, but Michael and I both have concerns
> that in a heterogenous hardware setup one can quickly run into issues
> and haven't had time to work-up a few solutions before bringing this up
> formally.  It also isn't a major problem until the second
> optimized/native XDP driver appears on the scene.
> 
> I can run a test where XDP redirects from an ixgbe <-> bnxt_en based
> device I get OOM kills after only a few seconds, due to the lack of
> feedback between the different drivers that the pointer to xdp->data can
> be freed/reused/etc and the different buffer allocation schemes used.
> 

hmm so how do you get OOM here, I expect the number of in-flight xdp
bufs should be limited by the number of xdps that can be posted to the
outgoing interface. If we are hitting OOM that _should_ mean the size of
the tx queue is too large. Ixgbe should be free'ing the buffer if an error
is returned from xdp xmit routines (will check this today). And bnxt should
return an error if we hit some high water mark on xmit.

> Initially I did not think this was an issue and that xdp_do_flush_map()
> would handle this, but I think there is a still a need to be able to
> signal back to the receving device that the buffer allocated has been
> xmitted by the transmitter and can be freed.  Since there is really no
> guarantee that completion of an XDP_REDIRECT action means that it is
> safe to free area pointed to by xdp->data area that contains the packet
> to be xmitted.  Since the packet done interrupt handler in a driver
> cannot signal back the the receiving driver that the buffer is now safe
> to reuse/free there is a chance for trouble.  

There should be some high water mark on how many outstanding packets
can be in-flight. At the moment I assumed this was something related to
queue lengths a more explicit high water mark could added to the xmit path
and tracked in xdp infrastructure.

> 
> I was hoping to spend some time this week cooking up a patch that just
> did not allow use of XDP_REDIRECT when the ifindex of the outgoing
> device did not match that of the device to which the XDP prog was
> attached, but that probably is not worth the trouble when we would just
> fix it for real.  (It would also require some really terrible hacks to
> enforce this in the kernel when all that is being done is setting up a
> map that contains the redirect table, so it is probably not useful.)
> 

I would prefer to solve the problem vs limiting the implementation

> The basic prototype would be something like this:
> 
> (rx packet interrupt on eth0, leads to napi_poll)
> napi_poll (eth0)
>   call xdp_prog (eth0)
>     xdp_do_redirect (eth0)
>       ndo_xdp_xmit (eth1)
>       mark buffer with information netdev/ring/etc
>       place buffer on tx ring for eth1
> 
> (tx done interrupt on eth1, leads to napi_poll)
> napi_poll (eth1)
>   process tx interrupt (eth1)
>     look up information about netdev/ring/etc
>     ndo_xdp_data_free (eth0, ring, etc)
> 
> Thoughts?
> 

^ permalink raw reply

* Re: [PATCH net 0/2] netfilter: ipvs: some fixes in sctp_conn_schedule
From: Pablo Neira Ayuso @ 2017-08-28 16:17 UTC (permalink / raw)
  To: Xin Long
  Cc: netfilter-devel, Alex Gartrell, lvs-devel, netdev, horms, ja,
	wensong
In-Reply-To: <cover.1503207076.git.lucien.xin@gmail.com>

On Sun, Aug 20, 2017 at 01:38:06PM +0800, Xin Long wrote:
> Patch 1/2 fixes the regression introduced by commit 5e26b1b3abce.
> Patch 2/2 makes ipvs not create conn for sctp ABORT packet.

Will wait for Julian and Simon to tell me what I should do with this.

Thanks!

^ permalink raw reply

* Re: [Intel-wired-lan] [PATCH] e1000e: apply burst mode settings only on default
From: Alexander Duyck @ 2017-08-28 16:25 UTC (permalink / raw)
  To: Neftin, Sasha
  Cc: Willem de Bruijn, "jeffrey.t.kirsher, Netdev,
	Willem de Bruijn, intel-wired-lan
In-Reply-To: <291b863f-552e-2f3b-f658-e812d0848949@intel.com>

On Sun, Aug 27, 2017 at 1:30 AM, Neftin, Sasha <sasha.neftin@intel.com> wrote:
> On 8/25/2017 18:06, Willem de Bruijn wrote:
>>
>> From: Willem de Bruijn <willemb@google.com>
>>
>> Devices that support FLAG2_DMA_BURST have different default values
>> for RDTR and RADV. Apply burst mode default settings only when no
>> explicit value was passed at module load.
>>
>> The RDTR default is zero. If the module is loaded for low latency
>> operation with RxIntDelay=0, do not override this value with a burst
>> default of 32.
>>
>> Move the decision to apply burst values earlier, where explicitly
>> initialized module variables can be distinguished from defaults.
>>
>> Signed-off-by: Willem de Bruijn <willemb@google.com>
>
> This patch looks good for me, but I would like hear second opinion.

This code is an improvement over what was already there. I am good with this.

Acked-by: Alexander Duyck <alexander.h.duyck@intel.com>

^ permalink raw reply

* Re: [ethtool] ethtool: Remove UDP Fragmentation Offload use from ethtool
From: David Miller @ 2017-08-28 16:38 UTC (permalink / raw)
  To: eric.dumazet; +Cc: tariqt, linville, netdev, eranbe, shakerd
In-Reply-To: <1503932411.11498.67.camel@edumazet-glaptop3.roam.corp.google.com>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Mon, 28 Aug 2017 08:00:11 -0700

> On Mon, 2017-08-28 at 15:38 +0300, Tariq Toukan wrote:
>> From: Shaker Daibes <shakerd@mellanox.com>
>> 
>> UFO was removed in kernel, here we remove it in ethtool app.
>> 
>> Fixes the following issue:
>> Features for ens8:
>> Cannot get device udp-fragmentation-offload settings: Operation not supported
>> 
>> Tested with "make check"
>> 
>> Signed-off-by: Shaker Daibes <shakerd@mellanox.com>
>> Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
>> ---
> 
> 
> Hi guys
> 
> I would rather remove the warning, but leave the ability to switch UFO
> on machines running old kernel but a recent ethtool.
> 
> ethtool does not need to be downgraded every time we boot an old
> kernel ;)

Agreed, we shouldn't be completely removing support for UFO from
ethtool.

^ permalink raw reply

* Re: [Intel-wired-lan] how to submit fixes for i40e/i40evf?
From: Alexander Duyck @ 2017-08-28 17:00 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: David S. Miller, Jeff Kirsher, Netdev, intel-wired-lan
In-Reply-To: <20170826002852.751d5f07@elisabeth>

On Fri, Aug 25, 2017 at 3:28 PM, Stefano Brivio <sbrivio@redhat.com> wrote:
> On Fri, 25 Aug 2017 15:10:08 -0700
> Alexander Duyck <alexander.duyck@gmail.com> wrote:
>
>> On Fri, Aug 25, 2017 at 1:52 PM, Stefano Brivio <sbrivio@redhat.com> wrote:
>>
>> [...]
>>
>> > Once patches reach Intel's patchwork, will they need to wait for some
>> > kind of periodically scheduled pull request process?
>>
>> Once in the patchwork they go through testing and after they have
>> passed testing Jeff will try to push them to Dave.
>
> Ok, the whole part above is clear, thanks a lot for clarifying.
>
>> > I don't know if a process is actually defined at this level of detail,
>> > but still I feel it's wrong that an obvious fix for a potential crash is
>> > waiting in some sort of limbo for 10 days now. Sure, worse things
>> > happen in the world, but I can't understand what this patch is waiting
>> > for.
>>
>> Well in the case of your patch it was rejected as it didn't apply to
>> Jeff's tree
>
> It actually did when I posted it.
>
>> and conflicted with Jacob Keller's patch. He submitted a v2 on Tuesday
>> which has only been applied for a few days. Once it receives a
>> "Tested-by:"
>
> Which, if I understood correctly, only comes after some internal testing
> process, right?
>
>> it will be ready for submission assuming it passes testing.
>
> Now that patch is again in a v2 pull request for net-next, without the
> changes I suggested for the commit message. And the same exact code
> changes were around for two weeks. IMHO there's room for improvement,
> so to speak.
>
>> I hope that helps to clarify things.
>
> It did to some extent, and thanks again for that.

One other thing I forgot that adds to the confusion is that you will
probably want to base you patch on the dev-queue branch of those
trees, not the master branch. The master branch is what Jeff submits
to Dave if I am not mistaken, while dev-queue is the location for the
ongoing development.

- Alex

^ permalink raw reply

* Re: [PATCH] igb: check memory allocation failure
From: Christophe JAILLET @ 2017-08-28 17:12 UTC (permalink / raw)
  To: Waskiewicz Jr, Peter, Kirsher, Jeffrey T
  Cc: intel-wired-lan@lists.osuosl.org, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org, kernel-janitors@vger.kernel.org
In-Reply-To: <E0D909EE5BB15A4699798539EA149D7F0779685B@ORSMSX103.amr.corp.intel.com>

Le 28/08/2017 à 01:09, Waskiewicz Jr, Peter a écrit :
> On 8/27/17 2:42 AM, Christophe JAILLET wrote:
>> Check memory allocation failures and return -ENOMEM in such cases, as
>> already done for other memory allocations in this function.
>>
>> This avoids NULL pointers dereference.
>>
>> Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
>> ---
>>    drivers/net/ethernet/intel/igb/igb_main.c | 2 ++
>>    1 file changed, 2 insertions(+)
>>
>> diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c
>> index fd4a46b03cc8..837d9b46a390 100644
>> --- a/drivers/net/ethernet/intel/igb/igb_main.c
>> +++ b/drivers/net/ethernet/intel/igb/igb_main.c
>> @@ -3162,6 +3162,8 @@ static int igb_sw_init(struct igb_adapter *adapter)
>>    	/* Setup and initialize a copy of the hw vlan table array */
>>    	adapter->shadow_vfta = kcalloc(E1000_VLAN_FILTER_TBL_SIZE, sizeof(u32),
>>    				       GFP_ATOMIC);
>> +	if (!adapter->shadow_vfta)
>> +		return -ENOMEM;
> Looks reasonable to me.
>
> A larger issue though I see in this function is that if we return
> -ENOMEM here, and if we return -ENOMEM from igb_init_interrupt_scheme()
> below on failure, we leak adapter->mac_table (and adapter->shadow_vfta
> in the latter).  We should add a proper unwind to free up the memory on
> failure.
>
> -PJ
>
Hi,

in fact, there is no leak because the only caller of 'igb_sw_init()' 
(i.e. 'igb_probe()'), already frees these resources in case of error, 
see [1]

These resources are also freed  in 'igb_remove()'.

Best reagrds,
CJ

[1]: 
https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/tree/drivers/net/ethernet/intel/igb/igb_main.c#n2775

^ permalink raw reply

* Re: [PATCH net-next] selftests/bpf: check the instruction dumps are populated
From: Martin KaFai Lau @ 2017-08-28 17:19 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: netdev, daniel, oss-drivers
In-Reply-To: <20170825213957.4768-1-jakub.kicinski@netronome.com>

On Fri, Aug 25, 2017 at 02:39:57PM -0700, Jakub Kicinski wrote:
> Add a basic test for checking whether kernel is populating
> the jited and xlated BPF images.  It was used to confirm
> the behaviour change from commit d777b2ddbecf ("bpf: don't
> zero out the info struct in bpf_obj_get_info_by_fd()"),
> which made bpf_obj_get_info_by_fd() usable for retrieving
> the image dumps.
Acked-by: Martin KaFai Lau <kafai@fb.com>

>
> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
> ---
>  tools/testing/selftests/bpf/test_progs.c | 16 ++++++++++++----
>  1 file changed, 12 insertions(+), 4 deletions(-)
>
> diff --git a/tools/testing/selftests/bpf/test_progs.c b/tools/testing/selftests/bpf/test_progs.c
> index 1cb037803679..11ee25cea227 100644
> --- a/tools/testing/selftests/bpf/test_progs.c
> +++ b/tools/testing/selftests/bpf/test_progs.c
> @@ -279,7 +279,7 @@ static void test_bpf_obj_id(void)
>  	/* +1 to test for the info_len returned by kernel */
>  	struct bpf_prog_info prog_infos[nr_iters + 1];
>  	struct bpf_map_info map_infos[nr_iters + 1];
> -	char jited_insns[128], xlated_insns[128];
> +	char jited_insns[128], xlated_insns[128], zeros[128];
>  	__u32 i, next_id, info_len, nr_id_found, duration = 0;
>  	int sysctl_fd, jit_enabled = 0, err = 0;
>  	__u64 array_value;
> @@ -305,6 +305,7 @@ static void test_bpf_obj_id(void)
>  		objs[i] = NULL;
>
>  	/* Check bpf_obj_get_info_by_fd() */
> +	bzero(zeros, sizeof(zeros));
>  	for (i = 0; i < nr_iters; i++) {
>  		err = bpf_prog_load(file, BPF_PROG_TYPE_SOCKET_FILTER,
>  				    &objs[i], &prog_fds[i]);
> @@ -318,6 +319,8 @@ static void test_bpf_obj_id(void)
>  		/* Check getting prog info */
>  		info_len = sizeof(struct bpf_prog_info) * 2;
>  		bzero(&prog_infos[i], info_len);
> +		bzero(jited_insns, sizeof(jited_insns));
> +		bzero(xlated_insns, sizeof(xlated_insns));
>  		prog_infos[i].jited_prog_insns = ptr_to_u64(jited_insns);
>  		prog_infos[i].jited_prog_len = sizeof(jited_insns);
>  		prog_infos[i].xlated_prog_insns = ptr_to_u64(xlated_insns);
> @@ -328,15 +331,20 @@ static void test_bpf_obj_id(void)
>  			  prog_infos[i].type != BPF_PROG_TYPE_SOCKET_FILTER ||
>  			  info_len != sizeof(struct bpf_prog_info) ||
>  			  (jit_enabled && !prog_infos[i].jited_prog_len) ||
> -			  !prog_infos[i].xlated_prog_len,
> +			  (jit_enabled &&
> +			   !memcmp(jited_insns, zeros, sizeof(zeros))) ||
> +			  !prog_infos[i].xlated_prog_len ||
> +			  !memcmp(xlated_insns, zeros, sizeof(zeros)),
>  			  "get-prog-info(fd)",
> -			  "err %d errno %d i %d type %d(%d) info_len %u(%lu) jit_enabled %d jited_prog_len %u xlated_prog_len %u\n",
> +			  "err %d errno %d i %d type %d(%d) info_len %u(%lu) jit_enabled %d jited_prog_len %u xlated_prog_len %u jited_prog %d xlated_prog %d\n",
>  			  err, errno, i,
>  			  prog_infos[i].type, BPF_PROG_TYPE_SOCKET_FILTER,
>  			  info_len, sizeof(struct bpf_prog_info),
>  			  jit_enabled,
>  			  prog_infos[i].jited_prog_len,
> -			  prog_infos[i].xlated_prog_len))
> +			  prog_infos[i].xlated_prog_len,
> +			  !!memcmp(jited_insns, zeros, sizeof(zeros)),
> +			  !!memcmp(xlated_insns, zeros, sizeof(zeros))))
>  			goto done;
>
>  		map_fds[i] = bpf_find_map(__func__, objs[i], "test_map_id");
> --
> 2.11.0
>

^ permalink raw reply

* Re: [net-next PATCH 1/9] bpf: convert sockmap field attach_bpf_fd2 to type
From: Alexei Starovoitov @ 2017-08-28 17:22 UTC (permalink / raw)
  To: John Fastabend; +Cc: ast, daniel, davem, netdev
In-Reply-To: <20170828141004.14143.19285.stgit@john-Precision-Tower-5810>

On Mon, Aug 28, 2017 at 07:10:04AM -0700, John Fastabend wrote:
> In the initial sockmap API we provided strparser and verdict programs
> using a single attach command by extending the attach API with a the
> attach_bpf_fd2 field.
> 
> However, if we add other programs in the future we will be adding a
> field for every new possible type, attach_bpf_fd(3,4,..). This
> seems a bit clumsy for an API. So lets push the programs using two
> new type fields.
> 
>    BPF_SK_SKB_STREAM_PARSER
>    BPF_SK_SKB_STREAM_VERDICT
> 
> This has the advantage of having a readable name and can easily be
> extended in the future.
> 
> Updates to samples and sockmap included here also generalize tests
> slightly to support upcoming patch for multiple map support.
> 
> Signed-off-by: John Fastabend <john.fastabend@gmail.com>
> Fixes: 174a79ff9515 ("bpf: sockmap with sk redirect support")
> Suggested-by: Alexei Starovoitov <ast@kernel.org>

lgtm
Acked-by: Alexei Starovoitov <ast@kernel.org>

^ permalink raw reply

* Re: [net-next PATCH 2/9] bpf: sockmap, remove STRPARSER map_flags and add multi-map support
From: Alexei Starovoitov @ 2017-08-28 17:31 UTC (permalink / raw)
  To: John Fastabend; +Cc: ast, daniel, davem, netdev
In-Reply-To: <20170828141025.14143.23990.stgit@john-Precision-Tower-5810>

On Mon, Aug 28, 2017 at 07:10:25AM -0700, John Fastabend wrote:
> The addition of map_flags BPF_SOCKMAP_STRPARSER flags was to handle a
> specific use case where we want to have BPF parse program disabled on
> an entry in a sockmap.
> 
> However, Alexei found the API a bit cumbersome and I agreed. Lets
> remove the STRPARSER flag and support the use case by allowing socks
> to be in multiple maps. This allows users to create two maps one with
> programs attached and one without. When socks are added to maps they
> now inherit any programs attached to the map. This is a nice
> generalization and IMO improves the API.
> 
> The API rules are less ambiguous and do not need a flag:
> 
>   - When a sock is added to a sockmap we have two cases,
> 
>      i. The sock map does not have any attached programs so
>         we can add sock to map without inheriting bpf programs.
>         The sock may exist in 0 or more other maps.
> 
>     ii. The sock map has an attached BPF program. To avoid duplicate
>         bpf programs we only add the sock entry if it does not have
>         an existing strparser/verdict attached, returning -EBUSY if
>         a program is already attached. Otherwise attach the program
>         and inherit strparser/verdict programs from the sock map.
> 
> This allows for socks to be in a multiple maps for redirects and
> inherit a BPF program from a single map.
> 
> Also this patch simplifies the logic around BPF_{EXIST|NOEXIST|ANY}
> flags. In the original patch I tried to be extra clever and only
> update map entries when necessary. Now I've decided the complexity
> is not worth it. If users constantly update an entry with the same
> sock for no reason (i.e. update an entry without actually changing
> any parameters on map or sock) we still do an alloc/release. Using
> this and allowing multiple entries of a sock to exist in a map the
> logic becomes much simpler.
> 
> Note: Now that multiple maps are supported the "maps" pointer called
> when a socket is closed becomes a list of maps to remove the sock from.
> To keep the map up to date when a sock is added to the sockmap we must
> add the map/elem in the list. Likewise when it is removed we must
> remove it from the list. This results in searching the per psock list
> on delete operation. On TCP_CLOSE events we walk the list and remove
> the psock from all map/entry locations. I don't see any perf
> implications in this because at most I have a psock in two maps. If
> a psock were to be in many maps its possibly this might be noticeable
> on delete but I can't think of a reason to dup a psock in many maps.
> The sk_callback_lock is used to protect read/writes to the list. This
> was convenient because in all locations we were taking the lock
> anyways just after working on the list. Also the lock is per sock so
> in normal cases we shouldn't see any contention.
> 
> Suggested-by: Alexei Starovoitov <ast@kernel.org>
> Fixes: 174a79ff9515 ("bpf: sockmap with sk redirect support")
> Signed-off-by: John Fastabend <john.fastabend@gmail.com>

nice. I think implementation got cleaner too.
thanks for addresing the comments quickly!
Acked-by: Alexei Starovoitov <ast@kernel.org>

> ---
>  include/uapi/linux/bpf.h |    3 -
>  kernel/bpf/sockmap.c     |  269 ++++++++++++++++++++++++++++------------------
>  2 files changed, 165 insertions(+), 107 deletions(-)
> 
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 97227be..08c206a 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -143,9 +143,6 @@ enum bpf_attach_type {
>  
>  #define MAX_BPF_ATTACH_TYPE __MAX_BPF_ATTACH_TYPE
>  
> -/* If BPF_SOCKMAP_STRPARSER is used sockmap will use strparser on receive */
> -#define BPF_SOCKMAP_STRPARSER	(1U << 0)
> -

Please update tools/include/uapi/linux/bpf.h as well.
Right now it's kinda messed up and still has:
enum bpf_sockmap_flags {
        BPF_SOCKMAP_UNSPEC,
        BPF_SOCKMAP_STRPARSER,
        __MAX_BPF_SOCKMAP_FLAG
};
that's separate patch of course. Not a blocker for this set.

^ permalink raw reply

* [iproute PATCH] ss: Fix for added diag support check
From: Phil Sutter @ 2017-08-28 17:31 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev

Commit 9f66764e308e9 ("libnetlink: Add test for error code returned from
netlink reply") changed rtnl_dump_filter_l() to return an error in case
NLMSG_DONE would contain one, even if it was ENOENT.

This in turn breaks ss when it tries to dump DCCP sockets on a system
without support for it: The function tcp_show(), which is shared between
TCP and DCCP, will start parsing /proc since inet_show_netlink() returns
an error - yet it parses /proc/net/tcp which doesn't make sense for DCCP
sockets at all.

On my system, a call to 'ss' without further arguments prints the list
of connected TCP sockets twice.

Fix this by introducing a dedicated function dccp_show() which does not
have a fallback to /proc, just like sctp_show(). And since tcp_show()
is no longer "multi-purpose", drop it's socktype parameter.

Fixes: 9f66764e308e9 ("libnetlink: Add test for error code returned from netlink reply")
Signed-off-by: Phil Sutter <phil@nwl.cc>
---
 misc/ss.c | 20 ++++++++++++++++----
 1 file changed, 16 insertions(+), 4 deletions(-)

diff --git a/misc/ss.c b/misc/ss.c
index fcc3cf9282c49..2c9e80e696595 100644
--- a/misc/ss.c
+++ b/misc/ss.c
@@ -2753,7 +2753,7 @@ static int tcp_show_netlink_file(struct filter *f)
 	return err;
 }
 
-static int tcp_show(struct filter *f, int socktype)
+static int tcp_show(struct filter *f)
 {
 	FILE *fp = NULL;
 	char *buf = NULL;
@@ -2768,7 +2768,7 @@ static int tcp_show(struct filter *f, int socktype)
 		return tcp_show_netlink_file(f);
 
 	if (!getenv("PROC_NET_TCP") && !getenv("PROC_ROOT")
-	    && inet_show_netlink(f, NULL, socktype) == 0)
+	    && inet_show_netlink(f, NULL, IPPROTO_TCP) == 0)
 		return 0;
 
 	/* Sigh... We have to parse /proc/net/tcp... */
@@ -2836,6 +2836,18 @@ outerr:
 	} while (0);
 }
 
+static int dccp_show(struct filter *f)
+{
+	if (!filter_af_get(f, AF_INET) && !filter_af_get(f, AF_INET6))
+		return 0;
+
+	if (!getenv("PROC_NET_DCCP") && !getenv("PROC_ROOT")
+	    && inet_show_netlink(f, NULL, IPPROTO_DCCP) == 0)
+		return 0;
+
+	return 0;
+}
+
 static int sctp_show(struct filter *f)
 {
 	if (!filter_af_get(f, AF_INET) && !filter_af_get(f, AF_INET6))
@@ -4390,9 +4402,9 @@ int main(int argc, char *argv[])
 	if (current_filter.dbs & (1<<UDP_DB))
 		udp_show(&current_filter);
 	if (current_filter.dbs & (1<<TCP_DB))
-		tcp_show(&current_filter, IPPROTO_TCP);
+		tcp_show(&current_filter);
 	if (current_filter.dbs & (1<<DCCP_DB))
-		tcp_show(&current_filter, IPPROTO_DCCP);
+		dccp_show(&current_filter);
 	if (current_filter.dbs & (1<<SCTP_DB))
 		sctp_show(&current_filter);
 
-- 
2.13.1

^ permalink raw reply related

* Re: [net-next PATCH 6/9] bpf: harden sockmap program attach to ensure correct map type
From: Alexei Starovoitov @ 2017-08-28 17:33 UTC (permalink / raw)
  To: John Fastabend; +Cc: ast, daniel, davem, netdev
In-Reply-To: <20170828141143.14143.89655.stgit@john-Precision-Tower-5810>

On Mon, Aug 28, 2017 at 07:11:43AM -0700, John Fastabend wrote:
> When attaching a program to sockmap we need to check map type
> is correct.
> 
> Fixes: 174a79ff9515 ("bpf: sockmap with sk redirect support")
> Signed-off-by: John Fastabend <john.fastabend@gmail.com>

good catch. Great to see that you're adding a test for it too.

Acked-by: Alexei Starovoitov <ast@kernel.org>

^ permalink raw reply

* Re: [net-next PATCH 8/9] bpf: sockmap requires STREAM_PARSER add Kconfig entry
From: Alexei Starovoitov @ 2017-08-28 17:34 UTC (permalink / raw)
  To: John Fastabend; +Cc: ast, daniel, davem, netdev
In-Reply-To: <20170828141221.14143.27371.stgit@john-Precision-Tower-5810>

On Mon, Aug 28, 2017 at 07:12:21AM -0700, John Fastabend wrote:
> SOCKMAP uses strparser code (compiled with Kconfig option
> CONFIG_STREAM_PARSER) to run the parser BPF program. Without this
> config option set sockmap wont be compiled. However, at the moment
> the only way to pull in the strparser code is to enable KCM.
> 
> To resolve this create a BPF specific config option to pull
> only the strparser piece in that sockmap needs. This also
> allows folks who want to use BPF/syscall/maps but don't need
> sockmap to easily opt out.
> 
> Signed-off-by: John Fastabend <john.fastabend@gmail.com>

Acked-by: Alexei Starovoitov <ast@kernel.org>

^ permalink raw reply

* Re: [net-next PATCH 4/9] bpf: additional sockmap self tests
From: Alexei Starovoitov @ 2017-08-28 17:36 UTC (permalink / raw)
  To: John Fastabend; +Cc: ast, daniel, davem, netdev
In-Reply-To: <20170828141105.14143.31609.stgit@john-Precision-Tower-5810>

On Mon, Aug 28, 2017 at 07:11:05AM -0700, John Fastabend wrote:
> Add some more sockmap tests to cover,
> 
>  - forwarding to NULL entries
>  - more than two maps to test list ops
>  - forwarding to different map
> 
> Signed-off-by: John Fastabend <john.fastabend@gmail.com>

Acked-by: Alexei Starovoitov <ast@kernel.org>

^ permalink raw reply

* Re: [net-next PATCH 5/9] bpf: more SK_SKB selftests
From: Alexei Starovoitov @ 2017-08-28 17:38 UTC (permalink / raw)
  To: John Fastabend; +Cc: ast, daniel, davem, netdev
In-Reply-To: <20170828141124.14143.92130.stgit@john-Precision-Tower-5810>

On Mon, Aug 28, 2017 at 07:11:24AM -0700, John Fastabend wrote:
> Tests packet read/writes and additional skb fields.
> 
> Signed-off-by: John Fastabend <john.fastabend@gmail.com>

Acked-by: Alexei Starovoitov <ast@kernel.org>

>  	{
> +		"invalid access of tc_classid for SK_SKB",
> +		.insns = {
> +			BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1,
> +				    offsetof(struct __sk_buff, tc_classid)),
> +			BPF_EXIT_INSN(),
> +		},
> +		.result = REJECT,
> +		.prog_type = BPF_PROG_TYPE_SK_SKB,
> +		.errstr = "invalid bpf_context access",

it shouldn't be readable eight, right?

^ permalink raw reply

* Re: [PATCH net-next 3/3 v9] drivers: net: ethernet: qualcomm: rmnet: Initial implementation
From: Subash Abhinov Kasiviswanathan @ 2017-08-28 17:40 UTC (permalink / raw)
  To: David Miller
  Cc: netdev, fengguang.wu, dcbw, jiri, stephen, David.Laight, marcel,
	andrew
In-Reply-To: <20170825.200623.1467784779129980276.davem@davemloft.net>

> 
> In these code paths where you are the writer, you have to rely upon
> the RTNL mutex (or some other mutual exclusion mechanism) to protect
> the update operation.  RCU locking itself does not provide this.
> 
> So you should use something like rcu_dereference_rtnl() or similar.
> 
> So this would be rmnet_force_unassociate_device() and rmnet_dellink()
> 
> RCU is subtle and the writer paths have the be handled differently
> from the reader paths.  Please take some time to study how RCU should
> be applied properly in these situations rather than just slapping a
> patch together overnight.
> 
> Thank you.

Sorry about the mess David.

Can you tell me if the following design is correct -

The shared resource which needs to be protected is 
real_dev->rx_handler_data.

For the writer path, this needs to be protected using rtnl_lock() and 
rcu.
The writer paths are rmnet_newlink(), rmnet_delink() and
rmnet_force_unassociate_device(). These paths are already called with
rtnl_lock() acquired in, so we just need to acquire the rcu_read_lock(). 
To
dereference here, we will need to use rtnl_dereference().

For the reader path, the real_dev->rx_handler_data is called in the TX / 
RX
path. We only need rcu_read_lock() for these scenarios. In these cases,
the rcu_read_lock() is held in __dev_queue_xmit() and
netif_receive_skb_internal(), so readers need to use 
rcu_dereference_rtnl()
to get the relevant information.

^ permalink raw reply

* [PATCH net-next 00/11] bnxt_en: Updates.
From: Michael Chan @ 2017-08-28 17:40 UTC (permalink / raw)
  To: davem; +Cc: netdev

Various changes including updated firmware interface, improved TX ring
allocation scheme, improved out-of-memory logic in NAPI loop, reduced
default rings on multi-port devices, new PCI IDs. Of particular note,

CPU affinity hints from Vasundhara Volam.

TC Flower eswitch support from Sathya Perla.

Michael Chan (4):
  bnxt_en: Update firmware interface spec. to 1.8.1.4.
  bnxt_en: Improve tx ring reservation logic.
  bnxt_en: Improve -ENOMEM logic in NAPI poll loop.
  bnxt_en: Reduce default rings on multi-port cards.

Ray Jui (1):
  bnxt: Add PCIe device IDs for bcm58802/bcm58808

Sathya Perla (4):
  bnxt_en: fix clearing devlink ptr from bnxt struct
  bnxt_en: bnxt: add TC flower filter offload support
  bnxt_en: add TC flower offload flow_alloc/free FW cmds
  bnxt_en: add code to query TC flower offload stats

Scott Branden (1):
  bnxt: initialize board_info values with proper enums

Vasundhara Volam (1):
  bnxt_en: assign CPU affinity hints to bnxt_en IRQs

 drivers/net/ethernet/broadcom/Kconfig             |   9 +
 drivers/net/ethernet/broadcom/bnxt/Makefile       |   2 +-
 drivers/net/ethernet/broadcom/bnxt/bnxt.c         | 198 +++--
 drivers/net/ethernet/broadcom/bnxt/bnxt.h         |  41 +-
 drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c |   3 +-
 drivers/net/ethernet/broadcom/bnxt/bnxt_hsi.h     | 186 ++++-
 drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c      | 834 ++++++++++++++++++++++
 drivers/net/ethernet/broadcom/bnxt/bnxt_tc.h      | 158 ++++
 drivers/net/ethernet/broadcom/bnxt/bnxt_vfr.c     |  22 +-
 drivers/net/ethernet/broadcom/bnxt/bnxt_vfr.h     |  22 +-
 drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c     |   4 +-
 11 files changed, 1407 insertions(+), 72 deletions(-)
 create mode 100644 drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c
 create mode 100644 drivers/net/ethernet/broadcom/bnxt/bnxt_tc.h

-- 
1.8.3.1

^ permalink raw reply

* [PATCH net-next 01/11] bnxt_en: Update firmware interface spec. to 1.8.1.4.
From: Michael Chan @ 2017-08-28 17:40 UTC (permalink / raw)
  To: davem; +Cc: netdev
In-Reply-To: <1503942035-24924-1-git-send-email-michael.chan@broadcom.com>

Flow APIs are added in this firmware interface.

Signed-off-by: Michael Chan <michael.chan@broadcom.com>
---
 drivers/net/ethernet/broadcom/bnxt/bnxt_hsi.h | 186 +++++++++++++++++++++++++-
 1 file changed, 181 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_hsi.h b/drivers/net/ethernet/broadcom/bnxt/bnxt_hsi.h
index 3ba22e8..cb04cc7 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt_hsi.h
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_hsi.h
@@ -11,14 +11,14 @@
 #ifndef BNXT_HSI_H
 #define BNXT_HSI_H
 
-/* HSI and HWRM Specification 1.8.0 */
+/* HSI and HWRM Specification 1.8.1 */
 #define HWRM_VERSION_MAJOR	1
 #define HWRM_VERSION_MINOR	8
-#define HWRM_VERSION_UPDATE	0
+#define HWRM_VERSION_UPDATE	1
 
-#define HWRM_VERSION_RSVD	0 /* non-zero means beta version */
+#define HWRM_VERSION_RSVD	4 /* non-zero means beta version */
 
-#define HWRM_VERSION_STR	"1.8.0.0"
+#define HWRM_VERSION_STR	"1.8.1.4"
 /*
  * Following is the signature for HWRM message field that indicates not
  * applicable (All F's). Need to cast it the size of the field if needed.
@@ -946,6 +946,7 @@ struct hwrm_func_cfg_input {
 	#define FUNC_CFG_REQ_FLAGS_STD_TX_RING_MODE_DISABLE	    0x400UL
 	#define FUNC_CFG_REQ_FLAGS_VIRT_MAC_PERSIST		    0x800UL
 	#define FUNC_CFG_REQ_FLAGS_NO_AUTOCLEAR_STATISTIC	    0x1000UL
+	#define FUNC_CFG_REQ_FLAGS_TX_ASSETS_TEST		    0x2000UL
 	__le32 enables;
 	#define FUNC_CFG_REQ_ENABLES_MTU			    0x1UL
 	#define FUNC_CFG_REQ_ENABLES_MRU			    0x2UL
@@ -1972,7 +1973,12 @@ struct hwrm_port_phy_qcaps_output {
 	#define PORT_PHY_QCAPS_RESP_FLAGS_EEE_SUPPORTED	    0x1UL
 	#define PORT_PHY_QCAPS_RESP_FLAGS_RSVD1_MASK		    0xfeUL
 	#define PORT_PHY_QCAPS_RESP_FLAGS_RSVD1_SFT		    1
-	u8 unused_0;
+	u8 port_cnt;
+	#define PORT_PHY_QCAPS_RESP_PORT_CNT_UNKNOWN		   0x0UL
+	#define PORT_PHY_QCAPS_RESP_PORT_CNT_1			   0x1UL
+	#define PORT_PHY_QCAPS_RESP_PORT_CNT_2			   0x2UL
+	#define PORT_PHY_QCAPS_RESP_PORT_CNT_3			   0x3UL
+	#define PORT_PHY_QCAPS_RESP_PORT_CNT_4			   0x4UL
 	__le16 supported_speeds_force_mode;
 	#define PORT_PHY_QCAPS_RESP_SUPPORTED_SPEEDS_FORCE_MODE_100MBHD 0x1UL
 	#define PORT_PHY_QCAPS_RESP_SUPPORTED_SPEEDS_FORCE_MODE_100MB 0x2UL
@@ -4407,6 +4413,164 @@ struct hwrm_cfa_ntuple_filter_cfg_output {
 	u8 valid;
 };
 
+/* hwrm_cfa_flow_alloc */
+/* Input (128 bytes) */
+struct hwrm_cfa_flow_alloc_input {
+	__le16 req_type;
+	__le16 cmpl_ring;
+	__le16 seq_id;
+	__le16 target_id;
+	__le64 resp_addr;
+	__le16 flags;
+	#define CFA_FLOW_ALLOC_REQ_FLAGS_TUNNEL		    0x1UL
+	#define CFA_FLOW_ALLOC_REQ_FLAGS_NUM_VLAN_MASK		    0x6UL
+	#define CFA_FLOW_ALLOC_REQ_FLAGS_NUM_VLAN_SFT		    1
+	#define CFA_FLOW_ALLOC_REQ_FLAGS_NUM_VLAN_NONE		   (0x0UL << 1)
+	#define CFA_FLOW_ALLOC_REQ_FLAGS_NUM_VLAN_ONE		   (0x1UL << 1)
+	#define CFA_FLOW_ALLOC_REQ_FLAGS_NUM_VLAN_TWO		   (0x2UL << 1)
+	#define CFA_FLOW_ALLOC_REQ_FLAGS_NUM_VLAN_LAST    CFA_FLOW_ALLOC_REQ_FLAGS_NUM_VLAN_TWO
+	#define CFA_FLOW_ALLOC_REQ_FLAGS_FLOWTYPE_MASK		    0x38UL
+	#define CFA_FLOW_ALLOC_REQ_FLAGS_FLOWTYPE_SFT		    3
+	#define CFA_FLOW_ALLOC_REQ_FLAGS_FLOWTYPE_L2		   (0x0UL << 3)
+	#define CFA_FLOW_ALLOC_REQ_FLAGS_FLOWTYPE_IPV4		   (0x1UL << 3)
+	#define CFA_FLOW_ALLOC_REQ_FLAGS_FLOWTYPE_IPV6		   (0x2UL << 3)
+	#define CFA_FLOW_ALLOC_REQ_FLAGS_FLOWTYPE_LAST    CFA_FLOW_ALLOC_REQ_FLAGS_FLOWTYPE_IPV6
+	__le16 src_fid;
+	__le32 tunnel_handle;
+	__le16 action_flags;
+	#define CFA_FLOW_ALLOC_REQ_ACTION_FLAGS_FWD		    0x1UL
+	#define CFA_FLOW_ALLOC_REQ_ACTION_FLAGS_RECYCLE	    0x2UL
+	#define CFA_FLOW_ALLOC_REQ_ACTION_FLAGS_DROP		    0x4UL
+	#define CFA_FLOW_ALLOC_REQ_ACTION_FLAGS_METER		    0x8UL
+	#define CFA_FLOW_ALLOC_REQ_ACTION_FLAGS_TUNNEL		    0x10UL
+	#define CFA_FLOW_ALLOC_REQ_ACTION_FLAGS_NAT_SRC	    0x20UL
+	#define CFA_FLOW_ALLOC_REQ_ACTION_FLAGS_NAT_DEST	    0x40UL
+	#define CFA_FLOW_ALLOC_REQ_ACTION_FLAGS_NAT_IPV4_ADDRESS   0x80UL
+	#define CFA_FLOW_ALLOC_REQ_ACTION_FLAGS_L2_HEADER_REWRITE  0x100UL
+	#define CFA_FLOW_ALLOC_REQ_ACTION_FLAGS_TTL_DECREMENT      0x200UL
+	__le16 dst_fid;
+	__be16 l2_rewrite_vlan_tpid;
+	__be16 l2_rewrite_vlan_tci;
+	__le16 act_meter_id;
+	__le16 ref_flow_handle;
+	__be16 ethertype;
+	__be16 outer_vlan_tci;
+	__be16 dmac[3];
+	__be16 inner_vlan_tci;
+	__be16 smac[3];
+	u8 ip_dst_mask_len;
+	u8 ip_src_mask_len;
+	__be32 ip_dst[4];
+	__be32 ip_src[4];
+	__be16 l4_src_port;
+	__be16 l4_src_port_mask;
+	__be16 l4_dst_port;
+	__be16 l4_dst_port_mask;
+	__be32 nat_ip_address[4];
+	__be16 l2_rewrite_dmac[3];
+	__be16 nat_port;
+	__be16 l2_rewrite_smac[3];
+	u8 ip_proto;
+	u8 unused_0;
+};
+
+/* Output (16 bytes) */
+struct hwrm_cfa_flow_alloc_output {
+	__le16 error_code;
+	__le16 req_type;
+	__le16 seq_id;
+	__le16 resp_len;
+	__le16 flow_handle;
+	u8 unused_0;
+	u8 unused_1;
+	u8 unused_2;
+	u8 unused_3;
+	u8 unused_4;
+	u8 valid;
+};
+
+/* hwrm_cfa_flow_free */
+/* Input (24 bytes) */
+struct hwrm_cfa_flow_free_input {
+	__le16 req_type;
+	__le16 cmpl_ring;
+	__le16 seq_id;
+	__le16 target_id;
+	__le64 resp_addr;
+	__le16 flow_handle;
+	__le16 unused_0[3];
+};
+
+/* Output (32 bytes) */
+struct hwrm_cfa_flow_free_output {
+	__le16 error_code;
+	__le16 req_type;
+	__le16 seq_id;
+	__le16 resp_len;
+	__le64 packet;
+	__le64 byte;
+	__le32 unused_0;
+	u8 unused_1;
+	u8 unused_2;
+	u8 unused_3;
+	u8 valid;
+};
+
+/* hwrm_cfa_flow_stats */
+/* Input (40 bytes) */
+struct hwrm_cfa_flow_stats_input {
+	__le16 req_type;
+	__le16 cmpl_ring;
+	__le16 seq_id;
+	__le16 target_id;
+	__le64 resp_addr;
+	__le16 num_flows;
+	__le16 flow_handle_0;
+	__le16 flow_handle_1;
+	__le16 flow_handle_2;
+	__le16 flow_handle_3;
+	__le16 flow_handle_4;
+	__le16 flow_handle_5;
+	__le16 flow_handle_6;
+	__le16 flow_handle_7;
+	__le16 flow_handle_8;
+	__le16 flow_handle_9;
+	__le16 unused_0;
+};
+
+/* Output (176 bytes) */
+struct hwrm_cfa_flow_stats_output {
+	__le16 error_code;
+	__le16 req_type;
+	__le16 seq_id;
+	__le16 resp_len;
+	__le64 packet_0;
+	__le64 packet_1;
+	__le64 packet_2;
+	__le64 packet_3;
+	__le64 packet_4;
+	__le64 packet_5;
+	__le64 packet_6;
+	__le64 packet_7;
+	__le64 packet_8;
+	__le64 packet_9;
+	__le64 byte_0;
+	__le64 byte_1;
+	__le64 byte_2;
+	__le64 byte_3;
+	__le64 byte_4;
+	__le64 byte_5;
+	__le64 byte_6;
+	__le64 byte_7;
+	__le64 byte_8;
+	__le64 byte_9;
+	__le32 unused_0;
+	u8 unused_1;
+	u8 unused_2;
+	u8 unused_3;
+	u8 valid;
+};
+
 /* hwrm_cfa_vfr_alloc */
 /* Input (32 bytes) */
 struct hwrm_cfa_vfr_alloc_input {
@@ -5534,11 +5698,15 @@ struct hwrm_selftest_qlist_output {
 	#define SELFTEST_QLIST_RESP_AVAILABLE_TESTS_LINK_TEST      0x2UL
 	#define SELFTEST_QLIST_RESP_AVAILABLE_TESTS_REGISTER_TEST  0x4UL
 	#define SELFTEST_QLIST_RESP_AVAILABLE_TESTS_MEMORY_TEST    0x8UL
+	#define SELFTEST_QLIST_RESP_AVAILABLE_TESTS_PCIE_EYE_TEST  0x10UL
+	#define SELFTEST_QLIST_RESP_AVAILABLE_TESTS_ETHERNET_EYE_TEST 0x20UL
 	u8 offline_tests;
 	#define SELFTEST_QLIST_RESP_OFFLINE_TESTS_NVM_TEST	    0x1UL
 	#define SELFTEST_QLIST_RESP_OFFLINE_TESTS_LINK_TEST	    0x2UL
 	#define SELFTEST_QLIST_RESP_OFFLINE_TESTS_REGISTER_TEST    0x4UL
 	#define SELFTEST_QLIST_RESP_OFFLINE_TESTS_MEMORY_TEST      0x8UL
+	#define SELFTEST_QLIST_RESP_OFFLINE_TESTS_PCIE_EYE_TEST    0x10UL
+	#define SELFTEST_QLIST_RESP_OFFLINE_TESTS_ETHERNET_EYE_TEST 0x20UL
 	u8 unused_0;
 	__le16 test_timeout;
 	u8 unused_1;
@@ -5566,6 +5734,8 @@ struct hwrm_selftest_exec_input {
 	#define SELFTEST_EXEC_REQ_FLAGS_LINK_TEST		    0x2UL
 	#define SELFTEST_EXEC_REQ_FLAGS_REGISTER_TEST		    0x4UL
 	#define SELFTEST_EXEC_REQ_FLAGS_MEMORY_TEST		    0x8UL
+	#define SELFTEST_EXEC_REQ_FLAGS_PCIE_EYE_TEST		    0x10UL
+	#define SELFTEST_EXEC_REQ_FLAGS_ETHERNET_EYE_TEST	    0x20UL
 	u8 unused_0[7];
 };
 
@@ -5580,11 +5750,15 @@ struct hwrm_selftest_exec_output {
 	#define SELFTEST_EXEC_RESP_REQUESTED_TESTS_LINK_TEST       0x2UL
 	#define SELFTEST_EXEC_RESP_REQUESTED_TESTS_REGISTER_TEST   0x4UL
 	#define SELFTEST_EXEC_RESP_REQUESTED_TESTS_MEMORY_TEST     0x8UL
+	#define SELFTEST_EXEC_RESP_REQUESTED_TESTS_PCIE_EYE_TEST   0x10UL
+	#define SELFTEST_EXEC_RESP_REQUESTED_TESTS_ETHERNET_EYE_TEST 0x20UL
 	u8 test_success;
 	#define SELFTEST_EXEC_RESP_TEST_SUCCESS_NVM_TEST	    0x1UL
 	#define SELFTEST_EXEC_RESP_TEST_SUCCESS_LINK_TEST	    0x2UL
 	#define SELFTEST_EXEC_RESP_TEST_SUCCESS_REGISTER_TEST      0x4UL
 	#define SELFTEST_EXEC_RESP_TEST_SUCCESS_MEMORY_TEST	    0x8UL
+	#define SELFTEST_EXEC_RESP_TEST_SUCCESS_PCIE_EYE_TEST      0x10UL
+	#define SELFTEST_EXEC_RESP_TEST_SUCCESS_ETHERNET_EYE_TEST  0x20UL
 	__le16 unused_0[3];
 };
 
@@ -5767,6 +5941,7 @@ struct cmd_nums {
 	#define HWRM_SELFTEST_QLIST				   (0x200UL)
 	#define HWRM_SELFTEST_EXEC				   (0x201UL)
 	#define HWRM_SELFTEST_IRQ				   (0x202UL)
+	#define HWRM_SELFTEST_RETREIVE_EYE_DATA		   (0x203UL)
 	#define HWRM_DBG_READ_DIRECT				   (0xff10UL)
 	#define HWRM_DBG_READ_INDIRECT				   (0xff11UL)
 	#define HWRM_DBG_WRITE_DIRECT				   (0xff12UL)
@@ -5984,6 +6159,7 @@ struct hwrm_struct_hdr {
 	#define STRUCT_HDR_STRUCT_ID_LLDP_DEVICE		   0x426UL
 	#define STRUCT_HDR_STRUCT_ID_AFM_OPAQUE		   0x1UL
 	#define STRUCT_HDR_STRUCT_ID_PORT_DESCRIPTION		   0xaUL
+	#define STRUCT_HDR_STRUCT_ID_RSS_V2			   0x64UL
 	__le16 len;
 	u8 version;
 	u8 count;
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH net-next 02/11] bnxt_en: Improve tx ring reservation logic.
From: Michael Chan @ 2017-08-28 17:40 UTC (permalink / raw)
  To: davem; +Cc: netdev
In-Reply-To: <1503942035-24924-1-git-send-email-michael.chan@broadcom.com>

When the number of TX rings is changed (e.g. ethtool -L, enabling XDP TX
rings, etc), the current code tries to reserve the new number of TX rings
before closing and re-opening the NIC.  If we are unable to reserve the
new TX rings, we abort the operation and keep the current TX rings.

The problem is that the firmware will disable the current TX rings even
when it cannot reserve the new set of TX rings.  We fix it as follows:

1. Instead of reserving the new set of TX rings, just ask the firmware
to check if the new set of TX rings is available.  There is a flag in
the firmware message to do that.  If not available, abort and the
current TX rings will not be disabled.

2. Do the actual TX ring reservation in the path that opens the NIC.
We keep the number of TX rings currently successfully reserved.  If the
number of TX rings is different than the reserved TX rings, we call
firmware and reserve again.

Signed-off-by: Michael Chan <michael.chan@broadcom.com>
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c         | 46 +++++++++++++++++++----
 drivers/net/ethernet/broadcom/bnxt/bnxt.h         |  5 ++-
 drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c |  3 +-
 drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c     |  4 +-
 4 files changed, 44 insertions(+), 14 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index 6e14fc4..4b0a807 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -4461,9 +4461,33 @@ static int bnxt_hwrm_reserve_tx_rings(struct bnxt *bp, int *tx_rings)
 	mutex_lock(&bp->hwrm_cmd_lock);
 	rc = __bnxt_hwrm_get_tx_rings(bp, 0xffff, tx_rings);
 	mutex_unlock(&bp->hwrm_cmd_lock);
+	if (!rc)
+		bp->tx_reserved_rings = *tx_rings;
 	return rc;
 }
 
+static int bnxt_hwrm_check_tx_rings(struct bnxt *bp, int tx_rings)
+{
+	struct hwrm_func_cfg_input req = {0};
+	int rc;
+
+	if (bp->hwrm_spec_code < 0x10801)
+		return 0;
+
+	if (BNXT_VF(bp))
+		return 0;
+
+	bnxt_hwrm_cmd_hdr_init(bp, &req, HWRM_FUNC_CFG, -1, -1);
+	req.fid = cpu_to_le16(0xffff);
+	req.flags = cpu_to_le32(FUNC_CFG_REQ_FLAGS_TX_ASSETS_TEST);
+	req.enables = cpu_to_le32(FUNC_CFG_REQ_ENABLES_NUM_TX_RINGS);
+	req.num_tx_rings = cpu_to_le16(tx_rings);
+	rc = hwrm_send_message_silent(bp, &req, sizeof(req), HWRM_CMD_TIMEOUT);
+	if (rc)
+		return -ENOMEM;
+	return 0;
+}
+
 static void bnxt_hwrm_set_coal_params(struct bnxt *bp, u32 max_bufs,
 	u32 buf_tmrs, u16 flags,
 	struct hwrm_ring_cmpl_ring_cfg_aggint_params_input *req)
@@ -5115,6 +5139,15 @@ static int bnxt_init_chip(struct bnxt *bp, bool irq_re_init)
 				   rc);
 			goto err_out;
 		}
+		if (bp->tx_reserved_rings != bp->tx_nr_rings) {
+			int tx = bp->tx_nr_rings;
+
+			if (bnxt_hwrm_reserve_tx_rings(bp, &tx) ||
+			    tx < bp->tx_nr_rings) {
+				rc = -ENOMEM;
+				goto err_out;
+			}
+		}
 	}
 
 	rc = bnxt_hwrm_ring_alloc(bp);
@@ -6998,8 +7031,8 @@ static void bnxt_sp_task(struct work_struct *work)
 }
 
 /* Under rtnl_lock */
-int bnxt_reserve_rings(struct bnxt *bp, int tx, int rx, bool sh, int tcs,
-		       int tx_xdp)
+int bnxt_check_rings(struct bnxt *bp, int tx, int rx, bool sh, int tcs,
+		     int tx_xdp)
 {
 	int max_rx, max_tx, tx_sets = 1;
 	int tx_rings_needed;
@@ -7019,10 +7052,7 @@ int bnxt_reserve_rings(struct bnxt *bp, int tx, int rx, bool sh, int tcs,
 	if (max_tx < tx_rings_needed)
 		return -ENOMEM;
 
-	if (bnxt_hwrm_reserve_tx_rings(bp, &tx_rings_needed) ||
-	    tx_rings_needed < (tx * tx_sets + tx_xdp))
-		return -ENOMEM;
-	return 0;
+	return bnxt_hwrm_check_tx_rings(bp, tx_rings_needed);
 }
 
 static void bnxt_unmap_bars(struct bnxt *bp, struct pci_dev *pdev)
@@ -7211,8 +7241,8 @@ int bnxt_setup_mq_tc(struct net_device *dev, u8 tc)
 	if (bp->flags & BNXT_FLAG_SHARED_RINGS)
 		sh = true;
 
-	rc = bnxt_reserve_rings(bp, bp->tx_nr_rings_per_tc, bp->rx_nr_rings,
-				sh, tc, bp->tx_nr_rings_xdp);
+	rc = bnxt_check_rings(bp, bp->tx_nr_rings_per_tc, bp->rx_nr_rings,
+			      sh, tc, bp->tx_nr_rings_xdp);
 	if (rc)
 		return rc;
 
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.h b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
index 2d84d57..568516f 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.h
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
@@ -1118,6 +1118,7 @@ struct bnxt {
 	int			tx_nr_rings;
 	int			tx_nr_rings_per_tc;
 	int			tx_nr_rings_xdp;
+	int			tx_reserved_rings;
 
 	int			tx_wake_thresh;
 	int			tx_push_thresh;
@@ -1346,8 +1347,8 @@ int bnxt_hwrm_func_rgtr_async_events(struct bnxt *bp, unsigned long *bmap,
 int bnxt_half_open_nic(struct bnxt *bp);
 void bnxt_half_close_nic(struct bnxt *bp);
 int bnxt_close_nic(struct bnxt *, bool, bool);
-int bnxt_reserve_rings(struct bnxt *bp, int tx, int rx, bool sh, int tcs,
-		       int tx_xdp);
+int bnxt_check_rings(struct bnxt *bp, int tx, int rx, bool sh, int tcs,
+		     int tx_xdp);
 int bnxt_setup_mq_tc(struct net_device *dev, u8 tc);
 int bnxt_get_max_rings(struct bnxt *, int *, int *, bool);
 void bnxt_restore_pf_fw_resources(struct bnxt *bp);
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c b/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c
index 08b870d..8eff05a 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c
@@ -435,8 +435,7 @@ static int bnxt_set_channels(struct net_device *dev,
 		}
 		tx_xdp = req_rx_rings;
 	}
-	rc = bnxt_reserve_rings(bp, req_tx_rings, req_rx_rings, sh, tcs,
-				tx_xdp);
+	rc = bnxt_check_rings(bp, req_tx_rings, req_rx_rings, sh, tcs, tx_xdp);
 	if (rc) {
 		netdev_warn(dev, "Unable to allocate the requested rings\n");
 		return rc;
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c b/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c
index 3961a68..d8f0c83 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c
@@ -169,8 +169,8 @@ static int bnxt_xdp_set(struct bnxt *bp, struct bpf_prog *prog)
 	tc = netdev_get_num_tc(dev);
 	if (!tc)
 		tc = 1;
-	rc = bnxt_reserve_rings(bp, bp->tx_nr_rings_per_tc, bp->rx_nr_rings,
-				true, tc, tx_xdp);
+	rc = bnxt_check_rings(bp, bp->tx_nr_rings_per_tc, bp->rx_nr_rings,
+			      true, tc, tx_xdp);
 	if (rc) {
 		netdev_warn(dev, "Unable to reserve enough TX rings to support XDP.\n");
 		return rc;
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH net-next 03/11] bnxt_en: assign CPU affinity hints to bnxt_en IRQs
From: Michael Chan @ 2017-08-28 17:40 UTC (permalink / raw)
  To: davem; +Cc: netdev, Vasundhara Volam
In-Reply-To: <1503942035-24924-1-git-send-email-michael.chan@broadcom.com>

From: Vasundhara Volam <vasundhara-v.volam@broadcom.com>

This patch provides hints to irqbalance to map bnxt_en device IRQs
to specific CPU cores. cpumask_local_spread() is used, which first
maps IRQs to near NUMA cores; when those cores are exhausted, IRQs
are mapped to far NUMA cores.

Signed-off-by: Vasundhara Volam <vasundhara-v.volam@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c | 25 ++++++++++++++++++++++++-
 drivers/net/ethernet/broadcom/bnxt/bnxt.h |  4 +++-
 2 files changed, 27 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index 4b0a807..9657bfb 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -49,6 +49,7 @@
 #include <linux/aer.h>
 #include <linux/bitmap.h>
 #include <linux/cpu_rmap.h>
+#include <linux/cpumask.h>
 
 #include "bnxt_hsi.h"
 #include "bnxt.h"
@@ -5554,8 +5555,15 @@ static void bnxt_free_irq(struct bnxt *bp)
 
 	for (i = 0; i < bp->cp_nr_rings; i++) {
 		irq = &bp->irq_tbl[i];
-		if (irq->requested)
+		if (irq->requested) {
+			if (irq->have_cpumask) {
+				irq_set_affinity_hint(irq->vector, NULL);
+				free_cpumask_var(irq->cpu_mask);
+				irq->have_cpumask = 0;
+			}
 			free_irq(irq->vector, bp->bnapi[i]);
+		}
+
 		irq->requested = 0;
 	}
 }
@@ -5588,6 +5596,21 @@ static int bnxt_request_irq(struct bnxt *bp)
 			break;
 
 		irq->requested = 1;
+
+		if (zalloc_cpumask_var(&irq->cpu_mask, GFP_KERNEL)) {
+			int numa_node = dev_to_node(&bp->pdev->dev);
+
+			irq->have_cpumask = 1;
+			cpumask_set_cpu(cpumask_local_spread(i, numa_node),
+					irq->cpu_mask);
+			rc = irq_set_affinity_hint(irq->vector, irq->cpu_mask);
+			if (rc) {
+				netdev_warn(bp->dev,
+					    "Set affinity failed, IRQ = %d\n",
+					    irq->vector);
+				break;
+			}
+		}
 	}
 	return rc;
 }
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.h b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
index 568516f..801c0f7 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.h
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
@@ -701,8 +701,10 @@ struct bnxt_napi {
 struct bnxt_irq {
 	irq_handler_t	handler;
 	unsigned int	vector;
-	u8		requested;
+	u8		requested:1;
+	u8		have_cpumask:1;
 	char		name[IFNAMSIZ + 2];
+	cpumask_var_t	cpu_mask;
 };
 
 #define HWRM_RING_ALLOC_TX	0x1
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH net-next 04/11] bnxt: Add PCIe device IDs for bcm58802/bcm58808
From: Michael Chan @ 2017-08-28 17:40 UTC (permalink / raw)
  To: davem; +Cc: netdev, Ray Jui
In-Reply-To: <1503942035-24924-1-git-send-email-michael.chan@broadcom.com>

From: Ray Jui <ray.jui@broadcom.com>

Add PCIe device ID for bcm58802 and bcm58808. Also add chip number
update to declare bcm588xx as chip class phase 4 and later

Signed-off-by: Ray Jui <ray.jui@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c | 8 +++++++-
 drivers/net/ethernet/broadcom/bnxt/bnxt.h | 8 ++++++++
 2 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index 9657bfb..500681d 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -104,6 +104,8 @@ enum board_idx {
 	BCM57416_NPAR,
 	BCM57452,
 	BCM57454,
+	BCM58802,
+	BCM58808,
 	NETXTREME_E_VF,
 	NETXTREME_C_VF,
 };
@@ -140,11 +142,14 @@ enum board_idx {
 	{ "Broadcom BCM57416 NetXtreme-E Ethernet Partition" },
 	{ "Broadcom BCM57452 NetXtreme-E 10Gb/25Gb/40Gb/50Gb Ethernet" },
 	{ "Broadcom BCM57454 NetXtreme-E 10Gb/25Gb/40Gb/50Gb/100Gb Ethernet" },
+	{ "Broadcom BCM58802 NetXtreme-S 10Gb/25Gb/40Gb/50Gb Ethernet" },
+	{ "Broadcom BCM58808 NetXtreme-S 10Gb/25Gb/40Gb/50Gb/100Gb Ethernet" },
 	{ "Broadcom NetXtreme-E Ethernet Virtual Function" },
 	{ "Broadcom NetXtreme-C Ethernet Virtual Function" },
 };
 
 static const struct pci_device_id bnxt_pci_tbl[] = {
+	{ PCI_VDEVICE(BROADCOM, 0x1614), .driver_data = BCM57454 },
 	{ PCI_VDEVICE(BROADCOM, 0x16c0), .driver_data = BCM57417_NPAR },
 	{ PCI_VDEVICE(BROADCOM, 0x16c8), .driver_data = BCM57301 },
 	{ PCI_VDEVICE(BROADCOM, 0x16c9), .driver_data = BCM57302 },
@@ -175,8 +180,9 @@ enum board_idx {
 	{ PCI_VDEVICE(BROADCOM, 0x16ed), .driver_data = BCM57414_NPAR },
 	{ PCI_VDEVICE(BROADCOM, 0x16ee), .driver_data = BCM57416_NPAR },
 	{ PCI_VDEVICE(BROADCOM, 0x16ef), .driver_data = BCM57416_NPAR },
+	{ PCI_VDEVICE(BROADCOM, 0x16f0), .driver_data = BCM58808 },
 	{ PCI_VDEVICE(BROADCOM, 0x16f1), .driver_data = BCM57452 },
-	{ PCI_VDEVICE(BROADCOM, 0x1614), .driver_data = BCM57454 },
+	{ PCI_VDEVICE(BROADCOM, 0xd802), .driver_data = BCM58802 },
 #ifdef CONFIG_BNXT_SRIOV
 	{ PCI_VDEVICE(BROADCOM, 0x1606), .driver_data = NETXTREME_E_VF },
 	{ PCI_VDEVICE(BROADCOM, 0x1609), .driver_data = NETXTREME_E_VF },
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.h b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
index 801c0f7..3521e46 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.h
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
@@ -990,6 +990,9 @@ struct bnxt {
 
 #define CHIP_NUM_5745X		0xd730
 
+#define CHIP_NUM_58802		0xd802
+#define CHIP_NUM_58808		0xd808
+
 #define BNXT_CHIP_NUM_5730X(chip_num)		\
 	((chip_num) >= CHIP_NUM_57301 &&	\
 	 (chip_num) <= CHIP_NUM_57304)
@@ -1021,6 +1024,10 @@ struct bnxt {
 #define BNXT_CHIP_NUM_57X1X(chip_num)		\
 	(BNXT_CHIP_NUM_5731X(chip_num) || BNXT_CHIP_NUM_5741X(chip_num))
 
+#define BNXT_CHIP_NUM_588XX(chip_num)		\
+	((chip_num) == CHIP_NUM_58802 ||	\
+	 (chip_num) == CHIP_NUM_58808)
+
 	struct net_device	*dev;
 	struct pci_dev		*pdev;
 
@@ -1079,6 +1086,7 @@ struct bnxt {
 #define BNXT_CHIP_P4_PLUS(bp)			\
 	(BNXT_CHIP_NUM_57X1X((bp)->chip_num) ||	\
 	 BNXT_CHIP_NUM_5745X((bp)->chip_num) ||	\
+	 BNXT_CHIP_NUM_588XX((bp)->chip_num) ||	\
 	 (BNXT_CHIP_NUM_58700((bp)->chip_num) &&	\
 	  !BNXT_CHIP_TYPE_NITRO_A0(bp)))
 
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH net-next 05/11] bnxt: initialize board_info values with proper enums
From: Michael Chan @ 2017-08-28 17:40 UTC (permalink / raw)
  To: davem; +Cc: netdev, Scott Branden
In-Reply-To: <1503942035-24924-1-git-send-email-michael.chan@broadcom.com>

From: Scott Branden <scott.branden@broadcom.com>

initialize board_info values with proper enums for defensive programming
purposes.  This will avoid any errors of the enums being declared not
lining up with the board_info array.

Signed-off-by: Scott Branden <scott.branden@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c | 64 +++++++++++++++----------------
 1 file changed, 32 insertions(+), 32 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index 500681d..1afb408 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -114,38 +114,38 @@ enum board_idx {
 static const struct {
 	char *name;
 } board_info[] = {
-	{ "Broadcom BCM57301 NetXtreme-C 10Gb Ethernet" },
-	{ "Broadcom BCM57302 NetXtreme-C 10Gb/25Gb Ethernet" },
-	{ "Broadcom BCM57304 NetXtreme-C 10Gb/25Gb/40Gb/50Gb Ethernet" },
-	{ "Broadcom BCM57417 NetXtreme-E Ethernet Partition" },
-	{ "Broadcom BCM58700 Nitro 1Gb/2.5Gb/10Gb Ethernet" },
-	{ "Broadcom BCM57311 NetXtreme-C 10Gb Ethernet" },
-	{ "Broadcom BCM57312 NetXtreme-C 10Gb/25Gb Ethernet" },
-	{ "Broadcom BCM57402 NetXtreme-E 10Gb Ethernet" },
-	{ "Broadcom BCM57404 NetXtreme-E 10Gb/25Gb Ethernet" },
-	{ "Broadcom BCM57406 NetXtreme-E 10GBase-T Ethernet" },
-	{ "Broadcom BCM57402 NetXtreme-E Ethernet Partition" },
-	{ "Broadcom BCM57407 NetXtreme-E 10GBase-T Ethernet" },
-	{ "Broadcom BCM57412 NetXtreme-E 10Gb Ethernet" },
-	{ "Broadcom BCM57414 NetXtreme-E 10Gb/25Gb Ethernet" },
-	{ "Broadcom BCM57416 NetXtreme-E 10GBase-T Ethernet" },
-	{ "Broadcom BCM57417 NetXtreme-E 10GBase-T Ethernet" },
-	{ "Broadcom BCM57412 NetXtreme-E Ethernet Partition" },
-	{ "Broadcom BCM57314 NetXtreme-C 10Gb/25Gb/40Gb/50Gb Ethernet" },
-	{ "Broadcom BCM57417 NetXtreme-E 10Gb/25Gb Ethernet" },
-	{ "Broadcom BCM57416 NetXtreme-E 10Gb Ethernet" },
-	{ "Broadcom BCM57404 NetXtreme-E Ethernet Partition" },
-	{ "Broadcom BCM57406 NetXtreme-E Ethernet Partition" },
-	{ "Broadcom BCM57407 NetXtreme-E 25Gb Ethernet" },
-	{ "Broadcom BCM57407 NetXtreme-E Ethernet Partition" },
-	{ "Broadcom BCM57414 NetXtreme-E Ethernet Partition" },
-	{ "Broadcom BCM57416 NetXtreme-E Ethernet Partition" },
-	{ "Broadcom BCM57452 NetXtreme-E 10Gb/25Gb/40Gb/50Gb Ethernet" },
-	{ "Broadcom BCM57454 NetXtreme-E 10Gb/25Gb/40Gb/50Gb/100Gb Ethernet" },
-	{ "Broadcom BCM58802 NetXtreme-S 10Gb/25Gb/40Gb/50Gb Ethernet" },
-	{ "Broadcom BCM58808 NetXtreme-S 10Gb/25Gb/40Gb/50Gb/100Gb Ethernet" },
-	{ "Broadcom NetXtreme-E Ethernet Virtual Function" },
-	{ "Broadcom NetXtreme-C Ethernet Virtual Function" },
+	[BCM57301] = { "Broadcom BCM57301 NetXtreme-C 10Gb Ethernet" },
+	[BCM57302] = { "Broadcom BCM57302 NetXtreme-C 10Gb/25Gb Ethernet" },
+	[BCM57304] = { "Broadcom BCM57304 NetXtreme-C 10Gb/25Gb/40Gb/50Gb Ethernet" },
+	[BCM57417_NPAR] = { "Broadcom BCM57417 NetXtreme-E Ethernet Partition" },
+	[BCM58700] = { "Broadcom BCM58700 Nitro 1Gb/2.5Gb/10Gb Ethernet" },
+	[BCM57311] = { "Broadcom BCM57311 NetXtreme-C 10Gb Ethernet" },
+	[BCM57312] = { "Broadcom BCM57312 NetXtreme-C 10Gb/25Gb Ethernet" },
+	[BCM57402] = { "Broadcom BCM57402 NetXtreme-E 10Gb Ethernet" },
+	[BCM57404] = { "Broadcom BCM57404 NetXtreme-E 10Gb/25Gb Ethernet" },
+	[BCM57406] = { "Broadcom BCM57406 NetXtreme-E 10GBase-T Ethernet" },
+	[BCM57402_NPAR] = { "Broadcom BCM57402 NetXtreme-E Ethernet Partition" },
+	[BCM57407] = { "Broadcom BCM57407 NetXtreme-E 10GBase-T Ethernet" },
+	[BCM57412] = { "Broadcom BCM57412 NetXtreme-E 10Gb Ethernet" },
+	[BCM57414] = { "Broadcom BCM57414 NetXtreme-E 10Gb/25Gb Ethernet" },
+	[BCM57416] = { "Broadcom BCM57416 NetXtreme-E 10GBase-T Ethernet" },
+	[BCM57417] = { "Broadcom BCM57417 NetXtreme-E 10GBase-T Ethernet" },
+	[BCM57412_NPAR] = { "Broadcom BCM57412 NetXtreme-E Ethernet Partition" },
+	[BCM57314] = { "Broadcom BCM57314 NetXtreme-C 10Gb/25Gb/40Gb/50Gb Ethernet" },
+	[BCM57417_SFP] = { "Broadcom BCM57417 NetXtreme-E 10Gb/25Gb Ethernet" },
+	[BCM57416_SFP] = { "Broadcom BCM57416 NetXtreme-E 10Gb Ethernet" },
+	[BCM57404_NPAR] = { "Broadcom BCM57404 NetXtreme-E Ethernet Partition" },
+	[BCM57406_NPAR] = { "Broadcom BCM57406 NetXtreme-E Ethernet Partition" },
+	[BCM57407_SFP] = { "Broadcom BCM57407 NetXtreme-E 25Gb Ethernet" },
+	[BCM57407_NPAR] = { "Broadcom BCM57407 NetXtreme-E Ethernet Partition" },
+	[BCM57414_NPAR] = { "Broadcom BCM57414 NetXtreme-E Ethernet Partition" },
+	[BCM57416_NPAR] = { "Broadcom BCM57416 NetXtreme-E Ethernet Partition" },
+	[BCM57452] = { "Broadcom BCM57452 NetXtreme-E 10Gb/25Gb/40Gb/50Gb Ethernet" },
+	[BCM57454] = { "Broadcom BCM57454 NetXtreme-E 10Gb/25Gb/40Gb/50Gb/100Gb Ethernet" },
+	[BCM58802] = { "Broadcom BCM58802 NetXtreme-S 10Gb/25Gb/40Gb/50Gb Ethernet" },
+	[BCM58808] = { "Broadcom BCM58808 NetXtreme-S 10Gb/25Gb/40Gb/50Gb/100Gb Ethernet" },
+	[NETXTREME_E_VF] = { "Broadcom NetXtreme-E Ethernet Virtual Function" },
+	[NETXTREME_C_VF] = { "Broadcom NetXtreme-C Ethernet Virtual Function" },
 };
 
 static const struct pci_device_id bnxt_pci_tbl[] = {
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH net-next 06/11] bnxt_en: Improve -ENOMEM logic in NAPI poll loop.
From: Michael Chan @ 2017-08-28 17:40 UTC (permalink / raw)
  To: davem; +Cc: netdev, Martin KaFai Lau
In-Reply-To: <1503942035-24924-1-git-send-email-michael.chan@broadcom.com>

If we cannot allocate RX buffers in the NAPI poll loop when processing
an RX event, the current code does not count that event towards the NAPI
budget.  This can cause us to potentially loop forever in NAPI if we
consistently cannot allocate new buffers.  Improve it by counting
-ENOMEM event as 1 towards the NAPI budget.

Cc: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index 1afb408..a34fcdd 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -1850,6 +1850,13 @@ static int bnxt_poll_work(struct bnxt *bp, struct bnxt_napi *bnapi, int budget)
 							   &event);
 			if (likely(rc >= 0))
 				rx_pkts += rc;
+			/* Increment rx_pkts when rc is -ENOMEM to count towards
+			 * the NAPI budget.  Otherwise, we may potentially loop
+			 * here forever if we consistently cannot allocate
+			 * buffers.
+			 */
+			else if (rc == -ENOMEM)
+				rx_pkts++;
 			else if (rc == -EBUSY)	/* partial completion */
 				break;
 		} else if (unlikely((TX_CMP_TYPE(txcmp) ==
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH net-next 07/11] bnxt_en: Reduce default rings on multi-port cards.
From: Michael Chan @ 2017-08-28 17:40 UTC (permalink / raw)
  To: davem; +Cc: netdev
In-Reply-To: <1503942035-24924-1-git-send-email-michael.chan@broadcom.com>

Reduce default rings from 8 to 4 on multi-port cards to reduce memory
usage.

Signed-off-by: Michael Chan <michael.chan@broadcom.com>
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c | 13 +++++++++----
 drivers/net/ethernet/broadcom/bnxt/bnxt.h |  1 +
 2 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index a34fcdd..4406f91 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -5795,6 +5795,8 @@ static int bnxt_hwrm_phy_qcaps(struct bnxt *bp)
 		link_info->support_auto_speeds =
 			le16_to_cpu(resp->supported_speeds_auto_mode);
 
+	bp->port_count = resp->port_cnt;
+
 hwrm_phy_qcaps_exit:
 	mutex_unlock(&bp->hwrm_cmd_lock);
 	return rc;
@@ -7877,6 +7879,9 @@ static int bnxt_set_dflt_rings(struct bnxt *bp, bool sh)
 	if (sh)
 		bp->flags |= BNXT_FLAG_SHARED_RINGS;
 	dflt_rings = netif_get_num_default_rss_queues();
+	/* Reduce default rings to reduce memory usage on multi-port cards */
+	if (bp->port_count > 1)
+		dflt_rings = min_t(int, dflt_rings, 4);
 	rc = bnxt_get_dflt_rings(bp, &max_rx_rings, &max_tx_rings, sh);
 	if (rc)
 		return rc;
@@ -8049,6 +8054,10 @@ static int bnxt_init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
 	bnxt_ethtool_init(bp);
 	bnxt_dcb_init(bp);
 
+	rc = bnxt_probe_phy(bp);
+	if (rc)
+		goto init_err_pci_clean;
+
 	bnxt_set_rx_skb_mode(bp, false);
 	bnxt_set_tpa_flags(bp);
 	bnxt_set_ring_params(bp);
@@ -8083,10 +8092,6 @@ static int bnxt_init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
 	if (dev->hw_features & NETIF_F_HW_VLAN_CTAG_RX)
 		bp->flags |= BNXT_FLAG_STRIP_VLAN;
 
-	rc = bnxt_probe_phy(bp);
-	if (rc)
-		goto init_err_pci_clean;
-
 	rc = bnxt_init_int_mode(bp);
 	if (rc)
 		goto init_err_pci_clean;
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.h b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
index 3521e46..86af8ea 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.h
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
@@ -1207,6 +1207,7 @@ struct bnxt {
 	u8			nge_port_cnt;
 	__le16			nge_fw_dst_port_id;
 	u8			port_partition_type;
+	u8			port_count;
 	u16			br_mode;
 
 	u16			rx_coal_ticks;
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH net-next 08/11] bnxt_en: fix clearing devlink ptr from bnxt struct
From: Michael Chan @ 2017-08-28 17:40 UTC (permalink / raw)
  To: davem; +Cc: netdev, Sathya Perla
In-Reply-To: <1503942035-24924-1-git-send-email-michael.chan@broadcom.com>

From: Sathya Perla <sathya.perla@broadcom.com>

The routine bnxt_link_bp_to_dl() is used to set the devlink ptr
in bnxt struct (bp) and also to set the bnxt back ptr in
the devlink struct.  If devlink_register() fails, bp->dl must
be cleared which is not happening currently. This patch fixes
bnxt_link_bp_to_dl() to clear bp->dl by passing  a NULL dl ptr.

Fixes: 4ab0c6a8ffd7 ("bnxt_en: add support to enable VF-representors")
Signed-off-by: Sathya Perla <sathya.perla@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
---
 drivers/net/ethernet/broadcom/bnxt/bnxt_vfr.c |  4 ++--
 drivers/net/ethernet/broadcom/bnxt/bnxt_vfr.h | 14 +++++++++-----
 2 files changed, 11 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_vfr.c b/drivers/net/ethernet/broadcom/bnxt/bnxt_vfr.c
index 86cce6f..c365d3c 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt_vfr.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_vfr.c
@@ -468,11 +468,11 @@ int bnxt_dl_register(struct bnxt *bp)
 		return -ENOMEM;
 	}
 
-	bnxt_link_bp_to_dl(dl, bp);
+	bnxt_link_bp_to_dl(bp, dl);
 	bp->eswitch_mode = DEVLINK_ESWITCH_MODE_LEGACY;
 	rc = devlink_register(dl, &bp->pdev->dev);
 	if (rc) {
-		bnxt_link_bp_to_dl(dl, NULL);
+		bnxt_link_bp_to_dl(bp, NULL);
 		devlink_free(dl);
 		netdev_warn(bp->dev, "devlink_register failed. rc=%d", rc);
 		return rc;
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_vfr.h b/drivers/net/ethernet/broadcom/bnxt/bnxt_vfr.h
index e55a3b6..3e997c9 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt_vfr.h
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_vfr.h
@@ -24,13 +24,17 @@ static inline struct bnxt *bnxt_get_bp_from_dl(struct devlink *dl)
 	return ((struct bnxt_dl *)devlink_priv(dl))->bp;
 }
 
-static inline void bnxt_link_bp_to_dl(struct devlink *dl, struct bnxt *bp)
+/* To clear devlink pointer from bp, pass NULL dl */
+static inline void bnxt_link_bp_to_dl(struct bnxt *bp, struct devlink *dl)
 {
-	struct bnxt_dl *bp_dl = devlink_priv(dl);
+	bp->dl = dl;
 
-	bp_dl->bp = bp;
-	if (bp)
-		bp->dl = dl;
+	/* add a back pointer in dl to bp */
+	if (dl) {
+		struct bnxt_dl *bp_dl = devlink_priv(dl);
+
+		bp_dl->bp = bp;
+	}
 }
 
 int bnxt_dl_register(struct bnxt *bp);
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH net-next 09/11] bnxt_en: bnxt: add TC flower filter offload support
From: Michael Chan @ 2017-08-28 17:40 UTC (permalink / raw)
  To: davem; +Cc: netdev, Sathya Perla
In-Reply-To: <1503942035-24924-1-git-send-email-michael.chan@broadcom.com>

From: Sathya Perla <sathya.perla@broadcom.com>

This patch adds support for offloading TC based flow
rules and actions for the 'flower' classifier in the bnxt_en driver.
It includes logic to parse flow rules and actions received from the
TC subsystem, store them and issue the corresponding
hwrm_cfa_flow_alloc/free FW cmds. L2/IPv4/IPv6 flows and drop,
redir, vlan push/pop actions are supported in this patch.

In this patch the hwrm_cfa_flow_xxx routines are just stubs.
The code for these routines is introduced in the next patch for easier
review. Also, the code to query the TC/flower action stats will
be introduced in a subsequent patch.

Signed-off-by: Sathya Perla <sathya.perla@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
---
 drivers/net/ethernet/broadcom/Kconfig         |   9 +
 drivers/net/ethernet/broadcom/bnxt/Makefile   |   2 +-
 drivers/net/ethernet/broadcom/bnxt/bnxt.c     |  39 +-
 drivers/net/ethernet/broadcom/bnxt/bnxt.h     |  23 +
 drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c  | 602 ++++++++++++++++++++++++++
 drivers/net/ethernet/broadcom/bnxt/bnxt_tc.h  | 158 +++++++
 drivers/net/ethernet/broadcom/bnxt/bnxt_vfr.c |  18 +
 drivers/net/ethernet/broadcom/bnxt/bnxt_vfr.h |   8 +
 8 files changed, 850 insertions(+), 9 deletions(-)
 create mode 100644 drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c
 create mode 100644 drivers/net/ethernet/broadcom/bnxt/bnxt_tc.h

diff --git a/drivers/net/ethernet/broadcom/Kconfig b/drivers/net/ethernet/broadcom/Kconfig
index 1456cb1..67134ec 100644
--- a/drivers/net/ethernet/broadcom/Kconfig
+++ b/drivers/net/ethernet/broadcom/Kconfig
@@ -212,6 +212,15 @@ config BNXT_SRIOV
 	  Virtualization support in the NetXtreme-C/E products. This
 	  allows for virtual function acceleration in virtual environments.
 
+config BNXT_FLOWER_OFFLOAD
+	bool "TC Flower offload support for NetXtreme-C/E"
+	depends on BNXT
+	default y
+	---help---
+	  This configuration parameter enables TC Flower packet classifier
+	  offload for eswitch.  This option enables SR-IOV switchdev eswitch
+	  offload.
+
 config BNXT_DCB
 	bool "Data Center Bridging (DCB) Support"
 	default n
diff --git a/drivers/net/ethernet/broadcom/bnxt/Makefile b/drivers/net/ethernet/broadcom/bnxt/Makefile
index d141a22..4f0cb8e 100644
--- a/drivers/net/ethernet/broadcom/bnxt/Makefile
+++ b/drivers/net/ethernet/broadcom/bnxt/Makefile
@@ -1,3 +1,3 @@
 obj-$(CONFIG_BNXT) += bnxt_en.o
 
-bnxt_en-y := bnxt.o bnxt_sriov.o bnxt_ethtool.o bnxt_dcb.o bnxt_ulp.o bnxt_xdp.o bnxt_vfr.o
+bnxt_en-y := bnxt.o bnxt_sriov.o bnxt_ethtool.o bnxt_dcb.o bnxt_ulp.o bnxt_xdp.o bnxt_vfr.o bnxt_tc.o
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index 4406f91..d6367c1 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -50,6 +50,7 @@
 #include <linux/bitmap.h>
 #include <linux/cpu_rmap.h>
 #include <linux/cpumask.h>
+#include <net/pkt_cls.h>
 
 #include "bnxt_hsi.h"
 #include "bnxt.h"
@@ -59,6 +60,7 @@
 #include "bnxt_dcb.h"
 #include "bnxt_xdp.h"
 #include "bnxt_vfr.h"
+#include "bnxt_tc.h"
 
 #define BNXT_TX_TIMEOUT		(5 * HZ)
 
@@ -7305,17 +7307,33 @@ int bnxt_setup_mq_tc(struct net_device *dev, u8 tc)
 	return 0;
 }
 
-static int bnxt_setup_tc(struct net_device *dev, enum tc_setup_type type,
-			 void *type_data)
+static int bnxt_setup_flower(struct net_device *dev,
+			     struct tc_cls_flower_offload *cls_flower)
 {
-	struct tc_mqprio_qopt *mqprio = type_data;
+	struct bnxt *bp = netdev_priv(dev);
 
-	if (type != TC_SETUP_MQPRIO)
+	if (BNXT_VF(bp))
 		return -EOPNOTSUPP;
 
-	mqprio->hw = TC_MQPRIO_HW_OFFLOAD_TCS;
+	return bnxt_tc_setup_flower(bp, bp->pf.fw_fid, cls_flower);
+}
+
+static int bnxt_setup_tc(struct net_device *dev, enum tc_setup_type type,
+			 void *type_data)
+{
+	switch (type) {
+	case TC_SETUP_CLSFLOWER:
+		return bnxt_setup_flower(dev, type_data);
+	case TC_SETUP_MQPRIO: {
+		struct tc_mqprio_qopt *mqprio = type_data;
+
+		mqprio->hw = TC_MQPRIO_HW_OFFLOAD_TCS;
 
-	return bnxt_setup_mq_tc(dev, mqprio->num_tc);
+		return bnxt_setup_mq_tc(dev, mqprio->num_tc);
+	}
+	default:
+		return -EOPNOTSUPP;
+	}
 }
 
 #ifdef CONFIG_RFS_ACCEL
@@ -7711,6 +7729,7 @@ static void bnxt_remove_one(struct pci_dev *pdev)
 
 	pci_disable_pcie_error_reporting(pdev);
 	unregister_netdev(dev);
+	bnxt_shutdown_tc(bp);
 	cancel_work_sync(&bp->sp_task);
 	bp->sp_event = 0;
 
@@ -8102,9 +8121,12 @@ static int bnxt_init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
 	else
 		device_set_wakeup_capable(&pdev->dev, false);
 
+	if (BNXT_PF(bp))
+		bnxt_init_tc(bp);
+
 	rc = register_netdev(dev);
 	if (rc)
-		goto init_err_clr_int;
+		goto init_err_cleanup_tc;
 
 	if (BNXT_PF(bp))
 		bnxt_dl_register(bp);
@@ -8117,7 +8139,8 @@ static int bnxt_init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
 
 	return 0;
 
-init_err_clr_int:
+init_err_cleanup_tc:
+	bnxt_shutdown_tc(bp);
 	bnxt_clear_int_mode(bp);
 
 init_err_pci_clean:
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.h b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
index 86af8ea..7b888d4 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.h
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
@@ -19,6 +19,7 @@
 #define DRV_VER_UPD	0
 
 #include <linux/interrupt.h>
+#include <linux/rhashtable.h>
 #include <net/devlink.h>
 #include <net/dst_metadata.h>
 #include <net/switchdev.h>
@@ -943,6 +944,27 @@ struct bnxt_test_info {
 #define BNXT_CAG_REG_LEGACY_INT_STATUS	0x4014
 #define BNXT_CAG_REG_BASE		0x300000
 
+struct bnxt_tc_info {
+	bool				enabled;
+
+	/* hash table to store TC offloaded flows */
+	struct rhashtable		flow_table;
+	struct rhashtable_params	flow_ht_params;
+
+	/* hash table to store L2 keys of TC flows */
+	struct rhashtable		l2_table;
+	struct rhashtable_params	l2_ht_params;
+
+	/* lock to atomically add/del an l2 node when a flow is
+	 * added or deleted.
+	 */
+	struct mutex			lock;
+
+	/* Stat counter mask (width) */
+	u64				bytes_mask;
+	u64				packets_mask;
+};
+
 struct bnxt_vf_rep_stats {
 	u64			packets;
 	u64			bytes;
@@ -1289,6 +1311,7 @@ struct bnxt {
 	enum devlink_eswitch_mode eswitch_mode;
 	struct bnxt_vf_rep	**vf_reps; /* array of vf-rep ptrs */
 	u16			*cfa_code_map; /* cfa_code -> vf_idx map */
+	struct bnxt_tc_info	tc_info;
 };
 
 #define BNXT_RX_STATS_OFFSET(counter)			\
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c b/drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c
new file mode 100644
index 0000000..a10df27
--- /dev/null
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c
@@ -0,0 +1,602 @@
+/* Broadcom NetXtreme-C/E network driver.
+ *
+ * Copyright (c) 2017 Broadcom Limited
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation.
+ */
+
+#include <linux/netdevice.h>
+#include <linux/inetdevice.h>
+#include <linux/if_vlan.h>
+#include <net/flow_dissector.h>
+#include <net/pkt_cls.h>
+#include <net/tc_act/tc_gact.h>
+#include <net/tc_act/tc_skbedit.h>
+#include <net/tc_act/tc_mirred.h>
+#include <net/tc_act/tc_vlan.h>
+
+#include "bnxt_hsi.h"
+#include "bnxt.h"
+#include "bnxt_sriov.h"
+#include "bnxt_tc.h"
+#include "bnxt_vfr.h"
+
+#ifdef CONFIG_BNXT_FLOWER_OFFLOAD
+
+#define BNXT_FID_INVALID			0xffff
+#define VLAN_TCI(vid, prio)	((vid) | ((prio) << VLAN_PRIO_SHIFT))
+
+/* Return the dst fid of the func for flow forwarding
+ * For PFs: src_fid is the fid of the PF
+ * For VF-reps: src_fid the fid of the VF
+ */
+static u16 bnxt_flow_get_dst_fid(struct bnxt *pf_bp, struct net_device *dev)
+{
+	struct bnxt *bp;
+
+	/* check if dev belongs to the same switch */
+	if (!switchdev_port_same_parent_id(pf_bp->dev, dev)) {
+		netdev_info(pf_bp->dev, "dev(ifindex=%d) not on same switch",
+			    dev->ifindex);
+		return BNXT_FID_INVALID;
+	}
+
+	/* Is dev a VF-rep? */
+	if (dev != pf_bp->dev)
+		return bnxt_vf_rep_get_fid(dev);
+
+	bp = netdev_priv(dev);
+	return bp->pf.fw_fid;
+}
+
+static int bnxt_tc_parse_redir(struct bnxt *bp,
+			       struct bnxt_tc_actions *actions,
+			       const struct tc_action *tc_act)
+{
+	int ifindex = tcf_mirred_ifindex(tc_act);
+	struct net_device *dev;
+	u16 dst_fid;
+
+	dev = __dev_get_by_index(dev_net(bp->dev), ifindex);
+	if (!dev) {
+		netdev_info(bp->dev, "no dev for ifindex=%d", ifindex);
+		return -EINVAL;
+	}
+
+	/* find the FID from dev */
+	dst_fid = bnxt_flow_get_dst_fid(bp, dev);
+	if (dst_fid == BNXT_FID_INVALID) {
+		netdev_info(bp->dev, "can't get fid for ifindex=%d", ifindex);
+		return -EINVAL;
+	}
+
+	actions->flags |= BNXT_TC_ACTION_FLAG_FWD;
+	actions->dst_fid = dst_fid;
+	actions->dst_dev = dev;
+	return 0;
+}
+
+static void bnxt_tc_parse_vlan(struct bnxt *bp,
+			       struct bnxt_tc_actions *actions,
+			       const struct tc_action *tc_act)
+{
+	if (tcf_vlan_action(tc_act) == TCA_VLAN_ACT_POP) {
+		actions->flags |= BNXT_TC_ACTION_FLAG_POP_VLAN;
+	} else if (tcf_vlan_action(tc_act) == TCA_VLAN_ACT_PUSH) {
+		actions->flags |= BNXT_TC_ACTION_FLAG_PUSH_VLAN;
+		actions->push_vlan_tci = htons(tcf_vlan_push_vid(tc_act));
+		actions->push_vlan_tpid = tcf_vlan_push_proto(tc_act);
+	}
+}
+
+static int bnxt_tc_parse_actions(struct bnxt *bp,
+				 struct bnxt_tc_actions *actions,
+				 struct tcf_exts *tc_exts)
+{
+	const struct tc_action *tc_act;
+	LIST_HEAD(tc_actions);
+	int rc;
+
+	if (!tcf_exts_has_actions(tc_exts)) {
+		netdev_info(bp->dev, "no actions");
+		return -EINVAL;
+	}
+
+	tcf_exts_to_list(tc_exts, &tc_actions);
+	list_for_each_entry(tc_act, &tc_actions, list) {
+		/* Drop action */
+		if (is_tcf_gact_shot(tc_act)) {
+			actions->flags |= BNXT_TC_ACTION_FLAG_DROP;
+			return 0; /* don't bother with other actions */
+		}
+
+		/* Redirect action */
+		if (is_tcf_mirred_egress_redirect(tc_act)) {
+			rc = bnxt_tc_parse_redir(bp, actions, tc_act);
+			if (rc)
+				return rc;
+			continue;
+		}
+
+		/* Push/pop VLAN */
+		if (is_tcf_vlan(tc_act)) {
+			bnxt_tc_parse_vlan(bp, actions, tc_act);
+			continue;
+		}
+	}
+
+	return 0;
+}
+
+#define GET_KEY(flow_cmd, key_type)					\
+		skb_flow_dissector_target((flow_cmd)->dissector, key_type,\
+					  (flow_cmd)->key)
+#define GET_MASK(flow_cmd, key_type)					\
+		skb_flow_dissector_target((flow_cmd)->dissector, key_type,\
+					  (flow_cmd)->mask)
+
+static int bnxt_tc_parse_flow(struct bnxt *bp,
+			      struct tc_cls_flower_offload *tc_flow_cmd,
+			      struct bnxt_tc_flow *flow)
+{
+	struct flow_dissector *dissector = tc_flow_cmd->dissector;
+	u16 addr_type = 0;
+
+	/* KEY_CONTROL and KEY_BASIC are needed for forming a meaningful key */
+	if ((dissector->used_keys & BIT(FLOW_DISSECTOR_KEY_CONTROL)) == 0 ||
+	    (dissector->used_keys & BIT(FLOW_DISSECTOR_KEY_BASIC)) == 0) {
+		netdev_info(bp->dev, "cannot form TC key: used_keys = 0x%x",
+			    dissector->used_keys);
+		return -EOPNOTSUPP;
+	}
+
+	if (dissector_uses_key(dissector, FLOW_DISSECTOR_KEY_CONTROL)) {
+		struct flow_dissector_key_control *key =
+			GET_KEY(tc_flow_cmd, FLOW_DISSECTOR_KEY_CONTROL);
+
+		addr_type = key->addr_type;
+	}
+
+	if (dissector_uses_key(dissector, FLOW_DISSECTOR_KEY_BASIC)) {
+		struct flow_dissector_key_basic *key =
+			GET_KEY(tc_flow_cmd, FLOW_DISSECTOR_KEY_BASIC);
+		struct flow_dissector_key_basic *mask =
+			GET_MASK(tc_flow_cmd, FLOW_DISSECTOR_KEY_BASIC);
+
+		flow->l2_key.ether_type = key->n_proto;
+		flow->l2_mask.ether_type = mask->n_proto;
+
+		if (key->n_proto == htons(ETH_P_IP) ||
+		    key->n_proto == htons(ETH_P_IPV6)) {
+			flow->l4_key.ip_proto = key->ip_proto;
+			flow->l4_mask.ip_proto = mask->ip_proto;
+		}
+	}
+
+	if (dissector_uses_key(dissector, FLOW_DISSECTOR_KEY_ETH_ADDRS)) {
+		struct flow_dissector_key_eth_addrs *key =
+			GET_KEY(tc_flow_cmd, FLOW_DISSECTOR_KEY_ETH_ADDRS);
+		struct flow_dissector_key_eth_addrs *mask =
+			GET_MASK(tc_flow_cmd, FLOW_DISSECTOR_KEY_ETH_ADDRS);
+
+		flow->flags |= BNXT_TC_FLOW_FLAGS_ETH_ADDRS;
+		ether_addr_copy(flow->l2_key.dmac, key->dst);
+		ether_addr_copy(flow->l2_mask.dmac, mask->dst);
+		ether_addr_copy(flow->l2_key.smac, key->src);
+		ether_addr_copy(flow->l2_mask.smac, mask->src);
+	}
+
+	if (dissector_uses_key(dissector, FLOW_DISSECTOR_KEY_VLAN)) {
+		struct flow_dissector_key_vlan *key =
+			GET_KEY(tc_flow_cmd, FLOW_DISSECTOR_KEY_VLAN);
+		struct flow_dissector_key_vlan *mask =
+			GET_MASK(tc_flow_cmd, FLOW_DISSECTOR_KEY_VLAN);
+
+		flow->l2_key.inner_vlan_tci =
+		   cpu_to_be16(VLAN_TCI(key->vlan_id, key->vlan_priority));
+		flow->l2_mask.inner_vlan_tci =
+		   cpu_to_be16((VLAN_TCI(mask->vlan_id, mask->vlan_priority)));
+		flow->l2_key.inner_vlan_tpid = htons(ETH_P_8021Q);
+		flow->l2_mask.inner_vlan_tpid = htons(0xffff);
+		flow->l2_key.num_vlans = 1;
+	}
+
+	if (dissector_uses_key(dissector, FLOW_DISSECTOR_KEY_IPV4_ADDRS)) {
+		struct flow_dissector_key_ipv4_addrs *key =
+			GET_KEY(tc_flow_cmd, FLOW_DISSECTOR_KEY_IPV4_ADDRS);
+		struct flow_dissector_key_ipv4_addrs *mask =
+			GET_MASK(tc_flow_cmd, FLOW_DISSECTOR_KEY_IPV4_ADDRS);
+
+		flow->flags |= BNXT_TC_FLOW_FLAGS_IPV4_ADDRS;
+		flow->l3_key.ipv4.daddr.s_addr = key->dst;
+		flow->l3_mask.ipv4.daddr.s_addr = mask->dst;
+		flow->l3_key.ipv4.saddr.s_addr = key->src;
+		flow->l3_mask.ipv4.saddr.s_addr = mask->src;
+	} else if (dissector_uses_key(dissector,
+				      FLOW_DISSECTOR_KEY_IPV6_ADDRS)) {
+		struct flow_dissector_key_ipv6_addrs *key =
+			GET_KEY(tc_flow_cmd, FLOW_DISSECTOR_KEY_IPV6_ADDRS);
+		struct flow_dissector_key_ipv6_addrs *mask =
+			GET_MASK(tc_flow_cmd, FLOW_DISSECTOR_KEY_IPV6_ADDRS);
+
+		flow->flags |= BNXT_TC_FLOW_FLAGS_IPV6_ADDRS;
+		flow->l3_key.ipv6.daddr = key->dst;
+		flow->l3_mask.ipv6.daddr = mask->dst;
+		flow->l3_key.ipv6.saddr = key->src;
+		flow->l3_mask.ipv6.saddr = mask->src;
+	}
+
+	if (dissector_uses_key(dissector, FLOW_DISSECTOR_KEY_PORTS)) {
+		struct flow_dissector_key_ports *key =
+			GET_KEY(tc_flow_cmd, FLOW_DISSECTOR_KEY_PORTS);
+		struct flow_dissector_key_ports *mask =
+			GET_MASK(tc_flow_cmd, FLOW_DISSECTOR_KEY_PORTS);
+
+		flow->flags |= BNXT_TC_FLOW_FLAGS_PORTS;
+		flow->l4_key.ports.dport = key->dst;
+		flow->l4_mask.ports.dport = mask->dst;
+		flow->l4_key.ports.sport = key->src;
+		flow->l4_mask.ports.sport = mask->src;
+	}
+
+	if (dissector_uses_key(dissector, FLOW_DISSECTOR_KEY_ICMP)) {
+		struct flow_dissector_key_icmp *key =
+			GET_KEY(tc_flow_cmd, FLOW_DISSECTOR_KEY_ICMP);
+		struct flow_dissector_key_icmp *mask =
+			GET_MASK(tc_flow_cmd, FLOW_DISSECTOR_KEY_ICMP);
+
+		flow->flags |= BNXT_TC_FLOW_FLAGS_ICMP;
+		flow->l4_key.icmp.type = key->type;
+		flow->l4_key.icmp.code = key->code;
+		flow->l4_mask.icmp.type = mask->type;
+		flow->l4_mask.icmp.code = mask->code;
+	}
+
+	return bnxt_tc_parse_actions(bp, &flow->actions, tc_flow_cmd->exts);
+}
+
+static int bnxt_hwrm_cfa_flow_free(struct bnxt *bp, __le16 flow_handle)
+{
+	return 0;
+}
+
+static int bnxt_hwrm_cfa_flow_alloc(struct bnxt *bp, struct bnxt_tc_flow *flow,
+				    __le16 ref_flow_handle, __le16 *flow_handle)
+{
+	return 0;
+}
+
+static int bnxt_tc_put_l2_node(struct bnxt *bp,
+			       struct bnxt_tc_flow_node *flow_node)
+{
+	struct bnxt_tc_l2_node *l2_node = flow_node->l2_node;
+	struct bnxt_tc_info *tc_info = &bp->tc_info;
+	int rc;
+
+	/* remove flow_node from the L2 shared flow list */
+	list_del(&flow_node->l2_list_node);
+	if (--l2_node->refcount == 0) {
+		rc =  rhashtable_remove_fast(&tc_info->l2_table, &l2_node->node,
+					     tc_info->l2_ht_params);
+		if (rc)
+			netdev_err(bp->dev,
+				   "Error: %s: rhashtable_remove_fast: %d",
+				   __func__, rc);
+		kfree_rcu(l2_node, rcu);
+	}
+	return 0;
+}
+
+static struct bnxt_tc_l2_node *
+bnxt_tc_get_l2_node(struct bnxt *bp, struct rhashtable *l2_table,
+		    struct rhashtable_params ht_params,
+		    struct bnxt_tc_l2_key *l2_key)
+{
+	struct bnxt_tc_l2_node *l2_node;
+	int rc;
+
+	l2_node = rhashtable_lookup_fast(l2_table, l2_key, ht_params);
+	if (!l2_node) {
+		l2_node = kzalloc(sizeof(*l2_node), GFP_KERNEL);
+		if (!l2_node) {
+			rc = -ENOMEM;
+			return NULL;
+		}
+
+		l2_node->key = *l2_key;
+		rc = rhashtable_insert_fast(l2_table, &l2_node->node,
+					    ht_params);
+		if (rc) {
+			kfree(l2_node);
+			netdev_err(bp->dev,
+				   "Error: %s: rhashtable_insert_fast: %d",
+				   __func__, rc);
+			return NULL;
+		}
+		INIT_LIST_HEAD(&l2_node->common_l2_flows);
+	}
+	return l2_node;
+}
+
+/* Get the ref_flow_handle for a flow by checking if there are any other
+ * flows that share the same L2 key as this flow.
+ */
+static int
+bnxt_tc_get_ref_flow_handle(struct bnxt *bp, struct bnxt_tc_flow *flow,
+			    struct bnxt_tc_flow_node *flow_node,
+			    __le16 *ref_flow_handle)
+{
+	struct bnxt_tc_info *tc_info = &bp->tc_info;
+	struct bnxt_tc_flow_node *ref_flow_node;
+	struct bnxt_tc_l2_node *l2_node;
+
+	l2_node = bnxt_tc_get_l2_node(bp, &tc_info->l2_table,
+				      tc_info->l2_ht_params,
+				      &flow->l2_key);
+	if (!l2_node)
+		return -1;
+
+	/* If any other flow is using this l2_node, use it's flow_handle
+	 * as the ref_flow_handle
+	 */
+	if (l2_node->refcount > 0) {
+		ref_flow_node = list_first_entry(&l2_node->common_l2_flows,
+						 struct bnxt_tc_flow_node,
+						 l2_list_node);
+		*ref_flow_handle = ref_flow_node->flow_handle;
+	} else {
+		*ref_flow_handle = cpu_to_le16(0xffff);
+	}
+
+	/* Insert the l2_node into the flow_node so that subsequent flows
+	 * with a matching l2 key can use the flow_handle of this flow
+	 * as their ref_flow_handle
+	 */
+	flow_node->l2_node = l2_node;
+	list_add(&flow_node->l2_list_node, &l2_node->common_l2_flows);
+	l2_node->refcount++;
+	return 0;
+}
+
+/* After the flow parsing is done, this routine is used for checking
+ * if there are any aspects of the flow that prevent it from being
+ * offloaded.
+ */
+static bool bnxt_tc_can_offload(struct bnxt *bp, struct bnxt_tc_flow *flow)
+{
+	/* If L4 ports are specified then ip_proto must be TCP or UDP */
+	if ((flow->flags & BNXT_TC_FLOW_FLAGS_PORTS) &&
+	    (flow->l4_key.ip_proto != IPPROTO_TCP &&
+	     flow->l4_key.ip_proto != IPPROTO_UDP)) {
+		netdev_info(bp->dev, "Cannot offload non-TCP/UDP (%d) ports",
+			    flow->l4_key.ip_proto);
+		return false;
+	}
+
+	return true;
+}
+
+static int __bnxt_tc_del_flow(struct bnxt *bp,
+			      struct bnxt_tc_flow_node *flow_node)
+{
+	struct bnxt_tc_info *tc_info = &bp->tc_info;
+	int rc;
+
+	/* send HWRM cmd to free the flow-id */
+	bnxt_hwrm_cfa_flow_free(bp, flow_node->flow_handle);
+
+	mutex_lock(&tc_info->lock);
+
+	/* release reference to l2 node */
+	bnxt_tc_put_l2_node(bp, flow_node);
+
+	mutex_unlock(&tc_info->lock);
+
+	rc = rhashtable_remove_fast(&tc_info->flow_table, &flow_node->node,
+				    tc_info->flow_ht_params);
+	if (rc)
+		netdev_err(bp->dev, "Error: %s: rhashtable_remove_fast rc=%d",
+			   __func__, rc);
+
+	kfree_rcu(flow_node, rcu);
+	return 0;
+}
+
+/* Add a new flow or replace an existing flow.
+ * Notes on locking:
+ * There are essentially two critical sections here.
+ * 1. while adding a new flow
+ *    a) lookup l2-key
+ *    b) issue HWRM cmd and get flow_handle
+ *    c) link l2-key with flow
+ * 2. while deleting a flow
+ *    a) unlinking l2-key from flow
+ * A lock is needed to protect these two critical sections.
+ *
+ * The hash-tables are already protected by the rhashtable API.
+ */
+static int bnxt_tc_add_flow(struct bnxt *bp, u16 src_fid,
+			    struct tc_cls_flower_offload *tc_flow_cmd)
+{
+	struct bnxt_tc_flow_node *new_node, *old_node;
+	struct bnxt_tc_info *tc_info = &bp->tc_info;
+	struct bnxt_tc_flow *flow;
+	__le16 ref_flow_handle;
+	int rc;
+
+	/* allocate memory for the new flow and it's node */
+	new_node = kzalloc(sizeof(*new_node), GFP_KERNEL);
+	if (!new_node) {
+		rc = -ENOMEM;
+		goto done;
+	}
+	new_node->cookie = tc_flow_cmd->cookie;
+	flow = &new_node->flow;
+
+	rc = bnxt_tc_parse_flow(bp, tc_flow_cmd, flow);
+	if (rc)
+		goto free_node;
+	flow->src_fid = src_fid;
+
+	if (!bnxt_tc_can_offload(bp, flow)) {
+		rc = -ENOSPC;
+		goto free_node;
+	}
+
+	/* If a flow exists with the same cookie, delete it */
+	old_node = rhashtable_lookup_fast(&tc_info->flow_table,
+					  &tc_flow_cmd->cookie,
+					  tc_info->flow_ht_params);
+	if (old_node)
+		__bnxt_tc_del_flow(bp, old_node);
+
+	/* Check if the L2 part of the flow has been offloaded already.
+	 * If so, bump up it's refcnt and get it's reference handle.
+	 */
+	mutex_lock(&tc_info->lock);
+	rc = bnxt_tc_get_ref_flow_handle(bp, flow, new_node, &ref_flow_handle);
+	if (rc)
+		goto unlock;
+
+	/* send HWRM cmd to alloc the flow */
+	rc = bnxt_hwrm_cfa_flow_alloc(bp, flow, ref_flow_handle,
+				      &new_node->flow_handle);
+	if (rc)
+		goto put_l2;
+
+	/* add new flow to flow-table */
+	rc = rhashtable_insert_fast(&tc_info->flow_table, &new_node->node,
+				    tc_info->flow_ht_params);
+	if (rc)
+		goto hwrm_flow_free;
+
+	mutex_unlock(&tc_info->lock);
+	return 0;
+
+hwrm_flow_free:
+	bnxt_hwrm_cfa_flow_free(bp, new_node->flow_handle);
+put_l2:
+	bnxt_tc_put_l2_node(bp, new_node);
+unlock:
+	mutex_unlock(&tc_info->lock);
+free_node:
+	kfree(new_node);
+done:
+	netdev_err(bp->dev, "Error: %s: cookie=0x%lx error=%d",
+		   __func__, tc_flow_cmd->cookie, rc);
+	return rc;
+}
+
+static int bnxt_tc_del_flow(struct bnxt *bp,
+			    struct tc_cls_flower_offload *tc_flow_cmd)
+{
+	struct bnxt_tc_info *tc_info = &bp->tc_info;
+	struct bnxt_tc_flow_node *flow_node;
+
+	flow_node = rhashtable_lookup_fast(&tc_info->flow_table,
+					   &tc_flow_cmd->cookie,
+					   tc_info->flow_ht_params);
+	if (!flow_node) {
+		netdev_info(bp->dev, "ERROR: no flow_node for cookie %lx",
+			    tc_flow_cmd->cookie);
+		return -EINVAL;
+	}
+
+	return __bnxt_tc_del_flow(bp, flow_node);
+}
+
+static int bnxt_tc_get_flow_stats(struct bnxt *bp,
+				  struct tc_cls_flower_offload *tc_flow_cmd)
+{
+	return 0;
+}
+
+int bnxt_tc_setup_flower(struct bnxt *bp, u16 src_fid,
+			 struct tc_cls_flower_offload *cls_flower)
+{
+	int rc = 0;
+
+	switch (cls_flower->command) {
+	case TC_CLSFLOWER_REPLACE:
+		rc = bnxt_tc_add_flow(bp, src_fid, cls_flower);
+		break;
+
+	case TC_CLSFLOWER_DESTROY:
+		rc = bnxt_tc_del_flow(bp, cls_flower);
+		break;
+
+	case TC_CLSFLOWER_STATS:
+		rc = bnxt_tc_get_flow_stats(bp, cls_flower);
+		break;
+	}
+	return rc;
+}
+
+static const struct rhashtable_params bnxt_tc_flow_ht_params = {
+	.head_offset = offsetof(struct bnxt_tc_flow_node, node),
+	.key_offset = offsetof(struct bnxt_tc_flow_node, cookie),
+	.key_len = sizeof(((struct bnxt_tc_flow_node *)0)->cookie),
+	.automatic_shrinking = true
+};
+
+static const struct rhashtable_params bnxt_tc_l2_ht_params = {
+	.head_offset = offsetof(struct bnxt_tc_l2_node, node),
+	.key_offset = offsetof(struct bnxt_tc_l2_node, key),
+	.key_len = BNXT_TC_L2_KEY_LEN,
+	.automatic_shrinking = true
+};
+
+/* convert counter width in bits to a mask */
+#define mask(width)		((u64)~0 >> (64 - (width)))
+
+int bnxt_init_tc(struct bnxt *bp)
+{
+	struct bnxt_tc_info *tc_info = &bp->tc_info;
+	int rc;
+
+	if (bp->hwrm_spec_code < 0x10800) {
+		netdev_warn(bp->dev,
+			    "Firmware does not support TC flower offload.\n");
+		return -ENOTSUPP;
+	}
+	mutex_init(&tc_info->lock);
+
+	/* Counter widths are programmed by FW */
+	tc_info->bytes_mask = mask(36);
+	tc_info->packets_mask = mask(28);
+
+	tc_info->flow_ht_params = bnxt_tc_flow_ht_params;
+	rc = rhashtable_init(&tc_info->flow_table, &tc_info->flow_ht_params);
+	if (rc)
+		return rc;
+
+	tc_info->l2_ht_params = bnxt_tc_l2_ht_params;
+	rc = rhashtable_init(&tc_info->l2_table, &tc_info->l2_ht_params);
+	if (rc)
+		goto destroy_flow_table;
+
+	tc_info->enabled = true;
+	bp->dev->hw_features |= NETIF_F_HW_TC;
+	bp->dev->features |= NETIF_F_HW_TC;
+	return 0;
+
+destroy_flow_table:
+	rhashtable_destroy(&tc_info->flow_table);
+	return rc;
+}
+
+void bnxt_shutdown_tc(struct bnxt *bp)
+{
+	struct bnxt_tc_info *tc_info = &bp->tc_info;
+
+	if (!tc_info->enabled)
+		return;
+
+	rhashtable_destroy(&tc_info->flow_table);
+	rhashtable_destroy(&tc_info->l2_table);
+}
+
+#else
+#endif
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_tc.h b/drivers/net/ethernet/broadcom/bnxt/bnxt_tc.h
new file mode 100644
index 0000000..6c4c1ed
--- /dev/null
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_tc.h
@@ -0,0 +1,158 @@
+/* Broadcom NetXtreme-C/E network driver.
+ *
+ * Copyright (c) 2017 Broadcom Limited
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation.
+ */
+
+#ifndef BNXT_TC_H
+#define BNXT_TC_H
+
+#ifdef CONFIG_BNXT_FLOWER_OFFLOAD
+
+/* Structs used for storing the filter/actions of the TC cmd.
+ */
+struct bnxt_tc_l2_key {
+	u8		dmac[ETH_ALEN];
+	u8		smac[ETH_ALEN];
+	__be16		inner_vlan_tpid;
+	__be16		inner_vlan_tci;
+	__be16		ether_type;
+	u8		num_vlans;
+};
+
+struct bnxt_tc_l3_key {
+	union {
+		struct {
+			struct in_addr daddr;
+			struct in_addr saddr;
+		} ipv4;
+		struct {
+			struct in6_addr daddr;
+			struct in6_addr saddr;
+		} ipv6;
+	};
+};
+
+struct bnxt_tc_l4_key {
+	u8  ip_proto;
+	union {
+		struct {
+			__be16 sport;
+			__be16 dport;
+		} ports;
+		struct {
+			u8 type;
+			u8 code;
+		} icmp;
+	};
+};
+
+struct bnxt_tc_actions {
+	u32				flags;
+#define BNXT_TC_ACTION_FLAG_FWD			BIT(0)
+#define BNXT_TC_ACTION_FLAG_FWD_VXLAN		BIT(1)
+#define BNXT_TC_ACTION_FLAG_PUSH_VLAN		BIT(3)
+#define BNXT_TC_ACTION_FLAG_POP_VLAN		BIT(4)
+#define BNXT_TC_ACTION_FLAG_DROP		BIT(5)
+
+	u16				dst_fid;
+	struct net_device		*dst_dev;
+	__be16				push_vlan_tpid;
+	__be16				push_vlan_tci;
+};
+
+struct bnxt_tc_flow_stats {
+	u64		packets;
+	u64		bytes;
+};
+
+struct bnxt_tc_flow {
+	u32				flags;
+#define BNXT_TC_FLOW_FLAGS_ETH_ADDRS		BIT(1)
+#define BNXT_TC_FLOW_FLAGS_IPV4_ADDRS		BIT(2)
+#define BNXT_TC_FLOW_FLAGS_IPV6_ADDRS		BIT(3)
+#define BNXT_TC_FLOW_FLAGS_PORTS		BIT(4)
+#define BNXT_TC_FLOW_FLAGS_ICMP			BIT(5)
+
+	/* flow applicable to pkts ingressing on this fid */
+	u16				src_fid;
+	struct bnxt_tc_l2_key		l2_key;
+	struct bnxt_tc_l2_key		l2_mask;
+	struct bnxt_tc_l3_key		l3_key;
+	struct bnxt_tc_l3_key		l3_mask;
+	struct bnxt_tc_l4_key		l4_key;
+	struct bnxt_tc_l4_key		l4_mask;
+
+	struct bnxt_tc_actions		actions;
+
+	/* updated stats accounting for hw-counter wrap-around */
+	struct bnxt_tc_flow_stats	stats;
+	/* previous snap-shot of stats */
+	struct bnxt_tc_flow_stats	prev_stats;
+	unsigned long			lastused; /* jiffies */
+};
+
+/* L2 hash table
+ * This data-struct is used for L2-flow table.
+ * The L2 part of a flow is stored in a hash table.
+ * A flow that shares the same L2 key/mask with an
+ * already existing flow must refer to it's flow handle.
+ */
+struct bnxt_tc_l2_node {
+	/* hash key: first 16b of key */
+#define BNXT_TC_L2_KEY_LEN			16
+	struct bnxt_tc_l2_key	key;
+	struct rhash_head	node;
+
+	/* a linked list of flows that share the same l2 key */
+	struct list_head	common_l2_flows;
+
+	/* number of flows sharing the l2 key */
+	u16			refcount;
+
+	struct rcu_head		rcu;
+};
+
+struct bnxt_tc_flow_node {
+	/* hash key: provided by TC */
+	unsigned long			cookie;
+	struct rhash_head		node;
+
+	struct bnxt_tc_flow		flow;
+
+	__le16				flow_handle;
+
+	/* L2 node in l2 hashtable that shares flow's l2 key */
+	struct bnxt_tc_l2_node		*l2_node;
+	/* for the shared_flows list maintained in l2_node */
+	struct list_head		l2_list_node;
+
+	struct rcu_head			rcu;
+};
+
+int bnxt_tc_setup_flower(struct bnxt *bp, u16 src_fid,
+			 struct tc_cls_flower_offload *cls_flower);
+int bnxt_init_tc(struct bnxt *bp);
+void bnxt_shutdown_tc(struct bnxt *bp);
+
+#else /* CONFIG_BNXT_FLOWER_OFFLOAD */
+
+static inline int bnxt_tc_setup_flower(struct bnxt *bp, u16 src_fid,
+				       struct tc_cls_flower_offload *cls_flower)
+{
+	return -EOPNOTSUPP;
+}
+
+static inline int bnxt_init_tc(struct bnxt *bp)
+{
+	return 0;
+}
+
+static inline void bnxt_shutdown_tc(struct bnxt *bp)
+{
+}
+#endif /* CONFIG_BNXT_FLOWER_OFFLOAD */
+#endif /* BNXT_TC_H */
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_vfr.c b/drivers/net/ethernet/broadcom/bnxt/bnxt_vfr.c
index c365d3c..e75db04 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt_vfr.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_vfr.c
@@ -11,10 +11,12 @@
 #include <linux/etherdevice.h>
 #include <linux/rtnetlink.h>
 #include <linux/jhash.h>
+#include <net/pkt_cls.h>
 
 #include "bnxt_hsi.h"
 #include "bnxt.h"
 #include "bnxt_vfr.h"
+#include "bnxt_tc.h"
 
 #ifdef CONFIG_BNXT_SRIOV
 
@@ -113,6 +115,21 @@ static netdev_tx_t bnxt_vf_rep_xmit(struct sk_buff *skb,
 	stats->tx_bytes = vf_rep->tx_stats.bytes;
 }
 
+static int bnxt_vf_rep_setup_tc(struct net_device *dev, enum tc_setup_type type,
+				void *type_data)
+{
+	struct bnxt_vf_rep *vf_rep = netdev_priv(dev);
+	struct bnxt *bp = vf_rep->bp;
+	int vf_fid = bp->pf.vf[vf_rep->vf_idx].fw_fid;
+
+	switch (type) {
+	case TC_SETUP_CLSFLOWER:
+		return bnxt_tc_setup_flower(bp, vf_fid, type_data);
+	default:
+		return -EOPNOTSUPP;
+	}
+}
+
 struct net_device *bnxt_get_vf_rep(struct bnxt *bp, u16 cfa_code)
 {
 	u16 vf_idx;
@@ -182,6 +199,7 @@ static int bnxt_vf_rep_port_attr_get(struct net_device *dev,
 	.ndo_stop		= bnxt_vf_rep_close,
 	.ndo_start_xmit		= bnxt_vf_rep_xmit,
 	.ndo_get_stats64	= bnxt_vf_rep_get_stats64,
+	.ndo_setup_tc		= bnxt_vf_rep_setup_tc,
 	.ndo_get_phys_port_name = bnxt_vf_rep_get_phys_port_name
 };
 
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_vfr.h b/drivers/net/ethernet/broadcom/bnxt/bnxt_vfr.h
index 3e997c9..d8b5f89 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt_vfr.h
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_vfr.h
@@ -45,6 +45,14 @@ static inline void bnxt_link_bp_to_dl(struct bnxt *bp, struct devlink *dl)
 void bnxt_vf_rep_rx(struct bnxt *bp, struct sk_buff *skb);
 struct net_device *bnxt_get_vf_rep(struct bnxt *bp, u16 cfa_code);
 
+static inline u16 bnxt_vf_rep_get_fid(struct net_device *dev)
+{
+	struct bnxt_vf_rep *vf_rep = netdev_priv(dev);
+	struct bnxt *bp = vf_rep->bp;
+
+	return bp->pf.vf[vf_rep->vf_idx].fw_fid;
+}
+
 #else
 
 static inline int bnxt_dl_register(struct bnxt *bp)
-- 
1.8.3.1

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox