Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH] net/mlx5e: Fix zero table prio set by user.
From: Saeed Mahameed @ 2019-07-25 21:22 UTC (permalink / raw)
  To: wenxu@ucloud.cn, Roi Dayan, Or Gerlitz, Mark Bloch, Paul Blakey
  Cc: netdev@vger.kernel.org
In-Reply-To: <1564053847-28756-1-git-send-email-wenxu@ucloud.cn>

On Thu, 2019-07-25 at 19:24 +0800, wenxu@ucloud.cn wrote:
> From: wenxu <wenxu@ucloud.cn>
> 
> The flow_cls_common_offload prio is zero
> 
> It leads the invalid table prio in hw.
> 
> Error: Could not process rule: Invalid argument
> 
> kernel log:
> mlx5_core 0000:81:00.0: E-Switch: Failed to create FDB Table err -22
> (table prio: 65535, level: 0, size: 4194304)
> 
> table_prio = (chain * FDB_MAX_PRIO) + prio - 1;
> should check (chain * FDB_MAX_PRIO) + prio is not 0
> 
> Signed-off-by: wenxu <wenxu@ucloud.cn>
> ---
>  drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git
> a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
> b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
> index 089ae4d..64ca90f 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
> @@ -970,7 +970,9 @@ static int esw_add_fdb_miss_rule(struct 

this piece of code isn't in this function, weird how it got to the
diff, patch applies correctly though !

> mlx5_eswitch *esw)
>  		flags |= (MLX5_FLOW_TABLE_TUNNEL_EN_REFORMAT |
>  			  MLX5_FLOW_TABLE_TUNNEL_EN_DECAP);
>  
> -	table_prio = (chain * FDB_MAX_PRIO) + prio - 1;
> +	table_prio = (chain * FDB_MAX_PRIO) + prio;
> +	if (table_prio)
> +		table_prio = table_prio - 1;
>  

This is black magic, even before this fix.
this -1 seems to be needed in order to call
create_next_size_table(table_prio) with the previous "table prio" ?
(table_prio - 1)  ?

The whole thing looks wrong to me since when prio is 0 and chain is 0,
there is not such thing table_prio - 1.

mlnx eswitch guys in the cc, please advise.

Thanks,
Saeed.

>  	/* create earlier levels for correct fs_core lookup when
>  	 * connecting tables

^ permalink raw reply

* Re: [PATCH net-next 2/2] mlx4/en_netdev: call notifiers when hw_enc_features change
From: Saeed Mahameed @ 2019-07-25 21:27 UTC (permalink / raw)
  To: dcaratti@redhat.com, davem@davemloft.net, Tariq Toukan,
	netdev@vger.kernel.org
  Cc: Eran Ben Elisha
In-Reply-To: <73cd7a2a29db5a32d669273d367566cdf6652f4e.camel@redhat.com>

On Thu, 2019-07-25 at 14:25 +0200, Davide Caratti wrote:
> On Wed, 2019-07-24 at 20:47 +0000, Saeed Mahameed wrote:
> > On Wed, 2019-07-24 at 16:02 +0200, Davide Caratti wrote:
> > > ensure to call netdev_features_change() when the driver flips its
> > > hw_enc_features bits.
> > > 
> > > Signed-off-by: Davide Caratti <dcaratti@redhat.com>
> > 
> > The patch is correct, 
> 
> hello Saeed, and thanks for looking at this!
> 
> > but can you explain how did you come to this ? 
> > did you encounter any issue with the current code ?
> > 
> > I am asking just because i think the whole dynamic changing of dev-
> > > hw_enc_features is redundant since mlx4 has the featutres_check
> > callback.
> 
> we need it to ensure that vlan_transfer_features() updates
> the (new) value of hw_enc_features in the overlying vlan: otherwise,
> segmentation will happen anyway when skb passes from vxlan to vlan,
> if the
> vxlan is added after the vlan device has been created (see:
> 7dad9937e064
> ("net: vlan: add support for tunnel offload") ).
> 

but in previous patch you made sure that the vlan always sees the
correct hw_enc_features on driver load, we don't need to have this
dynamic update mechanism, features_check ndo should take care of
protocols we don't support.

> thanks!

^ permalink raw reply

* Re: [PATCH net-next 06/11] net: hns3: modify firmware version display format
From: Saeed Mahameed @ 2019-07-25 21:32 UTC (permalink / raw)
  To: tanhuazhong@huawei.com, davem@davemloft.net
  Cc: lipeng321@huawei.com, yisen.zhuang@huawei.com,
	salil.mehta@huawei.com, linuxarm@huawei.com,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
	moyufeng@huawei.com
In-Reply-To: <95783289-9b3b-f085-876b-49815b07d595@huawei.com>

On Thu, 2019-07-25 at 10:34 +0800, tanhuazhong wrote:
> 
> On 2019/7/25 2:34, Saeed Mahameed wrote:
> > On Wed, 2019-07-24 at 11:18 +0800, Huazhong Tan wrote:
> > > From: Yufeng Mo <moyufeng@huawei.com>
> > > 
> > > This patch modifies firmware version display format in
> > > hclge(vf)_cmd_init() and hns3_get_drvinfo(). Also, adds
> > > some optimizations for firmware version display format.
> > > 
> > > Signed-off-by: Yufeng Mo <moyufeng@huawei.com>
> > > Signed-off-by: Peng Li <lipeng321@huawei.com>
> > > Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
> > > ---
> > >   drivers/net/ethernet/hisilicon/hns3/hnae3.h              |  9
> > > +++++++++
> > >   drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c       | 15
> > > +++++++++++++--
> > >   drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.c   | 10
> > > +++++++++-
> > >   drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_cmd.c | 11
> > > +++++++++--
> > >   4 files changed, 40 insertions(+), 5 deletions(-)
> > > 
> > > 

[...]

> > > --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.c
> > > +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.c
> > > @@ -419,7 +419,15 @@ int hclge_cmd_init(struct hclge_dev *hdev)
> > >   	}
> > >   	hdev->fw_version = version;
> > >   
> > > -	dev_info(&hdev->pdev->dev, "The firmware version is %08x\n",
> > > version);
> > > +	pr_info_once("The firmware version is %lu.%lu.%lu.%lu\n",
> > > +		     hnae3_get_field(version,
> > > HNAE3_FW_VERSION_BYTE3_MASK,
> > > +				     HNAE3_FW_VERSION_BYTE3_SHIFT),
> > > +		     hnae3_get_field(version,
> > > HNAE3_FW_VERSION_BYTE2_MASK,
> > > +				     HNAE3_FW_VERSION_BYTE2_SHIFT),
> > > +		     hnae3_get_field(version,
> > > HNAE3_FW_VERSION_BYTE1_MASK,
> > > +				     HNAE3_FW_VERSION_BYTE1_SHIFT),
> > > +		     hnae3_get_field(version,
> > > HNAE3_FW_VERSION_BYTE0_MASK,
> > > +				     HNAE3_FW_VERSION_BYTE0_SHIFT));
> > >   
> > 
> > Device name/string will not be printed now, what happens if i have
> > multiple devices ? at least print the device name as it was before
> > 
> Since on each board we only have one firmware, the firmware
> version is same per device, and will not change when running.
> So pr_info_once() looks good for this case.
> 

boards change too often to have such static assumption.

> BTW, maybe we should change below print in the end of
> hclge_init_ae_dev(), use dev_info() instead of pr_info(),
> then we can know that which device has already initialized.
> I will send other patch to do that, is it acceptable for you?
> 
> "pr_info("%s driver initialization finished.\n", HCLGE_DRIVER_NAME);"
> 

I would avoid using pr_info when i can ! if you have the option to
print with dev information as it was before that is preferable.

Thanks,
Saeed.


^ permalink raw reply

* Re: [PATCH net-next 1/2] net sched: update skbedit action for batched events operations
From: Roman Mashak @ 2019-07-25 21:39 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, kernel, jhs, xiyou.wangcong, jiri
In-Reply-To: <20190713.192344.1454771658469437265.davem@davemloft.net>

David Miller <davem@davemloft.net> writes:

> From: Roman Mashak <mrv@mojatatu.com>
> Date: Fri, 12 Jul 2019 18:21:59 -0400
>
>> Add get_fill_size() routine used to calculate the action size
>> when building a batch of events.
>> 
>> Signed-off-by: Roman Mashak <mrv@mojatatu.com>
>
> Please add an appropriate Fixes: tag, and also when you respin this provide the
> required "[PATCH 0/N] " header posting explaining what the series is doing at
> a high level, how it is doing it, and why it is doing it that way.

Thanks for feedback, I've sent v2 of this patchset.

^ permalink raw reply

* Re: general protection fault in tls_trim_both_msgs
From: Jakub Kicinski @ 2019-07-25 21:39 UTC (permalink / raw)
  To: syzbot
  Cc: ast, aviadye, borisp, bpf, daniel, davejwatson, davem,
	john.fastabend, kafai, linux-kernel, netdev, songliubraving,
	syzkaller-bugs, yhs
In-Reply-To: <0000000000002b4896058e7abf78@google.com>

On Wed, 24 Jul 2019 22:32:07 -0700, syzbot wrote:
> Hello,
> 
> syzbot found the following crash on:
> 
> HEAD commit:    9e6dfe80 Add linux-next specific files for 20190724
> git tree:       linux-next
> console output: https://syzkaller.appspot.com/x/log.txt?x=1046971fa00000
> kernel config:  https://syzkaller.appspot.com/x/.config?x=6cbb8fc2cf2842d7
> dashboard link: https://syzkaller.appspot.com/bug?extid=0e0fedcad708d12d3032
> compiler:       gcc (GCC) 9.0.0 20181231 (experimental)
> 
> Unfortunately, I don't have any reproducer for this crash yet.

Looks very like the issue we mentioned in the cover letter for unhash
fixes. TX is waiting for mem, the connection dies, we free ctx, TX
wakes up with a now stale ctx pointer. I'm testing a fix for this,
Netronome team was actually able to trigger a NULL-deref on the RX
side, because there ctx is reloaded but not NULL-checked.

> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: syzbot+0e0fedcad708d12d3032@syzkaller.appspotmail.com
> 
> kasan: CONFIG_KASAN_INLINE enabled
> kasan: GPF could be caused by NULL-ptr deref or user memory access
> general protection fault: 0000 [#1] PREEMPT SMP KASAN
> CPU: 1 PID: 15517 Comm: syz-executor.4 Not tainted 5.3.0-rc1-next-20190724  
> #50
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
> Google 01/01/2011
> RIP: 0010:tls_trim_both_msgs+0x54/0x130 net/tls/tls_sw.c:268
> Code: 48 c1 ea 03 80 3c 02 00 0f 85 e3 00 00 00 4d 8b b5 b0 06 00 00 48 b8  
> 00 00 00 00 00 fc ff df 49 8d 7e 28 48 89 fa 48 c1 ea 03 <80> 3c 02 00 0f  
> 85 b3 00 00 00 48 b8 00 00 00 00 00 fc ff df 49 8b
> RSP: 0018:ffff8880612cfac0 EFLAGS: 00010206
> RAX: dffffc0000000000 RBX: ffff8880a8794340 RCX: ffffc9000e7b9000
> RDX: 0000000000000005 RSI: ffffffff86298656 RDI: 0000000000000028
> RBP: ffff8880612cfae0 R08: ffff88805ae4c580 R09: fffffbfff14a8155
> R10: fffffbfff14a8154 R11: ffffffff8a540aa7 R12: 0000000000000000
> R13: ffff888061d82e00 R14: 0000000000000000 R15: 00000000ffffffe0
> FS:  00007f7d33516700(0000) GS:ffff8880ae900000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000001b2fa2f000 CR3: 000000009fcf1000 CR4: 00000000001406e0
> Call Trace:
>   tls_sw_sendmsg+0xe38/0x17b0 net/tls/tls_sw.c:1057
>   inet6_sendmsg+0x9e/0xe0 net/ipv6/af_inet6.c:576
>   sock_sendmsg_nosec net/socket.c:637 [inline]
>   sock_sendmsg+0xd7/0x130 net/socket.c:657
>   __sys_sendto+0x262/0x380 net/socket.c:1952
>   __do_sys_sendto net/socket.c:1964 [inline]
>   __se_sys_sendto net/socket.c:1960 [inline]
>   __x64_sys_sendto+0xe1/0x1a0 net/socket.c:1960
>   do_syscall_64+0xfa/0x760 arch/x86/entry/common.c:290
>   entry_SYSCALL_64_after_hwframe+0x49/0xbe
> RIP: 0033:0x459829
> Code: fd b7 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7  
> 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff  
> ff 0f 83 cb b7 fb ff c3 66 2e 0f 1f 84 00 00 00 00
> RSP: 002b:00007f7d33515c78 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
> RAX: ffffffffffffffda RBX: 0000000000000006 RCX: 0000000000459829
> RDX: ffffffffffffffc1 RSI: 00000000200005c0 RDI: 0000000000000003
> RBP: 000000000075bf20 R08: 0000000000000000 R09: 1201000000003618
> R10: 0000000000000000 R11: 0000000000000246 R12: 00007f7d335166d4
> R13: 00000000004c7669 R14: 00000000004dcc70 R15: 00000000ffffffff
> Modules linked in:
> ---[ end trace 2dd728cceb39a185 ]---
> RIP: 0010:tls_trim_both_msgs+0x54/0x130 net/tls/tls_sw.c:268
> Code: 48 c1 ea 03 80 3c 02 00 0f 85 e3 00 00 00 4d 8b b5 b0 06 00 00 48 b8  
> 00 00 00 00 00 fc ff df 49 8d 7e 28 48 89 fa 48 c1 ea 03 <80> 3c 02 00 0f  
> 85 b3 00 00 00 48 b8 00 00 00 00 00 fc ff df 49 8b
> RSP: 0018:ffff8880612cfac0 EFLAGS: 00010206
> RAX: dffffc0000000000 RBX: ffff8880a8794340 RCX: ffffc9000e7b9000
> RDX: 0000000000000005 RSI: ffffffff86298656 RDI: 0000000000000028
> RBP: ffff8880612cfae0 R08: ffff88805ae4c580 R09: fffffbfff14a8155
> R10: fffffbfff14a8154 R11: ffffffff8a540aa7 R12: 0000000000000000
> R13: ffff888061d82e00 R14: 0000000000000000 R15: 00000000ffffffe0
> FS:  00007f7d33516700(0000) GS:ffff8880ae900000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00000000019dbe80 CR3: 000000009fcf1000 CR4: 00000000001406e0
> 
> 
> ---
> This bug is generated by a bot. It may contain errors.
> See https://goo.gl/tpsmEJ for more information about syzbot.
> syzbot engineers can be reached at syzkaller@googlegroups.com.
> 
> syzbot will keep track of this bug report. See:
> https://goo.gl/tpsmEJ#status for how to communicate with syzbot.


^ permalink raw reply

* Re: [PATCH net-next 07/11] net: hns3: adds debug messages to identify eth down cause
From: Saeed Mahameed @ 2019-07-25 21:59 UTC (permalink / raw)
  To: tanhuazhong@huawei.com, davem@davemloft.net,
	liuyonglong@huawei.com
  Cc: lipeng321@huawei.com, yisen.zhuang@huawei.com,
	salil.mehta@huawei.com, linuxarm@huawei.com,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <30483e38-5e4a-0111-f431-4742ceb1aa62@huawei.com>

On Thu, 2019-07-25 at 20:28 +0800, liuyonglong wrote:
> 
> On 2019/7/25 3:12, Saeed Mahameed wrote:
> > On Wed, 2019-07-24 at 11:18 +0800, Huazhong Tan wrote:
> > > From: Yonglong Liu <liuyonglong@huawei.com>
> > > 
> > > Some times just see the eth interface have been down/up via
> > > dmesg, but can not know why the eth down. So adds some debug
> > > messages to identify the cause for this.
> > > 
> > 
> > I really don't like this. your default msg lvl has NETIF_MSG_IFDOWN
> > turned on .. dumping every single operation that happens on your
> > device
> > by default to kernel log is too much ! 
> > 
> > We should really consider using trace buffers with well defined
> > structures for vendor specific events. so we can use bpf filters
> > and
> > state of the art tools for netdev debugging.
> > 
> 
> We do this because we can just see a link down message in dmesg, and
> had
> take a long time to found the cause of link down, just because
> another
> user changed the settings.
> 
> We can change the net_open/net_stop/dcbnl_ops to msg_drv (not default
> turned on),  and want to keep the others default print to kernel log,
> is it acceptable?
> 

acceptable as long as debug information are kept off by default and
your driver doens't spam the kernel log.

you should use dynamic debug [1] and/or "off by default" msg lvls for
debugging information..

I couldn't find any rules regarding what to put in kernel log, Maybe
someone can share ?. but i vaguely remember that the recommendation
for device drivers is to put nothing, only error/warning messages.

[1] 
https://www.kernel.org/doc/html/v4.15/admin-guide/dynamic-debug-howto.html

> > > @@ -1593,6 +1603,11 @@ static int hns3_ndo_set_vf_vlan(struct
> > > net_device *netdev, int vf, u16 vlan,
> > >  	struct hnae3_handle *h = hns3_get_handle(netdev);
> > >  	int ret = -EIO;
> > >  
> > > +	if (netif_msg_ifdown(h))
> > 
> > why msg_ifdown ? looks like netif_msg_drv is more appropriate, for
> > many
> > of the cases in this patch.
> > 
> 
> This operation may cause link down, so we use msg_ifdown.
> 

ifdown isn't link down.. 

to be honest, I couldn't find any documentation explaining how/when to
use msg lvls, (i didn't look too deep though), by looking at other
drivers, my interpretations is:

ifdup (open/boot up flow)
ifdwon (close/teardown flow)
drv (driver based or dynamic flows) 
etc .. 

-Saeed.

^ permalink raw reply

* [PATCH bpf-next v3 0/7] bpf/flow_dissector: support input flags
From: Stanislav Fomichev @ 2019-07-25 22:52 UTC (permalink / raw)
  To: netdev, bpf
  Cc: davem, ast, daniel, Stanislav Fomichev, Song Liu,
	Willem de Bruijn, Petar Penkov

C flow dissector supports input flags that tell it to customize parsing
by either stopping early or trying to parse as deep as possible.
BPF flow dissector always parses as deep as possible which is sub-optimal.
Pass input flags to the BPF flow dissector as well so it can make the same
decisions.

Series outline:
* remove unused FLOW_DISSECTOR_F_STOP_AT_L3 flag
* export FLOW_DISSECTOR_F_XXX flags as uapi and pass them to BPF
  flow dissector
* add documentation for the export flags
* support input flags in BPF_PROG_TEST_RUN via ctx_{in,out}
* sync uapi to tools
* support FLOW_DISSECTOR_F_PARSE_1ST_FRAG in selftest
* support FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL in kernel and selftest
* support FLOW_DISSECTOR_F_STOP_AT_ENCAP in selftest

Pros:
* makes BPF flow dissector faster by avoiding burning extra cycles
* existing BPF progs continue to work by ignoring the flags and always
  parsing as deep as possible

Cons:
* new UAPI which we need to support (OTOH, if we need to deprecate some
  flags, we can just stop setting them upon calling BPF programs)

Some numbers (with .repeat = 4000000 in test_flow_dissector):
        test_flow_dissector:PASS:ipv4-frag 35 nsec
        test_flow_dissector:PASS:ipv4-frag 35 nsec
        test_flow_dissector:PASS:ipv4-no-frag 32 nsec
        test_flow_dissector:PASS:ipv4-no-frag 32 nsec

        test_flow_dissector:PASS:ipv6-frag 39 nsec
        test_flow_dissector:PASS:ipv6-frag 39 nsec
        test_flow_dissector:PASS:ipv6-no-frag 36 nsec
        test_flow_dissector:PASS:ipv6-no-frag 36 nsec

        test_flow_dissector:PASS:ipv6-flow-label 36 nsec
        test_flow_dissector:PASS:ipv6-flow-label 36 nsec
        test_flow_dissector:PASS:ipv6-no-flow-label 33 nsec
        test_flow_dissector:PASS:ipv6-no-flow-label 33 nsec

        test_flow_dissector:PASS:ipip-encap 38 nsec
        test_flow_dissector:PASS:ipip-encap 38 nsec
        test_flow_dissector:PASS:ipip-no-encap 32 nsec
        test_flow_dissector:PASS:ipip-no-encap 32 nsec

The improvement is around 10%, but it's in a tight cache-hot
BPF_PROG_TEST_RUN loop.

Cc: Song Liu <songliubraving@fb.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Petar Penkov <ppenkov@google.com>

Stanislav Fomichev (7):
  bpf/flow_dissector: pass input flags to BPF flow dissector program
  bpf/flow_dissector: document flags
  bpf/flow_dissector: support flags in BPF_PROG_TEST_RUN
  tools/bpf: sync bpf_flow_keys flags
  selftests/bpf: support BPF_FLOW_DISSECTOR_F_PARSE_1ST_FRAG
  bpf/flow_dissector: support ipv6 flow_label and
    BPF_FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL
  selftests/bpf: support BPF_FLOW_DISSECTOR_F_STOP_AT_ENCAP

 Documentation/bpf/prog_flow_dissector.rst     |  18 ++
 include/linux/skbuff.h                        |   2 +-
 include/uapi/linux/bpf.h                      |   6 +
 net/bpf/test_run.c                            |  39 ++-
 net/core/flow_dissector.c                     |  21 +-
 tools/include/uapi/linux/bpf.h                |   6 +
 .../selftests/bpf/prog_tests/flow_dissector.c | 243 +++++++++++++++++-
 tools/testing/selftests/bpf/progs/bpf_flow.c  |  47 +++-
 8 files changed, 368 insertions(+), 14 deletions(-)

-- 
2.22.0.709.g102302147b-goog

^ permalink raw reply

* [PATCH bpf-next v3 1/7] bpf/flow_dissector: pass input flags to BPF flow dissector program
From: Stanislav Fomichev @ 2019-07-25 22:52 UTC (permalink / raw)
  To: netdev, bpf
  Cc: davem, ast, daniel, Stanislav Fomichev, Petar Penkov,
	Willem de Bruijn, Song Liu
In-Reply-To: <20190725225231.195090-1-sdf@google.com>

C flow dissector supports input flags that tell it to customize parsing
by either stopping early or trying to parse as deep as possible. Pass
those flags to the BPF flow dissector so it can make the same
decisions. In the next commits I'll add support for those flags to
our reference bpf_flow.c

v3:
* Export copy of flow dissector flags instead of moving (Alexei Starovoitov)

Acked-by: Petar Penkov <ppenkov@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Acked-by: Song Liu <songliubraving@fb.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Petar Penkov <ppenkov@google.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 include/linux/skbuff.h    |  2 +-
 include/uapi/linux/bpf.h  |  5 +++++
 net/bpf/test_run.c        |  2 +-
 net/core/flow_dissector.c | 12 ++++++++++--
 4 files changed, 17 insertions(+), 4 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 718742b1c505..9b7a8038beec 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1271,7 +1271,7 @@ static inline int skb_flow_dissector_bpf_prog_detach(const union bpf_attr *attr)
 
 struct bpf_flow_dissector;
 bool bpf_flow_dissect(struct bpf_prog *prog, struct bpf_flow_dissector *ctx,
-		      __be16 proto, int nhoff, int hlen);
+		      __be16 proto, int nhoff, int hlen, unsigned int flags);
 
 bool __skb_flow_dissect(const struct net *net,
 			const struct sk_buff *skb,
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index fa1c753dcdbc..88b9d743036f 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -3507,6 +3507,10 @@ enum bpf_task_fd_type {
 	BPF_FD_TYPE_URETPROBE,		/* filename + offset */
 };
 
+#define BPF_FLOW_DISSECTOR_F_PARSE_1ST_FRAG		(1U << 0)
+#define BPF_FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL		(1U << 1)
+#define BPF_FLOW_DISSECTOR_F_STOP_AT_ENCAP		(1U << 2)
+
 struct bpf_flow_keys {
 	__u16	nhoff;
 	__u16	thoff;
@@ -3528,6 +3532,7 @@ struct bpf_flow_keys {
 			__u32	ipv6_dst[4];	/* in6_addr; network order */
 		};
 	};
+	__u32	flags;
 };
 
 struct bpf_func_info {
diff --git a/net/bpf/test_run.c b/net/bpf/test_run.c
index 80e6f3a6864d..4e41d15a1098 100644
--- a/net/bpf/test_run.c
+++ b/net/bpf/test_run.c
@@ -419,7 +419,7 @@ int bpf_prog_test_run_flow_dissector(struct bpf_prog *prog,
 	time_start = ktime_get_ns();
 	for (i = 0; i < repeat; i++) {
 		retval = bpf_flow_dissect(prog, &ctx, eth->h_proto, ETH_HLEN,
-					  size);
+					  size, 0);
 
 		if (signal_pending(current)) {
 			preempt_enable();
diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
index 3e6fedb57bc1..50ed1a688709 100644
--- a/net/core/flow_dissector.c
+++ b/net/core/flow_dissector.c
@@ -784,7 +784,7 @@ static void __skb_flow_bpf_to_target(const struct bpf_flow_keys *flow_keys,
 }
 
 bool bpf_flow_dissect(struct bpf_prog *prog, struct bpf_flow_dissector *ctx,
-		      __be16 proto, int nhoff, int hlen)
+		      __be16 proto, int nhoff, int hlen, unsigned int flags)
 {
 	struct bpf_flow_keys *flow_keys = ctx->flow_keys;
 	u32 result;
@@ -795,6 +795,14 @@ bool bpf_flow_dissect(struct bpf_prog *prog, struct bpf_flow_dissector *ctx,
 	flow_keys->nhoff = nhoff;
 	flow_keys->thoff = flow_keys->nhoff;
 
+	BUILD_BUG_ON((int)BPF_FLOW_DISSECTOR_F_PARSE_1ST_FRAG !=
+		     (int)FLOW_DISSECTOR_F_PARSE_1ST_FRAG);
+	BUILD_BUG_ON((int)BPF_FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL !=
+		     (int)FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL);
+	BUILD_BUG_ON((int)BPF_FLOW_DISSECTOR_F_STOP_AT_ENCAP !=
+		     (int)FLOW_DISSECTOR_F_STOP_AT_ENCAP);
+	flow_keys->flags = flags;
+
 	preempt_disable();
 	result = BPF_PROG_RUN(prog, ctx);
 	preempt_enable();
@@ -914,7 +922,7 @@ bool __skb_flow_dissect(const struct net *net,
 			}
 
 			ret = bpf_flow_dissect(attached, &ctx, n_proto, nhoff,
-					       hlen);
+					       hlen, flags);
 			__skb_flow_bpf_to_target(&flow_keys, flow_dissector,
 						 target_container);
 			rcu_read_unlock();
-- 
2.22.0.709.g102302147b-goog


^ permalink raw reply related

* [PATCH bpf-next v3 2/7] bpf/flow_dissector: document flags
From: Stanislav Fomichev @ 2019-07-25 22:52 UTC (permalink / raw)
  To: netdev, bpf
  Cc: davem, ast, daniel, Stanislav Fomichev, Petar Penkov,
	Willem de Bruijn, Song Liu
In-Reply-To: <20190725225231.195090-1-sdf@google.com>

Describe what each input flag does and who uses it.

Acked-by: Petar Penkov <ppenkov@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Acked-by: Song Liu <songliubraving@fb.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Petar Penkov <ppenkov@google.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 Documentation/bpf/prog_flow_dissector.rst | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/Documentation/bpf/prog_flow_dissector.rst b/Documentation/bpf/prog_flow_dissector.rst
index ed343abe541e..a78bf036cadd 100644
--- a/Documentation/bpf/prog_flow_dissector.rst
+++ b/Documentation/bpf/prog_flow_dissector.rst
@@ -26,6 +26,7 @@ and output arguments.
   * ``nhoff`` - initial offset of the networking header
   * ``thoff`` - initial offset of the transport header, initialized to nhoff
   * ``n_proto`` - L3 protocol type, parsed out of L2 header
+  * ``flags`` - optional flags
 
 Flow dissector BPF program should fill out the rest of the ``struct
 bpf_flow_keys`` fields. Input arguments ``nhoff/thoff/n_proto`` should be
@@ -101,6 +102,23 @@ can be called for both cases and would have to be written carefully to
 handle both cases.
 
 
+Flags
+=====
+
+``flow_keys->flags`` might contain optional input flags that work as follows:
+
+* ``BPF_FLOW_DISSECTOR_F_PARSE_1ST_FRAG`` - tells BPF flow dissector to
+  continue parsing first fragment; the default expected behavior is that
+  flow dissector returns as soon as it finds out that the packet is fragmented;
+  used by ``eth_get_headlen`` to estimate length of all headers for GRO.
+* ``BPF_FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL`` - tells BPF flow dissector to
+  stop parsing as soon as it reaches IPv6 flow label; used by
+  ``___skb_get_hash`` and ``__skb_get_hash_symmetric`` to get flow hash.
+* ``BPF_FLOW_DISSECTOR_F_STOP_AT_ENCAP`` - tells BPF flow dissector to stop
+  parsing as soon as it reaches encapsulated headers; used by routing
+  infrastructure.
+
+
 Reference Implementation
 ========================
 
-- 
2.22.0.709.g102302147b-goog


^ permalink raw reply related

* [PATCH bpf-next v3 3/7] bpf/flow_dissector: support flags in BPF_PROG_TEST_RUN
From: Stanislav Fomichev @ 2019-07-25 22:52 UTC (permalink / raw)
  To: netdev, bpf
  Cc: davem, ast, daniel, Stanislav Fomichev, Petar Penkov,
	Willem de Bruijn, Song Liu
In-Reply-To: <20190725225231.195090-1-sdf@google.com>

This will allow us to write tests for those flags.

v2:
* Swap kfree(data) and kfree(user_ctx) (Song Liu)

Acked-by: Petar Penkov <ppenkov@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Acked-by: Song Liu <songliubraving@fb.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Petar Penkov <ppenkov@google.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 net/bpf/test_run.c | 39 +++++++++++++++++++++++++++++++++++----
 1 file changed, 35 insertions(+), 4 deletions(-)

diff --git a/net/bpf/test_run.c b/net/bpf/test_run.c
index 4e41d15a1098..1153bbcdff72 100644
--- a/net/bpf/test_run.c
+++ b/net/bpf/test_run.c
@@ -377,6 +377,22 @@ int bpf_prog_test_run_xdp(struct bpf_prog *prog, const union bpf_attr *kattr,
 	return ret;
 }
 
+static int verify_user_bpf_flow_keys(struct bpf_flow_keys *ctx)
+{
+	/* make sure the fields we don't use are zeroed */
+	if (!range_is_zero(ctx, 0, offsetof(struct bpf_flow_keys, flags)))
+		return -EINVAL;
+
+	/* flags is allowed */
+
+	if (!range_is_zero(ctx, offsetof(struct bpf_flow_keys, flags) +
+			   FIELD_SIZEOF(struct bpf_flow_keys, flags),
+			   sizeof(struct bpf_flow_keys)))
+		return -EINVAL;
+
+	return 0;
+}
+
 int bpf_prog_test_run_flow_dissector(struct bpf_prog *prog,
 				     const union bpf_attr *kattr,
 				     union bpf_attr __user *uattr)
@@ -384,9 +400,11 @@ int bpf_prog_test_run_flow_dissector(struct bpf_prog *prog,
 	u32 size = kattr->test.data_size_in;
 	struct bpf_flow_dissector ctx = {};
 	u32 repeat = kattr->test.repeat;
+	struct bpf_flow_keys *user_ctx;
 	struct bpf_flow_keys flow_keys;
 	u64 time_start, time_spent = 0;
 	const struct ethhdr *eth;
+	unsigned int flags = 0;
 	u32 retval, duration;
 	void *data;
 	int ret;
@@ -395,9 +413,6 @@ int bpf_prog_test_run_flow_dissector(struct bpf_prog *prog,
 	if (prog->type != BPF_PROG_TYPE_FLOW_DISSECTOR)
 		return -EINVAL;
 
-	if (kattr->test.ctx_in || kattr->test.ctx_out)
-		return -EINVAL;
-
 	if (size < ETH_HLEN)
 		return -EINVAL;
 
@@ -410,6 +425,18 @@ int bpf_prog_test_run_flow_dissector(struct bpf_prog *prog,
 	if (!repeat)
 		repeat = 1;
 
+	user_ctx = bpf_ctx_init(kattr, sizeof(struct bpf_flow_keys));
+	if (IS_ERR(user_ctx)) {
+		kfree(data);
+		return PTR_ERR(user_ctx);
+	}
+	if (user_ctx) {
+		ret = verify_user_bpf_flow_keys(user_ctx);
+		if (ret)
+			goto out;
+		flags = user_ctx->flags;
+	}
+
 	ctx.flow_keys = &flow_keys;
 	ctx.data = data;
 	ctx.data_end = (__u8 *)data + size;
@@ -419,7 +446,7 @@ int bpf_prog_test_run_flow_dissector(struct bpf_prog *prog,
 	time_start = ktime_get_ns();
 	for (i = 0; i < repeat; i++) {
 		retval = bpf_flow_dissect(prog, &ctx, eth->h_proto, ETH_HLEN,
-					  size, 0);
+					  size, flags);
 
 		if (signal_pending(current)) {
 			preempt_enable();
@@ -450,8 +477,12 @@ int bpf_prog_test_run_flow_dissector(struct bpf_prog *prog,
 
 	ret = bpf_test_finish(kattr, uattr, &flow_keys, sizeof(flow_keys),
 			      retval, duration);
+	if (!ret)
+		ret = bpf_ctx_finish(kattr, uattr, user_ctx,
+				     sizeof(struct bpf_flow_keys));
 
 out:
+	kfree(user_ctx);
 	kfree(data);
 	return ret;
 }
-- 
2.22.0.709.g102302147b-goog


^ permalink raw reply related

* [PATCH bpf-next v3 4/7] tools/bpf: sync bpf_flow_keys flags
From: Stanislav Fomichev @ 2019-07-25 22:52 UTC (permalink / raw)
  To: netdev, bpf
  Cc: davem, ast, daniel, Stanislav Fomichev, Petar Penkov,
	Willem de Bruijn, Song Liu
In-Reply-To: <20190725225231.195090-1-sdf@google.com>

Export bpf_flow_keys flags to tools/libbpf/selftests.

Acked-by: Petar Penkov <ppenkov@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Acked-by: Song Liu <songliubraving@fb.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Petar Penkov <ppenkov@google.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 tools/include/uapi/linux/bpf.h | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 4e455018da65..2e4b0848d795 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -3504,6 +3504,10 @@ enum bpf_task_fd_type {
 	BPF_FD_TYPE_URETPROBE,		/* filename + offset */
 };
 
+#define BPF_FLOW_DISSECTOR_F_PARSE_1ST_FRAG		(1U << 0)
+#define BPF_FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL		(1U << 1)
+#define BPF_FLOW_DISSECTOR_F_STOP_AT_ENCAP		(1U << 2)
+
 struct bpf_flow_keys {
 	__u16	nhoff;
 	__u16	thoff;
@@ -3525,6 +3529,7 @@ struct bpf_flow_keys {
 			__u32	ipv6_dst[4];	/* in6_addr; network order */
 		};
 	};
+	__u32	flags;
 };
 
 struct bpf_func_info {
-- 
2.22.0.709.g102302147b-goog


^ permalink raw reply related

* [PATCH bpf-next v3 5/7] selftests/bpf: support BPF_FLOW_DISSECTOR_F_PARSE_1ST_FRAG
From: Stanislav Fomichev @ 2019-07-25 22:52 UTC (permalink / raw)
  To: netdev, bpf
  Cc: davem, ast, daniel, Stanislav Fomichev, Petar Penkov,
	Willem de Bruijn, Song Liu
In-Reply-To: <20190725225231.195090-1-sdf@google.com>

bpf_flow.c: exit early unless BPF_FLOW_DISSECTOR_F_PARSE_1ST_FRAG is
passed in flags. Also, set ip_proto earlier, this makes sure we have
correct value with fragmented packets.

Add selftest cases to test ipv4/ipv6 fragments and skip eth_get_headlen
tests that don't have BPF_FLOW_DISSECTOR_F_PARSE_1ST_FRAG flag.

eth_get_headlen calls flow dissector with
BPF_FLOW_DISSECTOR_F_PARSE_1ST_FRAG flag so we can't run tests that
have different set of input flags against it.

v2:
 * sefltests -> selftests (Willem de Bruijn)
 * Reword a comment about eth_get_headlen flags (Song Liu)

Acked-by: Petar Penkov <ppenkov@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Acked-by: Song Liu <songliubraving@fb.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Petar Penkov <ppenkov@google.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 .../selftests/bpf/prog_tests/flow_dissector.c | 133 +++++++++++++++++-
 tools/testing/selftests/bpf/progs/bpf_flow.c  |  29 +++-
 2 files changed, 155 insertions(+), 7 deletions(-)

diff --git a/tools/testing/selftests/bpf/prog_tests/flow_dissector.c b/tools/testing/selftests/bpf/prog_tests/flow_dissector.c
index c938283ac232..8e8c18aced9b 100644
--- a/tools/testing/selftests/bpf/prog_tests/flow_dissector.c
+++ b/tools/testing/selftests/bpf/prog_tests/flow_dissector.c
@@ -5,6 +5,10 @@
 #include <linux/if_tun.h>
 #include <sys/uio.h>
 
+#ifndef IP_MF
+#define IP_MF 0x2000
+#endif
+
 #define CHECK_FLOW_KEYS(desc, got, expected)				\
 	CHECK_ATTR(memcmp(&got, &expected, sizeof(got)) != 0,		\
 	      desc,							\
@@ -49,6 +53,18 @@ struct ipv6_pkt {
 	struct tcphdr tcp;
 } __packed;
 
+struct ipv6_frag_pkt {
+	struct ethhdr eth;
+	struct ipv6hdr iph;
+	struct frag_hdr {
+		__u8 nexthdr;
+		__u8 reserved;
+		__be16 frag_off;
+		__be32 identification;
+	} ipf;
+	struct tcphdr tcp;
+} __packed;
+
 struct dvlan_ipv6_pkt {
 	struct ethhdr eth;
 	__u16 vlan_tci;
@@ -65,9 +81,11 @@ struct test {
 		struct ipv4_pkt ipv4;
 		struct svlan_ipv4_pkt svlan_ipv4;
 		struct ipv6_pkt ipv6;
+		struct ipv6_frag_pkt ipv6_frag;
 		struct dvlan_ipv6_pkt dvlan_ipv6;
 	} pkt;
 	struct bpf_flow_keys keys;
+	__u32 flags;
 };
 
 #define VLAN_HLEN	4
@@ -143,6 +161,102 @@ struct test tests[] = {
 			.n_proto = __bpf_constant_htons(ETH_P_IPV6),
 		},
 	},
+	{
+		.name = "ipv4-frag",
+		.pkt.ipv4 = {
+			.eth.h_proto = __bpf_constant_htons(ETH_P_IP),
+			.iph.ihl = 5,
+			.iph.protocol = IPPROTO_TCP,
+			.iph.tot_len = __bpf_constant_htons(MAGIC_BYTES),
+			.iph.frag_off = __bpf_constant_htons(IP_MF),
+			.tcp.doff = 5,
+			.tcp.source = 80,
+			.tcp.dest = 8080,
+		},
+		.keys = {
+			.flags = BPF_FLOW_DISSECTOR_F_PARSE_1ST_FRAG,
+			.nhoff = ETH_HLEN,
+			.thoff = ETH_HLEN + sizeof(struct iphdr),
+			.addr_proto = ETH_P_IP,
+			.ip_proto = IPPROTO_TCP,
+			.n_proto = __bpf_constant_htons(ETH_P_IP),
+			.is_frag = true,
+			.is_first_frag = true,
+			.sport = 80,
+			.dport = 8080,
+		},
+		.flags = BPF_FLOW_DISSECTOR_F_PARSE_1ST_FRAG,
+	},
+	{
+		.name = "ipv4-no-frag",
+		.pkt.ipv4 = {
+			.eth.h_proto = __bpf_constant_htons(ETH_P_IP),
+			.iph.ihl = 5,
+			.iph.protocol = IPPROTO_TCP,
+			.iph.tot_len = __bpf_constant_htons(MAGIC_BYTES),
+			.iph.frag_off = __bpf_constant_htons(IP_MF),
+			.tcp.doff = 5,
+			.tcp.source = 80,
+			.tcp.dest = 8080,
+		},
+		.keys = {
+			.nhoff = ETH_HLEN,
+			.thoff = ETH_HLEN + sizeof(struct iphdr),
+			.addr_proto = ETH_P_IP,
+			.ip_proto = IPPROTO_TCP,
+			.n_proto = __bpf_constant_htons(ETH_P_IP),
+			.is_frag = true,
+			.is_first_frag = true,
+		},
+	},
+	{
+		.name = "ipv6-frag",
+		.pkt.ipv6_frag = {
+			.eth.h_proto = __bpf_constant_htons(ETH_P_IPV6),
+			.iph.nexthdr = IPPROTO_FRAGMENT,
+			.iph.payload_len = __bpf_constant_htons(MAGIC_BYTES),
+			.ipf.nexthdr = IPPROTO_TCP,
+			.tcp.doff = 5,
+			.tcp.source = 80,
+			.tcp.dest = 8080,
+		},
+		.keys = {
+			.flags = BPF_FLOW_DISSECTOR_F_PARSE_1ST_FRAG,
+			.nhoff = ETH_HLEN,
+			.thoff = ETH_HLEN + sizeof(struct ipv6hdr) +
+				sizeof(struct frag_hdr),
+			.addr_proto = ETH_P_IPV6,
+			.ip_proto = IPPROTO_TCP,
+			.n_proto = __bpf_constant_htons(ETH_P_IPV6),
+			.is_frag = true,
+			.is_first_frag = true,
+			.sport = 80,
+			.dport = 8080,
+		},
+		.flags = BPF_FLOW_DISSECTOR_F_PARSE_1ST_FRAG,
+	},
+	{
+		.name = "ipv6-no-frag",
+		.pkt.ipv6_frag = {
+			.eth.h_proto = __bpf_constant_htons(ETH_P_IPV6),
+			.iph.nexthdr = IPPROTO_FRAGMENT,
+			.iph.payload_len = __bpf_constant_htons(MAGIC_BYTES),
+			.ipf.nexthdr = IPPROTO_TCP,
+			.tcp.doff = 5,
+			.tcp.source = 80,
+			.tcp.dest = 8080,
+		},
+		.keys = {
+			.nhoff = ETH_HLEN,
+			.thoff = ETH_HLEN + sizeof(struct ipv6hdr) +
+				sizeof(struct frag_hdr),
+			.addr_proto = ETH_P_IPV6,
+			.ip_proto = IPPROTO_TCP,
+			.n_proto = __bpf_constant_htons(ETH_P_IPV6),
+			.is_frag = true,
+			.is_first_frag = true,
+		},
+	},
 };
 
 static int create_tap(const char *ifname)
@@ -225,6 +339,13 @@ void test_flow_dissector(void)
 			.data_size_in = sizeof(tests[i].pkt),
 			.data_out = &flow_keys,
 		};
+		static struct bpf_flow_keys ctx = {};
+
+		if (tests[i].flags) {
+			tattr.ctx_in = &ctx;
+			tattr.ctx_size_in = sizeof(ctx);
+			ctx.flags = tests[i].flags;
+		}
 
 		err = bpf_prog_test_run_xattr(&tattr);
 		CHECK_ATTR(tattr.data_size_out != sizeof(flow_keys) ||
@@ -251,10 +372,20 @@ void test_flow_dissector(void)
 	CHECK(err, "ifup", "err %d errno %d\n", err, errno);
 
 	for (i = 0; i < ARRAY_SIZE(tests); i++) {
-		struct bpf_flow_keys flow_keys = {};
+		/* Keep in sync with 'flags' from eth_get_headlen. */
+		__u32 eth_get_headlen_flags =
+			BPF_FLOW_DISSECTOR_F_PARSE_1ST_FRAG;
 		struct bpf_prog_test_run_attr tattr = {};
+		struct bpf_flow_keys flow_keys = {};
 		__u32 key = 0;
 
+		/* For skb-less case we can't pass input flags; run
+		 * only the tests that have a matching set of flags.
+		 */
+
+		if (tests[i].flags != eth_get_headlen_flags)
+			continue;
+
 		err = tx_tap(tap_fd, &tests[i].pkt, sizeof(tests[i].pkt));
 		CHECK(err < 0, "tx_tap", "err %d errno %d\n", err, errno);
 
diff --git a/tools/testing/selftests/bpf/progs/bpf_flow.c b/tools/testing/selftests/bpf/progs/bpf_flow.c
index 5ae485a6af3f..f931cd3db9d4 100644
--- a/tools/testing/selftests/bpf/progs/bpf_flow.c
+++ b/tools/testing/selftests/bpf/progs/bpf_flow.c
@@ -153,7 +153,6 @@ static __always_inline int parse_ip_proto(struct __sk_buff *skb, __u8 proto)
 	struct tcphdr *tcp, _tcp;
 	struct udphdr *udp, _udp;
 
-	keys->ip_proto = proto;
 	switch (proto) {
 	case IPPROTO_ICMP:
 		icmp = bpf_flow_dissect_get_header(skb, sizeof(*icmp), &_icmp);
@@ -231,7 +230,6 @@ static __always_inline int parse_ipv6_proto(struct __sk_buff *skb, __u8 nexthdr)
 {
 	struct bpf_flow_keys *keys = skb->flow_keys;
 
-	keys->ip_proto = nexthdr;
 	switch (nexthdr) {
 	case IPPROTO_HOPOPTS:
 	case IPPROTO_DSTOPTS:
@@ -266,6 +264,7 @@ PROG(IP)(struct __sk_buff *skb)
 	keys->addr_proto = ETH_P_IP;
 	keys->ipv4_src = iph->saddr;
 	keys->ipv4_dst = iph->daddr;
+	keys->ip_proto = iph->protocol;
 
 	keys->thoff += iph->ihl << 2;
 	if (data + keys->thoff > data_end)
@@ -273,13 +272,20 @@ PROG(IP)(struct __sk_buff *skb)
 
 	if (iph->frag_off & bpf_htons(IP_MF | IP_OFFSET)) {
 		keys->is_frag = true;
-		if (iph->frag_off & bpf_htons(IP_OFFSET))
+		if (iph->frag_off & bpf_htons(IP_OFFSET)) {
 			/* From second fragment on, packets do not have headers
 			 * we can parse.
 			 */
 			done = true;
-		else
+		} else {
 			keys->is_first_frag = true;
+			/* No need to parse fragmented packet unless
+			 * explicitly asked for.
+			 */
+			if (!(keys->flags &
+			      BPF_FLOW_DISSECTOR_F_PARSE_1ST_FRAG))
+				done = true;
+		}
 	}
 
 	if (done)
@@ -301,6 +307,7 @@ PROG(IPV6)(struct __sk_buff *skb)
 	memcpy(&keys->ipv6_src, &ip6h->saddr, 2*sizeof(ip6h->saddr));
 
 	keys->thoff += sizeof(struct ipv6hdr);
+	keys->ip_proto = ip6h->nexthdr;
 
 	return parse_ipv6_proto(skb, ip6h->nexthdr);
 }
@@ -317,7 +324,8 @@ PROG(IPV6OP)(struct __sk_buff *skb)
 	/* hlen is in 8-octets and does not include the first 8 bytes
 	 * of the header
 	 */
-	skb->flow_keys->thoff += (1 + ip6h->hdrlen) << 3;
+	keys->thoff += (1 + ip6h->hdrlen) << 3;
+	keys->ip_proto = ip6h->nexthdr;
 
 	return parse_ipv6_proto(skb, ip6h->nexthdr);
 }
@@ -333,9 +341,18 @@ PROG(IPV6FR)(struct __sk_buff *skb)
 
 	keys->thoff += sizeof(*fragh);
 	keys->is_frag = true;
-	if (!(fragh->frag_off & bpf_htons(IP6_OFFSET)))
+	keys->ip_proto = fragh->nexthdr;
+
+	if (!(fragh->frag_off & bpf_htons(IP6_OFFSET))) {
 		keys->is_first_frag = true;
 
+		/* No need to parse fragmented packet unless
+		 * explicitly asked for.
+		 */
+		if (!(keys->flags & BPF_FLOW_DISSECTOR_F_PARSE_1ST_FRAG))
+			return export_flow_keys(keys, BPF_OK);
+	}
+
 	return parse_ipv6_proto(skb, fragh->nexthdr);
 }
 
-- 
2.22.0.709.g102302147b-goog


^ permalink raw reply related

* [PATCH bpf-next v3 6/7] bpf/flow_dissector: support ipv6 flow_label and BPF_FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL
From: Stanislav Fomichev @ 2019-07-25 22:52 UTC (permalink / raw)
  To: netdev, bpf
  Cc: davem, ast, daniel, Stanislav Fomichev, Petar Penkov,
	Willem de Bruijn, Song Liu
In-Reply-To: <20190725225231.195090-1-sdf@google.com>

Add support for exporting ipv6 flow label via bpf_flow_keys.
Export flow label from bpf_flow.c and also return early when
BPF_FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL is passed.

Acked-by: Petar Penkov <ppenkov@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Acked-by: Song Liu <songliubraving@fb.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Petar Penkov <ppenkov@google.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 include/uapi/linux/bpf.h                      |  1 +
 net/core/flow_dissector.c                     |  9 ++++
 tools/include/uapi/linux/bpf.h                |  1 +
 .../selftests/bpf/prog_tests/flow_dissector.c | 46 +++++++++++++++++++
 tools/testing/selftests/bpf/progs/bpf_flow.c  | 10 ++++
 5 files changed, 67 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 88b9d743036f..e985f07a98ed 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -3533,6 +3533,7 @@ struct bpf_flow_keys {
 		};
 	};
 	__u32	flags;
+	__be32	flow_label;
 };
 
 struct bpf_func_info {
diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
index 50ed1a688709..9741b593ea53 100644
--- a/net/core/flow_dissector.c
+++ b/net/core/flow_dissector.c
@@ -737,6 +737,7 @@ static void __skb_flow_bpf_to_target(const struct bpf_flow_keys *flow_keys,
 	struct flow_dissector_key_basic *key_basic;
 	struct flow_dissector_key_addrs *key_addrs;
 	struct flow_dissector_key_ports *key_ports;
+	struct flow_dissector_key_tags *key_tags;
 
 	key_control = skb_flow_dissector_target(flow_dissector,
 						FLOW_DISSECTOR_KEY_CONTROL,
@@ -781,6 +782,14 @@ static void __skb_flow_bpf_to_target(const struct bpf_flow_keys *flow_keys,
 		key_ports->src = flow_keys->sport;
 		key_ports->dst = flow_keys->dport;
 	}
+
+	if (dissector_uses_key(flow_dissector,
+			       FLOW_DISSECTOR_KEY_FLOW_LABEL)) {
+		key_tags = skb_flow_dissector_target(flow_dissector,
+						     FLOW_DISSECTOR_KEY_FLOW_LABEL,
+						     target_container);
+		key_tags->flow_label = ntohl(flow_keys->flow_label);
+	}
 }
 
 bool bpf_flow_dissect(struct bpf_prog *prog, struct bpf_flow_dissector *ctx,
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 2e4b0848d795..3d7fc67ec1b8 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -3530,6 +3530,7 @@ struct bpf_flow_keys {
 		};
 	};
 	__u32	flags;
+	__be32	flow_label;
 };
 
 struct bpf_func_info {
diff --git a/tools/testing/selftests/bpf/prog_tests/flow_dissector.c b/tools/testing/selftests/bpf/prog_tests/flow_dissector.c
index 8e8c18aced9b..ef83f145a6f1 100644
--- a/tools/testing/selftests/bpf/prog_tests/flow_dissector.c
+++ b/tools/testing/selftests/bpf/prog_tests/flow_dissector.c
@@ -20,6 +20,7 @@
 	      "is_encap=%u/%u "						\
 	      "ip_proto=0x%x/0x%x "					\
 	      "n_proto=0x%x/0x%x "					\
+	      "flow_label=0x%x/0x%x "					\
 	      "sport=%u/%u "						\
 	      "dport=%u/%u\n",						\
 	      got.nhoff, expected.nhoff,				\
@@ -30,6 +31,7 @@
 	      got.is_encap, expected.is_encap,				\
 	      got.ip_proto, expected.ip_proto,				\
 	      got.n_proto, expected.n_proto,				\
+	      got.flow_label, expected.flow_label,			\
 	      got.sport, expected.sport,				\
 	      got.dport, expected.dport)
 
@@ -257,6 +259,50 @@ struct test tests[] = {
 			.is_first_frag = true,
 		},
 	},
+	{
+		.name = "ipv6-flow-label",
+		.pkt.ipv6 = {
+			.eth.h_proto = __bpf_constant_htons(ETH_P_IPV6),
+			.iph.nexthdr = IPPROTO_TCP,
+			.iph.payload_len = __bpf_constant_htons(MAGIC_BYTES),
+			.iph.flow_lbl = { 0xb, 0xee, 0xef },
+			.tcp.doff = 5,
+			.tcp.source = 80,
+			.tcp.dest = 8080,
+		},
+		.keys = {
+			.nhoff = ETH_HLEN,
+			.thoff = ETH_HLEN + sizeof(struct ipv6hdr),
+			.addr_proto = ETH_P_IPV6,
+			.ip_proto = IPPROTO_TCP,
+			.n_proto = __bpf_constant_htons(ETH_P_IPV6),
+			.sport = 80,
+			.dport = 8080,
+			.flow_label = __bpf_constant_htonl(0xbeeef),
+		},
+	},
+	{
+		.name = "ipv6-no-flow-label",
+		.pkt.ipv6 = {
+			.eth.h_proto = __bpf_constant_htons(ETH_P_IPV6),
+			.iph.nexthdr = IPPROTO_TCP,
+			.iph.payload_len = __bpf_constant_htons(MAGIC_BYTES),
+			.iph.flow_lbl = { 0xb, 0xee, 0xef },
+			.tcp.doff = 5,
+			.tcp.source = 80,
+			.tcp.dest = 8080,
+		},
+		.keys = {
+			.flags = BPF_FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL,
+			.nhoff = ETH_HLEN,
+			.thoff = ETH_HLEN + sizeof(struct ipv6hdr),
+			.addr_proto = ETH_P_IPV6,
+			.ip_proto = IPPROTO_TCP,
+			.n_proto = __bpf_constant_htons(ETH_P_IPV6),
+			.flow_label = __bpf_constant_htonl(0xbeeef),
+		},
+		.flags = BPF_FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL,
+	},
 };
 
 static int create_tap(const char *ifname)
diff --git a/tools/testing/selftests/bpf/progs/bpf_flow.c b/tools/testing/selftests/bpf/progs/bpf_flow.c
index f931cd3db9d4..7fbfa22f33df 100644
--- a/tools/testing/selftests/bpf/progs/bpf_flow.c
+++ b/tools/testing/selftests/bpf/progs/bpf_flow.c
@@ -83,6 +83,12 @@ static __always_inline int export_flow_keys(struct bpf_flow_keys *keys,
 	return ret;
 }
 
+#define IPV6_FLOWLABEL_MASK		__bpf_constant_htonl(0x000FFFFF)
+static inline __be32 ip6_flowlabel(const struct ipv6hdr *hdr)
+{
+	return *(__be32 *)hdr & IPV6_FLOWLABEL_MASK;
+}
+
 static __always_inline void *bpf_flow_dissect_get_header(struct __sk_buff *skb,
 							 __u16 hdr_size,
 							 void *buffer)
@@ -308,6 +314,10 @@ PROG(IPV6)(struct __sk_buff *skb)
 
 	keys->thoff += sizeof(struct ipv6hdr);
 	keys->ip_proto = ip6h->nexthdr;
+	keys->flow_label = ip6_flowlabel(ip6h);
+
+	if (keys->flags & BPF_FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL)
+		return export_flow_keys(keys, BPF_OK);
 
 	return parse_ipv6_proto(skb, ip6h->nexthdr);
 }
-- 
2.22.0.709.g102302147b-goog


^ permalink raw reply related

* [PATCH bpf-next v3 7/7] selftests/bpf: support BPF_FLOW_DISSECTOR_F_STOP_AT_ENCAP
From: Stanislav Fomichev @ 2019-07-25 22:52 UTC (permalink / raw)
  To: netdev, bpf
  Cc: davem, ast, daniel, Stanislav Fomichev, Petar Penkov,
	Willem de Bruijn, Song Liu
In-Reply-To: <20190725225231.195090-1-sdf@google.com>

Exit as soon as we found that packet is encapped when
BPF_FLOW_DISSECTOR_F_STOP_AT_ENCAP is passed.
Add appropriate selftest cases.

v2:
* Subtract sizeof(struct iphdr) from .iph_inner.tot_len (Willem de Bruijn)

Acked-by: Petar Penkov <ppenkov@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Acked-by: Song Liu <songliubraving@fb.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Petar Penkov <ppenkov@google.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 .../selftests/bpf/prog_tests/flow_dissector.c | 64 +++++++++++++++++++
 tools/testing/selftests/bpf/progs/bpf_flow.c  |  8 +++
 2 files changed, 72 insertions(+)

diff --git a/tools/testing/selftests/bpf/prog_tests/flow_dissector.c b/tools/testing/selftests/bpf/prog_tests/flow_dissector.c
index ef83f145a6f1..700d73d2f22a 100644
--- a/tools/testing/selftests/bpf/prog_tests/flow_dissector.c
+++ b/tools/testing/selftests/bpf/prog_tests/flow_dissector.c
@@ -41,6 +41,13 @@ struct ipv4_pkt {
 	struct tcphdr tcp;
 } __packed;
 
+struct ipip_pkt {
+	struct ethhdr eth;
+	struct iphdr iph;
+	struct iphdr iph_inner;
+	struct tcphdr tcp;
+} __packed;
+
 struct svlan_ipv4_pkt {
 	struct ethhdr eth;
 	__u16 vlan_tci;
@@ -82,6 +89,7 @@ struct test {
 	union {
 		struct ipv4_pkt ipv4;
 		struct svlan_ipv4_pkt svlan_ipv4;
+		struct ipip_pkt ipip;
 		struct ipv6_pkt ipv6;
 		struct ipv6_frag_pkt ipv6_frag;
 		struct dvlan_ipv6_pkt dvlan_ipv6;
@@ -303,6 +311,62 @@ struct test tests[] = {
 		},
 		.flags = BPF_FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL,
 	},
+	{
+		.name = "ipip-encap",
+		.pkt.ipip = {
+			.eth.h_proto = __bpf_constant_htons(ETH_P_IP),
+			.iph.ihl = 5,
+			.iph.protocol = IPPROTO_IPIP,
+			.iph.tot_len = __bpf_constant_htons(MAGIC_BYTES),
+			.iph_inner.ihl = 5,
+			.iph_inner.protocol = IPPROTO_TCP,
+			.iph_inner.tot_len =
+				__bpf_constant_htons(MAGIC_BYTES) -
+				sizeof(struct iphdr),
+			.tcp.doff = 5,
+			.tcp.source = 80,
+			.tcp.dest = 8080,
+		},
+		.keys = {
+			.nhoff = 0,
+			.nhoff = ETH_HLEN,
+			.thoff = ETH_HLEN + sizeof(struct iphdr) +
+				sizeof(struct iphdr),
+			.addr_proto = ETH_P_IP,
+			.ip_proto = IPPROTO_TCP,
+			.n_proto = __bpf_constant_htons(ETH_P_IP),
+			.is_encap = true,
+			.sport = 80,
+			.dport = 8080,
+		},
+	},
+	{
+		.name = "ipip-no-encap",
+		.pkt.ipip = {
+			.eth.h_proto = __bpf_constant_htons(ETH_P_IP),
+			.iph.ihl = 5,
+			.iph.protocol = IPPROTO_IPIP,
+			.iph.tot_len = __bpf_constant_htons(MAGIC_BYTES),
+			.iph_inner.ihl = 5,
+			.iph_inner.protocol = IPPROTO_TCP,
+			.iph_inner.tot_len =
+				__bpf_constant_htons(MAGIC_BYTES) -
+				sizeof(struct iphdr),
+			.tcp.doff = 5,
+			.tcp.source = 80,
+			.tcp.dest = 8080,
+		},
+		.keys = {
+			.flags = BPF_FLOW_DISSECTOR_F_STOP_AT_ENCAP,
+			.nhoff = ETH_HLEN,
+			.thoff = ETH_HLEN + sizeof(struct iphdr),
+			.addr_proto = ETH_P_IP,
+			.ip_proto = IPPROTO_IPIP,
+			.n_proto = __bpf_constant_htons(ETH_P_IP),
+			.is_encap = true,
+		},
+		.flags = BPF_FLOW_DISSECTOR_F_STOP_AT_ENCAP,
+	},
 };
 
 static int create_tap(const char *ifname)
diff --git a/tools/testing/selftests/bpf/progs/bpf_flow.c b/tools/testing/selftests/bpf/progs/bpf_flow.c
index 7fbfa22f33df..08bd8b9d58d0 100644
--- a/tools/testing/selftests/bpf/progs/bpf_flow.c
+++ b/tools/testing/selftests/bpf/progs/bpf_flow.c
@@ -167,9 +167,15 @@ static __always_inline int parse_ip_proto(struct __sk_buff *skb, __u8 proto)
 		return export_flow_keys(keys, BPF_OK);
 	case IPPROTO_IPIP:
 		keys->is_encap = true;
+		if (keys->flags & BPF_FLOW_DISSECTOR_F_STOP_AT_ENCAP)
+			return export_flow_keys(keys, BPF_OK);
+
 		return parse_eth_proto(skb, bpf_htons(ETH_P_IP));
 	case IPPROTO_IPV6:
 		keys->is_encap = true;
+		if (keys->flags & BPF_FLOW_DISSECTOR_F_STOP_AT_ENCAP)
+			return export_flow_keys(keys, BPF_OK);
+
 		return parse_eth_proto(skb, bpf_htons(ETH_P_IPV6));
 	case IPPROTO_GRE:
 		gre = bpf_flow_dissect_get_header(skb, sizeof(*gre), &_gre);
@@ -189,6 +195,8 @@ static __always_inline int parse_ip_proto(struct __sk_buff *skb, __u8 proto)
 			keys->thoff += 4; /* Step over sequence number */
 
 		keys->is_encap = true;
+		if (keys->flags & BPF_FLOW_DISSECTOR_F_STOP_AT_ENCAP)
+			return export_flow_keys(keys, BPF_OK);
 
 		if (gre->proto == bpf_htons(ETH_P_TEB)) {
 			eth = bpf_flow_dissect_get_header(skb, sizeof(*eth),
-- 
2.22.0.709.g102302147b-goog


^ permalink raw reply related

* Re: [net-next 0/2] tipc: link changeover issues
From: David Miller @ 2019-07-25 22:56 UTC (permalink / raw)
  To: tuong.t.lien; +Cc: jon.maloy, maloy, ying.xue, netdev, tipc-discussion
In-Reply-To: <20190724015612.2518-1-tuong.t.lien@dektech.com.au>

From: Tuong Lien <tuong.t.lien@dektech.com.au>
Date: Wed, 24 Jul 2019 08:56:10 +0700

> This patch series is to resolve some issues found with the current link
> changeover mechanism, it also includes an optimization for the link
> synching.

Series applied, thank you.

^ permalink raw reply

* [PATCH net-next] mvpp2: document HW checksum behaviour
From: Matteo Croce @ 2019-07-25 23:15 UTC (permalink / raw)
  To: netdev
  Cc: Antoine Tenart, Maxime Chevallier, Marcin Wojtas, Stefan Chulski,
	LKML

The hardware can only offload checksum calculation on first port due to
the Tx FIFO size limitation. Document this in a comment.

Fixes: 576193f2d579 ("net: mvpp2: jumbo frames support")
Signed-off-by: Matteo Croce <mcroce@redhat.com>
---
 drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c b/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c
index d8e5241097a9..2f7286bd203b 100644
--- a/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c
+++ b/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c
@@ -843,7 +843,10 @@ static int mvpp2_bm_update_mtu(struct net_device *dev, int mtu)
 		/* Add port to new short & long pool */
 		mvpp2_swf_bm_pool_init(port);
 
-		/* Update L4 checksum when jumbo enable/disable on port */
+		/* Update L4 checksum when jumbo enable/disable on port.
+		 * Only port 0 supports hardware checksum offload due to
+		 * the Tx FIFO size limitation.
+		 */
 		if (new_long_pool == MVPP2_BM_JUMBO && port->id != 0) {
 			dev->features &= ~(NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM);
 			dev->hw_features &= ~(NETIF_F_IP_CSUM |
-- 
2.21.0


^ permalink raw reply related

* Re: [PATCH bpf-next 02/10] libbpf: implement BPF CO-RE offset relocation algorithm
From: Alexei Starovoitov @ 2019-07-25 23:18 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: bpf, netdev, ast, daniel, yhs, andrii.nakryiko, kernel-team
In-Reply-To: <20190724192742.1419254-3-andriin@fb.com>

On Wed, Jul 24, 2019 at 12:27:34PM -0700, Andrii Nakryiko wrote:
> This patch implements the core logic for BPF CO-RE offsets relocations.
> All the details are described in code comments.
> 
> Signed-off-by: Andrii Nakryiko <andriin@fb.com>
> ---
>  tools/lib/bpf/libbpf.c | 866 ++++++++++++++++++++++++++++++++++++++++-
>  tools/lib/bpf/libbpf.h |   1 +
>  2 files changed, 861 insertions(+), 6 deletions(-)
> 
> diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
> index 8741c39adb1c..86d87bf10d46 100644
> --- a/tools/lib/bpf/libbpf.c
> +++ b/tools/lib/bpf/libbpf.c
> @@ -38,6 +38,7 @@
>  #include <sys/stat.h>
>  #include <sys/types.h>
>  #include <sys/vfs.h>
> +#include <sys/utsname.h>
>  #include <tools/libc_compat.h>
>  #include <libelf.h>
>  #include <gelf.h>
> @@ -47,6 +48,7 @@
>  #include "btf.h"
>  #include "str_error.h"
>  #include "libbpf_internal.h"
> +#include "hashmap.h"
>  
>  #ifndef EM_BPF
>  #define EM_BPF 247
> @@ -1013,16 +1015,22 @@ static int bpf_object__init_user_maps(struct bpf_object *obj, bool strict)
>  }
>  
>  static const struct btf_type *skip_mods_and_typedefs(const struct btf *btf,
> -						     __u32 id)
> +						     __u32 id,
> +						     __u32 *res_id)

I think it would be more readable to format it like:
static const struct btf_type *
skip_mods_and_typedefs(const struct btf *btf, __u32 id, __u32 *res_id)

> +	} else if (class == BPF_ST && BPF_MODE(insn->code) == BPF_MEM) {
> +		if (insn->imm != orig_off)
> +			return -EINVAL;
> +		insn->imm = new_off;
> +		pr_debug("prog '%s': patched insn #%d (ST | MEM) imm %d -> %d\n",
> +			 bpf_program__title(prog, false),
> +			 insn_idx, orig_off, new_off);

I'm pretty sure llvm was not capable of emitting BPF_ST insn.
When did that change?

> +/* 
> + * CO-RE relocate single instruction.
> + *
> + * The outline and important points of the algorithm:
> + * 1. For given local type, find corresponding candidate target types.
> + *    Candidate type is a type with the same "essential" name, ignoring
> + *    everything after last triple underscore (___). E.g., `sample`,
> + *    `sample___flavor_one`, `sample___flavor_another_one`, are all candidates
> + *    for each other. Names with triple underscore are referred to as
> + *    "flavors" and are useful, among other things, to allow to
> + *    specify/support incompatible variations of the same kernel struct, which
> + *    might differ between different kernel versions and/or build
> + *    configurations.

"flavors" is a convention of bpftool btf2c converter, right?
May be mention it here with pointer to the code?

> +	pr_debug("prog '%s': relo #%d: insn_off=%d, [%d] (%s) + %s\n",
> +		 prog_name, relo_idx, relo->insn_off,
> +		 local_id, local_name, spec_str);
> +
> +	err = bpf_core_spec_parse(local_btf, local_id, spec_str, &local_spec);
> +	if (err) {
> +		pr_warning("prog '%s': relo #%d: parsing [%d] (%s) + %s failed: %d\n",
> +			   prog_name, relo_idx, local_id, local_name, spec_str,
> +			   err);
> +		return -EINVAL;
> +	}
> +	pr_debug("prog '%s': relo #%d: [%d] (%s) + %s is off %u, len %d, raw_len %d\n",
> +		 prog_name, relo_idx, local_id, local_name, spec_str,
> +		 local_spec.offset, local_spec.len, local_spec.raw_len);

one warn and two debug that print more or less the same info seems like overkill.

> +	for (i = 0, j = 0; i < cand_ids->len; i++) {
> +		cand_id = cand_ids->data[j];
> +		cand_type = btf__type_by_id(targ_btf, cand_id);
> +		cand_name = btf__name_by_offset(targ_btf, cand_type->name_off);
> +
> +		err = bpf_core_spec_match(&local_spec, targ_btf,
> +					  cand_id, &cand_spec);
> +		if (err < 0) {
> +			pr_warning("prog '%s': relo #%d: failed to match spec [%d] (%s) + %s to candidate #%d [%d] (%s): %d\n",
> +				   prog_name, relo_idx, local_id, local_name,
> +				   spec_str, i, cand_id, cand_name, err);
> +			return err;
> +		}
> +		if (err == 0) {
> +			pr_debug("prog '%s': relo #%d: candidate #%d [%d] (%s) doesn't match spec\n",
> +				 prog_name, relo_idx, i, cand_id, cand_name);
> +			continue;
> +		}
> +
> +		pr_debug("prog '%s': relo #%d: candidate #%d ([%d] %s) is off %u, len %d, raw_len %d\n",
> +			 prog_name, relo_idx, i, cand_id, cand_name,
> +			 cand_spec.offset, cand_spec.len, cand_spec.raw_len);

have the same feeling about 3 printfs above.


^ permalink raw reply

* [PATCH net] mvpp2: refactor MTU change code
From: Matteo Croce @ 2019-07-25 23:19 UTC (permalink / raw)
  To: netdev
  Cc: Antoine Tenart, Maxime Chevallier, Marcin Wojtas, Stefan Chulski,
	LKML

The MTU change code can call napi_disable() with the device already down,
leading to a deadlock. Also, lot of code is duplicated unnecessarily.

Rework mvpp2_change_mtu() to avoid the deadlock and remove duplicated code.

Signed-off-by: Matteo Croce <mcroce@redhat.com>
---
 .../net/ethernet/marvell/mvpp2/mvpp2_main.c   | 41 ++++++-------------
 1 file changed, 13 insertions(+), 28 deletions(-)

diff --git a/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c b/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c
index 2f7286bd203b..60eb98f99571 100644
--- a/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c
+++ b/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c
@@ -3612,6 +3612,7 @@ static int mvpp2_set_mac_address(struct net_device *dev, void *p)
 static int mvpp2_change_mtu(struct net_device *dev, int mtu)
 {
 	struct mvpp2_port *port = netdev_priv(dev);
+	bool running = netif_running(dev);
 	int err;
 
 	if (!IS_ALIGNED(MVPP2_RX_PKT_SIZE(mtu), 8)) {
@@ -3620,40 +3621,24 @@ static int mvpp2_change_mtu(struct net_device *dev, int mtu)
 		mtu = ALIGN(MVPP2_RX_PKT_SIZE(mtu), 8);
 	}
 
-	if (!netif_running(dev)) {
-		err = mvpp2_bm_update_mtu(dev, mtu);
-		if (!err) {
-			port->pkt_size =  MVPP2_RX_PKT_SIZE(mtu);
-			return 0;
-		}
-
-		/* Reconfigure BM to the original MTU */
-		err = mvpp2_bm_update_mtu(dev, dev->mtu);
-		if (err)
-			goto log_error;
-	}
-
-	mvpp2_stop_dev(port);
+	if (running)
+		mvpp2_stop_dev(port);
 
 	err = mvpp2_bm_update_mtu(dev, mtu);
-	if (!err) {
+	if (err) {
+		netdev_err(dev, "failed to change MTU\n");
+		/* Reconfigure BM to the original MTU */
+		mvpp2_bm_update_mtu(dev, dev->mtu);
+	} else {
 		port->pkt_size =  MVPP2_RX_PKT_SIZE(mtu);
-		goto out_start;
 	}
 
-	/* Reconfigure BM to the original MTU */
-	err = mvpp2_bm_update_mtu(dev, dev->mtu);
-	if (err)
-		goto log_error;
-
-out_start:
-	mvpp2_start_dev(port);
-	mvpp2_egress_enable(port);
-	mvpp2_ingress_enable(port);
+	if (running) {
+		mvpp2_start_dev(port);
+		mvpp2_egress_enable(port);
+		mvpp2_ingress_enable(port);
+	}
 
-	return 0;
-log_error:
-	netdev_err(dev, "failed to change MTU\n");
 	return err;
 }
 
-- 
2.21.0


^ permalink raw reply related

* Re: [PATCH bpf-next 2/6] bpf: add BPF_MAP_DUMP command to dump more than one entry per call
From: Brian Vazquez @ 2019-07-25 23:25 UTC (permalink / raw)
  To: Song Liu
  Cc: Brian Vazquez, Alexei Starovoitov, Daniel Borkmann,
	David S . Miller, Stanislav Fomichev, Willem de Bruijn,
	Petar Penkov, open list, Networking, bpf
In-Reply-To: <CAPhsuW5NzzeDmNmgqRh0kwHnoQfaD90L44NJ9AbydG_tGJkKiQ@mail.gmail.com>

> > > If prev_key is deleted before map_get_next_key(), we get the first key
> > > again. This is pretty weird.
> >
> > Yes, I know. But note that the current scenario happens even for the
> > old interface (imagine you are walking a map from userspace and you
> > tried get_next_key the prev_key was removed, you will start again from
> > the beginning without noticing it).
> > I tried to sent a patch in the past but I was missing some context:
> > before NULL was used to get the very first_key the interface relied in
> > a random (non existent) key to retrieve the first_key in the map, and
> > I was told what we still have to support that scenario.
>
> BPF_MAP_DUMP is slightly different, as you may return the first key
> multiple times in the same call. Also, BPF_MAP_DUMP is new, so we
> don't have to support legacy scenarios.
>
> Since BPF_MAP_DUMP keeps a list of elements. It is possible to try
> to look up previous keys. Would something down this direction work?

I've been thinking about it and I think first we need a way to detect
that since key was not present we got the first_key instead:

- One solution I had in mind was to explicitly asked for the first key
with map_get_next_key(map, NULL, first_key) and while walking the map
check that map_get_next_key(map, prev_key, key) doesn't return the
same key. This could be done using memcmp.
- Discussing with Stan, he mentioned that another option is to support
a flag in map_get_next_key to let it know that we want an error
instead of the first_key.

After detecting the problem we also need to define what we want to do,
here some options:

a) Return the error to the caller
b) Try with previous keys if any (which be limited to the keys that we
have traversed so far in this dump call)
c) continue with next entries in the map. array is easy just get the
next valid key (starting on i+1), but hmap might be difficult since
starting on the next bucket could potentially skip some keys that were
concurrently added to the same bucket where key used to be, and
starting on the same bucket could lead us to return repeated elements.

Or maybe we could support those 3 cases via flags and let the caller
decide which one to use?

Let me know what are your thoughts.

Thanks,
Brian

^ permalink raw reply

* Re: [PATCH bpf-next 06/10] selftests/bpf: add CO-RE relocs array tests
From: Alexei Starovoitov @ 2019-07-25 23:26 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: bpf, netdev, ast, daniel, yhs, andrii.nakryiko, kernel-team
In-Reply-To: <20190724192742.1419254-7-andriin@fb.com>

On Wed, Jul 24, 2019 at 12:27:38PM -0700, Andrii Nakryiko wrote:
> Add tests for various array handling/relocation scenarios.
> 
> Signed-off-by: Andrii Nakryiko <andriin@fb.com>
...
> +
> +#define CORE_READ(dst, src) \
> +	bpf_probe_read(dst, sizeof(*src), __builtin_preserve_access_index(src))

This is the key accessor that all progs will use.
Please split just this single macro into individual commit and add
detailed comment about its purpose and
what __builtin_preserve_access_index() does underneath.

> +SEC("raw_tracepoint/sys_enter")
> +int test_core_nesting(void *ctx)
> +{
> +	struct core_reloc_arrays *in = (void *)&data.in;
> +	struct core_reloc_arrays_output *out = (void *)&data.out;
> +
> +	/* in->a[2] */
> +	if (CORE_READ(&out->a2, &in->a[2]))
> +		return 1;
> +	/* in->b[1][2][3] */
> +	if (CORE_READ(&out->b123, &in->b[1][2][3]))
> +		return 1;

^ permalink raw reply

* Re: [PATCH bpf-next 3/6] bpf: keep bpf.h in sync with tools/
From: Brian Vazquez @ 2019-07-25 23:27 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Brian Vazquez, Alexei Starovoitov, Daniel Borkmann,
	David S . Miller, Stanislav Fomichev, Willem de Bruijn,
	Petar Penkov, open list, Networking, bpf
In-Reply-To: <CAEf4BzaCUBA40DKUYm6rSa0v-jQMK7aPu867oYkZhfZGB4wiSA@mail.gmail.com>

On Wed, Jul 24, 2019 at 4:10 PM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Wed, Jul 24, 2019 at 10:10 AM Brian Vazquez <brianvv@google.com> wrote:
> >
> > Adds bpf_attr.dump structure to libbpf.
> >
> > Suggested-by: Stanislav Fomichev <sdf@google.com>
> > Signed-off-by: Brian Vazquez <brianvv@google.com>
> > ---
> >  tools/include/uapi/linux/bpf.h | 9 +++++++++
> >  tools/lib/bpf/libbpf.map       | 2 ++
> >  2 files changed, 11 insertions(+)
> >
> > diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> > index 4e455018da65f..e127f16e4e932 100644
> > --- a/tools/include/uapi/linux/bpf.h
> > +++ b/tools/include/uapi/linux/bpf.h
> > @@ -106,6 +106,7 @@ enum bpf_cmd {
> >         BPF_TASK_FD_QUERY,
> >         BPF_MAP_LOOKUP_AND_DELETE_ELEM,
> >         BPF_MAP_FREEZE,
> > +       BPF_MAP_DUMP,
> >  };
> >
> >  enum bpf_map_type {
> > @@ -388,6 +389,14 @@ union bpf_attr {
> >                 __u64           flags;
> >         };
> >
> > +       struct { /* struct used by BPF_MAP_DUMP command */
> > +               __aligned_u64   prev_key;
> > +               __aligned_u64   buf;
> > +               __aligned_u64   buf_len; /* input/output: len of buf */
> > +               __u64           flags;
> > +               __u32           map_fd;
> > +       } dump;
> > +
> >         struct { /* anonymous struct used by BPF_PROG_LOAD command */
> >                 __u32           prog_type;      /* one of enum bpf_prog_type */
> >                 __u32           insn_cnt;
> > diff --git a/tools/lib/bpf/libbpf.map b/tools/lib/bpf/libbpf.map
> > index f9d316e873d8d..cac3723d5c45c 100644
> > --- a/tools/lib/bpf/libbpf.map
> > +++ b/tools/lib/bpf/libbpf.map
> > @@ -183,4 +183,6 @@ LIBBPF_0.0.4 {
>
> LIBBPF_0.0.4 is closed, this needs to go into LIBBPF_0.0.5.

Sorry my bad, I didn't closely look at the rebase so this got it wrong.

>
> >                 perf_buffer__new;
> >                 perf_buffer__new_raw;
> >                 perf_buffer__poll;
> > +               bpf_map_dump;
> > +               bpf_map_dump_flags;
>
> As the general rule, please keep those lists of functions in alphabetical order.

right.

I will fix it in next version, thanks for reviewing it!

^ permalink raw reply

* Slowness forming TIPC cluster with explicit node addresses
From: Chris Packham @ 2019-07-25 23:37 UTC (permalink / raw)
  To: tipc-discussion@lists.sourceforge.net
  Cc: netdev@vger.kernel.org, linux-kernel@vger.kernel.org

Hi,

I'm having problems forming a TIPC cluster between 2 nodes.

This is the basic steps I'm going through on each node.

modprobe tipc
ip link set eth2 up
tipc node set addr 1.1.5 # or 1.1.6
tipc bearer enable media eth dev eth0

Then to confirm if the cluster is formed I use tipc link list

[root@node-5 ~]# tipc link list
broadcast-link: up
...

Looking at tcpdump the two nodes are sending packets 

22:30:05.782320 TIPC v2.0 1.1.5 > 0.0.0, headerlength 60 bytes,
MessageSize 76 bytes, Neighbor Detection Protocol internal, messageType
Link request
22:30:05.863555 TIPC v2.0 1.1.6 > 0.0.0, headerlength 60 bytes,
MessageSize 76 bytes, Neighbor Detection Protocol internal, messageType
Link request

Eventually (after a few minutes) the link does come up

[root@node-6 ~]# tipc link list
broadcast-link: up
1001006:eth2-1001005:eth2: up

[root@node-5 ~]# tipc link list
broadcast-link: up
1001005:eth2-1001006:eth2: up

When I remove the "tipc node set addr" things seem to kick into life
straight away

[root@node-5 ~]# tipc link list
broadcast-link: up
0050b61bd2aa:eth2-0050b61e6dfa:eth2: up

So there appears to be some difference in behaviour between having an
explicit node address and using the default. Unfortunately our
application relies on setting the node addresses.

[root@node-5 ~]# uname -a
Linux linuxbox 5.2.0-at1+ #8 SMP Thu Jul 25 23:22:41 UTC 2019 ppc
GNU/Linux

Any thoughts on the problem?

^ permalink raw reply

* Re: [PATCH v4 net-next 14/19] ionic: Add Tx and Rx handling
From: Saeed Mahameed @ 2019-07-25 23:48 UTC (permalink / raw)
  To: snelson@pensando.io, netdev@vger.kernel.org, davem@davemloft.net
In-Reply-To: <20190722214023.9513-15-snelson@pensando.io>

On Mon, 2019-07-22 at 14:40 -0700, Shannon Nelson wrote:
> Add both the Tx and Rx queue setup and handling.  The related
> stats display comes later.  Instead of using the generic napi
> routines used by the slow-path commands, the Tx and Rx paths
> are simplified and inlined in one file in order to get better
> compiler optimizations.
> 
> Signed-off-by: Shannon Nelson <snelson@pensando.io>
> ---
>  drivers/net/ethernet/pensando/ionic/Makefile  |   2 +-
>  .../net/ethernet/pensando/ionic/ionic_lif.c   | 387 ++++++++
>  .../net/ethernet/pensando/ionic/ionic_lif.h   |  52 ++
>  .../net/ethernet/pensando/ionic/ionic_txrx.c  | 879
> ++++++++++++++++++
>  .../net/ethernet/pensando/ionic/ionic_txrx.h  |  15 +
>  5 files changed, 1334 insertions(+), 1 deletion(-)
>  create mode 100644 drivers/net/ethernet/pensando/ionic/ionic_txrx.c
>  create mode 100644 drivers/net/ethernet/pensando/ionic/ionic_txrx.h
> 
> diff --git a/drivers/net/ethernet/pensando/ionic/Makefile
> b/drivers/net/ethernet/pensando/ionic/Makefile
> index 9b19bf57a489..0e2dc53f08d4 100644
> --- a/drivers/net/ethernet/pensando/ionic/Makefile
> +++ b/drivers/net/ethernet/pensando/ionic/Makefile
> @@ -4,4 +4,4 @@
>  obj-$(CONFIG_IONIC) := ionic.o
>  
>  ionic-y := ionic_main.o ionic_bus_pci.o ionic_dev.o ionic_ethtool.o
> \
> -	   ionic_lif.o ionic_rx_filter.o ionic_debugfs.o
> +	   ionic_lif.o ionic_rx_filter.o ionic_txrx.o ionic_debugfs.o
> diff --git a/drivers/net/ethernet/pensando/ionic/ionic_lif.c
> b/drivers/net/ethernet/pensando/ionic/ionic_lif.c
> index 2bd8ce61c4a0..40d3b1cb362a 100644
> --- a/drivers/net/ethernet/pensando/ionic/ionic_lif.c
> +++ b/drivers/net/ethernet/pensando/ionic/ionic_lif.c
> @@ -10,6 +10,7 @@
>  #include "ionic.h"
>  #include "ionic_bus.h"
>  #include "ionic_lif.h"
> +#include "ionic_txrx.h"
>  #include "ionic_ethtool.h"
>  #include "ionic_debugfs.h"
>  
> @@ -18,6 +19,13 @@ static int ionic_lif_addr_add(struct lif *lif,
> const u8 *addr);
>  static int ionic_lif_addr_del(struct lif *lif, const u8 *addr);
>  static void ionic_link_status_check(struct lif *lif);
>  
> +static int ionic_lif_stop(struct lif *lif);
> +static int ionic_txrx_alloc(struct lif *lif);
> +static int ionic_txrx_init(struct lif *lif);
> +static void ionic_qcq_free(struct lif *lif, struct qcq *qcq);
> +static int ionic_lif_txqs_init(struct lif *lif);
> +static int ionic_lif_rxqs_init(struct lif *lif);
> +static void ionic_lif_qcq_deinit(struct lif *lif, struct qcq *qcq);
>  static int ionic_set_nic_features(struct lif *lif, netdev_features_t
> features);
>  static int ionic_notifyq_clean(struct lif *lif, int budget);
>  
> @@ -66,12 +74,96 @@ static void ionic_lif_deferred_enqueue(struct
> ionic_deferred *def,
>  	schedule_work(&def->work);
>  }

Bottom up or top down ? your current design is very mixed and I had to
to scroll down and up too often just to understand what a function is
doing, i strongly suggest to pick an use one approach.

[1] 
https://en.wikipedia.org/wiki/Top-down_and_bottom-up_design#Programming


>  
> +static int ionic_qcq_enable(struct qcq *qcq)
> +{
> +	struct queue *q = &qcq->q;
> +	struct lif *lif = q->lif;
> +	struct device *dev = lif->ionic->dev;
> +	struct ionic_dev *idev = &lif->ionic->idev;
> +	struct ionic_admin_ctx ctx = {
> +		.work = COMPLETION_INITIALIZER_ONSTACK(ctx.work),
> +		.cmd.q_control = {
> +			.opcode = CMD_OPCODE_Q_CONTROL,
> +			.lif_index = cpu_to_le16(lif->index),
> +			.type = q->type,
> +			.index = cpu_to_le32(q->index),
> +			.oper = IONIC_Q_ENABLE,
> +		},
> +	};
> +
> +	dev_dbg(dev, "q_enable.index %d q_enable.qtype %d\n",
> +		ctx.cmd.q_control.index, ctx.cmd.q_control.type);
> +
> +	if (qcq->flags & QCQ_F_INTR) {
> +		irq_set_affinity_hint(qcq->intr.vector,
> +				      &qcq->intr.affinity_mask);
> +		napi_enable(&qcq->napi);
> +		ionic_intr_clean(idev->intr_ctrl, qcq->intr.index);
> +		ionic_intr_mask(idev->intr_ctrl, qcq->intr.index,
> +				IONIC_INTR_MASK_CLEAR);
> +	}
> +
> +	return ionic_adminq_post_wait(lif, &ctx);
> +}
> +
> +static int ionic_qcq_disable(struct qcq *qcq)
> +{
> +	struct queue *q = &qcq->q;
> +	struct lif *lif = q->lif;
> +	struct device *dev = lif->ionic->dev;
> +	struct ionic_dev *idev = &lif->ionic->idev;
> +	struct ionic_admin_ctx ctx = {
> +		.work = COMPLETION_INITIALIZER_ONSTACK(ctx.work),
> +		.cmd.q_control = {
> +			.opcode = CMD_OPCODE_Q_CONTROL,
> +			.lif_index = cpu_to_le16(lif->index),
> +			.type = q->type,
> +			.index = cpu_to_le32(q->index),
> +			.oper = IONIC_Q_DISABLE,
> +		},
> +	};
> +
> +	dev_dbg(dev, "q_disable.index %d q_disable.qtype %d\n",
> +		ctx.cmd.q_control.index, ctx.cmd.q_control.type);
> +
> +	if (qcq->flags & QCQ_F_INTR) {
> +		ionic_intr_mask(idev->intr_ctrl, qcq->intr.index,
> +				IONIC_INTR_MASK_SET);
> +		synchronize_irq(qcq->intr.vector);
> +		irq_set_affinity_hint(qcq->intr.vector, NULL);
> +		napi_disable(&qcq->napi);
> +	}
> +
> +	return ionic_adminq_post_wait(lif, &ctx);
> +}
> +
>  int ionic_open(struct net_device *netdev)
>  {
>  	struct lif *lif = netdev_priv(netdev);
> +	unsigned int i;
> +	int err;
>  
>  	netif_carrier_off(netdev);
>  
> +	err = ionic_txrx_alloc(lif);
> +	if (err)
> +		return err;
> +
> +	err = ionic_txrx_init(lif);
> +	if (err)
> +		goto err_out;
> +
> +	for (i = 0; i < lif->nxqs; i++) {
> +		err = ionic_qcq_enable(lif->txqcqs[i].qcq);
> +		if (err)
> +			goto err_out;
> +
> +		ionic_rx_fill(&lif->rxqcqs[i].qcq->q);
> +		err = ionic_qcq_enable(lif->rxqcqs[i].qcq);
> +		if (err)
> +			goto err_out;
> +	}
> +
>  	set_bit(LIF_UP, lif->state);
>  
>  	ionic_link_status_check(lif);
> @@ -79,11 +171,16 @@ int ionic_open(struct net_device *netdev)
>  		netif_tx_wake_all_queues(netdev);
>  
>  	return 0;
> +
> +err_out:
> +	ionic_lif_stop(lif);

This is dangerous, maybe now it is ok to stop a partially open driver, 
but future development might break this assumption, and to avoid this i
strongly recommend to have a symmetrical cleanup flow, or have a
documentation that says ionic_lif_stop might be called with partially
open resources and proper checks must be done.

> +	return err;
>  }
>  
>  static int ionic_lif_stop(struct lif *lif)
>  {
>  	struct net_device *ndev = lif->netdev;
> +	unsigned int i;
>  	int err = 0;
>  
>  	if (!test_bit(LIF_UP, lif->state)) {
> @@ -100,6 +197,21 @@ static int ionic_lif_stop(struct lif *lif)
>  	netif_tx_disable(ndev);
>  	synchronize_rcu();
>  
> +	for (i = 0; i < lif->nxqs; i++) {
> +		(void)ionic_qcq_disable(lif->txqcqs[i].qcq);
> +		ionic_tx_flush(&lif->txqcqs[i].qcq->cq);
> +		ionic_lif_qcq_deinit(lif, lif->txqcqs[i].qcq);
> +		ionic_qcq_free(lif, lif->txqcqs[i].qcq);
> +		lif->txqcqs[i].qcq = NULL;
> +
> +		(void)ionic_qcq_disable(lif->rxqcqs[i].qcq);
> +		ionic_rx_flush(&lif->rxqcqs[i].qcq->cq);
> +		ionic_lif_qcq_deinit(lif, lif->rxqcqs[i].qcq);
> +		ionic_rx_empty(&lif->rxqcqs[i].qcq->q);
> +		ionic_qcq_free(lif, lif->rxqcqs[i].qcq);
> +		lif->rxqcqs[i].qcq = NULL;
> +	}
> +

Why don't you break this down to stages/functions (disbale/uninit/free)
as you did on open (alloc/ini/enable) ?  will be easier to review and
to spot inconsistencies
 
>  	return err;
>  }
>  
> @@ -694,6 +806,7 @@ static int ionic_vlan_rx_kill_vid(struct
> net_device *netdev, __be16 proto,
>  static const struct net_device_ops ionic_netdev_ops = {
>  	.ndo_open               = ionic_open,
>  	.ndo_stop               = ionic_stop,
> +	.ndo_start_xmit		= ionic_start_xmit,
>  	.ndo_get_stats64	= ionic_get_stats64,
>  	.ndo_set_rx_mode	= ionic_set_rx_mode,
>  	.ndo_set_features	= ionic_set_features,
> @@ -909,10 +1022,83 @@ static void ionic_qcq_free(struct lif *lif,
> struct qcq *qcq)
>  	devm_kfree(dev, qcq);
>  }
>  
> +static int ionic_txrx_alloc(struct lif *lif)
> +{
> +	unsigned int flags;
> +	unsigned int i;
> +	int err = 0;
> +
> +	flags = QCQ_F_TX_STATS | QCQ_F_SG;
> +	for (i = 0; i < lif->nxqs; i++) {
> +		err = ionic_qcq_alloc(lif, IONIC_QTYPE_TXQ, i, "tx",
> flags,
> +				      lif->ntxq_descs,
> +				      sizeof(struct txq_desc),
> +				      sizeof(struct txq_comp),
> +				      sizeof(struct txq_sg_desc),
> +				      lif->kern_pid, &lif-
> >txqcqs[i].qcq);
> +		if (err)
> +			goto err_out_free_txqcqs;
> +
> +		lif->txqcqs[i].qcq->stats = lif->txqcqs[i].stats;
> +	}
> +
> +	flags = QCQ_F_RX_STATS | QCQ_F_INTR;
> +	for (i = 0; i < lif->nxqs; i++) {
> +		err = ionic_qcq_alloc(lif, IONIC_QTYPE_RXQ, i, "rx",
> flags,
> +				      lif->nrxq_descs,
> +				      sizeof(struct rxq_desc),
> +				      sizeof(struct rxq_comp),
> +				      0, lif->kern_pid, &lif-
> >rxqcqs[i].qcq);
> +		if (err)
> +			goto err_out_free_rxqcqs;
> +
> +		lif->rxqcqs[i].qcq->stats = lif->rxqcqs[i].stats;
> +
> +		ionic_link_qcq_interrupts(lif->rxqcqs[i].qcq,
> +					  lif->txqcqs[i].qcq);
> +	}
> +
> +	return 0;
> +
> +err_out_free_rxqcqs:
> +	for (i = 0; i < lif->nxqs; i++)
> +		ionic_qcq_free(lif, lif->rxqcqs[i].qcq);
> +err_out_free_txqcqs:
> +	for (i = 0; i < lif->nxqs; i++)
> +		ionic_qcq_free(lif, lif->txqcqs[i].qcq);
> +
> +	return err;
> +}
> +
> +static int ionic_txrx_init(struct lif *lif)
> +{
> +	int err;
> +
> +	err = ionic_lif_txqs_init(lif);
> +	if (err)
> +		return err;
> +
> +	err = ionic_lif_rxqs_init(lif);
> +	if (err)
> +		goto err_out;
> +
> +	ionic_set_rx_mode(lif->netdev);
> +
> +	return 0;
> +
> +err_out:
> +	ionic_stop(lif->netdev);

I would move error handling to the caller, keep the code flow and err
path symmetrical with good path.

> +
> +	return err;
> +}
> +
>  static int ionic_qcqs_alloc(struct lif *lif)
>  {
> +	struct device *dev = lif->ionic->dev;
> +	unsigned int q_list_size;
>  	unsigned int flags;
>  	int err;
> +	int i;
>  
>  	flags = QCQ_F_INTR;
>  	err = ionic_qcq_alloc(lif, IONIC_QTYPE_ADMINQ, 0, "admin",
> flags,
> @@ -937,8 +1123,47 @@ static int ionic_qcqs_alloc(struct lif *lif)
>  		ionic_link_qcq_interrupts(lif->adminqcq, lif-
> >notifyqcq);
>  	}
>  
> +	q_list_size = sizeof(*lif->txqcqs) * lif->nxqs;
> +	err = -ENOMEM;
> +	lif->txqcqs = devm_kzalloc(dev, q_list_size, GFP_KERNEL);
> +	if (!lif->txqcqs)
> +		goto err_out_free_notifyqcq;
> +	for (i = 0; i < lif->nxqs; i++) {
> +		lif->txqcqs[i].stats = devm_kzalloc(dev, sizeof(struct
> q_stats),
> +						    GFP_KERNEL);

why not inlineing stats into txqcq struct and avoid this kzalloc  ? 

> +		if (!lif->txqcqs[i].stats)
> +			goto err_out_free_tx_stats;
> +	}
> +
> +	lif->rxqcqs = devm_kzalloc(dev, q_list_size, GFP_KERNEL);
> +	if (!lif->rxqcqs)
> +		goto err_out_free_tx_stats;
> +	for (i = 0; i < lif->nxqs; i++) {
> +		lif->rxqcqs[i].stats = devm_kzalloc(dev, sizeof(struct
> q_stats),
> +						    GFP_KERNEL);
> +		if (!lif->rxqcqs[i].stats)
> +			goto err_out_free_rx_stats;

same 

> +	}
> +
>  	return 0;
>  
> +err_out_free_rx_stats:
> +	for (i = 0; i < lif->nxqs; i++)
> +		if (lif->rxqcqs[i].stats)
> +			devm_kfree(dev, lif->rxqcqs[i].stats);
> +	devm_kfree(dev, lif->rxqcqs);
> +	lif->rxqcqs = NULL;
> +err_out_free_tx_stats:
> +	for (i = 0; i < lif->nxqs; i++)
> +		if (lif->txqcqs[i].stats)
> +			devm_kfree(dev, lif->txqcqs[i].stats);
> +	devm_kfree(dev, lif->txqcqs);
> +	lif->txqcqs = NULL;
> +err_out_free_notifyqcq:
> +	if (lif->notifyqcq) {
> +		ionic_qcq_free(lif, lif->notifyqcq);
> +		lif->notifyqcq = NULL;
> +	}
>  err_out_free_adminqcq:
>  	ionic_qcq_free(lif, lif->adminqcq);
>  	lif->adminqcq = NULL;
> @@ -948,6 +1173,9 @@ static int ionic_qcqs_alloc(struct lif *lif)
>  
>  static void ionic_qcqs_free(struct lif *lif)
>  {
> +	struct device *dev = lif->ionic->dev;
> +	unsigned int i;
> +
>  	if (lif->notifyqcq) {
>  		ionic_qcq_free(lif, lif->notifyqcq);
>  		lif->notifyqcq = NULL;
> @@ -957,6 +1185,20 @@ static void ionic_qcqs_free(struct lif *lif)
>  		ionic_qcq_free(lif, lif->adminqcq);
>  		lif->adminqcq = NULL;
>  	}
> +
> +	for (i = 0; i < lif->nxqs; i++)
> +		if (lif->rxqcqs[i].stats)
> +			devm_kfree(dev, lif->rxqcqs[i].stats);
> +
> +	devm_kfree(dev, lif->rxqcqs);
> +	lif->rxqcqs = NULL;
> +
> +	for (i = 0; i < lif->nxqs; i++)
> +		if (lif->txqcqs[i].stats)
> +			devm_kfree(dev, lif->txqcqs[i].stats);
> +
> +	devm_kfree(dev, lif->txqcqs);
> +	lif->txqcqs = NULL;
>  }
>  
>  static struct lif *ionic_lif_alloc(struct ionic *ionic, unsigned int
> index)
> @@ -992,6 +1234,8 @@ static struct lif *ionic_lif_alloc(struct ionic
> *ionic, unsigned int index)
>  
>  	lif->ionic = ionic;
>  	lif->index = index;
> +	lif->ntxq_descs = IONIC_DEF_TXRX_DESC;
> +	lif->nrxq_descs = IONIC_DEF_TXRX_DESC;
>  
>  	snprintf(lif->name, sizeof(lif->name), "lif%u", index);
>  
> @@ -1432,6 +1676,147 @@ static int ionic_init_nic_features(struct lif
> *lif)
>  	return 0;
>  }
>  
> +static int ionic_lif_txq_init(struct lif *lif, struct qcq *qcq)
> +{
> +	struct device *dev = lif->ionic->dev;
> +	struct queue *q = &qcq->q;
> +	struct cq *cq = &qcq->cq;
> +	struct ionic_admin_ctx ctx = {
> +		.work = COMPLETION_INITIALIZER_ONSTACK(ctx.work),
> +		.cmd.q_init = {
> +			.opcode = CMD_OPCODE_Q_INIT,
> +			.lif_index = cpu_to_le16(lif->index),
> +			.type = q->type,
> +			.index = cpu_to_le32(q->index),
> +			.flags = cpu_to_le16(IONIC_QINIT_F_IRQ |
> +					     IONIC_QINIT_F_SG),
> +			.intr_index = cpu_to_le16(lif->rxqcqs[q-
> >index].qcq->intr.index),
> +			.pid = cpu_to_le16(q->pid),
> +			.ring_size = ilog2(q->num_descs),
> +			.ring_base = cpu_to_le64(q->base_pa),
> +			.cq_ring_base = cpu_to_le64(cq->base_pa),
> +			.sg_ring_base = cpu_to_le64(q->sg_base_pa),
> +		},
> +	};
> +	int err;
> +
> +	dev_dbg(dev, "txq_init.pid %d\n", ctx.cmd.q_init.pid);
> +	dev_dbg(dev, "txq_init.index %d\n", ctx.cmd.q_init.index);
> +	dev_dbg(dev, "txq_init.ring_base 0x%llx\n",
> ctx.cmd.q_init.ring_base);
> +	dev_dbg(dev, "txq_init.ring_size %d\n",
> ctx.cmd.q_init.ring_size);
> +
> +	err = ionic_adminq_post_wait(lif, &ctx);
> +	if (err)
> +		return err;
> +
> +	q->hw_type = ctx.comp.q_init.hw_type;
> +	q->hw_index = le32_to_cpu(ctx.comp.q_init.hw_index);
> +	q->dbval = IONIC_DBELL_QID(q->hw_index);
> +
> +	dev_dbg(dev, "txq->hw_type %d\n", q->hw_type);
> +	dev_dbg(dev, "txq->hw_index %d\n", q->hw_index);
> +
> +	qcq->flags |= QCQ_F_INITED;
> +
> +	ionic_debugfs_add_qcq(lif, qcq);
> +
> +	return 0;
> +}
> +
> +static int ionic_lif_txqs_init(struct lif *lif)
> +{
> +	unsigned int i;
> +	int err;
> +
> +	for (i = 0; i < lif->nxqs; i++) {
> +		err = ionic_lif_txq_init(lif, lif->txqcqs[i].qcq);
> +		if (err)
> +			goto err_out;
> +	}
> +
> +	return 0;
> +
> +err_out:
> +	for (; i > 0; i--)
> +		ionic_lif_qcq_deinit(lif, lif->txqcqs[i-1].qcq);
> +
> +	return err;
> +}
> +
> +static int ionic_lif_rxq_init(struct lif *lif, struct qcq *qcq)
> +{
> +	struct device *dev = lif->ionic->dev;
> +	struct queue *q = &qcq->q;
> +	struct cq *cq = &qcq->cq;
> +	struct ionic_admin_ctx ctx = {
> +		.work = COMPLETION_INITIALIZER_ONSTACK(ctx.work),
> +		.cmd.q_init = {
> +			.opcode = CMD_OPCODE_Q_INIT,
> +			.lif_index = cpu_to_le16(lif->index),
> +			.type = q->type,
> +			.index = cpu_to_le32(q->index),
> +			.flags = cpu_to_le16(IONIC_QINIT_F_IRQ),
> +			.intr_index = cpu_to_le16(cq->bound_intr-
> >index),
> +			.pid = cpu_to_le16(q->pid),
> +			.ring_size = ilog2(q->num_descs),
> +			.ring_base = cpu_to_le64(q->base_pa),
> +			.cq_ring_base = cpu_to_le64(cq->base_pa),
> +		},
> +	};
> +	int err;
> +
> +	dev_dbg(dev, "rxq_init.pid %d\n", ctx.cmd.q_init.pid);
> +	dev_dbg(dev, "rxq_init.index %d\n", ctx.cmd.q_init.index);
> +	dev_dbg(dev, "rxq_init.ring_base 0x%llx\n",
> ctx.cmd.q_init.ring_base);
> +	dev_dbg(dev, "rxq_init.ring_size %d\n",
> ctx.cmd.q_init.ring_size);
> +
> +	err = ionic_adminq_post_wait(lif, &ctx);
> +	if (err)
> +		return err;
> +
> +	q->hw_type = ctx.comp.q_init.hw_type;
> +	q->hw_index = le32_to_cpu(ctx.comp.q_init.hw_index);
> +	q->dbval = IONIC_DBELL_QID(q->hw_index);
> +
> +	dev_dbg(dev, "rxq->hw_type %d\n", q->hw_type);
> +	dev_dbg(dev, "rxq->hw_index %d\n", q->hw_index);
> +
> +	netif_napi_add(lif->netdev, &qcq->napi, ionic_rx_napi,
> +		       NAPI_POLL_WEIGHT);
> +
> +	err = ionic_request_irq(lif, qcq);

Again, your init/deinit objects code paths is very asymmetrical 
I was looking for where you do free_irq, and eventually found it in 
ionic_lif_qcq_deinit which is shared between rx/tx queues, but it is
very hard to follow who sets the QCQ_F_INTR flag so
ionic_lif_qcq_deinit would free the irq. so i gave up !

I strongly suggest for you to make the code symmetrical as much as
possible by following two simple rules:
1) each function "do_foo" must have a reverse function "undo_foo".. 
2) each function "do_foo" should handle its own error path rollback and
NOT by blindly calling "undo_foo"

This is a very well defined code structure and useful technique to help
reviewers and developers keep track of what is happening, and most
importantly a very good practice that reduces the number of current
bugs and future bugs by an order of magnitude!

example:

open()
 alloc_txrx_queues()
 init_txrx_queues()
 request_irqs()
 enable_txrx_queues()

close()
 disable_txrx_queues()
 free_irqs()
 deinit_txrx_queues()
 dealloc_txrxqs()

this will remove the need for some flags and unnecessary state tracking
for when a resources is open/close/inited, if each function is
symmetrical and handles its own rollback, so at any point in the code
you would know what to do without any extra if statement branches.

> +	if (err) {
> +		netif_napi_del(&qcq->napi);
> +		return err;
> +	}
> +
> +	qcq->flags |= QCQ_F_INITED;
> +
> +	ionic_debugfs_add_qcq(lif, qcq);
> +
> +	return 0;
> +}
> +
> +static int ionic_lif_rxqs_init(struct lif *lif)
> +{
> +	unsigned int i;
> +	int err;
> +
> +	for (i = 0; i < lif->nxqs; i++) {
> +		err = ionic_lif_rxq_init(lif, lif->rxqcqs[i].qcq);
> +		if (err)
> +			goto err_out;
> +	}
> +
> +	return 0;
> +
> +err_out:
> +	for (; i > 0; i--)
> +		ionic_lif_qcq_deinit(lif, lif->rxqcqs[i-1].qcq);
> +
> +	return err;
> +}
> +
>  static int ionic_station_set(struct lif *lif)
>  {
>  	struct net_device *netdev = lif->netdev;
> @@ -1531,6 +1916,8 @@ static int ionic_lif_init(struct lif *lif)
>  	if (err)
>  		goto err_out_notifyq_deinit;
>  
> +	lif->rx_copybreak = IONIC_RX_COPYBREAK_DEFAULT;
> +
>  	set_bit(LIF_INITED, lif->state);
>  
>  	ionic_link_status_check(lif);
> diff --git a/drivers/net/ethernet/pensando/ionic/ionic_lif.h
> b/drivers/net/ethernet/pensando/ionic/ionic_lif.h
> index d8589a306aa5..88d5bf8f58a1 100644
> --- a/drivers/net/ethernet/pensando/ionic/ionic_lif.h
> +++ b/drivers/net/ethernet/pensando/ionic/ionic_lif.h
> @@ -14,20 +14,38 @@
>  #define MAX_NUM_NAPI_CNTR	(NAPI_POLL_WEIGHT + 1)
>  #define GET_SG_CNTR_IDX(num_sg_elems)	(num_sg_elems)
>  #define MAX_NUM_SG_CNTR		(IONIC_TX_MAX_SG_ELEMS + 1)
> +#define IONIC_RX_COPYBREAK_DEFAULT		256
>  
>  struct tx_stats {
> +	u64 dma_map_err;
>  	u64 pkts;
>  	u64 bytes;
> +	u64 clean;
> +	u64 linearize;
> +	u64 no_csum;
> +	u64 csum;
> +	u64 crc32_csum;
> +	u64 tso;
> +	u64 frags;
> +	u64 sg_cntr[MAX_NUM_SG_CNTR];
>  };
>  
>  struct rx_stats {
> +	u64 dma_map_err;
> +	u64 alloc_err;
>  	u64 pkts;
>  	u64 bytes;
> +	u64 csum_none;
> +	u64 csum_complete;
> +	u64 csum_error;
> +	u64 buffers_posted;
>  };
>  
>  #define QCQ_F_INITED		BIT(0)
>  #define QCQ_F_SG		BIT(1)
>  #define QCQ_F_INTR		BIT(2)
> +#define QCQ_F_TX_STATS		BIT(3)
> +#define QCQ_F_RX_STATS		BIT(4)
>  #define QCQ_F_NOTIFYQ		BIT(5)
>  
>  struct napi_stats {
> @@ -56,7 +74,14 @@ struct qcq {
>  	struct dentry *dentry;
>  };
>  
> +struct qcqst {
> +	struct qcq *qcq;
> +	struct q_stats *stats;
> +};
> +
>  #define q_to_qcq(q)		container_of(q, struct qcq, q)
> +#define q_to_tx_stats(q)	(&q_to_qcq(q)->stats->tx)
> +#define q_to_rx_stats(q)	(&q_to_qcq(q)->stats->rx)
>  #define napi_to_qcq(napi)	container_of(napi, struct qcq, napi)
>  #define napi_to_cq(napi)	(&napi_to_qcq(napi)->cq)
>  
> @@ -108,11 +133,14 @@ struct lif {
>  	spinlock_t adminq_lock;		/* lock for AdminQ operations
> */
>  	struct qcq *adminqcq;
>  	struct qcq *notifyqcq;
> +	struct qcqst *txqcqs;
> +	struct qcqst *rxqcqs;
>  	u64 last_eid;
>  	unsigned int neqs;
>  	unsigned int nxqs;
>  	unsigned int ntxq_descs;
>  	unsigned int nrxq_descs;
> +	u32 rx_copybreak;
>  	unsigned int rx_mode;
>  	u64 hw_features;
>  	bool mc_overflow;
> @@ -134,6 +162,11 @@ struct lif {
>  	u32 flags;
>  };
>  
> +#define lif_to_txqcq(lif, i)	((lif)->txqcqs[i].qcq)
> +#define lif_to_rxqcq(lif, i)	((lif)->rxqcqs[i].qcq)
> +#define lif_to_txq(lif, i)	(&lif_to_txqcq((lif), i)->q)
> +#define lif_to_rxq(lif, i)	(&lif_to_txqcq((lif), i)->q)
> +
>  static inline bool ionic_is_pf(struct ionic *ionic)
>  {
>  	return ionic->pdev &&
> @@ -173,6 +206,22 @@ int ionic_open(struct net_device *netdev);
>  int ionic_stop(struct net_device *netdev);
>  int ionic_reset_queues(struct lif *lif);
>  
> +static inline void debug_stats_txq_post(struct qcq *qcq,
> +					struct txq_desc *desc, bool
> dbell)
> +{
> +	u8 num_sg_elems = ((le64_to_cpu(desc->cmd) >>
> IONIC_TXQ_DESC_NSGE_SHIFT)
> +						&
> IONIC_TXQ_DESC_NSGE_MASK);
> +	u8 sg_cntr_idx;
> +
> +	qcq->q.dbell_count += dbell;
> +
> +	sg_cntr_idx = GET_SG_CNTR_IDX(num_sg_elems);
> +	if (sg_cntr_idx > (MAX_NUM_SG_CNTR - 1))
> +		sg_cntr_idx = MAX_NUM_SG_CNTR - 1;
> +
> +	qcq->stats->tx.sg_cntr[sg_cntr_idx]++;
> +}
> +
>  static inline void debug_stats_napi_poll(struct qcq *qcq,
>  					 unsigned int work_done)
>  {
> @@ -188,7 +237,10 @@ static inline void debug_stats_napi_poll(struct
> qcq *qcq,
>  }
>  
>  #define DEBUG_STATS_CQE_CNT(cq)		((cq)->compl_count++)
> +#define DEBUG_STATS_RX_BUFF_CNT(qcq)	((qcq)->stats-
> >rx.buffers_posted++)
>  #define DEBUG_STATS_INTR_REARM(intr)	((intr)->rearm_count++)
> +#define DEBUG_STATS_TXQ_POST(qcq, txdesc, dbell) \
> +	debug_stats_txq_post(qcq, txdesc, dbell)
>  #define DEBUG_STATS_NAPI_POLL(qcq, work_done) \
>  	debug_stats_napi_poll(qcq, work_done)
>  
> diff --git a/drivers/net/ethernet/pensando/ionic/ionic_txrx.c
> b/drivers/net/ethernet/pensando/ionic/ionic_txrx.c
> new file mode 100644
> index 000000000000..7787e7dd5504
> --- /dev/null
> +++ b/drivers/net/ethernet/pensando/ionic/ionic_txrx.c
> @@ -0,0 +1,879 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright(c) 2017 - 2019 Pensando Systems, Inc */
> +
> +#include <linux/ip.h>
> +#include <linux/ipv6.h>
> +#include <linux/if_vlan.h>
> +#include <net/ip6_checksum.h>
> +
> +#include "ionic.h"
> +#include "ionic_lif.h"
> +#include "ionic_txrx.h"
> +
> +static void ionic_tx_clean(struct queue *q, struct desc_info
> *desc_info,
> +			   struct cq_info *cq_info, void *cb_arg);
> +static void ionic_rx_clean(struct queue *q, struct desc_info
> *desc_info,
> +			   struct cq_info *cq_info, void *cb_arg);
> +
> +static inline void ionic_txq_post(struct queue *q, bool ring_dbell,
> +				  desc_cb cb_func, void *cb_arg)
> +{
> +	DEBUG_STATS_TXQ_POST(q_to_qcq(q), q->head->desc, ring_dbell);
> +
> +	ionic_q_post(q, ring_dbell, cb_func, cb_arg);
> +}
> +
> +static inline void ionic_rxq_post(struct queue *q, bool ring_dbell,
> +				  desc_cb cb_func, void *cb_arg)
> +{
> +	ionic_q_post(q, ring_dbell, cb_func, cb_arg);
> +
> +	DEBUG_STATS_RX_BUFF_CNT(q_to_qcq(q));
> +}
> +
> +static void ionic_rx_recycle(struct queue *q, struct desc_info
> *desc_info,
> +			     struct sk_buff *skb)
> +{
> +	struct rxq_desc *old = desc_info->desc;
> +	struct rxq_desc *new = q->head->desc;
> +
> +	new->addr = old->addr;
> +	new->len = old->len;
> +
> +	ionic_rxq_post(q, true, ionic_rx_clean, skb);
> +}
> +
> +static bool ionic_rx_copybreak(struct queue *q, struct desc_info
> *desc_info,
> +			       struct cq_info *cq_info, struct sk_buff
> **skb)
> +{
> +	struct net_device *netdev = q->lif->netdev;
> +	struct device *dev = q->lif->ionic->dev;
> +	struct rxq_desc *desc = desc_info->desc;
> +	struct rxq_comp *comp = cq_info->cq_desc;
> +	struct sk_buff *new_skb;
> +	u16 clen, dlen;
> +
> +	clen = le16_to_cpu(comp->len);
> +	dlen = le16_to_cpu(desc->len);
> +	if (clen > q->lif->rx_copybreak) {
> +		dma_unmap_single(dev, (dma_addr_t)le64_to_cpu(desc-
> >addr),
> +				 dlen, DMA_FROM_DEVICE);
> +		return false;
> +	}
> +
> +	new_skb = netdev_alloc_skb_ip_align(netdev, clen);
> +	if (!new_skb) {
> +		dma_unmap_single(dev, (dma_addr_t)le64_to_cpu(desc-
> >addr),
> +				 dlen, DMA_FROM_DEVICE);
> +		return false;
> +	}
> +
> +	dma_sync_single_for_cpu(dev, (dma_addr_t)le64_to_cpu(desc-
> >addr),
> +				clen, DMA_FROM_DEVICE);
> +
> +	memcpy(new_skb->data, (*skb)->data, clen);
> +
> +	ionic_rx_recycle(q, desc_info, *skb);
> +	*skb = new_skb;
> +
> +	return true;
> +}
> +
> +static void ionic_rx_clean(struct queue *q, struct desc_info
> *desc_info,
> +			   struct cq_info *cq_info, void *cb_arg)
> +{
> +	struct rxq_comp *comp = cq_info->cq_desc;
> +	struct sk_buff *skb = cb_arg;
> +	struct qcq *qcq = q_to_qcq(q);
> +	struct net_device *netdev;
> +	struct rx_stats *stats;
> +
> +	stats = q_to_rx_stats(q);
> +	netdev = q->lif->netdev;
> +
> +	if (comp->status) {
> +		ionic_rx_recycle(q, desc_info, skb);
> +		return;
> +	}
> +
> +	if (unlikely(test_bit(LIF_QUEUE_RESET, q->lif->state))) {
> +		/* no packet processing while resetting */
> +		ionic_rx_recycle(q, desc_info, skb);
> +		return;
> +	}
> +
> +	stats->pkts++;
> +	stats->bytes += le16_to_cpu(comp->len);
> +
> +	ionic_rx_copybreak(q, desc_info, cq_info, &skb);
> +
> +	skb_put(skb, le16_to_cpu(comp->len));
> +	skb->protocol = eth_type_trans(skb, netdev);
> +
> +	skb_record_rx_queue(skb, q->index);
> +
> +	if (netdev->features & NETIF_F_RXHASH) {
> +		switch (comp->pkt_type_color &
> IONIC_RXQ_COMP_PKT_TYPE_MASK) {
> +		case PKT_TYPE_IPV4:
> +		case PKT_TYPE_IPV6:
> +			skb_set_hash(skb, le32_to_cpu(comp->rss_hash),
> +				     PKT_HASH_TYPE_L3);
> +			break;
> +		case PKT_TYPE_IPV4_TCP:
> +		case PKT_TYPE_IPV6_TCP:
> +		case PKT_TYPE_IPV4_UDP:
> +		case PKT_TYPE_IPV6_UDP:
> +			skb_set_hash(skb, le32_to_cpu(comp->rss_hash),
> +				     PKT_HASH_TYPE_L4);
> +			break;
> +		}
> +	}
> +
> +	if (netdev->features & NETIF_F_RXCSUM) {
> +		if (comp->csum_flags & IONIC_RXQ_COMP_CSUM_F_CALC) {
> +			skb->ip_summed = CHECKSUM_COMPLETE;
> +			skb->csum = (__wsum)le16_to_cpu(comp->csum);
> +			stats->csum_complete++;
> +		}
> +	} else {
> +		stats->csum_none++;
> +	}
> +
> +	if ((comp->csum_flags & IONIC_RXQ_COMP_CSUM_F_TCP_BAD) ||
> +	    (comp->csum_flags & IONIC_RXQ_COMP_CSUM_F_UDP_BAD) ||
> +	    (comp->csum_flags & IONIC_RXQ_COMP_CSUM_F_IP_BAD))
> +		stats->csum_error++;
> +
> +	if (netdev->features & NETIF_F_HW_VLAN_CTAG_RX) {
> +		if (comp->csum_flags & IONIC_RXQ_COMP_CSUM_F_VLAN)
> +			__vlan_hwaccel_put_tag(skb, htons(ETH_P_8021Q),
> +					       le16_to_cpu(comp-
> >vlan_tci));
> +	}
> +
> +	napi_gro_receive(&qcq->napi, skb);
> +}
> +
> +static bool ionic_rx_service(struct cq *cq, struct cq_info *cq_info)
> +{
> +	struct rxq_comp *comp = cq_info->cq_desc;
> +	struct queue *q = cq->bound_q;
> +	struct desc_info *desc_info;
> +
> +	if (!color_match(comp->pkt_type_color, cq->done_color))
> +		return false;
> +
> +	/* check for empty queue */
> +	if (q->tail->index == q->head->index)
> +		return false;
> +
> +	desc_info = q->tail;
> +	if (desc_info->index != le16_to_cpu(comp->comp_index))
> +		return false;
> +
> +	q->tail = desc_info->next;
> +
> +	/* clean the related q entry, only one per qc completion */
> +	ionic_rx_clean(q, desc_info, cq_info, desc_info->cb_arg);
> +
> +	desc_info->cb = NULL;
> +	desc_info->cb_arg = NULL;
> +
> +	return true;
> +}
> +
> +static u32 ionic_rx_walk_cq(struct cq *rxcq, u32 limit)
> +{
> +	u32 work_done = 0;
> +
> +	while (ionic_rx_service(rxcq, rxcq->tail)) {
> +		if (rxcq->tail->last)
> +			rxcq->done_color = !rxcq->done_color;
> +		rxcq->tail = rxcq->tail->next;
> +		DEBUG_STATS_CQE_CNT(rxcq);
> +
> +		if (++work_done >= limit)
> +			break;
> +	}
> +
> +	return work_done;
> +}
> +
> +void ionic_rx_flush(struct cq *cq)
> +{
> +	struct ionic_dev *idev = &cq->lif->ionic->idev;
> +	u32 work_done;
> +
> +	work_done = ionic_rx_walk_cq(cq, cq->num_descs);
> +
> +	if (work_done)
> +		ionic_intr_credits(idev->intr_ctrl, cq->bound_intr-
> >index,
> +				   work_done,
> IONIC_INTR_CRED_RESET_COALESCE);
> +}

mixing between rx and tx unrelated flows, very hard to review.
 
> +
> +void ionic_tx_flush(struct cq *cq)
> +{
> +	struct ionic_dev *idev = &cq->lif->ionic->idev;
> +	struct txq_comp *comp = cq->tail->cq_desc;
> +	struct queue *q = cq->bound_q;
> +	struct desc_info *desc_info;
> +	unsigned int work_done = 0;
> +
> +	/* walk the completed cq entries */
> +	while (work_done < cq->num_descs &&
> +	       color_match(comp->color, cq->done_color)) {
> +
> +		/* clean the related q entries, there could be
> +		 * several q entries completed for each cq completion
> +		 */
> +		do {
> +			desc_info = q->tail;
> +			q->tail = desc_info->next;
> +			ionic_tx_clean(q, desc_info, cq->tail,
> +				       desc_info->cb_arg);
> +			desc_info->cb = NULL;
> +			desc_info->cb_arg = NULL;
> +		} while (desc_info->index != le16_to_cpu(comp-
> >comp_index));
> +
> +		if (cq->tail->last)
> +			cq->done_color = !cq->done_color;
> +
> +		cq->tail = cq->tail->next;
> +		comp = cq->tail->cq_desc;
> +		DEBUG_STATS_CQE_CNT(cq);
> +
> +		work_done++;
> +	}
> +
> +	if (work_done)
> +		ionic_intr_credits(idev->intr_ctrl, cq->bound_intr-
> >index,
> +				   work_done, 0);
> +}
> +
> +static struct sk_buff *ionic_rx_skb_alloc(struct queue *q, unsigned
> int len,
> +					  dma_addr_t *dma_addr)
> +{
> +	struct lif *lif = q->lif;
> +	struct net_device *netdev = lif->netdev;
> +	struct device *dev = lif->ionic->dev;
> +	struct rx_stats *stats;
> +	struct sk_buff *skb;
> +
> +	stats = q_to_rx_stats(q);
> +	skb = netdev_alloc_skb_ip_align(netdev, len);
> +	if (!skb) {
> +		net_warn_ratelimited("%s: SKB alloc failed on %s!\n",
> +				     netdev->name, q->name);
> +		stats->alloc_err++;
> +		return NULL;
> +	}
> +
> +	*dma_addr = dma_map_single(dev, skb->data, len,
> DMA_FROM_DEVICE);
> +	if (dma_mapping_error(dev, *dma_addr)) {
> +		dev_kfree_skb(skb);
> +		net_warn_ratelimited("%s: DMA single map failed on
> %s!\n",
> +				     netdev->name, q->name);
> +		stats->dma_map_err++;
> +		return NULL;
> +	}
> +
> +	return skb;
> +}
> +
> +static void ionic_rx_skb_free(struct queue *q, struct sk_buff *skb,
> +			      unsigned int len, dma_addr_t dma_addr)
> +{
> +	struct device *dev = q->lif->ionic->dev;
> +
> +	dma_unmap_single(dev, dma_addr, len, DMA_FROM_DEVICE);
> +	dev_kfree_skb(skb);
> +}
> +
> +#define RX_RING_DOORBELL_STRIDE		((1 << 2) - 1)
> +
> +void ionic_rx_fill(struct queue *q)
> +{
> +	struct net_device *netdev = q->lif->netdev;
> +	struct rxq_desc *desc;
> +	struct sk_buff *skb;
> +	dma_addr_t dma_addr;
> +	bool ring_doorbell;
> +	unsigned int len;
> +	unsigned int i;
> +
> +	len = netdev->mtu + ETH_HLEN;
> +
> +	for (i = ionic_q_space_avail(q); i; i--) {
> +		skb = ionic_rx_skb_alloc(q, len, &dma_addr);
> +		if (!skb)
> +			return;
> +
> +		desc = q->head->desc;
> +		desc->addr = cpu_to_le64(dma_addr);
> +		desc->len = cpu_to_le16(len);
> +		desc->opcode = RXQ_DESC_OPCODE_SIMPLE;
> +
> +		ring_doorbell = ((q->head->index + 1) &
> +				RX_RING_DOORBELL_STRIDE) == 0;
> +
> +		ionic_rxq_post(q, ring_doorbell, ionic_rx_clean, skb);
> +	}
> +}
> +
> +static void ionic_rx_fill_cb(void *arg)
> +{
> +	ionic_rx_fill(arg);
> +}
> +
> +void ionic_rx_empty(struct queue *q)
> +{
> +	struct desc_info *cur = q->tail;
> +	struct rxq_desc *desc;
> +
> +	while (cur != q->head) {
> +		desc = cur->desc;
> +
> +		ionic_rx_skb_free(q, cur->cb_arg, le16_to_cpu(desc-
> >len),
> +				  le64_to_cpu(desc->addr));
> +		cur->cb_arg = NULL;
> +
> +		cur = cur->next;
> +	}
> +}
> +
> +int ionic_rx_napi(struct napi_struct *napi, int budget)
> +{
> +	struct qcq *qcq = napi_to_qcq(napi);
> +	struct cq *rxcq = napi_to_cq(napi);
> +	unsigned int qi = rxcq->bound_q->index;
> +	struct lif *lif = rxcq->bound_q->lif;
> +	struct ionic_dev *idev = &lif->ionic->idev;
> +	struct cq *txcq = &lif->txqcqs[qi].qcq->cq;
> +	u32 work_done = 0;
> +	u32 flags = 0;
> +
> +	ionic_tx_flush(txcq);
> +
> +	work_done = ionic_rx_walk_cq(rxcq, budget);
> +
> +	if (work_done)
> +		ionic_rx_fill_cb(rxcq->bound_q);
> +
> +	if (work_done < budget && napi_complete_done(napi, work_done))
> {
> +		flags |= IONIC_INTR_CRED_UNMASK;
> +		DEBUG_STATS_INTR_REARM(rxcq->bound_intr);
> +	}
> +
> +	if (work_done || flags) {
> +		flags |= IONIC_INTR_CRED_RESET_COALESCE;
> +		ionic_intr_credits(idev->intr_ctrl, rxcq->bound_intr-
> >index,
> +				   work_done, flags);
> +	}
> +
> +	DEBUG_STATS_NAPI_POLL(qcq, work_done);
> +
> +	return work_done;
> +}
> +
> +static dma_addr_t ionic_tx_map_single(struct queue *q, void *data,
> size_t len)
> +{
> +	struct tx_stats *stats = q_to_tx_stats(q);
> +	struct device *dev = q->lif->ionic->dev;
> +	dma_addr_t dma_addr;
> +
> +	dma_addr = dma_map_single(dev, data, len, DMA_TO_DEVICE);
> +	if (dma_mapping_error(dev, dma_addr)) {
> +		net_warn_ratelimited("%s: DMA single map failed on
> %s!\n",
> +				     q->lif->netdev->name, q->name);
> +		stats->dma_map_err++;
> +		return 0;
> +	}
> +	return dma_addr;
> +}
> +
> +static dma_addr_t ionic_tx_map_frag(struct queue *q, const
> skb_frag_t *frag,
> +				    size_t offset, size_t len)
> +{
> +	struct tx_stats *stats = q_to_tx_stats(q);
> +	struct device *dev = q->lif->ionic->dev;
> +	dma_addr_t dma_addr;
> +
> +	dma_addr = skb_frag_dma_map(dev, frag, offset, len,
> DMA_TO_DEVICE);
> +	if (dma_mapping_error(dev, dma_addr)) {
> +		net_warn_ratelimited("%s: DMA frag map failed on
> %s!\n",
> +				     q->lif->netdev->name, q->name);
> +		stats->dma_map_err++;
> +		return 0;
> +	}
> +	return dma_addr;
> +}
> +
> +static void ionic_tx_clean(struct queue *q, struct desc_info
> *desc_info,
> +			   struct cq_info *cq_info, void *cb_arg)
> +{
> +	struct txq_sg_desc *sg_desc = desc_info->sg_desc;
> +	struct txq_sg_elem *elem = sg_desc->elems;
> +	struct tx_stats *stats = q_to_tx_stats(q);
> +	struct txq_desc *desc = desc_info->desc;
> +	struct device *dev = q->lif->ionic->dev;
> +	struct sk_buff *skb = cb_arg;
> +	u8 opcode, flags, nsge;
> +	u16 queue_index;
> +	unsigned int i;
> +	u64 addr;
> +
> +	decode_txq_desc_cmd(le64_to_cpu(desc->cmd),
> +			    &opcode, &flags, &nsge, &addr);
> +
> +	dma_unmap_page(dev, (dma_addr_t)addr,
> +		       le16_to_cpu(desc->len), DMA_TO_DEVICE);
> +	for (i = 0; i < nsge; i++, elem++)
> +		dma_unmap_page(dev, (dma_addr_t)le64_to_cpu(elem-
> >addr),
> +			       le16_to_cpu(elem->len), DMA_TO_DEVICE);
> +
> +	if (skb) {
when skb is null and queue is stopped what do you do ? 
also you should do this at the end of napi cycle.. and not per packet. 

> +		queue_index = skb_get_queue_mapping(skb);
> +		if (unlikely(__netif_subqueue_stopped(q->lif->netdev,
> +						      queue_index))) {
> +			netif_wake_subqueue(q->lif->netdev,
> queue_index);
> +			q->wake++;
> +		}
> +		dev_kfree_skb_any(skb);
> +		stats->clean++;
> +	}
> +}
> +
> +static void ionic_tx_tcp_inner_pseudo_csum(struct sk_buff *skb)
> +{
> +	skb_cow_head(skb, 0);
> +
> +	if (skb->protocol == cpu_to_be16(ETH_P_IP)) {
> +		inner_ip_hdr(skb)->check = 0;
> +		inner_tcp_hdr(skb)->check =
> +			~csum_tcpudp_magic(inner_ip_hdr(skb)->saddr,
> +					   inner_ip_hdr(skb)->daddr,
> +					   0, IPPROTO_TCP, 0);
> +	} else if (skb->protocol == cpu_to_be16(ETH_P_IPV6)) {
> +		inner_tcp_hdr(skb)->check =
> +			~csum_ipv6_magic(&inner_ipv6_hdr(skb)->saddr,
> +					 &inner_ipv6_hdr(skb)->daddr,
> +					 0, IPPROTO_TCP, 0);
> +	}
> +}
> +
> +static void ionic_tx_tcp_pseudo_csum(struct sk_buff *skb)
> +{
> +	skb_cow_head(skb, 0);
> +
> +	if (skb->protocol == cpu_to_be16(ETH_P_IP)) {
> +		ip_hdr(skb)->check = 0;
> +		tcp_hdr(skb)->check =
> +			~csum_tcpudp_magic(ip_hdr(skb)->saddr,
> +					   ip_hdr(skb)->daddr,
> +					   0, IPPROTO_TCP, 0);
> +	} else if (skb->protocol == cpu_to_be16(ETH_P_IPV6)) {
> +		tcp_hdr(skb)->check =
> +			~csum_ipv6_magic(&ipv6_hdr(skb)->saddr,
> +					 &ipv6_hdr(skb)->daddr,
> +					 0, IPPROTO_TCP, 0);
> +	}
> +}
> +
> +static void ionic_tx_tso_post(struct queue *q, struct txq_desc
> *desc,
> +			      struct sk_buff *skb,
> +			      dma_addr_t addr, u8 nsge, u16 len,
> +			      unsigned int hdrlen, unsigned int mss,
> +			      bool outer_csum,
> +			      u16 vlan_tci, bool has_vlan,
> +			      bool start, bool done)
> +{
> +	u8 flags = 0;
> +	u64 cmd;
> +
> +	flags |= has_vlan ? IONIC_TXQ_DESC_FLAG_VLAN : 0;
> +	flags |= outer_csum ? IONIC_TXQ_DESC_FLAG_ENCAP : 0;
> +	flags |= start ? IONIC_TXQ_DESC_FLAG_TSO_SOT : 0;
> +	flags |= done ? IONIC_TXQ_DESC_FLAG_TSO_EOT : 0;
> +
> +	cmd = encode_txq_desc_cmd(IONIC_TXQ_DESC_OPCODE_TSO, flags,
> nsge, addr);
> +	desc->cmd = cpu_to_le64(cmd);
> +	desc->len = cpu_to_le16(len);
> +	desc->vlan_tci = cpu_to_le16(vlan_tci);
> +	desc->hdr_len = cpu_to_le16(hdrlen);
> +	desc->mss = cpu_to_le16(mss);
> +
> +	if (done) {
> +		skb_tx_timestamp(skb);
> +		ionic_txq_post(q, !netdev_xmit_more(), ionic_tx_clean,
> skb);
> +	} else {
> +		ionic_txq_post(q, false, ionic_tx_clean, NULL);
> +	}
> +}
> +
> +static struct txq_desc *ionic_tx_tso_next(struct queue *q,
> +					  struct txq_sg_elem **elem)
> +{
> +	struct txq_sg_desc *sg_desc = q->head->sg_desc;
> +	struct txq_desc *desc = q->head->desc;
> +
> +	*elem = sg_desc->elems;
> +	return desc;
> +}
> +
> +static int ionic_tx_tso(struct queue *q, struct sk_buff *skb)
> +{
> +	struct tx_stats *stats = q_to_tx_stats(q);
> +	struct desc_info *abort = q->head;
> +	struct desc_info *rewind = abort;
> +	unsigned int frag_left = 0;
> +	struct txq_sg_elem *elem;
> +	unsigned int offset = 0;
> +	unsigned int len_left;
> +	struct txq_desc *desc;
> +	dma_addr_t desc_addr;
> +	unsigned int hdrlen;
> +	unsigned int nfrags;
> +	unsigned int seglen;
> +	u64 total_bytes = 0;
> +	u64 total_pkts = 0;
> +	unsigned int left;
> +	unsigned int len;
> +	unsigned int mss;
> +	skb_frag_t *frag;
> +	bool start, done;
> +	bool outer_csum;
> +	bool has_vlan;
> +	u16 desc_len;
> +	u8 desc_nsge;
> +	u16 vlan_tci;
> +	bool encap;
> +
> +	mss = skb_shinfo(skb)->gso_size;
> +	nfrags = skb_shinfo(skb)->nr_frags;
> +	len_left = skb->len - skb_headlen(skb);
> +	outer_csum = (skb_shinfo(skb)->gso_type & SKB_GSO_GRE_CSUM) ||
> +		     (skb_shinfo(skb)->gso_type &
> SKB_GSO_UDP_TUNNEL_CSUM);
> +	has_vlan = !!skb_vlan_tag_present(skb);
> +	vlan_tci = skb_vlan_tag_get(skb);
> +	encap = skb->encapsulation;
> +
> +	/* Preload inner-most TCP csum field with IP pseudo hdr
> +	 * calculated with IP length set to zero.  HW will later
> +	 * add in length to each TCP segment resulting from the TSO.
> +	 */
> +
> +	if (encap)
> +		ionic_tx_tcp_inner_pseudo_csum(skb);
> +	else
> +		ionic_tx_tcp_pseudo_csum(skb);
> +
> +	if (encap)
> +		hdrlen = skb_inner_transport_header(skb) - skb->data +
> +			 inner_tcp_hdrlen(skb);
> +	else
> +		hdrlen = skb_transport_offset(skb) + tcp_hdrlen(skb);
> +
> +	seglen = hdrlen + mss;
> +	left = skb_headlen(skb);
> +
> +	desc = ionic_tx_tso_next(q, &elem);
> +	start = true;
> +
> +	/* Chop skb->data up into desc segments */
> +
> +	while (left > 0) {
> +		len = min(seglen, left);
> +		frag_left = seglen - len;
> +		desc_addr = ionic_tx_map_single(q, skb->data + offset,
> len);
> +		if (!desc_addr)
> +			goto err_out_abort;
> +		desc_len = len;
> +		desc_nsge = 0;
> +		left -= len;
> +		offset += len;
> +		if (nfrags > 0 && frag_left > 0)
> +			continue;
> +		done = (nfrags == 0 && left == 0);
> +		ionic_tx_tso_post(q, desc, skb,
> +				  desc_addr, desc_nsge, desc_len,
> +				  hdrlen, mss,
> +				  outer_csum,
> +				  vlan_tci, has_vlan,
> +				  start, done);
> +		total_pkts++;
> +		total_bytes += start ? len : len + hdrlen;
> +		desc = ionic_tx_tso_next(q, &elem);
> +		start = false;
> +		seglen = mss;
> +	}
> +
> +	/* Chop skb frags into desc segments */
> +
> +	for (frag = skb_shinfo(skb)->frags; len_left; frag++) {
> +		offset = 0;
> +		left = skb_frag_size(frag);
> +		len_left -= left;
> +		nfrags--;
> +		stats->frags++;
> +
> +		while (left > 0) {
> +			if (frag_left > 0) {
> +				len = min(frag_left, left);
> +				frag_left -= len;
> +				elem->addr =
> +				    cpu_to_le64(ionic_tx_map_frag(q,
> frag,
> +								  offse
> t, len));
> +				if (!elem->addr)
> +					goto err_out_abort;
> +				elem->len = cpu_to_le16(len);
> +				elem++;
> +				desc_nsge++;
> +				left -= len;
> +				offset += len;
> +				if (nfrags > 0 && frag_left > 0)
> +					continue;
> +				done = (nfrags == 0 && left == 0);
> +				ionic_tx_tso_post(q, desc, skb,
> desc_addr,
> +						  desc_nsge, desc_len,
> +						  hdrlen, mss,
> outer_csum,
> +						  vlan_tci, has_vlan,
> +						  start, done);
> +				total_pkts++;
> +				total_bytes += start ? len : len +
> hdrlen;
> +				desc = ionic_tx_tso_next(q, &elem);
> +				start = false;
> +			} else {
> +				len = min(mss, left);
> +				frag_left = mss - len;
> +				desc_addr = ionic_tx_map_frag(q, frag,
> +							      offset,
> len);
> +				if (!desc_addr)
> +					goto err_out_abort;
> +				desc_len = len;
> +				desc_nsge = 0;
> +				left -= len;
> +				offset += len;
> +				if (nfrags > 0 && frag_left > 0)
> +					continue;
> +				done = (nfrags == 0 && left == 0);
> +				ionic_tx_tso_post(q, desc, skb,
> desc_addr,
> +						  desc_nsge, desc_len,
> +						  hdrlen, mss,
> outer_csum,
> +						  vlan_tci, has_vlan,
> +						  start, done);
> +				total_pkts++;
> +				total_bytes += start ? len : len +
> hdrlen;
> +				desc = ionic_tx_tso_next(q, &elem);
> +				start = false;
> +			}
> +		}
> +	}
> +
> +	stats->pkts += total_pkts;
> +	stats->bytes += total_bytes;
> +	stats->tso++;
> +
> +	return 0;
> +
> +err_out_abort:
> +	while (rewind->desc != q->head->desc) {
> +		ionic_tx_clean(q, rewind, NULL, NULL);
> +		rewind = rewind->next;
> +	}
> +	q->head = abort;
> +
> +	return -ENOMEM;
> +}
> +
> +static int ionic_tx_calc_csum(struct queue *q, struct sk_buff *skb)
> +{
> +	struct tx_stats *stats = q_to_tx_stats(q);
> +	struct txq_desc *desc = q->head->desc;
> +	dma_addr_t addr;
> +	bool has_vlan;
> +	u8 flags = 0;
> +	bool encap;
> +	u64 cmd;
> +
> +	has_vlan = !!skb_vlan_tag_present(skb);
> +	encap = skb->encapsulation;
> +
> +	addr = ionic_tx_map_single(q, skb->data, skb_headlen(skb));
> +	if (!addr)
> +		return -ENOMEM;
> +
> +	flags |= has_vlan ? IONIC_TXQ_DESC_FLAG_VLAN : 0;
> +	flags |= encap ? IONIC_TXQ_DESC_FLAG_ENCAP : 0;
> +
> +	cmd = encode_txq_desc_cmd(IONIC_TXQ_DESC_OPCODE_CSUM_PARTIAL,
> +				  flags, skb_shinfo(skb)->nr_frags,
> addr);
> +	desc->cmd = cpu_to_le64(cmd);
> +	desc->len = cpu_to_le16(skb_headlen(skb));
> +	desc->vlan_tci = cpu_to_le16(skb_vlan_tag_get(skb));
> +	desc->csum_start = cpu_to_le16(skb_checksum_start_offset(skb));
> +	desc->csum_offset = cpu_to_le16(skb->csum_offset);
> +
> +	if (skb->csum_not_inet)
> +		stats->crc32_csum++;
> +	else
> +		stats->csum++;
> +
> +	return 0;
> +}
> +
> +static int ionic_tx_calc_no_csum(struct queue *q, struct sk_buff
> *skb)
> +{
> +	struct tx_stats *stats = q_to_tx_stats(q);
> +	struct txq_desc *desc = q->head->desc;
> +	dma_addr_t addr;
> +	bool has_vlan;
> +	u8 flags = 0;
> +	bool encap;
> +	u64 cmd;
> +
> +	has_vlan = !!skb_vlan_tag_present(skb);
> +	encap = skb->encapsulation;
> +
> +	addr = ionic_tx_map_single(q, skb->data, skb_headlen(skb));
> +	if (!addr)
> +		return -ENOMEM;
> +
> +	flags |= has_vlan ? IONIC_TXQ_DESC_FLAG_VLAN : 0;
> +	flags |= encap ? IONIC_TXQ_DESC_FLAG_ENCAP : 0;
> +
> +	cmd = encode_txq_desc_cmd(IONIC_TXQ_DESC_OPCODE_CSUM_NONE,
> +				  flags, skb_shinfo(skb)->nr_frags,
> addr);
> +	desc->cmd = cpu_to_le64(cmd);
> +	desc->len = cpu_to_le16(skb_headlen(skb));
> +	desc->vlan_tci = cpu_to_le16(skb_vlan_tag_get(skb));
> +
> +	stats->no_csum++;
> +
> +	return 0;
> +}
> +
> +static int ionic_tx_skb_frags(struct queue *q, struct sk_buff *skb)
> +{
> +	unsigned int len_left = skb->len - skb_headlen(skb);
> +	struct txq_sg_desc *sg_desc = q->head->sg_desc;
> +	struct txq_sg_elem *elem = sg_desc->elems;
> +	struct tx_stats *stats = q_to_tx_stats(q);
> +	dma_addr_t dma_addr;
> +	skb_frag_t *frag;
> +	u16 len;
> +
> +	for (frag = skb_shinfo(skb)->frags; len_left; frag++, elem++) {
> +		len = skb_frag_size(frag);
> +		elem->len = cpu_to_le16(len);
> +		dma_addr = ionic_tx_map_frag(q, frag, 0, len);
> +		if (!dma_addr)
> +			return -ENOMEM;
> +		elem->addr = cpu_to_le64(dma_addr);
> +		len_left -= len;
> +		stats->frags++;
> +	}
> +
> +	return 0;
> +}
> +
> +static int ionic_tx(struct queue *q, struct sk_buff *skb)
> +{
> +	struct tx_stats *stats = q_to_tx_stats(q);
> +	int err;
> +
> +	if (skb->ip_summed == CHECKSUM_PARTIAL)
> +		err = ionic_tx_calc_csum(q, skb);
> +	else
> +		err = ionic_tx_calc_no_csum(q, skb);
> +	if (err)
> +		return err;
> +
> +	err = ionic_tx_skb_frags(q, skb);
> +	if (err)
> +		return err;
> +
> +	skb_tx_timestamp(skb);
> +	stats->pkts++;
> +	stats->bytes += skb->len;
> +
> +	ionic_txq_post(q, !netdev_xmit_more(), ionic_tx_clean, skb);
> +
> +	return 0;
> +}
> +
> +static int ionic_tx_descs_needed(struct queue *q, struct sk_buff
> *skb)
> +{
> +	struct tx_stats *stats = q_to_tx_stats(q);
> +	int err;
> +
> +	/* If TSO, need roundup(skb->len/mss) descs */
> +	if (skb_is_gso(skb))
> +		return (skb->len / skb_shinfo(skb)->gso_size) + 1;
> +
> +	/* If non-TSO, just need 1 desc and nr_frags sg elems */
> +	if (skb_shinfo(skb)->nr_frags <= IONIC_TX_MAX_SG_ELEMS)
> +		return 1;
> +
> +	/* Too many frags, so linearize */
> +	err = skb_linearize(skb);
> +	if (err)
> +		return err;
> +
> +	stats->linearize++;
> +
> +	/* Need 1 desc and zero sg elems */
> +	return 1;
> +}
> +
> +netdev_tx_t ionic_start_xmit(struct sk_buff *skb, struct net_device
> *netdev)
> +{
> +	u16 queue_index = skb_get_queue_mapping(skb);
> +	struct lif *lif = netdev_priv(netdev);
> +	struct queue *q;
> +	int ndescs;
> +	int err;
> +
> +	if (unlikely(!test_bit(LIF_UP, lif->state))) {
> +		dev_kfree_skb(skb);
> +		return NETDEV_TX_OK;
> +	}
> +
> +	if (likely(lif_to_txqcq(lif, queue_index)))
> +		q = lif_to_txq(lif, queue_index);
> +	else
> +		q = lif_to_txq(lif, 0);
> +
> +	ndescs = ionic_tx_descs_needed(q, skb);
> +	if (ndescs < 0)
> +		goto err_out_drop;
> +
> +	if (!ionic_q_has_space(q, ndescs)) {
> +		netif_stop_subqueue(netdev, queue_index);
> +		q->stop++;
> +
> +		/* Might race with ionic_tx_clean, check again */
> +		smp_rmb();
> +		if (ionic_q_has_space(q, ndescs)) {
> +			netif_wake_subqueue(netdev, queue_index);
> +			q->wake++;
> +		} else {
> +			return NETDEV_TX_BUSY;
> +		}
> +	}
> +

You are missing netdev_tx_sent_queue() here
and corresponding netdev_tx_completed_queue() on completion.

> +	if (skb_is_gso(skb))
> +		err = ionic_tx_tso(q, skb);
> +	else
> +		err = ionic_tx(q, skb);
> +
> +	if (err)
> +		goto err_out_drop;
> +
> +	return NETDEV_TX_OK;
> +
> +err_out_drop:
> +	netif_stop_subqueue(netdev, queue_index);
> +	q->stop++;
> +	q->drop++;
> +	dev_kfree_skb(skb);
> +	return NETDEV_TX_OK;
> +}
> diff --git a/drivers/net/ethernet/pensando/ionic/ionic_txrx.h
> b/drivers/net/ethernet/pensando/ionic/ionic_txrx.h
> new file mode 100644
> index 000000000000..2391a0eec65a
> --- /dev/null
> +++ b/drivers/net/ethernet/pensando/ionic/ionic_txrx.h
> @@ -0,0 +1,15 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/* Copyright(c) 2017 - 2019 Pensando Systems, Inc */
> +
> +#ifndef _IONIC_TXRX_H_
> +#define _IONIC_TXRX_H_
> +
> +void ionic_rx_flush(struct cq *cq);
> +void ionic_tx_flush(struct cq *cq);
> +
> +void ionic_rx_fill(struct queue *q);
> +void ionic_rx_empty(struct queue *q);
> +int ionic_rx_napi(struct napi_struct *napi, int budget);
> +netdev_tx_t ionic_start_xmit(struct sk_buff *skb, struct net_device
> *netdev);
> +
> +#endif /* _IONIC_TXRX_H_ */

^ permalink raw reply

* Re: [PATCH bpf-next 2/6] bpf: add BPF_MAP_DUMP command to dump more than one entry per call
From: Alexei Starovoitov @ 2019-07-25 23:54 UTC (permalink / raw)
  To: Brian Vazquez
  Cc: Song Liu, Brian Vazquez, Alexei Starovoitov, Daniel Borkmann,
	David S . Miller, Stanislav Fomichev, Willem de Bruijn,
	Petar Penkov, open list, Networking, bpf
In-Reply-To: <CABCgpaV7mj5DhFqh44rUNVj5XMAyP+n79LrMobW_=DfvEaS4BQ@mail.gmail.com>

On Thu, Jul 25, 2019 at 04:25:53PM -0700, Brian Vazquez wrote:
> > > > If prev_key is deleted before map_get_next_key(), we get the first key
> > > > again. This is pretty weird.
> > >
> > > Yes, I know. But note that the current scenario happens even for the
> > > old interface (imagine you are walking a map from userspace and you
> > > tried get_next_key the prev_key was removed, you will start again from
> > > the beginning without noticing it).
> > > I tried to sent a patch in the past but I was missing some context:
> > > before NULL was used to get the very first_key the interface relied in
> > > a random (non existent) key to retrieve the first_key in the map, and
> > > I was told what we still have to support that scenario.
> >
> > BPF_MAP_DUMP is slightly different, as you may return the first key
> > multiple times in the same call. Also, BPF_MAP_DUMP is new, so we
> > don't have to support legacy scenarios.
> >
> > Since BPF_MAP_DUMP keeps a list of elements. It is possible to try
> > to look up previous keys. Would something down this direction work?
> 
> I've been thinking about it and I think first we need a way to detect
> that since key was not present we got the first_key instead:
> 
> - One solution I had in mind was to explicitly asked for the first key
> with map_get_next_key(map, NULL, first_key) and while walking the map
> check that map_get_next_key(map, prev_key, key) doesn't return the
> same key. This could be done using memcmp.
> - Discussing with Stan, he mentioned that another option is to support
> a flag in map_get_next_key to let it know that we want an error
> instead of the first_key.
> 
> After detecting the problem we also need to define what we want to do,
> here some options:
> 
> a) Return the error to the caller
> b) Try with previous keys if any (which be limited to the keys that we
> have traversed so far in this dump call)
> c) continue with next entries in the map. array is easy just get the
> next valid key (starting on i+1), but hmap might be difficult since
> starting on the next bucket could potentially skip some keys that were
> concurrently added to the same bucket where key used to be, and
> starting on the same bucket could lead us to return repeated elements.
> 
> Or maybe we could support those 3 cases via flags and let the caller
> decide which one to use?

this type of indecision is the reason why I wasn't excited about
batch dumping in the first place and gave 'soft yes' when Stan
mentioned it during lsf/mm/bpf uconf.
We probably shouldn't do it.
It feels this map_dump makes api more complex and doesn't really
give much benefit to the user other than large map dump becomes faster.
I think we gotta solve this problem differently.


^ permalink raw reply

* Re: [PATCH v4 net-next 15/19] ionic: Add netdev-event handling
From: Saeed Mahameed @ 2019-07-25 23:55 UTC (permalink / raw)
  To: snelson@pensando.io, netdev@vger.kernel.org, davem@davemloft.net
In-Reply-To: <20190722214023.9513-16-snelson@pensando.io>

On Mon, 2019-07-22 at 14:40 -0700, Shannon Nelson wrote:
> When the netdev gets a new name from userland, pass that name
> down to the NIC for internal tracking.
> 

Just out of curiosity, why your NIC internal device/firmware need to
keep tracking of the netdev name ?



^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox