* Re: [PATCH net] ipvlan: use ETH_MAX_MTU as max mtu
From: David Miller @ 2018-06-19 22:54 UTC (permalink / raw)
To: lucien.xin; +Cc: netdev, jarod, maheshb
In-Reply-To: <3a944c2cbcff6df280847770decca4875b671959.1529309757.git.lucien.xin@gmail.com>
From: Xin Long <lucien.xin@gmail.com>
Date: Mon, 18 Jun 2018 16:15:57 +0800
> Similar to the fixes on team and bonding, this restores the ability
> to set an ipvlan device's mtu to anything higher than 1500.
>
> Fixes: 91572088e3fd ("net: use core MTU range checking in core net infra")
> Signed-off-by: Xin Long <lucien.xin@gmail.com>
Applied, thanks.
^ permalink raw reply
* Re: [PATCH v2] net: hamradio: use eth_broadcast_addr
From: David Miller @ 2018-06-19 22:52 UTC (permalink / raw)
To: stefan; +Cc: netdev, linux-kernel
In-Reply-To: <20180617214058.6926-1-stefan@agner.ch>
From: Stefan Agner <stefan@agner.ch>
Date: Sun, 17 Jun 2018 23:40:53 +0200
> The array bpq_eth_addr is only used to get the size of an
> address, whereas the bcast_addr is used to set the broadcast
> address. This leads to a warning when using clang:
> drivers/net/hamradio/bpqether.c:94:13: warning: variable 'bpq_eth_addr' is not
> needed and will not be emitted [-Wunneeded-internal-declaration]
> static char bpq_eth_addr[6];
> ^
>
> Remove both variables and use the common eth_broadcast_addr
> to set the broadcast address.
>
> Signed-off-by: Stefan Agner <stefan@agner.ch>
Applied.
^ permalink raw reply
* Re: [PATCH net] enic: initialize enic->rfs_h.lock in enic_probe
From: David Miller @ 2018-06-19 22:49 UTC (permalink / raw)
To: gvaradar; +Cc: benve, netdev
In-Reply-To: <20180619151524.1326-1-gvaradar@cisco.com>
From: Govindarajulu Varadarajan <gvaradar@cisco.com>
Date: Tue, 19 Jun 2018 08:15:24 -0700
> lockdep spotted that we are using rfs_h.lock in enic_get_rxnfc() without
> initializing. rfs_h.lock is initialized in enic_open(). But ethtool_ops
> can be called when interface is down.
>
> Move enic_rfs_flw_tbl_init to enic_probe.
...
> Signed-off-by: Govindarajulu Varadarajan <gvaradar@cisco.com>
Applied, thanks.
^ permalink raw reply
* Re: [net] bpf, xdp, i40e: fix i40e_build_skb skb reserve and truesize
From: David Miller @ 2018-06-19 22:41 UTC (permalink / raw)
To: jeffrey.t.kirsher
Cc: daniel, netdev, nhorman, sassmann, jogreene, bjorn.topel,
john.fastabend
In-Reply-To: <20180619213354.30986-1-jeffrey.t.kirsher@intel.com>
From: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Date: Tue, 19 Jun 2018 14:33:54 -0700
> From: Daniel Borkmann <daniel@iogearbox.net>
>
> Using skb_reserve(skb, I40E_SKB_PAD + (xdp->data - xdp->data_hard_start))
> is clearly wrong since I40E_SKB_PAD already points to the offset where
> the original xdp->data was sitting since xdp->data_hard_start is defined
> as xdp->data - i40e_rx_offset(rx_ring) where latter offsets to I40E_SKB_PAD
> when build skb is used.
>
> However, also before cc5b114dcf98 ("bpf, i40e: add meta data support")
> this seems broken since bpf_xdp_adjust_head() helper could have been used
> to alter headroom and enlarge / shrink the frame and with that the assumption
> that the xdp->data remains unchanged does not hold and would push a bogus
> packet to upper stack.
>
> ixgbe got this right in 924708081629 ("ixgbe: add XDP support for pass and
> drop actions"). In any case, fix it by removing the I40E_SKB_PAD from both
> skb_reserve() and truesize calculation.
>
> Fixes: cc5b114dcf98 ("bpf, i40e: add meta data support")
> Fixes: 0c8493d90b6b ("i40e: add XDP support for pass and drop actions")
> Reported-by: Keith Busch <keith.busch@linux.intel.com>
> Reported-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> Cc: Björn Töpel <bjorn.topel@intel.com>
> Cc: John Fastabend <john.fastabend@gmail.com>
> Tested-by: Keith Busch <keith.busch@linux.intel.com>
> Acked-by: John Fastabend <john.fastabend@gmail.com>
> Acked-by: Alexander Duyck <alexander.h.duyck@intel.com>
> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Applied.
^ permalink raw reply
* Re: [PATCH] Revert "net: pskb_trim_rcsum() and CHECKSUM_COMPLETE are friends"
From: Eric Dumazet @ 2018-06-19 22:40 UTC (permalink / raw)
To: Andreas Schwab, Eric Dumazet
Cc: Mathieu Malaterre, David S. Miller, Eric Dumazet, LKML,
Christophe LEROY, Meelis Roos, netdev, linuxppc-dev
In-Reply-To: <87lgbav24y.fsf@igel.home>
On 06/19/2018 03:32 PM, Andreas Schwab wrote:
> On Jun 19 2018, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
>> diff --git a/drivers/net/ethernet/sun/sungem.c b/drivers/net/ethernet/sun/sungem.c
>> index 7a16d40a72d13cf1d522e8a3a396c826fe76f9b9..672d6748ab44f0890e92d5ca55d6ff6834c20dc9 100644
>> --- a/drivers/net/ethernet/sun/sungem.c
>> +++ b/drivers/net/ethernet/sun/sungem.c
>> @@ -60,8 +60,7 @@
>> #include <linux/sungem_phy.h>
>> #include "sungem.h"
>>
>> -/* Stripping FCS is causing problems, disabled for now */
>> -#undef STRIP_FCS
>> +#define STRIP_FCS
>>
>> #define DEFAULT_MSG (NETIF_MSG_DRV | \
>> NETIF_MSG_PROBE | \
>> @@ -435,7 +434,7 @@ static int gem_rxmac_reset(struct gem *gp)
>> writel(desc_dma & 0xffffffff, gp->regs + RXDMA_DBLOW);
>> writel(RX_RING_SIZE - 4, gp->regs + RXDMA_KICK);
>> val = (RXDMA_CFG_BASE | (RX_OFFSET << 10) |
>> - ((14 / 2) << 13) | RXDMA_CFG_FTHRESH_128);
>> + (ETH_HLEN << 13) | RXDMA_CFG_FTHRESH_128);
>> writel(val, gp->regs + RXDMA_CFG);
>> if (readl(gp->regs + GREG_BIFCFG) & GREG_BIFCFG_M66EN)
>> writel(((5 & RXDMA_BLANK_IPKTS) |
>> @@ -857,6 +856,14 @@ static int gem_rx(struct gem *gp, int work_to_do)
>>
>> csum = (__force __sum16)htons((status & RXDCTRL_TCPCSUM) ^ 0xffff);
>> skb->csum = csum_unfold(csum);
>> + {
>> + __wsum rsum = csum_partial(skb->data + ETH_HLEN, len - ETH_HLEN, 0);
>> + if (csum != csum_fold(rsum) && net_ratelimit())
>> + pr_err("sungem wrong csum : %x/%x, len %u bytes\n",
>> + csum, csum_fold(rsum), len);
>> + print_hex_dump(KERN_ERR, "raw data: ", DUMP_PREFIX_OFFSET,
>> + 16, 1, skb->data, len, true);
>> + }
>> skb->ip_summed = CHECKSUM_COMPLETE;
>> skb->protocol = eth_type_trans(skb, gp->dev);
>>
>> @@ -1761,7 +1768,7 @@ static void gem_init_dma(struct gem *gp)
>> writel(0, gp->regs + TXDMA_KICK);
>>
>> val = (RXDMA_CFG_BASE | (RX_OFFSET << 10) |
>> - ((14 / 2) << 13) | RXDMA_CFG_FTHRESH_128);
>> + (ETH_HLEN << 13) | RXDMA_CFG_FTHRESH_128);
>> writel(val, gp->regs + RXDMA_CFG);
>>
>> writel(desc_dma >> 32, gp->regs + RXDMA_DBHI);
>
> With that patch I still get the wrong csum messages, but no longer the
> hw csum failure messages (tested on a PowerMac G5).
>
> [ 662.659767] sungem: sungem wrong csum : 8359/7ca6, len 86 bytes, c0000001fee9cc02
> [ 662.659775] raw data: 00000000: 00 0d 93 43 81 62 d4 3d 7e 4c 48 b7 86 dd 61 01 ...C.b.=~LH...a.
> [ 662.659778] raw data: 00000010: 1c 1e 00 20 06 40 20 01 0a 62 17 11 88 01 00 00 ... .@ ..b......
> [ 662.659780] raw data: 00000020: 00 00 00 00 0a 38 20 01 0a 62 17 11 88 01 00 00 .....8 ..b......
> [ 662.659783] raw data: 00000030: 00 00 00 00 00 07 9a 18 00 16 c1 9a 7e ea ea 44 ............~..D
> [ 662.659785] raw data: 00000040: fb 4a 80 10 05 93 44 08 00 00 01 01 08 0a 59 68 .J....D.......Yh
> [ 662.659788] raw data: 00000050: ba e2 0e bb ac ae ......
>
> Andreas.
>
Note that 8359 and 7ca6 are the same really (a missing ~ to invert csum_partial())
So the bug was that :
1) Driver programmed a wrong start offset for the csum (7 bytes instead of 14 bytes to skip Ethernet Header)
2) FCS was not stripped.
Basically CHECKSUM_COMPLETE support never worked, but this was hidden by the fact that linux stack
had to throw away this CHECKSUM_COMPLETE because the FCS had to be removed.
Thanks !
^ permalink raw reply
* Re: [PATCH] net: nixge: Add __packed attribute to DMA descriptor struct
From: David Miller @ 2018-06-19 22:37 UTC (permalink / raw)
To: f.fainelli; +Cc: mdf, keescook, netdev, linux-kernel
In-Reply-To: <832bebb8-300d-e911-2946-5edfe82dc30a@gmail.com>
From: Florian Fainelli <f.fainelli@gmail.com>
Date: Tue, 19 Jun 2018 10:13:55 -0700
> How could padding be inserted given than all of the structure members
> are naturally aligned (all u32 type). Compiler bug?
Agreed, this looks completely unnecessary.
__packed should only be used when absolutely necessary because using
it generates less efficient code on some architectures.
^ permalink raw reply
* Re: [PATCH 1/5] net: emaclite: Use __func__ instead of hardcoded name
From: Joe Perches @ 2018-06-19 22:37 UTC (permalink / raw)
To: Andy Shevchenko, Radhey Shyam Pandey
Cc: David S. Miller, Andrew Lunn, Michal Simek, netdev,
linux-arm Mailing List, Linux Kernel Mailing List
In-Reply-To: <CAHp75VcKuqnUgynqXvACS7zY5=6aRxqymGE1w3Yh8sshehSm2g@mail.gmail.com>
On Wed, 2018-06-20 at 00:36 +0300, Andy Shevchenko wrote:
> On Mon, Jun 18, 2018 at 2:08 PM, Radhey Shyam Pandey
> <radhey.shyam.pandey@xilinx.com> wrote:
> > Switch hardcoded function name with a reference to __func__ making
> > the code more maintainable. Address below checkpatch warning:
> >
> > WARNING: Prefer using '"%s...", __func__' to using 'xemaclite_mdio_read',
> > this function's name, in a string
> > + "xemaclite_mdio_read(phy_id=%i, reg=%x) == %x\n",
> >
> > WARNING: Prefer using '"%s...", __func__' to using 'xemaclite_mdio_write',
> > this function's name, in a string
> > + "xemaclite_mdio_write(phy_id=%i, reg=%x, val=%x)\n",
> >
>
> For dev_dbg() the __func__ should be completely dropped away.
Not really the same.
dev_dbg without CONFIG_DYNAMIC_DEBUG does not have
the ability to prefix __func__.
^ permalink raw reply
* Re: [PATCH] Revert "net: pskb_trim_rcsum() and CHECKSUM_COMPLETE are friends"
From: Andreas Schwab @ 2018-06-19 22:32 UTC (permalink / raw)
To: Eric Dumazet
Cc: Mathieu Malaterre, David S. Miller, Eric Dumazet, LKML,
Christophe LEROY, Meelis Roos, netdev, linuxppc-dev
In-Reply-To: <05645b90-d3bc-466d-116f-548f3ee39de9@gmail.com>
On Jun 19 2018, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> diff --git a/drivers/net/ethernet/sun/sungem.c b/drivers/net/ethernet/sun/sungem.c
> index 7a16d40a72d13cf1d522e8a3a396c826fe76f9b9..672d6748ab44f0890e92d5ca55d6ff6834c20dc9 100644
> --- a/drivers/net/ethernet/sun/sungem.c
> +++ b/drivers/net/ethernet/sun/sungem.c
> @@ -60,8 +60,7 @@
> #include <linux/sungem_phy.h>
> #include "sungem.h"
>
> -/* Stripping FCS is causing problems, disabled for now */
> -#undef STRIP_FCS
> +#define STRIP_FCS
>
> #define DEFAULT_MSG (NETIF_MSG_DRV | \
> NETIF_MSG_PROBE | \
> @@ -435,7 +434,7 @@ static int gem_rxmac_reset(struct gem *gp)
> writel(desc_dma & 0xffffffff, gp->regs + RXDMA_DBLOW);
> writel(RX_RING_SIZE - 4, gp->regs + RXDMA_KICK);
> val = (RXDMA_CFG_BASE | (RX_OFFSET << 10) |
> - ((14 / 2) << 13) | RXDMA_CFG_FTHRESH_128);
> + (ETH_HLEN << 13) | RXDMA_CFG_FTHRESH_128);
> writel(val, gp->regs + RXDMA_CFG);
> if (readl(gp->regs + GREG_BIFCFG) & GREG_BIFCFG_M66EN)
> writel(((5 & RXDMA_BLANK_IPKTS) |
> @@ -857,6 +856,14 @@ static int gem_rx(struct gem *gp, int work_to_do)
>
> csum = (__force __sum16)htons((status & RXDCTRL_TCPCSUM) ^ 0xffff);
> skb->csum = csum_unfold(csum);
> + {
> + __wsum rsum = csum_partial(skb->data + ETH_HLEN, len - ETH_HLEN, 0);
> + if (csum != csum_fold(rsum) && net_ratelimit())
> + pr_err("sungem wrong csum : %x/%x, len %u bytes\n",
> + csum, csum_fold(rsum), len);
> + print_hex_dump(KERN_ERR, "raw data: ", DUMP_PREFIX_OFFSET,
> + 16, 1, skb->data, len, true);
> + }
> skb->ip_summed = CHECKSUM_COMPLETE;
> skb->protocol = eth_type_trans(skb, gp->dev);
>
> @@ -1761,7 +1768,7 @@ static void gem_init_dma(struct gem *gp)
> writel(0, gp->regs + TXDMA_KICK);
>
> val = (RXDMA_CFG_BASE | (RX_OFFSET << 10) |
> - ((14 / 2) << 13) | RXDMA_CFG_FTHRESH_128);
> + (ETH_HLEN << 13) | RXDMA_CFG_FTHRESH_128);
> writel(val, gp->regs + RXDMA_CFG);
>
> writel(desc_dma >> 32, gp->regs + RXDMA_DBHI);
With that patch I still get the wrong csum messages, but no longer the
hw csum failure messages (tested on a PowerMac G5).
[ 662.659767] sungem: sungem wrong csum : 8359/7ca6, len 86 bytes, c0000001fee9cc02
[ 662.659775] raw data: 00000000: 00 0d 93 43 81 62 d4 3d 7e 4c 48 b7 86 dd 61 01 ...C.b.=~LH...a.
[ 662.659778] raw data: 00000010: 1c 1e 00 20 06 40 20 01 0a 62 17 11 88 01 00 00 ... .@ ..b......
[ 662.659780] raw data: 00000020: 00 00 00 00 0a 38 20 01 0a 62 17 11 88 01 00 00 .....8 ..b......
[ 662.659783] raw data: 00000030: 00 00 00 00 00 07 9a 18 00 16 c1 9a 7e ea ea 44 ............~..D
[ 662.659785] raw data: 00000040: fb 4a 80 10 05 93 44 08 00 00 01 01 08 0a 59 68 .J....D.......Yh
[ 662.659788] raw data: 00000050: ba e2 0e bb ac ae ......
Andreas.
--
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510 2552 DF73 E780 A9DA AEC1
"And now for something completely different."
^ permalink raw reply
* Re: [PATCH v2 0/4] Slience NCSI logging
From: David Miller @ 2018-06-19 22:27 UTC (permalink / raw)
To: joel; +Cc: sam, netdev, joe
In-Reply-To: <20180619053834.12257-1-joel@jms.id.au>
From: Joel Stanley <joel@jms.id.au>
Date: Tue, 19 Jun 2018 15:08:30 +0930
> v2:
> Fix indent issue and commit message based on Joe's feedback
> Add Sam's acks
>
> Here are three changes to silence unnecessary warnings in the ncsi code.
>
> The final patch adds Sam as the maintainer for NCSI.
Series applied.
^ permalink raw reply
* Re: [PATCH net-next,RFC 00/13] New fast forwarding path
From: Daniel Borkmann @ 2018-06-19 22:22 UTC (permalink / raw)
To: Steffen Klassert; +Cc: David Miller, pablo, netfilter-devel, netdev
In-Reply-To: <20180617092304.fviy36bvgnhexwtb@gauss3.secunet.de>
On 06/17/2018 11:23 AM, Steffen Klassert wrote:
[...]
>> Would be curious about
>> the numbers. You'd get implicit batching for the forwarding via devmap
>> as well if you're required to flush it out via different device with
>> XDP_REDIRECT; otherwise XDP_TX of course. Given we have recently
>> integrated helpers for XDP to do a FIB and neighbor lookup from the
>> kernel tables, where it's thus shared and integrated with the rest of
>> the stack and tooling, it would be awesome to get to the same point
>> with xfrm as well. Eyal recently did a start on that for xfrm for tc
>> progs; would be nice to have integration on XDP as well, potentially
>> it might also result in a bigger plus on the forwarding numbers.
>
> It might make sense to intrgrate XDP with xfrm to
> be able to compare numbers etc. But I need a working
> XDP setup and some understanding about it first, what
> could take some time.
Okay, no prob. If you have any questions feel free to shoot an email.
Thanks,
Daniel
^ permalink raw reply
* Re: [PATCH net 0/3] qed*: Fix series.
From: David Miller @ 2018-06-19 22:18 UTC (permalink / raw)
To: sudarsana.kalluru; +Cc: netdev, Ariel.Elior, Michal.Kalderon
In-Reply-To: <20180619045802.24050-1-sudarsana.kalluru@cavium.com>
From: Sudarsana Reddy Kalluru <sudarsana.kalluru@cavium.com>
Date: Mon, 18 Jun 2018 21:57:59 -0700
> From: Sudarsana Reddy Kalluru <Sudarsana.Kalluru@cavium.com>
>
> The patch series fixes few issues in the qed/qede drivers.
> Please consider applying this series to "net".
Series applied.
^ permalink raw reply
* [PATCH net] enic: initialize enic->rfs_h.lock in enic_probe
From: Govindarajulu Varadarajan @ 2018-06-19 15:15 UTC (permalink / raw)
To: davem, benve, netdev; +Cc: Govindarajulu Varadarajan
lockdep spotted that we are using rfs_h.lock in enic_get_rxnfc() without
initializing. rfs_h.lock is initialized in enic_open(). But ethtool_ops
can be called when interface is down.
Move enic_rfs_flw_tbl_init to enic_probe.
INFO: trying to register non-static key.
the code is fine but needs lockdep annotation.
turning off the locking correctness validator.
CPU: 18 PID: 1189 Comm: ethtool Not tainted 4.17.0-rc7-devel+ #27
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-20171110_100015-anatol 04/01/2014
Call Trace:
dump_stack+0x85/0xc0
register_lock_class+0x550/0x560
? __handle_mm_fault+0xa8b/0x1100
__lock_acquire+0x81/0x670
lock_acquire+0xb9/0x1e0
? enic_get_rxnfc+0x139/0x2b0 [enic]
_raw_spin_lock_bh+0x38/0x80
? enic_get_rxnfc+0x139/0x2b0 [enic]
enic_get_rxnfc+0x139/0x2b0 [enic]
ethtool_get_rxnfc+0x8d/0x1c0
dev_ethtool+0x16c8/0x2400
? __mutex_lock+0x64d/0xa00
? dev_load+0x6a/0x150
dev_ioctl+0x253/0x4b0
sock_do_ioctl+0x9a/0x130
sock_ioctl+0x1af/0x350
do_vfs_ioctl+0x8e/0x670
? syscall_trace_enter+0x1e2/0x380
ksys_ioctl+0x60/0x90
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x5a/0x170
entry_SYSCALL_64_after_hwframe+0x49/0xbe
Signed-off-by: Govindarajulu Varadarajan <gvaradar@cisco.com>
---
drivers/net/ethernet/cisco/enic/enic_clsf.c | 3 +--
drivers/net/ethernet/cisco/enic/enic_main.c | 3 ++-
2 files changed, 3 insertions(+), 3 deletions(-)
diff --git a/drivers/net/ethernet/cisco/enic/enic_clsf.c b/drivers/net/ethernet/cisco/enic/enic_clsf.c
index 973c1fb70d09..99038dfc7fbe 100644
--- a/drivers/net/ethernet/cisco/enic/enic_clsf.c
+++ b/drivers/net/ethernet/cisco/enic/enic_clsf.c
@@ -79,7 +79,6 @@ void enic_rfs_flw_tbl_init(struct enic *enic)
enic->rfs_h.max = enic->config.num_arfs;
enic->rfs_h.free = enic->rfs_h.max;
enic->rfs_h.toclean = 0;
- enic_rfs_timer_start(enic);
}
void enic_rfs_flw_tbl_free(struct enic *enic)
@@ -88,7 +87,6 @@ void enic_rfs_flw_tbl_free(struct enic *enic)
enic_rfs_timer_stop(enic);
spin_lock_bh(&enic->rfs_h.lock);
- enic->rfs_h.free = 0;
for (i = 0; i < (1 << ENIC_RFS_FLW_BITSHIFT); i++) {
struct hlist_head *hhead;
struct hlist_node *tmp;
@@ -99,6 +97,7 @@ void enic_rfs_flw_tbl_free(struct enic *enic)
enic_delfltr(enic, n->fltr_id);
hlist_del(&n->node);
kfree(n);
+ enic->rfs_h.free++;
}
}
spin_unlock_bh(&enic->rfs_h.lock);
diff --git a/drivers/net/ethernet/cisco/enic/enic_main.c b/drivers/net/ethernet/cisco/enic/enic_main.c
index 30d2eaa18c04..e6ad581eadd8 100644
--- a/drivers/net/ethernet/cisco/enic/enic_main.c
+++ b/drivers/net/ethernet/cisco/enic/enic_main.c
@@ -1971,7 +1971,7 @@ static int enic_open(struct net_device *netdev)
vnic_intr_unmask(&enic->intr[i]);
enic_notify_timer_start(enic);
- enic_rfs_flw_tbl_init(enic);
+ enic_rfs_timer_start(enic);
return 0;
@@ -2904,6 +2904,7 @@ static int enic_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
timer_setup(&enic->notify_timer, enic_notify_timer, 0);
+ enic_rfs_flw_tbl_init(enic);
enic_set_rx_coal_setting(enic);
INIT_WORK(&enic->reset, enic_reset);
INIT_WORK(&enic->tx_hang_reset, enic_tx_hang_reset);
--
2.17.1
^ permalink raw reply related
* Re: [PATCH] tc, bpf: add option to dump bpf verifier as C program fragment
From: Daniel Borkmann @ 2018-06-19 22:13 UTC (permalink / raw)
To: David Ahern, Jakub Kicinski, Ophir Munk
Cc: netdev, Stephen Hemminger, Thomas Monjalon, Olga Shern, ast
In-Reply-To: <30ac6974-a434-926b-5749-25c594e5841c@gmail.com>
On 06/18/2018 11:44 PM, David Ahern wrote:
> On 6/18/18 2:18 PM, Jakub Kicinski wrote:
>> On Sun, 17 Jun 2018 08:48:41 +0000, Ophir Munk wrote:
>>> Similar to cbpf used within tcpdump utility with a "-d" option to dump
>>> the compiled packet-matching code in a human readable form - tc has the
>>> "verbose" option to dump ebpf verifier output.
>>> Another useful option of cbpf using tcpdump "-dd" option is to dump
>>> packet-matching code a C program fragment. Similar to this - this commit
>>> adds a new tc ebpf option named "code" to dump ebpf verifier as C program
>>> fragment.
>>>
>>> Existing "verbose" option sample output:
>>>
>>> Verifier analysis:
>>> 0: (61) r2 = *(u32 *)(r1 +52)
>>> 1: (18) r3 = 0xdeadbeef
>>> 3: (63) *(u32 *)(r10 -4) = r3
>>> .
>>> .
>>> 11: (63) *(u32 *)(r1 +52) = r2
>>> 12: (18) r0 = 0xffffffff
>>> 14: (95) exit
>>>
>>> New "code" option sample output:
>>>
>>> /* struct bpf_insn cls_q_code[] = { */
>>> {0x61, 2, 1, 52, 0x00000000},
>>> {0x18, 3, 0, 0, 0xdeadbeef},
>>> {0x00, 0, 0, 0, 0x00000000},
>>> .
>>> .
>>> {0x63, 1, 2, 52, 0x00000000},
>>> {0x18, 0, 0, 0, 0xffffffff},
>>> {0x00, 0, 0, 0, 0x00000000},
>>> {0x95, 0, 0, 0, 0x00000000},
>>>
>>> Signed-off-by: Ophir Munk <ophirmu@mellanox.com>
>>
>> Hmm... printing C arrays looks like hacky integration with some C
>> code... Would you not be better served by simply using libbpf in
>> whatever is consuming this output?
>
> I was thinking the same. bpftool would provide options too -- print the
> above, print in macro encodings and verifier. I gave an example of this
> side by side dump at netconf 2.1. Does not look like the slides made it
> online; see attached.
+1, I would also doubt that this adds a lot in terms of debuggability
when you're trying to load an object file with thousands of insns. Better
way would be to use llvm-objdump on the obj file to get to the annotated
disassembly, see also example in [0]. A .o to .c converter is wip for
libbpf/bpftool as presented in [1], which should provide the flexibility
to embedd an obj file.
Cheers,
Daniel
[0] http://cilium.readthedocs.io/en/latest/bpf/#llvm
[1] http://vger.kernel.org/netconf2018_files/AlexeiStarovoitov_netconf2018.pdf page 22
^ permalink raw reply
* Re: [PATCH] Revert "net: pskb_trim_rcsum() and CHECKSUM_COMPLETE are friends"
From: Eric Dumazet @ 2018-06-19 22:10 UTC (permalink / raw)
To: Andreas Schwab
Cc: Mathieu Malaterre, David S. Miller, Eric Dumazet, LKML,
Christophe LEROY, Meelis Roos, netdev, linuxppc-dev
In-Reply-To: <e1790720-911b-4d04-2215-752ec612e3d7@gmail.com>
On 06/19/2018 01:10 PM, Eric Dumazet wrote:
>
>
> On 06/19/2018 12:10 PM, Andreas Schwab wrote:
>> On Jun 18 2018, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>>
>>> DUMP_PREFIX_ADDRESS might give us more information (say alignment problem, or crossing page boundaries)
>>
>> DUMP_PREFIX_ADDRESS is useless for that purpose.
>>
>> Here are some samples of broken csums:
>>
>> [ 853.849225] sungem: sungem wrong csum : 9886/07be, len 94 bytes, c0000001fa187e02
>> [ 853.849232] raw data: 00000000: 00 0d 93 43 81 62 18 d6 c7 51 b8 1c 08 00 45 10 ...C.b...Q....E.
>> [ 853.849235] raw data: 00000010: 00 4c cb a0 40 00 40 11 d9 97 c0 a8 0a 01 c0 a8 .L..@.@.........
>> [ 853.849237] raw data: 00000020: 0a 07 00 7b 00 7b 00 38 69 e1 1c 03 0c f7 00 00 ...{.{.8i.......
>> [ 853.849240] raw data: 00000030: 08 f0 00 00 15 f0 c0 35 67 67 de d3 ca c9 d9 5b .......5gg.....[
>> [ 853.849242] raw data: 00000040: 1f ff de d3 d2 86 8f 67 fa f2 de d3 d2 86 8f 38 .......g.......8
>> [ 853.849244] raw data: 00000050: 2f ff de d3 d2 86 8f 3b ff ff d1 93 bc 50 /......;.....P
>
> Thanks.
>
> 4 bytes in excess.
>
> Might be the FCS, and it does not look like provided csum has a relation with it.
>
> For some reason FCS stripping was disabled by :
>
> commit 3e32011d4da6424b3bc65b1e1a047e30ac9882c7
> Author: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> Date: Mon May 19 09:39:11 2003 -0700
>
> [SUNGEM]: Updates from PowerPC people.
>
> Support more chips and split out all the complex PHY
> handling into a seperate file.
>
>
> Since this NIC never had CHECKSUM_COMPLETE support (since we have to trim each skb,
> thus were forcing ip_summed to CHECKSUM_NONE) we probably should remove it and be happy.
>
> Unless you guys find a way to let the NIC strip the FCS, and double check the csum is a real csum ;)
>
> diff --git a/drivers/net/ethernet/sun/sungem.c b/drivers/net/ethernet/sun/sungem.c
> index 7a16d40a72d13cf1d522e8a3a396c826fe76f9b9..85439308375e95c3854e4a1561697d69ec85399b 100644
> --- a/drivers/net/ethernet/sun/sungem.c
> +++ b/drivers/net/ethernet/sun/sungem.c
> @@ -760,7 +760,6 @@ static int gem_rx(struct gem *gp, int work_to_do)
> struct net_device *dev = gp->dev;
> int entry, drops, work_done = 0;
> u32 done;
> - __sum16 csum;
>
> if (netif_msg_rx_status(gp))
> printk(KERN_DEBUG "%s: rx interrupt, done: %d, rx_new: %d\n",
> @@ -855,9 +854,6 @@ static int gem_rx(struct gem *gp, int work_to_do)
> skb = copy_skb;
> }
>
> - csum = (__force __sum16)htons((status & RXDCTRL_TCPCSUM) ^ 0xffff);
> - skb->csum = csum_unfold(csum);
> - skb->ip_summed = CHECKSUM_COMPLETE;
> skb->protocol = eth_type_trans(skb, gp->dev);
>
> napi_gro_receive(&gp->napi, skb);
>
If you have time, you also could check if changing the suspect (14 / 2) to ETH_HLEN would help
(and also enabling STRIP_FCS)
diff --git a/drivers/net/ethernet/sun/sungem.c b/drivers/net/ethernet/sun/sungem.c
index 7a16d40a72d13cf1d522e8a3a396c826fe76f9b9..672d6748ab44f0890e92d5ca55d6ff6834c20dc9 100644
--- a/drivers/net/ethernet/sun/sungem.c
+++ b/drivers/net/ethernet/sun/sungem.c
@@ -60,8 +60,7 @@
#include <linux/sungem_phy.h>
#include "sungem.h"
-/* Stripping FCS is causing problems, disabled for now */
-#undef STRIP_FCS
+#define STRIP_FCS
#define DEFAULT_MSG (NETIF_MSG_DRV | \
NETIF_MSG_PROBE | \
@@ -435,7 +434,7 @@ static int gem_rxmac_reset(struct gem *gp)
writel(desc_dma & 0xffffffff, gp->regs + RXDMA_DBLOW);
writel(RX_RING_SIZE - 4, gp->regs + RXDMA_KICK);
val = (RXDMA_CFG_BASE | (RX_OFFSET << 10) |
- ((14 / 2) << 13) | RXDMA_CFG_FTHRESH_128);
+ (ETH_HLEN << 13) | RXDMA_CFG_FTHRESH_128);
writel(val, gp->regs + RXDMA_CFG);
if (readl(gp->regs + GREG_BIFCFG) & GREG_BIFCFG_M66EN)
writel(((5 & RXDMA_BLANK_IPKTS) |
@@ -857,6 +856,14 @@ static int gem_rx(struct gem *gp, int work_to_do)
csum = (__force __sum16)htons((status & RXDCTRL_TCPCSUM) ^ 0xffff);
skb->csum = csum_unfold(csum);
+ {
+ __wsum rsum = csum_partial(skb->data + ETH_HLEN, len - ETH_HLEN, 0);
+ if (csum != csum_fold(rsum) && net_ratelimit())
+ pr_err("sungem wrong csum : %x/%x, len %u bytes\n",
+ csum, csum_fold(rsum), len);
+ print_hex_dump(KERN_ERR, "raw data: ", DUMP_PREFIX_OFFSET,
+ 16, 1, skb->data, len, true);
+ }
skb->ip_summed = CHECKSUM_COMPLETE;
skb->protocol = eth_type_trans(skb, gp->dev);
@@ -1761,7 +1768,7 @@ static void gem_init_dma(struct gem *gp)
writel(0, gp->regs + TXDMA_KICK);
val = (RXDMA_CFG_BASE | (RX_OFFSET << 10) |
- ((14 / 2) << 13) | RXDMA_CFG_FTHRESH_128);
+ (ETH_HLEN << 13) | RXDMA_CFG_FTHRESH_128);
writel(val, gp->regs + RXDMA_CFG);
writel(desc_dma >> 32, gp->regs + RXDMA_DBHI);
^ permalink raw reply related
* Re: [PATCH] [Bug 16494] NFS client over TCP hangs due to packet loss
From: Joe Perches @ 2018-06-19 21:56 UTC (permalink / raw)
To: Andy Chittenden, 'Trond Myklebust'
Cc: 'Andrew Morton', 'David Miller', kuznet, pekkas,
jmorris, yoshfuji, kaber, eric.dumazet, William.Allen.Simpson,
gilad, ilpo.jarvinen, netdev, linux-kernel, linux-nfs,
'J. Bruce Fields', 'Neil Brown',
'Chuck Lever', 'Benny Halevy',
'Alexandros Batsakis', Andy Chittenden
In-Reply-To: <4c5fc9f3.487e0e0a.2960.ffffe4ec@mx.google.com>
On Mon, 2010-08-09 at 10:27 +0100, Andy Chittenden wrote:
You really need to check your clock.
Mail sent in the year 2010?
^ permalink raw reply
* Re: [PATCH] PCI: allow drivers to limit the number of VFs to 0
From: Bjorn Helgaas @ 2018-06-19 21:37 UTC (permalink / raw)
To: Jakub Kicinski
Cc: Bjorn Helgaas, linux-pci, netdev, Sathya Perla, Felix Manlunas,
alexander.duyck, john.fastabend, Jacob Keller, Donald Dutile,
oss-drivers, Christoph Hellwig
In-Reply-To: <20180525140223.GA45098@bhelgaas-glaptop.roam.corp.google.com>
On Fri, May 25, 2018 at 09:02:23AM -0500, Bjorn Helgaas wrote:
> On Thu, May 24, 2018 at 06:20:15PM -0700, Jakub Kicinski wrote:
> > Hi Bjorn!
> >
> > On Thu, 24 May 2018 18:57:48 -0500, Bjorn Helgaas wrote:
> > > On Mon, Apr 02, 2018 at 03:46:52PM -0700, Jakub Kicinski wrote:
> > > > Some user space depends on enabling sriov_totalvfs number of VFs
> > > > to not fail, e.g.:
> > > >
> > > > $ cat .../sriov_totalvfs > .../sriov_numvfs
> > > >
> > > > For devices which VF support depends on loaded FW we have the
> > > > pci_sriov_{g,s}et_totalvfs() API. However, this API uses 0 as
> > > > a special "unset" value, meaning drivers can't limit sriov_totalvfs
> > > > to 0. Remove the special values completely and simply initialize
> > > > driver_max_VFs to total_VFs. Then always use driver_max_VFs.
> > > > Add a helper for drivers to reset the VF limit back to total.
> > >
> > > I still can't really make sense out of the changelog.
> > >
> > > I think part of the reason it's confusing is because there are two
> > > things going on:
> > >
> > > 1) You want this:
> > >
> > > pci_sriov_set_totalvfs(dev, 0);
> > > x = pci_sriov_get_totalvfs(dev)
> > >
> > > to return 0 instead of total_VFs. That seems to connect with
> > > your subject line. It means "sriov_totalvfs" in sysfs could be
> > > 0, but I don't know how that is useful (I'm sure it is; just
> > > educate me :))
> >
> > Let me just quote the bug report that got filed on our internal bug
> > tracker :)
> >
> > When testing Juju Openstack with Ubuntu 18.04, enabling SR-IOV causes
> > errors because Juju gets the sriov_totalvfs for SR-IOV-capable device
> > then tries to set that as the sriov_numvfs parameter.
> >
> > For SR-IOV incapable FW, the sriov_totalvfs parameter should be 0,
> > but it's set to max. When FW is switched to flower*, the correct
> > sriov_totalvfs value is presented.
> >
> > * flower is a project name
>
> From the point of view of the PCI core (which knows nothing about
> device firmware and relies on the architected config space described
> by the PCIe spec), this sounds like an erratum: with some firmware
> installed, the device is not capable of SR-IOV, but still advertises
> an SR-IOV capability with "TotalVFs > 0".
>
> Regardless of whether that's an erratum, we do allow PF drivers to use
> pci_sriov_set_totalvfs() to limit the number of VFs that may be
> enabled by writing to the PF's "sriov_numvfs" sysfs file.
>
> But the current implementation does not allow a PF driver to limit VFs
> to 0, and that does seem nonsensical.
>
> > My understanding is OpenStack uses sriov_totalvfs to determine how many
> > VFs can be enabled, looks like this is the code:
> >
> > http://git.openstack.org/cgit/openstack/charm-neutron-openvswitch/tree/hooks/neutron_ovs_utils.py#n464
> >
> > > 2) You're adding the pci_sriov_reset_totalvfs() interface. I'm not
> > > sure what you intend for this. Is *every* driver supposed to
> > > call it in .remove()? Could/should this be done in the core
> > > somehow instead of depending on every driver?
> >
> > Good question, I was just thinking yesterday we may want to call it
> > from the core, but I don't think it's strictly necessary nor always
> > sufficient (we may reload FW without re-probing).
> >
> > We have a device which supports different number of VFs based on the FW
> > loaded. Some legacy FWs does not inform the driver how many VFs it can
> > support, because it supports max. So the flow in our driver is this:
> >
> > load_fw(dev);
> > ...
> > max_vfs = ask_fw_for_max_vfs(dev);
> > if (max_vfs >= 0)
> > return pci_sriov_set_totalvfs(dev, max_vfs);
> > else /* FW didn't tell us, assume max */
> > return pci_sriov_reset_totalvfs(dev);
> >
> > We also reset the max on device remove, but that's not strictly
> > necessary.
> >
> > Other users of pci_sriov_set_totalvfs() always know the value to set
> > the total to (either always get it from FW or it's a constant).
> >
> > If you prefer we can work out the correct max for those legacy cases in
> > the driver as well, although it seemed cleaner to just ask the core,
> > since it already has total_VFs value handy :)
> >
> > > I'm also having a hard time connecting your user-space command example
> > > with the rest of this. Maybe it will make more sense to me tomorrow
> > > after some coffee.
> >
> > OpenStack assumes it will always be able to set sriov_numvfs to
> > sriov_totalvfs, see this 'if':
> >
> > http://git.openstack.org/cgit/openstack/charm-neutron-openvswitch/tree/hooks/neutron_ovs_utils.py#n512
>
> Thanks for educating me. I think there are two issues here that we
> can separate. I extracted the patch below for the first.
>
> The second is the question of resetting driver_max_VFs. I think we
> currently have a general issue in the core:
>
> - load PF driver 1
> - driver calls pci_sriov_set_totalvfs() to reduce driver_max_VFs
> - unload PF driver 1
> - load PF driver 2
>
> Now driver_max_VFs is still stuck at the lower value set by driver 1.
> I don't think that's the way this should work.
>
> I guess this is partly a consequence of setting driver_max_VFs in
> sriov_init(), which is called before driver attach and should only
> depend on hardware characteristics, so it is related to the patch
> below. But I think we should fix it in general, not just for
> netronome.
Hi Jakub, the patch below is in v4.18-rc1 as 8d85a7a4f2c9 ("PCI/IOV:
Allow PF drivers to limit total_VFs to 0"). If there's more we need
to do here, would you mind rebasing what's left to v4.18-rc1 and
reposting it?
> commit 4a338bc6f94b9ad824ac944f5dfc249d6838719c
> Author: Jakub Kicinski <jakub.kicinski@netronome.com>
> Date: Fri May 25 08:18:34 2018 -0500
>
> PCI/IOV: Allow PF drivers to limit total_VFs to 0
>
> Some SR-IOV PF drivers implement .sriov_configure(), which allows
> user-space to enable VFs by writing the desired number of VFs to the sysfs
> "sriov_numvfs" file (see sriov_numvfs_store()).
>
> The PCI core limits the number of VFs to the TotalVFs advertised by the
> device in its SR-IOV capability. The PF driver can limit the number of VFs
> to even fewer (it may have pre-allocated data structures or knowledge of
> device limitations) by calling pci_sriov_set_totalvfs(), but previously it
> could not limit the VFs to 0.
>
> Change pci_sriov_get_totalvfs() so it always respects the VF limit imposed
> by the PF driver, even if the limit is 0.
>
> This sequence:
>
> pci_sriov_set_totalvfs(dev, 0);
> x = pci_sriov_get_totalvfs(dev);
>
> previously set "x" to TotalVFs from the SR-IOV capability. Now it will set
> "x" to 0.
>
> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
>
> diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
> index 192b82898a38..d0d73dbbd5ca 100644
> --- a/drivers/pci/iov.c
> +++ b/drivers/pci/iov.c
> @@ -469,6 +469,7 @@ static int sriov_init(struct pci_dev *dev, int pos)
> iov->nres = nres;
> iov->ctrl = ctrl;
> iov->total_VFs = total;
> + iov->driver_max_VFs = total;
> pci_read_config_word(dev, pos + PCI_SRIOV_VF_DID, &iov->vf_device);
> iov->pgsz = pgsz;
> iov->self = dev;
> @@ -827,10 +828,7 @@ int pci_sriov_get_totalvfs(struct pci_dev *dev)
> if (!dev->is_physfn)
> return 0;
>
> - if (dev->sriov->driver_max_VFs)
> - return dev->sriov->driver_max_VFs;
> -
> - return dev->sriov->total_VFs;
> + return dev->sriov->driver_max_VFs;
> }
> EXPORT_SYMBOL_GPL(pci_sriov_get_totalvfs);
>
^ permalink raw reply
* Re: [PATCH 1/5] net: emaclite: Use __func__ instead of hardcoded name
From: Andy Shevchenko @ 2018-06-19 21:36 UTC (permalink / raw)
To: Radhey Shyam Pandey
Cc: David S. Miller, Andrew Lunn, Michal Simek, netdev,
linux-arm Mailing List, Linux Kernel Mailing List
In-Reply-To: <1529320103-7711-2-git-send-email-radhey.shyam.pandey@xilinx.com>
On Mon, Jun 18, 2018 at 2:08 PM, Radhey Shyam Pandey
<radhey.shyam.pandey@xilinx.com> wrote:
> Switch hardcoded function name with a reference to __func__ making
> the code more maintainable. Address below checkpatch warning:
>
> WARNING: Prefer using '"%s...", __func__' to using 'xemaclite_mdio_read',
> this function's name, in a string
> + "xemaclite_mdio_read(phy_id=%i, reg=%x) == %x\n",
>
> WARNING: Prefer using '"%s...", __func__' to using 'xemaclite_mdio_write',
> this function's name, in a string
> + "xemaclite_mdio_write(phy_id=%i, reg=%x, val=%x)\n",
>
For dev_dbg() the __func__ should be completely dropped away.
--
With Best Regards,
Andy Shevchenko
^ permalink raw reply
* [net] bpf, xdp, i40e: fix i40e_build_skb skb reserve and truesize
From: Jeff Kirsher @ 2018-06-19 21:33 UTC (permalink / raw)
To: davem
Cc: Daniel Borkmann, netdev, nhorman, sassmann, jogreene,
Björn Töpel, John Fastabend, Jeff Kirsher
From: Daniel Borkmann <daniel@iogearbox.net>
Using skb_reserve(skb, I40E_SKB_PAD + (xdp->data - xdp->data_hard_start))
is clearly wrong since I40E_SKB_PAD already points to the offset where
the original xdp->data was sitting since xdp->data_hard_start is defined
as xdp->data - i40e_rx_offset(rx_ring) where latter offsets to I40E_SKB_PAD
when build skb is used.
However, also before cc5b114dcf98 ("bpf, i40e: add meta data support")
this seems broken since bpf_xdp_adjust_head() helper could have been used
to alter headroom and enlarge / shrink the frame and with that the assumption
that the xdp->data remains unchanged does not hold and would push a bogus
packet to upper stack.
ixgbe got this right in 924708081629 ("ixgbe: add XDP support for pass and
drop actions"). In any case, fix it by removing the I40E_SKB_PAD from both
skb_reserve() and truesize calculation.
Fixes: cc5b114dcf98 ("bpf, i40e: add meta data support")
Fixes: 0c8493d90b6b ("i40e: add XDP support for pass and drop actions")
Reported-by: Keith Busch <keith.busch@linux.intel.com>
Reported-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Björn Töpel <bjorn.topel@intel.com>
Cc: John Fastabend <john.fastabend@gmail.com>
Tested-by: Keith Busch <keith.busch@linux.intel.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
drivers/net/ethernet/intel/i40e/i40e_txrx.c | 7 +++----
1 file changed, 3 insertions(+), 4 deletions(-)
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index 8ffb7454e67c..ed6dbcfd4e96 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -2103,9 +2103,8 @@ static struct sk_buff *i40e_build_skb(struct i40e_ring *rx_ring,
unsigned int truesize = i40e_rx_pg_size(rx_ring) / 2;
#else
unsigned int truesize = SKB_DATA_ALIGN(sizeof(struct skb_shared_info)) +
- SKB_DATA_ALIGN(I40E_SKB_PAD +
- (xdp->data_end -
- xdp->data_hard_start));
+ SKB_DATA_ALIGN(xdp->data_end -
+ xdp->data_hard_start);
#endif
struct sk_buff *skb;
@@ -2124,7 +2123,7 @@ static struct sk_buff *i40e_build_skb(struct i40e_ring *rx_ring,
return NULL;
/* update pointers within the skb to store the data */
- skb_reserve(skb, I40E_SKB_PAD + (xdp->data - xdp->data_hard_start));
+ skb_reserve(skb, xdp->data - xdp->data_hard_start);
__skb_put(skb, xdp->data_end - xdp->data);
if (metasize)
skb_metadata_set(skb, metasize);
--
2.17.1
^ permalink raw reply related
* Re: [PATCH bpf-net] bpf: Change bpf_fib_lookup to return lookup status
From: Martin KaFai Lau @ 2018-06-19 21:24 UTC (permalink / raw)
To: David Ahern; +Cc: dsahern, netdev, borkmann, ast, davem
In-Reply-To: <7697c6fd-7871-4d96-bd1c-c81a7d70ec39@gmail.com>
On Tue, Jun 19, 2018 at 02:16:53PM -0600, David Ahern wrote:
> On 6/19/18 10:36 AM, Martin KaFai Lau wrote:
> > On Tue, Jun 19, 2018 at 09:34:28AM -0600, David Ahern wrote:
> >> On 6/19/18 9:25 AM, Martin KaFai Lau wrote:
> >>> On Mon, Jun 18, 2018 at 03:35:25PM -0600, David Ahern wrote:
> >>>> On 6/18/18 2:55 PM, Martin KaFai Lau wrote:
> >>>>>> /* rc > 0 case */
> >>>>>> switch(rc) {
> >>>>>> case BPF_FIB_LKUP_RET_BLACKHOLE:
> >>>>>> case BPF_FIB_LKUP_RET_UNREACHABLE:
> >>>>>> case BPF_FIB_LKUP_RET_PROHIBIT:
> >>>>>> return XDP_DROP;
> >>>>>> }
> >>>>>>
> >>>>>> For the others it becomes a question of do we share why the stack needs
> >>>>>> to be involved? Maybe the program wants to collect stats to show traffic
> >>>>>> patterns that can be improved (BPF_FIB_LKUP_RET_FRAG_NEEDED) or support
> >>>>>> in the kernel needs to be improved (BPF_FIB_LKUP_RET_UNSUPP_LWT) or an
> >>>>>> interface is misconfigured (BPF_FIB_LKUP_RET_FWD_DISABLED).
> >>>>> Thanks for the explanation.
> >>>>>
> >>>>> Agree on the bpf able to collect stats will be useful.
> >>>>>
> >>>>> I am wondering, if a new BPF_FIB_LKUP_RET_XYZ is added later,
> >>>>> how may the old xdp_prog work/not-work? As of now, the return value
> >>>>> is straight forward, FWD, PASS (to stack) or DROP (error).
> >>>>> With this change, the xdp_prog needs to match/switch() the
> >>>>> BPF_FIB_LKUP_RET_* to at least PASS and DROP.
> >>>>
> >>>> IMO, programs should only call XDP_DROP for known reasons - like the 3
> >>>> above. Anything else punt to the stack.
> >>>>
> >>>> If a new RET_XYZ comes along:
> >>>> 1. the new XYZ is a new ACL response where the packet is to be dropped.
> >>>> If the program does not understand XYZ and punts to the stack
> >>>> (recommendation), then a second lookup is done during normal packet
> >>>> processing and the stack drops it.
> >>>>
> >>>> 2. the new XYZ is a new path in the kernel that is unsupported with
> >>>> respect to XDP forwarding, nothing new for the program to do.
> >>>>
> >>>> Either way I would expect stats on BPF_FIB_LKUP_RET_* to give a hint to
> >>>> the program writer.
> >>>>
> >>>> Worst case of punting packets to the stack for any rc != 0 means the
> >>>> stack is doing 2 lookups - 1 in XDP based on its lookup parameters and 1
> >>>> in normal stack processing - to handle the packet.
> >>> Instead of having the xdp_prog to follow the meaning of what RET_SYZ is,
> >>> should the bpf_*_fib_lookup() return value be kept as is such that
> >>> the xdp_prog is clear what to do. The reason can be returned in
> >>> the 'struct bpf_fib_lookup'. The number of reasons can be extended.
> >>> If the xdp_prog does not understand a reason, it still will not
> >>> affect its decision because the return value is clear.
> >>> I think the situation here is similar to regular syscall which usually
> >>> uses -1 to clearly states error and errno to spells out the reason.
> >>>
> >>
> >> I did consider returning the status in struct bpf_fib_lookup. However,
> >> it is 64 bytes and can not be extended without a big performance
> >> penalty, so the only option there is to make an existing entry a union
> >> the most logical of which is the ifindex. It seemed odd to me to have
> >> the result by hidden in the struct as a union on ifindex and returning
> >> the egress index from the function:
> >>
> >> @@ -2625,7 +2636,11 @@ struct bpf_fib_lookup {
> >>
> >> /* total length of packet from network header - used for MTU
> >> check */
> >> __u16 tot_len;
> >> - __u32 ifindex; /* L3 device index for lookup */
> >> +
> >> + union {
> >> + __u32 ifindex; /* input: L3 device index for lookup */
> >> + __u32 result; /* output: one of BPF_FIB_LKUP_RET_* */
> >> + };
> >>
> >>
> >> It seemed more natural to have ifindex stay ifindex and only change
> >> value on return:
> >>
> >> @@ -2625,7 +2639,11 @@ struct bpf_fib_lookup {
> >>
> >> /* total length of packet from network header - used for MTU check */
> >> __u16 tot_len;
> >> - __u32 ifindex; /* L3 device index for lookup */
> >> +
> >> + /* input: L3 device index for lookup
> >> + * output: nexthop device index from FIB lookup
> >> + */
> >> + __u32 ifindex;
> >>
> >> union {
> >> /* inputs to lookup */
> >>
> >>
> >> From a program's perspective:
> >>
> >> rc < 0 -- program is passing incorrect data
> >> rc == 0 -- packet can be forwarded
> >> rc > 0 -- packet can not be forwarded.
> >>
> >> BPF programs are not required to track the LKUP_RET values any more than
> >> a function returning multiple negative values - the caller just checks
> >> rc < 0 means failure. If the program cares it can look at specific
> >> values of rc to see the specific value.
> >>
> >> The same applies with the LKUP_RET values - they are there to provide
> >> insight into why the packet is not forwarded directly if the program
> >> cares to know why.
> > hmm...ic. My concern is, the prog can interpret rc > 0 (in this patch) to be
> > drop vs pass (although we can advise them in bpf.h to always pass if it does
> > not understand a rc but it is not a strong contract), it may catch people
> > a surprise if a xdp_prog suddenly drops everything when running in a
> > newer kernel where the upper stack can actually handle it.
> >
> > while the current behavior (i.e. before this patch, rc == 0) is always pass
> > to the stack.
> >
> > I think at least comments should be put in the enum such that
> > the xdp/tc_prog should expect the enum could be extended later, so
> > the suggested behavior should be a pass for unknown LKUP_RET and let
> > the stack to decide.
> >
>
> All APIs with enums have the inherent quality that more can be added.
> Nothing about rc > 0 says it is ok to drop the packet and nothing in the
> documentation says it is ok to drop the packet. The program author needs
> to look at the extra information provided by the rc. Specific values are
> a hint that yes the packet can be dropped; others merely say 'packet
> needs help from the stack' with a reason why it needs help.
Right, I understand there are values that hint drop (as your earlier example)
and there are values that hint pass to stack. and either set of these values
could be extended later.
>
> My intention is to allow the XDP program to understand FIB based ACLs.
> To that end a fib lookup result of blackhole, unreachable and prohibit
> needs to be returned to the xdp program with the effective summary of
> "can drop in the xdp program".
>
> If the other return codes are going to cause confusion then less shorten
> the list:
>
> enum {
> BPF_FIB_LKUP_RET_SUCCESS, /* packet is to be forwareded */
> BPF_FIB_LKUP_RET_CAN_DROP, /* XDP program can drop the packet */
> BPF_FIB_LKUP_RET_NEED_STACK, /* packet needs full stack assist */
> };
>
> But, that still does not solve the problem of rc > 0 means xdp program
> can drop the packet, but then that is not the intention of rc > 0. The
> intention is only "here's more information about why this packet can not
> be forwarded at this layer"
ok. Please spell out this intention in bpf.h so that the writer will not
default to DROP for the reason that it does not understand.
^ permalink raw reply
* Re: [PATCH net] ipvlan: call dev_change_flags when reset ipvlan mode
From: Cong Wang @ 2018-06-19 21:10 UTC (permalink / raw)
To: Hangbin Liu
Cc: Linux Kernel Network Developers, Stefano Brivio, Paolo Abeni,
David Miller, Mahesh Bandewar
In-Reply-To: <1529330677-15328-1-git-send-email-liuhangbin@gmail.com>
On Mon, Jun 18, 2018 at 7:04 AM, Hangbin Liu <liuhangbin@gmail.com> wrote:
> @@ -94,10 +95,13 @@ static int ipvlan_set_port_mode(struct ipvl_port *port, u16 nval)
> mdev->l3mdev_ops = NULL;
> }
> list_for_each_entry(ipvlan, &port->ipvlans, pnode) {
> + flags = ipvlan->dev->flags;
> if (nval == IPVLAN_MODE_L3 || nval == IPVLAN_MODE_L3S)
> - ipvlan->dev->flags |= IFF_NOARP;
> + dev_change_flags(ipvlan->dev,
> + flags | IFF_NOARP);
> else
> - ipvlan->dev->flags &= ~IFF_NOARP;
> + dev_change_flags(ipvlan->dev,
> + flags & ~IFF_NOARP);
You need to check the return value of dev_change_flags().
^ permalink raw reply
* Re: iproute2 won't compile without AF_VSOCK
From: Steve Wise @ 2018-06-19 20:41 UTC (permalink / raw)
To: David Ahern, Stephen Hemminger; +Cc: netdev
In-Reply-To: <d6f6b60d-6dad-fd0c-cdc5-b560fe6da12d@gmail.com>
On 6/19/2018 3:29 PM, David Ahern wrote:
> On 6/19/18 2:27 PM, David Ahern wrote:
>> On 6/19/18 9:47 AM, Stephen Hemminger wrote:
>>> On Tue, 19 Jun 2018 10:17:45 -0500
>>> Steve Wise <swise@opengridcomputing.com> wrote:
>>>
>>>> Hey David,
>>>>
>>>> I'm trying to compile the latest iproute2 on an RHEL-7.3 distro, and it
>>>> fails to compile because AF_VSOCK is not defined. Should this
>>>> functionality be a configure option to disable it on older distros?
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> Steve.
>>>>
>>>> ----
>>>>
>>>> misc
>>>> CC ss.o
>>>> ss.c:301:27: error: ‘AF_VSOCK’ undeclared here (not in a function)
>>>> .families = FAMILY_MASK(AF_VSOCK),
>>>> ^
>>>> ss.c:252:46: note: in definition of macro ‘FAMILY_MASK’
>>>> #define FAMILY_MASK(family) ((uint64_t)1 << (family))
>>>> ^
>>>> ss.c:334:2: error: array index in initializer not of integer type
>>>> [AF_VSOCK] = {
>>>> ^
>>>> ss.c:334:2: error: (near initialization for ‘default_afs’)
>>>> make[1]: *** [ss.o] Error 1
>>>> make: *** [all] Error 2
>>>>
>>> Probably should just add an #ifdef to takeout that if not present
>>>
>> Most userspace tools have a compat header for cases like this.
>>
>> #ifndef AF_VSOCK
>> #define AF_VSOCK 40
>> #endif
>>
> Add the above to include//utils.h; AF_MPLS is already there.
I'll send out a patch.
Thanks,
Steve.
^ permalink raw reply
* Re: [bug] cxgb4: vrf stopped working with cxgb4 card
From: AMG Zollner Robert @ 2018-06-19 20:32 UTC (permalink / raw)
To: Ganesh Goudar; +Cc: office, dsa, netdev
In-Reply-To: <20180619132425.GA6576@chelsio.com>
On 19.06.2018 16:24, Ganesh Goudar wrote:
> On Monday, June 06/11/18, 2018 at 14:47:55 +0530, Ganesh Goudar wrote:
>> On Saturday, June 06/09/18, 2018 at 18:47:55 -0600, David Ahern wrote:
>>> Ganesh:
>>>
>>> On 6/4/18 9:03 AM, AMG Zollner Robert wrote:
>>>> I have noticed that vrf is not working with kernel v4.15.0 but was
>>>> working with v4.13.0 when using cxgb4 Chelsio driver (T520-cr)
>>>>
>>>> Setup:
>>>> Two metal servers with a T520-cr card each, directly connected without a
>>>> switch in between.
>>>>
>>>> SVR1 only ipfwd SVR2 with vrf
>>>> .----------------------------. .----------------------------------.
>>>> | | | |
>>>> | 192.168.8.1 [ ens2f4]--|---------|--[ens1f4] 192.168.8.2 |
>>>> | 192.168.9.1 [ens2f4d1]--|---------|--<ens1f4d1> 192.168.9.2 VRF=10 |
>>>> `----------------------------' `----------------------------------'
>>>>
>>>> When vrf is not working there are no error messages (dmesg or iproute
>>>> commands), tcpdump on the interface (SVR2.ens1f4d1) enslaved in vrf 10
>>>> shows packets(arp req/reply) coming in and going out, but outgoing
>>>> packets(arp reply) do not reach the other server SVR1.ens2f4d1
>>>>
>>>>
>>>> Bisect:
>>>> Found this commit to be the problem after doing a git bisect between
>>>> v4.13..v4.15:
>>>>
>>>> commit ba581f77df23c8ee70b372966e69cf10bc5453d8
>>>> Author: Ganesh Goudar <ganeshgr@chelsio.com>
>>>> Date: Sat Sep 23 16:07:28 2017 +0530
>>>>
>>>> cxgb4: do DCB state reset in couple of places
>>>>
>>>> reset the driver's DCB state in couple of places
>>>> where it was missing.
>>>
>>> Are you working on a fix for this or should a revert of the above patch
>>> be sent?
>> Will look into it and fix/revert it soon, Thanks for responding to Robert.
>>>
>>>
>>>>
>>>>
>>>> A bisect step was considered good when:
>>>> - successful ping from SVR1 to SVR2.ens1f4d1 vrf interface
>>>> - successful ping from SVR2 global to SVR2 vrf interface trough SVR1(l3
>>>> forwarding) (this check was redundant,both tests fail or pass simultaneous)
>>>>
>>>> The problem is still present on recent kernels also, checked v4.16.0 and
>>>> v4.17.rc7
>>>>
>>>> Disabling DCB for the card support fixes the problem ( Compiling kernel
>>>> with "CONFIG_CHELSIO_T4_DCB=n")
>>>>
>>>>
>>>>
>>>> This is my first time reporting a bug to the linux kernel and hope I
>>>> have included the right amount of information. Please let me know if I
>>>> have missed something.
>>>>
>>>>
>>>>
>>>> Thank you,
>>>> Zollner Robert
>>>>
>>>>
>>>> --------
>>>> Logs:
>>>>
>>>> VRF configured using folowing commands:
>>>>
>>>> #!/bin/sh
>>>>
>>>> CHDEV=ens1f4
>>>> VRF=vrf-recv
>>>>
>>>> sysctl -w net.ipv4.tcp_l3mdev_accept=1
>>>> sysctl -w net.ipv4.udp_l3mdev_accept=1
>>>> sysctl -w net.ipv4.conf.all.accept_local=1
>>>>
>>>> ifconfig ${CHDEV} 192.168.8.2/24
>>>> ifconfig ${CHDEV}d1 192.168.9.2/24
>>>>
>>>> ip link add ${VRF} type vrf table 10
>>>> ip link set dev ${VRF} up
>>>>
>>>> ip rule add pref 32765 table local
>>>> ip rule del pref 0
>>>>
>>>> ip route add table 10 unreachable default metric 4278198272
>>>>
>>>> ip link set dev ${CHDEV}d1 master ${VRF}
>>>>
>>>> ip route add table 10 default via 192.168.9.1
>>>> ip route add 192.168.9.0/24 via 192.168.8.1
>>>>
>>>>
>>>>
>>>>
>>>
> -netdev, Please feel free to add if needed.
>
> Hi Robert,
>
> My knowledge of VRF is very limited, I am trying to bring
> up VRF setup, I just wanted to check if you are doing anything
> related DCB and also please let me know how did you setup SRV1.
>
> Thanks
>
Hello Ganesh,
SRV1 is just forwarding(l3) between the two physical ports of the
T520-CR card.
ifconfig ens1f4 192.168.8.1/24
ifconfig ens1f4d1 192.168.9.1/24
sysctl -w net.ipv4.ip_forward=1
- No VRF is configured on this box
- DCB is also not used
SVR2 is using VRF and is configured with the script inlined in the first
email.
^ permalink raw reply
* Re: [virtio-dev] Re: [Qemu-devel] [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net
From: Michael S. Tsirkin @ 2018-06-19 20:32 UTC (permalink / raw)
To: Cornelia Huck
Cc: Siwei Liu, Samudrala, Sridhar, Alexander Duyck, virtio-dev,
aaron.f.brown, Jiri Pirko, Jakub Kicinski, Netdev, qemu-devel,
virtualization, konrad.wilk, boris.ostrovsky, Joao Martins,
Venu Busireddy, vijay.balakrishna
In-Reply-To: <20180619125453.2d2dfb2d.cohuck@redhat.com>
On Tue, Jun 19, 2018 at 12:54:53PM +0200, Cornelia Huck wrote:
> Sorry about dragging mainframes into this, but this will only work for
> homogenous device coupling, not for heterogenous. Consider my vfio-pci
> + virtio-net-ccw example again: The guest cannot find out that the two
> belong together by checking some group ID, it has to either use the MAC
> or some needs-to-be-architectured property.
>
> Alternatively, we could propose that mechanism as pci-only, which means
> we can rely on mechanisms that won't necessarily work on non-pci
> transports. (FWIW, I don't see a use case for using vfio-ccw to pass
> through a network card anytime in the near future, due to the nature of
> network cards currently in use on s390.)
That's what it boils down to, yes. If there's need to have this for
non-pci devices, then we should put it in config space.
Cornelia, what do you think?
--
MST
^ permalink raw reply
* Re: iproute2 won't compile without AF_VSOCK
From: David Ahern @ 2018-06-19 20:29 UTC (permalink / raw)
To: Stephen Hemminger, Steve Wise; +Cc: netdev
In-Reply-To: <f8c68944-57b6-18a6-aa50-c019baa2a63b@gmail.com>
On 6/19/18 2:27 PM, David Ahern wrote:
> On 6/19/18 9:47 AM, Stephen Hemminger wrote:
>> On Tue, 19 Jun 2018 10:17:45 -0500
>> Steve Wise <swise@opengridcomputing.com> wrote:
>>
>>> Hey David,
>>>
>>> I'm trying to compile the latest iproute2 on an RHEL-7.3 distro, and it
>>> fails to compile because AF_VSOCK is not defined. Should this
>>> functionality be a configure option to disable it on older distros?
>>>
>>>
>>> Thanks,
>>>
>>> Steve.
>>>
>>> ----
>>>
>>> misc
>>> CC ss.o
>>> ss.c:301:27: error: ‘AF_VSOCK’ undeclared here (not in a function)
>>> .families = FAMILY_MASK(AF_VSOCK),
>>> ^
>>> ss.c:252:46: note: in definition of macro ‘FAMILY_MASK’
>>> #define FAMILY_MASK(family) ((uint64_t)1 << (family))
>>> ^
>>> ss.c:334:2: error: array index in initializer not of integer type
>>> [AF_VSOCK] = {
>>> ^
>>> ss.c:334:2: error: (near initialization for ‘default_afs’)
>>> make[1]: *** [ss.o] Error 1
>>> make: *** [all] Error 2
>>>
>>
>> Probably should just add an #ifdef to takeout that if not present
>>
>
> Most userspace tools have a compat header for cases like this.
>
> #ifndef AF_VSOCK
> #define AF_VSOCK 40
> #endif
>
Add the above to include//utils.h; AF_MPLS is already there.
^ permalink raw reply
* Re: iproute2 won't compile without AF_VSOCK
From: David Ahern @ 2018-06-19 20:27 UTC (permalink / raw)
To: Stephen Hemminger, Steve Wise; +Cc: netdev
In-Reply-To: <20180619084732.0de6d75d@xeon-e3>
On 6/19/18 9:47 AM, Stephen Hemminger wrote:
> On Tue, 19 Jun 2018 10:17:45 -0500
> Steve Wise <swise@opengridcomputing.com> wrote:
>
>> Hey David,
>>
>> I'm trying to compile the latest iproute2 on an RHEL-7.3 distro, and it
>> fails to compile because AF_VSOCK is not defined. Should this
>> functionality be a configure option to disable it on older distros?
>>
>>
>> Thanks,
>>
>> Steve.
>>
>> ----
>>
>> misc
>> CC ss.o
>> ss.c:301:27: error: ‘AF_VSOCK’ undeclared here (not in a function)
>> .families = FAMILY_MASK(AF_VSOCK),
>> ^
>> ss.c:252:46: note: in definition of macro ‘FAMILY_MASK’
>> #define FAMILY_MASK(family) ((uint64_t)1 << (family))
>> ^
>> ss.c:334:2: error: array index in initializer not of integer type
>> [AF_VSOCK] = {
>> ^
>> ss.c:334:2: error: (near initialization for ‘default_afs’)
>> make[1]: *** [ss.o] Error 1
>> make: *** [all] Error 2
>>
>
> Probably should just add an #ifdef to takeout that if not present
>
Most userspace tools have a compat header for cases like this.
#ifndef AF_VSOCK
#define AF_VSOCK 40
#endif
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox