Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH net-next,RFC 00/13] New fast forwarding path
From: Daniel Borkmann @ 2018-06-19 22:22 UTC (permalink / raw)
  To: Steffen Klassert; +Cc: David Miller, pablo, netfilter-devel, netdev
In-Reply-To: <20180617092304.fviy36bvgnhexwtb@gauss3.secunet.de>

On 06/17/2018 11:23 AM, Steffen Klassert wrote:
[...]
>> Would be curious about
>> the numbers. You'd get implicit batching for the forwarding via devmap
>> as well if you're required to flush it out via different device with
>> XDP_REDIRECT; otherwise XDP_TX of course. Given we have recently
>> integrated helpers for XDP to do a FIB and neighbor lookup from the
>> kernel tables, where it's thus shared and integrated with the rest of
>> the stack and tooling, it would be awesome to get to the same point
>> with xfrm as well. Eyal recently did a start on that for xfrm for tc
>> progs; would be nice to have integration on XDP as well, potentially
>> it might also result in a bigger plus on the forwarding numbers.
> 
> It might make sense to intrgrate XDP with xfrm to
> be able to compare numbers etc. But I need a working
> XDP setup and some understanding about it first, what
> could take some time.

Okay, no prob. If you have any questions feel free to shoot an email.

Thanks,
Daniel

^ permalink raw reply

* Re: [PATCH net 0/3] qed*: Fix series.
From: David Miller @ 2018-06-19 22:18 UTC (permalink / raw)
  To: sudarsana.kalluru; +Cc: netdev, Ariel.Elior, Michal.Kalderon
In-Reply-To: <20180619045802.24050-1-sudarsana.kalluru@cavium.com>

From: Sudarsana Reddy Kalluru <sudarsana.kalluru@cavium.com>
Date: Mon, 18 Jun 2018 21:57:59 -0700

> From: Sudarsana Reddy Kalluru <Sudarsana.Kalluru@cavium.com>
> 
> The patch series fixes few issues in the qed/qede drivers.
> Please consider applying this series to "net".

Series applied.

^ permalink raw reply

* [PATCH net] enic: initialize enic->rfs_h.lock in enic_probe
From: Govindarajulu Varadarajan @ 2018-06-19 15:15 UTC (permalink / raw)
  To: davem, benve, netdev; +Cc: Govindarajulu Varadarajan

lockdep spotted that we are using rfs_h.lock in enic_get_rxnfc() without
initializing. rfs_h.lock is initialized in enic_open(). But ethtool_ops
can be called when interface is down.

Move enic_rfs_flw_tbl_init to enic_probe.

INFO: trying to register non-static key.
the code is fine but needs lockdep annotation.
turning off the locking correctness validator.
CPU: 18 PID: 1189 Comm: ethtool Not tainted 4.17.0-rc7-devel+ #27
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-20171110_100015-anatol 04/01/2014
Call Trace:
dump_stack+0x85/0xc0
register_lock_class+0x550/0x560
? __handle_mm_fault+0xa8b/0x1100
__lock_acquire+0x81/0x670
lock_acquire+0xb9/0x1e0
?  enic_get_rxnfc+0x139/0x2b0 [enic]
_raw_spin_lock_bh+0x38/0x80
? enic_get_rxnfc+0x139/0x2b0 [enic]
enic_get_rxnfc+0x139/0x2b0 [enic]
ethtool_get_rxnfc+0x8d/0x1c0
dev_ethtool+0x16c8/0x2400
? __mutex_lock+0x64d/0xa00
? dev_load+0x6a/0x150
dev_ioctl+0x253/0x4b0
sock_do_ioctl+0x9a/0x130
sock_ioctl+0x1af/0x350
do_vfs_ioctl+0x8e/0x670
? syscall_trace_enter+0x1e2/0x380
ksys_ioctl+0x60/0x90
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x5a/0x170
entry_SYSCALL_64_after_hwframe+0x49/0xbe

Signed-off-by: Govindarajulu Varadarajan <gvaradar@cisco.com>
---
 drivers/net/ethernet/cisco/enic/enic_clsf.c | 3 +--
 drivers/net/ethernet/cisco/enic/enic_main.c | 3 ++-
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/cisco/enic/enic_clsf.c b/drivers/net/ethernet/cisco/enic/enic_clsf.c
index 973c1fb70d09..99038dfc7fbe 100644
--- a/drivers/net/ethernet/cisco/enic/enic_clsf.c
+++ b/drivers/net/ethernet/cisco/enic/enic_clsf.c
@@ -79,7 +79,6 @@ void enic_rfs_flw_tbl_init(struct enic *enic)
 	enic->rfs_h.max = enic->config.num_arfs;
 	enic->rfs_h.free = enic->rfs_h.max;
 	enic->rfs_h.toclean = 0;
-	enic_rfs_timer_start(enic);
 }
 
 void enic_rfs_flw_tbl_free(struct enic *enic)
@@ -88,7 +87,6 @@ void enic_rfs_flw_tbl_free(struct enic *enic)
 
 	enic_rfs_timer_stop(enic);
 	spin_lock_bh(&enic->rfs_h.lock);
-	enic->rfs_h.free = 0;
 	for (i = 0; i < (1 << ENIC_RFS_FLW_BITSHIFT); i++) {
 		struct hlist_head *hhead;
 		struct hlist_node *tmp;
@@ -99,6 +97,7 @@ void enic_rfs_flw_tbl_free(struct enic *enic)
 			enic_delfltr(enic, n->fltr_id);
 			hlist_del(&n->node);
 			kfree(n);
+			enic->rfs_h.free++;
 		}
 	}
 	spin_unlock_bh(&enic->rfs_h.lock);
diff --git a/drivers/net/ethernet/cisco/enic/enic_main.c b/drivers/net/ethernet/cisco/enic/enic_main.c
index 30d2eaa18c04..e6ad581eadd8 100644
--- a/drivers/net/ethernet/cisco/enic/enic_main.c
+++ b/drivers/net/ethernet/cisco/enic/enic_main.c
@@ -1971,7 +1971,7 @@ static int enic_open(struct net_device *netdev)
 		vnic_intr_unmask(&enic->intr[i]);
 
 	enic_notify_timer_start(enic);
-	enic_rfs_flw_tbl_init(enic);
+	enic_rfs_timer_start(enic);
 
 	return 0;
 
@@ -2904,6 +2904,7 @@ static int enic_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
 
 	timer_setup(&enic->notify_timer, enic_notify_timer, 0);
 
+	enic_rfs_flw_tbl_init(enic);
 	enic_set_rx_coal_setting(enic);
 	INIT_WORK(&enic->reset, enic_reset);
 	INIT_WORK(&enic->tx_hang_reset, enic_tx_hang_reset);
-- 
2.17.1

^ permalink raw reply related

* Re: [PATCH] tc, bpf: add option to dump bpf verifier as C program fragment
From: Daniel Borkmann @ 2018-06-19 22:13 UTC (permalink / raw)
  To: David Ahern, Jakub Kicinski, Ophir Munk
  Cc: netdev, Stephen Hemminger, Thomas Monjalon, Olga Shern, ast
In-Reply-To: <30ac6974-a434-926b-5749-25c594e5841c@gmail.com>

On 06/18/2018 11:44 PM, David Ahern wrote:
> On 6/18/18 2:18 PM, Jakub Kicinski wrote:
>> On Sun, 17 Jun 2018 08:48:41 +0000, Ophir Munk wrote:
>>> Similar to cbpf used within tcpdump utility with a "-d" option to dump
>>> the compiled packet-matching code in a human readable form - tc has the
>>> "verbose" option to dump ebpf verifier output.
>>> Another useful option of cbpf using tcpdump "-dd" option is to dump
>>> packet-matching code a C program fragment. Similar to this - this commit
>>> adds a new tc ebpf option named "code" to dump ebpf verifier as C program
>>> fragment.
>>>
>>> Existing "verbose" option sample output:
>>>
>>> Verifier analysis:
>>> 0: (61) r2 = *(u32 *)(r1 +52)
>>> 1: (18) r3 = 0xdeadbeef
>>> 3: (63) *(u32 *)(r10 -4) = r3
>>> .
>>> .
>>> 11: (63) *(u32 *)(r1 +52) = r2
>>> 12: (18) r0 = 0xffffffff
>>> 14: (95) exit
>>>
>>> New "code" option sample output:
>>>
>>> /* struct bpf_insn cls_q_code[] = { */
>>> {0x61,    2,    1,       52, 0x00000000},
>>> {0x18,    3,    0,        0, 0xdeadbeef},
>>> {0x00,    0,    0,        0, 0x00000000},
>>> .
>>> .
>>> {0x63,    1,    2,       52, 0x00000000},
>>> {0x18,    0,    0,        0, 0xffffffff},
>>> {0x00,    0,    0,        0, 0x00000000},
>>> {0x95,    0,    0,        0, 0x00000000},
>>>
>>> Signed-off-by: Ophir Munk <ophirmu@mellanox.com>
>>
>> Hmm... printing C arrays looks like hacky integration with some C
>> code...  Would you not be better served by simply using libbpf in
>> whatever is consuming this output?
> 
> I was thinking the same. bpftool would provide options too -- print the
> above, print in macro encodings and verifier. I gave an example of this
> side by side dump at netconf 2.1. Does not look like the slides made it
> online; see attached.

+1, I would also doubt that this adds a lot in terms of debuggability
when you're trying to load an object file with thousands of insns. Better
way would be to use llvm-objdump on the obj file to get to the annotated
disassembly, see also example in [0]. A .o to .c converter is wip for
libbpf/bpftool as presented in [1], which should provide the flexibility
to embedd an obj file.

Cheers,
Daniel

  [0] http://cilium.readthedocs.io/en/latest/bpf/#llvm
  [1] http://vger.kernel.org/netconf2018_files/AlexeiStarovoitov_netconf2018.pdf page 22

^ permalink raw reply

* Re: [PATCH] Revert "net: pskb_trim_rcsum() and CHECKSUM_COMPLETE are friends"
From: Eric Dumazet @ 2018-06-19 22:10 UTC (permalink / raw)
  To: Andreas Schwab
  Cc: Mathieu Malaterre, David S. Miller, Eric Dumazet, LKML,
	Christophe LEROY, Meelis Roos, netdev, linuxppc-dev
In-Reply-To: <e1790720-911b-4d04-2215-752ec612e3d7@gmail.com>



On 06/19/2018 01:10 PM, Eric Dumazet wrote:
> 
> 
> On 06/19/2018 12:10 PM, Andreas Schwab wrote:
>> On Jun 18 2018, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>>
>>> DUMP_PREFIX_ADDRESS might give us more information (say alignment problem, or crossing page boundaries)
>>
>> DUMP_PREFIX_ADDRESS is useless for that purpose.
>>
>> Here are some samples of broken csums:
>>
>> [  853.849225] sungem: sungem wrong csum : 9886/07be, len 94 bytes, c0000001fa187e02
>> [  853.849232] raw data: 00000000: 00 0d 93 43 81 62 18 d6 c7 51 b8 1c 08 00 45 10  ...C.b...Q....E.
>> [  853.849235] raw data: 00000010: 00 4c cb a0 40 00 40 11 d9 97 c0 a8 0a 01 c0 a8  .L..@.@.........
>> [  853.849237] raw data: 00000020: 0a 07 00 7b 00 7b 00 38 69 e1 1c 03 0c f7 00 00  ...{.{.8i.......
>> [  853.849240] raw data: 00000030: 08 f0 00 00 15 f0 c0 35 67 67 de d3 ca c9 d9 5b  .......5gg.....[
>> [  853.849242] raw data: 00000040: 1f ff de d3 d2 86 8f 67 fa f2 de d3 d2 86 8f 38  .......g.......8
>> [  853.849244] raw data: 00000050: 2f ff de d3 d2 86 8f 3b ff ff d1 93 bc 50        /......;.....P
> 
> Thanks.
> 
> 4 bytes in excess.
> 
> Might be the FCS, and it does not look like provided csum has a relation with it.
> 
> For some reason FCS stripping was disabled by :
> 
> commit 3e32011d4da6424b3bc65b1e1a047e30ac9882c7
> Author: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> Date:   Mon May 19 09:39:11 2003 -0700
> 
>     [SUNGEM]: Updates from PowerPC people.
>     
>     Support more chips and split out all the complex PHY
>     handling into a seperate file.
> 
> 
> Since this NIC never had CHECKSUM_COMPLETE support (since we have to trim each skb,
> thus were forcing ip_summed to CHECKSUM_NONE) we probably should remove it and be happy.
> 
> Unless you guys find a way to let the NIC strip the FCS, and double check the csum is a real csum ;)
> 
> diff --git a/drivers/net/ethernet/sun/sungem.c b/drivers/net/ethernet/sun/sungem.c
> index 7a16d40a72d13cf1d522e8a3a396c826fe76f9b9..85439308375e95c3854e4a1561697d69ec85399b 100644
> --- a/drivers/net/ethernet/sun/sungem.c
> +++ b/drivers/net/ethernet/sun/sungem.c
> @@ -760,7 +760,6 @@ static int gem_rx(struct gem *gp, int work_to_do)
>         struct net_device *dev = gp->dev;
>         int entry, drops, work_done = 0;
>         u32 done;
> -       __sum16 csum;
>  
>         if (netif_msg_rx_status(gp))
>                 printk(KERN_DEBUG "%s: rx interrupt, done: %d, rx_new: %d\n",
> @@ -855,9 +854,6 @@ static int gem_rx(struct gem *gp, int work_to_do)
>                         skb = copy_skb;
>                 }
>  
> -               csum = (__force __sum16)htons((status & RXDCTRL_TCPCSUM) ^ 0xffff);
> -               skb->csum = csum_unfold(csum);
> -               skb->ip_summed = CHECKSUM_COMPLETE;
>                 skb->protocol = eth_type_trans(skb, gp->dev);
>  
>                 napi_gro_receive(&gp->napi, skb);
> 


If you have time, you also could check if changing the suspect (14 / 2) to ETH_HLEN would help
(and also enabling STRIP_FCS)



diff --git a/drivers/net/ethernet/sun/sungem.c b/drivers/net/ethernet/sun/sungem.c
index 7a16d40a72d13cf1d522e8a3a396c826fe76f9b9..672d6748ab44f0890e92d5ca55d6ff6834c20dc9 100644
--- a/drivers/net/ethernet/sun/sungem.c
+++ b/drivers/net/ethernet/sun/sungem.c
@@ -60,8 +60,7 @@
 #include <linux/sungem_phy.h>
 #include "sungem.h"
 
-/* Stripping FCS is causing problems, disabled for now */
-#undef STRIP_FCS
+#define STRIP_FCS
 
 #define DEFAULT_MSG    (NETIF_MSG_DRV          | \
                         NETIF_MSG_PROBE        | \
@@ -435,7 +434,7 @@ static int gem_rxmac_reset(struct gem *gp)
        writel(desc_dma & 0xffffffff, gp->regs + RXDMA_DBLOW);
        writel(RX_RING_SIZE - 4, gp->regs + RXDMA_KICK);
        val = (RXDMA_CFG_BASE | (RX_OFFSET << 10) |
-              ((14 / 2) << 13) | RXDMA_CFG_FTHRESH_128);
+              (ETH_HLEN << 13) | RXDMA_CFG_FTHRESH_128);
        writel(val, gp->regs + RXDMA_CFG);
        if (readl(gp->regs + GREG_BIFCFG) & GREG_BIFCFG_M66EN)
                writel(((5 & RXDMA_BLANK_IPKTS) |
@@ -857,6 +856,14 @@ static int gem_rx(struct gem *gp, int work_to_do)
 
                csum = (__force __sum16)htons((status & RXDCTRL_TCPCSUM) ^ 0xffff);
                skb->csum = csum_unfold(csum);
+               {
+               __wsum rsum = csum_partial(skb->data + ETH_HLEN, len - ETH_HLEN, 0);
+               if (csum != csum_fold(rsum) && net_ratelimit())
+                       pr_err("sungem wrong csum : %x/%x, len %u bytes\n",
+                               csum, csum_fold(rsum), len);
+                       print_hex_dump(KERN_ERR, "raw data: ", DUMP_PREFIX_OFFSET,
+                                           16, 1, skb->data, len, true);
+               }
                skb->ip_summed = CHECKSUM_COMPLETE;
                skb->protocol = eth_type_trans(skb, gp->dev);
 
@@ -1761,7 +1768,7 @@ static void gem_init_dma(struct gem *gp)
        writel(0, gp->regs + TXDMA_KICK);
 
        val = (RXDMA_CFG_BASE | (RX_OFFSET << 10) |
-              ((14 / 2) << 13) | RXDMA_CFG_FTHRESH_128);
+              (ETH_HLEN << 13) | RXDMA_CFG_FTHRESH_128);
        writel(val, gp->regs + RXDMA_CFG);
 
        writel(desc_dma >> 32, gp->regs + RXDMA_DBHI);

^ permalink raw reply related

* Re: [PATCH] [Bug 16494] NFS client over TCP hangs due to packet loss
From: Joe Perches @ 2018-06-19 21:56 UTC (permalink / raw)
  To: Andy Chittenden, 'Trond Myklebust'
  Cc: 'Andrew Morton', 'David Miller', kuznet, pekkas,
	jmorris, yoshfuji, kaber, eric.dumazet, William.Allen.Simpson,
	gilad, ilpo.jarvinen, netdev, linux-kernel, linux-nfs,
	'J. Bruce Fields', 'Neil Brown',
	'Chuck Lever', 'Benny Halevy',
	'Alexandros Batsakis', Andy Chittenden
In-Reply-To: <4c5fc9f3.487e0e0a.2960.ffffe4ec@mx.google.com>

On Mon, 2010-08-09 at 10:27 +0100, Andy Chittenden wrote:

You really need to check your clock.
Mail sent in the year 2010?

^ permalink raw reply

* Re: [PATCH] PCI: allow drivers to limit the number of VFs to 0
From: Bjorn Helgaas @ 2018-06-19 21:37 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Bjorn Helgaas, linux-pci, netdev, Sathya Perla, Felix Manlunas,
	alexander.duyck, john.fastabend, Jacob Keller, Donald Dutile,
	oss-drivers, Christoph Hellwig
In-Reply-To: <20180525140223.GA45098@bhelgaas-glaptop.roam.corp.google.com>

On Fri, May 25, 2018 at 09:02:23AM -0500, Bjorn Helgaas wrote:
> On Thu, May 24, 2018 at 06:20:15PM -0700, Jakub Kicinski wrote:
> > Hi Bjorn!
> > 
> > On Thu, 24 May 2018 18:57:48 -0500, Bjorn Helgaas wrote:
> > > On Mon, Apr 02, 2018 at 03:46:52PM -0700, Jakub Kicinski wrote:
> > > > Some user space depends on enabling sriov_totalvfs number of VFs
> > > > to not fail, e.g.:
> > > > 
> > > > $ cat .../sriov_totalvfs > .../sriov_numvfs
> > > > 
> > > > For devices which VF support depends on loaded FW we have the
> > > > pci_sriov_{g,s}et_totalvfs() API.  However, this API uses 0 as
> > > > a special "unset" value, meaning drivers can't limit sriov_totalvfs
> > > > to 0.  Remove the special values completely and simply initialize
> > > > driver_max_VFs to total_VFs.  Then always use driver_max_VFs.
> > > > Add a helper for drivers to reset the VF limit back to total.  
> > > 
> > > I still can't really make sense out of the changelog.
> > >
> > > I think part of the reason it's confusing is because there are two
> > > things going on:
> > > 
> > >   1) You want this:
> > >   
> > >        pci_sriov_set_totalvfs(dev, 0);
> > >        x = pci_sriov_get_totalvfs(dev) 
> > > 
> > >      to return 0 instead of total_VFs.  That seems to connect with
> > >      your subject line.  It means "sriov_totalvfs" in sysfs could be
> > >      0, but I don't know how that is useful (I'm sure it is; just
> > >      educate me :))
> > 
> > Let me just quote the bug report that got filed on our internal bug
> > tracker :)
> > 
> >   When testing Juju Openstack with Ubuntu 18.04, enabling SR-IOV causes
> >   errors because Juju gets the sriov_totalvfs for SR-IOV-capable device
> >   then tries to set that as the sriov_numvfs parameter.
> > 
> >   For SR-IOV incapable FW, the sriov_totalvfs parameter should be 0, 
> >   but it's set to max.  When FW is switched to flower*, the correct 
> >   sriov_totalvfs value is presented.
> > 
> > * flower is a project name
> 
> From the point of view of the PCI core (which knows nothing about
> device firmware and relies on the architected config space described
> by the PCIe spec), this sounds like an erratum: with some firmware
> installed, the device is not capable of SR-IOV, but still advertises
> an SR-IOV capability with "TotalVFs > 0".
> 
> Regardless of whether that's an erratum, we do allow PF drivers to use
> pci_sriov_set_totalvfs() to limit the number of VFs that may be
> enabled by writing to the PF's "sriov_numvfs" sysfs file.
> 
> But the current implementation does not allow a PF driver to limit VFs
> to 0, and that does seem nonsensical.
> 
> > My understanding is OpenStack uses sriov_totalvfs to determine how many
> > VFs can be enabled, looks like this is the code:
> > 
> > http://git.openstack.org/cgit/openstack/charm-neutron-openvswitch/tree/hooks/neutron_ovs_utils.py#n464
> > 
> > >   2) You're adding the pci_sriov_reset_totalvfs() interface.  I'm not
> > >      sure what you intend for this.  Is *every* driver supposed to
> > >      call it in .remove()?  Could/should this be done in the core
> > >      somehow instead of depending on every driver?
> > 
> > Good question, I was just thinking yesterday we may want to call it
> > from the core, but I don't think it's strictly necessary nor always
> > sufficient (we may reload FW without re-probing).
> > 
> > We have a device which supports different number of VFs based on the FW
> > loaded.  Some legacy FWs does not inform the driver how many VFs it can
> > support, because it supports max.  So the flow in our driver is this:
> > 
> > load_fw(dev);
> > ...
> > max_vfs = ask_fw_for_max_vfs(dev);
> > if (max_vfs >= 0)
> > 	return pci_sriov_set_totalvfs(dev, max_vfs);
> > else /* FW didn't tell us, assume max */
> > 	return pci_sriov_reset_totalvfs(dev); 
> > 
> > We also reset the max on device remove, but that's not strictly
> > necessary.
> > 
> > Other users of pci_sriov_set_totalvfs() always know the value to set
> > the total to (either always get it from FW or it's a constant).
> > 
> > If you prefer we can work out the correct max for those legacy cases in
> > the driver as well, although it seemed cleaner to just ask the core,
> > since it already has total_VFs value handy :)
> > 
> > > I'm also having a hard time connecting your user-space command example
> > > with the rest of this.  Maybe it will make more sense to me tomorrow
> > > after some coffee.
> > 
> > OpenStack assumes it will always be able to set sriov_numvfs to
> > sriov_totalvfs, see this 'if':
> > 
> > http://git.openstack.org/cgit/openstack/charm-neutron-openvswitch/tree/hooks/neutron_ovs_utils.py#n512
> 
> Thanks for educating me.  I think there are two issues here that we
> can separate.  I extracted the patch below for the first.
> 
> The second is the question of resetting driver_max_VFs.  I think we
> currently have a general issue in the core:
> 
>   - load PF driver 1
>   - driver calls pci_sriov_set_totalvfs() to reduce driver_max_VFs
>   - unload PF driver 1
>   - load PF driver 2
> 
> Now driver_max_VFs is still stuck at the lower value set by driver 1.
> I don't think that's the way this should work.
> 
> I guess this is partly a consequence of setting driver_max_VFs in
> sriov_init(), which is called before driver attach and should only
> depend on hardware characteristics, so it is related to the patch
> below.  But I think we should fix it in general, not just for
> netronome.

Hi Jakub, the patch below is in v4.18-rc1 as 8d85a7a4f2c9 ("PCI/IOV:
Allow PF drivers to limit total_VFs to 0").  If there's more we need
to do here, would you mind rebasing what's left to v4.18-rc1 and
reposting it?

> commit 4a338bc6f94b9ad824ac944f5dfc249d6838719c
> Author: Jakub Kicinski <jakub.kicinski@netronome.com>
> Date:   Fri May 25 08:18:34 2018 -0500
> 
>     PCI/IOV: Allow PF drivers to limit total_VFs to 0
>     
>     Some SR-IOV PF drivers implement .sriov_configure(), which allows
>     user-space to enable VFs by writing the desired number of VFs to the sysfs
>     "sriov_numvfs" file (see sriov_numvfs_store()).
>     
>     The PCI core limits the number of VFs to the TotalVFs advertised by the
>     device in its SR-IOV capability.  The PF driver can limit the number of VFs
>     to even fewer (it may have pre-allocated data structures or knowledge of
>     device limitations) by calling pci_sriov_set_totalvfs(), but previously it
>     could not limit the VFs to 0.
>     
>     Change pci_sriov_get_totalvfs() so it always respects the VF limit imposed
>     by the PF driver, even if the limit is 0.
>     
>     This sequence:
>     
>       pci_sriov_set_totalvfs(dev, 0);
>       x = pci_sriov_get_totalvfs(dev);
>     
>     previously set "x" to TotalVFs from the SR-IOV capability.  Now it will set
>     "x" to 0.
>     
>     Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
>     Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
> 
> diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
> index 192b82898a38..d0d73dbbd5ca 100644
> --- a/drivers/pci/iov.c
> +++ b/drivers/pci/iov.c
> @@ -469,6 +469,7 @@ static int sriov_init(struct pci_dev *dev, int pos)
>  	iov->nres = nres;
>  	iov->ctrl = ctrl;
>  	iov->total_VFs = total;
> +	iov->driver_max_VFs = total;
>  	pci_read_config_word(dev, pos + PCI_SRIOV_VF_DID, &iov->vf_device);
>  	iov->pgsz = pgsz;
>  	iov->self = dev;
> @@ -827,10 +828,7 @@ int pci_sriov_get_totalvfs(struct pci_dev *dev)
>  	if (!dev->is_physfn)
>  		return 0;
>  
> -	if (dev->sriov->driver_max_VFs)
> -		return dev->sriov->driver_max_VFs;
> -
> -	return dev->sriov->total_VFs;
> +	return dev->sriov->driver_max_VFs;
>  }
>  EXPORT_SYMBOL_GPL(pci_sriov_get_totalvfs);
>  

^ permalink raw reply

* Re: [PATCH 1/5] net: emaclite: Use __func__ instead of hardcoded name
From: Andy Shevchenko @ 2018-06-19 21:36 UTC (permalink / raw)
  To: Radhey Shyam Pandey
  Cc: David S. Miller, Andrew Lunn, Michal Simek, netdev,
	linux-arm Mailing List, Linux Kernel Mailing List
In-Reply-To: <1529320103-7711-2-git-send-email-radhey.shyam.pandey@xilinx.com>

On Mon, Jun 18, 2018 at 2:08 PM, Radhey Shyam Pandey
<radhey.shyam.pandey@xilinx.com> wrote:
> Switch hardcoded function name with a reference to __func__ making
> the code more maintainable. Address below checkpatch warning:
>
> WARNING: Prefer using '"%s...", __func__' to using 'xemaclite_mdio_read',
> this function's name, in a string
> +               "xemaclite_mdio_read(phy_id=%i, reg=%x) == %x\n",
>
> WARNING: Prefer using '"%s...", __func__' to using 'xemaclite_mdio_write',
> this function's name, in a string
> +               "xemaclite_mdio_write(phy_id=%i, reg=%x, val=%x)\n",
>

For dev_dbg() the __func__ should be completely dropped away.

-- 
With Best Regards,
Andy Shevchenko

^ permalink raw reply

* [net] bpf, xdp, i40e: fix i40e_build_skb skb reserve and truesize
From: Jeff Kirsher @ 2018-06-19 21:33 UTC (permalink / raw)
  To: davem
  Cc: Daniel Borkmann, netdev, nhorman, sassmann, jogreene,
	Björn Töpel, John Fastabend, Jeff Kirsher

From: Daniel Borkmann <daniel@iogearbox.net>

Using skb_reserve(skb, I40E_SKB_PAD + (xdp->data - xdp->data_hard_start))
is clearly wrong since I40E_SKB_PAD already points to the offset where
the original xdp->data was sitting since xdp->data_hard_start is defined
as xdp->data - i40e_rx_offset(rx_ring) where latter offsets to I40E_SKB_PAD
when build skb is used.

However, also before cc5b114dcf98 ("bpf, i40e: add meta data support")
this seems broken since bpf_xdp_adjust_head() helper could have been used
to alter headroom and enlarge / shrink the frame and with that the assumption
that the xdp->data remains unchanged does not hold and would push a bogus
packet to upper stack.

ixgbe got this right in 924708081629 ("ixgbe: add XDP support for pass and
drop actions"). In any case, fix it by removing the I40E_SKB_PAD from both
skb_reserve() and truesize calculation.

Fixes: cc5b114dcf98 ("bpf, i40e: add meta data support")
Fixes: 0c8493d90b6b ("i40e: add XDP support for pass and drop actions")
Reported-by: Keith Busch <keith.busch@linux.intel.com>
Reported-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Björn Töpel <bjorn.topel@intel.com>
Cc: John Fastabend <john.fastabend@gmail.com>
Tested-by: Keith Busch <keith.busch@linux.intel.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/i40e/i40e_txrx.c | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index 8ffb7454e67c..ed6dbcfd4e96 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -2103,9 +2103,8 @@ static struct sk_buff *i40e_build_skb(struct i40e_ring *rx_ring,
 	unsigned int truesize = i40e_rx_pg_size(rx_ring) / 2;
 #else
 	unsigned int truesize = SKB_DATA_ALIGN(sizeof(struct skb_shared_info)) +
-				SKB_DATA_ALIGN(I40E_SKB_PAD +
-					       (xdp->data_end -
-						xdp->data_hard_start));
+				SKB_DATA_ALIGN(xdp->data_end -
+					       xdp->data_hard_start);
 #endif
 	struct sk_buff *skb;
 
@@ -2124,7 +2123,7 @@ static struct sk_buff *i40e_build_skb(struct i40e_ring *rx_ring,
 		return NULL;
 
 	/* update pointers within the skb to store the data */
-	skb_reserve(skb, I40E_SKB_PAD + (xdp->data - xdp->data_hard_start));
+	skb_reserve(skb, xdp->data - xdp->data_hard_start);
 	__skb_put(skb, xdp->data_end - xdp->data);
 	if (metasize)
 		skb_metadata_set(skb, metasize);
-- 
2.17.1

^ permalink raw reply related

* Re: [PATCH bpf-net] bpf: Change bpf_fib_lookup to return lookup status
From: Martin KaFai Lau @ 2018-06-19 21:24 UTC (permalink / raw)
  To: David Ahern; +Cc: dsahern, netdev, borkmann, ast, davem
In-Reply-To: <7697c6fd-7871-4d96-bd1c-c81a7d70ec39@gmail.com>

On Tue, Jun 19, 2018 at 02:16:53PM -0600, David Ahern wrote:
> On 6/19/18 10:36 AM, Martin KaFai Lau wrote:
> > On Tue, Jun 19, 2018 at 09:34:28AM -0600, David Ahern wrote:
> >> On 6/19/18 9:25 AM, Martin KaFai Lau wrote:
> >>> On Mon, Jun 18, 2018 at 03:35:25PM -0600, David Ahern wrote:
> >>>> On 6/18/18 2:55 PM, Martin KaFai Lau wrote:
> >>>>>> 	/* rc > 0 case */
> >>>>>> 	switch(rc) {
> >>>>>> 	case BPF_FIB_LKUP_RET_BLACKHOLE:
> >>>>>> 	case BPF_FIB_LKUP_RET_UNREACHABLE:
> >>>>>> 	case BPF_FIB_LKUP_RET_PROHIBIT:
> >>>>>> 		return XDP_DROP;
> >>>>>> 	}
> >>>>>>
> >>>>>> For the others it becomes a question of do we share why the stack needs
> >>>>>> to be involved? Maybe the program wants to collect stats to show traffic
> >>>>>> patterns that can be improved (BPF_FIB_LKUP_RET_FRAG_NEEDED) or support
> >>>>>> in the kernel needs to be improved (BPF_FIB_LKUP_RET_UNSUPP_LWT) or an
> >>>>>> interface is misconfigured (BPF_FIB_LKUP_RET_FWD_DISABLED).
> >>>>> Thanks for the explanation.
> >>>>>
> >>>>> Agree on the bpf able to collect stats will be useful.
> >>>>>
> >>>>> I am wondering, if a new BPF_FIB_LKUP_RET_XYZ is added later,
> >>>>> how may the old xdp_prog work/not-work?  As of now, the return value
> >>>>> is straight forward, FWD, PASS (to stack) or DROP (error).
> >>>>> With this change, the xdp_prog needs to match/switch() the
> >>>>> BPF_FIB_LKUP_RET_* to at least PASS and DROP.
> >>>>
> >>>> IMO, programs should only call XDP_DROP for known reasons - like the 3
> >>>> above. Anything else punt to the stack.
> >>>>
> >>>> If a new RET_XYZ comes along:
> >>>> 1. the new XYZ is a new ACL response where the packet is to be dropped.
> >>>> If the program does not understand XYZ and punts to the stack
> >>>> (recommendation), then a second lookup is done during normal packet
> >>>> processing and the stack drops it.
> >>>>
> >>>> 2. the new XYZ is a new path in the kernel that is unsupported with
> >>>> respect to XDP forwarding, nothing new for the program to do.
> >>>>
> >>>> Either way I would expect stats on BPF_FIB_LKUP_RET_* to give a hint to
> >>>> the program writer.
> >>>>
> >>>> Worst case of punting packets to the stack for any rc != 0 means the
> >>>> stack is doing 2 lookups - 1 in XDP based on its lookup parameters and 1
> >>>> in normal stack processing - to handle the packet.
> >>> Instead of having the xdp_prog to follow the meaning of what RET_SYZ is,
> >>> should the bpf_*_fib_lookup() return value be kept as is such that
> >>> the xdp_prog is clear what to do.  The reason can be returned in
> >>> the 'struct bpf_fib_lookup'.  The number of reasons can be extended.
> >>> If the xdp_prog does not understand a reason, it still will not
> >>> affect its decision because the return value is clear.
> >>> I think the situation here is similar to regular syscall which usually
> >>> uses -1 to clearly states error and errno to spells out the reason.
> >>>
> >>
> >> I did consider returning the status in struct bpf_fib_lookup. However,
> >> it is 64 bytes and can not be extended without a big performance
> >> penalty, so the only option there is to make an existing entry a union
> >> the most logical of which is the ifindex. It seemed odd to me to have
> >> the result by hidden in the struct as a union on ifindex and returning
> >> the egress index from the function:
> >>
> >> @@ -2625,7 +2636,11 @@ struct bpf_fib_lookup {
> >>
> >>         /* total length of packet from network header - used for MTU
> >> check */
> >>         __u16   tot_len;
> >> -       __u32   ifindex;  /* L3 device index for lookup */
> >> +
> >> +       union {
> >> +               __u32   ifindex;  /* input: L3 device index for lookup */
> >> +               __u32   result;   /* output: one of BPF_FIB_LKUP_RET_* */
> >> +       };
> >>
> >>
> >> It seemed more natural to have ifindex stay ifindex and only change
> >> value on return:
> >>
> >> @@ -2625,7 +2639,11 @@ struct bpf_fib_lookup {
> >>
> >>  	/* total length of packet from network header - used for MTU check */
> >>  	__u16	tot_len;
> >> -	__u32	ifindex;  /* L3 device index for lookup */
> >> +
> >> +	/* input: L3 device index for lookup
> >> +	 * output: nexthop device index from FIB lookup
> >> +	 */
> >> +	__u32	ifindex;
> >>
> >>  	union {
> >>  		/* inputs to lookup */
> >>
> >>
> >> From a program's perspective:
> >>
> >> rc < 0  -- program is passing incorrect data
> >> rc == 0 -- packet can be forwarded
> >> rc > 0  -- packet can not be forwarded.
> >>
> >> BPF programs are not required to track the LKUP_RET values any more than
> >> a function returning multiple negative values - the caller just checks
> >> rc < 0 means failure. If the program cares it can look at specific
> >> values of rc to see the specific value.
> >>
> >> The same applies with the LKUP_RET values - they are there to provide
> >> insight into why the packet is not forwarded directly if the program
> >> cares to know why.
> > hmm...ic. My concern is, the prog can interpret rc > 0 (in this patch) to be
> > drop vs pass (although we can advise them in bpf.h to always pass if it does
> > not understand a rc but it is not a strong contract),  it may catch people
> > a surprise if a xdp_prog suddenly drops everything when running in a
> > newer kernel where the upper stack can actually handle it.
> > 
> > while the current behavior (i.e. before this patch, rc == 0) is always pass
> > to the stack.
> > 
> > I think at least comments should be put in the enum such that
> > the xdp/tc_prog should expect the enum could be extended later, so
> > the suggested behavior should be a pass for unknown LKUP_RET and let
> > the stack to decide.
> > 
> 
> All APIs with enums have the inherent quality that more can be added.
> Nothing about rc > 0 says it is ok to drop the packet and nothing in the
> documentation says it is ok to drop the packet. The program author needs
> to look at the extra information provided by the rc. Specific values are
> a hint that yes the packet can be dropped; others merely say 'packet
> needs help from the stack' with a reason why it needs help.
Right, I understand there are values that hint drop (as your earlier example)
and there are values that hint pass to stack.  and either set of these values
could be extended later.

> 
> My intention is to allow the XDP program to understand FIB based ACLs.
> To that end a fib lookup result of blackhole, unreachable and prohibit
> needs to be returned to the xdp program with the effective summary of
> "can drop in the xdp program".
> 
> If the other return codes are going to cause confusion then less shorten
> the list:
> 
> enum {
> 	BPF_FIB_LKUP_RET_SUCCESS,      /* packet is to be forwareded */
> 	BPF_FIB_LKUP_RET_CAN_DROP,     /* XDP program can drop the packet */
> 	BPF_FIB_LKUP_RET_NEED_STACK,   /* packet needs full stack assist */
> };
> 
> But, that still does not solve the problem of rc > 0 means xdp program
> can drop the packet, but then that is not the intention of rc > 0. The
> intention is only "here's more information about why this packet can not
> be forwarded at this layer"
ok. Please spell out this intention in bpf.h so that the writer will not
default to DROP for the reason that it does not understand.

^ permalink raw reply

* Re: [PATCH net] ipvlan: call dev_change_flags when reset ipvlan mode
From: Cong Wang @ 2018-06-19 21:10 UTC (permalink / raw)
  To: Hangbin Liu
  Cc: Linux Kernel Network Developers, Stefano Brivio, Paolo Abeni,
	David Miller, Mahesh Bandewar
In-Reply-To: <1529330677-15328-1-git-send-email-liuhangbin@gmail.com>

On Mon, Jun 18, 2018 at 7:04 AM, Hangbin Liu <liuhangbin@gmail.com> wrote:
> @@ -94,10 +95,13 @@ static int ipvlan_set_port_mode(struct ipvl_port *port, u16 nval)
>                         mdev->l3mdev_ops = NULL;
>                 }
>                 list_for_each_entry(ipvlan, &port->ipvlans, pnode) {
> +                       flags = ipvlan->dev->flags;
>                         if (nval == IPVLAN_MODE_L3 || nval == IPVLAN_MODE_L3S)
> -                               ipvlan->dev->flags |= IFF_NOARP;
> +                               dev_change_flags(ipvlan->dev,
> +                                                flags | IFF_NOARP);
>                         else
> -                               ipvlan->dev->flags &= ~IFF_NOARP;
> +                               dev_change_flags(ipvlan->dev,
> +                                                flags & ~IFF_NOARP);

You need to check the return value of dev_change_flags().

^ permalink raw reply

* Re: iproute2 won't compile without AF_VSOCK
From: Steve Wise @ 2018-06-19 20:41 UTC (permalink / raw)
  To: David Ahern, Stephen Hemminger; +Cc: netdev
In-Reply-To: <d6f6b60d-6dad-fd0c-cdc5-b560fe6da12d@gmail.com>



On 6/19/2018 3:29 PM, David Ahern wrote:
> On 6/19/18 2:27 PM, David Ahern wrote:
>> On 6/19/18 9:47 AM, Stephen Hemminger wrote:
>>> On Tue, 19 Jun 2018 10:17:45 -0500
>>> Steve Wise <swise@opengridcomputing.com> wrote:
>>>
>>>> Hey David,
>>>>
>>>> I'm trying to compile the latest iproute2 on an RHEL-7.3 distro, and it
>>>> fails to compile because AF_VSOCK is not defined.  Should this
>>>> functionality be a configure option to disable it on older distros?
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> Steve.
>>>>
>>>> ----
>>>>
>>>> misc
>>>>     CC       ss.o
>>>> ss.c:301:27: error: ‘AF_VSOCK’ undeclared here (not in a function)
>>>>    .families = FAMILY_MASK(AF_VSOCK),
>>>>                            ^
>>>> ss.c:252:46: note: in definition of macro ‘FAMILY_MASK’
>>>>  #define FAMILY_MASK(family) ((uint64_t)1 << (family))
>>>>                                               ^
>>>> ss.c:334:2: error: array index in initializer not of integer type
>>>>   [AF_VSOCK] = {
>>>>   ^
>>>> ss.c:334:2: error: (near initialization for ‘default_afs’)
>>>> make[1]: *** [ss.o] Error 1
>>>> make: *** [all] Error 2
>>>>
>>> Probably should just add an #ifdef to takeout that if not present
>>>
>> Most userspace tools have a compat header for cases like this.
>>
>> #ifndef AF_VSOCK
>> #define AF_VSOCK 	40
>> #endif
>>
> Add the above to include//utils.h; AF_MPLS is already there.

I'll send out a patch.

Thanks,

Steve.

^ permalink raw reply

* Re: [bug] cxgb4: vrf stopped working with cxgb4 card
From: AMG Zollner Robert @ 2018-06-19 20:32 UTC (permalink / raw)
  To: Ganesh Goudar; +Cc: office, dsa, netdev
In-Reply-To: <20180619132425.GA6576@chelsio.com>

On 19.06.2018 16:24, Ganesh Goudar wrote:
> On Monday, June 06/11/18, 2018 at 14:47:55 +0530, Ganesh Goudar wrote:
>> On Saturday, June 06/09/18, 2018 at 18:47:55 -0600, David Ahern wrote:
>>> Ganesh:
>>>
>>> On 6/4/18 9:03 AM, AMG Zollner Robert wrote:
>>>> I have noticed that vrf is not working with kernel v4.15.0 but was
>>>> working with v4.13.0 when using cxgb4 Chelsio driver (T520-cr)
>>>>
>>>> Setup:
>>>> Two metal servers with a T520-cr card each, directly connected without a
>>>> switch in between.
>>>>
>>>>         SVR1  only ipfwd                 SVR2     with vrf
>>>> .----------------------------. .----------------------------------.
>>>> |                            |         |             |
>>>> |    192.168.8.1 [  ens2f4]--|---------|--[ens1f4] 192.168.8.2   |
>>>> |    192.168.9.1 [ens2f4d1]--|---------|--<ens1f4d1> 192.168.9.2 VRF=10   |
>>>> `----------------------------' `----------------------------------'
>>>>
>>>> When vrf is not working there are no error messages (dmesg or iproute
>>>> commands), tcpdump on the interface (SVR2.ens1f4d1) enslaved in vrf 10
>>>> shows packets(arp req/reply) coming in and going out, but outgoing
>>>> packets(arp reply) do not reach the other server SVR1.ens2f4d1
>>>>
>>>>
>>>> Bisect:
>>>> Found this commit to be the problem after doing a git bisect between
>>>> v4.13..v4.15:
>>>>
>>>> commit ba581f77df23c8ee70b372966e69cf10bc5453d8
>>>> Author: Ganesh Goudar <ganeshgr@chelsio.com>
>>>> Date:   Sat Sep 23 16:07:28 2017 +0530
>>>>
>>>>      cxgb4: do DCB state reset in couple of places
>>>>
>>>>      reset the driver's DCB state in couple of places
>>>>      where it was missing.
>>>
>>> Are you working on a fix for this or should a revert of the above patch
>>> be sent?
>> Will look into it and fix/revert it soon, Thanks for responding to Robert.
>>>
>>>
>>>>
>>>>
>>>> A bisect step was considered good when:
>>>> - successful ping from SVR1 to SVR2.ens1f4d1 vrf interface
>>>> - successful ping from SVR2 global to SVR2 vrf interface trough SVR1(l3
>>>> forwarding) (this check was redundant,both tests fail or pass simultaneous)
>>>>
>>>> The problem is still present on recent kernels also, checked v4.16.0 and
>>>> v4.17.rc7
>>>>
>>>> Disabling DCB for the card support fixes the problem ( Compiling kernel
>>>> with "CONFIG_CHELSIO_T4_DCB=n")
>>>>
>>>>
>>>>
>>>> This is my first time reporting a bug to the linux kernel and hope I
>>>> have included the right amount of information. Please let me know if I
>>>> have missed something.
>>>>
>>>>
>>>>
>>>> Thank you,
>>>> Zollner Robert
>>>>
>>>>
>>>> --------
>>>> Logs:
>>>>
>>>> VRF configured using folowing commands:
>>>>
>>>> #!/bin/sh
>>>>
>>>> CHDEV=ens1f4
>>>> VRF=vrf-recv
>>>>
>>>> sysctl -w net.ipv4.tcp_l3mdev_accept=1
>>>> sysctl -w net.ipv4.udp_l3mdev_accept=1
>>>> sysctl -w net.ipv4.conf.all.accept_local=1
>>>>
>>>> ifconfig ${CHDEV}   192.168.8.2/24
>>>> ifconfig ${CHDEV}d1 192.168.9.2/24
>>>>
>>>> ip link add ${VRF} type vrf table 10
>>>> ip link set dev ${VRF} up
>>>>
>>>> ip rule add pref 32765 table local
>>>> ip rule del pref 0
>>>>
>>>> ip route add table 10 unreachable default metric 4278198272
>>>>
>>>> ip link set dev ${CHDEV}d1 master ${VRF}
>>>>
>>>> ip route add table 10 default via 192.168.9.1
>>>> ip route add 192.168.9.0/24 via 192.168.8.1
>>>>
>>>>
>>>>
>>>>
>>>
> -netdev, Please feel free to add if needed.
> 
> Hi Robert,
> 
> My knowledge of VRF is very limited, I am trying to bring
> up VRF setup, I just wanted to check if you are doing anything
> related DCB and also please let me know how did you setup SRV1.
> 
> Thanks
> 

Hello Ganesh,

SRV1 is just forwarding(l3) between the two physical ports of the 
T520-CR card.

ifconfig ens1f4   192.168.8.1/24
ifconfig ens1f4d1 192.168.9.1/24
sysctl -w net.ipv4.ip_forward=1

- No VRF is configured on this box
- DCB is also not used


SVR2 is using VRF and is configured with the script inlined in the first 
email.

^ permalink raw reply

* Re: [virtio-dev] Re: [Qemu-devel] [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net
From: Michael S. Tsirkin @ 2018-06-19 20:32 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: Siwei Liu, Samudrala, Sridhar, Alexander Duyck, virtio-dev,
	aaron.f.brown, Jiri Pirko, Jakub Kicinski, Netdev, qemu-devel,
	virtualization, konrad.wilk, boris.ostrovsky, Joao Martins,
	Venu Busireddy, vijay.balakrishna
In-Reply-To: <20180619125453.2d2dfb2d.cohuck@redhat.com>

On Tue, Jun 19, 2018 at 12:54:53PM +0200, Cornelia Huck wrote:
> Sorry about dragging mainframes into this, but this will only work for
> homogenous device coupling, not for heterogenous. Consider my vfio-pci
> + virtio-net-ccw example again: The guest cannot find out that the two
> belong together by checking some group ID, it has to either use the MAC
> or some needs-to-be-architectured property.
> 
> Alternatively, we could propose that mechanism as pci-only, which means
> we can rely on mechanisms that won't necessarily work on non-pci
> transports. (FWIW, I don't see a use case for using vfio-ccw to pass
> through a network card anytime in the near future, due to the nature of
> network cards currently in use on s390.)

That's what it boils down to, yes.  If there's need to have this for
non-pci devices, then we should put it in config space.
Cornelia, what do you think?

-- 
MST

^ permalink raw reply

* Re: iproute2 won't compile without AF_VSOCK
From: David Ahern @ 2018-06-19 20:29 UTC (permalink / raw)
  To: Stephen Hemminger, Steve Wise; +Cc: netdev
In-Reply-To: <f8c68944-57b6-18a6-aa50-c019baa2a63b@gmail.com>

On 6/19/18 2:27 PM, David Ahern wrote:
> On 6/19/18 9:47 AM, Stephen Hemminger wrote:
>> On Tue, 19 Jun 2018 10:17:45 -0500
>> Steve Wise <swise@opengridcomputing.com> wrote:
>>
>>> Hey David,
>>>
>>> I'm trying to compile the latest iproute2 on an RHEL-7.3 distro, and it
>>> fails to compile because AF_VSOCK is not defined.  Should this
>>> functionality be a configure option to disable it on older distros?
>>>
>>>
>>> Thanks,
>>>
>>> Steve.
>>>
>>> ----
>>>
>>> misc
>>>     CC       ss.o
>>> ss.c:301:27: error: ‘AF_VSOCK’ undeclared here (not in a function)
>>>    .families = FAMILY_MASK(AF_VSOCK),
>>>                            ^
>>> ss.c:252:46: note: in definition of macro ‘FAMILY_MASK’
>>>  #define FAMILY_MASK(family) ((uint64_t)1 << (family))
>>>                                               ^
>>> ss.c:334:2: error: array index in initializer not of integer type
>>>   [AF_VSOCK] = {
>>>   ^
>>> ss.c:334:2: error: (near initialization for ‘default_afs’)
>>> make[1]: *** [ss.o] Error 1
>>> make: *** [all] Error 2
>>>
>>
>> Probably should just add an #ifdef to takeout that if not present
>>
> 
> Most userspace tools have a compat header for cases like this.
> 
> #ifndef AF_VSOCK
> #define AF_VSOCK 	40
> #endif
> 

Add the above to include//utils.h; AF_MPLS is already there.

^ permalink raw reply

* Re: iproute2 won't compile without AF_VSOCK
From: David Ahern @ 2018-06-19 20:27 UTC (permalink / raw)
  To: Stephen Hemminger, Steve Wise; +Cc: netdev
In-Reply-To: <20180619084732.0de6d75d@xeon-e3>

On 6/19/18 9:47 AM, Stephen Hemminger wrote:
> On Tue, 19 Jun 2018 10:17:45 -0500
> Steve Wise <swise@opengridcomputing.com> wrote:
> 
>> Hey David,
>>
>> I'm trying to compile the latest iproute2 on an RHEL-7.3 distro, and it
>> fails to compile because AF_VSOCK is not defined.  Should this
>> functionality be a configure option to disable it on older distros?
>>
>>
>> Thanks,
>>
>> Steve.
>>
>> ----
>>
>> misc
>>     CC       ss.o
>> ss.c:301:27: error: ‘AF_VSOCK’ undeclared here (not in a function)
>>    .families = FAMILY_MASK(AF_VSOCK),
>>                            ^
>> ss.c:252:46: note: in definition of macro ‘FAMILY_MASK’
>>  #define FAMILY_MASK(family) ((uint64_t)1 << (family))
>>                                               ^
>> ss.c:334:2: error: array index in initializer not of integer type
>>   [AF_VSOCK] = {
>>   ^
>> ss.c:334:2: error: (near initialization for ‘default_afs’)
>> make[1]: *** [ss.o] Error 1
>> make: *** [all] Error 2
>>
> 
> Probably should just add an #ifdef to takeout that if not present
> 

Most userspace tools have a compat header for cases like this.

#ifndef AF_VSOCK
#define AF_VSOCK 	40
#endif

^ permalink raw reply

* Re: [PATCH bpf-net] bpf: Change bpf_fib_lookup to return lookup status
From: David Ahern @ 2018-06-19 20:16 UTC (permalink / raw)
  To: Martin KaFai Lau; +Cc: dsahern, netdev, borkmann, ast, davem
In-Reply-To: <20180619163644.gwdrauesfeq6vc5v@kafai-mbp.dhcp.thefacebook.com>

On 6/19/18 10:36 AM, Martin KaFai Lau wrote:
> On Tue, Jun 19, 2018 at 09:34:28AM -0600, David Ahern wrote:
>> On 6/19/18 9:25 AM, Martin KaFai Lau wrote:
>>> On Mon, Jun 18, 2018 at 03:35:25PM -0600, David Ahern wrote:
>>>> On 6/18/18 2:55 PM, Martin KaFai Lau wrote:
>>>>>> 	/* rc > 0 case */
>>>>>> 	switch(rc) {
>>>>>> 	case BPF_FIB_LKUP_RET_BLACKHOLE:
>>>>>> 	case BPF_FIB_LKUP_RET_UNREACHABLE:
>>>>>> 	case BPF_FIB_LKUP_RET_PROHIBIT:
>>>>>> 		return XDP_DROP;
>>>>>> 	}
>>>>>>
>>>>>> For the others it becomes a question of do we share why the stack needs
>>>>>> to be involved? Maybe the program wants to collect stats to show traffic
>>>>>> patterns that can be improved (BPF_FIB_LKUP_RET_FRAG_NEEDED) or support
>>>>>> in the kernel needs to be improved (BPF_FIB_LKUP_RET_UNSUPP_LWT) or an
>>>>>> interface is misconfigured (BPF_FIB_LKUP_RET_FWD_DISABLED).
>>>>> Thanks for the explanation.
>>>>>
>>>>> Agree on the bpf able to collect stats will be useful.
>>>>>
>>>>> I am wondering, if a new BPF_FIB_LKUP_RET_XYZ is added later,
>>>>> how may the old xdp_prog work/not-work?  As of now, the return value
>>>>> is straight forward, FWD, PASS (to stack) or DROP (error).
>>>>> With this change, the xdp_prog needs to match/switch() the
>>>>> BPF_FIB_LKUP_RET_* to at least PASS and DROP.
>>>>
>>>> IMO, programs should only call XDP_DROP for known reasons - like the 3
>>>> above. Anything else punt to the stack.
>>>>
>>>> If a new RET_XYZ comes along:
>>>> 1. the new XYZ is a new ACL response where the packet is to be dropped.
>>>> If the program does not understand XYZ and punts to the stack
>>>> (recommendation), then a second lookup is done during normal packet
>>>> processing and the stack drops it.
>>>>
>>>> 2. the new XYZ is a new path in the kernel that is unsupported with
>>>> respect to XDP forwarding, nothing new for the program to do.
>>>>
>>>> Either way I would expect stats on BPF_FIB_LKUP_RET_* to give a hint to
>>>> the program writer.
>>>>
>>>> Worst case of punting packets to the stack for any rc != 0 means the
>>>> stack is doing 2 lookups - 1 in XDP based on its lookup parameters and 1
>>>> in normal stack processing - to handle the packet.
>>> Instead of having the xdp_prog to follow the meaning of what RET_SYZ is,
>>> should the bpf_*_fib_lookup() return value be kept as is such that
>>> the xdp_prog is clear what to do.  The reason can be returned in
>>> the 'struct bpf_fib_lookup'.  The number of reasons can be extended.
>>> If the xdp_prog does not understand a reason, it still will not
>>> affect its decision because the return value is clear.
>>> I think the situation here is similar to regular syscall which usually
>>> uses -1 to clearly states error and errno to spells out the reason.
>>>
>>
>> I did consider returning the status in struct bpf_fib_lookup. However,
>> it is 64 bytes and can not be extended without a big performance
>> penalty, so the only option there is to make an existing entry a union
>> the most logical of which is the ifindex. It seemed odd to me to have
>> the result by hidden in the struct as a union on ifindex and returning
>> the egress index from the function:
>>
>> @@ -2625,7 +2636,11 @@ struct bpf_fib_lookup {
>>
>>         /* total length of packet from network header - used for MTU
>> check */
>>         __u16   tot_len;
>> -       __u32   ifindex;  /* L3 device index for lookup */
>> +
>> +       union {
>> +               __u32   ifindex;  /* input: L3 device index for lookup */
>> +               __u32   result;   /* output: one of BPF_FIB_LKUP_RET_* */
>> +       };
>>
>>
>> It seemed more natural to have ifindex stay ifindex and only change
>> value on return:
>>
>> @@ -2625,7 +2639,11 @@ struct bpf_fib_lookup {
>>
>>  	/* total length of packet from network header - used for MTU check */
>>  	__u16	tot_len;
>> -	__u32	ifindex;  /* L3 device index for lookup */
>> +
>> +	/* input: L3 device index for lookup
>> +	 * output: nexthop device index from FIB lookup
>> +	 */
>> +	__u32	ifindex;
>>
>>  	union {
>>  		/* inputs to lookup */
>>
>>
>> From a program's perspective:
>>
>> rc < 0  -- program is passing incorrect data
>> rc == 0 -- packet can be forwarded
>> rc > 0  -- packet can not be forwarded.
>>
>> BPF programs are not required to track the LKUP_RET values any more than
>> a function returning multiple negative values - the caller just checks
>> rc < 0 means failure. If the program cares it can look at specific
>> values of rc to see the specific value.
>>
>> The same applies with the LKUP_RET values - they are there to provide
>> insight into why the packet is not forwarded directly if the program
>> cares to know why.
> hmm...ic. My concern is, the prog can interpret rc > 0 (in this patch) to be
> drop vs pass (although we can advise them in bpf.h to always pass if it does
> not understand a rc but it is not a strong contract),  it may catch people
> a surprise if a xdp_prog suddenly drops everything when running in a
> newer kernel where the upper stack can actually handle it.
> 
> while the current behavior (i.e. before this patch, rc == 0) is always pass
> to the stack.
> 
> I think at least comments should be put in the enum such that
> the xdp/tc_prog should expect the enum could be extended later, so
> the suggested behavior should be a pass for unknown LKUP_RET and let
> the stack to decide.
> 

All APIs with enums have the inherent quality that more can be added.
Nothing about rc > 0 says it is ok to drop the packet and nothing in the
documentation says it is ok to drop the packet. The program author needs
to look at the extra information provided by the rc. Specific values are
a hint that yes the packet can be dropped; others merely say 'packet
needs help from the stack' with a reason why it needs help.

My intention is to allow the XDP program to understand FIB based ACLs.
To that end a fib lookup result of blackhole, unreachable and prohibit
needs to be returned to the xdp program with the effective summary of
"can drop in the xdp program".

If the other return codes are going to cause confusion then less shorten
the list:

enum {
	BPF_FIB_LKUP_RET_SUCCESS,      /* packet is to be forwareded */
	BPF_FIB_LKUP_RET_CAN_DROP,     /* XDP program can drop the packet */
	BPF_FIB_LKUP_RET_NEED_STACK,   /* packet needs full stack assist */
};

But, that still does not solve the problem of rc > 0 means xdp program
can drop the packet, but then that is not the intention of rc > 0. The
intention is only "here's more information about why this packet can not
be forwarded at this layer"

^ permalink raw reply

* Re: [PATCH net] net/sched: act_ife: fix recursive lock and idr leak
From: Cong Wang @ 2018-06-19 20:13 UTC (permalink / raw)
  To: Davide Caratti
  Cc: Jamal Hadi Salim, David S. Miller,
	Linux Kernel Network Developers
In-Reply-To: <40b45f70ef007b222b36a4676174e597c41d697f.1529415169.git.dcaratti@redhat.com>

On Tue, Jun 19, 2018 at 6:39 AM, Davide Caratti <dcaratti@redhat.com> wrote:
> a recursive lock warning [1] can be observed with the following script,
>
>  # $TC actions add action ife encode allow prio pass index 42
>  IFE type 0xED3E
>  # $TC actions replace action ife encode allow tcindex pass index 42
>
> in case the kernel was unable to run the last command (e.g. because of
> the impossibility to load 'act_meta_skbtcindex'). For a similar reason,
> the kernel can leak idr in the error path of tcf_ife_init(), because
> tcf_idr_release() is not called after successful idr reservation:
>
>  # $TC actions add action ife encode allow tcindex index 47
>  IFE type 0xED3E
>  RTNETLINK answers: No such file or directory
>  We have an error talking to the kernel
>  # $TC actions add action ife encode allow tcindex index 47
>  IFE type 0xED3E
>  RTNETLINK answers: No space left on device
>  We have an error talking to the kernel
>  # $TC actions add action ife encode use mark 7 type 0xfefe pass index 47
>  IFE type 0xFEFE
>  RTNETLINK answers: No space left on device
>  We have an error talking to the kernel
>
> Since tcfa_lock is already taken when the action is being edited, a call
> to tcf_idr_release() wrongly makes tcf_idr_cleanup() take the same lock
> again. On the other hand, tcf_idr_release() needs to be called in the
> error path of tcf_ife_init(), to undo the last tcf_idr_create() invocation.
> Fix both problems in tcf_ife_init().
> Since the cleanup() routine can now be called when ife->params is NULL,
> also add a NULL pointer check to avoid calling kfree_rcu(NULL, rcu).

Acked-by: Cong Wang <xiyou.wangcong@gmail.com>

^ permalink raw reply

* Re: [PATCH] Revert "net: pskb_trim_rcsum() and CHECKSUM_COMPLETE are friends"
From: Eric Dumazet @ 2018-06-19 20:10 UTC (permalink / raw)
  To: Andreas Schwab, Eric Dumazet
  Cc: Mathieu Malaterre, David S. Miller, Eric Dumazet, LKML,
	Christophe LEROY, Meelis Roos, netdev, linuxppc-dev
In-Reply-To: <87po0mvbgl.fsf@igel.home>



On 06/19/2018 12:10 PM, Andreas Schwab wrote:
> On Jun 18 2018, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> 
>> DUMP_PREFIX_ADDRESS might give us more information (say alignment problem, or crossing page boundaries)
> 
> DUMP_PREFIX_ADDRESS is useless for that purpose.
> 
> Here are some samples of broken csums:
> 
> [  853.849225] sungem: sungem wrong csum : 9886/07be, len 94 bytes, c0000001fa187e02
> [  853.849232] raw data: 00000000: 00 0d 93 43 81 62 18 d6 c7 51 b8 1c 08 00 45 10  ...C.b...Q....E.
> [  853.849235] raw data: 00000010: 00 4c cb a0 40 00 40 11 d9 97 c0 a8 0a 01 c0 a8  .L..@.@.........
> [  853.849237] raw data: 00000020: 0a 07 00 7b 00 7b 00 38 69 e1 1c 03 0c f7 00 00  ...{.{.8i.......
> [  853.849240] raw data: 00000030: 08 f0 00 00 15 f0 c0 35 67 67 de d3 ca c9 d9 5b  .......5gg.....[
> [  853.849242] raw data: 00000040: 1f ff de d3 d2 86 8f 67 fa f2 de d3 d2 86 8f 38  .......g.......8
> [  853.849244] raw data: 00000050: 2f ff de d3 d2 86 8f 3b ff ff d1 93 bc 50        /......;.....P

Thanks.

4 bytes in excess.

Might be the FCS, and it does not look like provided csum has a relation with it.

For some reason FCS stripping was disabled by :

commit 3e32011d4da6424b3bc65b1e1a047e30ac9882c7
Author: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Date:   Mon May 19 09:39:11 2003 -0700

    [SUNGEM]: Updates from PowerPC people.
    
    Support more chips and split out all the complex PHY
    handling into a seperate file.


Since this NIC never had CHECKSUM_COMPLETE support (since we have to trim each skb,
thus were forcing ip_summed to CHECKSUM_NONE) we probably should remove it and be happy.

Unless you guys find a way to let the NIC strip the FCS, and double check the csum is a real csum ;)

diff --git a/drivers/net/ethernet/sun/sungem.c b/drivers/net/ethernet/sun/sungem.c
index 7a16d40a72d13cf1d522e8a3a396c826fe76f9b9..85439308375e95c3854e4a1561697d69ec85399b 100644
--- a/drivers/net/ethernet/sun/sungem.c
+++ b/drivers/net/ethernet/sun/sungem.c
@@ -760,7 +760,6 @@ static int gem_rx(struct gem *gp, int work_to_do)
        struct net_device *dev = gp->dev;
        int entry, drops, work_done = 0;
        u32 done;
-       __sum16 csum;
 
        if (netif_msg_rx_status(gp))
                printk(KERN_DEBUG "%s: rx interrupt, done: %d, rx_new: %d\n",
@@ -855,9 +854,6 @@ static int gem_rx(struct gem *gp, int work_to_do)
                        skb = copy_skb;
                }
 
-               csum = (__force __sum16)htons((status & RXDCTRL_TCPCSUM) ^ 0xffff);
-               skb->csum = csum_unfold(csum);
-               skb->ip_summed = CHECKSUM_COMPLETE;
                skb->protocol = eth_type_trans(skb, gp->dev);
 
                napi_gro_receive(&gp->napi, skb);

^ permalink raw reply related

* Re: [virtio-dev] Re: [Qemu-devel] [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net
From: Siwei Liu @ 2018-06-19 20:09 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: Samudrala, Sridhar, Alexander Duyck, virtio-dev, aaron.f.brown,
	Jiri Pirko, Michael S. Tsirkin, Jakub Kicinski, Netdev,
	qemu-devel, virtualization, konrad.wilk, boris.ostrovsky,
	Joao Martins, Venu Busireddy, vijay.balakrishna
In-Reply-To: <20180619125453.2d2dfb2d.cohuck@redhat.com>

On Tue, Jun 19, 2018 at 3:54 AM, Cornelia Huck <cohuck@redhat.com> wrote:
> On Fri, 15 Jun 2018 10:06:07 -0700
> Siwei Liu <loseweigh@gmail.com> wrote:
>
>> On Fri, Jun 15, 2018 at 4:48 AM, Cornelia Huck <cohuck@redhat.com> wrote:
>> > On Thu, 14 Jun 2018 18:57:11 -0700
>> > Siwei Liu <loseweigh@gmail.com> wrote:
>> >
>> >> Thank you for sharing your thoughts, Cornelia. With questions below, I
>> >> think you raised really good points, some of which I don't have answer
>> >> yet and would also like to explore here.
>> >>
>> >> First off, I don't want to push the discussion to the extreme at this
>> >> point, or sell anything about having QEMU manage everything
>> >> automatically. Don't get me wrong, it's not there yet. Let's don't
>> >> assume we are tied to a specific or concerte solution. I think the key
>> >> for our discussion might be to define or refine the boundary between
>> >> VM and guest,  e.g. what each layer is expected to control and manage
>> >> exactly.
>> >>
>> >> In my view, there might be possibly 3 different options to represent
>> >> the failover device conceipt to QEMU and libvirt (or any upper layer
>> >> software):
>> >>
>> >> a. Seperate device: in this model, virtio and passthough remains
>> >> separate devices just as today. QEMU exposes the standby feature bit
>> >> for virtio, and publish status/event around the negotiation process of
>> >> this feature bit for libvirt to react upon. Since Libvirt has the
>> >> pairing relationship itself, maybe through MAC address or something
>> >> else, it can control the presence of primary by hot plugging or
>> >> unplugging the passthrough device, although it has to work tightly
>> >> with virtio's feature negotation process. Not just for migration but
>> >> also various corner scenarios (driver/feature ok, device reset,
>> >> reboot, legacy guest etc) along virtio's feature negotiation.
>> >
>> > Yes, that one has obvious tie-ins to virtio's modus operandi.
>> >
>> >>
>> >> b. Coupled device: in this model, virtio and passthough devices are
>> >> weakly coupled using some group ID, i.e. QEMU match the passthough
>> >> device for a standby virtio instance by comparing the group ID value
>> >> present behind each device's bridge. Libvirt provides QEMU the group
>> >> ID for both type of devices, and only deals with hot plug for
>> >> migration, by checking some migration status exposed (e.g. the feature
>> >> negotiation status on the virtio device) by QEMU. QEMU manages the
>> >> visibility of the primary in guest along virtio's feature negotiation
>> >> process.
>> >
>> > I'm a bit confused here. What, exactly, ties the two devices together?
>>
>> The group UUID. Since QEMU VFIO dvice does not have insight of MAC
>> address (which it doesn't have to), the association between VFIO
>> passthrough and standby must be specificed for QEMU to understand the
>> relationship with this model. Note, standby feature is no longer
>> required to be exposed under this model.
>
> Isn't that a bit limiting, though?
>
> With this model, you can probably tie a vfio-pci device and a
> virtio-net-pci device together. But this will fail if you have
> different transports: Consider tying together a vfio-pci device and a
> virtio-net-ccw device on s390, for example. The standby feature bit is
> on the virtio-net level and should not have any dependency on the
> transport used.

Probably we'd limit the support for grouping to virtio-net-pci device
and vfio-pci device only. For virtio-net-pci, as you might see with
Venu's patch, we store the group UUID on the config space of
virtio-pci, which is only applicable to PCI transport.

If virtio-net-ccw needs to support the same, I think similar grouping
interface should be defined on the VirtIO CCW transport. I think the
current implementation of the Linux failover driver assumes that it's
SR-IOV VF with same MAC address which the virtio-net-pci needs to pair
with, and that the PV path is on same PF without needing to update
network of the port-MAC association change. If we need to extend the
grouping mechanism to virtio-net-ccw, it has to pass such failover
mode to virtio driver specifically through some other option I guess.

>
>>
>> > If libvirt already has the knowledge that it should manage the two as a
>> > couple, why do we need the group id (or something else for other
>> > architectures)? (Maybe I'm simply missing something because I'm not
>> > that familiar with pci.)
>>
>> The idea is to have QEMU control the visibility and enumeration order
>> of the passthrough VFIO for the failover scenario. Hotplug can be one
>> way to achieve it, and perhaps there's other way around also. The
>> group ID is not just for QEMU to couple devices, it's also helpful to
>> guest too as grouping using MAC address is just not safe.
>
> Sorry about dragging mainframes into this, but this will only work for
> homogenous device coupling, not for heterogenous. Consider my vfio-pci
> + virtio-net-ccw example again: The guest cannot find out that the two
> belong together by checking some group ID, it has to either use the MAC
> or some needs-to-be-architectured property.
>
> Alternatively, we could propose that mechanism as pci-only, which means
> we can rely on mechanisms that won't necessarily work on non-pci
> transports. (FWIW, I don't see a use case for using vfio-ccw to pass
> through a network card anytime in the near future, due to the nature of
> network cards currently in use on s390.)

Yes, let's do this just for PCI transport (homogenous) for now.

>
>>
>> >
>> >>
>> >> c. Fully combined device: in this model, virtio and passthough devices
>> >> are viewed as a single VM interface altogther. QEMU not just controls
>> >> the visibility of the primary in guest, but can also manage the
>> >> exposure of the passthrough for migratability. It can be like that
>> >> libvirt supplies the group ID to QEMU. Or libvirt does not even have
>> >> to provide group ID for grouping the two devices, if just one single
>> >> combined device is exposed by QEMU. In either case, QEMU manages all
>> >> aspect of such internal construct, including virtio feature
>> >> negotiation, presence of the primary, and live migration.
>> >
>> > Same question as above.
>> >
>> >>
>> >> It looks like to me that, in your opinion, you seem to prefer go with
>> >> (a). While I'm actually okay with either (b) or (c). Do I understand
>> >> your point correctly?
>> >
>> > I'm not yet preferring anything, as I'm still trying to understand how
>> > this works :) I hope we can arrive at a model that covers the use case
>> > and that is also flexible enough to be extended to other platforms.
>> >
>> >>
>> >> The reason that I feel that (a) might not be ideal, just as Michael
>> >> alluded to (quoting below), is that as management stack, it really
>> >> doesn't need to care about the detailed process of feature negotiation
>> >> (if we view the guest presence of the primary as part of feature
>> >> negotiation at an extended level not just virtio). All it needs to be
>> >> done is to hand in the required devices to QEMU and that's all. Why do
>> >> we need to addd various hooks, events for whichever happens internally
>> >> within the guest?
>> >>
>> >> ''
>> >> Primary device is added with a special "primary-failover" flag.
>> >> A virtual machine is then initialized with just a standby virtio
>> >> device. Primary is not yet added.
>> >>
>> >> Later QEMU detects that guest driver device set DRIVER_OK.
>> >> It then exposes the primary device to the guest, and triggers
>> >> a device addition event (hot-plug event) for it.
>> >>
>> >> If QEMU detects guest driver removal, it initiates a hot-unplug sequence
>> >> to remove the primary driver.  In particular, if QEMU detects guest
>> >> re-initialization (e.g. by detecting guest reset) it immediately removes
>> >> the primary device.
>> >> ''
>> >>
>> >> and,
>> >>
>> >> ''
>> >> management just wants to give the primary to guest and later take it back,
>> >> it really does not care about the details of the process,
>> >> so I don't see what does pushing it up the stack buy you.
>> >>
>> >> So I don't think it *needs* to be done in libvirt. It probably can if you
>> >> add a bunch of hooks so it knows whenever vm reboots, driver binds and
>> >> unbinds from device, and can check that backup flag was set.
>> >> If you are pushing for a setup like that please get a buy-in
>> >> from libvirt maintainers or better write a patch.
>> >> ''
>> >
>> > This actually seems to mean the opposite to me: We need to know what
>> > the guest is doing and when, as it directly drives what we need to do
>> > with the devices. If we switch to a visibility vs a hotplug model (see
>> > the other mail), we might be able to handle that part within qemu.
>>
>> In the model of (b), I think it essentially turns hotplug to one of
>> mechanisms for QEMU to control the visibility. The libvirt can still
>> manage the hotplug of individual devices during live migration or in
>> normal situation to hot add/remove devices. Though the visibility of
>> the VFIO is under the controll of QEMU, and it's possible that the hot
>> add/remove request does not involve actual hot plug activity in guest
>> at all.
>
> That depends on how you model visibility, I guess. You'll probably want
> to stop traffic flowing through one or the other of the cards; would
> link down or similar be enough for the virtio device?

I'm not sure if it is a good idea. The guest user will see two devices
with same MAC but one of them is down. Do you expect user to use it or
not? And since the guest is going to be migrated, we need to unplug a
broken VF from guest before migrating, why do we bother plugging in
this useless VF at the first place?


Thanks,
-Siwei
>
>>
>> In the model of (c), the hotplug semantics of the combined device
>> would mean differently - it would end up with devices plugged in or
>> out altogther. To make this work, we either have to build a brand new
>> bond-like QEMU device consist of virtio and VFIO internally, or need
>> to have some abstraction in place for libvirt to manipulate the
>> combined device (and prohibit libvirt from operating on individual
>> internal device directly). Note with this model the group ID doesn't
>> even need to get exposed to libvirt, just imagine libvirt to supply
>> all options required to configure two regular virtio-net and VFIO
>> devices for a single device object, and QEMU will deal with the
>> device's visibility and enumeration, such when to hot plug VFIO device
>> in to or out from the guest.
>>
>> It might be complicated to implement (c) though.
>
> I think (c) would be even more complicated for heterogenous setups.

^ permalink raw reply

* Re: [PATCH net] net/sched: act_ife: preserve the action control in case of error
From: Cong Wang @ 2018-06-19 19:59 UTC (permalink / raw)
  To: Davide Caratti
  Cc: Jamal Hadi Salim, David S. Miller,
	Linux Kernel Network Developers
In-Reply-To: <717d0682de4fb24ed6818f2bf264202d22e1e8be.1529415179.git.dcaratti@redhat.com>

On Tue, Jun 19, 2018 at 6:45 AM, Davide Caratti <dcaratti@redhat.com> wrote:
> in the following script
>
>  # tc actions add action ife encode allow prio pass index 42
>  # tc actions replace action ife encode allow tcindex drop index 42
>
> the action control should remain equal to 'pass', if the kernel failed
> to replace the TC action. Pospone the assignment of the action control,
> to ensure it is not overwritten in the error path of tcf_ife_init().
>
> Fixes: ef6980b6becb ("introduce IFE action")
> Signed-off-by: Davide Caratti <dcaratti@redhat.com>

Acked-by: Cong Wang <xiyou.wangcong@gmail.com>

^ permalink raw reply

* Re: [PATCH] dt-bindings: Fix unbalanced quotation marks
From: Florian Fainelli @ 2018-06-19 19:49 UTC (permalink / raw)
  To: Jonathan Neuschäfer, devicetree
  Cc: Mark Rutland, linux-mips, Ulf Hansson, Madalin Bucur, James Hogan,
	Linus Walleij, alsa-devel, linux-kernel, Thierry Reding,
	Alexandre Torgue, linux-samsung-soc, Kevin Hilman,
	Rafał Miłecki, Krzysztof Kozlowski, Jonathan Hunter,
	Kukjin Kim, linux-input, Philipp Zabel, Jason Cooper, linux-pm,
	Marc Zyngier
In-Reply-To: <20180617143127.11421-1-j.neuschaefer@gmx.net>

On 06/17/2018 07:31 AM, Jonathan Neuschäfer wrote:
> Multiple binding documents have various forms of unbalanced quotation
> marks. Fix them.
> 
> Signed-off-by: Jonathan Neuschäfer <j.neuschaefer@gmx.net>
> ---

[snip]

>  Documentation/devicetree/bindings/mips/brcm/soc.txt             | 2 +-

Acked-by: Florian Fainelli <f.fainelli@gmail.com>
-- 
Florian

^ permalink raw reply

* Re: [PATCH V2] brcmfmac: stop watchdog before detach and free everything
From: Andy Shevchenko @ 2018-06-19 19:37 UTC (permalink / raw)
  To: Arend van Spriel
  Cc: Kalle Valo, Michael Trimarchi, Franky Lin, Hante Meuleman,
	Chi-Hsien Lin, Wright Feng, David S. Miller,
	Pieter-Paul Giesberts, Ian Molton, open list:TI WILINK WIRELES...,
	open list:BROADCOM BRCM80211 IEEE802.11n WIRELESS DRIVER,
	brcm80211-dev-list, netdev, Linux Kernel Mailing List
In-Reply-To: <5B29590C.3040800@broadcom.com>

On Tue, Jun 19, 2018 at 10:27 PM, Arend van Spriel
<arend.vanspriel@broadcom.com> wrote:
> On 6/19/2018 9:06 PM, Kalle Valo wrote:
>>
>> Arend van Spriel <arend.vanspriel@broadcom.com> writes:
>>
>>> On 5/30/2018 11:06 AM, Michael Trimarchi wrote:
>>>>
>>>> Using built-in in kernel image without a firmware in filesystem
>>>> or in the kernel image can lead to a kernel NULL pointer deference.
>>>> Watchdog need to be stopped in brcmf_sdio_remove
>>>>
>>>> The system is going down NOW!
>>>> [ 1348.110759] Unable to handle kernel NULL pointer dereference at
>>>> virtual address 000002f8
>>>> Sent SIGTERM to all processes
>>>> [ 1348.121412] Mem abort info:
>>>> [ 1348.126962]   ESR = 0x96000004
>>>> [ 1348.130023]   Exception class = DABT (current EL), IL = 32 bits
>>>> [ 1348.135948]   SET = 0, FnV = 0
>>>> [ 1348.138997]   EA = 0, S1PTW = 0
>>>> [ 1348.142154] Data abort info:
>>>> [ 1348.145045]   ISV = 0, ISS = 0x00000004
>>>> [ 1348.148884]   CM = 0, WnR = 0
>>>> [ 1348.151861] user pgtable: 4k pages, 48-bit VAs, pgdp =
>>>> (____ptrval____)
>>>> [ 1348.158475] [00000000000002f8] pgd=0000000000000000
>>>> [ 1348.163364] Internal error: Oops: 96000004 [#1] PREEMPT SMP
>>>> [ 1348.168927] Modules linked in: ipv6
>>>> [ 1348.172421] CPU: 3 PID: 1421 Comm: brcmf_wdog/mmc0 Not tainted
>>>> 4.17.0-rc5-next-20180517 #18
>>>> [ 1348.180757] Hardware name: Amarula A64-Relic (DT)
>>>> [ 1348.185455] pstate: 60000005 (nZCv daif -PAN -UAO)
>>>> [ 1348.190251] pc : brcmf_sdiod_freezer_count+0x0/0x20
>>>> [ 1348.195124] lr : brcmf_sdio_watchdog_thread+0x64/0x290
>>>> [ 1348.200253] sp : ffff00000b85be30
>>>> [ 1348.203561] x29: ffff00000b85be30 x28: 0000000000000000
>>>> [ 1348.208868] x27: ffff00000b6cb918 x26: ffff80003b990638
>>>> [ 1348.214176] x25: ffff0000087b1a20 x24: ffff80003b94f800
>>>> [ 1348.219483] x23: ffff000008e620c8 x22: ffff000008f0b660
>>>> [ 1348.224790] x21: ffff000008c6a858 x20: 00000000fffffe00
>>>> [ 1348.230097] x19: ffff80003b94f800 x18: 0000000000000001
>>>> [ 1348.235404] x17: 0000ffffab2e8a74 x16: ffff0000080d7de8
>>>> [ 1348.240711] x15: 0000000000000000 x14: 0000000000000400
>>>> [ 1348.246018] x13: 0000000000000400 x12: 0000000000000001
>>>> [ 1348.251324] x11: 00000000000002c4 x10: 0000000000000a10
>>>> [ 1348.256631] x9 : ffff00000b85bc40 x8 : ffff80003be11870
>>>> [ 1348.261937] x7 : ffff80003dfc7308 x6 : 000000078ff08b55
>>>> [ 1348.267243] x5 : 00000139e1058400 x4 : 0000000000000000
>>>> [ 1348.272550] x3 : dead000000000100 x2 : 958f2788d6618100
>>>> [ 1348.277856] x1 : 00000000fffffe00 x0 : 0000000000000000
>>>
>>>
>>> Forgot about this one.
>>>
>>> Acked-by: Arend van Spriel <arend.vanspriel@broadcom.com>
>>
>>
>> Should this go to 4.18?
>
>
> It is a bit of a corner case, but yeah let's do that.
>
> Regards,
> Arend

Guys, sorry, didn't have time to test. Do it ASAP.

-- 
With Best Regards,
Andy Shevchenko

^ permalink raw reply

* Re: [PATCH V2] brcmfmac: stop watchdog before detach and free everything
From: Arend van Spriel @ 2018-06-19 19:27 UTC (permalink / raw)
  To: Kalle Valo
  Cc: Michael Trimarchi, Franky Lin, Hante Meuleman, Chi-Hsien Lin,
	Wright Feng, David S. Miller, Pieter-Paul Giesberts, Ian Molton,
	linux-wireless, brcm80211-dev-list.pdl, brcm80211-dev-list,
	netdev, linux-kernel
In-Reply-To: <87fu1izjcs.fsf@kamboji.qca.qualcomm.com>

On 6/19/2018 9:06 PM, Kalle Valo wrote:
> Arend van Spriel <arend.vanspriel@broadcom.com> writes:
>
>> On 5/30/2018 11:06 AM, Michael Trimarchi wrote:
>>> Using built-in in kernel image without a firmware in filesystem
>>> or in the kernel image can lead to a kernel NULL pointer deference.
>>> Watchdog need to be stopped in brcmf_sdio_remove
>>>
>>> The system is going down NOW!
>>> [ 1348.110759] Unable to handle kernel NULL pointer dereference at
>>> virtual address 000002f8
>>> Sent SIGTERM to all processes
>>> [ 1348.121412] Mem abort info:
>>> [ 1348.126962]   ESR = 0x96000004
>>> [ 1348.130023]   Exception class = DABT (current EL), IL = 32 bits
>>> [ 1348.135948]   SET = 0, FnV = 0
>>> [ 1348.138997]   EA = 0, S1PTW = 0
>>> [ 1348.142154] Data abort info:
>>> [ 1348.145045]   ISV = 0, ISS = 0x00000004
>>> [ 1348.148884]   CM = 0, WnR = 0
>>> [ 1348.151861] user pgtable: 4k pages, 48-bit VAs, pgdp = (____ptrval____)
>>> [ 1348.158475] [00000000000002f8] pgd=0000000000000000
>>> [ 1348.163364] Internal error: Oops: 96000004 [#1] PREEMPT SMP
>>> [ 1348.168927] Modules linked in: ipv6
>>> [ 1348.172421] CPU: 3 PID: 1421 Comm: brcmf_wdog/mmc0 Not tainted
>>> 4.17.0-rc5-next-20180517 #18
>>> [ 1348.180757] Hardware name: Amarula A64-Relic (DT)
>>> [ 1348.185455] pstate: 60000005 (nZCv daif -PAN -UAO)
>>> [ 1348.190251] pc : brcmf_sdiod_freezer_count+0x0/0x20
>>> [ 1348.195124] lr : brcmf_sdio_watchdog_thread+0x64/0x290
>>> [ 1348.200253] sp : ffff00000b85be30
>>> [ 1348.203561] x29: ffff00000b85be30 x28: 0000000000000000
>>> [ 1348.208868] x27: ffff00000b6cb918 x26: ffff80003b990638
>>> [ 1348.214176] x25: ffff0000087b1a20 x24: ffff80003b94f800
>>> [ 1348.219483] x23: ffff000008e620c8 x22: ffff000008f0b660
>>> [ 1348.224790] x21: ffff000008c6a858 x20: 00000000fffffe00
>>> [ 1348.230097] x19: ffff80003b94f800 x18: 0000000000000001
>>> [ 1348.235404] x17: 0000ffffab2e8a74 x16: ffff0000080d7de8
>>> [ 1348.240711] x15: 0000000000000000 x14: 0000000000000400
>>> [ 1348.246018] x13: 0000000000000400 x12: 0000000000000001
>>> [ 1348.251324] x11: 00000000000002c4 x10: 0000000000000a10
>>> [ 1348.256631] x9 : ffff00000b85bc40 x8 : ffff80003be11870
>>> [ 1348.261937] x7 : ffff80003dfc7308 x6 : 000000078ff08b55
>>> [ 1348.267243] x5 : 00000139e1058400 x4 : 0000000000000000
>>> [ 1348.272550] x3 : dead000000000100 x2 : 958f2788d6618100
>>> [ 1348.277856] x1 : 00000000fffffe00 x0 : 0000000000000000
>>
>> Forgot about this one.
>>
>> Acked-by: Arend van Spriel <arend.vanspriel@broadcom.com>
>
> Should this go to 4.18?

It is a bit of a corner case, but yeah let's do that.

Regards,
Arend

^ permalink raw reply

* Re: [PATCH] Revert "net: pskb_trim_rcsum() and CHECKSUM_COMPLETE are friends"
From: Andreas Schwab @ 2018-06-19 19:10 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Mathieu Malaterre, David S. Miller, Eric Dumazet, LKML,
	Christophe LEROY, Meelis Roos, netdev, linuxppc-dev
In-Reply-To: <dd6f13bc-c5f2-85f8-c08d-837bc024fc7c@gmail.com>

On Jun 18 2018, Eric Dumazet <eric.dumazet@gmail.com> wrote:

> DUMP_PREFIX_ADDRESS might give us more information (say alignment problem, or crossing page boundaries)

DUMP_PREFIX_ADDRESS is useless for that purpose.

Here are some samples of broken csums:

[  853.849225] sungem: sungem wrong csum : 9886/07be, len 94 bytes, c0000001fa187e02
[  853.849232] raw data: 00000000: 00 0d 93 43 81 62 18 d6 c7 51 b8 1c 08 00 45 10  ...C.b...Q....E.
[  853.849235] raw data: 00000010: 00 4c cb a0 40 00 40 11 d9 97 c0 a8 0a 01 c0 a8  .L..@.@.........
[  853.849237] raw data: 00000020: 0a 07 00 7b 00 7b 00 38 69 e1 1c 03 0c f7 00 00  ...{.{.8i.......
[  853.849240] raw data: 00000030: 08 f0 00 00 15 f0 c0 35 67 67 de d3 ca c9 d9 5b  .......5gg.....[
[  853.849242] raw data: 00000040: 1f ff de d3 d2 86 8f 67 fa f2 de d3 d2 86 8f 38  .......g.......8
[  853.849244] raw data: 00000050: 2f ff de d3 d2 86 8f 3b ff ff d1 93 bc 50        /......;.....P

[  857.883052] sungem: sungem wrong csum : dbb4/c48f, len 94 bytes, c0000001fa185882
[  857.883058] raw data: 00000000: 00 0d 93 43 81 62 18 d6 c7 51 b8 1c 08 00 45 00  ...C.b...Q....E.
[  857.883070] raw data: 00000010: 00 4c a1 97 40 00 3a 11 ce ed d9 5b 2c 11 c0 a8  .L..@.:....[,...
[  857.883080] raw data: 00000020: 0a 07 00 7b 00 7b 00 38 14 4b 24 02 06 ea 00 00  ...{.{.8.K$.....
[  857.883085] raw data: 00000030: 00 0b 00 00 02 99 c0 a8 64 09 de d3 d2 5a 36 e4  ........d....Z6.
[  857.883090] raw data: 00000040: bc f5 de d3 d2 8a 8f 2c 17 44 de d3 d2 8a 93 8b  .......,.D......
[  857.883094] raw data: 00000050: d7 b7 de d3 d2 8a 93 97 69 6e 39 7b d2 5a        ........in9{.Z

[  858.124689] sungem: sungem wrong csum : 1f4f/02d0, len 118 bytes, c0000001fa185602
[  858.124700] raw data: 00000000: 00 0d 93 43 81 62 d4 3d 7e 4c 48 b7 86 dd 61 01  ...C.b.=~LH...a.
[  858.124705] raw data: 00000010: 1e b1 00 3c 06 40 20 01 0a 62 17 11 88 01 00 00  ...<.@ ..b......
[  858.124709] raw data: 00000020: 00 00 00 00 0a 38 20 01 0a 62 17 11 88 01 00 00  .....8 ..b......
[  858.124714] raw data: 00000030: 00 00 00 00 00 07 94 b4 00 16 86 f5 29 e8 36 cb  ............).6.
[  858.124718] raw data: 00000040: 50 49 80 18 05 93 9a 53 00 00 01 01 08 0a 58 b2  PI.....S......X.
[  858.124723] raw data: 00000050: de 54 61 5f 2f 3c 00 00 00 10 cc 08 55 f7 da 21  .Ta_/<......U..!
[  858.124727] raw data: 00000060: f4 60 0a 6b 3c aa b9 b3 7e 61 10 b8 c2 be 9a 0b  .`.k<...~a......
[  858.124731] raw data: 00000070: c7 e9 5b 97 1b ac                                ..[...

[  858.126522] sungem: sungem wrong csum : 0836/19e9, len 90 bytes, c0000001fa185382
[  858.126530] raw data: 00000000: 00 0d 93 43 81 62 d4 3d 7e 4c 48 b7 86 dd 61 01  ...C.b.=~LH...a.
[  858.126532] raw data: 00000010: 1e b1 00 20 06 40 20 01 0a 62 17 11 88 01 00 00  ... .@ ..b......
[  858.126535] raw data: 00000020: 00 00 00 00 0a 38 20 01 0a 62 17 11 88 01 00 00  .....8 ..b......
[  858.126537] raw data: 00000030: 00 00 00 00 00 07 94 b4 00 16 86 f5 2a 04 36 cb  ............*.6.
[  858.126540] raw data: 00000040: 50 65 80 10 05 93 3e 56 00 00 01 01 08 0a 58 b2  Pe....>V......X.
[  858.126542] raw data: 00000050: de 56 61 5f 30 4d 1d 58 42 d2                    .Va_0M.XB.

[  858.131559] sungem: sungem wrong csum : 5891/c98d, len 90 bytes, c0000001fa185102
[  858.131567] raw data: 00000000: 00 0d 93 43 81 62 d4 3d 7e 4c 48 b7 86 dd 61 01  ...C.b.=~LH...a.
[  858.131570] raw data: 00000010: 1e b1 00 20 06 40 20 01 0a 62 17 11 88 01 00 00  ... .@ ..b......
[  858.131572] raw data: 00000020: 00 00 00 00 0a 38 20 01 0a 62 17 11 88 01 00 00  .....8 ..b......
[  858.131574] raw data: 00000030: 00 00 00 00 00 07 94 b4 00 16 86 f5 2a 04 36 cb  ............*.6.
[  858.131577] raw data: 00000040: 50 a1 80 10 05 93 3e 10 00 00 01 01 08 0a 58 b2  P.....>.......X.
[  858.131579] raw data: 00000050: de 5b 61 5f 30 52 3f ea 70 9b                    .[a_0R?.p.

Andreas.

-- 
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510  2552 DF73 E780 A9DA AEC1
"And now for something completely different."

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox