Netdev List
 help / color / mirror / Atom feed
* Re: [Patch v6 net-next 0/2] Intel Wired LAN Driver Updates
From: David Miller @ 2014-01-18  2:38 UTC (permalink / raw)
  To: aaron.f.brown; +Cc: netdev, gospo, sassmann, ethan.kernel
In-Reply-To: <1389930065-3330-1-git-send-email-aaron.f.brown@intel.com>

From: Aaron Brown <aaron.f.brown@intel.com>
Date: Thu, 16 Jan 2014 19:41:03 -0800

> This series contains updates to ixgbe Ethan Zhao.  The first one replaces
> the magic number "63" with a macro, IXGBE_MAX_VFS_DRV_LIMIT, the second 
> moves the call to set driver_max_VFS to before SRIOV is enabled.
> 
> The code of these patches match the v3 (1/2) and v2 (2/2) versions sent
> to the e1000-devel and netdev mailing lists.  The intermediate versions
> (v4, v5) are from sorting out style issues, mostly tabs to spaces and
> split lines probably introduced via mailer.

Series applied, thanks.

^ permalink raw reply

* Re: [PATCH net-next] net: vxlan: do not use vxlan_net before checking event type
From: David Miller @ 2014-01-18  2:50 UTC (permalink / raw)
  To: dborkman; +Cc: netdev, ebiederm, jesse.brandeburg, xiyou.wangcong
In-Reply-To: <1389959706-30976-1-git-send-email-dborkman@redhat.com>

From: Daniel Borkmann <dborkman@redhat.com>
Date: Fri, 17 Jan 2014 12:55:06 +0100

> Jesse Brandeburg reported that commit acaf4e70997f caused a panic
> when adding a network namespace while vxlan module was present in
> the system:
 ...
> Apparently loopback device is being registered first and thus we
> receive an event notification when vxlan_net is not ready. Hence,
> when we call net_generic() and request vxlan_net_id, we seem to
> access garbage at that point in time. In setup_net() where we set
> up a newly allocated network namespace, we traverse the list of
> pernet ops ...
> 
> list_for_each_entry(ops, &pernet_list, list) {
> 	error = ops_init(ops, net);
> 	if (error < 0)
> 		goto out_undo;
> }
> 
> ... and loopback_net_init() is invoked first here, so in the middle
> of setup_net() we get this notification in vxlan. As currently we
> only care about devices that unregister, move access through
> net_generic() there. Fix is based on Cong Wang's proposal, but
> only changes what is needed here. It sucks a bit as we only work
> around the actual cure: right now it seems the only way to check if
> a netns actually finished traversing all init ops would be to check
> if it's part of net_namespace_list. But that I find quite expensive
> each time we go through a notifier callback. Anyway, did a couple
> of tests and it seems good for now.
> 
> Fixes: acaf4e70997f ("net: vxlan: when lower dev unregisters remove vxlan dev as well")
> Reported-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
> Cc: "Eric W. Biederman" <ebiederm@xmission.com>
> Cc: Jesse Brandeburg <jesse.brandeburg@intel.com>
> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
> Signed-off-by: Daniel Borkmann <dborkman@redhat.com>

Applied, thanks.

^ permalink raw reply

* Re: [PATCH net-next v2 0/2] bonding: add slave netlink and sysfs support
From: David Miller @ 2014-01-18  2:53 UTC (permalink / raw)
  To: sfeldma; +Cc: vfalico, fubar, andy, netdev, roopa, shm, dingtianhong
In-Reply-To: <20140117065316.3194.94624.stgit@monster-03.cumulusnetworks.com>

From: Scott Feldman <sfeldma@cumulusnetworks.com>
Date: Thu, 16 Jan 2014 22:57:42 -0800

> v2:
> 
>   - Address review comment from Ding (and Veacesiav): handle kobj cleanup
>     if sysfs_create_file() fails when adding slave attribute nodes.
> 
> v1:
> 
>   The following series adds bonding slave netlink and sysfs interfaces.
>   Slave interfaces get a new IFLA_SLAVE set of netlink attributes, along
>   with RTM_NEWLINK notification when slave's active status changes.  The
>   sysfs interface adds read-only nodes for slave attributes under a /slave
>   dir, simliar to how bond interfaces get a /bonding dir for bonding
>   attributes.

Applied, thanks Scott.

^ permalink raw reply

* Re: [PATCH net-next] net: ftgmac100: use kfree_skb() where appropriate
From: David Miller @ 2014-01-18  2:54 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev
In-Reply-To: <1389944304.31367.476.camel@edumazet-glaptop2.roam.corp.google.com>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 16 Jan 2014 23:38:24 -0800

> From: Eric Dumazet <edumazet@google.com>
> 
> In order to get correct drop monitor notifications for dropped
> packets, we should call kfree_skb() instead of dev_kfree_skb()
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Applied, thanks Eric.

^ permalink raw reply

* Re: [PATCH v2 net] bpf: do not use reciprocal divide
From: David Miller @ 2014-01-18  2:56 UTC (permalink / raw)
  To: heiko.carstens
  Cc: eric.dumazet, schwidefsky, hannes, netdev, dborkman, darkjames-ws,
	mgherzan, rmk+kernel, matt
In-Reply-To: <20140117085916.GA4208@osiris>

From: Heiko Carstens <heiko.carstens@de.ibm.com>
Date: Fri, 17 Jan 2014 09:59:16 +0100

> Could you please also apply the patch below to your tree? It would only
> generate a merge conflict, that would need fixing, if it would sit in the
> s390 tree.

Applied and I queued it up for -stable so I can combine it with
Eric's original change when I submit it to -stable.

^ permalink raw reply

* Re: [net-next 0/3] Intel Wired LAN Driver Updates
From: David Miller @ 2014-01-18  2:58 UTC (permalink / raw)
  To: aaron.f.brown; +Cc: netdev, gospo, sassmann
In-Reply-To: <1389950498-8820-1-git-send-email-aaron.f.brown@intel.com>

From: Aaron Brown <aaron.f.brown@intel.com>
Date: Fri, 17 Jan 2014 01:21:35 -0800

> This series contains an updates to ixgbe and ixgbevf.
> 
> Jacob add braces around some ixgbe_qv_lock_* calls lto better adhere 
> to Kernel style guidelines.  Don bumps the versions on ixgbe and
> ixgbevf to match internal driver functionality better.

Series applied, thanks.

^ permalink raw reply

* Re: [PATCH net-next] virtio-net: fix build error when CONFIG_AVERAGE is not enabled
From: David Miller @ 2014-01-18  2:58 UTC (permalink / raw)
  To: mwdalton; +Cc: mst, netdev, virtualization, edumazet
In-Reply-To: <1389950828-24039-1-git-send-email-mwdalton@google.com>

From: Michael Dalton <mwdalton@google.com>
Date: Fri, 17 Jan 2014 01:27:08 -0800

> Commit ab7db91705e9 ("virtio-net: auto-tune mergeable rx buffer size for
> improved performance") introduced a virtio-net dependency on EWMA.
> The inclusion of EWMA is controlled by CONFIG_AVERAGE. Fix build error
> when CONFIG_AVERAGE is not enabled by adding select AVERAGE to
> virtio-net's Kconfig entry.
> 
> Build failure reported using config make ARCH=s390 defconfig.
> 
> Signed-off-by: Michael Dalton <mwdalton@google.com>

Applied.

^ permalink raw reply

* Re: [PATCH 0/4] Patchset - Support for configurable RSS hash key
From: David Miller @ 2014-01-18  3:01 UTC (permalink / raw)
  To: VenkatKumar.Duvvuru; +Cc: netdev
In-Reply-To: <BF3270C86E8B1349A26C34E4EC1C44CB2C83D8B7@CMEXMB1.ad.emulex.com>

From: Venkata Duvvuru <VenkatKumar.Duvvuru@Emulex.Com>
Date: Fri, 17 Jan 2014 13:02:10 +0000

> NIC drivers that support RSS use either a hard-coded value or a random value for the RSS hash key. Irrespective of the type of the key used, the user would want to change the hash key if he/she is not satisfied with the effectiveness of the default hash-key in spreading the incoming flows evenly across the RSS queues.
> 
> This patch set provides support for configuring the RSS hash-key via the ethtool interface.
> 
> The patch set consists of:
> a) ethtool user-land patches
> b) ethtool kernel patch
> c) be2net patch that implements the ethtool hooks

Your submission is confusing.

Changes for the kernel side of things should be submitted as a separate
series, and you do not need to mention where the tree was, via SHA ID,
in the commit message.  It is sufficient to say that your patches are
against the 'net-net' tree, but after the "---" delimiter.  It's not
useful in the commit message proper.

^ permalink raw reply

* Re: [PATCH 17/41] net: Replace __this_cpu_inc in route.c with raw_cpu_inc
From: David Miller @ 2014-01-18  3:05 UTC (permalink / raw)
  To: cl; +Cc: tj, akpm, rostedt, linux-kernel, mingo, peterz, tglx, netdev,
	edumazet
In-Reply-To: <20140117151836.140608046@linux.com>

From: Christoph Lameter <cl@linux.com>
Date: Fri, 17 Jan 2014 09:18:29 -0600

> Acked-by: Ingo Molnar <mingo@kernel.org>
> Signed-off-by: Christoph Lameter <cl@linux.com>

Acked-by: David S. Miller <davem@davemloft.net>

^ permalink raw reply

* Re: [PATCH 24/41] net: Replace get_cpu_var through this_cpu_ptr
From: David Miller @ 2014-01-18  3:05 UTC (permalink / raw)
  To: cl; +Cc: tj, akpm, rostedt, linux-kernel, mingo, peterz, tglx, netdev,
	edumazet
In-Reply-To: <20140117151836.883704411@linux.com>

From: Christoph Lameter <cl@linux.com>
Date: Fri, 17 Jan 2014 09:18:36 -0600

> [Patch depends on another patch in this series that introduces raw_cpu_ops]
> 
> Replace uses of get_cpu_var for address calculation through this_cpu_ptr.
> 
> Cc: "David S. Miller" <davem@davemloft.net>
> Cc: netdev@vger.kernel.org
> Cc: Eric Dumazet <edumazet@google.com>
> Signed-off-by: Christoph Lameter <cl@linux.com>

Acked-by: David S. Miller <davem@davemloft.net>

^ permalink raw reply

* Re: [PATCH] Bluetooth: remove direct compilation of 6lowpan_iphc.c
From: David Miller @ 2014-01-18  3:10 UTC (permalink / raw)
  To: swarren
  Cc: dbaryshkov, marcel, sfr, linux-next, linux-kernel,
	linux-zigbee-devel, alex.bluesman.smirnov, netdev, jukka.rissanen,
	swarren
In-Reply-To: <1389986964-5177-1-git-send-email-swarren@wwwdotorg.org>

From: Stephen Warren <swarren@wwwdotorg.org>
Date: Fri, 17 Jan 2014 12:29:24 -0700

> From: Stephen Warren <swarren@nvidia.com>
> 
> It's now built as a separate utility module, and enabling BT selects
> that module in Kconfig. This fixes:
 ...
> (this change probably simply wasn't "git add"d to a53d34c3465b)
> 
> Fixes: a53d34c3465b ("net: move 6lowpan compression code to separate module")
> Fixes: 18722c247023 ("Bluetooth: Enable 6LoWPAN support for BT LE devices")
> Signed-off-by: Stephen Warren <swarren@nvidia.com>

Applied to net-next, thanks a lot.

^ permalink raw reply

* Re: pull request: wireless-next 2014-01-17
From: David Miller @ 2014-01-18  3:11 UTC (permalink / raw)
  To: linville; +Cc: linux-wireless, netdev
In-Reply-To: <20140117200028.GC15145@tuxdriver.com>

From: "John W. Linville" <linville@tuxdriver.com>
Date: Fri, 17 Jan 2014 15:00:29 -0500

> Please pull this batch of updates for the 3.14 stream!

Pulled, thanks John.

^ permalink raw reply

* Re: [net-next 0/9] Intel Wired LAN Driver Updates
From: David Miller @ 2014-01-18  3:17 UTC (permalink / raw)
  To: aaron.f.brown; +Cc: netdev, gospo, sassmann
In-Reply-To: <1390001799-19425-1-git-send-email-aaron.f.brown@intel.com>

From: Aaron Brown <aaron.f.brown@intel.com>
Date: Fri, 17 Jan 2014 15:36:30 -0800

> This series contains updates to i40e.  
> 
> Neerav implements DCB and DCBNL support and adds DCB options
> to Kconfig.  DCB is disabled by default.
> 
> Anjali refactors flow control director to fix inconsistencies
> that were preventing clean unloads of the driver, move the
> queues for handling flow director error into their own hardware
> VSI and implement a corrected version of the basic ethtool add 
> ntuple rule.
> 
> Jesse provides fixes for a compiler warning, firmware workaround,
> white space fixes and renames some defines.
> 
> Shannon reworks the device ID #defines to follow the
> DEV_ID_ convention followed by our other drivers.

Series applied.

^ permalink raw reply

* Re: [net-next 0/7] Intel Wired LAN Driver Updates
From: David Miller @ 2014-01-18  3:18 UTC (permalink / raw)
  To: aaron.f.brown; +Cc: netdev, gospo, sassmann
In-Reply-To: <1390012205-21995-1-git-send-email-aaron.f.brown@intel.com>

From: Aaron Brown <aaron.f.brown@intel.com>
Date: Fri, 17 Jan 2014 18:29:58 -0800

> This series contains updates from Emil to ixgbevf.
> 
> He cleans up the code by removing the adapter structure as a
> parameter from multiple functions in favor of using the ixgbevf_ring
> structure and moves hot-path specific statistic int the ring 
> structure for anticipated performance gains.
> 
> He also removes the Tx/Rx counters for checksum offload and adds 
> counters for tx_restart_queue and tx_timeout_count.
> 
> Next he makes it so that the first tx_buffer structure acts as a
> central storage location for most the skb info we are about to
> transmit, then takes advantage of the dma buffer always being
> present in the first descriptor and mapped as single allowing a 
> call to dma_unmap_single which alleviates the need to check for
> DMA mapping in ixgbevf_clean_tx_irq().  
> 
> Finally he merges the ixgbevf_tx_map call and the ixgbevf_tx_queue
> call into a single function.

Series applied, thanks.

^ permalink raw reply

* [GIT] Networking
From: David Miller @ 2014-01-18  3:25 UTC (permalink / raw)
  To: torvalds; +Cc: akpm, netdev, linux-kernel


1) The value choosen for the new SO_MAX_PACING_RATE socket option on
   parisc was very poorly choosen, let's fix it while we still can.
   From Eric Dumazet.

2) Our generic reciprocal divide was found to handle some edge cases
   incorrectly, part of this is encoded into the BPF as deep as the
   JIT engines themselves.  Just use a real divide throughout for now.
   From Eric Dumazet.

3) Because the initial lookup is lockless, the TCP metrics engine
   can end up creating two entries for the same lookup key.  Fix this
   by doing a second lookup under the lock before we actually create
   the new entry.  From Christoph Paasch.

4) Fix scatter-gather list init in usbnet driver, from Bjørn Mork.

5) Fix unintended 32-bit truncation in cxgb4 driver's bit shifting.
   From Dan Carpenter.

6) Netlink socket dumping uses the wrong socket state for timewait
   sockets.  Fix from Neal Cardwell.

7) Fix netlink memory leak in ieee802154_add_iface(), from Christian
   Engelmayer.

8) Multicast forwarding in ipv4 can overflow the per-rule reference
   counts, causing all multicast traffic to cease.  Fix from
   Hannes Frederic Sowa.

9) via-rhine needs to stop all TX queues when it resets the device,
   from Richard Weinberger.

10) Fix RDS per-cpu accesses broken by the this_cpu_* conversions.
    From Gerald Schaefer.

Please pull, thanks a lot!

The following changes since commit 228fdc083b017eaf90e578fa86fb1ecfd5ffae87:

  Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net (2014-01-11 06:37:11 +0700)

are available in the git repository at:


  git://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git master

for you to fetch changes up to 3af57f78c38131b7a66e2b01e06fdacae01992a3:

  s390/bpf,jit: fix 32 bit divisions, use unsigned divide instructions (2014-01-17 18:54:49 -0800)

----------------------------------------------------------------
Bjørn Mork (1):
      net: usbnet: fix SG initialisation

Christian Engelmayer (1):
      ieee802154: Fix memory leak in ieee802154_add_iface()

Christoph Paasch (1):
      tcp: metrics: Avoid duplicate entries with the same destination-IP

Dan Carpenter (1):
      cxgb4: silence shift wrapping static checker warning

David S. Miller (1):
      Merge tag 'batman-adv-fix-for-davem' of git://git.open-mesh.org/linux-merge

Eric Dumazet (2):
      bpf: do not use reciprocal divide
      parisc: fix SO_MAX_PACING_RATE typo

Gerald Schaefer (1):
      net: rds: fix per-cpu helper usage

Hannes Frederic Sowa (2):
      net: avoid reference counter overflows on fib_rules in multicast forwarding
      ipv6: simplify detection of first operational link-local address on interface

Heiko Carstens (1):
      s390/bpf,jit: fix 32 bit divisions, use unsigned divide instructions

Ivan Vecera (1):
      be2net: add dma_mapping_error() check for dma_map_page()

Jitendra Kalsaria (1):
      qlge: Fix vlan netdev features.

Marek Lindner (1):
      batman-adv: fix batman-adv header overhead calculation

Michael S. Tsirkin (1):
      MAINTAINERS: add virtio-dev ML for virtio

Mika Westerberg (1):
      e1000e: Fix compilation warning when !CONFIG_PM_SLEEP

Neal Cardwell (1):
      inet_diag: fix inet_diag_dump_icsk() to use correct state for timewait sockets

Peter Korsgaard (1):
      dm9601: add USB IDs for new dm96xx variants

Richard Weinberger (1):
      net,via-rhine: Fix tx_timeout handling

Yuval Mintz (1):
      bnx2x: Don't release PCI bars on shutdown

 MAINTAINERS                                      |  3 +++
 arch/arm/net/bpf_jit_32.c                        |  6 +++---
 arch/parisc/include/uapi/asm/socket.h            |  2 +-
 arch/powerpc/net/bpf_jit_comp.c                  |  7 ++++---
 arch/s390/net/bpf_jit_comp.c                     | 29 +++++++++++++++++-----------
 arch/sparc/net/bpf_jit_comp.c                    | 17 ++++++++++++++---
 arch/x86/net/bpf_jit_comp.c                      | 14 ++++++++++----
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c | 29 ++++++++++++++--------------
 drivers/net/ethernet/chelsio/cxgb4/l2t.c         |  2 +-
 drivers/net/ethernet/emulex/benet/be_main.c      | 11 +++++++++--
 drivers/net/ethernet/intel/e1000e/netdev.c       |  8 ++------
 drivers/net/ethernet/qlogic/qlge/qlge_main.c     |  2 ++
 drivers/net/ethernet/via/via-rhine.c             |  1 +
 drivers/net/usb/dm9601.c                         | 12 ++++++++++++
 drivers/net/usb/usbnet.c                         |  2 +-
 include/net/if_inet6.h                           |  1 -
 net/batman-adv/main.c                            |  2 +-
 net/core/filter.c                                | 30 ++---------------------------
 net/ieee802154/nl-phy.c                          |  6 ++++--
 net/ipv4/inet_diag.c                             |  5 ++++-
 net/ipv4/ipmr.c                                  |  7 +++++--
 net/ipv4/tcp_metrics.c                           | 51 +++++++++++++++++++++++++++++++-------------------
 net/ipv6/addrconf.c                              | 38 +++++++++++++++++--------------------
 net/ipv6/ip6mr.c                                 |  7 +++++--
 net/rds/ib_recv.c                                |  7 +++----
 25 files changed, 169 insertions(+), 130 deletions(-)

^ permalink raw reply

* Re: [PATCH net-next] net: vxlan: do not use vxlan_net before checking event type
From: Cong Wang @ 2014-01-18  3:50 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: David Miller, Linux Kernel Network Developers, Eric W. Biederman,
	Jesse Brandeburg
In-Reply-To: <52D97746.1040408@redhat.com>

On Fri, Jan 17, 2014 at 10:32 AM, Daniel Borkmann <dborkman@redhat.com> wrote:
>
>
> If you want to do cleanups, whatever, I really don't care.
> You had your chance to complain about that when you reviewed
> the initial version ... it has nothing to do with the fix.

This is not for stable, as long as it doesn't harm the readability
we are free to do any cleanup's.

If unsure, check Eric's patch for tunnel dst cache.

BTW, I am the original author of the patch, you just updated
it *trivially* and set yourself as the author. :) I don't mind, but
remember that this may be not appropriate for others. At
very least I didn't and don't do this myself.

^ permalink raw reply

* Re: [PATCH RFC 4/6] net: rfkill: gpio: add device tree support
From: Chen-Yu Tsai @ 2014-01-18  4:41 UTC (permalink / raw)
  To: Linus Walleij
  Cc: Mika Westerberg, Heikki Krogerus, Alexandre Courbot,
	Arnd Bergmann, linux-arm-kernel, Johannes Berg, David S. Miller,
	devicetree, netdev, linux-wireless, linux-sunxi, linux-kernel,
	Maxime Ripard
In-Reply-To: <CACRpkdZOD4zeA8T5kbJ4c5NsnuzHCg1mw8rRMYNT9c4R-Qnc6A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On Sat, Jan 18, 2014 at 7:11 AM, Linus Walleij <linus.walleij-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org> wrote:
> On Fri, Jan 17, 2014 at 6:43 PM, Chen-Yu Tsai <wens-jdAy2FN1RRM@public.gmane.org> wrote:
>> On Sat, Jan 18, 2014 at 12:47 AM, Arnd Bergmann <arnd-r2nGTMty4D4@public.gmane.org> wrote:
>
>>>> +- NAME_shutdown-gpios  : GPIO phandle to shutdown control
>>>> +                         (phandle must be the second)
>>>> +- NAME_reset-gpios     : GPIO phandle to reset control
>>>> +
>>>> +NAME must match the rfkill-name property. NAME_shutdown-gpios or
>>>> +NAME_reset-gpios, or both, must be defined.
>>>> +
>>>
>>> I don't understand this part. Why do you include the name in the
>>> gpios property, rather than just hardcoding the property strings
>>> to "shutdown-gpios" and "reset-gpios"?
>>
>> This quirk is a result of how gpiod_get_index implements device tree
>> lookup.
>
> Why can't it just have a single property "gpios", where the first
> element is the reset GPIO and the second is the shutdown GPIO?
>
> rfkill-gpio does this:
>
> gpio = devm_gpiod_get_index(&pdev->dev, rfkill->reset_name, 0);
> gpio = devm_gpiod_get_index(&pdev->dev, rfkill->shutdown_name, 1);
>
> The passed con ID name parameter is only there for the device
> tree case it seems. (ACPI ignores it.) So what about you just
> don't pass it at all and patch it to do like this instead:
>
> gpio = devm_gpiod_get_index(&pdev->dev, NULL, 0);
> gpio = devm_gpiod_get_index(&pdev->dev, NULL, 1);

I'd like that. It's much cleaner.

> Heikki, are you OK with this change?
>
> I think this is actually necessary if the ACPI and DT unification
> pipe dream shall limp forward, we cannot have arguments passed
> that have a semantic effect on DT but not on ACPI... Drivers
> that are supposed to use both ACPI and DT will always
> have to pass NULL as con ID.
>
>> If con_id is given, it is prepended to "gpios" as the property string.
>> con_id is also used as the label passed to gpiod_request, which is
>> then shown in /sys/kernel/debug/gpio.
>
> If your problem  is really what turns up in debugfs, then we need
> to figure out a way to label gpios outside of the *gpiod_get* calls.

Let's add a gpiod_set_label call. Currently there's a desc_set_label
in gpiolib, which is static inlined. We can either rename and promote
it to non-static, or add a new wrapping function.

> The string passed in *gpiod_get* is a "connection ID" not a proper
> name for the GPIO.

I see. Perhaps we should not pass this to gpiod_request as the label,
or add a comment stating consumers can use the new gpiod_set_label call
to change it.


Cheers,
ChenYu

^ permalink raw reply

* Re: [RFC PATCH net-next 3/3] virtio-net: Add accelerated RFS support
From: Tom Herbert @ 2014-01-18  4:59 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: Zhi Yong Wu, Stefan Hajnoczi, Linux Netdev List, Eric Dumazet,
	David S. Miller, Zhi Yong Wu
In-Reply-To: <1389979208.27141.11.camel@bwh-desktop.uk.level5networks.com>

Ben,

I've never quite understood why flow management in aRFS has to be done
with separate messages, and if I recall this seems to mitigate
performance gains to a large extent. It seems like we should be able
to piggyback on a TX descriptor for a connection information about the
RX side for that connection, namely the rxhash and queue mapping.
State creation should be implicit by just seeing a new rxhash value,
tear down might be accomplished with a separate flag on the final TX
packet on the connection (this would need some additional logic in the
stack). Is this method not feasible in either NICs or virtio-net?

On Fri, Jan 17, 2014 at 9:20 AM, Ben Hutchings
<bhutchings@solarflare.com> wrote:
> On Sat, 2014-01-18 at 00:54 +0800, Zhi Yong Wu wrote:
>> On Fri, Jan 17, 2014 at 7:16 AM, Ben Hutchings
>> <bhutchings@solarflare.com> wrote:
> [...]
>> > However, to take advantage of ARFS on a physical net driver, it would be
>> > necessary to send a control request for part 2.
>> aRFS on a physical net driver? What is this physical net driver? I
>> thought that in order to enable aRFS, guest virtio_net driver should
>> send a control request to its emulated virtio_net NIC.
> [...]
>
> If the backend is connected to a macvlan device on top of a physical net
> device that supports ARFS, then there is further potential for improving
> performance by steering to the best physical RX queue and CPU as well as
> the best virtio_net RX queue and vCPU.
>
> Ben.
>
> --
> Ben Hutchings, Staff Engineer, Solarflare
> Not speaking for my employer; that's the marketing department's job.
> They asked us to note that Solarflare product names are trademarked.
>

^ permalink raw reply

* Re: [PATCH net] net: core: orphan frags before queuing to slow qdisc
From: Jason Wang @ 2014-01-18  5:35 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: davem, netdev, linux-kernel, Michael S. Tsirkin
In-Reply-To: <1389968897.31367.489.camel@edumazet-glaptop2.roam.corp.google.com>

On 01/17/2014 10:28 PM, Eric Dumazet wrote:
> On Fri, 2014-01-17 at 17:42 +0800, Jason Wang wrote:
>> Many qdiscs can queue a packet for a long time, this will lead an issue
>> with zerocopy skb. It means the frags will not be orphaned in an expected
>> short time, this breaks the assumption that virtio-net will transmit the
>> packet in time.
>>
>> So if guest packets were queued through such kind of qdisc and hit the
>> limitation of the max pending packets for virtio/vhost. All packets that
>> go to another destination from guest will also be blocked.
>>
>> A case for reproducing the issue:
>>
>> - Boot two VMs and connect them to the same bridge kvmbr.
>> - Setup tbf with a very low rate/burst on eth0 which is a port of kvmbr.
>> - Let VM1 send lots of packets thorugh eth0
>> - After a while, VM1 is unable to send any packets out since the number of
>>    pending packets (queued to tbf) were exceeds the limitation of vhost/virito
> So whats the problem ? If the limit is low, you cannot sent packets.

It was just an extreme case. The problem is if zercopy packets of vm1 
were throttled by qdisc in eth0, probably all packets from vm1 were 
throttled even if it was not go through eth0.
> Solution : increase the limit, or tell the vm to lower its rate.
>
> Oh wait, are you bitten because you did some prior skb_orphan() to allow
> the vm to send unlimited number of skbs ???
>

The problem is sndbuf were defaulted to INT_MAX to prevent a similar 
issue for non-zerocopy packets. For zerocopy, only after the frags were 
orphaned can vhost notify the completion of tx for virtio-net. So 
INT_MAX sndbuf is not enough.
>> Solve this issue by orphaning the frags before queuing it to a slow qdisc (the
>> one without TCQ_F_CAN_BYPASS).
> Why orphaning the frags only solves the problem ? A skb without zerocopy
> frags should also be blocked for a while.

It's ok for non-zerocopy packet to be blocked since VM1 thought the 
packets has been sent instead of pending in the virtqueue. So VM1 can 
still send packet to other destination.
> Seriously, lets admit this zero copy stuff is utterly broken.
>
>
> TCQ_F_CAN_BYPASS is not enough. Some NIC have separate queues with
> strict priorities.
>

Yes, but looks less serious than traffic shaping.
> It seems to me that you are pushing to use FIFO (the only qdisc setting
> TCQ_F_CAN_BYPASS), by adding yet another test in fast path (I do not
> know how we can still call it a fast path), while we already have smart
> qdisc to avoid the inherent HOL and unfairness problems of FIFO.
>

It was just a workaround like the case of sndbuf before we had a better 
solution. So looks like using sfq or fq in guest can mitigate the issue?
>> Cc: Michael S. Tsirkin<mst@redhat.com>
>> Signed-off-by: Jason Wang<jasowang@redhat.com>
>> ---
>>   net/core/dev.c | 7 +++++++
>>   1 file changed, 7 insertions(+)
>>
>> diff --git a/net/core/dev.c b/net/core/dev.c
>> index 0ce469e..1209774 100644
>> --- a/net/core/dev.c
>> +++ b/net/core/dev.c
>> @@ -2700,6 +2700,12 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
>>   	contended = qdisc_is_running(q);
>>   	if (unlikely(contended))
>>   		spin_lock(&q->busylock);
>> +	if (!(q->flags&  TCQ_F_CAN_BYPASS)&&
>> +	    unlikely(skb_orphan_frags(skb, GFP_ATOMIC))) {
>> +		kfree_skb(skb);
>> +		rc = NET_XMIT_DROP;
>> +		goto out;
>> +	}
> Are you aware that copying stuff takes time ?
>
> If yes, why is it done after taking the busylock spinlock ?
>

Yes and it should be done outside the spinlock.
>>
>>   	spin_lock(root_lock);
>>   	if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED,&q->state))) {
>> @@ -2739,6 +2745,7 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
>>   		}
>>   	}
>>   	spin_unlock(root_lock);
>> +out:
>>   	if (unlikely(contended))
>>   		spin_unlock(&q->busylock);
>>   	return rc;
>
>

^ permalink raw reply

* Re: ipv4_dst_destroy panic regression after 3.10.15
From: Eric Dumazet @ 2014-01-18  6:49 UTC (permalink / raw)
  To: dormando; +Cc: netdev, linux-kernel
In-Reply-To: <alpine.DEB.2.10.1401171720560.15759@dinf>

On Fri, 2014-01-17 at 17:25 -0800, dormando wrote:
> Hi,
> 
> Upgraded a few kernels to the latest 3.10 stable tree while tracking down
> a rare kernel panic, seems to have introduced a much more frequent kernel
> panic. Takes anywhere from 4 hours to 2 days to trigger:
> 
> <4>[196727.311203] general protection fault: 0000 [#1] SMP
> <4>[196727.311224] Modules linked in: xt_TEE xt_dscp xt_DSCP macvlan bridge coretemp crc32_pclmul ghash_clmulni_intel gpio_ich microcode ipmi_watchdog ipmi_devintf sb_edac edac_core lpc_ich mfd_core tpm_tis tpm tpm_bios ipmi_si ipmi_msghandler isci igb libsas i2c_algo_bit ixgbe ptp pps_core mdio
> <4>[196727.311333] CPU: 17 PID: 0 Comm: swapper/17 Not tainted 3.10.26 #1
> <4>[196727.311344] Hardware name: Supermicro X9DRi-LN4+/X9DR3-LN4+/X9DRi-LN4+/X9DR3-LN4+, BIOS 3.0 07/05/2013
> <4>[196727.311364] task: ffff885e6f069700 ti: ffff885e6f072000 task.ti: ffff885e6f072000
> <4>[196727.311377] RIP: 0010:[<ffffffff815f8c7f>]  [<ffffffff815f8c7f>] ipv4_dst_destroy+0x4f/0x80
> <4>[196727.311399] RSP: 0018:ffff885effd23a70  EFLAGS: 00010282
> <4>[196727.311409] RAX: dead000000200200 RBX: ffff8854c398ecc0 RCX: 0000000000000040
> <4>[196727.311423] RDX: dead000000100100 RSI: dead000000100100 RDI: dead000000200200
> <4>[196727.311437] RBP: ffff885effd23a80 R08: ffffffff815fd9e0 R09: ffff885d5a590800
> <4>[196727.311451] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
> <4>[196727.311464] R13: ffffffff81c8c280 R14: 0000000000000000 R15: ffff880e85ee16ce
> <4>[196727.311510] FS:  0000000000000000(0000) GS:ffff885effd20000(0000) knlGS:0000000000000000
> <4>[196727.311554] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> <4>[196727.311581] CR2: 00007a46751eb000 CR3: 0000005e65688000 CR4: 00000000000407e0
> <4>[196727.311625] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> <4>[196727.311669] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> <4>[196727.311713] Stack:
> <4>[196727.311733]  ffff8854c398ecc0 ffff8854c398ecc0 ffff885effd23ab0 ffffffff815b7f42
> <4>[196727.311784]  ffff88be6595bc00 ffff8854c398ecc0 0000000000000000 ffff8854c398ecc0
> <4>[196727.311834]  ffff885effd23ad0 ffffffff815b86c6 ffff885d5a590800 ffff8816827821c0
> <4>[196727.311885] Call Trace:
> <4>[196727.311907]  <IRQ>
> <4>[196727.311912]  [<ffffffff815b7f42>] dst_destroy+0x32/0xe0
> <4>[196727.311959]  [<ffffffff815b86c6>] dst_release+0x56/0x80
> <4>[196727.311986]  [<ffffffff81620bd5>] tcp_v4_do_rcv+0x2a5/0x4a0
> <4>[196727.312013]  [<ffffffff81622b5a>] tcp_v4_rcv+0x7da/0x820
> <4>[196727.312041]  [<ffffffff815fd9e0>] ? ip_rcv_finish+0x360/0x360
> <4>[196727.312070]  [<ffffffff815de02d>] ? nf_hook_slow+0x7d/0x150
> <4>[196727.312097]  [<ffffffff815fd9e0>] ? ip_rcv_finish+0x360/0x360
> <4>[196727.312125]  [<ffffffff815fda92>] ip_local_deliver_finish+0xb2/0x230
> <4>[196727.312154]  [<ffffffff815fdd9a>] ip_local_deliver+0x4a/0x90
> <4>[196727.312183]  [<ffffffff815fd799>] ip_rcv_finish+0x119/0x360
> <4>[196727.312212]  [<ffffffff815fe00b>] ip_rcv+0x22b/0x340
> <4>[196727.312242]  [<ffffffffa0339680>] ? macvlan_broadcast+0x160/0x160 [macvlan]
> <4>[196727.312275]  [<ffffffff815b0c62>] __netif_receive_skb_core+0x512/0x640
> <4>[196727.312308]  [<ffffffff811427fb>] ? kmem_cache_alloc+0x13b/0x150
> <4>[196727.312338]  [<ffffffff815b0db1>] __netif_receive_skb+0x21/0x70
> <4>[196727.312368]  [<ffffffff815b0fa1>] netif_receive_skb+0x31/0xa0
> <4>[196727.312397]  [<ffffffff815b1ae8>] napi_gro_receive+0xe8/0x140
> <4>[196727.312433]  [<ffffffffa00274f1>] ixgbe_poll+0x551/0x11f0 [ixgbe]
> <4>[196727.312463]  [<ffffffff815fe00b>] ? ip_rcv+0x22b/0x340
> <4>[196727.312491]  [<ffffffff815b1691>] net_rx_action+0x111/0x210
> <4>[196727.312521]  [<ffffffff815b0db1>] ? __netif_receive_skb+0x21/0x70
> <4>[196727.312552]  [<ffffffff810519d0>] __do_softirq+0xd0/0x270
> <4>[196727.312583]  [<ffffffff816cef3c>] call_softirq+0x1c/0x30
> <4>[196727.312613]  [<ffffffff81004205>] do_softirq+0x55/0x90
> <4>[196727.312640]  [<ffffffff81051c85>] irq_exit+0x55/0x60
> <4>[196727.312668]  [<ffffffff816cf5c3>] do_IRQ+0x63/0xe0
> <4>[196727.312696]  [<ffffffff816c5aaa>] common_interrupt+0x6a/0x6a
> <4>[196727.312722]  <EOI>
> <4>[196727.312727]  [<ffffffff8100a150>] ? default_idle+0x20/0xe0
> <4>[196727.312775]  [<ffffffff8100a8ff>] arch_cpu_idle+0xf/0x20
> <4>[196727.312803]  [<ffffffff8108d330>] cpu_startup_entry+0xc0/0x270
> <4>[196727.312833]  [<ffffffff816b276e>] start_secondary+0x1f9/0x200
> <4>[196727.312860] Code: 4a 9f e9 81 e8 13 cb 0c 00 48 8b 93 b0 00 00 00 48 bf 00 02 20 00 00 00 ad de 48 8b 83 b8 00 00 00 48 be 00 01 10 00 00 00 ad de <48> 89 42 08 48 89 10 48 89 bb b8 00 00 00 48 c7 c7 4a 9f e9 81
> <1>[196727.313071] RIP  [<ffffffff815f8c7f>] ipv4_dst_destroy+0x4f/0x80
> <4>[196727.313100]  RSP <ffff885effd23a70>
> <4>[196727.313377] ---[ end trace 64b3f14fae0f2e29 ]---
> <0>[196727.380908] Kernel panic - not syncing: Fatal exception in interrupt
> 
> 
> ... bisecting it's going to be a pain... I tried eyeballing the diffs and
> am trying a revert or two.
> 
> We've hit it in .25, .26 so far. I have .27 running but not sure if it
> crashed, so the change exists between .15 and .25.

Please try following fix, thanks for the report !

diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 25071b48921c..78a50a22298a 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1333,7 +1333,7 @@ static void ipv4_dst_destroy(struct dst_entry
*dst)
 
 	if (!list_empty(&rt->rt_uncached)) {
 		spin_lock_bh(&rt_uncached_lock);
-		list_del(&rt->rt_uncached);
+		list_del_init(&rt->rt_uncached);
 		spin_unlock_bh(&rt_uncached_lock);
 	}
 }

^ permalink raw reply related

* Re: ipv4_dst_destroy panic regression after 3.10.15
From: Eric Dumazet @ 2014-01-18  7:09 UTC (permalink / raw)
  To: dormando; +Cc: netdev, linux-kernel, Alexei Starovoitov
In-Reply-To: <1390027758.31367.505.camel@edumazet-glaptop2.roam.corp.google.com>

On Fri, 2014-01-17 at 22:49 -0800, Eric Dumazet wrote:
> On Fri, 2014-01-17 at 17:25 -0800, dormando wrote:
> > Hi,
> > 
> > Upgraded a few kernels to the latest 3.10 stable tree while tracking down
> > a rare kernel panic, seems to have introduced a much more frequent kernel
> > panic. Takes anywhere from 4 hours to 2 days to trigger:
> > 
> > <4>[196727.311203] general protection fault: 0000 [#1] SMP
> > <4>[196727.311224] Modules linked in: xt_TEE xt_dscp xt_DSCP macvlan bridge coretemp crc32_pclmul ghash_clmulni_intel gpio_ich microcode ipmi_watchdog ipmi_devintf sb_edac edac_core lpc_ich mfd_core tpm_tis tpm tpm_bios ipmi_si ipmi_msghandler isci igb libsas i2c_algo_bit ixgbe ptp pps_core mdio
> > <4>[196727.311333] CPU: 17 PID: 0 Comm: swapper/17 Not tainted 3.10.26 #1
> > <4>[196727.311344] Hardware name: Supermicro X9DRi-LN4+/X9DR3-LN4+/X9DRi-LN4+/X9DR3-LN4+, BIOS 3.0 07/05/2013
> > <4>[196727.311364] task: ffff885e6f069700 ti: ffff885e6f072000 task.ti: ffff885e6f072000
> > <4>[196727.311377] RIP: 0010:[<ffffffff815f8c7f>]  [<ffffffff815f8c7f>] ipv4_dst_destroy+0x4f/0x80
> > <4>[196727.311399] RSP: 0018:ffff885effd23a70  EFLAGS: 00010282
> > <4>[196727.311409] RAX: dead000000200200 RBX: ffff8854c398ecc0 RCX: 0000000000000040
> > <4>[196727.311423] RDX: dead000000100100 RSI: dead000000100100 RDI: dead000000200200
> > <4>[196727.311437] RBP: ffff885effd23a80 R08: ffffffff815fd9e0 R09: ffff885d5a590800
> > <4>[196727.311451] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
> > <4>[196727.311464] R13: ffffffff81c8c280 R14: 0000000000000000 R15: ffff880e85ee16ce
> > <4>[196727.311510] FS:  0000000000000000(0000) GS:ffff885effd20000(0000) knlGS:0000000000000000
> > <4>[196727.311554] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > <4>[196727.311581] CR2: 00007a46751eb000 CR3: 0000005e65688000 CR4: 00000000000407e0
> > <4>[196727.311625] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > <4>[196727.311669] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> > <4>[196727.311713] Stack:
> > <4>[196727.311733]  ffff8854c398ecc0 ffff8854c398ecc0 ffff885effd23ab0 ffffffff815b7f42
> > <4>[196727.311784]  ffff88be6595bc00 ffff8854c398ecc0 0000000000000000 ffff8854c398ecc0
> > <4>[196727.311834]  ffff885effd23ad0 ffffffff815b86c6 ffff885d5a590800 ffff8816827821c0
> > <4>[196727.311885] Call Trace:
> > <4>[196727.311907]  <IRQ>
> > <4>[196727.311912]  [<ffffffff815b7f42>] dst_destroy+0x32/0xe0
> > <4>[196727.311959]  [<ffffffff815b86c6>] dst_release+0x56/0x80
> > <4>[196727.311986]  [<ffffffff81620bd5>] tcp_v4_do_rcv+0x2a5/0x4a0
> > <4>[196727.312013]  [<ffffffff81622b5a>] tcp_v4_rcv+0x7da/0x820
> > <4>[196727.312041]  [<ffffffff815fd9e0>] ? ip_rcv_finish+0x360/0x360
> > <4>[196727.312070]  [<ffffffff815de02d>] ? nf_hook_slow+0x7d/0x150
> > <4>[196727.312097]  [<ffffffff815fd9e0>] ? ip_rcv_finish+0x360/0x360
> > <4>[196727.312125]  [<ffffffff815fda92>] ip_local_deliver_finish+0xb2/0x230
> > <4>[196727.312154]  [<ffffffff815fdd9a>] ip_local_deliver+0x4a/0x90
> > <4>[196727.312183]  [<ffffffff815fd799>] ip_rcv_finish+0x119/0x360
> > <4>[196727.312212]  [<ffffffff815fe00b>] ip_rcv+0x22b/0x340
> > <4>[196727.312242]  [<ffffffffa0339680>] ? macvlan_broadcast+0x160/0x160 [macvlan]
> > <4>[196727.312275]  [<ffffffff815b0c62>] __netif_receive_skb_core+0x512/0x640
> > <4>[196727.312308]  [<ffffffff811427fb>] ? kmem_cache_alloc+0x13b/0x150
> > <4>[196727.312338]  [<ffffffff815b0db1>] __netif_receive_skb+0x21/0x70
> > <4>[196727.312368]  [<ffffffff815b0fa1>] netif_receive_skb+0x31/0xa0
> > <4>[196727.312397]  [<ffffffff815b1ae8>] napi_gro_receive+0xe8/0x140
> > <4>[196727.312433]  [<ffffffffa00274f1>] ixgbe_poll+0x551/0x11f0 [ixgbe]
> > <4>[196727.312463]  [<ffffffff815fe00b>] ? ip_rcv+0x22b/0x340
> > <4>[196727.312491]  [<ffffffff815b1691>] net_rx_action+0x111/0x210
> > <4>[196727.312521]  [<ffffffff815b0db1>] ? __netif_receive_skb+0x21/0x70
> > <4>[196727.312552]  [<ffffffff810519d0>] __do_softirq+0xd0/0x270
> > <4>[196727.312583]  [<ffffffff816cef3c>] call_softirq+0x1c/0x30
> > <4>[196727.312613]  [<ffffffff81004205>] do_softirq+0x55/0x90
> > <4>[196727.312640]  [<ffffffff81051c85>] irq_exit+0x55/0x60
> > <4>[196727.312668]  [<ffffffff816cf5c3>] do_IRQ+0x63/0xe0
> > <4>[196727.312696]  [<ffffffff816c5aaa>] common_interrupt+0x6a/0x6a
> > <4>[196727.312722]  <EOI>
> > <4>[196727.312727]  [<ffffffff8100a150>] ? default_idle+0x20/0xe0
> > <4>[196727.312775]  [<ffffffff8100a8ff>] arch_cpu_idle+0xf/0x20
> > <4>[196727.312803]  [<ffffffff8108d330>] cpu_startup_entry+0xc0/0x270
> > <4>[196727.312833]  [<ffffffff816b276e>] start_secondary+0x1f9/0x200
> > <4>[196727.312860] Code: 4a 9f e9 81 e8 13 cb 0c 00 48 8b 93 b0 00 00 00 48 bf 00 02 20 00 00 00 ad de 48 8b 83 b8 00 00 00 48 be 00 01 10 00 00 00 ad de <48> 89 42 08 48 89 10 48 89 bb b8 00 00 00 48 c7 c7 4a 9f e9 81
> > <1>[196727.313071] RIP  [<ffffffff815f8c7f>] ipv4_dst_destroy+0x4f/0x80
> > <4>[196727.313100]  RSP <ffff885effd23a70>
> > <4>[196727.313377] ---[ end trace 64b3f14fae0f2e29 ]---
> > <0>[196727.380908] Kernel panic - not syncing: Fatal exception in interrupt
> > 
> > 
> > ... bisecting it's going to be a pain... I tried eyeballing the diffs and
> > am trying a revert or two.
> > 
> > We've hit it in .25, .26 so far. I have .27 running but not sure if it
> > crashed, so the change exists between .15 and .25.
> 
> Please try following fix, thanks for the report !
> 
> diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> index 25071b48921c..78a50a22298a 100644
> --- a/net/ipv4/route.c
> +++ b/net/ipv4/route.c
> @@ -1333,7 +1333,7 @@ static void ipv4_dst_destroy(struct dst_entry
> *dst)
>  
>  	if (!list_empty(&rt->rt_uncached)) {
>  		spin_lock_bh(&rt_uncached_lock);
> -		list_del(&rt->rt_uncached);
> +		list_del_init(&rt->rt_uncached);
>  		spin_unlock_bh(&rt_uncached_lock);
>  	}
>  }
> 

Problem could come from this commit, in linux 3.10.23,
you also could try to revert it
 
commit 62713c4b6bc10c2d082ee1540e11b01a2b2162ab
Author: Alexei Starovoitov <ast@plumgrid.com>
Date:   Tue Nov 19 19:12:34 2013 -0800

    ipv4: fix race in concurrent ip_route_input_slow()
    
    [ Upstream commit dcdfdf56b4a6c9437fc37dbc9cee94a788f9b0c4 ]
    
    CPUs can ask for local route via ip_route_input_noref() concurrently.
    if nh_rth_input is not cached yet, CPUs will proceed to allocate
    equivalent DSTs on 'lo' and then will try to cache them in nh_rth_input
    via rt_cache_route()
    Most of the time they succeed, but on occasion the following two lines:
        orig = *p;
        prev = cmpxchg(p, orig, rt);
    in rt_cache_route() do race and one of the cpus fails to complete cmpxchg.
    But ip_route_input_slow() doesn't check the return code of rt_cache_route(),
    so dst is leaking. dst_destroy() is never called and 'lo' device
    refcnt doesn't go to zero, which can be seen in the logs as:
        unregister_netdevice: waiting for lo to become free. Usage count = 1
    Adding mdelay() between above two lines makes it easily reproducible.
    Fix it similar to nh_pcpu_rth_output case.
    
    Fixes: d2d68ba9fe8b ("ipv4: Cache input routes in fib_info nexthops.")
    Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

^ permalink raw reply

* Re: ipv4_dst_destroy panic regression after 3.10.15
From: dormando @ 2014-01-18  7:16 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, linux-kernel, Alexei Starovoitov
In-Reply-To: <1390028976.31367.512.camel@edumazet-glaptop2.roam.corp.google.com>

> On Fri, 2014-01-17 at 22:49 -0800, Eric Dumazet wrote:
> > On Fri, 2014-01-17 at 17:25 -0800, dormando wrote:
> > > Hi,
> > >
> > > Upgraded a few kernels to the latest 3.10 stable tree while tracking down
> > > a rare kernel panic, seems to have introduced a much more frequent kernel
> > > panic. Takes anywhere from 4 hours to 2 days to trigger:
> > >
> > > <4>[196727.311203] general protection fault: 0000 [#1] SMP
> > > <4>[196727.311224] Modules linked in: xt_TEE xt_dscp xt_DSCP macvlan bridge coretemp crc32_pclmul ghash_clmulni_intel gpio_ich microcode ipmi_watchdog ipmi_devintf sb_edac edac_core lpc_ich mfd_core tpm_tis tpm tpm_bios ipmi_si ipmi_msghandler isci igb libsas i2c_algo_bit ixgbe ptp pps_core mdio
> > > <4>[196727.311333] CPU: 17 PID: 0 Comm: swapper/17 Not tainted 3.10.26 #1
> > > <4>[196727.311344] Hardware name: Supermicro X9DRi-LN4+/X9DR3-LN4+/X9DRi-LN4+/X9DR3-LN4+, BIOS 3.0 07/05/2013
> > > <4>[196727.311364] task: ffff885e6f069700 ti: ffff885e6f072000 task.ti: ffff885e6f072000
> > > <4>[196727.311377] RIP: 0010:[<ffffffff815f8c7f>]  [<ffffffff815f8c7f>] ipv4_dst_destroy+0x4f/0x80
> > > <4>[196727.311399] RSP: 0018:ffff885effd23a70  EFLAGS: 00010282
> > > <4>[196727.311409] RAX: dead000000200200 RBX: ffff8854c398ecc0 RCX: 0000000000000040
> > > <4>[196727.311423] RDX: dead000000100100 RSI: dead000000100100 RDI: dead000000200200
> > > <4>[196727.311437] RBP: ffff885effd23a80 R08: ffffffff815fd9e0 R09: ffff885d5a590800
> > > <4>[196727.311451] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
> > > <4>[196727.311464] R13: ffffffff81c8c280 R14: 0000000000000000 R15: ffff880e85ee16ce
> > > <4>[196727.311510] FS:  0000000000000000(0000) GS:ffff885effd20000(0000) knlGS:0000000000000000
> > > <4>[196727.311554] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > <4>[196727.311581] CR2: 00007a46751eb000 CR3: 0000005e65688000 CR4: 00000000000407e0
> > > <4>[196727.311625] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > <4>[196727.311669] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> > > <4>[196727.311713] Stack:
> > > <4>[196727.311733]  ffff8854c398ecc0 ffff8854c398ecc0 ffff885effd23ab0 ffffffff815b7f42
> > > <4>[196727.311784]  ffff88be6595bc00 ffff8854c398ecc0 0000000000000000 ffff8854c398ecc0
> > > <4>[196727.311834]  ffff885effd23ad0 ffffffff815b86c6 ffff885d5a590800 ffff8816827821c0
> > > <4>[196727.311885] Call Trace:
> > > <4>[196727.311907]  <IRQ>
> > > <4>[196727.311912]  [<ffffffff815b7f42>] dst_destroy+0x32/0xe0
> > > <4>[196727.311959]  [<ffffffff815b86c6>] dst_release+0x56/0x80
> > > <4>[196727.311986]  [<ffffffff81620bd5>] tcp_v4_do_rcv+0x2a5/0x4a0
> > > <4>[196727.312013]  [<ffffffff81622b5a>] tcp_v4_rcv+0x7da/0x820
> > > <4>[196727.312041]  [<ffffffff815fd9e0>] ? ip_rcv_finish+0x360/0x360
> > > <4>[196727.312070]  [<ffffffff815de02d>] ? nf_hook_slow+0x7d/0x150
> > > <4>[196727.312097]  [<ffffffff815fd9e0>] ? ip_rcv_finish+0x360/0x360
> > > <4>[196727.312125]  [<ffffffff815fda92>] ip_local_deliver_finish+0xb2/0x230
> > > <4>[196727.312154]  [<ffffffff815fdd9a>] ip_local_deliver+0x4a/0x90
> > > <4>[196727.312183]  [<ffffffff815fd799>] ip_rcv_finish+0x119/0x360
> > > <4>[196727.312212]  [<ffffffff815fe00b>] ip_rcv+0x22b/0x340
> > > <4>[196727.312242]  [<ffffffffa0339680>] ? macvlan_broadcast+0x160/0x160 [macvlan]
> > > <4>[196727.312275]  [<ffffffff815b0c62>] __netif_receive_skb_core+0x512/0x640
> > > <4>[196727.312308]  [<ffffffff811427fb>] ? kmem_cache_alloc+0x13b/0x150
> > > <4>[196727.312338]  [<ffffffff815b0db1>] __netif_receive_skb+0x21/0x70
> > > <4>[196727.312368]  [<ffffffff815b0fa1>] netif_receive_skb+0x31/0xa0
> > > <4>[196727.312397]  [<ffffffff815b1ae8>] napi_gro_receive+0xe8/0x140
> > > <4>[196727.312433]  [<ffffffffa00274f1>] ixgbe_poll+0x551/0x11f0 [ixgbe]
> > > <4>[196727.312463]  [<ffffffff815fe00b>] ? ip_rcv+0x22b/0x340
> > > <4>[196727.312491]  [<ffffffff815b1691>] net_rx_action+0x111/0x210
> > > <4>[196727.312521]  [<ffffffff815b0db1>] ? __netif_receive_skb+0x21/0x70
> > > <4>[196727.312552]  [<ffffffff810519d0>] __do_softirq+0xd0/0x270
> > > <4>[196727.312583]  [<ffffffff816cef3c>] call_softirq+0x1c/0x30
> > > <4>[196727.312613]  [<ffffffff81004205>] do_softirq+0x55/0x90
> > > <4>[196727.312640]  [<ffffffff81051c85>] irq_exit+0x55/0x60
> > > <4>[196727.312668]  [<ffffffff816cf5c3>] do_IRQ+0x63/0xe0
> > > <4>[196727.312696]  [<ffffffff816c5aaa>] common_interrupt+0x6a/0x6a
> > > <4>[196727.312722]  <EOI>
> > > <4>[196727.312727]  [<ffffffff8100a150>] ? default_idle+0x20/0xe0
> > > <4>[196727.312775]  [<ffffffff8100a8ff>] arch_cpu_idle+0xf/0x20
> > > <4>[196727.312803]  [<ffffffff8108d330>] cpu_startup_entry+0xc0/0x270
> > > <4>[196727.312833]  [<ffffffff816b276e>] start_secondary+0x1f9/0x200
> > > <4>[196727.312860] Code: 4a 9f e9 81 e8 13 cb 0c 00 48 8b 93 b0 00 00 00 48 bf 00 02 20 00 00 00 ad de 48 8b 83 b8 00 00 00 48 be 00 01 10 00 00 00 ad de <48> 89 42 08 48 89 10 48 89 bb b8 00 00 00 48 c7 c7 4a 9f e9 81
> > > <1>[196727.313071] RIP  [<ffffffff815f8c7f>] ipv4_dst_destroy+0x4f/0x80
> > > <4>[196727.313100]  RSP <ffff885effd23a70>
> > > <4>[196727.313377] ---[ end trace 64b3f14fae0f2e29 ]---
> > > <0>[196727.380908] Kernel panic - not syncing: Fatal exception in interrupt
> > >
> > >
> > > ... bisecting it's going to be a pain... I tried eyeballing the diffs and
> > > am trying a revert or two.
> > >
> > > We've hit it in .25, .26 so far. I have .27 running but not sure if it
> > > crashed, so the change exists between .15 and .25.
> >
> > Please try following fix, thanks for the report !
> >
> > diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> > index 25071b48921c..78a50a22298a 100644
> > --- a/net/ipv4/route.c
> > +++ b/net/ipv4/route.c
> > @@ -1333,7 +1333,7 @@ static void ipv4_dst_destroy(struct dst_entry
> > *dst)
> >
> >  	if (!list_empty(&rt->rt_uncached)) {
> >  		spin_lock_bh(&rt_uncached_lock);
> > -		list_del(&rt->rt_uncached);
> > +		list_del_init(&rt->rt_uncached);
> >  		spin_unlock_bh(&rt_uncached_lock);
> >  	}
> >  }
> >
>
> Problem could come from this commit, in linux 3.10.23,
> you also could try to revert it
>
> commit 62713c4b6bc10c2d082ee1540e11b01a2b2162ab
> Author: Alexei Starovoitov <ast@plumgrid.com>
> Date:   Tue Nov 19 19:12:34 2013 -0800
>
>     ipv4: fix race in concurrent ip_route_input_slow()
>
>     [ Upstream commit dcdfdf56b4a6c9437fc37dbc9cee94a788f9b0c4 ]
>
>     CPUs can ask for local route via ip_route_input_noref() concurrently.
>     if nh_rth_input is not cached yet, CPUs will proceed to allocate
>     equivalent DSTs on 'lo' and then will try to cache them in nh_rth_input
>     via rt_cache_route()
>     Most of the time they succeed, but on occasion the following two lines:
>         orig = *p;
>         prev = cmpxchg(p, orig, rt);
>     in rt_cache_route() do race and one of the cpus fails to complete cmpxchg.
>     But ip_route_input_slow() doesn't check the return code of rt_cache_route(),
>     so dst is leaking. dst_destroy() is never called and 'lo' device
>     refcnt doesn't go to zero, which can be seen in the logs as:
>         unregister_netdevice: waiting for lo to become free. Usage count = 1
>     Adding mdelay() between above two lines makes it easily reproducible.
>     Fix it similar to nh_pcpu_rth_output case.
>
>     Fixes: d2d68ba9fe8b ("ipv4: Cache input routes in fib_info nexthops.")
>     Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
>     Signed-off-by: David S. Miller <davem@davemloft.net>
>     Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
>

Heh. I spent an hour squinting at the difflog from .15 to .25 and this was
my best guess. I have a kernel running in production with only this
reverted as of ~5 hours ago. Won't know if it helps for a day or two.

I'm building a kernel now with your route patch, but 62713c4 *not*
reverted, which I can throw on a different machine. Does this sound like a
good idea?

Thanks for your quick help as always!

^ permalink raw reply

* [PATCH net-next v3 0/2] bonding: fix primary problem for bonding
From: Ding Tianhong @ 2014-01-18  8:28 UTC (permalink / raw)
  To: Jay Vosburgh, Veaceslav Falico, Netdev, David S. Miller

If the slave's name changed, and the bond params primary is exist,
the bond should deal with the situation in two ways:

1) If the slave was the primary slave yet, clean the primary slave
   and reselect active slave.
2) If the slave's new name is as same as bond primary, set the slave
   as primary slave and reselect active slave.

If the new primary is not matching any slave in the bond, the bond should
record it to params, clean the primary slave and select a new active slave.

Update bonding.txt for primary description.

v2.1->v1: Because there are too many indentions and useless verification, so rewrite
	  the logic for updating the primary slave.
	  Modify some comments for to clean the typos.

v3->v2.1: Veaceslav disagree the first patch and modify the logic for it
	  (bonding: update the primary slave when changing slave's name)
	  and resend it himself (bonding: handle slave's name change with primary_slave logic),
	  so remove the first patch and send the last two patches.

Ding Tianhong (2):
  bonding: clean the primary slave if there is no slave matching new
    primary
  bonding: update bonding.txt for primary description.

 Documentation/networking/bonding.txt | 3 ++-
 drivers/net/bonding/bond_options.c   | 6 ++++++
 2 files changed, 8 insertions(+), 1 deletion(-)

-- 
1.8.0

^ permalink raw reply

* [PATCH net-next v3 1/2] bonding: clean the primary slave if there is no slave matching new primary
From: Ding Tianhong @ 2014-01-18  8:28 UTC (permalink / raw)
  To: Jay Vosburgh, Veaceslav Falico, David S. Miller, Netdev

If the new primay is not matching any slave in the bond, the bond should
record it to params, clean the primary slave and select a new active slave.

Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
---
 drivers/net/bonding/bond_options.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/drivers/net/bonding/bond_options.c b/drivers/net/bonding/bond_options.c
index 945a666..0ee0bfe 100644
--- a/drivers/net/bonding/bond_options.c
+++ b/drivers/net/bonding/bond_options.c
@@ -512,6 +512,12 @@ int bond_option_primary_set(struct bonding *bond, const char *primary)
 		}
 	}
 
+	if (bond->primary_slave) {
+		pr_info("%s: Setting primary slave to None.\n",
+			bond->dev->name);
+		bond->primary_slave = NULL;
+		bond_select_active_slave(bond);
+	}
 	strncpy(bond->params.primary, primary, IFNAMSIZ);
 	bond->params.primary[IFNAMSIZ - 1] = 0;
 
-- 
1.8.0

^ permalink raw reply related

* [PATCH net-next v3 2/2] bonding: update bonding.txt for primary description
From: Ding Tianhong @ 2014-01-18  8:28 UTC (permalink / raw)
  To: Jay Vosburgh, Veaceslav Falico, David S. Miller, Netdev

Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
---
 Documentation/networking/bonding.txt | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/Documentation/networking/bonding.txt b/Documentation/networking/bonding.txt
index a4d925e..5cdb229 100644
--- a/Documentation/networking/bonding.txt
+++ b/Documentation/networking/bonding.txt
@@ -657,7 +657,8 @@ primary
 	one slave is preferred over another, e.g., when one slave has
 	higher throughput than another.
 
-	The primary option is only valid for active-backup mode.
+	The primary option is only valid for active-backup(1),
+	balance-tlb (5) and balance-alb (6) mode.
 
 primary_reselect
 
-- 
1.8.0

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox