Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH v2] sungem: Spring cleaning and GRO support
From: David Miller @ 2011-06-02 21:14 UTC (permalink / raw)
  To: benh; +Cc: netdev, ruediger.herbst, bhamilton04
In-Reply-To: <1306912630.29297.34.camel@pasglop>

From: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Date: Wed, 01 Jun 2011 17:17:10 +1000

> It still needs testing on sparc and maybe a bit more torturing on
> powerpc. I've done some cursory link up/down tests on a g5 and a
> powerbook g4 here, and some suspend/resume tests, so far with
> success, but I'll do more tomorrow just in case.

I'll give this a quick spin on one of my sparc64 boxes that has
a sungem in it.  If all goes well I'll just apply this to net-next-2.6

Thanks!

^ permalink raw reply

* Re: [PATCH] bonding: reset queue mapping prior to transmission to physical device
From: David Miller @ 2011-06-02 21:10 UTC (permalink / raw)
  To: bhutchings; +Cc: nicolas.2p.debian, nhorman, netdev, fubar, andy
In-Reply-To: <1307047906.2812.61.camel@bwh-desktop>

From: Ben Hutchings <bhutchings@solarflare.com>
Date: Thu, 02 Jun 2011 21:51:46 +0100

> On Thu, 2011-06-02 at 13:46 -0700, David Miller wrote:
>> From: Nicolas de Pesloüan <nicolas.2p.debian@gmail.com>
>> Date: Thu, 02 Jun 2011 22:13:57 +0200
>> 
>> > To be more precise, due to the way bonding use queue mapping for slave
>> > selection, it is desirable to clear the mapping before sending to the
>> > slave, because the meaning of the mapping for the slave interface
>> > might be really different from the meaning for the bonding
>> > interface. Arguably, this is the mapping usage in bonding which is
>> > "different" from other usages, but...
>> 
>> This just confirms my reasoning behind why I wanted to discourage
>> drivers from providing explicit ->ndo_select_queue() methods unless
>> absolutely necessary.
>> 
>> Information now gets lost in cases like this bonding issue.
>> 
>> Bonding should definitely, as I suggested, remember the original
>> rxhash value and restore it when sending to the slave.
> 
> Surely RX queue (queue_mapping), not RX hash (rxhash, which is unchanged
> anyway AFAIK).

Right.

^ permalink raw reply

* Re: [PATCH] ftrace: tracepoint of net_dev_xmit sees freed skb and causes panic
From: David Miller @ 2011-06-02 21:04 UTC (permalink / raw)
  To: rostedt
  Cc: sanagi.koki, linux-kernel, netdev, nhorman, mingo, fweisbec,
	mathieu.desnoyers, tglx, kosaki.motohiro, izumi.taku,
	kaneshige.kenji
In-Reply-To: <1307048512.3667.28.camel@gandalf.stny.rr.com>

From: Steven Rostedt <rostedt@goodmis.org>
Date: Thu, 02 Jun 2011 17:01:52 -0400

> On Thu, 2011-06-02 at 13:59 -0700, David Miller wrote:
>> From: Steven Rostedt <rostedt@goodmis.org>
>> Date: Tue, 31 May 2011 11:14:27 -0400
>> 
>> > Note, the subject should not be "ftrace:", but "net:" or maybe even
>> > "net/tracing", as this really has nothing to do with ftrace code. The
>> > tracepoints are more generic than ftrace.
>> 
>> Do you guys mind if I take this in via the net-2.6 tree?
> 
> I have no issue with that. Heck, I prefer it.

Great, thanks!

^ permalink raw reply

* Re: [PATCH net-next when it opens] net: 8021q: Add pr_fmt
From: David Miller @ 2011-06-02 21:03 UTC (permalink / raw)
  To: joe; +Cc: kaber, greearb, netdev, linux-kernel
In-Reply-To: <6d2f6a1e6365b8865fd210e88d257ca0bc7ad0fa.1306442986.git.joe@perches.com>

From: Joe Perches <joe@perches.com>
Date: Thu, 26 May 2011 13:58:31 -0700

> Use the current logging style.
> 
> Add #define pr_fmt and remove embedded prefix from formats.
> 
> Not converting the current pr_<level> uses to netdev_<level>
> because all the output here is nicely prefaced with "8021q: ".
> 
> Remove __func__ use from proc registration failure message.
> 
> Signed-off-by: Joe Perches <joe@perches.com>

I'll apply this, thanks Joe.

^ permalink raw reply

* Re: [PATCH] ftrace: tracepoint of net_dev_xmit sees freed skb and causes panic
From: Steven Rostedt @ 2011-06-02 21:01 UTC (permalink / raw)
  To: David Miller
  Cc: sanagi.koki, linux-kernel, netdev, nhorman, mingo, fweisbec,
	mathieu.desnoyers, tglx, kosaki.motohiro, izumi.taku,
	kaneshige.kenji
In-Reply-To: <20110602.135913.2065173997860104604.davem@davemloft.net>

On Thu, 2011-06-02 at 13:59 -0700, David Miller wrote:
> From: Steven Rostedt <rostedt@goodmis.org>
> Date: Tue, 31 May 2011 11:14:27 -0400
> 
> > Note, the subject should not be "ftrace:", but "net:" or maybe even
> > "net/tracing", as this really has nothing to do with ftrace code. The
> > tracepoints are more generic than ftrace.
> 
> Do you guys mind if I take this in via the net-2.6 tree?

I have no issue with that. Heck, I prefer it.

-- Steve

^ permalink raw reply

* Re: TX watchdog vs link-layer flow control
From: Ben Hutchings @ 2011-06-02 21:01 UTC (permalink / raw)
  To: netdev; +Cc: linux-net-drivers
In-Reply-To: <1307047720.2812.59.camel@bwh-desktop>

Also: given that the watchdog fires if *any* TX queue is stopped, that
seems to mean that on a device with hardware TX priority it could be
triggered for a TX queue with low priority if the link is filled with
higher-priority TX packets for long enough.  (This depends on how 'hard'
the priorities are, of course.)

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply

* Re: [PATCH] ftrace: tracepoint of net_dev_xmit sees freed skb and causes panic
From: David Miller @ 2011-06-02 20:59 UTC (permalink / raw)
  To: rostedt
  Cc: sanagi.koki, linux-kernel, netdev, nhorman, mingo, fweisbec,
	mathieu.desnoyers, tglx, kosaki.motohiro, izumi.taku,
	kaneshige.kenji
In-Reply-To: <1306854867.11899.31.camel@gandalf.stny.rr.com>

From: Steven Rostedt <rostedt@goodmis.org>
Date: Tue, 31 May 2011 11:14:27 -0400

> Note, the subject should not be "ftrace:", but "net:" or maybe even
> "net/tracing", as this really has nothing to do with ftrace code. The
> tracepoints are more generic than ftrace.

Do you guys mind if I take this in via the net-2.6 tree?

^ permalink raw reply

* [v5 00/39] faster tree-based sysctl implementation
From: Lucian Adrian Grijincu @ 2011-06-02 20:58 UTC (permalink / raw)
  To: Eric W . Biederman, linux-kernel
  Cc: netdev, Lucian Adrian Grijincu, Octavian Purdila,
	David S . Miller

Hi,

This is version 5 of a patch series that introduces a faster/leaner
sysctl internal implementation. Due to high number of patches and low
general interest I'll just point you to the tree/branch:

 git://github.com/luciang/linux-2.6-new-sysctl.git  v5-new-sysctl-alg



Eric: I have a separate set of patches that make changes to the rest
      of the kernel. Those patches only change sysctl registrations:
      they replace complex tree registrations (with explicit sysctl
      directories), with registrations of arrays of sysctl files.

      Those patches DON'T depend on anything from this series.

      Examples:
       http://thread.gmane.org/gmane.linux.kernel/1137032/focus=1137089
       http://thread.gmane.org/gmane.linux.kernel/1137032/focus=1137087

      I'd like an OK-GO from you to start sending the patches for
      review to their respective maintainers. Also please say if you
      want to be CCed in those patches.




Changes since v4:
  http://thread.gmane.org/gmane.linux.network/196495/focus=1144143
- rebased on 3.0-rc1
- added a new patch manually register kernel/usermodehelper which
  was added in 3.0-rc1
- minor changes to the "sysctl: simplify find_in_table" patch


Changes since v3:
  http://thread.gmane.org/gmane.linux.network/196495/
- removed a bad patch that shrinked a counter from int to u8


Changes since v2:
  http://thread.gmane.org/gmane.linux.kernel/1137032/focus=3D194748
- added a compatibility layer to support old registering complex
  sysctl trees. This layer will be deleted once all users of the old
  are changed.
- subdirectories and netns correspondent dirs are now held in rbtrees
- split of from the patches that make changes in the rest of the tree
- rebased on top of 2.6.39


Changes since v1:
  http://thread.gmane.org/gmane.linux.kernel/1133667
- rebased on top of 2.6.39-rc6
- split the patch that adds the new algorithm and data structures.
- fixed a few bugs lingering in the old code
- shrinked a reference counter
- added a new reference counter to maintain ownership information
- added method to register an empty sysctl dir and converted some users
- added checks enforcing the rule that a non-netns specific directory may
  not be registered after a netns specific one has already been registered.
- added cookie support: register a piece of data with the header to be
  used to make simple conversions on the ctl_table. This saves memory where





Part 1: introduce compatibility layer:

  sysctl: introduce temporary sysctl wrappers
  sysctl: register only tables of sysctl files


Part 2: minimal changes to sysctl users:

  sysctl: call sysctl_init before the first sysctl registration
  sysctl: no-child: manually register kernel/random
  sysctl: no-child: manually register kernel/usermodehelper
  sysctl: no-child: manually register kernel/keys
  sysctl: no-child: manually register fs/inotify
  sysctl: no-child: manually register fs/epoll
  sysctl: no-child: manually register root tables


Part 3: cleanups simplifying the new algorithm (created when
        asked to break up the new algorithm patch):

  sysctl: faster reimplementation of sysctl_check_table
  sysctl: remove useless ctl_table->parent field
  sysctl: simplify find_in_table
  sysctl: sysctl_head_grab defaults to root header on NULL
  sysctl: delete useless grab_header function
  sysctl: rename ->used to ->ctl_use_refs
  sysctl: rename sysctl_head_grab/finish to sysctl_use_header/unuse
  sysctl: rename sysctl_head_next to sysctl_use_next_header
  sysctl: split ->count into ctl_procfs_refs and ctl_header_refs
  sysctl: rename sysctl_head_get/put to sysctl_proc_inode_get/put
  sysctl: rename (un)use_table to __sysctl_(un)use_header
  sysctl: simplify ->permissions hook
  sysctl: group root-specific operations
  sysctl: introduce ctl_table_group
  sysctl: move removal from list out of start_unregistering


Part 4: new algorithm/data structures:

  sysctl: faster tree-based sysctl implementation


Part 5: checks/warns requested during review:

  sysctl: add duplicate entry and sanity ctl_table checks
  sysctl: alloc ctl_table_header with kmem_cache
  sysctl: check netns-specific registration order respected
  sysctl: warn if registration/unregistration order is not respected
  RFC: sysctl: always perform sysctl checks


Part 6: Eric requested rbtrees for subdirs. I also used rbtrees for
        netns correspondent dirs. Hope that adding rbtrees after
        using the old list-based implementation is good enough.
        The rbtree code complicates things a bit and would uglify
        the patch adding the new algorithm.

  sysctl: reorder members of ctl_table_header (cleanup)
  sysctl: add ctl_type member
  RFC: sysctl: replace subdirectory list with rbtree
  RFC: sysctl: replace netns corresp list with rbtree
  sysctl: union-ize some ctl_table_header fields


Part 7: Eric requested ability to register an empty dir:

  sysctl: add register_sysctl_dir: register an empty sysctl directory


Part 8: unrequested feature I'd like to piggy back :)

  sysctl: add ctl_cookie and ctl_cookie_handler
  sysctl: add cookie to __register_sysctl_paths
  sysctl: add register_net_sysctl_table_net_cookie



 drivers/char/random.c            |   27 +-
 fs/eventpoll.c                   |   22 +-
 fs/notify/inotify/inotify_user.c |   22 +-
 fs/proc/inode.c                  |    2 +-
 fs/proc/proc_sysctl.c            |  236 +++++---
 include/linux/inotify.h          |    2 -
 include/linux/key.h              |    3 -
 include/linux/kmod.h             |    3 -
 include/linux/poll.h             |    2 -
 include/linux/sysctl.h           |  221 +++++---
 include/net/net_namespace.h      |    4 +-
 init/main.c                      |    1 +
 kernel/Makefile                  |    5 +-
 kernel/kmod.c                    |   14 +-
 kernel/sysctl.c                  | 1165 +++++++++++++++++++++++++++-----------
 kernel/sysctl_check.c            |  316 +++++++----
 lib/Kconfig.debug                |    8 -
 net/sysctl_net.c                 |   86 ++--
 security/keys/key.c              |    7 +
 security/keys/sysctl.c           |   18 +-
 20 files changed, 1492 insertions(+), 672 deletions(-)


-- 
 .
..: Lucian

^ permalink raw reply

* Re: [PATCH] Use unsigned variables for packet lengths in ip[6]_queue.
From: David Miller @ 2011-06-02 20:57 UTC (permalink / raw)
  To: davej; +Cc: netdev, netfilter-devel, pablo, kaber
In-Reply-To: <20110528003651.GA8380@redhat.com>

From: Dave Jones <davej@redhat.com>
Date: Fri, 27 May 2011 20:36:51 -0400

> On Tue, Apr 19, 2011 at 08:41:05PM -0700, David Miller wrote:
>  > From: Dave Jones <davej@redhat.com>
>  > Date: Tue, 19 Apr 2011 21:42:22 -0400
>  > 
>  > > Not catastrophic, but ipqueue seems to be too trusting of what it gets
>  > > passed from userspace, and passes it on down to the page allocator,
>  > > where it will spew warnings if the page order is too high.
>  > > 
>  > > __ipq_rcv_skb has several checks for lengths too small, but doesn't
>  > > seem to have any for oversized ones.   I'm not sure what the maximum
>  > > we should check for is. I'll code up a diff if anyone has any ideas
>  > > on a sane maximum.
>  > 
>  > Maybe the thing to do is to simply pass __GFP_NOWARN to nlmsg_new()
>  > in netlink_ack()?
>  > 
>  > Anyone else have a better idea?
> 
> So I went back to this today, and found something that doesn't look right.
> After adding some instrumentation, and re-running my tests, I found that
> the reason we were blowing up with enormous allocations was that we
> were passing down a nlmsglen's like -1061109568
> 
> Is there any reason for that to be signed ?
> The nlmsg_len entry of nlmsghdr is a u32, so I'm assuming this is a bug.
> 
> With the patch below, I haven't been able to reproduce the problem, but
> I don't know if I've inadvertantly broken some other behaviour somewhere
> deeper in netlink where this is valid.

netfilter-devel and maintainers CC:'d

> -- 
> 
> Netlink message lengths can't be negative, so use unsigned variables.
> 
> Signed-off-by: Dave Jones <davej@redhat.com>
> 
> diff --git a/net/ipv4/netfilter/ip_queue.c b/net/ipv4/netfilter/ip_queue.c
> index d2c1311..f7f9bd7 100644
> --- a/net/ipv4/netfilter/ip_queue.c
> +++ b/net/ipv4/netfilter/ip_queue.c
> @@ -402,7 +402,8 @@ ipq_dev_drop(int ifindex)
>  static inline void
>  __ipq_rcv_skb(struct sk_buff *skb)
>  {
> -	int status, type, pid, flags, nlmsglen, skblen;
> +	int status, type, pid, flags;
> +	unsigned int nlmsglen, skblen;
>  	struct nlmsghdr *nlh;
>  
>  	skblen = skb->len;
> diff --git a/net/ipv6/netfilter/ip6_queue.c b/net/ipv6/netfilter/ip6_queue.c
> index 413ab07..065fe40 100644
> --- a/net/ipv6/netfilter/ip6_queue.c
> +++ b/net/ipv6/netfilter/ip6_queue.c
> @@ -403,7 +403,8 @@ ipq_dev_drop(int ifindex)
>  static inline void
>  __ipq_rcv_skb(struct sk_buff *skb)
>  {
> -	int status, type, pid, flags, nlmsglen, skblen;
> +	int status, type, pid, flags;
> +	unsigned int nlmsglen, skblen;
>  	struct nlmsghdr *nlh;
>  
>  	skblen = skb->len;

^ permalink raw reply

* Re: TX watchdog vs link-layer flow control
From: David Miller @ 2011-06-02 20:55 UTC (permalink / raw)
  To: bhutchings; +Cc: netdev, linux-net-drivers
In-Reply-To: <1307047720.2812.59.camel@bwh-desktop>

From: Ben Hutchings <bhutchings@solarflare.com>
Date: Thu, 02 Jun 2011 21:48:40 +0100

> However, even if the link is up it can still be blocked by link-layer
> flow control.  A customer report (which has not yet been reproduced
> here) suggests that when Ethernet flow control is enabled a switch may
> in some circumstances throttle the TX packet rate to the extent that a
> TX queue cannot be unblocked before the watchdog fires.  It is certainly
> possible for a misbehaving link partner to do this, and this should
> probably not be considered as a bug in the local hardware or driver!
> 
> TX may also be blocked by a 'remote fault' indication.  This should
> possibly be translated into netif_carrier_off(), but I'm not sure that
> all drivers will be able to detect remote fault without polling.
> 
> Perhaps dev_watchdog() should support a driver operation to poll for
> cases like this before it decides that the local device is actually
> misbehaving?
> 
> Even then, I can't think of a reliable way to detect a pause frame
> flood.  Also, drivers might well require process context for such an
> operation.

Frankly, if the switch can't take packets for several seconds I want a
notification as that's a serious condition.

In lieu of a reliable way to poll for these kinds of cases in order to
distinguish properly, I think what happens now is the best we can do.

^ permalink raw reply

* Re: [PATCH] bonding: reset queue mapping prior to transmission to physical device
From: Ben Hutchings @ 2011-06-02 20:51 UTC (permalink / raw)
  To: David Miller; +Cc: nicolas.2p.debian, nhorman, netdev, fubar, andy
In-Reply-To: <20110602.134637.1101638217263464114.davem@davemloft.net>

On Thu, 2011-06-02 at 13:46 -0700, David Miller wrote:
> From: Nicolas de Pesloüan <nicolas.2p.debian@gmail.com>
> Date: Thu, 02 Jun 2011 22:13:57 +0200
> 
> > To be more precise, due to the way bonding use queue mapping for slave
> > selection, it is desirable to clear the mapping before sending to the
> > slave, because the meaning of the mapping for the slave interface
> > might be really different from the meaning for the bonding
> > interface. Arguably, this is the mapping usage in bonding which is
> > "different" from other usages, but...
> 
> This just confirms my reasoning behind why I wanted to discourage
> drivers from providing explicit ->ndo_select_queue() methods unless
> absolutely necessary.
> 
> Information now gets lost in cases like this bonding issue.
> 
> Bonding should definitely, as I suggested, remember the original
> rxhash value and restore it when sending to the slave.

Surely RX queue (queue_mapping), not RX hash (rxhash, which is unchanged
anyway AFAIK).

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply

* TX watchdog vs link-layer flow control
From: Ben Hutchings @ 2011-06-02 20:48 UTC (permalink / raw)
  To: netdev; +Cc: linux-net-drivers

The TX watchdog will fire if and only if a TX queue remains stopped for
a certain period for no apparent reason.  Specifically, it requires
netif_device_present(dev) && netif_running(dev) &&
netif_carrier_ok(dev).

However, even if the link is up it can still be blocked by link-layer
flow control.  A customer report (which has not yet been reproduced
here) suggests that when Ethernet flow control is enabled a switch may
in some circumstances throttle the TX packet rate to the extent that a
TX queue cannot be unblocked before the watchdog fires.  It is certainly
possible for a misbehaving link partner to do this, and this should
probably not be considered as a bug in the local hardware or driver!

TX may also be blocked by a 'remote fault' indication.  This should
possibly be translated into netif_carrier_off(), but I'm not sure that
all drivers will be able to detect remote fault without polling.

Perhaps dev_watchdog() should support a driver operation to poll for
cases like this before it decides that the local device is actually
misbehaving?

Even then, I can't think of a reliable way to detect a pause frame
flood.  Also, drivers might well require process context for such an
operation.

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply

* Re: [PATCH] bonding: reset queue mapping prior to transmission to physical device
From: David Miller @ 2011-06-02 20:46 UTC (permalink / raw)
  To: nicolas.2p.debian; +Cc: nhorman, bhutchings, netdev, fubar, andy
In-Reply-To: <4DE7EF05.8030505@gmail.com>

From: Nicolas de Pesloüan <nicolas.2p.debian@gmail.com>
Date: Thu, 02 Jun 2011 22:13:57 +0200

> To be more precise, due to the way bonding use queue mapping for slave
> selection, it is desirable to clear the mapping before sending to the
> slave, because the meaning of the mapping for the slave interface
> might be really different from the meaning for the bonding
> interface. Arguably, this is the mapping usage in bonding which is
> "different" from other usages, but...

This just confirms my reasoning behind why I wanted to discourage
drivers from providing explicit ->ndo_select_queue() methods unless
absolutely necessary.

Information now gets lost in cases like this bonding issue.

Bonding should definitely, as I suggested, remember the original
rxhash value and restore it when sending to the slave.

^ permalink raw reply

* RE: forwarding pings out ingress interface
From: Jeff Haran @ 2011-06-02 20:33 UTC (permalink / raw)
  To: Nicolas de Pesloüan; +Cc: netdev
In-Reply-To: <4DE7E8AA.20000@gmail.com>

> -----Original Message-----
> From: Nicolas de Pesloüan [mailto:nicolas.2p.debian@gmail.com]
> Sent: Thursday, June 02, 2011 12:47 PM
> To: Jeff Haran
> Cc: netdev@vger.kernel.org
> Subject: Re: forwarding pings out ingress interface
> 
> Le 02/06/2011 21:10, Jeff Haran a écrit :
> > Running Redhat Enterprize 6.0, seeing the following behavior.
> >
> > Two Ethernet network segments: 172.30/16 and 169.254.160/24.
> >
> > Three "devices":
> >
> > [linux] is the blade running RHEL6.0. It has multiple network interfaces
> > and has IP forwarding enabled.
> > [host] is an IP host of some sort.
> > [lb] is an L3 load balancer.
> >
> > They are connected like so:
> >
> > [host] 172.30.0.254<->  172.30.0.2 [lb] 169.254.160.20<->  169.254.160.1
> > [linux]
> >
> > I have not shown the other network interfaces that [linux] is attached
> > to for brevity in the diagram.
> >
> > One of the peculiarities of [lb] is if [host] pings it at 172.30.0.2, it
> > will not respond to the ping itself but instead forward the ping to
> > [linux]. I realize this is broken, but I have to deal with it (its
> > cooked into [lb]'s silicon). [lb]'s vendor assures me that if [linux]
> > would only forward the ping back to [lb] out the same interface it came
> > in on, then [lb] will generate a ping response back to [host].
> >
> > My problem is [linux] won't forward the ping back to [lb]. [linux] has a
> > route to 172.30/16 out the interface that connects it to [lb] and I can
> > see from tcpdump that [linux] is getting the ICMP Echo requests with
> > source address 172.30.0.254 and destination address 172.30.0.2, but
> > nothing gets transmitted out that interface in response.
> 
> As per RFC3927, section 2.6.2 "[a] host MUST NOT send a packet with an IPv4
> Link-Local destination
> address to any router for forwarding." Your linux host use 169.254.160/24,
> which is in the IPv4
> Link-Local range. So it is normal for your linux host not to reply to a ping
> request coming from
> outside of the local subnet.

Nicolas,

Thanks for the response, but what you are saying above is not the case. The source and destination IP addresses in the ping are both on the 172.30/64 net, which is routable. [linux] has assigned addresses in the link-local  local range to its own interfaces, but in this case I just want [linux] to forward the IP packet unmodified (except for decrementing the TTL and the MAC addresses before the IP packet, of course). Plus, as I mentioned above [linux] forwards these packets fine if I set to route to the 172.30/16 nets out another of its interfaces.

The issue seems to be related to forwarding a packet out the same interface it came in on, but I am guessing here.
 
> Why do you use an IP in the link local range? For as far as I remember (but I
> failed to find the
> exact section), this RFC also forbid static configuration of an IP in this range.
> 
> 	Nicolas.

Same reason everybody uses these things. These networks are on chassis backplanes and they are supposed to be hidden from the outside world, particularly we don't want the packets with 169.254 net addresses in them to get routed outside of those networks. The 169.254 network is the closest thing in IPv4 that comes to providing this functionality.

Jeff Haran




^ permalink raw reply

* Re: [PATCH] bonding: reset queue mapping prior to transmission to physical device
From: Nicolas de Pesloüan @ 2011-06-02 20:22 UTC (permalink / raw)
  To: David Miller; +Cc: nhorman, netdev, fubar, andy
In-Reply-To: <20110602.130710.1904222486883754792.davem@davemloft.net>

Le 02/06/2011 22:07, David Miller a écrit :
> From: Neil Horman<nhorman@tuxdriver.com>
> Date: Thu,  2 Jun 2011 14:03:19 -0400
>
>> The bonding driver is multiqueue enabled, in which each queue represents a slave
>> to enable optional steering of output frames to given slaves against the default
>> output policy.  However, it needs to reset the skb->queue_mapping prior to
>> queuing to the physical device or the physical slave (if it is multiqueue) could
>> wind up transmitting on an unintended tx queue (one that was reserved for
>> specific traffic classes for instance)
>>
>> Signed-off-by: Neil Horman<nhorman@tuxdriver.com>
>
> Since, as I mentioned, the idea when we are forwarding and bridging is that
> we use the input receive classification to influence the spread on transmit,
> I think things like this bonding case should remember the rxhash setting
> before they override it and then restore that value right before invoking
> dev_queue_xmit().

Ok, now I understand. Maybe, using queue mapping for special slave selection wasn't such a good idea 
at the very beginning, because it pollutes the RX mapping that is expected to be propagated up to 
TX. Restoring the original value before invoking dev_queue_xmit() would fix this, but I'm not sure 
it is the cleanest way to do it.

	Nicolas.

^ permalink raw reply

* Re: [PATCH] bonding: reset queue mapping prior to transmission to physical device
From: Nicolas de Pesloüan @ 2011-06-02 20:13 UTC (permalink / raw)
  To: David Miller; +Cc: nhorman, bhutchings, netdev, fubar, andy
In-Reply-To: <20110602.130448.917724318321380583.davem@davemloft.net>

Le 02/06/2011 22:04, David Miller a écrit :
> From: Neil Horman<nhorman@tuxdriver.com>
> Date: Thu, 2 Jun 2011 15:46:21 -0400
>
>> Potentially, yes.  I only fixed this because I was looking at bonding and its
>> queue_mapping behavior, and saw that this needed fixing.  Bridging and IP
>> forwarding should also likely clear the queue mapping in the forwarding path
>> somewhere to avoid selecting an output tx queue that is a function of whatever
>> queue and device it arrived on during ingress.  I've not yet looked to see if
>> thats already being done.
>
> No we do not do this, intentionally.
>
> That way the input classification and queue selection is mirrored
> on the transmit side, which we absolutely want to happen.

Can you confirm that this is the expected behavior for IP forwarding and bridge but not for bonding?

To be more precise, due to the way bonding use queue mapping for slave selection, it is desirable to 
clear the mapping before sending to the slave, because the meaning of the mapping for the slave 
interface might be really different from the meaning for the bonding interface. Arguably, this is 
the mapping usage in bonding which is "different" from other usages, but...

	Nicolas.

^ permalink raw reply

* Re: [PATCH] bonding: reset queue mapping prior to transmission to physical device
From: Ben Hutchings @ 2011-06-02 20:13 UTC (permalink / raw)
  To: Neil Horman; +Cc: netdev, Jay Vosburgh, Andy Gospodarek, David S. Miller
In-Reply-To: <20110602194621.GB2749@hmsreliant.think-freely.org>

On Thu, 2011-06-02 at 15:46 -0400, Neil Horman wrote:
> On Thu, Jun 02, 2011 at 08:09:49PM +0100, Ben Hutchings wrote:
> > On Thu, 2011-06-02 at 14:56 -0400, Neil Horman wrote:
> > > On Thu, Jun 02, 2011 at 07:35:53PM +0100, Ben Hutchings wrote:
> > > > On Thu, 2011-06-02 at 14:03 -0400, Neil Horman wrote:
> > > > > The bonding driver is multiqueue enabled, in which each queue represents a slave
> > > > > to enable optional steering of output frames to given slaves against the default
> > > > > output policy.  However, it needs to reset the skb->queue_mapping prior to
> > > > > queuing to the physical device or the physical slave (if it is multiqueue) could
> > > > > wind up transmitting on an unintended tx queue (one that was reserved for
> > > > > specific traffic classes for instance)
> > [...]
> > > > So far as I can see, this has no effect, because dev_queue_xmit() always
> > > > sets queue_mapping (in dev_pick_tx()).
> > > > 
> > > it resets the queue mapping exactly as you would expect it to.  bonding is a
> > > multiqueue enabled device and selects a potentially non-zero queue based on the
> > > output of bond_select_queue.
> > > 
> > > > What is the problem you're seeing?
> > > > 
> > > The problem is exctly that.  dev_pick_tx() on the bond device sets the
> > > queue_mapping as per the result of bond_select_queue (the ndo_select_queue
> > > method for the bonding driver).  The implementation there is based on the use of
> > > tc with bonding, so that output slaves can be selected for certain types of
> > > traffic.  But when that mechanism is used, skb->queue_mapping is preserved when
> > > the bonding driver queues the frame to the underlying slave.  This denies the
> > > slave (if its also a multiqueue device) the opportunity to reselect the queue
> > > properly, because of this call path:
> > > 
> > > bond_queue_xmit
> > >  dev_queue_xmit(slave_dev)
> > >   dev_pick_tx()
> > >    skb_tx_hash()
> > >     __skb_tx_hash()
> > > 
> > > __skb_tx_hash sees that skb_queue_recorded returns true, and assigns a hardware queue mapping
> > > based on what the bonding driver chose using its own internal logic.  Since
> > > bonding uses the multiqueue infrastructure to do slave output selection without
> > > any regard for slave output queue selection, it seems to me we should really
> > > reset the queue mapping to zero so the slave device can pick its own tx queue.
> > 
> > So you're effectively clearing the *RX queue* number (as this is before
> > dev_pick_tx()) in order to influence TX queue selection.
> > 
> No, you're not seeing it properly.  In bonding (as with all stacked devices) we
> make two passes through dev_pick_tx.

Only if the stacked device has its own software queues and schedulers;
e.g. VLAN devices don't.

> 1) The first time we call dev_pick_tx is when the network stack calls
> dev_queue_xmit on the bond device.  here we call ndo_select_queue, which calls
> bond_select_queue.  This method tells the stack which queue to enqueue the skb
> on for the bond device.  We can use tc's skbedit action to select a particular
> queue on the bond device for various classes of traffic, and those queues
> correspond to individual slave devices.
> 
> 2) the second time we call dev_pick_tx is when the bonding bond_queue_xmit
> routine (which is the bonding drivers ndo_start_xmit method, called after the
> frist dev_pick_tx), calls dev_queue_xmit, passing the slave net_device pointer).
> In this case we do whatever the slave device has configured (either reset the
> queue to zero, or call the slaves ndo_select queue method).
> 
> What I'm fixing is the fact that the bonding drivers queue_mappnig value is
> 'leaking' down to the slave.

And this is because dev_queue_xmit() assumes that it's a record of the
RX queue number.

The differing interpretation of queue_mapping for RX and TX is annoying
and I think we should change the initial value of queue_mapping to -1.
But that's a separate issue.

> Lets say, for example we're bonding two multiqueue devices, and have the bonding
> driver configured to send all traffic to 192.168.1.1 over the first slave (which
> we can accomplish using an appropriate tc rule on the bond device, setting
> frames with that dest ip to have a skb->queue_mapping of 1).
> 
> In the above example, frames bound for 192.168.1.1 have skb->queue_mapping set
> to 1 when they enter bond_queue_xmit.  bond_queue_xmit, calls
> dev_queue_xmit(slave, skb), which calls dev_pick_tx(slave, skb).  If we (for
> simplicity's sake) assume the slave has no ndo_select_queue method, dev_pick_tx
> calls __skb_tx_hash to determine the output queue.  But the bonding driver
> already set skb->queue_mapping to 1 (because it wanted this frame output on the
> first slave, not because it wanted to transmit the frame on the slaves tx queue
> 1).  Nevertheless, __skb_tx_hash() sees that skb_rx_queue_recorded returns true
> here and as a result, calls skb_rx_get_queue, which returns 0.  Because of that
> we wind up transmitting on hardware queue 0 all the time, which is not at all
> what we want.  What we want is for the bonding driver to be able to use
> queue_mapping for its own purposes, and then allow the physical device to make
> its own output queue selection independently.
[...]

I think I understand - the bonding device is effectively single-queue
and shouldn't record an RX queue number.  But I think you should define
and use skb_clear_rx_queue() to set queue_mapping=0, rather than abusing
skb_set_queue_mapping() which is meant to take a TX queue number.

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply

* Re: [PATCH] bonding: reset queue mapping prior to transmission to physical device
From: David Miller @ 2011-06-02 20:07 UTC (permalink / raw)
  To: nhorman; +Cc: netdev, fubar, andy
In-Reply-To: <1307037799-32315-1-git-send-email-nhorman@tuxdriver.com>

From: Neil Horman <nhorman@tuxdriver.com>
Date: Thu,  2 Jun 2011 14:03:19 -0400

> The bonding driver is multiqueue enabled, in which each queue represents a slave
> to enable optional steering of output frames to given slaves against the default
> output policy.  However, it needs to reset the skb->queue_mapping prior to
> queuing to the physical device or the physical slave (if it is multiqueue) could
> wind up transmitting on an unintended tx queue (one that was reserved for
> specific traffic classes for instance)
> 
> Signed-off-by: Neil Horman <nhorman@tuxdriver.com>

Since, as I mentioned, the idea when we are forwarding and bridging is that
we use the input receive classification to influence the spread on transmit,
I think things like this bonding case should remember the rxhash setting
before they override it and then restore that value right before invoking
dev_queue_xmit().

^ permalink raw reply

* Re: [PATCH] bonding: reset queue mapping prior to transmission to physical device
From: David Miller @ 2011-06-02 20:04 UTC (permalink / raw)
  To: nhorman; +Cc: bhutchings, netdev, fubar, andy
In-Reply-To: <20110602194621.GB2749@hmsreliant.think-freely.org>

From: Neil Horman <nhorman@tuxdriver.com>
Date: Thu, 2 Jun 2011 15:46:21 -0400

> Potentially, yes.  I only fixed this because I was looking at bonding and its
> queue_mapping behavior, and saw that this needed fixing.  Bridging and IP
> forwarding should also likely clear the queue mapping in the forwarding path
> somewhere to avoid selecting an output tx queue that is a function of whatever
> queue and device it arrived on during ingress.  I've not yet looked to see if
> thats already being done.

No we do not do this, intentionally.

That way the input classification and queue selection is mirrored
on the transmit side, which we absolutely want to happen.

^ permalink raw reply

* Re: [PATCH] bonding: reset queue mapping prior to transmission to physical device
From: David Miller @ 2011-06-02 19:59 UTC (permalink / raw)
  To: bhutchings; +Cc: nhorman, netdev, fubar, andy
In-Reply-To: <1307039753.2812.16.camel@bwh-desktop>

From: Ben Hutchings <bhutchings@solarflare.com>
Date: Thu, 02 Jun 2011 19:35:53 +0100

> So far as I can see, this has no effect, because dev_queue_xmit() always
> sets queue_mapping (in dev_pick_tx()).

__skb_tx_hash() uses any hash value recorded in the SKB before trying
to manually it.

^ permalink raw reply

* Re: [PATCH] bonding: reset queue mapping prior to transmission to physical device
From: Nicolas de Pesloüan @ 2011-06-02 19:52 UTC (permalink / raw)
  To: Neil Horman
  Cc: Ben Hutchings, netdev, Jay Vosburgh, Andy Gospodarek,
	David S. Miller
In-Reply-To: <20110602194621.GB2749@hmsreliant.think-freely.org>

Le 02/06/2011 21:46, Neil Horman a écrit :
> On Thu, Jun 02, 2011 at 08:09:49PM +0100, Ben Hutchings wrote:
>> On Thu, 2011-06-02 at 14:56 -0400, Neil Horman wrote:
>>> On Thu, Jun 02, 2011 at 07:35:53PM +0100, Ben Hutchings wrote:
>>>> On Thu, 2011-06-02 at 14:03 -0400, Neil Horman wrote:
>>>>> The bonding driver is multiqueue enabled, in which each queue represents a slave
>>>>> to enable optional steering of output frames to given slaves against the default
>>>>> output policy.  However, it needs to reset the skb->queue_mapping prior to
>>>>> queuing to the physical device or the physical slave (if it is multiqueue) could
>>>>> wind up transmitting on an unintended tx queue (one that was reserved for
>>>>> specific traffic classes for instance)
>> [...]
>>>> So far as I can see, this has no effect, because dev_queue_xmit() always
>>>> sets queue_mapping (in dev_pick_tx()).
>>>>
>>> it resets the queue mapping exactly as you would expect it to.  bonding is a
>>> multiqueue enabled device and selects a potentially non-zero queue based on the
>>> output of bond_select_queue.
>>>
>>>> What is the problem you're seeing?
>>>>
>>> The problem is exctly that.  dev_pick_tx() on the bond device sets the
>>> queue_mapping as per the result of bond_select_queue (the ndo_select_queue
>>> method for the bonding driver).  The implementation there is based on the use of
>>> tc with bonding, so that output slaves can be selected for certain types of
>>> traffic.  But when that mechanism is used, skb->queue_mapping is preserved when
>>> the bonding driver queues the frame to the underlying slave.  This denies the
>>> slave (if its also a multiqueue device) the opportunity to reselect the queue
>>> properly, because of this call path:
>>>
>>> bond_queue_xmit
>>>   dev_queue_xmit(slave_dev)
>>>    dev_pick_tx()
>>>     skb_tx_hash()
>>>      __skb_tx_hash()
>>>
>>> __skb_tx_hash sees that skb_queue_recorded returns true, and assigns a hardware queue mapping
>>> based on what the bonding driver chose using its own internal logic.  Since
>>> bonding uses the multiqueue infrastructure to do slave output selection without
>>> any regard for slave output queue selection, it seems to me we should really
>>> reset the queue mapping to zero so the slave device can pick its own tx queue.
>>
>> So you're effectively clearing the *RX queue* number (as this is before
>> dev_pick_tx()) in order to influence TX queue selection.
>>
> No, you're not seeing it properly.  In bonding (as with all stacked devices) we
> make two passes through dev_pick_tx.
>
> 1) The first time we call dev_pick_tx is when the network stack calls
> dev_queue_xmit on the bond device.  here we call ndo_select_queue, which calls
> bond_select_queue.  This method tells the stack which queue to enqueue the skb
> on for the bond device.  We can use tc's skbedit action to select a particular
> queue on the bond device for various classes of traffic, and those queues
> correspond to individual slave devices.
>
> 2) the second time we call dev_pick_tx is when the bonding bond_queue_xmit
> routine (which is the bonding drivers ndo_start_xmit method, called after the
> frist dev_pick_tx), calls dev_queue_xmit, passing the slave net_device pointer).
> In this case we do whatever the slave device has configured (either reset the
> queue to zero, or call the slaves ndo_select queue method).
>
> What I'm fixing is the fact that the bonding drivers queue_mappnig value is
> 'leaking' down to the slave.
>
> Lets say, for example we're bonding two multiqueue devices, and have the bonding
> driver configured to send all traffic to 192.168.1.1 over the first slave (which
> we can accomplish using an appropriate tc rule on the bond device, setting
> frames with that dest ip to have a skb->queue_mapping of 1).
>
> In the above example, frames bound for 192.168.1.1 have skb->queue_mapping set
> to 1 when they enter bond_queue_xmit.  bond_queue_xmit, calls
> dev_queue_xmit(slave, skb), which calls dev_pick_tx(slave, skb).  If we (for
> simplicity's sake) assume the slave has no ndo_select_queue method, dev_pick_tx
> calls __skb_tx_hash to determine the output queue.  But the bonding driver
> already set skb->queue_mapping to 1 (because it wanted this frame output on the
> first slave, not because it wanted to transmit the frame on the slaves tx queue
> 1).  Nevertheless, __skb_tx_hash() sees that skb_rx_queue_recorded returns true
> here and as a result, calls skb_rx_get_queue, which returns 0.  Because of that
> we wind up transmitting on hardware queue 0 all the time, which is not at all
> what we want.  What we want is for the bonding driver to be able to use
> queue_mapping for its own purposes, and then allow the physical device to make
> its own output queue selection independently.  To do this, queue_mapping needs
> to be zero, prior to queuenig the skb to the slave, which is exactly what this
> patch does.

This sounds good to me.

Reviewed-by: Nicolas de Pesloüan <nicolas.2p.debian@free.fr>

>
>> Here, the bonding device seems to be behaving as a forwarding device.
>> If TX queue selection can go wrong for certain combinations of queue
>> configuration when forwarding, then this is a problem for IP forwarding
>> and bridging as well, isn't it?
>>
> Potentially, yes.  I only fixed this because I was looking at bonding and its
> queue_mapping behavior, and saw that this needed fixing.  Bridging and IP
> forwarding should also likely clear the queue mapping in the forwarding path
> somewhere to avoid selecting an output tx queue that is a function of whatever
> queue and device it arrived on during ingress.  I've not yet looked to see if
> thats already being done.
>
> Neil

^ permalink raw reply

* Re: forwarding pings out ingress interface
From: Nicolas de Pesloüan @ 2011-06-02 19:46 UTC (permalink / raw)
  To: Jeff Haran; +Cc: netdev
In-Reply-To: <6F5DE7538AFCDA45A114F5E7510424A7027D8388@hq-exchange01.bytemobile.com>

Le 02/06/2011 21:10, Jeff Haran a écrit :
> Running Redhat Enterprize 6.0, seeing the following behavior.
>
> Two Ethernet network segments: 172.30/16 and 169.254.160/24.
>
> Three "devices":
>
> [linux] is the blade running RHEL6.0. It has multiple network interfaces
> and has IP forwarding enabled.
> [host] is an IP host of some sort.
> [lb] is an L3 load balancer.
>
> They are connected like so:
>
> [host] 172.30.0.254<->  172.30.0.2 [lb] 169.254.160.20<->  169.254.160.1
> [linux]
>
> I have not shown the other network interfaces that [linux] is attached
> to for brevity in the diagram.
>
> One of the peculiarities of [lb] is if [host] pings it at 172.30.0.2, it
> will not respond to the ping itself but instead forward the ping to
> [linux]. I realize this is broken, but I have to deal with it (its
> cooked into [lb]'s silicon). [lb]'s vendor assures me that if [linux]
> would only forward the ping back to [lb] out the same interface it came
> in on, then [lb] will generate a ping response back to [host].
>
> My problem is [linux] won't forward the ping back to [lb]. [linux] has a
> route to 172.30/16 out the interface that connects it to [lb] and I can
> see from tcpdump that [linux] is getting the ICMP Echo requests with
> source address 172.30.0.254 and destination address 172.30.0.2, but
> nothing gets transmitted out that interface in response.

As per RFC3927, section 2.6.2 "[a] host MUST NOT send a packet with an IPv4 Link-Local destination 
address to any router for forwarding." Your linux host use 169.254.160/24, which is in the IPv4 
Link-Local range. So it is normal for your linux host not to reply to a ping request coming from 
outside of the local subnet.

Why do you use an IP in the link local range? For as far as I remember (but I failed to find the 
exact section), this RFC also forbid static configuration of an IP in this range.

	Nicolas.

^ permalink raw reply

* Re: [PATCH] bonding: reset queue mapping prior to transmission to physical device
From: Neil Horman @ 2011-06-02 19:46 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: netdev, Jay Vosburgh, Andy Gospodarek, David S. Miller
In-Reply-To: <1307041789.2812.27.camel@bwh-desktop>

On Thu, Jun 02, 2011 at 08:09:49PM +0100, Ben Hutchings wrote:
> On Thu, 2011-06-02 at 14:56 -0400, Neil Horman wrote:
> > On Thu, Jun 02, 2011 at 07:35:53PM +0100, Ben Hutchings wrote:
> > > On Thu, 2011-06-02 at 14:03 -0400, Neil Horman wrote:
> > > > The bonding driver is multiqueue enabled, in which each queue represents a slave
> > > > to enable optional steering of output frames to given slaves against the default
> > > > output policy.  However, it needs to reset the skb->queue_mapping prior to
> > > > queuing to the physical device or the physical slave (if it is multiqueue) could
> > > > wind up transmitting on an unintended tx queue (one that was reserved for
> > > > specific traffic classes for instance)
> [...]
> > > So far as I can see, this has no effect, because dev_queue_xmit() always
> > > sets queue_mapping (in dev_pick_tx()).
> > > 
> > it resets the queue mapping exactly as you would expect it to.  bonding is a
> > multiqueue enabled device and selects a potentially non-zero queue based on the
> > output of bond_select_queue.
> > 
> > > What is the problem you're seeing?
> > > 
> > The problem is exctly that.  dev_pick_tx() on the bond device sets the
> > queue_mapping as per the result of bond_select_queue (the ndo_select_queue
> > method for the bonding driver).  The implementation there is based on the use of
> > tc with bonding, so that output slaves can be selected for certain types of
> > traffic.  But when that mechanism is used, skb->queue_mapping is preserved when
> > the bonding driver queues the frame to the underlying slave.  This denies the
> > slave (if its also a multiqueue device) the opportunity to reselect the queue
> > properly, because of this call path:
> > 
> > bond_queue_xmit
> >  dev_queue_xmit(slave_dev)
> >   dev_pick_tx()
> >    skb_tx_hash()
> >     __skb_tx_hash()
> > 
> > __skb_tx_hash sees that skb_queue_recorded returns true, and assigns a hardware queue mapping
> > based on what the bonding driver chose using its own internal logic.  Since
> > bonding uses the multiqueue infrastructure to do slave output selection without
> > any regard for slave output queue selection, it seems to me we should really
> > reset the queue mapping to zero so the slave device can pick its own tx queue.
> 
> So you're effectively clearing the *RX queue* number (as this is before
> dev_pick_tx()) in order to influence TX queue selection.
> 
No, you're not seeing it properly.  In bonding (as with all stacked devices) we
make two passes through dev_pick_tx.

1) The first time we call dev_pick_tx is when the network stack calls
dev_queue_xmit on the bond device.  here we call ndo_select_queue, which calls
bond_select_queue.  This method tells the stack which queue to enqueue the skb
on for the bond device.  We can use tc's skbedit action to select a particular
queue on the bond device for various classes of traffic, and those queues
correspond to individual slave devices.

2) the second time we call dev_pick_tx is when the bonding bond_queue_xmit
routine (which is the bonding drivers ndo_start_xmit method, called after the
frist dev_pick_tx), calls dev_queue_xmit, passing the slave net_device pointer).
In this case we do whatever the slave device has configured (either reset the
queue to zero, or call the slaves ndo_select queue method).

What I'm fixing is the fact that the bonding drivers queue_mappnig value is
'leaking' down to the slave.

Lets say, for example we're bonding two multiqueue devices, and have the bonding
driver configured to send all traffic to 192.168.1.1 over the first slave (which
we can accomplish using an appropriate tc rule on the bond device, setting
frames with that dest ip to have a skb->queue_mapping of 1).

In the above example, frames bound for 192.168.1.1 have skb->queue_mapping set
to 1 when they enter bond_queue_xmit.  bond_queue_xmit, calls
dev_queue_xmit(slave, skb), which calls dev_pick_tx(slave, skb).  If we (for
simplicity's sake) assume the slave has no ndo_select_queue method, dev_pick_tx
calls __skb_tx_hash to determine the output queue.  But the bonding driver
already set skb->queue_mapping to 1 (because it wanted this frame output on the
first slave, not because it wanted to transmit the frame on the slaves tx queue
1).  Nevertheless, __skb_tx_hash() sees that skb_rx_queue_recorded returns true
here and as a result, calls skb_rx_get_queue, which returns 0.  Because of that
we wind up transmitting on hardware queue 0 all the time, which is not at all
what we want.  What we want is for the bonding driver to be able to use
queue_mapping for its own purposes, and then allow the physical device to make
its own output queue selection independently.  To do this, queue_mapping needs
to be zero, prior to queuenig the skb to the slave, which is exactly what this
patch does.

> Here, the bonding device seems to be behaving as a forwarding device.
> If TX queue selection can go wrong for certain combinations of queue
> configuration when forwarding, then this is a problem for IP forwarding
> and bridging as well, isn't it?
> 
Potentially, yes.  I only fixed this because I was looking at bonding and its
queue_mapping behavior, and saw that this needed fixing.  Bridging and IP
forwarding should also likely clear the queue mapping in the forwarding path
somewhere to avoid selecting an output tx queue that is a function of whatever
queue and device it arrived on during ingress.  I've not yet looked to see if
thats already being done.

Neil

> Ben.
> 
> -- 
> Ben Hutchings, Senior Software Engineer, Solarflare
> Not speaking for my employer; that's the marketing department's job.
> They asked us to note that Solarflare product names are trademarked.
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply

* Re: [Bugme-new] [Bug 36562] New: e1000 network driver causes kernel oops
From: Andrew Morton @ 2011-06-02 19:32 UTC (permalink / raw)
  To: netdev; +Cc: bugzilla-daemon, bugme-daemon, bryan.christ
In-Reply-To: <bug-36562-10286@https.bugzilla.kernel.org/>


(switched to email.  Please respond via emailed reply-to-all, not via the
bugzilla web interface).

On Thu, 2 Jun 2011 19:20:08 GMT
bugzilla-daemon@bugzilla.kernel.org wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=36562
> 
>            Summary: e1000 network driver causes kernel oops
>            Product: Drivers
>            Version: 2.5
>     Kernel Version: 2.6.38.6-26.rc1.fc14.x86_64
>           Platform: All
>         OS/Version: Linux
>               Tree: Fedora
>             Status: NEW
>           Severity: high
>           Priority: P1
>          Component: Network
>         AssignedTo: drivers_network@kernel-bugs.osdl.org
>         ReportedBy: bryan.christ@gmail.com
>         Regression: No
> 
> 
> After running for about 12-48 hours we typically have several servers which
> kernel oops on the e1000
> 
> http://www.mediafire.com/imgbnc.php/e65c4a4b220096cabfa8746195898c1c5520b65cdde108921f027a309c179f6a6g.jpg
> 

It crashed deep down in TCP.  That may not be e1000-related.

I think we still have two drivers: e1000 and e1000e.  You're using
e1000e.


^ permalink raw reply

* Re: [PATCH] Use unsigned variables for packet lengths in ip[6]_queue.
From: Dave Jones @ 2011-06-02 19:24 UTC (permalink / raw)
  To: netdev
In-Reply-To: <20110528003651.GA8380@redhat.com>

On Fri, May 27, 2011 at 08:36:51PM -0400, Dave Jones wrote:

 >  > > Not catastrophic, but ipqueue seems to be too trusting of what it gets
 >  > > passed from userspace, and passes it on down to the page allocator,
 >  > > where it will spew warnings if the page order is too high.
 >  > > 
 >  > > __ipq_rcv_skb has several checks for lengths too small, but doesn't
 >  > > seem to have any for oversized ones.   I'm not sure what the maximum
 >  > > we should check for is. I'll code up a diff if anyone has any ideas
 >  > > on a sane maximum.
 >  > 
 >  > Maybe the thing to do is to simply pass __GFP_NOWARN to nlmsg_new()
 >  > in netlink_ack()?
 >  > 
 >  > Anyone else have a better idea?
 > 
 > So I went back to this today, and found something that doesn't look right.
 > After adding some instrumentation, and re-running my tests, I found that
 > the reason we were blowing up with enormous allocations was that we
 > were passing down a nlmsglen's like -1061109568
 > 
 > Is there any reason for that to be signed ?
 > The nlmsg_len entry of nlmsghdr is a u32, so I'm assuming this is a bug.
 > 
 > With the patch below, I haven't been able to reproduce the problem, but
 > I don't know if I've inadvertantly broken some other behaviour somewhere
 > deeper in netlink where this is valid.

any feedback on this ? am I barking up the wrong tree ?

	Dave


^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox