* Re: [RFC PATCH] af_packet: don't to defrag shared skb
From: David Miller @ 2012-12-07 19:10 UTC (permalink / raw)
To: eric; +Cc: netdev
In-Reply-To: <1354906561-4695-1-git-send-email-eric@regit.org>
From: Eric Leblond <eric@regit.org>
Date: Fri, 7 Dec 2012 19:56:01 +0100
> This patch is adding a check on skb before trying to defrag the
> packet for the hash computation in fanout mode. The goal of this
> patch is to avoid an kernel crash in pskb_expand_head.
> It appears that under some specific condition there is a shared
> skb reaching the defrag code and this lead to a crash due to the
> following code:
>
> if (skb_shared(skb))
> BUG();
>
> I've observed this crash under the following condition:
> 1. a program is listening to an wifi interface (let say wlan0)
> 2. it is using fanout capture in flow load balancing mode
> 3. defrag option is on on the fanout socket
> 4. the interface disconnect (radio down for example)
> 5. the interface reconnect (radio switched up)
> 6. once reconnected a single packet is seen with skb->users=2
> 7. the kernel crash in pskb_expand_head at skbuff.c:1035
...
> Signed-off-by: Eric Leblond <eric@regit.org>
Thanks Eric. I'll try to figure out if we should instead
change the wireless code to avoid sending shared SKBs into
the input path like that.
^ permalink raw reply
* Re: [patch v2] bridge: make buffer larger in br_setlink()
From: walter harms @ 2012-12-07 19:15 UTC (permalink / raw)
To: Dan Carpenter
Cc: Stephen Hemminger, David S. Miller, bridge, netdev,
kernel-janitors, Thomas Graf
In-Reply-To: <20121207185359.GP22569@mwanda>
Am 07.12.2012 19:53, schrieb Dan Carpenter:
> On Fri, Dec 07, 2012 at 05:07:24PM +0100, walter harms wrote:
>>
>>
>> Am 07.12.2012 12:10, schrieb Dan Carpenter:
>>> We pass IFLA_BRPORT_MAX to nla_parse_nested() so we need
>>> IFLA_BRPORT_MAX + 1 elements. Also Smatch complains that we read past
>>> the end of the array when in br_set_port_flag() when it's called with
>>> IFLA_BRPORT_FAST_LEAVE.
>>>
>>
>>
>>
>> I have no clue why nla_parse_nested() need IFLA_BRPORT_MAX elements.
>> but the majory of loop look like
>> for(i=0;i<max;++)
>> most programmers will think this way.
>> So it seems the place to fix is nla_parse_nested().
>> doing not so is asking for trouble (in the long run).
>> At least this function needs a big warning label that (max-1)
>> is actually needed.
>>
>
> Yeah, nla_parse_nested() is actually documented already.
>
documenting unexspected behavier is not as much helpfull as changing it.
just my 2 cents,
wh
^ permalink raw reply
* Re: [PATCH 0/2 v2] sctp: RCU protection when accessing assoc members for procfs
From: David Miller @ 2012-12-07 19:15 UTC (permalink / raw)
To: tgraf; +Cc: linux-sctp, netdev, vyasevich, nhorman
In-Reply-To: <cover.1354821623.git.tgraf@suug.ch>
From: Thomas Graf <tgraf@suug.ch>
Date: Thu, 6 Dec 2012 20:25:03 +0100
> @Dave: This is a respin of the following patches:
> [PATCH] sctp: proc: protect bind_addr->address_list accesses with rcu_read_lock()
> [PATCH] sctp: Add RCU protection to assoc->transport_addr_list
>
> Thomas Graf (2):
> sctp: proc: protect bind_addr->address_list accesses with
> rcu_read_lock()
> sctp: Add RCU protection to assoc->transport_addr_list
Series applied, thanks.
^ permalink raw reply
* Re: [PATCH net-next] rps: overflow prevention for saturated cpus
From: David Miller @ 2012-12-07 19:20 UTC (permalink / raw)
To: willemb; +Cc: netdev, edumazet, therbert
In-Reply-To: <1354826194-9289-1-git-send-email-willemb@google.com>
From: Willem de Bruijn <willemb@google.com>
Date: Thu, 6 Dec 2012 15:36:34 -0500
> This patch maintains flow affinity in normal conditions, but
> trades it for throughput when a cpu becomes saturated. Then, packets
> destined to that cpu (only) are redirected to the lightest loaded cpu
> in the rxqueue's rps_map. This breaks flow affinity under high load
> for some flows, in favor of processing packets up to the capacity
> of the complete rps_map cpuset in all circumstances.
We specifically built-in very strict checks to make sure we never
deliver packets out-of-order. Those mechanisms must be used and
enforced in any change of this nature.
^ permalink raw reply
* Re: [PATCH net-next 03/10] tipc: sk_recv_queue size check only for connectionless sockets
From: Neil Horman @ 2012-12-07 19:20 UTC (permalink / raw)
To: Paul Gortmaker; +Cc: David Miller, netdev, Jon Maloy, Ying Xue
In-Reply-To: <1354890498-6448-4-git-send-email-paul.gortmaker@windriver.com>
On Fri, Dec 07, 2012 at 09:28:11AM -0500, Paul Gortmaker wrote:
> From: Ying Xue <ying.xue@windriver.com>
>
> The sk_receive_queue limit control is currently performed for
> all arriving messages, disregarding socket and message type.
> But for connected sockets this check is redundant, since the protocol
> flow control already makes queue overflow impossible.
>
Can you explain where that occurs? I see where the tipc dispatch function calls
sk_add_backlog, which checks the per socket recieve queue (regardless of weather
the receiving socket is connection oriented or connectionless), but if the
receiver doesn't call receive very often, This just adds a check against your
global limit, doing nothing for your per-socket limits. In fact it seems to
repeat the same check twice, as in the worst case of the incomming message being
TIPC_LOW_IMPORTANCE, its just going to check that the global limit is exactly
OVERLOAD_LIMIT_BASE/2 again.
Neil
^ permalink raw reply
* Re: [PATCH] drivers/net: fix up function prototypes after __dev* removals
From: David Miller @ 2012-12-07 19:22 UTC (permalink / raw)
To: gregkh; +Cc: netdev, wfp5p
In-Reply-To: <20121207003056.GA30167@kroah.com>
From: Greg KH <gregkh@linuxfoundation.org>
Date: Thu, 6 Dec 2012 16:30:56 -0800
> From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
>
> The __dev* removal patches for the network drivers ended up messing up
> the function prototypes for a bunch of drivers. This patch fixes all of
> them back up to be properly aligned.
>
> Bonus is that this almost removes 100 lines of code, always a nice
> surprise.
>
> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Applied, thanks Greg.
^ permalink raw reply
* Re: [PATCH v3 1/4] net: Add support for hardware-offloaded encapsulation
From: David Miller @ 2012-12-07 19:28 UTC (permalink / raw)
To: joseph.gasparakis
Cc: bhutchings, alexander.h.duyck, shemminger, chrisw, gospo, netdev,
linux-kernel, dmitry, saeed.bishara, peter.p.waskiewicz.jr
In-Reply-To: <alpine.LFD.2.02.1212071013200.30243@morpheus.jf.intel.com>
From: Joseph Gasparakis <joseph.gasparakis@intel.com>
Date: Fri, 7 Dec 2012 10:24:17 -0800 (PST)
> So the idea here is that the driver will use the headers for checksumming
> if the skb->encapsulation bit is on. The bit should be set in the protocol
> driver.
>
> To answer the second comment, the flags that we use in this series of
> patches is NETIF_F_IP_CSUM, NETIF_F_IPV6_CSUM and NETIF_F_SG. These are
> the bits that we propose will be used for checksumming of encapsulation.
> As per a previous comment in v2, the hw_enc_features field should be used
> also in the future when NICs have more encap offloads, so one could
> indicate these features there from the driver.
>
> Furthermore, I submitted a patch for Rx checksumming, where NETIF_F_RXCSUM
> is used, again in conjunction with skb->encapsulation flag. As I mention
> in my logs, the driver is expected to set the ip_summed to UNNECESSARY and
> turn the skb->encapsulation on, to indicate that the inner headers are
> already HW checksummed.
>
This is the kind of language that belongs in the commit message and
code comments.
^ permalink raw reply
* Re: [PATCH net-next 2/2] net: doc: add default value for neighbour parameters
From: David Miller @ 2012-12-07 19:31 UTC (permalink / raw)
To: shanwei88; +Cc: bhutchings, eric.dumazet, netdev
In-Reply-To: <50C15427.9060803@gmail.com>
From: Shan Wei <shanwei88@gmail.com>
Date: Fri, 07 Dec 2012 10:27:51 +0800
> [PATCH net-next] net: doc : use more suitable word 'unexpected' to replace 'secluded'
>
> 'secluded' is used to describe places, not suitable here.
>
> Suggested-by: Ben Hutchings <bhutchings@solarflare.com>
> Signed-off-by: Shan Wei <davidshan@tencent.com>
Applied.
^ permalink raw reply
* Re: [PATCH net-next v5] bridge: export multicast database via netlink
From: David Miller @ 2012-12-07 19:33 UTC (permalink / raw)
To: tgraf; +Cc: amwang, netdev, bridge, herbert, brouer, shemminger
In-Reply-To: <20121207103629.GB2996@casper.infradead.org>
From: Thomas Graf <tgraf@suug.ch>
Date: Fri, 7 Dec 2012 10:36:29 +0000
> On 12/07/12 at 06:04pm, Cong Wang wrote:
>> From: Cong Wang <amwang@redhat.com>
>>
>> V5: fix two bugs pointed out by Thomas
>> remove seq check for now, mark it as TODO
>>
>> V4: remove some useless #include
>> some coding style fix
>>
>> V3: drop debugging printk's
>> update selinux perm table as well
>>
>> V2: drop patch 1/2, export ifindex directly
>> Redesign netlink attributes
>> Improve netlink seq check
>> Handle IPv6 addr as well
>>
>> This patch exports bridge multicast database via netlink
>> message type RTM_GETMDB. Similar to fdb, but currently bridge-specific.
>> We may need to support modify multicast database too (RTM_{ADD,DEL}MDB).
>>
>> (Thanks to Thomas for patient reviews)
>>
>> Cc: Herbert Xu <herbert@gondor.apana.org.au>
>> Cc: Stephen Hemminger <shemminger@vyatta.com>
>> Cc: "David S. Miller" <davem@davemloft.net>
>> Cc: Thomas Graf <tgraf@suug.ch>
>> Cc: Jesper Dangaard Brouer <brouer@redhat.com>
>> Signed-off-by: Cong Wang <amwang@redhat.com>
>
> Acked-by: Thomas Graf <tgraf@suug.ch>
Applied, thanks.
^ permalink raw reply
* Re: [PATCH v3 1/4] net: Add support for hardware-offloaded encapsulation
From: David Miller @ 2012-12-07 19:37 UTC (permalink / raw)
To: joseph.gasparakis
Cc: bhutchings, alexander.h.duyck, shemminger, chrisw, gospo, netdev,
linux-kernel, dmitry, saeed.bishara, peter.p.waskiewicz.jr
In-Reply-To: <alpine.LFD.2.02.1212071139190.4985@morpheus.jf.intel.com>
From: Joseph Gasparakis <joseph.gasparakis@intel.com>
Date: Fri, 7 Dec 2012 11:41:46 -0800 (PST)
>
>
> On Fri, 7 Dec 2012, David Miller wrote:
>
>> From: Joseph Gasparakis <joseph.gasparakis@intel.com>
>> Date: Fri, 7 Dec 2012 10:24:17 -0800 (PST)
>>
>> > So the idea here is that the driver will use the headers for checksumming
>> > if the skb->encapsulation bit is on. The bit should be set in the protocol
>> > driver.
>> >
>> > To answer the second comment, the flags that we use in this series of
>> > patches is NETIF_F_IP_CSUM, NETIF_F_IPV6_CSUM and NETIF_F_SG. These are
>> > the bits that we propose will be used for checksumming of encapsulation.
>> > As per a previous comment in v2, the hw_enc_features field should be used
>> > also in the future when NICs have more encap offloads, so one could
>> > indicate these features there from the driver.
>> >
>> > Furthermore, I submitted a patch for Rx checksumming, where NETIF_F_RXCSUM
>> > is used, again in conjunction with skb->encapsulation flag. As I mention
>> > in my logs, the driver is expected to set the ip_summed to UNNECESSARY and
>> > turn the skb->encapsulation on, to indicate that the inner headers are
>> > already HW checksummed.
>> >
>>
>> This is the kind of language that belongs in the commit message and
>> code comments.
>>
> Sure. I'll wait to gather some more feedback if there is any and I will
> re-spin this off adding more code comments and clarify this in the logs.
Great.
Please note that this request applies to your receive side change too.
^ permalink raw reply
* Re: [patch v2] bridge: make buffer larger in br_setlink()
From: David Miller @ 2012-12-07 19:40 UTC (permalink / raw)
To: dan.carpenter; +Cc: tgraf, netdev, shemminger, bridge, kernel-janitors
In-Reply-To: <20121207111045.GA9676@elgon.mountain>
From: Dan Carpenter <dan.carpenter@oracle.com>
Date: Fri, 7 Dec 2012 14:10:46 +0300
> We pass IFLA_BRPORT_MAX to nla_parse_nested() so we need
> IFLA_BRPORT_MAX + 1 elements. Also Smatch complains that we read past
> the end of the array when in br_set_port_flag() when it's called with
> IFLA_BRPORT_FAST_LEAVE.
>
> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Applied.
^ permalink raw reply
* Re: [PATCH] tcp: bug fix Fast Open client retransmission
From: David Miller @ 2012-12-07 19:41 UTC (permalink / raw)
To: ncardwell; +Cc: ycheng, nanditad, edumazet, netdev
In-Reply-To: <CADVnQymE7ubywMytDCabQ=3gMWkp=gVrHZke0HpiufQNbF=1KQ@mail.gmail.com>
From: Neal Cardwell <ncardwell@google.com>
Date: Thu, 6 Dec 2012 14:50:58 -0500
> On Thu, Dec 6, 2012 at 1:45 PM, Yuchung Cheng <ycheng@google.com> wrote:
>> If SYN-ACK partially acks SYN-data, the client retransmits the
>> remaining data by tcp_retransmit_skb(). This increments lost recovery
>> state variables like tp->retrans_out in Open state. If loss recovery
>> happens before the retransmission is acked, it triggers the WARN_ON
>> check in tcp_fastretrans_alert(). For example: the client sends
>> SYN-data, gets SYN-ACK acking only ISN, retransmits data, sends
>> another 4 data packets and get 3 dupacks.
>>
>> Since the retransmission is not caused by network drop it should not
>> update the recovery state variables. Further the server may return a
>> smaller MSS than the cached MSS used for SYN-data, so the retranmission
>> needs a loop. Otherwise some data will not be retransmitted until timeout
>> or other loss recovery events.
>>
>> Signed-off-by: Yuchung Cheng <ycheng@google.com>
>> ---
>
> Acked-by: Neal Cardwell <ncardwell@google.com>
Applied.
^ permalink raw reply
* Re: [PATCH] net: gro: fix possible panic in skb_gro_receive()
From: David Miller @ 2012-12-07 19:41 UTC (permalink / raw)
To: eric.dumazet; +Cc: netdev
In-Reply-To: <1354838099.10983.10.camel@edumazet-glaptop>
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 06 Dec 2012 15:54:59 -0800
> From: Eric Dumazet <edumazet@google.com>
>
> commit 2e71a6f8084e (net: gro: selective flush of packets) added
> a bug for skbs using frag_list. This part of the GRO stack is rarely
> used, as it needs skb not using a page fragment for their skb->head.
>
> Most drivers do use a page fragment, but some of them use GFP_KERNEL
> allocations for the initial fill of their RX ring buffer.
>
> napi_gro_flush() overwrite skb->prev that was used for these skb to
> point to the last skb in frag_list.
>
> Fix this using a separate field in struct napi_gro_cb to point to the
> last fragment.
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
Applied, thanks Eric.
^ permalink raw reply
* Re: [PATCH v3 1/4] net: Add support for hardware-offloaded encapsulation
From: Joseph Gasparakis @ 2012-12-07 19:41 UTC (permalink / raw)
To: David Miller
Cc: joseph.gasparakis, bhutchings, alexander.h.duyck, shemminger,
chrisw, gospo, netdev, linux-kernel, dmitry, saeed.bishara,
peter.p.waskiewicz.jr
In-Reply-To: <20121207.142845.1145515122357569641.davem@davemloft.net>
On Fri, 7 Dec 2012, David Miller wrote:
> From: Joseph Gasparakis <joseph.gasparakis@intel.com>
> Date: Fri, 7 Dec 2012 10:24:17 -0800 (PST)
>
> > So the idea here is that the driver will use the headers for checksumming
> > if the skb->encapsulation bit is on. The bit should be set in the protocol
> > driver.
> >
> > To answer the second comment, the flags that we use in this series of
> > patches is NETIF_F_IP_CSUM, NETIF_F_IPV6_CSUM and NETIF_F_SG. These are
> > the bits that we propose will be used for checksumming of encapsulation.
> > As per a previous comment in v2, the hw_enc_features field should be used
> > also in the future when NICs have more encap offloads, so one could
> > indicate these features there from the driver.
> >
> > Furthermore, I submitted a patch for Rx checksumming, where NETIF_F_RXCSUM
> > is used, again in conjunction with skb->encapsulation flag. As I mention
> > in my logs, the driver is expected to set the ip_summed to UNNECESSARY and
> > turn the skb->encapsulation on, to indicate that the inner headers are
> > already HW checksummed.
> >
>
> This is the kind of language that belongs in the commit message and
> code comments.
>
Sure. I'll wait to gather some more feedback if there is any and I will
re-spin this off adding more code comments and clarify this in the logs.
^ permalink raw reply
* Re: [PATCH net-next] bonding: Fix check for ethtool get_link operation support
From: David Miller @ 2012-12-07 19:41 UTC (permalink / raw)
To: bhutchings; +Cc: fubar, andy, netdev
In-Reply-To: <1354896932.2707.5.camel@bwh-desktop.uk.solarflarecom.com>
From: Ben Hutchings <bhutchings@solarflare.com>
Date: Fri, 7 Dec 2012 16:15:32 +0000
> Since commit 2c60db037034 ('net: provide a default dev->ethtool_ops')
> all devices have a non-null ethtool_ops. Test only
> dev->ethtool_ops->get_link in both places where we care.
>
> Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Applied.
^ permalink raw reply
* Re: [PATCH net-next 10/10] tipc: refactor accept() code for improved readability
From: Neil Horman @ 2012-12-07 19:42 UTC (permalink / raw)
To: Paul Gortmaker; +Cc: David Miller, netdev, Jon Maloy
In-Reply-To: <1354890498-6448-11-git-send-email-paul.gortmaker@windriver.com>
On Fri, Dec 07, 2012 at 09:28:18AM -0500, Paul Gortmaker wrote:
> In TIPC's accept() routine, there is a large block of code relating
> to initialization of a new socket, all within an if condition checking
> if the allocation succeeded.
>
> Here, we simply factor out that init code within the accept() function
> to its own separate function, which improves readability, and makes
> it easier to track which lock_sock() calls are operating on existing
> vs. new sockets.
>
> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
> ---
> net/tipc/socket.c | 93 +++++++++++++++++++++++++++++--------------------------
> 1 file changed, 49 insertions(+), 44 deletions(-)
>
> diff --git a/net/tipc/socket.c b/net/tipc/socket.c
> index 38613cf..56661c8 100644
> --- a/net/tipc/socket.c
> +++ b/net/tipc/socket.c
> @@ -1507,6 +1507,53 @@ static int listen(struct socket *sock, int len)
> return res;
> }
>
> +static void tipc_init_socket(struct sock *sk, struct socket *new_sock,
> + struct sk_buff *buf)
> +{
Can you rename this to something more specific to its purpose? tipc_init_socket
makes me wonder why you're not calling it internally from tipc_create. This
seems more like a tipc_init_accept_sock, or some such. Alternatively, since
you're ony using it from your accept call, you might consider not factoring it
out at all, and just reversing the logic in your accept function so that you do:
res = tipc_create(...)
if (res)
goto exit;
<rest of tipc_init_socket goes here>
That gives you the same level of readability, without the additional potential
call instruction, plus you put the unlikely case inside the if conditional.
Neil
^ permalink raw reply
* Re: [PATCH] fix initializer entry defined twice issue
From: Patrick Trantham @ 2012-12-07 19:51 UTC (permalink / raw)
To: Cong Ding
Cc: David S. Miller, Otavio Salvador, Javier Martinez Canillas,
Jiri Kosina, netdev, linux-kernel
In-Reply-To: <20121207000511.GD2510@gmail.com>
On 12/06/2012 06:05 PM, Cong Ding wrote:
> On Thu, Dec 06, 2012 at 05:59:21PM -0600, Patrick Trantham wrote:
>> On 12/06/2012 05:16 PM, Cong Ding wrote:
>>> the ".config_intr" is defined twice in both line 208 and 212.
>>>
>>> Signed-off-by: Cong Ding <dinggnu@gmail.com>
>>> ---
>>> drivers/net/phy/smsc.c | 1 -
>>> 1 files changed, 0 insertions(+), 1 deletions(-)
>>>
>>> diff --git a/drivers/net/phy/smsc.c b/drivers/net/phy/smsc.c
>>> index 16dceed..5cee6bd 100644
>>> --- a/drivers/net/phy/smsc.c
>>> +++ b/drivers/net/phy/smsc.c
>>> @@ -205,7 +205,6 @@ static struct phy_driver smsc_phy_driver[] = {
>>> /* basic functions */
>>> .config_aneg = genphy_config_aneg,
>>> .read_status = lan87xx_read_status,
>>> - .config_intr = smsc_phy_config_intr,
>>> /* IRQ related */
>>> .ack_interrupt = smsc_phy_ack_interrupt,
>> Hi Cong,
>>
>> That looks like a mistake from my previous patch to that file. Copy
>> and paste fail.
>>
>> The line should read:
>> .config_init = smsc_phy_config_init,
>>
>> I can submit a patch once I get off work unless you beat me to it.
> Thanks for the note. It's better that you submit a patch together with your
> code.
>
> - cong
Patch submitted, thanks for pointing this out!
- Patrick
^ permalink raw reply
* Re: [PATCH v3 1/4] net: Add support for hardware-offloaded encapsulation
From: David Miller @ 2012-12-07 19:52 UTC (permalink / raw)
To: joseph.gasparakis
Cc: bhutchings, alexander.h.duyck, shemminger, chrisw, gospo, netdev,
linux-kernel, dmitry, saeed.bishara, peter.p.waskiewicz.jr
In-Reply-To: <alpine.LFD.2.02.1212071149150.4985@morpheus.jf.intel.com>
From: Joseph Gasparakis <joseph.gasparakis@intel.com>
Date: Fri, 7 Dec 2012 11:52:43 -0800 (PST)
>
>
> On Fri, 7 Dec 2012, David Miller wrote:
>
>> From: Joseph Gasparakis <joseph.gasparakis@intel.com>
>> Date: Fri, 7 Dec 2012 11:41:46 -0800 (PST)
>>
>> >
>> >
>> > On Fri, 7 Dec 2012, David Miller wrote:
>> >
>> >> From: Joseph Gasparakis <joseph.gasparakis@intel.com>
>> >> Date: Fri, 7 Dec 2012 10:24:17 -0800 (PST)
>> >>
>> >> > So the idea here is that the driver will use the headers for checksumming
>> >> > if the skb->encapsulation bit is on. The bit should be set in the protocol
>> >> > driver.
>> >> >
>> >> > To answer the second comment, the flags that we use in this series of
>> >> > patches is NETIF_F_IP_CSUM, NETIF_F_IPV6_CSUM and NETIF_F_SG. These are
>> >> > the bits that we propose will be used for checksumming of encapsulation.
>> >> > As per a previous comment in v2, the hw_enc_features field should be used
>> >> > also in the future when NICs have more encap offloads, so one could
>> >> > indicate these features there from the driver.
>> >> >
>> >> > Furthermore, I submitted a patch for Rx checksumming, where NETIF_F_RXCSUM
>> >> > is used, again in conjunction with skb->encapsulation flag. As I mention
>> >> > in my logs, the driver is expected to set the ip_summed to UNNECESSARY and
>> >> > turn the skb->encapsulation on, to indicate that the inner headers are
>> >> > already HW checksummed.
>> >> >
>> >>
>> >> This is the kind of language that belongs in the commit message and
>> >> code comments.
>> >>
>> > Sure. I'll wait to gather some more feedback if there is any and I will
>> > re-spin this off adding more code comments and clarify this in the logs.
>>
>> Great.
>>
>> Please note that this request applies to your receive side change too.
>>
> I believe the logs are sufficient in the rx patch:
>
> This patch adds capability in vxlan to identify received
> checksummed inner packets and signal them to the upper layers of
> the stack. The driver needs to set the skb->encapsulation bit
> and also set the skb->ip_summed to CHECKSUM_UNNECESSARY.
>
> ...but I can add more comments to the code if this is what you are
> referring to.
Yes, please do.
^ permalink raw reply
* Re: [PATCH v3 1/4] net: Add support for hardware-offloaded encapsulation
From: Joseph Gasparakis @ 2012-12-07 19:52 UTC (permalink / raw)
To: David Miller
Cc: joseph.gasparakis, bhutchings, alexander.h.duyck, shemminger,
chrisw, gospo, netdev, linux-kernel, dmitry, saeed.bishara,
peter.p.waskiewicz.jr
In-Reply-To: <20121207.143725.1501926484331042018.davem@davemloft.net>
On Fri, 7 Dec 2012, David Miller wrote:
> From: Joseph Gasparakis <joseph.gasparakis@intel.com>
> Date: Fri, 7 Dec 2012 11:41:46 -0800 (PST)
>
> >
> >
> > On Fri, 7 Dec 2012, David Miller wrote:
> >
> >> From: Joseph Gasparakis <joseph.gasparakis@intel.com>
> >> Date: Fri, 7 Dec 2012 10:24:17 -0800 (PST)
> >>
> >> > So the idea here is that the driver will use the headers for checksumming
> >> > if the skb->encapsulation bit is on. The bit should be set in the protocol
> >> > driver.
> >> >
> >> > To answer the second comment, the flags that we use in this series of
> >> > patches is NETIF_F_IP_CSUM, NETIF_F_IPV6_CSUM and NETIF_F_SG. These are
> >> > the bits that we propose will be used for checksumming of encapsulation.
> >> > As per a previous comment in v2, the hw_enc_features field should be used
> >> > also in the future when NICs have more encap offloads, so one could
> >> > indicate these features there from the driver.
> >> >
> >> > Furthermore, I submitted a patch for Rx checksumming, where NETIF_F_RXCSUM
> >> > is used, again in conjunction with skb->encapsulation flag. As I mention
> >> > in my logs, the driver is expected to set the ip_summed to UNNECESSARY and
> >> > turn the skb->encapsulation on, to indicate that the inner headers are
> >> > already HW checksummed.
> >> >
> >>
> >> This is the kind of language that belongs in the commit message and
> >> code comments.
> >>
> > Sure. I'll wait to gather some more feedback if there is any and I will
> > re-spin this off adding more code comments and clarify this in the logs.
>
> Great.
>
> Please note that this request applies to your receive side change too.
>
I believe the logs are sufficient in the rx patch:
This patch adds capability in vxlan to identify received
checksummed inner packets and signal them to the upper layers of
the stack. The driver needs to set the skb->encapsulation bit
and also set the skb->ip_summed to CHECKSUM_UNNECESSARY.
...but I can add more comments to the code if this is what you are
referring to.
^ permalink raw reply
* Re: [PATCH net-next 02/10] tipc: eliminate aggregate sk_receive_queue limit
From: Paul Gortmaker @ 2012-12-07 19:54 UTC (permalink / raw)
To: David Miller; +Cc: nhorman, netdev, jon.maloy, ying.xue
In-Reply-To: <20121207.123644.431593821406881255.davem@davemloft.net>
[Re: [PATCH net-next 02/10] tipc: eliminate aggregate sk_receive_queue limit] On 07/12/2012 (Fri 12:36) David Miller wrote:
> From: Neil Horman <nhorman@tuxdriver.com>
> Date: Fri, 7 Dec 2012 11:07:33 -0500
>
> > On Fri, Dec 07, 2012 at 09:28:10AM -0500, Paul Gortmaker wrote:
> >> From: Ying Xue <ying.xue@windriver.com>
> >>
> >> As a complement to the per-socket sk_recv_queue limit, TIPC keeps a
> >> global atomic counter for the sum of sk_recv_queue sizes across all
> >> tipc sockets. When incremented, the counter is compared to an upper
> >> threshold value, and if this is reached, the message is rejected
> >> with error code TIPC_OVERLOAD.
> >>
> >> This check was originally meant to protect the node against
> >> buffer exhaustion and general CPU overload. However, all experience
> >> indicates that the feature not only is redundant on Linux, but even
> >> harmful. Users run into the limit very often, causing disturbances
> >> for their applications, while removing it seems to have no negative
> >> effects at all. We have also seen that overall performance is
> >> boosted significantly when this bottleneck is removed.
> >>
> >> Furthermore, we don't see any other network protocols maintaining
> >> such a mechanism, something strengthening our conviction that this
> >> control can be eliminated.
> >>
> >> Signed-off-by: Ying Xue <ying.xue@windriver.com>
> >> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
> >> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
> ...
> >> @@ -1241,11 +1241,6 @@ static u32 filter_rcv(struct sock *sk, struct sk_buff *buf)
> >> }
> >>
> >> /* Reject message if there isn't room to queue it */
> >> - recv_q_len = (u32)atomic_read(&tipc_queue_size);
> >> - if (unlikely(recv_q_len >= OVERLOAD_LIMIT_BASE)) {
> >> - if (rx_queue_full(msg, recv_q_len, OVERLOAD_LIMIT_BASE))
> >> - return TIPC_ERR_OVERLOAD;
> >> - }
> > If you're going to remove the one place that you read this variable, don't you
> > also want to remove the points where you increment/decrement the atomic as well,
> > and for that matter eliminate the definition itself?
>
> There's another reader, a getsockopt() call.
>
> I would just make it return zero or similar.
Updated commit below for reference, just to double check that I'm
doing what is requested properly.
Paul.
--
From 9da3d475874f4da49057767913af95ce01063ba3 Mon Sep 17 00:00:00 2001
From: Ying Xue <ying.xue@windriver.com>
Date: Tue, 27 Nov 2012 06:15:27 -0500
Subject: [PATCH] tipc: eliminate aggregate sk_receive_queue limit
As a complement to the per-socket sk_recv_queue limit, TIPC keeps a
global atomic counter for the sum of sk_recv_queue sizes across all
tipc sockets. When incremented, the counter is compared to an upper
threshold value, and if this is reached, the message is rejected
with error code TIPC_OVERLOAD.
This check was originally meant to protect the node against
buffer exhaustion and general CPU overload. However, all experience
indicates that the feature not only is redundant on Linux, but even
harmful. Users run into the limit very often, causing disturbances
for their applications, while removing it seems to have no negative
effects at all. We have also seen that overall performance is
boosted significantly when this bottleneck is removed.
Furthermore, we don't see any other network protocols maintaining
such a mechanism, something strengthening our conviction that this
control can be eliminated.
As a result, the atomic variable tipc_queue_size is now unused
and so it can be deleted. There is a getsockopt call that used
to allow reading it; we retain that but just return zero for
maximum compatibility.
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Cc: Neil Horman <nhorman@tuxdriver.com>
[PG: phase out tipc_queue_size as pointed out by Neil Horman]
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
diff --git a/net/tipc/socket.c b/net/tipc/socket.c
index 1a720c8..848be69 100644
--- a/net/tipc/socket.c
+++ b/net/tipc/socket.c
@@ -2,7 +2,7 @@
* net/tipc/socket.c: TIPC socket API
*
* Copyright (c) 2001-2007, Ericsson AB
- * Copyright (c) 2004-2008, 2010-2011, Wind River Systems
+ * Copyright (c) 2004-2008, 2010-2012, Wind River Systems
* All rights reserved.
*
* Redistribution and use in source and binary forms, with or without
@@ -73,8 +73,6 @@ static struct proto tipc_proto;
static int sockets_enabled;
-static atomic_t tipc_queue_size = ATOMIC_INIT(0);
-
/*
* Revised TIPC socket locking policy:
*
@@ -128,7 +126,6 @@ static atomic_t tipc_queue_size = ATOMIC_INIT(0);
static void advance_rx_queue(struct sock *sk)
{
kfree_skb(__skb_dequeue(&sk->sk_receive_queue));
- atomic_dec(&tipc_queue_size);
}
/**
@@ -140,10 +137,8 @@ static void discard_rx_queue(struct sock *sk)
{
struct sk_buff *buf;
- while ((buf = __skb_dequeue(&sk->sk_receive_queue))) {
- atomic_dec(&tipc_queue_size);
+ while ((buf = __skb_dequeue(&sk->sk_receive_queue)))
kfree_skb(buf);
- }
}
/**
@@ -155,10 +150,8 @@ static void reject_rx_queue(struct sock *sk)
{
struct sk_buff *buf;
- while ((buf = __skb_dequeue(&sk->sk_receive_queue))) {
+ while ((buf = __skb_dequeue(&sk->sk_receive_queue)))
tipc_reject_msg(buf, TIPC_ERR_NO_PORT);
- atomic_dec(&tipc_queue_size);
- }
}
/**
@@ -280,7 +273,6 @@ static int release(struct socket *sock)
buf = __skb_dequeue(&sk->sk_receive_queue);
if (buf == NULL)
break;
- atomic_dec(&tipc_queue_size);
if (TIPC_SKB_CB(buf)->handle != 0)
kfree_skb(buf);
else {
@@ -1241,11 +1233,6 @@ static u32 filter_rcv(struct sock *sk, struct sk_buff *buf)
}
/* Reject message if there isn't room to queue it */
- recv_q_len = (u32)atomic_read(&tipc_queue_size);
- if (unlikely(recv_q_len >= OVERLOAD_LIMIT_BASE)) {
- if (rx_queue_full(msg, recv_q_len, OVERLOAD_LIMIT_BASE))
- return TIPC_ERR_OVERLOAD;
- }
recv_q_len = skb_queue_len(&sk->sk_receive_queue);
if (unlikely(recv_q_len >= (OVERLOAD_LIMIT_BASE / 2))) {
if (rx_queue_full(msg, recv_q_len, OVERLOAD_LIMIT_BASE / 2))
@@ -1254,7 +1241,6 @@ static u32 filter_rcv(struct sock *sk, struct sk_buff *buf)
/* Enqueue message (finally!) */
TIPC_SKB_CB(buf)->handle = 0;
- atomic_inc(&tipc_queue_size);
__skb_queue_tail(&sk->sk_receive_queue, buf);
/* Initiate connection termination for an incoming 'FIN' */
@@ -1578,7 +1564,6 @@ restart:
/* Disconnect and send a 'FIN+' or 'FIN-' message to peer */
buf = __skb_dequeue(&sk->sk_receive_queue);
if (buf) {
- atomic_dec(&tipc_queue_size);
if (TIPC_SKB_CB(buf)->handle != 0) {
kfree_skb(buf);
goto restart;
@@ -1717,7 +1702,7 @@ static int getsockopt(struct socket *sock,
/* no need to set "res", since already 0 at this point */
break;
case TIPC_NODE_RECVQ_DEPTH:
- value = (u32)atomic_read(&tipc_queue_size);
+ value = 0; /* was tipc_queue_size, now obsolete */
break;
case TIPC_SOCK_RECVQ_DEPTH:
value = skb_queue_len(&sk->sk_receive_queue);
--
1.7.12.1
>
> Paul please do so and respin this series.
>
> Thanks.
^ permalink raw reply related
* Re: [PATCH net-next 10/10] tipc: refactor accept() code for improved readability
From: Paul Gortmaker @ 2012-12-07 19:56 UTC (permalink / raw)
To: Neil Horman; +Cc: David Miller, netdev, Jon Maloy
In-Reply-To: <20121207194226.GB30339@hmsreliant.think-freely.org>
[Re: [PATCH net-next 10/10] tipc: refactor accept() code for improved readability] On 07/12/2012 (Fri 14:42) Neil Horman wrote:
> On Fri, Dec 07, 2012 at 09:28:18AM -0500, Paul Gortmaker wrote:
> > In TIPC's accept() routine, there is a large block of code relating
> > to initialization of a new socket, all within an if condition checking
> > if the allocation succeeded.
> >
> > Here, we simply factor out that init code within the accept() function
> > to its own separate function, which improves readability, and makes
> > it easier to track which lock_sock() calls are operating on existing
> > vs. new sockets.
> >
> > Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
> > ---
> > net/tipc/socket.c | 93 +++++++++++++++++++++++++++++--------------------------
> > 1 file changed, 49 insertions(+), 44 deletions(-)
> >
> > diff --git a/net/tipc/socket.c b/net/tipc/socket.c
> > index 38613cf..56661c8 100644
> > --- a/net/tipc/socket.c
> > +++ b/net/tipc/socket.c
> > @@ -1507,6 +1507,53 @@ static int listen(struct socket *sock, int len)
> > return res;
> > }
> >
> > +static void tipc_init_socket(struct sock *sk, struct socket *new_sock,
> > + struct sk_buff *buf)
> > +{
> Can you rename this to something more specific to its purpose? tipc_init_socket
> makes me wonder why you're not calling it internally from tipc_create. This
> seems more like a tipc_init_accept_sock, or some such. Alternatively, since
> you're ony using it from your accept call, you might consider not factoring it
> out at all, and just reversing the logic in your accept function so that you do:
>
> res = tipc_create(...)
> if (res)
> goto exit;
> <rest of tipc_init_socket goes here>
>
> That gives you the same level of readability, without the additional potential
> call instruction, plus you put the unlikely case inside the if conditional.
I'm inclined to do the latter and just flip the sense of the "if" since
I already scratched my head trying to think of a good name (and
apparently failed in the end).
Thanks,
Paul.
>
> Neil
>
^ permalink raw reply
* Re: BUG: scheduling while atomic: ifup-bonding/3711/0x00000002 -- V3.6.7
From: Linda Walsh @ 2012-12-07 20:06 UTC (permalink / raw)
To: Jay Vosburgh; +Cc: Cong Wang, LKML, Linux Kernel Network Developers
In-Reply-To: <4913.1354154236@death.nxdomain>
Sorry for the delay.... my distro (Suse) has made rebooting my system
a chore (have to often boot from rescue to get it to come up because
they put mount libs in /usr/lib expecting they will always boot
from their ram disk -- preventing those of use who boot directly
from disk from doing so easily...grrr.
Jay Vosburgh wrote:
> The miimon functionality is used to check link state and notice
> when slaves lose carrier.
---
If I am running 'rr' on 2 channels -- specifically for the purpose
of link speed aggregation (getting 1 20Gb channel out of 2 10Gb channels)
I'm not sure I see how miimon would provide benefit. -- if 1 link dies,
the other, being on the same card is likely to be dead too, so would
it really serve a purpose?
> Running without it will not detect failure of
> the bonding slaves, which is likely not what you want. The mode,
> balance-rr in your case, is what selects the load balance to use, and is
> separate from the miimon.
>
----
Wouldn't the entire link die if a slave dies -- like RAID0, 1 disk
dies, the entire link goes down?
The other end (windows) doesn't dynamically config for a static-link
aggregation, so I don't think it would provide benefit.
> That said, the problem you're seeing appears to be caused by two
> things: bonding holds a lock (in addition to RTNL) when calling
> __ethtool_get_settings, and an ixgbe function in the call path to
> retrieve the settings, ixgbe_acquire_swfw_sync_X540, can sleep.
>
> The test patch above handles one case in bond_enslave, but there
> is another case in bond_miimon_commit when a slave changes link state
> from down to up, which will occur shortly after the slave is added.
>
----
Added your 2nd patch -- no more error messages...
however -- likely unrelated, the max speed read or write I am seeing
is about 500MB/s, and that is rare -- usually it's barely <3x a 1Gb
network speed. (119/125 MB R/W). I'm not at all sure it's really
combining the links
properly. Anyway to verify that?
On the windows side it shows the bond-link as a 20Gb connection, but
I don't see anyplace for something similar on linux.
^ permalink raw reply
* Re: [PATCH v3 1/4] net: Add support for hardware-offloaded encapsulation
From: Joseph Gasparakis @ 2012-12-07 20:18 UTC (permalink / raw)
To: David Miller
Cc: joseph.gasparakis, bhutchings, alexander.h.duyck, shemminger,
chrisw, gospo, netdev, linux-kernel, dmitry, saeed.bishara,
peter.p.waskiewicz.jr
In-Reply-To: <20121207.145239.1408511750725975741.davem@davemloft.net>
On Fri, 7 Dec 2012, David Miller wrote:
> From: Joseph Gasparakis <joseph.gasparakis@intel.com>
> Date: Fri, 7 Dec 2012 11:52:43 -0800 (PST)
>
> >
> >
> > On Fri, 7 Dec 2012, David Miller wrote:
> >
> >> From: Joseph Gasparakis <joseph.gasparakis@intel.com>
> >> Date: Fri, 7 Dec 2012 11:41:46 -0800 (PST)
> >>
> >> >
> >> >
> >> > On Fri, 7 Dec 2012, David Miller wrote:
> >> >
> >> >> From: Joseph Gasparakis <joseph.gasparakis@intel.com>
> >> >> Date: Fri, 7 Dec 2012 10:24:17 -0800 (PST)
> >> >>
> >> >> > So the idea here is that the driver will use the headers for checksumming
> >> >> > if the skb->encapsulation bit is on. The bit should be set in the protocol
> >> >> > driver.
> >> >> >
> >> >> > To answer the second comment, the flags that we use in this series of
> >> >> > patches is NETIF_F_IP_CSUM, NETIF_F_IPV6_CSUM and NETIF_F_SG. These are
> >> >> > the bits that we propose will be used for checksumming of encapsulation.
> >> >> > As per a previous comment in v2, the hw_enc_features field should be used
> >> >> > also in the future when NICs have more encap offloads, so one could
> >> >> > indicate these features there from the driver.
> >> >> >
> >> >> > Furthermore, I submitted a patch for Rx checksumming, where NETIF_F_RXCSUM
> >> >> > is used, again in conjunction with skb->encapsulation flag. As I mention
> >> >> > in my logs, the driver is expected to set the ip_summed to UNNECESSARY and
> >> >> > turn the skb->encapsulation on, to indicate that the inner headers are
> >> >> > already HW checksummed.
> >> >> >
> >> >>
> >> >> This is the kind of language that belongs in the commit message and
> >> >> code comments.
> >> >>
> >> > Sure. I'll wait to gather some more feedback if there is any and I will
> >> > re-spin this off adding more code comments and clarify this in the logs.
> >>
> >> Great.
> >>
> >> Please note that this request applies to your receive side change too.
> >>
> > I believe the logs are sufficient in the rx patch:
> >
> > This patch adds capability in vxlan to identify received
> > checksummed inner packets and signal them to the upper layers of
> > the stack. The driver needs to set the skb->encapsulation bit
> > and also set the skb->ip_summed to CHECKSUM_UNNECESSARY.
> >
> > ...but I can add more comments to the code if this is what you are
> > referring to.
>
> Yes, please do.
>
I will! Thanks.
^ permalink raw reply
* Re: [RFC PATCH] af_packet: don't to defrag shared skb
From: David Miller @ 2012-12-07 20:31 UTC (permalink / raw)
To: eric; +Cc: netdev, linux-wireless, johannes, linville
In-Reply-To: <1354906561-4695-1-git-send-email-eric@regit.org>
From: Eric Leblond <eric@regit.org>
Date: Fri, 7 Dec 2012 19:56:01 +0100
Wireless folks, please take a look. The issue is that,
under the circumstances listed below, we get SKBs in
the AF_PACKET input path that are shared.
Given the logic present in ieee80211_deliver_skb() I think
the mac80211 code doesn't expect this either.
More commentary from me below:
> This patch is adding a check on skb before trying to defrag the
> packet for the hash computation in fanout mode. The goal of this
> patch is to avoid an kernel crash in pskb_expand_head.
> It appears that under some specific condition there is a shared
> skb reaching the defrag code and this lead to a crash due to the
> following code:
>
> if (skb_shared(skb))
> BUG();
>
> I've observed this crash under the following condition:
> 1. a program is listening to an wifi interface (let say wlan0)
> 2. it is using fanout capture in flow load balancing mode
> 3. defrag option is on on the fanout socket
> 4. the interface disconnect (radio down for example)
> 5. the interface reconnect (radio switched up)
> 6. once reconnected a single packet is seen with skb->users=2
> 7. the kernel crash in pskb_expand_head at skbuff.c:1035
>
> [BBB55:744364] [<ffffffff812a2761>] ? __pskb_pull_tail+0x43x0x26f
> [BB8S5.744395] [<ffffffff812d29Tb>] ? ip_check_defrag+ox3a/0x14a
> [BBB55.744422] [<ffffffffB1344459>] ? packet_rcv_fanout+ox5e/oxf9
> [BBBS5.7444S0] [<ffffffffB12aaS9b>] ? __netif_receive_skb+ox444/ox4f9
> [BBB55.T4447B] [<ffffffffB12aa?e1>] ? netif_receive_skb+ox6d/0x?3
> [BBB55.T4447B] [<ffffffffB12aa?e1>] ? ieee80211_deliver_skb+0xbd/0xfa [mac80211]
> [BBB55.T4447B] [<ffffffffB12aa?e1>] ? ieee80211_rx_h_data+0x1e0/0x21a [mac80211]
> [BBB55.T4447B] [<ffffffffB12aa?e1>] ? ieee80211_rx_handlers+0x3d5/0x480 [mac80211]
> [BBB55.T4447B] [<ffffffffB12aa?e1>] ? __wake_up
> [BBB55.T4447B] [<ffffffffB12aa?e1>] ? evdev_eventr+0xc0/0xcf [evdev]
>
> Signed-off-by: Eric Leblond <eric@regit.org>
So if we look at ieee80211_deliver_skb(), it has code to deal with unaligned
packet headers, wherein it memoves() the data into a better aligned location.
But if these SKBs really are skb_shared(), this packet data
modification is illegal.
I suspect that the assumptions built into this unaligned data handling
code, and AF_PACKET, are correct. Meaning that we should never see
skb_shared() packets here. We just have a missing skb_copy()
somewhere in mac80211, Johannes can you please take a look?
Thanks.
^ permalink raw reply
* [GIT] Networking
From: David Miller @ 2012-12-07 20:35 UTC (permalink / raw)
To: torvalds; +Cc: akpm, netdev, linux-kernel
Two stragglers:
1) The new code that adds new flushing semantics to GRO can cause
SKB pointer list corruption, manage the lists differently to
avoid the OOPS. Fix from Eric Dumazet.
2) When TCP fast open does a retransmit of data in a SYN-ACK or
similar, we update retransmit state that we shouldn't triggering
a WARN_ON later. Fix from Yuchung Cheng.
Please pull, thanks a lot!
The following changes since commit 1afa471706963643ceeda7cbbe9c605a1e883d53:
Merge tag 'mmc-fixes-for-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/cjb/mmc (2012-12-07 09:15:20 -0800)
are available in the git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git master
for you to fetch changes up to c3c7c254b2e8cd99b0adf288c2a1bddacd7ba255:
net: gro: fix possible panic in skb_gro_receive() (2012-12-07 14:39:29 -0500)
----------------------------------------------------------------
Eric Dumazet (1):
net: gro: fix possible panic in skb_gro_receive()
Yuchung Cheng (1):
tcp: bug fix Fast Open client retransmission
include/linux/netdevice.h | 3 +++
include/net/tcp.h | 1 +
net/core/dev.c | 2 ++
net/core/skbuff.c | 6 +++---
net/ipv4/tcp_input.c | 6 +++++-
net/ipv4/tcp_output.c | 15 ++++++++++-----
6 files changed, 24 insertions(+), 9 deletions(-)
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox