Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [patch net-next RFC] netfilter: ip6_tables: use reasm skb for matching
From: Jiri Pirko @ 2013-10-30 14:13 UTC (permalink / raw)
  To: Florian Westphal
  Cc: netdev, davem, pablo, netfilter-devel, yoshfuji, kadlec, kaber,
	mleitner
In-Reply-To: <20131030134100.GD16615@breakpoint.cc>

Wed, Oct 30, 2013 at 02:41:00PM CET, fw@strlen.de wrote:
>Jiri Pirko <jiri@resnulli.us> wrote:
>> Currently, when ipv6 fragment goes through the netfilter, match
>> functions are called on them directly. This might cause match function
>> to fail. So benefit from the fact that nf_defrag_ipv6 constructs
>> reassembled skb for us and use this reassembled skb for matching.
>> 
>> This patch fixes for example following situation:
>> On HOSTA do:
>> ip6tables -I INPUT -p icmpv6 -j DROP
>> ip6tables -I INPUT -p icmpv6 -m icmp6 --icmpv6-type 128 -j ACCEPT
>> 
>> and on HOSTB you do:
>> ping6 HOSTA -s2000    (MTU is 1500)
>> 
>> Incoming echo requests will be filtered out on HOSTA. This issue does
>> not occur with smaller packets than MTU (where fragmentation does not happen).
>
>[..]
>> diff --git a/net/ipv6/netfilter/ip6_tables.c b/net/ipv6/netfilter/ip6_tables.c
>> index 44400c2..5421beb0 100644
>> --- a/net/ipv6/netfilter/ip6_tables.c
>> +++ b/net/ipv6/netfilter/ip6_tables.c
>> @@ -328,6 +328,7 @@ ip6t_do_table(struct sk_buff *skb,
>>  	const struct xt_table_info *private;
>>  	struct xt_action_param acpar;
>>  	unsigned int addend;
>> +	struct sk_buff *reasm = skb->nfct_reasm ? skb->nfct_reasm : skb;
>>  
>>  	/* Initialization */
>>  	indev = in ? in->name : nulldevname;
>> @@ -363,7 +364,7 @@ ip6t_do_table(struct sk_buff *skb,
>>  
>>  		IP_NF_ASSERT(e);
>>  		acpar.thoff = 0;
>> -		if (!ip6_packet_match(skb, indev, outdev, &e->ipv6,
>> +		if (!ip6_packet_match(reasm, indev, outdev, &e->ipv6,
>>  		    &acpar.thoff, &acpar.fragoff, &acpar.hotdrop)) {
>
>[..]
>
>This is a bit backwards, I think.
>- We gather frags
>- Then we invoke ip6t_do_table for each individual fragment
>
>So basically your patch is equivalent to
>for_each_frag( )
>  ip6t_do_table(reassembled_skb)
>
>Which makes no sense to me - why traverse the ruleset n times with the same
>packet?


Because each fragment need to be pushed through separately.

What different approach would you suggest?

Thanks

Jiri

^ permalink raw reply

* Re: [PATCH net-next 1/5] lib: crc32: clean up spacing in test cases
From: Daniel Borkmann @ 2013-10-30 14:15 UTC (permalink / raw)
  To: David Laight; +Cc: Joe Perches, davem, netdev, linux-sctp, linux-kernel
In-Reply-To: <AE90C24D6B3A694183C094C60CF0A2F6026B73C8@saturn3.aculab.com>

On 10/30/2013 03:10 PM, David Laight wrote:
>>> +	{0x674bf11d, 0x00000038, 0x00000542, 0x0af6d466, 0xd8b6e4c1, 0xf6e93d6c},
>>
>> these could be
>>
>> +	{0x674bf11d, 0x38, 0x542, 0x0af6d466, 0xd8b6e4c1, 0xf6e93d6c},
>
> Or even:
> #define X(a, b, c, d, e, f) {0x##a, 0x##b, 0x##c, 0x##d, 0x##e. 0x##f}
> 	X(674bf11d, 38, 542, 0af6d466, d8b6e4c1, f6e93d6c),
> ...
> #undef X

Sure, sounds good to me. We could do that as a follow-up.

> 	David
>
>
>

^ permalink raw reply

* Re: [PATCH 1/4 net-next] net: phy: add Generic Netlink Ethernet switch configuration API
From: Maxime Bizon @ 2013-10-30 14:28 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: Felix Fietkau, Florian Fainelli, Neil Horman, John Fastabend,
	netdev, David Miller, Sascha Hauer, John Crispin, Jonas Gorski,
	Gary Thomas, Vlad Yasevich, Stephen Hemminger
In-Reply-To: <5270F282.6000708@mojatatu.com>


On Wed, 2013-10-30 at 07:50 -0400, Jamal Hadi Salim wrote:

> The important part is all the APIs stay consistent. I can use
> same netlink calls. ifconfig works.
> iproute2 works. People have written books on this stuff - we dont

these books usually start by telling people to assign IP address to
interfaces, not applicable here.

> If i can get stats by doing ifconfig - that should provide illusion that
> the netdevice is sending/receiving packets.

4 separated netdevices looks like 4 ethernet segments to me, and nothing
will prevent me from setting a different ip network on each device.

ENOTSUPP cannot be returned by ndo_start_xmit, the ability for a
netdevice to be able to receive/send packet from host is IMO
fundamental.

> This is a good arguement.
> Can we hear a little more about this?

see this kind of old threads:

http://rt2x00.serialmonkey.com/phpBB/viewtopic.php?f=5&t=4378
http://www.linuxquestions.org/questions/linux-networking-3/what-is-wmaster0-728708/
http://forums.debian.net/viewtopic.php?p=219440

> I think that would be a reasonable thing to do if it becomes
> necessary.

with rough naming:

- struct netdevice
- struct netdev_queue
- struct network_port (something to call ethtool on)
- struct bridge_dev (something you create/destroy vlan on, control FDB)
- struct bridge_port (something you set path cost on, ...)
- struct sw_bridge_dev (netdevice + underlying bridge_dev)
- struct sw_bridge_port (netdevice + underlying bridge_port)

old netdevice => (netdevice + netdev_queue * x + network_port)

ethtool works on netdevice or network_port

brctl addbr/addif creates sw_bridge_dev/sw_bridge_port, other commands
work on bridge_dev/bridge_port

drivers can register bridge_dev / bridge_port / network_port

simple case of a system with single ethernet mac & directly attached 4
ports switch:

netdevice: eth0
bridge_dev: hwbr0
bridge_port: hwbr0p0, hwbr0p1, hwbr0p2, hwbr0p3
network ports: eth0np0, hwbr0np0, hwbr0np1, hwbr0np2, hwbr0np3

ifconfig, ip link show only eth0
brctl show hwbr0
ethtool works on eth0 or eth0p0, hwbr0npX

-- 
Maxime

^ permalink raw reply

* Re: [PATCH net-next 5/5] net: sctp: fix and consolidate SCTP checksumming code
From: Vlad Yasevich @ 2013-10-30 14:29 UTC (permalink / raw)
  To: Daniel Borkmann, davem; +Cc: netdev, linux-sctp
In-Reply-To: <1383130252-1515-6-git-send-email-dborkman@redhat.com>

On 10/30/2013 06:50 AM, Daniel Borkmann wrote:
> This fixes an outstanding bug found through IPVS, where SCTP packets
> with skb->data_len > 0 (non-linearized) and empty frag_list, but data
> accumulated in frags[] member, are forwarded with incorrect checksum
> letting SCTP initial handshake fail on some systems. Linearizing each
> SCTP skb in IPVS to prevent that would not be a good solution as
> this leads to an additional and unnecessary performance penalty on
> the load-balancer itself for no good reason (as we actually only want
> to update the checksum, and can do that in a different/better way
> presented here).
>
> The actual problem is elsewhere, namely, that SCTP's checksumming
> in sctp_compute_cksum() does not take frags[] into account like
> skb_checksum() does. So while we are fixing this up, we better reuse
> the existing code that we have anyway in __skb_checksum() and use it
> for walking through the data doing checksumming. This will not only
> fix this issue, but also consolidates some SCTP code with core
> sk_buff code, bringing it closer together and removing respectively
> avoiding reimplementation of skb_checksum() for no good reason.
>
> As crc32c() can use hardware implementation within the crypto layer,
> we leave that intact (it wraps around / falls back to e.g. slice-by-8
> algorithm in __crc32c_le() otherwise); plus use the __crc32c_le_combine()
> combinator for crc32c blocks.
>
> Also, we remove all other SCTP checksumming code, so that we only
> have to use sctp_compute_cksum() from now on; for doing that, we need
> to transform SCTP checkumming in output path slightly, and can leave
> the rest intact.
>
> Signed-off-by: Daniel Borkmann <dborkman@redhat.com>


Daniel

Here is a follow-on idea that might help even more.
What if we put a pointer to skb_checksum_ops() in the skb
somewhere (I was thinking of skb_shinfo).  Then
skb_checksum can simply use the data from there.  This would
allow us to get rid of all the special cases in SCTP that do
checksumming.  We can just set it to partial, set up the right
fields and let HW or SW always do the right thing.

-vlad

> ---
>   include/net/sctp/checksum.h | 56 +++++++++++++++------------------------------
>   net/sctp/output.c           |  9 +-------
>   2 files changed, 20 insertions(+), 45 deletions(-)
>
> diff --git a/include/net/sctp/checksum.h b/include/net/sctp/checksum.h
> index 259924d..6bd44fe 100644
> --- a/include/net/sctp/checksum.h
> +++ b/include/net/sctp/checksum.h
> @@ -42,56 +42,38 @@
>   #include <linux/types.h>
>   #include <net/sctp/sctp.h>
>   #include <linux/crc32c.h>
> +#include <linux/crc32.h>
>
> -static inline __u32 sctp_crc32c(__u32 crc, u8 *buffer, u16 length)
> +static inline __wsum sctp_csum_update(const void *buff, int len, __wsum sum)
>   {
> -	return crc32c(crc, buffer, length);
> -}
> -
> -static inline __u32 sctp_start_cksum(__u8 *buffer, __u16 length)
> -{
> -	__u32 crc = ~(__u32)0;
> -	__u8  zero[sizeof(__u32)] = {0};
> -
> -	/* Optimize this routine to be SCTP specific, knowing how
> -	 * to skip the checksum field of the SCTP header.
> +	/* This uses the crypto implementation of crc32c, which is either
> +	 * implemented w/ hardware support or resolves to __crc32c_le().
>   	 */
> -
> -	/* Calculate CRC up to the checksum. */
> -	crc = sctp_crc32c(crc, buffer, sizeof(struct sctphdr) - sizeof(__u32));
> -
> -	/* Skip checksum field of the header. */
> -	crc = sctp_crc32c(crc, zero, sizeof(__u32));
> -
> -	/* Calculate the rest of the CRC. */
> -	crc = sctp_crc32c(crc, &buffer[sizeof(struct sctphdr)],
> -			    length - sizeof(struct sctphdr));
> -	return crc;
> -}
> -
> -static inline __u32 sctp_update_cksum(__u8 *buffer, __u16 length, __u32 crc32)
> -{
> -	return sctp_crc32c(crc32, buffer, length);
> +	return crc32c(sum, buff, len);
>   }
>
> -static inline __le32 sctp_end_cksum(__u32 crc32)
> +static inline __wsum sctp_csum_combine(__wsum csum, __wsum csum2,
> +				       int offset, int len)
>   {
> -	return cpu_to_le32(~crc32);
> +	return __crc32c_le_combine(csum, csum2, len);
>   }
>
> -/* Calculate the CRC32C checksum of an SCTP packet.  */
>   static inline __le32 sctp_compute_cksum(const struct sk_buff *skb,
>   					unsigned int offset)
>   {
> -	const struct sk_buff *iter;
> +	struct sctphdr *sh = sctp_hdr(skb);
> +        __le32 ret, old = sh->checksum;
> +	const struct skb_checksum_ops ops = {
> +		.update  = sctp_csum_update,
> +		.combine = sctp_csum_combine,
> +	};
>
> -	__u32 crc32 = sctp_start_cksum(skb->data + offset,
> -				       skb_headlen(skb) - offset);
> -	skb_walk_frags(skb, iter)
> -		crc32 = sctp_update_cksum((__u8 *) iter->data,
> -					  skb_headlen(iter), crc32);
> +	sh->checksum = 0;
> +	ret = cpu_to_le32(~__skb_checksum(skb, offset, skb->len - offset,
> +					  ~(__u32)0, &ops));
> +	sh->checksum = old;
>
> -	return sctp_end_cksum(crc32);
> +	return ret;
>   }
>
>   #endif /* __sctp_checksum_h__ */
> diff --git a/net/sctp/output.c b/net/sctp/output.c
> index 3191373..e650978 100644
> --- a/net/sctp/output.c
> +++ b/net/sctp/output.c
> @@ -390,7 +390,6 @@ int sctp_packet_transmit(struct sctp_packet *packet)
>   	__u8 has_data = 0;
>   	struct dst_entry *dst = tp->dst;
>   	unsigned char *auth = NULL;	/* pointer to auth in skb data */
> -	__u32 cksum_buf_len = sizeof(struct sctphdr);
>
>   	pr_debug("%s: packet:%p\n", __func__, packet);
>
> @@ -493,7 +492,6 @@ int sctp_packet_transmit(struct sctp_packet *packet)
>   		if (chunk == packet->auth)
>   			auth = skb_tail_pointer(nskb);
>
> -		cksum_buf_len += chunk->skb->len;
>   		memcpy(skb_put(nskb, chunk->skb->len),
>   			       chunk->skb->data, chunk->skb->len);
>
> @@ -538,12 +536,7 @@ int sctp_packet_transmit(struct sctp_packet *packet)
>   	if (!sctp_checksum_disable) {
>   		if (!(dst->dev->features & NETIF_F_SCTP_CSUM) ||
>   		    (dst_xfrm(dst) != NULL) || packet->ipfragok) {
> -			__u32 crc32 = sctp_start_cksum((__u8 *)sh, cksum_buf_len);
> -
> -			/* 3) Put the resultant value into the checksum field in the
> -			 *    common header, and leave the rest of the bits unchanged.
> -			 */
> -			sh->checksum = sctp_end_cksum(crc32);
> +			sh->checksum = sctp_compute_cksum(nskb, 0);
>   		} else {
>   			/* no need to seed pseudo checksum for SCTP */
>   			nskb->ip_summed = CHECKSUM_PARTIAL;
>

^ permalink raw reply

* Re: [PATCH net-next 0/5] SCTP fix/updates
From: Vlad Yasevich @ 2013-10-30 14:29 UTC (permalink / raw)
  To: Daniel Borkmann, davem; +Cc: netdev, linux-sctp
In-Reply-To: <1383130252-1515-1-git-send-email-dborkman@redhat.com>

On 10/30/2013 06:50 AM, Daniel Borkmann wrote:
> Please see patch 5 for the main description/motivation, the rest just
> brings in the needed functionality for that. Although this is actually
> a fix, I've based it against net-next as some additional work for
> fixing it was needed.
>
> Thanks!
>
> Daniel Borkmann (5):
>    lib: crc32: clean up spacing in test cases
>    lib: crc32: add functionality to combine two crc32{,c}s in GF(2)
>    lib: crc32: add test cases for crc32{,c}_combine routines
>    net: skb_checksum: allow custom update/combine for walking skb
>    net: sctp: fix and consolidate SCTP checksumming code
>
>   include/linux/crc32.h       |  40 ++++
>   include/linux/skbuff.h      |  13 +-
>   include/net/checksum.h      |   6 +
>   include/net/sctp/checksum.h |  56 ++----
>   lib/crc32.c                 | 453 +++++++++++++++++++++++++-------------------
>   net/core/skbuff.c           |  31 ++-
>   net/sctp/output.c           |   9 +-
>   7 files changed, 350 insertions(+), 258 deletions(-)
>

Series
Acked-by: Vlad Yasevich <vyasevich@gmail.com>

-vlad

^ permalink raw reply

* Re: [patch net-next RFC] netfilter: ip6_tables: use reasm skb for matching
From: Florian Westphal @ 2013-10-30 14:44 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Florian Westphal, netdev, davem, pablo, netfilter-devel, yoshfuji,
	kadlec, kaber, mleitner
In-Reply-To: <20131030141354.GB1456@minipsycho.orion>

Jiri Pirko <jiri@resnulli.us> wrote:
> >This is a bit backwards, I think.
> >- We gather frags
> >- Then we invoke ip6t_do_table for each individual fragment
> >
> >So basically your patch is equivalent to
> >for_each_frag( )
> >  ip6t_do_table(reassembled_skb)
> >
> >Which makes no sense to me - why traverse the ruleset n times with the same
> >packet?
> 
> Because each fragment need to be pushed through separately.

Why?  AFAIU we only need to ensure that (in forwarding case) we
send out the original fragments instead of the reassembled packet.

> What different approach would you suggest?

I am sure that current behaviour is intentional, so I'd first like to
understand WHY this was implemented this way.

Also, this would change very long standing behaviour so one might argue that
this is a no-go anyway.

What is the exact problem that this is supposed to solve?

^ permalink raw reply

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
From: Neil Horman @ 2013-10-30 14:52 UTC (permalink / raw)
  To: Doug Ledford
  Cc: Ingo Molnar, Eric Dumazet, linux-kernel, netdev, David Laight
In-Reply-To: <52710B09.6090302@redhat.com>

On Wed, Oct 30, 2013 at 09:35:05AM -0400, Doug Ledford wrote:
> On 10/30/2013 07:02 AM, Neil Horman wrote:
> 
> >That does makes sense, but it then begs the question, whats the advantage of
> >having multiple alu's at all?
> 
> There's lots of ALU operations that don't operate on the flags or
> other entities that can be run in parallel.
> 
> >If they're just going to serialize on the
> >updating of the condition register, there doesn't seem to be much advantage in
> >having multiple alu's at all, especially if a common use case (parallelizing an
> >operation on a large linear dataset) resulted in lower performance.
> >
> >/me wonders if rearranging the instructions into this order:
> >adcq 0*8(src), res1
> >adcq 1*8(src), res2
> >adcq 2*8(src), res1
> >
> >would prevent pipeline stalls.  That would be interesting data, and (I think)
> >support your theory, Doug.  I'll give that a try
> 
> Just to avoid spending too much time on various combinations, here
> are the methods I've tried:
> 
> Original code
> 2 chains doing interleaved memory accesses
> 2 chains doing serial memory accesses (as above)
> 4 chains doing serial memory accesses
> 4 chains using 32bit values in 64bit registers so you can always use
> add instead of adc and never need the carry flag
> 
> And I've done all of the above with simple prefetch and smart prefetch.
> 
Yup, I just tried the 2 chains doing interleaved access and came up with the
same results for both prefetch cases.

> In all cases, the result is basically that the add method doesn't
> matter much in the grand scheme of things, but the prefetch does,
> and smart prefetch always beat simple prefetch.
> 
> My simple prefetch was to just go into the main while() loop for the
> csum operation and always prefetch 5*64 into the future.
> 
> My smart prefetch looks like this:
> 
> static inline void prefetch_line(unsigned long *cur_line,
>                                  unsigned long *end_line,
>                                  size_t size)
> {
>         size_t fetched = 0;
> 
>         while (*cur_line <= *end_line && fetched < size) {
>                 prefetch((void *)*cur_line);
>                 *cur_line += cache_line_size();
>                 fetched += cache_line_size();
>         }
> }
> 
I've done this too, but I've come up with results that are very close to simple
prefetch.

> I was going to tinker today and tomorrow with this function once I
> get a toolchain that will compile it (I reinstalled all my rhel6
> hosts as f20 and I'm hoping that does the trick, if not I need to do
> more work):
> 
> #define ADCXQ_64                                        \
>         asm("xorq %[res1],%[res1]\n\t"                  \
>             "adcxq 0*8(%[src]),%[res1]\n\t"             \
>             "adoxq 1*8(%[src]),%[res2]\n\t"             \
>             "adcxq 2*8(%[src]),%[res1]\n\t"             \
>             "adoxq 3*8(%[src]),%[res2]\n\t"             \
>             "adcxq 4*8(%[src]),%[res1]\n\t"             \
>             "adoxq 5*8(%[src]),%[res2]\n\t"             \
>             "adcxq 6*8(%[src]),%[res1]\n\t"             \
>             "adoxq 7*8(%[src]),%[res2]\n\t"             \
>             "adcxq %[zero],%[res1]\n\t"                 \
>             "adoxq %[zero],%[res2]\n\t"                 \
>             : [res1] "=r" (result1),                    \
>               [res2] "=r" (result2)                     \
>             : [src] "r" (buff), [zero] "r" (zero),      \
>               "[res1]" (result1), "[res2]" (result2))
> 
I've tried using this method also (HPA suggested it early in the thread, but its
not going to be usefull for awhile.  The compiler supports it already, but
theres not hardware available with support for these instructions yet (at least
not that I have available).

Neil

^ permalink raw reply

* [PATCH net-next] net: extend net_device allocation to vmalloc()
From: Eric Dumazet @ 2013-10-30 15:01 UTC (permalink / raw)
  To: Joby Poriyath, David Miller
  Cc: netdev, wei.liu2, ian.campbell, xen-devel, andrew.bennieston,
	david.vrabel, malcolm.crossley
In-Reply-To: <1383141242.4857.36.camel@edumazet-glaptop.roam.corp.google.com>

From: Eric Dumazet <edumazet@google.com>

Joby Poriyath provided a xen-netback patch to reduce the size of
xenvif structure as some netdev allocation could fail under
memory pressure/fragmentation.

This patch is handling the problem at the core level, allowing
any netdev structures to use vmalloc() if kmalloc() failed.

As vmalloc() adds overhead on a critical network path, add __GFP_REPEAT
to kzalloc() flags to do this fallback only when really needed.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Joby Poriyath <joby.poriyath@citrix.com>
---
 Documentation/networking/netdevices.txt |    7 ++++---
 include/linux/netdevice.h               |    1 +
 net/core/dev.c                          |   22 +++++++++++++++++-----
 net/core/net-sysfs.c                    |    2 +-
 4 files changed, 23 insertions(+), 9 deletions(-)

diff --git a/Documentation/networking/netdevices.txt b/Documentation/networking/netdevices.txt
index c7ecc70..df7cd19 100644
--- a/Documentation/networking/netdevices.txt
+++ b/Documentation/networking/netdevices.txt
@@ -10,9 +10,10 @@ network devices.
 struct net_device allocation rules
 ==================================
 Network device structures need to persist even after module is unloaded and
-must be allocated with kmalloc.  If device has registered successfully,
-it will be freed on last use by free_netdev.  This is required to handle the
-pathologic case cleanly (example: rmmod mydriver </sys/class/net/myeth/mtu )
+must be allocated with kmalloc() or vmalloc(). If device has registered
+successfully, it will be freed on last use by free_netdev.
+This is required to handle the pathologic case cleanly
+(example: rmmod mydriver </sys/class/net/myeth/mtu )
 
 There are routines in net_init.c to handle the common cases of
 alloc_etherdev, alloc_netdev.  These reserve extra space for driver
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index cb1d918..e6353ca 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1800,6 +1800,7 @@ static inline void unregister_netdevice(struct net_device *dev)
 
 int netdev_refcnt_read(const struct net_device *dev);
 void free_netdev(struct net_device *dev);
+void netdev_freemem(struct net_device *dev);
 void synchronize_net(void);
 int init_dummy_netdev(struct net_device *dev);
 
diff --git a/net/core/dev.c b/net/core/dev.c
index 0054c8c..0e61365 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -6196,6 +6196,16 @@ void netdev_set_default_ethtool_ops(struct net_device *dev,
 }
 EXPORT_SYMBOL_GPL(netdev_set_default_ethtool_ops);
 
+void netdev_freemem(struct net_device *dev)
+{
+	char *addr = (char *)dev - dev->padded;
+
+	if (is_vmalloc_addr(addr))
+		vfree(addr);
+	else
+		kfree(addr);
+}
+
 /**
  *	alloc_netdev_mqs - allocate network device
  *	@sizeof_priv:	size of private data to allocate space for
@@ -6239,7 +6249,9 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name,
 	/* ensure 32-byte alignment of whole construct */
 	alloc_size += NETDEV_ALIGN - 1;
 
-	p = kzalloc(alloc_size, GFP_KERNEL);
+	p = kzalloc(alloc_size, GFP_KERNEL | __GFP_NOWARN | __GFP_REPEAT);
+	if (!p)
+		p = vzalloc(alloc_size);
 	if (!p)
 		return NULL;
 
@@ -6248,7 +6260,7 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name,
 
 	dev->pcpu_refcnt = alloc_percpu(int);
 	if (!dev->pcpu_refcnt)
-		goto free_p;
+		goto free_dev;
 
 	if (dev_addr_init(dev))
 		goto free_pcpu;
@@ -6301,8 +6313,8 @@ free_pcpu:
 	kfree(dev->_rx);
 #endif
 
-free_p:
-	kfree(p);
+free_dev:
+	netdev_freemem(dev);
 	return NULL;
 }
 EXPORT_SYMBOL(alloc_netdev_mqs);
@@ -6339,7 +6351,7 @@ void free_netdev(struct net_device *dev)
 
 	/*  Compatibility with error handling in drivers */
 	if (dev->reg_state == NETREG_UNINITIALIZED) {
-		kfree((char *)dev - dev->padded);
+		netdev_freemem(dev);
 		return;
 	}
 
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index d954b56..d03f2c9 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -1263,7 +1263,7 @@ static void netdev_release(struct device *d)
 	BUG_ON(dev->reg_state != NETREG_RELEASED);
 
 	kfree(dev->ifalias);
-	kfree((char *)dev - dev->padded);
+	netdev_freemem(dev);
 }
 
 static const void *net_namespace(struct device *d)

^ permalink raw reply related

* Re: [PATCH net-next 1/5] lib: crc32: clean up spacing in test cases
From: Joe Perches @ 2013-10-30 15:14 UTC (permalink / raw)
  To: Daniel Borkmann; +Cc: David Laight, davem, netdev, linux-sctp, linux-kernel
In-Reply-To: <52711481.8030003@redhat.com>

On Wed, 2013-10-30 at 15:15 +0100, Daniel Borkmann wrote:
> On 10/30/2013 03:10 PM, David Laight wrote:
> >>> +	{0x674bf11d, 0x00000038, 0x00000542, 0x0af6d466, 0xd8b6e4c1, 0xf6e93d6c},
> >> these could be
> >> +	{0x674bf11d, 0x38, 0x542, 0x0af6d466, 0xd8b6e4c1, 0xf6e93d6c},
> > Or even:
> > #define X(a, b, c, d, e, f) {0x##a, 0x##b, 0x##c, 0x##d, 0x##e. 0x##f}
> > 	X(674bf11d, 38, 542, 0af6d466, d8b6e4c1, f6e93d6c),
> > ...
> > #undef X
> 
> Sure, sounds good to me. We could do that as a follow-up.

I personally don't care for that sort of token-pasting macro
but, <shrug> if you want, I've no real objection either.

^ permalink raw reply

* Re: [PATCH] route: Exporting ip6_route_add() so that Bluetooth 6LoWPAN can use it
From: David Miller @ 2013-10-30 17:05 UTC (permalink / raw)
  To: jukka.rissanen; +Cc: netdev
In-Reply-To: <1383124637-15533-1-git-send-email-jukka.rissanen@linux.intel.com>

From: Jukka Rissanen <jukka.rissanen@linux.intel.com>
Date: Wed, 30 Oct 2013 11:17:17 +0200

> Signed-off-by: Jukka Rissanen <jukka.rissanen@linux.intel.com>

Sorry, no.

There is no reason that external subsystems need to add ipv6 routes
using direct calls to these functions.

^ permalink raw reply

* [patch v2] libertas: potential oops in debugfs
From: Dan Carpenter @ 2013-10-30 17:12 UTC (permalink / raw)
  To: John W. Linville
  Cc: libertas-dev, linux-wireless, netdev, linux-kernel,
	kernel-janitors
In-Reply-To: <20131025144452.GA28451@ngolde.de>

If we do a zero size allocation then it will oops.  Also we can't be
sure the user passes us a NUL terminated string so I've added a
terminator.

This code can only be triggered by root.

Reported-by: Nico Golde <nico@ngolde.de>
Reported-by: Fabian Yamaguchi <fabs@goesec.de>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>

diff --git a/drivers/net/wireless/libertas/debugfs.c b/drivers/net/wireless/libertas/debugfs.c
index 668dd27..1917348 100644
--- a/drivers/net/wireless/libertas/debugfs.c
+++ b/drivers/net/wireless/libertas/debugfs.c
@@ -913,7 +913,10 @@ static ssize_t lbs_debugfs_write(struct file *f, const char __user *buf,
 	char *p2;
 	struct debug_data *d = f->private_data;
 
-	pdata = kmalloc(cnt, GFP_KERNEL);
+	if (cnt == 0)
+		return 0;
+
+	pdata = kmalloc(cnt + 1, GFP_KERNEL);
 	if (pdata == NULL)
 		return 0;
 
@@ -922,6 +925,7 @@ static ssize_t lbs_debugfs_write(struct file *f, const char __user *buf,
 		kfree(pdata);
 		return 0;
 	}
+	pdata[cnt] = '\0';
 
 	p0 = pdata;
 	for (i = 0; i < num_of_items; i++) {

^ permalink raw reply related

* [net-next PATCH] net: codel: Avoid undefined behavior from signed overflow
From: Jesper Dangaard Brouer @ 2013-10-30 17:23 UTC (permalink / raw)
  To: netdev; +Cc: Eric Dumazet, Paul E. McKenney, Dave Taht, Jesper Dangaard Brouer

From: Jesper Dangaard Brouer <netoptimizer@brouer.com>

As described in commit 5a581b367 (jiffies: Avoid undefined
behavior from signed overflow), according to the C standard
3.4.3p3, overflow of a signed integer results in undefined
behavior.

To fix this, do as the above commit, and do an unsigned
subtraction, and interpreting the result as a signed
two's-complement number.  This is based on the theory from
RFC 1982 and is nicely described in wikipedia here:
 https://en.wikipedia.org/wiki/Serial_number_arithmetic#General_Solution

A side-note, I have seen practical issues with the previous logic
when dealing with 16-bit, on a 64-bit machine (gcc version
4.4.5). This were 32-bit, which I have not observed issues with.

Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Jesper Dangaard Brouer <netoptimizer@brouer.com>
---

 include/net/codel.h |    8 ++++----
 1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/net/codel.h b/include/net/codel.h
index 389cf62..700fcdf 100644
--- a/include/net/codel.h
+++ b/include/net/codel.h
@@ -72,10 +72,10 @@ static inline codel_time_t codel_get_time(void)
 	return ns >> CODEL_SHIFT;
 }

-#define codel_time_after(a, b)		((s32)(a) - (s32)(b) > 0)
-#define codel_time_after_eq(a, b)	((s32)(a) - (s32)(b) >= 0)
-#define codel_time_before(a, b)		((s32)(a) - (s32)(b) < 0)
-#define codel_time_before_eq(a, b)	((s32)(a) - (s32)(b) <= 0)
+#define codel_time_after(a, b)		((s32)((a) - (b)) > 0)
+#define codel_time_after_eq(a, b)	((s32)((a) - (b)) >= 0)
+#define codel_time_before(a, b) 	((s32)((a) - (b)) < 0)
+#define codel_time_before_eq(a, b)	((s32)((a) - (b)) <= 0)

 /* Qdiscs using codel plugin must use codel_skb_cb in their own cb[] */
 struct codel_skb_cb {

^ permalink raw reply related

* Re: [PATCH 1/4 net-next] net: phy: add Generic Netlink Ethernet switch configuration API
From: Lennert Buytenhek @ 2013-10-30 17:34 UTC (permalink / raw)
  To: Jamal Hadi Salim, Felix Fietkau
  Cc: Florian Fainelli, Neil Horman, John Fastabend, netdev,
	David Miller, Sascha Hauer, John Crispin, Jonas Gorski,
	Gary Thomas, Vlad Yasevich, Stephen Hemminger, Chris Healy
In-Reply-To: <20131030172756.GM8570@wantstofly.org>

On Wed, Oct 30, 2013 at 06:27:56PM +0100, Lennert Buytenhek wrote:

> > >This means that all per-port netdevs will be dummy ports which don't
> > >include the data path.
> 
> And I think that's fine.
> 
> Look, even if you're not going to address data traffic to individual
> ports on your switch chip, there's still a plethora of per-port
> operations that you want to be able to do: administratively setting
> the link state on ports up and down, controlling autonegotiation and
> other PHY settings on individual ports, etc.
> 
> You can either let the administrator do this with the standard ifconfig
> / ip link / ethtool tools, or you can make up a parallel API and
> corresponding set of userland tools to duplicate most of the existing
> functionality -- I know which option I prefer.
> 
> Presenting each switch port as an individual Linux netdevice to the OS
> is an orthogonal decision to actually using those netdevices for data
> traffic, and conflating the two by arguing that you need special tools
> to do per-port operations for the sole reason that your switch chip
> cannot address individual ports is a rather confused argument.

Forgot to add: there's a patch for net/dsa that adds exactly such an
option.  We called it 'unmanaged' mode, and it doesn't enable packet
tagging on the CPU<->switch chip interface, so that data only ever
flows over a single network interface ("eth0"), while the other
("dummy") network interfaces ("port1", "port2", etc) are used for
setting link state with ip link, setting PHY settings with ethtool,
getting ethtool statistics, etc, with 100% unmodified userland tools.
This patch is currently buried inside a vendor tree, but I'd be happy
to dig it out and submit it.

^ permalink raw reply

* Re: [PATCH 1/4 net-next] net: phy: add Generic Netlink Ethernet switch configuration API
From: Lennert Buytenhek @ 2013-10-30 17:27 UTC (permalink / raw)
  To: Jamal Hadi Salim, Felix Fietkau
  Cc: Florian Fainelli, Neil Horman, John Fastabend, netdev,
	David Miller, Sascha Hauer, John Crispin, Jonas Gorski,
	Gary Thomas, Vlad Yasevich, Stephen Hemminger, Chris Healy
In-Reply-To: <526EEAE9.6090508@mojatatu.com>

I didn't follow the rest of this thread, but..

On Mon, Oct 28, 2013 at 06:53:29PM -0400, Jamal Hadi Salim wrote:

> >That question does not make any sense to me. Aside from low level
> >control frames like pause frames for flow control, the switch has no
> >need to send packets to the CPU port on its own.

..a lot of people want to be able to do Spanning Tree, LLDP, 802.1x,
you name it, on their routers and access points, and that requires
that your CPU can send/receive packets to/from individual ports on
your switch chip.  In a lot of markets, your product is a non-starter
if it can't provide any or all of the above.  Excluding this entire
class of use cases _by software design_ is somewhat myopic and stupid.

(It's a different thing if your switch chip is dumb and can't actually
address individual ports, but then there's still no reason to impose
the same restrictions on your software design.)

> >DSA does this, and last time I looked, it pushes *all* bridge traffic
> >through the CPU, making it completely unusable for slower embedded CPUs.
>
> [...]
>
> >If I remember correctly, adding support 'bridge acceleration' was left
> >as an exercise for the reader and never actually implemented.

This patch does exactly that:

	http://patchwork.ozlabs.org/patch/16578/

This patch is in production use in a couple of million DSL gateways,
as well as in a bunch of airplane in-flight entertainment systems, so
by all means I would say that it works rather well.

If there is renewed interest in having such functionality upstream,
I would be happy to update the patch and submit it for inclusion.

> >Sure, this could be fixed somehow, but even then the model and
> >assumptions that DSA is built on simply don't work for some of the
> >dumber switches that we support.

What model and assumptions would those be?

> >One of the currently very common switches in many embedded devices is
> >the RTL8366/RTL8367. It has some flexibility when it comes to
> >configuring VLANs, and it's one of the few ones where you can configure
> >a forwarding table for a VLAN (which spans multiple ports), which allows
> >software bridging between multiple VLANs.
> >However, what this switch does *not* support is adding a header/trailer
> >to packets to indicate the originating port.

The ingress/egress port doesn't _have_ to be conveyed in the data
packets themselves.

>From a quick look at the RTL8366 datasheet, you can control the
egress port by creating a temporary MAC address table entry (this
seems to work both for unicast and multicast packets).

Admittedly, I didn't have a very thorough look at the datasheet,
but it also mentions the Spanning Tree protocol, and contains this
remark related to receiving BPDUs: "The CPU port should carry the
ingress port number of the receiving BPDU.".  If this switch chip
can't do per-port addressing, how can it actually ever speak STP
at all?  Is the datasheet just lying about this?

> >This means that all per-port netdevs will be dummy ports which don't
> >include the data path.

And I think that's fine.

Look, even if you're not going to address data traffic to individual
ports on your switch chip, there's still a plethora of per-port
operations that you want to be able to do: administratively setting
the link state on ports up and down, controlling autonegotiation and
other PHY settings on individual ports, etc.

You can either let the administrator do this with the standard ifconfig
/ ip link / ethtool tools, or you can make up a parallel API and
corresponding set of userland tools to duplicate most of the existing
functionality -- I know which option I prefer.

Presenting each switch port as an individual Linux netdevice to the OS
is an orthogonal decision to actually using those netdevices for data
traffic, and conflating the two by arguing that you need special tools
to do per-port operations for the sole reason that your switch chip
cannot address individual ports is a rather confused argument.

> My view is that netdevs are still valuable even if only they get
> used for control path. Like you said earlier - you can still pull
> stats, flow control messages still make it through etc. They provide
> you
> the consistent api to configure the switch above, ex:
> If i was to use the FDB api for this switch as long as i can
> abstract it in software as a bridge, I could send it a switch config
> via its ops which says:
> "I am giving you this entry with vland 400 for port 2, but i want you to
> send it to the hardware not to your local entry"

Fully agreed on this.

> >So let's say you have a configuration where you're using VLAN ID 4 on
> >port 1, and you want to bridge it to VLAN ID 400 on port 2.
> >
> >Sounds easy enough, you can easily create a bridge that spans port1.4
> >and port2.400. Except, this particular switch (like pretty much any
> >other switch supported by swconfig) isn't actually able to handle such a
> >configuration on its own.

Neither can DSA switch chips.

You can always find things that Linux can do that your switch chip
cannot (e.g. stateful firewalling between ports), and that isn't much
of an argument for or against anything.

> >With swconfig, you create two VLANs: VLAN 4, containing CPU and port1;
> >VLAN 400, containing CPU and port2. You then create a software bridge
> >between eth0.4 and eth0.400 (assuming eth0 is the NIC connected to the
> >switch).

With DSA, you would bridge between port1.4 and port2.400.  I'm still
not sure what your argument is arguing for or against.

> >In a different scenario, the code would also have to detect
> >configurations that the switch isn't able to handle, e.g.: bridging
> >port1.4 to eth1 and port2.4 to eth2.
> >Such a configuration wouldn't work at all with such a switch, because
> >the CPU isn't able to tell apart traffic from port1 and port2, and
> >there's no way to tell the switch that port1.4 and port2.4 should not be
> >connected to each other, but both should go to the CPU.

And it's quite easy to detect what your switch chip can do and offload
that part to the hardware, and keep doing the rest in software.

> >Trying to make all of these cases work in the code will make the whole
> >thing a lot more difficult to deal with and maintain. It will also make
> >it much harder for the user to figure out, what configurations work, and
> >what configurations don't.

It's actually quite easy, and certainly a lot less total effort than
forcing all of your users to learn a new set of userland tools (unless
you're not aiming to ever have a lot of users, that is..).

thanks,
Lennert

^ permalink raw reply

* Re: [PATCH v2 net-next] net: introduce gro_frag_list_enable sysctl
From: Jerry Chu @ 2013-10-30 17:39 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Eric Dumazet, David Miller, Christoph Paasch,
	netdev@vger.kernel.org, Michael Dalton
In-Reply-To: <20131030044212.GA3364@gondor.apana.org.au>

On Tue, Oct 29, 2013 at 9:42 PM, Herbert Xu <herbert@gondor.apana.org.au> wrote:
>
> On Tue, Oct 29, 2013 at 09:34:41PM -0700, Eric Dumazet wrote:
> >
> > What matters ?
> >
> > GRO ?
>
> What matters is that you should not treat the forwarding case
> separately from the host case.
>
> For virtualisation the host case looks exactly like the forwarding
> case.

Not sure I agree - there are two different "forwarding" cases - forwarding
to another physical NIC (to go out to the wire hence need to do GSO),
and (for virtualization) forwarding to a virtual NIC and consumed internally
(e.g., VM). For the latter we should strive to push GSO pkts all the way
to the VM stack w/o breaking them up. So for virtualization GRO is all
goodness but not sure about the regular forwarding path. (From the
perf perspective it boils down to if the cost of GSO/GRO will offset
the benefit of GRO. Sure if one manages to get the cost close to zero
than there is not reason to leave GRO always on. But it's still a big if for
now.)

Best,

Jerry

>
>
> IOW, if having a 64KB packet matters for the host, then it matters
> for forwarding as well.
>
> > Before my patch, GRO packets were 17 MSS, and nobody complained packets
> > were too small, so what are you saying exactly ?
>
> I'm not criticsing your mega-GRO patch at all.  That one is great
> and means that we'll get aggregated packets up to 64K.  What we need
> to do is just to patch up the GSO code so that it can handle these
> mega-packets properly.
>
> Cheers,
> --
> Email: Herbert Xu <herbert@gondor.apana.org.au>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: [PATCH 1/4 net-next] net: phy: add Generic Netlink Ethernet switch configuration API
From: John Fastabend @ 2013-10-30 17:56 UTC (permalink / raw)
  To: Lennert Buytenhek
  Cc: Jamal Hadi Salim, Felix Fietkau, Florian Fainelli, Neil Horman,
	netdev, David Miller, Sascha Hauer, John Crispin, Jonas Gorski,
	Gary Thomas, Vlad Yasevich, Stephen Hemminger, Chris Healy
In-Reply-To: <20131030173416.GA32234@wantstofly.org>

On 10/30/2013 10:34 AM, Lennert Buytenhek wrote:
> On Wed, Oct 30, 2013 at 06:27:56PM +0100, Lennert Buytenhek wrote:
>
>>>> This means that all per-port netdevs will be dummy ports which don't
>>>> include the data path.
>>
>> And I think that's fine.
>>
>> Look, even if you're not going to address data traffic to individual
>> ports on your switch chip, there's still a plethora of per-port
>> operations that you want to be able to do: administratively setting
>> the link state on ports up and down, controlling autonegotiation and
>> other PHY settings on individual ports, etc.
>>
>> You can either let the administrator do this with the standard ifconfig
>> / ip link / ethtool tools, or you can make up a parallel API and
>> corresponding set of userland tools to duplicate most of the existing
>> functionality -- I know which option I prefer.
>>
>> Presenting each switch port as an individual Linux netdevice to the OS
>> is an orthogonal decision to actually using those netdevices for data
>> traffic, and conflating the two by arguing that you need special tools
>> to do per-port operations for the sole reason that your switch chip
>> cannot address individual ports is a rather confused argument.
>
> Forgot to add: there's a patch for net/dsa that adds exactly such an
> option.  We called it 'unmanaged' mode, and it doesn't enable packet
> tagging on the CPU<->switch chip interface, so that data only ever
> flows over a single network interface ("eth0"), while the other
> ("dummy") network interfaces ("port1", "port2", etc) are used for
> setting link state with ip link, setting PHY settings with ethtool,
> getting ethtool statistics, etc, with 100% unmodified userland tools.
> This patch is currently buried inside a vendor tree, but I'd be happy
> to dig it out and submit it.
>

A "dummy" network interface is something I've been thinking about
for SR-IOV as well. In the SR-IOV case we have an embedded bridge
in the hardware but the virtual functions may be direct assigned
to a guest and not visible to the host.

It would be easier to manage the ports and assign them to different
bridge/QOS objects (OVS, bridge, nftables) if the ports were visible
and manageable in the host even though there is no data path. Today
we special ndo ops that only work for VFs but this is a bit clumsy
and gets more clumsy as the nic switch becomes more like a real switch.

.John

^ permalink raw reply

* Re: [PATCH 1/4 net-next] net: phy: add Generic Netlink Ethernet switch configuration API
From: John Fastabend @ 2013-10-30 17:56 UTC (permalink / raw)
  To: Lennert Buytenhek
  Cc: Jamal Hadi Salim, Felix Fietkau, Florian Fainelli, Neil Horman,
	netdev, David Miller, Sascha Hauer, John Crispin, Jonas Gorski,
	Gary Thomas, Vlad Yasevich, Stephen Hemminger, Chris Healy
In-Reply-To: <20131030173416.GA32234@wantstofly.org>

On 10/30/2013 10:34 AM, Lennert Buytenhek wrote:
> On Wed, Oct 30, 2013 at 06:27:56PM +0100, Lennert Buytenhek wrote:
>
>>>> This means that all per-port netdevs will be dummy ports which don't
>>>> include the data path.
>>
>> And I think that's fine.
>>
>> Look, even if you're not going to address data traffic to individual
>> ports on your switch chip, there's still a plethora of per-port
>> operations that you want to be able to do: administratively setting
>> the link state on ports up and down, controlling autonegotiation and
>> other PHY settings on individual ports, etc.
>>
>> You can either let the administrator do this with the standard ifconfig
>> / ip link / ethtool tools, or you can make up a parallel API and
>> corresponding set of userland tools to duplicate most of the existing
>> functionality -- I know which option I prefer.
>>
>> Presenting each switch port as an individual Linux netdevice to the OS
>> is an orthogonal decision to actually using those netdevices for data
>> traffic, and conflating the two by arguing that you need special tools
>> to do per-port operations for the sole reason that your switch chip
>> cannot address individual ports is a rather confused argument.
>
> Forgot to add: there's a patch for net/dsa that adds exactly such an
> option.  We called it 'unmanaged' mode, and it doesn't enable packet
> tagging on the CPU<->switch chip interface, so that data only ever
> flows over a single network interface ("eth0"), while the other
> ("dummy") network interfaces ("port1", "port2", etc) are used for
> setting link state with ip link, setting PHY settings with ethtool,
> getting ethtool statistics, etc, with 100% unmodified userland tools.
> This patch is currently buried inside a vendor tree, but I'd be happy
> to dig it out and submit it.
>

A "dummy" network interface is something I've been thinking about
for SR-IOV nics as well. In the SR-IOV case we have an embedded bridge
in the hardware but the virtual functions may be direct assigned
to a guest and not visible to the host.

It would be easier to manage the ports and assign them to different
bridge/QOS objects (OVS, bridge, nftables) if the ports were visible
and manageable in the host even though there is no data path. Today
we special ndo ops that only work for VFs but this is a bit clumsy
and gets more clumsy as the nic switch becomes more like a real switch.

.John

^ permalink raw reply

* Re: [net-next PATCH] net: codel: Avoid undefined behavior from signed overflow
From: Eric Dumazet @ 2013-10-30 18:01 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: netdev, Paul E. McKenney, Dave Taht
In-Reply-To: <20131030172341.19203.93490.stgit@dragon>

On Wed, 2013-10-30 at 18:23 +0100, Jesper Dangaard Brouer wrote:
> From: Jesper Dangaard Brouer <netoptimizer@brouer.com>
> 
> As described in commit 5a581b367 (jiffies: Avoid undefined
> behavior from signed overflow), according to the C standard
> 3.4.3p3, overflow of a signed integer results in undefined
> behavior.
> 
> To fix this, do as the above commit, and do an unsigned
> subtraction, and interpreting the result as a signed
> two's-complement number.  This is based on the theory from
> RFC 1982 and is nicely described in wikipedia here:
>  https://en.wikipedia.org/wiki/Serial_number_arithmetic#General_Solution
> 
> A side-note, I have seen practical issues with the previous logic
> when dealing with 16-bit, on a 64-bit machine (gcc version
> 4.4.5). This were 32-bit, which I have not observed issues with.
> 
> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> Signed-off-by: Jesper Dangaard Brouer <netoptimizer@brouer.com>
> ---
> 
>  include/net/codel.h |    8 ++++----
>  1 files changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/include/net/codel.h b/include/net/codel.h
> index 389cf62..700fcdf 100644
> --- a/include/net/codel.h
> +++ b/include/net/codel.h
> @@ -72,10 +72,10 @@ static inline codel_time_t codel_get_time(void)
>  	return ns >> CODEL_SHIFT;
>  }
>  
> -#define codel_time_after(a, b)		((s32)(a) - (s32)(b) > 0)
> -#define codel_time_after_eq(a, b)	((s32)(a) - (s32)(b) >= 0)
> -#define codel_time_before(a, b)		((s32)(a) - (s32)(b) < 0)
> -#define codel_time_before_eq(a, b)	((s32)(a) - (s32)(b) <= 0)
> +#define codel_time_after(a, b)		((s32)((a) - (b)) > 0)
> +#define codel_time_after_eq(a, b)	((s32)((a) - (b)) >= 0)
> +#define codel_time_before(a, b) 	((s32)((a) - (b)) < 0)
> +#define codel_time_before_eq(a, b)	((s32)((a) - (b)) <= 0)
>  

I see nothing enforcing an unsigned subtraction as claimed in your
changelog.

a / b could be signed.

Paul commit 5a581b367b5 was OK because of existing typecheck(unsigned
long, ....)

^ permalink raw reply

* Re: [BUG 3.12.rc4] Oops: unable to handle kernel paging request during shutdown
From: Pablo Neira Ayuso @ 2013-10-30 18:04 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Thomas Gleixner, Patrick McHardy, Jozsef Kadlecsik, David Miller,
	Knut Petersen, Ingo Molnar, Paul McKenney,
	Frédéric Weisbecker, Greg KH, linux-kernel,
	Network Development, netfilter-devel
In-Reply-To: <CA+55aFxuVpDo3LvmG9j-Hu7sENLTcsB6_CY2TA4mE-k+CuTeGg@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 2059 bytes --]

On Sun, Oct 27, 2013 at 08:39:47PM +0000, Linus Torvalds wrote:
> On Sun, Oct 27, 2013 at 8:20 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > Appended is a warning I get with DEBUG_TIMER_OBJECTS. Seems to be a
> > device-mapper issue.
> 
> .. and here's another one. This time it looks like nf_conntrack_free()
> is freeing something that has a delayed work in it (again, likely an
> embedded 'struct kobject'). Looks like it is the
> 
>     kmem_cache_destroy(net->ct.nf_conntrack_cachep);
> 
> that triggers this. Which probably means that there are still slab
> entries on that slab cache or something, but I didn't dig any deeper..
> 
> David? Patrick? Pablo? Jozsef? Any ideas? This was immediately preceded by
> 
>   [ 1136.316280] kobject: 'nf_conntrack_ffff8800b74d0000'
> (ffff8801196fac78): kobject_uevent_env
>   [ 1136.316287] kobject: 'nf_conntrack_ffff8800b74d0000'
> (ffff8801196fac78): fill_kobj_path: path =
> '/kernel/slab/nf_conntrack_ffff8800b74d0000'
>   [ 1136.316331] kobject: 'nf_conntrack_ffff8800b74d0000'
> (ffff8801196fac78): kobject_release, parent           (null) (delayed)
> 
> and I think it's that delayed "kobject_release()" that triggers this.
> 
> Notice that kobject_release() can be delayed *without* the magic
> kobject debugging option by simply having a reference count on it from
> some external source. So this particular issue is probably triggered
> by my extra debug options in this case (I'm running with all those
> nasty "try to find bad object freeing" options, and doing module
> unloading etc), but can happen without it (it's just very hard to
> trigger in practice without the debug options).

nf_conntrack_free() is decrementing our object counter (net->ct.count)
before releasing the object. That counter is used in the
nf_conntrack_cleanup_net_list path to check if it's time to
kmem_cache_destroy our cache of conntrack objects. I think we have a
race there that should be easier to trigger (although still hard) with
CONFIG_DEBUG_OBJECTS_FREE as object releases become slowier.

[-- Attachment #2: linus.patch --]
[-- Type: text/x-diff, Size: 528 bytes --]

diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index 5d892fe..d60cf16 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -764,9 +764,10 @@ void nf_conntrack_free(struct nf_conn *ct)
 	struct net *net = nf_ct_net(ct);
 
 	nf_ct_ext_destroy(ct);
-	atomic_dec(&net->ct.count);
 	nf_ct_ext_free(ct);
 	kmem_cache_free(net->ct.nf_conntrack_cachep, ct);
+	smp_mb__before_atomic_dec();
+	atomic_dec(&net->ct.count);
 }
 EXPORT_SYMBOL_GPL(nf_conntrack_free);
 

^ permalink raw reply related

* Re: [PATCH v2 net-next] net: introduce gro_frag_list_enable sysctl
From: Vlad Yasevich @ 2013-10-30 18:09 UTC (permalink / raw)
  To: Jerry Chu, Herbert Xu
  Cc: Eric Dumazet, David Miller, Christoph Paasch,
	netdev@vger.kernel.org, Michael Dalton
In-Reply-To: <CAPshTCg8j-y6fGfh0tRMir34-Lzy77qyzibvwjMSMpR0Rr1rsA@mail.gmail.com>

On 10/30/2013 01:39 PM, Jerry Chu wrote:
> On Tue, Oct 29, 2013 at 9:42 PM, Herbert Xu <herbert@gondor.apana.org.au> wrote:
>>
>> On Tue, Oct 29, 2013 at 09:34:41PM -0700, Eric Dumazet wrote:
>>>
>>> What matters ?
>>>
>>> GRO ?
>>
>> What matters is that you should not treat the forwarding case
>> separately from the host case.
>>
>> For virtualisation the host case looks exactly like the forwarding
>> case.
>
>
> Not sure I agree - there are two different "forwarding" cases - forwarding
> to another physical NIC (to go out to the wire hence need to do GSO),
> and (for virtualization) forwarding to a virtual NIC and consumed internally
> (e.g., VM).

I don't think you can really differentiate these 2 case.  VM are
very commonly used as routers/forwarders.  In some cases, to get
better throughput,  VFs are assigned to the VMs as the externally
facing ports.  So, you still end up forwarding to another physical
NIC.

-vlad

> For the latter we should strive to push GSO pkts all the way
> to the VM stack w/o breaking them up. So for virtualization GRO is all
> goodness but not sure about the regular forwarding path. (From the
> perf perspective it boils down to if the cost of GSO/GRO will offset
> the benefit of GRO. Sure if one manages to get the cost close to zero
> than there is not reason to leave GRO always on. But it's still a big if for
> now.)
>
> Best,
>
> Jerry
>
>>
>>
>> IOW, if having a 64KB packet matters for the host, then it matters
>> for forwarding as well.
>>
>>> Before my patch, GRO packets were 17 MSS, and nobody complained packets
>>> were too small, so what are you saying exactly ?
>>
>> I'm not criticsing your mega-GRO patch at all.  That one is great
>> and means that we'll get aggregated packets up to 64K.  What we need
>> to do is just to patch up the GSO code so that it can handle these
>> mega-packets properly.
>>
>> Cheers,
>> --
>> Email: Herbert Xu <herbert@gondor.apana.org.au>
>> Home Page: http://gondor.apana.org.au/~herbert/
>> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply

* Re: [gpio:for-next 67/67] pch_gbe_main.c:undefined reference to `devm_gpio_request_one'
From: Linus Walleij @ 2013-10-30 18:21 UTC (permalink / raw)
  To: David Cohen
  Cc: Darren Hart, David S. Miller, netdev@vger.kernel.org,
	Fengguang Wu, Alexandre Courbot, linux-gpio@vger.kernel.org
In-Reply-To: <526CA546.7040406@linux.intel.com>

On Sat, Oct 26, 2013 at 10:31 PM, David Cohen
<david.a.cohen@linux.intel.com> wrote:

> Maybe if GPIOLIB has the static inline stubs returning -ENODEV we could
> use a patch similar to the one attached here.

I've added stubs returning -EINVAL just like all siblings there, if you
like to have this changed, send a patch changing it to -ENODEV
and this patch on top as 2/2 please.

Yours,
Linus Walleij

^ permalink raw reply

* Re: [PATCH] [trivial]doc:net: Fix typo in Documentation/networking
From: Randy Dunlap @ 2013-10-30 18:22 UTC (permalink / raw)
  To: Masanari Iida, trivial, linux-kenrel, netdev
In-Reply-To: <1383119175-1963-1-git-send-email-standby24x7@gmail.com>

On 10/30/13 00:46, Masanari Iida wrote:
> Correct spelling typo in Documentation/networking
> 
> Signed-off-by: Masanari Iida <standby24x7@gmail.com>
> ---
>  Documentation/networking/dccp.txt         | 4 ++--
>  Documentation/networking/e100.txt         | 2 +-
>  Documentation/networking/ieee802154.txt   | 4 ++--
>  Documentation/networking/l2tp.txt         | 2 +-
>  Documentation/networking/netdev-FAQ.txt   | 2 +-
>  Documentation/networking/netlink_mmap.txt | 6 +++---
>  Documentation/networking/operstates.txt   | 2 +-
>  Documentation/networking/rxrpc.txt        | 2 +-
>  Documentation/networking/stmmac.txt       | 8 ++++----
>  Documentation/networking/vortex.txt       | 4 ++--
>  Documentation/networking/x25-iface.txt    | 2 +-
>  11 files changed, 19 insertions(+), 19 deletions(-)

> diff --git a/Documentation/networking/ieee802154.txt b/Documentation/networking/ieee802154.txt
> index 09eb573..98c8b2b 100644
> --- a/Documentation/networking/ieee802154.txt
> +++ b/Documentation/networking/ieee802154.txt
> @@ -66,7 +66,7 @@ net_device, with .type = ARPHRD_IEEE802154. Data is exchanged with socket family
>  code via plain sk_buffs. On skb reception skb->cb must contain additional
>  info as described in the struct ieee802154_mac_cb. During packet transmission
>  the skb->cb is used to provide additional data to device's header_ops->create
> -function. Be aware, that this data can be overriden later (when socket code
> +function. Be aware, that this data can be overridden later (when socket code

fwiw, I would drop the comma above...

>  submits skb to qdisc), so if you need something from that cb later, you should
>  store info in the skb->data on your own.
>  
> diff --git a/Documentation/networking/operstates.txt b/Documentation/networking/operstates.txt
> index 9769457..c40f4bb 100644
> --- a/Documentation/networking/operstates.txt
> +++ b/Documentation/networking/operstates.txt
> @@ -89,7 +89,7 @@ packets. The name 'carrier' and the inversion are historical, think of
>  it as lower layer.
>  
>  Note that for certain kind of soft-devices, which are not managing any
> -real hardware, there is possible to set this bit from userpsace.
> +real hardware, there is possible to set this bit from userspace.

                  it is possible

>  One should use TVL IFLA_CARRIER to do so.
>  
>  netif_carrier_ok() can be used to query that bit.

Acked-by: Randy Dunlap <rdunlap@infradead.org>

Thanks.


-- 
~Randy

^ permalink raw reply

* Fw: [Bug 64081] New: mdiobus_read fail and return -110 when called from get_phy_id  r6040: 0000:00:08.0 failed to register MII bus
From: Stephen Hemminger @ 2013-10-30 18:23 UTC (permalink / raw)
  To: netdev



Begin forwarded message:

Date: Wed, 30 Oct 2013 11:10:47 -0700
From: "bugzilla-daemon@bugzilla.kernel.org" <bugzilla-daemon@bugzilla.kernel.org>
To: "stephen@networkplumber.org" <stephen@networkplumber.org>
Subject: [Bug 64081] New: mdiobus_read fail and return -110 when called from get_phy_id  r6040: 0000:00:08.0 failed to register MII bus


https://bugzilla.kernel.org/show_bug.cgi?id=64081

            Bug ID: 64081
           Summary: mdiobus_read fail and return -110 when called from
                    get_phy_id  r6040: 0000:00:08.0 failed to register MII
                    bus
           Product: Networking
           Version: 2.5
    Kernel Version: 3.10.17
          Hardware: All
                OS: Linux
              Tree: Mainline
            Status: NEW
          Severity: blocking
          Priority: P1
         Component: Other
          Assignee: shemminger@linux-foundation.org
          Reporter: nils.koehler@ibt-interfaces.de
        Regression: No

I have tracked down an issue I have to get the device driver r6040.c to run on
Kernel 3.10.17.

It fails called get_phy_id.
In fails sometimes randomly on a phy id between 0 and 31.

the chance to get the error is 50% in my case after insmod r6040.ko

The error is caused by the following function:

int mdiobus_read(struct mii_bus *bus, int addr, u32 regnum)
{
    int retval;

    BUG_ON(in_interrupt());

    mutex_lock(&bus->mdio_lock);
    retval = bus->read(bus, addr, regnum);
    mutex_unlock(&bus->mdio_lock);

printk(KERN_INFO "IBT Debug 1 mdiobus_read retval:%d addr:%d
regnum:%d\n",retval,addr,regnum);

    return retval;
}
EXPORT_SYMBOL(mdiobus_read);

I have added a debug trace and in case of failure retval returns -110

This cause a break in get_phy_id flow and driver init fully fails because it
returns the an -EIO 

I saw that the real existing phy id 1 in my case always was propper detected
but the feeding loop try to init up to 31 Phy's and mostly one of them fail.

-- 
You are receiving this mail because:
You are the assignee for the bug.

^ permalink raw reply

* Re: Fw: [Bug 64081] New: mdiobus_read fail and return -110 when called from get_phy_id r6040: 0000:00:08.0 failed to register MII bus
From: Florian Fainelli @ 2013-10-30 18:33 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev
In-Reply-To: <20131030112348.58ceec8a@nehalam.linuxnetplumber.net>

2013/10/30 Stephen Hemminger <stephen@networkplumber.org>:
>
>
> Begin forwarded message:
>
> Date: Wed, 30 Oct 2013 11:10:47 -0700
> From: "bugzilla-daemon@bugzilla.kernel.org" <bugzilla-daemon@bugzilla.kernel.org>
> To: "stephen@networkplumber.org" <stephen@networkplumber.org>
> Subject: [Bug 64081] New: mdiobus_read fail and return -110 when called from get_phy_id  r6040: 0000:00:08.0 failed to register MII bus

Thanks Stephen, investigating with the reporter of the bug.
--
Florian

^ permalink raw reply

* [RESEND] [PATCH] Fix tc stats when using -batch mode
From: Nigel Kukard @ 2013-10-30 18:44 UTC (permalink / raw)
  To: netdev

There are two global variables in tc/tc_class.c:
__u32 filter_qdisc;
__u32 filter_classid;

These are not re-initialized for each line received in -batch mode:
class show dev eth0 parent 1: classid 1:1
class show dev eth0 parent 1: classid 1:1
Error: duplicate "classid": "1:1" is the second value.

This patch fixes the issue by initializing the two globals when we
enter print_class().

Signed-off-by: Nigel Kukard <nkukard@lbsd.net>
---
 tc/tc_class.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/tc/tc_class.c b/tc/tc_class.c
index 9d4eea5..8043448 100644
--- a/tc/tc_class.c
+++ b/tc/tc_class.c
@@ -241,6 +241,9 @@ int tc_class_list(int argc, char **argv)
 	t.tcm_family = AF_UNSPEC;
 	memset(d, 0, sizeof(d));
 
+	filter_qdisc = 0;
+	filter_classid = 0;
+
 	while (argc > 0) {
 		if (strcmp(*argv, "dev") == 0) {
 			NEXT_ARG();
-- 
1.8.4.rc3

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox