Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH net-next] net: qmi_wwan: make dynamic device IDs work
From: David Miller @ 2012-07-18 16:32 UTC (permalink / raw)
  To: bjorn; +Cc: netdev, linux-usb
In-Reply-To: <1342559672-5893-1-git-send-email-bjorn@mork.no>

From: Bjørn Mork <bjorn@mork.no>
Date: Tue, 17 Jul 2012 23:14:32 +0200

> The usbnet API use the device ID table to store a pointer to
> a minidriver. Setting a generic pointer for dynamic device
> IDs will in most cases make them work as expected.  usbnet
> will otherwise treat the dynamic IDs as blacklisted. That is
> rarely useful.
> 
> There is no standard class describing devices supported by
> this driver, and most vendors don't even provide enough
> information to allow vendor specific wildcard matching. The
> result is that most of the supported devices must be
> explicitly listed in the device table.  Allowing dynamic IDs
> to work both simplifies testing and verification of new
> devices, and provides a way for end users to use a device
> before the ID is added to the driver.
> 
> Signed-off-by: Bjørn Mork <bjorn@mork.no>

Applied.

^ permalink raw reply

* Re: [PATCH net-next v3] ipv6: add ipv6_addr_hash() helper
From: Eric Dumazet @ 2012-07-18 16:30 UTC (permalink / raw)
  To: David Miller; +Cc: joe, netdev, andrewmcgr, dave.taht, therbert
In-Reply-To: <20120718.091928.492512069081953171.davem@davemloft.net>

On Wed, 2012-07-18 at 09:19 -0700, David Miller wrote:
> From: David Miller <davem@davemloft.net>
> Date: Wed, 18 Jul 2012 09:16:47 -0700 (PDT)
> 
> > 
> > Just to clarify, I made sure to use this version.
> 
> Did you actually try to compile this?
> 
> net/ipv6/addrconf.c already has a function named ipv6_addr_hash()
> and you didn't remove it in your patch.


Oops, sorry, will send a v4 then.

^ permalink raw reply

* Re: [PATCH v2] b44: add 64 bit stats
From: David Miller @ 2012-07-18 16:30 UTC (permalink / raw)
  To: eric.dumazet; +Cc: kgroeneveld, netdev
In-Reply-To: <1342583402.2626.1376.camel@edumazet-glaptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 18 Jul 2012 05:50:02 +0200

> On Tue, 2012-07-17 at 23:46 -0400, Kevin Groeneveld wrote:
>> From: Kevin Groeneveld <kgroeneveld@gmail.com>
>> 
>> Add support for 64 bit stats to Broadcom b44 ethernet driver.
>> 
>> Signed-off-by: Kevin Groeneveld <kgroeneveld@gmail.com>
>> ---
>> v2: use u64_stats_fetch_begin_bh/u64_stats_fetch_retry_bh instead of
>>     u64_stats_fetch_begin/u64_stats_fetch_retry as stats update happens in a
>>     timer interrupt
> 
> Seems good this time, thanks
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Applied, thanks.

^ permalink raw reply

* Re: [PATCH net-next 0/2] be2net fixes
From: David Miller @ 2012-07-18 16:28 UTC (permalink / raw)
  To: padmanabh.ratnakar; +Cc: netdev
In-Reply-To: <9a423faa-bc47-459e-a59b-182dd94b8465@exht1.ad.emulex.com>

From: Padmanabh Ratnakar <padmanabh.ratnakar@emulex.com>
Date: Wed, 18 Jul 2012 18:21:45 +0530

> Driver fixes for recent Lancer FW changes.

Both applied, thanks.

^ permalink raw reply

* Re: [RFC PATCH] net: cgroup: null ptr dereference in netprio cgroup during init
From: David Miller @ 2012-07-18 16:26 UTC (permalink / raw)
  To: john.r.fastabend; +Cc: gaofeng, nhorman, mark.d.rustad, netdev, eric.dumazet
In-Reply-To: <20120718003316.2979.49278.stgit@jf-dev1-dcblab>

From: John Fastabend <john.r.fastabend@intel.com>
Date: Tue, 17 Jul 2012 17:33:16 -0700

> When the netprio cgroup is built in the kernel cgroup_init will call
> cgrp_create which eventually calls update_netdev_tables. This is
> being called before do_initcalls() so a null ptr dereference occurs
> on init_net.
> 
> This patch adds a check on init_net.count to verify the structure
> has been initialized. The failure was introduced here,
> 
> commit ef209f15980360f6945873df3cd710c5f62f2a3e
> Author: Gao feng <gaofeng@cn.fujitsu.com>
> Date:   Wed Jul 11 21:50:15 2012 +0000
> 
>     net: cgroup: fix access the unallocated memory in netprio cgroup
> 
> Tested with ping with netprio_cgroup as a module and built in.
> 
> Marked RFC for now I think DaveM might have a reason why this needs
> some improvement.
> 
> Reported-by: Mark Rustad <mark.d.rustad@intel.com>
> Signed-off-by: John Fastabend <john.r.fastabend@intel.com>

John, just so I can sleep better at night, can you add an explicit
initializer to init_net in net/core/net_namespace.c in this patch
so that "count" is explicitly set to ATOMIC_INIT(0)?

This is done elsewhere in the tree for similar situations.

Otherwise this patch looks great, thanks a lot.

^ permalink raw reply

* Re: [RFC] r8169 : why SG / TX checksum are default disabled
From: David Miller @ 2012-07-18 16:23 UTC (permalink / raw)
  To: hayeswang; +Cc: romieu, eric.dumazet, netdev
In-Reply-To: <AC35CD5731C3479B9BF5E53BF541A81F@realtek.com.tw>

From: hayeswang <hayeswang@realtek.com>
Date: Wed, 18 Jul 2012 14:45:55 +0800

> Francois Romieu [mailto:romieu@fr.zoreil.com] 
> [...]
> 
>> Hayes, should we not add into the kernel driver something similar to
>> the rtl8168_start_xmit::skb_checksum_help stuff in Realtek's 
>> 8168 driver ?
>> There seems to be a bug for (skb->len < 60 && RTL_GIGA_MAC_VER_34.
> 
> For RTL8168E-VL (RTL_GIGA_MAC_VER_34), the hardware wouldn't send the packet
> with the length less than 60 bytes. The hardware should pad this kind of packet
> to 60 bytes, but it wouldn't. Therefore, the software has to pad the packet to
> 60 bytes. However, the hw checksum would be incorrect for the modified packet,
> so the software checksum is necessary.

I wonder how the hardware checksum can be incorrectly calculated if the padding
is done with zeros?

^ permalink raw reply

* Re: [net-next 0/6][pull request] Intel Wired LAN Driver Updates
From: David Miller @ 2012-07-18 16:21 UTC (permalink / raw)
  To: jeffrey.t.kirsher; +Cc: netdev, gospo, sassmann
In-Reply-To: <1342578045-17778-1-git-send-email-jeffrey.t.kirsher@intel.com>

From: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Date: Tue, 17 Jul 2012 19:20:39 -0700

> This series contains updates to ixgbe & ixgbevf.
> 
> The following are changes since commit 5abf7f7e0f6bdbfcac737f636497d7016d9507eb:
>   ipv4: fix rcu splat
> and are available in the git repository at:
>   git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net-next master

Pulled, thanks Jeff.

^ permalink raw reply

* Re: pull request: sfc-next 2012-07-17
From: Ben Hutchings @ 2012-07-18 16:19 UTC (permalink / raw)
  To: David Miller; +Cc: linux-net-drivers, netdev
In-Reply-To: <20120718.091010.2180226373071597034.davem@davemloft.net>

On Wed, 2012-07-18 at 09:10 -0700, David Miller wrote:
> From: Ben Hutchings <bhutchings@solarflare.com>
> Date: Tue, 17 Jul 2012 18:05:40 +0100
> 
> > The following changes since commit 141e369de698f2e17bf716b83fcc647ddcb2220c:
> > 
> >   xfrm: Initialize the struct xfrm_dst behind the dst_enty field (2012-07-14 00:29:12 -0700)
> > 
> > are available in the git repository at:
> >   git://git.kernel.org/pub/scm/linux/kernel/git/bwh/sfc-next.git for-davem
> > 
> > (commit c2dbab39db1c3c2ccbdbb2c6bac6f07cc7a7c1f6)
> > 
> > 1. Fix potential badness when running a self-test with SR-IOV enabled.
> > 2. Fix calculation of some interface statistics that could run backward.
> > 3. Miscellaneous cleanup.
> 
> Looks good, pulled, thanks Ben.
> 
> Out of curiosity, why the conversion to the generic DMA interfaces?
> Do you plan on using something uniquely provided by them vs. the PCI
> specific DMA interfaces (ability to specify GFP flags, stuff like
> that) or do you really plan on having non-PCI devices in the future?

Some DMA operations were already converted because GFP_KERNEL was wanted
or as part of the skb frags API.  I changed the rest to be consistent;
I'm not aware of any plans for non-PCI devices.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply

* Re: [PATCH net-next v3] ipv6: add ipv6_addr_hash() helper
From: David Miller @ 2012-07-18 16:19 UTC (permalink / raw)
  To: eric.dumazet; +Cc: joe, netdev, andrewmcgr, dave.taht, therbert
In-Reply-To: <20120718.091647.2208692760443941786.davem@davemloft.net>

From: David Miller <davem@davemloft.net>
Date: Wed, 18 Jul 2012 09:16:47 -0700 (PDT)

> 
> Just to clarify, I made sure to use this version.

Did you actually try to compile this?

net/ipv6/addrconf.c already has a function named ipv6_addr_hash()
and you didn't remove it in your patch.

^ permalink raw reply

* Re: [PATCH net-next v3] ipv6: add ipv6_addr_hash() helper
From: David Miller @ 2012-07-18 16:16 UTC (permalink / raw)
  To: eric.dumazet; +Cc: joe, netdev, andrewmcgr, dave.taht, therbert
In-Reply-To: <1342621670.2626.2818.camel@edumazet-glaptop>


Just to clarify, I made sure to use this version.

^ permalink raw reply

* Re: [PATCH net-next] ipv6: add ipv6_addr_hash() helper
From: David Miller @ 2012-07-18 16:16 UTC (permalink / raw)
  To: eric.dumazet; +Cc: joe, netdev, andrewmcgr, dave.taht, therbert
In-Reply-To: <1342620841.2626.2786.camel@edumazet-glaptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 18 Jul 2012 16:14:01 +0200

> but hash_long() on 64bit sounds a bit expensive for our needs...

Agreed.

^ permalink raw reply

* Re: [PATCH net-next] ipv6: add ipv6_addr_hash() helper
From: David Miller @ 2012-07-18 16:15 UTC (permalink / raw)
  To: eric.dumazet; +Cc: David.Laight, netdev, andrewmcgr, dave.taht, therbert
In-Reply-To: <1342620366.2626.2764.camel@edumazet-glaptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 18 Jul 2012 16:06:06 +0200

> [PATCH net-next v2] ipv6: add ipv6_addr_hash() helper
> 
> Introduce ipv6_addr_hash() helper doing a XOR on all bits
> of an IPv6 address, with an optimized x86_64 version.
> 
> Use it in flow dissector, as suggested by Andrew McGregor,
> to reduce hash collision probabilities in fq_codel (and other
> users of flow dissector)
> 
> Use it in ip6_tunnel.c and use more bit shuffling, as suggested
> by David Laight, as existing hash was ignoring most of them.
> 
> Use it in sunrpc and use more bit shuffling, using hash_32().
> 
> As cleanup, use it in net/ipv4/tcp_metrics.c
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Reported-by: Andrew McGregor <andrewmcgr@gmail.com>

Looks good, applied.

^ permalink raw reply

* Re: [PATCH net-next 0/2] Pull request for 'davem-next.r8169' branch
From: David Miller @ 2012-07-18 16:11 UTC (permalink / raw)
  To: romieu; +Cc: netdev, hayeswang
In-Reply-To: <cover.1342562326.git.romieu@fr.zoreil.com>

From: Francois Romieu <romieu@fr.zoreil.com>
Date: Wed, 18 Jul 2012 00:09:42 +0200

> Please pull from branch 'davem-next.r8169' in repository
> 
> git://violet.fr.zoreil.com/romieu/linux davem-next.r8169
> 
> to get the changes below.

Pulled, thanks Francois.

^ permalink raw reply

* Re: getsockopt/setsockopt with SO_RCVBUF and SO_SNDBUF "non-standard" behaviour
From: Eric Dumazet @ 2012-07-18 16:11 UTC (permalink / raw)
  To: Eugen Dedu; +Cc: linux-kernel, netdev
In-Reply-To: <5006DD6B.3030300@pu-pm.univ-fcomte.fr>

On Wed, 2012-07-18 at 17:59 +0200, Eugen Dedu wrote:
> Any idea?
> 
> On 17/07/12 11:27, Eugen Dedu wrote:
> > Hi all,
> >
> > I looked on Internet and at the old thread
> > http://lkml.indiana.edu/hypermail/linux/kernel/0108.0/0275.html, but the
> > issue is still not settled as far as I see.
> >
> > I need to have the highest memory available for snd/rcv buffer and I
> > need to know/confirm how much it allocated for my process (how much I
> > can use).
> >
> > So with Linux we need to do something like:
> > setsockopt (..., SO_RCVBUF, 256000, ...)
> > getsockopt (..., SO_RCVBUF, &i, ...)
> > i /= 2;
> >
> > where i is the size I am looking for.
> >
> > Now, to make this code work for other OSes it should be changed to:
> > setsockopt (..., SO_RCVBUF, 256000, ...)
> > getsockopt (..., SO_RCVBUF, &i, ...)
> > #ifdef LINUX
> > i /= 2;
> > #endif
> >
> > First question, is this code correct? If not, what code gives the amount
> > of memory useable for my process?
> >
> > Second, it seems to me that linux is definitely "non-standard" here.
> > Saying that linux uses twice as memory has nothing to do with that,
> > since getsockopt should return what the application can count on, not
> > what is the internal use. It is like a hypothetical malloc (10) would
> > return not 10, but 20 (including meta-information). Is that right?
> >
> > Cheers,

That the way it's done on linux since day 0

You can probably find a lot of pages on the web explaining the
rationale.

If your application handles UDP frames, what SO_RCVBUF should count ?

If its the amount of payload bytes, you could have a pathological
situation where an attacker sends 1-byte UDP frames fast enough and
could consume a lot of kernel memory.

Each frame consumes a fair amount of kernel memory (between 512 bytes
and 8 Kbytes depending on the driver).

So linux says : If user expect to receive  XXXX bytes, set a limit of
_kernel_ memory used to store these bytes, and use an estimation of 100%
of overhead. That is : allow 2*XXXX bytes to be allocated for socket
receive buffers.

^ permalink raw reply

* Re: pull request: sfc-next 2012-07-17
From: David Miller @ 2012-07-18 16:10 UTC (permalink / raw)
  To: bhutchings; +Cc: linux-net-drivers, netdev
In-Reply-To: <1342544740.2698.13.camel@bwh-desktop.uk.solarflarecom.com>

From: Ben Hutchings <bhutchings@solarflare.com>
Date: Tue, 17 Jul 2012 18:05:40 +0100

> The following changes since commit 141e369de698f2e17bf716b83fcc647ddcb2220c:
> 
>   xfrm: Initialize the struct xfrm_dst behind the dst_enty field (2012-07-14 00:29:12 -0700)
> 
> are available in the git repository at:
>   git://git.kernel.org/pub/scm/linux/kernel/git/bwh/sfc-next.git for-davem
> 
> (commit c2dbab39db1c3c2ccbdbb2c6bac6f07cc7a7c1f6)
> 
> 1. Fix potential badness when running a self-test with SR-IOV enabled.
> 2. Fix calculation of some interface statistics that could run backward.
> 3. Miscellaneous cleanup.

Looks good, pulled, thanks Ben.

Out of curiosity, why the conversion to the generic DMA interfaces?
Do you plan on using something uniquely provided by them vs. the PCI
specific DMA interfaces (ability to specify GFP flags, stuff like
that) or do you really plan on having non-PCI devices in the future?

^ permalink raw reply

* Re: [PATCH 0/5] Long term PMTU/redirect storage in ipv4.
From: David Miller @ 2012-07-18 16:07 UTC (permalink / raw)
  To: ja; +Cc: eric.dumazet, netdev
In-Reply-To: <alpine.LFD.2.00.1207181105320.2154@ja.ssi.bg>

From: Julian Anastasov <ja@ssi.bg>
Date: Wed, 18 Jul 2012 11:36:08 +0300 (EEST)

> 	Is the cost of read_seqbegin a problem? Here is a
> 2nd version, I still keep this first check for now.

No, the read side of seqlocks are extremely cheap, it's just a plain
read and compare of a read-mostly integer.

> Subject: [PATCH v2] ipv4: use seqlock for nh_exceptions
> 
> From: Julian Anastasov <ja@ssi.bg>
> 
> 	Use global seqlock for the nh_exceptions. Call
> fnhe_oldest with the right hash chain. Correct the diff
> value for dst_set_expires.
> 
> v2: after suggestions from Eric Dumazet:
> * get rid of spin lock fnhe_lock, rearrange update_or_create_fnhe
> * continue daddr search in rt_bind_exception
> 
> Signed-off-by: Julian Anastasov <ja@ssi.bg>
> ---

I think if you get a seqlock mis-compare, you will need to branch back
to rescan the hash chain from the beginning.

Otherwise I like these changes a lot.

We should perhaps consider doing something similar in the TCP metrics
code.

Thanks!

^ permalink raw reply

* Re: [PATCH] cipso: don't follow a NULL pointer when setsockopt() is called
From: David Miller @ 2012-07-18 16:02 UTC (permalink / raw)
  To: pmoore; +Cc: netdev
In-Reply-To: <20120717210738.22790.23522.stgit@sifl>

From: Paul Moore <pmoore@redhat.com>
Date: Tue, 17 Jul 2012 17:07:47 -0400

> As reported by Alan Cox, and verified by Lin Ming, when a user
> attempts to add a CIPSO option to a socket using the CIPSO_V4_TAG_LOCAL
> tag the kernel dies a terrible death when it attempts to follow a NULL
> pointer (the skb argument to cipso_v4_validate() is NULL when called via
> the setsockopt() syscall).
> 
> This patch fixes this by first checking to ensure that the skb is
> non-NULL before using it to find the incoming network interface.  In
> the unlikely case where the skb is NULL and the user attempts to add
> a CIPSO option with the _TAG_LOCAL tag we return an error as this is
> not something we want to allow.
> 
> A simple reproducer, kindly supplied by Lin Ming, although you must
> have the CIPSO DOI #3 configure on the system first or you will be
> caught early in cipso_v4_validate():
 ...
> CC: Lin Ming <mlin@ss.pku.edu.cn>
> Reported-by: Alan Cox <alan@lxorguk.ukuu.org.uk>
> Signed-off-by: Paul Moore <pmoore@redhat.com>

Applied and queued up for -stable, thanks Paul.

^ permalink raw reply

* Re: [PATCH net-next] ipv6: fix inet6_csk_xmit()
From: David Miller @ 2012-07-18 16:00 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev, ncardwell, ycheng
In-Reply-To: <1342597084.2626.1851.camel@edumazet-glaptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 18 Jul 2012 09:38:04 +0200

> From: Eric Dumazet <edumazet@google.com>
> 
> We should provide to inet6_csk_route_socket a struct flowi6 pointer,
> so that net6_csk_xmit() works correctly instead of sending garbage.
> 
> Also add some consts 
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Reported-by: Yuchung Cheng <ycheng@google.com>

Thanks a lot for fixing this Eric.

Applied.

^ permalink raw reply

* Re: [RFC PATCH] net: cgroup: null ptr dereference in netprio cgroup during init
From: David Miller @ 2012-07-18 15:53 UTC (permalink / raw)
  To: nhorman; +Cc: john.r.fastabend, gaofeng, mark.d.rustad, netdev, eric.dumazet
In-Reply-To: <20120718152520.GG25563@hmsreliant.think-freely.org>

From: Neil Horman <nhorman@tuxdriver.com>
Date: Wed, 18 Jul 2012 11:25:20 -0400

> Yeah, I see what you mean.  Seems like what we need is to either:
> 1) move cgroup_init to later in the boot process.  If you're not early_init,
> then I don't see why the subsystem can't wait until later in the boot process
> (i.e. make cgroup_init a late_initcall or some such).
> 
> or
> 
> 2) Allow module based cgroups to flag themselves as needing late init after the
> rest of the kernel has booted.

These are way too complicated compared to John's currently proposed
fix for this recently introduced regression.

I want a one liner which I can prove is going to remove the crash.

All of this talk of rearranging initcall ordering for cgroup stuff
is too ambitious this late in the -rc.

^ permalink raw reply

* Re: [RFC PATCH] net: cgroup: null ptr dereference in netprio cgroup during init
From: Neil Horman @ 2012-07-18 15:25 UTC (permalink / raw)
  To: John Fastabend; +Cc: davem, gaofeng, mark.d.rustad, netdev, eric.dumazet
In-Reply-To: <5006C679.2040605@intel.com>

On Wed, Jul 18, 2012 at 07:21:45AM -0700, John Fastabend wrote:
> On 7/18/2012 5:45 AM, Neil Horman wrote:
> >On Tue, Jul 17, 2012 at 05:33:16PM -0700, John Fastabend wrote:
> >>When the netprio cgroup is built in the kernel cgroup_init will call
> >>cgrp_create which eventually calls update_netdev_tables. This is
> >>being called before do_initcalls() so a null ptr dereference occurs
> >>on init_net.
> >>
> >>This patch adds a check on init_net.count to verify the structure
> >>has been initialized. The failure was introduced here,
> >>
> >>commit ef209f15980360f6945873df3cd710c5f62f2a3e
> >>Author: Gao feng <gaofeng@cn.fujitsu.com>
> >>Date:   Wed Jul 11 21:50:15 2012 +0000
> >>
> >>     net: cgroup: fix access the unallocated memory in netprio cgroup
> >>
> >>Tested with ping with netprio_cgroup as a module and built in.
> >>
> >>Marked RFC for now I think DaveM might have a reason why this needs
> >>some improvement.
> >>
> >>Reported-by: Mark Rustad <mark.d.rustad@intel.com>
> >>Cc: Neil Horman <nhorman@tuxdriver.com>
> >>Cc: Eric Dumazet <edumazet@google.com>
> >>Cc: Gao feng <gaofeng@cn.fujitsu.com>
> >>Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
> >>---
> >>
> >>  net/core/netprio_cgroup.c |    3 +++
> >>  1 files changed, 3 insertions(+), 0 deletions(-)
> >>
> >>diff --git a/net/core/netprio_cgroup.c b/net/core/netprio_cgroup.c
> >>index b2e9caa..e9fd7fd 100644
> >>--- a/net/core/netprio_cgroup.c
> >>+++ b/net/core/netprio_cgroup.c
> >>@@ -116,6 +116,9 @@ static int update_netdev_tables(void)
> >>  	u32 max_len;
> >>  	struct netprio_map *map;
> >>
> >>+	if (!atomic_read(&init_net.count))
> >>+		return ret;
> >>+
> >>  	rtnl_lock();
> >>  	max_len = atomic_read(&max_prioidx) + 1;
> >>  	for_each_netdev(&init_net, dev) {
> >>
> >>
> >
> >John, do you have a stack trace of this.  I'm having a hard time seeing how we
> >get into this path prior to the network stack being initalized.
> 
> Mark had a partial trace
> 
> [    0.003455] Dentry cache hash table entries: 262144 (order: 9,
> 2097152 bytes)
> [    0.005550] Inode-cache hash table entries: 131072 (order: 8,
> 1048576 bytes)
> [    0.007165] Mount-cache hash table entries: 256
> [    0.010289] Initializing cgroup subsys net_cls
> [    0.010947] Initializing cgroup subsys net_prio
> [    0.011039] BUG: unable to handle kernel NULL pointer dereference
> at 0000000000000828
> [    0.011998] IP: [<ffffffff814202c8>] update_netdev_tables+0x68/0xe0
> 
> 
Well, I was really hoping to see what call path got us there, so this doesn't
really help.  I'll try to setup a system here to reproduce later today.

> >
> >It also brings up another point.  If this is happening, and we're creating the
> >root cgroup from start_kernel, Then we're actually initalizing some cgroups
> >twice, because a few cgroups register themselves via cgroup_load_subsys in
> >module_init specified routines.  So if you're building netprio_cgroup or
> >net_cls_cgroup as part of the monolithic kernel, you'll get cgroup_create called
> >prior to your module_init() call.  Thats not good.
> 
> Well your module_init() wouldn't be called in this case right? I think
> netprio has a bug where we only register a netdevice notifier when
> its built as a module.
> 
> same issue with cls_cgroup and register_tcf_proto_ops?
> 
No.  When not built monolitically, module_init is defined as __initcall, so it
still gets called during the boot process

> >
> >In fact, the cgroup_subsys struct has an early_init flag that cgroup_init
> >appears to use to skip the initialization of subsystems that don't need to be
> >initialized that early in boot (assuming thats the path we're going down to get
> >to this oops).
> 
> Do you mean ss->early_init? Not sure that helps us either we get called
> by cgroup_init because we don't have an early_init callback or we get
> called via cgroup_init_early even earlier.
> 
Yeah, I see what you mean.  Seems like what we need is to either:
1) move cgroup_init to later in the boot process.  If you're not early_init,
then I don't see why the subsystem can't wait until later in the boot process
(i.e. make cgroup_init a late_initcall or some such).

or

2) Allow module based cgroups to flag themselves as needing late init after the
rest of the kernel has booted.


Neil

> >
> >If you can post the call stack, I'd appreciate it, I'd like to dig a bit deeper
> >into this.
> 
> Yes I'll do this shortly.
> 
> >Neil
> >
> 

^ permalink raw reply

* r8169: link up, link down
From: J. Christopher Pereira @ 2012-07-18 15:21 UTC (permalink / raw)
  To: netdev

Hi:

An old FC11 server died recently so we changed the hardware.
The new hardware has a r8169 nic with the following driver version (dmesg
output):

        r8169 Gigabit Ethernet driver 2.3LK-NAPI loaded
        r8169 0000:02:01.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16
        r8169 0000:02:01.0: no PCI Express capability

Once in a while, the nic stops working with many "link up" and "link down"
messages:

        r8169: eth1: link up
        r8169: eth1: link down

The problem is solved by reseting the server.

I found some bug reports out there, but couldn't find any clear info about
the problem or if it has been fixed in newer driver versions.
The server is using the latest FC11 kernel
(kernel-2.6.30.10-105.2.23.fc11.x86_64) and FC11 reached its end of life.
Replacing the nic for another model is obviously the simplest solution, but.

Is there any solution or workarround?

^ permalink raw reply

* Re: [RFC PATCH] net: cgroup: null ptr dereference in netprio cgroup during init
From: John Fastabend @ 2012-07-18 15:14 UTC (permalink / raw)
  To: Neil Horman; +Cc: davem, gaofeng, mark.d.rustad, netdev, eric.dumazet
In-Reply-To: <5006C679.2040605@intel.com>

On 7/18/2012 7:21 AM, John Fastabend wrote:
> On 7/18/2012 5:45 AM, Neil Horman wrote:
>> On Tue, Jul 17, 2012 at 05:33:16PM -0700, John Fastabend wrote:
>>> When the netprio cgroup is built in the kernel cgroup_init will call
>>> cgrp_create which eventually calls update_netdev_tables. This is
>>> being called before do_initcalls() so a null ptr dereference occurs
>>> on init_net.
>>>
>>> This patch adds a check on init_net.count to verify the structure
>>> has been initialized. The failure was introduced here,
>>>
>>> commit ef209f15980360f6945873df3cd710c5f62f2a3e
>>> Author: Gao feng <gaofeng@cn.fujitsu.com>
>>> Date:   Wed Jul 11 21:50:15 2012 +0000
>>>
>>>      net: cgroup: fix access the unallocated memory in netprio cgroup
>>>
>>> Tested with ping with netprio_cgroup as a module and built in.
>>>
>>> Marked RFC for now I think DaveM might have a reason why this needs
>>> some improvement.
>>>
>>> Reported-by: Mark Rustad <mark.d.rustad@intel.com>
>>> Cc: Neil Horman <nhorman@tuxdriver.com>
>>> Cc: Eric Dumazet <edumazet@google.com>
>>> Cc: Gao feng <gaofeng@cn.fujitsu.com>
>>> Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
>>> ---
>>>
>>>   net/core/netprio_cgroup.c |    3 +++
>>>   1 files changed, 3 insertions(+), 0 deletions(-)
>>>
>>> diff --git a/net/core/netprio_cgroup.c b/net/core/netprio_cgroup.c
>>> index b2e9caa..e9fd7fd 100644
>>> --- a/net/core/netprio_cgroup.c
>>> +++ b/net/core/netprio_cgroup.c
>>> @@ -116,6 +116,9 @@ static int update_netdev_tables(void)
>>>       u32 max_len;
>>>       struct netprio_map *map;
>>>
>>> +    if (!atomic_read(&init_net.count))
>>> +        return ret;
>>> +
>>>       rtnl_lock();
>>>       max_len = atomic_read(&max_prioidx) + 1;
>>>       for_each_netdev(&init_net, dev) {
>>>
>>>
>>
>> John, do you have a stack trace of this.  I'm having a hard time
>> seeing how we
>> get into this path prior to the network stack being initalized.
>
> Mark had a partial trace
>
> [    0.003455] Dentry cache hash table entries: 262144 (order: 9,
> 2097152 bytes)
> [    0.005550] Inode-cache hash table entries: 131072 (order: 8, 1048576
> bytes)
> [    0.007165] Mount-cache hash table entries: 256
> [    0.010289] Initializing cgroup subsys net_cls
> [    0.010947] Initializing cgroup subsys net_prio
> [    0.011039] BUG: unable to handle kernel NULL pointer dereference at
> 0000000000000828
> [    0.011998] IP: [<ffffffff814202c8>] update_netdev_tables+0x68/0xe0
>
>
>>
>> It also brings up another point.  If this is happening, and we're
>> creating the
>> root cgroup from start_kernel, Then we're actually initalizing some
>> cgroups
>> twice, because a few cgroups register themselves via
>> cgroup_load_subsys in
>> module_init specified routines.  So if you're building netprio_cgroup or
>> net_cls_cgroup as part of the monolithic kernel, you'll get
>> cgroup_create called
>> prior to your module_init() call.  Thats not good.
>
> Well your module_init() wouldn't be called in this case right? I think
> netprio has a bug where we only register a netdevice notifier when
> its built as a module.
>
> same issue with cls_cgroup and register_tcf_proto_ops?
>

Neil, I was very unclear in the above. What I meant here was
cgroup_load_subsys() checks ss->module so you should _not_
get two create calls. And returns 0 so the register calls for
netdev notifiers should get setup.

I missed the return 0 part and so I thought we might abort before
this occurs but it looks ok to me on second glance.

^ permalink raw reply

* Confirm Your Identity Webmail
From: Christine BALAGUE @ 2012-07-18 13:45 UTC (permalink / raw)







Confirm Your Identity Webmail

Your mailbox has exceeded one or more size limits set by your  
administrator. You may not be able to send or receive new messages  
until your mailbox size is reduced. To make more space available,  
please click http://thanaban.com/main/wp-admin/account-update/  to  
correct account details.

Thank you and we apologize for the inconvenience.

System Administrator.


----------------------------------------------------------------
This message was sent using IMP, the Internet Messaging Program.

^ permalink raw reply

* Re: [PATCH] mlx4_en: map entire pages to increase throughput
From: Or Gerlitz @ 2012-07-18 14:59 UTC (permalink / raw)
  To: Thadeu Lima de Souza Cascardo, Yevgeny Petrilin
  Cc: Or Gerlitz, Rick Jones, davem@davemloft.net,
	netdev@vger.kernel.org, amirv@mellanox.com,
	brking@linux.vnet.ibm.com, leitao@linux.vnet.ibm.com,
	klebers@linux.vnet.ibm.com, linuxppc-dev@lists.ozlabs.org,
	anton@samba.org
In-Reply-To: <20120716205708.GB16137@oc1711230544.ibm.com>

On 7/16/2012 11:57 PM, Thadeu Lima de Souza Cascardo wrote:
> On Mon, Jul 16, 2012 at 11:43:33PM +0300, Or Gerlitz wrote:
>>
>>
>> TCP_STREAM from this setup before the patch would be good to know as well
>>
>
> Does the stream test that I did with uperf using messages of 64000 bytes fit?

netperf/TCP_STREAM is very common and it would help to better compare 
the numbers
you get on your systems before/after the patch which runs done here. As 
for review for
the patch itself and the related discussion, Yevgeny Petrilin should be 
looking on your
patch, he'll be in by early next week.

Or.

^ permalink raw reply

* AW: RFC: (now non Base64) replace packets already in queue
From: Erdt, Ralph @ 2012-07-18 14:50 UTC (permalink / raw)
  To: Nicolas de Pesloüan; +Cc: netdev@vger.kernel.org, Eric Dumazet, Rick Jones
In-Reply-To: <4FF4A873.1000001@gmail.com>

Hello.

I'm sorry for the very late answer. But I had top-priority family issues.

> I suggest you try and send a properly formated patch with your code, so
> that people here can have a look at it and evaluate the interest of
> integrating it into main line kernel.

Attached at the button of the eMail.


> That being said, I really think you should try to manage a userspace
> queue, [..] you can
> add many nice features into userspace to enhance the speed/quality 
> [..]
> And I really see your packet replacement system as one of those nice
> features and cannot imagine a good reason not to put it in userspace.

All this features are done already. E.g. we are using RoHC.
But we also want to use the TC stuff - its already there - why reprogramming?

Here the patch. But I didn't find which git tree I should use. This patch is against Linux-2.6. I'm sorry. Can you tell me, which tree I've to use?
-------------
>From 52f27fa2b0867de821af38c731c2ebc763afb1f1 Mon Sep 17 00:00:00 2001
From: Ralph Erdt <Ralph.Robert.Erdt@fkie.fraunhofer.de>
Date: Wed, 18 Jul 2012 16:43:44 +0200
Subject: [PATCH] RFC: TC qdisc "replace packet in queue"

This adds a new TC qdisc, which replaces packets in the queue. It
compares every incoming packet with all of the packets in the queue.
If the incoming and the compared packet meet all these conditions:
 - UDPv4
 - not fragmented
 - TOS like the given value(s)
 - same TOS
 - same source IP
 - same destination IP
 - same destination port
the packet in the queue will be replaced with the incoming packet.

The variable "overlimit" is the counter of replaced packets

Background:
In very low bandwidth networks (<=9.6Kbps, shared, etc.) it's hard
(rather: impossible) to get all packets sent.
But some of the packets contain information, which gets obsolete over
time. E.g. (GPS) positions, which will be sent periodically. If the
application sends a new packet while an old position packet is still in
the queue, the old packet is obsolete. This can be dropped. But just
dropping the old packet and queuing the new packet will result in never
sending a packet of this type. So this qdisc replace the old packet with
the new one. The information gets the chance to get sent - with the
newest available information.

Code-Status:
RFC for discussing.
The configuration by "debug-fs" is ... not optimal. But following the
"litte step" rules this is a first option. A configuration with tc will
be done later (if this patch got offical).

Signed-off-by: Ralph Erdt <Ralph.Robert.Erdt@fkie.fraunhofer.de>
---
 net/sched/Kconfig  |   16 +++
 net/sched/Makefile |    1 +
 net/sched/sch_pr.c |  264 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 281 insertions(+), 0 deletions(-)
 create mode 100644 net/sched/sch_pr.c

diff --git a/net/sched/Kconfig b/net/sched/Kconfig
index e7a8976..e29ad48 100644
--- a/net/sched/Kconfig
+++ b/net/sched/Kconfig
@@ -308,6 +308,22 @@ config NET_SCH_PLUG
 	  To compile this code as a module, choose M here: the
 	  module will be called sch_plug.
 
+config NET_SCH_PR
+	tristate "Packet Replace"
+	help
+	  Say Y here if you want to use the "Packet Replace"
+	  packet scheduling algorithm.
+
+	  This qdisc will replace packets in the queue, if this is a packet
+	  from the same UDP stream (IP/Port).
+
+	  See the top of <file:net/sched/sch_pr.c> for more details.
+
+	  To compile this driver as a module, choose M here: the module
+	  will be called sch_pr.
+
+	  If unsure, say N.
+
 comment "Classification"
 
 config NET_CLS
diff --git a/net/sched/Makefile b/net/sched/Makefile
index 5940a19..ef669ff 100644
--- a/net/sched/Makefile
+++ b/net/sched/Makefile
@@ -39,6 +39,7 @@ obj-$(CONFIG_NET_SCH_CHOKE)	+= sch_choke.o
 obj-$(CONFIG_NET_SCH_QFQ)	+= sch_qfq.o
 obj-$(CONFIG_NET_SCH_CODEL)	+= sch_codel.o
 obj-$(CONFIG_NET_SCH_FQ_CODEL)	+= sch_fq_codel.o
+obj-$(CONFIG_NET_SCH_PR)	+= sch_pr.o
 
 obj-$(CONFIG_NET_CLS_U32)	+= cls_u32.o
 obj-$(CONFIG_NET_CLS_ROUTE4)	+= cls_route.o
diff --git a/net/sched/sch_pr.c b/net/sched/sch_pr.c
new file mode 100644
index 0000000..5cbf8d8
--- /dev/null
+++ b/net/sched/sch_pr.c
@@ -0,0 +1,264 @@
+/*
+ * net/sched/sch_pr.c	"packet replace"
+ *
+ * Copyrigth (c) 2012 Fraunhofer FKIE, all rigths reserved.
+ *
+ *		This program is free software; you can redistribute it and/or
+ *		modify it under the terms of the GNU General Public License
+ *		as published by the Free Software Foundation; either version
+ *		2 of the License, or (at your option) any later version.
+ *
+ * Authors:	Ralph Erdt (Fraunhofer FKIE),
+ *                                 <ralph.robert.erdt@fkie.fraunhofer.de>
+ */
+
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/types.h>
+#include <linux/kernel.h>
+#include <linux/errno.h>
+#include <linux/skbuff.h>
+#include <net/pkt_sched.h>
+
+#include <linux/ip.h>
+#include <net/ip.h>
+#include <linux/debugfs.h>
+
+/*
+ * replace packet in queue
+ * ==========================
+ * This is a modified fifo queue (fifo by Alexey Kuznetsov).
+ *
+ * This packet compares every incoming packet with all of the packets in the
+ * queue.
+ * If the incoming and the compared packet meet all these conditions:
+ *  - UDPv4
+ *  - not fragmented
+ *  - TOS like the given value(s)
+ *  - same TOS
+ *  - same source IP
+ *  - same destination IP
+ *  - same destination port
+ * the packet in the queue will be replaced with the incoming packet.
+ *
+ * The variable "overlimit" is the counter of replaced packets
+ *
+ * Background:
+ * In very low bandwidth networks (<=9.6Kbps, shared, etc.) it's hard
+ * (rather: impossible) to get all packets sent.
+ * But some of the packets contain information, which gets obsolete over time.
+ * E.g. (GPS) positions, which will be sent periodically. If the application
+ * sends a new packet while an old position packet is still in the queue, the
+ * old packet is obsolete. This can be dropped. But just dropping the old
+ * packet and queuing the new packet will result in never sending a packet
+ * of this type.
+ * So this qdisc replace the old packet with the new one. The information gets
+ * the chance to get sent - with the newest available information.
+ *
+ * DRAWBACKS:
+ * Its not very CPU cycle saving. But on very low bandwith networks the
+ * application have to be careful with sending packets. And with a propper
+ * configuration, this will be OK.
+ */
+
+struct dentry *dgdir, *dgfile;
+
+#define TOSBITMASK 0
+#define TOSCOMPARE 1
+/* tos Flag. 1.: BitMask. 2.: Compare with */
+static u8 tos[] = {0xFF, 0xFF};
+
+bool pr_packet_to_work_with(struct sk_buff *pkt)
+{
+	struct iphdr *hdr;
+
+	if (unlikely(pkt->protocol != htons(ETH_P_IP)))
+		return false;
+
+	/* Only compare UDP - Layer 4 must be there */
+	if (unlikely(pkt->network_header == NULL))
+		return false;
+
+	hdr = ip_hdr(pkt);
+
+	/* Check for UDPv4 */
+	if (unlikely(hdr->protocol != IPPROTO_UDP))
+		return false;
+
+	/* no fragmented packets */
+	if (unlikely(ip_is_fragment(hdr)))
+		return false;
+
+	/* Correct TOS ? */
+	if ((hdr->tos & tos[TOSBITMASK]) != tos[TOSCOMPARE])
+		return false;
+
+	return true;
+}
+
+bool comp(struct sk_buff *a, struct sk_buff *b)
+{
+	struct iphdr *ah = NULL;
+	struct iphdr *bh = NULL;
+	u32 ipA, ipB;
+	u16 portsA, portsB;
+	int poff;
+	/* The packet has a header
+	 *  - the existence was already checked by "pr_packet_to_work_with" */
+	ah = ip_hdr(a);
+	bh = ip_hdr(b);
+
+	/* TOS must be the same */
+	if (ah->tos != bh->tos)
+		return false;
+
+	/* IP and Port must be the same */
+	ipA = (__force u32)ah->daddr;
+	ipB = (__force u32)bh->daddr;
+	if ((ipA != ipB))
+		return false;
+	ipA = (__force u32)ah->saddr;
+	ipB = (__force u32)bh->saddr;
+	if ((ipA != ipB))
+		return false;
+
+	poff = proto_ports_offset(IPPROTO_UDP);
+	if (unlikely(poff < 0))
+		/* This should be impossible.. */
+		return false;
+
+	/* Src Ports are always different - just compare destination ports */
+	portsA = *(u16 *)((void *)ah + bh->ihl * 4 + poff + 2);
+	portsB = *(u16 *)((void *)bh + ah->ihl * 4 + poff + 2);
+	if ((portsA != portsB))
+		return false;
+
+	return true;
+}
+
+static int pr_enqueue(struct sk_buff *skb, struct Qdisc *sch)
+{
+	struct sk_buff *replace = NULL;
+
+	/* Search, if there is a packet with same IDs */
+	/* Only search, if this packet is worth it */
+	if (pr_packet_to_work_with(skb)) {
+		struct sk_buff *it;
+		skb_queue_walk((&(sch->q)), it) {
+			/* If the other packet is worth it? */
+			if (pr_packet_to_work_with(it)) {
+				if (comp(skb, it)) {
+					replace = it;
+					break;
+				}
+			}
+		}
+	}
+
+	if (replace == NULL) {
+		/* a new kind of packet. Just enqueue */
+		if (likely(skb_queue_len(&sch->q) < sch->limit))
+			return qdisc_enqueue_tail(skb, sch);
+		return qdisc_reshape_fail(skb, sch);
+	} else {
+		/* replace the packet */
+		sch->qstats.overlimits++;
+		/* There is no drop nor replace. So do the replace myself */
+		skb->next = replace->next;
+		skb->prev = replace->prev;
+		if (replace->next != NULL)
+			replace->next->prev = skb;
+		if (replace->prev != NULL)
+			replace->prev->next = skb;
+		kfree_skb(replace);
+		return NET_XMIT_SUCCESS;
+	}
+}
+
+static int pr_init(struct Qdisc *sch, struct nlattr *opt)
+{
+	sch->flags |= TCQ_F_CAN_BYPASS; /* sounds good, but what? */
+	sch->limit = qdisc_dev(sch)->tx_queue_len ? : 1;
+	return 0;
+}
+
+static int pr_dump(struct Qdisc *sch, struct sk_buff *skb)
+{
+	struct tc_fifo_qopt opt = { .limit = sch->limit };
+
+	if (nla_put(skb, TCA_OPTIONS, sizeof(opt), &opt))
+		goto nla_put_failure;
+
+	return skb->len;
+
+nla_put_failure:
+	return -1;
+}
+
+struct Qdisc_ops pr_qdisc_ops __read_mostly = {
+	.id		=	"pr",
+	.priv_size	=	0,
+	.enqueue	=	pr_enqueue,
+	.dequeue	=	qdisc_dequeue_head,
+	.peek		=	qdisc_peek_head,
+	.drop		=	qdisc_queue_drop,
+	.init		=	pr_init,
+	.reset		=	qdisc_reset_queue,
+	.change		=	pr_init,
+	.dump		=	pr_dump,
+	.owner		=	THIS_MODULE,
+};
+EXPORT_SYMBOL(pr_qdisc_ops);
+
+/* DebugFS interface as first shot configuration */
+static ssize_t dg_read_file(struct file *file, char __user *userbuf,
+					size_t count, loff_t *ppos)
+{
+	return simple_read_from_buffer(userbuf, count, ppos, tos, 2);
+}
+
+static ssize_t dg_write_file(struct file *file, const char __user *buf,
+					size_t count, loff_t *ppos)
+{
+	u8 tmp[] = {0xFF, 0xFF};
+	int res;
+	if (count != 2)
+		return -EINVAL;
+
+	res = copy_from_user(tmp, buf, count);
+	if (res != 0)
+		return -EINVAL;
+
+	/* Two bytes to copy.. for this a memcpy with errorhandling?!? */
+	tos[0] = tmp[0];
+	tos[1] = tmp[1];
+
+	return count;
+}
+
+static const struct file_operations dgfops = {
+	.read = dg_read_file,
+	.write = dg_write_file,
+};
+
+static int __init pr_module_init(void)
+{
+	bool ret = register_qdisc(&pr_qdisc_ops);
+	if (!ret) {
+		/* open Communication channel */
+		dgdir = debugfs_create_dir("sch_pr", NULL);
+		dgfile = debugfs_create_file("tos", 0644, dgdir, tos, &dgfops);
+	}
+	return ret;
+}
+
+static void __exit pr_module_exit(void)
+{
+	debugfs_remove(dgfile);
+	debugfs_remove(dgdir);
+	unregister_qdisc(&pr_qdisc_ops);
+}
+
+module_init(pr_module_init);
+module_exit(pr_module_exit);
+MODULE_LICENSE("GPL");
-- 
1.7.7

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox