Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH] tcp: Fix slowness in read /proc/net/tcp
From: David Miller @ 2010-06-07  7:55 UTC (permalink / raw)
  To: eric.dumazet; +Cc: therbert, netdev, iler.ml
In-Reply-To: <1275895167.2545.8.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Mon, 07 Jun 2010 09:19:27 +0200

> Le dimanche 06 juin 2010 à 20:27 -0700, Tom Herbert a écrit :
>> Signed-off-by: Tom Herbert <therbert@google.com>
>> ---
> 
> This problem raises every year,  (last attempt from Yakov Lerner :
> http://kerneltrap.org/mailarchive/linux-netdev/2009/9/26/6256119 )
> 
> And finally, someone motivated enough to use /proc/net/tcp found the
> right answer ;)
> 
> Most netdev people tend to push inet_diag (netlink) interface instead of
> old /proc/net/tcp, but it seems old interface will still be used in
> 2030, so :
> 
> Acked-by: Eric Dumazet <eric.dumazet@gmail.com>

Indeed this is the best attempt of this I've seen so far,
applied to net-next-2.6, thanks Tom.

> BTW, another problem of /proc/net/tcp is the buffer size used by netstat
> utility : 1024 bytes instead of PAGE_SIZE, making O(N^2) behavior even
> more palpable.

Probably cap it at 8K like we do for netlink atomic reads, because
there are all sorts of crazy PAGE_SIZE configurations possible out
there, even as high as 512K.

^ permalink raw reply

* Re: [PATCH] asix: check packet size against mtu+ETH_HLEN instead of ETH_FRAME_LEN
From: David Miller @ 2010-06-07  7:57 UTC (permalink / raw)
  To: jussi.kivilinna; +Cc: netdev, dhollis
In-Reply-To: <20100606233531.23585.10292.stgit@fate.lan>

From: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
Date: Mon, 07 Jun 2010 02:35:31 +0300

> Driver checks received packet is too large in asix_rx_fixup() and fails if it is. Problem is
> that MTU might be set larger than 1500 and asix fails to work correctly with VLAN tagged
> packets. The check should be 'dev->net->mtu + ETH_HLEN' instead.
> 
> Tested with AX88772.
> 
> Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>

Applied, thanks.

^ permalink raw reply

* Re: [PATCH net-next-2.6] net-caif: Added missing lock validator constants
From: David Miller @ 2010-06-07  8:01 UTC (permalink / raw)
  To: alex.lorca; +Cc: sjur.brandeland, netdev
In-Reply-To: <1275761704-8758-1-git-send-email-alex.lorca@gmail.com>

From: Alex Lorca <alex.lorca@gmail.com>
Date: Sat,  5 Jun 2010 18:15:04 +0000

> CAIF is using "xxx-AF_MAX" strings for the lock validator. It should use
> its own strings.
> 
> Signed-off-by: Alex Lorca <alex.lorca@gmail.com>

Applied, thank you.

^ permalink raw reply

* Re: [PATCH net-next-2.6] raw: avoid two atomics in xmit
From: David Miller @ 2010-06-07  8:09 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev
In-Reply-To: <1275639837.2482.17.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Fri, 04 Jun 2010 10:23:57 +0200

> Avoid two atomic ops per raw_send_hdrinc() call
> 
> Avoid two atomic ops per raw6_send_hdrinc() call
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

Applied.

I'm starting to feel like we now use the implicit skb dst ref stuff
more than the usual cloning, which is great! :-)

^ permalink raw reply

* Re: [PATCH net-next-2.6] net sched: make pedit check for clones instead
From: David Miller @ 2010-06-07  8:09 UTC (permalink / raw)
  To: herbert; +Cc: hadi, jpirko, netdev
In-Reply-To: <20100604130511.GA10925@gondor.apana.org.au>

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Fri, 4 Jun 2010 23:05:11 +1000

> On Fri, Jun 04, 2010 at 08:43:06AM -0400, jamal wrote:
>> And here's the second one. I noticed Changli's changes are not yet in
>> net-next. I will wait for that change to propagate and then send the
>> fix to pedit that i discussed with Herbert.
> 
> Thanks Jamal!
> 
> BTW, I think this patch should go in first (before the one that
> removes the OK2MUNGE setting) to be on the safe side.

I applied them this way to net-next-2.6, thanks!

^ permalink raw reply

* Re: [PATCH] 8139too: fix buffer overrun in rtl8139_init_board
From: David Miller @ 2010-06-07  8:14 UTC (permalink / raw)
  To: dkirjanov; +Cc: jeff, netdev
In-Reply-To: <20100605094549.GA11559@hera.kernel.org>

From: Denis Kirjanov <dkirjanov@hera.kernel.org>
Date: Sat, 5 Jun 2010 09:45:49 +0000

> Fix rtl_chip_info buffer overrun when we can't identify the chip.
> (i = ARRAY_SIZE (rtl_chip_info) in this case)
> 
> Signed-off-by: Denis Kirjanov <dkirjanov@kernel.org>

Applied, thanks!

^ permalink raw reply

* Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.
From: Mitchell Erblich @ 2010-06-07  8:17 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Stephen Hemminger, xiaohui.xin, netdev, kvm, linux-kernel, mst,
	mingo, davem, herbert, jdike
In-Reply-To: <87pr03gu1c.fsf@basil.nowhere.org>


On Jun 7, 2010, at 12:51 AM, Andi Kleen wrote:

> Stephen Hemminger <shemminger@vyatta.com> writes:
> 
>> Still not sure this is a good idea for a couple of reasons:
>> 
>> 1. We already have lots of special cases with skb's (frags and fraglist),
>>   and skb's travel through a lot of different parts of the kernel.  So any
>>   new change like this creates lots of exposed points for new bugs. Look
>>   at cases like MD5 TCP and netfilter, and forwarding these SKB's to ipsec
>>   and ppp and ...
>> 
>> 2. SKB's can have infinite lifetime in the kernel. If these buffers come from
>>   a fixed size pool in an external device, they can easily all get tied up
>>   if you have a slow listener. What happens then?
> 
> 3. If they come from an internal pool what happens when the kernel runs 
> low on memory? How is that pool balanced against other kernel
> memory users?
> 
> -Andi
> 
> -- 
> ak@linux.intel.com -- Speaking for myself only.

In general,

When an internal pool is created/used, their SHOULD be a reason.
Maybe, to keep allocation latency to a min, OR ...

Now IMO, 

internal pool objects should have a ref count and
if that count is 0, then under memory pressure and/or num 
of objects are above a high water mark, then they are freed,

OR if there is a last reference age field, then the object is to be 
cleaned if dirty, then freed,  

Else, the pool is allowed to grow if the number of objects in the
pool is below a set max (max COULD equal Infinity).



Mitchell Erblich

^ permalink raw reply

* Re: [PATCH] phylib: Add support for the LXT973 phy.
From: David Miller @ 2010-06-07  8:18 UTC (permalink / raw)
  To: richardcochran; +Cc: afleming, netdev
In-Reply-To: <20100605140017.GA2750@riccoc20.at.omicron.at>

From: Richard Cochran <richardcochran@gmail.com>
Date: Sat, 5 Jun 2010 16:00:17 +0200

> On Wed, Jun 02, 2010 at 02:32:11PM -0500, Andy Fleming wrote:
>> Yeah, I was clearly not thinking clearly.  dev_flags will be
>> overwritten, and is not meant for this.  I believe, what we should do
>> is add a "port" field to the PHY device, and if PCR_FIBER_SELECT is
>> set, then set the port field to PORT_FIBRE.  I'm not entirely clear on
>> the semantics of that field in the ethtool cmd, but it seems like the
>> right idea.
> 
> Here is another try. Is that more like it?

I think this is overkill.

One, and only one, PHY driver wants to maintain a boolean state and
we're adding a full 32-bit flags member to the generic PHY device
struct to accomodate this?

Andy if you have ideas to use that state via ethtool or whatever in
the future, you come back to us when the future arrives and you've
implemented the code in the generic PHY lib code to do that stuff.

As it stands right now, that code doesn't exist so accomodating it is
just silly.

I also think worrying about the phy_dev->priv pointer being misused in
the future inside of this driver is a completely bogus argument by any
measure.

We have many cases all over the kernel tree, in drivers and elsewhere,
where we encode integer states in what are normally pointer values
when we need to maintain a small piece of state and don't need to do a
full blown memory allocation to allocate a piece of extra memory to
hold that state.

Richard, please just resubmit your original patch and I will apply it.

Thanks.

^ permalink raw reply

* Re: [PATCH] greth: Remove unnecessary memset of napi member in netdev private data
From: David Miller @ 2010-06-07  8:21 UTC (permalink / raw)
  To: tklauser; +Cc: kristoffer, netdev
In-Reply-To: <1275632661-16406-1-git-send-email-tklauser@distanz.ch>

From: Tobias Klauser <tklauser@distanz.ch>
Date: Fri,  4 Jun 2010 08:24:21 +0200

> The memory for the private data is allocated using kzalloc in
> alloc_etherdev (or alloc_netdev_mq respectively) so there is no need to
> set the napi member to 0 explicitely.
> 
> Signed-off-by: Tobias Klauser <tklauser@distanz.ch>

Applied.

^ permalink raw reply

* Re: [PATCH] htb: remove two unnecessary assignments
From: David Miller @ 2010-06-07  8:21 UTC (permalink / raw)
  To: xiaosuo; +Cc: hadi, netdev
In-Reply-To: <1275651213-3343-1-git-send-email-xiaosuo@gmail.com>

From: Changli Gao <xiaosuo@gmail.com>
Date: Fri,  4 Jun 2010 19:33:33 +0800

> remove two unnecessary assignments
> 
> we don't need to assign NULL when initialize structure objects.
> 
> Signed-off-by: Changli Gao <xiaosuo@gmail.com>

Applied.

^ permalink raw reply

* Re: [RFC] act_cpu: redirect skb receiving to a special CPU.
From: Andi Kleen @ 2010-06-07  8:43 UTC (permalink / raw)
  To: hadi
  Cc: Changli Gao, Eric Dumazet, David S. Miller, Tom Herbert,
	Linux Netdev List
In-Reply-To: <1275746045.3490.60.camel@bigi>

jamal <hadi@cyberus.ca> writes:
>
> I would look at it as "messaging of remote CPU" which may not result
> in an IRQ. I am pretty sure if you tried hard you could use HT in AMD
> hardware - the remote cpu may have an IRQ triggered but it wont be as
> expensive as IPI.

It's unlikely you'll find any way on x86 to do an IPI that is cheaper
than an standard IPI. That is unless you dedicate the receiver to poll
or monitor.

On recent higher end Intel CPUs X2APIC IPIs will be somewhat cheaper
than classical APIC IPIs.

But for a normal "IPI user" like networking it looks all the same,
it's hidden by the architecture code.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply

* Re: [PATCH] tcp: Fix slowness in read /proc/net/tcp
From: Andi Kleen @ 2010-06-07  8:49 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Tom Herbert, davem, netdev, Yakov Lerner
In-Reply-To: <1275895167.2545.8.camel@edumazet-laptop>

Eric Dumazet <eric.dumazet@gmail.com> writes:
>
> BTW, another problem of /proc/net/tcp is the buffer size used by netstat
> utility : 1024 bytes instead of PAGE_SIZE, making O(N^2) behavior even
> more palpable.

That would be easily fixable in netstat.

But I'm not sure net-tools is still maintained as a separate package,
afaik the standard procedure is to submit a patch to a big distribution
and the others take it from there.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply

* Re: [Patch 1/2] s2io: add dynamic LRO disable support
From: Cong Wang @ 2010-06-07  9:01 UTC (permalink / raw)
  To: Ramkrishna Vepa
  Cc: Michal Schmidt, netdev@vger.kernel.org, herbert.xu@redhat.com,
	nhorman@redhat.com, sgruszka@redhat.com, davem@davemloft.net
In-Reply-To: <FCA91A92EE52B041906A0358FC28FCC38EF129DB8F@FRE1EXCH02.hq.exar.com>

On 06/05/10 16:53, Ramkrishna Vepa wrote:
>> +static int s2io_ethtool_set_flags(struct net_device *dev, u32 data)
>> +{
>> +	struct s2io_nic *sp = netdev_priv(dev);
>> +	int rc = 0;
>> +	int changed = 0;
>> +
>> +	if (data&  ETH_FLAG_LRO) {
>> +		if (lro_enable) {
>> +			if (!(dev->features&  NETIF_F_LRO)) {
>> +				dev->features |= NETIF_F_LRO;
>> +				changed = 1;
>> +			}
>> +		} else
>> +			rc = -EINVAL;
>> +	} else if (dev->features&  NETIF_F_LRO) {
>> +		dev->features&= ~NETIF_F_LRO;
>> +		changed = 1;
>> +	}
>> +
>> +	if (changed&&  netif_running(dev)) {
>> +		s2io_stop_all_tx_queue(sp);
>> +		s2io_card_down(sp);
>> +		sp->lro = dev->features&  NETIF_F_LRO;
>> +		rc = s2io_card_up(sp);
> In s2io_card_up, update ring->lro too as it is used in the fast path -
> 		struct ring_info *ring =&mac_control->rings[i];
>
>   		ring->mtu = dev->mtu;
> +		ring->lro = sp->lro;
>

Yeah, I missed this important part.


>> +		s2io_start_all_tx_queue(sp);
>
> The following line in init_shared_mem() is redundant and can be removed.
> -	ring->lro = lro_enable;
>

Okay.

I will send an updated version soon.

Thanks a lot!

^ permalink raw reply

* Re: [Patch] infiniband: check local reserved ports
From: Cong Wang @ 2010-06-07  9:04 UTC (permalink / raw)
  To: Roland Dreier
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA, Tetsuo Handa,
	davem-fT/PcQaiUtIeIZ0/mPfg9Q, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	sean.hefty-ral2JQCrhuEAvxtiuMwx3w
In-Reply-To: <ada1vcmpyvu.fsf-BjVyx320WGW9gfZ95n9DRSW4+XlvGpQz@public.gmane.org>

On 06/05/10 00:04, Roland Dreier wrote:
>   >  >  Should this inet_is_reserved_local_port() test apply to all the "port
>   >  >  spaces" that this code is handling?  I honestly am ignorant of the
>   >  >  intended semantics of the new local_reserved_ports stuff, hence my question.
>
>   >  Yes, but I only found this case, is there any else?
>
> My question was more in the other direction: should this test apply to
> all the "port spaces" handled here?  From looking at the code, it
> appears the answer is yes -- it seems that putting a port in
> local_reserved_ports reserves that port for IPv4 and IPv6, UDP, TCP,
> SCTP, DCCP, everything, so we should probably reserve all RDMA CM ports too.

Yes.

So this patch looks good for you? :)

Thanks.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH] virtio_net: indicate oom when addbuf returns failure
From: Michael S. Tsirkin @ 2010-06-07  9:15 UTC (permalink / raw)
  To: Herbert Xu; +Cc: Bruce Rogers, netdev, Shirley Ma, virtualization, stable
In-Reply-To: <20100606222441.GA5992@gondor.apana.org.au>

On Mon, Jun 07, 2010 at 08:24:41AM +1000, Herbert Xu wrote:
> On Sun, Jun 06, 2010 at 11:13:00PM +0300, Michael S. Tsirkin wrote:
> >
> > Actually this code looks strange:
> > Note that add_buf inicates out of memory
> > condition with a positive return value, and ring full
> > (which is not an error!) with -ENOSPC.
> 
> Indeed, this ultimately came from
> 
> commit 9ab86bbcf8be755256f0a5e994e0b38af6b4d399
> Author: Shirley Ma <mashirle@us.ibm.com>
> Date:   Fri Jan 29 03:20:04 2010 +0000
> 
>     virtio_net: Defer skb allocation in receive path Date: Wed, 13 Jan 2010 12:53:38 -0800
> 
> (Greg, please don't apply this even though I've just given you
> the upstream commit ID that you were asking for :)
> 
> where it confuses a memory allocation error with an add_buf failure.
> 
> > Possibly the right thing to do is to
> > 1. handle ENOMEM specially
> > 2. fix add_buf to return ENOMEM on error
> 
> I think we should make it so that only a memory allocation error
> is returned as before.  There is no need for returning the add_buf
> error unless add_buf is now doing an allocation itself that needs
> to be retried.

That's what my patch did, right? Ack it?

> Thanks,
> -- 
> Visit Openswan at http://www.openswan.org/
> Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* in_dev->refcnt and dev->refcnt
From: ratheesh k @ 2010-06-07  9:20 UTC (permalink / raw)
  To: netdev, linux-net

Is there any relation between in_dev->refcnt and dev->refcnt ?  . My
question is : if  put or get on one variable affects other ?

-Ratheesh

^ permalink raw reply

* Re: in_dev->refcnt and dev->refcnt
From: Eric Dumazet @ 2010-06-07  9:30 UTC (permalink / raw)
  To: ratheesh k; +Cc: netdev, linux-net
In-Reply-To: <AANLkTinijGlJRnqhcEbXhc84MXc62gbshxvTa6JeXo2K@mail.gmail.com>

Le lundi 07 juin 2010 à 14:50 +0530, ratheesh k a écrit :
> Is there any relation between in_dev->refcnt and dev->refcnt ?  . My
> question is : if  put or get on one variable affects other ?
> 

No, there is no relation between them.

You can increase / decrease one refcount or the other, using separate
API.




^ permalink raw reply

* RE: [PATCH] r8169: fix random mdio_write failures
From: hayeswang @ 2010-06-07  9:26 UTC (permalink / raw)
  To: 'Francois Romieu', 'Timo Teräs'; +Cc: netdev, davem
In-Reply-To: <20100605124103.GA3213@electric-eye.fr.zoreil.com>

Our hardware engineer suggests that check the completed indication
per 100 micro seconds. And it needs 20 micro seconds delay after the 
completed indication for the next command.
 
Best Regards,
Hayes


-----Original Message-----
From: Francois Romieu [mailto:romieu@fr.zoreil.com] 
Sent: Saturday, June 05, 2010 8:41 PM
To: Timo Teräs
Cc: netdev@vger.kernel.org; Edward Hsu; Hayeswang; davem@davemloft.net
Subject: Re: [PATCH] r8169: fix random mdio_write failures

Timo Teräs <timo.teras@iki.fi> :
[...]
> diff --git a/drivers/net/r8169.c b/drivers/net/r8169.c index 
> 217e709..03a8318 100644
> --- a/drivers/net/r8169.c
> +++ b/drivers/net/r8169.c
> @@ -559,6 +559,11 @@ static void mdio_write(void __iomem *ioaddr, int
reg_addr, int value)
>  			break;
>  		udelay(25);
>  	}
> +	/*
> +	 * Some configurations require a small delay even after the write
> +	 * completed indication or the next write might fail.
> +	 */
> +	udelay(25);

Acked-off-by: Francois Romieu <romieu@fr.zoreil.com>

Good work.

I wonder if increasing the in-loop delay as well would help the write
succeed faster (or slower ?).

--
Ueimor


------Please consider the environment before printing this e-mail. 




^ permalink raw reply

* Re: [Patch 2/2] mlx4: add dynamic LRO disable support
From: Cong Wang @ 2010-06-07  8:51 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: netdev, herbert.xu, nhorman, sgruszka, davem
In-Reply-To: <1275661552.2095.13.camel@achroite.uk.solarflarecom.com>

On 06/04/10 22:25, Ben Hutchings wrote:
> On Fri, 2010-06-04 at 09:56 +0800, Cong Wang wrote:
>> On 06/03/10 20:37, Ben Hutchings wrote:
>>> On Wed, 2010-06-02 at 23:39 -0400, Amerigo Wang wrote:
>>>> This patch adds dynamic LRO diable support for mlx4 net driver.
>>>> It also fixes a bug of mlx4, which checks NETIF_F_LRO flag in rx
>>>> path without rtnl lock.
>>> [...]
>>>
>>> Is that flag test actually unsafe - and if so, how is testing num_lro
>>> any better?  Perhaps access to net_device::features should be wrapped
>>> with ACCESS_ONCE() to ensure that reads and writes are atomic.
>>>
>>
>> At least, I don't find there is any race with 'num_lro', thus
>> no lock is needed.
>
> In both cases there is a race condition but it is harmless so long as
> the read and the write are atomic.  There is a general assumption in
> networking code that this is the case for int and long.  Personally I
> would prefer to see this made explicit using ACCESS_ONCE(), but I don't
> see any specific problem in mlx4 (not that I'm familiar with this driver
> either).

Hmm, right, it seems mlx4_en_add() is async too.
I will pick your suggestion.


>
> Now that I look at the patch again, I see you're using a static (i.e.
> global) variable to 'back up' the non-zero (enabled) value of num_lro.
> This is introducing a bug!  The correct value is apparently set in
> mlx4_en_get_profile(); you would need to replicate that.
>

Oh, probably, but unfortunately 'num_lro' is static so only visible
in en_main.c.

Thanks!


^ permalink raw reply

* Re: [RFC][PATCH] Fix another namespace issue with devices assigned to  classes
From: Johannes Berg @ 2010-06-07  9:42 UTC (permalink / raw)
  To: Kay Sievers; +Cc: Eric W. Biederman, Greg KH, netdev
In-Reply-To: <AANLkTinUAR2gkY7ipgXKy-pMHcmLuouEeyXxnP97U_7N@mail.gmail.com>

On Sun, 2010-06-06 at 19:17 +0200, Kay Sievers wrote:

> There is no real difference between classes and buses. Actually we're
> working on merging them completely inside the kernel. Just declare a
> "struct bus_type" instead of a "struct class".

Can you please tell me then how to device_create() without a class? I
cannot seem to create devices without a class at all, even using manual
allocation (yuck) and device_register crashes the kernel.

johannes

^ permalink raw reply

* Re: [Bugme-new] [Bug 16120] New: Oops: 0000 [#1] SMP, unable to handle kernel NULL pointer dereference at (null)
From: Eric Dumazet @ 2010-06-07  9:48 UTC (permalink / raw)
  To: Andrew Morton, David Miller
  Cc: netdev, bugzilla-daemon, bugme-daemon, alex.vizor,
	Patrick McHardy
In-Reply-To: <1275730457.5238.14.camel@edumazet-laptop>

Le samedi 05 juin 2010 à 11:34 +0200, Eric Dumazet a écrit :
> Le samedi 05 juin 2010 à 11:17 +0200, Eric Dumazet a écrit :
> > Le vendredi 04 juin 2010 à 16:17 -0700, Andrew Morton a écrit :
> > > (switched to email.  Please respond via emailed reply-to-all, not via the
> > > bugzilla web interface).
> > > 
> > > On Fri, 4 Jun 2010 09:25:58 GMT
> > > bugzilla-daemon@bugzilla.kernel.org wrote:
> > > 
> > > > https://bugzilla.kernel.org/show_bug.cgi?id=16120
> > > > 
> > > >            Summary: Oops: 0000 [#1] SMP, unable to handle kernel NULL
> > > >                     pointer dereference at (null)
> > > >            Product: Platform Specific/Hardware
> > > >            Version: 2.5
> > > >     Kernel Version: 2.6.35-rc1
> > > >           Platform: All
> > > >         OS/Version: Linux
> > > >               Tree: Mainline
> > > >             Status: NEW
> > > >           Severity: high
> > > >           Priority: P1
> > > >          Component: x86-64
> > > >         AssignedTo: platform_x86_64@kernel-bugs.osdl.org
> > > >         ReportedBy: alex.vizor@gmail.com
> > > >         Regression: Yes
> > > > 
> > > > 
> > > > Created an attachment (id=26647)
> > > >  --> (https://bugzilla.kernel.org/attachment.cgi?id=26647) id)
> > 
> > > > 2.6.35-rc1 kernel log
> > > > 
> > > > It happens randomly, almost a week I used 2.6.35-rc1 and don't have any
> > > > problems. But since last day it happened twice.
> > > > 
> > > > I attached kernel log, please inform me if I can help in investigation.
> > > > 
> > > 
> > > ip6mr_sk_done() oopsed.
> > 
> > Only thing I found a first glance is a typo but this should not be the
> > root of the problem.
> > 
> 

I was able to reproduce the problem here, and following patch solves it.

(I see David already committed first patch about macro typo)

Thanks !

[PATCH net-2.6] ipmr: dont corrupt lists

ipmr_rules_exit() and ip6mr_rules_exit() free a list of items, but
forget to properly remove these items from list. List head is not
changed and still points to freed memory.

This can trigger a fault later when icmpv6_sk_exit() is called.

Fix is to either reinit list, or use list_del() to properly remove items
from list before freeing them.

bugzilla report : https://bugzilla.kernel.org/show_bug.cgi?id=16120

Introduced by commit d1db275dd3f6e4 (ipv6: ip6mr: support multiple
tables) and commit f0ad0860d01e (ipv4: ipmr: support multiple tables)

Reported-by: Alex Zhavnerchik <alex.vizor@gmail.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Patrick McHardy <kaber@trash.net>
---
 net/ipv4/ipmr.c  |    4 +++-
 net/ipv6/ip6mr.c |    4 +++-
 2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
index 856123f..757f25e 100644
--- a/net/ipv4/ipmr.c
+++ b/net/ipv4/ipmr.c
@@ -267,8 +267,10 @@ static void __net_exit ipmr_rules_exit(struct net *net)
 {
 	struct mr_table *mrt, *next;
 
-	list_for_each_entry_safe(mrt, next, &net->ipv4.mr_tables, list)
+	list_for_each_entry_safe(mrt, next, &net->ipv4.mr_tables, list) {
+		list_del(&mrt->list);
 		kfree(mrt);
+	}
 	fib_rules_unregister(net->ipv4.mr_rules_ops);
 }
 #else
diff --git a/net/ipv6/ip6mr.c b/net/ipv6/ip6mr.c
index 89c0b07..66078da 100644
--- a/net/ipv6/ip6mr.c
+++ b/net/ipv6/ip6mr.c
@@ -254,8 +254,10 @@ static void __net_exit ip6mr_rules_exit(struct net *net)
 {
 	struct mr6_table *mrt, *next;
 
-	list_for_each_entry_safe(mrt, next, &net->ipv6.mr6_tables, list)
+	list_for_each_entry_safe(mrt, next, &net->ipv6.mr6_tables, list) {
+		list_del(&mrt->list);
 		ip6mr_free_table(mrt);
+	}
 	fib_rules_unregister(net->ipv6.mr6_rules_ops);
 }
 #else



^ permalink raw reply related

* Re: [v5 Patch 1/3] netpoll: add generic support for bridge and bonding devices
From: Cong Wang @ 2010-06-07  9:57 UTC (permalink / raw)
  To: Andy Gospodarek
  Cc: Jay Vosburgh, Flavio Leitner, linux-kernel, Matt Mackall, netdev,
	bridge, Andy Gospodarek, Neil Horman, Jeff Moyer,
	Stephen Hemminger, bonding-devel, David Miller
In-Reply-To: <20100604191841.GM7497@gospo.rdu.redhat.com>

On 06/05/10 03:18, Andy Gospodarek wrote:
> On Wed, Jun 02, 2010 at 06:04:45PM +0800, Cong Wang wrote:
>> On 06/02/10 02:42, Jay Vosburgh wrote:
>>> Cong Wang<amwang@redhat.com>   wrote:
>>>
>>>> On 06/01/10 03:08, Flavio Leitner wrote:
>>>>> On Mon, May 31, 2010 at 01:56:52PM +0800, Cong Wang wrote:
>>>>>> Hi, Flavio,
>>>>>>
>>>>>> Please use the attached patch instead, try to see if it solves
>>>>>> all your problems.
>>>>>
>>>>> I tried and it hangs. No backtraces this time.
>>>>> The bond_change_active_slave() prints before NETDEV_BONDING_FAILOVER
>>>>> notification, so I think it won't work.
>>>>
>>>> Ah, I thought the same.
>>>>
>>>>>
>>>>> Please, correct if I'm wrong, but when a failover happens with your
>>>>> patch applied, the netconsole would be disabled forever even with
>>>>> another healthy slave, right?
>>>>>
>>>>
>>>> Yes, this is an easy solution, because bonding has several modes,
>>>> it is complex to make netpoll work in different modes.
>>>
>>> 	If I understand correctly, the root cause of the problem with
>>> netconsole and bonding is that bonding is, ultimately, performing
>>> printks with a write lock held, and when netconsole recursively calls
>>> into bonding to send the printk over the netconsole, there is a deadlock
>>> (when the bonding xmit function attempts to acquire the same lock for
>>> read).
>>
>>
>> Yes.
>>
>>>
>>> 	You're trying to avoid the deadlock by shutting off netconsole
>>> (permanently, it looks like) for one problem case: a failover, which
>>> does some printks with a write lock held.
>>>
>>> 	This doesn't look to me like a complete solution, there are
>>> other cases in bonding that will do printk with write locks held.  I
>>> suspect those will also hang netconsole as things exist today, and won't
>>> be affected by your patch below.
>>
>>
>> I can expect that, bonding modes are complex.
>>
>>>
>>> 	For example:
>>>
>>> 	The sysfs functions to set the primary (bonding_store_primary)
>>> or active (bonding_store_active_slave) options: a pr_info is called to
>>> provide a log message of the results.  These could be tested by setting
>>> the primary or active options via sysfs, e.g.,
>>>
>>> echo eth0>   /sys/class/net/bond0/bonding/primary
>>> echo eth0>   /sys/class/net/bond0/bonding/active
>>>
>>> 	If the kernel is defined with DEBUG, there are a few pr_debug
>>> calls within write_locks (bond_del_vlan, for example).
>>>
>>> 	If the slave's underlying device driver's ndo_vlan_rx_register
>>> or ndo_vlan_rx_kill_vid functions call printk (and it looks like some do
>>> for error cases, e.g., igbvf, ehea, enic), those would also presumably
>>> deadlock (because bonding holds its write_lock when calling the ndo_
>>> vlan functions).
>>>
>>> 	It also appears that (with the patch below) some nominally
>>> normal usage patterns will immediately disable netconsole.  The one that
>>> comes to mind is if the primary= option is set (to "eth1" for this
>>> example), but that slave not enslaved first (the slaves are added, say,
>>> eth0 then eth1).  In that situation, when the primary slave (eth1 here)
>>> is added, the first thing that will happen is a failover, and that will
>>> disable netconsole.
>>>
>>
>> Thanks for your detailed explanation!
>>
>> This is why I said bonding is complex. I guess we would have to adjust
>> netpoll code for different bonding cases, one solution seems not fix all.
>> I am not sure how much work to do, since I am not familiar with bonding
>> code. Maybe Andy can help?
>>
>
> Sorry I've been silent until now.  This does seem quite similar to a
> problem I've previously encountered when dealing with bonding+netpoll on
> some old 2.6.9-based kernels.  There is no guarantee the methods used
> there will apply here, but I'll talk about them anyway.
>
> As Flavio noticed, recursive calls into the bond transmit routines were
> not a good idea.  I discovered the same and worked around this issue by
> checking to see if we could take the bond->lock for writing before
> continuing.  If we could not get, I wanted to signal that this should be
> queued for transmission later.  Based on the flow of netpoll_send_skb
> (or possibly for another reason that is escaping me right now) I added
> one of these checks in bond_poll_controller too.  These aren't the
> prettiest fixes, but seemed to work well for me when I did this work in
> the past.  I realize the differences are not that great compared to some
> of the patches posted by Flavio, but I think they are worth trying.


Hmm, I still feel like this way is ugly, although it may work.
I guess David doesn't like it either.

Anyway, Flavio, could you try the following patch as well?

Thanks a lot!

>
> diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
> index ef60244..d7b9b99 100644
> --- a/drivers/net/bonding/bond_main.c
> +++ b/drivers/net/bonding/bond_main.c
> @@ -1290,6 +1290,12 @@ static bool slaves_support_netpoll(struct net_device *bond_dev)
>   static void bond_poll_controller(struct net_device *bond_dev)
>   {
>   	struct net_device *dev = bond_dev->npinfo->netpoll->real_dev;
> +	struct bonding *bond = netdev_priv(bond_dev);
> +
> +	if (!write_trylock(&bond->lock))
> +		return;
> +	write_unlock(&bond->lock);
> +
>   	if (dev != bond_dev)
>   		netpoll_poll_dev(dev);
>   }
> @@ -4418,7 +4424,11 @@ static void bond_set_xmit_hash_policy(struct bonding *bond)
>
>   static netdev_tx_t bond_start_xmit(struct sk_buff *skb, struct net_device *dev)
>   {
> -	const struct bonding *bond = netdev_priv(dev);
> +	struct bonding *bond = netdev_priv(dev);
> +
> +	if (!write_trylock(&bond->lock))
> +		return NETDEV_TX_BUSY;
> +	write_unlock(&bond->lock);
>
>   	switch (bond->params.mode) {
>   	case BOND_MODE_ROUNDROBIN:
>
> The other key to all of this is to make sure that queuing is done
> correctly now that we expect to queue these frames and have them sent at
> some point when there is a member of the bond that is actually capable
> of sending them out.
>
> The new style of sending queued skbs in a workqueue is much better than
> what was done in the 2.6.9 timeframe, but careful attention should still
> be paid to txq lock and which processor is the owner.  Returning
> something other than NETDEV_TX_OK from bond_start_xmit and checking for
> locks being held there should also help with any deadlocks that show up
> while running in queue_process (though they would not be recursive).
>
> I'm not in a good spot to test this right now, but I can take a look at
> next week and we can try and track down any of the other deadlocks that
> currently exist as I suspect this will not resolve all of the issues.

^ permalink raw reply

* Re: [PATCH v2] netdev:bfin_mac: reclaim and free tx skb as soon as possible after transfer
From: Eric Dumazet @ 2010-06-07  9:58 UTC (permalink / raw)
  To: Sonic Zhang; +Cc: David Miller, netdev, uclinux-dist-devel
In-Reply-To: <AANLkTimO47rtvIwYAwpaMYtK5SEnhM0m3AxL5q-8lvA7@mail.gmail.com>

Le vendredi 04 juin 2010 à 12:44 +0800, Sonic Zhang a écrit :

> 
> Yes, you are right. dev_kfree_skb_irq() queues used skb to the
> complete queue. But, it is actually freed in the other soft irq
> NET_TX_SOFTIRQ.

I guess you didnt understood my mail, so I'll re-explain :

dev_kfree_skb_irq() queues skb only if packet is not already orphaned.

As most packets are now orphaned (in recent kernels where your patch
applies), dev_kfree_skb_irq() can free packet immediately, with no
NET_TX_SOFTIRQ overhead.

^ permalink raw reply

* Re: [Bugme-new] [Bug 16120] New: Oops: 0000 [#1] SMP, unable to handle kernel NULL pointer dereference at (null)
From: David Miller @ 2010-06-07  9:58 UTC (permalink / raw)
  To: eric.dumazet
  Cc: akpm, netdev, bugzilla-daemon, bugme-daemon, alex.vizor, kaber
In-Reply-To: <1275904120.2545.40.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Mon, 07 Jun 2010 11:48:40 +0200

> [PATCH net-2.6] ipmr: dont corrupt lists
> 
> ipmr_rules_exit() and ip6mr_rules_exit() free a list of items, but
> forget to properly remove these items from list. List head is not
> changed and still points to freed memory.
> 
> This can trigger a fault later when icmpv6_sk_exit() is called.
> 
> Fix is to either reinit list, or use list_del() to properly remove items
> from list before freeing them.
> 
> bugzilla report : https://bugzilla.kernel.org/show_bug.cgi?id=16120
> 
> Introduced by commit d1db275dd3f6e4 (ipv6: ip6mr: support multiple
> tables) and commit f0ad0860d01e (ipv4: ipmr: support multiple tables)
> 
> Reported-by: Alex Zhavnerchik <alex.vizor@gmail.com>
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> CC: Patrick McHardy <kaber@trash.net>

Applied, thanks a lot Eric.

^ permalink raw reply

* Re: [RFC][PATCH] Fix another namespace issue with devices assigned to classes
From: Kay Sievers @ 2010-06-07  9:53 UTC (permalink / raw)
  To: Johannes Berg; +Cc: Eric W. Biederman, Greg KH, netdev
In-Reply-To: <1275903773.29978.1.camel@jlt3.sipsolutions.net>

On Mon, Jun 7, 2010 at 11:42, Johannes Berg <johannes@sipsolutions.net> wrote:
> On Sun, 2010-06-06 at 19:17 +0200, Kay Sievers wrote:
>
>> There is no real difference between classes and buses. Actually we're
>> working on merging them completely inside the kernel. Just declare a
>> "struct bus_type" instead of a "struct class".
>
> Can you please tell me then how to device_create() without a class? I
> cannot seem to create devices without a class at all, even using manual
> allocation (yuck) and device_register crashes the kernel.

Right, this "convenience API" does not exist for buses. It's not doing
much, just allocates a "struct device" and fills in the few values and
calls device_register().

Does your device create a device node? If not, device_create() should
not be used anyway, because the corresponding device_destroy() will
not do anything.

Kay

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox