Netdev List
 help / color / mirror / Atom feed
* Re: [patch net-next-2.6] net: convert bonding to use rx_handler
From: Jiri Pirko @ 2011-02-18 18:47 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Patrick McHardy, netdev, davem, shemminger, fubar,
	nicolas.2p.debian, andy
In-Reply-To: <1298045670.6201.73.camel@edumazet-laptop>

Fri, Feb 18, 2011 at 05:14:30PM CET, eric.dumazet@gmail.com wrote:
>Le vendredi 18 février 2011 à 16:50 +0100, Patrick McHardy a écrit :
>> On 18.02.2011 15:58, Jiri Pirko wrote:
>> > Fri, Feb 18, 2011 at 03:46:45PM CET, kaber@trash.net wrote:
>> >> Am 18.02.2011 15:27, schrieb Eric Dumazet:
>> >>> Le vendredi 18 février 2011 à 15:14 +0100, Jiri Pirko a écrit :
>> >>>
>> >>>> Do not know how to do it better. As for percpu variable, not only
>> >>>> origdev would have to be remembered but also probably skb pointer to
>> >>>> know if it's the first run on the skb or not. Can't really figure out a
>> >>>> better solution. Can you?
>> >>>
>> >>> I'll try and let you know.
>> >>
>> >> Why not simply do a lookup on skb->iif?
>> > 
>> > Well I was trying to avoid iterating over list of devices for each
>> > incoming frame.
>> > 
>> 
>> Well, there are a couple of holes on 64 bit, perhaps you can rearrange
>> things and eliminate either iif or input_dev without increasing size
>> since they appear to be redundant.
>
>Jiri
>
>I dont understand why netif_rx() is needed in your patch.

I used netif_rx() because bridge and macvlan does that too. I did not see
a reason to not to do the same.

>
>Can we stack 10 bond devices or so ???
>
>If we avoid this stage and call the real thing (netif_receive_skb()),
>then we dont need adding a field in each skb, since it can be carried by
>a global variable (per cpu of course)
>
I'm probably missing something. How do netif_receive_skb() and
netif_rx() differ in this point of view, since both are calling:
"ret = enqueue_to_backlog(skb, cpu, &rflow->last_qtail);"
?

Still I see a problem with the percpu global variable. We would have to
store skb pointer there as well and in each __netif_receive_skb() call it
would have to be checked if it's different from the current one.
In that case store new skb and orig_Dev.

Leaving aside that global variables are evil in general, I still think
this is not nicer solution then to add skb->input_dev (although I
understand your arguments).


>bond_handle_frame() being called from __netif_receive_skb() I believe it
>can use netif_receive_skb() instead of netif_rx().
>
>Same remark for vlan_on_bond_hook()
>
>
>

^ permalink raw reply

* Re: BUG: Bad page map in process udevd (anon_vma: (null)) in 2.6.38-rc4
From: Linus Torvalds @ 2011-02-18 18:48 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Michal Hocko, Ingo Molnar, linux-mm, LKML, David Miller,
	Eric Dumazet, netdev, Arnaldo Carvalho de Melo
In-Reply-To: <m17hcx43m3.fsf@fess.ebiederm.org>

On Fri, Feb 18, 2011 at 10:08 AM, Eric W. Biederman
<ebiederm@xmission.com> wrote:
>
> I am still getting programs segfaulting but that is happening on other
> machines running on older kernels so I am going to chalk that up to a
> buggy test and a false positive.

Ok.

> I am have OOM problems getting my tests run to complete.  On a good
> day that happens about 1 time in 3 right now.  I'm guess I will have
> to turn off DEBUG_PAGEALLOC to get everything to complete.
> DEBUG_PAGEALLOC causes us to use more memory doesn't it?

It does use a bit more memory, but it shouldn't be _that_ noticeable.
The real cost of DEBUG_PAGEALLOC is all the crazy page table
operations and TLB flushes we do for each allocation/deallocation. So
DEBUG_PAGEALLOC is very CPU-intensive, but it shouldn't have _that_
much of a memory overhead - just some trivial overhead due to not
being able to use largepages for the normal kernel identity mappings.

But there might be some other interaction with OOM that I haven't thought about.

> The most interesting thing I have right now is a networking lockdep
> issue.  Does anyone know what is going on there?

This seems to be a fairly straightforward bug.

In net/ipv4/inet_timewait_sock.c we have this:

  /* These are always called from BH context.  See callers in
   * tcp_input.c to verify this.
   */

  /* This is for handling early-kills of TIME_WAIT sockets. */
  void inet_twsk_deschedule(struct inet_timewait_sock *tw,
                            struct inet_timewait_death_row *twdr)
  {
          spin_lock(&twdr->death_lock);
          ..

and the intention is clearly that that spin_lock is BH-safe because
it's called from BH context.

Except that clearly isn't true. It's called from a worker thread:

> stack backtrace:
> Pid: 10833, comm: kworker/u:1 Not tainted 2.6.38-rc4-359399.2010AroraKernelBeta.fc14.x86_64 #1
> Call Trace:
>  [<ffffffff81460e69>] ? inet_twsk_deschedule+0x29/0xa0
>  [<ffffffff81460fd6>] ? inet_twsk_purge+0xf6/0x180
>  [<ffffffff81460f10>] ? inet_twsk_purge+0x30/0x180
>  [<ffffffff814760fc>] ? tcp_sk_exit_batch+0x1c/0x20
>  [<ffffffff8141c1d3>] ? ops_exit_list.clone.0+0x53/0x60
>  [<ffffffff8141c520>] ? cleanup_net+0x100/0x1b0
>  [<ffffffff81068c47>] ? process_one_work+0x187/0x4b0
>  [<ffffffff81068be1>] ? process_one_work+0x121/0x4b0
>  [<ffffffff8141c420>] ? cleanup_net+0x0/0x1b0
>  [<ffffffff8106a65c>] ? worker_thread+0x15c/0x330

so it can deadlock with a BH happening at the same time, afaik.

The code (and comment) is all from 2005, it looks like the BH->worker
thread has broken the code. But somebody who knows that code better
should take a deeper look at it.

Added acme to the cc, since the code is attributed to him back in 2005
;). Although I don't know how active he's been in networking lately
(seems to be all perf-related). Whatever, it can't hurt.

                   Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: BUG: Bad page map in process udevd (anon_vma: (null)) in 2.6.38-rc4
From: Arnaldo Carvalho de Melo @ 2011-02-18 19:01 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric W. Biederman, Michal Hocko, Ingo Molnar, linux-mm, LKML,
	David Miller, Eric Dumazet, netdev
In-Reply-To: <AANLkTikh4oaR6CBK3NBazer7yjhE0VndsUB5FCDRsbJc@mail.gmail.com>

Em Fri, Feb 18, 2011 at 10:48:18AM -0800, Linus Torvalds escreveu:
> On Fri, Feb 18, 2011 at 10:08 AM, Eric W. Biederman
> <ebiederm@xmission.com> wrote:
> >
> > I am still getting programs segfaulting but that is happening on other
> > machines running on older kernels so I am going to chalk that up to a
> > buggy test and a false positive.
> 
> Ok.
> 
> > I am have OOM problems getting my tests run to complete.  On a good
> > day that happens about 1 time in 3 right now.  I'm guess I will have
> > to turn off DEBUG_PAGEALLOC to get everything to complete.
> > DEBUG_PAGEALLOC causes us to use more memory doesn't it?
> 
> It does use a bit more memory, but it shouldn't be _that_ noticeable.
> The real cost of DEBUG_PAGEALLOC is all the crazy page table
> operations and TLB flushes we do for each allocation/deallocation. So
> DEBUG_PAGEALLOC is very CPU-intensive, but it shouldn't have _that_
> much of a memory overhead - just some trivial overhead due to not
> being able to use largepages for the normal kernel identity mappings.
> 
> But there might be some other interaction with OOM that I haven't thought about.
> 
> > The most interesting thing I have right now is a networking lockdep
> > issue.  Does anyone know what is going on there?
> 
> This seems to be a fairly straightforward bug.
> 
> In net/ipv4/inet_timewait_sock.c we have this:
> 
>   /* These are always called from BH context.  See callers in
>    * tcp_input.c to verify this.
>    */
> 
>   /* This is for handling early-kills of TIME_WAIT sockets. */
>   void inet_twsk_deschedule(struct inet_timewait_sock *tw,
>                             struct inet_timewait_death_row *twdr)
>   {
>           spin_lock(&twdr->death_lock);
>           ..
> 
> and the intention is clearly that that spin_lock is BH-safe because
> it's called from BH context.
> 
> Except that clearly isn't true. It's called from a worker thread:
> 
> > stack backtrace:
> > Pid: 10833, comm: kworker/u:1 Not tainted 2.6.38-rc4-359399.2010AroraKernelBeta.fc14.x86_64 #1
> > Call Trace:
> >  [<ffffffff81460e69>] ? inet_twsk_deschedule+0x29/0xa0
> >  [<ffffffff81460fd6>] ? inet_twsk_purge+0xf6/0x180
> >  [<ffffffff81460f10>] ? inet_twsk_purge+0x30/0x180
> >  [<ffffffff814760fc>] ? tcp_sk_exit_batch+0x1c/0x20
> >  [<ffffffff8141c1d3>] ? ops_exit_list.clone.0+0x53/0x60
> >  [<ffffffff8141c520>] ? cleanup_net+0x100/0x1b0
> >  [<ffffffff81068c47>] ? process_one_work+0x187/0x4b0
> >  [<ffffffff81068be1>] ? process_one_work+0x121/0x4b0
> >  [<ffffffff8141c420>] ? cleanup_net+0x0/0x1b0
> >  [<ffffffff8106a65c>] ? worker_thread+0x15c/0x330
> 
> so it can deadlock with a BH happening at the same time, afaik.
> 
> The code (and comment) is all from 2005, it looks like the BH->worker
> thread has broken the code. But somebody who knows that code better
> should take a deeper look at it.
> 
> Added acme to the cc, since the code is attributed to him back in 2005
> ;). Although I don't know how active he's been in networking lately
> (seems to be all perf-related). Whatever, it can't hurt.

Original code is ANK's, I just made it possible to use with DCCP, and
yeah, the smiley is appropriate, something 6 years old and the world
around it changing continually... well, thanks for the git blame ;-)

- Arnaldo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: BUG: Bad page map in process udevd (anon_vma: (null)) in 2.6.38-rc4
From: Arnaldo Carvalho de Melo @ 2011-02-18 19:11 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric W. Biederman, Michal Hocko, Ingo Molnar, linux-mm, LKML,
	David Miller, Eric Dumazet, netdev, Pavel Emelyanov
In-Reply-To: <20110218190128.GF13211@ghostprotocols.net>

Em Fri, Feb 18, 2011 at 05:01:28PM -0200, Arnaldo Carvalho de Melo escreveu:
> Em Fri, Feb 18, 2011 at 10:48:18AM -0800, Linus Torvalds escreveu:
> > This seems to be a fairly straightforward bug.
> > 
> > In net/ipv4/inet_timewait_sock.c we have this:
> > 
> >   /* These are always called from BH context.  See callers in
> >    * tcp_input.c to verify this.
> >    */
> > 
> >   /* This is for handling early-kills of TIME_WAIT sockets. */
> >   void inet_twsk_deschedule(struct inet_timewait_sock *tw,
> >                             struct inet_timewait_death_row *twdr)
> >   {
> >           spin_lock(&twdr->death_lock);
> >           ..
> > 
> > and the intention is clearly that that spin_lock is BH-safe because
> > it's called from BH context.
> > 
> > Except that clearly isn't true. It's called from a worker thread:
> > 
> > > stack backtrace:
> > > Pid: 10833, comm: kworker/u:1 Not tainted 2.6.38-rc4-359399.2010AroraKernelBeta.fc14.x86_64 #1
> > > Call Trace:
> > >  [<ffffffff81460e69>] ? inet_twsk_deschedule+0x29/0xa0
> > >  [<ffffffff81460fd6>] ? inet_twsk_purge+0xf6/0x180
> > >  [<ffffffff81460f10>] ? inet_twsk_purge+0x30/0x180
> > >  [<ffffffff814760fc>] ? tcp_sk_exit_batch+0x1c/0x20
> > >  [<ffffffff8141c1d3>] ? ops_exit_list.clone.0+0x53/0x60
> > >  [<ffffffff8141c520>] ? cleanup_net+0x100/0x1b0
> > >  [<ffffffff81068c47>] ? process_one_work+0x187/0x4b0
> > >  [<ffffffff81068be1>] ? process_one_work+0x121/0x4b0
> > >  [<ffffffff8141c420>] ? cleanup_net+0x0/0x1b0
> > >  [<ffffffff8106a65c>] ? worker_thread+0x15c/0x330
> > 
> > so it can deadlock with a BH happening at the same time, afaik.
> > 
> > The code (and comment) is all from 2005, it looks like the BH->worker
> > thread has broken the code. But somebody who knows that code better
> > should take a deeper look at it.
> > 
> > Added acme to the cc, since the code is attributed to him back in 2005
> > ;). Although I don't know how active he's been in networking lately
> > (seems to be all perf-related). Whatever, it can't hurt.
> 
> Original code is ANK's, I just made it possible to use with DCCP, and
> yeah, the smiley is appropriate, something 6 years old and the world
> around it changing continually... well, thanks for the git blame ;-)

But yeah, your analisys seems correct, with the bug being introduced by
one of these world around it changing continually issues, networking
namespaces broke the rules of the game on its cleanup_net() routine,
adding Pavel to the CC list since it doesn't hurt ;-)

- Arnaldo

^ permalink raw reply

* Re: BUG: Bad page map in process udevd (anon_vma: (null)) in 2.6.38-rc4
From: Eric Dumazet @ 2011-02-18 19:13 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Linus Torvalds, Eric W. Biederman, Michal Hocko, Ingo Molnar,
	linux-mm, LKML, David Miller, netdev
In-Reply-To: <20110218190128.GF13211@ghostprotocols.net>

Le vendredi 18 février 2011 à 17:01 -0200, Arnaldo Carvalho de Melo a
écrit :

> Original code is ANK's, I just made it possible to use with DCCP, and
> yeah, the smiley is appropriate, something 6 years old and the world
> around it changing continually... well, thanks for the git blame ;-)
> 

At that time, net namespaces did not exist.

I would blame people implementing them ;)




^ permalink raw reply

* Re: Possible netfilter-related memory corruption in 2.6.37
From: Eric Dumazet @ 2011-02-18 19:14 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: Jan Engelhardt, Avi Kivity, netfilter-devel, Marcelo Tosatti,
	nicolas prochazka, KVM list, netdev
In-Reply-To: <4D5EBC6D.6070200@trash.net>

Le vendredi 18 février 2011 à 19:37 +0100, Patrick McHardy a écrit :
> Am 14.02.2011 17:52, schrieb Patrick McHardy:
> > Am 14.02.2011 17:48, schrieb Eric Dumazet:
> >> I am not sure, but I guess nf_reinject() needs a fix too ;)
> > 
> > I agree. That one looks uglier though, I guess we'll have to
> > iterate through all hooks to note the previous one.
> 
> How about this? Unfortunately I don't think we can avoid
> iterating through all hooks without violating RCU rules.
> 
> 

       /* Continue traversal iff userspace said ok... */
        if (verdict == NF_REPEAT) {
-               elem = elem->prev;
-               verdict = NF_ACCEPT;
+               prev = NULL;
+               list_for_each_entry_rcu(i,
&nf_hooks[entry->pf][entry->hook],
+                                       list) {
+                       if (&i->list == elem)
+                               break;
+                       prev = i;

	
Hmm... what happens if "elem" was the first elem in list ?

We exit with prev = NULL  --> NF_DROP ?

I must miss something...

+               }
+
+               if (prev == NULL ||
+                   &i->list == &nf_hooks[entry->pf][entry->hook])
+                       verdict = NF_DROP;
+               else {
+                       elem = &prev->list;
+                       verdict = NF_ACCEPT;
+               }
        }



--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [patch net-next-2.6] net: convert bonding to use rx_handler
From: Eric Dumazet @ 2011-02-18 19:17 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Patrick McHardy, netdev, davem, shemminger, fubar,
	nicolas.2p.debian, andy
In-Reply-To: <20110218184725.GA2602@psychotron.redhat.com>

Le vendredi 18 février 2011 à 19:47 +0100, Jiri Pirko a écrit :
> Fri, Feb 18, 2011 at 05:14:30PM CET, eric.dumazet@gmail.com wrote:
> >Le vendredi 18 février 2011 à 16:50 +0100, Patrick McHardy a écrit :
> >> On 18.02.2011 15:58, Jiri Pirko wrote:
> >> > Fri, Feb 18, 2011 at 03:46:45PM CET, kaber@trash.net wrote:
> >> >> Am 18.02.2011 15:27, schrieb Eric Dumazet:
> >> >>> Le vendredi 18 février 2011 à 15:14 +0100, Jiri Pirko a écrit :
> >> >>>
> >> >>>> Do not know how to do it better. As for percpu variable, not only
> >> >>>> origdev would have to be remembered but also probably skb pointer to
> >> >>>> know if it's the first run on the skb or not. Can't really figure out a
> >> >>>> better solution. Can you?
> >> >>>
> >> >>> I'll try and let you know.
> >> >>
> >> >> Why not simply do a lookup on skb->iif?
> >> > 
> >> > Well I was trying to avoid iterating over list of devices for each
> >> > incoming frame.
> >> > 
> >> 
> >> Well, there are a couple of holes on 64 bit, perhaps you can rearrange
> >> things and eliminate either iif or input_dev without increasing size
> >> since they appear to be redundant.
> >
> >Jiri
> >
> >I dont understand why netif_rx() is needed in your patch.
> 
> I used netif_rx() because bridge and macvlan does that too. I did not see
> a reason to not to do the same.
> 
> >
> >Can we stack 10 bond devices or so ???
> >
> >If we avoid this stage and call the real thing (netif_receive_skb()),
> >then we dont need adding a field in each skb, since it can be carried by
> >a global variable (per cpu of course)
> >
> I'm probably missing something. How do netif_receive_skb() and
> netif_rx() differ in this point of view, since both are calling:
> "ret = enqueue_to_backlog(skb, cpu, &rflow->last_qtail);"
> ?
> 
> Still I see a problem with the percpu global variable. We would have to
> store skb pointer there as well and in each __netif_receive_skb() call it
> would have to be checked if it's different from the current one.
> In that case store new skb and orig_Dev.
> 
> Leaving aside that global variables are evil in general, I still think
> this is not nicer solution then to add skb->input_dev (although I
> understand your arguments).

Really I must miss something about "global variables" thing/fear.

Kernel is full of global variables, they are not evil if properly used.

Take a look at net/core/dev.c :

static DEFINE_PER_CPU(int, xmit_recursion);

For an example of what I have in mind.




^ permalink raw reply

* Re: [PATCH v6 0/9] net: Unified offload configuration
From: Michał Mirosław @ 2011-02-18 19:27 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: David Miller, netdev
In-Reply-To: <1298039371.2211.9.camel@localhost>

On Fri, Feb 18, 2011 at 02:29:31PM +0000, Ben Hutchings wrote:
> On Fri, 2011-02-18 at 15:22 +0100, Michał Mirosław wrote:
> > On Thu, Feb 17, 2011 at 02:56:11PM -0800, David Miller wrote:
> [...]
> > > Please get rid of that annoying message spit out by netif_features_change(),
> > > it's just spam.  If we want notifications for stuff like this, use a
> > > non-unicast netlink message so those who want to hear it can do so.
> > You mean netdev_update_features() "Features changed" message? Is it ok
> > to just demote it to DEBUG level or you want to remove it altogether?
> > What about netdev_fix_features() messages?
> I think you need to emit these messages at 'error' severity when fixing
> up features for a newly-added device, but at 'debug' later on.

Why the difference? Those messages on startup matter mostly to developers
and later mostly to users. Anyway, the conditions in netdev_fix_features()
are all constant (dependent only on net core implementation).

Best Regards,
Michał Mirosław

^ permalink raw reply

* Re: [patch net-next-2.6] net: convert bonding to use rx_handler
From: Jiri Pirko @ 2011-02-18 19:28 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Patrick McHardy, netdev, davem, shemminger, fubar,
	nicolas.2p.debian, andy
In-Reply-To: <1298056657.2425.28.camel@edumazet-laptop>

Fri, Feb 18, 2011 at 08:17:37PM CET, eric.dumazet@gmail.com wrote:
>Le vendredi 18 février 2011 à 19:47 +0100, Jiri Pirko a écrit :
>> Fri, Feb 18, 2011 at 05:14:30PM CET, eric.dumazet@gmail.com wrote:
>> >Le vendredi 18 février 2011 à 16:50 +0100, Patrick McHardy a écrit :
>> >> On 18.02.2011 15:58, Jiri Pirko wrote:
>> >> > Fri, Feb 18, 2011 at 03:46:45PM CET, kaber@trash.net wrote:
>> >> >> Am 18.02.2011 15:27, schrieb Eric Dumazet:
>> >> >>> Le vendredi 18 février 2011 à 15:14 +0100, Jiri Pirko a écrit :
>> >> >>>
>> >> >>>> Do not know how to do it better. As for percpu variable, not only
>> >> >>>> origdev would have to be remembered but also probably skb pointer to
>> >> >>>> know if it's the first run on the skb or not. Can't really figure out a
>> >> >>>> better solution. Can you?
>> >> >>>
>> >> >>> I'll try and let you know.
>> >> >>
>> >> >> Why not simply do a lookup on skb->iif?
>> >> > 
>> >> > Well I was trying to avoid iterating over list of devices for each
>> >> > incoming frame.
>> >> > 
>> >> 
>> >> Well, there are a couple of holes on 64 bit, perhaps you can rearrange
>> >> things and eliminate either iif or input_dev without increasing size
>> >> since they appear to be redundant.
>> >
>> >Jiri
>> >
>> >I dont understand why netif_rx() is needed in your patch.
>> 
>> I used netif_rx() because bridge and macvlan does that too. I did not see
>> a reason to not to do the same.
>> 
>> >
>> >Can we stack 10 bond devices or so ???
>> >
>> >If we avoid this stage and call the real thing (netif_receive_skb()),
>> >then we dont need adding a field in each skb, since it can be carried by
>> >a global variable (per cpu of course)
>> >
>> I'm probably missing something. How do netif_receive_skb() and
>> netif_rx() differ in this point of view, since both are calling:
>> "ret = enqueue_to_backlog(skb, cpu, &rflow->last_qtail);"
>> ?
>> 
>> Still I see a problem with the percpu global variable. We would have to
>> store skb pointer there as well and in each __netif_receive_skb() call it
>> would have to be checked if it's different from the current one.
>> In that case store new skb and orig_Dev.
>> 
>> Leaving aside that global variables are evil in general, I still think
>> this is not nicer solution then to add skb->input_dev (although I
>> understand your arguments).
>
>Really I must miss something about "global variables" thing/fear.
>
>Kernel is full of global variables, they are not evil if properly used.

I know. But that doesn't mean it's ok. But I see your point.

>
>Take a look at net/core/dev.c :
>
>static DEFINE_PER_CPU(int, xmit_recursion);
>
>For an example of what I have in mind.

Yes I saw this. We would have to do something like:

struct skb_rx_context {
	struct sk_buff *skb;
	struct net_device *orig_dev;
};

static DEFINE_PER_CPU(struct skb_rx_context, skb_rx_context);

and then in __netif_receive_skb():

struct skb_rx_context *cont = __this_cpu_read(skb_rx_context);

if (cont->skb != skb) {
	cont->skb = skb;
	orig_dev = cont->orig_dev = skb->dev;
} else {
	orig_dev = cont->orig_dev;
}


Does this make sense?

>
>
>

^ permalink raw reply

* Re: [PATCH v2] net: provide default_advmss() methods to blackhole dst_ops
From: David Miller @ 2011-02-18 19:39 UTC (permalink / raw)
  To: eric.dumazet; +Cc: xiaosuo, linux, linux-kernel, netdev, roland, dbaluta
In-Reply-To: <1298036672.6201.65.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Fri, 18 Feb 2011 14:44:32 +0100

> Le vendredi 18 février 2011 à 14:33 +0100, Eric Dumazet a écrit :
> 
>> I had this exact idea but found we need struct net pointer to get this
>> value, not provided in parameters, so I falled back to the 256 value.
>> 
>> 
> 
> Hmm, reading again this stuff, maybe we can just use
> ipv4_default_advmss() instead of a custom one.
> 
> dst->dev should be available
> 
> [PATCH] net: provide default_advmss() methods to blackhole dst_ops
> 
> Commit 0dbaee3b37e118a (net: Abstract default ADVMSS behind an
> accessor.) introduced a possible crash in tcp_connect_init(), when
> dst->default_advmss() is called from dst_metric_advmss()
> 
> Reported-by: George Spelvin <linux@horizon.com>
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

Yes, this is a lot better, applied.

Thanks!

^ permalink raw reply

* Re: [PATCH] net: Add default_advmss() methods to blackhole dst_ops
From: George Spelvin @ 2011-02-18 19:58 UTC (permalink / raw)
  To: davem, eric.dumazet, linux; +Cc: linux-kernel, netdev, roland
In-Reply-To: <1298023023.2595.170.camel@edumazet-laptop>

This makes sense to me; I was having trouble at the time with a loose
ethernet cable (broken RJ45 retention clip), so the cable could have
come unplugged at the time I tried to make the connection.

Thank you very much!

^ permalink raw reply

* Re: [patch net-next-2.6] net: convert bonding to use rx_handler
From: Eric Dumazet @ 2011-02-18 19:58 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Patrick McHardy, netdev, davem, shemminger, fubar,
	nicolas.2p.debian, andy
In-Reply-To: <20110218192856.GB2602@psychotron.redhat.com>

Le vendredi 18 février 2011 à 20:28 +0100, Jiri Pirko a écrit :

> Yes I saw this. We would have to do something like:
> 
> struct skb_rx_context {
> 	struct sk_buff *skb;
> 	struct net_device *orig_dev;
> };
> 
> static DEFINE_PER_CPU(struct skb_rx_context, skb_rx_context);
> 
> and then in __netif_receive_skb():
> 
> struct skb_rx_context *cont = __this_cpu_read(skb_rx_context);
> 
> if (cont->skb != skb) {
> 	cont->skb = skb;
> 	orig_dev = cont->orig_dev = skb->dev;
> } else {
> 	orig_dev = cont->orig_dev;
> }
> 
> 
> Does this make sense?

Well, yes, something like that, but I think you dont need to keep a
pointer to current skb. (It would not work if one handled/freed, same
'skb pointer' is reused a bit later)

Sorry, I wont have time to look at this right now, its now 21h00 in
France, time to get some time with family ;)

See you !



^ permalink raw reply

* Re: [patch net-next-2.6] net: convert bonding to use rx_handler
From: Jiri Pirko @ 2011-02-18 20:03 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Patrick McHardy, netdev, davem, shemminger, fubar,
	nicolas.2p.debian, andy
In-Reply-To: <1298059106.2425.31.camel@edumazet-laptop>

Fri, Feb 18, 2011 at 08:58:26PM CET, eric.dumazet@gmail.com wrote:
>Le vendredi 18 février 2011 à 20:28 +0100, Jiri Pirko a écrit :
>
>> Yes I saw this. We would have to do something like:
>> 
>> struct skb_rx_context {
>> 	struct sk_buff *skb;
>> 	struct net_device *orig_dev;
>> };
>> 
>> static DEFINE_PER_CPU(struct skb_rx_context, skb_rx_context);
>> 
>> and then in __netif_receive_skb():
>> 
>> struct skb_rx_context *cont = __this_cpu_read(skb_rx_context);
>> 
>> if (cont->skb != skb) {
>> 	cont->skb = skb;
>> 	orig_dev = cont->orig_dev = skb->dev;
>> } else {
>> 	orig_dev = cont->orig_dev;
>> }
>> 
>> 
>> Does this make sense?
>
>Well, yes, something like that, but I think you dont need to keep a
>pointer to current skb. (It would not work if one handled/freed, same
>'skb pointer' is reused a bit later)

Well I think I need. How else should I distinguish that new skb (first
time in __netif_receive_skb) is there and I need to remember orig_dev?

>
>Sorry, I wont have time to look at this right now, its now 21h00 in
>France, time to get some time with family ;)

Np, same time here in CZ :)


>
>See you !
>
>

^ permalink raw reply

* Re: [PATCH v6 0/9] net: Unified offload configuration
From: David Miller @ 2011-02-18 20:06 UTC (permalink / raw)
  To: bhutchings; +Cc: mirq-linux, netdev
In-Reply-To: <1298039371.2211.9.camel@localhost>

From: Ben Hutchings <bhutchings@solarflare.com>
Date: Fri, 18 Feb 2011 14:29:31 +0000

> On Fri, 2011-02-18 at 15:22 +0100, Michał Mirosław wrote:
>> On Thu, Feb 17, 2011 at 02:56:11PM -0800, David Miller wrote:
> [...]
>> > Please get rid of that annoying message spit out by netif_features_change(),
>> > it's just spam.  If we want notifications for stuff like this, use a
>> > non-unicast netlink message so those who want to hear it can do so.
>> 
>> You mean netdev_update_features() "Features changed" message? Is it ok
>> to just demote it to DEBUG level or you want to remove it altogether?
>> What about netdev_fix_features() messages?
> 
> I think you need to emit these messages at 'error' severity when fixing
> up features for a newly-added device, but at 'debug' later on.

I get one several minutes after every boot for a completely
unregistered device for some reason:

[119704.730965] (unregistered net_device): Features changed: 0x00011065 -> 0x00015065

^ permalink raw reply

* Re: [patch net-next-2.6] net: convert bonding to use rx_handler
From: David Miller @ 2011-02-18 20:06 UTC (permalink / raw)
  To: jpirko
  Cc: kaber, eric.dumazet, netdev, shemminger, fubar, nicolas.2p.debian,
	andy
In-Reply-To: <20110218145850.GF2939@psychotron.redhat.com>

From: Jiri Pirko <jpirko@redhat.com>
Date: Fri, 18 Feb 2011 15:58:51 +0100

> Fri, Feb 18, 2011 at 03:46:45PM CET, kaber@trash.net wrote:
>>Am 18.02.2011 15:27, schrieb Eric Dumazet:
>>> Le vendredi 18 février 2011 à 15:14 +0100, Jiri Pirko a écrit :
>>> 
>>>> Do not know how to do it better. As for percpu variable, not only
>>>> origdev would have to be remembered but also probably skb pointer to
>>>> know if it's the first run on the skb or not. Can't really figure out a
>>>> better solution. Can you?
>>> 
>>> I'll try and let you know.
>>
>>Why not simply do a lookup on skb->iif?
> 
> Well I was trying to avoid iterating over list of devices for each
> incoming frame.

It is not list, it is hash :-)

^ permalink raw reply

* Re: [PATCH ethtool] ethtool: Warn if we cannot set the advertising mask for specified speed/duplex
From: Ben Hutchings @ 2011-02-18 20:08 UTC (permalink / raw)
  To: netdev; +Cc: linux-net-drivers, Naveen Chouksey
In-Reply-To: <1298044598.2570.4.camel@bwh-desktop>

On Fri, 2011-02-18 at 15:56 +0000, Ben Hutchings wrote:
> On Thu, 2011-02-17 at 19:06 +0000, Ben Hutchings wrote:
> > When autonegotiation is enabled, drivers must determine link speed and
> > duplex through the autonegotiation process and will generally ignore
> > the speed and duplex specified in struct ethtool_cmd.  If the user
> > specifies a recognised combination of speed and duplex, we set the
> > advertising mask to include only the matching mode.  Otherwise, we
> > advertise all supported modes.  We should really limit the advertised
> > modes separately by speed and duplex, but for now we just warn.
> > 
> > Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
> > ---
> > This is a longstanding bug in ethtool which people keep getting bitten
> > by.  Naveen was just the latest to report it.  I have been working on
> > some changes to improve link advertising and reporting, but until those
> > are ready I'm adding this warning.
> 
> That patch fixes an additional bug: we will now update the advertising
> mask based on speed/duplex when autoneg is already on.  Unfortunately it
> introduces a bug: we will update advertising even if none of autoneg,
> speed or duplex are specified!
> 
> Here's a second version which I think will do the right thing in all
> cases, though I haven't tested it yet.  I'll include this in ethtool
> 2.6.38 if no-one can spot a flaw.
[...]
> +			if (autoneg_wanted == AUTONEG_ENABLE &&
> +			    advertising_wanted == 0) {
> +				ecmd.advertising = ecmd.supported &
> +					(ADVERTISED_10baseT_Half |
> +					 ADVERTISED_10baseT_Full |
> +					 ADVERTISED_100baseT_Half |
> +					 ADVERTISED_100baseT_Full |
> +					 ADVERTISED_1000baseT_Half |
> +					 ADVERTISED_1000baseT_Full |
> +					 ADVERTISED_2500baseX_Full |
> +					 ADVERTISED_10000baseT_Full);
> +			} else if (advertising_wanted != -1) {

This condition should be (advertising_wanted > 0).  With that last
change, ethtool seems to do the right thing.

Ben.

> +				ecmd.advertising = advertising_wanted;
>  			}
>  
>  			/* Try to perform the update. */

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply

* Re: [PATCH 1/2] ipv4: Add hash table of interface addresses.
From: David Miller @ 2011-02-18 20:12 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev
In-Reply-To: <1298010150.2642.5.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Fri, 18 Feb 2011 07:22:30 +0100

> Maybe you should take into account net pointer here, or machines with
> many net namespaces will hash collide for 127.0.0.1

Good idea.  I'll post a new series with that fixed up later.

Thanks!

^ permalink raw reply

* Re: [patch net-next-2.6] net: convert bonding to use rx_handler
From: Jiri Pirko @ 2011-02-18 20:13 UTC (permalink / raw)
  To: David Miller
  Cc: kaber, eric.dumazet, netdev, shemminger, fubar, nicolas.2p.debian,
	andy
In-Reply-To: <20110218.120656.104048936.davem@davemloft.net>

Fri, Feb 18, 2011 at 09:06:56PM CET, davem@davemloft.net wrote:
>From: Jiri Pirko <jpirko@redhat.com>
>Date: Fri, 18 Feb 2011 15:58:51 +0100
>
>> Fri, Feb 18, 2011 at 03:46:45PM CET, kaber@trash.net wrote:
>>>Am 18.02.2011 15:27, schrieb Eric Dumazet:
>>>> Le vendredi 18 février 2011 à 15:14 +0100, Jiri Pirko a écrit :
>>>> 
>>>>> Do not know how to do it better. As for percpu variable, not only
>>>>> origdev would have to be remembered but also probably skb pointer to
>>>>> know if it's the first run on the skb or not. Can't really figure out a
>>>>> better solution. Can you?
>>>> 
>>>> I'll try and let you know.
>>>
>>>Why not simply do a lookup on skb->iif?
>> 
>> Well I was trying to avoid iterating over list of devices for each
>> incoming frame.
>
>It is not list, it is hash :-)

Even if, Do you think that this would be ok?


^ permalink raw reply

* Re: [PATCH 1/2] net: dont leave active on stack LIST_HEAD
From: David Miller @ 2011-02-18 20:14 UTC (permalink / raw)
  To: eric.dumazet
  Cc: torvalds, ebiederm, opurdila, mingo, mhocko, linux-mm,
	linux-kernel, netdev
In-Reply-To: <1298019278.2595.83.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Fri, 18 Feb 2011 09:54:38 +0100

> From: Linus Torvalds <torvalds@linux-foundation.org>
> 
> Eric W. Biderman and Michal Hocko reported various memory corruptions
> that we suspected to be related to a LIST head located on stack, that
> was manipulated after thread left function frame (and eventually exited,
> so its stack was freed and reused).
> 
> Eric Dumazet suggested the problem was probably coming from commit
> 443457242beb (net: factorize
> sync-rcu call in unregister_netdevice_many)
> 
> This patch fixes __dev_close() and dev_close() to properly deinit their
> respective LIST_HEAD(single) before exiting.
> 
> References: https://lkml.org/lkml/2011/2/16/304
> References: https://lkml.org/lkml/2011/2/14/223
> 
> Reported-by: Michal Hocko <mhocko@suse.cz>
> Reported-by: Eric W. Biderman <ebiderman@xmission.com>
> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

Applied.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH 2/2] net: deinit automatic LIST_HEAD
From: David Miller @ 2011-02-18 20:14 UTC (permalink / raw)
  To: eric.dumazet
  Cc: torvalds, ebiederm, mingo, opurdila, linux-kernel, linux-mm,
	mhocko, netdev, stable
In-Reply-To: <1298019559.2595.92.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Fri, 18 Feb 2011 09:59:19 +0100

> commit 9b5e383c11b08784 (net: Introduce
> unregister_netdevice_many()) left an active LIST_HEAD() in
> rollback_registered(), with possible memory corruption.
> 
> Even if device is freed without touching its unreg_list (and therefore
> touching the previous memory location holding LISTE_HEAD(single), better
> close the bug for good, since its really subtle.
> 
> (Same fix for default_device_exit_batch() for completeness)
> 
> Reported-by: Michal Hocko <mhocko@suse.cz>
> Reported-by: Eric W. Biderman <ebiderman@xmission.com>
> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

Applied.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: BUG: Bad page map in process udevd (anon_vma: (null)) in 2.6.38-rc4
From: Eric W. Biederman @ 2011-02-18 20:38 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Linus Torvalds, Michal Hocko, Ingo Molnar, linux-mm, LKML,
	David Miller, Eric Dumazet, netdev, Pavel Emelyanov,
	Daniel Lezcano
In-Reply-To: <20110218191146.GG13211@ghostprotocols.net>

Arnaldo Carvalho de Melo <acme@redhat.com> writes:

> Em Fri, Feb 18, 2011 at 05:01:28PM -0200, Arnaldo Carvalho de Melo escreveu:
>> Em Fri, Feb 18, 2011 at 10:48:18AM -0800, Linus Torvalds escreveu:
>> > This seems to be a fairly straightforward bug.
>> > 
>> > In net/ipv4/inet_timewait_sock.c we have this:
>> > 
>> >   /* These are always called from BH context.  See callers in
>> >    * tcp_input.c to verify this.
>> >    */
>> > 
>> >   /* This is for handling early-kills of TIME_WAIT sockets. */
>> >   void inet_twsk_deschedule(struct inet_timewait_sock *tw,
>> >                             struct inet_timewait_death_row *twdr)
>> >   {
>> >           spin_lock(&twdr->death_lock);
>> >           ..
>> > 
>> > and the intention is clearly that that spin_lock is BH-safe because
>> > it's called from BH context.
>> > 
>> > Except that clearly isn't true. It's called from a worker thread:
>> > 
>> > > stack backtrace:
>> > > Pid: 10833, comm: kworker/u:1 Not tainted 2.6.38-rc4-359399.2010AroraKernelBeta.fc14.x86_64 #1
>> > > Call Trace:
>> > >  [<ffffffff81460e69>] ? inet_twsk_deschedule+0x29/0xa0
>> > >  [<ffffffff81460fd6>] ? inet_twsk_purge+0xf6/0x180
>> > >  [<ffffffff81460f10>] ? inet_twsk_purge+0x30/0x180
>> > >  [<ffffffff814760fc>] ? tcp_sk_exit_batch+0x1c/0x20
>> > >  [<ffffffff8141c1d3>] ? ops_exit_list.clone.0+0x53/0x60
>> > >  [<ffffffff8141c520>] ? cleanup_net+0x100/0x1b0
>> > >  [<ffffffff81068c47>] ? process_one_work+0x187/0x4b0
>> > >  [<ffffffff81068be1>] ? process_one_work+0x121/0x4b0
>> > >  [<ffffffff8141c420>] ? cleanup_net+0x0/0x1b0
>> > >  [<ffffffff8106a65c>] ? worker_thread+0x15c/0x330
>> > 
>> > so it can deadlock with a BH happening at the same time, afaik.
>> > 
>> > The code (and comment) is all from 2005, it looks like the BH->worker
>> > thread has broken the code. But somebody who knows that code better
>> > should take a deeper look at it.
>> > 
>> > Added acme to the cc, since the code is attributed to him back in 2005
>> > ;). Although I don't know how active he's been in networking lately
>> > (seems to be all perf-related). Whatever, it can't hurt.
>> 
>> Original code is ANK's, I just made it possible to use with DCCP, and
>> yeah, the smiley is appropriate, something 6 years old and the world
>> around it changing continually... well, thanks for the git blame ;-)
>
> But yeah, your analisys seems correct, with the bug being introduced by
> one of these world around it changing continually issues, networking
> namespaces broke the rules of the game on its cleanup_net() routine,
> adding Pavel to the CC list since it doesn't hurt ;-)

Which probably gets the bug back around to me.

I guess this must be one of those ipv4 cases that where the cleanup
simply did not exist in the rmmod sense that we had to invent.

I think that was Daniel who did the time wait sockets.  I do remember
they were a real pain.

Would a bh_disable be sufficient?  I guess I should stop remembering and
look at the code now.

Eric

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* [patch net-next-2.6 V2] net: convert bonding to use rx_handler
From: Jiri Pirko @ 2011-02-18 20:58 UTC (permalink / raw)
  To: David Miller
  Cc: kaber, eric.dumazet, netdev, shemminger, fubar, nicolas.2p.debian,
	andy
In-Reply-To: <20110218.120656.104048936.davem@davemloft.net>

This patch converts bonding to use rx_handler. Results in cleaner
__netif_receive_skb() with much less exceptions needed. Also bond-specific
work is moved into bond code.

Signed-off-by: Jiri Pirko <jpirko@redhat.com>

v1->v2:
	using skb_iif instead of new input_dev to remember original device

---
 drivers/net/bonding/bond_main.c |   75 ++++++++++++++++++++++++++-
 net/core/dev.c                  |  111 ++++++++-------------------------------
 2 files changed, 97 insertions(+), 89 deletions(-)

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 77e3c6a..a856a11 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -1423,6 +1423,68 @@ static void bond_setup_by_slave(struct net_device *bond_dev,
 	bond->setup_by_slave = 1;
 }
 
+/* On bonding slaves other than the currently active slave, suppress
+ * duplicates except for 802.3ad ETH_P_SLOW, alb non-mcast/bcast, and
+ * ARP on active-backup slaves with arp_validate enabled.
+ */
+static bool bond_should_deliver_exact_match(struct sk_buff *skb,
+					    struct net_device *slave_dev,
+					    struct net_device *bond_dev)
+{
+	if (slave_dev->priv_flags & IFF_SLAVE_INACTIVE) {
+		if (slave_dev->priv_flags & IFF_SLAVE_NEEDARP &&
+		    skb->protocol == __cpu_to_be16(ETH_P_ARP))
+			return false;
+
+		if (bond_dev->priv_flags & IFF_MASTER_ALB &&
+		    skb->pkt_type != PACKET_BROADCAST &&
+		    skb->pkt_type != PACKET_MULTICAST)
+				return false;
+
+		if (bond_dev->priv_flags & IFF_MASTER_8023AD &&
+		    skb->protocol == __cpu_to_be16(ETH_P_SLOW))
+			return false;
+
+		return true;
+	}
+	return false;
+}
+
+static struct sk_buff *bond_handle_frame(struct sk_buff *skb)
+{
+	struct net_device *slave_dev;
+	struct net_device *bond_dev;
+
+	skb = skb_share_check(skb, GFP_ATOMIC);
+	if (unlikely(!skb))
+		return NULL;
+	slave_dev = skb->dev;
+	bond_dev = ACCESS_ONCE(slave_dev->master);
+	if (unlikely(!bond_dev))
+		return skb;
+
+	if (bond_dev->priv_flags & IFF_MASTER_ARPMON)
+		slave_dev->last_rx = jiffies;
+
+	if (bond_should_deliver_exact_match(skb, slave_dev, bond_dev)) {
+		skb->deliver_no_wcard = 1;
+		return skb;
+	}
+
+	skb->dev = bond_dev;
+
+	if (bond_dev->priv_flags & IFF_MASTER_ALB &&
+	    bond_dev->priv_flags & IFF_BRIDGE_PORT &&
+	    skb->pkt_type == PACKET_HOST) {
+		u16 *dest = (u16 *) eth_hdr(skb)->h_dest;
+
+		memcpy(dest, bond_dev->dev_addr, ETH_ALEN);
+	}
+
+	netif_rx(skb);
+	return NULL;
+}
+
 /* enslave device <slave> to bond device <master> */
 int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev)
 {
@@ -1599,11 +1661,17 @@ int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev)
 		pr_debug("Error %d calling netdev_set_bond_master\n", res);
 		goto err_restore_mac;
 	}
+	res = netdev_rx_handler_register(slave_dev, bond_handle_frame, NULL);
+	if (res) {
+		pr_debug("Error %d calling netdev_rx_handler_register\n", res);
+		goto err_unset_master;
+	}
+
 	/* open the slave since the application closed it */
 	res = dev_open(slave_dev);
 	if (res) {
 		pr_debug("Opening slave %s failed\n", slave_dev->name);
-		goto err_unset_master;
+		goto err_unreg_rxhandler;
 	}
 
 	new_slave->dev = slave_dev;
@@ -1811,6 +1879,9 @@ int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev)
 err_close:
 	dev_close(slave_dev);
 
+err_unreg_rxhandler:
+	netdev_rx_handler_unregister(slave_dev);
+
 err_unset_master:
 	netdev_set_bond_master(slave_dev, NULL);
 
@@ -1992,6 +2063,7 @@ int bond_release(struct net_device *bond_dev, struct net_device *slave_dev)
 		netif_addr_unlock_bh(bond_dev);
 	}
 
+	netdev_rx_handler_unregister(slave_dev);
 	netdev_set_bond_master(slave_dev, NULL);
 
 #ifdef CONFIG_NET_POLL_CONTROLLER
@@ -2114,6 +2186,7 @@ static int bond_release_all(struct net_device *bond_dev)
 			netif_addr_unlock_bh(bond_dev);
 		}
 
+		netdev_rx_handler_unregister(slave_dev);
 		netdev_set_bond_master(slave_dev, NULL);
 
 		/* close slave before restoring its mac address */
diff --git a/net/core/dev.c b/net/core/dev.c
index 4f69439..580cff1 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3092,63 +3092,31 @@ void netdev_rx_handler_unregister(struct net_device *dev)
 }
 EXPORT_SYMBOL_GPL(netdev_rx_handler_unregister);
 
-static inline void skb_bond_set_mac_by_master(struct sk_buff *skb,
-					      struct net_device *master)
+static void vlan_on_bond_hook(struct sk_buff *skb)
 {
-	if (skb->pkt_type == PACKET_HOST) {
-		u16 *dest = (u16 *) eth_hdr(skb)->h_dest;
-
-		memcpy(dest, master->dev_addr, ETH_ALEN);
-	}
-}
-
-/* On bonding slaves other than the currently active slave, suppress
- * duplicates except for 802.3ad ETH_P_SLOW, alb non-mcast/bcast, and
- * ARP on active-backup slaves with arp_validate enabled.
- */
-static int __skb_bond_should_drop(struct sk_buff *skb,
-				  struct net_device *master)
-{
-	struct net_device *dev = skb->dev;
-
-	if (master->priv_flags & IFF_MASTER_ARPMON)
-		dev->last_rx = jiffies;
-
-	if ((master->priv_flags & IFF_MASTER_ALB) &&
-	    (master->priv_flags & IFF_BRIDGE_PORT)) {
-		/* Do address unmangle. The local destination address
-		 * will be always the one master has. Provides the right
-		 * functionality in a bridge.
-		 */
-		skb_bond_set_mac_by_master(skb, master);
-	}
-
-	if (dev->priv_flags & IFF_SLAVE_INACTIVE) {
-		if ((dev->priv_flags & IFF_SLAVE_NEEDARP) &&
-		    skb->protocol == __cpu_to_be16(ETH_P_ARP))
-			return 0;
-
-		if (master->priv_flags & IFF_MASTER_ALB) {
-			if (skb->pkt_type != PACKET_BROADCAST &&
-			    skb->pkt_type != PACKET_MULTICAST)
-				return 0;
-		}
-		if (master->priv_flags & IFF_MASTER_8023AD &&
-		    skb->protocol == __cpu_to_be16(ETH_P_SLOW))
-			return 0;
+	/*
+	 * Make sure ARP frames received on VLAN interfaces stacked on
+	 * bonding interfaces still make their way to any base bonding
+	 * device that may have registered for a specific ptype.
+	 */
+	if (skb->dev->priv_flags & IFF_802_1Q_VLAN &&
+	    vlan_dev_real_dev(skb->dev)->priv_flags & IFF_BONDING &&
+	    skb->protocol == htons(ETH_P_ARP)) {
+		struct sk_buff *skb2 = skb_clone(skb, GFP_ATOMIC);
 
-		return 1;
+		if (!skb2)
+			return;
+		skb2->dev = vlan_dev_real_dev(skb->dev);
+		netif_rx(skb2);
 	}
-	return 0;
 }
 
 static int __netif_receive_skb(struct sk_buff *skb)
 {
 	struct packet_type *ptype, *pt_prev;
 	rx_handler_func_t *rx_handler;
+	struct net_device *null_or_dev;
 	struct net_device *orig_dev;
-	struct net_device *null_or_orig;
-	struct net_device *orig_or_bond;
 	int ret = NET_RX_DROP;
 	__be16 type;
 
@@ -3164,30 +3132,6 @@ static int __netif_receive_skb(struct sk_buff *skb)
 	if (!skb->skb_iif)
 		skb->skb_iif = skb->dev->ifindex;
 
-	/*
-	 * bonding note: skbs received on inactive slaves should only
-	 * be delivered to pkt handlers that are exact matches.  Also
-	 * the deliver_no_wcard flag will be set.  If packet handlers
-	 * are sensitive to duplicate packets these skbs will need to
-	 * be dropped at the handler.
-	 */
-	null_or_orig = NULL;
-	orig_dev = skb->dev;
-	if (skb->deliver_no_wcard)
-		null_or_orig = orig_dev;
-	else if (netif_is_bond_slave(orig_dev)) {
-		struct net_device *bond_master = ACCESS_ONCE(orig_dev->master);
-
-		if (likely(bond_master)) {
-			if (__skb_bond_should_drop(skb, bond_master)) {
-				skb->deliver_no_wcard = 1;
-				/* deliver only exact match */
-				null_or_orig = orig_dev;
-			} else
-				skb->dev = bond_master;
-		}
-	}
-
 	__this_cpu_inc(softnet_data.processed);
 	skb_reset_network_header(skb);
 	skb_reset_transport_header(skb);
@@ -3196,6 +3140,7 @@ static int __netif_receive_skb(struct sk_buff *skb)
 	pt_prev = NULL;
 
 	rcu_read_lock();
+	orig_dev = dev_get_by_index_rcu(dev_net(skb->dev), skb->skb_iif);
 
 #ifdef CONFIG_NET_CLS_ACT
 	if (skb->tc_verd & TC_NCLS) {
@@ -3205,8 +3150,7 @@ static int __netif_receive_skb(struct sk_buff *skb)
 #endif
 
 	list_for_each_entry_rcu(ptype, &ptype_all, list) {
-		if (ptype->dev == null_or_orig || ptype->dev == skb->dev ||
-		    ptype->dev == orig_dev) {
+		if (!ptype->dev || ptype->dev == skb->dev) {
 			if (pt_prev)
 				ret = deliver_skb(skb, pt_prev, orig_dev);
 			pt_prev = ptype;
@@ -3220,7 +3164,6 @@ static int __netif_receive_skb(struct sk_buff *skb)
 ncls:
 #endif
 
-	/* Handle special case of bridge or macvlan */
 	rx_handler = rcu_dereference(skb->dev->rx_handler);
 	if (rx_handler) {
 		if (pt_prev) {
@@ -3244,24 +3187,16 @@ ncls:
 			goto out;
 	}
 
-	/*
-	 * Make sure frames received on VLAN interfaces stacked on
-	 * bonding interfaces still make their way to any base bonding
-	 * device that may have registered for a specific ptype.  The
-	 * handler may have to adjust skb->dev and orig_dev.
-	 */
-	orig_or_bond = orig_dev;
-	if ((skb->dev->priv_flags & IFF_802_1Q_VLAN) &&
-	    (vlan_dev_real_dev(skb->dev)->priv_flags & IFF_BONDING)) {
-		orig_or_bond = vlan_dev_real_dev(skb->dev);
-	}
+	vlan_on_bond_hook(skb);
+
+	/* deliver only exact match when indicated */
+	null_or_dev = skb->deliver_no_wcard ? skb->dev : NULL;
 
 	type = skb->protocol;
 	list_for_each_entry_rcu(ptype,
 			&ptype_base[ntohs(type) & PTYPE_HASH_MASK], list) {
-		if (ptype->type == type && (ptype->dev == null_or_orig ||
-		     ptype->dev == skb->dev || ptype->dev == orig_dev ||
-		     ptype->dev == orig_or_bond)) {
+		if (ptype->type == type &&
+		    (ptype->dev == null_or_dev || ptype->dev == skb->dev)) {
 			if (pt_prev)
 				ret = deliver_skb(skb, pt_prev, orig_dev);
 			pt_prev = ptype;
-- 
1.7.3.4


^ permalink raw reply related

* Re: IGMP and rwlock: Dead ocurred again on TILEPro
From: Chris Metcalf @ 2011-02-18 21:51 UTC (permalink / raw)
  To: Cypher Wu
  Cc: David Miller, xiyou.wangcong, linux-kernel, eric.dumazet, netdev
In-Reply-To: <AANLkTim1F679mmnbHnYkvi6ZDojxD-tPGer4F+61C+di@mail.gmail.com>

On 2/17/2011 10:16 PM, Cypher Wu wrote:
> On Fri, Feb 18, 2011 at 7:18 AM, Chris Metcalf <cmetcalf@tilera.com> wrote:
>> The interrupt architecture on Tile allows a write to a special-purpose
>> register to put you into a "critical section" where no interrupts or faults
>> are delivered.  So we just need to bracket the read_lock operations with
>> two SPR writes; each takes six machine cycles, so we're only adding 12
>> cycles to the total cost of taking or releasing a read lock on an rwlock
> I agree that just lock interrupt for read operations should be enough,
> but read_unlock() is also the place we should lock interrupt, right?
> If interrupt occurred when it hold lock-val after TNS deadlock still
> can occur.

Correct; that's what I meant by "read_lock operations".  This include lock,
trylock, and unlock.

> When will you release out that patch? Since time is tight, so maybe
> I've to fix-up it myself.

I heard from one of our support folks that you were asking through that
channel, so I asked him to go ahead and give you the spinlock sources
directly.  I will be spending time next week syncing up our internal tree
with the public git repository so you'll see it on LKML at that time.

> 1. If we use SPR_INTERRUPT_CRITICAL_SECTION it will disable all the
> interrupt which claimed 'CM', is that right? Should we have to same
> its original value and restore it later?

We don't need to save and restore, since INTERRUPT_CRITICAL_SECTION is
almost always zero except in very specific situations.

> 2. Should we lock interrupt for the whole operation of
> read_lock()/read_unlock(), or we should leave interrupt critical
> section if it run into  __raw_read_lock_slow() and before have to
> delay_backoff() some time, and re-enter interrupt critical section
> again before TNS?

Correct, the fix only holds the critical section around the tns and the
write-back, not during the delay_backoff().

> Bye the way, other RISC platforms, say ARM and MIPS, use store
> conditional rather that TNS a temp value for lock-val, does Fx have
> similar instructions?

TILEPro does not have anything more than test-and-set; TILE-Gx (the 64-bit
processor) has a full array of atomic instructions.

> Adding that to SPR writes should be fine, but it may cause interrupt
> delay a little more that other platform's read_lock()?

A little, but I think it's in the noise relative to the basic cost of
read_lock in the absence of full-fledged atomic instructions.

> Another question: What NMI in the former mail means?

Non-maskable interrupt, such as performance counter interrupts.

-- 
Chris Metcalf, Tilera Corp.
http://www.tilera.com

^ permalink raw reply

* [GIT] Networking
From: David Miller @ 2011-02-18 21:52 UTC (permalink / raw)
  To: torvalds; +Cc: akpm, netdev, linux-kernel


1) The on-stack list_head memory corruption fixes from Linus and Eric
   Dumazet.

2) ISDN hisax doesn't check alloc_skb() failures.

3) ieee80211_reconfig doesn't do necessary locking, fix from Eliad
   Peller.

4) Work around buffer size hw limitation in ixgbe driver, from Amir
   Hanania.

5) Fix crash in ixgbe driver due to NULL ptr deref, from Andy
   Gospodarek.

6) Fix crash in tcp_connect() due to blackhole dst_ops missing
   default_mss() method, fix from Eric Dumazet.

7) Multicast snooping list corruption et al. fixes in bridging from
   Herbert Xu.

8) NETDEV_NOTIFY_PEERS events should send gratuitous ARP unconditionally
   as it's used to trigger fail-over, from Ian Campbell.

9) xfrm_alloc_dst() can OOPS due to mis-coded conditional, from
   Hiroaki SHIMODA.

10) Several NULL deref fixes from Jesper Juhl in DCB, USB Net, USB HSO,
    and ATM Solos driver.

11) Device shutdown races and fixes in e1000e, from Jesse Brandeburg.

12) Fix CAN softing driver Kconfig deps, from Kurt Van Dijck.

13) Fix RCU usage in netfilter nf_iterate, from Patrick McHardy.

14) A cleanup change lost the setting of IPPROT_GRE value in the key
    used to lookup routes in ipgre_tunnel_xmit, fix from Steffen Klassert.

15) RX padding and MAC address loading fixes in pch_gbe driver from
    Toshiharu Okada.

16) If the firmward has control of the tg3 PHY, don't allow ioctl access
    otherwise we'll corrupt state.  Fix from Matt Carlson.

The following changes since commit a5bbef0b2deb7b943f095181309ecc9e1fc91c0f:

  Merge branch 'for-linus/bugfixes' of git://xenbits.xen.org/people/ianc/linux-2.6 (2011-02-18 12:44:41 -0800)

are available in the git repository at:

  master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6.git master

Amir Hanania (1):
      ixgbe: work around for DDP last buffer size

Andy Gospodarek (1):
      ixgbe: fix panic due to uninitialised pointer

Bao Liang (1):
      Bluetooth: Set conn state to BT_DISCONN to avoid multiple responses

Casey Leedom (4):
      cxgb4vf: Check driver parameters in the right place ...
      cxgb4vf: Behave properly when CONFIG_DEBUG_FS isn't defined ...
      cxgb4vf: Quiesce Virtual Interfaces on shutdown ...
      cxgb4vf: Use defined Mailbox Timeout

Cho, Yu-Chen (1):
      Bluetooth: add Atheros BT AR9285 fw supported

David S. Miller (5):
      hisax: Fix unchecked alloc_skb() return.
      Merge branch 'master' of git://git.kernel.org/.../kaber/nf-2.6
      Merge branch 'master' of git://git.kernel.org/.../linville/wireless-2.6
      iwlwifi: Delete iwl3945_good_plcp_health.
      isdn: hisax: Use l2headersize() instead of dup (and buggy) func.

Eliad Peller (1):
      mac80211: add missing locking in ieee80211_reconfig

Eric Dumazet (2):
      net: provide default_advmss() methods to blackhole dst_ops
      net: deinit automatic LIST_HEAD

Giuseppe Cavallaro (1):
      stmmac: enable wol via magic frame by default.

Herbert Xu (3):
      bridge: Fix mglist corruption that leads to memory corruption
      bridge: Fix timer typo that may render snooping less effective
      bridge: Replace mp->mglist hlist with a bool

Hiroaki SHIMODA (1):
      xfrm: avoid possible oopse in xfrm_alloc_dst

Ian Campbell (1):
      arp_notify: unconditionally send gratuitous ARP for NETDEV_NOTIFY_PEERS.

Ivan Vecera (1):
      drivers/net: Call netif_carrier_off at the end of the probe

Jesper Juhl (4):
      Don't potentially dereference NULL in net/dcb/dcbnl.c:dcbnl_getapp()
      USB Network driver infrastructure: Fix leak when usb_autopm_get_interface() returns less than zero in kevent().
      Net, USB, Option, hso: Do not dereference NULL pointer
      ATM, Solos PCI ADSL2+: Don't deref NULL pointer if net_ratelimit() and alloc_skb() interact badly.

Jesse Brandeburg (2):
      e1000e: check down flag in tasks
      e1000e: flush all writebacks before unload

John Fastabend (1):
      net: dcb: application priority is per net_device

John W. Linville (1):
      Merge branch 'master' of git://git.kernel.org/.../padovan/bluetooth-2.6

Kurt Van Dijck (1):
      net/can/softing: make CAN_SOFTING_CS depend on CAN_SOFTING

Linus Torvalds (1):
      net: dont leave active on stack LIST_HEAD

Matt Carlson (1):
      tg3: Restrict phy ioctl access

Patrick McHardy (1):
      netfilter: nf_iterate: fix incorrect RCU usage

Randy Dunlap (1):
      net: fix ifenslave build flags

Stanislaw Gruszka (1):
      iwl3945: remove plcp check

Steffen Klassert (1):
      ip_gre: Add IPPROTO_GRE to flowi in ipgre_tunnel_xmit

Toshiharu Okada (2):
      pch_gbe: Fix the issue that the receiving data is not normal.
      pch_gbe: Fix the MAC Address load issue.

 Documentation/networking/Makefile       |    2 +
 drivers/atm/solos-pci.c                 |    5 +-
 drivers/bluetooth/ath3k.c               |    2 +
 drivers/bluetooth/btusb.c               |    3 +
 drivers/isdn/hisax/isdnl2.c             |   28 ++++-----
 drivers/net/can/softing/Kconfig         |    2 +-
 drivers/net/cxgb4vf/cxgb4vf_main.c      |   80 ++++++++++++++++++------
 drivers/net/cxgb4vf/t4vf_hw.c           |    2 +-
 drivers/net/e1000e/netdev.c             |   52 ++++++++++++---
 drivers/net/forcedeth.c                 |    2 +
 drivers/net/ixgbe/ixgbe_fcoe.c          |   51 +++++++++++++++-
 drivers/net/ixgbe/ixgbe_fcoe.h          |    2 +
 drivers/net/ixgbe/ixgbe_main.c          |    6 +-
 drivers/net/pch_gbe/pch_gbe.h           |    2 +-
 drivers/net/pch_gbe/pch_gbe_main.c      |  104 ++++++++++++++++++------------
 drivers/net/r8169.c                     |    2 +
 drivers/net/stmmac/stmmac_main.c        |    4 +-
 drivers/net/tg3.c                       |    8 ++-
 drivers/net/usb/hso.c                   |   12 ++--
 drivers/net/usb/usbnet.c                |    4 +-
 drivers/net/wireless/iwlwifi/iwl-3945.c |   67 --------------------
 net/bluetooth/l2cap.c                   |    1 +
 net/bridge/br_input.c                   |    2 +-
 net/bridge/br_multicast.c               |   19 +++---
 net/bridge/br_private.h                 |    3 +-
 net/core/dev.c                          |    9 ++-
 net/dcb/dcbnl.c                         |    9 +++-
 net/ipv4/devinet.c                      |   30 ++++++---
 net/ipv4/ip_gre.c                       |    1 +
 net/ipv4/route.c                        |    1 +
 net/ipv6/route.c                        |    1 +
 net/mac80211/util.c                     |    2 +
 net/netfilter/core.c                    |    3 +-
 net/xfrm/xfrm_policy.c                  |    7 ++-
 34 files changed, 327 insertions(+), 201 deletions(-)

^ permalink raw reply

* Confirmation
From: Western Union Money Transfer @ 2011-02-18 22:10 UTC (permalink / raw)


You have a money transfer of $85,000. Confirm receipt via e-mail: wudept4@w.cn

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox