Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH 1/2] ipv4: Add hash table of interface addresses.
From: David Miller @ 2011-02-18 20:12 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev
In-Reply-To: <1298010150.2642.5.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Fri, 18 Feb 2011 07:22:30 +0100

> Maybe you should take into account net pointer here, or machines with
> many net namespaces will hash collide for 127.0.0.1

Good idea.  I'll post a new series with that fixed up later.

Thanks!

^ permalink raw reply

* Re: [PATCH ethtool] ethtool: Warn if we cannot set the advertising mask for specified speed/duplex
From: Ben Hutchings @ 2011-02-18 20:08 UTC (permalink / raw)
  To: netdev; +Cc: linux-net-drivers, Naveen Chouksey
In-Reply-To: <1298044598.2570.4.camel@bwh-desktop>

On Fri, 2011-02-18 at 15:56 +0000, Ben Hutchings wrote:
> On Thu, 2011-02-17 at 19:06 +0000, Ben Hutchings wrote:
> > When autonegotiation is enabled, drivers must determine link speed and
> > duplex through the autonegotiation process and will generally ignore
> > the speed and duplex specified in struct ethtool_cmd.  If the user
> > specifies a recognised combination of speed and duplex, we set the
> > advertising mask to include only the matching mode.  Otherwise, we
> > advertise all supported modes.  We should really limit the advertised
> > modes separately by speed and duplex, but for now we just warn.
> > 
> > Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
> > ---
> > This is a longstanding bug in ethtool which people keep getting bitten
> > by.  Naveen was just the latest to report it.  I have been working on
> > some changes to improve link advertising and reporting, but until those
> > are ready I'm adding this warning.
> 
> That patch fixes an additional bug: we will now update the advertising
> mask based on speed/duplex when autoneg is already on.  Unfortunately it
> introduces a bug: we will update advertising even if none of autoneg,
> speed or duplex are specified!
> 
> Here's a second version which I think will do the right thing in all
> cases, though I haven't tested it yet.  I'll include this in ethtool
> 2.6.38 if no-one can spot a flaw.
[...]
> +			if (autoneg_wanted == AUTONEG_ENABLE &&
> +			    advertising_wanted == 0) {
> +				ecmd.advertising = ecmd.supported &
> +					(ADVERTISED_10baseT_Half |
> +					 ADVERTISED_10baseT_Full |
> +					 ADVERTISED_100baseT_Half |
> +					 ADVERTISED_100baseT_Full |
> +					 ADVERTISED_1000baseT_Half |
> +					 ADVERTISED_1000baseT_Full |
> +					 ADVERTISED_2500baseX_Full |
> +					 ADVERTISED_10000baseT_Full);
> +			} else if (advertising_wanted != -1) {

This condition should be (advertising_wanted > 0).  With that last
change, ethtool seems to do the right thing.

Ben.

> +				ecmd.advertising = advertising_wanted;
>  			}
>  
>  			/* Try to perform the update. */

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply

* Re: [patch net-next-2.6] net: convert bonding to use rx_handler
From: David Miller @ 2011-02-18 20:06 UTC (permalink / raw)
  To: jpirko
  Cc: kaber, eric.dumazet, netdev, shemminger, fubar, nicolas.2p.debian,
	andy
In-Reply-To: <20110218145850.GF2939@psychotron.redhat.com>

From: Jiri Pirko <jpirko@redhat.com>
Date: Fri, 18 Feb 2011 15:58:51 +0100

> Fri, Feb 18, 2011 at 03:46:45PM CET, kaber@trash.net wrote:
>>Am 18.02.2011 15:27, schrieb Eric Dumazet:
>>> Le vendredi 18 février 2011 à 15:14 +0100, Jiri Pirko a écrit :
>>> 
>>>> Do not know how to do it better. As for percpu variable, not only
>>>> origdev would have to be remembered but also probably skb pointer to
>>>> know if it's the first run on the skb or not. Can't really figure out a
>>>> better solution. Can you?
>>> 
>>> I'll try and let you know.
>>
>>Why not simply do a lookup on skb->iif?
> 
> Well I was trying to avoid iterating over list of devices for each
> incoming frame.

It is not list, it is hash :-)

^ permalink raw reply

* Re: [PATCH v6 0/9] net: Unified offload configuration
From: David Miller @ 2011-02-18 20:06 UTC (permalink / raw)
  To: bhutchings; +Cc: mirq-linux, netdev
In-Reply-To: <1298039371.2211.9.camel@localhost>

From: Ben Hutchings <bhutchings@solarflare.com>
Date: Fri, 18 Feb 2011 14:29:31 +0000

> On Fri, 2011-02-18 at 15:22 +0100, Michał Mirosław wrote:
>> On Thu, Feb 17, 2011 at 02:56:11PM -0800, David Miller wrote:
> [...]
>> > Please get rid of that annoying message spit out by netif_features_change(),
>> > it's just spam.  If we want notifications for stuff like this, use a
>> > non-unicast netlink message so those who want to hear it can do so.
>> 
>> You mean netdev_update_features() "Features changed" message? Is it ok
>> to just demote it to DEBUG level or you want to remove it altogether?
>> What about netdev_fix_features() messages?
> 
> I think you need to emit these messages at 'error' severity when fixing
> up features for a newly-added device, but at 'debug' later on.

I get one several minutes after every boot for a completely
unregistered device for some reason:

[119704.730965] (unregistered net_device): Features changed: 0x00011065 -> 0x00015065

^ permalink raw reply

* Re: [patch net-next-2.6] net: convert bonding to use rx_handler
From: Jiri Pirko @ 2011-02-18 20:03 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Patrick McHardy, netdev, davem, shemminger, fubar,
	nicolas.2p.debian, andy
In-Reply-To: <1298059106.2425.31.camel@edumazet-laptop>

Fri, Feb 18, 2011 at 08:58:26PM CET, eric.dumazet@gmail.com wrote:
>Le vendredi 18 février 2011 à 20:28 +0100, Jiri Pirko a écrit :
>
>> Yes I saw this. We would have to do something like:
>> 
>> struct skb_rx_context {
>> 	struct sk_buff *skb;
>> 	struct net_device *orig_dev;
>> };
>> 
>> static DEFINE_PER_CPU(struct skb_rx_context, skb_rx_context);
>> 
>> and then in __netif_receive_skb():
>> 
>> struct skb_rx_context *cont = __this_cpu_read(skb_rx_context);
>> 
>> if (cont->skb != skb) {
>> 	cont->skb = skb;
>> 	orig_dev = cont->orig_dev = skb->dev;
>> } else {
>> 	orig_dev = cont->orig_dev;
>> }
>> 
>> 
>> Does this make sense?
>
>Well, yes, something like that, but I think you dont need to keep a
>pointer to current skb. (It would not work if one handled/freed, same
>'skb pointer' is reused a bit later)

Well I think I need. How else should I distinguish that new skb (first
time in __netif_receive_skb) is there and I need to remember orig_dev?

>
>Sorry, I wont have time to look at this right now, its now 21h00 in
>France, time to get some time with family ;)

Np, same time here in CZ :)


>
>See you !
>
>

^ permalink raw reply

* Re: [patch net-next-2.6] net: convert bonding to use rx_handler
From: Eric Dumazet @ 2011-02-18 19:58 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Patrick McHardy, netdev, davem, shemminger, fubar,
	nicolas.2p.debian, andy
In-Reply-To: <20110218192856.GB2602@psychotron.redhat.com>

Le vendredi 18 février 2011 à 20:28 +0100, Jiri Pirko a écrit :

> Yes I saw this. We would have to do something like:
> 
> struct skb_rx_context {
> 	struct sk_buff *skb;
> 	struct net_device *orig_dev;
> };
> 
> static DEFINE_PER_CPU(struct skb_rx_context, skb_rx_context);
> 
> and then in __netif_receive_skb():
> 
> struct skb_rx_context *cont = __this_cpu_read(skb_rx_context);
> 
> if (cont->skb != skb) {
> 	cont->skb = skb;
> 	orig_dev = cont->orig_dev = skb->dev;
> } else {
> 	orig_dev = cont->orig_dev;
> }
> 
> 
> Does this make sense?

Well, yes, something like that, but I think you dont need to keep a
pointer to current skb. (It would not work if one handled/freed, same
'skb pointer' is reused a bit later)

Sorry, I wont have time to look at this right now, its now 21h00 in
France, time to get some time with family ;)

See you !



^ permalink raw reply

* Re: [PATCH] net: Add default_advmss() methods to blackhole dst_ops
From: George Spelvin @ 2011-02-18 19:58 UTC (permalink / raw)
  To: davem, eric.dumazet, linux; +Cc: linux-kernel, netdev, roland
In-Reply-To: <1298023023.2595.170.camel@edumazet-laptop>

This makes sense to me; I was having trouble at the time with a loose
ethernet cable (broken RJ45 retention clip), so the cable could have
come unplugged at the time I tried to make the connection.

Thank you very much!

^ permalink raw reply

* Re: [PATCH v2] net: provide default_advmss() methods to blackhole dst_ops
From: David Miller @ 2011-02-18 19:39 UTC (permalink / raw)
  To: eric.dumazet; +Cc: xiaosuo, linux, linux-kernel, netdev, roland, dbaluta
In-Reply-To: <1298036672.6201.65.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Fri, 18 Feb 2011 14:44:32 +0100

> Le vendredi 18 février 2011 à 14:33 +0100, Eric Dumazet a écrit :
> 
>> I had this exact idea but found we need struct net pointer to get this
>> value, not provided in parameters, so I falled back to the 256 value.
>> 
>> 
> 
> Hmm, reading again this stuff, maybe we can just use
> ipv4_default_advmss() instead of a custom one.
> 
> dst->dev should be available
> 
> [PATCH] net: provide default_advmss() methods to blackhole dst_ops
> 
> Commit 0dbaee3b37e118a (net: Abstract default ADVMSS behind an
> accessor.) introduced a possible crash in tcp_connect_init(), when
> dst->default_advmss() is called from dst_metric_advmss()
> 
> Reported-by: George Spelvin <linux@horizon.com>
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

Yes, this is a lot better, applied.

Thanks!

^ permalink raw reply

* Re: [patch net-next-2.6] net: convert bonding to use rx_handler
From: Jiri Pirko @ 2011-02-18 19:28 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Patrick McHardy, netdev, davem, shemminger, fubar,
	nicolas.2p.debian, andy
In-Reply-To: <1298056657.2425.28.camel@edumazet-laptop>

Fri, Feb 18, 2011 at 08:17:37PM CET, eric.dumazet@gmail.com wrote:
>Le vendredi 18 février 2011 à 19:47 +0100, Jiri Pirko a écrit :
>> Fri, Feb 18, 2011 at 05:14:30PM CET, eric.dumazet@gmail.com wrote:
>> >Le vendredi 18 février 2011 à 16:50 +0100, Patrick McHardy a écrit :
>> >> On 18.02.2011 15:58, Jiri Pirko wrote:
>> >> > Fri, Feb 18, 2011 at 03:46:45PM CET, kaber@trash.net wrote:
>> >> >> Am 18.02.2011 15:27, schrieb Eric Dumazet:
>> >> >>> Le vendredi 18 février 2011 à 15:14 +0100, Jiri Pirko a écrit :
>> >> >>>
>> >> >>>> Do not know how to do it better. As for percpu variable, not only
>> >> >>>> origdev would have to be remembered but also probably skb pointer to
>> >> >>>> know if it's the first run on the skb or not. Can't really figure out a
>> >> >>>> better solution. Can you?
>> >> >>>
>> >> >>> I'll try and let you know.
>> >> >>
>> >> >> Why not simply do a lookup on skb->iif?
>> >> > 
>> >> > Well I was trying to avoid iterating over list of devices for each
>> >> > incoming frame.
>> >> > 
>> >> 
>> >> Well, there are a couple of holes on 64 bit, perhaps you can rearrange
>> >> things and eliminate either iif or input_dev without increasing size
>> >> since they appear to be redundant.
>> >
>> >Jiri
>> >
>> >I dont understand why netif_rx() is needed in your patch.
>> 
>> I used netif_rx() because bridge and macvlan does that too. I did not see
>> a reason to not to do the same.
>> 
>> >
>> >Can we stack 10 bond devices or so ???
>> >
>> >If we avoid this stage and call the real thing (netif_receive_skb()),
>> >then we dont need adding a field in each skb, since it can be carried by
>> >a global variable (per cpu of course)
>> >
>> I'm probably missing something. How do netif_receive_skb() and
>> netif_rx() differ in this point of view, since both are calling:
>> "ret = enqueue_to_backlog(skb, cpu, &rflow->last_qtail);"
>> ?
>> 
>> Still I see a problem with the percpu global variable. We would have to
>> store skb pointer there as well and in each __netif_receive_skb() call it
>> would have to be checked if it's different from the current one.
>> In that case store new skb and orig_Dev.
>> 
>> Leaving aside that global variables are evil in general, I still think
>> this is not nicer solution then to add skb->input_dev (although I
>> understand your arguments).
>
>Really I must miss something about "global variables" thing/fear.
>
>Kernel is full of global variables, they are not evil if properly used.

I know. But that doesn't mean it's ok. But I see your point.

>
>Take a look at net/core/dev.c :
>
>static DEFINE_PER_CPU(int, xmit_recursion);
>
>For an example of what I have in mind.

Yes I saw this. We would have to do something like:

struct skb_rx_context {
	struct sk_buff *skb;
	struct net_device *orig_dev;
};

static DEFINE_PER_CPU(struct skb_rx_context, skb_rx_context);

and then in __netif_receive_skb():

struct skb_rx_context *cont = __this_cpu_read(skb_rx_context);

if (cont->skb != skb) {
	cont->skb = skb;
	orig_dev = cont->orig_dev = skb->dev;
} else {
	orig_dev = cont->orig_dev;
}


Does this make sense?

>
>
>

^ permalink raw reply

* Re: [PATCH v6 0/9] net: Unified offload configuration
From: Michał Mirosław @ 2011-02-18 19:27 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: David Miller, netdev
In-Reply-To: <1298039371.2211.9.camel@localhost>

On Fri, Feb 18, 2011 at 02:29:31PM +0000, Ben Hutchings wrote:
> On Fri, 2011-02-18 at 15:22 +0100, Michał Mirosław wrote:
> > On Thu, Feb 17, 2011 at 02:56:11PM -0800, David Miller wrote:
> [...]
> > > Please get rid of that annoying message spit out by netif_features_change(),
> > > it's just spam.  If we want notifications for stuff like this, use a
> > > non-unicast netlink message so those who want to hear it can do so.
> > You mean netdev_update_features() "Features changed" message? Is it ok
> > to just demote it to DEBUG level or you want to remove it altogether?
> > What about netdev_fix_features() messages?
> I think you need to emit these messages at 'error' severity when fixing
> up features for a newly-added device, but at 'debug' later on.

Why the difference? Those messages on startup matter mostly to developers
and later mostly to users. Anyway, the conditions in netdev_fix_features()
are all constant (dependent only on net core implementation).

Best Regards,
Michał Mirosław

^ permalink raw reply

* Re: [patch net-next-2.6] net: convert bonding to use rx_handler
From: Eric Dumazet @ 2011-02-18 19:17 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Patrick McHardy, netdev, davem, shemminger, fubar,
	nicolas.2p.debian, andy
In-Reply-To: <20110218184725.GA2602@psychotron.redhat.com>

Le vendredi 18 février 2011 à 19:47 +0100, Jiri Pirko a écrit :
> Fri, Feb 18, 2011 at 05:14:30PM CET, eric.dumazet@gmail.com wrote:
> >Le vendredi 18 février 2011 à 16:50 +0100, Patrick McHardy a écrit :
> >> On 18.02.2011 15:58, Jiri Pirko wrote:
> >> > Fri, Feb 18, 2011 at 03:46:45PM CET, kaber@trash.net wrote:
> >> >> Am 18.02.2011 15:27, schrieb Eric Dumazet:
> >> >>> Le vendredi 18 février 2011 à 15:14 +0100, Jiri Pirko a écrit :
> >> >>>
> >> >>>> Do not know how to do it better. As for percpu variable, not only
> >> >>>> origdev would have to be remembered but also probably skb pointer to
> >> >>>> know if it's the first run on the skb or not. Can't really figure out a
> >> >>>> better solution. Can you?
> >> >>>
> >> >>> I'll try and let you know.
> >> >>
> >> >> Why not simply do a lookup on skb->iif?
> >> > 
> >> > Well I was trying to avoid iterating over list of devices for each
> >> > incoming frame.
> >> > 
> >> 
> >> Well, there are a couple of holes on 64 bit, perhaps you can rearrange
> >> things and eliminate either iif or input_dev without increasing size
> >> since they appear to be redundant.
> >
> >Jiri
> >
> >I dont understand why netif_rx() is needed in your patch.
> 
> I used netif_rx() because bridge and macvlan does that too. I did not see
> a reason to not to do the same.
> 
> >
> >Can we stack 10 bond devices or so ???
> >
> >If we avoid this stage and call the real thing (netif_receive_skb()),
> >then we dont need adding a field in each skb, since it can be carried by
> >a global variable (per cpu of course)
> >
> I'm probably missing something. How do netif_receive_skb() and
> netif_rx() differ in this point of view, since both are calling:
> "ret = enqueue_to_backlog(skb, cpu, &rflow->last_qtail);"
> ?
> 
> Still I see a problem with the percpu global variable. We would have to
> store skb pointer there as well and in each __netif_receive_skb() call it
> would have to be checked if it's different from the current one.
> In that case store new skb and orig_Dev.
> 
> Leaving aside that global variables are evil in general, I still think
> this is not nicer solution then to add skb->input_dev (although I
> understand your arguments).

Really I must miss something about "global variables" thing/fear.

Kernel is full of global variables, they are not evil if properly used.

Take a look at net/core/dev.c :

static DEFINE_PER_CPU(int, xmit_recursion);

For an example of what I have in mind.




^ permalink raw reply

* Re: Possible netfilter-related memory corruption in 2.6.37
From: Eric Dumazet @ 2011-02-18 19:14 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: Jan Engelhardt, Avi Kivity, netfilter-devel, Marcelo Tosatti,
	nicolas prochazka, KVM list, netdev
In-Reply-To: <4D5EBC6D.6070200@trash.net>

Le vendredi 18 février 2011 à 19:37 +0100, Patrick McHardy a écrit :
> Am 14.02.2011 17:52, schrieb Patrick McHardy:
> > Am 14.02.2011 17:48, schrieb Eric Dumazet:
> >> I am not sure, but I guess nf_reinject() needs a fix too ;)
> > 
> > I agree. That one looks uglier though, I guess we'll have to
> > iterate through all hooks to note the previous one.
> 
> How about this? Unfortunately I don't think we can avoid
> iterating through all hooks without violating RCU rules.
> 
> 

       /* Continue traversal iff userspace said ok... */
        if (verdict == NF_REPEAT) {
-               elem = elem->prev;
-               verdict = NF_ACCEPT;
+               prev = NULL;
+               list_for_each_entry_rcu(i,
&nf_hooks[entry->pf][entry->hook],
+                                       list) {
+                       if (&i->list == elem)
+                               break;
+                       prev = i;

	
Hmm... what happens if "elem" was the first elem in list ?

We exit with prev = NULL  --> NF_DROP ?

I must miss something...

+               }
+
+               if (prev == NULL ||
+                   &i->list == &nf_hooks[entry->pf][entry->hook])
+                       verdict = NF_DROP;
+               else {
+                       elem = &prev->list;
+                       verdict = NF_ACCEPT;
+               }
        }



--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: BUG: Bad page map in process udevd (anon_vma: (null)) in 2.6.38-rc4
From: Eric Dumazet @ 2011-02-18 19:13 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Linus Torvalds, Eric W. Biederman, Michal Hocko, Ingo Molnar,
	linux-mm, LKML, David Miller, netdev
In-Reply-To: <20110218190128.GF13211@ghostprotocols.net>

Le vendredi 18 février 2011 à 17:01 -0200, Arnaldo Carvalho de Melo a
écrit :

> Original code is ANK's, I just made it possible to use with DCCP, and
> yeah, the smiley is appropriate, something 6 years old and the world
> around it changing continually... well, thanks for the git blame ;-)
> 

At that time, net namespaces did not exist.

I would blame people implementing them ;)




^ permalink raw reply

* Re: BUG: Bad page map in process udevd (anon_vma: (null)) in 2.6.38-rc4
From: Arnaldo Carvalho de Melo @ 2011-02-18 19:11 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric W. Biederman, Michal Hocko, Ingo Molnar, linux-mm, LKML,
	David Miller, Eric Dumazet, netdev, Pavel Emelyanov
In-Reply-To: <20110218190128.GF13211@ghostprotocols.net>

Em Fri, Feb 18, 2011 at 05:01:28PM -0200, Arnaldo Carvalho de Melo escreveu:
> Em Fri, Feb 18, 2011 at 10:48:18AM -0800, Linus Torvalds escreveu:
> > This seems to be a fairly straightforward bug.
> > 
> > In net/ipv4/inet_timewait_sock.c we have this:
> > 
> >   /* These are always called from BH context.  See callers in
> >    * tcp_input.c to verify this.
> >    */
> > 
> >   /* This is for handling early-kills of TIME_WAIT sockets. */
> >   void inet_twsk_deschedule(struct inet_timewait_sock *tw,
> >                             struct inet_timewait_death_row *twdr)
> >   {
> >           spin_lock(&twdr->death_lock);
> >           ..
> > 
> > and the intention is clearly that that spin_lock is BH-safe because
> > it's called from BH context.
> > 
> > Except that clearly isn't true. It's called from a worker thread:
> > 
> > > stack backtrace:
> > > Pid: 10833, comm: kworker/u:1 Not tainted 2.6.38-rc4-359399.2010AroraKernelBeta.fc14.x86_64 #1
> > > Call Trace:
> > >  [<ffffffff81460e69>] ? inet_twsk_deschedule+0x29/0xa0
> > >  [<ffffffff81460fd6>] ? inet_twsk_purge+0xf6/0x180
> > >  [<ffffffff81460f10>] ? inet_twsk_purge+0x30/0x180
> > >  [<ffffffff814760fc>] ? tcp_sk_exit_batch+0x1c/0x20
> > >  [<ffffffff8141c1d3>] ? ops_exit_list.clone.0+0x53/0x60
> > >  [<ffffffff8141c520>] ? cleanup_net+0x100/0x1b0
> > >  [<ffffffff81068c47>] ? process_one_work+0x187/0x4b0
> > >  [<ffffffff81068be1>] ? process_one_work+0x121/0x4b0
> > >  [<ffffffff8141c420>] ? cleanup_net+0x0/0x1b0
> > >  [<ffffffff8106a65c>] ? worker_thread+0x15c/0x330
> > 
> > so it can deadlock with a BH happening at the same time, afaik.
> > 
> > The code (and comment) is all from 2005, it looks like the BH->worker
> > thread has broken the code. But somebody who knows that code better
> > should take a deeper look at it.
> > 
> > Added acme to the cc, since the code is attributed to him back in 2005
> > ;). Although I don't know how active he's been in networking lately
> > (seems to be all perf-related). Whatever, it can't hurt.
> 
> Original code is ANK's, I just made it possible to use with DCCP, and
> yeah, the smiley is appropriate, something 6 years old and the world
> around it changing continually... well, thanks for the git blame ;-)

But yeah, your analisys seems correct, with the bug being introduced by
one of these world around it changing continually issues, networking
namespaces broke the rules of the game on its cleanup_net() routine,
adding Pavel to the CC list since it doesn't hurt ;-)

- Arnaldo

^ permalink raw reply

* Re: BUG: Bad page map in process udevd (anon_vma: (null)) in 2.6.38-rc4
From: Arnaldo Carvalho de Melo @ 2011-02-18 19:01 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric W. Biederman, Michal Hocko, Ingo Molnar, linux-mm, LKML,
	David Miller, Eric Dumazet, netdev
In-Reply-To: <AANLkTikh4oaR6CBK3NBazer7yjhE0VndsUB5FCDRsbJc@mail.gmail.com>

Em Fri, Feb 18, 2011 at 10:48:18AM -0800, Linus Torvalds escreveu:
> On Fri, Feb 18, 2011 at 10:08 AM, Eric W. Biederman
> <ebiederm@xmission.com> wrote:
> >
> > I am still getting programs segfaulting but that is happening on other
> > machines running on older kernels so I am going to chalk that up to a
> > buggy test and a false positive.
> 
> Ok.
> 
> > I am have OOM problems getting my tests run to complete.  On a good
> > day that happens about 1 time in 3 right now.  I'm guess I will have
> > to turn off DEBUG_PAGEALLOC to get everything to complete.
> > DEBUG_PAGEALLOC causes us to use more memory doesn't it?
> 
> It does use a bit more memory, but it shouldn't be _that_ noticeable.
> The real cost of DEBUG_PAGEALLOC is all the crazy page table
> operations and TLB flushes we do for each allocation/deallocation. So
> DEBUG_PAGEALLOC is very CPU-intensive, but it shouldn't have _that_
> much of a memory overhead - just some trivial overhead due to not
> being able to use largepages for the normal kernel identity mappings.
> 
> But there might be some other interaction with OOM that I haven't thought about.
> 
> > The most interesting thing I have right now is a networking lockdep
> > issue.  Does anyone know what is going on there?
> 
> This seems to be a fairly straightforward bug.
> 
> In net/ipv4/inet_timewait_sock.c we have this:
> 
>   /* These are always called from BH context.  See callers in
>    * tcp_input.c to verify this.
>    */
> 
>   /* This is for handling early-kills of TIME_WAIT sockets. */
>   void inet_twsk_deschedule(struct inet_timewait_sock *tw,
>                             struct inet_timewait_death_row *twdr)
>   {
>           spin_lock(&twdr->death_lock);
>           ..
> 
> and the intention is clearly that that spin_lock is BH-safe because
> it's called from BH context.
> 
> Except that clearly isn't true. It's called from a worker thread:
> 
> > stack backtrace:
> > Pid: 10833, comm: kworker/u:1 Not tainted 2.6.38-rc4-359399.2010AroraKernelBeta.fc14.x86_64 #1
> > Call Trace:
> >  [<ffffffff81460e69>] ? inet_twsk_deschedule+0x29/0xa0
> >  [<ffffffff81460fd6>] ? inet_twsk_purge+0xf6/0x180
> >  [<ffffffff81460f10>] ? inet_twsk_purge+0x30/0x180
> >  [<ffffffff814760fc>] ? tcp_sk_exit_batch+0x1c/0x20
> >  [<ffffffff8141c1d3>] ? ops_exit_list.clone.0+0x53/0x60
> >  [<ffffffff8141c520>] ? cleanup_net+0x100/0x1b0
> >  [<ffffffff81068c47>] ? process_one_work+0x187/0x4b0
> >  [<ffffffff81068be1>] ? process_one_work+0x121/0x4b0
> >  [<ffffffff8141c420>] ? cleanup_net+0x0/0x1b0
> >  [<ffffffff8106a65c>] ? worker_thread+0x15c/0x330
> 
> so it can deadlock with a BH happening at the same time, afaik.
> 
> The code (and comment) is all from 2005, it looks like the BH->worker
> thread has broken the code. But somebody who knows that code better
> should take a deeper look at it.
> 
> Added acme to the cc, since the code is attributed to him back in 2005
> ;). Although I don't know how active he's been in networking lately
> (seems to be all perf-related). Whatever, it can't hurt.

Original code is ANK's, I just made it possible to use with DCCP, and
yeah, the smiley is appropriate, something 6 years old and the world
around it changing continually... well, thanks for the git blame ;-)

- Arnaldo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: BUG: Bad page map in process udevd (anon_vma: (null)) in 2.6.38-rc4
From: Linus Torvalds @ 2011-02-18 18:48 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Michal Hocko, Ingo Molnar, linux-mm, LKML, David Miller,
	Eric Dumazet, netdev, Arnaldo Carvalho de Melo
In-Reply-To: <m17hcx43m3.fsf@fess.ebiederm.org>

On Fri, Feb 18, 2011 at 10:08 AM, Eric W. Biederman
<ebiederm@xmission.com> wrote:
>
> I am still getting programs segfaulting but that is happening on other
> machines running on older kernels so I am going to chalk that up to a
> buggy test and a false positive.

Ok.

> I am have OOM problems getting my tests run to complete.  On a good
> day that happens about 1 time in 3 right now.  I'm guess I will have
> to turn off DEBUG_PAGEALLOC to get everything to complete.
> DEBUG_PAGEALLOC causes us to use more memory doesn't it?

It does use a bit more memory, but it shouldn't be _that_ noticeable.
The real cost of DEBUG_PAGEALLOC is all the crazy page table
operations and TLB flushes we do for each allocation/deallocation. So
DEBUG_PAGEALLOC is very CPU-intensive, but it shouldn't have _that_
much of a memory overhead - just some trivial overhead due to not
being able to use largepages for the normal kernel identity mappings.

But there might be some other interaction with OOM that I haven't thought about.

> The most interesting thing I have right now is a networking lockdep
> issue.  Does anyone know what is going on there?

This seems to be a fairly straightforward bug.

In net/ipv4/inet_timewait_sock.c we have this:

  /* These are always called from BH context.  See callers in
   * tcp_input.c to verify this.
   */

  /* This is for handling early-kills of TIME_WAIT sockets. */
  void inet_twsk_deschedule(struct inet_timewait_sock *tw,
                            struct inet_timewait_death_row *twdr)
  {
          spin_lock(&twdr->death_lock);
          ..

and the intention is clearly that that spin_lock is BH-safe because
it's called from BH context.

Except that clearly isn't true. It's called from a worker thread:

> stack backtrace:
> Pid: 10833, comm: kworker/u:1 Not tainted 2.6.38-rc4-359399.2010AroraKernelBeta.fc14.x86_64 #1
> Call Trace:
>  [<ffffffff81460e69>] ? inet_twsk_deschedule+0x29/0xa0
>  [<ffffffff81460fd6>] ? inet_twsk_purge+0xf6/0x180
>  [<ffffffff81460f10>] ? inet_twsk_purge+0x30/0x180
>  [<ffffffff814760fc>] ? tcp_sk_exit_batch+0x1c/0x20
>  [<ffffffff8141c1d3>] ? ops_exit_list.clone.0+0x53/0x60
>  [<ffffffff8141c520>] ? cleanup_net+0x100/0x1b0
>  [<ffffffff81068c47>] ? process_one_work+0x187/0x4b0
>  [<ffffffff81068be1>] ? process_one_work+0x121/0x4b0
>  [<ffffffff8141c420>] ? cleanup_net+0x0/0x1b0
>  [<ffffffff8106a65c>] ? worker_thread+0x15c/0x330

so it can deadlock with a BH happening at the same time, afaik.

The code (and comment) is all from 2005, it looks like the BH->worker
thread has broken the code. But somebody who knows that code better
should take a deeper look at it.

Added acme to the cc, since the code is attributed to him back in 2005
;). Although I don't know how active he's been in networking lately
(seems to be all perf-related). Whatever, it can't hurt.

                   Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [patch net-next-2.6] net: convert bonding to use rx_handler
From: Jiri Pirko @ 2011-02-18 18:47 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Patrick McHardy, netdev, davem, shemminger, fubar,
	nicolas.2p.debian, andy
In-Reply-To: <1298045670.6201.73.camel@edumazet-laptop>

Fri, Feb 18, 2011 at 05:14:30PM CET, eric.dumazet@gmail.com wrote:
>Le vendredi 18 février 2011 à 16:50 +0100, Patrick McHardy a écrit :
>> On 18.02.2011 15:58, Jiri Pirko wrote:
>> > Fri, Feb 18, 2011 at 03:46:45PM CET, kaber@trash.net wrote:
>> >> Am 18.02.2011 15:27, schrieb Eric Dumazet:
>> >>> Le vendredi 18 février 2011 à 15:14 +0100, Jiri Pirko a écrit :
>> >>>
>> >>>> Do not know how to do it better. As for percpu variable, not only
>> >>>> origdev would have to be remembered but also probably skb pointer to
>> >>>> know if it's the first run on the skb or not. Can't really figure out a
>> >>>> better solution. Can you?
>> >>>
>> >>> I'll try and let you know.
>> >>
>> >> Why not simply do a lookup on skb->iif?
>> > 
>> > Well I was trying to avoid iterating over list of devices for each
>> > incoming frame.
>> > 
>> 
>> Well, there are a couple of holes on 64 bit, perhaps you can rearrange
>> things and eliminate either iif or input_dev without increasing size
>> since they appear to be redundant.
>
>Jiri
>
>I dont understand why netif_rx() is needed in your patch.

I used netif_rx() because bridge and macvlan does that too. I did not see
a reason to not to do the same.

>
>Can we stack 10 bond devices or so ???
>
>If we avoid this stage and call the real thing (netif_receive_skb()),
>then we dont need adding a field in each skb, since it can be carried by
>a global variable (per cpu of course)
>
I'm probably missing something. How do netif_receive_skb() and
netif_rx() differ in this point of view, since both are calling:
"ret = enqueue_to_backlog(skb, cpu, &rflow->last_qtail);"
?

Still I see a problem with the percpu global variable. We would have to
store skb pointer there as well and in each __netif_receive_skb() call it
would have to be checked if it's different from the current one.
In that case store new skb and orig_Dev.

Leaving aside that global variables are evil in general, I still think
this is not nicer solution then to add skb->input_dev (although I
understand your arguments).


>bond_handle_frame() being called from __netif_receive_skb() I believe it
>can use netif_receive_skb() instead of netif_rx().
>
>Same remark for vlan_on_bond_hook()
>
>
>

^ permalink raw reply

* Re: Possible netfilter-related memory corruption in 2.6.37
From: Patrick McHardy @ 2011-02-18 18:37 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Jan Engelhardt, Avi Kivity, netfilter-devel, Marcelo Tosatti,
	nicolas prochazka, KVM list, netdev
In-Reply-To: <4D595DBD.3090005@trash.net>

[-- Attachment #1: Type: text/plain, Size: 391 bytes --]

Am 14.02.2011 17:52, schrieb Patrick McHardy:
> Am 14.02.2011 17:48, schrieb Eric Dumazet:
>> I am not sure, but I guess nf_reinject() needs a fix too ;)
> 
> I agree. That one looks uglier though, I guess we'll have to
> iterate through all hooks to note the previous one.

How about this? Unfortunately I don't think we can avoid
iterating through all hooks without violating RCU rules.



[-- Attachment #2: x.diff --]
[-- Type: text/plain, Size: 1009 bytes --]

diff --git a/net/netfilter/nf_queue.c b/net/netfilter/nf_queue.c
index 74aebed..834bb07 100644
--- a/net/netfilter/nf_queue.c
+++ b/net/netfilter/nf_queue.c
@@ -235,6 +235,7 @@ int nf_queue(struct sk_buff *skb,
 void nf_reinject(struct nf_queue_entry *entry, unsigned int verdict)
 {
 	struct sk_buff *skb = entry->skb;
+	struct nf_hook_ops *i, *prev;
 	struct list_head *elem = &entry->elem->list;
 	const struct nf_afinfo *afinfo;
 
@@ -244,8 +245,21 @@ void nf_reinject(struct nf_queue_entry *entry, unsigned int verdict)
 
 	/* Continue traversal iff userspace said ok... */
 	if (verdict == NF_REPEAT) {
-		elem = elem->prev;
-		verdict = NF_ACCEPT;
+		prev = NULL;
+		list_for_each_entry_rcu(i, &nf_hooks[entry->pf][entry->hook],
+					list) {
+			if (&i->list == elem)
+				break;
+			prev = i;
+		}
+
+		if (prev == NULL ||
+		    &i->list == &nf_hooks[entry->pf][entry->hook])
+			verdict = NF_DROP;
+		else {
+			elem = &prev->list;
+			verdict = NF_ACCEPT;
+		}
 	}
 
 	if (verdict == NF_ACCEPT) {

^ permalink raw reply related

* Re: BUG: Bad page map in process udevd (anon_vma: (null)) in 2.6.38-rc4
From: Eric W. Biederman @ 2011-02-18 18:08 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Michal Hocko, Ingo Molnar, linux-mm, LKML, David Miller,
	Eric Dumazet, netdev
In-Reply-To: <AANLkTimO=M5xG_mnDBSxPKwSOTrp3JhHVBa8=wHsiVHY@mail.gmail.com>

Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Fri, Feb 18, 2011 at 8:26 AM, Michal Hocko <mhocko@suse.cz> wrote:
>>> Now, I will try with the 2 patches patches in this thread. I will also
>>> turn on DEBUG_LIST and DEBUG_PAGEALLOC.
>>
>> I am not able to reproduce with those 2 patches applied.
>
> Thanks for verifying. Davem/EricD - you can add Michal's tested-by to
> the patches too.
>
> And I think we can consider this whole thing solved. It hopefully also
> explains all the other random crashes that EricB saw - just random
> memory corruption in other datastructures.
>
> EricB - do all your stress-testers run ok now?

Things are looking better and PAGEALLOC debug isn't firing.
So this looks like one bug down.  I have not seen the bad page map
symptom.

I am still getting programs segfaulting but that is happening on other
machines running on older kernels so I am going to chalk that up to a
buggy test and a false positive.

I am have OOM problems getting my tests run to complete.  On a good
day that happens about 1 time in 3 right now.  I'm guess I will have
to turn off DEBUG_PAGEALLOC to get everything to complete.
DEBUG_PAGEALLOC causes us to use more memory doesn't it?

The most interesting thing I have right now is a networking lockdep
issue.  Does anyone know what is going on there?

Eric


=================================
[ INFO: inconsistent lock state ]
2.6.38-rc4-359399.2010AroraKernelBeta.fc14.x86_64 #1
---------------------------------
inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage.
kworker/u:1/10833 [HC0[0]:SC0[0]:HE1:SE1] takes:
 (tcp_death_row.death_lock){+.?...}, at: [<ffffffff81460e69>] inet_twsk_deschedule+0x29/0xa0
{IN-SOFTIRQ-W} state was registered at:
  [<ffffffff810840ce>] __lock_acquire+0x70e/0x1d30
  [<ffffffff81085cff>] lock_acquire+0x9f/0x120
  [<ffffffff814deb6c>] _raw_spin_lock+0x2c/0x40
  [<ffffffff8146066b>] inet_twsk_schedule+0x3b/0x1e0
  [<ffffffff8147bf7d>] tcp_time_wait+0x20d/0x380
  [<ffffffff8146b46e>] tcp_fin.clone.39+0x10e/0x1c0
  [<ffffffff8146c628>] tcp_data_queue+0x798/0xd50
  [<ffffffff8146fdd9>] tcp_rcv_state_process+0x799/0xbb0
  [<ffffffff814786d8>] tcp_v4_do_rcv+0x238/0x500
  [<ffffffff8147a90a>] tcp_v4_rcv+0x86a/0xbe0
  [<ffffffff81455d4d>] ip_local_deliver_finish+0x10d/0x380
  [<ffffffff81456180>] ip_local_deliver+0x80/0x90
  [<ffffffff81455832>] ip_rcv_finish+0x192/0x5a0
  [<ffffffff814563c4>] ip_rcv+0x234/0x300
  [<ffffffff81420e83>] __netif_receive_skb+0x443/0x700
  [<ffffffff81421d68>] netif_receive_skb+0xb8/0xf0
  [<ffffffff81421ed8>] napi_skb_finish+0x48/0x60
  [<ffffffff81422d35>] napi_gro_receive+0xb5/0xc0
  [<ffffffffa006b4cf>] igb_poll+0x89f/0xd20 [igb]
  [<ffffffff81422279>] net_rx_action+0x149/0x270
  [<ffffffff81054bc0>] __do_softirq+0xc0/0x1f0
  [<ffffffff81003d1c>] call_softirq+0x1c/0x30
  [<ffffffff81005825>] do_softirq+0xa5/0xe0
  [<ffffffff81054dfd>] irq_exit+0x8d/0xa0
  [<ffffffff81005391>] do_IRQ+0x61/0xe0
  [<ffffffff814df793>] ret_from_intr+0x0/0x1a
  [<ffffffff810ec9ed>] ____pagevec_lru_add+0x16d/0x1a0
  [<ffffffff810ed073>] lru_add_drain+0x73/0xe0
  [<ffffffff8110a44c>] exit_mmap+0x5c/0x180
  [<ffffffff8104aad2>] mmput+0x52/0xe0
  [<ffffffff810513c0>] exit_mm+0x120/0x150
  [<ffffffff81051522>] do_exit+0x132/0x8c0
  [<ffffffff81051f39>] do_group_exit+0x59/0xd0
  [<ffffffff81051fc2>] sys_exit_group+0x12/0x20
  [<ffffffff81002d92>] system_call_fastpath+0x16/0x1b
irq event stamp: 187417
hardirqs last  enabled at (187417): [<ffffffff81127db5>] kmem_cache_free+0x125/0x160
hardirqs last disabled at (187416): [<ffffffff81127d02>] kmem_cache_free+0x72/0x160
softirqs last  enabled at (187410): [<ffffffff81411c52>] sk_common_release+0x62/0xc0
softirqs last disabled at (187408): [<ffffffff814dec41>] _raw_write_lock_bh+0x11/0x40
other info that might help us debug this:
3 locks held by kworker/u:1/10833:
 #0:  (netns){.+.+.+}, at: [<ffffffff81068be1>] process_one_work+0x121/0x4b0
 #1:  (net_cleanup_work){+.+.+.}, at: [<ffffffff81068be1>] process_one_work+0x121/0x4b0
 #2:  (net_mutex){+.+.+.}, at: [<ffffffff8141c4a0>] cleanup_net+0x80/0x1b0

stack backtrace:
Pid: 10833, comm: kworker/u:1 Not tainted 2.6.38-rc4-359399.2010AroraKernelBeta.fc14.x86_64 #1
Call Trace:
 [<ffffffff810835b0>] ? print_usage_bug+0x170/0x180
 [<ffffffff8108393f>] ? mark_lock+0x37f/0x400
 [<ffffffff81084150>] ? __lock_acquire+0x790/0x1d30
 [<ffffffff81083d8f>] ? __lock_acquire+0x3cf/0x1d30
 [<ffffffff81124acf>] ? check_object+0xaf/0x270
 [<ffffffff81460e69>] ? inet_twsk_deschedule+0x29/0xa0
 [<ffffffff81085cff>] ? lock_acquire+0x9f/0x120
 [<ffffffff81460e69>] ? inet_twsk_deschedule+0x29/0xa0
 [<ffffffff81410f49>] ? __sk_free+0xd9/0x160
 [<ffffffff814deb6c>] ? _raw_spin_lock+0x2c/0x40
 [<ffffffff81460e69>] ? inet_twsk_deschedule+0x29/0xa0
 [<ffffffff81460e69>] ? inet_twsk_deschedule+0x29/0xa0
 [<ffffffff81460fd6>] ? inet_twsk_purge+0xf6/0x180
 [<ffffffff81460f10>] ? inet_twsk_purge+0x30/0x180
 [<ffffffff814760fc>] ? tcp_sk_exit_batch+0x1c/0x20
 [<ffffffff8141c1d3>] ? ops_exit_list.clone.0+0x53/0x60
 [<ffffffff8141c520>] ? cleanup_net+0x100/0x1b0
 [<ffffffff81068c47>] ? process_one_work+0x187/0x4b0
 [<ffffffff81068be1>] ? process_one_work+0x121/0x4b0
 [<ffffffff8141c420>] ? cleanup_net+0x0/0x1b0
 [<ffffffff8106a65c>] ? worker_thread+0x15c/0x330
 [<ffffffff8106a500>] ? worker_thread+0x0/0x330
 [<ffffffff8106f226>] ? kthread+0xb6/0xc0
 [<ffffffff8108678d>] ? trace_hardirqs_on_caller+0x13d/0x180
 [<ffffffff81003c24>] ? kernel_thread_helper+0x4/0x10
 [<ffffffff814df854>] ? restore_args+0x0/0x30
 [<ffffffff8106f170>] ? kthread+0x0/0xc0
 [<ffffffff81003c20>] ? kernel_thread_helper+0x0/0x10

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* how to listen() on single IP address but very many ports?
From: Chris Friesen @ 2011-02-18 17:55 UTC (permalink / raw)
  To: netdev


I have an application team that needs to listen() for tcp connections on
many ports (and by many I mean pretty much all 64K ports).  However, the
connections are short-lived, and the number of active connections at any
given time is small.

Apparently when they tried this before on an older kernel the
performance of the naive "open 60K sockets and call listen()" solution
was not acceptable, so they used NAT with port mapping to direct all the
incoming packets to a single real port.  However, they now want to add
support for IPv6 and this solution won't work.

What's the recommended method for efficiently listening on this many
ports?  Should I be able to efficiently listen() on that many sockets
using epoll or similar?  If there isn't a way to do this, is there an
equivalent IPv6 workaround?

One possible solution that came up was to implement a PORT_ANY which
would match any incoming request that didn't already have an explicit
listener.  Even better would be a way to bind a single listening socket
to a range of ports.

Has anyone ever considered something like this?

Chris

-- 
Chris Friesen
Software Developer
GENBAND
chris.friesen@genband.com
www.genband.com

^ permalink raw reply

* Re: IGMPMSG_WRONGVIF only happens when true VIF is an OIF of correct IIF?
From: David Lamparter @ 2011-02-18 17:12 UTC (permalink / raw)
  To: Dan Langlois; +Cc: netdev
In-Reply-To: <1298044016.2898.12.camel@thor>

[-- Attachment #1: Type: text/plain, Size: 3732 bytes --]

On Fri, Feb 18, 2011 at 04:46:56PM +0100, Dan Langlois wrote:
> Hi, I have a question about the IGMPMSG_WRONGVIF message.
> I have MRT_ASSERT enabled but do not receive notifications.
> However, if I enable MRT_PIM, I do get them.  I'm curious
> why MRT_ASSERT alone isn't sufficient.
> 
> This if-statement appears in ip_mr_forward() in net/ipv4/ipmr.c
> 
>        if (true_vifi >= 0 && net->ipv4.mroute_do_assert &&
>             /* pimsm uses asserts, when switching from RPT to SPT, 
>                so that we cannot check that packet arrived on an oif. 
>                It is bad, but otherwise we would need to move pretty
>                large chunk of pimd to kernel. Ough... --ANK
>              */      
>             (net->ipv4.mroute_do_pim ||
>              cache->mfc_un.res.ttls[true_vifi] < 255) && 
>             time_after(jiffies,
>                        cache->mfc_un.res.last_assert +
> MFC_ASSERT_THRESH)) {
>                 cache->mfc_un.res.last_assert = jiffies;
>                 ipmr_cache_report(net, skb, true_vifi,
> IGMPMSG_WRONGVIF);
>         }       
> 
> 
> By the time this statement is reached, it is known that the packet
> did not arrive on the expected incoming interface (IIF) and thus a
> "WRONGVIF" condition.  This if-statement decides whether a PIM-assert
> notification needs to be sent to the multicast routing daemon.
> 
> The part of this if-statement I'm questioning is:
>     (net->ipv4.mroute_do_pim ||
>      cache->mfc_un.res.ttls[true_vifi] < 255) &&
> 
> I read this as:
> "send WRONGVIF if PIM is enabled -OR-
> the packet arrived on an outgoing interface OIF of the correct IIF"
> 
> So this statement is always true when PIM is enabled.
> However, if only ASSERT is enabled, this statement is only true when
> a packet is reflected back on an OIF.
> 
> Why not always send the assert for any WRONGVIF condition regardless
> of the interface it came in on?  I mean, isn't that the point of
> enabling MRT_ASSERT in the first place?  If the assertions weren't
> wanted, just don't enable that feature, right?

The point of MRT_ASSERT is to aid in resolving duplication issues where
you end up with multiple mrouters thinking they're responsible for the
same subnet. This condition is indicated by more than 1 router on the
subnet having the subnet as OIF on the (*,G) or (S,G). Therefore, your
daemon receives the assertion if it is one of the forwarding mrouters,
which it will only be if it's an OIF.

If your daemon doesn't have the target subnet listed as oif, the
assumption is that a different mrouter has been elected to forward
packets for this G/S,G to this subnet. The kernel knows that your daemon
decided to not forward things to this subnet, so you are not involved in
duplicates, so you don't get an assert.

The "PIM or" is a kludge to make PIM work, as indicated by the comment
above the if. I'd have to refreshen my PIM knowledge to fully understand
it or explain it ;)

> In our system, multicast streams may get rerouted through a different
> VIF than what was first installed in the MFC cache.  I would very much
> like to get these WRONGVIF assertions in this case, which is NOT the
> case that a packet is seen on an OIF.
> 
> Of course I can simply enable MRT_PIM to get the messages, but in that
> case I don't understand the reason for having two separate toggles
> (MRT_ASSERT versus MRT_PIM).

I don't really understand you use case -- is this a case of another
router showing up on the subnet and directing traffic to it? If so, why
wasn't the first router directing traffic to it? Do you have a primary
multicast forwarder election system in place?


-David


[-- Attachment #2: Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply

* Re: [PATCH][RFC] sctp: fix reporting of unknown parameters
From: Vladislav Yasevich @ 2011-02-18 14:43 UTC (permalink / raw)
  To: Jiri Bohac; +Cc: Shan Wei, linux-sctp, Neil Horman
In-Reply-To: <20110218120142.GA24525@midget.suse.cz>

On 02/18/2011 07:01 AM, Jiri Bohac wrote:
> On Fri, Feb 18, 2011 at 07:26:18PM +0800, Shan Wei wrote:
>> Jiri Bohac wrote, at 02/18/2011 07:12 AM:
>>> commit 5fa782c2f5ef6c2e4f04d3e228412c9b4a4c8809 re-worked the 
>>> handling of unknown parameters. sctp_init_cause_fixed() can now
>>> return -ENOSPC if there is not enough tailroom in the error
>>> chunk skb. When this happens, the error header is not appended to
>>> the error chunk. In that case, the payload of the unknown parameter
>>> should not be appended either.
>>>
>>> Signed-off-by: Jiri Bohac <jbohac@suse.cz>
>>
>> For this case, there is no more tailroom in skb
> 
> not always true:
> 
> both sctp_init_cause_fixed() and sctp_addto_chunk_fixed() get
> passesd WORD_ROUND(ntohs(param.p->length) as the paylen/len
> parameter.
> 
> sctp_init_cause_fixed() does:
> 	len = sizeof(sctp_errhdr_t) + paylen;
>         if (skb_tailroom(chunk->skb) < len)
> 	                return -ENOSPC;
> 
> if paylen is only slightly smaller than the tailroom (by less
> than sizeof(sctp_errhdr_t) bytes), -ENOSPEC will be returned and
> the header will not be appended to the error chunk.
> 
> But later sctp_addto_chunk_fixed() does this check:
> 	if (skb_tailroom(chunk->skb) >= len)
> which will pass and the unknown parameter value will be appended
> to the error chunk without the error header.
> 
>> we send incomplete INIT-ACK chunk. With your patch, this chunk also 
>> can't info the sender of INIT Chunk which parameter is not unrecognized.
>> So maybe this handle is not perfect. 
> 
> The logic of not reporting unknown parameters for which we don't
> have space in the pre-allocated buffer remains unchanged.
> 


Hi Jiri

I agree.  Good catch.  If we can not add the error header to that packet, then we definitely
should not be adding the error payload either.

We also don't want to abort as Shan Wei suggested.  It is allowed, when exceeding the MTU to
drop errors that would not otherwise fit into the packet.  It just may so happen that we
can report subsequent unknown parameters, but not this one in particular.

Acked-by: Vlad Yasevich <vladislav.yasevich@hp.com>

-vlad




^ permalink raw reply

* Re: [PATCH][RFC] sctp: fix reporting of unknown parameters
From: Jiri Bohac @ 2011-02-18 12:01 UTC (permalink / raw)
  To: Shan Wei; +Cc: Jiri Bohac, linux-sctp, Neil Horman, Vlad Yasevich
In-Reply-To: <4D5E575A.9000905@cn.fujitsu.com>

On Fri, Feb 18, 2011 at 07:26:18PM +0800, Shan Wei wrote:
> Jiri Bohac wrote, at 02/18/2011 07:12 AM:
> > commit 5fa782c2f5ef6c2e4f04d3e228412c9b4a4c8809 re-worked the 
> > handling of unknown parameters. sctp_init_cause_fixed() can now
> > return -ENOSPC if there is not enough tailroom in the error
> > chunk skb. When this happens, the error header is not appended to
> > the error chunk. In that case, the payload of the unknown parameter
> > should not be appended either.
> > 
> > Signed-off-by: Jiri Bohac <jbohac@suse.cz>
> 
> For this case, there is no more tailroom in skb

not always true:

both sctp_init_cause_fixed() and sctp_addto_chunk_fixed() get
passesd WORD_ROUND(ntohs(param.p->length) as the paylen/len
parameter.

sctp_init_cause_fixed() does:
	len = sizeof(sctp_errhdr_t) + paylen;
        if (skb_tailroom(chunk->skb) < len)
	                return -ENOSPC;

if paylen is only slightly smaller than the tailroom (by less
than sizeof(sctp_errhdr_t) bytes), -ENOSPEC will be returned and
the header will not be appended to the error chunk.

But later sctp_addto_chunk_fixed() does this check:
	if (skb_tailroom(chunk->skb) >= len)
which will pass and the unknown parameter value will be appended
to the error chunk without the error header.

> we send incomplete INIT-ACK chunk. With your patch, this chunk also 
> can't info the sender of INIT Chunk which parameter is not unrecognized.
> So maybe this handle is not perfect. 

The logic of not reporting unknown parameters for which we don't
have space in the pre-allocated buffer remains unchanged.

-- 
Jiri Bohac <jbohac@suse.cz>
SUSE Labs, SUSE CZ


^ permalink raw reply

* Re: [PATCH][RFC] sctp: fix reporting of unknown parameters
From: Shan Wei @ 2011-02-18 11:26 UTC (permalink / raw)
  To: Jiri Bohac; +Cc: linux-sctp, Neil Horman, Vlad Yasevich
In-Reply-To: <20110217231208.GA28281@midget.suse.cz>

(cc vlad Yasevich)

Jiri Bohac wrote, at 02/18/2011 07:12 AM:
> Hi,
> 
> commit 5fa782c2f5ef6c2e4f04d3e228412c9b4a4c8809 re-worked the 
> handling of unknown parameters. sctp_init_cause_fixed() can now
> return -ENOSPC if there is not enough tailroom in the error
> chunk skb. When this happens, the error header is not appended to
> the error chunk. In that case, the payload of the unknown parameter
> should not be appended either.
> 
> Signed-off-by: Jiri Bohac <jbohac@suse.cz>

For this case, there is no more tailroom in skb, originally
we send incomplete INIT-ACK chunk. With your patch, this chunk also 
can't info the sender of INIT Chunk which parameter is not unrecognized.
So maybe this handle is not perfect. 

So, for this case, i think send ABORT is more appropriate.
Just take same handle with SCTP_IERROR_NOMEM

> 
> 
> diff --git a/net/sctp/sm_make_chunk.c b/net/sctp/sm_make_chunk.c
> index 2cc46f0..b23428f 100644
> --- a/net/sctp/sm_make_chunk.c
> +++ b/net/sctp/sm_make_chunk.c
> @@ -2029,11 +2029,11 @@ static sctp_ierror_t sctp_process_unk_param(const struct sctp_association *asoc,
>  			*errp = sctp_make_op_error_fixed(asoc, chunk);
>  
>  		if (*errp) {
> -			sctp_init_cause_fixed(*errp, SCTP_ERROR_UNKNOWN_PARAM,
> -					WORD_ROUND(ntohs(param.p->length)));
> -			sctp_addto_chunk_fixed(*errp,
> -					WORD_ROUND(ntohs(param.p->length)),
> -					param.v);
> +			if (!sctp_init_cause_fixed(*errp, SCTP_ERROR_UNKNOWN_PARAM,
> +					WORD_ROUND(ntohs(param.p->length))))
> +				sctp_addto_chunk_fixed(*errp,
> +						WORD_ROUND(ntohs(param.p->length)),
> +						param.v);
>  		} else {
>  			/* If there is no memory for generating the ERROR
>  			 * report as specified, an ABORT will be triggered
> 
> 


-- 

Best Regards
-----
Shan Wei

^ permalink raw reply

* [PATCH][RFC] sctp: fix reporting of unknown parameters
From: Jiri Bohac @ 2011-02-17 23:12 UTC (permalink / raw)
  To: linux-sctp, Neil Horman

Hi,

commit 5fa782c2f5ef6c2e4f04d3e228412c9b4a4c8809 re-worked the 
handling of unknown parameters. sctp_init_cause_fixed() can now
return -ENOSPC if there is not enough tailroom in the error
chunk skb. When this happens, the error header is not appended to
the error chunk. In that case, the payload of the unknown parameter
should not be appended either.

Signed-off-by: Jiri Bohac <jbohac@suse.cz>


diff --git a/net/sctp/sm_make_chunk.c b/net/sctp/sm_make_chunk.c
index 2cc46f0..b23428f 100644
--- a/net/sctp/sm_make_chunk.c
+++ b/net/sctp/sm_make_chunk.c
@@ -2029,11 +2029,11 @@ static sctp_ierror_t sctp_process_unk_param(const struct sctp_association *asoc,
 			*errp = sctp_make_op_error_fixed(asoc, chunk);
 
 		if (*errp) {
-			sctp_init_cause_fixed(*errp, SCTP_ERROR_UNKNOWN_PARAM,
-					WORD_ROUND(ntohs(param.p->length)));
-			sctp_addto_chunk_fixed(*errp,
-					WORD_ROUND(ntohs(param.p->length)),
-					param.v);
+			if (!sctp_init_cause_fixed(*errp, SCTP_ERROR_UNKNOWN_PARAM,
+					WORD_ROUND(ntohs(param.p->length))))
+				sctp_addto_chunk_fixed(*errp,
+						WORD_ROUND(ntohs(param.p->length)),
+						param.v);
 		} else {
 			/* If there is no memory for generating the ERROR
 			 * report as specified, an ABORT will be triggered


-- 
Jiri Bohac <jbohac@suse.cz>
SUSE Labs, SUSE CZ


^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox