Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH 001/001] ipv4: enable use of 240/4 address space
From: YOSHIFUJI Hideaki / 吉藤英明 @ 2008-01-11 12:41 UTC (permalink / raw)
  To: andi; +Cc: vaf, netdev, linux-kernel, yoshfuji
In-Reply-To: <p73bq7smx1t.fsf@bingen.suse.de>

In article <p73bq7smx1t.fsf@bingen.suse.de> (at Fri, 11 Jan 2008 12:17:02 +0100), Andi Kleen <andi@firstfloor.org> says:

> Vince Fuller <vaf@cisco.com> writes:
> 
> > from Vince Fuller <vaf@vaf.net>
> >
> > This set of diffs modify the 2.6.20 kernel to enable use of the 240/4
> > (aka "class-E") address space as consistent with the Internet Draft
> > draft-fuller-240space-00.txt.
> 
> Wouldn't it be wise to at least wait for it becoming an RFC first? 

I do think so, too.

There is no positive consesus on this draft
at the intarea meeting in Vancouver, right?

We cannot / should not enable that space until we have reached
a consensus on it.

--yoshfuji

^ permalink raw reply

* Re: [PATCH 1/5] spidernet: add missing initialization
From: Jens Osterkamp @ 2008-01-11 12:44 UTC (permalink / raw)
  To: Ishizaki Kou; +Cc: linasvepstas, netdev, cbe-oss-dev, Jeff Garzik
In-Reply-To: <20080111.153859.-1300526764.kouish@swc.toshiba.co.jp>

On Friday 11 January 2008, Ishizaki Kou wrote:
> This patch fixes initialization of "aneg_count" and "medium" fields in
> spider_net_card to make spidernet driver correctly sets "link status".
> 
> Signed-off-by: Kou Ishizaki <kou.ishizaki@toshiba.co.jp>

Hi Ishizaki,

Linas has left the company and is no longer doing kernel related stuff,
so I suggest, given Jeff is ok with that, that the two of us take over
spidernet maintainership.

Jens

---

Change maintainership for spidernet.

Signed-off-by: Jens Osterkamp <jens@de.ibm.com>

Index: linux-2.6/MAINTAINERS
===================================================================
--- linux-2.6.orig/MAINTAINERS	2008-01-11 13:32:04.000000000 +0100
+++ linux-2.6/MAINTAINERS	2008-01-11 13:41:32.000000000 +0100
@@ -3613,8 +3613,10 @@
 S:	Supported
 
 SPIDERNET NETWORK DRIVER for CELL
-P:	Linas Vepstas
-M:	linas@austin.ibm.com
+P:	Ishizaki Kou
+M:	kou.ishizaki@toshiba.co.jp
+P:	Jens Osterkamp
+M:	jens@de.ibm.com
 L:	netdev@vger.kernel.org
 S:	Supported
 

IBM Deutschland Entwicklung GmbH
Vorsitzender des Aufsichtsrats: Martin Jetter
Geschäftsführung: Herbert Kircher 
Sitz der Gesellschaft: Böblingen
Registergericht: Amtsgericht Stuttgart, HRB 243294

^ permalink raw reply

* Re: [PATCH 2.6.23+] ingress classify to [nf]mark
From: Dzianis Kahanovich @ 2008-01-11 17:37 UTC (permalink / raw)
  To: netdev
In-Reply-To: <47865613.1000902@trash.net>

Patrick McHardy wrote:

>> --- linux-2.6.23-gentoo-r2/net/sched/sch_ingress.c
>> +++ linux-2.6.23-gentoo-r2.fixed/net/sched/sch_ingress.c
>> @@ -161,2 +161,5 @@
>>              skb->tc_index = TC_H_MIN(res.classid);
>> +#ifdef CONFIG_NET_SCH_INGRESS_TC2MARK
>> +            skb->mark = 
>> (skb->mark&(res.classid>>16))|TC_H_MIN(res.classid);
>> +#endif
>>          default:
> 
> 
> Behaviour like this shouldn't depend on compile-time options.

Also I want to move it outside of NET_CLS_ACT dependence, but unsure in 
behaviour understanding without NET_CLS_ACT.

But there are reduse code.

-- 
WBR,
Denis Kaganovich,  mahatma@eu.by  http://mahatma.bspu.unibel.by

^ permalink raw reply

* Re: [NET] ROUTE: fix rcu_dereference() uses in /proc/net/rt_cache
From: Jarek Poplawski @ 2008-01-11 14:13 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: Eric Dumazet, Herbert Xu, davem, dipankar, netdev
In-Reply-To: <20080110235111.GF9586@linux.vnet.ibm.com>

On Thu, Jan 10, 2008 at 03:51:11PM -0800, Paul E. McKenney wrote:
> On Fri, Jan 11, 2008 at 12:10:42AM +0100, Jarek Poplawski wrote:
> > Eric Dumazet wrote, On 01/09/2008 11:37 AM:
> > ...
> > > [NET] ROUTE: fix rcu_dereference() uses in /proc/net/rt_cache
> > ...
> > > diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> > > index d337706..28484f3 100644
> > > --- a/net/ipv4/route.c
> > > +++ b/net/ipv4/route.c
> > > @@ -283,12 +283,12 @@ static struct rtable *rt_cache_get_first(struct seq_file *seq)
> > >  			break;
> > >  		rcu_read_unlock_bh();
> > >  	}
> > > -	return r;
> > > +	return rcu_dereference(r);
> > >  }
> > >  
> > >  static struct rtable *rt_cache_get_next(struct seq_file *seq, struct rtable *r)
> > >  {
> > > -	struct rt_cache_iter_state *st = rcu_dereference(seq->private);
> > > +	struct rt_cache_iter_state *st = seq->private;
> > >  
> > >  	r = r->u.dst.rt_next;
> > >  	while (!r) {
> > > @@ -298,7 +298,7 @@ static struct rtable *rt_cache_get_next(struct seq_file *seq, struct rtable *r)
> > >  		rcu_read_lock_bh();
> > >  		r = rt_hash_table[st->bucket].chain;
> > >  	}
> > > -	return r;
> > > +	return rcu_dereference(r);
> > >  }
> > 
> > It seems this optimization could've a side effect: if during such a
> > loop updates are done, and r is seen !NULL during while() check, but
> > NULL after rcu_dereference(), the listing/counting could stop too
> > soon. So, IMHO, probably the first version of this patch is more
> > reliable. (Or alternatively additional check is needed before return.)
> 
> Looks to me like "r" is a local variable (argument list), so there
> should not be any possibility of it being changed by some other
> task, right?

It seems words could be stronger than then logic (in some cases)...
After forgetting what's dereference usually for, it's all right!

Thanks,
Jarek P.

^ permalink raw reply

* Re: [PATCH 2.6.23+] ingress classify to [nf]mark
From: Dzianis Kahanovich @ 2008-01-11 17:24 UTC (permalink / raw)
  To: netdev
In-Reply-To: <1200001167.4443.38.camel@localhost>

jamal wrote:

>> To "classid x:y" = "mark=mark&x|y" ("classid :y" = "-j MARK --set-mark y", etc).
>>
>> --- linux-2.6.23-gentoo-r2/net/sched/Kconfig
>> +++ linux-2.6.23-gentoo-r2.fixed/net/sched/Kconfig
>> @@ -222,6 +222,16 @@
> [..]
>>   			skb->tc_index = TC_H_MIN(res.classid);
>> +#ifdef CONFIG_NET_SCH_INGRESS_TC2MARK
>> +			skb->mark = (skb->mark&(res.classid>>16))|TC_H_MIN(res.classid);
>> +#endif
>>   		default:
> 
> 
> Please either use ipt action and netfilter fwmarker for this activity or

Sorry. There are only unsuccessful attempt to popularize my working solution.
Really I just use "#define tc_index mark" (in skbuff.h or sch_ingress.c) or 
something like this:

--- linux-2.6.23-gentoo-r2/net/sched/Kconfig
+++ linux-2.6.23-gentoo-r2.fixed/net/sched/Kconfig
@@ -222,6 +222,16 @@
  	  To compile this code as a module, choose M here: the
  	  module will be called sch_ingress.

+config NET_SCH_INGRESS_TC2MARK
+	bool "ingress tc_index -> mark"
+	depends on NET_SCH_INGRESS && NET_CLS_ACT
+	---help---
+	  This enables access to "mark" value via "tc_index" alias
+	  in ingress and unify this values (usage example: set "flowid :2"
+	  in ingress and use it value as "mark" in any way - netfilter, etc).
+	
+	  But tc_index may be undefined - use "flowid :0".
+
  comment "Classification"

  config NET_CLS
--- linux-2.6.23-gentoo-r2/net/sched/sch_ingress.c
+++ linux-2.6.23-gentoo-r2.fixed/net/sched/sch_ingress.c
@@ -18,6 +18,9 @@
  #include <net/netlink.h>
  #include <net/pkt_sched.h>

+#ifdef CONFIG_NET_SCH_INGRESS_TC2MARK
+#define tc_index mark
+#endif

  #undef DEBUG_INGRESS



> create a new action. 
> If you choose the later (example because you want to dynamically compute
> the mark), look at net/sched/act_simple.c to start from and i can help
> you if you have any questions.
>  
> If you want to use ipt action, the syntax would be something like:
> 
> ---
> tc qdisc add dev XXX ingress
> tc filter add dev XXX parent ffff: protocol ip prio 5 \
> u32 blah bleh \
> flowid 1:12 action ipt -j mark --set-mark 13 

Yes, I do so. But there are simple:
---
if [[ $[TC_INDEX2MARK] == 0 ]] ; then
  c=${c//action ipt -j MARK --set-mark /flowid :}
fi
$c
---

Simpliest:
--- linux-2.6.23-gentoo-r2/net/sched/sch_ingress.c
+++ linux-2.6.23-gentoo-r2.fixed/net/sched/sch_ingress.c
@@ -222,6 +222,16 @@
-   			skb->tc_index = TC_H_MIN(res.classid);
+   			skb->tc_index = TC_H_MIN(mark=res.classid);


-- 
WBR,
Denis Kaganovich,  mahatma@eu.by  http://mahatma.bspu.unibel.by

^ permalink raw reply

* Re: [PATCH 2.6.23+] ingress classify to [nf]mark
From: jamal @ 2008-01-11 14:59 UTC (permalink / raw)
  To: mahatma; +Cc: netdev
In-Reply-To: <4787A663.4030204@bspu.unibel.by>

On Fri, 2008-11-01 at 15:24 -0200, Dzianis Kahanovich wrote:
> jamal wrote:

> > tc qdisc add dev XXX ingress
> > tc filter add dev XXX parent ffff: protocol ip prio 5 \
> > u32 blah bleh \
> > flowid 1:12 action ipt -j mark --set-mark 13 
> 
> Yes, I do so. But there are simple:
> ---
> if [[ $[TC_INDEX2MARK] == 0 ]] ; then
>   c=${c//action ipt -j MARK --set-mark /flowid :}
> fi
> $c
> ---

I didnt quiet understand what you have above. Does your script above
read the flowid and sets the MARK to some dynamic value based on flowid?
if thats what you are doing - it sounds sensible and much more clever
than what is posted. And it doesnt require any kernel patch.

> Simpliest:
> --- linux-2.6.23-gentoo-r2/net/sched/sch_ingress.c
> +++ linux-2.6.23-gentoo-r2.fixed/net/sched/sch_ingress.c
> @@ -222,6 +222,16 @@
> -   			skb->tc_index = TC_H_MIN(res.classid);
> +   			skb->tc_index = TC_H_MIN(mark=res.classid);

Just write a metaset action and you can have all sorts of policies on
what tc_index, mark etc you want. It is something thats needed in any
case.
When we did tc_index it made sense then because it was for "tc" to use
some default policy. Enforcing policies in the kernel is not the best
thing to do; as an example you want to specify the polciy for mark to
be: classid major>>16|minor. I am sure you have good reasons; however,
for the next person who wants to set it it major>>8|minor for their own
good reason, theres conflict.  
My offer to help you is still open.

cheers,
jamal

^ permalink raw reply

* Re: questions on NAPI processing latency and dropped network packets
From: Chris Friesen @ 2008-01-11 14:59 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, linux-kernel
In-Reply-To: <20080110.172049.118174993.davem@davemloft.net>

David Miller wrote:

> You have to be kidding, coming here for help with a nearly
> 4 year old kernel.

I figured it couldn't hurt to ask...if I can't ask the original authors, 
who else is there?

I'd love to work on newer kernels, but we have a commitment to our 
customers to support multiple releases for a significant amount of time.

Chris

^ permalink raw reply

* doubt in e1000_io_write()
From: Jeba Anandhan @ 2008-01-11 15:13 UTC (permalink / raw)
  To: netdev

Hi all,
i have doubt in e1000_io_write().

void
e1000_io_write(struct e1000_hw *hw, unsigned long port, uint32_t value)
{
        outl(value, port);
}

kernel version: 2.6.12.3

Even hw structure has not been used, why it has been passed into
e1000_io_write function?

Thanks
Jeba

^ permalink raw reply

* Re: [PROCFS] [NETNS] issue with /proc/net entries
From: Benjamin Thery @ 2008-01-11 16:00 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: netdev, linux-kernel
In-Reply-To: <m1ir21v4jv.fsf@ebiederm.dsl.xmission.com>

Eric W. Biederman wrote:
> Benjamin Thery <benjamin.thery@bull.net> writes:
> 
>> Hi Eric,
>>
>> While testing the current network namespace stuff merged in net-2.6.25,
>> I bumped into the following problem with the /proc/net/ entries.
>> It doesn't always display the actual data of the current namespace,
>> but sometime displays data from other namespaces.
>>
>> I bisected the problem to the commit:
>> "proc: remove/Fix proc generic d_revalidate"
>> 3790ee4bd86396558eedd86faac1052cb782e4e1
>>
>> The problem: If a process in a particular network namespace changes
>> current directory to /proc/net, then processes in other network
>> namespaces trying to look at /proc/net entries will see data from the
>> first namespace (the one with CWD /proc/net). (See test case below).
>>
>> As you comments in the commit suggest, you seem to be aware of some
>> issues when CONFIG_NET_NS=y. Is it one of these corner cases you
>> identified? Any idea on how we can fix it?
> 
> Yes.  It isn't especially hard.   I have most of it in my queue
> I just need to get the silly patches out of there.
> 
> Essentially we need to fix the caching of proc_generic entries,
> So that we can have a proper d_revalidate implementation.
> 
> To get d_revalidate and the caching correct for /proc/net will take
> just a bit more work.  We need to make /proc/net a symlink
> to something like /proc/self/net so that we don't get excess
> revalidates when switching between different processes.
> 
> Or else we can't properly implement the case you have described.
> Where being in the directory causes the wrong version of /proc/net
> to show up. Changing the contents of the dentry for /proc/net
> should only happen during unshare.  Not when we switch between
> processes or else we get into the d_revalidate leaks mount points
> problem again.
> 
> We also need the check to see if something is mounted on top of
> us before we call drop the dentry.  But if we don't even try until
> we know the dentry is invalid it should not be too bad.

Thanks for all the details.
I'll put this issue on my "netns current limitations" list until
it's solved.

Benjamin


> 
> Eric
> 


-- 
B e n j a m i n   T h e r y  - BULL/DT/Open Software R&D

    http://www.bull.com

^ permalink raw reply

* RE: e1000 performance issue in 4 simultaneous links
From: Breno Leitao @ 2008-01-11 16:20 UTC (permalink / raw)
  To: Brandeburg, Jesse, rick.jones2; +Cc: netdev
In-Reply-To: <36D9DB17C6DE9E40B059440DB8D95F5204275B04@orsmsx418.amr.corp.intel.com>

On Thu, 2008-01-10 at 12:52 -0800, Brandeburg, Jesse wrote:
> Breno Leitao wrote:
> > When I run netperf in just one interface, I get 940.95 * 10^6 bits/sec
> > of transfer rate. If I run 4 netperf against 4 different interfaces, I
> > get around 720 * 10^6 bits/sec.
> 
> I hope this explanation makes sense, but what it comes down to is that
> combining hardware round robin balancing with NAPI is a BAD IDEA.  In
> general the behavior of hardware round robin balancing is bad and I'm
> sure it is causing all sorts of other performance issues that you may
> not even be aware of.
I've made another test removing the ppc IRQ Round Robin scheme, bonded
each interface (eth6, eth7, eth16 and eth17) to different CPUs (CPU1,
CPU2, CPU3 and CPU4) and I also get around around 720 * 10^6 bits/s in
average.

Take a look at the interrupt table this time: 

io-dolphins:~/leitao # cat /proc/interrupts  | grep eth[1]*[67]
277:         15    1362450         13         14         13         14         15         18   XICS      Level     eth6
278:         12         13    1348681         19         13         15         10         11   XICS      Level     eth7
323:         11         18         17    1348426         18         11         11         13   XICS      Level     eth16
324:         12         16         11         19    1402709         13         14         11   XICS      Level     eth17


I also tried to bound all the 4 interface IRQ to a single CPU (CPU0)
using the noirqdistrib boot paramenter, and the performance was a little
worse.

Rick, 
  The 2 interface test that I showed in my first email, was run in two
different NIC. Also, I am running netperf with the following command
"netperf -H <hostname> -T 0,8" while netserver is running without any
argument at all. Also, running vmstat in parallel shows that there is no
bottleneck in the CPU. Take a look: 

procs -----------memory---------- ---swap-- -----io---- -system-- -----cpu------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 2  0      0 6714732  16168 227440    0    0     8     2  203   21  0  1 98  0  0
 0  0      0 6715120  16176 227440    0    0     0    28 16234  505  0 16 83  0  1
 0  0      0 6715516  16176 227440    0    0     0     0 16251  518  0 16 83  0  1
 1  0      0 6715252  16176 227440    0    0     0     1 16316  497  0 15 84  0  1
 0  0      0 6716092  16176 227440    0    0     0     0 16300  520  0 16 83  0  1
 0  0      0 6716320  16180 227440    0    0     0     1 16354  486  0 15 84  0  1
 

Thanks!

-- 
Breno Leitao <leitao@linux.vnet.ibm.com>


^ permalink raw reply

* Re: e1000 performance issue in 4 simultaneous links
From: Eric Dumazet @ 2008-01-11 16:48 UTC (permalink / raw)
  To: Breno Leitao; +Cc: Brandeburg, Jesse, rick.jones2, netdev
In-Reply-To: <1200068444.9349.20.camel@cafe>

Breno Leitao a écrit :
> On Thu, 2008-01-10 at 12:52 -0800, Brandeburg, Jesse wrote:
>   
>> Breno Leitao wrote:
>>     
>>> When I run netperf in just one interface, I get 940.95 * 10^6 bits/sec
>>> of transfer rate. If I run 4 netperf against 4 different interfaces, I
>>> get around 720 * 10^6 bits/sec.
>>>       
>> I hope this explanation makes sense, but what it comes down to is that
>> combining hardware round robin balancing with NAPI is a BAD IDEA.  In
>> general the behavior of hardware round robin balancing is bad and I'm
>> sure it is causing all sorts of other performance issues that you may
>> not even be aware of.
>>     
> I've made another test removing the ppc IRQ Round Robin scheme, bonded
> each interface (eth6, eth7, eth16 and eth17) to different CPUs (CPU1,
> CPU2, CPU3 and CPU4) and I also get around around 720 * 10^6 bits/s in
> average.
>
> Take a look at the interrupt table this time: 
>
> io-dolphins:~/leitao # cat /proc/interrupts  | grep eth[1]*[67]
> 277:         15    1362450         13         14         13         14         15         18   XICS      Level     eth6
> 278:         12         13    1348681         19         13         15         10         11   XICS      Level     eth7
> 323:         11         18         17    1348426         18         11         11         13   XICS      Level     eth16
> 324:         12         16         11         19    1402709         13         14         11   XICS      Level     eth17
>
>
> I also tried to bound all the 4 interface IRQ to a single CPU (CPU0)
> using the noirqdistrib boot paramenter, and the performance was a little
> worse.
>
> Rick, 
>   The 2 interface test that I showed in my first email, was run in two
> different NIC. Also, I am running netperf with the following command
> "netperf -H <hostname> -T 0,8" while netserver is running without any
> argument at all. Also, running vmstat in parallel shows that there is no
> bottleneck in the CPU. Take a look: 
>
> procs -----------memory---------- ---swap-- -----io---- -system-- -----cpu------
>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
>  2  0      0 6714732  16168 227440    0    0     8     2  203   21  0  1 98  0  0
>  0  0      0 6715120  16176 227440    0    0     0    28 16234  505  0 16 83  0  1
>  0  0      0 6715516  16176 227440    0    0     0     0 16251  518  0 16 83  0  1
>  1  0      0 6715252  16176 227440    0    0     0     1 16316  497  0 15 84  0  1
>  0  0      0 6716092  16176 227440    0    0     0     0 16300  520  0 16 83  0  1
>  0  0      0 6716320  16180 227440    0    0     0     1 16354  486  0 15 84  0  1
>  
>
>   
If your machine has 8 cpus, then your vmstat output shows a bottleneck :)

(100/8 = 12.5), so I guess one of your CPU is full






^ permalink raw reply

* Re: [PATCH 1/5] spidernet: add missing initialization
From: Linas Vepstas @ 2008-01-11 16:48 UTC (permalink / raw)
  To: Jens Osterkamp; +Cc: Ishizaki Kou, netdev, cbe-oss-dev, Jeff Garzik
In-Reply-To: <200801111344.35652.jens@de.ibm.com>

Hi,

On 11/01/2008, Jens Osterkamp <jens@de.ibm.com> wrote:
> Hi Ishizaki,
>
> Linas has left the company and is no longer doing kernel related stuff,
> so I suggest, given Jeff is ok with that, that the two of us take over
> spidernet maintainership.
>
> Jens
>
> ---
>
> Change maintainership for spidernet.
>
> Signed-off-by: Jens Osterkamp <jens@de.ibm.com>

Fine with me ...

Acked-by: Linas Vepstas <linasvepstas@gmail.com>

> Index: linux-2.6/MAINTAINERS
> ===================================================================
> --- linux-2.6.orig/MAINTAINERS  2008-01-11 13:32:04.000000000 +0100
> +++ linux-2.6/MAINTAINERS       2008-01-11 13:41:32.000000000 +0100
> @@ -3613,8 +3613,10 @@
>  S:     Supported
>
>  SPIDERNET NETWORK DRIVER for CELL
> -P:     Linas Vepstas
> -M:     linas@austin.ibm.com
> +P:     Ishizaki Kou
> +M:     kou.ishizaki@toshiba.co.jp
> +P:     Jens Osterkamp
> +M:     jens@de.ibm.com
>  L:     netdev@vger.kernel.org
>  S:     Supported
>
>

^ permalink raw reply

* iproute2: removing primary address removes secondaries
From: martin f krafft @ 2008-01-11 16:31 UTC (permalink / raw)
  To: netdev discussion list

[-- Attachment #1: Type: text/plain, Size: 2579 bytes --]

Dear list,

When I add an address to an interface whose network prefix is the
same as that of an address already bound to the interface, the new
address becomes a secondary address. As per
http://www.policyrouting.org/iproute2.doc.html:

  "secondary --- this address is not used when selecting the default
  source address for outgoing packets. An IP address becomes
  secondary if another address within the same prefix (network)
  already exists. The first address within the prefix is primary and
  is the tag address for the group of all the secondary addresses.
  When the primary address is deleted all of the secondaries are
  purged too."

In the following, I want to argue that this is not necessary.
I think that removal of a primary address should cause the next
address to be promoted to be the default source address and the
link-scoped route to be retained. This is basically out of
http://bugs.debian.org/429689, the maintainer asked me to turn
directly to this list.

If I add an address to a device with 'ip add', ip also implicitly
adds a link-scoped route according to the netmask. It only does this
for primary addresses, so if I add a second address within the same
network, the route is not duplicated.

Thus, the net effect on the routing table is the same for the
following two commands:

  ip a a 172.16.0.100/12 dev eth0 && ip a a 172.16.0.200/12 dev eth0
  ip a a 172.16.0.100/12 dev eth0 && ip a a 172.16.0.200/32 dev eth0
                                                        ^^^^
In the first case, the .200 address becomes a secondary of the .100
address. In the second case, they are both primaries. In both cases,
only one /12 link-scoped route will be created.

However, in both cases, if I remove the .100 address, the .200 is
affected: if it's secondary, it ceases to exist, and if it's
primary (i.e. in the /32 case), then the host can no longer use it
to communicate to hosts in the same link segment, only to hosts on
the other side of the default gateway.

I thus question the point of purging secondary addresses. Obviously,
only one address can be primary (it is used as source address for
packets leaving the machine by the respective route). But if the
primary address is removed, the next secondary should be promoted
and the route should *not* be deleted.

Comments?

Cheers,

-- 
martin | http://madduck.net/ | http://two.sentenc.es/

microsoft: for when quality, reliability, and security
           just aren't that important!

spamtraps: madduck.bogus@madduck.net

[-- Attachment #2: Digital signature (see http://martin-krafft.net/gpg/) --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply

* Re: iproute2: removing primary address removes secondaries
From: Daniel Lezcano @ 2008-01-11 17:13 UTC (permalink / raw)
  To: netdev discussion list
In-Reply-To: <20080111163155.GA17637@piper.oerlikon.madduck.net>

martin f krafft wrote:
> Dear list,
> 
> When I add an address to an interface whose network prefix is the
> same as that of an address already bound to the interface, the new
> address becomes a secondary address. As per
> http://www.policyrouting.org/iproute2.doc.html:
> 
>   "secondary --- this address is not used when selecting the default
>   source address for outgoing packets. An IP address becomes
>   secondary if another address within the same prefix (network)
>   already exists. The first address within the prefix is primary and
>   is the tag address for the group of all the secondary addresses.
>   When the primary address is deleted all of the secondaries are
>   purged too."
> 
> In the following, I want to argue that this is not necessary.
> I think that removal of a primary address should cause the next
> address to be promoted to be the default source address and the
> link-scoped route to be retained. This is basically out of
> http://bugs.debian.org/429689, the maintainer asked me to turn
> directly to this list.
> 
> If I add an address to a device with 'ip add', ip also implicitly
> adds a link-scoped route according to the netmask. It only does this
> for primary addresses, so if I add a second address within the same
> network, the route is not duplicated.
> 
> Thus, the net effect on the routing table is the same for the
> following two commands:
> 
>   ip a a 172.16.0.100/12 dev eth0 && ip a a 172.16.0.200/12 dev eth0
>   ip a a 172.16.0.100/12 dev eth0 && ip a a 172.16.0.200/32 dev eth0
>                                                         ^^^^
> In the first case, the .200 address becomes a secondary of the .100
> address. In the second case, they are both primaries. In both cases,
> only one /12 link-scoped route will be created.
> 
> However, in both cases, if I remove the .100 address, the .200 is
> affected: if it's secondary, it ceases to exist, and if it's
> primary (i.e. in the /32 case), then the host can no longer use it
> to communicate to hosts in the same link segment, only to hosts on
> the other side of the default gateway.
> 
> I thus question the point of purging secondary addresses. Obviously,
> only one address can be primary (it is used as source address for
> packets leaving the machine by the respective route). But if the
> primary address is removed, the next secondary should be promoted
> and the route should *not* be deleted.
> 
> Comments?
> 
> Cheers,

There is a tweak in /proc/sys which activate secondaries promotion when 
a primary is deleted.

/proc/sys/net/ipv4/conf/all/promote_secondaries

I think it changes the behavior to the one you wish.

Regards

^ permalink raw reply

* Simple question about LARTC theory
From: slavon @ 2008-01-11 17:17 UTC (permalink / raw)
  To: netdev

Hello all.
Sorry for offtopic. I subscribe only on netdev@vger.kernel.org... try  
send to lartc@vger.kernel.org and get "Undelivered Mail Returned to  
Sender". May i do small offtop? This maillist have many people that  
known lartc "in code" and i hope its help for my idea. Thanks.

Simple Question

Legend
[] - qdisc
() - class
** - filter

[htb 1:0 root] *match X FLOWID 3:5*
(1:2 htb)(2:3 htb)(3:5 htb)[sfq 5]
(1:6 htb)(6:7 htb)(7:8 htb)[sfq 8]

packet go
IN -> [htb 1:0] -> (class 1:2 - GREEN) -> (class 2:3 GREEN) -> (class
3:5 - GREEN) -> [sfq 5] -> OUT

then i create

[prio 3 bound 10:0] *match X flowid 10:2*
+(10:1 htb) -- [sfq 101]
+(10:2 htb) -- [sfq 102]
+(10:3 htb) -- [sfq 103]

HOW to add filter to [sfq 5] and [sfq 8]  that then packet go out from
it its go to [prio 3 bound 10:0] and do filter from it?

flowid work if it see begin and end of links... i need like GOTO... if
i add to [prio 3 bound 10:0] PARRENT ID - flowid found path, but i
need that [prio 3 bound 10:0] must have more 1 parrent...

i look to "link" but if i understand - its work for only for hashtables
i look to classid but its go to class 10:X, not to [prio 3 bound 10:0]
and not process filter...

Or i not understand theory?

That i need? I need 3 groups in tc
1-st group get all traffic and do HTB shape (defence from ICMP and UDP shtorm)
a) icmp rate 100mbs cell 500mbs
b) udp rate 100mbs cell 500mbs
c) other rate 300mbs cell 500mbs
all prio = 0 to do normal cellrate

2-nd group do prio ( icmp and udp must be first becouse its not have  
check for transmit)
icmp = 1
udp = 2
other = 3

3-th group do speed limit by IP (shape it) ( this part is ready )

i wont that all exits on group 1 go to group 2 filters and all exits  
on group 2 go to group 3 exists...

Thanks. Slavon

----------------------------------------------------------------
This message was sent using IMP, the Internet Messaging Program.

^ permalink raw reply

* why does promote_secondaries default to off? (was: iproute2: removing primary address removes secondaries)
From: martin f krafft @ 2008-01-11 17:26 UTC (permalink / raw)
  To: netdev discussion list
In-Reply-To: <4787A3AB.4000205@fr.ibm.com>

[-- Attachment #1: Type: text/plain, Size: 689 bytes --]

also sprach Daniel Lezcano <dlezcano@fr.ibm.com> [2008.01.11.1813 +0100]:
> There is a tweak in /proc/sys which activate secondaries promotion when a 
> primary is deleted.
>
> /proc/sys/net/ipv4/conf/all/promote_secondaries
>
> I think it changes the behavior to the one you wish.

Totally. That would have been the last place I had looked.
Thank you!

Do you have any idea why this isn't on by default?

-- 
martin | http://madduck.net/ | http://two.sentenc.es/

"i never go without my dinner. no one ever does, except vegetarians
 and people like that."
                                                        -- oscar wilde

spamtraps: madduck.bogus@madduck.net

[-- Attachment #2: Digital signature (see http://martin-krafft.net/gpg/) --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply

* Re: [PATCH 001/001] ipv4: enable use of 240/4 address space
From: Vince Fuller @ 2008-01-11 17:29 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Vince Fuller, netdev, linux-kernel
In-Reply-To: <p73bq7smx1t.fsf@bingen.suse.de>

On Fri, Jan 11, 2008 at 12:17:02PM +0100, Andi Kleen wrote:
> Vince Fuller <vaf@cisco.com> writes:
> 
> > from Vince Fuller <vaf@vaf.net>
> >
> > This set of diffs modify the 2.6.20 kernel to enable use of the 240/4
> > (aka "class-E") address space as consistent with the Internet Draft
> > draft-fuller-240space-00.txt.
> 
> Wouldn't it be wise to at least wait for it becoming an RFC first? 

There is reasonable consensus on making use of 240/4; some applications,
such as ISAKMP and automatic ipv6-to-IPv4 tunneling, still need to determine
if they should treat the space as "public" or "private" but that shouldn't
affect whether kernel support is added.

Solaris recently added support for 240/4 and OSX already has it. I thought
the Linux kernel developers might appreciate having patches to do likewise.

I leave it up to you, the developers, to decide if you want to use these
patches.

	--Vince

^ permalink raw reply

* Re: why does promote_secondaries default to off?
From: Daniel Lezcano @ 2008-01-11 17:33 UTC (permalink / raw)
  To: netdev discussion list
In-Reply-To: <20080111172641.GA22449@piper.oerlikon.madduck.net>

martin f krafft wrote:
> also sprach Daniel Lezcano <dlezcano@fr.ibm.com> [2008.01.11.1813 +0100]:
>> There is a tweak in /proc/sys which activate secondaries promotion when a 
>> primary is deleted.
>>
>> /proc/sys/net/ipv4/conf/all/promote_secondaries
>>
>> I think it changes the behavior to the one you wish.
> 
> Totally. That would have been the last place I had looked.
> Thank you!
> 
> Do you have any idea why this isn't on by default?

This tweak is "recent" (2.6.16 as far as I remember), so I suppose the 
reason is to not puzzled people with a changed default behavior.

^ permalink raw reply

* Re: [PATCH 0/4] Pull request for 'ipg-fixes' branch
From: Francois Romieu @ 2008-01-11 17:22 UTC (permalink / raw)
  To: linux; +Cc: davem, akpm, jeff, netdev
In-Reply-To: <20080111015851.25008.qmail@science.horizon.com>

linux@horizon.com <linux@horizon.com> :
[...]
> I notice that the vendor-supplied driver doesn't have these bugs.

The M in POMS stands for "my".

[...]
> Would you be interested in some cleanup patches ?

Yes.

> In particular, I think I can get rid of tx->lock entirely, or at least
> take it off the fast path. All it's protecting is the write to
> sp->tx_current, and a few judicious memory barriers can deal with that.

I have done a kind of memory barrier trick for the r8169 in the past but
it is not clear that I would do it again. Today I would argue more strongly
in direction of similar locking amongst different drivers. The tg3 driver
is a good model imho.

Anyway you have been here for some time so I see no reason to kill any
different/new locking scheme you could come with.

Off until sunday.

-- 
Ueimor

^ permalink raw reply

* Re: e1000 performance issue in 4 simultaneous links
From: Denys Fedoryshchenko @ 2008-01-11 17:36 UTC (permalink / raw)
  To: netdev
In-Reply-To: <47879DE4.8080603@cosmosbay.com>

Maybe good idea to use sysstat ?

http://perso.wanadoo.fr/sebastien.godard/

For example:

visp-1 ~ # mpstat -P ALL 1
Linux 2.6.24-rc7-devel (visp-1)         01/11/08

19:27:57     CPU   %user   %nice    %sys %iowait    %irq   %soft  %steal
   %idle    intr/s
19:27:58     all    0.00    0.00    0.00    0.00    0.00    2.51    0.00   
97.49   7707.00
19:27:58       0    0.00    0.00    0.00    0.00    0.00    4.00    0.00   
96.00   1926.00
19:27:58       1    0.00    0.00    0.00    0.00    0.00    1.01    0.00   
98.99   1926.00
19:27:58       2    0.00    0.00    0.00    0.00    0.00    5.00    0.00   
95.00   1927.00
19:27:58       3    0.00    0.00    0.00    0.00    0.00    0.99    0.00   
99.01   1927.00
19:27:58       4    0.00    0.00    0.00    0.00    0.00    0.00    0.00    
0.00      0.00



> >>     
> >>> When I run netperf in just one interface, I get 940.95 * 10^6 bits/sec
> >>> of transfer rate. If I run 4 netperf against 4 different interfaces, I
> >>> get around 720 * 10^6 bits/sec.
> >>>       
> >> I hope this explanation makes sense, but what it comes down to is that
> >> combining hardware round robin balancing with NAPI is a BAD IDEA.  In
> >> general the behavior of hardware round robin balancing is bad and I'm
> >> sure it is causing all sorts of other performance issues that you may
> >> not even be aware of.
> >>     
> > I've made another test removing the ppc IRQ Round Robin scheme, bonded
> > each interface (eth6, eth7, eth16 and eth17) to different CPUs (CPU1,
> > CPU2, CPU3 and CPU4) and I also get around around 720 * 10^6 bits/s in
> > average.
> >
> > Take a look at the interrupt table this time: 
> >
> > io-dolphins:~/leitao # cat /proc/interrupts  | grep eth[1]*[67]
> > 277:         15    1362450         13         14         13         
14         15         18   XICS      Level     eth6
> > 278:         12         13    1348681         19         13         
15         10         11   XICS      Level     eth7
> > 323:         11         18         17    1348426         18         
11         11         13   XICS      Level     eth16
> > 324:         12         16         11         19    1402709         
13         14         11   XICS      Level     eth17
> >
> >
> > I also tried to bound all the 4 interface IRQ to a single CPU (CPU0)
> > using the noirqdistrib boot paramenter, and the performance was a little
> > worse.
> >
> > Rick, 
> >   The 2 interface test that I showed in my first email, was run in two
> > different NIC. Also, I am running netperf with the following command
> > "netperf -H <hostname> -T 0,8" while netserver is running without any
> > argument at all. Also, running vmstat in parallel shows that there is no
> > bottleneck in the CPU. Take a look: 
> >
> > procs -----------memory---------- ---swap-- -----io---- -system-- -----
cpu------
> >  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy 
id wa st
> >  2  0      0 6714732  16168 227440    0    0     8     2  203   21  0  1 
98  0  0
> >  0  0      0 6715120  16176 227440    0    0     0    28 16234  505  0 16 
83  0  1
> >  0  0      0 6715516  16176 227440    0    0     0     0 16251  518  0 16 
83  0  1
> >  1  0      0 6715252  16176 227440    0    0     0     1 16316  497  0 15 
84  0  1
> >  0  0      0 6716092  16176 227440    0    0     0     0 16300  520  0 16 
83  0  1
> >  0  0      0 6716320  16180 227440    0    0     0     1 16354  486  0 15 
84  0  1
> >  
> >
> >   
> If your machine has 8 cpus, then your vmstat output shows a 
> bottleneck :)
> 
> (100/8 = 12.5), so I guess one of your CPU is full
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
Denys Fedoryshchenko
Technical Manager
Virtual ISP S.A.L.


^ permalink raw reply

* Re: why does promote_secondaries default to off?
From: martin f krafft @ 2008-01-11 17:43 UTC (permalink / raw)
  To: netdev discussion list
In-Reply-To: <4787A863.3060506@fr.ibm.com>

[-- Attachment #1: Type: text/plain, Size: 601 bytes --]

also sprach Daniel Lezcano <dlezcano@fr.ibm.com> [2008.01.11.1833 +0100]:
> This tweak is "recent" (2.6.16 as far as I remember), so I suppose
> the  reason is to not puzzled people with a changed default
> behavior.

Your instant and helpful responses are most appreciated!

-- 
martin | http://madduck.net/ | http://two.sentenc.es/
 
a common mistake that people make
when trying to design something completely foolproof
was to underestimate the ingenuity of complete fools.
                                 -- douglas adams, "mostly harmless"
 
spamtraps: madduck.bogus@madduck.net

[-- Attachment #2: Digital signature (see http://martin-krafft.net/gpg/) --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply

* Re: [PATCH] ibm_newemac: Increase number of default rx-/tx-buffers
From: Eugene Surovegin @ 2008-01-11 17:48 UTC (permalink / raw)
  To: Stefan Roese; +Cc: linuxppc-dev, netdev
In-Reply-To: <200801051338.17957.sr@denx.de>

On Sat, Jan 05, 2008 at 01:38:17PM +0100, Stefan Roese wrote:
> On Saturday 05 January 2008, Benjamin Herrenschmidt wrote:
> > On Sat, 2008-01-05 at 10:50 +0100, Stefan Roese wrote:
> > > Performance tests done by AMCC have shown that 256 buffer increase the
> > > performance of the Linux EMAC driver. So let's update the default
> > > values to match this setup.
> > >
> > > Signed-off-by: Stefan Roese <sr@denx.de>
> > > ---
> >
> > Do we have the numbers ? Did they also measure latency ?
> 
> I hoped this question would not come. ;) No, unfortunately I don't have any 
> numbers. Just the recommendation from AMCC to always use 256 buffers.

This cannot be true for all chips. Default numbers I selected weren't 
random. In particular, 256 for Tx doesn't make a lot of sense for 405. 
You just gonna waste memory.

I'd be quite reluctant to follow such advices from AMCC without actual 
details. 

-- 
Eugene

^ permalink raw reply

* Re: [PATCH 2.6.23+] ingress classify to [nf]mark
From: Dzianis Kahanovich @ 2008-01-11 20:42 UTC (permalink / raw)
  To: netdev
In-Reply-To: <1200063541.4483.42.camel@localhost>

jamal wrote:

>> Yes, I do so. But there are simple:
>> ---
>> if [[ $[TC_INDEX2MARK] == 0 ]] ; then
==1
>>   c=${c//action ipt -j MARK --set-mark /flowid :}
    c=${c//action ipt -j MARK --set-mark 0x/flowid :}
>> fi
>> $c
>> ---
> 
> I didnt quiet understand what you have above. Does your script above
> read the flowid and sets the MARK to some dynamic value based on flowid?
> if thats what you are doing - it sounds sensible and much more clever
> than what is posted. And it doesnt require any kernel patch.

I suggest just to use classid to toggle mark/nfmark in ingress. I see, classid
are near unused in ingress (no classes, etc) and for many solutions classid in
ingress filters may be used only for nfmarking. Also I suggest to use both
parts (major & minor) of classid - major may be "and" value, minor - "or". In
current place it may be useful only for (if, unsure) overriting netfilter
"raw" table marks, but if it will be moved outside current "CLS_ACT" block -
tc filter rules may operate mark bits more useful.

About script example:
While I compose filter, I check flag ($TC_INDEX2MARK), tells me are patch
applied or no. If no - I use usual "-j MARK --set-mark", else I use classid to
change mark. All in ingress only. For example:
tc filter add dev eth0 parent ffff: protocol ip u32 ... action ipt -j MARK 0x10
are cname to:
tc filter add dev eth0 parent ffff: protocol ip u32 ... flowid :10

- it use less code/modules and, in many cases, may be single/main goal to
ingress usage - pre-marking packets.

>> Simpliest:
>> --- linux-2.6.23-gentoo-r2/net/sched/sch_ingress.c
>> +++ linux-2.6.23-gentoo-r2.fixed/net/sched/sch_ingress.c
>> @@ -222,6 +222,16 @@
>> -   			skb->tc_index = TC_H_MIN(res.classid);
>> +   			skb->tc_index = TC_H_MIN(mark=res.classid);
> 
> Just write a metaset action and you can have all sorts of policies on
> what tc_index, mark etc you want. It is something thats needed in any
> case.
> When we did tc_index it made sense then because it was for "tc" to use
> some default policy. Enforcing policies in the kernel is not the best
> thing to do; as an example you want to specify the polciy for mark to
> be: classid major>>16|minor. I am sure you have good reasons; however,
> for the next person who wants to set it it major>>8|minor for their own
> good reason, theres conflict.  
> My offer to help you is still open.

OK, I understand there are not too transparent for future usage, but I see too
few applications for ingress/classid will conflicting with.

Thanx, I will try to understand "metaset actions", but I think it will be not
so elegant for my usage then my "#define tc_index mark" in the beginning of
sch_ingress.c. Or may be I will use "and/or" behaviour, but now "#define
tc_index mark" works on my router many month (I may use also -j MARK - with
one flag in my script, but there are lot of unuseful code).

This code (ingress/classifying[/CLS_ACT]) are executing everywhen and I
suggest changes from none (changing target variable from "tc_index" to "mark")
to few "and/or" atomic operations for useful functionality. With
"mark=res.classid" only (I may use self, but not suggest to kernel) it even
less code then default (no TC_H_MIN) and fully satisfy to many goals (traffic
marking without netfilter, but compatible with it).

-- 
WBR,
Denis Kaganovich,  mahatma@eu.by  http://mahatma.bspu.unibel.by

^ permalink raw reply

* Re: Netperf TCP_RR(loopback) 10% regression in 2.6.24-rc6, comparing with 2.6.22
From: Rick Jones @ 2008-01-11 17:56 UTC (permalink / raw)
  To: Zhang, Yanmin; +Cc: LKML, netdev
In-Reply-To: <1200043854.3265.24.camel@ymzhang>

>>The test command is:
>>#sudo taskset -c 7 ./netserver
>>#sudo taskset -c 0 ./netperf -t TCP_RR -l 60 -H 127.0.0.1 -i 50,3 -I 99,5 -- -r 1,1

A couple of comments/questions on the command lines:

*) netperf/netserver support CPU affinity within themselves with the 
global -T option to netperf.  Is the result with taskset much different? 
   The equivalent to the above would be to run netperf with:

./netperf -T 0,7 ...

The one possibly salient difference between the two is that when done 
within netperf, the initial process creation will take place wherever 
the scheduler wants it.

*) The -i option to set the confidence iteration count will silently cap 
the max at 30.

happy benchmarking,

rick jones

^ permalink raw reply

* Re: e1000 performance issue in 4 simultaneous links
From: Breno Leitao @ 2008-01-11 18:19 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Brandeburg, Jesse, rick.jones2, netdev
In-Reply-To: <47879DE4.8080603@cosmosbay.com>

On Fri, 2008-01-11 at 17:48 +0100, Eric Dumazet wrote:
> Breno Leitao a écrit :
> > Take a look at the interrupt table this time: 
> >
> > io-dolphins:~/leitao # cat /proc/interrupts  | grep eth[1]*[67]
> > 277:         15    1362450         13         14         13         14         15         18   XICS      Level     eth6
> > 278:         12         13    1348681         19         13         15         10         11   XICS      Level     eth7
> > 323:         11         18         17    1348426         18         11         11         13   XICS      Level     eth16
> > 324:         12         16         11         19    1402709         13         14         11   XICS      Level     eth17
> >
> >
> >   
> If your machine has 8 cpus, then your vmstat output shows a bottleneck :)
> 
> (100/8 = 12.5), so I guess one of your CPU is full

Well, if I run top while running the test, I see this load distributed
among the CPUs, mainly those that had a NIC IRC bonded. Take a look:

Tasks: 133 total,   2 running, 130 sleeping,   0 stopped,   1 zombie
Cpu0  :  0.3%us, 19.5%sy,  0.0%ni, 73.5%id,  0.0%wa,  0.0%hi,  0.0%si,  6.6%st
Cpu1  :  0.0%us,  0.0%sy,  0.0%ni, 75.1%id,  0.0%wa,  0.7%hi, 24.3%si,  0.0%st
Cpu2  :  0.0%us,  0.0%sy,  0.0%ni, 73.1%id,  0.0%wa,  0.7%hi, 26.2%si,  0.0%st
Cpu3  :  0.0%us,  0.0%sy,  0.0%ni, 76.1%id,  0.0%wa,  0.7%hi, 23.3%si,  0.0%st
Cpu4  :  0.0%us,  0.3%sy,  0.0%ni, 70.4%id,  0.7%wa,  0.3%hi, 28.2%si,  0.0%st
Cpu5  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu6  :  0.0%us,  0.0%sy,  0.0%ni, 99.7%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st
Cpu7  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st

Note that this average scenario doesn't change during the entire
benchmarking test.

Thanks!

-- 
Breno Leitao <leitao@linux.vnet.ibm.com>


^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox