Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH 1/3] vlan: Do not support clearing VLAN_FLAG_REORDER_HDR
From: Eric W. Biederman @ 2011-05-23 22:48 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: David Miller, greearb, nicolas.2p.debian, jpirko, xiaosuo, netdev,
	kaber, fubar, eric.dumazet, andy, jesse
In-Reply-To: <20110523151631.2c34740f@nehalam>

Stephen Hemminger <shemminger@linux-foundation.org> writes:

> On Mon, 23 May 2011 15:05:54 -0700
> ebiederm@xmission.com (Eric W. Biederman) wrote:
>
>> David Miller <davem@davemloft.net> writes:
>> 
>> > From: Stephen Hemminger <shemminger@linux-foundation.org>
>> > Date: Mon, 23 May 2011 14:00:48 -0700
>> >
>> >> IMHO the REORDER_HDR flag was a design mistake. It means supporting
>> >> two different API's for the application. For a packet capture application
>> >> it means the format of the packet will be different based on how
>> >> the VLAN was created. And the vlan was created outside the application.
>> >> 
>> >> If there was a requirement to support both formats, then it should
>> >> be a property of the AF_PACKET socket.  The application can then do
>> >> an setsockopt() to control the format of the data. The issue is
>> >> for things like mmap/AF_PACKET the re-inserted form will be slower
>> >> since data has to be copied.
>> >
>> > Indeed, it was a foolish thing to support in the first place.
>> >
>> > I guess we couldn't see the hw offloads in the future at that
>> > point.
>> >
>> > I'm all for finding a way to deprecate this over time.
>> 
>> 
>> There are two very similar issues here, that affect two
>> slightly different cases.
>> 
>> Let's assume the configuration is:
>> eth0 - Talks to the outside world.
>> vlan2000 - Is the vlan interface of interest.
>> 
>> 
>> With REORDER_HDR set when I tcpdump on vlan2000 I don't see the
>> vlan header, however I continue to see the vlan header when I tcpdump on eth0.
>> 
>> With vlan hardware acceleration.  When I tcpdump on eth0 I don't
>> see the vlan header.  Nor do I see the vlan header when I tcpdump
>> on vlan2000.
>> 
>> 
>> Right now both cases are problematic in Linus's tree.
>> 
>> For inbound packets We are testing the wrong interface for the to see if
>> we should play with REORDER_HDR.
>> 
>> Hardware acceleration vlan tagging we are hiding the vlan header from
>> portable applications that simply dump eth0.  Which is non-intuitive
>> and apparent to everyone now that this happens on both software and
>> hardware interfaces.
>> 
>> 
>> So we have a couple of questions.
>> 1) Do we revert the software emulation of vlan header tagging to fix
>>    it's implementation issues in 2.6.40?
>> 
>>    The big issues.
>>    - vlan_untag is currently reading undefined data.
>>    - Clearing REORDER_HDR no longer works.
>> 
>>    I expect the issues are small enough that we can address them now.
>> 
>> 2) Do we instead move the software implementation of vlan header
>>    acceleration back into the net-next.
>> 
>> 3) What do we do with pf_packet and vlan hardware acceleration when
>>    dumping not the vlan interface but the interface below the vlan
>>    interface?
>> 
>>    Do we provide an option to keep the vlan header?  Should that option
>>    be on by default?
>
> Why not just make libpcap smarter? it already has Linux specific
> stuff for checksum offload. The portability to BSD for AF_PACKET is
> a non-issue.

That is not an option I had considered, and it sounds intriguing.

Even if libpcap can cope we are still not backwards compatible with
older linux applications.  Some of which work up to 2.6.39 because
the interface the vlan tagged packets come in on does not implement
vlan hardware acceleration.

There is also the issue that socket filters can't currently see
skb->tci so we are really messed up in the case we want vlan hardware
acceleration.

Eric




^ permalink raw reply

* Re: 2.6.38.x, 2.6.39 sfq? kernel panic in sfq_enqueue
From: Denys Fedoryshchenko @ 2011-05-23 22:35 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, hadi
In-Reply-To: <1306163146.20687.19.camel@edumazet-laptop>

 On Mon, 23 May 2011 17:05:46 +0200, Eric Dumazet wrote:
> Le lundi 23 mai 2011 à 17:27 +0300, Denys Fedoryshchenko a écrit :
>
>> > Thanks !
>>  By objdump or he must recompile kernel with DEBUG_INFO and use gdb?
>>
>>
>>
>
>
> Just objdump of sch_sfq.o (or sch_sfq.ko) file, (to keep existing
> offsets), thanks
 Sorry for delay, i had to contact person who run that kernel :)

 www.nuclearcat.com/files/disasm.txt
 www.nuclearcat.com/files/sch_sfq.ko



^ permalink raw reply

* Re: [PATCH 1/3] vlan: Do not support clearing VLAN_FLAG_REORDER_HDR
From: Ben Greear @ 2011-05-23 22:23 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: David Miller, shemminger, nicolas.2p.debian, jpirko, xiaosuo,
	netdev, kaber, fubar, eric.dumazet, andy, jesse
In-Reply-To: <m1y61x2htp.fsf@fess.ebiederm.org>

On 05/23/2011 03:05 PM, Eric W. Biederman wrote:
> David Miller<davem@davemloft.net>  writes:
>
>> From: Stephen Hemminger<shemminger@linux-foundation.org>
>> Date: Mon, 23 May 2011 14:00:48 -0700
>>
>>> IMHO the REORDER_HDR flag was a design mistake. It means supporting
>>> two different API's for the application. For a packet capture application
>>> it means the format of the packet will be different based on how
>>> the VLAN was created. And the vlan was created outside the application.
>>>
>>> If there was a requirement to support both formats, then it should
>>> be a property of the AF_PACKET socket.  The application can then do
>>> an setsockopt() to control the format of the data. The issue is
>>> for things like mmap/AF_PACKET the re-inserted form will be slower
>>> since data has to be copied.
>>
>> Indeed, it was a foolish thing to support in the first place.
>>
>> I guess we couldn't see the hw offloads in the future at that
>> point.
>>
>> I'm all for finding a way to deprecate this over time.
>
>
> There are two very similar issues here, that affect two
> slightly different cases.
>
> Let's assume the configuration is:
> eth0 - Talks to the outside world.
> vlan2000 - Is the vlan interface of interest.
>
>
> With REORDER_HDR set when I tcpdump on vlan2000 I don't see the
> vlan header, however I continue to see the vlan header when I tcpdump on eth0.

That seems good.  This is what originally let dhcp function.  Not sure
if it's still required for dhcp, but there could easily be other programs that
have similar issues.

With REORDER_HDR off, we should see vlan tagged packets on both eth0
and vlan2000.

If REORDER_HDR dissappears entirely, I think you have to default to
stripping the header on vlan2000.

>
> With vlan hardware acceleration.  When I tcpdump on eth0 I don't
> see the vlan header.  Nor do I see the vlan header when I tcpdump
> on vlan2000.

I think you should see the header on eth0 regardless of hw acceleration
or not.  Users should not have to know if their NIC/driver supports
vlan tag stripping in one mode or another.


> Right now both cases are problematic in Linus's tree.
>
> For inbound packets We are testing the wrong interface for the to see if
> we should play with REORDER_HDR.
>
> Hardware acceleration vlan tagging we are hiding the vlan header from
> portable applications that simply dump eth0.  Which is non-intuitive
> and apparent to everyone now that this happens on both software and
> hardware interfaces.
>
>
> So we have a couple of questions.
> 1) Do we revert the software emulation of vlan header tagging to fix
>     it's implementation issues in 2.6.40?
>
>     The big issues.
>     - vlan_untag is currently reading undefined data.
>     - Clearing REORDER_HDR no longer works.
>
>     I expect the issues are small enough that we can address them now.
>
> 2) Do we instead move the software implementation of vlan header
>     acceleration back into the net-next.
>
> 3) What do we do with pf_packet and vlan hardware acceleration when
>     dumping not the vlan interface but the interface below the vlan
>     interface?
>
>     Do we provide an option to keep the vlan header?  Should that option
>     be on by default?

At the least we need to have the header kept when using pf_packet on eth0.

I think it's best to have it available on vlan2000, but perhaps have it
stripped by default for backwards compatibility.

Thanks,
Ben


>
> Eric


-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


^ permalink raw reply

* [patch 1/1] net: convert %p usage to %pK
From: akpm @ 2011-05-23 22:17 UTC (permalink / raw)
  To: davem
  Cc: netdev, akpm, drosenberg, a.p.zijlstra, eparis, eric.dumazet,
	eugeneteo, jmorris, kees.cook, mingo, tgraf

From: Dan Rosenberg <drosenberg@vsecurity.com>

The %pK format specifier is designed to hide exposed kernel pointers,
specifically via /proc interfaces.  Exposing these pointers provides an
easy target for kernel write vulnerabilities, since they reveal the
locations of writable structures containing easily triggerable function
pointers.  The behavior of %pK depends on the kptr_restrict sysctl.

If kptr_restrict is set to 0, no deviation from the standard %p behavior
occurs.  If kptr_restrict is set to 1, the default, if the current user
(intended to be a reader via seq_printf(), etc.) does not have CAP_SYSLOG
(currently in the LSM tree), kernel pointers using %pK are printed as 0's.
 If kptr_restrict is set to 2, kernel pointers using %pK are printed as
0's regardless of privileges.  Replacing with 0's was chosen over the
default "(null)", which cannot be parsed by userland %p, which expects
"(nil)".

The supporting code for kptr_restrict and %pK are currently in the -mm
tree.  This patch converts users of %p in net/ to %pK.  Cases of printing
pointers to the syslog are not covered, since this would eliminate useful
information for postmortem debugging and the reading of the syslog is
already optionally protected by the dmesg_restrict sysctl.

Signed-off-by: Dan Rosenberg <drosenberg@vsecurity.com>
Cc: James Morris <jmorris@namei.org>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Thomas Graf <tgraf@infradead.org>
Cc: Eugene Teo <eugeneteo@kernel.org>
Cc: Kees Cook <kees.cook@canonical.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David S. Miller <davem@davemloft.net>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Eric Paris <eparis@parisplace.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 net/atm/proc.c           |    4 ++--
 net/can/bcm.c            |    6 +++---
 net/ipv4/raw.c           |    2 +-
 net/ipv4/tcp_ipv4.c      |    6 +++---
 net/ipv4/udp.c           |    2 +-
 net/ipv6/raw.c           |    2 +-
 net/ipv6/tcp_ipv6.c      |    6 +++---
 net/ipv6/udp.c           |    2 +-
 net/key/af_key.c         |    2 +-
 net/netlink/af_netlink.c |    2 +-
 net/packet/af_packet.c   |    2 +-
 net/phonet/socket.c      |    2 +-
 net/sctp/proc.c          |    4 ++--
 net/unix/af_unix.c       |    2 +-
 14 files changed, 22 insertions(+), 22 deletions(-)

diff -puN net/atm/proc.c~net-convert-%p-usage-to-%pk net/atm/proc.c
--- a/net/atm/proc.c~net-convert-%p-usage-to-%pk
+++ a/net/atm/proc.c
@@ -191,7 +191,7 @@ static void vcc_info(struct seq_file *se
 {
 	struct sock *sk = sk_atm(vcc);
 
-	seq_printf(seq, "%p ", vcc);
+	seq_printf(seq, "%pK ", vcc);
 	if (!vcc->dev)
 		seq_printf(seq, "Unassigned    ");
 	else
@@ -218,7 +218,7 @@ static void svc_info(struct seq_file *se
 {
 	if (!vcc->dev)
 		seq_printf(seq, sizeof(void *) == 4 ?
-			   "N/A@%p%10s" : "N/A@%p%2s", vcc, "");
+			   "N/A@%pK%10s" : "N/A@%pK%2s", vcc, "");
 	else
 		seq_printf(seq, "%3d %3d %5d         ",
 			   vcc->dev->number, vcc->vpi, vcc->vci);
diff -puN net/can/bcm.c~net-convert-%p-usage-to-%pk net/can/bcm.c
--- a/net/can/bcm.c~net-convert-%p-usage-to-%pk
+++ a/net/can/bcm.c
@@ -165,9 +165,9 @@ static int bcm_proc_show(struct seq_file
 	struct bcm_sock *bo = bcm_sk(sk);
 	struct bcm_op *op;
 
-	seq_printf(m, ">>> socket %p", sk->sk_socket);
-	seq_printf(m, " / sk %p", sk);
-	seq_printf(m, " / bo %p", bo);
+	seq_printf(m, ">>> socket %pK", sk->sk_socket);
+	seq_printf(m, " / sk %pK", sk);
+	seq_printf(m, " / bo %pK", bo);
 	seq_printf(m, " / dropped %lu", bo->dropped_usr_msgs);
 	seq_printf(m, " / bound %s", bcm_proc_getifname(ifname, bo->ifindex));
 	seq_printf(m, " <<<\n");
diff -puN net/ipv4/raw.c~net-convert-%p-usage-to-%pk net/ipv4/raw.c
--- a/net/ipv4/raw.c~net-convert-%p-usage-to-%pk
+++ a/net/ipv4/raw.c
@@ -979,7 +979,7 @@ static void raw_sock_seq_show(struct seq
 	      srcp  = inet->inet_num;
 
 	seq_printf(seq, "%4d: %08X:%04X %08X:%04X"
-		" %02X %08X:%08X %02X:%08lX %08X %5d %8d %lu %d %p %d\n",
+		" %02X %08X:%08X %02X:%08lX %08X %5d %8d %lu %d %pK %d\n",
 		i, src, srcp, dest, destp, sp->sk_state,
 		sk_wmem_alloc_get(sp),
 		sk_rmem_alloc_get(sp),
diff -puN net/ipv4/tcp_ipv4.c~net-convert-%p-usage-to-%pk net/ipv4/tcp_ipv4.c
--- a/net/ipv4/tcp_ipv4.c~net-convert-%p-usage-to-%pk
+++ a/net/ipv4/tcp_ipv4.c
@@ -2371,7 +2371,7 @@ static void get_openreq4(struct sock *sk
 	int ttd = req->expires - jiffies;
 
 	seq_printf(f, "%4d: %08X:%04X %08X:%04X"
-		" %02X %08X:%08X %02X:%08lX %08X %5d %8d %u %d %p%n",
+		" %02X %08X:%08X %02X:%08lX %08X %5d %8d %u %d %pK%n",
 		i,
 		ireq->loc_addr,
 		ntohs(inet_sk(sk)->inet_sport),
@@ -2426,7 +2426,7 @@ static void get_tcp4_sock(struct sock *s
 		rx_queue = max_t(int, tp->rcv_nxt - tp->copied_seq, 0);
 
 	seq_printf(f, "%4d: %08X:%04X %08X:%04X %02X %08X:%08X %02X:%08lX "
-			"%08X %5d %8d %lu %d %p %lu %lu %u %u %d%n",
+			"%08X %5d %8d %lu %d %pK %lu %lu %u %u %d%n",
 		i, src, srcp, dest, destp, sk->sk_state,
 		tp->write_seq - tp->snd_una,
 		rx_queue,
@@ -2461,7 +2461,7 @@ static void get_timewait4_sock(struct in
 	srcp  = ntohs(tw->tw_sport);
 
 	seq_printf(f, "%4d: %08X:%04X %08X:%04X"
-		" %02X %08X:%08X %02X:%08lX %08X %5d %8d %d %d %p%n",
+		" %02X %08X:%08X %02X:%08lX %08X %5d %8d %d %d %pK%n",
 		i, src, srcp, dest, destp, tw->tw_substate, 0, 0,
 		3, jiffies_to_clock_t(ttd), 0, 0, 0, 0,
 		atomic_read(&tw->tw_refcnt), tw, len);
diff -puN net/ipv4/udp.c~net-convert-%p-usage-to-%pk net/ipv4/udp.c
--- a/net/ipv4/udp.c~net-convert-%p-usage-to-%pk
+++ a/net/ipv4/udp.c
@@ -2090,7 +2090,7 @@ static void udp4_format_sock(struct sock
 	__u16 srcp	  = ntohs(inet->inet_sport);
 
 	seq_printf(f, "%5d: %08X:%04X %08X:%04X"
-		" %02X %08X:%08X %02X:%08lX %08X %5d %8d %lu %d %p %d%n",
+		" %02X %08X:%08X %02X:%08lX %08X %5d %8d %lu %d %pK %d%n",
 		bucket, src, srcp, dest, destp, sp->sk_state,
 		sk_wmem_alloc_get(sp),
 		sk_rmem_alloc_get(sp),
diff -puN net/ipv6/raw.c~net-convert-%p-usage-to-%pk net/ipv6/raw.c
--- a/net/ipv6/raw.c~net-convert-%p-usage-to-%pk
+++ a/net/ipv6/raw.c
@@ -1240,7 +1240,7 @@ static void raw6_sock_seq_show(struct se
 	srcp  = inet_sk(sp)->inet_num;
 	seq_printf(seq,
 		   "%4d: %08X%08X%08X%08X:%04X %08X%08X%08X%08X:%04X "
-		   "%02X %08X:%08X %02X:%08lX %08X %5d %8d %lu %d %p %d\n",
+		   "%02X %08X:%08X %02X:%08lX %08X %5d %8d %lu %d %pK %d\n",
 		   i,
 		   src->s6_addr32[0], src->s6_addr32[1],
 		   src->s6_addr32[2], src->s6_addr32[3], srcp,
diff -puN net/ipv6/tcp_ipv6.c~net-convert-%p-usage-to-%pk net/ipv6/tcp_ipv6.c
--- a/net/ipv6/tcp_ipv6.c~net-convert-%p-usage-to-%pk
+++ a/net/ipv6/tcp_ipv6.c
@@ -2036,7 +2036,7 @@ static void get_openreq6(struct seq_file
 
 	seq_printf(seq,
 		   "%4d: %08X%08X%08X%08X:%04X %08X%08X%08X%08X:%04X "
-		   "%02X %08X:%08X %02X:%08lX %08X %5d %8d %d %d %p\n",
+		   "%02X %08X:%08X %02X:%08lX %08X %5d %8d %d %d %pK\n",
 		   i,
 		   src->s6_addr32[0], src->s6_addr32[1],
 		   src->s6_addr32[2], src->s6_addr32[3],
@@ -2087,7 +2087,7 @@ static void get_tcp6_sock(struct seq_fil
 
 	seq_printf(seq,
 		   "%4d: %08X%08X%08X%08X:%04X %08X%08X%08X%08X:%04X "
-		   "%02X %08X:%08X %02X:%08lX %08X %5d %8d %lu %d %p %lu %lu %u %u %d\n",
+		   "%02X %08X:%08X %02X:%08lX %08X %5d %8d %lu %d %pK %lu %lu %u %u %d\n",
 		   i,
 		   src->s6_addr32[0], src->s6_addr32[1],
 		   src->s6_addr32[2], src->s6_addr32[3], srcp,
@@ -2129,7 +2129,7 @@ static void get_timewait6_sock(struct se
 
 	seq_printf(seq,
 		   "%4d: %08X%08X%08X%08X:%04X %08X%08X%08X%08X:%04X "
-		   "%02X %08X:%08X %02X:%08lX %08X %5d %8d %d %d %p\n",
+		   "%02X %08X:%08X %02X:%08lX %08X %5d %8d %d %d %pK\n",
 		   i,
 		   src->s6_addr32[0], src->s6_addr32[1],
 		   src->s6_addr32[2], src->s6_addr32[3], srcp,
diff -puN net/ipv6/udp.c~net-convert-%p-usage-to-%pk net/ipv6/udp.c
--- a/net/ipv6/udp.c~net-convert-%p-usage-to-%pk
+++ a/net/ipv6/udp.c
@@ -1391,7 +1391,7 @@ static void udp6_sock_seq_show(struct se
 	srcp  = ntohs(inet->inet_sport);
 	seq_printf(seq,
 		   "%5d: %08X%08X%08X%08X:%04X %08X%08X%08X%08X:%04X "
-		   "%02X %08X:%08X %02X:%08lX %08X %5d %8d %lu %d %p %d\n",
+		   "%02X %08X:%08X %02X:%08lX %08X %5d %8d %lu %d %pK %d\n",
 		   bucket,
 		   src->s6_addr32[0], src->s6_addr32[1],
 		   src->s6_addr32[2], src->s6_addr32[3], srcp,
diff -puN net/key/af_key.c~net-convert-%p-usage-to-%pk net/key/af_key.c
--- a/net/key/af_key.c~net-convert-%p-usage-to-%pk
+++ a/net/key/af_key.c
@@ -3656,7 +3656,7 @@ static int pfkey_seq_show(struct seq_fil
 	if (v == SEQ_START_TOKEN)
 		seq_printf(f ,"sk       RefCnt Rmem   Wmem   User   Inode\n");
 	else
-		seq_printf(f ,"%p %-6d %-6u %-6u %-6u %-6lu\n",
+		seq_printf(f, "%pK %-6d %-6u %-6u %-6u %-6lu\n",
 			       s,
 			       atomic_read(&s->sk_refcnt),
 			       sk_rmem_alloc_get(s),
diff -puN net/netlink/af_netlink.c~net-convert-%p-usage-to-%pk net/netlink/af_netlink.c
--- a/net/netlink/af_netlink.c~net-convert-%p-usage-to-%pk
+++ a/net/netlink/af_netlink.c
@@ -1985,7 +1985,7 @@ static int netlink_seq_show(struct seq_f
 		struct sock *s = v;
 		struct netlink_sock *nlk = nlk_sk(s);
 
-		seq_printf(seq, "%p %-3d %-6d %08x %-8d %-8d %p %-8d %-8d %-8lu\n",
+		seq_printf(seq, "%pK %-3d %-6d %08x %-8d %-8d %pK %-8d %-8d %-8lu\n",
 			   s,
 			   s->sk_protocol,
 			   nlk->pid,
diff -puN net/packet/af_packet.c~net-convert-%p-usage-to-%pk net/packet/af_packet.c
--- a/net/packet/af_packet.c~net-convert-%p-usage-to-%pk
+++ a/net/packet/af_packet.c
@@ -2706,7 +2706,7 @@ static int packet_seq_show(struct seq_fi
 		const struct packet_sock *po = pkt_sk(s);
 
 		seq_printf(seq,
-			   "%p %-6d %-4d %04x   %-5d %1d %-6u %-6u %-6lu\n",
+			   "%pK %-6d %-4d %04x   %-5d %1d %-6u %-6u %-6lu\n",
 			   s,
 			   atomic_read(&s->sk_refcnt),
 			   s->sk_type,
diff -puN net/phonet/socket.c~net-convert-%p-usage-to-%pk net/phonet/socket.c
--- a/net/phonet/socket.c~net-convert-%p-usage-to-%pk
+++ a/net/phonet/socket.c
@@ -607,7 +607,7 @@ static int pn_sock_seq_show(struct seq_f
 		struct pn_sock *pn = pn_sk(sk);
 
 		seq_printf(seq, "%2d %04X:%04X:%02X %02X %08X:%08X %5d %lu "
-			"%d %p %d%n",
+			"%d %pK %d%n",
 			sk->sk_protocol, pn->sobject, pn->dobject,
 			pn->resource, sk->sk_state,
 			sk_wmem_alloc_get(sk), sk_rmem_alloc_get(sk),
diff -puN net/sctp/proc.c~net-convert-%p-usage-to-%pk net/sctp/proc.c
--- a/net/sctp/proc.c~net-convert-%p-usage-to-%pk
+++ a/net/sctp/proc.c
@@ -212,7 +212,7 @@ static int sctp_eps_seq_show(struct seq_
 	sctp_for_each_hentry(epb, node, &head->chain) {
 		ep = sctp_ep(epb);
 		sk = epb->sk;
-		seq_printf(seq, "%8p %8p %-3d %-3d %-4d %-5d %5d %5lu ", ep, sk,
+		seq_printf(seq, "%8pK %8pK %-3d %-3d %-4d %-5d %5d %5lu ", ep, sk,
 			   sctp_sk(sk)->type, sk->sk_state, hash,
 			   epb->bind_addr.port,
 			   sock_i_uid(sk), sock_i_ino(sk));
@@ -316,7 +316,7 @@ static int sctp_assocs_seq_show(struct s
 		assoc = sctp_assoc(epb);
 		sk = epb->sk;
 		seq_printf(seq,
-			   "%8p %8p %-3d %-3d %-2d %-4d "
+			   "%8pK %8pK %-3d %-3d %-2d %-4d "
 			   "%4d %8d %8d %7d %5lu %-5d %5d ",
 			   assoc, sk, sctp_sk(sk)->type, sk->sk_state,
 			   assoc->state, hash,
diff -puN net/unix/af_unix.c~net-convert-%p-usage-to-%pk net/unix/af_unix.c
--- a/net/unix/af_unix.c~net-convert-%p-usage-to-%pk
+++ a/net/unix/af_unix.c
@@ -2254,7 +2254,7 @@ static int unix_seq_show(struct seq_file
 		struct unix_sock *u = unix_sk(s);
 		unix_state_lock(s);
 
-		seq_printf(seq, "%p: %08X %08X %08X %04X %02X %5lu",
+		seq_printf(seq, "%pK: %08X %08X %08X %04X %02X %5lu",
 			s,
 			atomic_read(&s->sk_refcnt),
 			0,
_

^ permalink raw reply

* [patch 1/1] net/irda: convert bfin_sir to common Blackfin UART header
From: akpm @ 2011-05-23 22:17 UTC (permalink / raw)
  To: samuel; +Cc: netdev, akpm, vapier, davem

From: Mike Frysinger <vapier@gentoo.org>

No need to duplicate these defines now that the common Blackfin code has
unified these for all UART devices.

Signed-off-by: Mike Frysinger <vapier@gentoo.org>
Cc: Samuel Ortiz <samuel@sortiz.org>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 drivers/net/irda/bfin_sir.c |   59 +++++++++++++++----------------
 drivers/net/irda/bfin_sir.h |   63 ++--------------------------------
 2 files changed, 33 insertions(+), 89 deletions(-)

diff -puN drivers/net/irda/bfin_sir.c~net-irda-convert-bfin_sir-to-common-blackfin-uart-header drivers/net/irda/bfin_sir.c
--- a/drivers/net/irda/bfin_sir.c~net-irda-convert-bfin_sir-to-common-blackfin-uart-header
+++ a/drivers/net/irda/bfin_sir.c
@@ -67,27 +67,27 @@ static void bfin_sir_stop_tx(struct bfin
 	disable_dma(port->tx_dma_channel);
 #endif
 
-	while (!(SIR_UART_GET_LSR(port) & THRE)) {
+	while (!(UART_GET_LSR(port) & THRE)) {
 		cpu_relax();
 		continue;
 	}
 
-	SIR_UART_STOP_TX(port);
+	UART_CLEAR_IER(port, ETBEI);
 }
 
 static void bfin_sir_enable_tx(struct bfin_sir_port *port)
 {
-	SIR_UART_ENABLE_TX(port);
+	UART_SET_IER(port, ETBEI);
 }
 
 static void bfin_sir_stop_rx(struct bfin_sir_port *port)
 {
-	SIR_UART_STOP_RX(port);
+	UART_CLEAR_IER(port, ERBFI);
 }
 
 static void bfin_sir_enable_rx(struct bfin_sir_port *port)
 {
-	SIR_UART_ENABLE_RX(port);
+	UART_SET_IER(port, ERBFI);
 }
 
 static int bfin_sir_set_speed(struct bfin_sir_port *port, int speed)
@@ -116,7 +116,7 @@ static int bfin_sir_set_speed(struct bfi
 
 		do {
 			udelay(utime);
-			lsr = SIR_UART_GET_LSR(port);
+			lsr = UART_GET_LSR(port);
 		} while (!(lsr & TEMT) && count--);
 
 		/* The useconds for 1 bits to transmit */
@@ -125,27 +125,27 @@ static int bfin_sir_set_speed(struct bfi
 		/* Clear UCEN bit to reset the UART state machine
 		 * and control registers
 		 */
-		val = SIR_UART_GET_GCTL(port);
+		val = UART_GET_GCTL(port);
 		val &= ~UCEN;
-		SIR_UART_PUT_GCTL(port, val);
+		UART_PUT_GCTL(port, val);
 
 		/* Set DLAB in LCR to Access THR RBR IER */
-		SIR_UART_SET_DLAB(port);
+		UART_SET_DLAB(port);
 		SSYNC();
 
-		SIR_UART_PUT_DLL(port, quot & 0xFF);
-		SIR_UART_PUT_DLH(port, (quot >> 8) & 0xFF);
+		UART_PUT_DLL(port, quot & 0xFF);
+		UART_PUT_DLH(port, (quot >> 8) & 0xFF);
 		SSYNC();
 
 		/* Clear DLAB in LCR */
-		SIR_UART_CLEAR_DLAB(port);
+		UART_CLEAR_DLAB(port);
 		SSYNC();
 
-		SIR_UART_PUT_LCR(port, lcr);
+		UART_PUT_LCR(port, lcr);
 
-		val = SIR_UART_GET_GCTL(port);
+		val = UART_GET_GCTL(port);
 		val |= UCEN;
-		SIR_UART_PUT_GCTL(port, val);
+		UART_PUT_GCTL(port, val);
 
 		ret = 0;
 		break;
@@ -154,12 +154,12 @@ static int bfin_sir_set_speed(struct bfi
 		break;
 	}
 
-	val = SIR_UART_GET_GCTL(port);
+	val = UART_GET_GCTL(port);
 	/* If not add the 'RPOLC', we can't catch the receive interrupt.
 	 * It's related with the HW layout and the IR transiver.
 	 */
 	val |= IREN | RPOLC;
-	SIR_UART_PUT_GCTL(port, val);
+	UART_PUT_GCTL(port, val);
 	return ret;
 }
 
@@ -168,7 +168,7 @@ static int bfin_sir_is_receiving(struct 
 	struct bfin_sir_self *self = netdev_priv(dev);
 	struct bfin_sir_port *port = self->sir_port;
 
-	if (!(SIR_UART_GET_IER(port) & ERBFI))
+	if (!(UART_GET_IER(port) & ERBFI))
 		return 0;
 	return self->rx_buff.state != OUTSIDE_FRAME;
 }
@@ -182,7 +182,7 @@ static void bfin_sir_tx_chars(struct net
 
 	if (self->tx_buff.len != 0) {
 		chr = *(self->tx_buff.data);
-		SIR_UART_PUT_CHAR(port, chr);
+		UART_PUT_CHAR(port, chr);
 		self->tx_buff.data++;
 		self->tx_buff.len--;
 	} else {
@@ -206,8 +206,8 @@ static void bfin_sir_rx_chars(struct net
 	struct bfin_sir_port *port = self->sir_port;
 	unsigned char ch;
 
-	SIR_UART_CLEAR_LSR(port);
-	ch = SIR_UART_GET_CHAR(port);
+	UART_CLEAR_LSR(port);
+	ch = UART_GET_CHAR(port);
 	async_unwrap_char(dev, &self->stats, &self->rx_buff, ch);
 	dev->last_rx = jiffies;
 }
@@ -219,7 +219,7 @@ static irqreturn_t bfin_sir_rx_int(int i
 	struct bfin_sir_port *port = self->sir_port;
 
 	spin_lock(&self->lock);
-	while ((SIR_UART_GET_LSR(port) & DR))
+	while ((UART_GET_LSR(port) & DR))
 		bfin_sir_rx_chars(dev);
 	spin_unlock(&self->lock);
 
@@ -233,7 +233,7 @@ static irqreturn_t bfin_sir_tx_int(int i
 	struct bfin_sir_port *port = self->sir_port;
 
 	spin_lock(&self->lock);
-	if (SIR_UART_GET_LSR(port) & THRE)
+	if (UART_GET_LSR(port) & THRE)
 		bfin_sir_tx_chars(dev);
 	spin_unlock(&self->lock);
 
@@ -312,7 +312,7 @@ static void bfin_sir_dma_rx_chars(struct
 	struct bfin_sir_port *port = self->sir_port;
 	int i;
 
-	SIR_UART_CLEAR_LSR(port);
+	UART_CLEAR_LSR(port);
 
 	for (i = port->rx_dma_buf.head; i < port->rx_dma_buf.tail; i++)
 		async_unwrap_char(dev, &self->stats, &self->rx_buff, port->rx_dma_buf.buf[i]);
@@ -430,11 +430,10 @@ static void bfin_sir_shutdown(struct bfi
 	unsigned short val;
 
 	bfin_sir_stop_rx(port);
-	SIR_UART_DISABLE_INTS(port);
 
-	val = SIR_UART_GET_GCTL(port);
+	val = UART_GET_GCTL(port);
 	val &= ~(UCEN | IREN | RPOLC);
-	SIR_UART_PUT_GCTL(port, val);
+	UART_PUT_GCTL(port, val);
 
 #ifdef CONFIG_SIR_BFIN_DMA
 	disable_dma(port->tx_dma_channel);
@@ -518,12 +517,12 @@ static void bfin_sir_send_work(struct wo
 	 * sending data. We also can set the speed, which will
 	 * reset all the UART.
 	 */
-	val = SIR_UART_GET_GCTL(port);
+	val = UART_GET_GCTL(port);
 	val &= ~(IREN | RPOLC);
-	SIR_UART_PUT_GCTL(port, val);
+	UART_PUT_GCTL(port, val);
 	SSYNC();
 	val |= IREN | RPOLC;
-	SIR_UART_PUT_GCTL(port, val);
+	UART_PUT_GCTL(port, val);
 	SSYNC();
 	/* bfin_sir_set_speed(port, self->speed); */
 
diff -puN drivers/net/irda/bfin_sir.h~net-irda-convert-bfin_sir-to-common-blackfin-uart-header drivers/net/irda/bfin_sir.h
--- a/drivers/net/irda/bfin_sir.h~net-irda-convert-bfin_sir-to-common-blackfin-uart-header
+++ a/drivers/net/irda/bfin_sir.h
@@ -26,7 +26,6 @@
 #include <asm/cacheflush.h>
 #include <asm/dma.h>
 #include <asm/portmux.h>
-#include <mach/bfin_serial_5xx.h>
 #undef DRIVER_NAME
 
 #ifdef CONFIG_SIR_BFIN_DMA
@@ -83,64 +82,10 @@ struct bfin_sir_self {
 
 #define DRIVER_NAME "bfin_sir"
 
-#define SIR_UART_GET_CHAR(port)    bfin_read16((port)->membase + OFFSET_RBR)
-#define SIR_UART_GET_DLL(port)     bfin_read16((port)->membase + OFFSET_DLL)
-#define SIR_UART_GET_DLH(port)     bfin_read16((port)->membase + OFFSET_DLH)
-#define SIR_UART_GET_LCR(port)     bfin_read16((port)->membase + OFFSET_LCR)
-#define SIR_UART_GET_GCTL(port)    bfin_read16((port)->membase + OFFSET_GCTL)
-
-#define SIR_UART_PUT_CHAR(port, v) bfin_write16(((port)->membase + OFFSET_THR), v)
-#define SIR_UART_PUT_DLL(port, v)  bfin_write16(((port)->membase + OFFSET_DLL), v)
-#define SIR_UART_PUT_DLH(port, v)  bfin_write16(((port)->membase + OFFSET_DLH), v)
-#define SIR_UART_PUT_LCR(port, v)  bfin_write16(((port)->membase + OFFSET_LCR), v)
-#define SIR_UART_PUT_GCTL(port, v) bfin_write16(((port)->membase + OFFSET_GCTL), v)
-
-#ifdef CONFIG_BF54x
-#define SIR_UART_GET_LSR(port)     bfin_read16((port)->membase + OFFSET_LSR)
-#define SIR_UART_GET_IER(port)     bfin_read16((port)->membase + OFFSET_IER_SET)
-#define SIR_UART_SET_IER(port, v)  bfin_write16(((port)->membase + OFFSET_IER_SET), v)
-#define SIR_UART_CLEAR_IER(port, v) bfin_write16(((port)->membase + OFFSET_IER_CLEAR), v)
-#define SIR_UART_PUT_LSR(port, v)  bfin_write16(((port)->membase + OFFSET_LSR), v)
-#define SIR_UART_CLEAR_LSR(port)   bfin_write16(((port)->membase + OFFSET_LSR), -1)
-
-#define SIR_UART_SET_DLAB(port)
-#define SIR_UART_CLEAR_DLAB(port)
-
-#define SIR_UART_ENABLE_INTS(port, v) SIR_UART_SET_IER(port, v)
-#define SIR_UART_DISABLE_INTS(port)   SIR_UART_CLEAR_IER(port, 0xF)
-#define SIR_UART_STOP_TX(port)     do { SIR_UART_PUT_LSR(port, TFI); SIR_UART_CLEAR_IER(port, ETBEI); } while (0)
-#define SIR_UART_ENABLE_TX(port)   do { SIR_UART_SET_IER(port, ETBEI); } while (0)
-#define SIR_UART_STOP_RX(port)     do { SIR_UART_CLEAR_IER(port, ERBFI); } while (0)
-#define SIR_UART_ENABLE_RX(port)   do { SIR_UART_SET_IER(port, ERBFI); } while (0)
-#else
-
-#define SIR_UART_GET_IIR(port)     bfin_read16((port)->membase + OFFSET_IIR)
-#define SIR_UART_GET_IER(port)     bfin_read16((port)->membase + OFFSET_IER)
-#define SIR_UART_PUT_IER(port, v)  bfin_write16(((port)->membase + OFFSET_IER), v)
-
-#define SIR_UART_SET_DLAB(port)    do { SIR_UART_PUT_LCR(port, SIR_UART_GET_LCR(port) | DLAB); } while (0)
-#define SIR_UART_CLEAR_DLAB(port)  do { SIR_UART_PUT_LCR(port, SIR_UART_GET_LCR(port) & ~DLAB); } while (0)
-
-#define SIR_UART_ENABLE_INTS(port, v) SIR_UART_PUT_IER(port, v)
-#define SIR_UART_DISABLE_INTS(port)   SIR_UART_PUT_IER(port, 0)
-#define SIR_UART_STOP_TX(port)     do { SIR_UART_PUT_IER(port, SIR_UART_GET_IER(port) & ~ETBEI); } while (0)
-#define SIR_UART_ENABLE_TX(port)   do { SIR_UART_PUT_IER(port, SIR_UART_GET_IER(port) | ETBEI); } while (0)
-#define SIR_UART_STOP_RX(port)     do { SIR_UART_PUT_IER(port, SIR_UART_GET_IER(port) & ~ERBFI); } while (0)
-#define SIR_UART_ENABLE_RX(port)   do { SIR_UART_PUT_IER(port, SIR_UART_GET_IER(port) | ERBFI); } while (0)
-
-static inline unsigned int SIR_UART_GET_LSR(struct bfin_sir_port *port)
-{
-	unsigned int lsr = bfin_read16(port->membase + OFFSET_LSR);
-	port->lsr |= (lsr & (BI|FE|PE|OE));
-	return lsr | port->lsr;
-}
-
-static inline void SIR_UART_CLEAR_LSR(struct bfin_sir_port *port)
-{
-	port->lsr = 0;
-	bfin_read16(port->membase + OFFSET_LSR);
-}
-#endif
+#define port_membase(port)     (((struct bfin_sir_port *)(port))->membase)
+#define get_lsr_cache(port)    (((struct bfin_sir_port *)(port))->lsr)
+#define put_lsr_cache(port, v) (((struct bfin_sir_port *)(port))->lsr = (v))
+#include <asm/bfin_serial.h>
 
 static const unsigned short per[][4] = {
 	/* rx pin      tx pin     NULL  uart_number */
_

^ permalink raw reply

* Re: [PATCH 1/3] vlan: Do not support clearing VLAN_FLAG_REORDER_HDR
From: Stephen Hemminger @ 2011-05-23 22:16 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: David Miller, greearb, nicolas.2p.debian, jpirko, xiaosuo, netdev,
	kaber, fubar, eric.dumazet, andy, jesse
In-Reply-To: <m1y61x2htp.fsf@fess.ebiederm.org>

On Mon, 23 May 2011 15:05:54 -0700
ebiederm@xmission.com (Eric W. Biederman) wrote:

> David Miller <davem@davemloft.net> writes:
> 
> > From: Stephen Hemminger <shemminger@linux-foundation.org>
> > Date: Mon, 23 May 2011 14:00:48 -0700
> >
> >> IMHO the REORDER_HDR flag was a design mistake. It means supporting
> >> two different API's for the application. For a packet capture application
> >> it means the format of the packet will be different based on how
> >> the VLAN was created. And the vlan was created outside the application.
> >> 
> >> If there was a requirement to support both formats, then it should
> >> be a property of the AF_PACKET socket.  The application can then do
> >> an setsockopt() to control the format of the data. The issue is
> >> for things like mmap/AF_PACKET the re-inserted form will be slower
> >> since data has to be copied.
> >
> > Indeed, it was a foolish thing to support in the first place.
> >
> > I guess we couldn't see the hw offloads in the future at that
> > point.
> >
> > I'm all for finding a way to deprecate this over time.
> 
> 
> There are two very similar issues here, that affect two
> slightly different cases.
> 
> Let's assume the configuration is:
> eth0 - Talks to the outside world.
> vlan2000 - Is the vlan interface of interest.
> 
> 
> With REORDER_HDR set when I tcpdump on vlan2000 I don't see the
> vlan header, however I continue to see the vlan header when I tcpdump on eth0.
> 
> With vlan hardware acceleration.  When I tcpdump on eth0 I don't
> see the vlan header.  Nor do I see the vlan header when I tcpdump
> on vlan2000.
> 
> 
> Right now both cases are problematic in Linus's tree.
> 
> For inbound packets We are testing the wrong interface for the to see if
> we should play with REORDER_HDR.
> 
> Hardware acceleration vlan tagging we are hiding the vlan header from
> portable applications that simply dump eth0.  Which is non-intuitive
> and apparent to everyone now that this happens on both software and
> hardware interfaces.
> 
> 
> So we have a couple of questions.
> 1) Do we revert the software emulation of vlan header tagging to fix
>    it's implementation issues in 2.6.40?
> 
>    The big issues.
>    - vlan_untag is currently reading undefined data.
>    - Clearing REORDER_HDR no longer works.
> 
>    I expect the issues are small enough that we can address them now.
> 
> 2) Do we instead move the software implementation of vlan header
>    acceleration back into the net-next.
> 
> 3) What do we do with pf_packet and vlan hardware acceleration when
>    dumping not the vlan interface but the interface below the vlan
>    interface?
> 
>    Do we provide an option to keep the vlan header?  Should that option
>    be on by default?

Why not just make libpcap smarter? it already has Linux specific
stuff for checksum offload. The portability to BSD for AF_PACKET is
a non-issue.


-- 

^ permalink raw reply

* Re: [PATCH 1/3] vlan: Do not support clearing VLAN_FLAG_REORDER_HDR
From: Eric W. Biederman @ 2011-05-23 22:05 UTC (permalink / raw)
  To: David Miller
  Cc: shemminger, greearb, nicolas.2p.debian, jpirko, xiaosuo, netdev,
	kaber, fubar, eric.dumazet, andy, jesse
In-Reply-To: <20110523.172047.1438754754048434316.davem@davemloft.net>

David Miller <davem@davemloft.net> writes:

> From: Stephen Hemminger <shemminger@linux-foundation.org>
> Date: Mon, 23 May 2011 14:00:48 -0700
>
>> IMHO the REORDER_HDR flag was a design mistake. It means supporting
>> two different API's for the application. For a packet capture application
>> it means the format of the packet will be different based on how
>> the VLAN was created. And the vlan was created outside the application.
>> 
>> If there was a requirement to support both formats, then it should
>> be a property of the AF_PACKET socket.  The application can then do
>> an setsockopt() to control the format of the data. The issue is
>> for things like mmap/AF_PACKET the re-inserted form will be slower
>> since data has to be copied.
>
> Indeed, it was a foolish thing to support in the first place.
>
> I guess we couldn't see the hw offloads in the future at that
> point.
>
> I'm all for finding a way to deprecate this over time.

There are two very similar issues here, that affect two
slightly different cases.

Let's assume the configuration is:
eth0 - Talks to the outside world.
vlan2000 - Is the vlan interface of interest.

With REORDER_HDR set when I tcpdump on vlan2000 I don't see the
vlan header, however I continue to see the vlan header when I tcpdump on eth0.

With vlan hardware acceleration.  When I tcpdump on eth0 I don't
see the vlan header.  Nor do I see the vlan header when I tcpdump
on vlan2000.

Right now both cases are problematic in Linus's tree.

For inbound packets We are testing the wrong interface for the to see if
we should play with REORDER_HDR.

Hardware acceleration vlan tagging we are hiding the vlan header from
portable applications that simply dump eth0.  Which is non-intuitive
and apparent to everyone now that this happens on both software and
hardware interfaces.

So we have a couple of questions.
1) Do we revert the software emulation of vlan header tagging to fix
   it's implementation issues in 2.6.40?

   The big issues.
   - vlan_untag is currently reading undefined data.
   - Clearing REORDER_HDR no longer works.

   I expect the issues are small enough that we can address them now.

2) Do we instead move the software implementation of vlan header
   acceleration back into the net-next.

3) What do we do with pf_packet and vlan hardware acceleration when
   dumping not the vlan interface but the interface below the vlan
   interface?

   Do we provide an option to keep the vlan header?  Should that option
   be on by default?

Eric

^ permalink raw reply

* Re: [PATCH v3] CDC NCM: release interfaces fix in unbind()
From: David Miller @ 2011-05-23 21:46 UTC (permalink / raw)
  To: alexey.orishko-Re5JQEeQqe8AvxtiuMwx3w
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA, oliver-GvhC2dPhHPQdnm+yROfE0A,
	linux-usb-u79uwXL29TY76Z2rM5mHXA, gregkh-l3A5Bk7waGM,
	stern-nwvwT67g6+6dFdvTe/nMLpVzexx5G7lz,
	alexey.orishko-0IS4wlFg1OjSUeElwK9/Pw
In-Reply-To: <1305938536-5343-1-git-send-email-alexey.orishko-0IS4wlFg1OjSUeElwK9/Pw@public.gmane.org>

From: Alexey Orishko <alexey.orishko-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Date: Sat, 21 May 2011 02:42:16 +0200

> Changes:
> - claim slave/data interface during bind() and release
>  interfaces in unbind() unconditionally
> - in case of error during bind(), release claimed data
>  interface in the same function
> - remove obsolited "*_claimed" entries from driver context
> 
> Signed-off-by: Alexey Orishko <alexey.orishko-0IS4wlFg1OjSUeElwK9/Pw@public.gmane.org>

Oliver, this look good to you?
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH 0/2] Add and use WARN_RATELIMIT
From: David Miller @ 2011-05-23 21:38 UTC (permalink / raw)
  To: joe; +Cc: greearb, linux-arch, arnd, netdev, linux-kernel
In-Reply-To: <cover.1305999731.git.joe@perches.com>

From: Joe Perches <joe@perches.com>
Date: Sat, 21 May 2011 10:48:38 -0700

> Generic mechanism to ratelimit WARN uses.
> 
> Joe Perches (2):
>   bug.h: Add WARN_RATELIMIT
>   net: filter: Use WARN_RATELIMIT
> 
>  include/asm-generic/bug.h |   16 ++++++++++++++++
>  net/core/filter.c         |    4 +++-
>  2 files changed, 19 insertions(+), 1 deletions(-)

This looks fine to me, and no objections have been voiced otherwise,
so applied to net-2.6, thanks!

^ permalink raw reply

* Re: [PATCH] sch_sfq: avoid giving spurious NET_XMIT_CN signals
From: David Miller @ 2011-05-23 21:36 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev, kaber, jarkao2, hadi, shemminger
In-Reply-To: <1306184562.2638.14.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Mon, 23 May 2011 23:02:42 +0200

> While chasing a possible net_sched bug, I found that IP fragments have
> litle chance to pass a congestioned SFQ qdisc :
> 
> - Say SFQ qdisc is full because one flow is non responsive.
> - ip_fragment() wants to send two fragments belonging to an idle flow.
> - sfq_enqueue() queues first packet, but see queue limit reached :
> - sfq_enqueue() drops one packet from 'big consumer', and returns
> NET_XMIT_CN.
> - ip_fragment() cancel remaining fragments.
> 
> This patch restores fairness, making sure we return NET_XMIT_CN only if
> we dropped a packet from the same flow.
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

Looks good, applied, thanks Eric.

^ permalink raw reply

* [PATCH 4/4] ipv4: Kill 'rt_src' from 'struct rtable'
From: David Miller @ 2011-05-23 21:34 UTC (permalink / raw)
  To: netdev


Signed-off-by: David S. Miller <davem@davemloft.net>
---
 include/net/route.h     |    1 -
 net/ipv4/route.c        |   34 +++++++++++++++-------------------
 net/ipv4/xfrm4_policy.c |    1 -
 3 files changed, 15 insertions(+), 21 deletions(-)

diff --git a/include/net/route.h b/include/net/route.h
index d293db3..40e9713 100644
--- a/include/net/route.h
+++ b/include/net/route.h
@@ -50,7 +50,6 @@ struct rtable {
 	__u16			rt_type;
 
 	__be32			rt_dst;	/* Path destination	*/
-	__be32			rt_src;	/* Path source		*/
 	int			rt_route_iif;
 	int			rt_iif;
 	int			rt_oif;
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 0cdbdb0..1f9d43b 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1165,7 +1165,6 @@ static int ip_route_input_mc(struct sk_buff *skb, __be32 daddr, __be32 saddr,
 	rth->rt_flags	= RTCF_MULTICAST;
 	rth->rt_type	= RTN_MULTICAST;
 	rth->rt_dst	= daddr;
-	rth->rt_src	= saddr;
 	rth->rt_route_iif = dev->ifindex;
 	rth->rt_iif	= dev->ifindex;
 	rth->rt_oif	= 0;
@@ -1298,7 +1297,6 @@ static int __mkroute_input(struct sk_buff *skb,
 	rth->rt_flags = flags;
 	rth->rt_type = res->type;
 	rth->rt_dst	= daddr;
-	rth->rt_src	= saddr;
 	rth->rt_route_iif = in_dev->dev->ifindex;
 	rth->rt_iif 	= in_dev->dev->ifindex;
 	rth->rt_oif 	= 0;
@@ -1470,7 +1468,6 @@ local_input:
 	rth->rt_flags 	= flags|RTCF_LOCAL;
 	rth->rt_type	= res.type;
 	rth->rt_dst	= daddr;
-	rth->rt_src	= saddr;
 #ifdef CONFIG_IP_ROUTE_CLASSID
 	rth->dst.tclassid = itag;
 #endif
@@ -1634,7 +1631,6 @@ static struct rtable *__mkroute_output(const struct fib_result *res,
 	rth->rt_flags	= flags;
 	rth->rt_type	= type;
 	rth->rt_dst	= fl4->daddr;
-	rth->rt_src	= fl4->saddr;
 	rth->rt_route_iif = 0;
 	rth->rt_iif	= orig_oif ? : dev_out->ifindex;
 	rth->rt_oif	= orig_oif;
@@ -1919,7 +1915,6 @@ struct dst_entry *ipv4_blackhole_route(struct net *net, struct dst_entry *dst_or
 		rt->rt_flags = ort->rt_flags;
 		rt->rt_type = ort->rt_type;
 		rt->rt_dst = ort->rt_dst;
-		rt->rt_src = ort->rt_src;
 		rt->rt_gateway = ort->rt_gateway;
 		rt->rt_spec_dst = ort->rt_spec_dst;
 		rt->peer = ort->peer;
@@ -1954,7 +1949,7 @@ struct rtable *ip_route_output_flow(struct net *net, struct flowi4 *flp4,
 }
 EXPORT_SYMBOL_GPL(ip_route_output_flow);
 
-static int rt_fill_info(struct net *net,  __be32 src, u8 tos,
+static int rt_fill_info(struct net *net,  __be32 src, struct flowi4 *fl4,
 			struct sk_buff *skb, u32 pid, u32 seq, int event,
 			int nowait, unsigned int flags)
 {
@@ -1972,7 +1967,7 @@ static int rt_fill_info(struct net *net,  __be32 src, u8 tos,
 	r->rtm_family	 = AF_INET;
 	r->rtm_dst_len	= 32;
 	r->rtm_src_len	= 0;
-	r->rtm_tos	= tos;
+	r->rtm_tos	= fl4->flowi4_tos;
 	r->rtm_table	= RT_TABLE_MAIN;
 	NLA_PUT_U32(skb, RTA_TABLE, RT_TABLE_MAIN);
 	r->rtm_type	= rt->rt_type;
@@ -1996,10 +1991,10 @@ static int rt_fill_info(struct net *net,  __be32 src, u8 tos,
 #endif
 	if (rt_is_input_route(rt))
 		NLA_PUT_BE32(skb, RTA_PREFSRC, rt->rt_spec_dst);
-	else if (rt->rt_src != src)
-		NLA_PUT_BE32(skb, RTA_PREFSRC, rt->rt_src);
+	else if (fl4->saddr != src)
+		NLA_PUT_BE32(skb, RTA_PREFSRC, fl4->saddr);
 
-	if (rt->rt_dst != rt->rt_gateway)
+	if (fl4->daddr != rt->rt_gateway)
 		NLA_PUT_BE32(skb, RTA_GATEWAY, rt->rt_gateway);
 
 	if (rtnetlink_put_metrics(skb, dst_metrics_ptr(&rt->dst)) < 0)
@@ -2027,7 +2022,7 @@ static int rt_fill_info(struct net *net,  __be32 src, u8 tos,
 		if (ipv4_is_multicast(dst) && !ipv4_is_local_multicast(dst) &&
 		    IPV4_DEVCONF_ALL(net, MC_FORWARDING)) {
 			int err = ipmr_get_route(net, skb,
-						 rt->rt_src, rt->rt_dst,
+						 fl4->saddr, fl4->daddr,
 						 r, nowait);
 			if (err <= 0) {
 				if (!nowait) {
@@ -2062,6 +2057,7 @@ static int inet_rtm_getroute(struct sk_buff *in_skb, struct nlmsghdr* nlh, void
 	struct rtmsg *rtm;
 	struct nlattr *tb[RTA_MAX+1];
 	struct rtable *rt = NULL;
+	struct flowi4 fl4;
 	__be32 dst = 0;
 	__be32 src = 0;
 	u32 iif;
@@ -2096,6 +2092,13 @@ static int inet_rtm_getroute(struct sk_buff *in_skb, struct nlmsghdr* nlh, void
 	iif = tb[RTA_IIF] ? nla_get_u32(tb[RTA_IIF]) : 0;
 	mark = tb[RTA_MARK] ? nla_get_u32(tb[RTA_MARK]) : 0;
 
+	memset(&fl4, 0, sizeof(fl4));
+	fl4.daddr = dst;
+	fl4.saddr = src;
+	fl4.flowi4_tos = rtm->rtm_tos;
+	fl4.flowi4_oif = tb[RTA_OIF] ? nla_get_u32(tb[RTA_OIF]) : 0;
+	fl4.flowi4_mark = mark;
+
 	if (iif) {
 		struct net_device *dev;
 
@@ -2116,13 +2119,6 @@ static int inet_rtm_getroute(struct sk_buff *in_skb, struct nlmsghdr* nlh, void
 		if (err == 0 && rt->dst.error)
 			err = -rt->dst.error;
 	} else {
-		struct flowi4 fl4 = {
-			.daddr = dst,
-			.saddr = src,
-			.flowi4_tos = rtm->rtm_tos,
-			.flowi4_oif = tb[RTA_OIF] ? nla_get_u32(tb[RTA_OIF]) : 0,
-			.flowi4_mark = mark,
-		};
 		rt = ip_route_output_key(net, &fl4);
 
 		err = 0;
@@ -2137,7 +2133,7 @@ static int inet_rtm_getroute(struct sk_buff *in_skb, struct nlmsghdr* nlh, void
 	if (rtm->rtm_flags & RTM_F_NOTIFY)
 		rt->rt_flags |= RTCF_NOTIFY;
 
-	err = rt_fill_info(net, src, rtm->rtm_tos, skb,
+	err = rt_fill_info(net, src, &fl4, skb,
 			   NETLINK_CB(in_skb).pid, nlh->nlmsg_seq,
 			   RTM_NEWROUTE, 0, 0);
 	if (err <= 0)
diff --git a/net/ipv4/xfrm4_policy.c b/net/ipv4/xfrm4_policy.c
index ea0eeb1..ff2f330 100644
--- a/net/ipv4/xfrm4_policy.c
+++ b/net/ipv4/xfrm4_policy.c
@@ -96,7 +96,6 @@ static int xfrm4_fill_dst(struct xfrm_dst *xdst, struct net_device *dev,
 	xdst->u.rt.rt_flags = rt->rt_flags & (RTCF_BROADCAST | RTCF_MULTICAST |
 					      RTCF_LOCAL);
 	xdst->u.rt.rt_type = rt->rt_type;
-	xdst->u.rt.rt_src = rt->rt_src;
 	xdst->u.rt.rt_dst = rt->rt_dst;
 	xdst->u.rt.rt_gateway = rt->rt_gateway;
 	xdst->u.rt.rt_spec_dst = rt->rt_spec_dst;
-- 
1.7.5.2


^ permalink raw reply related

* [PATCH 3/4] ipv4: Remove rt_key_{src,dst,tos} from struct rtable.
From: David Miller @ 2011-05-23 21:34 UTC (permalink / raw)
  To: netdev


They are always used in contexts where they can be reconstituted,
or where the finally resolved rt->rt_{src,dst} is semantically
equivalent.

Signed-off-by: David S. Miller <davem@davemloft.net>
---
 include/net/route.h     |    5 -----
 net/ipv4/route.c        |   37 ++++++++-----------------------------
 net/ipv4/xfrm4_policy.c |    3 ---
 3 files changed, 8 insertions(+), 37 deletions(-)

diff --git a/include/net/route.h b/include/net/route.h
index 34c9bc5..d293db3 100644
--- a/include/net/route.h
+++ b/include/net/route.h
@@ -45,14 +45,9 @@ struct fib_info;
 struct rtable {
 	struct dst_entry	dst;
 
-	/* Lookup key. */
-	__be32			rt_key_dst;
-	__be32			rt_key_src;
-
 	int			rt_genid;
 	unsigned		rt_flags;
 	__u16			rt_type;
-	__u8			rt_key_tos;
 
 	__be32			rt_dst;	/* Path destination	*/
 	__be32			rt_src;	/* Path source		*/
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 8af1985..0cdbdb0 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1161,12 +1161,9 @@ static int ip_route_input_mc(struct sk_buff *skb, __be32 daddr, __be32 saddr,
 #endif
 	rth->dst.output = ip_rt_bug;
 
-	rth->rt_key_dst	= daddr;
-	rth->rt_key_src	= saddr;
 	rth->rt_genid	= rt_genid(dev_net(dev));
 	rth->rt_flags	= RTCF_MULTICAST;
 	rth->rt_type	= RTN_MULTICAST;
-	rth->rt_key_tos	= tos;
 	rth->rt_dst	= daddr;
 	rth->rt_src	= saddr;
 	rth->rt_route_iif = dev->ifindex;
@@ -1297,12 +1294,9 @@ static int __mkroute_input(struct sk_buff *skb,
 		goto cleanup;
 	}
 
-	rth->rt_key_dst	= daddr;
-	rth->rt_key_src	= saddr;
 	rth->rt_genid = rt_genid(dev_net(rth->dst.dev));
 	rth->rt_flags = flags;
 	rth->rt_type = res->type;
-	rth->rt_key_tos	= tos;
 	rth->rt_dst	= daddr;
 	rth->rt_src	= saddr;
 	rth->rt_route_iif = in_dev->dev->ifindex;
@@ -1472,12 +1466,9 @@ local_input:
 	rth->dst.tclassid = itag;
 #endif
 
-	rth->rt_key_dst	= daddr;
-	rth->rt_key_src	= saddr;
 	rth->rt_genid = rt_genid(net);
 	rth->rt_flags 	= flags|RTCF_LOCAL;
 	rth->rt_type	= res.type;
-	rth->rt_key_tos	= tos;
 	rth->rt_dst	= daddr;
 	rth->rt_src	= saddr;
 #ifdef CONFIG_IP_ROUTE_CLASSID
@@ -1590,12 +1581,10 @@ EXPORT_SYMBOL(ip_route_input);
 /* called with rcu_read_lock() */
 static struct rtable *__mkroute_output(const struct fib_result *res,
 				       const struct flowi4 *fl4,
-				       __be32 orig_daddr, __be32 orig_saddr,
 				       int orig_oif, struct net_device *dev_out,
 				       unsigned int flags)
 {
 	struct fib_info *fi = res->fi;
-	u32 tos = RT_FL_TOS(fl4);
 	struct in_device *in_dev;
 	u16 type = res->type;
 	struct rtable *rth;
@@ -1641,12 +1630,9 @@ static struct rtable *__mkroute_output(const struct fib_result *res,
 
 	rth->dst.output = ip_output;
 
-	rth->rt_key_dst	= orig_daddr;
-	rth->rt_key_src	= orig_saddr;
 	rth->rt_genid = rt_genid(dev_net(dev_out));
 	rth->rt_flags	= flags;
 	rth->rt_type	= type;
-	rth->rt_key_tos	= tos;
 	rth->rt_dst	= fl4->daddr;
 	rth->rt_src	= fl4->saddr;
 	rth->rt_route_iif = 0;
@@ -1699,8 +1685,6 @@ struct rtable *__ip_route_output_key(struct net *net, struct flowi4 *fl4)
 	unsigned int flags = 0;
 	struct fib_result res;
 	struct rtable *rth;
-	__be32 orig_daddr;
-	__be32 orig_saddr;
 	int orig_oif;
 
 	res.fi		= NULL;
@@ -1708,8 +1692,6 @@ struct rtable *__ip_route_output_key(struct net *net, struct flowi4 *fl4)
 	res.r		= NULL;
 #endif
 
-	orig_daddr = fl4->daddr;
-	orig_saddr = fl4->saddr;
 	orig_oif = fl4->flowi4_oif;
 
 	fl4->flowi4_iif = net->loopback_dev->ifindex;
@@ -1870,8 +1852,7 @@ struct rtable *__ip_route_output_key(struct net *net, struct flowi4 *fl4)
 
 
 make_route:
-	rth = __mkroute_output(&res, fl4, orig_daddr, orig_saddr, orig_oif,
-			       dev_out, flags);
+	rth = __mkroute_output(&res, fl4, orig_oif, dev_out, flags);
 	if (!IS_ERR(rth))
 		rth = rt_finalize(rth, NULL);
 
@@ -1929,9 +1910,6 @@ struct dst_entry *ipv4_blackhole_route(struct net *net, struct dst_entry *dst_or
 		if (new->dev)
 			dev_hold(new->dev);
 
-		rt->rt_key_dst = ort->rt_key_dst;
-		rt->rt_key_src = ort->rt_key_src;
-		rt->rt_key_tos = ort->rt_key_tos;
 		rt->rt_route_iif = ort->rt_route_iif;
 		rt->rt_iif = ort->rt_iif;
 		rt->rt_oif = ort->rt_oif;
@@ -1976,7 +1954,7 @@ struct rtable *ip_route_output_flow(struct net *net, struct flowi4 *flp4,
 }
 EXPORT_SYMBOL_GPL(ip_route_output_flow);
 
-static int rt_fill_info(struct net *net,
+static int rt_fill_info(struct net *net,  __be32 src, u8 tos,
 			struct sk_buff *skb, u32 pid, u32 seq, int event,
 			int nowait, unsigned int flags)
 {
@@ -1994,7 +1972,7 @@ static int rt_fill_info(struct net *net,
 	r->rtm_family	 = AF_INET;
 	r->rtm_dst_len	= 32;
 	r->rtm_src_len	= 0;
-	r->rtm_tos	= rt->rt_key_tos;
+	r->rtm_tos	= tos;
 	r->rtm_table	= RT_TABLE_MAIN;
 	NLA_PUT_U32(skb, RTA_TABLE, RT_TABLE_MAIN);
 	r->rtm_type	= rt->rt_type;
@@ -2006,9 +1984,9 @@ static int rt_fill_info(struct net *net,
 
 	NLA_PUT_BE32(skb, RTA_DST, rt->rt_dst);
 
-	if (rt->rt_key_src) {
+	if (src) {
 		r->rtm_src_len = 32;
-		NLA_PUT_BE32(skb, RTA_SRC, rt->rt_key_src);
+		NLA_PUT_BE32(skb, RTA_SRC, src);
 	}
 	if (rt->dst.dev)
 		NLA_PUT_U32(skb, RTA_OIF, rt->dst.dev->ifindex);
@@ -2018,7 +1996,7 @@ static int rt_fill_info(struct net *net,
 #endif
 	if (rt_is_input_route(rt))
 		NLA_PUT_BE32(skb, RTA_PREFSRC, rt->rt_spec_dst);
-	else if (rt->rt_src != rt->rt_key_src)
+	else if (rt->rt_src != src)
 		NLA_PUT_BE32(skb, RTA_PREFSRC, rt->rt_src);
 
 	if (rt->rt_dst != rt->rt_gateway)
@@ -2159,7 +2137,8 @@ static int inet_rtm_getroute(struct sk_buff *in_skb, struct nlmsghdr* nlh, void
 	if (rtm->rtm_flags & RTM_F_NOTIFY)
 		rt->rt_flags |= RTCF_NOTIFY;
 
-	err = rt_fill_info(net, skb, NETLINK_CB(in_skb).pid, nlh->nlmsg_seq,
+	err = rt_fill_info(net, src, rtm->rtm_tos, skb,
+			   NETLINK_CB(in_skb).pid, nlh->nlmsg_seq,
 			   RTM_NEWROUTE, 0, 0);
 	if (err <= 0)
 		goto errout_free;
diff --git a/net/ipv4/xfrm4_policy.c b/net/ipv4/xfrm4_policy.c
index 981e43e..ea0eeb1 100644
--- a/net/ipv4/xfrm4_policy.c
+++ b/net/ipv4/xfrm4_policy.c
@@ -79,9 +79,6 @@ static int xfrm4_fill_dst(struct xfrm_dst *xdst, struct net_device *dev,
 	struct rtable *rt = (struct rtable *)xdst->route;
 	const struct flowi4 *fl4 = &fl->u.ip4;
 
-	rt->rt_key_dst = fl4->daddr;
-	rt->rt_key_src = fl4->saddr;
-	rt->rt_key_tos = fl4->flowi4_tos;
 	rt->rt_route_iif = fl4->flowi4_iif;
 	rt->rt_iif = fl4->flowi4_iif;
 	rt->rt_oif = fl4->flowi4_oif;
-- 
1.7.5.2


^ permalink raw reply related

* [PATCH 2/4] ipv4: Kill ip_route_input_noref().
From: David Miller @ 2011-05-23 21:33 UTC (permalink / raw)
  To: netdev


The "noref" argument to ip_route_input_common() is now always ignored
because we do not cache routes, and in that case we must always grab
a reference to the resulting 'dst'.

Signed-off-by: David S. Miller <davem@davemloft.net>
---
 include/net/route.h    |   16 ++--------------
 net/ipv4/arp.c         |    2 +-
 net/ipv4/ip_fragment.c |    4 ++--
 net/ipv4/ip_input.c    |    4 ++--
 net/ipv4/route.c       |    6 +++---
 net/ipv4/xfrm4_input.c |    4 ++--
 6 files changed, 12 insertions(+), 24 deletions(-)

diff --git a/include/net/route.h b/include/net/route.h
index 8861144..34c9bc5 100644
--- a/include/net/route.h
+++ b/include/net/route.h
@@ -165,20 +165,8 @@ static inline struct rtable *ip_route_output_gre(struct net *net, struct flowi4
 	return ip_route_output_key(net, fl4);
 }
 
-extern int ip_route_input_common(struct sk_buff *skb, __be32 dst, __be32 src,
-				 u8 tos, struct net_device *devin, bool noref);
-
-static inline int ip_route_input(struct sk_buff *skb, __be32 dst, __be32 src,
-				 u8 tos, struct net_device *devin)
-{
-	return ip_route_input_common(skb, dst, src, tos, devin, false);
-}
-
-static inline int ip_route_input_noref(struct sk_buff *skb, __be32 dst, __be32 src,
-				       u8 tos, struct net_device *devin)
-{
-	return ip_route_input_common(skb, dst, src, tos, devin, true);
-}
+extern int ip_route_input(struct sk_buff *skb, __be32 dst, __be32 src,
+			  u8 tos, struct net_device *devin);
 
 extern unsigned short	ip_rt_frag_needed(struct net *net, const struct iphdr *iph,
 					  unsigned short new_mtu, struct net_device *dev);
diff --git a/net/ipv4/arp.c b/net/ipv4/arp.c
index 1b74d3b..ef44a91 100644
--- a/net/ipv4/arp.c
+++ b/net/ipv4/arp.c
@@ -875,7 +875,7 @@ static int arp_process(struct sk_buff *skb)
 	}
 
 	if (arp->ar_op == htons(ARPOP_REQUEST) &&
-	    ip_route_input_noref(skb, tip, sip, 0, dev) == 0) {
+	    ip_route_input(skb, tip, sip, 0, dev) == 0) {
 
 		rt = skb_rtable(skb);
 		addr_type = rt->rt_type;
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 0ad6035..831c922 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -252,8 +252,8 @@ static void ip_expire(unsigned long arg)
 		/* skb dst is stale, drop it, and perform route lookup again */
 		skb_dst_drop(head);
 		iph = ip_hdr(head);
-		err = ip_route_input_noref(head, iph->daddr, iph->saddr,
-					   iph->tos, head->dev);
+		err = ip_route_input(head, iph->daddr, iph->saddr,
+				     iph->tos, head->dev);
 		if (err)
 			goto out_rcu_unlock;
 
diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c
index c8f48ef..c21ff9b 100644
--- a/net/ipv4/ip_input.c
+++ b/net/ipv4/ip_input.c
@@ -324,8 +324,8 @@ static int ip_rcv_finish(struct sk_buff *skb)
 	 *	how the packet travels inside Linux networking.
 	 */
 	if (skb_dst(skb) == NULL) {
-		int err = ip_route_input_noref(skb, iph->daddr, iph->saddr,
-					       iph->tos, skb->dev);
+		int err = ip_route_input(skb, iph->daddr, iph->saddr,
+					 iph->tos, skb->dev);
 		if (unlikely(err)) {
 			if (err == -EHOSTUNREACH)
 				IP_INC_STATS_BH(dev_net(skb->dev),
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 7e05fb5..8af1985 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1541,8 +1541,8 @@ martian_source_keep_err:
 	goto out;
 }
 
-int ip_route_input_common(struct sk_buff *skb, __be32 daddr, __be32 saddr,
-			   u8 tos, struct net_device *dev, bool noref)
+int ip_route_input(struct sk_buff *skb, __be32 daddr, __be32 saddr,
+		   u8 tos, struct net_device *dev)
 {
 	int res;
 
@@ -1585,7 +1585,7 @@ int ip_route_input_common(struct sk_buff *skb, __be32 daddr, __be32 saddr,
 	rcu_read_unlock();
 	return res;
 }
-EXPORT_SYMBOL(ip_route_input_common);
+EXPORT_SYMBOL(ip_route_input);
 
 /* called with rcu_read_lock() */
 static struct rtable *__mkroute_output(const struct fib_result *res,
diff --git a/net/ipv4/xfrm4_input.c b/net/ipv4/xfrm4_input.c
index 06814b6..58d23a5 100644
--- a/net/ipv4/xfrm4_input.c
+++ b/net/ipv4/xfrm4_input.c
@@ -27,8 +27,8 @@ static inline int xfrm4_rcv_encap_finish(struct sk_buff *skb)
 	if (skb_dst(skb) == NULL) {
 		const struct iphdr *iph = ip_hdr(skb);
 
-		if (ip_route_input_noref(skb, iph->daddr, iph->saddr,
-					 iph->tos, skb->dev))
+		if (ip_route_input(skb, iph->daddr, iph->saddr,
+				   iph->tos, skb->dev))
 			goto drop;
 	}
 	return dst_input(skb);
-- 
1.7.5.2


^ permalink raw reply related

* [PATCH 1/4] ipv4: Delete routing cache.
From: David Miller @ 2011-05-23 21:33 UTC (permalink / raw)
  To: netdev


Signed-off-by: David S. Miller <davem@davemloft.net>
---
 include/net/route.h     |    1 -
 net/ipv4/fib_frontend.c |    5 -
 net/ipv4/route.c        |  889 +----------------------------------------------
 3 files changed, 19 insertions(+), 876 deletions(-)

diff --git a/include/net/route.h b/include/net/route.h
index db7b343..8861144 100644
--- a/include/net/route.h
+++ b/include/net/route.h
@@ -114,7 +114,6 @@ extern int		ip_rt_init(void);
 extern void		ip_rt_redirect(__be32 old_gw, __be32 dst, __be32 new_gw,
 				       __be32 src, struct net_device *dev);
 extern void		rt_cache_flush(struct net *net, int how);
-extern void		rt_cache_flush_batch(struct net *net);
 extern struct rtable *__ip_route_output_key(struct net *, struct flowi4 *flp);
 extern struct rtable *ip_route_output_flow(struct net *, struct flowi4 *flp,
 					   struct sock *sk);
diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index 2252471..33bbbda 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -1022,11 +1022,6 @@ static int fib_netdev_event(struct notifier_block *this, unsigned long event, vo
 		rt_cache_flush(dev_net(dev), 0);
 		break;
 	case NETDEV_UNREGISTER_BATCH:
-		/* The batch unregister is only called on the first
-		 * device in the list of devices being unregistered.
-		 * Therefore we should not pass dev_net(dev) in here.
-		 */
-		rt_cache_flush_batch(NULL);
 		break;
 	}
 	return NOTIFY_DONE;
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index b24d58e..7e05fb5 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -129,7 +129,6 @@ static int ip_rt_gc_elasticity __read_mostly	= 8;
 static int ip_rt_mtu_expires __read_mostly	= 10 * 60 * HZ;
 static int ip_rt_min_pmtu __read_mostly		= 512 + 20 + 20;
 static int ip_rt_min_advmss __read_mostly	= 256;
-static int rt_chain_length_max __read_mostly	= 20;
 
 /*
  *	Interface to generic destination cache.
@@ -142,7 +141,6 @@ static void		 ipv4_dst_destroy(struct dst_entry *dst);
 static struct dst_entry *ipv4_negative_advice(struct dst_entry *dst);
 static void		 ipv4_link_failure(struct sk_buff *skb);
 static void		 ip_rt_update_pmtu(struct dst_entry *dst, u32 mtu);
-static int rt_garbage_collect(struct dst_ops *ops);
 
 static void ipv4_dst_ifdown(struct dst_entry *dst, struct net_device *dev,
 			    int how)
@@ -187,7 +185,6 @@ static u32 *ipv4_cow_metrics(struct dst_entry *dst, unsigned long old)
 static struct dst_ops ipv4_dst_ops = {
 	.family =		AF_INET,
 	.protocol =		cpu_to_be16(ETH_P_IP),
-	.gc =			rt_garbage_collect,
 	.check =		ipv4_dst_check,
 	.default_advmss =	ipv4_default_advmss,
 	.default_mtu =		ipv4_default_mtu,
@@ -222,184 +219,30 @@ const __u8 ip_tos2prio[16] = {
 };
 
 
-/*
- * Route cache.
- */
-
-/* The locking scheme is rather straight forward:
- *
- * 1) Read-Copy Update protects the buckets of the central route hash.
- * 2) Only writers remove entries, and they hold the lock
- *    as they look at rtable reference counts.
- * 3) Only readers acquire references to rtable entries,
- *    they do so with atomic increments and with the
- *    lock held.
- */
-
-struct rt_hash_bucket {
-	struct rtable __rcu	*chain;
-};
-
-#if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK) || \
-	defined(CONFIG_PROVE_LOCKING)
-/*
- * Instead of using one spinlock for each rt_hash_bucket, we use a table of spinlocks
- * The size of this table is a power of two and depends on the number of CPUS.
- * (on lockdep we have a quite big spinlock_t, so keep the size down there)
- */
-#ifdef CONFIG_LOCKDEP
-# define RT_HASH_LOCK_SZ	256
-#else
-# if NR_CPUS >= 32
-#  define RT_HASH_LOCK_SZ	4096
-# elif NR_CPUS >= 16
-#  define RT_HASH_LOCK_SZ	2048
-# elif NR_CPUS >= 8
-#  define RT_HASH_LOCK_SZ	1024
-# elif NR_CPUS >= 4
-#  define RT_HASH_LOCK_SZ	512
-# else
-#  define RT_HASH_LOCK_SZ	256
-# endif
-#endif
-
-static spinlock_t	*rt_hash_locks;
-# define rt_hash_lock_addr(slot) &rt_hash_locks[(slot) & (RT_HASH_LOCK_SZ - 1)]
-
-static __init void rt_hash_lock_init(void)
-{
-	int i;
-
-	rt_hash_locks = kmalloc(sizeof(spinlock_t) * RT_HASH_LOCK_SZ,
-			GFP_KERNEL);
-	if (!rt_hash_locks)
-		panic("IP: failed to allocate rt_hash_locks\n");
-
-	for (i = 0; i < RT_HASH_LOCK_SZ; i++)
-		spin_lock_init(&rt_hash_locks[i]);
-}
-#else
-# define rt_hash_lock_addr(slot) NULL
-
-static inline void rt_hash_lock_init(void)
-{
-}
-#endif
-
-static struct rt_hash_bucket 	*rt_hash_table __read_mostly;
-static unsigned			rt_hash_mask __read_mostly;
-static unsigned int		rt_hash_log  __read_mostly;
-
 static DEFINE_PER_CPU(struct rt_cache_stat, rt_cache_stat);
 #define RT_CACHE_STAT_INC(field) __this_cpu_inc(rt_cache_stat.field)
 
-static inline unsigned int rt_hash(__be32 daddr, __be32 saddr, int idx,
-				   int genid)
-{
-	return jhash_3words((__force u32)daddr, (__force u32)saddr,
-			    idx, genid)
-		& rt_hash_mask;
-}
-
 static inline int rt_genid(struct net *net)
 {
 	return atomic_read(&net->ipv4.rt_genid);
 }
 
 #ifdef CONFIG_PROC_FS
-struct rt_cache_iter_state {
-	struct seq_net_private p;
-	int bucket;
-	int genid;
-};
-
-static struct rtable *rt_cache_get_first(struct seq_file *seq)
-{
-	struct rt_cache_iter_state *st = seq->private;
-	struct rtable *r = NULL;
-
-	for (st->bucket = rt_hash_mask; st->bucket >= 0; --st->bucket) {
-		if (!rcu_dereference_raw(rt_hash_table[st->bucket].chain))
-			continue;
-		rcu_read_lock_bh();
-		r = rcu_dereference_bh(rt_hash_table[st->bucket].chain);
-		while (r) {
-			if (dev_net(r->dst.dev) == seq_file_net(seq) &&
-			    r->rt_genid == st->genid)
-				return r;
-			r = rcu_dereference_bh(r->dst.rt_next);
-		}
-		rcu_read_unlock_bh();
-	}
-	return r;
-}
-
-static struct rtable *__rt_cache_get_next(struct seq_file *seq,
-					  struct rtable *r)
-{
-	struct rt_cache_iter_state *st = seq->private;
-
-	r = rcu_dereference_bh(r->dst.rt_next);
-	while (!r) {
-		rcu_read_unlock_bh();
-		do {
-			if (--st->bucket < 0)
-				return NULL;
-		} while (!rcu_dereference_raw(rt_hash_table[st->bucket].chain));
-		rcu_read_lock_bh();
-		r = rcu_dereference_bh(rt_hash_table[st->bucket].chain);
-	}
-	return r;
-}
-
-static struct rtable *rt_cache_get_next(struct seq_file *seq,
-					struct rtable *r)
-{
-	struct rt_cache_iter_state *st = seq->private;
-	while ((r = __rt_cache_get_next(seq, r)) != NULL) {
-		if (dev_net(r->dst.dev) != seq_file_net(seq))
-			continue;
-		if (r->rt_genid == st->genid)
-			break;
-	}
-	return r;
-}
-
-static struct rtable *rt_cache_get_idx(struct seq_file *seq, loff_t pos)
-{
-	struct rtable *r = rt_cache_get_first(seq);
-
-	if (r)
-		while (pos && (r = rt_cache_get_next(seq, r)))
-			--pos;
-	return pos ? NULL : r;
-}
-
 static void *rt_cache_seq_start(struct seq_file *seq, loff_t *pos)
 {
-	struct rt_cache_iter_state *st = seq->private;
 	if (*pos)
-		return rt_cache_get_idx(seq, *pos - 1);
-	st->genid = rt_genid(seq_file_net(seq));
+		return NULL;
 	return SEQ_START_TOKEN;
 }
 
 static void *rt_cache_seq_next(struct seq_file *seq, void *v, loff_t *pos)
 {
-	struct rtable *r;
-
-	if (v == SEQ_START_TOKEN)
-		r = rt_cache_get_first(seq);
-	else
-		r = rt_cache_get_next(seq, v);
 	++*pos;
-	return r;
+	return NULL;
 }
 
 static void rt_cache_seq_stop(struct seq_file *seq, void *v)
 {
-	if (v && v != SEQ_START_TOKEN)
-		rcu_read_unlock_bh();
 }
 
 static int rt_cache_seq_show(struct seq_file *seq, void *v)
@@ -409,29 +252,6 @@ static int rt_cache_seq_show(struct seq_file *seq, void *v)
 			   "Iface\tDestination\tGateway \tFlags\t\tRefCnt\tUse\t"
 			   "Metric\tSource\t\tMTU\tWindow\tIRTT\tTOS\tHHRef\t"
 			   "HHUptod\tSpecDst");
-	else {
-		struct rtable *r = v;
-		int len;
-
-		seq_printf(seq, "%s\t%08X\t%08X\t%8X\t%d\t%u\t%d\t"
-			      "%08X\t%d\t%u\t%u\t%02X\t%d\t%1d\t%08X%n",
-			r->dst.dev ? r->dst.dev->name : "*",
-			(__force u32)r->rt_dst,
-			(__force u32)r->rt_gateway,
-			r->rt_flags, atomic_read(&r->dst.__refcnt),
-			r->dst.__use, 0, (__force u32)r->rt_src,
-			dst_metric_advmss(&r->dst) + 40,
-			dst_metric(&r->dst, RTAX_WINDOW),
-			(int)((dst_metric(&r->dst, RTAX_RTT) >> 3) +
-			      dst_metric(&r->dst, RTAX_RTTVAR)),
-			r->rt_key_tos,
-			r->dst.hh ? atomic_read(&r->dst.hh->hh_refcnt) : -1,
-			r->dst.hh ? (r->dst.hh->hh_output ==
-				       dev_queue_xmit) : 0,
-			r->rt_spec_dst, &len);
-
-		seq_printf(seq, "%*s\n", 127 - len, "");
-	}
 	return 0;
 }
 
@@ -444,8 +264,7 @@ static const struct seq_operations rt_cache_seq_ops = {
 
 static int rt_cache_seq_open(struct inode *inode, struct file *file)
 {
-	return seq_open_net(inode, file, &rt_cache_seq_ops,
-			sizeof(struct rt_cache_iter_state));
+	return seq_open(file, &rt_cache_seq_ops);
 }
 
 static const struct file_operations rt_cache_seq_fops = {
@@ -453,7 +272,7 @@ static const struct file_operations rt_cache_seq_fops = {
 	.open	 = rt_cache_seq_open,
 	.read	 = seq_read,
 	.llseek	 = seq_lseek,
-	.release = seq_release_net,
+	.release = seq_release,
 };
 
 
@@ -643,184 +462,12 @@ static inline int ip_rt_proc_init(void)
 }
 #endif /* CONFIG_PROC_FS */
 
-static inline void rt_free(struct rtable *rt)
-{
-	call_rcu_bh(&rt->dst.rcu_head, dst_rcu_free);
-}
-
-static inline void rt_drop(struct rtable *rt)
-{
-	ip_rt_put(rt);
-	call_rcu_bh(&rt->dst.rcu_head, dst_rcu_free);
-}
-
-static inline int rt_fast_clean(struct rtable *rth)
-{
-	/* Kill broadcast/multicast entries very aggresively, if they
-	   collide in hash table with more useful entries */
-	return (rth->rt_flags & (RTCF_BROADCAST | RTCF_MULTICAST)) &&
-		rt_is_input_route(rth) && rth->dst.rt_next;
-}
-
-static inline int rt_valuable(struct rtable *rth)
-{
-	return (rth->rt_flags & (RTCF_REDIRECTED | RTCF_NOTIFY)) ||
-		(rth->peer && rth->peer->pmtu_expires);
-}
-
-static int rt_may_expire(struct rtable *rth, unsigned long tmo1, unsigned long tmo2)
-{
-	unsigned long age;
-	int ret = 0;
-
-	if (atomic_read(&rth->dst.__refcnt))
-		goto out;
-
-	age = jiffies - rth->dst.lastuse;
-	if ((age <= tmo1 && !rt_fast_clean(rth)) ||
-	    (age <= tmo2 && rt_valuable(rth)))
-		goto out;
-	ret = 1;
-out:	return ret;
-}
-
-/* Bits of score are:
- * 31: very valuable
- * 30: not quite useless
- * 29..0: usage counter
- */
-static inline u32 rt_score(struct rtable *rt)
-{
-	u32 score = jiffies - rt->dst.lastuse;
-
-	score = ~score & ~(3<<30);
-
-	if (rt_valuable(rt))
-		score |= (1<<31);
-
-	if (rt_is_output_route(rt) ||
-	    !(rt->rt_flags & (RTCF_BROADCAST|RTCF_MULTICAST|RTCF_LOCAL)))
-		score |= (1<<30);
-
-	return score;
-}
-
-static inline bool rt_caching(const struct net *net)
-{
-	return net->ipv4.current_rt_cache_rebuild_count <=
-		net->ipv4.sysctl_rt_cache_rebuild_count;
-}
-
-static inline bool compare_hash_inputs(const struct rtable *rt1,
-				       const struct rtable *rt2)
-{
-	return ((((__force u32)rt1->rt_key_dst ^ (__force u32)rt2->rt_key_dst) |
-		((__force u32)rt1->rt_key_src ^ (__force u32)rt2->rt_key_src) |
-		(rt1->rt_iif ^ rt2->rt_iif)) == 0);
-}
-
-static inline int compare_keys(struct rtable *rt1, struct rtable *rt2)
-{
-	return (((__force u32)rt1->rt_key_dst ^ (__force u32)rt2->rt_key_dst) |
-		((__force u32)rt1->rt_key_src ^ (__force u32)rt2->rt_key_src) |
-		(rt1->rt_mark ^ rt2->rt_mark) |
-		(rt1->rt_key_tos ^ rt2->rt_key_tos) |
-		(rt1->rt_oif ^ rt2->rt_oif) |
-		(rt1->rt_iif ^ rt2->rt_iif)) == 0;
-}
-
-static inline int compare_netns(struct rtable *rt1, struct rtable *rt2)
-{
-	return net_eq(dev_net(rt1->dst.dev), dev_net(rt2->dst.dev));
-}
-
 static inline int rt_is_expired(struct rtable *rth)
 {
 	return rth->rt_genid != rt_genid(dev_net(rth->dst.dev));
 }
 
 /*
- * Perform a full scan of hash table and free all entries.
- * Can be called by a softirq or a process.
- * In the later case, we want to be reschedule if necessary
- */
-static void rt_do_flush(struct net *net, int process_context)
-{
-	unsigned int i;
-	struct rtable *rth, *next;
-
-	for (i = 0; i <= rt_hash_mask; i++) {
-		struct rtable __rcu **pprev;
-		struct rtable *list;
-
-		if (process_context && need_resched())
-			cond_resched();
-		rth = rcu_dereference_raw(rt_hash_table[i].chain);
-		if (!rth)
-			continue;
-
-		spin_lock_bh(rt_hash_lock_addr(i));
-
-		list = NULL;
-		pprev = &rt_hash_table[i].chain;
-		rth = rcu_dereference_protected(*pprev,
-			lockdep_is_held(rt_hash_lock_addr(i)));
-
-		while (rth) {
-			next = rcu_dereference_protected(rth->dst.rt_next,
-				lockdep_is_held(rt_hash_lock_addr(i)));
-
-			if (!net ||
-			    net_eq(dev_net(rth->dst.dev), net)) {
-				rcu_assign_pointer(*pprev, next);
-				rcu_assign_pointer(rth->dst.rt_next, list);
-				list = rth;
-			} else {
-				pprev = &rth->dst.rt_next;
-			}
-			rth = next;
-		}
-
-		spin_unlock_bh(rt_hash_lock_addr(i));
-
-		for (; list; list = next) {
-			next = rcu_dereference_protected(list->dst.rt_next, 1);
-			rt_free(list);
-		}
-	}
-}
-
-/*
- * While freeing expired entries, we compute average chain length
- * and standard deviation, using fixed-point arithmetic.
- * This to have an estimation of rt_chain_length_max
- *  rt_chain_length_max = max(elasticity, AVG + 4*SD)
- * We use 3 bits for frational part, and 29 (or 61) for magnitude.
- */
-
-#define FRACT_BITS 3
-#define ONE (1UL << FRACT_BITS)
-
-/*
- * Given a hash chain and an item in this hash chain,
- * find if a previous entry has the same hash_inputs
- * (but differs on tos, mark or oif)
- * Returns 0 if an alias is found.
- * Returns ONE if rth has no alias before itself.
- */
-static int has_noalias(const struct rtable *head, const struct rtable *rth)
-{
-	const struct rtable *aux = head;
-
-	while (aux != rth) {
-		if (compare_hash_inputs(aux, rth))
-			return 0;
-		aux = rcu_dereference_protected(aux->dst.rt_next, 1);
-	}
-	return ONE;
-}
-
-/*
  * Perturbation of rt_genid by a small quantity [1..256]
  * Using 8 bits of shuffling ensure we can call rt_cache_invalidate()
  * many times (2^24) without giving recent rt_genid.
@@ -841,346 +488,21 @@ static void rt_cache_invalidate(struct net *net)
 void rt_cache_flush(struct net *net, int delay)
 {
 	rt_cache_invalidate(net);
-	if (delay >= 0)
-		rt_do_flush(net, !in_softirq());
-}
-
-/* Flush previous cache invalidated entries from the cache */
-void rt_cache_flush_batch(struct net *net)
-{
-	rt_do_flush(net, !in_softirq());
-}
-
-static void rt_emergency_hash_rebuild(struct net *net)
-{
-	if (net_ratelimit())
-		printk(KERN_WARNING "Route hash chain too long!\n");
-	rt_cache_invalidate(net);
-}
-
-/*
-   Short description of GC goals.
-
-   We want to build algorithm, which will keep routing cache
-   at some equilibrium point, when number of aged off entries
-   is kept approximately equal to newly generated ones.
-
-   Current expiration strength is variable "expire".
-   We try to adjust it dynamically, so that if networking
-   is idle expires is large enough to keep enough of warm entries,
-   and when load increases it reduces to limit cache size.
- */
-
-static int rt_garbage_collect(struct dst_ops *ops)
-{
-	static unsigned long expire = RT_GC_TIMEOUT;
-	static unsigned long last_gc;
-	static int rover;
-	static int equilibrium;
-	struct rtable *rth;
-	struct rtable __rcu **rthp;
-	unsigned long now = jiffies;
-	int goal;
-	int entries = dst_entries_get_fast(&ipv4_dst_ops);
-
-	/*
-	 * Garbage collection is pretty expensive,
-	 * do not make it too frequently.
-	 */
-
-	RT_CACHE_STAT_INC(gc_total);
-
-	if (now - last_gc < ip_rt_gc_min_interval &&
-	    entries < ip_rt_max_size) {
-		RT_CACHE_STAT_INC(gc_ignored);
-		goto out;
-	}
-
-	entries = dst_entries_get_slow(&ipv4_dst_ops);
-	/* Calculate number of entries, which we want to expire now. */
-	goal = entries - (ip_rt_gc_elasticity << rt_hash_log);
-	if (goal <= 0) {
-		if (equilibrium < ipv4_dst_ops.gc_thresh)
-			equilibrium = ipv4_dst_ops.gc_thresh;
-		goal = entries - equilibrium;
-		if (goal > 0) {
-			equilibrium += min_t(unsigned int, goal >> 1, rt_hash_mask + 1);
-			goal = entries - equilibrium;
-		}
-	} else {
-		/* We are in dangerous area. Try to reduce cache really
-		 * aggressively.
-		 */
-		goal = max_t(unsigned int, goal >> 1, rt_hash_mask + 1);
-		equilibrium = entries - goal;
-	}
-
-	if (now - last_gc >= ip_rt_gc_min_interval)
-		last_gc = now;
-
-	if (goal <= 0) {
-		equilibrium += goal;
-		goto work_done;
-	}
-
-	do {
-		int i, k;
-
-		for (i = rt_hash_mask, k = rover; i >= 0; i--) {
-			unsigned long tmo = expire;
-
-			k = (k + 1) & rt_hash_mask;
-			rthp = &rt_hash_table[k].chain;
-			spin_lock_bh(rt_hash_lock_addr(k));
-			while ((rth = rcu_dereference_protected(*rthp,
-					lockdep_is_held(rt_hash_lock_addr(k)))) != NULL) {
-				if (!rt_is_expired(rth) &&
-					!rt_may_expire(rth, tmo, expire)) {
-					tmo >>= 1;
-					rthp = &rth->dst.rt_next;
-					continue;
-				}
-				*rthp = rth->dst.rt_next;
-				rt_free(rth);
-				goal--;
-			}
-			spin_unlock_bh(rt_hash_lock_addr(k));
-			if (goal <= 0)
-				break;
-		}
-		rover = k;
-
-		if (goal <= 0)
-			goto work_done;
-
-		/* Goal is not achieved. We stop process if:
-
-		   - if expire reduced to zero. Otherwise, expire is halfed.
-		   - if table is not full.
-		   - if we are called from interrupt.
-		   - jiffies check is just fallback/debug loop breaker.
-		     We will not spin here for long time in any case.
-		 */
-
-		RT_CACHE_STAT_INC(gc_goal_miss);
-
-		if (expire == 0)
-			break;
-
-		expire >>= 1;
-
-		if (dst_entries_get_fast(&ipv4_dst_ops) < ip_rt_max_size)
-			goto out;
-	} while (!in_softirq() && time_before_eq(jiffies, now));
-
-	if (dst_entries_get_fast(&ipv4_dst_ops) < ip_rt_max_size)
-		goto out;
-	if (dst_entries_get_slow(&ipv4_dst_ops) < ip_rt_max_size)
-		goto out;
-	if (net_ratelimit())
-		printk(KERN_WARNING "dst cache overflow\n");
-	RT_CACHE_STAT_INC(gc_dst_overflow);
-	return 1;
-
-work_done:
-	expire += ip_rt_gc_min_interval;
-	if (expire > ip_rt_gc_timeout ||
-	    dst_entries_get_fast(&ipv4_dst_ops) < ipv4_dst_ops.gc_thresh ||
-	    dst_entries_get_slow(&ipv4_dst_ops) < ipv4_dst_ops.gc_thresh)
-		expire = ip_rt_gc_timeout;
-out:	return 0;
 }
 
-/*
- * Returns number of entries in a hash chain that have different hash_inputs
- */
-static int slow_chain_length(const struct rtable *head)
-{
-	int length = 0;
-	const struct rtable *rth = head;
-
-	while (rth) {
-		length += has_noalias(head, rth);
-		rth = rcu_dereference_protected(rth->dst.rt_next, 1);
-	}
-	return length >> FRACT_BITS;
-}
-
-static struct rtable *rt_intern_hash(unsigned hash, struct rtable *rt,
-				     struct sk_buff *skb, int ifindex)
+static struct rtable *rt_finalize(struct rtable *rt, struct sk_buff *skb)
 {
-	struct rtable	*rth, *cand;
-	struct rtable __rcu **rthp, **candp;
-	unsigned long	now;
-	u32 		min_score;
-	int		chain_length;
-	int attempts = !in_softirq();
-
-restart:
-	chain_length = 0;
-	min_score = ~(u32)0;
-	cand = NULL;
-	candp = NULL;
-	now = jiffies;
-
-	if (!rt_caching(dev_net(rt->dst.dev))) {
-		/*
-		 * If we're not caching, just tell the caller we
-		 * were successful and don't touch the route.  The
-		 * caller hold the sole reference to the cache entry, and
-		 * it will be released when the caller is done with it.
-		 * If we drop it here, the callers have no way to resolve routes
-		 * when we're not caching.  Instead, just point *rp at rt, so
-		 * the caller gets a single use out of the route
-		 * Note that we do rt_free on this new route entry, so that
-		 * once its refcount hits zero, we are still able to reap it
-		 * (Thanks Alexey)
-		 * Note: To avoid expensive rcu stuff for this uncached dst,
-		 * we set DST_NOCACHE so that dst_release() can free dst without
-		 * waiting a grace period.
-		 */
-
-		rt->dst.flags |= DST_NOCACHE;
-		if (rt->rt_type == RTN_UNICAST || rt_is_output_route(rt)) {
-			int err = arp_bind_neighbour(&rt->dst);
-			if (err) {
-				if (net_ratelimit())
-					printk(KERN_WARNING
-					    "Neighbour table failure & not caching routes.\n");
-				ip_rt_put(rt);
-				return ERR_PTR(err);
-			}
-		}
-
-		goto skip_hashing;
-	}
-
-	rthp = &rt_hash_table[hash].chain;
-
-	spin_lock_bh(rt_hash_lock_addr(hash));
-	while ((rth = rcu_dereference_protected(*rthp,
-			lockdep_is_held(rt_hash_lock_addr(hash)))) != NULL) {
-		if (rt_is_expired(rth)) {
-			*rthp = rth->dst.rt_next;
-			rt_free(rth);
-			continue;
-		}
-		if (compare_keys(rth, rt) && compare_netns(rth, rt)) {
-			/* Put it first */
-			*rthp = rth->dst.rt_next;
-			/*
-			 * Since lookup is lockfree, the deletion
-			 * must be visible to another weakly ordered CPU before
-			 * the insertion at the start of the hash chain.
-			 */
-			rcu_assign_pointer(rth->dst.rt_next,
-					   rt_hash_table[hash].chain);
-			/*
-			 * Since lookup is lockfree, the update writes
-			 * must be ordered for consistency on SMP.
-			 */
-			rcu_assign_pointer(rt_hash_table[hash].chain, rth);
-
-			dst_use(&rth->dst, now);
-			spin_unlock_bh(rt_hash_lock_addr(hash));
-
-			rt_drop(rt);
-			if (skb)
-				skb_dst_set(skb, &rth->dst);
-			return rth;
-		}
-
-		if (!atomic_read(&rth->dst.__refcnt)) {
-			u32 score = rt_score(rth);
-
-			if (score <= min_score) {
-				cand = rth;
-				candp = rthp;
-				min_score = score;
-			}
-		}
-
-		chain_length++;
-
-		rthp = &rth->dst.rt_next;
-	}
-
-	if (cand) {
-		/* ip_rt_gc_elasticity used to be average length of chain
-		 * length, when exceeded gc becomes really aggressive.
-		 *
-		 * The second limit is less certain. At the moment it allows
-		 * only 2 entries per bucket. We will see.
-		 */
-		if (chain_length > ip_rt_gc_elasticity) {
-			*candp = cand->dst.rt_next;
-			rt_free(cand);
-		}
-	} else {
-		if (chain_length > rt_chain_length_max &&
-		    slow_chain_length(rt_hash_table[hash].chain) > rt_chain_length_max) {
-			struct net *net = dev_net(rt->dst.dev);
-			int num = ++net->ipv4.current_rt_cache_rebuild_count;
-			if (!rt_caching(net)) {
-				printk(KERN_WARNING "%s: %d rebuilds is over limit, route caching disabled\n",
-					rt->dst.dev->name, num);
-			}
-			rt_emergency_hash_rebuild(net);
-			spin_unlock_bh(rt_hash_lock_addr(hash));
-
-			hash = rt_hash(rt->rt_key_dst, rt->rt_key_src,
-					ifindex, rt_genid(net));
-			goto restart;
-		}
-	}
-
-	/* Try to bind route to arp only if it is output
-	   route or unicast forwarding path.
-	 */
 	if (rt->rt_type == RTN_UNICAST || rt_is_output_route(rt)) {
 		int err = arp_bind_neighbour(&rt->dst);
 		if (err) {
-			spin_unlock_bh(rt_hash_lock_addr(hash));
-
-			if (err != -ENOBUFS) {
-				rt_drop(rt);
-				return ERR_PTR(err);
-			}
-
-			/* Neighbour tables are full and nothing
-			   can be released. Try to shrink route cache,
-			   it is most likely it holds some neighbour records.
-			 */
-			if (attempts-- > 0) {
-				int saved_elasticity = ip_rt_gc_elasticity;
-				int saved_int = ip_rt_gc_min_interval;
-				ip_rt_gc_elasticity	= 1;
-				ip_rt_gc_min_interval	= 0;
-				rt_garbage_collect(&ipv4_dst_ops);
-				ip_rt_gc_min_interval	= saved_int;
-				ip_rt_gc_elasticity	= saved_elasticity;
-				goto restart;
-			}
-
 			if (net_ratelimit())
-				printk(KERN_WARNING "ipv4: Neighbour table overflow.\n");
-			rt_drop(rt);
-			return ERR_PTR(-ENOBUFS);
+				printk(KERN_WARNING
+				       "Neighbour table failure & not caching routes.\n");
+			ip_rt_put(rt);
+			return ERR_PTR(err);
 		}
 	}
 
-	rt->dst.rt_next = rt_hash_table[hash].chain;
-
-	/*
-	 * Since lookup is lockfree, we must make sure
-	 * previous writes to rt are committed to memory
-	 * before making rt visible to other CPUS.
-	 */
-	rcu_assign_pointer(rt_hash_table[hash].chain, rt);
-
-	spin_unlock_bh(rt_hash_lock_addr(hash));
-
-skip_hashing:
 	if (skb)
 		skb_dst_set(skb, &rt->dst);
 	return rt;
@@ -1248,26 +570,6 @@ void __ip_select_ident(struct iphdr *iph, struct dst_entry *dst, int more)
 }
 EXPORT_SYMBOL(__ip_select_ident);
 
-static void rt_del(unsigned hash, struct rtable *rt)
-{
-	struct rtable __rcu **rthp;
-	struct rtable *aux;
-
-	rthp = &rt_hash_table[hash].chain;
-	spin_lock_bh(rt_hash_lock_addr(hash));
-	ip_rt_put(rt);
-	while ((aux = rcu_dereference_protected(*rthp,
-			lockdep_is_held(rt_hash_lock_addr(hash)))) != NULL) {
-		if (aux == rt || rt_is_expired(aux)) {
-			*rthp = aux->dst.rt_next;
-			rt_free(aux);
-			continue;
-		}
-		rthp = &aux->dst.rt_next;
-	}
-	spin_unlock_bh(rt_hash_lock_addr(hash));
-}
-
 /* called in rcu_read_lock() section */
 void ip_rt_redirect(__be32 old_gw, __be32 daddr, __be32 new_gw,
 		    __be32 saddr, struct net_device *dev)
@@ -1326,10 +628,7 @@ static struct dst_entry *ipv4_negative_advice(struct dst_entry *dst)
 			ip_rt_put(rt);
 			ret = NULL;
 		} else if (rt->rt_flags & RTCF_REDIRECTED) {
-			unsigned hash = rt_hash(rt->rt_key_dst, rt->rt_key_src,
-						rt->rt_oif,
-						rt_genid(dev_net(dst->dev)));
-			rt_del(hash, rt);
+			ip_rt_put(rt);
 			ret = NULL;
 		} else if (rt->peer &&
 			   rt->peer->pmtu_expires &&
@@ -1818,7 +1117,7 @@ static struct rtable *rt_dst_alloc(struct net_device *dev,
 				   bool nopolicy, bool noxfrm)
 {
 	return dst_alloc(&ipv4_dst_ops, dev, 1, -1,
-			 DST_HOST |
+			 DST_HOST | DST_NOCACHE |
 			 (nopolicy ? DST_NOPOLICY : 0) |
 			 (noxfrm ? DST_NOXFRM : 0));
 }
@@ -1827,7 +1126,6 @@ static struct rtable *rt_dst_alloc(struct net_device *dev,
 static int ip_route_input_mc(struct sk_buff *skb, __be32 daddr, __be32 saddr,
 				u8 tos, struct net_device *dev, int our)
 {
-	unsigned int hash;
 	struct rtable *rth;
 	__be32 spec_dst;
 	struct in_device *in_dev = __in_dev_get_rcu(dev);
@@ -1891,8 +1189,7 @@ static int ip_route_input_mc(struct sk_buff *skb, __be32 daddr, __be32 saddr,
 #endif
 	RT_CACHE_STAT_INC(in_slow_mc);
 
-	hash = rt_hash(daddr, saddr, dev->ifindex, rt_genid(dev_net(dev)));
-	rth = rt_intern_hash(hash, rth, skb, dev->ifindex);
+	rth = rt_finalize(rth, skb);
 	err = 0;
 	if (IS_ERR(rth))
 		err = PTR_ERR(rth);
@@ -2037,7 +1334,6 @@ static int ip_mkroute_input(struct sk_buff *skb,
 {
 	struct rtable* rth = NULL;
 	int err;
-	unsigned hash;
 
 #ifdef CONFIG_IP_ROUTE_MULTIPATH
 	if (res->fi && res->fi->fib_nhs > 1)
@@ -2049,10 +1345,7 @@ static int ip_mkroute_input(struct sk_buff *skb,
 	if (err)
 		return err;
 
-	/* put it into the cache */
-	hash = rt_hash(daddr, saddr, fl4->flowi4_iif,
-		       rt_genid(dev_net(rth->dst.dev)));
-	rth = rt_intern_hash(hash, rth, skb, fl4->flowi4_iif);
+	rth = rt_finalize(rth, skb);
 	if (IS_ERR(rth))
 		return PTR_ERR(rth);
 	return 0;
@@ -2078,7 +1371,6 @@ static int ip_route_input_slow(struct sk_buff *skb, __be32 daddr, __be32 saddr,
 	unsigned	flags = 0;
 	u32		itag = 0;
 	struct rtable * rth;
-	unsigned	hash;
 	__be32		spec_dst;
 	int		err = -EINVAL;
 	struct net    * net = dev_net(dev);
@@ -2205,8 +1497,7 @@ local_input:
 		rth->dst.error= -err;
 		rth->rt_flags 	&= ~RTCF_LOCAL;
 	}
-	hash = rt_hash(daddr, saddr, fl4.flowi4_iif, rt_genid(net));
-	rth = rt_intern_hash(hash, rth, skb, fl4.flowi4_iif);
+	rth = rt_finalize(rth, skb);
 	err = 0;
 	if (IS_ERR(rth))
 		err = PTR_ERR(rth);
@@ -2253,47 +1544,10 @@ martian_source_keep_err:
 int ip_route_input_common(struct sk_buff *skb, __be32 daddr, __be32 saddr,
 			   u8 tos, struct net_device *dev, bool noref)
 {
-	struct rtable * rth;
-	unsigned	hash;
-	int iif = dev->ifindex;
-	struct net *net;
 	int res;
 
-	net = dev_net(dev);
-
 	rcu_read_lock();
 
-	if (!rt_caching(net))
-		goto skip_cache;
-
-	tos &= IPTOS_RT_MASK;
-	hash = rt_hash(daddr, saddr, iif, rt_genid(net));
-
-	for (rth = rcu_dereference(rt_hash_table[hash].chain); rth;
-	     rth = rcu_dereference(rth->dst.rt_next)) {
-		if ((((__force u32)rth->rt_key_dst ^ (__force u32)daddr) |
-		     ((__force u32)rth->rt_key_src ^ (__force u32)saddr) |
-		     (rth->rt_iif ^ iif) |
-		     rth->rt_oif |
-		     (rth->rt_key_tos ^ tos)) == 0 &&
-		    rth->rt_mark == skb->mark &&
-		    net_eq(dev_net(rth->dst.dev), net) &&
-		    !rt_is_expired(rth)) {
-			if (noref) {
-				dst_use_noref(&rth->dst, jiffies);
-				skb_dst_set_noref(skb, &rth->dst);
-			} else {
-				dst_use(&rth->dst, jiffies);
-				skb_dst_set(skb, &rth->dst);
-			}
-			RT_CACHE_STAT_INC(in_hit);
-			rcu_read_unlock();
-			return 0;
-		}
-		RT_CACHE_STAT_INC(in_hlist_search);
-	}
-
-skip_cache:
 	/* Multicast recognition logic is moved from route cache to here.
 	   The problem was that too many Ethernet cards have broken/missing
 	   hardware multicast filters :-( As result the host on multicasting
@@ -2436,10 +1690,9 @@ static struct rtable *__mkroute_output(const struct fib_result *res,
 
 /*
  * Major route resolver routine.
- * called with rcu_read_lock();
  */
 
-static struct rtable *ip_route_output_slow(struct net *net, struct flowi4 *fl4)
+struct rtable *__ip_route_output_key(struct net *net, struct flowi4 *fl4)
 {
 	struct net_device *dev_out = NULL;
 	u32 tos	= RT_FL_TOS(fl4);
@@ -2619,57 +1872,13 @@ static struct rtable *ip_route_output_slow(struct net *net, struct flowi4 *fl4)
 make_route:
 	rth = __mkroute_output(&res, fl4, orig_daddr, orig_saddr, orig_oif,
 			       dev_out, flags);
-	if (!IS_ERR(rth)) {
-		unsigned int hash;
-
-		hash = rt_hash(orig_daddr, orig_saddr, orig_oif,
-			       rt_genid(dev_net(dev_out)));
-		rth = rt_intern_hash(hash, rth, NULL, orig_oif);
-	}
+	if (!IS_ERR(rth))
+		rth = rt_finalize(rth, NULL);
 
 out:
 	rcu_read_unlock();
 	return rth;
 }
-
-struct rtable *__ip_route_output_key(struct net *net, struct flowi4 *flp4)
-{
-	struct rtable *rth;
-	unsigned int hash;
-
-	if (!rt_caching(net))
-		goto slow_output;
-
-	hash = rt_hash(flp4->daddr, flp4->saddr, flp4->flowi4_oif, rt_genid(net));
-
-	rcu_read_lock_bh();
-	for (rth = rcu_dereference_bh(rt_hash_table[hash].chain); rth;
-		rth = rcu_dereference_bh(rth->dst.rt_next)) {
-		if (rth->rt_key_dst == flp4->daddr &&
-		    rth->rt_key_src == flp4->saddr &&
-		    rt_is_output_route(rth) &&
-		    rth->rt_oif == flp4->flowi4_oif &&
-		    rth->rt_mark == flp4->flowi4_mark &&
-		    !((rth->rt_key_tos ^ flp4->flowi4_tos) &
-			    (IPTOS_RT_MASK | RTO_ONLINK)) &&
-		    net_eq(dev_net(rth->dst.dev), net) &&
-		    !rt_is_expired(rth)) {
-			dst_use(&rth->dst, jiffies);
-			RT_CACHE_STAT_INC(out_hit);
-			rcu_read_unlock_bh();
-			if (!flp4->saddr)
-				flp4->saddr = rth->rt_src;
-			if (!flp4->daddr)
-				flp4->daddr = rth->rt_dst;
-			return rth;
-		}
-		RT_CACHE_STAT_INC(out_hlist_search);
-	}
-	rcu_read_unlock_bh();
-
-slow_output:
-	return ip_route_output_slow(net, flp4);
-}
 EXPORT_SYMBOL_GPL(__ip_route_output_key);
 
 static struct dst_entry *ipv4_blackhole_dst_check(struct dst_entry *dst, u32 cookie)
@@ -2966,43 +2175,6 @@ errout_free:
 
 int ip_rt_dump(struct sk_buff *skb,  struct netlink_callback *cb)
 {
-	struct rtable *rt;
-	int h, s_h;
-	int idx, s_idx;
-	struct net *net;
-
-	net = sock_net(skb->sk);
-
-	s_h = cb->args[0];
-	if (s_h < 0)
-		s_h = 0;
-	s_idx = idx = cb->args[1];
-	for (h = s_h; h <= rt_hash_mask; h++, s_idx = 0) {
-		if (!rt_hash_table[h].chain)
-			continue;
-		rcu_read_lock_bh();
-		for (rt = rcu_dereference_bh(rt_hash_table[h].chain), idx = 0; rt;
-		     rt = rcu_dereference_bh(rt->dst.rt_next), idx++) {
-			if (!net_eq(dev_net(rt->dst.dev), net) || idx < s_idx)
-				continue;
-			if (rt_is_expired(rt))
-				continue;
-			skb_dst_set_noref(skb, &rt->dst);
-			if (rt_fill_info(net, skb, NETLINK_CB(cb->skb).pid,
-					 cb->nlh->nlmsg_seq, RTM_NEWROUTE,
-					 1, NLM_F_MULTI) <= 0) {
-				skb_dst_drop(skb);
-				rcu_read_unlock_bh();
-				goto done;
-			}
-			skb_dst_drop(skb);
-		}
-		rcu_read_unlock_bh();
-	}
-
-done:
-	cb->args[0] = h;
-	cb->args[1] = idx;
 	return skb->len;
 }
 
@@ -3237,16 +2409,6 @@ static __net_initdata struct pernet_operations rt_genid_ops = {
 struct ip_rt_acct __percpu *ip_rt_acct __read_mostly;
 #endif /* CONFIG_IP_ROUTE_CLASSID */
 
-static __initdata unsigned long rhash_entries;
-static int __init set_rhash_entries(char *str)
-{
-	if (!str)
-		return 0;
-	rhash_entries = simple_strtoul(str, &str, 0);
-	return 1;
-}
-__setup("rhash_entries=", set_rhash_entries);
-
 int __init ip_rt_init(void)
 {
 	int rc = 0;
@@ -3269,21 +2431,8 @@ int __init ip_rt_init(void)
 	if (dst_entries_init(&ipv4_dst_blackhole_ops) < 0)
 		panic("IP: failed to allocate ipv4_dst_blackhole_ops counter\n");
 
-	rt_hash_table = (struct rt_hash_bucket *)
-		alloc_large_system_hash("IP route cache",
-					sizeof(struct rt_hash_bucket),
-					rhash_entries,
-					(totalram_pages >= 128 * 1024) ?
-					15 : 17,
-					0,
-					&rt_hash_log,
-					&rt_hash_mask,
-					rhash_entries ? 0 : 512 * 1024);
-	memset(rt_hash_table, 0, (rt_hash_mask + 1) * sizeof(struct rt_hash_bucket));
-	rt_hash_lock_init();
-
-	ipv4_dst_ops.gc_thresh = (rt_hash_mask + 1);
-	ip_rt_max_size = (rt_hash_mask + 1) * 16;
+	ipv4_dst_ops.gc_thresh = ~0;
+	ip_rt_max_size = INT_MAX;
 
 	devinet_init();
 	ip_fib_init();
-- 
1.7.5.2


^ permalink raw reply related

* [PATCH v7 0/4] rtcache removal
From: David Miller @ 2011-05-23 21:33 UTC (permalink / raw)
  To: netdev


Besides dealing with ongoing conflicts, new this time is patch #4,
which deletes rt->rt_src completely.

Enjoy.

^ permalink raw reply

* Re: [PATCH 1/3] vlan: Do not support clearing VLAN_FLAG_REORDER_HDR
From: David Miller @ 2011-05-23 21:20 UTC (permalink / raw)
  To: shemminger
  Cc: greearb, nicolas.2p.debian, ebiederm, jpirko, xiaosuo, netdev,
	kaber, fubar, eric.dumazet, andy, jesse
In-Reply-To: <20110523140048.777fb378@nehalam>

From: Stephen Hemminger <shemminger@linux-foundation.org>
Date: Mon, 23 May 2011 14:00:48 -0700

> IMHO the REORDER_HDR flag was a design mistake. It means supporting
> two different API's for the application. For a packet capture application
> it means the format of the packet will be different based on how
> the VLAN was created. And the vlan was created outside the application.
> 
> If there was a requirement to support both formats, then it should
> be a property of the AF_PACKET socket.  The application can then do
> an setsockopt() to control the format of the data. The issue is
> for things like mmap/AF_PACKET the re-inserted form will be slower
> since data has to be copied.

Indeed, it was a foolish thing to support in the first place.

I guess we couldn't see the hw offloads in the future at that
point.

I'm all for finding a way to deprecate this over time.

^ permalink raw reply

* [GIT PULL] Namespace file descriptors for 2.6.40
From: Eric W. Biederman @ 2011-05-23 21:05 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-kernel, Linux Containers, netdev, James Bottomley,
	Geert Uytterhoeven


Please pull the namespace file descriptor git tree from:

   git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/linux-2.6-nsfd.git

Because other syscall work has happened in other trees there
are conflicts on alpha and m68k.

For alpha all that is needed is a simple incrementing of the syscall
number in my tree and adding of my syscall to the end of the list.

For m68k please just delete all of the syscall entries the conflict will
add to arch/m68k/kernel/entry_mm.S.  The m68k tree has consolidated
everything in arch/m68k/kernel/syscalltable.S


This tree adds the files /proc/<pid>/ns/net, /proc/<pid>/ns/ipc,
/proc/<pid>/ns/uts that can be opened to refer to the namespaces of a
process at the time those files are opened, and can be bind mounted to
keep the specified namespace alive without a process.

This tree adds the setns system call that can be used to change the
specified namespace of a process to the namespace specified by a system
call.

This tree adds a new rtnetlink attribute that allows for moving a
network device into a network namespace specified by a file descriptor.

Support for the other namespaces is planned but is not ready for 2.6.40.

These changes dramatically simplify what a userspace process has to do
to keep a namespace alive, and to execute system calls in it.

The shortlog:

Stephen Rothwell (1):
      net: fix get_net_ns_by_fd for !CONFIG_NET_NS

Eric W. Biederman (11):
      ns: proc files for namespace naming policy.
      ns: Introduce the setns syscall
      ns proc: Add support for the network namespace.
      ns proc: Add support for the uts namespace
      ns proc: Add support for the ipc namespace
      net: Allow setting the network namespace by fd
      Merge commit '2e7bad5f34b5beed47542490c760ed26574e38ba' into HEAD
      Merge commit '7143b7d41218d4fc2ea33e6056c73609527ae687' into HEAD
      ns: Wire up the setns system call
      ns: Declare sys_setns in syscalls.h
      ns proc: Return -ENOENT for a nonexistent /proc/self/ns/ entry.

The diffstat:

 arch/alpha/include/asm/unistd.h        |    3 +-
 arch/alpha/kernel/systbls.S            |    1 +
 arch/arm/include/asm/unistd.h          |    1 +
 arch/arm/kernel/calls.S                |    1 +
 arch/avr32/include/asm/unistd.h        |    3 +-
 arch/avr32/kernel/syscall_table.S      |    1 +
 arch/blackfin/include/asm/unistd.h     |    3 +-
 arch/blackfin/mach-common/entry.S      |    1 +
 arch/cris/arch-v10/kernel/entry.S      |    1 +
 arch/cris/arch-v32/kernel/entry.S      |    1 +
 arch/cris/include/asm/unistd.h         |    3 +-
 arch/frv/include/asm/unistd.h          |    3 +-
 arch/frv/kernel/entry.S                |    1 +
 arch/h8300/include/asm/unistd.h        |    3 +-
 arch/h8300/kernel/syscalls.S           |    1 +
 arch/ia64/include/asm/unistd.h         |    3 +-
 arch/ia64/kernel/entry.S               |    1 +
 arch/m32r/include/asm/unistd.h         |    3 +-
 arch/m32r/kernel/syscall_table.S       |    1 +
 arch/m68k/include/asm/unistd.h         |    3 +-
 arch/m68k/kernel/syscalltable.S        |    1 +
 arch/microblaze/include/asm/unistd.h   |    3 +-
 arch/microblaze/kernel/syscall_table.S |    1 +
 arch/mips/include/asm/unistd.h         |   15 ++-
 arch/mips/kernel/scall32-o32.S         |    1 +
 arch/mips/kernel/scall64-64.S          |    1 +
 arch/mips/kernel/scall64-n32.S         |    1 +
 arch/mips/kernel/scall64-o32.S         |    1 +
 arch/mn10300/include/asm/unistd.h      |    3 +-
 arch/mn10300/kernel/entry.S            |    1 +
 arch/parisc/include/asm/unistd.h       |    4 +-
 arch/parisc/kernel/syscall_table.S     |    1 +
 arch/powerpc/include/asm/systbl.h      |    1 +
 arch/powerpc/include/asm/unistd.h      |    3 +-
 arch/s390/include/asm/unistd.h         |    3 +-
 arch/s390/kernel/syscalls.S            |    1 +
 arch/sh/include/asm/unistd_32.h        |    3 +-
 arch/sh/include/asm/unistd_64.h        |    3 +-
 arch/sh/kernel/syscalls_32.S           |    1 +
 arch/sh/kernel/syscalls_64.S           |    1 +
 arch/sparc/include/asm/unistd.h        |    3 +-
 arch/sparc/kernel/systbls_32.S         |    2 +-
 arch/sparc/kernel/systbls_64.S         |    4 +-
 arch/x86/ia32/ia32entry.S              |    1 +
 arch/x86/include/asm/unistd_32.h       |    3 +-
 arch/x86/include/asm/unistd_64.h       |    2 +
 arch/x86/kernel/syscall_table_32.S     |    1 +
 arch/xtensa/include/asm/unistd.h       |    4 +-
 fs/proc/Makefile                       |    1 +
 fs/proc/base.c                         |   20 ++--
 fs/proc/inode.c                        |    7 +
 fs/proc/internal.h                     |   18 +++
 fs/proc/namespaces.c                   |  198 ++++++++++++++++++++++++++++++++
 include/asm-generic/unistd.h           |    4 +-
 include/linux/if_link.h                |    1 +
 include/linux/proc_fs.h                |   21 ++++
 include/linux/syscalls.h               |    1 +
 include/net/net_namespace.h            |    1 +
 ipc/namespace.c                        |   37 ++++++
 kernel/nsproxy.c                       |   42 +++++++
 kernel/utsname.c                       |   39 ++++++
 net/core/net_namespace.c               |   65 +++++++++++
 net/core/rtnetlink.c                   |    5 +-
 63 files changed, 525 insertions(+), 42 deletions(-)

Thanks,
Eric

^ permalink raw reply

* [PATCH] sch_sfq: avoid giving spurious NET_XMIT_CN signals
From: Eric Dumazet @ 2011-05-23 21:02 UTC (permalink / raw)
  To: David Miller
  Cc: netdev, Patrick McHardy, Jarek Poplawski, Jamal Hadi Salim,
	Stephen Hemminger

While chasing a possible net_sched bug, I found that IP fragments have
litle chance to pass a congestioned SFQ qdisc :

- Say SFQ qdisc is full because one flow is non responsive.
- ip_fragment() wants to send two fragments belonging to an idle flow.
- sfq_enqueue() queues first packet, but see queue limit reached :
- sfq_enqueue() drops one packet from 'big consumer', and returns
NET_XMIT_CN.
- ip_fragment() cancel remaining fragments.

This patch restores fairness, making sure we return NET_XMIT_CN only if
we dropped a packet from the same flow.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Patrick McHardy <kaber@trash.net>
CC: Jarek Poplawski <jarkao2@gmail.com>
CC: Jamal Hadi Salim <hadi@cyberus.ca>
CC: Stephen Hemminger <shemminger@vyatta.com>
---
 net/sched/sch_sfq.c |    8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/net/sched/sch_sfq.c b/net/sched/sch_sfq.c
index 7ef87f9..b1d00f8 100644
--- a/net/sched/sch_sfq.c
+++ b/net/sched/sch_sfq.c
@@ -361,7 +361,7 @@ sfq_enqueue(struct sk_buff *skb, struct Qdisc *sch)
 {
 	struct sfq_sched_data *q = qdisc_priv(sch);
 	unsigned int hash;
-	sfq_index x;
+	sfq_index x, qlen;
 	struct sfq_slot *slot;
 	int uninitialized_var(ret);
 
@@ -405,8 +405,12 @@ sfq_enqueue(struct sk_buff *skb, struct Qdisc *sch)
 	if (++sch->q.qlen <= q->limit)
 		return NET_XMIT_SUCCESS;
 
+	qlen = slot->qlen;
 	sfq_drop(sch);
-	return NET_XMIT_CN;
+	/* Return Congestion Notification only if we dropped a packet
+	 * from this flow.
+	 */
+	return (qlen != slot->qlen) ? NET_XMIT_CN : NET_XMIT_SUCCESS;
 }
 
 static struct sk_buff *



^ permalink raw reply related

* Re: [PATCH 1/3] vlan: Do not support clearing VLAN_FLAG_REORDER_HDR
From: Stephen Hemminger @ 2011-05-23 21:00 UTC (permalink / raw)
  To: Ben Greear
  Cc: Nicolas de Pesloüan, Eric W. Biederman, David Miller,
	Jiri Pirko, Changli Gao, netdev, kaber, fubar, eric.dumazet, andy,
	Jesse Gross
In-Reply-To: <4DDAC26C.2070601@candelatech.com>

> Well, yes..unless perhaps when the REORDER_HDR flag is enabled.
> 
> As for when the tag is added/removed, as long as we don't end up doing
> a remove and then an add (in software), and as long as the pkt is correct
> going to user-space, it doesn't matter to me.
> 
> It would be lame to remove it in software and then re-add it,
> unless absolutely required for some reason.

IMHO the REORDER_HDR flag was a design mistake. It means supporting
two different API's for the application. For a packet capture application
it means the format of the packet will be different based on how
the VLAN was created. And the vlan was created outside the application.

If there was a requirement to support both formats, then it should
be a property of the AF_PACKET socket.  The application can then do
an setsockopt() to control the format of the data. The issue is
for things like mmap/AF_PACKET the re-inserted form will be slower
since data has to be copied.

^ permalink raw reply

* Re: [PATCH] ehea: Fix multicast registration on semi-promiscuous mode
From: David Miller @ 2011-05-23 20:34 UTC (permalink / raw)
  To: leitao; +Cc: netdev
In-Reply-To: <1306157795-13377-1-git-send-email-leitao@linux.vnet.ibm.com>

From: leitao@linux.vnet.ibm.com
Date: Mon, 23 May 2011 10:36:35 -0300

> Ehea will not register multicast groups in phyp if the physical
> interface is in promiscuous mode. But it should register if the
> logical port is in promiscuous mode, but the physical port is not.
> 
> Ehea physical promiscuous mode is defined by ehea_port->promisc,
> while logical port is defined by IFF_PROMISC.
> 
> So currently, if the user set the interface in promiscuous mode,
> IGMP will not be registred in PHYP, and PHYP will never pass
> the multicast packet to the logical port, which is bad
> 
> So, this patch just fixes it, assuring that we register in phyp
> if the physical port is not on promiscuous mode.
> 
> Signed-off-by: Breno Leitao <leitao@linux.vnet.ibm.com>

Applied, thanks.

^ permalink raw reply

* Re: [PATCH] ipv6: xfrm6: fix dubious code
From: David Miller @ 2011-05-23 20:33 UTC (permalink / raw)
  To: bhutchings; +Cc: eric.dumazet, netdev
In-Reply-To: <1306164960.3456.42.camel@localhost>

From: Ben Hutchings <bhutchings@solarflare.com>
Date: Mon, 23 May 2011 08:36:00 -0700

> On Mon, 2011-05-23 at 10:42 +0200, Eric Dumazet wrote:
>> net/ipv6/xfrm6_tunnel.c: In function ‘xfrm6_tunnel_rcv’:
>> net/ipv6/xfrm6_tunnel.c:244:53: warning: the omitted middle operand
>> in ?: will always be ‘true’, suggest explicit middle operand
>> 
>> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
>> ---
>>  net/ipv6/xfrm6_tunnel.c |    2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>> 
>> diff --git a/net/ipv6/xfrm6_tunnel.c b/net/ipv6/xfrm6_tunnel.c
>> index a6770a0..fb9b0c3 100644
>> --- a/net/ipv6/xfrm6_tunnel.c
>> +++ b/net/ipv6/xfrm6_tunnel.c
>> @@ -241,7 +241,7 @@ static int xfrm6_tunnel_rcv(struct sk_buff *skb)
>>  	__be32 spi;
>>  
>>  	spi = xfrm6_tunnel_spi_lookup(net, (const xfrm_address_t *)&iph->saddr);
>> -	return xfrm6_rcv_spi(skb, IPPROTO_IPV6, spi) > 0 ? : 0;
>> +	return xfrm6_rcv_spi(skb, IPPROTO_IPV6, spi) > 0 ? 1 : 0;
>>  }
>>  
>>  static int xfrm6_tunnel_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
> 
> I suspect that this was intended to return the result of xfrm6_rcv_spi()
> if > 0.

I also suspect this was the intent, but I'm not sure why it matters
at all.

The equivalent code implementing the same operations on the ipv4
side return xfrm4_rcv_spi()'s return value directly.

So we need to either decide that we can do the same thing here on the
ipv6 side, or document exactly why we can't.

^ permalink raw reply

* Re: [PATCH] net: ping: cleanups ping_v4_unhash()
From: David Miller @ 2011-05-23 20:29 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev, segoon
In-Reply-To: <1306138981.2869.2.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Mon, 23 May 2011 10:23:00 +0200

> net/ipv4/ping.c: In function ‘ping_v4_unhash’:
> net/ipv4/ping.c:140:28: warning: variable ‘hslot’ set but not used
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> CC: Vasiliy Kulikov <segoon@openwall.com>

Applied.

^ permalink raw reply

* Re: [PATCH] snap: remove one synchronize_net()
From: David Miller @ 2011-05-23 20:30 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev
In-Reply-To: <1306140108.2869.7.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Mon, 23 May 2011 10:41:48 +0200

> No need to wait for a rcu grace period after list insertion.
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

Applied.

^ permalink raw reply

* Re: [PATCH 1/3] vlan: Do not support clearing VLAN_FLAG_REORDER_HDR
From: Ben Greear @ 2011-05-23 20:24 UTC (permalink / raw)
  To: Nicolas de Pesloüan
  Cc: Eric W. Biederman, David Miller, Jiri Pirko, Changli Gao, netdev,
	shemminger, kaber, fubar, eric.dumazet, andy, Jesse Gross
In-Reply-To: <4DDAB72B.9050101@gmail.com>

On 05/23/2011 12:36 PM, Nicolas de Pesloüan wrote:
> Le 23/05/2011 18:33, Ben Greear a écrit :
>> On 05/23/2011 02:00 AM, Eric W. Biederman wrote:
>>> Ben Greear<greearb@candelatech.com> writes:
>>>
>>>> I believe we have been getting tagged VLAN packets properly
>>>> in our test cases. We would not be creating any VLAN devices
>>>> in this case, so perhaps the NIC isn't doing any stripping.
>>>>
>>>> To me, it seems like we should get the fully tagged packet
>>>> without having to go muck with aux-data, though it would
>>>> be fine if it were *also* in aux-data.
>>>
>>> Given that pf_packet is a portable interface that works on multiple OS's
>>> I tend to agree. Certainly my users would be happier if they don't
>>> have to change their code and not having to change tcpdump would
>>> also be nice.
>>>
>>> I'm not certain exactly where in the code it makes sense to put the
>>> vlan header back on for pf_packet sockets. The simplest thing would
>>> be just before we run the socket filter. If we don't do the simplest
>>> thing this raises the question how do we avoid breaking socket filters
>>> that look at the packet data and know there is going to be a vlan
>>> header there.
>>
>> That is going to be tricky, since the VLAN header would adjust
>> offsets and users could be using some filter that uses offsets
>> with no actual mention of VLANs (but expecting it to take
>> the VLAN header into account).
>>
>>> Still the current situation is better than seeing vlan 0 tagged packets
>>> twice.
>>>
>>> My gut feel says if we can cheaply get the socket filters to act like it
>>> sees the vlan tag (where the vlan tag belongs) we should not actually
>>> put the vlan tag back on until we copy the packet to userspace.
>>
>> Maybe keep a count of how many sockets with filters and/or pf_packet
>> sockets are open, and how many things are registered in
>> the 'ptype_all' chain, and only re-add (or never remove) the header if
>> that is > 0?
>>
>> (And, let the bridging and other kernel logic deal with vlans
>> via auxillary methods as well as checking in-line headers.)
>
> Well, this doesn't sound very different from my previous proposal: if a
> protocol handler is registered at parent interface level, can't we
> simply assume this protocol handler expect the raw packet?

Well, yes..unless perhaps when the REORDER_HDR flag is enabled.

As for when the tag is added/removed, as long as we don't end up doing
a remove and then an add (in software), and as long as the pkt is correct
going to user-space, it doesn't matter to me.

It would be lame to remove it in software and then re-add it,
unless absolutely required for some reason.

Thanks,
Ben

-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


^ permalink raw reply

* Re: [PATCH]net:8021q:vlan.c Fix pr_info to read on line in the syslog.
From: Ben Greear @ 2011-05-23 20:16 UTC (permalink / raw)
  To: David Miller; +Cc: joe, justinmattock, netdev, linux-kernel
In-Reply-To: <20110523.155630.1404321441037491870.davem@davemloft.net>

On 05/23/2011 12:56 PM, David Miller wrote:
> From: Joe Perches<joe@perches.com>
> Date: Mon, 23 May 2011 09:23:01 -0700
>
>> though I think that emitting names on startup isn't necessary and
>> this is enough:
>
> Agreed, it's not like Ben and I are Napoleon or anything...

I'm fine with that as well.  It was good for the ego, but I
won't mind not getting emails from folks that accidentally
look at their DSL router logs and assume I wrote (and can debug)
the entire AT&T network :)

Thanks,
Ben

> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox