Netdev List

Netdev List
 help / color / mirror / Atom feed

* [PATCH] 3c503: Fix IRQ probing
From: Ben Hutchings @ 2010-04-04 16:33 UTC (permalink / raw)
  To: Paul Gortmaker; +Cc: netdev, 566522, Piotr Skólski

The driver attempts to select an IRQ for the NIC automatically by
testing which of the supported IRQs are available and then probing
each available IRQ with probe_irq_{on,off}().  There are obvious race
conditions here, besides which:
1. The test for availability is done by passing a NULL handler, which
   now always returns -EINVAL, thus the device cannot be opened:
   <http://bugs.debian.org/566522>
2. probe_irq_off() will report only the first ISA IRQ handled,
   potentially leading to a false negative.

There was another bug that meant it ignored all error codes from
request_irq() except -EBUSY, so it would 'succeed' despite this
(possibly causing conflicts with other ISA devices).  This was fixed
by ab08999d6029bb2c79c16be5405d63d2bedbdfea 'WARNING: some
request_irq() failures ignored in el2_open()', which exposed bug 1.

This patch:
1. Replaces the use of probe_irq_{on,off}() with a real interrupt handler
2. Adds a delay before checking the interrupt-seen flag
3. Disables interrupts on all failure paths
4. Distinguishes error codes from the second request_irq() call,
   consistently with the first

Compile-tested only.

Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
---
 drivers/net/3c503.c |   42 ++++++++++++++++++++++++++++++------------
 1 files changed, 30 insertions(+), 12 deletions(-)

diff --git a/drivers/net/3c503.c b/drivers/net/3c503.c
index 66e0323..b74a0ea 100644
--- a/drivers/net/3c503.c
+++ b/drivers/net/3c503.c
@@ -380,6 +380,12 @@ out:
     return retval;
 }

+static irqreturn_t el2_probe_interrupt(int irq, void *seen)
+{
+	*(bool *)seen = true;
+	return IRQ_HANDLED;
+}
+
 static int
 el2_open(struct net_device *dev)
 {
@@ -391,23 +397,35 @@ el2_open(struct net_device *dev)

 	outb(EGACFR_NORM, E33G_GACFR);	/* Enable RAM and interrupts. */
 	do {
-	    retval = request_irq(*irqp, NULL, 0, "bogus", dev);
-	    if (retval >= 0) {
+		bool seen;
+
+		retval = request_irq(*irqp, el2_probe_interrupt, 0,
+				     dev->name, &seen);
+		if (retval == -EBUSY)
+			continue;
+		if (retval < 0)
+			goto err_disable;
+
 		/* Twinkle the interrupt, and check if it's seen. */
-		unsigned long cookie = probe_irq_on();
+		seen = false;
+		smp_wmb();
 		outb_p(0x04 << ((*irqp == 9) ? 2 : *irqp), E33G_IDCFR);
 		outb_p(0x00, E33G_IDCFR);
-		if (*irqp == probe_irq_off(cookie) &&	/* It's a good IRQ line! */
-		    ((retval = request_irq(dev->irq = *irqp,
-					   eip_interrupt, 0,
-					   dev->name, dev)) == 0))
-		    break;
-	    } else {
-		    if (retval != -EBUSY)
-			    return retval;
-	    }
+		msleep(1);
+		free_irq(*irqp, el2_probe_interrupt);
+		if (!seen)
+			continue;
+
+		retval = request_irq(dev->irq = *irqp, eip_interrupt, 0,
+				     dev->name, dev);
+		if (retval == -EBUSY)
+			continue;
+		if (retval < 0)
+			goto err_disable;
 	} while (*++irqp);
+
 	if (*irqp == 0) {
+	err_disable:
 	    outb(EGACFR_IRQOFF, E33G_GACFR);	/* disable interrupts. */
 	    return -EAGAIN;
 	}
-- 
1.7.0.3

^ permalink raw reply related

* Patch to fix kernel bug 15678 - x25 code accesses fields beyond the end of packet.
From: John Hughes @ 2010-04-04 16:40 UTC (permalink / raw)
  To: netdev

[-- Attachment #1: Type: text/plain, Size: 257 bytes --]

The current X.25 code attempts to decode fields in X.25 packets that are 
not present.  Here is a little patch that checks the received packet 
length before attempting to decode the missing fields.  It also improves 
error checking for malformed packets.


[-- Attachment #2: x25-overrun.patch --]
[-- Type: text/x-patch, Size: 4972 bytes --]

From: John Hughes <john@calva.com>
Subject: Patch to fix bug 15678 - x25 accesses fields beyond end of packet.

Here is a patch to stop X.25 examining fields beyond the end of the packet.

For example, when a simple CALL ACCEPTED was received:

	10 10 0f

x25_parse_facilities was attempting to decode the FACILITIES field, but this
packet contains no facilities field.

Signed-off-by: John Hughes <john@calva.com>

diff --git a/include/net/x25.h b/include/net/x25.h
index 9baa07d..33f67fb 100644
--- a/include/net/x25.h
+++ b/include/net/x25.h
@@ -182,6 +182,10 @@ extern int  sysctl_x25_clear_request_timeout;
 extern int  sysctl_x25_ack_holdback_timeout;
 extern int  sysctl_x25_forward;
 
+extern int x25_parse_address_block(struct sk_buff *skb,
+		struct x25_address *called_addr,
+		struct x25_address *calling_addr);
+
 extern int  x25_addr_ntoa(unsigned char *, struct x25_address *,
 			  struct x25_address *);
 extern int  x25_addr_aton(unsigned char *, struct x25_address *,
diff --git a/net/x25/af_x25.c b/net/x25/af_x25.c
index 9796f3e..fe26c01 100644
--- a/net/x25/af_x25.c
+++ b/net/x25/af_x25.c
@@ -82,6 +82,41 @@ struct compat_x25_subscrip_struct {
 };
 #endif
 
+
+int x25_parse_address_block(struct sk_buff *skb,
+		struct x25_address *called_addr,
+		struct x25_address *calling_addr)
+{
+	unsigned char len;
+	int needed;
+	int rc;
+
+	if (skb->len < 1) {
+		/* packet has no address block */
+		rc = 0;
+		goto empty;
+	}
+
+	len = *skb->data;
+	needed = 1 + (len >> 4) + (len & 0x0f);
+
+	if (skb->len < needed) {
+		/* packet is too short to hold the addresses it claims
+		   to hold */
+		rc = -1;
+		goto empty;
+	}
+
+	return x25_addr_ntoa(skb->data, called_addr, calling_addr);
+
+empty:
+	*called_addr->x25_addr = 0;
+	*calling_addr->x25_addr = 0;
+
+	return rc;
+}
+
+
 int x25_addr_ntoa(unsigned char *p, struct x25_address *called_addr,
 		  struct x25_address *calling_addr)
 {
@@ -921,16 +956,26 @@ int x25_rx_call_request(struct sk_buff *skb, struct x25_neigh *nb,
 	/*
 	 *	Extract the X.25 addresses and convert them to ASCII strings,
 	 *	and remove them.
+	 *
+	 *	Address block is mandatory in call request packets
 	 */
-	addr_len = x25_addr_ntoa(skb->data, &source_addr, &dest_addr);
+	addr_len = x25_parse_address_block(skb, &source_addr, &dest_addr);
+	if (addr_len <= 0)
+		goto out_clear_request;
 	skb_pull(skb, addr_len);
 
 	/*
 	 *	Get the length of the facilities, skip past them for the moment
 	 *	get the call user data because this is needed to determine
 	 *	the correct listener
+	 *
+	 *	Facilities length is mandatory in call request packets
 	 */
+	if (skb->len < 1)
+		goto out_clear_request;
 	len = skb->data[0] + 1;
+	if (skb->len < len)
+		goto out_clear_request;
 	skb_pull(skb,len);
 
 	/*
diff --git a/net/x25/x25_facilities.c b/net/x25/x25_facilities.c
index a21f664..a2765c6 100644
--- a/net/x25/x25_facilities.c
+++ b/net/x25/x25_facilities.c
@@ -35,7 +35,7 @@ int x25_parse_facilities(struct sk_buff *skb, struct x25_facilities *facilities,
 		struct x25_dte_facilities *dte_facs, unsigned long *vc_fac_mask)
 {
 	unsigned char *p = skb->data;
-	unsigned int len = *p++;
+	unsigned int len;
 
 	*vc_fac_mask = 0;
 
@@ -50,6 +50,14 @@ int x25_parse_facilities(struct sk_buff *skb, struct x25_facilities *facilities,
 	memset(dte_facs->called_ae, '\0', sizeof(dte_facs->called_ae));
 	memset(dte_facs->calling_ae, '\0', sizeof(dte_facs->calling_ae));
 
+	if (skb->len < 1)
+		return 0;
+
+	len = *p++;
+
+	if (len >= skb->len)
+		return -1;
+
 	while (len > 0) {
 		switch (*p & X25_FAC_CLASS_MASK) {
 		case X25_FAC_CLASS_A:
@@ -247,6 +255,8 @@ int x25_negotiate_facilities(struct sk_buff *skb, struct sock *sk,
 	memcpy(new, ours, sizeof(*new));
 
 	len = x25_parse_facilities(skb, &theirs, dte, &x25->vc_facil_mask);
+	if (len < 0)
+		return len;
 
 	/*
 	 *	They want reverse charging, we won't accept it.
diff --git a/net/x25/x25_in.c b/net/x25/x25_in.c
index 96d9227..b39072f 100644
--- a/net/x25/x25_in.c
+++ b/net/x25/x25_in.c
@@ -89,6 +89,7 @@ static int x25_queue_rx_frame(struct sock *sk, struct sk_buff *skb, int more)
 static int x25_state1_machine(struct sock *sk, struct sk_buff *skb, int frametype)
 {
 	struct x25_address source_addr, dest_addr;
+	int len;
 
 	switch (frametype) {
 		case X25_CALL_ACCEPTED: {
@@ -106,11 +107,17 @@ static int x25_state1_machine(struct sock *sk, struct sk_buff *skb, int frametyp
 			 *	Parse the data in the frame.
 			 */
 			skb_pull(skb, X25_STD_MIN_LEN);
-			skb_pull(skb, x25_addr_ntoa(skb->data, &source_addr, &dest_addr));
-			skb_pull(skb,
-				 x25_parse_facilities(skb, &x25->facilities,
+
+			len = x25_parse_address_block(skb, &source_addr,
+						&dest_addr);
+			if (len > 0)
+				skb_pull(skb, len);
+
+			len = x25_parse_facilities(skb, &x25->facilities,
 						&x25->dte_facilities,
-						&x25->vc_facil_mask));
+						&x25->vc_facil_mask);
+			if (len > 0)
+				skb_pull(skb, len);
 			/*
 			 *	Copy any Call User Data.
 			 */

^ permalink raw reply related

* patch to improve x.25 throughput negotiation
From: John Hughes @ 2010-04-04 16:48 UTC (permalink / raw)
  To: netdev

[-- Attachment #1: Type: text/plain, Size: 518 bytes --]

The current X.25 code has some bugs in throughput negotiation:

   1. It does negotiation in all cases, usually there is no need
   2. It incorrectly attempts to negotiate the throughput class in one
      direction only.  There are separate throughput classes for input
      and output and if either is negotiated both mist be negotiates.

This is bug https://bugzilla.kernel.org/show_bug.cgi?id=15681

This bug was first reported by Daniel Ferenci to the linux-x25 mailing 
list on 6/8/2004, but is still present.


[-- Attachment #2: throughput.patch --]
[-- Type: text/x-patch, Size: 3294 bytes --]

From: John Hughes <john@calva.com>
Subject: x.25 attempts to negotiate invalid throughput

The current (2.6.34) x.25 code doesn't seem to know that the X.25 throughput
facility includes two values, one for the required throughput outbound, one
for inbound.

This causes it to attempt to negotiate throughput 0x0A, which is throughput
9600 inbound and the illegal value "0" for inbound throughput.

Because of this some X.25 devices (e.g. Cisco 1600) refuse to connect to Linux
X.25.

The following patch fixes this behaviour.  Unless the user specifies a required
throughput it does not attempt to negotiate.  If the user does not specify
a throughput it accepts the suggestion of the remote X.25 system.  If the
user requests a throughput then it validates both the input and output
throughputs and correctly negotiates them with the remote end.

Signed-off-by: John Hughes <john@calva.com>

diff --git a/net/x25/af_x25.c b/net/x25/af_x25.c
index 9796f3e..f391f61 100644
--- a/net/x25/af_x25.c
+++ b/net/x25/af_x25.c
@@ -553,7 +553,8 @@ static int x25_create(struct net *net, struct socket *sock, int protocol,
 	x25->facilities.winsize_out = X25_DEFAULT_WINDOW_SIZE;
 	x25->facilities.pacsize_in  = X25_DEFAULT_PACKET_SIZE;
 	x25->facilities.pacsize_out = X25_DEFAULT_PACKET_SIZE;
-	x25->facilities.throughput  = X25_DEFAULT_THROUGHPUT;
+	x25->facilities.throughput  = 0;	/* by default don't negotiate
+						   throughput */
 	x25->facilities.reverse     = X25_DEFAULT_REVERSE;
 	x25->dte_facilities.calling_len = 0;
 	x25->dte_facilities.called_len = 0;
@@ -1414,9 +1415,20 @@ static int x25_ioctl(struct socket *sock, unsigned int cmd, unsigned long arg)
 			if (facilities.winsize_in < 1 ||
 			    facilities.winsize_in > 127)
 				break;
-			if (facilities.throughput < 0x03 ||
-			    facilities.throughput > 0xDD)
-				break;
+			if (facilities.throughput) {
+				int out = facilities.throughput & 0xf0;
+				int in  = facilities.throughput & 0x0f;
+				if (!out)
+					facilities.throughput |=
+						X25_DEFAULT_THROUGHPUT << 4;
+				else if (out < 0x30 || out > 0xD0)
+					break;
+				if (!in)
+					facilities.throughput |=
+						X25_DEFAULT_THROUGHPUT;
+				else if (in < 0x03 || in > 0x0D)
+					break;
+			}
 			if (facilities.reverse &&
 				(facilities.reverse & 0x81) != 0x81)
 				break;
diff --git a/net/x25/x25_facilities.c b/net/x25/x25_facilities.c
index a21f664..b447a66 100644
--- a/net/x25/x25_facilities.c
+++ b/net/x25/x25_facilities.c
@@ -259,9 +259,18 @@ int x25_negotiate_facilities(struct sk_buff *skb, struct sock *sk,
 	new->reverse = theirs.reverse;
 
 	if (theirs.throughput) {
-		if (theirs.throughput < ours->throughput) {
-			SOCK_DEBUG(sk, "X.25: throughput negotiated down\n");
-			new->throughput = theirs.throughput;
+		int theirs_in =  theirs.throughput & 0x0f;
+		int theirs_out = theirs.throughput & 0xf0;
+		int ours_in  = ours->throughput & 0x0f;
+		int ours_out = ours->throughput & 0xf0;
+		if (!ours_in || theirs_in < ours_in) {
+			SOCK_DEBUG(sk, "X.25: inbound throughput negotiated\n");
+			new->throughput = (new->throughput & 0xf0) | theirs_in;
+		}
+		if (!ours_out || theirs_out < ours_out) {
+			SOCK_DEBUG(sk,
+				"X.25: outbound throughput negotiated\n");
+			new->throughput = (new->throughput & 0x0f) | theirs_out;
 		}
 	}
 

^ permalink raw reply related

* Re: small packets sent through ne2k-pci delayed
From: Stephen Hemminger @ 2010-04-04 17:23 UTC (permalink / raw)
  To: Florian Zumbiehl; +Cc: p gortmaker, netdev
In-Reply-To: <27697562.59751270401801591.JavaMail.root@tahiti.vyatta.com>


----- "Florian Zumbiehl" <florz@florz.de> wrote:

> Hi,
> 
> I noticed today that a 2.6.33 kernel with an ne2k-pci card of mine
> transmits small packets (those that result in frames < 61 bytes)
> only with some major delay - or more exactly, it seems that they are
> being transmitted only when the next packet to be transmitted comes
> along.
> 
> Now, this patch seems to fix it for me, but I am not that sure that
> that's
> how it should be fixed:
> 
> di

> I _think_ that this problem did not exist with 2.6.22-rc4 - but I
> didn't
> have a chance to (re-)test that yet, I just think that I did things in
> the
> past (when that kernel was still running on that machine) that would
> imply
> that the problem did not exist at the time ...
> 
> So, any suggestions what I should try, other than re-testing with
> 2.6.22-rc4 (some time soon, I can do that, too, just not now)?
> 
> What might be interesting to know: those 61-byte frames do actually
> arrive
> at the recipient as 61-byte frames ...
> 
> Florian
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

The standard correct way to do this is skb_padto()

^ permalink raw reply

* Bug#575970: iproute2: silence errors about kernel missing 6rd on "ip tun show".
From: Alexandre Cassen @ 2010-04-04 20:23 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: Andreas Henriksson, 575970, netdev
In-Reply-To: <20100331160642.75766629@s6510>

Hello,

On Wed, 31 Mar 2010, Stephen Hemminger wrote:
> On Wed, 31 Mar 2010 10:08:54 +0200
> Andreas Henriksson <andreas@fatal.se> wrote:
>
>> Hello!
>>
>> As reported in http://bugs.debian.org/575970 there is currently a warning
>> printed for every tunnel when using latest iproute2 on atleast <= 2.6.32
>> kernels (missing 6rd?!).
>>
>> The attached patch avoids perror when errno is EINVAL, which I assume
>> is the way to detect missing 6rd support. A better/cleaner
>> method to detect and avoid 6rd when there's no kernel support
>> is more then welcome.
>>
>> Regards,
>> Andreas Henriksson
>>
>
> I will wait (a little while) to see if Alexandre has a preferred alternative.

IMHO, the proper way to detect 6rd kernel missing support is to catch 
EINVAL, but not others errno since they might be usefull in some case.

Reading the code again, there is also a need to test tunnel protocol since 
6rd scope is ipv6/ip only.

I will send another email with a proposed patch to fix those 2 issues.

Regs,
Alexandre

^ permalink raw reply

* [PATCH][iproute2] Detect 6rd kernel missing support / 6rd tunnel scope
From: Alexandre Cassen @ 2010-04-04 20:40 UTC (permalink / raw)
  To: netdev

This patch fix two issues:

* If kernel is not supporting 6rd then ioctl() call
  will return EINVAL, if so just skip perror call.

* 6rd scope is ipv6/ip tunnels. Dont try to fetch
  6rd tunnel parms if tunnel protocol != IPPROTO_IPV6.

Signed-off-by: Alexandre Cassen <acassen@freebox.fr>
---
 ip/iptunnel.c |    2 +-
 ip/tunnel.c   |   11 ++++++-----
 2 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/ip/iptunnel.c b/ip/iptunnel.c
index 1cd9fbd..3525fbb 100644
--- a/ip/iptunnel.c
+++ b/ip/iptunnel.c
@@ -365,7 +365,7 @@ static void print_tunnel(struct ip_tunnel_parm *p)
 	if (!(p->iph.frag_off&htons(IP_DF)))
 		printf(" nopmtudisc");
 
-	if (!tnl_ioctl_get_6rd(p->name, &ip6rd) && ip6rd.prefixlen) {
+	if (p->iph.protocol == IPPROTO_IPV6 && !tnl_ioctl_get_6rd(p->name, &ip6rd) && ip6rd.prefixlen) {
 		printf(" 6rd-prefix %s/%u ",
 		       inet_ntop(AF_INET6, &ip6rd.prefix, s1, sizeof(s1)),
 		       ip6rd.prefixlen);
diff --git a/ip/tunnel.c b/ip/tunnel.c
index d389e86..6efbd2d 100644
--- a/ip/tunnel.c
+++ b/ip/tunnel.c
@@ -26,6 +26,7 @@
 #include <stdio.h>
 #include <string.h>
 #include <unistd.h>
+#include <errno.h>
 #include <sys/types.h>
 #include <sys/socket.h>
 #include <sys/ioctl.h>
@@ -168,7 +169,7 @@ int tnl_del_ioctl(const char *basedev, const char *name, void *p)
 	return err;
 }
 
-static int tnl_gen_ioctl(int cmd, const char *name, void *p)
+static int tnl_gen_ioctl(int cmd, const char *name, void *p, int skiperr)
 {
 	struct ifreq ifr;
 	int fd;
@@ -178,7 +179,7 @@ static int tnl_gen_ioctl(int cmd, const char *name, void *p)
 	ifr.ifr_ifru.ifru_data = p;
 	fd = socket(preferred_family, SOCK_DGRAM, 0);
 	err = ioctl(fd, cmd, &ifr);
-	if (err)
+	if (err && errno != skiperr)
 		perror("ioctl");
 	close(fd);
 	return err;
@@ -186,15 +187,15 @@ static int tnl_gen_ioctl(int cmd, const char *name, void *p)
 
 int tnl_prl_ioctl(int cmd, const char *name, void *p)
 {
-	return tnl_gen_ioctl(cmd, name, p);
+	return tnl_gen_ioctl(cmd, name, p, -1);
 }
 
 int tnl_6rd_ioctl(int cmd, const char *name, void *p)
 {
-	return tnl_gen_ioctl(cmd, name, p);
+	return tnl_gen_ioctl(cmd, name, p, -1);
 }
 
 int tnl_ioctl_get_6rd(const char *name, void *p)
 {
-	return tnl_gen_ioctl(SIOCGET6RD, name, p);
+	return tnl_gen_ioctl(SIOCGET6RD, name, p, EINVAL);
 }
-- 
1.6.3.3


^ permalink raw reply related

* [PATCH] ARM: dmabounce: fix partial sync in dma_sync_single_* API
From: FUJITA Tomonori @ 2010-04-05  3:39 UTC (permalink / raw)
  To: linux; +Cc: linux-arm-kernel, netdev, davem, linux-kernel

I don't have arm hardware that uses dmabounce so I can't confirm the
problem but seems that dmabounce doesn't work for some drivers...

=
From: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Subject: [PATCH] ARM: dmabounce: fix partial sync in dma_sync_single_* API

Some network drivers do a partial sync with
dma_sync_single_for_{device|cpu}. The dma_addr argument might not be
the same as one as passed into the mapping API.

This adds some tricks to find_safe_buffer() for
dma_sync_single_for_{device|cpu}.

Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
---
 arch/arm/common/dmabounce.c |   31 +++++++++++++++++++++----------
 1 files changed, 21 insertions(+), 10 deletions(-)

diff --git a/arch/arm/common/dmabounce.c b/arch/arm/common/dmabounce.c
index cc0a932..87eb160 100644
--- a/arch/arm/common/dmabounce.c
+++ b/arch/arm/common/dmabounce.c
@@ -163,7 +163,8 @@ alloc_safe_buffer(struct dmabounce_device_info *device_info, void *ptr,
 
 /* determine if a buffer is from our "safe" pool */
 static inline struct safe_buffer *
-find_safe_buffer(struct dmabounce_device_info *device_info, dma_addr_t safe_dma_addr)
+find_safe_buffer(struct dmabounce_device_info *device_info, dma_addr_t safe_dma_addr,
+		 int for_sync)
 {
 	struct safe_buffer *b, *rb = NULL;
 	unsigned long flags;
@@ -171,10 +172,17 @@ find_safe_buffer(struct dmabounce_device_info *device_info, dma_addr_t safe_dma_
 	read_lock_irqsave(&device_info->lock, flags);
 
 	list_for_each_entry(b, &device_info->safe_buffers, node)
-		if (b->safe_dma_addr == safe_dma_addr) {
-			rb = b;
-			break;
-		}
+		if (for_sync) {
+			if (b->safe_dma_addr <= safe_dma_addr &&
+			    safe_dma_addr < b->safe_dma_addr + b->size) {
+				rb = b;
+				break;
+			}
+		} else
+			if (b->safe_dma_addr == safe_dma_addr) {
+				rb = b;
+				break;
+			}
 
 	read_unlock_irqrestore(&device_info->lock, flags);
 	return rb;
@@ -205,7 +213,8 @@ free_safe_buffer(struct dmabounce_device_info *device_info, struct safe_buffer *
 /* ************************************************** */
 
 static struct safe_buffer *find_safe_buffer_dev(struct device *dev,
-		dma_addr_t dma_addr, const char *where)
+						dma_addr_t dma_addr, const char *where,
+						int for_sync)
 {
 	if (!dev || !dev->archdata.dmabounce)
 		return NULL;
@@ -216,7 +225,7 @@ static struct safe_buffer *find_safe_buffer_dev(struct device *dev,
 			pr_err("unknown device: Trying to %s invalid mapping\n", where);
 		return NULL;
 	}
-	return find_safe_buffer(dev->archdata.dmabounce, dma_addr);
+	return find_safe_buffer(dev->archdata.dmabounce, dma_addr, for_sync);
 }
 
 static inline dma_addr_t map_single(struct device *dev, void *ptr, size_t size,
@@ -286,7 +295,7 @@ static inline dma_addr_t map_single(struct device *dev, void *ptr, size_t size,
 static inline void unmap_single(struct device *dev, dma_addr_t dma_addr,
 		size_t size, enum dma_data_direction dir)
 {
-	struct safe_buffer *buf = find_safe_buffer_dev(dev, dma_addr, "unmap");
+	struct safe_buffer *buf = find_safe_buffer_dev(dev, dma_addr, "unmap", 0);
 
 	if (buf) {
 		BUG_ON(buf->size != size);
@@ -398,7 +407,7 @@ int dmabounce_sync_for_cpu(struct device *dev, dma_addr_t addr,
 	dev_dbg(dev, "%s(dma=%#x,off=%#lx,sz=%zx,dir=%x)\n",
 		__func__, addr, off, sz, dir);
 
-	buf = find_safe_buffer_dev(dev, addr, __func__);
+	buf = find_safe_buffer_dev(dev, addr, __func__, 1);
 	if (!buf)
 		return 1;
 
@@ -411,6 +420,8 @@ int dmabounce_sync_for_cpu(struct device *dev, dma_addr_t addr,
 	DO_STATS(dev->archdata.dmabounce->bounce_count++);
 
 	if (dir == DMA_FROM_DEVICE || dir == DMA_BIDIRECTIONAL) {
+		if (addr != buf->safe_dma_addr)
+			off = addr - buf->safe_dma_addr;
 		dev_dbg(dev, "%s: copy back safe %p to unsafe %p size %d\n",
 			__func__, buf->safe + off, buf->ptr + off, sz);
 		memcpy(buf->ptr + off, buf->safe + off, sz);
@@ -427,7 +438,7 @@ int dmabounce_sync_for_device(struct device *dev, dma_addr_t addr,
 	dev_dbg(dev, "%s(dma=%#x,off=%#lx,sz=%zx,dir=%x)\n",
 		__func__, addr, off, sz, dir);
 
-	buf = find_safe_buffer_dev(dev, addr, __func__);
+	buf = find_safe_buffer_dev(dev, addr, __func__, 1);
 	if (!buf)
 		return 1;
 
-- 
1.7.0


^ permalink raw reply related

* Re: [PATCH 1/3] IPv6: Generic TTL Security Mechanism (original version)
From: YOSHIFUJI Hideaki @ 2010-04-05  4:48 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: davem, Pekka Savola, Nick Hilliard, netdev, YOSHIFUJI Hideaki
In-Reply-To: <20100403232922.489187907@vyatta.com>

Hi,

(2010/04/04 8:21), Stephen Hemminger wrote:
> The original proposed code; the IPV6 and IPV4 socket options are seperate.
> With this method, the server does have to deal with both IPv4 and IPv6
> socket options and the client has to handle the different for each
> family.

I am for 1/3 (original), not for 2/3, 3/3.

Because we should allow users to set respective value
for IPv4 and IPv6, as we allow users to do so for TTL
and hoplimit itself.

--yoshfuji


^ permalink raw reply

* [PATCH] mac80211: Ensure initializing private mc_list in prepare_multicast().
From: YOSHIFUJI Hideaki @ 2010-04-05  3:59 UTC (permalink / raw)
  To: davem; +Cc: jpirko, yoshfuji, netdev

Fix kernel panic by NULL pointer dereference in the context of
ieee80211_ops->prepare_multicast().

This bug was introduced by commit 22bedad3c.. ("net: convert
multicast list to list_head").

Call __hw_addr_init() in ieee80211_alloc_hw() to initialize
list_head of private device multicast list, like we do in
bond_init().

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
---
 net/mac80211/main.c |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/net/mac80211/main.c b/net/mac80211/main.c
index 84ad249..0b82cd2 100644
--- a/net/mac80211/main.c
+++ b/net/mac80211/main.c
@@ -388,6 +388,9 @@ struct ieee80211_hw *ieee80211_alloc_hw(size_t priv_data_len,
 	local->uapsd_max_sp_len = IEEE80211_DEFAULT_MAX_SP_LEN;

 	INIT_LIST_HEAD(&local->interfaces);
+
+	__hw_addr_init(&local->mc_list);
+
 	mutex_init(&local->iflist_mtx);
 	mutex_init(&local->scan_mtx);

-- 
1.5.6.5

^ permalink raw reply related

* [RFC PATCH 0/2] netdev: implement a buffer to log network driver's information
From: Koki Sanagi @ 2010-04-05  6:50 UTC (permalink / raw)
  To: netdev
  Cc: izumi.taku, kaneshige.kenji, davem, nhorman, jeffrey.t.kirsher,
	jesse.brandeburg, bruce.w.allan, alexander.h.duyck,
	peter.p.waskiewicz.jr, john.ronciak

This patch implements a buffer for recording network driver's message.
This patch extends below patch to make other network driver use it.

http://marc.info/?l=e1000-devel&m=126690500618157&w=2

When I investigate some network driver's trouble, I feel like I want more
detailed debug information, for example, all device register information
(like ethtool -d) when device reset was happened or tx/rx ring's move when
network stream is not smooth etc.
As a recording measure of such information, there are syslog and ftrace now.
but they have some weak points.
Syslog is not appropriate for the size of message is large(ex. recording
all register information) or the number of is large(ex.tracing internal move).
Ftrace is appropriate for recording such messages.
But ftrace has only one buffer for all ftrace event in kernel. As a result,
one adapter's event may be flushed by the others(Of course, syslog has same
weak point).

This patch implements a buffer system which beats those weak points.

Features of that are

1.Each interface can hold respective buffer.
     It prevents recorded data from being flowed by other's recorded data.

2.An interface can hold multi-buffers.
     It makes one adapter have several buffer for each different purpose.
     For example, one is to trace a driver's internal move, the another is to
     log error message and some releveant information.

3.resize and on/off per buffer.
     if you implement this patch's buffer and you regist buffer in driver's
	probe function, all adapter which use that driver must have same size
     buffer. If you want trace to one adapter, but not to another, you can make
     another adapter's buffer off.

The implementation example of igb is patch 2.

HOW TO USE:

If you want to know how to use from driver side, see patch 2.
User side is below.

# mount -t debugfs nodev /sys/kernel/debug
# ls /sys/kernel/debug/ndrvbuf
igb-trace-0000:03:00.0 igb-trace-0000:03:00.1
# ls /sys/kernel/debug/ndrvbuf/igb-trace-0000:03:00.0
buffer  buffer_size

"buffer" is output interface. If you set read_format function in
register_ndrvbuf, it is used. If not, default read function is used.
It displays recorded data by hex style.

# cat buffer
[  1] 50462.369207: clean_tx qidx=1 ntu=154->156
[  0] 50462.369241: clean_rx qidx=0 ntu=111->112
[  0] 50462.369250: xmit qidx=1 ntu=156->158
[  1] 50462.369256: clean_tx qidx=1 ntu=156->158
[  1] 50462.369342: clean_rx qidx=0 ntu=113->114
[  1] 50462.369439: clean_rx qidx=0 ntu=114->115

"buffer_size" is size of buffer per CPU. If you want to change that,
# echo 1000000 > buffer_size
# cat buffer_size
1000000

If you want to disable recording,
# echo 0 > buffer_size

Thanks,
Koki Sanagi.

^ permalink raw reply

* [RFC PATCH 1/2] netdev: buffer infrastructure to log network driver's information
From: Koki Sanagi @ 2010-04-05  6:52 UTC (permalink / raw)
  To: netdev
  Cc: izumi.taku, kaneshige.kenji, davem, nhorman, jeffrey.t.kirsher,
	jesse.brandeburg, bruce.w.allan, alexander.h.duyck,
	peter.p.waskiewicz.jr, john.ronciak
In-Reply-To: <4BB98828.5030302@jp.fujitsu.com>

This patch implements buffer infrastructure under driver/net.
This buffer records information from network driver.

Signed-off-by: Koki Sanagi <sanagi.koki@jp.fujitsu.com>
---
  drivers/net/Kconfig     |    8 +
  drivers/net/Makefile    |    1 +
  drivers/net/ndrvbuf.c   |  535 +++++++++++++++++++++++++++++++++++++++++++++++
  include/linux/ndrvbuf.h |   57 +++++
  4 files changed, 601 insertions(+), 0 deletions(-)

diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
index 7029cd5..98ac929 100644
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -219,6 +219,14 @@ config MII
  	  or internal device.  It is safe to say Y or M here even if your
  	  ethernet card lack MII.
  
+config NDRVBUF
+	tristate "Use buffer for network device driver"
+	default m
+	help
+	  The ndrvbuf is a generally buffer for network driver. It can record
+	  some event or logging informaiton you want to preseve. ring_buffer
+	  is used as a buffer infrastructure.
+
  config MACB
  	tristate "Atmel MACB support"
  	depends on AVR32 || ARCH_AT91SAM9260 || ARCH_AT91SAM9263 || ARCH_AT91SAM9G20 || ARCH_AT91SAM9G45 || ARCH_AT91CAP9
diff --git a/drivers/net/Makefile b/drivers/net/Makefile
index 4788862..84319ba 100644
--- a/drivers/net/Makefile
+++ b/drivers/net/Makefile
@@ -5,6 +5,7 @@
  obj-$(CONFIG_MII) += mii.o
  obj-$(CONFIG_MDIO) += mdio.o
  obj-$(CONFIG_PHYLIB) += phy/
+obj-$(CONFIG_NDRVBUF) += ndrvbuf.o
  
  obj-$(CONFIG_TI_DAVINCI_EMAC) += davinci_emac.o
  
diff --git a/drivers/net/ndrvbuf.c b/drivers/net/ndrvbuf.c
new file mode 100644
index 0000000..7401334
--- /dev/null
+++ b/drivers/net/ndrvbuf.c
@@ -0,0 +1,535 @@
+#include <linux/ndrvbuf.h>
+
+MODULE_LICENSE("GPL");
+static struct dentry *ndrvbuf_root;
+
+/* This list links all ndrvbuf registered */
+static LIST_HEAD(ndrvbuf_list);
+static DEFINE_MUTEX(ndrvbuf_list_lock);
+static atomic_t ndrvbuf_disabled;
+/* control reading ring_buffer */
+struct ndrvbuf_reader {
+	struct ndrvbuf *ndrvbuf;
+	loff_t idx;
+	struct ring_buffer_iter **iter;
+	char *print;
+	size_t print_len;
+	loff_t print_pos;
+};
+
+/**
+ * typical_ndrvbuf_header - write a typical header
+ * @dst: buffer to write
+ * @cpu: The processor number
+ * @ts: The time stamp
+ *
+ * Write a typical header to buffer
+ **/
+size_t _typical_ndrvbuf_header(void *dst, size_t size, int cpu, u64 ts)
+{
+	int ret;
+	unsigned long usecs_rem, secs;
+
+	ts += 500;
+	do_div(ts, NSEC_PER_USEC);
+	usecs_rem = do_div(ts, USEC_PER_SEC);
+	secs = (unsigned long)ts;
+	ret = snprintf(dst, size, "[%3d] %5lu.%06lu: ", cpu, secs, usecs_rem);
+	return ret;
+}
+EXPORT_SYMBOL(_typical_ndrvbuf_header);
+
+/**
+ * ndrvbuf_default_read - print the event using hex format
+ * @ubuf: The buffer to write
+ * @bufsize: The maximum byte to write
+ * @entry: The adrress to read
+ * @len: The size of etnry
+ * @cpu: The processor nubmer
+ * @ts: The time stamp
+ *
+ * If read==NULL in register_ndrvbuf, this funcion is used when printing.
+ **/
+static size_t ndrvbuf_default_read(void *ubuf, size_t bufsize, void *entry,
+				unsigned len, int cpu, u64 ts)
+{
+	int bufpos = 0, lpos = 0;
+
+	bufpos += _typical_ndrvbuf_header(ubuf, bufsize, cpu, ts);
+	bufpos += snprintf(ubuf + bufpos, bufsize - bufpos, "\n");
+	while (lpos < len) {
+		hex_dump_to_buffer(entry + lpos, len - lpos, 16, 4,
+				ubuf + bufpos, bufsize - bufpos, 0);
+		bufpos = strlen(ubuf);
+		bufpos += snprintf(ubuf + bufpos, bufsize - bufpos, "\n");
+		lpos += 16;
+	}
+	return bufpos;
+}
+
+static int is_in_ndrvbuf_list(struct ndrvbuf *ndrvbuf)
+{
+	struct list_head *cur;
+	struct ndrvbuf *cbuf;
+
+	list_for_each(cur, &ndrvbuf_list) {
+		cbuf = list_entry(cur, struct ndrvbuf, list);
+		if (cbuf == ndrvbuf)
+			break;
+	}
+	if (cur == &ndrvbuf_list)
+		return 0;
+	else
+		return 1;
+}
+
+/**
+ * ndrvbuf_buffer_open - open ring_buffer and set itearator to read
+ * @inode: contain ndrvbuf pointer
+ * @file: The pointer to attach the ndrvbuf pointer
+ **/
+static int ndrvbuf_buffer_open(struct inode *inode, struct file *file)
+{
+	struct ndrvbuf *ndrvbuf = inode->i_private;
+	struct ndrvbuf_reader *reader;
+	int cpu;
+
+	reader = kzalloc(sizeof(struct ndrvbuf_reader), GFP_KERNEL);
+	if (!reader) {
+		pr_warning("Could not alloc reader->iter\n");
+		goto out;
+	}
+	reader->iter = kmalloc(sizeof(struct ring_buffer_iter *) * nr_cpu_ids,
+				GFP_KERNEL);
+	if (!reader->iter) {
+		pr_warning("Could not alloc reader->iter\n");
+		goto out_free_reader;
+	}
+	mutex_lock(&ndrvbuf_list_lock);
+	if (!is_in_ndrvbuf_list(ndrvbuf)) {
+		mutex_unlock(&ndrvbuf_list_lock);
+		goto out_free_iter;
+	}
+	atomic_inc(&ndrvbuf->reader_count);
+	mutex_unlock(&ndrvbuf_list_lock);
+	for_each_online_cpu(cpu)
+		reader->iter[cpu] = ring_buffer_read_start(ndrvbuf->rbuf, cpu);
+	reader->print = kmalloc(NDRVBUF_STR_LEN, GFP_KERNEL);
+	if (!reader->print) {
+		pr_warning("Could not alloc reader->print\n");
+		goto out_free_iter;
+	}
+	reader->ndrvbuf = ndrvbuf;
+	file->private_data = reader;
+	return 0;
+
+out_free_iter:
+	kfree(reader->iter);
+out_free_reader:
+	kfree(reader);
+out:
+	return -ENOMEM;
+}
+
+static int ndrvbuf_buffer_release(struct inode *inode, struct file *file)
+{
+	struct ndrvbuf_reader *reader = file->private_data;
+	struct ndrvbuf *ndrvbuf = reader->ndrvbuf;
+	int cpu;
+
+	for_each_online_cpu(cpu) {
+		ring_buffer_read_finish(reader->iter[cpu]);
+	}
+	kfree(reader->iter);
+	kfree(reader->print);
+	kfree(reader);
+	atomic_dec(&ndrvbuf->reader_count);
+	return 0;
+}
+
+static loff_t _ndrvbuf_lseek(struct ndrvbuf_reader *reader, loff_t pos)
+{
+	struct ndrvbuf *ndrvbuf = reader->ndrvbuf;
+	struct ring_buffer_iter **iter = reader->iter;
+	struct ring_buffer_event *event;
+	void *entry;
+	int cpu, next_cpu;
+	u64 ts, next_ts;
+	unsigned len;
+	int strlen;
+
+	reader->idx = 0;
+	for_each_online_cpu(cpu) {
+		ring_buffer_iter_reset(iter[cpu]);
+	}
+	while (1) {
+		next_ts = 0;
+		next_cpu = -1;
+		for_each_online_cpu(cpu) {
+			if (!iter[cpu])
+				continue;
+			event = ring_buffer_iter_peek(iter[cpu], &ts);
+			if (!event)
+				continue;
+			if (!next_ts || ts < next_ts) {
+				next_ts = ts;
+				next_cpu = cpu;
+			}
+		}
+		if (next_cpu < 0)
+			return -EINVAL;
+		event = ring_buffer_read(iter[next_cpu], &ts);
+		if (!event)
+			return -EINVAL;
+		entry = ring_buffer_event_data(event);
+		len = ring_buffer_event_length(event);
+		strlen = ndrvbuf->read(reader->print, NDRVBUF_STR_LEN,
+					entry, len, next_cpu, ts);
+		if (reader->idx + strlen > pos)
+			break;
+		reader->idx += strlen;
+	}
+	reader->print_len = strlen;
+	reader->print_pos = reader->idx + strlen - pos;
+	reader->idx = pos;
+	return pos;
+}
+
+/**
+ * ndrvbuf_buffer_read - read event and write it to user buffer
+ * @file: the file to read
+ * @buf: the file to write
+ * @nbytes: the maximum size to write
+ * @ppos: the position to read from
+ *
+ * Read an event from ring_buffer and write it user buffer.
+ * If read format function is set, use it.
+ **/
+static ssize_t ndrvbuf_buffer_read(struct file *file, char __user *buf,
+					size_t nbytes, loff_t *ppos)
+{
+	struct ndrvbuf_reader *reader = file->private_data;
+	struct ndrvbuf *ndrvbuf = reader->ndrvbuf;
+	struct ring_buffer_iter **iter = reader->iter;
+	struct ring_buffer_event *event;
+	void *entry;
+	unsigned len;
+	int cpu, next_cpu = -1;
+	u64 ts, next_ts = 0;
+	int ret;
+	size_t copy, strlen;
+	loff_t pos = *ppos;
+
+	if (pos != reader->idx)
+		_ndrvbuf_lseek(reader, pos);
+	if (reader->print_len) {
+		copy = min(reader->print_len, nbytes);
+		ret = copy_to_user(buf, reader->print + reader->print_pos,
+				copy);
+		copy -= ret;
+		reader->print_len -= copy;
+		reader->print_pos += copy;
+		reader->idx += copy;
+		*ppos += copy;
+		return copy;
+	}
+	for_each_online_cpu(cpu) {
+		if (!iter[cpu])
+			continue;
+		event = ring_buffer_iter_peek(iter[cpu], &ts);
+		if (!event)
+			continue;
+		if (!next_ts || ts < next_ts) {
+			next_ts = ts;
+			next_cpu = cpu;
+		}
+	}
+	if (next_cpu < 0)
+		return 0;
+	event = ring_buffer_read(iter[next_cpu], &ts);
+	if (!event)
+		return 0;
+	entry = ring_buffer_event_data(event);
+	len = ring_buffer_event_length(event);
+
+	strlen = ndrvbuf->read(reader->print, NDRVBUF_STR_LEN, entry,
+				len, next_cpu, ts);
+	reader->print_len = strlen;
+	copy = min(strlen, nbytes);
+	ret = copy_to_user(buf, reader->print, strlen);
+	copy -= ret;
+	reader->print_len -= copy;
+	reader->print_pos = copy;
+	*ppos = pos + copy;
+	reader->idx = *ppos;
+	return copy;
+}
+
+
+static const struct file_operations buffer_file_ops = {
+	.owner =	THIS_MODULE,
+	.open =		ndrvbuf_buffer_open,
+	.read =		ndrvbuf_buffer_read,
+	.release =	ndrvbuf_buffer_release,
+};
+
+static int ndrvbuf_buffer_size_open(struct inode *inode, struct file *file)
+{
+	file->private_data = inode->i_private;
+	return 0;
+}
+
+static ssize_t ndrvbuf_buffer_size_read(struct file *file, char __user *ubuf,
+					size_t nbytes, loff_t *ppos)
+{
+	struct ndrvbuf *ndrvbuf = file->private_data;
+	int ret;
+	char buf[16];
+	ssize_t size;
+
+	mutex_lock(&ndrvbuf_list_lock);
+	if (!is_in_ndrvbuf_list(ndrvbuf)) {
+		mutex_unlock(&ndrvbuf_list_lock);
+		return -ENODEV;
+
+	}
+	ret = snprintf(buf, 16, "%lu\n", ndrvbuf->buffer_size);
+	size = simple_read_from_buffer(ubuf, nbytes, ppos, buf, ret);
+	mutex_unlock(&ndrvbuf_list_lock);
+	return size;
+}
+
+/**
+ * ndrvbuf_buffer_size_write - resize the size of ring_buffer
+ * @file: the file to read
+ * @buf: the file to write
+ * @nbytes: the maximum size to write
+ * @ppos: the position to read from
+ **/
+static ssize_t ndrvbuf_buffer_size_write(struct file *file,
+			const char __user *ubuf, size_t nbytes, loff_t *ppos)
+{
+	struct ndrvbuf *ndrvbuf = file->private_data;
+	unsigned long cur_size = ndrvbuf->buffer_size;
+	unsigned long new_size;
+	int ret;
+	char buf[64];
+
+	if (nbytes >= sizeof(buf))
+		return -EINVAL;
+	if (copy_from_user(&buf, ubuf, nbytes))
+		return -EFAULT;
+	buf[nbytes] = 0;
+	ret = strict_strtoul(buf, 10, &new_size);
+	if (ret < 0)
+		return ret;
+	if (cur_size == new_size)
+		return nbytes;
+schedule:
+	mutex_lock(&ndrvbuf_list_lock);
+	if (atomic_read(&ndrvbuf->reader_count) != 0) {
+		mutex_unlock(&ndrvbuf_list_lock);
+		schedule();
+		goto schedule;
+	}
+	atomic_inc(&ndrvbuf_disabled);
+	synchronize_sched();
+	if (!is_in_ndrvbuf_list(ndrvbuf)) {
+		atomic_dec(&ndrvbuf_disabled);
+		mutex_unlock(&ndrvbuf_list_lock);
+		return -ENODEV;
+	}
+	ret = ring_buffer_resize(ndrvbuf->rbuf, new_size);
+	if (ret < 0)
+		pr_warning("Could not change buffer size\n");
+	ndrvbuf->buffer_size = new_size;
+	if (new_size)
+		atomic_set(&ndrvbuf->disabled, 0);
+	else
+		atomic_set(&ndrvbuf->disabled, 1);
+	atomic_dec(&ndrvbuf_disabled);
+	mutex_unlock(&ndrvbuf_list_lock);
+
+	return nbytes;
+}
+
+static const struct file_operations buffer_size_file_ops = {
+	.owner =	THIS_MODULE,
+	.open =		ndrvbuf_buffer_size_open,
+	.read =		ndrvbuf_buffer_size_read,
+	.write =	ndrvbuf_buffer_size_write,
+};
+
+static struct ndrvbuf *create_ndrvbuf(const char *name, size_t size)
+{
+	struct ndrvbuf *ndrvbuf;
+	struct dentry *buf_dir;
+
+	ndrvbuf = kzalloc(sizeof(struct ndrvbuf), GFP_KERNEL);
+	if (!ndrvbuf)
+		goto out;
+	strcpy(ndrvbuf->name, name);
+	buf_dir = debugfs_create_dir(name, ndrvbuf_root);
+	if (!buf_dir) {
+		pr_warning("Could not create debugfs dir '%s'\n", name);
+		goto out_free_ndrvbuf;
+	}
+	ndrvbuf->buf_dir = buf_dir;
+	ndrvbuf->buffer_dent = debugfs_create_file("buffer",
+				S_IFREG|S_IRUGO|S_IWUSR,
+				buf_dir, ndrvbuf, &buffer_file_ops);
+	if (!ndrvbuf->buffer_dent) {
+		pr_warning("Could not create debugfs file 'buffer'\n");
+		goto out_rem_buf_dir;
+	}
+	ndrvbuf->buffer_size_dent = debugfs_create_file("buffer_size",
+				S_IFREG|S_IRUGO|S_IWUSR,
+				buf_dir, ndrvbuf, &buffer_size_file_ops);
+	if (!ndrvbuf->buffer_size_dent) {
+		pr_warning("Could not create debugfs file 'buffer_size'\n");
+		goto out_rem_buffer;
+	}
+	ndrvbuf->rbuf = ring_buffer_alloc(size, RB_FL_OVERWRITE);
+	if (!ndrvbuf->rbuf) {
+		pr_warning("Could not alloc ring_buffer for %s\n",
+							name);
+		goto out_rem_buffer_size;
+	}
+	ndrvbuf->buffer_size = size;
+	if (size)
+		atomic_set(&ndrvbuf->disabled, 0);
+	else
+		atomic_set(&ndrvbuf->disabled, 1);
+	return ndrvbuf;
+
+out_rem_buffer_size:
+	debugfs_remove(ndrvbuf->buffer_size_dent);
+out_rem_buffer:
+	debugfs_remove(ndrvbuf->buffer_dent);
+out_rem_buf_dir:
+	debugfs_remove(ndrvbuf->buf_dir);
+out_free_ndrvbuf:
+	kfree(ndrvbuf);
+out:
+	return NULL;
+}
+
+static void remove_ndrvbuf(struct ndrvbuf *ndrvbuf)
+{
+	if (!ndrvbuf)
+		return;
+	debugfs_remove(ndrvbuf->buffer_size_dent);
+	debugfs_remove(ndrvbuf->buffer_dent);
+	debugfs_remove(ndrvbuf->buf_dir);
+	ring_buffer_free(ndrvbuf->rbuf);
+	kfree(ndrvbuf);
+}
+
+/**
+ * register_ndrvbuf - create ndrvbuf struct
+ * @name: The buffer dif name. Usually, it is created under
+ *        /sys/kernel/debug/ndrvbuf
+ * @size: The size of ring_buffer per cpu
+ * @read: The read format funcion. If NULL, use ndrvbuf_default_read.
+ *
+ * This is called when network driver want to set ndrvbuf.
+ * If registering is failed, return NULL.
+ **/
+struct ndrvbuf *_register_ndrvbuf(const char *name, size_t size,
+		size_t (*read)(void *, size_t, void *, size_t, int, u64))
+{
+	struct ndrvbuf *cbuf;
+	struct list_head *cur;
+
+	mutex_lock(&ndrvbuf_list_lock);
+	atomic_inc(&ndrvbuf_disabled);
+	synchronize_sched();
+	list_for_each(cur, &ndrvbuf_list) {
+		cbuf = list_entry(cur, struct ndrvbuf, list);
+		if (!strncmp(cbuf->name, name, NDRVBUF_NAME_SIZE))
+			break;
+	}
+	if (cur != &ndrvbuf_list) {
+		pr_warning("%s already exists\n", name);
+		cbuf = NULL;
+		goto out;
+	}
+	cbuf = create_ndrvbuf(name, size);
+	if (!cbuf)
+		goto out;
+
+	if (read)
+		cbuf->read = read;
+	else
+		cbuf->read = ndrvbuf_default_read;
+	list_add(&cbuf->list, &ndrvbuf_list);
+out:
+	atomic_dec(&ndrvbuf_disabled);
+	mutex_unlock(&ndrvbuf_list_lock);
+	return cbuf;
+}
+EXPORT_SYMBOL(_register_ndrvbuf);
+
+/**
+ * unregister_ndrvbuf - free ndrbuf struct and some resources
+ * @ndrvbuf: The pointer of ndrvbuf
+ **/
+void _unregister_ndrvbuf(struct ndrvbuf *ndrvbuf)
+{
+schedule:
+	mutex_lock(&ndrvbuf_list_lock);
+	if (atomic_read(&ndrvbuf->reader_count) != 0) {
+		mutex_unlock(&ndrvbuf_list_lock);
+		schedule();
+		goto schedule;
+	}
+	atomic_inc(&ndrvbuf_disabled);
+	synchronize_sched();
+	list_del(&ndrvbuf->list);
+	atomic_dec(&ndrvbuf_disabled);
+	mutex_unlock(&ndrvbuf_list_lock);
+
+	remove_ndrvbuf(ndrvbuf);
+}
+EXPORT_SYMBOL(_unregister_ndrvbuf);
+
+/**
+ * _write_ndrvbuf - Write a trace event to buffer
+ * @ndrvbuf: buffer to write
+ * @size: The size of trace event
+ * @buf: The pointer to read from
+ **/
+void _write_ndrvbuf(struct ndrvbuf *ndrvbuf, size_t size, void *buf)
+{
+	unsigned long flags;
+
+	preempt_disable();
+	local_irq_save(flags);
+	if (!atomic_read(&ndrvbuf_disabled) && is_in_ndrvbuf_list(ndrvbuf)
+		&& !atomic_read(&ndrvbuf->disabled))
+		ring_buffer_write(ndrvbuf->rbuf, size, buf);
+	local_irq_restore(flags);
+	preempt_enable();
+}
+EXPORT_SYMBOL(_write_ndrvbuf);
+
+static int __init ndrvbuf_init(void)
+{
+	ndrvbuf_root = debugfs_create_dir("ndrvbuf", NULL);
+	if (!ndrvbuf_root) {
+		pr_warning("Could not create debugfs dir 'ndrvbuf'\n");
+		return -ENODEV;
+	}
+	atomic_set(&ndrvbuf_disabled, 0);
+	return 0;
+}
+
+module_init(ndrvbuf_init);
+
+static void __exit ndrvbuf_exit(void)
+{
+	if (ndrvbuf_root)
+		debugfs_remove(ndrvbuf_root);
+}
+
+module_exit(ndrvbuf_exit);
diff --git a/include/linux/ndrvbuf.h b/include/linux/ndrvbuf.h
new file mode 100644
index 0000000..1d56b79
--- /dev/null
+++ b/include/linux/ndrvbuf.h
@@ -0,0 +1,57 @@
+#ifndef _NDRVBUF_H_
+#define _NDRVBUF_H_
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/sched.h>
+#include <linux/ring_buffer.h>
+#include <linux/debugfs.h>
+#include <linux/netdevice.h>
+
+#define NDRVBUF_STR_LEN (2*PAGE_SIZE)
+#define NDRVBUF_NAME_SIZE 32
+
+struct ndrvbuf {
+	struct list_head list;
+	char name[NDRVBUF_NAME_SIZE];
+	struct dentry *buf_dir;
+	struct dentry *enable_dent;
+	struct dentry *buffer_size_dent;
+	struct dentry *buffer_dent;
+	unsigned long buffer_size;
+	atomic_t disabled;
+	size_t (*read)(void *ubuf, size_t bufsize, void *entry,
+			size_t size, int cpu, u64 ts);
+	struct ring_buffer *rbuf;
+	atomic_t reader_count;
+};
+
+extern __attribute__((weak)) struct ndrvbuf *_register_ndrvbuf(
+		const char *name, size_t size,
+		size_t (*read)(void *, size_t, void *, size_t, int, u64));
+extern __attribute__((weak)) void _unregister_ndrvbuf(struct ndrvbuf *ndrvbuf);
+extern __attribute__((weak)) void _write_ndrvbuf(struct ndrvbuf *ndrvbuf,
+		size_t size, void *buf);
+extern __attribute__((weak)) size_t _typical_ndrvbuf_header(void *dst,
+		size_t size, int cpu, u64 ts);
+
+#define register_ndrvbuf(name, size, read) ((_register_ndrvbuf) ? \
+	_register_ndrvbuf(name, size, read) : NULL)
+
+#define unregister_ndrvbuf(ndrvbuf)					\
+	do {								\
+		if (_unregister_ndrvbuf)				\
+			_unregister_ndrvbuf(ndrvbuf);			\
+	} while (0)
+
+#define write_ndrvbuf(ndrvbuf, size, buf)				\
+	do {								\
+		if (_write_ndrvbuf)					\
+			_write_ndrvbuf(ndrvbuf, size, buf);		\
+	} while (0)
+
+#define typical_ndrvbuf_header(dst, size, cpu, ts) (			\
+	(_typical_ndrvbuf_header) ? 					\
+	_typical_ndrvbuf_header(dst, size, cpu, ts) : 0)
+
+#endif /* _NDRVBUF_H_ */


^ permalink raw reply related

* [RFC PATCH 2/2] netdev: an usage example on igb
From: Koki Sanagi @ 2010-04-05  6:54 UTC (permalink / raw)
  To: netdev
  Cc: izumi.taku, kaneshige.kenji, davem, nhorman, jeffrey.t.kirsher,
	jesse.brandeburg, bruce.w.allan, alexander.h.duyck,
	peter.p.waskiewicz.jr, john.ronciak
In-Reply-To: <4BB98828.5030302@jp.fujitsu.com>

This patch is usage example of previous patch's buffer on igb.
The output is like below.

# cat /sys/kernel/debug/ndrvbuf/igb-trace-0000\:03\:00.0/buffer
[  1] 50462.369207: clean_tx qidx=1 ntu=154->156
[  0] 50462.369241: clean_rx qidx=0 ntu=111->112
[  0] 50462.369250: xmit qidx=1 ntu=156->158
[  1] 50462.369256: clean_tx qidx=1 ntu=156->158
[  1] 50462.369342: clean_rx qidx=0 ntu=113->114
[  1] 50462.369439: clean_rx qidx=0 ntu=114->115

This example outputs original print style, because it sets original print
function(igb_trace_read) when registered.

register_ndrvbuf(buname, 1000000, igb_trace_read);

If you set NULL to arg3, outputs by ndrvbuf default style.
If you set 0 to size(arg2), recording is disabled at first(but small buffer is
alloced).
When you set non-zero to size, recording becomes enabled.

Signed-off-by: Koki Sanagi <sanagi.koki@jp.fujitsu.com>
---
  drivers/net/igb/Makefile    |    2 +-
  drivers/net/igb/igb.h       |    1 +
  drivers/net/igb/igb_main.c  |   10 +++++-
  drivers/net/igb/igb_trace.c |   81 +++++++++++++++++++++++++++++++++++++++++++
  drivers/net/igb/igb_trace.h |   21 +++++++++++
  5 files changed, 113 insertions(+), 2 deletions(-)

diff --git a/drivers/net/igb/Makefile b/drivers/net/igb/Makefile
index 8372cb9..286541e 100644
--- a/drivers/net/igb/Makefile
+++ b/drivers/net/igb/Makefile
@@ -33,5 +33,5 @@
  obj-$(CONFIG_IGB) += igb.o
  
  igb-objs := igb_main.o igb_ethtool.o e1000_82575.o \
-	    e1000_mac.o e1000_nvm.o e1000_phy.o e1000_mbx.o
+	    e1000_mac.o e1000_nvm.o e1000_phy.o e1000_mbx.o igb_trace.o
  
diff --git a/drivers/net/igb/igb.h b/drivers/net/igb/igb.h
index a177570..533c5e6 100644
--- a/drivers/net/igb/igb.h
+++ b/drivers/net/igb/igb.h
@@ -315,6 +315,7 @@ struct igb_adapter {
  	unsigned int vfs_allocated_count;
  	struct vf_data_storage *vf_data;
  	u32 rss_queues;
+	struct ndrvbuf *trace;
  };
  
  #define IGB_FLAG_HAS_MSI           (1 << 0)
diff --git a/drivers/net/igb/igb_main.c b/drivers/net/igb/igb_main.c
index 583a21c..f754fd1 100644
--- a/drivers/net/igb/igb_main.c
+++ b/drivers/net/igb/igb_main.c
@@ -48,6 +48,7 @@
  #include <linux/dca.h>
  #endif
  #include "igb.h"
+#include "igb_trace.h"
  
  #define DRV_VERSION "2.1.0-k2"
  char igb_driver_name[] = "igb";
@@ -1412,6 +1413,7 @@ static int __devinit igb_probe(struct pci_dev *pdev,
  	int err, pci_using_dac;
  	u16 eeprom_apme_mask = IGB_EEPROM_APME;
  	u32 part_num;
+	char bufname[NDRVBUF_NAME_SIZE];
  
  	err = pci_enable_device_mem(pdev);
  	if (err)
@@ -1674,6 +1676,8 @@ static int __devinit igb_probe(struct pci_dev *pdev,
  		(adapter->flags & IGB_FLAG_HAS_MSI) ? "MSI" : "legacy",
  		adapter->num_rx_queues, adapter->num_tx_queues);
  
+	sprintf(bufname, "igb-trace-%s", pci_name(pdev));
+	adapter->trace = register_ndrvbuf(bufname, 1000000, igb_trace_read);
  	return 0;
  
  err_register:
@@ -1734,6 +1738,7 @@ static void __devexit igb_remove(struct pci_dev *pdev)
  	 * would have already happened in close and is redundant. */
  	igb_release_hw_control(adapter);
  
+	unregister_ndrvbuf(adapter->trace);
  	unregister_netdev(netdev);
  
  	igb_clear_interrupt_scheme(adapter);
@@ -3814,6 +3819,7 @@ netdev_tx_t igb_xmit_frame_ring_adv(struct sk_buff *skb,
  	}
  
  	igb_tx_queue_adv(tx_ring, tx_flags, count, skb->len, hdr_len);
+	igb_trace_write_xmit(adapter->trace, tx_ring, first);
  
  	/* Make sure there is space in the ring for the next send. */
  	igb_maybe_stop_tx(tx_ring, MAX_SKB_FRAGS + 4);
@@ -5039,7 +5045,7 @@ static bool igb_clean_tx_irq(struct igb_q_vector *q_vector)
  		eop = tx_ring->buffer_info[i].next_to_watch;
  		eop_desc = E1000_TX_DESC_ADV(*tx_ring, eop);
  	}
-
+	igb_trace_write_clean_tx(adapter->trace, tx_ring, i);
  	tx_ring->next_to_clean = i;
  
  	if (unlikely(count &&
@@ -5195,6 +5201,7 @@ static bool igb_clean_rx_irq_adv(struct igb_q_vector *q_vector,
  {
  	struct igb_ring *rx_ring = q_vector->rx_ring;
  	struct net_device *netdev = rx_ring->netdev;
+	struct igb_adapter *adapter = netdev_priv(netdev);
  	struct pci_dev *pdev = rx_ring->pdev;
  	union e1000_adv_rx_desc *rx_desc , *next_rxd;
  	struct igb_buffer *buffer_info , *next_buffer;
@@ -5309,6 +5316,7 @@ next_desc:
  		staterr = le32_to_cpu(rx_desc->wb.upper.status_error);
  	}
  
+	igb_trace_write_clean_rx(adapter->trace, rx_ring, i);
  	rx_ring->next_to_clean = i;
  	cleaned_count = igb_desc_unused(rx_ring);
  
diff --git a/drivers/net/igb/igb_trace.c b/drivers/net/igb/igb_trace.c
new file mode 100644
index 0000000..ab96300
--- /dev/null
+++ b/drivers/net/igb/igb_trace.c
@@ -0,0 +1,81 @@
+#include "igb_trace.h"
+
+struct trace_data_common {
+	unsigned short type;
+	u8 qidx;
+	int from;
+	int to;
+};
+
+void igb_trace_write_xmit(struct ndrvbuf *ndrvbuf,
+				struct igb_ring *tx_ring, int i)
+{
+	struct trace_data_common tdata;
+
+	tdata.type = IGB_TRACE_EVENT_XMIT;
+	tdata.qidx = tx_ring->queue_index;
+	tdata.from = i;
+	tdata.to = tx_ring->next_to_use;
+
+	write_ndrvbuf(ndrvbuf, sizeof(struct trace_data_common), &tdata);
+}
+
+void igb_trace_write_clean_tx(struct ndrvbuf *ndrvbuf,
+				struct igb_ring *tx_ring, int i)
+{
+	struct trace_data_common tdata;
+
+	tdata.type = IGB_TRACE_EVENT_CLEAN_TX;
+	tdata.qidx = tx_ring->queue_index;
+	tdata.from = tx_ring->next_to_clean;
+	tdata.to = i;
+
+	write_ndrvbuf(ndrvbuf, sizeof(struct trace_data_common), &tdata);
+}
+
+void igb_trace_write_clean_rx(struct ndrvbuf *ndrvbuf,
+				struct igb_ring *rx_ring, int i)
+{
+	struct trace_data_common tdata;
+
+	tdata.type = IGB_TRACE_EVENT_CLEAN_RX;
+	tdata.qidx = rx_ring->queue_index;
+	tdata.from = rx_ring->next_to_clean;
+	tdata.to = i;
+
+	write_ndrvbuf(ndrvbuf, sizeof(struct trace_data_common), &tdata);
+}
+
+size_t igb_trace_read(void *ubuf, size_t bufsize, void *entry,
+			size_t size, int cpu, u64 ts)
+{
+	struct trace_data_common *tdata = entry;
+	int bufpos = 0;
+	size_t headlen;
+
+	headlen = typical_ndrvbuf_header(ubuf, bufsize, cpu, ts);
+	bufpos += headlen;
+
+	switch (tdata->type) {
+	case IGB_TRACE_EVENT_XMIT:
+		bufpos += snprintf(ubuf + bufpos, bufsize - bufpos,
+					"xmit qidx=%u ntu=%d->%d\n",
+					tdata->qidx, tdata->from, tdata->to);
+		break;
+	case IGB_TRACE_EVENT_CLEAN_TX:
+		bufpos += snprintf(ubuf + bufpos, bufsize - bufpos,
+					"clean_tx qidx=%u ntu=%d->%d\n",
+					tdata->qidx, tdata->from, tdata->to);
+		break;
+	case IGB_TRACE_EVENT_CLEAN_RX:
+		bufpos += snprintf(ubuf + bufpos, bufsize - bufpos,
+					"clean_rx qidx=%u ntu=%d->%d\n",
+					tdata->qidx, tdata->from, tdata->to);
+		break;
+	default:
+		bufpos += snprintf(ubuf + bufpos, bufsize - bufpos,
+					"EVENT ID:%u is not defined\n",
+					tdata->type);
+	}
+	return bufpos;
+}
diff --git a/drivers/net/igb/igb_trace.h b/drivers/net/igb/igb_trace.h
new file mode 100644
index 0000000..fa130e1
--- /dev/null
+++ b/drivers/net/igb/igb_trace.h
@@ -0,0 +1,21 @@
+#ifndef _IGB_TRACE_H_
+#define _IGB_TRACE_H_
+
+#include <linux/ndrvbuf.h>
+#include "igb.h"
+
+#define IGB_TRACE_EVENT_XMIT		0x01
+#define IGB_TRACE_EVENT_CLEAN_TX	0x02
+#define IGB_TRACE_EVENT_CLEAN_RX	0x03
+
+extern void igb_trace_write_xmit(struct ndrvbuf *ndrvbuf,
+			struct igb_ring *tx_ring, int i);
+extern void igb_trace_write_clean_tx(struct ndrvbuf *ndrvbuf,
+			struct igb_ring *tx_ring, int i);
+extern void igb_trace_write_clean_rx(struct ndrvbuf *ndrvbuf,
+			struct igb_ring *rx_ring, int i);
+
+extern size_t igb_trace_read(void *ubuf, size_t bufsize, void *entry,
+			size_t size, int cpu, u64 ts);
+
+#endif /*_IGB_TRACE_H_*/


^ permalink raw reply related

* [PATCH 0/4] caching bundles, iteration 4
From: Timo Teras @ 2010-04-05  7:00 UTC (permalink / raw)
  To: netdev; +Cc: Herbert Xu, Timo Teras

Changes since last iteration:
  - wrapped flow_cache_ops* in struct flow_cache_object for
    readability, Herbert's request
  - constified flow_cache_ops, noticed by Eric Dumazet
  - NETDEV_DOWN hook now calls garbage collect function to also process
    per-socket bundles (instead of the plain flow_cache_flush)
  - some coding style fixes

Timo Teras (4):
  flow: virtualize flow cache entry methods
  xfrm: cache bundles instead of policies for outgoing flows
  xfrm: remove policy garbage collection
  flow: delayed deletion of flow cache entries

 include/net/flow.h      |   23 ++-
 include/net/xfrm.h      |   12 +-
 net/core/flow.c         |  201 +++++++-----
 net/ipv4/xfrm4_policy.c |   22 --
 net/ipv6/xfrm6_policy.c |   31 --
 net/xfrm/xfrm_policy.c  |  818 +++++++++++++++++++++++++----------------------
 6 files changed, 584 insertions(+), 523 deletions(-)


^ permalink raw reply

* [PATCH 1/4] flow: virtualize flow cache entry methods
From: Timo Teras @ 2010-04-05  7:00 UTC (permalink / raw)
  To: netdev; +Cc: Herbert Xu, Timo Teras
In-Reply-To: <1270450824-2928-1-git-send-email-timo.teras@iki.fi>

This allows to validate the cached object before returning it.
It also allows to destruct object properly, if the last reference
was held in flow cache. This is also a prepartion for caching
bundles in the flow cache.

In return for virtualizing the methods, we save on:
- not having to regenerate the whole flow cache on policy removal:
  each flow matching a killed policy gets refreshed as the getter
  function notices it smartly.
- we do not have to call flow_cache_flush from policy gc, since the
  flow cache now properly deletes the object if it had any references

Signed-off-by: Timo Teras <timo.teras@iki.fi>
---
 include/net/flow.h     |   23 ++++++++--
 include/net/xfrm.h     |    2 +
 net/core/flow.c        |  117 ++++++++++++++++++++++++------------------------
 net/xfrm/xfrm_policy.c |  112 ++++++++++++++++++++++++++++++---------------
 4 files changed, 154 insertions(+), 100 deletions(-)

diff --git a/include/net/flow.h b/include/net/flow.h
index 809970b..bb08692 100644
--- a/include/net/flow.h
+++ b/include/net/flow.h
@@ -86,11 +86,26 @@ struct flowi {
 
 struct net;
 struct sock;
-typedef int (*flow_resolve_t)(struct net *net, struct flowi *key, u16 family,
-			      u8 dir, void **objp, atomic_t **obj_refp);
+struct flow_cache_ops;
+
+struct flow_cache_object {
+	const struct flow_cache_ops *ops;
+};
+
+struct flow_cache_ops {
+	struct flow_cache_object *(*get)(struct flow_cache_object *);
+	int (*check)(struct flow_cache_object *);
+	void (*delete)(struct flow_cache_object *);
+};
+
+typedef struct flow_cache_object *(*flow_resolve_t)(
+		struct net *net, struct flowi *key, u16 family,
+		u8 dir, struct flow_cache_object *oldobj, void *ctx);
+
+extern struct flow_cache_object *flow_cache_lookup(
+		struct net *net, struct flowi *key, u16 family,
+		u8 dir, flow_resolve_t resolver, void *ctx);
 
-extern void *flow_cache_lookup(struct net *net, struct flowi *key, u16 family,
-			       u8 dir, flow_resolve_t resolver);
 extern void flow_cache_flush(void);
 extern atomic_t flow_cache_genid;
 
diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index d74e080..35396e2 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -19,6 +19,7 @@
 #include <net/route.h>
 #include <net/ipv6.h>
 #include <net/ip6_fib.h>
+#include <net/flow.h>
 
 #include <linux/interrupt.h>
 
@@ -481,6 +482,7 @@ struct xfrm_policy {
 	atomic_t		refcnt;
 	struct timer_list	timer;
 
+	struct flow_cache_object flo;
 	u32			priority;
 	u32			index;
 	struct xfrm_mark	mark;
diff --git a/net/core/flow.c b/net/core/flow.c
index 1d27ca6..15151f5 100644
--- a/net/core/flow.c
+++ b/net/core/flow.c
@@ -26,17 +26,16 @@
 #include <linux/security.h>
 
 struct flow_cache_entry {
-	struct flow_cache_entry	*next;
-	u16			family;
-	u8			dir;
-	u32			genid;
-	struct flowi		key;
-	void			*object;
-	atomic_t		*object_ref;
+	struct flow_cache_entry		*next;
+	u16				family;
+	u8				dir;
+	u32				genid;
+	struct flowi			key;
+	struct flow_cache_object	*object;
 };
 
 struct flow_cache_percpu {
-	struct flow_cache_entry **	hash_table;
+	struct flow_cache_entry		**hash_table;
 	int				hash_count;
 	u32				hash_rnd;
 	int				hash_rnd_recalc;
@@ -44,7 +43,7 @@ struct flow_cache_percpu {
 };
 
 struct flow_flush_info {
-	struct flow_cache *		cache;
+	struct flow_cache		*cache;
 	atomic_t			cpuleft;
 	struct completion		completion;
 };
@@ -52,7 +51,7 @@ struct flow_flush_info {
 struct flow_cache {
 	u32				hash_shift;
 	unsigned long			order;
-	struct flow_cache_percpu *	percpu;
+	struct flow_cache_percpu	*percpu;
 	struct notifier_block		hotcpu_notifier;
 	int				low_watermark;
 	int				high_watermark;
@@ -78,12 +77,21 @@ static void flow_cache_new_hashrnd(unsigned long arg)
 	add_timer(&fc->rnd_timer);
 }
 
+static int flow_entry_valid(struct flow_cache_entry *fle)
+{
+	if (atomic_read(&flow_cache_genid) != fle->genid)
+		return 0;
+	if (fle->object && !fle->object->ops->check(fle->object))
+		return 0;
+	return 1;
+}
+
 static void flow_entry_kill(struct flow_cache *fc,
 			    struct flow_cache_percpu *fcp,
 			    struct flow_cache_entry *fle)
 {
 	if (fle->object)
-		atomic_dec(fle->object_ref);
+		fle->object->ops->delete(fle->object);
 	kmem_cache_free(flow_cachep, fle);
 	fcp->hash_count--;
 }
@@ -96,16 +104,18 @@ static void __flow_cache_shrink(struct flow_cache *fc,
 	int i;
 
 	for (i = 0; i < flow_cache_hash_size(fc); i++) {
-		int k = 0;
+		int saved = 0;
 
 		flp = &fcp->hash_table[i];
-		while ((fle = *flp) != NULL && k < shrink_to) {
-			k++;
-			flp = &fle->next;
-		}
 		while ((fle = *flp) != NULL) {
-			*flp = fle->next;
-			flow_entry_kill(fc, fcp, fle);
+			if (saved < shrink_to &&
+			    flow_entry_valid(fle)) {
+				saved++;
+				flp = &fle->next;
+			} else {
+				*flp = fle->next;
+				flow_entry_kill(fc, fcp, fle);
+			}
 		}
 	}
 }
@@ -166,18 +176,21 @@ static int flow_key_compare(struct flowi *key1, struct flowi *key2)
 	return 0;
 }
 
-void *flow_cache_lookup(struct net *net, struct flowi *key, u16 family, u8 dir,
-			flow_resolve_t resolver)
+struct flow_cache_object *
+flow_cache_lookup(struct net *net, struct flowi *key, u16 family, u8 dir,
+		  flow_resolve_t resolver, void *ctx)
 {
 	struct flow_cache *fc = &flow_cache_global;
 	struct flow_cache_percpu *fcp;
 	struct flow_cache_entry *fle, **head;
+	struct flow_cache_object *flo;
 	unsigned int hash;
 
 	local_bh_disable();
 	fcp = per_cpu_ptr(fc->percpu, smp_processor_id());
 
 	fle = NULL;
+	flo = NULL;
 	/* Packet really early in init?  Making flow_cache_init a
 	 * pre-smp initcall would solve this.  --RR */
 	if (!fcp->hash_table)
@@ -185,24 +198,14 @@ void *flow_cache_lookup(struct net *net, struct flowi *key, u16 family, u8 dir,
 
 	if (fcp->hash_rnd_recalc)
 		flow_new_hash_rnd(fc, fcp);
-	hash = flow_hash_code(fc, fcp, key);
 
+	hash = flow_hash_code(fc, fcp, key);
 	head = &fcp->hash_table[hash];
 	for (fle = *head; fle; fle = fle->next) {
 		if (fle->family == family &&
 		    fle->dir == dir &&
-		    flow_key_compare(key, &fle->key) == 0) {
-			if (fle->genid == atomic_read(&flow_cache_genid)) {
-				void *ret = fle->object;
-
-				if (ret)
-					atomic_inc(fle->object_ref);
-				local_bh_enable();
-
-				return ret;
-			}
+		    flow_key_compare(key, &fle->key) == 0)
 			break;
-		}
 	}
 
 	if (!fle) {
@@ -219,33 +222,32 @@ void *flow_cache_lookup(struct net *net, struct flowi *key, u16 family, u8 dir,
 			fle->object = NULL;
 			fcp->hash_count++;
 		}
+	} else if (fle->genid == atomic_read(&flow_cache_genid)) {
+		flo = fle->object;
+		if (!flo)
+			goto ret_object;
+		flo = flo->ops->get(flo);
+		if (flo)
+			goto ret_object;
 	}
 
 nocache:
-	{
-		int err;
-		void *obj;
-		atomic_t *obj_ref;
-
-		err = resolver(net, key, family, dir, &obj, &obj_ref);
-
-		if (fle && !err) {
-			fle->genid = atomic_read(&flow_cache_genid);
-
-			if (fle->object)
-				atomic_dec(fle->object_ref);
-
-			fle->object = obj;
-			fle->object_ref = obj_ref;
-			if (obj)
-				atomic_inc(fle->object_ref);
+	flo = resolver(net, key, family, dir, fle ? fle->object : NULL, ctx);
+	if (fle) {
+		fle->genid = atomic_read(&flow_cache_genid);
+		if (IS_ERR(flo)) {
+			fle->genid--;
+			fle->object = NULL;
+		} else {
+			fle->object = flo;
 		}
-		local_bh_enable();
-
-		if (err)
-			obj = ERR_PTR(err);
-		return obj;
+	} else {
+		if (flo && !IS_ERR(flo))
+			flo->ops->delete(flo);
 	}
+ret_object:
+	local_bh_enable();
+	return flo;
 }
 
 static void flow_cache_flush_tasklet(unsigned long data)
@@ -261,13 +263,12 @@ static void flow_cache_flush_tasklet(unsigned long data)
 
 		fle = fcp->hash_table[i];
 		for (; fle; fle = fle->next) {
-			unsigned genid = atomic_read(&flow_cache_genid);
-
-			if (!fle->object || fle->genid == genid)
+			if (flow_entry_valid(fle))
 				continue;
 
+			if (fle->object)
+				fle->object->ops->delete(fle->object);
 			fle->object = NULL;
-			atomic_dec(fle->object_ref);
 		}
 	}
 
diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
index 82789cf..7722bae 100644
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -216,6 +216,35 @@ expired:
 	xfrm_pol_put(xp);
 }
 
+static struct flow_cache_object *xfrm_policy_flo_get(struct flow_cache_object *flo)
+{
+	struct xfrm_policy *pol = container_of(flo, struct xfrm_policy, flo);
+
+	if (unlikely(pol->walk.dead))
+		flo = NULL;
+	else
+		xfrm_pol_hold(pol);
+
+	return flo;
+}
+
+static int xfrm_policy_flo_check(struct flow_cache_object *flo)
+{
+	struct xfrm_policy *pol = container_of(flo, struct xfrm_policy, flo);
+
+	return !pol->walk.dead;
+}
+
+static void xfrm_policy_flo_delete(struct flow_cache_object *flo)
+{
+	xfrm_pol_put(container_of(flo, struct xfrm_policy, flo));
+}
+
+static const struct flow_cache_ops xfrm_policy_fc_ops = {
+	.get = xfrm_policy_flo_get,
+	.check = xfrm_policy_flo_check,
+	.delete = xfrm_policy_flo_delete,
+};
 
 /* Allocate xfrm_policy. Not used here, it is supposed to be used by pfkeyv2
  * SPD calls.
@@ -236,6 +265,7 @@ struct xfrm_policy *xfrm_policy_alloc(struct net *net, gfp_t gfp)
 		atomic_set(&policy->refcnt, 1);
 		setup_timer(&policy->timer, xfrm_policy_timer,
 				(unsigned long)policy);
+		policy->flo.ops = &xfrm_policy_fc_ops;
 	}
 	return policy;
 }
@@ -269,9 +299,6 @@ static void xfrm_policy_gc_kill(struct xfrm_policy *policy)
 	if (del_timer(&policy->timer))
 		atomic_dec(&policy->refcnt);
 
-	if (atomic_read(&policy->refcnt) > 1)
-		flow_cache_flush();
-
 	xfrm_pol_put(policy);
 }
 
@@ -661,10 +688,8 @@ struct xfrm_policy *xfrm_policy_bysel_ctx(struct net *net, u32 mark, u8 type,
 	}
 	write_unlock_bh(&xfrm_policy_lock);
 
-	if (ret && delete) {
-		atomic_inc(&flow_cache_genid);
+	if (ret && delete)
 		xfrm_policy_kill(ret);
-	}
 	return ret;
 }
 EXPORT_SYMBOL(xfrm_policy_bysel_ctx);
@@ -703,10 +728,8 @@ struct xfrm_policy *xfrm_policy_byid(struct net *net, u32 mark, u8 type,
 	}
 	write_unlock_bh(&xfrm_policy_lock);
 
-	if (ret && delete) {
-		atomic_inc(&flow_cache_genid);
+	if (ret && delete)
 		xfrm_policy_kill(ret);
-	}
 	return ret;
 }
 EXPORT_SYMBOL(xfrm_policy_byid);
@@ -822,7 +845,6 @@ int xfrm_policy_flush(struct net *net, u8 type, struct xfrm_audit *audit_info)
 	}
 	if (!cnt)
 		err = -ESRCH;
-	atomic_inc(&flow_cache_genid);
 out:
 	write_unlock_bh(&xfrm_policy_lock);
 	return err;
@@ -976,32 +998,35 @@ fail:
 	return ret;
 }
 
-static int xfrm_policy_lookup(struct net *net, struct flowi *fl, u16 family,
-			      u8 dir, void **objp, atomic_t **obj_refp)
+static struct flow_cache_object *
+xfrm_policy_lookup(struct net *net, struct flowi *fl, u16 family,
+		   u8 dir, struct flow_cache_object *old_obj, void *ctx)
 {
 	struct xfrm_policy *pol;
-	int err = 0;
+
+	if (old_obj)
+		xfrm_pol_put(container_of(old_obj, struct xfrm_policy, flo));
 
 #ifdef CONFIG_XFRM_SUB_POLICY
 	pol = xfrm_policy_lookup_bytype(net, XFRM_POLICY_TYPE_SUB, fl, family, dir);
-	if (IS_ERR(pol)) {
-		err = PTR_ERR(pol);
-		pol = NULL;
-	}
-	if (pol || err)
-		goto end;
+	if (IS_ERR(pol))
+		return ERR_CAST(pol);
+	if (pol)
+		goto found;
 #endif
 	pol = xfrm_policy_lookup_bytype(net, XFRM_POLICY_TYPE_MAIN, fl, family, dir);
-	if (IS_ERR(pol)) {
-		err = PTR_ERR(pol);
-		pol = NULL;
-	}
-#ifdef CONFIG_XFRM_SUB_POLICY
-end:
-#endif
-	if ((*objp = (void *) pol) != NULL)
-		*obj_refp = &pol->refcnt;
-	return err;
+	if (IS_ERR(pol))
+		return ERR_CAST(pol);
+	if (pol)
+		goto found;
+	return NULL;
+
+found:
+	/* Resolver returns two references:
+	 * one for cache and one for caller of flow_cache_lookup() */
+	xfrm_pol_hold(pol);
+
+	return &pol->flo;
 }
 
 static inline int policy_to_flow_dir(int dir)
@@ -1091,8 +1116,6 @@ int xfrm_policy_delete(struct xfrm_policy *pol, int dir)
 	pol = __xfrm_policy_unlink(pol, dir);
 	write_unlock_bh(&xfrm_policy_lock);
 	if (pol) {
-		if (dir < XFRM_POLICY_MAX)
-			atomic_inc(&flow_cache_genid);
 		xfrm_policy_kill(pol);
 		return 0;
 	}
@@ -1578,18 +1601,24 @@ restart:
 	}
 
 	if (!policy) {
+		struct flow_cache_object *flo;
+
 		/* To accelerate a bit...  */
 		if ((dst_orig->flags & DST_NOXFRM) ||
 		    !net->xfrm.policy_count[XFRM_POLICY_OUT])
 			goto nopol;
 
-		policy = flow_cache_lookup(net, fl, dst_orig->ops->family,
-					   dir, xfrm_policy_lookup);
-		err = PTR_ERR(policy);
-		if (IS_ERR(policy)) {
+		flo = flow_cache_lookup(net, fl, dst_orig->ops->family,
+					dir, xfrm_policy_lookup, NULL);
+		err = PTR_ERR(flo);
+		if (IS_ERR(flo)) {
 			XFRM_INC_STATS(net, LINUX_MIB_XFRMOUTPOLERROR);
 			goto dropdst;
 		}
+		if (flo)
+			policy = container_of(flo, struct xfrm_policy, flo);
+		else
+			policy = NULL;
 	}
 
 	if (!policy)
@@ -1939,9 +1968,16 @@ int __xfrm_policy_check(struct sock *sk, int dir, struct sk_buff *skb,
 		}
 	}
 
-	if (!pol)
-		pol = flow_cache_lookup(net, &fl, family, fl_dir,
-					xfrm_policy_lookup);
+	if (!pol) {
+		struct flow_cache_object *flo;
+
+		flo = flow_cache_lookup(net, &fl, family, fl_dir,
+					xfrm_policy_lookup, NULL);
+		if (IS_ERR_OR_NULL(flo))
+			pol = ERR_CAST(flo);
+		else
+			pol = container_of(flo, struct xfrm_policy, flo);
+	}
 
 	if (IS_ERR(pol)) {
 		XFRM_INC_STATS(net, LINUX_MIB_XFRMINPOLERROR);
-- 
1.6.3.3


^ permalink raw reply related

* [PATCH 3/4] xfrm: remove policy garbage collection
From: Timo Teras @ 2010-04-05  7:00 UTC (permalink / raw)
  To: netdev; +Cc: Herbert Xu, Timo Teras
In-Reply-To: <1270450824-2928-1-git-send-email-timo.teras@iki.fi>

Policies are now properly reference counted and destroyed from
all code paths. The delayed gc is just an overhead now and can
be removed.

Signed-off-by: Timo Teras <timo.teras@iki.fi>
---
 net/xfrm/xfrm_policy.c |   39 +++++----------------------------------
 1 files changed, 5 insertions(+), 34 deletions(-)

diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
index d353de5..0b803fd 100644
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -46,9 +46,6 @@ static struct xfrm_policy_afinfo *xfrm_policy_afinfo[NPROTO];
 
 static struct kmem_cache *xfrm_dst_cache __read_mostly;
 
-static HLIST_HEAD(xfrm_policy_gc_list);
-static DEFINE_SPINLOCK(xfrm_policy_gc_lock);
-
 static struct xfrm_policy_afinfo *xfrm_policy_get_afinfo(unsigned short family);
 static void xfrm_policy_put_afinfo(struct xfrm_policy_afinfo *afinfo);
 static void xfrm_init_pmtu(struct dst_entry *dst);
@@ -288,32 +285,6 @@ void xfrm_policy_destroy(struct xfrm_policy *policy)
 }
 EXPORT_SYMBOL(xfrm_policy_destroy);
 
-static void xfrm_policy_gc_kill(struct xfrm_policy *policy)
-{
-	atomic_inc(&policy->genid);
-
-	if (del_timer(&policy->timer))
-		atomic_dec(&policy->refcnt);
-
-	xfrm_pol_put(policy);
-}
-
-static void xfrm_policy_gc_task(struct work_struct *work)
-{
-	struct xfrm_policy *policy;
-	struct hlist_node *entry, *tmp;
-	struct hlist_head gc_list;
-
-	spin_lock_bh(&xfrm_policy_gc_lock);
-	gc_list.first = xfrm_policy_gc_list.first;
-	INIT_HLIST_HEAD(&xfrm_policy_gc_list);
-	spin_unlock_bh(&xfrm_policy_gc_lock);
-
-	hlist_for_each_entry_safe(policy, entry, tmp, &gc_list, bydst)
-		xfrm_policy_gc_kill(policy);
-}
-static DECLARE_WORK(xfrm_policy_gc_work, xfrm_policy_gc_task);
-
 /* Rule must be locked. Release descentant resources, announce
  * entry dead. The rule must be unlinked from lists to the moment.
  */
@@ -322,11 +293,12 @@ static void xfrm_policy_kill(struct xfrm_policy *policy)
 {
 	policy->walk.dead = 1;
 
-	spin_lock_bh(&xfrm_policy_gc_lock);
-	hlist_add_head(&policy->bydst, &xfrm_policy_gc_list);
-	spin_unlock_bh(&xfrm_policy_gc_lock);
+	atomic_inc(&policy->genid);
 
-	schedule_work(&xfrm_policy_gc_work);
+	if (del_timer(&policy->timer))
+		xfrm_pol_put(policy);
+
+	xfrm_pol_put(policy);
 }
 
 static unsigned int xfrm_policy_hashmax __read_mostly = 1 * 1024 * 1024;
@@ -2605,7 +2577,6 @@ static void xfrm_policy_fini(struct net *net)
 	audit_info.sessionid = -1;
 	audit_info.secid = 0;
 	xfrm_policy_flush(net, XFRM_POLICY_TYPE_MAIN, &audit_info);
-	flush_work(&xfrm_policy_gc_work);
 
 	WARN_ON(!list_empty(&net->xfrm.policy_all));
 
-- 
1.6.3.3


^ permalink raw reply related

* [PATCH 2/4] xfrm: cache bundles instead of policies for outgoing flows
From: Timo Teras @ 2010-04-05  7:00 UTC (permalink / raw)
  To: netdev; +Cc: Herbert Xu, Timo Teras
In-Reply-To: <1270450824-2928-1-git-send-email-timo.teras@iki.fi>

__xfrm_lookup() is called for each packet transmitted out of
system. The xfrm_find_bundle() does a linear search which can
kill system performance depending on how many bundles are
required per policy.

This modifies __xfrm_lookup() to store bundles directly in
the flow cache. If we did not get a hit, we just create a new
bundle instead of doing slow search. This means that we can now
get multiple xfrm_dst's for same flow (on per-cpu basis).

Signed-off-by: Timo Teras <timo.teras@iki.fi>
---
 include/net/xfrm.h      |   10 +-
 net/ipv4/xfrm4_policy.c |   22 --
 net/ipv6/xfrm6_policy.c |   31 --
 net/xfrm/xfrm_policy.c  |  709 +++++++++++++++++++++++++----------------------
 4 files changed, 385 insertions(+), 387 deletions(-)

diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index 35396e2..625dd61 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -267,7 +267,6 @@ struct xfrm_policy_afinfo {
 					       xfrm_address_t *saddr,
 					       xfrm_address_t *daddr);
 	int			(*get_saddr)(struct net *net, xfrm_address_t *saddr, xfrm_address_t *daddr);
-	struct dst_entry	*(*find_bundle)(struct flowi *fl, struct xfrm_policy *policy);
 	void			(*decode_session)(struct sk_buff *skb,
 						  struct flowi *fl,
 						  int reverse);
@@ -483,13 +482,13 @@ struct xfrm_policy {
 	struct timer_list	timer;
 
 	struct flow_cache_object flo;
+	atomic_t		genid;
 	u32			priority;
 	u32			index;
 	struct xfrm_mark	mark;
 	struct xfrm_selector	selector;
 	struct xfrm_lifetime_cfg lft;
 	struct xfrm_lifetime_cur curlft;
-	struct dst_entry       *bundles;
 	struct xfrm_policy_walk_entry walk;
 	u8			type;
 	u8			action;
@@ -879,11 +878,15 @@ struct xfrm_dst {
 		struct rt6_info		rt6;
 	} u;
 	struct dst_entry *route;
+	struct flow_cache_object flo;
+	struct xfrm_policy *pols[XFRM_POLICY_TYPE_MAX];
+	int num_pols, num_xfrms;
 #ifdef CONFIG_XFRM_SUB_POLICY
 	struct flowi *origin;
 	struct xfrm_selector *partner;
 #endif
-	u32 genid;
+	u32 xfrm_genid;
+	u32 policy_genid;
 	u32 route_mtu_cached;
 	u32 child_mtu_cached;
 	u32 route_cookie;
@@ -893,6 +896,7 @@ struct xfrm_dst {
 #ifdef CONFIG_XFRM
 static inline void xfrm_dst_destroy(struct xfrm_dst *xdst)
 {
+	xfrm_pols_put(xdst->pols, xdst->num_pols);
 	dst_release(xdst->route);
 	if (likely(xdst->u.dst.xfrm))
 		xfrm_state_put(xdst->u.dst.xfrm);
diff --git a/net/ipv4/xfrm4_policy.c b/net/ipv4/xfrm4_policy.c
index e4a1483..1705476 100644
--- a/net/ipv4/xfrm4_policy.c
+++ b/net/ipv4/xfrm4_policy.c
@@ -59,27 +59,6 @@ static int xfrm4_get_saddr(struct net *net,
 	return 0;
 }
 
-static struct dst_entry *
-__xfrm4_find_bundle(struct flowi *fl, struct xfrm_policy *policy)
-{
-	struct dst_entry *dst;
-
-	read_lock_bh(&policy->lock);
-	for (dst = policy->bundles; dst; dst = dst->next) {
-		struct xfrm_dst *xdst = (struct xfrm_dst *)dst;
-		if (xdst->u.rt.fl.oif == fl->oif &&	/*XXX*/
-		    xdst->u.rt.fl.fl4_dst == fl->fl4_dst &&
-		    xdst->u.rt.fl.fl4_src == fl->fl4_src &&
-		    xdst->u.rt.fl.fl4_tos == fl->fl4_tos &&
-		    xfrm_bundle_ok(policy, xdst, fl, AF_INET, 0)) {
-			dst_clone(dst);
-			break;
-		}
-	}
-	read_unlock_bh(&policy->lock);
-	return dst;
-}
-
 static int xfrm4_get_tos(struct flowi *fl)
 {
 	return fl->fl4_tos;
@@ -259,7 +238,6 @@ static struct xfrm_policy_afinfo xfrm4_policy_afinfo = {
 	.dst_ops =		&xfrm4_dst_ops,
 	.dst_lookup =		xfrm4_dst_lookup,
 	.get_saddr =		xfrm4_get_saddr,
-	.find_bundle = 		__xfrm4_find_bundle,
 	.decode_session =	_decode_session4,
 	.get_tos =		xfrm4_get_tos,
 	.init_path =		xfrm4_init_path,
diff --git a/net/ipv6/xfrm6_policy.c b/net/ipv6/xfrm6_policy.c
index ae18165..8c452fd 100644
--- a/net/ipv6/xfrm6_policy.c
+++ b/net/ipv6/xfrm6_policy.c
@@ -67,36 +67,6 @@ static int xfrm6_get_saddr(struct net *net,
 	return 0;
 }
 
-static struct dst_entry *
-__xfrm6_find_bundle(struct flowi *fl, struct xfrm_policy *policy)
-{
-	struct dst_entry *dst;
-
-	/* Still not clear if we should set fl->fl6_{src,dst}... */
-	read_lock_bh(&policy->lock);
-	for (dst = policy->bundles; dst; dst = dst->next) {
-		struct xfrm_dst *xdst = (struct xfrm_dst*)dst;
-		struct in6_addr fl_dst_prefix, fl_src_prefix;
-
-		ipv6_addr_prefix(&fl_dst_prefix,
-				 &fl->fl6_dst,
-				 xdst->u.rt6.rt6i_dst.plen);
-		ipv6_addr_prefix(&fl_src_prefix,
-				 &fl->fl6_src,
-				 xdst->u.rt6.rt6i_src.plen);
-		if (ipv6_addr_equal(&xdst->u.rt6.rt6i_dst.addr, &fl_dst_prefix) &&
-		    ipv6_addr_equal(&xdst->u.rt6.rt6i_src.addr, &fl_src_prefix) &&
-		    xfrm_bundle_ok(policy, xdst, fl, AF_INET6,
-				   (xdst->u.rt6.rt6i_dst.plen != 128 ||
-				    xdst->u.rt6.rt6i_src.plen != 128))) {
-			dst_clone(dst);
-			break;
-		}
-	}
-	read_unlock_bh(&policy->lock);
-	return dst;
-}
-
 static int xfrm6_get_tos(struct flowi *fl)
 {
 	return 0;
@@ -291,7 +261,6 @@ static struct xfrm_policy_afinfo xfrm6_policy_afinfo = {
 	.dst_ops =		&xfrm6_dst_ops,
 	.dst_lookup =		xfrm6_dst_lookup,
 	.get_saddr = 		xfrm6_get_saddr,
-	.find_bundle =		__xfrm6_find_bundle,
 	.decode_session =	_decode_session6,
 	.get_tos =		xfrm6_get_tos,
 	.init_path =		xfrm6_init_path,
diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
index 7722bae..d353de5 100644
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -37,6 +37,8 @@
 DEFINE_MUTEX(xfrm_cfg_mutex);
 EXPORT_SYMBOL(xfrm_cfg_mutex);
 
+static DEFINE_SPINLOCK(xfrm_policy_sk_bundle_lock);
+static struct dst_entry *xfrm_policy_sk_bundles;
 static DEFINE_RWLOCK(xfrm_policy_lock);
 
 static DEFINE_RWLOCK(xfrm_policy_afinfo_lock);
@@ -50,6 +52,7 @@ static DEFINE_SPINLOCK(xfrm_policy_gc_lock);
 static struct xfrm_policy_afinfo *xfrm_policy_get_afinfo(unsigned short family);
 static void xfrm_policy_put_afinfo(struct xfrm_policy_afinfo *afinfo);
 static void xfrm_init_pmtu(struct dst_entry *dst);
+static int stale_bundle(struct dst_entry *dst);
 
 static struct xfrm_policy *__xfrm_policy_unlink(struct xfrm_policy *pol,
 						int dir);
@@ -277,8 +280,6 @@ void xfrm_policy_destroy(struct xfrm_policy *policy)
 {
 	BUG_ON(!policy->walk.dead);
 
-	BUG_ON(policy->bundles);
-
 	if (del_timer(&policy->timer))
 		BUG();
 
@@ -289,12 +290,7 @@ EXPORT_SYMBOL(xfrm_policy_destroy);
 
 static void xfrm_policy_gc_kill(struct xfrm_policy *policy)
 {
-	struct dst_entry *dst;
-
-	while ((dst = policy->bundles) != NULL) {
-		policy->bundles = dst->next;
-		dst_free(dst);
-	}
+	atomic_inc(&policy->genid);
 
 	if (del_timer(&policy->timer))
 		atomic_dec(&policy->refcnt);
@@ -572,7 +568,6 @@ int xfrm_policy_insert(int dir, struct xfrm_policy *policy, int excl)
 	struct xfrm_policy *delpol;
 	struct hlist_head *chain;
 	struct hlist_node *entry, *newpos;
-	struct dst_entry *gc_list;
 	u32 mark = policy->mark.v & policy->mark.m;
 
 	write_lock_bh(&xfrm_policy_lock);
@@ -623,33 +618,11 @@ int xfrm_policy_insert(int dir, struct xfrm_policy *policy, int excl)
 		schedule_work(&net->xfrm.policy_hash_work);
 
 	read_lock_bh(&xfrm_policy_lock);
-	gc_list = NULL;
 	entry = &policy->bydst;
-	hlist_for_each_entry_continue(policy, entry, bydst) {
-		struct dst_entry *dst;
-
-		write_lock(&policy->lock);
-		dst = policy->bundles;
-		if (dst) {
-			struct dst_entry *tail = dst;
-			while (tail->next)
-				tail = tail->next;
-			tail->next = gc_list;
-			gc_list = dst;
-
-			policy->bundles = NULL;
-		}
-		write_unlock(&policy->lock);
-	}
+	hlist_for_each_entry_continue(policy, entry, bydst)
+		atomic_inc(&policy->genid);
 	read_unlock_bh(&xfrm_policy_lock);
 
-	while (gc_list) {
-		struct dst_entry *dst = gc_list;
-
-		gc_list = dst->next;
-		dst_free(dst);
-	}
-
 	return 0;
 }
 EXPORT_SYMBOL(xfrm_policy_insert);
@@ -998,6 +971,19 @@ fail:
 	return ret;
 }
 
+static struct xfrm_policy *
+__xfrm_policy_lookup(struct net *net, struct flowi *fl, u16 family, u8 dir)
+{
+#ifdef CONFIG_XFRM_SUB_POLICY
+	struct xfrm_policy *pol;
+
+	pol = xfrm_policy_lookup_bytype(net, XFRM_POLICY_TYPE_SUB, fl, family, dir);
+	if (pol != NULL)
+		return pol;
+#endif
+	return xfrm_policy_lookup_bytype(net, XFRM_POLICY_TYPE_MAIN, fl, family, dir);
+}
+
 static struct flow_cache_object *
 xfrm_policy_lookup(struct net *net, struct flowi *fl, u16 family,
 		   u8 dir, struct flow_cache_object *old_obj, void *ctx)
@@ -1007,21 +993,10 @@ xfrm_policy_lookup(struct net *net, struct flowi *fl, u16 family,
 	if (old_obj)
 		xfrm_pol_put(container_of(old_obj, struct xfrm_policy, flo));
 
-#ifdef CONFIG_XFRM_SUB_POLICY
-	pol = xfrm_policy_lookup_bytype(net, XFRM_POLICY_TYPE_SUB, fl, family, dir);
-	if (IS_ERR(pol))
+	pol = __xfrm_policy_lookup(net, fl, family, dir);
+	if (IS_ERR_OR_NULL(pol))
 		return ERR_CAST(pol);
-	if (pol)
-		goto found;
-#endif
-	pol = xfrm_policy_lookup_bytype(net, XFRM_POLICY_TYPE_MAIN, fl, family, dir);
-	if (IS_ERR(pol))
-		return ERR_CAST(pol);
-	if (pol)
-		goto found;
-	return NULL;
 
-found:
 	/* Resolver returns two references:
 	 * one for cache and one for caller of flow_cache_lookup() */
 	xfrm_pol_hold(pol);
@@ -1313,18 +1288,6 @@ xfrm_tmpl_resolve(struct xfrm_policy **pols, int npols, struct flowi *fl,
  * still valid.
  */
 
-static struct dst_entry *
-xfrm_find_bundle(struct flowi *fl, struct xfrm_policy *policy, unsigned short family)
-{
-	struct dst_entry *x;
-	struct xfrm_policy_afinfo *afinfo = xfrm_policy_get_afinfo(family);
-	if (unlikely(afinfo == NULL))
-		return ERR_PTR(-EINVAL);
-	x = afinfo->find_bundle(fl, policy);
-	xfrm_policy_put_afinfo(afinfo);
-	return x;
-}
-
 static inline int xfrm_get_tos(struct flowi *fl, int family)
 {
 	struct xfrm_policy_afinfo *afinfo = xfrm_policy_get_afinfo(family);
@@ -1340,6 +1303,54 @@ static inline int xfrm_get_tos(struct flowi *fl, int family)
 	return tos;
 }
 
+static struct flow_cache_object *xfrm_bundle_flo_get(struct flow_cache_object *flo)
+{
+	struct xfrm_dst *xdst = container_of(flo, struct xfrm_dst, flo);
+	struct dst_entry *dst = &xdst->u.dst;
+
+	if (xdst->route == NULL) {
+		/* Dummy bundle - if it has xfrms we were not
+		 * able to build bundle as template resolution failed.
+		 * It means we need to try again resolving. */
+		if (xdst->num_xfrms > 0)
+			return NULL;
+	} else {
+		/* Real bundle */
+		if (stale_bundle(dst))
+			return NULL;
+	}
+
+	dst_hold(dst);
+	return flo;
+}
+
+static int xfrm_bundle_flo_check(struct flow_cache_object *flo)
+{
+	struct xfrm_dst *xdst = container_of(flo, struct xfrm_dst, flo);
+	struct dst_entry *dst = &xdst->u.dst;
+
+	if (!xdst->route)
+		return 0;
+	if (stale_bundle(dst))
+		return 0;
+
+	return 1;
+}
+
+static void xfrm_bundle_flo_delete(struct flow_cache_object *flo)
+{
+	struct xfrm_dst *xdst = container_of(flo, struct xfrm_dst, flo);
+	struct dst_entry *dst = &xdst->u.dst;
+
+	dst_free(dst);
+}
+
+static const struct flow_cache_ops xfrm_bundle_fc_ops = {
+	.get = xfrm_bundle_flo_get,
+	.check = xfrm_bundle_flo_check,
+	.delete = xfrm_bundle_flo_delete,
+};
+
 static inline struct xfrm_dst *xfrm_alloc_dst(struct net *net, int family)
 {
 	struct xfrm_policy_afinfo *afinfo = xfrm_policy_get_afinfo(family);
@@ -1362,9 +1373,10 @@ static inline struct xfrm_dst *xfrm_alloc_dst(struct net *net, int family)
 		BUG();
 	}
 	xdst = dst_alloc(dst_ops) ?: ERR_PTR(-ENOBUFS);
-
 	xfrm_policy_put_afinfo(afinfo);
 
+	xdst->flo.ops = &xfrm_bundle_fc_ops;
+
 	return xdst;
 }
 
@@ -1402,6 +1414,7 @@ static inline int xfrm_fill_dst(struct xfrm_dst *xdst, struct net_device *dev,
 	return err;
 }
 
+
 /* Allocate chain of dst_entry's, attach known xfrm's, calculate
  * all the metrics... Shortly, bundle a bundle.
  */
@@ -1465,7 +1478,7 @@ static struct dst_entry *xfrm_bundle_create(struct xfrm_policy *policy,
 			dst_hold(dst);
 
 		dst1->xfrm = xfrm[i];
-		xdst->genid = xfrm[i]->genid;
+		xdst->xfrm_genid = xfrm[i]->genid;
 
 		dst1->obsolete = -1;
 		dst1->flags |= DST_HOST;
@@ -1558,7 +1571,186 @@ xfrm_dst_update_origin(struct dst_entry *dst, struct flowi *fl)
 #endif
 }
 
-static int stale_bundle(struct dst_entry *dst);
+static int xfrm_expand_policies(struct flowi *fl, u16 family,
+				struct xfrm_policy **pols,
+				int *num_pols, int *num_xfrms)
+{
+	int i;
+
+	if (*num_pols == 0 || !pols[0]) {
+		*num_pols = 0;
+		*num_xfrms = 0;
+		return 0;
+	}
+	if (IS_ERR(pols[0]))
+		return PTR_ERR(pols[0]);
+
+	*num_xfrms = pols[0]->xfrm_nr;
+
+#ifdef CONFIG_XFRM_SUB_POLICY
+	if (pols[0] && pols[0]->action == XFRM_POLICY_ALLOW &&
+	    pols[0]->type != XFRM_POLICY_TYPE_MAIN) {
+		pols[1] = xfrm_policy_lookup_bytype(xp_net(pols[0]),
+						    XFRM_POLICY_TYPE_MAIN,
+						    fl, family,
+						    XFRM_POLICY_OUT);
+		if (pols[1]) {
+			if (IS_ERR(pols[1])) {
+				xfrm_pols_put(pols, *num_pols);
+				return PTR_ERR(pols[1]);
+			}
+			(*num_pols) ++;
+			(*num_xfrms) += pols[1]->xfrm_nr;
+		}
+	}
+#endif
+	for (i = 0; i < *num_pols; i++) {
+		if (pols[i]->action != XFRM_POLICY_ALLOW) {
+			*num_xfrms = -1;
+			break;
+		}
+	}
+
+	return 0;
+
+}
+
+static struct xfrm_dst *
+xfrm_resolve_and_create_bundle(struct xfrm_policy **pols, int num_pols,
+			       struct flowi *fl, u16 family,
+			       struct dst_entry *dst_orig)
+{
+	struct net *net = xp_net(pols[0]);
+	struct xfrm_state *xfrm[XFRM_MAX_DEPTH];
+	struct dst_entry *dst;
+	struct xfrm_dst *xdst;
+	int err;
+
+	/* Try to instantiate a bundle */
+	err = xfrm_tmpl_resolve(pols, num_pols, fl, xfrm, family);
+	if (err < 0) {
+		if (err != -EAGAIN)
+			XFRM_INC_STATS(net, LINUX_MIB_XFRMOUTPOLERROR);
+		return ERR_PTR(err);
+	}
+
+	dst = xfrm_bundle_create(pols[0], xfrm, err, fl, dst_orig);
+	if (IS_ERR(dst)) {
+		XFRM_INC_STATS(net, LINUX_MIB_XFRMOUTBUNDLEGENERROR);
+		return ERR_CAST(dst);
+	}
+
+	xdst = (struct xfrm_dst *)dst;
+	xdst->num_xfrms = err;
+	if (num_pols > 1)
+		err = xfrm_dst_update_parent(dst, &pols[1]->selector);
+	else
+		err = xfrm_dst_update_origin(dst, fl);
+	if (unlikely(err)) {
+		dst_free(dst);
+		XFRM_INC_STATS(net, LINUX_MIB_XFRMOUTBUNDLECHECKERROR);
+		return ERR_PTR(err);
+	}
+
+	xdst->num_pols = num_pols;
+	memcpy(xdst->pols, pols, sizeof(struct xfrm_policy*) * num_pols);
+	xdst->policy_genid = atomic_read(&pols[0]->genid);
+
+	return xdst;
+}
+
+static struct flow_cache_object *
+xfrm_bundle_lookup(struct net *net, struct flowi *fl, u16 family, u8 dir,
+		   struct flow_cache_object *oldflo, void *ctx)
+{
+	struct dst_entry *dst_orig = (struct dst_entry *)ctx;
+	struct xfrm_policy *pols[XFRM_POLICY_TYPE_MAX];
+	struct xfrm_dst *xdst, *new_xdst;
+	int num_pols = 0, num_xfrms = 0, i, err, pol_dead;
+
+	/* Check if the policies from old bundle are usable */
+	xdst = NULL;
+	if (oldflo) {
+		xdst = container_of(oldflo, struct xfrm_dst, flo);
+		num_pols = xdst->num_pols;
+		num_xfrms = xdst->num_xfrms;
+		pol_dead = 0;
+		for (i = 0; i < num_pols; i++) {
+			pols[i] = xdst->pols[i];
+			pol_dead |= pols[i]->walk.dead;
+		}
+		if (pol_dead) {
+			dst_free(&xdst->u.dst);
+			xdst = NULL;
+			num_pols = 0;
+			num_xfrms = 0;
+			oldflo = NULL;
+		}
+	}
+
+	/* Resolve policies to use if we couldn't get them from
+	 * previous cache entry */
+	if (xdst == NULL) {
+		num_pols = 1;
+		pols[0] = __xfrm_policy_lookup(net, fl, family, dir);
+		err = xfrm_expand_policies(fl, family, pols,
+					   &num_pols, &num_xfrms);
+		if (err < 0)
+			goto inc_error;
+		if (num_pols == 0)
+			return NULL;
+		if (num_xfrms <= 0)
+			goto make_dummy_bundle;
+	}
+
+	new_xdst = xfrm_resolve_and_create_bundle(pols, num_pols, fl, family, dst_orig);
+	if (IS_ERR(new_xdst)) {
+		err = PTR_ERR(new_xdst);
+		if (err != -EAGAIN)
+			goto error;
+		if (oldflo == NULL)
+			goto make_dummy_bundle;
+		dst_hold(&xdst->u.dst);
+		return oldflo;
+	}
+
+	/* Kill the previous bundle */
+	if (xdst) {
+		/* The policies were stolen for newly generated bundle */
+		xdst->num_pols = 0;
+		dst_free(&xdst->u.dst);
+	}
+
+	/* Flow cache does not have reference, it dst_free()'s,
+	 * but we do need to return one reference for original caller */
+	dst_hold(&new_xdst->u.dst);
+	return &new_xdst->flo;
+
+make_dummy_bundle:
+	/* We found policies, but there's no bundles to instantiate:
+	 * either because the policy blocks, has no transformations or
+	 * we could not build template (no xfrm_states).*/
+	xdst = xfrm_alloc_dst(net, family);
+	if (IS_ERR(xdst)) {
+		xfrm_pols_put(pols, num_pols);
+		return ERR_CAST(xdst);
+	}
+	xdst->num_pols = num_pols;
+	xdst->num_xfrms = num_xfrms;
+	memcpy(xdst->pols, pols, sizeof(struct xfrm_policy*) * num_pols);
+
+	dst_hold(&xdst->u.dst);
+	return &xdst->flo;
+
+inc_error:
+	XFRM_INC_STATS(net, LINUX_MIB_XFRMOUTPOLERROR);
+error:
+	if (xdst != NULL)
+		dst_free(&xdst->u.dst);
+	else
+		xfrm_pols_put(pols, num_pols);
+	return ERR_PTR(err);
+}
 
 /* Main function: finds/creates a bundle for given flow.
  *
@@ -1568,248 +1760,152 @@ static int stale_bundle(struct dst_entry *dst);
 int __xfrm_lookup(struct net *net, struct dst_entry **dst_p, struct flowi *fl,
 		  struct sock *sk, int flags)
 {
-	struct xfrm_policy *policy;
 	struct xfrm_policy *pols[XFRM_POLICY_TYPE_MAX];
-	int npols;
-	int pol_dead;
-	int xfrm_nr;
-	int pi;
-	struct xfrm_state *xfrm[XFRM_MAX_DEPTH];
-	struct dst_entry *dst, *dst_orig = *dst_p;
-	int nx = 0;
-	int err;
-	u32 genid;
-	u16 family;
+	struct flow_cache_object *flo;
+	struct xfrm_dst *xdst;
+	struct dst_entry *dst, *dst_orig = *dst_p, *route;
+	u16 family = dst_orig->ops->family;
 	u8 dir = policy_to_flow_dir(XFRM_POLICY_OUT);
+	int i, err, num_pols, num_xfrms, drop_pols = 0;
 
 restart:
-	genid = atomic_read(&flow_cache_genid);
-	policy = NULL;
-	for (pi = 0; pi < ARRAY_SIZE(pols); pi++)
-		pols[pi] = NULL;
-	npols = 0;
-	pol_dead = 0;
-	xfrm_nr = 0;
+	dst = NULL;
+	xdst = NULL;
+	route = NULL;
 
 	if (sk && sk->sk_policy[XFRM_POLICY_OUT]) {
-		policy = xfrm_sk_policy_lookup(sk, XFRM_POLICY_OUT, fl);
-		err = PTR_ERR(policy);
-		if (IS_ERR(policy)) {
-			XFRM_INC_STATS(net, LINUX_MIB_XFRMOUTPOLERROR);
+		num_pols = 1;
+		pols[0] = xfrm_sk_policy_lookup(sk, XFRM_POLICY_OUT, fl);
+		err = xfrm_expand_policies(fl, family, pols,
+					   &num_pols, &num_xfrms);
+		if (err < 0)
 			goto dropdst;
+
+		if (num_pols) {
+			if (num_xfrms <= 0) {
+				drop_pols = num_pols;
+				goto no_transform;
+			}
+
+			xdst = xfrm_resolve_and_create_bundle(
+					pols, num_pols, fl,
+					family, dst_orig);
+			if (IS_ERR(xdst)) {
+				xfrm_pols_put(pols, num_pols);
+				err = PTR_ERR(xdst);
+				goto dropdst;
+			}
+
+			spin_lock_bh(&xfrm_policy_sk_bundle_lock);
+			xdst->u.dst.next = xfrm_policy_sk_bundles;
+			xfrm_policy_sk_bundles = &xdst->u.dst;
+			spin_unlock_bh(&xfrm_policy_sk_bundle_lock);
+
+			route = xdst->route;
 		}
 	}
 
-	if (!policy) {
-		struct flow_cache_object *flo;
-
+	if (xdst == NULL) {
 		/* To accelerate a bit...  */
 		if ((dst_orig->flags & DST_NOXFRM) ||
 		    !net->xfrm.policy_count[XFRM_POLICY_OUT])
 			goto nopol;
 
-		flo = flow_cache_lookup(net, fl, dst_orig->ops->family,
-					dir, xfrm_policy_lookup, NULL);
-		err = PTR_ERR(flo);
+		flo = flow_cache_lookup(net, fl, family, dir,
+					xfrm_bundle_lookup, dst_orig);
+		if (flo == NULL)
+			goto nopol;
 		if (IS_ERR(flo)) {
-			XFRM_INC_STATS(net, LINUX_MIB_XFRMOUTPOLERROR);
+			err = PTR_ERR(flo);
 			goto dropdst;
 		}
-		if (flo)
-			policy = container_of(flo, struct xfrm_policy, flo);
-		else
-			policy = NULL;
+		xdst = container_of(flo, struct xfrm_dst, flo);
+
+		num_pols = xdst->num_pols;
+		num_xfrms = xdst->num_xfrms;
+		memcpy(pols, xdst->pols, sizeof(struct xfrm_policy*) * num_pols);
+		route = xdst->route;
+	}
+
+	dst = &xdst->u.dst;
+	if (route == NULL && num_xfrms > 0) {
+		/* The only case when xfrm_bundle_lookup() returns a
+		 * bundle with null route, is when the template could
+		 * not be resolved. It means policies are there, but
+		 * bundle could not be created, since we don't yet
+		 * have the xfrm_state's. We need to wait for KM to
+		 * negotiate new SA's or bail out with error.*/
+		if (net->xfrm.sysctl_larval_drop) {
+			/* EREMOTE tells the caller to generate
+			 * a one-shot blackhole route. */
+			dst_release(dst);
+			xfrm_pols_put(pols, num_pols);
+			XFRM_INC_STATS(net, LINUX_MIB_XFRMOUTNOSTATES);
+			return -EREMOTE;
+		}
+		if (flags & XFRM_LOOKUP_WAIT) {
+			DECLARE_WAITQUEUE(wait, current);
+
+			add_wait_queue(&net->xfrm.km_waitq, &wait);
+			set_current_state(TASK_INTERRUPTIBLE);
+			schedule();
+			set_current_state(TASK_RUNNING);
+			remove_wait_queue(&net->xfrm.km_waitq, &wait);
+
+			if (!signal_pending(current)) {
+				dst_release(dst);
+				goto restart;
+			}
+
+			err = -ERESTART;
+		} else
+			err = -EAGAIN;
+
+		XFRM_INC_STATS(net, LINUX_MIB_XFRMOUTNOSTATES);
+		goto error;
 	}
 
-	if (!policy)
+no_transform:
+	if (num_pols == 0)
 		goto nopol;
 
-	family = dst_orig->ops->family;
-	pols[0] = policy;
-	npols ++;
-	xfrm_nr += pols[0]->xfrm_nr;
-
-	err = -ENOENT;
-	if ((flags & XFRM_LOOKUP_ICMP) && !(policy->flags & XFRM_POLICY_ICMP))
+	if ((flags & XFRM_LOOKUP_ICMP) &&
+	    !(pols[0]->flags & XFRM_POLICY_ICMP)) {
+		err = -ENOENT;
 		goto error;
+	}
 
-	policy->curlft.use_time = get_seconds();
+	for (i = 0; i < num_pols; i++)
+		pols[i]->curlft.use_time = get_seconds();
 
-	switch (policy->action) {
-	default:
-	case XFRM_POLICY_BLOCK:
+	if (num_xfrms < 0) {
 		/* Prohibit the flow */
 		XFRM_INC_STATS(net, LINUX_MIB_XFRMOUTPOLBLOCK);
 		err = -EPERM;
 		goto error;
-
-	case XFRM_POLICY_ALLOW:
-#ifndef CONFIG_XFRM_SUB_POLICY
-		if (policy->xfrm_nr == 0) {
-			/* Flow passes not transformed. */
-			xfrm_pol_put(policy);
-			return 0;
-		}
-#endif
-
-		/* Try to find matching bundle.
-		 *
-		 * LATER: help from flow cache. It is optional, this
-		 * is required only for output policy.
-		 */
-		dst = xfrm_find_bundle(fl, policy, family);
-		if (IS_ERR(dst)) {
-			XFRM_INC_STATS(net, LINUX_MIB_XFRMOUTBUNDLECHECKERROR);
-			err = PTR_ERR(dst);
-			goto error;
-		}
-
-		if (dst)
-			break;
-
-#ifdef CONFIG_XFRM_SUB_POLICY
-		if (pols[0]->type != XFRM_POLICY_TYPE_MAIN) {
-			pols[1] = xfrm_policy_lookup_bytype(net,
-							    XFRM_POLICY_TYPE_MAIN,
-							    fl, family,
-							    XFRM_POLICY_OUT);
-			if (pols[1]) {
-				if (IS_ERR(pols[1])) {
-					XFRM_INC_STATS(net, LINUX_MIB_XFRMOUTPOLERROR);
-					err = PTR_ERR(pols[1]);
-					goto error;
-				}
-				if (pols[1]->action == XFRM_POLICY_BLOCK) {
-					XFRM_INC_STATS(net, LINUX_MIB_XFRMOUTPOLBLOCK);
-					err = -EPERM;
-					goto error;
-				}
-				npols ++;
-				xfrm_nr += pols[1]->xfrm_nr;
-			}
-		}
-
-		/*
-		 * Because neither flowi nor bundle information knows about
-		 * transformation template size. On more than one policy usage
-		 * we can realize whether all of them is bypass or not after
-		 * they are searched. See above not-transformed bypass
-		 * is surrounded by non-sub policy configuration, too.
-		 */
-		if (xfrm_nr == 0) {
-			/* Flow passes not transformed. */
-			xfrm_pols_put(pols, npols);
-			return 0;
-		}
-
-#endif
-		nx = xfrm_tmpl_resolve(pols, npols, fl, xfrm, family);
-
-		if (unlikely(nx<0)) {
-			err = nx;
-			if (err == -EAGAIN && net->xfrm.sysctl_larval_drop) {
-				/* EREMOTE tells the caller to generate
-				 * a one-shot blackhole route.
-				 */
-				XFRM_INC_STATS(net, LINUX_MIB_XFRMOUTNOSTATES);
-				xfrm_pol_put(policy);
-				return -EREMOTE;
-			}
-			if (err == -EAGAIN && (flags & XFRM_LOOKUP_WAIT)) {
-				DECLARE_WAITQUEUE(wait, current);
-
-				add_wait_queue(&net->xfrm.km_waitq, &wait);
-				set_current_state(TASK_INTERRUPTIBLE);
-				schedule();
-				set_current_state(TASK_RUNNING);
-				remove_wait_queue(&net->xfrm.km_waitq, &wait);
-
-				nx = xfrm_tmpl_resolve(pols, npols, fl, xfrm, family);
-
-				if (nx == -EAGAIN && signal_pending(current)) {
-					XFRM_INC_STATS(net, LINUX_MIB_XFRMOUTNOSTATES);
-					err = -ERESTART;
-					goto error;
-				}
-				if (nx == -EAGAIN ||
-				    genid != atomic_read(&flow_cache_genid)) {
-					xfrm_pols_put(pols, npols);
-					goto restart;
-				}
-				err = nx;
-			}
-			if (err < 0) {
-				XFRM_INC_STATS(net, LINUX_MIB_XFRMOUTNOSTATES);
-				goto error;
-			}
-		}
-		if (nx == 0) {
-			/* Flow passes not transformed. */
-			xfrm_pols_put(pols, npols);
-			return 0;
-		}
-
-		dst = xfrm_bundle_create(policy, xfrm, nx, fl, dst_orig);
-		err = PTR_ERR(dst);
-		if (IS_ERR(dst)) {
-			XFRM_INC_STATS(net, LINUX_MIB_XFRMOUTBUNDLEGENERROR);
-			goto error;
-		}
-
-		for (pi = 0; pi < npols; pi++)
-			pol_dead |= pols[pi]->walk.dead;
-
-		write_lock_bh(&policy->lock);
-		if (unlikely(pol_dead || stale_bundle(dst))) {
-			/* Wow! While we worked on resolving, this
-			 * policy has gone. Retry. It is not paranoia,
-			 * we just cannot enlist new bundle to dead object.
-			 * We can't enlist stable bundles either.
-			 */
-			write_unlock_bh(&policy->lock);
-			dst_free(dst);
-
-			if (pol_dead)
-				XFRM_INC_STATS(net, LINUX_MIB_XFRMOUTPOLDEAD);
-			else
-				XFRM_INC_STATS(net, LINUX_MIB_XFRMOUTBUNDLECHECKERROR);
-			err = -EHOSTUNREACH;
-			goto error;
-		}
-
-		if (npols > 1)
-			err = xfrm_dst_update_parent(dst, &pols[1]->selector);
-		else
-			err = xfrm_dst_update_origin(dst, fl);
-		if (unlikely(err)) {
-			write_unlock_bh(&policy->lock);
-			dst_free(dst);
-			XFRM_INC_STATS(net, LINUX_MIB_XFRMOUTBUNDLECHECKERROR);
-			goto error;
-		}
-
-		dst->next = policy->bundles;
-		policy->bundles = dst;
-		dst_hold(dst);
-		write_unlock_bh(&policy->lock);
+	} else if (num_xfrms > 0) {
+		/* Flow transformed */
+		*dst_p = dst;
+		dst_release(dst_orig);
+	} else {
+		/* Flow passes untransformed */
+		dst_release(dst);
 	}
-	*dst_p = dst;
-	dst_release(dst_orig);
-	xfrm_pols_put(pols, npols);
+ok:
+	xfrm_pols_put(pols, drop_pols);
 	return 0;
 
+nopol:
+	if (!(flags & XFRM_LOOKUP_ICMP))
+		goto ok;
+	err = -ENOENT;
 error:
-	xfrm_pols_put(pols, npols);
+	dst_release(dst);
 dropdst:
 	dst_release(dst_orig);
 	*dst_p = NULL;
+	xfrm_pols_put(pols, drop_pols);
 	return err;
-
-nopol:
-	err = -ENOENT;
-	if (flags & XFRM_LOOKUP_ICMP)
-		goto dropdst;
-	return 0;
 }
 EXPORT_SYMBOL(__xfrm_lookup);
 
@@ -2161,71 +2257,24 @@ static struct dst_entry *xfrm_negative_advice(struct dst_entry *dst)
 	return dst;
 }
 
-static void prune_one_bundle(struct xfrm_policy *pol, int (*func)(struct dst_entry *), struct dst_entry **gc_list_p)
-{
-	struct dst_entry *dst, **dstp;
-
-	write_lock(&pol->lock);
-	dstp = &pol->bundles;
-	while ((dst=*dstp) != NULL) {
-		if (func(dst)) {
-			*dstp = dst->next;
-			dst->next = *gc_list_p;
-			*gc_list_p = dst;
-		} else {
-			dstp = &dst->next;
-		}
-	}
-	write_unlock(&pol->lock);
-}
-
-static void xfrm_prune_bundles(struct net *net, int (*func)(struct dst_entry *))
+static void __xfrm_garbage_collect(struct net *net)
 {
-	struct dst_entry *gc_list = NULL;
-	int dir;
+	struct dst_entry *head, *next;
 
-	read_lock_bh(&xfrm_policy_lock);
-	for (dir = 0; dir < XFRM_POLICY_MAX * 2; dir++) {
-		struct xfrm_policy *pol;
-		struct hlist_node *entry;
-		struct hlist_head *table;
-		int i;
+	flow_cache_flush();
 
-		hlist_for_each_entry(pol, entry,
-				     &net->xfrm.policy_inexact[dir], bydst)
-			prune_one_bundle(pol, func, &gc_list);
+	spin_lock_bh(&xfrm_policy_sk_bundle_lock);
+	head = xfrm_policy_sk_bundles;
+	xfrm_policy_sk_bundles = NULL;
+	spin_unlock_bh(&xfrm_policy_sk_bundle_lock);
 
-		table = net->xfrm.policy_bydst[dir].table;
-		for (i = net->xfrm.policy_bydst[dir].hmask; i >= 0; i--) {
-			hlist_for_each_entry(pol, entry, table + i, bydst)
-				prune_one_bundle(pol, func, &gc_list);
-		}
-	}
-	read_unlock_bh(&xfrm_policy_lock);
-
-	while (gc_list) {
-		struct dst_entry *dst = gc_list;
-		gc_list = dst->next;
-		dst_free(dst);
+	while (head) {
+		next = head->next;
+		dst_free(head);
+		head = next;
 	}
 }
 
-static int unused_bundle(struct dst_entry *dst)
-{
-	return !atomic_read(&dst->__refcnt);
-}
-
-static void __xfrm_garbage_collect(struct net *net)
-{
-	xfrm_prune_bundles(net, unused_bundle);
-}
-
-static int xfrm_flush_bundles(struct net *net)
-{
-	xfrm_prune_bundles(net, stale_bundle);
-	return 0;
-}
-
 static void xfrm_init_pmtu(struct dst_entry *dst)
 {
 	do {
@@ -2283,7 +2332,9 @@ int xfrm_bundle_ok(struct xfrm_policy *pol, struct xfrm_dst *first,
 			return 0;
 		if (dst->xfrm->km.state != XFRM_STATE_VALID)
 			return 0;
-		if (xdst->genid != dst->xfrm->genid)
+		if (xdst->xfrm_genid != dst->xfrm->genid)
+			return 0;
+		if (xdst->policy_genid != atomic_read(&xdst->pols[0]->genid))
 			return 0;
 
 		if (strict && fl &&
@@ -2448,7 +2499,7 @@ static int xfrm_dev_event(struct notifier_block *this, unsigned long event, void
 
 	switch (event) {
 	case NETDEV_DOWN:
-		xfrm_flush_bundles(dev_net(dev));
+		__xfrm_garbage_collect(dev_net(dev));
 	}
 	return NOTIFY_DONE;
 }
@@ -2780,7 +2831,6 @@ static int xfrm_policy_migrate(struct xfrm_policy *pol,
 			       struct xfrm_migrate *m, int num_migrate)
 {
 	struct xfrm_migrate *mp;
-	struct dst_entry *dst;
 	int i, j, n = 0;
 
 	write_lock_bh(&pol->lock);
@@ -2805,10 +2855,7 @@ static int xfrm_policy_migrate(struct xfrm_policy *pol,
 			       sizeof(pol->xfrm_vec[i].saddr));
 			pol->xfrm_vec[i].encap_family = mp->new_family;
 			/* flush bundles */
-			while ((dst = pol->bundles) != NULL) {
-				pol->bundles = dst->next;
-				dst_free(dst);
-			}
+			atomic_inc(&pol->genid);
 		}
 	}
 
-- 
1.6.3.3


^ permalink raw reply related

* [PATCH 4/4] flow: delayed deletion of flow cache entries
From: Timo Teras @ 2010-04-05  7:00 UTC (permalink / raw)
  To: netdev; +Cc: Herbert Xu, Timo Teras
In-Reply-To: <1270450824-2928-1-git-send-email-timo.teras@iki.fi>

Speed up lookups by freeing flow cache entries later. After
virtualizing flow cache entry operations, the flow cache may now
end up calling policy or bundle destructor which can be slowish.

As gc_list is more effective with double linked list, the flow cache
is converted to use common hlist and list macroes where appropriate.

Signed-off-by: Timo Teras <timo.teras@iki.fi>
---
 net/core/flow.c |  100 ++++++++++++++++++++++++++++++++++++++-----------------
 1 files changed, 69 insertions(+), 31 deletions(-)

diff --git a/net/core/flow.c b/net/core/flow.c
index 15151f5..5d71e24 100644
--- a/net/core/flow.c
+++ b/net/core/flow.c
@@ -26,7 +26,10 @@
 #include <linux/security.h>
 
 struct flow_cache_entry {
-	struct flow_cache_entry		*next;
+	union {
+		struct hlist_node	hlist;
+		struct list_head	gc_list;
+	} u;
 	u16				family;
 	u8				dir;
 	u32				genid;
@@ -35,7 +38,7 @@ struct flow_cache_entry {
 };
 
 struct flow_cache_percpu {
-	struct flow_cache_entry		**hash_table;
+	struct hlist_head		*hash_table;
 	int				hash_count;
 	u32				hash_rnd;
 	int				hash_rnd_recalc;
@@ -62,6 +65,9 @@ atomic_t flow_cache_genid = ATOMIC_INIT(0);
 static struct flow_cache flow_cache_global;
 static struct kmem_cache *flow_cachep;
 
+static DEFINE_SPINLOCK(flow_cache_gc_lock);
+static LIST_HEAD(flow_cache_gc_list);
+
 #define flow_cache_hash_size(cache)	(1 << (cache)->hash_shift)
 #define FLOW_HASH_RND_PERIOD		(10 * 60 * HZ)
 
@@ -86,38 +92,66 @@ static int flow_entry_valid(struct flow_cache_entry *fle)
 	return 1;
 }
 
-static void flow_entry_kill(struct flow_cache *fc,
-			    struct flow_cache_percpu *fcp,
-			    struct flow_cache_entry *fle)
+static void flow_entry_kill(struct flow_cache_entry *fle)
 {
 	if (fle->object)
 		fle->object->ops->delete(fle->object);
 	kmem_cache_free(flow_cachep, fle);
-	fcp->hash_count--;
+}
+
+static void flow_cache_gc_task(struct work_struct *work)
+{
+	struct list_head gc_list;
+	struct flow_cache_entry *fce, *n;
+
+	INIT_LIST_HEAD(&gc_list);
+	spin_lock_bh(&flow_cache_gc_lock);
+	list_splice_tail_init(&flow_cache_gc_list, &gc_list);
+	spin_unlock_bh(&flow_cache_gc_lock);
+
+	list_for_each_entry_safe(fce, n, &gc_list, u.gc_list)
+		flow_entry_kill(fce);
+}
+static DECLARE_WORK(flow_cache_gc_work, flow_cache_gc_task);
+
+static void flow_cache_queue_garbage(struct flow_cache_percpu *fcp,
+				     int deleted, struct list_head *gc_list)
+{
+	if (deleted) {
+		fcp->hash_count -= deleted;
+		spin_lock_bh(&flow_cache_gc_lock);
+		list_splice_tail(gc_list, &flow_cache_gc_list);
+		spin_unlock_bh(&flow_cache_gc_lock);
+		schedule_work(&flow_cache_gc_work);
+	}
 }
 
 static void __flow_cache_shrink(struct flow_cache *fc,
 				struct flow_cache_percpu *fcp,
 				int shrink_to)
 {
-	struct flow_cache_entry *fle, **flp;
-	int i;
+	struct flow_cache_entry *fle;
+	struct hlist_node *entry, *tmp;
+	LIST_HEAD(gc_list);
+	int i, deleted = 0;
 
 	for (i = 0; i < flow_cache_hash_size(fc); i++) {
 		int saved = 0;
 
-		flp = &fcp->hash_table[i];
-		while ((fle = *flp) != NULL) {
+		hlist_for_each_entry_safe(fle, entry, tmp,
+					  &fcp->hash_table[i], u.hlist) {
 			if (saved < shrink_to &&
 			    flow_entry_valid(fle)) {
 				saved++;
-				flp = &fle->next;
 			} else {
-				*flp = fle->next;
-				flow_entry_kill(fc, fcp, fle);
+				deleted++;
+				hlist_del(&fle->u.hlist);
+				list_add_tail(&fle->u.gc_list, &gc_list);
 			}
 		}
 	}
+
+	flow_cache_queue_garbage(fcp, deleted, &gc_list);
 }
 
 static void flow_cache_shrink(struct flow_cache *fc,
@@ -182,7 +216,8 @@ flow_cache_lookup(struct net *net, struct flowi *key, u16 family, u8 dir,
 {
 	struct flow_cache *fc = &flow_cache_global;
 	struct flow_cache_percpu *fcp;
-	struct flow_cache_entry *fle, **head;
+	struct flow_cache_entry *fle, *tfle;
+	struct hlist_node *entry;
 	struct flow_cache_object *flo;
 	unsigned int hash;
 
@@ -200,12 +235,13 @@ flow_cache_lookup(struct net *net, struct flowi *key, u16 family, u8 dir,
 		flow_new_hash_rnd(fc, fcp);
 
 	hash = flow_hash_code(fc, fcp, key);
-	head = &fcp->hash_table[hash];
-	for (fle = *head; fle; fle = fle->next) {
-		if (fle->family == family &&
-		    fle->dir == dir &&
-		    flow_key_compare(key, &fle->key) == 0)
+	hlist_for_each_entry(tfle, entry, &fcp->hash_table[hash], u.hlist) {
+		if (tfle->family == family &&
+		    tfle->dir == dir &&
+		    flow_key_compare(key, &tfle->key) == 0) {
+			fle = tfle;
 			break;
+		}
 	}
 
 	if (!fle) {
@@ -214,12 +250,11 @@ flow_cache_lookup(struct net *net, struct flowi *key, u16 family, u8 dir,
 
 		fle = kmem_cache_alloc(flow_cachep, GFP_ATOMIC);
 		if (fle) {
-			fle->next = *head;
-			*head = fle;
 			fle->family = family;
 			fle->dir = dir;
 			memcpy(&fle->key, key, sizeof(*key));
 			fle->object = NULL;
+			hlist_add_head(&fle->u.hlist, &fcp->hash_table[hash]);
 			fcp->hash_count++;
 		}
 	} else if (fle->genid == atomic_read(&flow_cache_genid)) {
@@ -255,23 +290,26 @@ static void flow_cache_flush_tasklet(unsigned long data)
 	struct flow_flush_info *info = (void *)data;
 	struct flow_cache *fc = info->cache;
 	struct flow_cache_percpu *fcp;
-	int i;
+	struct flow_cache_entry *fle;
+	struct hlist_node *entry, *tmp;
+	LIST_HEAD(gc_list);
+	int i, deleted = 0;
 
 	fcp = per_cpu_ptr(fc->percpu, smp_processor_id());
 	for (i = 0; i < flow_cache_hash_size(fc); i++) {
-		struct flow_cache_entry *fle;
-
-		fle = fcp->hash_table[i];
-		for (; fle; fle = fle->next) {
+		hlist_for_each_entry_safe(fle, entry, tmp,
+					  &fcp->hash_table[i], u.hlist) {
 			if (flow_entry_valid(fle))
 				continue;
 
-			if (fle->object)
-				fle->object->ops->delete(fle->object);
-			fle->object = NULL;
+			deleted++;
+			hlist_del(&fle->u.hlist);
+			list_add_tail(&fle->u.gc_list, &gc_list);
 		}
 	}
 
+	flow_cache_queue_garbage(fcp, deleted, &gc_list);
+
 	if (atomic_dec_and_test(&info->cpuleft))
 		complete(&info->completion);
 }
@@ -313,7 +351,7 @@ void flow_cache_flush(void)
 static void __init flow_cache_cpu_prepare(struct flow_cache *fc,
 					  struct flow_cache_percpu *fcp)
 {
-	fcp->hash_table = (struct flow_cache_entry **)
+	fcp->hash_table = (struct hlist_head *)
 		__get_free_pages(GFP_KERNEL|__GFP_ZERO, fc->order);
 	if (!fcp->hash_table)
 		panic("NET: failed to allocate flow cache order %lu\n", fc->order);
@@ -347,7 +385,7 @@ static int flow_cache_init(struct flow_cache *fc)
 
 	for (order = 0;
 	     (PAGE_SIZE << order) <
-		     (sizeof(struct flow_cache_entry *)*flow_cache_hash_size(fc));
+		     (sizeof(struct hlist_head)*flow_cache_hash_size(fc));
 	     order++)
 		/* NOTHING */;
 	fc->order = order;
-- 
1.6.3.3


^ permalink raw reply related

* Re: [PATCH 2/2] benet: fix the misusage of zero dma address
From: Sathya Perla @ 2010-04-05  7:10 UTC (permalink / raw)
  To: FUJITA Tomonori; +Cc: sathyap, subbus, sarveshwarb, ajitk, netdev
In-Reply-To: <1270176803-8561-2-git-send-email-fujita.tomonori@lab.ntt.co.jp>

Hi Fujita, thanks for the patch; pls see below:

On 02/04/10 11:53 +0900, FUJITA Tomonori wrote:
> benet driver wrongly assumes that zero is an invalid dma address
> (calls dma_unmap_page for only non zero dma addresses). Zero is a
> valid dma address on some architectures. The dma length can be used
> here.
> 
> Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
> ---
>  drivers/net/benet/be_main.c |    6 ++++--
>  1 files changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/benet/be_main.c b/drivers/net/benet/be_main.c
> index 8d5e27b..8828c7d 100644
> --- a/drivers/net/benet/be_main.c
> +++ b/drivers/net/benet/be_main.c
> @@ -394,13 +394,15 @@ static void unmap_tx_frag(struct pci_dev *pdev, struct be_eth_wrb *wrb,
>  	be_dws_le_to_cpu(wrb, sizeof(*wrb));
>  
>  	dma = (u64)wrb->frag_pa_hi << 32 | (u64)wrb->frag_pa_lo;
> -	if (dma != 0) {
> +	if (wrb->frag_len) {
>  		if (unmap_single)
>  			pci_unmap_single(pdev, dma, wrb->frag_len,
>  				PCI_DMA_TODEVICE);
>  		else
>  			pci_unmap_page(pdev, dma, wrb->frag_len,
>  				PCI_DMA_TODEVICE);
> +
> +		wrb->frag_len = 0;
Why does wrb->frag_len need to be reset here?
In the TX path, it is set to the proper value for data wrbs and zero
for dummy and hdr wrbs.
>  	}
>  }
>  
> @@ -466,9 +468,9 @@ dma_err:
>  	txq->head = map_head;
>  	while (copied) {
>  		wrb = queue_head_node(txq);
> +		copied -= wrb->frag_len;
>  		unmap_tx_frag(pdev, wrb, map_single);
>  		map_single = false;
> -		copied -= wrb->frag_len;
>  		queue_head_inc(txq);
>  	}
>  	return 0;
> -- 
> 1.7.0
> 

^ permalink raw reply

* Re: [PATCH 2/2] benet: fix the misusage of zero dma address
From: FUJITA Tomonori @ 2010-04-05  7:40 UTC (permalink / raw)
  To: sathyap; +Cc: fujita.tomonori, subbus, sarveshwarb, ajitk, netdev
In-Reply-To: <20100405071059.GA32671@serverengines.com>

On Mon, 5 Apr 2010 12:40:59 +0530
Sathya Perla <sathyap@serverengines.com> wrote:

> Hi Fujita, thanks for the patch; pls see below:
> 
> On 02/04/10 11:53 +0900, FUJITA Tomonori wrote:
> > benet driver wrongly assumes that zero is an invalid dma address
> > (calls dma_unmap_page for only non zero dma addresses). Zero is a
> > valid dma address on some architectures. The dma length can be used
> > here.
> > 
> > Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
> > ---
> >  drivers/net/benet/be_main.c |    6 ++++--
> >  1 files changed, 4 insertions(+), 2 deletions(-)
> > 
> > diff --git a/drivers/net/benet/be_main.c b/drivers/net/benet/be_main.c
> > index 8d5e27b..8828c7d 100644
> > --- a/drivers/net/benet/be_main.c
> > +++ b/drivers/net/benet/be_main.c
> > @@ -394,13 +394,15 @@ static void unmap_tx_frag(struct pci_dev *pdev, struct be_eth_wrb *wrb,
> >  	be_dws_le_to_cpu(wrb, sizeof(*wrb));
> >  
> >  	dma = (u64)wrb->frag_pa_hi << 32 | (u64)wrb->frag_pa_lo;
> > -	if (dma != 0) {
> > +	if (wrb->frag_len) {
> >  		if (unmap_single)
> >  			pci_unmap_single(pdev, dma, wrb->frag_len,
> >  				PCI_DMA_TODEVICE);
> >  		else
> >  			pci_unmap_page(pdev, dma, wrb->frag_len,
> >  				PCI_DMA_TODEVICE);
> > +
> > +		wrb->frag_len = 0;
> Why does wrb->frag_len need to be reset here?
> In the TX path, it is set to the proper value for data wrbs and zero
> for dummy and hdr wrbs.

I guess that I misunderstood why unmap_tx_frag() checks a dma address.
The checking is necessary to avoid calling pci_unamp_* API for dummy
hdr wrbs?

Anyway, if wrb->frag_len doesn't need to be reset here, the following
patch is ok?

=
From: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Subject: [PATCH v2 2/2] benet: fix the misusage of zero dma address

benet driver wrongly assumes that zero is an invalid dma address
(calls dma_unmap_page for only non zero dma addresses). Zero is a
valid dma address on some architectures. The dma length can be used
here.

Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
---
 drivers/net/benet/be_main.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/net/benet/be_main.c b/drivers/net/benet/be_main.c
index a27a0a1..c95e1a2 100644
--- a/drivers/net/benet/be_main.c
+++ b/drivers/net/benet/be_main.c
@@ -404,7 +404,7 @@ static void unmap_tx_frag(struct pci_dev *pdev, struct be_eth_wrb *wrb,
 	be_dws_le_to_cpu(wrb, sizeof(*wrb));
 
 	dma = (u64)wrb->frag_pa_hi << 32 | (u64)wrb->frag_pa_lo;
-	if (dma != 0) {
+	if (wrb->frag_len) {
 		if (unmap_single)
 			pci_unmap_single(pdev, dma, wrb->frag_len,
 				PCI_DMA_TODEVICE);
-- 
1.7.0



^ permalink raw reply related

* Re: [PATCH 1/6] sysfs: Basic support for multiple super blocks
From: Tejun Heo @ 2010-04-05  7:45 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Greg Kroah-Hartman, Kay Sievers, linux-kernel, Cornelia Huck,
	linux-fsdevel, Eric Dumazet, Benjamin LaHaise, Serge Hallyn,
	netdev
In-Reply-To: <m1634d82e0.fsf@fess.ebiederm.org>

Hello, Eric.

On 03/31/2010 02:51 PM, Eric W. Biederman wrote:
>> I haven't looked at later patches but I suppose this is gonna be
>> filled with more meaningful stuff later. 
> 
> Yes it will.
> 
>> One (possibly silly) thing
>> that stands out compared to get_sb_single() is missing remount
>> handling.  Is it intended?
> 
> There is nothing for a remount to do so I ignore it.   The only
> thing that would possibly be meaningful is a read-only mount,
> and nothing I know of sysfs suggests read-only mounts of sysfs
> work, or make any sense.

I see.  Wouldn't it be better to make that design choice evident by
stating the choice in the comment or at least in the patch
description?  As it currently stands, you're burying a clear
functional change in a seemingly innocent patch which contains zero
line of comment and two lines of description.  The same pattern holds
for this whole patchset.  Where are the comments and descriptions
about the design and implementation?  :-(

Thanks.

-- 
tejun

^ permalink raw reply

* Re: [PATCH 3/6] sysfs: Implement sysfs tagged directory support.
From: Tejun Heo @ 2010-04-05  8:17 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Greg Kroah-Hartman, Kay Sievers, linux-kernel, Cornelia Huck,
	linux-fsdevel, Eric Dumazet, Benjamin LaHaise, Serge Hallyn,
	netdev, Benjamin Thery
In-Reply-To: <m1oci4zv7h.fsf@fess.ebiederm.org>

Hello, Eric.

On 03/31/2010 06:39 PM, Eric W. Biederman wrote:
> Let me try a happy median between overwhelming and too little
> information by giving you some experts, and a bit of overview.
> 
> (Ugh after have writing this I certainly will agree that we
>  have some many layers in the device model that they become
>  obfuscating abstractions).

Yeah, exactly, and this patchset is pushing it further with no
documentation and indirections to high heavens.  As someone who
doesn't have much experience with namespaces, I can't make much sense
of this patchset and it obfuscates the whole kobject thing more and
that's a bad direction to be heading toward.

> Looking through my code there are 3 types of callbacks.
> - Callbacks to the namespace type of a children.
>   .child_ns_type

Can you please also explain the relationships among kobjects, ns_types
and NSes?

> - Callbacks to find the namespace of a kobject.
>   .namespace
> - Callbacks on the a namespace type to find the namespace
>   of a particular context.
>   .current_ns
>   .initial_ns  (not used in my patchset)
>   .netlink_ns  (not used in my patchset)
> 
> In a world of weird explicitness I expect .child_ns_type and
> .namespace could be made to go away by pushing through explicit
> ns_type, and namespace parameters everywhere. But that seems
> like an awful lot of unnecessary code churn and bloat with
> the only real advantage being that we have an abstraction
> stored explicit at each layer.

* How much churn would it be?  I would be willing to trade quite a bit
  if the following can go away.  The sheer amount of indirection there
  scares me a lot.

  struct kobj_type {
  ...
	const struct kobj_ns_type_operations *(*child_ns_type)(struct kobject *kobj);
  ...
  };

* Is it necessary to teach kobject layer the concept of namespaces?
  Wouldn't it be possible to let kobject and sysfs deal with tags and
  make namespaces use them?

> static int kobj_bcast_filter(struct sock *dest_sk, struct sk_buff *skb, void *data)
> {
> 	struct kobject *kobj = data;
> 	const struct kobj_ns_type_operations *ops;
> 
> 	ops = kobj_ns_ops(kobj);
> 	if (ops) {
> 		const void *sock_ns, *ns;
> 		ns = kobj->ktype->namespace(kobj);
> 		sock_ns = ops->netlink_ns(dsk);
> 		return sock_ns != ns;
> 	}
> 
> 	return 0;
> }
> 
> initial_ns is used to figure out what the initial/default
> namespace is for a class of namespaces.  We only report
> with /sbin/hotplug events in the initial network namespace.
> At least for now.
> 
> static int kobj_usermode_filter(struct kobject *kobj)
> {
> 	const struct kobj_ns_type_operations *ops;
> 
> 	ops = kobj_ns_ops(kobj);
> 	if (ops) {
> 		const void *init_ns, *ns;
> 		ns = kobj->ktype->namespace(kobj);
> 		init_ns = ops->initial_ns();
> 		return ns != init_ns;
> 	}
> 
> 	return 0;
> }

I can understand you would need two different ways of establishing the
accessor depending on the mode of access (file IO or netlink) but can
initial_ns ever be dynamic?  Can't it just be void *inital_ns instead
of a callback?

Thanks.

-- 
tejun

^ permalink raw reply

* Re: [PATCH 2/2] benet: fix the misusage of zero dma address
From: Sathya Perla @ 2010-04-05  8:22 UTC (permalink / raw)
  To: FUJITA Tomonori; +Cc: sathyap, subbus, sarveshwarb, ajitk, netdev
In-Reply-To: <20100405163942P.fujita.tomonori@lab.ntt.co.jp>

On 05/04/10 16:40 +0900, FUJITA Tomonori wrote:
> > > +		wrb->frag_len = 0;
> > Why does wrb->frag_len need to be reset here?
> > In the TX path, it is set to the proper value for data wrbs and zero
> > for dummy and hdr wrbs.
> 
> I guess that I misunderstood why unmap_tx_frag() checks a dma address.
> The checking is necessary to avoid calling pci_unamp_* API for dummy
> hdr wrbs?
Yes.
> 
> Anyway, if wrb->frag_len doesn't need to be reset here, the following
> patch is ok?
Yes. Thanks.

Acked-by: Sathya Perla <sathyap@serverengines.com>
> 
> =
> From: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
> Subject: [PATCH v2 2/2] benet: fix the misusage of zero dma address
> 
> benet driver wrongly assumes that zero is an invalid dma address
> (calls dma_unmap_page for only non zero dma addresses). Zero is a
> valid dma address on some architectures. The dma length can be used
> here.
> 
> Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
> ---
>  drivers/net/benet/be_main.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/drivers/net/benet/be_main.c b/drivers/net/benet/be_main.c
> index a27a0a1..c95e1a2 100644
> --- a/drivers/net/benet/be_main.c
> +++ b/drivers/net/benet/be_main.c
> @@ -404,7 +404,7 @@ static void unmap_tx_frag(struct pci_dev *pdev, struct be_eth_wrb *wrb,
>  	be_dws_le_to_cpu(wrb, sizeof(*wrb));
>  
>  	dma = (u64)wrb->frag_pa_hi << 32 | (u64)wrb->frag_pa_lo;
> -	if (dma != 0) {
> +	if (wrb->frag_len) {
>  		if (unmap_single)
>  			pci_unmap_single(pdev, dma, wrb->frag_len,
>  				PCI_DMA_TODEVICE);
> -- 
> 1.7.0
> 
> 

^ permalink raw reply

* Re: [RFC PATCH 2/2] netdev: an usage example on igb
From: Eric Dumazet @ 2010-04-05  8:30 UTC (permalink / raw)
  To: Koki Sanagi
  Cc: netdev, izumi.taku, kaneshige.kenji, davem, nhorman,
	jeffrey.t.kirsher, jesse.brandeburg, bruce.w.allan,
	alexander.h.duyck, peter.p.waskiewicz.jr, john.ronciak
In-Reply-To: <4BB98940.5070003@jp.fujitsu.com>

Le lundi 05 avril 2010 à 15:54 +0900, Koki Sanagi a écrit :
> This patch is usage example of previous patch's buffer on igb.
> The output is like below.
> 
> # cat /sys/kernel/debug/ndrvbuf/igb-trace-0000\:03\:00.0/buffer
> [  1] 50462.369207: clean_tx qidx=1 ntu=154->156
> [  0] 50462.369241: clean_rx qidx=0 ntu=111->112
> [  0] 50462.369250: xmit qidx=1 ntu=156->158
> [  1] 50462.369256: clean_tx qidx=1 ntu=156->158
> [  1] 50462.369342: clean_rx qidx=0 ntu=113->114
> [  1] 50462.369439: clean_rx qidx=0 ntu=114->115
> 
> This example outputs original print style, because it sets original print
> function(igb_trace_read) when registered.
> 
> register_ndrvbuf(buname, 1000000, igb_trace_read);
> 
> If you set NULL to arg3, outputs by ndrvbuf default style.
> If you set 0 to size(arg2), recording is disabled at first(but small buffer is
> alloced).
> When you set non-zero to size, recording becomes enabled.
> 
> Signed-off-by: Koki Sanagi <sanagi.koki@jp.fujitsu.com>
> ---
>   drivers/net/igb/Makefile    |    2 +-
>   drivers/net/igb/igb.h       |    1 +
>   drivers/net/igb/igb_main.c  |   10 +++++-
>   drivers/net/igb/igb_trace.c |   81 +++++++++++++++++++++++++++++++++++++++++++
>   drivers/net/igb/igb_trace.h |   21 +++++++++++
>   5 files changed, 113 insertions(+), 2 deletions(-)
> 

This depends on NDRVBUF, yet I see no Kconfig change in this patch.




^ permalink raw reply

* Re: [PATCH 1/4] flow: virtualize flow cache entry methods
From: Herbert Xu @ 2010-04-05  8:33 UTC (permalink / raw)
  To: Timo Teras; +Cc: netdev
In-Reply-To: <1270450824-2928-2-git-send-email-timo.teras@iki.fi>

On Mon, Apr 05, 2010 at 10:00:21AM +0300, Timo Teras wrote:
>
> @@ -219,33 +222,32 @@ void *flow_cache_lookup(struct net *net, struct flowi *key, u16 family, u8 dir,
>  			fle->object = NULL;
>  			fcp->hash_count++;
>  		}
> +	} else if (fle->genid == atomic_read(&flow_cache_genid)) {
> +		flo = fle->object;
> +		if (!flo)
> +			goto ret_object;
> +		flo = flo->ops->get(flo);
> +		if (flo)
> +			goto ret_object;
>  	}
>  
>  nocache:
> -	{
> -		int err;
> -		void *obj;
> -		atomic_t *obj_ref;
> -
> -		err = resolver(net, key, family, dir, &obj, &obj_ref);
> -
> -		if (fle && !err) {
> -			fle->genid = atomic_read(&flow_cache_genid);
> -
> -			if (fle->object)
> -				atomic_dec(fle->object_ref);
> -
> -			fle->object = obj;
> -			fle->object_ref = obj_ref;
> -			if (obj)
> -				atomic_inc(fle->object_ref);
> +	flo = resolver(net, key, family, dir, fle ? fle->object : NULL, ctx);
> +	if (fle) {
> +		fle->genid = atomic_read(&flow_cache_genid);
> +		if (IS_ERR(flo)) {
> +			fle->genid--;
> +			fle->object = NULL;

Shouldn't we call fle->object->ops->delete here?

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: [PATCH 1/4] flow: virtualize flow cache entry methods
From: Timo Teräs @ 2010-04-05  8:36 UTC (permalink / raw)
  To: Herbert Xu; +Cc: netdev
In-Reply-To: <20100405083302.GA16636@gondor.apana.org.au>

Herbert Xu wrote:
> On Mon, Apr 05, 2010 at 10:00:21AM +0300, Timo Teras wrote:
>> @@ -219,33 +222,32 @@ void *flow_cache_lookup(struct net *net, struct flowi *key, u16 family, u8 dir,
>> +	flo = resolver(net, key, family, dir, fle ? fle->object : NULL, ctx);
>> +	if (fle) {
>> +		fle->genid = atomic_read(&flow_cache_genid);
>> +		if (IS_ERR(flo)) {
>> +			fle->genid--;
>> +			fle->object = NULL;
> 
> Shouldn't we call fle->object->ops->delete here?

The resolver function releases the old object.

It might actually make more sense to pass struct flow_cache_object**
so the resolver can twiddle the flow_cache_entry's object. Then it'd
be more explicit that the resolver is replacing entries.


^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox