Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: unregister_netdevice: waiting for lo to become free. Usage count = 8
From: Julian Anastasov @ 2011-04-15 20:11 UTC (permalink / raw)
  To: Hans Schillstrom; +Cc: Simon Horman, netdev, lvs-devel, Eric W. Biederman
In-Reply-To: <201104150901.47214.hans@schillstrom.com>


	Hello,

On Fri, 15 Apr 2011, Hans Schillstrom wrote:

> Hello Julian
> 
> I'm trying to fix the cleanup process when a namespace get "killed",
> which is a new feature for ipvs. However an old problem appears again
> 
> When there has been traffic trough ipvs where the destination is unreachable
> the usage count on loopback dev increases one for every packet....

	What is the kernel version?

> I guess thats because of this rule :
> 
> # ip route list table all
> ...
> unreachable default dev lo  table 0  proto kernel  metric 4294967295  error -101 hoplimit 25
> ...
> 
> I made a test just forwarding packets through the same container (ipvs loaded)
> to an unreachable destination and that test had a balanced count i.e. it was possible to reboot the container.

	Can you explain, what do you mean with unreachable
destination? Are you adding some rejecting route?

> Do you have an idea why  this happens in the ipvs case ?

	Do you see with debug level 3 the "Removing destination"
messages. Only real servers can hold dest->dst_cache reference
for dev which can be a problem because the real servers are not
deleted immediately - on traffic they are moved to trash
list. But ip_vs_trash_cleanup() should remove any left
structures. You should check in debug that all servers are
deleted. If all real server structures are freed but
problem remains we should look more deeply in the
dest->dst_cache usage. DR or NAT is used?

	I assume cleanup really happens in this order:

ip_vs_cleanup():
	nf_unregister_hooks()
	...
	ip_vs_conn_cleanup()
	...
	ip_vs_control_cleanup()

Regards

--
Julian Anastasov <ja@ssi.bg>

^ permalink raw reply

* Re: Feature request: "inverted" ping -a (beep on failure)
From: Christian Boltz @ 2011-04-15 20:11 UTC (permalink / raw)
  To: Randy Dunlap, netdev-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <20110415124937.6e746646.rdunlap-/UHa2rfvQTnk1uMJSBkQmQ@public.gmane.org>

Hello,

Am Freitag, 15. April 2011 schrieb Randy Dunlap:
> On Fri, 15 Apr 2011 21:35:32 +0200 Christian Boltz wrote:
> > I'd like to have the exact opposite of it: beep when pinging fails.
[...]
> Couldn't you look for exit code (status) 1 and then do a bell/beep
> (or play a sound file :)?

That would require that I know in advance when exactly the server is 
unreachable - but in this case, I wouldn't need to ping it ;-)

To have this working, ping would need an option "exit on error", which 
it doesn't have AFAIK.

A workaround is to run ping -c1 in a loop:

while true ; do
    ping -c1 $server || beep
    sleep 1
done

but I'd prefer to have something like this directly in ping ;-)

> Or do you want ping to beep and then continue running?

Yes, that's exactly what I want.


Gruß

Christian Boltz
-- 
> Ich moechte gern einige User die ihre Mails ueber einen Mailserver 
> (sendmail bevorzugt, postfix auch moeglich) scannen.
Dafür reicht ein Kopierer. Hosen runter, User draufsetzen und "Copy" 
drücken!   [> Ralf Thomas und Sandy Drobic in suse-linux]
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: Feature request: "inverted" ping -a (beep on failure)
From: Randy Dunlap @ 2011-04-15 20:14 UTC (permalink / raw)
  To: Christian Boltz; +Cc: netdev
In-Reply-To: <201104152211.46180@tux.boltz.de.vu>

On Fri, 15 Apr 2011 22:11:45 +0200 Christian Boltz wrote:

> Hello,
> 
> Am Freitag, 15. April 2011 schrieb Randy Dunlap:
> > On Fri, 15 Apr 2011 21:35:32 +0200 Christian Boltz wrote:
> > > I'd like to have the exact opposite of it: beep when pinging fails.
> [...]
> > Couldn't you look for exit code (status) 1 and then do a bell/beep
> > (or play a sound file :)?
> 
> That would require that I know in advance when exactly the server is 
> unreachable - but in this case, I wouldn't need to ping it ;-)

I didn't follow that, but it's OK.

> To have this working, ping would need an option "exit on error", which 
> it doesn't have AFAIK.

'man ping' discusses exit status codes:

       If  ping  does  not  receive any reply packets at all it will exit with
       code 1. If a packet count and deadline are both  specified,  and  fewer
       than  count  packets are received by the time the deadline has arrived,
       it will also exit with code 1.  On other error it exits  with  code  2.
       Otherwise  it exits with code 0. This makes it possible to use the exit
       code to see if a host is alive or not.


> A workaround is to run ping -c1 in a loop:
> 
> while true ; do       If  ping  does  not  receive any reply packets at all it will exit with
       code 1. If a packet count and deadline are both  specified,  and  fewer
       than  count  packets are received by the time the deadline has arrived,
       it will also exit with code 1.  On other error it exits  with  code  2.
       Otherwise  it exits with code 0. This makes it possible to use the exit
       code to see if a host is alive or not.
>     ping -c1 $server || beep
>     sleep 1
> done
> 
> but I'd prefer to have something like this directly in ping ;-)
> 
> > Or do you want ping to beep and then continue running?
> 
> Yes, that's exactly what I want.


---
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***

^ permalink raw reply

* net: Automatic IRQ siloing for network devices
From: Neil Horman @ 2011-04-15 20:17 UTC (permalink / raw)
  To: netdev; +Cc: davem

Automatic IRQ siloing for network devices

At last years netconf:
http://vger.kernel.org/netconf2010.html

Tom Herbert gave a talk in which he outlined some of the things we can do to
improve scalability and througput in our network stack

One of the big items on the slides was the notion of siloing irqs, which is the
practice of setting irq affinity to a cpu or cpu set that was 'close' to the
process that would be consuming data.  The idea was to ensure that a hard irq
for a nic (and its subsequent softirq) would execute on the same cpu as the
process consuming the data, increasing cache hit rates and speeding up overall
throughput.

I had taken an idea away from that talk, and have finally gotten around to
implementing it.  One of the problems with the above approach is that its all
quite manual.  I.e. to properly enact this siloiong, you have to do a few things
by hand:

1) decide which process is the heaviest user of a given rx queue 
2) restrict the cpus which that task will run on
3) identify the irq which the rx queue in (1) maps to
4) manually set the affinity for the irq in (3) to cpus which match the cpus in
(2)

That configuration of course has to change in response to workload changed (what
if your consumer process gets reworked so that its no longer the largest network
user, etc).  

I thought it would be good if we could automate some amount of this, and I think
I've found a way to do that.  With this patch set I introduce the ability to:

A) Register common affinity monitoring routines against a given irq which can
implement various algorithms to determine a suggested placement of said irq's
affinity

B) Add an algorithm to the network subsystem to track the amount of data that
flows through each entry in a given rx_queues rps_flow_table, and uses that data
to suggest an affinity for the irq associated with that rx queue.

This patchset lets these affinity suggestions get exported via the
/proc/irq/<n>/affinity_hint interface (which is unused in the kernel with the
exception of ixgbe).  It also exports a new proc file affinity_alg which informs
anyone interested in the affinity_hint how the hint is being computed.

Testing:
	I've been running this patchset on my dual core system here with a cxgb4
as my network interface.  I've been running a TCP STREAMS test from netperf in 2
minute increments under various conditions.  I've found experimentally that (as
you might expect) optimal performance is reached when irq affinity is bound to a
core that is not the cpu core identified by the largest RFS flow, but is as
close to it as possible (ideally sharing an L2 cache).  In that way with we
avoid the cpu contention between the softirq and the application, while still
maximizing cache hits.  In congunction with the irqbalance patch I hacked up
here:

http://people.redhat.com/nhorman/irqbalance.patch

To steer irqs that have affinity using the rfs max weight algorithm to cpus that
are as close as possible to the hinted cpu, I'm able to get approximately a 3%
speedup in receive rates over the pessimal case, and about a 1% speedup over the
nominal case (statically setting irq affinity to a single cpu).

Note: Currently this patch set only updates cxgb4 to use the new hinting
mechanism.  If this gets accepted, I have more cards to test with and plan to
update them, but I thought for a first pass it would be better to simply update
what I tested with.

Thoughts/Opinions appreciated

Thanks & Regards
Neil

^ permalink raw reply

* [PATCH 1/3] irq: Add registered affinity guidance infrastructure
From: Neil Horman @ 2011-04-15 20:17 UTC (permalink / raw)
  To: netdev
  Cc: davem, nhorman, Dimitris Michailidis, Thomas Gleixner,
	David Howells, Eric Dumazet, Tom Herbert
In-Reply-To: <1302898677-3833-1-git-send-email-nhorman@tuxdriver.com>

From: nhorman <nhorman@devel2.think-freely.org>

This patch adds the needed data to the irq_desc struct, as well as the needed
API calls to allow the requester of an irq to register a handler function to
determine the affinity_hint of that irq when queried from user space.

Signed-offy-by: Neil Horman <nhorman@tuxdriver.com>

CC: Dimitris Michailidis <dm@chelsio.com>
CC: "David S. Miller" <davem@davemloft.net>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: David Howells <dhowells@redhat.com>
CC: Eric Dumazet <eric.dumazet@gmail.com>
CC: Tom Herbert <therbert@google.com>
---
 include/linux/interrupt.h |   38 +++++++++++++++++++++++++++++++++++++-
 include/linux/irq.h       |    9 +++++++++
 include/linux/irqdesc.h   |    4 ++++
 kernel/irq/Kconfig        |   12 +++++++++++-
 kernel/irq/manage.c       |   39 +++++++++++++++++++++++++++++++++++++++
 kernel/irq/proc.c         |   35 +++++++++++++++++++++++++++++++++++
 6 files changed, 135 insertions(+), 2 deletions(-)

diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index 59b72ca..6edb364 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -118,6 +118,17 @@ struct irqaction {
 } ____cacheline_internodealigned_in_smp;
 
 extern irqreturn_t no_action(int cpl, void *dev_id);
+#ifdef CONFIG_AFFINITY_UPDATE
+extern int setup_affinity_data(int irq, irq_affinity_init_t, void *);
+#else
+static inline int setup_affinity_data(int irq,
+				      irq_affinity_init_t init, void *d)
+{
+	return 0;
+}
+#endif
+
+extern void free_irq(unsigned int, void *);
 
 #ifdef CONFIG_GENERIC_HARDIRQS
 extern int __must_check
@@ -125,6 +136,32 @@ request_threaded_irq(unsigned int irq, irq_handler_t handler,
 		     irq_handler_t thread_fn,
 		     unsigned long flags, const char *name, void *dev);
 
+#ifdef CONFIG_AFFINITY_UPDATE
+static inline int __must_check
+request_affinity_irq(unsigned int irq, irq_handler_t handler,
+		     irq_handler_t thread_fn,
+		     unsigned long flags, const char *name, void *dev,
+		     irq_affinity_init_t af_init, void *af_priv)
+{
+	int rc;
+
+	rc = request_threaded_irq(irq, handler, thread_fn, flags, name, dev);
+	if (rc)
+		goto out;
+
+	if (af_init)
+		rc = setup_affinity_data(irq, af_init, af_priv);
+	if (rc)
+		free_irq(irq, dev);
+
+out:
+	return rc;
+}
+#else
+#define request_affinity_irq(irq, hnd, tfn, flg, nm, dev, init, priv) \
+	request_threaded_irq(irq, hnd, NULL, flg, nm, dev)
+#endif
+
 static inline int __must_check
 request_irq(unsigned int irq, irq_handler_t handler, unsigned long flags,
 	    const char *name, void *dev)
@@ -167,7 +204,6 @@ request_any_context_irq(unsigned int irq, irq_handler_t handler,
 static inline void exit_irq_thread(void) { }
 #endif
 
-extern void free_irq(unsigned int, void *);
 
 struct device;
 
diff --git a/include/linux/irq.h b/include/linux/irq.h
index 1d3577f..4bff14f 100644
--- a/include/linux/irq.h
+++ b/include/linux/irq.h
@@ -30,6 +30,15 @@
 
 struct irq_desc;
 struct irq_data;
+struct affin_data {
+	void *priv;
+	char *affinity_alg;
+	void (*affin_update)(int irq, struct affin_data *ad);
+	void (*affin_cleanup)(int irq, struct affin_data *ad);
+};
+
+typedef int (*irq_affinity_init_t)(int, struct affin_data*, void *);
+
 typedef	void (*irq_flow_handler_t)(unsigned int irq,
 					    struct irq_desc *desc);
 typedef	void (*irq_preflow_handler_t)(struct irq_data *data);
diff --git a/include/linux/irqdesc.h b/include/linux/irqdesc.h
index 0021837..14a22fb 100644
--- a/include/linux/irqdesc.h
+++ b/include/linux/irqdesc.h
@@ -64,6 +64,10 @@ struct irq_desc {
 	struct timer_rand_state *timer_rand_state;
 	unsigned int __percpu	*kstat_irqs;
 	irq_flow_handler_t	handle_irq;
+#ifdef CONFIG_AFFINITY_UPDATE
+	struct affin_data	*af_data;
+#endif
+
 #ifdef CONFIG_IRQ_PREFLOW_FASTEOI
 	irq_preflow_handler_t	preflow_handler;
 #endif
diff --git a/kernel/irq/Kconfig b/kernel/irq/Kconfig
index 09bef82..abaf19c 100644
--- a/kernel/irq/Kconfig
+++ b/kernel/irq/Kconfig
@@ -51,6 +51,17 @@ config IRQ_PREFLOW_FASTEOI
 config IRQ_FORCED_THREADING
        bool
 
+config AFFINITY_UPDATE
+	bool "Support irq affinity direction"
+	depends on GENERIC_HARDIRQS
+	---help---
+
+	Affinity updating adds the ability for requestors of irqs to
+	register affinity update methods against the irq in question
+	in so doing the requestor can be informed every time user space
+	queries an irq for its optimal affinity, giving the requstor the
+	chance to tell user space where the irq can be optimally handled
+
 config SPARSE_IRQ
 	bool "Support sparse irq numbering"
 	depends on HAVE_SPARSE_IRQ
@@ -64,6 +75,5 @@ config SPARSE_IRQ
 	    out the interrupt descriptors in a more NUMA-friendly way. )
 
 	  If you don't know what to do here, say N.
-
 endmenu
 endif
diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index acd599a..257ea4d 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -1159,6 +1159,17 @@ static struct irqaction *__free_irq(unsigned int irq, void *dev_id)
 
 	unregister_handler_proc(irq, action);
 
+#ifdef CONFIG_AFFINITY_UPDATE
+	/*
+	 * Have to do this after we unregister proc accessors
+	 */
+	if (desc->af_data) {
+		if (desc->af_data->affin_cleanup)
+			desc->af_data->affin_cleanup(irq, desc->af_data);
+		kfree(desc->af_data);
+		desc->af_data = NULL;
+	}
+#endif
 	/* Make sure it's not being used on another CPU: */
 	synchronize_irq(irq);
 
@@ -1345,6 +1356,34 @@ int request_threaded_irq(unsigned int irq, irq_handler_t handler,
 }
 EXPORT_SYMBOL(request_threaded_irq);
 
+#ifdef CONFIG_AFFINITY_UPDATE
+int setup_affinity_data(int irq, irq_affinity_init_t af_init, void *af_priv)
+{
+	struct affin_data *data;
+	struct irq_desc *desc;
+	int rc;
+
+	desc = irq_to_desc(irq);
+	if (!desc)
+		return -ENOENT;
+
+	data = kzalloc(sizeof(struct affin_data), GFP_KERNEL);
+	if (!data)
+		return -ENOMEM;
+
+	rc = af_init(irq, data, af_priv);
+	if (rc) {
+		kfree(data);
+		return rc;
+	}
+
+	desc->af_data = data;
+
+	return 0;
+}
+EXPORT_SYMBOL(setup_affinity_data);
+#endif
+
 /**
  *	request_any_context_irq - allocate an interrupt line
  *	@irq: Interrupt line to allocate
diff --git a/kernel/irq/proc.c b/kernel/irq/proc.c
index 4cc2e5e..8fecb05 100644
--- a/kernel/irq/proc.c
+++ b/kernel/irq/proc.c
@@ -42,6 +42,11 @@ static int irq_affinity_hint_proc_show(struct seq_file *m, void *v)
 	if (!zalloc_cpumask_var(&mask, GFP_KERNEL))
 		return -ENOMEM;
 
+#ifdef CONFIG_AFFINITY_UPDATE
+	if (desc->af_data && desc->af_data->affin_update)
+		desc->af_data->affin_update((long)m->private, desc->af_data);
+#endif
+
 	raw_spin_lock_irqsave(&desc->lock, flags);
 	if (desc->affinity_hint)
 		cpumask_copy(mask, desc->affinity_hint);
@@ -54,6 +59,19 @@ static int irq_affinity_hint_proc_show(struct seq_file *m, void *v)
 	return 0;
 }
 
+static int irq_affinity_alg_proc_show(struct seq_file *m, void *v)
+{
+	char *alg = "none";
+#ifdef CONFIG_AFFINITY_UPDATE
+	struct irq_desc *desc = irq_to_desc((long)m->private);
+
+	if (desc->af_data->affinity_alg)
+		alg = desc->af_data->affinity_alg;
+#endif
+	seq_printf(m, "%s\n", alg);
+	return 0;
+}
+
 #ifndef is_affinity_mask_valid
 #define is_affinity_mask_valid(val) 1
 #endif
@@ -110,6 +128,11 @@ static int irq_affinity_hint_proc_open(struct inode *inode, struct file *file)
 	return single_open(file, irq_affinity_hint_proc_show, PDE(inode)->data);
 }
 
+static int irq_affinity_alg_proc_open(struct inode *inode, struct file *file)
+{
+	return single_open(file, irq_affinity_alg_proc_show, PDE(inode)->data);
+}
+
 static const struct file_operations irq_affinity_proc_fops = {
 	.open		= irq_affinity_proc_open,
 	.read		= seq_read,
@@ -125,6 +148,13 @@ static const struct file_operations irq_affinity_hint_proc_fops = {
 	.release	= single_release,
 };
 
+static const struct file_operations irq_affinity_alg_proc_fops = {
+	.open		= irq_affinity_alg_proc_open,
+	.read           = seq_read,
+	.llseek         = seq_lseek,
+	.release        = single_release,
+};
+
 static int default_affinity_show(struct seq_file *m, void *v)
 {
 	seq_cpumask(m, irq_default_affinity);
@@ -288,6 +318,11 @@ void register_irq_proc(unsigned int irq, struct irq_desc *desc)
 	/* create /proc/irq/<irq>/affinity_hint */
 	proc_create_data("affinity_hint", 0400, desc->dir,
 			 &irq_affinity_hint_proc_fops, (void *)(long)irq);
+#ifdef CONFIG_AFFINITY_UPDATE
+	/* Create /proc/irq/<irq>/affinity_alg */
+	proc_create_data("affinity_alg", 0400, desc->dir,
+			&irq_affinity_alg_proc_fops, (void *)(long)irq);
+#endif
 
 	proc_create_data("node", 0444, desc->dir,
 			 &irq_node_proc_fops, (void *)(long)irq);
-- 
1.7.4.2


^ permalink raw reply related

* [PATCH 2/3] net: Add net device irq siloing feature
From: Neil Horman @ 2011-04-15 20:17 UTC (permalink / raw)
  To: netdev
  Cc: davem, Neil Horman, Dimitris Michailidis, Thomas Gleixner,
	David Howells, Eric Dumazet, Tom Herbert
In-Reply-To: <1302898677-3833-1-git-send-email-nhorman@tuxdriver.com>

Using the irq affinity infrastrucuture, we can now allow net devices to call
request_irq using a new wrapper function (request_net_irq), which will attach a
common affinty_update handler to each requested irq.  This affinty update
mechanism correlates each tracked irq to the flow(s) that said irq processes
most frequently.  The highest traffic flow is noted, marked and exported to user
space via the affinity_hint proc file for each irq. In this way, utilities like
irqbalance are able to determine  which cpu is recieving the most data from each
rx queue on a given NIC, and set irq affinity accordingly.

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>

CC: Dimitris Michailidis <dm@chelsio.com>
CC: "David S. Miller" <davem@davemloft.net>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: David Howells <dhowells@redhat.com>
CC: Eric Dumazet <eric.dumazet@gmail.com>
CC: Tom Herbert <therbert@google.com>
---
 include/linux/netdevice.h  |   18 +++++++
 kernel/irq/proc.c          |    2 +-
 net/Kconfig                |   12 +++++
 net/core/dev.c             |  107 ++++++++++++++++++++++++++++++++++++++++++++
 net/core/sysctl_net_core.c |    9 ++++
 5 files changed, 147 insertions(+), 1 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 5eeb2cd..ba6191f 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -609,6 +609,9 @@ struct rps_map {
 struct rps_dev_flow {
 	u16 cpu;
 	u16 filter;
+#ifdef CONFIG_RFS_SILOING
+	u32 weight;
+#endif
 	unsigned int last_qtail;
 };
 #define RPS_NO_FILTER 0xffff
@@ -1631,6 +1634,21 @@ static inline void unregister_netdevice(struct net_device *dev)
 	unregister_netdevice_queue(dev, NULL);
 }
 
+#ifdef CONFIG_RFS_SILOING
+extern int netdev_rxq_silo_init(int irq, struct affin_data *afd, void *priv);
+extern int sysctl_irq_siloing_period;
+
+static inline int __must_check
+request_net_irq(unsigned int irq, irq_handler_t handler, unsigned long flags,
+		const char *name, void *dev, struct net_device *ndev, int rxq)
+{
+	return request_affinity_irq(irq, handler, NULL, flags, name, dev,
+				    netdev_rxq_silo_init, &ndev->_rx[rxq]);
+}
+#else
+#define request_net_irq(i, h, f, n, d, nd, r) request_irq(i, h, NULL, f, n, d)
+#endif
+
 extern int 		netdev_refcnt_read(const struct net_device *dev);
 extern void		free_netdev(struct net_device *dev);
 extern void		synchronize_net(void);
diff --git a/kernel/irq/proc.c b/kernel/irq/proc.c
index 8fecb05..d5a7e4d 100644
--- a/kernel/irq/proc.c
+++ b/kernel/irq/proc.c
@@ -65,7 +65,7 @@ static int irq_affinity_alg_proc_show(struct seq_file *m, void *v)
 #ifdef CONFIG_AFFINITY_UPDATE
 	struct irq_desc *desc = irq_to_desc((long)m->private);
 
-	if (desc->af_data->affinity_alg)
+	if (desc->af_data && desc->af_data->affinity_alg)
 		alg = desc->af_data->affinity_alg;
 #endif
 	seq_printf(m, "%s\n", alg);
diff --git a/net/Kconfig b/net/Kconfig
index 79cabf1..d6ef6f5 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -232,6 +232,18 @@ config XPS
 	depends on SMP && SYSFS && USE_GENERIC_SMP_HELPERS
 	default y
 
+config RFS_SILOING
+	boolean
+	depends on RFS_ACCEL && AFFINITY_UPDATE
+	default y
+	---help---
+	 This feature allows appropriately enabled network drivers to
+	 export affinity_hint data to user space based on the RFS flow hash
+	 table for the rx queue associated with a given interrupt.  This allows
+	 userspace to optimize irq affinity such that a given rx queue has its
+	 interrupt serviced on the same cpu/l2 cache/numa node running the process
+	 that consumes most of its data.
+
 menu "Network testing"
 
 config NET_PKTGEN
diff --git a/net/core/dev.c b/net/core/dev.c
index 0b88eba..4d86137 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -173,6 +173,9 @@
 #define PTYPE_HASH_SIZE	(16)
 #define PTYPE_HASH_MASK	(PTYPE_HASH_SIZE - 1)
 
+#ifdef CONFIG_RFS_SILOING
+int sysctl_irq_siloing_period;
+#endif
 static DEFINE_SPINLOCK(ptype_lock);
 static struct list_head ptype_base[PTYPE_HASH_SIZE] __read_mostly;
 static struct list_head ptype_all __read_mostly;	/* Taps */
@@ -2640,6 +2643,9 @@ set_rps_cpu(struct net_device *dev, struct sk_buff *skb,
 		rflow->filter = rc;
 		if (old_rflow->filter == rflow->filter)
 			old_rflow->filter = RPS_NO_FILTER;
+#ifdef CONFIG_RFS_SILOING
+		old_rflow->weight = rflow->weight = 0;
+#endif
 	out:
 #endif
 		rflow->last_qtail =
@@ -2723,6 +2729,10 @@ static int get_rps_cpu(struct net_device *dev, struct sk_buff *skb,
 		      rflow->last_qtail)) >= 0))
 			rflow = set_rps_cpu(dev, skb, rflow, next_cpu);
 
+#ifdef CONFIG_RFS_SILOING
+		rflow->weight += skb->len;
+#endif
+
 		if (tcpu != RPS_NO_CPU && cpu_online(tcpu)) {
 			*rflowp = rflow;
 			cpu = tcpu;
@@ -6224,6 +6234,103 @@ static struct hlist_head *netdev_create_hash(void)
 	return hash;
 }
 
+#ifdef CONFIG_RFS_SILOING
+struct netdev_rxq_affin_data {
+	struct netdev_rx_queue *q;
+	unsigned long last_update;
+	cpumask_var_t affinity_mask;
+};
+
+static void netdev_rxq_silo_affin_update(int irq, struct affin_data *afd)
+{
+	struct netdev_rxq_affin_data *afdp = afd->priv;
+	struct netdev_rx_queue *q = afdp->q;
+	struct rps_dev_flow_table *flow_table;
+	int i;
+	u16 tcpu;
+	u32 mw;
+	unsigned long next_update;
+
+	mw = tcpu = 0;
+
+	next_update = afdp->last_update + (sysctl_irq_siloing_period * HZ);
+
+	if (time_after(next_update, jiffies))
+		return;
+
+	afdp->last_update = jiffies;
+
+	irq_set_affinity_hint(irq, NULL);
+	cpumask_clear(afdp->affinity_mask);
+	rcu_read_lock();
+	flow_table = rcu_dereference(q->rps_flow_table);
+
+	if (!flow_table)
+		goto out;
+
+	for (i = 0; (i & flow_table->mask) == i; i++) {
+		if (mw < flow_table->flows[i].weight) {
+			tcpu = ACCESS_ONCE(flow_table->flows[i].cpu);
+			if (tcpu == RPS_NO_CPU)
+				continue;
+			mw = flow_table->flows[i].weight;
+		}
+	}
+
+
+	if (mw) {
+		cpumask_set_cpu(tcpu, afdp->affinity_mask);
+		irq_set_affinity_hint(irq, afdp->affinity_mask);
+	}
+out:
+	rcu_read_unlock();
+	return;
+}
+
+static void netdev_rxq_silo_cleanup(int irq, struct affin_data *afd)
+{
+	struct netdev_rxq_affin_data *afdp = afd->priv;
+
+	free_cpumask_var(afdp->affinity_mask);
+	kfree(afdp);
+	afd->priv = NULL;
+}
+
+/**
+ *	netdev_rxq_silo_init - setup an irq to be siloed
+ *
+ *	initalizes the irq data required to allow the networking
+ *	subsystem to determine which cpu is best suited to
+ *      service the passed in irq, and then export that data
+ *	via the affinity_hint proc interface
+ */
+int netdev_rxq_silo_init(int irq, struct affin_data *afd, void *priv)
+{
+	struct netdev_rxq_affin_data *afdp;
+
+	afd->priv = afdp = kzalloc(sizeof(struct netdev_rxq_affin_data),
+				   GFP_KERNEL);
+	if (!afdp)
+		return -ENOMEM;
+
+	if (!alloc_cpumask_var(&afdp->affinity_mask, GFP_KERNEL)) {
+		kfree(afdp);
+		return -ENOMEM;
+	}
+
+	cpumask_clear(afdp->affinity_mask);
+
+	afdp->q = priv;
+	afdp->last_update = jiffies;
+	afd->affin_update = netdev_rxq_silo_affin_update;
+	afd->affin_cleanup = netdev_rxq_silo_cleanup;
+	afd->affinity_alg = "net:rfs max weight";
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(netdev_rxq_silo_init);
+#endif
+
 /* Initialize per network namespace state */
 static int __net_init netdev_init(struct net *net)
 {
diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c
index 385b609..b5c733e 100644
--- a/net/core/sysctl_net_core.c
+++ b/net/core/sysctl_net_core.c
@@ -158,6 +158,15 @@ static struct ctl_table net_core_table[] = {
 		.proc_handler	= rps_sock_flow_sysctl
 	},
 #endif
+#ifdef CONFIG_RFS_SILOING
+	{
+		.procname	= "irq_siloing_period",
+		.data		= &sysctl_irq_siloing_period,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec
+	},
+#endif
 #endif /* CONFIG_NET */
 	{
 		.procname	= "netdev_budget",
-- 
1.7.4.2


^ permalink raw reply related

* [PATCH 3/3] net: Adding siloing irqs to cxgb4 driver
From: Neil Horman @ 2011-04-15 20:17 UTC (permalink / raw)
  To: netdev
  Cc: davem, Neil Horman, Dimitris Michailidis, Thomas Gleixner,
	David Howells, Eric Dumazet, Tom Herbert
In-Reply-To: <1302898677-3833-1-git-send-email-nhorman@tuxdriver.com>

cxgb4 hardware has been tested here and shows correct functionality with
affinity hinting infrastructure

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>

CC: Dimitris Michailidis <dm@chelsio.com>
CC: "David S. Miller" <davem@davemloft.net>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: David Howells <dhowells@redhat.com>
CC: Eric Dumazet <eric.dumazet@gmail.com>
CC: Tom Herbert <therbert@google.com>
---
 drivers/net/cxgb4/cxgb4_main.c |    6 ++++--
 1 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/net/cxgb4/cxgb4_main.c b/drivers/net/cxgb4/cxgb4_main.c
index 5352c8a..11aeef6 100644
--- a/drivers/net/cxgb4/cxgb4_main.c
+++ b/drivers/net/cxgb4/cxgb4_main.c
@@ -562,9 +562,11 @@ static int request_msix_queue_irqs(struct adapter *adap)
 		return err;
 
 	for_each_ethrxq(s, ethqidx) {
-		err = request_irq(adap->msix_info[msi].vec, t4_sge_intr_msix, 0,
+		err = request_net_irq(adap->msix_info[msi].vec, t4_sge_intr_msix, 0,
 				  adap->msix_info[msi].desc,
-				  &s->ethrxq[ethqidx].rspq);
+				  &s->ethrxq[ethqidx].rspq,
+				  adap->port[ethqidx/MAX_NPORTS],
+				  ethqidx % MAX_NPORTS);
 		if (err)
 			goto unwind;
 		msi++;
-- 
1.7.4.2


^ permalink raw reply related

* Re: Feature request: "inverted" ping -a (beep on failure)
From: Denys Fedoryshchenko @ 2011-04-15 20:10 UTC (permalink / raw)
  To: Randy Dunlap; +Cc: Christian Boltz, netdev
In-Reply-To: <20110415124937.6e746646.rdunlap@xenotime.net>

 On Fri, 15 Apr 2011 12:49:37 -0700, Randy Dunlap wrote:
> On Fri, 15 Apr 2011 21:35:32 +0200 Christian Boltz wrote:
>
>> Hello,
>>
>> ping -a (beep on ping success) is a quite useful command, but it can 
>> be
>> annoying.
>>
>> I'd like to have the exact opposite of it: beep when pinging fails.
>>
>> I understand that this is slightly difficult because "ping success" 
>> is
>> easier to detect (incoming package) than "ping failure" (no incoming
>> package or firewall reject) - my proposal is to have a timeout for 
>> every
>> package (if no reply package comes in) and beep if no reply is seen
>> after the timeout is over.
>>
>> For the timeout, the -W option could be used. The default timeout 
>> seems
>> to be 10 seconds, which is OK.
>>
>> Usecase / why this would be useful for me:
>> Basically for server monitoring. The exact usecase is that I have 
>> rented
>> a "root server" and asked the hoster to exchange a broken harddisk.
>> With the "inverted" ping -a, it would be easy to notice when they 
>> switch
>> off the server to replace the disk.
>>
>> Please consider this feature for the next version of ping ;-)
>>
>>
>> (The iputils homepage does not list any bugtracker or similar, 
>> therefore
>> I'm asking here.)
>
> Couldn't you look for exit code (status) 1 and then do a bell/beep
> (or play a sound file :)?
>
> Or do you want ping to beep and then continue running?
>
 I wrote my own tool and call it ping watchdog (i so ideas about ping 
 watchdog in other projects, just improved it a little) :-)
 Probably it can be useful here, it can run script if ping fail more 
 than N packets... it is a bit undocumented and cryptic, but i can 
 improve it.

 http://code.google.com/p/sysadmin-tools/source/browse/trunk/pingwdog/pingwdog.c


^ permalink raw reply

* Re: SMSC 8720a/MDIO/PHY help.
From: Andy Fleming @ 2011-04-15 20:29 UTC (permalink / raw)
  To: ANDY KENNEDY; +Cc: michael, netdev
In-Reply-To: <9AC3F0E75060224C8BBC5BA2DDC8853A1FA8E8D4@EXV1.corp.adtran.com>

On Wed, Apr 13, 2011 at 4:38 PM, ANDY KENNEDY <ANDY.KENNEDY@adtran.com> wrote:
>> -----Original Message-----
>> From: Michael Riesch [mailto:michael@riesch.at]
>> Sent: Wednesday, April 13, 2011 4:19 PM
>> To: netdev@vger.kernel.org
>> Cc: ANDY KENNEDY
>> Subject: Re: SMSC 8720a/MDIO/PHY help.
>>
>>
>> > If you have an idea of something for me to try, I'd love to
>> entertain
>> > it.
>>
>> I am rather new to PHYLIB, but these are my ideas:
>>
>>  1) make sure phy_connect is executed (AFIAK called by MDIO bus
>> driver)
>
> Going through the phy.txt doc under Documentation/networking:
> PHY Abstraction Layer
> (Updated 2008-04-08)
> though it may be a bit out-of-date, I did see what you are talking about.  What I'm hung up on at the moment is the behavior of adjust_link().  It appears that I only need to start the queues, though I don’t know.
>
>>
>>  2) maybe you need to call phy_start / phy_stop (AFAIK from the PHY
>> driver's open / close function)
>
> Currently, when I do this I only get the call to adjust_link() over and over again.


...this means that the state machine is running.  The PHY is polling
every couple seconds to report the current state. It calls
adjust_link() to keep the net_device up-to-date on that state. What
other behavior are you expecting to see?


Andy

^ permalink raw reply

* Re: SMSC 8720a/MDIO/PHY help.
From: Andy Fleming @ 2011-04-15 20:36 UTC (permalink / raw)
  To: ANDY KENNEDY; +Cc: netdev
In-Reply-To: <9AC3F0E75060224C8BBC5BA2DDC8853A1FA8E8FD@EXV1.corp.adtran.com>

On Wed, Apr 13, 2011 at 11:08 PM, ANDY KENNEDY <ANDY.KENNEDY@adtran.com> wrote:
>> > -----Original Message-----
>> > From: Michael Riesch [mailto:michael@riesch.at]
>> > Sent: Wednesday, April 13, 2011 4:19 PM
>> > To: netdev@vger.kernel.org
>> > Cc: ANDY KENNEDY
>> > Subject: Re: SMSC 8720a/MDIO/PHY help.
>> >
>> >
>> > > If you have an idea of something for me to try, I'd love to
>> > entertain
>> > > it.
>> >
>> > I am rather new to PHYLIB, but these are my ideas:
>> >
>> >  1) make sure phy_connect is executed (AFIAK called by MDIO bus
>> > driver)
>
> Along this line of though:  phy_connect requires struct net_device, which has a struct net_device_ops within it.  When I do a phy_connect am I supposed to provide the minimal functions for netdev_ops (correct this list if I am mistaken):
> ndo_open
> ndo_stop
> ndo_start_xmit
> ndo_get_stats
> ndo_set_multicast_list
> As well as populate the dev->dev_addr within the struct net_device.
>
> The part that confuses me is that the smsc.c ??driver?? under drivers/net/phy/smsc.c doesn’t do any of this.  This is a phy supported by this file, so should I have to do all this to get the device up?

Hmm....where are you calling phy_connect from?  phy_connect() is
called from a net_device driver, to connect the net device to the PHY.
The net_device should be filled in by the net driver. The PHY Lib
doesn't use the struct net_device * itself.  It merely passes that
structure to the registered adjust_link() callback, as context.

We could theoretically make the net_device a void *, and let the
caller of phy_connect() determine its own context, but that didn't
seem necessary at the time.  It also might make sense for
adjust_link() to pass the struct phy_device.

But those are all just possible enhancements for the future.

Andy

^ permalink raw reply

* RE: SMSC 8720a/MDIO/PHY help.
From: ANDY KENNEDY @ 2011-04-15 20:53 UTC (permalink / raw)
  To: Andy Fleming; +Cc: michael, netdev
In-Reply-To: <BANLkTik7xeMnx0S2m0easY0hVT_UmomjzA@mail.gmail.com>

> -----Original Message-----
> From: netdev-owner@vger.kernel.org [mailto:netdev-
> owner@vger.kernel.org] On Behalf Of Andy Fleming
> Sent: Friday, April 15, 2011 3:30 PM
> To: ANDY KENNEDY
> Cc: michael@riesch.at; netdev@vger.kernel.org
> Subject: Re: SMSC 8720a/MDIO/PHY help.
> 
> On Wed, Apr 13, 2011 at 4:38 PM, ANDY KENNEDY
> <ANDY.KENNEDY@adtran.com> wrote:
> >> -----Original Message-----
> >> From: Michael Riesch [mailto:michael@riesch.at]
> >> Sent: Wednesday, April 13, 2011 4:19 PM
> >> To: netdev@vger.kernel.org
> >> Cc: ANDY KENNEDY
> >> Subject: Re: SMSC 8720a/MDIO/PHY help.
> >>
> >>
> >> > If you have an idea of something for me to try, I'd love to
> >> entertain
> >> > it.
> >>
> >> I am rather new to PHYLIB, but these are my ideas:
> >>
> >>  1) make sure phy_connect is executed (AFIAK called by MDIO bus
> >> driver)
> >
> > Going through the phy.txt doc under Documentation/networking:
> > PHY Abstraction Layer
> > (Updated 2008-04-08)
> > though it may be a bit out-of-date, I did see what you are
> talking about.  What I'm hung up on at the moment is the behavior
> of adjust_link().  It appears that I only need to start the queues,
> though I don't know.
> >
> >>
> >>  2) maybe you need to call phy_start / phy_stop (AFAIK from the
> PHY
> >> driver's open / close function)
> >
> > Currently, when I do this I only get the call to adjust_link()
> over and over again.
> 
> 
> ...this means that the state machine is running.  The PHY is
> polling
> every couple seconds to report the current state. It calls
> adjust_link() to keep the net_device up-to-date on that state. What
> other behavior are you expecting to see?

Well, you see I was expecting it to be up and running at that point (to be able to assign an IP, pass traffic, etc) -- but that is due to ignorance of (1) network device drivers, (2) PHY device drivers, (3) MDIO bus drivers, (4) General level 2 networking, (5) ;) get the point?

See, I'm totally new to networking (at this level).  Device drivers, yes, but not networking.

Though, after Michael's e-mails, I have discovered that I have to
1) Make the MDIO bus work
2) Establish communications with the MDIO driver (in this case smsc.c under net/phy)
3) Make all my NDO required functions for controlling the "real" network device -- I was unaware that the PHY _WASN'T_ the network device.
4) Call phy_connect_direct (so I don't trash the already existent smsc.c as before stated)
5) Finally, after the NDO functions are written, register the NDO with the networking layer of the Kernel.

One thing I have done wrong (realized AFTER depending upon the work I've done) is that I should have split out the MDIO and the network device.  

Now, where is the document that explains all this.  The PHY document is very informative, however, if you don't know that a PHY is NOT a network device, you're kinda outta luck.  Even Wiki reports a PHY as the Physical Link layer of the OSI model.  Which, again, doesn't tell the ignorant much.

I have a bit more knowledge now, however, and I think I've about got my network device up (that would be I'm only 80% ignorant now ;).

Andy

^ permalink raw reply

* Re: [Bugme-new] [Bug 33042] New: Marvell 88E1145 phy configured incorrectly in fiber mode
From: Andy Fleming @ 2011-04-15 20:57 UTC (permalink / raw)
  To: Alex Dubov
  Cc: Andrew Morton, David Daney, netdev, bugzilla-daemon, bugme-daemon,
	Grant Likely, Andy Fleming
In-Reply-To: <903944.53826.qm@web37604.mail.mud.yahoo.com>

On Thu, Apr 14, 2011 at 2:59 AM, Alex Dubov <oakad@yahoo.com> wrote:
>
>
> --- On Thu, 14/4/11, Andy Fleming <afleming@gmail.com> wrote:
>
>>
>> I've just rewritten the U-Boot code for PHY management, so
>> I'd be
>> interested in hearing if this breaks your board.  But
>> what's
>> interesting to me is that, in order for U-Boot to report
>> that the link
>> is a "fiber" link, something had to set the TSEC_FIBER
>> flag, and only
>> one PHY in the public source did.  This implies to me
>> that your board
>> isn't supported by mainline U-Boot, and suggests that
>> someone may have
>> modified the 88e1145 driver. Otherwise, I don't see any
>> fiber-related
>> differences between the U-Boot 1145 driver, and the Linux
>> one.
>
> I had not seen any difference, that's true. But the problem somehow
> creeps in.
>
> The u-boot is standard stock u-boot pulled from the recent git,
> no special configuration involved.

Are you seeing this message when you run ethernet in u-boot?

"Speed: 1000, full duplex, fiber mode"

Because that last part only shows up if someone sets TSEC_FIBER in the
tsec's "flags" field...

> I tried to prevent kernel from reconfiguring the phy, but to no avail.
> It seems very weird to me, because I did quite a lot of testing with
> u-boot and network just works on that interface. However, when kernel
> starts booting it suddenly looses the ability to talk to it.

Believe me, I feel your pain.  These devices are often remarkably
fickle. The kernel tries to be
more robust, but sometimes the PHYs just don't like to be touched at all.

You could probably change to use a fixed link by removing the
phy-handle property from your ethernet device node, and adding:
"fixed-link=<0 1000 1 0 0>".  If that works, then the issue is that
Linux is breaking something when it connects. It might be good enough
for you to use fixed-link, though it would be good to actually find
out what's going wrong with the PHY driver.

Andy

^ permalink raw reply

* Re: SMSC 8720a/MDIO/PHY help.
From: Andy Fleming @ 2011-04-15 21:02 UTC (permalink / raw)
  To: ANDY KENNEDY; +Cc: michael, netdev
In-Reply-To: <9AC3F0E75060224C8BBC5BA2DDC8853A1FB11170@EXV1.corp.adtran.com>

On Fri, Apr 15, 2011 at 3:53 PM, ANDY KENNEDY <ANDY.KENNEDY@adtran.com> wrote:
>> -----Original Message-----
>> From: netdev-owner@vger.kernel.org [mailto:netdev-
>> owner@vger.kernel.org] On Behalf Of Andy Fleming
>> Sent: Friday, April 15, 2011 3:30 PM
>> To: ANDY KENNEDY
>> Cc: michael@riesch.at; netdev@vger.kernel.org
>> Subject: Re: SMSC 8720a/MDIO/PHY help.
>>
>> On Wed, Apr 13, 2011 at 4:38 PM, ANDY KENNEDY
>> <ANDY.KENNEDY@adtran.com> wrote:
>> >> -----Original Message-----
>> >> From: Michael Riesch [mailto:michael@riesch.at]
>> >> Sent: Wednesday, April 13, 2011 4:19 PM
>> >> To: netdev@vger.kernel.org
>> >> Cc: ANDY KENNEDY
>> >> Subject: Re: SMSC 8720a/MDIO/PHY help.
>> >>
>> >>
>> >> > If you have an idea of something for me to try, I'd love to
>> >> entertain
>> >> > it.
>> >>
>> >> I am rather new to PHYLIB, but these are my ideas:
>> >>
>> >>  1) make sure phy_connect is executed (AFIAK called by MDIO bus
>> >> driver)
>> >
>> > Going through the phy.txt doc under Documentation/networking:
>> > PHY Abstraction Layer
>> > (Updated 2008-04-08)
>> > though it may be a bit out-of-date, I did see what you are
>> talking about.  What I'm hung up on at the moment is the behavior
>> of adjust_link().  It appears that I only need to start the queues,
>> though I don't know.
>> >
>> >>
>> >>  2) maybe you need to call phy_start / phy_stop (AFAIK from the
>> PHY
>> >> driver's open / close function)
>> >
>> > Currently, when I do this I only get the call to adjust_link()
>> over and over again.
>>
>>
>> ...this means that the state machine is running.  The PHY is
>> polling
>> every couple seconds to report the current state. It calls
>> adjust_link() to keep the net_device up-to-date on that state. What
>> other behavior are you expecting to see?
>
> Well, you see I was expecting it to be up and running at that point (to be able to assign an IP, pass traffic, etc) -- but that is due to ignorance of (1) network device drivers, (2) PHY device drivers, (3) MDIO bus drivers, (4) General level 2 networking, (5) ;) get the point?
>
> See, I'm totally new to networking (at this level).  Device drivers, yes, but not networking.
>
> Though, after Michael's e-mails, I have discovered that I have to
> 1) Make the MDIO bus work
> 2) Establish communications with the MDIO driver (in this case smsc.c under net/phy)
> 3) Make all my NDO required functions for controlling the "real" network device -- I was unaware that the PHY _WASN'T_ the network device.
> 4) Call phy_connect_direct (so I don't trash the already existent smsc.c as before stated)
> 5) Finally, after the NDO functions are written, register the NDO with the networking layer of the Kernel.
>
> One thing I have done wrong (realized AFTER depending upon the work I've done) is that I should have split out the MDIO and the network device.
>
> Now, where is the document that explains all this.  The PHY document is very informative, however, if you don't know that a PHY is NOT a network device, you're kinda outta luck.  Even Wiki reports a PHY as the Physical Link layer of the OSI model.  Which, again, doesn't tell the ignorant much.


Take a look at Documentation/networking/netdevices.txt

Also grep around in drivers/net to see networking drivers that have
been ported to use phylib (look for phy_connect or phy_attach).

^ permalink raw reply

* RE: SMSC 8720a/MDIO/PHY help.
From: ANDY KENNEDY @ 2011-04-15 21:17 UTC (permalink / raw)
  To: Andy Fleming; +Cc: michael, netdev
In-Reply-To: <BANLkTi=rzTCTackMiKRW=vciN=+ThiCGJg@mail.gmail.com>

> -----Original Message-----
> From: Andy Fleming [mailto:afleming@gmail.com]
> Sent: Friday, April 15, 2011 4:02 PM
> To: ANDY KENNEDY
> Cc: michael@riesch.at; netdev@vger.kernel.org
> Subject: Re: SMSC 8720a/MDIO/PHY help.
> > Now, where is the document that explains all this.  The PHY
> document is very informative, however, if you don't know that a PHY
> is NOT a network device, you're kinda outta luck.  Even Wiki
> reports a PHY as the Physical Link layer of the OSI model.  Which,
> again, doesn't tell the ignorant much.
> 
> 
> Take a look at Documentation/networking/netdevices.txt
> 
> Also grep around in drivers/net to see networking drivers that have
> been ported to use phylib (look for phy_connect or phy_attach).

That's pretty much what I did.  After that, I was able (using my function stubs) to do an ifconfig eth0 and see that at least the kernel assigned my IP address to the device (though nothing else worked, but that was further than I was getting).

Like I said, I'm not completely ignorant anymore.  I look forward to the day (perhaps Monday) when I'm only half ignorant ;).

Thanks for your help!

Andy

^ permalink raw reply

* Re: The bonding driver should notify userspace of MAC address change
From: Jay Vosburgh @ 2011-04-15 21:45 UTC (permalink / raw)
  To: =?UTF-8?B?Tmljb2xhcyBkZSBQZXNsb8O8YW4=?=
  Cc: =?UTF-8?B?TWljaGHFgiBHw7Nybnk=?=, netdev, roy, Andy Gospodarek
In-Reply-To: <4DA89ADC.7040808@gmail.com>

Nicolas de Pesloüan <nicolas.2p.debian@gmail.com> wrote:

>Agreed.
>
>> 	Is there some race window there between the register and the
>> netif_carrier_off?
>
>It might be that dhcpd does not wait for link to be up before starting to send DHCP requests.

	It looks like it's not related to carrier state at all:

#212: dhcpcd requires restart to get an IP address for bonded interface
-----------------------+-----------------
  Reporter:  mgorny@…  |      Owner:  roy
      Type:  defect    |     Status:  new
  Priority:  major     |  Milestone:
 Component:  dhcpcd    |    Version:  5.1
Resolution:            |   Keywords:
-----------------------+-----------------

Comment (by roy):

 Sorry, the above isn't too clear.

 dhcpcd will read the hardware address when the interface is marked IFF_UP
 or when given RTM_NEWLINK with ifi->ifi_change = ~0U, the latter being
 sent by some drivers to tell userland that an interface characteristic has
 changed - like say a hardware address - if the driver supports such a
 change whilst still up. Normal behaviour is to mark device as DOWN before
 changing hardware address. bonding does this whilst marked UP, hence this
 issue.

 carrier going up / down is just that, it's not a signal to re-read the
 interface characteristics.


	Now this confuses me again; I thought that running the dhcp
client (dhcpcd) over bonding has worked for years, although I've not
personally tried it recently.  Perhaps it varies by distro.  In any
event, this behavior of bonding (setting the bond's MAC without a
down/up flip) has never been different in my memory.

	I've not yet dug down to see if NETDEV_CHANGEADDR will result in
an RTM_NEWLINK to user space.  At first glance it doesn't look like it.

	When bonding goes link up, however, I think linkwatch_do_dev
will issue an RTM_NEWLINK (via a call to netdev_state_change), or,
alternately, dev_change_flags will do it at IFF_UP time.

	-J

---
	-Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com

^ permalink raw reply

* Re: Feature request: "inverted" ping -a (beep on failure)
From: Martin Topholm @ 2011-04-15 21:57 UTC (permalink / raw)
  To: Christian Boltz; +Cc: netdev
In-Reply-To: <201104152135.33171@tux.boltz.de.vu>

On Fri, 15 Apr 2011, Christian Boltz wrote:
> I'd like to have the exact opposite of it: beep when pinging fails.

I too have missed this feature (from the BSDs ping). Also I needed
adhoc tracking of multiple hosts. So I experimented with libevent2 and
some code from the BSD ping...

You can see the result here http://hoth.dk/xping/screenshot.jpeg
or http://hoth.dk/xping/xping-20110415.tar.gz .

> I understand that this is slightly difficult because "ping success" is 
> easier to detect (incoming package) than "ping failure" (no incoming 
> package or firewall reject)

I used the transmit interval for timeout. There's propably a lot of
corner cases I haven't thought about, but it works fairly well.

Regards, Martin

^ permalink raw reply

* [PATCH v5 0/7] rtcache removal respin
From: David Miller @ 2011-04-15 22:39 UTC (permalink / raw)
  To: netdev

This is just a respin of the routing cache removal patches,
to deal with conflicts that have arisen since v4.

I'm leaving out the netlink patch from now on because that
change is totally unrelated to this work.

No functional changes are present since the last respin.

^ permalink raw reply

* [PATCH v5 1/7] ipv4: Delete routing cache.
From: David Miller @ 2011-04-15 22:39 UTC (permalink / raw)
  To: netdev


Signed-off-by: David S. Miller <davem@davemloft.net>
---
 include/net/route.h     |    1 -
 net/ipv4/fib_frontend.c |    5 -
 net/ipv4/route.c        |  908 ++---------------------------------------------
 3 files changed, 23 insertions(+), 891 deletions(-)

diff --git a/include/net/route.h b/include/net/route.h
index 3782cdd..f09d08f 100644
--- a/include/net/route.h
+++ b/include/net/route.h
@@ -122,7 +122,6 @@ extern int		ip_rt_init(void);
 extern void		ip_rt_redirect(__be32 old_gw, __be32 dst, __be32 new_gw,
 				       __be32 src, struct net_device *dev);
 extern void		rt_cache_flush(struct net *net, int how);
-extern void		rt_cache_flush_batch(struct net *net);
 extern struct rtable *__ip_route_output_key(struct net *, const struct flowi4 *flp);
 extern struct rtable *ip_route_output_flow(struct net *, struct flowi4 *flp,
 					   struct sock *sk);
diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index 2252471..33bbbda 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -1022,11 +1022,6 @@ static int fib_netdev_event(struct notifier_block *this, unsigned long event, vo
 		rt_cache_flush(dev_net(dev), 0);
 		break;
 	case NETDEV_UNREGISTER_BATCH:
-		/* The batch unregister is only called on the first
-		 * device in the list of devices being unregistered.
-		 * Therefore we should not pass dev_net(dev) in here.
-		 */
-		rt_cache_flush_batch(NULL);
 		break;
 	}
 	return NOTIFY_DONE;
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index e9aee81..8033171 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -129,7 +129,6 @@ static int ip_rt_gc_elasticity __read_mostly	= 8;
 static int ip_rt_mtu_expires __read_mostly	= 10 * 60 * HZ;
 static int ip_rt_min_pmtu __read_mostly		= 512 + 20 + 20;
 static int ip_rt_min_advmss __read_mostly	= 256;
-static int rt_chain_length_max __read_mostly	= 20;
 
 /*
  *	Interface to generic destination cache.
@@ -142,7 +141,6 @@ static void		 ipv4_dst_destroy(struct dst_entry *dst);
 static struct dst_entry *ipv4_negative_advice(struct dst_entry *dst);
 static void		 ipv4_link_failure(struct sk_buff *skb);
 static void		 ip_rt_update_pmtu(struct dst_entry *dst, u32 mtu);
-static int rt_garbage_collect(struct dst_ops *ops);
 
 static void ipv4_dst_ifdown(struct dst_entry *dst, struct net_device *dev,
 			    int how)
@@ -187,7 +185,6 @@ static u32 *ipv4_cow_metrics(struct dst_entry *dst, unsigned long old)
 static struct dst_ops ipv4_dst_ops = {
 	.family =		AF_INET,
 	.protocol =		cpu_to_be16(ETH_P_IP),
-	.gc =			rt_garbage_collect,
 	.check =		ipv4_dst_check,
 	.default_advmss =	ipv4_default_advmss,
 	.default_mtu =		ipv4_default_mtu,
@@ -222,184 +219,30 @@ const __u8 ip_tos2prio[16] = {
 };
 
 
-/*
- * Route cache.
- */
-
-/* The locking scheme is rather straight forward:
- *
- * 1) Read-Copy Update protects the buckets of the central route hash.
- * 2) Only writers remove entries, and they hold the lock
- *    as they look at rtable reference counts.
- * 3) Only readers acquire references to rtable entries,
- *    they do so with atomic increments and with the
- *    lock held.
- */
-
-struct rt_hash_bucket {
-	struct rtable __rcu	*chain;
-};
-
-#if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK) || \
-	defined(CONFIG_PROVE_LOCKING)
-/*
- * Instead of using one spinlock for each rt_hash_bucket, we use a table of spinlocks
- * The size of this table is a power of two and depends on the number of CPUS.
- * (on lockdep we have a quite big spinlock_t, so keep the size down there)
- */
-#ifdef CONFIG_LOCKDEP
-# define RT_HASH_LOCK_SZ	256
-#else
-# if NR_CPUS >= 32
-#  define RT_HASH_LOCK_SZ	4096
-# elif NR_CPUS >= 16
-#  define RT_HASH_LOCK_SZ	2048
-# elif NR_CPUS >= 8
-#  define RT_HASH_LOCK_SZ	1024
-# elif NR_CPUS >= 4
-#  define RT_HASH_LOCK_SZ	512
-# else
-#  define RT_HASH_LOCK_SZ	256
-# endif
-#endif
-
-static spinlock_t	*rt_hash_locks;
-# define rt_hash_lock_addr(slot) &rt_hash_locks[(slot) & (RT_HASH_LOCK_SZ - 1)]
-
-static __init void rt_hash_lock_init(void)
-{
-	int i;
-
-	rt_hash_locks = kmalloc(sizeof(spinlock_t) * RT_HASH_LOCK_SZ,
-			GFP_KERNEL);
-	if (!rt_hash_locks)
-		panic("IP: failed to allocate rt_hash_locks\n");
-
-	for (i = 0; i < RT_HASH_LOCK_SZ; i++)
-		spin_lock_init(&rt_hash_locks[i]);
-}
-#else
-# define rt_hash_lock_addr(slot) NULL
-
-static inline void rt_hash_lock_init(void)
-{
-}
-#endif
-
-static struct rt_hash_bucket 	*rt_hash_table __read_mostly;
-static unsigned			rt_hash_mask __read_mostly;
-static unsigned int		rt_hash_log  __read_mostly;
-
 static DEFINE_PER_CPU(struct rt_cache_stat, rt_cache_stat);
 #define RT_CACHE_STAT_INC(field) __this_cpu_inc(rt_cache_stat.field)
 
-static inline unsigned int rt_hash(__be32 daddr, __be32 saddr, int idx,
-				   int genid)
-{
-	return jhash_3words((__force u32)daddr, (__force u32)saddr,
-			    idx, genid)
-		& rt_hash_mask;
-}
-
 static inline int rt_genid(struct net *net)
 {
 	return atomic_read(&net->ipv4.rt_genid);
 }
 
 #ifdef CONFIG_PROC_FS
-struct rt_cache_iter_state {
-	struct seq_net_private p;
-	int bucket;
-	int genid;
-};
-
-static struct rtable *rt_cache_get_first(struct seq_file *seq)
-{
-	struct rt_cache_iter_state *st = seq->private;
-	struct rtable *r = NULL;
-
-	for (st->bucket = rt_hash_mask; st->bucket >= 0; --st->bucket) {
-		if (!rcu_dereference_raw(rt_hash_table[st->bucket].chain))
-			continue;
-		rcu_read_lock_bh();
-		r = rcu_dereference_bh(rt_hash_table[st->bucket].chain);
-		while (r) {
-			if (dev_net(r->dst.dev) == seq_file_net(seq) &&
-			    r->rt_genid == st->genid)
-				return r;
-			r = rcu_dereference_bh(r->dst.rt_next);
-		}
-		rcu_read_unlock_bh();
-	}
-	return r;
-}
-
-static struct rtable *__rt_cache_get_next(struct seq_file *seq,
-					  struct rtable *r)
-{
-	struct rt_cache_iter_state *st = seq->private;
-
-	r = rcu_dereference_bh(r->dst.rt_next);
-	while (!r) {
-		rcu_read_unlock_bh();
-		do {
-			if (--st->bucket < 0)
-				return NULL;
-		} while (!rcu_dereference_raw(rt_hash_table[st->bucket].chain));
-		rcu_read_lock_bh();
-		r = rcu_dereference_bh(rt_hash_table[st->bucket].chain);
-	}
-	return r;
-}
-
-static struct rtable *rt_cache_get_next(struct seq_file *seq,
-					struct rtable *r)
-{
-	struct rt_cache_iter_state *st = seq->private;
-	while ((r = __rt_cache_get_next(seq, r)) != NULL) {
-		if (dev_net(r->dst.dev) != seq_file_net(seq))
-			continue;
-		if (r->rt_genid == st->genid)
-			break;
-	}
-	return r;
-}
-
-static struct rtable *rt_cache_get_idx(struct seq_file *seq, loff_t pos)
-{
-	struct rtable *r = rt_cache_get_first(seq);
-
-	if (r)
-		while (pos && (r = rt_cache_get_next(seq, r)))
-			--pos;
-	return pos ? NULL : r;
-}
-
 static void *rt_cache_seq_start(struct seq_file *seq, loff_t *pos)
 {
-	struct rt_cache_iter_state *st = seq->private;
 	if (*pos)
-		return rt_cache_get_idx(seq, *pos - 1);
-	st->genid = rt_genid(seq_file_net(seq));
+		return NULL;
 	return SEQ_START_TOKEN;
 }
 
 static void *rt_cache_seq_next(struct seq_file *seq, void *v, loff_t *pos)
 {
-	struct rtable *r;
-
-	if (v == SEQ_START_TOKEN)
-		r = rt_cache_get_first(seq);
-	else
-		r = rt_cache_get_next(seq, v);
 	++*pos;
-	return r;
+	return NULL;
 }
 
 static void rt_cache_seq_stop(struct seq_file *seq, void *v)
 {
-	if (v && v != SEQ_START_TOKEN)
-		rcu_read_unlock_bh();
 }
 
 static int rt_cache_seq_show(struct seq_file *seq, void *v)
@@ -409,29 +252,6 @@ static int rt_cache_seq_show(struct seq_file *seq, void *v)
 			   "Iface\tDestination\tGateway \tFlags\t\tRefCnt\tUse\t"
 			   "Metric\tSource\t\tMTU\tWindow\tIRTT\tTOS\tHHRef\t"
 			   "HHUptod\tSpecDst");
-	else {
-		struct rtable *r = v;
-		int len;
-
-		seq_printf(seq, "%s\t%08X\t%08X\t%8X\t%d\t%u\t%d\t"
-			      "%08X\t%d\t%u\t%u\t%02X\t%d\t%1d\t%08X%n",
-			r->dst.dev ? r->dst.dev->name : "*",
-			(__force u32)r->rt_dst,
-			(__force u32)r->rt_gateway,
-			r->rt_flags, atomic_read(&r->dst.__refcnt),
-			r->dst.__use, 0, (__force u32)r->rt_src,
-			dst_metric_advmss(&r->dst) + 40,
-			dst_metric(&r->dst, RTAX_WINDOW),
-			(int)((dst_metric(&r->dst, RTAX_RTT) >> 3) +
-			      dst_metric(&r->dst, RTAX_RTTVAR)),
-			r->rt_tos,
-			r->dst.hh ? atomic_read(&r->dst.hh->hh_refcnt) : -1,
-			r->dst.hh ? (r->dst.hh->hh_output ==
-				       dev_queue_xmit) : 0,
-			r->rt_spec_dst, &len);
-
-		seq_printf(seq, "%*s\n", 127 - len, "");
-	}
 	return 0;
 }
 
@@ -444,8 +264,7 @@ static const struct seq_operations rt_cache_seq_ops = {
 
 static int rt_cache_seq_open(struct inode *inode, struct file *file)
 {
-	return seq_open_net(inode, file, &rt_cache_seq_ops,
-			sizeof(struct rt_cache_iter_state));
+	return seq_open(file, &rt_cache_seq_ops);
 }
 
 static const struct file_operations rt_cache_seq_fops = {
@@ -453,7 +272,7 @@ static const struct file_operations rt_cache_seq_fops = {
 	.open	 = rt_cache_seq_open,
 	.read	 = seq_read,
 	.llseek	 = seq_lseek,
-	.release = seq_release_net,
+	.release = seq_release,
 };
 
 
@@ -643,184 +462,12 @@ static inline int ip_rt_proc_init(void)
 }
 #endif /* CONFIG_PROC_FS */
 
-static inline void rt_free(struct rtable *rt)
-{
-	call_rcu_bh(&rt->dst.rcu_head, dst_rcu_free);
-}
-
-static inline void rt_drop(struct rtable *rt)
-{
-	ip_rt_put(rt);
-	call_rcu_bh(&rt->dst.rcu_head, dst_rcu_free);
-}
-
-static inline int rt_fast_clean(struct rtable *rth)
-{
-	/* Kill broadcast/multicast entries very aggresively, if they
-	   collide in hash table with more useful entries */
-	return (rth->rt_flags & (RTCF_BROADCAST | RTCF_MULTICAST)) &&
-		rt_is_input_route(rth) && rth->dst.rt_next;
-}
-
-static inline int rt_valuable(struct rtable *rth)
-{
-	return (rth->rt_flags & (RTCF_REDIRECTED | RTCF_NOTIFY)) ||
-		(rth->peer && rth->peer->pmtu_expires);
-}
-
-static int rt_may_expire(struct rtable *rth, unsigned long tmo1, unsigned long tmo2)
-{
-	unsigned long age;
-	int ret = 0;
-
-	if (atomic_read(&rth->dst.__refcnt))
-		goto out;
-
-	age = jiffies - rth->dst.lastuse;
-	if ((age <= tmo1 && !rt_fast_clean(rth)) ||
-	    (age <= tmo2 && rt_valuable(rth)))
-		goto out;
-	ret = 1;
-out:	return ret;
-}
-
-/* Bits of score are:
- * 31: very valuable
- * 30: not quite useless
- * 29..0: usage counter
- */
-static inline u32 rt_score(struct rtable *rt)
-{
-	u32 score = jiffies - rt->dst.lastuse;
-
-	score = ~score & ~(3<<30);
-
-	if (rt_valuable(rt))
-		score |= (1<<31);
-
-	if (rt_is_output_route(rt) ||
-	    !(rt->rt_flags & (RTCF_BROADCAST|RTCF_MULTICAST|RTCF_LOCAL)))
-		score |= (1<<30);
-
-	return score;
-}
-
-static inline bool rt_caching(const struct net *net)
-{
-	return net->ipv4.current_rt_cache_rebuild_count <=
-		net->ipv4.sysctl_rt_cache_rebuild_count;
-}
-
-static inline bool compare_hash_inputs(const struct rtable *rt1,
-				       const struct rtable *rt2)
-{
-	return ((((__force u32)rt1->rt_key_dst ^ (__force u32)rt2->rt_key_dst) |
-		((__force u32)rt1->rt_key_src ^ (__force u32)rt2->rt_key_src) |
-		(rt1->rt_iif ^ rt2->rt_iif)) == 0);
-}
-
-static inline int compare_keys(struct rtable *rt1, struct rtable *rt2)
-{
-	return (((__force u32)rt1->rt_key_dst ^ (__force u32)rt2->rt_key_dst) |
-		((__force u32)rt1->rt_key_src ^ (__force u32)rt2->rt_key_src) |
-		(rt1->rt_mark ^ rt2->rt_mark) |
-		(rt1->rt_tos ^ rt2->rt_tos) |
-		(rt1->rt_oif ^ rt2->rt_oif) |
-		(rt1->rt_iif ^ rt2->rt_iif)) == 0;
-}
-
-static inline int compare_netns(struct rtable *rt1, struct rtable *rt2)
-{
-	return net_eq(dev_net(rt1->dst.dev), dev_net(rt2->dst.dev));
-}
-
 static inline int rt_is_expired(struct rtable *rth)
 {
 	return rth->rt_genid != rt_genid(dev_net(rth->dst.dev));
 }
 
 /*
- * Perform a full scan of hash table and free all entries.
- * Can be called by a softirq or a process.
- * In the later case, we want to be reschedule if necessary
- */
-static void rt_do_flush(struct net *net, int process_context)
-{
-	unsigned int i;
-	struct rtable *rth, *next;
-
-	for (i = 0; i <= rt_hash_mask; i++) {
-		struct rtable __rcu **pprev;
-		struct rtable *list;
-
-		if (process_context && need_resched())
-			cond_resched();
-		rth = rcu_dereference_raw(rt_hash_table[i].chain);
-		if (!rth)
-			continue;
-
-		spin_lock_bh(rt_hash_lock_addr(i));
-
-		list = NULL;
-		pprev = &rt_hash_table[i].chain;
-		rth = rcu_dereference_protected(*pprev,
-			lockdep_is_held(rt_hash_lock_addr(i)));
-
-		while (rth) {
-			next = rcu_dereference_protected(rth->dst.rt_next,
-				lockdep_is_held(rt_hash_lock_addr(i)));
-
-			if (!net ||
-			    net_eq(dev_net(rth->dst.dev), net)) {
-				rcu_assign_pointer(*pprev, next);
-				rcu_assign_pointer(rth->dst.rt_next, list);
-				list = rth;
-			} else {
-				pprev = &rth->dst.rt_next;
-			}
-			rth = next;
-		}
-
-		spin_unlock_bh(rt_hash_lock_addr(i));
-
-		for (; list; list = next) {
-			next = rcu_dereference_protected(list->dst.rt_next, 1);
-			rt_free(list);
-		}
-	}
-}
-
-/*
- * While freeing expired entries, we compute average chain length
- * and standard deviation, using fixed-point arithmetic.
- * This to have an estimation of rt_chain_length_max
- *  rt_chain_length_max = max(elasticity, AVG + 4*SD)
- * We use 3 bits for frational part, and 29 (or 61) for magnitude.
- */
-
-#define FRACT_BITS 3
-#define ONE (1UL << FRACT_BITS)
-
-/*
- * Given a hash chain and an item in this hash chain,
- * find if a previous entry has the same hash_inputs
- * (but differs on tos, mark or oif)
- * Returns 0 if an alias is found.
- * Returns ONE if rth has no alias before itself.
- */
-static int has_noalias(const struct rtable *head, const struct rtable *rth)
-{
-	const struct rtable *aux = head;
-
-	while (aux != rth) {
-		if (compare_hash_inputs(aux, rth))
-			return 0;
-		aux = rcu_dereference_protected(aux->dst.rt_next, 1);
-	}
-	return ONE;
-}
-
-/*
  * Perturbation of rt_genid by a small quantity [1..256]
  * Using 8 bits of shuffling ensure we can call rt_cache_invalidate()
  * many times (2^24) without giving recent rt_genid.
@@ -841,364 +488,25 @@ static void rt_cache_invalidate(struct net *net)
 void rt_cache_flush(struct net *net, int delay)
 {
 	rt_cache_invalidate(net);
-	if (delay >= 0)
-		rt_do_flush(net, !in_softirq());
 }
 
-/* Flush previous cache invalidated entries from the cache */
-void rt_cache_flush_batch(struct net *net)
+static struct rtable *rt_finalize(struct rtable *rt, struct sk_buff *skb)
 {
-	rt_do_flush(net, !in_softirq());
-}
-
-static void rt_emergency_hash_rebuild(struct net *net)
-{
-	if (net_ratelimit())
-		printk(KERN_WARNING "Route hash chain too long!\n");
-	rt_cache_invalidate(net);
-}
-
-/*
-   Short description of GC goals.
-
-   We want to build algorithm, which will keep routing cache
-   at some equilibrium point, when number of aged off entries
-   is kept approximately equal to newly generated ones.
-
-   Current expiration strength is variable "expire".
-   We try to adjust it dynamically, so that if networking
-   is idle expires is large enough to keep enough of warm entries,
-   and when load increases it reduces to limit cache size.
- */
-
-static int rt_garbage_collect(struct dst_ops *ops)
-{
-	static unsigned long expire = RT_GC_TIMEOUT;
-	static unsigned long last_gc;
-	static int rover;
-	static int equilibrium;
-	struct rtable *rth;
-	struct rtable __rcu **rthp;
-	unsigned long now = jiffies;
-	int goal;
-	int entries = dst_entries_get_fast(&ipv4_dst_ops);
-
-	/*
-	 * Garbage collection is pretty expensive,
-	 * do not make it too frequently.
-	 */
-
-	RT_CACHE_STAT_INC(gc_total);
-
-	if (now - last_gc < ip_rt_gc_min_interval &&
-	    entries < ip_rt_max_size) {
-		RT_CACHE_STAT_INC(gc_ignored);
-		goto out;
-	}
-
-	entries = dst_entries_get_slow(&ipv4_dst_ops);
-	/* Calculate number of entries, which we want to expire now. */
-	goal = entries - (ip_rt_gc_elasticity << rt_hash_log);
-	if (goal <= 0) {
-		if (equilibrium < ipv4_dst_ops.gc_thresh)
-			equilibrium = ipv4_dst_ops.gc_thresh;
-		goal = entries - equilibrium;
-		if (goal > 0) {
-			equilibrium += min_t(unsigned int, goal >> 1, rt_hash_mask + 1);
-			goal = entries - equilibrium;
-		}
-	} else {
-		/* We are in dangerous area. Try to reduce cache really
-		 * aggressively.
-		 */
-		goal = max_t(unsigned int, goal >> 1, rt_hash_mask + 1);
-		equilibrium = entries - goal;
-	}
-
-	if (now - last_gc >= ip_rt_gc_min_interval)
-		last_gc = now;
-
-	if (goal <= 0) {
-		equilibrium += goal;
-		goto work_done;
-	}
-
-	do {
-		int i, k;
-
-		for (i = rt_hash_mask, k = rover; i >= 0; i--) {
-			unsigned long tmo = expire;
-
-			k = (k + 1) & rt_hash_mask;
-			rthp = &rt_hash_table[k].chain;
-			spin_lock_bh(rt_hash_lock_addr(k));
-			while ((rth = rcu_dereference_protected(*rthp,
-					lockdep_is_held(rt_hash_lock_addr(k)))) != NULL) {
-				if (!rt_is_expired(rth) &&
-					!rt_may_expire(rth, tmo, expire)) {
-					tmo >>= 1;
-					rthp = &rth->dst.rt_next;
-					continue;
-				}
-				*rthp = rth->dst.rt_next;
-				rt_free(rth);
-				goal--;
-			}
-			spin_unlock_bh(rt_hash_lock_addr(k));
-			if (goal <= 0)
-				break;
-		}
-		rover = k;
-
-		if (goal <= 0)
-			goto work_done;
-
-		/* Goal is not achieved. We stop process if:
-
-		   - if expire reduced to zero. Otherwise, expire is halfed.
-		   - if table is not full.
-		   - if we are called from interrupt.
-		   - jiffies check is just fallback/debug loop breaker.
-		     We will not spin here for long time in any case.
-		 */
-
-		RT_CACHE_STAT_INC(gc_goal_miss);
-
-		if (expire == 0)
-			break;
-
-		expire >>= 1;
-#if RT_CACHE_DEBUG >= 2
-		printk(KERN_DEBUG "expire>> %u %d %d %d\n", expire,
-				dst_entries_get_fast(&ipv4_dst_ops), goal, i);
-#endif
-
-		if (dst_entries_get_fast(&ipv4_dst_ops) < ip_rt_max_size)
-			goto out;
-	} while (!in_softirq() && time_before_eq(jiffies, now));
-
-	if (dst_entries_get_fast(&ipv4_dst_ops) < ip_rt_max_size)
-		goto out;
-	if (dst_entries_get_slow(&ipv4_dst_ops) < ip_rt_max_size)
-		goto out;
-	if (net_ratelimit())
-		printk(KERN_WARNING "dst cache overflow\n");
-	RT_CACHE_STAT_INC(gc_dst_overflow);
-	return 1;
-
-work_done:
-	expire += ip_rt_gc_min_interval;
-	if (expire > ip_rt_gc_timeout ||
-	    dst_entries_get_fast(&ipv4_dst_ops) < ipv4_dst_ops.gc_thresh ||
-	    dst_entries_get_slow(&ipv4_dst_ops) < ipv4_dst_ops.gc_thresh)
-		expire = ip_rt_gc_timeout;
-#if RT_CACHE_DEBUG >= 2
-	printk(KERN_DEBUG "expire++ %u %d %d %d\n", expire,
-			dst_entries_get_fast(&ipv4_dst_ops), goal, rover);
-#endif
-out:	return 0;
-}
-
-/*
- * Returns number of entries in a hash chain that have different hash_inputs
- */
-static int slow_chain_length(const struct rtable *head)
-{
-	int length = 0;
-	const struct rtable *rth = head;
-
-	while (rth) {
-		length += has_noalias(head, rth);
-		rth = rcu_dereference_protected(rth->dst.rt_next, 1);
-	}
-	return length >> FRACT_BITS;
-}
-
-static struct rtable *rt_intern_hash(unsigned hash, struct rtable *rt,
-				     struct sk_buff *skb, int ifindex)
-{
-	struct rtable	*rth, *cand;
-	struct rtable __rcu **rthp, **candp;
-	unsigned long	now;
-	u32 		min_score;
-	int		chain_length;
-	int attempts = !in_softirq();
-
-restart:
-	chain_length = 0;
-	min_score = ~(u32)0;
-	cand = NULL;
-	candp = NULL;
-	now = jiffies;
-
-	if (!rt_caching(dev_net(rt->dst.dev))) {
-		/*
-		 * If we're not caching, just tell the caller we
-		 * were successful and don't touch the route.  The
-		 * caller hold the sole reference to the cache entry, and
-		 * it will be released when the caller is done with it.
-		 * If we drop it here, the callers have no way to resolve routes
-		 * when we're not caching.  Instead, just point *rp at rt, so
-		 * the caller gets a single use out of the route
-		 * Note that we do rt_free on this new route entry, so that
-		 * once its refcount hits zero, we are still able to reap it
-		 * (Thanks Alexey)
-		 * Note: To avoid expensive rcu stuff for this uncached dst,
-		 * we set DST_NOCACHE so that dst_release() can free dst without
-		 * waiting a grace period.
-		 */
-
-		rt->dst.flags |= DST_NOCACHE;
-		if (rt->rt_type == RTN_UNICAST || rt_is_output_route(rt)) {
-			int err = arp_bind_neighbour(&rt->dst);
-			if (err) {
-				if (net_ratelimit())
-					printk(KERN_WARNING
-					    "Neighbour table failure & not caching routes.\n");
-				ip_rt_put(rt);
-				return ERR_PTR(err);
-			}
-		}
-
-		goto skip_hashing;
-	}
-
-	rthp = &rt_hash_table[hash].chain;
-
-	spin_lock_bh(rt_hash_lock_addr(hash));
-	while ((rth = rcu_dereference_protected(*rthp,
-			lockdep_is_held(rt_hash_lock_addr(hash)))) != NULL) {
-		if (rt_is_expired(rth)) {
-			*rthp = rth->dst.rt_next;
-			rt_free(rth);
-			continue;
-		}
-		if (compare_keys(rth, rt) && compare_netns(rth, rt)) {
-			/* Put it first */
-			*rthp = rth->dst.rt_next;
-			/*
-			 * Since lookup is lockfree, the deletion
-			 * must be visible to another weakly ordered CPU before
-			 * the insertion at the start of the hash chain.
-			 */
-			rcu_assign_pointer(rth->dst.rt_next,
-					   rt_hash_table[hash].chain);
-			/*
-			 * Since lookup is lockfree, the update writes
-			 * must be ordered for consistency on SMP.
-			 */
-			rcu_assign_pointer(rt_hash_table[hash].chain, rth);
-
-			dst_use(&rth->dst, now);
-			spin_unlock_bh(rt_hash_lock_addr(hash));
-
-			rt_drop(rt);
-			if (skb)
-				skb_dst_set(skb, &rth->dst);
-			return rth;
-		}
-
-		if (!atomic_read(&rth->dst.__refcnt)) {
-			u32 score = rt_score(rth);
-
-			if (score <= min_score) {
-				cand = rth;
-				candp = rthp;
-				min_score = score;
-			}
-		}
-
-		chain_length++;
-
-		rthp = &rth->dst.rt_next;
-	}
-
-	if (cand) {
-		/* ip_rt_gc_elasticity used to be average length of chain
-		 * length, when exceeded gc becomes really aggressive.
-		 *
-		 * The second limit is less certain. At the moment it allows
-		 * only 2 entries per bucket. We will see.
-		 */
-		if (chain_length > ip_rt_gc_elasticity) {
-			*candp = cand->dst.rt_next;
-			rt_free(cand);
-		}
-	} else {
-		if (chain_length > rt_chain_length_max &&
-		    slow_chain_length(rt_hash_table[hash].chain) > rt_chain_length_max) {
-			struct net *net = dev_net(rt->dst.dev);
-			int num = ++net->ipv4.current_rt_cache_rebuild_count;
-			if (!rt_caching(net)) {
-				printk(KERN_WARNING "%s: %d rebuilds is over limit, route caching disabled\n",
-					rt->dst.dev->name, num);
-			}
-			rt_emergency_hash_rebuild(net);
-			spin_unlock_bh(rt_hash_lock_addr(hash));
-
-			hash = rt_hash(rt->rt_key_dst, rt->rt_key_src,
-					ifindex, rt_genid(net));
-			goto restart;
-		}
-	}
-
-	/* Try to bind route to arp only if it is output
-	   route or unicast forwarding path.
+	/* To avoid expensive rcu stuff for this uncached dst, we set
+	 * DST_NOCACHE so that dst_release() can free dst without
+	 * waiting a grace period.
 	 */
+	rt->dst.flags |= DST_NOCACHE;
 	if (rt->rt_type == RTN_UNICAST || rt_is_output_route(rt)) {
 		int err = arp_bind_neighbour(&rt->dst);
 		if (err) {
-			spin_unlock_bh(rt_hash_lock_addr(hash));
-
-			if (err != -ENOBUFS) {
-				rt_drop(rt);
-				return ERR_PTR(err);
-			}
-
-			/* Neighbour tables are full and nothing
-			   can be released. Try to shrink route cache,
-			   it is most likely it holds some neighbour records.
-			 */
-			if (attempts-- > 0) {
-				int saved_elasticity = ip_rt_gc_elasticity;
-				int saved_int = ip_rt_gc_min_interval;
-				ip_rt_gc_elasticity	= 1;
-				ip_rt_gc_min_interval	= 0;
-				rt_garbage_collect(&ipv4_dst_ops);
-				ip_rt_gc_min_interval	= saved_int;
-				ip_rt_gc_elasticity	= saved_elasticity;
-				goto restart;
-			}
-
 			if (net_ratelimit())
-				printk(KERN_WARNING "ipv4: Neighbour table overflow.\n");
-			rt_drop(rt);
-			return ERR_PTR(-ENOBUFS);
+				printk(KERN_WARNING
+				       "Neighbour table failure & not caching routes.\n");
+			ip_rt_put(rt);
+			return ERR_PTR(err);
 		}
 	}
-
-	rt->dst.rt_next = rt_hash_table[hash].chain;
-
-#if RT_CACHE_DEBUG >= 2
-	if (rt->dst.rt_next) {
-		struct rtable *trt;
-		printk(KERN_DEBUG "rt_cache @%02x: %pI4",
-		       hash, &rt->rt_dst);
-		for (trt = rt->dst.rt_next; trt; trt = trt->dst.rt_next)
-			printk(" . %pI4", &trt->rt_dst);
-		printk("\n");
-	}
-#endif
-	/*
-	 * Since lookup is lockfree, we must make sure
-	 * previous writes to rt are committed to memory
-	 * before making rt visible to other CPUS.
-	 */
-	rcu_assign_pointer(rt_hash_table[hash].chain, rt);
-
-	spin_unlock_bh(rt_hash_lock_addr(hash));
-
-skip_hashing:
 	if (skb)
 		skb_dst_set(skb, &rt->dst);
 	return rt;
@@ -1266,26 +574,6 @@ void __ip_select_ident(struct iphdr *iph, struct dst_entry *dst, int more)
 }
 EXPORT_SYMBOL(__ip_select_ident);
 
-static void rt_del(unsigned hash, struct rtable *rt)
-{
-	struct rtable __rcu **rthp;
-	struct rtable *aux;
-
-	rthp = &rt_hash_table[hash].chain;
-	spin_lock_bh(rt_hash_lock_addr(hash));
-	ip_rt_put(rt);
-	while ((aux = rcu_dereference_protected(*rthp,
-			lockdep_is_held(rt_hash_lock_addr(hash)))) != NULL) {
-		if (aux == rt || rt_is_expired(aux)) {
-			*rthp = aux->dst.rt_next;
-			rt_free(aux);
-			continue;
-		}
-		rthp = &aux->dst.rt_next;
-	}
-	spin_unlock_bh(rt_hash_lock_addr(hash));
-}
-
 /* called in rcu_read_lock() section */
 void ip_rt_redirect(__be32 old_gw, __be32 daddr, __be32 new_gw,
 		    __be32 saddr, struct net_device *dev)
@@ -1344,14 +632,11 @@ static struct dst_entry *ipv4_negative_advice(struct dst_entry *dst)
 			ip_rt_put(rt);
 			ret = NULL;
 		} else if (rt->rt_flags & RTCF_REDIRECTED) {
-			unsigned hash = rt_hash(rt->rt_key_dst, rt->rt_key_src,
-						rt->rt_oif,
-						rt_genid(dev_net(dst->dev)));
 #if RT_CACHE_DEBUG >= 1
 			printk(KERN_DEBUG "ipv4_negative_advice: redirect to %pI4/%02x dropped\n",
-				&rt->rt_dst, rt->rt_tos);
+			       &rt->rt_dst, rt->rt_tos);
 #endif
-			rt_del(hash, rt);
+			ip_rt_put(rt);
 			ret = NULL;
 		} else if (rt->peer &&
 			   rt->peer->pmtu_expires &&
@@ -1850,7 +1135,6 @@ static struct rtable *rt_dst_alloc(bool nopolicy, bool noxfrm)
 static int ip_route_input_mc(struct sk_buff *skb, __be32 daddr, __be32 saddr,
 				u8 tos, struct net_device *dev, int our)
 {
-	unsigned int hash;
 	struct rtable *rth;
 	__be32 spec_dst;
 	struct in_device *in_dev = __in_dev_get_rcu(dev);
@@ -1912,8 +1196,7 @@ static int ip_route_input_mc(struct sk_buff *skb, __be32 daddr, __be32 saddr,
 #endif
 	RT_CACHE_STAT_INC(in_slow_mc);
 
-	hash = rt_hash(daddr, saddr, dev->ifindex, rt_genid(dev_net(dev)));
-	rth = rt_intern_hash(hash, rth, skb, dev->ifindex);
+	rth = rt_finalize(rth, skb);
 	err = 0;
 	if (IS_ERR(rth))
 		err = PTR_ERR(rth);
@@ -2056,7 +1339,6 @@ static int ip_mkroute_input(struct sk_buff *skb,
 {
 	struct rtable* rth = NULL;
 	int err;
-	unsigned hash;
 
 #ifdef CONFIG_IP_ROUTE_MULTIPATH
 	if (res->fi && res->fi->fib_nhs > 1)
@@ -2068,10 +1350,7 @@ static int ip_mkroute_input(struct sk_buff *skb,
 	if (err)
 		return err;
 
-	/* put it into the cache */
-	hash = rt_hash(daddr, saddr, fl4->flowi4_iif,
-		       rt_genid(dev_net(rth->dst.dev)));
-	rth = rt_intern_hash(hash, rth, skb, fl4->flowi4_iif);
+	rth = rt_finalize(rth, skb);
 	if (IS_ERR(rth))
 		return PTR_ERR(rth);
 	return 0;
@@ -2097,7 +1376,6 @@ static int ip_route_input_slow(struct sk_buff *skb, __be32 daddr, __be32 saddr,
 	unsigned	flags = 0;
 	u32		itag = 0;
 	struct rtable * rth;
-	unsigned	hash;
 	__be32		spec_dst;
 	int		err = -EINVAL;
 	struct net    * net = dev_net(dev);
@@ -2218,8 +1496,7 @@ local_input:
 		rth->rt_flags 	&= ~RTCF_LOCAL;
 	}
 	rth->rt_type	= res.type;
-	hash = rt_hash(daddr, saddr, fl4.flowi4_iif, rt_genid(net));
-	rth = rt_intern_hash(hash, rth, skb, fl4.flowi4_iif);
+	rth = rt_finalize(rth, skb);
 	err = 0;
 	if (IS_ERR(rth))
 		err = PTR_ERR(rth);
@@ -2266,47 +1543,10 @@ martian_source_keep_err:
 int ip_route_input_common(struct sk_buff *skb, __be32 daddr, __be32 saddr,
 			   u8 tos, struct net_device *dev, bool noref)
 {
-	struct rtable * rth;
-	unsigned	hash;
-	int iif = dev->ifindex;
-	struct net *net;
 	int res;
 
-	net = dev_net(dev);
-
 	rcu_read_lock();
 
-	if (!rt_caching(net))
-		goto skip_cache;
-
-	tos &= IPTOS_RT_MASK;
-	hash = rt_hash(daddr, saddr, iif, rt_genid(net));
-
-	for (rth = rcu_dereference(rt_hash_table[hash].chain); rth;
-	     rth = rcu_dereference(rth->dst.rt_next)) {
-		if ((((__force u32)rth->rt_key_dst ^ (__force u32)daddr) |
-		     ((__force u32)rth->rt_key_src ^ (__force u32)saddr) |
-		     (rth->rt_iif ^ iif) |
-		     rth->rt_oif |
-		     (rth->rt_tos ^ tos)) == 0 &&
-		    rth->rt_mark == skb->mark &&
-		    net_eq(dev_net(rth->dst.dev), net) &&
-		    !rt_is_expired(rth)) {
-			if (noref) {
-				dst_use_noref(&rth->dst, jiffies);
-				skb_dst_set_noref(skb, &rth->dst);
-			} else {
-				dst_use(&rth->dst, jiffies);
-				skb_dst_set(skb, &rth->dst);
-			}
-			RT_CACHE_STAT_INC(in_hit);
-			rcu_read_unlock();
-			return 0;
-		}
-		RT_CACHE_STAT_INC(in_hlist_search);
-	}
-
-skip_cache:
 	/* Multicast recognition logic is moved from route cache to here.
 	   The problem was that too many Ethernet cards have broken/missing
 	   hardware multicast filters :-( As result the host on multicasting
@@ -2448,11 +1688,9 @@ static struct rtable *__mkroute_output(const struct fib_result *res,
 
 /*
  * Major route resolver routine.
- * called with rcu_read_lock();
  */
 
-static struct rtable *ip_route_output_slow(struct net *net,
-					   const struct flowi4 *oldflp4)
+struct rtable *__ip_route_output_key(struct net *net, const struct flowi4 *oldflp4)
 {
 	u32 tos	= RT_FL_TOS(oldflp4);
 	struct flowi4 fl4;
@@ -2629,53 +1867,13 @@ static struct rtable *ip_route_output_slow(struct net *net,
 
 make_route:
 	rth = __mkroute_output(&res, &fl4, oldflp4, dev_out, flags);
-	if (!IS_ERR(rth)) {
-		unsigned int hash;
-
-		hash = rt_hash(oldflp4->daddr, oldflp4->saddr, oldflp4->flowi4_oif,
-			       rt_genid(dev_net(dev_out)));
-		rth = rt_intern_hash(hash, rth, NULL, oldflp4->flowi4_oif);
-	}
+	if (!IS_ERR(rth))
+		rth = rt_finalize(rth, NULL);
 
 out:
 	rcu_read_unlock();
 	return rth;
 }
-
-struct rtable *__ip_route_output_key(struct net *net, const struct flowi4 *flp4)
-{
-	struct rtable *rth;
-	unsigned int hash;
-
-	if (!rt_caching(net))
-		goto slow_output;
-
-	hash = rt_hash(flp4->daddr, flp4->saddr, flp4->flowi4_oif, rt_genid(net));
-
-	rcu_read_lock_bh();
-	for (rth = rcu_dereference_bh(rt_hash_table[hash].chain); rth;
-		rth = rcu_dereference_bh(rth->dst.rt_next)) {
-		if (rth->rt_key_dst == flp4->daddr &&
-		    rth->rt_key_src == flp4->saddr &&
-		    rt_is_output_route(rth) &&
-		    rth->rt_oif == flp4->flowi4_oif &&
-		    rth->rt_mark == flp4->flowi4_mark &&
-		    !((rth->rt_tos ^ flp4->flowi4_tos) &
-			    (IPTOS_RT_MASK | RTO_ONLINK)) &&
-		    net_eq(dev_net(rth->dst.dev), net) &&
-		    !rt_is_expired(rth)) {
-			dst_use(&rth->dst, jiffies);
-			RT_CACHE_STAT_INC(out_hit);
-			rcu_read_unlock_bh();
-			return rth;
-		}
-		RT_CACHE_STAT_INC(out_hlist_search);
-	}
-	rcu_read_unlock_bh();
-
-slow_output:
-	return ip_route_output_slow(net, flp4);
-}
 EXPORT_SYMBOL_GPL(__ip_route_output_key);
 
 static struct dst_entry *ipv4_blackhole_dst_check(struct dst_entry *dst, u32 cookie)
@@ -2968,43 +2166,6 @@ errout_free:
 
 int ip_rt_dump(struct sk_buff *skb,  struct netlink_callback *cb)
 {
-	struct rtable *rt;
-	int h, s_h;
-	int idx, s_idx;
-	struct net *net;
-
-	net = sock_net(skb->sk);
-
-	s_h = cb->args[0];
-	if (s_h < 0)
-		s_h = 0;
-	s_idx = idx = cb->args[1];
-	for (h = s_h; h <= rt_hash_mask; h++, s_idx = 0) {
-		if (!rt_hash_table[h].chain)
-			continue;
-		rcu_read_lock_bh();
-		for (rt = rcu_dereference_bh(rt_hash_table[h].chain), idx = 0; rt;
-		     rt = rcu_dereference_bh(rt->dst.rt_next), idx++) {
-			if (!net_eq(dev_net(rt->dst.dev), net) || idx < s_idx)
-				continue;
-			if (rt_is_expired(rt))
-				continue;
-			skb_dst_set_noref(skb, &rt->dst);
-			if (rt_fill_info(net, skb, NETLINK_CB(cb->skb).pid,
-					 cb->nlh->nlmsg_seq, RTM_NEWROUTE,
-					 1, NLM_F_MULTI) <= 0) {
-				skb_dst_drop(skb);
-				rcu_read_unlock_bh();
-				goto done;
-			}
-			skb_dst_drop(skb);
-		}
-		rcu_read_unlock_bh();
-	}
-
-done:
-	cb->args[0] = h;
-	cb->args[1] = idx;
 	return skb->len;
 }
 
@@ -3239,16 +2400,6 @@ static __net_initdata struct pernet_operations rt_genid_ops = {
 struct ip_rt_acct __percpu *ip_rt_acct __read_mostly;
 #endif /* CONFIG_IP_ROUTE_CLASSID */
 
-static __initdata unsigned long rhash_entries;
-static int __init set_rhash_entries(char *str)
-{
-	if (!str)
-		return 0;
-	rhash_entries = simple_strtoul(str, &str, 0);
-	return 1;
-}
-__setup("rhash_entries=", set_rhash_entries);
-
 int __init ip_rt_init(void)
 {
 	int rc = 0;
@@ -3271,21 +2422,8 @@ int __init ip_rt_init(void)
 	if (dst_entries_init(&ipv4_dst_blackhole_ops) < 0)
 		panic("IP: failed to allocate ipv4_dst_blackhole_ops counter\n");
 
-	rt_hash_table = (struct rt_hash_bucket *)
-		alloc_large_system_hash("IP route cache",
-					sizeof(struct rt_hash_bucket),
-					rhash_entries,
-					(totalram_pages >= 128 * 1024) ?
-					15 : 17,
-					0,
-					&rt_hash_log,
-					&rt_hash_mask,
-					rhash_entries ? 0 : 512 * 1024);
-	memset(rt_hash_table, 0, (rt_hash_mask + 1) * sizeof(struct rt_hash_bucket));
-	rt_hash_lock_init();
-
-	ipv4_dst_ops.gc_thresh = (rt_hash_mask + 1);
-	ip_rt_max_size = (rt_hash_mask + 1) * 16;
+	ipv4_dst_ops.gc_thresh = ~0;
+	ip_rt_max_size = INT_MAX;
 
 	devinet_init();
 	ip_fib_init();
-- 
1.7.4.3


^ permalink raw reply related

* [PATCH v5 2/7] ipv4: Kill ip_route_input_noref().
From: David Miller @ 2011-04-15 22:39 UTC (permalink / raw)
  To: netdev


The "noref" argument to ip_route_input_common() is now always ignored
because we do not cache routes, and in that case we must always grab
a reference to the resulting 'dst'.

Signed-off-by: David S. Miller <davem@davemloft.net>
---
 include/net/route.h    |   16 ++--------------
 net/ipv4/arp.c         |    2 +-
 net/ipv4/ip_input.c    |    4 ++--
 net/ipv4/route.c       |    6 +++---
 net/ipv4/xfrm4_input.c |    4 ++--
 5 files changed, 10 insertions(+), 22 deletions(-)

diff --git a/include/net/route.h b/include/net/route.h
index f09d08f..b2a44a9 100644
--- a/include/net/route.h
+++ b/include/net/route.h
@@ -175,20 +175,8 @@ static inline struct rtable *ip_route_output_gre(struct net *net,
 	return ip_route_output_key(net, &fl4);
 }
 
-extern int ip_route_input_common(struct sk_buff *skb, __be32 dst, __be32 src,
-				 u8 tos, struct net_device *devin, bool noref);
-
-static inline int ip_route_input(struct sk_buff *skb, __be32 dst, __be32 src,
-				 u8 tos, struct net_device *devin)
-{
-	return ip_route_input_common(skb, dst, src, tos, devin, false);
-}
-
-static inline int ip_route_input_noref(struct sk_buff *skb, __be32 dst, __be32 src,
-				       u8 tos, struct net_device *devin)
-{
-	return ip_route_input_common(skb, dst, src, tos, devin, true);
-}
+extern int ip_route_input(struct sk_buff *skb, __be32 dst, __be32 src,
+			  u8 tos, struct net_device *devin);
 
 extern unsigned short	ip_rt_frag_needed(struct net *net, struct iphdr *iph, unsigned short new_mtu, struct net_device *dev);
 extern void		ip_rt_send_redirect(struct sk_buff *skb);
diff --git a/net/ipv4/arp.c b/net/ipv4/arp.c
index 1b74d3b..ef44a91 100644
--- a/net/ipv4/arp.c
+++ b/net/ipv4/arp.c
@@ -875,7 +875,7 @@ static int arp_process(struct sk_buff *skb)
 	}
 
 	if (arp->ar_op == htons(ARPOP_REQUEST) &&
-	    ip_route_input_noref(skb, tip, sip, 0, dev) == 0) {
+	    ip_route_input(skb, tip, sip, 0, dev) == 0) {
 
 		rt = skb_rtable(skb);
 		addr_type = rt->rt_type;
diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c
index d7b2b09..577eb45 100644
--- a/net/ipv4/ip_input.c
+++ b/net/ipv4/ip_input.c
@@ -324,8 +324,8 @@ static int ip_rcv_finish(struct sk_buff *skb)
 	 *	how the packet travels inside Linux networking.
 	 */
 	if (skb_dst(skb) == NULL) {
-		int err = ip_route_input_noref(skb, iph->daddr, iph->saddr,
-					       iph->tos, skb->dev);
+		int err = ip_route_input(skb, iph->daddr, iph->saddr,
+					 iph->tos, skb->dev);
 		if (unlikely(err)) {
 			if (err == -EHOSTUNREACH)
 				IP_INC_STATS_BH(dev_net(skb->dev),
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 8033171..4ed7788 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1540,8 +1540,8 @@ martian_source_keep_err:
 	goto out;
 }
 
-int ip_route_input_common(struct sk_buff *skb, __be32 daddr, __be32 saddr,
-			   u8 tos, struct net_device *dev, bool noref)
+int ip_route_input(struct sk_buff *skb, __be32 daddr, __be32 saddr,
+		   u8 tos, struct net_device *dev)
 {
 	int res;
 
@@ -1584,7 +1584,7 @@ int ip_route_input_common(struct sk_buff *skb, __be32 daddr, __be32 saddr,
 	rcu_read_unlock();
 	return res;
 }
-EXPORT_SYMBOL(ip_route_input_common);
+EXPORT_SYMBOL(ip_route_input);
 
 /* called with rcu_read_lock() */
 static struct rtable *__mkroute_output(const struct fib_result *res,
diff --git a/net/ipv4/xfrm4_input.c b/net/ipv4/xfrm4_input.c
index 06814b6..58d23a5 100644
--- a/net/ipv4/xfrm4_input.c
+++ b/net/ipv4/xfrm4_input.c
@@ -27,8 +27,8 @@ static inline int xfrm4_rcv_encap_finish(struct sk_buff *skb)
 	if (skb_dst(skb) == NULL) {
 		const struct iphdr *iph = ip_hdr(skb);
 
-		if (ip_route_input_noref(skb, iph->daddr, iph->saddr,
-					 iph->tos, skb->dev))
+		if (ip_route_input(skb, iph->daddr, iph->saddr,
+				   iph->tos, skb->dev))
 			goto drop;
 	}
 	return dst_input(skb);
-- 
1.7.4.3


^ permalink raw reply related

* [PATCH v5 3/7] ipv4: Set DST_NOCACHE in rt_dst_alloc().
From: David Miller @ 2011-04-15 22:39 UTC (permalink / raw)
  To: netdev


Instead of using a read/modify/write in rt_finalize().

Signed-off-by: David S. Miller <davem@davemloft.net>
---
 net/ipv4/route.c |   11 +++++------
 1 files changed, 5 insertions(+), 6 deletions(-)

diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 4ed7788..f66898c 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -492,11 +492,6 @@ void rt_cache_flush(struct net *net, int delay)
 
 static struct rtable *rt_finalize(struct rtable *rt, struct sk_buff *skb)
 {
-	/* To avoid expensive rcu stuff for this uncached dst, we set
-	 * DST_NOCACHE so that dst_release() can free dst without
-	 * waiting a grace period.
-	 */
-	rt->dst.flags |= DST_NOCACHE;
 	if (rt->rt_type == RTN_UNICAST || rt_is_output_route(rt)) {
 		int err = arp_bind_neighbour(&rt->dst);
 		if (err) {
@@ -1124,7 +1119,11 @@ static struct rtable *rt_dst_alloc(bool nopolicy, bool noxfrm)
 	if (rt) {
 		rt->dst.obsolete = -1;
 
-		rt->dst.flags = DST_HOST |
+		/* To avoid expensive rcu stuff for this uncached dst, we set
+		 * DST_NOCACHE so that dst_release() can free dst without
+		 * waiting a grace period.
+		 */
+		rt->dst.flags = DST_NOCACHE | DST_HOST |
 			(nopolicy ? DST_NOPOLICY : 0) |
 			(noxfrm ? DST_NOXFRM : 0);
 	}
-- 
1.7.4.3


^ permalink raw reply related

* [PATCH v5 4/7] net: Make dst_alloc() take more explicit initializations.
From: David Miller @ 2011-04-15 22:39 UTC (permalink / raw)
  To: netdev


Now the dst->dev, dev->obsolete, and dst->flags values can
be specified as well.

Signed-off-by: David S. Miller <davem@davemloft.net>
---
 include/net/dst.h      |    3 ++-
 net/core/dst.c         |   18 +++++++++++++-----
 net/decnet/dn_route.c  |   13 ++-----------
 net/ipv4/route.c       |   48 +++++++++++++++++++-----------------------------
 net/ipv6/route.c       |   29 +++++++++++------------------
 net/xfrm/xfrm_policy.c |    2 +-
 6 files changed, 48 insertions(+), 65 deletions(-)

diff --git a/include/net/dst.h b/include/net/dst.h
index 75b95df..9fc2ada 100644
--- a/include/net/dst.h
+++ b/include/net/dst.h
@@ -352,7 +352,8 @@ static inline struct dst_entry *skb_dst_pop(struct sk_buff *skb)
 }
 
 extern int dst_discard(struct sk_buff *skb);
-extern void *dst_alloc(struct dst_ops * ops, int initial_ref);
+extern void *dst_alloc(struct dst_ops * ops, struct net_device *dev,
+		       int initial_ref, int initial_obsolete, int flags);
 extern void __dst_free(struct dst_entry * dst);
 extern struct dst_entry *dst_destroy(struct dst_entry * dst);
 
diff --git a/net/core/dst.c b/net/core/dst.c
index 91104d3..9505778 100644
--- a/net/core/dst.c
+++ b/net/core/dst.c
@@ -166,7 +166,8 @@ EXPORT_SYMBOL(dst_discard);
 
 const u32 dst_default_metrics[RTAX_MAX];
 
-void *dst_alloc(struct dst_ops *ops, int initial_ref)
+void *dst_alloc(struct dst_ops *ops, struct net_device *dev,
+		int initial_ref, int initial_obsolete, int flags)
 {
 	struct dst_entry *dst;
 
@@ -177,12 +178,19 @@ void *dst_alloc(struct dst_ops *ops, int initial_ref)
 	dst = kmem_cache_zalloc(ops->kmem_cachep, GFP_ATOMIC);
 	if (!dst)
 		return NULL;
-	atomic_set(&dst->__refcnt, initial_ref);
 	dst->ops = ops;
-	dst->lastuse = jiffies;
-	dst->path = dst;
-	dst->input = dst->output = dst_discard;
+	dst->dev = dev;
+	if (dev)
+		dev_hold(dev);
 	dst_init_metrics(dst, dst_default_metrics, true);
+	dst->path = dst;
+	dst->input = dst_discard;
+	dst->output = dst_discard;
+
+	dst->obsolete = initial_obsolete;
+	atomic_set(&dst->__refcnt, initial_ref);
+	dst->lastuse = jiffies;
+	dst->flags = flags;
 #if RT_CACHE_DEBUG >= 2
 	atomic_inc(&dst_total);
 #endif
diff --git a/net/decnet/dn_route.c b/net/decnet/dn_route.c
index 9f09d4f..f489b08 100644
--- a/net/decnet/dn_route.c
+++ b/net/decnet/dn_route.c
@@ -1125,13 +1125,10 @@ make_route:
 	if (dev_out->flags & IFF_LOOPBACK)
 		flags |= RTCF_LOCAL;
 
-	rt = dst_alloc(&dn_dst_ops, 0);
+	rt = dst_alloc(&dn_dst_ops, dev_out, 1, 0, DST_HOST);
 	if (rt == NULL)
 		goto e_nobufs;
 
-	atomic_set(&rt->dst.__refcnt, 1);
-	rt->dst.flags   = DST_HOST;
-
 	rt->fld.saddr        = oldflp->saddr;
 	rt->fld.daddr        = oldflp->daddr;
 	rt->fld.flowidn_oif  = oldflp->flowidn_oif;
@@ -1146,8 +1143,6 @@ make_route:
 	rt->rt_dst_map    = fld.daddr;
 	rt->rt_src_map    = fld.saddr;
 
-	rt->dst.dev = dev_out;
-	dev_hold(dev_out);
 	rt->dst.neighbour = neigh;
 	neigh = NULL;
 
@@ -1399,7 +1394,7 @@ static int dn_route_input_slow(struct sk_buff *skb)
 	}
 
 make_route:
-	rt = dst_alloc(&dn_dst_ops, 0);
+	rt = dst_alloc(&dn_dst_ops, out_dev, 0, 0, DST_HOST);
 	if (rt == NULL)
 		goto e_nobufs;
 
@@ -1419,9 +1414,7 @@ make_route:
 	rt->fld.flowidn_iif  = in_dev->ifindex;
 	rt->fld.flowidn_mark = fld.flowidn_mark;
 
-	rt->dst.flags = DST_HOST;
 	rt->dst.neighbour = neigh;
-	rt->dst.dev = out_dev;
 	rt->dst.lastuse = jiffies;
 	rt->dst.output = dn_rt_bug;
 	switch(res.type) {
@@ -1440,8 +1433,6 @@ make_route:
 			rt->dst.input = dst_discard;
 	}
 	rt->rt_flags = flags;
-	if (rt->dst.dev)
-		dev_hold(rt->dst.dev);
 
 	err = dn_rt_set_next_hop(rt, &res);
 	if (err)
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index f66898c..61a96ca 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1113,21 +1113,17 @@ static void rt_set_nexthop(struct rtable *rt, const struct flowi4 *oldflp4,
 	rt->rt_type = type;
 }
 
-static struct rtable *rt_dst_alloc(bool nopolicy, bool noxfrm)
+static struct rtable *rt_dst_alloc(struct net_device *dev,
+				   bool nopolicy, bool noxfrm)
 {
-	struct rtable *rt = dst_alloc(&ipv4_dst_ops, 1);
-	if (rt) {
-		rt->dst.obsolete = -1;
-
-		/* To avoid expensive rcu stuff for this uncached dst, we set
-		 * DST_NOCACHE so that dst_release() can free dst without
-		 * waiting a grace period.
-		 */
-		rt->dst.flags = DST_NOCACHE | DST_HOST |
-			(nopolicy ? DST_NOPOLICY : 0) |
-			(noxfrm ? DST_NOXFRM : 0);
-	}
-	return rt;
+	/* To avoid expensive rcu stuff for this uncached dst, we set
+	 * DST_NOCACHE so that dst_release() can free dst without
+	 * waiting a grace period.
+	 */
+	return dst_alloc(&ipv4_dst_ops, dev, 1, -1,
+			 DST_NOCACHE | DST_HOST |
+			 (nopolicy ? DST_NOPOLICY : 0) |
+			 (noxfrm ? DST_NOXFRM : 0));
 }
 
 /* called in rcu_read_lock() section */
@@ -1159,7 +1155,8 @@ static int ip_route_input_mc(struct sk_buff *skb, __be32 daddr, __be32 saddr,
 		if (err < 0)
 			goto e_err;
 	}
-	rth = rt_dst_alloc(IN_DEV_CONF_GET(in_dev, NOPOLICY), false);
+	rth = rt_dst_alloc(init_net.loopback_dev,
+			   IN_DEV_CONF_GET(in_dev, NOPOLICY), false);
 	if (!rth)
 		goto e_nobufs;
 
@@ -1176,8 +1173,6 @@ static int ip_route_input_mc(struct sk_buff *skb, __be32 daddr, __be32 saddr,
 #endif
 	rth->rt_route_iif = dev->ifindex;
 	rth->rt_iif	= dev->ifindex;
-	rth->dst.dev	= init_net.loopback_dev;
-	dev_hold(rth->dst.dev);
 	rth->rt_oif	= 0;
 	rth->rt_gateway	= daddr;
 	rth->rt_spec_dst= spec_dst;
@@ -1295,7 +1290,8 @@ static int __mkroute_input(struct sk_buff *skb,
 		}
 	}
 
-	rth = rt_dst_alloc(IN_DEV_CONF_GET(in_dev, NOPOLICY),
+	rth = rt_dst_alloc(out_dev->dev,
+			   IN_DEV_CONF_GET(in_dev, NOPOLICY),
 			   IN_DEV_CONF_GET(out_dev, NOXFRM));
 	if (!rth) {
 		err = -ENOBUFS;
@@ -1311,8 +1307,6 @@ static int __mkroute_input(struct sk_buff *skb,
 	rth->rt_gateway	= daddr;
 	rth->rt_route_iif = in_dev->dev->ifindex;
 	rth->rt_iif 	= in_dev->dev->ifindex;
-	rth->dst.dev	= (out_dev)->dev;
-	dev_hold(rth->dst.dev);
 	rth->rt_oif 	= 0;
 	rth->rt_spec_dst= spec_dst;
 
@@ -1465,7 +1459,8 @@ brd_input:
 	RT_CACHE_STAT_INC(in_brd);
 
 local_input:
-	rth = rt_dst_alloc(IN_DEV_CONF_GET(in_dev, NOPOLICY), false);
+	rth = rt_dst_alloc(net->loopback_dev,
+			   IN_DEV_CONF_GET(in_dev, NOPOLICY), false);
 	if (!rth)
 		goto e_nobufs;
 
@@ -1483,8 +1478,6 @@ local_input:
 #endif
 	rth->rt_route_iif = dev->ifindex;
 	rth->rt_iif	= dev->ifindex;
-	rth->dst.dev	= net->loopback_dev;
-	dev_hold(rth->dst.dev);
 	rth->rt_gateway	= daddr;
 	rth->rt_spec_dst= spec_dst;
 	rth->dst.input= ip_local_deliver;
@@ -1631,7 +1624,8 @@ static struct rtable *__mkroute_output(const struct fib_result *res,
 			fi = NULL;
 	}
 
-	rth = rt_dst_alloc(IN_DEV_CONF_GET(in_dev, NOPOLICY),
+	rth = rt_dst_alloc(dev_out,
+			   IN_DEV_CONF_GET(in_dev, NOPOLICY),
 			   IN_DEV_CONF_GET(in_dev, NOXFRM));
 	if (!rth)
 		return ERR_PTR(-ENOBUFS);
@@ -1645,10 +1639,6 @@ static struct rtable *__mkroute_output(const struct fib_result *res,
 	rth->rt_src	= fl4->saddr;
 	rth->rt_route_iif = 0;
 	rth->rt_iif	= oldflp4->flowi4_oif ? : dev_out->ifindex;
-	/* get references to the devices that are to be hold by the routing
-	   cache entry */
-	rth->dst.dev	= dev_out;
-	dev_hold(dev_out);
 	rth->rt_gateway = fl4->daddr;
 	rth->rt_spec_dst= fl4->saddr;
 
@@ -1901,7 +1891,7 @@ static struct dst_ops ipv4_dst_blackhole_ops = {
 
 struct dst_entry *ipv4_blackhole_route(struct net *net, struct dst_entry *dst_orig)
 {
-	struct rtable *rt = dst_alloc(&ipv4_dst_blackhole_ops, 1);
+	struct rtable *rt = dst_alloc(&ipv4_dst_blackhole_ops, NULL, 1, 0, 0);
 	struct rtable *ort = (struct rtable *) dst_orig;
 
 	if (rt) {
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 843406f..d3f55cc 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -220,9 +220,10 @@ static struct rt6_info ip6_blk_hole_entry_template = {
 #endif
 
 /* allocate dst with ip6_dst_ops */
-static inline struct rt6_info *ip6_dst_alloc(struct dst_ops *ops)
+static inline struct rt6_info *ip6_dst_alloc(struct dst_ops *ops,
+					     struct net_device *dev)
 {
-	return (struct rt6_info *)dst_alloc(ops, 0);
+	return (struct rt6_info *)dst_alloc(ops, dev, 0, 0, 0);
 }
 
 static void ip6_dst_destroy(struct dst_entry *dst)
@@ -874,10 +875,10 @@ EXPORT_SYMBOL(ip6_route_output);
 
 struct dst_entry *ip6_blackhole_route(struct net *net, struct dst_entry *dst_orig)
 {
-	struct rt6_info *rt = dst_alloc(&ip6_dst_blackhole_ops, 1);
-	struct rt6_info *ort = (struct rt6_info *) dst_orig;
+	struct rt6_info *rt, *ort = (struct rt6_info *) dst_orig;
 	struct dst_entry *new = NULL;
 
+	rt = dst_alloc(&ip6_dst_blackhole_ops, ort->dst.dev, 1, 0, 0);
 	if (rt) {
 		new = &rt->dst;
 
@@ -886,9 +887,6 @@ struct dst_entry *ip6_blackhole_route(struct net *net, struct dst_entry *dst_ori
 		new->output = dst_discard;
 
 		dst_copy_metrics(new, &ort->dst);
-		new->dev = ort->dst.dev;
-		if (new->dev)
-			dev_hold(new->dev);
 		rt->rt6i_idev = ort->rt6i_idev;
 		if (rt->rt6i_idev)
 			in6_dev_hold(rt->rt6i_idev);
@@ -1031,13 +1029,12 @@ struct dst_entry *icmp6_dst_alloc(struct net_device *dev,
 	if (unlikely(idev == NULL))
 		return NULL;
 
-	rt = ip6_dst_alloc(&net->ipv6.ip6_dst_ops);
+	rt = ip6_dst_alloc(&net->ipv6.ip6_dst_ops, dev);
 	if (unlikely(rt == NULL)) {
 		in6_dev_put(idev);
 		goto out;
 	}
 
-	dev_hold(dev);
 	if (neigh)
 		neigh_hold(neigh);
 	else {
@@ -1046,7 +1043,6 @@ struct dst_entry *icmp6_dst_alloc(struct net_device *dev,
 			neigh = NULL;
 	}
 
-	rt->rt6i_dev	  = dev;
 	rt->rt6i_idev     = idev;
 	rt->rt6i_nexthop  = neigh;
 	atomic_set(&rt->dst.__refcnt, 1);
@@ -1205,7 +1201,7 @@ int ip6_route_add(struct fib6_config *cfg)
 		goto out;
 	}
 
-	rt = ip6_dst_alloc(&net->ipv6.ip6_dst_ops);
+	rt = ip6_dst_alloc(&net->ipv6.ip6_dst_ops, NULL);
 
 	if (rt == NULL) {
 		err = -ENOMEM;
@@ -1714,7 +1710,8 @@ void rt6_pmtu_discovery(struct in6_addr *daddr, struct in6_addr *saddr,
 static struct rt6_info * ip6_rt_copy(struct rt6_info *ort)
 {
 	struct net *net = dev_net(ort->rt6i_dev);
-	struct rt6_info *rt = ip6_dst_alloc(&net->ipv6.ip6_dst_ops);
+	struct rt6_info *rt = ip6_dst_alloc(&net->ipv6.ip6_dst_ops,
+					    ort->dst.dev);
 
 	if (rt) {
 		rt->dst.input = ort->dst.input;
@@ -1722,9 +1719,6 @@ static struct rt6_info * ip6_rt_copy(struct rt6_info *ort)
 
 		dst_copy_metrics(&rt->dst, &ort->dst);
 		rt->dst.error = ort->dst.error;
-		rt->dst.dev = ort->dst.dev;
-		if (rt->dst.dev)
-			dev_hold(rt->dst.dev);
 		rt->rt6i_idev = ort->rt6i_idev;
 		if (rt->rt6i_idev)
 			in6_dev_hold(rt->rt6i_idev);
@@ -1994,7 +1988,8 @@ struct rt6_info *addrconf_dst_alloc(struct inet6_dev *idev,
 				    int anycast)
 {
 	struct net *net = dev_net(idev->dev);
-	struct rt6_info *rt = ip6_dst_alloc(&net->ipv6.ip6_dst_ops);
+	struct rt6_info *rt = ip6_dst_alloc(&net->ipv6.ip6_dst_ops,
+					    net->loopback_dev);
 	struct neighbour *neigh;
 
 	if (rt == NULL) {
@@ -2004,13 +1999,11 @@ struct rt6_info *addrconf_dst_alloc(struct inet6_dev *idev,
 		return ERR_PTR(-ENOMEM);
 	}
 
-	dev_hold(net->loopback_dev);
 	in6_dev_hold(idev);
 
 	rt->dst.flags = DST_HOST;
 	rt->dst.input = ip6_input;
 	rt->dst.output = ip6_output;
-	rt->rt6i_dev = net->loopback_dev;
 	rt->rt6i_idev = idev;
 	dst_metric_set(&rt->dst, RTAX_HOPLIMIT, -1);
 	rt->dst.obsolete = -1;
diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
index 15792d8..70552c4 100644
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -1348,7 +1348,7 @@ static inline struct xfrm_dst *xfrm_alloc_dst(struct net *net, int family)
 	default:
 		BUG();
 	}
-	xdst = dst_alloc(dst_ops, 0);
+	xdst = dst_alloc(dst_ops, NULL, 0, 0, 0);
 	xfrm_policy_put_afinfo(afinfo);
 
 	if (likely(xdst))
-- 
1.7.4.3


^ permalink raw reply related

* [PATCH v5 5/7] net: Use non-zero allocations in dst_alloc().
From: David Miller @ 2011-04-15 22:39 UTC (permalink / raw)
  To: netdev


Make dst_alloc() and it's users explicitly initialize the entire
entry.

The zero'ing done by kmem_cache_zalloc() was almost entirely
redundant.

Signed-off-by: David S. Miller <davem@davemloft.net>
---
 net/core/dst.c         |   20 ++++++++++--
 net/decnet/dn_route.c  |    2 +
 net/ipv4/route.c       |   78 +++++++++++++++++++++++++++++-------------------
 net/ipv6/route.c       |    8 ++++-
 net/xfrm/xfrm_policy.c |    1 +
 5 files changed, 74 insertions(+), 35 deletions(-)

diff --git a/net/core/dst.c b/net/core/dst.c
index 9505778..30f0093 100644
--- a/net/core/dst.c
+++ b/net/core/dst.c
@@ -175,22 +175,36 @@ void *dst_alloc(struct dst_ops *ops, struct net_device *dev,
 		if (ops->gc(ops))
 			return NULL;
 	}
-	dst = kmem_cache_zalloc(ops->kmem_cachep, GFP_ATOMIC);
+	dst = kmem_cache_alloc(ops->kmem_cachep, GFP_ATOMIC);
 	if (!dst)
 		return NULL;
-	dst->ops = ops;
+	dst->child = NULL;
 	dst->dev = dev;
 	if (dev)
 		dev_hold(dev);
+	dst->ops = ops;
 	dst_init_metrics(dst, dst_default_metrics, true);
+	dst->expires = 0UL;
 	dst->path = dst;
+	dst->neighbour = NULL;
+	dst->hh = NULL;
+#ifdef CONFIG_XFRM
+	dst->xfrm = NULL;
+#endif
 	dst->input = dst_discard;
 	dst->output = dst_discard;
-
+	dst->error = 0;
 	dst->obsolete = initial_obsolete;
+	dst->header_len = 0;
+	dst->trailer_len = 0;
+#ifdef CONFIG_IP_ROUTE_CLASSID
+	dst->tclassid = 0;
+#endif
 	atomic_set(&dst->__refcnt, initial_ref);
+	dst->__use = 0;
 	dst->lastuse = jiffies;
 	dst->flags = flags;
+	dst->next = NULL;
 #if RT_CACHE_DEBUG >= 2
 	atomic_inc(&dst_total);
 #endif
diff --git a/net/decnet/dn_route.c b/net/decnet/dn_route.c
index f489b08..74544bc 100644
--- a/net/decnet/dn_route.c
+++ b/net/decnet/dn_route.c
@@ -1129,6 +1129,7 @@ make_route:
 	if (rt == NULL)
 		goto e_nobufs;
 
+	memset(&rt->fld, 0, sizeof(rt->fld));
 	rt->fld.saddr        = oldflp->saddr;
 	rt->fld.daddr        = oldflp->daddr;
 	rt->fld.flowidn_oif  = oldflp->flowidn_oif;
@@ -1398,6 +1399,7 @@ make_route:
 	if (rt == NULL)
 		goto e_nobufs;
 
+	memset(&rt->fld, 0, sizeof(rt->fld));
 	rt->rt_saddr      = fld.saddr;
 	rt->rt_daddr      = fld.daddr;
 	rt->rt_gateway    = fld.daddr;
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 61a96ca..b6ad9dc 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1110,7 +1110,6 @@ static void rt_set_nexthop(struct rtable *rt, const struct flowi4 *oldflp4,
 #endif
 	set_class_tag(rt, itag);
 #endif
-	rt->rt_type = type;
 }
 
 static struct rtable *rt_dst_alloc(struct net_device *dev,
@@ -1160,25 +1159,28 @@ static int ip_route_input_mc(struct sk_buff *skb, __be32 daddr, __be32 saddr,
 	if (!rth)
 		goto e_nobufs;
 
+#ifdef CONFIG_IP_ROUTE_CLASSID
+	rth->dst.tclassid = itag;
+#endif
 	rth->dst.output = ip_rt_bug;
 
 	rth->rt_key_dst	= daddr;
-	rth->rt_dst	= daddr;
-	rth->rt_tos	= tos;
-	rth->rt_mark    = skb->mark;
 	rth->rt_key_src	= saddr;
+	rth->rt_genid	= rt_genid(dev_net(dev));
+	rth->rt_flags	= RTCF_MULTICAST;
+	rth->rt_type	= RTN_MULTICAST;
+	rth->rt_tos	= tos;
+	rth->rt_dst	= daddr;
 	rth->rt_src	= saddr;
-#ifdef CONFIG_IP_ROUTE_CLASSID
-	rth->dst.tclassid = itag;
-#endif
 	rth->rt_route_iif = dev->ifindex;
 	rth->rt_iif	= dev->ifindex;
 	rth->rt_oif	= 0;
+	rth->rt_mark    = skb->mark;
 	rth->rt_gateway	= daddr;
 	rth->rt_spec_dst= spec_dst;
-	rth->rt_genid	= rt_genid(dev_net(dev));
-	rth->rt_flags	= RTCF_MULTICAST;
-	rth->rt_type	= RTN_MULTICAST;
+	rth->rt_peer_genid = 0;
+	rth->peer = NULL;
+	rth->fi = NULL;
 	if (our) {
 		rth->dst.input= ip_local_deliver;
 		rth->rt_flags |= RTCF_LOCAL;
@@ -1299,25 +1301,28 @@ static int __mkroute_input(struct sk_buff *skb,
 	}
 
 	rth->rt_key_dst	= daddr;
-	rth->rt_dst	= daddr;
-	rth->rt_tos	= tos;
-	rth->rt_mark    = skb->mark;
 	rth->rt_key_src	= saddr;
+	rth->rt_genid = rt_genid(dev_net(rth->dst.dev));
+	rth->rt_flags = flags;
+	rth->rt_type = res->type;
+	rth->rt_tos	= tos;
+	rth->rt_dst	= daddr;
 	rth->rt_src	= saddr;
-	rth->rt_gateway	= daddr;
 	rth->rt_route_iif = in_dev->dev->ifindex;
 	rth->rt_iif 	= in_dev->dev->ifindex;
 	rth->rt_oif 	= 0;
+	rth->rt_mark    = skb->mark;
+	rth->rt_gateway	= daddr;
 	rth->rt_spec_dst= spec_dst;
+	rth->rt_peer_genid = 0;
+	rth->peer = NULL;
+	rth->fi = NULL;
 
 	rth->dst.input = ip_forward;
 	rth->dst.output = ip_output;
-	rth->rt_genid = rt_genid(dev_net(rth->dst.dev));
 
 	rt_set_nexthop(rth, NULL, res, res->fi, res->type, itag);
 
-	rth->rt_flags = flags;
-
 	*result = rth;
 	err = 0;
  cleanup:
@@ -1464,30 +1469,37 @@ local_input:
 	if (!rth)
 		goto e_nobufs;
 
+	rth->dst.input= ip_local_deliver;
 	rth->dst.output= ip_rt_bug;
-	rth->rt_genid = rt_genid(net);
+#ifdef CONFIG_IP_ROUTE_CLASSID
+	rth->dst.tclassid = itag;
+#endif
 
 	rth->rt_key_dst	= daddr;
-	rth->rt_dst	= daddr;
-	rth->rt_tos	= tos;
-	rth->rt_mark    = skb->mark;
 	rth->rt_key_src	= saddr;
+	rth->rt_genid = rt_genid(net);
+	rth->rt_flags 	= flags|RTCF_LOCAL;
+	rth->rt_type	= res.type;
+	rth->rt_tos	= tos;
+	rth->rt_dst	= daddr;
 	rth->rt_src	= saddr;
 #ifdef CONFIG_IP_ROUTE_CLASSID
 	rth->dst.tclassid = itag;
 #endif
 	rth->rt_route_iif = dev->ifindex;
 	rth->rt_iif	= dev->ifindex;
+	rth->rt_oif	= 0;
+	rth->rt_mark    = skb->mark;
 	rth->rt_gateway	= daddr;
 	rth->rt_spec_dst= spec_dst;
-	rth->dst.input= ip_local_deliver;
-	rth->rt_flags 	= flags|RTCF_LOCAL;
+	rth->rt_peer_genid = 0;
+	rth->peer = NULL;
+	rth->fi = NULL;
 	if (res.type == RTN_UNREACHABLE) {
 		rth->dst.input= ip_error;
 		rth->dst.error= -err;
 		rth->rt_flags 	&= ~RTCF_LOCAL;
 	}
-	rth->rt_type	= res.type;
 	rth = rt_finalize(rth, skb);
 	err = 0;
 	if (IS_ERR(rth))
@@ -1630,20 +1642,25 @@ static struct rtable *__mkroute_output(const struct fib_result *res,
 	if (!rth)
 		return ERR_PTR(-ENOBUFS);
 
+	rth->dst.output = ip_output;
+
 	rth->rt_key_dst	= oldflp4->daddr;
-	rth->rt_tos	= tos;
 	rth->rt_key_src	= oldflp4->saddr;
-	rth->rt_oif	= oldflp4->flowi4_oif;
-	rth->rt_mark    = oldflp4->flowi4_mark;
+	rth->rt_genid = rt_genid(dev_net(dev_out));
+	rth->rt_flags	= flags;
+	rth->rt_type	= type;
+	rth->rt_tos	= tos;
 	rth->rt_dst	= fl4->daddr;
 	rth->rt_src	= fl4->saddr;
 	rth->rt_route_iif = 0;
 	rth->rt_iif	= oldflp4->flowi4_oif ? : dev_out->ifindex;
+	rth->rt_oif	= oldflp4->flowi4_oif;
+	rth->rt_mark    = oldflp4->flowi4_mark;
 	rth->rt_gateway = fl4->daddr;
 	rth->rt_spec_dst= fl4->saddr;
-
-	rth->dst.output=ip_output;
-	rth->rt_genid = rt_genid(dev_net(dev_out));
+	rth->rt_peer_genid = 0;
+	rth->peer = NULL;
+	rth->fi = NULL;
 
 	RT_CACHE_STAT_INC(out_slow_tot);
 
@@ -1671,7 +1688,6 @@ static struct rtable *__mkroute_output(const struct fib_result *res,
 
 	rt_set_nexthop(rth, oldflp4, res, fi, type, 0);
 
-	rth->rt_flags = flags;
 	return rth;
 }
 
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index d3f55cc..4a1d5a0 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -223,7 +223,11 @@ static struct rt6_info ip6_blk_hole_entry_template = {
 static inline struct rt6_info *ip6_dst_alloc(struct dst_ops *ops,
 					     struct net_device *dev)
 {
-	return (struct rt6_info *)dst_alloc(ops, dev, 0, 0, 0);
+	struct rt6_info *rt = dst_alloc(ops, dev, 0, 0, 0);
+
+	memset(&rt->rt6i_table, 0, sizeof(*rt) - sizeof(struct dst_entry));
+
+	return rt;
 }
 
 static void ip6_dst_destroy(struct dst_entry *dst)
@@ -880,6 +884,8 @@ struct dst_entry *ip6_blackhole_route(struct net *net, struct dst_entry *dst_ori
 
 	rt = dst_alloc(&ip6_dst_blackhole_ops, ort->dst.dev, 1, 0, 0);
 	if (rt) {
+		memset(&rt->rt6i_table, 0, sizeof(*rt) - sizeof(struct dst_entry));
+
 		new = &rt->dst;
 
 		new->__use = 1;
diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
index 70552c4..00bcb88 100644
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -1349,6 +1349,7 @@ static inline struct xfrm_dst *xfrm_alloc_dst(struct net *net, int family)
 		BUG();
 	}
 	xdst = dst_alloc(dst_ops, NULL, 0, 0, 0);
+	memset(&xdst->u.rt6.rt6i_table, 0, sizeof(*xdst) - sizeof(struct dst_entry));
 	xfrm_policy_put_afinfo(afinfo);
 
 	if (likely(xdst))
-- 
1.7.4.3


^ permalink raw reply related

* [PATCH v5 6/7] ipv4: Kill rt_key_{src,dst} from struct rtable.
From: David Miller @ 2011-04-15 22:39 UTC (permalink / raw)
  To: netdev


They are always used in contexts where they can be reconstituted,
or where the finally resolved rt->rt_{src,dst} is semantically
equivalent.

Signed-off-by: David S. Miller <davem@davemloft.net>
---
 include/net/route.h     |    4 ----
 net/ipv4/ipmr.c         |    4 ++--
 net/ipv4/route.c        |   24 +++++++-----------------
 net/ipv4/xfrm4_policy.c |    2 --
 4 files changed, 9 insertions(+), 25 deletions(-)

diff --git a/include/net/route.h b/include/net/route.h
index b2a44a9..1337917 100644
--- a/include/net/route.h
+++ b/include/net/route.h
@@ -53,10 +53,6 @@ struct fib_info;
 struct rtable {
 	struct dst_entry	dst;
 
-	/* Lookup key. */
-	__be32			rt_key_dst;
-	__be32			rt_key_src;
-
 	int			rt_genid;
 	unsigned		rt_flags;
 	__u16			rt_type;
diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
index 1f62eae..0441b26 100644
--- a/net/ipv4/ipmr.c
+++ b/net/ipv4/ipmr.c
@@ -1791,8 +1791,8 @@ dont_forward:
 static struct mr_table *ipmr_rt_fib_lookup(struct net *net, struct rtable *rt)
 {
 	struct flowi4 fl4 = {
-		.daddr = rt->rt_key_dst,
-		.saddr = rt->rt_key_src,
+		.daddr = rt->rt_dst,
+		.saddr = rt->rt_src,
 		.flowi4_tos = rt->rt_tos,
 		.flowi4_oif = rt->rt_oif,
 		.flowi4_iif = rt->rt_iif,
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index b6ad9dc..e9244e0 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -988,8 +988,8 @@ void ip_rt_get_source(u8 *addr, struct rtable *rt)
 		src = rt->rt_src;
 	else {
 		struct flowi4 fl4 = {
-			.daddr = rt->rt_key_dst,
-			.saddr = rt->rt_key_src,
+			.daddr = rt->rt_dst,
+			.saddr = rt->rt_src,
 			.flowi4_tos = rt->rt_tos,
 			.flowi4_oif = rt->rt_oif,
 			.flowi4_iif = rt->rt_iif,
@@ -1164,8 +1164,6 @@ static int ip_route_input_mc(struct sk_buff *skb, __be32 daddr, __be32 saddr,
 #endif
 	rth->dst.output = ip_rt_bug;
 
-	rth->rt_key_dst	= daddr;
-	rth->rt_key_src	= saddr;
 	rth->rt_genid	= rt_genid(dev_net(dev));
 	rth->rt_flags	= RTCF_MULTICAST;
 	rth->rt_type	= RTN_MULTICAST;
@@ -1300,8 +1298,6 @@ static int __mkroute_input(struct sk_buff *skb,
 		goto cleanup;
 	}
 
-	rth->rt_key_dst	= daddr;
-	rth->rt_key_src	= saddr;
 	rth->rt_genid = rt_genid(dev_net(rth->dst.dev));
 	rth->rt_flags = flags;
 	rth->rt_type = res->type;
@@ -1475,8 +1471,6 @@ local_input:
 	rth->dst.tclassid = itag;
 #endif
 
-	rth->rt_key_dst	= daddr;
-	rth->rt_key_src	= saddr;
 	rth->rt_genid = rt_genid(net);
 	rth->rt_flags 	= flags|RTCF_LOCAL;
 	rth->rt_type	= res.type;
@@ -1644,8 +1638,6 @@ static struct rtable *__mkroute_output(const struct fib_result *res,
 
 	rth->dst.output = ip_output;
 
-	rth->rt_key_dst	= oldflp4->daddr;
-	rth->rt_key_src	= oldflp4->saddr;
 	rth->rt_genid = rt_genid(dev_net(dev_out));
 	rth->rt_flags	= flags;
 	rth->rt_type	= type;
@@ -1922,8 +1914,6 @@ struct dst_entry *ipv4_blackhole_route(struct net *net, struct dst_entry *dst_or
 		if (new->dev)
 			dev_hold(new->dev);
 
-		rt->rt_key_dst = ort->rt_key_dst;
-		rt->rt_key_src = ort->rt_key_src;
 		rt->rt_tos = ort->rt_tos;
 		rt->rt_route_iif = ort->rt_route_iif;
 		rt->rt_iif = ort->rt_iif;
@@ -1974,7 +1964,7 @@ struct rtable *ip_route_output_flow(struct net *net, struct flowi4 *flp4,
 }
 EXPORT_SYMBOL_GPL(ip_route_output_flow);
 
-static int rt_fill_info(struct net *net,
+static int rt_fill_info(struct net *net,  __be32 src,
 			struct sk_buff *skb, u32 pid, u32 seq, int event,
 			int nowait, unsigned int flags)
 {
@@ -2004,9 +1994,9 @@ static int rt_fill_info(struct net *net,
 
 	NLA_PUT_BE32(skb, RTA_DST, rt->rt_dst);
 
-	if (rt->rt_key_src) {
+	if (src) {
 		r->rtm_src_len = 32;
-		NLA_PUT_BE32(skb, RTA_SRC, rt->rt_key_src);
+		NLA_PUT_BE32(skb, RTA_SRC, src);
 	}
 	if (rt->dst.dev)
 		NLA_PUT_U32(skb, RTA_OIF, rt->dst.dev->ifindex);
@@ -2016,7 +2006,7 @@ static int rt_fill_info(struct net *net,
 #endif
 	if (rt_is_input_route(rt))
 		NLA_PUT_BE32(skb, RTA_PREFSRC, rt->rt_spec_dst);
-	else if (rt->rt_src != rt->rt_key_src)
+	else if (rt->rt_src != src)
 		NLA_PUT_BE32(skb, RTA_PREFSRC, rt->rt_src);
 
 	if (rt->rt_dst != rt->rt_gateway)
@@ -2155,7 +2145,7 @@ static int inet_rtm_getroute(struct sk_buff *in_skb, struct nlmsghdr* nlh, void
 	if (rtm->rtm_flags & RTM_F_NOTIFY)
 		rt->rt_flags |= RTCF_NOTIFY;
 
-	err = rt_fill_info(net, skb, NETLINK_CB(in_skb).pid, nlh->nlmsg_seq,
+	err = rt_fill_info(net, src, skb, NETLINK_CB(in_skb).pid, nlh->nlmsg_seq,
 			   RTM_NEWROUTE, 0, 0);
 	if (err <= 0)
 		goto errout_free;
diff --git a/net/ipv4/xfrm4_policy.c b/net/ipv4/xfrm4_policy.c
index d20a05e..4a592d0 100644
--- a/net/ipv4/xfrm4_policy.c
+++ b/net/ipv4/xfrm4_policy.c
@@ -71,8 +71,6 @@ static int xfrm4_fill_dst(struct xfrm_dst *xdst, struct net_device *dev,
 	struct rtable *rt = (struct rtable *)xdst->route;
 	const struct flowi4 *fl4 = &fl->u.ip4;
 
-	rt->rt_key_dst = fl4->daddr;
-	rt->rt_key_src = fl4->saddr;
 	rt->rt_tos = fl4->flowi4_tos;
 	rt->rt_route_iif = fl4->flowi4_iif;
 	rt->rt_iif = fl4->flowi4_iif;
-- 
1.7.4.3


^ permalink raw reply related

* [PATCH v5 7/7] ipv4: Use caller's on-stack flowi as-is in output route lookups.
From: David Miller @ 2011-04-15 22:39 UTC (permalink / raw)
  To: netdev


Signed-off-by: David S. Miller <davem@davemloft.net>
---
 include/net/route.h |    2 +-
 net/ipv4/route.c    |  136 ++++++++++++++++++++++++--------------------------
 2 files changed, 66 insertions(+), 72 deletions(-)

diff --git a/include/net/route.h b/include/net/route.h
index 1337917..b7cab6f 100644
--- a/include/net/route.h
+++ b/include/net/route.h
@@ -118,7 +118,7 @@ extern int		ip_rt_init(void);
 extern void		ip_rt_redirect(__be32 old_gw, __be32 dst, __be32 new_gw,
 				       __be32 src, struct net_device *dev);
 extern void		rt_cache_flush(struct net *net, int how);
-extern struct rtable *__ip_route_output_key(struct net *, const struct flowi4 *flp);
+extern struct rtable *__ip_route_output_key(struct net *, struct flowi4 *flp);
 extern struct rtable *ip_route_output_flow(struct net *, struct flowi4 *flp,
 					   struct sock *sk);
 extern struct dst_entry *ipv4_blackhole_route(struct net *net, struct dst_entry *dst_orig);
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index e9244e0..c55620d 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1047,7 +1047,7 @@ static unsigned int ipv4_default_mtu(const struct dst_entry *dst)
 	return mtu;
 }
 
-static void rt_init_metrics(struct rtable *rt, const struct flowi4 *oldflp4,
+static void rt_init_metrics(struct rtable *rt, const struct flowi4 *fl4,
 			    struct fib_info *fi)
 {
 	struct inet_peer *peer;
@@ -1056,7 +1056,7 @@ static void rt_init_metrics(struct rtable *rt, const struct flowi4 *oldflp4,
 	/* If a peer entry exists for this destination, we must hook
 	 * it up in order to get at cached metrics.
 	 */
-	if (oldflp4 && (oldflp4->flowi4_flags & FLOWI_FLAG_PRECOW_METRICS))
+	if (fl4 && (fl4->flowi4_flags & FLOWI_FLAG_PRECOW_METRICS))
 		create = 1;
 
 	rt->peer = peer = inet_getpeer_v4(rt->rt_dst, create);
@@ -1083,7 +1083,7 @@ static void rt_init_metrics(struct rtable *rt, const struct flowi4 *oldflp4,
 	}
 }
 
-static void rt_set_nexthop(struct rtable *rt, const struct flowi4 *oldflp4,
+static void rt_set_nexthop(struct rtable *rt, const struct flowi4 *fl4,
 			   const struct fib_result *res,
 			   struct fib_info *fi, u16 type, u32 itag)
 {
@@ -1093,7 +1093,7 @@ static void rt_set_nexthop(struct rtable *rt, const struct flowi4 *oldflp4,
 		if (FIB_RES_GW(*res) &&
 		    FIB_RES_NH(*res).nh_scope == RT_SCOPE_LINK)
 			rt->rt_gateway = FIB_RES_GW(*res);
-		rt_init_metrics(rt, oldflp4, fi);
+		rt_init_metrics(rt, fl4, fi);
 #ifdef CONFIG_IP_ROUTE_CLASSID
 		dst->tclassid = FIB_RES_NH(*res).nh_tclassid;
 #endif
@@ -1587,12 +1587,11 @@ EXPORT_SYMBOL(ip_route_input);
 /* called with rcu_read_lock() */
 static struct rtable *__mkroute_output(const struct fib_result *res,
 				       const struct flowi4 *fl4,
-				       const struct flowi4 *oldflp4,
 				       struct net_device *dev_out,
 				       unsigned int flags)
 {
 	struct fib_info *fi = res->fi;
-	u32 tos = RT_FL_TOS(oldflp4);
+	u32 tos = RT_FL_TOS(fl4);
 	struct in_device *in_dev;
 	u16 type = res->type;
 	struct rtable *rth;
@@ -1619,8 +1618,8 @@ static struct rtable *__mkroute_output(const struct fib_result *res,
 		fi = NULL;
 	} else if (type == RTN_MULTICAST) {
 		flags |= RTCF_MULTICAST | RTCF_LOCAL;
-		if (!ip_check_mc_rcu(in_dev, oldflp4->daddr, oldflp4->saddr,
-				     oldflp4->flowi4_proto))
+		if (!ip_check_mc_rcu(in_dev, fl4->daddr, fl4->saddr,
+				     fl4->flowi4_proto))
 			flags &= ~RTCF_LOCAL;
 		/* If multicast route do not exist use
 		 * default one, but do not gateway in this case.
@@ -1645,9 +1644,9 @@ static struct rtable *__mkroute_output(const struct fib_result *res,
 	rth->rt_dst	= fl4->daddr;
 	rth->rt_src	= fl4->saddr;
 	rth->rt_route_iif = 0;
-	rth->rt_iif	= oldflp4->flowi4_oif ? : dev_out->ifindex;
-	rth->rt_oif	= oldflp4->flowi4_oif;
-	rth->rt_mark    = oldflp4->flowi4_mark;
+	rth->rt_iif	= fl4->flowi4_oif ? : dev_out->ifindex;
+	rth->rt_oif	= fl4->flowi4_oif;
+	rth->rt_mark    = fl4->flowi4_mark;
 	rth->rt_gateway = fl4->daddr;
 	rth->rt_spec_dst= fl4->saddr;
 	rth->rt_peer_genid = 0;
@@ -1670,7 +1669,7 @@ static struct rtable *__mkroute_output(const struct fib_result *res,
 #ifdef CONFIG_IP_MROUTE
 		if (type == RTN_MULTICAST) {
 			if (IN_DEV_MFORWARD(in_dev) &&
-			    !ipv4_is_local_multicast(oldflp4->daddr)) {
+			    !ipv4_is_local_multicast(fl4->daddr)) {
 				rth->dst.input = ip_mr_input;
 				rth->dst.output = ip_mc_output;
 			}
@@ -1678,7 +1677,7 @@ static struct rtable *__mkroute_output(const struct fib_result *res,
 #endif
 	}
 
-	rt_set_nexthop(rth, oldflp4, res, fi, type, 0);
+	rt_set_nexthop(rth, fl4, res, fi, type, 0);
 
 	return rth;
 }
@@ -1687,13 +1686,12 @@ static struct rtable *__mkroute_output(const struct fib_result *res,
  * Major route resolver routine.
  */
 
-struct rtable *__ip_route_output_key(struct net *net, const struct flowi4 *oldflp4)
+struct rtable *__ip_route_output_key(struct net *net, struct flowi4 *fl4)
 {
-	u32 tos	= RT_FL_TOS(oldflp4);
-	struct flowi4 fl4;
-	struct fib_result res;
-	unsigned int flags = 0;
 	struct net_device *dev_out = NULL;
+	u32 tos	= RT_FL_TOS(fl4);
+	unsigned int flags = 0;
+	struct fib_result res;
 	struct rtable *rth;
 
 	res.fi		= NULL;
@@ -1701,21 +1699,17 @@ struct rtable *__ip_route_output_key(struct net *net, const struct flowi4 *oldfl
 	res.r		= NULL;
 #endif
 
-	fl4.flowi4_oif = oldflp4->flowi4_oif;
-	fl4.flowi4_iif = net->loopback_dev->ifindex;
-	fl4.flowi4_mark = oldflp4->flowi4_mark;
-	fl4.daddr = oldflp4->daddr;
-	fl4.saddr = oldflp4->saddr;
-	fl4.flowi4_tos = tos & IPTOS_RT_MASK;
-	fl4.flowi4_scope = ((tos & RTO_ONLINK) ?
-			RT_SCOPE_LINK : RT_SCOPE_UNIVERSE);
+	fl4->flowi4_iif = net->loopback_dev->ifindex;
+	fl4->flowi4_tos = tos & IPTOS_RT_MASK;
+	fl4->flowi4_scope = ((tos & RTO_ONLINK) ?
+			 RT_SCOPE_LINK : RT_SCOPE_UNIVERSE);
 
 	rcu_read_lock();
-	if (oldflp4->saddr) {
+	if (fl4->saddr) {
 		rth = ERR_PTR(-EINVAL);
-		if (ipv4_is_multicast(oldflp4->saddr) ||
-		    ipv4_is_lbcast(oldflp4->saddr) ||
-		    ipv4_is_zeronet(oldflp4->saddr))
+		if (ipv4_is_multicast(fl4->saddr) ||
+		    ipv4_is_lbcast(fl4->saddr) ||
+		    ipv4_is_zeronet(fl4->saddr))
 			goto out;
 
 		/* I removed check for oif == dev_out->oif here.
@@ -1726,11 +1720,11 @@ struct rtable *__ip_route_output_key(struct net *net, const struct flowi4 *oldfl
 		      of another iface. --ANK
 		 */
 
-		if (oldflp4->flowi4_oif == 0 &&
-		    (ipv4_is_multicast(oldflp4->daddr) ||
-		     ipv4_is_lbcast(oldflp4->daddr))) {
+		if (fl4->flowi4_oif == 0 &&
+		    (ipv4_is_multicast(fl4->daddr) ||
+		     ipv4_is_lbcast(fl4->daddr))) {
 			/* It is equivalent to inet_addr_type(saddr) == RTN_LOCAL */
-			dev_out = __ip_dev_find(net, oldflp4->saddr, false);
+			dev_out = __ip_dev_find(net, fl4->saddr, false);
 			if (dev_out == NULL)
 				goto out;
 
@@ -1749,20 +1743,20 @@ struct rtable *__ip_route_output_key(struct net *net, const struct flowi4 *oldfl
 			   Luckily, this hack is good workaround.
 			 */
 
-			fl4.flowi4_oif = dev_out->ifindex;
+			fl4->flowi4_oif = dev_out->ifindex;
 			goto make_route;
 		}
 
-		if (!(oldflp4->flowi4_flags & FLOWI_FLAG_ANYSRC)) {
+		if (!(fl4->flowi4_flags & FLOWI_FLAG_ANYSRC)) {
 			/* It is equivalent to inet_addr_type(saddr) == RTN_LOCAL */
-			if (!__ip_dev_find(net, oldflp4->saddr, false))
+			if (!__ip_dev_find(net, fl4->saddr, false))
 				goto out;
 		}
 	}
 
 
-	if (oldflp4->flowi4_oif) {
-		dev_out = dev_get_by_index_rcu(net, oldflp4->flowi4_oif);
+	if (fl4->flowi4_oif) {
+		dev_out = dev_get_by_index_rcu(net, fl4->flowi4_oif);
 		rth = ERR_PTR(-ENODEV);
 		if (dev_out == NULL)
 			goto out;
@@ -1772,37 +1766,37 @@ struct rtable *__ip_route_output_key(struct net *net, const struct flowi4 *oldfl
 			rth = ERR_PTR(-ENETUNREACH);
 			goto out;
 		}
-		if (ipv4_is_local_multicast(oldflp4->daddr) ||
-		    ipv4_is_lbcast(oldflp4->daddr)) {
-			if (!fl4.saddr)
-				fl4.saddr = inet_select_addr(dev_out, 0,
-							     RT_SCOPE_LINK);
+		if (ipv4_is_local_multicast(fl4->daddr) ||
+		    ipv4_is_lbcast(fl4->daddr)) {
+			if (!fl4->saddr)
+				fl4->saddr = inet_select_addr(dev_out, 0,
+							      RT_SCOPE_LINK);
 			goto make_route;
 		}
-		if (!fl4.saddr) {
-			if (ipv4_is_multicast(oldflp4->daddr))
-				fl4.saddr = inet_select_addr(dev_out, 0,
-							     fl4.flowi4_scope);
-			else if (!oldflp4->daddr)
-				fl4.saddr = inet_select_addr(dev_out, 0,
-							     RT_SCOPE_HOST);
+		if (fl4->saddr) {
+			if (ipv4_is_multicast(fl4->daddr))
+				fl4->saddr = inet_select_addr(dev_out, 0,
+							      fl4->flowi4_scope);
+			else if (!fl4->daddr)
+				fl4->saddr = inet_select_addr(dev_out, 0,
+							      RT_SCOPE_HOST);
 		}
 	}
 
-	if (!fl4.daddr) {
-		fl4.daddr = fl4.saddr;
-		if (!fl4.daddr)
-			fl4.daddr = fl4.saddr = htonl(INADDR_LOOPBACK);
+	if (!fl4->daddr) {
+		fl4->daddr = fl4->saddr;
+		if (!fl4->daddr)
+			fl4->daddr = fl4->saddr = htonl(INADDR_LOOPBACK);
 		dev_out = net->loopback_dev;
-		fl4.flowi4_oif = net->loopback_dev->ifindex;
+		fl4->flowi4_oif = net->loopback_dev->ifindex;
 		res.type = RTN_LOCAL;
 		flags |= RTCF_LOCAL;
 		goto make_route;
 	}
 
-	if (fib_lookup(net, &fl4, &res)) {
+	if (fib_lookup(net, fl4, &res)) {
 		res.fi = NULL;
-		if (oldflp4->flowi4_oif) {
+		if (fl4->flowi4_oif) {
 			/* Apparently, routing tables are wrong. Assume,
 			   that the destination is on link.
 
@@ -1821,9 +1815,9 @@ struct rtable *__ip_route_output_key(struct net *net, const struct flowi4 *oldfl
 			   likely IPv6, but we do not.
 			 */
 
-			if (fl4.saddr == 0)
-				fl4.saddr = inet_select_addr(dev_out, 0,
-							     RT_SCOPE_LINK);
+			if (fl4->saddr == 0)
+				fl4->saddr = inet_select_addr(dev_out, 0,
+							      RT_SCOPE_LINK);
 			res.type = RTN_UNICAST;
 			goto make_route;
 		}
@@ -1832,38 +1826,38 @@ struct rtable *__ip_route_output_key(struct net *net, const struct flowi4 *oldfl
 	}
 
 	if (res.type == RTN_LOCAL) {
-		if (!fl4.saddr) {
+		if (!fl4->saddr) {
 			if (res.fi->fib_prefsrc)
-				fl4.saddr = res.fi->fib_prefsrc;
+				fl4->saddr = res.fi->fib_prefsrc;
 			else
-				fl4.saddr = fl4.daddr;
+				fl4->saddr = fl4->daddr;
 		}
 		dev_out = net->loopback_dev;
-		fl4.flowi4_oif = dev_out->ifindex;
+		fl4->flowi4_oif = dev_out->ifindex;
 		res.fi = NULL;
 		flags |= RTCF_LOCAL;
 		goto make_route;
 	}
 
 #ifdef CONFIG_IP_ROUTE_MULTIPATH
-	if (res.fi->fib_nhs > 1 && fl4.flowi4_oif == 0)
+	if (res.fi->fib_nhs > 1 && fl4->flowi4_oif == 0)
 		fib_select_multipath(&res);
 	else
 #endif
 	if (!res.prefixlen &&
 	    res.table->tb_num_default > 1 &&
-	    res.type == RTN_UNICAST && !fl4.flowi4_oif)
+	    res.type == RTN_UNICAST && !fl4->flowi4_oif)
 		fib_select_default(&res);
 
-	if (!fl4.saddr)
-		fl4.saddr = FIB_RES_PREFSRC(net, res);
+	if (!fl4->saddr)
+		fl4->saddr = FIB_RES_PREFSRC(net, res);
 
 	dev_out = FIB_RES_DEV(res);
-	fl4.flowi4_oif = dev_out->ifindex;
+	fl4->flowi4_oif = dev_out->ifindex;
 
 
 make_route:
-	rth = __mkroute_output(&res, &fl4, oldflp4, dev_out, flags);
+	rth = __mkroute_output(&res, fl4, dev_out, flags);
 	if (!IS_ERR(rth))
 		rth = rt_finalize(rth, NULL);
 
-- 
1.7.4.3


^ permalink raw reply related

* Re: [PATCH 1/1] ipv6: ignore looped-back NA while dad is running
From: David Miller @ 2011-04-15 22:44 UTC (permalink / raw)
  To: sahne; +Cc: netdev, linux-kernel
In-Reply-To: <20110414070925.GA78446@0x90.at>

From: Daniel Walter <sahne@0x90.at>
Date: Thu, 14 Apr 2011 09:09:25 +0200

> [ipv6] Ignore looped-back NAs while in Duplicate Address Detection
> 
> If we send an unsolicited NA shortly after bringing up an
> IPv6 address, the duplicate address detection algorithm
> fails and the ip stays in tentative mode forever. 
> This is due a missing check if the NA is looped-back to us.
> 
> Signed-off-by: Daniel Walter <dwalter@barracuda.com>

Applied to net-next-2.6

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox