Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: Network performance with small packets
From: Michael S. Tsirkin @ 2011-04-14 16:03 UTC (permalink / raw)
  To: Rusty Russell
  Cc: habanero, Shirley Ma, Krishna Kumar2, David Miller, kvm, netdev,
	steved, Tom Lendacky, borntraeger
In-Reply-To: <87bp09ax7a.fsf@rustcorp.com.au>

On Thu, Apr 14, 2011 at 08:58:41PM +0930, Rusty Russell wrote:
> On Tue, 12 Apr 2011 23:01:12 +0300, "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > On Thu, Mar 10, 2011 at 12:19:42PM +1030, Rusty Russell wrote:
> > > Here's an old patch where I played with implementing this:
> > 
> > ...
> > 
> > > 
> > > virtio: put last_used and last_avail index into ring itself.
> > > 
> > > Generally, the other end of the virtio ring doesn't need to see where
> > > you're up to in consuming the ring.  However, to completely understand
> > > what's going on from the outside, this information must be exposed.
> > > For example, if you want to save and restore a virtio_ring, but you're
> > > not the consumer because the kernel is using it directly.
> > > 
> > > Fortunately, we have room to expand:
> > 
> > This seems to be true for x86 kvm and lguest but is it true
> > for s390?
> 
> Yes, as the ring is page aligned so there's always room.
> 
> > Will this last bit work on s390?
> > If I understand correctly the memory is allocated by host there?
> 
> They have to offer the feature, so if the have some way of allocating
> non-page-aligned amounts of memory, they'll have to add those extra 2
> bytes.
> 
> So I think it's OK...
> Rusty.

To clarify, my concern is that we always seem to try to map
these extra 2 bytes, which thinkably might fail?

-- 
MST

^ permalink raw reply

* [PATCH v2] ip: ip_options_compile() resilient to NULL skb route
From: Eric Dumazet @ 2011-04-14 15:55 UTC (permalink / raw)
  To: Hiroaki SHIMODA; +Cc: Stephen Hemminger, David Miller, lkml, netdev
In-Reply-To: <20110414123058.d4ffe7fb.shimoda.hiroaki@gmail.com>

Scot Doyle demonstrated ip_options_compile() could be called with an skb
without an attached route, using a setup involving a bridge, netfilter,
and forged IP packets.

Let's make ip_options_compile() and ip_options_rcv_srr() a bit more
robust, instead of changing bridge/netfilter code.

With help from Hiroaki SHIMODA.

Reported-by: Scot Doyle <lkml@scotdoyle.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Hiroaki SHIMODA <shimoda.hiroaki@gmail.com>
---
v2: ip_options_rcv_srr() fix as well, from Hiroaki

 net/ipv4/ip_options.c |    6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/net/ipv4/ip_options.c b/net/ipv4/ip_options.c
index 28a736f..2391b24 100644
--- a/net/ipv4/ip_options.c
+++ b/net/ipv4/ip_options.c
@@ -329,7 +329,7 @@ int ip_options_compile(struct net *net,
 					pp_ptr = optptr + 2;
 					goto error;
 				}
-				if (skb) {
+				if (rt) {
 					memcpy(&optptr[optptr[2]-1], &rt->rt_spec_dst, 4);
 					opt->is_changed = 1;
 				}
@@ -371,7 +371,7 @@ int ip_options_compile(struct net *net,
 						goto error;
 					}
 					opt->ts = optptr - iph;
-					if (skb) {
+					if (rt)  {
 						memcpy(&optptr[optptr[2]-1], &rt->rt_spec_dst, 4);
 						timeptr = (__be32*)&optptr[optptr[2]+3];
 					}
@@ -603,7 +603,7 @@ int ip_options_rcv_srr(struct sk_buff *skb)
 	unsigned long orefdst;
 	int err;
 
-	if (!opt->srr)
+	if (!opt->srr || !rt)
 		return 0;
 
 	if (skb->pkt_type != PACKET_HOST)



^ permalink raw reply related

* Re: [PATCH v2] net: filter: Just In Time compiler
From: Avi Kivity @ 2011-04-14 15:53 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Hagen Paul Pfeifer, David Miller, netdev,
	Arnaldo Carvalho de Melo, Ben Hutchings
In-Reply-To: <1302795951.3248.14.camel@edumazet-laptop>

On 04/14/2011 06:45 PM, Eric Dumazet wrote:
> Le jeudi 14 avril 2011 à 18:41 +0300, Avi Kivity a écrit :
>
> >  I'm talking about optimizing the generated code.  For example, bpf has
> >  just two registers so a complex program generates a lot of loads and
> >  stores.  An optimizing compiler can use extra target registers to avoid
> >  those spills, and doesn't need to keep A and X in fixed registers.
> >
>
> Thats not exactly true.
>
> A bpf filter also uses up to 16 mem[] 'registers'.
>

That's what I referred as loads and stores.  Since you can't use mem[] 
to index into a packet, you have to spill X into mem[], calculate a new 
X, use it to access the packet, and reload X.

> A risc cpu (with a lot of registers) could use registers to hold part of
> the mem[] array.

An optimizing compiler will dynamically assign mem[] into registers, 
even on i386.  Liveness analysis means the same machine register can be 
used for different mem[] locations.

> >  If you translate the bpf program to C and optimize that with gcc you'll
> >  probably get much better machine code that the jit in the patch.
> >
>
> Well, gcc wont optimize a lot a bpf program if you ask me.

IMO, it will.  I'll try to have gcc optimize your example filter later.

> You would better make tcpdump not generate bpf but direct C code.

That involves breaking the interface (plus, we might not trust tcpdump).

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply

* [PATCH net-next] qlge: make nic_operations struct const
From: Stephen Hemminger @ 2011-04-14 15:51 UTC (permalink / raw)
  To: Ron Mercer, netdev; +Cc: linux-driver

The struct nic_operations is just function pointers and should be
declared const for added security.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>

--- a/drivers/net/qlge/qlge.h	2011-04-14 08:31:15.863779054 -0700
+++ b/drivers/net/qlge/qlge.h	2011-04-14 08:31:19.231840948 -0700
@@ -2134,7 +2134,7 @@ struct ql_adapter {
 	struct delayed_work mpi_idc_work;
 	struct delayed_work mpi_core_to_log;
 	struct completion ide_completion;
-	struct nic_operations *nic_ops;
+	const struct nic_operations *nic_ops;
 	u16 device_id;
 	struct timer_list timer;
 	atomic_t lb_count;
--- a/drivers/net/qlge/qlge_main.c	2011-04-14 08:30:32.311185557 -0700
+++ b/drivers/net/qlge/qlge_main.c	2011-04-14 08:31:07.595627135 -0700
@@ -4412,12 +4412,12 @@ error:
 	rtnl_unlock();
 }
 
-static struct nic_operations qla8012_nic_ops = {
+static const struct nic_operations qla8012_nic_ops = {
 	.get_flash		= ql_get_8012_flash_params,
 	.port_initialize	= ql_8012_port_initialize,
 };
 
-static struct nic_operations qla8000_nic_ops = {
+static const struct nic_operations qla8000_nic_ops = {
 	.get_flash		= ql_get_8000_flash_params,
 	.port_initialize	= ql_8000_port_initialize,
 };

^ permalink raw reply

* [PATCH net-next] sfc: make function tables const
From: Stephen Hemminger @ 2011-04-14 15:50 UTC (permalink / raw)
  To: Steve Hodgson, Ben Hutchings; +Cc: Solarflare linux maintainers, netdev

The phy, mac, and board information structures should be const.
Since tables contain function pointer this improves security
(at least theoretically).

Compile tested only.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>

---
 drivers/net/sfc/efx.c          |    6 +++---
 drivers/net/sfc/falcon.c       |    4 ++--
 drivers/net/sfc/falcon_xmac.c  |    2 +-
 drivers/net/sfc/mac.h          |    4 ++--
 drivers/net/sfc/mcdi_mac.c     |    2 +-
 drivers/net/sfc/mcdi_phy.c     |    2 +-
 drivers/net/sfc/net_driver.h   |    6 +++---
 drivers/net/sfc/nic.h          |    6 +++---
 drivers/net/sfc/phy.h          |    8 ++++----
 drivers/net/sfc/qt202x_phy.c   |    2 +-
 drivers/net/sfc/siena.c        |    2 +-
 drivers/net/sfc/tenxpress.c    |    2 +-
 drivers/net/sfc/txc43128_phy.c |    2 +-
 13 files changed, 24 insertions(+), 24 deletions(-)

--- a/drivers/net/sfc/efx.c	2011-04-14 08:33:56.762355851 -0700
+++ b/drivers/net/sfc/efx.c	2011-04-14 08:43:13.544925178 -0700
@@ -2245,7 +2245,7 @@ static bool efx_port_dummy_op_poll(struc
 	return false;
 }
 
-static struct efx_phy_operations efx_dummy_phy_operations = {
+static const struct efx_phy_operations efx_dummy_phy_operations = {
 	.init		 = efx_port_dummy_op_int,
 	.reconfigure	 = efx_port_dummy_op_int,
 	.poll		 = efx_port_dummy_op_poll,
@@ -2261,7 +2261,7 @@ static struct efx_phy_operations efx_dum
 /* This zeroes out and then fills in the invariants in a struct
  * efx_nic (including all sub-structures).
  */
-static int efx_init_struct(struct efx_nic *efx, struct efx_nic_type *type,
+static int efx_init_struct(struct efx_nic *efx, const struct efx_nic_type *type,
 			   struct pci_dev *pci_dev, struct net_device *net_dev)
 {
 	int i;
@@ -2451,7 +2451,7 @@ static int efx_pci_probe_main(struct efx
 static int __devinit efx_pci_probe(struct pci_dev *pci_dev,
 				   const struct pci_device_id *entry)
 {
-	struct efx_nic_type *type = (struct efx_nic_type *) entry->driver_data;
+	const struct efx_nic_type *type = (const struct efx_nic_type *) entry->driver_data;
 	struct net_device *net_dev;
 	struct efx_nic *efx;
 	int i, rc;
--- a/drivers/net/sfc/mcdi_phy.c	2011-04-14 08:34:22.730711864 -0700
+++ b/drivers/net/sfc/mcdi_phy.c	2011-04-14 08:35:55.387770007 -0700
@@ -739,7 +739,7 @@ static const char *efx_mcdi_phy_test_nam
 	return NULL;
 }
 
-struct efx_phy_operations efx_mcdi_phy_ops = {
+const struct efx_phy_operations efx_mcdi_phy_ops = {
 	.probe		= efx_mcdi_phy_probe,
 	.init 	 	= efx_port_dummy_op_int,
 	.reconfigure	= efx_mcdi_phy_reconfigure,
--- a/drivers/net/sfc/net_driver.h	2011-04-14 08:34:34.814856688 -0700
+++ b/drivers/net/sfc/net_driver.h	2011-04-14 08:40:03.118873109 -0700
@@ -773,10 +773,10 @@ struct efx_nic {
 
 	struct efx_buffer stats_buffer;
 
-	struct efx_mac_operations *mac_op;
+	const struct efx_mac_operations *mac_op;
 
 	unsigned int phy_type;
-	struct efx_phy_operations *phy_op;
+	const struct efx_phy_operations *phy_op;
 	void *phy_data;
 	struct mdio_if_info mdio;
 	unsigned int mdio_bus;
@@ -897,7 +897,7 @@ struct efx_nic_type {
 	void (*resume_wol)(struct efx_nic *efx);
 	int (*test_registers)(struct efx_nic *efx);
 	int (*test_nvram)(struct efx_nic *efx);
-	struct efx_mac_operations *default_mac_ops;
+	const struct efx_mac_operations *default_mac_ops;
 
 	int revision;
 	unsigned int mem_map_size;
--- a/drivers/net/sfc/phy.h	2011-04-14 08:34:45.574982389 -0700
+++ b/drivers/net/sfc/phy.h	2011-04-14 08:35:08.507246412 -0700
@@ -13,14 +13,14 @@
 /****************************************************************************
  * 10Xpress (SFX7101) PHY
  */
-extern struct efx_phy_operations falcon_sfx7101_phy_ops;
+extern const struct efx_phy_operations falcon_sfx7101_phy_ops;
 
 extern void tenxpress_set_id_led(struct efx_nic *efx, enum efx_led_mode mode);
 
 /****************************************************************************
  * AMCC/Quake QT202x PHYs
  */
-extern struct efx_phy_operations falcon_qt202x_phy_ops;
+extern const struct efx_phy_operations falcon_qt202x_phy_ops;
 
 /* These PHYs provide various H/W control states for LEDs */
 #define QUAKE_LED_LINK_INVAL	(0)
@@ -39,7 +39,7 @@ extern void falcon_qt202x_set_led(struct
 /****************************************************************************
 * Transwitch CX4 retimer
 */
-extern struct efx_phy_operations falcon_txc_phy_ops;
+extern const struct efx_phy_operations falcon_txc_phy_ops;
 
 #define TXC_GPIO_DIR_INPUT	0
 #define TXC_GPIO_DIR_OUTPUT	1
@@ -50,7 +50,7 @@ extern void falcon_txc_set_gpio_val(stru
 /****************************************************************************
  * Siena managed PHYs
  */
-extern struct efx_phy_operations efx_mcdi_phy_ops;
+extern const struct efx_phy_operations efx_mcdi_phy_ops;
 
 extern int efx_mcdi_mdio_read(struct efx_nic *efx, unsigned int bus,
 			      unsigned int prtad, unsigned int devad,
--- a/drivers/net/sfc/qt202x_phy.c	2011-04-14 08:35:21.647395047 -0700
+++ b/drivers/net/sfc/qt202x_phy.c	2011-04-14 08:35:25.715440799 -0700
@@ -449,7 +449,7 @@ static void qt202x_phy_remove(struct efx
 	efx->phy_data = NULL;
 }
 
-struct efx_phy_operations falcon_qt202x_phy_ops = {
+const struct efx_phy_operations falcon_qt202x_phy_ops = {
 	.probe		 = qt202x_phy_probe,
 	.init		 = qt202x_phy_init,
 	.reconfigure	 = qt202x_phy_reconfigure,
--- a/drivers/net/sfc/tenxpress.c	2011-04-14 08:35:29.783486390 -0700
+++ b/drivers/net/sfc/tenxpress.c	2011-04-14 08:35:54.967764761 -0700
@@ -478,7 +478,7 @@ static void sfx7101_set_npage_adv(struct
 			  advertising & ADVERTISED_10000baseT_Full);
 }
 
-struct efx_phy_operations falcon_sfx7101_phy_ops = {
+const struct efx_phy_operations falcon_sfx7101_phy_ops = {
 	.probe		  = tenxpress_phy_probe,
 	.init             = tenxpress_phy_init,
 	.reconfigure      = tenxpress_phy_reconfigure,
--- a/drivers/net/sfc/txc43128_phy.c	2011-04-14 08:35:48.539694644 -0700
+++ b/drivers/net/sfc/txc43128_phy.c	2011-04-14 08:35:52.619739671 -0700
@@ -545,7 +545,7 @@ static void txc43128_get_settings(struct
 	mdio45_ethtool_gset(&efx->mdio, ecmd);
 }
 
-struct efx_phy_operations falcon_txc_phy_ops = {
+const struct efx_phy_operations falcon_txc_phy_ops = {
 	.probe		= txc43128_phy_probe,
 	.init		= txc43128_phy_init,
 	.reconfigure	= txc43128_phy_reconfigure,
--- a/drivers/net/sfc/falcon_xmac.c	2011-04-14 08:38:51.094018275 -0700
+++ b/drivers/net/sfc/falcon_xmac.c	2011-04-14 08:38:57.022090260 -0700
@@ -362,7 +362,7 @@ void falcon_poll_xmac(struct efx_nic *ef
 	falcon_ack_status_intr(efx);
 }
 
-struct efx_mac_operations falcon_xmac_operations = {
+const struct efx_mac_operations falcon_xmac_operations = {
 	.reconfigure	= falcon_reconfigure_xmac,
 	.update_stats	= falcon_update_stats_xmac,
 	.check_fault	= falcon_xmac_check_fault,
--- a/drivers/net/sfc/mac.h	2011-04-14 08:39:27.806461103 -0700
+++ b/drivers/net/sfc/mac.h	2011-04-14 08:39:36.686565988 -0700
@@ -13,8 +13,8 @@
 
 #include "net_driver.h"
 
-extern struct efx_mac_operations falcon_xmac_operations;
-extern struct efx_mac_operations efx_mcdi_mac_operations;
+extern const struct efx_mac_operations falcon_xmac_operations;
+extern const struct efx_mac_operations efx_mcdi_mac_operations;
 extern int efx_mcdi_mac_stats(struct efx_nic *efx, dma_addr_t dma_addr,
 			      u32 dma_len, int enable, int clear);
 
--- a/drivers/net/sfc/mcdi_mac.c	2011-04-14 08:39:00.734136111 -0700
+++ b/drivers/net/sfc/mcdi_mac.c	2011-04-14 08:39:06.630207635 -0700
@@ -138,7 +138,7 @@ static bool efx_mcdi_mac_check_fault(str
 }
 
 
-struct efx_mac_operations efx_mcdi_mac_operations = {
+const struct efx_mac_operations efx_mcdi_mac_operations = {
 	.reconfigure	= efx_mcdi_mac_reconfigure,
 	.update_stats	= efx_port_dummy_op_void,
 	.check_fault 	= efx_mcdi_mac_check_fault,
--- a/drivers/net/sfc/falcon.c	2011-04-14 08:43:16.856958923 -0700
+++ b/drivers/net/sfc/falcon.c	2011-04-14 08:43:26.809060907 -0700
@@ -1703,7 +1703,7 @@ static int falcon_set_wol(struct efx_nic
  **************************************************************************
  */
 
-struct efx_nic_type falcon_a1_nic_type = {
+const struct efx_nic_type falcon_a1_nic_type = {
 	.probe = falcon_probe_nic,
 	.remove = falcon_remove_nic,
 	.init = falcon_init_nic,
@@ -1744,7 +1744,7 @@ struct efx_nic_type falcon_a1_nic_type =
 	.reset_world_flags = ETH_RESET_IRQ,
 };
 
-struct efx_nic_type falcon_b0_nic_type = {
+const struct efx_nic_type falcon_b0_nic_type = {
 	.probe = falcon_probe_nic,
 	.remove = falcon_remove_nic,
 	.init = falcon_init_nic,
--- a/drivers/net/sfc/nic.h	2011-04-14 08:43:37.417167767 -0700
+++ b/drivers/net/sfc/nic.h	2011-04-14 08:43:50.801302635 -0700
@@ -150,9 +150,9 @@ struct siena_nic_data {
 	int wol_filter_id;
 };
 
-extern struct efx_nic_type falcon_a1_nic_type;
-extern struct efx_nic_type falcon_b0_nic_type;
-extern struct efx_nic_type siena_a0_nic_type;
+extern const struct efx_nic_type falcon_a1_nic_type;
+extern const struct efx_nic_type falcon_b0_nic_type;
+extern const struct efx_nic_type siena_a0_nic_type;
 
 /**************************************************************************
  *
--- a/drivers/net/sfc/siena.c	2011-04-14 08:43:55.377348789 -0700
+++ b/drivers/net/sfc/siena.c	2011-04-14 08:44:24.561640141 -0700
@@ -581,7 +581,7 @@ static void siena_init_wol(struct efx_ni
  **************************************************************************
  */
 
-struct efx_nic_type siena_a0_nic_type = {
+const struct efx_nic_type siena_a0_nic_type = {
 	.probe = siena_probe_nic,
 	.remove = siena_remove_nic,
 	.init = siena_init_nic,

^ permalink raw reply

* Re: [PATCH v2] net: filter: Just In Time compiler
From: Eric Dumazet @ 2011-04-14 15:45 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Hagen Paul Pfeifer, David Miller, netdev,
	Arnaldo Carvalho de Melo, Ben Hutchings
In-Reply-To: <4DA715BB.6050307@redhat.com>

Le jeudi 14 avril 2011 à 18:41 +0300, Avi Kivity a écrit :

> I'm talking about optimizing the generated code.  For example, bpf has 
> just two registers so a complex program generates a lot of loads and 
> stores.  An optimizing compiler can use extra target registers to avoid 
> those spills, and doesn't need to keep A and X in fixed registers.
> 

Thats not exactly true.

A bpf filter also uses up to 16 mem[] 'registers'.

A risc cpu (with a lot of registers) could use registers to hold part of
the mem[] array.

> If you translate the bpf program to C and optimize that with gcc you'll 
> probably get much better machine code that the jit in the patch.
> 

Well, gcc wont optimize a lot a bpf program if you ask me.

You would better make tcpdump not generate bpf but direct C code.





^ permalink raw reply

* Re: [PATCH v2] net: filter: Just In Time compiler
From: Avi Kivity @ 2011-04-14 15:45 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, netdev, Arnaldo Carvalho de Melo, Ben Hutchings,
	Hagen Paul Pfeifer
In-Reply-To: <1302795630.3248.10.camel@edumazet-laptop>

On 04/14/2011 06:40 PM, Eric Dumazet wrote:
> Le jeudi 14 avril 2011 à 17:40 +0300, Avi Kivity a écrit :
> >  On 04/03/2011 04:56 PM, Eric Dumazet wrote:
> >  >  In order to speedup packet filtering, here is an implementation of a JIT
> >  >  compiler for x86_64
> >  >
> >
> >  Have you considered putting the compiler in userspace?
> >
>
> Hmm, to be honest no.
>
> >  You could have a trusted compile server waiting on a pipe and compiling
> >  programs sent to it by the kernel, sending the results back down.  Use
> >  the interpreter until the compiler returns; if it doesn't, use the
> >  interpreter forever.
>
> I feel it might be too expensive in some cases, and kind of complex
> architecture.

It is, but the kernel-side complexity is lower.  And since we have a 
fallback, overall reliability is improved rather than reduced.

> >
> >  The upside is that you can use established optimizing compilers like
> >  LLVM or GCC, which already support more target architectures.  It may
> >  not matter much for something simple like bpf, but other VMs may be a
> >  lot more complicated.
> >
>
> Not only bpf is very simple, but it needs to access skb fields and other
> parts of the kernel, we would need to instruct userland compiler of all
> these details.

A simple implementation would be to translate the bpf program into a C 
function which receives the same arguments as your bpf runtime, and 
optimize that with gcc.

> We would need to load kind of a module (with dynamic loader)

Well, we have one.

> Of course, making each bpf filter a module of his own has benefit for
> perf profiling.

And stack unwind info, etc.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply

* Re: [PATCH v2] net: filter: Just In Time compiler
From: Avi Kivity @ 2011-04-14 15:41 UTC (permalink / raw)
  To: Hagen Paul Pfeifer
  Cc: Eric Dumazet, David Miller, netdev, Arnaldo Carvalho de Melo,
	Ben Hutchings
In-Reply-To: <0a7fdea6b816da546ea71f752d36b5c2@localhost>

On 04/14/2011 05:55 PM, Hagen Paul Pfeifer wrote:
> On Thu, 14 Apr 2011 17:40:03 +0300, Avi Kivity<avi@redhat.com>  wrote:
>
> >  Have you considered putting the compiler in userspace?
>
> Kernelspace (modules, threads, etc) can register BPF filters too. It is
> possible that there is no userspace involved at all.

A userspace jit would still work just fine, no?  I don't want the user 
who supplied the program to also supply the jit; rather, when the kernel 
installs the bpf program, it also asks an independent userspace compiler 
to translate it.

> >  The upside is that you can use established optimizing compilers like
> >  LLVM or GCC, which already support more target architectures.  It may
> >  not matter much for something simple like bpf, but other VMs may be a
> >  lot more complicated.
>
> BPF is another domain. Standard compiler optimization are not comparable
> to BPF optimizations so there is no gain there. Maybe writing a gcc front
> _and_ back-end may gain some valuable advantages.

I'm talking about optimizing the generated code.  For example, bpf has 
just two registers so a complex program generates a lot of loads and 
stores.  An optimizing compiler can use extra target registers to avoid 
those spills, and doesn't need to keep A and X in fixed registers.

If you translate the bpf program to C and optimize that with gcc you'll 
probably get much better machine code that the jit in the patch.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply

* Re: [PATCH v2] net: filter: Just In Time compiler
From: Eric Dumazet @ 2011-04-14 15:40 UTC (permalink / raw)
  To: Avi Kivity
  Cc: David Miller, netdev, Arnaldo Carvalho de Melo, Ben Hutchings,
	Hagen Paul Pfeifer
In-Reply-To: <4DA70743.1050106@redhat.com>

Le jeudi 14 avril 2011 à 17:40 +0300, Avi Kivity a écrit :
> On 04/03/2011 04:56 PM, Eric Dumazet wrote:
> > In order to speedup packet filtering, here is an implementation of a JIT
> > compiler for x86_64
> >
> 
> Have you considered putting the compiler in userspace?
> 

Hmm, to be honest no.

> You could have a trusted compile server waiting on a pipe and compiling 
> programs sent to it by the kernel, sending the results back down.  Use 
> the interpreter until the compiler returns; if it doesn't, use the 
> interpreter forever.

I feel it might be too expensive in some cases, and kind of complex
architecture.

> 
> The upside is that you can use established optimizing compilers like 
> LLVM or GCC, which already support more target architectures.  It may 
> not matter much for something simple like bpf, but other VMs may be a 
> lot more complicated.
> 

Not only bpf is very simple, but it needs to access skb fields and other
parts of the kernel, we would need to instruct userland compiler of all
these details.

We would need to load kind of a module (with dynamic loader)

Of course, making each bpf filter a module of his own has benefit for
perf profiling.




^ permalink raw reply

* Re: [PATCH 1/1] ipv6: RTA_PREFSRC support for ipv6 route source address selection
From: Stephen Clark @ 2011-04-14 15:02 UTC (permalink / raw)
  To: Daniel Walter; +Cc: netdev, linux-kernel, davem
In-Reply-To: <20110414144954.GA79918@0x90.at>

On 04/14/2011 10:49 AM, Daniel Walter wrote:
> On Thu, Apr 14, 2011 at 10:01:09AM -0400, Stephen Clark wrote:
>    
>> On 04/14/2011 03:10 AM, Daniel Walter wrote:
>>      
>>> [ipv6] Add support for RTA_PREFSRC
>>>
>>> This patch allows a user to select the preferred source address
>>> for a specific IPv6-Route. It can be set via a netlink message
>>> setting RTA_PREFSRC to a valid IPv6 address which must be
>>> up on the device the route will be bound to.
>>>
>>>
>>> Signed-off-by: Daniel Walter<dwalter@barracuda.com>
>>> ---
>>> Repost patch, after fixing some warnings pointed out on netdev@
>>> applies clean against current linux-2.6 HEAD
>>>
>>>    include/net/ip6_fib.h   |    2 +
>>>    include/net/ip6_route.h |    7 ++++
>>>    net/ipv6/addrconf.c     |    2 +
>>>    net/ipv6/ip6_output.c   |    8 ++--
>>>    net/ipv6/route.c        |   72 +++++++++++++++++++++++++++++++++++++++++++++--
>>>    5 files changed, 84 insertions(+), 7 deletions(-)
>>>
>>> ---
>>> diff --git a/include/net/ip6_fib.h b/include/net/ip6_fib.h
>>> index bc3cde0..98348d5 100644
>>> --- a/include/net/ip6_fib.h
>>> +++ b/include/net/ip6_fib.h
>>> @@ -42,6 +42,7 @@ struct fib6_config {
>>>
>>>    	struct in6_addr	fc_dst;
>>>    	struct in6_addr	fc_src;
>>> +	struct in6_addr	fc_prefsrc;
>>>    	struct in6_addr	fc_gateway;
>>>
>>>    	unsigned long	fc_expires;
>>> @@ -107,6 +108,7 @@ struct rt6_info {
>>>    	struct rt6key			rt6i_dst ____cacheline_aligned_in_smp;
>>>    	u32				rt6i_flags;
>>>    	struct rt6key			rt6i_src;
>>> +	struct rt6key			rt6i_prefsrc;
>>>    	u32				rt6i_metric;
>>>    	u32				rt6i_peer_genid;
>>>
>>> diff --git a/include/net/ip6_route.h b/include/net/ip6_route.h
>>> index c850e5f..86b1cb4 100644
>>> --- a/include/net/ip6_route.h
>>> +++ b/include/net/ip6_route.h
>>> @@ -84,6 +84,12 @@ extern int			ip6_route_add(struct fib6_config *cfg);
>>>    extern int			ip6_ins_rt(struct rt6_info *);
>>>    extern int			ip6_del_rt(struct rt6_info *);
>>>
>>> +extern int			ip6_route_get_saddr(struct net *net,
>>> +						    struct rt6_info *rt,
>>> +						    struct in6_addr *daddr,
>>> +						    unsigned int prefs,
>>> +						    struct in6_addr *saddr);
>>> +
>>>    extern struct rt6_info		*rt6_lookup(struct net *net,
>>>    					    const struct in6_addr *daddr,
>>>    					    const struct in6_addr *saddr,
>>> @@ -141,6 +147,7 @@ struct rt6_rtnl_dump_arg {
>>>    extern int rt6_dump_route(struct rt6_info *rt, void *p_arg);
>>>    extern void rt6_ifdown(struct net *net, struct net_device *dev);
>>>    extern void rt6_mtu_change(struct net_device *dev, unsigned mtu);
>>> +extern void rt6_remove_prefsrc(struct inet6_ifaddr *ifp);
>>>
>>>
>>>    /*
>>> diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
>>> index 1493534..129d7e1 100644
>>> --- a/net/ipv6/addrconf.c
>>> +++ b/net/ipv6/addrconf.c
>>> @@ -825,6 +825,8 @@ static void ipv6_del_addr(struct inet6_ifaddr *ifp)
>>>    		dst_release(&rt->dst);
>>>    	}
>>>
>>> +	/* clean up prefsrc entries */
>>> +	rt6_remove_prefsrc(ifp);
>>>    out:
>>>    	in6_ifa_put(ifp);
>>>    }
>>> diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
>>> index 46cf7be..1f4c096 100644
>>> --- a/net/ipv6/ip6_output.c
>>> +++ b/net/ipv6/ip6_output.c
>>> @@ -930,10 +930,10 @@ static int ip6_dst_lookup_tail(struct sock *sk,
>>>    		goto out_err_release;
>>>
>>>    	if (ipv6_addr_any(&fl6->saddr)) {
>>> -		err = ipv6_dev_get_saddr(net, ip6_dst_idev(*dst)->dev,
>>> -					&fl6->daddr,
>>> -					 sk ? inet6_sk(sk)->srcprefs : 0,
>>> -					&fl6->saddr);
>>> +		struct rt6_info *rt = (struct rt6_info *) *dst;
>>> +		err = ip6_route_get_saddr(net, rt,&fl6->daddr,
>>> +					  sk ? inet6_sk(sk)->srcprefs : 0,
>>> +					&fl6->saddr);
>>>    		if (err)
>>>    			goto out_err_release;
>>>    	}
>>> diff --git a/net/ipv6/route.c b/net/ipv6/route.c
>>> index 843406f..af26cc10 100644
>>> --- a/net/ipv6/route.c
>>> +++ b/net/ipv6/route.c
>>> @@ -1325,6 +1325,16 @@ int ip6_route_add(struct fib6_config *cfg)
>>>    	if (dev == NULL)
>>>    		goto out;
>>>
>>> +	if (!ipv6_addr_any(&cfg->fc_prefsrc)) {
>>> +		if (!ipv6_chk_addr(net,&cfg->fc_prefsrc, dev, 0)) {
>>> +			err = -EINVAL;
>>> +			goto out;
>>> +		}
>>> +		ipv6_addr_copy(&rt->rt6i_prefsrc.addr,&cfg->fc_prefsrc);
>>> +		rt->rt6i_prefsrc.plen = 128;
>>> +	} else
>>> +		rt->rt6i_prefsrc.plen = 0;
>>> +
>>>    	if (cfg->fc_flags&   (RTF_GATEWAY | RTF_NONEXTHOP)) {
>>>    		rt->rt6i_nexthop = __neigh_lookup_errno(&nd_tbl,&rt->rt6i_gateway, dev);
>>>    		if (IS_ERR(rt->rt6i_nexthop)) {
>>> @@ -2037,6 +2047,55 @@ struct rt6_info *addrconf_dst_alloc(struct inet6_dev *idev,
>>>    	return rt;
>>>    }
>>>
>>> +int ip6_route_get_saddr(struct net *net,
>>> +			struct rt6_info *rt,
>>> +			struct in6_addr *daddr,
>>> +			unsigned int prefs,
>>> +			struct in6_addr *saddr)
>>> +{
>>> +	struct inet6_dev *idev = ip6_dst_idev((struct dst_entry*)rt);
>>> +	int err = 0;
>>> +	if (rt->rt6i_prefsrc.plen)
>>> +		ipv6_addr_copy(saddr,&rt->rt6i_prefsrc.addr);
>>> +	else
>>> +		err = ipv6_dev_get_saddr(net, idev ? idev->dev : NULL,
>>> +					 daddr, prefs, saddr);
>>> +	return err;
>>> +}
>>> +
>>> +/* remove deleted ip from prefsrc entries */
>>> +struct arg_dev_net_ip {
>>> +	struct net_device *dev;
>>> +	struct net *net;
>>> +	struct in6_addr *addr;
>>> +};
>>> +
>>> +static int fib6_remove_prefsrc(struct rt6_info *rt, void *arg)
>>> +{
>>> +	struct net_device *dev = ((struct arg_dev_net_ip *)arg)->dev;
>>> +	struct net *net = ((struct arg_dev_net_ip *)arg)->net;
>>> +	struct in6_addr *addr = ((struct arg_dev_net_ip *)arg)->addr;
>>> +
>>> +	if (((void *)rt->rt6i_dev == dev || dev == NULL)&&
>>> +	    rt != net->ipv6.ip6_null_entry&&
>>> +	    ipv6_addr_equal(addr,&rt->rt6i_prefsrc.addr)) {
>>> +		/* remove prefsrc entry */
>>> +		rt->rt6i_prefsrc.plen = 0;
>>> +	}
>>> +	return 0;
>>> +}
>>> +
>>> +void rt6_remove_prefsrc(struct inet6_ifaddr *ifp)
>>> +{
>>> +	struct net *net = dev_net(ifp->idev->dev);
>>> +	struct arg_dev_net_ip adni = {
>>> +		.dev = ifp->idev->dev,
>>> +		.net = net,
>>> +		.addr =&ifp->addr,
>>> +	};
>>> +	fib6_clean_all(net, fib6_remove_prefsrc, 0,&adni);
>>> +}
>>> +
>>>    struct arg_dev_net {
>>>    	struct net_device *dev;
>>>    	struct net *net;
>>> @@ -2183,6 +2242,9 @@ static int rtm_to_fib6_config(struct sk_buff *skb, struct nlmsghdr *nlh,
>>>    		nla_memcpy(&cfg->fc_src, tb[RTA_SRC], plen);
>>>    	}
>>>
>>> +	if (tb[RTA_PREFSRC])
>>> +		nla_memcpy(&cfg->fc_prefsrc, tb[RTA_PREFSRC], 16);
>>> +
>>>    	if (tb[RTA_OIF])
>>>    		cfg->fc_ifindex = nla_get_u32(tb[RTA_OIF]);
>>>
>>> @@ -2325,13 +2387,17 @@ static int rt6_fill_node(struct net *net,
>>>    #endif
>>>    			NLA_PUT_U32(skb, RTA_IIF, iif);
>>>    	} else if (dst) {
>>> -		struct inet6_dev *idev = ip6_dst_idev(&rt->dst);
>>>    		struct in6_addr saddr_buf;
>>> -		if (ipv6_dev_get_saddr(net, idev ? idev->dev : NULL,
>>> -				       dst, 0,&saddr_buf) == 0)
>>> +		if (ip6_route_get_saddr(net, rt, dst, 0,&saddr_buf) == 0)
>>>    			NLA_PUT(skb, RTA_PREFSRC, 16,&saddr_buf);
>>>    	}
>>>
>>> +	if (rt->rt6i_prefsrc.plen) {
>>> +		struct in6_addr saddr_buf;
>>> +		ipv6_addr_copy(&saddr_buf,&rt->rt6i_prefsrc.addr);
>>> +		NLA_PUT(skb, RTA_PREFSRC, 16,&saddr_buf);
>>> +	}
>>> +
>>>    	if (rtnetlink_put_metrics(skb, dst_metrics_ptr(&rt->dst))<   0)
>>>    		goto nla_put_failure;
>>>        
>> What userspace application will be used to set this?
>>      
> iproute2 already has support for this, since it is using
> RTA_PREFSRC for ipv4.
>
> ip -6 r a 2001:db8:a::/64 via 2001:db8:b::1 src 2001:db8:b::2
>
>    
Fantastic!

-- 

"They that give up essential liberty to obtain temporary safety,
deserve neither liberty nor safety."  (Ben Franklin)

"The course of history shows that as a government grows, liberty
decreases."  (Thomas Jefferson)

^ permalink raw reply

* Re: [PATCH v2] net: filter: Just In Time compiler
From: Hagen Paul Pfeifer @ 2011-04-14 14:55 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Eric Dumazet, David Miller, netdev, Arnaldo Carvalho de Melo,
	Ben Hutchings
In-Reply-To: <4DA70743.1050106@redhat.com>

On Thu, 14 Apr 2011 17:40:03 +0300, Avi Kivity <avi@redhat.com> wrote:

> Have you considered putting the compiler in userspace?

Kernelspace (modules, threads, etc) can register BPF filters too. It is

possible that there is no userspace involved at all.

> The upside is that you can use established optimizing compilers like 

> LLVM or GCC, which already support more target architectures.  It may 

> not matter much for something simple like bpf, but other VMs may be a 

> lot more complicated.

BPF is another domain. Standard compiler optimization are not comparable

to BPF optimizations so there is no gain there. Maybe writing a gcc front

_and_ back-end may gain some valuable advantages.

Hagen 

^ permalink raw reply

* Re: [PATCH 1/1] ipv6: RTA_PREFSRC support for ipv6 route source address selection
From: Daniel Walter @ 2011-04-14 14:49 UTC (permalink / raw)
  To: Stephen Clark; +Cc: netdev, linux-kernel, davem
In-Reply-To: <4DA6FE25.70608@earthlink.net>

On Thu, Apr 14, 2011 at 10:01:09AM -0400, Stephen Clark wrote:
> On 04/14/2011 03:10 AM, Daniel Walter wrote:
> > [ipv6] Add support for RTA_PREFSRC
> >
> > This patch allows a user to select the preferred source address
> > for a specific IPv6-Route. It can be set via a netlink message
> > setting RTA_PREFSRC to a valid IPv6 address which must be
> > up on the device the route will be bound to.
> >
> >
> > Signed-off-by: Daniel Walter<dwalter@barracuda.com>
> > ---
> > Repost patch, after fixing some warnings pointed out on netdev@
> > applies clean against current linux-2.6 HEAD
> >
> >   include/net/ip6_fib.h   |    2 +
> >   include/net/ip6_route.h |    7 ++++
> >   net/ipv6/addrconf.c     |    2 +
> >   net/ipv6/ip6_output.c   |    8 ++--
> >   net/ipv6/route.c        |   72 +++++++++++++++++++++++++++++++++++++++++++++--
> >   5 files changed, 84 insertions(+), 7 deletions(-)
> >
> > ---
> > diff --git a/include/net/ip6_fib.h b/include/net/ip6_fib.h
> > index bc3cde0..98348d5 100644
> > --- a/include/net/ip6_fib.h
> > +++ b/include/net/ip6_fib.h
> > @@ -42,6 +42,7 @@ struct fib6_config {
> >
> >   	struct in6_addr	fc_dst;
> >   	struct in6_addr	fc_src;
> > +	struct in6_addr	fc_prefsrc;
> >   	struct in6_addr	fc_gateway;
> >
> >   	unsigned long	fc_expires;
> > @@ -107,6 +108,7 @@ struct rt6_info {
> >   	struct rt6key			rt6i_dst ____cacheline_aligned_in_smp;
> >   	u32				rt6i_flags;
> >   	struct rt6key			rt6i_src;
> > +	struct rt6key			rt6i_prefsrc;
> >   	u32				rt6i_metric;
> >   	u32				rt6i_peer_genid;
> >
> > diff --git a/include/net/ip6_route.h b/include/net/ip6_route.h
> > index c850e5f..86b1cb4 100644
> > --- a/include/net/ip6_route.h
> > +++ b/include/net/ip6_route.h
> > @@ -84,6 +84,12 @@ extern int			ip6_route_add(struct fib6_config *cfg);
> >   extern int			ip6_ins_rt(struct rt6_info *);
> >   extern int			ip6_del_rt(struct rt6_info *);
> >
> > +extern int			ip6_route_get_saddr(struct net *net,
> > +						    struct rt6_info *rt,
> > +						    struct in6_addr *daddr,
> > +						    unsigned int prefs,
> > +						    struct in6_addr *saddr);
> > +
> >   extern struct rt6_info		*rt6_lookup(struct net *net,
> >   					    const struct in6_addr *daddr,
> >   					    const struct in6_addr *saddr,
> > @@ -141,6 +147,7 @@ struct rt6_rtnl_dump_arg {
> >   extern int rt6_dump_route(struct rt6_info *rt, void *p_arg);
> >   extern void rt6_ifdown(struct net *net, struct net_device *dev);
> >   extern void rt6_mtu_change(struct net_device *dev, unsigned mtu);
> > +extern void rt6_remove_prefsrc(struct inet6_ifaddr *ifp);
> >
> >
> >   /*
> > diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
> > index 1493534..129d7e1 100644
> > --- a/net/ipv6/addrconf.c
> > +++ b/net/ipv6/addrconf.c
> > @@ -825,6 +825,8 @@ static void ipv6_del_addr(struct inet6_ifaddr *ifp)
> >   		dst_release(&rt->dst);
> >   	}
> >
> > +	/* clean up prefsrc entries */
> > +	rt6_remove_prefsrc(ifp);
> >   out:
> >   	in6_ifa_put(ifp);
> >   }
> > diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
> > index 46cf7be..1f4c096 100644
> > --- a/net/ipv6/ip6_output.c
> > +++ b/net/ipv6/ip6_output.c
> > @@ -930,10 +930,10 @@ static int ip6_dst_lookup_tail(struct sock *sk,
> >   		goto out_err_release;
> >
> >   	if (ipv6_addr_any(&fl6->saddr)) {
> > -		err = ipv6_dev_get_saddr(net, ip6_dst_idev(*dst)->dev,
> > -					&fl6->daddr,
> > -					 sk ? inet6_sk(sk)->srcprefs : 0,
> > -					&fl6->saddr);
> > +		struct rt6_info *rt = (struct rt6_info *) *dst;
> > +		err = ip6_route_get_saddr(net, rt,&fl6->daddr,
> > +					  sk ? inet6_sk(sk)->srcprefs : 0,
> > +					&fl6->saddr);
> >   		if (err)
> >   			goto out_err_release;
> >   	}
> > diff --git a/net/ipv6/route.c b/net/ipv6/route.c
> > index 843406f..af26cc10 100644
> > --- a/net/ipv6/route.c
> > +++ b/net/ipv6/route.c
> > @@ -1325,6 +1325,16 @@ int ip6_route_add(struct fib6_config *cfg)
> >   	if (dev == NULL)
> >   		goto out;
> >
> > +	if (!ipv6_addr_any(&cfg->fc_prefsrc)) {
> > +		if (!ipv6_chk_addr(net,&cfg->fc_prefsrc, dev, 0)) {
> > +			err = -EINVAL;
> > +			goto out;
> > +		}
> > +		ipv6_addr_copy(&rt->rt6i_prefsrc.addr,&cfg->fc_prefsrc);
> > +		rt->rt6i_prefsrc.plen = 128;
> > +	} else
> > +		rt->rt6i_prefsrc.plen = 0;
> > +
> >   	if (cfg->fc_flags&  (RTF_GATEWAY | RTF_NONEXTHOP)) {
> >   		rt->rt6i_nexthop = __neigh_lookup_errno(&nd_tbl,&rt->rt6i_gateway, dev);
> >   		if (IS_ERR(rt->rt6i_nexthop)) {
> > @@ -2037,6 +2047,55 @@ struct rt6_info *addrconf_dst_alloc(struct inet6_dev *idev,
> >   	return rt;
> >   }
> >
> > +int ip6_route_get_saddr(struct net *net,
> > +			struct rt6_info *rt,
> > +			struct in6_addr *daddr,
> > +			unsigned int prefs,
> > +			struct in6_addr *saddr)
> > +{
> > +	struct inet6_dev *idev = ip6_dst_idev((struct dst_entry*)rt);
> > +	int err = 0;
> > +	if (rt->rt6i_prefsrc.plen)
> > +		ipv6_addr_copy(saddr,&rt->rt6i_prefsrc.addr);
> > +	else
> > +		err = ipv6_dev_get_saddr(net, idev ? idev->dev : NULL,
> > +					 daddr, prefs, saddr);
> > +	return err;
> > +}
> > +
> > +/* remove deleted ip from prefsrc entries */
> > +struct arg_dev_net_ip {
> > +	struct net_device *dev;
> > +	struct net *net;
> > +	struct in6_addr *addr;
> > +};
> > +
> > +static int fib6_remove_prefsrc(struct rt6_info *rt, void *arg)
> > +{
> > +	struct net_device *dev = ((struct arg_dev_net_ip *)arg)->dev;
> > +	struct net *net = ((struct arg_dev_net_ip *)arg)->net;
> > +	struct in6_addr *addr = ((struct arg_dev_net_ip *)arg)->addr;
> > +
> > +	if (((void *)rt->rt6i_dev == dev || dev == NULL)&&
> > +	    rt != net->ipv6.ip6_null_entry&&
> > +	    ipv6_addr_equal(addr,&rt->rt6i_prefsrc.addr)) {
> > +		/* remove prefsrc entry */
> > +		rt->rt6i_prefsrc.plen = 0;
> > +	}
> > +	return 0;
> > +}
> > +
> > +void rt6_remove_prefsrc(struct inet6_ifaddr *ifp)
> > +{
> > +	struct net *net = dev_net(ifp->idev->dev);
> > +	struct arg_dev_net_ip adni = {
> > +		.dev = ifp->idev->dev,
> > +		.net = net,
> > +		.addr =&ifp->addr,
> > +	};
> > +	fib6_clean_all(net, fib6_remove_prefsrc, 0,&adni);
> > +}
> > +
> >   struct arg_dev_net {
> >   	struct net_device *dev;
> >   	struct net *net;
> > @@ -2183,6 +2242,9 @@ static int rtm_to_fib6_config(struct sk_buff *skb, struct nlmsghdr *nlh,
> >   		nla_memcpy(&cfg->fc_src, tb[RTA_SRC], plen);
> >   	}
> >
> > +	if (tb[RTA_PREFSRC])
> > +		nla_memcpy(&cfg->fc_prefsrc, tb[RTA_PREFSRC], 16);
> > +
> >   	if (tb[RTA_OIF])
> >   		cfg->fc_ifindex = nla_get_u32(tb[RTA_OIF]);
> >
> > @@ -2325,13 +2387,17 @@ static int rt6_fill_node(struct net *net,
> >   #endif
> >   			NLA_PUT_U32(skb, RTA_IIF, iif);
> >   	} else if (dst) {
> > -		struct inet6_dev *idev = ip6_dst_idev(&rt->dst);
> >   		struct in6_addr saddr_buf;
> > -		if (ipv6_dev_get_saddr(net, idev ? idev->dev : NULL,
> > -				       dst, 0,&saddr_buf) == 0)
> > +		if (ip6_route_get_saddr(net, rt, dst, 0,&saddr_buf) == 0)
> >   			NLA_PUT(skb, RTA_PREFSRC, 16,&saddr_buf);
> >   	}
> >
> > +	if (rt->rt6i_prefsrc.plen) {
> > +		struct in6_addr saddr_buf;
> > +		ipv6_addr_copy(&saddr_buf,&rt->rt6i_prefsrc.addr);
> > +		NLA_PUT(skb, RTA_PREFSRC, 16,&saddr_buf);
> > +	}
> > +
> >   	if (rtnetlink_put_metrics(skb, dst_metrics_ptr(&rt->dst))<  0)
> >   		goto nla_put_failure;
> 
> What userspace application will be used to set this?

iproute2 already has support for this, since it is using
RTA_PREFSRC for ipv4. 

ip -6 r a 2001:db8:a::/64 via 2001:db8:b::1 src 2001:db8:b::2


^ permalink raw reply

* Re: [PATCH v2] net: filter: Just In Time compiler
From: Avi Kivity @ 2011-04-14 14:40 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, netdev, Arnaldo Carvalho de Melo, Ben Hutchings,
	Hagen Paul Pfeifer
In-Reply-To: <1301838968.2837.200.camel@edumazet-laptop>

On 04/03/2011 04:56 PM, Eric Dumazet wrote:
> In order to speedup packet filtering, here is an implementation of a JIT
> compiler for x86_64
>

Have you considered putting the compiler in userspace?

You could have a trusted compile server waiting on a pipe and compiling 
programs sent to it by the kernel, sending the results back down.  Use 
the interpreter until the compiler returns; if it doesn't, use the 
interpreter forever.

The upside is that you can use established optimizing compilers like 
LLVM or GCC, which already support more target architectures.  It may 
not matter much for something simple like bpf, but other VMs may be a 
lot more complicated.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply

* Re: [PATCH 1/1] ipv6: RTA_PREFSRC support for ipv6 route source address selection
From: YOSHIFUJI Hideaki @ 2011-04-14 14:24 UTC (permalink / raw)
  To: sclark46; +Cc: Daniel Walter, netdev, linux-kernel, davem, yoshfuji
In-Reply-To: <4DA6FE25.70608@earthlink.net>

Stephen Clark wrote:
> On 04/14/2011 03:10 AM, Daniel Walter wrote:
> > [ipv6] Add support for RTA_PREFSRC
> >
> > This patch allows a user to select the preferred source address
> > for a specific IPv6-Route. It can be set via a netlink message
> > setting RTA_PREFSRC to a valid IPv6 address which must be
> > up on the device the route will be bound to.

> What userspace application will be used to set this?

I do expect Daniel will submit appropriate patch for iproute2 package
shortly :-)

--yoshfuji

^ permalink raw reply

* Re: [net-next-2.6 RFC PATCH v2 08/13] pcnet32: set ethtool set_phys_id on/off cycle frequency to 2/sec
From: Don Fry @ 2011-04-14 14:00 UTC (permalink / raw)
  To: Bruce Allan; +Cc: netdev
In-Reply-To: <20110413195927.25901.35278.stgit@gitlad.jf.intel.com>

> Date: Wed, 13 Apr 2011 12:59:27 -0700
> 
> Physical identification frequency based on how it was done prior to the
> introduction of set_phys_id.  Compile tested only.
> 
> Signed-off-by: Bruce Allan <bruce.w.allan@intel.com>
> Cc: Don Fry <pcnet32@frontier.com>

Acked-by: Don Fry <pcnet32@frontier.com>




^ permalink raw reply

* Re: [PATCH 1/1] ipv6: RTA_PREFSRC support for ipv6 route source address selection
From: Stephen Clark @ 2011-04-14 14:01 UTC (permalink / raw)
  To: Daniel Walter; +Cc: netdev, linux-kernel, davem
In-Reply-To: <20110414071057.GB78446@0x90.at>

On 04/14/2011 03:10 AM, Daniel Walter wrote:
> [ipv6] Add support for RTA_PREFSRC
>
> This patch allows a user to select the preferred source address
> for a specific IPv6-Route. It can be set via a netlink message
> setting RTA_PREFSRC to a valid IPv6 address which must be
> up on the device the route will be bound to.
>
>
> Signed-off-by: Daniel Walter<dwalter@barracuda.com>
> ---
> Repost patch, after fixing some warnings pointed out on netdev@
> applies clean against current linux-2.6 HEAD
>
>   include/net/ip6_fib.h   |    2 +
>   include/net/ip6_route.h |    7 ++++
>   net/ipv6/addrconf.c     |    2 +
>   net/ipv6/ip6_output.c   |    8 ++--
>   net/ipv6/route.c        |   72 +++++++++++++++++++++++++++++++++++++++++++++--
>   5 files changed, 84 insertions(+), 7 deletions(-)
>
> ---
> diff --git a/include/net/ip6_fib.h b/include/net/ip6_fib.h
> index bc3cde0..98348d5 100644
> --- a/include/net/ip6_fib.h
> +++ b/include/net/ip6_fib.h
> @@ -42,6 +42,7 @@ struct fib6_config {
>
>   	struct in6_addr	fc_dst;
>   	struct in6_addr	fc_src;
> +	struct in6_addr	fc_prefsrc;
>   	struct in6_addr	fc_gateway;
>
>   	unsigned long	fc_expires;
> @@ -107,6 +108,7 @@ struct rt6_info {
>   	struct rt6key			rt6i_dst ____cacheline_aligned_in_smp;
>   	u32				rt6i_flags;
>   	struct rt6key			rt6i_src;
> +	struct rt6key			rt6i_prefsrc;
>   	u32				rt6i_metric;
>   	u32				rt6i_peer_genid;
>
> diff --git a/include/net/ip6_route.h b/include/net/ip6_route.h
> index c850e5f..86b1cb4 100644
> --- a/include/net/ip6_route.h
> +++ b/include/net/ip6_route.h
> @@ -84,6 +84,12 @@ extern int			ip6_route_add(struct fib6_config *cfg);
>   extern int			ip6_ins_rt(struct rt6_info *);
>   extern int			ip6_del_rt(struct rt6_info *);
>
> +extern int			ip6_route_get_saddr(struct net *net,
> +						    struct rt6_info *rt,
> +						    struct in6_addr *daddr,
> +						    unsigned int prefs,
> +						    struct in6_addr *saddr);
> +
>   extern struct rt6_info		*rt6_lookup(struct net *net,
>   					    const struct in6_addr *daddr,
>   					    const struct in6_addr *saddr,
> @@ -141,6 +147,7 @@ struct rt6_rtnl_dump_arg {
>   extern int rt6_dump_route(struct rt6_info *rt, void *p_arg);
>   extern void rt6_ifdown(struct net *net, struct net_device *dev);
>   extern void rt6_mtu_change(struct net_device *dev, unsigned mtu);
> +extern void rt6_remove_prefsrc(struct inet6_ifaddr *ifp);
>
>
>   /*
> diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
> index 1493534..129d7e1 100644
> --- a/net/ipv6/addrconf.c
> +++ b/net/ipv6/addrconf.c
> @@ -825,6 +825,8 @@ static void ipv6_del_addr(struct inet6_ifaddr *ifp)
>   		dst_release(&rt->dst);
>   	}
>
> +	/* clean up prefsrc entries */
> +	rt6_remove_prefsrc(ifp);
>   out:
>   	in6_ifa_put(ifp);
>   }
> diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
> index 46cf7be..1f4c096 100644
> --- a/net/ipv6/ip6_output.c
> +++ b/net/ipv6/ip6_output.c
> @@ -930,10 +930,10 @@ static int ip6_dst_lookup_tail(struct sock *sk,
>   		goto out_err_release;
>
>   	if (ipv6_addr_any(&fl6->saddr)) {
> -		err = ipv6_dev_get_saddr(net, ip6_dst_idev(*dst)->dev,
> -					&fl6->daddr,
> -					 sk ? inet6_sk(sk)->srcprefs : 0,
> -					&fl6->saddr);
> +		struct rt6_info *rt = (struct rt6_info *) *dst;
> +		err = ip6_route_get_saddr(net, rt,&fl6->daddr,
> +					  sk ? inet6_sk(sk)->srcprefs : 0,
> +					&fl6->saddr);
>   		if (err)
>   			goto out_err_release;
>   	}
> diff --git a/net/ipv6/route.c b/net/ipv6/route.c
> index 843406f..af26cc10 100644
> --- a/net/ipv6/route.c
> +++ b/net/ipv6/route.c
> @@ -1325,6 +1325,16 @@ int ip6_route_add(struct fib6_config *cfg)
>   	if (dev == NULL)
>   		goto out;
>
> +	if (!ipv6_addr_any(&cfg->fc_prefsrc)) {
> +		if (!ipv6_chk_addr(net,&cfg->fc_prefsrc, dev, 0)) {
> +			err = -EINVAL;
> +			goto out;
> +		}
> +		ipv6_addr_copy(&rt->rt6i_prefsrc.addr,&cfg->fc_prefsrc);
> +		rt->rt6i_prefsrc.plen = 128;
> +	} else
> +		rt->rt6i_prefsrc.plen = 0;
> +
>   	if (cfg->fc_flags&  (RTF_GATEWAY | RTF_NONEXTHOP)) {
>   		rt->rt6i_nexthop = __neigh_lookup_errno(&nd_tbl,&rt->rt6i_gateway, dev);
>   		if (IS_ERR(rt->rt6i_nexthop)) {
> @@ -2037,6 +2047,55 @@ struct rt6_info *addrconf_dst_alloc(struct inet6_dev *idev,
>   	return rt;
>   }
>
> +int ip6_route_get_saddr(struct net *net,
> +			struct rt6_info *rt,
> +			struct in6_addr *daddr,
> +			unsigned int prefs,
> +			struct in6_addr *saddr)
> +{
> +	struct inet6_dev *idev = ip6_dst_idev((struct dst_entry*)rt);
> +	int err = 0;
> +	if (rt->rt6i_prefsrc.plen)
> +		ipv6_addr_copy(saddr,&rt->rt6i_prefsrc.addr);
> +	else
> +		err = ipv6_dev_get_saddr(net, idev ? idev->dev : NULL,
> +					 daddr, prefs, saddr);
> +	return err;
> +}
> +
> +/* remove deleted ip from prefsrc entries */
> +struct arg_dev_net_ip {
> +	struct net_device *dev;
> +	struct net *net;
> +	struct in6_addr *addr;
> +};
> +
> +static int fib6_remove_prefsrc(struct rt6_info *rt, void *arg)
> +{
> +	struct net_device *dev = ((struct arg_dev_net_ip *)arg)->dev;
> +	struct net *net = ((struct arg_dev_net_ip *)arg)->net;
> +	struct in6_addr *addr = ((struct arg_dev_net_ip *)arg)->addr;
> +
> +	if (((void *)rt->rt6i_dev == dev || dev == NULL)&&
> +	    rt != net->ipv6.ip6_null_entry&&
> +	    ipv6_addr_equal(addr,&rt->rt6i_prefsrc.addr)) {
> +		/* remove prefsrc entry */
> +		rt->rt6i_prefsrc.plen = 0;
> +	}
> +	return 0;
> +}
> +
> +void rt6_remove_prefsrc(struct inet6_ifaddr *ifp)
> +{
> +	struct net *net = dev_net(ifp->idev->dev);
> +	struct arg_dev_net_ip adni = {
> +		.dev = ifp->idev->dev,
> +		.net = net,
> +		.addr =&ifp->addr,
> +	};
> +	fib6_clean_all(net, fib6_remove_prefsrc, 0,&adni);
> +}
> +
>   struct arg_dev_net {
>   	struct net_device *dev;
>   	struct net *net;
> @@ -2183,6 +2242,9 @@ static int rtm_to_fib6_config(struct sk_buff *skb, struct nlmsghdr *nlh,
>   		nla_memcpy(&cfg->fc_src, tb[RTA_SRC], plen);
>   	}
>
> +	if (tb[RTA_PREFSRC])
> +		nla_memcpy(&cfg->fc_prefsrc, tb[RTA_PREFSRC], 16);
> +
>   	if (tb[RTA_OIF])
>   		cfg->fc_ifindex = nla_get_u32(tb[RTA_OIF]);
>
> @@ -2325,13 +2387,17 @@ static int rt6_fill_node(struct net *net,
>   #endif
>   			NLA_PUT_U32(skb, RTA_IIF, iif);
>   	} else if (dst) {
> -		struct inet6_dev *idev = ip6_dst_idev(&rt->dst);
>   		struct in6_addr saddr_buf;
> -		if (ipv6_dev_get_saddr(net, idev ? idev->dev : NULL,
> -				       dst, 0,&saddr_buf) == 0)
> +		if (ip6_route_get_saddr(net, rt, dst, 0,&saddr_buf) == 0)
>   			NLA_PUT(skb, RTA_PREFSRC, 16,&saddr_buf);
>   	}
>
> +	if (rt->rt6i_prefsrc.plen) {
> +		struct in6_addr saddr_buf;
> +		ipv6_addr_copy(&saddr_buf,&rt->rt6i_prefsrc.addr);
> +		NLA_PUT(skb, RTA_PREFSRC, 16,&saddr_buf);
> +	}
> +
>   	if (rtnetlink_put_metrics(skb, dst_metrics_ptr(&rt->dst))<  0)
>   		goto nla_put_failure;

What userspace application will be used to set this?

^ permalink raw reply

* Re: [PATCH] ip: ip_options_compile() resilient to NULL skb route
From: Scot Doyle @ 2011-04-14 13:34 UTC (permalink / raw)
  To: Hiroaki SHIMODA, Eric Dumazet; +Cc: Stephen Hemminger, David Miller, netdev
In-Reply-To: <20110414131552.1822142f.shimoda.hiroaki@gmail.com>

I tested the three patches linked below, plus the two patches previously 
accepted by David in this thread, with 2.6.39-rc3 commit 
85f2e689a5c8fb6ed8fdbee00109e7f6e5fefcb6. No panics :-)

http://article.gmane.org/gmane.linux.network/192293
http://article.gmane.org/gmane.linux.network/192299
http://article.gmane.org/gmane.linux.network/192301

------------

diff --git a/net/bridge/br_netfilter.c b/net/bridge/br_netfilter.c
index 008ff6c..10ac127 100644
--- a/net/bridge/br_netfilter.c
+++ b/net/bridge/br_netfilter.c
@@ -249,11 +249,9 @@ static int br_parse_ip_options(struct sk_buff *skb)
                 goto drop;
         }

-       /* Zero out the CB buffer if no options present */
-       if (iph->ihl == 5) {
-               memset(IPCB(skb), 0, sizeof(struct inet_skb_parm));
+       memset(IPCB(skb), 0, sizeof(struct inet_skb_parm));
+       if (iph->ihl == 5)
                 return 0;
-       }

         opt->optlen = iph->ihl*4 - sizeof(struct iphdr);
         if (ip_options_compile(dev_net(dev), opt, skb))
@@ -265,7 +263,7 @@ static int br_parse_ip_options(struct sk_buff *skb)
                 if (in_dev && !IN_DEV_SOURCE_ROUTE(in_dev))
                         goto drop;

-               if (ip_options_rcv_srr(skb))
+               if (skb_rtable(skb) && ip_options_rcv_srr(skb))
                         goto drop;
         }

diff --git a/net/ipv4/inetpeer.c b/net/ipv4/inetpeer.c
index dd1b20e..9df4e63 100644
--- a/net/ipv4/inetpeer.c
+++ b/net/ipv4/inetpeer.c
@@ -354,7 +354,8 @@ static void inetpeer_free_rcu(struct rcu_head *head)
  }

  /* May be called with local BH enabled. */
-static void unlink_from_pool(struct inet_peer *p, struct inet_peer_base 
*base)
+static void unlink_from_pool(struct inet_peer *p, struct inet_peer_base 
*base,
+                            struct inet_peer __rcu **stack[PEER_MAXDEPTH])
  {
         int do_free;

@@ -368,7 +369,6 @@ static void unlink_from_pool(struct inet_peer *p, 
struct inet_peer_base *base)
          * We use refcnt=-1 to alert lockless readers this entry is 
deleted.
          */
         if (atomic_cmpxchg(&p->refcnt, 1, -1) == 1) {
-               struct inet_peer __rcu **stack[PEER_MAXDEPTH];
                 struct inet_peer __rcu ***stackptr, ***delp;
                 if (lookup(&p->daddr, stack, base) != p)
                         BUG();
@@ -422,7 +422,7 @@ static struct inet_peer_base *peer_to_base(struct 
inet_peer *p)
  }

  /* May be called with local BH enabled. */
-static int cleanup_once(unsigned long ttl)
+static int cleanup_once(unsigned long ttl, struct inet_peer __rcu 
**stack[PEER_MAXDEPTH])
  {
         struct inet_peer *p = NULL;

@@ -454,7 +454,7 @@ static int cleanup_once(unsigned long ttl)
                  * happen because of entry limits in route cache. */
                 return -1;

-       unlink_from_pool(p, peer_to_base(p));
+       unlink_from_pool(p, peer_to_base(p), stack);
         return 0;
  }

@@ -524,7 +524,7 @@ struct inet_peer *inet_getpeer(struct inetpeer_addr 
*daddr, int create)

         if (base->total >= inet_peer_threshold)
                 /* Remove one less-recently-used entry. */
-               cleanup_once(0);
+               cleanup_once(0, stack);

         return p;
  }
@@ -540,6 +540,7 @@ static void peer_check_expire(unsigned long dummy)
  {
         unsigned long now = jiffies;
         int ttl, total;
+       struct inet_peer __rcu **stack[PEER_MAXDEPTH];

         total = compute_total();
         if (total >= inet_peer_threshold)
@@ -548,7 +549,7 @@ static void peer_check_expire(unsigned long dummy)
                 ttl = inet_peer_maxttl
                                 - (inet_peer_maxttl - inet_peer_minttl) 
/ HZ *
                                         total / inet_peer_threshold * HZ;
-       while (!cleanup_once(ttl)) {
+       while (!cleanup_once(ttl, stack)) {
                 if (jiffies != now)
                         break;
         }
diff --git a/net/ipv4/ip_options.c b/net/ipv4/ip_options.c
index 28a736f..546dd02 100644
--- a/net/ipv4/ip_options.c
+++ b/net/ipv4/ip_options.c
@@ -329,7 +329,7 @@ int ip_options_compile(struct net *net,
                                         pp_ptr = optptr + 2;
                                         goto error;
                                 }
-                               if (skb) {
+                               if (rt) {
                                         memcpy(&optptr[optptr[2]-1], 
&rt->rt_spec_dst, 4);
                                         opt->is_changed = 1;
                                 }
@@ -371,7 +371,7 @@ int ip_options_compile(struct net *net,
                                                 goto error;
                                         }
                                         opt->ts = optptr - iph;
-                                       if (skb) {
+                                       if (rt) {
                                                 
memcpy(&optptr[optptr[2]-1], &rt->rt_spec_dst, 4);
                                                 timeptr = 
(__be32*)&optptr[optptr[2]+3];
                                         }
@@ -606,7 +606,7 @@ int ip_options_rcv_srr(struct sk_buff *skb)
         if (!opt->srr)
                 return 0;

-       if (skb->pkt_type != PACKET_HOST)
+       if (skb->pkt_type != PACKET_HOST || !rt)
                 return -EINVAL;
         if (rt->rt_type == RTN_UNICAST) {
                 if (!opt->is_strictroute)

^ permalink raw reply related

* Re: [Bug 32772] New: PROBLEM: kernel BUG at net/ipv4/inetpeer.c:386
From: Eric Dumazet @ 2011-04-14 13:32 UTC (permalink / raw)
  To: Dmitry Novikov; +Cc: David Miller, shemminger, netdev
In-Reply-To: <BANLkTi=A59zBTwrf2TsVbSrouxBhLsrH3w@mail.gmail.com>

Le jeudi 14 avril 2011 à 16:02 +0300, Dmitry Novikov a écrit :
> Thanks. Patch applied . Will wait results

Thanks to you Dmitry, for your patience



^ permalink raw reply

* Re: [Bug 32772] New: PROBLEM: kernel BUG at net/ipv4/inetpeer.c:386
From: Dmitry Novikov @ 2011-04-14 13:02 UTC (permalink / raw)
  To: David Miller; +Cc: eric.dumazet, shemminger, netdev
In-Reply-To: <20110413.132403.179942105.davem@davemloft.net>

Thanks. Patch applied . Will wait results

2011/4/13 David Miller <davem@davemloft.net>:
> From: Dmitry Novikov <dimetrios@gmail.com>
> Date: Wed, 13 Apr 2011 23:14:03 +0300
>
>> Crash again after 7 days of uptime. slub_nomerge is set
>
> Looks like too deep stack, try this patch which is in net-2.6:
>
> --------------------
> inetpeer: reduce stack usage
>
> On 64bit arches, we use 752 bytes of stack when cleanup_once() is called
> from inet_getpeer().
>

^ permalink raw reply

* [PATCH net-next-2.6] rndis_host: Quirky devices are still 'point-to-point'
From: Ben Hutchings @ 2011-04-14 12:51 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

My changes in commit 4d42d417be75d750b82798922b6e775915e11bce were
written some time before the introduction of FLAG_POINTTOPOINT, so
didn't include that flag in the new driver_info.  Change the new
driver_info to be consistent.

Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
---
 drivers/net/usb/rndis_host.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/net/usb/rndis_host.c b/drivers/net/usb/rndis_host.c
index 6d6c1da..255d6a4 100644
--- a/drivers/net/usb/rndis_host.c
+++ b/drivers/net/usb/rndis_host.c
@@ -592,7 +592,7 @@ static const struct driver_info	rndis_info = {
 
 static const struct driver_info	rndis_poll_status_info = {
 	.description =	"RNDIS device (poll status before control)",
-	.flags =	FLAG_ETHER | FLAG_FRAMING_RN | FLAG_NO_SETINT,
+	.flags =	FLAG_ETHER | FLAG_POINTTOPOINT | FLAG_FRAMING_RN | FLAG_NO_SETINT,
 	.data =		RNDIS_DRIVER_DATA_POLL_STATUS,
 	.bind =		rndis_bind,
 	.unbind =	rndis_unbind,
-- 
1.7.4.1



^ permalink raw reply related

* Re: Network performance with small packets
From: Michael S. Tsirkin @ 2011-04-14 12:40 UTC (permalink / raw)
  To: Rusty Russell
  Cc: habanero, Shirley Ma, Krishna Kumar2, David Miller, kvm, netdev,
	steved, Tom Lendacky, borntraeger
In-Reply-To: <87bp09ax7a.fsf@rustcorp.com.au>

On Thu, Apr 14, 2011 at 08:58:41PM +0930, Rusty Russell wrote:
> On Tue, 12 Apr 2011 23:01:12 +0300, "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > On Thu, Mar 10, 2011 at 12:19:42PM +1030, Rusty Russell wrote:
> > > Here's an old patch where I played with implementing this:
> > 
> > ...
> > 
> > > 
> > > virtio: put last_used and last_avail index into ring itself.
> > > 
> > > Generally, the other end of the virtio ring doesn't need to see where
> > > you're up to in consuming the ring.  However, to completely understand
> > > what's going on from the outside, this information must be exposed.
> > > For example, if you want to save and restore a virtio_ring, but you're
> > > not the consumer because the kernel is using it directly.
> > > 
> > > Fortunately, we have room to expand:
> > 
> > This seems to be true for x86 kvm and lguest but is it true
> > for s390?
> 
> Yes, as the ring is page aligned so there's always room.
> 
> > Will this last bit work on s390?
> > If I understand correctly the memory is allocated by host there?
> 
> They have to offer the feature, so if the have some way of allocating
> non-page-aligned amounts of memory, they'll have to add those extra 2
> bytes.
> 
> So I think it's OK...
> Rusty.

Correct. I wonder whether we need to pass the relevant flag
to vring_size. If we do we'll need to add a new function
for that though as vring_size is exported to userspace.

-- 
MST

^ permalink raw reply

* Re: Network performance with small packets
From: Rusty Russell @ 2011-04-14 11:28 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: habanero, Shirley Ma, Krishna Kumar2, David Miller, kvm, netdev,
	steved, Tom Lendacky, borntraeger
In-Reply-To: <20110412200112.GA19729@redhat.com>

On Tue, 12 Apr 2011 23:01:12 +0300, "Michael S. Tsirkin" <mst@redhat.com> wrote:
> On Thu, Mar 10, 2011 at 12:19:42PM +1030, Rusty Russell wrote:
> > Here's an old patch where I played with implementing this:
> 
> ...
> 
> > 
> > virtio: put last_used and last_avail index into ring itself.
> > 
> > Generally, the other end of the virtio ring doesn't need to see where
> > you're up to in consuming the ring.  However, to completely understand
> > what's going on from the outside, this information must be exposed.
> > For example, if you want to save and restore a virtio_ring, but you're
> > not the consumer because the kernel is using it directly.
> > 
> > Fortunately, we have room to expand:
> 
> This seems to be true for x86 kvm and lguest but is it true
> for s390?

Yes, as the ring is page aligned so there's always room.

> Will this last bit work on s390?
> If I understand correctly the memory is allocated by host there?

They have to offer the feature, so if the have some way of allocating
non-page-aligned amounts of memory, they'll have to add those extra 2
bytes.

So I think it's OK...
Rusty.

^ permalink raw reply

* [PATCH 12/12] mm: Throttle direct reclaimers if PF_MEMALLOC reserves are low and swap is backed by network storage
From: Mel Gorman @ 2011-04-14 10:41 UTC (permalink / raw)
  To: Linux-MM, Linux-Netdev; +Cc: LKML, Peter Zijlstra, Mel Gorman
In-Reply-To: <1302777698-28237-1-git-send-email-mgorman@suse.de>

If swap is backed by network storage such as NBD, there is a risk that a
large number of reclaimers can hang the system by consuming all
PF_MEMALLOC reserves. To avoid these hangs, the administrator must tune
min_free_kbytes in advance. This patch will throttle direct reclaimers
if half the PF_MEMALLOC reserves are in use as the system is at risk of
hanging. A message will be displayed so the administrator knows that
min_free_kbytes should be tuned to a higher value to avoid the
throttling in the future.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/mmzone.h |    1 +
 mm/page_alloc.c        |    1 +
 mm/vmscan.c            |   66 ++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 68 insertions(+), 0 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 02ecb01..e86dcaf 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -638,6 +638,7 @@ typedef struct pglist_data {
 					     range, including holes */
 	int node_id;
 	wait_queue_head_t kswapd_wait;
+	wait_queue_head_t pfmemalloc_wait;
 	struct task_struct *kswapd;
 	int kswapd_max_order;
 	enum zone_type classzone_idx;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2b87dfd..4b1170f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4160,6 +4160,7 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
 	pgdat_resize_init(pgdat);
 	pgdat->nr_zones = 0;
 	init_waitqueue_head(&pgdat->kswapd_wait);
+	init_waitqueue_head(&pgdat->pfmemalloc_wait);
 	pgdat->kswapd_max_order = 0;
 	pgdat_page_cgroup_init(pgdat);
 	
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 6771ea7..2dad23d 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -42,6 +42,8 @@
 #include <linux/delayacct.h>
 #include <linux/sysctl.h>
 
+#include <net/sock.h>
+
 #include <asm/tlbflush.h>
 #include <asm/div64.h>
 
@@ -2115,6 +2117,61 @@ out:
 	return 0;
 }
 
+static bool pfmemalloc_watermark_ok(pg_data_t *pgdat, int high_zoneidx)
+{
+	struct zone *zone;
+	unsigned long pfmemalloc_reserve = 0;
+	unsigned long free_pages = 0;
+	int i;
+
+	for (i = 0; i <= high_zoneidx; i++) {
+		zone = &pgdat->node_zones[i];
+		pfmemalloc_reserve += min_wmark_pages(zone);
+		free_pages += zone_page_state(zone, NR_FREE_PAGES);
+	}
+
+	return (free_pages > pfmemalloc_reserve / 2) ? true : false;
+}
+
+/*
+ * Throttle direct reclaimers if backing storage is backed by the network
+ * and the PFMEMALLOC reserve for the preferred node is getting dangerously
+ * depleted. kswapd will continue to make progress and wake the processes
+ * when the low watermark is reached
+ */
+static void throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
+					nodemask_t *nodemask)
+{
+	struct zone *zone;
+	int high_zoneidx = gfp_zone(gfp_mask);
+	DEFINE_WAIT(wait);
+
+	/*
+	 * Only worry about the PFMEMALLOC reserves when network-backed
+	 * storage is configured.
+	 */
+	if (!sk_memalloc_socks())
+		return;
+
+	/* Check if the pfmemalloc reserves are ok */
+	first_zones_zonelist(zonelist, high_zoneidx, NULL, &zone);
+	if (pfmemalloc_watermark_ok(zone->zone_pgdat, high_zoneidx))
+		return;
+
+	/* Throttle */
+	if (printk_ratelimit())
+		printk(KERN_INFO "Throttling %s due to reclaim pressure on "
+				 "network storage\n",
+			current->comm);
+	do {
+		prepare_to_wait(&zone->zone_pgdat->pfmemalloc_wait, &wait,
+							TASK_INTERRUPTIBLE);
+		schedule();
+		finish_wait(&zone->zone_pgdat->pfmemalloc_wait, &wait);
+	} while (!pfmemalloc_watermark_ok(zone->zone_pgdat, high_zoneidx) &&
+			!fatal_signal_pending(current));
+}
+
 unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 				gfp_t gfp_mask, nodemask_t *nodemask)
 {
@@ -2131,6 +2188,8 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 		.nodemask = nodemask,
 	};
 
+	throttle_direct_reclaim(gfp_mask, zonelist, nodemask);
+
 	trace_mm_vmscan_direct_reclaim_begin(order,
 				sc.may_writepage,
 				gfp_mask);
@@ -2482,6 +2541,13 @@ loop_again:
 			}
 
 		}
+
+		/* Wake throttled direct reclaimers if low watermark is met */
+		if (sk_memalloc_socks() &&
+				waitqueue_active(&pgdat->pfmemalloc_wait) &&
+				pfmemalloc_watermark_ok(pgdat, MAX_NR_ZONES - 1))
+			wake_up_interruptible(&pgdat->pfmemalloc_wait);
+
 		if (all_zones_ok || (order && pgdat_balanced(pgdat, balanced, *classzone_idx)))
 			break;		/* kswapd: all done */
 		/*
-- 
1.7.3.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* [PATCH 11/12] nbd: Set SOCK_MEMALLOC for access to PFMEMALLOC reserves
From: Mel Gorman @ 2011-04-14 10:41 UTC (permalink / raw)
  To: Linux-MM, Linux-Netdev; +Cc: LKML, Peter Zijlstra, Mel Gorman
In-Reply-To: <1302777698-28237-1-git-send-email-mgorman@suse.de>

Set SOCK_MEMALLOC on the NBD socket to allow access to PFMEMALLOC
reserves so pages backed by NBD, particularly if swap related,
can be cleaned to prevent the machine being deadlocked. It is
still possible that the PFMEMALLOC reserves get depleted resulting
in deadlock but this can be resolved by the administrator by
increasing min_free_kbytes.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 drivers/block/nbd.c |    7 ++++++-
 1 files changed, 6 insertions(+), 1 deletions(-)

diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
index e6fc716..322cef8 100644
--- a/drivers/block/nbd.c
+++ b/drivers/block/nbd.c
@@ -156,6 +156,7 @@ static int sock_xmit(struct nbd_device *lo, int send, void *buf, int size,
 	struct msghdr msg;
 	struct kvec iov;
 	sigset_t blocked, oldset;
+	unsigned long pflags = current->flags;
 
 	if (unlikely(!sock)) {
 		printk(KERN_ERR "%s: Attempted %s on closed socket in sock_xmit\n",
@@ -168,8 +169,9 @@ static int sock_xmit(struct nbd_device *lo, int send, void *buf, int size,
 	siginitsetinv(&blocked, sigmask(SIGKILL));
 	sigprocmask(SIG_SETMASK, &blocked, &oldset);
 
+	current->flags |= PF_MEMALLOC;
 	do {
-		sock->sk->sk_allocation = GFP_NOIO;
+		sock->sk->sk_allocation = GFP_NOIO | __GFP_MEMALLOC;
 		iov.iov_base = buf;
 		iov.iov_len = size;
 		msg.msg_name = NULL;
@@ -214,6 +216,7 @@ static int sock_xmit(struct nbd_device *lo, int send, void *buf, int size,
 	} while (size > 0);
 
 	sigprocmask(SIG_SETMASK, &oldset, NULL);
+	tsk_restore_flags(current, pflags, PF_MEMALLOC);
 
 	return result;
 }
@@ -404,6 +407,8 @@ static int nbd_do_it(struct nbd_device *lo)
 
 	BUG_ON(lo->magic != LO_MAGIC);
 
+	sk_set_memalloc(lo->sock->sk);
+
 	lo->pid = current->pid;
 	ret = sysfs_create_file(&disk_to_dev(lo->disk)->kobj, &pid_attr.attr);
 	if (ret) {
-- 
1.7.3.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* [PATCH 10/12] mm: Micro-optimise slab to avoid a function call
From: Mel Gorman @ 2011-04-14 10:41 UTC (permalink / raw)
  To: Linux-MM, Linux-Netdev; +Cc: LKML, Peter Zijlstra, Mel Gorman
In-Reply-To: <1302777698-28237-1-git-send-email-mgorman@suse.de>

Getting and putting objects in SLAB currently requires a function call
but the bulk of the work is related to PFMEMALLOC reserves which are
only consumed when network-backed storage is critical. Use an inline
function to determine if the function call is required.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/slab.c |   28 ++++++++++++++++++++++++++--
 1 files changed, 26 insertions(+), 2 deletions(-)

diff --git a/mm/slab.c b/mm/slab.c
index 8f81d17..0e9980b 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -116,6 +116,8 @@
 #include	<linux/kmemcheck.h>
 #include	<linux/memory.h>
 
+#include	<net/sock.h>
+
 #include	<asm/cacheflush.h>
 #include	<asm/tlbflush.h>
 #include	<asm/page.h>
@@ -941,7 +943,7 @@ static void check_ac_pfmemalloc(struct kmem_cache *cachep,
 	ac->pfmemalloc = false;
 }
 
-static void *ac_get_obj(struct kmem_cache *cachep, struct array_cache *ac,
+static void *__ac_get_obj(struct kmem_cache *cachep, struct array_cache *ac,
 						gfp_t flags, bool force_refill)
 {
 	int i;
@@ -988,7 +990,20 @@ static void *ac_get_obj(struct kmem_cache *cachep, struct array_cache *ac,
 	return objp;
 }
 
-static void ac_put_obj(struct kmem_cache *cachep, struct array_cache *ac,
+static inline void *ac_get_obj(struct kmem_cache *cachep,
+			struct array_cache *ac, gfp_t flags, bool force_refill)
+{
+	void *objp;
+
+	if (unlikely(sk_memalloc_socks()))
+		objp = __ac_get_obj(cachep, ac, flags, force_refill);
+	else
+		objp = ac->entry[--ac->avail];
+
+	return objp;
+}
+
+static void *__ac_put_obj(struct kmem_cache *cachep, struct array_cache *ac,
 								void *objp)
 {
 	struct slab *slabp;
@@ -1001,6 +1016,15 @@ static void ac_put_obj(struct kmem_cache *cachep, struct array_cache *ac,
 			set_obj_pfmemalloc(&objp);
 	}
 
+	return objp;
+}
+
+static inline void ac_put_obj(struct kmem_cache *cachep, struct array_cache *ac,
+								void *objp)
+{
+	if (unlikely(sk_memalloc_socks()))
+		objp = __ac_put_obj(cachep, ac, objp);
+
 	ac->entry[ac->avail++] = objp;
 }
 
-- 
1.7.3.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox