Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: Fwd: LVS on local node
From: Changli Gao @ 2010-07-22  9:10 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Franchoze Eric, wensong, lvs-devel, netdev, netfilter-devel
In-Reply-To: <1279781811.2405.15.camel@edumazet-laptop>

On Thu, Jul 22, 2010 at 2:56 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
> lvs seems not very SMP friendly and a bit complex.
>
> I would use an iptables setup and a slighly modified REDIRECT target
> (and/or a nf_nat_setup_info() change)
>
> Say you have 8 daemons listening on different ports (1000 to 1007)
>
> iptables -t nat -A PREROUTING -p tcp --dport 1234 -j REDIRECT --rxhash-dist --to-port 1000-1007
>
> rxhash would be provided by RPS on recent kernels or locally computed if
> not already provided by core network (or old kernel)
>
> This rule would be triggered only at connection establishment.
> conntracking take care of following packets and is SMP friendly.
>
>

I think maybe REDIRECT is enough. If the public port is one of the
real ports, you need to append "random" option to iptables target
REDIRECT. If not, "REDIRECT --to-ports 1000-1007" is good enough, and
the destination port will be selected in the round-robin manner.

-- 
Regards,
Changli Gao(xiaosuo@gmail.com)

^ permalink raw reply

* Re: [RFC PATCH v3 1/5] irq: add tracepoint to softirq_raise
From: Koki Sanagi @ 2010-07-22  8:41 UTC (permalink / raw)
  To: Neil Horman
  Cc: netdev, linux-kernel, davem, kaneshige.kenji, izumi.taku,
	kosaki.motohiro, laijs, scott.a.mcmillan, rostedt, eric.dumazet,
	fweisbec, mathieu.desnoyers
In-Reply-To: <20100721111419.GB21259@hmsreliant.think-freely.org>

(2010/07/21 20:14), Neil Horman wrote:
> On Wed, Jul 21, 2010 at 03:57:05PM +0900, Koki Sanagi wrote:
>> (2010/07/20 20:04), Neil Horman wrote:
>>> On Tue, Jul 20, 2010 at 09:45:31AM +0900, Koki Sanagi wrote:
>>>> From: Lai Jiangshan <laijs@cn.fujitsu.com>
>>>>
>>>> Add a tracepoint for tracing when softirq action is raised.
>>>>
>>>> It and the existed tracepoints complete softirq's tracepoints:
>>>> softirq_raise, softirq_entry and softirq_exit.
>>>>
>>>> And when this tracepoint is used in combination with
>>>> the softirq_entry tracepoint we can determine
>>>> the softirq raise latency.
>>>>
>>>> Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
>>>> Acked-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
>>>> Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
>>>>
>>>> [ factorize softirq events with DECLARE_EVENT_CLASS ]
>>>> Signed-off-by: Koki Sanagi <sanagi.koki@jp.fujitsu.com>
>>>> ---
>>>>  include/linux/interrupt.h  |    8 +++++-
>>>>  include/trace/events/irq.h |   57 ++++++++++++++++++++++++++-----------------
>>>>  kernel/softirq.c           |    4 +-
>>>>  3 files changed, 43 insertions(+), 26 deletions(-)
>>>>
>>>> diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
>>>> index c233113..1cb5726 100644
>>>> --- a/include/linux/interrupt.h
>>>> +++ b/include/linux/interrupt.h
>>>> @@ -18,6 +18,7 @@
>>>>  #include <asm/atomic.h>
>>>>  #include <asm/ptrace.h>
>>>>  #include <asm/system.h>
>>>> +#include <trace/events/irq.h>
>>>>  
>>>>  /*
>>>>   * These correspond to the IORESOURCE_IRQ_* defines in
>>>> @@ -402,7 +403,12 @@ asmlinkage void do_softirq(void);
>>>>  asmlinkage void __do_softirq(void);
>>>>  extern void open_softirq(int nr, void (*action)(struct softirq_action *));
>>>>  extern void softirq_init(void);
>>>> -#define __raise_softirq_irqoff(nr) do { or_softirq_pending(1UL << (nr)); } while (0)
>>>> +static inline void __raise_softirq_irqoff(unsigned int nr)
>>>> +{
>>>> +	trace_softirq_raise(nr);
>>>> +	or_softirq_pending(1UL << nr);
>>>> +}
>>>> +
>>> We already have tracepoints in irq_enter and irq_exit.  If the goal here is to
>>> detect latency during packet processing, cant the delta in time between those
>>> two points be used to determine interrupt handling latency?
>>
>> Certainly, the time between irq_entry and irq_exit is not directly related to
>> latency during packet processing. But it's indirectly related it.
>> Because softirq_entry isn't passed until irq exits and softirq_entry time is
>> related to packet processing latency. So I show it as a reference.
>>
> Its not directly related no, but look at it, the amount of processing between
> irq_exit and softirq_entry is minimal.  The information you are trying to
> extract by computing the delta from irq_entry to softirq_entry is almost exactly
> the same as that from irq_entry to irq_exit.  For that matter, since you're
> trying to guage lantency for packet processing, I expect you could get the same
> delta by measuring irq_entry to napi_poll tracepoint time, and save the hassle
> of needing to filter on softirq processing that doesn't relate to packet
> processing.

Yeah, to determine interrput latency, we need either one irq_exit or
softirq_entry, not both.
And I think softirq_entry should be left because there is a possibility that
softirq isn't triggered immidiately after irq_exit.
softirq_exit isn't needed because it is not related to packet processing.
softirq_raise is needed because it connects irq_entry and softirq_entry but
there is no need to show it. Currently, my idea is like the following.

irq_entry(+0.000000msec,irq=77:eth3)
    |
softirq_entry(+0.003562msec)
    |
    |---netif_receive_skb(+0.006279msec,len=100)
    |            |
    |   skb_copy_datagram_iovec(+0.038778msec, 2285:sshd)
    |
napi_poll_exit(+0.017160msec, eth3)

> 
>>>
>>>
>>>>  extern void raise_softirq_irqoff(unsigned int nr);
>>>>  extern void raise_softirq(unsigned int nr);
>>>>  extern void wakeup_softirqd(void);
>>>> diff --git a/include/trace/events/irq.h b/include/trace/events/irq.h
>>>> index 0e4cfb6..717744c 100644
>>>> --- a/include/trace/events/irq.h
>>>> +++ b/include/trace/events/irq.h
>>>> @@ -5,7 +5,9 @@
>>>>  #define _TRACE_IRQ_H
>>>>  
>>>>  #include <linux/tracepoint.h>
>>>> -#include <linux/interrupt.h>
>>>> +
>>>> +struct irqaction;
>>>> +struct softirq_action;
>>>>  
>>>>  #define softirq_name(sirq) { sirq##_SOFTIRQ, #sirq }
>>>>  #define show_softirq_name(val)				\
>>>> @@ -84,56 +86,65 @@ TRACE_EVENT(irq_handler_exit,
>>>>  
>>>>  DECLARE_EVENT_CLASS(softirq,
>>>>  
>>>> -	TP_PROTO(struct softirq_action *h, struct softirq_action *vec),
>>>> +	TP_PROTO(unsigned int nr),
>>>>  
>>>> -	TP_ARGS(h, vec),
>>>> +	TP_ARGS(nr),
>>>>  
>>>>  	TP_STRUCT__entry(
>>>> -		__field(	int,	vec			)
>>>> +		__field(	unsigned int,	vec	)
>>>>  	),
>>>>  
>>>>  	TP_fast_assign(
>>>> -		__entry->vec = (int)(h - vec);
>>>> +		__entry->vec	= nr;
>>>>  	),
>>>>  
>>>>  	TP_printk("vec=%d [action=%s]", __entry->vec,
>>>> -		  show_softirq_name(__entry->vec))
>>>> +		show_softirq_name(__entry->vec))
>>>> +);
>>>> +
>>>> +/**
>>>> + * softirq_raise - called immediately when a softirq is raised
>>>> + * @nr: softirq vector number
>>>> + *
>>>> + * Tracepoint for tracing when softirq action is raised.
>>>> + * Also, when used in combination with the softirq_entry tracepoint
>>>> + * we can determine the softirq raise latency.
>>>> + */
>>>> +DEFINE_EVENT(softirq, softirq_raise,
>>>> +
>>>> +	TP_PROTO(unsigned int nr),
>>>> +
>>>> +	TP_ARGS(nr)
>>>>  );
>>>>  
>>>>  /**
>>>>   * softirq_entry - called immediately before the softirq handler
>>>> - * @h: pointer to struct softirq_action
>>>> - * @vec: pointer to first struct softirq_action in softirq_vec array
>>>> + * @nr: softirq vector number
>>>>   *
>>>> - * The @h parameter, contains a pointer to the struct softirq_action
>>>> - * which has a pointer to the action handler that is called. By subtracting
>>>> - * the @vec pointer from the @h pointer, we can determine the softirq
>>>> - * number. Also, when used in combination with the softirq_exit tracepoint
>>>> + * Tracepoint for tracing when softirq action starts.
>>>> + * Also, when used in combination with the softirq_exit tracepoint
>>>>   * we can determine the softirq latency.
>>>>   */
>>>>  DEFINE_EVENT(softirq, softirq_entry,
>>>>  
>>>> -	TP_PROTO(struct softirq_action *h, struct softirq_action *vec),
>>>> +	TP_PROTO(unsigned int nr),
>>>>  
>>>> -	TP_ARGS(h, vec)
>>>> +	TP_ARGS(nr)
>>>>  );
>>>>  
>>>>  /**
>>>>   * softirq_exit - called immediately after the softirq handler returns
>>>> - * @h: pointer to struct softirq_action
>>>> - * @vec: pointer to first struct softirq_action in softirq_vec array
>>>> + * @nr: softirq vector number
>>>>   *
>>>> - * The @h parameter contains a pointer to the struct softirq_action
>>>> - * that has handled the softirq. By subtracting the @vec pointer from
>>>> - * the @h pointer, we can determine the softirq number. Also, when used in
>>>> - * combination with the softirq_entry tracepoint we can determine the softirq
>>>> - * latency.
>>>> + * Tracepoint for tracing when softirq action ends.
>>>> + * Also, when used in combination with the softirq_entry tracepoint
>>>> + * we can determine the softirq latency.
>>>>   */
>>>>  DEFINE_EVENT(softirq, softirq_exit,
>>>>  
>>>> -	TP_PROTO(struct softirq_action *h, struct softirq_action *vec),
>>>> +	TP_PROTO(unsigned int nr),
>>>>  
>>>> -	TP_ARGS(h, vec)
>>>> +	TP_ARGS(nr)
>>>>  );
>>>>  
>>>>  #endif /*  _TRACE_IRQ_H */
>>>> diff --git a/kernel/softirq.c b/kernel/softirq.c
>>>> index 825e112..6790599 100644
>>>> --- a/kernel/softirq.c
>>>> +++ b/kernel/softirq.c
>>>> @@ -215,9 +215,9 @@ restart:
>>>>  			int prev_count = preempt_count();
>>>>  			kstat_incr_softirqs_this_cpu(h - softirq_vec);
>>>>  
>>>> -			trace_softirq_entry(h, softirq_vec);
>>>> +			trace_softirq_entry(h - softirq_vec);
>>>>  			h->action(h);
>>>> -			trace_softirq_exit(h, softirq_vec);
>>>> +			trace_softirq_exit(h - softirq_vec);
>>>
>>> You're loosing information here by reducing the numbers of parameters in this
>>> tracepoint.  How many other tracepoint scripts rely on having both pointers
>>> handy?  Why not just do the pointer math inside your tracehook instead?
>>
>> In __raise_softirq_irqoff macro there is no method to refer softirq_vec, so it
>> can't use softirq DECLARE_EVENT_CLASS as is.
>> Currently,  there is no script using softirq_entry or softirq_exit.
>>
> That shouldn't matter, just pass in NULL for softirq_vec in
> __raise_softirq_irqoff as the second argument to the trace function.  You may
> need to fix up the class definition so that the assignment or printk doesn't try
> to dereference that pointer when its NULL, but thats easy enough, and it avoids
> breaking any other perf scripts floating out there.
> Neil
> 
>> Thanks,
>> Koki Sanagi.
>>
>>>
>>>>  			if (unlikely(prev_count != preempt_count())) {
>>>>  				printk(KERN_ERR "huh, entered softirq %td %s %p"
>>>>  				       "with preempt_count %08x,"
>>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>
>>>
>>>
>>
>>
>>
> 
> 

^ permalink raw reply

* Re: [RFC PATCH v3 4/5] skb: add tracepoints to freeing skb
From: Koki Sanagi @ 2010-07-22  8:39 UTC (permalink / raw)
  To: Neil Horman
  Cc: netdev, linux-kernel, davem, kaneshige.kenji, izumi.taku,
	kosaki.motohiro, laijs, scott.a.mcmillan, rostedt, eric.dumazet,
	fweisbec, mathieu.desnoyers
In-Reply-To: <20100721105645.GA21259@hmsreliant.think-freely.org>

(2010/07/21 19:56), Neil Horman wrote:
> On Wed, Jul 21, 2010 at 04:02:57PM +0900, Koki Sanagi wrote:
>> (2010/07/20 20:50), Neil Horman wrote:
>>> On Tue, Jul 20, 2010 at 09:49:10AM +0900, Koki Sanagi wrote:
>>>> [RFC PATCH v3 4/5] skb: add tracepoints to freeing skb
>>>> This patch adds tracepoint to consume_skb, dev_kfree_skb_irq and
>>>> skb_free_datagram_locked. Combinating with tracepoint on dev_hard_start_xmit,
>>>> we can check how long it takes to free transmited packets. And using it, we can
>>>> calculate how many packets driver had at that time. It is useful when a drop of
>>>> transmited packet is a problem.
>>>>
>>>>           <idle>-0     [001] 241409.218333: consume_skb: skbaddr=dd6b2fb8
>>>>           <idle>-0     [001] 241409.490555: dev_kfree_skb_irq: skbaddr=f5e29840
>>>>
>>>>         udp-recv-302   [001] 515031.206008: skb_free_datagram_locked: skbaddr=f5b1d900
>>>>
>>>>
>>>> Signed-off-by: Koki Sanagi <sanagi.koki@jp.fujitsu.com>
>>>> ---
>>>>  include/trace/events/skb.h |   42 ++++++++++++++++++++++++++++++++++++++++++
>>>>  net/core/datagram.c        |    1 +
>>>>  net/core/dev.c             |    2 ++
>>>>  net/core/skbuff.c          |    1 +
>>>>  4 files changed, 46 insertions(+), 0 deletions(-)
>>>>
>>>> diff --git a/include/trace/events/skb.h b/include/trace/events/skb.h
>>>> index 4b2be6d..84c9041 100644
>>>> --- a/include/trace/events/skb.h
>>>> +++ b/include/trace/events/skb.h
>>>> @@ -35,6 +35,48 @@ TRACE_EVENT(kfree_skb,
>>>>  		__entry->skbaddr, __entry->protocol, __entry->location)
>>>>  );
>>>>  
>>>> +DECLARE_EVENT_CLASS(free_skb,
>>>> +
>>>> +	TP_PROTO(struct sk_buff *skb),
>>>> +
>>>> +	TP_ARGS(skb),
>>>> +
>>>> +	TP_STRUCT__entry(
>>>> +		__field(	void *,	skbaddr	)
>>>> +	),
>>>> +
>>>> +	TP_fast_assign(
>>>> +		__entry->skbaddr = skb;
>>>> +	),
>>>> +
>>>> +	TP_printk("skbaddr=%p", __entry->skbaddr)
>>>> +
>>>> +);
>>>> +
>>>> +DEFINE_EVENT(free_skb, consume_skb,
>>>> +
>>>> +	TP_PROTO(struct sk_buff *skb),
>>>> +
>>>> +	TP_ARGS(skb)
>>>> +
>>>> +);
>>>> +
>>>> +DEFINE_EVENT(free_skb, dev_kfree_skb_irq,
>>>> +
>>>> +	TP_PROTO(struct sk_buff *skb),
>>>> +
>>>> +	TP_ARGS(skb)
>>>> +
>>>> +);
>>>> +
>>>> +DEFINE_EVENT(free_skb, skb_free_datagram_locked,
>>>> +
>>>> +	TP_PROTO(struct sk_buff *skb),
>>>> +
>>>> +	TP_ARGS(skb)
>>>> +
>>>> +);
>>>> +
>>>
>>> Why create these last two tracepoints at all?  dev_kfree_skb_irq will eventually
>>> pass through kfree_skb anyway, getting picked up by the tracepoint there, the
>>> while the latter won't (since it uses __kfree_skb instead), I think that could
>>> be fixed up by add a call to trace_kfree_skb there directly, saving you two
>>> tracepoints.
>>>
>>> Neil
>>>
>> I think dev_kfree_skb_irq isn't chased by trace_kfree_skb or trace_consume_skb
>> completely. Because net_tx_action frees skb by __kfree_skb. So it is better to
>> add trace_kfree_skb before it. skb_free_datagram_locked is same.
>>
> It isn't, you're right, but that was the point I made above.  Those missed areas
> could be easily handled by adding calls to trace_kfree_skb which already exists,
> to the missed areas.  Then you don't need to create those new tracepoints.  The
> way your doing this, if someone wants to trace all skb frees in debugfs, they
> would have to enable three tracepoints, not just one.  Not that thats the point
> of your patch, but its something to consider, and it simplifies your code.
> Neil
> 

O.K. I've re-made a patch to use trace_kfree_skb instead of
trace_dev_kfree_skb_irq and trace_skb_free_datagram_locked.
But I've got a problem.
I should use not __builtin_return_address, but macro or function which returns
current address. But I don't know any macro like that. Do you know any solution ?

Koki Sanagi.
---
 include/trace/events/skb.h |   17 +++++++++++++++++
 net/core/datagram.c        |    1 +
 net/core/dev.c             |    2 ++
 net/core/skbuff.c          |    1 +
 4 files changed, 21 insertions(+), 0 deletions(-)

diff --git a/include/trace/events/skb.h b/include/trace/events/skb.h
index 4b2be6d..75ce9d5 100644
--- a/include/trace/events/skb.h
+++ b/include/trace/events/skb.h
@@ -35,6 +35,23 @@ TRACE_EVENT(kfree_skb,
 		__entry->skbaddr, __entry->protocol, __entry->location)
 );
 
+TRACE_EVENT(consume_skb,
+
+	TP_PROTO(struct sk_buff *skb),
+
+	TP_ARGS(skb),
+
+	TP_STRUCT__entry(
+		__field(	void *,	skbaddr	)
+	),
+
+	TP_fast_assign(
+		__entry->skbaddr = skb;
+	),
+
+	TP_printk("skbaddr=%p", __entry->skbaddr)
+);
+
 TRACE_EVENT(skb_copy_datagram_iovec,
 
 	TP_PROTO(const struct sk_buff *skb, int len),
diff --git a/net/core/datagram.c b/net/core/datagram.c
index 251997a..96dab4f 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -243,6 +243,7 @@ void skb_free_datagram_locked(struct sock *sk, struct sk_buff *skb)
 	unlock_sock_fast(sk, slow);
 
 	/* skb is now orphaned, can be freed outside of locked section */
+	trace_kfree_skb(skb, __builtin_return_address(0));
 	__kfree_skb(skb);
 }
 EXPORT_SYMBOL(skb_free_datagram_locked);
diff --git a/net/core/dev.c b/net/core/dev.c
index e6a911f..faded6f 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -131,6 +131,7 @@
 #include <linux/random.h>
 #include <trace/events/napi.h>
 #include <trace/events/net.h>
+#include <trace/events/skb.h>
 #include <linux/pci.h>
 
 #include "net-sysfs.h"
@@ -2577,6 +2578,7 @@ static void net_tx_action(struct softirq_action *h)
 			clist = clist->next;
 
 			WARN_ON(atomic_read(&skb->users));
+			trace_kfree_skb(skb, __builtin_return_address(0));
 			__kfree_skb(skb);
 		}
 	}
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 76d33ca..ce0bc36 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -466,6 +466,7 @@ void consume_skb(struct sk_buff *skb)
 		smp_rmb();
 	else if (likely(!atomic_dec_and_test(&skb->users)))
 		return;
+	trace_consume_skb(skb);
 	__kfree_skb(skb);
 }
 EXPORT_SYMBOL(consume_skb);



 

^ permalink raw reply related

* [patch v2.8 1/4] netfilter: xt_ipvs (netfilter matcher for IPVS)
From: Simon Horman @ 2010-07-22  7:35 UTC (permalink / raw)
  To: lvs-devel, netdev, linux-kernel, netfilter, netfilter-devel
  Cc: Malcolm Turnbull, Mark Brooks, Wensong Zhang, Julius Volz,
	Patrick McHardy, David S. Miller, Hannes Eder, Jan Engelhardt
In-Reply-To: <20100722073547.504156161@vergenet.net>

[-- Attachment #1: netfilter-xt_ipvs-netfilter-matcher-for-IPVS.patch --]
[-- Type: text/plain, Size: 8937 bytes --]

From:	Hannes Eder <heder@google.com>

This implements the kernel-space side of the netfilter matcher xt_ipvs.

[ minor fixes by Simon Horman <horms@verge.net.au> ]
Signed-off-by: Hannes Eder <heder@google.com>
Signed-off-by: Simon Horman <horms@verge.net.au>

--- 

 include/linux/netfilter/xt_ipvs.h |   27 ++++
 net/netfilter/Kconfig             |   10 +
 net/netfilter/Makefile            |    1 
 net/netfilter/ipvs/ip_vs_proto.c  |    1 
 net/netfilter/xt_ipvs.c           |  189 +++++++++++++++++++++++++++++++++++++
 5 files changed, 228 insertions(+), 0 deletions(-)
 create mode 100644 include/linux/netfilter/xt_ipvs.h
 create mode 100644 net/netfilter/xt_ipvs.c

v2.8
* Trivial rediff
As suggested by Jan Engelhardt
* Use an enum instead of #ifdefs for flags and masks

v2.5
* Use nf_ct_is_untracked(ct) instead of ct == nf_ct_is_untracked(ct),
  the later is blatantly incorrect.

v2.4
As per advice from Patrick McHardy
* Reduce size of l4proto and fwd_method members of struct xt_ipvs_mtinf
  from __u16 to __u8
* Use nf_conntrack_untracked() instead of &nf_conntrack_untracked

v2.3
As per advice from Patrick McHardy
* Don't define a value for _XT_IPVS_H in xt_ipvs.h
* Depend on NF_CONNTRACK
* Update to new API
  - ipvs_mt_check() should return an int rather than a bool
  - Change type of ipvs_mt()'s par parameter from
    struct xt_action_param to struct xt_match_param
  - Make ipvs_mt()'s par parameter non-const

v2.1, v2.2
No Change
Index: nf-next-2.6/include/linux/netfilter/xt_ipvs.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ nf-next-2.6/include/linux/netfilter/xt_ipvs.h	2010-07-22 10:39:44.000000000 +0900
@@ -0,0 +1,27 @@
+#ifndef _XT_IPVS_H
+#define _XT_IPVS_H
+
+enum {
+	XT_IPVS_IPVS_PROPERTY =	1 << 0, /* all other options imply this one */
+	XT_IPVS_PROTO =		1 << 1,
+	XT_IPVS_VADDR =		1 << 2,
+	XT_IPVS_VPORT =		1 << 3,
+	XT_IPVS_DIR =		1 << 4,
+	XT_IPVS_METHOD =	1 << 5,
+	XT_IPVS_VPORTCTL =	1 << 6,
+	XT_IPVS_MASK =		(1 << 7) - 1,
+	XT_IPVS_ONCE_MASK =	XT_IPVS_MASK & ~XT_IPVS_IPVS_PROPERTY
+};
+
+struct xt_ipvs_mtinfo {
+	union nf_inet_addr	vaddr, vmask;
+	__be16			vport;
+	__u8			l4proto;
+	__u8			fwd_method;
+	__be16			vportctl;
+
+	__u8			invert;
+	__u8			bitmask;
+};
+
+#endif /* _XT_IPVS_H */
Index: nf-next-2.6/net/netfilter/Kconfig
===================================================================
--- nf-next-2.6.orig/net/netfilter/Kconfig	2010-07-22 10:13:21.000000000 +0900
+++ nf-next-2.6/net/netfilter/Kconfig	2010-07-22 10:13:42.000000000 +0900
@@ -742,6 +742,16 @@ config NETFILTER_XT_MATCH_IPRANGE
 
 	If unsure, say M.
 
+config NETFILTER_XT_MATCH_IPVS
+	tristate '"ipvs" match support'
+	depends on IP_VS
+	depends on NETFILTER_ADVANCED
+	depends on NF_CONNTRACK
+	help
+	  This option allows you to match against IPVS properties of a packet.
+
+	  If unsure, say N.
+
 config NETFILTER_XT_MATCH_LENGTH
 	tristate '"length" match support'
 	depends on NETFILTER_ADVANCED
Index: nf-next-2.6/net/netfilter/Makefile
===================================================================
--- nf-next-2.6.orig/net/netfilter/Makefile	2010-07-22 10:13:21.000000000 +0900
+++ nf-next-2.6/net/netfilter/Makefile	2010-07-22 10:13:42.000000000 +0900
@@ -77,6 +77,7 @@ obj-$(CONFIG_NETFILTER_XT_MATCH_HASHLIMI
 obj-$(CONFIG_NETFILTER_XT_MATCH_HELPER) += xt_helper.o
 obj-$(CONFIG_NETFILTER_XT_MATCH_HL) += xt_hl.o
 obj-$(CONFIG_NETFILTER_XT_MATCH_IPRANGE) += xt_iprange.o
+obj-$(CONFIG_NETFILTER_XT_MATCH_IPVS) += xt_ipvs.o
 obj-$(CONFIG_NETFILTER_XT_MATCH_LENGTH) += xt_length.o
 obj-$(CONFIG_NETFILTER_XT_MATCH_LIMIT) += xt_limit.o
 obj-$(CONFIG_NETFILTER_XT_MATCH_MAC) += xt_mac.o
Index: nf-next-2.6/net/netfilter/ipvs/ip_vs_proto.c
===================================================================
--- nf-next-2.6.orig/net/netfilter/ipvs/ip_vs_proto.c	2010-07-22 10:13:21.000000000 +0900
+++ nf-next-2.6/net/netfilter/ipvs/ip_vs_proto.c	2010-07-22 10:13:42.000000000 +0900
@@ -98,6 +98,7 @@ struct ip_vs_protocol * ip_vs_proto_get(
 
 	return NULL;
 }
+EXPORT_SYMBOL(ip_vs_proto_get);
 
 
 /*
Index: nf-next-2.6/net/netfilter/xt_ipvs.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ nf-next-2.6/net/netfilter/xt_ipvs.c	2010-07-22 10:13:42.000000000 +0900
@@ -0,0 +1,189 @@
+/*
+ *	xt_ipvs - kernel module to match IPVS connection properties
+ *
+ *	Author: Hannes Eder <heder@google.com>
+ */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/module.h>
+#include <linux/moduleparam.h>
+#include <linux/spinlock.h>
+#include <linux/skbuff.h>
+#ifdef CONFIG_IP_VS_IPV6
+#include <net/ipv6.h>
+#endif
+#include <linux/ip_vs.h>
+#include <linux/types.h>
+#include <linux/netfilter/x_tables.h>
+#include <linux/netfilter/x_tables.h>
+#include <linux/netfilter/xt_ipvs.h>
+#include <net/netfilter/nf_conntrack.h>
+
+#include <net/ip_vs.h>
+
+MODULE_AUTHOR("Hannes Eder <heder@google.com>");
+MODULE_DESCRIPTION("Xtables: match IPVS connection properties");
+MODULE_LICENSE("GPL");
+MODULE_ALIAS("ipt_ipvs");
+MODULE_ALIAS("ip6t_ipvs");
+
+/* borrowed from xt_conntrack */
+static bool ipvs_mt_addrcmp(const union nf_inet_addr *kaddr,
+			    const union nf_inet_addr *uaddr,
+			    const union nf_inet_addr *umask,
+			    unsigned int l3proto)
+{
+	if (l3proto == NFPROTO_IPV4)
+		return ((kaddr->ip ^ uaddr->ip) & umask->ip) == 0;
+#ifdef CONFIG_IP_VS_IPV6
+	else if (l3proto == NFPROTO_IPV6)
+		return ipv6_masked_addr_cmp(&kaddr->in6, &umask->in6,
+		       &uaddr->in6) == 0;
+#endif
+	else
+		return false;
+}
+
+static bool
+ipvs_mt(const struct sk_buff *skb, struct xt_action_param *par)
+{
+	const struct xt_ipvs_mtinfo *data = par->matchinfo;
+	/* ipvs_mt_check ensures that family is only NFPROTO_IPV[46]. */
+	const u_int8_t family = par->family;
+	struct ip_vs_iphdr iph;
+	struct ip_vs_protocol *pp;
+	struct ip_vs_conn *cp;
+	bool match = true;
+
+	if (data->bitmask == XT_IPVS_IPVS_PROPERTY) {
+		match = skb->ipvs_property ^
+			!!(data->invert & XT_IPVS_IPVS_PROPERTY);
+		goto out;
+	}
+
+	/* other flags than XT_IPVS_IPVS_PROPERTY are set */
+	if (!skb->ipvs_property) {
+		match = false;
+		goto out;
+	}
+
+	ip_vs_fill_iphdr(family, skb_network_header(skb), &iph);
+
+	if (data->bitmask & XT_IPVS_PROTO)
+		if ((iph.protocol == data->l4proto) ^
+		    !(data->invert & XT_IPVS_PROTO)) {
+			match = false;
+			goto out;
+		}
+
+	pp = ip_vs_proto_get(iph.protocol);
+	if (unlikely(!pp)) {
+		match = false;
+		goto out;
+	}
+
+	/*
+	 * Check if the packet belongs to an existing entry
+	 */
+	cp = pp->conn_out_get(family, skb, pp, &iph, iph.len, 1 /* inverse */);
+	if (unlikely(cp == NULL)) {
+		match = false;
+		goto out;
+	}
+
+	/*
+	 * We found a connection, i.e. ct != 0, make sure to call
+	 * __ip_vs_conn_put before returning.  In our case jump to out_put_con.
+	 */
+
+	if (data->bitmask & XT_IPVS_VPORT)
+		if ((cp->vport == data->vport) ^
+		    !(data->invert & XT_IPVS_VPORT)) {
+			match = false;
+			goto out_put_cp;
+		}
+
+	if (data->bitmask & XT_IPVS_VPORTCTL)
+		if ((cp->control != NULL &&
+		     cp->control->vport == data->vportctl) ^
+		    !(data->invert & XT_IPVS_VPORTCTL)) {
+			match = false;
+			goto out_put_cp;
+		}
+
+	if (data->bitmask & XT_IPVS_DIR) {
+		enum ip_conntrack_info ctinfo;
+		struct nf_conn *ct = nf_ct_get(skb, &ctinfo);
+
+		if (ct == NULL || nf_ct_is_untracked(ct)) {
+			match = false;
+			goto out_put_cp;
+		}
+
+		if ((ctinfo >= IP_CT_IS_REPLY) ^
+		    !!(data->invert & XT_IPVS_DIR)) {
+			match = false;
+			goto out_put_cp;
+		}
+	}
+
+	if (data->bitmask & XT_IPVS_METHOD)
+		if (((cp->flags & IP_VS_CONN_F_FWD_MASK) == data->fwd_method) ^
+		    !(data->invert & XT_IPVS_METHOD)) {
+			match = false;
+			goto out_put_cp;
+		}
+
+	if (data->bitmask & XT_IPVS_VADDR) {
+		if (ipvs_mt_addrcmp(&cp->vaddr, &data->vaddr,
+				    &data->vmask, family) ^
+		    !(data->invert & XT_IPVS_VADDR)) {
+			match = false;
+			goto out_put_cp;
+		}
+	}
+
+out_put_cp:
+	__ip_vs_conn_put(cp);
+out:
+	pr_debug("match=%d\n", match);
+	return match;
+}
+
+static int ipvs_mt_check(const struct xt_mtchk_param *par)
+{
+	if (par->family != NFPROTO_IPV4
+#ifdef CONFIG_IP_VS_IPV6
+	    && par->family != NFPROTO_IPV6
+#endif
+		) {
+		pr_info("protocol family %u not supported\n", par->family);
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static struct xt_match xt_ipvs_mt_reg __read_mostly = {
+	.name       = "ipvs",
+	.revision   = 0,
+	.family     = NFPROTO_UNSPEC,
+	.match      = ipvs_mt,
+	.checkentry = ipvs_mt_check,
+	.matchsize  = XT_ALIGN(sizeof(struct xt_ipvs_mtinfo)),
+	.me         = THIS_MODULE,
+};
+
+static int __init ipvs_mt_init(void)
+{
+	return xt_register_match(&xt_ipvs_mt_reg);
+}
+
+static void __exit ipvs_mt_exit(void)
+{
+	xt_unregister_match(&xt_ipvs_mt_reg);
+}
+
+module_init(ipvs_mt_init);
+module_exit(ipvs_mt_exit);


^ permalink raw reply

* Re: macvtap: Limit packet queue length
From: Chris Wright @ 2010-07-22  7:47 UTC (permalink / raw)
  To: Herbert Xu
  Cc: David S. Miller, netdev, Arnd Bergmann, Mark Wagner, Chris Wright
In-Reply-To: <20100722074431.GA26744@gondor.apana.org.au>

* Herbert Xu (herbert@gondor.hengli.com.au) wrote:
> On Thu, Jul 22, 2010 at 02:41:57PM +0800, Herbert Xu wrote:
> > Hi:
> > 
> > macvtap: Limit packet queue length
> 
> Chris has informed me that he's already tried a similar patch
> and it only makes the problem worse :)
> 
> The issue is that the macvtap TX queue length defaults to zero.
> 
> So here is an updated patch which addresses this:
> 
> macvtap: Limit packet queue length
> 
> Mark Wagner reported OOM symptoms when sending UDP traffic over
> a macvtap link to a kvm receiver.
> 
> This appears to be caused by the fact that macvtap packet queues
> are unlimited in length.  This means that if the receiver can't
> keep up with the rate of flow, then we will hit OOM. Of course
> it gets worse if the OOM killer then decides to kill the receiver.
> 
> This patch imposes a cap on the packet queue length, in the same
> way as the tuntap driver, using the device TX queue length.
> 
> Please note that macvtap currently has no way of giving congestion
> notification, that means the software device TX queue cannot be
> used and packets will always be dropped once the macvtap driver
> queue fills up.
> 
> This shouldn't be a great problem for the scenario where macvtap
> is used to feed a kvm receiver, as the traffic is most likely
> external in origin so congestion notification can't be applied
> anyway.
> 
> Of course, if anybody decides to complain about guest-to-guest
> UDP packet loss down the track, then we may have to revisit this.
> 
> Incidentally, this patch also fixes a real memory leak when
> macvtap_get_queue fails.
> 
> Chris Wright noticed that for this patch to work, we need a
> non-zero TX queue length.  This patch includes his work to change
> the default macvtap TX queue length to 500.
> 
> Reported-by: Mark Wagner <mwagner@redhat.com>
> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

Acked-by: Chris Wright <chrisw@sous-sol.org>

Thanks Herbert.
-chris

^ permalink raw reply

* Re: macvtap: Limit packet queue length
From: Herbert Xu @ 2010-07-22  7:44 UTC (permalink / raw)
  To: David S. Miller, netdev, Arnd Bergmann; +Cc: Mark Wagner, Chris Wright
In-Reply-To: <20100722064157.GA25913@gondor.apana.org.au>

On Thu, Jul 22, 2010 at 02:41:57PM +0800, Herbert Xu wrote:
> Hi:
> 
> macvtap: Limit packet queue length

Chris has informed me that he's already tried a similar patch
and it only makes the problem worse :)

The issue is that the macvtap TX queue length defaults to zero.

So here is an updated patch which addresses this:

macvtap: Limit packet queue length

Mark Wagner reported OOM symptoms when sending UDP traffic over
a macvtap link to a kvm receiver.

This appears to be caused by the fact that macvtap packet queues
are unlimited in length.  This means that if the receiver can't
keep up with the rate of flow, then we will hit OOM. Of course
it gets worse if the OOM killer then decides to kill the receiver.

This patch imposes a cap on the packet queue length, in the same
way as the tuntap driver, using the device TX queue length.

Please note that macvtap currently has no way of giving congestion
notification, that means the software device TX queue cannot be
used and packets will always be dropped once the macvtap driver
queue fills up.

This shouldn't be a great problem for the scenario where macvtap
is used to feed a kvm receiver, as the traffic is most likely
external in origin so congestion notification can't be applied
anyway.

Of course, if anybody decides to complain about guest-to-guest
UDP packet loss down the track, then we may have to revisit this.

Incidentally, this patch also fixes a real memory leak when
macvtap_get_queue fails.

Chris Wright noticed that for this patch to work, we need a
non-zero TX queue length.  This patch includes his work to change
the default macvtap TX queue length to 500.

Reported-by: Mark Wagner <mwagner@redhat.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c
index 87e8d4c..f15fe2c 100644
--- a/drivers/net/macvlan.c
+++ b/drivers/net/macvlan.c
@@ -499,7 +499,7 @@ static const struct net_device_ops macvlan_netdev_ops = {
 	.ndo_validate_addr	= eth_validate_addr,
 };

-static void macvlan_setup(struct net_device *dev)
+void macvlan_common_setup(struct net_device *dev)
 {
 	ether_setup(dev);

@@ -508,6 +508,12 @@ static void macvlan_setup(struct net_device *dev)
 	dev->destructor		= free_netdev;
 	dev->header_ops		= &macvlan_hard_header_ops,
 	dev->ethtool_ops	= &macvlan_ethtool_ops;
+}
+EXPORT_SYMBOL_GPL(macvlan_common_setup);
+
+static void macvlan_setup(struct net_device *dev)
+{
+	macvlan_common_setup(dev);
 	dev->tx_queue_len	= 0;
 }

@@ -705,7 +711,6 @@ int macvlan_link_register(struct rtnl_link_ops *ops)
 	/* common fields */
 	ops->priv_size		= sizeof(struct macvlan_dev);
 	ops->get_tx_queues	= macvlan_get_tx_queues;
-	ops->setup		= macvlan_setup;
 	ops->validate		= macvlan_validate;
 	ops->maxtype		= IFLA_MACVLAN_MAX;
 	ops->policy		= macvlan_policy;
@@ -719,6 +724,7 @@ EXPORT_SYMBOL_GPL(macvlan_link_register);

 static struct rtnl_link_ops macvlan_link_ops = {
 	.kind		= "macvlan",
+	.setup		= macvlan_setup,
 	.newlink	= macvlan_newlink,
 	.dellink	= macvlan_dellink,
 };
diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
index a8a94e2..ff02b83 100644
--- a/drivers/net/macvtap.c
+++ b/drivers/net/macvtap.c
@@ -180,11 +180,18 @@ static int macvtap_forward(struct net_device *dev, struct sk_buff *skb)
 {
 	struct macvtap_queue *q = macvtap_get_queue(dev, skb);
 	if (!q)
-		return -ENOLINK;
+		goto drop;
+
+	if (skb_queue_len(&q->sk.sk_receive_queue) >= dev->tx_queue_len)
+		goto drop;

 	skb_queue_tail(&q->sk.sk_receive_queue, skb);
 	wake_up_interruptible_poll(sk_sleep(&q->sk), POLLIN | POLLRDNORM | POLLRDBAND);
-	return 0;
+	return NET_RX_SUCCESS;
+
+drop:
+	kfree_skb(skb);
+	return NET_RX_DROP;
 }

 /*
@@ -235,8 +242,15 @@ static void macvtap_dellink(struct net_device *dev,
 	macvlan_dellink(dev, head);
 }

+static void macvtap_setup(struct net_device *dev)
+{
+	macvlan_common_setup(dev);
+	dev->tx_queue_len = TUN_READQ_SIZE;
+}
+
 static struct rtnl_link_ops macvtap_link_ops __read_mostly = {
 	.kind		= "macvtap",
+	.setup		= macvtap_setup,
 	.newlink	= macvtap_newlink,
 	.dellink	= macvtap_dellink,
 };
diff --git a/include/linux/if_macvlan.h b/include/linux/if_macvlan.h
index 9ea047a..1ffaeff 100644
--- a/include/linux/if_macvlan.h
+++ b/include/linux/if_macvlan.h
@@ -67,6 +67,8 @@ static inline void macvlan_count_rx(const struct macvlan_dev *vlan,
 	}
 }

+extern void macvlan_common_setup(struct net_device *dev);
+
 extern int macvlan_common_newlink(struct net *src_net, struct net_device *dev,
 				  struct nlattr *tb[], struct nlattr *data[],
 				  int (*receive)(struct sk_buff *skb),

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply related

* [patch v2.8 4/4] [patch v2.2 4/4] [PATCH v2.1 4/4] libxt_ipvs: user-space lib for netfilter matcher xt_ipvs
From: Simon Horman @ 2010-07-22  7:35 UTC (permalink / raw)
  To: lvs-devel, netdev, linux-kernel, netfilter, netfilter-devel
  Cc: Malcolm Turnbull, Mark Brooks, Wensong Zhang, Julius Volz,
	Patrick McHardy, David S. Miller, Hannes Eder, Jan Engelhardt
In-Reply-To: <20100722073547.504156161@vergenet.net>

[-- Attachment #1: libxt_ipvs-user-space-lib-for-netfilter-matcher-xt_ipvs.patch --]
[-- Type: text/plain, Size: 14174 bytes --]

From:	Hannes Eder <heder@google.com>

The user-space library for the netfilter matcher xt_ipvs.

[ trivial up-port by Simon Horman <horms@verge.net.au> ]
Signed-off-by: Hannes Eder <heder@google.com>
Acked-by: Simon Horman <horms@verge.net.au>

 configure.ac                      |   10 -
 extensions/libxt_ipvs.c           |  365 +++++++++++++++++++++++++++++++++++++
 extensions/libxt_ipvs.man         |   24 ++
 include/linux/netfilter/xt_ipvs.h |   27 +++
 4 files changed, 424 insertions(+), 2 deletions(-)
 create mode 100644 extensions/libxt_ipvs.c
 create mode 100644 extensions/libxt_ipvs.man
 create mode 100644 include/linux/netfilter/xt_ipvs.h

v2.8
* There is no need to define _XT_IPVS_H as a value
As suggested by Jan Engelhardt
* Use an enum instead of #ifdefs for flags and masks

v2.7
* Update struct xt_ipvs_mtinfo to use __u8 instead of __16 for the l4proto
  and fwd_method to reflect the same change to the kernel copy
  of struct xt_ipvs_mtinfo.

v2.1
* Trival up-port

Index: iptables/configure.ac
===================================================================
--- iptables.orig/configure.ac	2010-07-22 10:41:05.000000000 +0900
+++ iptables/configure.ac	2010-07-22 10:41:13.000000000 +0900
@@ -52,12 +52,18 @@ AC_ARG_WITH([pkgconfigdir], AS_HELP_STRI
 	[Path to the pkgconfig directory [[LIBDIR/pkgconfig]]]),
 	[pkgconfigdir="$withval"], [pkgconfigdir='${libdir}/pkgconfig'])
 
-AC_CHECK_HEADER([linux/dccp.h])
-
 blacklist_modules="";
+
+AC_CHECK_HEADER([linux/dccp.h])
 if test "$ac_cv_header_linux_dccp_h" != "yes"; then
 	blacklist_modules="$blacklist_modules dccp";
 fi;
+
+AC_CHECK_HEADER([linux/ip_vs.h])
+if test "$ac_cv_header_linux_ip_vs_h" != "yes"; then
+	blacklist_modules="$blacklist_modules ipvs";
+fi;
+
 AC_SUBST([blacklist_modules])
 
 AM_CONDITIONAL([ENABLE_STATIC], [test "$enable_static" = "yes"])
Index: iptables/extensions/libxt_ipvs.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ iptables/extensions/libxt_ipvs.c	2010-07-22 10:41:13.000000000 +0900
@@ -0,0 +1,365 @@
+/*
+ * Shared library add-on to iptables to add IPVS matching.
+ *
+ * Detailed doc is in the kernel module source net/netfilter/xt_ipvs.c
+ *
+ * Author: Hannes Eder <heder@google.com>
+ */
+#include <sys/types.h>
+#include <assert.h>
+#include <ctype.h>
+#include <errno.h>
+#include <getopt.h>
+#include <netdb.h>
+#include <stdlib.h>
+#include <stdio.h>
+#include <string.h>
+#include <xtables.h>
+#include <linux/ip_vs.h>
+#include <linux/netfilter/xt_ipvs.h>
+
+static const struct option ipvs_mt_opts[] = {
+	{ .name = "ipvs",     .has_arg = false, .val = '0' },
+	{ .name = "vproto",   .has_arg = true,  .val = '1' },
+	{ .name = "vaddr",    .has_arg = true,  .val = '2' },
+	{ .name = "vport",    .has_arg = true,  .val = '3' },
+	{ .name = "vdir",     .has_arg = true,  .val = '4' },
+	{ .name = "vmethod",  .has_arg = true,  .val = '5' },
+	{ .name = "vportctl", .has_arg = true,  .val = '6' },
+	{ .name = NULL }
+};
+
+static void ipvs_mt_help(void)
+{
+	printf(
+"IPVS match options:\n"
+"[!] --ipvs                      packet belongs to an IPVS connection\n"
+"\n"
+"Any of the following options implies --ipvs (even negated)\n"
+"[!] --vproto protocol           VIP protocol to match; by number or name,\n"
+"                                e.g. \"tcp\"\n"
+"[!] --vaddr address[/mask]      VIP address to match\n"
+"[!] --vport port                VIP port to match; by number or name,\n"
+"                                e.g. \"http\"\n"
+"    --vdir {ORIGINAL|REPLY}     flow direction of packet\n"
+"[!] --vmethod {GATE|IPIP|MASQ}  IPVS forwarding method used\n"
+"[!] --vportctl port             VIP port of the controlling connection to\n"
+"                                match, e.g. 21 for FTP\n"
+		);
+}
+
+static void ipvs_mt_parse_addr_and_mask(const char *arg,
+					union nf_inet_addr *address,
+					union nf_inet_addr *mask,
+					unsigned int family)
+{
+	struct in_addr *addr = NULL;
+	struct in6_addr *addr6 = NULL;
+	unsigned int naddrs = 0;
+
+	if (family == NFPROTO_IPV4) {
+		xtables_ipparse_any(arg, &addr, &mask->in, &naddrs);
+		if (naddrs > 1)
+			xtables_error(PARAMETER_PROBLEM,
+				      "multiple IP addresses not allowed");
+		if (naddrs == 1)
+			memcpy(&address->in, addr, sizeof(*addr));
+	} else if (family == NFPROTO_IPV6) {
+		xtables_ip6parse_any(arg, &addr6, &mask->in6, &naddrs);
+		if (naddrs > 1)
+			xtables_error(PARAMETER_PROBLEM,
+				      "multiple IP addresses not allowed");
+		if (naddrs == 1)
+			memcpy(&address->in6, addr6, sizeof(*addr6));
+	} else {
+		/* Hu? */
+		assert(false);
+	}
+}
+
+/* Function which parses command options; returns true if it ate an option */
+static int ipvs_mt_parse(int c, char **argv, int invert, unsigned int *flags,
+			 const void *entry, struct xt_entry_match **match,
+			 unsigned int family)
+{
+	struct xt_ipvs_mtinfo *data = (void *)(*match)->data;
+	char *p = NULL;
+	u_int8_t op = 0;
+
+	if ('0' <= c && c <= '6') {
+		static const int ops[] = {
+			XT_IPVS_IPVS_PROPERTY,
+			XT_IPVS_PROTO,
+			XT_IPVS_VADDR,
+			XT_IPVS_VPORT,
+			XT_IPVS_DIR,
+			XT_IPVS_METHOD,
+			XT_IPVS_VPORTCTL
+		};
+		op = ops[c - '0'];
+	} else
+		return 0;
+
+	if (*flags & op & XT_IPVS_ONCE_MASK)
+		goto multiple_use;
+
+	switch (c) {
+	case '0': /* --ipvs */
+		/* Nothing to do here. */
+		break;
+
+	case '1': /* --vproto */
+		/* Canonicalize into lower case */
+		for (p = optarg; *p != '\0'; ++p)
+			*p = tolower(*p);
+
+		data->l4proto = xtables_parse_protocol(optarg);
+		break;
+
+	case '2': /* --vaddr */
+		ipvs_mt_parse_addr_and_mask(optarg, &data->vaddr,
+					    &data->vmask, family);
+		break;
+
+	case '3': /* --vport */
+		data->vport = htons(xtables_parse_port(optarg, "tcp"));
+		break;
+
+	case '4': /* --vdir */
+		xtables_param_act(XTF_NO_INVERT, "ipvs", "--vdir", invert);
+		if (strcasecmp(optarg, "ORIGINAL") == 0) {
+			data->bitmask |= XT_IPVS_DIR;
+			data->invert   &= ~XT_IPVS_DIR;
+		} else if (strcasecmp(optarg, "REPLY") == 0) {
+			data->bitmask |= XT_IPVS_DIR;
+			data->invert  |= XT_IPVS_DIR;
+		} else {
+			xtables_param_act(XTF_BAD_VALUE,
+					  "ipvs", "--vdir", optarg);
+		}
+		break;
+
+	case '5': /* --vmethod */
+		if (strcasecmp(optarg, "GATE") == 0)
+			data->fwd_method = IP_VS_CONN_F_DROUTE;
+		else if (strcasecmp(optarg, "IPIP") == 0)
+			data->fwd_method = IP_VS_CONN_F_TUNNEL;
+		else if (strcasecmp(optarg, "MASQ") == 0)
+			data->fwd_method = IP_VS_CONN_F_MASQ;
+		else
+			xtables_param_act(XTF_BAD_VALUE,
+					  "ipvs", "--vmethod", optarg);
+		break;
+
+	case '6': /* --vportctl */
+		data->vportctl = htons(xtables_parse_port(optarg, "tcp"));
+		break;
+
+	default:
+		/* Hu? How did we come here? */
+		assert(false);
+		return 0;
+	}
+
+	if (op & XT_IPVS_ONCE_MASK) {
+		if (data->invert & XT_IPVS_IPVS_PROPERTY)
+			xtables_error(PARAMETER_PROBLEM,
+				      "! --ipvs cannot be together with"
+				      " other options");
+		data->bitmask |= XT_IPVS_IPVS_PROPERTY;
+	}
+
+	data->bitmask |= op;
+	if (invert)
+		data->invert |= op;
+	*flags |= op;
+	return 1;
+
+multiple_use:
+	xtables_error(PARAMETER_PROBLEM,
+		      "multiple use of the same IPVS option is not allowed");
+}
+
+static int ipvs_mt4_parse(int c, char **argv, int invert, unsigned int *flags,
+			  const void *entry, struct xt_entry_match **match)
+{
+	return ipvs_mt_parse(c, argv, invert, flags, entry, match,
+			     NFPROTO_IPV4);
+}
+
+static int ipvs_mt6_parse(int c, char **argv, int invert, unsigned int *flags,
+			  const void *entry, struct xt_entry_match **match)
+{
+	return ipvs_mt_parse(c, argv, invert, flags, entry, match,
+			     NFPROTO_IPV6);
+}
+
+static void ipvs_mt_check(unsigned int flags)
+{
+	if (flags == 0)
+		xtables_error(PARAMETER_PROBLEM,
+			      "IPVS: At least one option is required");
+}
+
+/* Shamelessly copied from libxt_conntrack.c */
+static void ipvs_mt_dump_addr(const union nf_inet_addr *addr,
+			      const union nf_inet_addr *mask,
+			      unsigned int family, bool numeric)
+{
+	char buf[BUFSIZ];
+
+	if (family == NFPROTO_IPV4) {
+		if (!numeric && addr->ip == 0) {
+			printf("anywhere ");
+			return;
+		}
+		if (numeric)
+			strcpy(buf, xtables_ipaddr_to_numeric(&addr->in));
+		else
+			strcpy(buf, xtables_ipaddr_to_anyname(&addr->in));
+		strcat(buf, xtables_ipmask_to_numeric(&mask->in));
+		printf("%s ", buf);
+	} else if (family == NFPROTO_IPV6) {
+		if (!numeric && addr->ip6[0] == 0 && addr->ip6[1] == 0 &&
+		    addr->ip6[2] == 0 && addr->ip6[3] == 0) {
+			printf("anywhere ");
+			return;
+		}
+		if (numeric)
+			strcpy(buf, xtables_ip6addr_to_numeric(&addr->in6));
+		else
+			strcpy(buf, xtables_ip6addr_to_anyname(&addr->in6));
+		strcat(buf, xtables_ip6mask_to_numeric(&mask->in6));
+		printf("%s ", buf);
+	}
+}
+
+static void ipvs_mt_dump(const void *ip, const struct xt_ipvs_mtinfo *data,
+			 unsigned int family, bool numeric, const char *prefix)
+{
+	if (data->bitmask == XT_IPVS_IPVS_PROPERTY) {
+		if (data->invert & XT_IPVS_IPVS_PROPERTY)
+			printf("! ");
+		printf("%sipvs ", prefix);
+	}
+
+	if (data->bitmask & XT_IPVS_PROTO) {
+		if (data->invert & XT_IPVS_PROTO)
+			printf("! ");
+		printf("%sproto %u ", prefix, data->l4proto);
+	}
+
+	if (data->bitmask & XT_IPVS_VADDR) {
+		if (data->invert & XT_IPVS_VADDR)
+			printf("! ");
+
+		printf("%svaddr ", prefix);
+		ipvs_mt_dump_addr(&data->vaddr, &data->vmask, family, numeric);
+	}
+
+	if (data->bitmask & XT_IPVS_VPORT) {
+		if (data->invert & XT_IPVS_VPORT)
+			printf("! ");
+
+		printf("%svport %u ", prefix, ntohs(data->vport));
+	}
+
+	if (data->bitmask & XT_IPVS_DIR) {
+		if (data->invert & XT_IPVS_DIR)
+			printf("%svdir REPLY ", prefix);
+		else
+			printf("%svdir ORIGINAL ", prefix);
+	}
+
+	if (data->bitmask & XT_IPVS_METHOD) {
+		if (data->invert & XT_IPVS_METHOD)
+			printf("! ");
+
+		printf("%svmethod ", prefix);
+		switch (data->fwd_method) {
+		case IP_VS_CONN_F_DROUTE:
+			printf("GATE ");
+			break;
+		case IP_VS_CONN_F_TUNNEL:
+			printf("IPIP ");
+			break;
+		case IP_VS_CONN_F_MASQ:
+			printf("MASQ ");
+			break;
+		default:
+			/* Hu? */
+			printf("UNKNOWN ");
+			break;
+		}
+	}
+
+	if (data->bitmask & XT_IPVS_VPORTCTL) {
+		if (data->invert & XT_IPVS_VPORTCTL)
+			printf("! ");
+
+		printf("%svportctl %u ", prefix, ntohs(data->vportctl));
+	}
+}
+
+static void ipvs_mt4_print(const void *ip, const struct xt_entry_match *match,
+			   int numeric)
+{
+	const struct xt_ipvs_mtinfo *data = (const void *)match->data;
+	ipvs_mt_dump(ip, data, NFPROTO_IPV4, numeric, "");
+}
+
+static void ipvs_mt6_print(const void *ip, const struct xt_entry_match *match,
+			   int numeric)
+{
+	const struct xt_ipvs_mtinfo *data = (const void *)match->data;
+	ipvs_mt_dump(ip, data, NFPROTO_IPV6, numeric, "");
+}
+
+static void ipvs_mt4_save(const void *ip, const struct xt_entry_match *match)
+{
+	const struct xt_ipvs_mtinfo *data = (const void *)match->data;
+	ipvs_mt_dump(ip, data, NFPROTO_IPV4, true, "--");
+}
+
+static void ipvs_mt6_save(const void *ip, const struct xt_entry_match *match)
+{
+	const struct xt_ipvs_mtinfo *data = (const void *)match->data;
+	ipvs_mt_dump(ip, data, NFPROTO_IPV6, true, "--");
+}
+
+static struct xtables_match ipvs_matches_reg[] = {
+	{
+		.version       = XTABLES_VERSION,
+		.name          = "ipvs",
+		.revision      = 0,
+		.family        = NFPROTO_IPV4,
+		.size          = XT_ALIGN(sizeof(struct xt_ipvs_mtinfo)),
+		.userspacesize = XT_ALIGN(sizeof(struct xt_ipvs_mtinfo)),
+		.help          = ipvs_mt_help,
+		.parse         = ipvs_mt4_parse,
+		.final_check   = ipvs_mt_check,
+		.print         = ipvs_mt4_print,
+		.save          = ipvs_mt4_save,
+		.extra_opts    = ipvs_mt_opts,
+	},
+	{
+		.version       = XTABLES_VERSION,
+		.name          = "ipvs",
+		.revision      = 0,
+		.family        = NFPROTO_IPV6,
+		.size          = XT_ALIGN(sizeof(struct xt_ipvs_mtinfo)),
+		.userspacesize = XT_ALIGN(sizeof(struct xt_ipvs_mtinfo)),
+		.help          = ipvs_mt_help,
+		.parse         = ipvs_mt6_parse,
+		.final_check   = ipvs_mt_check,
+		.print         = ipvs_mt6_print,
+		.save          = ipvs_mt6_save,
+		.extra_opts    = ipvs_mt_opts,
+	},
+};
+
+void _init(void)
+{
+	xtables_register_matches(ipvs_matches_reg,
+				 ARRAY_SIZE(ipvs_matches_reg));
+}
Index: iptables/extensions/libxt_ipvs.man
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ iptables/extensions/libxt_ipvs.man	2010-07-22 10:41:13.000000000 +0900
@@ -0,0 +1,24 @@
+Match IPVS connection properties.
+.TP
+[\fB!\fR] \fB\-\-ipvs\fP
+packet belongs to an IPVS connection
+.TP
+Any of the following options implies \-\-ipvs (even negated)
+.TP
+[\fB!\fR] \fB\-\-vproto\fP \fIprotocol\fP
+VIP protocol to match; by number or name, e.g. "tcp"
+.TP
+[\fB!\fR] \fB\-\-vaddr\fP \fIaddress\fP[\fB/\fP\fImask\fP]
+VIP address to match
+.TP
+[\fB!\fR] \fB\-\-vport\fP \fIport\fP
+VIP port to match; by number or name, e.g. "http"
+.TP
+\fB\-\-vdir\fP {\fBORIGINAL\fP|\fBREPLY\fP}
+flow direction of packet
+.TP
+[\fB!\fR] \fB\-\-vmethod\fP {\fBGATE\fP|\fBIPIP\fP|\fBMASQ\fP}
+IPVS forwarding method used
+.TP
+[\fB!\fR] \fB\-\-vportctl\fP \fIport\fP
+VIP port of the controlling connection to match, e.g. 21 for FTP
Index: iptables/include/linux/netfilter/xt_ipvs.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ iptables/include/linux/netfilter/xt_ipvs.h	2010-07-22 10:39:44.000000000 +0900
@@ -0,0 +1,27 @@
+#ifndef _XT_IPVS_H
+#define _XT_IPVS_H
+
+enum {
+	XT_IPVS_IPVS_PROPERTY =	1 << 0, /* all other options imply this one */
+	XT_IPVS_PROTO =		1 << 1,
+	XT_IPVS_VADDR =		1 << 2,
+	XT_IPVS_VPORT =		1 << 3,
+	XT_IPVS_DIR =		1 << 4,
+	XT_IPVS_METHOD =	1 << 5,
+	XT_IPVS_VPORTCTL =	1 << 6,
+	XT_IPVS_MASK =		(1 << 7) - 1,
+	XT_IPVS_ONCE_MASK =	XT_IPVS_MASK & ~XT_IPVS_IPVS_PROPERTY
+};
+
+struct xt_ipvs_mtinfo {
+	union nf_inet_addr	vaddr, vmask;
+	__be16			vport;
+	__u8			l4proto;
+	__u8			fwd_method;
+	__be16			vportctl;
+
+	__u8			invert;
+	__u8			bitmask;
+};
+
+#endif /* _XT_IPVS_H */


^ permalink raw reply

* [patch v2.8 3/4] IPVS: make FTP work with full NAT support
From: Simon Horman @ 2010-07-22  7:35 UTC (permalink / raw)
  To: lvs-devel, netdev, linux-kernel, netfilter, netfilter-devel
  Cc: Malcolm Turnbull, Mark Brooks, Wensong Zhang, Julius Volz,
	Patrick McHardy, David S. Miller, Hannes Eder, Jan Engelhardt
In-Reply-To: <20100722073547.504156161@vergenet.net>

[-- Attachment #1: IPVS-make-FTP-work-with-full-NAT-support.patch --]
[-- Type: text/plain, Size: 12646 bytes --]

From:	Hannes Eder <heder@google.com>

Use nf_conntrack/nf_nat code to do the packet mangling and the TCP
sequence adjusting.  The function 'ip_vs_skb_replace' is now dead
code, so it is removed.

To SNAT FTP, use something like:

% iptables -t nat -A POSTROUTING -m ipvs --vaddr 192.168.100.30/32 \
> --vport 21 -j SNAT --to-source 192.168.10.10

and for the data connections in passive mode:

% iptables -t nat -A POSTROUTING -m ipvs --vaddr 192.168.100.30/32 \
> --vportctl 21 -j SNAT --to-source 192.168.10.10

using '-m state --state RELATED' would also works.

Make sure the kernel modules ip_vs_ftp, nf_conntrack_ftp, and
nf_nat_ftp are loaded.

[ up-port and minor fixes by Simon Horman <horms@verge.net.au> ]
Signed-off-by: Hannes Eder <heder@google.com>
Signed-off-by: Simon Horman <horms@verge.net.au>

--- 

 include/net/ip_vs.h             |    2 
 net/netfilter/ipvs/Kconfig      |    2 
 net/netfilter/ipvs/ip_vs_app.c  |   43 ---------
 net/netfilter/ipvs/ip_vs_core.c |    1 
 net/netfilter/ipvs/ip_vs_ftp.c  |  174 ++++++++++++++++++++++++++++++++++++---
 5 files changed, 164 insertions(+), 58 deletions(-)

v2.6
* pointer arguments for %pI4

v2.5
* Use nf_ct_is_untracked(ct) instead of nf_ct_is_untracked(),
  the latter is blatantly incorrect
* Return 0 (and thus drop the packet) if mangling wasn't attempted

v2.4
As suggested by Patrick McHardy
* Use nf_conntrack_untracked() instead of &nf_conntrack_untracked
* Fix ip_vs_ftp_out logic
  - Don't call nf_nat_mangle_tcp_packet() unless ct is valid and tracked
  - Only call ip_vs_expect_relatedi() if  nf_nat_mangle_tcp_packet()
    succeeds
  - Note that packets are dropped if mangling fails
Other
* Drop unrelated cosmetic change to sizing of buf in ip_vs_ftp_out()

v2.3
* Up-port
* Drop buf_len = snprintf() change - its a separate, cosmetic, fix
As suggested by Patrick McHardy
* Use %pI4 instead of NIPQUAD

v2.2
* No change

v2.1
* Up-port

Index: nf-next-2.6/include/net/ip_vs.h
===================================================================
--- nf-next-2.6.orig/include/net/ip_vs.h	2010-07-11 17:30:19.000000000 +0900
+++ nf-next-2.6/include/net/ip_vs.h	2010-07-11 17:33:33.000000000 +0900
@@ -736,8 +736,6 @@ extern void ip_vs_app_inc_put(struct ip_
 
 extern int ip_vs_app_pkt_out(struct ip_vs_conn *, struct sk_buff *skb);
 extern int ip_vs_app_pkt_in(struct ip_vs_conn *, struct sk_buff *skb);
-extern int ip_vs_skb_replace(struct sk_buff *skb, gfp_t pri,
-			     char *o_buf, int o_len, char *n_buf, int n_len);
 extern int ip_vs_app_init(void);
 extern void ip_vs_app_cleanup(void);
 
Index: nf-next-2.6/net/netfilter/ipvs/Kconfig
===================================================================
--- nf-next-2.6.orig/net/netfilter/ipvs/Kconfig	2010-07-11 17:33:06.000000000 +0900
+++ nf-next-2.6/net/netfilter/ipvs/Kconfig	2010-07-11 17:33:33.000000000 +0900
@@ -235,7 +235,7 @@ comment 'IPVS application helper'
 
 config	IP_VS_FTP
   	tristate "FTP protocol helper"
-        depends on IP_VS_PROTO_TCP
+        depends on IP_VS_PROTO_TCP && NF_NAT
 	---help---
 	  FTP is a protocol that transfers IP address and/or port number in
 	  the payload. In the virtual server via Network Address Translation,
Index: nf-next-2.6/net/netfilter/ipvs/ip_vs_app.c
===================================================================
--- nf-next-2.6.orig/net/netfilter/ipvs/ip_vs_app.c	2010-07-11 17:30:19.000000000 +0900
+++ nf-next-2.6/net/netfilter/ipvs/ip_vs_app.c	2010-07-11 17:33:33.000000000 +0900
@@ -569,49 +569,6 @@ static const struct file_operations ip_v
 };
 #endif
 
-
-/*
- *	Replace a segment of data with a new segment
- */
-int ip_vs_skb_replace(struct sk_buff *skb, gfp_t pri,
-		      char *o_buf, int o_len, char *n_buf, int n_len)
-{
-	int diff;
-	int o_offset;
-	int o_left;
-
-	EnterFunction(9);
-
-	diff = n_len - o_len;
-	o_offset = o_buf - (char *)skb->data;
-	/* The length of left data after o_buf+o_len in the skb data */
-	o_left = skb->len - (o_offset + o_len);
-
-	if (diff <= 0) {
-		memmove(o_buf + n_len, o_buf + o_len, o_left);
-		memcpy(o_buf, n_buf, n_len);
-		skb_trim(skb, skb->len + diff);
-	} else if (diff <= skb_tailroom(skb)) {
-		skb_put(skb, diff);
-		memmove(o_buf + n_len, o_buf + o_len, o_left);
-		memcpy(o_buf, n_buf, n_len);
-	} else {
-		if (pskb_expand_head(skb, skb_headroom(skb), diff, pri))
-			return -ENOMEM;
-		skb_put(skb, diff);
-		memmove(skb->data + o_offset + n_len,
-			skb->data + o_offset + o_len, o_left);
-		skb_copy_to_linear_data_offset(skb, o_offset, n_buf, n_len);
-	}
-
-	/* must update the iph total length here */
-	ip_hdr(skb)->tot_len = htons(skb->len);
-
-	LeaveFunction(9);
-	return 0;
-}
-
-
 int __init ip_vs_app_init(void)
 {
 	/* we will replace it with proc_net_ipvs_create() soon */
Index: nf-next-2.6/net/netfilter/ipvs/ip_vs_core.c
===================================================================
--- nf-next-2.6.orig/net/netfilter/ipvs/ip_vs_core.c	2010-07-11 17:33:06.000000000 +0900
+++ nf-next-2.6/net/netfilter/ipvs/ip_vs_core.c	2010-07-11 17:33:33.000000000 +0900
@@ -54,7 +54,6 @@
 
 EXPORT_SYMBOL(register_ip_vs_scheduler);
 EXPORT_SYMBOL(unregister_ip_vs_scheduler);
-EXPORT_SYMBOL(ip_vs_skb_replace);
 EXPORT_SYMBOL(ip_vs_proto_name);
 EXPORT_SYMBOL(ip_vs_conn_new);
 EXPORT_SYMBOL(ip_vs_conn_in_get);
Index: nf-next-2.6/net/netfilter/ipvs/ip_vs_ftp.c
===================================================================
--- nf-next-2.6.orig/net/netfilter/ipvs/ip_vs_ftp.c	2010-07-11 17:30:19.000000000 +0900
+++ nf-next-2.6/net/netfilter/ipvs/ip_vs_ftp.c	2010-07-11 18:01:58.000000000 +0900
@@ -20,6 +20,17 @@
  *
  * Author:	Wouter Gadeyne
  *
+ *
+ * Code for ip_vs_expect_related and ip_vs_expect_callback is taken from
+ * http://www.ssi.bg/~ja/nfct/:
+ *
+ * ip_vs_nfct.c:	Netfilter connection tracking support for IPVS
+ *
+ * Portions Copyright (C) 2001-2002
+ * Antefacto Ltd, 181 Parnell St, Dublin 1, Ireland.
+ *
+ * Portions Copyright (C) 2003-2008
+ * Julian Anastasov
  */
 
 #define KMSG_COMPONENT "IPVS"
@@ -32,6 +43,9 @@
 #include <linux/in.h>
 #include <linux/ip.h>
 #include <linux/netfilter.h>
+#include <net/netfilter/nf_conntrack.h>
+#include <net/netfilter/nf_conntrack_expect.h>
+#include <net/netfilter/nf_nat_helper.h>
 #include <linux/gfp.h>
 #include <net/protocol.h>
 #include <net/tcp.h>
@@ -43,6 +57,16 @@
 #define SERVER_STRING "227 Entering Passive Mode ("
 #define CLIENT_STRING "PORT "
 
+#define FMT_TUPLE	"%pI4:%u->%pI4:%u/%u"
+#define ARG_TUPLE(T)	&(T)->src.u3.ip, ntohs((T)->src.u.all), \
+			&(T)->dst.u3.ip, ntohs((T)->dst.u.all), \
+			(T)->dst.protonum
+
+#define FMT_CONN	"%pI4:%u->%pI4:%u->%pI4:%u/%u:%u"
+#define ARG_CONN(C)	&((C)->caddr.ip), ntohs((C)->cport), \
+			&((C)->vaddr.ip), ntohs((C)->vport), \
+			&((C)->daddr.ip), ntohs((C)->dport), \
+			(C)->protocol, (C)->state
 
 /*
  * List of ports (up to IP_VS_APP_MAX_PORTS) to be handled by helper
@@ -123,6 +147,119 @@ static int ip_vs_ftp_get_addrport(char *
 	return 1;
 }
 
+/*
+ * Called from init_conntrack() as expectfn handler.
+ */
+static void
+ip_vs_expect_callback(struct nf_conn *ct,
+		      struct nf_conntrack_expect *exp)
+{
+	struct nf_conntrack_tuple *orig, new_reply;
+	struct ip_vs_conn *cp;
+
+	if (exp->tuple.src.l3num != PF_INET)
+		return;
+
+	/*
+	 * We assume that no NF locks are held before this callback.
+	 * ip_vs_conn_out_get and ip_vs_conn_in_get should match their
+	 * expectations even if they use wildcard values, now we provide the
+	 * actual values from the newly created original conntrack direction.
+	 * The conntrack is confirmed when packet reaches IPVS hooks.
+	 */
+
+	/* RS->CLIENT */
+	orig = &ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple;
+	cp = ip_vs_conn_out_get(exp->tuple.src.l3num, orig->dst.protonum,
+				&orig->src.u3, orig->src.u.tcp.port,
+				&orig->dst.u3, orig->dst.u.tcp.port);
+	if (cp) {
+		/* Change reply CLIENT->RS to CLIENT->VS */
+		new_reply = ct->tuplehash[IP_CT_DIR_REPLY].tuple;
+		IP_VS_DBG(7, "%s(): ct=%p, status=0x%lX, tuples=" FMT_TUPLE ", "
+			  FMT_TUPLE ", found inout cp=" FMT_CONN "\n",
+			  __func__, ct, ct->status,
+			  ARG_TUPLE(orig), ARG_TUPLE(&new_reply),
+			  ARG_CONN(cp));
+		new_reply.dst.u3 = cp->vaddr;
+		new_reply.dst.u.tcp.port = cp->vport;
+		IP_VS_DBG(7, "%s(): ct=%p, new tuples=" FMT_TUPLE ", " FMT_TUPLE
+			  ", inout cp=" FMT_CONN "\n",
+			  __func__, ct,
+			  ARG_TUPLE(orig), ARG_TUPLE(&new_reply),
+			  ARG_CONN(cp));
+		goto alter;
+	}
+
+	/* CLIENT->VS */
+	cp = ip_vs_conn_in_get(exp->tuple.src.l3num, orig->dst.protonum,
+			       &orig->src.u3, orig->src.u.tcp.port,
+			       &orig->dst.u3, orig->dst.u.tcp.port);
+	if (cp) {
+		/* Change reply VS->CLIENT to RS->CLIENT */
+		new_reply = ct->tuplehash[IP_CT_DIR_REPLY].tuple;
+		IP_VS_DBG(7, "%s(): ct=%p, status=0x%lX, tuples=" FMT_TUPLE ", "
+			  FMT_TUPLE ", found outin cp=" FMT_CONN "\n",
+			  __func__, ct, ct->status,
+			  ARG_TUPLE(orig), ARG_TUPLE(&new_reply),
+			  ARG_CONN(cp));
+		new_reply.src.u3 = cp->daddr;
+		new_reply.src.u.tcp.port = cp->dport;
+		IP_VS_DBG(7, "%s(): ct=%p, new tuples=" FMT_TUPLE ", "
+			  FMT_TUPLE ", outin cp=" FMT_CONN "\n",
+			  __func__, ct,
+			  ARG_TUPLE(orig), ARG_TUPLE(&new_reply),
+			  ARG_CONN(cp));
+		goto alter;
+	}
+
+	IP_VS_DBG(7, "%s(): ct=%p, status=0x%lX, tuple=" FMT_TUPLE
+		  " - unknown expect\n",
+		  __func__, ct, ct->status, ARG_TUPLE(orig));
+	return;
+
+alter:
+	/* Never alter conntrack for non-NAT conns */
+	if (IP_VS_FWD_METHOD(cp) == IP_VS_CONN_F_MASQ)
+		nf_conntrack_alter_reply(ct, &new_reply);
+	ip_vs_conn_put(cp);
+	return;
+}
+
+/*
+ * Create NF conntrack expectation with wildcard (optional) source port.
+ * Then the default callback function will alter the reply and will confirm
+ * the conntrack entry when the first packet comes.
+ */
+static void
+ip_vs_expect_related(struct sk_buff *skb, struct nf_conn *ct,
+		     struct ip_vs_conn *cp, u_int8_t proto,
+		     const __be16 *port, int from_rs)
+{
+	struct nf_conntrack_expect *exp;
+
+	BUG_ON(!ct || ct == &nf_conntrack_untracked);
+
+	exp = nf_ct_expect_alloc(ct);
+	if (!exp)
+		return;
+
+	if (from_rs)
+		nf_ct_expect_init(exp, NF_CT_EXPECT_CLASS_DEFAULT,
+				  nf_ct_l3num(ct), &cp->daddr, &cp->caddr,
+				  proto, port, &cp->cport);
+	else
+		nf_ct_expect_init(exp, NF_CT_EXPECT_CLASS_DEFAULT,
+				  nf_ct_l3num(ct), &cp->caddr, &cp->vaddr,
+				  proto, port, &cp->vport);
+
+	exp->expectfn = ip_vs_expect_callback;
+
+	IP_VS_DBG(7, "%s(): ct=%p, expect tuple=" FMT_TUPLE "\n",
+		  __func__, ct, ARG_TUPLE(&exp->tuple));
+	nf_ct_expect_related(exp);
+	nf_ct_expect_put(exp);
+}
 
 /*
  * Look at outgoing ftp packets to catch the response to a PASV command
@@ -149,7 +286,9 @@ static int ip_vs_ftp_out(struct ip_vs_ap
 	struct ip_vs_conn *n_cp;
 	char buf[24];		/* xxx.xxx.xxx.xxx,ppp,ppp\000 */
 	unsigned buf_len;
-	int ret;
+	int ret = 0;
+	enum ip_conntrack_info ctinfo;
+	struct nf_conn *ct;
 
 #ifdef CONFIG_IP_VS_IPV6
 	/* This application helper doesn't work with IPv6 yet,
@@ -219,19 +358,26 @@ static int ip_vs_ftp_out(struct ip_vs_ap
 
 		buf_len = strlen(buf);
 
+		ct = nf_ct_get(skb, &ctinfo);
+		if (ct && !nf_ct_is_untracked(ct)) {
+			/* If mangling fails this function will return 0
+			 * which will cause the packet to be dropped.
+			 * Mangling can only fail under memory pressure,
+			 * hopefully it will succeed on the retransmitted
+			 * packet.
+			 */
+			ret = nf_nat_mangle_tcp_packet(skb, ct, ctinfo,
+						       start-data, end-start,
+						       buf, buf_len);
+			if (ret)
+				ip_vs_expect_related(skb, ct, n_cp,
+						     IPPROTO_TCP, NULL, 0);
+		}
+
 		/*
-		 * Calculate required delta-offset to keep TCP happy
+		 * Not setting 'diff' is intentional, otherwise the sequence
+		 * would be adjusted twice.
 		 */
-		*diff = buf_len - (end-start);
-
-		if (*diff == 0) {
-			/* simply replace it with new passive address */
-			memcpy(start, buf, buf_len);
-			ret = 1;
-		} else {
-			ret = !ip_vs_skb_replace(skb, GFP_ATOMIC, start,
-					  end-start, buf, buf_len);
-		}
 
 		cp->app_data = NULL;
 		ip_vs_tcp_conn_listen(n_cp);
@@ -263,6 +409,7 @@ static int ip_vs_ftp_in(struct ip_vs_app
 	union nf_inet_addr to;
 	__be16 port;
 	struct ip_vs_conn *n_cp;
+	struct nf_conn *ct;
 
 #ifdef CONFIG_IP_VS_IPV6
 	/* This application helper doesn't work with IPv6 yet,
@@ -349,6 +496,11 @@ static int ip_vs_ftp_in(struct ip_vs_app
 		ip_vs_control_add(n_cp, cp);
 	}
 
+	ct = (struct nf_conn *)skb->nfct;
+	if (ct && ct != &nf_conntrack_untracked)
+		ip_vs_expect_related(skb, ct, n_cp,
+				     IPPROTO_TCP, &n_cp->dport, 1);
+
 	/*
 	 *	Move tunnel to listen state
 	 */


^ permalink raw reply

* [patch v2.8 2/4] IPVS: make friends with nf_conntrack
From: Simon Horman @ 2010-07-22  7:35 UTC (permalink / raw)
  To: lvs-devel, netdev, linux-kernel, netfilter, netfilter-devel
  Cc: Malcolm Turnbull, Mark Brooks, Wensong Zhang, Julius Volz,
	Patrick McHardy, David S. Miller, Hannes Eder, Jan Engelhardt
In-Reply-To: <20100722073547.504156161@vergenet.net>

[-- Attachment #1: IPVS-make-friends-with-nf_conntrack.patch --]
[-- Type: text/plain, Size: 5649 bytes --]

From:	Hannes Eder <heder@google.com>

Update the nf_conntrack tuple in reply direction, as we will see
traffic from the real server (RIP) to the client (CIP).  Once this is
done we can use netfilters SNAT in POSTROUTING, especially with
xt_ipvs, to do source NAT, e.g.:

% iptables -t nat -A POSTROUTING -m ipvs --vaddr 192.168.100.30/32 --vport 80 \
> -j SNAT --to-source 192.168.10.10

[ minor fixes by Simon Horman <horms@verge.net.au> ]
Signed-off-by: Hannes Eder <heder@google.com>
Signed-off-by: Simon Horman <horms@verge.net.au>

--- 
 net/netfilter/ipvs/Kconfig      |    2 +-
 net/netfilter/ipvs/ip_vs_core.c |   36 ------------------------------------
 net/netfilter/ipvs/ip_vs_xmit.c |   29 +++++++++++++++++++++++++++++
 3 files changed, 30 insertions(+), 37 deletions(-)

v2.4
As per advice from Patrick McHardy
* Use nf_conntrack_untracked() instead of &nf_conntrack_untracked

v2.1, v2.2, v2.3
No change

Index: nf-next-2.6/net/netfilter/ipvs/Kconfig
===================================================================
--- nf-next-2.6.orig/net/netfilter/ipvs/Kconfig	2010-07-07 13:24:31.000000000 +0900
+++ nf-next-2.6/net/netfilter/ipvs/Kconfig	2010-07-07 13:38:23.000000000 +0900
@@ -3,7 +3,7 @@
 #
 menuconfig IP_VS
 	tristate "IP virtual server support"
-	depends on NET && INET && NETFILTER
+	depends on NET && INET && NETFILTER && NF_CONNTRACK
 	---help---
 	  IP Virtual Server support will let you build a high-performance
 	  virtual server based on cluster of two or more real servers. This
Index: nf-next-2.6/net/netfilter/ipvs/ip_vs_core.c
===================================================================
--- nf-next-2.6.orig/net/netfilter/ipvs/ip_vs_core.c	2010-07-07 13:23:37.000000000 +0900
+++ nf-next-2.6/net/netfilter/ipvs/ip_vs_core.c	2010-07-07 13:38:23.000000000 +0900
@@ -536,26 +536,6 @@ int ip_vs_leave(struct ip_vs_service *sv
 	return NF_DROP;
 }
 
-
-/*
- *      It is hooked before NF_IP_PRI_NAT_SRC at the NF_INET_POST_ROUTING
- *      chain, and is used for VS/NAT.
- *      It detects packets for VS/NAT connections and sends the packets
- *      immediately. This can avoid that iptable_nat mangles the packets
- *      for VS/NAT.
- */
-static unsigned int ip_vs_post_routing(unsigned int hooknum,
-				       struct sk_buff *skb,
-				       const struct net_device *in,
-				       const struct net_device *out,
-				       int (*okfn)(struct sk_buff *))
-{
-	if (!skb->ipvs_property)
-		return NF_ACCEPT;
-	/* The packet was sent from IPVS, exit this chain */
-	return NF_STOP;
-}
-
 __sum16 ip_vs_checksum_complete(struct sk_buff *skb, int offset)
 {
 	return csum_fold(skb_checksum(skb, offset, skb->len - offset, 0));
@@ -1499,14 +1479,6 @@ static struct nf_hook_ops ip_vs_ops[] __
 		.hooknum        = NF_INET_FORWARD,
 		.priority       = 99,
 	},
-	/* Before the netfilter connection tracking, exit from POST_ROUTING */
-	{
-		.hook		= ip_vs_post_routing,
-		.owner		= THIS_MODULE,
-		.pf		= PF_INET,
-		.hooknum        = NF_INET_POST_ROUTING,
-		.priority       = NF_IP_PRI_NAT_SRC-1,
-	},
 #ifdef CONFIG_IP_VS_IPV6
 	/* After packet filtering, forward packet through VS/DR, VS/TUN,
 	 * or VS/NAT(change destination), so that filtering rules can be
@@ -1535,14 +1507,6 @@ static struct nf_hook_ops ip_vs_ops[] __
 		.hooknum        = NF_INET_FORWARD,
 		.priority       = 99,
 	},
-	/* Before the netfilter connection tracking, exit from POST_ROUTING */
-	{
-		.hook		= ip_vs_post_routing,
-		.owner		= THIS_MODULE,
-		.pf		= PF_INET6,
-		.hooknum        = NF_INET_POST_ROUTING,
-		.priority       = NF_IP6_PRI_NAT_SRC-1,
-	},
 #endif
 };
 
Index: nf-next-2.6/net/netfilter/ipvs/ip_vs_xmit.c
===================================================================
--- nf-next-2.6.orig/net/netfilter/ipvs/ip_vs_xmit.c	2010-07-07 13:23:37.000000000 +0900
+++ nf-next-2.6/net/netfilter/ipvs/ip_vs_xmit.c	2010-07-07 13:42:22.000000000 +0900
@@ -28,6 +28,7 @@
 #include <net/ip6_route.h>
 #include <linux/icmpv6.h>
 #include <linux/netfilter.h>
+#include <net/netfilter/nf_conntrack.h>
 #include <linux/netfilter_ipv4.h>
 
 #include <net/ip_vs.h>
@@ -348,6 +349,30 @@ ip_vs_bypass_xmit_v6(struct sk_buff *skb
 }
 #endif
 
+static void
+ip_vs_update_conntrack(struct sk_buff *skb, struct ip_vs_conn *cp)
+{
+	struct nf_conn *ct = (struct nf_conn *)skb->nfct;
+	struct nf_conntrack_tuple new_tuple;
+
+	if (ct == NULL || nf_ct_is_untracked(ct) || nf_ct_is_confirmed(ct))
+		return;
+
+	/*
+	 * The connection is not yet in the hashtable, so we update it.
+	 * CIP->VIP will remain the same, so leave the tuple in
+	 * IP_CT_DIR_ORIGINAL untouched.  When the reply comes back from the
+	 * real-server we will see RIP->DIP.
+	 */
+	new_tuple = ct->tuplehash[IP_CT_DIR_REPLY].tuple;
+	new_tuple.src.u3 = cp->daddr;
+	/*
+	 * This will also take care of UDP and other protocols.
+	 */
+	new_tuple.src.u.tcp.port = cp->dport;
+	nf_conntrack_alter_reply(ct, &new_tuple);
+}
+
 /*
  *      NAT transmitter (only for outside-to-inside nat forwarding)
  *      Not used for related ICMP
@@ -403,6 +428,8 @@ ip_vs_nat_xmit(struct sk_buff *skb, stru
 
 	IP_VS_DBG_PKT(10, pp, skb, 0, "After DNAT");
 
+	ip_vs_update_conntrack(skb, cp);
+
 	/* FIXME: when application helper enlarges the packet and the length
 	   is larger than the MTU of outgoing device, there will be still
 	   MTU problem. */
@@ -479,6 +506,8 @@ ip_vs_nat_xmit_v6(struct sk_buff *skb, s
 
 	IP_VS_DBG_PKT(10, pp, skb, 0, "After DNAT");
 
+	ip_vs_update_conntrack(skb, cp);
+
 	/* FIXME: when application helper enlarges the packet and the length
 	   is larger than the MTU of outgoing device, there will be still
 	   MTU problem. */


^ permalink raw reply

* [patch v2.8 0/4] IPVS full NAT support + netfilter 'ipvs' match support
From: Simon Horman @ 2010-07-22  7:35 UTC (permalink / raw)
  To: lvs-devel, netdev, linux-kernel, netfilter, netfilter-devel
  Cc: Malcolm Turnbull, Mark Brooks, Wensong Zhang, Julius Volz,
	Patrick McHardy, David S. Miller, Hannes Eder, Jan Engelhardt

This is a repost of a patch-series posted by Hannes Eder last September.
This is v2 of the patch series and I don't see any outstanding objections to
it in the mailing list archives.

Series v2.8 Convert XT_IPVS_IPVS_* from #defines to an enum,
       as suggested by Jan Engelhardt <jengelh@medozas.de>

Series v2.7 Fixes header miss-match between kernel and user-space

Series v2.6 fixes the arguments to  of %pI4

Series v2.5 fixes some problems introduced in v2.4.

Series v2.4 addresses all of the concerns that Patrick McHardy raised
witht the v2.3 series.

The original cover-email from Hannes follows.
The diffstat output has been updated to reflect changes by me.

Mark Brooks has tested the v2.7 patchset, and found no problems.
Details of his test follow Hannes's cover. I have made minor
edits to Mark's email but not the results.

----------------------------------------------------------------------

From:	Hannes Eder <heder@google.com>

The following series implements full NAT support for IPVS.  The
approach is via a minimal change to IPVS (make friends with
nf_conntrack) and adding a netfilter matcher, kernel- and user-space
part, i.e. xt_ipvs and libxt_ipvs.

Example usage:

% ipvsadm -A -t 192.168.100.30:80 -s rr
% ipvsadm -a -t 192.168.100.30:80 -r 192.168.10.20:80 -m
# ...

# Source NAT for VIP 192.168.100.30:80
% iptables -t nat -A POSTROUTING -m ipvs --vaddr 192.168.100.30/32 \
> --vport 80 -j SNAT --to-source 192.168.10.10

or SNAT-ing only a specific real server:

% iptables -t nat -A POSTROUTING --dst 192.168.11.20 \
> -m ipvs --vaddr 192.168.100.30/32 -j SNAT --to-source 192.168.10.10


First of all, thanks for all the feedback.  This is the changelog for v2:

- Make ip_vs_ftp work again.  Setup nf_conntrack expectations for
  related data connections (based on Julian's patch see
  http://www.ssi.bg/~ja/nfct/) and let nf_conntrack/nf_nat do the
  packet mangling and the TCP sequence adjusting.

  This change rises the question how to deal with ip_vs_sync?  Does it
  work together with conntrackd?  Wild idea: what about getting rid of
  ip_vs_sync and piggy packing all on nf_conntrack and use conntrackd?

  Any comments on this?

- xt_ipvs: add new rule '--vportctl port' to match the VIP port of the
  controlling connection, e.g. port 21 for FTP.  Can be used to match
  a related data connection for FTP:

  # SNAT FTP control connection
  % iptables -t nat -A POSTROUTING -m ipvs --vaddr 192.168.100.30/32 \
  > --vport 21 -j SNAT --to-source 192.168.10.10
  
  # SNAT FTP passive data connection
  % iptables -t nat -A POSTROUTING -m ipvs --vaddr 192.168.100.30/32 \
  > --vportctl 21 -j SNAT --to-source 192.168.10.10

- xt_ipvs: use 'par->family' instead of 'skb->protocol'

- xt_ipvs: add ipvs_mt_check and restrict to NFPROTO_IPV4 and NFPROTO_IPV6

- Call nf_conntrack_alter_reply(), so helper lookup is performed based
  on the changed tuple.

Changes to the linux kernel
(nf-next-2.6, "bridge: add per bridge device controls for invoking iptables")

Hannes Eder (3):
      netfilter: xt_ipvs (netfilter matcher for IPVS)
      IPVS: make friends with nf_conntrack
      IPVS: make FTP work with full NAT support


 include/linux/netfilter/xt_ipvs.h |   27 +++++
 include/net/ip_vs.h              |    2 
 net/netfilter/Kconfig            |   10 ++
 net/netfilter/Makefile           |    1 
 net/netfilter/ipvs/Kconfig       |    4 
 net/netfilter/ipvs/ip_vs_app.c   |   43 ---------
 net/netfilter/ipvs/ip_vs_core.c  |   37 --------
 net/netfilter/ipvs/ip_vs_ftp.c   |  174 +++++++++++++++++++++++++++++++++++---
 net/netfilter/ipvs/ip_vs_proto.c |    1 
 net/netfilter/ipvs/ip_vs_xmit.c  |   29 ++++++
 net/netfilter/xt_ipvs.c           |  189 +++++++++++++++++++++++++++++++++++++
 11 files changed, 422 insertions(+), 95 deletions(-)
 create mode 100644 include/linux/netfilter/xt_ipvs.h
 create mode 100644 net/netfilter/xt_ipvs.c


Changes to iptables
(iptables.git, "xt_quota: also document negation")

Hannes Eder (1):
      libxt_ipvs: user-space lib for netfilter matcher xt_ipvs

 configure.ac                      |   10 1
 extensions/libxt_ipvs.c           |  365 +++++++++++++++++++++++++++++++++++++
 extensions/libxt_ipvs.man         |   24 ++
 include/linux/netfilter/xt_ipvs.h |   27 +++
 4 files changed, 424 insertions(+), 2 deletions(-)
 create mode 100644 extensions/libxt_ipvs.c
 create mode 100644 extensions/libxt_ipvs.man
 create mode 100644 include/linux/netfilter/xt_ipvs.h

----------------------------------------------------------------------

From: Mark Brooks <mark@loadbalancer.org>

I'm going to detail my setup and what I did to test/confirm this (you can
probably skip this bit if you want bit I thought I should include it for
completeness)

The Loadbalancer

IPVS 1.2.1
iptables 1.4.8 --patched
kernel - 2.6.35-rc1 --patched

eth0 ip 192.168.17.93
eth0:45 192.168.18.21 (I would have used eth1 but couldn't find a test box
spare with 2 network cards in)

My test box -

eth0 192.168.18.1

The Webserver -

192.168.17.4:80

Commands to setup ipvs and iptables

IPVS
ipvsadm -A -t 192.168.18.21:80 -s rr
ipvsadm -a -t 192.168.18.21:80 -r 192.168.17.4:80 -m

iptables
/usr/local/sbin/iptables -t nat -A POSTROUTING -m ipvs --vaddr
192.168.18.21/24 --vport 80 -j SNAT --to-source 192.168.17.93

iptables shows -

iptables -t nat -L
Chain PREROUTING (policy ACCEPT)
target     prot opt source               destination

Chain INPUT (policy ACCEPT)
target     prot opt source               destination

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination

Chain POSTROUTING (policy ACCEPT)
target     prot opt source               destination
SNAT       all  --  anywhere             anywhere            vaddr
192.168.18.0/24 vport 80 to:192.168.17.93

ipvsadm shows -

ipvsadm -ln
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
TCP  192.168.18.21:80 rr
  -> 192.168.17.4:80              Masq    1      0          0



Connected to the IP from my browser page loaded fine and you can see in the
apache log -

"192.168.17.93 - - [21/Jul/2010:08:44:00 -0400] "GET / HTTP/1.1" 200 82"

Finally I ran a couple of tests with httperf for about 6 hours to see if
anything strange happened

httperf --hog --server 192.168.18.21 --num-con 250 --ra $NUMBER --timeout 5

A maximum number of connections of 250 at rates between 1 and 250
connections per second. Every connection completed fine and there appeared
to be no problems.


^ permalink raw reply

* Re: [PATCH net-next] sysfs: add entry to indicate network interfaces with random MAC address
From: Ian Campbell @ 2010-07-22  7:12 UTC (permalink / raw)
  To: David Miller
  Cc: gregory.v.rose, leedom, shemminger, andy, harald, bhutchings,
	sassmann, netdev, linux-kernel, gospo, alexander.h.duyck
In-Reply-To: <20100721.123324.237334251.davem@davemloft.net>

[-- Attachment #1: Type: text/plain, Size: 2163 bytes --]

On Wed, 2010-07-21 at 12:33 -0700, David Miller wrote:
> From: "Rose, Gregory V" <gregory.v.rose@intel.com>
> Date: Wed, 21 Jul 2010 12:02:17 -0700
> 
> >>From: David Miller <davem@davemloft.net>
> >>Date: Wed, 21 Jul 2010 11:48:51 -0700 (PDT)
> >>
> >>> You could do things like have the PF controller use the root
> >>filesystem
> >>> ID label to construct the VF's MAC address, or something like that.
> >>
> >>And here I of course mean the root filesystem of the guest the VF will
> >>be given to.
> > 
> > I suppose you could do that but then the VM is going to have to be
> > allowed to set its own MAC address.  There is a lot of opposition
> > and concern about allowing VMs to set their own MAC address.
> 
> Why would that be necessary?  The host with the PF creating the guest
> has access to the "device" and thus the root filesystem of the guest,
> and thus could pull in the root filesystem "key" and instantiate the
> VF's MAC before booting the guest.

Most VM host toolstacks allow you to store a MAC address for each
virtual NIC in the metadata associated with the VM. This MAC address is
either given by the user when they create the virtual NIC, random with
locally administered bit set or random in the VM vendors OID space. This
ensures the VM configuration remains consistent with time.

Why would they not continue to do the same for SR-IOV passthrough NICs?

As a fallback some toolstacks will generate a random address if the NIC
configuration doesn't specify one but if you want a persistent address
for a guest why would you not just configure it that way? Accessing the
guest root filesystem might be a nicer fallback than random generation
when users haven't explicitly configured a MAC but isn't there a chance
of a VM admin controlling the MAC address by manipulating the root
filesystem? What do you do if there is an address clash in this case,
relabelling the root filesystem is a bit of a faff. Also the root
filesystem could be contained within an LVM volume or encrypted or
whatever.

Ian.
-- 
Ian Campbell

Military intelligence is a contradiction in terms.
		-- Groucho Marx

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply

* Re: [patch v2.6 4/4] libxt_ipvs: user-space lib for netfilter matcher xt_ipvs
From: Simon Horman @ 2010-07-22  6:57 UTC (permalink / raw)
  To: Jan Engelhardt
  Cc: lvs-devel, netdev, linux-kernel, netfilter, netfilter-devel,
	Malcolm Turnbull, Wensong Zhang, Julius Volz, Patrick McHardy,
	David S. Miller, Hannes Eder
In-Reply-To: <alpine.LSU.2.01.1007220823340.4980@obet.zrqbmnf.qr>

On Thu, Jul 22, 2010 at 08:25:01AM +0200, Jan Engelhardt wrote:
> 
> On Thursday 2010-07-22 03:38, Simon Horman wrote:
> >
> >I must confess that I'm not familiar with using enum in this way.
> >Can I confirm that you are suggesting the following?
> >
> >enum {
> >	XT_IPVS_IPVS_PROPERTY =	1 << 0, /* all other options imply this one */
> >	XT_IPVS_PROTO =		1 << 1,
> >	XT_IPVS_VADDR =		1 << 2,
> >	XT_IPVS_VPORT =		1 << 3,
> >	XT_IPVS_DIR =		1 << 4,
> >	XT_IPVS_METHOD =	1 << 5,
> >	XT_IPVS_VPORTCTL =	1 << 6,
> >	XT_IPVS_MASK =		(1 << 7) - 1,
> >	XT_IPVS_ONCE_MASK =	(XT_IPVS_MASK & ~XT_IPVS_IPVS_PROPERTY)
> >};
> 
> Yes; You may drop the () in ONCE_MASK though.

Thanks; and yes, silliness on my part.

^ permalink raw reply

* Re: Fwd: LVS on local node
From: Eric Dumazet @ 2010-07-22  6:56 UTC (permalink / raw)
  To: Franchoze Eric; +Cc: wensong, lvs-devel, netdev, netfilter-devel
In-Reply-To: <27901279770680@web67.yandex.ru>

Le jeudi 22 juillet 2010 à 07:51 +0400, Franchoze Eric a écrit :
> Hello,
> 
> I'm trying to do load balancing of incoming traffic to my applications. This applications are not very  smp friendly, and I want try to run some instances according to number of cpus on single machine. And balance load of incoming traffic/connections to this applications.
> Looks like is should be similar to http://www.austintek.com/LVS/LVS-HOWTO/HOWTO/LVS-HOWTO.localnode.html
> 
>  linux kernel 2.6.32 with  or without hide interface patches.  Tried different configurations but could not see packets on application layer.
> 
> 192.168.1.165 - eth0 - interface for external connections
> 195.0.0.1 - dummy0 - virtual interface, real application is binded to that address.
> 
> Configuration is:
> -A -t 192.168.1.165:1234 -s wlc
> -a -t 192.168.1.165:1234 -r 195.0.0.1:1234 -g -w
> 
> #ipvsadm -L -n
> IP Virtual Server version 1.2.1 (size=4096)
> Prot LocalAddress:Port Scheduler Flags
>   -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
> TCP  192.168.1.165:1234 wlc
>   -> 195.0.0.1:1234               Local   1      0          0        
> #
> 
> Log:
> [ 2106.897409] IPVS: lookup/out TCP 192.168.1.165:44847->192.168.1.165:1234 not hit
> [ 2106.897412] IPVS: lookup service: fwm 0 TCP 192.168.1.165:1234 hit
> [ 2106.897414] IPVS: ip_vs_wlc_schedule(): Scheduling...
> [ 2106.897416] IPVS: WLC: server 195.0.0.1:1234 activeconns 0 refcnt 2 weight 1 overhead 1
> [ 2106.897418] IPVS: Enter: ip_vs_conn_new, net/netfilter/ipvs/ip_vs_conn.c line 693
> [ 2106.897421] IPVS: Bind-dest TCP c:192.168.1.165:44847 v:192.168.1.165:1234 d:195.0.0.1:1234 fwd:L s:0 conn->flags:181 conn->refcnt:1 dest->refcnt:3
> [ 2106.897425] IPVS: Schedule fwd:L c:192.168.1.165:44847 v:192.168.1.165:1234 d:195.0.0.1:1234 conn->flags:1C1 conn->refcnt:2
> [ 2106.897429] IPVS: TCP input  [S...] 195.0.0.1:1234->192.168.1.165:44847 state: NONE->SYN_RECV conn->refcnt:2
> [ 2106.897431] IPVS: Enter: ip_vs_null_xmit, net/netfilter/ipvs/ip_vs_xmit.c line 212
> [ 2106.897439] IPVS: lookup/in TCP 192.168.1.165:1234->192.168.1.165:44847 not hit
> [ 2106.897441] IPVS: lookup/out TCP 192.168.1.165:1234->192.168.1.165:44847 not hit
> [ 2107.277535] IPVS: packet type=1 proto=17 daddr=255.255.255.255 ignored
> [ 2108.542691] IPVS: packet type=1 proto=17 daddr=192.168.1.255 ignored
> 
> As the result, server application does receive anything on accept(). I tried to make dummy0 a hidden device and play with arp settings. But without result.
> 
> I will be happy to hear any idea how to do connection in this environment.
> 

lvs seems not very SMP friendly and a bit complex.

I would use an iptables setup and a slighly modified REDIRECT target
(and/or a nf_nat_setup_info() change)

Say you have 8 daemons listening on different ports (1000 to 1007)

iptables -t nat -A PREROUTING -p tcp --dport 1234 -j REDIRECT --rxhash-dist --to-port 1000-1007

rxhash would be provided by RPS on recent kernels or locally computed if
not already provided by core network (or old kernel)

This rule would be triggered only at connection establishment.
conntracking take care of following packets and is SMP friendly.




^ permalink raw reply

* Re: [PATCH net-next] sysfs: add entry to indicate network interfaces with random MAC address
From: Stefan Assmann @ 2010-07-22  6:53 UTC (permalink / raw)
  To: Rose, Gregory V
  Cc: Casey Leedom, David Miller, shemminger@vyatta.com,
	andy@greyhouse.net, harald@redhat.com, bhutchings@solarflare.com,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
	gospo@redhat.com, Duyck, Alexander H
In-Reply-To: <43F901BD926A4E43B106BF17856F0755F184620A@orsmsx508.amr.corp.intel.com>

On 21.07.2010 20:43, Rose, Gregory V wrote:
> I'm curious, what happens when the VM using the VF migrates to a new machine and has another VF assigned to with a different MAC address?
> 
> Intel's view of things is that we don't use persistent MAC addresses in our VFs because the MAC address belongs to the VM and when it migrates it's going to want to use another VF with the same MAC address.  If they're persistent I'm wondering how that can be done.
> 
> This discussion has come about because some folks want to use the VF in the Host VMM.  The original design goal for Intel was that VFs would be assigned to VMs and that VMM vendors would want to assign MAC addresses with their own assigned OUI's.

Using the VF in the host is a feature and I'm sure people will think of
ways to make good use of it. However the actual problem we've seen is a
more practical one. So to pass-through a VF to a VM the host has to be
aware that the VF exists. Therefore you usually have to enable the VF in
the host (i.e. specify the max_vfs parameter). The device will be
discovered by the system and because of the random MAC address udev
ignores the new device. With the additional information we provide with
our solution udev will be able to recognize the device by it's "device
path" and handle it properly (until you decide to pass it to a VM or
just be happy with it in the host).

Remember the issue that lead to the proposal of renaming VFs to vfeth?
That's exactly the problem we try to fix. Additional benefit of an
"address assignment type" as Ben likes to call it would be the handling
of MAC address stealing NICs.

  Stefan
--
Stefan Assmann         | Red Hat GmbH
Software Engineer      | Otto-Hahn-Strasse 20, 85609 Dornach
                       | HR: Amtsgericht Muenchen HRB 153243
                       | GF: Brendan Lane, Charlie Peters,
sassmann at redhat.com |     Michael Cunningham, Charles Cachera

^ permalink raw reply

* macvtap: Limit packet queue length
From: Herbert Xu @ 2010-07-22  6:41 UTC (permalink / raw)
  To: David S. Miller, netdev, Arnd Bergmann; +Cc: Mark Wagner

Hi:

macvtap: Limit packet queue length

Mark Wagner reported OOM symptoms when sending UDP traffic over
a macvtap link to a kvm receiver.

This appears to be caused by the fact that macvtap packet queues
are unlimited in length.  This means that if the receiver can't
keep up with the rate of flow, then we will hit OOM. Of course
it gets worse if the OOM killer then decides to kill the receiver.

This patch imposes a cap on the packet queue length, in the same
way as the tuntap driver, using the device TX queue length.

Please note that macvtap currently has no way of giving congestion
notification, that means the software device TX queue cannot be
used and packets will always be dropped once the macvtap driver
queue fills up.

This shouldn't be a great problem for the scenario where macvtap
is used to feed a kvm receiver, as the traffic is most likely
external in origin so congestion notification can't be applied
anyway.

Of course, if anybody decides to complain about guest-to-guest
UDP packet loss down the track, then we may have to revisit this.

Incidentally, this patch also fixes a real memory leak when
macvtap_get_queue fails.

Reported-by: Mark Wagner <mwagner@redhat.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
index a8a94e2..488d3b9 100644
--- a/drivers/net/macvtap.c
+++ b/drivers/net/macvtap.c
@@ -180,11 +180,18 @@ static int macvtap_forward(struct net_device *dev, struct sk_buff *skb)
 {
 	struct macvtap_queue *q = macvtap_get_queue(dev, skb);
 	if (!q)
-		return -ENOLINK;
+		goto drop;
+
+	if (skb_queue_len(&q->sk.sk_receive_queue) >= dev->tx_queue_len)
+		goto drop;

 	skb_queue_tail(&q->sk.sk_receive_queue, skb);
 	wake_up_interruptible_poll(sk_sleep(&q->sk), POLLIN | POLLRDNORM | POLLRDBAND);
-	return 0;
+	return NET_RX_SUCCESS;
+
+drop:
+	kfree_skb(skb);
+	return NET_RX_DROP;
 }

 /*

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply related

* RE: [PATCH 0/3 net-2.6] bnx2x: Bug fixes in statistics handling
From: Vladislav Zolotarov @ 2010-07-22  6:35 UTC (permalink / raw)
  To: David Miller; +Cc: netdev@vger.kernel.org, Eilon Greenstein, Dmitry Kravkov
In-Reply-To: <20100721.111243.112856946.davem@davemloft.net>

> -----Original Message-----
> From: David Miller [mailto:davem@davemloft.net]
> Sent: Wednesday, July 21, 2010 9:13 PM
> To: Vladislav Zolotarov
> Cc: netdev@vger.kernel.org; Eilon Greenstein; Dmitry Kravkov
> Subject: Re: [PATCH 0/3 net-2.6] bnx2x: Bug fixes in statistics
> handling
> 
> From: "Vladislav Zolotarov" <vladz@broadcom.com>
> Date: Wed, 21 Jul 2010 18:58:55 +0300
> 
> > Dave, pls., consider applying the following patches to the net-2.6
> tree.
> > They include 2 bugs fixes in the statistics handling flow:
> 
> All applied.

Thanks.

> 
> There is a space character at the beginning of every "Signed-off-by:"
> line in your patch postings, please get rid of them for future
> submissions.

Hmmm... No prob., will do. My git creates it this way for some reason.
Probably some configuration glitch.

Thanks again,
vlad



^ permalink raw reply

* Re: [patch v2.6 4/4] libxt_ipvs: user-space lib for netfilter matcher xt_ipvs
From: Jan Engelhardt @ 2010-07-22  6:25 UTC (permalink / raw)
  To: Simon Horman
  Cc: lvs-devel, netdev, linux-kernel, netfilter, netfilter-devel,
	Malcolm Turnbull, Wensong Zhang, Julius Volz, Patrick McHardy,
	David S. Miller, Hannes Eder
In-Reply-To: <20100722013817.GB15008@verge.net.au>


On Thursday 2010-07-22 03:38, Simon Horman wrote:
>
>I must confess that I'm not familiar with using enum in this way.
>Can I confirm that you are suggesting the following?
>
>enum {
>	XT_IPVS_IPVS_PROPERTY =	1 << 0, /* all other options imply this one */
>	XT_IPVS_PROTO =		1 << 1,
>	XT_IPVS_VADDR =		1 << 2,
>	XT_IPVS_VPORT =		1 << 3,
>	XT_IPVS_DIR =		1 << 4,
>	XT_IPVS_METHOD =	1 << 5,
>	XT_IPVS_VPORTCTL =	1 << 6,
>	XT_IPVS_MASK =		(1 << 7) - 1,
>	XT_IPVS_ONCE_MASK =	(XT_IPVS_MASK & ~XT_IPVS_IPVS_PROPERTY)
>};

Yes; You may drop the () in ONCE_MASK though.

^ permalink raw reply

* Re: [PATCH] fec: use interrupt for MDIO completion indication
From: Baruch Siach @ 2010-07-22  6:17 UTC (permalink / raw)
  To: Wolfram Sang
  Cc: Bryan Wu, netdev, linux-arm-kernel, Sascha Hauer, Greg Ungerer
In-Reply-To: <20100721125113.GA2651@pengutronix.de>

Hi Wolfram, Bryan,

On Wed, Jul 21, 2010 at 02:51:13PM +0200, Wolfram Sang wrote:
> > > Thanks for this patch, we tested on our i.MX51 board with Ubuntu. It 
> > > works fine.
> > > 
> > > Wolfram, you can pick up this, too. -;)
> > 
> > Dave has already applied this patch to his net-next tree.
> 
> Bryan, thanks for letting me know, I missed this one.
> 
> However, have you guys ever tried pulling the cable off/on or restarting
> the interface with 'ifconfig down/up'? This always caused a stalled PHY
> for me. This patch helps:

I can confirm the problem and the fix on a i.MX25 based system. Thanks 
Wolfram.

baruch

> From: Wolfram Sang <w.sang@pengutronix.de>
> Subject: [PATCH] net/fec: restore interrupt mask after software-reset in fec_stop()
> 
> After the change from mdio polling to irq, it became necessary to
> restore the interrupt mask after resetting the chip in fec_stop().
> Otherwise, with all irqs disabled, no communication with the PHY will be
> possible after e.g. un-/replugging the cable and the device gets
> stalled.
> 
> Signed-off-by: Wolfram Sang <w.sang@pengutronix.de>
> ---
>  drivers/net/fec.c |    7 ++++---
>  1 files changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/net/fec.c b/drivers/net/fec.c
> index 391a553..768b840 100644
> --- a/drivers/net/fec.c
> +++ b/drivers/net/fec.c
> @@ -118,6 +118,8 @@ static unsigned char	fec_mac_default[] = {
>  #define FEC_ENET_MII	((uint)0x00800000)	/* MII interrupt */
>  #define FEC_ENET_EBERR	((uint)0x00400000)	/* SDMA bus error */
>  
> +#define FEC_DEFAULT_IMASK (FEC_ENET_TXF | FEC_ENET_RXF | FEC_ENET_MII)
> +
>  /* The FEC stores dest/src/type, data, and checksum for receive packets.
>   */
>  #define PKT_MAXBUF_SIZE		1518
> @@ -1213,8 +1215,7 @@ fec_restart(struct net_device *dev, int duplex)
>  	writel(0, fep->hwp + FEC_R_DES_ACTIVE);
>  
>  	/* Enable interrupts we wish to service */
> -	writel(FEC_ENET_TXF | FEC_ENET_RXF | FEC_ENET_MII,
> -			fep->hwp + FEC_IMASK);
> +	writel(FEC_DEFAULT_IMASK, fep->hwp + FEC_IMASK);
>  }
>  
>  static void
> @@ -1233,8 +1234,8 @@ fec_stop(struct net_device *dev)
>  	/* Whack a reset.  We should wait for this. */
>  	writel(1, fep->hwp + FEC_ECNTRL);
>  	udelay(10);
> -
>  	writel(fep->phy_speed, fep->hwp + FEC_MII_SPEED);
> +	writel(FEC_DEFAULT_IMASK, fep->hwp + FEC_IMASK);
>  }
>  
>  static int __devinit
> -- 
> 1.7.1
> 
> ==========================
> 
> BUT, while it helps and may possibly be a quick fix for 2.6.35,
> resetting the chip in fec_stop() looks like a wrong thing to do for me.
> In the long run, it probably is better to make sure the chip is set up
> correctly during initialization, so the reset in fec_stop() is not
> needed at all. I had a quick shot at this, but seem to have missed
> something as it didn't work. As I will be away from the computers for
> two weeks in about 24 hours, I at least wanted to bring up the issue.
> 
> Regards,
> 
>    Wolfram
> 
> -- 
> Pengutronix e.K.                           | Wolfram Sang                |
> Industrial Linux Solutions                 | http://www.pengutronix.de/  |



-- 
                                                     ~. .~   Tk Open Systems
=}------------------------------------------------ooO--U--Ooo------------{=
   - baruch@tkos.co.il - tel: +972.2.679.5364, http://www.tkos.co.il -

^ permalink raw reply

* Re: [PATCH] LSM: Add post recvmsg() hook.
From: David Miller @ 2010-07-22  5:06 UTC (permalink / raw)
  To: penguin-kernel
  Cc: kuznet, pekkas, jmorris, yoshfuji, kaber, paul.moore, netdev,
	linux-security-module
In-Reply-To: <201007220502.o6M52GJU098071@www262.sakura.ne.jp>

From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Thu, 22 Jul 2010 14:02:16 +0900

> Then, why does below proposal lose information?

Peek changes state, now it's possible that two processes end up
receiving the packet.

Please consider deeply how your desired semantics are unobtainable
without breaking thigngs fundamentally.

^ permalink raw reply

* Re: [PATCH] LSM: Add post recvmsg() hook.
From: Tetsuo Handa @ 2010-07-22  5:02 UTC (permalink / raw)
  To: davem
  Cc: kuznet, pekkas, jmorris, yoshfuji, kaber, paul.moore, netdev,
	linux-security-module
In-Reply-To: <20100721.214517.236270570.davem@davemloft.net>

David Miller wrote:
> From: Tetsuo Handa
> Date: Thu, 22 Jul 2010 13:41:38 +0900
> 
> > Excuse me, below check is made inside recvmsg() and may return error if
> > SELinux's policy has changed after the select() said "ready" and before
> > security_socket_recvmsg() is called. No?
> 
> It does this before pulling the packet out of the receive queue of the
> socket.  It's like signalling a parameter error to the process, no
> socket state is changed.

So, we agreed that security_socket_recvmsg() is allowed to return error code
rather than available data even if both conditions

1) Application makes poll() on UDP socket in blocking mode, and UDP
   reports that receive data is available

and

2) Application, after such a poll() call, makes a blocking recvmsg() call
   and no other activity has occurred on the socket meanwhile

are met.

Then, why does below proposal lose information?
The message is not removed if security_socket_post_recvmsg() returned error code.

 int udp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 		size_t len, int noblock, int flags, int *addr_len)
 {
 	struct inet_sock *inet = inet_sk(sk);
 	struct sockaddr_in *sin = (struct sockaddr_in *)msg->msg_name;
 	struct sk_buff *skb;
 	unsigned int ulen;
 	int peeked;
 	int err;
 	int is_udplite = IS_UDPLITE(sk);
 	bool slow;
+	bool peek_forced;
 
 	/*
 	 *      Check any passed addresses
 	 */
 	if (addr_len)
 		*addr_len = sizeof(*sin);
 
 	if (flags & MSG_ERRQUEUE)
 		return ip_recv_error(sk, msg, len);
 
+	/* LSM wants to decide permission based on skb? */
+	peek_forced = security_socket_recvmsg_force_peek(sk);
 try_again:
-	skb = __skb_recv_datagram(sk, flags | (noblock ? MSG_DONTWAIT : 0),
-				  &peeked, &err);
+	skb = __skb_recv_datagram(sk, flags | (noblock ? MSG_DONTWAIT : 0) |
+				  (peek_forced ? MSG_PEEK : 0), &peeked, &err);
 	if (!skb)
 		goto out;
+	if (peek_forced) {
+		err = security_socket_post_recvmsg(sk, skb);
+		if (err < 0) {
+			/*
+			 * Do not remove this message from queue because LSM
+			 * decided not to deliver this message to the caller.
+			 */
+			peek_forced = false;
+			goto out_free;
+		}
+	}
 
 	ulen = skb->len - sizeof(struct udphdr);
 	if (len > ulen)
 		len = ulen;
 	else if (len < ulen)
 		msg->msg_flags |= MSG_TRUNC;
(...snipped...)
 out_free:
+	if (peek_forced && !(flags & MSG_PEEK)) {
+		/*
+		 * Remove this message from queue because this message was
+		 * peeked for LSM but the caller did not ask to peek.
+		 */
+		slow = lock_sock_fast(sk);
+		skb_kill_datagram(sk, skb, flags);
+		unlock_sock_fast(sk, slow);
+		goto out;
+	}
 	skb_free_datagram_locked(sk, skb);
 out:
 	return err;
(...snipped...)
 }


^ permalink raw reply

* Re: [PATCH] LSM: Add post recvmsg() hook.
From: David Miller @ 2010-07-22  4:45 UTC (permalink / raw)
  To: penguin-kernel
  Cc: kuznet, pekkas, jmorris, yoshfuji, kaber, paul.moore, netdev,
	linux-security-module
In-Reply-To: <201007220441.o6M4fcmC093106@www262.sakura.ne.jp>

From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Thu, 22 Jul 2010 13:41:38 +0900

> Excuse me, below check is made inside recvmsg() and may return error if
> SELinux's policy has changed after the select() said "ready" and before
> security_socket_recvmsg() is called. No?

It does this before pulling the packet out of the receive queue of the
socket.  It's like signalling a parameter error to the process, no
socket state is changed.

^ permalink raw reply

* Re: [PATCH] LSM: Add post recvmsg() hook.
From: Tetsuo Handa @ 2010-07-22  4:41 UTC (permalink / raw)
  To: davem
  Cc: kuznet, pekkas, jmorris, yoshfuji, kaber, paul.moore, netdev,
	linux-security-module
In-Reply-To: <20100721.210636.197931242.davem@davemloft.net>

David Miller wrote:
> Your analysis is wrong, and what Tomoyo is doing is so fundamentally
> different than what the existing SELINUX hooks do.
> 
> The existing LSM hooks do not break BSD socket behavior.  Do you know
> why?  Because someone who understood all of this spent a great deal
> of time carefully designing them.
> 
> The existing hooks do not drop packets on recvmsg() when a security
> check does not pass, they signal the error long before the socket
> receive queue is even looked at.  It is just like seeing a -EFAULT,
> -ENFILE, or similar error.
> 
> Checks are always made _BEFORE_ major state changes are made to the
> socket.

Excuse me, below check is made inside recvmsg() and may return error if
SELinux's policy has changed after the select() said "ready" and before
security_socket_recvmsg() is called. No?

int avc_has_perm_noaudit(u32 ssid, u32 tsid,
                         u16 tclass, u32 requested,
                         unsigned flags,
                         struct av_decision *in_avd)
{
        struct avc_node *node;
        struct av_decision avd_entry, *avd;
        int rc = 0;
        u32 denied;

        BUG_ON(!requested);

        rcu_read_lock();

        node = avc_lookup(ssid, tsid, tclass);
        if (!node) {
                rcu_read_unlock();

                if (in_avd)
                        avd = in_avd;
                else
                        avd = &avd_entry;

                security_compute_av(ssid, tsid, tclass, avd);
                rcu_read_lock();
                node = avc_insert(ssid, tsid, tclass, avd);
        } else {
                if (in_avd)
                        memcpy(in_avd, &node->ae.avd, sizeof(*in_avd));
                avd = &node->ae.avd;
        }

        denied = requested & ~(avd->allowed);

        if (denied) {
                if (flags & AVC_STRICT)
                        rc = -EACCES;
                else if (!selinux_enforcing || (avd->flags & AVD_FLAGS_PERMISSIVE))
                        avc_update_node(AVC_CALLBACK_GRANT, requested, ssid,
                                        tsid, tclass, avd->seqno);
                else
                        rc = -EACCES;
        }

        rcu_read_unlock();
        return rc;
}

int avc_has_perm(u32 ssid, u32 tsid, u16 tclass,
                 u32 requested, struct common_audit_data *auditdata)
{
        struct av_decision avd;
        int rc;

        rc = avc_has_perm_noaudit(ssid, tsid, tclass, requested, 0, &avd);
        avc_audit(ssid, tsid, tclass, requested, &avd, rc, auditdata);
        return rc;
}

static int socket_has_perm(struct task_struct *task, struct socket *sock,
                           u32 perms)
{
        struct inode_security_struct *isec;
        struct common_audit_data ad;
        u32 sid;
        int err = 0;

        isec = SOCK_INODE(sock)->i_security;

        if (isec->sid == SECINITSID_KERNEL)
                goto out;
        sid = task_sid(task);

        COMMON_AUDIT_DATA_INIT(&ad, NET);
        ad.u.net.sk = sock->sk;
        err = avc_has_perm(sid, isec->sid, isec->sclass, perms, &ad);

out:
        return err;
}

static int selinux_socket_recvmsg(struct socket *sock, struct msghdr *msg,
                                  int size, int flags)
{
        return socket_has_perm(current, sock, SOCKET__READ);
}

static struct security_operations selinux_ops = {
(...snipped...)
	.socket_recvmsg =               selinux_socket_recvmsg,
(...snipped...)
};

Are you saying that selinux_socket_recvmsg() always returns 0?

> That is critically important, and it's what you seem to fail to see.
> 
> The hooks you propose _LOSE_ information.  So even if another process
> has the 'fd' for a socket, and they would be allowed to receive the
> packet by LSM checks, the post hook does not allow that to happen
> because the failing 'fd' just frees up the packet and loses it
> forever.
> 
> The existing hooks signal before we pull the new connection out of the
> accept queue during accept(), therefore avoiding the illegal situation
> your post ->accept() hook would create since there is absolutely no
> way (and there should not be a way) to push a connection back into the
> sock accept queue after we've taken it from the protocol layer.
> 
> And again here, the proposed hooks _LOSE_ information.  The accepted
> connection is lost forever, another process with valid security
> credentials cannot accept the connection.  It is completely gone.
> 
> And I'm not even going to entertain adding facilities to allow pushing
> things back into the socket state after they've been removed for
> inspection.
> 
> I think we've been through this issue enough times that we have covered
> the issues in their entirety, and nothing you have written convinces
> me that my position is wrong and that it is valid to put the Tomoyo
> post-recvmsg and post-accept hooks into the tree.
> 
> Sorry, but I'm not applying your patches, they are fundamentally flawed
> unlike the existing hooks.

Did the idea described in previous mail _LOSE_ information?
I made the udp_recvmsg() to force MSG_PEEK so that the message will not be
removed from the queue if security_socket_post_recvmsg() returned error code
and remove the message from the queue only if security_socket_post_recvmsg()
returned 0 and the caller did not pass MSG_PEEK.

^ permalink raw reply

* Re: linux-next: build warning after merge of the nettree
From: David Miller @ 2010-07-22  4:11 UTC (permalink / raw)
  To: sfr; +Cc: netdev, linux-next, linux-kernel, richardcochran, florian
In-Reply-To: <20100722120639.cc8d56a5.sfr@canb.auug.org.au>

From: Stephen Rothwell <sfr@canb.auug.org.au>
Date: Thu, 22 Jul 2010 12:06:39 +1000

> After merging the net tree, today's linux-next build (x86_64 allmodconfig)
> produced this warning:
> 
> drivers/net/r6040.c: In function 'r6040_ioctl':
> drivers/net/r6040.c:513: warning: passing argument 2 of 'phy_mii_ioctl' from incompatible pointer type
> include/linux/phy.h:522: note: expected 'struct ifreq *' but argument is of type 'struct mii_ioctl_data *'
> 
> Introduced by commit 28b041139e344ecd0f144d6205b004ae354cfa1e ("net:
> preserve ifreq parameter when calling generic phy_mii_ioctl()") (which
> changed the phy_mii_ioctl() API) interacting with commit
> 3831861b4ad8fd0ad7110048eb3e155628799d2b ("r6040: implement phylib")
> (which added a use of that function).

Thanks Stephen, should be fixed as follows:

--------------------
r6040: Fix args to phy_mii_ioctl().

Reported by Stephen Rothwell.

Signed-off-by: David S. Miller <davem@davemloft.net>
---
 drivers/net/r6040.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/net/r6040.c b/drivers/net/r6040.c
index 7d482a2..142c381 100644
--- a/drivers/net/r6040.c
+++ b/drivers/net/r6040.c
@@ -510,7 +510,7 @@ static int r6040_ioctl(struct net_device *dev, struct ifreq *rq, int cmd)
 	if (!lp->phydev)
 		return -EINVAL;
 
-	return phy_mii_ioctl(lp->phydev, if_mii(rq), cmd);
+	return phy_mii_ioctl(lp->phydev, rq, cmd);
 }
 
 static int r6040_rx(struct net_device *dev, int limit)
-- 
1.7.1.1

^ permalink raw reply related

* Re: linux-next: build warning after merge of the net tree
From: David Miller @ 2010-07-22  4:09 UTC (permalink / raw)
  To: sfr; +Cc: netdev, linux-next, linux-kernel
In-Reply-To: <20100722121157.344c8462.sfr@canb.auug.org.au>

From: Stephen Rothwell <sfr@canb.auug.org.au>
Date: Thu, 22 Jul 2010 12:11:57 +1000

> After merging the net tree, today's linux-next build (x86_64 allmodconfig)
> produced this warning:
> 
> drivers/vhost/net.c: In function 'vhost_net_set_backend':
> drivers/vhost/net.c:536: warning: label 'done' defined but not used
> 
> Introduced by commit 11fe883936980fe242869d671092a466cf1db3e3 ("Merge
> branch 'master' of
> master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6") which
> kept the unneeded label.  Sorry if I misguided you.

I've fixed this, thanks Stephen.  Don't worry, not your fault :)

^ permalink raw reply

* Re: [PATCH] LSM: Add post recvmsg() hook.
From: David Miller @ 2010-07-22  4:06 UTC (permalink / raw)
  To: penguin-kernel
  Cc: kuznet, pekkas, jmorris, yoshfuji, kaber, paul.moore, netdev,
	linux-security-module
In-Reply-To: <201007220338.o6M3citW076383@www262.sakura.ne.jp>

Your analysis is wrong, and what Tomoyo is doing is so fundamentally
different than what the existing SELINUX hooks do.

The existing LSM hooks do not break BSD socket behavior.  Do you know
why?  Because someone who understood all of this spent a great deal
of time carefully designing them.

The existing hooks do not drop packets on recvmsg() when a security
check does not pass, they signal the error long before the socket
receive queue is even looked at.  It is just like seeing a -EFAULT,
-ENFILE, or similar error.

Checks are always made _BEFORE_ major state changes are made to the
socket.

That is critically important, and it's what you seem to fail to see.

The hooks you propose _LOSE_ information.  So even if another process
has the 'fd' for a socket, and they would be allowed to receive the
packet by LSM checks, the post hook does not allow that to happen
because the failing 'fd' just frees up the packet and loses it
forever.

The existing hooks signal before we pull the new connection out of the
accept queue during accept(), therefore avoiding the illegal situation
your post ->accept() hook would create since there is absolutely no
way (and there should not be a way) to push a connection back into the
sock accept queue after we've taken it from the protocol layer.

And again here, the proposed hooks _LOSE_ information.  The accepted
connection is lost forever, another process with valid security
credentials cannot accept the connection.  It is completely gone.

And I'm not even going to entertain adding facilities to allow pushing
things back into the socket state after they've been removed for
inspection.

I think we've been through this issue enough times that we have covered
the issues in their entirety, and nothing you have written convinces
me that my position is wrong and that it is valid to put the Tomoyo
post-recvmsg and post-accept hooks into the tree.

Sorry, but I'm not applying your patches, they are fundamentally flawed
unlike the existing hooks.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox