Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [RFC PATCH v8 00/16] Provide a zero-copy method on KVM virtio-net.
From: Avi Kivity @ 2010-08-02 16:11 UTC (permalink / raw)
  To: Shirley Ma
  Cc: xiaohui.xin, netdev, kvm, linux-kernel, mst, mingo, davem,
	herbert, jdike
In-Reply-To: <1280764918.22830.7.camel@localhost.localdomain>

  On 08/02/2010 07:01 PM, Shirley Ma wrote:
> Hello Avi,
>
> On Sun, 2010-08-01 at 11:18 +0300, Avi Kivity wrote:
>> I don't understand.  Under what conditions do you use
>> get_user_pages()
>> instead of get_user_pages_fast()?  Why?
> The code always calls get_user_pages_fast, however, the page will be
> unpinned in skb_free if the same page is not used again for a new
> buffer. The reason for unpin the page is we don't want to pin all of the
> guest kernel memory(memory over commit).

That is fine.

> So get_user_pages_fast will
> call slow path get_user_pages.

I don't understand this. gup_fast() only calls gup() if the page is 
swapped out or read-only.

> Your previous comment is suggesting to keep the page pinned for
> get_user_pages_fast fast path?
>

Right now I'm not sure I understand what's happening.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply

* Re: [RFC PATCH v8 00/16] Provide a zero-copy method on KVM virtio-net.
From: Shirley Ma @ 2010-08-02 16:25 UTC (permalink / raw)
  To: Avi Kivity
  Cc: xiaohui.xin, netdev, kvm, linux-kernel, mst, mingo, davem,
	herbert, jdike
In-Reply-To: <4C56EE3D.1050203@redhat.com>

On Mon, 2010-08-02 at 19:11 +0300, Avi Kivity wrote:
> I don't understand this. gup_fast() only calls gup() if the page is 
> swapped out or read-only.

Oh, I used the page as read-only on xmit path. Should I use write
instead?

Thanks
Shirley


^ permalink raw reply

* Re: [PATCH v2 2/2] macvtap: Implement multiqueue macvtap driver
From: Krishna Kumar2 @ 2010-08-02 16:28 UTC (permalink / raw)
  To: Arnd Bergmann; +Cc: bhutchings, davem, mst, netdev, therbert
In-Reply-To: <201008021752.28475.arnd@arndb.de>

Hi Arnd,

Thanks for your comments. The declaration was in the 2nd patch
since the function was not used outside net/core/dev.c after the
1st patch is applied, but now I think you are right.

Regarding min/min_t, I had tried both and got this error:

"include/linux/if_macvlan.h:57: error: braced-group within expression
	allowed only inside a function"
Please let me know if there is any alternative (curly braces cannot be
used outside of functions). Otherwise one change required is to add:

#ifndef MIN
#endif

I will wait for a few hours and resubmit the patches.

thanks,

- KK

Arnd Bergmann <arnd@arndb.de> wrote on 08/02/2010 09:22:28 PM:

> Arnd Bergmann <arnd@arndb.de>
> 08/02/2010 09:22 PM
>
> To
>
> Krishna Kumar2/India/IBM@IBMIN
>
> cc
>
> davem@davemloft.net, bhutchings@solarflare.com,
> netdev@vger.kernel.org, mst@redhat.com, therbert@google.com
>
> Subject
>
> Re: [PATCH v2 2/2] macvtap: Implement multiqueue macvtap driver
>
> On Monday 02 August 2010, Krishna Kumar wrote:
> > Implement multiqueue facility for macvtap driver. The idea is that
> > a macvtap device can be opened multiple times and the fd's can be
> > used to register eg, as backend for vhost.
> >
> > Please review.
>
> Only two very minor points from my side:
>
> > diff -ruNp org/include/linux/netdevice.h new/include/linux/netdevice.h
> > --- org/include/linux/netdevice.h   2010-07-25 16:57:07.000000000 +0530
> > +++ new/include/linux/netdevice.h   2010-08-02 16:05:57.000000000 +0530
> > @@ -2253,6 +2253,7 @@ static inline const char *netdev_name(co
> >     return dev->name;
> >  }
> >
> > +extern int skb_calculate_flow(struct net_device *dev, struct sk_buff
*skb);
> >  extern int netdev_printk(const char *level, const struct net_device
*dev,
> >            const char *format, ...)
> >     __attribute__ ((format (printf, 3, 4)));
>
> This logically belongs into the first patch.
>
> > diff -ruNp org/include/linux/if_macvlan.h
new/include/linux/if_macvlan.h
> > --- org/include/linux/if_macvlan.h   2010-08-02 15:32:33.000000000
+0530
> > +++ new/include/linux/if_macvlan.h   2010-08-02 15:32:33.000000000
+0530
> > @@ -40,6 +40,14 @@ struct macvlan_rx_stats {
> >     unsigned long      rx_errors;
> >  };
> >
> > +#define MIN(x, y)      (((x) < (y)) ? (x) : (y))
> > +
> > +/*
> > + * Maximum times a macvtap device can be opened. This can be used to
> > + * configure the number of receive queue, e.g. for multiqueue virtio.
> > + */
> > +#define MAX_MACVTAP_QUEUES   MIN(16, NR_CPUS)
> > +
>
> Please use the existing min() or min_t() macro instead of providing your
own.
>
>    Arnd


^ permalink raw reply

* Re: 2.6.35-rc6-git6: Reported regressions from 2.6.34
From: Tejun Heo @ 2010-08-02 16:32 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Rafael J. Wysocki, Jens Axboe, Linux Kernel Mailing List,
	Maciej Rutecki, Andrew Morton, Kernel Testers List,
	Network Development, Linux ACPI, Linux PM List, Linux SCSI List,
	Linux Wireless List, DRI
In-Reply-To: <AANLkTimcH7+Bq1UEbaSU7SQpzArPgmSLegiqE13V8=CF@mail.gmail.com>

Hello, Linus.

On 08/01/2010 08:01 PM, Linus Torvalds wrote:
> This has a proposed patch. I don't know what the status of it is, though. Jens?
> 
>    http://marc.info/?l=linux-kernel&m=127950018204029&w=2
> 
>> Bug-Entry       : http://bugzilla.kernel.org/show_bug.cgi?id=16393
>> Subject         : kernel BUG at fs/block_dev.c:765!
>> Submitter       : Markus Trippelsdorf <markus@trippelsdorf.de>
>> Date            : 2010-07-14 13:52 (19 days old)
>> Message-ID      : <20100714135217.GA1797@arch.tripp.de>
>> References      : http://marc.info/?l=linux-kernel&m=127911564213748&w=2
> 
> This one is interesting. And I think I perhaps see where it's coming from.
> 
> bd_start_claiming() (through bd_prepare_to_claim()) has two separate
> success cases: either there was no holder (bd_claiming is NULL) or the
> new holder was already claiming it (bd_claiming == holder).
> 
> Note in particular the case of the holder _already_ holding it. What happens is:
> 
>  - bd_start_claiming() succeeds because we had _already_ claimed it
> with the same holder
> 
>  - then some error happens, and we call bd_abort_claiming(), which
> does whole->bd_claiming = NULL;
> 
>  - the original holder thinks it still holds the bd, but it has been released!
> 
>  - a new claimer comes in, and succeeds because bd_claiming is now NULL.
> 
>  - we now have two "owners" of the bd, but bd_claiming only points to
> the second one.
> 
> I think bd_start_claiming() needs to do some kind of refcount for the
> nested holder case, and bd_abort_claiming() needs to decrement the
> refcount and only clear the bd_claiming field when it goes down to
> zero.
> 
> I dunno. Maybe there's something else going on, but it does look
> suspicious, and the above would explain the BUG_ON().

Yeah, that definitely sounds plausible.  I think the condition check
in bd_prepare_to_claim() should have been "if (whole->bd_claiming)"
instead of "if (whole->bd_claiming && whole->bd_claiming != holder)".
It doesn't make much sense to allow multiple parallel claiming
operations anyway and the comment above already says - "This function
fails if @bdev is already claimed by another holder and waits if
another claiming is in progress."

I'll try to build a test case and verify it.

Thank you.

-- 
tejun

^ permalink raw reply

* Re: [RFC PATCH v8 00/16] Provide a zero-copy method on KVM virtio-net.
From: Avi Kivity @ 2010-08-02 16:32 UTC (permalink / raw)
  To: Shirley Ma
  Cc: xiaohui.xin, netdev, kvm, linux-kernel, mst, mingo, davem,
	herbert, jdike
In-Reply-To: <1280766319.22830.24.camel@localhost.localdomain>

  On 08/02/2010 07:25 PM, Shirley Ma wrote:
> On Mon, 2010-08-02 at 19:11 +0300, Avi Kivity wrote:
>> I don't understand this. gup_fast() only calls gup() if the page is
>> swapped out or read-only.
> Oh, I used the page as read-only on xmit path. Should I use write
> instead?

No, for xmit getting the page as read only is fine.

I was inaccurate, gup_fast() performs as follows:

- if .write = 1, gup_fast() will be fast if the page is mapped and writeable
- if .write = 0, gup_fast() will be fast if the page is mapped

so, using .write = 0 for the xmit path will be faster in more cases than 
.write = 1.

When are you seeing gup_fast() fall back to gup()?  It should be at most 
once per page (when a guest starts up none of its pages are mapped, it 
faults them in on demand).

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply

* Re: [PATCH v2 2/2] macvtap: Implement multiqueue macvtap driver
From: Arnd Bergmann @ 2010-08-02 16:34 UTC (permalink / raw)
  To: Krishna Kumar2; +Cc: bhutchings, davem, mst, netdev, therbert
In-Reply-To: <OFE0F15F8D.2C34C2AD-ON65257773.00579311-65257773.005A60A9@in.ibm.com>

On Monday 02 August 2010, Krishna Kumar2 wrote:
> "include/linux/if_macvlan.h:57: error: braced-group within expression
>         allowed only inside a function"
> Please let me know if there is any alternative (curly braces cannot be
> used outside of functions). Otherwise one change required is to add:
> 
> #ifndef MIN
> #endif
> 
> I will wait for a few hours and resubmit the patches.

Maybe just open-code the minimum computation:

#define MAX_MACVTAP_QUEUES   (NR_CPUS < 16 ? NR_CPUS : 16)

	ARnd

^ permalink raw reply

* Re: [PATCH] tc: make symbols loaded from tc action modules global.
From: Stephen Hemminger @ 2010-08-02 16:55 UTC (permalink / raw)
  To: Andreas Henriksson; +Cc: netdev
In-Reply-To: <20100802073032.GA32046@amd64.fatal.se>

On Mon, 2 Aug 2010 09:30:33 +0200
Andreas Henriksson <andreas@fatal.se> wrote:

> Fixes problems with xtables based MARK target ("ipt" module).
> When tc loads the "ipt" (xt) module it kept the symbols local,
> this made loading of libxtables not find the required struct.
> 
> currently ipt/xt is the only tc action module.
> iproute2 never seem to do dlclose.
> hopefully the modules doesn't export more symbols then needed.
> 
> In this situation hopefully the RTLD_GLOBAL flag won't hurt us.
> 
> I've been using this patch in the Debian package of iproute for
> the last 3 weeks and noone has complained.
> ( This fixes http://bugs.debian.org/584898 )
> 
> Signed-off-by: Andreas Henriksson <andreas@fatal.se>
> ---
>  tc/m_action.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/tc/m_action.c b/tc/m_action.c
> index a198158..6464b2e 100644
> --- a/tc/m_action.c
> +++ b/tc/m_action.c
> @@ -99,7 +99,7 @@ restart_s:
>  	}
>  
>  	snprintf(buf, sizeof(buf), "%s/m_%s.so", get_tc_lib(), str);
> -	dlh = dlopen(buf, RTLD_LAZY);
> +	dlh = dlopen(buf, RTLD_LAZY | RTLD_GLOBAL);
>  	if (dlh == NULL) {
>  		dlh = aBODY;
>  		if (dlh == NULL) {


Applied

^ permalink raw reply

* 加薪不靠老板靠自己,專業輔導3-8萬!!
From: coleenwcq06046432 @ 2010-08-02 17:49 UTC (permalink / raw)
  To: chiupc, 7637090, llen880921, mike140250360, ccs7418, alduyafang,
	helesa22, ayrtg

您是"月光族"還是每月"不滿族"?
您已厭倦再當"伸手牌"?
您不想下班後又趕著去接小孩的生活?
立即加入WFS在家工作系統
WFS是經過科學設計,
符合現代及未來的工作模式
只要跟著我們一起
進入這個系統按部就班
您將改變目前的生活 實現自己的夢想

快按下列網址改變您的生活(限年滿23歲)
http://www.moonnini.com/w/herblifetw/

--------------------------------------------------------------
Ovi Mail: Making email access easy
http://mail.ovi.com

^ permalink raw reply

* Re: why do we need printk on sending syn flood cookie?
From: Mitchell Erblich @ 2010-08-02 18:10 UTC (permalink / raw)
  To: Franchoze Eric; +Cc: Florian Westphal, netdev
In-Reply-To: <23001280765498@web50.yandex.ru>

On Aug 2, 2010, at 9:11 AM, Franchoze Eric wrote:

> 
> 
> 02.08.10, 12:17, "Florian Westphal" <fw@strlen.de>:
> 
>> Franchoze Eric  wrote:
>>> Just sirious why do we need printk each 1 second (60*HZ) about possible syn-flood? It really floods dmesg. Is there something dengerous? I have suggestion to turn off printk about sending tcp cookie each 1 second.
>> 
>> It is handled exactly like other printks in the networking path,
>> e.g. receipt of tcp wscale == 15.
>> 
>> Why does this need special treatment?
>> 
> 
> For now I see "possible SYN flooding on port %d. Sending cookies.\n" message each second on my server. I know that there are a lot of SYNs and I know that kernel sends cookie. Why do I need so mach printk?
> So I suggested add new value to /proc/sys/net/ipv4/tcp_syncookies, which will enable cookie but this printk will be turned off.

Once print per sec is a very good GENERIC informative msg to an admin that 
this system either has some  very small config'd or default values
(normally set up as a percentage of memory or set sock option and/or .. )
and/or that for some reason that a large number of SYNs are being rec'vd
and/or that a number of connections are being un/intentionally being
retried and/or dropped

Remember each printk may only be a small fraction of the number of SYNs
rcv'd and this fraction COULD depend on the Mb/Gb of the intf(s) or more
likely some type of  avg of summation of the number of network paths
involved.

Mitchell Erblich

> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* softirq warnings when calling dev_kfree_skb_irq - bug in conntrack?
From: Jeremy Fitzhardinge @ 2010-08-02 18:54 UTC (permalink / raw)
  To: NetDev
  Cc: Xu, Dongxiao, Xen-devel@lists.xensource.com, Ian Campbell,
	Patrick McHardy, Eric Dumazet

  Hi,

I'm seeing this in the current linux-next tree:

------------[ cut here ]------------
WARNING: at kernel/softirq.c:143 local_bh_enable+0x40/0x87()
Modules linked in: xt_state dm_mirror dm_region_hash dm_log microcode [last unloaded: scsi_wait_scan]
Pid: 0, comm: swapper Not tainted 2.6.35-rc6-next-20100729+ #29
Call Trace:
  <IRQ>   [<ffffffff81030de3>] warn_slowpath_common+0x80/0x98
  [<ffffffff81030e10>] warn_slowpath_null+0x15/0x17
  [<ffffffff81035ff3>] local_bh_enable+0x40/0x87
  [<ffffffff814236e5>] destroy_conntrack+0x78/0x9e
  [<ffffffff810bea55>] ? __kmalloc_track_caller+0xc3/0x135
  [<ffffffff814203b4>] nf_conntrack_destroy+0x16/0x18
  [<ffffffff813fadee>] skb_release_head_state+0x97/0xd9
  [<ffffffff813fabbe>] __kfree_skb+0x11/0x7a
  [<ffffffff813fac4e>] consume_skb+0x27/0x29
  [<ffffffff81402d3a>] dev_kfree_skb_irq+0x18/0x62
  [<ffffffff8130a762>] xennet_tx_buf_gc+0xfc/0x192
  [<ffffffff8130a8fb>] smart_poll_function+0x50/0x121
  [<ffffffff8130a8ab>] ? smart_poll_function+0x0/0x121
  [<ffffffff8104b8d1>] __run_hrtimer+0xcc/0x127
  [<ffffffff8104bad3>] hrtimer_interrupt+0x9c/0x17b
  [<ffffffff81005f24>] xen_timer_interrupt+0x2a/0x13e
  [<ffffffff81006180>] ? check_events+0x12/0x22
  [<ffffffff81005be9>] ? xen_force_evtchn_callback+0xd/0xf
  [<ffffffff81005be9>] ? xen_force_evtchn_callback+0xd/0xf
  [<ffffffff81077641>] handle_IRQ_event+0x52/0x119
  [<ffffffff81079abe>] handle_level_irq+0x6c/0xb2
  [<ffffffff8127b3dd>] __xen_evtchn_do_upcall+0xa9/0x12a
  [<ffffffff8100616d>] ? xen_restore_fl_direct_end+0x0/0x1
  [<ffffffff8127b491>] xen_evtchn_do_upcall+0x28/0x39
  [<ffffffff810097ac>] xen_do_hypervisor_callback+0x1c/0x30
  <EOI>   [<ffffffff810013aa>] ? hypercall_page+0x3aa/0x1000
  [<ffffffff810013aa>] ? hypercall_page+0x3aa/0x1000
  [<ffffffff81005c2d>] ? xen_safe_halt+0x10/0x1a
  [<ffffffff81003eb4>] ? xen_idle+0x38/0x44
  [<ffffffff81007de4>] ? cpu_idle+0x82/0xe9
  [<ffffffff814b84e3>] ? rest_init+0x67/0x69
  [<ffffffff81afcc10>] ? start_kernel+0x387/0x392
  [<ffffffff81afc2c8>] ? x86_64_start_reservations+0xb3/0xb7
  [<ffffffff81affed2>] ? xen_start_kernel+0x4be/0x4c2
---[ end trace 755676650ea49003 ]---


The warning is:

	WARN_ON_ONCE(in_irq() || irqs_disabled());


It seems the basic problem is that xennet_tx_buf_gc() is being called in 
interrupt context - with smartpoll it's from the timer interrupt, but 
even without it is being called from xennet_interrupt(), which in turn 
calls dev_kfree_skb_irq().

Since this should be perfectly OK, it appears the problem is actually in 
conntrack.  I'm not sure where this bug started happening, but its 
relatively recently I think.

Thanks,
     J

^ permalink raw reply

* Re: [PATCH] ip_fragment: fix subtracting PPPOE_SES_HLEN from mtu twice
From: Bart De Schuymer @ 2010-08-02 19:20 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: Changli Gao, David S. Miller, netdev
In-Reply-To: <4C56E961.9070107@trash.net>

Patrick McHardy schreef:
> On 01.08.2010 01:25, Changli Gao wrote:
>   
>> 6c79bf0f2440fd250c8fce8d9b82fcf03d4e8350 subtracts PPPOE_SES_HLEN from mtu at
>> the front of ip_fragment(). So the later subtraction should be removed. The
>> MTU of 802.1q is also 1500, so MTU should not be changed.
>>     
>
> Bart, please review, thanks.
>
>   
The patch looks correct. The commit Changli refers to fixed the case 
where fragments are already available but broke the slow_path. The MTU 
for 802.1Q is indeed also 1500...

cheers,
Bart

Signed-off-by: Bart De Schuymer <bdschuym@pandora.bo>
>> Signed-off-by: Changli Gao <xiaosuo@gmail.com>
>> ----
>>  net/ipv4/ip_output.c |    6 ++----
>>  1 file changed, 2 insertions(+), 4 deletions(-)
>> diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
>> index 6652bd9..04b6989 100644
>> --- a/net/ipv4/ip_output.c
>> +++ b/net/ipv4/ip_output.c
>> @@ -446,7 +446,7 @@ int ip_fragment(struct sk_buff *skb, int (*output)(struct sk_buff *))
>>  	int ptr;
>>  	struct net_device *dev;
>>  	struct sk_buff *skb2;
>> -	unsigned int mtu, hlen, left, len, ll_rs, pad;
>> +	unsigned int mtu, hlen, left, len, ll_rs;
>>  	int offset;
>>  	__be16 not_last_frag;
>>  	struct rtable *rt = skb_rtable(skb);
>> @@ -585,9 +585,7 @@ slow_path:
>>  	/* for bridged IP traffic encapsulated inside f.e. a vlan header,
>>  	 * we need to make room for the encapsulating header
>>  	 */
>> -	pad = nf_bridge_pad(skb);
>> -	ll_rs = LL_RESERVED_SPACE_EXTRA(rt->dst.dev, pad);
>> -	mtu -= pad;
>> +	ll_rs = LL_RESERVED_SPACE_EXTRA(rt->dst.dev, nf_bridge_pad(skb));
>>  
>>  	/*
>>  	 *	Fragment the datagram.
>>
>>     
>
>
>   


-- 
Bart De Schuymer
www.artinalgorithms.be


^ permalink raw reply

* [PATCH 1/2] phy/marvell: add 88e1121 interface mode support
From: Cyril Chemparathy @ 2010-08-02 19:44 UTC (permalink / raw)
  To: netdev; +Cc: Cyril Chemparathy
In-Reply-To: <1280778294-2993-1-git-send-email-cyril@ti.com>

This patch adds support for RGMII RX/TX delay configuration on marvell 88e1121
and derivatives.  With this patch, PHY_INTERFACE_MODE_RGMII_*ID modes are now
supported on these devices.

Signed-off-by: Cyril Chemparathy <cyril@ti.com>
---
 drivers/net/phy/marvell.c |   35 ++++++++++++++++++++++++++++++++---
 1 files changed, 32 insertions(+), 3 deletions(-)

diff --git a/drivers/net/phy/marvell.c b/drivers/net/phy/marvell.c
index 78b74e8..b1413ae 100644
--- a/drivers/net/phy/marvell.c
+++ b/drivers/net/phy/marvell.c
@@ -69,6 +69,12 @@
 #define MII_M1111_COPPER		0
 #define MII_M1111_FIBER			1
 
+#define MII_88E1121_PHY_MSCR_PAGE	2
+#define MII_88E1121_PHY_MSCR_REG	21
+#define MII_88E1121_PHY_MSCR_RX_DELAY	BIT(5)
+#define MII_88E1121_PHY_MSCR_TX_DELAY	BIT(4)
+#define MII_88E1121_PHY_MSCR_DELAY_MASK	(~(0x3 << 4))
+
 #define MII_88E1121_PHY_LED_CTRL	16
 #define MII_88E1121_PHY_LED_PAGE	3
 #define MII_88E1121_PHY_LED_DEF		0x0030
@@ -180,7 +186,30 @@ static int marvell_config_aneg(struct phy_device *phydev)
 
 static int m88e1121_config_aneg(struct phy_device *phydev)
 {
-	int err, temp;
+	int err, oldpage, mscr;
+
+	oldpage = phy_read(phydev, MII_88E1121_PHY_PAGE);
+
+	err = phy_write(phydev, MII_88E1121_PHY_PAGE,
+			MII_88E1121_PHY_MSCR_PAGE);
+	if (err < 0)
+		return err;
+	mscr = phy_read(phydev, MII_88E1121_PHY_MSCR_REG) &
+		MII_88E1121_PHY_MSCR_DELAY_MASK;
+
+	if (phydev->interface == PHY_INTERFACE_MODE_RGMII_ID)
+		mscr |= (MII_88E1121_PHY_MSCR_RX_DELAY |
+			 MII_88E1121_PHY_MSCR_TX_DELAY);
+	else if (phydev->interface == PHY_INTERFACE_MODE_RGMII_RXID)
+		mscr |= MII_88E1121_PHY_MSCR_RX_DELAY;
+	else if (phydev->interface == PHY_INTERFACE_MODE_RGMII_TXID)
+		mscr |= MII_88E1121_PHY_MSCR_TX_DELAY;
+
+	err = phy_write(phydev, MII_88E1121_PHY_MSCR_REG, mscr);
+	if (err < 0)
+		return err;
+
+	phy_write(phydev, MII_88E1121_PHY_PAGE, oldpage);
 
 	err = phy_write(phydev, MII_BMCR, BMCR_RESET);
 	if (err < 0)
@@ -191,11 +220,11 @@ static int m88e1121_config_aneg(struct phy_device *phydev)
 	if (err < 0)
 		return err;
 
-	temp = phy_read(phydev, MII_88E1121_PHY_PAGE);
+	oldpage = phy_read(phydev, MII_88E1121_PHY_PAGE);
 
 	phy_write(phydev, MII_88E1121_PHY_PAGE, MII_88E1121_PHY_LED_PAGE);
 	phy_write(phydev, MII_88E1121_PHY_LED_CTRL, MII_88E1121_PHY_LED_DEF);
-	phy_write(phydev, MII_88E1121_PHY_PAGE, temp);
+	phy_write(phydev, MII_88E1121_PHY_PAGE, oldpage);
 
 	err = genphy_config_aneg(phydev);
 
-- 
1.7.0.4


^ permalink raw reply related

* [PATCH 0/2] Minor extensions to marvell phy driver
From: Cyril Chemparathy @ 2010-08-02 19:44 UTC (permalink / raw)
  To: netdev; +Cc: Cyril Chemparathy

This patch series adds a couple of minor extensions to the marvell phy driver.
The first patch in the series allows for RGMII TX and RX delay configuration
via interface mode.  The second patch adds support for a new device (88ec048).

Cyril Chemparathy (2):
  phy/marvell: add 88e1121 interface mode support
  phy/marvell: add 88ec048 support

 drivers/net/phy/marvell.c |   76 +++++++++++++++++++++++++++++++++++++++++++--
 1 files changed, 73 insertions(+), 3 deletions(-)

^ permalink raw reply

* [PATCH 2/2] phy/marvell: add 88ec048 support
From: Cyril Chemparathy @ 2010-08-02 19:44 UTC (permalink / raw)
  To: netdev; +Cc: Cyril Chemparathy
In-Reply-To: <1280778294-2993-1-git-send-email-cyril@ti.com>

Marvell 88ec048 is a derivative of its 88e1121r device.  From the programmer's
perspective, the one major difference is the addition of an additional control
bit in Page 2 Register 16 - used to control the padding of odd nibble
preambles.

This patch adds support for this new device, while inheriting as much code as
possible from the existing 88e1121r implementation.

Signed-off-by: Cyril Chemparathy <cyril@ti.com>
---
 drivers/net/phy/marvell.c |   41 +++++++++++++++++++++++++++++++++++++++++
 1 files changed, 41 insertions(+), 0 deletions(-)

diff --git a/drivers/net/phy/marvell.c b/drivers/net/phy/marvell.c
index b1413ae..0887218 100644
--- a/drivers/net/phy/marvell.c
+++ b/drivers/net/phy/marvell.c
@@ -75,6 +75,9 @@
 #define MII_88E1121_PHY_MSCR_TX_DELAY	BIT(4)
 #define MII_88E1121_PHY_MSCR_DELAY_MASK	(~(0x3 << 4))
 
+#define MII_88EC048_PHY_MSCR1_REG	16
+#define MII_88EC048_PHY_MSCR1_PAD_ODD	BIT(6)
+
 #define MII_88E1121_PHY_LED_CTRL	16
 #define MII_88E1121_PHY_LED_PAGE	3
 #define MII_88E1121_PHY_LED_DEF		0x0030
@@ -231,6 +234,31 @@ static int m88e1121_config_aneg(struct phy_device *phydev)
 	return err;
 }
 
+static int m88ec048_config_aneg(struct phy_device *phydev)
+{
+	int err, oldpage, mscr;
+
+	oldpage = phy_read(phydev, MII_88E1121_PHY_PAGE);
+
+	err = phy_write(phydev, MII_88E1121_PHY_PAGE,
+			MII_88E1121_PHY_MSCR_PAGE);
+	if (err < 0)
+		return err;
+
+	mscr = phy_read(phydev, MII_88EC048_PHY_MSCR1_REG);
+	mscr |= MII_88EC048_PHY_MSCR1_PAD_ODD;
+
+	err = phy_write(phydev, MII_88E1121_PHY_MSCR_REG, mscr);
+	if (err < 0)
+		return err;
+
+	err = phy_write(phydev, MII_88E1121_PHY_PAGE, oldpage);
+	if (err < 0)
+		return err;
+
+	return m88e1121_config_aneg(phydev);
+}
+
 static int m88e1111_config_init(struct phy_device *phydev)
 {
 	int err;
@@ -622,6 +650,19 @@ static struct phy_driver marvell_drivers[] = {
 		.driver = { .owner = THIS_MODULE },
 	},
 	{
+		.phy_id = 0x01410e90,
+		.phy_id_mask = 0xfffffff0,
+		.name = "Marvell 88EC048",
+		.features = PHY_GBIT_FEATURES,
+		.flags = PHY_HAS_INTERRUPT,
+		.config_aneg = &m88ec048_config_aneg,
+		.read_status = &marvell_read_status,
+		.ack_interrupt = &marvell_ack_interrupt,
+		.config_intr = &marvell_config_intr,
+		.did_interrupt = &m88e1121_did_interrupt,
+		.driver = { .owner = THIS_MODULE },
+	},
+	{
 		.phy_id = 0x01410cd0,
 		.phy_id_mask = 0xfffffff0,
 		.name = "Marvell 88E1145",
-- 
1.7.0.4


^ permalink raw reply related

* Re: [PATCH 01/11] pcmcia: use pcmica_{read,write}_config_byte
From: Dominik Brodowski @ 2010-08-02 19:52 UTC (permalink / raw)
  To: Komuro; +Cc: Michael Buesch, netdev, linux-pcmcia, linux-wireless,
	linux-serial
In-Reply-To: <15238373.192661280750366920.komurojun-mbn@nifty.com>

Hey,

On Mon, Aug 02, 2010 at 08:59:26PM +0900, Komuro wrote:
> >--- a/drivers/net/pcmcia/xirc2ps_cs.c
> >+++ b/drivers/net/pcmcia/xirc2ps_cs.c
> 
> 
> >+	if (err)
> > 	    goto config_error;
> >-	reg.Action = CS_WRITE;
> >-	reg.Offset = CISREG_IOBASE_1;
> >-	reg.Value = (link->io.BasePort2 >> 8) & 0xff;
> >-	if ((err = pcmcia_access_configuration_register(link, &reg)))
> >+
> >+	err = pcmcia_write_config_byte(link, CISREG_IOBASE_1,
> >+				link->io.BasePort2 & 0xff);
> 
> It should be
> 
> 	err = pcmcia_write_config_byte(link, CISREG_IOBASE_1,
> 				(link->io.BasePort2 >> 8) & 0xff);
> 

Fixed, thanks.

Best,
	Dominik

^ permalink raw reply

* [PATCH 01/28] netfilter: nf_conntrack_reasm: add fast path for in-order fragments
From: kaber @ 2010-08-02 19:57 UTC (permalink / raw)
  To: davem; +Cc: netfilter-devel, netdev
In-Reply-To: <1280779065-9333-1-git-send-email-kaber@trash.net>

From: Changli Gao <xiaosuo@gmail.com>

As the fragments are sent in order in most of OSes, such as Windows, Darwin and
FreeBSD, it is likely the new fragments are at the end of the inet_frag_queue.
In the fast path, we check if the skb at the end of the inet_frag_queue is the
prev we expect.

Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
---
 net/ipv6/netfilter/nf_conntrack_reasm.c |    9 +++++++++
 1 files changed, 9 insertions(+), 0 deletions(-)

diff --git a/net/ipv6/netfilter/nf_conntrack_reasm.c b/net/ipv6/netfilter/nf_conntrack_reasm.c
index 9254008..098a050 100644
--- a/net/ipv6/netfilter/nf_conntrack_reasm.c
+++ b/net/ipv6/netfilter/nf_conntrack_reasm.c
@@ -269,6 +269,11 @@ static int nf_ct_frag6_queue(struct nf_ct_frag6_queue *fq, struct sk_buff *skb,
 	 * in the chain of fragments so far.  We must know where to put
 	 * this fragment, right?
 	 */
+	prev = fq->q.fragments_tail;
+	if (!prev || NFCT_FRAG6_CB(prev)->offset < offset) {
+		next = NULL;
+		goto found;
+	}
 	prev = NULL;
 	for (next = fq->q.fragments; next != NULL; next = next->next) {
 		if (NFCT_FRAG6_CB(next)->offset >= offset)
@@ -276,6 +281,7 @@ static int nf_ct_frag6_queue(struct nf_ct_frag6_queue *fq, struct sk_buff *skb,
 		prev = next;
 	}
 
+found:
 	/* We found where to put this one.  Check for overlap with
 	 * preceding fragment, and, if needed, align things so that
 	 * any overlaps are eliminated.
@@ -341,6 +347,8 @@ static int nf_ct_frag6_queue(struct nf_ct_frag6_queue *fq, struct sk_buff *skb,
 
 	/* Insert this fragment in the chain of fragments. */
 	skb->next = next;
+	if (!next)
+		fq->q.fragments_tail = skb;
 	if (prev)
 		prev->next = skb;
 	else
@@ -464,6 +472,7 @@ nf_ct_frag6_reasm(struct nf_ct_frag6_queue *fq, struct net_device *dev)
 					  head->csum);
 
 	fq->q.fragments = NULL;
+	fq->q.fragments_tail = NULL;
 
 	/* all original skbs are linked into the NFCT_FRAG6_CB(head).orig */
 	fp = skb_shinfo(head)->frag_list;
-- 
1.7.1


^ permalink raw reply related

* [PATCH 04/28] ipvs: Kconfig cleanup
From: kaber @ 2010-08-02 19:57 UTC (permalink / raw)
  To: davem; +Cc: netfilter-devel, netdev
In-Reply-To: <1280779065-9333-1-git-send-email-kaber@trash.net>

From: Michal Marek <mmarek@suse.cz>

IP_VS_PROTO_AH_ESP should be set iff either of IP_VS_PROTO_{AH,ESP} is
selected. Express this with standard kconfig syntax.

Signed-off-by: Michal Marek <mmarek@suse.cz>
Acked-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Patrick McHardy <kaber@trash.net>
---
 net/netfilter/ipvs/Kconfig |    5 +----
 1 files changed, 1 insertions(+), 4 deletions(-)

diff --git a/net/netfilter/ipvs/Kconfig b/net/netfilter/ipvs/Kconfig
index 712ccad..d80b41a 100644
--- a/net/netfilter/ipvs/Kconfig
+++ b/net/netfilter/ipvs/Kconfig
@@ -87,19 +87,16 @@ config	IP_VS_PROTO_UDP
 	  protocol. Say Y if unsure.
 
 config	IP_VS_PROTO_AH_ESP
-	bool
-	depends on UNDEFINED
+	def_bool IP_VS_PROTO_ESP || IP_VS_PROTO_AH
 
 config	IP_VS_PROTO_ESP
 	bool "ESP load balancing support"
-	select IP_VS_PROTO_AH_ESP
 	---help---
 	  This option enables support for load balancing ESP (Encapsulation
 	  Security Payload) transport protocol. Say Y if unsure.
 
 config	IP_VS_PROTO_AH
 	bool "AH load balancing support"
-	select IP_VS_PROTO_AH_ESP
 	---help---
 	  This option enables support for load balancing AH (Authentication
 	  Header) transport protocol. Say Y if unsure.
-- 
1.7.1


^ permalink raw reply related

* [PATCH 06/28] netfilter: xt_TPROXY: the length of lines should be within 80
From: kaber @ 2010-08-02 19:57 UTC (permalink / raw)
  To: davem; +Cc: netfilter-devel, netdev
In-Reply-To: <1280779065-9333-1-git-send-email-kaber@trash.net>

From: Changli Gao <xiaosuo@gmail.com>

According to the Documentation/CodingStyle, the length of lines should
be within 80.

Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
---
 net/netfilter/xt_TPROXY.c |    6 ++++--
 1 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/net/netfilter/xt_TPROXY.c b/net/netfilter/xt_TPROXY.c
index e1a0ded..c61294d 100644
--- a/net/netfilter/xt_TPROXY.c
+++ b/net/netfilter/xt_TPROXY.c
@@ -37,8 +37,10 @@ tproxy_tg(struct sk_buff *skb, const struct xt_action_param *par)
 		return NF_DROP;
 
 	sk = nf_tproxy_get_sock_v4(dev_net(skb->dev), iph->protocol,
-				   iph->saddr, tgi->laddr ? tgi->laddr : iph->daddr,
-				   hp->source, tgi->lport ? tgi->lport : hp->dest,
+				   iph->saddr,
+				   tgi->laddr ? tgi->laddr : iph->daddr,
+				   hp->source,
+				   tgi->lport ? tgi->lport : hp->dest,
 				   par->in, true);
 
 	/* NOTE: assign_sock consumes our sk reference */
-- 
1.7.1


^ permalink raw reply related

* [PATCH 08/28] netfilter: nf_ct_tcp: fix flow recovery with TCP window tracking enabled
From: kaber @ 2010-08-02 19:57 UTC (permalink / raw)
  To: davem; +Cc: netfilter-devel, netdev
In-Reply-To: <1280779065-9333-1-git-send-email-kaber@trash.net>

From: Pablo Neira Ayuso <pablo@netfilter.org>

This patch adds the missing bits to support the recovery of TCP flows
without disabling window tracking (aka be_liberal). To ensure a
successful recovery, we have to inject the window scale factor via
ctnetlink.

This patch has been tested with a development snapshot of conntrackd
and the new clause `TCPWindowTracking' that allows to perform strict
TCP window tracking recovery across fail-overs.

With this patch, we don't update the receiver's window until it's not
initiated. We require this to perform a successful recovery. Jozsef
confirmed in a private email that this spotted a real issue since that
should not happen.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Signed-off-by: Patrick McHardy <kaber@trash.net>
---
 net/netfilter/nf_conntrack_proto_tcp.c |   10 +++++++++-
 1 files changed, 9 insertions(+), 1 deletions(-)

diff --git a/net/netfilter/nf_conntrack_proto_tcp.c b/net/netfilter/nf_conntrack_proto_tcp.c
index 802dbff..c4c885d 100644
--- a/net/netfilter/nf_conntrack_proto_tcp.c
+++ b/net/netfilter/nf_conntrack_proto_tcp.c
@@ -585,8 +585,16 @@ static bool tcp_in_window(const struct nf_conn *ct,
 			 * Let's try to use the data from the packet.
 			 */
 			sender->td_end = end;
+			win <<= sender->td_scale;
 			sender->td_maxwin = (win == 0 ? 1 : win);
 			sender->td_maxend = end + sender->td_maxwin;
+			/*
+			 * We haven't seen traffic in the other direction yet
+			 * but we have to tweak window tracking to pass III
+			 * and IV until that happens.
+			 */
+			if (receiver->td_maxwin == 0)
+				receiver->td_end = receiver->td_maxend = sack;
 		}
 	} else if (((state->state == TCP_CONNTRACK_SYN_SENT
 		     && dir == IP_CT_DIR_ORIGINAL)
@@ -680,7 +688,7 @@ static bool tcp_in_window(const struct nf_conn *ct,
 		/*
 		 * Update receiver data.
 		 */
-		if (after(end, sender->td_maxend))
+		if (receiver->td_maxwin != 0 && after(end, sender->td_maxend))
 			receiver->td_maxwin += end - sender->td_maxend;
 		if (after(sack + win, receiver->td_maxend - 1)) {
 			receiver->td_maxend = sack + win;
-- 
1.7.1


^ permalink raw reply related

* [PATCH 10/28] netfilter: correct CHECKSUM header and export it
From: kaber @ 2010-08-02 19:57 UTC (permalink / raw)
  To: davem; +Cc: netfilter-devel, netdev
In-Reply-To: <1280779065-9333-1-git-send-email-kaber@trash.net>

From: Michael S. Tsirkin <mst@redhat.com>

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
---
 include/linux/netfilter/Kbuild        |    1 +
 include/linux/netfilter/xt_CHECKSUM.h |    8 +++++---
 2 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/include/linux/netfilter/Kbuild b/include/linux/netfilter/Kbuild
index bb103f4..b93b64d 100644
--- a/include/linux/netfilter/Kbuild
+++ b/include/linux/netfilter/Kbuild
@@ -3,6 +3,7 @@ header-y += nf_conntrack_tuple_common.h
 header-y += nfnetlink_conntrack.h
 header-y += nfnetlink_log.h
 header-y += nfnetlink_queue.h
+header-y += xt_CHECKSUM.h
 header-y += xt_CLASSIFY.h
 header-y += xt_CONNMARK.h
 header-y += xt_CONNSECMARK.h
diff --git a/include/linux/netfilter/xt_CHECKSUM.h b/include/linux/netfilter/xt_CHECKSUM.h
index 3b4fb77..9a2e466 100644
--- a/include/linux/netfilter/xt_CHECKSUM.h
+++ b/include/linux/netfilter/xt_CHECKSUM.h
@@ -6,8 +6,10 @@
  *
  * This software is distributed under GNU GPL v2, 1991
 */
-#ifndef _IPT_CHECKSUM_TARGET_H
-#define _IPT_CHECKSUM_TARGET_H
+#ifndef _XT_CHECKSUM_TARGET_H
+#define _XT_CHECKSUM_TARGET_H
+
+#include <linux/types.h>
 
 #define XT_CHECKSUM_OP_FILL	0x01	/* fill in checksum in IP header */
 
@@ -15,4 +17,4 @@ struct xt_CHECKSUM_info {
 	__u8 operation;	/* bitset of operations */
 };
 
-#endif /* _IPT_CHECKSUM_TARGET_H */
+#endif /* _XT_CHECKSUM_TARGET_H */
-- 
1.7.1


^ permalink raw reply related

* [PATCH 12/28] IPVS: make friends with nf_conntrack
From: kaber @ 2010-08-02 19:57 UTC (permalink / raw)
  To: davem; +Cc: netfilter-devel, netdev
In-Reply-To: <1280779065-9333-1-git-send-email-kaber@trash.net>

From: Hannes Eder <heder@google.com>

Update the nf_conntrack tuple in reply direction, as we will see
traffic from the real server (RIP) to the client (CIP).  Once this is
done we can use netfilters SNAT in POSTROUTING, especially with
xt_ipvs, to do source NAT, e.g.:

% iptables -t nat -A POSTROUTING -m ipvs --vaddr 192.168.100.30/32 --vport 80 \
		  -j SNAT --to-source 192.168.10.10

[ minor fixes by Simon Horman <horms@verge.net.au> ]
Signed-off-by: Hannes Eder <heder@google.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Patrick McHardy <kaber@trash.net>
---
 net/netfilter/ipvs/Kconfig      |    2 +-
 net/netfilter/ipvs/ip_vs_core.c |   36 ------------------------------------
 net/netfilter/ipvs/ip_vs_xmit.c |   29 +++++++++++++++++++++++++++++
 3 files changed, 30 insertions(+), 37 deletions(-)

diff --git a/net/netfilter/ipvs/Kconfig b/net/netfilter/ipvs/Kconfig
index d80b41a..3662444 100644
--- a/net/netfilter/ipvs/Kconfig
+++ b/net/netfilter/ipvs/Kconfig
@@ -3,7 +3,7 @@
 #
 menuconfig IP_VS
 	tristate "IP virtual server support"
-	depends on NET && INET && NETFILTER
+	depends on NET && INET && NETFILTER && NF_CONNTRACK
 	---help---
 	  IP Virtual Server support will let you build a high-performance
 	  virtual server based on cluster of two or more real servers. This
diff --git a/net/netfilter/ipvs/ip_vs_core.c b/net/netfilter/ipvs/ip_vs_core.c
index 50907d8..58f82df 100644
--- a/net/netfilter/ipvs/ip_vs_core.c
+++ b/net/netfilter/ipvs/ip_vs_core.c
@@ -536,26 +536,6 @@ int ip_vs_leave(struct ip_vs_service *svc, struct sk_buff *skb,
 	return NF_DROP;
 }
 
-
-/*
- *      It is hooked before NF_IP_PRI_NAT_SRC at the NF_INET_POST_ROUTING
- *      chain, and is used for VS/NAT.
- *      It detects packets for VS/NAT connections and sends the packets
- *      immediately. This can avoid that iptable_nat mangles the packets
- *      for VS/NAT.
- */
-static unsigned int ip_vs_post_routing(unsigned int hooknum,
-				       struct sk_buff *skb,
-				       const struct net_device *in,
-				       const struct net_device *out,
-				       int (*okfn)(struct sk_buff *))
-{
-	if (!skb->ipvs_property)
-		return NF_ACCEPT;
-	/* The packet was sent from IPVS, exit this chain */
-	return NF_STOP;
-}
-
 __sum16 ip_vs_checksum_complete(struct sk_buff *skb, int offset)
 {
 	return csum_fold(skb_checksum(skb, offset, skb->len - offset, 0));
@@ -1499,14 +1479,6 @@ static struct nf_hook_ops ip_vs_ops[] __read_mostly = {
 		.hooknum        = NF_INET_FORWARD,
 		.priority       = 99,
 	},
-	/* Before the netfilter connection tracking, exit from POST_ROUTING */
-	{
-		.hook		= ip_vs_post_routing,
-		.owner		= THIS_MODULE,
-		.pf		= PF_INET,
-		.hooknum        = NF_INET_POST_ROUTING,
-		.priority       = NF_IP_PRI_NAT_SRC-1,
-	},
 #ifdef CONFIG_IP_VS_IPV6
 	/* After packet filtering, forward packet through VS/DR, VS/TUN,
 	 * or VS/NAT(change destination), so that filtering rules can be
@@ -1535,14 +1507,6 @@ static struct nf_hook_ops ip_vs_ops[] __read_mostly = {
 		.hooknum        = NF_INET_FORWARD,
 		.priority       = 99,
 	},
-	/* Before the netfilter connection tracking, exit from POST_ROUTING */
-	{
-		.hook		= ip_vs_post_routing,
-		.owner		= THIS_MODULE,
-		.pf		= PF_INET6,
-		.hooknum        = NF_INET_POST_ROUTING,
-		.priority       = NF_IP6_PRI_NAT_SRC-1,
-	},
 #endif
 };
 
diff --git a/net/netfilter/ipvs/ip_vs_xmit.c b/net/netfilter/ipvs/ip_vs_xmit.c
index 02b078e..21e1a5e 100644
--- a/net/netfilter/ipvs/ip_vs_xmit.c
+++ b/net/netfilter/ipvs/ip_vs_xmit.c
@@ -28,6 +28,7 @@
 #include <net/ip6_route.h>
 #include <linux/icmpv6.h>
 #include <linux/netfilter.h>
+#include <net/netfilter/nf_conntrack.h>
 #include <linux/netfilter_ipv4.h>
 
 #include <net/ip_vs.h>
@@ -348,6 +349,30 @@ ip_vs_bypass_xmit_v6(struct sk_buff *skb, struct ip_vs_conn *cp,
 }
 #endif
 
+static void
+ip_vs_update_conntrack(struct sk_buff *skb, struct ip_vs_conn *cp)
+{
+	struct nf_conn *ct = (struct nf_conn *)skb->nfct;
+	struct nf_conntrack_tuple new_tuple;
+
+	if (ct == NULL || nf_ct_is_untracked(ct) || nf_ct_is_confirmed(ct))
+		return;
+
+	/*
+	 * The connection is not yet in the hashtable, so we update it.
+	 * CIP->VIP will remain the same, so leave the tuple in
+	 * IP_CT_DIR_ORIGINAL untouched.  When the reply comes back from the
+	 * real-server we will see RIP->DIP.
+	 */
+	new_tuple = ct->tuplehash[IP_CT_DIR_REPLY].tuple;
+	new_tuple.src.u3 = cp->daddr;
+	/*
+	 * This will also take care of UDP and other protocols.
+	 */
+	new_tuple.src.u.tcp.port = cp->dport;
+	nf_conntrack_alter_reply(ct, &new_tuple);
+}
+
 /*
  *      NAT transmitter (only for outside-to-inside nat forwarding)
  *      Not used for related ICMP
@@ -403,6 +428,8 @@ ip_vs_nat_xmit(struct sk_buff *skb, struct ip_vs_conn *cp,
 
 	IP_VS_DBG_PKT(10, pp, skb, 0, "After DNAT");
 
+	ip_vs_update_conntrack(skb, cp);
+
 	/* FIXME: when application helper enlarges the packet and the length
 	   is larger than the MTU of outgoing device, there will be still
 	   MTU problem. */
@@ -479,6 +506,8 @@ ip_vs_nat_xmit_v6(struct sk_buff *skb, struct ip_vs_conn *cp,
 
 	IP_VS_DBG_PKT(10, pp, skb, 0, "After DNAT");
 
+	ip_vs_update_conntrack(skb, cp);
+
 	/* FIXME: when application helper enlarges the packet and the length
 	   is larger than the MTU of outgoing device, there will be still
 	   MTU problem. */
-- 
1.7.1


^ permalink raw reply related

* [PATCH 13/28] IPVS: make FTP work with full NAT support
From: kaber @ 2010-08-02 19:57 UTC (permalink / raw)
  To: davem; +Cc: netfilter-devel, netdev
In-Reply-To: <1280779065-9333-1-git-send-email-kaber@trash.net>

From: Hannes Eder <heder@google.com>

Use nf_conntrack/nf_nat code to do the packet mangling and the TCP
sequence adjusting.  The function 'ip_vs_skb_replace' is now dead
code, so it is removed.

To SNAT FTP, use something like:

% iptables -t nat -A POSTROUTING -m ipvs --vaddr 192.168.100.30/32 \
    --vport 21 -j SNAT --to-source 192.168.10.10
and for the data connections in passive mode:

% iptables -t nat -A POSTROUTING -m ipvs --vaddr 192.168.100.30/32 \
    --vportctl 21 -j SNAT --to-source 192.168.10.10
using '-m state --state RELATED' would also works.

Make sure the kernel modules ip_vs_ftp, nf_conntrack_ftp, and
nf_nat_ftp are loaded.

[ up-port and minor fixes by Simon Horman <horms@verge.net.au> ]
Signed-off-by: Hannes Eder <heder@google.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Patrick McHardy <kaber@trash.net>
---
 include/net/ip_vs.h             |    2 -
 net/netfilter/ipvs/Kconfig      |    2 +-
 net/netfilter/ipvs/ip_vs_app.c  |   43 ----------
 net/netfilter/ipvs/ip_vs_core.c |    1 -
 net/netfilter/ipvs/ip_vs_ftp.c  |  176 ++++++++++++++++++++++++++++++++++++---
 5 files changed, 165 insertions(+), 59 deletions(-)

diff --git a/include/net/ip_vs.h b/include/net/ip_vs.h
index fe82b1e..1f9e511 100644
--- a/include/net/ip_vs.h
+++ b/include/net/ip_vs.h
@@ -736,8 +736,6 @@ extern void ip_vs_app_inc_put(struct ip_vs_app *inc);
 
 extern int ip_vs_app_pkt_out(struct ip_vs_conn *, struct sk_buff *skb);
 extern int ip_vs_app_pkt_in(struct ip_vs_conn *, struct sk_buff *skb);
-extern int ip_vs_skb_replace(struct sk_buff *skb, gfp_t pri,
-			     char *o_buf, int o_len, char *n_buf, int n_len);
 extern int ip_vs_app_init(void);
 extern void ip_vs_app_cleanup(void);
 
diff --git a/net/netfilter/ipvs/Kconfig b/net/netfilter/ipvs/Kconfig
index 3662444..be10f65 100644
--- a/net/netfilter/ipvs/Kconfig
+++ b/net/netfilter/ipvs/Kconfig
@@ -235,7 +235,7 @@ comment 'IPVS application helper'
 
 config	IP_VS_FTP
   	tristate "FTP protocol helper"
-        depends on IP_VS_PROTO_TCP
+        depends on IP_VS_PROTO_TCP && NF_NAT
 	---help---
 	  FTP is a protocol that transfers IP address and/or port number in
 	  the payload. In the virtual server via Network Address Translation,
diff --git a/net/netfilter/ipvs/ip_vs_app.c b/net/netfilter/ipvs/ip_vs_app.c
index 1cb0e83..e76f87f 100644
--- a/net/netfilter/ipvs/ip_vs_app.c
+++ b/net/netfilter/ipvs/ip_vs_app.c
@@ -569,49 +569,6 @@ static const struct file_operations ip_vs_app_fops = {
 };
 #endif
 
-
-/*
- *	Replace a segment of data with a new segment
- */
-int ip_vs_skb_replace(struct sk_buff *skb, gfp_t pri,
-		      char *o_buf, int o_len, char *n_buf, int n_len)
-{
-	int diff;
-	int o_offset;
-	int o_left;
-
-	EnterFunction(9);
-
-	diff = n_len - o_len;
-	o_offset = o_buf - (char *)skb->data;
-	/* The length of left data after o_buf+o_len in the skb data */
-	o_left = skb->len - (o_offset + o_len);
-
-	if (diff <= 0) {
-		memmove(o_buf + n_len, o_buf + o_len, o_left);
-		memcpy(o_buf, n_buf, n_len);
-		skb_trim(skb, skb->len + diff);
-	} else if (diff <= skb_tailroom(skb)) {
-		skb_put(skb, diff);
-		memmove(o_buf + n_len, o_buf + o_len, o_left);
-		memcpy(o_buf, n_buf, n_len);
-	} else {
-		if (pskb_expand_head(skb, skb_headroom(skb), diff, pri))
-			return -ENOMEM;
-		skb_put(skb, diff);
-		memmove(skb->data + o_offset + n_len,
-			skb->data + o_offset + o_len, o_left);
-		skb_copy_to_linear_data_offset(skb, o_offset, n_buf, n_len);
-	}
-
-	/* must update the iph total length here */
-	ip_hdr(skb)->tot_len = htons(skb->len);
-
-	LeaveFunction(9);
-	return 0;
-}
-
-
 int __init ip_vs_app_init(void)
 {
 	/* we will replace it with proc_net_ipvs_create() soon */
diff --git a/net/netfilter/ipvs/ip_vs_core.c b/net/netfilter/ipvs/ip_vs_core.c
index 58f82df..4f8ddba 100644
--- a/net/netfilter/ipvs/ip_vs_core.c
+++ b/net/netfilter/ipvs/ip_vs_core.c
@@ -54,7 +54,6 @@
 
 EXPORT_SYMBOL(register_ip_vs_scheduler);
 EXPORT_SYMBOL(unregister_ip_vs_scheduler);
-EXPORT_SYMBOL(ip_vs_skb_replace);
 EXPORT_SYMBOL(ip_vs_proto_name);
 EXPORT_SYMBOL(ip_vs_conn_new);
 EXPORT_SYMBOL(ip_vs_conn_in_get);
diff --git a/net/netfilter/ipvs/ip_vs_ftp.c b/net/netfilter/ipvs/ip_vs_ftp.c
index 2ae747a..f228a17 100644
--- a/net/netfilter/ipvs/ip_vs_ftp.c
+++ b/net/netfilter/ipvs/ip_vs_ftp.c
@@ -20,6 +20,17 @@
  *
  * Author:	Wouter Gadeyne
  *
+ *
+ * Code for ip_vs_expect_related and ip_vs_expect_callback is taken from
+ * http://www.ssi.bg/~ja/nfct/:
+ *
+ * ip_vs_nfct.c:	Netfilter connection tracking support for IPVS
+ *
+ * Portions Copyright (C) 2001-2002
+ * Antefacto Ltd, 181 Parnell St, Dublin 1, Ireland.
+ *
+ * Portions Copyright (C) 2003-2008
+ * Julian Anastasov
  */
 
 #define KMSG_COMPONENT "IPVS"
@@ -32,6 +43,9 @@
 #include <linux/in.h>
 #include <linux/ip.h>
 #include <linux/netfilter.h>
+#include <net/netfilter/nf_conntrack.h>
+#include <net/netfilter/nf_conntrack_expect.h>
+#include <net/netfilter/nf_nat_helper.h>
 #include <linux/gfp.h>
 #include <net/protocol.h>
 #include <net/tcp.h>
@@ -43,6 +57,16 @@
 #define SERVER_STRING "227 Entering Passive Mode ("
 #define CLIENT_STRING "PORT "
 
+#define FMT_TUPLE	"%pI4:%u->%pI4:%u/%u"
+#define ARG_TUPLE(T)	&(T)->src.u3.ip, ntohs((T)->src.u.all), \
+			&(T)->dst.u3.ip, ntohs((T)->dst.u.all), \
+			(T)->dst.protonum
+
+#define FMT_CONN	"%pI4:%u->%pI4:%u->%pI4:%u/%u:%u"
+#define ARG_CONN(C)	&((C)->caddr.ip), ntohs((C)->cport), \
+			&((C)->vaddr.ip), ntohs((C)->vport), \
+			&((C)->daddr.ip), ntohs((C)->dport), \
+			(C)->protocol, (C)->state
 
 /*
  * List of ports (up to IP_VS_APP_MAX_PORTS) to be handled by helper
@@ -123,6 +147,119 @@ static int ip_vs_ftp_get_addrport(char *data, char *data_limit,
 	return 1;
 }
 
+/*
+ * Called from init_conntrack() as expectfn handler.
+ */
+static void
+ip_vs_expect_callback(struct nf_conn *ct,
+		      struct nf_conntrack_expect *exp)
+{
+	struct nf_conntrack_tuple *orig, new_reply;
+	struct ip_vs_conn *cp;
+
+	if (exp->tuple.src.l3num != PF_INET)
+		return;
+
+	/*
+	 * We assume that no NF locks are held before this callback.
+	 * ip_vs_conn_out_get and ip_vs_conn_in_get should match their
+	 * expectations even if they use wildcard values, now we provide the
+	 * actual values from the newly created original conntrack direction.
+	 * The conntrack is confirmed when packet reaches IPVS hooks.
+	 */
+
+	/* RS->CLIENT */
+	orig = &ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple;
+	cp = ip_vs_conn_out_get(exp->tuple.src.l3num, orig->dst.protonum,
+				&orig->src.u3, orig->src.u.tcp.port,
+				&orig->dst.u3, orig->dst.u.tcp.port);
+	if (cp) {
+		/* Change reply CLIENT->RS to CLIENT->VS */
+		new_reply = ct->tuplehash[IP_CT_DIR_REPLY].tuple;
+		IP_VS_DBG(7, "%s(): ct=%p, status=0x%lX, tuples=" FMT_TUPLE ", "
+			  FMT_TUPLE ", found inout cp=" FMT_CONN "\n",
+			  __func__, ct, ct->status,
+			  ARG_TUPLE(orig), ARG_TUPLE(&new_reply),
+			  ARG_CONN(cp));
+		new_reply.dst.u3 = cp->vaddr;
+		new_reply.dst.u.tcp.port = cp->vport;
+		IP_VS_DBG(7, "%s(): ct=%p, new tuples=" FMT_TUPLE ", " FMT_TUPLE
+			  ", inout cp=" FMT_CONN "\n",
+			  __func__, ct,
+			  ARG_TUPLE(orig), ARG_TUPLE(&new_reply),
+			  ARG_CONN(cp));
+		goto alter;
+	}
+
+	/* CLIENT->VS */
+	cp = ip_vs_conn_in_get(exp->tuple.src.l3num, orig->dst.protonum,
+			       &orig->src.u3, orig->src.u.tcp.port,
+			       &orig->dst.u3, orig->dst.u.tcp.port);
+	if (cp) {
+		/* Change reply VS->CLIENT to RS->CLIENT */
+		new_reply = ct->tuplehash[IP_CT_DIR_REPLY].tuple;
+		IP_VS_DBG(7, "%s(): ct=%p, status=0x%lX, tuples=" FMT_TUPLE ", "
+			  FMT_TUPLE ", found outin cp=" FMT_CONN "\n",
+			  __func__, ct, ct->status,
+			  ARG_TUPLE(orig), ARG_TUPLE(&new_reply),
+			  ARG_CONN(cp));
+		new_reply.src.u3 = cp->daddr;
+		new_reply.src.u.tcp.port = cp->dport;
+		IP_VS_DBG(7, "%s(): ct=%p, new tuples=" FMT_TUPLE ", "
+			  FMT_TUPLE ", outin cp=" FMT_CONN "\n",
+			  __func__, ct,
+			  ARG_TUPLE(orig), ARG_TUPLE(&new_reply),
+			  ARG_CONN(cp));
+		goto alter;
+	}
+
+	IP_VS_DBG(7, "%s(): ct=%p, status=0x%lX, tuple=" FMT_TUPLE
+		  " - unknown expect\n",
+		  __func__, ct, ct->status, ARG_TUPLE(orig));
+	return;
+
+alter:
+	/* Never alter conntrack for non-NAT conns */
+	if (IP_VS_FWD_METHOD(cp) == IP_VS_CONN_F_MASQ)
+		nf_conntrack_alter_reply(ct, &new_reply);
+	ip_vs_conn_put(cp);
+	return;
+}
+
+/*
+ * Create NF conntrack expectation with wildcard (optional) source port.
+ * Then the default callback function will alter the reply and will confirm
+ * the conntrack entry when the first packet comes.
+ */
+static void
+ip_vs_expect_related(struct sk_buff *skb, struct nf_conn *ct,
+		     struct ip_vs_conn *cp, u_int8_t proto,
+		     const __be16 *port, int from_rs)
+{
+	struct nf_conntrack_expect *exp;
+
+	BUG_ON(!ct || ct == &nf_conntrack_untracked);
+
+	exp = nf_ct_expect_alloc(ct);
+	if (!exp)
+		return;
+
+	if (from_rs)
+		nf_ct_expect_init(exp, NF_CT_EXPECT_CLASS_DEFAULT,
+				  nf_ct_l3num(ct), &cp->daddr, &cp->caddr,
+				  proto, port, &cp->cport);
+	else
+		nf_ct_expect_init(exp, NF_CT_EXPECT_CLASS_DEFAULT,
+				  nf_ct_l3num(ct), &cp->caddr, &cp->vaddr,
+				  proto, port, &cp->vport);
+
+	exp->expectfn = ip_vs_expect_callback;
+
+	IP_VS_DBG(7, "%s(): ct=%p, expect tuple=" FMT_TUPLE "\n",
+		  __func__, ct, ARG_TUPLE(&exp->tuple));
+	nf_ct_expect_related(exp);
+	nf_ct_expect_put(exp);
+}
 
 /*
  * Look at outgoing ftp packets to catch the response to a PASV command
@@ -149,7 +286,9 @@ static int ip_vs_ftp_out(struct ip_vs_app *app, struct ip_vs_conn *cp,
 	struct ip_vs_conn *n_cp;
 	char buf[24];		/* xxx.xxx.xxx.xxx,ppp,ppp\000 */
 	unsigned buf_len;
-	int ret;
+	int ret = 0;
+	enum ip_conntrack_info ctinfo;
+	struct nf_conn *ct;
 
 #ifdef CONFIG_IP_VS_IPV6
 	/* This application helper doesn't work with IPv6 yet,
@@ -219,19 +358,26 @@ static int ip_vs_ftp_out(struct ip_vs_app *app, struct ip_vs_conn *cp,
 
 		buf_len = strlen(buf);
 
+		ct = nf_ct_get(skb, &ctinfo);
+		if (ct && !nf_ct_is_untracked(ct)) {
+			/* If mangling fails this function will return 0
+			 * which will cause the packet to be dropped.
+			 * Mangling can only fail under memory pressure,
+			 * hopefully it will succeed on the retransmitted
+			 * packet.
+			 */
+			ret = nf_nat_mangle_tcp_packet(skb, ct, ctinfo,
+						       start-data, end-start,
+						       buf, buf_len);
+			if (ret)
+				ip_vs_expect_related(skb, ct, n_cp,
+						     IPPROTO_TCP, NULL, 0);
+		}
+
 		/*
-		 * Calculate required delta-offset to keep TCP happy
+		 * Not setting 'diff' is intentional, otherwise the sequence
+		 * would be adjusted twice.
 		 */
-		*diff = buf_len - (end-start);
-
-		if (*diff == 0) {
-			/* simply replace it with new passive address */
-			memcpy(start, buf, buf_len);
-			ret = 1;
-		} else {
-			ret = !ip_vs_skb_replace(skb, GFP_ATOMIC, start,
-					  end-start, buf, buf_len);
-		}
 
 		cp->app_data = NULL;
 		ip_vs_tcp_conn_listen(n_cp);
@@ -263,6 +409,7 @@ static int ip_vs_ftp_in(struct ip_vs_app *app, struct ip_vs_conn *cp,
 	union nf_inet_addr to;
 	__be16 port;
 	struct ip_vs_conn *n_cp;
+	struct nf_conn *ct;
 
 #ifdef CONFIG_IP_VS_IPV6
 	/* This application helper doesn't work with IPv6 yet,
@@ -349,6 +496,11 @@ static int ip_vs_ftp_in(struct ip_vs_app *app, struct ip_vs_conn *cp,
 		ip_vs_control_add(n_cp, cp);
 	}
 
+	ct = (struct nf_conn *)skb->nfct;
+	if (ct && ct != &nf_conntrack_untracked)
+		ip_vs_expect_related(skb, ct, n_cp,
+				     IPPROTO_TCP, &n_cp->dport, 1);
+
 	/*
 	 *	Move tunnel to listen state
 	 */
-- 
1.7.1


^ permalink raw reply related

* [PATCH 15/28] netfilter: nf_nat_core: merge the same lines
From: kaber @ 2010-08-02 19:57 UTC (permalink / raw)
  To: davem; +Cc: netfilter-devel, netdev
In-Reply-To: <1280779065-9333-1-git-send-email-kaber@trash.net>

From: Changli Gao <xiaosuo@gmail.com>

proto->unique_tuple() will be called finally, if the previous calls fail. This
patch checks the false condition of (range->flags &IP_NAT_RANGE_PROTO_RANDOM)
instead to avoid duplicate line of code: proto->unique_tuple().

Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
---
 net/ipv4/netfilter/nf_nat_core.c |    9 ++-------
 1 files changed, 2 insertions(+), 7 deletions(-)

diff --git a/net/ipv4/netfilter/nf_nat_core.c b/net/ipv4/netfilter/nf_nat_core.c
index c7719b2..037a3a6 100644
--- a/net/ipv4/netfilter/nf_nat_core.c
+++ b/net/ipv4/netfilter/nf_nat_core.c
@@ -261,14 +261,9 @@ get_unique_tuple(struct nf_conntrack_tuple *tuple,
 	rcu_read_lock();
 	proto = __nf_nat_proto_find(orig_tuple->dst.protonum);
 
-	/* Change protocol info to have some randomization */
-	if (range->flags & IP_NAT_RANGE_PROTO_RANDOM) {
-		proto->unique_tuple(tuple, range, maniptype, ct);
-		goto out;
-	}
-
 	/* Only bother mapping if it's not already in range and unique */
-	if ((!(range->flags & IP_NAT_RANGE_PROTO_SPECIFIED) ||
+	if (!(range->flags & IP_NAT_RANGE_PROTO_RANDOM) &&
+	    (!(range->flags & IP_NAT_RANGE_PROTO_SPECIFIED) ||
 	     proto->in_range(tuple, maniptype, &range->min, &range->max)) &&
 	    !nf_nat_used_tuple(tuple, ct))
 		goto out;
-- 
1.7.1


^ permalink raw reply related

* [PATCH 19/28] netfilter: ip6tables: use skb->len for accounting
From: kaber @ 2010-08-02 19:57 UTC (permalink / raw)
  To: davem; +Cc: netfilter-devel, netdev
In-Reply-To: <1280779065-9333-1-git-send-email-kaber@trash.net>

From: Changli Gao <xiaosuo@gmail.com>

ipv6_hdr(skb)->payload_len is ZERO and can't be used for accounting, if
the payload is a Jumbo Payload specified in RFC2675.

Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
---
 net/ipv6/netfilter/ip6_tables.c |    4 +---
 1 files changed, 1 insertions(+), 3 deletions(-)

diff --git a/net/ipv6/netfilter/ip6_tables.c b/net/ipv6/netfilter/ip6_tables.c
index dc41d6d..33113c1 100644
--- a/net/ipv6/netfilter/ip6_tables.c
+++ b/net/ipv6/netfilter/ip6_tables.c
@@ -387,9 +387,7 @@ ip6t_do_table(struct sk_buff *skb,
 				goto no_match;
 		}
 
-		ADD_COUNTER(e->counters,
-			    ntohs(ipv6_hdr(skb)->payload_len) +
-			    sizeof(struct ipv6hdr), 1);
+		ADD_COUNTER(e->counters, skb->len, 1);
 
 		t = ip6t_get_target_c(e);
 		IP_NF_ASSERT(t->u.kernel.target);
-- 
1.7.1


^ permalink raw reply related

* [PATCH 20/28] netfilter: iptables: use skb->len for accounting
From: kaber @ 2010-08-02 19:57 UTC (permalink / raw)
  To: davem; +Cc: netfilter-devel, netdev
In-Reply-To: <1280779065-9333-1-git-send-email-kaber@trash.net>

From: Changli Gao <xiaosuo@gmail.com>

Use skb->len for accounting as xt_quota does.

Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
---
 net/ipv4/netfilter/ip_tables.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/net/ipv4/netfilter/ip_tables.c b/net/ipv4/netfilter/ip_tables.c
index b38c118..3c584a6 100644
--- a/net/ipv4/netfilter/ip_tables.c
+++ b/net/ipv4/netfilter/ip_tables.c
@@ -364,7 +364,7 @@ ipt_do_table(struct sk_buff *skb,
 				goto no_match;
 		}
 
-		ADD_COUNTER(e->counters, ntohs(ip->tot_len), 1);
+		ADD_COUNTER(e->counters, skb->len, 1);
 
 		t = ipt_get_target(e);
 		IP_NF_ASSERT(t->u.kernel.target);
-- 
1.7.1


^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox