Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH] mlx4: Add support for EEH error recovery
From: Or Gerlitz @ 2012-07-23 21:26 UTC (permalink / raw)
  To: Kleber Sacilotto de Souza
  Cc: Or Gerlitz, David Miller, netdev, jackm, yevgenyp, cascardo,
	brking, shlomop
In-Reply-To: <500DB9CE.5080100@linux.vnet.ibm.com>

Kleber Sacilotto de Souza <klebers@linux.vnet.ibm.com> wrote:
>> For powerpc we have an IBM internal user space tool that injects the
>> error on the bus with the aid of the system firmware. The kernel used
>> was built with the option:
>> CONFIG_EEH=y
>> and without the AER options. I will run some more tests with the AER
>> options activated.

> I tested the powerpc error injection with
>
> CONFIG_EEH=y
> CONFIG_PCIEAER=y
> CONFIG_PCIEAER_INJECT=m
>
> and with the aer_inject module loaded and it didn't affect the EEH
> recovery, the adapter recovered as expected.

I wasn't sure to follow what did you mean by "it didn't affect the EEH
recovery", how did you use the aer_inject module, is that through
user-space tool which is available for us?

Or.

^ permalink raw reply

* Regression: ping -R crashes over Ipsec
From: Stephen Hemminger @ 2012-07-23 21:30 UTC (permalink / raw)
  To: David Miller, James Davidson; +Cc: netdev

James is investigating a bug that occurs when record route is used
over ipsec.

  https://bugzilla.vyatta.com/show_bug.cgi?id=8218

It appears that this regression was introduced by:

commit 8e36360ae876995e92d3a7538dda70548e64e685
Author: David S. Miller <davem@davemloft.net>
Date:   Fri May 13 17:29:41 2011 -0400

    ipv4: Remove route key identity dependencies in ip_rt_get_source().
    
    Pass in the sk_buff so that we can fetch the necessary keys from
    the packet header when working with input routes.
    
    Signed-off-by: David S. Miller <davem@davemloft.net>


The problem is that in ip_rt_get_source() it is assuming skb->dev is a
valid pointer and can be used instead of rt->iif. It looks like when running
through Ipsec this isn't true.


[   60.740704] BUG: unable to handle kernel NULL pointer dereference at 00000070
[   60.748066] IP: [<c122dfac>] ip_rt_get_source+0x54/0xd1
[   60.753431] *pde = 00000000
[   60.756455] Oops: 0000 [#1] SMP
[   60.759881] Modules linked in: xt_policy authenc xfrm6_mode_tunnel xfrm4_mode_tunnel deflate zlib_deflate ctr twofish_generic twofish_i586 twofish_common camellia serpent blowfish cast5 des_generic cbc aes_i586 aes_generic xcbc rmd160 sha512_generic sha256_generic crypto_null iptable_nat ip6table_filter ip6table_raw ip6_tables iptable_filter xt_NOTRACK xt_CT iptable_raw nf_nat_pptp nf_conntrack_pptp nf_conntrack_proto_gre nf_nat_h323 nf_conntrack_h323 nf_nat_sip nf_conntrack_sip nf_nat_proto_gre nf_nat_tftp nf_nat_ftp nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_conntrack_tftp nf_conntrack_ftp nf_conntrack acpi_cpufreq mperf xfrm_user cpufreq_userspace cpufreq_stats xfrm4_tunnel tunnel4 cpufreq_powersave ipcomp cpufreq_ondemand freq_table xfrm_ipcomp esp4 cpufreq_conservative ipv6 ah4 af_
 key dcdbas evdev intel_agp container intel_gtt i2c_i801 i2c_core agpgart pcspkr ghes hed button processor battery usb_storage ohci_hcd squashfs loop ext4 jbd2 crc16 raid10 raid456 async_raid
 6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx raid1 raid0 multipath linear md_mod usbhid hid fan thermal thermal_sys ahci libahci libata igb dca bnx2 [last unloaded: scsi_wait_scan]
[   60.871342]
[   60.872904] Pid: 0, comm: swapper Not tainted 3.0.23-1-586-vyatta #1 Dell Inc. PowerEdge R210 II/09T7VV
[   60.882593] EIP: 0060:[<c122dfac>] EFLAGS: 00010246 CPU: 0
[   60.888143] EIP is at ip_rt_get_source+0x54/0xd1
[   60.892820] EAX: f3f80000 EBX: f3a4323c ECX: 00000000 EDX: f3829c00
[   60.899157] ESI: f3f00000 EDI: f440ddc0 EBP: f440dda0 ESP: f440dd9c
[   60.905485]  DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
[   60.910947] Process swapper (pid: 0, ti=f440c000 task=c138dee0 task.ti=c1388000)
[   60.918419] Stack:
[   60.920500]  f3a4325b 00000002 00000000 00000000 00000000 00000000 64002cac 010021ac
[   60.928898]  00000000 0000003c f47e0240 00000020 00000010 00000028 f3829c18 f382e0f8
[   60.937295]  f3a43278 f3a4323c c1233483 f3829c00 f3a43250 f47e02f0 f440de98 f3829c00
[   60.945714] Call Trace:
[   60.948232]  [<c1233483>] ? ip_options_build+0x7e/0x12b
[   60.953527]  [<c1234126>] ? __ip_make_skb+0x230/0x280
[   60.958645]  [<c123502c>] ? ip_push_pending_frames+0x13/0x20
[   60.964375]  [<c12520bf>] ? icmp_reply+0x114/0x135
[   60.969230]  [<c12521f5>] ? icmp_echo+0x57/0x5c
[   60.973828]  [<c1252ac9>] ? icmp_rcv+0x176/0x191
[   60.978510]  [<c1231570>] ? ip_local_deliver_finish+0x100/0x19c
[   60.984496]  [<c1231470>] ? T.971+0x41/0x41
[   60.988745]  [<c1231642>] ? T.972+0x36/0x39
[   60.992997]  [<c123167b>] ? ip_local_deliver+0x36/0x39
[   60.998200]  [<c1231470>] ? T.971+0x41/0x41
[   61.002449]  [<c123134f>] ? ip_rcv_finish+0x2cb/0x2f0
[   61.007565]  [<c1231084>] ? inet_del_protocol+0x26/0x26
[   61.012858]  [<c1231642>] ? T.972+0x36/0x39
[   61.017107]  [<c12104b1>] ? __netif_receive_skb+0x393/0x3ba
[   61.022745]  [<c1231084>] ? inet_del_protocol+0x26/0x26
[   61.028035]  [<c1210572>] ? process_backlog+0x9a/0x132
[   61.033236]  [<c103106e>] ? irq_enter+0x49/0x49
[   61.037836]  [<c1210ccd>] ? net_rx_action+0x92/0x19a
[   61.042865]  [<c103106e>] ? irq_enter+0x49/0x49
[   61.047460]  [<c1031104>] ? __do_softirq+0x96/0x144
[   61.052404]  [<c103106e>] ? irq_enter+0x49/0x49
[   61.057001]  <IRQ>
[   61.059247]  [<c1030f55>] ? irq_exit+0x2f/0x91
[   61.063754]  [<c10035d8>] ? do_IRQ+0x73/0x84
[   61.068089]  [<c128bca9>] ? common_interrupt+0x29/0x30
[   61.073290]  [<c103007b>] ? do_setitimer+0xdf/0x1a3
[   61.078233]  [<c1166afe>] ? intel_idle+0x9c/0xb9
[   61.082917]  [<c11fc59d>] ? cpuidle_idle_call+0xcf/0x15a
[   61.088294]  [<c1001b18>] ? cpu_idle+0x41/0x5d
[   61.092796]  [<c13ba6eb>] ? start_kernel+0x2b2/0x2b5
[   61.097825] Code: 00 00 89 ef f3 ab 8b 43 10 89 44 24 18 8b 43 0c 89 44 24 1c 8a 43 01 83 e0 1e 88 44 24 10 8b 46 0c 8b 48 70 89 4c 24 04 8b 4a 14 <8b> 49 70 89 4c 24 08 8b 92 90 00 00 00 8d 4c 24 24 89 54 24 0c
[   61.121450] EIP: [<c122dfac>] ip_rt_get_source+0x54/0xd1 SS:ESP 0068:f440dd9c
[   61.128795] CR2: 0000000000000070
[   61.132180] ---[ end trace d5716a30ffe983e9 ]---

Message from[   61.136923] Kernel panic - not syncing: Fatal exception in interrupt
 syslogd@West at [   61.136924] Pid: 0, comm: swapper Tainted: G      D     3.0.23-1-586-vyatta #1
Jul 13 13:05:19 [   61.136925] Call Trace:
...
 kernel:[ [   61.136927]  [<c1288eba>] ? panic+0x4d/0x12b
  60.756455] Oop[   61.136929]  [<c1004756>] ? oops_end+0x6c/0x76
s: 0000 [#1] SMP[   61.136931]  [<c101b23f>] ? no_context+0x10d/0x116

[   61.136933]  [<c101b37b>] ? bad_area_nosemaphore+0xa/0xc
[   61.136934]  [<c101b75d>] ? do_page_fault+0x131/0x2ec
[   61.136936]  [<c1230f24>] ? inet_getpeer+0x252/0x290
[   61.136938]  [<c1206dac>] ? skb_copy_and_csum_bits+0x50/0x225
[   61.136939]  [<c101b62c>] ? vmalloc_sync_all+0xc4/0xc4

^ permalink raw reply

* Re: [PATCH] mlx4: Add support for EEH error recovery
From: David Miller @ 2012-07-23 21:34 UTC (permalink / raw)
  To: or.gerlitz
  Cc: klebers, ogerlitz, netdev, jackm, yevgenyp, cascardo, brking,
	shlomop
In-Reply-To: <CAJZOPZJP9OaZVB03kKGtWvrUUb8B--wUNX9z3MG_MNmKr8U3kQ@mail.gmail.com>

From: Or Gerlitz <or.gerlitz@gmail.com>
Date: Tue, 24 Jul 2012 00:26:51 +0300

> I wasn't sure to follow what did you mean by "it didn't affect the EEH
> recovery", how did you use the aer_inject module, is that through
> user-space tool which is available for us?

Can we please move forward, if he implemented the feature properly
and he tested it successfully, unless you can find a logic or
stylistic flaw in his patch please ACK it.

You can't hold his changes back while you work out how _YOU_ can
test it to your liking.

^ permalink raw reply

* Re: Regression: ping -R crashes over Ipsec
From: David Miller @ 2012-07-23 21:37 UTC (permalink / raw)
  To: shemminger; +Cc: james.davidson, netdev
In-Reply-To: <20120723143038.4ad5ac7a@nehalam.linuxnetplumber.net>

Stephen please work on a fix if you can, I'm already overloaded
with the issues Julian has reported, thanks.

^ permalink raw reply

* Re: [PATCH] mlx4: Add support for EEH error recovery
From: Or Gerlitz @ 2012-07-23 21:42 UTC (permalink / raw)
  To: David Miller
  Cc: klebers, ogerlitz, netdev, jackm, yevgenyp, cascardo, brking,
	shlomop
In-Reply-To: <20120723.143436.2124127996154789223.davem@davemloft.net>

On Tue, Jul 24, 2012 at 12:34 AM, David Miller <davem@davemloft.net> wrote:

> Can we please move forward, if he implemented the feature properly
> and he tested it successfully, unless you can find a logic or
> stylistic flaw in his patch please ACK it.
>
> You can't hold his changes back while you work out how _YOU_ can
> test it to your liking.

Hi Dave,

We're trying to act in  R/R (Responsive and Responsible) manner -
namely Shlomo did code review of the patches and we want to further
evaluate them by testing, I think its fully legitimate to test a patch
before ACK-ing.  Doing these types of tests isn't around my personal
typical daily menu and I'm asking for some directives from the author
on how to issue that testing, I don't see what wrong here. We're
planning anyway to go deeper around this area and enhance the PCI
hotplug /error handling related code in the driver, so there's an
initial learing curve here, makes sense? we can move the Q&A for the
testing to be off-list if you prefer it to go that way.

Or.

^ permalink raw reply

* Re: [PATCH] mlx4: Add support for EEH error recovery
From: David Miller @ 2012-07-23 21:44 UTC (permalink / raw)
  To: or.gerlitz
  Cc: klebers, ogerlitz, netdev, jackm, yevgenyp, cascardo, brking,
	shlomop
In-Reply-To: <CAJZOPZKp4su8sOV1Cpm5CG2DYY2xVKUjrSFge5qcYn2pnkC-YA@mail.gmail.com>

From: Or Gerlitz <or.gerlitz@gmail.com>
Date: Tue, 24 Jul 2012 00:42:08 +0300

> On Tue, Jul 24, 2012 at 12:34 AM, David Miller <davem@davemloft.net> wrote:
> 
>> Can we please move forward, if he implemented the feature properly
>> and he tested it successfully, unless you can find a logic or
>> stylistic flaw in his patch please ACK it.
>>
>> You can't hold his changes back while you work out how _YOU_ can
>> test it to your liking.
> 
> Hi Dave,
> 
> We're trying to act in  R/R (Responsive and Responsible) manner -
> namely Shlomo did code review of the patches and we want to further
> evaluate them by testing, I think its fully legitimate to test a patch
> before ACK-ing.  Doing these types of tests isn't around my personal
> typical daily menu and I'm asking for some directives from the author
> on how to issue that testing, I don't see what wrong here. We're
> planning anyway to go deeper around this area and enhance the PCI
> hotplug /error handling related code in the driver, so there's an
> initial learing curve here, makes sense? we can move the Q&A for the
> testing to be off-list if you prefer it to go that way.

But ACK his patch, because you have not found any problems with it.

This is taking days, and you're stalling further progress.

I never let patches rot in patchwork more than a few days, as this
patch has already.

Either ACK or provide a legitimate reason to reject it now.

^ permalink raw reply

* Re: Regression: ping -R crashes over Ipsec
From: Stephen Hemminger @ 2012-07-23 21:48 UTC (permalink / raw)
  To: David Miller; +Cc: james.davidson, netdev
In-Reply-To: <20120723.143728.595164943890165527.davem@davemloft.net>

On Mon, 23 Jul 2012 14:37:28 -0700 (PDT)
David Miller <davem@davemloft.net> wrote:

> 
> Stephen please work on a fix if you can, I'm already overloaded
> with the issues Julian has reported, thanks.

James is working on one, just wanted wider audience. For now may
have to just silently drop the ping request.

^ permalink raw reply

* Re: [PATCH] mlx4: Add support for EEH error recovery
From: Or Gerlitz @ 2012-07-23 22:02 UTC (permalink / raw)
  To: David Miller
  Cc: klebers, ogerlitz, netdev, jackm, yevgenyp, cascardo, brking,
	shlomop
In-Reply-To: <20120723.144421.1741050474189216927.davem@davemloft.net>

On Tue, Jul 24, 2012 at 12:44 AM, David Miller <davem@davemloft.net> wrote:
> But ACK his patch, because you have not found any problems with it.

Again, we wanted to test the patch before providing ACK, and it takes us a
bit to catch up on testing of this area, happens.

> This is taking days, and you're stalling further progress.
> I never let patches rot in patchwork more than a few days, as this patch has already.
> Either ACK or provide a legitimate reason to reject it now.

understood. As its a bit off working hours here now, lets see what
tomorrow yield testing wise - but by tomorrow we either ACK or provide
reason to reject/change it.

Or.

^ permalink raw reply

* Re: [PATCH] mlx4: Add support for EEH error recovery
From: David Miller @ 2012-07-23 22:21 UTC (permalink / raw)
  To: or.gerlitz
  Cc: klebers, ogerlitz, netdev, jackm, yevgenyp, cascardo, brking,
	shlomop
In-Reply-To: <CAJZOPZL=Dw4GK0fV+_cC-TTRoYxkEhe9KBg-6G9LE=xrK3b7jQ@mail.gmail.com>

From: Or Gerlitz <or.gerlitz@gmail.com>
Date: Tue, 24 Jul 2012 01:02:10 +0300

> As its a bit off working hours here now, lets see what tomorrow
> yield testing wise - but by tomorrow we either ACK or provide reason
> to reject/change it.

Thank you.

^ permalink raw reply

* Re: [PATCH 2/2] ipv4: Change rt->rt_iif encoding.
From: David Miller @ 2012-07-23 22:22 UTC (permalink / raw)
  To: netdev; +Cc: ja
In-Reply-To: <20120723.140737.822125500680433996.davem@davemloft.net>

From: David Miller <davem@davemloft.net>
Date: Mon, 23 Jul 2012 14:07:37 -0700 (PDT)

> On input packet processing, rt->rt_iif will be zero if we should
> use skb->dev->ifindex.
> 
> Since we access rt->rt_iif consistently via inet_iif(), that is
> the only spot whose interpretation have to adjust.
> 
> Signed-off-by: David S. Miller <davem@davemloft.net>

Julian, as you seem to have feared, this turns out to not work.

We zap the skb->dev apparently, so I'll need to come up with
a different scheme to handle this.

^ permalink raw reply

* Re: [PATCH 2/2] ipv4: Change rt->rt_iif encoding.
From: Julian Anastasov @ 2012-07-23 23:11 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <20120723.152220.822947775372843073.davem@davemloft.net>


	Hello,

On Mon, 23 Jul 2012, David Miller wrote:

> From: David Miller <davem@davemloft.net>
> Date: Mon, 23 Jul 2012 14:07:37 -0700 (PDT)
> 
> > On input packet processing, rt->rt_iif will be zero if we should
> > use skb->dev->ifindex.
> > 
> > Since we access rt->rt_iif consistently via inet_iif(), that is
> > the only spot whose interpretation have to adjust.
> > 
> > Signed-off-by: David S. Miller <davem@davemloft.net>
> 
> Julian, as you seem to have feared, this turns out to not work.
> 
> We zap the skb->dev apparently, so I'll need to come up with
> a different scheme to handle this.

	I was just replying that I don't see any problems
with the 4 patches but when you say it, may be the net/sched
code will work with output device on forwarding, not input.
Is skb_iif a good source? From include/linux/skbuff.h:

skb_iif: ifindex of device we arrived on

	If it has hidden semantic may be we can make this
comment more specific, does it survive some loops?

	Anyways, the idea for new encoding of rt_iif is very good.

Regards

--
Julian Anastasov <ja@ssi.bg>

^ permalink raw reply

* Re: [PATCH 2/2] ipv4: Change rt->rt_iif encoding.
From: David Miller @ 2012-07-23 23:04 UTC (permalink / raw)
  To: netdev; +Cc: ja
In-Reply-To: <20120723.152220.822947775372843073.davem@davemloft.net>

From: David Miller <davem@davemloft.net>
Date: Mon, 23 Jul 2012 15:22:20 -0700 (PDT)

> From: David Miller <davem@davemloft.net>
> Date: Mon, 23 Jul 2012 14:07:37 -0700 (PDT)
> 
>> On input packet processing, rt->rt_iif will be zero if we should
>> use skb->dev->ifindex.
>> 
>> Since we access rt->rt_iif consistently via inet_iif(), that is
>> the only spot whose interpretation have to adjust.
>> 
>> Signed-off-by: David S. Miller <davem@davemloft.net>
> 
> Julian, as you seem to have feared, this turns out to not work.
> 
> We zap the skb->dev apparently, so I'll need to come up with
> a different scheme to handle this.

Just to clarify, the specific issue is that tcp input will set
skb->dev to NULL right before we lock the socket and try to process
the packet in that socket's context.

So inet_iif() implemented conditionally using skb->dev won't
work without some kind of adjustment.

And we need to do inet_iif() even after we have the top-level
socket in-hand, in order to lookup listener socket children,
and maintain the cached RX dst.

Oddly enough, I had expected that this sort of change came from Eric
Dumazet as some kind of optimization, but it's rather a poorly
documented change from many years ago apparently written by your's
truly :-)

--------------------
commit 72ffc30d67f33b69ce4596b940a66e45bf5994b3
Author: davem <davem>
Date:   Fri Dec 8 17:15:53 2000 +0000

    Kill both is_clone and rx_dev from sk_buff, both
    were superfluous.

diff --git a/drivers/net/pppoe.c b/drivers/net/pppoe.c
index feb7236..8744f61 100644
--- a/drivers/net/pppoe.c
+++ b/drivers/net/pppoe.c
@@ -785,8 +785,7 @@ int pppoe_sendmsg(struct socket *sock, struct msghdr *m,
 	skb_reserve(skb, dev->hard_header_len);
 	skb->nh.raw = skb->data;
 
-	skb->rx_dev = skb->dev = dev;
-	dev_hold(skb->rx_dev);
+	skb->dev = dev;
 
 	skb->priority = sk->priority;
 	skb->protocol = __constant_htons(ETH_P_PPP_SES);
@@ -869,11 +868,7 @@ int __pppoe_xmit(struct sock *sk, struct sk_buff *skb)
 
 	skb->nh.raw = skb->data;
 
-	/* Change device of skb, update reference counts */
-	if(skb->rx_dev)
-	    dev_put(skb->rx_dev);
-	skb->rx_dev = skb->dev = dev;
-	dev_hold(skb->rx_dev);
+	skb->dev = dev;
 
 	dev->hard_header(skb, dev, ETH_P_PPP_SES,
 			 sk->protinfo.pppox->pppoe_pa.remote,
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index c67f8c8..17e48d0 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -65,8 +65,7 @@ struct sk_buff {
 	struct sk_buff_head * list;		/* List we are on				*/
 	struct sock	*sk;			/* Socket we are owned by 			*/
 	struct timeval	stamp;			/* Time we arrived				*/
-	struct net_device	*dev;			/* Device we arrived on/are leaving by		*/
-	struct net_device	*rx_dev;
+	struct net_device	*dev;		/* Device we arrived on/are leaving by		*/
 
 	/* Transport layer header */
 	union
@@ -110,8 +109,7 @@ struct sk_buff {
 	unsigned int 	len;			/* Length of actual data			*/
 	unsigned int	csum;			/* Checksum 					*/
 	volatile char 	used;			/* Data moved to user and not MSG_PEEK		*/
-	unsigned char	is_clone,		/* We are a clone				*/
-			cloned, 		/* head may be cloned (check refcnt to be sure). */
+	unsigned char	cloned, 		/* head may be cloned (check refcnt to be sure). */
   			pkt_type,		/* Packet class					*/
   			ip_summed;		/* Driver fed us an IP checksum			*/
 	__u32		priority;		/* Packet queueing priority			*/
diff --git a/include/net/sock.h b/include/net/sock.h
index 4b3a82b..8550282 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1169,6 +1169,7 @@ static inline int sock_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
 	}
 #endif /* CONFIG_FILTER */
 
+	skb->dev = NULL;
 	skb_set_owner_r(skb, sk);
 	skb_queue_tail(&sk->receive_queue, skb);
 	if (!sk->dead)
diff --git a/net/core/dev.c b/net/core/dev.c
index 0f097b8..f8d5437 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -617,7 +617,7 @@ void netdev_state_change(struct net_device *dev)
 
 void dev_load(const char *name)
 {
-	if (!__dev_get_by_name(name) && capable(CAP_SYS_MODULE))
+	if (!dev_get(name) && capable(CAP_SYS_MODULE))
 		request_module(name);
 }
 
@@ -875,8 +875,6 @@ void dev_queue_xmit_nit(struct sk_buff *skb, struct net_device *dev)
 
 			skb2->h.raw = skb2->nh.raw;
 			skb2->pkt_type = PACKET_OUTGOING;
-			skb2->rx_dev = skb->dev;
-			dev_hold(skb2->rx_dev);
 			ptype->func(skb2, skb->dev, ptype);
 		}
 	}
@@ -1129,10 +1127,7 @@ int netif_rx(struct sk_buff *skb)
 				goto drop;
 
 enqueue:
-			if (skb->rx_dev)
-				dev_put(skb->rx_dev);
-			skb->rx_dev = skb->dev;
-			dev_hold(skb->rx_dev);
+			dev_hold(skb->dev);
 			__skb_queue_tail(&queue->input_pkt_queue,skb);
 			__cpu_raise_softirq(this_cpu, NET_RX_SOFTIRQ);
 			local_irq_restore(flags);
@@ -1206,11 +1201,11 @@ static int deliver_to_old_ones(struct packet_type *pt, struct sk_buff *skb, int
  */
 static __inline__ void skb_bond(struct sk_buff *skb)
 {
-	struct net_device *dev = skb->rx_dev;
+	struct net_device *dev = skb->dev;
 	
 	if (dev->master) {
 		dev_hold(dev->master);
-		skb->dev = skb->rx_dev = dev->master;
+		skb->dev = dev->master;
 		dev_put(dev);
 	}
 }
@@ -1320,6 +1315,7 @@ static void net_rx_action(struct softirq_action *h)
 
 	for (;;) {
 		struct sk_buff *skb;
+		struct net_device *rx_dev;
 
 		local_irq_disable();
 		skb = __skb_dequeue(&queue->input_pkt_queue);
@@ -1330,10 +1326,13 @@ static void net_rx_action(struct softirq_action *h)
 
 		skb_bond(skb);
 
+		rx_dev = skb->dev;
+
 #ifdef CONFIG_NET_FASTROUTE
 		if (skb->pkt_type == PACKET_FASTROUTE) {
 			netdev_rx_stat[this_cpu].fastroute_deferred_out++;
 			dev_queue_xmit(skb);
+			dev_put(rx_dev);
 			continue;
 		}
 #endif
@@ -1369,6 +1368,7 @@ static void net_rx_action(struct softirq_action *h)
 			if (skb->dev->br_port != NULL &&
 			    br_handle_frame_hook != NULL) {
 				handle_bridge(skb, pt_prev);
+				dev_put(rx_dev)
 				continue;
 			}
 #endif
@@ -1399,6 +1399,8 @@ static void net_rx_action(struct softirq_action *h)
 				kfree_skb(skb);
 		}
 
+		dev_put(rx_dev);
+
 		if (bugdet-- < 0 || jiffies - start_time > 1)
 			goto softnet_break;
 
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index d7f575d..1d8dc09 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -4,7 +4,7 @@
  *	Authors:	Alan Cox <iiitac@pyr.swan.ac.uk>
  *			Florian La Roche <rzsfl@rz.uni-sb.de>
  *
- *	Version:	$Id: skbuff.c,v 1.74 2000-12-06 11:45:19 anton Exp $
+ *	Version:	$Id: skbuff.c,v 1.75 2000-12-08 17:15:53 davem Exp $
  *
  *	Fixes:	
  *		Alan Cox	:	Fixed the worst of the load balancer bugs.
@@ -202,7 +202,6 @@ struct sk_buff *alloc_skb(unsigned int size,int gfp_mask)
 
 	/* Set up other state */
 	skb->len = 0;
-	skb->is_clone = 0;
 	skb->cloned = 0;
 
 	atomic_set(&skb->users, 1); 
@@ -233,7 +232,6 @@ static inline void skb_headerinit(void *p, kmem_cache_t *cache,
 	skb->ip_summed = 0;
 	skb->security = 0;	/* By default packets are insecure */
 	skb->dst = NULL;
-	skb->rx_dev = NULL;
 #ifdef CONFIG_NETFILTER
 	skb->nfmark = skb->nfcache = 0;
 	skb->nfct = NULL;
@@ -287,10 +285,6 @@ void __kfree_skb(struct sk_buff *skb)
 #ifdef CONFIG_NETFILTER
 	nf_conntrack_put(skb->nfct);
 #endif
-#ifdef CONFIG_NET		
-	if(skb->rx_dev)
-		dev_put(skb->rx_dev);
-#endif		
 	skb_headerinit(skb, NULL, 0);  /* clean state */
 	kfree_skbmem(skb);
 }
@@ -325,12 +319,10 @@ struct sk_buff *skb_clone(struct sk_buff *skb, int gfp_mask)
 	skb->cloned = 1;
        
 	dst_clone(n->dst);
-	n->rx_dev = NULL;
 	n->cloned = 1;
 	n->next = n->prev = NULL;
 	n->list = NULL;
 	n->sk = NULL;
-	n->is_clone = 1;
 	atomic_set(&n->users, 1);
 	n->destructor = NULL;
 #ifdef CONFIG_NETFILTER
@@ -349,7 +341,6 @@ static void copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
 	new->list=NULL;
 	new->sk=NULL;
 	new->dev=old->dev;
-	new->rx_dev=NULL;
 	new->priority=old->priority;
 	new->protocol=old->protocol;
 	new->dst=dst_clone(old->dst);
@@ -358,7 +349,6 @@ static void copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
 	new->mac.raw=old->mac.raw+offset;
 	memcpy(new->cb, old->cb, sizeof(old->cb));
 	new->used=old->used;
-	new->is_clone=0;
 	atomic_set(&new->users, 1);
 	new->pkt_type=old->pkt_type;
 	new->stamp=old->stamp;
diff --git a/net/decnet/dn_nsp_in.c b/net/decnet/dn_nsp_in.c
index 4754cd8..3617294 100644
--- a/net/decnet/dn_nsp_in.c
+++ b/net/decnet/dn_nsp_in.c
@@ -78,9 +78,9 @@ extern int decnet_log_martians;
 static void dn_log_martian(struct sk_buff *skb, const char *msg)
 {
 	if (decnet_log_martians && net_ratelimit()) {
-		char *devname = skb->rx_dev ? skb->rx_dev->name : "???";
+		char *devname = skb->dev ? skb->dev->name : "???";
 		struct dn_skb_cb *cb = (struct dn_skb_cb *)skb->cb;
-		printk(KERN_INFO "DECnet: Martian packet (%s) rx_dev=%s src=0x%04hx dst=0x%04hx srcport=0x%04hx dstport=0x%04hx\n", msg, devname, cb->src, cb->dst, cb->src_port, cb->dst_port);
+		printk(KERN_INFO "DECnet: Martian packet (%s) dev=%s src=0x%04hx dst=0x%04hx srcport=0x%04hx dstport=0x%04hx\n", msg, devname, cb->src, cb->dst, cb->src_port, cb->dst_port);
 	}
 }
 
@@ -782,7 +782,7 @@ free_out:
 
 int dn_nsp_rx(struct sk_buff *skb)
 {
-	return NF_HOOK(PF_DECnet, NF_DN_LOCAL_IN, skb, skb->rx_dev, NULL, dn_nsp_rx_packet);
+	return NF_HOOK(PF_DECnet, NF_DN_LOCAL_IN, skb, skb->dev, NULL, dn_nsp_rx_packet);
 }
 
 /*
diff --git a/net/decnet/dn_route.c b/net/decnet/dn_route.c
index 20ec07a..70646fc 100644
--- a/net/decnet/dn_route.c
+++ b/net/decnet/dn_route.c
@@ -526,6 +526,7 @@ static int dn_forward(struct sk_buff *skb)
 {
 	struct dn_skb_cb *cb = (struct dn_skb_cb *)skb->cb;
 	struct dst_entry *dst = skb->dst;
+	struct net_device *dev = skb->dev;
 	struct neighbour *neigh;
 	int err = -EINVAL;
 
@@ -551,7 +552,7 @@ static int dn_forward(struct sk_buff *skb)
 	else
 		cb->rt_flags &= ~DN_RT_F_IE;
 
-	return NF_HOOK(PF_DECnet, NF_DN_FORWARD, skb, skb->rx_dev, skb->dev, neigh->output);
+	return NF_HOOK(PF_DECnet, NF_DN_FORWARD, skb, dev, skb->dev, neigh->output);
 
 
 error:
@@ -985,7 +986,6 @@ int dn_cache_getroute(struct sk_buff *in_skb, struct nlmsghdr *nlh, void *arg)
 		}
 		skb->protocol = __constant_htons(ETH_P_DNA_RT);
 		skb->dev = dev;
-		skb->rx_dev = dev;
 		cb->src = src;
 		cb->dst = dst;
 		local_bh_disable();
@@ -1002,7 +1002,6 @@ int dn_cache_getroute(struct sk_buff *in_skb, struct nlmsghdr *nlh, void *arg)
 	if (skb->dev)
 		dev_put(skb->dev);
 	skb->dev = NULL;
-	skb->rx_dev = NULL;
 	if (err)
 		goto out_free;
 	skb->dst = &rt->u.dst;
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index b520899..3dec2c7 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -5,7 +5,7 @@
  *
  *		The IP fragmentation functionality.
  *		
- * Version:	$Id: ip_fragment.c,v 1.52 2000-11-28 13:32:54 davem Exp $
+ * Version:	$Id: ip_fragment.c,v 1.53 2000-12-08 17:15:53 davem Exp $
  *
  * Authors:	Fred N. van Kempen <waltje@uWalt.NL.Mugnet.ORG>
  *		Alan Cox <Alan.Cox@linux.org>
@@ -83,7 +83,7 @@ struct ipq {
 	atomic_t	refcnt;
 	struct timer_list timer;	/* when will this queue expire?		*/
 	struct ipq	**pprev;
-	struct net_device	*dev;	/* Device - for icmp replies */
+	int		iif;		/* Device index - for icmp replies	*/
 };
 
 /* Hash table. */
@@ -255,8 +255,13 @@ static void ip_expire(unsigned long arg)
 	IP_INC_STATS_BH(IpReasmFails);
 
 	if ((qp->last_in&FIRST_IN) && qp->fragments != NULL) {
+		struct sk_buff *head = qp->fragments;
+
 		/* Send an ICMP "Fragment Reassembly Timeout" message. */
-		icmp_send(qp->fragments, ICMP_TIME_EXCEEDED, ICMP_EXC_FRAGTIME, 0);
+		if ((head->dev = dev_get_by_index(qp->iif)) != NULL) {
+			icmp_send(head, ICMP_TIME_EXCEEDED, ICMP_EXC_FRAGTIME, 0);
+			dev_put(head->dev);
+		}
 	}
 out:
 	spin_unlock(&qp->lock);
@@ -480,7 +485,8 @@ static void ip_frag_queue(struct ipq *qp, struct sk_buff *skb)
 	else
 		qp->fragments = skb;
 
-	qp->dev = skb->dev;
+	qp->iif = skb->dev->ifindex;
+	skb->dev = NULL;
 	qp->meat += skb->len;
 	atomic_add(skb->truesize, &ip_frag_mem);
 	if (offset == 0)
@@ -499,7 +505,7 @@ err:
  * of bits on input. Until the new skb data handling is in I'm not going
  * to touch this with a bargepole. 
  */
-static struct sk_buff *ip_frag_reasm(struct ipq *qp)
+static struct sk_buff *ip_frag_reasm(struct ipq *qp, struct net_device *dev)
 {
 	struct sk_buff *skb;
 	struct iphdr *iph;
@@ -546,7 +552,7 @@ static struct sk_buff *ip_frag_reasm(struct ipq *qp)
 	skb->dst = dst_clone(head->dst);
 	skb->pkt_type = head->pkt_type;
 	skb->protocol = head->protocol;
-	skb->dev = qp->dev;
+	skb->dev = dev;
 
 	/*
 	*  Clearly bogus, because security markings of the individual
@@ -595,6 +601,7 @@ struct sk_buff *ip_defrag(struct sk_buff *skb)
 {
 	struct iphdr *iph = skb->nh.iph;
 	struct ipq *qp;
+	struct net_device *dev;
 	
 	IP_INC_STATS_BH(IpReasmReqds);
 
@@ -602,6 +609,8 @@ struct sk_buff *ip_defrag(struct sk_buff *skb)
 	if (atomic_read(&ip_frag_mem) > sysctl_ipfrag_high_thresh)
 		ip_evictor();
 
+	dev = skb->dev;
+
 	/* Lookup (or create) queue header */
 	if ((qp = ip_find(iph)) != NULL) {
 		struct sk_buff *ret = NULL;
@@ -612,7 +621,7 @@ struct sk_buff *ip_defrag(struct sk_buff *skb)
 
 		if (qp->last_in == (FIRST_IN|LAST_IN) &&
 		    qp->meat == qp->len)
-			ret = ip_frag_reasm(qp);
+			ret = ip_frag_reasm(qp, dev);
 
 		spin_unlock(&qp->lock);
 		ipq_put(qp);
diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c
index 120c844..c98617d 100644
--- a/net/ipv4/ip_input.c
+++ b/net/ipv4/ip_input.c
@@ -5,7 +5,7 @@
  *
  *		The Internet Protocol (IP) module.
  *
- * Version:	$Id: ip_input.c,v 1.50 2000-10-24 22:54:26 davem Exp $
+ * Version:	$Id: ip_input.c,v 1.51 2000-12-08 17:15:53 davem Exp $
  *
  * Authors:	Ross Biro, <bir7@leland.Stanford.Edu>
  *		Fred N. van Kempen, <waltje@uWalt.NL.Mugnet.ORG>
@@ -225,12 +225,6 @@ static inline int ip_local_deliver_finish(struct sk_buff *skb)
 	nf_debug_ip_local_deliver(skb);
 #endif /*CONFIG_NETFILTER_DEBUG*/
 
-	/* Free rx_dev before enqueueing to sockets */
-	if (skb->rx_dev) {
-		dev_put(skb->rx_dev);
-		skb->rx_dev = NULL;
-	}
-
         /* Point into the IP datagram, just past the header. */
         skb->h.raw = skb->nh.raw + iph->ihl*4;
 
diff --git a/net/ipv4/netfilter/ip_queue.c b/net/ipv4/netfilter/ip_queue.c
index 73fd4ea..9c8d493 100644
--- a/net/ipv4/netfilter/ip_queue.c
+++ b/net/ipv4/netfilter/ip_queue.c
@@ -400,13 +400,6 @@ static struct sk_buff *netlink_build_message(ipq_queue_element_t *e, int *errp)
 	if (e->info->outdev) strcpy(pm->outdev_name, e->info->outdev->name);
 	else pm->outdev_name[0] = '\0';
 	pm->hw_protocol = e->skb->protocol;
-	if (e->skb->rx_dev) {
-		pm->hw_type = e->skb->rx_dev->type;
-		if (e->skb->rx_dev->hard_header_parse)
-			pm->hw_addrlen =
-				e->skb->rx_dev->hard_header_parse(e->skb,
-				                                  pm->hw_addr);
-	}
 	if (data_len)
 		memcpy(pm->payload, e->skb->data, data_len);
 	nlh->nlmsg_len = skb->tail - old_tail;
diff --git a/net/ipv4/netfilter/ipt_MIRROR.c b/net/ipv4/netfilter/ipt_MIRROR.c
index cb5362d..9449c51 100644
--- a/net/ipv4/netfilter/ipt_MIRROR.c
+++ b/net/ipv4/netfilter/ipt_MIRROR.c
@@ -50,7 +50,7 @@ static int route_mirror(struct sk_buff *skb)
 
 	/* check if the interface we are leaving by is the same as the
            one we arrived on */
-	if (skb->rx_dev == rt->u.dst.dev) {
+	if (skb->dev == rt->u.dst.dev) {
 		/* Drop old route. */
 		dst_release(skb->dst);
 		skb->dst = &rt->u.dst;
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 27ff7be..7c06a6b 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -5,7 +5,7 @@
  *
  *		Implementation of the Transmission Control Protocol(TCP).
  *
- * Version:	$Id: tcp_ipv4.c,v 1.221 2000-11-28 17:04:10 davem Exp $
+ * Version:	$Id: tcp_ipv4.c,v 1.222 2000-12-08 17:15:53 davem Exp $
  *
  *		IPv4 specific functions
  *
@@ -1649,6 +1649,8 @@ process:
 	if (sk->state == TCP_TIME_WAIT)
 		goto do_time_wait;
 
+	skb->dev = NULL;
+
 	bh_lock_sock(sk);
 	ret = 0;
 	if (!sk->lock.users) {
diff --git a/net/ipv6/ip6_input.c b/net/ipv6/ip6_input.c
index c5b645d..41b42a4 100644
--- a/net/ipv6/ip6_input.c
+++ b/net/ipv6/ip6_input.c
@@ -6,7 +6,7 @@
  *	Pedro Roque		<roque@di.fc.ul.pt>
  *	Ian P. Morris		<I.P.Morris@soton.ac.uk>
  *
- *	$Id: ip6_input.c,v 1.17 2000-02-27 19:42:53 davem Exp $
+ *	$Id: ip6_input.c,v 1.18 2000-12-08 17:15:54 davem Exp $
  *
  *	Based in linux/net/ipv4/ip_input.c
  *
@@ -146,11 +146,6 @@ static inline int ip6_input_finish(struct sk_buff *skb)
 	}
 	len = skb->tail - skb->h.raw;
 
-	if (skb->rx_dev) {
-		dev_put(skb->rx_dev);
-		skb->rx_dev = NULL;
-	}
-
 	raw_sk = raw_v6_htable[nexthdr&(MAX_INET_PROTOS-1)];
 	if (raw_sk)
 		raw_sk = ipv6_raw_deliver(skb, nexthdr, len);
diff --git a/net/ipv6/reassembly.c b/net/ipv6/reassembly.c
index 4111466..edfdc23 100644
--- a/net/ipv6/reassembly.c
+++ b/net/ipv6/reassembly.c
@@ -5,7 +5,7 @@
  *	Authors:
  *	Pedro Roque		<roque@di.fc.ul.pt>	
  *
- *	$Id: reassembly.c,v 1.20 2000-11-28 13:48:03 davem Exp $
+ *	$Id: reassembly.c,v 1.21 2000-12-08 17:15:54 davem Exp $
  *
  *	Based on: net/ipv4/ip_fragment.c
  *
@@ -78,7 +78,6 @@ struct frag_queue
 	struct sk_buff		*fragments;
 	int			len;
 	int			meat;
-	struct net_device	*dev;
 	int			iif;
 	__u8			last_in;	/* has first/last segment arrived? */
 #define COMPLETE		4
@@ -476,8 +475,8 @@ static void ip6_frag_queue(struct frag_queue *fq, struct sk_buff *skb,
 	else
 		fq->fragments = skb;
 
-	fq->dev = skb->dev;
 	fq->iif = skb->dev->ifindex;
+	skb->dev = NULL;
 	fq->meat += skb->len;
 	atomic_add(skb->truesize, &ip6_frag_mem);
 
@@ -507,7 +506,8 @@ err:
  *	queue is eligible for reassembly i.e. it is not COMPLETE,
  *	the last and the first frames arrived and all the bits are here.
  */
-static u8* ip6_frag_reasm(struct frag_queue *fq, struct sk_buff **skb_in)
+static int ip6_frag_reasm(struct frag_queue *fq, struct sk_buff **skb_in,
+			  struct net_device *dev)
 {
 	struct sk_buff *fp, *head = fq->fragments;
 	struct sk_buff *skb;
@@ -541,7 +541,7 @@ static u8* ip6_frag_reasm(struct frag_queue *fq, struct sk_buff **skb_in)
 
 	skb->mac.raw = skb->data;
 	skb->nh.ipv6h = (struct ipv6hdr *) skb->data;
-	skb->dev = fq->dev;
+	skb->dev = dev;
 	skb->protocol = __constant_htons(ETH_P_IPV6);
 	skb->pkt_type = head->pkt_type;
 	FRAG6_CB(skb)->h = FRAG6_CB(head)->h;
@@ -579,6 +579,7 @@ u8* ipv6_reassembly(struct sk_buff **skbp, __u8 *nhptr)
 {
 	struct sk_buff *skb = *skbp; 
 	struct frag_hdr *fhdr = (struct frag_hdr *) (skb->h.raw);
+	struct net_device *dev = skb->dev;
 	struct frag_queue *fq;
 	struct ipv6hdr *hdr;
 
@@ -616,7 +617,7 @@ u8* ipv6_reassembly(struct sk_buff **skbp, __u8 *nhptr)
 
 		if (fq->last_in == (FIRST_IN|LAST_IN) &&
 		    fq->meat == fq->len)
-			ret = ip6_frag_reasm(fq, skbp);
+			ret = ip6_frag_reasm(fq, skbp, dev);
 
 		spin_unlock(&fq->lock);
 		fq_put(fq);
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 0801f14..bfb5aef 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -5,7 +5,7 @@
  *	Authors:
  *	Pedro Roque		<roque@di.fc.ul.pt>	
  *
- *	$Id: tcp_ipv6.c,v 1.127 2000-11-28 17:04:10 davem Exp $
+ *	$Id: tcp_ipv6.c,v 1.128 2000-12-08 17:15:54 davem Exp $
  *
  *	Based on: 
  *	linux/net/ipv4/tcp.c
@@ -1576,6 +1576,8 @@ process:
 	if(sk->state == TCP_TIME_WAIT)
 		goto do_time_wait;
 
+	skb->dev = NULL;
+
 	bh_lock_sock(sk);
 	ret = 0;
 	if (!sk->lock.users) {
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index b30efe7..b76a320 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -5,7 +5,7 @@
  *
  *		PACKET - implements raw packet sockets.
  *
- * Version:	$Id: af_packet.c,v 1.46 2000-10-24 21:26:19 davem Exp $
+ * Version:	$Id: af_packet.c,v 1.47 2000-12-08 17:15:54 davem Exp $
  *
  * Authors:	Ross Biro, <bir7@leland.Stanford.Edu>
  *		Fred N. van Kempen, <waltje@uWalt.NL.Mugnet.ORG>
@@ -264,11 +264,6 @@ static int packet_rcv_spkt(struct sk_buff *skb, struct net_device *dev,  struct
 	strncpy(spkt->spkt_device, dev->name, sizeof(spkt->spkt_device));
 	spkt->spkt_protocol = skb->protocol;
 
-	if (skb->rx_dev) {
-		dev_put(skb->rx_dev);
-		skb->rx_dev = NULL;
-	}
-
 	/*
 	 *	Charge the memory to the socket. This is done specifically
 	 *	to prevent sockets using all the memory up.
@@ -482,17 +477,13 @@ static int packet_rcv(struct sk_buff *skb, struct net_device *dev,  struct packe
 	if (dev->hard_header_parse)
 		sll->sll_halen = dev->hard_header_parse(skb, sll->sll_addr);
 
-	if (skb->rx_dev) {
-		dev_put(skb->rx_dev);
-		skb->rx_dev = NULL;
-	}
-
 #ifdef CONFIG_FILTER
 	if (skb->len > snaplen)
 		__skb_trim(skb, snaplen);
 #endif
 
 	skb_set_owner_r(skb, sk);
+	skb->dev = NULL;
 	spin_lock(&sk->receive_queue.lock);
 	po->stats.tp_packets++;
 	__skb_queue_tail(&sk->receive_queue, skb);

^ permalink raw reply related

* Re: [PATCH 2/2] ipv4: Change rt->rt_iif encoding.
From: David Miller @ 2012-07-23 23:05 UTC (permalink / raw)
  To: ja; +Cc: netdev
In-Reply-To: <alpine.LFD.2.00.1207240157120.2404@ja.ssi.bg>

From: Julian Anastasov <ja@ssi.bg>
Date: Tue, 24 Jul 2012 02:11:38 +0300 (EEST)

> skb_iif: ifindex of device we arrived on
> 
> 	If it has hidden semantic may be we can make this
> comment more specific, does it survive some loops?

Strangely I hadn't noticed this, and after the email I just sent
out I was going to look into adding such a value :-)

Great!

I'll respin my patches to use that thing, thanks Julian.

^ permalink raw reply

* Re: [PATCH 1/2] ipvs: ip_vs_ftp depends on nf_conntrack_ftp helper
From: Simon Horman @ 2012-07-23 23:11 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: Julian Anastasov, lvs-devel, netdev, netfilter-devel,
	Wensong Zhang, Hans Schillstrom, Jesper Dangaard Brouer
In-Reply-To: <20120723173906.GA1430@1984>

On Mon, Jul 23, 2012 at 07:39:06PM +0200, Pablo Neira Ayuso wrote:
> On Mon, Jul 23, 2012 at 03:48:18PM +0900, Simon Horman wrote:
> > On Thu, Jul 12, 2012 at 10:43:22PM +0300, Julian Anastasov wrote:
> > > 
> > > 	Hello,
> > > 
> > > On Thu, 12 Jul 2012, Pablo Neira Ayuso wrote:
> > > 
> > > > On Wed, Jul 11, 2012 at 09:25:26AM +0900, Simon Horman wrote:
> > > > > From: Julian Anastasov <ja@ssi.bg>
> > > > > 
> > > > > 	The FTP application indirectly depends on the
> > > > > nf_conntrack_ftp helper for proper NAT support. If the
> > > > > module is not loaded, IPVS can resize the packets for the
> > > > > command connection, eg. PASV response but the SEQ adjustment
> > > > > logic in ipv4_confirm is not called without helper.
> > > > > 
> > > > > Signed-off-by: Julian Anastasov <ja@ssi.bg>
> > > > > Signed-off-by: Simon Horman <horms@verge.net.au>
> > > > > ---
> > > > >  net/netfilter/ipvs/Kconfig | 3 ++-
> > > > >  1 file changed, 2 insertions(+), 1 deletion(-)
> > > > > 
> > > > > diff --git a/net/netfilter/ipvs/Kconfig b/net/netfilter/ipvs/Kconfig
> > > > > index f987138..8b2cffd 100644
> > > > > --- a/net/netfilter/ipvs/Kconfig
> > > > > +++ b/net/netfilter/ipvs/Kconfig
> > > > > @@ -250,7 +250,8 @@ comment 'IPVS application helper'
> > > > >  
> > > > >  config	IP_VS_FTP
> > > > >    	tristate "FTP protocol helper"
> > > > > -        depends on IP_VS_PROTO_TCP && NF_CONNTRACK && NF_NAT
> > > > > +	depends on IP_VS_PROTO_TCP && NF_CONNTRACK && NF_NAT && \
> > > > > +		NF_CONNTRACK_FTP
> > > > 
> > > > If you require FTP NAT support, then this depends on NF_NAT_FTP
> > > > instead of NF_CONNTRACK_FTP.
> > > 
> > > 	No, I just checked again, it works without nf_nat_ftp,
> > > only nf_nat, nf_conntrack_ftp and iptable_nat are needed.
> > > We use packet mangling part from nf_nat (nf_nat_mangle_tcp_packet).
> > 
> > Is there a consensus on this?
> 
> Fine with me, just wanted to make sure this what you wanted. Thanks
> Simon.

Thanks. I'll include this in a pull request after rebasing ipvs-next.
I plan to do that today.


^ permalink raw reply

* Re: [PATCH 2/2] ipv4: Change rt->rt_iif encoding.
From: David Miller @ 2012-07-23 23:14 UTC (permalink / raw)
  To: ja; +Cc: netdev
In-Reply-To: <20120723.160541.184307938805782289.davem@davemloft.net>

From: David Miller <davem@davemloft.net>
Date: Mon, 23 Jul 2012 16:05:41 -0700 (PDT)

> From: Julian Anastasov <ja@ssi.bg>
> Date: Tue, 24 Jul 2012 02:11:38 +0300 (EEST)
> 
>> skb_iif: ifindex of device we arrived on
>> 
>> 	If it has hidden semantic may be we can make this
>> comment more specific, does it survive some loops?
> 
> Strangely I hadn't noticed this, and after the email I just sent
> out I was going to look into adding such a value :-)
> 
> Great!
> 
> I'll respin my patches to use that thing, thanks Julian.

Hmmm, the problem is that when we decapsulate VLAN devices, we're left
with the parent device's index in skb->skb_iif.

That's what all of that "orig_dev" logic in __netif_receive_skb() is
all about.  The skb->dev is rewritten by vlan_do_receive() for that
case.

I wonder if we should just get rid of all of that orig_dev logic and
simply update skb->skb_iif every time we hit the code starting at
label "another_round"

I'll keep looking into this.

^ permalink raw reply

* [GIT PULL nf-next] IPVS
From: Simon Horman @ 2012-07-23 23:28 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: lvs-devel, netdev, netfilter-devel, Wensong Zhang,
	Julian Anastasov

Hi Pablo,

please consider the following enhancements to IPVS for inclusion in 3.6.

----------------------------------------------------------------
The following changes since commit 9b70749e64132e17ab02239b82fcb4a2c55554d1:

  niu: Change niu_rbr_fill() to use unlikely() to check niu_rbr_add_page() return value (2012-07-22 23:31:07 -0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/horms/ipvs-next.git master

for you to fetch changes up to a805cfbcaaf819ab71a052d8a9d5d4c88cf2aba0:

  ipvs: add pmtu_disc option to disable IP DF for TUN packets (2012-07-24 08:23:06 +0900)

----------------------------------------------------------------
Claudiu Ghioc (1):
      ipvs: fixed sparse warning

Julian Anastasov (4):
      ipvs: ip_vs_ftp depends on nf_conntrack_ftp helper
      ipvs: generalize app registration in netns
      ipvs: implement passive PMTUD for IPIP packets
      ipvs: add pmtu_disc option to disable IP DF for TUN packets

 include/net/ip_vs.h             | 16 ++++++--
 net/netfilter/ipvs/Kconfig      |  3 +-
 net/netfilter/ipvs/ip_vs_app.c  | 58 ++++++++++++++++++++--------
 net/netfilter/ipvs/ip_vs_core.c | 76 +++++++++++++++++++++++++++++++++++--
 net/netfilter/ipvs/ip_vs_ctl.c  | 16 ++++++--
 net/netfilter/ipvs/ip_vs_ftp.c  | 21 +++--------
 net/netfilter/ipvs/ip_vs_xmit.c | 83 ++++++++++++++++++++++++++++-------------
 7 files changed, 204 insertions(+), 69 deletions(-)

^ permalink raw reply

* [PATCH 1/5] ipvs: ip_vs_ftp depends on nf_conntrack_ftp helper
From: Simon Horman @ 2012-07-23 23:28 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: lvs-devel, netdev, netfilter-devel, Wensong Zhang,
	Julian Anastasov, Simon Horman
In-Reply-To: <1343086141-9086-1-git-send-email-horms@verge.net.au>

From: Julian Anastasov <ja@ssi.bg>

	The FTP application indirectly depends on the
nf_conntrack_ftp helper for proper NAT support. If the
module is not loaded, IPVS can resize the packets for the
command connection, eg. PASV response but the SEQ adjustment
logic in ipv4_confirm is not called without helper.

Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
---
 net/netfilter/ipvs/Kconfig | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/netfilter/ipvs/Kconfig b/net/netfilter/ipvs/Kconfig
index f987138..8b2cffd 100644
--- a/net/netfilter/ipvs/Kconfig
+++ b/net/netfilter/ipvs/Kconfig
@@ -250,7 +250,8 @@ comment 'IPVS application helper'
 
 config	IP_VS_FTP
   	tristate "FTP protocol helper"
-        depends on IP_VS_PROTO_TCP && NF_CONNTRACK && NF_NAT
+	depends on IP_VS_PROTO_TCP && NF_CONNTRACK && NF_NAT && \
+		NF_CONNTRACK_FTP
 	select IP_VS_NFCT
 	---help---
 	  FTP is a protocol that transfers IP address and/or port number in
-- 
1.7.10.2.484.gcd07cc5


^ permalink raw reply related

* [PATCH 2/5] ipvs: generalize app registration in netns
From: Simon Horman @ 2012-07-23 23:28 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: lvs-devel, netdev, netfilter-devel, Wensong Zhang,
	Julian Anastasov, Simon Horman
In-Reply-To: <1343086141-9086-1-git-send-email-horms@verge.net.au>

From: Julian Anastasov <ja@ssi.bg>

	Get rid of the ftp_app pointer and allow applications
to be registered without adding fields in the netns_ipvs structure.

v2: fix coding style as suggested by Pablo Neira Ayuso <pablo@netfilter.org>

Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
---
 include/net/ip_vs.h            |  5 ++--
 net/netfilter/ipvs/ip_vs_app.c | 58 ++++++++++++++++++++++++++++++------------
 net/netfilter/ipvs/ip_vs_ftp.c | 21 ++++-----------
 3 files changed, 49 insertions(+), 35 deletions(-)

diff --git a/include/net/ip_vs.h b/include/net/ip_vs.h
index 95374d1..4b8f18f 100644
--- a/include/net/ip_vs.h
+++ b/include/net/ip_vs.h
@@ -808,8 +808,6 @@ struct netns_ipvs {
 	struct list_head	rs_table[IP_VS_RTAB_SIZE];
 	/* ip_vs_app */
 	struct list_head	app_list;
-	/* ip_vs_ftp */
-	struct ip_vs_app	*ftp_app;
 	/* ip_vs_proto */
 	#define IP_VS_PROTO_TAB_SIZE	32	/* must be power of 2 */
 	struct ip_vs_proto_data *proto_data_table[IP_VS_PROTO_TAB_SIZE];
@@ -1179,7 +1177,8 @@ extern void ip_vs_service_net_cleanup(struct net *net);
  *      (from ip_vs_app.c)
  */
 #define IP_VS_APP_MAX_PORTS  8
-extern int register_ip_vs_app(struct net *net, struct ip_vs_app *app);
+extern struct ip_vs_app *register_ip_vs_app(struct net *net,
+					    struct ip_vs_app *app);
 extern void unregister_ip_vs_app(struct net *net, struct ip_vs_app *app);
 extern int ip_vs_bind_app(struct ip_vs_conn *cp, struct ip_vs_protocol *pp);
 extern void ip_vs_unbind_app(struct ip_vs_conn *cp);
diff --git a/net/netfilter/ipvs/ip_vs_app.c b/net/netfilter/ipvs/ip_vs_app.c
index 64f9e8f..9713e6e 100644
--- a/net/netfilter/ipvs/ip_vs_app.c
+++ b/net/netfilter/ipvs/ip_vs_app.c
@@ -180,22 +180,38 @@ register_ip_vs_app_inc(struct net *net, struct ip_vs_app *app, __u16 proto,
 }
 
 
-/*
- *	ip_vs_app registration routine
- */
-int register_ip_vs_app(struct net *net, struct ip_vs_app *app)
+/* Register application for netns */
+struct ip_vs_app *register_ip_vs_app(struct net *net, struct ip_vs_app *app)
 {
 	struct netns_ipvs *ipvs = net_ipvs(net);
-	/* increase the module use count */
-	ip_vs_use_count_inc();
+	struct ip_vs_app *a;
+	int err = 0;
+
+	if (!ipvs)
+		return ERR_PTR(-ENOENT);
 
 	mutex_lock(&__ip_vs_app_mutex);
 
-	list_add(&app->a_list, &ipvs->app_list);
+	list_for_each_entry(a, &ipvs->app_list, a_list) {
+		if (!strcmp(app->name, a->name)) {
+			err = -EEXIST;
+			goto out_unlock;
+		}
+	}
+	a = kmemdup(app, sizeof(*app), GFP_KERNEL);
+	if (!a) {
+		err = -ENOMEM;
+		goto out_unlock;
+	}
+	INIT_LIST_HEAD(&a->incs_list);
+	list_add(&a->a_list, &ipvs->app_list);
+	/* increase the module use count */
+	ip_vs_use_count_inc();
 
+out_unlock:
 	mutex_unlock(&__ip_vs_app_mutex);
 
-	return 0;
+	return err ? ERR_PTR(err) : a;
 }
 
 
@@ -205,20 +221,29 @@ int register_ip_vs_app(struct net *net, struct ip_vs_app *app)
  */
 void unregister_ip_vs_app(struct net *net, struct ip_vs_app *app)
 {
-	struct ip_vs_app *inc, *nxt;
+	struct netns_ipvs *ipvs = net_ipvs(net);
+	struct ip_vs_app *a, *anxt, *inc, *nxt;
+
+	if (!ipvs)
+		return;
 
 	mutex_lock(&__ip_vs_app_mutex);
 
-	list_for_each_entry_safe(inc, nxt, &app->incs_list, a_list) {
-		ip_vs_app_inc_release(net, inc);
-	}
+	list_for_each_entry_safe(a, anxt, &ipvs->app_list, a_list) {
+		if (app && strcmp(app->name, a->name))
+			continue;
+		list_for_each_entry_safe(inc, nxt, &a->incs_list, a_list) {
+			ip_vs_app_inc_release(net, inc);
+		}
 
-	list_del(&app->a_list);
+		list_del(&a->a_list);
+		kfree(a);
 
-	mutex_unlock(&__ip_vs_app_mutex);
+		/* decrease the module use count */
+		ip_vs_use_count_dec();
+	}
 
-	/* decrease the module use count */
-	ip_vs_use_count_dec();
+	mutex_unlock(&__ip_vs_app_mutex);
 }
 
 
@@ -586,5 +611,6 @@ int __net_init ip_vs_app_net_init(struct net *net)
 
 void __net_exit ip_vs_app_net_cleanup(struct net *net)
 {
+	unregister_ip_vs_app(net, NULL /* all */);
 	proc_net_remove(net, "ip_vs_app");
 }
diff --git a/net/netfilter/ipvs/ip_vs_ftp.c b/net/netfilter/ipvs/ip_vs_ftp.c
index b20b29c..ad70b7e 100644
--- a/net/netfilter/ipvs/ip_vs_ftp.c
+++ b/net/netfilter/ipvs/ip_vs_ftp.c
@@ -441,16 +441,10 @@ static int __net_init __ip_vs_ftp_init(struct net *net)
 
 	if (!ipvs)
 		return -ENOENT;
-	app = kmemdup(&ip_vs_ftp, sizeof(struct ip_vs_app), GFP_KERNEL);
-	if (!app)
-		return -ENOMEM;
-	INIT_LIST_HEAD(&app->a_list);
-	INIT_LIST_HEAD(&app->incs_list);
-	ipvs->ftp_app = app;
 
-	ret = register_ip_vs_app(net, app);
-	if (ret)
-		goto err_exit;
+	app = register_ip_vs_app(net, &ip_vs_ftp);
+	if (IS_ERR(app))
+		return PTR_ERR(app);
 
 	for (i = 0; i < ports_count; i++) {
 		if (!ports[i])
@@ -464,9 +458,7 @@ static int __net_init __ip_vs_ftp_init(struct net *net)
 	return 0;
 
 err_unreg:
-	unregister_ip_vs_app(net, app);
-err_exit:
-	kfree(ipvs->ftp_app);
+	unregister_ip_vs_app(net, &ip_vs_ftp);
 	return ret;
 }
 /*
@@ -474,10 +466,7 @@ err_exit:
  */
 static void __ip_vs_ftp_exit(struct net *net)
 {
-	struct netns_ipvs *ipvs = net_ipvs(net);
-
-	unregister_ip_vs_app(net, ipvs->ftp_app);
-	kfree(ipvs->ftp_app);
+	unregister_ip_vs_app(net, &ip_vs_ftp);
 }
 
 static struct pernet_operations ip_vs_ftp_ops = {
-- 
1.7.10.2.484.gcd07cc5


^ permalink raw reply related

* [PATCH 3/5] ipvs: fixed sparse warning
From: Simon Horman @ 2012-07-23 23:28 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: lvs-devel, netdev, netfilter-devel, Wensong Zhang,
	Julian Anastasov, Claudiu Ghioc, Claudiu Ghioc, Simon Horman
In-Reply-To: <1343086141-9086-1-git-send-email-horms@verge.net.au>

From: Claudiu Ghioc <claudiughioc@gmail.com>

Removed the following sparse warnings, wether CONFIG_SYSCTL
is defined or not:
*       warning: symbol 'ip_vs_control_net_init_sysctl' was not
	declared. Should it be static?
*       warning: symbol 'ip_vs_control_net_cleanup_sysctl' was
	not declared. Should it be static?

Signed-off-by: Claudiu Ghioc <claudiu.ghioc@gmail.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
---
 net/netfilter/ipvs/ip_vs_ctl.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
index 84444dd..d6d5cca 100644
--- a/net/netfilter/ipvs/ip_vs_ctl.c
+++ b/net/netfilter/ipvs/ip_vs_ctl.c
@@ -3675,7 +3675,7 @@ static void ip_vs_genl_unregister(void)
  * per netns intit/exit func.
  */
 #ifdef CONFIG_SYSCTL
-int __net_init ip_vs_control_net_init_sysctl(struct net *net)
+static int __net_init ip_vs_control_net_init_sysctl(struct net *net)
 {
 	int idx;
 	struct netns_ipvs *ipvs = net_ipvs(net);
@@ -3743,7 +3743,7 @@ int __net_init ip_vs_control_net_init_sysctl(struct net *net)
 	return 0;
 }
 
-void __net_exit ip_vs_control_net_cleanup_sysctl(struct net *net)
+static void __net_exit ip_vs_control_net_cleanup_sysctl(struct net *net)
 {
 	struct netns_ipvs *ipvs = net_ipvs(net);
 
@@ -3754,8 +3754,8 @@ void __net_exit ip_vs_control_net_cleanup_sysctl(struct net *net)
 
 #else
 
-int __net_init ip_vs_control_net_init_sysctl(struct net *net) { return 0; }
-void __net_exit ip_vs_control_net_cleanup_sysctl(struct net *net) { }
+static int __net_init ip_vs_control_net_init_sysctl(struct net *net) { return 0; }
+static void __net_exit ip_vs_control_net_cleanup_sysctl(struct net *net) { }
 
 #endif
 
-- 
1.7.10.2.484.gcd07cc5


^ permalink raw reply related

* [PATCH 4/5] ipvs: implement passive PMTUD for IPIP packets
From: Simon Horman @ 2012-07-23 23:28 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: lvs-devel, netdev, netfilter-devel, Wensong Zhang,
	Julian Anastasov, Simon Horman
In-Reply-To: <1343086141-9086-1-git-send-email-horms@verge.net.au>

From: Julian Anastasov <ja@ssi.bg>

	IPVS is missing the logic to update PMTU in routing
for its IPIP packets. We monitor the dst_mtu and can return
FRAG_NEEDED messages but if the tunneled packets get ICMP
error we can not rely on other traffic to save the lowest
MTU.

	The following patch adds ICMP handling for IPIP
packets in incoming direction, from some remote host to
our local IP used as saddr in the outer header. By this
way we can forward any related ICMP traffic if it is for IPVS
TUN connection. For the special case of PMTUD we update the
routing and if client requested DF we can forward the
error.

	To properly update the routing we have to bind
the cached route (dest->dst_cache) to the selected saddr
because ipv4_update_pmtu uses saddr for dst lookup.
Add IP_VS_RT_MODE_CONNECT flag to force such binding with
second route.

	Update ip_vs_tunnel_xmit to provide IP_VS_RT_MODE_CONNECT
and change the code to copy DF. For now we prefer not to
force PMTU discovery (outer DF=1) because we don't have
configuration option to enable or disable PMTUD. As we
do not keep any packets to resend, we prefer not to
play games with packets without DF bit because the sender
is not informed when they are rejected.

	Also, change ops->update_pmtu to be called only
for local clients because there is no point to update
MTU for input routes, in our case skb->dst->dev is lo.
It seems the code is copied from ipip.c where the skb
dst points to tunnel device.

Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
---
 net/netfilter/ipvs/ip_vs_core.c | 76 +++++++++++++++++++++++++++++++++++++--
 net/netfilter/ipvs/ip_vs_xmit.c | 79 ++++++++++++++++++++++++++++-------------
 2 files changed, 128 insertions(+), 27 deletions(-)

diff --git a/net/netfilter/ipvs/ip_vs_core.c b/net/netfilter/ipvs/ip_vs_core.c
index b54ecce..58918e2 100644
--- a/net/netfilter/ipvs/ip_vs_core.c
+++ b/net/netfilter/ipvs/ip_vs_core.c
@@ -1303,7 +1303,8 @@ ip_vs_in_icmp(struct sk_buff *skb, int *related, unsigned int hooknum)
 	struct ip_vs_conn *cp;
 	struct ip_vs_protocol *pp;
 	struct ip_vs_proto_data *pd;
-	unsigned int offset, ihl, verdict;
+	unsigned int offset, offset2, ihl, verdict;
+	bool ipip;
 
 	*related = 1;
 
@@ -1345,6 +1346,21 @@ ip_vs_in_icmp(struct sk_buff *skb, int *related, unsigned int hooknum)
 
 	net = skb_net(skb);
 
+	/* Special case for errors for IPIP packets */
+	ipip = false;
+	if (cih->protocol == IPPROTO_IPIP) {
+		if (unlikely(cih->frag_off & htons(IP_OFFSET)))
+			return NF_ACCEPT;
+		/* Error for our IPIP must arrive at LOCAL_IN */
+		if (!(skb_rtable(skb)->rt_flags & RTCF_LOCAL))
+			return NF_ACCEPT;
+		offset += cih->ihl * 4;
+		cih = skb_header_pointer(skb, offset, sizeof(_ciph), &_ciph);
+		if (cih == NULL)
+			return NF_ACCEPT; /* The packet looks wrong, ignore */
+		ipip = true;
+	}
+
 	pd = ip_vs_proto_data_get(net, cih->protocol);
 	if (!pd)
 		return NF_ACCEPT;
@@ -1358,11 +1374,14 @@ ip_vs_in_icmp(struct sk_buff *skb, int *related, unsigned int hooknum)
 	IP_VS_DBG_PKT(11, AF_INET, pp, skb, offset,
 		      "Checking incoming ICMP for");
 
+	offset2 = offset;
 	offset += cih->ihl * 4;
 
 	ip_vs_fill_iphdr(AF_INET, cih, &ciph);
-	/* The embedded headers contain source and dest in reverse order */
-	cp = pp->conn_in_get(AF_INET, skb, &ciph, offset, 1);
+	/* The embedded headers contain source and dest in reverse order.
+	 * For IPIP this is error for request, not for reply.
+	 */
+	cp = pp->conn_in_get(AF_INET, skb, &ciph, offset, ipip ? 0 : 1);
 	if (!cp)
 		return NF_ACCEPT;
 
@@ -1376,6 +1395,57 @@ ip_vs_in_icmp(struct sk_buff *skb, int *related, unsigned int hooknum)
 		goto out;
 	}
 
+	if (ipip) {
+		__be32 info = ic->un.gateway;
+
+		/* Update the MTU */
+		if (ic->type == ICMP_DEST_UNREACH &&
+		    ic->code == ICMP_FRAG_NEEDED) {
+			struct ip_vs_dest *dest = cp->dest;
+			u32 mtu = ntohs(ic->un.frag.mtu);
+
+			/* Strip outer IP and ICMP, go to IPIP header */
+			__skb_pull(skb, ihl + sizeof(_icmph));
+			offset2 -= ihl + sizeof(_icmph);
+			skb_reset_network_header(skb);
+			IP_VS_DBG(12, "ICMP for IPIP %pI4->%pI4: mtu=%u\n",
+				&ip_hdr(skb)->saddr, &ip_hdr(skb)->daddr, mtu);
+			rcu_read_lock();
+			ipv4_update_pmtu(skb, dev_net(skb->dev),
+					 mtu, 0, 0, 0, 0);
+			rcu_read_unlock();
+			/* Client uses PMTUD? */
+			if (!(cih->frag_off & htons(IP_DF)))
+				goto ignore_ipip;
+			/* Prefer the resulting PMTU */
+			if (dest) {
+				spin_lock(&dest->dst_lock);
+				if (dest->dst_cache)
+					mtu = dst_mtu(dest->dst_cache);
+				spin_unlock(&dest->dst_lock);
+			}
+			if (mtu > 68 + sizeof(struct iphdr))
+				mtu -= sizeof(struct iphdr);
+			info = htonl(mtu);
+		}
+		/* Strip outer IP, ICMP and IPIP, go to IP header of
+		 * original request.
+		 */
+		__skb_pull(skb, offset2);
+		skb_reset_network_header(skb);
+		IP_VS_DBG(12, "Sending ICMP for %pI4->%pI4: t=%u, c=%u, i=%u\n",
+			&ip_hdr(skb)->saddr, &ip_hdr(skb)->daddr,
+			ic->type, ic->code, ntohl(info));
+		icmp_send(skb, ic->type, ic->code, info);
+		/* ICMP can be shorter but anyways, account it */
+		ip_vs_out_stats(cp, skb);
+
+ignore_ipip:
+		consume_skb(skb);
+		verdict = NF_STOLEN;
+		goto out;
+	}
+
 	/* do the statistics and put it back */
 	ip_vs_in_stats(cp, skb);
 	if (IPPROTO_TCP == cih->protocol || IPPROTO_UDP == cih->protocol)
diff --git a/net/netfilter/ipvs/ip_vs_xmit.c b/net/netfilter/ipvs/ip_vs_xmit.c
index 65b616a..c2275ba 100644
--- a/net/netfilter/ipvs/ip_vs_xmit.c
+++ b/net/netfilter/ipvs/ip_vs_xmit.c
@@ -49,6 +49,7 @@ enum {
 	IP_VS_RT_MODE_RDR	= 4, /* Allow redirect from remote daddr to
 				      * local
 				      */
+	IP_VS_RT_MODE_CONNECT	= 8, /* Always bind route to saddr */
 };
 
 /*
@@ -84,6 +85,42 @@ __ip_vs_dst_check(struct ip_vs_dest *dest, u32 rtos)
 	return dst;
 }
 
+/* Get route to daddr, update *saddr, optionally bind route to saddr */
+static struct rtable *do_output_route4(struct net *net, __be32 daddr,
+				       u32 rtos, int rt_mode, __be32 *saddr)
+{
+	struct flowi4 fl4;
+	struct rtable *rt;
+	int loop = 0;
+
+	memset(&fl4, 0, sizeof(fl4));
+	fl4.daddr = daddr;
+	fl4.saddr = (rt_mode & IP_VS_RT_MODE_CONNECT) ? *saddr : 0;
+	fl4.flowi4_tos = rtos;
+
+retry:
+	rt = ip_route_output_key(net, &fl4);
+	if (IS_ERR(rt)) {
+		/* Invalid saddr ? */
+		if (PTR_ERR(rt) == -EINVAL && *saddr &&
+		    rt_mode & IP_VS_RT_MODE_CONNECT && !loop) {
+			*saddr = 0;
+			flowi4_update_output(&fl4, 0, rtos, daddr, 0);
+			goto retry;
+		}
+		IP_VS_DBG_RL("ip_route_output error, dest: %pI4\n", &daddr);
+		return NULL;
+	} else if (!*saddr && rt_mode & IP_VS_RT_MODE_CONNECT && fl4.saddr) {
+		ip_rt_put(rt);
+		*saddr = fl4.saddr;
+		flowi4_update_output(&fl4, 0, rtos, daddr, fl4.saddr);
+		loop++;
+		goto retry;
+	}
+	*saddr = fl4.saddr;
+	return rt;
+}
+
 /* Get route to destination or remote server */
 static struct rtable *
 __ip_vs_get_out_rt(struct sk_buff *skb, struct ip_vs_dest *dest,
@@ -98,20 +135,13 @@ __ip_vs_get_out_rt(struct sk_buff *skb, struct ip_vs_dest *dest,
 		spin_lock(&dest->dst_lock);
 		if (!(rt = (struct rtable *)
 		      __ip_vs_dst_check(dest, rtos))) {
-			struct flowi4 fl4;
-
-			memset(&fl4, 0, sizeof(fl4));
-			fl4.daddr = dest->addr.ip;
-			fl4.flowi4_tos = rtos;
-			rt = ip_route_output_key(net, &fl4);
-			if (IS_ERR(rt)) {
+			rt = do_output_route4(net, dest->addr.ip, rtos,
+					      rt_mode, &dest->dst_saddr.ip);
+			if (!rt) {
 				spin_unlock(&dest->dst_lock);
-				IP_VS_DBG_RL("ip_route_output error, dest: %pI4\n",
-					     &dest->addr.ip);
 				return NULL;
 			}
 			__ip_vs_dst_set(dest, rtos, dst_clone(&rt->dst), 0);
-			dest->dst_saddr.ip = fl4.saddr;
 			IP_VS_DBG(10, "new dst %pI4, src %pI4, refcnt=%d, "
 				  "rtos=%X\n",
 				  &dest->addr.ip, &dest->dst_saddr.ip,
@@ -122,19 +152,17 @@ __ip_vs_get_out_rt(struct sk_buff *skb, struct ip_vs_dest *dest,
 			*ret_saddr = dest->dst_saddr.ip;
 		spin_unlock(&dest->dst_lock);
 	} else {
-		struct flowi4 fl4;
+		__be32 saddr = htonl(INADDR_ANY);
 
-		memset(&fl4, 0, sizeof(fl4));
-		fl4.daddr = daddr;
-		fl4.flowi4_tos = rtos;
-		rt = ip_route_output_key(net, &fl4);
-		if (IS_ERR(rt)) {
-			IP_VS_DBG_RL("ip_route_output error, dest: %pI4\n",
-				     &daddr);
+		/* For such unconfigured boxes avoid many route lookups
+		 * for performance reasons because we do not remember saddr
+		 */
+		rt_mode &= ~IP_VS_RT_MODE_CONNECT;
+		rt = do_output_route4(net, daddr, rtos, rt_mode, &saddr);
+		if (!rt)
 			return NULL;
-		}
 		if (ret_saddr)
-			*ret_saddr = fl4.saddr;
+			*ret_saddr = saddr;
 	}
 
 	local = rt->rt_flags & RTCF_LOCAL;
@@ -331,6 +359,7 @@ ip_vs_dst_reset(struct ip_vs_dest *dest)
 	old_dst = dest->dst_cache;
 	dest->dst_cache = NULL;
 	dst_release(old_dst);
+	dest->dst_saddr.ip = 0;
 }
 
 #define IP_VS_XMIT_TUNNEL(skb, cp)				\
@@ -771,7 +800,7 @@ ip_vs_tunnel_xmit(struct sk_buff *skb, struct ip_vs_conn *cp,
 	struct net_device *tdev;		/* Device to other host */
 	struct iphdr  *old_iph = ip_hdr(skb);
 	u8     tos = old_iph->tos;
-	__be16 df = old_iph->frag_off;
+	__be16 df;
 	struct iphdr  *iph;			/* Our new IP header */
 	unsigned int max_headroom;		/* The extra header space needed */
 	int    mtu;
@@ -781,7 +810,8 @@ ip_vs_tunnel_xmit(struct sk_buff *skb, struct ip_vs_conn *cp,
 
 	if (!(rt = __ip_vs_get_out_rt(skb, cp->dest, cp->daddr.ip,
 				      RT_TOS(tos), IP_VS_RT_MODE_LOCAL |
-						   IP_VS_RT_MODE_NON_LOCAL,
+						   IP_VS_RT_MODE_NON_LOCAL |
+						   IP_VS_RT_MODE_CONNECT,
 						   &saddr)))
 		goto tx_error_icmp;
 	if (rt->rt_flags & RTCF_LOCAL) {
@@ -796,10 +826,11 @@ ip_vs_tunnel_xmit(struct sk_buff *skb, struct ip_vs_conn *cp,
 		IP_VS_DBG_RL("%s(): mtu less than 68\n", __func__);
 		goto tx_error_put;
 	}
-	if (skb_dst(skb))
+	if (rt_is_output_route(skb_rtable(skb)))
 		skb_dst(skb)->ops->update_pmtu(skb_dst(skb), NULL, skb, mtu);
 
-	df |= (old_iph->frag_off & htons(IP_DF));
+	/* Copy DF, reset fragment offset and MF */
+	df = old_iph->frag_off & htons(IP_DF);
 
 	if ((old_iph->frag_off & htons(IP_DF) &&
 	    mtu < ntohs(old_iph->tot_len) && !skb_is_gso(skb))) {
-- 
1.7.10.2.484.gcd07cc5


^ permalink raw reply related

* [PATCH 5/5] ipvs: add pmtu_disc option to disable IP DF for TUN packets
From: Simon Horman @ 2012-07-23 23:29 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: lvs-devel, netdev, netfilter-devel, Wensong Zhang,
	Julian Anastasov, Simon Horman
In-Reply-To: <1343086141-9086-1-git-send-email-horms@verge.net.au>

From: Julian Anastasov <ja@ssi.bg>

	Disabling PMTU discovery can increase the output packet
rate but some users have enough resources and prefer to fragment
than to drop traffic. By default, we copy the DF bit but if
pmtu_disc is disabled we do not send FRAG_NEEDED messages anymore.

Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
---
 include/net/ip_vs.h             | 11 +++++++++++
 net/netfilter/ipvs/ip_vs_ctl.c  |  8 ++++++++
 net/netfilter/ipvs/ip_vs_xmit.c |  6 +++---
 3 files changed, 22 insertions(+), 3 deletions(-)

diff --git a/include/net/ip_vs.h b/include/net/ip_vs.h
index 4b8f18f..ee75ccd 100644
--- a/include/net/ip_vs.h
+++ b/include/net/ip_vs.h
@@ -888,6 +888,7 @@ struct netns_ipvs {
 	unsigned int		sysctl_sync_refresh_period;
 	int			sysctl_sync_retries;
 	int			sysctl_nat_icmp_send;
+	int			sysctl_pmtu_disc;
 
 	/* ip_vs_lblc */
 	int			sysctl_lblc_expiration;
@@ -974,6 +975,11 @@ static inline int sysctl_sync_sock_size(struct netns_ipvs *ipvs)
 	return ipvs->sysctl_sync_sock_size;
 }
 
+static inline int sysctl_pmtu_disc(struct netns_ipvs *ipvs)
+{
+	return ipvs->sysctl_pmtu_disc;
+}
+
 #else
 
 static inline int sysctl_sync_threshold(struct netns_ipvs *ipvs)
@@ -1016,6 +1022,11 @@ static inline int sysctl_sync_sock_size(struct netns_ipvs *ipvs)
 	return 0;
 }
 
+static inline int sysctl_pmtu_disc(struct netns_ipvs *ipvs)
+{
+	return 1;
+}
+
 #endif
 
 /*
diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
index d6d5cca..03d3fc6 100644
--- a/net/netfilter/ipvs/ip_vs_ctl.c
+++ b/net/netfilter/ipvs/ip_vs_ctl.c
@@ -1801,6 +1801,12 @@ static struct ctl_table vs_vars[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec,
 	},
+	{
+		.procname	= "pmtu_disc",
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec,
+	},
 #ifdef CONFIG_IP_VS_DEBUG
 	{
 		.procname	= "debug_level",
@@ -3726,6 +3732,8 @@ static int __net_init ip_vs_control_net_init_sysctl(struct net *net)
 	ipvs->sysctl_sync_retries = clamp_t(int, DEFAULT_SYNC_RETRIES, 0, 3);
 	tbl[idx++].data = &ipvs->sysctl_sync_retries;
 	tbl[idx++].data = &ipvs->sysctl_nat_icmp_send;
+	ipvs->sysctl_pmtu_disc = 1;
+	tbl[idx++].data = &ipvs->sysctl_pmtu_disc;
 
 
 	ipvs->sysctl_hdr = register_net_sysctl(net, "net/ipv4/vs", tbl);
diff --git a/net/netfilter/ipvs/ip_vs_xmit.c b/net/netfilter/ipvs/ip_vs_xmit.c
index c2275ba..543a554 100644
--- a/net/netfilter/ipvs/ip_vs_xmit.c
+++ b/net/netfilter/ipvs/ip_vs_xmit.c
@@ -795,6 +795,7 @@ int
 ip_vs_tunnel_xmit(struct sk_buff *skb, struct ip_vs_conn *cp,
 		  struct ip_vs_protocol *pp)
 {
+	struct netns_ipvs *ipvs = net_ipvs(skb_net(skb));
 	struct rtable *rt;			/* Route to the other host */
 	__be32 saddr;				/* Source for tunnel */
 	struct net_device *tdev;		/* Device to other host */
@@ -830,10 +831,9 @@ ip_vs_tunnel_xmit(struct sk_buff *skb, struct ip_vs_conn *cp,
 		skb_dst(skb)->ops->update_pmtu(skb_dst(skb), NULL, skb, mtu);
 
 	/* Copy DF, reset fragment offset and MF */
-	df = old_iph->frag_off & htons(IP_DF);
+	df = sysctl_pmtu_disc(ipvs) ? old_iph->frag_off & htons(IP_DF) : 0;
 
-	if ((old_iph->frag_off & htons(IP_DF) &&
-	    mtu < ntohs(old_iph->tot_len) && !skb_is_gso(skb))) {
+	if (df && mtu < ntohs(old_iph->tot_len) && !skb_is_gso(skb)) {
 		icmp_send(skb, ICMP_DEST_UNREACH,ICMP_FRAG_NEEDED, htonl(mtu));
 		IP_VS_DBG_RL("%s(): frag needed\n", __func__);
 		goto tx_error_put;
-- 
1.7.10.2.484.gcd07cc5

^ permalink raw reply related

* Re: [GIT PULL nf-next] IPVS
From: David Miller @ 2012-07-23 23:35 UTC (permalink / raw)
  To: horms; +Cc: pablo, lvs-devel, netdev, netfilter-devel, wensong, ja
In-Reply-To: <1343086141-9086-1-git-send-email-horms@verge.net.au>

From: Simon Horman <horms@verge.net.au>
Date: Tue, 24 Jul 2012 08:28:55 +0900

> please consider the following enhancements to IPVS for inclusion in 3.6.

The merge window has just openned, therefore any new work should have
been submitted and queued up already.

^ permalink raw reply

* Re: [3.5 regression / mcs7830 / bisected] bridge constantly toggeling between disabled and forwarding
From: Michael Leun @ 2012-07-23 23:36 UTC (permalink / raw)
  To: linux; +Cc: davem, netdev, linux-kernel, gregkh
In-Reply-To: <20120723091504.2d035d28@xenia.leun.net>

On Mon, 23 Jul 2012 09:15:04 +0200
Michael Leun <lkml20120218@newton.leun.net> wrote:

[see issue description below]

Bisecting yielded

b1ff4f96fd1c63890d78d8939c6e0f2b44ce3113 is the first bad commit
commit b1ff4f96fd1c63890d78d8939c6e0f2b44ce3113
Author: Ondrej Zary <linux@rainbow-software.org>
Date:   Fri Jun 1 10:29:08 2012 +0000

    mcs7830: Implement link state detection

    Add .status callback that detects link state changes.
    Tested with MCS7832CV-AA chip (9710:7830, identified as rev.C by the driver).
    Fixes https://bugzilla.kernel.org/show_bug.cgi?id=28532

    Signed-off-by: Ondrej Zary <linux@rainbow-software.org>
    Signed-off-by: David S. Miller <davem@davemloft.net>

:040000 040000 5480780cb5e75c57122a621fc3bab0108c16be27 d97efd9cc0a465dff76bcd3a3c547f718f2a5345 M    drivers


Reverting that from 3.5 makes the issue go away.

> Hi,
> 
> when I use my usb ethernet adapter
> 
> # > lsusb
> [...]
> Bus 002 Device 009: ID 9710:7830 MosChip Semiconductor MCS7830 10/100 Mbps Ethernet adapter
> [...]
> 
> as port of an bridge
> 
> > # brctl addbr br0
> > # brctl addif br0 eth0
> > # brctl addif br0 ue5
> > # ifconfig ue5 up
> > # ifconfig br0 up
> 
> (Also does happen when eth0 is not part of the bridge, but the logs I
> had available were from that situation...)
> 
> I constantly get messages showing the interface toggeling between
> disabled and forwarding state:
> 
> Jul 23 07:40:50 elektra kernel: [ 1539.497337] br0: port 2(ue5) entered disabled state
> Jul 23 07:40:50 elektra kernel: [ 1539.554992] br0: port 2(ue5) entered forwarding state
> Jul 23 07:40:50 elektra kernel: [ 1539.555005] br0: port 2(ue5) entered forwarding state
> Jul 23 07:40:51 elektra kernel: [ 1540.496242] br0: port 2(ue5) entered disabled state
> Jul 23 07:40:51 elektra kernel: [ 1540.552534] br0: port 2(ue5) entered forwarding state
> Jul 23 07:40:51 elektra kernel: [ 1540.552548] br0: port 2(ue5) entered forwarding state
> Jul 23 07:40:52 elektra kernel: [ 1541.550413] br0: port 2(ue5) entered forwarding state
> Jul 23 07:40:53 elektra kernel: [ 1542.529672] br0: port 2(ue5) entered disabled state
> Jul 23 07:40:53 elektra kernel: [ 1542.587162] br0: port 2(ue5) entered forwarding state
> Jul 23 07:40:53 elektra kernel: [ 1542.587175] br0: port 2(ue5) entered forwarding state
> Jul 23 07:40:54 elektra kernel: [ 1543.585309] br0: port 2(ue5) entered forwarding state
> Jul 23 07:41:00 elektra kernel: [ 1549.360600] br0: port 2(ue5) entered disabled state
> Jul 23 07:41:00 elektra kernel: [ 1549.442998] br0: port 2(ue5) entered forwarding state
> Jul 23 07:41:00 elektra kernel: [ 1549.443011] br0: port 2(ue5) entered forwarding state
> Jul 23 07:41:01 elektra kernel: [ 1550.357686] br0: port 2(ue5) entered disabled state
> Jul 23 07:41:01 elektra kernel: [ 1550.408208] br0: port 2(ue5) entered forwarding state
> Jul 23 07:41:01 elektra kernel: [ 1550.408222] br0: port 2(ue5) entered forwarding state
> Jul 23 07:41:02 elektra kernel: [ 1551.407656] br0: port 2(ue5) entered forwarding state
> Jul 23 07:41:03 elektra kernel: [ 1552.401578] br0: port 2(ue5) entered disabled state
> Jul 23 07:41:03 elektra kernel: [ 1552.474773] br0: port 2(ue5) entered forwarding state
> Jul 23 07:41:03 elektra kernel: [ 1552.474786] br0: port 2(ue5) entered forwarding state
> Jul 23 07:41:04 elektra kernel: [ 1553.472487] br0: port 2(ue5) entered forwarding state
> Jul 23 07:41:05 elektra kernel: [ 1554.356138] br0: port 2(ue5) entered disabled state
> [...]
> 
> This does (in the same situation, nothing else than the kernel changed)
> not happen with 3.4.5.
> 
> Does anybody have an idea what the issue might be or do I need to bisect?


-- 
MfG,

Michael Leun

^ permalink raw reply

* [PATCH 1/3] ipv4: Prepare for change of rt->rt_iif encoding.
From: David Miller @ 2012-07-23 23:45 UTC (permalink / raw)
  To: netdev; +Cc: ja


Use inet_iif() consistently, and for TCP record the input interface of
cached RX dst in inet sock.

rt->rt_iif is going to be encoded differently, so that we can
legitimately cache input routes in the FIB info more aggressively.

When the input interface is "use SKB device index" the rt->rt_iif will
be set to zero.

This forces us to move the TCP RX dst cache installation into the ipv4
specific code, and as well it should since doing the route caching for
ipv6 is pointless at the moment since it is not inspected in the ipv6
input paths yet.

Also, remove the unlikely on dst->obsolete, all ipv4 dsts have
obsolete set to a non-zero value to force invocation of the check
callback.

Signed-off-by: David S. Miller <davem@davemloft.net>
---
 include/net/inet_sock.h |    1 +
 net/dccp/ipv4.c         |    2 +-
 net/ipv4/icmp.c         |    2 +-
 net/ipv4/ip_sockglue.c  |    5 ++---
 net/ipv4/route.c        |    2 +-
 net/ipv4/tcp_input.c    |   12 ------------
 net/ipv4/tcp_ipv4.c     |   24 ++++++++++++++++++------
 net/sched/cls_route.c   |    2 +-
 net/sched/em_meta.c     |    2 +-
 net/sctp/protocol.c     |    2 +-
 10 files changed, 27 insertions(+), 27 deletions(-)

diff --git a/include/net/inet_sock.h b/include/net/inet_sock.h
index 924d7b9..613cfa4 100644
--- a/include/net/inet_sock.h
+++ b/include/net/inet_sock.h
@@ -172,6 +172,7 @@ struct inet_sock {
 	int			uc_index;
 	int			mc_index;
 	__be32			mc_addr;
+	int			rx_dst_ifindex;
 	struct ip_mc_socklist __rcu	*mc_list;
 	struct inet_cork_full	cork;
 };
diff --git a/net/dccp/ipv4.c b/net/dccp/ipv4.c
index 25428d0..176ecdb 100644
--- a/net/dccp/ipv4.c
+++ b/net/dccp/ipv4.c
@@ -481,7 +481,7 @@ static struct dst_entry* dccp_v4_route_skb(struct net *net, struct sock *sk,
 	struct rtable *rt;
 	const struct iphdr *iph = ip_hdr(skb);
 	struct flowi4 fl4 = {
-		.flowi4_oif = skb_rtable(skb)->rt_iif,
+		.flowi4_oif = inet_iif(skb),
 		.daddr = iph->saddr,
 		.saddr = iph->daddr,
 		.flowi4_tos = RT_CONN_FLAGS(sk),
diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
index f2a06be..f2eccd5 100644
--- a/net/ipv4/icmp.c
+++ b/net/ipv4/icmp.c
@@ -571,7 +571,7 @@ void icmp_send(struct sk_buff *skb_in, int type, int code, __be32 info)
 		rcu_read_lock();
 		if (rt_is_input_route(rt) &&
 		    net->ipv4.sysctl_icmp_errors_use_inbound_ifaddr)
-			dev = dev_get_by_index_rcu(net, rt->rt_iif);
+			dev = dev_get_by_index_rcu(net, inet_iif(skb_in));
 
 		if (dev)
 			saddr = inet_select_addr(dev, 0, RT_SCOPE_LINK);
diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c
index de29f46..5eea4a8 100644
--- a/net/ipv4/ip_sockglue.c
+++ b/net/ipv4/ip_sockglue.c
@@ -1027,10 +1027,9 @@ e_inval:
 void ipv4_pktinfo_prepare(struct sk_buff *skb)
 {
 	struct in_pktinfo *pktinfo = PKTINFO_SKB_CB(skb);
-	const struct rtable *rt = skb_rtable(skb);
 
-	if (rt) {
-		pktinfo->ipi_ifindex = rt->rt_iif;
+	if (skb_rtable(skb)) {
+		pktinfo->ipi_ifindex = inet_iif(skb);
 		pktinfo->ipi_spec_dst.s_addr = fib_compute_spec_dst(skb);
 	} else {
 		pktinfo->ipi_ifindex = 0;
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 34017be..f6be781 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -848,7 +848,7 @@ void ip_rt_send_redirect(struct sk_buff *skb)
 		if (log_martians &&
 		    peer->rate_tokens == ip_rt_redirect_number)
 			net_warn_ratelimited("host %pI4/if%d ignores redirects for %pI4 to %pI4\n",
-					     &ip_hdr(skb)->saddr, rt->rt_iif,
+					     &ip_hdr(skb)->saddr, inet_iif(skb),
 					     &ip_hdr(skb)->daddr, &rt->rt_gateway);
 #endif
 	}
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 21d7f8f..3e07a64 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -5391,18 +5391,6 @@ int tcp_rcv_established(struct sock *sk, struct sk_buff *skb,
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 
-	if (sk->sk_rx_dst) {
-		struct dst_entry *dst = sk->sk_rx_dst;
-		if (unlikely(dst->obsolete)) {
-			if (dst->ops->check(dst, 0) == NULL) {
-				dst_release(dst);
-				sk->sk_rx_dst = NULL;
-			}
-		}
-	}
-	if (unlikely(sk->sk_rx_dst == NULL))
-		sk->sk_rx_dst = dst_clone(skb_dst(skb));
-
 	/*
 	 *	Header prediction.
 	 *	The code loosely follows the one in the famous
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index bc5432e..3e30548 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1618,6 +1618,20 @@ int tcp_v4_do_rcv(struct sock *sk, struct sk_buff *skb)
 
 	if (sk->sk_state == TCP_ESTABLISHED) { /* Fast path */
 		sock_rps_save_rxhash(sk, skb);
+		if (sk->sk_rx_dst) {
+			struct dst_entry *dst = sk->sk_rx_dst;
+			if (dst->ops->check(dst, 0) == NULL) {
+				dst_release(dst);
+				sk->sk_rx_dst = NULL;
+			}
+		}
+		if (unlikely(sk->sk_rx_dst == NULL)) {
+			struct inet_sock *icsk = inet_sk(sk);
+			struct rtable *rt = skb_rtable(skb);
+
+			sk->sk_rx_dst = dst_clone(&rt->dst);
+			icsk->rx_dst_ifindex = inet_iif(skb);
+		}
 		if (tcp_rcv_established(sk, skb, tcp_hdr(skb), skb->len)) {
 			rsk = sk;
 			goto reset;
@@ -1700,14 +1714,12 @@ void tcp_v4_early_demux(struct sk_buff *skb)
 		skb->destructor = sock_edemux;
 		if (sk->sk_state != TCP_TIME_WAIT) {
 			struct dst_entry *dst = sk->sk_rx_dst;
+			struct inet_sock *icsk = inet_sk(sk);
 			if (dst)
 				dst = dst_check(dst, 0);
-			if (dst) {
-				struct rtable *rt = (struct rtable *) dst;
-
-				if (rt->rt_iif == dev->ifindex)
-					skb_dst_set_noref(skb, dst);
-			}
+			if (dst &&
+			    icsk->rx_dst_ifindex == dev->ifindex)
+				skb_dst_set_noref(skb, dst);
 		}
 	}
 }
diff --git a/net/sched/cls_route.c b/net/sched/cls_route.c
index 36fec42..44f405c 100644
--- a/net/sched/cls_route.c
+++ b/net/sched/cls_route.c
@@ -143,7 +143,7 @@ static int route4_classify(struct sk_buff *skb, const struct tcf_proto *tp,
 	if (head == NULL)
 		goto old_method;
 
-	iif = ((struct rtable *)dst)->rt_iif;
+	iif = inet_iif(skb);
 
 	h = route4_fastmap_hash(id, iif);
 	if (id == head->fastmap[h].id &&
diff --git a/net/sched/em_meta.c b/net/sched/em_meta.c
index 4790c69..4ab6e33 100644
--- a/net/sched/em_meta.c
+++ b/net/sched/em_meta.c
@@ -264,7 +264,7 @@ META_COLLECTOR(int_rtiif)
 	if (unlikely(skb_rtable(skb) == NULL))
 		*err = -1;
 	else
-		dst->value = skb_rtable(skb)->rt_iif;
+		dst->value = inet_iif(skb);
 }
 
 /**************************************************************************
diff --git a/net/sctp/protocol.c b/net/sctp/protocol.c
index 9c90811..1f89c4e 100644
--- a/net/sctp/protocol.c
+++ b/net/sctp/protocol.c
@@ -568,7 +568,7 @@ static void sctp_v4_get_saddr(struct sctp_sock *sk,
 /* What interface did this skb arrive on? */
 static int sctp_v4_skb_iif(const struct sk_buff *skb)
 {
-	return skb_rtable(skb)->rt_iif;
+	return inet_iif(skb);
 }
 
 /* Was this packet marked by Explicit Congestion Notification? */
-- 
1.7.10.4

^ permalink raw reply related

* [PATCH 2/3] net: Make skb->skb_iif always track skb->dev
From: David Miller @ 2012-07-23 23:45 UTC (permalink / raw)
  To: netdev; +Cc: ja


Make it follow device decapsulation, from things such as VLAN and
bonding.

The stuff that actually cares about pre-demuxed device pointers, is
handled by the "orig_dev" variable in __netif_receive_skb().  And
the only consumer of that is the po->origdev feature of AF_PACKET
sockets.

Signed-off-by: David S. Miller <davem@davemloft.net>
---
 net/core/dev.c |    3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index cca02ae..0ebaea1 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3173,8 +3173,6 @@ static int __netif_receive_skb(struct sk_buff *skb)
 	if (netpoll_receive_skb(skb))
 		return NET_RX_DROP;
 
-	if (!skb->skb_iif)
-		skb->skb_iif = skb->dev->ifindex;
 	orig_dev = skb->dev;
 
 	skb_reset_network_header(skb);
@@ -3186,6 +3184,7 @@ static int __netif_receive_skb(struct sk_buff *skb)
 	rcu_read_lock();
 
 another_round:
+	skb->skb_iif = skb->dev->ifindex;
 
 	__this_cpu_inc(softnet_data.processed);
 
-- 
1.7.10.4

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox