[PATCH next v2 0/7] Introduce l3_dev pointer for L3 processing

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH next v2 0/7] Introduce l3_dev pointer for L3 processing
@ 2016-03-09 21:49 Mahesh Bandewar
  2016-03-10  9:47 ` Nicolas Dichtel
  2016-03-14  1:50 ` David Miller
  0 siblings, 2 replies; 12+ messages in thread
From: Mahesh Bandewar @ 2016-03-09 21:49 UTC (permalink / raw)
  To: David Miller; +Cc: Mahesh Bandewar, Eric Dumazet, netdev

From: Mahesh Bandewar <maheshb@google.com>

One of the major request (for enhancement) that I have received
from various users of IPvlan in L3 mode is its inability to handle
IPtables.

While looking at the code and how we handle ingress, the problem
can be attributed to the asymmetry in the way packets get processed
for IPvlan devices configured in L3 mode. L3 mode is supposed to
be restrictive and all the L3 decisions need to be taken for the
traffic in master's ns. This does happen as expected for egress
traffic however on ingress traffic, the IPvlan packet-handler
changes the skb->dev and this forces packet to be processed with
the IPvlan slave and it's associated ns. This causes above mentioned
problem and few other which are not yet reported / attempted. e.g.
IPsec with L3 mode or even ingress routing.

This could have been solved if we had a way to handover packet to
slave and associated ns after completing the L3 phase. This is a
non-trivial issue to fix especially looking at IPsec code.

This patch series attempts to solve this problem by introducing the
device pointer l3_dev which resides in net_device structure in the
RX cache line. We initialize the l3_dev to self. This would mean
there is no complex logic to when-and-how-to initialize it. Now
the stack will use this dev pointer during the L3 phase. This should
not alter any existing properties / behavior and also there should
not be any additional penalties since it resides in the same RX
cache line. 

Now coming back to IPvlan setup. The typical IPvlan L3 setup where
master is in default-ns while the slaves are put inside different
(slave) net-ns. During the initialization phase, the driver can
make l3_dev for each slave to point to its master-device. This will
solve the current IPT problem as well as make the packet processing
symmetric from stack perspective. For all other packet processing
since dev->l3_dev is same as dev, there should not be any behavioral
or functional change. 

Mahesh Bandewar (7):
  dev: Add netif_get_l3_dev() helper
  dev: Update packet dispatcher to pass l3_dev as an input device
  ipv4: Use l3_dev for L3 ingress processing
  ipv6: Use l3_dev for L3 ingress processing
  raw: Use l3_dev for L3 ingress raw packet delivery
  xfrm: Use l3_dev for xfrm policy checks.
  ipvlan: Implement L3-symmetric mode.

 drivers/net/ipvlan/ipvlan_main.c | 16 +++++++++-------
 include/linux/netdevice.h        |  6 ++++++
 include/net/xfrm.h               |  2 +-
 net/core/dev.c                   | 10 +++++++---
 net/ipv4/ip_input.c              | 12 +++++++-----
 net/ipv4/ip_options.c            |  3 ++-
 net/ipv4/raw.c                   | 11 +++++------
 net/ipv6/ip6_input.c             | 14 ++++++++------
 net/ipv6/raw.c                   |  2 +-
 net/ipv6/route.c                 |  7 ++++---
 net/xfrm/xfrm_policy.c           |  4 ++--
 11 files changed, 52 insertions(+), 35 deletions(-)

-- 
2.7.0.rc3.207.g0ac5344

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH next v2 0/7] Introduce l3_dev pointer for L3 processing
  2016-03-09 21:49 [PATCH next v2 0/7] Introduce l3_dev pointer for L3 processing Mahesh Bandewar
@ 2016-03-10  9:47 ` Nicolas Dichtel
  2016-03-10 21:29   ` Cong Wang
  2016-03-13 23:44   ` Mahesh Bandewar
  2016-03-14  1:50 ` David Miller
  1 sibling, 2 replies; 12+ messages in thread
From: Nicolas Dichtel @ 2016-03-10  9:47 UTC (permalink / raw)
  To: Mahesh Bandewar, David Miller
  Cc: Mahesh Bandewar, Eric Dumazet, netdev, Eric W. Biederman,
	Cong Wang

Le 09/03/2016 22:49, Mahesh Bandewar a écrit :
> From: Mahesh Bandewar <maheshb@google.com>
>
> One of the major request (for enhancement) that I have received
> from various users of IPvlan in L3 mode is its inability to handle
> IPtables.
>
> While looking at the code and how we handle ingress, the problem
> can be attributed to the asymmetry in the way packets get processed
> for IPvlan devices configured in L3 mode. L3 mode is supposed to
> be restrictive and all the L3 decisions need to be taken for the
> traffic in master's ns. This does happen as expected for egress
> traffic however on ingress traffic, the IPvlan packet-handler
> changes the skb->dev and this forces packet to be processed with
> the IPvlan slave and it's associated ns. This causes above mentioned
> problem and few other which are not yet reported / attempted. e.g.
> IPsec with L3 mode or even ingress routing.
>
> This could have been solved if we had a way to handover packet to
> slave and associated ns after completing the L3 phase. This is a
> non-trivial issue to fix especially looking at IPsec code.
>
> This patch series attempts to solve this problem by introducing the
> device pointer l3_dev which resides in net_device structure in the
> RX cache line. We initialize the l3_dev to self. This would mean
> there is no complex logic to when-and-how-to initialize it. Now
> the stack will use this dev pointer during the L3 phase. This should
> not alter any existing properties / behavior and also there should
> not be any additional penalties since it resides in the same RX
> cache line.
If I understand correctly (and as Cong already said), information are leaking
between netns during the input phase. On the tx side, skb_scrub_packet() is
called, but not on the rx side. I think it's wrong. There should be an explicit
boundary.

Another small comment: maybe finding another name than l3_dev could help to
avoid confusion with the existing l3mdev.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH next v2 0/7] Introduce l3_dev pointer for L3 processing
  2016-03-10  9:47 ` Nicolas Dichtel
@ 2016-03-10 21:29   ` Cong Wang
  2016-03-14  0:01     ` Mahesh Bandewar
  2016-03-13 23:44   ` Mahesh Bandewar
  1 sibling, 1 reply; 12+ messages in thread
From: Cong Wang @ 2016-03-10 21:29 UTC (permalink / raw)
  To: Nicolas Dichtel
  Cc: Mahesh Bandewar, David Miller, Mahesh Bandewar, Eric Dumazet,
	netdev, Eric W. Biederman, Cong Wang

On Thu, Mar 10, 2016 at 1:47 AM, Nicolas Dichtel
<nicolas.dichtel@6wind.com> wrote:
> Le 09/03/2016 22:49, Mahesh Bandewar a écrit :
>>
>> From: Mahesh Bandewar <maheshb@google.com>
>>
>> One of the major request (for enhancement) that I have received
>> from various users of IPvlan in L3 mode is its inability to handle
>> IPtables.
>>
>> While looking at the code and how we handle ingress, the problem
>> can be attributed to the asymmetry in the way packets get processed
>> for IPvlan devices configured in L3 mode. L3 mode is supposed to
>> be restrictive and all the L3 decisions need to be taken for the
>> traffic in master's ns. This does happen as expected for egress
>> traffic however on ingress traffic, the IPvlan packet-handler
>> changes the skb->dev and this forces packet to be processed with
>> the IPvlan slave and it's associated ns. This causes above mentioned
>> problem and few other which are not yet reported / attempted. e.g.
>> IPsec with L3 mode or even ingress routing.
>>
>> This could have been solved if we had a way to handover packet to
>> slave and associated ns after completing the L3 phase. This is a
>> non-trivial issue to fix especially looking at IPsec code.
>>
>> This patch series attempts to solve this problem by introducing the
>> device pointer l3_dev which resides in net_device structure in the
>> RX cache line. We initialize the l3_dev to self. This would mean
>> there is no complex logic to when-and-how-to initialize it. Now
>> the stack will use this dev pointer during the L3 phase. This should
>> not alter any existing properties / behavior and also there should
>> not be any additional penalties since it resides in the same RX
>> cache line.
>
> If I understand correctly (and as Cong already said), information are
> leaking
> between netns during the input phase. On the tx side, skb_scrub_packet() is
> called, but not on the rx side. I think it's wrong. There should be an
> explicit
> boundary.

That is not what I am complaining about.

I dislike the trick of switching skb->dev pointer with skb->dev->l3_dev.
This is not how we switch netns, nor the way how netns works.

Look at veth pair or dev_change_net_namespace(), each time when we
switch netns, we need to do a full reregistration or a full reentrance, we
never just switch some pointers to switch netns. This is why I said it breaks
isolation.

Also, it is ugly to hide such a ipvlan-specific pointer for half of the RX code
path.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH next v2 0/7] Introduce l3_dev pointer for L3 processing
  2016-03-10  9:47 ` Nicolas Dichtel
  2016-03-10 21:29   ` Cong Wang
@ 2016-03-13 23:44   ` Mahesh Bandewar
  1 sibling, 0 replies; 12+ messages in thread
From: Mahesh Bandewar @ 2016-03-13 23:44 UTC (permalink / raw)
  To: nicolas.dichtel
  Cc: Mahesh Bandewar, David Miller, Eric Dumazet, netdev,
	Eric W. Biederman, Cong Wang

> If I understand correctly (and as Cong already said), information are
> leaking
> between netns during the input phase. On the tx side, skb_scrub_packet() is
> called, but not on the rx side. I think it's wrong. There should be an
> explicit
> boundary.
>
I think we used to do dev_forward_skb() in the RX path which used to
do skb_scrub_packet().
When we added the optimization to avoid queuing second time just to
deliver packet to the
slave (by doing RX_HANDLER_ANOTHER), we lost the skb_scrub_packet()
which we used to
get automatically. I believe we can add that! Thanks for pointing it out.

> Another small comment: maybe finding another name than l3_dev could help to
> avoid confusion with the existing l3mdev.
I have absolutely no preference for the name. I used it to indicate
this is the *L3 device*.
Please suggest.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH next v2 0/7] Introduce l3_dev pointer for L3 processing
  2016-03-10 21:29   ` Cong Wang
@ 2016-03-14  0:01     ` Mahesh Bandewar
  2016-03-14 18:13       ` Cong Wang
  0 siblings, 1 reply; 12+ messages in thread
From: Mahesh Bandewar @ 2016-03-14  0:01 UTC (permalink / raw)
  To: Cong Wang
  Cc: Nicolas Dichtel, Mahesh Bandewar, David Miller, Eric Dumazet,
	netdev, Eric W. Biederman, Cong Wang

>> If I understand correctly (and as Cong already said), information are
>> leaking
>> between netns during the input phase. On the tx side, skb_scrub_packet() is
>> called, but not on the rx side. I think it's wrong. There should be an
>> explicit
>> boundary.
>
> That is not what I am complaining about.
>
> I dislike the trick of switching skb->dev pointer with skb->dev->l3_dev.
> This is not how we switch netns, nor the way how netns works.
>
How it is different from what we are doing currently?

Current: Use skb->dev for L3 processing and derive netns from skb->dev
Proposal: use skb->dev->l3_dev for L3 processing and derive netns from
skb->dev->l3_dev

> Look at veth pair or dev_change_net_namespace(), each time when we
> switch netns, we need to do a full reregistration or a full reentrance, we
> never just switch some pointers to switch netns. This is why I said it breaks
> isolation.
>
> Also, it is ugly to hide such a ipvlan-specific pointer for half of the RX code
> path.
I think I have already mentioned, I'm adding RX code now and later
I'll add TX code to use
l3_dev to make it symmetric. This way all L3 (Tx/Rx) will use this
device reference
always.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH next v2 0/7] Introduce l3_dev pointer for L3 processing
  2016-03-09 21:49 [PATCH next v2 0/7] Introduce l3_dev pointer for L3 processing Mahesh Bandewar
  2016-03-10  9:47 ` Nicolas Dichtel
@ 2016-03-14  1:50 ` David Miller
  2016-03-14  2:29   ` Mahesh Bandewar
  1 sibling, 1 reply; 12+ messages in thread
From: David Miller @ 2016-03-14  1:50 UTC (permalink / raw)
  To: mahesh; +Cc: maheshb, edumazet, netdev

From: Mahesh Bandewar <mahesh@bandewar.net>
Date: Wed,  9 Mar 2016 13:49:49 -0800

> This does happen as expected for egress traffic however on ingress
> traffic, the IPvlan packet-handler changes the skb->dev and this
> forces packet to be processed with the IPvlan slave and it's
> associated ns. This causes above mentioned problem and few other
> which are not yet reported / attempted. e.g.  IPsec with L3 mode or
> even ingress routing.

And now your changes make it so that we can't run netfilter rules on
the slave's device within the slave's ns.

If someone is doing that now you're breaking things for them.

It doesn't matter whether doing so or not makes sense.

You're going to have to find a way to do both, and also I'm concerned
about how you're leaking the source namespace's "stuff" into the
destination's.  That's very worrisome to me.

There is no way I'm applying this as-is, sorry.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH next v2 0/7] Introduce l3_dev pointer for L3 processing
  2016-03-14  1:50 ` David Miller
@ 2016-03-14  2:29   ` Mahesh Bandewar
  2016-03-14  3:53     ` David Miller
  0 siblings, 1 reply; 12+ messages in thread
From: Mahesh Bandewar @ 2016-03-14  2:29 UTC (permalink / raw)
  To: David Miller; +Cc: mahesh, Eric Dumazet, linux-netdev

On Sun, Mar 13, 2016 at 6:50 PM, David Miller <davem@davemloft.net> wrote:
> From: Mahesh Bandewar <mahesh@bandewar.net>
> Date: Wed,  9 Mar 2016 13:49:49 -0800
>
>> This does happen as expected for egress traffic however on ingress
>> traffic, the IPvlan packet-handler changes the skb->dev and this
>> forces packet to be processed with the IPvlan slave and it's
>> associated ns. This causes above mentioned problem and few other
>> which are not yet reported / attempted. e.g.  IPsec with L3 mode or
>> even ingress routing.
>
> And now your changes make it so that we can't run netfilter rules on
> the slave's device within the slave's ns.
>
> If someone is doing that now you're breaking things for them.
>
Yes, I agree. If someone is using netfilter hooks in slave-ns, then this would
break for them. If we do this as a separate mode (e.g. L3s) keeping the
current mode as it is. Would that be an acceptable solution?

> It doesn't matter whether doing so or not makes sense.
>
> You're going to have to find a way to do both, and also I'm concerned
> about how you're leaking the source namespace's "stuff" into the
> destination's.  That's very worrisome to me.
>
If we add a new mode (e.g. L3s) and preserve current mode as is it, then
that should address your first concern.

Now about leaking ns "stuff". I do not want to leak stuff across ns. In the
current L3 mode there is an issue that Nicolas has pointed out and I'll fix
that soon. In the new mode (L3s); right from the beginning the L2 and L3
processing always (and consistently) will happen with master device
(proposed skb->dev->l3_dev).

> There is no way I'm applying this as-is, sorry.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH next v2 0/7] Introduce l3_dev pointer for L3 processing
  2016-03-14  2:29   ` Mahesh Bandewar
@ 2016-03-14  3:53     ` David Miller
  2016-03-14 17:57       ` Mahesh Bandewar
  2016-03-14 18:15       ` Cong Wang
  0 siblings, 2 replies; 12+ messages in thread
From: David Miller @ 2016-03-14  3:53 UTC (permalink / raw)
  To: maheshb; +Cc: mahesh, edumazet, netdev

From: Mahesh Bandewar <maheshb@google.com>
Date: Sun, 13 Mar 2016 19:29:58 -0700

> On Sun, Mar 13, 2016 at 6:50 PM, David Miller <davem@davemloft.net> wrote:
>> It doesn't matter whether doing so or not makes sense.
>>
>> You're going to have to find a way to do both, and also I'm concerned
>> about how you're leaking the source namespace's "stuff" into the
>> destination's.  That's very worrisome to me.
>
> If we add a new mode (e.g. L3s) and preserve current mode as is it,
> then that should address your first concern.

Also, I don't want all of this device translation stuff all over the
place.

Furthermore, when you walk across the ns boundary, that old device has
to disappear.  That's why that is the device assigned to skb->dev.

Please stop pretending that this device switching is ok, it's not.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH next v2 0/7] Introduce l3_dev pointer for L3 processing
  2016-03-14  3:53     ` David Miller
@ 2016-03-14 17:57       ` Mahesh Bandewar
  2016-03-17  8:47         ` Nicolas Dichtel
  2016-03-14 18:15       ` Cong Wang
  1 sibling, 1 reply; 12+ messages in thread
From: Mahesh Bandewar @ 2016-03-14 17:57 UTC (permalink / raw)
  To: David Miller; +Cc: mahesh, Eric Dumazet, linux-netdev

On Sun, Mar 13, 2016 at 8:53 PM, David Miller <davem@davemloft.net> wrote:
> From: Mahesh Bandewar <maheshb@google.com>
> Date: Sun, 13 Mar 2016 19:29:58 -0700
>
>> On Sun, Mar 13, 2016 at 6:50 PM, David Miller <davem@davemloft.net> wrote:
>>> It doesn't matter whether doing so or not makes sense.
>>>
>>> You're going to have to find a way to do both, and also I'm concerned
>>> about how you're leaking the source namespace's "stuff" into the
>>> destination's.  That's very worrisome to me.
>>
>> If we add a new mode (e.g. L3s) and preserve current mode as is it,
>> then that should address your first concern.
>
> Also, I don't want all of this device translation stuff all over the
> place.
>
I could add skb->dev. Is that OK? Then non of this translation / helper-stuff
is required. I'm definitely open for suggestions.

> Furthermore, when you walk across the ns boundary, that old device has
> to disappear.  That's why that is the device assigned to skb->dev.
>
The layer boundaries are not that well maintained. We do check for the xfrm
policies in L4 and expect the skb->dev pointing to the L3 device. So unless we
have a way to derive a L3 dev from skb->dev, I don't think xfrm will
work. Unless
some Xfrm-expert asserts that this is not needed.

> Please stop pretending that this device switching is ok, it's not.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH next v2 0/7] Introduce l3_dev pointer for L3 processing
  2016-03-14  0:01     ` Mahesh Bandewar
@ 2016-03-14 18:13       ` Cong Wang
  0 siblings, 0 replies; 12+ messages in thread
From: Cong Wang @ 2016-03-14 18:13 UTC (permalink / raw)
  To: Mahesh Bandewar
  Cc: Nicolas Dichtel, Mahesh Bandewar, David Miller, Eric Dumazet,
	netdev, Eric W. Biederman, Cong Wang

On Sun, Mar 13, 2016 at 5:01 PM, Mahesh Bandewar <maheshb@google.com> wrote:
>>> If I understand correctly (and as Cong already said), information are
>>> leaking
>>> between netns during the input phase. On the tx side, skb_scrub_packet() is
>>> called, but not on the rx side. I think it's wrong. There should be an
>>> explicit
>>> boundary.
>>
>> That is not what I am complaining about.
>>
>> I dislike the trick of switching skb->dev pointer with skb->dev->l3_dev.
>> This is not how we switch netns, nor the way how netns works.
>>
> How it is different from what we are doing currently?
>
> Current: Use skb->dev for L3 processing and derive netns from skb->dev
> Proposal: use skb->dev->l3_dev for L3 processing and derive netns from
> skb->dev->l3_dev


If you ever read the part you quote below, you will have the answer.


>
>> Look at veth pair or dev_change_net_namespace(), each time when we
>> switch netns, we need to do a full reregistration or a full reentrance, we
>> never just switch some pointers to switch netns. This is why I said it breaks
>> isolation.

^ You miss this part.


>>
>> Also, it is ugly to hide such a ipvlan-specific pointer for half of the RX code
>> path.
> I think I have already mentioned, I'm adding RX code now and later
> I'll add TX code to use
> l3_dev to make it symmetric. This way all L3 (Tx/Rx) will use this
> device reference
> always.

You are trying to convince me by telling me you will add more ugly code??
Seriously??

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH next v2 0/7] Introduce l3_dev pointer for L3 processing
  2016-03-14  3:53     ` David Miller
  2016-03-14 17:57       ` Mahesh Bandewar
@ 2016-03-14 18:15       ` Cong Wang
  1 sibling, 0 replies; 12+ messages in thread
From: Cong Wang @ 2016-03-14 18:15 UTC (permalink / raw)
  To: David Miller
  Cc: Mahesh Bandewar, Mahesh Bandewar, Eric Dumazet,
	Linux Kernel Network Developers

On Sun, Mar 13, 2016 at 8:53 PM, David Miller <davem@davemloft.net> wrote:
>
> Please stop pretending that this device switching is ok, it's not.

+1
This is what I have been complaining about since v1...

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH next v2 0/7] Introduce l3_dev pointer for L3 processing
  2016-03-14 17:57       ` Mahesh Bandewar
@ 2016-03-17  8:47         ` Nicolas Dichtel
  0 siblings, 0 replies; 12+ messages in thread
From: Nicolas Dichtel @ 2016-03-17  8:47 UTC (permalink / raw)
  To: Mahesh Bandewar, David Miller; +Cc: mahesh, Eric Dumazet, linux-netdev

Le 14/03/2016 18:57, Mahesh Bandewar a écrit :
> On Sun, Mar 13, 2016 at 8:53 PM, David Miller <davem@davemloft.net> wrote:
[snip]
>> Furthermore, when you walk across the ns boundary, that old device has
>> to disappear.  That's why that is the device assigned to skb->dev.
>>
> The layer boundaries are not that well maintained. We do check for the xfrm
> policies in L4 and expect the skb->dev pointing to the L3 device. So unless we
> have a way to derive a L3 dev from skb->dev, I don't think xfrm will
> work. Unless
> some Xfrm-expert asserts that this is not needed.
Adding a hook "at the right place" to do the switch is probably the better way.
For xfrm, you will need to handle it in this hook or rearrange things.
I don't think that a quick and easy solution will be possible.

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2016-03-17  8:48 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-03-09 21:49 [PATCH next v2 0/7] Introduce l3_dev pointer for L3 processing Mahesh Bandewar
2016-03-10  9:47 ` Nicolas Dichtel
2016-03-10 21:29   ` Cong Wang
2016-03-14  0:01     ` Mahesh Bandewar
2016-03-14 18:13       ` Cong Wang
2016-03-13 23:44   ` Mahesh Bandewar
2016-03-14  1:50 ` David Miller
2016-03-14  2:29   ` Mahesh Bandewar
2016-03-14  3:53     ` David Miller
2016-03-14 17:57       ` Mahesh Bandewar
2016-03-17  8:47         ` Nicolas Dichtel
2016-03-14 18:15       ` Cong Wang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).