Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: ipv4: Use standard iovec primitive in raw_probe_proto_opt
From: Al Viro @ 2014-11-06  6:43 UTC (permalink / raw)
  To: Herbert Xu
  Cc: David Miller, netdev, linux-kernel, bcrl, Masahide Nakamura,
	Hideaki YOSHIFUJI
In-Reply-To: <20141106055023.GA28865@gondor.apana.org.au>

On Thu, Nov 06, 2014 at 01:50:23PM +0800, Herbert Xu wrote:
> +	/* We only need the first two bytes. */
> +	err = memcpy_fromiovecend((void *)&icmph, msg->msg_iov, 0, 2);
> +	if (err)
> +		return err;
> +
> +	fl4->fl4_icmp_type = icmph.type;
> +	fl4->fl4_icmp_code = icmph.code;

That's more readable, but that exposes another problem in there - we read
the same piece of userland data twice, with no promise whatsoever that we'll
get the same value both times...

^ permalink raw reply

* Re: ipv4: Use standard iovec primitive in raw_probe_proto_opt
From: Herbert Xu @ 2014-11-06  6:46 UTC (permalink / raw)
  To: Al Viro
  Cc: David Miller, netdev, linux-kernel, bcrl, Masahide Nakamura,
	Hideaki YOSHIFUJI
In-Reply-To: <20141106064318.GW7996@ZenIV.linux.org.uk>

On Thu, Nov 06, 2014 at 06:43:18AM +0000, Al Viro wrote:
> On Thu, Nov 06, 2014 at 01:50:23PM +0800, Herbert Xu wrote:
> > +	/* We only need the first two bytes. */
> > +	err = memcpy_fromiovecend((void *)&icmph, msg->msg_iov, 0, 2);
> > +	if (err)
> > +		return err;
> > +
> > +	fl4->fl4_icmp_type = icmph.type;
> > +	fl4->fl4_icmp_code = icmph.code;
> 
> That's more readable, but that exposes another problem in there - we read
> the same piece of userland data twice, with no promise whatsoever that we'll
> get the same value both times...

Sure, but you have to be root anyway to write to raw sockets.

Patches are welcome :)

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: mlx4+vxlan offload breaks gre tunnels
From: Or Gerlitz @ 2014-11-06  7:03 UTC (permalink / raw)
  To: Tom Herbert; +Cc: Florian Westphal, Linux Netdev List, Jesse Gross, Amir Vadai
In-Reply-To: <CA+mtBx_oxJxteeBnbyboGTSKn3JdBRx2Nb3a9EuCZzG+_U-8Zw@mail.gmail.com>

On Wed, Nov 5, 2014 at 11:59 PM, Tom Herbert <therbert@google.com> wrote:
> On Wed, Nov 5, 2014 at 8:17 AM, Or Gerlitz <ogerlitz@mellanox.com> wrote:
>> On 11/5/2014 5:04 PM, Florian Westphal wrote:
>>>
>>> tl,dr: all tcp packets sent via gre tunnel have broken tcp csum if vxlan
>>> offload
>>> is enabled with mlx4 driver.
>>>
>>> Given following config on tx-side:
>>> dev=enp3s0
>>> ip addr add dev $dev 192.168.23.1/24
>>> ip link set $dev up
>>> ip link add mygre type gretap remote 192.168.23.2 local 192.168.23.1
>>> ip addr add dev mygre 192.168.42.1/24
>>> ip link set gre0 up
>>> ip link set mygre up
>>>
>>> and
>>>
>>> options mlx4_core log_num_mgm_entry_size=-1 debug_level=1
>>> port_type_array=2,2
>>>
>>> in
>>> /etc/modprobe.d/mlx4.conf
>>>
>>> all tcp packets sent to destinations over the gre tunnel have bogus tcp
>>> checksums (and are tossed on rx side when stack validates tcp checksum).
>>>
>>> net-next head is commit 30349bdbc4da5ecf0efa25556e3caff9c9b8c5f7 .
>>>
>>> What makes things work for me:
>>> either
>>>
>>> options mlx4_core 1 debug_level=1 port_type_array=2,2
>>>
>>> (ie. no MLX4_TUNNEL_OFFLOAD_MODE_VXLAN)
>>>
>>> or not setting NETIF_F_IP_CSUM in enc_features:
>>>
>>> --- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
>>> +++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
>>> @@ -2579,10 +2579,12 @@ int mlx4_en_init_netdev(struct mlx4_en_dev *mdev,
>>> int port,
>>>                  dev->priv_flags |= IFF_UNICAST_FLT;
>>>            if (mdev->dev->caps.tunnel_offload_mode ==
>>> MLX4_TUNNEL_OFFLOAD_MODE_VXLAN) {
>>> -               dev->hw_enc_features |= NETIF_F_IP_CSUM | NETIF_F_RXCSUM |
>>> +               dev->hw_enc_features |= NETIF_F_RXCSUM |
>>>                                          NETIF_F_TSO |
>>> NETIF_F_GSO_UDP_TUNNEL;
>>>
>>> I am not sure if its right fix, but to my eyes this basically looks like
>>> mlx4 is telling stack that it can handle tcp checksum offload within
>>> tunnels, and that doesn't seem to be the case for all types (e.g. gre).
>>>
>>> Could someone who understand the enc_features specifics better confirm
>>> that
>>> above patch is correct (or provide a better/proper fix)?
>>
>>
>> Yep, I can see now the problem. It comes into play with ConnectX3-pro NICs
>> that support VXLAN offloads (but not with ConnectX3 NIC which don't) when
>> you enable the offloads support on the CX3-pro.
>>
>> The problem originates from the fact that we can't advertize something like
>> "the HW can offload the inner checksum of UDP/VXLAN encapsulated (but not
>> for GRE)", e.g in a similar manner that exists in the GSO space, where you
>> have NETIF_F_GSO _YYY for each yyy in {UDP, SIT, GRE, etc} tunneling scheme.
>>
>> I think the best effort we can do now is
>>
>> 1. come up with something such as the below patch for 3.18 which is
>> back-ward portable for -stable kernels, it will only arm the hw offloads if
>> the OS tells us there's VXLAN in action
>>
>> 2. come  up with proper kernel APIs to let NICs advertize which encap
>> schemes they can actually offload the inner checksum, Tom... your work which
>> now runs over netdev.

> Possibly #3: add ndo_gso_check to detect nested tunneling. In this
> case it would see that gso_type has both SKB_GSO_GRE and
> SKB_GSO_UDP_TUNNEL set.

I am not with you, where do we have nested tunneling here? the below
setting is simple GRE tunnel
to host (say) TCP traffic

>> tested to work with the  following which is a bit different, tell me if it
>> works for you
>>
>> # node A - with mlx4_en address192.168.31.18
>> ip tunnel add gre1 mode gre local 192.168.31.18 remote 192.168.31.17 ttl 255
>> ifconfig gre1 10.10.10.18/24 up
>> ifconfig gre1 mtu 1450
>>
>> # node B - with mlx4_en address192.168.31.17
>> ip tunnel add gre1 mode gre local 192.168.31.17 remote 192.168.31.18 ttl 255
>> ifconfig gre1 10.10.10.17/24 up
>> ifconfig gre1 mtu 1450

^ permalink raw reply

* Re: M_CAN message RAM initialization AppNote  - was: Re: [PATCH V3 3/3] can: m_can: workaround for transmit data less than 4 bytes
From: Oliver Hartkopp @ 2014-11-06  7:04 UTC (permalink / raw)
  To: Dong Aisheng
  Cc: Marc Kleine-Budde, linux-can, wg, varkabhadram, netdev,
	linux-arm-kernel
In-Reply-To: <20141106015716.GB7642@shlinux1.ap.freescale.net>

On 06.11.2014 02:57, Dong Aisheng wrote:
> On Wed, Nov 05, 2014 at 07:15:10PM +0100, Oliver Hartkopp wrote:

>> The Message RAM is usually equipped with a parity or ECC functionality.
>> But RAM cells suffer a hardware reset and can therefore hold
>> arbitrary content at startup - including parity and/or ECC bits.
>>
>> So when you write only the CAN ID and the first four bytes the last
>> four bytes remain untouched. Then the M_CAN starts to read in 32bit
>> words from the start of the Tx Message element. So it is very likely
>> to trigger the message RAM error when reading the uninitialized
>> 32bit word from the last four bytes.
>>
>> Finally it turns out that an initial writing (with any kind of data)
>> to the entire message RAM is mandatory to create valid parity/ECC
>> checksums.
>>
>> That's it.
>>
>
> Thanks for sharing this information.
> Does it mean this issue is related to the nature of Message RAM and is
> supposed to exist on all M_CAN IP versions?

 From what I know from the 3.1.x revision there's no change regarding IR.BRU 
and IR.BEC - so I would assume this to stay on all M_CAN IP revisions.

But after some sleep I wonder if this patch [3/3] would need an update too.

Writing to the TX message RAM is obviously no workaround but a valid and 
needed initialization process.

I would tend to make this patch:

---

can: m_can: add missing TX message RAM initialization

The M_CAN message RAM is usually equipped with a parity or ECC functionality.
But RAM cells suffer a hardware reset and can therefore hold arbitrary content 
at startup - including parity and/or ECC bits.

To prevent the M_CAN controller detecting checksum errors when reading 
potentially uninitialized TX message RAM content to transmit CAN frames the TX 
message RAM has to be written with (any kind of) initial data.

---

Then the code should memset() the entire TX FIFO element - and not only the 8 
data bytes we are addressing now.

Maybe it makes sense to send the entire updated patch set (3) again ...

[1/3] can: add can_is_canfd_skb() API
[2/3] can: m_can: update to support CAN FD features
[3/3] can: m_can: add missing message RAM initialization

Are you ok with that?

Regards,
Oliver

^ permalink raw reply

* Re: [PATCH] bridge: missing null bridge device check causing null pointer dereference (bugfix)
From: Toshiaki Makita @ 2014-11-06  7:07 UTC (permalink / raw)
  To: Su-Hyun Park, Stephen Hemminger, David S. Miller
  Cc: netdev, bridge, linux-kernel
In-Reply-To: <1415255192-13584-1-git-send-email-suhyun.park@ahnlab.com>

On 2014/11/06 15:26, Su-Hyun Park wrote:
> the bridge device can be null if the bridge is being deleted while processing 
> the packet, which causes the null pointer dereference in switch statement.

How can this happen??
It is guarded by rcu.
netdev_rx_handler_unregister() ensures rx_handler_data is non NULL.

Thanks,
Toshiaki Makita

> 
> crash dump snippet:
> 
> <1>BUG: unable to handle kernel NULL pointer dereference at 0000000000000021
> <1>IP: [<ffffffff814179f6>] br_handle_frame+0xe6/0x270
> 
> <0>Code: 4c 0f 44 f0 89 f8 66 33 15 32 52 24 00 66 33 05 29 52 24 00 09 c2 89 
> f0 66 33 05 22 52 24 00 80 e4 f0 66 09 c2 0f 84 eb 00 00 00 <41> 0f b6 46 21 
> 3c 02 74 61 3c 03 74 1d 48 89 df e8 d5 bc f0 ff
> ---
>  net/bridge/br_input.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/net/bridge/br_input.c b/net/bridge/br_input.c
> index 6fd5522..7e899ca 100644
> --- a/net/bridge/br_input.c
> +++ b/net/bridge/br_input.c
> @@ -176,6 +176,8 @@ rx_handler_result_t br_handle_frame(struct sk_buff **pskb)
>  		return RX_HANDLER_CONSUMED;
>  
>  	p = br_port_get_rcu(skb->dev);
> +	if (!p)
> +		goto drop;
>  
>  	if (unlikely(is_link_local_ether_addr(dest))) {
>  		u16 fwd_mask = p->br->group_fwd_mask_required;
> 

^ permalink raw reply

* Re: ipv4: Use standard iovec primitive in raw_probe_proto_opt
From: Al Viro @ 2014-11-06  7:11 UTC (permalink / raw)
  To: Herbert Xu
  Cc: David Miller, netdev, linux-kernel, bcrl, Masahide Nakamura,
	Hideaki YOSHIFUJI
In-Reply-To: <20141106064629.GA29321@gondor.apana.org.au>

On Thu, Nov 06, 2014 at 02:46:29PM +0800, Herbert Xu wrote:
> On Thu, Nov 06, 2014 at 06:43:18AM +0000, Al Viro wrote:
> > On Thu, Nov 06, 2014 at 01:50:23PM +0800, Herbert Xu wrote:
> > > +	/* We only need the first two bytes. */
> > > +	err = memcpy_fromiovecend((void *)&icmph, msg->msg_iov, 0, 2);
> > > +	if (err)
> > > +		return err;
> > > +
> > > +	fl4->fl4_icmp_type = icmph.type;
> > > +	fl4->fl4_icmp_code = icmph.code;
> > 
> > That's more readable, but that exposes another problem in there - we read
> > the same piece of userland data twice, with no promise whatsoever that we'll
> > get the same value both times...
> 
> Sure, but you have to be root anyway to write to raw sockets.

Point, but that might very well be a pattern to watch for - there's at least
one more instance in TIPC (also not exploitable, according to TIPC folks)
and such bugs are easily repeated...

BTW, I've picked the tun and macvtap related bits from another part of old
queue; see vfs.git#untested-macvtap - it's on top of #iov_iter-net and it's
really completely untested.  Back then I was mostly interested in killing
as many ->aio_write() instances as I could, so it's only the "send" side of
things.

^ permalink raw reply

* RE: mlx4+vxlan offload breaks gre tunnels
From: Sathya Perla @ 2014-11-06  7:21 UTC (permalink / raw)
  To: Or Gerlitz, Florian Westphal, netdev@vger.kernel.org, Tom Herbert,
	Jesse Gross
  Cc: amirv@mellanox.com, Sathya Perla
In-Reply-To: <545A4DB7.5010603@mellanox.com>

> -----Original Message-----
> From: netdev-owner@vger.kernel.org [mailto:netdev-
> 
> On 11/5/2014 5:04 PM, Florian Westphal wrote:
> > tl,dr: all tcp packets sent via gre tunnel have broken tcp csum if vxlan
> offload
> > is enabled with mlx4 driver.
> >
> > Given following config on tx-side:
> > dev=enp3s0
> > ip addr add dev $dev 192.168.23.1/24
> > ip link set $dev up
> > ip link add mygre type gretap remote 192.168.23.2 local 192.168.23.1
> > ip addr add dev mygre 192.168.42.1/24
> > ip link set gre0 up
> > ip link set mygre up
> >
> > and
> >
> > options mlx4_core log_num_mgm_entry_size=-1 debug_level=1
> > port_type_array=2,2
> >
> > in
> > /etc/modprobe.d/mlx4.conf
> >
> > all tcp packets sent to destinations over the gre tunnel have bogus tcp
> > checksums (and are tossed on rx side when stack validates tcp checksum).
> >
> > net-next head is commit 30349bdbc4da5ecf0efa25556e3caff9c9b8c5f7 .
> >
> > What makes things work for me:
> > either
> >
> > options mlx4_core 1 debug_level=1 port_type_array=2,2
> >
> > (ie. no MLX4_TUNNEL_OFFLOAD_MODE_VXLAN)
> >
> > or not setting NETIF_F_IP_CSUM in enc_features:
> >
> > --- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
> > +++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
> > @@ -2579,10 +2579,12 @@ int mlx4_en_init_netdev(struct mlx4_en_dev
> *mdev, int port,
> >                  dev->priv_flags |= IFF_UNICAST_FLT;
> >
> >          if (mdev->dev->caps.tunnel_offload_mode ==
> MLX4_TUNNEL_OFFLOAD_MODE_VXLAN) {
> > -               dev->hw_enc_features |= NETIF_F_IP_CSUM | NETIF_F_RXCSUM
> |
> > +               dev->hw_enc_features |= NETIF_F_RXCSUM |
> >                                          NETIF_F_TSO | NETIF_F_GSO_UDP_TUNNEL;
> >
> > I am not sure if its right fix, but to my eyes this basically looks like
> > mlx4 is telling stack that it can handle tcp checksum offload within
> > tunnels, and that doesn't seem to be the case for all types (e.g. gre).
> >
> > Could someone who understand the enc_features specifics better confirm
> that
> > above patch is correct (or provide a better/proper fix)?
> 
> Yep, I can see now the problem. It comes into play with ConnectX3-pro
> NICs that support VXLAN offloads (but not with ConnectX3 NIC which
> don't) when you enable the offloads support on the CX3-pro.
> 
> The problem originates from the fact that we can't advertize something
> like "the HW can offload the inner checksum of UDP/VXLAN encapsulated
> (but not for GRE)", e.g in a similar manner that exists in the GSO
> space, where you have NETIF_F_GSO _YYY for each yyy in {UDP, SIT, GRE,
> etc} tunneling scheme.
> 
> I think the best effort we can do now is
> 
> 1. come up with something such as the below patch for 3.18 which is
> back-ward portable for -stable kernels, it will only arm the hw offloads
> if the OS tells us there's VXLAN in action

Or, wouldn't the patch below not work (i.e., the same issue would persist)
when there is both VXLAN and some other (say GRE) tunnel in the system
and the NIC HW is capable of supporting checksum offload only on VxLAN.

Do you expect a user who uses VxLAN to not use other kinds of tunnels?

> 
> 2. come  up with proper kernel APIs to let NICs advertize which encap
> schemes they can actually offload the inner checksum, Tom... your work
> which now runs over netdev.
> 
> Tom/Jesse- thoughts? are you +1-ing the below approach?
> 
> Or.
> 
> tested to work with the  following which is a bit different, tell me if
> it works for you
> 
> # node A - with mlx4_en address192.168.31.18
> ip tunnel add gre1 mode gre local 192.168.31.18 remote 192.168.31.17 ttl 255
> ifconfig gre1 10.10.10.18/24 up
> ifconfig gre1 mtu 1450
> 
> # node B - with mlx4_en address192.168.31.17
> ip tunnel add gre1 mode gre local 192.168.31.17 remote 192.168.31.18 ttl 255
> ifconfig gre1 10.10.10.17/24 up
> ifconfig gre1 mtu 1450
> 
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
> b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
> index 0efbae9..7753833 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
> @@ -2292,6 +2292,12 @@ static void mlx4_en_add_vxlan_offloads(struct
> work_struct *work)
>   out:
>          if (ret)
>                  en_err(priv, "failed setting L2 tunnel configuration
> ret %d\n", ret);
> +
> +       /* set offloads */
> +       priv->dev->hw_enc_features |= NETIF_F_IP_CSUM |
> NETIF_F_RXCSUM |
> +                                     NETIF_F_TSO | NETIF_F_GSO_UDP_TUNNEL;
> +       priv->dev->hw_features |= NETIF_F_GSO_UDP_TUNNEL;
> +       priv->dev->features    |= NETIF_F_GSO_UDP_TUNNEL;
>   }
> 
>   static void mlx4_en_del_vxlan_offloads(struct work_struct *work)
> @@ -2299,6 +2305,10 @@ static void mlx4_en_del_vxlan_offloads(struct
> work_struct *work)
>          int ret;
>          struct mlx4_en_priv *priv = container_of(work, struct mlx4_en_priv,
> vxlan_del_task);
> +       /* unset offloads */
> +       priv->dev->hw_enc_features = 0;
> +       priv->dev->hw_features &= ~NETIF_F_GSO_UDP_TUNNEL;
> +       priv->dev->features    &= ~NETIF_F_GSO_UDP_TUNNEL;
> 
>          ret = mlx4_SET_PORT_VXLAN(priv->mdev->dev, priv->port,
>                                    VXLAN_STEER_BY_OUTER_MAC, 0);
> @@ -2578,13 +2588,6 @@ int mlx4_en_init_netdev(struct mlx4_en_dev
> *mdev,
> int port,
>          if (mdev->dev->caps.steering_mode != MLX4_STEERING_MODE_A0)
>                  dev->priv_flags |= IFF_UNICAST_FLT;
> 
> -       if (mdev->dev->caps.tunnel_offload_mode ==
> MLX4_TUNNEL_OFFLOAD_MODE_VXLAN) {
> -               dev->hw_enc_features |= NETIF_F_IP_CSUM | NETIF_F_RXCSUM |
> -                                       NETIF_F_TSO |
> NETIF_F_GSO_UDP_TUNNEL;
> -               dev->hw_features |= NETIF_F_GSO_UDP_TUNNEL;
> -               dev->features    |= NETIF_F_GSO_UDP_TUNNEL;
> -       }
> -
>          mdev->pndev[port] = dev;
> 
>          netif_carrier_off(dev);
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: mlx4+vxlan offload breaks gre tunnels
From: Or Gerlitz @ 2014-11-06  7:53 UTC (permalink / raw)
  To: Sathya Perla
  Cc: Or Gerlitz, Florian Westphal, netdev@vger.kernel.org, Tom Herbert,
	Jesse Gross, amirv@mellanox.com
In-Reply-To: <637ce8e1-89e5-4254-a928-4ab2a4eb8c29@CMEXHTCAS1.ad.emulex.com>

On Thu, Nov 6, 2014 at 9:21 AM, Sathya Perla <Sathya.Perla@emulex.com> wrote:
>> I think the best effort we can do now is
>>
>> 1. come up with something such as the below patch for 3.18 which is
>> back-ward portable for -stable kernels, it will only arm the hw offloads
>> if the OS tells us there's VXLAN in action

> Or, wouldn't the patch below not work (i.e., the same issue would persist)
> when there is both VXLAN and some other (say GRE) tunnel in the system
> and the NIC HW is capable of supporting checksum offload only on VxLAN.

Indeed, this would be the case. But the patch below will make things
to work when
only GRE is used (or when  only VXLAN is used). So we're making
progress vs. the current
situation where GRE breaks over a HW which is capable to do VXLAN offloads even
if there's no VXLAN tunnel around.


Or.


>> 2. come  up with proper kernel APIs to let NICs advertize which encap
>> schemes they can actually offload the inner checksum, Tom... your work
>> which now runs over netdev.

>> tested to work with the  following which is a bit different, tell me if
>> it works for you
>>
>> # node A - with mlx4_en address192.168.31.18
>> ip tunnel add gre1 mode gre local 192.168.31.18 remote 192.168.31.17 ttl 255
>> ifconfig gre1 10.10.10.18/24 up
>> ifconfig gre1 mtu 1450
>>
>> # node B - with mlx4_en address192.168.31.17
>> ip tunnel add gre1 mode gre local 192.168.31.17 remote 192.168.31.18 ttl 255
>> ifconfig gre1 10.10.10.17/24 up
>> ifconfig gre1 mtu 1450

^ permalink raw reply

* Re: getting rid of ->splice_write?
From: Christoph Hellwig @ 2014-11-06  7:55 UTC (permalink / raw)
  To: Al Viro; +Cc: Miklos Szeredi, linux-fsdevel, netdev
In-Reply-To: <20141105184945.GS7996@ZenIV.linux.org.uk>

On Wed, Nov 05, 2014 at 06:49:45PM +0000, Al Viro wrote:
> Not really.  A minor nitpick is that you've missed port_fops_splice_write(),
> but the real bitch isn't that and not even the sockets (see the fun with
> iov_iter sendmsg/recvmsg work getting resurrected).  It's that NULL
> ->splice_write() means default_file_splice_write.  IOW, you'd need either
> ->write_iter() for _everything_ (with support of bvec-backed ones included,
> since that's what iter_file_splice_write() will feed to ->write_iter()),
> or you need to have do_splice_from() check if ->write_iter is NULL and
> go for default_file_splice_write() instead of iter_file_splice_write().
> 
> The latter might be doable, but the former is really over the top - for
> that we'd need to touch every driver instance of ->write() out there.
> You want to do that, it's your funeral...

The latter is what I thought off.  And yes, the socket work looks good,
especially if we can get rid of ->sendpage as well.  That'll require
passing new flags somewhere, the ones in the iocb added for
preadv2/pwritev2 might be usable.

> > Similarly it seems to be like we could kill ->splice_read by
> > implementing an equivalent iteration over ->read_iter.
> 
> Hard to do.  I agree that we want to, but it'll take quite a bit of work
> on iov_iter primitives, I'm afraid.  At the very least, we want a variant
> of iov_iter that could steal pages.  Don't forget that a large part of
> the rationale behind splice_read was the ability to play zero-copy games.
> 
> I'm not sure if it will happen this cycle; there's more than enough fun
> on the net/* side.  _If_ that ends up done faster than I expect it to be,
> ->splice_read() is the obvious next target.

And zero copy games would become a lot less nasty if they could go
straight through ->read_iter instead of the current abuses of splice
infrastructure.

Same for sendfile, btw.

^ permalink raw reply

* RE: [PATCH] bridge: missing null bridge device check causing null pointer dereference (bugfix)
From: 박수현 @ 2014-11-06  7:58 UTC (permalink / raw)
  To: Toshiaki Makita, Stephen Hemminger, David S. Miller
  Cc: bridge@lists.linux-foundation.org, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <545B1E27.3080302@lab.ntt.co.jp>

>-----Original Message-----
>From: Toshiaki Makita [mailto:makita.toshiaki@lab.ntt.co.jp]
>Sent: Thursday, November 06, 2014 4:07 PM
>To: 박수현; Stephen Hemminger; David S. Miller
>Cc: bridge@lists.linux-foundation.org; netdev@vger.kernel.org; linux-
>kernel@vger.kernel.org
>Subject: Re: [PATCH] bridge: missing null bridge device check causing null
>pointer dereference (bugfix)
>
>On 2014/11/06 15:26, Su-Hyun Park wrote:
>> the bridge device can be null if the bridge is being deleted while
>> processing the packet, which causes the null pointer dereference in
>switch statement.
>
>How can this happen??
>It is guarded by rcu.
>netdev_rx_handler_unregister() ensures rx_handler_data is non NULL.
>

The RCU protect rx_handler_data, not the bridge member port. It can be NULL according to below code.

static inline struct net_bridge_port *br_port_get_rcu(const struct net_device *dev) {
	struct net_bridge_port *port = rcu_dereference(dev->rx_handler_data);
	return br_port_exists(dev) ? port : NULL; 
}

The crash happens at the below switch statement in br_handle_frame, where p is NULL.

	switch (p->state)

>>
>> crash dump snippet:
>>
>> <1>BUG: unable to handle kernel NULL pointer dereference at
>> 0000000000000021
>> <1>IP: [<ffffffff814179f6>] br_handle_frame+0xe6/0x270
>>
>> <0>Code: 4c 0f 44 f0 89 f8 66 33 15 32 52 24 00 66 33 05 29 52 24 00
>> 09 c2 89
>> f0 66 33 05 22 52 24 00 80 e4 f0 66 09 c2 0f 84 eb 00 00 00 <41> 0f b6
>> 46 21 3c 02 74 61 3c 03 74 1d 48 89 df e8 d5 bc f0 ff
>> ---
>>  net/bridge/br_input.c | 2 ++
>>  1 file changed, 2 insertions(+)
>>
>> diff --git a/net/bridge/br_input.c b/net/bridge/br_input.c index
>> 6fd5522..7e899ca 100644
>> --- a/net/bridge/br_input.c
>> +++ b/net/bridge/br_input.c
>> @@ -176,6 +176,8 @@ rx_handler_result_t br_handle_frame(struct sk_buff
>**pskb)
>>  		return RX_HANDLER_CONSUMED;
>>
>>  	p = br_port_get_rcu(skb->dev);
>> +	if (!p)
>> +		goto drop;
>>
>>  	if (unlikely(is_link_local_ether_addr(dest))) {
>>  		u16 fwd_mask = p->br->group_fwd_mask_required;
>>
>


^ permalink raw reply

* Re: M_CAN message RAM initialization AppNote  - was: Re: [PATCH V3 3/3] can: m_can: workaround for transmit data less than 4 bytes
From: Dong Aisheng @ 2014-11-06  8:09 UTC (permalink / raw)
  To: Oliver Hartkopp
  Cc: Marc Kleine-Budde, linux-can, wg, varkabhadram, netdev,
	linux-arm-kernel
In-Reply-To: <545B1D71.4000408@hartkopp.net>

On Thu, Nov 06, 2014 at 08:04:17AM +0100, Oliver Hartkopp wrote:
> On 06.11.2014 02:57, Dong Aisheng wrote:
> >On Wed, Nov 05, 2014 at 07:15:10PM +0100, Oliver Hartkopp wrote:
> 
> >>The Message RAM is usually equipped with a parity or ECC functionality.
> >>But RAM cells suffer a hardware reset and can therefore hold
> >>arbitrary content at startup - including parity and/or ECC bits.
> >>
> >>So when you write only the CAN ID and the first four bytes the last
> >>four bytes remain untouched. Then the M_CAN starts to read in 32bit
> >>words from the start of the Tx Message element. So it is very likely
> >>to trigger the message RAM error when reading the uninitialized
> >>32bit word from the last four bytes.
> >>
> >>Finally it turns out that an initial writing (with any kind of data)
> >>to the entire message RAM is mandatory to create valid parity/ECC
> >>checksums.
> >>
> >>That's it.
> >>
> >
> >Thanks for sharing this information.
> >Does it mean this issue is related to the nature of Message RAM and is
> >supposed to exist on all M_CAN IP versions?
> 
> From what I know from the 3.1.x revision there's no change regarding
> IR.BRU and IR.BEC - so I would assume this to stay on all M_CAN IP
> revisions.
> 
> But after some sleep I wonder if this patch [3/3] would need an update too.
> 
> Writing to the TX message RAM is obviously no workaround but a valid
> and needed initialization process.
> 
> I would tend to make this patch:
> 
> ---
> 
> can: m_can: add missing TX message RAM initialization
> 
> The M_CAN message RAM is usually equipped with a parity or ECC functionality.
> But RAM cells suffer a hardware reset and can therefore hold
> arbitrary content at startup - including parity and/or ECC bits.
> 
> To prevent the M_CAN controller detecting checksum errors when
> reading potentially uninitialized TX message RAM content to transmit
> CAN frames the TX message RAM has to be written with (any kind of)
> initial data.
> 

The key point of the issue is that why M_CAN will read potentially uninitialized
TX message RAM content which should not happen?
e.g. for our case of the issue, if we sending a no data frame or a less
than 4 bytes frame, why m_can will read extra 4 bytes uninitialized/unset
data which seems not reasonable?

>From IP code logic, it will read full 8 bytes of data no matter how many data
actually to be transfered which is strange.

For sending data over 4 bytes, since the Message RAM content will be filled
with the real data to be transfered so there's no such issue.

> ---
> 
> Then the code should memset() the entire TX FIFO element - and not
> only the 8 data bytes we are addressing now.
> 

Our IC guy told me the issue only happened on transferring a data size
of less than 4 bytes and my test also proved that.
So i'm not sure memset() the entire TX FIFO element is neccesary...

Do you think we could keep the current solution firstly and updated later
if needed?

> Maybe it makes sense to send the entire updated patch set (3) again ...
> 
> [1/3] can: add can_is_canfd_skb() API
> [2/3] can: m_can: update to support CAN FD features
> [3/3] can: m_can: add missing message RAM initialization
> 
> Are you ok with that?
> 
> Regards,
> Oliver
> 

Regards
Dong Aisheng

^ permalink raw reply

* Re: [PATCH net 0/5] Implement ndo_gso_check() for vxlan nics
From: Or Gerlitz @ 2014-11-06  8:23 UTC (permalink / raw)
  To: Tom Herbert
  Cc: David Miller, Joe Stringer, Linux Netdev List, Sathya Perla,
	Jeff Kirsher, linux.nics, Amir Vadai, shahed.shaikh,
	dept-gelinuxnicdev, LKML
In-Reply-To: <CA+mtBx-Js62X3BpfjQkcsh=w=-cgvSyM5ueCpufGVv0uu1nphw@mail.gmail.com>

On Thu, Nov 6, 2014 at 4:44 AM, Tom Herbert <therbert@google.com> wrote:
> On Wed, Nov 5, 2014 at 6:15 PM, David Miller <davem@davemloft.net> wrote:
>> From: Joe Stringer <joestringer@nicira.com>
>> Date: Wed, 5 Nov 2014 17:06:46 -0800
>>
>>> My impression was that the changes are more likely to be
>>> hardware-specific (like the i40e changes) rather than software-specific,
>>> like changes that might be integrated into the helper.
>>
>> I think there is more commonality amongst hardware capabilities,
>> and this is why I want the helper to play itself out.
>>
>>> That said, I can rework for one helper. The way I see it would be the
>>> same code as these patches, as "vxlan_gso_check(struct sk_buff *)" in
>>> drivers/net/vxlan.c which would be called from each driver. Is that what
>>> you had in mind?
>>
>> Yes.
>
> Note that this code is not VXLAN specific, it will also accept NVGRE
> and GRE/UDP with keyid and TEB. I imagine all these cases should be
> indistinguishable to the hardware so they probably just work (which
> would be cool!). It might be better to name and locate the helper
> function to reflect that.

Unlike the VXLAN case, currently there's no indication from the
network stack towards the driver that an NVGRE tunnel was set, so in
our case we're not arming the HW offloads for NVGRE. I'll look into
that, maybe we can make them work always. Also re the math there to be
the same for VXLAN/NVGRE  -- skb_inner_mac_header(skb) -
skb_transport_header(skb) is exactly 8 (sizeof struct gre_base_hdr),
isn't that?

^ permalink raw reply

* Re: [PATCH 1/4] inet: Add skb_copy_datagram_iter
From: Herbert Xu @ 2014-11-06  8:23 UTC (permalink / raw)
  To: David Miller; +Cc: viro, netdev, linux-kernel, bcrl
In-Reply-To: <20141105.152410.1775725940252546246.davem@davemloft.net>

On Wed, Nov 05, 2014 at 03:24:10PM -0500, David Miller wrote:
> 
> Herbert, please provide a cover letter for this series, and the most recent
> version of patch #2 gets various rejects when I try to apply it to net-next.

Sure, I'll regenerate them.  However, while doing so I noticed that
a number of my patches on tun/macvtap that you have previously set
as accepted are missing from net-next.  Could this be why you got
the rejects?

For example, this patch wasn't in net-next when I just did a pull.

https://patchwork.ozlabs.org/patch/405966/

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* [PATCH 0/4] Replace skb_copy_datagram_const_iovec with iterator version
From: Herbert Xu @ 2014-11-06  8:27 UTC (permalink / raw)
  To: David Miller; +Cc: viro, netdev, linux-kernel, bcrl, YOSHIFUJI Hideaki
In-Reply-To: <20141105.152410.1775725940252546246.davem@davemloft.net>

Hi Dave:

This patch series adds the helper skb_copy_datagram_iter, which
is meant to replace both skb_copy_datagram_iovec and its evil
twin skb_copy_datagram_const_iovec.

It then converts tun and macvtap over to the new helper and finally
removes skb_copy_datagram_const_iovec which is only used by tun
and macvtap.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* [PATCH 1/4] inet: Add skb_copy_datagram_iter
From: Herbert Xu @ 2014-11-06  8:28 UTC (permalink / raw)
  To: David Miller, viro, netdev, linux-kernel, bcrl, YOSHIFUJI Hideaki
In-Reply-To: <20141106082704.GB29800@gondor.apana.org.au>

This patch adds skb_copy_datagram_iter, which is identical to
skb_copy_datagram_iovec except that it operates on iov_iter
instead of iovec.

Eventually all users of skb_copy_datagram_iovec should switch
over to iov_iter and then we can remove skb_copy_datagram_iovec.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
---

 include/linux/skbuff.h |    3 +
 net/core/datagram.c    |   82 +++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 85 insertions(+)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 39ec753..a405013 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -150,6 +150,7 @@
 struct net_device;
 struct scatterlist;
 struct pipe_inode_info;
+struct iov_iter;
 
 #if defined(CONFIG_NF_CONNTRACK) || defined(CONFIG_NF_CONNTRACK_MODULE)
 struct nf_conntrack {
@@ -2655,6 +2656,8 @@ int zerocopy_sg_from_iovec(struct sk_buff *skb, const struct iovec *frm,
 int skb_copy_datagram_const_iovec(const struct sk_buff *from, int offset,
 				  const struct iovec *to, int to_offset,
 				  int size);
+int skb_copy_datagram_iter(const struct sk_buff *from, int offset,
+			   struct iov_iter *to, int size);
 void skb_free_datagram(struct sock *sk, struct sk_buff *skb);
 void skb_free_datagram_locked(struct sock *sk, struct sk_buff *skb);
 int skb_kill_datagram(struct sock *sk, struct sk_buff *skb, unsigned int flags);
diff --git a/net/core/datagram.c b/net/core/datagram.c
index fdbc9a8..45a9d4d 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -49,6 +49,7 @@
 #include <linux/spinlock.h>
 #include <linux/slab.h>
 #include <linux/pagemap.h>
+#include <linux/uio.h>
 
 #include <net/protocol.h>
 #include <linux/skbuff.h>
@@ -482,6 +483,87 @@ fault:
 EXPORT_SYMBOL(skb_copy_datagram_const_iovec);
 
 /**
+ *	skb_copy_datagram_iter - Copy a datagram to an iovec iterator.
+ *	@skb: buffer to copy
+ *	@offset: offset in the buffer to start copying from
+ *	@to: iovec iterator to copy to
+ *	@len: amount of data to copy from buffer to iovec
+ */
+int skb_copy_datagram_iter(const struct sk_buff *skb, int offset,
+			   struct iov_iter *to, int len)
+{
+	int start = skb_headlen(skb);
+	int i, copy = start - offset;
+	struct sk_buff *frag_iter;
+
+	trace_skb_copy_datagram_iovec(skb, len);
+
+	/* Copy header. */
+	if (copy > 0) {
+		if (copy > len)
+			copy = len;
+		if (copy_to_iter(skb->data + offset, copy, to))
+			goto fault;
+		if ((len -= copy) == 0)
+			return 0;
+		offset += copy;
+	}
+
+	/* Copy paged appendix. Hmm... why does this look so complicated? */
+	for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
+		int end;
+		const skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
+
+		WARN_ON(start > offset + len);
+
+		end = start + skb_frag_size(frag);
+		if ((copy = end - offset) > 0) {
+			int err;
+			u8  *vaddr;
+			struct page *page = skb_frag_page(frag);
+
+			if (copy > len)
+				copy = len;
+			vaddr = kmap(page);
+			err = copy_to_iter(vaddr + frag->page_offset +
+					   offset - start, copy, to);
+			kunmap(page);
+			if (err)
+				goto fault;
+			if (!(len -= copy))
+				return 0;
+			offset += copy;
+		}
+		start = end;
+	}
+
+	skb_walk_frags(skb, frag_iter) {
+		int end;
+
+		WARN_ON(start > offset + len);
+
+		end = start + frag_iter->len;
+		if ((copy = end - offset) > 0) {
+			if (copy > len)
+				copy = len;
+			if (skb_copy_datagram_iter(frag_iter, offset - start,
+						   to, copy))
+				goto fault;
+			if ((len -= copy) == 0)
+				return 0;
+			offset += copy;
+		}
+		start = end;
+	}
+	if (!len)
+		return 0;
+
+fault:
+	return -EFAULT;
+}
+EXPORT_SYMBOL(skb_copy_datagram_iter);
+
+/**
  *	skb_copy_datagram_from_iovec - Copy a datagram from an iovec.
  *	@skb: buffer to copy
  *	@offset: offset in the buffer to start copying to

^ permalink raw reply related

* [PATCH 2/4] tun: Use iovec iterators
From: Herbert Xu @ 2014-11-06  8:28 UTC (permalink / raw)
  To: David Miller, viro, netdev, linux-kernel, bcrl, YOSHIFUJI Hideaki
In-Reply-To: <20141106082704.GB29800@gondor.apana.org.au>

This patch removes the use of skb_copy_datagram_const_iovec in
favour of the iovec iterator-based skb_copy_datagram_iter.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
---

 drivers/net/tun.c |   65 ++++++++++++++++++++++++------------------------------
 1 file changed, 30 insertions(+), 35 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 9dd3746..b4ac4d5 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -71,6 +71,7 @@
 #include <net/rtnetlink.h>
 #include <net/sock.h>
 #include <linux/seq_file.h>
+#include <linux/uio.h>
 
 #include <asm/uaccess.h>
 
@@ -1230,11 +1231,11 @@ static ssize_t tun_chr_aio_write(struct kiocb *iocb, const struct iovec *iv,
 static ssize_t tun_put_user(struct tun_struct *tun,
 			    struct tun_file *tfile,
 			    struct sk_buff *skb,
-			    const struct iovec *iv, int len)
+			    struct iov_iter *iter)
 {
 	struct tun_pi pi = { 0, skb->protocol };
-	ssize_t total = 0;
-	int vlan_offset = 0, copied;
+	ssize_t total;
+	int vlan_offset;
 	int vlan_hlen = 0;
 	int vnet_hdr_sz = 0;
 
@@ -1244,23 +1245,25 @@ static ssize_t tun_put_user(struct tun_struct *tun,
 	if (tun->flags & TUN_VNET_HDR)
 		vnet_hdr_sz = tun->vnet_hdr_sz;
 
+	total = skb->len + vlan_hlen + vnet_hdr_sz;
+
 	if (!(tun->flags & TUN_NO_PI)) {
-		if ((len -= sizeof(pi)) < 0)
+		if (iov_iter_count(iter) < sizeof(pi))
 			return -EINVAL;
 
-		if (len < skb->len + vlan_hlen + vnet_hdr_sz) {
+		total += sizeof(pi);
+		if (iov_iter_count(iter) < total) {
 			/* Packet will be striped */
 			pi.flags |= TUN_PKT_STRIP;
 		}
 
-		if (memcpy_toiovecend(iv, (void *) &pi, 0, sizeof(pi)))
+		if (copy_to_iter(&pi, sizeof(pi), iter))
 			return -EFAULT;
-		total += sizeof(pi);
 	}
 
 	if (vnet_hdr_sz) {
 		struct virtio_net_hdr gso = { 0 }; /* no info leak */
-		if ((len -= vnet_hdr_sz) < 0)
+		if (iov_iter_count(iter) < vnet_hdr_sz)
 			return -EINVAL;
 
 		if (skb_is_gso(skb)) {
@@ -1299,17 +1302,12 @@ static ssize_t tun_put_user(struct tun_struct *tun,
 			gso.flags = VIRTIO_NET_HDR_F_DATA_VALID;
 		} /* else everything is zero */
 
-		if (unlikely(memcpy_toiovecend(iv, (void *)&gso, total,
-					       sizeof(gso))))
+		if (copy_to_iter(&gso, sizeof(gso), iter))
 			return -EFAULT;
-		total += vnet_hdr_sz;
 	}
 
-	copied = total;
-	len = min_t(int, skb->len + vlan_hlen, len);
-	total += skb->len + vlan_hlen;
 	if (vlan_hlen) {
-		int copy, ret;
+		int ret;
 		struct {
 			__be16 h_vlan_proto;
 			__be16 h_vlan_TCI;
@@ -1320,36 +1318,32 @@ static ssize_t tun_put_user(struct tun_struct *tun,
 
 		vlan_offset = offsetof(struct vlan_ethhdr, h_vlan_proto);
 
-		copy = min_t(int, vlan_offset, len);
-		ret = skb_copy_datagram_const_iovec(skb, 0, iv, copied, copy);
-		len -= copy;
-		copied += copy;
-		if (ret || !len)
+		ret = skb_copy_datagram_iter(skb, 0, iter, vlan_offset);
+		if (ret || !iov_iter_count(iter))
 			goto done;
 
-		copy = min_t(int, sizeof(veth), len);
-		ret = memcpy_toiovecend(iv, (void *)&veth, copied, copy);
-		len -= copy;
-		copied += copy;
-		if (ret || !len)
+		ret = copy_to_iter(&veth, sizeof(veth), iter);
+		if (ret || !iov_iter_count(iter))
 			goto done;
 	}
 
-	skb_copy_datagram_const_iovec(skb, vlan_offset, iv, copied, len);
+	skb_copy_datagram_iter(skb, vlan_offset, iter, skb->len - vlan_offset);
 
 done:
 	tun->dev->stats.tx_packets++;
-	tun->dev->stats.tx_bytes += len;
+	tun->dev->stats.tx_bytes += skb->len + vlan_hlen;
 
 	return total;
 }
 
 static ssize_t tun_do_read(struct tun_struct *tun, struct tun_file *tfile,
-			   const struct iovec *iv, ssize_t len, int noblock)
+			   const struct iovec *iv, unsigned long segs,
+			   ssize_t len, int noblock)
 {
 	struct sk_buff *skb;
 	ssize_t ret = 0;
 	int peeked, err, off = 0;
+	struct iov_iter iter;
 
 	tun_debug(KERN_INFO, tun, "tun_do_read\n");
 
@@ -1362,11 +1356,12 @@ static ssize_t tun_do_read(struct tun_struct *tun, struct tun_file *tfile,
 	/* Read frames from queue */
 	skb = __skb_recv_datagram(tfile->socket.sk, noblock ? MSG_DONTWAIT : 0,
 				  &peeked, &off, &err);
-	if (skb) {
-		ret = tun_put_user(tun, tfile, skb, iv, len);
-		kfree_skb(skb);
-	} else
-		ret = err;
+	if (!skb)
+		return ret;
+
+	iov_iter_init(&iter, READ, iv, segs, len);
+	ret = tun_put_user(tun, tfile, skb, &iter);
+	kfree_skb(skb);
 
 	return ret;
 }
@@ -1387,7 +1382,7 @@ static ssize_t tun_chr_aio_read(struct kiocb *iocb, const struct iovec *iv,
 		goto out;
 	}
 
-	ret = tun_do_read(tun, tfile, iv, len,
+	ret = tun_do_read(tun, tfile, iv, count, len,
 			  file->f_flags & O_NONBLOCK);
 	ret = min_t(ssize_t, ret, len);
 	if (ret > 0)
@@ -1488,7 +1483,7 @@ static int tun_recvmsg(struct kiocb *iocb, struct socket *sock,
 					 SOL_PACKET, TUN_TX_TIMESTAMP);
 		goto out;
 	}
-	ret = tun_do_read(tun, tfile, m->msg_iov, total_len,
+	ret = tun_do_read(tun, tfile, m->msg_iov, m->msg_iovlen, total_len,
 			  flags & MSG_DONTWAIT);
 	if (ret > total_len) {
 		m->msg_flags |= MSG_TRUNC;

^ permalink raw reply related

* [PATCH 3/4] macvtap: Use iovec iterators
From: Herbert Xu @ 2014-11-06  8:28 UTC (permalink / raw)
  To: David Miller, viro, netdev, linux-kernel, bcrl, YOSHIFUJI Hideaki
In-Reply-To: <20141106082704.GB29800@gondor.apana.org.au>

This patch removes the use of skb_copy_datagram_const_iovec in
favour of the iovec iterator-based skb_copy_datagram_iter.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
---

 drivers/net/macvtap.c |   45 ++++++++++++++++++++-------------------------
 1 file changed, 20 insertions(+), 25 deletions(-)

diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
index 880cc09..a0e1dd7 100644
--- a/drivers/net/macvtap.c
+++ b/drivers/net/macvtap.c
@@ -15,6 +15,7 @@
 #include <linux/cdev.h>
 #include <linux/idr.h>
 #include <linux/fs.h>
+#include <linux/uio.h>
 
 #include <net/ipv6.h>
 #include <net/net_namespace.h>
@@ -778,31 +779,28 @@ static ssize_t macvtap_aio_write(struct kiocb *iocb, const struct iovec *iv,
 /* Put packet to the user space buffer */
 static ssize_t macvtap_put_user(struct macvtap_queue *q,
 				const struct sk_buff *skb,
-				const struct iovec *iv, int len)
+				struct iov_iter *iter)
 {
 	int ret;
 	int vnet_hdr_len = 0;
 	int vlan_offset = 0;
-	int copied, total;
+	int total;
 
 	if (q->flags & IFF_VNET_HDR) {
 		struct virtio_net_hdr vnet_hdr;
 		vnet_hdr_len = q->vnet_hdr_sz;
-		if ((len -= vnet_hdr_len) < 0)
+		if (iov_iter_count(iter) < vnet_hdr_len)
 			return -EINVAL;
 
 		macvtap_skb_to_vnet_hdr(skb, &vnet_hdr);
 
-		if (memcpy_toiovecend(iv, (void *)&vnet_hdr, 0, sizeof(vnet_hdr)))
+		if (copy_to_iter(&vnet_hdr, sizeof(vnet_hdr), iter))
 			return -EFAULT;
 	}
-	total = copied = vnet_hdr_len;
+	total = vnet_hdr_len;
 	total += skb->len;
 
-	if (!vlan_tx_tag_present(skb))
-		len = min_t(int, skb->len, len);
-	else {
-		int copy;
+	if (vlan_tx_tag_present(skb)) {
 		struct {
 			__be16 h_vlan_proto;
 			__be16 h_vlan_TCI;
@@ -811,37 +809,33 @@ static ssize_t macvtap_put_user(struct macvtap_queue *q,
 		veth.h_vlan_TCI = htons(vlan_tx_tag_get(skb));
 
 		vlan_offset = offsetof(struct vlan_ethhdr, h_vlan_proto);
-		len = min_t(int, skb->len + VLAN_HLEN, len);
 		total += VLAN_HLEN;
 
-		copy = min_t(int, vlan_offset, len);
-		ret = skb_copy_datagram_const_iovec(skb, 0, iv, copied, copy);
-		len -= copy;
-		copied += copy;
-		if (ret || !len)
+		ret = skb_copy_datagram_iter(skb, 0, iter, vlan_offset);
+		if (ret || !iov_iter_count(iter))
 			goto done;
 
-		copy = min_t(int, sizeof(veth), len);
-		ret = memcpy_toiovecend(iv, (void *)&veth, copied, copy);
-		len -= copy;
-		copied += copy;
-		if (ret || !len)
+		ret = copy_to_iter(&veth, sizeof(veth), iter);
+		if (ret || !iov_iter_count(iter))
 			goto done;
 	}
 
-	ret = skb_copy_datagram_const_iovec(skb, vlan_offset, iv, copied, len);
+	ret = skb_copy_datagram_iter(skb, vlan_offset, iter,
+				     skb->len - vlan_offset);
 
 done:
 	return ret ? ret : total;
 }
 
 static ssize_t macvtap_do_read(struct macvtap_queue *q,
-			       const struct iovec *iv, unsigned long len,
+			       const struct iovec *iv, unsigned long segs,
+			       unsigned long len,
 			       int noblock)
 {
 	DEFINE_WAIT(wait);
 	struct sk_buff *skb;
 	ssize_t ret = 0;
+	struct iov_iter iter;
 
 	while (len) {
 		if (!noblock)
@@ -863,7 +857,8 @@ static ssize_t macvtap_do_read(struct macvtap_queue *q,
 			schedule();
 			continue;
 		}
-		ret = macvtap_put_user(q, skb, iv, len);
+		iov_iter_init(&iter, READ, iv, segs, len);
+		ret = macvtap_put_user(q, skb, &iter);
 		kfree_skb(skb);
 		break;
 	}
@@ -886,7 +881,7 @@ static ssize_t macvtap_aio_read(struct kiocb *iocb, const struct iovec *iv,
 		goto out;
 	}
 
-	ret = macvtap_do_read(q, iv, len, file->f_flags & O_NONBLOCK);
+	ret = macvtap_do_read(q, iv, count, len, file->f_flags & O_NONBLOCK);
 	ret = min_t(ssize_t, ret, len);
 	if (ret > 0)
 		iocb->ki_pos = ret;
@@ -1117,7 +1112,7 @@ static int macvtap_recvmsg(struct kiocb *iocb, struct socket *sock,
 	int ret;
 	if (flags & ~(MSG_DONTWAIT|MSG_TRUNC))
 		return -EINVAL;
-	ret = macvtap_do_read(q, m->msg_iov, total_len,
+	ret = macvtap_do_read(q, m->msg_iov, m->msg_iovlen, total_len,
 			  flags & MSG_DONTWAIT);
 	if (ret > total_len) {
 		m->msg_flags |= MSG_TRUNC;

^ permalink raw reply related

* [PATCH 4/4] net: Kill skb_copy_datagram_const_iovec
From: Herbert Xu @ 2014-11-06  8:28 UTC (permalink / raw)
  To: David Miller, viro, netdev, linux-kernel, bcrl, YOSHIFUJI Hideaki
In-Reply-To: <20141106082704.GB29800@gondor.apana.org.au>

Now that both macvtap and tun are using skb_copy_datagram_iter, we
can kill the abomination that is skb_copy_datagram_const_iovec.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
---

 include/linux/skbuff.h |    3 -
 net/core/datagram.c    |   89 -------------------------------------------------
 2 files changed, 92 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index a405013..da59580 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2653,9 +2653,6 @@ int skb_copy_datagram_from_iovec(struct sk_buff *skb, int offset,
 				 int len);
 int zerocopy_sg_from_iovec(struct sk_buff *skb, const struct iovec *frm,
 			   int offset, size_t count);
-int skb_copy_datagram_const_iovec(const struct sk_buff *from, int offset,
-				  const struct iovec *to, int to_offset,
-				  int size);
 int skb_copy_datagram_iter(const struct sk_buff *from, int offset,
 			   struct iov_iter *to, int size);
 void skb_free_datagram(struct sock *sk, struct sk_buff *skb);
diff --git a/net/core/datagram.c b/net/core/datagram.c
index 45a9d4d..93054b9 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -394,95 +394,6 @@ fault:
 EXPORT_SYMBOL(skb_copy_datagram_iovec);
 
 /**
- *	skb_copy_datagram_const_iovec - Copy a datagram to an iovec.
- *	@skb: buffer to copy
- *	@offset: offset in the buffer to start copying from
- *	@to: io vector to copy to
- *	@to_offset: offset in the io vector to start copying to
- *	@len: amount of data to copy from buffer to iovec
- *
- *	Returns 0 or -EFAULT.
- *	Note: the iovec is not modified during the copy.
- */
-int skb_copy_datagram_const_iovec(const struct sk_buff *skb, int offset,
-				  const struct iovec *to, int to_offset,
-				  int len)
-{
-	int start = skb_headlen(skb);
-	int i, copy = start - offset;
-	struct sk_buff *frag_iter;
-
-	/* Copy header. */
-	if (copy > 0) {
-		if (copy > len)
-			copy = len;
-		if (memcpy_toiovecend(to, skb->data + offset, to_offset, copy))
-			goto fault;
-		if ((len -= copy) == 0)
-			return 0;
-		offset += copy;
-		to_offset += copy;
-	}
-
-	/* Copy paged appendix. Hmm... why does this look so complicated? */
-	for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
-		int end;
-		const skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
-
-		WARN_ON(start > offset + len);
-
-		end = start + skb_frag_size(frag);
-		if ((copy = end - offset) > 0) {
-			int err;
-			u8  *vaddr;
-			struct page *page = skb_frag_page(frag);
-
-			if (copy > len)
-				copy = len;
-			vaddr = kmap(page);
-			err = memcpy_toiovecend(to, vaddr + frag->page_offset +
-						offset - start, to_offset, copy);
-			kunmap(page);
-			if (err)
-				goto fault;
-			if (!(len -= copy))
-				return 0;
-			offset += copy;
-			to_offset += copy;
-		}
-		start = end;
-	}
-
-	skb_walk_frags(skb, frag_iter) {
-		int end;
-
-		WARN_ON(start > offset + len);
-
-		end = start + frag_iter->len;
-		if ((copy = end - offset) > 0) {
-			if (copy > len)
-				copy = len;
-			if (skb_copy_datagram_const_iovec(frag_iter,
-							  offset - start,
-							  to, to_offset,
-							  copy))
-				goto fault;
-			if ((len -= copy) == 0)
-				return 0;
-			offset += copy;
-			to_offset += copy;
-		}
-		start = end;
-	}
-	if (!len)
-		return 0;
-
-fault:
-	return -EFAULT;
-}
-EXPORT_SYMBOL(skb_copy_datagram_const_iovec);
-
-/**
  *	skb_copy_datagram_iter - Copy a datagram to an iovec iterator.
  *	@skb: buffer to copy
  *	@offset: offset in the buffer to start copying from

^ permalink raw reply related

* Re: [PATCH] bridge: missing null bridge device check causing null pointer dereference (bugfix)
From: Toshiaki Makita @ 2014-11-06  8:28 UTC (permalink / raw)
  To: 박수현, Stephen Hemminger, David S. Miller
  Cc: netdev@vger.kernel.org, bridge@lists.linux-foundation.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <8D1F1238A24CE743B8F3CED0F137C69E408AA087@EXMB02.ahnbang.ahnlab.com>

On 2014/11/06 16:58, 박수현 wrote:
>> -----Original Message-----
>> From: Toshiaki Makita [mailto:makita.toshiaki@lab.ntt.co.jp]
>> Sent: Thursday, November 06, 2014 4:07 PM
>> To: 박수현; Stephen Hemminger; David S. Miller
>> Cc: bridge@lists.linux-foundation.org; netdev@vger.kernel.org; linux-
>> kernel@vger.kernel.org
>> Subject: Re: [PATCH] bridge: missing null bridge device check causing null
>> pointer dereference (bugfix)
>>
>> On 2014/11/06 15:26, Su-Hyun Park wrote:
>>> the bridge device can be null if the bridge is being deleted while
>>> processing the packet, which causes the null pointer dereference in
>> switch statement.
>>
>> How can this happen??
>> It is guarded by rcu.
>> netdev_rx_handler_unregister() ensures rx_handler_data is non NULL.
>>
> 
> The RCU protect rx_handler_data, not the bridge member port. It can be NULL according to below code.
> 
> static inline struct net_bridge_port *br_port_get_rcu(const struct net_device *dev) {
> 	struct net_bridge_port *port = rcu_dereference(dev->rx_handler_data);
> 	return br_port_exists(dev) ? port : NULL; 
> }

Seems to have been fixed for a year.
716ec052d228 ("bridge: fix NULL pointer deref of br_port_get_rcu")

Thanks,
Toshiaki Makita

^ permalink raw reply

* Re: M_CAN message RAM initialization AppNote  - was: Re: [PATCH V3 3/3] can: m_can: workaround for transmit data less than 4 bytes
From: Marc Kleine-Budde @ 2014-11-06  9:00 UTC (permalink / raw)
  To: Oliver Hartkopp, Dong Aisheng
  Cc: linux-can, wg, varkabhadram, netdev, linux-arm-kernel
In-Reply-To: <545B1D71.4000408@hartkopp.net>

[-- Attachment #1: Type: text/plain, Size: 2447 bytes --]

On 11/06/2014 08:04 AM, Oliver Hartkopp wrote:
> On 06.11.2014 02:57, Dong Aisheng wrote:
>> On Wed, Nov 05, 2014 at 07:15:10PM +0100, Oliver Hartkopp wrote:
> 
>>> The Message RAM is usually equipped with a parity or ECC functionality.
>>> But RAM cells suffer a hardware reset and can therefore hold
>>> arbitrary content at startup - including parity and/or ECC bits.
>>>
>>> So when you write only the CAN ID and the first four bytes the last
>>> four bytes remain untouched. Then the M_CAN starts to read in 32bit
>>> words from the start of the Tx Message element. So it is very likely
>>> to trigger the message RAM error when reading the uninitialized
>>> 32bit word from the last four bytes.
>>>
>>> Finally it turns out that an initial writing (with any kind of data)
>>> to the entire message RAM is mandatory to create valid parity/ECC
>>> checksums.
>>>
>>> That's it.
>>>
>>
>> Thanks for sharing this information.
>> Does it mean this issue is related to the nature of Message RAM and is
>> supposed to exist on all M_CAN IP versions?
> 
> From what I know from the 3.1.x revision there's no change regarding
> IR.BRU and IR.BEC - so I would assume this to stay on all M_CAN IP
> revisions.
> 
> But after some sleep I wonder if this patch [3/3] would need an update too.
> 
> Writing to the TX message RAM is obviously no workaround but a valid and
> needed initialization process.
> 
> I would tend to make this patch:
> 
> ---
> 
> can: m_can: add missing TX message RAM initialization
> 
> The M_CAN message RAM is usually equipped with a parity or ECC
> functionality.
> But RAM cells suffer a hardware reset and can therefore hold arbitrary
> content at startup - including parity and/or ECC bits.
> 
> To prevent the M_CAN controller detecting checksum errors when reading
> potentially uninitialized TX message RAM content to transmit CAN frames
> the TX message RAM has to be written with (any kind of) initial data.
> 
> ---
> 
> Then the code should memset() the entire TX FIFO element - and not only
> the 8 data bytes we are addressing now.

No literal memset() as this is iomem

Marc

-- 
Pengutronix e.K.                  | Marc Kleine-Budde           |
Industrial Linux Solutions        | Phone: +49-231-2826-924     |
Vertretung West/Dortmund          | Fax:   +49-5121-206917-5555 |
Amtsgericht Hildesheim, HRA 2686  | http://www.pengutronix.de   |


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply

* Re: "asix: Don't reset PHY on if_up for ASIX 88772" breaks net on arndale platform
From: Riku Voipio @ 2014-11-06  9:06 UTC (permalink / raw)
  To: Charles Keepax
  Cc: Stam, Michel [FINT], freddy, davem, linux-usb, netdev,
	linux-kernel, linux-samsung-soc
In-Reply-To: <20141105150258.GR23178@opensource.wolfsonmicro.com>

On Wed, Nov 05, 2014 at 03:02:58PM +0000, Charles Keepax wrote:
> On Wed, Nov 05, 2014 at 01:04:37PM +0100, Stam, Michel [FINT] wrote:
> > Hello Charles,
> > 
> > After looking around I found the reset value for the 8772 chip, which
> > seems to be 0x1E1 (ANAR register).
> > 
> > This equates to (according to include/uapi/linux/mii.h)
> > ADVERTISE_ALL | ADVERTISE_CSMA.
> > 
> > The register only seems to become 0 if the software reset fails.
 
> Odd it definitely reads back as zero on Arndale. I am guessing
> that the root of the problem here is that for some reason Arndale
> POR of the ethernet is pants and it needs a full software reset
> before it will work and the patch removes the full reset
> callback.

The asix on arndale comes semi-configured from u-boot, which I guess is
not the state kernel expects it to come in. At least in my case where
I use tftp from u-boot to load my kernel.

So probably the full reset is needed here to make the asix chip come
to a truly pristine state.

The commit that Michel partially reverted (by returning to use
ax88772_link_reset instead of ax88772_reset), indicates that a strong reset
is needed for suspend/resume as well:

commit 4ad1438f025ed8d1e4e95a796ca7f0ad5a22c378
Author: Grant Grundler <grundler@chromium.org>
Date:   Tue Oct 4 09:55:16 2011 +0000

    NET: fix phy init for AX88772 USB ethernet
        
    Fix phy initialization for AX88772 (USB 2.0 100BT). Failure
    was occasionally DHCP wouldn't work after reboot or
    suspend/resume cycle.
 
> > Unfortunately, this is exactly what I get when the patch is applied;
> > asix 1-2:1.0 eth1: Failed to send software reset: ffffffb5
> > asix 1-2:1.0 eth1: link reset failed (-75) usbnet usb-0000:00:1d.0-2, 
> > ASIX AX88772 USB 2.0 Ethernet
> > asix 1-2:1.0 eth1: Failed to send software reset: ffffffb5
> > asix 1-2:1.0 eth1: link reset failed (-75) usbnet usb-0000:00:1d.0-2, 
> > ASIX AX88772 USB 2.0 Ethernet
> 
> Ok so I am guessing you have a value in the register which is
> neither the reset value or 0 and this causing problems later in
> the reset/on the next reset. I do find the naming confusing in
> the error message there as it says link reset failed but the
> link_reset callback can't fail in the driver and I modified the
> reset callback. But I guess that is just oddities of the network
> stack I am not familiar with.
> 
> The other thing that feels odd is (and again apologies as I know
> next to nothing about the networking stack) how come it is
> unexpected that the reset callback destroys the state of the
> device. Naively I would have expected that a reset callback would
> reset the device back to its default state. Here we seem to be
> trying to avoid that happening.

Indeed, it would seems some tracing would be neede to figure out in
which order the .reset and .link_reset callbacks are being called.

^ permalink raw reply

* Re: bifurcated driver
From: Nicolas Dichtel @ 2014-11-06  9:10 UTC (permalink / raw)
  To: Alex Markuze, Zhou, Danny
  Cc: dev-VfR2kkLFssw@public.gmane.org, Fastabend, John R, Or Gerlitz,
	netdev
In-Reply-To: <CAKfHP0VaCW_zBb9-uJYwwDQ-+sz-DZ=b6hcWn0HfMmMzhiOfUA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

Also CC netdev, this thread may interest network folks.

Le 06/11/2014 09:13, Alex Markuze a écrit :
> Danny sums up the issue perfectly IMHO.
> While both verbs and DPDK aim to provide generic user space networking, the
> similarities end there.
> verbs and RDMA HW are closely coupled and behave differently then standard
> eth nics and are not related to netdev mechanisms.
>
> Or, welcome to this discussion.
>
> Those interested can read the IB spec's (+1K pages) available from
> openfabrics*.
> *https://www.openfabrics.org/index.php
>
>
>
>
> On Thu, Nov 6, 2014 at 6:45 AM, Zhou, Danny <danny.zhou-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> wrote:
>
>> I roughly read libibverbs related code and relevant infiniband/rdma
>> documents, and found though
>> many concepts in libibverbs looks similar to bifurcated driver, but there
>> are still lots of differences as
>> illustrated below based on my understanding:
>>
>> 1) Queue pair defined in RDMA specification are abstract concept, where
>> the queue pairs term used in
>>    bifurcated driver are rx/tx queue pairs in the NIC.
>> 2) Bifurcated PMD in DPDK directly access NIC resources as a slave driver
>> (no NIC control), while libibverbs
>>    as a user space library rather than driver offloads certain operations
>> to kernel driver and NIC by invoking
>>    "verbs" APIs.
>> 3) Libibverbs invokes infiniband specific system calls to allow
>> user/kernel space communication based on
>>    "verbs" defined in infiniband/RDMA spec, while bifurcated driver build
>> on top of af_packet module
>>    and new socket options to do things like hw queue split-off , map
>> certain pages on I/O space to user space
>>    operations, etc.
>> 4) There is a specific embedded MMU unit in Infiniband/RDMA to provides
>> memory protection, while
>>    bifurcated driver uses IOMMU rather than NIC to provide memory
>> protection.
>>
>> IMHO, libibverbs and corresponding kernel modules/drivers are specifically
>> designed and implemented for
>> direct access to RDMA hardware from userspace, and it highly depends on
>> "verbs" related system calls
>> supported by infiniband/rdma mechanism in kernel, rather than netdev
>> mechanism that bifurcated driver
>> solution depends on.
>>
>>> -----Original Message-----
>>> From: Vincent JARDIN [mailto:vincent.jardin-pdR9zngts4EAvxtiuMwx3w@public.gmane.org]
>>> Sent: Thursday, November 06, 2014 9:31 AM
>>> To: Zhou, Danny
>>> Cc: Thomas Monjalon; dev-VfR2kkLFssw@public.gmane.org; Fastabend, John R; Or Gerlitz
>>> Subject: Re: [dpdk-dev] bifurcated driver
>>>
>>> +Or
>>>
>>> On 05/11/2014 23:48, Zhou, Danny wrote:
>>>> Hi Thomas,
>>>>
>>>> Thanks for sharing the links to ibverbs, I will take a close look at
>> it and compare it to bifurcated driver. My take
>>>> after a rough review is that idea is very much similar, but bifurcated
>> driver implementation is generic for any
>>>> Ethernet device based on existing af_packet mechanism, with extension
>> of exchanging the messages between
>>>> user space and kernel space driver.
>>>>
>>>> I have an internal document to summary the pros and cons of below
>> solutions, except for ibvers, but
>>>> will be adding it shortly.
>>>>
>>>> - igb_uio
>>>> - uio_pci_generic
>>>> - VFIO
>>>> - bifurcated driver
>>>>
>>>> Short answers to your questions:
>>>>>     - upstream status
>>>> Adding IOMMU based memory protection and generic descriptor
>> description support now, into version 2
>>>> kernel patches.
>>>>
>>>>>     - usable with kernel netdev
>>>> af_packet based, and relevant patchset will be submitted to netdev for
>> sure.
>>>>
>>>>>     - usable in a vm
>>>> No, it does no coexist with SRIOV for number of reasons. but if you
>> pass-through a PF to a VM, it works perfect.
>>>>
>>>>>     - usable for Ethernet
>>>> It could work with all Ethernet NICs, as flow director is available
>> and NIC driver support new net_ops to split off
>>>> queue pairs for user space.
>>>>
>>>>>     - hardware requirements
>>>> No specific hardware requirements. All mainstream NICs have multiple
>> qpairs and flow director support.
>>>>
>>>>>     - security protection
>>>> Leverage IOMMU to provide memory protection on Intel platform. Other
>> archs provide similar memory protection
>>>> mechanism, so we only use arch-agnostic DMA memory allocation APIs in
>> kernel to support memory protection.
>>>>
>>>>>     - performance
>>>> DPDK native performance on user space queues, as long as drop_en is
>> enabled to avoid head-of-line blocking.
>>>>
>>>> -Danny
>>>>
>>>>> -----Original Message-----
>>>>> From: Thomas Monjalon [mailto:thomas.monjalon-pdR9zngts4EAvxtiuMwx3w@public.gmane.org]
>>>>> Sent: Wednesday, November 05, 2014 9:01 PM
>>>>> To: Zhou, Danny
>>>>> Cc: dev-VfR2kkLFssw@public.gmane.org; Fastabend, John R
>>>>> Subject: Re: [dpdk-dev] bifurcated driver
>>>>>
>>>>> Hi Danny,
>>>>>
>>>>> 2014-10-31 17:36, O'driscoll, Tim:
>>>>>> Bifurcated Driver (Danny.Zhou-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org)
>>>>>
>>>>> Thanks for the presentation of bifurcated driver during the community
>> call.
>>>>> I asked if you looked at ibverbs and you wanted a link to check.
>>>>> The kernel module is here:
>>>>>
>> http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/drivers/infiniband/core
>>>>> The userspace library:
>>>>>     http://git.kernel.org/cgit/libs/infiniband/libibverbs.git
>>>>>
>>>>> Extract from Kconfig:
>>>>> "
>>>>> config INFINIBAND_USER_ACCESS
>>>>>     tristate "InfiniBand userspace access (verbs and CM)"
>>>>>     select ANON_INODES
>>>>>     ---help---
>>>>>       Userspace InfiniBand access support.  This enables the
>>>>>       kernel side of userspace verbs and the userspace
>>>>>       communication manager (CM).  This allows userspace processes
>>>>>       to set up connections and directly access InfiniBand
>>>>>       hardware for fast-path operations.  You will also need
>>>>>       libibverbs, libibcm and a hardware driver library from
>>>>>       <http://www.openfabrics.org/git/>.
>>>>> "
>>>>>
>>>>> It seems to be close to the bifurcated driver needs.
>>>>> Not sure if it can solve the security issues if there is no dedicated
>> MMU
>>>>> in the NIC.
>>>>>
>>>>> I feel we should sum up pros and cons of
>>>>>     - igb_uio
>>>>>     - uio_pci_generic
>>>>>     - VFIO
>>>>>     - ibverbs
>>>>>     - bifurcated driver
>>>>> I suggest to consider these criterias:
>>>>>     - upstream status
>>>>>     - usable with kernel netdev
>>>>>     - usable in a vm
>>>>>     - usable for ethernet
>>>>>     - hardware requirements
>>>>>     - security protection
>>>>>     - performance
>>>>>
>>>>> --
>>>>> Thomas
>>
>>

^ permalink raw reply

* [GIT net-next v2] Open vSwitch
From: Pravin B Shelar @ 2014-11-06  9:18 UTC (permalink / raw)
  To: davem; +Cc: netdev

First two patches are related to OVS MPLS support. Rest of patches
are mostly refactoring and minor improvements to openvswitch.

v1-v2:
 - Fix conflicts due to "gue: Remote checksum offload"
----------------------------------------------------------------

The following changes since commit e1b2cb655060e081e73b384b1fc8b2e978f73467:

  fou: Fix typo in returning flags in netlink (2014-11-05 22:18:20 -0500)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/pshelar/openvswitch.git net_next_ovs

for you to fetch changes up to a85311bf1f9f8185682990cafdd4e0572c0ed373:

  openvswitch: Avoid NULL mask check while building mask (2014-11-05 23:52:35 -0800)

----------------------------------------------------------------
Andy Zhou (2):
      openvswitch: refactor do_output() to move NULL check out of fast path
      openvswitch: Refactor get_dp() function into multiple access APIs.

Chunhe Li (1):
      openvswitch: Drop packets when interdev is not up

Jarno Rajahalme (1):
      openvswitch: Fix the type of struct ovs_key_nd nd_target field.

Jesse Gross (1):
      openvswitch: Additional logging for -EINVAL on flow setups.

Joe Stringer (3):
      openvswitch: Remove redundant tcp_flags code.
      openvswitch: Refactor ovs_flow_cmd_fill_info().
      openvswitch: Move key_attr_size() to flow_netlink.h.

Lorand Jakab (1):
      openvswitch: Remove flow member from struct ovs_skb_cb

Pravin B Shelar (4):
      net: Remove MPLS GSO feature.
      openvswitch: Move table destroy to dp-rcu callback.
      openvswitch: Refactor action alloc and copy api.
      openvswitch: Avoid NULL mask check while building mask

Simon Horman (1):
      openvswitch: Add basic MPLS support to kernel

 include/linux/netdev_features.h      |   5 +-
 include/linux/netdevice.h            |   1 -
 include/linux/skbuff.h               |   4 +-
 include/net/mpls.h                   |  39 +++++
 include/uapi/linux/openvswitch.h     |  38 ++++-
 net/core/dev.c                       |   3 +-
 net/core/ethtool.c                   |   1 -
 net/ipv4/af_inet.c                   |   1 -
 net/ipv4/tcp_offload.c               |   1 -
 net/ipv4/udp_offload.c               |   3 +-
 net/ipv6/ip6_offload.c               |   1 -
 net/ipv6/udp_offload.c               |   3 +-
 net/mpls/mpls_gso.c                  |   3 +-
 net/openvswitch/Kconfig              |   1 +
 net/openvswitch/actions.c            | 136 ++++++++++++---
 net/openvswitch/datapath.c           | 215 ++++++++++++-----------
 net/openvswitch/datapath.h           |   4 +-
 net/openvswitch/flow.c               |  30 ++++
 net/openvswitch/flow.h               |  17 +-
 net/openvswitch/flow_netlink.c       | 322 +++++++++++++++++++++++++----------
 net/openvswitch/flow_netlink.h       |   5 +-
 net/openvswitch/flow_table.c         |  11 +-
 net/openvswitch/flow_table.h         |   2 +-
 net/openvswitch/vport-internal_dev.c |   5 +
 24 files changed, 606 insertions(+), 245 deletions(-)
 create mode 100644 include/net/mpls.h

^ permalink raw reply

* [PATCH net-next v2 01/14] net: Remove MPLS GSO feature.
From: Pravin B Shelar @ 2014-11-06  9:18 UTC (permalink / raw)
  To: davem; +Cc: netdev, Pravin B Shelar

Device can export MPLS GSO support in dev->mpls_features same way
it export vlan features in dev->vlan_features. So it is safe to
remove NETIF_F_GSO_MPLS redundant flag.

Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
---
v1-v2:
Fixed conflicts for latest net-next.
---
 include/linux/netdev_features.h | 5 +----
 include/linux/netdevice.h       | 1 -
 include/linux/skbuff.h          | 4 +---
 net/core/ethtool.c              | 1 -
 net/ipv4/af_inet.c              | 1 -
 net/ipv4/tcp_offload.c          | 1 -
 net/ipv4/udp_offload.c          | 3 +--
 net/ipv6/ip6_offload.c          | 1 -
 net/ipv6/udp_offload.c          | 3 +--
 net/mpls/mpls_gso.c             | 3 +--
 10 files changed, 5 insertions(+), 18 deletions(-)

diff --git a/include/linux/netdev_features.h b/include/linux/netdev_features.h
index 8c94b07..8e30685 100644
--- a/include/linux/netdev_features.h
+++ b/include/linux/netdev_features.h
@@ -47,7 +47,6 @@ enum {
 	NETIF_F_GSO_SIT_BIT,		/* ... SIT tunnel with TSO */
 	NETIF_F_GSO_UDP_TUNNEL_BIT,	/* ... UDP TUNNEL with TSO */
 	NETIF_F_GSO_UDP_TUNNEL_CSUM_BIT,/* ... UDP TUNNEL with TSO & CSUM */
-	NETIF_F_GSO_MPLS_BIT,		/* ... MPLS segmentation */
 	NETIF_F_GSO_TUNNEL_REMCSUM_BIT, /* ... TUNNEL with TSO & REMCSUM */
 	/**/NETIF_F_GSO_LAST =		/* last bit, see GSO_MASK */
 		NETIF_F_GSO_TUNNEL_REMCSUM_BIT,
@@ -119,7 +118,6 @@ enum {
 #define NETIF_F_GSO_SIT		__NETIF_F(GSO_SIT)
 #define NETIF_F_GSO_UDP_TUNNEL	__NETIF_F(GSO_UDP_TUNNEL)
 #define NETIF_F_GSO_UDP_TUNNEL_CSUM __NETIF_F(GSO_UDP_TUNNEL_CSUM)
-#define NETIF_F_GSO_MPLS	__NETIF_F(GSO_MPLS)
 #define NETIF_F_GSO_TUNNEL_REMCSUM __NETIF_F(GSO_TUNNEL_REMCSUM)
 #define NETIF_F_HW_VLAN_STAG_FILTER __NETIF_F(HW_VLAN_STAG_FILTER)
 #define NETIF_F_HW_VLAN_STAG_RX	__NETIF_F(HW_VLAN_STAG_RX)
@@ -183,7 +181,6 @@ enum {
 				 NETIF_F_GSO_IPIP |			\
 				 NETIF_F_GSO_SIT |			\
 				 NETIF_F_GSO_UDP_TUNNEL |		\
-				 NETIF_F_GSO_UDP_TUNNEL_CSUM |		\
-				 NETIF_F_GSO_MPLS)
+				 NETIF_F_GSO_UDP_TUNNEL_CSUM)
 
 #endif	/* _LINUX_NETDEV_FEATURES_H */
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 4767f54..90ac959 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3583,7 +3583,6 @@ static inline bool net_gso_ok(netdev_features_t features, int gso_type)
 	BUILD_BUG_ON(SKB_GSO_SIT     != (NETIF_F_GSO_SIT >> NETIF_F_GSO_SHIFT));
 	BUILD_BUG_ON(SKB_GSO_UDP_TUNNEL != (NETIF_F_GSO_UDP_TUNNEL >> NETIF_F_GSO_SHIFT));
 	BUILD_BUG_ON(SKB_GSO_UDP_TUNNEL_CSUM != (NETIF_F_GSO_UDP_TUNNEL_CSUM >> NETIF_F_GSO_SHIFT));
-	BUILD_BUG_ON(SKB_GSO_MPLS    != (NETIF_F_GSO_MPLS >> NETIF_F_GSO_SHIFT));
 	BUILD_BUG_ON(SKB_GSO_TUNNEL_REMCSUM != (NETIF_F_GSO_TUNNEL_REMCSUM >> NETIF_F_GSO_SHIFT));
 
 	return (features & feature) == feature;
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 39ec753..53f4f6c 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -372,9 +372,7 @@ enum {
 
 	SKB_GSO_UDP_TUNNEL_CSUM = 1 << 11,
 
-	SKB_GSO_MPLS = 1 << 12,
-
-	SKB_GSO_TUNNEL_REMCSUM = 1 << 13,
+	SKB_GSO_TUNNEL_REMCSUM = 1 << 12,
 };
 
 #if BITS_PER_LONG > 32
diff --git a/net/core/ethtool.c b/net/core/ethtool.c
index 06dfb29..b0f84f5 100644
--- a/net/core/ethtool.c
+++ b/net/core/ethtool.c
@@ -84,7 +84,6 @@ static const char netdev_features_strings[NETDEV_FEATURE_COUNT][ETH_GSTRING_LEN]
 	[NETIF_F_GSO_IPIP_BIT] =	 "tx-ipip-segmentation",
 	[NETIF_F_GSO_SIT_BIT] =		 "tx-sit-segmentation",
 	[NETIF_F_GSO_UDP_TUNNEL_BIT] =	 "tx-udp_tnl-segmentation",
-	[NETIF_F_GSO_MPLS_BIT] =	 "tx-mpls-segmentation",
 
 	[NETIF_F_FCOE_CRC_BIT] =         "tx-checksum-fcoe-crc",
 	[NETIF_F_SCTP_CSUM_BIT] =        "tx-checksum-sctp",
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index ed2c672..3a096bb 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1223,7 +1223,6 @@ static struct sk_buff *inet_gso_segment(struct sk_buff *skb,
 		       SKB_GSO_UDP_TUNNEL |
 		       SKB_GSO_UDP_TUNNEL_CSUM |
 		       SKB_GSO_TUNNEL_REMCSUM |
-		       SKB_GSO_MPLS |
 		       0)))
 		goto out;
 
diff --git a/net/ipv4/tcp_offload.c b/net/ipv4/tcp_offload.c
index a1b2a56..9d7930b 100644
--- a/net/ipv4/tcp_offload.c
+++ b/net/ipv4/tcp_offload.c
@@ -94,7 +94,6 @@ struct sk_buff *tcp_gso_segment(struct sk_buff *skb,
 			       SKB_GSO_GRE_CSUM |
 			       SKB_GSO_IPIP |
 			       SKB_GSO_SIT |
-			       SKB_GSO_MPLS |
 			       SKB_GSO_UDP_TUNNEL |
 			       SKB_GSO_UDP_TUNNEL_CSUM |
 			       SKB_GSO_TUNNEL_REMCSUM |
diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c
index 0a5a70d..d3e537e 100644
--- a/net/ipv4/udp_offload.c
+++ b/net/ipv4/udp_offload.c
@@ -207,8 +207,7 @@ static struct sk_buff *udp4_ufo_fragment(struct sk_buff *skb,
 				      SKB_GSO_UDP_TUNNEL_CSUM |
 				      SKB_GSO_TUNNEL_REMCSUM |
 				      SKB_GSO_IPIP |
-				      SKB_GSO_GRE | SKB_GSO_GRE_CSUM |
-				      SKB_GSO_MPLS) ||
+				      SKB_GSO_GRE | SKB_GSO_GRE_CSUM) ||
 			     !(type & (SKB_GSO_UDP))))
 			goto out;
 
diff --git a/net/ipv6/ip6_offload.c b/net/ipv6/ip6_offload.c
index e976707..fd76ce9 100644
--- a/net/ipv6/ip6_offload.c
+++ b/net/ipv6/ip6_offload.c
@@ -79,7 +79,6 @@ static struct sk_buff *ipv6_gso_segment(struct sk_buff *skb,
 		       SKB_GSO_UDP_TUNNEL |
 		       SKB_GSO_UDP_TUNNEL_CSUM |
 		       SKB_GSO_TUNNEL_REMCSUM |
-		       SKB_GSO_MPLS |
 		       SKB_GSO_TCPV6 |
 		       0)))
 		goto out;
diff --git a/net/ipv6/udp_offload.c b/net/ipv6/udp_offload.c
index 637ba2e..b6aa8ed 100644
--- a/net/ipv6/udp_offload.c
+++ b/net/ipv6/udp_offload.c
@@ -46,8 +46,7 @@ static struct sk_buff *udp6_ufo_fragment(struct sk_buff *skb,
 				      SKB_GSO_GRE |
 				      SKB_GSO_GRE_CSUM |
 				      SKB_GSO_IPIP |
-				      SKB_GSO_SIT |
-				      SKB_GSO_MPLS) ||
+				      SKB_GSO_SIT) ||
 			     !(type & (SKB_GSO_UDP))))
 			goto out;
 
diff --git a/net/mpls/mpls_gso.c b/net/mpls/mpls_gso.c
index e3545f2..ca27837 100644
--- a/net/mpls/mpls_gso.c
+++ b/net/mpls/mpls_gso.c
@@ -34,8 +34,7 @@ static struct sk_buff *mpls_gso_segment(struct sk_buff *skb,
 				  SKB_GSO_TCP_ECN |
 				  SKB_GSO_GRE |
 				  SKB_GSO_GRE_CSUM |
-				  SKB_GSO_IPIP |
-				  SKB_GSO_MPLS)))
+				  SKB_GSO_IPIP)))
 		goto out;
 
 	/* Setup inner SKB. */
-- 
1.9.3

^ permalink raw reply related

* [PATCH net-next v2 02/14] openvswitch: Add basic MPLS support to kernel
From: Pravin B Shelar @ 2014-11-06  9:18 UTC (permalink / raw)
  To: davem
  Cc: netdev, Simon Horman, Ravi K, Leo Alterman, Isaku Yamahata,
	Joe Stringer, Jesse Gross, Pravin B Shelar

From: Simon Horman <horms@verge.net.au>

Allow datapath to recognize and extract MPLS labels into flow keys
and execute actions which push, pop, and set labels on packets.

Based heavily on work by Leo Alterman, Ravi K, Isaku Yamahata and Joe Stringer.

Cc: Ravi K <rkerur@gmail.com>
Cc: Leo Alterman <lalterman@nicira.com>
Cc: Isaku Yamahata <yamahata@valinux.co.jp>
Cc: Joe Stringer <joe@wand.net.nz>
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Jesse Gross <jesse@nicira.com>
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
---
 include/net/mpls.h               |  39 +++++++++++
 include/uapi/linux/openvswitch.h |  32 +++++++++
 net/core/dev.c                   |   3 +-
 net/openvswitch/Kconfig          |   1 +
 net/openvswitch/actions.c        | 106 ++++++++++++++++++++++++++++-
 net/openvswitch/datapath.c       |   6 +-
 net/openvswitch/flow.c           |  30 +++++++++
 net/openvswitch/flow.h           |  17 +++--
 net/openvswitch/flow_netlink.c   | 139 ++++++++++++++++++++++++++++++++++-----
 net/openvswitch/flow_netlink.h   |   2 +-
 10 files changed, 345 insertions(+), 30 deletions(-)
 create mode 100644 include/net/mpls.h

diff --git a/include/net/mpls.h b/include/net/mpls.h
new file mode 100644
index 0000000..5b3b5ad
--- /dev/null
+++ b/include/net/mpls.h
@@ -0,0 +1,39 @@
+/*
+ * Copyright (c) 2014 Nicira, Inc.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ */
+
+#ifndef _NET_MPLS_H
+#define _NET_MPLS_H 1
+
+#include <linux/if_ether.h>
+#include <linux/netdevice.h>
+
+#define MPLS_HLEN 4
+
+static inline bool eth_p_mpls(__be16 eth_type)
+{
+	return eth_type == htons(ETH_P_MPLS_UC) ||
+		eth_type == htons(ETH_P_MPLS_MC);
+}
+
+/*
+ * For non-MPLS skbs this will correspond to the network header.
+ * For MPLS skbs it will be before the network_header as the MPLS
+ * label stack lies between the end of the mac header and the network
+ * header. That is, for MPLS skbs the end of the mac header
+ * is the top of the MPLS label stack.
+ */
+static inline unsigned char *skb_mpls_header(struct sk_buff *skb)
+{
+	return skb_mac_header(skb) + skb->mac_len;
+}
+#endif
diff --git a/include/uapi/linux/openvswitch.h b/include/uapi/linux/openvswitch.h
index 435eabc..631056b 100644
--- a/include/uapi/linux/openvswitch.h
+++ b/include/uapi/linux/openvswitch.h
@@ -293,6 +293,9 @@ enum ovs_key_attr {
 	OVS_KEY_ATTR_DP_HASH,      /* u32 hash value. Value 0 indicates the hash
 				   is not computed by the datapath. */
 	OVS_KEY_ATTR_RECIRC_ID, /* u32 recirc id */
+	OVS_KEY_ATTR_MPLS,      /* array of struct ovs_key_mpls.
+				 * The implementation may restrict
+				 * the accepted length of the array. */
 
 #ifdef __KERNEL__
 	OVS_KEY_ATTR_TUNNEL_INFO,  /* struct ovs_tunnel_info */
@@ -340,6 +343,10 @@ struct ovs_key_ethernet {
 	__u8	 eth_dst[ETH_ALEN];
 };
 
+struct ovs_key_mpls {
+	__be32 mpls_lse;
+};
+
 struct ovs_key_ipv4 {
 	__be32 ipv4_src;
 	__be32 ipv4_dst;
@@ -484,6 +491,19 @@ enum ovs_userspace_attr {
 #define OVS_USERSPACE_ATTR_MAX (__OVS_USERSPACE_ATTR_MAX - 1)
 
 /**
+ * struct ovs_action_push_mpls - %OVS_ACTION_ATTR_PUSH_MPLS action argument.
+ * @mpls_lse: MPLS label stack entry to push.
+ * @mpls_ethertype: Ethertype to set in the encapsulating ethernet frame.
+ *
+ * The only values @mpls_ethertype should ever be given are %ETH_P_MPLS_UC and
+ * %ETH_P_MPLS_MC, indicating MPLS unicast or multicast. Other are rejected.
+ */
+struct ovs_action_push_mpls {
+	__be32 mpls_lse;
+	__be16 mpls_ethertype; /* Either %ETH_P_MPLS_UC or %ETH_P_MPLS_MC */
+};
+
+/**
  * struct ovs_action_push_vlan - %OVS_ACTION_ATTR_PUSH_VLAN action argument.
  * @vlan_tpid: Tag protocol identifier (TPID) to push.
  * @vlan_tci: Tag control identifier (TCI) to push.  The CFI bit must be set
@@ -534,6 +554,15 @@ struct ovs_action_hash {
  * @OVS_ACTION_ATTR_POP_VLAN: Pop the outermost 802.1Q header off the packet.
  * @OVS_ACTION_ATTR_SAMPLE: Probabilitically executes actions, as specified in
  * the nested %OVS_SAMPLE_ATTR_* attributes.
+ * @OVS_ACTION_ATTR_PUSH_MPLS: Push a new MPLS label stack entry onto the
+ * top of the packets MPLS label stack.  Set the ethertype of the
+ * encapsulating frame to either %ETH_P_MPLS_UC or %ETH_P_MPLS_MC to
+ * indicate the new packet contents.
+ * @OVS_ACTION_ATTR_POP_MPLS: Pop an MPLS label stack entry off of the
+ * packet's MPLS label stack.  Set the encapsulating frame's ethertype to
+ * indicate the new packet contents. This could potentially still be
+ * %ETH_P_MPLS if the resulting MPLS label stack is not empty.  If there
+ * is no MPLS label stack, as determined by ethertype, no action is taken.
  *
  * Only a single header can be set with a single %OVS_ACTION_ATTR_SET.  Not all
  * fields within a header are modifiable, e.g. the IPv4 protocol and fragment
@@ -550,6 +579,9 @@ enum ovs_action_attr {
 	OVS_ACTION_ATTR_SAMPLE,       /* Nested OVS_SAMPLE_ATTR_*. */
 	OVS_ACTION_ATTR_RECIRC,       /* u32 recirc_id. */
 	OVS_ACTION_ATTR_HASH,	      /* struct ovs_action_hash. */
+	OVS_ACTION_ATTR_PUSH_MPLS,    /* struct ovs_action_push_mpls. */
+	OVS_ACTION_ATTR_POP_MPLS,     /* __be16 ethertype. */
+
 	__OVS_ACTION_ATTR_MAX
 };
 
diff --git a/net/core/dev.c b/net/core/dev.c
index 40be481..70bb609 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -118,6 +118,7 @@
 #include <linux/if_vlan.h>
 #include <linux/ip.h>
 #include <net/ip.h>
+#include <net/mpls.h>
 #include <linux/ipv6.h>
 #include <linux/in.h>
 #include <linux/jhash.h>
@@ -2530,7 +2531,7 @@ static netdev_features_t net_mpls_features(struct sk_buff *skb,
 					   netdev_features_t features,
 					   __be16 type)
 {
-	if (type == htons(ETH_P_MPLS_UC) || type == htons(ETH_P_MPLS_MC))
+	if (eth_p_mpls(type))
 		features &= skb->dev->mpls_features;
 
 	return features;
diff --git a/net/openvswitch/Kconfig b/net/openvswitch/Kconfig
index 2a9673e..454ce12 100644
--- a/net/openvswitch/Kconfig
+++ b/net/openvswitch/Kconfig
@@ -30,6 +30,7 @@ config OPENVSWITCH
 
 config OPENVSWITCH_GRE
 	tristate "Open vSwitch GRE tunneling support"
+	select NET_MPLS_GSO
 	depends on INET
 	depends on OPENVSWITCH
 	depends on NET_IPGRE_DEMUX
diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c
index 922c133..930b1b6 100644
--- a/net/openvswitch/actions.c
+++ b/net/openvswitch/actions.c
@@ -28,10 +28,12 @@
 #include <linux/in6.h>
 #include <linux/if_arp.h>
 #include <linux/if_vlan.h>
+
 #include <net/ip.h>
 #include <net/ipv6.h>
 #include <net/checksum.h>
 #include <net/dsfield.h>
+#include <net/mpls.h>
 #include <net/sctp/checksum.h>
 
 #include "datapath.h"
@@ -118,6 +120,92 @@ static int make_writable(struct sk_buff *skb, int write_len)
 	return pskb_expand_head(skb, 0, 0, GFP_ATOMIC);
 }
 
+static int push_mpls(struct sk_buff *skb,
+		     const struct ovs_action_push_mpls *mpls)
+{
+	__be32 *new_mpls_lse;
+	struct ethhdr *hdr;
+
+	/* Networking stack do not allow simultaneous Tunnel and MPLS GSO. */
+	if (skb->encapsulation)
+		return -ENOTSUPP;
+
+	if (skb_cow_head(skb, MPLS_HLEN) < 0)
+		return -ENOMEM;
+
+	skb_push(skb, MPLS_HLEN);
+	memmove(skb_mac_header(skb) - MPLS_HLEN, skb_mac_header(skb),
+		skb->mac_len);
+	skb_reset_mac_header(skb);
+
+	new_mpls_lse = (__be32 *)skb_mpls_header(skb);
+	*new_mpls_lse = mpls->mpls_lse;
+
+	if (skb->ip_summed == CHECKSUM_COMPLETE)
+		skb->csum = csum_add(skb->csum, csum_partial(new_mpls_lse,
+							     MPLS_HLEN, 0));
+
+	hdr = eth_hdr(skb);
+	hdr->h_proto = mpls->mpls_ethertype;
+
+	skb_set_inner_protocol(skb, skb->protocol);
+	skb->protocol = mpls->mpls_ethertype;
+
+	return 0;
+}
+
+static int pop_mpls(struct sk_buff *skb, const __be16 ethertype)
+{
+	struct ethhdr *hdr;
+	int err;
+
+	err = make_writable(skb, skb->mac_len + MPLS_HLEN);
+	if (unlikely(err))
+		return err;
+
+	if (skb->ip_summed == CHECKSUM_COMPLETE)
+		skb->csum = csum_sub(skb->csum,
+				     csum_partial(skb_mpls_header(skb),
+						  MPLS_HLEN, 0));
+
+	memmove(skb_mac_header(skb) + MPLS_HLEN, skb_mac_header(skb),
+		skb->mac_len);
+
+	__skb_pull(skb, MPLS_HLEN);
+	skb_reset_mac_header(skb);
+
+	/* skb_mpls_header() is used to locate the ethertype
+	 * field correctly in the presence of VLAN tags.
+	 */
+	hdr = (struct ethhdr *)(skb_mpls_header(skb) - ETH_HLEN);
+	hdr->h_proto = ethertype;
+	if (eth_p_mpls(skb->protocol))
+		skb->protocol = ethertype;
+	return 0;
+}
+
+static int set_mpls(struct sk_buff *skb, const __be32 *mpls_lse)
+{
+	__be32 *stack;
+	int err;
+
+	err = make_writable(skb, skb->mac_len + MPLS_HLEN);
+	if (unlikely(err))
+		return err;
+
+	stack = (__be32 *)skb_mpls_header(skb);
+	if (skb->ip_summed == CHECKSUM_COMPLETE) {
+		__be32 diff[] = { ~(*stack), *mpls_lse };
+
+		skb->csum = ~csum_partial((char *)diff, sizeof(diff),
+					  ~skb->csum);
+	}
+
+	*stack = *mpls_lse;
+
+	return 0;
+}
+
 /* remove VLAN header from packet and update csum accordingly. */
 static int __pop_vlan_tci(struct sk_buff *skb, __be16 *current_tci)
 {
@@ -140,10 +228,12 @@ static int __pop_vlan_tci(struct sk_buff *skb, __be16 *current_tci)
 
 	vlan_set_encap_proto(skb, vhdr);
 	skb->mac_header += VLAN_HLEN;
+
 	if (skb_network_offset(skb) < ETH_HLEN)
 		skb_set_network_header(skb, ETH_HLEN);
-	skb_reset_mac_len(skb);
 
+	/* Update mac_len for subsequent MPLS actions */
+	skb_reset_mac_len(skb);
 	return 0;
 }
 
@@ -186,6 +276,8 @@ static int push_vlan(struct sk_buff *skb, const struct ovs_action_push_vlan *vla
 
 		if (!__vlan_put_tag(skb, skb->vlan_proto, current_tag))
 			return -ENOMEM;
+		/* Update mac_len for subsequent MPLS actions */
+		skb->mac_len += VLAN_HLEN;
 
 		if (skb->ip_summed == CHECKSUM_COMPLETE)
 			skb->csum = csum_add(skb->csum, csum_partial(skb->data
@@ -612,6 +704,10 @@ static int execute_set_action(struct sk_buff *skb,
 	case OVS_KEY_ATTR_SCTP:
 		err = set_sctp(skb, nla_data(nested_attr));
 		break;
+
+	case OVS_KEY_ATTR_MPLS:
+		err = set_mpls(skb, nla_data(nested_attr));
+		break;
 	}
 
 	return err;
@@ -690,6 +786,14 @@ static int do_execute_actions(struct datapath *dp, struct sk_buff *skb,
 			execute_hash(skb, key, a);
 			break;
 
+		case OVS_ACTION_ATTR_PUSH_MPLS:
+			err = push_mpls(skb, nla_data(a));
+			break;
+
+		case OVS_ACTION_ATTR_POP_MPLS:
+			err = pop_mpls(skb, nla_get_be16(a));
+			break;
+
 		case OVS_ACTION_ATTR_PUSH_VLAN:
 			err = push_vlan(skb, nla_data(a));
 			if (unlikely(err)) /* skb already freed. */
diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c
index f18302f..688cb9b 100644
--- a/net/openvswitch/datapath.c
+++ b/net/openvswitch/datapath.c
@@ -560,7 +560,7 @@ static int ovs_packet_cmd_execute(struct sk_buff *skb, struct genl_info *info)
 		goto err_flow_free;
 
 	err = ovs_nla_copy_actions(a[OVS_PACKET_ATTR_ACTIONS],
-				   &flow->key, 0, &acts);
+				   &flow->key, &acts);
 	if (err)
 		goto err_flow_free;
 
@@ -846,7 +846,7 @@ static int ovs_flow_cmd_new(struct sk_buff *skb, struct genl_info *info)
 		goto err_kfree_flow;
 
 	error = ovs_nla_copy_actions(a[OVS_FLOW_ATTR_ACTIONS], &new_flow->key,
-				     0, &acts);
+				     &acts);
 	if (error) {
 		OVS_NLERR("Flow actions may not be safe on all matching packets.\n");
 		goto err_kfree_acts;
@@ -953,7 +953,7 @@ static struct sw_flow_actions *get_flow_actions(const struct nlattr *a,
 		return acts;
 
 	ovs_flow_mask_key(&masked_key, key, mask);
-	error = ovs_nla_copy_actions(a, &masked_key, 0, &acts);
+	error = ovs_nla_copy_actions(a, &masked_key, &acts);
 	if (error) {
 		OVS_NLERR("Flow actions may not be safe on all matching packets.\n");
 		kfree(acts);
diff --git a/net/openvswitch/flow.c b/net/openvswitch/flow.c
index 2b78789..90a2101 100644
--- a/net/openvswitch/flow.c
+++ b/net/openvswitch/flow.c
@@ -32,6 +32,7 @@
 #include <linux/if_arp.h>
 #include <linux/ip.h>
 #include <linux/ipv6.h>
+#include <linux/mpls.h>
 #include <linux/sctp.h>
 #include <linux/smp.h>
 #include <linux/tcp.h>
@@ -42,6 +43,7 @@
 #include <net/ip.h>
 #include <net/ip_tunnels.h>
 #include <net/ipv6.h>
+#include <net/mpls.h>
 #include <net/ndisc.h>
 
 #include "datapath.h"
@@ -480,6 +482,7 @@ static int key_extract(struct sk_buff *skb, struct sw_flow_key *key)
 		return -ENOMEM;
 
 	skb_reset_network_header(skb);
+	skb_reset_mac_len(skb);
 	__skb_push(skb, skb->data - skb_mac_header(skb));
 
 	/* Network layer. */
@@ -584,6 +587,33 @@ static int key_extract(struct sk_buff *skb, struct sw_flow_key *key)
 			memset(&key->ip, 0, sizeof(key->ip));
 			memset(&key->ipv4, 0, sizeof(key->ipv4));
 		}
+	} else if (eth_p_mpls(key->eth.type)) {
+		size_t stack_len = MPLS_HLEN;
+
+		/* In the presence of an MPLS label stack the end of the L2
+		 * header and the beginning of the L3 header differ.
+		 *
+		 * Advance network_header to the beginning of the L3
+		 * header. mac_len corresponds to the end of the L2 header.
+		 */
+		while (1) {
+			__be32 lse;
+
+			error = check_header(skb, skb->mac_len + stack_len);
+			if (unlikely(error))
+				return 0;
+
+			memcpy(&lse, skb_network_header(skb), MPLS_HLEN);
+
+			if (stack_len == MPLS_HLEN)
+				memcpy(&key->mpls.top_lse, &lse, MPLS_HLEN);
+
+			skb_set_network_header(skb, skb->mac_len + stack_len);
+			if (lse & htonl(MPLS_LS_S_MASK))
+				break;
+
+			stack_len += MPLS_HLEN;
+		}
 	} else if (key->eth.type == htons(ETH_P_IPV6)) {
 		int nh_len;             /* IPv6 Header + Extensions */
 
diff --git a/net/openvswitch/flow.h b/net/openvswitch/flow.h
index 7181331..4962bee 100644
--- a/net/openvswitch/flow.h
+++ b/net/openvswitch/flow.h
@@ -102,12 +102,17 @@ struct sw_flow_key {
 		__be16 tci;		/* 0 if no VLAN, VLAN_TAG_PRESENT set otherwise. */
 		__be16 type;		/* Ethernet frame type. */
 	} eth;
-	struct {
-		u8     proto;		/* IP protocol or lower 8 bits of ARP opcode. */
-		u8     tos;		/* IP ToS. */
-		u8     ttl;		/* IP TTL/hop limit. */
-		u8     frag;		/* One of OVS_FRAG_TYPE_*. */
-	} ip;
+	union {
+		struct {
+			__be32 top_lse;	/* top label stack entry */
+		} mpls;
+		struct {
+			u8     proto;	/* IP protocol or lower 8 bits of ARP opcode. */
+			u8     tos;	    /* IP ToS. */
+			u8     ttl;	    /* IP TTL/hop limit. */
+			u8     frag;	/* One of OVS_FRAG_TYPE_*. */
+		} ip;
+	};
 	struct {
 		__be16 src;		/* TCP/UDP/SCTP source port. */
 		__be16 dst;		/* TCP/UDP/SCTP destination port. */
diff --git a/net/openvswitch/flow_netlink.c b/net/openvswitch/flow_netlink.c
index 939bcb3..569309c 100644
--- a/net/openvswitch/flow_netlink.c
+++ b/net/openvswitch/flow_netlink.c
@@ -46,6 +46,7 @@
 #include <net/ip.h>
 #include <net/ipv6.h>
 #include <net/ndisc.h>
+#include <net/mpls.h>
 
 #include "flow_netlink.h"
 
@@ -134,7 +135,8 @@ static bool match_validate(const struct sw_flow_match *match,
 			| (1 << OVS_KEY_ATTR_ICMP)
 			| (1 << OVS_KEY_ATTR_ICMPV6)
 			| (1 << OVS_KEY_ATTR_ARP)
-			| (1 << OVS_KEY_ATTR_ND));
+			| (1 << OVS_KEY_ATTR_ND)
+			| (1 << OVS_KEY_ATTR_MPLS));
 
 	/* Always allowed mask fields. */
 	mask_allowed |= ((1 << OVS_KEY_ATTR_TUNNEL)
@@ -149,6 +151,12 @@ static bool match_validate(const struct sw_flow_match *match,
 			mask_allowed |= 1 << OVS_KEY_ATTR_ARP;
 	}
 
+	if (eth_p_mpls(match->key->eth.type)) {
+		key_expected |= 1 << OVS_KEY_ATTR_MPLS;
+		if (match->mask && (match->mask->key.eth.type == htons(0xffff)))
+			mask_allowed |= 1 << OVS_KEY_ATTR_MPLS;
+	}
+
 	if (match->key->eth.type == htons(ETH_P_IP)) {
 		key_expected |= 1 << OVS_KEY_ATTR_IPV4;
 		if (match->mask && (match->mask->key.eth.type == htons(0xffff)))
@@ -266,6 +274,7 @@ static const int ovs_key_lens[OVS_KEY_ATTR_MAX + 1] = {
 	[OVS_KEY_ATTR_RECIRC_ID] = sizeof(u32),
 	[OVS_KEY_ATTR_DP_HASH] = sizeof(u32),
 	[OVS_KEY_ATTR_TUNNEL] = -1,
+	[OVS_KEY_ATTR_MPLS] = sizeof(struct ovs_key_mpls),
 };
 
 static bool is_all_zero(const u8 *fp, size_t size)
@@ -735,6 +744,16 @@ static int ovs_key_from_nlattrs(struct sw_flow_match *match, u64 attrs,
 		attrs &= ~(1 << OVS_KEY_ATTR_ARP);
 	}
 
+	if (attrs & (1 << OVS_KEY_ATTR_MPLS)) {
+		const struct ovs_key_mpls *mpls_key;
+
+		mpls_key = nla_data(a[OVS_KEY_ATTR_MPLS]);
+		SW_FLOW_KEY_PUT(match, mpls.top_lse,
+				mpls_key->mpls_lse, is_mask);
+
+		attrs &= ~(1 << OVS_KEY_ATTR_MPLS);
+	 }
+
 	if (attrs & (1 << OVS_KEY_ATTR_TCP)) {
 		const struct ovs_key_tcp *tcp_key;
 
@@ -1140,6 +1159,14 @@ int ovs_nla_put_flow(const struct sw_flow_key *swkey,
 		arp_key->arp_op = htons(output->ip.proto);
 		ether_addr_copy(arp_key->arp_sha, output->ipv4.arp.sha);
 		ether_addr_copy(arp_key->arp_tha, output->ipv4.arp.tha);
+	} else if (eth_p_mpls(swkey->eth.type)) {
+		struct ovs_key_mpls *mpls_key;
+
+		nla = nla_reserve(skb, OVS_KEY_ATTR_MPLS, sizeof(*mpls_key));
+		if (!nla)
+			goto nla_put_failure;
+		mpls_key = nla_data(nla);
+		mpls_key->mpls_lse = output->mpls.top_lse;
 	}
 
 	if ((swkey->eth.type == htons(ETH_P_IP) ||
@@ -1336,9 +1363,15 @@ static inline void add_nested_action_end(struct sw_flow_actions *sfa,
 	a->nla_len = sfa->actions_len - st_offset;
 }
 
+static int ovs_nla_copy_actions__(const struct nlattr *attr,
+				  const struct sw_flow_key *key,
+				  int depth, struct sw_flow_actions **sfa,
+				  __be16 eth_type, __be16 vlan_tci);
+
 static int validate_and_copy_sample(const struct nlattr *attr,
 				    const struct sw_flow_key *key, int depth,
-				    struct sw_flow_actions **sfa)
+				    struct sw_flow_actions **sfa,
+				    __be16 eth_type, __be16 vlan_tci)
 {
 	const struct nlattr *attrs[OVS_SAMPLE_ATTR_MAX + 1];
 	const struct nlattr *probability, *actions;
@@ -1375,7 +1408,8 @@ static int validate_and_copy_sample(const struct nlattr *attr,
 	if (st_acts < 0)
 		return st_acts;
 
-	err = ovs_nla_copy_actions(actions, key, depth + 1, sfa);
+	err = ovs_nla_copy_actions__(actions, key, depth + 1, sfa,
+				     eth_type, vlan_tci);
 	if (err)
 		return err;
 
@@ -1385,10 +1419,10 @@ static int validate_and_copy_sample(const struct nlattr *attr,
 	return 0;
 }
 
-static int validate_tp_port(const struct sw_flow_key *flow_key)
+static int validate_tp_port(const struct sw_flow_key *flow_key,
+			    __be16 eth_type)
 {
-	if ((flow_key->eth.type == htons(ETH_P_IP) ||
-	     flow_key->eth.type == htons(ETH_P_IPV6)) &&
+	if ((eth_type == htons(ETH_P_IP) || eth_type == htons(ETH_P_IPV6)) &&
 	    (flow_key->tp.src || flow_key->tp.dst))
 		return 0;
 
@@ -1483,7 +1517,7 @@ static int validate_and_copy_set_tun(const struct nlattr *attr,
 static int validate_set(const struct nlattr *a,
 			const struct sw_flow_key *flow_key,
 			struct sw_flow_actions **sfa,
-			bool *set_tun)
+			bool *set_tun, __be16 eth_type)
 {
 	const struct nlattr *ovs_key = nla_data(a);
 	int key_type = nla_type(ovs_key);
@@ -1508,6 +1542,9 @@ static int validate_set(const struct nlattr *a,
 		break;
 
 	case OVS_KEY_ATTR_TUNNEL:
+		if (eth_p_mpls(eth_type))
+			return -EINVAL;
+
 		*set_tun = true;
 		err = validate_and_copy_set_tun(a, sfa);
 		if (err)
@@ -1515,7 +1552,7 @@ static int validate_set(const struct nlattr *a,
 		break;
 
 	case OVS_KEY_ATTR_IPV4:
-		if (flow_key->eth.type != htons(ETH_P_IP))
+		if (eth_type != htons(ETH_P_IP))
 			return -EINVAL;
 
 		if (!flow_key->ip.proto)
@@ -1531,7 +1568,7 @@ static int validate_set(const struct nlattr *a,
 		break;
 
 	case OVS_KEY_ATTR_IPV6:
-		if (flow_key->eth.type != htons(ETH_P_IPV6))
+		if (eth_type != htons(ETH_P_IPV6))
 			return -EINVAL;
 
 		if (!flow_key->ip.proto)
@@ -1553,19 +1590,24 @@ static int validate_set(const struct nlattr *a,
 		if (flow_key->ip.proto != IPPROTO_TCP)
 			return -EINVAL;
 
-		return validate_tp_port(flow_key);
+		return validate_tp_port(flow_key, eth_type);
 
 	case OVS_KEY_ATTR_UDP:
 		if (flow_key->ip.proto != IPPROTO_UDP)
 			return -EINVAL;
 
-		return validate_tp_port(flow_key);
+		return validate_tp_port(flow_key, eth_type);
+
+	case OVS_KEY_ATTR_MPLS:
+		if (!eth_p_mpls(eth_type))
+			return -EINVAL;
+		break;
 
 	case OVS_KEY_ATTR_SCTP:
 		if (flow_key->ip.proto != IPPROTO_SCTP)
 			return -EINVAL;
 
-		return validate_tp_port(flow_key);
+		return validate_tp_port(flow_key, eth_type);
 
 	default:
 		return -EINVAL;
@@ -1609,12 +1651,13 @@ static int copy_action(const struct nlattr *from,
 	return 0;
 }
 
-int ovs_nla_copy_actions(const struct nlattr *attr,
-			 const struct sw_flow_key *key,
-			 int depth,
-			 struct sw_flow_actions **sfa)
+static int ovs_nla_copy_actions__(const struct nlattr *attr,
+				  const struct sw_flow_key *key,
+				  int depth, struct sw_flow_actions **sfa,
+				  __be16 eth_type, __be16 vlan_tci)
 {
 	const struct nlattr *a;
+	bool out_tnl_port = false;
 	int rem, err;
 
 	if (depth >= SAMPLE_ACTION_DEPTH)
@@ -1626,6 +1669,8 @@ int ovs_nla_copy_actions(const struct nlattr *attr,
 			[OVS_ACTION_ATTR_OUTPUT] = sizeof(u32),
 			[OVS_ACTION_ATTR_RECIRC] = sizeof(u32),
 			[OVS_ACTION_ATTR_USERSPACE] = (u32)-1,
+			[OVS_ACTION_ATTR_PUSH_MPLS] = sizeof(struct ovs_action_push_mpls),
+			[OVS_ACTION_ATTR_POP_MPLS] = sizeof(__be16),
 			[OVS_ACTION_ATTR_PUSH_VLAN] = sizeof(struct ovs_action_push_vlan),
 			[OVS_ACTION_ATTR_POP_VLAN] = 0,
 			[OVS_ACTION_ATTR_SET] = (u32)-1,
@@ -1655,6 +1700,8 @@ int ovs_nla_copy_actions(const struct nlattr *attr,
 		case OVS_ACTION_ATTR_OUTPUT:
 			if (nla_get_u32(a) >= DP_MAX_PORTS)
 				return -EINVAL;
+			out_tnl_port = false;
+
 			break;
 
 		case OVS_ACTION_ATTR_HASH: {
@@ -1671,6 +1718,7 @@ int ovs_nla_copy_actions(const struct nlattr *attr,
 		}
 
 		case OVS_ACTION_ATTR_POP_VLAN:
+			vlan_tci = htons(0);
 			break;
 
 		case OVS_ACTION_ATTR_PUSH_VLAN:
@@ -1679,19 +1727,66 @@ int ovs_nla_copy_actions(const struct nlattr *attr,
 				return -EINVAL;
 			if (!(vlan->vlan_tci & htons(VLAN_TAG_PRESENT)))
 				return -EINVAL;
+			vlan_tci = vlan->vlan_tci;
 			break;
 
 		case OVS_ACTION_ATTR_RECIRC:
 			break;
 
+		case OVS_ACTION_ATTR_PUSH_MPLS: {
+			const struct ovs_action_push_mpls *mpls = nla_data(a);
+
+			/* Networking stack do not allow simultaneous Tunnel
+			 * and MPLS GSO.
+			 */
+			if (out_tnl_port)
+				return -EINVAL;
+
+			if (!eth_p_mpls(mpls->mpls_ethertype))
+				return -EINVAL;
+			/* Prohibit push MPLS other than to a white list
+			 * for packets that have a known tag order.
+			 */
+			if (vlan_tci & htons(VLAN_TAG_PRESENT) ||
+			    (eth_type != htons(ETH_P_IP) &&
+			     eth_type != htons(ETH_P_IPV6) &&
+			     eth_type != htons(ETH_P_ARP) &&
+			     eth_type != htons(ETH_P_RARP) &&
+			     !eth_p_mpls(eth_type)))
+				return -EINVAL;
+			eth_type = mpls->mpls_ethertype;
+			break;
+		}
+
+		case OVS_ACTION_ATTR_POP_MPLS:
+			if (vlan_tci & htons(VLAN_TAG_PRESENT) ||
+			    !eth_p_mpls(eth_type))
+				return -EINVAL;
+
+			/* Disallow subsequent L2.5+ set and mpls_pop actions
+			 * as there is no check here to ensure that the new
+			 * eth_type is valid and thus set actions could
+			 * write off the end of the packet or otherwise
+			 * corrupt it.
+			 *
+			 * Support for these actions is planned using packet
+			 * recirculation.
+			 */
+			eth_type = htons(0);
+			break;
+
 		case OVS_ACTION_ATTR_SET:
-			err = validate_set(a, key, sfa, &skip_copy);
+			err = validate_set(a, key, sfa,
+					   &out_tnl_port, eth_type);
 			if (err)
 				return err;
+
+			skip_copy = out_tnl_port;
 			break;
 
 		case OVS_ACTION_ATTR_SAMPLE:
-			err = validate_and_copy_sample(a, key, depth, sfa);
+			err = validate_and_copy_sample(a, key, depth, sfa,
+						       eth_type, vlan_tci);
 			if (err)
 				return err;
 			skip_copy = true;
@@ -1713,6 +1808,14 @@ int ovs_nla_copy_actions(const struct nlattr *attr,
 	return 0;
 }
 
+int ovs_nla_copy_actions(const struct nlattr *attr,
+			 const struct sw_flow_key *key,
+			 struct sw_flow_actions **sfa)
+{
+	return ovs_nla_copy_actions__(attr, key, 0, sfa, key->eth.type,
+				      key->eth.tci);
+}
+
 static int sample_action_to_attr(const struct nlattr *attr, struct sk_buff *skb)
 {
 	const struct nlattr *a;
diff --git a/net/openvswitch/flow_netlink.h b/net/openvswitch/flow_netlink.h
index 206e45a..6355b1d 100644
--- a/net/openvswitch/flow_netlink.h
+++ b/net/openvswitch/flow_netlink.h
@@ -49,7 +49,7 @@ int ovs_nla_get_match(struct sw_flow_match *match,
 		      const struct nlattr *);
 
 int ovs_nla_copy_actions(const struct nlattr *attr,
-			 const struct sw_flow_key *key, int depth,
+			 const struct sw_flow_key *key,
 			 struct sw_flow_actions **sfa);
 int ovs_nla_put_actions(const struct nlattr *attr,
 			int len, struct sk_buff *skb);
-- 
1.9.3

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox