Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH v3 0/5] Introduce variable length mdev alias
From: Cornelia Huck @ 2019-09-11 16:29 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Alex Williamson, Jiri Pirko, kwankhede@nvidia.com,
	davem@davemloft.net, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, netdev@vger.kernel.org
In-Reply-To: <AM0PR05MB48668DFF8E816F0D2D3041BFD1B10@AM0PR05MB4866.eurprd05.prod.outlook.com>

On Wed, 11 Sep 2019 15:30:40 +0000
Parav Pandit <parav@mellanox.com> wrote:

> Hi Alex,
> 
> > -----Original Message-----
> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Wednesday, September 11, 2019 8:56 AM
> > To: Parav Pandit <parav@mellanox.com>
> > Cc: Jiri Pirko <jiri@mellanox.com>; kwankhede@nvidia.com;
> > cohuck@redhat.com; davem@davemloft.net; kvm@vger.kernel.org; linux-
> > kernel@vger.kernel.org; netdev@vger.kernel.org
> > Subject: Re: [PATCH v3 0/5] Introduce variable length mdev alias
> > 
> > On Mon, 9 Sep 2019 20:42:32 +0000
> > Parav Pandit <parav@mellanox.com> wrote:
> >   
> > > Hi Alex,
> > >  
> > > > -----Original Message-----
> > > > From: Parav Pandit <parav@mellanox.com>
> > > > Sent: Sunday, September 1, 2019 11:25 PM
> > > > To: alex.williamson@redhat.com; Jiri Pirko <jiri@mellanox.com>;
> > > > kwankhede@nvidia.com; cohuck@redhat.com; davem@davemloft.net
> > > > Cc: kvm@vger.kernel.org; linux-kernel@vger.kernel.org;
> > > > netdev@vger.kernel.org; Parav Pandit <parav@mellanox.com>
> > > > Subject: [PATCH v3 0/5] Introduce variable length mdev alias
> > > >
> > > > To have consistent naming for the netdevice of a mdev and to have
> > > > consistent naming of the devlink port [1] of a mdev, which is formed
> > > > using phys_port_name of the devlink port, current UUID is not usable
> > > > because UUID is too long.
> > > >
> > > > UUID in string format is 36-characters long and in binary 128-bit.
> > > > Both formats are not able to fit within 15 characters limit of netdev  
> > name.  
> > > >
> > > > It is desired to have mdev device naming consistent using UUID.
> > > > So that widely used user space framework such as ovs [2] can make
> > > > use of mdev representor in similar way as PCIe SR-IOV VF and PF  
> > representors.  
> > > >
> > > > Hence,
> > > > (a) mdev alias is created which is derived using sha1 from the mdev  
> > name.  
> > > > (b) Vendor driver describes how long an alias should be for the
> > > > child mdev created for a given parent.
> > > > (c) Mdev aliases are unique at system level.
> > > > (d) alias is created optionally whenever parent requested.
> > > > This ensures that non networking mdev parents can function without
> > > > alias creation overhead.
> > > >
> > > > This design is discussed at [3].
> > > >
> > > > An example systemd/udev extension will have,
> > > >
> > > > 1. netdev name created using mdev alias available in sysfs.
> > > >
> > > > mdev UUID=83b8f4f2-509f-382f-3c1e-e6bfe0fa1001
> > > > mdev 12 character alias=cd5b146a80a5
> > > >
> > > > netdev name of this mdev = enmcd5b146a80a5 Here en = Ethernet link m
> > > > = mediated device
> > > >
> > > > 2. devlink port phys_port_name created using mdev alias.
> > > > devlink phys_port_name=pcd5b146a80a5
> > > >
> > > > This patchset enables mdev core to maintain unique alias for a mdev.
> > > >
> > > > Patch-1 Introduces mdev alias using sha1.
> > > > Patch-2 Ensures that mdev alias is unique in a system.
> > > > Patch-3 Exposes mdev alias in a sysfs hirerchy, update Documentation
> > > > Patch-4 Introduces mdev_alias() API.
> > > > Patch-5 Extends mtty driver to optionally provide alias generation.
> > > > This also enables to test UUID based sha1 collision and trigger
> > > > error handling for duplicate sha1 results.
> > > >
> > > > [1] http://man7.org/linux/man-pages/man8/devlink-port.8.html
> > > > [2] https://docs.openstack.org/os-vif/latest/user/plugins/ovs.html
> > > > [3] https://patchwork.kernel.org/cover/11084231/
> > > >
> > > > ---
> > > > Changelog:
> > > > v2->v3:
> > > >  - Addressed comment from Yunsheng Lin
> > > >  - Changed strcmp() ==0 to !strcmp()
> > > >  - Addressed comment from Cornelia Hunk
> > > >  - Merged sysfs Documentation patch with syfs patch
> > > >  - Added more description for alias return value  
> > >
> > > Did you get a chance review this updated series?
> > > I addressed Cornelia's and yours comment.
> > > I do not think allocating alias memory twice, once for comparison and
> > > once for storing is good idea or moving alias generation logic inside
> > > the mdev_list_lock(). So I didn't address that suggestion of Cornelia.  
> > 
> > Sorry, I'm at LPC this week.  I agree, I don't think the double allocation is
> > necessary, I thought the comment was sufficient to clarify null'ing the
> > variable.  It's awkward, but seems correct.

Not hot about it, but no real complaints.

However, please give me some more time, as I'm at LPC as well.

> > 
> > I'm not sure what we do with this patch series though, has the real
> > consumer of this even been proposed?  It feels optimistic to include at this
> > point.  We've used the sample driver as a placeholder in the past for
> > mdev_uuid(), but we arrived at that via a conversion rather than explicitly
> > adding the API.  Please let me know where the consumer patches stand,
> > perhaps it would make more sense for them to go in together rather than
> > risk adding an unused API.  Thanks,
> >   
> Given that consumer patch series is relatively large (around 15+ patches), I was considering to merge this one as pre-series to it.
> Its ok to combine this with consumer patch series.
> But wanted to have it reviewed beforehand, so that churn is less in actual consumer series which is more mlx5_core and devlink/netdev centric.
> So if you can add Review-by, it will be easier to combine with consumer series.
> 
> And if we merge it with consumer series, it will come through Dave Miller's tree instead of your tree.
> Would that work for you?

It would be easier to see what to do here if we could see the consumer
for this. If those patches are fine, we could maybe queue this series
via both trees?

^ permalink raw reply

* Re: [PATCH v2] vhost: block speculation of translated descriptors
From: Will Deacon @ 2019-09-11 16:25 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-kernel, Jason Wang, kvm, virtualization, netdev, security
In-Reply-To: <20190911095147-mutt-send-email-mst@kernel.org>

On Wed, Sep 11, 2019 at 09:52:25AM -0400, Michael S. Tsirkin wrote:
> On Wed, Sep 11, 2019 at 08:10:00AM -0400, Michael S. Tsirkin wrote:
> > iovec addresses coming from vhost are assumed to be
> > pre-validated, but in fact can be speculated to a value
> > out of range.
> > 
> > Userspace address are later validated with array_index_nospec so we can
> > be sure kernel info does not leak through these addresses, but vhost
> > must also not leak userspace info outside the allowed memory table to
> > guests.
> > 
> > Following the defence in depth principle, make sure
> > the address is not validated out of node range.
> > 
> > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > Acked-by: Jason Wang <jasowang@redhat.com>
> > Tested-by: Jason Wang <jasowang@redhat.com>
> > ---
> 
> Cc: security@kernel.org
> 
> Pls advise on whether you'd like me to merge this directly,
> Cc stable, or handle it in some other way.

I think you're fine taking it directly, with a cc stable and a Fixes: tag.

Cheers,

Will

^ permalink raw reply

* Re: [PATCH] sctp: Fix the link time qualifier of 'sctp_ctrlsock_exit()'
From: Marcelo Ricardo Leitner @ 2019-09-11 16:23 UTC (permalink / raw)
  To: Christophe JAILLET
  Cc: davem, vyasevich, nhorman, linux-sctp, netdev, linux-kernel,
	kernel-janitors
In-Reply-To: <20190911160239.10734-1-christophe.jaillet@wanadoo.fr>

On Wed, Sep 11, 2019 at 06:02:39PM +0200, Christophe JAILLET wrote:
> The '.exit' functions from 'pernet_operations' structure should be marked
> as __net_exit, not __net_init.
> 
> Fixes: 8e2d61e0aed2 ("sctp: fix race on protocol/netns initialization")
> Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>

Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>

^ permalink raw reply

* Re: VRF Issue Since kernel 5
From: David Ahern @ 2019-09-11 16:09 UTC (permalink / raw)
  To: Gowen, Alexis Bauvin, mmanning@vyatta.att-mail.com; +Cc: netdev@vger.kernel.org
In-Reply-To: <CWLP265MB155424EF95E39E98C4502F86FDB10@CWLP265MB1554.GBRP265.PROD.OUTLOOK.COM>

On 9/11/19 3:01 PM, Gowen wrote:
> Hi all,
> 
> It looks like ip vrf exec checks /etc/resolv.conf (found with strace -e
> trace=file sudo ip vrf exec mgmt-vrf host www.google.co.uk &>
> ~/straceFileOfVrfHost.txt) , but as I'm on an Azure machine using
> netplan, this file isn't updated with DNS servers. I have added my DNS
> server to resolv.conf and now can update the cache with "sudo ip vrf
> exec sudo apt update", if I am correct (which I'm not sure about as not
> really my area) then this might be affecting more than just me.
> 
> Also still not able to fix the updating cache from global VRF - which
> would cause bother in prod environment to others as well so think it
> would be good to get an RCA for it?
> 
> thanks for your help so far, has been really interesting.
> 
> Gareth
> 
> 
> ------------------------------------------------------------------------
> *From:* Gowen <gowen@potatocomputing.co.uk>
> *Sent:* 11 September 2019 13:48
> *To:* David Ahern <dsahern@gmail.com>; Alexis Bauvin
> <abauvin@online.net>; mmanning@vyatta.att-mail.com
> <mmanning@vyatta.att-mail.com>
> *Cc:* netdev@vger.kernel.org <netdev@vger.kernel.org>
> *Subject:* Re: VRF Issue Since kernel 5
>  
> yep no problem:
> 
> Admin@NETM06:~$ sudo sysctl -a | grep l3mdev
> net.ipv4.raw_l3mdev_accept = 1
> net.ipv4.tcp_l3mdev_accept = 1
> net.ipv4.udp_l3mdev_accept = 1
> 
> 
> The source of the DNS issue in the vrf exec command is something to do
> with networkd managing the DNS servers, I can fix it by explicitly
> mentioning the DNS server:
> 
> systemd-resolve --status --no-page
> 
> <OUTPUT OMITTED>
> 
> Link 4 (mgmt-vrf)
>       Current Scopes: none
>        LLMNR setting: yes
> MulticastDNS setting: no
>       DNSSEC setting: no
>     DNSSEC supported: no
> 
> Link 3 (eth1)
>       Current Scopes: DNS
>        LLMNR setting: yes
> MulticastDNS setting: no
>       DNSSEC setting: no
>     DNSSEC supported: no
>          DNS Servers: 10.24.65.203
>                       10.24.65.204
>                       10.25.65.203
>                       10.25.65.204
>           DNS Domain: reddog.microsoft.com
> 
> Link 2 (eth0)
>       Current Scopes: DNS
>        LLMNR setting: yes
> MulticastDNS setting: no
>       DNSSEC setting: no
>     DNSSEC supported: no
>          DNS Servers: 10.24.65.203
>                       10.24.65.204
>                       10.25.65.203
>                       10.25.65.204
>           DNS Domain: reddog.microsoft.com
> 
> there is no DNS server when I use ip vrf exec command (tcpdump shows
> only loopback traffic when invoked without my DNS sever explicitly
> entered) - odd as mgmt-vrf isnt L3 device so thought it would pick up
> eth0 DNS servers?
> 
> I dont think this helps with my update cache traffic from global vrf
> though on port 80
> 

Let's back up a bit: your subject line says vrf issue since kernel 5.
Did you update / change the OS as well?

ie., the previous version that worked what is the OS and kernel version?
What is the OS and kernel version with the problem?

^ permalink raw reply

* [PATCH] sctp: Fix the link time qualifier of 'sctp_ctrlsock_exit()'
From: Christophe JAILLET @ 2019-09-11 16:02 UTC (permalink / raw)
  To: davem, vyasevich, nhorman, marcelo.leitner
  Cc: linux-sctp, netdev, linux-kernel, kernel-janitors,
	Christophe JAILLET

The '.exit' functions from 'pernet_operations' structure should be marked
as __net_exit, not __net_init.

Fixes: 8e2d61e0aed2 ("sctp: fix race on protocol/netns initialization")
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
---
 net/sctp/protocol.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/sctp/protocol.c b/net/sctp/protocol.c
index 2d47adcb4cbe..53746ffeeca3 100644
--- a/net/sctp/protocol.c
+++ b/net/sctp/protocol.c
@@ -1336,7 +1336,7 @@ static int __net_init sctp_ctrlsock_init(struct net *net)
 	return status;
 }
 
-static void __net_init sctp_ctrlsock_exit(struct net *net)
+static void __net_exit sctp_ctrlsock_exit(struct net *net)
 {
 	/* Free the control endpoint.  */
 	inet_ctl_sock_destroy(net->sctp.ctl_sock);
-- 
2.20.1


^ permalink raw reply related

* RE: [PATCH net-next 1/5] enetc: Fix if_mode extraction
From: Claudiu Manoil @ 2019-09-11 16:01 UTC (permalink / raw)
  To: Andrew Lunn; +Cc: David S . Miller, Alexandru Marginean, netdev@vger.kernel.org
In-Reply-To: <20190910074412.GA31298@lunn.ch>

>-----Original Message-----
>From: Andrew Lunn <andrew@lunn.ch>
>Sent: Tuesday, September 10, 2019 10:44 AM
>To: Claudiu Manoil <claudiu.manoil@nxp.com>
>Cc: David S . Miller <davem@davemloft.net>; Alexandru Marginean
><alexandru.marginean@nxp.com>; netdev@vger.kernel.org
>Subject: Re: [PATCH net-next 1/5] enetc: Fix if_mode extraction
>
>On Mon, Sep 09, 2019 at 04:24:01PM +0000, Claudiu Manoil wrote:
[...]
>>
>> Hi Andrew,
>>
>> The MAC2MAC connections are defined as fixed-link too, but without
>> phy-mode/phy-connection-type properties.  We don't want to de-register
>> these links.  Initial code was bogus in this regard.
>
>Hi Claudiu
>
>This is what is not clear in the change log. That this code is removed
>because it is wrong. Please could you expand the explanation to make
>this clearer.
>

I agree, but I also need to modify the patch to handle both the error case
of invalid phy-mode for mdio and normal fixed link phy connections, and
the mac2mac connection case.  The mac2mac connection case can be also
deferred to a later patch, when the switch driver - Felix - will be available
(there's no use for it in the current enetc upstream driver).

>> Current proposal is:
>> 			ethernet@0,2 { /* SoC internal, connected to switch port 4 */
>> 				compatible = "fsl,enetc";
>> 				reg = <0x000200 0 0 0 0>;
>> 				fixed-link {
>> 					speed = <1000>;
>> 					full-duplex;
>> 				};
>> 			};
>> 			switch@0,5 {
>> 				compatible = "mscc,felix-switch";
>> 				[...]
>> 				ports {
>> 					#address-cells = <1>;
>> 					#size-cells = <0>;
>>
>> 					/* external ports */
>> 					[...]
>> 					/* internal SoC ports */
>> 					port@4 { /* connected to ENETC port2 */
>> 						reg = <4>;
>> 						fixed-link {
>> 							speed = <1000>;
>> 							full-duplex;
>> 						};
>> 					};
>
>So this connection between the SoC and the switch does not use tags?
>Can it use tags? Does the hardware allow you to have two CPU ports,
>and load balance over them?
>

Unfortunately the switch can handle only one port with tags.  There's only
one CPU port, switch port 4 is just like another front panel port.  On top of
that, the CPU port is not capable of flow control (pause frames don't work with
tagged traffic on the switch side).  So we may be forced to use port 4 to mitigate
this.  Note that the switch is inside the SoC.

>This second half is just standard DSA. This looks good.
>

Thanks for the confirmation and the rest of the review, all valid findings.

Regards,
Claudiu

^ permalink raw reply

* Re: [PATCH 0/7] net: dsa: mv88e6xxx: features to handle network storms
From: Vivien Didelot @ 2019-09-11 15:31 UTC (permalink / raw)
  To: Robert Beckett
  Cc: netdev, Andrew Lunn, Florian Fainelli, David S. Miller,
	bob.beckett
In-Reply-To: <3f265c5afcb2eea48410ec607d65e8f4e6a20373.camel@collabora.com>

Hi Robert,

On Wed, 11 Sep 2019 10:46:05 +0100, Robert Beckett <bob.beckett@collabora.com> wrote:
> > Feature series targeting netdev must be prefixed "PATCH net-next". As
> 
> Thanks for the info. Out of curiosity, where should I have gleaned this
> info from? This is my first contribution to netdev, so I wasnt familiar
> with the etiquette.
> 
> > this approach was a PoC, sending it as "RFC net-next" would be even
> > more
> > appropriate.

Netdev being a huge subsystem has specific rules for subject prefix or merge
window, which are described in Documentation/networking/netdev-FAQ.rst


Thank you,

	Vivien

^ permalink raw reply

* RE: [PATCH v3 0/5] Introduce variable length mdev alias
From: Parav Pandit @ 2019-09-11 15:30 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Jiri Pirko, kwankhede@nvidia.com, cohuck@redhat.com,
	davem@davemloft.net, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, netdev@vger.kernel.org
In-Reply-To: <20190911145610.453b32ec@x1.home>

Hi Alex,

> -----Original Message-----
> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Wednesday, September 11, 2019 8:56 AM
> To: Parav Pandit <parav@mellanox.com>
> Cc: Jiri Pirko <jiri@mellanox.com>; kwankhede@nvidia.com;
> cohuck@redhat.com; davem@davemloft.net; kvm@vger.kernel.org; linux-
> kernel@vger.kernel.org; netdev@vger.kernel.org
> Subject: Re: [PATCH v3 0/5] Introduce variable length mdev alias
> 
> On Mon, 9 Sep 2019 20:42:32 +0000
> Parav Pandit <parav@mellanox.com> wrote:
> 
> > Hi Alex,
> >
> > > -----Original Message-----
> > > From: Parav Pandit <parav@mellanox.com>
> > > Sent: Sunday, September 1, 2019 11:25 PM
> > > To: alex.williamson@redhat.com; Jiri Pirko <jiri@mellanox.com>;
> > > kwankhede@nvidia.com; cohuck@redhat.com; davem@davemloft.net
> > > Cc: kvm@vger.kernel.org; linux-kernel@vger.kernel.org;
> > > netdev@vger.kernel.org; Parav Pandit <parav@mellanox.com>
> > > Subject: [PATCH v3 0/5] Introduce variable length mdev alias
> > >
> > > To have consistent naming for the netdevice of a mdev and to have
> > > consistent naming of the devlink port [1] of a mdev, which is formed
> > > using phys_port_name of the devlink port, current UUID is not usable
> > > because UUID is too long.
> > >
> > > UUID in string format is 36-characters long and in binary 128-bit.
> > > Both formats are not able to fit within 15 characters limit of netdev
> name.
> > >
> > > It is desired to have mdev device naming consistent using UUID.
> > > So that widely used user space framework such as ovs [2] can make
> > > use of mdev representor in similar way as PCIe SR-IOV VF and PF
> representors.
> > >
> > > Hence,
> > > (a) mdev alias is created which is derived using sha1 from the mdev
> name.
> > > (b) Vendor driver describes how long an alias should be for the
> > > child mdev created for a given parent.
> > > (c) Mdev aliases are unique at system level.
> > > (d) alias is created optionally whenever parent requested.
> > > This ensures that non networking mdev parents can function without
> > > alias creation overhead.
> > >
> > > This design is discussed at [3].
> > >
> > > An example systemd/udev extension will have,
> > >
> > > 1. netdev name created using mdev alias available in sysfs.
> > >
> > > mdev UUID=83b8f4f2-509f-382f-3c1e-e6bfe0fa1001
> > > mdev 12 character alias=cd5b146a80a5
> > >
> > > netdev name of this mdev = enmcd5b146a80a5 Here en = Ethernet link m
> > > = mediated device
> > >
> > > 2. devlink port phys_port_name created using mdev alias.
> > > devlink phys_port_name=pcd5b146a80a5
> > >
> > > This patchset enables mdev core to maintain unique alias for a mdev.
> > >
> > > Patch-1 Introduces mdev alias using sha1.
> > > Patch-2 Ensures that mdev alias is unique in a system.
> > > Patch-3 Exposes mdev alias in a sysfs hirerchy, update Documentation
> > > Patch-4 Introduces mdev_alias() API.
> > > Patch-5 Extends mtty driver to optionally provide alias generation.
> > > This also enables to test UUID based sha1 collision and trigger
> > > error handling for duplicate sha1 results.
> > >
> > > [1] http://man7.org/linux/man-pages/man8/devlink-port.8.html
> > > [2] https://docs.openstack.org/os-vif/latest/user/plugins/ovs.html
> > > [3] https://patchwork.kernel.org/cover/11084231/
> > >
> > > ---
> > > Changelog:
> > > v2->v3:
> > >  - Addressed comment from Yunsheng Lin
> > >  - Changed strcmp() ==0 to !strcmp()
> > >  - Addressed comment from Cornelia Hunk
> > >  - Merged sysfs Documentation patch with syfs patch
> > >  - Added more description for alias return value
> >
> > Did you get a chance review this updated series?
> > I addressed Cornelia's and yours comment.
> > I do not think allocating alias memory twice, once for comparison and
> > once for storing is good idea or moving alias generation logic inside
> > the mdev_list_lock(). So I didn't address that suggestion of Cornelia.
> 
> Sorry, I'm at LPC this week.  I agree, I don't think the double allocation is
> necessary, I thought the comment was sufficient to clarify null'ing the
> variable.  It's awkward, but seems correct.
> 
> I'm not sure what we do with this patch series though, has the real
> consumer of this even been proposed?  It feels optimistic to include at this
> point.  We've used the sample driver as a placeholder in the past for
> mdev_uuid(), but we arrived at that via a conversion rather than explicitly
> adding the API.  Please let me know where the consumer patches stand,
> perhaps it would make more sense for them to go in together rather than
> risk adding an unused API.  Thanks,
> 
Given that consumer patch series is relatively large (around 15+ patches), I was considering to merge this one as pre-series to it.
Its ok to combine this with consumer patch series.
But wanted to have it reviewed beforehand, so that churn is less in actual consumer series which is more mlx5_core and devlink/netdev centric.
So if you can add Review-by, it will be easier to combine with consumer series.

And if we merge it with consumer series, it will come through Dave Miller's tree instead of your tree.
Would that work for you?

^ permalink raw reply

* [PATCH net-next] nfp: read chip model from the PluDevice register
From: Simon Horman @ 2019-09-11 15:21 UTC (permalink / raw)
  To: David Miller
  Cc: Jakub Kicinski, netdev, oss-drivers, Dirk van der Merwe,
	Simon Horman

From: Dirk van der Merwe <dirk.vandermerwe@netronome.com>

The PluDevice register provides the authoritative chip model/revision.

Since the model number is purely used for reporting purposes, follow
the hardware team convention of subtracting 0x10 from the PluDevice
register to obtain the chip model/revision number.

Suggested-by: Francois H. Theron <francois.theron@netronome.com>
Signed-off-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
---
 drivers/net/ethernet/netronome/nfp/nfpcore/nfp_cpplib.c | 16 ++++++----------
 1 file changed, 6 insertions(+), 10 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_cpplib.c b/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_cpplib.c
index 3cfecf105bde..85734c6badf5 100644
--- a/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_cpplib.c
+++ b/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_cpplib.c
@@ -24,8 +24,9 @@
 /* NFP6000 PL */
 #define NFP_PL_DEVICE_ID			0x00000004
 #define   NFP_PL_DEVICE_ID_MASK			GENMASK(7, 0)
-
-#define NFP6000_ARM_GCSR_SOFTMODEL0		0x00400144
+#define   NFP_PL_DEVICE_PART_MASK		GENMASK(31, 16)
+#define NFP_PL_DEVICE_MODEL_MASK		(NFP_PL_DEVICE_PART_MASK | \
+						 NFP_PL_DEVICE_ID_MASK)
 
 /**
  * nfp_cpp_readl() - Read a u32 word from a CPP location
@@ -120,22 +121,17 @@ int nfp_cpp_writeq(struct nfp_cpp *cpp, u32 cpp_id,
  */
 int nfp_cpp_model_autodetect(struct nfp_cpp *cpp, u32 *model)
 {
-	const u32 arm_id = NFP_CPP_ID(NFP_CPP_TARGET_ARM, 0, 0);
 	u32 reg;
 	int err;
 
-	err = nfp_cpp_readl(cpp, arm_id, NFP6000_ARM_GCSR_SOFTMODEL0, model);
-	if (err < 0)
-		return err;
-
-	/* The PL's PluDeviceID revision code is authoratative */
-	*model &= ~0xff;
 	err = nfp_xpb_readl(cpp, NFP_XPB_DEVICE(1, 1, 16) + NFP_PL_DEVICE_ID,
 			    &reg);
 	if (err < 0)
 		return err;
 
-	*model |= (NFP_PL_DEVICE_ID_MASK & reg) - 0x10;
+	*model = reg & NFP_PL_DEVICE_MODEL_MASK;
+	if (*model & NFP_PL_DEVICE_ID_MASK)
+		*model -= 0x10;
 
 	return 0;
 }
-- 
2.11.0


^ permalink raw reply related

* Re: VRF Issue Since kernel 5
From: Mike Manning @ 2019-09-11 12:15 UTC (permalink / raw)
  To: Gowen, David Ahern, Alexis Bauvin; +Cc: netdev@vger.kernel.org
In-Reply-To: <CWLP265MB1554604C9DB9B28D245E47A2FDB10@CWLP265MB1554.GBRP265.PROD.OUTLOOK.COM>

Hi Gareth,
Could you please also check that all the following are set to 1, I
appreciate you've confirmed that the one for tcp is set to 1, and by
default the one for raw is also set to 1:

sudo sysctl -a | grep l3mdev

If not,
sudo sysctl net.ipv4.raw_l3mdev_accept=1
sudo sysctl net.ipv4.udp_l3mdev_accept=1
sudo sysctl net.ipv4.tcp_l3mdev_accept=1


Thanks
Mike




^ permalink raw reply

* [PATCH v2] net: qrtr: fix memort leak in qrtr_tun_write_iter
From: Navid Emamdoost @ 2019-09-11 15:09 UTC (permalink / raw)
  To: davem; +Cc: emamd001, smccaman, kjlu, Navid Emamdoost, netdev, linux-kernel
In-Reply-To: <20190911.101320.682967997452798874.davem@davemloft.net>

In qrtr_tun_write_iter the allocated kbuf should be release in case of
error or success return.

v2 Update: Thanks to David Miller for pointing out the release on success
path as well.

Signed-off-by: Navid Emamdoost <navid.emamdoost@gmail.com>
---
 net/qrtr/tun.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/net/qrtr/tun.c b/net/qrtr/tun.c
index ccff1e544c21..e35869e81766 100644
--- a/net/qrtr/tun.c
+++ b/net/qrtr/tun.c
@@ -84,11 +84,14 @@ static ssize_t qrtr_tun_write_iter(struct kiocb *iocb, struct iov_iter *from)
 	if (!kbuf)
 		return -ENOMEM;
 
-	if (!copy_from_iter_full(kbuf, len, from))
+	if (!copy_from_iter_full(kbuf, len, from)) {
+		kfree(kbuf);
 		return -EFAULT;
+	}
 
 	ret = qrtr_endpoint_post(&tun->ep, kbuf, len);
 
+	kfree(kbuf);
 	return ret < 0 ? ret : len;
 }
 
-- 
2.17.1


^ permalink raw reply related

* [PATCH iproute2-next] devlink: unknown 'fw_load_policy' string validation
From: Simon Horman @ 2019-09-11 14:56 UTC (permalink / raw)
  To: David Ahern
  Cc: Jiri Pirko, netdev, oss-drivers, Jakub Kicinski,
	Dirk van der Merwe, Simon Horman

From: Dirk van der Merwe <dirk.vandermerwe@netronome.com>

The 'fw_load_policy' devlink parameter now supports an unknown value.

Suggested-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
---

Sorry about these depenendencies, some related changes came through
in separate patch-sets.

1. Depends on iproute2-next patch sent earlier today:
   [PATCH iproute2-next] devlink: add 'reset_dev_on_drv_probe' devlink param

2. Depends on devlink.h changes present in net-next commit:
   64f658ded48e ("devlink: add unknown 'fw_load_policy' value")

   Which in turn depends on other devlink.h changes present in net-next commit:
   5bbd21df5a07 ("devlink: add 'reset_dev_on_drv_probe' param")
---
 devlink/devlink.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/devlink/devlink.c b/devlink/devlink.c
index 15877a04f5d6..e4b494eb3e5d 100644
--- a/devlink/devlink.c
+++ b/devlink/devlink.c
@@ -2259,6 +2259,11 @@ static const struct param_val_conv param_val_conv[] = {
 		.vuint = DEVLINK_PARAM_RESET_DEV_ON_DRV_PROBE_VALUE_UNKNOWN,
 	},
 	{
+		.name = "fw_load_policy",
+		.vstr = "unknown",
+		.vuint = DEVLINK_PARAM_FW_LOAD_POLICY_VALUE_UNKNOWN,
+	},
+	{
 		.name = "reset_dev_on_drv_probe",
 		.vstr = "always",
 		.vuint = DEVLINK_PARAM_RESET_DEV_ON_DRV_PROBE_VALUE_ALWAYS,
-- 
2.11.0


^ permalink raw reply related

* Re: ixgbe: driver drops packets routed from an IPSec interface with a "bad sa_idx" error
From: Michael Marley @ 2019-09-11 14:50 UTC (permalink / raw)
  To: Steffen Klassert; +Cc: Shannon Nelson, netdev, Jeff Kirsher
In-Reply-To: <20190911061547.GR2879@gauss3.secunet.de>

On 2019-09-11 02:15, Steffen Klassert wrote:
> On Tue, Sep 10, 2019 at 06:53:30PM -0400, Michael Marley wrote:
>> 
>> StrongSwan has hardware offload disabled by default, and I didn't 
>> enable
>> it explicitly.  I also already tried turning off all those switches 
>> with
>> ethtool and it has no effect.  This doesn't surprise me though, 
>> because
>> as I said, I don't actually have the IPSec connection running over the
>> ixgbe device.  The IPSec connection runs over another network adapter
>> that doesn't support IPSec offload at all.  The problem comes when
>> traffic received over the IPSec interface is then routed back out
>> (unencrypted) through the ixgbe device into the local network.
> 
> 
> Seems like the ixgbe driver tries to use the sec_path
> from RX to setup an offload at the TX side.
> 
> Can you please try this (completely untested) patch?
> 
> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> index 9bcae44e9883..ae31bd57127c 100644
> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> @@ -36,6 +36,7 @@
>  #include <net/vxlan.h>
>  #include <net/mpls.h>
>  #include <net/xdp_sock.h>
> +#include <net/xfrm.h>
> 
>  #include "ixgbe.h"
>  #include "ixgbe_common.h"
> @@ -8696,7 +8697,7 @@ netdev_tx_t ixgbe_xmit_frame_ring(struct sk_buff 
> *skb,
>  #endif /* IXGBE_FCOE */
> 
>  #ifdef CONFIG_IXGBE_IPSEC
> -	if (secpath_exists(skb) &&
> +	if (xfrm_offload(skb) &&
>  	    !ixgbe_ipsec_tx(tx_ring, first, &ipsec_tx))
>  		goto out_drop;
>  #endif
With the patch, the problem is gone.  Thanks!

Michael

^ permalink raw reply

* Re: [PATCH] ftgmac100: Disable HW checksum generation on AST2500
From: Joel Stanley @ 2019-09-11 14:48 UTC (permalink / raw)
  To: Florian Fainelli, Benjamin Herrenschmidt
  Cc: Vijay Khemka, David S. Miller, YueHaibing, Andrew Lunn,
	Kate Stewart, Mauro Carvalho Chehab, Luis Chamberlain,
	Thomas Gleixner, netdev, Linux Kernel Mailing List,
	openbmc @ lists . ozlabs . org, linux-aspeed, Sai Dasari
In-Reply-To: <bd5eab2e-6ba6-9e27-54d4-d9534da9d5f7@gmail.com>

Hi Ben,

On Tue, 10 Sep 2019 at 22:05, Florian Fainelli <f.fainelli@gmail.com> wrote:
>
> On 9/10/19 2:37 PM, Vijay Khemka wrote:
> > HW checksum generation is not working for AST2500, specially with IPV6
> > over NCSI. All TCP packets with IPv6 get dropped. By disabling this
> > it works perfectly fine with IPV6.
> >
> > Verified with IPV6 enabled and can do ssh.
>
> How about IPv4, do these packets have problem? If not, can you continue
> advertising NETIF_F_IP_CSUM but take out NETIF_F_IPV6_CSUM?
>
> >
> > Signed-off-by: Vijay Khemka <vijaykhemka@fb.com>
> > ---
> >  drivers/net/ethernet/faraday/ftgmac100.c | 5 +++--
> >  1 file changed, 3 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/net/ethernet/faraday/ftgmac100.c b/drivers/net/ethernet/faraday/ftgmac100.c
> > index 030fed65393e..591c9725002b 100644
> > --- a/drivers/net/ethernet/faraday/ftgmac100.c
> > +++ b/drivers/net/ethernet/faraday/ftgmac100.c
> > @@ -1839,8 +1839,9 @@ static int ftgmac100_probe(struct platform_device *pdev)
> >       if (priv->use_ncsi)
> >               netdev->hw_features |= NETIF_F_HW_VLAN_CTAG_FILTER;
> >
> > -     /* AST2400  doesn't have working HW checksum generation */
> > -     if (np && (of_device_is_compatible(np, "aspeed,ast2400-mac")))
> > +     /* AST2400  and AST2500 doesn't have working HW checksum generation */
> > +     if (np && (of_device_is_compatible(np, "aspeed,ast2400-mac") ||
> > +                of_device_is_compatible(np, "aspeed,ast2500-mac")))

Do you recall under what circumstances we need to disable hardware checksumming?

Cheers,

Joel

> >               netdev->hw_features &= ~NETIF_F_HW_CSUM;
> >       if (np && of_get_property(np, "no-hw-checksum", NULL))
> >               netdev->hw_features &= ~(NETIF_F_HW_CSUM | NETIF_F_RXCSUM);
> >
>
>
> --
> Florian

^ permalink raw reply

* Re: [PATCH net 1/2] sctp: remove redundant assignment when call sctp_get_port_local
From: Marcelo Ricardo Leitner @ 2019-09-11 14:39 UTC (permalink / raw)
  To: Dan Carpenter
  Cc: maowenan, vyasevich, nhorman, davem, linux-sctp, netdev,
	linux-kernel, kernel-janitors
In-Reply-To: <20190911143008.GD3499@localhost.localdomain>

On Wed, Sep 11, 2019 at 11:30:08AM -0300, Marcelo Ricardo Leitner wrote:
> On Wed, Sep 11, 2019 at 11:30:38AM +0300, Dan Carpenter wrote:
> > On Wed, Sep 11, 2019 at 09:30:47AM +0800, maowenan wrote:
> > > 
> > > 
> > > On 2019/9/11 3:22, Dan Carpenter wrote:
> > > > On Tue, Sep 10, 2019 at 09:57:10PM +0300, Dan Carpenter wrote:
> > > >> On Tue, Sep 10, 2019 at 03:13:42PM +0800, Mao Wenan wrote:
> > > >>> There are more parentheses in if clause when call sctp_get_port_local
> > > >>> in sctp_do_bind, and redundant assignment to 'ret'. This patch is to
> > > >>> do cleanup.
> > > >>>
> > > >>> Signed-off-by: Mao Wenan <maowenan@huawei.com>
> > > >>> ---
> > > >>>  net/sctp/socket.c | 3 +--
> > > >>>  1 file changed, 1 insertion(+), 2 deletions(-)
> > > >>>
> > > >>> diff --git a/net/sctp/socket.c b/net/sctp/socket.c
> > > >>> index 9d1f83b10c0a..766b68b55ebe 100644
> > > >>> --- a/net/sctp/socket.c
> > > >>> +++ b/net/sctp/socket.c
> > > >>> @@ -399,9 +399,8 @@ static int sctp_do_bind(struct sock *sk, union sctp_addr *addr, int len)
> > > >>>  	 * detection.
> > > >>>  	 */
> > > >>>  	addr->v4.sin_port = htons(snum);
> > > >>> -	if ((ret = sctp_get_port_local(sk, addr))) {
> > > >>> +	if (sctp_get_port_local(sk, addr))
> > > >>>  		return -EADDRINUSE;
> > > >>
> > > >> sctp_get_port_local() returns a long which is either 0,1 or a pointer
> > > >> casted to long.  It's not documented what it means and neither of the
> > > >> callers use the return since commit 62208f12451f ("net: sctp: simplify
> > > >> sctp_get_port").
> > > > 
> > > > Actually it was commit 4e54064e0a13 ("sctp: Allow only 1 listening
> > > > socket with SO_REUSEADDR") from 11 years ago.  That patch fixed a bug,
> > > > because before the code assumed that a pointer casted to an int was the
> > > > same as a pointer casted to a long.
> > > 
> > > commit 4e54064e0a13 treated non-zero return value as unexpected, so the current
> > > cleanup is ok?
> > 
> > Yeah.  It's fine, I was just confused why we weren't preserving the
> > error code and then I saw that we didn't return errors at all and got
> > confused.
> 
> But please lets seize the moment and do the change Dean suggested.

*Dan*, sorry.

> This was the last place saving this return value somewhere. It makes
> sense to cleanup sctp_get_port_local() now and remove that masked
> pointer return.
> 
> Then you may also cleanup:
> socket.c:       return !!sctp_get_port_local(sk, &addr);
> as it will be a direct map.
> 
>   Marcelo
> 

^ permalink raw reply

* Re: [PATCH net 1/2] sctp: remove redundant assignment when call sctp_get_port_local
From: Marcelo Ricardo Leitner @ 2019-09-11 14:30 UTC (permalink / raw)
  To: Dan Carpenter
  Cc: maowenan, vyasevich, nhorman, davem, linux-sctp, netdev,
	linux-kernel, kernel-janitors
In-Reply-To: <20190911083038.GF20699@kadam>

On Wed, Sep 11, 2019 at 11:30:38AM +0300, Dan Carpenter wrote:
> On Wed, Sep 11, 2019 at 09:30:47AM +0800, maowenan wrote:
> > 
> > 
> > On 2019/9/11 3:22, Dan Carpenter wrote:
> > > On Tue, Sep 10, 2019 at 09:57:10PM +0300, Dan Carpenter wrote:
> > >> On Tue, Sep 10, 2019 at 03:13:42PM +0800, Mao Wenan wrote:
> > >>> There are more parentheses in if clause when call sctp_get_port_local
> > >>> in sctp_do_bind, and redundant assignment to 'ret'. This patch is to
> > >>> do cleanup.
> > >>>
> > >>> Signed-off-by: Mao Wenan <maowenan@huawei.com>
> > >>> ---
> > >>>  net/sctp/socket.c | 3 +--
> > >>>  1 file changed, 1 insertion(+), 2 deletions(-)
> > >>>
> > >>> diff --git a/net/sctp/socket.c b/net/sctp/socket.c
> > >>> index 9d1f83b10c0a..766b68b55ebe 100644
> > >>> --- a/net/sctp/socket.c
> > >>> +++ b/net/sctp/socket.c
> > >>> @@ -399,9 +399,8 @@ static int sctp_do_bind(struct sock *sk, union sctp_addr *addr, int len)
> > >>>  	 * detection.
> > >>>  	 */
> > >>>  	addr->v4.sin_port = htons(snum);
> > >>> -	if ((ret = sctp_get_port_local(sk, addr))) {
> > >>> +	if (sctp_get_port_local(sk, addr))
> > >>>  		return -EADDRINUSE;
> > >>
> > >> sctp_get_port_local() returns a long which is either 0,1 or a pointer
> > >> casted to long.  It's not documented what it means and neither of the
> > >> callers use the return since commit 62208f12451f ("net: sctp: simplify
> > >> sctp_get_port").
> > > 
> > > Actually it was commit 4e54064e0a13 ("sctp: Allow only 1 listening
> > > socket with SO_REUSEADDR") from 11 years ago.  That patch fixed a bug,
> > > because before the code assumed that a pointer casted to an int was the
> > > same as a pointer casted to a long.
> > 
> > commit 4e54064e0a13 treated non-zero return value as unexpected, so the current
> > cleanup is ok?
> 
> Yeah.  It's fine, I was just confused why we weren't preserving the
> error code and then I saw that we didn't return errors at all and got
> confused.

But please lets seize the moment and do the change Dean suggested.
This was the last place saving this return value somewhere. It makes
sense to cleanup sctp_get_port_local() now and remove that masked
pointer return.

Then you may also cleanup:
socket.c:       return !!sctp_get_port_local(sk, &addr);
as it will be a direct map.

  Marcelo

^ permalink raw reply

* Re: [PATCH 2/2] dt-bindings: net: dwmac: document 'mac-mode' property
From: David Miller @ 2019-09-11 14:28 UTC (permalink / raw)
  To: alexandru.ardelean
  Cc: netdev, devicetree, linux-kernel, robh+dt, peppe.cavallaro,
	alexandre.torgue
In-Reply-To: <20190906130256.10321-2-alexandru.ardelean@analog.com>

From: Alexandru Ardelean <alexandru.ardelean@analog.com>
Date: Fri, 6 Sep 2019 16:02:56 +0300

> This change documents the 'mac-mode' property that was introduced in the
> 'stmmac' driver to support passive mode converters that can sit in-between
> the MAC & PHY.
> 
> Signed-off-by: Alexandru Ardelean <alexandru.ardelean@analog.com>

Applied to net-next.

^ permalink raw reply

* Re: [PATCH 1/2] net: stmmac: implement support for passive mode converters via dt
From: David Miller @ 2019-09-11 14:28 UTC (permalink / raw)
  To: alexandru.ardelean
  Cc: netdev, devicetree, linux-kernel, robh+dt, peppe.cavallaro,
	alexandre.torgue
In-Reply-To: <20190906130256.10321-1-alexandru.ardelean@analog.com>

From: Alexandru Ardelean <alexandru.ardelean@analog.com>
Date: Fri, 6 Sep 2019 16:02:55 +0300

> In-between the MAC & PHY there can be a mode converter, which converts one
> mode to another (e.g. GMII-to-RGMII).
> 
> The converter, can be passive (i.e. no driver or OS/SW information
> required), so the MAC & PHY need to be configured differently.
> 
> For the `stmmac` driver, this is implemented via a `mac-mode` property in
> the device-tree, which configures the MAC into a certain mode, and for the
> PHY a `phy_interface` field will hold the mode of the PHY. The mode of the
> PHY will be passed to the PHY and from there-on it work in a different
> mode. If unspecified, the default `phy-mode` will be used for both.
> 
> Signed-off-by: Alexandru Ardelean <alexandru.ardelean@analog.com>

Applied to net-next.

^ permalink raw reply

* Re: [PATCH] mlx4: fix spelling mistake "veify" -> "verify"
From: David Miller @ 2019-09-11 14:20 UTC (permalink / raw)
  To: colin.king; +Cc: tariqt, netdev, linux-rdma, kernel-janitors, linux-kernel
In-Reply-To: <20190911141811.8370-1-colin.king@canonical.com>

From: Colin King <colin.king@canonical.com>
Date: Wed, 11 Sep 2019 15:18:11 +0100

> From: Colin Ian King <colin.king@canonical.com>
> 
> There is a spelling mistake in a mlx4_err error message. Fix it.
> 
> Signed-off-by: Colin Ian King <colin.king@canonical.com>

Applied.

^ permalink raw reply

* [PATCH] mlx4: fix spelling mistake "veify" -> "verify"
From: Colin King @ 2019-09-11 14:18 UTC (permalink / raw)
  To: Tariq Toukan, David S . Miller, netdev, linux-rdma
  Cc: kernel-janitors, linux-kernel

From: Colin Ian King <colin.king@canonical.com>

There is a spelling mistake in a mlx4_err error message. Fix it.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
---
 drivers/net/ethernet/mellanox/mlx4/main.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/main.c b/drivers/net/ethernet/mellanox/mlx4/main.c
index 07c204bd3fc4..a48a40c1278e 100644
--- a/drivers/net/ethernet/mellanox/mlx4/main.c
+++ b/drivers/net/ethernet/mellanox/mlx4/main.c
@@ -2240,7 +2240,7 @@ static int mlx4_validate_optimized_steering(struct mlx4_dev *dev)
 	for (i = 1; i <= dev->caps.num_ports; i++) {
 		if (mlx4_dev_port(dev, i, &port_cap)) {
 			mlx4_err(dev,
-				 "QUERY_DEV_CAP command failed, can't veify DMFS high rate steering.\n");
+				 "QUERY_DEV_CAP command failed, can't verify DMFS high rate steering.\n");
 		} else if ((dev->caps.dmfs_high_steer_mode !=
 			    MLX4_STEERING_DMFS_A0_DEFAULT) &&
 			   (port_cap.dmfs_optimized_state ==
-- 
2.20.1


^ permalink raw reply related

* Re: [PATCH] net: hns3: fix spelling mistake "undeflow" -> "underflow"
From: David Miller @ 2019-09-11 14:17 UTC (permalink / raw)
  To: colin.king
  Cc: yisen.zhuang, salil.mehta, tanhuazhong, netdev, kernel-janitors,
	linux-kernel
In-Reply-To: <20190911140817.20173-1-colin.king@canonical.com>

From: Colin King <colin.king@canonical.com>
Date: Wed, 11 Sep 2019 15:08:16 +0100

> From: Colin Ian King <colin.king@canonical.com>
> 
> There is a spelling mistake in a .msg literal string. Fix it.
> 
> Signed-off-by: Colin Ian King <colin.king@canonical.com>

Applied.

^ permalink raw reply

* Re: [PATCH net-next 0/2] qed* Fix series.
From: David Miller @ 2019-09-11 14:15 UTC (permalink / raw)
  To: skalluru; +Cc: netdev, mkalderon, aelior
In-Reply-To: <20190911114251.7013-1-skalluru@marvell.com>

From: Sudarsana Reddy Kalluru <skalluru@marvell.com>
Date: Wed, 11 Sep 2019 04:42:49 -0700

> The patch series addresses couple of issues in the recent commits.
> Patch (1) populates the actual dump-size of config attribute instead of
> providing a fixed size value.
> Patch(2) updates frame format of flash config buffer as required by
> management FW (MFW).
> 
> Please consider applying it to net-next.

Series applied, thanks.

^ permalink raw reply

* Re: [PATCH] net: lmc: fix spelling mistake "runnin" -> "running"
From: David Miller @ 2019-09-11 14:13 UTC (permalink / raw)
  To: colin.king; +Cc: netdev, kernel-janitors, linux-kernel
In-Reply-To: <20190911113734.26185-1-colin.king@canonical.com>

From: Colin King <colin.king@canonical.com>
Date: Wed, 11 Sep 2019 12:37:34 +0100

> From: Colin Ian King <colin.king@canonical.com>
> 
> There is a spelling mistake in the lmc_trace message. Fix it.
> 
> Signed-off-by: Colin Ian King <colin.king@canonical.com>

Applied, thanks.

^ permalink raw reply

* [PATCH ipsec-next v2 5/6] esp4: split esp_output_udp_encap and introduce esp_output_encap
From: Sabrina Dubroca @ 2019-09-11 14:13 UTC (permalink / raw)
  To: netdev; +Cc: Herbert Xu, Steffen Klassert, Sabrina Dubroca
In-Reply-To: <cover.1568192824.git.sd@queasysnail.net>

Co-developed-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
---
 net/ipv4/esp4.c | 57 ++++++++++++++++++++++++++++++++-----------------
 1 file changed, 37 insertions(+), 20 deletions(-)

diff --git a/net/ipv4/esp4.c b/net/ipv4/esp4.c
index c5d826642229..033c61d27148 100644
--- a/net/ipv4/esp4.c
+++ b/net/ipv4/esp4.c
@@ -225,45 +225,62 @@ static void esp_output_fill_trailer(u8 *tail, int tfclen, int plen, __u8 proto)
 	tail[plen - 1] = proto;
 }
 
-static int esp_output_udp_encap(struct xfrm_state *x, struct sk_buff *skb, struct esp_info *esp)
+static struct ip_esp_hdr *esp_output_udp_encap(struct sk_buff *skb,
+					       int encap_type,
+					       struct esp_info *esp,
+					       __be16 sport,
+					       __be16 dport)
 {
-	int encap_type;
 	struct udphdr *uh;
 	__be32 *udpdata32;
-	__be16 sport, dport;
-	struct xfrm_encap_tmpl *encap = x->encap;
-	struct ip_esp_hdr *esph = esp->esph;
 	unsigned int len;
 
-	spin_lock_bh(&x->lock);
-	sport = encap->encap_sport;
-	dport = encap->encap_dport;
-	encap_type = encap->encap_type;
-	spin_unlock_bh(&x->lock);
-
 	len = skb->len + esp->tailen - skb_transport_offset(skb);
 	if (len + sizeof(struct iphdr) >= IP_MAX_MTU)
-		return -EMSGSIZE;
+		return ERR_PTR(-EMSGSIZE);
 
-	uh = (struct udphdr *)esph;
+	uh = (struct udphdr *)esp->esph;
 	uh->source = sport;
 	uh->dest = dport;
 	uh->len = htons(len);
 	uh->check = 0;
 
+	*skb_mac_header(skb) = IPPROTO_UDP;
+
+	if (encap_type == UDP_ENCAP_ESPINUDP_NON_IKE) {
+		udpdata32 = (__be32 *)(uh + 1);
+		udpdata32[0] = udpdata32[1] = 0;
+		return (struct ip_esp_hdr *)(udpdata32 + 2);
+	}
+
+	return (struct ip_esp_hdr *)(uh + 1);
+}
+
+static int esp_output_encap(struct xfrm_state *x, struct sk_buff *skb,
+			    struct esp_info *esp)
+{
+	struct xfrm_encap_tmpl *encap = x->encap;
+	struct ip_esp_hdr *esph;
+	__be16 sport, dport;
+	int encap_type;
+
+	spin_lock_bh(&x->lock);
+	sport = encap->encap_sport;
+	dport = encap->encap_dport;
+	encap_type = encap->encap_type;
+	spin_unlock_bh(&x->lock);
+
 	switch (encap_type) {
 	default:
 	case UDP_ENCAP_ESPINUDP:
-		esph = (struct ip_esp_hdr *)(uh + 1);
-		break;
 	case UDP_ENCAP_ESPINUDP_NON_IKE:
-		udpdata32 = (__be32 *)(uh + 1);
-		udpdata32[0] = udpdata32[1] = 0;
-		esph = (struct ip_esp_hdr *)(udpdata32 + 2);
+		esph = esp_output_udp_encap(skb, encap_type, esp, sport, dport);
 		break;
 	}
 
-	*skb_mac_header(skb) = IPPROTO_UDP;
+	if (IS_ERR(esph))
+		return PTR_ERR(esph);
+
 	esp->esph = esph;
 
 	return 0;
@@ -281,7 +298,7 @@ int esp_output_head(struct xfrm_state *x, struct sk_buff *skb, struct esp_info *
 
 	/* this is non-NULL only with UDP Encapsulation */
 	if (x->encap) {
-		int err = esp_output_udp_encap(x, skb, esp);
+		int err = esp_output_encap(x, skb, esp);
 
 		if (err < 0)
 			return err;
-- 
2.22.0


^ permalink raw reply related

* [PATCH ipsec-next v2 6/6] xfrm: add espintcp (RFC 8229)
From: Sabrina Dubroca @ 2019-09-11 14:13 UTC (permalink / raw)
  To: netdev; +Cc: Herbert Xu, Steffen Klassert, Sabrina Dubroca
In-Reply-To: <cover.1568192824.git.sd@queasysnail.net>

TCP encapsulation of IKE and IPsec messages (RFC 8229) is implemented
as a TCP ULP, overriding in particular the sendmsg and recvmsg
operations. A Stream Parser is used to extract messages out of the TCP
stream using the first 2 bytes as length marker. Received IKE messages
are put on "ike_queue", waiting to be dequeued by the custom recvmsg
implementation. Received ESP messages are sent to XFRM, like with UDP
encapsulation.

Some of this code is taken from the original submission by Herbert
Xu. Currently, only IPv4 is supported, like for UDP encapsulation.

Co-developed-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
---
v2:
  - remove unneeded goto and improve error handling in
    esp_output_tcp_finish
  - clean up the ifdefs by providing dummy implementations of those
    functions
  - fix Kconfig select, missing NET_SOCK_MSG

 include/net/espintcp.h   |  38 +++
 include/net/xfrm.h       |   1 +
 include/uapi/linux/udp.h |   1 +
 net/ipv4/esp4.c          | 191 ++++++++++++++-
 net/xfrm/Kconfig         |  10 +
 net/xfrm/Makefile        |   1 +
 net/xfrm/espintcp.c      | 505 +++++++++++++++++++++++++++++++++++++++
 net/xfrm/xfrm_policy.c   |   7 +
 net/xfrm/xfrm_state.c    |   3 +
 9 files changed, 754 insertions(+), 3 deletions(-)
 create mode 100644 include/net/espintcp.h
 create mode 100644 net/xfrm/espintcp.c

diff --git a/include/net/espintcp.h b/include/net/espintcp.h
new file mode 100644
index 000000000000..02fc28c82d30
--- /dev/null
+++ b/include/net/espintcp.h
@@ -0,0 +1,38 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _NET_ESPINTCP_H
+#define _NET_ESPINTCP_H
+
+#include <net/strparser.h>
+#include <linux/skmsg.h>
+
+void __init espintcp_init(void);
+
+int espintcp_push_skb(struct sock *sk, struct sk_buff *skb);
+int espintcp_queue_out(struct sock *sk, struct sk_buff *skb);
+bool tcp_is_ulp_esp(struct sock *sk);
+
+struct espintcp_msg {
+	struct sk_buff *skb;
+	struct sk_msg skmsg;
+	int offset;
+	int len;
+};
+
+struct espintcp_ctx {
+	struct strparser strp;
+	struct sk_buff_head ike_queue;
+	struct sk_buff_head out_queue;
+	struct espintcp_msg partial;
+	void (*saved_data_ready)(struct sock *sk);
+	void (*saved_write_space)(struct sock *sk);
+	struct work_struct work;
+	bool tx_running;
+};
+
+static inline struct espintcp_ctx *espintcp_getctx(const struct sock *sk)
+{
+	struct inet_connection_sock *icsk = inet_csk(sk);
+
+	return icsk->icsk_ulp_data;
+}
+#endif
diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index afedc9210c4b..3dd3c199ecfa 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -193,6 +193,7 @@ struct xfrm_state {
 
 	/* Data for encapsulator */
 	struct xfrm_encap_tmpl	*encap;
+	struct sock __rcu	*encap_sk;
 
 	/* Data for care-of address */
 	xfrm_address_t	*coaddr;
diff --git a/include/uapi/linux/udp.h b/include/uapi/linux/udp.h
index 30baccb6c9c4..4828794efcf8 100644
--- a/include/uapi/linux/udp.h
+++ b/include/uapi/linux/udp.h
@@ -42,5 +42,6 @@ struct udphdr {
 #define UDP_ENCAP_GTP0		4 /* GSM TS 09.60 */
 #define UDP_ENCAP_GTP1U		5 /* 3GPP TS 29.060 */
 #define UDP_ENCAP_RXRPC		6
+#define TCP_ENCAP_ESPINTCP	7 /* Yikes, this is really xfrm encap types. */
 
 #endif /* _UAPI_LINUX_UDP_H */
diff --git a/net/ipv4/esp4.c b/net/ipv4/esp4.c
index 033c61d27148..140a97805752 100644
--- a/net/ipv4/esp4.c
+++ b/net/ipv4/esp4.c
@@ -18,6 +18,8 @@
 #include <net/icmp.h>
 #include <net/protocol.h>
 #include <net/udp.h>
+#include <net/tcp.h>
+#include <net/espintcp.h>
 
 #include <linux/highmem.h>
 
@@ -117,6 +119,132 @@ static void esp_ssg_unref(struct xfrm_state *x, void *tmp)
 			put_page(sg_page(sg));
 }
 
+#ifdef CONFIG_XFRM_ESPINTCP
+struct esp_tcp_sk {
+	struct sock *sk;
+	struct rcu_head rcu;
+};
+
+static void esp_free_tcp_sk(struct rcu_head *head)
+{
+	struct esp_tcp_sk *esk = container_of(head, struct esp_tcp_sk, rcu);
+
+	sock_put(esk->sk);
+	kfree(esk);
+}
+
+static struct sock *esp_find_tcp_sk(struct xfrm_state *x)
+{
+	struct xfrm_encap_tmpl *encap = x->encap;
+	struct esp_tcp_sk *esk;
+	__be16 sport, dport;
+	struct sock *nsk;
+	struct sock *sk;
+
+	sk = rcu_dereference(x->encap_sk);
+	if (sk && sk->sk_state == TCP_ESTABLISHED)
+		return sk;
+
+	spin_lock_bh(&x->lock);
+	sport = encap->encap_sport;
+	dport = encap->encap_dport;
+	nsk = rcu_dereference_protected(x->encap_sk,
+					lockdep_is_held(&x->lock));
+	if (sk && sk == nsk) {
+		esk = kmalloc(sizeof(*esk), GFP_ATOMIC);
+		if (!esk) {
+			spin_unlock_bh(&x->lock);
+			return ERR_PTR(-ENOMEM);
+		}
+		RCU_INIT_POINTER(x->encap_sk, NULL);
+		esk->sk = sk;
+		call_rcu(&esk->rcu, esp_free_tcp_sk);
+	}
+	spin_unlock_bh(&x->lock);
+
+	sk = inet_lookup_established(xs_net(x), &tcp_hashinfo, x->id.daddr.a4,
+				     dport, x->props.saddr.a4, sport, 0);
+	if (!sk)
+		return ERR_PTR(-ENOENT);
+
+	if (!tcp_is_ulp_esp(sk)) {
+		sock_put(sk);
+		return ERR_PTR(-EINVAL);
+	}
+
+	spin_lock_bh(&x->lock);
+	nsk = rcu_dereference_protected(x->encap_sk,
+					lockdep_is_held(&x->lock));
+	if (encap->encap_sport != sport ||
+	    encap->encap_dport != dport) {
+		sock_put(sk);
+		sk = nsk ?: ERR_PTR(-EREMCHG);
+	} else if (sk == nsk) {
+		sock_put(sk);
+	} else {
+		rcu_assign_pointer(x->encap_sk, sk);
+	}
+	spin_unlock_bh(&x->lock);
+
+	return sk;
+}
+
+static int esp_output_tcp_finish(struct xfrm_state *x, struct sk_buff *skb)
+{
+	struct sock *sk;
+	int err;
+
+	rcu_read_lock();
+
+	sk = esp_find_tcp_sk(x);
+	err = PTR_ERR_OR_ZERO(sk);
+	if (err)
+		goto out;
+
+	bh_lock_sock(sk);
+	if (sock_owned_by_user(sk))
+		err = espintcp_queue_out(sk, skb);
+	else
+		err = espintcp_push_skb(sk, skb);
+	bh_unlock_sock(sk);
+
+out:
+	rcu_read_unlock();
+	return err;
+}
+
+static int esp_output_tcp_encap_cb(struct net *net, struct sock *sk,
+				   struct sk_buff *skb)
+{
+	struct dst_entry *dst = skb_dst(skb);
+	struct xfrm_state *x = dst->xfrm;
+
+	return esp_output_tcp_finish(x, skb);
+}
+
+static int esp_output_tail_tcp(struct xfrm_state *x, struct sk_buff *skb)
+{
+	int err;
+
+	local_bh_disable();
+	err = xfrm_trans_queue_net(xs_net(x), skb, esp_output_tcp_encap_cb);
+	local_bh_enable();
+
+	/* EINPROGRESS just happens to do the right thing.  It
+	 * actually means that the skb has been consumed and
+	 * isn't coming back.
+	 */
+	return err ?: -EINPROGRESS;
+}
+#else
+static int esp_output_tail_tcp(struct xfrm_state *x, struct sk_buff *skb)
+{
+	kfree_skb(skb);
+
+	return -EOPNOTSUPP;
+}
+#endif
+
 static void esp_output_done(struct crypto_async_request *base, int err)
 {
 	struct sk_buff *skb = base->data;
@@ -147,7 +275,11 @@ static void esp_output_done(struct crypto_async_request *base, int err)
 		secpath_reset(skb);
 		xfrm_dev_resume(skb);
 	} else {
-		xfrm_output_resume(skb, err);
+		if (!err &&
+		    x->encap && x->encap->encap_type == TCP_ENCAP_ESPINTCP)
+			esp_output_tail_tcp(x, skb);
+		else
+			xfrm_output_resume(skb, err);
 	}
 }
 
@@ -236,7 +368,7 @@ static struct ip_esp_hdr *esp_output_udp_encap(struct sk_buff *skb,
 	unsigned int len;
 
 	len = skb->len + esp->tailen - skb_transport_offset(skb);
-	if (len + sizeof(struct iphdr) >= IP_MAX_MTU)
+	if (len + sizeof(struct iphdr) > IP_MAX_MTU)
 		return ERR_PTR(-EMSGSIZE);
 
 	uh = (struct udphdr *)esp->esph;
@@ -256,6 +388,41 @@ static struct ip_esp_hdr *esp_output_udp_encap(struct sk_buff *skb,
 	return (struct ip_esp_hdr *)(uh + 1);
 }
 
+#ifdef CONFIG_XFRM_ESPINTCP
+static struct ip_esp_hdr *esp_output_tcp_encap(struct xfrm_state *x,
+						    struct sk_buff *skb,
+						    struct esp_info *esp)
+{
+	__be16 *lenp = (void *)esp->esph;
+	struct ip_esp_hdr *esph;
+	unsigned int len;
+	struct sock *sk;
+
+	len = skb->len + esp->tailen - skb_transport_offset(skb);
+	if (len > IP_MAX_MTU)
+		return ERR_PTR(-EMSGSIZE);
+
+	rcu_read_lock();
+	sk = esp_find_tcp_sk(x);
+	rcu_read_unlock();
+
+	if (IS_ERR(sk))
+		return ERR_CAST(sk);
+
+	*lenp = htons(len);
+	esph = (struct ip_esp_hdr *)(lenp + 1);
+
+	return esph;
+}
+#else
+static struct ip_esp_hdr *esp_output_tcp_encap(struct xfrm_state *x,
+						    struct sk_buff *skb,
+						    struct esp_info *esp)
+{
+	return ERR_PTR(-EOPNOTSUPP);
+}
+#endif
+
 static int esp_output_encap(struct xfrm_state *x, struct sk_buff *skb,
 			    struct esp_info *esp)
 {
@@ -276,6 +443,9 @@ static int esp_output_encap(struct xfrm_state *x, struct sk_buff *skb,
 	case UDP_ENCAP_ESPINUDP_NON_IKE:
 		esph = esp_output_udp_encap(skb, encap_type, esp, sport, dport);
 		break;
+	case TCP_ENCAP_ESPINTCP:
+		esph = esp_output_tcp_encap(x, skb, esp);
+		break;
 	}
 
 	if (IS_ERR(esph))
@@ -296,7 +466,7 @@ int esp_output_head(struct xfrm_state *x, struct sk_buff *skb, struct esp_info *
 	struct sk_buff *trailer;
 	int tailen = esp->tailen;
 
-	/* this is non-NULL only with UDP Encapsulation */
+	/* this is non-NULL only with TCP/UDP Encapsulation */
 	if (x->encap) {
 		int err = esp_output_encap(x, skb, esp);
 
@@ -491,6 +661,9 @@ int esp_output_tail(struct xfrm_state *x, struct sk_buff *skb, struct esp_info *
 	if (sg != dsg)
 		esp_ssg_unref(x, tmp);
 
+	if (!err && x->encap && x->encap->encap_type == TCP_ENCAP_ESPINTCP)
+		err = esp_output_tail_tcp(x, skb);
+
 error_free:
 	kfree(tmp);
 error:
@@ -617,10 +790,14 @@ int esp_input_done2(struct sk_buff *skb, int err)
 
 	if (x->encap) {
 		struct xfrm_encap_tmpl *encap = x->encap;
+		struct tcphdr *th = (void *)(skb_network_header(skb) + ihl);
 		struct udphdr *uh = (void *)(skb_network_header(skb) + ihl);
 		__be16 source;
 
 		switch (x->encap->encap_type) {
+		case TCP_ENCAP_ESPINTCP:
+			source = th->source;
+			break;
 		case UDP_ENCAP_ESPINUDP:
 		case UDP_ENCAP_ESPINUDP_NON_IKE:
 			source = uh->source;
@@ -1017,6 +1194,14 @@ static int esp_init_state(struct xfrm_state *x)
 		case UDP_ENCAP_ESPINUDP_NON_IKE:
 			x->props.header_len += sizeof(struct udphdr) + 2 * sizeof(u32);
 			break;
+#ifdef CONFIG_XFRM_ESPINTCP
+		case TCP_ENCAP_ESPINTCP:
+			/* only the length field, TCP encap is done by
+			 * the socket
+			 */
+			x->props.header_len += 2;
+			break;
+#endif
 		}
 	}
 
diff --git a/net/xfrm/Kconfig b/net/xfrm/Kconfig
index 51bb6018f3bf..e67044527fb7 100644
--- a/net/xfrm/Kconfig
+++ b/net/xfrm/Kconfig
@@ -73,6 +73,16 @@ config XFRM_IPCOMP
 	select CRYPTO
 	select CRYPTO_DEFLATE
 
+config XFRM_ESPINTCP
+	bool "ESP in TCP encapsulation (RFC 8229)"
+	depends on XFRM && INET_ESP
+	select STREAM_PARSER
+	select NET_SOCK_MSG
+	help
+	  Support for RFC 8229 encapsulation of ESP and IKE over TCP sockets.
+
+	  If unsure, say N.
+
 config NET_KEY
 	tristate "PF_KEY sockets"
 	select XFRM_ALGO
diff --git a/net/xfrm/Makefile b/net/xfrm/Makefile
index fbc4552d17b8..2d4bb4b9f75e 100644
--- a/net/xfrm/Makefile
+++ b/net/xfrm/Makefile
@@ -11,3 +11,4 @@ obj-$(CONFIG_XFRM_ALGO) += xfrm_algo.o
 obj-$(CONFIG_XFRM_USER) += xfrm_user.o
 obj-$(CONFIG_XFRM_IPCOMP) += xfrm_ipcomp.o
 obj-$(CONFIG_XFRM_INTERFACE) += xfrm_interface.o
+obj-$(CONFIG_XFRM_ESPINTCP) += espintcp.o
diff --git a/net/xfrm/espintcp.c b/net/xfrm/espintcp.c
new file mode 100644
index 000000000000..1d561a00c4b0
--- /dev/null
+++ b/net/xfrm/espintcp.c
@@ -0,0 +1,505 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <net/tcp.h>
+#include <net/strparser.h>
+#include <net/xfrm.h>
+#include <net/esp.h>
+#include <net/espintcp.h>
+#include <linux/skmsg.h>
+#include <net/inet_common.h>
+
+static void handle_nonesp(struct espintcp_ctx *ctx, struct sk_buff *skb,
+			  struct sock *sk)
+{
+	if (atomic_read(&sk->sk_rmem_alloc) >= sk->sk_rcvbuf ||
+	    !sk_rmem_schedule(sk, skb, skb->truesize)) {
+		kfree_skb(skb);
+		return;
+	}
+
+	skb_set_owner_r(skb, sk);
+
+	memset(skb->cb, 0, sizeof(skb->cb));
+	skb_queue_tail(&ctx->ike_queue, skb);
+	ctx->saved_data_ready(sk);
+}
+
+static void handle_esp(struct sk_buff *skb, struct sock *sk)
+{
+	skb_reset_transport_header(skb);
+	memset(skb->cb, 0, sizeof(skb->cb));
+
+	rcu_read_lock();
+	skb->dev = dev_get_by_index_rcu(sock_net(sk), skb->skb_iif);
+	local_bh_disable();
+	xfrm4_rcv_encap(skb, IPPROTO_ESP, 0, TCP_ENCAP_ESPINTCP);
+	local_bh_enable();
+	rcu_read_unlock();
+}
+
+static void espintcp_rcv(struct strparser *strp, struct sk_buff *skb)
+{
+	struct espintcp_ctx *ctx = container_of(strp, struct espintcp_ctx,
+						strp);
+	struct strp_msg *rxm = strp_msg(skb);
+	u32 nonesp_marker;
+	int err;
+
+	err = skb_copy_bits(skb, rxm->offset + 2, &nonesp_marker,
+			    sizeof(nonesp_marker));
+	if (err < 0) {
+		kfree_skb(skb);
+		return;
+	}
+
+	/* remove header, leave non-ESP marker/SPI */
+	if (!__pskb_pull(skb, rxm->offset + 2)) {
+		kfree_skb(skb);
+		return;
+	}
+
+	if (pskb_trim(skb, rxm->full_len - 2) != 0) {
+		kfree_skb(skb);
+		return;
+	}
+
+	if (nonesp_marker == 0)
+		handle_nonesp(ctx, skb, strp->sk);
+	else
+		handle_esp(skb, strp->sk);
+}
+
+static int espintcp_parse(struct strparser *strp, struct sk_buff *skb)
+{
+	struct strp_msg *rxm = strp_msg(skb);
+	__be16 blen;
+	u16 len;
+	int err;
+
+	if (skb->len < rxm->offset + 2)
+		return 0;
+
+	err = skb_copy_bits(skb, rxm->offset, &blen, sizeof(blen));
+	if (err < 0)
+		return err;
+
+	len = be16_to_cpu(blen);
+	if (len < 6)
+		return -EINVAL;
+
+	return len;
+}
+
+static int espintcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
+			    int nonblock, int flags, int *addr_len)
+{
+	struct espintcp_ctx *ctx = espintcp_getctx(sk);
+	struct sk_buff *skb;
+	int err = 0;
+	int copied;
+	int off = 0;
+
+	flags |= nonblock ? MSG_DONTWAIT : 0;
+
+	skb = __skb_recv_datagram(sk, &ctx->ike_queue, flags, NULL, &off, &err);
+	if (!skb)
+		return err;
+
+	copied = len;
+	if (copied > skb->len)
+		copied = skb->len;
+	else if (copied < skb->len)
+		msg->msg_flags |= MSG_TRUNC;
+
+	err = skb_copy_datagram_msg(skb, 0, msg, copied);
+	if (unlikely(err)) {
+		kfree_skb(skb);
+		return err;
+	}
+
+	if (flags & MSG_TRUNC)
+		copied = skb->len;
+	kfree_skb(skb);
+	return copied;
+}
+
+int espintcp_queue_out(struct sock *sk, struct sk_buff *skb)
+{
+	struct espintcp_ctx *ctx = espintcp_getctx(sk);
+
+	if (skb_queue_len(&ctx->out_queue) >= netdev_max_backlog)
+		return -ENOBUFS;
+
+	__skb_queue_tail(&ctx->out_queue, skb);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(espintcp_queue_out);
+
+/* espintcp length field is 2B and length includes the length field's size */
+#define MAX_ESPINTCP_MSG (((1 << 16) - 1) - 2)
+
+static int espintcp_sendskb_locked(struct sock *sk, struct espintcp_msg *emsg,
+				   int flags)
+{
+	do {
+		int ret;
+
+		ret = skb_send_sock_locked(sk, emsg->skb,
+					   emsg->offset, emsg->len);
+		if (ret < 0)
+			return ret;
+
+		emsg->len -= ret;
+		emsg->offset += ret;
+	} while (emsg->len > 0);
+
+	kfree_skb(emsg->skb);
+	memset(emsg, 0, sizeof(*emsg));
+
+	return 0;
+}
+
+static int espintcp_sendskmsg_locked(struct sock *sk,
+				     struct espintcp_msg *emsg, int flags)
+{
+	struct sk_msg *skmsg = &emsg->skmsg;
+	struct scatterlist *sg;
+	int done = 0;
+	int ret;
+
+	flags |= MSG_SENDPAGE_NOTLAST;
+	sg = &skmsg->sg.data[skmsg->sg.start];
+	do {
+		size_t size = sg->length - emsg->offset;
+		int offset = sg->offset + emsg->offset;
+		struct page *p;
+
+		emsg->offset = 0;
+
+		if (sg_is_last(sg))
+			flags &= ~MSG_SENDPAGE_NOTLAST;
+
+		p = sg_page(sg);
+retry:
+		ret = do_tcp_sendpages(sk, p, offset, size, flags);
+		if (ret < 0) {
+			emsg->offset = offset - sg->offset;
+			skmsg->sg.start += done;
+			return ret;
+		}
+
+		if (ret != size) {
+			offset += ret;
+			size -= ret;
+			goto retry;
+		}
+
+		done++;
+		put_page(p);
+		sk_mem_uncharge(sk, sg->length);
+		sg = sg_next(sg);
+	} while (sg);
+
+	memset(emsg, 0, sizeof(*emsg));
+
+	return 0;
+}
+
+static int espintcp_push_msgs(struct sock *sk)
+{
+	struct espintcp_ctx *ctx = espintcp_getctx(sk);
+	struct espintcp_msg *emsg = &ctx->partial;
+	int err;
+
+	if (!emsg->len)
+		return 0;
+
+	if (ctx->tx_running)
+		return -EAGAIN;
+	ctx->tx_running = 1;
+
+	if (emsg->skb)
+		err = espintcp_sendskb_locked(sk, emsg, 0);
+	else
+		err = espintcp_sendskmsg_locked(sk, emsg, 0);
+	if (err == -EAGAIN) {
+		ctx->tx_running = 0;
+		return 0;
+	}
+	if (!err)
+		memset(emsg, 0, sizeof(*emsg));
+
+	ctx->tx_running = 0;
+
+	return err;
+}
+
+int espintcp_push_skb(struct sock *sk, struct sk_buff *skb)
+{
+	struct espintcp_ctx *ctx = espintcp_getctx(sk);
+	struct espintcp_msg *emsg = &ctx->partial;
+	unsigned int len;
+	int offset;
+
+	if (sk->sk_state != TCP_ESTABLISHED) {
+		kfree_skb(skb);
+		return -ECONNRESET;
+	}
+
+	offset = skb_transport_offset(skb);
+	len = skb->len - offset;
+
+	espintcp_push_msgs(sk);
+
+	if (emsg->len) {
+		kfree_skb(skb);
+		return -ENOBUFS;
+	}
+
+	skb_set_owner_w(skb, sk);
+
+	emsg->offset = offset;
+	emsg->len = len;
+	emsg->skb = skb;
+
+	espintcp_push_msgs(sk);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(espintcp_push_skb);
+
+static int espintcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
+{
+	long timeo = sock_sndtimeo(sk, msg->msg_flags & MSG_DONTWAIT);
+	struct espintcp_ctx *ctx = espintcp_getctx(sk);
+	struct espintcp_msg *emsg = &ctx->partial;
+	struct iov_iter pfx_iter;
+	struct kvec pfx_iov = {};
+	size_t msglen = size + 2;
+	char buf[2] = {0};
+	int err, end;
+
+	if (msg->msg_flags)
+		return -EOPNOTSUPP;
+
+	if (size > MAX_ESPINTCP_MSG)
+		return -EMSGSIZE;
+
+	if (msg->msg_controllen)
+		return -EOPNOTSUPP;
+
+	lock_sock(sk);
+
+	err = espintcp_push_msgs(sk);
+	if (err < 0) {
+		err = -ENOBUFS;
+		goto unlock;
+	}
+
+	sk_msg_init(&emsg->skmsg);
+	while (1) {
+		/* only -ENOMEM is possible since we don't coalesce */
+		err = sk_msg_alloc(sk, &emsg->skmsg, msglen, 0);
+		if (!err)
+			break;
+
+		err = sk_stream_wait_memory(sk, &timeo);
+		if (err)
+			goto fail;
+	}
+
+	*((__be16 *)buf) = cpu_to_be16(msglen);
+	pfx_iov.iov_base = buf;
+	pfx_iov.iov_len = sizeof(buf);
+	iov_iter_kvec(&pfx_iter, WRITE, &pfx_iov, 1, pfx_iov.iov_len);
+
+	err = sk_msg_memcopy_from_iter(sk, &pfx_iter, &emsg->skmsg,
+				       pfx_iov.iov_len);
+	if (err < 0)
+		goto fail;
+
+	err = sk_msg_memcopy_from_iter(sk, &msg->msg_iter, &emsg->skmsg, size);
+	if (err < 0)
+		goto fail;
+
+	end = emsg->skmsg.sg.end;
+	emsg->len = size;
+	sk_msg_iter_var_prev(end);
+	sg_mark_end(sk_msg_elem(&emsg->skmsg, end));
+
+	tcp_rate_check_app_limited(sk);
+
+	err = espintcp_push_msgs(sk);
+	/* this message could be partially sent, keep it */
+	if (err < 0)
+		goto unlock;
+	release_sock(sk);
+
+	return size;
+
+fail:
+	sk_msg_free(sk, &emsg->skmsg);
+	memset(emsg, 0, sizeof(*emsg));
+unlock:
+	release_sock(sk);
+	return err;
+}
+
+static struct proto espintcp_prot __ro_after_init;
+static struct proto_ops espintcp_ops __ro_after_init;
+
+static void espintcp_data_ready(struct sock *sk)
+{
+	struct espintcp_ctx *ctx = espintcp_getctx(sk);
+
+	strp_data_ready(&ctx->strp);
+}
+
+static void espintcp_tx_work(struct work_struct *work)
+{
+	struct espintcp_ctx *ctx = container_of(work,
+						struct espintcp_ctx, work);
+	struct sock *sk = ctx->strp.sk;
+
+	lock_sock(sk);
+	if (!ctx->tx_running)
+		espintcp_push_msgs(sk);
+	release_sock(sk);
+}
+
+static void espintcp_write_space(struct sock *sk)
+{
+	struct espintcp_ctx *ctx = espintcp_getctx(sk);
+
+	schedule_work(&ctx->work);
+	ctx->saved_write_space(sk);
+}
+
+static void espintcp_destruct(struct sock *sk)
+{
+	struct espintcp_ctx *ctx = espintcp_getctx(sk);
+
+	kfree(ctx);
+}
+
+bool tcp_is_ulp_esp(struct sock *sk)
+{
+	return sk->sk_prot == &espintcp_prot;
+}
+EXPORT_SYMBOL_GPL(tcp_is_ulp_esp);
+
+static int espintcp_init_sk(struct sock *sk)
+{
+	struct inet_connection_sock *icsk = inet_csk(sk);
+	struct strp_callbacks cb = {
+		.rcv_msg = espintcp_rcv,
+		.parse_msg = espintcp_parse,
+	};
+	struct espintcp_ctx *ctx;
+	int err;
+
+	ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
+	if (!ctx)
+		return -ENOMEM;
+
+	err = strp_init(&ctx->strp, sk, &cb);
+	if (err)
+		goto free;
+
+	__sk_dst_reset(sk);
+
+	strp_check_rcv(&ctx->strp);
+	skb_queue_head_init(&ctx->ike_queue);
+	skb_queue_head_init(&ctx->out_queue);
+	sk->sk_prot = &espintcp_prot;
+	sk->sk_socket->ops = &espintcp_ops;
+	ctx->saved_data_ready = sk->sk_data_ready;
+	ctx->saved_write_space = sk->sk_write_space;
+	sk->sk_data_ready = espintcp_data_ready;
+	sk->sk_write_space = espintcp_write_space;
+	sk->sk_destruct = espintcp_destruct;
+	icsk->icsk_ulp_data = ctx;
+	INIT_WORK(&ctx->work, espintcp_tx_work);
+
+	/* avoid using task_frag */
+	sk->sk_allocation = GFP_ATOMIC;
+
+	return 0;
+
+free:
+	kfree(ctx);
+	return err;
+}
+
+static void espintcp_release(struct sock *sk)
+{
+	struct espintcp_ctx *ctx = espintcp_getctx(sk);
+	struct sk_buff_head queue;
+	struct sk_buff *skb;
+
+	__skb_queue_head_init(&queue);
+	skb_queue_splice_init(&ctx->out_queue, &queue);
+
+	while ((skb = __skb_dequeue(&queue)))
+		espintcp_push_skb(sk, skb);
+
+	tcp_release_cb(sk);
+}
+
+static void espintcp_close(struct sock *sk, long timeout)
+{
+	struct espintcp_ctx *ctx = espintcp_getctx(sk);
+	struct espintcp_msg *emsg = &ctx->partial;
+
+	strp_stop(&ctx->strp);
+
+	sk->sk_prot = &tcp_prot;
+	barrier();
+
+	cancel_work_sync(&ctx->work);
+	strp_done(&ctx->strp);
+
+	skb_queue_purge(&ctx->out_queue);
+	skb_queue_purge(&ctx->ike_queue);
+
+	if (emsg->len) {
+		if (emsg->skb)
+			kfree_skb(emsg->skb);
+		else
+			sk_msg_free(sk, &emsg->skmsg);
+	}
+
+	tcp_close(sk, timeout);
+}
+
+static __poll_t espintcp_poll(struct file *file, struct socket *sock,
+			      poll_table *wait)
+{
+	__poll_t mask = datagram_poll(file, sock, wait);
+	struct sock *sk = sock->sk;
+	struct espintcp_ctx *ctx = espintcp_getctx(sk);
+
+	if (!skb_queue_empty(&ctx->ike_queue))
+		mask |= EPOLLIN | EPOLLRDNORM;
+
+	return mask;
+}
+
+static struct tcp_ulp_ops espintcp_ulp __read_mostly = {
+	.name = "espintcp",
+	.owner = THIS_MODULE,
+	.init = espintcp_init_sk,
+};
+
+void __init espintcp_init(void)
+{
+	memcpy(&espintcp_prot, &tcp_prot, sizeof(tcp_prot));
+	memcpy(&espintcp_ops, &inet_stream_ops, sizeof(inet_stream_ops));
+	espintcp_prot.sendmsg = espintcp_sendmsg;
+	espintcp_prot.recvmsg = espintcp_recvmsg;
+	espintcp_prot.close = espintcp_close;
+	espintcp_prot.release_cb = espintcp_release;
+	espintcp_ops.poll = espintcp_poll;
+
+	tcp_register_ulp(&espintcp_ulp);
+}
diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
index ec94f5795ea4..686307ed6920 100644
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -39,6 +39,9 @@
 #ifdef CONFIG_XFRM_STATISTICS
 #include <net/snmp.h>
 #endif
+#ifdef CONFIG_XFRM_ESPINTCP
+#include <net/espintcp.h>
+#endif
 
 #include "xfrm_hash.h"
 
@@ -4155,6 +4158,10 @@ void __init xfrm_init(void)
 	seqcount_init(&xfrm_policy_hash_generation);
 	xfrm_input_init();
 
+#ifdef CONFIG_XFRM_ESPINTCP
+	espintcp_init();
+#endif
+
 	RCU_INIT_POINTER(xfrm_if_cb, NULL);
 	synchronize_rcu();
 }
diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c
index c6f3c4a1bd99..acef2d54f869 100644
--- a/net/xfrm/xfrm_state.c
+++ b/net/xfrm/xfrm_state.c
@@ -668,6 +668,9 @@ int __xfrm_state_delete(struct xfrm_state *x)
 		net->xfrm.state_num--;
 		spin_unlock(&net->xfrm.xfrm_state_lock);
 
+		if (x->encap_sk)
+			sock_put(rcu_dereference_raw(x->encap_sk));
+
 		xfrm_dev_state_delete(x);
 
 		/* All xfrm_state objects are created by xfrm_state_alloc.
-- 
2.22.0


^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox