Netdev List

Netdev List
 help / color / mirror / Atom feed

* [PATCH net 1/4] nfp: ignore signals when communicating with management FW
From: Jakub Kicinski @ 2018-04-11 23:47 UTC (permalink / raw)
  To: davem; +Cc: netdev, oss-drivers, Jakub Kicinski
In-Reply-To: <20180411234738.6766-1-jakub.kicinski@netronome.com>

We currently allow signals to interrupt the wait for management FW
commands.  Exiting the wait should not cause trouble, the FW will
just finish executing the command in the background and new commands
will wait for the old one to finish.

However, this may not be what users expect (Ctrl-C not actually stopping
the command).  Moreover some systems routinely request link information
with signals pending (Ubuntu 14.04 runs a landscape-sysinfo python tool
from MOTD) worrying users with errors like these:

nfp 0000:04:00.0: nfp_nsp: Error -512 waiting for code 0x0007 to start
nfp 0000:04:00.0: nfp: reading port table failed -512

Make the wait for management FW responses non-interruptible.

Fixes: 1a64821c6af7 ("nfp: add support for service processor access")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
---
 drivers/net/ethernet/netronome/nfp/nfpcore/nfp_nsp.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_nsp.c b/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_nsp.c
index 99bb679a9801..2abee0fe3a7c 100644
--- a/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_nsp.c
+++ b/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_nsp.c
@@ -281,8 +281,7 @@ nfp_nsp_wait_reg(struct nfp_cpp *cpp, u64 *reg, u32 nsp_cpp, u64 addr,
 		if ((*reg & mask) == val)
 			return 0;

-		if (msleep_interruptible(25))
-			return -ERESTARTSYS;
+		msleep(25);

 		if (time_after(start_time, wait_until))
 			return -ETIMEDOUT;
-- 
2.16.2

^ permalink raw reply related

* [PATCH net 0/4] nfp: improve signal handing on FW waits and flower control message processing
From: Jakub Kicinski @ 2018-04-11 23:47 UTC (permalink / raw)
  To: davem; +Cc: netdev, oss-drivers, Jakub Kicinski

Hi!

The first part of this set aims to improve handling of interrupted
waits.  Patch 1 makes waiting for management FW responses
uninterruptible while patch 2 adds a message when signal arrives
while waiting for an NFP mutex.  We can't interrupt execution of
FW commands so uninterruptible sleep seems reasonable there.
Exiting a wait for a mutex should be clean and have no side affects
so we are allowing to abort it.  Note that both waits have rather
large timeouts (tens of seconds).

Patches 3 and 4 improve flower offload operation under heavy load.
Currently there is no cap on the number of queued FW notifications.
Some of the notifications have to be processed from a workqueue
which may lead to very large number of messages getting queued
if workqueue never gets a chance to run.  Pieter puts a limit
on number of queued messages, tries to drop some messages we ignore
without queuing and process more important messages first.

Jakub Kicinski (2):
  nfp: ignore signals when communicating with management FW
  nfp: print a message when mutex wait is interrupted

Pieter Jansen van Vuuren (2):
  nfp: flower: move route ack control messages out of the workqueue
  nfp: flower: split and limit cmsg skb lists

 drivers/net/ethernet/netronome/nfp/flower/cmsg.c   | 44 ++++++++++++++++++----
 drivers/net/ethernet/netronome/nfp/flower/cmsg.h   |  2 +
 drivers/net/ethernet/netronome/nfp/flower/main.c   |  6 ++-
 drivers/net/ethernet/netronome/nfp/flower/main.h   |  8 +++-
 .../net/ethernet/netronome/nfp/nfpcore/nfp_mutex.c |  5 ++-
 .../net/ethernet/netronome/nfp/nfpcore/nfp_nsp.c   |  3 +-
 6 files changed, 54 insertions(+), 14 deletions(-)

-- 
2.16.2

^ permalink raw reply

* Re: [RFC net-next 1/2] net: net-porcfs: Reduce rcu lock critical section
From: Saeed Mahameed @ 2018-04-11 23:47 UTC (permalink / raw)
  To: eric.dumazet@gmail.com, davem@davemloft.net; +Cc: netdev@vger.kernel.org
In-Reply-To: <f1b3bc40-cfdf-5d89-9cfc-cf6996f99d8c@gmail.com>

On Wed, 2018-04-11 at 15:30 -0700, Eric Dumazet wrote:
> 
> On 04/11/2018 11:59 AM, Saeed Mahameed wrote:
> > On Tue, 2018-04-10 at 13:35 -0700, Eric Dumazet wrote:
> > > 
> > > On 04/10/2018 10:16 AM, David Miller wrote:
> > > > 
> > > > The tradeoff here is that now you are doing two unnecessary
> > > > atomic
> > > > operations per stats dump.
> > > > 
> > > > That is what the RCU lock allows us to avoid.
> > > > 
> > > 
> > > dev_hold() and dev_put() are actually per cpu increment and
> > > decrements,
> > > pretty cheap these days.
> > > 
> > 
> > Yes, i am adding only 2 cpu instructions here.
> > I think the trade-off here is too small and the price to finally
> > have
> > get_stats64 called from non atomic context is really worth it.
> 
> Oh... but you have not mentioned this reason in your changelog.
> 

from the commit message:

"This is really greedy and demanding from device drivers since
ndo_get_stats64 called from dev_seq_show while the rcu lock is held"

sorry if this wasn't clear enough I will fix this if this goes through.

> What about bonding stats ?
> 
> By sending partial patches like that, others have to take care of the
> details
> and this is not really acceptable.
> 

This is a RFC just to show the approach, if the approach is acceptable
of course i will provide a full series that will handle all other
places, the change should be the same, I already have 2 other patches
to address ovs and netlink stats, i just didn't want to waste your time
on small details like netlink messages.

> > 
> > It  looks really odd to me that the device chain locks are held for
> > such long periods, while we already have the means to avoid this,
> > same
> > goes for rtnl_lock, same trick can work here for many use cases and
> > many ndos, we are just over protective for no reasons.
> > 
> > 
> > > Problem here is that any preemption of the process holding device
> > > reference
> > > might trigger warnings in device unregister.
> > > 
> > 
> > This is true for any other place dev_hold is used,
> > as explained in the commit message dev_hold is used for a very
> > brief
> > moment before calling the stats ndo and released directly after.
> 
> Not really.
> 
> Other places usually have notifiers to remove the refcount when
> needed.
> 

Other places hold the netdev for the whole lifetime of the netdev, they
don't know when to release it, this is why we need notifiers.

in this patch the approach is:

hold
call ndo
put

notifier is not needed in this case.

> We worked quite hard in the past to remove extra dev_hold()
> (before we finally converted it to percpu counter)
> 
> > 
> > looking at 
> > 
> > netdev_wait_allrefs(...)
> > [...]
> > 	msleep(250);
> > 
> > 	refcnt = netdev_refcnt_read(dev);
> > 
> > 	if (time_after(jiffies, warning_time + 10 * HZ)) {
> > 		pr_emerg("unregister_netdevice: waiting for %s to
> > become free. Usage count = %d\n",
> > 			 dev->name, refcnt);
> > 		warning_time = jiffies;
> > 	}
> > 
> > The holder will get enough time to release the netdev way before
> > the
> > warning is triggered.
> > 
> > The warning is triggered only if someone holds the dev for more
> > than 10
> > seconds which is impossible for the stats ndo to take more than
> > this,
> > in fact i just did a quick measurement and it seems that in average
> > get_stats64 ndo takes 0.5us !
> 
> Average is nice, but what about max time ?
> 

Well if we allow devices to access HW counters via FW command
interfaces in ndo_get_stats and by testing mlx5 where we query up to 5
hw registers, it could take 100us, still this is way smaller than 10sec
 :) and it is really a nice rate to fetch HW stats on demand.

> Sleeping more than 10 seconds to satify GFP_KERNEL memory allocation
> can definitely
> happen in the real world, or simply if you get preempted by some
> RT/high prio tasks.
> 

Same issue can occur without this patch if an IRQ is triggered under
while under rcu lock.
And RT/high prio task shouldn't take more than 10sec.

In case GFP_KERNEL memory allocation takes more than 10sec you will
already get tons of OOM warnings.
Having a netdev warning in case of really the netdev is being
unregistered and ndo_get_stats was preempted for more than 10 seconds
will be the least of your problems.

> Just say no to 'might sleep in ndo_get_stats()', and you will save a
> lot of issues.

We are just being over protective here.

Just say no to 'might sleep in ndo_get_stats()' is by itself creating a
lot of issues, many drivers out there have a background thread running
N times a second just to cache HW stats. which is really silly and CPU
consuming for no reason.

The rate of ndo_get_stats should be equal to the rate the driver can
actually provide fresh stats, any background thread is just a a waste
of CPU. Counters should be fetched from HW on demand rather than
periodically for no reason.

The same goes for set_rx_mode ndo.

Bottom line it looks like the need for rcu locking today is only meant
for synchronizing accessing the netdev ndo with the device chain in
question (namespace/ovs/bonding/etc .. ).

There are many places where dev_get_stats is called from non atmoic
context where the caller knows it is safe to access the netdev ndo.

example:
net/bondign/bond_main.c: bond_enslave(..)

/* initialize slave stats */
dev_get_stats(new_slave->dev, &new_slave->slave_stats);

^ permalink raw reply

* Re: [RFC v3 net-next 13/18] net/sched: Introduce the TBS Qdisc
From: Jesus Sanchez-Palencia @ 2018-04-11 23:38 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: netdev, jhs, xiyou.wangcong, jiri, vinicius.gomes, richardcochran,
	anna-maria, henrik, John Stultz, levi.pearson, edumazet, willemb,
	mlichvar
In-Reply-To: <alpine.DEB.2.21.1804112209450.1564@nanos.tec.linutronix.de>

Hi,

On 04/11/2018 01:16 PM, Thomas Gleixner wrote:
>>>> Putting it all together, we end up with:
>>>>
>>>> 1) a new txtime aware qdisc, tbs, to be used per queue. Its cli will look like:
>>>> $ tc qdisc add (...) tbs clockid CLOCK_REALTIME delta 150000 offload sorting
>>>
>>> Why CLOCK_REALTIME? The only interesting time in a TSN network is
>>> CLOCK_TAI, really.
>>
>> REALTIME was just an example here to show that the qdisc has to be configured
>> with a clockid parameter. Are you suggesting that instead both of the new qdiscs
>> (i.e. tbs and taprio) should always be using CLOCK_TAI implicitly?
> 
> I think so. It's _the_ network time on which everything is based on.

Yes, but more on this below.

> 
>>>> 2) a new cmsg-interface for setting a per-packet timestamp that will be used
>>>> either as a txtime or as deadline by tbs (and further the NIC driver for the
>>>> offlaod case): SCM_TXTIME.
>>>>
>>>> 3) a new socket option: SO_TXTIME. It will be used to enable the feature for a
>>>> socket, and will have as parameters a clockid and a txtime mode (deadline or
>>>> explicit), that defines the semantics of the timestamp set on packets using
>>>> SCM_TXTIME.
>>>>
>>>> 4) a new #define DYNAMIC_CLOCKID 15 added to include/uapi/linux/time.h .
>>>
>>> Can you remind me why we would need that?
>>
>> So there is a "clockid" that can be used for the full hw offload modes. On this
>> case, the txtimes are in reference to the NIC's PTP clock, and, as discussed, we
>> can't just use a clockid that was computed from the fd pointing to /dev/ptpX .
> 
> And the NICs PTP clock is CLOCK_TAI, so there should be no reason to have
> yet another clock, right?

Just breaking this down a bit, yes, TAI is the network time base, and the NICs
PTP clock use that because PTP is (commonly) based on TAI. After the PHCs have
been synchronized over the network (e.g. with ptp4l), my understanding is that
if applications want to use the clockid_t CLOCK_TAI as a network clock reference
it's required that something (i.e. phc2sys) is synchronizing the PHCs and the
system clock, and also that something calls adjtime to apply the TAI vs UTC
offset to CLOCK_TAI.

If we are fine with those 'dependencies', then I agree there is no need for
another clock.

I was thinking about the full offload use-cases, thus when no scheduling is
happening inside the qdiscs. Applications could just read the time from the PHC
clocks directly without having to rely on any of the above. On this case,
userspace would use DYNAMIC_CLOCK just to flag that this is the case, but I must
admit it's not clear to me how common of a use-case that is, or even if it makes
sense.

Thanks,
Jesus

> 
> Thanks,
> 
> 	tglx
> 

^ permalink raw reply

* [net  1/1] tipc: fix missing initializer in tipc_sendmsg()
From: Jon Maloy @ 2018-04-11 23:15 UTC (permalink / raw)
  To: davem, netdev
  Cc: mohan.krishna.ghanta.krishnamurthy, tung.q.nguyen, hoang.h.le,
	jon.maloy, canh.d.luu, ying.xue, tipc-discussion

The stack variable 'dnode' in __tipc_sendmsg() may theoretically
end up tipc_node_get_mtu() as an unitilalized variable.

We fix this by intializing the variable at declaration. We also add
a default else clause to the two conditional ones already there, so
that we never end up in the named function if the given address
type is illegal.

Reported-by: syzbot+b0975ce9355b347c1546@syzkaller.appspotmail.com
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
---
 net/tipc/socket.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/tipc/socket.c b/net/tipc/socket.c
index 1fd1c8b..252a52ae 100644
--- a/net/tipc/socket.c
+++ b/net/tipc/socket.c
@@ -1278,7 +1278,7 @@ static int __tipc_sendmsg(struct socket *sock, struct msghdr *m, size_t dlen)
 	struct tipc_msg *hdr = &tsk->phdr;
 	struct tipc_name_seq *seq;
 	struct sk_buff_head pkts;
-	u32 dnode, dport;
+	u32 dport, dnode = 0;
 	u32 type, inst;
 	int mtu, rc;
 
@@ -1348,6 +1348,8 @@ static int __tipc_sendmsg(struct socket *sock, struct msghdr *m, size_t dlen)
 		msg_set_destnode(hdr, dnode);
 		msg_set_destport(hdr, dest->addr.id.ref);
 		msg_set_hdr_sz(hdr, BASIC_H_SIZE);
+	} else {
+		return -EINVAL;
 	}
 
 	/* Block or return if destination link is congested */
-- 
2.1.4

^ permalink raw reply related

* Re: [PATCH net-next 1/5] virtio: Add support for SCTP checksum offloading
From: Michael S. Tsirkin @ 2018-04-11 22:49 UTC (permalink / raw)
  To: Vladislav Yasevich
  Cc: netdev, linux-sctp, virtualization, jasowang, nhorman,
	Vladislav Yasevich
In-Reply-To: <20180402134006.10111-2-vyasevic@redhat.com>

On Mon, Apr 02, 2018 at 09:40:02AM -0400, Vladislav Yasevich wrote:
> To support SCTP checksum offloading, we need to add a new feature
> to virtio_net, so we can negotiate support between the hypervisor
> and the guest.
> 
> The signalling to the guest that an alternate checksum needs to
> be used is done via a new flag in the virtio_net_hdr.  If the
> flag is set, the host will know to perform an alternate checksum
> calculation, which right now is only CRC32c.
> 
> Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com>
> ---
>  drivers/net/virtio_net.c        | 11 ++++++++---
>  include/linux/virtio_net.h      |  6 ++++++
>  include/uapi/linux/virtio_net.h |  2 ++
>  3 files changed, 16 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 7b187ec..b601294 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -2724,9 +2724,14 @@ static int virtnet_probe(struct virtio_device *vdev)
>  	/* Do we support "hardware" checksums? */
>  	if (virtio_has_feature(vdev, VIRTIO_NET_F_CSUM)) {
>  		/* This opens up the world of extra features. */
> -		dev->hw_features |= NETIF_F_HW_CSUM | NETIF_F_SG;
> +		netdev_features_t sctp = 0;
> +
> +		if (virtio_has_feature(vdev, VIRTIO_NET_F_SCTP_CSUM))
> +			sctp |= NETIF_F_SCTP_CRC;
> +
> +		dev->hw_features |= NETIF_F_HW_CSUM | NETIF_F_SG | sctp;
>  		if (csum)
> -			dev->features |= NETIF_F_HW_CSUM | NETIF_F_SG;
> +			dev->features |= NETIF_F_HW_CSUM | NETIF_F_SG | sctp;
>  
>  		if (virtio_has_feature(vdev, VIRTIO_NET_F_GSO)) {
>  			dev->hw_features |= NETIF_F_TSO
> @@ -2952,7 +2957,7 @@ static struct virtio_device_id id_table[] = {
>  };
>  
>  #define VIRTNET_FEATURES \
> -	VIRTIO_NET_F_CSUM, VIRTIO_NET_F_GUEST_CSUM, \
> +	VIRTIO_NET_F_CSUM, VIRTIO_NET_F_GUEST_CSUM,  VIRTIO_NET_F_SCTP_CSUM, \
>  	VIRTIO_NET_F_MAC, \
>  	VIRTIO_NET_F_HOST_TSO4, VIRTIO_NET_F_HOST_UFO, VIRTIO_NET_F_HOST_TSO6, \
>  	VIRTIO_NET_F_HOST_ECN, VIRTIO_NET_F_GUEST_TSO4, VIRTIO_NET_F_GUEST_TSO6, \
> diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h
> index f144216..2e7a64a 100644
> --- a/include/linux/virtio_net.h
> +++ b/include/linux/virtio_net.h
> @@ -39,6 +39,9 @@ static inline int virtio_net_hdr_to_skb(struct sk_buff *skb,
>  
>  		if (!skb_partial_csum_set(skb, start, off))
>  			return -EINVAL;
> +
> +		if (hdr->flags & VIRTIO_NET_HDR_F_CSUM_NOT_INET)
> +			skb->csum_not_inet = 1;
>  	}
>  
>  	if (hdr->gso_type != VIRTIO_NET_HDR_GSO_NONE) {
> @@ -96,6 +99,9 @@ static inline int virtio_net_hdr_from_skb(const struct sk_buff *skb,
>  		hdr->flags = VIRTIO_NET_HDR_F_DATA_VALID;
>  	} /* else everything is zero */
>  
> +	if (skb->csum_not_inet)
> +		hdr->flags &= VIRTIO_NET_HDR_F_CSUM_NOT_INET;
> +
>  	return 0;
>  }
>  
> diff --git a/include/uapi/linux/virtio_net.h b/include/uapi/linux/virtio_net.h
> index 5de6ed3..3f279c8 100644
> --- a/include/uapi/linux/virtio_net.h
> +++ b/include/uapi/linux/virtio_net.h
> @@ -36,6 +36,7 @@
>  #define VIRTIO_NET_F_GUEST_CSUM	1	/* Guest handles pkts w/ partial csum */
>  #define VIRTIO_NET_F_CTRL_GUEST_OFFLOADS 2 /* Dynamic offload configuration. */
>  #define VIRTIO_NET_F_MTU	3	/* Initial MTU advice */
> +#define VIRTIO_NET_F_SCTP_CSUM  4	/* SCTP checksum offload support */
>  #define VIRTIO_NET_F_MAC	5	/* Host has given MAC address. */
>  #define VIRTIO_NET_F_GUEST_TSO4	7	/* Guest can handle TSOv4 in. */
>  #define VIRTIO_NET_F_GUEST_TSO6	8	/* Guest can handle TSOv6 in. */

Is this a guest or a host checksum? We should differenciate between the
two.


> @@ -101,6 +102,7 @@ struct virtio_net_config {
>  struct virtio_net_hdr_v1 {
>  #define VIRTIO_NET_HDR_F_NEEDS_CSUM	1	/* Use csum_start, csum_offset */
>  #define VIRTIO_NET_HDR_F_DATA_VALID	2	/* Csum is valid */
> +#define VIRTIO_NET_HDR_F_CSUM_NOT_INET  4       /* Checksum is not inet */
>  	__u8 flags;
>  #define VIRTIO_NET_HDR_GSO_NONE		0	/* Not a GSO frame */
>  #define VIRTIO_NET_HDR_GSO_TCPV4	1	/* GSO frame, IPv4 TCP (TSO) */
> -- 
> 2.9.5

^ permalink raw reply

* Re: [PATCH net-next 1/5] virtio: Add support for SCTP checksum offloading
From: Marcelo Ricardo Leitner @ 2018-04-11 22:39 UTC (permalink / raw)
  To: Vladislav Yasevich
  Cc: netdev, linux-sctp, virtualization, mst, jasowang, nhorman,
	Vladislav Yasevich
In-Reply-To: <20180402134006.10111-2-vyasevic@redhat.com>

On Mon, Apr 02, 2018 at 09:40:02AM -0400, Vladislav Yasevich wrote:
> To support SCTP checksum offloading, we need to add a new feature
> to virtio_net, so we can negotiate support between the hypervisor
> and the guest.
>
> The signalling to the guest that an alternate checksum needs to
> be used is done via a new flag in the virtio_net_hdr.  If the
> flag is set, the host will know to perform an alternate checksum
> calculation, which right now is only CRC32c.
>
> Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com>
> ---
>  drivers/net/virtio_net.c        | 11 ++++++++---
>  include/linux/virtio_net.h      |  6 ++++++
>  include/uapi/linux/virtio_net.h |  2 ++
>  3 files changed, 16 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 7b187ec..b601294 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -2724,9 +2724,14 @@ static int virtnet_probe(struct virtio_device *vdev)
>  	/* Do we support "hardware" checksums? */
>  	if (virtio_has_feature(vdev, VIRTIO_NET_F_CSUM)) {
>  		/* This opens up the world of extra features. */
> -		dev->hw_features |= NETIF_F_HW_CSUM | NETIF_F_SG;
> +		netdev_features_t sctp = 0;
> +
> +		if (virtio_has_feature(vdev, VIRTIO_NET_F_SCTP_CSUM))
> +			sctp |= NETIF_F_SCTP_CRC;
> +
> +		dev->hw_features |= NETIF_F_HW_CSUM | NETIF_F_SG | sctp;
>  		if (csum)
> -			dev->features |= NETIF_F_HW_CSUM | NETIF_F_SG;
> +			dev->features |= NETIF_F_HW_CSUM | NETIF_F_SG | sctp;
>
>  		if (virtio_has_feature(vdev, VIRTIO_NET_F_GSO)) {
>  			dev->hw_features |= NETIF_F_TSO
> @@ -2952,7 +2957,7 @@ static struct virtio_device_id id_table[] = {
>  };
>
>  #define VIRTNET_FEATURES \
> -	VIRTIO_NET_F_CSUM, VIRTIO_NET_F_GUEST_CSUM, \
> +	VIRTIO_NET_F_CSUM, VIRTIO_NET_F_GUEST_CSUM,  VIRTIO_NET_F_SCTP_CSUM, \
>  	VIRTIO_NET_F_MAC, \
>  	VIRTIO_NET_F_HOST_TSO4, VIRTIO_NET_F_HOST_UFO, VIRTIO_NET_F_HOST_TSO6, \
>  	VIRTIO_NET_F_HOST_ECN, VIRTIO_NET_F_GUEST_TSO4, VIRTIO_NET_F_GUEST_TSO6, \
> diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h
> index f144216..2e7a64a 100644
> --- a/include/linux/virtio_net.h
> +++ b/include/linux/virtio_net.h
> @@ -39,6 +39,9 @@ static inline int virtio_net_hdr_to_skb(struct sk_buff *skb,
>
>  		if (!skb_partial_csum_set(skb, start, off))
>  			return -EINVAL;
> +
> +		if (hdr->flags & VIRTIO_NET_HDR_F_CSUM_NOT_INET)
> +			skb->csum_not_inet = 1;
>  	}
>
>  	if (hdr->gso_type != VIRTIO_NET_HDR_GSO_NONE) {
> @@ -96,6 +99,9 @@ static inline int virtio_net_hdr_from_skb(const struct sk_buff *skb,
>  		hdr->flags = VIRTIO_NET_HDR_F_DATA_VALID;
>  	} /* else everything is zero */
>
> +	if (skb->csum_not_inet)
> +		hdr->flags &= VIRTIO_NET_HDR_F_CSUM_NOT_INET;
        Shouldn't this be  |=  instead?

> +
>  	return 0;
>  }
>
> diff --git a/include/uapi/linux/virtio_net.h b/include/uapi/linux/virtio_net.h
> index 5de6ed3..3f279c8 100644
> --- a/include/uapi/linux/virtio_net.h
> +++ b/include/uapi/linux/virtio_net.h
> @@ -36,6 +36,7 @@
>  #define VIRTIO_NET_F_GUEST_CSUM	1	/* Guest handles pkts w/ partial csum */
>  #define VIRTIO_NET_F_CTRL_GUEST_OFFLOADS 2 /* Dynamic offload configuration. */
>  #define VIRTIO_NET_F_MTU	3	/* Initial MTU advice */
> +#define VIRTIO_NET_F_SCTP_CSUM  4	/* SCTP checksum offload support */
>  #define VIRTIO_NET_F_MAC	5	/* Host has given MAC address. */
>  #define VIRTIO_NET_F_GUEST_TSO4	7	/* Guest can handle TSOv4 in. */
>  #define VIRTIO_NET_F_GUEST_TSO6	8	/* Guest can handle TSOv6 in. */
> @@ -101,6 +102,7 @@ struct virtio_net_config {
>  struct virtio_net_hdr_v1 {
>  #define VIRTIO_NET_HDR_F_NEEDS_CSUM	1	/* Use csum_start, csum_offset */
>  #define VIRTIO_NET_HDR_F_DATA_VALID	2	/* Csum is valid */
> +#define VIRTIO_NET_HDR_F_CSUM_NOT_INET  4       /* Checksum is not inet */
>  	__u8 flags;
>  #define VIRTIO_NET_HDR_GSO_NONE		0	/* Not a GSO frame */
>  #define VIRTIO_NET_HDR_GSO_TCPV4	1	/* GSO frame, IPv4 TCP (TSO) */
> --
> 2.9.5
>

^ permalink raw reply

* Re: [RFC net-next 1/2] net: net-porcfs: Reduce rcu lock critical section
From: Eric Dumazet @ 2018-04-11 22:30 UTC (permalink / raw)
  To: Saeed Mahameed, davem@davemloft.net; +Cc: netdev@vger.kernel.org
In-Reply-To: <1523473143.3402.55.camel@mellanox.com>



On 04/11/2018 11:59 AM, Saeed Mahameed wrote:
> On Tue, 2018-04-10 at 13:35 -0700, Eric Dumazet wrote:
>>
>> On 04/10/2018 10:16 AM, David Miller wrote:
>>>
>>> The tradeoff here is that now you are doing two unnecessary atomic
>>> operations per stats dump.
>>>
>>> That is what the RCU lock allows us to avoid.
>>>
>>
>> dev_hold() and dev_put() are actually per cpu increment and
>> decrements,
>> pretty cheap these days.
>>
> 
> Yes, i am adding only 2 cpu instructions here.
> I think the trade-off here is too small and the price to finally have
> get_stats64 called from non atomic context is really worth it.

Oh... but you have not mentioned this reason in your changelog.

What about bonding stats ?

By sending partial patches like that, others have to take care of the details
and this is not really acceptable.

> 
> It  looks really odd to me that the device chain locks are held for
> such long periods, while we already have the means to avoid this, same
> goes for rtnl_lock, same trick can work here for many use cases and
> many ndos, we are just over protective for no reasons.
> 
> 
>> Problem here is that any preemption of the process holding device
>> reference
>> might trigger warnings in device unregister.
>>
> 
> This is true for any other place dev_hold is used,
> as explained in the commit message dev_hold is used for a very brief
> moment before calling the stats ndo and released directly after.

Not really.

Other places usually have notifiers to remove the refcount when needed.

We worked quite hard in the past to remove extra dev_hold()
(before we finally converted it to percpu counter)

> 
> looking at 
> 
> netdev_wait_allrefs(...)
> [...]
> 	msleep(250);
> 
> 	refcnt = netdev_refcnt_read(dev);
> 
> 	if (time_after(jiffies, warning_time + 10 * HZ)) {
> 		pr_emerg("unregister_netdevice: waiting for %s to
> become free. Usage count = %d\n",
> 			 dev->name, refcnt);
> 		warning_time = jiffies;
> 	}
> 
> The holder will get enough time to release the netdev way before the
> warning is triggered.
> 
> The warning is triggered only if someone holds the dev for more than 10
> seconds which is impossible for the stats ndo to take more than this,
> in fact i just did a quick measurement and it seems that in average
> get_stats64 ndo takes 0.5us !

Average is nice, but what about max time ?

Sleeping more than 10 seconds to satify GFP_KERNEL memory allocation can definitely
happen in the real world, or simply if you get preempted by some RT/high prio tasks.

Just say no to 'might sleep in ndo_get_stats()', and you will save a lot of issues.

^ permalink raw reply

* Re: [PATCH bpf-next v8 05/11] seccomp,landlock: Enforce Landlock programs per process hierarchy
From: Mickaël Salaün @ 2018-04-11 22:18 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Andy Lutomirski, Daniel Borkmann, LKML, Alexei Starovoitov,
	Arnaldo Carvalho de Melo, Casey Schaufler, David Drysdale,
	David S . Miller, Eric W . Biederman, Jann Horn, Jonathan Corbet,
	Michael Kerrisk, Kees Cook, Paul Moore, Sargun Dhillon,
	Serge E . Hallyn, Shuah Khan, Tejun Heo, Thomas Graf,
	Tycho Andersen, Will Drewry, Kernel 
In-Reply-To: <20180410044821.tllxbaq2uj6gtzpn@ast-mbp.dhcp.thefacebook.com>


[-- Attachment #1.1: Type: text/plain, Size: 25891 bytes --]


On 04/10/2018 06:48 AM, Alexei Starovoitov wrote:
> On Mon, Apr 09, 2018 at 12:01:59AM +0200, Mickaël Salaün wrote:
>>
>> On 04/08/2018 11:06 PM, Andy Lutomirski wrote:
>>> On Sun, Apr 8, 2018 at 6:13 AM, Mickaël Salaün <mic@digikod.net> wrote:
>>>>
>>>> On 02/27/2018 10:48 PM, Mickaël Salaün wrote:
>>>>>
>>>>> On 27/02/2018 17:39, Andy Lutomirski wrote:
>>>>>> On Tue, Feb 27, 2018 at 5:32 AM, Alexei Starovoitov
>>>>>> <alexei.starovoitov@gmail.com> wrote:
>>>>>>> On Tue, Feb 27, 2018 at 05:20:55AM +0000, Andy Lutomirski wrote:
>>>>>>>> On Tue, Feb 27, 2018 at 4:54 AM, Alexei Starovoitov
>>>>>>>> <alexei.starovoitov@gmail.com> wrote:
>>>>>>>>> On Tue, Feb 27, 2018 at 04:40:34AM +0000, Andy Lutomirski wrote:
>>>>>>>>>> On Tue, Feb 27, 2018 at 2:08 AM, Alexei Starovoitov
>>>>>>>>>> <alexei.starovoitov@gmail.com> wrote:
>>>>>>>>>>> On Tue, Feb 27, 2018 at 01:41:15AM +0100, Mickaël Salaün wrote:
>>>>>>>>>>>> The seccomp(2) syscall can be used by a task to apply a Landlock program
>>>>>>>>>>>> to itself. As a seccomp filter, a Landlock program is enforced for the
>>>>>>>>>>>> current task and all its future children. A program is immutable and a
>>>>>>>>>>>> task can only add new restricting programs to itself, forming a list of
>>>>>>>>>>>> programss.
>>>>>>>>>>>>
>>>>>>>>>>>> A Landlock program is tied to a Landlock hook. If the action on a kernel
>>>>>>>>>>>> object is allowed by the other Linux security mechanisms (e.g. DAC,
>>>>>>>>>>>> capabilities, other LSM), then a Landlock hook related to this kind of
>>>>>>>>>>>> object is triggered. The list of programs for this hook is then
>>>>>>>>>>>> evaluated. Each program return a 32-bit value which can deny the action
>>>>>>>>>>>> on a kernel object with a non-zero value. If every programs of the list
>>>>>>>>>>>> return zero, then the action on the object is allowed.
>>>>>>>>>>>>
>>>>>>>>>>>> Multiple Landlock programs can be chained to share a 64-bits value for a
>>>>>>>>>>>> call chain (e.g. evaluating multiple elements of a file path).  This
>>>>>>>>>>>> chaining is restricted when a process construct this chain by loading a
>>>>>>>>>>>> program, but additional checks are performed when it requests to apply
>>>>>>>>>>>> this chain of programs to itself.  The restrictions ensure that it is
>>>>>>>>>>>> not possible to call multiple programs in a way that would imply to
>>>>>>>>>>>> handle multiple shared values (i.e. cookies) for one chain.  For now,
>>>>>>>>>>>> only a fs_pick program can be chained to the same type of program,
>>>>>>>>>>>> because it may make sense if they have different triggers (cf. next
>>>>>>>>>>>> commits).  This restrictions still allows to reuse Landlock programs in
>>>>>>>>>>>> a safe way (e.g. use the same loaded fs_walk program with multiple
>>>>>>>>>>>> chains of fs_pick programs).
>>>>>>>>>>>>
>>>>>>>>>>>> Signed-off-by: Mickaël Salaün <mic@digikod.net>
>>>>>>>>>>>
>>>>>>>>>>> ...
>>>>>>>>>>>
>>>>>>>>>>>> +struct landlock_prog_set *landlock_prepend_prog(
>>>>>>>>>>>> +             struct landlock_prog_set *current_prog_set,
>>>>>>>>>>>> +             struct bpf_prog *prog)
>>>>>>>>>>>> +{
>>>>>>>>>>>> +     struct landlock_prog_set *new_prog_set = current_prog_set;
>>>>>>>>>>>> +     unsigned long pages;
>>>>>>>>>>>> +     int err;
>>>>>>>>>>>> +     size_t i;
>>>>>>>>>>>> +     struct landlock_prog_set tmp_prog_set = {};
>>>>>>>>>>>> +
>>>>>>>>>>>> +     if (prog->type != BPF_PROG_TYPE_LANDLOCK_HOOK)
>>>>>>>>>>>> +             return ERR_PTR(-EINVAL);
>>>>>>>>>>>> +
>>>>>>>>>>>> +     /* validate memory size allocation */
>>>>>>>>>>>> +     pages = prog->pages;
>>>>>>>>>>>> +     if (current_prog_set) {
>>>>>>>>>>>> +             size_t i;
>>>>>>>>>>>> +
>>>>>>>>>>>> +             for (i = 0; i < ARRAY_SIZE(current_prog_set->programs); i++) {
>>>>>>>>>>>> +                     struct landlock_prog_list *walker_p;
>>>>>>>>>>>> +
>>>>>>>>>>>> +                     for (walker_p = current_prog_set->programs[i];
>>>>>>>>>>>> +                                     walker_p; walker_p = walker_p->prev)
>>>>>>>>>>>> +                             pages += walker_p->prog->pages;
>>>>>>>>>>>> +             }
>>>>>>>>>>>> +             /* count a struct landlock_prog_set if we need to allocate one */
>>>>>>>>>>>> +             if (refcount_read(&current_prog_set->usage) != 1)
>>>>>>>>>>>> +                     pages += round_up(sizeof(*current_prog_set), PAGE_SIZE)
>>>>>>>>>>>> +                             / PAGE_SIZE;
>>>>>>>>>>>> +     }
>>>>>>>>>>>> +     if (pages > LANDLOCK_PROGRAMS_MAX_PAGES)
>>>>>>>>>>>> +             return ERR_PTR(-E2BIG);
>>>>>>>>>>>> +
>>>>>>>>>>>> +     /* ensure early that we can allocate enough memory for the new
>>>>>>>>>>>> +      * prog_lists */
>>>>>>>>>>>> +     err = store_landlock_prog(&tmp_prog_set, current_prog_set, prog);
>>>>>>>>>>>> +     if (err)
>>>>>>>>>>>> +             return ERR_PTR(err);
>>>>>>>>>>>> +
>>>>>>>>>>>> +     /*
>>>>>>>>>>>> +      * Each task_struct points to an array of prog list pointers.  These
>>>>>>>>>>>> +      * tables are duplicated when additions are made (which means each
>>>>>>>>>>>> +      * table needs to be refcounted for the processes using it). When a new
>>>>>>>>>>>> +      * table is created, all the refcounters on the prog_list are bumped (to
>>>>>>>>>>>> +      * track each table that references the prog). When a new prog is
>>>>>>>>>>>> +      * added, it's just prepended to the list for the new table to point
>>>>>>>>>>>> +      * at.
>>>>>>>>>>>> +      *
>>>>>>>>>>>> +      * Manage all the possible errors before this step to not uselessly
>>>>>>>>>>>> +      * duplicate current_prog_set and avoid a rollback.
>>>>>>>>>>>> +      */
>>>>>>>>>>>> +     if (!new_prog_set) {
>>>>>>>>>>>> +             /*
>>>>>>>>>>>> +              * If there is no Landlock program set used by the current task,
>>>>>>>>>>>> +              * then create a new one.
>>>>>>>>>>>> +              */
>>>>>>>>>>>> +             new_prog_set = new_landlock_prog_set();
>>>>>>>>>>>> +             if (IS_ERR(new_prog_set))
>>>>>>>>>>>> +                     goto put_tmp_lists;
>>>>>>>>>>>> +     } else if (refcount_read(&current_prog_set->usage) > 1) {
>>>>>>>>>>>> +             /*
>>>>>>>>>>>> +              * If the current task is not the sole user of its Landlock
>>>>>>>>>>>> +              * program set, then duplicate them.
>>>>>>>>>>>> +              */
>>>>>>>>>>>> +             new_prog_set = new_landlock_prog_set();
>>>>>>>>>>>> +             if (IS_ERR(new_prog_set))
>>>>>>>>>>>> +                     goto put_tmp_lists;
>>>>>>>>>>>> +             for (i = 0; i < ARRAY_SIZE(new_prog_set->programs); i++) {
>>>>>>>>>>>> +                     new_prog_set->programs[i] =
>>>>>>>>>>>> +                             READ_ONCE(current_prog_set->programs[i]);
>>>>>>>>>>>> +                     if (new_prog_set->programs[i])
>>>>>>>>>>>> +                             refcount_inc(&new_prog_set->programs[i]->usage);
>>>>>>>>>>>> +             }
>>>>>>>>>>>> +
>>>>>>>>>>>> +             /*
>>>>>>>>>>>> +              * Landlock program set from the current task will not be freed
>>>>>>>>>>>> +              * here because the usage is strictly greater than 1. It is
>>>>>>>>>>>> +              * only prevented to be freed by another task thanks to the
>>>>>>>>>>>> +              * caller of landlock_prepend_prog() which should be locked if
>>>>>>>>>>>> +              * needed.
>>>>>>>>>>>> +              */
>>>>>>>>>>>> +             landlock_put_prog_set(current_prog_set);
>>>>>>>>>>>> +     }
>>>>>>>>>>>> +
>>>>>>>>>>>> +     /* prepend tmp_prog_set to new_prog_set */
>>>>>>>>>>>> +     for (i = 0; i < ARRAY_SIZE(tmp_prog_set.programs); i++) {
>>>>>>>>>>>> +             /* get the last new list */
>>>>>>>>>>>> +             struct landlock_prog_list *last_list =
>>>>>>>>>>>> +                     tmp_prog_set.programs[i];
>>>>>>>>>>>> +
>>>>>>>>>>>> +             if (last_list) {
>>>>>>>>>>>> +                     while (last_list->prev)
>>>>>>>>>>>> +                             last_list = last_list->prev;
>>>>>>>>>>>> +                     /* no need to increment usage (pointer replacement) */
>>>>>>>>>>>> +                     last_list->prev = new_prog_set->programs[i];
>>>>>>>>>>>> +                     new_prog_set->programs[i] = tmp_prog_set.programs[i];
>>>>>>>>>>>> +             }
>>>>>>>>>>>> +     }
>>>>>>>>>>>> +     new_prog_set->chain_last = tmp_prog_set.chain_last;
>>>>>>>>>>>> +     return new_prog_set;
>>>>>>>>>>>> +
>>>>>>>>>>>> +put_tmp_lists:
>>>>>>>>>>>> +     for (i = 0; i < ARRAY_SIZE(tmp_prog_set.programs); i++)
>>>>>>>>>>>> +             put_landlock_prog_list(tmp_prog_set.programs[i]);
>>>>>>>>>>>> +     return new_prog_set;
>>>>>>>>>>>> +}
>>>>>>>>>>>
>>>>>>>>>>> Nack on the chaining concept.
>>>>>>>>>>> Please do not reinvent the wheel.
>>>>>>>>>>> There is an existing mechanism for attaching/detaching/quering multiple
>>>>>>>>>>> programs attached to cgroup and tracing hooks that are also
>>>>>>>>>>> efficiently executed via BPF_PROG_RUN_ARRAY.
>>>>>>>>>>> Please use that instead.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I don't see how that would help.  Suppose you add a filter, then
>>>>>>>>>> fork(), and then the child adds another filter.  Do you want to
>>>>>>>>>> duplicate the entire array?  You certainly can't *modify* the array
>>>>>>>>>> because you'll affect processes that shouldn't be affected.
>>>>>>>>>>
>>>>>>>>>> In contrast, doing this through seccomp like the earlier patches
>>>>>>>>>> seemed just fine to me, and seccomp already had the right logic.
>>>>>>>>>
>>>>>>>>> it doesn't look to me that existing seccomp side of managing fork
>>>>>>>>> situation can be reused. Here there is an attempt to add 'chaining'
>>>>>>>>> concept which sort of an extension of existing seccomp style,
>>>>>>>>> but somehow heavily done on bpf side and contradicts cgroup/tracing.
>>>>>>>>>
>>>>>>>>
>>>>>>>> I don't see why the seccomp way can't be used.  I agree with you that
>>>>>>>> the seccomp *style* shouldn't be used in bpf code like this, but I
>>>>>>>> think that Landlock programs can and should just live in the existing
>>>>>>>> seccomp chain.  If the existing seccomp code needs some modification
>>>>>>>> to make this work, then so be it.
>>>>>>>
>>>>>>> +1
>>>>>>> if that was the case...
>>>>>>> but that's not my reading of the patch set.
>>>>>>
>>>>>> An earlier version of the patch set used the seccomp filter chain.
>>>>>> Mickaël, what exactly was wrong with that approach other than that the
>>>>>> seccomp() syscall was awkward for you to use?  You could add a
>>>>>> seccomp_add_landlock_rule() syscall if you needed to.
>>>>>
>>>>> Nothing was wrong about about that, this part did not changed (see my
>>>>> next comment).
>>>>>
>>>>>>
>>>>>> As a side comment, why is this an LSM at all, let alone a non-stacking
>>>>>> LSM?  It would make a lot more sense to me to make Landlock depend on
>>>>>> having LSMs configured in but to call the landlock hooks directly from
>>>>>> the security_xyz() hooks.
>>>>>
>>>>> See Casey's answer and his patch series: https://lwn.net/Articles/741963/
>>>>>
>>>>>>
>>>>>>>
>>>>>>>> In other words, the kernel already has two kinds of chaining:
>>>>>>>> seccomp's and bpf's.  bpf's doesn't work right for this type of usage
>>>>>>>> across fork(), whereas seccomp's already handles that case correctly.
>>>>>>>> (In contrast, seccomp's is totally wrong for cgroup-attached filters.)
>>>>>>>>  So IMO Landlock should use the seccomp core code and call into bpf
>>>>>>>> for the actual filtering.
>>>>>>>
>>>>>>> +1
>>>>>>> in cgroup we had to invent this new BPF_PROG_RUN_ARRAY mechanism,
>>>>>>> since cgroup hierarchy can be complicated with bpf progs attached
>>>>>>> at different levels with different override/multiprog properties,
>>>>>>> so walking link list and checking all flags at run-time would have
>>>>>>> been too slow. That's why we added compute_effective_progs().
>>>>>>
>>>>>> If we start adding override flags to Landlock, I think we're doing it
>>>>>> wrong.   With cgroup bpf programs, the whole mess is set up by the
>>>>>> administrator.  With seccomp, and with Landlock if done correctly, it
>>>>>> *won't* be set up by the administrator, so the chance that everyone
>>>>>> gets all the flags right is about zero.  All attached filters should
>>>>>> run unconditionally.
>>>>>
>>>>>
>>>>> There is a misunderstanding about this chaining mechanism. This should
>>>>> not be confused with the list of seccomp filters nor the cgroup
>>>>> hierarchies. Landlock programs can be stacked the same way seccomp's
>>>>> filters can (cf. struct landlock_prog_set, the "chain_last" field is an
>>>>> optimization which is not used for this struct handling). This stackable
>>>>> property did not changed from the previous patch series. The chaining
>>>>> mechanism is for another use case, which does not make sense for seccomp
>>>>> filters nor other eBPF program types, at least for now, from what I can
>>>>> tell.
>>>>>
>>>>> You may want to get a look at my talk at FOSDEM
>>>>> (https://landlock.io/talks/2018-02-04_landlock-fosdem.pdf), especially
>>>>> slides 11 and 12.
>>>>>
>>>>> Let me explain my reasoning about this program chaining thing.
>>>>>
>>>>> To check if an action on a file is allowed, we first need to identify
>>>>> this file and match it to the security policy. In a previous
>>>>> (non-public) patch series, I tried to use one type of eBPF program to
>>>>> check every kind of access to a file. To be able to identify a file, I
>>>>> relied on an eBPF map, similar to the current inode map. This map store
>>>>> a set of references to file descriptors. I then created a function
>>>>> bpf_is_file_beneath() to check if the requested file was beneath a file
>>>>> in the map. This way, no chaining, only one eBPF program type to check
>>>>> an access to a file... but some issues then emerged. First, this design
>>>>> create a side-channel which help an attacker using such a program to
>>>>> infer some information not normally available, for example to get a hint
>>>>> on where a file descriptor (received from a UNIX socket) come from.
>>>>> Another issue is that this type of program would be called for each
>>>>> component of a path. Indeed, when the kernel check if an access to a
>>>>> file is allowed, it walk through all of the directories in its path
>>>>> (checking if the current process is allowed to execute them). That first
>>>>> attempt led me to rethink the way we could filter an access to a file
>>>>> *path*.
>>>>>
>>>>> To minimize the number of called to an eBPF program dedicated to
>>>>> validate an access to a file path, I decided to create three subtype of
>>>>> eBPF programs. The FS_WALK type is called when walking through every
>>>>> directory of a file path (except the last one if it is the target). We
>>>>> can then restrict this type of program to the minimum set of functions
>>>>> it is allowed to call and the minimum set of data available from its
>>>>> context. The first implicit chaining is for this type of program. To be
>>>>> able to evaluate a path while being called for all its components, this
>>>>> program need to store a state (to remember what was the parent directory
>>>>> of this path). There is no "previous" field in the subtype for this
>>>>> program because it is chained with itself, for each directories. This
>>>>> enable to create a FS_WALK program to evaluate a file hierarchy, thank
>>>>> to the inode map which can be used to check if a directory of this
>>>>> hierarchy is part of an allowed (or denied) list of directories. This
>>>>> design enables to express a file hierarchy in a programmatic way,
>>>>> without requiring an eBPF helper to do the job (unlike my first experiment).
>>>>>
>>>>> The explicit chaining is used to tied a path evaluation (with a FS_WALK
>>>>> program) to an access to the actual file being requested (the last
>>>>> component of a file path), with a FS_PICK program. It is only at this
>>>>> time that the kernel check for the requested action (e.g. read, write,
>>>>> chdir, append...). To be able to filter such access request we can have
>>>>> one call to the same program for every action and let this program check
>>>>> for which action it was called. However, this design does not allow the
>>>>> kernel to know if the current action is indeed handled by this program.
>>>>> Hence, it is not possible to implement a cache mechanism to only call
>>>>> this program if it knows how to handle this action.
>>>>>
>>>>> The approach I took for this FS_PICK type of program is to add to its
>>>>> subtype which action it can handle (with the "triggers" bitfield, seen
>>>>> as ORed actions). This way, the kernel knows if a call to a FS_PICK
>>>>> program is necessary. If the user wants to enforce a different security
>>>>> policy according to the action requested on a file, then it needs
>>>>> multiple FS_PICK programs. However, to reduce the number of such
>>>>> programs, this patch series allow a FS_PICK program to be chained with
>>>>> another, the same way a FS_WALK is chained with itself. This way, if the
>>>>> user want to check if the action is a for example an "open" and a "read"
>>>>> and not a "map" and a "read", then it can chain multiple FS_PICK
>>>>> programs with different triggers actions. The OR check performed by the
>>>>> kernel is not a limitation then, only a way to know if a call to an eBPF
>>>>> program is needed.
>>>>>
>>>>> The last type of program is FS_GET. This one is called when a process
>>>>> get a struct file or change its working directory. This is the only
>>>>> program type able (and allowed) to tag a file. This restriction is
>>>>> important to not being subject to resource exhaustion attacks (i.e.
>>>>> tagging every inode accessible to an attacker, which would allocate too
>>>>> much kernel memory).
>>>>>
>>>>> This design gives room for improvements to create a cache of eBPF
>>>>> context (input data, including maps if any), with the result of an eBPF
>>>>> program. This would help limit the number of call to an eBPF program the
>>>>> same way SELinux or other kernel components do to limit costly checks.
>>>>>
>>>>> The eBPF maps of progs are useful to call the same type of eBPF
>>>>> program. It does not fit with this use case because we may want multiple
>>>>> eBPF program according to the action requested on a kernel object (e.g.
>>>>> FS_GET). The other reason is because the eBPF program does not know what
>>>>> will be the next (type of) access check performed by the kernel.
>>>>>
>>>>> To say it another way, this chaining mechanism is a way to split a
>>>>> kernel object evaluation with multiple specialized programs, each of
>>>>> them being able to deal with data tied to their type. Using a monolithic
>>>>> eBPF program to check everything does not scale and does not fit with
>>>>> unprivileged use either.
>>>>>
>>>>> As a side note, the cookie value is only an ephemeral value to keep a
>>>>> state between multiple programs call. It can be used to create a state
>>>>> machine for an object evaluation.
>>>>>
>>>>> I don't see a way to do an efficient and programmatic path evaluation,
>>>>> with different access checks, with the current eBPF features. Please let
>>>>> me know if you know how to do it another way.
>>>>>
>>>>
>>>> Andy, Alexei, Daniel, what do you think about this Landlock program
>>>> chaining and cookie?
>>>>
>>>
>>> Can you give a small pseudocode real world example that acutally needs
>>> chaining?  The mechanism is quite complicated and I'd like to
>>> understand how it'll be used.
>>>
>>
>> Here is the interesting part from the example (patch 09/11):
>>
>> +SEC("maps")
>> +struct bpf_map_def inode_map = {
>> +	.type = BPF_MAP_TYPE_INODE,
>> +	.key_size = sizeof(u32),
>> +	.value_size = sizeof(u64),
>> +	.max_entries = 20,
>> +};
>> +
>> +SEC("subtype/landlock1")
>> +static union bpf_prog_subtype _subtype1 = {
>> +	.landlock_hook = {
>> +		.type = LANDLOCK_HOOK_FS_WALK,
>> +	}
>> +};
>> +
>> +static __always_inline __u64 update_cookie(__u64 cookie, __u8 lookup,
>> +		void *inode, void *chain, bool freeze)
>> +{
>> +	__u64 map_allow = 0;
>> +
>> +	if (cookie == 0) {
>> +		cookie = bpf_inode_get_tag(inode, chain);
>> +		if (cookie)
>> +			return cookie;
>> +		/* only look for the first match in the map, ignore nested
>> +		 * paths in this example */
>> +		map_allow = bpf_inode_map_lookup(&inode_map, inode);
>> +		if (map_allow)
>> +			cookie = 1 | map_allow;
>> +	} else {
>> +		if (cookie & COOKIE_VALUE_FREEZED)
>> +			return cookie;
>> +		map_allow = cookie & _MAP_MARK_MASK;
>> +		cookie &= ~_MAP_MARK_MASK;
>> +		switch (lookup) {
>> +		case LANDLOCK_CTX_FS_WALK_INODE_LOOKUP_DOTDOT:
>> +			cookie--;
>> +			break;
>> +		case LANDLOCK_CTX_FS_WALK_INODE_LOOKUP_DOT:
>> +			break;
>> +		default:
>> +			/* ignore _MAP_MARK_MASK overflow in this example */
>> +			cookie++;
>> +			break;
>> +		}
>> +		if (cookie >= 1)
>> +			cookie |= map_allow;
>> +	}
>> +	/* do not modify the cookie for each fs_pick */
>> +	if (freeze && cookie)
>> +		cookie |= COOKIE_VALUE_FREEZED;
>> +	return cookie;
>> +}
>> +
>> +SEC("landlock1")
>> +int fs_walk(struct landlock_ctx_fs_walk *ctx)
>> +{
>> +	ctx->cookie = update_cookie(ctx->cookie, ctx->inode_lookup,
>> +			(void *)ctx->inode, (void *)ctx->chain, false);
>> +	return LANDLOCK_RET_ALLOW;
>> +}
>>
>> The program "landlock1" is called for every directory execution (except
>> the last one if it is the leaf of a path). This enables to identify a
>> file hierarchy with only a (one dimension) list of file descriptors
>> (i.e. inode_map).
>>
>> Underneath, the Landlock LSM part looks if there is an associated path
>> walk (nameidata) with each inode access request. If there is one, then
>> the cookie associated with the path walk (if any) is made available
>> through the eBPF program context. This enables to develop a state
>> machine with an eBPF program to "evaluate" a file path (without string
>> parsing).
>>
>> The goal with this chaining mechanism is to be able to express a complex
>> kernel object like a file, with multiple run of one or more eBPF
>> programs, as a multilayer evaluation. This semantic may only make sense
>> for the user/developer and his security policy. We must keep in mind
>> that this object identification should be available to unprivileged
>> processes. This means that we must be very careful to what kind of
>> information are available to an eBPF program because this can then leak
>> to a process (e.g. through a map). With this mechanism, only information
>> already available to user space is available to the eBPF program.
>>
>> In this example, the complexity of the path evaluation is in the eBPF
>> program. We can then keep the kernel code more simple and generic. This
>> enables more flexibility for a security policy definition.
> 
> it all sounds correct on paper, but it's pretty novel
> approach and I'm not sure I see all the details in the patch.
> When people say "inode" they most of the time mean inode integer number,
> whereas in this patch do you mean a raw pointer to in-kernel
> 'struct inode' ?
> To avoid confusion it should probably be called differently.

It's indeed a pointer to a "struct inode", not an inode number.

I was thinking about generalizing the BPF_MAP_TYPE_INODE by renaming it
to BPF_MAP_TYPE_FD. This map type could then be used either to identify
a set of inodes (pointers) or other kernel objects identifiable by a
file descriptor. A "subtype" (similar to the BPF prog subtype introduced
in this patch series) may be used to specialize such a map to statically
identify the kind of content it may hold. We could then add more
subtypes to identify sockets, devices, processes, and so on.

> 
> If you meant inode as a number then why inode only?
> where is superblock, device, mount point?
> How bpf side can compare inodes without this additional info?
> How bpf side will know what inode to compare to?
> What if inode number is reused?

This pointer can identify if a giver inode is the same as one pointed by
a file descriptor (or a file path).


> This approach is an optimization to compare inodes
> instead of strings passed into sys_open ?

Comparing paths with strings is less efficient but it is also very
error-prone. Another advantage of using file descriptors is for
unprivileged processes: we can be sure that this processes are allowed
to access a file referred by a file descriptor (opened file). Indeed we
check (security_inode_getattr) that the process is allowed to stat an
opened file. This way, a malicious process can't infer information by
crafting path strings.


> 
> If you meant inode as a pointer how bpf side will
> know the pointer before the walk begins?

The BPF map is filled by user space with file descriptors pointing to
opened files. When a path walk begin, the LSM part of Landlock is
notified that a process is requesting an access to the first element of
the path (e.g. "/"). This first element may be part of a map or not. The
BPF program can then choose if this request is legitimate or not.


> What guarantees that it's not a stale pointer?

When user space updates a map with a new file descriptor, the kernel
checks if this FD is valid. If this is the case, then the inode's usage
counter is incremented and its address is stored in the map.


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply

* [PATCH net] strparser: Fix incorrect strp->need_bytes value.
From: Doron Roberts-Kedes @ 2018-04-11 22:05 UTC (permalink / raw)
  To: davem; +Cc: netdev, Doron Roberts-Kedes

strp_data_ready resets strp->need_bytes to 0 if strp_peek_len indicates
that the remainder of the message has been received. However,
do_strp_work does not reset strp->need_bytes to 0. If do_strp_work
completes a partial message, the value of strp->need_bytes will continue
to reflect the needed bytes of the previous message, causing
future invocations of strp_data_ready to return early if
strp->need_bytes is less than strp_peek_len. Resetting strp->need_bytes
to 0 in __strp_recv on handing a full message to the upper layer solves
this problem.

__strp_recv also calculates strp->need_bytes using stm->accum_len before
stm->accum_len has been incremented by cand_len. This can cause
strp->need_bytes to be equal to the full length of the message instead
of the full length minus the accumulated length. This, in turn, causes
strp_data_ready to return early, even when there is sufficient data to
complete the partial message. Incrementing stm->accum_len before using
it to calculate strp->need_bytes solves this problem.

Found while testing net/tls_sw recv path.

Fixes: 43a0c6751a322847 ("strparser: Stream parser for messages")
Signed-off-by: Doron Roberts-Kedes <doronrk@fb.com>
---
 net/strparser/strparser.c | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/net/strparser/strparser.c b/net/strparser/strparser.c
index b9283ce..805b139 100644
--- a/net/strparser/strparser.c
+++ b/net/strparser/strparser.c
@@ -296,9 +296,9 @@ static int __strp_recv(read_descriptor_t *desc, struct sk_buff *orig_skb,
 					strp_start_timer(strp, timeo);
 				}

+				stm->accum_len += cand_len;
 				strp->need_bytes = stm->strp.full_len -
 						       stm->accum_len;
-				stm->accum_len += cand_len;
 				stm->early_eaten = cand_len;
 				STRP_STATS_ADD(strp->stats.bytes, cand_len);
 				desc->count = 0; /* Stop reading socket */
@@ -321,6 +321,7 @@ static int __strp_recv(read_descriptor_t *desc, struct sk_buff *orig_skb,
 		/* Hurray, we have a new message! */
 		cancel_delayed_work(&strp->msg_timer_work);
 		strp->skb_head = NULL;
+		strp->need_bytes = 0;
 		STRP_STATS_INCR(strp->stats.msgs);

 		/* Give skb to upper layer */
@@ -410,9 +411,7 @@ void strp_data_ready(struct strparser *strp)
 		return;

 	if (strp->need_bytes) {
-		if (strp_peek_len(strp) >= strp->need_bytes)
-			strp->need_bytes = 0;
-		else
+		if (strp_peek_len(strp) < strp->need_bytes)
 			return;
 	}

-- 
2.9.5

^ permalink raw reply related

* [GIT] Networking
From: David Miller @ 2018-04-11 21:53 UTC (permalink / raw)
  To: torvalds; +Cc: akpm, netdev, linux-kernel


1) In ip_gre tunnel, handle the conflict between TUNNEL_{SEQ,CSUM} and
   GSO/LLTX properly.  From Sabrina Dubroca.

2) Stop properly on error in lan78xx_read_otp(), from Phil Elwell.

3) Don't uncompress in slip before rstate is initialized, from Tejaswi
   Tanikella.

4) When using 1.x firmware on aquantia, issue a deinit before we
   hardware reset the chip, otherwise we break dirty wake WOL.  From
   Igor Russkikh.

5) Correct log check in vhost_vq_access_ok(), from Stefan Hajnoczi.

6) Fix ethtool -x crashes in bnxt_en, from Michael Chan.

7) Fix races in l2tp tunnel creation and duplicate tunnel detection,
   from Guillaume Nault.

Please pull, thanks a lot!

The following changes since commit c18bb396d3d261ebbb4efbc05129c5d354c541e4:

  Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net (2018-04-09 17:04:10 -0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git 

for you to fetch changes up to 0c84cee8f131a090f77f5a3dea5d6a7bd99c00db:

  Merge branch 'l2tp-tunnel-creation-fixes' (2018-04-11 17:41:28 -0400)

----------------------------------------------------------------
Andy Gospodarek (1):
      bnxt_en: do not allow wildcard matches for L2 flows

Bassem Boubaker (1):
      cdc_ether: flag the Cinterion AHS8 modem by gemalto as WWAN

David S. Miller (4):
      Merge branch 'Aquantia-atlantic-critical-fixes-04-2018'
      Merge branch 'vhost-fix-vhost_vq_access_ok-log-check'
      Merge branch 'bnxt_en-Fixes-for-net'
      Merge branch 'l2tp-tunnel-creation-fixes'

Eric Auger (1):
      vhost: Fix vhost_copy_to_user()

Guillaume Nault (2):
      l2tp: fix races in tunnel creation
      l2tp: fix race in duplicate tunnel detection

Igor Russkikh (2):
      net: aquantia: Regression on reset with 1.x firmware
      net: aquantia: oops when shutdown on already stopped device

Ka-Cheong Poon (1):
      rds: MP-RDS may use an invalid c_path

Michael Chan (3):
      bnxt_en: Fix ethtool -x crash when device is down.
      bnxt_en: Need to include RDMA rings in bnxt_check_rings().
      bnxt_en: Fix NULL pointer dereference at bnxt_free_irq().

Phil Elwell (3):
      lan78xx: Correctly indicate invalid OTP
      lan78xx: Avoid spurious kevent 4 "error"
      lan78xx: Don't reset the interface on open

Sabrina Dubroca (3):
      ip_gre: clear feature flags when incompatible o_flags are set
      tun: set the flags before registering the netdevice
      tun: send netlink notification when the device is modified

Sriharsha Basavapatna (2):
      bnxt_en: Ignore src port field in decap filter nodes
      bnxt_en: Support max-mtu with VF-reps

Stefan Hajnoczi (2):
      vhost: fix vhost_vq_access_ok() log check
      vhost: return bool from *_access_ok() functions

Tejaswi Tanikella (1):
      slip: Check if rstate is initialized before uncompressing

 drivers/net/ethernet/aquantia/atlantic/aq_nic.c              |   8 +-
 drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_utils.c |  16 ++++
 drivers/net/ethernet/broadcom/bnxt/bnxt.c                    |   4 +-
 drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c            |  11 ++-
 drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c                 |  63 +++++++++++++++-
 drivers/net/ethernet/broadcom/bnxt/bnxt_vfr.c                |  30 ++++++++
 drivers/net/slip/slhc.c                                      |   5 ++
 drivers/net/tun.c                                            |  33 ++++++--
 drivers/net/usb/cdc_ether.c                                  |   6 ++
 drivers/net/usb/lan78xx.c                                    |   9 +--
 drivers/vhost/vhost.c                                        |  72 +++++++++---------
 drivers/vhost/vhost.h                                        |   4 +-
 include/net/slhc_vj.h                                        |   1 +
 net/ipv4/ip_gre.c                                            |   6 ++
 net/l2tp/l2tp_core.c                                         | 225 ++++++++++++++++++++++++-------------------------------
 net/l2tp/l2tp_core.h                                         |   4 +-
 net/l2tp/l2tp_netlink.c                                      |  22 +++---
 net/l2tp/l2tp_ppp.c                                          |   9 +++
 net/rds/send.c                                               |  15 ++--
 19 files changed, 345 insertions(+), 198 deletions(-)

^ permalink raw reply

* Re: [PATCH net] net: validate attribute sizes in neigh_dump_table()
From: David Ahern @ 2018-04-11 21:48 UTC (permalink / raw)
  To: Eric Dumazet, David S . Miller; +Cc: netdev, Eric Dumazet
In-Reply-To: <20180411214600.203361-1-edumazet@google.com>

On 4/11/18 3:46 PM, Eric Dumazet wrote:
> Since neigh_dump_table() calls nlmsg_parse() without giving policy
> constraints, attributes can have arbirary size that we must validate
> 

...

> 
> Fixes: 21fdd092acc7 ("net: Add support for filtering neigh dump by master device")
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: David Ahern <dsa@cumulusnetworks.com>
> Reported-by: syzbot <syzkaller@googlegroups.com>
> ---
>  net/core/neighbour.c | 12 ++++++++----
>  1 file changed, 8 insertions(+), 4 deletions(-)
> 

Acked-by: David Ahern <dsa@cumulusnetworks.com>

Thanks for fixing, Eric.

^ permalink raw reply

* [PATCH net] net: validate attribute sizes in neigh_dump_table()
From: Eric Dumazet @ 2018-04-11 21:46 UTC (permalink / raw)
  To: David S . Miller; +Cc: netdev, Eric Dumazet, Eric Dumazet, David Ahern

Since neigh_dump_table() calls nlmsg_parse() without giving policy
constraints, attributes can have arbirary size that we must validate

Reported by syzbot/KMSAN :

BUG: KMSAN: uninit-value in neigh_master_filtered net/core/neighbour.c:2292 [inline]
BUG: KMSAN: uninit-value in neigh_dump_table net/core/neighbour.c:2348 [inline]
BUG: KMSAN: uninit-value in neigh_dump_info+0x1af0/0x2250 net/core/neighbour.c:2438
CPU: 1 PID: 3575 Comm: syzkaller268891 Not tainted 4.16.0+ #83
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:17 [inline]
 dump_stack+0x185/0x1d0 lib/dump_stack.c:53
 kmsan_report+0x142/0x240 mm/kmsan/kmsan.c:1067
 __msan_warning_32+0x6c/0xb0 mm/kmsan/kmsan_instr.c:676
 neigh_master_filtered net/core/neighbour.c:2292 [inline]
 neigh_dump_table net/core/neighbour.c:2348 [inline]
 neigh_dump_info+0x1af0/0x2250 net/core/neighbour.c:2438
 netlink_dump+0x9ad/0x1540 net/netlink/af_netlink.c:2225
 __netlink_dump_start+0x1167/0x12a0 net/netlink/af_netlink.c:2322
 netlink_dump_start include/linux/netlink.h:214 [inline]
 rtnetlink_rcv_msg+0x1435/0x1560 net/core/rtnetlink.c:4598
 netlink_rcv_skb+0x355/0x5f0 net/netlink/af_netlink.c:2447
 rtnetlink_rcv+0x50/0x60 net/core/rtnetlink.c:4653
 netlink_unicast_kernel net/netlink/af_netlink.c:1311 [inline]
 netlink_unicast+0x1672/0x1750 net/netlink/af_netlink.c:1337
 netlink_sendmsg+0x1048/0x1310 net/netlink/af_netlink.c:1900
 sock_sendmsg_nosec net/socket.c:630 [inline]
 sock_sendmsg net/socket.c:640 [inline]
 ___sys_sendmsg+0xec0/0x1310 net/socket.c:2046
 __sys_sendmsg net/socket.c:2080 [inline]
 SYSC_sendmsg+0x2a3/0x3d0 net/socket.c:2091
 SyS_sendmsg+0x54/0x80 net/socket.c:2087
 do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287
 entry_SYSCALL_64_after_hwframe+0x3d/0xa2
RIP: 0033:0x43fed9
RSP: 002b:00007ffddbee2798 EFLAGS: 00000213 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 00000000004002c8 RCX: 000000000043fed9
RDX: 0000000000000000 RSI: 0000000020005000 RDI: 0000000000000003
RBP: 00000000006ca018 R08: 00000000004002c8 R09: 00000000004002c8
R10: 00000000004002c8 R11: 0000000000000213 R12: 0000000000401800
R13: 0000000000401890 R14: 0000000000000000 R15: 0000000000000000

Uninit was created at:
 kmsan_save_stack_with_flags mm/kmsan/kmsan.c:278 [inline]
 kmsan_internal_poison_shadow+0xb8/0x1b0 mm/kmsan/kmsan.c:188
 kmsan_kmalloc+0x94/0x100 mm/kmsan/kmsan.c:314
 kmsan_slab_alloc+0x11/0x20 mm/kmsan/kmsan.c:321
 slab_post_alloc_hook mm/slab.h:445 [inline]
 slab_alloc_node mm/slub.c:2737 [inline]
 __kmalloc_node_track_caller+0xaed/0x11c0 mm/slub.c:4369
 __kmalloc_reserve net/core/skbuff.c:138 [inline]
 __alloc_skb+0x2cf/0x9f0 net/core/skbuff.c:206
 alloc_skb include/linux/skbuff.h:984 [inline]
 netlink_alloc_large_skb net/netlink/af_netlink.c:1183 [inline]
 netlink_sendmsg+0x9a6/0x1310 net/netlink/af_netlink.c:1875
 sock_sendmsg_nosec net/socket.c:630 [inline]
 sock_sendmsg net/socket.c:640 [inline]
 ___sys_sendmsg+0xec0/0x1310 net/socket.c:2046
 __sys_sendmsg net/socket.c:2080 [inline]
 SYSC_sendmsg+0x2a3/0x3d0 net/socket.c:2091
 SyS_sendmsg+0x54/0x80 net/socket.c:2087
 do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287
 entry_SYSCALL_64_after_hwframe+0x3d/0xa2

Fixes: 21fdd092acc7 ("net: Add support for filtering neigh dump by master device")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: David Ahern <dsa@cumulusnetworks.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
---
 net/core/neighbour.c | 12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index 7b7a14abba28e2b77c6448f1c3d151287afc79ad..a8bc02bb339f9f4c914ae7b23408cd5ccc8b3b8e 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -2323,12 +2323,16 @@ static int neigh_dump_table(struct neigh_table *tbl, struct sk_buff *skb,
 
 	err = nlmsg_parse(nlh, sizeof(struct ndmsg), tb, NDA_MAX, NULL, NULL);
 	if (!err) {
-		if (tb[NDA_IFINDEX])
+		if (tb[NDA_IFINDEX]) {
+			if (nla_len(tb[NDA_IFINDEX]) != sizeof(u32))
+				return -EINVAL;
 			filter_idx = nla_get_u32(tb[NDA_IFINDEX]);
-
-		if (tb[NDA_MASTER])
+		}
+		if (tb[NDA_MASTER]) {
+			if (nla_len(tb[NDA_MASTER]) != sizeof(u32))
+				return -EINVAL;
 			filter_master_idx = nla_get_u32(tb[NDA_MASTER]);
-
+		}
 		if (filter_idx || filter_master_idx)
 			flags |= NLM_F_DUMP_FILTERED;
 	}
-- 
2.17.0.484.g0c8726318c-goog

^ permalink raw reply related

* Re: [PATCH net 0/2] l2tp: tunnel creation fixes
From: David Miller @ 2018-04-11 21:42 UTC (permalink / raw)
  To: g.nault; +Cc: netdev, jchapman, tparkin
In-Reply-To: <cover.1523385906.git.g.nault@alphalink.fr>

From: Guillaume Nault <g.nault@alphalink.fr>
Date: Tue, 10 Apr 2018 21:01:07 +0200

> L2TP tunnel creation is racy. We need to make sure that the tunnel
> returned by l2tp_tunnel_create() isn't going to be freed while the
> caller is using it. This is done in patch #1, by separating tunnel
> creation from tunnel registration.
> 
> With the tunnel registration code in place, we can now check for
> duplicate tunnels in a race-free way. This is done in patch #2, which
> incidentally removes the last use of l2tp_tunnel_find().

Series applied and queued up for -stable, thanks.

^ permalink raw reply

* [PATCH net] tcp: md5: reject TCP_MD5SIG or TCP_MD5SIG_EXT on established sockets
From: Eric Dumazet @ 2018-04-11 21:36 UTC (permalink / raw)
  To: David S . Miller; +Cc: netdev, Eric Dumazet, Eric Dumazet

syzbot/KMSAN reported an uninit-value in tcp_parse_options() [1]

I believe this was caused by a TCP_MD5SIG being set on live
flow.

This is highly unexpected, since TCP option space is limited.

For instance, presence of TCP MD5 option automatically disables
TCP TimeStamp option at SYN/SYNACK time, which we can not do
once flow has been established.

Really, adding/deleting an MD5 key only makes sense on sockets
in CLOSE or LISTEN state.

[1]
BUG: KMSAN: uninit-value in tcp_parse_options+0xd74/0x1a30 net/ipv4/tcp_input.c:3720
CPU: 1 PID: 6177 Comm: syzkaller192004 Not tainted 4.16.0+ #83
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:17 [inline]
 dump_stack+0x185/0x1d0 lib/dump_stack.c:53
 kmsan_report+0x142/0x240 mm/kmsan/kmsan.c:1067
 __msan_warning_32+0x6c/0xb0 mm/kmsan/kmsan_instr.c:676
 tcp_parse_options+0xd74/0x1a30 net/ipv4/tcp_input.c:3720
 tcp_fast_parse_options net/ipv4/tcp_input.c:3858 [inline]
 tcp_validate_incoming+0x4f1/0x2790 net/ipv4/tcp_input.c:5184
 tcp_rcv_established+0xf60/0x2bb0 net/ipv4/tcp_input.c:5453
 tcp_v4_do_rcv+0x6cd/0xd90 net/ipv4/tcp_ipv4.c:1469
 sk_backlog_rcv include/net/sock.h:908 [inline]
 __release_sock+0x2d6/0x680 net/core/sock.c:2271
 release_sock+0x97/0x2a0 net/core/sock.c:2786
 tcp_sendmsg+0xd6/0x100 net/ipv4/tcp.c:1464
 inet_sendmsg+0x48d/0x740 net/ipv4/af_inet.c:764
 sock_sendmsg_nosec net/socket.c:630 [inline]
 sock_sendmsg net/socket.c:640 [inline]
 SYSC_sendto+0x6c3/0x7e0 net/socket.c:1747
 SyS_sendto+0x8a/0xb0 net/socket.c:1715
 do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287
 entry_SYSCALL_64_after_hwframe+0x3d/0xa2
RIP: 0033:0x448fe9
RSP: 002b:00007fd472c64d38 EFLAGS: 00000216 ORIG_RAX: 000000000000002c
RAX: ffffffffffffffda RBX: 00000000006e5a30 RCX: 0000000000448fe9
RDX: 000000000000029f RSI: 0000000020a88f88 RDI: 0000000000000004
RBP: 00000000006e5a34 R08: 0000000020e68000 R09: 0000000000000010
R10: 00000000200007fd R11: 0000000000000216 R12: 0000000000000000
R13: 00007fff074899ef R14: 00007fd472c659c0 R15: 0000000000000009

Uninit was created at:
 kmsan_save_stack_with_flags mm/kmsan/kmsan.c:278 [inline]
 kmsan_internal_poison_shadow+0xb8/0x1b0 mm/kmsan/kmsan.c:188
 kmsan_kmalloc+0x94/0x100 mm/kmsan/kmsan.c:314
 kmsan_slab_alloc+0x11/0x20 mm/kmsan/kmsan.c:321
 slab_post_alloc_hook mm/slab.h:445 [inline]
 slab_alloc_node mm/slub.c:2737 [inline]
 __kmalloc_node_track_caller+0xaed/0x11c0 mm/slub.c:4369
 __kmalloc_reserve net/core/skbuff.c:138 [inline]
 __alloc_skb+0x2cf/0x9f0 net/core/skbuff.c:206
 alloc_skb include/linux/skbuff.h:984 [inline]
 tcp_send_ack+0x18c/0x910 net/ipv4/tcp_output.c:3624
 __tcp_ack_snd_check net/ipv4/tcp_input.c:5040 [inline]
 tcp_ack_snd_check net/ipv4/tcp_input.c:5053 [inline]
 tcp_rcv_established+0x2103/0x2bb0 net/ipv4/tcp_input.c:5469
 tcp_v4_do_rcv+0x6cd/0xd90 net/ipv4/tcp_ipv4.c:1469
 sk_backlog_rcv include/net/sock.h:908 [inline]
 __release_sock+0x2d6/0x680 net/core/sock.c:2271
 release_sock+0x97/0x2a0 net/core/sock.c:2786
 tcp_sendmsg+0xd6/0x100 net/ipv4/tcp.c:1464
 inet_sendmsg+0x48d/0x740 net/ipv4/af_inet.c:764
 sock_sendmsg_nosec net/socket.c:630 [inline]
 sock_sendmsg net/socket.c:640 [inline]
 SYSC_sendto+0x6c3/0x7e0 net/socket.c:1747
 SyS_sendto+0x8a/0xb0 net/socket.c:1715
 do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287
 entry_SYSCALL_64_after_hwframe+0x3d/0xa2

Fixes: cfb6eeb4c860 ("[TCP]: MD5 Signature Option (RFC2385) support.")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
---
 net/ipv4/tcp.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index bccc4c2700870b8c7ff592a6bd27acebd9bc6471..4fa3f812b9ff8954a9b6a018c648ff12ab995721 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2813,8 +2813,10 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
 #ifdef CONFIG_TCP_MD5SIG
 	case TCP_MD5SIG:
 	case TCP_MD5SIG_EXT:
-		/* Read the IP->Key mappings from userspace */
-		err = tp->af_specific->md5_parse(sk, optname, optval, optlen);
+		if ((1 << sk->sk_state) & (TCPF_CLOSE | TCPF_LISTEN))
+			err = tp->af_specific->md5_parse(sk, optname, optval, optlen);
+		else
+			err = -EINVAL;
 		break;
 #endif
 	case TCP_USER_TIMEOUT:
-- 
2.17.0.484.g0c8726318c-goog

^ permalink raw reply related

* Re: [PATCH net v2 2/6] bnxt_en: do not allow wildcard matches for L2 flows
From: Jakub Kicinski @ 2018-04-11 21:19 UTC (permalink / raw)
  To: Michael Chan, Andy Gospodarek; +Cc: David Miller, Netdev
In-Reply-To: <CACKFLinjsSyMDSyut3h1Qyzrn0HGjaGTROdA-5U-syezEopkxQ@mail.gmail.com>

On Wed, 11 Apr 2018 13:55:11 -0700, Michael Chan wrote:
> On Wed, Apr 11, 2018 at 1:50 PM, Andy Gospodarek wrote:
> > On Wed, Apr 11, 2018 at 01:41:31PM -0700, Michael Chan wrote:  
> > True, but I'm not sure that tc_cls_common_offload is used in all cases.
> > Take red_offload() as one of those.  
> 
> For Flower, we know we have the extack pointer in
> tc_cls_common_offload struct and we can use it to set the netlink
> error message.  The point is that we don't have to modify
> ndo_setup_tc().

Yes, the extack is actually only populated when skip_sw is specified to
avoid warning users who don't care about offloads.

Flower offloads don't go via .ndo_setup_tc but TC block callbacks.  But
one day we will hopefully find a reasonable way to pass extack to qdisc
offloads as well..

FWIW your driver is actually already using extack under the veil of
tc_cls_can_offload_and_chain0() :)

^ permalink raw reply

* BUG: corrupted list in team_nl_cmd_options_set
From: syzbot @ 2018-04-11 21:02 UTC (permalink / raw)
  To: jiri, linux-kernel, netdev, syzkaller-bugs

Hello,

syzbot hit the following crash on upstream commit
b284d4d5a6785f8cd07eda2646a95782373cd01e (Tue Apr 10 19:25:30 2018 +0000)
Merge tag 'ceph-for-4.17-rc1' of git://github.com/ceph/ceph-client
syzbot dashboard link:  
https://syzkaller.appspot.com/bug?extid=4d4af685432dc0e56c91

C reproducer: https://syzkaller.appspot.com/x/repro.c?id=6161158629228544
syzkaller reproducer:  
https://syzkaller.appspot.com/x/repro.syz?id=5600380654190592
Raw console output:  
https://syzkaller.appspot.com/x/log.txt?id=4627738266697728
Kernel config:  
https://syzkaller.appspot.com/x/.config?id=-1223000601505858474
compiler: gcc (GCC) 8.0.1 20180301 (experimental)

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+4d4af685432dc0e56c91@syzkaller.appspotmail.com
It will help syzbot understand when the bug is fixed. See footer for  
details.
If you forward the report, please keep this part and the footer.

8021q: adding VLAN 0 to HW filter on device team0
netlink: 'syzkaller556835': attribute type 3 has an invalid length.
netlink: 'syzkaller556835': attribute type 3 has an invalid length.
list_add double add: new=0000000004f859c0, prev=00000000c9745291,  
next=0000000004f859c0.
------------[ cut here ]------------
kernel BUG at lib/list_debug.c:31!
invalid opcode: 0000 [#1] SMP KASAN
Dumping ftrace buffer:
    (ftrace buffer empty)
Modules linked in:
CPU: 0 PID: 4466 Comm: syzkaller556835 Not tainted 4.16.0+ #17
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011
RIP: 0010:__list_add_valid+0xaa/0xb0 lib/list_debug.c:29
RSP: 0018:ffff8801b04bf248 EFLAGS: 00010286
RAX: 0000000000000058 RBX: ffff8801c8fc7a90 RCX: 0000000000000000
RDX: 0000000000000058 RSI: ffffffff815fbf41 RDI: ffffed0036097e3f
RBP: ffff8801b04bf260 R08: ffff8801b0b2a700 R09: ffffed003b604f90
R10: ffffed003b604f90 R11: ffff8801db027c87 R12: ffff8801c8fc7a90
R13: ffff8801c8fc7a90 R14: dffffc0000000000 R15: 0000000000000000
FS:  0000000000b98880(0000) GS:ffff8801db000000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000000000043fc30 CR3: 00000001afe8e000 CR4: 00000000001406f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
  __list_add include/linux/list.h:60 [inline]
  list_add include/linux/list.h:79 [inline]
  team_nl_cmd_options_set+0x9ff/0x12b0 drivers/net/team/team.c:2571
  genl_family_rcv_msg+0x889/0x1120 net/netlink/genetlink.c:599
  genl_rcv_msg+0xc6/0x170 net/netlink/genetlink.c:624
  netlink_rcv_skb+0x172/0x440 net/netlink/af_netlink.c:2448
  genl_rcv+0x28/0x40 net/netlink/genetlink.c:635
  netlink_unicast_kernel net/netlink/af_netlink.c:1310 [inline]
  netlink_unicast+0x58b/0x740 net/netlink/af_netlink.c:1336
  netlink_sendmsg+0x9f0/0xfa0 net/netlink/af_netlink.c:1901
  sock_sendmsg_nosec net/socket.c:629 [inline]
  sock_sendmsg+0xd5/0x120 net/socket.c:639
  ___sys_sendmsg+0x805/0x940 net/socket.c:2117
  __sys_sendmsg+0x115/0x270 net/socket.c:2155
  SYSC_sendmsg net/socket.c:2164 [inline]
  SyS_sendmsg+0x29/0x30 net/socket.c:2162
  do_syscall_64+0x29e/0x9d0 arch/x86/entry/common.c:287
  entry_SYSCALL_64_after_hwframe+0x42/0xb7
RIP: 0033:0x4458b9
RSP: 002b:00007ffd1d4a7278 EFLAGS: 00000213 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 000000000000001b RCX: 00000000004458b9
RDX: 0000000000000010 RSI: 0000000020000d00 RDI: 0000000000000004
RBP: 00000000004a74ed R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000213 R12: 00007ffd1d4a7348
R13: 0000000000402a60 R14: 0000000000000000 R15: 0000000000000000
Code: 75 e8 eb a9 48 89 f7 48 89 75 e8 e8 d1 85 7b fe 48 8b 75 e8 eb bb 48  
89 f2 48 89 d9 4c 89 e6 48 c7 c7 a0 84 d8 87 e8 ea 67 28 fe <0f> 0b 0f 1f  
40 00 48 b8 00 00 00 00 00 fc ff df 55 48 89 e5 41
RIP: __list_add_valid+0xaa/0xb0 lib/list_debug.c:29 RSP: ffff8801b04bf248
---[ end trace b4f71d7dd7ca6d10 ]---


---
This bug is generated by a dumb bot. It may contain errors.
See https://goo.gl/tpsmEJ for details.
Direct all questions to syzkaller@googlegroups.com.

syzbot will keep track of this bug report.
If you forgot to add the Reported-by tag, once the fix for this bug is  
merged
into any tree, please reply to this email with:
#syz fix: exact-commit-title
If you want to test a patch for this bug, please reply with:
#syz test: git://repo/address.git branch
and provide the patch inline or as an attachment.
To mark this as a duplicate of another syzbot report, please reply with:
#syz dup: exact-subject-of-another-report
If it's a one-off invalid bug report, please reply with:
#syz invalid
Note: if the crash happens again, it will cause creation of a new bug  
report.
Note: all commands must start from beginning of the line in the email body.

^ permalink raw reply

* Re: [PATCH net v2 2/6] bnxt_en: do not allow wildcard matches for L2 flows
From: Michael Chan @ 2018-04-11 20:55 UTC (permalink / raw)
  To: Andy Gospodarek; +Cc: Jakub Kicinski, David Miller, Netdev
In-Reply-To: <20180411205014.GE33938@C02RW35GFVH8.dhcp.broadcom.net>

On Wed, Apr 11, 2018 at 1:50 PM, Andy Gospodarek
<andrew.gospodarek@broadcom.com> wrote:
> On Wed, Apr 11, 2018 at 01:41:31PM -0700, Michael Chan wrote:
>> On Wed, Apr 11, 2018 at 1:31 PM, Andy Gospodarek
>> <andrew.gospodarek@broadcom.com> wrote:
>> > On Wed, Apr 11, 2018 at 11:43:14AM -0700, Jakub Kicinski wrote:
>> >> On Wed, 11 Apr 2018 11:50:14 -0400, Michael Chan wrote:
>> >> > @@ -764,6 +788,41 @@ static bool bnxt_tc_can_offload(struct bnxt *bp, struct bnxt_tc_flow *flow)
>> >> >             return false;
>> >> >     }
>> >> >
>> >> > +   /* Currently source/dest MAC cannot be partial wildcard  */
>> >> > +   if (bits_set(&flow->l2_key.smac, sizeof(flow->l2_key.smac)) &&
>> >> > +       !is_exactmatch(flow->l2_mask.smac, sizeof(flow->l2_mask.smac))) {
>> >> > +           netdev_info(bp->dev, "Wildcard match unsupported for Source MAC\n");
>> >>
>> >> This wouldn't be something to do in net, but how do you feel about
>> >> using extack for messages like this?
>> >>
>> >
>> > I agree 'net' would not have been the place for a change like that, but
>> > I do think that would be a good idea.  It looks like we could easily
>> > change the ndo_setup_tc to something like this:
>> >
>> >         int                     (*ndo_setup_tc)(struct net_device *dev,
>> >                                                 enum tc_setup_type type,
>> >                                                 void *type_data,
>> >                                                 struct netlink_ext_ack *extack);
>>
>> I think the extack pointer is already in the tc_cls_common_offload
>> struct inside tc_cls_flower_offload struct.
>
> True, but I'm not sure that tc_cls_common_offload is used in all cases.
> Take red_offload() as one of those.

For Flower, we know we have the extack pointer in
tc_cls_common_offload struct and we can use it to set the netlink
error message.  The point is that we don't have to modify
ndo_setup_tc().

^ permalink raw reply

* [net 1/1] tipc: fix unbalanced reference counter
From: Jon Maloy @ 2018-04-11 20:52 UTC (permalink / raw)
  To: davem, netdev; +Cc: tipc-discussion, mohan.krishna.ghanta.krishnamurthy

When a topology subscription is created, we may encounter (or KASAN
may provoke) a failure to create a corresponding service instance in
the binding table. Instead of letting the tipc_nametbl_subscribe()
report the failure back to the caller, the function just makes a warning
printout and returns, without incrementing the subscription reference
counter as expected by the caller.

This makes the caller believe that the subscription was successful, so
it will at a later moment try to unsubscribe the item. This involves
a sub_put() call. Since the reference counter never was incremented
in the first place, we get a premature delete of the subscription item,
followed by a "use-after-free" warning.

We fix this by adding a return value to tipc_nametbl_subscribe() and
make the caller aware of the failure to subscribe.

This bug seems to always have been around, but this fix only applies
back to the commit shown below. Given the low risk of this happening
we believe this to be sufficient.

Fixes: commit 218527fe27ad ("tipc: replace name table service range
array with rb tree")
Reported-by: syzbot+aa245f26d42b8305d157@syzkaller.appspotmail.com

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
---
 net/tipc/name_table.c | 5 ++++-
 net/tipc/name_table.h | 2 +-
 net/tipc/subscr.c     | 5 ++++-
 3 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/net/tipc/name_table.c b/net/tipc/name_table.c
index b1fe209..4068eaa 100644
--- a/net/tipc/name_table.c
+++ b/net/tipc/name_table.c
@@ -665,13 +665,14 @@ int tipc_nametbl_withdraw(struct net *net, u32 type, u32 lower,
 /**
  * tipc_nametbl_subscribe - add a subscription object to the name table
  */
-void tipc_nametbl_subscribe(struct tipc_subscription *sub)
+bool tipc_nametbl_subscribe(struct tipc_subscription *sub)
 {
 	struct name_table *nt = tipc_name_table(sub->net);
 	struct tipc_net *tn = tipc_net(sub->net);
 	struct tipc_subscr *s = &sub->evt.s;
 	u32 type = tipc_sub_read(s, seq.type);
 	struct tipc_service *sc;
+	bool res = true;
 
 	spin_lock_bh(&tn->nametbl_lock);
 	sc = tipc_service_find(sub->net, type);
@@ -685,8 +686,10 @@ void tipc_nametbl_subscribe(struct tipc_subscription *sub)
 		pr_warn("Failed to subscribe for {%u,%u,%u}\n", type,
 			tipc_sub_read(s, seq.lower),
 			tipc_sub_read(s, seq.upper));
+		res = false;
 	}
 	spin_unlock_bh(&tn->nametbl_lock);
+	return res;
 }
 
 /**
diff --git a/net/tipc/name_table.h b/net/tipc/name_table.h
index 4b14fc2..0febba4 100644
--- a/net/tipc/name_table.h
+++ b/net/tipc/name_table.h
@@ -126,7 +126,7 @@ struct publication *tipc_nametbl_insert_publ(struct net *net, u32 type,
 struct publication *tipc_nametbl_remove_publ(struct net *net, u32 type,
 					     u32 lower, u32 upper,
 					     u32 node, u32 key);
-void tipc_nametbl_subscribe(struct tipc_subscription *s);
+bool tipc_nametbl_subscribe(struct tipc_subscription *s);
 void tipc_nametbl_unsubscribe(struct tipc_subscription *s);
 int tipc_nametbl_init(struct net *net);
 void tipc_nametbl_stop(struct net *net);
diff --git a/net/tipc/subscr.c b/net/tipc/subscr.c
index b7d80bc..f340e53 100644
--- a/net/tipc/subscr.c
+++ b/net/tipc/subscr.c
@@ -153,7 +153,10 @@ struct tipc_subscription *tipc_sub_subscribe(struct net *net,
 	memcpy(&sub->evt.s, s, sizeof(*s));
 	spin_lock_init(&sub->lock);
 	kref_init(&sub->kref);
-	tipc_nametbl_subscribe(sub);
+	if (!tipc_nametbl_subscribe(sub)) {
+		kfree(sub);
+		return NULL;
+	}
 	timer_setup(&sub->timer, tipc_sub_timeout, 0);
 	timeout = tipc_sub_read(&sub->evt.s, timeout);
 	if (timeout != TIPC_WAIT_FOREVER)
-- 
2.1.4


------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot

^ permalink raw reply related

* Re: [PATCH net v2 2/6] bnxt_en: do not allow wildcard matches for L2 flows
From: Andy Gospodarek @ 2018-04-11 20:50 UTC (permalink / raw)
  To: Michael Chan; +Cc: Andy Gospodarek, Jakub Kicinski, David Miller, Netdev
In-Reply-To: <CACKFLi=W=0a+2pap5Yis7_ZCRrP+cGxxik6XBLnEPws+ge873g@mail.gmail.com>

On Wed, Apr 11, 2018 at 01:41:31PM -0700, Michael Chan wrote:
> On Wed, Apr 11, 2018 at 1:31 PM, Andy Gospodarek
> <andrew.gospodarek@broadcom.com> wrote:
> > On Wed, Apr 11, 2018 at 11:43:14AM -0700, Jakub Kicinski wrote:
> >> On Wed, 11 Apr 2018 11:50:14 -0400, Michael Chan wrote:
> >> > @@ -764,6 +788,41 @@ static bool bnxt_tc_can_offload(struct bnxt *bp, struct bnxt_tc_flow *flow)
> >> >             return false;
> >> >     }
> >> >
> >> > +   /* Currently source/dest MAC cannot be partial wildcard  */
> >> > +   if (bits_set(&flow->l2_key.smac, sizeof(flow->l2_key.smac)) &&
> >> > +       !is_exactmatch(flow->l2_mask.smac, sizeof(flow->l2_mask.smac))) {
> >> > +           netdev_info(bp->dev, "Wildcard match unsupported for Source MAC\n");
> >>
> >> This wouldn't be something to do in net, but how do you feel about
> >> using extack for messages like this?
> >>
> >
> > I agree 'net' would not have been the place for a change like that, but
> > I do think that would be a good idea.  It looks like we could easily
> > change the ndo_setup_tc to something like this:
> >
> >         int                     (*ndo_setup_tc)(struct net_device *dev,
> >                                                 enum tc_setup_type type,
> >                                                 void *type_data,
> >                                                 struct netlink_ext_ack *extack);
> 
> I think the extack pointer is already in the tc_cls_common_offload
> struct inside tc_cls_flower_offload struct.

True, but I'm not sure that tc_cls_common_offload is used in all cases.
Take red_offload() as one of those.

^ permalink raw reply

* Re: [PATCH net v2 2/6] bnxt_en: do not allow wildcard matches for L2 flows
From: Michael Chan @ 2018-04-11 20:41 UTC (permalink / raw)
  To: Andy Gospodarek; +Cc: Jakub Kicinski, David Miller, Netdev
In-Reply-To: <20180411203152.GD33938@C02RW35GFVH8.dhcp.broadcom.net>

On Wed, Apr 11, 2018 at 1:31 PM, Andy Gospodarek
<andrew.gospodarek@broadcom.com> wrote:
> On Wed, Apr 11, 2018 at 11:43:14AM -0700, Jakub Kicinski wrote:
>> On Wed, 11 Apr 2018 11:50:14 -0400, Michael Chan wrote:
>> > @@ -764,6 +788,41 @@ static bool bnxt_tc_can_offload(struct bnxt *bp, struct bnxt_tc_flow *flow)
>> >             return false;
>> >     }
>> >
>> > +   /* Currently source/dest MAC cannot be partial wildcard  */
>> > +   if (bits_set(&flow->l2_key.smac, sizeof(flow->l2_key.smac)) &&
>> > +       !is_exactmatch(flow->l2_mask.smac, sizeof(flow->l2_mask.smac))) {
>> > +           netdev_info(bp->dev, "Wildcard match unsupported for Source MAC\n");
>>
>> This wouldn't be something to do in net, but how do you feel about
>> using extack for messages like this?
>>
>
> I agree 'net' would not have been the place for a change like that, but
> I do think that would be a good idea.  It looks like we could easily
> change the ndo_setup_tc to something like this:
>
>         int                     (*ndo_setup_tc)(struct net_device *dev,
>                                                 enum tc_setup_type type,
>                                                 void *type_data,
>                                                 struct netlink_ext_ack *extack);

I think the extack pointer is already in the tc_cls_common_offload
struct inside tc_cls_flower_offload struct.

^ permalink raw reply

* Re: [PATCH net v2 2/6] bnxt_en: do not allow wildcard matches for L2 flows
From: Andy Gospodarek @ 2018-04-11 20:31 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: Michael Chan, davem, netdev
In-Reply-To: <20180411114303.6f927c45@cakuba.netronome.com>

On Wed, Apr 11, 2018 at 11:43:14AM -0700, Jakub Kicinski wrote:
> On Wed, 11 Apr 2018 11:50:14 -0400, Michael Chan wrote:
> > @@ -764,6 +788,41 @@ static bool bnxt_tc_can_offload(struct bnxt *bp, struct bnxt_tc_flow *flow)
> >  		return false;
> >  	}
> >  
> > +	/* Currently source/dest MAC cannot be partial wildcard  */
> > +	if (bits_set(&flow->l2_key.smac, sizeof(flow->l2_key.smac)) &&
> > +	    !is_exactmatch(flow->l2_mask.smac, sizeof(flow->l2_mask.smac))) {
> > +		netdev_info(bp->dev, "Wildcard match unsupported for Source MAC\n");
> 
> This wouldn't be something to do in net, but how do you feel about
> using extack for messages like this?
> 

I agree 'net' would not have been the place for a change like that, but
I do think that would be a good idea.  It looks like we could easily
change the ndo_setup_tc to something like this:

        int                     (*ndo_setup_tc)(struct net_device *dev,
                                                enum tc_setup_type type,
                                                void *type_data,
						struct netlink_ext_ack *extack);

It also looks like most of the callers of ndo_setup_tc have infra in
place to pass extack easily when the call is sourced from a netlink
message.   The others can just pass in NULL or define a local
netlink_ext_ack variable for short-term use.

^ permalink raw reply

* Re: [RFC v3 net-next 13/18] net/sched: Introduce the TBS Qdisc
From: Ivan Briano @ 2018-04-11 20:31 UTC (permalink / raw)
  To: Thomas Gleixner, Jesus Sanchez-Palencia
  Cc: netdev, jhs, xiyou.wangcong, jiri, vinicius.gomes, richardcochran,
	anna-maria, henrik, John Stultz, levi.pearson, edumazet, willemb,
	mlichvar
In-Reply-To: <alpine.DEB.2.21.1804112209450.1564@nanos.tec.linutronix.de>



On 04/11/2018 01:16 PM, Thomas Gleixner wrote:
> On Tue, 10 Apr 2018, Jesus Sanchez-Palencia wrote:
>>>> This will be provided by tbs if the socket which is transmitting packets is
>>>> configured for deadline mode.
>>>
>>> You don't want the socket to decide that. The qdisc into which a socket
>>> feeds defines the mode and the qdisc rejects requests with the wrong mode.
>>>
>>> Making a qdisc doing both and let the user decide what he wants it to be is
>>> not really going to fly. Especially if you have different users which want
>>> a different mode. It's clearly distinct functionality.
>>
>>
>> Ok, so just to make sure I got this right, are you suggesting that both the
>> 'tbs' qdisc *and* the socket (i.e. through SO_TXTIME) should have a config
>> parameter for specifying the txtime mode? This way if there is a mismatch,
>> packets from that socket are rejected by the qdisc.
> 
> Correct. The same is true if you try to set SO_TXTIME for something which
> is just routing regular traffic.
> 
>> (...)
>>>
>>>> Another question for this mode (but perhaps that applies to both modes) is, what
>>>> if the qdisc misses the deadline for *any* reason? I'm assuming it should drop
>>>> the packet during dequeue.
>>>
>>> There the question is how user space is notified about that issue. The
>>> application which queued the packet on time does rightfully assume that
>>> it's going to be on the wire on time.
>>>
>>> This is a violation of the overall scheduling plan, so you need to have
>>> a sane design to handle that.
>>
>> In addition to the qdisc stats, we could look into using the socket's error
>> queue to notify the application about that.
> 
> Makes sense.
>  
>>>> Putting it all together, we end up with:
>>>>
>>>> 1) a new txtime aware qdisc, tbs, to be used per queue. Its cli will look like:
>>>> $ tc qdisc add (...) tbs clockid CLOCK_REALTIME delta 150000 offload sorting
>>>
>>> Why CLOCK_REALTIME? The only interesting time in a TSN network is
>>> CLOCK_TAI, really.
>>
>> REALTIME was just an example here to show that the qdisc has to be configured
>> with a clockid parameter. Are you suggesting that instead both of the new qdiscs
>> (i.e. tbs and taprio) should always be using CLOCK_TAI implicitly?
> 
> I think so. It's _the_ network time on which everything is based on.
> 
>>>> 2) a new cmsg-interface for setting a per-packet timestamp that will be used
>>>> either as a txtime or as deadline by tbs (and further the NIC driver for the
>>>> offlaod case): SCM_TXTIME.
>>>>
>>>> 3) a new socket option: SO_TXTIME. It will be used to enable the feature for a
>>>> socket, and will have as parameters a clockid and a txtime mode (deadline or
>>>> explicit), that defines the semantics of the timestamp set on packets using
>>>> SCM_TXTIME.
>>>>
>>>> 4) a new #define DYNAMIC_CLOCKID 15 added to include/uapi/linux/time.h .
>>>
>>> Can you remind me why we would need that?
>>
>> So there is a "clockid" that can be used for the full hw offload modes. On this
>> case, the txtimes are in reference to the NIC's PTP clock, and, as discussed, we
>> can't just use a clockid that was computed from the fd pointing to /dev/ptpX .
> 
> And the NICs PTP clock is CLOCK_TAI, so there should be no reason to have
> yet another clock, right?
> 

Most likely, though you can technically have a different time domain
that is not based on TAI.

> Thanks,
> 
> 	tglx
> 

^ permalink raw reply

* Re: [PATCH net v2 2/6] bnxt_en: do not allow wildcard matches for L2 flows
From: Michael Chan @ 2018-04-11 20:21 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: Andy Gospodarek, David Miller, Netdev
In-Reply-To: <20180411114303.6f927c45@cakuba.netronome.com>

On Wed, Apr 11, 2018 at 11:43 AM, Jakub Kicinski <kubakici@wp.pl> wrote:
> On Wed, 11 Apr 2018 11:50:14 -0400, Michael Chan wrote:
>> @@ -764,6 +788,41 @@ static bool bnxt_tc_can_offload(struct bnxt *bp, struct bnxt_tc_flow *flow)
>>               return false;
>>       }
>>
>> +     /* Currently source/dest MAC cannot be partial wildcard  */
>> +     if (bits_set(&flow->l2_key.smac, sizeof(flow->l2_key.smac)) &&
>> +         !is_exactmatch(flow->l2_mask.smac, sizeof(flow->l2_mask.smac))) {
>> +             netdev_info(bp->dev, "Wildcard match unsupported for Source MAC\n");
>
> This wouldn't be something to do in net, but how do you feel about
> using extack for messages like this?
>

Sounds reasonable to me.  Just need to pass in the extack pointer to
this function to set the netlink error message.

^ permalink raw reply

* Re: [RFC v3 net-next 13/18] net/sched: Introduce the TBS Qdisc
From: Thomas Gleixner @ 2018-04-11 20:16 UTC (permalink / raw)
  To: Jesus Sanchez-Palencia
  Cc: netdev, jhs, xiyou.wangcong, jiri, vinicius.gomes, richardcochran,
	anna-maria, henrik, John Stultz, levi.pearson, edumazet, willemb,
	mlichvar
In-Reply-To: <e758b90b-508b-d1e8-bd1b-41e7b40c357b@intel.com>

On Tue, 10 Apr 2018, Jesus Sanchez-Palencia wrote:
> >> This will be provided by tbs if the socket which is transmitting packets is
> >> configured for deadline mode.
> > 
> > You don't want the socket to decide that. The qdisc into which a socket
> > feeds defines the mode and the qdisc rejects requests with the wrong mode.
> > 
> > Making a qdisc doing both and let the user decide what he wants it to be is
> > not really going to fly. Especially if you have different users which want
> > a different mode. It's clearly distinct functionality.
> 
> 
> Ok, so just to make sure I got this right, are you suggesting that both the
> 'tbs' qdisc *and* the socket (i.e. through SO_TXTIME) should have a config
> parameter for specifying the txtime mode? This way if there is a mismatch,
> packets from that socket are rejected by the qdisc.

Correct. The same is true if you try to set SO_TXTIME for something which
is just routing regular traffic.

> (...)
> > 
> >> Another question for this mode (but perhaps that applies to both modes) is, what
> >> if the qdisc misses the deadline for *any* reason? I'm assuming it should drop
> >> the packet during dequeue.
> > 
> > There the question is how user space is notified about that issue. The
> > application which queued the packet on time does rightfully assume that
> > it's going to be on the wire on time.
> > 
> > This is a violation of the overall scheduling plan, so you need to have
> > a sane design to handle that.
> 
> In addition to the qdisc stats, we could look into using the socket's error
> queue to notify the application about that.

Makes sense.
 
> >> Putting it all together, we end up with:
> >>
> >> 1) a new txtime aware qdisc, tbs, to be used per queue. Its cli will look like:
> >> $ tc qdisc add (...) tbs clockid CLOCK_REALTIME delta 150000 offload sorting
> > 
> > Why CLOCK_REALTIME? The only interesting time in a TSN network is
> > CLOCK_TAI, really.
> 
> REALTIME was just an example here to show that the qdisc has to be configured
> with a clockid parameter. Are you suggesting that instead both of the new qdiscs
> (i.e. tbs and taprio) should always be using CLOCK_TAI implicitly?

I think so. It's _the_ network time on which everything is based on.

> >> 2) a new cmsg-interface for setting a per-packet timestamp that will be used
> >> either as a txtime or as deadline by tbs (and further the NIC driver for the
> >> offlaod case): SCM_TXTIME.
> >>
> >> 3) a new socket option: SO_TXTIME. It will be used to enable the feature for a
> >> socket, and will have as parameters a clockid and a txtime mode (deadline or
> >> explicit), that defines the semantics of the timestamp set on packets using
> >> SCM_TXTIME.
> >>
> >> 4) a new #define DYNAMIC_CLOCKID 15 added to include/uapi/linux/time.h .
> > 
> > Can you remind me why we would need that?
> 
> So there is a "clockid" that can be used for the full hw offload modes. On this
> case, the txtimes are in reference to the NIC's PTP clock, and, as discussed, we
> can't just use a clockid that was computed from the fd pointing to /dev/ptpX .

And the NICs PTP clock is CLOCK_TAI, so there should be no reason to have
yet another clock, right?

Thanks,

	tglx

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox