Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH bpf-next v2 0/7] BTF uapi cleanup
From: Daniel Borkmann @ 2018-05-23 10:29 UTC (permalink / raw)
  To: Martin KaFai Lau, netdev; +Cc: Alexei Starovoitov, Yonghong Song, kernel-team
In-Reply-To: <20180522215723.784040-1-kafai@fb.com>

On 05/22/2018 11:57 PM, Martin KaFai Lau wrote:
> This patch set makes some changes to cleanup the unused
> bits in BTF uapi.  It also makes the btf_header extensible.
> 
> Please see individual patches for details.
> 
> v2:
> - Remove NR_SECS from patch 2
> - Remove "unsigned" check on array->index_type from patch 3
> - Remove BTF_INT_VARARGS and further limit BTF_INT_ENCODING
>   from 8 bits to 4 bits in patch 4
> - Adjustments in test_btf.c to reflect changes in v2
> 
> Martin KaFai Lau (7):
>   bpf: Expose check_uarg_tail_zero()
>   bpf: btf: Change how section is supported in btf_header
>   bpf: btf: Check array->index_type
>   bpf: btf: Remove unused bits from uapi/linux/btf.h
>   bpf: btf: Rename btf_key_id and btf_value_id in bpf_map_info
>   bpf: btf: Sync bpf.h and btf.h to tools
>   bpf: btf: Add tests for the btf uapi changes
> 
>  include/linux/bpf.h                    |   6 +-
>  include/uapi/linux/bpf.h               |   8 +-
>  include/uapi/linux/btf.h               |  37 +--
>  kernel/bpf/arraymap.c                  |   2 +-
>  kernel/bpf/btf.c                       | 335 +++++++++++++++------
>  kernel/bpf/syscall.c                   |  32 +-
>  tools/include/uapi/linux/bpf.h         |   8 +-
>  tools/include/uapi/linux/btf.h         |  37 +--
>  tools/lib/bpf/bpf.c                    |   4 +-
>  tools/lib/bpf/bpf.h                    |   4 +-
>  tools/lib/bpf/btf.c                    |   5 +-
>  tools/lib/bpf/libbpf.c                 |  34 +--
>  tools/lib/bpf/libbpf.h                 |   4 +-
>  tools/testing/selftests/bpf/test_btf.c | 521 +++++++++++++++++++++++++++------
>  14 files changed, 753 insertions(+), 284 deletions(-)
> 

Applied to bpf-next, thanks Martin!

^ permalink raw reply

* Re: [PATCH v2] selftests: net: reuseport_bpf_numa: don't fail if no numa support
From: Daniel Borkmann @ 2018-05-23 10:29 UTC (permalink / raw)
  To: Anders Roxell, davem, shuah; +Cc: netdev, linux-kselftest, linux-kernel
In-Reply-To: <20180518222737.16006-1-anders.roxell@linaro.org>

On 05/19/2018 12:27 AM, Anders Roxell wrote:
> The reuseport_bpf_numa test case fails there's no numa support.  The
> test shouldn't fail if there's no support it should be skipped.
> 
> Fixes: 3c2c3c16aaf6 ("reuseport, bpf: add test case for bpf_get_numa_node_id")
> Signed-off-by: Anders Roxell <anders.roxell@linaro.org>

Applied to bpf tree, thanks Anders!

^ permalink raw reply

* Re: [bpf-next V4 PATCH 2/8] bpf: devmap prepare xdp frames for bulking
From: Jesper Dangaard Brouer @ 2018-05-23 10:29 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: netdev, Daniel Borkmann, Alexei Starovoitov, Christoph Hellwig,
	BjörnTöpel, Magnus Karlsson, makita.toshiaki, brouer
In-Reply-To: <1db29c53-1568-a3b5-f2d3-1c830aefe33b@iogearbox.net>

On Wed, 23 May 2018 11:54:38 +0200
Daniel Borkmann <daniel@iogearbox.net> wrote:

> Please fix this bug up and send a fresh series, so we can fix this
> right away without needing a follow-up in bpf-next.

Okay.  So I assume this means you reverted/didn't push this patchset in
bpf-next tree... so I have time to respin the series, and I will send a V5.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply

* Re: [bpf-next V4 PATCH 2/8] bpf: devmap prepare xdp frames for bulking
From: Daniel Borkmann @ 2018-05-23  9:54 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, netdev, Daniel Borkmann,
	Alexei Starovoitov
  Cc: Christoph Hellwig, BjörnTöpel, Magnus Karlsson,
	makita.toshiaki
In-Reply-To: <152665048175.21055.15345477060144555643.stgit@firesoul>

On 05/18/2018 03:34 PM, Jesper Dangaard Brouer wrote:
> Like cpumap create queue for xdp frames that will be bulked.  For now,
> this patch simply invoke ndo_xdp_xmit foreach frame.  This happens,
> either when the map flush operation is envoked, or when the limit
> DEV_MAP_BULK_SIZE is reached.
> 
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> ---
>  kernel/bpf/devmap.c |   77 ++++++++++++++++++++++++++++++++++++++++++++++++---
>  1 file changed, 73 insertions(+), 4 deletions(-)
> 
> diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c
> index 808808bf2bf2..cab72c100bb5 100644
> --- a/kernel/bpf/devmap.c
> +++ b/kernel/bpf/devmap.c
> @@ -54,11 +54,18 @@
>  #define DEV_CREATE_FLAG_MASK \
>  	(BPF_F_NUMA_NODE | BPF_F_RDONLY | BPF_F_WRONLY)
>  
> +#define DEV_MAP_BULK_SIZE 16
> +struct xdp_bulk_queue {
> +	struct xdp_frame *q[DEV_MAP_BULK_SIZE];
> +	unsigned int count;
> +};
> +
>  /* objects in the map */
>  struct bpf_dtab_netdev {
>  	struct net_device *dev; /* must be first member, due to tracepoint */
>  	struct bpf_dtab *dtab;
>  	unsigned int bit;
> +	struct xdp_bulk_queue __percpu *bulkq;
>  	struct rcu_head rcu;
>  };
>  
> @@ -209,6 +216,38 @@ void __dev_map_insert_ctx(struct bpf_map *map, u32 bit)
>  	__set_bit(bit, bitmap);
>  }
>  
> +static int bq_xmit_all(struct bpf_dtab_netdev *obj,
> +			 struct xdp_bulk_queue *bq)
> +{
> +	unsigned int processed = 0, drops = 0;
> +	struct net_device *dev = obj->dev;
> +	int i;
> +
> +	if (unlikely(!bq->count))
> +		return 0;
> +
> +	for (i = 0; i < bq->count; i++) {
> +		struct xdp_frame *xdpf = bq->q[i];
> +
> +		prefetch(xdpf);
> +	}
> +
> +	for (i = 0; i < bq->count; i++) {
> +		struct xdp_frame *xdpf = bq->q[i];
> +		int err;
> +
> +		err = dev->netdev_ops->ndo_xdp_xmit(dev, xdpf);
> +		if (err) {
> +			drops++;
> +			xdp_return_frame(xdpf);
> +		}
> +		processed++;

This sort of thing makes it really hard to review. 'processed' and
'drops' are not read anywhere in this function. So I need to go and
check all the other patches whether there's further logic involved here
or not. I had to review your series after applying all patches in a
local branch, please never do this. Add the logic in a patch where it's
self-contained and obvious to review.

> +	}
> +	bq->count = 0;
> +
> +	return 0;
> +}
> +
>  /* __dev_map_flush is called from xdp_do_flush_map() which _must_ be signaled
>   * from the driver before returning from its napi->poll() routine. The poll()
>   * routine is called either from busy_poll context or net_rx_action signaled
> @@ -224,6 +263,7 @@ void __dev_map_flush(struct bpf_map *map)
>  
>  	for_each_set_bit(bit, bitmap, map->max_entries) {
>  		struct bpf_dtab_netdev *dev = READ_ONCE(dtab->netdev_map[bit]);
> +		struct xdp_bulk_queue *bq;
>  		struct net_device *netdev;
>  
>  		/* This is possible if the dev entry is removed by user space
> @@ -233,6 +273,9 @@ void __dev_map_flush(struct bpf_map *map)
>  			continue;
>  
>  		__clear_bit(bit, bitmap);
> +
> +		bq = this_cpu_ptr(dev->bulkq);
> +		bq_xmit_all(dev, bq);
>  		netdev = dev->dev;
>  		if (likely(netdev->netdev_ops->ndo_xdp_flush))
>  			netdev->netdev_ops->ndo_xdp_flush(netdev);
> @@ -255,6 +298,20 @@ struct bpf_dtab_netdev *__dev_map_lookup_elem(struct bpf_map *map, u32 key)
>  	return obj;
>  }
>  
> +/* Runs under RCU-read-side, plus in softirq under NAPI protection.
> + * Thus, safe percpu variable access.
> + */
> +static int bq_enqueue(struct bpf_dtab_netdev *obj, struct xdp_frame *xdpf)
> +{
> +	struct xdp_bulk_queue *bq = this_cpu_ptr(obj->bulkq);
> +
> +	if (unlikely(bq->count == DEV_MAP_BULK_SIZE))
> +		bq_xmit_all(obj, bq);
> +
> +	bq->q[bq->count++] = xdpf;
> +	return 0;
> +}
> +
>  int dev_map_enqueue(struct bpf_dtab_netdev *dst, struct xdp_buff *xdp)
>  {
>  	struct net_device *dev = dst->dev;
> @@ -268,8 +325,7 @@ int dev_map_enqueue(struct bpf_dtab_netdev *dst, struct xdp_buff *xdp)
>  	if (unlikely(!xdpf))
>  		return -EOVERFLOW;
>  
> -	/* TODO: implement a bulking/enqueue step later */
> -	err = dev->netdev_ops->ndo_xdp_xmit(dev, xdpf);
> +	err = bq_enqueue(dst, xdpf);
>  	if (err)
>  		return err;
>  
> @@ -288,13 +344,18 @@ static void dev_map_flush_old(struct bpf_dtab_netdev *dev)
>  {
>  	if (dev->dev->netdev_ops->ndo_xdp_flush) {
>  		struct net_device *fl = dev->dev;
> +		struct xdp_bulk_queue *bq;
>  		unsigned long *bitmap;
> +
>  		int cpu;

Please remove the newline from above.

>  		for_each_online_cpu(cpu) {
>  			bitmap = per_cpu_ptr(dev->dtab->flush_needed, cpu);
>  			__clear_bit(dev->bit, bitmap);
>  
> +			bq = per_cpu_ptr(dev->bulkq, cpu);
> +			bq_xmit_all(dev, bq);
> +
>  			fl->netdev_ops->ndo_xdp_flush(dev->dev);
>  		}
>  	}
> @@ -306,6 +367,7 @@ static void __dev_map_entry_free(struct rcu_head *rcu)
>  
>  	dev = container_of(rcu, struct bpf_dtab_netdev, rcu);
>  	dev_map_flush_old(dev);
> +	free_percpu(dev->bulkq);
>  	dev_put(dev->dev);
>  	kfree(dev);
>  }
> @@ -338,6 +400,7 @@ static int dev_map_update_elem(struct bpf_map *map, void *key, void *value,
>  {
>  	struct bpf_dtab *dtab = container_of(map, struct bpf_dtab, map);
>  	struct net *net = current->nsproxy->net_ns;
> +	gfp_t gfp = GFP_ATOMIC | __GFP_NOWARN;
>  	struct bpf_dtab_netdev *dev, *old_dev;
>  	u32 i = *(u32 *)key;
>  	u32 ifindex = *(u32 *)value;
> @@ -352,11 +415,17 @@ static int dev_map_update_elem(struct bpf_map *map, void *key, void *value,
>  	if (!ifindex) {
>  		dev = NULL;
>  	} else {
> -		dev = kmalloc_node(sizeof(*dev), GFP_ATOMIC | __GFP_NOWARN,
> -				   map->numa_node);
> +		dev = kmalloc_node(sizeof(*dev), gfp, map->numa_node);
>  		if (!dev)
>  			return -ENOMEM;
>  
> +		dev->bulkq = __alloc_percpu_gfp(sizeof(*dev->bulkq),
> +						sizeof(void *), gfp);
> +		if (!dev->bulkq) {
> +			kfree(dev);
> +			return -ENOMEM;
> +		}
> +
>  		dev->dev = dev_get_by_index(net, ifindex);
>  		if (!dev->dev) {
>  			kfree(dev);

Ahh well, and I just realized that here you are leaking memory in the error path. :(

A free_percpu(dev->bulkq) is missing.

Please fix this bug up and send a fresh series, so we can fix this right away without
needing a follow-up in bpf-next.

Thanks,
Daniel

^ permalink raw reply

* Re: [PATCH net-next] sctp: add support for SCTP_REUSE_PORT sockopt
From: Neil Horman @ 2018-05-23  9:48 UTC (permalink / raw)
  To: Xin Long
  Cc: Michael Tuexen, Marcelo Ricardo Leitner, network dev, linux-sctp,
	davem
In-Reply-To: <CADvbK_fRLwT=SHHahPwpJvxnFQWbuwrM6PWLW+20vmXhUgAZxA@mail.gmail.com>

On Wed, May 23, 2018 at 03:04:53PM +0800, Xin Long wrote:
> On Tue, May 22, 2018 at 7:51 PM, Neil Horman <nhorman@tuxdriver.com> wrote:
> > On Tue, May 22, 2018 at 03:07:57PM +0800, Xin Long wrote:
> >> On Mon, May 21, 2018 at 9:48 PM, Neil Horman <nhorman@tuxdriver.com> wrote:
> >> > On Mon, May 21, 2018 at 02:16:56PM +0200, Michael Tuexen wrote:
> >> >> > On 21. May 2018, at 13:39, Neil Horman <nhorman@tuxdriver.com> wrote:
> >> >> >
> >> >> > On Sun, May 20, 2018 at 10:54:04PM -0300, Marcelo Ricardo Leitner wrote:
> >> >> >> On Sun, May 20, 2018 at 08:50:59PM -0400, Neil Horman wrote:
> >> >> >>> On Sat, May 19, 2018 at 03:44:40PM +0800, Xin Long wrote:
> >> >> >>>> This feature is actually already supported by sk->sk_reuse which can be
> >> >> >>>> set by SO_REUSEADDR. But it's not working exactly as RFC6458 demands in
> >> >> >>>> section 8.1.27, like:
> >> >> >>>>
> >> >> >>>>  - This option only supports one-to-one style SCTP sockets
> >> >> >>>>  - This socket option must not be used after calling bind()
> >> >> >>>>    or sctp_bindx().
> >> >> >>>>
> >> >> >>>> Besides, SCTP_REUSE_PORT sockopt should be provided for user's programs.
> >> >> >>>> Otherwise, the programs with SCTP_REUSE_PORT from other systems will not
> >> >> >>>> work in linux.
> >> >> >>>>
> >> >> >>>> This patch reuses sk->sk_reuse and works pretty much as SO_REUSEADDR,
> >> >> >>>> just with some extra setup limitations that are neeeded when it is being
> >> >> >>>> enabled.
> >> >> >>>>
> >> >> >>>> "It should be noted that the behavior of the socket-level socket option
> >> >> >>>> to reuse ports and/or addresses for SCTP sockets is unspecified", so it
> >> >> >>>> leaves SO_REUSEADDR as is for the compatibility.
> >> >> >>>>
> >> >> >>>> Signed-off-by: Xin Long <lucien.xin@gmail.com>
> >> >> >>>> ---
> >> >> >>>> include/uapi/linux/sctp.h |  1 +
> >> >> >>>> net/sctp/socket.c         | 48 +++++++++++++++++++++++++++++++++++++++++++++++
> >> >> >>>> 2 files changed, 49 insertions(+)
> >> >> >>>>
> >> >> >>> A few things:
> >> >> >>>
> >> >> >>> 1) I agree with Tom, this feature is a complete duplication of the SK_REUSEPORT
> >> >> >>> socket option.  I understand that this is an implementation of the option in the
> >> >> >>> RFC, but its definately a duplication of a feature, which makes several things
> >> >> >>> really messy.
> >> >> >>>
> >> >> >>> 2) The overloading of the sk_reuse opeion is a bad idea, for several reasons.
> >> >> >>> Chief among them is the behavioral interference between this patch and the
> >> >> >>> SO_REUSEADDR socket level option, that also sets this feature.  If you set
> >> >> >>> sk_reuse via SO_REUSEADDR, you will set the SCTP port reuse feature regardless
> >> >> >>> of the bind or 1:1/1:m state of the socket.  Vice versa, if you set this socket
> >> >> >>> option via the SCTP_PORT_REUSE option you will inadvertently turn on address
> >> >> >>> reuse for the socket.  We can't do that.
> >> >> >>
> >> >> >> Given your comments, going a bit further here, one other big
> >> >> >> implication is that a port would never be able to be considered to
> >> >> >> fully meet SCTP standards regarding reuse because a rogue application
> >> >> >> may always abuse of the socket level opt to gain access to the port.
> >> >> >>
> >> >> >> IOW, the patch allows the application to use such restrictions against
> >> >> >> itself and nothing else, which undermines the patch idea.
> >> >> >>
> >> >> > Agreed.
> >> >> >
> >> >> >> I lack the knowledge on why the SCTP option was proposed in the RFC. I
> >> >> >> guess they had a good reason to add the restriction on 1:1/1:m style.
> >> >> >> Does the usage of the current imply in any risk to SCTP sockets? If
> >> >> >> yes, that would give some grounds for going forward with the SCTP
> >> >> >> option.
> >> >> >>
> >> >> > I'm also not privy to why the sctp option was proposed, though I expect that the
> >> >> > lack of standardization of SO_REUSEPORT probably had something to do with it.
> >> >> > As for the reasoning behind restriction to only 1:1 sockets, if I had to guess,
> >> >> > I would say it likely because it creates ordering difficulty at the application
> >> >> > level.
> >> >> >
> >> >> > CC-ing Michael Tuxen, who I believe had some input on this RFC.  Hopefully he
> >> >> > can shed some light on this.
> >> >> Dear all,
> >> >>
> >> >> the reason this was added is to have a specified way to allow a system to
> >> >> behave like a client and server making use of the INIT collision.
> >> >>
> >> >> For 1-to-many style sockets you can do this by creating a socket, binding it,
> >> >> calling listen on it and trying to connect to the peer.
> >> >>
> >> >> For 1-to-1 style sockets you need two sockets for it. One listener and one
> >> >> you use to connect (and close it in case of failure, open a new one...).
> >> >>
> >> >> It was not clear if one can achieve this with SO_REUSEPORT and/or SO_REUSEADDR
> >> >> on all platforms. We left that unspecified.
> >> >>
> >> >> I hope this makes the intention clearer.
> >> >>
> >> > I think it makes the intention clearer yes, but it unfortunately does nothing in
> >> > my mind to clarify how the implementation should best handle the potential
> >> > overlap in functionality.  What I see here is that we have two functional paths
> >> > (the SO_REUSEPORT path and the SCTP_PORT_REUSE path), which may or may not
> >> > (depending on the OS implementation achieve the same functional goal (allowing
> >> > multiple sockets to share a port while allowing one socket to listen and the
> >> > other connect to a remote peer).  If both implementations do the same thing on a
> >> > given platform, we can either just alias one to another and be done, but if they
> >> > don't then we either have to implement both paths, and ensure that the
> >> > SO_REUSEPORT path is a no-op/error return for SCTP sockets, or that each path
> >> > implements a distinct feature set that is cleaarly documented.
> >> >
> >> > That said, I think we may be in luck.  Looking at the connect and listen paths,
> >> > it appears to me that:
> >> >
> >> > 1) Sockets ignore SO_REUSEPORT in the connect and listen paths (save for any
> >> > autobinding) so it would appear that the intent of the SCTP rfc can be honored
> >> > via SO_REUSEPORT on linux.
> >> >
> >> > 2) SO_REUSEPORT prevents changing state after a bind has occured, so we can honr
> >> > that part of the SCTP RFC.
> >> >
> >> > The only missing part is the restriction that SCTP_REUSE_PORT has which is
> >> > unaccounted for is that 1:M sockets aren't allowed to enable port reuse.
> >> > However, I think the implication from Michaels description above is that port
> >> > reuse on a 1:M socket is implicit because a single socket can connect and listen
> >> > in that use case, rather than there being a danger to doing so.
> >> >
> >> > As such, I would propose that we implement this socket option by simply setting
> >> > the sk->sk_reuseport field in the sock structure, and document the fact that
> >> > linux does not restrict port reuse from 1:M sockets.
> >> Note that, sk->sk_reuseport is not affecting linux SCTP socket at all now.
> >> linux SCTP socket doesn't really have SO_REUSEADDR (sk->sk_reuse)
> >> support, but use sk->sk_reuse as REUSE_PORT, (yes, it is confusing).
> >> Pls refer to sctp_get_port_local().
> >>
> > No, its not used now, but if you do use it to do something specific to SCTP (via
> > the SCTP_REUSE_PORT socket option), you risk aliasing SO_REUSEPORT behavior to
> > it, and if it doesn't match what the RFC behavior mandates, thats a problem.
> >
> >> So I'm not sure using sk->sk_reuseport here means we will drop sk->sk_reuse
> >> use in linux SCTP but use sk->sk_reuseport instead, or we will think that socket
> >> enables 'port reuse' when either of them is set.
> >>
> > I don't think we would drop the behavior of sk_reuse here, why would we?  As far
> > as I can see, the behavior of SO_REUSEADDR (not SO_REUSEPORT), isn't in
> > question, is it?
> >
> >> Note some users may be already using SO_REUSEADDR to enable the 'port
> >> reuse' in linux sctp socket. If we're changing to sk->sk_reuseport, we may face
> >> a compatibility problem.
> >>
> > I don't see how the behavior of SO_REUSEADDR is in question here.  All I'm
> > suggesting is that you simplify this patch so that the SCTP_REUSE_PORT socket
> > option set sk_reuseport, as that option to my eyes conforms to the sctp rfc
> > requirements.  Or am I' missing something?
> No, I am :)
> sk_reuseport seems more complicated than sk_reuse. I kind of mixed them.
> I need to check more beofore continuing. Thanks.
> 
Thank you!
Neil

> >
> > Neil
> >
> >>
> >> >
> >> > Thoughts?
> >> > Neil
> >> >
> >>
> 

^ permalink raw reply

* pull-request: mac80211 2018-05-23
From: Johannes Berg @ 2018-05-23  9:47 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, linux-wireless

Hi Dave,

Just another handful of fixes as we wind down towards the
merge window.

Please pull and let me know if there's any problem.

Thanks,
johannes



The following changes since commit 6caf9fb3bda17df59de4ed6ed4950c43ca1361e3:

  Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf (2018-05-17 23:33:52 -0400)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211.git tags/mac80211-for-davem-2018-05-23

for you to fetch changes up to fed4825096cfbbfd654cb292ab6eb193911aef01:

  mac80211_hwsim: Fix radio dump for radio idx 0 (2018-05-22 10:24:17 +0200)

----------------------------------------------------------------
A handful of fixes:
 * hwsim radio dump wasn't working for the first radio
 * mesh was updating statistics incorrectly
 * a netlink message allocation was possibly too short
 * wiphy name limit was still too long
 * in certain cases regdb query could find a NULL pointer

----------------------------------------------------------------
Andrew Zaborowski (1):
      mac80211_hwsim: Fix radio dump for radio idx 0

Bob Copeland (1):
      mac80211: mesh: fix premature update of rc stats

Dedy Lansky (1):
      nl80211: fix nlmsg allocation in cfg80211_ft_event

Eric Biggers (1):
      cfg80211: further limit wiphy names to 64 bytes

Haim Dreyfuss (1):
      cfg80211: fix NULL pointer derference when querying regdb

 drivers/net/wireless/mac80211_hwsim.c | 4 ++--
 include/uapi/linux/nl80211.h          | 2 +-
 net/mac80211/mesh_plink.c             | 8 ++++----
 net/wireless/nl80211.c                | 3 ++-
 net/wireless/reg.c                    | 3 +++
 5 files changed, 12 insertions(+), 8 deletions(-)

^ permalink raw reply

* Re: [PATCH bpf-next 0/5] fix test_sockmap
From: Prashant Bhole @ 2018-05-23  9:44 UTC (permalink / raw)
  To: John Fastabend, Alexei Starovoitov, Daniel Borkmann
  Cc: David S . Miller, Shuah Khan, netdev
In-Reply-To: <f9f46ed7-43ae-4421-4546-ffb09fabd1fe@gmail.com>



On 5/22/2018 2:08 AM, John Fastabend wrote:
> On 05/20/2018 10:13 PM, Prashant Bhole wrote:
>>
>>
>> On 5/19/2018 1:42 AM, John Fastabend wrote:
>>> On 05/18/2018 12:17 AM, Prashant Bhole wrote:
>>>> This series fixes bugs in test_sockmap code. They weren't caught
>>>> previously because failure in RX/TX thread was not notified to the
>>>> main thread.
>>>>
>>>> Also fixed data verification logic and slightly improved test output
>>>> such that parameters values (cork, apply, start, end) of failed test
>>>> can be easily seen.
>>>>
>>>
>>> Great, this was on my list so thanks for taking care of it.
>>>
>>>> Note: Even after fixing above problems there are issues with tests
>>>> which set cork parameter. Tests fail (RX thread timeout) when cork
>>>> value is non-zero and overall data sent by TX thread isn't multiples
>>>> of cork value.
>>>
>>>
>>> This is expected. When 'cork' is set the sender should only xmit
>>> the data when 'cork' bytes are available. If the user doesn't
>>> provide the N bytes the data is cork'ed waiting for the bytes and
>>> if the socket is closed the state is cleaned up. What these tests
>>> are testing is the cleanup path when a user doesn't provide the
>>> N bytes. In practice this is used to validate headers and prevent
>>> users from sending partial headers. We want to keep these tests because
>>> they verify a tear-down path in the code.
>>
>> Ok.
>>
>>>
>>> After your changes do these get reported as failures? If so we
>>> need to account for the above in the calculations.
>>
>> Yes, cork related test are reported as failures because of RX thread
>> timeout.
>>
>> So with your above description, I think we need to differentiate cork
>> tests with partial data and full data. In partial data test we can have
>> something like "timeout_expected" flag. Any other way to fix it?
>>
> 
> Adding a flag seems reasonable to me. Lets do this for now. Also I
> plan to add more negative tests so we can either use the same
> flag or a new one for those cases as well.
> 

John,
I worked on this for some time and noticed that the RX-timeout of tests 
with cork parameter is dependent on various parameters. So we can not 
set a flag like the way 'drop_expected' flag is set before executing the 
test.

So I decided to write a function which judges all parameters before each 
test and decides whether a test with cork parameter will timeout or not. 
Then the conditions in the function became complicated. For example some 
tests fail if opt->rate < 17 (with some other conditions). Here is 17 is 
related to FRAGS_PER_SKB. Consider following two examples.

./test_sockmap --cgroup /mnt/cgroup2 -r 16 -i 1 -l 30 -t sendpage 
--txmsg --txmsg_cork 1024   # RX timeout occurs

./test_sockmap --cgroup /mnt/cgroup2 -r 17 -i 1 -l 30 -t sendpage 
--txmsg --txmsg_cork 1024   # Success!

Do we need to keep such tests? if yes, then I will continue with adding 
such conditions in the function.


-Prashant

^ permalink raw reply

* Re: [net-next 1/6] net/dcb: Add dcbnl buffer attribute
From: Jiri Pirko @ 2018-05-23  9:43 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Saeed Mahameed, David S. Miller, netdev, Huy Nguyen, Or Gerlitz
In-Reply-To: <20180521222026.4f54f479@cakuba>

Tue, May 22, 2018 at 07:20:26AM CEST, jakub.kicinski@netronome.com wrote:
>On Mon, 21 May 2018 14:04:57 -0700, Saeed Mahameed wrote:
>> From: Huy Nguyen <huyn@mellanox.com>
>> 
>> In this patch, we add dcbnl buffer attribute to allow user
>> change the NIC's buffer configuration such as priority
>> to buffer mapping and buffer size of individual buffer.
>> 
>> This attribute combined with pfc attribute allows advance user to
>> fine tune the qos setting for specific priority queue. For example,
>> user can give dedicated buffer for one or more prirorities or user
>> can give large buffer to certain priorities.
>> 
>> We present an use case scenario where dcbnl buffer attribute configured
>> by advance user helps reduce the latency of messages of different sizes.
>> 
>> Scenarios description:
>> On ConnectX-5, we run latency sensitive traffic with
>> small/medium message sizes ranging from 64B to 256KB and bandwidth sensitive
>> traffic with large messages sizes 512KB and 1MB. We group small, medium,
>> and large message sizes to their own pfc enables priorities as follow.
>>   Priorities 1 & 2 (64B, 256B and 1KB)
>>   Priorities 3 & 4 (4KB, 8KB, 16KB, 64KB, 128KB and 256KB)
>>   Priorities 5 & 6 (512KB and 1MB)
>> 
>> By default, ConnectX-5 maps all pfc enabled priorities to a single
>> lossless fixed buffer size of 50% of total available buffer space. The
>> other 50% is assigned to lossy buffer. Using dcbnl buffer attribute,
>> we create three equal size lossless buffers. Each buffer has 25% of total
>> available buffer space. Thus, the lossy buffer size reduces to 25%. Priority
>> to lossless  buffer mappings are set as follow.
>>   Priorities 1 & 2 on lossless buffer #1
>>   Priorities 3 & 4 on lossless buffer #2
>>   Priorities 5 & 6 on lossless buffer #3
>> 
>> We observe improvements in latency for small and medium message sizes
>> as follows. Please note that the large message sizes bandwidth performance is
>> reduced but the total bandwidth remains the same.
>>   256B message size (42 % latency reduction)
>>   4K message size (21% latency reduction)
>>   64K message size (16% latency reduction)
>> 
>> Signed-off-by: Huy Nguyen <huyn@mellanox.com>
>> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
>
>On a cursory look this bares a lot of resemblance to devlink shared
>buffer configuration ABI.  Did you look into using that?  
>
>Just to be clear devlink shared buffer ABIs don't require representors
>and "switchdev mode".

If the CX5 buffer they are trying to utilize here is per port and not a
shared one, it would seem ok for me to not have it in "devlink sb".

^ permalink raw reply

* [PATCH net-next] sfc: stop the TX queue before pushing new buffers
From: Martin Habets @ 2018-05-23  9:41 UTC (permalink / raw)
  To: linux-net-drivers, davem; +Cc: netdev, jarod

efx_enqueue_skb() can push new buffers for the xmit_more functionality.
We must stops the TX queue before this or else the TX queue does not get
restarted and we get a netdev watchdog.

In the error handling we may now need to unwind more than 1 packet, and
we may need to push the new buffers onto the partner queue.

Fixes: e9117e5099ea ("sfc: Firmware-Assisted TSO version 2")
Reported-by: Jarod Wilson <jarod@redhat.com>
Signed-off-by: Martin Habets <mhabets@solarflare.com>
---

Dave, could you please also queue up this patch for stable?

 drivers/net/ethernet/sfc/tx.c |   31 +++++++++++++++++++++++--------
 1 file changed, 23 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/sfc/tx.c b/drivers/net/ethernet/sfc/tx.c
index cece961f2e82..17e0697f42d5 100644
--- a/drivers/net/ethernet/sfc/tx.c
+++ b/drivers/net/ethernet/sfc/tx.c
@@ -435,17 +435,18 @@ static int efx_tx_map_data(struct efx_tx_queue *tx_queue, struct sk_buff *skb,
 	} while (1);
 }
 
-/* Remove buffers put into a tx_queue.  None of the buffers must have
- * an skb attached.
+/* Remove buffers put into a tx_queue for the current packet.
+ * None of the buffers must have an skb attached.
  */
-static void efx_enqueue_unwind(struct efx_tx_queue *tx_queue)
+static void efx_enqueue_unwind(struct efx_tx_queue *tx_queue,
+			       unsigned int insert_count)
 {
 	struct efx_tx_buffer *buffer;
 	unsigned int bytes_compl = 0;
 	unsigned int pkts_compl = 0;
 
 	/* Work backwards until we hit the original insert pointer value */
-	while (tx_queue->insert_count != tx_queue->write_count) {
+	while (tx_queue->insert_count != insert_count) {
 		--tx_queue->insert_count;
 		buffer = __efx_tx_queue_get_insert_buffer(tx_queue);
 		efx_dequeue_buffer(tx_queue, buffer, &pkts_compl, &bytes_compl);
@@ -504,6 +505,8 @@ static int efx_tx_tso_fallback(struct efx_tx_queue *tx_queue,
  */
 netdev_tx_t efx_enqueue_skb(struct efx_tx_queue *tx_queue, struct sk_buff *skb)
 {
+	unsigned int old_insert_count = tx_queue->insert_count;
+	bool xmit_more = skb->xmit_more;
 	bool data_mapped = false;
 	unsigned int segments;
 	unsigned int skb_len;
@@ -553,8 +556,10 @@ netdev_tx_t efx_enqueue_skb(struct efx_tx_queue *tx_queue, struct sk_buff *skb)
 	/* Update BQL */
 	netdev_tx_sent_queue(tx_queue->core_txq, skb_len);
 
+	efx_tx_maybe_stop_queue(tx_queue);
+
 	/* Pass off to hardware */
-	if (!skb->xmit_more || netif_xmit_stopped(tx_queue->core_txq)) {
+	if (!xmit_more || netif_xmit_stopped(tx_queue->core_txq)) {
 		struct efx_tx_queue *txq2 = efx_tx_queue_partner(tx_queue);
 
 		/* There could be packets left on the partner queue if those
@@ -577,14 +582,24 @@ netdev_tx_t efx_enqueue_skb(struct efx_tx_queue *tx_queue, struct sk_buff *skb)
 		tx_queue->tx_packets++;
 	}
 
-	efx_tx_maybe_stop_queue(tx_queue);
-
 	return NETDEV_TX_OK;
 
 
 err:
-	efx_enqueue_unwind(tx_queue);
+	efx_enqueue_unwind(tx_queue, old_insert_count);
 	dev_kfree_skb_any(skb);
+
+	/* If we're not expecting another transmit and we had something to push
+	 * on this queue or a partner queue then we need to push here to get the
+	 * previous packets out.
+	 */
+	if (!xmit_more) {
+		struct efx_tx_queue *txq2 = efx_tx_queue_partner(tx_queue);
+
+		if (txq2->xmit_more_available)
+			efx_nic_push_buffers(txq2);
+	}
+
 	return NETDEV_TX_OK;
 }
 

^ permalink raw reply related

* Re: [bpf-next V4 PATCH 1/8] bpf: devmap introduce dev_map_enqueue
From: Daniel Borkmann @ 2018-05-23  9:34 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, netdev, Daniel Borkmann,
	Alexei Starovoitov
  Cc: Christoph Hellwig, BjörnTöpel, Magnus Karlsson,
	makita.toshiaki
In-Reply-To: <152665047666.21055.16395048652428831814.stgit@firesoul>

On 05/18/2018 03:34 PM, Jesper Dangaard Brouer wrote:
> Functionality is the same, but the ndo_xdp_xmit call is now
> simply invoked from inside the devmap.c code.
> 
> V2: Fix compile issue reported by kbuild test robot <lkp@intel.com>
> 
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> ---
>  include/linux/bpf.h        |   14 +++++++++++---
>  include/trace/events/xdp.h |    9 ++++++++-
>  kernel/bpf/devmap.c        |   37 +++++++++++++++++++++++++++++++------
>  net/core/filter.c          |   15 ++-------------
>  4 files changed, 52 insertions(+), 23 deletions(-)
> 
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index ed0122b45b63..fc1459bdcafc 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -485,14 +485,15 @@ int bpf_check(struct bpf_prog **fp, union bpf_attr *attr);
>  void bpf_patch_call_args(struct bpf_insn *insn, u32 stack_depth);
>  
>  /* Map specifics */
> -struct net_device  *__dev_map_lookup_elem(struct bpf_map *map, u32 key);
> +struct xdp_buff;
> +struct bpf_dtab_netdev *__dev_map_lookup_elem(struct bpf_map *map, u32 key);

When you have some follow-up patches, would be great if you could clean this
up a bit. At least a newline after the struct declaration would make it a bit
more readable.

>  void __dev_map_insert_ctx(struct bpf_map *map, u32 index);
>  void __dev_map_flush(struct bpf_map *map);
> +int dev_map_enqueue(struct bpf_dtab_netdev *dst, struct xdp_buff *xdp);
>  
>  struct bpf_cpu_map_entry *__cpu_map_lookup_elem(struct bpf_map *map, u32 key);
>  void __cpu_map_insert_ctx(struct bpf_map *map, u32 index);
>  void __cpu_map_flush(struct bpf_map *map);
> -struct xdp_buff;
>  int cpu_map_enqueue(struct bpf_cpu_map_entry *rcpu, struct xdp_buff *xdp,
>  		    struct net_device *dev_rx);
>  
> @@ -571,6 +572,14 @@ static inline void __dev_map_flush(struct bpf_map *map)
>  {
>  }
>  
> +struct xdp_buff;
> +struct bpf_dtab_netdev;
> +static inline

In particular here as well.

> +int dev_map_enqueue(struct bpf_dtab_netdev *dst, struct xdp_buff *xdp)
> +{
> +	return 0;
> +}
> +
>  static inline
>  struct bpf_cpu_map_entry *__cpu_map_lookup_elem(struct bpf_map *map, u32 key)
>  {
> @@ -585,7 +594,6 @@ static inline void __cpu_map_flush(struct bpf_map *map)
>  {
>  }
>  
> -struct xdp_buff;
>  static inline int cpu_map_enqueue(struct bpf_cpu_map_entry *rcpu,
>  				  struct xdp_buff *xdp,
>  				  struct net_device *dev_rx)
> diff --git a/include/trace/events/xdp.h b/include/trace/events/xdp.h
> index 8989a92c571a..96104610d40e 100644
> --- a/include/trace/events/xdp.h
> +++ b/include/trace/events/xdp.h
> @@ -138,11 +138,18 @@ DEFINE_EVENT_PRINT(xdp_redirect_template, xdp_redirect_map_err,
>  		  __entry->map_id, __entry->map_index)
>  );
>  
> +#ifndef __DEVMAP_OBJ_TYPE
> +#define __DEVMAP_OBJ_TYPE
> +struct _bpf_dtab_netdev {
> +	struct net_device *dev;
> +};
> +#endif /* __DEVMAP_OBJ_TYPE */

The __DEVMAP_OBJ_TYPE is not used anywhere, what's its purpose?

Also if you define struct _bpf_dtab_netdev this is rather fragile when mapping
to struct bpf_dtab_netdev. Best way of guarding this is to make a BUILD_BUG_ON()
where you compare both offsets in the struct and bail out compilation whenever
this changes.

>  #define devmap_ifindex(fwd, map)				\
>  	(!fwd ? 0 :						\
>  	 (!map ? 0 :						\
>  	  ((map->map_type == BPF_MAP_TYPE_DEVMAP) ?		\
> -	   ((struct net_device *)fwd)->ifindex : 0)))
> +	   ((struct _bpf_dtab_netdev *)fwd)->dev->ifindex : 0)))
>  
>  #define _trace_xdp_redirect_map(dev, xdp, fwd, map, idx)		\
>  	 trace_xdp_redirect_map(dev, xdp, devmap_ifindex(fwd, map),	\
> diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c
> index 565f9ece9115..808808bf2bf2 100644
> --- a/kernel/bpf/devmap.c
> +++ b/kernel/bpf/devmap.c
> @@ -48,18 +48,21 @@
>   * calls will fail at this point.
>   */
>  #include <linux/bpf.h>
> +#include <net/xdp.h>
>  #include <linux/filter.h>
>  
>  #define DEV_CREATE_FLAG_MASK \
>  	(BPF_F_NUMA_NODE | BPF_F_RDONLY | BPF_F_WRONLY)
>  
> +/* objects in the map */

This comment is unnecessary.

>  struct bpf_dtab_netdev {
> -	struct net_device *dev;
> +	struct net_device *dev; /* must be first member, due to tracepoint */
>  	struct bpf_dtab *dtab;
>  	unsigned int bit;
>  	struct rcu_head rcu;
>  };
>  
> +/* bpf map container */

Ditto. Why add it? If it's unclear from the code, then it would probably be
better to clean up the code a bit to make it more obvious. Comments should
explain *why* we do certain things, not *what* the code is doing. Latter is
just a sign that the code itself should be improved potentially. :)

>  struct bpf_dtab {
>  	struct bpf_map map;
>  	struct bpf_dtab_netdev **netdev_map;
> @@ -240,21 +243,43 @@ void __dev_map_flush(struct bpf_map *map)
>   * update happens in parallel here a dev_put wont happen until after reading the
>   * ifindex.
>   */
> -struct net_device  *__dev_map_lookup_elem(struct bpf_map *map, u32 key)
> +struct bpf_dtab_netdev *__dev_map_lookup_elem(struct bpf_map *map, u32 key)
>  {
>  	struct bpf_dtab *dtab = container_of(map, struct bpf_dtab, map);
> -	struct bpf_dtab_netdev *dev;
> +	struct bpf_dtab_netdev *obj;
>  
>  	if (key >= map->max_entries)
>  		return NULL;
>  
> -	dev = READ_ONCE(dtab->netdev_map[key]);
> -	return dev ? dev->dev : NULL;
> +	obj = READ_ONCE(dtab->netdev_map[key]);
> +	return obj;
> +}
> +
> +int dev_map_enqueue(struct bpf_dtab_netdev *dst, struct xdp_buff *xdp)
> +{
> +	struct net_device *dev = dst->dev;
> +	struct xdp_frame *xdpf;
> +	int err;
> +
> +	if (!dev->netdev_ops->ndo_xdp_xmit)
> +		return -EOPNOTSUPP;
> +
> +	xdpf = convert_to_xdp_frame(xdp);
> +	if (unlikely(!xdpf))
> +		return -EOVERFLOW;
> +
> +	/* TODO: implement a bulking/enqueue step later */
> +	err = dev->netdev_ops->ndo_xdp_xmit(dev, xdpf);
> +	if (err)
> +		return err;
> +
> +	return 0;

The 'err' is just unnecessary, lets just do:

  return dev->netdev_ops->ndo_xdp_xmit(dev, xdpf);

Later after the other patches this becomes:

  return bq_enqueue(dst, xdpf, dev_rx);

>  }
>  
>  static void *dev_map_lookup_elem(struct bpf_map *map, void *key)
>  {
> -	struct net_device *dev = __dev_map_lookup_elem(map, *(u32 *)key);
> +	struct bpf_dtab_netdev *obj = __dev_map_lookup_elem(map, *(u32 *)key);
> +	struct net_device *dev = dev = obj ? obj->dev : NULL;
>  
>  	return dev ? &dev->ifindex : NULL;
>  }
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 6d0d1560bd70..1447ec94ef74 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -3061,20 +3061,9 @@ static int __bpf_tx_xdp_map(struct net_device *dev_rx, void *fwd,
>  
>  	switch (map->map_type) {
>  	case BPF_MAP_TYPE_DEVMAP: {
> -		struct net_device *dev = fwd;
> -		struct xdp_frame *xdpf;
> +		struct bpf_dtab_netdev *dst = fwd;
>  
> -		if (!dev->netdev_ops->ndo_xdp_xmit)
> -			return -EOPNOTSUPP;
> -
> -		xdpf = convert_to_xdp_frame(xdp);
> -		if (unlikely(!xdpf))
> -			return -EOVERFLOW;
> -
> -		/* TODO: move to inside map code instead, for bulk support
> -		 * err = dev_map_enqueue(dev, xdp);
> -		 */
> -		err = dev->netdev_ops->ndo_xdp_xmit(dev, xdpf);
> +		err = dev_map_enqueue(dst, xdp);
>  		if (err)
>  			return err;
>  		__dev_map_insert_ctx(map, index);
> 

^ permalink raw reply

* Re: [net-next 1/6] net/dcb: Add dcbnl buffer attribute
From: Jiri Pirko @ 2018-05-23  9:33 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Huy Nguyen, Saeed Mahameed, David S. Miller, netdev, Or Gerlitz,
	Parav Pandit, Ido Schimmel
In-Reply-To: <20180523022314.783e47fa@cakuba>

Wed, May 23, 2018 at 11:23:14AM CEST, jakub.kicinski@netronome.com wrote:
>On Tue, 22 May 2018 20:01:21 -0500, Huy Nguyen wrote:
>> On 5/22/2018 1:32 PM, Jakub Kicinski wrote:
>> > On Tue, 22 May 2018 10:36:17 -0500, Huy Nguyen wrote:  
>> >> On 5/22/2018 12:20 AM, Jakub Kicinski wrote:  
>> >>> On Mon, 21 May 2018 14:04:57 -0700, Saeed Mahameed wrote:  
>> >>>> From: Huy Nguyen <huyn@mellanox.com>
>> >>>>
>> >>>> In this patch, we add dcbnl buffer attribute to allow user
>> >>>> change the NIC's buffer configuration such as priority
>> >>>> to buffer mapping and buffer size of individual buffer.
>> >>>>
>> >>>> This attribute combined with pfc attribute allows advance user to
>> >>>> fine tune the qos setting for specific priority queue. For example,
>> >>>> user can give dedicated buffer for one or more prirorities or user
>> >>>> can give large buffer to certain priorities.
>> >>>>
>> >>>> We present an use case scenario where dcbnl buffer attribute configured
>> >>>> by advance user helps reduce the latency of messages of different sizes.
>> >>>>
>> >>>> Scenarios description:
>> >>>> On ConnectX-5, we run latency sensitive traffic with
>> >>>> small/medium message sizes ranging from 64B to 256KB and bandwidth sensitive
>> >>>> traffic with large messages sizes 512KB and 1MB. We group small, medium,
>> >>>> and large message sizes to their own pfc enables priorities as follow.
>> >>>>     Priorities 1 & 2 (64B, 256B and 1KB)
>> >>>>     Priorities 3 & 4 (4KB, 8KB, 16KB, 64KB, 128KB and 256KB)
>> >>>>     Priorities 5 & 6 (512KB and 1MB)
>> >>>>
>> >>>> By default, ConnectX-5 maps all pfc enabled priorities to a single
>> >>>> lossless fixed buffer size of 50% of total available buffer space. The
>> >>>> other 50% is assigned to lossy buffer. Using dcbnl buffer attribute,
>> >>>> we create three equal size lossless buffers. Each buffer has 25% of total
>> >>>> available buffer space. Thus, the lossy buffer size reduces to 25%. Priority
>> >>>> to lossless  buffer mappings are set as follow.
>> >>>>     Priorities 1 & 2 on lossless buffer #1
>> >>>>     Priorities 3 & 4 on lossless buffer #2
>> >>>>     Priorities 5 & 6 on lossless buffer #3
>> >>>>
>> >>>> We observe improvements in latency for small and medium message sizes
>> >>>> as follows. Please note that the large message sizes bandwidth performance is
>> >>>> reduced but the total bandwidth remains the same.
>> >>>>     256B message size (42 % latency reduction)
>> >>>>     4K message size (21% latency reduction)
>> >>>>     64K message size (16% latency reduction)
>> >>>>
>> >>>> Signed-off-by: Huy Nguyen <huyn@mellanox.com>
>> >>>> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>  
>> >>> On a cursory look this bares a lot of resemblance to devlink shared
>> >>> buffer configuration ABI.  Did you look into using that?
>> >>>
>> >>> Just to be clear devlink shared buffer ABIs don't require representors
>> >>> and "switchdev mode".
>> >>> .  
>> >> [HQN] Dear Jakub, there are several reasons that devlink shared buffer
>> >> ABI cannot be used:
>> >> 1. The devlink shared buffer ABI is written based on the switch cli
>> >> which you can find out more
>> >> from this link https://community.mellanox.com/docs/DOC-2558.  
>> > Devlink API accommodates requirements of simpler (SwitchX2?) and more
>> > advanced schemes (present in Spectrum).  The simpler/basic static
>> > threshold configurations is exactly what you are doing here, AFAIU.  
>> [HQN] Devlink API is tailored specifically for switch.
>
>I hope that is not true, since we (Netronome) are trying to use it for
>NIC configuration, too.  We should generalize the API if need be.

Sure it is not true. I have no clue why anyone thinks so :/


>
>> We don't configure threshold configuration explicitly. It is done via
>> PFC. Once PFC is enabled on priority, threshold is setup based on our
>> proprietary formula that were tested rigorously for performance.
>
>Are you referring to XOFF/XON thresholds?  I don't think the "threshold
>type" in devlink API implies we are setting XON/XOFF thresholds
>directly :S  If PFC is enabled we may be setting them indirectly,
>obviously.
>
>My understanding is that for static threshold type the size parameter
>specifies the max amount of memory given pool can consume.
>
>> >> 2. The dcbnl interfaces have been used for QoS settings.  
>> > QoS settings != shared buffer configuration.  
>> [HQN] I think we have different definition about "shared buffer".
>> Please refer to this below switch cli link.
>> It explained in detail what is the "shared buffer" in switch means.
>> Our NIC does not have "shared buffer" supported.
>> https://community.mellanox.com/docs/DOC-2591
>
>Yes, we must have a different definitions of "shared buffer" :)  That
>link, however, didn't clarify much for me...  In mlx5 you seem to have a
>buffer which is shared between priorities, even if it's not what would
>be referred to as shared buffer in switch context.

We introduced "shared buffer" in a devlink with "devlink handle" because
the buffer is shared among the whole ASIC, between multiple
ports/netdevs.


>
>> >> In NIC, the  buffer configuration are tied to priority (ETS PFC).  
>> > Some customers use DCB, a lot (most?) of them don't.  I don't think
>> > the "this is a logical extension of a commonly used API" really
>> > stands here.  
>> [HQN] DCBNL are being actively used. The whole point of this patch
>> is to tie buffer configuration with IEEE's priority and is IEEE's PFC 
>> configuration.
>>
>> Ambitious future is to have the switch configure the NIC's buffer
>> size and buffer mapping
>> via TLV packet and this DCBNL interface. But we won't go too far here.
>
>I think I can understand the motivation, and I think it's a nice thing
>to expose!  The only questions are: does it really belong to DCBNL and
>can existing API be used?
> 
>From patch description it seems like your default setup is shared
>buffer split 50% (lossy)/50% (all prios) and the example you give
>changes that to 25% (lossy)/25%x3 prio groups.
>
>With existing devlink API could this be modelled by three ingress pools
>with 2 TCs bound each?
>
>> >> The buffer configuration are not tied to port like switch.  
>> > It's tied to a port and TCs, you just have one port but still have 8
>> > TCs exactly like a switch...  
>> [HQN] No. Our buffer ties to priority not to TCs.
>
>Right, that is a valid point.  Although TCs can be mapped to
>priorities.  Some switches may tie buffers to priorities, too.  So
>perhaps it's worth extending devlink?
>
>> >> 3. Shared buffer, alpha, threshold are switch specific terms.  
>> > IDK how talking about alpha is relevant, it's just one threshold
>> > type the API supports.  As far as shared buffer and threshold I
>> > don't know if these are switch terms (or how "switch" differs from
>> > "NIC" at that level) - I personally find carving shared buffer into
>> > pools very intuitive.  
>> [HQN] Yes, I understand your point too. The NIC's buffer shares some 
>> characteristics with the switch's buffer settings. 
>
>Yes, and if it's not a perfect match we can extend it.
>
>> But this DCB buffer setting is to improve the performance and work
>> together with the PFC setting. We would like to keep all the qos
>> setting under DCB Netlink as they are designed to be this way.
>
>DCBNL seems to carry standard-based information, which this is not.
>mlxsw supports DCBNL, will it also support this buffer configuration
>mechanism?

Ido would provide you more and accurate info. Basically, in mlxsw we use
dcbnl for the things in can cover and was designed for. And for those
things, the netdev is a handle. Config is specific to the netdev. On the
other hand, devlink shared buffer is used for buffer shared between all
netdevs.



>
>> > Could you give examples of commands/configs one can use with your
>> > new ABI?  
>> [HQN] The plan is to add the support in lldptool once the kernel code
>> is accepted. To test the kernel code,
>> I am using small python scripts that works on top of the netlink
>> library. It will be like this format which is similar to other
>> options in lldptool priority2buffer: 0,2,5,7,1,2,3,6 maps priorities
>> 0,1,2,3,4,5,6,7 to buffer 0,2,5,7,1,2,3,6
>>      buffer_size: 87296,87296,0,87296,0,0,0,0 set receive buffer size 
>> for buffer 0,1,2,3,4,5,6,7 respectively
>> >    How does one query the total size of the buffer to be carved?  
>> [HQN] This is not necessary. If the total size is too big, error will
>> be return via DCB netlink interface.
>
>Right, I'm not saying it's a bug :)  It's just nice when user can be
>told the total size without having to probe for it :)

^ permalink raw reply

* Re: [PATCH V3 8/8] dt-bindings: stm32: add compatible for syscon
From: Christophe ROULLIER @ 2018-05-23  9:32 UTC (permalink / raw)
  To: Rob Herring
  Cc: mark.rutland@arm.com, mcoquelin.stm32@gmail.com, Alexandre TORGUE,
	Peppe CAVALLARO, devicetree@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, netdev@vger.kernel.org,
	andrew@lunn.ch
In-Reply-To: <20180522172238.GA26145@rob-hp-laptop>

On 05/22/2018 07:22 PM, Rob Herring wrote:
> On Mon, May 21, 2018 at 10:07:26AM +0200, Christophe Roullier wrote:
>> This patch describes syscon DT bindings.
>>
>> Signed-off-by: Christophe Roullier <christophe.roullier@st.com>
>> ---
>>   Documentation/devicetree/bindings/arm/stm32.txt | 5 +++++
>>   1 file changed, 5 insertions(+)
>>
>> diff --git a/Documentation/devicetree/bindings/arm/stm32.txt b/Documentation/devicetree/bindings/arm/stm32.txt
>> index 6808ed9..e46ebad 100644
>> --- a/Documentation/devicetree/bindings/arm/stm32.txt
>> +++ b/Documentation/devicetree/bindings/arm/stm32.txt
>> @@ -8,3 +8,8 @@ using one of the following compatible strings:
>>     st,stm32f746
>>     st,stm32h743
>>     st,stm32mp157
>> +
>> +Required nodes:
>> +- syscon: the soc bus node must have a system controller node pointing to the
>> +  global control registers, with the compatible string
>> +  "st,stm32mp157-syscfg", "syscon";
> 
> Please don't mix soc/board bindings with other nodes. So perhaps
> stm32-syscon.txt.
> 
> Rob
> 

Hi Rob,

Is it ok for you with this tree file:

Documentation/devicetree/bindings/arm/stm32/stm32.txt
Documentation/devicetree/bindings/arm/stm32/stm32-syscon.txt
	With stm32-syscon.txt:

---------------------------------------------------
STMicroelectronics STM32 Platforms System Controller

Properties:
  - compatible : should contain two values. First value must be :
		- " st,stm32mp157-syscfg " - for stm32mp157 based SoCs,
		second value must be always "syscon".
  - reg : offset and length of the register set.

Example:
	syscfg: system-config@50020000 {
		compatible = "st,stm32mp157-syscfg", "syscon";
		reg = <0x50020000 0x400>;
	};
---------------------------------------------------

Do we need to update also all MCU family (stm32f4, stm32h7, stm32f7) 
property to be coherent ?

Thanks for your feedback.

Christophe.


^ permalink raw reply

* Re: [bpf-next V4 PATCH 0/8] xdp: introduce bulking for ndo_xdp_xmit API
From: Daniel Borkmann @ 2018-05-23  9:24 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, netdev, Daniel Borkmann,
	Alexei Starovoitov
  Cc: Christoph Hellwig, BjörnTöpel, Magnus Karlsson,
	makita.toshiaki
In-Reply-To: <152665044141.21055.1276346542020340263.stgit@firesoul>

On 05/18/2018 03:34 PM, Jesper Dangaard Brouer wrote:
> This patchset change ndo_xdp_xmit API to take a bulk of xdp frames.
> 
> In this V4 patchset, I've split-out the patches from 4 to 8 patches.
> I cannot split the driver changes from the NDO change, but I've tried
> to isolated the NDO change together with the driver change as much as
> possible.
> 
> When kernel is compiled with CONFIG_RETPOLINE, every indirect function
> pointer (branch) call hurts performance. For XDP this have a huge
> negative performance impact.
> 
> This patchset reduce the needed (indirect) calls to ndo_xdp_xmit, but
> also prepares for further optimizations.  The DMA APIs use of indirect
> function pointer calls is the primary source the regression.  It is
> left for a followup patchset, to use bulking calls towards the DMA API
> (via the scatter-gatter calls).
> 
> The other advantage of this API change is that drivers can easier
> amortize the cost of any sync/locking scheme, over the bulk of
> packets.  The assumption of the current API is that the driver
> implemementing the NDO will also allocate a dedicated XDP TX queue for
> every CPU in the system.  Which is not always possible or practical to
> configure. E.g. ixgbe cannot load an XDP program on a machine with
> more than 96 CPUs, due to limited hardware TX queues.  E.g. virtio_net
> is hard to configure as it requires manually increasing the
> queues. E.g. tun driver chooses to use a per XDP frame producer lock
> modulo smp_processor_id over avail queues.
> 
> I'm considered adding 'flags' to ndo_xdp_xmit, but it's not part of
> this patchset.  This will be a followup patchset, once we know if this
> will be needed (e.g. for non-map xdp_redirect flush-flag, and if
> AF_XDP chooses to use ndo_xdp_xmit for TX).
> 
> ---
> 
> Jesper Dangaard Brouer (8):
>       bpf: devmap introduce dev_map_enqueue
>       bpf: devmap prepare xdp frames for bulking
>       xdp: add tracepoint for devmap like cpumap have
>       samples/bpf: xdp_monitor use tracepoint xdp:xdp_devmap_xmit
>       xdp: introduce xdp_return_frame_rx_napi
>       xdp: change ndo_xdp_xmit API to support bulking
>       xdp/trace: extend tracepoint in devmap with an err
>       samples/bpf: xdp_monitor use err code from tracepoint xdp:xdp_devmap_xmit

Series applied to bpf-next, thanks Jesper. (Some minor comments in the patches.)

^ permalink raw reply

* Re: [net-next 1/6] net/dcb: Add dcbnl buffer attribute
From: Jakub Kicinski @ 2018-05-23  9:23 UTC (permalink / raw)
  To: Huy Nguyen
  Cc: Saeed Mahameed, David S. Miller, netdev, Jiri Pirko, Or Gerlitz,
	Parav Pandit, Ido Schimmel
In-Reply-To: <1576e986-6e04-63f3-3569-a105493929a6@mellanox.com>

On Tue, 22 May 2018 20:01:21 -0500, Huy Nguyen wrote:
> On 5/22/2018 1:32 PM, Jakub Kicinski wrote:
> > On Tue, 22 May 2018 10:36:17 -0500, Huy Nguyen wrote:  
> >> On 5/22/2018 12:20 AM, Jakub Kicinski wrote:  
> >>> On Mon, 21 May 2018 14:04:57 -0700, Saeed Mahameed wrote:  
> >>>> From: Huy Nguyen <huyn@mellanox.com>
> >>>>
> >>>> In this patch, we add dcbnl buffer attribute to allow user
> >>>> change the NIC's buffer configuration such as priority
> >>>> to buffer mapping and buffer size of individual buffer.
> >>>>
> >>>> This attribute combined with pfc attribute allows advance user to
> >>>> fine tune the qos setting for specific priority queue. For example,
> >>>> user can give dedicated buffer for one or more prirorities or user
> >>>> can give large buffer to certain priorities.
> >>>>
> >>>> We present an use case scenario where dcbnl buffer attribute configured
> >>>> by advance user helps reduce the latency of messages of different sizes.
> >>>>
> >>>> Scenarios description:
> >>>> On ConnectX-5, we run latency sensitive traffic with
> >>>> small/medium message sizes ranging from 64B to 256KB and bandwidth sensitive
> >>>> traffic with large messages sizes 512KB and 1MB. We group small, medium,
> >>>> and large message sizes to their own pfc enables priorities as follow.
> >>>>     Priorities 1 & 2 (64B, 256B and 1KB)
> >>>>     Priorities 3 & 4 (4KB, 8KB, 16KB, 64KB, 128KB and 256KB)
> >>>>     Priorities 5 & 6 (512KB and 1MB)
> >>>>
> >>>> By default, ConnectX-5 maps all pfc enabled priorities to a single
> >>>> lossless fixed buffer size of 50% of total available buffer space. The
> >>>> other 50% is assigned to lossy buffer. Using dcbnl buffer attribute,
> >>>> we create three equal size lossless buffers. Each buffer has 25% of total
> >>>> available buffer space. Thus, the lossy buffer size reduces to 25%. Priority
> >>>> to lossless  buffer mappings are set as follow.
> >>>>     Priorities 1 & 2 on lossless buffer #1
> >>>>     Priorities 3 & 4 on lossless buffer #2
> >>>>     Priorities 5 & 6 on lossless buffer #3
> >>>>
> >>>> We observe improvements in latency for small and medium message sizes
> >>>> as follows. Please note that the large message sizes bandwidth performance is
> >>>> reduced but the total bandwidth remains the same.
> >>>>     256B message size (42 % latency reduction)
> >>>>     4K message size (21% latency reduction)
> >>>>     64K message size (16% latency reduction)
> >>>>
> >>>> Signed-off-by: Huy Nguyen <huyn@mellanox.com>
> >>>> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>  
> >>> On a cursory look this bares a lot of resemblance to devlink shared
> >>> buffer configuration ABI.  Did you look into using that?
> >>>
> >>> Just to be clear devlink shared buffer ABIs don't require representors
> >>> and "switchdev mode".
> >>> .  
> >> [HQN] Dear Jakub, there are several reasons that devlink shared buffer
> >> ABI cannot be used:
> >> 1. The devlink shared buffer ABI is written based on the switch cli
> >> which you can find out more
> >> from this link https://community.mellanox.com/docs/DOC-2558.  
> > Devlink API accommodates requirements of simpler (SwitchX2?) and more
> > advanced schemes (present in Spectrum).  The simpler/basic static
> > threshold configurations is exactly what you are doing here, AFAIU.  
> [HQN] Devlink API is tailored specifically for switch.

I hope that is not true, since we (Netronome) are trying to use it for
NIC configuration, too.  We should generalize the API if need be.

> We don't configure threshold configuration explicitly. It is done via
> PFC. Once PFC is enabled on priority, threshold is setup based on our
> proprietary formula that were tested rigorously for performance.

Are you referring to XOFF/XON thresholds?  I don't think the "threshold
type" in devlink API implies we are setting XON/XOFF thresholds
directly :S  If PFC is enabled we may be setting them indirectly,
obviously.

My understanding is that for static threshold type the size parameter
specifies the max amount of memory given pool can consume.

> >> 2. The dcbnl interfaces have been used for QoS settings.  
> > QoS settings != shared buffer configuration.  
> [HQN] I think we have different definition about "shared buffer".
> Please refer to this below switch cli link.
> It explained in detail what is the "shared buffer" in switch means.
> Our NIC does not have "shared buffer" supported.
> https://community.mellanox.com/docs/DOC-2591

Yes, we must have a different definitions of "shared buffer" :)  That
link, however, didn't clarify much for me...  In mlx5 you seem to have a
buffer which is shared between priorities, even if it's not what would
be referred to as shared buffer in switch context.

> >> In NIC, the  buffer configuration are tied to priority (ETS PFC).  
> > Some customers use DCB, a lot (most?) of them don't.  I don't think
> > the "this is a logical extension of a commonly used API" really
> > stands here.  
> [HQN] DCBNL are being actively used. The whole point of this patch
> is to tie buffer configuration with IEEE's priority and is IEEE's PFC 
> configuration.
>
> Ambitious future is to have the switch configure the NIC's buffer
> size and buffer mapping
> via TLV packet and this DCBNL interface. But we won't go too far here.

I think I can understand the motivation, and I think it's a nice thing
to expose!  The only questions are: does it really belong to DCBNL and
can existing API be used?
 
From patch description it seems like your default setup is shared
buffer split 50% (lossy)/50% (all prios) and the example you give
changes that to 25% (lossy)/25%x3 prio groups.

With existing devlink API could this be modelled by three ingress pools
with 2 TCs bound each?

> >> The buffer configuration are not tied to port like switch.  
> > It's tied to a port and TCs, you just have one port but still have 8
> > TCs exactly like a switch...  
> [HQN] No. Our buffer ties to priority not to TCs.

Right, that is a valid point.  Although TCs can be mapped to
priorities.  Some switches may tie buffers to priorities, too.  So
perhaps it's worth extending devlink?

> >> 3. Shared buffer, alpha, threshold are switch specific terms.  
> > IDK how talking about alpha is relevant, it's just one threshold
> > type the API supports.  As far as shared buffer and threshold I
> > don't know if these are switch terms (or how "switch" differs from
> > "NIC" at that level) - I personally find carving shared buffer into
> > pools very intuitive.  
> [HQN] Yes, I understand your point too. The NIC's buffer shares some 
> characteristics with the switch's buffer settings. 

Yes, and if it's not a perfect match we can extend it.

> But this DCB buffer setting is to improve the performance and work
> together with the PFC setting. We would like to keep all the qos
> setting under DCB Netlink as they are designed to be this way.

DCBNL seems to carry standard-based information, which this is not.
mlxsw supports DCBNL, will it also support this buffer configuration
mechanism?

> > Could you give examples of commands/configs one can use with your
> > new ABI?  
> [HQN] The plan is to add the support in lldptool once the kernel code
> is accepted. To test the kernel code,
> I am using small python scripts that works on top of the netlink
> library. It will be like this format which is similar to other
> options in lldptool priority2buffer: 0,2,5,7,1,2,3,6 maps priorities
> 0,1,2,3,4,5,6,7 to buffer 0,2,5,7,1,2,3,6
>      buffer_size: 87296,87296,0,87296,0,0,0,0 set receive buffer size 
> for buffer 0,1,2,3,4,5,6,7 respectively
> >    How does one query the total size of the buffer to be carved?  
> [HQN] This is not necessary. If the total size is too big, error will
> be return via DCB netlink interface.

Right, I'm not saying it's a bug :)  It's just nice when user can be
told the total size without having to probe for it :)

^ permalink raw reply

* Re: [PATCH bpf-next v3 10/10] tools: bpftool: add delimiters to multi-function JITed dumps
From: Daniel Borkmann @ 2018-05-23  9:08 UTC (permalink / raw)
  To: Jakub Kicinski, Sandipan Das
  Cc: ast, netdev, linuxppc-dev, mpe, naveen.n.rao, Quentin Monnet
In-Reply-To: <20180522125544.541c68c8@cakuba>

On 05/22/2018 09:55 PM, Jakub Kicinski wrote:
> On Tue, 22 May 2018 22:46:13 +0530, Sandipan Das wrote:
>> +		if (info.nr_jited_func_lens && info.jited_func_lens) {
>> +			struct kernel_sym *sym = NULL;
>> +			unsigned char *img = buf;
>> +			__u64 *ksyms = NULL;
>> +			__u32 *lens;
>> +			__u32 i;
>> +
>> +			if (info.nr_jited_ksyms) {
>> +				kernel_syms_load(&dd);
>> +				ksyms = (__u64 *) info.jited_ksyms;
>> +			}
>> +
>> +			lens = (__u32 *) info.jited_func_lens;
>> +			for (i = 0; i < info.nr_jited_func_lens; i++) {
>> +				if (ksyms) {
>> +					sym = kernel_syms_search(&dd, ksyms[i]);
>> +					if (sym)
>> +						printf("%s:\n", sym->name);
>> +					else
>> +						printf("%016llx:\n", ksyms[i]);
>> +				}
>> +
>> +				disasm_print_insn(img, lens[i], opcodes, name);
>> +				img += lens[i];
>> +				printf("\n");
>> +			}
>> +		} else {
> 
> The output doesn't seem to be JSON-compatible :(  We try to make sure
> all bpftool command can produce valid JSON when run with -j (or -p)
> switch.
> 
> Would it be possible to make each function a separate JSON object with
> "name" and "insn" array?  Would that work?

Sandipan, could you take a look at this? Given there's json output today we
should definitely try not to break it; presumably this would be one final
respin of your series with this fixed.

Thanks,
Daniel

^ permalink raw reply

* Re: [RFC V4 PATCH 7/8] vhost: packed ring support
From: Jason Wang @ 2018-05-23  8:57 UTC (permalink / raw)
  To: Wei Xu; +Cc: mst, kvm, virtualization, netdev, linux-kernel, jfreimann,
	tiwei.bie
In-Reply-To: <20180523071727.GA13373@wei-ubt>



On 2018年05月23日 15:17, Wei Xu wrote:
> On Wed, May 23, 2018 at 09:39:28AM +0800, Jason Wang wrote:
>>
>> On 2018年05月23日 00:54, Wei Xu wrote:
>>> On Wed, May 16, 2018 at 08:32:20PM +0800, Jason Wang wrote:
>>>> Signed-off-by: Jason Wang <jasowang@redhat.com>
>>>> ---
>>>>   drivers/vhost/net.c   |   3 +-
>>>>   drivers/vhost/vhost.c | 539 ++++++++++++++++++++++++++++++++++++++++++++++----
>>>>   drivers/vhost/vhost.h |   8 +-
>>>>   3 files changed, 513 insertions(+), 37 deletions(-)
>>>>
>>>> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
>>>> index 8304c30..f2a0f5b 100644
>>>> --- a/drivers/vhost/vhost.c
>>>> +++ b/drivers/vhost/vhost.c
>>>> @@ -1358,6 +1382,8 @@ long vhost_vring_ioctl(struct vhost_dev *d, unsigned int ioctl, void __user *arg
>>>>   			break;
>>>>   		}
>>>>   		vq->last_avail_idx = s.num;
>>>> +		if (vhost_has_feature(vq, VIRTIO_F_RING_PACKED))
>>>> +			vq->avail_wrap_counter = s.num >> 31;
>>>>   		/* Forget the cached index value. */
>>>>   		vq->avail_idx = vq->last_avail_idx;
>>>>   		break;
>>>> @@ -1366,6 +1392,8 @@ long vhost_vring_ioctl(struct vhost_dev *d, unsigned int ioctl, void __user *arg
>>>>   		s.num = vq->last_avail_idx;
>>>>   		if (copy_to_user(argp, &s, sizeof s))
>>>>   			r = -EFAULT;
>>>> +		if (vhost_has_feature(vq, VIRTIO_F_RING_PACKED))
>>>> +			s.num |= vq->avail_wrap_counter << 31;
>>>>   		break;
>>>>   	case VHOST_SET_VRING_ADDR:
>>>>   		if (copy_from_user(&a, argp, sizeof a)) {
>>> 'last_used_idx' also needs to be saved/restored here.
>>>
>>> I have figured out the root cause of broken device after reloading
>>> 'virtio-net' module, all indices have been reset for a reloading but
>>> 'last_used_idx' is not properly reset in this case. This confuses
>>> handle_rx()/tx().
>>>
>>> Wei
>>>
>> Good catch, so we probably need a new ioctl to sync between qemu and vhost.
>>
>> Something like VHOST_SET/GET_USED_BASE.
> Sure, or can we expand 'vhost_vring_state' to keep them done in a bunch?

It's port of uapi, so we can't.

Thanks

>
>> Thanks
>>

^ permalink raw reply

* Re: [PATCH] [RFC] bpf: tracing: new helper bpf_get_current_cgroup_ino
From: Daniel Borkmann @ 2018-05-23  8:57 UTC (permalink / raw)
  To: Y Song, Alexei Starovoitov
  Cc: Alban Crequy, netdev, linux-kernel, containers, cgroups,
	Alban Crequy, tj
In-Reply-To: <CAH3MdRVwmKd84ePvNX+NuAj3TfA_28BObEmzBqGXv=P5_A=8fQ@mail.gmail.com>

On 05/23/2018 06:31 AM, Y Song wrote:
> On Tue, May 22, 2018 at 8:35 PM, Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
>> On Tue, May 22, 2018 at 08:33:24PM -0700, Y Song wrote:
>>> +       struct cgroup *cgrp = task_dfl_cgroup(current);
>>> +       if (!cgrp)
>>> +               return -EINVAL;
>>
>> why this check is needed?
> 
> No reason :-) Originally I am concerned whether it is possible cgrp
> could be NULL.
> By looking at the code, it SEEMS to me that it could not be NULL, but I am not
> 100% sure (as I am not a cgroup expert). Since you are asking,
> probably means it cannot be NULL, so will remove it in formal upstream patch.

Aside from this the cgrp->kn->id.id is also u64, so overlapping this with
error codes we'll get into a similar issue as with bpf_perf_event_read().

^ permalink raw reply

* [PATCH net-next v3] net: sched: don't disable bh when accessing action idr
From: Vlad Buslov @ 2018-05-23  8:52 UTC (permalink / raw)
  To: jiri; +Cc: netdev, jhs, xiyou.wangcong, davem, linux-kernel, Vlad Buslov
In-Reply-To: <20180523073239.GC3155@nanopsycho>

Initial net_device implementation used ingress_lock spinlock to synchronize
ingress path of device. This lock was used in both process and bh context.
In some code paths action map lock was obtained while holding ingress_lock.
Commit e1e992e52faa ("[NET_SCHED] protect action config/dump from irqs")
modified actions to always disable bh, while using action map lock, in
order to prevent deadlock on ingress_lock in softirq. This lock was removed
in commit 555353cfa1ae ("netdev: The ingress_lock member is no longer
needed.").

Another reason to disable bh was filters delete code, that released actions
in rcu callback. This code was changed to release actions from workqueue
context in patch set "net_sched: close the race between call_rcu() and
cleanup_net()".

With these changes it is no longer necessary to continue disable bh while
accessing action map.

Replace all action idr spinlock usage with regular calls that do not
disable bh.

Acked-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
---
Changes from V2 to V3:
 - Expanded commit message.

Changes from V1 to V2:
 - Expanded commit message.

 net/sched/act_api.c | 20 ++++++++++----------
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/net/sched/act_api.c b/net/sched/act_api.c
index 72251241665a..3f4cf930f809 100644
--- a/net/sched/act_api.c
+++ b/net/sched/act_api.c
@@ -77,9 +77,9 @@ static void free_tcf(struct tc_action *p)
 
 static void tcf_idr_remove(struct tcf_idrinfo *idrinfo, struct tc_action *p)
 {
-	spin_lock_bh(&idrinfo->lock);
+	spin_lock(&idrinfo->lock);
 	idr_remove(&idrinfo->action_idr, p->tcfa_index);
-	spin_unlock_bh(&idrinfo->lock);
+	spin_unlock(&idrinfo->lock);
 	gen_kill_estimator(&p->tcfa_rate_est);
 	free_tcf(p);
 }
@@ -156,7 +156,7 @@ static int tcf_dump_walker(struct tcf_idrinfo *idrinfo, struct sk_buff *skb,
 	struct tc_action *p;
 	unsigned long id = 1;
 
-	spin_lock_bh(&idrinfo->lock);
+	spin_lock(&idrinfo->lock);
 
 	s_i = cb->args[0];
 
@@ -191,7 +191,7 @@ static int tcf_dump_walker(struct tcf_idrinfo *idrinfo, struct sk_buff *skb,
 	if (index >= 0)
 		cb->args[0] = index + 1;
 
-	spin_unlock_bh(&idrinfo->lock);
+	spin_unlock(&idrinfo->lock);
 	if (n_i) {
 		if (act_flags & TCA_FLAG_LARGE_DUMP_ON)
 			cb->args[1] = n_i;
@@ -261,9 +261,9 @@ static struct tc_action *tcf_idr_lookup(u32 index, struct tcf_idrinfo *idrinfo)
 {
 	struct tc_action *p = NULL;
 
-	spin_lock_bh(&idrinfo->lock);
+	spin_lock(&idrinfo->lock);
 	p = idr_find(&idrinfo->action_idr, index);
-	spin_unlock_bh(&idrinfo->lock);
+	spin_unlock(&idrinfo->lock);
 
 	return p;
 }
@@ -323,7 +323,7 @@ int tcf_idr_create(struct tc_action_net *tn, u32 index, struct nlattr *est,
 	}
 	spin_lock_init(&p->tcfa_lock);
 	idr_preload(GFP_KERNEL);
-	spin_lock_bh(&idrinfo->lock);
+	spin_lock(&idrinfo->lock);
 	/* user doesn't specify an index */
 	if (!index) {
 		index = 1;
@@ -331,7 +331,7 @@ int tcf_idr_create(struct tc_action_net *tn, u32 index, struct nlattr *est,
 	} else {
 		err = idr_alloc_u32(idr, NULL, &index, index, GFP_ATOMIC);
 	}
-	spin_unlock_bh(&idrinfo->lock);
+	spin_unlock(&idrinfo->lock);
 	idr_preload_end();
 	if (err)
 		goto err3;
@@ -369,9 +369,9 @@ void tcf_idr_insert(struct tc_action_net *tn, struct tc_action *a)
 {
 	struct tcf_idrinfo *idrinfo = tn->idrinfo;
 
-	spin_lock_bh(&idrinfo->lock);
+	spin_lock(&idrinfo->lock);
 	idr_replace(&idrinfo->action_idr, a, a->tcfa_index);
-	spin_unlock_bh(&idrinfo->lock);
+	spin_unlock(&idrinfo->lock);
 }
 EXPORT_SYMBOL(tcf_idr_insert);
 
-- 
2.7.5

^ permalink raw reply related

* Re: [PATCH] netfilter: uapi: includes linux/types.h
From: Pablo Neira Ayuso @ 2018-05-23  7:49 UTC (permalink / raw)
  To: YueHaibing; +Cc: kadlec, linux-kernel, netdev, coreteam
In-Reply-To: <20180523070326.15968-1-yuehaibing@huawei.com>

On Wed, May 23, 2018 at 03:03:26PM +0800, YueHaibing wrote:
> gcc-7.3.0 report following warning:
> ./usr/include/linux/netfilter/nf_osf.h:27: found __[us]{8,16,32,64} type without #include <linux/types.h>
> 
> includes linux/types.h to fix it.

Thanks.

There's already a fix for this in the nf-next queue.

commit 01cd267bff52619a53fa05c930ea5ed53493d21a
Author: Florian Westphal <fw@strlen.de>
Date:   Tue May 8 10:05:38 2018 +0200

    netfilter: fix fallout from xt/nf osf separation

^ permalink raw reply

* [PATCH net] net/mlx4: Fix irq-unsafe spinlock usage
From: Tariq Toukan @ 2018-05-23  7:41 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev, Eran Ben Elisha, Jack Morgenstein, Tariq Toukan

From: Jack Morgenstein <jackm@dev.mellanox.co.il>

spin_lock/unlock was used instead of spin_un/lock_irq
in a procedure used in process space, on a spinlock
which can be grabbed in an interrupt.

This caused the stack trace below to be displayed (on kernel
4.17.0-rc1 compiled with Lock Debugging enabled):

[  154.661474] WARNING: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected
[  154.668909] 4.17.0-rc1-rdma_rc_mlx+ #3 Tainted: G          I
[  154.675856] -----------------------------------------------------
[  154.682706] modprobe/10159 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
[  154.690254] 00000000f3b0e495 (&(&qp_table->lock)->rlock){+.+.}, at: mlx4_qp_remove+0x20/0x50 [mlx4_core]
[  154.700927]
and this task is already holding:
[  154.707461] 0000000094373b5d (&(&cq->lock)->rlock/1){....}, at: destroy_qp_common+0x111/0x560 [mlx4_ib]
[  154.718028] which would create a new lock dependency:
[  154.723705]  (&(&cq->lock)->rlock/1){....} -> (&(&qp_table->lock)->rlock){+.+.}
[  154.731922]
but this new dependency connects a SOFTIRQ-irq-safe lock:
[  154.740798]  (&(&cq->lock)->rlock){..-.}
[  154.740800]
... which became SOFTIRQ-irq-safe at:
[  154.752163]   _raw_spin_lock_irqsave+0x3e/0x50
[  154.757163]   mlx4_ib_poll_cq+0x36/0x900 [mlx4_ib]
[  154.762554]   ipoib_tx_poll+0x4a/0xf0 [ib_ipoib]
...
to a SOFTIRQ-irq-unsafe lock:
[  154.815603]  (&(&qp_table->lock)->rlock){+.+.}
[  154.815604]
... which became SOFTIRQ-irq-unsafe at:
[  154.827718] ...
[  154.827720]   _raw_spin_lock+0x35/0x50
[  154.833912]   mlx4_qp_lookup+0x1e/0x50 [mlx4_core]
[  154.839302]   mlx4_flow_attach+0x3f/0x3d0 [mlx4_core]

Since mlx4_qp_lookup() is called only in process space, we can
simply replace the spin_un/lock calls with spin_un/lock_irq calls.

Fixes: 6dc06c08bef1 ("net/mlx4: Fix the check in attaching steering rules")
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
---

Hi Dave, please queue for -stable >= 4.12.

---
 drivers/net/ethernet/mellanox/mlx4/qp.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/qp.c b/drivers/net/ethernet/mellanox/mlx4/qp.c
index 3aaf4bad6c5a..427e7a31862c 100644
--- a/drivers/net/ethernet/mellanox/mlx4/qp.c
+++ b/drivers/net/ethernet/mellanox/mlx4/qp.c
@@ -393,11 +393,11 @@ struct mlx4_qp *mlx4_qp_lookup(struct mlx4_dev *dev, u32 qpn)
 	struct mlx4_qp_table *qp_table = &mlx4_priv(dev)->qp_table;
 	struct mlx4_qp *qp;
 
-	spin_lock(&qp_table->lock);
+	spin_lock_irq(&qp_table->lock);
 
 	qp = __mlx4_qp_lookup(dev, qpn);
 
-	spin_unlock(&qp_table->lock);
+	spin_unlock_irq(&qp_table->lock);
 	return qp;
 }
 
-- 
1.8.3.1

^ permalink raw reply related

* Re: [PATCH net-next v2] net: sched: don't disable bh when accessing action idr
From: Jiri Pirko @ 2018-05-23  7:32 UTC (permalink / raw)
  To: Vlad Buslov; +Cc: davem, netdev, jhs, xiyou.wangcong, linux-kernel
In-Reply-To: <1526932984-11544-1-git-send-email-vladbu@mellanox.com>

Mon, May 21, 2018 at 10:03:04PM CEST, vladbu@mellanox.com wrote:
>Initial net_device implementation used ingress_lock spinlock to synchronize
>ingress path of device. This lock was used in both process and bh context.
>In some code paths action map lock was obtained while holding ingress_lock.
>Commit e1e992e52faa ("[NET_SCHED] protect action config/dump from irqs")
>modified actions to always disable bh, while using action map lock, in
>order to prevent deadlock on ingress_lock in softirq. This lock was removed
>from net_device, so disabling bh, while accessing action map, is no longer
>necessary.
>
>Replace all action idr spinlock usage with regular calls that do not
>disable bh.
>
>Signed-off-by: Vlad Buslov <vladbu@mellanox.com>

Please add my tag to v3, with the description changes requested by Cong.
Acked-by: Jiri Pirko <jiri@mellanox.com>

Thanks!

^ permalink raw reply

* Re: [PATCH net-next v1] netfilter: provide input interface for route lookup for rpfilter
From: Pablo Neira Ayuso @ 2018-05-23  7:26 UTC (permalink / raw)
  To: Vincent Bernat
  Cc: Jozsef Kadlecsik, Florian Westphal, David S. Miller,
	netfilter-devel, netdev
In-Reply-To: <20180520110338.9376-1-vincent@bernat.im>

On Sun, May 20, 2018 at 01:03:38PM +0200, Vincent Bernat wrote:
> In commit 47b7e7f82802, this bit was removed at the same time the
> RT6_LOOKUP_F_IFACE flag was removed. However, it is needed when
> link-local addresses are used, which is a very common case: when
> packets are routed, neighbor solicitations are done using link-local
> addresses. For example, the following neighbor solicitation is not
> matched by "-m rpfilter":
> 
>     IP6 fe80::5254:33ff:fe00:1 > ff02::1:ff00:3: ICMP6, neighbor
>     solicitation, who has 2001:db8::5254:33ff:fe00:3, length 32
> 
> Commit 47b7e7f82802 doesn't quite explain why we shouldn't use
> RT6_LOOKUP_F_IFACE in the rpfilter case. I suppose the interface check
> later in the function would make it redundant. However, the remaining
> of the routing code is using RT6_LOOKUP_F_IFACE when there is no
> source address (which matches rpfilter's case with a non-unicast
> destination, like with neighbor solicitation).

Applied, thanks Vincent.

^ permalink raw reply

* Re: [PATCH][V2] net/mlx4: fix spelling mistake: "Inrerface" -> "Interface" and rephrase message
From: Tariq Toukan @ 2018-05-23  7:26 UTC (permalink / raw)
  To: Colin King, Tariq Toukan, David S . Miller, netdev, linux-rdma
  Cc: kernel-janitors, linux-kernel
In-Reply-To: <20180522154251.16789-1-colin.king@canonical.com>



On 22/05/2018 6:42 PM, Colin King wrote:
> From: Colin Ian King <colin.king@canonical.com>
> 
> Trivial fix to spelling mistake in mlx4_dbg debug message and also
> change the phrasing of the message so that is is more readable
> 
> Signed-off-by: Colin Ian King <colin.king@canonical.com>
> 
> ---
> V2: rephrase message, as helpfully suggested by Tariq Toukan
> ---
>   drivers/net/ethernet/mellanox/mlx4/intf.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx4/intf.c b/drivers/net/ethernet/mellanox/mlx4/intf.c
> index 2edcce98ab2d..65482f004e50 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/intf.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/intf.c
> @@ -172,7 +172,7 @@ int mlx4_do_bond(struct mlx4_dev *dev, bool enable)
>   		list_add_tail(&dev_ctx->list, &priv->ctx_list);
>   		spin_unlock_irqrestore(&priv->ctx_lock, flags);
>   
> -		mlx4_dbg(dev, "Inrerface for protocol %d restarted with when bonded mode is %s\n",
> +		mlx4_dbg(dev, "Interface for protocol %d restarted with bonded mode %s\n",
>   			 dev_ctx->intf->protocol, enable ?
>   			 "enabled" : "disabled");
>   	}
> 

Thanks Colin.

Reviewed-by: Tariq Toukan <tariqt@mellanox.com>

^ permalink raw reply

* Re: [RFC V4 PATCH 7/8] vhost: packed ring support
From: Wei Xu @ 2018-05-23  7:17 UTC (permalink / raw)
  To: Jason Wang
  Cc: mst, kvm, virtualization, netdev, linux-kernel, jfreimann,
	tiwei.bie
In-Reply-To: <e12d4055-6ae6-4d00-ae8b-1acd88633f48@redhat.com>

On Wed, May 23, 2018 at 09:39:28AM +0800, Jason Wang wrote:
> 
> 
> On 2018年05月23日 00:54, Wei Xu wrote:
> >On Wed, May 16, 2018 at 08:32:20PM +0800, Jason Wang wrote:
> >>Signed-off-by: Jason Wang <jasowang@redhat.com>
> >>---
> >>  drivers/vhost/net.c   |   3 +-
> >>  drivers/vhost/vhost.c | 539 ++++++++++++++++++++++++++++++++++++++++++++++----
> >>  drivers/vhost/vhost.h |   8 +-
> >>  3 files changed, 513 insertions(+), 37 deletions(-)
> >>
> >>diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> >>index 8304c30..f2a0f5b 100644
> >>--- a/drivers/vhost/vhost.c
> >>+++ b/drivers/vhost/vhost.c
> >>@@ -1358,6 +1382,8 @@ long vhost_vring_ioctl(struct vhost_dev *d, unsigned int ioctl, void __user *arg
> >>  			break;
> >>  		}
> >>  		vq->last_avail_idx = s.num;
> >>+		if (vhost_has_feature(vq, VIRTIO_F_RING_PACKED))
> >>+			vq->avail_wrap_counter = s.num >> 31;
> >>  		/* Forget the cached index value. */
> >>  		vq->avail_idx = vq->last_avail_idx;
> >>  		break;
> >>@@ -1366,6 +1392,8 @@ long vhost_vring_ioctl(struct vhost_dev *d, unsigned int ioctl, void __user *arg
> >>  		s.num = vq->last_avail_idx;
> >>  		if (copy_to_user(argp, &s, sizeof s))
> >>  			r = -EFAULT;
> >>+		if (vhost_has_feature(vq, VIRTIO_F_RING_PACKED))
> >>+			s.num |= vq->avail_wrap_counter << 31;
> >>  		break;
> >>  	case VHOST_SET_VRING_ADDR:
> >>  		if (copy_from_user(&a, argp, sizeof a)) {
> >'last_used_idx' also needs to be saved/restored here.
> >
> >I have figured out the root cause of broken device after reloading
> >'virtio-net' module, all indices have been reset for a reloading but
> >'last_used_idx' is not properly reset in this case. This confuses
> >handle_rx()/tx().
> >
> >Wei
> >
> 
> Good catch, so we probably need a new ioctl to sync between qemu and vhost.
> 
> Something like VHOST_SET/GET_USED_BASE.

Sure, or can we expand 'vhost_vring_state' to keep them done in a bunch?

> 
> Thanks
> 

^ permalink raw reply

* Re: [PATCH net-next] sctp: add support for SCTP_REUSE_PORT sockopt
From: Xin Long @ 2018-05-23  7:04 UTC (permalink / raw)
  To: Neil Horman
  Cc: Michael Tuexen, Marcelo Ricardo Leitner, network dev, linux-sctp,
	davem
In-Reply-To: <20180522115151.GA28740@hmswarspite.think-freely.org>

On Tue, May 22, 2018 at 7:51 PM, Neil Horman <nhorman@tuxdriver.com> wrote:
> On Tue, May 22, 2018 at 03:07:57PM +0800, Xin Long wrote:
>> On Mon, May 21, 2018 at 9:48 PM, Neil Horman <nhorman@tuxdriver.com> wrote:
>> > On Mon, May 21, 2018 at 02:16:56PM +0200, Michael Tuexen wrote:
>> >> > On 21. May 2018, at 13:39, Neil Horman <nhorman@tuxdriver.com> wrote:
>> >> >
>> >> > On Sun, May 20, 2018 at 10:54:04PM -0300, Marcelo Ricardo Leitner wrote:
>> >> >> On Sun, May 20, 2018 at 08:50:59PM -0400, Neil Horman wrote:
>> >> >>> On Sat, May 19, 2018 at 03:44:40PM +0800, Xin Long wrote:
>> >> >>>> This feature is actually already supported by sk->sk_reuse which can be
>> >> >>>> set by SO_REUSEADDR. But it's not working exactly as RFC6458 demands in
>> >> >>>> section 8.1.27, like:
>> >> >>>>
>> >> >>>>  - This option only supports one-to-one style SCTP sockets
>> >> >>>>  - This socket option must not be used after calling bind()
>> >> >>>>    or sctp_bindx().
>> >> >>>>
>> >> >>>> Besides, SCTP_REUSE_PORT sockopt should be provided for user's programs.
>> >> >>>> Otherwise, the programs with SCTP_REUSE_PORT from other systems will not
>> >> >>>> work in linux.
>> >> >>>>
>> >> >>>> This patch reuses sk->sk_reuse and works pretty much as SO_REUSEADDR,
>> >> >>>> just with some extra setup limitations that are neeeded when it is being
>> >> >>>> enabled.
>> >> >>>>
>> >> >>>> "It should be noted that the behavior of the socket-level socket option
>> >> >>>> to reuse ports and/or addresses for SCTP sockets is unspecified", so it
>> >> >>>> leaves SO_REUSEADDR as is for the compatibility.
>> >> >>>>
>> >> >>>> Signed-off-by: Xin Long <lucien.xin@gmail.com>
>> >> >>>> ---
>> >> >>>> include/uapi/linux/sctp.h |  1 +
>> >> >>>> net/sctp/socket.c         | 48 +++++++++++++++++++++++++++++++++++++++++++++++
>> >> >>>> 2 files changed, 49 insertions(+)
>> >> >>>>
>> >> >>> A few things:
>> >> >>>
>> >> >>> 1) I agree with Tom, this feature is a complete duplication of the SK_REUSEPORT
>> >> >>> socket option.  I understand that this is an implementation of the option in the
>> >> >>> RFC, but its definately a duplication of a feature, which makes several things
>> >> >>> really messy.
>> >> >>>
>> >> >>> 2) The overloading of the sk_reuse opeion is a bad idea, for several reasons.
>> >> >>> Chief among them is the behavioral interference between this patch and the
>> >> >>> SO_REUSEADDR socket level option, that also sets this feature.  If you set
>> >> >>> sk_reuse via SO_REUSEADDR, you will set the SCTP port reuse feature regardless
>> >> >>> of the bind or 1:1/1:m state of the socket.  Vice versa, if you set this socket
>> >> >>> option via the SCTP_PORT_REUSE option you will inadvertently turn on address
>> >> >>> reuse for the socket.  We can't do that.
>> >> >>
>> >> >> Given your comments, going a bit further here, one other big
>> >> >> implication is that a port would never be able to be considered to
>> >> >> fully meet SCTP standards regarding reuse because a rogue application
>> >> >> may always abuse of the socket level opt to gain access to the port.
>> >> >>
>> >> >> IOW, the patch allows the application to use such restrictions against
>> >> >> itself and nothing else, which undermines the patch idea.
>> >> >>
>> >> > Agreed.
>> >> >
>> >> >> I lack the knowledge on why the SCTP option was proposed in the RFC. I
>> >> >> guess they had a good reason to add the restriction on 1:1/1:m style.
>> >> >> Does the usage of the current imply in any risk to SCTP sockets? If
>> >> >> yes, that would give some grounds for going forward with the SCTP
>> >> >> option.
>> >> >>
>> >> > I'm also not privy to why the sctp option was proposed, though I expect that the
>> >> > lack of standardization of SO_REUSEPORT probably had something to do with it.
>> >> > As for the reasoning behind restriction to only 1:1 sockets, if I had to guess,
>> >> > I would say it likely because it creates ordering difficulty at the application
>> >> > level.
>> >> >
>> >> > CC-ing Michael Tuxen, who I believe had some input on this RFC.  Hopefully he
>> >> > can shed some light on this.
>> >> Dear all,
>> >>
>> >> the reason this was added is to have a specified way to allow a system to
>> >> behave like a client and server making use of the INIT collision.
>> >>
>> >> For 1-to-many style sockets you can do this by creating a socket, binding it,
>> >> calling listen on it and trying to connect to the peer.
>> >>
>> >> For 1-to-1 style sockets you need two sockets for it. One listener and one
>> >> you use to connect (and close it in case of failure, open a new one...).
>> >>
>> >> It was not clear if one can achieve this with SO_REUSEPORT and/or SO_REUSEADDR
>> >> on all platforms. We left that unspecified.
>> >>
>> >> I hope this makes the intention clearer.
>> >>
>> > I think it makes the intention clearer yes, but it unfortunately does nothing in
>> > my mind to clarify how the implementation should best handle the potential
>> > overlap in functionality.  What I see here is that we have two functional paths
>> > (the SO_REUSEPORT path and the SCTP_PORT_REUSE path), which may or may not
>> > (depending on the OS implementation achieve the same functional goal (allowing
>> > multiple sockets to share a port while allowing one socket to listen and the
>> > other connect to a remote peer).  If both implementations do the same thing on a
>> > given platform, we can either just alias one to another and be done, but if they
>> > don't then we either have to implement both paths, and ensure that the
>> > SO_REUSEPORT path is a no-op/error return for SCTP sockets, or that each path
>> > implements a distinct feature set that is cleaarly documented.
>> >
>> > That said, I think we may be in luck.  Looking at the connect and listen paths,
>> > it appears to me that:
>> >
>> > 1) Sockets ignore SO_REUSEPORT in the connect and listen paths (save for any
>> > autobinding) so it would appear that the intent of the SCTP rfc can be honored
>> > via SO_REUSEPORT on linux.
>> >
>> > 2) SO_REUSEPORT prevents changing state after a bind has occured, so we can honr
>> > that part of the SCTP RFC.
>> >
>> > The only missing part is the restriction that SCTP_REUSE_PORT has which is
>> > unaccounted for is that 1:M sockets aren't allowed to enable port reuse.
>> > However, I think the implication from Michaels description above is that port
>> > reuse on a 1:M socket is implicit because a single socket can connect and listen
>> > in that use case, rather than there being a danger to doing so.
>> >
>> > As such, I would propose that we implement this socket option by simply setting
>> > the sk->sk_reuseport field in the sock structure, and document the fact that
>> > linux does not restrict port reuse from 1:M sockets.
>> Note that, sk->sk_reuseport is not affecting linux SCTP socket at all now.
>> linux SCTP socket doesn't really have SO_REUSEADDR (sk->sk_reuse)
>> support, but use sk->sk_reuse as REUSE_PORT, (yes, it is confusing).
>> Pls refer to sctp_get_port_local().
>>
> No, its not used now, but if you do use it to do something specific to SCTP (via
> the SCTP_REUSE_PORT socket option), you risk aliasing SO_REUSEPORT behavior to
> it, and if it doesn't match what the RFC behavior mandates, thats a problem.
>
>> So I'm not sure using sk->sk_reuseport here means we will drop sk->sk_reuse
>> use in linux SCTP but use sk->sk_reuseport instead, or we will think that socket
>> enables 'port reuse' when either of them is set.
>>
> I don't think we would drop the behavior of sk_reuse here, why would we?  As far
> as I can see, the behavior of SO_REUSEADDR (not SO_REUSEPORT), isn't in
> question, is it?
>
>> Note some users may be already using SO_REUSEADDR to enable the 'port
>> reuse' in linux sctp socket. If we're changing to sk->sk_reuseport, we may face
>> a compatibility problem.
>>
> I don't see how the behavior of SO_REUSEADDR is in question here.  All I'm
> suggesting is that you simplify this patch so that the SCTP_REUSE_PORT socket
> option set sk_reuseport, as that option to my eyes conforms to the sctp rfc
> requirements.  Or am I' missing something?
No, I am :)
sk_reuseport seems more complicated than sk_reuse. I kind of mixed them.
I need to check more beofore continuing. Thanks.

>
> Neil
>
>>
>> >
>> > Thoughts?
>> > Neil
>> >
>>

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox