Netdev List
 help / color / mirror / Atom feed
* Re: [nf-next] netfilter: extend SRH match to support matching previous, next and last SID
From: Ahmed Abdelsalam @ 2018-04-23 20:01 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: fw, davem, dav.lebrun, linux-kernel, netfilter-devel, coreteam,
	netdev
In-Reply-To: <20180423173047.gsf2xjlmpichyvte@salvia>

On Mon, 23 Apr 2018 19:30:47 +0200
Pablo Neira Ayuso <pablo@netfilter.org> wrote:

> On Mon, Apr 23, 2018 at 05:48:22AM -0500, Ahmed Abdelsalam wrote:
> > Signed-off-by: Ahmed Abdelsalam <amsalam20@gmail.com>
> > ---
> >  include/uapi/linux/netfilter_ipv6/ip6t_srh.h | 22 +++++++++++++--
> >  net/ipv6/netfilter/ip6t_srh.c                | 41 +++++++++++++++++++++++++++-
> >  2 files changed, 60 insertions(+), 3 deletions(-)
> > 
> > diff --git a/include/uapi/linux/netfilter_ipv6/ip6t_srh.h b/include/uapi/linux/netfilter_ipv6/ip6t_srh.h
> > index f3cc0ef..9808382 100644
> > --- a/include/uapi/linux/netfilter_ipv6/ip6t_srh.h
> > +++ b/include/uapi/linux/netfilter_ipv6/ip6t_srh.h
> > @@ -17,7 +17,10 @@
> >  #define IP6T_SRH_LAST_GT        0x0100
> >  #define IP6T_SRH_LAST_LT        0x0200
> >  #define IP6T_SRH_TAG            0x0400
> > -#define IP6T_SRH_MASK           0x07FF
> > +#define IP6T_SRH_PSID           0x0800
> > +#define IP6T_SRH_NSID           0x1000
> > +#define IP6T_SRH_LSID           0x2000
> > +#define IP6T_SRH_MASK           0x3FFF
> >  
> >  /* Values for "mt_invflags" field in struct ip6t_srh */
> >  #define IP6T_SRH_INV_NEXTHDR    0x0001
> > @@ -31,7 +34,10 @@
> >  #define IP6T_SRH_INV_LAST_GT    0x0100
> >  #define IP6T_SRH_INV_LAST_LT    0x0200
> >  #define IP6T_SRH_INV_TAG        0x0400
> > -#define IP6T_SRH_INV_MASK       0x07FF
> > +#define IP6T_SRH_INV_PSID       0x0800
> > +#define IP6T_SRH_INV_NSID       0x1000
> > +#define IP6T_SRH_INV_LSID       0x2000
> > +#define IP6T_SRH_INV_MASK       0x3FFF
> >  
> >  /**
> >   *      struct ip6t_srh - SRH match options
> > @@ -40,6 +46,12 @@
> >   *      @ segs_left: Segments left field of SRH
> >   *      @ last_entry: Last entry field of SRH
> >   *      @ tag: Tag field of SRH
> > + *      @ psid_addr: Address of previous SID in SRH SID list
> > + *      @ nsid_addr: Address of NEXT SID in SRH SID list
> > + *      @ lsid_addr: Address of LAST SID in SRH SID list
> > + *      @ psid_msk: Mask of previous SID in SRH SID list
> > + *      @ nsid_msk: Mask of next SID in SRH SID list
> > + *      @ lsid_msk: MAsk of last SID in SRH SID list
> >   *      @ mt_flags: match options
> >   *      @ mt_invflags: Invert the sense of match options
> >   */
> > @@ -50,6 +62,12 @@ struct ip6t_srh {
> >  	__u8                    segs_left;
> >  	__u8                    last_entry;
> >  	__u16                   tag;
> > +	struct in6_addr		psid_addr;
> > +	struct in6_addr		nsid_addr;
> > +	struct in6_addr		lsid_addr;
> > +	struct in6_addr		psid_msk;
> > +	struct in6_addr		nsid_msk;
> > +	struct in6_addr		lsid_msk;
> 
> This is changing something exposed through UAPI, so you will need a
> new revision for this.

Could you please advice what should be done in this case? 

> 
> >  	__u16                   mt_flags;
> >  	__u16                   mt_invflags;
> >  };
> > diff --git a/net/ipv6/netfilter/ip6t_srh.c b/net/ipv6/netfilter/ip6t_srh.c
> > index 33719d5..2b5cc73 100644
> > --- a/net/ipv6/netfilter/ip6t_srh.c
> > +++ b/net/ipv6/netfilter/ip6t_srh.c
> > @@ -30,7 +30,9 @@ static bool srh_mt6(const struct sk_buff *skb, struct xt_action_param *par)
> >  	const struct ip6t_srh *srhinfo = par->matchinfo;
> >  	struct ipv6_sr_hdr *srh;
> >  	struct ipv6_sr_hdr _srh;
> > -	int hdrlen, srhoff = 0;
> > +	int hdrlen, psidoff, nsidoff, lsidoff, srhoff = 0;
> > +	struct in6_addr *psid, *nsid, *lsid;
> > +	struct in6_addr _psid, _nsid, _lsid;
> 
> Could you rearrange variable definitions? ie. longest line first, eg.
> 
> 	int hdrlen, psidoff, nsidoff, lsidoff, srhoff = 0;
>   	const struct ip6t_srh *srhinfo = par->matchinfo;
> 	struct in6_addr *psid, *nsid, *lsid;
>   	struct ipv6_sr_hdr *srh;
>   	struct ipv6_sr_hdr _srh;
> 

Ok I will re-arrange them in reverse christmas tree form. 

Ahmed 

-- 
Ahmed Abdelsalam <amsalam20@gmail.com>

^ permalink raw reply

* Re: [PATCH bpf-next 02/15] xsk: add user memory registration support sockopt
From: Björn Töpel @ 2018-04-23 20:00 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Karlsson, Magnus, Duyck, Alexander H, Alexander Duyck,
	John Fastabend, Alexei Starovoitov, Jesper Dangaard Brouer,
	Willem de Bruijn, Daniel Borkmann, Netdev, Björn Töpel,
	michael.lundkvist, Brandeburg, Jesse, Singhai, Anjali,
	Zhang, Qi Z
In-Reply-To: <20180423190615-mutt-send-email-mst@kernel.org>

2018-04-23 18:18 GMT+02:00 Michael S. Tsirkin <mst@redhat.com>:

[...]

>> +static void xdp_umem_unpin_pages(struct xdp_umem *umem)
>> +{
>> +     unsigned int i;
>> +
>> +     if (umem->pgs) {
>> +             for (i = 0; i < umem->npgs; i++)
>
> Since you pin them with FOLL_WRITE, I assume these pages
> are written to.
> Don't you need set_page_dirty_lock here?
>

Hmm, I actually *removed* it from the RFC V2, but after doing some
homework, I think you're right. Thanks for pointing this out!

Thinking more about this; This function is called from sk_destruct,
and in the Tx case the sk_destruct can be called from interrupt
context, where set_page_dirty_lock cannot be called.

Are there any preferred ways of solving this? Scheduling the whole
xsk_destruct call to a workqueue is one way (I think). Any
cleaner/better way?

[...]

>> +static int __xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr)
>> +{
>> +     u32 frame_size = mr->frame_size, frame_headroom = mr->frame_headroom;
>> +     u64 addr = mr->addr, size = mr->len;
>> +     unsigned int nframes;
>> +     int size_chk, err;
>> +
>> +     if (frame_size < XDP_UMEM_MIN_FRAME_SIZE || frame_size > PAGE_SIZE) {
>> +             /* Strictly speaking we could support this, if:
>> +              * - huge pages, or*
>
> what does "or*" here mean?
>

Oops, I'll change to just 'or' in the next revision.


Thanks!
Björn

^ permalink raw reply

* Re: [RFC V3 PATCH 0/8] Packed ring for vhost
From: Michael S. Tsirkin @ 2018-04-23 19:59 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Jason Wang, kvm, virtualization, netdev, linux-kernel, tiwei.bie,
	jfreimann, wexu
In-Reply-To: <20180423193120.GD30033@char.us.oracle.com>

On Mon, Apr 23, 2018 at 03:31:20PM -0400, Konrad Rzeszutek Wilk wrote:
> On Mon, Apr 23, 2018 at 01:34:52PM +0800, Jason Wang wrote:
> > Hi all:
> > 
> > This RFC implement packed ring layout. The code were tested with
> > Tiwei's RFC V2 a thttps://lkml.org/lkml/2018/4/1/48. Some fixups and
> > tweaks were needed on top of Tiwei's code to make it run. TCP stream
> > and pktgen does not show obvious difference compared with split ring.
> 
> I have to ask then - what is the benefit of this?

You can use this with dpdk within guest.
The DPDK version did see a performance improvement so hopefully with
future versions this will too.

> > 
> > Changes from V2:
> > - do not use & in checking desc_event_flags
> > - off should be most significant bit
> > - remove the workaround of mergeable buffer for dpdk prototype
> > - id should be in the last descriptor in the chain
> > - keep _F_WRITE for write descriptor when adding used
> > - device flags updating should use ADDR_USED type
> > - return error on unexpected unavail descriptor in a chain
> > - return false in vhost_ve_avail_empty is descriptor is available
> > - track last seen avail_wrap_counter
> > - correctly examine available descriptor in get_indirect_packed()
> > - vhost_idx_diff should return u16 instead of bool
> > 
> > Changes from V1:
> > 
> > - Refactor vhost used elem code to avoid open coding on used elem
> > - Event suppression support (compile test only).
> > - Indirect descriptor support (compile test only).
> > - Zerocopy support.
> > - vIOMMU support.
> > - SCSI/VSOCK support (compile test only).
> > - Fix several bugs
> > 
> > For simplicity, I don't implement batching or other optimizations.
> > 
> > Please review.
> > 
> > Jason Wang (8):
> >   vhost: move get_rx_bufs to vhost.c
> >   vhost: hide used ring layout from device
> >   vhost: do not use vring_used_elem
> >   vhost_net: do not explicitly manipulate vhost_used_elem
> >   vhost: vhost_put_user() can accept metadata type
> >   virtio: introduce packed ring defines
> >   vhost: packed ring support
> >   vhost: event suppression for packed ring
> > 
> >  drivers/vhost/net.c                | 136 ++----
> >  drivers/vhost/scsi.c               |  62 +--
> >  drivers/vhost/vhost.c              | 824 ++++++++++++++++++++++++++++++++++---
> >  drivers/vhost/vhost.h              |  47 ++-
> >  drivers/vhost/vsock.c              |  42 +-
> >  include/uapi/linux/virtio_config.h |   9 +
> >  include/uapi/linux/virtio_ring.h   |  32 ++
> >  7 files changed, 926 insertions(+), 226 deletions(-)
> > 
> > -- 
> > 2.7.4
> > 

^ permalink raw reply

* Re: [PATCH bpf-next v5 06/10] tools/bpf: add bpf_get_stack helper to tools headers
From: Alexei Starovoitov @ 2018-04-23 19:48 UTC (permalink / raw)
  To: Yonghong Song; +Cc: ast, daniel, netdev, ecree, kernel-team
In-Reply-To: <20180423175417.4104464-7-yhs@fb.com>

On Mon, Apr 23, 2018 at 10:54:13AM -0700, Yonghong Song wrote:
> Signed-off-by: Yonghong Song <yhs@fb.com>
> ---

please add few words to commit log. Don't leave it empty
even for trivial patches.

^ permalink raw reply

* Re: [PATCH bpf-next v5 10/10] tools/bpf: add a test for bpf_get_stack with tracepoint prog
From: Alexei Starovoitov @ 2018-04-23 19:47 UTC (permalink / raw)
  To: Yonghong Song; +Cc: ast, daniel, netdev, ecree, kernel-team
In-Reply-To: <20180423175417.4104464-11-yhs@fb.com>

On Mon, Apr 23, 2018 at 10:54:17AM -0700, Yonghong Song wrote:
> The test_stacktrace_map and test_stacktrace_build_id are
> enhanced to call bpf_get_stack in the helper to get the
> stack trace as well.  The stack traces from bpf_get_stack
> and bpf_get_stackid are compared to ensure that for the
> same stack as represented as the same hash, their ip addresses
> or build id's must be the same.
> 
> Signed-off-by: Yonghong Song <yhs@fb.com>
...
>  /* taken from /sys/kernel/debug/tracing/events/sched/sched_switch/format */
> @@ -44,7 +51,10 @@ struct sched_switch_args {
>  SEC("tracepoint/sched/sched_switch")
>  int oncpu(struct sched_switch_args *ctx)
>  {
> +	__u32 max_len = PERF_MAX_STACK_DEPTH * sizeof(__u64);
>  	__u32 key = 0, val = 0, *value_p;
> +	void *stack_p;
> +

nit: unnecessary extra empty line.
otherwise
Acked-by: Alexei Starovoitov <ast@kernel.org>

^ permalink raw reply

* Re: [PATCH v7 net-next 4/4] netvsc: refactor notifier/event handling code to use the failover framework
From: Siwei Liu @ 2018-04-23 19:44 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Stephen Hemminger, Jiri Pirko, Sridhar Samudrala, David Miller,
	Netdev, virtualization, virtio-dev, Brandeburg, Jesse,
	Alexander Duyck, Jakub Kicinski, Jason Wang
In-Reply-To: <20180423205019-mutt-send-email-mst@kernel.org>

On Mon, Apr 23, 2018 at 10:56 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Mon, Apr 23, 2018 at 10:44:40AM -0700, Stephen Hemminger wrote:
>> On Mon, 23 Apr 2018 20:24:56 +0300
>> "Michael S. Tsirkin" <mst@redhat.com> wrote:
>>
>> > On Mon, Apr 23, 2018 at 10:04:06AM -0700, Stephen Hemminger wrote:
>> > > > >
>> > > > >I will NAK patches to change to common code for netvsc especially the
>> > > > >three device model.  MS worked hard with distro vendors to support transparent
>> > > > >mode, ans we really can't have a new model; or do backport.
>> > > > >
>> > > > >Plus, DPDK is now dependent on existing model.
>> > > >
>> > > > Sorry, but nobody here cares about dpdk or other similar oddities.
>> > >
>> > > The network device model is a userspace API, and DPDK is a userspace application.
>> >
>> > It is userspace but are you sure dpdk is actually poking at netdevs?
>> > AFAIK it's normally banging device registers directly.
>> >
>> > > You can't go breaking userspace even if you don't like the application.
>> >
>> > Could you please explain how is the proposed patchset breaking
>> > userspace? Ignoring DPDK for now, I don't think it changes the userspace
>> > API at all.
>> >
>>
>> The DPDK has a device driver vdev_netvsc which scans the Linux network devices
>> to look for Linux netvsc device and the paired VF device and setup the
>> DPDK environment.  This setup creates a DPDK failsafe (bondingish) instance
>> and sets up TAP support over the Linux netvsc device as well as the Mellanox
>> VF device.
>>
>> So it depends on existing 2 device model. You can't go to a 3 device model
>> or start hiding devices from userspace.
>
> Okay so how does the existing patch break that? IIUC does not go to
> a 3 device model since netvsc calls failover_register directly.
>
>> Also, I am working on associating netvsc and VF device based on serial number
>> rather than MAC address. The serial number is how Windows works now, and it makes
>> sense for Linux and Windows to use the same mechanism if possible.
>
> Maybe we should support same for virtio ...
> Which serial do you mean? From vpd?
>
> I guess you will want to keep supporting MAC for old hypervisors?
>
> It all seems like a reasonable thing to support in the generic core.

That's the reason why I chose explicit identifier rather than rely on
MAC address to bind/pair a device. MAC address can change. Even if it
can't, malicious guest user can fake MAC address to skip binding.

-Siwei


>
> --
> MST

^ permalink raw reply

* Re: [PATCH bpf-next v5 03/10] bpf/verifier: refine retval R0 state for bpf_get_stack helper
From: Alexei Starovoitov @ 2018-04-23 19:42 UTC (permalink / raw)
  To: Yonghong Song; +Cc: ast, daniel, netdev, ecree, kernel-team
In-Reply-To: <20180423175417.4104464-4-yhs@fb.com>

On Mon, Apr 23, 2018 at 10:54:10AM -0700, Yonghong Song wrote:
> The special property of return values for helpers bpf_get_stack
> and bpf_probe_read_str are captured in verifier.
> Both helpers return a negative error code or
> a length, which is equal to or smaller than the buffer
> size argument. This additional information in the
> verifier can avoid the condition such as "retval > bufsize"
> in the bpf program. For example, for the code blow,
>     usize = bpf_get_stack(ctx, raw_data, max_len, BPF_F_USER_STACK);
>     if (usize < 0 || usize > max_len)
>         return 0;
> The verifier may have the following errors:
>     52: (85) call bpf_get_stack#65
>      R0=map_value(id=0,off=0,ks=4,vs=1600,imm=0) R1_w=ctx(id=0,off=0,imm=0)
>      R2_w=map_value(id=0,off=0,ks=4,vs=1600,imm=0) R3_w=inv800 R4_w=inv256
>      R6=ctx(id=0,off=0,imm=0) R7=map_value(id=0,off=0,ks=4,vs=1600,imm=0)
>      R9_w=inv800 R10=fp0,call_-1
>     53: (bf) r8 = r0
>     54: (bf) r1 = r8
>     55: (67) r1 <<= 32
>     56: (bf) r2 = r1
>     57: (77) r2 >>= 32
>     58: (25) if r2 > 0x31f goto pc+33
>      R0=inv(id=0) R1=inv(id=0,smax_value=9223372032559808512,
>                          umax_value=18446744069414584320,
>                          var_off=(0x0; 0xffffffff00000000))
>      R2=inv(id=0,umax_value=799,var_off=(0x0; 0x3ff))
>      R6=ctx(id=0,off=0,imm=0) R7=map_value(id=0,off=0,ks=4,vs=1600,imm=0)
>      R8=inv(id=0) R9=inv800 R10=fp0,call_-1
>     59: (1f) r9 -= r8
>     60: (c7) r1 s>>= 32
>     61: (bf) r2 = r7
>     62: (0f) r2 += r1
>     math between map_value pointer and register with unbounded
>     min value is not allowed
> The failure is due to llvm compiler optimization where register "r2",
> which is a copy of "r1", is tested for condition while later on "r1"
> is used for map_ptr operation. The verifier is not able to track such
> inst sequence effectively.
> 
> Without the "usize > max_len" condition, there is no llvm optimization
> and the below generated code passed verifier:
>     52: (85) call bpf_get_stack#65
>      R0=map_value(id=0,off=0,ks=4,vs=1600,imm=0) R1_w=ctx(id=0,off=0,imm=0)
>      R2_w=map_value(id=0,off=0,ks=4,vs=1600,imm=0) R3_w=inv800 R4_w=inv256
>      R6=ctx(id=0,off=0,imm=0) R7=map_value(id=0,off=0,ks=4,vs=1600,imm=0)
>      R9_w=inv800 R10=fp0,call_-1
>     53: (b7) r1 = 0
>     54: (bf) r8 = r0
>     55: (67) r8 <<= 32
>     56: (c7) r8 s>>= 32
>     57: (6d) if r1 s> r8 goto pc+24
>      R0=inv(id=0,umax_value=800) R1=inv0 R6=ctx(id=0,off=0,imm=0)
>      R7=map_value(id=0,off=0,ks=4,vs=1600,imm=0)
>      R8=inv(id=0,umax_value=800,var_off=(0x0; 0x3ff)) R9=inv800
>      R10=fp0,call_-1
>     58: (bf) r2 = r7
>     59: (0f) r2 += r8
>     60: (1f) r9 -= r8
>     61: (bf) r1 = r6
> 
> Signed-off-by: Yonghong Song <yhs@fb.com>
> ---
>  kernel/bpf/verifier.c | 25 +++++++++++++++++++++++++
>  1 file changed, 25 insertions(+)
> 
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index aba9425..d00bf53 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -164,6 +164,8 @@ struct bpf_call_arg_meta {
>  	bool pkt_access;
>  	int regno;
>  	int access_size;
> +	s64 msize_smax_value;
> +	u64 msize_umax_value;
>  };
>  
>  static DEFINE_MUTEX(bpf_verifier_lock);
> @@ -1994,6 +1996,12 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 regno,
>  	} else if (arg_type_is_mem_size(arg_type)) {
>  		bool zero_size_allowed = (arg_type == ARG_CONST_SIZE_OR_ZERO);
>  
> +		/* remember the mem_size which may be used later
> +		 * to refine return values.
> +		 */
> +		meta->msize_smax_value = reg->smax_value;
> +		meta->msize_umax_value = reg->umax_value;
> +
>  		/* The register is SCALAR_VALUE; the access check
>  		 * happens using its boundaries.
>  		 */
> @@ -2333,6 +2341,21 @@ static int prepare_func_exit(struct bpf_verifier_env *env, int *insn_idx)
>  	return 0;
>  }
>  
> +static void do_refine_retval_range(struct bpf_reg_state *regs, int ret_type,
> +				   int func_id,
> +				   struct bpf_call_arg_meta *meta)
> +{
> +	struct bpf_reg_state *ret_reg = &regs[BPF_REG_0];
> +
> +	if (ret_type != RET_INTEGER ||
> +	    (func_id != BPF_FUNC_get_stack &&
> +	     func_id != BPF_FUNC_probe_read_str))
> +		return;
> +
> +	ret_reg->smax_value = meta->msize_smax_value;
> +	ret_reg->umax_value = meta->msize_umax_value;

should we call:
        __reg_deduce_bounds(ret_reg);
        __reg_bound_offset(ret_reg);
here?
It doesn't seem that it will be called later and
ret_reg->var_off state will be more conservative than necessary.

^ permalink raw reply

* Re: [bpf PATCH 2/2] bpf: sockmap sample use clang flag, -target bpf
From: Alexei Starovoitov @ 2018-04-23 19:35 UTC (permalink / raw)
  To: John Fastabend; +Cc: ast, daniel, netdev
In-Reply-To: <20180423191107.21348.15124.stgit@john-Precision-Tower-5810>

On Mon, Apr 23, 2018 at 12:11:08PM -0700, John Fastabend wrote:
> Per Documentation/bpf/bpf_devel_QA.txt add the -target flag to the
> sockmap Makefile. Relevant text quoted here,
> 
>    Otherwise, you can use bpf target. Additionally, you _must_ use
>    bpf target when:
> 
>  - Your program uses data structures with pointer or long / unsigned
>    long types that interface with BPF helpers or context data
>    structures. Access into these structures is verified by the BPF
>    verifier and may result in verification failures if the native
>    architecture is not aligned with the BPF architecture, e.g. 64-bit.
>    An example of this is BPF_PROG_TYPE_SK_MSG require '-target bpf'
> 
> Fixes: 69e8cc134bcb ("bpf: sockmap sample program")
> Signed-off-by: John Fastabend <john.fastabend@gmail.com>

Acked-by: Alexei Starovoitov <ast@kernel.org>

^ permalink raw reply

* Re: [bpf PATCH 1/2] bpf: Document sockmap '-target bpf' requirement for PROG_TYPE_SK_MSG
From: Alexei Starovoitov @ 2018-04-23 19:35 UTC (permalink / raw)
  To: John Fastabend; +Cc: ast, daniel, netdev
In-Reply-To: <20180423191102.21348.85601.stgit@john-Precision-Tower-5810>

On Mon, Apr 23, 2018 at 12:11:02PM -0700, John Fastabend wrote:
> BPF_PROG_TYPE_SK_MSG programs use a 'void *' for both data and the
> data_end pointers. Additionally, the verifier ensures that every
> accesses into the values is a __u64 read. This correctly maps on
> to the BPF 64-bit architecture.
> 
> However, to ensure that when building on 32bit architectures that
> clang uses correct types the '-target bpf' option _must_ be
> specified. To make this clear add a note to the Documentation.
> 
> Signed-off-by: John Fastabend <john.fastabend@gmail.com>

Acked-by: Alexei Starovoitov <ast@kernel.org>

^ permalink raw reply

* Re: [bpf PATCH v2 0/3] BPF, a couple sockmap fixes
From: Alexei Starovoitov @ 2018-04-23 19:34 UTC (permalink / raw)
  To: John Fastabend; +Cc: ast, daniel, netdev
In-Reply-To: <20180423182726.17999.269.stgit@john-Precision-Tower-5810>

On Mon, Apr 23, 2018 at 11:29:24AM -0700, John Fastabend wrote:
> While testing sockmap with more programs (besides our test programs)
> I found a couple issues.
> 
> The attached series fixes an issue where pinned maps were not
> working correctly, blocking sockets returned zero, and an error
> path that when the sock hit an out of memory case resulted in a
> double page_put() while doing ingress redirects.
> 
> See individual patches for more details.
> 
> v2: Incorporated Daniel's feedback to use map ops for uref put op
>     which also fixed the build error discovered in v1.

Acked-by: Alexei Starovoitov <ast@kernel.org>

^ permalink raw reply

* Re: [RFC V3 PATCH 0/8] Packed ring for vhost
From: Konrad Rzeszutek Wilk @ 2018-04-23 19:31 UTC (permalink / raw)
  To: Jason Wang; +Cc: kvm, mst, netdev, linux-kernel, virtualization, wexu
In-Reply-To: <1524461700-5469-1-git-send-email-jasowang@redhat.com>

On Mon, Apr 23, 2018 at 01:34:52PM +0800, Jason Wang wrote:
> Hi all:
> 
> This RFC implement packed ring layout. The code were tested with
> Tiwei's RFC V2 a thttps://lkml.org/lkml/2018/4/1/48. Some fixups and
> tweaks were needed on top of Tiwei's code to make it run. TCP stream
> and pktgen does not show obvious difference compared with split ring.

I have to ask then - what is the benefit of this?

> 
> Changes from V2:
> - do not use & in checking desc_event_flags
> - off should be most significant bit
> - remove the workaround of mergeable buffer for dpdk prototype
> - id should be in the last descriptor in the chain
> - keep _F_WRITE for write descriptor when adding used
> - device flags updating should use ADDR_USED type
> - return error on unexpected unavail descriptor in a chain
> - return false in vhost_ve_avail_empty is descriptor is available
> - track last seen avail_wrap_counter
> - correctly examine available descriptor in get_indirect_packed()
> - vhost_idx_diff should return u16 instead of bool
> 
> Changes from V1:
> 
> - Refactor vhost used elem code to avoid open coding on used elem
> - Event suppression support (compile test only).
> - Indirect descriptor support (compile test only).
> - Zerocopy support.
> - vIOMMU support.
> - SCSI/VSOCK support (compile test only).
> - Fix several bugs
> 
> For simplicity, I don't implement batching or other optimizations.
> 
> Please review.
> 
> Jason Wang (8):
>   vhost: move get_rx_bufs to vhost.c
>   vhost: hide used ring layout from device
>   vhost: do not use vring_used_elem
>   vhost_net: do not explicitly manipulate vhost_used_elem
>   vhost: vhost_put_user() can accept metadata type
>   virtio: introduce packed ring defines
>   vhost: packed ring support
>   vhost: event suppression for packed ring
> 
>  drivers/vhost/net.c                | 136 ++----
>  drivers/vhost/scsi.c               |  62 +--
>  drivers/vhost/vhost.c              | 824 ++++++++++++++++++++++++++++++++++---
>  drivers/vhost/vhost.h              |  47 ++-
>  drivers/vhost/vsock.c              |  42 +-
>  include/uapi/linux/virtio_config.h |   9 +
>  include/uapi/linux/virtio_ring.h   |  32 ++
>  7 files changed, 926 insertions(+), 226 deletions(-)
> 
> -- 
> 2.7.4
> 

^ permalink raw reply

* [bpf PATCH 2/2] bpf: sockmap sample use clang flag, -target bpf
From: John Fastabend @ 2018-04-23 19:11 UTC (permalink / raw)
  To: ast, daniel; +Cc: netdev
In-Reply-To: <20180423191102.21348.85601.stgit@john-Precision-Tower-5810>

Per Documentation/bpf/bpf_devel_QA.txt add the -target flag to the
sockmap Makefile. Relevant text quoted here,

   Otherwise, you can use bpf target. Additionally, you _must_ use
   bpf target when:

 - Your program uses data structures with pointer or long / unsigned
   long types that interface with BPF helpers or context data
   structures. Access into these structures is verified by the BPF
   verifier and may result in verification failures if the native
   architecture is not aligned with the BPF architecture, e.g. 64-bit.
   An example of this is BPF_PROG_TYPE_SK_MSG require '-target bpf'

Fixes: 69e8cc134bcb ("bpf: sockmap sample program")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
 samples/sockmap/Makefile |    7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/samples/sockmap/Makefile b/samples/sockmap/Makefile
index 9bf2881..fa53f4d 100644
--- a/samples/sockmap/Makefile
+++ b/samples/sockmap/Makefile
@@ -65,11 +65,14 @@ $(src)/*.c: verify_target_bpf
 # asm/sysreg.h - inline assembly used by it is incompatible with llvm.
 # But, there is no easy way to fix it, so just exclude it since it is
 # useless for BPF samples.
+#
+# -target bpf option required with SK_MSG programs, this is to ensure
+#  reading 'void *' data types for data and data_end are __u64 reads.
 $(obj)/%.o: $(src)/%.c
 	$(CLANG) $(NOSTDINC_FLAGS) $(LINUXINCLUDE) $(EXTRA_CFLAGS) -I$(obj) \
 		-D__KERNEL__ -D__ASM_SYSREG_H -Wno-unused-value -Wno-pointer-sign \
 		-Wno-compare-distinct-pointer-types \
 		-Wno-gnu-variable-sized-type-not-at-end \
 		-Wno-address-of-packed-member -Wno-tautological-compare \
-		-Wno-unknown-warning-option \
-		-O2 -emit-llvm -c $< -o -| $(LLC) -march=bpf -filetype=obj -o $@
+		-Wno-unknown-warning-option -O2 -target bpf \
+		-emit-llvm -c $< -o -| $(LLC) -march=bpf -filetype=obj -o $@

^ permalink raw reply related

* [bpf PATCH 1/2] bpf: Document sockmap '-target bpf' requirement for PROG_TYPE_SK_MSG
From: John Fastabend @ 2018-04-23 19:11 UTC (permalink / raw)
  To: ast, daniel; +Cc: netdev

BPF_PROG_TYPE_SK_MSG programs use a 'void *' for both data and the
data_end pointers. Additionally, the verifier ensures that every
accesses into the values is a __u64 read. This correctly maps on
to the BPF 64-bit architecture.

However, to ensure that when building on 32bit architectures that
clang uses correct types the '-target bpf' option _must_ be
specified. To make this clear add a note to the Documentation.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
 0 files changed

diff --git a/Documentation/bpf/bpf_devel_QA.txt b/Documentation/bpf/bpf_devel_QA.txt
index 1a0b704..da57601 100644
--- a/Documentation/bpf/bpf_devel_QA.txt
+++ b/Documentation/bpf/bpf_devel_QA.txt
@@ -557,6 +557,14 @@ A: Although LLVM IR generation and optimization try to stay architecture
        pulls in some header files containing file scope host assembly codes.
      - You can add "-fno-jump-tables" to work around the switch table issue.
 
-   Otherwise, you can use bpf target.
+   Otherwise, you can use bpf target. Additionally, you _must_ use bpf target
+   when:
+
+     - Your program uses data structures with pointer or long / unsigned long
+       types that interface with BPF helpers or context data structures. Access
+       into these structures is verified by the BPF verifier and may result
+       in verification failures if the native architecture is not aligned with
+       the BPF architecture, e.g. 64-bit. An example of this is
+       BPF_PROG_TYPE_SK_MSG require '-target bpf'
 
 Happy BPF hacking!

^ permalink raw reply related

* [PATCH net-next 2/2] net/sctp: Replace in/out stream arrays with flex_array
From: Oleg Babin @ 2018-04-23 18:41 UTC (permalink / raw)
  To: netdev, linux-sctp
  Cc: David S. Miller, Vlad Yasevich, Neil Horman, Xin Long,
	Marcelo Ricardo Leitner, Andrey Ryabinin
In-Reply-To: <1524508866-317485-1-git-send-email-obabin@virtuozzo.com>

This path replaces physically contiguous memory arrays
allocated using kmalloc_array() with flexible arrays.
This enables to avoid memory allocation failures on the
systems under a memory stress.

Signed-off-by: Oleg Babin <obabin@virtuozzo.com>
---
 include/net/sctp/structs.h |  1 +
 net/sctp/stream.c          | 78 +++++++++++++++++++++++++++++++++++-----------
 2 files changed, 61 insertions(+), 18 deletions(-)

diff --git a/include/net/sctp/structs.h b/include/net/sctp/structs.h
index 578bb40..c7f42b4 100644
--- a/include/net/sctp/structs.h
+++ b/include/net/sctp/structs.h
@@ -57,6 +57,7 @@
 #include <linux/atomic.h>		/* This gets us atomic counters.  */
 #include <linux/skbuff.h>	/* We need sk_buff_head. */
 #include <linux/workqueue.h>	/* We need tq_struct.	 */
+#include <linux/flex_array.h>	/* We need flex_array.   */
 #include <linux/sctp.h>		/* We need sctp* header structs.  */
 #include <net/sctp/auth.h>	/* We need auth specific structs */
 #include <net/ip.h>		/* For inet_skb_parm */
diff --git a/net/sctp/stream.c b/net/sctp/stream.c
index 16e36c0..be372b0 100644
--- a/net/sctp/stream.c
+++ b/net/sctp/stream.c
@@ -40,13 +40,60 @@
 struct sctp_stream_out *sctp_stream_out_ptr(const struct sctp_stream *stream,
 					    __u16 sid)
 {
-	return ((struct sctp_stream_out *)(stream->out)) + sid;
+	return flex_array_get(stream->out, sid);
 }
 
 struct sctp_stream_in *sctp_stream_in_ptr(const struct sctp_stream *stream,
 					  __u16 sid)
 {
-	return ((struct sctp_stream_in *)(stream->in)) + sid;
+	return flex_array_get(stream->in, sid);
+}
+
+static struct flex_array *fa_alloc(size_t elem_size, size_t elem_count,
+				   gfp_t gfp)
+{
+	struct flex_array *result;
+	int err;
+
+	result = flex_array_alloc(elem_size, elem_count, gfp);
+	if (result) {
+		err = flex_array_prealloc(result, 0, elem_count, gfp);
+		if (err) {
+			flex_array_free(result);
+			result = NULL;
+		}
+	}
+
+	return result;
+}
+
+static void fa_free(struct flex_array *fa)
+{
+	if (fa)
+		flex_array_free(fa);
+}
+
+static void fa_copy(struct flex_array *fa, struct flex_array *from,
+		    size_t index, size_t count)
+{
+	void *elem;
+
+	while (count--) {
+		elem = flex_array_get(from, index);
+		flex_array_put(fa, index, elem, 0);
+		index++;
+	}
+}
+
+static void fa_zero(struct flex_array *fa, size_t index, size_t count)
+{
+	void *elem;
+
+	while (count--) {
+		elem = flex_array_get(fa, index);
+		memset(elem, 0, fa->element_size);
+		index++;
+	}
 }
 
 /* Migrates chunks from stream queues to new stream queues if needed,
@@ -106,19 +153,17 @@ static int sctp_stream_alloc_out(struct sctp_stream *stream, __u16 outcnt,
 	struct flex_array *out;
 	size_t elem_size = sizeof(struct sctp_stream_out);
 
-	out = kmalloc_array(outcnt, elem_size, gfp);
+	out = fa_alloc(elem_size, outcnt, gfp);
 	if (!out)
 		return -ENOMEM;
 
 	if (stream->out) {
-		memcpy(out, stream->out, min(outcnt, stream->outcnt) *
-					 elem_size);
-		kfree(stream->out);
+		fa_copy(out, stream->out, 0, min(outcnt, stream->outcnt));
+		fa_free(stream->out);
 	}
 
 	if (outcnt > stream->outcnt)
-		memset(((struct sctp_stream_out *)out) + stream->outcnt, 0,
-		       (outcnt - stream->outcnt) * elem_size);
+		fa_zero(out, stream->outcnt, (outcnt - stream->outcnt));
 
 	stream->out = out;
 
@@ -131,20 +176,17 @@ static int sctp_stream_alloc_in(struct sctp_stream *stream, __u16 incnt,
 	struct flex_array *in;
 	size_t elem_size = sizeof(struct sctp_stream_in);
 
-	in = kmalloc_array(incnt, elem_size, gfp);
-
+	in = fa_alloc(elem_size, incnt, gfp);
 	if (!in)
 		return -ENOMEM;
 
 	if (stream->in) {
-		memcpy(in, stream->in, min(incnt, stream->incnt) *
-				       elem_size);
-		kfree(stream->in);
+		fa_copy(in, stream->in, 0, min(incnt, stream->incnt));
+		fa_free(stream->in);
 	}
 
 	if (incnt > stream->incnt)
-		memset(((struct sctp_stream_in *)in) + stream->incnt, 0,
-		       (incnt - stream->incnt) * elem_size);
+		fa_zero(in, stream->incnt, (incnt - stream->incnt));
 
 	stream->in = in;
 
@@ -188,7 +230,7 @@ int sctp_stream_init(struct sctp_stream *stream, __u16 outcnt, __u16 incnt,
 	ret = sctp_stream_alloc_in(stream, incnt, gfp);
 	if (ret) {
 		sched->free(stream);
-		kfree(stream->out);
+		fa_free(stream->out);
 		stream->out = NULL;
 		stream->outcnt = 0;
 		goto out;
@@ -220,8 +262,8 @@ void sctp_stream_free(struct sctp_stream *stream)
 	sched->free(stream);
 	for (i = 0; i < stream->outcnt; i++)
 		kfree(SCTP_SO(stream, i)->ext);
-	kfree(stream->out);
-	kfree(stream->in);
+	fa_free(stream->out);
+	fa_free(stream->in);
 }
 
 void sctp_stream_clear(struct sctp_stream *stream)
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH net-next 1/2] net/sctp: Make wrappers for accessing in/out streams
From: Oleg Babin @ 2018-04-23 18:41 UTC (permalink / raw)
  To: netdev, linux-sctp
  Cc: David S. Miller, Vlad Yasevich, Neil Horman, Xin Long,
	Marcelo Ricardo Leitner, Andrey Ryabinin
In-Reply-To: <1524508866-317485-1-git-send-email-obabin@virtuozzo.com>

This patch introduces wrappers for accessing in/out streams indirectly.
This will enable to replace physically contiguous memory arrays
of streams with flexible arrays (or maybe any other appropriate
mechanism) which do memory allocation on a per-page basis.

Signed-off-by: Oleg Babin <obabin@virtuozzo.com>
---
 include/net/sctp/structs.h   |  30 +++++++-----
 net/sctp/chunk.c             |   6 ++-
 net/sctp/outqueue.c          |  11 +++--
 net/sctp/socket.c            |   4 +-
 net/sctp/stream.c            | 107 +++++++++++++++++++++++++------------------
 net/sctp/stream_interleave.c |   2 +-
 net/sctp/stream_sched.c      |  13 +++---
 net/sctp/stream_sched_prio.c |  22 ++++-----
 net/sctp/stream_sched_rr.c   |   8 ++--
 9 files changed, 116 insertions(+), 87 deletions(-)

diff --git a/include/net/sctp/structs.h b/include/net/sctp/structs.h
index a0ec462..578bb40 100644
--- a/include/net/sctp/structs.h
+++ b/include/net/sctp/structs.h
@@ -394,37 +394,37 @@ int sctp_stream_init(struct sctp_stream *stream, __u16 outcnt, __u16 incnt,
 
 /* What is the current SSN number for this stream? */
 #define sctp_ssn_peek(stream, type, sid) \
-	((stream)->type[sid].ssn)
+	(sctp_stream_##type##_ptr((stream), (sid))->ssn)
 
 /* Return the next SSN number for this stream.	*/
 #define sctp_ssn_next(stream, type, sid) \
-	((stream)->type[sid].ssn++)
+	(sctp_stream_##type##_ptr((stream), (sid))->ssn++)
 
 /* Skip over this ssn and all below. */
 #define sctp_ssn_skip(stream, type, sid, ssn) \
-	((stream)->type[sid].ssn = ssn + 1)
+	(sctp_stream_##type##_ptr((stream), (sid))->ssn = ssn + 1)
 
 /* What is the current MID number for this stream? */
 #define sctp_mid_peek(stream, type, sid) \
-	((stream)->type[sid].mid)
+	(sctp_stream_##type##_ptr((stream), (sid))->mid)
 
 /* Return the next MID number for this stream.  */
 #define sctp_mid_next(stream, type, sid) \
-	((stream)->type[sid].mid++)
+	(sctp_stream_##type##_ptr((stream), (sid))->mid++)
 
 /* Skip over this mid and all below. */
 #define sctp_mid_skip(stream, type, sid, mid) \
-	((stream)->type[sid].mid = mid + 1)
+	(sctp_stream_##type##_ptr((stream), (sid))->mid = mid + 1)
 
-#define sctp_stream_in(asoc, sid) (&(asoc)->stream.in[sid])
+#define sctp_stream_in(asoc, sid) sctp_stream_in_ptr(&(asoc)->stream, (sid))
 
 /* What is the current MID_uo number for this stream? */
 #define sctp_mid_uo_peek(stream, type, sid) \
-	((stream)->type[sid].mid_uo)
+	(sctp_stream_##type##_ptr((stream), (sid))->mid_uo)
 
 /* Return the next MID_uo number for this stream.  */
 #define sctp_mid_uo_next(stream, type, sid) \
-	((stream)->type[sid].mid_uo++)
+	(sctp_stream_##type##_ptr((stream), (sid))->mid_uo++)
 
 /*
  * Pointers to address related SCTP functions.
@@ -1428,8 +1428,8 @@ struct sctp_stream_in {
 };
 
 struct sctp_stream {
-	struct sctp_stream_out *out;
-	struct sctp_stream_in *in;
+	struct flex_array *out;
+	struct flex_array *in;
 	__u16 outcnt;
 	__u16 incnt;
 	/* Current stream being sent, if any */
@@ -1451,6 +1451,14 @@ struct sctp_stream {
 	struct sctp_stream_interleave *si;
 };
 
+struct sctp_stream_out *sctp_stream_out_ptr(const struct sctp_stream *stream,
+					    __u16 sid);
+struct sctp_stream_in *sctp_stream_in_ptr(const struct sctp_stream *stream,
+					  __u16 sid);
+
+#define SCTP_SO(s, i) sctp_stream_out_ptr((s), (i))
+#define SCTP_SI(s, i) sctp_stream_in_ptr((s), (i))
+
 #define SCTP_STREAM_CLOSED		0x00
 #define SCTP_STREAM_OPEN		0x01
 
diff --git a/net/sctp/chunk.c b/net/sctp/chunk.c
index be296d6..4b9310e 100644
--- a/net/sctp/chunk.c
+++ b/net/sctp/chunk.c
@@ -333,7 +333,8 @@ int sctp_chunk_abandoned(struct sctp_chunk *chunk)
 	if (SCTP_PR_TTL_ENABLED(chunk->sinfo.sinfo_flags) &&
 	    time_after(jiffies, chunk->msg->expires_at)) {
 		struct sctp_stream_out *streamout =
-			&chunk->asoc->stream.out[chunk->sinfo.sinfo_stream];
+			SCTP_SO(&chunk->asoc->stream,
+				chunk->sinfo.sinfo_stream);
 
 		if (chunk->sent_count) {
 			chunk->asoc->abandoned_sent[SCTP_PR_INDEX(TTL)]++;
@@ -347,7 +348,8 @@ int sctp_chunk_abandoned(struct sctp_chunk *chunk)
 	} else if (SCTP_PR_RTX_ENABLED(chunk->sinfo.sinfo_flags) &&
 		   chunk->sent_count > chunk->sinfo.sinfo_timetolive) {
 		struct sctp_stream_out *streamout =
-			&chunk->asoc->stream.out[chunk->sinfo.sinfo_stream];
+			SCTP_SO(&chunk->asoc->stream,
+				chunk->sinfo.sinfo_stream);
 
 		chunk->asoc->abandoned_sent[SCTP_PR_INDEX(RTX)]++;
 		streamout->ext->abandoned_sent[SCTP_PR_INDEX(RTX)]++;
diff --git a/net/sctp/outqueue.c b/net/sctp/outqueue.c
index f211b3d..8d5d811 100644
--- a/net/sctp/outqueue.c
+++ b/net/sctp/outqueue.c
@@ -80,7 +80,7 @@ static inline void sctp_outq_head_data(struct sctp_outq *q,
 	q->out_qlen += ch->skb->len;
 
 	stream = sctp_chunk_stream_no(ch);
-	oute = q->asoc->stream.out[stream].ext;
+	oute = SCTP_SO(&q->asoc->stream, stream)->ext;
 	list_add(&ch->stream_list, &oute->outq);
 }
 
@@ -101,7 +101,7 @@ static inline void sctp_outq_tail_data(struct sctp_outq *q,
 	q->out_qlen += ch->skb->len;
 
 	stream = sctp_chunk_stream_no(ch);
-	oute = q->asoc->stream.out[stream].ext;
+	oute = SCTP_SO(&q->asoc->stream, stream)->ext;
 	list_add_tail(&ch->stream_list, &oute->outq);
 }
 
@@ -372,7 +372,7 @@ static int sctp_prsctp_prune_sent(struct sctp_association *asoc,
 		sctp_insert_list(&asoc->outqueue.abandoned,
 				 &chk->transmitted_list);
 
-		streamout = &asoc->stream.out[chk->sinfo.sinfo_stream];
+		streamout = SCTP_SO(&asoc->stream, chk->sinfo.sinfo_stream);
 		asoc->sent_cnt_removable--;
 		asoc->abandoned_sent[SCTP_PR_INDEX(PRIO)]++;
 		streamout->ext->abandoned_sent[SCTP_PR_INDEX(PRIO)]++;
@@ -416,7 +416,7 @@ static int sctp_prsctp_prune_unsent(struct sctp_association *asoc,
 		asoc->abandoned_unsent[SCTP_PR_INDEX(PRIO)]++;
 		if (chk->sinfo.sinfo_stream < asoc->stream.outcnt) {
 			struct sctp_stream_out *streamout =
-				&asoc->stream.out[chk->sinfo.sinfo_stream];
+				SCTP_SO(&asoc->stream, chk->sinfo.sinfo_stream);
 
 			streamout->ext->abandoned_unsent[SCTP_PR_INDEX(PRIO)]++;
 		}
@@ -1050,6 +1050,7 @@ static void sctp_outq_flush(struct sctp_outq *q, int rtx_timeout, gfp_t gfp)
 		/* Finally, transmit new packets.  */
 		while ((chunk = sctp_outq_dequeue_data(q)) != NULL) {
 			__u32 sid = ntohs(chunk->subh.data_hdr->stream);
+			__u8 stream_state = SCTP_SO(&asoc->stream, sid)->state;
 
 			/* Has this chunk expired? */
 			if (sctp_chunk_abandoned(chunk)) {
@@ -1059,7 +1060,7 @@ static void sctp_outq_flush(struct sctp_outq *q, int rtx_timeout, gfp_t gfp)
 				continue;
 			}
 
-			if (asoc->stream.out[sid].state == SCTP_STREAM_CLOSED) {
+			if (stream_state == SCTP_STREAM_CLOSED) {
 				sctp_outq_head_data(q, chunk);
 				goto sctp_flush_out;
 			}
diff --git a/net/sctp/socket.c b/net/sctp/socket.c
index 80835ac..3442f7c 100644
--- a/net/sctp/socket.c
+++ b/net/sctp/socket.c
@@ -1907,7 +1907,7 @@ static int sctp_sendmsg_to_asoc(struct sctp_association *asoc,
 		goto err;
 	}
 
-	if (unlikely(!asoc->stream.out[sinfo->sinfo_stream].ext)) {
+	if (unlikely(!SCTP_SO(&asoc->stream, sinfo->sinfo_stream)->ext)) {
 		err = sctp_stream_init_ext(&asoc->stream, sinfo->sinfo_stream);
 		if (err)
 			goto err;
@@ -6942,7 +6942,7 @@ static int sctp_getsockopt_pr_streamstatus(struct sock *sk, int len,
 	if (!asoc || params.sprstat_sid >= asoc->stream.outcnt)
 		goto out;
 
-	streamoute = asoc->stream.out[params.sprstat_sid].ext;
+	streamoute = SCTP_SO(&asoc->stream, params.sprstat_sid)->ext;
 	if (!streamoute) {
 		/* Not allocated yet, means all stats are 0 */
 		params.sprstat_abandoned_unsent = 0;
diff --git a/net/sctp/stream.c b/net/sctp/stream.c
index f799043..16e36c0 100644
--- a/net/sctp/stream.c
+++ b/net/sctp/stream.c
@@ -37,6 +37,18 @@
 #include <net/sctp/sm.h>
 #include <net/sctp/stream_sched.h>
 
+struct sctp_stream_out *sctp_stream_out_ptr(const struct sctp_stream *stream,
+					    __u16 sid)
+{
+	return ((struct sctp_stream_out *)(stream->out)) + sid;
+}
+
+struct sctp_stream_in *sctp_stream_in_ptr(const struct sctp_stream *stream,
+					  __u16 sid)
+{
+	return ((struct sctp_stream_in *)(stream->in)) + sid;
+}
+
 /* Migrates chunks from stream queues to new stream queues if needed,
  * but not across associations. Also, removes those chunks to streams
  * higher than the new max.
@@ -78,34 +90,35 @@ static void sctp_stream_outq_migrate(struct sctp_stream *stream,
 		 * sctp_stream_update will swap ->out pointers.
 		 */
 		for (i = 0; i < outcnt; i++) {
-			kfree(new->out[i].ext);
-			new->out[i].ext = stream->out[i].ext;
-			stream->out[i].ext = NULL;
+			kfree(SCTP_SO(new, i)->ext);
+			SCTP_SO(new, i)->ext = SCTP_SO(stream, i)->ext;
+			SCTP_SO(stream, i)->ext = NULL;
 		}
 	}
 
 	for (i = outcnt; i < stream->outcnt; i++)
-		kfree(stream->out[i].ext);
+		kfree(SCTP_SO(stream, i)->ext);
 }
 
 static int sctp_stream_alloc_out(struct sctp_stream *stream, __u16 outcnt,
 				 gfp_t gfp)
 {
-	struct sctp_stream_out *out;
+	struct flex_array *out;
+	size_t elem_size = sizeof(struct sctp_stream_out);
 
-	out = kmalloc_array(outcnt, sizeof(*out), gfp);
+	out = kmalloc_array(outcnt, elem_size, gfp);
 	if (!out)
 		return -ENOMEM;
 
 	if (stream->out) {
 		memcpy(out, stream->out, min(outcnt, stream->outcnt) *
-					 sizeof(*out));
+					 elem_size);
 		kfree(stream->out);
 	}
 
 	if (outcnt > stream->outcnt)
-		memset(out + stream->outcnt, 0,
-		       (outcnt - stream->outcnt) * sizeof(*out));
+		memset(((struct sctp_stream_out *)out) + stream->outcnt, 0,
+		       (outcnt - stream->outcnt) * elem_size);
 
 	stream->out = out;
 
@@ -115,22 +128,23 @@ static int sctp_stream_alloc_out(struct sctp_stream *stream, __u16 outcnt,
 static int sctp_stream_alloc_in(struct sctp_stream *stream, __u16 incnt,
 				gfp_t gfp)
 {
-	struct sctp_stream_in *in;
+	struct flex_array *in;
+	size_t elem_size = sizeof(struct sctp_stream_in);
 
-	in = kmalloc_array(incnt, sizeof(*stream->in), gfp);
+	in = kmalloc_array(incnt, elem_size, gfp);
 
 	if (!in)
 		return -ENOMEM;
 
 	if (stream->in) {
 		memcpy(in, stream->in, min(incnt, stream->incnt) *
-				       sizeof(*in));
+				       elem_size);
 		kfree(stream->in);
 	}
 
 	if (incnt > stream->incnt)
-		memset(in + stream->incnt, 0,
-		       (incnt - stream->incnt) * sizeof(*in));
+		memset(((struct sctp_stream_in *)in) + stream->incnt, 0,
+		       (incnt - stream->incnt) * elem_size);
 
 	stream->in = in;
 
@@ -162,7 +176,7 @@ int sctp_stream_init(struct sctp_stream *stream, __u16 outcnt, __u16 incnt,
 
 	stream->outcnt = outcnt;
 	for (i = 0; i < stream->outcnt; i++)
-		stream->out[i].state = SCTP_STREAM_OPEN;
+		SCTP_SO(stream, i)->state = SCTP_STREAM_OPEN;
 
 	sched->init(stream);
 
@@ -193,7 +207,7 @@ int sctp_stream_init_ext(struct sctp_stream *stream, __u16 sid)
 	soute = kzalloc(sizeof(*soute), GFP_KERNEL);
 	if (!soute)
 		return -ENOMEM;
-	stream->out[sid].ext = soute;
+	SCTP_SO(stream, sid)->ext = soute;
 
 	return sctp_sched_init_sid(stream, sid, GFP_KERNEL);
 }
@@ -205,7 +219,7 @@ void sctp_stream_free(struct sctp_stream *stream)
 
 	sched->free(stream);
 	for (i = 0; i < stream->outcnt; i++)
-		kfree(stream->out[i].ext);
+		kfree(SCTP_SO(stream, i)->ext);
 	kfree(stream->out);
 	kfree(stream->in);
 }
@@ -215,12 +229,12 @@ void sctp_stream_clear(struct sctp_stream *stream)
 	int i;
 
 	for (i = 0; i < stream->outcnt; i++) {
-		stream->out[i].mid = 0;
-		stream->out[i].mid_uo = 0;
+		SCTP_SO(stream, i)->mid = 0;
+		SCTP_SO(stream, i)->mid_uo = 0;
 	}
 
 	for (i = 0; i < stream->incnt; i++)
-		stream->in[i].mid = 0;
+		SCTP_SI(stream, i)->mid = 0;
 }
 
 void sctp_stream_update(struct sctp_stream *stream, struct sctp_stream *new)
@@ -271,8 +285,8 @@ static bool sctp_stream_outq_is_empty(struct sctp_stream *stream,
 	for (i = 0; i < str_nums; i++) {
 		__u16 sid = ntohs(str_list[i]);
 
-		if (stream->out[sid].ext &&
-		    !list_empty(&stream->out[sid].ext->outq))
+		if (SCTP_SO(stream, sid)->ext &&
+		    !list_empty(&SCTP_SO(stream, sid)->ext->outq))
 			return false;
 	}
 
@@ -359,11 +373,11 @@ int sctp_send_reset_streams(struct sctp_association *asoc,
 	if (out) {
 		if (str_nums)
 			for (i = 0; i < str_nums; i++)
-				stream->out[str_list[i]].state =
+				SCTP_SO(stream, str_list[i])->state =
 						       SCTP_STREAM_CLOSED;
 		else
 			for (i = 0; i < stream->outcnt; i++)
-				stream->out[i].state = SCTP_STREAM_CLOSED;
+				SCTP_SO(stream, i)->state = SCTP_STREAM_CLOSED;
 	}
 
 	asoc->strreset_chunk = chunk;
@@ -378,11 +392,11 @@ int sctp_send_reset_streams(struct sctp_association *asoc,
 
 		if (str_nums)
 			for (i = 0; i < str_nums; i++)
-				stream->out[str_list[i]].state =
+				SCTP_SO(stream, str_list[i])->state =
 						       SCTP_STREAM_OPEN;
 		else
 			for (i = 0; i < stream->outcnt; i++)
-				stream->out[i].state = SCTP_STREAM_OPEN;
+				SCTP_SO(stream, i)->state = SCTP_STREAM_OPEN;
 
 		goto out;
 	}
@@ -416,7 +430,7 @@ int sctp_send_reset_assoc(struct sctp_association *asoc)
 
 	/* Block further xmit of data until this request is completed */
 	for (i = 0; i < stream->outcnt; i++)
-		stream->out[i].state = SCTP_STREAM_CLOSED;
+		SCTP_SO(stream, i)->state = SCTP_STREAM_CLOSED;
 
 	asoc->strreset_chunk = chunk;
 	sctp_chunk_hold(asoc->strreset_chunk);
@@ -427,7 +441,7 @@ int sctp_send_reset_assoc(struct sctp_association *asoc)
 		asoc->strreset_chunk = NULL;
 
 		for (i = 0; i < stream->outcnt; i++)
-			stream->out[i].state = SCTP_STREAM_OPEN;
+			SCTP_SO(stream, i)->state = SCTP_STREAM_OPEN;
 
 		return retval;
 	}
@@ -607,10 +621,10 @@ struct sctp_chunk *sctp_process_strreset_outreq(
 		}
 
 		for (i = 0; i < nums; i++)
-			stream->in[ntohs(str_p[i])].mid = 0;
+			SCTP_SI(stream, ntohs(str_p[i]))->mid = 0;
 	} else {
 		for (i = 0; i < stream->incnt; i++)
-			stream->in[i].mid = 0;
+			SCTP_SI(stream, i)->mid = 0;
 	}
 
 	result = SCTP_STRRESET_PERFORMED;
@@ -681,11 +695,11 @@ struct sctp_chunk *sctp_process_strreset_inreq(
 
 	if (nums)
 		for (i = 0; i < nums; i++)
-			stream->out[ntohs(str_p[i])].state =
+			SCTP_SO(stream, ntohs(str_p[i]))->state =
 					       SCTP_STREAM_CLOSED;
 	else
 		for (i = 0; i < stream->outcnt; i++)
-			stream->out[i].state = SCTP_STREAM_CLOSED;
+			SCTP_SO(stream, i)->state = SCTP_STREAM_CLOSED;
 
 	asoc->strreset_chunk = chunk;
 	asoc->strreset_outstanding = 1;
@@ -784,11 +798,11 @@ struct sctp_chunk *sctp_process_strreset_tsnreq(
 	 *      incoming and outgoing streams.
 	 */
 	for (i = 0; i < stream->outcnt; i++) {
-		stream->out[i].mid = 0;
-		stream->out[i].mid_uo = 0;
+		SCTP_SO(stream, i)->mid = 0;
+		SCTP_SO(stream, i)->mid_uo = 0;
 	}
 	for (i = 0; i < stream->incnt; i++)
-		stream->in[i].mid = 0;
+		SCTP_SI(stream, i)->mid = 0;
 
 	result = SCTP_STRRESET_PERFORMED;
 
@@ -977,15 +991,18 @@ struct sctp_chunk *sctp_process_strreset_resp(
 		       sizeof(__u16);
 
 		if (result == SCTP_STRRESET_PERFORMED) {
+			struct sctp_stream_out *sout;
 			if (nums) {
 				for (i = 0; i < nums; i++) {
-					stream->out[ntohs(str_p[i])].mid = 0;
-					stream->out[ntohs(str_p[i])].mid_uo = 0;
+					sout = SCTP_SO(stream, ntohs(str_p[i]));
+					sout->mid = 0;
+					sout->mid_uo = 0;
 				}
 			} else {
 				for (i = 0; i < stream->outcnt; i++) {
-					stream->out[i].mid = 0;
-					stream->out[i].mid_uo = 0;
+					sout = SCTP_SO(stream, i);
+					sout->mid = 0;
+					sout->mid_uo = 0;
 				}
 			}
 
@@ -993,7 +1010,7 @@ struct sctp_chunk *sctp_process_strreset_resp(
 		}
 
 		for (i = 0; i < stream->outcnt; i++)
-			stream->out[i].state = SCTP_STREAM_OPEN;
+			SCTP_SO(stream, i)->state = SCTP_STREAM_OPEN;
 
 		*evp = sctp_ulpevent_make_stream_reset_event(asoc, flags,
 			nums, str_p, GFP_ATOMIC);
@@ -1048,15 +1065,15 @@ struct sctp_chunk *sctp_process_strreset_resp(
 			asoc->adv_peer_ack_point = asoc->ctsn_ack_point;
 
 			for (i = 0; i < stream->outcnt; i++) {
-				stream->out[i].mid = 0;
-				stream->out[i].mid_uo = 0;
+				SCTP_SO(stream, i)->mid = 0;
+				SCTP_SO(stream, i)->mid_uo = 0;
 			}
 			for (i = 0; i < stream->incnt; i++)
-				stream->in[i].mid = 0;
+				SCTP_SI(stream, i)->mid = 0;
 		}
 
 		for (i = 0; i < stream->outcnt; i++)
-			stream->out[i].state = SCTP_STREAM_OPEN;
+			SCTP_SO(stream, i)->state = SCTP_STREAM_OPEN;
 
 		*evp = sctp_ulpevent_make_assoc_reset_event(asoc, flags,
 			stsn, rtsn, GFP_ATOMIC);
@@ -1070,7 +1087,7 @@ struct sctp_chunk *sctp_process_strreset_resp(
 
 		if (result == SCTP_STRRESET_PERFORMED)
 			for (i = number; i < stream->outcnt; i++)
-				stream->out[i].state = SCTP_STREAM_OPEN;
+				SCTP_SO(stream, i)->state = SCTP_STREAM_OPEN;
 		else
 			stream->outcnt = number;
 
diff --git a/net/sctp/stream_interleave.c b/net/sctp/stream_interleave.c
index d3764c1..46f9fb6 100644
--- a/net/sctp/stream_interleave.c
+++ b/net/sctp/stream_interleave.c
@@ -1053,7 +1053,7 @@ static void sctp_intl_abort_pd(struct sctp_ulpq *ulpq, gfp_t gfp)
 	__u16 sid;
 
 	for (sid = 0; sid < stream->incnt; sid++) {
-		struct sctp_stream_in *sin = &stream->in[sid];
+		struct sctp_stream_in *sin = SCTP_SI(stream, sid);
 		__u32 mid;
 
 		if (sin->pd_mode_uo) {
diff --git a/net/sctp/stream_sched.c b/net/sctp/stream_sched.c
index f5fcd42..a6c04a9 100644
--- a/net/sctp/stream_sched.c
+++ b/net/sctp/stream_sched.c
@@ -161,7 +161,7 @@ int sctp_sched_set_sched(struct sctp_association *asoc,
 
 		/* Give the next scheduler a clean slate. */
 		for (i = 0; i < asoc->stream.outcnt; i++) {
-			void *p = asoc->stream.out[i].ext;
+			void *p = SCTP_SO(&asoc->stream, i)->ext;
 
 			if (!p)
 				continue;
@@ -175,7 +175,7 @@ int sctp_sched_set_sched(struct sctp_association *asoc,
 	asoc->outqueue.sched = n;
 	n->init(&asoc->stream);
 	for (i = 0; i < asoc->stream.outcnt; i++) {
-		if (!asoc->stream.out[i].ext)
+		if (!SCTP_SO(&asoc->stream, i)->ext)
 			continue;
 
 		ret = n->init_sid(&asoc->stream, i, GFP_KERNEL);
@@ -217,7 +217,7 @@ int sctp_sched_set_value(struct sctp_association *asoc, __u16 sid,
 	if (sid >= asoc->stream.outcnt)
 		return -EINVAL;
 
-	if (!asoc->stream.out[sid].ext) {
+	if (!SCTP_SO(&asoc->stream, sid)->ext) {
 		int ret;
 
 		ret = sctp_stream_init_ext(&asoc->stream, sid);
@@ -234,7 +234,7 @@ int sctp_sched_get_value(struct sctp_association *asoc, __u16 sid,
 	if (sid >= asoc->stream.outcnt)
 		return -EINVAL;
 
-	if (!asoc->stream.out[sid].ext)
+	if (!SCTP_SO(&asoc->stream, sid)->ext)
 		return 0;
 
 	return asoc->outqueue.sched->get(&asoc->stream, sid, value);
@@ -252,7 +252,7 @@ void sctp_sched_dequeue_done(struct sctp_outq *q, struct sctp_chunk *ch)
 		 * priority stream comes in.
 		 */
 		sid = sctp_chunk_stream_no(ch);
-		sout = &q->asoc->stream.out[sid];
+		sout = SCTP_SO(&q->asoc->stream, sid);
 		q->asoc->stream.out_curr = sout;
 		return;
 	}
@@ -272,8 +272,9 @@ void sctp_sched_dequeue_common(struct sctp_outq *q, struct sctp_chunk *ch)
 int sctp_sched_init_sid(struct sctp_stream *stream, __u16 sid, gfp_t gfp)
 {
 	struct sctp_sched_ops *sched = sctp_sched_ops_from_stream(stream);
+	struct sctp_stream_out_ext *ext = SCTP_SO(stream, sid)->ext;
 
-	INIT_LIST_HEAD(&stream->out[sid].ext->outq);
+	INIT_LIST_HEAD(&ext->outq);
 	return sched->init_sid(stream, sid, gfp);
 }
 
diff --git a/net/sctp/stream_sched_prio.c b/net/sctp/stream_sched_prio.c
index 7997d35..2245083 100644
--- a/net/sctp/stream_sched_prio.c
+++ b/net/sctp/stream_sched_prio.c
@@ -75,10 +75,10 @@ static struct sctp_stream_priorities *sctp_sched_prio_get_head(
 
 	/* No luck. So we search on all streams now. */
 	for (i = 0; i < stream->outcnt; i++) {
-		if (!stream->out[i].ext)
+		if (!SCTP_SO(stream, i)->ext)
 			continue;
 
-		p = stream->out[i].ext->prio_head;
+		p = SCTP_SO(stream, i)->ext->prio_head;
 		if (!p)
 			/* Means all other streams won't be initialized
 			 * as well.
@@ -165,7 +165,7 @@ static void sctp_sched_prio_sched(struct sctp_stream *stream,
 static int sctp_sched_prio_set(struct sctp_stream *stream, __u16 sid,
 			       __u16 prio, gfp_t gfp)
 {
-	struct sctp_stream_out *sout = &stream->out[sid];
+	struct sctp_stream_out *sout = SCTP_SO(stream, sid);
 	struct sctp_stream_out_ext *soute = sout->ext;
 	struct sctp_stream_priorities *prio_head, *old;
 	bool reschedule = false;
@@ -186,7 +186,7 @@ static int sctp_sched_prio_set(struct sctp_stream *stream, __u16 sid,
 		return 0;
 
 	for (i = 0; i < stream->outcnt; i++) {
-		soute = stream->out[i].ext;
+		soute = SCTP_SO(stream, i)->ext;
 		if (soute && soute->prio_head == old)
 			/* It's still in use, nothing else to do here. */
 			return 0;
@@ -201,7 +201,7 @@ static int sctp_sched_prio_set(struct sctp_stream *stream, __u16 sid,
 static int sctp_sched_prio_get(struct sctp_stream *stream, __u16 sid,
 			       __u16 *value)
 {
-	*value = stream->out[sid].ext->prio_head->prio;
+	*value = SCTP_SO(stream, sid)->ext->prio_head->prio;
 	return 0;
 }
 
@@ -215,7 +215,7 @@ static int sctp_sched_prio_init(struct sctp_stream *stream)
 static int sctp_sched_prio_init_sid(struct sctp_stream *stream, __u16 sid,
 				    gfp_t gfp)
 {
-	INIT_LIST_HEAD(&stream->out[sid].ext->prio_list);
+	INIT_LIST_HEAD(&SCTP_SO(stream, sid)->ext->prio_list);
 	return sctp_sched_prio_set(stream, sid, 0, gfp);
 }
 
@@ -233,9 +233,9 @@ static void sctp_sched_prio_free(struct sctp_stream *stream)
 	 */
 	sctp_sched_prio_unsched_all(stream);
 	for (i = 0; i < stream->outcnt; i++) {
-		if (!stream->out[i].ext)
+		if (!SCTP_SO(stream, i)->ext)
 			continue;
-		prio = stream->out[i].ext->prio_head;
+		prio = SCTP_SO(stream, i)->ext->prio_head;
 		if (prio && list_empty(&prio->prio_sched))
 			list_add(&prio->prio_sched, &list);
 	}
@@ -255,7 +255,7 @@ static void sctp_sched_prio_enqueue(struct sctp_outq *q,
 	ch = list_first_entry(&msg->chunks, struct sctp_chunk, frag_list);
 	sid = sctp_chunk_stream_no(ch);
 	stream = &q->asoc->stream;
-	sctp_sched_prio_sched(stream, stream->out[sid].ext);
+	sctp_sched_prio_sched(stream, SCTP_SO(stream, sid)->ext);
 }
 
 static struct sctp_chunk *sctp_sched_prio_dequeue(struct sctp_outq *q)
@@ -297,7 +297,7 @@ static void sctp_sched_prio_dequeue_done(struct sctp_outq *q,
 	 * this priority.
 	 */
 	sid = sctp_chunk_stream_no(ch);
-	soute = q->asoc->stream.out[sid].ext;
+	soute = SCTP_SO(&q->asoc->stream, sid)->ext;
 	prio = soute->prio_head;
 
 	sctp_sched_prio_next_stream(prio);
@@ -317,7 +317,7 @@ static void sctp_sched_prio_sched_all(struct sctp_stream *stream)
 		__u16 sid;
 
 		sid = sctp_chunk_stream_no(ch);
-		sout = &stream->out[sid];
+		sout = SCTP_SO(stream, sid);
 		if (sout->ext)
 			sctp_sched_prio_sched(stream, sout->ext);
 	}
diff --git a/net/sctp/stream_sched_rr.c b/net/sctp/stream_sched_rr.c
index 1155692..52ba743 100644
--- a/net/sctp/stream_sched_rr.c
+++ b/net/sctp/stream_sched_rr.c
@@ -100,7 +100,7 @@ static int sctp_sched_rr_init(struct sctp_stream *stream)
 static int sctp_sched_rr_init_sid(struct sctp_stream *stream, __u16 sid,
 				  gfp_t gfp)
 {
-	INIT_LIST_HEAD(&stream->out[sid].ext->rr_list);
+	INIT_LIST_HEAD(&SCTP_SO(stream, sid)->ext->rr_list);
 
 	return 0;
 }
@@ -120,7 +120,7 @@ static void sctp_sched_rr_enqueue(struct sctp_outq *q,
 	ch = list_first_entry(&msg->chunks, struct sctp_chunk, frag_list);
 	sid = sctp_chunk_stream_no(ch);
 	stream = &q->asoc->stream;
-	sctp_sched_rr_sched(stream, stream->out[sid].ext);
+	sctp_sched_rr_sched(stream, SCTP_SO(stream, sid)->ext);
 }
 
 static struct sctp_chunk *sctp_sched_rr_dequeue(struct sctp_outq *q)
@@ -154,7 +154,7 @@ static void sctp_sched_rr_dequeue_done(struct sctp_outq *q,
 
 	/* Last chunk on that msg, move to the next stream */
 	sid = sctp_chunk_stream_no(ch);
-	soute = q->asoc->stream.out[sid].ext;
+	soute = SCTP_SO(&q->asoc->stream, sid)->ext;
 
 	sctp_sched_rr_next_stream(&q->asoc->stream);
 
@@ -173,7 +173,7 @@ static void sctp_sched_rr_sched_all(struct sctp_stream *stream)
 		__u16 sid;
 
 		sid = sctp_chunk_stream_no(ch);
-		soute = stream->out[sid].ext;
+		soute = SCTP_SO(stream, sid)->ext;
 		if (soute)
 			sctp_sched_rr_sched(stream, soute);
 	}
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH net-next 0/2] net/sctp: Avoid allocating high order memory with kmalloc()
From: Oleg Babin @ 2018-04-23 18:41 UTC (permalink / raw)
  To: netdev, linux-sctp
  Cc: David S. Miller, Vlad Yasevich, Neil Horman, Xin Long,
	Marcelo Ricardo Leitner, Andrey Ryabinin

Each SCTP association can have up to 65535 input and output streams.
For each stream type an array of sctp_stream_in or sctp_stream_out
structures is allocated using kmalloc_array() function. This function
allocates physically contiguous memory regions, so this can lead
to allocation of memory regions of very high order, i.e.:

  sizeof(struct sctp_stream_out) == 24,
  ((65535 * 24) / 4096) == 383 memory pages (4096 byte per page),
  which means 9th memory order.

This can lead to a memory allocation failures on the systems
under a memory stress.

We actually do not need these arrays of memory to be physically
contiguous. Possible simple solution would be to use kvmalloc()
instread of kmalloc() as kvmalloc() can allocate physically scattered
pages if contiguous pages are not available. But the problem
is that the allocation can happed in a softirq context with
GFP_ATOMIC flag set, and kvmalloc() cannot be used in this scenario.

So the other possible solution is to use flexible arrays instead of
contiguios arrays of memory so that the memory would be allocated
on a per-page basis.

This patchset replaces kvmalloc() with flex_array usage.
It consists of two parts:

  * First patch is preparatory - it mechanically wraps all direct
    access to assoc->stream.out[] and assoc->stream.in[] arrays
    with SCTP_SO() and SCTP_SI() wrappers so that later a direct
    array access could be easily changed to an access to a
    flex_array (or any other possible alternative).
  * Second patch replaces kmalloc_array() with flex_array usage.

Oleg Babin (2):
  net/sctp: Make wrappers for accessing in/out streams
  net/sctp: Replace in/out stream arrays with flex_array

 include/net/sctp/structs.h   |  31 +++++---
 net/sctp/chunk.c             |   6 +-
 net/sctp/outqueue.c          |  11 +--
 net/sctp/socket.c            |   4 +-
 net/sctp/stream.c            | 165 +++++++++++++++++++++++++++++--------------
 net/sctp/stream_interleave.c |   2 +-
 net/sctp/stream_sched.c      |  13 ++--
 net/sctp/stream_sched_prio.c |  22 +++---
 net/sctp/stream_sched_rr.c   |   8 +--
 9 files changed, 167 insertions(+), 95 deletions(-)

-- 
1.8.3.1

^ permalink raw reply

* Re: [RFC v3 net-next 13/18] net/sched: Introduce the TBS Qdisc
From: Jesus Sanchez-Palencia @ 2018-04-23 18:21 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: netdev, jhs, xiyou.wangcong, jiri, vinicius.gomes, richardcochran,
	intel-wired-lan, anna-maria, henrik, john.stultz, levi.pearson,
	edumazet, willemb, mlichvar
In-Reply-To: <alpine.DEB.2.21.1803211407520.3754@nanos.tec.linutronix.de>

Hi Thomas,


On 03/21/2018 06:46 AM, Thomas Gleixner wrote:
> On Tue, 6 Mar 2018, Jesus Sanchez-Palencia wrote:
>> +struct tbs_sched_data {
>> +	bool sorting;
>> +	int clockid;
>> +	int queue;
>> +	s32 delta; /* in ns */
>> +	ktime_t last; /* The txtime of the last skb sent to the netdevice. */
>> +	struct rb_root head;
> 
> Hmm. You are reimplementing timerqueue open coded. Have you checked whether
> you could reuse the timerqueue implementation?
> 
> That requires to add a timerqueue node to struct skbuff
> 
> @@ -671,7 +671,8 @@ struct sk_buff {
>  				unsigned long		dev_scratch;
>  			};
>  		};
> -		struct rb_node	rbnode; /* used in netem & tcp stack */
> +		struct rb_node		rbnode; /* used in netem & tcp stack */
> +		struct timerqueue_node	tqnode;
>  	};
>  	struct sock		*sk;
> 
> Then you can use timerqueue_head in your scheduler data and all the open
> coded rbtree handling goes away.


I just noticed that doing the above increases the size of struct sk_buff by 8
bytes - struct timerqueue_node is 32bytes long while struct rb_node is only
24bytes long.

Given the feedback we got here before against touching struct sk_buff at all for
non-generic use cases, I will keep the implementation of sch_tbs.c as is, thus
keeping the open-coded version for now, ok?

Thanks,
Jesus


(...)

^ permalink raw reply

* [PATCH net-next 2/2] net/ipv6: Fix missing rcu dereferences on from
From: David Ahern @ 2018-04-23 18:32 UTC (permalink / raw)
  To: netdev; +Cc: David Ahern
In-Reply-To: <20180423183207.9124-1-dsahern@gmail.com>

kbuild test robot reported 2 uses of rt->from not properly accessed
using rcu_dereference:
1. add rcu_dereference_protected to rt6_remove_exception_rt and make
   sure it is always called with rcu lock held.

2. change rt6_do_redirect to take a reference on 'from' when accessed
   the first time so it can be used the sceond time outside of the lock

Fixes: a68886a69180 ("net/ipv6: Make from in rt6_info rcu protected")
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: David Ahern <dsahern@gmail.com>
---
 net/ipv6/route.c | 15 ++++++++++-----
 1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 354a5b8d016f..ac3e51631c65 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -1541,11 +1541,13 @@ static struct rt6_info *rt6_find_cached_rt(struct fib6_info *rt,
 static int rt6_remove_exception_rt(struct rt6_info *rt)
 {
 	struct rt6_exception_bucket *bucket;
-	struct fib6_info *from = rt->from;
 	struct in6_addr *src_key = NULL;
 	struct rt6_exception *rt6_ex;
+	struct fib6_info *from;
 	int err;
 
+	from = rcu_dereference_protected(rt->from,
+					 lockdep_is_held(&rt6_exception_lock));
 	if (!from ||
 	    !(rt->rt6i_flags & RTF_CACHE))
 		return -EINVAL;
@@ -2223,6 +2225,7 @@ static void ip6_link_failure(struct sk_buff *skb)
 
 	rt = (struct rt6_info *) skb_dst(skb);
 	if (rt) {
+		rcu_read_lock();
 		if (rt->rt6i_flags & RTF_CACHE) {
 			if (dst_hold_safe(&rt->dst))
 				rt6_remove_exception_rt(rt);
@@ -2230,15 +2233,14 @@ static void ip6_link_failure(struct sk_buff *skb)
 			struct fib6_info *from;
 			struct fib6_node *fn;
 
-			rcu_read_lock();
 			from = rcu_dereference(rt->from);
 			if (from) {
 				fn = rcu_dereference(from->fib6_node);
 				if (fn && (rt->rt6i_flags & RTF_DEFAULT))
 					fn->fn_sernum = -1;
 			}
-			rcu_read_unlock();
 		}
+		rcu_read_unlock();
 	}
 }
 
@@ -3340,8 +3342,10 @@ static void rt6_do_redirect(struct dst_entry *dst, struct sock *sk, struct sk_bu
 
 	rcu_read_lock();
 	from = rcu_dereference(rt->from);
-	nrt = ip6_rt_cache_alloc(from, &msg->dest, NULL);
+	fib6_info_hold(from);
 	rcu_read_unlock();
+
+	nrt = ip6_rt_cache_alloc(from, &msg->dest, NULL);
 	if (!nrt)
 		goto out;
 
@@ -3355,7 +3359,7 @@ static void rt6_do_redirect(struct dst_entry *dst, struct sock *sk, struct sk_bu
 	 * a cached route because rt6_insert_exception() will
 	 * takes care of it
 	 */
-	if (rt6_insert_exception(nrt, rt->from)) {
+	if (rt6_insert_exception(nrt, from)) {
 		dst_release_immediate(&nrt->dst);
 		goto out;
 	}
@@ -3367,6 +3371,7 @@ static void rt6_do_redirect(struct dst_entry *dst, struct sock *sk, struct sk_bu
 	call_netevent_notifiers(NETEVENT_REDIRECT, &netevent);
 
 out:
+	fib6_info_release(from);
 	neigh_release(neigh);
 }
 
-- 
2.11.0

^ permalink raw reply related

* [PATCH net-next 1/2] net/ipv6: add rcu locking to ip6_negative_advice
From: David Ahern @ 2018-04-23 18:32 UTC (permalink / raw)
  To: netdev; +Cc: David Ahern
In-Reply-To: <20180423183207.9124-1-dsahern@gmail.com>

syzbot reported a suspicious rcu_dereference_check:
  __dump_stack lib/dump_stack.c:77 [inline]
  dump_stack+0x1b9/0x294 lib/dump_stack.c:113
  lockdep_rcu_suspicious+0x14a/0x153 kernel/locking/lockdep.c:4592
  rt6_check_expired+0x38b/0x3e0 net/ipv6/route.c:410
  ip6_negative_advice+0x67/0xc0 net/ipv6/route.c:2204
  dst_negative_advice include/net/sock.h:1786 [inline]
  sock_setsockopt+0x138f/0x1fe0 net/core/sock.c:1051
  __sys_setsockopt+0x2df/0x390 net/socket.c:1899
  SYSC_setsockopt net/socket.c:1914 [inline]
  SyS_setsockopt+0x34/0x50 net/socket.c:1911

Add rcu locking around call to rt6_check_expired in
ip6_negative_advice.

Fixes: a68886a69180 ("net/ipv6: Make from in rt6_info rcu protected")
Reported-by: syzbot+2422c9e35796659d2273@syzkaller.appspotmail.com
Signed-off-by: David Ahern <dsahern@gmail.com>
---
 net/ipv6/route.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 0407bbc5a028..354a5b8d016f 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -2201,10 +2201,12 @@ static struct dst_entry *ip6_negative_advice(struct dst_entry *dst)
 
 	if (rt) {
 		if (rt->rt6i_flags & RTF_CACHE) {
+			rcu_read_lock();
 			if (rt6_check_expired(rt)) {
 				rt6_remove_exception_rt(rt);
 				dst = NULL;
 			}
+			rcu_read_unlock();
 		} else {
 			dst_release(dst);
 			dst = NULL;
-- 
2.11.0

^ permalink raw reply related

* [PATCH net-next 0/2] net/ipv6: couple of fixes for rcu change to from
From: David Ahern @ 2018-04-23 18:32 UTC (permalink / raw)
  To: netdev; +Cc: David Ahern

So many details... I am thankful for all the robots running the
permutations and tools.

Two bug fixes from the rcu change to rt->from:
1. missing rcu lock in ip6_negative_advice
2. rcu dereferences in 2 sites

David Ahern (2):
  net/ipv6: add rcu locking to ip6_negative_advice
  net/ipv6: Fix missing rcu dereferences on from

 net/ipv6/route.c | 17 ++++++++++++-----
 1 file changed, 12 insertions(+), 5 deletions(-)

-- 
2.11.0

^ permalink raw reply

* [bpf PATCH v2 3/3] bpf: sockmap, fix double page_put on ENOMEM error in redirect path
From: John Fastabend @ 2018-04-23 18:29 UTC (permalink / raw)
  To: ast, daniel; +Cc: netdev
In-Reply-To: <20180423182726.17999.269.stgit@john-Precision-Tower-5810>

In the case where the socket memory boundary is hit the redirect
path returns an ENOMEM error. However, before checking for this
condition the redirect scatterlist buffer is setup with a valid
page and length. This is never unwound so when the buffers are
released latter in the error path we do a put_page() and clear
the scatterlist fields. But, because the initial error happens
before completing the scatterlist buffer we end up with both the
original buffer and the redirect buffer pointing to the same page
resulting in duplicate put_page() calls.

To fix this simply move the initial configuration of the redirect
scatterlist buffer below the sock memory check.

Found this while running TCP_STREAM test with netperf using Cilium.

Fixes: fa246693a111 ("bpf: sockmap, BPF_F_INGRESS flag for BPF_SK_SKB_STREAM_VERDICT")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
 kernel/bpf/sockmap.c |    3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c
index a7e0536..2389022 100644
--- a/kernel/bpf/sockmap.c
+++ b/kernel/bpf/sockmap.c
@@ -524,8 +524,6 @@ static int bpf_tcp_ingress(struct sock *sk, int apply_bytes,
 	i = md->sg_start;
 
 	do {
-		r->sg_data[i] = md->sg_data[i];
-
 		size = (apply && apply_bytes < md->sg_data[i].length) ?
 			apply_bytes : md->sg_data[i].length;
 
@@ -536,6 +534,7 @@ static int bpf_tcp_ingress(struct sock *sk, int apply_bytes,
 		}
 
 		sk_mem_charge(sk, size);
+		r->sg_data[i] = md->sg_data[i];
 		r->sg_data[i].length = size;
 		md->sg_data[i].length -= size;
 		md->sg_data[i].offset += size;

^ permalink raw reply related

* [bpf PATCH v2 2/3] bpf: sockmap, sk_wait_event needed to handle blocking cases
From: John Fastabend @ 2018-04-23 18:29 UTC (permalink / raw)
  To: ast, daniel; +Cc: netdev
In-Reply-To: <20180423182726.17999.269.stgit@john-Precision-Tower-5810>

In the recvmsg handler we need to add a wait event to support the
blocking use cases. Without this we return zero and may confuse
user applications. In the wait event any data received on the
sk either via sk_receive_queue or the psock ingress list will
wake up the sock.

Fixes: fa246693a111 ("bpf: sockmap, BPF_F_INGRESS flag for BPF_SK_SKB_STREAM_VERDICT")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
 kernel/bpf/sockmap.c |   44 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 44 insertions(+)

diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c
index 347e51d..a7e0536 100644
--- a/kernel/bpf/sockmap.c
+++ b/kernel/bpf/sockmap.c
@@ -43,6 +43,7 @@
 #include <net/tcp.h>
 #include <linux/ptr_ring.h>
 #include <net/inet_common.h>
+#include <linux/sched/signal.h>
 
 #define SOCK_CREATE_FLAG_MASK \
 	(BPF_F_NUMA_NODE | BPF_F_RDONLY | BPF_F_WRONLY)
@@ -732,6 +733,26 @@ static int bpf_exec_tx_verdict(struct smap_psock *psock,
 	return err;
 }
 
+static int bpf_wait_data(struct sock *sk,
+			 struct smap_psock *psk, int flags,
+			 long timeo, int *err)
+{
+	int rc;
+
+	DEFINE_WAIT_FUNC(wait, woken_wake_function);
+
+	add_wait_queue(sk_sleep(sk), &wait);
+	sk_set_bit(SOCKWQ_ASYNC_WAITDATA, sk);
+	rc = sk_wait_event(sk, &timeo,
+			   !list_empty(&psk->ingress) ||
+			   !skb_queue_empty(&sk->sk_receive_queue),
+			   &wait);
+	sk_clear_bit(SOCKWQ_ASYNC_WAITDATA, sk);
+	remove_wait_queue(sk_sleep(sk), &wait);
+
+	return rc;
+}
+
 static int bpf_tcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
 			   int nonblock, int flags, int *addr_len)
 {
@@ -755,6 +776,7 @@ static int bpf_tcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
 		return tcp_recvmsg(sk, msg, len, nonblock, flags, addr_len);
 
 	lock_sock(sk);
+bytes_ready:
 	while (copied != len) {
 		struct scatterlist *sg;
 		struct sk_msg_buff *md;
@@ -809,6 +831,28 @@ static int bpf_tcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
 		}
 	}
 
+	if (!copied) {
+		long timeo;
+		int data;
+		int err = 0;
+
+		timeo = sock_rcvtimeo(sk, nonblock);
+		data = bpf_wait_data(sk, psock, flags, timeo, &err);
+
+		if (data) {
+			if (!skb_queue_empty(&sk->sk_receive_queue)) {
+				release_sock(sk);
+				smap_release_sock(psock, sk);
+				copied = tcp_recvmsg(sk, msg, len, nonblock, flags, addr_len);
+				return copied;
+			}
+			goto bytes_ready;
+		}
+
+		if (err)
+			copied = err;
+	}
+
 	release_sock(sk);
 	smap_release_sock(psock, sk);
 	return copied;

^ permalink raw reply related

* [bpf PATCH v2 1/3] bpf: sockmap, map_release does not hold refcnt for pinned maps
From: John Fastabend @ 2018-04-23 18:29 UTC (permalink / raw)
  To: ast, daniel; +Cc: netdev
In-Reply-To: <20180423182726.17999.269.stgit@john-Precision-Tower-5810>

Relying on map_release hook to decrement the reference counts when a
map is removed only works if the map is not being pinned. In the
pinned case the ref is decremented immediately and the BPF programs
released. After this BPF programs may not be in-use which is not
what the user would expect.

This patch moves the release logic into bpf_map_put_uref() and brings
sockmap in-line with how a similar case is handled in prog array maps.

Fixes: 3d9e952697de ("bpf: sockmap, fix leaking maps with attached but not detached progs")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
 include/linux/bpf.h   |    2 +-
 kernel/bpf/arraymap.c |    3 ++-
 kernel/bpf/sockmap.c  |    4 ++--
 kernel/bpf/syscall.c  |    4 ++--
 4 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 486e65e..8401e78 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -31,6 +31,7 @@ struct bpf_map_ops {
 	void (*map_release)(struct bpf_map *map, struct file *map_file);
 	void (*map_free)(struct bpf_map *map);
 	int (*map_get_next_key)(struct bpf_map *map, void *key, void *next_key);
+	void (*map_put_uref)(struct bpf_map *map);
 
 	/* funcs callable from userspace and from eBPF programs */
 	void *(*map_lookup_elem)(struct bpf_map *map, void *key);
@@ -434,7 +435,6 @@ int bpf_percpu_array_update(struct bpf_map *map, void *key, void *value,
 int bpf_fd_array_map_update_elem(struct bpf_map *map, struct file *map_file,
 				 void *key, void *value, u64 map_flags);
 int bpf_fd_array_map_lookup_elem(struct bpf_map *map, void *key, u32 *value);
-void bpf_fd_array_map_clear(struct bpf_map *map);
 int bpf_fd_htab_map_update_elem(struct bpf_map *map, struct file *map_file,
 				void *key, void *value, u64 map_flags);
 int bpf_fd_htab_map_lookup_elem(struct bpf_map *map, void *key, u32 *value);
diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index 14750e7..3715572 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -476,7 +476,7 @@ static u32 prog_fd_array_sys_lookup_elem(void *ptr)
 }
 
 /* decrement refcnt of all bpf_progs that are stored in this map */
-void bpf_fd_array_map_clear(struct bpf_map *map)
+static void bpf_fd_array_map_clear(struct bpf_map *map)
 {
 	struct bpf_array *array = container_of(map, struct bpf_array, map);
 	int i;
@@ -495,6 +495,7 @@ void bpf_fd_array_map_clear(struct bpf_map *map)
 	.map_fd_get_ptr = prog_fd_array_get_ptr,
 	.map_fd_put_ptr = prog_fd_array_put_ptr,
 	.map_fd_sys_lookup_elem = prog_fd_array_sys_lookup_elem,
+	.map_put_uref = bpf_fd_array_map_clear,
 };
 
 static struct bpf_event_entry *bpf_event_entry_gen(struct file *perf_file,
diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c
index a3b2138..347e51d 100644
--- a/kernel/bpf/sockmap.c
+++ b/kernel/bpf/sockmap.c
@@ -1831,7 +1831,7 @@ static int sock_map_update_elem(struct bpf_map *map,
 	return err;
 }
 
-static void sock_map_release(struct bpf_map *map, struct file *map_file)
+static void sock_map_release(struct bpf_map *map)
 {
 	struct bpf_stab *stab = container_of(map, struct bpf_stab, map);
 	struct bpf_prog *orig;
@@ -1855,7 +1855,7 @@ static void sock_map_release(struct bpf_map *map, struct file *map_file)
 	.map_get_next_key = sock_map_get_next_key,
 	.map_update_elem = sock_map_update_elem,
 	.map_delete_elem = sock_map_delete_elem,
-	.map_release = sock_map_release,
+	.map_put_uref = sock_map_release,
 };
 
 BPF_CALL_4(bpf_sock_map_update, struct bpf_sock_ops_kern *, bpf_sock,
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 4ca46df..4b70439 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -257,8 +257,8 @@ static void bpf_map_free_deferred(struct work_struct *work)
 static void bpf_map_put_uref(struct bpf_map *map)
 {
 	if (atomic_dec_and_test(&map->usercnt)) {
-		if (map->map_type == BPF_MAP_TYPE_PROG_ARRAY)
-			bpf_fd_array_map_clear(map);
+		if (map->ops->map_put_uref)
+			map->ops->map_put_uref(map);
 	}
 }
 

^ permalink raw reply related

* [bpf PATCH v2 0/3] BPF, a couple sockmap fixes
From: John Fastabend @ 2018-04-23 18:29 UTC (permalink / raw)
  To: ast, daniel; +Cc: netdev

While testing sockmap with more programs (besides our test programs)
I found a couple issues.

The attached series fixes an issue where pinned maps were not
working correctly, blocking sockets returned zero, and an error
path that when the sock hit an out of memory case resulted in a
double page_put() while doing ingress redirects.

See individual patches for more details.

v2: Incorporated Daniel's feedback to use map ops for uref put op
    which also fixed the build error discovered in v1.

---

John Fastabend (3):
      bpf: sockmap, map_release does not hold refcnt for pinned maps
      bpf: sockmap, sk_wait_event needed to handle blocking cases
      bpf: sockmap, fix double page_put on ENOMEM error in redirect path


 include/linux/bpf.h   |    2 +-
 kernel/bpf/arraymap.c |    3 ++-
 kernel/bpf/sockmap.c  |   51 +++++++++++++++++++++++++++++++++++++++++++++----
 kernel/bpf/syscall.c  |    4 ++--
 4 files changed, 52 insertions(+), 8 deletions(-)

^ permalink raw reply

* Re: [PATCH] Bluetooth: use wait_event API instead of open-coding it
From: Marcel Holtmann @ 2018-04-23 17:58 UTC (permalink / raw)
  To: John Keeping; +Cc: BlueZ development, ML netdev, linux-kernel, Johan Hedberg
In-Reply-To: <20180419152937.13515-1-john@metanate.com>

Hi John,

> I've seen timeout errors from HCI commands where it looks like
> schedule_timeout() has returned immediately; additional logging for the
> error case gives:
> 
> 	req_status=1 req_result=0 remaining=10000 jiffies
> 
> so the device is still in state HCI_REQ_PEND and the value returned by
> schedule_timeout() is the same as the original timeout (HCI_INIT_TIMEOUT
> on a system with HZ=1000).
> 
> Use wait_event_interruptible_timeout() instead of open-coding similar
> behaviour which is subject to the spurious failure described above.
> 
> Signed-off-by: John Keeping <john@metanate.com>
> ---
> I saw problems with the -rt patchset on 4.9 and I'm not convinced that
> it's Bluetooth at fault for the problem described above, but I think
> this is a nice cleanup even if it's not a bug fix.
> 
> net/bluetooth/hci_request.c | 30 +++++++-----------------------
> 1 file changed, 7 insertions(+), 23 deletions(-)

patch has been applied to bluetooth-next tree.

Regards

Marcel

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox