Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [RFC V3 PATCH 0/8] Packed ring for vhost
From: Konrad Rzeszutek Wilk @ 2018-04-23 20:11 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: kvm, netdev, linux-kernel, virtualization, wexu
In-Reply-To: <20180423224715-mutt-send-email-mst@kernel.org>

On Mon, Apr 23, 2018 at 10:59:43PM +0300, Michael S. Tsirkin wrote:
> On Mon, Apr 23, 2018 at 03:31:20PM -0400, Konrad Rzeszutek Wilk wrote:
> > On Mon, Apr 23, 2018 at 01:34:52PM +0800, Jason Wang wrote:
> > > Hi all:
> > > 
> > > This RFC implement packed ring layout. The code were tested with
> > > Tiwei's RFC V2 a thttps://lkml.org/lkml/2018/4/1/48. Some fixups and
> > > tweaks were needed on top of Tiwei's code to make it run. TCP stream
> > > and pktgen does not show obvious difference compared with split ring.
> > 
> > I have to ask then - what is the benefit of this?
> 
> You can use this with dpdk within guest.
> The DPDK version did see a performance improvement so hopefully with

Is there a link to this performance improvement document?

> future versions this will too.
> 
> > > 
> > > Changes from V2:
> > > - do not use & in checking desc_event_flags
> > > - off should be most significant bit
> > > - remove the workaround of mergeable buffer for dpdk prototype
> > > - id should be in the last descriptor in the chain
> > > - keep _F_WRITE for write descriptor when adding used
> > > - device flags updating should use ADDR_USED type
> > > - return error on unexpected unavail descriptor in a chain
> > > - return false in vhost_ve_avail_empty is descriptor is available
> > > - track last seen avail_wrap_counter
> > > - correctly examine available descriptor in get_indirect_packed()
> > > - vhost_idx_diff should return u16 instead of bool
> > > 
> > > Changes from V1:
> > > 
> > > - Refactor vhost used elem code to avoid open coding on used elem
> > > - Event suppression support (compile test only).
> > > - Indirect descriptor support (compile test only).
> > > - Zerocopy support.
> > > - vIOMMU support.
> > > - SCSI/VSOCK support (compile test only).
> > > - Fix several bugs
> > > 
> > > For simplicity, I don't implement batching or other optimizations.
> > > 
> > > Please review.
> > > 
> > > Jason Wang (8):
> > >   vhost: move get_rx_bufs to vhost.c
> > >   vhost: hide used ring layout from device
> > >   vhost: do not use vring_used_elem
> > >   vhost_net: do not explicitly manipulate vhost_used_elem
> > >   vhost: vhost_put_user() can accept metadata type
> > >   virtio: introduce packed ring defines
> > >   vhost: packed ring support
> > >   vhost: event suppression for packed ring
> > > 
> > >  drivers/vhost/net.c                | 136 ++----
> > >  drivers/vhost/scsi.c               |  62 +--
> > >  drivers/vhost/vhost.c              | 824 ++++++++++++++++++++++++++++++++++---
> > >  drivers/vhost/vhost.h              |  47 ++-
> > >  drivers/vhost/vsock.c              |  42 +-
> > >  include/uapi/linux/virtio_config.h |   9 +
> > >  include/uapi/linux/virtio_ring.h   |  32 ++
> > >  7 files changed, 926 insertions(+), 226 deletions(-)
> > > 
> > > -- 
> > > 2.7.4
> > > 

^ permalink raw reply

* Re: [PATCH bpf-next 02/15] xsk: add user memory registration support sockopt
From: Michael S. Tsirkin @ 2018-04-23 20:11 UTC (permalink / raw)
  To: Björn Töpel
  Cc: Karlsson, Magnus, Duyck, Alexander H, Alexander Duyck,
	John Fastabend, Alexei Starovoitov, Jesper Dangaard Brouer,
	Willem de Bruijn, Daniel Borkmann, Netdev, Björn Töpel,
	michael.lundkvist, Brandeburg, Jesse, Singhai, Anjali,
	Zhang, Qi Z
In-Reply-To: <CAJ+HfNjwMFig+aJbacFK--5_1i8F2DLSyAUOvU12Xc-OvJBAzQ@mail.gmail.com>

On Mon, Apr 23, 2018 at 10:00:15PM +0200, Björn Töpel wrote:
> 2018-04-23 18:18 GMT+02:00 Michael S. Tsirkin <mst@redhat.com>:
> 
> [...]
> 
> >> +static void xdp_umem_unpin_pages(struct xdp_umem *umem)
> >> +{
> >> +     unsigned int i;
> >> +
> >> +     if (umem->pgs) {
> >> +             for (i = 0; i < umem->npgs; i++)
> >
> > Since you pin them with FOLL_WRITE, I assume these pages
> > are written to.
> > Don't you need set_page_dirty_lock here?
> >
> 
> Hmm, I actually *removed* it from the RFC V2, but after doing some
> homework, I think you're right. Thanks for pointing this out!
> 
> Thinking more about this; This function is called from sk_destruct,
> and in the Tx case the sk_destruct can be called from interrupt
> context, where set_page_dirty_lock cannot be called.
> 
> Are there any preferred ways of solving this? Scheduling the whole
> xsk_destruct call to a workqueue is one way (I think). Any
> cleaner/better way?
> 
> [...]

Defer unpinning pages until the next tx call?


> >> +static int __xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr)
> >> +{
> >> +     u32 frame_size = mr->frame_size, frame_headroom = mr->frame_headroom;
> >> +     u64 addr = mr->addr, size = mr->len;
> >> +     unsigned int nframes;
> >> +     int size_chk, err;
> >> +
> >> +     if (frame_size < XDP_UMEM_MIN_FRAME_SIZE || frame_size > PAGE_SIZE) {
> >> +             /* Strictly speaking we could support this, if:
> >> +              * - huge pages, or*
> >
> > what does "or*" here mean?
> >
> 
> Oops, I'll change to just 'or' in the next revision.
> 
> 
> Thanks!
> Björn

^ permalink raw reply

* Re: [PATCH 0/3] bpf: Store/dump license string for loaded program
From: Alexei Starovoitov @ 2018-04-23 20:11 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Alexei Starovoitov, Daniel Borkmann, lkml, netdev, Quentin Monnet
In-Reply-To: <20180423065927.23127-1-jolsa@kernel.org>

On Mon, Apr 23, 2018 at 08:59:24AM +0200, Jiri Olsa wrote:
> hi,
> sending the change to store and dump the license
> info for loaded BPF programs. It's important for
> us get the license info, when investigating on
> screwed up machine.

hmm. boolean flag whether bpf prog is gpl or not
is already exposed via bpf_prog_info.
I see no point of wasting extra 128 bytes of kernel memory.

^ permalink raw reply

* Re: [nf-next] netfilter: extend SRH match to support matching previous, next and last SID
From: Florian Westphal @ 2018-04-23 20:08 UTC (permalink / raw)
  To: Ahmed Abdelsalam
  Cc: Pablo Neira Ayuso, fw, davem, dav.lebrun, linux-kernel,
	netfilter-devel, coreteam, netdev
In-Reply-To: <20180423220148.03800031d0cb8e8a7a83dc31@gmail.com>

Ahmed Abdelsalam <amsalam20@gmail.com> wrote:
> > > @@ -50,6 +62,12 @@ struct ip6t_srh {
> > >  	__u8                    segs_left;
> > >  	__u8                    last_entry;
> > >  	__u16                   tag;
> > > +	struct in6_addr		psid_addr;
> > > +	struct in6_addr		nsid_addr;
> > > +	struct in6_addr		lsid_addr;
> > > +	struct in6_addr		psid_msk;
> > > +	struct in6_addr		nsid_msk;
> > > +	struct in6_addr		lsid_msk;
> > 
> > This is changing something exposed through UAPI, so you will need a
> > new revision for this.
> 
> Could you please advice what should be done in this case? 

You need to add
struct ip6t_srh_v1 {
	/* copy of struct ip6t_srh here */

	/* new fields go here */
};


Look at xt_conntrack.c, conntrack_mt_reg[] for an example of
multi-revision match.

You can probably re-origanise code to avoid too much duplication.
See 5a786232eb69a1f870ddc0cfd69d5bdef241a2ea in nf.git for an example,
it makes v0 into a v1 struct at runtime and re-uses new v1 code
for old v0.

^ permalink raw reply

* Re: [bpf-next PATCH 3/3] bpf: add sample program to trace map events
From: Alexei Starovoitov @ 2018-04-23 20:08 UTC (permalink / raw)
  To: Sebastiano Miano
  Cc: Jesper Dangaard Brouer, netdev, ast, daniel, mingo, rostedt,
	fulvio.risso, David S. Miller
In-Reply-To: <062780b1-6a24-a7f3-175c-db3b02605850@polito.it>

On Mon, Apr 23, 2018 at 04:08:36PM +0200, Sebastiano Miano wrote:
> 
> That's in fact the real use case for the first two patches. Since bpf
> tracepoints are still a rather common (and easy to use) troubleshooting and
> monitoring tool why shouldn't we "enhance" their support with the newly
> added map/prog IDs?

because these tracepoints can be abused in the way that this patch demonstrated.
Whether to keep this patch in the series or not is irrelevant.

^ permalink raw reply

* Re: [PATCH v7 net-next 4/4] netvsc: refactor notifier/event handling code to use the failover framework
From: Michael S. Tsirkin @ 2018-04-23 20:06 UTC (permalink / raw)
  To: Siwei Liu
  Cc: Stephen Hemminger, Jiri Pirko, Sridhar Samudrala, David Miller,
	Netdev, virtualization, virtio-dev, Brandeburg, Jesse,
	Alexander Duyck, Jakub Kicinski, Jason Wang
In-Reply-To: <CADGSJ20ge75T+ddxtUBT4d9m1i3=HLOAHJEoS7Cg0bqnXrutwA@mail.gmail.com>

On Mon, Apr 23, 2018 at 12:44:39PM -0700, Siwei Liu wrote:
> On Mon, Apr 23, 2018 at 10:56 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > On Mon, Apr 23, 2018 at 10:44:40AM -0700, Stephen Hemminger wrote:
> >> On Mon, 23 Apr 2018 20:24:56 +0300
> >> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> >>
> >> > On Mon, Apr 23, 2018 at 10:04:06AM -0700, Stephen Hemminger wrote:
> >> > > > >
> >> > > > >I will NAK patches to change to common code for netvsc especially the
> >> > > > >three device model.  MS worked hard with distro vendors to support transparent
> >> > > > >mode, ans we really can't have a new model; or do backport.
> >> > > > >
> >> > > > >Plus, DPDK is now dependent on existing model.
> >> > > >
> >> > > > Sorry, but nobody here cares about dpdk or other similar oddities.
> >> > >
> >> > > The network device model is a userspace API, and DPDK is a userspace application.
> >> >
> >> > It is userspace but are you sure dpdk is actually poking at netdevs?
> >> > AFAIK it's normally banging device registers directly.
> >> >
> >> > > You can't go breaking userspace even if you don't like the application.
> >> >
> >> > Could you please explain how is the proposed patchset breaking
> >> > userspace? Ignoring DPDK for now, I don't think it changes the userspace
> >> > API at all.
> >> >
> >>
> >> The DPDK has a device driver vdev_netvsc which scans the Linux network devices
> >> to look for Linux netvsc device and the paired VF device and setup the
> >> DPDK environment.  This setup creates a DPDK failsafe (bondingish) instance
> >> and sets up TAP support over the Linux netvsc device as well as the Mellanox
> >> VF device.
> >>
> >> So it depends on existing 2 device model. You can't go to a 3 device model
> >> or start hiding devices from userspace.
> >
> > Okay so how does the existing patch break that? IIUC does not go to
> > a 3 device model since netvsc calls failover_register directly.
> >
> >> Also, I am working on associating netvsc and VF device based on serial number
> >> rather than MAC address. The serial number is how Windows works now, and it makes
> >> sense for Linux and Windows to use the same mechanism if possible.
> >
> > Maybe we should support same for virtio ...
> > Which serial do you mean? From vpd?
> >
> > I guess you will want to keep supporting MAC for old hypervisors?
> >
> > It all seems like a reasonable thing to support in the generic core.
> 
> That's the reason why I chose explicit identifier rather than rely on
> MAC address to bind/pair a device. MAC address can change. Even if it
> can't, malicious guest user can fake MAC address to skip binding.
> 
> -Siwei

Address should be sampled at device creation to prevent this
kind of hack. Not that it buys the malicious user much:
if you can poke at MAC addresses you probably already can
break networking.




> 
> >
> > --
> > MST

^ permalink raw reply

* Re: KASAN: null-ptr-deref Read in refcount_inc_not_zero
From: Cong Wang @ 2018-04-23 20:05 UTC (permalink / raw)
  To: syzbot
  Cc: David Miller, Denys Vlasenko, LKML,
	Linux Kernel Network Developers, syzkaller-bugs, xiaolou4617
In-Reply-To: <000000000000eeeeff056a79954c@google.com>

#syz fix: llc: fix NULL pointer deref for SOCK_ZAPPED

^ permalink raw reply

* Re: [PATCH bpf] bpf: disable and restore preemption in __BPF_PROG_RUN_ARRAY
From: Alexei Starovoitov @ 2018-04-23 20:05 UTC (permalink / raw)
  To: Roman Gushchin; +Cc: netdev, kernel-team, Alexei Starovoitov, Daniel Borkmann
In-Reply-To: <20180423170921.16162-1-guro@fb.com>

On Mon, Apr 23, 2018 at 06:09:21PM +0100, Roman Gushchin wrote:
> Running bpf programs requires disabled preemption,
> however at least some* of the BPF_PROG_RUN_ARRAY users
> do not follow this rule.
> 
> To fix this bug, and also to make it not happen in the future,
> let's add explicit preemption disabling/re-enabling
> to the __BPF_PROG_RUN_ARRAY code.
> 
> * for example:
>  [   17.624472] RIP: 0010:__cgroup_bpf_run_filter_sk+0x1c4/0x1d0
>  ...
>  [   17.640890]  inet6_create+0x3eb/0x520
>  [   17.641405]  __sock_create+0x242/0x340
>  [   17.641939]  __sys_socket+0x57/0xe0
>  [   17.642370]  ? trace_hardirqs_off_thunk+0x1a/0x1c
>  [   17.642944]  SyS_socket+0xa/0x10
>  [   17.643357]  do_syscall_64+0x79/0x220
>  [   17.643879]  entry_SYSCALL_64_after_hwframe+0x42/0xb7
> 
> Signed-off-by: Roman Gushchin <guro@fb.com>
> Cc: Alexei Starovoitov <ast@kernel.org>
> Cc: Daniel Borkmann <daniel@iogearbox.net>

Acked-by: Alexei Starovoitov <ast@kernel.org>

^ permalink raw reply

* Re: [nf-next] netfilter: extend SRH match to support matching previous, next and last SID
From: Ahmed Abdelsalam @ 2018-04-23 20:01 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: fw, davem, dav.lebrun, linux-kernel, netfilter-devel, coreteam,
	netdev
In-Reply-To: <20180423173047.gsf2xjlmpichyvte@salvia>

On Mon, 23 Apr 2018 19:30:47 +0200
Pablo Neira Ayuso <pablo@netfilter.org> wrote:

> On Mon, Apr 23, 2018 at 05:48:22AM -0500, Ahmed Abdelsalam wrote:
> > Signed-off-by: Ahmed Abdelsalam <amsalam20@gmail.com>
> > ---
> >  include/uapi/linux/netfilter_ipv6/ip6t_srh.h | 22 +++++++++++++--
> >  net/ipv6/netfilter/ip6t_srh.c                | 41 +++++++++++++++++++++++++++-
> >  2 files changed, 60 insertions(+), 3 deletions(-)
> > 
> > diff --git a/include/uapi/linux/netfilter_ipv6/ip6t_srh.h b/include/uapi/linux/netfilter_ipv6/ip6t_srh.h
> > index f3cc0ef..9808382 100644
> > --- a/include/uapi/linux/netfilter_ipv6/ip6t_srh.h
> > +++ b/include/uapi/linux/netfilter_ipv6/ip6t_srh.h
> > @@ -17,7 +17,10 @@
> >  #define IP6T_SRH_LAST_GT        0x0100
> >  #define IP6T_SRH_LAST_LT        0x0200
> >  #define IP6T_SRH_TAG            0x0400
> > -#define IP6T_SRH_MASK           0x07FF
> > +#define IP6T_SRH_PSID           0x0800
> > +#define IP6T_SRH_NSID           0x1000
> > +#define IP6T_SRH_LSID           0x2000
> > +#define IP6T_SRH_MASK           0x3FFF
> >  
> >  /* Values for "mt_invflags" field in struct ip6t_srh */
> >  #define IP6T_SRH_INV_NEXTHDR    0x0001
> > @@ -31,7 +34,10 @@
> >  #define IP6T_SRH_INV_LAST_GT    0x0100
> >  #define IP6T_SRH_INV_LAST_LT    0x0200
> >  #define IP6T_SRH_INV_TAG        0x0400
> > -#define IP6T_SRH_INV_MASK       0x07FF
> > +#define IP6T_SRH_INV_PSID       0x0800
> > +#define IP6T_SRH_INV_NSID       0x1000
> > +#define IP6T_SRH_INV_LSID       0x2000
> > +#define IP6T_SRH_INV_MASK       0x3FFF
> >  
> >  /**
> >   *      struct ip6t_srh - SRH match options
> > @@ -40,6 +46,12 @@
> >   *      @ segs_left: Segments left field of SRH
> >   *      @ last_entry: Last entry field of SRH
> >   *      @ tag: Tag field of SRH
> > + *      @ psid_addr: Address of previous SID in SRH SID list
> > + *      @ nsid_addr: Address of NEXT SID in SRH SID list
> > + *      @ lsid_addr: Address of LAST SID in SRH SID list
> > + *      @ psid_msk: Mask of previous SID in SRH SID list
> > + *      @ nsid_msk: Mask of next SID in SRH SID list
> > + *      @ lsid_msk: MAsk of last SID in SRH SID list
> >   *      @ mt_flags: match options
> >   *      @ mt_invflags: Invert the sense of match options
> >   */
> > @@ -50,6 +62,12 @@ struct ip6t_srh {
> >  	__u8                    segs_left;
> >  	__u8                    last_entry;
> >  	__u16                   tag;
> > +	struct in6_addr		psid_addr;
> > +	struct in6_addr		nsid_addr;
> > +	struct in6_addr		lsid_addr;
> > +	struct in6_addr		psid_msk;
> > +	struct in6_addr		nsid_msk;
> > +	struct in6_addr		lsid_msk;
> 
> This is changing something exposed through UAPI, so you will need a
> new revision for this.

Could you please advice what should be done in this case? 

> 
> >  	__u16                   mt_flags;
> >  	__u16                   mt_invflags;
> >  };
> > diff --git a/net/ipv6/netfilter/ip6t_srh.c b/net/ipv6/netfilter/ip6t_srh.c
> > index 33719d5..2b5cc73 100644
> > --- a/net/ipv6/netfilter/ip6t_srh.c
> > +++ b/net/ipv6/netfilter/ip6t_srh.c
> > @@ -30,7 +30,9 @@ static bool srh_mt6(const struct sk_buff *skb, struct xt_action_param *par)
> >  	const struct ip6t_srh *srhinfo = par->matchinfo;
> >  	struct ipv6_sr_hdr *srh;
> >  	struct ipv6_sr_hdr _srh;
> > -	int hdrlen, srhoff = 0;
> > +	int hdrlen, psidoff, nsidoff, lsidoff, srhoff = 0;
> > +	struct in6_addr *psid, *nsid, *lsid;
> > +	struct in6_addr _psid, _nsid, _lsid;
> 
> Could you rearrange variable definitions? ie. longest line first, eg.
> 
> 	int hdrlen, psidoff, nsidoff, lsidoff, srhoff = 0;
>   	const struct ip6t_srh *srhinfo = par->matchinfo;
> 	struct in6_addr *psid, *nsid, *lsid;
>   	struct ipv6_sr_hdr *srh;
>   	struct ipv6_sr_hdr _srh;
> 

Ok I will re-arrange them in reverse christmas tree form. 

Ahmed 

-- 
Ahmed Abdelsalam <amsalam20@gmail.com>

^ permalink raw reply

* Re: [PATCH bpf-next 02/15] xsk: add user memory registration support sockopt
From: Björn Töpel @ 2018-04-23 20:00 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Karlsson, Magnus, Duyck, Alexander H, Alexander Duyck,
	John Fastabend, Alexei Starovoitov, Jesper Dangaard Brouer,
	Willem de Bruijn, Daniel Borkmann, Netdev, Björn Töpel,
	michael.lundkvist, Brandeburg, Jesse, Singhai, Anjali,
	Zhang, Qi Z
In-Reply-To: <20180423190615-mutt-send-email-mst@kernel.org>

2018-04-23 18:18 GMT+02:00 Michael S. Tsirkin <mst@redhat.com>:

[...]

>> +static void xdp_umem_unpin_pages(struct xdp_umem *umem)
>> +{
>> +     unsigned int i;
>> +
>> +     if (umem->pgs) {
>> +             for (i = 0; i < umem->npgs; i++)
>
> Since you pin them with FOLL_WRITE, I assume these pages
> are written to.
> Don't you need set_page_dirty_lock here?
>

Hmm, I actually *removed* it from the RFC V2, but after doing some
homework, I think you're right. Thanks for pointing this out!

Thinking more about this; This function is called from sk_destruct,
and in the Tx case the sk_destruct can be called from interrupt
context, where set_page_dirty_lock cannot be called.

Are there any preferred ways of solving this? Scheduling the whole
xsk_destruct call to a workqueue is one way (I think). Any
cleaner/better way?

[...]

>> +static int __xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr)
>> +{
>> +     u32 frame_size = mr->frame_size, frame_headroom = mr->frame_headroom;
>> +     u64 addr = mr->addr, size = mr->len;
>> +     unsigned int nframes;
>> +     int size_chk, err;
>> +
>> +     if (frame_size < XDP_UMEM_MIN_FRAME_SIZE || frame_size > PAGE_SIZE) {
>> +             /* Strictly speaking we could support this, if:
>> +              * - huge pages, or*
>
> what does "or*" here mean?
>

Oops, I'll change to just 'or' in the next revision.


Thanks!
Björn

^ permalink raw reply

* Re: [RFC V3 PATCH 0/8] Packed ring for vhost
From: Michael S. Tsirkin @ 2018-04-23 19:59 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Jason Wang, kvm, virtualization, netdev, linux-kernel, tiwei.bie,
	jfreimann, wexu
In-Reply-To: <20180423193120.GD30033@char.us.oracle.com>

On Mon, Apr 23, 2018 at 03:31:20PM -0400, Konrad Rzeszutek Wilk wrote:
> On Mon, Apr 23, 2018 at 01:34:52PM +0800, Jason Wang wrote:
> > Hi all:
> > 
> > This RFC implement packed ring layout. The code were tested with
> > Tiwei's RFC V2 a thttps://lkml.org/lkml/2018/4/1/48. Some fixups and
> > tweaks were needed on top of Tiwei's code to make it run. TCP stream
> > and pktgen does not show obvious difference compared with split ring.
> 
> I have to ask then - what is the benefit of this?

You can use this with dpdk within guest.
The DPDK version did see a performance improvement so hopefully with
future versions this will too.

> > 
> > Changes from V2:
> > - do not use & in checking desc_event_flags
> > - off should be most significant bit
> > - remove the workaround of mergeable buffer for dpdk prototype
> > - id should be in the last descriptor in the chain
> > - keep _F_WRITE for write descriptor when adding used
> > - device flags updating should use ADDR_USED type
> > - return error on unexpected unavail descriptor in a chain
> > - return false in vhost_ve_avail_empty is descriptor is available
> > - track last seen avail_wrap_counter
> > - correctly examine available descriptor in get_indirect_packed()
> > - vhost_idx_diff should return u16 instead of bool
> > 
> > Changes from V1:
> > 
> > - Refactor vhost used elem code to avoid open coding on used elem
> > - Event suppression support (compile test only).
> > - Indirect descriptor support (compile test only).
> > - Zerocopy support.
> > - vIOMMU support.
> > - SCSI/VSOCK support (compile test only).
> > - Fix several bugs
> > 
> > For simplicity, I don't implement batching or other optimizations.
> > 
> > Please review.
> > 
> > Jason Wang (8):
> >   vhost: move get_rx_bufs to vhost.c
> >   vhost: hide used ring layout from device
> >   vhost: do not use vring_used_elem
> >   vhost_net: do not explicitly manipulate vhost_used_elem
> >   vhost: vhost_put_user() can accept metadata type
> >   virtio: introduce packed ring defines
> >   vhost: packed ring support
> >   vhost: event suppression for packed ring
> > 
> >  drivers/vhost/net.c                | 136 ++----
> >  drivers/vhost/scsi.c               |  62 +--
> >  drivers/vhost/vhost.c              | 824 ++++++++++++++++++++++++++++++++++---
> >  drivers/vhost/vhost.h              |  47 ++-
> >  drivers/vhost/vsock.c              |  42 +-
> >  include/uapi/linux/virtio_config.h |   9 +
> >  include/uapi/linux/virtio_ring.h   |  32 ++
> >  7 files changed, 926 insertions(+), 226 deletions(-)
> > 
> > -- 
> > 2.7.4
> > 

^ permalink raw reply

* Re: [PATCH bpf-next v5 06/10] tools/bpf: add bpf_get_stack helper to tools headers
From: Alexei Starovoitov @ 2018-04-23 19:48 UTC (permalink / raw)
  To: Yonghong Song; +Cc: ast, daniel, netdev, ecree, kernel-team
In-Reply-To: <20180423175417.4104464-7-yhs@fb.com>

On Mon, Apr 23, 2018 at 10:54:13AM -0700, Yonghong Song wrote:
> Signed-off-by: Yonghong Song <yhs@fb.com>
> ---

please add few words to commit log. Don't leave it empty
even for trivial patches.

^ permalink raw reply

* Re: [PATCH bpf-next v5 10/10] tools/bpf: add a test for bpf_get_stack with tracepoint prog
From: Alexei Starovoitov @ 2018-04-23 19:47 UTC (permalink / raw)
  To: Yonghong Song; +Cc: ast, daniel, netdev, ecree, kernel-team
In-Reply-To: <20180423175417.4104464-11-yhs@fb.com>

On Mon, Apr 23, 2018 at 10:54:17AM -0700, Yonghong Song wrote:
> The test_stacktrace_map and test_stacktrace_build_id are
> enhanced to call bpf_get_stack in the helper to get the
> stack trace as well.  The stack traces from bpf_get_stack
> and bpf_get_stackid are compared to ensure that for the
> same stack as represented as the same hash, their ip addresses
> or build id's must be the same.
> 
> Signed-off-by: Yonghong Song <yhs@fb.com>
...
>  /* taken from /sys/kernel/debug/tracing/events/sched/sched_switch/format */
> @@ -44,7 +51,10 @@ struct sched_switch_args {
>  SEC("tracepoint/sched/sched_switch")
>  int oncpu(struct sched_switch_args *ctx)
>  {
> +	__u32 max_len = PERF_MAX_STACK_DEPTH * sizeof(__u64);
>  	__u32 key = 0, val = 0, *value_p;
> +	void *stack_p;
> +

nit: unnecessary extra empty line.
otherwise
Acked-by: Alexei Starovoitov <ast@kernel.org>

^ permalink raw reply

* Re: [PATCH v7 net-next 4/4] netvsc: refactor notifier/event handling code to use the failover framework
From: Siwei Liu @ 2018-04-23 19:44 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Stephen Hemminger, Jiri Pirko, Sridhar Samudrala, David Miller,
	Netdev, virtualization, virtio-dev, Brandeburg, Jesse,
	Alexander Duyck, Jakub Kicinski, Jason Wang
In-Reply-To: <20180423205019-mutt-send-email-mst@kernel.org>

On Mon, Apr 23, 2018 at 10:56 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Mon, Apr 23, 2018 at 10:44:40AM -0700, Stephen Hemminger wrote:
>> On Mon, 23 Apr 2018 20:24:56 +0300
>> "Michael S. Tsirkin" <mst@redhat.com> wrote:
>>
>> > On Mon, Apr 23, 2018 at 10:04:06AM -0700, Stephen Hemminger wrote:
>> > > > >
>> > > > >I will NAK patches to change to common code for netvsc especially the
>> > > > >three device model.  MS worked hard with distro vendors to support transparent
>> > > > >mode, ans we really can't have a new model; or do backport.
>> > > > >
>> > > > >Plus, DPDK is now dependent on existing model.
>> > > >
>> > > > Sorry, but nobody here cares about dpdk or other similar oddities.
>> > >
>> > > The network device model is a userspace API, and DPDK is a userspace application.
>> >
>> > It is userspace but are you sure dpdk is actually poking at netdevs?
>> > AFAIK it's normally banging device registers directly.
>> >
>> > > You can't go breaking userspace even if you don't like the application.
>> >
>> > Could you please explain how is the proposed patchset breaking
>> > userspace? Ignoring DPDK for now, I don't think it changes the userspace
>> > API at all.
>> >
>>
>> The DPDK has a device driver vdev_netvsc which scans the Linux network devices
>> to look for Linux netvsc device and the paired VF device and setup the
>> DPDK environment.  This setup creates a DPDK failsafe (bondingish) instance
>> and sets up TAP support over the Linux netvsc device as well as the Mellanox
>> VF device.
>>
>> So it depends on existing 2 device model. You can't go to a 3 device model
>> or start hiding devices from userspace.
>
> Okay so how does the existing patch break that? IIUC does not go to
> a 3 device model since netvsc calls failover_register directly.
>
>> Also, I am working on associating netvsc and VF device based on serial number
>> rather than MAC address. The serial number is how Windows works now, and it makes
>> sense for Linux and Windows to use the same mechanism if possible.
>
> Maybe we should support same for virtio ...
> Which serial do you mean? From vpd?
>
> I guess you will want to keep supporting MAC for old hypervisors?
>
> It all seems like a reasonable thing to support in the generic core.

That's the reason why I chose explicit identifier rather than rely on
MAC address to bind/pair a device. MAC address can change. Even if it
can't, malicious guest user can fake MAC address to skip binding.

-Siwei


>
> --
> MST

^ permalink raw reply

* Re: [PATCH bpf-next v5 03/10] bpf/verifier: refine retval R0 state for bpf_get_stack helper
From: Alexei Starovoitov @ 2018-04-23 19:42 UTC (permalink / raw)
  To: Yonghong Song; +Cc: ast, daniel, netdev, ecree, kernel-team
In-Reply-To: <20180423175417.4104464-4-yhs@fb.com>

On Mon, Apr 23, 2018 at 10:54:10AM -0700, Yonghong Song wrote:
> The special property of return values for helpers bpf_get_stack
> and bpf_probe_read_str are captured in verifier.
> Both helpers return a negative error code or
> a length, which is equal to or smaller than the buffer
> size argument. This additional information in the
> verifier can avoid the condition such as "retval > bufsize"
> in the bpf program. For example, for the code blow,
>     usize = bpf_get_stack(ctx, raw_data, max_len, BPF_F_USER_STACK);
>     if (usize < 0 || usize > max_len)
>         return 0;
> The verifier may have the following errors:
>     52: (85) call bpf_get_stack#65
>      R0=map_value(id=0,off=0,ks=4,vs=1600,imm=0) R1_w=ctx(id=0,off=0,imm=0)
>      R2_w=map_value(id=0,off=0,ks=4,vs=1600,imm=0) R3_w=inv800 R4_w=inv256
>      R6=ctx(id=0,off=0,imm=0) R7=map_value(id=0,off=0,ks=4,vs=1600,imm=0)
>      R9_w=inv800 R10=fp0,call_-1
>     53: (bf) r8 = r0
>     54: (bf) r1 = r8
>     55: (67) r1 <<= 32
>     56: (bf) r2 = r1
>     57: (77) r2 >>= 32
>     58: (25) if r2 > 0x31f goto pc+33
>      R0=inv(id=0) R1=inv(id=0,smax_value=9223372032559808512,
>                          umax_value=18446744069414584320,
>                          var_off=(0x0; 0xffffffff00000000))
>      R2=inv(id=0,umax_value=799,var_off=(0x0; 0x3ff))
>      R6=ctx(id=0,off=0,imm=0) R7=map_value(id=0,off=0,ks=4,vs=1600,imm=0)
>      R8=inv(id=0) R9=inv800 R10=fp0,call_-1
>     59: (1f) r9 -= r8
>     60: (c7) r1 s>>= 32
>     61: (bf) r2 = r7
>     62: (0f) r2 += r1
>     math between map_value pointer and register with unbounded
>     min value is not allowed
> The failure is due to llvm compiler optimization where register "r2",
> which is a copy of "r1", is tested for condition while later on "r1"
> is used for map_ptr operation. The verifier is not able to track such
> inst sequence effectively.
> 
> Without the "usize > max_len" condition, there is no llvm optimization
> and the below generated code passed verifier:
>     52: (85) call bpf_get_stack#65
>      R0=map_value(id=0,off=0,ks=4,vs=1600,imm=0) R1_w=ctx(id=0,off=0,imm=0)
>      R2_w=map_value(id=0,off=0,ks=4,vs=1600,imm=0) R3_w=inv800 R4_w=inv256
>      R6=ctx(id=0,off=0,imm=0) R7=map_value(id=0,off=0,ks=4,vs=1600,imm=0)
>      R9_w=inv800 R10=fp0,call_-1
>     53: (b7) r1 = 0
>     54: (bf) r8 = r0
>     55: (67) r8 <<= 32
>     56: (c7) r8 s>>= 32
>     57: (6d) if r1 s> r8 goto pc+24
>      R0=inv(id=0,umax_value=800) R1=inv0 R6=ctx(id=0,off=0,imm=0)
>      R7=map_value(id=0,off=0,ks=4,vs=1600,imm=0)
>      R8=inv(id=0,umax_value=800,var_off=(0x0; 0x3ff)) R9=inv800
>      R10=fp0,call_-1
>     58: (bf) r2 = r7
>     59: (0f) r2 += r8
>     60: (1f) r9 -= r8
>     61: (bf) r1 = r6
> 
> Signed-off-by: Yonghong Song <yhs@fb.com>
> ---
>  kernel/bpf/verifier.c | 25 +++++++++++++++++++++++++
>  1 file changed, 25 insertions(+)
> 
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index aba9425..d00bf53 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -164,6 +164,8 @@ struct bpf_call_arg_meta {
>  	bool pkt_access;
>  	int regno;
>  	int access_size;
> +	s64 msize_smax_value;
> +	u64 msize_umax_value;
>  };
>  
>  static DEFINE_MUTEX(bpf_verifier_lock);
> @@ -1994,6 +1996,12 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 regno,
>  	} else if (arg_type_is_mem_size(arg_type)) {
>  		bool zero_size_allowed = (arg_type == ARG_CONST_SIZE_OR_ZERO);
>  
> +		/* remember the mem_size which may be used later
> +		 * to refine return values.
> +		 */
> +		meta->msize_smax_value = reg->smax_value;
> +		meta->msize_umax_value = reg->umax_value;
> +
>  		/* The register is SCALAR_VALUE; the access check
>  		 * happens using its boundaries.
>  		 */
> @@ -2333,6 +2341,21 @@ static int prepare_func_exit(struct bpf_verifier_env *env, int *insn_idx)
>  	return 0;
>  }
>  
> +static void do_refine_retval_range(struct bpf_reg_state *regs, int ret_type,
> +				   int func_id,
> +				   struct bpf_call_arg_meta *meta)
> +{
> +	struct bpf_reg_state *ret_reg = &regs[BPF_REG_0];
> +
> +	if (ret_type != RET_INTEGER ||
> +	    (func_id != BPF_FUNC_get_stack &&
> +	     func_id != BPF_FUNC_probe_read_str))
> +		return;
> +
> +	ret_reg->smax_value = meta->msize_smax_value;
> +	ret_reg->umax_value = meta->msize_umax_value;

should we call:
        __reg_deduce_bounds(ret_reg);
        __reg_bound_offset(ret_reg);
here?
It doesn't seem that it will be called later and
ret_reg->var_off state will be more conservative than necessary.

^ permalink raw reply

* Re: [bpf PATCH 2/2] bpf: sockmap sample use clang flag, -target bpf
From: Alexei Starovoitov @ 2018-04-23 19:35 UTC (permalink / raw)
  To: John Fastabend; +Cc: ast, daniel, netdev
In-Reply-To: <20180423191107.21348.15124.stgit@john-Precision-Tower-5810>

On Mon, Apr 23, 2018 at 12:11:08PM -0700, John Fastabend wrote:
> Per Documentation/bpf/bpf_devel_QA.txt add the -target flag to the
> sockmap Makefile. Relevant text quoted here,
> 
>    Otherwise, you can use bpf target. Additionally, you _must_ use
>    bpf target when:
> 
>  - Your program uses data structures with pointer or long / unsigned
>    long types that interface with BPF helpers or context data
>    structures. Access into these structures is verified by the BPF
>    verifier and may result in verification failures if the native
>    architecture is not aligned with the BPF architecture, e.g. 64-bit.
>    An example of this is BPF_PROG_TYPE_SK_MSG require '-target bpf'
> 
> Fixes: 69e8cc134bcb ("bpf: sockmap sample program")
> Signed-off-by: John Fastabend <john.fastabend@gmail.com>

Acked-by: Alexei Starovoitov <ast@kernel.org>

^ permalink raw reply

* Re: [bpf PATCH 1/2] bpf: Document sockmap '-target bpf' requirement for PROG_TYPE_SK_MSG
From: Alexei Starovoitov @ 2018-04-23 19:35 UTC (permalink / raw)
  To: John Fastabend; +Cc: ast, daniel, netdev
In-Reply-To: <20180423191102.21348.85601.stgit@john-Precision-Tower-5810>

On Mon, Apr 23, 2018 at 12:11:02PM -0700, John Fastabend wrote:
> BPF_PROG_TYPE_SK_MSG programs use a 'void *' for both data and the
> data_end pointers. Additionally, the verifier ensures that every
> accesses into the values is a __u64 read. This correctly maps on
> to the BPF 64-bit architecture.
> 
> However, to ensure that when building on 32bit architectures that
> clang uses correct types the '-target bpf' option _must_ be
> specified. To make this clear add a note to the Documentation.
> 
> Signed-off-by: John Fastabend <john.fastabend@gmail.com>

Acked-by: Alexei Starovoitov <ast@kernel.org>

^ permalink raw reply

* Re: [bpf PATCH v2 0/3] BPF, a couple sockmap fixes
From: Alexei Starovoitov @ 2018-04-23 19:34 UTC (permalink / raw)
  To: John Fastabend; +Cc: ast, daniel, netdev
In-Reply-To: <20180423182726.17999.269.stgit@john-Precision-Tower-5810>

On Mon, Apr 23, 2018 at 11:29:24AM -0700, John Fastabend wrote:
> While testing sockmap with more programs (besides our test programs)
> I found a couple issues.
> 
> The attached series fixes an issue where pinned maps were not
> working correctly, blocking sockets returned zero, and an error
> path that when the sock hit an out of memory case resulted in a
> double page_put() while doing ingress redirects.
> 
> See individual patches for more details.
> 
> v2: Incorporated Daniel's feedback to use map ops for uref put op
>     which also fixed the build error discovered in v1.

Acked-by: Alexei Starovoitov <ast@kernel.org>

^ permalink raw reply

* Re: [RFC V3 PATCH 0/8] Packed ring for vhost
From: Konrad Rzeszutek Wilk @ 2018-04-23 19:31 UTC (permalink / raw)
  To: Jason Wang; +Cc: kvm, mst, netdev, linux-kernel, virtualization, wexu
In-Reply-To: <1524461700-5469-1-git-send-email-jasowang@redhat.com>

On Mon, Apr 23, 2018 at 01:34:52PM +0800, Jason Wang wrote:
> Hi all:
> 
> This RFC implement packed ring layout. The code were tested with
> Tiwei's RFC V2 a thttps://lkml.org/lkml/2018/4/1/48. Some fixups and
> tweaks were needed on top of Tiwei's code to make it run. TCP stream
> and pktgen does not show obvious difference compared with split ring.

I have to ask then - what is the benefit of this?

> 
> Changes from V2:
> - do not use & in checking desc_event_flags
> - off should be most significant bit
> - remove the workaround of mergeable buffer for dpdk prototype
> - id should be in the last descriptor in the chain
> - keep _F_WRITE for write descriptor when adding used
> - device flags updating should use ADDR_USED type
> - return error on unexpected unavail descriptor in a chain
> - return false in vhost_ve_avail_empty is descriptor is available
> - track last seen avail_wrap_counter
> - correctly examine available descriptor in get_indirect_packed()
> - vhost_idx_diff should return u16 instead of bool
> 
> Changes from V1:
> 
> - Refactor vhost used elem code to avoid open coding on used elem
> - Event suppression support (compile test only).
> - Indirect descriptor support (compile test only).
> - Zerocopy support.
> - vIOMMU support.
> - SCSI/VSOCK support (compile test only).
> - Fix several bugs
> 
> For simplicity, I don't implement batching or other optimizations.
> 
> Please review.
> 
> Jason Wang (8):
>   vhost: move get_rx_bufs to vhost.c
>   vhost: hide used ring layout from device
>   vhost: do not use vring_used_elem
>   vhost_net: do not explicitly manipulate vhost_used_elem
>   vhost: vhost_put_user() can accept metadata type
>   virtio: introduce packed ring defines
>   vhost: packed ring support
>   vhost: event suppression for packed ring
> 
>  drivers/vhost/net.c                | 136 ++----
>  drivers/vhost/scsi.c               |  62 +--
>  drivers/vhost/vhost.c              | 824 ++++++++++++++++++++++++++++++++++---
>  drivers/vhost/vhost.h              |  47 ++-
>  drivers/vhost/vsock.c              |  42 +-
>  include/uapi/linux/virtio_config.h |   9 +
>  include/uapi/linux/virtio_ring.h   |  32 ++
>  7 files changed, 926 insertions(+), 226 deletions(-)
> 
> -- 
> 2.7.4
> 

^ permalink raw reply

* [bpf PATCH 2/2] bpf: sockmap sample use clang flag, -target bpf
From: John Fastabend @ 2018-04-23 19:11 UTC (permalink / raw)
  To: ast, daniel; +Cc: netdev
In-Reply-To: <20180423191102.21348.85601.stgit@john-Precision-Tower-5810>

Per Documentation/bpf/bpf_devel_QA.txt add the -target flag to the
sockmap Makefile. Relevant text quoted here,

   Otherwise, you can use bpf target. Additionally, you _must_ use
   bpf target when:

 - Your program uses data structures with pointer or long / unsigned
   long types that interface with BPF helpers or context data
   structures. Access into these structures is verified by the BPF
   verifier and may result in verification failures if the native
   architecture is not aligned with the BPF architecture, e.g. 64-bit.
   An example of this is BPF_PROG_TYPE_SK_MSG require '-target bpf'

Fixes: 69e8cc134bcb ("bpf: sockmap sample program")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
 samples/sockmap/Makefile |    7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/samples/sockmap/Makefile b/samples/sockmap/Makefile
index 9bf2881..fa53f4d 100644
--- a/samples/sockmap/Makefile
+++ b/samples/sockmap/Makefile
@@ -65,11 +65,14 @@ $(src)/*.c: verify_target_bpf
 # asm/sysreg.h - inline assembly used by it is incompatible with llvm.
 # But, there is no easy way to fix it, so just exclude it since it is
 # useless for BPF samples.
+#
+# -target bpf option required with SK_MSG programs, this is to ensure
+#  reading 'void *' data types for data and data_end are __u64 reads.
 $(obj)/%.o: $(src)/%.c
 	$(CLANG) $(NOSTDINC_FLAGS) $(LINUXINCLUDE) $(EXTRA_CFLAGS) -I$(obj) \
 		-D__KERNEL__ -D__ASM_SYSREG_H -Wno-unused-value -Wno-pointer-sign \
 		-Wno-compare-distinct-pointer-types \
 		-Wno-gnu-variable-sized-type-not-at-end \
 		-Wno-address-of-packed-member -Wno-tautological-compare \
-		-Wno-unknown-warning-option \
-		-O2 -emit-llvm -c $< -o -| $(LLC) -march=bpf -filetype=obj -o $@
+		-Wno-unknown-warning-option -O2 -target bpf \
+		-emit-llvm -c $< -o -| $(LLC) -march=bpf -filetype=obj -o $@

^ permalink raw reply related

* [bpf PATCH 1/2] bpf: Document sockmap '-target bpf' requirement for PROG_TYPE_SK_MSG
From: John Fastabend @ 2018-04-23 19:11 UTC (permalink / raw)
  To: ast, daniel; +Cc: netdev

BPF_PROG_TYPE_SK_MSG programs use a 'void *' for both data and the
data_end pointers. Additionally, the verifier ensures that every
accesses into the values is a __u64 read. This correctly maps on
to the BPF 64-bit architecture.

However, to ensure that when building on 32bit architectures that
clang uses correct types the '-target bpf' option _must_ be
specified. To make this clear add a note to the Documentation.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
 0 files changed

diff --git a/Documentation/bpf/bpf_devel_QA.txt b/Documentation/bpf/bpf_devel_QA.txt
index 1a0b704..da57601 100644
--- a/Documentation/bpf/bpf_devel_QA.txt
+++ b/Documentation/bpf/bpf_devel_QA.txt
@@ -557,6 +557,14 @@ A: Although LLVM IR generation and optimization try to stay architecture
        pulls in some header files containing file scope host assembly codes.
      - You can add "-fno-jump-tables" to work around the switch table issue.
 
-   Otherwise, you can use bpf target.
+   Otherwise, you can use bpf target. Additionally, you _must_ use bpf target
+   when:
+
+     - Your program uses data structures with pointer or long / unsigned long
+       types that interface with BPF helpers or context data structures. Access
+       into these structures is verified by the BPF verifier and may result
+       in verification failures if the native architecture is not aligned with
+       the BPF architecture, e.g. 64-bit. An example of this is
+       BPF_PROG_TYPE_SK_MSG require '-target bpf'
 
 Happy BPF hacking!

^ permalink raw reply related

* [PATCH net-next 2/2] net/sctp: Replace in/out stream arrays with flex_array
From: Oleg Babin @ 2018-04-23 18:41 UTC (permalink / raw)
  To: netdev, linux-sctp
  Cc: David S. Miller, Vlad Yasevich, Neil Horman, Xin Long,
	Marcelo Ricardo Leitner, Andrey Ryabinin
In-Reply-To: <1524508866-317485-1-git-send-email-obabin@virtuozzo.com>

This path replaces physically contiguous memory arrays
allocated using kmalloc_array() with flexible arrays.
This enables to avoid memory allocation failures on the
systems under a memory stress.

Signed-off-by: Oleg Babin <obabin@virtuozzo.com>
---
 include/net/sctp/structs.h |  1 +
 net/sctp/stream.c          | 78 +++++++++++++++++++++++++++++++++++-----------
 2 files changed, 61 insertions(+), 18 deletions(-)

diff --git a/include/net/sctp/structs.h b/include/net/sctp/structs.h
index 578bb40..c7f42b4 100644
--- a/include/net/sctp/structs.h
+++ b/include/net/sctp/structs.h
@@ -57,6 +57,7 @@
 #include <linux/atomic.h>		/* This gets us atomic counters.  */
 #include <linux/skbuff.h>	/* We need sk_buff_head. */
 #include <linux/workqueue.h>	/* We need tq_struct.	 */
+#include <linux/flex_array.h>	/* We need flex_array.   */
 #include <linux/sctp.h>		/* We need sctp* header structs.  */
 #include <net/sctp/auth.h>	/* We need auth specific structs */
 #include <net/ip.h>		/* For inet_skb_parm */
diff --git a/net/sctp/stream.c b/net/sctp/stream.c
index 16e36c0..be372b0 100644
--- a/net/sctp/stream.c
+++ b/net/sctp/stream.c
@@ -40,13 +40,60 @@
 struct sctp_stream_out *sctp_stream_out_ptr(const struct sctp_stream *stream,
 					    __u16 sid)
 {
-	return ((struct sctp_stream_out *)(stream->out)) + sid;
+	return flex_array_get(stream->out, sid);
 }
 
 struct sctp_stream_in *sctp_stream_in_ptr(const struct sctp_stream *stream,
 					  __u16 sid)
 {
-	return ((struct sctp_stream_in *)(stream->in)) + sid;
+	return flex_array_get(stream->in, sid);
+}
+
+static struct flex_array *fa_alloc(size_t elem_size, size_t elem_count,
+				   gfp_t gfp)
+{
+	struct flex_array *result;
+	int err;
+
+	result = flex_array_alloc(elem_size, elem_count, gfp);
+	if (result) {
+		err = flex_array_prealloc(result, 0, elem_count, gfp);
+		if (err) {
+			flex_array_free(result);
+			result = NULL;
+		}
+	}
+
+	return result;
+}
+
+static void fa_free(struct flex_array *fa)
+{
+	if (fa)
+		flex_array_free(fa);
+}
+
+static void fa_copy(struct flex_array *fa, struct flex_array *from,
+		    size_t index, size_t count)
+{
+	void *elem;
+
+	while (count--) {
+		elem = flex_array_get(from, index);
+		flex_array_put(fa, index, elem, 0);
+		index++;
+	}
+}
+
+static void fa_zero(struct flex_array *fa, size_t index, size_t count)
+{
+	void *elem;
+
+	while (count--) {
+		elem = flex_array_get(fa, index);
+		memset(elem, 0, fa->element_size);
+		index++;
+	}
 }
 
 /* Migrates chunks from stream queues to new stream queues if needed,
@@ -106,19 +153,17 @@ static int sctp_stream_alloc_out(struct sctp_stream *stream, __u16 outcnt,
 	struct flex_array *out;
 	size_t elem_size = sizeof(struct sctp_stream_out);
 
-	out = kmalloc_array(outcnt, elem_size, gfp);
+	out = fa_alloc(elem_size, outcnt, gfp);
 	if (!out)
 		return -ENOMEM;
 
 	if (stream->out) {
-		memcpy(out, stream->out, min(outcnt, stream->outcnt) *
-					 elem_size);
-		kfree(stream->out);
+		fa_copy(out, stream->out, 0, min(outcnt, stream->outcnt));
+		fa_free(stream->out);
 	}
 
 	if (outcnt > stream->outcnt)
-		memset(((struct sctp_stream_out *)out) + stream->outcnt, 0,
-		       (outcnt - stream->outcnt) * elem_size);
+		fa_zero(out, stream->outcnt, (outcnt - stream->outcnt));
 
 	stream->out = out;
 
@@ -131,20 +176,17 @@ static int sctp_stream_alloc_in(struct sctp_stream *stream, __u16 incnt,
 	struct flex_array *in;
 	size_t elem_size = sizeof(struct sctp_stream_in);
 
-	in = kmalloc_array(incnt, elem_size, gfp);
-
+	in = fa_alloc(elem_size, incnt, gfp);
 	if (!in)
 		return -ENOMEM;
 
 	if (stream->in) {
-		memcpy(in, stream->in, min(incnt, stream->incnt) *
-				       elem_size);
-		kfree(stream->in);
+		fa_copy(in, stream->in, 0, min(incnt, stream->incnt));
+		fa_free(stream->in);
 	}
 
 	if (incnt > stream->incnt)
-		memset(((struct sctp_stream_in *)in) + stream->incnt, 0,
-		       (incnt - stream->incnt) * elem_size);
+		fa_zero(in, stream->incnt, (incnt - stream->incnt));
 
 	stream->in = in;
 
@@ -188,7 +230,7 @@ int sctp_stream_init(struct sctp_stream *stream, __u16 outcnt, __u16 incnt,
 	ret = sctp_stream_alloc_in(stream, incnt, gfp);
 	if (ret) {
 		sched->free(stream);
-		kfree(stream->out);
+		fa_free(stream->out);
 		stream->out = NULL;
 		stream->outcnt = 0;
 		goto out;
@@ -220,8 +262,8 @@ void sctp_stream_free(struct sctp_stream *stream)
 	sched->free(stream);
 	for (i = 0; i < stream->outcnt; i++)
 		kfree(SCTP_SO(stream, i)->ext);
-	kfree(stream->out);
-	kfree(stream->in);
+	fa_free(stream->out);
+	fa_free(stream->in);
 }
 
 void sctp_stream_clear(struct sctp_stream *stream)
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH net-next 1/2] net/sctp: Make wrappers for accessing in/out streams
From: Oleg Babin @ 2018-04-23 18:41 UTC (permalink / raw)
  To: netdev, linux-sctp
  Cc: David S. Miller, Vlad Yasevich, Neil Horman, Xin Long,
	Marcelo Ricardo Leitner, Andrey Ryabinin
In-Reply-To: <1524508866-317485-1-git-send-email-obabin@virtuozzo.com>

This patch introduces wrappers for accessing in/out streams indirectly.
This will enable to replace physically contiguous memory arrays
of streams with flexible arrays (or maybe any other appropriate
mechanism) which do memory allocation on a per-page basis.

Signed-off-by: Oleg Babin <obabin@virtuozzo.com>
---
 include/net/sctp/structs.h   |  30 +++++++-----
 net/sctp/chunk.c             |   6 ++-
 net/sctp/outqueue.c          |  11 +++--
 net/sctp/socket.c            |   4 +-
 net/sctp/stream.c            | 107 +++++++++++++++++++++++++------------------
 net/sctp/stream_interleave.c |   2 +-
 net/sctp/stream_sched.c      |  13 +++---
 net/sctp/stream_sched_prio.c |  22 ++++-----
 net/sctp/stream_sched_rr.c   |   8 ++--
 9 files changed, 116 insertions(+), 87 deletions(-)

diff --git a/include/net/sctp/structs.h b/include/net/sctp/structs.h
index a0ec462..578bb40 100644
--- a/include/net/sctp/structs.h
+++ b/include/net/sctp/structs.h
@@ -394,37 +394,37 @@ int sctp_stream_init(struct sctp_stream *stream, __u16 outcnt, __u16 incnt,
 
 /* What is the current SSN number for this stream? */
 #define sctp_ssn_peek(stream, type, sid) \
-	((stream)->type[sid].ssn)
+	(sctp_stream_##type##_ptr((stream), (sid))->ssn)
 
 /* Return the next SSN number for this stream.	*/
 #define sctp_ssn_next(stream, type, sid) \
-	((stream)->type[sid].ssn++)
+	(sctp_stream_##type##_ptr((stream), (sid))->ssn++)
 
 /* Skip over this ssn and all below. */
 #define sctp_ssn_skip(stream, type, sid, ssn) \
-	((stream)->type[sid].ssn = ssn + 1)
+	(sctp_stream_##type##_ptr((stream), (sid))->ssn = ssn + 1)
 
 /* What is the current MID number for this stream? */
 #define sctp_mid_peek(stream, type, sid) \
-	((stream)->type[sid].mid)
+	(sctp_stream_##type##_ptr((stream), (sid))->mid)
 
 /* Return the next MID number for this stream.  */
 #define sctp_mid_next(stream, type, sid) \
-	((stream)->type[sid].mid++)
+	(sctp_stream_##type##_ptr((stream), (sid))->mid++)
 
 /* Skip over this mid and all below. */
 #define sctp_mid_skip(stream, type, sid, mid) \
-	((stream)->type[sid].mid = mid + 1)
+	(sctp_stream_##type##_ptr((stream), (sid))->mid = mid + 1)
 
-#define sctp_stream_in(asoc, sid) (&(asoc)->stream.in[sid])
+#define sctp_stream_in(asoc, sid) sctp_stream_in_ptr(&(asoc)->stream, (sid))
 
 /* What is the current MID_uo number for this stream? */
 #define sctp_mid_uo_peek(stream, type, sid) \
-	((stream)->type[sid].mid_uo)
+	(sctp_stream_##type##_ptr((stream), (sid))->mid_uo)
 
 /* Return the next MID_uo number for this stream.  */
 #define sctp_mid_uo_next(stream, type, sid) \
-	((stream)->type[sid].mid_uo++)
+	(sctp_stream_##type##_ptr((stream), (sid))->mid_uo++)
 
 /*
  * Pointers to address related SCTP functions.
@@ -1428,8 +1428,8 @@ struct sctp_stream_in {
 };
 
 struct sctp_stream {
-	struct sctp_stream_out *out;
-	struct sctp_stream_in *in;
+	struct flex_array *out;
+	struct flex_array *in;
 	__u16 outcnt;
 	__u16 incnt;
 	/* Current stream being sent, if any */
@@ -1451,6 +1451,14 @@ struct sctp_stream {
 	struct sctp_stream_interleave *si;
 };
 
+struct sctp_stream_out *sctp_stream_out_ptr(const struct sctp_stream *stream,
+					    __u16 sid);
+struct sctp_stream_in *sctp_stream_in_ptr(const struct sctp_stream *stream,
+					  __u16 sid);
+
+#define SCTP_SO(s, i) sctp_stream_out_ptr((s), (i))
+#define SCTP_SI(s, i) sctp_stream_in_ptr((s), (i))
+
 #define SCTP_STREAM_CLOSED		0x00
 #define SCTP_STREAM_OPEN		0x01
 
diff --git a/net/sctp/chunk.c b/net/sctp/chunk.c
index be296d6..4b9310e 100644
--- a/net/sctp/chunk.c
+++ b/net/sctp/chunk.c
@@ -333,7 +333,8 @@ int sctp_chunk_abandoned(struct sctp_chunk *chunk)
 	if (SCTP_PR_TTL_ENABLED(chunk->sinfo.sinfo_flags) &&
 	    time_after(jiffies, chunk->msg->expires_at)) {
 		struct sctp_stream_out *streamout =
-			&chunk->asoc->stream.out[chunk->sinfo.sinfo_stream];
+			SCTP_SO(&chunk->asoc->stream,
+				chunk->sinfo.sinfo_stream);
 
 		if (chunk->sent_count) {
 			chunk->asoc->abandoned_sent[SCTP_PR_INDEX(TTL)]++;
@@ -347,7 +348,8 @@ int sctp_chunk_abandoned(struct sctp_chunk *chunk)
 	} else if (SCTP_PR_RTX_ENABLED(chunk->sinfo.sinfo_flags) &&
 		   chunk->sent_count > chunk->sinfo.sinfo_timetolive) {
 		struct sctp_stream_out *streamout =
-			&chunk->asoc->stream.out[chunk->sinfo.sinfo_stream];
+			SCTP_SO(&chunk->asoc->stream,
+				chunk->sinfo.sinfo_stream);
 
 		chunk->asoc->abandoned_sent[SCTP_PR_INDEX(RTX)]++;
 		streamout->ext->abandoned_sent[SCTP_PR_INDEX(RTX)]++;
diff --git a/net/sctp/outqueue.c b/net/sctp/outqueue.c
index f211b3d..8d5d811 100644
--- a/net/sctp/outqueue.c
+++ b/net/sctp/outqueue.c
@@ -80,7 +80,7 @@ static inline void sctp_outq_head_data(struct sctp_outq *q,
 	q->out_qlen += ch->skb->len;
 
 	stream = sctp_chunk_stream_no(ch);
-	oute = q->asoc->stream.out[stream].ext;
+	oute = SCTP_SO(&q->asoc->stream, stream)->ext;
 	list_add(&ch->stream_list, &oute->outq);
 }
 
@@ -101,7 +101,7 @@ static inline void sctp_outq_tail_data(struct sctp_outq *q,
 	q->out_qlen += ch->skb->len;
 
 	stream = sctp_chunk_stream_no(ch);
-	oute = q->asoc->stream.out[stream].ext;
+	oute = SCTP_SO(&q->asoc->stream, stream)->ext;
 	list_add_tail(&ch->stream_list, &oute->outq);
 }
 
@@ -372,7 +372,7 @@ static int sctp_prsctp_prune_sent(struct sctp_association *asoc,
 		sctp_insert_list(&asoc->outqueue.abandoned,
 				 &chk->transmitted_list);
 
-		streamout = &asoc->stream.out[chk->sinfo.sinfo_stream];
+		streamout = SCTP_SO(&asoc->stream, chk->sinfo.sinfo_stream);
 		asoc->sent_cnt_removable--;
 		asoc->abandoned_sent[SCTP_PR_INDEX(PRIO)]++;
 		streamout->ext->abandoned_sent[SCTP_PR_INDEX(PRIO)]++;
@@ -416,7 +416,7 @@ static int sctp_prsctp_prune_unsent(struct sctp_association *asoc,
 		asoc->abandoned_unsent[SCTP_PR_INDEX(PRIO)]++;
 		if (chk->sinfo.sinfo_stream < asoc->stream.outcnt) {
 			struct sctp_stream_out *streamout =
-				&asoc->stream.out[chk->sinfo.sinfo_stream];
+				SCTP_SO(&asoc->stream, chk->sinfo.sinfo_stream);
 
 			streamout->ext->abandoned_unsent[SCTP_PR_INDEX(PRIO)]++;
 		}
@@ -1050,6 +1050,7 @@ static void sctp_outq_flush(struct sctp_outq *q, int rtx_timeout, gfp_t gfp)
 		/* Finally, transmit new packets.  */
 		while ((chunk = sctp_outq_dequeue_data(q)) != NULL) {
 			__u32 sid = ntohs(chunk->subh.data_hdr->stream);
+			__u8 stream_state = SCTP_SO(&asoc->stream, sid)->state;
 
 			/* Has this chunk expired? */
 			if (sctp_chunk_abandoned(chunk)) {
@@ -1059,7 +1060,7 @@ static void sctp_outq_flush(struct sctp_outq *q, int rtx_timeout, gfp_t gfp)
 				continue;
 			}
 
-			if (asoc->stream.out[sid].state == SCTP_STREAM_CLOSED) {
+			if (stream_state == SCTP_STREAM_CLOSED) {
 				sctp_outq_head_data(q, chunk);
 				goto sctp_flush_out;
 			}
diff --git a/net/sctp/socket.c b/net/sctp/socket.c
index 80835ac..3442f7c 100644
--- a/net/sctp/socket.c
+++ b/net/sctp/socket.c
@@ -1907,7 +1907,7 @@ static int sctp_sendmsg_to_asoc(struct sctp_association *asoc,
 		goto err;
 	}
 
-	if (unlikely(!asoc->stream.out[sinfo->sinfo_stream].ext)) {
+	if (unlikely(!SCTP_SO(&asoc->stream, sinfo->sinfo_stream)->ext)) {
 		err = sctp_stream_init_ext(&asoc->stream, sinfo->sinfo_stream);
 		if (err)
 			goto err;
@@ -6942,7 +6942,7 @@ static int sctp_getsockopt_pr_streamstatus(struct sock *sk, int len,
 	if (!asoc || params.sprstat_sid >= asoc->stream.outcnt)
 		goto out;
 
-	streamoute = asoc->stream.out[params.sprstat_sid].ext;
+	streamoute = SCTP_SO(&asoc->stream, params.sprstat_sid)->ext;
 	if (!streamoute) {
 		/* Not allocated yet, means all stats are 0 */
 		params.sprstat_abandoned_unsent = 0;
diff --git a/net/sctp/stream.c b/net/sctp/stream.c
index f799043..16e36c0 100644
--- a/net/sctp/stream.c
+++ b/net/sctp/stream.c
@@ -37,6 +37,18 @@
 #include <net/sctp/sm.h>
 #include <net/sctp/stream_sched.h>
 
+struct sctp_stream_out *sctp_stream_out_ptr(const struct sctp_stream *stream,
+					    __u16 sid)
+{
+	return ((struct sctp_stream_out *)(stream->out)) + sid;
+}
+
+struct sctp_stream_in *sctp_stream_in_ptr(const struct sctp_stream *stream,
+					  __u16 sid)
+{
+	return ((struct sctp_stream_in *)(stream->in)) + sid;
+}
+
 /* Migrates chunks from stream queues to new stream queues if needed,
  * but not across associations. Also, removes those chunks to streams
  * higher than the new max.
@@ -78,34 +90,35 @@ static void sctp_stream_outq_migrate(struct sctp_stream *stream,
 		 * sctp_stream_update will swap ->out pointers.
 		 */
 		for (i = 0; i < outcnt; i++) {
-			kfree(new->out[i].ext);
-			new->out[i].ext = stream->out[i].ext;
-			stream->out[i].ext = NULL;
+			kfree(SCTP_SO(new, i)->ext);
+			SCTP_SO(new, i)->ext = SCTP_SO(stream, i)->ext;
+			SCTP_SO(stream, i)->ext = NULL;
 		}
 	}
 
 	for (i = outcnt; i < stream->outcnt; i++)
-		kfree(stream->out[i].ext);
+		kfree(SCTP_SO(stream, i)->ext);
 }
 
 static int sctp_stream_alloc_out(struct sctp_stream *stream, __u16 outcnt,
 				 gfp_t gfp)
 {
-	struct sctp_stream_out *out;
+	struct flex_array *out;
+	size_t elem_size = sizeof(struct sctp_stream_out);
 
-	out = kmalloc_array(outcnt, sizeof(*out), gfp);
+	out = kmalloc_array(outcnt, elem_size, gfp);
 	if (!out)
 		return -ENOMEM;
 
 	if (stream->out) {
 		memcpy(out, stream->out, min(outcnt, stream->outcnt) *
-					 sizeof(*out));
+					 elem_size);
 		kfree(stream->out);
 	}
 
 	if (outcnt > stream->outcnt)
-		memset(out + stream->outcnt, 0,
-		       (outcnt - stream->outcnt) * sizeof(*out));
+		memset(((struct sctp_stream_out *)out) + stream->outcnt, 0,
+		       (outcnt - stream->outcnt) * elem_size);
 
 	stream->out = out;
 
@@ -115,22 +128,23 @@ static int sctp_stream_alloc_out(struct sctp_stream *stream, __u16 outcnt,
 static int sctp_stream_alloc_in(struct sctp_stream *stream, __u16 incnt,
 				gfp_t gfp)
 {
-	struct sctp_stream_in *in;
+	struct flex_array *in;
+	size_t elem_size = sizeof(struct sctp_stream_in);
 
-	in = kmalloc_array(incnt, sizeof(*stream->in), gfp);
+	in = kmalloc_array(incnt, elem_size, gfp);
 
 	if (!in)
 		return -ENOMEM;
 
 	if (stream->in) {
 		memcpy(in, stream->in, min(incnt, stream->incnt) *
-				       sizeof(*in));
+				       elem_size);
 		kfree(stream->in);
 	}
 
 	if (incnt > stream->incnt)
-		memset(in + stream->incnt, 0,
-		       (incnt - stream->incnt) * sizeof(*in));
+		memset(((struct sctp_stream_in *)in) + stream->incnt, 0,
+		       (incnt - stream->incnt) * elem_size);
 
 	stream->in = in;
 
@@ -162,7 +176,7 @@ int sctp_stream_init(struct sctp_stream *stream, __u16 outcnt, __u16 incnt,
 
 	stream->outcnt = outcnt;
 	for (i = 0; i < stream->outcnt; i++)
-		stream->out[i].state = SCTP_STREAM_OPEN;
+		SCTP_SO(stream, i)->state = SCTP_STREAM_OPEN;
 
 	sched->init(stream);
 
@@ -193,7 +207,7 @@ int sctp_stream_init_ext(struct sctp_stream *stream, __u16 sid)
 	soute = kzalloc(sizeof(*soute), GFP_KERNEL);
 	if (!soute)
 		return -ENOMEM;
-	stream->out[sid].ext = soute;
+	SCTP_SO(stream, sid)->ext = soute;
 
 	return sctp_sched_init_sid(stream, sid, GFP_KERNEL);
 }
@@ -205,7 +219,7 @@ void sctp_stream_free(struct sctp_stream *stream)
 
 	sched->free(stream);
 	for (i = 0; i < stream->outcnt; i++)
-		kfree(stream->out[i].ext);
+		kfree(SCTP_SO(stream, i)->ext);
 	kfree(stream->out);
 	kfree(stream->in);
 }
@@ -215,12 +229,12 @@ void sctp_stream_clear(struct sctp_stream *stream)
 	int i;
 
 	for (i = 0; i < stream->outcnt; i++) {
-		stream->out[i].mid = 0;
-		stream->out[i].mid_uo = 0;
+		SCTP_SO(stream, i)->mid = 0;
+		SCTP_SO(stream, i)->mid_uo = 0;
 	}
 
 	for (i = 0; i < stream->incnt; i++)
-		stream->in[i].mid = 0;
+		SCTP_SI(stream, i)->mid = 0;
 }
 
 void sctp_stream_update(struct sctp_stream *stream, struct sctp_stream *new)
@@ -271,8 +285,8 @@ static bool sctp_stream_outq_is_empty(struct sctp_stream *stream,
 	for (i = 0; i < str_nums; i++) {
 		__u16 sid = ntohs(str_list[i]);
 
-		if (stream->out[sid].ext &&
-		    !list_empty(&stream->out[sid].ext->outq))
+		if (SCTP_SO(stream, sid)->ext &&
+		    !list_empty(&SCTP_SO(stream, sid)->ext->outq))
 			return false;
 	}
 
@@ -359,11 +373,11 @@ int sctp_send_reset_streams(struct sctp_association *asoc,
 	if (out) {
 		if (str_nums)
 			for (i = 0; i < str_nums; i++)
-				stream->out[str_list[i]].state =
+				SCTP_SO(stream, str_list[i])->state =
 						       SCTP_STREAM_CLOSED;
 		else
 			for (i = 0; i < stream->outcnt; i++)
-				stream->out[i].state = SCTP_STREAM_CLOSED;
+				SCTP_SO(stream, i)->state = SCTP_STREAM_CLOSED;
 	}
 
 	asoc->strreset_chunk = chunk;
@@ -378,11 +392,11 @@ int sctp_send_reset_streams(struct sctp_association *asoc,
 
 		if (str_nums)
 			for (i = 0; i < str_nums; i++)
-				stream->out[str_list[i]].state =
+				SCTP_SO(stream, str_list[i])->state =
 						       SCTP_STREAM_OPEN;
 		else
 			for (i = 0; i < stream->outcnt; i++)
-				stream->out[i].state = SCTP_STREAM_OPEN;
+				SCTP_SO(stream, i)->state = SCTP_STREAM_OPEN;
 
 		goto out;
 	}
@@ -416,7 +430,7 @@ int sctp_send_reset_assoc(struct sctp_association *asoc)
 
 	/* Block further xmit of data until this request is completed */
 	for (i = 0; i < stream->outcnt; i++)
-		stream->out[i].state = SCTP_STREAM_CLOSED;
+		SCTP_SO(stream, i)->state = SCTP_STREAM_CLOSED;
 
 	asoc->strreset_chunk = chunk;
 	sctp_chunk_hold(asoc->strreset_chunk);
@@ -427,7 +441,7 @@ int sctp_send_reset_assoc(struct sctp_association *asoc)
 		asoc->strreset_chunk = NULL;
 
 		for (i = 0; i < stream->outcnt; i++)
-			stream->out[i].state = SCTP_STREAM_OPEN;
+			SCTP_SO(stream, i)->state = SCTP_STREAM_OPEN;
 
 		return retval;
 	}
@@ -607,10 +621,10 @@ struct sctp_chunk *sctp_process_strreset_outreq(
 		}
 
 		for (i = 0; i < nums; i++)
-			stream->in[ntohs(str_p[i])].mid = 0;
+			SCTP_SI(stream, ntohs(str_p[i]))->mid = 0;
 	} else {
 		for (i = 0; i < stream->incnt; i++)
-			stream->in[i].mid = 0;
+			SCTP_SI(stream, i)->mid = 0;
 	}
 
 	result = SCTP_STRRESET_PERFORMED;
@@ -681,11 +695,11 @@ struct sctp_chunk *sctp_process_strreset_inreq(
 
 	if (nums)
 		for (i = 0; i < nums; i++)
-			stream->out[ntohs(str_p[i])].state =
+			SCTP_SO(stream, ntohs(str_p[i]))->state =
 					       SCTP_STREAM_CLOSED;
 	else
 		for (i = 0; i < stream->outcnt; i++)
-			stream->out[i].state = SCTP_STREAM_CLOSED;
+			SCTP_SO(stream, i)->state = SCTP_STREAM_CLOSED;
 
 	asoc->strreset_chunk = chunk;
 	asoc->strreset_outstanding = 1;
@@ -784,11 +798,11 @@ struct sctp_chunk *sctp_process_strreset_tsnreq(
 	 *      incoming and outgoing streams.
 	 */
 	for (i = 0; i < stream->outcnt; i++) {
-		stream->out[i].mid = 0;
-		stream->out[i].mid_uo = 0;
+		SCTP_SO(stream, i)->mid = 0;
+		SCTP_SO(stream, i)->mid_uo = 0;
 	}
 	for (i = 0; i < stream->incnt; i++)
-		stream->in[i].mid = 0;
+		SCTP_SI(stream, i)->mid = 0;
 
 	result = SCTP_STRRESET_PERFORMED;
 
@@ -977,15 +991,18 @@ struct sctp_chunk *sctp_process_strreset_resp(
 		       sizeof(__u16);
 
 		if (result == SCTP_STRRESET_PERFORMED) {
+			struct sctp_stream_out *sout;
 			if (nums) {
 				for (i = 0; i < nums; i++) {
-					stream->out[ntohs(str_p[i])].mid = 0;
-					stream->out[ntohs(str_p[i])].mid_uo = 0;
+					sout = SCTP_SO(stream, ntohs(str_p[i]));
+					sout->mid = 0;
+					sout->mid_uo = 0;
 				}
 			} else {
 				for (i = 0; i < stream->outcnt; i++) {
-					stream->out[i].mid = 0;
-					stream->out[i].mid_uo = 0;
+					sout = SCTP_SO(stream, i);
+					sout->mid = 0;
+					sout->mid_uo = 0;
 				}
 			}
 
@@ -993,7 +1010,7 @@ struct sctp_chunk *sctp_process_strreset_resp(
 		}
 
 		for (i = 0; i < stream->outcnt; i++)
-			stream->out[i].state = SCTP_STREAM_OPEN;
+			SCTP_SO(stream, i)->state = SCTP_STREAM_OPEN;
 
 		*evp = sctp_ulpevent_make_stream_reset_event(asoc, flags,
 			nums, str_p, GFP_ATOMIC);
@@ -1048,15 +1065,15 @@ struct sctp_chunk *sctp_process_strreset_resp(
 			asoc->adv_peer_ack_point = asoc->ctsn_ack_point;
 
 			for (i = 0; i < stream->outcnt; i++) {
-				stream->out[i].mid = 0;
-				stream->out[i].mid_uo = 0;
+				SCTP_SO(stream, i)->mid = 0;
+				SCTP_SO(stream, i)->mid_uo = 0;
 			}
 			for (i = 0; i < stream->incnt; i++)
-				stream->in[i].mid = 0;
+				SCTP_SI(stream, i)->mid = 0;
 		}
 
 		for (i = 0; i < stream->outcnt; i++)
-			stream->out[i].state = SCTP_STREAM_OPEN;
+			SCTP_SO(stream, i)->state = SCTP_STREAM_OPEN;
 
 		*evp = sctp_ulpevent_make_assoc_reset_event(asoc, flags,
 			stsn, rtsn, GFP_ATOMIC);
@@ -1070,7 +1087,7 @@ struct sctp_chunk *sctp_process_strreset_resp(
 
 		if (result == SCTP_STRRESET_PERFORMED)
 			for (i = number; i < stream->outcnt; i++)
-				stream->out[i].state = SCTP_STREAM_OPEN;
+				SCTP_SO(stream, i)->state = SCTP_STREAM_OPEN;
 		else
 			stream->outcnt = number;
 
diff --git a/net/sctp/stream_interleave.c b/net/sctp/stream_interleave.c
index d3764c1..46f9fb6 100644
--- a/net/sctp/stream_interleave.c
+++ b/net/sctp/stream_interleave.c
@@ -1053,7 +1053,7 @@ static void sctp_intl_abort_pd(struct sctp_ulpq *ulpq, gfp_t gfp)
 	__u16 sid;
 
 	for (sid = 0; sid < stream->incnt; sid++) {
-		struct sctp_stream_in *sin = &stream->in[sid];
+		struct sctp_stream_in *sin = SCTP_SI(stream, sid);
 		__u32 mid;
 
 		if (sin->pd_mode_uo) {
diff --git a/net/sctp/stream_sched.c b/net/sctp/stream_sched.c
index f5fcd42..a6c04a9 100644
--- a/net/sctp/stream_sched.c
+++ b/net/sctp/stream_sched.c
@@ -161,7 +161,7 @@ int sctp_sched_set_sched(struct sctp_association *asoc,
 
 		/* Give the next scheduler a clean slate. */
 		for (i = 0; i < asoc->stream.outcnt; i++) {
-			void *p = asoc->stream.out[i].ext;
+			void *p = SCTP_SO(&asoc->stream, i)->ext;
 
 			if (!p)
 				continue;
@@ -175,7 +175,7 @@ int sctp_sched_set_sched(struct sctp_association *asoc,
 	asoc->outqueue.sched = n;
 	n->init(&asoc->stream);
 	for (i = 0; i < asoc->stream.outcnt; i++) {
-		if (!asoc->stream.out[i].ext)
+		if (!SCTP_SO(&asoc->stream, i)->ext)
 			continue;
 
 		ret = n->init_sid(&asoc->stream, i, GFP_KERNEL);
@@ -217,7 +217,7 @@ int sctp_sched_set_value(struct sctp_association *asoc, __u16 sid,
 	if (sid >= asoc->stream.outcnt)
 		return -EINVAL;
 
-	if (!asoc->stream.out[sid].ext) {
+	if (!SCTP_SO(&asoc->stream, sid)->ext) {
 		int ret;
 
 		ret = sctp_stream_init_ext(&asoc->stream, sid);
@@ -234,7 +234,7 @@ int sctp_sched_get_value(struct sctp_association *asoc, __u16 sid,
 	if (sid >= asoc->stream.outcnt)
 		return -EINVAL;
 
-	if (!asoc->stream.out[sid].ext)
+	if (!SCTP_SO(&asoc->stream, sid)->ext)
 		return 0;
 
 	return asoc->outqueue.sched->get(&asoc->stream, sid, value);
@@ -252,7 +252,7 @@ void sctp_sched_dequeue_done(struct sctp_outq *q, struct sctp_chunk *ch)
 		 * priority stream comes in.
 		 */
 		sid = sctp_chunk_stream_no(ch);
-		sout = &q->asoc->stream.out[sid];
+		sout = SCTP_SO(&q->asoc->stream, sid);
 		q->asoc->stream.out_curr = sout;
 		return;
 	}
@@ -272,8 +272,9 @@ void sctp_sched_dequeue_common(struct sctp_outq *q, struct sctp_chunk *ch)
 int sctp_sched_init_sid(struct sctp_stream *stream, __u16 sid, gfp_t gfp)
 {
 	struct sctp_sched_ops *sched = sctp_sched_ops_from_stream(stream);
+	struct sctp_stream_out_ext *ext = SCTP_SO(stream, sid)->ext;
 
-	INIT_LIST_HEAD(&stream->out[sid].ext->outq);
+	INIT_LIST_HEAD(&ext->outq);
 	return sched->init_sid(stream, sid, gfp);
 }
 
diff --git a/net/sctp/stream_sched_prio.c b/net/sctp/stream_sched_prio.c
index 7997d35..2245083 100644
--- a/net/sctp/stream_sched_prio.c
+++ b/net/sctp/stream_sched_prio.c
@@ -75,10 +75,10 @@ static struct sctp_stream_priorities *sctp_sched_prio_get_head(
 
 	/* No luck. So we search on all streams now. */
 	for (i = 0; i < stream->outcnt; i++) {
-		if (!stream->out[i].ext)
+		if (!SCTP_SO(stream, i)->ext)
 			continue;
 
-		p = stream->out[i].ext->prio_head;
+		p = SCTP_SO(stream, i)->ext->prio_head;
 		if (!p)
 			/* Means all other streams won't be initialized
 			 * as well.
@@ -165,7 +165,7 @@ static void sctp_sched_prio_sched(struct sctp_stream *stream,
 static int sctp_sched_prio_set(struct sctp_stream *stream, __u16 sid,
 			       __u16 prio, gfp_t gfp)
 {
-	struct sctp_stream_out *sout = &stream->out[sid];
+	struct sctp_stream_out *sout = SCTP_SO(stream, sid);
 	struct sctp_stream_out_ext *soute = sout->ext;
 	struct sctp_stream_priorities *prio_head, *old;
 	bool reschedule = false;
@@ -186,7 +186,7 @@ static int sctp_sched_prio_set(struct sctp_stream *stream, __u16 sid,
 		return 0;
 
 	for (i = 0; i < stream->outcnt; i++) {
-		soute = stream->out[i].ext;
+		soute = SCTP_SO(stream, i)->ext;
 		if (soute && soute->prio_head == old)
 			/* It's still in use, nothing else to do here. */
 			return 0;
@@ -201,7 +201,7 @@ static int sctp_sched_prio_set(struct sctp_stream *stream, __u16 sid,
 static int sctp_sched_prio_get(struct sctp_stream *stream, __u16 sid,
 			       __u16 *value)
 {
-	*value = stream->out[sid].ext->prio_head->prio;
+	*value = SCTP_SO(stream, sid)->ext->prio_head->prio;
 	return 0;
 }
 
@@ -215,7 +215,7 @@ static int sctp_sched_prio_init(struct sctp_stream *stream)
 static int sctp_sched_prio_init_sid(struct sctp_stream *stream, __u16 sid,
 				    gfp_t gfp)
 {
-	INIT_LIST_HEAD(&stream->out[sid].ext->prio_list);
+	INIT_LIST_HEAD(&SCTP_SO(stream, sid)->ext->prio_list);
 	return sctp_sched_prio_set(stream, sid, 0, gfp);
 }
 
@@ -233,9 +233,9 @@ static void sctp_sched_prio_free(struct sctp_stream *stream)
 	 */
 	sctp_sched_prio_unsched_all(stream);
 	for (i = 0; i < stream->outcnt; i++) {
-		if (!stream->out[i].ext)
+		if (!SCTP_SO(stream, i)->ext)
 			continue;
-		prio = stream->out[i].ext->prio_head;
+		prio = SCTP_SO(stream, i)->ext->prio_head;
 		if (prio && list_empty(&prio->prio_sched))
 			list_add(&prio->prio_sched, &list);
 	}
@@ -255,7 +255,7 @@ static void sctp_sched_prio_enqueue(struct sctp_outq *q,
 	ch = list_first_entry(&msg->chunks, struct sctp_chunk, frag_list);
 	sid = sctp_chunk_stream_no(ch);
 	stream = &q->asoc->stream;
-	sctp_sched_prio_sched(stream, stream->out[sid].ext);
+	sctp_sched_prio_sched(stream, SCTP_SO(stream, sid)->ext);
 }
 
 static struct sctp_chunk *sctp_sched_prio_dequeue(struct sctp_outq *q)
@@ -297,7 +297,7 @@ static void sctp_sched_prio_dequeue_done(struct sctp_outq *q,
 	 * this priority.
 	 */
 	sid = sctp_chunk_stream_no(ch);
-	soute = q->asoc->stream.out[sid].ext;
+	soute = SCTP_SO(&q->asoc->stream, sid)->ext;
 	prio = soute->prio_head;
 
 	sctp_sched_prio_next_stream(prio);
@@ -317,7 +317,7 @@ static void sctp_sched_prio_sched_all(struct sctp_stream *stream)
 		__u16 sid;
 
 		sid = sctp_chunk_stream_no(ch);
-		sout = &stream->out[sid];
+		sout = SCTP_SO(stream, sid);
 		if (sout->ext)
 			sctp_sched_prio_sched(stream, sout->ext);
 	}
diff --git a/net/sctp/stream_sched_rr.c b/net/sctp/stream_sched_rr.c
index 1155692..52ba743 100644
--- a/net/sctp/stream_sched_rr.c
+++ b/net/sctp/stream_sched_rr.c
@@ -100,7 +100,7 @@ static int sctp_sched_rr_init(struct sctp_stream *stream)
 static int sctp_sched_rr_init_sid(struct sctp_stream *stream, __u16 sid,
 				  gfp_t gfp)
 {
-	INIT_LIST_HEAD(&stream->out[sid].ext->rr_list);
+	INIT_LIST_HEAD(&SCTP_SO(stream, sid)->ext->rr_list);
 
 	return 0;
 }
@@ -120,7 +120,7 @@ static void sctp_sched_rr_enqueue(struct sctp_outq *q,
 	ch = list_first_entry(&msg->chunks, struct sctp_chunk, frag_list);
 	sid = sctp_chunk_stream_no(ch);
 	stream = &q->asoc->stream;
-	sctp_sched_rr_sched(stream, stream->out[sid].ext);
+	sctp_sched_rr_sched(stream, SCTP_SO(stream, sid)->ext);
 }
 
 static struct sctp_chunk *sctp_sched_rr_dequeue(struct sctp_outq *q)
@@ -154,7 +154,7 @@ static void sctp_sched_rr_dequeue_done(struct sctp_outq *q,
 
 	/* Last chunk on that msg, move to the next stream */
 	sid = sctp_chunk_stream_no(ch);
-	soute = q->asoc->stream.out[sid].ext;
+	soute = SCTP_SO(&q->asoc->stream, sid)->ext;
 
 	sctp_sched_rr_next_stream(&q->asoc->stream);
 
@@ -173,7 +173,7 @@ static void sctp_sched_rr_sched_all(struct sctp_stream *stream)
 		__u16 sid;
 
 		sid = sctp_chunk_stream_no(ch);
-		soute = stream->out[sid].ext;
+		soute = SCTP_SO(stream, sid)->ext;
 		if (soute)
 			sctp_sched_rr_sched(stream, soute);
 	}
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH net-next 0/2] net/sctp: Avoid allocating high order memory with kmalloc()
From: Oleg Babin @ 2018-04-23 18:41 UTC (permalink / raw)
  To: netdev, linux-sctp
  Cc: David S. Miller, Vlad Yasevich, Neil Horman, Xin Long,
	Marcelo Ricardo Leitner, Andrey Ryabinin

Each SCTP association can have up to 65535 input and output streams.
For each stream type an array of sctp_stream_in or sctp_stream_out
structures is allocated using kmalloc_array() function. This function
allocates physically contiguous memory regions, so this can lead
to allocation of memory regions of very high order, i.e.:

  sizeof(struct sctp_stream_out) == 24,
  ((65535 * 24) / 4096) == 383 memory pages (4096 byte per page),
  which means 9th memory order.

This can lead to a memory allocation failures on the systems
under a memory stress.

We actually do not need these arrays of memory to be physically
contiguous. Possible simple solution would be to use kvmalloc()
instread of kmalloc() as kvmalloc() can allocate physically scattered
pages if contiguous pages are not available. But the problem
is that the allocation can happed in a softirq context with
GFP_ATOMIC flag set, and kvmalloc() cannot be used in this scenario.

So the other possible solution is to use flexible arrays instead of
contiguios arrays of memory so that the memory would be allocated
on a per-page basis.

This patchset replaces kvmalloc() with flex_array usage.
It consists of two parts:

  * First patch is preparatory - it mechanically wraps all direct
    access to assoc->stream.out[] and assoc->stream.in[] arrays
    with SCTP_SO() and SCTP_SI() wrappers so that later a direct
    array access could be easily changed to an access to a
    flex_array (or any other possible alternative).
  * Second patch replaces kmalloc_array() with flex_array usage.

Oleg Babin (2):
  net/sctp: Make wrappers for accessing in/out streams
  net/sctp: Replace in/out stream arrays with flex_array

 include/net/sctp/structs.h   |  31 +++++---
 net/sctp/chunk.c             |   6 +-
 net/sctp/outqueue.c          |  11 +--
 net/sctp/socket.c            |   4 +-
 net/sctp/stream.c            | 165 +++++++++++++++++++++++++++++--------------
 net/sctp/stream_interleave.c |   2 +-
 net/sctp/stream_sched.c      |  13 ++--
 net/sctp/stream_sched_prio.c |  22 +++---
 net/sctp/stream_sched_rr.c   |   8 +--
 9 files changed, 167 insertions(+), 95 deletions(-)

-- 
1.8.3.1

^ permalink raw reply

* Re: [RFC v3 net-next 13/18] net/sched: Introduce the TBS Qdisc
From: Jesus Sanchez-Palencia @ 2018-04-23 18:21 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: netdev, jhs, xiyou.wangcong, jiri, vinicius.gomes, richardcochran,
	intel-wired-lan, anna-maria, henrik, john.stultz, levi.pearson,
	edumazet, willemb, mlichvar
In-Reply-To: <alpine.DEB.2.21.1803211407520.3754@nanos.tec.linutronix.de>

Hi Thomas,


On 03/21/2018 06:46 AM, Thomas Gleixner wrote:
> On Tue, 6 Mar 2018, Jesus Sanchez-Palencia wrote:
>> +struct tbs_sched_data {
>> +	bool sorting;
>> +	int clockid;
>> +	int queue;
>> +	s32 delta; /* in ns */
>> +	ktime_t last; /* The txtime of the last skb sent to the netdevice. */
>> +	struct rb_root head;
> 
> Hmm. You are reimplementing timerqueue open coded. Have you checked whether
> you could reuse the timerqueue implementation?
> 
> That requires to add a timerqueue node to struct skbuff
> 
> @@ -671,7 +671,8 @@ struct sk_buff {
>  				unsigned long		dev_scratch;
>  			};
>  		};
> -		struct rb_node	rbnode; /* used in netem & tcp stack */
> +		struct rb_node		rbnode; /* used in netem & tcp stack */
> +		struct timerqueue_node	tqnode;
>  	};
>  	struct sock		*sk;
> 
> Then you can use timerqueue_head in your scheduler data and all the open
> coded rbtree handling goes away.


I just noticed that doing the above increases the size of struct sk_buff by 8
bytes - struct timerqueue_node is 32bytes long while struct rb_node is only
24bytes long.

Given the feedback we got here before against touching struct sk_buff at all for
non-generic use cases, I will keep the implementation of sch_tbs.c as is, thus
keeping the open-coded version for now, ok?

Thanks,
Jesus


(...)

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox