Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: tools/bpf regression causing samples/bpf/ to hang
From: Björn Töpel @ 2018-09-11 19:01 UTC (permalink / raw)
  To: yhs; +Cc: Netdev, ast, Daniel Borkmann, Jesper Dangaard Brouer
In-Reply-To: <a5a23097-c59f-42a2-eeff-050b825cde11@fb.com>

Den tis 11 sep. 2018 kl 20:21 skrev Yonghong Song <yhs@fb.com>:
>
>
>
> On 9/11/18 10:15 AM, Björn Töpel wrote:
> > Den tis 11 sep. 2018 kl 18:47 skrev Yonghong Song <yhs@fb.com>:
> >>
> >>
> >>
> >> On 9/11/18 4:11 AM, Björn Töpel wrote:
> >>> Hi Yonghong, I tried to run the XDP samples from the bpf-next tip
> >>> today, and was hit by a regression.
> >>>
> >>> Commit f7010770fbac ("tools/bpf: move bpf/lib netlink related
> >>> functions into a new file") adds a while(1) around the recv call in
> >>> bpf_set_link_xdp_fd making that call getting stuck in an infinite
> >>> loop.
> >>>
> >>> I simply removed the loop, and that solved my problem (patch below).
> >>>
> >>> However, I don't know if removing the loop would break bpftool for
> >>> you. If not, I can submit the patch as a proper one for bpf-next.
> >>
> >> Hi, Björn, thanks for reporting the problem.
> >> The while loop is needed since the "recv" syscall buffer size
> >> may not be big enough to hold all the returned information, in
> >> which cases, multiple "recv" calls are needed.
> >>
> >> Could you try the following patch to see whether it fixed your
> >> issue? Thanks!
> >>
> >
> > Nope, it doesn't -- but if you move that hunk after the for-loop it works.
>
> Could you try this patch?
>

Works! Thanks!

Tested-by: Björn Töpel <bjorn.topel@intel.com>

> commit 9a7fb19899ce87594fe8012f8a23fc8fc7b6b764 (HEAD -> fix)
> Author: Yonghong Song <yhs@fb.com>
> Date:   Tue Sep 11 08:58:20 2018 -0700
>
>      tools/bpf: fix a netlink recv issue
>
>      Commit f7010770fbac ("tools/bpf: move bpf/lib netlink related
>      functions into a new file") introduced a while loop for the
>      netlink recv path. This while loop is needed since the
>      buffer in recv syscall may not be enough to hold all the
>      information and in such cases multiple recv calls are needed.
>
>      There is a bug introduced by the above commit as
>      the while loop may block on recv syscall if there is no
>      more messages are expected. The netlink message header
>      flag NLM_F_MULTI is used to indicate that more messages
>      are expected and this patch fixed the bug by doing
>      further recv syscall only if multipart message is expected.
>
>      The patch added another fix regarding to message length of 0.
>      When netlink recv returns message length of 0, there will be
>      no more messages for returning data so the while loop
>      can end.
>
>      Fixes: f7010770fbac ("tools/bpf: move bpf/lib netlink related
> functions into a new file")
>      Reported-by: Björn Töpel <bjorn.topel@intel.com>
>      Signed-off-by: Yonghong Song <yhs@fb.com>
>
> diff --git a/tools/lib/bpf/netlink.c b/tools/lib/bpf/netlink.c
> index 469e068dd0c5..fde1d7bf8199 100644
> --- a/tools/lib/bpf/netlink.c
> +++ b/tools/lib/bpf/netlink.c
> @@ -65,18 +65,23 @@ static int bpf_netlink_recv(int sock, __u32 nl_pid,
> int seq,
>                              __dump_nlmsg_t _fn, dump_nlmsg_t fn,
>                              void *cookie)
>   {
> +       bool multipart = true;
>          struct nlmsgerr *err;
>          struct nlmsghdr *nh;
>          char buf[4096];
>          int len, ret;
>
> -       while (1) {
> +       while (multipart) {
> +               multipart = false;
>                  len = recv(sock, buf, sizeof(buf), 0);
>                  if (len < 0) {
>                          ret = -errno;
>                          goto done;
>                  }
>
> +               if (len == 0)
> +                       break;
> +
>                  for (nh = (struct nlmsghdr *)buf; NLMSG_OK(nh, len);
>                       nh = NLMSG_NEXT(nh, len)) {
>                          if (nh->nlmsg_pid != nl_pid) {
> @@ -87,6 +92,8 @@ static int bpf_netlink_recv(int sock, __u32 nl_pid,
> int seq,
>                                  ret = -LIBBPF_ERRNO__INVSEQ;
>                                  goto done;
>                          }
> +                       if (nh->nlmsg_flags & NLM_F_MULTI)
> +                               multipart = true;
>                          switch (nh->nlmsg_type) {
>                          case NLMSG_ERROR:
>                                  err = (struct nlmsgerr *)NLMSG_DATA(nh);
>
>
> >
> > Cheers,
> > Björn
> >
> >> commit 3eb1c0249dfc3ea4ad61aa223dce32262af7e049 (HEAD -> fix)
> >> Author: Yonghong Song <yhs@fb.com>
> >> Date:   Tue Sep 11 08:58:20 2018 -0700
> >>
> >>       tools/bpf: fix a netlink recv issue
> >>
> >>       Commit f7010770fbac ("tools/bpf: move bpf/lib netlink related
> >>       functions into a new file") introduced a while loop for the
> >>       netlink recv path. This while loop is needed since the
> >>       buffer in recv syscall may not be big enough to hold all the
> >>       information and in such cases multiple recv calls are needed.
> >>
> >>       When netlink recv returns message length of 0, there will be
> >>       no more messages for returning data so the while loop
> >>       can end.
> >>
> >>       Fixes: f7010770fbac ("tools/bpf: move bpf/lib netlink related
> >> functions into a new file")
> >>       Reported-by: Björn Töpel <bjorn.topel@intel.com>
> >>       Signed-off-by: Yonghong Song <yhs@fb.com>
> >>
> >> diff --git a/tools/lib/bpf/netlink.c b/tools/lib/bpf/netlink.c
> >> index 469e068dd0c5..37827319a50a 100644
> >> --- a/tools/lib/bpf/netlink.c
> >> +++ b/tools/lib/bpf/netlink.c
> >> @@ -77,6 +77,9 @@ static int bpf_netlink_recv(int sock, __u32 nl_pid,
> >> int seq,
> >>                           goto done;
> >>                   }
> >>
> >> +               if (len == 0)
> >> +                       break;
> >> +
> >>                   for (nh = (struct nlmsghdr *)buf; NLMSG_OK(nh, len);
> >>                        nh = NLMSG_NEXT(nh, len)) {
> >>                           if (nh->nlmsg_pid != nl_pid) {
> >>
> >>
> >>>
> >>> Thanks!
> >>> Björn
> >>>
> >>> From: =?UTF-8?q?Bj=C3=B6rn=20T=C3=B6pel?= <bjorn.topel@intel.com>
> >>> Date: Tue, 11 Sep 2018 12:35:44 +0200
> >>> Subject: [PATCH] tools/bpf: remove loop around netlink recv
> >>>
> >>> Commit f7010770fbac ("tools/bpf: move bpf/lib netlink related
> >>> functions into a new file") moved the bpf_set_link_xdp_fd and split it
> >>> up into multiple functions. The added receive function
> >>> bpf_netlink_recv added a loop around the recv syscall leading to
> >>> multiple recv calls. This resulted in all XDP samples in the
> >>> samples/bpf/ to stop working, since they were stuck in a blocking
> >>> recv.
> >>>
> >>> This commits removes the while (1)-statement.
> >>>
> >>> Fixes: f7010770fbac ("tools/bpf: move bpf/lib netlink related
> >>> functions into a new file")
> >>> Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
> >>> ---
> >>>    tools/lib/bpf/netlink.c | 64 ++++++++++++++++++++---------------------
> >>>    1 file changed, 31 insertions(+), 33 deletions(-)
> >>>
> >>> diff --git a/tools/lib/bpf/netlink.c b/tools/lib/bpf/netlink.c
> >>> index 469e068dd0c5..0eae1fbf46c6 100644
> >>> --- a/tools/lib/bpf/netlink.c
> >>> +++ b/tools/lib/bpf/netlink.c
> >>> @@ -70,41 +70,39 @@ static int bpf_netlink_recv(int sock, __u32 nl_pid, int seq,
> >>>        char buf[4096];
> >>>        int len, ret;
> >>>
> >>> -    while (1) {
> >>> -        len = recv(sock, buf, sizeof(buf), 0);
> >>> -        if (len < 0) {
> >>> -            ret = -errno;
> >>> +    len = recv(sock, buf, sizeof(buf), 0);
> >>> +    if (len < 0) {
> >>> +        ret = -errno;
> >>> +        goto done;
> >>> +    }
> >>> +
> >>> +    for (nh = (struct nlmsghdr *)buf; NLMSG_OK(nh, len);
> >>> +         nh = NLMSG_NEXT(nh, len)) {
> >>> +        if (nh->nlmsg_pid != nl_pid) {
> >>> +            ret = -LIBBPF_ERRNO__WRNGPID;
> >>>                goto done;
> >>>            }
> >>> -
> >>> -        for (nh = (struct nlmsghdr *)buf; NLMSG_OK(nh, len);
> >>> -             nh = NLMSG_NEXT(nh, len)) {
> >>> -            if (nh->nlmsg_pid != nl_pid) {
> >>> -                ret = -LIBBPF_ERRNO__WRNGPID;
> >>> -                goto done;
> >>> -            }
> >>> -            if (nh->nlmsg_seq != seq) {
> >>> -                ret = -LIBBPF_ERRNO__INVSEQ;
> >>> -                goto done;
> >>> -            }
> >>> -            switch (nh->nlmsg_type) {
> >>> -            case NLMSG_ERROR:
> >>> -                err = (struct nlmsgerr *)NLMSG_DATA(nh);
> >>> -                if (!err->error)
> >>> -                    continue;
> >>> -                ret = err->error;
> >>> -                nla_dump_errormsg(nh);
> >>> -                goto done;
> >>> -            case NLMSG_DONE:
> >>> -                return 0;
> >>> -            default:
> >>> -                break;
> >>> -            }
> >>> -            if (_fn) {
> >>> -                ret = _fn(nh, fn, cookie);
> >>> -                if (ret)
> >>> -                    return ret;
> >>> -            }
> >>> +        if (nh->nlmsg_seq != seq) {
> >>> +            ret = -LIBBPF_ERRNO__INVSEQ;
> >>> +            goto done;
> >>> +        }
> >>> +        switch (nh->nlmsg_type) {
> >>> +        case NLMSG_ERROR:
> >>> +            err = (struct nlmsgerr *)NLMSG_DATA(nh);
> >>> +            if (!err->error)
> >>> +                continue;
> >>> +            ret = err->error;
> >>> +            nla_dump_errormsg(nh);
> >>> +            goto done;
> >>> +        case NLMSG_DONE:
> >>> +            return 0;
> >>> +        default:
> >>> +            break;
> >>> +        }
> >>> +        if (_fn) {
> >>> +            ret = _fn(nh, fn, cookie);
> >>> +            if (ret)
> >>> +                return ret;
> >>>            }
> >>>        }
> >>>        ret = 0;
> >>>

^ permalink raw reply

* Re: kernels > v4.12 oops/crash with ipsec-traffic: bisected to b838d5e1c5b6e57b10ec8af2268824041e3ea911: ipv4: mark DST_NOGC and remove the operation of dst_free()
From: Tobias Hommel @ 2018-09-11 19:02 UTC (permalink / raw)
  To: Wolfgang Walter
  Cc: Steffen Klassert, Kristian Evensen, Network Development, weiwan,
	edumazet
In-Reply-To: <2028376.H0yIdbXTXp@stwm.de>

> > Subject: [PATCH RFC] xfrm: Fix NULL pointer dereference when skb_dst_force
> > clears the dst_entry.
> > 
> > Since commit 222d7dbd258d ("net: prevent dst uses after free")
> > skb_dst_force() might clear the dst_entry attached to the skb.
> > The xfrm code don't expect this to happen, so we crash with
> > a NULL pointer dereference in this case. Fix it by checking
> > skb_dst(skb) for NULL after skb_dst_force() and drop the packet
> > in cast the dst_entry was cleared.
> > 
> > Fixes: 222d7dbd258d ("net: prevent dst uses after free")
> > Reported-by: Tobias Hommel <netdev-list@genoetigt.de>
> > Reported-by: Kristian Evensen <kristian.evensen@gmail.com>
> > Reported-by: Wolfgang Walter <linux@stwm.de>
> > Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
> > ---
> >  net/xfrm/xfrm_output.c | 4 ++++
> >  net/xfrm/xfrm_policy.c | 4 ++++
> >  2 files changed, 8 insertions(+)
> > 
> > diff --git a/net/xfrm/xfrm_output.c b/net/xfrm/xfrm_output.c
> > index 89b178a78dc7..36d15a38ce5e 100644
> > --- a/net/xfrm/xfrm_output.c
> > +++ b/net/xfrm/xfrm_output.c
> > @@ -101,6 +101,10 @@ static int xfrm_output_one(struct sk_buff *skb, int
> > err) spin_unlock_bh(&x->lock);
> > 
> >  		skb_dst_force(skb);
> > +		if (!skb_dst(skb)) {
> > +			XFRM_INC_STATS(net, LINUX_MIB_XFRMOUTERROR);
> > +			goto error_nolock;
> > +		}
> > 
> >  		if (xfrm_offload(skb)) {
> >  			x->type_offload->encap(x, skb);
> > diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
> > index 7c5e8978aeaa..626e0f4d1749 100644
> > --- a/net/xfrm/xfrm_policy.c
> > +++ b/net/xfrm/xfrm_policy.c
> > @@ -2548,6 +2548,10 @@ int __xfrm_route_forward(struct sk_buff *skb,
> > unsigned short family) }
> > 
> >  	skb_dst_force(skb);
> > +	if (!skb_dst(skb)) {
> > +		XFRM_INC_STATS(net, LINUX_MIB_XFRMFWDHDRERROR);
> > +		return 0;
> > +	}
> > 
> >  	dst = xfrm_lookup(net, skb_dst(skb), &fl, NULL, XFRM_LOOKUP_QUEUE);
> >  	if (IS_ERR(dst)) {
> 
> This patch fixes the problem here.
> 
> XfrmFwdHdrError gets around 80 at the very beginning and remains so. Probably 
> this happens when some route are changed/set then. 
> 
> Regards and thanks,

Same here, we're now running stable for ~6 hours, XfrmFwdHdrError is at 220.
This is less than 1 lost packet per minute, which seems to be okay for now.

^ permalink raw reply

* Re: [RFC] managing PHY carrier from user space
From: Joakim Tjernlund @ 2018-09-11 19:21 UTC (permalink / raw)
  To: netdev@vger.kernel.org, f.fainelli@gmail.com, andrew@lunn.ch
In-Reply-To: <b1df2baf-4c6b-6645-4b6f-648ff22949d2@gmail.com>

On Tue, 2018-09-11 at 09:56 -0700, Florian Fainelli wrote:
> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender and know the content is safe.
> 
> 
> On 09/11/2018 09:41 AM, Joakim Tjernlund wrote:
> > I am looking for a way to induce carrier state from user space, primarily
> > for Fixed PHYs as these are always up. ifplugd/dhcp etc. does not behave properly
> > if the link is up when it really isn't.
> 
> Was my suggestion in my email to you somehow not working? This is
> obviously not acceptable for upstream, there is no reason, even for a
> fixed PHY, to attempt to mangle with the carrier state for any
> reasonable production purposes.

Ohh, I never got that mail. Scanning the netdev archives I found it though, thanks.
I will go down the ndo_change_carrier() way and see whether I can work out what to
do w.r.t fixed link status callback.

 Thanks
         Jocke



^ permalink raw reply

* [PATCH net-next 1/5] bpf: use __GFP_COMP while allocating page
From: Tushar Dave @ 2018-09-11 19:38 UTC (permalink / raw)
  To: ast, daniel, davem, santosh.shilimkar, jakub.kicinski,
	quentin.monnet, jiong.wang, sandipan, john.fastabend, kafai, rdna,
	yhs, netdev, rds-devel, sowmini.varadhan
In-Reply-To: <1536694684-3200-1-git-send-email-tushar.n.dave@oracle.com>

Helper bpg_msg_pull_data() can allocate multiple pages while
linearizing multiple scatterlist elements into one shared page.
However, if the shared page has size > PAGE_SIZE, using
copy_page_to_iter() causes below warning.

e.g.
[ 6367.019832] WARNING: CPU: 2 PID: 7410 at lib/iov_iter.c:825
page_copy_sane.part.8+0x0/0x8

To avoid above warning, use __GFP_COMP while allocating multiple
contiguous pages.

Signed-off-by: Tushar Dave <tushar.n.dave@oracle.com>
---
 net/core/filter.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index d301134..0b40f95 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2344,7 +2344,8 @@ struct sock *do_msg_redirect_map(struct sk_msg_buff *msg)
 	if (unlikely(bytes_sg_total > copy))
 		return -EINVAL;
 
-	page = alloc_pages(__GFP_NOWARN | GFP_ATOMIC, get_order(copy));
+	page = alloc_pages(__GFP_NOWARN | GFP_ATOMIC | __GFP_COMP,
+			   get_order(copy));
 	if (unlikely(!page))
 		return -ENOMEM;
 	p = page_address(page);
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH net-next 0/5] eBPF and struct scatterlist
From: Tushar Dave @ 2018-09-11 19:37 UTC (permalink / raw)
  To: ast, daniel, davem, santosh.shilimkar, jakub.kicinski,
	quentin.monnet, jiong.wang, sandipan, john.fastabend, kafai, rdna,
	yhs, netdev, rds-devel, sowmini.varadhan

This non-RFC patch-set is follow-up on the RFC v3 that was sent earlier.
(https://www.spinics.net/lists/netdev/msg519380.html)

In this patch-set following changes are made,
RFC v3 -> this patch-set:

- "RFC v3 patch 3" is removed as it is no longer needed because
bpf_msg_pull_data() has all required bug fixed. Thanks Daniel.

- Use __GFP_COMP while allocating pages in bpf_msg_pull_data to avoid
page_copy_sane while using sg page in copy_page_to_iter() (patch 1)

- In sg_filter_run(), after BPF prog returns, mb.sg_data may have
changed while linearize multiple scatterlist entries into one.
Therefore, make sure to update original sg and mark the sg end correctly
before return. (patch 3)

- BPF program can write/modify RDS packet, if that is the case then the
modified packet data is represented in scatterlist. Therefore use
scatterlist (not skb) while copying payload back to userspace. Also
carefully release scatterlist and associated pages e.g.
get_page()/put_page() (patch 4)

Details:
--------
eBPF: Patch 1 use __GFP_COMP while allocating pages in bpf_msg_pull_data
to avoid page_copy_sane warning.

eBPF: Patch 2 adds new eBPF prog type BPF_PROG_TYPE_SOCKET_SG_FILTER
which uses the existing socket filter infrastructure for bpf program
attach and load. eBPF program of type BPF_PROG_TYPE_SOCKET_SG_FILTER
deals with struct scatterlist as bpf context contrast to
BPF_PROG_TYPE_SOCKET_FILTER which deals with struct skb. This new eBPF
program type allow socket filter to run on packet data that is in form
of struct scatterlist.

eBPF: Patch 3 adds sg_filter_run() that runs BPF_PROG_TYPE_SOCKET_SG_FILTER.

RDS: patch 4 allows rds_recv_incoming to invoke socket filter program
which deals with struct scatterlist

bpf/samples: Patch 5 adds socket filter eBPF sample program that uses
patches 1 to 5. The sample program opens an rds socket, attach ebpf
program (socksg i.e. BPF_PROG_TYPE_SOCKET_SG_FILTER) to rds socket and
uses bpf_msg_pull_data() helper to inspect RDS packet data. For a test,
current sample program only prints first few bytes of packet data.

Background:
-----------
The motivation for this work is to allow eBPF based firewalling for
kernel modules that do not always get their packet as an sk_buff from
their downlink drivers. One such instance of this use-case is RDS, which
can be run both over IB (driver RDMA's a scatterlist to the RDS module)
or over TCP (TCP passes an sk_buff to the RDS module).

This patchset uses exiting socket filter infrastructure and extend it
with new eBPF program type that deals with struct scatterlist.
Existing bpf helper bpf_msg_pull_data() is used to inspect packet data
that are in form struct scatterlist. For RDS, the integrated approach
treats the scatterlist as the common denominator, and allows the
application to write a filter for processing a scatterlist.

Testing:
---------
To confirm data accuracy and results, RDS packets of various sizes has
been tested with socksg program along with various start and end values
for bpf_msg_pull_data(). All such tests shows accurate results.

Thanks.

-Tushar

Tushar Dave (5):
  bpf: use __GFP_COMP while allocating page
  eBPF: Add new eBPF prog type BPF_PROG_TYPE_SOCKET_SG_FILTER
  ebpf: Add sg_filter_run()
  rds: invoke socket sg filter attached to rds socket
  ebpf: Add sample ebpf program for SOCKET_SG_FILTER

 include/linux/bpf_types.h      |   1 +
 include/linux/filter.h         |   8 +
 include/uapi/linux/bpf.h       |   7 +
 kernel/bpf/syscall.c           |   1 +
 kernel/bpf/verifier.c          |   1 +
 net/core/filter.c              |  93 ++++++++++-
 net/rds/ib.c                   |   1 +
 net/rds/ib.h                   |   1 +
 net/rds/ib_recv.c              |  12 ++
 net/rds/rds.h                  |   1 +
 net/rds/recv.c                 |  12 ++
 net/rds/tcp.c                  |   1 +
 net/rds/tcp.h                  |   2 +
 net/rds/tcp_recv.c             | 108 ++++++++++++-
 samples/bpf/Makefile           |   3 +
 samples/bpf/bpf_load.c         |  11 +-
 samples/bpf/rds_filter_kern.c  |  42 +++++
 samples/bpf/rds_filter_user.c  | 339 +++++++++++++++++++++++++++++++++++++++++
 tools/bpf/bpftool/prog.c       |   1 +
 tools/include/uapi/linux/bpf.h |   7 +
 tools/lib/bpf/libbpf.c         |   3 +
 tools/lib/bpf/libbpf.h         |   2 +
 22 files changed, 650 insertions(+), 7 deletions(-)
 create mode 100644 samples/bpf/rds_filter_kern.c
 create mode 100644 samples/bpf/rds_filter_user.c

-- 
1.8.3.1

^ permalink raw reply

* [PATCH net-next 2/5] eBPF: Add new eBPF prog type BPF_PROG_TYPE_SOCKET_SG_FILTER
From: Tushar Dave @ 2018-09-11 19:38 UTC (permalink / raw)
  To: ast, daniel, davem, santosh.shilimkar, jakub.kicinski,
	quentin.monnet, jiong.wang, sandipan, john.fastabend, kafai, rdna,
	yhs, netdev, rds-devel, sowmini.varadhan
In-Reply-To: <1536694684-3200-1-git-send-email-tushar.n.dave@oracle.com>

Add new eBPF prog type BPF_PROG_TYPE_SOCKET_SG_FILTER which uses the
existing socket filter infrastructure for bpf program attach and load.
SOCKET_SG_FILTER eBPF program receives struct scatterlist as bpf context
contrast to SOCKET_FILTER which deals with struct skb. This is useful
for kernel entities that don't have skb to represent packet data but
want to run eBPF socket filter on packet data that is in form of struct
scatterlist e.g. IB/RDMA

Signed-off-by: Tushar Dave <tushar.n.dave@oracle.com>
Acked-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
---
 include/linux/bpf_types.h      |  1 +
 include/uapi/linux/bpf.h       |  1 +
 kernel/bpf/syscall.c           |  1 +
 kernel/bpf/verifier.c          |  1 +
 net/core/filter.c              | 55 ++++++++++++++++++++++++++++++++++++++++--
 samples/bpf/bpf_load.c         | 11 ++++++---
 tools/bpf/bpftool/prog.c       |  1 +
 tools/include/uapi/linux/bpf.h |  1 +
 tools/lib/bpf/libbpf.c         |  3 +++
 tools/lib/bpf/libbpf.h         |  2 ++
 10 files changed, 72 insertions(+), 5 deletions(-)

diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index cd26c09..7dc1503 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -16,6 +16,7 @@
 BPF_PROG_TYPE(BPF_PROG_TYPE_SOCK_OPS, sock_ops)
 BPF_PROG_TYPE(BPF_PROG_TYPE_SK_SKB, sk_skb)
 BPF_PROG_TYPE(BPF_PROG_TYPE_SK_MSG, sk_msg)
+BPF_PROG_TYPE(BPF_PROG_TYPE_SOCKET_SG_FILTER, socksg_filter)
 #endif
 #ifdef CONFIG_BPF_EVENTS
 BPF_PROG_TYPE(BPF_PROG_TYPE_KPROBE, kprobe)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 66917a4..6ec1e32 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -152,6 +152,7 @@ enum bpf_prog_type {
 	BPF_PROG_TYPE_LWT_SEG6LOCAL,
 	BPF_PROG_TYPE_LIRC_MODE2,
 	BPF_PROG_TYPE_SK_REUSEPORT,
+	BPF_PROG_TYPE_SOCKET_SG_FILTER,
 };
 
 enum bpf_attach_type {
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 3c9636f..5f302b7 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1361,6 +1361,7 @@ static int bpf_prog_load(union bpf_attr *attr)
 
 	if (type != BPF_PROG_TYPE_SOCKET_FILTER &&
 	    type != BPF_PROG_TYPE_CGROUP_SKB &&
+	    type != BPF_PROG_TYPE_SOCKET_SG_FILTER &&
 	    !capable(CAP_SYS_ADMIN))
 		return -EPERM;
 
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index f4ff0c5..17fc4d2 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -1234,6 +1234,7 @@ static bool may_access_direct_pkt_data(struct bpf_verifier_env *env,
 	case BPF_PROG_TYPE_LWT_XMIT:
 	case BPF_PROG_TYPE_SK_SKB:
 	case BPF_PROG_TYPE_SK_MSG:
+	case BPF_PROG_TYPE_SOCKET_SG_FILTER:
 		if (meta)
 			return meta->pkt_access;
 
diff --git a/net/core/filter.c b/net/core/filter.c
index 0b40f95..469c488 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -1140,7 +1140,8 @@ static void bpf_release_orig_filter(struct bpf_prog *fp)
 
 static void __bpf_prog_release(struct bpf_prog *prog)
 {
-	if (prog->type == BPF_PROG_TYPE_SOCKET_FILTER) {
+	if (prog->type == BPF_PROG_TYPE_SOCKET_FILTER ||
+	    prog->type == BPF_PROG_TYPE_SOCKET_SG_FILTER) {
 		bpf_prog_put(prog);
 	} else {
 		bpf_release_orig_filter(prog);
@@ -1539,10 +1540,16 @@ int sk_reuseport_attach_filter(struct sock_fprog *fprog, struct sock *sk)
 
 static struct bpf_prog *__get_bpf(u32 ufd, struct sock *sk)
 {
+	struct bpf_prog *prog;
+
 	if (sock_flag(sk, SOCK_FILTER_LOCKED))
 		return ERR_PTR(-EPERM);
 
-	return bpf_prog_get_type(ufd, BPF_PROG_TYPE_SOCKET_FILTER);
+	prog = bpf_prog_get_type(ufd, BPF_PROG_TYPE_SOCKET_FILTER);
+	if (IS_ERR(prog))
+		prog = bpf_prog_get_type(ufd, BPF_PROG_TYPE_SOCKET_SG_FILTER);
+
+	return prog;
 }
 
 int sk_attach_bpf(u32 ufd, struct sock *sk)
@@ -4935,6 +4942,17 @@ bool bpf_helper_changes_pkt_data(void *func)
 }
 
 static const struct bpf_func_proto *
+socksg_filter_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
+{
+	switch (func_id) {
+	case BPF_FUNC_msg_pull_data:
+		return &bpf_msg_pull_data_proto;
+	default:
+		return bpf_base_func_proto(func_id);
+	}
+}
+
+static const struct bpf_func_proto *
 tc_cls_act_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 {
 	switch (func_id) {
@@ -6753,6 +6771,30 @@ static u32 sk_skb_convert_ctx_access(enum bpf_access_type type,
 	return insn - insn_buf;
 }
 
+static u32 socksg_filter_convert_ctx_access(enum bpf_access_type type,
+					    const struct bpf_insn *si,
+					    struct bpf_insn *insn_buf,
+					    struct bpf_prog *prog,
+					    u32 *target_size)
+{
+	struct bpf_insn *insn = insn_buf;
+
+	switch (si->off) {
+	case offsetof(struct sk_msg_md, data):
+		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct sk_msg_buff, data),
+				      si->dst_reg, si->src_reg,
+				      offsetof(struct sk_msg_buff, data));
+		break;
+	case offsetof(struct sk_msg_md, data_end):
+		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct sk_msg_buff, data_end),
+				      si->dst_reg, si->src_reg,
+				      offsetof(struct sk_msg_buff, data_end));
+		break;
+	}
+
+	return insn - insn_buf;
+}
+
 static u32 sk_msg_convert_ctx_access(enum bpf_access_type type,
 				     const struct bpf_insn *si,
 				     struct bpf_insn *insn_buf,
@@ -6891,6 +6933,15 @@ static u32 sk_msg_convert_ctx_access(enum bpf_access_type type,
 	.test_run		= bpf_prog_test_run_skb,
 };
 
+const struct bpf_verifier_ops socksg_filter_verifier_ops = {
+	.get_func_proto         = socksg_filter_func_proto,
+	.is_valid_access        = sk_msg_is_valid_access,
+	.convert_ctx_access     = socksg_filter_convert_ctx_access,
+};
+
+const struct bpf_prog_ops socksg_filter_prog_ops = {
+};
+
 const struct bpf_verifier_ops tc_cls_act_verifier_ops = {
 	.get_func_proto		= tc_cls_act_func_proto,
 	.is_valid_access	= tc_cls_act_is_valid_access,
diff --git a/samples/bpf/bpf_load.c b/samples/bpf/bpf_load.c
index 904e775..3b1697d 100644
--- a/samples/bpf/bpf_load.c
+++ b/samples/bpf/bpf_load.c
@@ -69,6 +69,8 @@ static int load_and_attach(const char *event, struct bpf_insn *prog, int size)
 	bool is_sockops = strncmp(event, "sockops", 7) == 0;
 	bool is_sk_skb = strncmp(event, "sk_skb", 6) == 0;
 	bool is_sk_msg = strncmp(event, "sk_msg", 6) == 0;
+	bool is_socksg = strncmp(event, "socksg", 6) == 0;
+
 	size_t insns_cnt = size / sizeof(struct bpf_insn);
 	enum bpf_prog_type prog_type;
 	char buf[256];
@@ -102,6 +104,8 @@ static int load_and_attach(const char *event, struct bpf_insn *prog, int size)
 		prog_type = BPF_PROG_TYPE_SK_SKB;
 	} else if (is_sk_msg) {
 		prog_type = BPF_PROG_TYPE_SK_MSG;
+	} else if (is_socksg) {
+		prog_type = BPF_PROG_TYPE_SOCKET_SG_FILTER;
 	} else {
 		printf("Unknown event '%s'\n", event);
 		return -1;
@@ -122,8 +126,8 @@ static int load_and_attach(const char *event, struct bpf_insn *prog, int size)
 	if (is_xdp || is_perf_event || is_cgroup_skb || is_cgroup_sk)
 		return 0;
 
-	if (is_socket || is_sockops || is_sk_skb || is_sk_msg) {
-		if (is_socket)
+	if (is_socket || is_sockops || is_sk_skb || is_sk_msg || is_socksg) {
+		if (is_socket || is_socksg)
 			event += 6;
 		else
 			event += 7;
@@ -627,7 +631,8 @@ static int do_load_bpf_file(const char *path, fixup_map_cb fixup_map)
 		    memcmp(shname, "cgroup/", 7) == 0 ||
 		    memcmp(shname, "sockops", 7) == 0 ||
 		    memcmp(shname, "sk_skb", 6) == 0 ||
-		    memcmp(shname, "sk_msg", 6) == 0) {
+		    memcmp(shname, "sk_msg", 6) == 0 ||
+		    memcmp(shname, "socksg", 6) == 0) {
 			ret = load_and_attach(shname, data->d_buf,
 					      data->d_size);
 			if (ret != 0)
diff --git a/tools/bpf/bpftool/prog.c b/tools/bpf/bpftool/prog.c
index dce960d..9c57c4e 100644
--- a/tools/bpf/bpftool/prog.c
+++ b/tools/bpf/bpftool/prog.c
@@ -74,6 +74,7 @@
 	[BPF_PROG_TYPE_RAW_TRACEPOINT]	= "raw_tracepoint",
 	[BPF_PROG_TYPE_CGROUP_SOCK_ADDR] = "cgroup_sock_addr",
 	[BPF_PROG_TYPE_LIRC_MODE2]	= "lirc_mode2",
+	[BPF_PROG_TYPE_SOCKET_SG_FILTER] = "socket_sg_filter",
 };
 
 static void print_boot_time(__u64 nsecs, char *buf, unsigned int size)
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 66917a4..6ec1e32 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -152,6 +152,7 @@ enum bpf_prog_type {
 	BPF_PROG_TYPE_LWT_SEG6LOCAL,
 	BPF_PROG_TYPE_LIRC_MODE2,
 	BPF_PROG_TYPE_SK_REUSEPORT,
+	BPF_PROG_TYPE_SOCKET_SG_FILTER,
 };
 
 enum bpf_attach_type {
diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 2abd0f1..a7ac51c 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -1502,6 +1502,7 @@ static bool bpf_prog_type__needs_kver(enum bpf_prog_type type)
 	case BPF_PROG_TYPE_CGROUP_SOCK_ADDR:
 	case BPF_PROG_TYPE_LIRC_MODE2:
 	case BPF_PROG_TYPE_SK_REUSEPORT:
+	case BPF_PROG_TYPE_SOCKET_SG_FILTER:
 		return false;
 	case BPF_PROG_TYPE_UNSPEC:
 	case BPF_PROG_TYPE_KPROBE:
@@ -2077,6 +2078,7 @@ static bool bpf_program__is_type(struct bpf_program *prog,
 BPF_PROG_TYPE_FNS(raw_tracepoint, BPF_PROG_TYPE_RAW_TRACEPOINT);
 BPF_PROG_TYPE_FNS(xdp, BPF_PROG_TYPE_XDP);
 BPF_PROG_TYPE_FNS(perf_event, BPF_PROG_TYPE_PERF_EVENT);
+BPF_PROG_TYPE_FNS(socket_sg_filter, BPF_PROG_TYPE_SOCKET_SG_FILTER);
 
 void bpf_program__set_expected_attach_type(struct bpf_program *prog,
 					   enum bpf_attach_type type)
@@ -2129,6 +2131,7 @@ void bpf_program__set_expected_attach_type(struct bpf_program *prog,
 	BPF_SA_PROG_SEC("cgroup/sendmsg6", BPF_CGROUP_UDP6_SENDMSG),
 	BPF_S_PROG_SEC("cgroup/post_bind4", BPF_CGROUP_INET4_POST_BIND),
 	BPF_S_PROG_SEC("cgroup/post_bind6", BPF_CGROUP_INET6_POST_BIND),
+	BPF_PROG_SEC("socksg",          BPF_PROG_TYPE_SOCKET_SG_FILTER),
 };
 
 #undef BPF_PROG_SEC
diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
index 96c55fa..7527ea4 100644
--- a/tools/lib/bpf/libbpf.h
+++ b/tools/lib/bpf/libbpf.h
@@ -208,6 +208,7 @@ int bpf_program__set_prep(struct bpf_program *prog, int nr_instance,
 void bpf_program__set_type(struct bpf_program *prog, enum bpf_prog_type type);
 void bpf_program__set_expected_attach_type(struct bpf_program *prog,
 					   enum bpf_attach_type type);
+int bpf_program__set_socket_sg_filter(struct bpf_program *prog);
 
 bool bpf_program__is_socket_filter(struct bpf_program *prog);
 bool bpf_program__is_tracepoint(struct bpf_program *prog);
@@ -217,6 +218,7 @@ void bpf_program__set_expected_attach_type(struct bpf_program *prog,
 bool bpf_program__is_sched_act(struct bpf_program *prog);
 bool bpf_program__is_xdp(struct bpf_program *prog);
 bool bpf_program__is_perf_event(struct bpf_program *prog);
+bool bpf_program__is_socket_sg_filter(struct bpf_program *prog);
 
 /*
  * No need for __attribute__((packed)), all members of 'bpf_map_def'
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH net-next 3/5] ebpf: Add sg_filter_run()
From: Tushar Dave @ 2018-09-11 19:38 UTC (permalink / raw)
  To: ast, daniel, davem, santosh.shilimkar, jakub.kicinski,
	quentin.monnet, jiong.wang, sandipan, john.fastabend, kafai, rdna,
	yhs, netdev, rds-devel, sowmini.varadhan
In-Reply-To: <1536694684-3200-1-git-send-email-tushar.n.dave@oracle.com>

When sg_filter_run() is invoked it runs the attached eBPF
prog of type BPF_PROG_TYPE_SOCKET_SG_FILTER which deals with
struct scatterlist.

Signed-off-by: Tushar Dave <tushar.n.dave@oracle.com>
Acked-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
---
 include/linux/filter.h         |  8 ++++++++
 include/uapi/linux/bpf.h       |  6 ++++++
 net/core/filter.c              | 35 +++++++++++++++++++++++++++++++++++
 tools/include/uapi/linux/bpf.h |  6 ++++++
 4 files changed, 55 insertions(+)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index 6791a0a..ae664a9 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -1113,4 +1113,12 @@ struct bpf_sock_ops_kern {
 					 */
 };
 
+enum __socksg_action {
+	__SOCKSG_PASS = 0,
+	__SOCKSG_DROP,
+	__SOCKSG_REDIRECT,
+};
+
+int sg_filter_run(struct sock *sk, struct scatterlist *sg);
+
 #endif /* __LINUX_FILTER_H__ */
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 6ec1e32..1e11789 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -2428,6 +2428,12 @@ enum sk_action {
 	SK_PASS,
 };
 
+enum socksg_action {
+	SOCKSG_PASS = 0,
+	SOCKSG_DROP,
+	SOCKSG_REDIRECT,
+};
+
 /* user accessible metadata for SK_MSG packet hook, new fields must
  * be added to the end of this structure
  */
diff --git a/net/core/filter.c b/net/core/filter.c
index 469c488..a3afc61 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -121,6 +121,41 @@ int sk_filter_trim_cap(struct sock *sk, struct sk_buff *skb, unsigned int cap)
 }
 EXPORT_SYMBOL(sk_filter_trim_cap);
 
+int sg_filter_run(struct sock *sk, struct scatterlist *sg)
+{
+	struct sk_filter *filter;
+	int result = 0;
+
+	if (!sg)
+		return result;
+
+	rcu_read_lock();
+	filter = rcu_dereference(sk->sk_filter);
+	if (filter) {
+		struct sk_msg_buff mb = {0};
+
+		memcpy(mb.sg_data, sg, sizeof(*sg) * MAX_SKB_FRAGS);
+		mb.sg_start = 0;
+		mb.sg_end = sg_nents(sg);
+		mb.data = sg_virt(sg);
+		mb.data_end = mb.data + sg->length;
+		mb.sg_copy[mb.sg_end - 1] = true;
+
+		result = BPF_PROG_RUN(filter->prog, &mb);
+
+		/* BPF prog may have changed mb.sg_data e.g. may linearize
+		 * multiple scatterlist entries into one. Therefore, make sure
+		 * to update original sg and mark the sg end.
+		 */
+		memcpy(sg, mb.sg_data, sizeof(*sg) * MAX_SKB_FRAGS);
+		sg_mark_end(&sg[mb.sg_end - 1]);
+	}
+	rcu_read_unlock();
+
+	return result;
+}
+EXPORT_SYMBOL(sg_filter_run);
+
 BPF_CALL_1(bpf_skb_get_pay_offset, struct sk_buff *, skb)
 {
 	return skb_get_poff(skb);
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 6ec1e32..1e11789 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -2428,6 +2428,12 @@ enum sk_action {
 	SK_PASS,
 };
 
+enum socksg_action {
+	SOCKSG_PASS = 0,
+	SOCKSG_DROP,
+	SOCKSG_REDIRECT,
+};
+
 /* user accessible metadata for SK_MSG packet hook, new fields must
  * be added to the end of this structure
  */
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH net-next 5/5] ebpf: Add sample ebpf program for SOCKET_SG_FILTER
From: Tushar Dave @ 2018-09-11 19:38 UTC (permalink / raw)
  To: ast, daniel, davem, santosh.shilimkar, jakub.kicinski,
	quentin.monnet, jiong.wang, sandipan, john.fastabend, kafai, rdna,
	yhs, netdev, rds-devel, sowmini.varadhan
In-Reply-To: <1536694684-3200-1-git-send-email-tushar.n.dave@oracle.com>

Add a sample program that shows how socksg program is used and attached
to socket filter. The kernel sample program deals with struct
scatterlist that is passed as bpf context.

When run in server mode, the sample RDS program opens PF_RDS socket,
attaches eBPF program to RDS socket which then uses bpf_msg_pull_data
helper to inspect packet data contained in struct scatterlist and
returns appropriate action code back to kernel.

To ease testing, RDS client functionality is also added so that users
can generate RDS packet.

Server:
[root@lab71 bpf]# ./rds_filter -s 192.168.3.71 -t tcp
running server in a loop
transport tcp
server bound to address: 192.168.3.71 port 4000
server listening on 192.168.3.71

Client:
[root@lab70 bpf]# ./rds_filter -s 192.168.3.71 -c 192.168.3.70 -t tcp
transport tcp
client bound to address: 192.168.3.70 port 25278
client sending 8192 byte message  from 192.168.3.70 to 192.168.3.71 on
port 25278
payload contains:30 31 32 33 34 35 36 37 38 39 ...

Server output:
192.168.3.71 received a packet from 192.168.3.71 of len 8192 cmsg len 0,
on port 25278
payload contains:30 31 32 33 34 35 36 37 38 39 ...
server listening on 192.168.3.71

[root@lab71 tushar]# cat /sys/kernel/debug/tracing/trace_pipe
          <idle>-0     [038] ..s.   146.947362: 0: 30 31 32
          <idle>-0     [038] ..s.   146.947364: 0: 33 34 35

Similarly specifying '-t ib' will run this on IB link.

Signed-off-by: Tushar Dave <tushar.n.dave@oracle.com>
Acked-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
---
 samples/bpf/Makefile          |   3 +
 samples/bpf/rds_filter_kern.c |  42 ++++++
 samples/bpf/rds_filter_user.c | 339 ++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 384 insertions(+)
 create mode 100644 samples/bpf/rds_filter_kern.c
 create mode 100644 samples/bpf/rds_filter_user.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index be0a961..bbac5ef 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -53,6 +53,7 @@ hostprogs-y += xdpsock
 hostprogs-y += xdp_fwd
 hostprogs-y += task_fd_query
 hostprogs-y += xdp_sample_pkts
+hostprogs-y += rds_filter
 
 # Libbpf dependencies
 LIBBPF = $(TOOLS_PATH)/lib/bpf/libbpf.a
@@ -109,6 +110,7 @@ xdpsock-objs := xdpsock_user.o
 xdp_fwd-objs := xdp_fwd_user.o
 task_fd_query-objs := bpf_load.o task_fd_query_user.o $(TRACE_HELPERS)
 xdp_sample_pkts-objs := xdp_sample_pkts_user.o $(TRACE_HELPERS)
+rds_filter-objs := bpf_load.o rds_filter_user.o
 
 # Tell kbuild to always build the programs
 always := $(hostprogs-y)
@@ -167,6 +169,7 @@ always += xdpsock_kern.o
 always += xdp_fwd_kern.o
 always += task_fd_query_kern.o
 always += xdp_sample_pkts_kern.o
+always += rds_filter_kern.o
 
 KBUILD_HOSTCFLAGS += -I$(objtree)/usr/include
 KBUILD_HOSTCFLAGS += -I$(srctree)/tools/lib/
diff --git a/samples/bpf/rds_filter_kern.c b/samples/bpf/rds_filter_kern.c
new file mode 100644
index 0000000..633e687
--- /dev/null
+++ b/samples/bpf/rds_filter_kern.c
@@ -0,0 +1,42 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/filter.h>
+#include <linux/ptrace.h>
+#include <linux/version.h>
+#include <uapi/linux/bpf.h>
+#include <linux/rds.h>
+#include "bpf_helpers.h"
+
+#define bpf_printk(fmt, ...)				\
+({							\
+	char ____fmt[] = fmt;				\
+	bpf_trace_printk(____fmt, sizeof(____fmt),	\
+			##__VA_ARGS__);			\
+})
+
+SEC("socksg")
+int main_prog(struct sk_msg_md *msg)
+{
+	int start, end, err;
+	unsigned char *d;
+
+	start = 0;
+	end = 6;
+
+	err = bpf_msg_pull_data(msg, start, end, 0);
+	if (err) {
+		bpf_printk("socksg: pull_data err %i\n", err);
+		return SOCKSG_PASS;
+	}
+
+	if (msg->data + 6 > msg->data_end)
+		return SOCKSG_PASS;
+
+	d = (unsigned char *)msg->data;
+	bpf_printk("%x %x %x\n", d[0], d[1], d[2]);
+	bpf_printk("%x %x %x\n", d[3], d[4], d[5]);
+
+	return SOCKSG_PASS;
+}
+
+char _license[] SEC("license") = "GPL";
+u32 _version SEC("version") = LINUX_VERSION_CODE;
diff --git a/samples/bpf/rds_filter_user.c b/samples/bpf/rds_filter_user.c
new file mode 100644
index 0000000..1186345
--- /dev/null
+++ b/samples/bpf/rds_filter_user.c
@@ -0,0 +1,339 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <arpa/inet.h>
+#include <assert.h>
+#include "bpf_load.h"
+#include <getopt.h>
+#include <errno.h>
+#include <netinet/in.h>
+#include <limits.h>
+#include <linux/sockios.h>
+#include <linux/rds.h>
+#include <linux/errqueue.h>
+#include <linux/bpf.h>
+#include <strings.h>
+#include <sys/types.h>
+#include <sys/socket.h>
+#include <string.h>
+#include <stdlib.h>
+#include <stdio.h>
+#include <unistd.h>
+
+#define TESTPORT	4000
+#define BUFSIZE		8192
+
+int transport = -1;
+
+static int str2trans(const char *trans)
+{
+	if (strcmp(trans, "tcp") == 0)
+		return RDS_TRANS_TCP;
+	if (strcmp(trans, "ib") == 0)
+		return RDS_TRANS_IB;
+	return (RDS_TRANS_NONE);
+}
+
+static const char *trans2str(int trans)
+{
+	switch (trans) {
+	case RDS_TRANS_TCP:
+		return ("tcp");
+	case RDS_TRANS_IB:
+		return ("ib");
+	case RDS_TRANS_NONE:
+		return ("none");
+	default:
+		return ("unknown");
+	}
+}
+
+static int gettransport(int sock)
+{
+	int err;
+	char val;
+	socklen_t len = sizeof(int);
+
+	err = getsockopt(sock, SOL_RDS, SO_RDS_TRANSPORT,
+			 (char *)&val, &len);
+	if (err < 0) {
+		fprintf(stderr, "%s: getsockopt %s\n",
+			__func__, strerror(errno));
+		return err;
+	}
+	return (int)val;
+}
+
+static int settransport(int sock, int transport)
+{
+	int err;
+
+	err = setsockopt(sock, SOL_RDS, SO_RDS_TRANSPORT,
+			 (char *)&transport, sizeof(transport));
+	if (err < 0) {
+		fprintf(stderr, "could not set transport %s, %s\n",
+			trans2str(transport), strerror(errno));
+	}
+	return err;
+}
+
+static void print_sock_local_info(int fd, char *str, struct sockaddr_in *ret)
+{
+	socklen_t sin_size = sizeof(struct sockaddr_in);
+	struct sockaddr_in sin;
+	int err;
+
+	err = getsockname(fd, (struct sockaddr *)&sin, &sin_size);
+	if (err < 0) {
+		fprintf(stderr, "%s getsockname %s\n",
+			__func__, strerror(errno));
+		return;
+	}
+	printf("%s address: %s port %d\n",
+		(str ? str : ""), inet_ntoa(sin.sin_addr), ntohs(sin.sin_port));
+
+	if (ret != NULL)
+		*ret = sin;
+}
+
+static void print_payload(char *buf)
+{
+	int i;
+
+	printf("payload contains:");
+	for (i = 0; i < 10; i++)
+		printf("%x ", buf[i]);
+	printf("...\n");
+}
+
+static void server(char *address, in_port_t port)
+{
+	struct sockaddr_in sin, din;
+	struct msghdr msg;
+	struct iovec *iov;
+	int rc, sock;
+	char *buf;
+
+	buf = calloc(BUFSIZE, sizeof(char));
+	if (!buf) {
+		fprintf(stderr, "%s: calloc %s\n", __func__, strerror(errno));
+		return;
+	}
+
+	sock = socket(PF_RDS, SOCK_SEQPACKET, 0);
+	if (sock < 0) {
+		fprintf(stderr, "%s: socket %s\n", __func__, strerror(errno));
+		goto out;
+	}
+	if (settransport(sock, transport) < 0)
+		goto out;
+
+	printf("transport %s\n", trans2str(gettransport(sock)));
+
+	memset(&sin, 0, sizeof(sin));
+	sin.sin_family = AF_INET;
+	sin.sin_addr.s_addr = inet_addr(address);
+	sin.sin_port = htons(port);
+
+	rc = bind(sock, (struct sockaddr *)&sin, sizeof(sin));
+	if (rc < 0) {
+		fprintf(stderr, "%s: bind %s\n", __func__, strerror(errno));
+		goto out;
+	}
+
+	/* attach bpf prog */
+	assert(setsockopt(sock, SOL_SOCKET, SO_ATTACH_BPF, prog_fd,
+			  sizeof(prog_fd[0])) == 0);
+
+	print_sock_local_info(sock, "server bound to", NULL);
+
+	iov = calloc(1, sizeof(struct iovec));
+	if (!iov) {
+		fprintf(stderr, "%s: calloc %s\n", __func__, strerror(errno));
+		goto out;
+	}
+
+	while (1) {
+		memset(buf, 0, BUFSIZE);
+		iov[0].iov_base = buf;
+		iov[0].iov_len = BUFSIZE;
+
+		memset(&msg, 0, sizeof(msg));
+		msg.msg_name = &din;
+		msg.msg_namelen = sizeof(din);
+		msg.msg_iov = iov;
+		msg.msg_iovlen = 1;
+
+		printf("server listening on %s\n", inet_ntoa(sin.sin_addr));
+
+		rc = recvmsg(sock, &msg, 0);
+		if (rc < 0) {
+			fprintf(stderr, "%s: recvmsg %s\n",
+				__func__, strerror(errno));
+			break;
+		}
+
+		printf("%s received a packet from %s of len %d cmsg len %d, on port %d\n",
+			inet_ntoa(sin.sin_addr),
+			inet_ntoa(din.sin_addr),
+			(uint32_t) iov[0].iov_len,
+			(uint32_t) msg.msg_controllen,
+			ntohs(din.sin_port));
+
+		print_payload(buf);
+	}
+	free(iov);
+out:
+	free(buf);
+}
+
+static void create_message(char *buf)
+{
+	unsigned int i;
+
+	for (i = 0; i < BUFSIZE; i++) {
+		buf[i] = i + 0x30;
+	}
+}
+
+static int build_rds_packet(struct msghdr *msg, char *buf)
+{
+	struct iovec *iov;
+
+	iov = calloc(1, sizeof(struct iovec));
+	if (!iov) {
+		fprintf(stderr, "%s: calloc %s\n", __func__, strerror(errno));
+		return -1;
+	}
+
+	msg->msg_iov = iov;
+	msg->msg_iovlen = 1;
+
+	iov[0].iov_base = buf;
+	iov[0].iov_len = BUFSIZE * sizeof(char);
+
+	return 0;
+}
+
+static void client(char *localaddr, char *remoteaddr, in_port_t server_port)
+{
+	struct sockaddr_in sin, din;
+	struct msghdr msg;
+	int rc, sock;
+	char *buf;
+
+	buf = calloc(BUFSIZE, sizeof(char));
+	if (!buf) {
+		fprintf(stderr, "%s: calloc %s\n", __func__, strerror(errno));
+		return;
+	}
+
+	create_message(buf);
+
+	sock = socket(PF_RDS, SOCK_SEQPACKET, 0);
+	if (sock < 0) {
+		fprintf(stderr, "%s: socket %s\n", __func__, strerror(errno));
+		goto out;
+	}
+
+	if (settransport(sock, transport) < 0)
+		goto out;
+
+	printf("transport %s\n", trans2str(gettransport(sock)));
+
+	memset(&sin, 0, sizeof(sin));
+	sin.sin_family = AF_INET;
+	sin.sin_addr.s_addr = inet_addr(localaddr);
+	sin.sin_port = 0;
+
+	rc = bind(sock, (struct sockaddr *)&sin, sizeof(sin));
+	if (rc < 0) {
+		fprintf(stderr, "%s: bind %s\n", __func__, strerror(errno));
+		goto out;
+	}
+	print_sock_local_info(sock, "client bound to",  &sin);
+
+	memset(&msg, 0, sizeof(msg));
+	msg.msg_name = &din;
+	msg.msg_namelen = sizeof(din);
+
+	memset(&din, 0, sizeof(din));
+	din.sin_family = AF_INET;
+	din.sin_addr.s_addr = inet_addr(remoteaddr);
+	din.sin_port = htons(server_port);
+
+	rc = build_rds_packet(&msg, buf);
+	if (rc < 0)
+		goto out;
+
+	printf("client sending %d byte message from %s to %s on port %d\n",
+		(uint32_t) msg.msg_iov->iov_len, localaddr,
+		remoteaddr, ntohs(sin.sin_port));
+
+	rc = sendmsg(sock, &msg, 0);
+	if (rc < 0)
+		fprintf(stderr, "%s: sendmsg %s\n", __func__, strerror(errno));
+
+	print_payload(buf);
+
+	if (msg.msg_control)
+		free(msg.msg_control);
+	if (msg.msg_iov)
+		free(msg.msg_iov);
+out:
+	free(buf);
+
+	return;
+}
+
+static void usage(char *progname)
+{
+	fprintf(stderr, "Usage %s [-s srvaddr] [-c clientaddr] [-t transport]"
+		"\n", progname);
+}
+
+int main(int argc, char **argv)
+{
+	in_port_t server_port = TESTPORT;
+	char *serveraddr = NULL;
+	char *clientaddr = NULL;
+	char filename[256];
+	int opt;
+
+	while ((opt = getopt(argc, argv, "s:c:t:")) != -1) {
+		switch (opt) {
+		case 's':
+			serveraddr = optarg;
+			break;
+		case 'c':
+			clientaddr = optarg;
+			break;
+		case 't':
+			transport = str2trans(optarg);
+			if (transport == RDS_TRANS_NONE) {
+				fprintf(stderr,
+					"unknown transport %s\n", optarg);
+					usage(argv[0]);
+					return (-1);
+			}
+			break;
+		default:
+			usage(argv[0]);
+			return 1;
+		}
+	}
+
+	snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]);
+
+	if (load_bpf_file(filename)) {
+		fprintf(stderr, "Error: load_bpf_file %s", bpf_log_buf);
+		return 1;
+	}
+
+	if (serveraddr && !clientaddr) {
+		printf("running server in a loop\n");
+		server(serveraddr, server_port);
+	} else if (serveraddr && clientaddr) {
+		client(clientaddr, serveraddr, server_port);
+	}
+
+	return 0;
+}
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH net-next 4/5] rds: invoke socket sg filter attached to rds socket
From: Tushar Dave @ 2018-09-11 19:38 UTC (permalink / raw)
  To: ast, daniel, davem, santosh.shilimkar, jakub.kicinski,
	quentin.monnet, jiong.wang, sandipan, john.fastabend, kafai, rdna,
	yhs, netdev, rds-devel, sowmini.varadhan
In-Reply-To: <1536694684-3200-1-git-send-email-tushar.n.dave@oracle.com>

RDS module sits on top of TCP (rds_tcp) and IB (rds_rdma), so messages
arrive in form of skb (over TCP) and scatterlist (over IB/RDMA).
However, because socket filter only deal with skb (e.g. struct skb as
bpf context) we can only use socket filter for rds_tcp and not for
rds_rdma.

Considering one filtering solution for RDS, it seems that the common
denominator between sk_buff and scatterlist is scatterlist. Therefore,
this patch converts skb to sgvec and invoke sg_filter_run for
rds_tcp and simply invoke sg_filter_run for IB/rds_rdma.

Signed-off-by: Tushar Dave <tushar.n.dave@oracle.com>
Reviewed-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
---
 net/rds/ib.c       |   1 +
 net/rds/ib.h       |   1 +
 net/rds/ib_recv.c  |  12 ++++++
 net/rds/rds.h      |   1 +
 net/rds/recv.c     |  12 ++++++
 net/rds/tcp.c      |   1 +
 net/rds/tcp.h      |   2 +
 net/rds/tcp_recv.c | 108 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
 8 files changed, 137 insertions(+), 1 deletion(-)

diff --git a/net/rds/ib.c b/net/rds/ib.c
index eba75c1..6c40652 100644
--- a/net/rds/ib.c
+++ b/net/rds/ib.c
@@ -527,6 +527,7 @@ struct rds_transport rds_ib_transport = {
 	.conn_path_shutdown	= rds_ib_conn_path_shutdown,
 	.inc_copy_to_user	= rds_ib_inc_copy_to_user,
 	.inc_free		= rds_ib_inc_free,
+	.inc_to_sg_get		= rds_ib_inc_to_sg_get,
 	.cm_initiate_connect	= rds_ib_cm_initiate_connect,
 	.cm_handle_connect	= rds_ib_cm_handle_connect,
 	.cm_connect_complete	= rds_ib_cm_connect_complete,
diff --git a/net/rds/ib.h b/net/rds/ib.h
index 73427ff..0a12b41 100644
--- a/net/rds/ib.h
+++ b/net/rds/ib.h
@@ -404,6 +404,7 @@ int rds_ib_update_ipaddr(struct rds_ib_device *rds_ibdev,
 void rds_ib_recv_free_caches(struct rds_ib_connection *ic);
 void rds_ib_recv_refill(struct rds_connection *conn, int prefill, gfp_t gfp);
 void rds_ib_inc_free(struct rds_incoming *inc);
+int rds_ib_inc_to_sg_get(struct rds_incoming *inc, struct scatterlist **sg);
 int rds_ib_inc_copy_to_user(struct rds_incoming *inc, struct iov_iter *to);
 void rds_ib_recv_cqe_handler(struct rds_ib_connection *ic, struct ib_wc *wc,
 			     struct rds_ib_ack_state *state);
diff --git a/net/rds/ib_recv.c b/net/rds/ib_recv.c
index 2f16146..0054c7c 100644
--- a/net/rds/ib_recv.c
+++ b/net/rds/ib_recv.c
@@ -219,6 +219,18 @@ void rds_ib_inc_free(struct rds_incoming *inc)
 	rds_ib_recv_cache_put(&ibinc->ii_cache_entry, &ic->i_cache_incs);
 }
 
+int rds_ib_inc_to_sg_get(struct rds_incoming *inc, struct scatterlist **sg)
+{
+	struct rds_ib_incoming *ibinc;
+	struct rds_page_frag *frag;
+
+	ibinc = container_of(inc, struct rds_ib_incoming, ii_inc);
+	frag = list_entry(ibinc->ii_frags.next, struct rds_page_frag, f_item);
+	*sg =  &frag->f_sg;
+
+	return 0;
+}
+
 static void rds_ib_recv_clear_one(struct rds_ib_connection *ic,
 				  struct rds_ib_recv_work *recv)
 {
diff --git a/net/rds/rds.h b/net/rds/rds.h
index 6bfaf05..9f3e4df 100644
--- a/net/rds/rds.h
+++ b/net/rds/rds.h
@@ -542,6 +542,7 @@ struct rds_transport {
 	int (*recv_path)(struct rds_conn_path *cp);
 	int (*inc_copy_to_user)(struct rds_incoming *inc, struct iov_iter *to);
 	void (*inc_free)(struct rds_incoming *inc);
+	int (*inc_to_sg_get)(struct rds_incoming *inc, struct scatterlist **sg);
 
 	int (*cm_handle_connect)(struct rdma_cm_id *cm_id,
 				 struct rdma_cm_event *event, bool isv6);
diff --git a/net/rds/recv.c b/net/rds/recv.c
index 1271965..424042e 100644
--- a/net/rds/recv.c
+++ b/net/rds/recv.c
@@ -290,6 +290,8 @@ void rds_recv_incoming(struct rds_connection *conn, struct in6_addr *saddr,
 	struct sock *sk;
 	unsigned long flags;
 	struct rds_conn_path *cp;
+	struct sk_filter *filter;
+	int result = __SOCKSG_PASS;
 
 	inc->i_conn = conn;
 	inc->i_rx_jiffies = jiffies;
@@ -374,6 +376,16 @@ void rds_recv_incoming(struct rds_connection *conn, struct in6_addr *saddr,
 	/* We can be racing with rds_release() which marks the socket dead. */
 	sk = rds_rs_to_sk(rs);
 
+	rcu_read_lock();
+	filter = rcu_dereference(sk->sk_filter);
+	if (filter && conn->c_trans->inc_to_sg_get) {
+		struct scatterlist *sg = NULL;
+
+		if (conn->c_trans->inc_to_sg_get(inc, &sg) == 0)
+			result = sg_filter_run(sk, sg);
+	}
+	rcu_read_unlock();
+
 	/* serialize with rds_release -> sock_orphan */
 	write_lock_irqsave(&rs->rs_recv_lock, flags);
 	if (!sock_flag(sk, SOCK_DEAD)) {
diff --git a/net/rds/tcp.c b/net/rds/tcp.c
index b9bbcf3..b0683e6 100644
--- a/net/rds/tcp.c
+++ b/net/rds/tcp.c
@@ -464,6 +464,7 @@ struct rds_transport rds_tcp_transport = {
 	.conn_path_shutdown	= rds_tcp_conn_path_shutdown,
 	.inc_copy_to_user	= rds_tcp_inc_copy_to_user,
 	.inc_free		= rds_tcp_inc_free,
+	.inc_to_sg_get		= rds_tcp_inc_to_sg_get,
 	.stats_info_copy	= rds_tcp_stats_info_copy,
 	.exit			= rds_tcp_exit,
 	.t_owner		= THIS_MODULE,
diff --git a/net/rds/tcp.h b/net/rds/tcp.h
index 3c69361..e4ea16e 100644
--- a/net/rds/tcp.h
+++ b/net/rds/tcp.h
@@ -7,6 +7,7 @@
 struct rds_tcp_incoming {
 	struct rds_incoming	ti_inc;
 	struct sk_buff_head	ti_skb_list;
+	struct scatterlist	*sg;
 };
 
 struct rds_tcp_connection {
@@ -82,6 +83,7 @@ void rds_tcp_restore_callbacks(struct socket *sock,
 int rds_tcp_recv_path(struct rds_conn_path *cp);
 void rds_tcp_inc_free(struct rds_incoming *inc);
 int rds_tcp_inc_copy_to_user(struct rds_incoming *inc, struct iov_iter *to);
+int rds_tcp_inc_to_sg_get(struct rds_incoming *inc, struct scatterlist **sg);
 
 /* tcp_send.c */
 void rds_tcp_xmit_path_prepare(struct rds_conn_path *cp);
diff --git a/net/rds/tcp_recv.c b/net/rds/tcp_recv.c
index 42c5ff1..22d84f2 100644
--- a/net/rds/tcp_recv.c
+++ b/net/rds/tcp_recv.c
@@ -50,14 +50,113 @@ static void rds_tcp_inc_purge(struct rds_incoming *inc)
 void rds_tcp_inc_free(struct rds_incoming *inc)
 {
 	struct rds_tcp_incoming *tinc;
+	int i;
+
 	tinc = container_of(inc, struct rds_tcp_incoming, ti_inc);
 	rds_tcp_inc_purge(inc);
+
+	if (tinc->sg) {
+		for (i = 0; i < sg_nents(tinc->sg); i++) {
+			struct page *page;
+
+			page = sg_page(&tinc->sg[i]);
+			put_page(page);
+		}
+		kfree(tinc->sg);
+	}
+
 	rdsdebug("freeing tinc %p inc %p\n", tinc, inc);
 	kmem_cache_free(rds_tcp_incoming_slab, tinc);
 }
 
+#define MAX_SG MAX_SKB_FRAGS
+int rds_tcp_inc_to_sg_get(struct rds_incoming *inc, struct scatterlist **sg)
+{
+	struct rds_tcp_incoming *tinc;
+	struct sk_buff *skb;
+	int num_sg = 0;
+	int i;
+
+	tinc = container_of(inc, struct rds_tcp_incoming, ti_inc);
+
+	/* For now we are assuming that the max sg elements we need is MAX_SG.
+	 * To determine actual number of sg elements we need to traverse the
+	 * skb queue e.g.
+	 *
+	 * skb_queue_walk(&tinc->ti_skb_list, skb) {
+	 *	num_sg += skb_shinfo(skb)->nr_frags + 1;
+	 * }
+	 */
+	tinc->sg = kzalloc(sizeof(*tinc->sg) * MAX_SG, GFP_KERNEL);
+	if (!tinc->sg)
+		return -ENOMEM;
+
+	sg_init_table(tinc->sg, MAX_SG);
+	skb_queue_walk(&tinc->ti_skb_list, skb) {
+		num_sg += skb_to_sgvec_nomark(skb, &tinc->sg[num_sg], 0,
+					      skb->len);
+	}
+
+	/* packet can have zero length */
+	if (num_sg <= 0) {
+		kfree(tinc->sg);
+		tinc->sg = NULL;
+		return -ENODATA;
+	}
+
+	sg_mark_end(&tinc->sg[num_sg - 1]);
+	*sg = tinc->sg;
+
+	for (i = 0; i < num_sg; i++)
+		get_page(sg_page(&tinc->sg[i]));
+
+	return 0;
+}
+
+static int rds_tcp_inc_copy_sg_to_user(struct rds_incoming *inc,
+				       struct iov_iter *to)
+{
+	struct rds_tcp_incoming *tinc;
+	struct scatterlist *sg;
+	unsigned long copied = 0;
+	unsigned long len;
+	u8 i = 0;
+
+	tinc = container_of(inc, struct rds_tcp_incoming, ti_inc);
+	len = be32_to_cpu(inc->i_hdr.h_len);
+	sg = tinc->sg;
+
+	do {
+		struct page *page;
+		unsigned long n, copy, to_copy;
+
+		sg = &tinc->sg[i];
+		copy = sg->length;
+		page = sg_page(sg);
+		to_copy = iov_iter_count(to);
+		to_copy = min_t(unsigned long, to_copy, copy);
+
+		n = copy_page_to_iter(page, sg->offset, to_copy, to);
+		if (n != copy)
+			return -EFAULT;
+
+		rds_stats_add(s_copy_to_user, to_copy);
+		copied += to_copy;
+		sg->offset += to_copy;
+		sg->length -= to_copy;
+
+		if (!sg->length)
+			i++;
+
+		if (copied == len)
+			break;
+	} while (i != sg_nents(tinc->sg));
+	return copied;
+}
 /*
- * this is pretty lame, but, whatever.
+ * This is pretty lame, but, whatever.
+ * Note: bpf filter can change RDS packet and if so then the modified packet is
+ * contained in the form of scatterlist, not skb.
  */
 int rds_tcp_inc_copy_to_user(struct rds_incoming *inc, struct iov_iter *to)
 {
@@ -70,6 +169,12 @@ int rds_tcp_inc_copy_to_user(struct rds_incoming *inc, struct iov_iter *to)
 
 	tinc = container_of(inc, struct rds_tcp_incoming, ti_inc);
 
+	/* if tinc->sg is not NULL means bpf filter ran on packet and so packet
+	 * now is in the form of scatterlist.
+	 */
+	if (tinc->sg)
+		return rds_tcp_inc_copy_sg_to_user(inc, to);
+
 	skb_queue_walk(&tinc->ti_skb_list, skb) {
 		unsigned long to_copy, skb_off;
 		for (skb_off = 0; skb_off < skb->len; skb_off += to_copy) {
@@ -176,6 +281,7 @@ static int rds_tcp_data_recv(read_descriptor_t *desc, struct sk_buff *skb,
 				desc->error = -ENOMEM;
 				goto out;
 			}
+			tinc->sg = NULL;
 			tc->t_tinc = tinc;
 			rdsdebug("alloced tinc %p\n", tinc);
 			rds_inc_path_init(&tinc->ti_inc, cp,
-- 
1.8.3.1

^ permalink raw reply related

* Re: [PATCH v3 net-next 5/6] dt-bindings: net: dsa: Add lantiq,xrx200-gswip DT bindings
From: Hauke Mehrtens @ 2018-09-11 21:01 UTC (permalink / raw)
  To: Rob Herring
  Cc: davem, netdev, andrew, vivien.didelot, f.fainelli, john,
	linux-mips, dev, hauke.mehrtens, devicetree
In-Reply-To: <20180910220119.GA32582@bogus>


[-- Attachment #1.1: Type: text/plain, Size: 5896 bytes --]

On 09/11/2018 12:01 AM, Rob Herring wrote:
> On Sun, Sep 09, 2018 at 10:20:27PM +0200, Hauke Mehrtens wrote:
>> This adds the binding for the GSWIP (Gigabit switch) core found in the
>> xrx200 / VR9 Lantiq / Intel SoC.
>>
>> This part takes care of the switch, MDIO bus, and loading the FW into
>> the embedded GPHYs.
>>
>> Signed-off-by: Hauke Mehrtens <hauke@hauke-m.de>
>> Cc: devicetree@vger.kernel.org
>> ---
>>  .../devicetree/bindings/net/dsa/lantiq-gswip.txt   | 141 +++++++++++++++++++++
>>  1 file changed, 141 insertions(+)
>>  create mode 100644 Documentation/devicetree/bindings/net/dsa/lantiq-gswip.txt
>>
>> diff --git a/Documentation/devicetree/bindings/net/dsa/lantiq-gswip.txt b/Documentation/devicetree/bindings/net/dsa/lantiq-gswip.txt
>> new file mode 100644
>> index 000000000000..a089f5856778
>> --- /dev/null
>> +++ b/Documentation/devicetree/bindings/net/dsa/lantiq-gswip.txt
>> @@ -0,0 +1,141 @@
>> +Lantiq GSWIP Ethernet switches
>> +==================================
>> +
>> +Required properties for GSWIP core:
>> +
>> +- compatible	: "lantiq,xrx200-gswip" for the embedded GSWIP in the
>> +		  xRX200 SoC
>> +- reg		: memory range of the GSWIP core registers
>> +		: memory range of the GSWIP MDIO registers
>> +		: memory range of the GSWIP MII registers
>> +
>> +See Documentation/devicetree/bindings/net/dsa/dsa.txt for a list of
>> +additional required and optional properties.
>> +
>> +
>> +Required properties for MDIO bus:
>> +- compatible	: "lantiq,xrx200-mdio" for the MDIO bus inside the GSWIP
>> +		  core of the xRX200 SoC and the PHYs connected to it.
>> +
>> +See Documentation/devicetree/bindings/net/mdio.txt for a list of additional
>> +required and optional properties.
>> +
>> +
>> +Required properties for GPHY firmware loading:
>> +- compatible	: "lantiq,gphy-fw" and "lantiq,xrx200-gphy-fw",
>> +		  "lantiq,xrx200a1x-gphy-fw", "lantiq,xrx200a2x-gphy-fw",
>> +		  "lantiq,xrx300-gphy-fw", or "lantiq,xrx330-gphy-fw"
>> +		  for the loading of the firmware into the embedded
>> +		  GPHY core of the SoC.
> 
> One valid combination of compatibles per line please.

Ok, I will update this.

> 
>> +- lantiq,rcu	: reference to the rcu syscon
>> +
>> +The GPHY firmware loader has a list of GPHY entries, one for each
>> +embedded GPHY
>> +
>> +- reg		: Offset of the GPHY firmware register in the RCU
>> +		  register range
> 
> This use of reg is strange. This node should probably be a child of 
> the RCU.

The SoC Designers put all registers for which they didn't want to create
a new register block into the RCU (Reset controller unit) range. The
switch itself is on the main crossbar, and has his own memory range, but
the registers to load the GPHY FW are in the RCU register. We have to
load the GPHY firmware before we can assess the GPHY, after the FW is
loaded we control the GPHY through the MDIO bus of the switch.

The GPHY is now part of the switch driver, so we moved the GPHY node
also as a sub node to the switch, when it would be under the RCU we
somehow have to make sure it gets loaded before the switch gets loaded,
which is more complicated. The GPHY itself is also part of the switch IP
block and not the reset controller unit.

>> +- resets	: list of resets of the embedded GPHY
>> +- reset-names	: list of names of the resets
>> +
>> +Example:
>> +
>> +Ethernet switch on the VRX200 SoC:
>> +
>> +gswip: gswip@E108000 {
> 
> switch@... or ethernet-switch@...
> 
> We need a standard name here and add it to the DT spec.

Ok, I will change this.

> 
>> +	#address-cells = <1>;
>> +	#size-cells = <0>;
>> +	compatible = "lantiq,xrx200-gswip";
>> +	reg = <	0xE108000 0x3000 /* switch */
>> +		0xE10B100 0x70 /* mdio */
>> +		0xE10B1D8 0x30 /* mii */
>> +		>;
>> +	dsa,member = <0 0>;
> 
> Not documented.

This is part of the general dsa binding.

> 
>> +
>> +	ports {
>> +		#address-cells = <1>;
>> +		#size-cells = <0>;
>> +
>> +		port@0 {
>> +			reg = <0>;
>> +			label = "lan3";
>> +			phy-mode = "rgmii";
>> +			phy-handle = <&phy0>;
>> +		};
>> +
>> +		port@1 {
>> +			reg = <1>;
>> +			label = "lan4";
>> +			phy-mode = "rgmii";
>> +			phy-handle = <&phy1>;
>> +		};
>> +
>> +		port@2 {
>> +			reg = <2>;
>> +			label = "lan2";
>> +			phy-mode = "internal";
>> +			phy-handle = <&phy11>;
>> +		};
>> +
>> +		port@4 {
>> +			reg = <4>;
>> +			label = "lan1";
>> +			phy-mode = "internal";
>> +			phy-handle = <&phy13>;
>> +		};
>> +
>> +		port@5 {
>> +			reg = <5>;
>> +			label = "wan";
>> +			phy-mode = "rgmii";
>> +			phy-handle = <&phy5>;
>> +		};
>> +
>> +		port@6 {
>> +			reg = <0x6>;
>> +			label = "cpu";
>> +			ethernet = <&eth0>;
>> +		};
>> +	};
>> +
>> +	mdio@0 {
> 
> What's the address 0 here?

I will remove this, there is only one MDIO bus under the switch.

> 
>> +		#address-cells = <1>;
>> +		#size-cells = <0>;
>> +		compatible = "lantiq,xrx200-mdio";
>> +		reg = <0>;
>> +
>> +		phy0: ethernet-phy@0 {
>> +			reg = <0x0>;
>> +		};
>> +		phy1: ethernet-phy@1 {
>> +			reg = <0x1>;
>> +		};
>> +		phy5: ethernet-phy@5 {
>> +			reg = <0x5>;
>> +		};
>> +		phy11: ethernet-phy@11 {
>> +			reg = <0x11>;
>> +		};
>> +		phy13: ethernet-phy@13 {
>> +			reg = <0x13>;
>> +		};
>> +	};
>> +
>> +	gphy-fw {
>> +		compatible = "lantiq,xrx200-gphy-fw", "lantiq,gphy-fw";
>> +		lantiq,rcu = <&rcu0>;
> 
> Missing #size-cells and #address-cells, but this should change as I said 
> above.

Ok, I will change this.

> 
>> +
>> +		gphy@20 {
>> +			reg = <0x20>;
>> +
>> +			resets = <&reset0 31 30>;
>> +			reset-names = "gphy";
>> +		};
>> +
>> +		gphy@68 {
>> +			reg = <0x68>;
>> +
>> +			resets = <&reset0 29 28>;
>> +			reset-names = "gphy";
>> +		};
>> +	};
>> +};
>> -- 
>> 2.11.0
>>



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply

* Re: [PATCH v2 net-next 11/12] net: ethernet: Add helper for set_pauseparam for Pause
From: kbuild test robot @ 2018-09-11 21:01 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: kbuild-all, David Miller, netdev, Florian Fainelli, Andrew Lunn
In-Reply-To: <1536616350-15442-12-git-send-email-andrew@lunn.ch>

[-- Attachment #1: Type: text/plain, Size: 13820 bytes --]

Hi Andrew,

I love your patch! Perhaps something to improve:

[auto build test WARNING on net-next/master]

url:    https://github.com/0day-ci/linux/commits/Andrew-Lunn/Preparing-for-phylib-limkmodes/20180911-204149
reproduce: make htmldocs

All warnings (new ones prefixed by >>):

   drivers/target/target_core_device.c:1: warning: no structured comments found
   drivers/usb/dwc3/gadget.c:510: warning: Excess function parameter 'dwc' description in 'dwc3_gadget_start_config'
   drivers/usb/typec/class.c:1497: warning: Excess function parameter 'drvdata' description in 'typec_port_register_altmode'
   drivers/usb/typec/class.c:1497: warning: Excess function parameter 'drvdata' description in 'typec_port_register_altmode'
   drivers/usb/typec/class.c:1497: warning: Excess function parameter 'drvdata' description in 'typec_port_register_altmode'
   drivers/usb/typec/class.c:1497: warning: Excess function parameter 'drvdata' description in 'typec_port_register_altmode'
   drivers/usb/typec/class.c:1497: warning: Excess function parameter 'drvdata' description in 'typec_port_register_altmode'
   drivers/usb/typec/class.c:1497: warning: Excess function parameter 'drvdata' description in 'typec_port_register_altmode'
   drivers/usb/typec/class.c:1497: warning: Excess function parameter 'drvdata' description in 'typec_port_register_altmode'
   drivers/usb/typec/class.c:1497: warning: Excess function parameter 'drvdata' description in 'typec_port_register_altmode'
   drivers/usb/typec/class.c:1497: warning: Excess function parameter 'drvdata' description in 'typec_port_register_altmode'
   drivers/usb/typec/class.c:1497: warning: Excess function parameter 'drvdata' description in 'typec_port_register_altmode'
   drivers/usb/typec/class.c:1497: warning: Excess function parameter 'drvdata' description in 'typec_port_register_altmode'
   drivers/usb/typec/class.c:1497: warning: Excess function parameter 'drvdata' description in 'typec_port_register_altmode'
   drivers/usb/typec/bus.c:1: warning: no structured comments found
   drivers/usb/typec/bus.c:268: warning: Function parameter or member 'mode' not described in 'typec_match_altmode'
   drivers/usb/typec/class.c:1497: warning: Excess function parameter 'drvdata' description in 'typec_port_register_altmode'
   drivers/usb/typec/class.c:1: warning: no structured comments found
   include/linux/w1.h:281: warning: Function parameter or member 'of_match_table' not described in 'w1_family'
   fs/direct-io.c:257: warning: Excess function parameter 'offset' description in 'dio_complete'
   fs/file_table.c:1: warning: no structured comments found
   fs/libfs.c:477: warning: Excess function parameter 'available' description in 'simple_write_end'
   fs/posix_acl.c:646: warning: Function parameter or member 'inode' not described in 'posix_acl_update_mode'
   fs/posix_acl.c:646: warning: Function parameter or member 'mode_p' not described in 'posix_acl_update_mode'
   fs/posix_acl.c:646: warning: Function parameter or member 'acl' not described in 'posix_acl_update_mode'
   drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c:183: warning: Function parameter or member 'blockable' not described in 'amdgpu_mn_read_lock'
   drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c:254: warning: Function parameter or member 'blockable' not described in 'amdgpu_mn_invalidate_range_start_gfx'
   drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c:302: warning: Function parameter or member 'blockable' not described in 'amdgpu_mn_invalidate_range_start_hsa'
   drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c:3011: warning: Excess function parameter 'dev' description in 'amdgpu_vm_get_task_info'
   drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c:3012: warning: Function parameter or member 'adev' not described in 'amdgpu_vm_get_task_info'
   drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c:3012: warning: Excess function parameter 'dev' description in 'amdgpu_vm_get_task_info'
   include/drm/drm_drv.h:610: warning: Function parameter or member 'gem_prime_pin' not described in 'drm_driver'
   include/drm/drm_drv.h:610: warning: Function parameter or member 'gem_prime_unpin' not described in 'drm_driver'
   include/drm/drm_drv.h:610: warning: Function parameter or member 'gem_prime_res_obj' not described in 'drm_driver'
   include/drm/drm_drv.h:610: warning: Function parameter or member 'gem_prime_get_sg_table' not described in 'drm_driver'
   include/drm/drm_drv.h:610: warning: Function parameter or member 'gem_prime_import_sg_table' not described in 'drm_driver'
   include/drm/drm_drv.h:610: warning: Function parameter or member 'gem_prime_vmap' not described in 'drm_driver'
   include/drm/drm_drv.h:610: warning: Function parameter or member 'gem_prime_vunmap' not described in 'drm_driver'
   include/drm/drm_drv.h:610: warning: Function parameter or member 'gem_prime_mmap' not described in 'drm_driver'
   include/drm/drm_panel.h:98: warning: Function parameter or member 'link' not described in 'drm_panel'
   drivers/gpu/drm/i915/i915_vma.h:49: warning: cannot understand function prototype: 'struct i915_vma '
   drivers/gpu/drm/i915/i915_vma.h:1: warning: no structured comments found
   drivers/gpu/drm/i915/intel_guc_fwif.h:553: warning: cannot understand function prototype: 'struct guc_log_buffer_state '
   drivers/gpu/drm/i915/i915_trace.h:1: warning: no structured comments found
   include/linux/skbuff.h:860: warning: Function parameter or member 'dev_scratch' not described in 'sk_buff'
   include/linux/skbuff.h:860: warning: Function parameter or member 'list' not described in 'sk_buff'
   include/linux/skbuff.h:860: warning: Function parameter or member 'ip_defrag_offset' not described in 'sk_buff'
   include/linux/skbuff.h:860: warning: Function parameter or member 'skb_mstamp' not described in 'sk_buff'
   include/linux/skbuff.h:860: warning: Function parameter or member '__cloned_offset' not described in 'sk_buff'
   include/linux/skbuff.h:860: warning: Function parameter or member 'head_frag' not described in 'sk_buff'
   include/linux/skbuff.h:860: warning: Function parameter or member '__pkt_type_offset' not described in 'sk_buff'
   include/linux/skbuff.h:860: warning: Function parameter or member 'encapsulation' not described in 'sk_buff'
   include/linux/skbuff.h:860: warning: Function parameter or member 'encap_hdr_csum' not described in 'sk_buff'
   include/linux/skbuff.h:860: warning: Function parameter or member 'csum_valid' not described in 'sk_buff'
   include/linux/skbuff.h:860: warning: Function parameter or member 'csum_complete_sw' not described in 'sk_buff'
   include/linux/skbuff.h:860: warning: Function parameter or member 'csum_level' not described in 'sk_buff'
   include/linux/skbuff.h:860: warning: Function parameter or member 'inner_protocol_type' not described in 'sk_buff'
   include/linux/skbuff.h:860: warning: Function parameter or member 'remcsum_offload' not described in 'sk_buff'
   include/linux/skbuff.h:860: warning: Function parameter or member 'offload_fwd_mark' not described in 'sk_buff'
   include/linux/skbuff.h:860: warning: Function parameter or member 'offload_mr_fwd_mark' not described in 'sk_buff'
   include/linux/skbuff.h:860: warning: Function parameter or member 'sender_cpu' not described in 'sk_buff'
   include/linux/skbuff.h:860: warning: Function parameter or member 'reserved_tailroom' not described in 'sk_buff'
   include/linux/skbuff.h:860: warning: Function parameter or member 'inner_ipproto' not described in 'sk_buff'
   include/net/sock.h:238: warning: Function parameter or member 'skc_addrpair' not described in 'sock_common'
   include/net/sock.h:238: warning: Function parameter or member 'skc_portpair' not described in 'sock_common'
   include/net/sock.h:238: warning: Function parameter or member 'skc_ipv6only' not described in 'sock_common'
   include/net/sock.h:238: warning: Function parameter or member 'skc_net_refcnt' not described in 'sock_common'
   include/net/sock.h:238: warning: Function parameter or member 'skc_v6_daddr' not described in 'sock_common'
   include/net/sock.h:238: warning: Function parameter or member 'skc_v6_rcv_saddr' not described in 'sock_common'
   include/net/sock.h:238: warning: Function parameter or member 'skc_cookie' not described in 'sock_common'
   include/net/sock.h:238: warning: Function parameter or member 'skc_listener' not described in 'sock_common'
   include/net/sock.h:238: warning: Function parameter or member 'skc_tw_dr' not described in 'sock_common'
   include/net/sock.h:238: warning: Function parameter or member 'skc_rcv_wnd' not described in 'sock_common'
   include/net/sock.h:238: warning: Function parameter or member 'skc_tw_rcv_nxt' not described in 'sock_common'
   include/net/sock.h:509: warning: Function parameter or member 'sk_backlog.rmem_alloc' not described in 'sock'
   include/net/sock.h:509: warning: Function parameter or member 'sk_backlog.len' not described in 'sock'
   include/net/sock.h:509: warning: Function parameter or member 'sk_backlog.head' not described in 'sock'
   include/net/sock.h:509: warning: Function parameter or member 'sk_backlog.tail' not described in 'sock'
   include/net/sock.h:509: warning: Function parameter or member 'sk_wq_raw' not described in 'sock'
   include/net/sock.h:509: warning: Function parameter or member 'tcp_rtx_queue' not described in 'sock'
   include/net/sock.h:509: warning: Function parameter or member 'sk_route_forced_caps' not described in 'sock'
   include/net/sock.h:509: warning: Function parameter or member 'sk_txtime_report_errors' not described in 'sock'
   include/net/sock.h:509: warning: Function parameter or member 'sk_validate_xmit_skb' not described in 'sock'
   include/linux/netdevice.h:2044: warning: Function parameter or member 'adj_list.upper' not described in 'net_device'
   include/linux/netdevice.h:2044: warning: Function parameter or member 'adj_list.lower' not described in 'net_device'
   include/linux/netdevice.h:2044: warning: Function parameter or member 'gso_partial_features' not described in 'net_device'
   include/linux/netdevice.h:2044: warning: Function parameter or member 'switchdev_ops' not described in 'net_device'
   include/linux/netdevice.h:2044: warning: Function parameter or member 'l3mdev_ops' not described in 'net_device'
   include/linux/netdevice.h:2044: warning: Function parameter or member 'xfrmdev_ops' not described in 'net_device'
   include/linux/netdevice.h:2044: warning: Function parameter or member 'tlsdev_ops' not described in 'net_device'
   include/linux/netdevice.h:2044: warning: Function parameter or member 'name_assign_type' not described in 'net_device'
   include/linux/netdevice.h:2044: warning: Function parameter or member 'ieee802154_ptr' not described in 'net_device'
   include/linux/netdevice.h:2044: warning: Function parameter or member 'mpls_ptr' not described in 'net_device'
   include/linux/netdevice.h:2044: warning: Function parameter or member 'xdp_prog' not described in 'net_device'
   include/linux/netdevice.h:2044: warning: Function parameter or member 'gro_flush_timeout' not described in 'net_device'
   include/linux/netdevice.h:2044: warning: Function parameter or member 'nf_hooks_ingress' not described in 'net_device'
   include/linux/netdevice.h:2044: warning: Function parameter or member '____cacheline_aligned_in_smp' not described in 'net_device'
   include/linux/netdevice.h:2044: warning: Function parameter or member 'qdisc_hash' not described in 'net_device'
   include/linux/netdevice.h:2044: warning: Function parameter or member 'xps_cpus_map' not described in 'net_device'
   include/linux/netdevice.h:2044: warning: Function parameter or member 'xps_rxqs_map' not described in 'net_device'
>> drivers/net/phy/phy_device.c:1826: warning: Function parameter or member 'tx' not described in 'phy_set_sym_pause'
>> drivers/net/phy/phy_device.c:1826: warning: Function parameter or member 'tx' not described in 'phy_set_sym_pause'
   include/linux/phylink.h:56: warning: Function parameter or member '__ETHTOOL_DECLARE_LINK_MODE_MASK(advertising' not described in 'phylink_link_state'
   include/linux/phylink.h:56: warning: Function parameter or member '__ETHTOOL_DECLARE_LINK_MODE_MASK(lp_advertising' not described in 'phylink_link_state'
   sound/soc/soc-core.c:2918: warning: Excess function parameter 'legacy_dai_naming' description in 'snd_soc_register_dais'
   Documentation/admin-guide/cgroup-v2.rst:1485: WARNING: Block quote ends without a blank line; unexpected unindent.
   Documentation/admin-guide/cgroup-v2.rst:1487: WARNING: Block quote ends without a blank line; unexpected unindent.
   Documentation/admin-guide/cgroup-v2.rst:1488: WARNING: Block quote ends without a blank line; unexpected unindent.
   Documentation/core-api/boot-time-mm.rst:78: ERROR: Error in "kernel-doc" directive:
   unknown option: "nodocs".

vim +1826 drivers/net/phy/phy_device.c

  1812	
  1813	/**
  1814	 * phy_set_sym_pause - Configure symmetric Pause
  1815	 * @phydev: target phy_device struct
  1816	 * @rx: Receiver Pause is supported
  1817	 * @autoneg: Auto neg should be used
  1818	 *
  1819	 * Description: Configure advertised Pause support depending on if
  1820	 * receiver pause and pause auto neg is supported. Generally called
  1821	 * from the set_pauseparam .ndo.
  1822	 */
  1823	void phy_set_sym_pause(struct phy_device *phydev, bool rx, bool tx,
  1824			       bool autoneg)
  1825	{
> 1826		phydev->supported &= ~SUPPORTED_Pause;
  1827	
  1828		if (rx && tx && autoneg)
  1829			phydev->supported |= SUPPORTED_Pause;
  1830	
  1831		phydev->advertising = phydev->supported;
  1832	}
  1833	EXPORT_SYMBOL(phy_set_sym_pause);
  1834	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 6586 bytes --]

^ permalink raw reply

* Re: [PATCH net-next 4/5] rds: invoke socket sg filter attached to rds socket
From: santosh.shilimkar @ 2018-09-11 21:06 UTC (permalink / raw)
  To: Tushar Dave, ast, daniel, davem, jakub.kicinski, quentin.monnet,
	jiong.wang, sandipan, john.fastabend, kafai, rdna, yhs, netdev,
	rds-devel, sowmini.varadhan
In-Reply-To: <1536694684-3200-5-git-send-email-tushar.n.dave@oracle.com>

On 9/11/18 12:38 PM, Tushar Dave wrote:
> RDS module sits on top of TCP (rds_tcp) and IB (rds_rdma), so messages
> arrive in form of skb (over TCP) and scatterlist (over IB/RDMA).
> However, because socket filter only deal with skb (e.g. struct skb as
> bpf context) we can only use socket filter for rds_tcp and not for
> rds_rdma.
> 
> Considering one filtering solution for RDS, it seems that the common
> denominator between sk_buff and scatterlist is scatterlist. Therefore,
> this patch converts skb to sgvec and invoke sg_filter_run for
> rds_tcp and simply invoke sg_filter_run for IB/rds_rdma.
> 
> Signed-off-by: Tushar Dave <tushar.n.dave@oracle.com>
> Reviewed-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
> ---
I remember acking the earlier version. Here it is again..

Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>

^ permalink raw reply

* [PATCH bpf-next] tools/bpf: fix a netlink recv issue
From: Yonghong Song @ 2018-09-11 21:09 UTC (permalink / raw)
  To: ast, daniel, netdev; +Cc: kernel-team

Commit f7010770fbac ("tools/bpf: move bpf/lib netlink related
functions into a new file") introduced a while loop for the
netlink recv path. This while loop is needed since the
buffer in recv syscall may not be enough to hold all the
information and in such cases multiple recv calls are needed.

There is a bug introduced by the above commit as
the while loop may block on recv syscall if there is no
more messages are expected. The netlink message header
flag NLM_F_MULTI is used to indicate that more messages
are expected and this patch fixed the bug by doing
further recv syscall only if multipart message is expected.

The patch added another fix regarding to message length of 0.
When netlink recv returns message length of 0, there will be
no more messages for returning data so the while loop
can end.

Fixes: f7010770fbac ("tools/bpf: move bpf/lib netlink related functions into a new file")
Reported-by: Björn Töpel <bjorn.topel@intel.com>
Tested-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Yonghong Song <yhs@fb.com>
---
 tools/lib/bpf/netlink.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/tools/lib/bpf/netlink.c b/tools/lib/bpf/netlink.c
index 469e068dd0c5..fde1d7bf8199 100644
--- a/tools/lib/bpf/netlink.c
+++ b/tools/lib/bpf/netlink.c
@@ -65,18 +65,23 @@ static int bpf_netlink_recv(int sock, __u32 nl_pid, int seq,
 			    __dump_nlmsg_t _fn, dump_nlmsg_t fn,
 			    void *cookie)
 {
+	bool multipart = true;
 	struct nlmsgerr *err;
 	struct nlmsghdr *nh;
 	char buf[4096];
 	int len, ret;

-	while (1) {
+	while (multipart) {
+		multipart = false;
 		len = recv(sock, buf, sizeof(buf), 0);
 		if (len < 0) {
 			ret = -errno;
 			goto done;
 		}

+		if (len == 0)
+			break;
+
 		for (nh = (struct nlmsghdr *)buf; NLMSG_OK(nh, len);
 		     nh = NLMSG_NEXT(nh, len)) {
 			if (nh->nlmsg_pid != nl_pid) {
@@ -87,6 +92,8 @@ static int bpf_netlink_recv(int sock, __u32 nl_pid, int seq,
 				ret = -LIBBPF_ERRNO__INVSEQ;
 				goto done;
 			}
+			if (nh->nlmsg_flags & NLM_F_MULTI)
+				multipart = true;
 			switch (nh->nlmsg_type) {
 			case NLMSG_ERROR:
 				err = (struct nlmsgerr *)NLMSG_DATA(nh);
-- 
2.17.1

^ permalink raw reply related

* Re: [PATCH bpf-next 0/2] bpf: add bpffs/bpftool dump for prog_array and map_in_map maps
From: Alexei Starovoitov @ 2018-09-11 21:20 UTC (permalink / raw)
  To: Yonghong Song; +Cc: ast, daniel, netdev, kernel-team
In-Reply-To: <20180907002605.1408960-1-yhs@fb.com>

On Thu, Sep 06, 2018 at 05:26:03PM -0700, Yonghong Song wrote:
> The support to dump program array and map_in_map maps
> for bpffs and bpftool is added. Patch #1 added bpffs support
> and Patch #2 added bpftool support. Please see
> individual patches for example output.

Applied, Thanks

^ permalink raw reply

* [Patch net] net_sched: notify filter deletion when deleting a chain
From: Cong Wang @ 2018-09-11 21:22 UTC (permalink / raw)
  To: netdev; +Cc: Cong Wang, Jiri Pirko

When we delete a chain of filters, we need to notify
user-space we are deleting each filters in this chain
too.

Fixes: 32a4f5ecd738 ("net: sched: introduce chain object to uapi")
Cc: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
---
 net/sched/cls_api.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index 1a67af8a6e8c..0a75cb2e5e7b 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -1902,6 +1902,8 @@ static int tc_ctl_chain(struct sk_buff *skb, struct nlmsghdr *n,
 				RTM_NEWCHAIN, false);
 		break;
 	case RTM_DELCHAIN:
+		tfilter_notify_chain(net, skb, block, q, parent, n,
+				     chain, RTM_DELTFILTER);
 		/* Flush the chain first as the user requested chain removal. */
 		tcf_chain_flush(chain);
 		/* In case the chain was successfully deleted, put a reference
-- 
2.14.4

^ permalink raw reply related

* Re: libbpf build broken on musl libc (Alpine Linux)
From: Alexei Starovoitov @ 2018-09-11 21:24 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Jakub Kicinski, Daniel Borkmann, Thomas Richter,
	Hendrik Brueckner, Linux Kernel Mailing List,
	Linux Networking Development Mailing List
In-Reply-To: <20180911121543.GB22689@kernel.org>

On Tue, Sep 11, 2018 at 09:15:43AM -0300, Arnaldo Carvalho de Melo wrote:
> Em Tue, Sep 11, 2018 at 12:22:18PM +0200, Jakub Kicinski escreveu:
> > On Mon, 10 Sep 2018 14:29:03 -0300, Arnaldo Carvalho de Melo wrote:
> > > After lunch I'll work on a patch to fix this, 
>  
> > Hi Arnaldo!
>  
> > Any luck?
> 
> Well, we need to apply the patch below and make tools/lib/str_error_r.c
> live in a library that libbpf and perf is linked to.

do you want us to take the patch or you're applying it yourself?

^ permalink raw reply

* Re: [PATCH bpf-next] tools/bpf: fix a netlink recv issue
From: Alexei Starovoitov @ 2018-09-11 21:27 UTC (permalink / raw)
  To: Yonghong Song; +Cc: ast, daniel, netdev, kernel-team
In-Reply-To: <20180911210911.3235080-1-yhs@fb.com>

On Tue, Sep 11, 2018 at 02:09:11PM -0700, Yonghong Song wrote:
> Commit f7010770fbac ("tools/bpf: move bpf/lib netlink related
> functions into a new file") introduced a while loop for the
> netlink recv path. This while loop is needed since the
> buffer in recv syscall may not be enough to hold all the
> information and in such cases multiple recv calls are needed.
> 
> There is a bug introduced by the above commit as
> the while loop may block on recv syscall if there is no
> more messages are expected. The netlink message header
> flag NLM_F_MULTI is used to indicate that more messages
> are expected and this patch fixed the bug by doing
> further recv syscall only if multipart message is expected.
> 
> The patch added another fix regarding to message length of 0.
> When netlink recv returns message length of 0, there will be
> no more messages for returning data so the while loop
> can end.
> 
> Fixes: f7010770fbac ("tools/bpf: move bpf/lib netlink related functions into a new file")
> Reported-by: Björn Töpel <bjorn.topel@intel.com>
> Tested-by: Björn Töpel <bjorn.topel@intel.com>
> Signed-off-by: Yonghong Song <yhs@fb.com>

Applied, Thanks

^ permalink raw reply

* Re: [PATCH v3 net-next 3/6] dt-bindings: net: Add lantiq,xrx200-net DT bindings
From: Hauke Mehrtens @ 2018-09-11 21:33 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: davem, netdev, vivien.didelot, f.fainelli, john, linux-mips, dev,
	hauke.mehrtens, devicetree
In-Reply-To: <20180910125308.GC30395@lunn.ch>


[-- Attachment #1.1: Type: text/plain, Size: 2267 bytes --]

On 09/10/2018 02:53 PM, Andrew Lunn wrote:
> On Sun, Sep 09, 2018 at 10:16:44PM +0200, Hauke Mehrtens wrote:
>> This adds the binding for the PMAC core between the CPU and the GSWIP
>> switch found on the xrx200 / VR9 Lantiq / Intel SoC.
>>
>> Signed-off-by: Hauke Mehrtens <hauke@hauke-m.de>
>> Cc: devicetree@vger.kernel.org
>> ---
>>  .../devicetree/bindings/net/lantiq,xrx200-net.txt   | 21 +++++++++++++++++++++
>>  1 file changed, 21 insertions(+)
>>  create mode 100644 Documentation/devicetree/bindings/net/lantiq,xrx200-net.txt
>>
>> diff --git a/Documentation/devicetree/bindings/net/lantiq,xrx200-net.txt b/Documentation/devicetree/bindings/net/lantiq,xrx200-net.txt
>> new file mode 100644
>> index 000000000000..8a2fe5200cdc
>> --- /dev/null
>> +++ b/Documentation/devicetree/bindings/net/lantiq,xrx200-net.txt
>> @@ -0,0 +1,21 @@
>> +Lantiq xRX200 GSWIP PMAC Ethernet driver
>> +==================================
>> +
>> +Required properties:
>> +
>> +- compatible	: "lantiq,xrx200-net" for the PMAC of the embedded
>> +		: GSWIP in the xXR200
>> +- reg		: memory range of the PMAC core inside of the GSWIP core
>> +- interrupts	: TX and RX DMA interrupts. Use interrupt-names "tx" for
>> +		: the TX interrupt and "rx" for the RX interrupt.
>> +
>> +Example:
>> +
>> +eth0: eth@E10B308 {
>> +	#address-cells = <1>;
>> +	#size-cells = <0>;
>> +	compatible = "lantiq,xrx200-net";
>> +	reg = <0xE10B308 0x30>;
> 
> Hi Hauke
> 
> This binding itself looks fine. I just find this address range a bit
> odd. What are 0xe10b300-0xe10b307 used for? Are all 0x30 bytes used in
> the range? The address range ending at 0xe10b338 seems a bit
> odd. 0xe10b33f would be more typical.
> 
> I'm asking because it can be messy when you find out you need to
> change the address range, and not break backwards compatibility.
> 
>      Andrew

Hi Andrew,

Thank you for the question, there were multiple problems with the
register size in this description.
It is a bit more complicated because the PMAC is part of the switch and
does not start at an even address.

It is correct that this starts at 0xE10B308, but the size is 0xCF8.

0xe10b300 is unused and 0xe10B304 is used to enable debug modes.

Hauke


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply

* Re: [PATCH v2 net] MIPS: lantiq: dma: add dev pointer
From: Hauke Mehrtens @ 2018-09-11 21:36 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: davem, netdev, f.fainelli, john, linux-mips, hauke.mehrtens,
	paul.burton
In-Reply-To: <20180910124542.GB30395@lunn.ch>


[-- Attachment #1.1: Type: text/plain, Size: 946 bytes --]

On 09/10/2018 02:45 PM, Andrew Lunn wrote:
> On Sun, Sep 09, 2018 at 09:26:23PM +0200, Hauke Mehrtens wrote:
>> dma_zalloc_coherent() now crashes if no dev pointer is given.
>> Add a dev pointer to the ltq_dma_channel structure and fill it in the
>> driver using it.
>>
>> This fixes a bug introduced in kernel 4.19.
>>
>> Signed-off-by: Hauke Mehrtens <hauke@hauke-m.de>
>> ---
>>
>> no changes since v1.
>>
>> This should go into kernel 4.19 and I have some other patches adding new 
>> features for kernel 4.20 which are depending on this, so I would prefer 
>> if this goes through the net tree. 
> 
> Hi Hauke
> 
> Is this a build time dependency, or a runtime dependency?
> 
> What we don't want to do is add the switch driver to net-next and find
> it does not compile because this change is not in net-next yet.
> 
>    Andrew
> 

Yes, this has a compile dependency because I had to extend the API.

Hauke


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply

* Re: [PATCH net-next] net: bridge: add support for sticky fdb entries
From: Roopa Prabhu @ 2018-09-11 21:41 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: Nikolay Aleksandrov, netdev, David Miller, bridge
In-Reply-To: <20180910141802.268e85c5@shemminger-XPS-13-9360>

On Mon, Sep 10, 2018 at 2:18 PM, Stephen Hemminger
<stephen@networkplumber.org> wrote:
> On Mon, 10 Sep 2018 13:16:01 +0300
> Nikolay Aleksandrov <nikolay@cumulusnetworks.com> wrote:
>
>> Add support for entries which are "sticky", i.e. will not change their port
>> if they show up from a different one. A new ndm flag is introduced for that
>> purpose - NTF_STICKY. We allow to set it only to non-local entries.
>
> Is there a name for this in other network switch API's?


In all switch ASIC's static fdb entries implicitly don't move.
Since the kernel does not follow the same for static fdb entries, we
want a new flag to
explicitly make the mac moves not possible.

As nikolay says, primary request for this came from an E-VPN rfc. It
uses the name 'sticky'...
hence the name 'sticky'  https://tools.ietf.org/html/rfc7432#section-15.2



>
>>
>> Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
>> ---
>> I'll send the selftest for sticky with the iproute2 patch if this one is
>> accepted. We've had multiple requests to support such flag and now it's
>> also needed for some eVPN and clag setups.
>>
>>  include/uapi/linux/neighbour.h |  1 +
>>  net/bridge/br_fdb.c            | 19 ++++++++++++++++---
>>  net/bridge/br_private.h        |  1 +
>>  3 files changed, 18 insertions(+), 3 deletions(-)
>>
>> diff --git a/include/uapi/linux/neighbour.h b/include/uapi/linux/neighbour.h
>> index 904db6148476..998155444e0d 100644
>> --- a/include/uapi/linux/neighbour.h
>> +++ b/include/uapi/linux/neighbour.h
>> @@ -43,6 +43,7 @@ enum {
>>  #define NTF_PROXY    0x08    /* == ATF_PUBL */
>>  #define NTF_EXT_LEARNED      0x10
>>  #define NTF_OFFLOADED   0x20
>> +#define NTF_STICKY   0x40
>>  #define NTF_ROUTER   0x80
>>
>>  /*
>> diff --git a/net/bridge/br_fdb.c b/net/bridge/br_fdb.c
>> index 502f66349530..26569ed06a4d 100644
>> --- a/net/bridge/br_fdb.c
>> +++ b/net/bridge/br_fdb.c
>> @@ -584,7 +584,7 @@ void br_fdb_update(struct net_bridge *br, struct net_bridge_port *source,
>>                       unsigned long now = jiffies;
>>
>>                       /* fastpath: update of existing entry */
>> -                     if (unlikely(source != fdb->dst)) {
>> +                     if (unlikely(source != fdb->dst && !fdb->is_sticky)) {
>>                               fdb->dst = source;
>>                               fdb_modified = true;
>>                               /* Take over HW learned entry */
>> @@ -656,6 +656,8 @@ static int fdb_fill_info(struct sk_buff *skb, const struct net_bridge *br,
>>               ndm->ndm_flags |= NTF_OFFLOADED;
>>       if (fdb->added_by_external_learn)
>>               ndm->ndm_flags |= NTF_EXT_LEARNED;
>> +     if (fdb->is_sticky)
>> +             ndm->ndm_flags |= NTF_STICKY;
>>
>>       if (nla_put(skb, NDA_LLADDR, ETH_ALEN, &fdb->key.addr))
>>               goto nla_put_failure;
>> @@ -772,7 +774,8 @@ int br_fdb_dump(struct sk_buff *skb,
>>
>>  /* Update (create or replace) forwarding database entry */
>>  static int fdb_add_entry(struct net_bridge *br, struct net_bridge_port *source,
>> -                      const __u8 *addr, __u16 state, __u16 flags, __u16 vid)
>> +                      const u8 *addr, u16 state, u16 flags, u16 vid,
>> +                      u8 is_sticky)
>
> Why not change the API to take a full ndm flags, someone is sure to add more later.

^ permalink raw reply

* Re: Fw: [Bug 201071] New: Creating a vxlan in state 'up' does not give proper RTM_NEWLINK message
From: Roopa Prabhu @ 2018-09-11 22:10 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev
In-Reply-To: <20180910115516.46f0c47d@shemminger-XPS-13-9360>

On Mon, Sep 10, 2018 at 11:55 AM, Stephen Hemminger
<stephen@networkplumber.org> wrote:
>
>
> Begin forwarded message:
>
> Date: Mon, 10 Sep 2018 04:04:37 +0000
> From: bugzilla-daemon@bugzilla.kernel.org
> To: stephen@networkplumber.org
> Subject: [Bug 201071] New: Creating a vxlan in state 'up' does not give proper RTM_NEWLINK message
>
>
> https://bugzilla.kernel.org/show_bug.cgi?id=201071
>
>             Bug ID: 201071
>            Summary: Creating a vxlan in state 'up' does not give proper
>                     RTM_NEWLINK message
>            Product: Networking
>            Version: 2.5
>     Kernel Version: 4.19-rc1
>           Hardware: All
>                 OS: Linux
>               Tree: Mainline
>             Status: NEW
>           Severity: normal
>           Priority: P1
>          Component: Other
>           Assignee: stephen@networkplumber.org
>           Reporter: liam.mcbirnie@boeing.com
>         Regression: Yes
>
> If a vxlan is created with state 'up', the RTM_NEWLINK message shows the state
> as down, and there no other netlink messages are sent.
> As a result, processes listening to netlink are never notified that the vxlan
> link is up.

thanks for the fwd. looking...



>
> eg.
> # ip link add test up type vxlan id 8 group 224.224.224.224 dev eth0
>
> Output of ip monitor link
> # 4: test: <BROADCAST,MULTICAST> mtu 1450 qdisc noop state DOWN group default
>       link/ether ee:cd:97:1a:cf:91 brd ff:ff:ff:ff:ff:ff
>
> Output of ip link show (expected from netlink message)
> # 4: test: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state
> UNKNOWN group default qlen 1000
>       link/ether ee:cd:97:1a:cf:91 brd ff:ff:ff:ff:ff:ff
>
> This is a regression introduced by the following patch series.
> https://patchwork.ozlabs.org/patch/947181/
>
> --
> You are receiving this mail because:
> You are the assignee for the bug.

^ permalink raw reply

* [Patch net] tipc: check return value of __tipc_dump_start()
From: Cong Wang @ 2018-09-11 22:12 UTC (permalink / raw)
  To: netdev; +Cc: tipc-discussion, Cong Wang, Jon Maloy, Ying Xue

When __tipc_dump_start() fails with running out of memory,
we have no reason to continue, especially we should avoid
calling tipc_dump_done().

Fixes: 8f5c5fcf3533 ("tipc: call start and done ops directly in __tipc_nl_compat_dumpit()")
Reported-and-tested-by: syzbot+3f8324abccfbf8c74a9f@syzkaller.appspotmail.com
Cc: Jon Maloy <jon.maloy@ericsson.com>
Cc: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
---
 net/tipc/netlink_compat.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/net/tipc/netlink_compat.c b/net/tipc/netlink_compat.c
index 82f665728382..6376467e78f8 100644
--- a/net/tipc/netlink_compat.c
+++ b/net/tipc/netlink_compat.c
@@ -185,7 +185,10 @@ static int __tipc_nl_compat_dumpit(struct tipc_nl_compat_cmd_dump *cmd,
 		return -ENOMEM;
 
 	buf->sk = msg->dst_sk;
-	__tipc_dump_start(&cb, msg->net);
+	if (__tipc_dump_start(&cb, msg->net)) {
+		kfree_skb(buf);
+		return -ENOMEM;
+	}
 
 	do {
 		int rem;
-- 
2.14.4

^ permalink raw reply related

* [PATCH] net: dsa: mv88e6xxx: Make sure to configure ports with external PHYs
From: Marek Vasut @ 2018-09-11 22:15 UTC (permalink / raw)
  To: netdev; +Cc: Marek Vasut, Andrew Lunn

The MV88E6xxx can have external PHYs attached to certain ports and those
PHYs could even be on different MDIO bus than the one within the switch.
This patch makes sure that ports with such PHYs are configured correctly
according to the information provided by the PHY.

Signed-off-by: Marek Vasut <marex@denx.de>
Cc: Andrew Lunn <andrew@lunn.ch>
---
 drivers/net/dsa/mv88e6xxx/chip.c | 16 ++++++++++++++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c
index 15427380e32e..dc92dd7b55ab 100644
--- a/drivers/net/dsa/mv88e6xxx/chip.c
+++ b/drivers/net/dsa/mv88e6xxx/chip.c
@@ -575,6 +575,13 @@ static int mv88e6xxx_port_setup_mac(struct mv88e6xxx_chip *chip, int port,
 	return err;
 }
 
+static int mv88e6xxx_phy_is_internal(struct dsa_switch *ds, int port)
+{
+	struct mv88e6xxx_chip *chip = ds->priv;
+
+	return port < chip->info->num_internal_phys;
+}
+
 /* We expect the switch to perform auto negotiation if there is a real
  * phy. However, in the case of a fixed link phy, we force the port
  * settings from the fixed link settings.
@@ -585,7 +592,8 @@ static void mv88e6xxx_adjust_link(struct dsa_switch *ds, int port,
 	struct mv88e6xxx_chip *chip = ds->priv;
 	int err;
 
-	if (!phy_is_pseudo_fixed_link(phydev))
+	if (!phy_is_pseudo_fixed_link(phydev) &&
+	    mv88e6xxx_phy_is_internal(ds, port))
 		return;
 
 	mutex_lock(&chip->reg_lock);
@@ -709,13 +717,17 @@ static void mv88e6xxx_mac_config(struct dsa_switch *ds, int port,
 	struct mv88e6xxx_chip *chip = ds->priv;
 	int speed, duplex, link, pause, err;
 
-	if (mode == MLO_AN_PHY)
+	if ((mode == MLO_AN_PHY) && mv88e6xxx_phy_is_internal(ds, port))
 		return;
 
 	if (mode == MLO_AN_FIXED) {
 		link = LINK_FORCED_UP;
 		speed = state->speed;
 		duplex = state->duplex;
+	} else if (!mv88e6xxx_phy_is_internal(ds, port)) {
+		link = state->link;
+		speed = state->speed;
+		duplex = state->duplex;
 	} else {
 		speed = SPEED_UNFORCED;
 		duplex = DUPLEX_UNFORCED;
-- 
2.18.0

^ permalink raw reply related

* [PATCH net-next V2 00/11] vhost_net TX batching
From: Jason Wang @ 2018-09-12  3:16 UTC (permalink / raw)
  To: netdev, linux-kernel; +Cc: kvm, virtualization, mst, jasowang

Hi all:

This series tries to batch submitting packets to underlayer socket
through msg_control during sendmsg(). This is done by:

1) Doing userspace copy inside vhost_net
2) Build XDP buff
3) Batch at most 64 (VHOST_NET_BATCH) XDP buffs and submit them once
   through msg_control during sendmsg().
4) Underlayer sockets can use XDP buffs directly when XDP is enalbed,
   or build skb based on XDP buff.

For the packet that can not be built easily with XDP or for the case
that batch submission is hard (e.g sndbuf is limited). We will go for
the previous slow path, passing iov iterator to underlayer socket
through sendmsg() once per packet.

This can help to improve cache utilization and avoid lots of indirect
calls with sendmsg(). It can also co-operate with the batching support
of the underlayer sockets (e.g the case of XDP redirection through
maps).

Testpmd(txonly) in guest shows obvious improvements:

Test                /+pps%
XDP_DROP on TAP     /+44.8%
XDP_REDIRECT on TAP /+29%
macvtap (skb)       /+26%

Netperf TCP_STREAM TX from guest shows obvious improvements on small
packet:

    size/session/+thu%/+normalize%
       64/     1/   +2%/    0%
       64/     2/   +3%/   +1%
       64/     4/   +7%/   +5%
       64/     8/   +8%/   +6%
      256/     1/   +3%/    0%
      256/     2/  +10%/   +7%
      256/     4/  +26%/  +22%
      256/     8/  +27%/  +23%
      512/     1/   +3%/   +2%
      512/     2/  +19%/  +14%
      512/     4/  +43%/  +40%
      512/     8/  +45%/  +41%
     1024/     1/   +4%/    0%
     1024/     2/  +27%/  +21%
     1024/     4/  +38%/  +73%
     1024/     8/  +15%/  +24%
     2048/     1/  +10%/   +7%
     2048/     2/  +16%/  +12%
     2048/     4/    0%/   +2%
     2048/     8/    0%/   +2%
     4096/     1/  +36%/  +60%
     4096/     2/  -11%/  -26%
     4096/     4/    0%/  +14%
     4096/     8/    0%/   +4%
    16384/     1/   -1%/   +5%
    16384/     2/    0%/   +2%
    16384/     4/    0%/   -3%
    16384/     8/    0%/   +4%
    65535/     1/    0%/  +10%
    65535/     2/    0%/   +8%
    65535/     4/    0%/   +1%
    65535/     8/    0%/   +3%

Please review.

Thanks

Changes from V1:

- don't hold page refcnt for XDP_DROP
- release page if build_skb() fails
- introduce a helper to build skb for the path of no batching mode
- rename tun_do_xdp() as tun_xdp_act() and use negative for reporting
  errors instead of a pointer to integer
- introduce a num field in tun_msg_ctl
- introduce a new structure for storing metadata in the head of XDP
  packet
- do not store batche XDP inside vhoet_net_virtqueue

Jason Wang (11):
  net: sock: introduce SOCK_XDP
  tuntap: switch to use XDP_PACKET_HEADROOM
  tuntap: enable bh early during processing XDP
  tuntap: simplify error handling in tun_build_skb()
  tuntap: tweak on the path of skb XDP case in tun_build_skb()
  tuntap: split out XDP logic
  tuntap: move XDP flushing out of tun_do_xdp()
  tun: switch to new type of msg_control
  tuntap: accept an array of XDP buffs through sendmsg()
  tap: accept an array of XDP buffs through sendmsg()
  vhost_net: batch submitting XDP buffers to underlayer sockets

 drivers/net/tap.c      |  88 +++++++++++++-
 drivers/net/tun.c      | 267 ++++++++++++++++++++++++++++++++---------
 drivers/vhost/net.c    | 181 +++++++++++++++++++++++++---
 include/linux/if_tun.h |  14 +++
 include/net/sock.h     |   1 +
 5 files changed, 471 insertions(+), 80 deletions(-)

-- 
2.17.1

^ permalink raw reply

* [PATCH net-next V2 01/11] net: sock: introduce SOCK_XDP
From: Jason Wang @ 2018-09-12  3:16 UTC (permalink / raw)
  To: netdev, linux-kernel; +Cc: kvm, virtualization, mst, jasowang
In-Reply-To: <20180912031709.14112-1-jasowang@redhat.com>

This patch introduces a new sock flag - SOCK_XDP. This will be used
for notifying the upper layer that XDP program is attached on the
lower socket, and requires for extra headroom.

TUN will be the first user.

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/net/tun.c  | 19 +++++++++++++++++++
 include/net/sock.h |  1 +
 2 files changed, 20 insertions(+)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index ebd07ad82431..2c548bd20393 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -869,6 +869,9 @@ static int tun_attach(struct tun_struct *tun, struct file *file,
 		tun_napi_init(tun, tfile, napi);
 	}
 
+	if (rtnl_dereference(tun->xdp_prog))
+		sock_set_flag(&tfile->sk, SOCK_XDP);
+
 	tun_set_real_num_queues(tun);
 
 	/* device is allowed to go away first, so no need to hold extra
@@ -1241,13 +1244,29 @@ static int tun_xdp_set(struct net_device *dev, struct bpf_prog *prog,
 		       struct netlink_ext_ack *extack)
 {
 	struct tun_struct *tun = netdev_priv(dev);
+	struct tun_file *tfile;
 	struct bpf_prog *old_prog;
+	int i;
 
 	old_prog = rtnl_dereference(tun->xdp_prog);
 	rcu_assign_pointer(tun->xdp_prog, prog);
 	if (old_prog)
 		bpf_prog_put(old_prog);
 
+	for (i = 0; i < tun->numqueues; i++) {
+		tfile = rtnl_dereference(tun->tfiles[i]);
+		if (prog)
+			sock_set_flag(&tfile->sk, SOCK_XDP);
+		else
+			sock_reset_flag(&tfile->sk, SOCK_XDP);
+	}
+	list_for_each_entry(tfile, &tun->disabled, next) {
+		if (prog)
+			sock_set_flag(&tfile->sk, SOCK_XDP);
+		else
+			sock_reset_flag(&tfile->sk, SOCK_XDP);
+	}
+
 	return 0;
 }
 
diff --git a/include/net/sock.h b/include/net/sock.h
index 433f45fc2d68..38cae35f6e16 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -800,6 +800,7 @@ enum sock_flags {
 	SOCK_SELECT_ERR_QUEUE, /* Wake select on error queue */
 	SOCK_RCU_FREE, /* wait rcu grace period in sk_destruct() */
 	SOCK_TXTIME,
+	SOCK_XDP, /* XDP is attached */
 };
 
 #define SK_FLAGS_TIMESTAMP ((1UL << SOCK_TIMESTAMP) | (1UL << SOCK_TIMESTAMPING_RX_SOFTWARE))
-- 
2.17.1

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox