Netdev List

Netdev List
 help / color / mirror / Atom feed

* [PATCH bpf] bpf: use __GFP_COMP while allocating page
From: Tushar Dave @ 2018-09-12 20:15 UTC (permalink / raw)
  To: ast, daniel, davem, netdev, john.fastabend

Helper bpg_msg_pull_data() can allocate multiple pages while
linearizing multiple scatterlist elements into one shared page.
However, if the shared page has size > PAGE_SIZE, using
copy_page_to_iter() causes below warning.

e.g.
[ 6367.019832] WARNING: CPU: 2 PID: 7410 at lib/iov_iter.c:825
page_copy_sane.part.8+0x0/0x8

To avoid above warning, use __GFP_COMP while allocating multiple
contiguous pages.

Signed-off-by: Tushar Dave <tushar.n.dave@oracle.com>
---
 net/core/filter.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index d301134..0b40f95 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2344,7 +2344,8 @@ struct sock *do_msg_redirect_map(struct sk_msg_buff *msg)
 	if (unlikely(bytes_sg_total > copy))
 		return -EINVAL;
 
-	page = alloc_pages(__GFP_NOWARN | GFP_ATOMIC, get_order(copy));
+	page = alloc_pages(__GFP_NOWARN | GFP_ATOMIC | __GFP_COMP,
+			   get_order(copy));
 	if (unlikely(!page))
 		return -ENOMEM;
 	p = page_address(page);
-- 
1.8.3.1

^ permalink raw reply related

* Re: [Patch net] tipc: check return value of __tipc_dump_start()
From: David Miller @ 2018-09-12 20:15 UTC (permalink / raw)
  To: xiyou.wangcong; +Cc: netdev, tipc-discussion, jon.maloy, ying.xue
In-Reply-To: <20180911221217.23392-1-xiyou.wangcong@gmail.com>

From: Cong Wang <xiyou.wangcong@gmail.com>
Date: Tue, 11 Sep 2018 15:12:17 -0700

> When __tipc_dump_start() fails with running out of memory,
> we have no reason to continue, especially we should avoid
> calling tipc_dump_done().
> 
> Fixes: 8f5c5fcf3533 ("tipc: call start and done ops directly in __tipc_nl_compat_dumpit()")
> Reported-and-tested-by: syzbot+3f8324abccfbf8c74a9f@syzkaller.appspotmail.com
> Cc: Jon Maloy <jon.maloy@ericsson.com>
> Cc: Ying Xue <ying.xue@windriver.com>
> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>

Applied, thanks Cong.

^ permalink raw reply

* Re: [PATCH net 0/4] s390/qeth: fixes 2018-09-12
From: David Miller @ 2018-09-12 20:13 UTC (permalink / raw)
  To: jwi; +Cc: netdev, linux-s390, schwidefsky, heiko.carstens, raspl, ubraun
In-Reply-To: <20180912133135.12335-1-jwi@linux.ibm.com>

From: Julian Wiedmann <jwi@linux.ibm.com>
Date: Wed, 12 Sep 2018 15:31:31 +0200

> please apply the following qeth fixes for -net.
> 
> Patch 1 resolves a regression in an error path, while patch 2 enables
> the SG support by default that was newly introduced with 4.19.
> Patch 3 takes care of a longstanding problem with large-order
> allocations, and patch 4 fixes a potential out-of-bounds access.

Series applied, thanks Julian.

^ permalink raw reply

* Re: [PATCH iproute2] q_cake: Add printing of no-split-gso option
From: Stephen Hemminger @ 2018-09-12 20:07 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen; +Cc: netdev
In-Reply-To: <20180911223216.15609-1-toke@toke.dk>

On Wed, 12 Sep 2018 00:32:16 +0200
Toke Høiland-Jørgensen <toke@toke.dk> wrote:

> When the GSO splitting was turned into dual split-gso/no-split-gso options,
> the printing of the latter was left out. Add that, so output is consistent
> with the options passed.
> 
> Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk>

Applied. I noticed that nat/nonat and wash/nowash have similar missing output.

^ permalink raw reply

* Re: [PATCH bpf] bpf: btf: Fix end boundary calculation for type section
From: Daniel Borkmann @ 2018-09-12 20:02 UTC (permalink / raw)
  To: Martin KaFai Lau, netdev; +Cc: Alexei Starovoitov, kernel-team
In-Reply-To: <20180912172911.3609494-1-kafai@fb.com>

On 09/12/2018 07:29 PM, Martin KaFai Lau wrote:
> The end boundary math for type section is incorrect in
> btf_check_all_metas().  It just happens that hdr->type_off
> is always 0 for now because there are only two sections
> (type and string) and string section must be at the end (ensured
> in btf_parse_str_sec).
> 
> However, type_off may not be 0 if a new section would be added later.
> This patch fixes it.
> 
> Fixes: f80442a4cd18 ("bpf: btf: Change how section is supported in btf_header")
> Reported-by: Dmitry Vyukov <dvyukov@google.com>
> Signed-off-by: Martin KaFai Lau <kafai@fb.com>

Applied to bpf, thanks everyone!

^ permalink raw reply

* Re: [RFC v2 1/2] netlink: add NLA_REJECT policy type
From: Johannes Berg @ 2018-09-12 19:37 UTC (permalink / raw)
  To: Michal Kubecek; +Cc: David Miller, linux-wireless, netdev
In-Reply-To: <20180912192925.GD29691@unicorn.suse.cz>

On Wed, 2018-09-12 at 21:29 +0200, Michal Kubecek wrote:

> > 3) eventually replace nlmsg_parse() calls by nlmsg_parse_strict() and
> >    see what breaks? :-) We won't be able to rely on that any time soon
> >    though (unless userspace first checks with a guaranteed rejected
> >    attribute, e.g. one that has NLA_REJECT, perhaps the u64 pad
> >    attributes could be marked such since the kernel can't assume
> >    alignment anyway)
> 
> I'm not so sure we (eventually) want to reject unknown attributes
> everywhere. I don't have any concrete example in mind but I think there
> will be use cases where we want to ignore unrecognized attributes
> (probably per parse call). But it makes sense to require caller to
> explicitely declare this is the case.

Yeah, I think nla_parse() vs. nla_parse_strict() - along with
remembering in review to say "perhaps you should prefer
nla_parse_strict() for this new thing" might be all we want (or
realistically can do).

> > While we're talking about wishlist, I'm also toying with the idea of
> > having some sort of generic mechanism to convert netlink attributes
> > to/from structs, for internal kernel representation; so far though I
> > haven't been able to come up with anything useful.
> 
> I was also thinking about something like this. One motivation was to
> design extensible version of ethtool_ops, the other was allowing complex
> data types (structures, arrays) for ethtool tunables. But I have nothing
> more than a vague idea so far.

I played with macros a bit, but ultimately that wasn't so readable.
There's no way to do upper-casing in the preprocessor :-) Realistically,
I ended up with something that would require either a lot of boiler-
plate code estill, or perhaps double-including a header file (though I
never really finished the latter thought.)

This is what I toyed with earlier today:
https://p.sipsolutions.net/35e260d7debaa406.txt

johannes

^ permalink raw reply

* Re: [PATCH net-next 5/5] ebpf: Add sample ebpf program for SOCKET_SG_FILTER
From: Tushar Dave @ 2018-09-12 19:32 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: ast, daniel, davem, santosh.shilimkar, jakub.kicinski,
	quentin.monnet, jiong.wang, sandipan, john.fastabend, kafai, rdna,
	yhs, netdev, rds-devel, sowmini.varadhan
In-Reply-To: <20180912040038.oobnr4yfzoaajk6k@ast-mbp>



On 09/11/2018 09:00 PM, Alexei Starovoitov wrote:
> On Tue, Sep 11, 2018 at 09:38:04PM +0200, Tushar Dave wrote:
>> Add a sample program that shows how socksg program is used and attached
>> to socket filter. The kernel sample program deals with struct
>> scatterlist that is passed as bpf context.
>>
>> When run in server mode, the sample RDS program opens PF_RDS socket,
>> attaches eBPF program to RDS socket which then uses bpf_msg_pull_data
>> helper to inspect packet data contained in struct scatterlist and
>> returns appropriate action code back to kernel.
>>
>> To ease testing, RDS client functionality is also added so that users
>> can generate RDS packet.
>>
>> Server:
>> [root@lab71 bpf]# ./rds_filter -s 192.168.3.71 -t tcp
>> running server in a loop
>> transport tcp
>> server bound to address: 192.168.3.71 port 4000
>> server listening on 192.168.3.71
>>
>> Client:
>> [root@lab70 bpf]# ./rds_filter -s 192.168.3.71 -c 192.168.3.70 -t tcp
>> transport tcp
>> client bound to address: 192.168.3.70 port 25278
>> client sending 8192 byte message  from 192.168.3.70 to 192.168.3.71 on
>> port 25278
>> payload contains:30 31 32 33 34 35 36 37 38 39 ...
>>
>> Server output:
>> 192.168.3.71 received a packet from 192.168.3.71 of len 8192 cmsg len 0,
>> on port 25278
>> payload contains:30 31 32 33 34 35 36 37 38 39 ...
>> server listening on 192.168.3.71
>>
>> [root@lab71 tushar]# cat /sys/kernel/debug/tracing/trace_pipe
>>            <idle>-0     [038] ..s.   146.947362: 0: 30 31 32
>>            <idle>-0     [038] ..s.   146.947364: 0: 33 34 35
>>
>> Similarly specifying '-t ib' will run this on IB link.
>>
>> Signed-off-by: Tushar Dave <tushar.n.dave@oracle.com>
>> Acked-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
>> ---
>>   samples/bpf/Makefile          |   3 +
>>   samples/bpf/rds_filter_kern.c |  42 ++++++
>>   samples/bpf/rds_filter_user.c | 339 ++++++++++++++++++++++++++++++++++++++++++
> 
> please no samples.
> Add this as proper test to tools/testing/selftests/bpf
> that reports PASS/FAIL and can be run automatically.
> samples/bpf is effectively dead code.

Okay :(

-Tushar
> 
> 

^ permalink raw reply

* Re: [PATCH net-next 3/5] ebpf: Add sg_filter_run()
From: Tushar Dave @ 2018-09-12 19:27 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: ast, daniel, davem, santosh.shilimkar, jakub.kicinski,
	quentin.monnet, jiong.wang, sandipan, john.fastabend, kafai, rdna,
	yhs, netdev, rds-devel, sowmini.varadhan
In-Reply-To: <20180912035846.3lwf4wrbqky7vpwe@ast-mbp>



On 09/11/2018 08:58 PM, Alexei Starovoitov wrote:
> On Tue, Sep 11, 2018 at 09:38:02PM +0200, Tushar Dave wrote:
>> When sg_filter_run() is invoked it runs the attached eBPF
>> prog of type BPF_PROG_TYPE_SOCKET_SG_FILTER which deals with
>> struct scatterlist.
>>
>> Signed-off-by: Tushar Dave <tushar.n.dave@oracle.com>
>> Acked-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
>> ---
>>   include/linux/filter.h         |  8 ++++++++
>>   include/uapi/linux/bpf.h       |  6 ++++++
>>   net/core/filter.c              | 35 +++++++++++++++++++++++++++++++++++
>>   tools/include/uapi/linux/bpf.h |  6 ++++++
>>   4 files changed, 55 insertions(+)
>>
>> diff --git a/include/linux/filter.h b/include/linux/filter.h
>> index 6791a0a..ae664a9 100644
>> --- a/include/linux/filter.h
>> +++ b/include/linux/filter.h
>> @@ -1113,4 +1113,12 @@ struct bpf_sock_ops_kern {
>>   					 */
>>   };
>>   
>> +enum __socksg_action {
>> +	__SOCKSG_PASS = 0,
>> +	__SOCKSG_DROP,
>> +	__SOCKSG_REDIRECT,
> 
> what is this? I see no code that handles it either in this patch
> or in the later patches?!

Yes, I am not handling these actions in RDS in this patch series. I am
thinking first I should have basic infrastructure in place (and want to
make sure it works and accepted by community) then I will add more
changes for RDS which includes handling socksg actions at RDS level.
But fine, I can add code that handles socksg actions.

Thanks

-Tushar
> 
> 

^ permalink raw reply

* Re: [PATCH net-next 2/5] eBPF: Add new eBPF prog type BPF_PROG_TYPE_SOCKET_SG_FILTER
From: Tushar Dave @ 2018-09-12 19:25 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: ast, daniel, davem, santosh.shilimkar, jakub.kicinski,
	quentin.monnet, jiong.wang, sandipan, john.fastabend, kafai, rdna,
	yhs, netdev, rds-devel, sowmini.varadhan
In-Reply-To: <20180912035719.ff5mcjg3gbrg52xt@ast-mbp>



On 09/11/2018 08:57 PM, Alexei Starovoitov wrote:
> On Tue, Sep 11, 2018 at 09:38:01PM +0200, Tushar Dave wrote:
>> Add new eBPF prog type BPF_PROG_TYPE_SOCKET_SG_FILTER which uses the
>> existing socket filter infrastructure for bpf program attach and load.
>> SOCKET_SG_FILTER eBPF program receives struct scatterlist as bpf context
>> contrast to SOCKET_FILTER which deals with struct skb. This is useful
>> for kernel entities that don't have skb to represent packet data but
>> want to run eBPF socket filter on packet data that is in form of struct
>> scatterlist e.g. IB/RDMA
>>
>> Signed-off-by: Tushar Dave <tushar.n.dave@oracle.com>
>> Acked-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
>> ---
>>   include/linux/bpf_types.h      |  1 +
>>   include/uapi/linux/bpf.h       |  1 +
>>   kernel/bpf/syscall.c           |  1 +
>>   kernel/bpf/verifier.c          |  1 +
>>   net/core/filter.c              | 55 ++++++++++++++++++++++++++++++++++++++++--
>>   samples/bpf/bpf_load.c         | 11 ++++++---
>>   tools/bpf/bpftool/prog.c       |  1 +
>>   tools/include/uapi/linux/bpf.h |  1 +
>>   tools/lib/bpf/libbpf.c         |  3 +++
>>   tools/lib/bpf/libbpf.h         |  2 ++
>

Alexi,

Thank you for reviewing the patches.

> please do not mix core kernel and user space into single patch.
> split tools/include/uapi/linux/bpf.h sync into separate patch
> and changes to tools/lib/bpf as yet another patch.

Sure, I can do that.

> 
>>   10 files changed, 72 insertions(+), 5 deletions(-)
>>
>> diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
>> index cd26c09..7dc1503 100644
>> --- a/include/linux/bpf_types.h
>> +++ b/include/linux/bpf_types.h
>> @@ -16,6 +16,7 @@
>>   BPF_PROG_TYPE(BPF_PROG_TYPE_SOCK_OPS, sock_ops)
>>   BPF_PROG_TYPE(BPF_PROG_TYPE_SK_SKB, sk_skb)
>>   BPF_PROG_TYPE(BPF_PROG_TYPE_SK_MSG, sk_msg)
>> +BPF_PROG_TYPE(BPF_PROG_TYPE_SOCKET_SG_FILTER, socksg_filter)
>>   #endif
>>   #ifdef CONFIG_BPF_EVENTS
>>   BPF_PROG_TYPE(BPF_PROG_TYPE_KPROBE, kprobe)
>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>> index 66917a4..6ec1e32 100644
>> --- a/include/uapi/linux/bpf.h
>> +++ b/include/uapi/linux/bpf.h
>> @@ -152,6 +152,7 @@ enum bpf_prog_type {
>>   	BPF_PROG_TYPE_LWT_SEG6LOCAL,
>>   	BPF_PROG_TYPE_LIRC_MODE2,
>>   	BPF_PROG_TYPE_SK_REUSEPORT,
>> +	BPF_PROG_TYPE_SOCKET_SG_FILTER,
>>   };
>>   
>>   enum bpf_attach_type {
>> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
>> index 3c9636f..5f302b7 100644
>> --- a/kernel/bpf/syscall.c
>> +++ b/kernel/bpf/syscall.c
>> @@ -1361,6 +1361,7 @@ static int bpf_prog_load(union bpf_attr *attr)
>>   
>>   	if (type != BPF_PROG_TYPE_SOCKET_FILTER &&
>>   	    type != BPF_PROG_TYPE_CGROUP_SKB &&
>> +	    type != BPF_PROG_TYPE_SOCKET_SG_FILTER &&
> 
> I'm not comfortable to let unpriv use this right away.
> Can you live with root-only ?

Honestly, I prep this the same as how we treat
BPF_PROG_TYPE_SOCKET_FILTER. But sure I can live with root-only.

> 
>>   	    !capable(CAP_SYS_ADMIN))
>>   		return -EPERM;
>>   
>> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
>> index f4ff0c5..17fc4d2 100644
>> --- a/kernel/bpf/verifier.c
>> +++ b/kernel/bpf/verifier.c
>> @@ -1234,6 +1234,7 @@ static bool may_access_direct_pkt_data(struct bpf_verifier_env *env,
>>   	case BPF_PROG_TYPE_LWT_XMIT:
>>   	case BPF_PROG_TYPE_SK_SKB:
>>   	case BPF_PROG_TYPE_SK_MSG:
>> +	case BPF_PROG_TYPE_SOCKET_SG_FILTER:
>>   		if (meta)
>>   			return meta->pkt_access;
>>   
>> diff --git a/net/core/filter.c b/net/core/filter.c
>> index 0b40f95..469c488 100644
>> --- a/net/core/filter.c
>> +++ b/net/core/filter.c
>> @@ -1140,7 +1140,8 @@ static void bpf_release_orig_filter(struct bpf_prog *fp)
>>   
>>   static void __bpf_prog_release(struct bpf_prog *prog)
>>   {
>> -	if (prog->type == BPF_PROG_TYPE_SOCKET_FILTER) {
>> +	if (prog->type == BPF_PROG_TYPE_SOCKET_FILTER ||
>> +	    prog->type == BPF_PROG_TYPE_SOCKET_SG_FILTER) {
>>   		bpf_prog_put(prog);
> 
> this doesn't look right.
> Why this is needed?
> Are you using old-style setsockopt to attach?

Yes.

> I think new style of attaching that all bpf prog types that came
> after socket_filter are using is preferred.
> Pls take a look at BPF_PROG_ATTACH cmd.

well, I am not sure if that is going to work.

I did this way so I can attach eBPF prog type
BPF_PROG_TYPE_SOCKET_SG_FILTER to a socket. Like today we can attach
regular socket filter bpf program (e.g. BPF_PROG_TYPE_SOCKET_FILTER) to
TCP, UDP sockets using setsockopt. The only difference between them is,
BPF_PROG_TYPE_SOCKET_FILTER deals with struct sk_buff while
BPF_PROG_TYPE_SOCKET_SG_FILTER deals with strut scatterlist.

e.g. setsockopt(sock, SOL_SOCKET, SO_ATTACH_BPF, prog_fd,
sizeof(prog_fd[0]));


> 
> Also it looks the first patch doesn't really add the useful logic, but adds
> few lines of code here and there. Then more code comes in patches 3 and 4.
> Please rearrange them that they're reviewable as logical pieces.

okay.

-Tushar
> 
> 

^ permalink raw reply

* Re: [PATCH net-next] virtio_net: ethtool tx napi configuration
From: Willem de Bruijn @ 2018-09-12 19:11 UTC (permalink / raw)
  To: f.fainelli
  Cc: Network Development, David Miller, caleb.raitto, Jason Wang,
	Michael S. Tsirkin, Jon Olson (Google Drive), Willem de Bruijn
In-Reply-To: <2fd8d46f-7466-e507-af31-c587693beb6e@gmail.com>

On Wed, Sep 12, 2018 at 2:16 PM Florian Fainelli <f.fainelli@gmail.com> wrote:
>
>
>
> On 9/12/2018 11:07 AM, Willem de Bruijn wrote:
> > On Wed, Sep 12, 2018 at 1:42 PM Florian Fainelli <f.fainelli@gmail.com> wrote:
> >>
> >>
> >>
> >> On 9/9/2018 3:44 PM, Willem de Bruijn wrote:
> >>> From: Willem de Bruijn <willemb@google.com>
> >>>
> >>> Implement ethtool .set_coalesce (-C) and .get_coalesce (-c) handlers.
> >>> Interrupt moderation is currently not supported, so these accept and
> >>> display the default settings of 0 usec and 1 frame.
> >>>
> >>> Toggle tx napi through a bit in tx-frames. So as to not interfere
> >>> with possible future interrupt moderation, use bit 10, well outside
> >>> the reasonable range of real interrupt moderation values.
> >>>
> >>> Changes are not atomic. The tx IRQ, napi BH and transmit path must
> >>> be quiesced when switching modes. Only allow changing this setting
> >>> when the device is down.
> >>
> >> Humm, would not a private ethtool flag to switch TX NAPI on/off be more
> >> appropriate rather than use the coalescing configuration API here?
> >
> > What do you mean by private ethtool flag? A new field in ethtool
> > --features (-k)?
>
> I meant using ethtool_drvinfo::n_priv_flags, ETH_SS_PRIV_FLAGS and then
> ETHTOOL_GFPFLAGS and ETHTOOL_SPFLAGS to control the toggling of that
> private flag. mlx5 has a number of privates flags for instance.

Interesting, thanks! I was not at all aware of those ethtool flags.
Am having a look. It definitely looks promising.

> > Configurable napi-tx is not a common feature across devices. We really
> > want virtio-net to also just convert to napi-tx as default, but need a
> > way to gradually convert with application opt-out if some workloads
> > see regressions.
>
> The rationale makes sense, no questions about it.
>
> > There is prior art in interpreting coalesce values as
> > more than a direct mapping to usec. The e1000 is one example.
> >
>
> Looked at both e1000 and e1000e and they both have a similar programming
> of the HW's interrupt target rate register, which is relevant to
> interrupt coalescing, what part of these drivers do you see as doing
> something not quite coalescing related?

It's all coalescing related, for sure. e1000_set_coalesce just does not
translate the tx-usecs values into microsecond latency directly.

It modifies both the interrupt throttle rate adapter->itr and interrupt mode
adapter->itr_setting, which are initially set in e1000_check_options from
module param InterruptThrottleRate.

Value 0 disables interrupt moderation. 1 and 3 program a dynamic mode.
2 is an illegal value as is 5..9. 10..10000 converts from usec to interrupt
rate/sec.

I took tx-napi to be a similar interrupt related option as, say, dynamic
conservative mode interrupt moderation.

^ permalink raw reply

* Re: [bpf-next, v2 1/3] flow_dissector: implements flow dissector BPF hook
From: Willem de Bruijn @ 2018-09-12 18:43 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Petar Penkov, Network Development, David Miller,
	Alexei Starovoitov, Daniel Borkmann, simon.horman, ecree,
	songliubraving, Tom Herbert, Petar Penkov, Willem de Bruijn
In-Reply-To: <20180912034632.bkxpktzvze242rvr@ast-mbp>

On Tue, Sep 11, 2018 at 11:47 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Fri, Sep 07, 2018 at 05:11:08PM -0700, Petar Penkov wrote:
> > From: Petar Penkov <ppenkov@google.com>
> >
> > Adds a hook for programs of type BPF_PROG_TYPE_FLOW_DISSECTOR and
> > attach type BPF_FLOW_DISSECTOR that is executed in the flow dissector
> > path. The BPF program is per-network namespace.
> >
> > Signed-off-by: Petar Penkov <ppenkov@google.com>
> > Signed-off-by: Willem de Bruijn <willemb@google.com>
> > ---
> >  include/linux/bpf.h            |   1 +
> >  include/linux/bpf_types.h      |   1 +
> >  include/linux/skbuff.h         |   7 ++
> >  include/net/net_namespace.h    |   3 +
> >  include/net/sch_generic.h      |  12 ++-
> >  include/uapi/linux/bpf.h       |  25 ++++++
> >  kernel/bpf/syscall.c           |   8 ++
> >  kernel/bpf/verifier.c          |  32 ++++++++
> >  net/core/filter.c              |  67 ++++++++++++++++
> >  net/core/flow_dissector.c      | 136 +++++++++++++++++++++++++++++++++
> >  tools/bpf/bpftool/prog.c       |   1 +
> >  tools/include/uapi/linux/bpf.h |  25 ++++++
> >  tools/lib/bpf/libbpf.c         |   2 +
>
> please split up update to tools/include/uapi/linux/bpf.h as a separate patch 2.
> We often have conflicts in there, so best to have a separate.
> Also please split tools/lib and tools/bpf chnages into patch 3.

Will do in v3.

> >  struct qdisc_skb_cb {
> > -     unsigned int            pkt_len;
> > -     u16                     slave_dev_queue_mapping;
> > -     u16                     tc_classid;
> > +     union {
> > +             struct {
> > +                     unsigned int            pkt_len;
> > +                     u16                     slave_dev_queue_mapping;
> > +                     u16                     tc_classid;
> > +             };
> > +             struct bpf_flow_keys *flow_keys;
> > +     };
>
> is this magic really necessary? flow_dissector runs very early in recv path.
> There is no qdisc or conflicts with tcp/ip use of cb.
> I think the whole cb block can be used.

The flow dissector also runs in the context of TC, from flower.
But that is not a reason to use this struct.

We need both (a) data shared with the BPF application and between
applications using tail-calls, to pass along the parse offset (nhoff),
and (b) data not accessible by the program, to store the flow_keys
pointer.

qdisc_skb_cb already has this split, exposing only the 20B .data
field to the application. Flow dissection currently reuses the existing
bpf_convert_ctx_access logic for this field.

We could create a separate flowdissect_skb_cb struct with the
same split setup, but a second constraint is that relevant internal
BPF interfaces already expect qdisc_skb_cb, notably
bkf_skb_data_end. So the union was easier.

There is another way to pass nhoff besides cb[] (see below). But
I don't immediately see another place to store the flow_keys ptr.

At least, if we pass skb as context. One larger change would
be to introduce another ctx struct, similar to sk_reuseport_(kern|md).

> > @@ -2333,6 +2335,7 @@ struct __sk_buff {
> >       /* ... here. */
> >
> >       __u32 data_meta;
> > +     __u32 flow_keys;
>
> please use
> struct bpf_flow_keys *flow_keys;
> instead.
>
> See what we did in 'struct sk_msg_md' and in 'struct sk_reuseport_md'.
> There is no need to hide pointers in u32.
>

Will do in v3.

> > @@ -658,6 +754,46 @@ bool __skb_flow_dissect(const struct sk_buff *skb,
> >                                             FLOW_DISSECTOR_KEY_BASIC,
> >                                             target_container);
> >
> > +     rcu_read_lock();
> > +     attached = skb ? rcu_dereference(dev_net(skb->dev)->flow_dissector_prog)
> > +                    : NULL;
> > +     if (attached) {
> > +             /* Note that even though the const qualifier is discarded
> > +              * throughout the execution of the BPF program, all changes(the
> > +              * control block) are reverted after the BPF program returns.
> > +              * Therefore, __skb_flow_dissect does not alter the skb.
> > +              */
> > +             struct bpf_flow_keys flow_keys = {};
> > +             struct qdisc_skb_cb cb_saved;
> > +             struct qdisc_skb_cb *cb;
> > +             u16 *pseudo_cb;
> > +             u32 result;
> > +
> > +             cb = qdisc_skb_cb(skb);
> > +             pseudo_cb = (u16 *)bpf_skb_cb((struct sk_buff *)skb);
> > +
> > +             /* Save Control Block */
> > +             memcpy(&cb_saved, cb, sizeof(cb_saved));
> > +             memset(cb, 0, sizeof(cb_saved));
> > +
> > +             /* Pass parameters to the BPF program */
> > +             cb->flow_keys = &flow_keys;
> > +             *pseudo_cb = nhoff;
>
> I don't understand this bit.
> What is this pseudo_cb and why nhoff goes in there?
> Some odd way to pass it into the prog?

Yes. nhoff passes the offset to the program to start parsing from.
Applications also pass this during tail calls.

Alternatively we can just add a new field to struct bpf_flow_keys.

^ permalink raw reply

* Re: [PATCH net-next v3 02/17] zinc: introduce minimal cryptography library
From: Andy Lutomirski @ 2018-09-12 23:45 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Jason A. Donenfeld, LKML, Netdev, David Miller,
	Greg Kroah-Hartman, Andrew Lutomirski, Samuel Neves,
	Jean-Philippe Aumasson, Linux Crypto Mailing List
In-Reply-To: <CAKv+Gu9RGo9rtGsmwMLZ_=-WSQ_h7har4agXP-2XOKupq-KYtA@mail.gmail.com>

On Wed, Sep 12, 2018 at 3:56 PM, Ard Biesheuvel
<ard.biesheuvel@linaro.org> wrote:
> I realize you've put a lot of good and hard work into the existing
> I am also concerned about your claim that all software algorithms will
> be moved into this crypto library. You are not specific about whose
> responsibility it will be that this is going to happen in a timely
> fashion. But more importantly, it is not clear at all how you expect
> this to work for, e.g., h/w instruction based SHAxxx or AES in various
> chaining modes, which should be used only on cores that implement
> those instructions (note that on arm64, we have optional instructions
> for AES, PMULL, SHA1, SHA256, SHA512, SHA3, SM3 and SM4). Are all
> those implementations (only few of which will be used on a certain
> core) going to be part of the monolithic library? What are the APIs
> going to look like for block ciphers, taking chaining modes into
> account?

I'm not convinced that there's any real need for *all* crypto
algorithms to move into lib/zinc or to move at all.  As I see it,
there are two classes of crypto algorithms in the kernel:

a) Crypto that is used by code that chooses its algorithm statically
and wants synchronous operations.  These include everything in
drivers/char/random.c, but also a bunch of various networking things
that are hardcoded and basically everything that uses stack buffers.
(This means it includes all the code that I broke when I did
VMAP_STACK.  Sign.)

b) Crypto that is used dynamically.  This includes dm-crypt
(aes-xts-plain64, aes-cbc-essiv, etc), all the ALG_IF interfaces, a
lot of IPSEC stuff, possibly KCM, and probably many more.  These will
get comparatively little benefit from being converted to a zinc-like
interface.  For some of these cases, it wouldn't make any sense at all
to convert them.  Certainly the ones that do async hardware crypto
using DMA engines will never look at all like zinc, even under the
hood.

I think that, as a short-term goal, it makes a lot of sense to have
implementations of the crypto that *new* kernel code (like Wireguard)
wants to use in style (a) that live in /lib, and it obviously makes
sense to consolidate their implementations with the crypto/
implementations in a timely manner.  As a medium-term goal, adding
more algorithms as needed for things that could use the simpler APIs
(Bluetooth, perhaps) would make sense.

But I see no reason at all that /lib should ever contain a grab-bag of
crypto implementations just for the heck of it.  They should have real
in-kernel users IMO.  And this means that there will probably always
be some crypto implementations in crypto/ for things like aes-xts.

--Andy

^ permalink raw reply

* Re: [RFC v2 1/2] netlink: add NLA_REJECT policy type
From: Johannes Berg @ 2018-09-12 18:34 UTC (permalink / raw)
  To: David Miller; +Cc: linux-wireless, netdev, mkubecek
In-Reply-To: <20180912.111555.1317690378514849083.davem@davemloft.net>

On Wed, 2018-09-12 at 11:15 -0700, David Miller wrote:

> This looks great, no objections to this idea or the facility.

Great. I'll post this (with the fixups) for real tomorrow then, I guess.
A bit too late for me to do now.

> It does, however, remind me about about the classic problem of how bad
> we are at feature support detection because unrecognized attributes are
> ignored.
> 
> I do really hope we can fully solve that problem some day.

Yes.

There may be two or more levels to this.

It wouldn't be hard to reject attributes that are higher than maxtype -
we already pass that to nla_parse() wherever we call it, but we'd have
to find a way to make it optional I guess, for compatibility reasons.
Perhaps with a warning, like attribute validation. For genetlink, a flag
in the family (something like "strict attribute validation") would be
easy, but for "netlink proper" we have a lot of nlmsg_parse() calls to
patch, and/or replace by nlmsg_parse_strict().

I guess we should

1) implement nlmsg_parse_strict() for those new things that want it
   strictly - greenfield type stuff that doesn't need to work with
   existing applications

2) add a warning to nlmsg_parse() when a too high attribute is
   encountered

3) eventually replace nlmsg_parse() calls by nlmsg_parse_strict() and
   see what breaks? :-) We won't be able to rely on that any time soon
   though (unless userspace first checks with a guaranteed rejected
   attribute, e.g. one that has NLA_REJECT, perhaps the u64 pad
   attributes could be marked such since the kernel can't assume
   alignment anyway)

Perhaps we also have too many calls to nlmsg_parse() without a policy,
but that's orthogonal to this check.

On a second level though, with complex things like nl80211 it's often
not clear at all which attributes are used with which commands. Some
attributes (like NL80211_ATTR_IFINDEX) are (almost) universal, but there
are others that aren't. Perhaps this isn't all that important, since if
you try to trigger scanning and at the same time tell the kernel about a
key index, that clearly makes no sense at all. OTOH, we have no good way
of discovering what attribute is used where - we (try to) document this
well in the nl80211.h kernel-doc, but that isn't always complete.

So more introspection (of sorts) could be useful.

While we're talking about wishlist, I'm also toying with the idea of
having some sort of generic mechanism to convert netlink attributes
to/from structs, for internal kernel representation; so far though I
haven't been able to come up with anything useful.

johannes

^ permalink raw reply

* Re: [PATCH net-next] virtio_net: ethtool tx napi configuration
From: Florian Fainelli @ 2018-09-12 18:16 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Network Development, David Miller, caleb.raitto, Jason Wang,
	Michael S. Tsirkin, Jon Olson (Google Drive), Willem de Bruijn
In-Reply-To: <CAF=yD-K4mrqY-YjsWL1JkYhqBOSciRTyvMtTv5CQU7Uq6GeypA@mail.gmail.com>



On 9/12/2018 11:07 AM, Willem de Bruijn wrote:
> On Wed, Sep 12, 2018 at 1:42 PM Florian Fainelli <f.fainelli@gmail.com> wrote:
>>
>>
>>
>> On 9/9/2018 3:44 PM, Willem de Bruijn wrote:
>>> From: Willem de Bruijn <willemb@google.com>
>>>
>>> Implement ethtool .set_coalesce (-C) and .get_coalesce (-c) handlers.
>>> Interrupt moderation is currently not supported, so these accept and
>>> display the default settings of 0 usec and 1 frame.
>>>
>>> Toggle tx napi through a bit in tx-frames. So as to not interfere
>>> with possible future interrupt moderation, use bit 10, well outside
>>> the reasonable range of real interrupt moderation values.
>>>
>>> Changes are not atomic. The tx IRQ, napi BH and transmit path must
>>> be quiesced when switching modes. Only allow changing this setting
>>> when the device is down.
>>
>> Humm, would not a private ethtool flag to switch TX NAPI on/off be more
>> appropriate rather than use the coalescing configuration API here?
> 
> What do you mean by private ethtool flag? A new field in ethtool
> --features (-k)?

I meant using ethtool_drvinfo::n_priv_flags, ETH_SS_PRIV_FLAGS and then 
ETHTOOL_GFPFLAGS and ETHTOOL_SPFLAGS to control the toggling of that 
private flag. mlx5 has a number of privates flags for instance.

> 
> Configurable napi-tx is not a common feature across devices. We really
> want virtio-net to also just convert to napi-tx as default, but need a
> way to gradually convert with application opt-out if some workloads
> see regressions.

The rationale makes sense, no questions about it.

> There is prior art in interpreting coalesce values as
> more than a direct mapping to usec. The e1000 is one example.
>

Looked at both e1000 and e1000e and they both have a similar programming 
of the HW's interrupt target rate register, which is relevant to 
interrupt coalescing, what part of these drivers do you see as doing 
something not quite coalescing related?
-- 
Florian

^ permalink raw reply

* Re: [PATCH net-next] virtio_net: ethtool tx napi configuration
From: Willem de Bruijn @ 2018-09-12 18:07 UTC (permalink / raw)
  To: f.fainelli
  Cc: Network Development, David Miller, caleb.raitto, Jason Wang,
	Michael S. Tsirkin, Jon Olson (Google Drive), Willem de Bruijn
In-Reply-To: <a0272b54-3bf6-1c07-ab1d-62545d16a633@gmail.com>

On Wed, Sep 12, 2018 at 1:42 PM Florian Fainelli <f.fainelli@gmail.com> wrote:
>
>
>
> On 9/9/2018 3:44 PM, Willem de Bruijn wrote:
> > From: Willem de Bruijn <willemb@google.com>
> >
> > Implement ethtool .set_coalesce (-C) and .get_coalesce (-c) handlers.
> > Interrupt moderation is currently not supported, so these accept and
> > display the default settings of 0 usec and 1 frame.
> >
> > Toggle tx napi through a bit in tx-frames. So as to not interfere
> > with possible future interrupt moderation, use bit 10, well outside
> > the reasonable range of real interrupt moderation values.
> >
> > Changes are not atomic. The tx IRQ, napi BH and transmit path must
> > be quiesced when switching modes. Only allow changing this setting
> > when the device is down.
>
> Humm, would not a private ethtool flag to switch TX NAPI on/off be more
> appropriate rather than use the coalescing configuration API here?

What do you mean by private ethtool flag? A new field in ethtool
--features (-k)?

Configurable napi-tx is not a common feature across devices. We really
want virtio-net to also just convert to napi-tx as default, but need a
way to gradually convert with application opt-out if some workloads
see regressions. There is prior art in interpreting coalesce values as
more than a direct mapping to usec. The e1000 is one example.

^ permalink raw reply

* Re: [PATCH net-next] virtio_net: ethtool tx napi configuration
From: Florian Fainelli @ 2018-09-12 17:42 UTC (permalink / raw)
  To: Willem de Bruijn, netdev
  Cc: davem, caleb.raitto, jasowang, mst, jonolson, Willem de Bruijn
In-Reply-To: <20180909224449.203593-1-willemdebruijn.kernel@gmail.com>



On 9/9/2018 3:44 PM, Willem de Bruijn wrote:
> From: Willem de Bruijn <willemb@google.com>
> 
> Implement ethtool .set_coalesce (-C) and .get_coalesce (-c) handlers.
> Interrupt moderation is currently not supported, so these accept and
> display the default settings of 0 usec and 1 frame.
> 
> Toggle tx napi through a bit in tx-frames. So as to not interfere
> with possible future interrupt moderation, use bit 10, well outside
> the reasonable range of real interrupt moderation values.
> 
> Changes are not atomic. The tx IRQ, napi BH and transmit path must
> be quiesced when switching modes. Only allow changing this setting
> when the device is down.

Humm, would not a private ethtool flag to switch TX NAPI on/off be more 
appropriate rather than use the coalescing configuration API here?

> 
> Link: https://patchwork.ozlabs.org/patch/948149/
> Suggested-by: Jason Wang <jasowang@redhat.com>
> Signed-off-by: Willem de Bruijn <willemb@google.com>
> ---
>   drivers/net/virtio_net.c | 52 ++++++++++++++++++++++++++++++++++++++++
>   1 file changed, 52 insertions(+)
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 765920905226..b320b6b14749 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -66,6 +66,8 @@ DECLARE_EWMA(pkt_len, 0, 64)
>   
>   #define VIRTNET_DRIVER_VERSION "1.0.0"
>   
> +static const u32 ethtool_coalesce_napi_mask = (1UL << 10);
> +
>   static const unsigned long guest_offloads[] = {
>   	VIRTIO_NET_F_GUEST_TSO4,
>   	VIRTIO_NET_F_GUEST_TSO6,
> @@ -2181,6 +2183,54 @@ static int virtnet_get_link_ksettings(struct net_device *dev,
>   	return 0;
>   }
>   
> +static int virtnet_set_coalesce(struct net_device *dev,
> +				struct ethtool_coalesce *ec)
> +{
> +	const struct ethtool_coalesce ec_default = {
> +		.cmd = ETHTOOL_SCOALESCE,
> +		.rx_max_coalesced_frames = 1,
> +		.tx_max_coalesced_frames = 1,
> +	};
> +	struct virtnet_info *vi = netdev_priv(dev);
> +	int i, napi_weight = 0;
> +
> +	if (ec->tx_max_coalesced_frames & ethtool_coalesce_napi_mask) {
> +		ec->tx_max_coalesced_frames &= ~ethtool_coalesce_napi_mask;
> +		napi_weight = NAPI_POLL_WEIGHT;
> +	}
> +
> +	/* disallow changes to fields not explicitly tested above */
> +	if (memcmp(ec, &ec_default, sizeof(ec_default)))
> +		return -EINVAL;
> +
> +	if (napi_weight ^ vi->sq[0].napi.weight) {
> +		if (dev->flags & IFF_UP)
> +			return -EBUSY;
> +		for (i = 0; i < vi->max_queue_pairs; i++)
> +			vi->sq[i].napi.weight = napi_weight;
> +	}
> +
> +	return 0;
> +}
> +
> +static int virtnet_get_coalesce(struct net_device *dev,
> +				struct ethtool_coalesce *ec)
> +{
> +	const struct ethtool_coalesce ec_default = {
> +		.cmd = ETHTOOL_GCOALESCE,
> +		.rx_max_coalesced_frames = 1,
> +		.tx_max_coalesced_frames = 1,
> +	};
> +	struct virtnet_info *vi = netdev_priv(dev);
> +
> +	memcpy(ec, &ec_default, sizeof(ec_default));
> +
> +	if (vi->sq[0].napi.weight)
> +		ec->tx_max_coalesced_frames |= ethtool_coalesce_napi_mask;
> +
> +	return 0;
> +}
> +
>   static void virtnet_init_settings(struct net_device *dev)
>   {
>   	struct virtnet_info *vi = netdev_priv(dev);
> @@ -2219,6 +2269,8 @@ static const struct ethtool_ops virtnet_ethtool_ops = {
>   	.get_ts_info = ethtool_op_get_ts_info,
>   	.get_link_ksettings = virtnet_get_link_ksettings,
>   	.set_link_ksettings = virtnet_set_link_ksettings,
> +	.set_coalesce = virtnet_set_coalesce,
> +	.get_coalesce = virtnet_get_coalesce,
>   };
>   
>   static void virtnet_freeze_down(struct virtio_device *vdev)
> 

-- 
Florian

^ permalink raw reply

* Re: [PATCH bpf] bpf: btf: Fix end boundary calculation for type section
From: Yonghong Song @ 2018-09-12 17:38 UTC (permalink / raw)
  To: Martin KaFai Lau, netdev; +Cc: Alexei Starovoitov, Daniel Borkmann, kernel-team
In-Reply-To: <20180912172911.3609494-1-kafai@fb.com>



On 9/12/18 10:29 AM, Martin KaFai Lau wrote:
> The end boundary math for type section is incorrect in
> btf_check_all_metas().  It just happens that hdr->type_off
> is always 0 for now because there are only two sections
> (type and string) and string section must be at the end (ensured
> in btf_parse_str_sec).
> 
> However, type_off may not be 0 if a new section would be added later.
> This patch fixes it.
> 
> Fixes: f80442a4cd18 ("bpf: btf: Change how section is supported in btf_header")
> Reported-by: Dmitry Vyukov <dvyukov@google.com>
> Signed-off-by: Martin KaFai Lau <kafai@fb.com>

Acked-by: Yonghong Song <yhs@fb.com>

> ---
>   kernel/bpf/btf.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
> index 2590700237c1..138f0302692e 100644
> --- a/kernel/bpf/btf.c
> +++ b/kernel/bpf/btf.c
> @@ -1844,7 +1844,7 @@ static int btf_check_all_metas(struct btf_verifier_env *env)
>   
>   	hdr = &btf->hdr;
>   	cur = btf->nohdr_data + hdr->type_off;
> -	end = btf->nohdr_data + hdr->type_len;
> +	end = cur + hdr->type_len;
>   
>   	env->log_type_id = 1;
>   	while (cur < end) {
> 

^ permalink raw reply

* [PATCH bpf] bpf: btf: Fix end boundary calculation for type section
From: Martin KaFai Lau @ 2018-09-12 17:29 UTC (permalink / raw)
  To: netdev; +Cc: Alexei Starovoitov, Daniel Borkmann, kernel-team

The end boundary math for type section is incorrect in
btf_check_all_metas().  It just happens that hdr->type_off
is always 0 for now because there are only two sections
(type and string) and string section must be at the end (ensured
in btf_parse_str_sec).

However, type_off may not be 0 if a new section would be added later.
This patch fixes it.

Fixes: f80442a4cd18 ("bpf: btf: Change how section is supported in btf_header")
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 kernel/bpf/btf.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index 2590700237c1..138f0302692e 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -1844,7 +1844,7 @@ static int btf_check_all_metas(struct btf_verifier_env *env)
 
 	hdr = &btf->hdr;
 	cur = btf->nohdr_data + hdr->type_off;
-	end = btf->nohdr_data + hdr->type_len;
+	end = cur + hdr->type_len;
 
 	env->log_type_id = 1;
 	while (cur < end) {
-- 
2.17.1

^ permalink raw reply related

* Re: [PATCH net-next 1/5] bpf: use __GFP_COMP while allocating page
From: John Fastabend @ 2018-09-12 16:51 UTC (permalink / raw)
  To: Tushar Dave, ast, daniel, davem, santosh.shilimkar,
	jakub.kicinski, quentin.monnet, jiong.wang, sandipan, kafai, rdna,
	yhs, netdev, rds-devel, sowmini.varadhan
In-Reply-To: <2fd601e3-5679-08f4-1610-b3c22de80935@oracle.com>

On 09/12/2018 09:21 AM, Tushar Dave wrote:
> 
> 
> On 09/11/2018 12:38 PM, Tushar Dave wrote:
>> Helper bpg_msg_pull_data() can allocate multiple pages while
>> linearizing multiple scatterlist elements into one shared page.
>> However, if the shared page has size > PAGE_SIZE, using
>> copy_page_to_iter() causes below warning.
>>
>> e.g.
>> [ 6367.019832] WARNING: CPU: 2 PID: 7410 at lib/iov_iter.c:825
>> page_copy_sane.part.8+0x0/0x8
>>
>> To avoid above warning, use __GFP_COMP while allocating multiple
>> contiguous pages.
>>
>> Signed-off-by: Tushar Dave <tushar.n.dave@oracle.com>
>> ---
>>   net/core/filter.c | 3 ++-
>>   1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/net/core/filter.c b/net/core/filter.c
>> index d301134..0b40f95 100644
>> --- a/net/core/filter.c
>> +++ b/net/core/filter.c
>> @@ -2344,7 +2344,8 @@ struct sock *do_msg_redirect_map(struct sk_msg_buff *msg)
>>       if (unlikely(bytes_sg_total > copy))
>>           return -EINVAL;
>>   -    page = alloc_pages(__GFP_NOWARN | GFP_ATOMIC, get_order(copy));
>> +    page = alloc_pages(__GFP_NOWARN | GFP_ATOMIC | __GFP_COMP,
>> +               get_order(copy));
>>       if (unlikely(!page))
>>           return -ENOMEM;
>>       p = page_address(page);
> 
> I should have mentioned that I could re-order this patch anywhere in
> patch series (as long as it doesn't break git bisect). I kept it first
> because I think it is more like a bug fix. I sent it along with these
> patch series considering we have a context of why and for what I need
> this patch!
> 
> Daniel, John,
> 
> Not sure if you guys hit this page_copy_sane warning. I hit it when RDS
> copy sg page to userspace using copy_page_to_iter().
> 

I have not hit this before but I'm working on a set of patches for
test_sockmap to test the bpf_msg_pull_data() so I'll add a case
for this. Currently, we only test the simple case where we pull
data out of a single page in selftests. This was sufficient for
my use case but missed a handful of other valid cases.

> example:
> 
> RDS packet size 8KB represented in scatterlist:
> sg_data[0].length = 1400
> sg_data[1].length = 1448
> sg_data[2].length = 1448
> sg_data[3].length = 1448
> sg_data[4].length = 1448
> sg_data[5].length = 1000
> 
> If start=0 and end=8192, bpf_msg_pull_data() will linearize all
> sg_data elements into one shared page. e.g. sg_data[0].length = 8192.
> Using this sg_data[0].page in function copy_page_to_iter() causes:
> WARNING: CPU: 2 PID: 7410 at lib/iov_iter.c:825
> page_copy_sane.part.8+0x0/0x8
> 
> (FYI, patch 4 has code that does copy_page_to_iter)
> 

How about sending it as a bugfix against bpf on its own. It
looks like we could reproduce this with a combination of
bpf_msg_pull_data() + redirect (to ingress) perhaps. Either
way seems like a candidate for the bpf fixes tree to me.

Thanks,
John

> 
> Comments?
> 
> Thanks in advance,
> -Tushar
> 
> 

^ permalink raw reply

* Need input on placement of driver
From: Sunil Kovvuri @ 2018-09-12 16:50 UTC (permalink / raw)
  To: David S. Miller; +Cc: Linux Netdev List, Arnd Bergmann

Hi David,

I am trying to submit a driver into drivers/soc folder and Arnd is of
the opinion that
the driver should be moved to drivers/net/ethernet.

Can you please go through below and give your feedback.

HW functionality in brief
# HW has a Admin function (AF) PCI device which has privilege access
to configure co-processors.
# Co-processors include network block, crypto block, ring buffer block
used by both network and crypto blocks, packet or anyother work
scheduler block, ingress/egress packet parser and forwarding block,
internal state machine caches etc.
# Each of these blocks are multiple in number and can be attached to
other PCI devices.
# Future variants of the same silicon might have additional functional
blocks in AF.
# There are other SRIOV PF/VF devices which are dumb at power-on and
acquire functionality based
    what blocks are attached to them by AF.
# So AF is the one which configures, facilities and manages all HW
resources (network and non-network).
   But doesn't handle any data.
# PF/VFs communicate with AF via a shared mailbox memory for functional block
   attach / detach requests, HW configuration etc etc.

AF driver will have logic not only the functionality needed by kernel
netdev or crypto drivers but
also the HW configuration logic needed by userspace application drivers.

Keeping current and future functionality in view we thought of having
3 drivers in kernel
# AF driver at drivers/soc
# PF/ VF netdev driver (network & ring buffer blocks attached to these
devices) at drivers/net/ethernet
# PF / VF crypto driver (ring buffer and crypto blocks attached to
these devices) at drivers/crypto.

I have submitted few patches for the AF driver
https://patchwork.kernel.org/cover/10587635/

Here Arnd has opined that all drivers should move into drivers/net/ethernet.
So wanted to check if you would be okay with this.

Thanks,
Sunil.

^ permalink raw reply

* [PATCH v2 04/30] inet: frags: Convert timers to use timer_setup()
From: Stephen Hemminger @ 2018-09-12 16:36 UTC (permalink / raw)
  To: davem, gregkh
  Cc: edumazet, Kees Cook, Alexander Aring, Stefan Schmidt,
	Alexey Kuznetsov, Hideaki YOSHIFUJI, Pablo Neira Ayuso,
	Jozsef Kadlecsik, Florian Westphal, linux-wpan, netdev,
	netfilter-devel, coreteam
In-Reply-To: <20180912163648.17611-1-sthemmin@microsoft.com>

From: Kees Cook <keescook@chromium.org>

In preparation for unconditionally passing the struct timer_list pointer to
all timer callbacks, switch to using the new timer_setup() and from_timer()
to pass the timer pointer explicitly.

Cc: Alexander Aring <alex.aring@gmail.com>
Cc: Stefan Schmidt <stefan@osg.samsung.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Cc: Florian Westphal <fw@strlen.de>
Cc: linux-wpan@vger.kernel.org
Cc: netdev@vger.kernel.org
Cc: netfilter-devel@vger.kernel.org
Cc: coreteam@netfilter.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Stefan Schmidt <stefan@osg.samsung.com> # for ieee802154
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 78802011fbe34331bdef6f2dfb1634011f0e4c32)
---
 include/net/inet_frag.h                 | 2 +-
 net/ieee802154/6lowpan/reassembly.c     | 5 +++--
 net/ipv4/inet_fragment.c                | 4 ++--
 net/ipv4/ip_fragment.c                  | 5 +++--
 net/ipv6/netfilter/nf_conntrack_reasm.c | 5 +++--
 net/ipv6/reassembly.c                   | 5 +++--
 6 files changed, 15 insertions(+), 11 deletions(-)

diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
index fd338293a095..69e531ed8189 100644
--- a/include/net/inet_frag.h
+++ b/include/net/inet_frag.h
@@ -97,7 +97,7 @@ struct inet_frags {
 	void			(*constructor)(struct inet_frag_queue *q,
 					       const void *arg);
 	void			(*destructor)(struct inet_frag_queue *);
-	void			(*frag_expire)(unsigned long data);
+	void			(*frag_expire)(struct timer_list *t);
 	struct kmem_cache	*frags_cachep;
 	const char		*frags_cache_name;
 };
diff --git a/net/ieee802154/6lowpan/reassembly.c b/net/ieee802154/6lowpan/reassembly.c
index 9ccb8458b5c3..6badc055555b 100644
--- a/net/ieee802154/6lowpan/reassembly.c
+++ b/net/ieee802154/6lowpan/reassembly.c
@@ -80,12 +80,13 @@ static void lowpan_frag_init(struct inet_frag_queue *q, const void *a)
 	fq->daddr = *arg->dst;
 }
 
-static void lowpan_frag_expire(unsigned long data)
+static void lowpan_frag_expire(struct timer_list *t)
 {
+	struct inet_frag_queue *frag = from_timer(frag, t, timer);
 	struct frag_queue *fq;
 	struct net *net;
 
-	fq = container_of((struct inet_frag_queue *)data, struct frag_queue, q);
+	fq = container_of(frag, struct frag_queue, q);
 	net = container_of(fq->q.net, struct net, ieee802154_lowpan.frags);
 
 	spin_lock(&fq->q.lock);
diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
index 4b44f973c37f..97e747b1e9a0 100644
--- a/net/ipv4/inet_fragment.c
+++ b/net/ipv4/inet_fragment.c
@@ -150,7 +150,7 @@ inet_evict_bucket(struct inet_frags *f, struct inet_frag_bucket *hb)
 	spin_unlock(&hb->chain_lock);
 
 	hlist_for_each_entry_safe(fq, n, &expired, list_evictor)
-		f->frag_expire((unsigned long) fq);
+		f->frag_expire(&fq->timer);
 
 	return evicted;
 }
@@ -367,7 +367,7 @@ static struct inet_frag_queue *inet_frag_alloc(struct netns_frags *nf,
 	f->constructor(q, arg);
 	add_frag_mem_limit(nf, f->qsize);
 
-	setup_timer(&q->timer, f->frag_expire, (unsigned long)q);
+	timer_setup(&q->timer, f->frag_expire, 0);
 	spin_lock_init(&q->lock);
 	refcount_set(&q->refcnt, 1);
 
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 9d0b08c8ee00..5171c8cc0eb6 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -191,12 +191,13 @@ static bool frag_expire_skip_icmp(u32 user)
 /*
  * Oops, a fragment queue timed out.  Kill it and send an ICMP reply.
  */
-static void ip_expire(unsigned long arg)
+static void ip_expire(struct timer_list *t)
 {
+	struct inet_frag_queue *frag = from_timer(frag, t, timer);
 	struct ipq *qp;
 	struct net *net;
 
-	qp = container_of((struct inet_frag_queue *) arg, struct ipq, q);
+	qp = container_of(frag, struct ipq, q);
 	net = container_of(qp->q.net, struct net, ipv4.frags);
 
 	rcu_read_lock();
diff --git a/net/ipv6/netfilter/nf_conntrack_reasm.c b/net/ipv6/netfilter/nf_conntrack_reasm.c
index 7ea2b4490672..bc776ef392ea 100644
--- a/net/ipv6/netfilter/nf_conntrack_reasm.c
+++ b/net/ipv6/netfilter/nf_conntrack_reasm.c
@@ -169,12 +169,13 @@ static unsigned int nf_hashfn(const struct inet_frag_queue *q)
 	return nf_hash_frag(nq->id, &nq->saddr, &nq->daddr);
 }
 
-static void nf_ct_frag6_expire(unsigned long data)
+static void nf_ct_frag6_expire(struct timer_list *t)
 {
+	struct inet_frag_queue *frag = from_timer(frag, t, timer);
 	struct frag_queue *fq;
 	struct net *net;
 
-	fq = container_of((struct inet_frag_queue *)data, struct frag_queue, q);
+	fq = container_of(frag, struct frag_queue, q);
 	net = container_of(fq->q.net, struct net, nf_frag.frags);
 
 	ip6_expire_frag_queue(net, fq);
diff --git a/net/ipv6/reassembly.c b/net/ipv6/reassembly.c
index 26f737c3fc7b..b85ef051b75c 100644
--- a/net/ipv6/reassembly.c
+++ b/net/ipv6/reassembly.c
@@ -169,12 +169,13 @@ void ip6_expire_frag_queue(struct net *net, struct frag_queue *fq)
 }
 EXPORT_SYMBOL(ip6_expire_frag_queue);
 
-static void ip6_frag_expire(unsigned long data)
+static void ip6_frag_expire(struct timer_list *t)
 {
+	struct inet_frag_queue *frag = from_timer(frag, t, timer);
 	struct frag_queue *fq;
 	struct net *net;
 
-	fq = container_of((struct inet_frag_queue *)data, struct frag_queue, q);
+	fq = container_of(frag, struct frag_queue, q);
 	net = container_of(fq->q.net, struct net, ipv6.frags);
 
 	ip6_expire_frag_queue(net, fq);
-- 
2.18.0

^ permalink raw reply related

* Re: [PATCH net-next 1/5] bpf: use __GFP_COMP while allocating page
From: Tushar Dave @ 2018-09-12 16:21 UTC (permalink / raw)
  To: ast, daniel, davem, santosh.shilimkar, jakub.kicinski,
	quentin.monnet, jiong.wang, sandipan, john.fastabend, kafai, rdna,
	yhs, netdev, rds-devel, sowmini.varadhan
In-Reply-To: <1536694684-3200-2-git-send-email-tushar.n.dave@oracle.com>



On 09/11/2018 12:38 PM, Tushar Dave wrote:
> Helper bpg_msg_pull_data() can allocate multiple pages while
> linearizing multiple scatterlist elements into one shared page.
> However, if the shared page has size > PAGE_SIZE, using
> copy_page_to_iter() causes below warning.
> 
> e.g.
> [ 6367.019832] WARNING: CPU: 2 PID: 7410 at lib/iov_iter.c:825
> page_copy_sane.part.8+0x0/0x8
> 
> To avoid above warning, use __GFP_COMP while allocating multiple
> contiguous pages.
> 
> Signed-off-by: Tushar Dave <tushar.n.dave@oracle.com>
> ---
>   net/core/filter.c | 3 ++-
>   1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/net/core/filter.c b/net/core/filter.c
> index d301134..0b40f95 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -2344,7 +2344,8 @@ struct sock *do_msg_redirect_map(struct sk_msg_buff *msg)
>   	if (unlikely(bytes_sg_total > copy))
>   		return -EINVAL;
>   
> -	page = alloc_pages(__GFP_NOWARN | GFP_ATOMIC, get_order(copy));
> +	page = alloc_pages(__GFP_NOWARN | GFP_ATOMIC | __GFP_COMP,
> +			   get_order(copy));
>   	if (unlikely(!page))
>   		return -ENOMEM;
>   	p = page_address(page);

I should have mentioned that I could re-order this patch anywhere in
patch series (as long as it doesn't break git bisect). I kept it first
because I think it is more like a bug fix. I sent it along with these
patch series considering we have a context of why and for what I need
this patch!

Daniel, John,

Not sure if you guys hit this page_copy_sane warning. I hit it when RDS
copy sg page to userspace using copy_page_to_iter().

example:

RDS packet size 8KB represented in scatterlist:
sg_data[0].length = 1400
sg_data[1].length = 1448
sg_data[2].length = 1448
sg_data[3].length = 1448
sg_data[4].length = 1448
sg_data[5].length = 1000

If start=0 and end=8192, bpf_msg_pull_data() will linearize all
sg_data elements into one shared page. e.g. sg_data[0].length = 8192.
Using this sg_data[0].page in function copy_page_to_iter() causes:
WARNING: CPU: 2 PID: 7410 at lib/iov_iter.c:825
page_copy_sane.part.8+0x0/0x8

(FYI, patch 4 has code that does copy_page_to_iter)


Comments?

Thanks in advance,
-Tushar

^ permalink raw reply

* Re: [PATCH v2 02/17] compat_ioctl: move drivers to generic_compat_ioctl_ptrarg
From: Arnd Bergmann @ 2018-09-12 16:20 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Al Viro, Linux FS-devel Mailing List, Sudip Mukherjee, gregkh,
	Peter Huewe, Jarkko Sakkinen, Stefan Richter, Jiri Kosina,
	Benjamin Tissoires, Alexander Shishkin, Winkler, Tomas,
	Artem Bityutskiy, Marek Vasut, David Miller, Alex Williamson,
	OGAWA Hirofumi, Linux Kernel Mailing List, linux
In-Reply-To: <20180912153300.GE5633@ziepe.ca>

On Wed, Sep 12, 2018 at 5:33 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Wed, Sep 12, 2018 at 05:01:03PM +0200, Arnd Bergmann wrote:
> > Each of these drivers has a copy of the same trivial helper function to
> > convert the pointer argument and then call the native ioctl handler.
> >
> > We now have a generic implementation of that, so use it.
> >
>
> For vtpm:
>
> Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
>
> Arnd, would you consider including a patch as part of/after this
> series to make compat_ioctl in drivers/infiniband/core/uverbs_main.c
> use this as well?  Looks like a bug too?

That should be included in patch 5 in this series. I may have skipped
some Cc there since it had too many recipients (sent only to the
mailing lists instead).

        Arnd

^ permalink raw reply

* Re: [PATCH v2 05/17] compat_ioctl: move more drivers to generic_compat_ioctl_ptrarg
From: Mauro Carvalho Chehab @ 2018-09-12 16:01 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: viro, linux-fsdevel, Greg Kroah-Hartman, David S. Miller, devel,
	linux-kernel, qat-linux, linux-crypto, linux-media, dri-devel,
	linaro-mm-sig, amd-gfx, linux-input, linux-iio, linux-rdma,
	linux-nvdimm, linux-nvme, linux-pci, platform-driver-x86,
	linux-remoteproc, sparclinux, linux-scsi, linux-usb, linux-fbdev,
	linuxppc-dev, linux-btrfs
In-Reply-To: <20180912151134.436719-1-arnd@arndb.de>

Em Wed, 12 Sep 2018 17:08:52 +0200
Arnd Bergmann <arnd@arndb.de> escreveu:

> The .ioctl and .compat_ioctl file operations have the same prototype so
> they can both point to the same function, which works great almost all
> the time when all the commands are compatible.
> 
> One exception is the s390 architecture, where a compat pointer is only
> 31 bit wide, and converting it into a 64-bit pointer requires calling
> compat_ptr(). Most drivers here will ever run in s390, but since we now
> have a generic helper for it, it's easy enough to use it consistently.
> 
> I double-checked all these drivers to ensure that all ioctl arguments
> are used as pointers or are ignored, but are not interpreted as integer
> values.
> 
> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
> ---

>  drivers/media/rc/lirc_dev.c                 | 4 +---

> diff --git a/drivers/media/rc/lirc_dev.c b/drivers/media/rc/lirc_dev.c
> index f862f1b7f996..077209f414ed 100644
> --- a/drivers/media/rc/lirc_dev.c
> +++ b/drivers/media/rc/lirc_dev.c
> @@ -730,9 +730,7 @@ static const struct file_operations lirc_fops = {
>  	.owner		= THIS_MODULE,
>  	.write		= ir_lirc_transmit_ir,
>  	.unlocked_ioctl	= ir_lirc_ioctl,
> -#ifdef CONFIG_COMPAT
> -	.compat_ioctl	= ir_lirc_ioctl,
> -#endif
> +	.compat_ioctl	= generic_compat_ioctl_ptrarg,
>  	.read		= ir_lirc_read,
>  	.poll		= ir_lirc_poll,
>  	.open		= ir_lirc_open,

Adding an infrared remote controller to a s390 mainframe sounds fun :-)

I suspect that one could implement it on a s390 platform 
using gpio-ir-recv and/or gpio-ir-tx drivers. Perhaps one possible
practical usage would be to let the mainframe to send remote
controller codes to adjust the air conditioning system ;-)

From lirc driver's PoV, there's nothing that really prevents one to
do that and use lirc API, and the driver is generic enough to work
on any hardware platform.

I didn't check the implementation of generic_compat_ioctl_ptrarg(),
but assuming it is ok,

Acked-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>

Thanks,
Mauro

^ permalink raw reply

* [PATCH net v2 3/3] tls: clear key material from kernel memory when do_tls_setsockopt_conf fails
From: Sabrina Dubroca @ 2018-09-12 15:44 UTC (permalink / raw)
  To: netdev
  Cc: Sabrina Dubroca, Aviad Yehezkel, Boris Pismenny, Dave Watson,
	Vakul Garg
In-Reply-To: <cover.1536766755.git.sd@queasysnail.net>

Fixes: 3c4d7559159b ("tls: kernel TLS support")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
---
v2: use the new union tls_crypto_context

 net/tls/tls_main.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/tls/tls_main.c b/net/tls/tls_main.c
index 737b3865be1b..523622dc74f8 100644
--- a/net/tls/tls_main.c
+++ b/net/tls/tls_main.c
@@ -509,7 +509,7 @@ static int do_tls_setsockopt_conf(struct sock *sk, char __user *optval,
 	goto out;
 
 err_crypto_info:
-	memset(crypto_info, 0, sizeof(*crypto_info));
+	memzero_explicit(crypto_info, sizeof(union tls_crypto_context));
 out:
 	return rc;
 }
-- 
2.18.0

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox