* [PATCH bpf] bpf: use __GFP_COMP while allocating page
From: Tushar Dave @ 2018-09-12 20:15 UTC (permalink / raw)
To: ast, daniel, davem, netdev, john.fastabend
Helper bpg_msg_pull_data() can allocate multiple pages while
linearizing multiple scatterlist elements into one shared page.
However, if the shared page has size > PAGE_SIZE, using
copy_page_to_iter() causes below warning.
e.g.
[ 6367.019832] WARNING: CPU: 2 PID: 7410 at lib/iov_iter.c:825
page_copy_sane.part.8+0x0/0x8
To avoid above warning, use __GFP_COMP while allocating multiple
contiguous pages.
Signed-off-by: Tushar Dave <tushar.n.dave@oracle.com>
---
net/core/filter.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/net/core/filter.c b/net/core/filter.c
index d301134..0b40f95 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2344,7 +2344,8 @@ struct sock *do_msg_redirect_map(struct sk_msg_buff *msg)
if (unlikely(bytes_sg_total > copy))
return -EINVAL;
- page = alloc_pages(__GFP_NOWARN | GFP_ATOMIC, get_order(copy));
+ page = alloc_pages(__GFP_NOWARN | GFP_ATOMIC | __GFP_COMP,
+ get_order(copy));
if (unlikely(!page))
return -ENOMEM;
p = page_address(page);
--
1.8.3.1
^ permalink raw reply related
* Re: [PATCH net-next 1/5] bpf: use __GFP_COMP while allocating page
From: Tushar Dave @ 2018-09-12 20:15 UTC (permalink / raw)
To: John Fastabend, ast, daniel, davem, santosh.shilimkar,
jakub.kicinski, quentin.monnet, jiong.wang, sandipan, kafai, rdna,
yhs, netdev, rds-devel, sowmini.varadhan
In-Reply-To: <6855742d-5925-0d94-e4f3-74bf118ca3d2@gmail.com>
On 09/12/2018 09:51 AM, John Fastabend wrote:
> On 09/12/2018 09:21 AM, Tushar Dave wrote:
>>
>>
>> On 09/11/2018 12:38 PM, Tushar Dave wrote:
>>> Helper bpg_msg_pull_data() can allocate multiple pages while
>>> linearizing multiple scatterlist elements into one shared page.
>>> However, if the shared page has size > PAGE_SIZE, using
>>> copy_page_to_iter() causes below warning.
>>>
>>> e.g.
>>> [ 6367.019832] WARNING: CPU: 2 PID: 7410 at lib/iov_iter.c:825
>>> page_copy_sane.part.8+0x0/0x8
>>>
>>> To avoid above warning, use __GFP_COMP while allocating multiple
>>> contiguous pages.
>>>
>>> Signed-off-by: Tushar Dave <tushar.n.dave@oracle.com>
>>> ---
>>> net/core/filter.c | 3 ++-
>>> 1 file changed, 2 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/net/core/filter.c b/net/core/filter.c
>>> index d301134..0b40f95 100644
>>> --- a/net/core/filter.c
>>> +++ b/net/core/filter.c
>>> @@ -2344,7 +2344,8 @@ struct sock *do_msg_redirect_map(struct sk_msg_buff *msg)
>>> if (unlikely(bytes_sg_total > copy))
>>> return -EINVAL;
>>> - page = alloc_pages(__GFP_NOWARN | GFP_ATOMIC, get_order(copy));
>>> + page = alloc_pages(__GFP_NOWARN | GFP_ATOMIC | __GFP_COMP,
>>> + get_order(copy));
>>> if (unlikely(!page))
>>> return -ENOMEM;
>>> p = page_address(page);
>>
>> I should have mentioned that I could re-order this patch anywhere in
>> patch series (as long as it doesn't break git bisect). I kept it first
>> because I think it is more like a bug fix. I sent it along with these
>> patch series considering we have a context of why and for what I need
>> this patch!
>>
>> Daniel, John,
>>
>> Not sure if you guys hit this page_copy_sane warning. I hit it when RDS
>> copy sg page to userspace using copy_page_to_iter().
>>
>
> I have not hit this before but I'm working on a set of patches for
> test_sockmap to test the bpf_msg_pull_data() so I'll add a case
> for this. Currently, we only test the simple case where we pull
> data out of a single page in selftests. This was sufficient for
> my use case but missed a handful of other valid cases.
>
>> example:
>>
>> RDS packet size 8KB represented in scatterlist:
>> sg_data[0].length = 1400
>> sg_data[1].length = 1448
>> sg_data[2].length = 1448
>> sg_data[3].length = 1448
>> sg_data[4].length = 1448
>> sg_data[5].length = 1000
>>
>> If start=0 and end=8192, bpf_msg_pull_data() will linearize all
>> sg_data elements into one shared page. e.g. sg_data[0].length = 8192.
>> Using this sg_data[0].page in function copy_page_to_iter() causes:
>> WARNING: CPU: 2 PID: 7410 at lib/iov_iter.c:825
>> page_copy_sane.part.8+0x0/0x8
>>
>> (FYI, patch 4 has code that does copy_page_to_iter)
>>
>
> How about sending it as a bugfix against bpf on its own. It
> looks like we could reproduce this with a combination of
> bpf_msg_pull_data() + redirect (to ingress) perhaps. Either
> way seems like a candidate for the bpf fixes tree to me.
Done.
Thanks.
-Tushar
>
> Thanks,
> John
>
>>
>> Comments?
>>
>> Thanks in advance,
>> -Tushar
>>
>>
>
>
^ permalink raw reply
* Re: [PATCH net 0/2] nfp: flower: fixes for flower offload
From: David Miller @ 2018-09-12 20:19 UTC (permalink / raw)
To: jakub.kicinski; +Cc: netdev, oss-drivers
In-Reply-To: <20180911133845.14214-1-jakub.kicinski@netronome.com>
From: Jakub Kicinski <jakub.kicinski@netronome.com>
Date: Tue, 11 Sep 2018 06:38:43 -0700
> Two fixes for flower matching and tunnel encap. Pieter fixes
> VLAN matching if the entire VLAN id is masked out and match
> is only performed on the PCP field. Louis adds validation of
> tunnel flags for encap, most importantly we should not offload
> actions on IPv6 tunnels if it's not supported.
Series applied, thanks.
^ permalink raw reply
* Re: [PATCH net-next] nfp: report FW vNIC stats in interface stats
From: David Miller @ 2018-09-12 20:21 UTC (permalink / raw)
To: jakub.kicinski; +Cc: netdev, oss-drivers
In-Reply-To: <20180911134408.15129-1-jakub.kicinski@netronome.com>
From: Jakub Kicinski <jakub.kicinski@netronome.com>
Date: Tue, 11 Sep 2018 06:44:08 -0700
> Report in standard netdev stats drops and errors as well as
> RX multicast from the FW vNIC counters.
>
> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
> Reviewed-by: Simon Horman <simon.horman@netronome.com>
Applied, thank you.
^ permalink raw reply
* [PATCH net] net: rtnl_configure_link: fix dev flags changes arg to __dev_notify_flags
From: Roopa Prabhu @ 2018-09-12 20:21 UTC (permalink / raw)
To: davem; +Cc: netdev, stephen, liam.mcbirnie
From: Roopa Prabhu <roopa@cumulusnetworks.com>
This fix addresses https://bugzilla.kernel.org/show_bug.cgi?id=201071
Commit 5025f7f7d506 wrongly relied on __dev_change_flags to notify users of
dev flag changes in the case when dev->rtnl_link_state = RTNL_LINK_INITIALIZED.
Fix it by indicating flag changes explicitly to __dev_notify_flags.
Fixes: 5025f7f7d506 ("rtnetlink: add rtnl_link_state check in rtnl_configure_link")
Reported-By: Liam mcbirnie <liam.mcbirnie@boeing.com>
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
---
Dave, if 5025f7f7d506 made it to stable, request you to pls queue this one up too. Thanks.
net/core/rtnetlink.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 60c9288..63ce2283 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -2810,7 +2810,7 @@ int rtnl_configure_link(struct net_device *dev, const struct ifinfomsg *ifm)
}
if (dev->rtnl_link_state == RTNL_LINK_INITIALIZED) {
- __dev_notify_flags(dev, old_flags, 0U);
+ __dev_notify_flags(dev, old_flags, (old_flags ^ dev->flags));
} else {
dev->rtnl_link_state = RTNL_LINK_INITIALIZED;
__dev_notify_flags(dev, old_flags, ~0U);
--
2.1.4
^ permalink raw reply related
* [PATCH bpf] bpf/verifier: disallow pointer subtraction
From: Alexei Starovoitov @ 2018-09-12 21:06 UTC (permalink / raw)
To: David S . Miller; +Cc: daniel, netdev, kernel-team
Subtraction of pointers was accidentally allowed for unpriv programs
by commit 82abbf8d2fc4. Revert that part of commit.
Fixes: 82abbf8d2fc4 ("bpf: do not allow root to mangle valid pointers")
Reported-by: Jann Horn <jannh@google.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
kernel/bpf/verifier.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 92246117d2b0..bb07e74b34a2 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -3163,7 +3163,7 @@ static int adjust_reg_min_max_vals(struct bpf_verifier_env *env,
* an arbitrary scalar. Disallow all math except
* pointer subtraction
*/
- if (opcode == BPF_SUB){
+ if (opcode == BPF_SUB && env->allow_ptr_leaks) {
mark_reg_unknown(env, regs, insn->dst_reg);
return 0;
}
--
2.17.1
^ permalink raw reply related
* Re: [PATCH bpf] bpf/verifier: disallow pointer subtraction
From: Daniel Borkmann @ 2018-09-12 21:48 UTC (permalink / raw)
To: Alexei Starovoitov, David S . Miller; +Cc: netdev, kernel-team
In-Reply-To: <20180912210610.829166-1-ast@kernel.org>
On 09/12/2018 11:06 PM, Alexei Starovoitov wrote:
> Subtraction of pointers was accidentally allowed for unpriv programs
> by commit 82abbf8d2fc4. Revert that part of commit.
>
> Fixes: 82abbf8d2fc4 ("bpf: do not allow root to mangle valid pointers")
> Reported-by: Jann Horn <jannh@google.com>
> Acked-by: Daniel Borkmann <daniel@iogearbox.net>
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Applied to bpf, thanks Alexei!
^ permalink raw reply
* Re: [PATCH bpf] bpf: use __GFP_COMP while allocating page
From: Daniel Borkmann @ 2018-09-12 21:49 UTC (permalink / raw)
To: Tushar Dave, ast, davem, netdev, john.fastabend
In-Reply-To: <1536783329-784-1-git-send-email-tushar.n.dave@oracle.com>
On 09/12/2018 10:15 PM, Tushar Dave wrote:
> Helper bpg_msg_pull_data() can allocate multiple pages while
> linearizing multiple scatterlist elements into one shared page.
> However, if the shared page has size > PAGE_SIZE, using
> copy_page_to_iter() causes below warning.
>
> e.g.
> [ 6367.019832] WARNING: CPU: 2 PID: 7410 at lib/iov_iter.c:825
> page_copy_sane.part.8+0x0/0x8
>
> To avoid above warning, use __GFP_COMP while allocating multiple
> contiguous pages.
>
> Signed-off-by: Tushar Dave <tushar.n.dave@oracle.com>
Applied to bpf, thanks Tushar!
^ permalink raw reply
* Re: [PATCH net-next v2] net: bridge: add support for sticky fdb entries
From: David Miller @ 2018-09-13 3:30 UTC (permalink / raw)
To: nikolay; +Cc: netdev, roopa, bridge
In-Reply-To: <20180911063953.15008-1-nikolay@cumulusnetworks.com>
From: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Date: Tue, 11 Sep 2018 09:39:53 +0300
> Add support for entries which are "sticky", i.e. will not change their port
> if they show up from a different one. A new ndm flag is introduced for that
> purpose - NTF_STICKY. We allow to set it only to non-local entries.
>
> Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
> ---
> v2: As Stephen suggested pass the whole ndm_flags because it will probably
> be used by someone else soon.
Applied, thanks Nikolay.
^ permalink raw reply
* Re: [bpf-next, v2 1/3] flow_dissector: implements flow dissector BPF hook
From: Alexei Starovoitov @ 2018-09-12 22:25 UTC (permalink / raw)
To: Willem de Bruijn
Cc: Petar Penkov, Network Development, David Miller,
Alexei Starovoitov, Daniel Borkmann, simon.horman, ecree,
songliubraving, Tom Herbert, Petar Penkov, Willem de Bruijn
In-Reply-To: <CAF=yD-LOvyj9t1O7MpqfXE7eYKpFZn96Gx144vmkFgppZe2UbQ@mail.gmail.com>
On Wed, Sep 12, 2018 at 02:43:37PM -0400, Willem de Bruijn wrote:
> On Tue, Sep 11, 2018 at 11:47 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Fri, Sep 07, 2018 at 05:11:08PM -0700, Petar Penkov wrote:
> > > From: Petar Penkov <ppenkov@google.com>
> > >
> > > Adds a hook for programs of type BPF_PROG_TYPE_FLOW_DISSECTOR and
> > > attach type BPF_FLOW_DISSECTOR that is executed in the flow dissector
> > > path. The BPF program is per-network namespace.
> > >
> > > Signed-off-by: Petar Penkov <ppenkov@google.com>
> > > Signed-off-by: Willem de Bruijn <willemb@google.com>
> > > ---
> > > include/linux/bpf.h | 1 +
> > > include/linux/bpf_types.h | 1 +
> > > include/linux/skbuff.h | 7 ++
> > > include/net/net_namespace.h | 3 +
> > > include/net/sch_generic.h | 12 ++-
> > > include/uapi/linux/bpf.h | 25 ++++++
> > > kernel/bpf/syscall.c | 8 ++
> > > kernel/bpf/verifier.c | 32 ++++++++
> > > net/core/filter.c | 67 ++++++++++++++++
> > > net/core/flow_dissector.c | 136 +++++++++++++++++++++++++++++++++
> > > tools/bpf/bpftool/prog.c | 1 +
> > > tools/include/uapi/linux/bpf.h | 25 ++++++
> > > tools/lib/bpf/libbpf.c | 2 +
> >
> > please split up update to tools/include/uapi/linux/bpf.h as a separate patch 2.
> > We often have conflicts in there, so best to have a separate.
> > Also please split tools/lib and tools/bpf chnages into patch 3.
>
> Will do in v3.
>
> > > struct qdisc_skb_cb {
> > > - unsigned int pkt_len;
> > > - u16 slave_dev_queue_mapping;
> > > - u16 tc_classid;
> > > + union {
> > > + struct {
> > > + unsigned int pkt_len;
> > > + u16 slave_dev_queue_mapping;
> > > + u16 tc_classid;
> > > + };
> > > + struct bpf_flow_keys *flow_keys;
> > > + };
> >
> > is this magic really necessary? flow_dissector runs very early in recv path.
> > There is no qdisc or conflicts with tcp/ip use of cb.
> > I think the whole cb block can be used.
>
> The flow dissector also runs in the context of TC, from flower.
> But that is not a reason to use this struct.
>
> We need both (a) data shared with the BPF application and between
> applications using tail-calls, to pass along the parse offset (nhoff),
> and (b) data not accessible by the program, to store the flow_keys
> pointer.
>
> qdisc_skb_cb already has this split, exposing only the 20B .data
> field to the application. Flow dissection currently reuses the existing
> bpf_convert_ctx_access logic for this field.
>
> We could create a separate flowdissect_skb_cb struct with the
> same split setup, but a second constraint is that relevant internal
> BPF interfaces already expect qdisc_skb_cb, notably
> bkf_skb_data_end. So the union was easier.
got it. all makes sense.
>
> There is another way to pass nhoff besides cb[] (see below). But
> I don't immediately see another place to store the flow_keys ptr.
>
> At least, if we pass skb as context. One larger change would
> be to introduce another ctx struct, similar to sk_reuseport_(kern|md).
yeah. thought about this too, but your approach looks easier and faster.
Accesses to skb have one less dereference.
> > > @@ -2333,6 +2335,7 @@ struct __sk_buff {
> > > /* ... here. */
> > >
> > > __u32 data_meta;
> > > + __u32 flow_keys;
> >
> > please use
> > struct bpf_flow_keys *flow_keys;
> > instead.
> >
> > See what we did in 'struct sk_msg_md' and in 'struct sk_reuseport_md'.
> > There is no need to hide pointers in u32.
> >
>
> Will do in v3.
>
> > > @@ -658,6 +754,46 @@ bool __skb_flow_dissect(const struct sk_buff *skb,
> > > FLOW_DISSECTOR_KEY_BASIC,
> > > target_container);
> > >
> > > + rcu_read_lock();
> > > + attached = skb ? rcu_dereference(dev_net(skb->dev)->flow_dissector_prog)
> > > + : NULL;
> > > + if (attached) {
> > > + /* Note that even though the const qualifier is discarded
> > > + * throughout the execution of the BPF program, all changes(the
> > > + * control block) are reverted after the BPF program returns.
> > > + * Therefore, __skb_flow_dissect does not alter the skb.
> > > + */
> > > + struct bpf_flow_keys flow_keys = {};
> > > + struct qdisc_skb_cb cb_saved;
> > > + struct qdisc_skb_cb *cb;
> > > + u16 *pseudo_cb;
> > > + u32 result;
> > > +
> > > + cb = qdisc_skb_cb(skb);
> > > + pseudo_cb = (u16 *)bpf_skb_cb((struct sk_buff *)skb);
> > > +
> > > + /* Save Control Block */
> > > + memcpy(&cb_saved, cb, sizeof(cb_saved));
> > > + memset(cb, 0, sizeof(cb_saved));
> > > +
> > > + /* Pass parameters to the BPF program */
> > > + cb->flow_keys = &flow_keys;
> > > + *pseudo_cb = nhoff;
> >
> > I don't understand this bit.
> > What is this pseudo_cb and why nhoff goes in there?
> > Some odd way to pass it into the prog?
>
> Yes. nhoff passes the offset to the program to start parsing from.
> Applications also pass this during tail calls.
>
> Alternatively we can just add a new field to struct bpf_flow_keys.
I think that certainly will be cleaner and easier to use from
bpf prog pov. Since flow_keys stay constant any change to nhoff
between tail_calls will be preserved too. I see no cons to such approach.
^ permalink raw reply
* Re: [PATCH bpf-next 4/4] tools/bpf: bpftool: add net support
From: Daniel Borkmann @ 2018-09-12 22:29 UTC (permalink / raw)
To: Yonghong Song, ast, netdev; +Cc: kernel-team
In-Reply-To: <20180905235806.1536396-5-yhs@fb.com>
On 09/06/2018 01:58 AM, Yonghong Song wrote:
> Add "bpftool net" support. Networking devices are enumerated
> to dump device index/name associated with xdp progs.
>
> For each networking device, tc classes and qdiscs are enumerated
> in order to check their bpf filters.
> In addition, root handle and clsact ingress/egress are also checked for
> bpf filters. Not all filter information is printed out. Only ifindex,
> kind, filter name, prog_id and tag are printed out, which are good
> enough to show attachment information. If the filter action
> is a bpf action, its bpf program id, bpf name and tag will be
> printed out as well.
>
> For example,
> $ ./bpftool net
> xdp [
> ifindex 2 devname eth0 prog_id 198
> ]
Could we make the output more terse? E.g. the 'ifindex' and 'devname' is basically
zero information but will take lots of space. 'eth0 (2)' would for example make it
shorter. Also info is missing whether the attached prog is driver/hw/generic XDP. :(
> tc_filters [
> ifindex 2 kind qdisc_htb name prefix_matcher.o:[cls_prefix_matcher_htb]
> prog_id 111727 tag d08fe3b4319bc2fd act []
> ifindex 2 kind qdisc_clsact_ingress name fbflow_icmp
> prog_id 130246 tag 3f265c7f26db62c9 act []
> ifindex 2 kind qdisc_clsact_egress name prefix_matcher.o:[cls_prefix_matcher_clsact]
> prog_id 111726 tag 99a197826974c876
> ifindex 2 kind qdisc_clsact_egress name cls_fg_dscp
> prog_id 108619 tag dc4630674fd72dcc act []
> ifindex 2 kind qdisc_clsact_egress name fbflow_egress
> prog_id 130245 tag 72d2d830d6888d2c
> ]
Similar comment here. Do we need the tag here? I think it's not really needed, e.g.
the output of bpftool perf [0] doesn't provide it either and therefore makes the list
as nice one-liners, so overview is much nicer there. Is there a reason that tc progs
do not show dev name as opposed to xdp progs? Can we shorten everything to make it
a one-liner like in bpftool perf?
Should we have a small indicator here if the tc prog was offloaded?
Does the dump work with tc shared blocks?
Should we also dump networking related cgroup BPF progs here under bpftool net?
Thanks,
Daniel
[0] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=b04df400c30235fa347313c9e2a0695549bd2c8e
^ permalink raw reply
* [PATCH v3] PCI: Reprogram bridge prefetch registers on resume
From: Daniel Drake @ 2018-09-13 3:37 UTC (permalink / raw)
To: bhelgaas-hpIqsD4AKlfQT0dZR+AlfA
Cc: andy.shevchenko-VuQAYsv1563Yd54FQh9/CA,
linux-pm-u79uwXL29TY76Z2rM5mHXA, linux-pci-u79uwXL29TY76Z2rM5mHXA,
rafael.j.wysocki-ral2JQCrhuEAvxtiuMwx3w,
nic_swsd-Rasf1IRRPZFBDgjK7y7TUQ,
keith.busch-ral2JQCrhuEAvxtiuMwx3w, netdev-u79uwXL29TY76Z2rM5mHXA,
nouveau-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
rchang-eYqpPyKDWXRBDgjK7y7TUQ, linux-6IF/jdPJHihWk0Htik3J/w,
davem-fT/PcQaiUtIeIZ0/mPfg9Q,
jonathan.derrick-ral2JQCrhuEAvxtiuMwx3w,
hkallweit1-Re5JQEeQqe8AvxtiuMwx3w
On 38+ Intel-based Asus products, the nvidia GPU becomes unusable
after S3 suspend/resume. The affected products include multiple
generations of nvidia GPUs and Intel SoCs. After resume, nouveau logs
many errors such as:
fifo: fault 00 [READ] at 0000005555555000 engine 00 [GR] client 04
[HUB/FE] reason 4a [] on channel -1 [007fa91000 unknown]
DRM: failed to idle channel 0 [DRM]
Similarly, the nvidia proprietary driver also fails after resume
(black screen, 100% CPU usage in Xorg process). We shipped a sample
to Nvidia for diagnosis, and their response indicated that it's a
problem with the parent PCI bridge (on the Intel SoC), not the GPU.
Runtime suspend/resume works fine, only S3 suspend is affected.
We found a workaround: on resume, rewrite the Intel PCI bridge
'Prefetchable Base Upper 32 Bits' register (PCI_PREF_BASE_UPPER32). In
the cases that I checked, this register has value 0 and we just have to
rewrite that value.
Linux already saves and restores PCI config space during suspend/resume,
but this register was being skipped because upon resume, it already
has value 0 (the correct, pre-suspend value).
Intel appear to have previously acknowledged this behaviour and the
requirement to rewrite this register.
https://bugzilla.kernel.org/show_bug.cgi?id=116851#c23
Based on that, rewrite the prefetch register values even when that
appears unnecessary.
We have confirmed this solution on all the affected models we have
in-hands (X542UQ, UX533FD, X530UN, V272UN).
Additionally, this solves an issue where r8169 MSI-X interrupts were
broken after S3 suspend/resume on Asus X441UAR. This issue was recently
worked around in commit 7bb05b85bc2d ("r8169: don't use MSI-X on
RTL8106e"). It also fixes the same issue on RTL6186evl/8111evl on an
Aimfor-tech laptop that we had not yet patched. I suspect it will also
fix the issue that was worked around in commit 7c53a722459c ("r8169:
don't use MSI-X on RTL8168g").
Thomas Martitz reports that this change also solves an issue where
the AMD Radeon Polaris 10 GPU on the HP Zbook 14u G5 is unresponsive
after S3 suspend/resume.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=201069
Signed-off-by: Daniel Drake <drake@endlessm.com>
---
drivers/pci/pci.c | 25 +++++++++++++++++--------
1 file changed, 17 insertions(+), 8 deletions(-)
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 29ff9619b5fa..5d58220b6997 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -1289,12 +1289,12 @@ int pci_save_state(struct pci_dev *dev)
EXPORT_SYMBOL(pci_save_state);
static void pci_restore_config_dword(struct pci_dev *pdev, int offset,
- u32 saved_val, int retry)
+ u32 saved_val, int retry, bool force)
{
u32 val;
pci_read_config_dword(pdev, offset, &val);
- if (val == saved_val)
+ if (!force && val == saved_val)
return;
for (;;) {
@@ -1313,25 +1313,34 @@ static void pci_restore_config_dword(struct pci_dev *pdev, int offset,
}
static void pci_restore_config_space_range(struct pci_dev *pdev,
- int start, int end, int retry)
+ int start, int end, int retry,
+ bool force)
{
int index;
for (index = end; index >= start; index--)
pci_restore_config_dword(pdev, 4 * index,
pdev->saved_config_space[index],
- retry);
+ retry, force);
}
static void pci_restore_config_space(struct pci_dev *pdev)
{
if (pdev->hdr_type == PCI_HEADER_TYPE_NORMAL) {
- pci_restore_config_space_range(pdev, 10, 15, 0);
+ pci_restore_config_space_range(pdev, 10, 15, 0, false);
/* Restore BARs before the command register. */
- pci_restore_config_space_range(pdev, 4, 9, 10);
- pci_restore_config_space_range(pdev, 0, 3, 0);
+ pci_restore_config_space_range(pdev, 4, 9, 10, false);
+ pci_restore_config_space_range(pdev, 0, 3, 0, false);
+ } else if (pdev->hdr_type == PCI_HEADER_TYPE_BRIDGE) {
+ pci_restore_config_space_range(pdev, 12, 15, 0, false);
+ /* Force rewriting of prefetch registers to avoid
+ * S3 resume issues on Intel PCI bridges that occur when
+ * these registers are not explicitly written.
+ */
+ pci_restore_config_space_range(pdev, 9, 11, 0, true);
+ pci_restore_config_space_range(pdev, 0, 8, 0, false);
} else {
- pci_restore_config_space_range(pdev, 0, 15, 0);
+ pci_restore_config_space_range(pdev, 0, 15, 0, false);
}
}
--
2.17.1
_______________________________________________
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau
^ permalink raw reply related
* Re: [PATCH] docs: net: Remove TCP congestion document
From: David Miller @ 2018-09-13 3:43 UTC (permalink / raw)
To: me; +Cc: edumazet, netdev, linux-doc, linux-kernel
In-Reply-To: <20180912081444.17258-1-me@tobin.cc>
From: "Tobin C. Harding" <me@tobin.cc>
Date: Wed, 12 Sep 2018 18:14:44 +1000
> Document is stale, let's remove it.
>
> Remove TCP congestion document.
>
> Signed-off-by: Tobin C. Harding <me@tobin.cc>
Applied, thanks.
^ permalink raw reply
* Re: [PATCH bpf-next 4/4] tools/bpf: bpftool: add net support
From: Daniel Borkmann @ 2018-09-12 22:40 UTC (permalink / raw)
To: Yonghong Song, ast, netdev; +Cc: kernel-team
In-Reply-To: <b5b3d620-80a5-3b00-e84f-12ebce6c6925@iogearbox.net>
On 09/13/2018 12:29 AM, Daniel Borkmann wrote:
> On 09/06/2018 01:58 AM, Yonghong Song wrote:
>> Add "bpftool net" support. Networking devices are enumerated
>> to dump device index/name associated with xdp progs.
>>
>> For each networking device, tc classes and qdiscs are enumerated
>> in order to check their bpf filters.
>> In addition, root handle and clsact ingress/egress are also checked for
>> bpf filters. Not all filter information is printed out. Only ifindex,
>> kind, filter name, prog_id and tag are printed out, which are good
>> enough to show attachment information. If the filter action
>> is a bpf action, its bpf program id, bpf name and tag will be
>> printed out as well.
>>
>> For example,
>> $ ./bpftool net
>> xdp [
>> ifindex 2 devname eth0 prog_id 198
>> ]
>
> Could we make the output more terse? E.g. the 'ifindex' and 'devname' is basically
> zero information but will take lots of space. 'eth0 (2)' would for example make it
> shorter. Also info is missing whether the attached prog is driver/hw/generic XDP. :(
>
>> tc_filters [
>> ifindex 2 kind qdisc_htb name prefix_matcher.o:[cls_prefix_matcher_htb]
>> prog_id 111727 tag d08fe3b4319bc2fd act []
>> ifindex 2 kind qdisc_clsact_ingress name fbflow_icmp
>> prog_id 130246 tag 3f265c7f26db62c9 act []
>> ifindex 2 kind qdisc_clsact_egress name prefix_matcher.o:[cls_prefix_matcher_clsact]
>> prog_id 111726 tag 99a197826974c876
>> ifindex 2 kind qdisc_clsact_egress name cls_fg_dscp
>> prog_id 108619 tag dc4630674fd72dcc act []
>> ifindex 2 kind qdisc_clsact_egress name fbflow_egress
>> prog_id 130245 tag 72d2d830d6888d2c
>> ]
>
> Similar comment here. Do we need the tag here? I think it's not really needed, e.g.
> the output of bpftool perf [0] doesn't provide it either and therefore makes the list
> as nice one-liners, so overview is much nicer there. Is there a reason that tc progs
> do not show dev name as opposed to xdp progs? Can we shorten everything to make it
> a one-liner like in bpftool perf?
>
> Should we have a small indicator here if the tc prog was offloaded?
>
> Does the dump work with tc shared blocks?
>
> Should we also dump networking related cgroup BPF progs here under bpftool net?
One more thought, I think it would make sense to also explicitly document current
limitations of the progs that this listing does not show where user should consult
other tools aka iproute2 e.g. lwt, seg6, tc bpf actions, etc, so the current scope
would be more clear for users.
> Thanks,
> Daniel
>
> [0] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=b04df400c30235fa347313c9e2a0695549bd2c8e
>
^ permalink raw reply
* Re: [PATCH bpf-next 04/11] bpf: Add PTR_TO_SOCKET verifier type
From: Alexei Starovoitov @ 2018-09-12 22:50 UTC (permalink / raw)
To: Joe Stringer
Cc: daniel, netdev, ast, john.fastabend, tgraf, kafai, nitin.hande,
mauricio.vasquez
In-Reply-To: <20180912003640.28316-5-joe@wand.net.nz>
On Tue, Sep 11, 2018 at 05:36:33PM -0700, Joe Stringer wrote:
> Teach the verifier a little bit about a new type of pointer, a
> PTR_TO_SOCKET. This pointer type is accessed from BPF through the
> 'struct bpf_sock' structure.
>
> Signed-off-by: Joe Stringer <joe@wand.net.nz>
> ---
> include/linux/bpf.h | 17 +++++
> include/linux/bpf_verifier.h | 2 +
> kernel/bpf/verifier.c | 125 ++++++++++++++++++++++++++++++-----
> net/core/filter.c | 30 +++++----
> 4 files changed, 147 insertions(+), 27 deletions(-)
>
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 523481a3471b..6ec93f3d66dd 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -154,6 +154,7 @@ enum bpf_arg_type {
>
> ARG_PTR_TO_CTX, /* pointer to context */
> ARG_ANYTHING, /* any (initialized) argument is ok */
> + ARG_PTR_TO_SOCKET, /* pointer to bpf_sock */
> };
>
> /* type of values returned from helper functions */
> @@ -162,6 +163,7 @@ enum bpf_return_type {
> RET_VOID, /* function doesn't return anything */
> RET_PTR_TO_MAP_VALUE, /* returns a pointer to map elem value */
> RET_PTR_TO_MAP_VALUE_OR_NULL, /* returns a pointer to map elem value or NULL */
> + RET_PTR_TO_SOCKET_OR_NULL, /* returns a pointer to a socket or NULL */
> };
>
> /* eBPF function prototype used by verifier to allow BPF_CALLs from eBPF programs
> @@ -212,6 +214,8 @@ enum bpf_reg_type {
> PTR_TO_PACKET_META, /* skb->data - meta_len */
> PTR_TO_PACKET, /* reg points to skb->data */
> PTR_TO_PACKET_END, /* skb->data + headlen */
> + PTR_TO_SOCKET, /* reg points to struct bpf_sock */
> + PTR_TO_SOCKET_OR_NULL, /* reg points to struct bpf_sock or NULL */
> };
>
> /* The information passed from prog-specific *_is_valid_access
> @@ -334,6 +338,11 @@ const struct bpf_func_proto *bpf_get_trace_printk_proto(void);
>
> typedef unsigned long (*bpf_ctx_copy_t)(void *dst, const void *src,
> unsigned long off, unsigned long len);
> +typedef u32 (*bpf_convert_ctx_access_t)(enum bpf_access_type type,
> + const struct bpf_insn *src,
> + struct bpf_insn *dst,
> + struct bpf_prog *prog,
> + u32 *target_size);
>
> u64 bpf_event_output(struct bpf_map *map, u64 flags, void *meta, u64 meta_size,
> void *ctx, u64 ctx_size, bpf_ctx_copy_t ctx_copy);
> @@ -827,4 +836,12 @@ extern const struct bpf_func_proto bpf_get_local_storage_proto;
> void bpf_user_rnd_init_once(void);
> u64 bpf_user_rnd_u32(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
>
> +bool bpf_sock_is_valid_access(int off, int size, enum bpf_access_type type,
> + struct bpf_insn_access_aux *info);
> +u32 bpf_sock_convert_ctx_access(enum bpf_access_type type,
> + const struct bpf_insn *si,
> + struct bpf_insn *insn_buf,
> + struct bpf_prog *prog,
> + u32 *target_size);
> +
> #endif /* _LINUX_BPF_H */
> diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
> index af262b97f586..23a2b17bfd75 100644
> --- a/include/linux/bpf_verifier.h
> +++ b/include/linux/bpf_verifier.h
> @@ -58,6 +58,8 @@ struct bpf_reg_state {
> * offset, so they can share range knowledge.
> * For PTR_TO_MAP_VALUE_OR_NULL this is used to share which map value we
> * came from, when one is tested for != NULL.
> + * For PTR_TO_SOCKET this is used to share which pointers retain the
> + * same reference to the socket, to determine proper reference freeing.
> */
> u32 id;
> /* For scalar types (SCALAR_VALUE), this represents our knowledge of
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index f2357c8c90de..111e031cf65d 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -80,8 +80,8 @@ static const struct bpf_verifier_ops * const bpf_verifier_ops[] = {
> * (like pointer plus pointer becomes SCALAR_VALUE type)
> *
> * When verifier sees load or store instructions the type of base register
> - * can be: PTR_TO_MAP_VALUE, PTR_TO_CTX, PTR_TO_STACK. These are three pointer
> - * types recognized by check_mem_access() function.
> + * can be: PTR_TO_MAP_VALUE, PTR_TO_CTX, PTR_TO_STACK, PTR_TO_SOCKET. These are
> + * four pointer types recognized by check_mem_access() function.
> *
> * PTR_TO_MAP_VALUE means that this register is pointing to 'map element value'
> * and the range of [ptr, ptr + map's value_size) is accessible.
> @@ -266,6 +266,8 @@ static const char * const reg_type_str[] = {
> [PTR_TO_PACKET] = "pkt",
> [PTR_TO_PACKET_META] = "pkt_meta",
> [PTR_TO_PACKET_END] = "pkt_end",
> + [PTR_TO_SOCKET] = "sock",
> + [PTR_TO_SOCKET_OR_NULL] = "sock_or_null",
> };
>
> static char slot_type_char[] = {
> @@ -971,6 +973,8 @@ static bool is_spillable_regtype(enum bpf_reg_type type)
> case PTR_TO_PACKET_META:
> case PTR_TO_PACKET_END:
> case CONST_PTR_TO_MAP:
> + case PTR_TO_SOCKET:
> + case PTR_TO_SOCKET_OR_NULL:
> return true;
> default:
> return false;
> @@ -1326,6 +1330,28 @@ static int check_ctx_access(struct bpf_verifier_env *env, int insn_idx, int off,
> return -EACCES;
> }
>
> +static int check_sock_access(struct bpf_verifier_env *env, u32 regno, int off,
> + int size, enum bpf_access_type t)
> +{
> + struct bpf_reg_state *regs = cur_regs(env);
> + struct bpf_reg_state *reg = ®s[regno];
> + struct bpf_insn_access_aux info;
> +
> + if (reg->smin_value < 0) {
> + verbose(env, "R%d min value is negative, either use unsigned index or do a if (index >=0) check.\n",
> + regno);
> + return -EACCES;
> + }
> +
> + if (!bpf_sock_is_valid_access(off, size, t, &info)) {
> + verbose(env, "invalid bpf_sock_ops access off=%d size=%d\n",
> + off, size);
> + return -EACCES;
> + }
> +
> + return 0;
> +}
> +
> static bool __is_pointer_value(bool allow_ptr_leaks,
> const struct bpf_reg_state *reg)
> {
> @@ -1441,6 +1467,9 @@ static int check_ptr_alignment(struct bpf_verifier_env *env,
> */
> strict = true;
> break;
> + case PTR_TO_SOCKET:
> + pointer_desc = "sock ";
> + break;
> default:
> break;
> }
> @@ -1697,6 +1726,16 @@ static int check_mem_access(struct bpf_verifier_env *env, int insn_idx, u32 regn
> err = check_packet_access(env, regno, off, size, false);
> if (!err && t == BPF_READ && value_regno >= 0)
> mark_reg_unknown(env, regs, value_regno);
> +
> + } else if (reg->type == PTR_TO_SOCKET) {
> + if (t == BPF_WRITE) {
> + verbose(env, "cannot write into socket\n");
> + return -EACCES;
> + }
> + err = check_sock_access(env, regno, off, size, t);
> + if (!err && value_regno >= 0)
> + mark_reg_unknown(env, regs, value_regno);
> +
> } else {
> verbose(env, "R%d invalid mem access '%s'\n", regno,
> reg_type_str[reg->type]);
> @@ -1918,6 +1957,10 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 regno,
> err = check_ctx_reg(env, reg, regno);
> if (err < 0)
> return err;
> + } else if (arg_type == ARG_PTR_TO_SOCKET) {
> + expected_type = PTR_TO_SOCKET;
> + if (type != expected_type)
> + goto err_type;
> } else if (arg_type_is_mem_ptr(arg_type)) {
> expected_type = PTR_TO_STACK;
> /* One exception here. In case function allows for NULL to be
> @@ -2511,6 +2554,10 @@ static int check_helper_call(struct bpf_verifier_env *env, int func_id, int insn
> }
> regs[BPF_REG_0].map_ptr = meta.map_ptr;
> regs[BPF_REG_0].id = ++env->id_gen;
> + } else if (fn->ret_type == RET_PTR_TO_SOCKET_OR_NULL) {
> + mark_reg_known_zero(env, regs, BPF_REG_0);
> + regs[BPF_REG_0].type = PTR_TO_SOCKET_OR_NULL;
> + regs[BPF_REG_0].id = ++env->id_gen;
> } else {
> verbose(env, "unknown return type %d of func %s#%d\n",
> fn->ret_type, func_id_name(func_id), func_id);
> @@ -2648,6 +2695,8 @@ static int adjust_ptr_min_max_vals(struct bpf_verifier_env *env,
> return -EACCES;
> case CONST_PTR_TO_MAP:
> case PTR_TO_PACKET_END:
> + case PTR_TO_SOCKET:
> + case PTR_TO_SOCKET_OR_NULL:
> verbose(env, "R%d pointer arithmetic on %s prohibited\n",
> dst, reg_type_str[ptr_reg->type]);
> return -EACCES;
> @@ -3595,6 +3644,8 @@ static void mark_ptr_or_null_reg(struct bpf_reg_state *reg, u32 id,
> } else {
> reg->type = PTR_TO_MAP_VALUE;
> }
> + } else if (reg->type == PTR_TO_SOCKET_OR_NULL) {
> + reg->type = PTR_TO_SOCKET;
> }
> /* We don't need id from this point onwards anymore, thus we
> * should better reset it, so that state pruning has chances
> @@ -4369,6 +4420,8 @@ static bool regsafe(struct bpf_reg_state *rold, struct bpf_reg_state *rcur,
> case PTR_TO_CTX:
> case CONST_PTR_TO_MAP:
> case PTR_TO_PACKET_END:
> + case PTR_TO_SOCKET:
> + case PTR_TO_SOCKET_OR_NULL:
> /* Only valid matches are exact, which memcmp() above
> * would have accepted
> */
> @@ -4646,6 +4699,37 @@ static int is_state_visited(struct bpf_verifier_env *env, int insn_idx)
> return 0;
> }
>
> +/* Return true if it's OK to have the same insn return a different type. */
> +static bool reg_type_mismatch_ok(enum bpf_reg_type type)
> +{
> + switch (type) {
> + case PTR_TO_CTX:
> + case PTR_TO_SOCKET:
> + case PTR_TO_SOCKET_OR_NULL:
> + return false;
> + default:
> + return true;
> + }
> +}
> +
> +/* If an instruction was previously used with particular pointer types, then we
> + * need to be careful to avoid cases such as the below, where it may be ok
> + * for one branch accessing the pointer, but not ok for the other branch:
> + *
> + * R1 = sock_ptr
> + * goto X;
> + * ...
> + * R1 = some_other_valid_ptr;
> + * goto X;
> + * ...
> + * R2 = *(u32 *)(R1 + 0);
> + */
> +static bool reg_type_mismatch(enum bpf_reg_type src, enum bpf_reg_type prev)
> +{
> + return src != prev && (!reg_type_mismatch_ok(src) ||
> + !reg_type_mismatch_ok(prev));
> +}
> +
> static int do_check(struct bpf_verifier_env *env)
> {
> struct bpf_verifier_state *state;
> @@ -4778,9 +4862,7 @@ static int do_check(struct bpf_verifier_env *env)
> */
> *prev_src_type = src_reg_type;
>
> - } else if (src_reg_type != *prev_src_type &&
> - (src_reg_type == PTR_TO_CTX ||
> - *prev_src_type == PTR_TO_CTX)) {
> + } else if (reg_type_mismatch(src_reg_type, *prev_src_type)) {
> /* ABuser program is trying to use the same insn
> * dst_reg = *(u32*) (src_reg + off)
> * with different pointer types:
> @@ -4826,8 +4908,8 @@ static int do_check(struct bpf_verifier_env *env)
> if (*prev_dst_type == NOT_INIT) {
> *prev_dst_type = dst_reg_type;
> } else if (dst_reg_type != *prev_dst_type &&
> - (dst_reg_type == PTR_TO_CTX ||
> - *prev_dst_type == PTR_TO_CTX)) {
> + (!reg_type_mismatch_ok(dst_reg_type) ||
> + !reg_type_mismatch_ok(*prev_dst_type))) {
reg_type_mismatch() could have been used here as well ?
> verbose(env, "same insn cannot be used with different pointers\n");
> return -EINVAL;
> }
> @@ -5244,10 +5326,14 @@ static void sanitize_dead_code(struct bpf_verifier_env *env)
> }
> }
>
> -/* convert load instructions that access fields of 'struct __sk_buff'
> - * into sequence of instructions that access fields of 'struct sk_buff'
> +/* convert load instructions that access fields of a context type into a
> + * sequence of instructions that access fields of the underlying structure:
> + * struct __sk_buff -> struct sk_buff
> + * struct bpf_sock_ops -> struct sock
> */
> -static int convert_ctx_accesses(struct bpf_verifier_env *env)
> +static int convert_ctx_accesses(struct bpf_verifier_env *env,
> + bpf_convert_ctx_access_t convert_ctx_access,
> + enum bpf_reg_type ctx_type)
> {
> const struct bpf_verifier_ops *ops = env->ops;
> int i, cnt, size, ctx_field_size, delta = 0;
> @@ -5274,12 +5360,14 @@ static int convert_ctx_accesses(struct bpf_verifier_env *env)
> }
> }
>
> - if (!ops->convert_ctx_access || bpf_prog_is_dev_bound(env->prog->aux))
> + if (!convert_ctx_access || bpf_prog_is_dev_bound(env->prog->aux))
> return 0;
>
> insn = env->prog->insnsi + delta;
>
> for (i = 0; i < insn_cnt; i++, insn++) {
> + enum bpf_reg_type ptr_type;
> +
> if (insn->code == (BPF_LDX | BPF_MEM | BPF_B) ||
> insn->code == (BPF_LDX | BPF_MEM | BPF_H) ||
> insn->code == (BPF_LDX | BPF_MEM | BPF_W) ||
> @@ -5321,7 +5409,8 @@ static int convert_ctx_accesses(struct bpf_verifier_env *env)
> continue;
> }
>
> - if (env->insn_aux_data[i + delta].ptr_type != PTR_TO_CTX)
> + ptr_type = env->insn_aux_data[i + delta].ptr_type;
> + if (ptr_type != ctx_type)
> continue;
>
> ctx_field_size = env->insn_aux_data[i + delta].ctx_field_size;
> @@ -5354,8 +5443,8 @@ static int convert_ctx_accesses(struct bpf_verifier_env *env)
> }
>
> target_size = 0;
> - cnt = ops->convert_ctx_access(type, insn, insn_buf, env->prog,
> - &target_size);
> + cnt = convert_ctx_access(type, insn, insn_buf, env->prog,
> + &target_size);
> if (cnt == 0 || cnt >= ARRAY_SIZE(insn_buf) ||
> (ctx_field_size && !target_size)) {
> verbose(env, "bpf verifier is misconfigured\n");
> @@ -5899,7 +5988,13 @@ int bpf_check(struct bpf_prog **prog, union bpf_attr *attr)
>
> if (ret == 0)
> /* program is valid, convert *(u32*)(ctx + off) accesses */
> - ret = convert_ctx_accesses(env);
> + ret = convert_ctx_accesses(env, env->ops->convert_ctx_access,
> + PTR_TO_CTX);
> +
> + if (ret == 0)
> + /* Convert *(u32*)(sock_ops + off) accesses */
> + ret = convert_ctx_accesses(env, bpf_sock_convert_ctx_access,
> + PTR_TO_SOCKET);
can it be done in single pass instead?
env->insn_aux_data tells us what kind of conversion needs to happen.
while progs are small, I guess, it's ok, but would like to see a follow up
patch to this.
^ permalink raw reply
* Re: [PATCH bpf-next 05/11] bpf: Macrofy stack state copy
From: Alexei Starovoitov @ 2018-09-12 22:51 UTC (permalink / raw)
To: Joe Stringer
Cc: daniel, netdev, ast, john.fastabend, tgraf, kafai, nitin.hande,
mauricio.vasquez
In-Reply-To: <20180912003640.28316-6-joe@wand.net.nz>
On Tue, Sep 11, 2018 at 05:36:34PM -0700, Joe Stringer wrote:
> An upcoming commit will need very similar copy/realloc boilerplate, so
> refactor the existing stack copy/realloc functions into macros to
> simplify it.
>
> Signed-off-by: Joe Stringer <joe@wand.net.nz>
Acked-by: Alexei Starovoitov <ast@kernel.org>
^ permalink raw reply
* Re: [bpf-next, v2 1/3] flow_dissector: implements flow dissector BPF hook
From: Willem de Bruijn @ 2018-09-12 22:53 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: Petar Penkov, Network Development, David Miller,
Alexei Starovoitov, Daniel Borkmann, simon.horman, ecree,
songliubraving, Tom Herbert, Petar Penkov, Willem de Bruijn
In-Reply-To: <20180912222500.57wnc7sjofgatuxu@ast-mbp>
On Wed, Sep 12, 2018 at 6:25 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Wed, Sep 12, 2018 at 02:43:37PM -0400, Willem de Bruijn wrote:
> > On Tue, Sep 11, 2018 at 11:47 PM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > On Fri, Sep 07, 2018 at 05:11:08PM -0700, Petar Penkov wrote:
> > > > From: Petar Penkov <ppenkov@google.com>
> > > >
> > > > Adds a hook for programs of type BPF_PROG_TYPE_FLOW_DISSECTOR and
> > > > attach type BPF_FLOW_DISSECTOR that is executed in the flow dissector
> > > > path. The BPF program is per-network namespace.
> > > >
> > > > Signed-off-by: Petar Penkov <ppenkov@google.com>
> > > > Signed-off-by: Willem de Bruijn <willemb@google.com>
> > > > ---
> > > > include/linux/bpf.h | 1 +
> > > > include/linux/bpf_types.h | 1 +
> > > > include/linux/skbuff.h | 7 ++
> > > > include/net/net_namespace.h | 3 +
> > > > include/net/sch_generic.h | 12 ++-
> > > > include/uapi/linux/bpf.h | 25 ++++++
> > > > kernel/bpf/syscall.c | 8 ++
> > > > kernel/bpf/verifier.c | 32 ++++++++
> > > > net/core/filter.c | 67 ++++++++++++++++
> > > > net/core/flow_dissector.c | 136 +++++++++++++++++++++++++++++++++
> > > > tools/bpf/bpftool/prog.c | 1 +
> > > > tools/include/uapi/linux/bpf.h | 25 ++++++
> > > > tools/lib/bpf/libbpf.c | 2 +
> > >
> > > please split up update to tools/include/uapi/linux/bpf.h as a separate patch 2.
> > > We often have conflicts in there, so best to have a separate.
> > > Also please split tools/lib and tools/bpf chnages into patch 3.
> >
> > Will do in v3.
> >
> > > > struct qdisc_skb_cb {
> > > > - unsigned int pkt_len;
> > > > - u16 slave_dev_queue_mapping;
> > > > - u16 tc_classid;
> > > > + union {
> > > > + struct {
> > > > + unsigned int pkt_len;
> > > > + u16 slave_dev_queue_mapping;
> > > > + u16 tc_classid;
> > > > + };
> > > > + struct bpf_flow_keys *flow_keys;
> > > > + };
> > >
> > > is this magic really necessary? flow_dissector runs very early in recv path.
> > > There is no qdisc or conflicts with tcp/ip use of cb.
> > > I think the whole cb block can be used.
> >
> > The flow dissector also runs in the context of TC, from flower.
> > But that is not a reason to use this struct.
> >
> > We need both (a) data shared with the BPF application and between
> > applications using tail-calls, to pass along the parse offset (nhoff),
> > and (b) data not accessible by the program, to store the flow_keys
> > pointer.
> >
> > qdisc_skb_cb already has this split, exposing only the 20B .data
> > field to the application. Flow dissection currently reuses the existing
> > bpf_convert_ctx_access logic for this field.
> >
> > We could create a separate flowdissect_skb_cb struct with the
> > same split setup, but a second constraint is that relevant internal
> > BPF interfaces already expect qdisc_skb_cb, notably
> > bkf_skb_data_end. So the union was easier.
>
> got it. all makes sense.
>
> >
> > There is another way to pass nhoff besides cb[] (see below). But
> > I don't immediately see another place to store the flow_keys ptr.
> >
> > At least, if we pass skb as context. One larger change would
> > be to introduce another ctx struct, similar to sk_reuseport_(kern|md).
>
> yeah. thought about this too, but your approach looks easier and faster.
> Accesses to skb have one less dereference.
>
> > > > @@ -2333,6 +2335,7 @@ struct __sk_buff {
> > > > /* ... here. */
> > > >
> > > > __u32 data_meta;
> > > > + __u32 flow_keys;
> > >
> > > please use
> > > struct bpf_flow_keys *flow_keys;
> > > instead.
> > >
> > > See what we did in 'struct sk_msg_md' and in 'struct sk_reuseport_md'.
> > > There is no need to hide pointers in u32.
> > >
> >
> > Will do in v3.
> >
> > > > @@ -658,6 +754,46 @@ bool __skb_flow_dissect(const struct sk_buff *skb,
> > > > FLOW_DISSECTOR_KEY_BASIC,
> > > > target_container);
> > > >
> > > > + rcu_read_lock();
> > > > + attached = skb ? rcu_dereference(dev_net(skb->dev)->flow_dissector_prog)
> > > > + : NULL;
> > > > + if (attached) {
> > > > + /* Note that even though the const qualifier is discarded
> > > > + * throughout the execution of the BPF program, all changes(the
> > > > + * control block) are reverted after the BPF program returns.
> > > > + * Therefore, __skb_flow_dissect does not alter the skb.
> > > > + */
> > > > + struct bpf_flow_keys flow_keys = {};
> > > > + struct qdisc_skb_cb cb_saved;
> > > > + struct qdisc_skb_cb *cb;
> > > > + u16 *pseudo_cb;
> > > > + u32 result;
> > > > +
> > > > + cb = qdisc_skb_cb(skb);
> > > > + pseudo_cb = (u16 *)bpf_skb_cb((struct sk_buff *)skb);
> > > > +
> > > > + /* Save Control Block */
> > > > + memcpy(&cb_saved, cb, sizeof(cb_saved));
> > > > + memset(cb, 0, sizeof(cb_saved));
> > > > +
> > > > + /* Pass parameters to the BPF program */
> > > > + cb->flow_keys = &flow_keys;
> > > > + *pseudo_cb = nhoff;
> > >
> > > I don't understand this bit.
> > > What is this pseudo_cb and why nhoff goes in there?
> > > Some odd way to pass it into the prog?
> >
> > Yes. nhoff passes the offset to the program to start parsing from.
> > Applications also pass this during tail calls.
> >
> > Alternatively we can just add a new field to struct bpf_flow_keys.
>
> I think that certainly will be cleaner and easier to use from
> bpf prog pov. Since flow_keys stay constant any change to nhoff
> between tail_calls will be preserved too. I see no cons to such approach.
Yes, it's definitely simpler. We'll do that.
Thanks!
^ permalink raw reply
* Re: [PATCH net-next v3 02/17] zinc: introduce minimal cryptography library
From: Ard Biesheuvel @ 2018-09-12 22:56 UTC (permalink / raw)
To: Jason A. Donenfeld
Cc: LKML, Netdev, David Miller, Greg Kroah-Hartman, Andrew Lutomirski,
Samuel Neves, Jean-Philippe Aumasson, Linux Crypto Mailing List
In-Reply-To: <CAHmME9rTEFLaYmzFrc719LW2DWS2Ua_vL6by1UXZ9uDvMpbhYw@mail.gmail.com>
On 11 September 2018 at 23:22, Jason A. Donenfeld <Jason@zx2c4.com> wrote:
> Hello Ard,
>
> I realize you've put a lot of good and hard work into the existing
> crypto API, and that my writing in these commit messages might be a
> bit too bristly and dismissive of that hard work. So I think in a
> sense it's understandable that you've responded as such here. But
> hopefully I can address your concerns. One thing to keep in mind is
> that Zinc endeavors to provide the basis for simple and direct
> functions to software algorithms. This is fairly modest goal. Just
> some functions that do some stuff in software. Around these you'll
> still be able to have complicated and impressive dynamic dispatch and
> asynchronous mechanisms such as the present crypto API. Zinc is merely
> getting the software implementation side done in a very simple and
> direct way. So I don't think there's a good reason for so much
> antagonism, despite a perhaps overbearing tone of my commit messages.
> Rather, I expect that we'll wind up working together on this quite a
> bit down the line.
>
No worries, it takes more than this to piss me off. I do take pride in
my work, but I am perfectly aware that I am not a rare talent like
Andy Polyakov, which is actually why I have worked extensively with
him in the past to get kernel specific changes to the ARM/arm64 NEON
code into upstream OpenSSL, so that we could use the upstream code
unmodified and benefit from upstream review. [1] [2]
In this series, you are dumping a huge volume of unannotated,
generated asm into the kernel which has been modified [by you] to
[among other things?] adhere to the kernel API (without documenting
what the changes are exactly). How does that live up to the promise of
better, peer reviewed code?
Then there is the performance claim. We know for instance that the
OpenSSL ARM NEON code for ChaCha20 is faster on cores that happen to
possess a micro-architectural property that ALU instructions are
essentially free when they are interleaved with SIMD instructions. But
we also know that a) Cortex-A7, which is a relevant target, is not one
of those cores, and b) that chip designers are not likely to optimize
for that particular usage pattern so relying on it in generic code is
unwise in general.
I am also concerned about your claim that all software algorithms will
be moved into this crypto library. You are not specific about whose
responsibility it will be that this is going to happen in a timely
fashion. But more importantly, it is not clear at all how you expect
this to work for, e.g., h/w instruction based SHAxxx or AES in various
chaining modes, which should be used only on cores that implement
those instructions (note that on arm64, we have optional instructions
for AES, PMULL, SHA1, SHA256, SHA512, SHA3, SM3 and SM4). Are all
those implementations (only few of which will be used on a certain
core) going to be part of the monolithic library? What are the APIs
going to look like for block ciphers, taking chaining modes into
account?
I'm sure it is rather simple to port the crypto API implementation of
ChaCha20 to use your library. I am more concerned about how your
library is going to expand to cover all other software algorithms that
we currently use in the kernel.
>> In spite of the wall of text, you fail to point out exactly why the
>> existing AEAD API in unsuitable, and why fixing it is not an option.
>
> I thought I had addressed this. Firstly, there's a need for more than
> just AEAD, but ignoring that, the AEAD API is a big full API that does
> lots of things, makes allocations, parses descriptors, and so forth.
> I'm sure this kind of highly-engineered approach will continue to
> improve over time in that highly engineered direction. Zinc is doing
> something a bit different: it's providing a series of simple functions
> for various cryptographic routines. This is a considerably different
> goal -- perhaps even a complementary one -- to the AEAD API.
>
I understand how your crypto library is different from the crypto API.
But the question was why WireGuard cannot make use of the crypto APIs
AEAD interface. And note that this is an honest question: I know that
the crypto API is cumbersome to use in general, but it would be very
helpful to understand where exactly the impedance mismatch is between
what your code needs and what the crypto API provides.
>> I don't think you have
>> convinced anyone else yet either.
>
> Please only speak for yourself and refrain from rhetoric like this,
> which is patently false.
>
Fair enough. You have clearly convinced Greg :-)
>> Please refrain from sending a v4 with just a couple of more tweaks on
>> top
>
> Sorry, no, I'm not going to stop working hard on this because you're
> wary of a new approach. I will continue to improve the submission
> until it is mergable, and I do not intend to stop.
>
Of course. But please respond to all the concerns, not just the low
hanging fruit. I have already mentioned a couple of times that I
object to dumping large volumes of generated assembly into the kernel
tree without any annotation whatsoever, or any documentation about how
the generated code was hand tweaked before submission. You have not
responded to that concern yet.
> Anyway, it sounds like this whole thing may have ruffled your feathers
> a bit. Will you be at Linux Plumbers Conference in November? I'm
> planning on attending, and perhaps we could find some time there to
> sit down and talk one on one a bit.
>
That would be good, yes. I will be there.
[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=e4e7f10bfc4069925e99cc4b428c3434e30b6c3f
[2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=7918ecef073fe80eeb399a37d8d48561864eedf1
^ permalink raw reply
* Re: [PATCH bpf-next 06/11] bpf: Add reference tracking to verifier
From: Alexei Starovoitov @ 2018-09-12 23:17 UTC (permalink / raw)
To: Joe Stringer
Cc: daniel, netdev, ast, john.fastabend, tgraf, kafai, nitin.hande,
mauricio.vasquez
In-Reply-To: <20180912003640.28316-7-joe@wand.net.nz>
On Tue, Sep 11, 2018 at 05:36:35PM -0700, Joe Stringer wrote:
> Allow helper functions to acquire a reference and return it into a
> register. Specific pointer types such as the PTR_TO_SOCKET will
> implicitly represent such a reference. The verifier must ensure that
> these references are released exactly once in each path through the
> program.
>
> To achieve this, this commit assigns an id to the pointer and tracks it
> in the 'bpf_func_state', then when the function or program exits,
> verifies that all of the acquired references have been freed. When the
> pointer is passed to a function that frees the reference, it is removed
> from the 'bpf_func_state` and all existing copies of the pointer in
> registers are marked invalid.
>
> Signed-off-by: Joe Stringer <joe@wand.net.nz>
> ---
> include/linux/bpf_verifier.h | 24 ++-
> kernel/bpf/verifier.c | 303 ++++++++++++++++++++++++++++++++---
> 2 files changed, 306 insertions(+), 21 deletions(-)
>
> diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
> index 23a2b17bfd75..23f222e0cb0b 100644
> --- a/include/linux/bpf_verifier.h
> +++ b/include/linux/bpf_verifier.h
> @@ -104,6 +104,17 @@ struct bpf_stack_state {
> u8 slot_type[BPF_REG_SIZE];
> };
>
> +struct bpf_reference_state {
> + /* Track each reference created with a unique id, even if the same
> + * instruction creates the reference multiple times (eg, via CALL).
> + */
> + int id;
> + /* Instruction where the allocation of this reference occurred. This
> + * is used purely to inform the user of a reference leak.
> + */
> + int insn_idx;
> +};
> +
> /* state of the program:
> * type of all registers and stack info
> */
> @@ -121,7 +132,9 @@ struct bpf_func_state {
> */
> u32 subprogno;
>
> - /* should be second to last. See copy_func_state() */
> + /* The following fields should be last. See copy_func_state() */
> + int acquired_refs;
> + struct bpf_reference_state *refs;
> int allocated_stack;
> struct bpf_stack_state *stack;
> };
> @@ -217,11 +230,16 @@ __printf(2, 0) void bpf_verifier_vlog(struct bpf_verifier_log *log,
> __printf(2, 3) void bpf_verifier_log_write(struct bpf_verifier_env *env,
> const char *fmt, ...);
>
> -static inline struct bpf_reg_state *cur_regs(struct bpf_verifier_env *env)
> +static inline struct bpf_func_state *cur_func(struct bpf_verifier_env *env)
> {
> struct bpf_verifier_state *cur = env->cur_state;
>
> - return cur->frame[cur->curframe]->regs;
> + return cur->frame[cur->curframe];
> +}
> +
> +static inline struct bpf_reg_state *cur_regs(struct bpf_verifier_env *env)
> +{
> + return cur_func(env)->regs;
> }
>
> int bpf_prog_offload_verifier_prep(struct bpf_verifier_env *env);
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index faa83b3d7011..67c62ef67d37 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -1,5 +1,6 @@
> /* Copyright (c) 2011-2014 PLUMgrid, http://plumgrid.com
> * Copyright (c) 2016 Facebook
> + * Copyright (c) 2018 Covalent IO, Inc. http://covalent.io
> *
> * This program is free software; you can redistribute it and/or
> * modify it under the terms of version 2 of the GNU General Public
> @@ -140,6 +141,18 @@ static const struct bpf_verifier_ops * const bpf_verifier_ops[] = {
> *
> * After the call R0 is set to return type of the function and registers R1-R5
> * are set to NOT_INIT to indicate that they are no longer readable.
> + *
> + * The following reference types represent a potential reference to a kernel
> + * resource which, after first being allocated, must be checked and freed by
> + * the BPF program:
> + * - PTR_TO_SOCKET_OR_NULL, PTR_TO_SOCKET
> + *
> + * When the verifier sees a helper call return a reference type, it allocates a
> + * pointer id for the reference and stores it in the current function state.
> + * Similar to the way that PTR_TO_MAP_VALUE_OR_NULL is converted into
> + * PTR_TO_MAP_VALUE, PTR_TO_SOCKET_OR_NULL becomes PTR_TO_SOCKET when the type
> + * passes through a NULL-check conditional. For the branch wherein the state is
> + * changed to CONST_IMM, the verifier releases the reference.
> */
>
> /* verifier_state + insn_idx are pushed to stack when branch is encountered */
> @@ -189,6 +202,7 @@ struct bpf_call_arg_meta {
> int access_size;
> s64 msize_smax_value;
> u64 msize_umax_value;
> + int ptr_id;
> };
>
> static DEFINE_MUTEX(bpf_verifier_lock);
> @@ -251,7 +265,42 @@ static bool type_is_pkt_pointer(enum bpf_reg_type type)
>
> static bool reg_type_may_be_null(enum bpf_reg_type type)
> {
> - return type == PTR_TO_MAP_VALUE_OR_NULL;
> + return type == PTR_TO_MAP_VALUE_OR_NULL ||
> + type == PTR_TO_SOCKET_OR_NULL;
> +}
> +
> +static bool type_is_refcounted(enum bpf_reg_type type)
> +{
> + return type == PTR_TO_SOCKET;
> +}
> +
> +static bool type_is_refcounted_or_null(enum bpf_reg_type type)
> +{
> + return type == PTR_TO_SOCKET || type == PTR_TO_SOCKET_OR_NULL;
> +}
> +
> +static bool reg_is_refcounted(const struct bpf_reg_state *reg)
> +{
> + return type_is_refcounted(reg->type);
> +}
> +
> +static bool reg_is_refcounted_or_null(const struct bpf_reg_state *reg)
> +{
> + return type_is_refcounted_or_null(reg->type);
> +}
> +
> +static bool arg_type_is_refcounted(enum bpf_arg_type type)
> +{
> + return type == ARG_PTR_TO_SOCKET;
> +}
> +
> +/* Determine whether the function releases some resources allocated by another
> + * function call. The first reference type argument will be assumed to be
> + * released by release_reference().
> + */
> +static bool is_release_function(enum bpf_func_id func_id)
> +{
> + return false;
> }
>
> /* string representation of 'enum bpf_reg_type' */
> @@ -384,6 +433,12 @@ static void print_verifier_state(struct bpf_verifier_env *env,
> else
> verbose(env, "=%s", types_buf);
> }
> + if (state->acquired_refs && state->refs[0].id) {
> + verbose(env, " refs=%d", state->refs[0].id);
> + for (i = 1; i < state->acquired_refs; i++)
> + if (state->refs[i].id)
> + verbose(env, ",%d", state->refs[i].id);
> + }
> verbose(env, "\n");
> }
>
> @@ -402,6 +457,8 @@ static int copy_##NAME##_state(struct bpf_func_state *dst, \
> sizeof(*src->FIELD) * (src->COUNT / SIZE)); \
> return 0; \
> }
> +/* copy_reference_state() */
> +COPY_STATE_FN(reference, acquired_refs, refs, 1)
> /* copy_stack_state() */
> COPY_STATE_FN(stack, allocated_stack, stack, BPF_REG_SIZE)
> #undef COPY_STATE_FN
> @@ -440,6 +497,8 @@ static int realloc_##NAME##_state(struct bpf_func_state *state, int size, \
> state->FIELD = new_##FIELD; \
> return 0; \
> }
> +/* realloc_reference_state() */
> +REALLOC_STATE_FN(reference, acquired_refs, refs, 1)
> /* realloc_stack_state() */
> REALLOC_STATE_FN(stack, allocated_stack, stack, BPF_REG_SIZE)
> #undef REALLOC_STATE_FN
> @@ -451,16 +510,89 @@ REALLOC_STATE_FN(stack, allocated_stack, stack, BPF_REG_SIZE)
> * which realloc_stack_state() copies over. It points to previous
> * bpf_verifier_state which is never reallocated.
> */
> -static int realloc_func_state(struct bpf_func_state *state, int size,
> - bool copy_old)
> +static int realloc_func_state(struct bpf_func_state *state, int stack_size,
> + int refs_size, bool copy_old)
> {
> - return realloc_stack_state(state, size, copy_old);
> + int err = realloc_reference_state(state, refs_size, copy_old);
> + if (err)
> + return err;
> + return realloc_stack_state(state, stack_size, copy_old);
> +}
> +
> +/* Acquire a pointer id from the env and update the state->refs to include
> + * this new pointer reference.
> + * On success, returns a valid pointer id to associate with the register
> + * On failure, returns a negative errno.
> + */
> +static int acquire_reference_state(struct bpf_verifier_env *env, int insn_idx)
> +{
> + struct bpf_func_state *state = cur_func(env);
> + int new_ofs = state->acquired_refs;
> + int id, err;
> +
> + err = realloc_reference_state(state, state->acquired_refs + 1, true);
> + if (err)
> + return err;
> + id = ++env->id_gen;
> + state->refs[new_ofs].id = id;
> + state->refs[new_ofs].insn_idx = insn_idx;
> +
> + return id;
> +}
> +
> +/* release function corresponding to acquire_reference_state(). Idempotent. */
> +static int __release_reference_state(struct bpf_func_state *state, int ptr_id)
> +{
> + int i, last_idx;
> +
> + if (!ptr_id)
> + return 0;
Is this defensive programming or this condition can actually happen?
As far as I can see all callers suppose to pass valid ptr_id into it.
Acked-by: Alexei Starovoitov <ast@kernel.org>
^ permalink raw reply
* Re: [PATCH net-next] virtio_net: ethtool tx napi configuration
From: Willem de Bruijn @ 2018-09-12 23:27 UTC (permalink / raw)
To: f.fainelli
Cc: Network Development, David Miller, caleb.raitto, Jason Wang,
Michael S. Tsirkin, Jon Olson (Google Drive), Willem de Bruijn
In-Reply-To: <CAF=yD-+Aqe=9GbKGo_n748D97W2rJHdsYL+cay1gyR4eA2Hc=w@mail.gmail.com>
On Wed, Sep 12, 2018 at 3:11 PM Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
>
> On Wed, Sep 12, 2018 at 2:16 PM Florian Fainelli <f.fainelli@gmail.com> wrote:
> >
> >
> >
> > On 9/12/2018 11:07 AM, Willem de Bruijn wrote:
> > > On Wed, Sep 12, 2018 at 1:42 PM Florian Fainelli <f.fainelli@gmail.com> wrote:
> > >>
> > >>
> > >>
> > >> On 9/9/2018 3:44 PM, Willem de Bruijn wrote:
> > >>> From: Willem de Bruijn <willemb@google.com>
> > >>>
> > >>> Implement ethtool .set_coalesce (-C) and .get_coalesce (-c) handlers.
> > >>> Interrupt moderation is currently not supported, so these accept and
> > >>> display the default settings of 0 usec and 1 frame.
> > >>>
> > >>> Toggle tx napi through a bit in tx-frames. So as to not interfere
> > >>> with possible future interrupt moderation, use bit 10, well outside
> > >>> the reasonable range of real interrupt moderation values.
> > >>>
> > >>> Changes are not atomic. The tx IRQ, napi BH and transmit path must
> > >>> be quiesced when switching modes. Only allow changing this setting
> > >>> when the device is down.
> > >>
> > >> Humm, would not a private ethtool flag to switch TX NAPI on/off be more
> > >> appropriate rather than use the coalescing configuration API here?
> > >
> > > What do you mean by private ethtool flag? A new field in ethtool
> > > --features (-k)?
> >
> > I meant using ethtool_drvinfo::n_priv_flags, ETH_SS_PRIV_FLAGS and then
> > ETHTOOL_GFPFLAGS and ETHTOOL_SPFLAGS to control the toggling of that
> > private flag. mlx5 has a number of privates flags for instance.
>
> Interesting, thanks! I was not at all aware of those ethtool flags.
> Am having a look. It definitely looks promising.
Okay, I made that change. That is indeed much cleaner, thanks.
Let me send the patch, initially as RFC.
I've observed one issue where if we toggle the flag before bringing
up the device, it hits a kernel BUG at include/linux/netdevice.h:515
BUG_ON(!test_bit(NAPI_STATE_SCHED, &n->state));
^ permalink raw reply
* [PATCH net-next RFC] virtio_net: ethtool tx napi configuration
From: Willem de Bruijn @ 2018-09-12 23:29 UTC (permalink / raw)
To: netdev; +Cc: jasowang, mst, f.fainelli, Willem de Bruijn
From: Willem de Bruijn <willemb@google.com>
Implement ethtool .set_priv_flags and .get_priv_flags handlers
and use ethtool private flags to toggle transmit napi:
ethtool --set-priv-flags eth0 tx-napi on
ethtool --show-priv-flags eth0
Link: https://patchwork.ozlabs.org/patch/948149/
Suggested-by: Jason Wang <jasowang@redhat.com>
Suggested-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
---
drivers/net/virtio_net.c | 49 ++++++++++++++++++++++++++++++++++++++++
1 file changed, 49 insertions(+)
diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 765920905226..9ca7e0a0f0d9 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -73,6 +73,10 @@ static const unsigned long guest_offloads[] = {
VIRTIO_NET_F_GUEST_UFO
};
+static const char virtnet_ethtool_priv_flags[][ETH_GSTRING_LEN] = {
+ "tx-napi",
+};
+
struct virtnet_stat_desc {
char desc[ETH_GSTRING_LEN];
size_t offset;
@@ -2059,6 +2063,9 @@ static void virtnet_get_strings(struct net_device *dev, u32 stringset, u8 *data)
}
}
break;
+ case ETH_SS_PRIV_FLAGS:
+ memcpy(data, virtnet_ethtool_priv_flags,
+ sizeof(virtnet_ethtool_priv_flags));
}
}
@@ -2070,6 +2077,9 @@ static int virtnet_get_sset_count(struct net_device *dev, int sset)
case ETH_SS_STATS:
return vi->curr_queue_pairs * (VIRTNET_RQ_STATS_LEN +
VIRTNET_SQ_STATS_LEN);
+ case ETH_SS_PRIV_FLAGS:
+ return ARRAY_SIZE(virtnet_ethtool_priv_flags);
+
default:
return -EOPNOTSUPP;
}
@@ -2181,6 +2191,43 @@ static int virtnet_get_link_ksettings(struct net_device *dev,
return 0;
}
+static int virtnet_set_priv_flags(struct net_device *dev, u32 priv_flags)
+{
+ struct virtnet_info *vi = netdev_priv(dev);
+ int i, napi_weight;
+
+ napi_weight = priv_flags & 0x1 ? NAPI_POLL_WEIGHT : 0;
+
+ if (napi_weight ^ vi->sq[0].napi.weight) {
+ for (i = 0; i < vi->max_queue_pairs; i++) {
+ struct netdev_queue *txq =
+ netdev_get_tx_queue(vi->dev, i);
+
+ virtnet_napi_tx_disable(&vi->sq[i].napi);
+ __netif_tx_lock_bh(txq);
+ vi->sq[i].napi.weight = napi_weight;
+ if (!napi_weight)
+ virtqueue_enable_cb(vi->sq[i].vq);
+ __netif_tx_unlock_bh(txq);
+ virtnet_napi_tx_enable(vi, vi->sq[i].vq,
+ &vi->sq[i].napi);
+ }
+ }
+
+ return 0;
+}
+
+static u32 virtnet_get_priv_flags(struct net_device *dev)
+{
+ struct virtnet_info *vi = netdev_priv(dev);
+ int priv_flags = 0;
+
+ if (vi->sq[0].napi.weight)
+ priv_flags |= 0x1;
+
+ return priv_flags;
+}
+
static void virtnet_init_settings(struct net_device *dev)
{
struct virtnet_info *vi = netdev_priv(dev);
@@ -2219,6 +2266,8 @@ static const struct ethtool_ops virtnet_ethtool_ops = {
.get_ts_info = ethtool_op_get_ts_info,
.get_link_ksettings = virtnet_get_link_ksettings,
.set_link_ksettings = virtnet_set_link_ksettings,
+ .set_priv_flags = virtnet_set_priv_flags,
+ .get_priv_flags = virtnet_get_priv_flags,
};
static void virtnet_freeze_down(struct virtio_device *vdev)
--
2.19.0.rc2.392.g5ba43deb5a-goog
^ permalink raw reply related
* [PATCH iproute2] iproute2: fix use-after-free
From: Mahesh Bandewar @ 2018-09-12 23:29 UTC (permalink / raw)
To: Stephen Hemminger; +Cc: netdev, Mahesh Bandewar
From: Mahesh Bandewar <maheshb@google.com>
A local program using iproute2 lib pointed out the issue and looking
at the code it is pretty obvious -
a = (struct nlmsghdr *)b;
...
free(b);
if (a->nlmsg_seq == seq)
...
Fixes: 86bf43c7c2fd ("lib/libnetlink: update rtnl_talk to support malloc buff at run time")
Signed-off-by: Mahesh Bandewar <maheshb@google.com>
---
lib/libnetlink.c | 15 +++++++++++----
1 file changed, 11 insertions(+), 4 deletions(-)
diff --git a/lib/libnetlink.c b/lib/libnetlink.c
index 928de1dd16d8..016a5f0bcfb6 100644
--- a/lib/libnetlink.c
+++ b/lib/libnetlink.c
@@ -661,17 +661,24 @@ next:
if (l < sizeof(struct nlmsgerr)) {
fprintf(stderr, "ERROR truncated\n");
} else if (!err->error) {
+ unsigned int tmp_seq;
+
/* check messages from kernel */
nl_dump_ext_ack(h, errfn);
- if (answer)
+ tmp_seq = h->nlmsg_seq;
+ if (answer) {
*answer = (struct nlmsghdr *)buf;
- else
+ } else {
free(buf);
- if (h->nlmsg_seq == seq)
+ buf = NULL;
+ }
+ if (tmp_seq == seq) {
return 0;
- else if (i < iovlen)
+ } else if (i < iovlen) {
+ free(buf);
goto next;
+ }
return 0;
}
--
2.19.0.rc2.392.g5ba43deb5a-goog
^ permalink raw reply related
* Re: [PATCH net-next v3 01/17] asm: simd context helper API
From: Kevin Easton @ 2018-09-13 5:03 UTC (permalink / raw)
To: Jason A. Donenfeld
Cc: LKML, Netdev, David Miller, Greg Kroah-Hartman, Andrew Lutomirski,
Thomas Gleixner, Samuel Neves, linux-arch
In-Reply-To: <CAHmME9pmgu_xWJ0rvdS4RhzwjKPiNdkQ9nc6mqTNeb6A8xT8Eg@mail.gmail.com>
On Wed, Sep 12, 2018 at 08:10:41PM +0200, Jason A. Donenfeld wrote:
> On Wed, Sep 12, 2018 at 8:14 AM Kevin Easton <kevin@guarana.org> wrote:
> > Given that it's always supposed to be used like that, mightn't it be
> > better if simd_relax() took a pointer to the context, so the call is
> > just
> >
> > simd_relax(&simd_context);
> >
> > ?
> >
> > The inlining means that there won't actually be a pointer dereference in
> > the emitted code.
> >
> > If simd_put() also took a pointer then it could set the context back to
> > HAVE_NO_SIMD as well?
>
> That's sort of a neat idea. I guess in this scheme, you'd envision:
>
> simd_context_t simd_context;
>
> simd_get(&simd_context);
> simd_relax(&simd_context);
> simd_put(&simd_context);
>
> And this way, if simd_context ever becomes a heavier struct, it can be
> modified in place rather than returned by value from the function. On
> the other hand, it's a little bit more annoying to type and makes it
> harder to do declaration and initialization on the same line.
Yes. It's also how most get/put APIs already work in the kernel, eg
kref_get/put (mostly because they tend to be 'getting/putting' an
already-initialized object, though).
- Kevin
^ permalink raw reply
* Re: [PATCH bpf-next 07/11] bpf: Add helper to retrieve socket in BPF
From: Alexei Starovoitov @ 2018-09-13 0:06 UTC (permalink / raw)
To: Joe Stringer
Cc: daniel, netdev, ast, john.fastabend, tgraf, kafai, nitin.hande,
mauricio.vasquez
In-Reply-To: <20180912003640.28316-8-joe@wand.net.nz>
On Tue, Sep 11, 2018 at 05:36:36PM -0700, Joe Stringer wrote:
> This patch adds new BPF helper functions, bpf_sk_lookup_tcp() and
> bpf_sk_lookup_udp() which allows BPF programs to find out if there is a
> socket listening on this host, and returns a socket pointer which the
> BPF program can then access to determine, for instance, whether to
> forward or drop traffic. bpf_sk_lookup_xxx() may take a reference on the
> socket, so when a BPF program makes use of this function, it must
> subsequently pass the returned pointer into the newly added sk_release()
> to return the reference.
>
> By way of example, the following pseudocode would filter inbound
> connections at XDP if there is no corresponding service listening for
> the traffic:
>
> struct bpf_sock_tuple tuple;
> struct bpf_sock_ops *sk;
>
> populate_tuple(ctx, &tuple); // Extract the 5tuple from the packet
> sk = bpf_sk_lookup_tcp(ctx, &tuple, sizeof tuple, netns, 0);
...
> +struct bpf_sock_tuple {
> + union {
> + __be32 ipv6[4];
> + __be32 ipv4;
> + } saddr;
> + union {
> + __be32 ipv6[4];
> + __be32 ipv4;
> + } daddr;
> + __be16 sport;
> + __be16 dport;
> + __u8 family;
> +};
since we can pass ptr_to_packet into map lookup and other helpers now,
can you move 'family' out of bpf_sock_tuple and combine with netns_id arg?
then progs wouldn't need to copy bytes from the packet into tuple
to do a lookup.
The rest looks great to me.
^ permalink raw reply
* Re: [PATCH bpf-next 08/11] selftests/bpf: Add tests for reference tracking
From: Alexei Starovoitov @ 2018-09-13 0:09 UTC (permalink / raw)
To: Joe Stringer
Cc: daniel, netdev, ast, john.fastabend, tgraf, kafai, nitin.hande,
mauricio.vasquez
In-Reply-To: <20180912003640.28316-9-joe@wand.net.nz>
On Tue, Sep 11, 2018 at 05:36:37PM -0700, Joe Stringer wrote:
> reference tracking: leak potential reference
> reference tracking: leak potential reference on stack
> reference tracking: leak potential reference on stack 2
> reference tracking: zero potential reference
> reference tracking: copy and zero potential references
> reference tracking: release reference without check
> reference tracking: release reference
> reference tracking: release reference twice
> reference tracking: release reference twice inside branch
> reference tracking: alloc, check, free in one subbranch
> reference tracking: alloc, check, free in both subbranches
> reference tracking in call: free reference in subprog
> reference tracking in call: free reference in subprog and outside
> reference tracking in call: alloc & leak reference in subprog
> reference tracking in call: alloc in subprog, release outside
> reference tracking in call: sk_ptr leak into caller stack
> reference tracking in call: sk_ptr spill into caller stack
>
> Signed-off-by: Joe Stringer <joe@wand.net.nz>
...
> + "reference tracking in call: alloc in subprog, release outside",
> + .insns = {
> + BPF_MOV64_REG(BPF_REG_4, BPF_REG_10),
> + BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 1, 0, 5),
> + BPF_MOV64_REG(BPF_REG_1, BPF_REG_0),
> + BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 2),
> + BPF_MOV64_IMM(BPF_REG_2, 0),
> + BPF_EMIT_CALL(BPF_FUNC_sk_release),
> + BPF_EXIT_INSN(),
> +
> + /* subprog 1 */
> + BPF_SK_LOOKUP,
> + BPF_EXIT_INSN(), /* return sk */
> + },
Thanks a lot for adding subprog tests and checking that return to the caller
and spill works too.
Awesome stuff!
Acked-by: Alexei Starovoitov <ast@kernel.org>
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox