* Re: [PATCH net] Revert "net: phy: Correctly process PHY_HALTED in phy_stop_machine()"
From: Florian Fainelli @ 2017-08-31 19:18 UTC (permalink / raw)
To: Mason
Cc: Marc Gonzalez, David Daney, netdev, Geert Uytterhoeven,
David Miller, Andrew Lunn, Mans Rullgard
In-Reply-To: <d54d0555-c448-741c-eb63-5a773bed5a30@free.fr>
On 08/31/2017 12:09 PM, Mason wrote:
> On 31/08/2017 19:03, Florian Fainelli wrote:
>
>> On 08/31/2017 05:29 AM, Marc Gonzalez wrote:
>>
>>> On 31/08/2017 02:49, Florian Fainelli wrote:
>>>
>>>> The original motivation for this change originated from Marc Gonzalez
>>>> indicating that his network driver did not have its adjust_link callback
>>>> executing with phydev->link = 0 while he was expecting it.
>>>
>>> I expect the core to call phy_adjust_link() for link changes.
>>> This used to work back in 3.4 and was broken somewhere along
>>> the way.
>>
>> If that was working correctly in 3.4 surely we can look at the diff and
>> figure out what changed, even maybe find the offending commit, can you
>> do that?
>
> Bisecting would a be a huge pain because my platform was
> not upstream until v4.4
Then just diff the file and try to pinpoint which commit may have
changed that?
>
> You mentioned the guarantees made by PHYLIB.
> When is the adjust_link callback guaranteed to be called?
As long as the state machine is running after a call to phy_start()
adjust_link will be called if there a change in link and/or link
settings. Once you call phy_stop() no such guarantees are made.
>
>>>> PHYLIB has never made any such guarantees ever because phy_stop() merely
>>>> just tells the workqueue to move into PHY_HALTED state which will happen
>>>> asynchronously.
>>>
>>> My original proposal was to fix the issue in the driver.
>>> I'll try locating it in my archives.
>>
>> Yes I remember you telling that, by the way I don't think you ever
>> provided a clear explanation why this is absolutely necessary for your
>> driver though?
>
> 1) nb8800_link_reconfigure() calls phy_print_status()
> which prints the "Link down" and "Link up" messages
> to the console. With the patch reverted, nothing is
> printed when the link goes down, and the result is
> random when the link comes up. Sometimes, we get
> down + up, sometimes just up.
Nothing printed when you bring down the network interface as a result of
not signaling the link down, there is a small nuance here.
Seeing a random message upon bringing the interface back up suggests you
may not be re-initialization your old vs. new link state book keeping
variables and state transitions are not properly detected and therefore
not printed. In fact, I don't see where priv->link is ever set to say -1
to force the comparison between phydev->link != priv->link to be true,
oversight?
>
> 2) nb8800_link_reconfigure() does some HW init when
> the link state changes. If we miss some notifications,
> we might not perform some HW init, and stuff breaks.
Care to be more specific? What specific HW init is required during link
notification that if not done breaks the HW? There is both
nb8800_mac_config() and nb8800_pause_config() that are both called in
the adjust_link callback both could presumably be deferred until the
link is detected, so why do you need it during ndo_stop() absolutely?
--
Florian
^ permalink raw reply
* Re: net/ipv4: divide error in __tcp_select_window
From: David Miller @ 2017-08-31 19:25 UTC (permalink / raw)
To: ncardwell; +Cc: idaifish, kuznet, netdev, syzkaller, weiwan, edumazet
In-Reply-To: <CADVnQynuyK5H1SDOnzDZPOnVn701JhrSQRCORVJ_6c=z_EooBw@mail.gmail.com>
From: Neal Cardwell <ncardwell@google.com>
Date: Thu, 31 Aug 2017 07:11:28 -0400
> Thanks for the report. I believe this tcp_recvmsg => tcp_cleanup_rbuf
> => __tcp_select_window divide-by-zero issue was fixed in May by Wei,
> in:
>
> 499350a5a6e7 tcp: initialize rcv_mss to TCP_MIN_MSS instead of 0
> https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/commit/?id=499350a5a6e7
>
> Looks like we should probably mark this as a -stable candidate, so
> that it will eventually make it to 4.4.y, 4.9.y, 4.12.y users, etc. (I
> don't see the commit in those stable branches.)
Ok, queued up for -stable, thanks.
^ permalink raw reply
* Re: [PATCH] wl1251: add a missing spin_lock_init()
From: David Miller @ 2017-08-31 19:26 UTC (permalink / raw)
To: pavel
Cc: kvalo, torvalds, xiyou.wangcong, akpm, netdev, linux-kernel,
linux-wireless
In-Reply-To: <20170831185718.GA31245@amd>
From: Pavel Machek <pavel@ucw.cz>
Date: Thu, 31 Aug 2017 20:57:19 +0200
> Hi!
>
>> From: Pavel Machek <pavel@ucw.cz>
>> Date: Thu, 31 Aug 2017 16:47:43 +0200
>>
>> > Dave, Linus -- can you still take the patch?
>>
>> Pavel, please do not bypass maintainers like this.
>>
>> It's really rude, and if you do things like that instead of
>> trying to work properly with us, your relationship with
>> these maintainers will suffer in the long term.
>
> Do you mean I'm being rude to Kalle, or rude to you?
He said "to David", not "David and Linus".
^ permalink raw reply
* Re: tip -ENOBOOT - bisected to locking/refcounts, x86/asm: Implement fast refcount overflow protection
From: Kees Cook @ 2017-08-31 19:28 UTC (permalink / raw)
To: Mike Galbraith
Cc: David S. Miller, Peter Zijlstra, LKML, Ingo Molnar,
Reshetova, Elena, Network Development
In-Reply-To: <1504187918.27500.16.camel@gmx.de>
On Thu, Aug 31, 2017 at 6:58 AM, Mike Galbraith <efault@gmx.de> wrote:
> gdb) list *in6_dev_get+0x10
> 0xffffffff8166d3d0 is in in6_dev_get (./include/net/addrconf.h:318).
> 313 {
> 314 struct inet6_dev *idev;
> 315
> 316 rcu_read_lock();
> 317 idev = rcu_dereference(dev->ip6_ptr);
> 318 if (idev)
> 319 refcount_inc(&idev->refcnt);
> 320 rcu_read_unlock();
> 321 return idev;
> 322
And this is a completely different refcount from the other that
tripped. This one is quite simple, too, though I see it uses
refcount_dec(), which is a path to saturation. I've sent a patch to
try to clarify this further...
-Kees
--
Kees Cook
Pixel Security
^ permalink raw reply
* Re: net/ipv4: divide error in __tcp_select_window
From: Neal Cardwell @ 2017-08-31 19:37 UTC (permalink / raw)
To: David Miller
Cc: idaifish, Alexey Kuznetsov, Netdev, syzkaller, Wei Wang,
Eric Dumazet
In-Reply-To: <20170831.122543.754410467437404852.davem@davemloft.net>
On Thu, Aug 31, 2017 at 3:25 PM, David Miller <davem@davemloft.net> wrote:
>
> From: Neal Cardwell <ncardwell@google.com>
> Date: Thu, 31 Aug 2017 07:11:28 -0400
>
> > Thanks for the report. I believe this tcp_recvmsg => tcp_cleanup_rbuf
> > => __tcp_select_window divide-by-zero issue was fixed in May by Wei,
> > in:
> >
> > 499350a5a6e7 tcp: initialize rcv_mss to TCP_MIN_MSS instead of 0
> > https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/commit/?id=499350a5a6e7
> >
> > Looks like we should probably mark this as a -stable candidate, so
> > that it will eventually make it to 4.4.y, 4.9.y, 4.12.y users, etc. (I
> > don't see the commit in those stable branches.)
>
> Ok, queued up for -stable, thanks.
Great. Thanks!
neal
^ permalink raw reply
* Re: [PATCH] wl1251: add a missing spin_lock_init()
From: Pavel Machek @ 2017-08-31 19:59 UTC (permalink / raw)
To: David Miller
Cc: kvalo, torvalds, xiyou.wangcong, akpm, netdev, linux-kernel,
linux-wireless
In-Reply-To: <20170831.122645.872320547726000124.davem@davemloft.net>
[-- Attachment #1: Type: text/plain, Size: 1062 bytes --]
Hi!
On Thu 2017-08-31 12:26:45, David Miller wrote:
> From: Pavel Machek <pavel@ucw.cz>
> Date: Thu, 31 Aug 2017 20:57:19 +0200
>
> > Hi!
> >
> >> From: Pavel Machek <pavel@ucw.cz>
> >> Date: Thu, 31 Aug 2017 16:47:43 +0200
> >>
> >> > Dave, Linus -- can you still take the patch?
> >>
> >> Pavel, please do not bypass maintainers like this.
> >>
> >> It's really rude, and if you do things like that instead of
> >> trying to work properly with us, your relationship with
> >> these maintainers will suffer in the long term.
> >
> > Do you mean I'm being rude to Kalle, or rude to you?
>
> He said "to David", not "David and Linus".
Ok. If I knew you would be replying so quickly, I'd acted
differently. I did not want to be rude.
But I'd still like to get the patch in. Do you plan to send another
pull request to Linus, and can you take the patch, please?
Thanks,
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]
^ permalink raw reply
* Re: [PATCH] wl1251: add a missing spin_lock_init()
From: David Miller @ 2017-08-31 20:10 UTC (permalink / raw)
To: pavel
Cc: kvalo, torvalds, xiyou.wangcong, akpm, netdev, linux-kernel,
linux-wireless
In-Reply-To: <20170831195933.GA2788@amd>
From: Pavel Machek <pavel@ucw.cz>
Date: Thu, 31 Aug 2017 21:59:33 +0200
> Do you plan to send another pull request to Linus, and can you take
> the patch, please?
Yes and yes.
^ permalink raw reply
* Re: [PATCH v4 4/5] net: stmmac: dwmac-sun8i: choose internal PHY via phy-is-integrated
From: Rob Herring @ 2017-08-31 20:18 UTC (permalink / raw)
To: Andrew Lunn
Cc: Corentin Labbe, mark.rutland, maxime.ripard, wens, linux,
peppe.cavallaro, alexandre.torgue, f.fainelli, icenowy, netdev,
devicetree, linux-arm-kernel, linux-kernel
In-Reply-To: <20170826212051.GA10418@lunn.ch>
On Sat, Aug 26, 2017 at 11:20:51PM +0200, Andrew Lunn wrote:
> Hi Corentin
>
> I think we have now all agreed this is an mdio-mux, plus it is also an
> MII mux. We should represent that in device tree. This patchset does
> this. However, as it is now, the mux structure in DT is ignored. All
> it does is search for the phy-is-integrated flags and goes on that.
>
> I made the comment that the device tree representation cannot be
> implemented using an MDIO mux driver, because of driver loading
> issues. However, the core of the MDIO mux code is just a library,
> symbols exported as GPL, free for anything to use.
>
> What i think should happen is the mdio-mux is implemented inside the
> MAC driver, using the mux-core as a library. The device tree structure
> of a mix is then reflected within Linux. The mux switch callback is
> implemented within the MAC driver. So it can reset the MAC when the
> mux is switched. The 'phy-is-integrated' property is then no longer
> needed.
>
> I would suggest a binding something like:
This is looks better to me, but...
> emac: ethernet@1c0b000 {
> compatible = "allwinner,sun8i-h3-emac";
> syscon = <&syscon>;
> reg = <0x01c0b000 0x104>;
> interrupts = <GIC_SPI 82 IRQ_TYPE_LEVEL_HIGH>;
> interrupt-names = "macirq";
> resets = <&ccu RST_BUS_EMAC>;
> reset-names = "stmmaceth";
> clocks = <&ccu CLK_BUS_EMAC>;
> clock-names = "stmmaceth";
> #address-cells = <1>;
> #size-cells = <0>;
>
> phy-handle = <&int_mii_phy>;
> phy-mode = "mii";
> allwinner,leds-active-low;
>
> mdio: mdio {
> #address-cells = <1>;
> #size-cells = <0>;
> }
Why do you need this node still?
>
> mdio-mux {
> #address-cells = <1>;
> #size-cells = <0>;
>
> mdio@0 {
> reg = <0>;
> #address-cells = <1>;
> #size-cells = <0>;
>
> int_mii_phy: ethernet-phy@1 {
> reg = <1>;
> clocks = <&ccu CLK_BUS_EPHY>;
> resets = <&ccu RST_BUS_EPHY>;
> };
> };
> ext_mdio: mdio@0 {
> #address-cells = <1>;
> #size-cells = <0>;
>
> ext_rgmii_phy: ethernet-phy@1 {
> reg = <1>;
> };
> };
> };
> };
>
> Andrew
^ permalink raw reply
* [PATCH net-next 0/2] net: ubuf_info.refcnt conversion
From: Eric Dumazet @ 2017-08-31 20:30 UTC (permalink / raw)
To: David S . Miller; +Cc: netdev, Willem de Bruijn, Eric Dumazet, Eric Dumazet
Yet another atomic_t -> refcount_t conversion, split in two patches.
First patch prepares the automatic conversion done in the second patch.
Eric Dumazet (2):
net: prepare (struct ubuf_info)->refcnt conversion
net: convert (struct ubuf_info)->refcnt to refcount_t
include/linux/skbuff.h | 5 +++--
net/core/skbuff.c | 9 +++++----
net/ipv4/tcp.c | 2 --
3 files changed, 8 insertions(+), 8 deletions(-)
--
2.14.1.581.gf28d330327-goog
^ permalink raw reply
* [PATCH net-next 1/2] net: prepare (struct ubuf_info)->refcnt conversion
From: Eric Dumazet @ 2017-08-31 20:30 UTC (permalink / raw)
To: David S . Miller; +Cc: netdev, Willem de Bruijn, Eric Dumazet, Eric Dumazet
In-Reply-To: <20170831203013.9219-1-edumazet@google.com>
In order to convert this atomic_t refcnt to refcount_t,
we need to init the refcount to one to not trigger
a 0 -> 1 transition.
This also removes one atomic operation in fast path.
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
net/core/skbuff.c | 3 ++-
net/ipv4/tcp.c | 2 --
2 files changed, 2 insertions(+), 3 deletions(-)
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 917da73d3ab3b82163cf0a9ee944da09cb5a391f..65b9ca3945f8fd2b1bef4aef5dd774be04e5d128 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -963,7 +963,7 @@ struct ubuf_info *sock_zerocopy_alloc(struct sock *sk, size_t size)
uarg->len = 1;
uarg->bytelen = size;
uarg->zerocopy = 1;
- atomic_set(&uarg->refcnt, 0);
+ atomic_set(&uarg->refcnt, 1);
sock_hold(sk);
return uarg;
@@ -1005,6 +1005,7 @@ struct ubuf_info *sock_zerocopy_realloc(struct sock *sk, size_t size,
uarg->len++;
uarg->bytelen = bytelen;
atomic_set(&sk->sk_zckey, ++next);
+ sock_zerocopy_get(uarg);
return uarg;
}
}
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 21ca2df274c5130a13d31a391a1408d779af34af..fff4d7bc3b8b46174e7bd0a04d7c212307e55cb5 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1190,8 +1190,6 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
goto out_err;
}
- /* skb may be freed in main loop, keep extra ref on uarg */
- sock_zerocopy_get(uarg);
if (!(sk_check_csum_caps(sk) && sk->sk_route_caps & NETIF_F_SG))
uarg->zerocopy = 0;
}
--
2.14.1.581.gf28d330327-goog
^ permalink raw reply related
* [PATCH net-next 2/2] net: convert (struct ubuf_info)->refcnt to refcount_t
From: Eric Dumazet @ 2017-08-31 20:30 UTC (permalink / raw)
To: David S . Miller; +Cc: netdev, Willem de Bruijn, Eric Dumazet, Eric Dumazet
In-Reply-To: <20170831203013.9219-1-edumazet@google.com>
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
include/linux/skbuff.h | 5 +++--
net/core/skbuff.c | 8 ++++----
2 files changed, 7 insertions(+), 6 deletions(-)
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 7594e19bce622a38dc39c054093c3da15b99b67b..316a92b45351f53709886ee0099cbc83b66f1b15 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -22,6 +22,7 @@
#include <linux/cache.h>
#include <linux/rbtree.h>
#include <linux/socket.h>
+#include <linux/refcount.h>
#include <linux/atomic.h>
#include <asm/types.h>
@@ -456,7 +457,7 @@ struct ubuf_info {
u32 bytelen;
};
};
- atomic_t refcnt;
+ refcount_t refcnt;
struct mmpin {
struct user_struct *user;
@@ -472,7 +473,7 @@ struct ubuf_info *sock_zerocopy_realloc(struct sock *sk, size_t size,
static inline void sock_zerocopy_get(struct ubuf_info *uarg)
{
- atomic_inc(&uarg->refcnt);
+ refcount_inc(&uarg->refcnt);
}
void sock_zerocopy_put(struct ubuf_info *uarg);
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 65b9ca3945f8fd2b1bef4aef5dd774be04e5d128..ed86ca9afd9d8d1ac47983acf6006c179285a612 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -963,7 +963,7 @@ struct ubuf_info *sock_zerocopy_alloc(struct sock *sk, size_t size)
uarg->len = 1;
uarg->bytelen = size;
uarg->zerocopy = 1;
- atomic_set(&uarg->refcnt, 1);
+ refcount_set(&uarg->refcnt, 1);
sock_hold(sk);
return uarg;
@@ -1086,7 +1086,7 @@ EXPORT_SYMBOL_GPL(sock_zerocopy_callback);
void sock_zerocopy_put(struct ubuf_info *uarg)
{
- if (uarg && atomic_dec_and_test(&uarg->refcnt)) {
+ if (uarg && refcount_dec_and_test(&uarg->refcnt)) {
if (uarg->callback)
uarg->callback(uarg, uarg->zerocopy);
else
@@ -1108,7 +1108,7 @@ void sock_zerocopy_put_abort(struct ubuf_info *uarg)
* avoid an skb send inside the main loop triggering uarg free.
*/
if (sk->sk_type != SOCK_STREAM)
- atomic_inc(&uarg->refcnt);
+ refcount_inc(&uarg->refcnt);
sock_zerocopy_put(uarg);
}
@@ -1490,7 +1490,7 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
if (skb_orphan_frags(skb, gfp_mask))
goto nofrags;
if (skb_zcopy(skb))
- atomic_inc(&skb_uarg(skb)->refcnt);
+ refcount_inc(&skb_uarg(skb)->refcnt);
for (i = 0; i < skb_shinfo(skb)->nr_frags; i++)
skb_frag_ref(skb, i);
--
2.14.1.581.gf28d330327-goog
^ permalink raw reply related
* Re: [PATCH][net-next] net: qualcomm: rmnet: remove unused variable priv
From: Subash Abhinov Kasiviswanathan @ 2017-08-31 20:34 UTC (permalink / raw)
To: Colin King; +Cc: David S . Miller, netdev, kernel-janitors, linux-kernel
In-Reply-To: <20170831140727.28170-1-colin.king@canonical.com>
On 2017-08-31 08:07, Colin King wrote:
> From: Colin Ian King <colin.king@canonical.com>
>
> priv is being assigned but is never used, so remove it.
>
> Cleans up clang build warning:
> "warning: Value stored to 'priv' is never read"
>
> Fixes: ceed73a2cf4a ("drivers: net: ethernet: qualcomm: rmnet: Initial
> implementation")
> Signed-off-by: Colin Ian King <colin.king@canonical.com>
> ---
> drivers/net/ethernet/qualcomm/rmnet/rmnet_vnd.c | 3 ---
> 1 file changed, 3 deletions(-)
>
> diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_vnd.c
> b/drivers/net/ethernet/qualcomm/rmnet/rmnet_vnd.c
> index c8b573d28dcf..bf7455fdafcc 100644
> --- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_vnd.c
> +++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_vnd.c
> @@ -73,9 +73,6 @@ static const struct net_device_ops rmnet_vnd_ops = {
> */
> void rmnet_vnd_setup(struct net_device *rmnet_dev)
> {
> - struct rmnet_priv *priv;
> -
> - priv = netdev_priv(rmnet_dev);
> netdev_dbg(rmnet_dev, "Setting up device %s\n", rmnet_dev->name);
>
> rmnet_dev->netdev_ops = &rmnet_vnd_ops;
Thanks for fixing this.
Acked-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
^ permalink raw reply
* Re: [PATCH v2 net-next 1/8] bpf: Add support for recursively running cgroup sock filters
From: David Ahern @ 2017-08-31 20:53 UTC (permalink / raw)
To: Tejun Heo, Alexei Starovoitov; +Cc: netdev, daniel, ast, davem
In-Reply-To: <20170831142201.GB1599492@devbig577.frc2.facebook.com>
On 8/31/17 8:22 AM, Tejun Heo wrote:
> On Sun, Aug 27, 2017 at 08:49:23AM -0600, David Ahern wrote:
>> On 8/25/17 8:49 PM, Alexei Starovoitov wrote:
>>>
>>>> + if (prog && curr_recursive && !new_recursive)
>>>> + /* if a parent has recursive prog attached, only
>>>> + * allow recursive programs in descendent cgroup
>>>> + */
>>>> + return -EINVAL;
>>>> +
>>>> old_prog = cgrp->bpf.prog[type];
>>>
>>> ... I'm struggling to completely understand how it interacts
>>> with BPF_F_ALLOW_OVERRIDE.
>>
>> The 2 flags are completely independent. The existing override logic is
>> unchanged. If a program can not be overridden, then the new recursive
>> flag is irrelevant.
>
> I'm not sure all four combo of the two flags makes sense. Can't we
> have something simpler like the following?
>
> 1. None: No further bpf programs allowed in the subtree.
>
> 2. Overridable: If a sub-cgroup installs the same bpf program, this
> one yields to that one.
>
> 3. Recursive: If a sub-cgroup installs the same bpf program, that
> cgroup program gets run in addition to this one.
>
> Note that we can have combinations of overridables and recursives -
> both allow further programs in the sub-hierarchy and the only
> distinction is whether that specific program behaves when that
> happens.
>
I am going to send v3 for patches 2-6 and 8 - the uncontested patches.
Alexei and I will meet in L.A. the week of Sept 11-15 to discuss the
recursive implementation (Patch 1 and its testing, patch 7).
^ permalink raw reply
* [PATCH 0/3] Security: add lsm hooks for checking permissions on eBPF objects
From: Chenbo Feng @ 2017-08-31 20:56 UTC (permalink / raw)
To: linux-security-module
Cc: Jeffrey Vander Stoep, netdev, SELinux, Alexei Starovoitov,
lorenzo, Chenbo Feng
From: Chenbo Feng <fengc@google.com>
Much like files and sockets, eBPF objects are accessed, controlled, and
shared via a file descriptor (FD). Unlike files and sockets, the existing
mechanism for eBPF object access control is very limited. Currently there
are two options for granting accessing to eBPF operations: grant access to
all processes, or only CAP_SYS_ADMIN processes. The CAP_SYS_ADMIN-only
mode is not ideal because most users do not have this capability and
granting a user CAP_SYS_ADMIN grants too many other security-sensitive
permissions. It also unnecessarily allows all CAP_SYS_ADMIN processes
access to eBPF functionality. Allowing all processes to access to eBPF
objects is also undesirable since it has potential to allow unprivileged
processes to consume kernel memory, and opens up attack surface to the
kernel.
Adding LSM hooks maintains the status quo for systems which do not use an
LSM, preserving compatibility with userspace, while allowing security
modules to choose how best to handle permissions on eBPF objects. Here is
a possible use case for the lsm hooks with selinux module:
The network-control daemon (netd) creates and loads an eBPF object for
network packet filtering and analysis. It passes the object FD to an
unprivileged network monitor app (netmonitor), which is not allowed to
create, modify or load eBPF objects, but is allowed to read the traffic
stats from the object.
Selinux could use these hooks to grant the following permissions:
allow netd self:bpf { create modify read…};
allow netmonitor netd:bpf read;
In this patch series, 5 security hooks is added to the eBPF syscall
implementations to do permissions checks. The LSM hooks introduced to eBPF
maps and programs can be summarized as follows:
Bpf_map_create: check for the ability of creating eBPF maps.
Bpf_map_modify: check the ability of update and delete eBPF map
entries.
Bpf_map_read: check the ability of lookup map element as well as
get map keys.
Bpf_post_create: initialize the security struct inside struct
bpf_map
Bpf_prog_load: check the ability for loading the eBPF program.
In order to store the ownership and security information about eBPF maps,
a security field pointer is added to the struct bpf_map. And a simple
implementation of selinux check on these hooks is added in selinux
subsystem.
Chenbo Feng (3):
security: bpf: Add eBPF LSM hooks to security module
security: bpf: Add eBPF LSM hooks and security field to eBPF map
selinux: bpf: Implement the selinux checks for eBPF object
include/linux/bpf.h | 3 +++
include/linux/lsm_hooks.h | 41 ++++++++++++++++++++++++++++
include/linux/security.h | 36 +++++++++++++++++++++++++
kernel/bpf/syscall.c | 28 +++++++++++++++++++
security/security.c | 28 +++++++++++++++++++
security/selinux/hooks.c | 54 +++++++++++++++++++++++++++++++++++++
security/selinux/include/classmap.h | 2 ++
security/selinux/include/objsec.h | 4 +++
8 files changed, 196 insertions(+)
--
2.14.1.581.gf28d330327-goog
^ permalink raw reply
* [PATCH 1/3] security: bpf: Add eBPF LSM hooks to security module
From: Chenbo Feng @ 2017-08-31 20:56 UTC (permalink / raw)
To: linux-security-module
Cc: Jeffrey Vander Stoep, netdev, SELinux, Alexei Starovoitov,
lorenzo, Chenbo Feng
In-Reply-To: <20170831205635.80256-1-chenbofeng.kernel@gmail.com>
From: Chenbo Feng <fengc@google.com>
Introduce 5 LSM hooks to provide finer granularity controls on eBPF
related operations including create eBPF maps, modify and read eBPF maps
content and load eBPF programs to the kernel. Hooks use the new security
pointer inside the eBPF map struct to store the owner's security
information and the different security modules can perform different
checks based on the information stored inside the security field.
Signed-off-by: Chenbo Feng <fengc@google.com>
---
include/linux/lsm_hooks.h | 41 +++++++++++++++++++++++++++++++++++++++++
include/linux/security.h | 36 ++++++++++++++++++++++++++++++++++++
security/security.c | 28 ++++++++++++++++++++++++++++
3 files changed, 105 insertions(+)
diff --git a/include/linux/lsm_hooks.h b/include/linux/lsm_hooks.h
index ce02f76a6188..3aaf9a08a983 100644
--- a/include/linux/lsm_hooks.h
+++ b/include/linux/lsm_hooks.h
@@ -1353,6 +1353,32 @@
* @inode we wish to get the security context of.
* @ctx is a pointer in which to place the allocated security context.
* @ctxlen points to the place to put the length of @ctx.
+ *
+ * Security hooks for using the eBPF maps and programs functionalities through
+ * eBPF syscalls.
+ *
+ * @bpf_map_create:
+ * Check permissions prior to creating a new bpf map.
+ * Return 0 if the permission is granted.
+ *
+ * @bpf_map_modify:
+ * Check permission prior to insert, update and delete map content.
+ * @map pointer to the struct bpf_map that contains map information.
+ * Return 0 if the permission is granted.
+ *
+ * @bpf_map_read:
+ * Check permission prior to read a bpf map content.
+ * @map pointer to the struct bpf_map that contains map information.
+ * Return 0 if the permission is granted.
+ *
+ * @bpf_prog_load:
+ * Check permission prior to load eBPF program.
+ * Return 0 if the permission is granted.
+ *
+ * @bpf_post_create:
+ * Initialize the bpf object security field inside struct bpf_maps and
+ * it is used for future security checks.
+ *
*/
union security_list_options {
int (*binder_set_context_mgr)(struct task_struct *mgr);
@@ -1685,6 +1711,14 @@ union security_list_options {
struct audit_context *actx);
void (*audit_rule_free)(void *lsmrule);
#endif /* CONFIG_AUDIT */
+
+#ifdef CONFIG_BPF_SYSCALL
+ int (*bpf_map_create)(void);
+ int (*bpf_map_read)(struct bpf_map *map);
+ int (*bpf_map_modify)(struct bpf_map *map);
+ int (*bpf_prog_load)(void);
+ int (*bpf_post_create)(struct bpf_map *map);
+#endif /* CONFIG_BPF_SYSCALL */
};
struct security_hook_heads {
@@ -1905,6 +1939,13 @@ struct security_hook_heads {
struct list_head audit_rule_match;
struct list_head audit_rule_free;
#endif /* CONFIG_AUDIT */
+#ifdef CONFIG_BPF_SYSCALL
+ struct list_head bpf_map_create;
+ struct list_head bpf_map_read;
+ struct list_head bpf_map_modify;
+ struct list_head bpf_prog_load;
+ struct list_head bpf_post_create;
+#endif /* CONFIG_BPF_SYSCALL */
} __randomize_layout;
/*
diff --git a/include/linux/security.h b/include/linux/security.h
index 458e24bea2d4..0656a4f74d14 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -31,6 +31,7 @@
#include <linux/string.h>
#include <linux/mm.h>
#include <linux/fs.h>
+#include <linux/bpf.h>
struct linux_binprm;
struct cred;
@@ -1735,6 +1736,41 @@ static inline void securityfs_remove(struct dentry *dentry)
#endif
+#ifdef CONFIG_BPF_SYSCALL
+#ifdef CONFIG_SECURITY
+int security_map_create(void);
+int security_map_modify(struct bpf_map *map);
+int security_map_read(struct bpf_map *map);
+int security_prog_load(void);
+int security_post_create(struct bpf_map *map);
+#else
+static inline int security_map_create(void)
+{
+ return 0;
+}
+
+static inline int security_map_read(struct bpf_map *map)
+{
+ return 0;
+}
+
+static inline int security_map_modify(struct bpf_map *map)
+{
+ return 0;
+}
+
+static inline int security_prog_load(void)
+{
+ return 0;
+}
+
+static inline int security_post_create(struct bpf_map *map)
+{
+ return 0;
+}
+#endif /* CONFIG_SECURITY */
+#endif /* CONFIG_BPF_SYSCALL */
+
#ifdef CONFIG_SECURITY
static inline char *alloc_secdata(void)
diff --git a/security/security.c b/security/security.c
index 55b5997e4b72..02272f93a89e 100644
--- a/security/security.c
+++ b/security/security.c
@@ -12,6 +12,7 @@
* (at your option) any later version.
*/
+#include <linux/bpf.h>
#include <linux/capability.h>
#include <linux/dcache.h>
#include <linux/module.h>
@@ -1708,3 +1709,30 @@ int security_audit_rule_match(u32 secid, u32 field, u32 op, void *lsmrule,
actx);
}
#endif /* CONFIG_AUDIT */
+
+#ifdef CONFIG_BPF_SYSCALL
+int security_map_create(void)
+{
+ return call_int_hook(bpf_map_create, 0);
+}
+
+int security_map_modify(struct bpf_map *map)
+{
+ return call_int_hook(bpf_map_modify, 0, map);
+}
+
+int security_map_read(struct bpf_map *map)
+{
+ return call_int_hook(bpf_map_read, 0, map);
+}
+
+int security_prog_load(void)
+{
+ return call_int_hook(bpf_prog_load, 0);
+}
+
+int security_post_create(struct bpf_map *map)
+{
+ return call_int_hook(bpf_post_create, 0, map);
+}
+#endif /* CONFIG_BPF_SYSCALL */
--
2.14.1.581.gf28d330327-goog
^ permalink raw reply related
* [PATCH 2/3] security: bpf: Add eBPF LSM hooks and security field to eBPF map
From: Chenbo Feng @ 2017-08-31 20:56 UTC (permalink / raw)
To: linux-security-module
Cc: Jeffrey Vander Stoep, netdev, SELinux, Alexei Starovoitov,
lorenzo, Chenbo Feng
In-Reply-To: <20170831205635.80256-1-chenbofeng.kernel@gmail.com>
From: Chenbo Feng <fengc@google.com>
Introduce a pointer into struct bpf_map to hold the security information
about the map. The actual security struct varies based on the security
models implemented. Place the LSM hooks before each of the unrestricted
eBPF operations, the map_update_elem and map_delete_elem operations are
checked by security_map_modify. The map_lookup_elem and map_get_next_key
operations are checked by securtiy_map_read.
Signed-off-by: Chenbo Feng <fengc@google.com>
---
include/linux/bpf.h | 3 +++
kernel/bpf/syscall.c | 28 ++++++++++++++++++++++++++++
2 files changed, 31 insertions(+)
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index b69e7a5869ff..ca3e6ff7091d 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -53,6 +53,9 @@ struct bpf_map {
struct work_struct work;
atomic_t usercnt;
struct bpf_map *inner_map_meta;
+#ifdef CONFIG_SECURITY
+ void *security;
+#endif
};
/* function argument constraints */
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 045646da97cc..b15580bcf3b1 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -279,6 +279,10 @@ static int map_create(union bpf_attr *attr)
if (err)
return -EINVAL;
+ err = security_map_create();
+ if (err)
+ return -EACCES;
+
/* find map type and init map: hashtable vs rbtree vs bloom vs ... */
map = find_and_alloc_map(attr);
if (IS_ERR(map))
@@ -291,6 +295,10 @@ static int map_create(union bpf_attr *attr)
if (err)
goto free_map_nouncharge;
+ err = security_post_create(map);
+ if (err < 0)
+ goto free_map;
+
err = bpf_map_alloc_id(map);
if (err)
goto free_map;
@@ -410,6 +418,10 @@ static int map_lookup_elem(union bpf_attr *attr)
if (IS_ERR(map))
return PTR_ERR(map);
+ err = security_map_read(map);
+ if (err)
+ return -EACCES;
+
key = memdup_user(ukey, map->key_size);
if (IS_ERR(key)) {
err = PTR_ERR(key);
@@ -490,6 +502,10 @@ static int map_update_elem(union bpf_attr *attr)
if (IS_ERR(map))
return PTR_ERR(map);
+ err = security_map_modify(map);
+ if (err)
+ return -EACCES;
+
key = memdup_user(ukey, map->key_size);
if (IS_ERR(key)) {
err = PTR_ERR(key);
@@ -573,6 +589,10 @@ static int map_delete_elem(union bpf_attr *attr)
if (IS_ERR(map))
return PTR_ERR(map);
+ err = security_map_modify(map);
+ if (err)
+ return -EACCES;
+
key = memdup_user(ukey, map->key_size);
if (IS_ERR(key)) {
err = PTR_ERR(key);
@@ -616,6 +636,10 @@ static int map_get_next_key(union bpf_attr *attr)
if (IS_ERR(map))
return PTR_ERR(map);
+ err = security_map_read(map);
+ if (err)
+ return -EACCES;
+
if (ukey) {
key = memdup_user(ukey, map->key_size);
if (IS_ERR(key)) {
@@ -935,6 +959,10 @@ static int bpf_prog_load(union bpf_attr *attr)
if (CHECK_ATTR(BPF_PROG_LOAD))
return -EINVAL;
+ err = security_prog_load();
+ if (err)
+ return -EACCES;
+
if (attr->prog_flags & ~BPF_F_STRICT_ALIGNMENT)
return -EINVAL;
--
2.14.1.581.gf28d330327-goog
^ permalink raw reply related
* [PATCH 3/3] selinux: bpf: Implement the selinux checks for eBPF object
From: Chenbo Feng @ 2017-08-31 20:56 UTC (permalink / raw)
To: linux-security-module
Cc: Jeffrey Vander Stoep, netdev, SELinux, Alexei Starovoitov,
lorenzo, Chenbo Feng
In-Reply-To: <20170831205635.80256-1-chenbofeng.kernel@gmail.com>
From: Chenbo Feng <fengc@google.com>
Introduce 5 new selinux checks for eBPF object related operations. The
check is based on the ownership information of eBPF maps and the
capability of creating eBPF object.
Signed-off-by: Chenbo Feng <fengc@google.com>
---
security/selinux/hooks.c | 54 +++++++++++++++++++++++++++++++++++++
security/selinux/include/classmap.h | 2 ++
security/selinux/include/objsec.h | 4 +++
3 files changed, 60 insertions(+)
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index 33fd061305c4..39ad7d9f335d 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -85,6 +85,7 @@
#include <linux/export.h>
#include <linux/msg.h>
#include <linux/shm.h>
+#include <linux/bpf.h>
#include "avc.h"
#include "objsec.h"
@@ -6245,6 +6246,52 @@ static void selinux_ib_free_security(void *ib_sec)
}
#endif
+#ifdef CONFIG_BPF_SYSCALL
+static int selinux_bpf_map_create(void)
+{
+ u32 sid = current_sid();
+
+ return avc_has_perm(sid, sid, SECCLASS_BPF, BPF__MAP_CREATE, NULL);
+}
+
+static int selinux_bpf_map_modify(struct bpf_map *map)
+{
+ struct bpf_security_struct *bpfsec = map->security;
+
+ return avc_has_perm(current_sid(), bpfsec->sid, SECCLASS_BPF,
+ BPF__MAP_MODIFY, NULL);
+}
+
+static int selinux_bpf_map_read(struct bpf_map *map)
+{
+ struct bpf_security_struct *bpfsec = map->security;
+
+ return avc_has_perm(current_sid(), bpfsec->sid, SECCLASS_BPF,
+ BPF__MAP_READ, NULL);
+}
+
+static int selinux_bpf_prog_load(void)
+{
+ u32 sid = current_sid();
+
+ return avc_has_perm(sid, sid, SECCLASS_BPF, BPF__PROG_LOAD, NULL);
+}
+
+static int selinux_bpf_post_create(struct bpf_map *map)
+{
+ struct bpf_security_struct *bpfsec;
+
+ bpfsec = kzalloc(sizeof(*bpfsec), GFP_KERNEL);
+ if (!bpfsec)
+ return -ENOMEM;
+
+ bpfsec->sid = current_sid();
+ map->security = bpfsec;
+
+ return 0;
+}
+#endif
+
static struct security_hook_list selinux_hooks[] __lsm_ro_after_init = {
LSM_HOOK_INIT(binder_set_context_mgr, selinux_binder_set_context_mgr),
LSM_HOOK_INIT(binder_transaction, selinux_binder_transaction),
@@ -6465,6 +6512,13 @@ static struct security_hook_list selinux_hooks[] __lsm_ro_after_init = {
LSM_HOOK_INIT(audit_rule_match, selinux_audit_rule_match),
LSM_HOOK_INIT(audit_rule_free, selinux_audit_rule_free),
#endif
+#ifdef CONFIG_BPF_SYSCALL
+ LSM_HOOK_INIT(bpf_map_create, selinux_bpf_map_create),
+ LSM_HOOK_INIT(bpf_map_modify, selinux_bpf_map_modify),
+ LSM_HOOK_INIT(bpf_map_read, selinux_bpf_map_read),
+ LSM_HOOK_INIT(bpf_prog_load, selinux_bpf_prog_load),
+ LSM_HOOK_INIT(bpf_post_create, selinux_bpf_post_create),
+#endif
};
static __init int selinux_init(void)
diff --git a/security/selinux/include/classmap.h b/security/selinux/include/classmap.h
index b9fe3434b036..83c880fb17b4 100644
--- a/security/selinux/include/classmap.h
+++ b/security/selinux/include/classmap.h
@@ -235,6 +235,8 @@ struct security_class_mapping secclass_map[] = {
{ "access", NULL } },
{ "infiniband_endport",
{ "manage_subnet", NULL } },
+ { "bpf",
+ {"map_create", "map_modify", "map_read", "prog_load" } },
{ NULL }
};
diff --git a/security/selinux/include/objsec.h b/security/selinux/include/objsec.h
index 6ebc61e370ff..ba564f662b0d 100644
--- a/security/selinux/include/objsec.h
+++ b/security/selinux/include/objsec.h
@@ -150,6 +150,10 @@ struct pkey_security_struct {
u32 sid; /* SID of pkey */
};
+struct bpf_security_struct {
+ u32 sid; /*SID of bpf obj creater*/
+};
+
extern unsigned int selinux_checkreqprot;
#endif /* _SELINUX_OBJSEC_H_ */
--
2.14.1.581.gf28d330327-goog
^ permalink raw reply related
* Re: [PATCH v4 4/5] net: stmmac: dwmac-sun8i: choose internal PHY via phy-is-integrated
From: Andrew Lunn @ 2017-08-31 20:59 UTC (permalink / raw)
To: Rob Herring
Cc: Corentin Labbe, mark.rutland-5wv7dgnIgG8,
maxime.ripard-wi1+55ScJUtKEb57/3fJTNBPR1lH4CV8, wens-jdAy2FN1RRM,
linux-I+IVW8TIWO2tmTQ+vhA3Yw, peppe.cavallaro-qxv4g6HH51o,
alexandre.torgue-qxv4g6HH51o, f.fainelli-Re5JQEeQqe8AvxtiuMwx3w,
icenowy-h8G6r0blFSE, netdev-u79uwXL29TY76Z2rM5mHXA,
devicetree-u79uwXL29TY76Z2rM5mHXA,
linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
linux-kernel-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <20170831201803.x6vwftfcllrfpk6q@rob-hp-laptop>
On Thu, Aug 31, 2017 at 03:18:03PM -0500, Rob Herring wrote:
> On Sat, Aug 26, 2017 at 11:20:51PM +0200, Andrew Lunn wrote:
> > Hi Corentin
> >
> > I think we have now all agreed this is an mdio-mux, plus it is also an
> > MII mux. We should represent that in device tree. This patchset does
> > this. However, as it is now, the mux structure in DT is ignored. All
> > it does is search for the phy-is-integrated flags and goes on that.
> >
> > I made the comment that the device tree representation cannot be
> > implemented using an MDIO mux driver, because of driver loading
> > issues. However, the core of the MDIO mux code is just a library,
> > symbols exported as GPL, free for anything to use.
> >
> > What i think should happen is the mdio-mux is implemented inside the
> > MAC driver, using the mux-core as a library. The device tree structure
> > of a mix is then reflected within Linux. The mux switch callback is
> > implemented within the MAC driver. So it can reset the MAC when the
> > mux is switched. The 'phy-is-integrated' property is then no longer
> > needed.
> >
> > I would suggest a binding something like:
>
> This is looks better to me, but...
>
> > emac: ethernet@1c0b000 {
> > compatible = "allwinner,sun8i-h3-emac";
> > syscon = <&syscon>;
> > reg = <0x01c0b000 0x104>;
> > interrupts = <GIC_SPI 82 IRQ_TYPE_LEVEL_HIGH>;
> > interrupt-names = "macirq";
> > resets = <&ccu RST_BUS_EMAC>;
> > reset-names = "stmmaceth";
> > clocks = <&ccu CLK_BUS_EMAC>;
> > clock-names = "stmmaceth";
> > #address-cells = <1>;
> > #size-cells = <0>;
> >
> > phy-handle = <&int_mii_phy>;
> > phy-mode = "mii";
> > allwinner,leds-active-low;
> >
> > mdio: mdio {
> > #address-cells = <1>;
> > #size-cells = <0>;
> > }
>
> Why do you need this node still?
Hi Rob
It might not be needed, depending on how it is implemented. But:
Documentation/devicetree/bindings/net/mdio-mux.txt
It is normal for an mdio bus mux to have a phandle back to the parent
mdio bus. Also, i think the stmmac driver will only instantiate the
mdio bus if there is a node for it in the device tree.
Andrew
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* [PATCH net-next] doc: document MSG_ZEROCOPY
From: Willem de Bruijn @ 2017-08-31 21:00 UTC (permalink / raw)
To: netdev; +Cc: davem, Willem de Bruijn
From: Willem de Bruijn <willemb@google.com>
Documentation for this feature was missing from the patchset.
Copied a lot from the netdev 2.1 paper, addressing some small
interface changes since then.
Signed-off-by: Willem de Bruijn <willemb@google.com>
---
Documentation/networking/msg_zerocopy.rst | 254 ++++++++++++++++++++++++++++++
1 file changed, 254 insertions(+)
create mode 100644 Documentation/networking/msg_zerocopy.rst
diff --git a/Documentation/networking/msg_zerocopy.rst b/Documentation/networking/msg_zerocopy.rst
new file mode 100644
index 000000000000..2e2a3410b947
--- /dev/null
+++ b/Documentation/networking/msg_zerocopy.rst
@@ -0,0 +1,254 @@
+
+============
+MSG_ZEROCOPY
+============
+
+Intro
+=====
+
+The MSG_ZEROCOPY flag enables copy avoidance for socket send calls.
+The feature is currently implemented for TCP sockets.
+
+
+Opportunity and Caveats
+-----------------------
+
+Copying large buffers between user process and kernel can be
+expensive. Linux supports various interfaces that eschew copying,
+such as sendpage and splice. The MSG_ZEROCOPY flag extends the
+underlying copy avoidance mechanism to common socket send calls.
+
+Copy avoidance is not a free lunch. As implemented, with page pinning,
+it replaces per byte copy cost with page accounting and completion
+notification overhead. As a result, MSG_ZEROCOPY is generally only
+effective at writes over around 10 KB.
+
+Page pinning also changes system call semantics. It temporarily shares
+the buffer between process and network stack. Unlike with copying, the
+process cannot immediately overwrite the buffer after system call
+return without possibly modifying the data in flight. Kernel integrity
+is not affected, but a buggy program can possibly corrupt its own data
+stream.
+
+The kernel returns a notification when it is safe to modify data.
+Converting an existing application to MSG_ZEROCOPY is not always as
+trivial as just passing the flag, then.
+
+
+More Info
+---------
+
+Much of this document was derived from a longer paper presented at
+netdev 2.1. For more in-depth information see that paper and talk,
+the excellent reporting over at LWN.net or read the original code.
+
+ paper, slides, video
+ https://netdevconf.org/2.1/session.html?debruijn
+
+ LWN article
+ https://lwn.net/Articles/726917/
+
+ patchset
+ [PATCH net-next v4 0/9] socket sendmsg MSG_ZEROCOPY
+ https://lwn.net/Articles/730010/
+ https://www.spinics.net/lists/netdev/msg447552.html
+
+
+Interface
+=========
+
+Passing the MSG_ZEROCOPY flag is the most obvious step to enable copy
+avoidance, but not the only one.
+
+Socket Setup
+------------
+
+The kernel is permissive when applications pass undefined flags to the
+send system call. By default it simply ignores these. To avoid enabling
+copy avoidance mode for legacy processes that accidentally pass this
+flag, a process must first signal intent by setting a socket option:
+
+::
+
+ if (setsockopt(fd, SOL_SOCKET, SO_ZEROCOPY, &one, sizeof(one)))
+ error(1, errno, "setsockopt zerocopy");
+
+
+Transmission
+------------
+
+The change to send (or sendto, sendmsg, sendmmsg) itself is trivial.
+Pass the new flag.
+
+::
+
+ ret = send(fd, buf, sizeof(buf), MSG_ZEROCOPY);
+ if (ret != sizeof(buf))
+ error(1, errno, "send");
+
+
+Mixing copy avoidance and copying
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Many workloads have a mixture of large and small buffers. Because copy
+avoidance is more expensive than copying for small packets, the
+feature is implemented as a flag. It is safe to mix calls with the flag
+with those without.
+
+
+Notifications
+-------------
+
+The kernel has to notify the process when it is safe to reuse a
+previously passed buffer. It queues completion notifications on the
+socket error queue, akin to the transmit timestamping interface.
+
+The notification itself is a simple scalar value. Each socket
+maintains an internal 32-bit counter. Each send call with MSG_ZEROCOPY
+that successfully sends data increments the counter. The counter is
+not incremented on failure or if called with length zero.
+
+
+Notification Reception
+~~~~~~~~~~~~~~~~~~~~~~
+
+The below snippet demonstrates the API. In the simplest case, each
+send syscall is followed by a poll and recvmsg on the error queue.
+
+Reading from the error queue is always a non-blocking operation. The
+poll call is there to block until an error is outstanding. It will set
+POLLERR in its output flags. That flag does not have to be set in the
+events field: errors are signaled unconditionally.
+
+::
+
+ pfd.fd = fd;
+ pfd.events = 0;
+ if (poll(&pfd, 1, -1) != 1 || pfd.revents & POLLERR == 0)
+ error(1, errno, "poll");
+
+ ret = recvmsg(fd, &msg, MSG_ERRQUEUE);
+ if (ret == -1)
+ error(1, errno, "recvmsg");
+
+ read_notification(msg);
+
+The example is for demonstration purpose only. In practice, it is more
+efficient to not wait for notifications, but read without blocking
+every couple of send calls.
+
+Notifications can be processed out of order with other operations on
+the socket. A socket that has an error queued would normally block
+other operations until the error is read. Zerocopy notifications have
+a zero error code, however, to not block send and recv calls.
+
+
+Notification Batching
+~~~~~~~~~~~~~~~~~~~~~
+
+Multiple outstanding packets can be read at once using the recvmmsg
+call. This is often not needed. In each message the kernel returns not
+a single value, but a range. It coalesces consecutive notifications
+while one is outstanding for reception on the error queue.
+
+When a new notification is about to be queued, it checks whether the
+new value extends the range of the notification at the tail of the
+queue. If so, it drops the new notification packet and instead increases
+the range upper value of the outstanding notification.
+
+For protocols that acknowledge data in-order, like TCP, each
+notification can be squashed into the previous one, so that no more
+than one notification is outstanding at any one point.
+
+Ordered delivery is the common case, but not guaranteed. Notifications
+may arrive out of order on retransmission and socket teardown.
+
+
+Notification Parsing
+~~~~~~~~~~~~~~~~~~~~
+
+The below snippet demonstrates how to parse the control message: the
+read_notification() call in the previous snippet. A notification
+is encoded in the standard error format, sock_extended_err.
+
+The level and type fields in the control data are protocol family
+specific, IP_RECVERR or IPV6_RECVERR.
+
+Error origin is the new type SO_EE_ORIGIN_ZEROCOPY. The errno is zero,
+as explained before, to avoid blocking read and write system calls on
+the socket.
+
+The 32-bit notification range is encoded as [ee_info, ee_data]. This
+range is inclusive. Other fields in the struct must be treated as
+undefined, bar for ee_code, as discussed below.
+
+::
+
+ struct sock_extended_err *serr;
+ struct cmsghdr *cm;
+
+ cm = CMSG_FIRSTHDR(msg);
+ if (cm->cmsg_level != SOL_IP &&
+ cm->cmsg_type != IP_RECVERR)
+ error(1, 0, "cmsg");
+
+ serr = (void *) CMSG_DATA(cm);
+ if (serr->ee_errno != 0 ||
+ serr->ee_origin != SO_EE_ORIGIN_ZEROCOPY)
+ error(1, 0, "serr");
+
+ printf("completed: %u..%u\n", serr->ee_info, serr->ee_data);
+
+
+Deferred copies
+~~~~~~~~~~~~~~~
+
+Passing flag MSG_ZEROCOPY is a hint to the kernel to apply copy
+avoidance, and a contract that the kernel will queue a completion
+notification. It is not a guarantee that the copy is elided.
+
+Copy avoidance is not always feasible. Devices that do not support
+scatter-gather I/O cannot send packets made up of kernel generated
+protocol headers plus zerocopy user data. A packet may need to be
+converted to having a private copy of data deep in the stack, say to
+compute a checksum.
+
+In all these cases, the kernel returns a completion notification when
+it releases its hold on the shared pages. The notification may arrive
+before the (copied) data is fully transmitted. A zerocopy completion
+notification is not a transmit completion notification, therefore.
+
+Deferred copies can be more expensive than a copy immediately in the
+system call, if the data is no longer warm in the cache. The process
+also incurs notification processing cost for no benefit. For this
+reason, the kernel signals if data was completed with a copy, by
+setting flag SO_EE_CODE_ZEROCOPY_COPIED in field ee_code on return.
+A process may use this signal to stop passing flag MSG_ZEROCOPY on
+subsequent requests on the same socket.
+
+Implementation
+==============
+
+Loopback
+--------
+
+Data sent to local sockets can be queued indefinitely if the receive
+process does not read its socket. Unbound notification latency is not
+acceptable. For this reason all packets generated with MSG_ZEROCOPY
+that are looped to a local socket will incur a deferred copy. This
+includes looping onto packet sockets (e.g., tcpdump) and tun devices.
+
+
+Testing
+=======
+
+More realistic example code can be found in the kernel source under
+tools/testing/selftests/net/msg_zerocopy.c.
+
+Be cognizant of the loopback constraint. The test can be run between
+a pair of hosts. But if run between a local pair of processes, for
+instance when run with msg_zerocopy.sh between a veth pair across
+namespaces, the test will not show any improvement. For testing, the
+loopback restriction can be temporarily avoided by making
+skb_orphan_frags_rx identical to skb_orphan_frags.
+
--
2.14.1.581.gf28d330327-goog
^ permalink raw reply related
* Re: [PATCH 2/3] security: bpf: Add eBPF LSM hooks and security field to eBPF map
From: Mimi Zohar @ 2017-08-31 21:17 UTC (permalink / raw)
To: Chenbo Feng, linux-security-module
Cc: Jeffrey Vander Stoep, netdev, SELinux, Alexei Starovoitov,
lorenzo, Chenbo Feng
In-Reply-To: <20170831205635.80256-3-chenbofeng.kernel@gmail.com>
On Thu, 2017-08-31 at 13:56 -0700, Chenbo Feng wrote:
> From: Chenbo Feng <fengc@google.com>
>
> Introduce a pointer into struct bpf_map to hold the security information
> about the map. The actual security struct varies based on the security
> models implemented. Place the LSM hooks before each of the unrestricted
> eBPF operations, the map_update_elem and map_delete_elem operations are
> checked by security_map_modify. The map_lookup_elem and map_get_next_key
> operations are checked by securtiy_map_read.
>
> Signed-off-by: Chenbo Feng <fengc@google.com>
> ---
> include/linux/bpf.h | 3 +++
> kernel/bpf/syscall.c | 28 ++++++++++++++++++++++++++++
> 2 files changed, 31 insertions(+)
>
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index b69e7a5869ff..ca3e6ff7091d 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -53,6 +53,9 @@ struct bpf_map {
> struct work_struct work;
> atomic_t usercnt;
> struct bpf_map *inner_map_meta;
> +#ifdef CONFIG_SECURITY
> + void *security;
> +#endif
> };
>
> /* function argument constraints */
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 045646da97cc..b15580bcf3b1 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -279,6 +279,10 @@ static int map_create(union bpf_attr *attr)
> if (err)
> return -EINVAL;
>
> + err = security_map_create();
> + if (err)
> + return -EACCES;
Any reason not to just return err?
Mimi
> +
> /* find map type and init map: hashtable vs rbtree vs bloom vs ... */
> map = find_and_alloc_map(attr);
> if (IS_ERR(map))
> @@ -291,6 +295,10 @@ static int map_create(union bpf_attr *attr)
> if (err)
> goto free_map_nouncharge;
>
> + err = security_post_create(map);
> + if (err < 0)
> + goto free_map;
> +
> err = bpf_map_alloc_id(map);
> if (err)
> goto free_map;
> @@ -410,6 +418,10 @@ static int map_lookup_elem(union bpf_attr *attr)
> if (IS_ERR(map))
> return PTR_ERR(map);
>
> + err = security_map_read(map);
> + if (err)
> + return -EACCES;
> +
> key = memdup_user(ukey, map->key_size);
> if (IS_ERR(key)) {
> err = PTR_ERR(key);
> @@ -490,6 +502,10 @@ static int map_update_elem(union bpf_attr *attr)
> if (IS_ERR(map))
> return PTR_ERR(map);
>
> + err = security_map_modify(map);
> + if (err)
> + return -EACCES;
> +
> key = memdup_user(ukey, map->key_size);
> if (IS_ERR(key)) {
> err = PTR_ERR(key);
> @@ -573,6 +589,10 @@ static int map_delete_elem(union bpf_attr *attr)
> if (IS_ERR(map))
> return PTR_ERR(map);
>
> + err = security_map_modify(map);
> + if (err)
> + return -EACCES;
> +
> key = memdup_user(ukey, map->key_size);
> if (IS_ERR(key)) {
> err = PTR_ERR(key);
> @@ -616,6 +636,10 @@ static int map_get_next_key(union bpf_attr *attr)
> if (IS_ERR(map))
> return PTR_ERR(map);
>
> + err = security_map_read(map);
> + if (err)
> + return -EACCES;
> +
> if (ukey) {
> key = memdup_user(ukey, map->key_size);
> if (IS_ERR(key)) {
> @@ -935,6 +959,10 @@ static int bpf_prog_load(union bpf_attr *attr)
> if (CHECK_ATTR(BPF_PROG_LOAD))
> return -EINVAL;
>
> + err = security_prog_load();
> + if (err)
> + return -EACCES;
> +
> if (attr->prog_flags & ~BPF_F_STRICT_ALIGNMENT)
> return -EINVAL;
>
^ permalink raw reply
* Re: [PATCH] ath9k: remove cast to void pointer
From: Joe Perches @ 2017-08-31 21:29 UTC (permalink / raw)
To: Himanshu Jha, kvalo; +Cc: ath9k-devel, linux-wireless, netdev, linux-kernel
In-Reply-To: <1504184863-9732-1-git-send-email-himanshujha199640@gmail.com>
On Thu, 2017-08-31 at 18:37 +0530, Himanshu Jha wrote:
> casting to void pointer from any pointer type and vice-versa is done
> implicitly and therefore casting is not needed in such a case.
You said you were going to remember to mention the tool
and script that did this.
^ permalink raw reply
* Re: [PATCH net-next 7/8] net: hns3: add vlan filter config of Ports
From: David Miller @ 2017-08-31 21:38 UTC (permalink / raw)
To: lipeng321; +Cc: netdev, linux-kernel, linuxarm, yisen.zhuang, salil.mehta
In-Reply-To: <1504186749-8926-8-git-send-email-lipeng321@huawei.com>
From: Lipeng <lipeng321@huawei.com>
Date: Thu, 31 Aug 2017 21:39:08 +0800
> Config the self_define vlan_type as TPID(0x8100) for vlan identification.
> When normal port initialize vlan configure, set default vlan id as 0.
>
> Signed-off-by: Mingguang Qu <qumingguang@huawei.com>
> Signed-off-by: Lipeng <lipeng321@huawei.com>
No, that's not what this patch is doing.
> @@ -3308,6 +3308,7 @@ static int hclge_add_mac_vlan_tbl(struct hclge_vport *vport,
> mc_desc[1].flag |= cpu_to_le16(HCLGE_CMD_FLAG_NEXT);
> hclge_cmd_reuse_desc(&mc_desc[2], false);
> mc_desc[2].flag &= cpu_to_le16(~HCLGE_CMD_FLAG_NEXT);
> +
> memcpy(mc_desc[0].data, req,
> sizeof(struct hclge_mac_vlan_tbl_entry));
> ret = hclge_cmd_send(&hdev->hw, mc_desc, 3);
All it does is add an empty line.
^ permalink raw reply
* Re: [PATCH net-next] net/ncsi: Define {add, kill}_vid callbacks for !CONFIG_NET_NCSI
From: Benjamin Herrenschmidt @ 2017-08-31 21:14 UTC (permalink / raw)
To: Vernon Mauery, Samuel Mendoza-Jonas
Cc: David S . Miller, netdev, linux-kernel, OpenBMC Maillist,
Gavin Shan
In-Reply-To: <20170831152414.GB69617@mauery>
On Thu, 2017-08-31 at 08:24 -0700, Vernon Mauery wrote:
> +int ncsi_vlan_rx_add_vid(struct net_device *dev, __be16 proto, u16 vid)
> > +{
> > + return -ENOTTY;
> > +}
> > +int ncsi_vlan_rx_kill_vid(struct net_device *dev, __be16 proto, u16 vid)
> > +{
> > + return -ENOTTY;
> > +}
>
> These should be static functions because they are defined in the header
> file or you will get multiple symbol definitions.
static inline even or you'll get warning about them being unused iirc.
Cheers,
Ben.
^ permalink raw reply
* Re: [patch net-next repost 0/8] mlxsw: Add IPv6 host dpipe table
From: David Miller @ 2017-08-31 21:43 UTC (permalink / raw)
To: jiri; +Cc: netdev, arkadis, idosch, mlxsw
In-Reply-To: <20170831155919.1366-1-jiri@resnulli.us>
From: Jiri Pirko <jiri@resnulli.us>
Date: Thu, 31 Aug 2017 17:59:11 +0200
> This patchset adds IPv6 host dpipe table support. This will provide the
> ability to observe the hardware offloaded IPv6 neighbors.
Series applied, thanks.
I noticed while reviewing this we are pretty much split on how
to access neigh->primary_key.
Half the code goes "(type *) n->primary_key" and the other half
(mostly in ipv6) goes "(type *) &n->primary_key"
I know both forms are basically equivalent, but consistency
matters :-)
^ permalink raw reply
* Re: [PATCH net-next] bridge: add tracepoint in br_fdb_update
From: Jesper Dangaard Brouer @ 2017-08-31 21:50 UTC (permalink / raw)
To: David Miller; +Cc: brouer, roopa, netdev, nikolay, f.fainelli, andrew, bridge
In-Reply-To: <20170831.114325.2027598728601121750.davem@davemloft.net>
On Thu, 31 Aug 2017 11:43:25 -0700 (PDT)
David Miller <davem@davemloft.net> wrote:
> From: Roopa Prabhu <roopa@cumulusnetworks.com>
> Date: Wed, 30 Aug 2017 22:18:13 -0700
>
> > From: Roopa Prabhu <roopa@cumulusnetworks.com>
> >
> > This extends bridge fdb table tracepoints to also cover
> > learned fdb entries in the br_fdb_update path. Note that
> > unlike other tracepoints I have moved this to when the fdb
> > is modified because this is in the datapath and can generate
> > a lot of noise in the trace output. br_fdb_update is also called
> > from added_by_user context in the NTF_USE case which is already
> > traced ..hence the !added_by_user check.
> >
> > Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
>
> Applied.
>
> Let's use dev->name for now and if the tooling can eventually
> do transparent ifindex->name then we can consider redoing
> a bunch of networking tracepoints.
I agree! :-)
--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
LinkedIn: http://www.linkedin.com/in/brouer
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox