Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [BISECTED] v4.9: OCTEON ethernet crash
From: Florian Fainelli @ 2016-12-15  1:00 UTC (permalink / raw)
  To: Aaro Koskinen; +Cc: David S. Miller, David Daney, netdev
In-Reply-To: <20161215005751.457nm46shyjdt63q@raspberrypi-2.musicnaut.iki.fi>

On 12/14/2016 04:57 PM, Aaro Koskinen wrote:
> Hi,
> 
> On Wed, Dec 14, 2016 at 04:41:13PM -0800, Florian Fainelli wrote:
>> On 12/14/2016 04:32 PM, Aaro Koskinen wrote:
>>> Git bisect points to:
>>>
>>> commit ec988ad78ed6d184a7f4ca6b8e962b0e8f1de461
>>> Author: Florian Fainelli <f.fainelli@gmail.com>
>>> Date:   Tue Dec 6 20:54:43 2016 -0800
>>>
>>>     phy: Don't increment MDIO bus refcount unless it's a different owner
>>>
>>> Reverting this patch from v4.9 fixes the issue...
>>
>> This should help:
>>
>> diff --git a/drivers/staging/octeon/ethernet.c
>> b/drivers/staging/octeon/ethernet.c
>> index 8130dfe89745..12ebc4d800c3 100644
>> --- a/drivers/staging/octeon/ethernet.c
>> +++ b/drivers/staging/octeon/ethernet.c
>> @@ -770,6 +770,7 @@ static int cvm_oct_probe(struct platform_device *pdev)
>>                         /* Initialize the device private structure. */
>>                         struct octeon_ethernet *priv = netdev_priv(dev);
>>
>> +                       SET_NETDEV_DEV(dev, &pdev->dev);
>>                         dev->netdev_ops = &cvm_oct_pow_netdev_ops;
>>                         priv->imode = CVMX_HELPER_INTERFACE_MODE_DISABLED;
>>                         priv->port = CVMX_PIP_NUM_INPUT_PORTS;
> 
> No, it's still crashing.

How about this:

diff --git a/drivers/staging/octeon/ethernet.c
b/drivers/staging/octeon/ethernet.c
index 12ebc4d800c3..4971aa54756a 100644
--- a/drivers/staging/octeon/ethernet.c
+++ b/drivers/staging/octeon/ethernet.c
@@ -817,6 +817,7 @@ static int cvm_oct_probe(struct platform_device *pdev)
                        }

                        /* Initialize the device private structure. */
+                       SET_NETDEV_DEV(dev, &pdev->dev);
                        priv = netdev_priv(dev);
                        priv->netdev = dev;
                        priv->of_node = cvm_oct_node_for_port(pip,
interface,
-- 
Florian

^ permalink raw reply related

* Re: [BISECTED] v4.9: OCTEON ethernet crash
From: Aaro Koskinen @ 2016-12-15  0:57 UTC (permalink / raw)
  To: Florian Fainelli; +Cc: David S. Miller, David Daney, netdev
In-Reply-To: <f70acf45-191d-215a-0019-420627a35d98@gmail.com>

Hi,

On Wed, Dec 14, 2016 at 04:41:13PM -0800, Florian Fainelli wrote:
> On 12/14/2016 04:32 PM, Aaro Koskinen wrote:
> > Git bisect points to:
> > 
> > commit ec988ad78ed6d184a7f4ca6b8e962b0e8f1de461
> > Author: Florian Fainelli <f.fainelli@gmail.com>
> > Date:   Tue Dec 6 20:54:43 2016 -0800
> > 
> >     phy: Don't increment MDIO bus refcount unless it's a different owner
> > 
> > Reverting this patch from v4.9 fixes the issue...
> 
> This should help:
> 
> diff --git a/drivers/staging/octeon/ethernet.c
> b/drivers/staging/octeon/ethernet.c
> index 8130dfe89745..12ebc4d800c3 100644
> --- a/drivers/staging/octeon/ethernet.c
> +++ b/drivers/staging/octeon/ethernet.c
> @@ -770,6 +770,7 @@ static int cvm_oct_probe(struct platform_device *pdev)
>                         /* Initialize the device private structure. */
>                         struct octeon_ethernet *priv = netdev_priv(dev);
> 
> +                       SET_NETDEV_DEV(dev, &pdev->dev);
>                         dev->netdev_ops = &cvm_oct_pow_netdev_ops;
>                         priv->imode = CVMX_HELPER_INTERFACE_MODE_DISABLED;
>                         priv->port = CVMX_PIP_NUM_INPUT_PORTS;

No, it's still crashing.

A.

^ permalink raw reply

* [PATCH net-next 1/2] inet: Don't go into port scan when looking for specific bind port
From: Tom Herbert @ 2016-12-15  0:54 UTC (permalink / raw)
  To: davem, netdev; +Cc: kernel-team, jbacik, eric.dumazet, raigatgoog
In-Reply-To: <20161215005416.1561632-1-tom@herbertland.com>

inet_csk_get_port is called with port number (snum argument) that may be
zero or nonzero. If it is zero, then the intent is to find an available
ephemeral port number to bind to. If snum is non-zero then the caller
is asking to allocate a specific port number. In the latter case we
never want to perform the scan in ephemeral port range. It is
conceivable that this can happen if the "goto again" in "tb_found:"
is done. This patch adds a check that snum is zero before doing
the "goto again".

Signed-off-by: Tom Herbert <tom@herbertland.com>
---
 net/ipv4/inet_connection_sock.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index d5d3ead..f59838a6 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -212,7 +212,7 @@ int inet_csk_get_port(struct sock *sk, unsigned short snum)
 			      sk->sk_reuseport &&
 			      !rcu_access_pointer(sk->sk_reuseport_cb) &&
 			      uid_eq(tb->fastuid, uid))) &&
-			    smallest_size != -1 && --attempts >= 0) {
+			    !snum && smallest_size != -1 && --attempts >= 0) {
 				spin_unlock_bh(&head->lock);
 				goto again;
 			}
-- 
2.9.3

^ permalink raw reply related

* [PATCH net-next 2/2] inet: Fix get port to handle zero port number with soreuseport set
From: Tom Herbert @ 2016-12-15  0:54 UTC (permalink / raw)
  To: davem, netdev; +Cc: kernel-team, jbacik, eric.dumazet, raigatgoog
In-Reply-To: <20161215005416.1561632-1-tom@herbertland.com>

A user may call listen with binding an explicit port with the intent
that the kernel will assign an available port to the socket. In this
case inet_csk_get_port does a port scan. For such sockets, the user may
also set soreuseport with the intent a creating more sockets for the
port that is selected. The problem is that the initial socket being
opened could inadvertently choose an existing and unreleated port
number that was already created with soreuseport.

This patch adds a boolean parameter to inet_bind_conflict that indicates
rather soreuseport is allowed for the check (in addition to
sk->sk_reuseport). In calls to inet_bind_conflict from inet_csk_get_port
the argument is set to true if an explicit port is being looked up (snum
argument is nonzero), and is false if port scan is done.

Signed-off-by: Tom Herbert <tom@herbertland.com>
---
 include/net/inet6_connection_sock.h |  3 ++-
 include/net/inet_connection_sock.h  |  6 ++++--
 net/ipv4/inet_connection_sock.c     | 14 +++++++++-----
 net/ipv6/inet6_connection_sock.c    |  7 ++++---
 4 files changed, 19 insertions(+), 11 deletions(-)

diff --git a/include/net/inet6_connection_sock.h b/include/net/inet6_connection_sock.h
index 954ad6b..3212b39 100644
--- a/include/net/inet6_connection_sock.h
+++ b/include/net/inet6_connection_sock.h
@@ -22,7 +22,8 @@ struct sock;
 struct sockaddr;
 
 int inet6_csk_bind_conflict(const struct sock *sk,
-			    const struct inet_bind_bucket *tb, bool relax);
+			    const struct inet_bind_bucket *tb, bool relax,
+			    bool soreuseport_ok);
 
 struct dst_entry *inet6_csk_route_req(const struct sock *sk, struct flowi6 *fl6,
 				      const struct request_sock *req, u8 proto);
diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
index 146054c..85ee387 100644
--- a/include/net/inet_connection_sock.h
+++ b/include/net/inet_connection_sock.h
@@ -63,7 +63,8 @@ struct inet_connection_sock_af_ops {
 #endif
 	void	    (*addr2sockaddr)(struct sock *sk, struct sockaddr *);
 	int	    (*bind_conflict)(const struct sock *sk,
-				     const struct inet_bind_bucket *tb, bool relax);
+				     const struct inet_bind_bucket *tb,
+				     bool relax, bool soreuseport_ok);
 	void	    (*mtu_reduced)(struct sock *sk);
 };
 
@@ -261,7 +262,8 @@ inet_csk_rto_backoff(const struct inet_connection_sock *icsk,
 struct sock *inet_csk_accept(struct sock *sk, int flags, int *err);
 
 int inet_csk_bind_conflict(const struct sock *sk,
-			   const struct inet_bind_bucket *tb, bool relax);
+			   const struct inet_bind_bucket *tb, bool relax,
+			   bool soreuseport_ok);
 int inet_csk_get_port(struct sock *sk, unsigned short snum);
 
 struct dst_entry *inet_csk_route_req(const struct sock *sk, struct flowi4 *fl4,
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index f59838a6..19ea045 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -45,11 +45,12 @@ void inet_get_local_port_range(struct net *net, int *low, int *high)
 EXPORT_SYMBOL(inet_get_local_port_range);
 
 int inet_csk_bind_conflict(const struct sock *sk,
-			   const struct inet_bind_bucket *tb, bool relax)
+			   const struct inet_bind_bucket *tb, bool relax,
+			   bool reuseport_ok)
 {
 	struct sock *sk2;
-	int reuse = sk->sk_reuse;
-	int reuseport = sk->sk_reuseport;
+	bool reuse = sk->sk_reuse;
+	bool reuseport = !!sk->sk_reuseport && reuseport_ok;
 	kuid_t uid = sock_i_uid((struct sock *)sk);
 
 	/*
@@ -105,6 +106,7 @@ int inet_csk_get_port(struct sock *sk, unsigned short snum)
 	struct inet_bind_bucket *tb;
 	kuid_t uid = sock_i_uid(sk);
 	u32 remaining, offset;
+	bool reuseport_ok = !!snum;
 
 	if (port) {
 have_port:
@@ -165,7 +167,8 @@ int inet_csk_get_port(struct sock *sk, unsigned short snum)
 					smallest_size = tb->num_owners;
 					smallest_port = port;
 				}
-				if (!inet_csk(sk)->icsk_af_ops->bind_conflict(sk, tb, false))
+				if (!inet_csk(sk)->icsk_af_ops->bind_conflict(sk, tb, false,
+									      reuseport_ok))
 					goto tb_found;
 				goto next_port;
 			}
@@ -206,7 +209,8 @@ int inet_csk_get_port(struct sock *sk, unsigned short snum)
 		      sk->sk_reuseport && uid_eq(tb->fastuid, uid))) &&
 		    smallest_size == -1)
 			goto success;
-		if (inet_csk(sk)->icsk_af_ops->bind_conflict(sk, tb, true)) {
+		if (inet_csk(sk)->icsk_af_ops->bind_conflict(sk, tb, true,
+							     reuseport_ok)) {
 			if ((reuse ||
 			     (tb->fastreuseport > 0 &&
 			      sk->sk_reuseport &&
diff --git a/net/ipv6/inet6_connection_sock.c b/net/ipv6/inet6_connection_sock.c
index 1c86c47..7396e75 100644
--- a/net/ipv6/inet6_connection_sock.c
+++ b/net/ipv6/inet6_connection_sock.c
@@ -29,11 +29,12 @@
 #include <net/sock_reuseport.h>
 
 int inet6_csk_bind_conflict(const struct sock *sk,
-			    const struct inet_bind_bucket *tb, bool relax)
+			    const struct inet_bind_bucket *tb, bool relax,
+			    bool reuseport_ok)
 {
 	const struct sock *sk2;
-	int reuse = sk->sk_reuse;
-	int reuseport = sk->sk_reuseport;
+	bool reuse = !!sk->sk_reuse;
+	bool reuseport = !!sk->sk_reuseport && reuseport_ok;
 	kuid_t uid = sock_i_uid((struct sock *)sk);
 
 	/* We must walk the whole port owner list in this case. -DaveM */
-- 
2.9.3

^ permalink raw reply related

* [PATCH net-next 0/2] inet: Fixes for inet_csk_get_port and soreusport
From: Tom Herbert @ 2016-12-15  0:54 UTC (permalink / raw)
  To: davem, netdev; +Cc: kernel-team, jbacik, eric.dumazet, raigatgoog

This patch set fixes a couple of issues I noticed while debugging our
softlockup issue in inet_csk_get_port.

- Don't allow jump into port scan in inet_csk_get_port if function
  was called with non-zero port number (looking up explicit port
  number).
- When inet_csk_get_port is called with zero port number (ie. perform
  scan) an reuseport is set on the socket, don't match sockets that
  also have reuseport set. The intent from the user should be
  to get a new port number and then explictly bind other
  sockets to that number using soreuseport.

Tested:

Ran first patch on production workload with no ill effect.

For second patch, ran a little listener application and first
demonstrated that unbound sockets with soreuseport can indeed
be bound to unrelated soreuseport sockets.


Tom Herbert (2):
  inet: Don't go into port scan when looking for specific bind port
  inet: Fix get port to handle zero port number with soreuseport set

 include/net/inet6_connection_sock.h |  3 ++-
 include/net/inet_connection_sock.h  |  6 ++++--
 net/ipv4/inet_connection_sock.c     | 16 ++++++++++------
 net/ipv6/inet6_connection_sock.c    |  7 ++++---
 4 files changed, 20 insertions(+), 12 deletions(-)

-- 
2.9.3

^ permalink raw reply

* Re: [PATCH net-next 1/1] driver: ipvlan: Define common functions to decrease duplicated codes used to add or del IP address
From: Feng Gao @ 2016-12-15  0:50 UTC (permalink / raw)
  To: David S. Miller, Mahesh Bandewar, Eric Dumazet,
	Linux Kernel Network Developers, Feng Gao
In-Reply-To: <1481727165-18824-1-git-send-email-fgao@ikuai8.com>

On Wed, Dec 14, 2016 at 10:52 PM,  <fgao@ikuai8.com> wrote:
> From: Gao Feng <gfree.wind@gmail.com>
>
> There are some duplicated codes in ipvlan_add_addr6/4 and
> ipvlan_del_addr6/4. Now define two common functions ipvlan_add_addr
> and ipvlan_del_addr to decrease the duplicated codes.
> It could be helful to maintain the codes.
>
> Signed-off-by: Gao Feng <gfree.wind@gmail.com>
> ---
>  drivers/net/ipvlan/ipvlan_main.c | 68 +++++++++++++++++-----------------------
>  1 file changed, 29 insertions(+), 39 deletions(-)
>
> diff --git a/drivers/net/ipvlan/ipvlan_main.c b/drivers/net/ipvlan/ipvlan_main.c
> index 693ec5b..5874d30 100644
> --- a/drivers/net/ipvlan/ipvlan_main.c
> +++ b/drivers/net/ipvlan/ipvlan_main.c
> @@ -669,23 +669,22 @@ static int ipvlan_device_event(struct notifier_block *unused,
>         return NOTIFY_DONE;
>  }
>
> -static int ipvlan_add_addr6(struct ipvl_dev *ipvlan, struct in6_addr *ip6_addr)
> +static int ipvlan_add_addr(struct ipvl_dev *ipvlan, void *iaddr, bool is_v6)
>  {
>         struct ipvl_addr *addr;
>
> -       if (ipvlan_addr_busy(ipvlan->port, ip6_addr, true)) {
> -               netif_err(ipvlan, ifup, ipvlan->dev,
> -                         "Failed to add IPv6=%pI6c addr for %s intf\n",
> -                         ip6_addr, ipvlan->dev->name);
> -               return -EINVAL;
> -       }
>         addr = kzalloc(sizeof(struct ipvl_addr), GFP_ATOMIC);
>         if (!addr)
>                 return -ENOMEM;
>
>         addr->master = ipvlan;
> -       memcpy(&addr->ip6addr, ip6_addr, sizeof(struct in6_addr));
> -       addr->atype = IPVL_IPV6;
> +       if (is_v6) {
> +               memcpy(&addr->ip6addr, iaddr, sizeof(struct in6_addr));
> +               addr->atype = IPVL_IPV6;
> +       } else {
> +               memcpy(&addr->ip4addr, iaddr, sizeof(struct in_addr));
> +               addr->atype = IPVL_IPV4;
> +       }
>         list_add_tail(&addr->anode, &ipvlan->addrs);
>
>         /* If the interface is not up, the address will be added to the hash
> @@ -697,11 +696,11 @@ static int ipvlan_add_addr6(struct ipvl_dev *ipvlan, struct in6_addr *ip6_addr)
>         return 0;
>  }
>
> -static void ipvlan_del_addr6(struct ipvl_dev *ipvlan, struct in6_addr *ip6_addr)
> +static void ipvlan_del_addr(struct ipvl_dev *ipvlan, void *iaddr, bool is_v6)
>  {
>         struct ipvl_addr *addr;
>
> -       addr = ipvlan_find_addr(ipvlan, ip6_addr, true);
> +       addr = ipvlan_find_addr(ipvlan, iaddr, is_v6);
>         if (!addr)
>                 return;
>
> @@ -712,6 +711,23 @@ static void ipvlan_del_addr6(struct ipvl_dev *ipvlan, struct in6_addr *ip6_addr)
>         return;
>  }
>
> +static int ipvlan_add_addr6(struct ipvl_dev *ipvlan, struct in6_addr *ip6_addr)
> +{
> +       if (ipvlan_addr_busy(ipvlan->port, ip6_addr, true)) {
> +               netif_err(ipvlan, ifup, ipvlan->dev,
> +                         "Failed to add IPv6=%pI6c addr for %s intf\n",
> +                         ip6_addr, ipvlan->dev->name);
> +               return -EINVAL;
> +       }
> +
> +       return ipvlan_add_addr(ipvlan, ip6_addr, true);
> +}
> +
> +static void ipvlan_del_addr6(struct ipvl_dev *ipvlan, struct in6_addr *ip6_addr)
> +{
> +       return ipvlan_del_addr(ipvlan, ip6_addr, true);
> +}
> +
>  static int ipvlan_addr6_event(struct notifier_block *unused,
>                               unsigned long event, void *ptr)
>  {
> @@ -745,45 +761,19 @@ static int ipvlan_addr6_event(struct notifier_block *unused,
>
>  static int ipvlan_add_addr4(struct ipvl_dev *ipvlan, struct in_addr *ip4_addr)
>  {
> -       struct ipvl_addr *addr;
> -
>         if (ipvlan_addr_busy(ipvlan->port, ip4_addr, false)) {
>                 netif_err(ipvlan, ifup, ipvlan->dev,
>                           "Failed to add IPv4=%pI4 on %s intf.\n",
>                           ip4_addr, ipvlan->dev->name);
>                 return -EINVAL;
>         }
> -       addr = kzalloc(sizeof(struct ipvl_addr), GFP_KERNEL);
> -       if (!addr)
> -               return -ENOMEM;
> -
> -       addr->master = ipvlan;
> -       memcpy(&addr->ip4addr, ip4_addr, sizeof(struct in_addr));
> -       addr->atype = IPVL_IPV4;
> -       list_add_tail(&addr->anode, &ipvlan->addrs);
> -
> -       /* If the interface is not up, the address will be added to the hash
> -        * list by ipvlan_open.
> -        */
> -       if (netif_running(ipvlan->dev))
> -               ipvlan_ht_addr_add(ipvlan, addr);
>
> -       return 0;
> +       return ipvlan_add_addr(ipvlan, ip4_addr, false);
>  }
>
>  static void ipvlan_del_addr4(struct ipvl_dev *ipvlan, struct in_addr *ip4_addr)
>  {
> -       struct ipvl_addr *addr;
> -
> -       addr = ipvlan_find_addr(ipvlan, ip4_addr, false);
> -       if (!addr)
> -               return;
> -
> -       ipvlan_ht_addr_del(addr);
> -       list_del(&addr->anode);
> -       kfree_rcu(addr, rcu);
> -
> -       return;
> +       return ipvlan_del_addr(ipvlan, ip4_addr, false);
>  }
>
>  static int ipvlan_addr4_event(struct notifier_block *unused,
> --
> 1.9.1
>
>

Sorry, I just remember the "net-next" cleanup is closing.
Ignore this commit please, I would send a new after "net-next" is opened.

Regards
Feng

^ permalink raw reply

* Re: [PATCH net] bpf, test_verifier: fix a test case error result on unprivileged
From: Alexei Starovoitov @ 2016-12-15  0:43 UTC (permalink / raw)
  To: Daniel Borkmann, davem; +Cc: netdev
In-Reply-To: <639d61f73c907b704001ed2b115208998990eb38.1481762158.git.daniel@iogearbox.net>

On 12/14/16 4:39 PM, Daniel Borkmann wrote:
> Running ./test_verifier as unprivileged lets 1 out of 98 tests fail:
>
>    [...]
>    #71 unpriv: check that printk is disallowed FAIL
>    Unexpected error message!
>    0: (7a) *(u64 *)(r10 -8) = 0
>    1: (bf) r1 = r10
>    2: (07) r1 += -8
>    3: (b7) r2 = 8
>    4: (bf) r3 = r1
>    5: (85) call bpf_trace_printk#6
>    unknown func bpf_trace_printk#6
>    [...]
>
> The test case is correct, just that the error outcome changed with
> ebb676daa1a3 ("bpf: Print function name in addition to function id").
> Same as with e00c7b216f34 ("bpf: fix multiple issues in selftest suite
> and samples") issue 2), so just fix up the function name.
>
> Fixes: ebb676daa1a3 ("bpf: Print function name in addition to function id")
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

was thinking to send the same fix. Thanks you for beating me :)
Acked-by: Alexei Starovoitov <ast@kernel.org>

^ permalink raw reply

* Re: [BISECTED] v4.9: OCTEON ethernet crash
From: Florian Fainelli @ 2016-12-15  0:41 UTC (permalink / raw)
  To: Aaro Koskinen, David S. Miller, David Daney, netdev
In-Reply-To: <20161215003253.un6hk2ytl3auiztn@raspberrypi-2.musicnaut.iki.fi>

On 12/14/2016 04:32 PM, Aaro Koskinen wrote:
> Hi,
> 
> I'm getting the following crash on every boot on OCTEON (EdgeRouter Lite)
> with v4.9 (right after setting up ethernet bridging):
> 
> [   16.814902] CPU 0 Unable to handle kernel paging request at virtual address 0000000000000080, epc == ffffffff81458570, ra == ffffffff81458804
> [   16.827805] Oops[#1]:
> [   16.830100] CPU: 0 PID: 706 Comm: ifconfig Not tainted 4.9.0-octeon-los_be07e6-00002-g29a0b7e #1
> [   16.838884] task: 800000041f9dec00 task.stack: 800000041f0d8000
> [   16.844801] $ 0   : 0000000000000000 0000000010108ce1 0000000000000000 0000000000000001
> [   16.852867] $ 4   : 800000041f98a000 800000041fb67800 0000000000000000 0000000000000002
> [   16.860932] $ 8   : 800000041fb67810 800000041f0dbb40 10434794771be290 771bf16800000001
> [   16.868997] $12   : 0000000000000000 ffffffff81383edc ffffffff81296950 0000000000000000
> [   16.877060] $16   : 800000041fb67800 800000041f98a000 ffffffff81508c48 800000041f817800
> [   16.885125] $20   : 0000000000000000 0000000000000002 0000000000000000 800000041f98a000
> [   16.893190] $24   : 0000000000000000 0000000000000000                                  
> [   16.901256] $28   : 800000041f0d8000 800000041f0dbb10 800000041e109410 ffffffff81458804
> [   16.909321] Hi    : 00000000000002c9
> [   16.912896] Lo    : 0000000000001c1d
> [   16.916484] epc   : ffffffff81458570 phy_attach_direct+0x38/0x1b0
> [   16.922580] ra    : ffffffff81458804 phy_connect_direct+0x24/0x88
> [   16.928671] Status: 10108ce3	KX SX UX KERNEL EXL IE 
> [   16.933723] Cause : 00800008 (ExcCode 02)
> [   16.937730] BadVA : 0000000000000080
> [   16.941306] PrId  : 000d0601 (Cavium Octeon+)
> [   16.945660] Modules linked in: at803x
> [   16.949353] Process ifconfig (pid: 706, threadinfo=800000041f0d8000, task=800000041f9dec00, tls=00000000771c3490)
> [   16.959605] Stack : 800000041fb67800 800000041f98a000 ffffffff81508c48 0000000000000002
> [   16.967671]         0000000000000000 0000000000000000 800000041e109400 ffffffff81458804
> [   16.975734]         800000041fb67800 800000041f98a000 ffffffff81508c48 ffffffff81504cd4
> [   16.983799]         800000041f98a000 800000041f98a000 ffffffff81509830 0000000000000000
> [   16.991864]         0000000000000000 ffffffff81508f58 0000000000000000 ffffffff815087e0
> [   16.999928]         800000041f98a000 800000041f98a048 ffffffff816fcc88 0000000000001302
> [   17.007993]         0000000000008914 ffffffff8150993c ffffffff816fcc88 0000000000001302
> [   17.016057]         800000041f98a000 ffffffff8153bc6c 800000041f98a000 0000000000000008
> [   17.024122]         8000000003c500d8 0000000000000000 800000041f98a000 0000000000000341
> [   17.032187]         0000000000001043 ffffffff8153bf94 00000000000000fe 800000041f98a000
> [   17.040251]         ...
> [   17.042722] Call Trace:
> [   17.045175] [<ffffffff81458570>] phy_attach_direct+0x38/0x1b0
> [   17.050927] [<ffffffff81458804>] phy_connect_direct+0x24/0x88
> [   17.056682] [<ffffffff81504cd4>] of_phy_connect+0x54/0xb0
> [   17.062089] [<ffffffff81508f58>] cvm_oct_phy_setup_device+0x48/0xc0
> [   17.068361] [<ffffffff815087e0>] cvm_oct_common_open+0x58/0x2a8
> [   17.074285] [<ffffffff8150993c>] cvm_oct_rgmii_open+0x1c/0x90
> [   17.080040] [<ffffffff8153bc6c>] __dev_open+0x104/0x198
> [   17.085270] [<ffffffff8153bf94>] __dev_change_flags+0x94/0x180
> [   17.091107] [<ffffffff8153c0a4>] dev_change_flags+0x24/0x68
> [   17.096687] [<ffffffff815c6e30>] devinet_ioctl+0x6a8/0x8b0
> [   17.102181] [<ffffffff81516e0c>] sock_do_ioctl.constprop.14+0x24/0x68
> [   17.108626] [<ffffffff81518338>] compat_sock_ioctl+0xd18/0xfc8
> [   17.114471] [<ffffffff81296a10>] compat_SyS_ioctl+0xc0/0x1980
> [   17.120222] [<ffffffff8113109c>] syscall_common+0x18/0x3c
> [   17.125621] Code: ffb20010  dc8204b8  dcb30298 <dc420080> de640000  dc520010  12440005  00a08025  0c46e4a6 
> [   17.135490] 
> [   17.137147] ---[ end trace f1d7b064cedee4e4 ]---
> [   17.141882] Kernel panic - not syncing: Fatal exception
> [   17.147140] ---[ end Kernel panic - not syncing: Fatal exception
> 
> Git bisect points to:
> 
> commit ec988ad78ed6d184a7f4ca6b8e962b0e8f1de461
> Author: Florian Fainelli <f.fainelli@gmail.com>
> Date:   Tue Dec 6 20:54:43 2016 -0800
> 
>     phy: Don't increment MDIO bus refcount unless it's a different owner
> 
> Reverting this patch from v4.9 fixes the issue...

This should help:

diff --git a/drivers/staging/octeon/ethernet.c
b/drivers/staging/octeon/ethernet.c
index 8130dfe89745..12ebc4d800c3 100644
--- a/drivers/staging/octeon/ethernet.c
+++ b/drivers/staging/octeon/ethernet.c
@@ -770,6 +770,7 @@ static int cvm_oct_probe(struct platform_device *pdev)
                        /* Initialize the device private structure. */
                        struct octeon_ethernet *priv = netdev_priv(dev);

+                       SET_NETDEV_DEV(dev, &pdev->dev);
                        dev->netdev_ops = &cvm_oct_pow_netdev_ops;
                        priv->imode = CVMX_HELPER_INTERFACE_MODE_DISABLED;
                        priv->port = CVMX_PIP_NUM_INPUT_PORTS;
-- 
Florian

^ permalink raw reply related

* [PATCH net] bpf, test_verifier: fix a test case error result on unprivileged
From: Daniel Borkmann @ 2016-12-15  0:39 UTC (permalink / raw)
  To: davem; +Cc: ast, netdev, Daniel Borkmann

Running ./test_verifier as unprivileged lets 1 out of 98 tests fail:

  [...]
  #71 unpriv: check that printk is disallowed FAIL
  Unexpected error message!
  0: (7a) *(u64 *)(r10 -8) = 0
  1: (bf) r1 = r10
  2: (07) r1 += -8
  3: (b7) r2 = 8
  4: (bf) r3 = r1
  5: (85) call bpf_trace_printk#6
  unknown func bpf_trace_printk#6
  [...]

The test case is correct, just that the error outcome changed with
ebb676daa1a3 ("bpf: Print function name in addition to function id").
Same as with e00c7b216f34 ("bpf: fix multiple issues in selftest suite
and samples") issue 2), so just fix up the function name.

Fixes: ebb676daa1a3 ("bpf: Print function name in addition to function id")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
---
 tools/testing/selftests/bpf/test_verifier.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/testing/selftests/bpf/test_verifier.c b/tools/testing/selftests/bpf/test_verifier.c
index 072dc63..853d7e4 100644
--- a/tools/testing/selftests/bpf/test_verifier.c
+++ b/tools/testing/selftests/bpf/test_verifier.c
@@ -1059,7 +1059,7 @@ struct test_val {
 			BPF_MOV64_IMM(BPF_REG_0, 0),
 			BPF_EXIT_INSN(),
 		},
-		.errstr_unpriv = "unknown func 6",
+		.errstr_unpriv = "unknown func bpf_trace_printk#6",
 		.result_unpriv = REJECT,
 		.result = ACCEPT,
 	},
-- 
1.9.3

^ permalink raw reply related

* [BISECTED] v4.9: OCTEON ethernet crash
From: Aaro Koskinen @ 2016-12-15  0:32 UTC (permalink / raw)
  To: Florian Fainelli, David S. Miller, David Daney, netdev

Hi,

I'm getting the following crash on every boot on OCTEON (EdgeRouter Lite)
with v4.9 (right after setting up ethernet bridging):

[   16.814902] CPU 0 Unable to handle kernel paging request at virtual address 0000000000000080, epc == ffffffff81458570, ra == ffffffff81458804
[   16.827805] Oops[#1]:
[   16.830100] CPU: 0 PID: 706 Comm: ifconfig Not tainted 4.9.0-octeon-los_be07e6-00002-g29a0b7e #1
[   16.838884] task: 800000041f9dec00 task.stack: 800000041f0d8000
[   16.844801] $ 0   : 0000000000000000 0000000010108ce1 0000000000000000 0000000000000001
[   16.852867] $ 4   : 800000041f98a000 800000041fb67800 0000000000000000 0000000000000002
[   16.860932] $ 8   : 800000041fb67810 800000041f0dbb40 10434794771be290 771bf16800000001
[   16.868997] $12   : 0000000000000000 ffffffff81383edc ffffffff81296950 0000000000000000
[   16.877060] $16   : 800000041fb67800 800000041f98a000 ffffffff81508c48 800000041f817800
[   16.885125] $20   : 0000000000000000 0000000000000002 0000000000000000 800000041f98a000
[   16.893190] $24   : 0000000000000000 0000000000000000                                  
[   16.901256] $28   : 800000041f0d8000 800000041f0dbb10 800000041e109410 ffffffff81458804
[   16.909321] Hi    : 00000000000002c9
[   16.912896] Lo    : 0000000000001c1d
[   16.916484] epc   : ffffffff81458570 phy_attach_direct+0x38/0x1b0
[   16.922580] ra    : ffffffff81458804 phy_connect_direct+0x24/0x88
[   16.928671] Status: 10108ce3	KX SX UX KERNEL EXL IE 
[   16.933723] Cause : 00800008 (ExcCode 02)
[   16.937730] BadVA : 0000000000000080
[   16.941306] PrId  : 000d0601 (Cavium Octeon+)
[   16.945660] Modules linked in: at803x
[   16.949353] Process ifconfig (pid: 706, threadinfo=800000041f0d8000, task=800000041f9dec00, tls=00000000771c3490)
[   16.959605] Stack : 800000041fb67800 800000041f98a000 ffffffff81508c48 0000000000000002
[   16.967671]         0000000000000000 0000000000000000 800000041e109400 ffffffff81458804
[   16.975734]         800000041fb67800 800000041f98a000 ffffffff81508c48 ffffffff81504cd4
[   16.983799]         800000041f98a000 800000041f98a000 ffffffff81509830 0000000000000000
[   16.991864]         0000000000000000 ffffffff81508f58 0000000000000000 ffffffff815087e0
[   16.999928]         800000041f98a000 800000041f98a048 ffffffff816fcc88 0000000000001302
[   17.007993]         0000000000008914 ffffffff8150993c ffffffff816fcc88 0000000000001302
[   17.016057]         800000041f98a000 ffffffff8153bc6c 800000041f98a000 0000000000000008
[   17.024122]         8000000003c500d8 0000000000000000 800000041f98a000 0000000000000341
[   17.032187]         0000000000001043 ffffffff8153bf94 00000000000000fe 800000041f98a000
[   17.040251]         ...
[   17.042722] Call Trace:
[   17.045175] [<ffffffff81458570>] phy_attach_direct+0x38/0x1b0
[   17.050927] [<ffffffff81458804>] phy_connect_direct+0x24/0x88
[   17.056682] [<ffffffff81504cd4>] of_phy_connect+0x54/0xb0
[   17.062089] [<ffffffff81508f58>] cvm_oct_phy_setup_device+0x48/0xc0
[   17.068361] [<ffffffff815087e0>] cvm_oct_common_open+0x58/0x2a8
[   17.074285] [<ffffffff8150993c>] cvm_oct_rgmii_open+0x1c/0x90
[   17.080040] [<ffffffff8153bc6c>] __dev_open+0x104/0x198
[   17.085270] [<ffffffff8153bf94>] __dev_change_flags+0x94/0x180
[   17.091107] [<ffffffff8153c0a4>] dev_change_flags+0x24/0x68
[   17.096687] [<ffffffff815c6e30>] devinet_ioctl+0x6a8/0x8b0
[   17.102181] [<ffffffff81516e0c>] sock_do_ioctl.constprop.14+0x24/0x68
[   17.108626] [<ffffffff81518338>] compat_sock_ioctl+0xd18/0xfc8
[   17.114471] [<ffffffff81296a10>] compat_SyS_ioctl+0xc0/0x1980
[   17.120222] [<ffffffff8113109c>] syscall_common+0x18/0x3c
[   17.125621] Code: ffb20010  dc8204b8  dcb30298 <dc420080> de640000  dc520010  12440005  00a08025  0c46e4a6 
[   17.135490] 
[   17.137147] ---[ end trace f1d7b064cedee4e4 ]---
[   17.141882] Kernel panic - not syncing: Fatal exception
[   17.147140] ---[ end Kernel panic - not syncing: Fatal exception

Git bisect points to:

commit ec988ad78ed6d184a7f4ca6b8e962b0e8f1de461
Author: Florian Fainelli <f.fainelli@gmail.com>
Date:   Tue Dec 6 20:54:43 2016 -0800

    phy: Don't increment MDIO bus refcount unless it's a different owner

Reverting this patch from v4.9 fixes the issue...

A.

^ permalink raw reply

* [PATCH net] bpf: fix regression on verifier pruning wrt map lookups
From: Daniel Borkmann @ 2016-12-15  0:30 UTC (permalink / raw)
  To: davem; +Cc: ast, jbacik, tgraf, netdev, Daniel Borkmann

Commit 57a09bf0a416 ("bpf: Detect identical PTR_TO_MAP_VALUE_OR_NULL
registers") introduced a regression where existing programs stopped
loading due to reaching the verifier's maximum complexity limit,
whereas prior to this commit they were loading just fine; the affected
program has roughly 2k instructions.

What was found is that state pruning couldn't be performed effectively
anymore due to mismatches of the verifier's register state, in particular
in the id tracking. It doesn't mean that 57a09bf0a416 is incorrect per
se, but rather that verifier needs to perform a lot more work for the
same program with regards to involved map lookups.

Since commit 57a09bf0a416 is only about tracking registers with type
PTR_TO_MAP_VALUE_OR_NULL, the id is only needed to follow registers
until they are promoted through pattern matching with a NULL check to
either PTR_TO_MAP_VALUE or UNKNOWN_VALUE type. After that point, the
id becomes irrelevant for the transitioned types.

For UNKNOWN_VALUE, id is already reset to 0 via mark_reg_unknown_value(),
but not so for PTR_TO_MAP_VALUE where id is becoming stale. It's even
transferred further into other types that don't make use of it. Among
others, one example is where UNKNOWN_VALUE is set on function call
return with RET_INTEGER return type.

states_equal() will then fall through the memcmp() on register state;
note that the second memcmp() uses offsetofend(), so the id is part of
that since d2a4dd37f6b4 ("bpf: fix state equivalence"). But the bisect
pointed already to 57a09bf0a416, where we really reach beyond complexity
limit. What I found was that states_equal() often failed in this
case due to id mismatches in spilled regs with registers in type
PTR_TO_MAP_VALUE. Unlike non-spilled regs, spilled regs just perform
a memcmp() on their reg state and don't have any other optimizations
in place, therefore also id was relevant in this case for making a
pruning decision.

We can safely reset id to 0 as well when converting to PTR_TO_MAP_VALUE.
For the affected program, it resulted in a ~17 fold reduction of
complexity and let the program load fine again. Selftest suite also
runs fine. The only other place where env->id_gen is used currently is
through direct packet access, but for these cases id is long living, thus
a different scenario.

Also, the current logic in mark_map_regs() is not fully correct when
marking NULL branch with UNKNOWN_VALUE. We need to cache the destination
reg's id in any case. Otherwise, once we marked that reg as UNKNOWN_VALUE,
it's id is reset and any subsequent registers that hold the original id
and are of type PTR_TO_MAP_VALUE_OR_NULL won't be marked UNKNOWN_VALUE
anymore, since mark_map_reg() reuses the uncached regs[regno].id that
was just overridden. Note, we don't need to cache it outside of
mark_map_regs(), since it's called once on this_branch and the other
time on other_branch, which are both two independent verifier states.
A test case for this is added here, too.

Fixes: 57a09bf0a416 ("bpf: Detect identical PTR_TO_MAP_VALUE_OR_NULL registers")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Thomas Graf <tgraf@suug.ch>
Acked-by: Alexei Starovoitov <ast@kernel.org>
---
 kernel/bpf/verifier.c                       | 11 ++++++++---
 tools/testing/selftests/bpf/test_verifier.c | 28 ++++++++++++++++++++++++++++
 2 files changed, 36 insertions(+), 3 deletions(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index d28f9a3..81e267b 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -1970,6 +1970,11 @@ static void mark_map_reg(struct bpf_reg_state *regs, u32 regno, u32 id,

 	if (reg->type == PTR_TO_MAP_VALUE_OR_NULL && reg->id == id) {
 		reg->type = type;
+		/* We don't need id from this point onwards anymore, thus we
+		 * should better reset it, so that state pruning has chances
+		 * to take effect.
+		 */
+		reg->id = 0;
 		if (type == UNKNOWN_VALUE)
 			mark_reg_unknown_value(regs, regno);
 	}
@@ -1982,16 +1987,16 @@ static void mark_map_regs(struct bpf_verifier_state *state, u32 regno,
 			  enum bpf_reg_type type)
 {
 	struct bpf_reg_state *regs = state->regs;
+	u32 id = regs[regno].id;
 	int i;

 	for (i = 0; i < MAX_BPF_REG; i++)
-		mark_map_reg(regs, i, regs[regno].id, type);
+		mark_map_reg(regs, i, id, type);

 	for (i = 0; i < MAX_BPF_STACK; i += BPF_REG_SIZE) {
 		if (state->stack_slot_type[i] != STACK_SPILL)
 			continue;
-		mark_map_reg(state->spilled_regs, i / BPF_REG_SIZE,
-			     regs[regno].id, type);
+		mark_map_reg(state->spilled_regs, i / BPF_REG_SIZE, id, type);
 	}
 }

diff --git a/tools/testing/selftests/bpf/test_verifier.c b/tools/testing/selftests/bpf/test_verifier.c
index 0103bf2..072dc63 100644
--- a/tools/testing/selftests/bpf/test_verifier.c
+++ b/tools/testing/selftests/bpf/test_verifier.c
@@ -2661,6 +2661,34 @@ struct test_val {
 		.prog_type = BPF_PROG_TYPE_SCHED_CLS
 	},
 	{
+		"multiple registers share map_lookup_elem bad reg type",
+		.insns = {
+			BPF_MOV64_IMM(BPF_REG_1, 10),
+			BPF_STX_MEM(BPF_DW, BPF_REG_10, BPF_REG_1, -8),
+			BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+			BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
+			BPF_LD_MAP_FD(BPF_REG_1, 0),
+			BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0,
+				     BPF_FUNC_map_lookup_elem),
+			BPF_MOV64_REG(BPF_REG_2, BPF_REG_0),
+			BPF_MOV64_REG(BPF_REG_3, BPF_REG_0),
+			BPF_MOV64_REG(BPF_REG_4, BPF_REG_0),
+			BPF_MOV64_REG(BPF_REG_5, BPF_REG_0),
+			BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 1),
+			BPF_MOV64_IMM(BPF_REG_1, 1),
+			BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 1),
+			BPF_MOV64_IMM(BPF_REG_1, 2),
+			BPF_JMP_IMM(BPF_JEQ, BPF_REG_3, 0, 1),
+			BPF_ST_MEM(BPF_DW, BPF_REG_3, 0, 0),
+			BPF_MOV64_IMM(BPF_REG_1, 3),
+			BPF_EXIT_INSN(),
+		},
+		.fixup_map1 = { 4 },
+		.result = REJECT,
+		.errstr = "R3 invalid mem access 'inv'",
+		.prog_type = BPF_PROG_TYPE_SCHED_CLS
+	},
+	{
 		"invalid map access from else condition",
 		.insns = {
 			BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0),
-- 
1.9.3

^ permalink raw reply related

* Re: [PATCH v3 1/3] siphash: add cryptographically secure hashtable function
From: Linus Torvalds @ 2016-12-15  0:10 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: Tom Herbert, Netdev, kernel-hardening@lists.openwall.com, LKML,
	Linux Crypto Mailing List, Jean-Philippe Aumasson,
	Daniel J . Bernstein, Eric Biggers, David Laight
In-Reply-To: <CAHmME9pu6No0wqPzPpaBwQR_b+5CXvh0kke7J8ouN=rx4pxMGg@mail.gmail.com>

On Wed, Dec 14, 2016 at 3:34 PM, Jason A. Donenfeld <Jason@zx2c4.com> wrote:
>
> Or does your reasonable dislike of "word" still allow for the use of
> dword and qword, so that the current function names of:

dword really is confusing to people.

If you have a MIPS background, it means 64 bits. While to people with
Windows programming backgrounds it means 32 bits.

Please try to avoid using it.

As mentioned, I think almost everybody agrees on the "q" part being 64
bits, but that may just be me not having seen it in any other context.

And before anybody points it out - yes, we already have lots of uses
of "dword" in various places. But they tend to be mostly
hardware-specific - either architectures or drivers.

So I'd _prefer_ to try to keep "word" and "dword" away from generic
helper routines. But it's not like anything is really black and white.

           Linus

^ permalink raw reply

* Re: [PATCH v3 1/3] siphash: add cryptographically secure hashtable function
From: Jason A. Donenfeld @ 2016-12-14 23:34 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Tom Herbert, Netdev, kernel-hardening@lists.openwall.com, LKML,
	Linux Crypto Mailing List, Jean-Philippe Aumasson,
	Daniel J . Bernstein, Eric Biggers, David Laight
In-Reply-To: <CA+55aFyBGQpEKiAcs0w58ZEie+L8OrWvf_2hvGx4E=L56p5hMg@mail.gmail.com>

Hey Linus,

On Thu, Dec 15, 2016 at 12:30 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> No. The bug is talking about "words" in the first place.
>
> Depending on your background, a "word" can be generally be either 16
> bits or 32 bits (or, in some cases, 18 bits).
>
> In theory, a 64-bit entity can be a "word" too, but pretty much nobody
> uses that. Even architectures that started out with a 64-bit register
> size and never had any smaller historical baggage (eg alpha) tend to
> call 32-bit entities "words".
>
> So 16 bits can be a word, but some people/architectures will call it a
> "half-word".
>
> To make matters even more confusing, a "quadword" is generally always
> 64 bits, regardless of the size of "word".
>
> So please try to avoid the use of "word" entirely. It's too ambiguous,
> and it's not even helpful as a "size of the native register". It's
> almost purely random.
>
> For the kernel, we tend use
>
>  - uX for types that have specific sizes (X being the number of bits)
>
>  - "[unsigned] long" for native register size
>
> But never "word".

The voice of reason. Have a desired name for this function family?

siphash_3u64s
siphash_3u64
siphash_three_u64
siphash_3sixityfourbitintegers

Or does your reasonable dislike of "word" still allow for the use of
dword and qword, so that the current function names of:

siphash_3qwords
siphash_6dwords

are okay?

Jason

^ permalink raw reply

* Re: [PATCH v3 1/3] siphash: add cryptographically secure hashtable function
From: Linus Torvalds @ 2016-12-14 23:30 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: Tom Herbert, Netdev, kernel-hardening@lists.openwall.com, LKML,
	Linux Crypto Mailing List, Jean-Philippe Aumasson,
	Daniel J . Bernstein, Eric Biggers, David Laight
In-Reply-To: <CAHmME9rpvf4tyDjZcJAJxMAW1LcqNm7DiquiYX0uQhRzDLbwqw@mail.gmail.com>

On Wed, Dec 14, 2016 at 2:56 PM, Jason A. Donenfeld <Jason@zx2c4.com> wrote:
>
> So actually jhash_Nwords makes no sense, since it takes dwords
> (32-bits) not words (16-bits). The siphash analog should be called
> siphash24_Nqwords.

No. The bug is talking about "words" in the first place.

Depending on your background, a "word" can be generally be either 16
bits or 32 bits (or, in some cases, 18 bits).

In theory, a 64-bit entity can be a "word" too, but pretty much nobody
uses that. Even architectures that started out with a 64-bit register
size and never had any smaller historical baggage (eg alpha) tend to
call 32-bit entities "words".

So 16 bits can be a word, but some people/architectures will call it a
"half-word".

To make matters even more confusing, a "quadword" is generally always
64 bits, regardless of the size of "word".

So please try to avoid the use of "word" entirely. It's too ambiguous,
and it's not even helpful as a "size of the native register". It's
almost purely random.

For the kernel, we tend use

 - uX for types that have specific sizes (X being the number of bits)

 - "[unsigned] long" for native register size

But never "word".

           Linus

^ permalink raw reply

* Re: [PATCH v2 1/4] siphash: add cryptographically secure hashtable function
From: Jason A. Donenfeld @ 2016-12-14 23:29 UTC (permalink / raw)
  To: Hannes Frederic Sowa
  Cc: David Laight, Netdev, kernel-hardening, Jean-Philippe Aumasson,
	LKML, Linux Crypto Mailing List, Daniel J . Bernstein,
	Linus Torvalds, Eric Biggers
In-Reply-To: <8ea3fdff-23c4-b81d-2588-44549bd2d8c1@stressinduktion.org>

Hi Hannes,

On Wed, Dec 14, 2016 at 11:03 PM, Hannes Frederic Sowa
<hannes@stressinduktion.org> wrote:
> I fear that the alignment requirement will be a source of bugs on 32 bit
> machines, where you cannot even simply take a well aligned struct on a
> stack and put it into the normal siphash(aligned) function without
> adding alignment annotations everywhere. Even blocks returned from
> kmalloc on 32 bit are not aligned to 64 bit.

That's what the "__aligned(SIPHASH24_ALIGNMENT)" attribute is for. The
aligned siphash function will be for structs explicitly made for
siphash consumption. For everything else there's siphash_unaligned.

> Can we do this a runtime check and just have one function (siphash)
> dealing with that?

Seems like the runtime branching on the aligned function would be bad
for performance, when we likely know at compile time if it's going to
be aligned or not. I suppose we could add that check just to the
unaligned version, and rename it to "maybe_unaligned"? Is this what
you have in mind?

Jason

^ permalink raw reply

* Re: [PATCH v3 1/3] siphash: add cryptographically secure hashtable function
From: Jason A. Donenfeld @ 2016-12-14 23:17 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Netdev, kernel-hardening, LKML, Linux Crypto Mailing List,
	Jean-Philippe Aumasson, Daniel J . Bernstein, Linus Torvalds,
	Eric Biggers, David Laight
In-Reply-To: <CALx6S349hOFhnMgM_TgKXC1O7bmOvR87Nm=5B7_sNLEWiZU8Zg@mail.gmail.com>

Hey Tom,

On Thu, Dec 15, 2016 at 12:14 AM, Tom Herbert <tom@herbertland.com> wrote:
> I'm confused, doesn't 2dword == 1qword? Anyway, I think the qword
> functions are good enough. If someone needs to hash over some odd
> length they can either put them in a structure padded to 64 bits or
> call the hash function that takes a byte length.

Yes. Here's an example:

static inline u64 siphash24_2dwords(const u32 a, const u32 b, const u8
key[SIPHASH24_KEY_LEN])
{
       return siphash24_1qword(((u64)b << 32) | a, key);
}

This winds up being extremely useful and syntactically convenient in a
few places. Check out my git branch in about 10 minutes or wait for v4
to be posted tomorrow; these are nice helpers.

> I'd still drop the "24" unless you really think we're going to have
> multiple variants coming into the kernel.

Okay. I don't have a problem with this, unless anybody has some reason
to the contrary.

Jason

^ permalink raw reply

* Re: [PATCH v3 1/3] siphash: add cryptographically secure hashtable function
From: Tom Herbert @ 2016-12-14 23:14 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: Netdev, kernel-hardening, LKML, Linux Crypto Mailing List,
	Jean-Philippe Aumasson, Daniel J . Bernstein, Linus Torvalds,
	Eric Biggers, David Laight
In-Reply-To: <CAHmME9rpvf4tyDjZcJAJxMAW1LcqNm7DiquiYX0uQhRzDLbwqw@mail.gmail.com>

On Wed, Dec 14, 2016 at 2:56 PM, Jason A. Donenfeld <Jason@zx2c4.com> wrote:
> Hey Tom,
>
> On Wed, Dec 14, 2016 at 10:35 PM, Tom Herbert <tom@herbertland.com> wrote:
>> Those look good, although I would probably just do 1,2,3 words and
>> then have a function that takes n words like jhash. Might want to call
>> these dword to distinguish from 32 bit words in jhash.
>
> So actually jhash_Nwords makes no sense, since it takes dwords
> (32-bits) not words (16-bits). The siphash analog should be called
> siphash24_Nqwords.
>
Yeah, that's a "bug" with jhash function names.

> I think what I'll do is change what I already have to:
> siphash24_1qword
> siphash24_2qword
> siphash24_3qword
> siphash24_4qword
>
> And then add some static inline helpers to assist with smaller u32s
> like ipv4 addresses called:
>
> siphash24_2dword
> siphash24_4dword
> siphash24_6dword
> siphash24_8dword
>
> While we're having something new, might as well call it the right thing.
>
I'm confused, doesn't 2dword == 1qword? Anyway, I think the qword
functions are good enough. If someone needs to hash over some odd
length they can either put them in a structure padded to 64 bits or
call the hash function that takes a byte length.

>
>> Also, what is the significance of "24" in the function and constant
>> names? Can we just drop that and call this siphash?
>
> SipHash is actually a family of PRFs, differentiated by the number of
> SIPROUNDs after each 64-bit input is processed and the number of
> SIPROUNDs at the very end of the function. The best trade-off of speed
> and security for kernel usage is 2 rounds after each 64-bit input and
> 4 rounds at the end of the function. This doesn't fall to any known
> cryptanalysis and it's very fast.

I'd still drop the "24" unless you really think we're going to have
multiple variants coming into the kernel.

Tom

^ permalink raw reply

* [PATCH] net: sfc: use new api ethtool_{get|set}_link_ksettings
From: Philippe Reynes @ 2016-12-14 23:12 UTC (permalink / raw)
  To: linux-net-drivers, ecree, bkenward; +Cc: netdev, linux-kernel, Philippe Reynes

The ethtool api {get|set}_settings is deprecated.
We move this driver to new api {get|set}_link_ksettings.

Signed-off-by: Philippe Reynes <tremyfr@gmail.com>
---
 drivers/net/ethernet/sfc/ethtool.c    |   35 ++++++++++++-------
 drivers/net/ethernet/sfc/mcdi_port.c  |   60 ++++++++++++++++++++------------
 drivers/net/ethernet/sfc/net_driver.h |   12 +++---
 3 files changed, 65 insertions(+), 42 deletions(-)

diff --git a/drivers/net/ethernet/sfc/ethtool.c b/drivers/net/ethernet/sfc/ethtool.c
index f644216..87bdc56 100644
--- a/drivers/net/ethernet/sfc/ethtool.c
+++ b/drivers/net/ethernet/sfc/ethtool.c
@@ -120,44 +120,53 @@ static int efx_ethtool_phys_id(struct net_device *net_dev,
 }
 
 /* This must be called with rtnl_lock held. */
-static int efx_ethtool_get_settings(struct net_device *net_dev,
-				    struct ethtool_cmd *ecmd)
+static int
+efx_ethtool_get_link_ksettings(struct net_device *net_dev,
+			       struct ethtool_link_ksettings *cmd)
 {
 	struct efx_nic *efx = netdev_priv(net_dev);
 	struct efx_link_state *link_state = &efx->link_state;
+	u32 supported;
 
 	mutex_lock(&efx->mac_lock);
-	efx->phy_op->get_settings(efx, ecmd);
+	efx->phy_op->get_link_ksettings(efx, cmd);
 	mutex_unlock(&efx->mac_lock);
 
 	/* Both MACs support pause frames (bidirectional and respond-only) */
-	ecmd->supported |= SUPPORTED_Pause | SUPPORTED_Asym_Pause;
+	ethtool_convert_link_mode_to_legacy_u32(&supported,
+						cmd->link_modes.supported);
+
+	supported |= SUPPORTED_Pause | SUPPORTED_Asym_Pause;
+
+	ethtool_convert_legacy_u32_to_link_mode(cmd->link_modes.supported,
+						supported);
 
 	if (LOOPBACK_INTERNAL(efx)) {
-		ethtool_cmd_speed_set(ecmd, link_state->speed);
-		ecmd->duplex = link_state->fd ? DUPLEX_FULL : DUPLEX_HALF;
+		cmd->base.speed = link_state->speed;
+		cmd->base.duplex = link_state->fd ? DUPLEX_FULL : DUPLEX_HALF;
 	}
 
 	return 0;
 }
 
 /* This must be called with rtnl_lock held. */
-static int efx_ethtool_set_settings(struct net_device *net_dev,
-				    struct ethtool_cmd *ecmd)
+static int
+efx_ethtool_set_link_ksettings(struct net_device *net_dev,
+			       const struct ethtool_link_ksettings *cmd)
 {
 	struct efx_nic *efx = netdev_priv(net_dev);
 	int rc;
 
 	/* GMAC does not support 1000Mbps HD */
-	if ((ethtool_cmd_speed(ecmd) == SPEED_1000) &&
-	    (ecmd->duplex != DUPLEX_FULL)) {
+	if ((cmd->base.speed == SPEED_1000) &&
+	    (cmd->base.duplex != DUPLEX_FULL)) {
 		netif_dbg(efx, drv, efx->net_dev,
 			  "rejecting unsupported 1000Mbps HD setting\n");
 		return -EINVAL;
 	}
 
 	mutex_lock(&efx->mac_lock);
-	rc = efx->phy_op->set_settings(efx, ecmd);
+	rc = efx->phy_op->set_link_ksettings(efx, cmd);
 	mutex_unlock(&efx->mac_lock);
 	return rc;
 }
@@ -1342,8 +1351,6 @@ static int efx_ethtool_get_module_info(struct net_device *net_dev,
 }
 
 const struct ethtool_ops efx_ethtool_ops = {
-	.get_settings		= efx_ethtool_get_settings,
-	.set_settings		= efx_ethtool_set_settings,
 	.get_drvinfo		= efx_ethtool_get_drvinfo,
 	.get_regs_len		= efx_ethtool_get_regs_len,
 	.get_regs		= efx_ethtool_get_regs,
@@ -1373,4 +1380,6 @@ static int efx_ethtool_get_module_info(struct net_device *net_dev,
 	.get_ts_info		= efx_ethtool_get_ts_info,
 	.get_module_info	= efx_ethtool_get_module_info,
 	.get_module_eeprom	= efx_ethtool_get_module_eeprom,
+	.get_link_ksettings	= efx_ethtool_get_link_ksettings,
+	.set_link_ksettings	= efx_ethtool_set_link_ksettings,
 };
diff --git a/drivers/net/ethernet/sfc/mcdi_port.c b/drivers/net/ethernet/sfc/mcdi_port.c
index 9dcd396..c905971 100644
--- a/drivers/net/ethernet/sfc/mcdi_port.c
+++ b/drivers/net/ethernet/sfc/mcdi_port.c
@@ -503,45 +503,59 @@ static void efx_mcdi_phy_remove(struct efx_nic *efx)
 	kfree(phy_data);
 }
 
-static void efx_mcdi_phy_get_settings(struct efx_nic *efx, struct ethtool_cmd *ecmd)
+static void efx_mcdi_phy_get_link_ksettings(struct efx_nic *efx,
+					    struct ethtool_link_ksettings *cmd)
 {
 	struct efx_mcdi_phy_data *phy_cfg = efx->phy_data;
 	MCDI_DECLARE_BUF(outbuf, MC_CMD_GET_LINK_OUT_LEN);
 	int rc;
-
-	ecmd->supported =
-		mcdi_to_ethtool_cap(phy_cfg->media, phy_cfg->supported_cap);
-	ecmd->advertising = efx->link_advertising;
-	ethtool_cmd_speed_set(ecmd, efx->link_state.speed);
-	ecmd->duplex = efx->link_state.fd;
-	ecmd->port = mcdi_to_ethtool_media(phy_cfg->media);
-	ecmd->phy_address = phy_cfg->port;
-	ecmd->transceiver = XCVR_INTERNAL;
-	ecmd->autoneg = !!(efx->link_advertising & ADVERTISED_Autoneg);
-	ecmd->mdio_support = (efx->mdio.mode_support &
+	u32 supported, advertising, lp_advertising;
+
+	supported = mcdi_to_ethtool_cap(phy_cfg->media, phy_cfg->supported_cap);
+	advertising = efx->link_advertising;
+	cmd->base.speed = efx->link_state.speed;
+	cmd->base.duplex = efx->link_state.fd;
+	cmd->base.port = mcdi_to_ethtool_media(phy_cfg->media);
+	cmd->base.phy_address = phy_cfg->port;
+	cmd->base.autoneg = !!(efx->link_advertising & ADVERTISED_Autoneg);
+	cmd->base.mdio_support = (efx->mdio.mode_support &
 			      (MDIO_SUPPORTS_C45 | MDIO_SUPPORTS_C22));
 
+	ethtool_convert_legacy_u32_to_link_mode(cmd->link_modes.supported,
+						supported);
+	ethtool_convert_legacy_u32_to_link_mode(cmd->link_modes.advertising,
+						advertising);
+
 	BUILD_BUG_ON(MC_CMD_GET_LINK_IN_LEN != 0);
 	rc = efx_mcdi_rpc(efx, MC_CMD_GET_LINK, NULL, 0,
 			  outbuf, sizeof(outbuf), NULL);
 	if (rc)
 		return;
-	ecmd->lp_advertising =
+	lp_advertising =
 		mcdi_to_ethtool_cap(phy_cfg->media,
 				    MCDI_DWORD(outbuf, GET_LINK_OUT_LP_CAP));
+
+	ethtool_convert_legacy_u32_to_link_mode(cmd->link_modes.lp_advertising,
+						lp_advertising);
 }
 
-static int efx_mcdi_phy_set_settings(struct efx_nic *efx, struct ethtool_cmd *ecmd)
+static int
+efx_mcdi_phy_set_link_ksettings(struct efx_nic *efx,
+				const struct ethtool_link_ksettings *cmd)
 {
 	struct efx_mcdi_phy_data *phy_cfg = efx->phy_data;
 	u32 caps;
 	int rc;
+	u32 advertising;
+
+	ethtool_convert_link_mode_to_legacy_u32(&advertising,
+						cmd->link_modes.advertising);
 
-	if (ecmd->autoneg) {
-		caps = (ethtool_to_mcdi_cap(ecmd->advertising) |
+	if (cmd->base.autoneg) {
+		caps = (ethtool_to_mcdi_cap(advertising) |
 			 1 << MC_CMD_PHY_CAP_AN_LBN);
-	} else if (ecmd->duplex) {
-		switch (ethtool_cmd_speed(ecmd)) {
+	} else if (cmd->base.duplex) {
+		switch (cmd->base.speed) {
 		case 10:    caps = 1 << MC_CMD_PHY_CAP_10FDX_LBN;    break;
 		case 100:   caps = 1 << MC_CMD_PHY_CAP_100FDX_LBN;   break;
 		case 1000:  caps = 1 << MC_CMD_PHY_CAP_1000FDX_LBN;  break;
@@ -550,7 +564,7 @@ static int efx_mcdi_phy_set_settings(struct efx_nic *efx, struct ethtool_cmd *ec
 		default:    return -EINVAL;
 		}
 	} else {
-		switch (ethtool_cmd_speed(ecmd)) {
+		switch (cmd->base.speed) {
 		case 10:    caps = 1 << MC_CMD_PHY_CAP_10HDX_LBN;    break;
 		case 100:   caps = 1 << MC_CMD_PHY_CAP_100HDX_LBN;   break;
 		case 1000:  caps = 1 << MC_CMD_PHY_CAP_1000HDX_LBN;  break;
@@ -563,9 +577,9 @@ static int efx_mcdi_phy_set_settings(struct efx_nic *efx, struct ethtool_cmd *ec
 	if (rc)
 		return rc;
 
-	if (ecmd->autoneg) {
+	if (cmd->base.autoneg) {
 		efx_link_set_advertising(
-			efx, ecmd->advertising | ADVERTISED_Autoneg);
+			efx, advertising | ADVERTISED_Autoneg);
 		phy_cfg->forced_cap = 0;
 	} else {
 		efx_link_set_advertising(efx, 0);
@@ -812,8 +826,8 @@ static int efx_mcdi_phy_get_module_info(struct efx_nic *efx,
 	.poll		= efx_mcdi_phy_poll,
 	.fini		= efx_port_dummy_op_void,
 	.remove		= efx_mcdi_phy_remove,
-	.get_settings	= efx_mcdi_phy_get_settings,
-	.set_settings	= efx_mcdi_phy_set_settings,
+	.get_link_ksettings = efx_mcdi_phy_get_link_ksettings,
+	.set_link_ksettings = efx_mcdi_phy_set_link_ksettings,
 	.test_alive	= efx_mcdi_phy_test_alive,
 	.run_tests	= efx_mcdi_phy_run_tests,
 	.test_name	= efx_mcdi_phy_test_name,
diff --git a/drivers/net/ethernet/sfc/net_driver.h b/drivers/net/ethernet/sfc/net_driver.h
index 8692e82..1a635ce 100644
--- a/drivers/net/ethernet/sfc/net_driver.h
+++ b/drivers/net/ethernet/sfc/net_driver.h
@@ -720,8 +720,8 @@ static inline bool efx_link_state_equal(const struct efx_link_state *left,
  * @reconfigure: Reconfigure PHY (e.g. for new link parameters)
  * @poll: Update @link_state and report whether it changed.
  *	Serialised by the mac_lock.
- * @get_settings: Get ethtool settings. Serialised by the mac_lock.
- * @set_settings: Set ethtool settings. Serialised by the mac_lock.
+ * @get_link_ksettings: Get ethtool settings. Serialised by the mac_lock.
+ * @set_link_ksettings: Set ethtool settings. Serialised by the mac_lock.
  * @set_npage_adv: Set abilities advertised in (Extended) Next Page
  *	(only needed where AN bit is set in mmds)
  * @test_alive: Test that PHY is 'alive' (online)
@@ -736,10 +736,10 @@ struct efx_phy_operations {
 	void (*remove) (struct efx_nic *efx);
 	int (*reconfigure) (struct efx_nic *efx);
 	bool (*poll) (struct efx_nic *efx);
-	void (*get_settings) (struct efx_nic *efx,
-			      struct ethtool_cmd *ecmd);
-	int (*set_settings) (struct efx_nic *efx,
-			     struct ethtool_cmd *ecmd);
+	void (*get_link_ksettings)(struct efx_nic *efx,
+				   struct ethtool_link_ksettings *cmd);
+	int (*set_link_ksettings)(struct efx_nic *efx,
+				  const struct ethtool_link_ksettings *cmd);
 	void (*set_npage_adv) (struct efx_nic *efx, u32);
 	int (*test_alive) (struct efx_nic *efx);
 	const char *(*test_name) (struct efx_nic *efx, unsigned int index);
-- 
1.7.4.4

^ permalink raw reply related

* Re: [PATCH v3 1/3] siphash: add cryptographically secure hashtable function
From: Jason A. Donenfeld @ 2016-12-14 22:56 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Netdev, kernel-hardening, LKML, Linux Crypto Mailing List,
	Jean-Philippe Aumasson, Daniel J . Bernstein, Linus Torvalds,
	Eric Biggers, David Laight
In-Reply-To: <CALx6S35VBjw42G6rHPrNfVaBfLMz3YZVjs3D3hBG=4gp5+g5tA@mail.gmail.com>

Hey Tom,

On Wed, Dec 14, 2016 at 10:35 PM, Tom Herbert <tom@herbertland.com> wrote:
> Those look good, although I would probably just do 1,2,3 words and
> then have a function that takes n words like jhash. Might want to call
> these dword to distinguish from 32 bit words in jhash.

So actually jhash_Nwords makes no sense, since it takes dwords
(32-bits) not words (16-bits). The siphash analog should be called
siphash24_Nqwords.

I think what I'll do is change what I already have to:
siphash24_1qword
siphash24_2qword
siphash24_3qword
siphash24_4qword

And then add some static inline helpers to assist with smaller u32s
like ipv4 addresses called:

siphash24_2dword
siphash24_4dword
siphash24_6dword
siphash24_8dword

While we're having something new, might as well call it the right thing.

> Also, what is the significance of "24" in the function and constant
> names? Can we just drop that and call this siphash?

SipHash is actually a family of PRFs, differentiated by the number of
SIPROUNDs after each 64-bit input is processed and the number of
SIPROUNDs at the very end of the function. The best trade-off of speed
and security for kernel usage is 2 rounds after each 64-bit input and
4 rounds at the end of the function. This doesn't fall to any known
cryptanalysis and it's very fast.

^ permalink raw reply

* [PATCH perf/core REBASE 2/5] samples/bpf: Switch over to libbpf
From: Joe Stringer @ 2016-12-14 22:43 UTC (permalink / raw)
  To: linux-kernel
  Cc: netdev, wangnan0, ast, daniel, acme, Arnaldo Carvalho de Melo
In-Reply-To: <20161214224342.12858-1-joe@ovn.org>

Now that libbpf under tools/lib/bpf/* is synced with the version from
samples/bpf, we can get rid most of the libbpf library here.

Signed-off-by: Joe Stringer <joe@ovn.org>
Cc: Alexei Starovoitov <ast@fb.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Wang Nan <wangnan0@huawei.com>
Link: http://lkml.kernel.org/r/20161209024620.31660-6-joe@ovn.org
[ Use -I$(srctree)/tools/lib/ to support out of source code tree builds, as noticed by Wang Nan ]
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
---
 samples/bpf/Makefile   |  67 +++++++++++++++--------------
 samples/bpf/README.rst |   4 +-
 samples/bpf/libbpf.c   | 111 -------------------------------------------------
 samples/bpf/libbpf.h   |  19 +--------
 4 files changed, 39 insertions(+), 162 deletions(-)

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index f2219c1489e5..add514e2984a 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -35,40 +35,43 @@ hostprogs-y += tc_l2_redirect
 hostprogs-y += lwt_len_hist
 hostprogs-y += xdp_tx_iptunnel
 
-test_lru_dist-objs := test_lru_dist.o libbpf.o
-sock_example-objs := sock_example.o libbpf.o
-fds_example-objs := bpf_load.o libbpf.o fds_example.o
-sockex1-objs := bpf_load.o libbpf.o sockex1_user.o
-sockex2-objs := bpf_load.o libbpf.o sockex2_user.o
-sockex3-objs := bpf_load.o libbpf.o sockex3_user.o
-tracex1-objs := bpf_load.o libbpf.o tracex1_user.o
-tracex2-objs := bpf_load.o libbpf.o tracex2_user.o
-tracex3-objs := bpf_load.o libbpf.o tracex3_user.o
-tracex4-objs := bpf_load.o libbpf.o tracex4_user.o
-tracex5-objs := bpf_load.o libbpf.o tracex5_user.o
-tracex6-objs := bpf_load.o libbpf.o tracex6_user.o
-test_probe_write_user-objs := bpf_load.o libbpf.o test_probe_write_user_user.o
-trace_output-objs := bpf_load.o libbpf.o trace_output_user.o
-lathist-objs := bpf_load.o libbpf.o lathist_user.o
-offwaketime-objs := bpf_load.o libbpf.o offwaketime_user.o
-spintest-objs := bpf_load.o libbpf.o spintest_user.o
-map_perf_test-objs := bpf_load.o libbpf.o map_perf_test_user.o
-test_overhead-objs := bpf_load.o libbpf.o test_overhead_user.o
-test_cgrp2_array_pin-objs := libbpf.o test_cgrp2_array_pin.o
-test_cgrp2_attach-objs := libbpf.o test_cgrp2_attach.o
-test_cgrp2_attach2-objs := libbpf.o test_cgrp2_attach2.o cgroup_helpers.o
-test_cgrp2_sock-objs := libbpf.o test_cgrp2_sock.o
-test_cgrp2_sock2-objs := bpf_load.o libbpf.o test_cgrp2_sock2.o
-xdp1-objs := bpf_load.o libbpf.o xdp1_user.o
+# Libbpf dependencies
+LIBBPF := libbpf.o ../../tools/lib/bpf/bpf.o
+
+test_lru_dist-objs := test_lru_dist.o $(LIBBPF)
+sock_example-objs := sock_example.o $(LIBBPF)
+fds_example-objs := bpf_load.o $(LIBBPF) fds_example.o
+sockex1-objs := bpf_load.o $(LIBBPF) sockex1_user.o
+sockex2-objs := bpf_load.o $(LIBBPF) sockex2_user.o
+sockex3-objs := bpf_load.o $(LIBBPF) sockex3_user.o
+tracex1-objs := bpf_load.o $(LIBBPF) tracex1_user.o
+tracex2-objs := bpf_load.o $(LIBBPF) tracex2_user.o
+tracex3-objs := bpf_load.o $(LIBBPF) tracex3_user.o
+tracex4-objs := bpf_load.o $(LIBBPF) tracex4_user.o
+tracex5-objs := bpf_load.o $(LIBBPF) tracex5_user.o
+tracex6-objs := bpf_load.o $(LIBBPF) tracex6_user.o
+test_probe_write_user-objs := bpf_load.o $(LIBBPF) test_probe_write_user_user.o
+trace_output-objs := bpf_load.o $(LIBBPF) trace_output_user.o
+lathist-objs := bpf_load.o $(LIBBPF) lathist_user.o
+offwaketime-objs := bpf_load.o $(LIBBPF) offwaketime_user.o
+spintest-objs := bpf_load.o $(LIBBPF) spintest_user.o
+map_perf_test-objs := bpf_load.o $(LIBBPF) map_perf_test_user.o
+test_overhead-objs := bpf_load.o $(LIBBPF) test_overhead_user.o
+test_cgrp2_array_pin-objs := $(LIBBPF) test_cgrp2_array_pin.o
+test_cgrp2_attach-objs := $(LIBBPF) test_cgrp2_attach.o
+test_cgrp2_attach2-objs := $(LIBBPF) test_cgrp2_attach2.o cgroup_helpers.o
+test_cgrp2_sock-objs := $(LIBBPF) test_cgrp2_sock.o
+test_cgrp2_sock2-objs := bpf_load.o $(LIBBPF) test_cgrp2_sock2.o
+xdp1-objs := bpf_load.o $(LIBBPF) xdp1_user.o
 # reuse xdp1 source intentionally
-xdp2-objs := bpf_load.o libbpf.o xdp1_user.o
-test_current_task_under_cgroup-objs := bpf_load.o libbpf.o cgroup_helpers.o \
+xdp2-objs := bpf_load.o $(LIBBPF) xdp1_user.o
+test_current_task_under_cgroup-objs := bpf_load.o $(LIBBPF) cgroup_helpers.o \
 				       test_current_task_under_cgroup_user.o
-trace_event-objs := bpf_load.o libbpf.o trace_event_user.o
-sampleip-objs := bpf_load.o libbpf.o sampleip_user.o
-tc_l2_redirect-objs := bpf_load.o libbpf.o tc_l2_redirect_user.o
-lwt_len_hist-objs := bpf_load.o libbpf.o lwt_len_hist_user.o
-xdp_tx_iptunnel-objs := bpf_load.o libbpf.o xdp_tx_iptunnel_user.o
+trace_event-objs := bpf_load.o $(LIBBPF) trace_event_user.o
+sampleip-objs := bpf_load.o $(LIBBPF) sampleip_user.o
+tc_l2_redirect-objs := bpf_load.o $(LIBBPF) tc_l2_redirect_user.o
+lwt_len_hist-objs := bpf_load.o $(LIBBPF) lwt_len_hist_user.o
+xdp_tx_iptunnel-objs := bpf_load.o $(LIBBPF) xdp_tx_iptunnel_user.o
 
 # Tell kbuild to always build the programs
 always := $(hostprogs-y)
diff --git a/samples/bpf/README.rst b/samples/bpf/README.rst
index a43eae3f0551..79f9a58f1872 100644
--- a/samples/bpf/README.rst
+++ b/samples/bpf/README.rst
@@ -1,8 +1,8 @@
 eBPF sample programs
 ====================
 
-This directory contains a mini eBPF library, test stubs, verifier
-test-suite and examples for using eBPF.
+This directory contains a test stubs, verifier test-suite and examples
+for using eBPF. The examples use libbpf from tools/lib/bpf.
 
 Build dependencies
 ==================
diff --git a/samples/bpf/libbpf.c b/samples/bpf/libbpf.c
index 6f076abdca35..3391225ad7e9 100644
--- a/samples/bpf/libbpf.c
+++ b/samples/bpf/libbpf.c
@@ -4,8 +4,6 @@
 #include <linux/unistd.h>
 #include <unistd.h>
 #include <string.h>
-#include <linux/netlink.h>
-#include <linux/bpf.h>
 #include <errno.h>
 #include <net/ethernet.h>
 #include <net/if.h>
@@ -13,96 +11,6 @@
 #include <arpa/inet.h>
 #include "libbpf.h"
 
-static __u64 ptr_to_u64(void *ptr)
-{
-	return (__u64) (unsigned long) ptr;
-}
-
-int bpf_create_map(enum bpf_map_type map_type, int key_size, int value_size,
-		   int max_entries, int map_flags)
-{
-	union bpf_attr attr = {
-		.map_type = map_type,
-		.key_size = key_size,
-		.value_size = value_size,
-		.max_entries = max_entries,
-		.map_flags = map_flags,
-	};
-
-	return syscall(__NR_bpf, BPF_MAP_CREATE, &attr, sizeof(attr));
-}
-
-int bpf_map_update_elem(int fd, void *key, void *value, unsigned long long flags)
-{
-	union bpf_attr attr = {
-		.map_fd = fd,
-		.key = ptr_to_u64(key),
-		.value = ptr_to_u64(value),
-		.flags = flags,
-	};
-
-	return syscall(__NR_bpf, BPF_MAP_UPDATE_ELEM, &attr, sizeof(attr));
-}
-
-int bpf_map_lookup_elem(int fd, void *key, void *value)
-{
-	union bpf_attr attr = {
-		.map_fd = fd,
-		.key = ptr_to_u64(key),
-		.value = ptr_to_u64(value),
-	};
-
-	return syscall(__NR_bpf, BPF_MAP_LOOKUP_ELEM, &attr, sizeof(attr));
-}
-
-int bpf_map_delete_elem(int fd, void *key)
-{
-	union bpf_attr attr = {
-		.map_fd = fd,
-		.key = ptr_to_u64(key),
-	};
-
-	return syscall(__NR_bpf, BPF_MAP_DELETE_ELEM, &attr, sizeof(attr));
-}
-
-int bpf_map_get_next_key(int fd, void *key, void *next_key)
-{
-	union bpf_attr attr = {
-		.map_fd = fd,
-		.key = ptr_to_u64(key),
-		.next_key = ptr_to_u64(next_key),
-	};
-
-	return syscall(__NR_bpf, BPF_MAP_GET_NEXT_KEY, &attr, sizeof(attr));
-}
-
-#define ROUND_UP(x, n) (((x) + (n) - 1u) & ~((n) - 1u))
-
-int bpf_load_program(enum bpf_prog_type prog_type,
-		     const struct bpf_insn *insns, int prog_len,
-		     const char *license, int kern_version,
-		     char *log_buf, size_t log_buf_sz)
-{
-	union bpf_attr attr = {
-		.prog_type = prog_type,
-		.insns = ptr_to_u64((void *) insns),
-		.insn_cnt = prog_len / sizeof(struct bpf_insn),
-		.license = ptr_to_u64((void *) license),
-		.log_buf = ptr_to_u64(log_buf),
-		.log_size = log_buf_sz,
-		.log_level = 1,
-	};
-
-	/* assign one field outside of struct init to make sure any
-	 * padding is zero initialized
-	 */
-	attr.kern_version = kern_version;
-
-	log_buf[0] = 0;
-
-	return syscall(__NR_bpf, BPF_PROG_LOAD, &attr, sizeof(attr));
-}
-
 int bpf_prog_attach(int prog_fd, int target_fd, enum bpf_attach_type type)
 {
 	union bpf_attr attr = {
@@ -124,25 +32,6 @@ int bpf_prog_detach(int target_fd, enum bpf_attach_type type)
 	return syscall(__NR_bpf, BPF_PROG_DETACH, &attr, sizeof(attr));
 }
 
-int bpf_obj_pin(int fd, const char *pathname)
-{
-	union bpf_attr attr = {
-		.pathname	= ptr_to_u64((void *)pathname),
-		.bpf_fd		= fd,
-	};
-
-	return syscall(__NR_bpf, BPF_OBJ_PIN, &attr, sizeof(attr));
-}
-
-int bpf_obj_get(const char *pathname)
-{
-	union bpf_attr attr = {
-		.pathname	= ptr_to_u64((void *)pathname),
-	};
-
-	return syscall(__NR_bpf, BPF_OBJ_GET, &attr, sizeof(attr));
-}
-
 int open_raw_sock(const char *name)
 {
 	struct sockaddr_ll sll;
diff --git a/samples/bpf/libbpf.h b/samples/bpf/libbpf.h
index 20e3457857ca..cf7d2386d1f9 100644
--- a/samples/bpf/libbpf.h
+++ b/samples/bpf/libbpf.h
@@ -2,28 +2,13 @@
 #ifndef __LIBBPF_H
 #define __LIBBPF_H
 
-struct bpf_insn;
-
-int bpf_create_map(enum bpf_map_type map_type, int key_size, int value_size,
-		   int max_entries, int map_flags);
-int bpf_map_update_elem(int fd, void *key, void *value, unsigned long long flags);
-int bpf_map_lookup_elem(int fd, void *key, void *value);
-int bpf_map_delete_elem(int fd, void *key);
-int bpf_map_get_next_key(int fd, void *key, void *next_key);
+#include <bpf/bpf.h>
 
-int bpf_load_program(enum bpf_prog_type prog_type,
-		     const struct bpf_insn *insns, int insn_len,
-		     const char *license, int kern_version,
-		     char *log_buf, size_t log_buf_sz);
+struct bpf_insn;
 
 int bpf_prog_attach(int prog_fd, int attachable_fd, enum bpf_attach_type type);
 int bpf_prog_detach(int attachable_fd, enum bpf_attach_type type);
 
-int bpf_obj_pin(int fd, const char *pathname);
-int bpf_obj_get(const char *pathname);
-
-#define BPF_LOG_BUF_SIZE (256 * 1024)
-
 /* ALU ops on registers, bpf_add|sub|...: dst_reg += src_reg */
 
 #define BPF_ALU64_REG(OP, DST, SRC)				\
-- 
2.10.2

^ permalink raw reply related

* [PATCH perf/core REBASE 1/5] samples/bpf: Make samples more libbpf-centric
From: Joe Stringer @ 2016-12-14 22:43 UTC (permalink / raw)
  To: linux-kernel
  Cc: netdev, wangnan0, ast, daniel, acme, Arnaldo Carvalho de Melo
In-Reply-To: <20161214224342.12858-1-joe@ovn.org>

Switch all of the sample code to use the function names from
tools/lib/bpf so that they're consistent with that, and to declare their
own log buffers. This allow the next commit to be purely devoted to
getting rid of the duplicate library in samples/bpf.

Signed-off-by: Joe Stringer <joe@ovn.org>
Cc: Alexei Starovoitov <ast@fb.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Wang Nan <wangnan0@huawei.com>
Link: http://lkml.kernel.org/r/20161209024620.31660-5-joe@ovn.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
---
 samples/bpf/bpf_load.c                            | 17 +++++++++---
 samples/bpf/bpf_load.h                            |  3 +++
 samples/bpf/fds_example.c                         |  9 ++++---
 samples/bpf/lathist_user.c                        |  2 +-
 samples/bpf/libbpf.c                              | 23 ++++++++--------
 samples/bpf/libbpf.h                              | 18 ++++++-------
 samples/bpf/lwt_len_hist_user.c                   |  6 +++--
 samples/bpf/offwaketime_user.c                    |  8 +++---
 samples/bpf/sampleip_user.c                       |  4 +--
 samples/bpf/sock_example.c                        | 12 +++++----
 samples/bpf/sockex1_user.c                        |  6 ++---
 samples/bpf/sockex2_user.c                        |  4 +--
 samples/bpf/sockex3_user.c                        |  4 +--
 samples/bpf/spintest_user.c                       |  8 +++---
 samples/bpf/tc_l2_redirect_user.c                 |  4 +--
 samples/bpf/test_cgrp2_array_pin.c                |  4 +--
 samples/bpf/test_cgrp2_attach.c                   | 11 +++++---
 samples/bpf/test_cgrp2_attach2.c                  |  7 +++--
 samples/bpf/test_cgrp2_sock.c                     |  6 +++--
 samples/bpf/test_current_task_under_cgroup_user.c |  8 +++---
 samples/bpf/test_lru_dist.c                       | 32 +++++++++++------------
 samples/bpf/test_probe_write_user_user.c          |  2 +-
 samples/bpf/trace_event_user.c                    | 14 +++++-----
 samples/bpf/trace_output_user.c                   |  2 +-
 samples/bpf/tracex2_user.c                        | 10 +++----
 samples/bpf/tracex3_user.c                        |  4 +--
 samples/bpf/tracex4_user.c                        |  4 +--
 samples/bpf/tracex6_user.c                        |  2 +-
 samples/bpf/xdp1_user.c                           |  2 +-
 samples/bpf/xdp_tx_iptunnel_user.c                |  6 ++---
 30 files changed, 133 insertions(+), 109 deletions(-)

diff --git a/samples/bpf/bpf_load.c b/samples/bpf/bpf_load.c
index e30b6de94f2e..f5b186c46b7c 100644
--- a/samples/bpf/bpf_load.c
+++ b/samples/bpf/bpf_load.c
@@ -22,7 +22,6 @@
 #include <poll.h>
 #include <ctype.h>
 #include "libbpf.h"
-#include "bpf_helpers.h"
 #include "bpf_load.h"
 
 #define DEBUGFS "/sys/kernel/debug/tracing/"
@@ -30,17 +29,26 @@
 static char license[128];
 static int kern_version;
 static bool processed_sec[128];
+char bpf_log_buf[BPF_LOG_BUF_SIZE];
 int map_fd[MAX_MAPS];
 int prog_fd[MAX_PROGS];
 int event_fd[MAX_PROGS];
 int prog_cnt;
 int prog_array_fd = -1;
 
+struct bpf_map_def {
+	unsigned int type;
+	unsigned int key_size;
+	unsigned int value_size;
+	unsigned int max_entries;
+	unsigned int map_flags;
+};
+
 static int populate_prog_array(const char *event, int prog_fd)
 {
 	int ind = atoi(event), err;
 
-	err = bpf_update_elem(prog_array_fd, &ind, &prog_fd, BPF_ANY);
+	err = bpf_map_update_elem(prog_array_fd, &ind, &prog_fd, BPF_ANY);
 	if (err < 0) {
 		printf("failed to store prog_fd in prog_array\n");
 		return -1;
@@ -87,9 +95,10 @@ static int load_and_attach(const char *event, struct bpf_insn *prog, int size)
 		return -1;
 	}
 
-	fd = bpf_prog_load(prog_type, prog, size, license, kern_version);
+	fd = bpf_load_program(prog_type, prog, size, license, kern_version,
+			      bpf_log_buf, BPF_LOG_BUF_SIZE);
 	if (fd < 0) {
-		printf("bpf_prog_load() err=%d\n%s", errno, bpf_log_buf);
+		printf("bpf_load_program() err=%d\n%s", errno, bpf_log_buf);
 		return -1;
 	}
 
diff --git a/samples/bpf/bpf_load.h b/samples/bpf/bpf_load.h
index fb46a421ab41..c827827299b3 100644
--- a/samples/bpf/bpf_load.h
+++ b/samples/bpf/bpf_load.h
@@ -1,12 +1,15 @@
 #ifndef __BPF_LOAD_H
 #define __BPF_LOAD_H
 
+#include "libbpf.h"
+
 #define MAX_MAPS 32
 #define MAX_PROGS 32
 
 extern int map_fd[MAX_MAPS];
 extern int prog_fd[MAX_PROGS];
 extern int event_fd[MAX_PROGS];
+extern char bpf_log_buf[BPF_LOG_BUF_SIZE];
 extern int prog_cnt;
 
 /* parses elf file compiled by llvm .c->.o
diff --git a/samples/bpf/fds_example.c b/samples/bpf/fds_example.c
index 625e797be6ef..8a4fc4ef3993 100644
--- a/samples/bpf/fds_example.c
+++ b/samples/bpf/fds_example.c
@@ -58,8 +58,9 @@ static int bpf_prog_create(const char *object)
 		assert(!load_bpf_file((char *)object));
 		return prog_fd[0];
 	} else {
-		return bpf_prog_load(BPF_PROG_TYPE_SOCKET_FILTER,
-				     insns, sizeof(insns), "GPL", 0);
+		return bpf_load_program(BPF_PROG_TYPE_SOCKET_FILTER,
+					insns, sizeof(insns), "GPL", 0,
+					bpf_log_buf, BPF_LOG_BUF_SIZE);
 	}
 }
 
@@ -83,12 +84,12 @@ static int bpf_do_map(const char *file, uint32_t flags, uint32_t key,
 	}
 
 	if ((flags & BPF_F_KEY_VAL) == BPF_F_KEY_VAL) {
-		ret = bpf_update_elem(fd, &key, &value, 0);
+		ret = bpf_map_update_elem(fd, &key, &value, 0);
 		printf("bpf: fd:%d u->(%u:%u) ret:(%d,%s)\n", fd, key, value,
 		       ret, strerror(errno));
 		assert(ret == 0);
 	} else if (flags & BPF_F_KEY) {
-		ret = bpf_lookup_elem(fd, &key, &value);
+		ret = bpf_map_lookup_elem(fd, &key, &value);
 		printf("bpf: fd:%d l->(%u):%u ret:(%d,%s)\n", fd, key, value,
 		       ret, strerror(errno));
 		assert(ret == 0);
diff --git a/samples/bpf/lathist_user.c b/samples/bpf/lathist_user.c
index 65da8c1576de..6477bad5b4e2 100644
--- a/samples/bpf/lathist_user.c
+++ b/samples/bpf/lathist_user.c
@@ -73,7 +73,7 @@ static void get_data(int fd)
 	for (c = 0; c < MAX_CPU; c++) {
 		for (i = 0; i < MAX_ENTRIES; i++) {
 			key = c * MAX_ENTRIES + i;
-			bpf_lookup_elem(fd, &key, &value);
+			bpf_map_lookup_elem(fd, &key, &value);
 
 			cpu_hist[c].data[i] = value;
 			if (value > cpu_hist[c].max)
diff --git a/samples/bpf/libbpf.c b/samples/bpf/libbpf.c
index 9ce707bf02a7..6f076abdca35 100644
--- a/samples/bpf/libbpf.c
+++ b/samples/bpf/libbpf.c
@@ -32,7 +32,7 @@ int bpf_create_map(enum bpf_map_type map_type, int key_size, int value_size,
 	return syscall(__NR_bpf, BPF_MAP_CREATE, &attr, sizeof(attr));
 }
 
-int bpf_update_elem(int fd, void *key, void *value, unsigned long long flags)
+int bpf_map_update_elem(int fd, void *key, void *value, unsigned long long flags)
 {
 	union bpf_attr attr = {
 		.map_fd = fd,
@@ -44,7 +44,7 @@ int bpf_update_elem(int fd, void *key, void *value, unsigned long long flags)
 	return syscall(__NR_bpf, BPF_MAP_UPDATE_ELEM, &attr, sizeof(attr));
 }
 
-int bpf_lookup_elem(int fd, void *key, void *value)
+int bpf_map_lookup_elem(int fd, void *key, void *value)
 {
 	union bpf_attr attr = {
 		.map_fd = fd,
@@ -55,7 +55,7 @@ int bpf_lookup_elem(int fd, void *key, void *value)
 	return syscall(__NR_bpf, BPF_MAP_LOOKUP_ELEM, &attr, sizeof(attr));
 }
 
-int bpf_delete_elem(int fd, void *key)
+int bpf_map_delete_elem(int fd, void *key)
 {
 	union bpf_attr attr = {
 		.map_fd = fd,
@@ -65,7 +65,7 @@ int bpf_delete_elem(int fd, void *key)
 	return syscall(__NR_bpf, BPF_MAP_DELETE_ELEM, &attr, sizeof(attr));
 }
 
-int bpf_get_next_key(int fd, void *key, void *next_key)
+int bpf_map_get_next_key(int fd, void *key, void *next_key)
 {
 	union bpf_attr attr = {
 		.map_fd = fd,
@@ -78,19 +78,18 @@ int bpf_get_next_key(int fd, void *key, void *next_key)
 
 #define ROUND_UP(x, n) (((x) + (n) - 1u) & ~((n) - 1u))
 
-char bpf_log_buf[LOG_BUF_SIZE];
-
-int bpf_prog_load(enum bpf_prog_type prog_type,
-		  const struct bpf_insn *insns, int prog_len,
-		  const char *license, int kern_version)
+int bpf_load_program(enum bpf_prog_type prog_type,
+		     const struct bpf_insn *insns, int prog_len,
+		     const char *license, int kern_version,
+		     char *log_buf, size_t log_buf_sz)
 {
 	union bpf_attr attr = {
 		.prog_type = prog_type,
 		.insns = ptr_to_u64((void *) insns),
 		.insn_cnt = prog_len / sizeof(struct bpf_insn),
 		.license = ptr_to_u64((void *) license),
-		.log_buf = ptr_to_u64(bpf_log_buf),
-		.log_size = LOG_BUF_SIZE,
+		.log_buf = ptr_to_u64(log_buf),
+		.log_size = log_buf_sz,
 		.log_level = 1,
 	};
 
@@ -99,7 +98,7 @@ int bpf_prog_load(enum bpf_prog_type prog_type,
 	 */
 	attr.kern_version = kern_version;
 
-	bpf_log_buf[0] = 0;
+	log_buf[0] = 0;
 
 	return syscall(__NR_bpf, BPF_PROG_LOAD, &attr, sizeof(attr));
 }
diff --git a/samples/bpf/libbpf.h b/samples/bpf/libbpf.h
index 94a901d86fc2..20e3457857ca 100644
--- a/samples/bpf/libbpf.h
+++ b/samples/bpf/libbpf.h
@@ -6,14 +6,15 @@ struct bpf_insn;
 
 int bpf_create_map(enum bpf_map_type map_type, int key_size, int value_size,
 		   int max_entries, int map_flags);
-int bpf_update_elem(int fd, void *key, void *value, unsigned long long flags);
-int bpf_lookup_elem(int fd, void *key, void *value);
-int bpf_delete_elem(int fd, void *key);
-int bpf_get_next_key(int fd, void *key, void *next_key);
+int bpf_map_update_elem(int fd, void *key, void *value, unsigned long long flags);
+int bpf_map_lookup_elem(int fd, void *key, void *value);
+int bpf_map_delete_elem(int fd, void *key);
+int bpf_map_get_next_key(int fd, void *key, void *next_key);
 
-int bpf_prog_load(enum bpf_prog_type prog_type,
-		  const struct bpf_insn *insns, int insn_len,
-		  const char *license, int kern_version);
+int bpf_load_program(enum bpf_prog_type prog_type,
+		     const struct bpf_insn *insns, int insn_len,
+		     const char *license, int kern_version,
+		     char *log_buf, size_t log_buf_sz);
 
 int bpf_prog_attach(int prog_fd, int attachable_fd, enum bpf_attach_type type);
 int bpf_prog_detach(int attachable_fd, enum bpf_attach_type type);
@@ -21,8 +22,7 @@ int bpf_prog_detach(int attachable_fd, enum bpf_attach_type type);
 int bpf_obj_pin(int fd, const char *pathname);
 int bpf_obj_get(const char *pathname);
 
-#define LOG_BUF_SIZE (256 * 1024)
-extern char bpf_log_buf[LOG_BUF_SIZE];
+#define BPF_LOG_BUF_SIZE (256 * 1024)
 
 /* ALU ops on registers, bpf_add|sub|...: dst_reg += src_reg */
 
diff --git a/samples/bpf/lwt_len_hist_user.c b/samples/bpf/lwt_len_hist_user.c
index 05d783fc5daf..ec8f3bbcbef3 100644
--- a/samples/bpf/lwt_len_hist_user.c
+++ b/samples/bpf/lwt_len_hist_user.c
@@ -14,6 +14,8 @@
 #define MAX_INDEX 64
 #define MAX_STARS 38
 
+char bpf_log_buf[BPF_LOG_BUF_SIZE];
+
 static void stars(char *str, long val, long max, int width)
 {
 	int i;
@@ -41,13 +43,13 @@ int main(int argc, char **argv)
 		return -1;
 	}
 
-	while (bpf_get_next_key(map_fd, &key, &next_key) == 0) {
+	while (bpf_map_get_next_key(map_fd, &key, &next_key) == 0) {
 		if (next_key >= MAX_INDEX) {
 			fprintf(stderr, "Key %lu out of bounds\n", next_key);
 			continue;
 		}
 
-		bpf_lookup_elem(map_fd, &next_key, values);
+		bpf_map_lookup_elem(map_fd, &next_key, values);
 
 		sum = 0;
 		for (i = 0; i < nr_cpus; i++)
diff --git a/samples/bpf/offwaketime_user.c b/samples/bpf/offwaketime_user.c
index 6f002a9c24fa..9cce2a66bd66 100644
--- a/samples/bpf/offwaketime_user.c
+++ b/samples/bpf/offwaketime_user.c
@@ -49,14 +49,14 @@ static void print_stack(struct key_t *key, __u64 count)
 	int i;
 
 	printf("%s;", key->target);
-	if (bpf_lookup_elem(map_fd[3], &key->tret, ip) != 0) {
+	if (bpf_map_lookup_elem(map_fd[3], &key->tret, ip) != 0) {
 		printf("---;");
 	} else {
 		for (i = PERF_MAX_STACK_DEPTH - 1; i >= 0; i--)
 			print_ksym(ip[i]);
 	}
 	printf("-;");
-	if (bpf_lookup_elem(map_fd[3], &key->wret, ip) != 0) {
+	if (bpf_map_lookup_elem(map_fd[3], &key->wret, ip) != 0) {
 		printf("---;");
 	} else {
 		for (i = 0; i < PERF_MAX_STACK_DEPTH; i++)
@@ -77,8 +77,8 @@ static void print_stacks(int fd)
 	struct key_t key = {}, next_key;
 	__u64 value;
 
-	while (bpf_get_next_key(fd, &key, &next_key) == 0) {
-		bpf_lookup_elem(fd, &next_key, &value);
+	while (bpf_map_get_next_key(fd, &key, &next_key) == 0) {
+		bpf_map_lookup_elem(fd, &next_key, &value);
 		print_stack(&next_key, value);
 		key = next_key;
 	}
diff --git a/samples/bpf/sampleip_user.c b/samples/bpf/sampleip_user.c
index 260a6bdd6413..5ac5adf75931 100644
--- a/samples/bpf/sampleip_user.c
+++ b/samples/bpf/sampleip_user.c
@@ -95,8 +95,8 @@ static void print_ip_map(int fd)
 
 	/* fetch IPs and counts */
 	key = 0, i = 0;
-	while (bpf_get_next_key(fd, &key, &next_key) == 0) {
-		bpf_lookup_elem(fd, &next_key, &value);
+	while (bpf_map_get_next_key(fd, &key, &next_key) == 0) {
+		bpf_map_lookup_elem(fd, &next_key, &value);
 		counts[i].ip = next_key;
 		counts[i++].count = value;
 		key = next_key;
diff --git a/samples/bpf/sock_example.c b/samples/bpf/sock_example.c
index 28b60baa9fa8..d6b91e9a38ad 100644
--- a/samples/bpf/sock_example.c
+++ b/samples/bpf/sock_example.c
@@ -28,6 +28,8 @@
 #include <stddef.h>
 #include "libbpf.h"
 
+char bpf_log_buf[BPF_LOG_BUF_SIZE];
+
 static int test_sock(void)
 {
 	int sock = -1, map_fd, prog_fd, i, key;
@@ -55,8 +57,8 @@ static int test_sock(void)
 		BPF_EXIT_INSN(),
 	};
 
-	prog_fd = bpf_prog_load(BPF_PROG_TYPE_SOCKET_FILTER, prog, sizeof(prog),
-				"GPL", 0);
+	prog_fd = bpf_load_program(BPF_PROG_TYPE_SOCKET_FILTER, prog, sizeof(prog),
+				   "GPL", 0, bpf_log_buf, BPF_LOG_BUF_SIZE);
 	if (prog_fd < 0) {
 		printf("failed to load prog '%s'\n", strerror(errno));
 		goto cleanup;
@@ -72,13 +74,13 @@ static int test_sock(void)
 
 	for (i = 0; i < 10; i++) {
 		key = IPPROTO_TCP;
-		assert(bpf_lookup_elem(map_fd, &key, &tcp_cnt) == 0);
+		assert(bpf_map_lookup_elem(map_fd, &key, &tcp_cnt) == 0);
 
 		key = IPPROTO_UDP;
-		assert(bpf_lookup_elem(map_fd, &key, &udp_cnt) == 0);
+		assert(bpf_map_lookup_elem(map_fd, &key, &udp_cnt) == 0);
 
 		key = IPPROTO_ICMP;
-		assert(bpf_lookup_elem(map_fd, &key, &icmp_cnt) == 0);
+		assert(bpf_map_lookup_elem(map_fd, &key, &icmp_cnt) == 0);
 
 		printf("TCP %lld UDP %lld ICMP %lld packets\n",
 		       tcp_cnt, udp_cnt, icmp_cnt);
diff --git a/samples/bpf/sockex1_user.c b/samples/bpf/sockex1_user.c
index 678ce4693551..9454448bf198 100644
--- a/samples/bpf/sockex1_user.c
+++ b/samples/bpf/sockex1_user.c
@@ -32,13 +32,13 @@ int main(int ac, char **argv)
 		int key;
 
 		key = IPPROTO_TCP;
-		assert(bpf_lookup_elem(map_fd[0], &key, &tcp_cnt) == 0);
+		assert(bpf_map_lookup_elem(map_fd[0], &key, &tcp_cnt) == 0);
 
 		key = IPPROTO_UDP;
-		assert(bpf_lookup_elem(map_fd[0], &key, &udp_cnt) == 0);
+		assert(bpf_map_lookup_elem(map_fd[0], &key, &udp_cnt) == 0);
 
 		key = IPPROTO_ICMP;
-		assert(bpf_lookup_elem(map_fd[0], &key, &icmp_cnt) == 0);
+		assert(bpf_map_lookup_elem(map_fd[0], &key, &icmp_cnt) == 0);
 
 		printf("TCP %lld UDP %lld ICMP %lld bytes\n",
 		       tcp_cnt, udp_cnt, icmp_cnt);
diff --git a/samples/bpf/sockex2_user.c b/samples/bpf/sockex2_user.c
index 8a4085c2d117..6a40600d5a83 100644
--- a/samples/bpf/sockex2_user.c
+++ b/samples/bpf/sockex2_user.c
@@ -39,8 +39,8 @@ int main(int ac, char **argv)
 		int key = 0, next_key;
 		struct pair value;
 
-		while (bpf_get_next_key(map_fd[0], &key, &next_key) == 0) {
-			bpf_lookup_elem(map_fd[0], &next_key, &value);
+		while (bpf_map_get_next_key(map_fd[0], &key, &next_key) == 0) {
+			bpf_map_lookup_elem(map_fd[0], &next_key, &value);
 			printf("ip %s bytes %lld packets %lld\n",
 			       inet_ntoa((struct in_addr){htonl(next_key)}),
 			       value.bytes, value.packets);
diff --git a/samples/bpf/sockex3_user.c b/samples/bpf/sockex3_user.c
index 3fcfd8c4b2a3..9099c4255f23 100644
--- a/samples/bpf/sockex3_user.c
+++ b/samples/bpf/sockex3_user.c
@@ -54,8 +54,8 @@ int main(int argc, char **argv)
 
 		sleep(1);
 		printf("IP     src.port -> dst.port               bytes      packets\n");
-		while (bpf_get_next_key(map_fd[2], &key, &next_key) == 0) {
-			bpf_lookup_elem(map_fd[2], &next_key, &value);
+		while (bpf_map_get_next_key(map_fd[2], &key, &next_key) == 0) {
+			bpf_map_lookup_elem(map_fd[2], &next_key, &value);
 			printf("%s.%05d -> %s.%05d %12lld %12lld\n",
 			       inet_ntoa((struct in_addr){htonl(next_key.src)}),
 			       next_key.port16[0],
diff --git a/samples/bpf/spintest_user.c b/samples/bpf/spintest_user.c
index 311ede532230..80676c25fa50 100644
--- a/samples/bpf/spintest_user.c
+++ b/samples/bpf/spintest_user.c
@@ -31,8 +31,8 @@ int main(int ac, char **argv)
 	for (i = 0; i < 5; i++) {
 		key = 0;
 		printf("kprobing funcs:");
-		while (bpf_get_next_key(map_fd[0], &key, &next_key) == 0) {
-			bpf_lookup_elem(map_fd[0], &next_key, &value);
+		while (bpf_map_get_next_key(map_fd[0], &key, &next_key) == 0) {
+			bpf_map_lookup_elem(map_fd[0], &next_key, &value);
 			assert(next_key == value);
 			sym = ksym_search(value);
 			printf(" %s", sym->name);
@@ -41,8 +41,8 @@ int main(int ac, char **argv)
 		if (key)
 			printf("\n");
 		key = 0;
-		while (bpf_get_next_key(map_fd[0], &key, &next_key) == 0)
-			bpf_delete_elem(map_fd[0], &next_key);
+		while (bpf_map_get_next_key(map_fd[0], &key, &next_key) == 0)
+			bpf_map_delete_elem(map_fd[0], &next_key);
 		sleep(1);
 	}
 
diff --git a/samples/bpf/tc_l2_redirect_user.c b/samples/bpf/tc_l2_redirect_user.c
index 4013c5337b91..28995a776560 100644
--- a/samples/bpf/tc_l2_redirect_user.c
+++ b/samples/bpf/tc_l2_redirect_user.c
@@ -60,9 +60,9 @@ int main(int argc, char **argv)
 	}
 
 	/* bpf_tunnel_key.remote_ipv4 expects host byte orders */
-	ret = bpf_update_elem(array_fd, &array_key, &ifindex, 0);
+	ret = bpf_map_update_elem(array_fd, &array_key, &ifindex, 0);
 	if (ret) {
-		perror("bpf_update_elem");
+		perror("bpf_map_update_elem");
 		goto out;
 	}
 
diff --git a/samples/bpf/test_cgrp2_array_pin.c b/samples/bpf/test_cgrp2_array_pin.c
index 70e86f7be69d..8a1b8b5d8def 100644
--- a/samples/bpf/test_cgrp2_array_pin.c
+++ b/samples/bpf/test_cgrp2_array_pin.c
@@ -85,9 +85,9 @@ int main(int argc, char **argv)
 		}
 	}
 
-	ret = bpf_update_elem(array_fd, &array_key, &cg2_fd, 0);
+	ret = bpf_map_update_elem(array_fd, &array_key, &cg2_fd, 0);
 	if (ret) {
-		perror("bpf_update_elem");
+		perror("bpf_map_update_elem");
 		goto out;
 	}
 
diff --git a/samples/bpf/test_cgrp2_attach.c b/samples/bpf/test_cgrp2_attach.c
index a19484c45b79..8283ef86d392 100644
--- a/samples/bpf/test_cgrp2_attach.c
+++ b/samples/bpf/test_cgrp2_attach.c
@@ -36,6 +36,8 @@ enum {
 	MAP_KEY_BYTES,
 };
 
+char bpf_log_buf[BPF_LOG_BUF_SIZE];
+
 static int prog_load(int map_fd, int verdict)
 {
 	struct bpf_insn prog[] = {
@@ -67,8 +69,9 @@ static int prog_load(int map_fd, int verdict)
 		BPF_EXIT_INSN(),
 	};
 
-	return bpf_prog_load(BPF_PROG_TYPE_CGROUP_SKB,
-			     prog, sizeof(prog), "GPL", 0);
+	return bpf_load_program(BPF_PROG_TYPE_CGROUP_SKB,
+				prog, sizeof(prog), "GPL", 0,
+				bpf_log_buf, BPF_LOG_BUF_SIZE);
 }
 
 static int usage(const char *argv0)
@@ -108,10 +111,10 @@ static int attach_filter(int cg_fd, int type, int verdict)
 	}
 	while (1) {
 		key = MAP_KEY_PACKETS;
-		assert(bpf_lookup_elem(map_fd, &key, &pkt_cnt) == 0);
+		assert(bpf_map_lookup_elem(map_fd, &key, &pkt_cnt) == 0);
 
 		key = MAP_KEY_BYTES;
-		assert(bpf_lookup_elem(map_fd, &key, &byte_cnt) == 0);
+		assert(bpf_map_lookup_elem(map_fd, &key, &byte_cnt) == 0);
 
 		printf("cgroup received %lld packets, %lld bytes\n",
 		       pkt_cnt, byte_cnt);
diff --git a/samples/bpf/test_cgrp2_attach2.c b/samples/bpf/test_cgrp2_attach2.c
index ddfac42ed4df..fc6092fdc3b0 100644
--- a/samples/bpf/test_cgrp2_attach2.c
+++ b/samples/bpf/test_cgrp2_attach2.c
@@ -32,6 +32,8 @@
 #define BAR		"/foo/bar/"
 #define PING_CMD	"ping -c1 -w1 127.0.0.1"
 
+char bpf_log_buf[BPF_LOG_BUF_SIZE];
+
 static int prog_load(int verdict)
 {
 	int ret;
@@ -40,8 +42,9 @@ static int prog_load(int verdict)
 		BPF_EXIT_INSN(),
 	};
 
-	ret = bpf_prog_load(BPF_PROG_TYPE_CGROUP_SKB,
-			     prog, sizeof(prog), "GPL", 0);
+	ret = bpf_load_program(BPF_PROG_TYPE_CGROUP_SKB,
+			       prog, sizeof(prog), "GPL", 0,
+			       bpf_log_buf, BPF_LOG_BUF_SIZE);
 
 	if (ret < 0) {
 		log_err("Loading program");
diff --git a/samples/bpf/test_cgrp2_sock.c b/samples/bpf/test_cgrp2_sock.c
index d467b3c1c55c..43b4bde5d05c 100644
--- a/samples/bpf/test_cgrp2_sock.c
+++ b/samples/bpf/test_cgrp2_sock.c
@@ -23,6 +23,8 @@
 
 #include "libbpf.h"
 
+char bpf_log_buf[BPF_LOG_BUF_SIZE];
+
 static int prog_load(int idx)
 {
 	struct bpf_insn prog[] = {
@@ -34,8 +36,8 @@ static int prog_load(int idx)
 		BPF_EXIT_INSN(),
 	};
 
-	return bpf_prog_load(BPF_PROG_TYPE_CGROUP_SOCK, prog, sizeof(prog),
-			     "GPL", 0);
+	return bpf_load_program(BPF_PROG_TYPE_CGROUP_SOCK, prog, sizeof(prog),
+				"GPL", 0, bpf_log_buf, BPF_LOG_BUF_SIZE);
 }
 
 static int usage(const char *argv0)
diff --git a/samples/bpf/test_current_task_under_cgroup_user.c b/samples/bpf/test_current_task_under_cgroup_user.c
index 95aaaa846130..65b5fb51c1db 100644
--- a/samples/bpf/test_current_task_under_cgroup_user.c
+++ b/samples/bpf/test_current_task_under_cgroup_user.c
@@ -36,7 +36,7 @@ int main(int argc, char **argv)
 	if (!cg2)
 		goto err;
 
-	if (bpf_update_elem(map_fd[0], &idx, &cg2, BPF_ANY)) {
+	if (bpf_map_update_elem(map_fd[0], &idx, &cg2, BPF_ANY)) {
 		log_err("Adding target cgroup to map");
 		goto err;
 	}
@@ -50,7 +50,7 @@ int main(int argc, char **argv)
 	 */
 
 	sync();
-	bpf_lookup_elem(map_fd[1], &idx, &remote_pid);
+	bpf_map_lookup_elem(map_fd[1], &idx, &remote_pid);
 
 	if (local_pid != remote_pid) {
 		fprintf(stderr,
@@ -64,10 +64,10 @@ int main(int argc, char **argv)
 		goto err;
 
 	remote_pid = 0;
-	bpf_update_elem(map_fd[1], &idx, &remote_pid, BPF_ANY);
+	bpf_map_update_elem(map_fd[1], &idx, &remote_pid, BPF_ANY);
 
 	sync();
-	bpf_lookup_elem(map_fd[1], &idx, &remote_pid);
+	bpf_map_lookup_elem(map_fd[1], &idx, &remote_pid);
 
 	if (local_pid == remote_pid) {
 		fprintf(stderr, "BPF cgroup negative test did not work\n");
diff --git a/samples/bpf/test_lru_dist.c b/samples/bpf/test_lru_dist.c
index 316230a0ed23..d96dc88d3b04 100644
--- a/samples/bpf/test_lru_dist.c
+++ b/samples/bpf/test_lru_dist.c
@@ -134,7 +134,7 @@ static int pfect_lru_lookup_or_insert(struct pfect_lru *lru,
 	int seen = 0;
 
 	lru->total++;
-	if (!bpf_lookup_elem(lru->map_fd, &key, &node)) {
+	if (!bpf_map_lookup_elem(lru->map_fd, &key, &node)) {
 		if (node) {
 			list_move(&node->list, &lru->list);
 			return 1;
@@ -151,7 +151,7 @@ static int pfect_lru_lookup_or_insert(struct pfect_lru *lru,
 		node = list_last_entry(&lru->list,
 				       struct pfect_lru_node,
 				       list);
-		bpf_update_elem(lru->map_fd, &node->key, &null_node, BPF_EXIST);
+		bpf_map_update_elem(lru->map_fd, &node->key, &null_node, BPF_EXIST);
 	}
 
 	node->key = key;
@@ -159,10 +159,10 @@ static int pfect_lru_lookup_or_insert(struct pfect_lru *lru,
 
 	lru->nr_misses++;
 	if (seen) {
-		assert(!bpf_update_elem(lru->map_fd, &key, &node, BPF_EXIST));
+		assert(!bpf_map_update_elem(lru->map_fd, &key, &node, BPF_EXIST));
 	} else {
 		lru->nr_unique++;
-		assert(!bpf_update_elem(lru->map_fd, &key, &node, BPF_NOEXIST));
+		assert(!bpf_map_update_elem(lru->map_fd, &key, &node, BPF_NOEXIST));
 	}
 
 	return seen;
@@ -285,11 +285,11 @@ static void do_test_lru_dist(int task, void *data)
 
 		pfect_lru_lookup_or_insert(&pfect_lru, key);
 
-		if (!bpf_lookup_elem(lru_map_fd, &key, &value))
+		if (!bpf_map_lookup_elem(lru_map_fd, &key, &value))
 			continue;
 
-		if (bpf_update_elem(lru_map_fd, &key, &value, BPF_NOEXIST)) {
-			printf("bpf_update_elem(lru_map_fd, %llu): errno:%d\n",
+		if (bpf_map_update_elem(lru_map_fd, &key, &value, BPF_NOEXIST)) {
+			printf("bpf_map_update_elem(lru_map_fd, %llu): errno:%d\n",
 			       key, errno);
 			assert(0);
 		}
@@ -358,19 +358,19 @@ static void test_lru_loss0(int map_type, int map_flags)
 	for (key = 1; key <= 1000; key++) {
 		int start_key, end_key;
 
-		assert(bpf_update_elem(map_fd, &key, value, BPF_NOEXIST) == 0);
+		assert(bpf_map_update_elem(map_fd, &key, value, BPF_NOEXIST) == 0);
 
 		start_key = 101;
 		end_key = min(key, 900);
 
 		while (start_key <= end_key) {
-			bpf_lookup_elem(map_fd, &start_key, value);
+			bpf_map_lookup_elem(map_fd, &start_key, value);
 			start_key++;
 		}
 	}
 
 	for (key = 1; key <= 1000; key++) {
-		if (bpf_lookup_elem(map_fd, &key, value)) {
+		if (bpf_map_lookup_elem(map_fd, &key, value)) {
 			if (key <= 100)
 				old_unused_losses++;
 			else if (key <= 900)
@@ -408,10 +408,10 @@ static void test_lru_loss1(int map_type, int map_flags)
 	value[0] = 1234;
 
 	for (key = 1; key <= 1000; key++)
-		assert(!bpf_update_elem(map_fd, &key, value, BPF_NOEXIST));
+		assert(!bpf_map_update_elem(map_fd, &key, value, BPF_NOEXIST));
 
 	for (key = 1; key <= 1000; key++) {
-		if (bpf_lookup_elem(map_fd, &key, value))
+		if (bpf_map_lookup_elem(map_fd, &key, value))
 			nr_losses++;
 	}
 
@@ -436,7 +436,7 @@ static void do_test_parallel_lru_loss(int task, void *data)
 	next_ins_key = stable_base;
 	value[0] = 1234;
 	for (i = 0; i < nr_stable_elems; i++) {
-		assert(bpf_update_elem(map_fd, &next_ins_key, value,
+		assert(bpf_map_update_elem(map_fd, &next_ins_key, value,
 				       BPF_NOEXIST) == 0);
 		next_ins_key++;
 	}
@@ -448,9 +448,9 @@ static void do_test_parallel_lru_loss(int task, void *data)
 
 		if (rn % 10) {
 			key = rn % nr_stable_elems + stable_base;
-			bpf_lookup_elem(map_fd, &key, value);
+			bpf_map_lookup_elem(map_fd, &key, value);
 		} else {
-			bpf_update_elem(map_fd, &next_ins_key, value,
+			bpf_map_update_elem(map_fd, &next_ins_key, value,
 					BPF_NOEXIST);
 			next_ins_key++;
 		}
@@ -458,7 +458,7 @@ static void do_test_parallel_lru_loss(int task, void *data)
 
 	key = stable_base;
 	for (i = 0; i < nr_stable_elems; i++) {
-		if (bpf_lookup_elem(map_fd, &key, value))
+		if (bpf_map_lookup_elem(map_fd, &key, value))
 			nr_losses++;
 		key++;
 	}
diff --git a/samples/bpf/test_probe_write_user_user.c b/samples/bpf/test_probe_write_user_user.c
index a44bf347bedd..b5bf178a6ecc 100644
--- a/samples/bpf/test_probe_write_user_user.c
+++ b/samples/bpf/test_probe_write_user_user.c
@@ -50,7 +50,7 @@ int main(int ac, char **argv)
 	mapped_addr_in->sin_port = htons(5555);
 	mapped_addr_in->sin_addr.s_addr = inet_addr("255.255.255.255");
 
-	assert(!bpf_update_elem(map_fd[0], &mapped_addr, &serv_addr, BPF_ANY));
+	assert(!bpf_map_update_elem(map_fd[0], &mapped_addr, &serv_addr, BPF_ANY));
 
 	assert(listen(serverfd, 5) == 0);
 
diff --git a/samples/bpf/trace_event_user.c b/samples/bpf/trace_event_user.c
index 9a130d31ecf2..704fe9fa77b2 100644
--- a/samples/bpf/trace_event_user.c
+++ b/samples/bpf/trace_event_user.c
@@ -61,14 +61,14 @@ static void print_stack(struct key_t *key, __u64 count)
 	int i;
 
 	printf("%3lld %s;", count, key->comm);
-	if (bpf_lookup_elem(map_fd[1], &key->kernstack, ip) != 0) {
+	if (bpf_map_lookup_elem(map_fd[1], &key->kernstack, ip) != 0) {
 		printf("---;");
 	} else {
 		for (i = PERF_MAX_STACK_DEPTH - 1; i >= 0; i--)
 			print_ksym(ip[i]);
 	}
 	printf("-;");
-	if (bpf_lookup_elem(map_fd[1], &key->userstack, ip) != 0) {
+	if (bpf_map_lookup_elem(map_fd[1], &key->userstack, ip) != 0) {
 		printf("---;");
 	} else {
 		for (i = PERF_MAX_STACK_DEPTH - 1; i >= 0; i--)
@@ -98,10 +98,10 @@ static void print_stacks(void)
 	int fd = map_fd[0], stack_map = map_fd[1];
 
 	sys_read_seen = sys_write_seen = false;
-	while (bpf_get_next_key(fd, &key, &next_key) == 0) {
-		bpf_lookup_elem(fd, &next_key, &value);
+	while (bpf_map_get_next_key(fd, &key, &next_key) == 0) {
+		bpf_map_lookup_elem(fd, &next_key, &value);
 		print_stack(&next_key, value);
-		bpf_delete_elem(fd, &next_key);
+		bpf_map_delete_elem(fd, &next_key);
 		key = next_key;
 	}
 
@@ -111,8 +111,8 @@ static void print_stacks(void)
 	}
 
 	/* clear stack map */
-	while (bpf_get_next_key(stack_map, &stackid, &next_id) == 0) {
-		bpf_delete_elem(stack_map, &next_id);
+	while (bpf_map_get_next_key(stack_map, &stackid, &next_id) == 0) {
+		bpf_map_delete_elem(stack_map, &next_id);
 		stackid = next_id;
 	}
 }
diff --git a/samples/bpf/trace_output_user.c b/samples/bpf/trace_output_user.c
index 661a7d052f2c..3bedd945def1 100644
--- a/samples/bpf/trace_output_user.c
+++ b/samples/bpf/trace_output_user.c
@@ -162,7 +162,7 @@ static void test_bpf_perf_event(void)
 	pmu_fd = perf_event_open(&attr, -1/*pid*/, 0/*cpu*/, -1/*group_fd*/, 0);
 
 	assert(pmu_fd >= 0);
-	assert(bpf_update_elem(map_fd[0], &key, &pmu_fd, BPF_ANY) == 0);
+	assert(bpf_map_update_elem(map_fd[0], &key, &pmu_fd, BPF_ANY) == 0);
 	ioctl(pmu_fd, PERF_EVENT_IOC_ENABLE, 0);
 }
 
diff --git a/samples/bpf/tracex2_user.c b/samples/bpf/tracex2_user.c
index 3e225e331f66..ded9804c5034 100644
--- a/samples/bpf/tracex2_user.c
+++ b/samples/bpf/tracex2_user.c
@@ -48,12 +48,12 @@ static void print_hist_for_pid(int fd, void *task)
 	long max_value = 0;
 	int i, ind;
 
-	while (bpf_get_next_key(fd, &key, &next_key) == 0) {
+	while (bpf_map_get_next_key(fd, &key, &next_key) == 0) {
 		if (memcmp(&next_key, task, SIZE)) {
 			key = next_key;
 			continue;
 		}
-		bpf_lookup_elem(fd, &next_key, values);
+		bpf_map_lookup_elem(fd, &next_key, values);
 		value = 0;
 		for (i = 0; i < nr_cpus; i++)
 			value += values[i];
@@ -83,7 +83,7 @@ static void print_hist(int fd)
 	int task_cnt = 0;
 	int i;
 
-	while (bpf_get_next_key(fd, &key, &next_key) == 0) {
+	while (bpf_map_get_next_key(fd, &key, &next_key) == 0) {
 		int found = 0;
 
 		for (i = 0; i < task_cnt; i++)
@@ -136,8 +136,8 @@ int main(int ac, char **argv)
 
 	for (i = 0; i < 5; i++) {
 		key = 0;
-		while (bpf_get_next_key(map_fd[0], &key, &next_key) == 0) {
-			bpf_lookup_elem(map_fd[0], &next_key, &value);
+		while (bpf_map_get_next_key(map_fd[0], &key, &next_key) == 0) {
+			bpf_map_lookup_elem(map_fd[0], &next_key, &value);
 			printf("location 0x%lx count %ld\n", next_key, value);
 			key = next_key;
 		}
diff --git a/samples/bpf/tracex3_user.c b/samples/bpf/tracex3_user.c
index d0851cb4fa8d..8f7d199d5945 100644
--- a/samples/bpf/tracex3_user.c
+++ b/samples/bpf/tracex3_user.c
@@ -28,7 +28,7 @@ static void clear_stats(int fd)
 
 	memset(values, 0, sizeof(values));
 	for (key = 0; key < SLOTS; key++)
-		bpf_update_elem(fd, &key, values, BPF_ANY);
+		bpf_map_update_elem(fd, &key, values, BPF_ANY);
 }
 
 const char *color[] = {
@@ -89,7 +89,7 @@ static void print_hist(int fd)
 	int i;
 
 	for (key = 0; key < SLOTS; key++) {
-		bpf_lookup_elem(fd, &key, values);
+		bpf_map_lookup_elem(fd, &key, values);
 		value = 0;
 		for (i = 0; i < nr_cpus; i++)
 			value += values[i];
diff --git a/samples/bpf/tracex4_user.c b/samples/bpf/tracex4_user.c
index bc4a3bdea6ed..03449f773cb1 100644
--- a/samples/bpf/tracex4_user.c
+++ b/samples/bpf/tracex4_user.c
@@ -37,8 +37,8 @@ static void print_old_objects(int fd)
 	key = write(1, "\e[1;1H\e[2J", 12); /* clear screen */
 
 	key = -1;
-	while (bpf_get_next_key(map_fd[0], &key, &next_key) == 0) {
-		bpf_lookup_elem(map_fd[0], &next_key, &v);
+	while (bpf_map_get_next_key(map_fd[0], &key, &next_key) == 0) {
+		bpf_map_lookup_elem(map_fd[0], &next_key, &v);
 		key = next_key;
 		if (val - v.val < 1000000000ll)
 			/* object was allocated more then 1 sec ago */
diff --git a/samples/bpf/tracex6_user.c b/samples/bpf/tracex6_user.c
index 8ea4976cfcf1..179297cb4d35 100644
--- a/samples/bpf/tracex6_user.c
+++ b/samples/bpf/tracex6_user.c
@@ -36,7 +36,7 @@ static void test_bpf_perf_event(void)
 			goto exit;
 		}
 
-		bpf_update_elem(map_fd[0], &i, &pmu_fd[i], BPF_ANY);
+		bpf_map_update_elem(map_fd[0], &i, &pmu_fd[i], BPF_ANY);
 		ioctl(pmu_fd[i], PERF_EVENT_IOC_ENABLE, 0);
 	}
 
diff --git a/samples/bpf/xdp1_user.c b/samples/bpf/xdp1_user.c
index 5f040a0d7712..d2be65d1fd86 100644
--- a/samples/bpf/xdp1_user.c
+++ b/samples/bpf/xdp1_user.c
@@ -43,7 +43,7 @@ static void poll_stats(int interval)
 		for (key = 0; key < nr_keys; key++) {
 			__u64 sum = 0;
 
-			assert(bpf_lookup_elem(map_fd[0], &key, values) == 0);
+			assert(bpf_map_lookup_elem(map_fd[0], &key, values) == 0);
 			for (i = 0; i < nr_cpus; i++)
 				sum += (values[i] - prev[key][i]);
 			if (sum)
diff --git a/samples/bpf/xdp_tx_iptunnel_user.c b/samples/bpf/xdp_tx_iptunnel_user.c
index 7a71f5c74684..70e192fc61aa 100644
--- a/samples/bpf/xdp_tx_iptunnel_user.c
+++ b/samples/bpf/xdp_tx_iptunnel_user.c
@@ -51,7 +51,7 @@ static void poll_stats(unsigned int kill_after_s)
 		for (proto = 0; proto < nr_protos; proto++) {
 			__u64 sum = 0;
 
-			assert(bpf_lookup_elem(map_fd[0], &proto, values) == 0);
+			assert(bpf_map_lookup_elem(map_fd[0], &proto, values) == 0);
 			for (i = 0; i < nr_cpus; i++)
 				sum += (values[i] - prev[proto][i]);
 
@@ -237,8 +237,8 @@ int main(int argc, char **argv)
 
 	while (min_port <= max_port) {
 		vip.dport = htons(min_port++);
-		if (bpf_update_elem(map_fd[1], &vip, &tnl, BPF_NOEXIST)) {
-			perror("bpf_update_elem(&vip2tnl)");
+		if (bpf_map_update_elem(map_fd[1], &vip, &tnl, BPF_NOEXIST)) {
+			perror("bpf_map_update_elem(&vip2tnl)");
 			return 1;
 		}
 	}
-- 
2.10.2

^ permalink raw reply related

* [PATCH perf/core REBASE 3/5] tools lib bpf: Add bpf_prog_{attach,detach}
From: Joe Stringer @ 2016-12-14 22:43 UTC (permalink / raw)
  To: linux-kernel; +Cc: netdev, wangnan0, ast, daniel, acme
In-Reply-To: <20161214224342.12858-1-joe@ovn.org>

Commit d8c5b17f2bc0 ("samples: bpf: add userspace example for attaching
eBPF programs to cgroups") added these functions to samples/libbpf, but
during this merge all of the samples libbpf functionality is shifting to
tools/lib/bpf. Shift these functions there.

Signed-off-by: Joe Stringer <joe@ovn.org>
---
Arnaldo, this is a new patch you didn't previously review which I've
prepared due to the conflict with net-next. I figured it's better to try
to get samples/bpf properly switched over this window rather than defer the
problem and end up having to deal with another merge problem next time
around. I hope that is fine for you. If not, this patch onwards will need
to be dropped

It's a simple copy/paste/delete with a minor change for sys_bpf() vs
syscall().
---
 samples/bpf/libbpf.c | 21 ---------------------
 samples/bpf/libbpf.h |  3 ---
 tools/lib/bpf/bpf.c  | 21 +++++++++++++++++++++
 tools/lib/bpf/bpf.h  |  3 +++
 4 files changed, 24 insertions(+), 24 deletions(-)

diff --git a/samples/bpf/libbpf.c b/samples/bpf/libbpf.c
index 3391225ad7e9..d9af876b4a2c 100644
--- a/samples/bpf/libbpf.c
+++ b/samples/bpf/libbpf.c
@@ -11,27 +11,6 @@
 #include <arpa/inet.h>
 #include "libbpf.h"
 
-int bpf_prog_attach(int prog_fd, int target_fd, enum bpf_attach_type type)
-{
-	union bpf_attr attr = {
-		.target_fd = target_fd,
-		.attach_bpf_fd = prog_fd,
-		.attach_type = type,
-	};
-
-	return syscall(__NR_bpf, BPF_PROG_ATTACH, &attr, sizeof(attr));
-}
-
-int bpf_prog_detach(int target_fd, enum bpf_attach_type type)
-{
-	union bpf_attr attr = {
-		.target_fd = target_fd,
-		.attach_type = type,
-	};
-
-	return syscall(__NR_bpf, BPF_PROG_DETACH, &attr, sizeof(attr));
-}
-
 int open_raw_sock(const char *name)
 {
 	struct sockaddr_ll sll;
diff --git a/samples/bpf/libbpf.h b/samples/bpf/libbpf.h
index cf7d2386d1f9..cc815624aacf 100644
--- a/samples/bpf/libbpf.h
+++ b/samples/bpf/libbpf.h
@@ -6,9 +6,6 @@
 
 struct bpf_insn;
 
-int bpf_prog_attach(int prog_fd, int attachable_fd, enum bpf_attach_type type);
-int bpf_prog_detach(int attachable_fd, enum bpf_attach_type type);
-
 /* ALU ops on registers, bpf_add|sub|...: dst_reg += src_reg */
 
 #define BPF_ALU64_REG(OP, DST, SRC)				\
diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
index d0afb26c2e0f..e19335df0d3a 100644
--- a/tools/lib/bpf/bpf.c
+++ b/tools/lib/bpf/bpf.c
@@ -167,3 +167,24 @@ int bpf_obj_get(const char *pathname)
 
 	return sys_bpf(BPF_OBJ_GET, &attr, sizeof(attr));
 }
+
+int bpf_prog_attach(int prog_fd, int target_fd, enum bpf_attach_type type)
+{
+	union bpf_attr attr = {
+		.target_fd = target_fd,
+		.attach_bpf_fd = prog_fd,
+		.attach_type = type,
+	};
+
+	return sys_bpf(BPF_PROG_ATTACH, &attr, sizeof(attr));
+}
+
+int bpf_prog_detach(int target_fd, enum bpf_attach_type type)
+{
+	union bpf_attr attr = {
+		.target_fd = target_fd,
+		.attach_type = type,
+	};
+
+	return sys_bpf(BPF_PROG_DETACH, &attr, sizeof(attr));
+}
diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h
index 7fcdce16fd62..a2f9853dd882 100644
--- a/tools/lib/bpf/bpf.h
+++ b/tools/lib/bpf/bpf.h
@@ -41,5 +41,8 @@ int bpf_map_delete_elem(int fd, void *key);
 int bpf_map_get_next_key(int fd, void *key, void *next_key);
 int bpf_obj_pin(int fd, const char *pathname);
 int bpf_obj_get(const char *pathname);
+int bpf_prog_attach(int prog_fd, int attachable_fd, enum bpf_attach_type type);
+int bpf_prog_detach(int attachable_fd, enum bpf_attach_type type);
+
 
 #endif
-- 
2.10.2

^ permalink raw reply related

* [PATCH perf/core REBASE 0/5] Reuse libbpf from samples/bpf
From: Joe Stringer @ 2016-12-14 22:43 UTC (permalink / raw)
  To: linux-kernel, acme; +Cc: netdev, wangnan0, ast, daniel

Arnaldo, here's the refresh of this series that you requested after the merge
with net-next. It is based on commit 1f125a4aa4d8 ("tools lib bpf: Add flags
to bpf_create_map()") from perf/core today.

Patch #3 is new, but trivial. It has the biggest changes compared to the
version that you previously applied to perf/core.

---

Update tools/lib/bpf to provide the remaining bpf wrapper pieces needed by the
samples/bpf/ code, then get rid of all of the duplicate BPF libraries in
samples/bpf/libbpf.[ch].

---
REBASE: Rebased v3 that was applied to perf/core.
        Resolved merge conflict with net-next.
        New patch shifts bpf_prog_{attach,detach}() to libbpf.
        Drop unnecessary build targets
        Drop extra unneeded log buffers

v3: Add ack for first patch.
    Split out second patch from v2 into separate changes for remaining diff.
    Add patches to switch samples/bpf over to using tools/lib/.

(Was "libbpf: Synchronize implementations")
v2: Don't shift non-bpf code into libbpf.
    Drop the patch to synchronize ELF definitions with tc.

v1: https://www.mail-archive.com/netdev@vger.kernel.org/msg135088.html
    First post.


Joe Stringer (5):
  samples/bpf: Make samples more libbpf-centric
  samples/bpf: Switch over to libbpf
  tools lib bpf: Add bpf_prog_{attach,detach}
  samples/bpf: Remove perf_event_open() declaration
  samples/bpf: Move open_raw_sock to separate header

 samples/bpf/Makefile                              |  69 +++++----
 samples/bpf/README.rst                            |   4 +-
 samples/bpf/bpf_load.c                            |  20 ++-
 samples/bpf/bpf_load.h                            |   3 +
 samples/bpf/fds_example.c                         |  10 +-
 samples/bpf/lathist_user.c                        |   2 +-
 samples/bpf/libbpf.c                              | 176 ----------------------
 samples/bpf/libbpf.h                              |  28 +---
 samples/bpf/lwt_len_hist_user.c                   |   6 +-
 samples/bpf/offwaketime_user.c                    |   8 +-
 samples/bpf/sampleip_user.c                       |   7 +-
 samples/bpf/sock_example.c                        |  13 +-
 samples/bpf/sock_example.h                        |  35 +++++
 samples/bpf/sockex1_user.c                        |   7 +-
 samples/bpf/sockex2_user.c                        |   5 +-
 samples/bpf/sockex3_user.c                        |   5 +-
 samples/bpf/spintest_user.c                       |   8 +-
 samples/bpf/tc_l2_redirect_user.c                 |   4 +-
 samples/bpf/test_cgrp2_array_pin.c                |   4 +-
 samples/bpf/test_cgrp2_attach.c                   |  11 +-
 samples/bpf/test_cgrp2_attach2.c                  |   7 +-
 samples/bpf/test_cgrp2_sock.c                     |   6 +-
 samples/bpf/test_current_task_under_cgroup_user.c |   8 +-
 samples/bpf/test_lru_dist.c                       |  32 ++--
 samples/bpf/test_probe_write_user_user.c          |   2 +-
 samples/bpf/trace_event_user.c                    |  23 +--
 samples/bpf/trace_output_user.c                   |   5 +-
 samples/bpf/tracex2_user.c                        |  10 +-
 samples/bpf/tracex3_user.c                        |   4 +-
 samples/bpf/tracex4_user.c                        |   4 +-
 samples/bpf/tracex6_user.c                        |   5 +-
 samples/bpf/xdp1_user.c                           |   2 +-
 samples/bpf/xdp_tx_iptunnel_user.c                |   6 +-
 tools/lib/bpf/bpf.c                               |  21 +++
 tools/lib/bpf/bpf.h                               |   3 +
 35 files changed, 231 insertions(+), 332 deletions(-)
 delete mode 100644 samples/bpf/libbpf.c
 create mode 100644 samples/bpf/sock_example.h

-- 
2.10.2

^ permalink raw reply

* Re: [PATCHv3 perf/core 0/7] Reuse libbpf from samples/bpf
From: Joe Stringer @ 2016-12-14 22:46 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo; +Cc: Daniel Borkmann, LKML, netdev, Wang Nan, ast
In-Reply-To: <20161214145512.GQ5482@kernel.org>

On 14 December 2016 at 06:55, Arnaldo Carvalho de Melo <acme@kernel.org> wrote:
> Em Wed, Dec 14, 2016 at 10:25:01AM -0300, Arnaldo Carvalho de Melo escreveu:
>> Em Fri, Dec 09, 2016 at 04:30:54PM +0100, Daniel Borkmann escreveu:
>> > On 12/09/2016 04:09 PM, Arnaldo Carvalho de Melo wrote:
>> > Please note that this might result in hopefully just a minor merge issue
>> > with net-next. Looks like patch 4/7 touches test_maps.c and test_verifier.c,
>> > which moved to a new bpf selftest suite [1] this net-next cycle. Seems it's
>> > just log buffer and some renames there, which can be discarded for both
>> > files sitting in selftests.
>>
>> Yeah, I've got to this point, and the merge has a little bit more than
>> that, including BPF_PROG_ATTACH/BPF_PROG_DETACH, etc, working on it...
>
> So, Joe, can you try refreshing this work, starting from what I have in
> perf/core? It has the changes coming from net-next that Daniel warned us about
> and some more.

Hi Arnaldo,

I've just respun this series based on the version you previously
applied to perf/core. Since bpf_prog_{attach,detach}() were added to
samples/libbpf, a new patch will shift these over to tools/lib/bpf.
Other than that, I folded "samples/bpf: Drop unnecessary build
targets." back into "samples/bpf: Switch over to libbpf", and I
noticed that there were a couple of unnecessary log buffers with the
latest changes. For any new sample programs, those were fixed up to
use libbpf as well.

Don't forget to do a "make headers_install" before attempting to build
the samples, access to the latest headers is required (as per the
readme in samples/bpf).

Thanks,
Joe

^ permalink raw reply

* Re: Designing a safe RX-zero-copy Memory Model for Networking
From: Alexander Duyck @ 2016-12-14 22:45 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: John Fastabend, David Miller, Christoph Lameter, rppt, Netdev,
	linux-mm, willemdebruijn.kernel, Björn Töpel,
	magnus.karlsson, Mel Gorman, Tom Herbert, Brenden Blanco,
	Tariq Toukan, Saeed Mahameed, Brandeburg, Jesse, METH,
	Vlad Yasevich
In-Reply-To: <20161214222927.587a8ac4@redhat.com>

On Wed, Dec 14, 2016 at 1:29 PM, Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
> On Wed, 14 Dec 2016 08:45:08 -0800
> Alexander Duyck <alexander.duyck@gmail.com> wrote:
>
>> I agree.  This is a no-go from the performance perspective as well.
>> At a minimum you would have to be zeroing out the page between uses to
>> avoid leaking data, and that assumes that the program we are sending
>> the pages to is slightly well behaved.  If we think zeroing out an
>> sk_buff is expensive wait until we are trying to do an entire 4K page.
>
> Again, yes the page will be zero'ed out, but only when entering the
> page_pool. Because they are recycled they are not cleared on every use.
> Thus, performance does not suffer.

So you are talking about recycling, but not clearing the page when it
is recycled.  That right there is my problem with this.  It is fine if
you assume the pages are used by the application only, but you are
talking about using them for both the application and for the regular
network path.  You can't do that.  If you are recycling you will have
to clear the page every time you put it back onto the Rx ring,
otherwise you can leak the recycled memory into user space and end up
with a user space program being able to snoop data out of the skb.

> Besides clearing large mem area is not as bad as clearing small.
> Clearing an entire page does cost something, as mentioned before 143
> cycles, which is 28 bytes-per-cycle (4096/143).  And clearing 256 bytes
> cost 36 cycles which is only 7 bytes-per-cycle (256/36).

What I am saying is that you are going to be clearing the 4K blocks
each time they are recycled.  You can't have the pages shared between
user-space and the network stack unless you have true isolation.  If
you are allowing network stack pages to be recycled back into the
user-space application you open up all sorts of leaks where the
application can snoop into data it shouldn't have access to.

>> I think we are stuck with having to use a HW filter to split off
>> application traffic to a specific ring, and then having to share the
>> memory between the application and the kernel on that ring only.  Any
>> other approach just opens us up to all sorts of security concerns
>> since it would be possible for the application to try to read and
>> possibly write any data it wants into the buffers.
>
> This is why I wrote a document[1], trying to outline how this is possible,
> going through all the combinations, and asking the community to find
> faults in my idea.  Inlining it again, as nobody really replied on the
> content of the doc.
>
> -
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer
>
> [1] https://prototype-kernel.readthedocs.io/en/latest/vm/page_pool/design/memory_model_nic.html
>
> ===========================
> Memory Model for Networking
> ===========================
>
> This design describes how the page_pool change the memory model for
> networking in the NIC (Network Interface Card) drivers.
>
> .. Note:: The catch for driver developers is that, once an application
>           request zero-copy RX, then the driver must use a specific
>           SKB allocation mode and might have to reconfigure the
>           RX-ring.
>
>
> Design target
> =============
>
> Allow the NIC to function as a normal Linux NIC and be shared in a
> safe manor, between the kernel network stack and an accelerated
> userspace application using RX zero-copy delivery.
>
> Target is to provide the basis for building RX zero-copy solutions in
> a memory safe manor.  An efficient communication channel for userspace
> delivery is out of scope for this document, but OOM considerations are
> discussed below (`Userspace delivery and OOM`_).
>
> Background
> ==========
>
> The SKB or ``struct sk_buff`` is the fundamental meta-data structure
> for network packets in the Linux Kernel network stack.  It is a fairly
> complex object and can be constructed in several ways.
>
> From a memory perspective there are two ways depending on
> RX-buffer/page state:
>
> 1) Writable packet page
> 2) Read-only packet page
>
> To take full potential of the page_pool, the drivers must actually
> support handling both options depending on the configuration state of
> the page_pool.
>
> Writable packet page
> --------------------
>
> When the RX packet page is writable, the SKB setup is fairly straight
> forward.  The SKB->data (and skb->head) can point directly to the page
> data, adjusting the offset according to drivers headroom (for adding
> headers) and setting the length according to the DMA descriptor info.
>
> The page/data need to be writable, because the network stack need to
> adjust headers (like TimeToLive and checksum) or even add or remove
> headers for encapsulation purposes.
>
> A subtle catch, which also requires a writable page, is that the SKB
> also have an accompanying "shared info" data-structure ``struct
> skb_shared_info``.  This "skb_shared_info" is written into the
> skb->data memory area at the end (skb->end) of the (header) data.  The
> skb_shared_info contains semi-sensitive information, like kernel
> memory pointers to other pages (which might be pointers to more packet
> data).  This would be bad from a zero-copy point of view to leak this
> kind of information.

This should be the default once we get things moved over to using the
DMA_ATTR_SKIP_CPU_SYNC DMA attribute.  It will be a little while more
before it gets fully into Linus's tree.  It looks like the swiotlb
bits have been accepted, just waiting on the ability to map a page w/
attributes and the remainder of the patches that are floating around
in mmotm and linux-next.

BTW, any ETA on when we might expect to start seeing code related to
the page_pool?  It is much easier to review code versus these kind of
blueprints.

> Read-only packet page
> ---------------------
>
> When the RX packet page is read-only, the construction of the SKB is
> significantly more complicated and even involves one more memory
> allocation.
>
> 1) Allocate a new separate writable memory area, and point skb->data
>    here.  This is needed due to (above described) skb_shared_info.
>
> 2) Memcpy packet headers into this (skb->data) area.
>
> 3) Clear part of skb_shared_info struct in writable-area.
>
> 4) Setup pointer to packet-data in the page (in skb_shared_info->frags)
>    and adjust the page_offset to be past the headers just copied.
>
> It is useful (later) that the network stack have this notion that part
> of the packet and a page can be read-only.  This implies that the
> kernel will not "pollute" this memory with any sensitive information.
> This is good from a zero-copy point of view, but bad from a
> performance perspective.

This will hopefully become a legacy approach.

>
> NIC RX Zero-Copy
> ================
>
> Doing NIC RX zero-copy involves mapping RX pages into userspace.  This
> involves costly mapping and unmapping operations in the address space
> of the userspace process.  Plus for doing this safely, the page memory
> need to be cleared before using it, to avoid leaking kernel
> information to userspace, also a costly operation.  The page_pool base
> "class" of optimization is moving these kind of operations out of the
> fastpath, by recycling and lifetime control.
>
> Once a NIC RX-queue's page_pool have been configured for zero-copy
> into userspace, then can packets still be allowed to travel the normal
> stack?
>
> Yes, this should be possible, because the driver can use the
> SKB-read-only mode, which avoids polluting the page data with
> kernel-side sensitive data.  This implies, when a driver RX-queue
> switch page_pool to RX-zero-copy mode it MUST also switch to
> SKB-read-only mode (for normal stack delivery for this RXq).

This is the part that is wrong.  Once userspace has access to the
pages in an Rx ring that ring cannot be used for regular kernel-side
networking.  If it is, then sensitive kernel data may be leaked
because the application has full access to any page on the ring so it
could read the data at any time regardless of where the data is meant
to be delivered.

> XDP can be used for controlling which pages that gets RX zero-copied
> to userspace.  The page is still writable for the XDP program, but
> read-only for normal stack delivery.

Making the page read-only doesn't get you anything.  You still have a
conflict since user-space can read any packet directly out of the
page.

> Kernel safety
> -------------
>
> For the paranoid, how do we protect the kernel from a malicious
> userspace program.  Sure there will be a communication interface
> between kernel and userspace, that synchronize ownership of pages.
> But a userspace program can violate this interface, given pages are
> kept VMA mapped, the program can in principle access all the memory
> pages in the given page_pool.  This opens up for a malicious (or
> defect) program modifying memory pages concurrently with the kernel
> and DMA engine using them.
>
> An easy way to get around userspace modifying page data contents is
> simply to map pages read-only into userspace.
>
> .. Note:: The first implementation target is read-only zero-copy RX
>           page to userspace and require driver to use SKB-read-only
>           mode.

This allows for Rx but what do we do about Tx?  It sounds like
Christoph's RDMA approach might be the way to go.

> Advanced: Allowing userspace write access?
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> What if userspace need write access? Flipping the page permissions per
> transfer will likely kill performance (as this likely affects the
> TLB-cache).
>
> I will argue that giving userspace write access is still possible,
> without risking a kernel crash.  This is related to the SKB-read-only
> mode that copies the packet headers (in to another memory area,
> inaccessible to userspace).  The attack angle is to modify packet
> headers after they passed some kernel network stack validation step
> (as once headers are copied they are out of "reach").
>
> Situation classes where memory page can be modified concurrently:
>
> 1) When DMA engine owns the page.  Not a problem, as DMA engine will
>    simply overwrite data.
>
> 2) Just after DMA engine finish writing.  Not a problem, the packet
>    will go through netstack validation and be rejected.
>
> 3) While XDP reads data. This can lead to XDP/eBPF program goes into a
>    wrong code branch, but the eBPF virtual machine should not be able
>    to crash the kernel. The worst outcome is a wrong or invalid XDP
>    return code.
>
> 4) Before SKB with read-only page is constructed. Not a problem, the
>    packet will go through netstack validation and be rejected.
>
> 5) After SKB with read-only page has been constructed.  Remember the
>    packet headers were copied into a separate memory area, and the
>    page data is pointed to with an offset passed the copied headers.
>    Thus, userspace cannot modify the headers used for netstack
>    validation.  It can only modify packet data contents, which is less
>    critical as it cannot crash the kernel, and eventually this will be
>    caught by packet checksum validation.
>
> 6) After netstack delivered packet to another userspace process. Not a
>    problem, as it cannot crash the kernel.  It might corrupt
>    packet-data being read by another userspace process, which one
>    argument for requiring elevated privileges to get write access
>    (like NET_CAP_ADMIN).

If userspace has access to a ring we shouldn't be using SKBs on it
really anyway.  We should probably expect XDP to be handling all the
packaging so items 4-6 can probably be dropped.

>
> Userspace delivery and OOM
> --------------------------
>
> These RX pages are likely mapped to userspace via mmap(), so-far so
> good.  It is key to performance to get an efficient way of signaling
> between kernel and userspace, e.g what page are ready for consumption,
> and when userspace are done with the page.
>
> It is outside the scope of page_pool to provide such a queuing
> structure, but the page_pool can offer some means of protecting the
> system resource usage.  It is a classical problem that resources
> (e.g. the page) must be returned in a timely manor, else the system,
> in this case, will run out of memory.  Any system/design with
> unbounded memory allocation can lead to Out-Of-Memory (OOM)
> situations.
>
> Communication between kernel and userspace is likely going to be some
> kind of queue.  Given transferring packets individually will have too
> much scheduling overhead.  A queue can implicitly function as a
> bulking interface, and offers a natural way to split the workload
> across CPU cores.
>
> This essentially boils down-to a two queue system, with the RX-ring
> queue and the userspace delivery queue.
>
> Two bad situations exists for the userspace queue:
>
> 1) Userspace is not consuming objects fast-enough. This should simply
>    result in packets getting dropped when enqueueing to a full
>    userspace queue (as queue *must* implement some limit). Open
>    question is; should this be reported or communicated to userspace.
>
> 2) Userspace is consuming objects fast, but not returning them in a
>    timely manor.  This is a bad situation, because it threatens the
>    system stability as it can lead to OOM.
>
> The page_pool should somehow protect the system in case 2.  The
> page_pool can detect the situation as it is able to track the number
> of outstanding pages, due to the recycle feedback loop.  Thus, the
> page_pool can have some configurable limit of allowed outstanding
> pages, which can protect the system against OOM.
>
> Note, the `Fbufs paper`_ propose to solve case 2 by allowing these
> pages to be "pageable", i.e. swap-able, but that is not an option for
> the page_pool as these pages are DMA mapped.
>
> .. _`Fbufs paper`:
>    http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.52.9688
>
> Effect of blocking allocation
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> The effect of page_pool, in case 2, that denies more allocations
> essentially result-in the RX-ring queue cannot be refilled and HW
> starts dropping packets due to "out-of-buffers".  For NICs with
> several HW RX-queues, this can be limited to a subset of queues (and
> admin can control which RX queue with HW filters).
>
> The question is if the page_pool can do something smarter in this
> case, to signal the consumers of these pages, before the maximum limit
> is hit (of allowed outstanding packets).  The MM-subsystem already
> have a concept of emergency PFMEMALLOC reserves and associate
> page-flags (e.g. page_is_pfmemalloc).  And the network stack already
> handle and react to this.  Could the same PFMEMALLOC system be used
> for marking pages when limit is close?
>
> This requires further analysis. One can imagine; this could be used at
> RX by XDP to mitigate the situation by dropping less-important frames.
> Given XDP choose which pages are being send to userspace it might have
> appropriate knowledge of what it relevant to drop(?).
>
> .. Note:: An alternative idea is using a data-structure that blocks
>           userspace from getting new pages before returning some.
>           (out of scope for the page_pool)
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox