Netdev List

Netdev List
 help / color / mirror / Atom feed

* RE: [PATCH net-next 01/10] net/smc: add missing dev_put
From: Parav Pandit @ 2017-10-02 20:39 UTC (permalink / raw)
  To: Parav Pandit, Ursula Braun, davem@davemloft.net
  Cc: netdev@vger.kernel.org, linux-rdma@vger.kernel.org,
	linux-s390@vger.kernel.org, jwi@linux.vnet.ibm.com,
	schwidefsky@de.ibm.com, heiko.carstens@de.ibm.com,
	raspl@linux.vnet.ibm.com
In-Reply-To: <VI1PR0502MB30089D06B9C95D682E21511BD17D0@VI1PR0502MB3008.eurprd05.prod.outlook.com>

> -----Original Message-----
> From: linux-rdma-owner@vger.kernel.org [mailto:linux-rdma-
> owner@vger.kernel.org] On Behalf Of Parav Pandit
> Sent: Monday, October 02, 2017 3:36 PM
> To: Ursula Braun <ubraun@linux.vnet.ibm.com>; davem@davemloft.net
> Cc: netdev@vger.kernel.org; linux-rdma@vger.kernel.org; linux-
> s390@vger.kernel.org; jwi@linux.vnet.ibm.com; schwidefsky@de.ibm.com;
> heiko.carstens@de.ibm.com; raspl@linux.vnet.ibm.com
> Subject: RE: [PATCH net-next 01/10] net/smc: add missing dev_put
> 
> Hi Ursula, Dave, Hans,
> 
> > -----Original Message-----
> > From: linux-rdma-owner@vger.kernel.org [mailto:linux-rdma-
> > owner@vger.kernel.org] On Behalf Of Ursula Braun
> > Sent: Wednesday, September 20, 2017 6:58 AM
> > To: davem@davemloft.net
> > Cc: netdev@vger.kernel.org; linux-rdma@vger.kernel.org; linux-
> > s390@vger.kernel.org; jwi@linux.vnet.ibm.com; schwidefsky@de.ibm.com;
> > heiko.carstens@de.ibm.com; raspl@linux.vnet.ibm.com;
> > ubraun@linux.vnet.ibm.com
> > Subject: [PATCH net-next 01/10] net/smc: add missing dev_put
> >
> > From: Hans Wippel <hwippel@linux.vnet.ibm.com>
> >
> > In the infiniband part, SMC currently uses get_netdev which calls
> > dev_hold on the returned net device. However, the SMC code never calls
> > dev_put on that net device resulting in a wrong reference count.
> >
> > This patch adds a dev_put after the usage of the net device to fix the issue.
> >
> > Signed-off-by: Hans Wippel <hwippel@linux.vnet.ibm.com>
> > Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com>
> > ---
> >  net/smc/smc_ib.c | 1 +
> >  1 file changed, 1 insertion(+)
> >
> > diff --git a/net/smc/smc_ib.c b/net/smc/smc_ib.c index
> > 547e0e113b17..0b5852299158 100644
> > --- a/net/smc/smc_ib.c
> > +++ b/net/smc/smc_ib.c
> > @@ -380,6 +380,7 @@ static int smc_ib_fill_gid_and_mac(struct
> > smc_ib_device *smcibdev, u8 ibport)
> >  	ndev = smcibdev->ibdev->get_netdev(smcibdev->ibdev, ibport);
> >  	if (ndev) {
> >  		memcpy(&smcibdev->mac, ndev->dev_addr, ETH_ALEN);
> > +		dev_put(ndev);
> 
> I am sorry for providing late comments. smc_ib_fill_gid_and_mac() is not coded
> correctly.
> Few fixes are needed.
> 1. ULP such as SMC should not open code/deference any function pointer like
> get_netdev() of the IB device.
> 2. Replace ib_query_gid(..., NULL)
> With
> ib_query_gid(..., gid_attr);
> 
> Use gid_attr.ndev to get the MAC address.
> Do dev_put(gid_attr.ndev);
> 
> Code should look like below,
> 
> struct ib_gid_attr gid_attr;
> 
> rc = ib_query_gid(..., &gid_attr);
> if (rc || !gid_addr.ndev)
> 	return -ENODEV;
> else
> 	memcpy(smcibdev->mac, ndev->dev_addr, ETH_ALEN);
> dev_put(gid_addr.ndev);
> --

Also,
smc_link_determine_gid() doesn't do dev_put(gattr.ndev) in for loop.
Please fix it as well.

^ permalink raw reply

* RE: [net-next 11/15] i40evf: Enable VF to request an alternate queue allocation
From: Yuval Mintz @ 2017-10-02 20:50 UTC (permalink / raw)
  To: Jeff Kirsher, davem@davemloft.net
  Cc: Alan Brady, netdev@vger.kernel.org, nhorman@redhat.com,
	sassmann@redhat.com, jogreene@redhat.com
In-Reply-To: <20171002194852.71970-12-jeffrey.t.kirsher@intel.com>

> +	case VIRTCHNL_OP_REQUEST_QUEUES: {
> +		struct virtchnl_vf_res_request *vfres =
> +			(struct virtchnl_vf_res_request *)msg;
> +		if (vfres->num_queue_pairs == adapter->num_req_queues)
> {
> +			adapter->flags |=
> I40EVF_FLAG_REINIT_ITR_NEEDED;
> +			i40evf_schedule_reset(adapter);
> +		} else {
> +			dev_info(&adapter->pdev->dev,
> +				 "Requested %d queues, PF can support
> %d\n",
> +				 adapter->num_req_queues,
> +				 vfres->num_queue_pairs);
> +			adapter->num_req_queues = 0;
> +		}
> +		}
> +		break;

Something is odd about your parenthesis.

>  	default:
>  		if (adapter->current_op && (v_opcode != adapter-
> >current_op))
>  			dev_warn(&adapter->pdev->dev, "Expected
> response %d from PF, received %d\n",
> --
> 2.14.2

^ permalink raw reply

* [PATCH v2 net-next] net: core: decouple ifalias get/set from rtnl lock
From: Florian Westphal @ 2017-10-02 21:29 UTC (permalink / raw)
  To: netdev; +Cc: Florian Westphal

Device alias can be set by either rtnetlink (rtnl is held) or sysfs.

rtnetlink hold the rtnl mutex, sysfs acquires it for this purpose.
Add an extra mutex for it and use rcu to protect concurrent accesses.

This allows the sysfs path to not take rtnl and would later allow
to not hold it when dumping ifalias.

Based on suggestion from Eric Dumazet.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 Change since v1:
  - don't use a sequence lock, instead use kmalloc+kfree_rcu.

 include/linux/netdevice.h |  8 ++++++-
 net/core/dev.c            | 58 ++++++++++++++++++++++++++++++++++++-----------
 net/core/net-sysfs.c      | 17 +++++++-------
 net/core/rtnetlink.c      | 13 +++++++++--
 4 files changed, 71 insertions(+), 25 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index e1d6ef130611..d04424cfffba 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -826,6 +826,11 @@ struct xfrmdev_ops {
 };
 #endif
 
+struct dev_ifalias {
+	struct rcu_head rcuhead;
+	char ifalias[];
+};
+
 /*
  * This structure defines the management hooks for network devices.
  * The following hooks can be defined; unless noted otherwise, they are
@@ -1632,7 +1637,7 @@ enum netdev_priv_flags {
 struct net_device {
 	char			name[IFNAMSIZ];
 	struct hlist_node	name_hlist;
-	char 			*ifalias;
+	struct dev_ifalias	__rcu *ifalias;
 	/*
 	 *	I/O specific fields
 	 *	FIXME: Merge these and struct ifmap into one
@@ -3275,6 +3280,7 @@ void __dev_notify_flags(struct net_device *, unsigned int old_flags,
 			unsigned int gchanges);
 int dev_change_name(struct net_device *, const char *);
 int dev_set_alias(struct net_device *, const char *, size_t);
+int dev_get_alias(const struct net_device *, char *, size_t);
 int dev_change_net_namespace(struct net_device *, struct net *, const char *);
 int __dev_set_mtu(struct net_device *, int);
 int dev_set_mtu(struct net_device *, int);
diff --git a/net/core/dev.c b/net/core/dev.c
index e350c768d4b5..cadd7218a6ae 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -188,6 +188,8 @@ static struct napi_struct *napi_by_id(unsigned int napi_id);
 DEFINE_RWLOCK(dev_base_lock);
 EXPORT_SYMBOL(dev_base_lock);
 
+static DEFINE_MUTEX(ifalias_mutex);
+
 /* protects napi_hash addition/deletion and napi_gen_id */
 static DEFINE_SPINLOCK(napi_hash_lock);
 
@@ -1265,29 +1267,59 @@ int dev_change_name(struct net_device *dev, const char *newname)
  */
 int dev_set_alias(struct net_device *dev, const char *alias, size_t len)
 {
-	char *new_ifalias;
-
-	ASSERT_RTNL();
+	struct dev_ifalias *new_alias, *old;
 
 	if (len >= IFALIASZ)
 		return -EINVAL;
 
-	if (!len) {
-		kfree(dev->ifalias);
-		dev->ifalias = NULL;
-		return 0;
+	if (len) {
+		new_alias = kmalloc(sizeof(*new_alias) + len + 1, GFP_KERNEL);
+		if (!new_alias)
+			return -ENOMEM;
+	} else {
+		new_alias = NULL;
 	}
 
-	new_ifalias = krealloc(dev->ifalias, len + 1, GFP_KERNEL);
-	if (!new_ifalias)
-		return -ENOMEM;
-	dev->ifalias = new_ifalias;
-	memcpy(dev->ifalias, alias, len);
-	dev->ifalias[len] = 0;
+	mutex_lock(&ifalias_mutex);
+
+	old = rcu_dereference_protected(dev->ifalias,
+					mutex_is_locked(&ifalias_mutex));
+	if (len) {
+		memcpy(new_alias->ifalias, alias, len);
+		new_alias->ifalias[len] = 0;
+	}
+
+	rcu_assign_pointer(dev->ifalias, new_alias);
+	mutex_unlock(&ifalias_mutex);
+
+	if (old)
+		kfree_rcu(old, rcuhead);
 
 	return len;
 }
 
+/**
+ *	dev_get_alias - get ifalias of a device
+ *	@dev: device
+ *	@alias: buffer to store name of ifalias
+ *	@len: size of buffer
+ *
+ *	get ifalias for a device.  Caller must make sure dev cannot go
+ *	away,  e.g. rcu read lock or own a reference count to device.
+ */
+int dev_get_alias(const struct net_device *dev, char *name, size_t len)
+{
+	const struct dev_ifalias *alias;
+	int ret = 0;
+
+	rcu_read_lock();
+	alias = rcu_dereference(dev->ifalias);
+	if (alias)
+		ret = snprintf(name, len, "%s", alias->ifalias);
+	rcu_read_unlock();
+
+	return ret;
+}
 
 /**
  *	netdev_features_change - device changes features
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 927a6dcbad96..51d5836d8fb9 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -391,10 +391,7 @@ static ssize_t ifalias_store(struct device *dev, struct device_attribute *attr,
 	if (len >  0 && buf[len - 1] == '\n')
 		--count;
 
-	if (!rtnl_trylock())
-		return restart_syscall();
 	ret = dev_set_alias(netdev, buf, count);
-	rtnl_unlock();
 
 	return ret < 0 ? ret : len;
 }
@@ -403,13 +400,12 @@ static ssize_t ifalias_show(struct device *dev,
 			    struct device_attribute *attr, char *buf)
 {
 	const struct net_device *netdev = to_net_dev(dev);
+	char tmp[IFALIASZ];
 	ssize_t ret = 0;
 
-	if (!rtnl_trylock())
-		return restart_syscall();
-	if (netdev->ifalias)
-		ret = sprintf(buf, "%s\n", netdev->ifalias);
-	rtnl_unlock();
+	ret = dev_get_alias(netdev, tmp, sizeof(tmp));
+	if (ret > 0)
+		ret = sprintf(buf, "%s\n", tmp);
 	return ret;
 }
 static DEVICE_ATTR_RW(ifalias);
@@ -1488,7 +1484,10 @@ static void netdev_release(struct device *d)
 
 	BUG_ON(dev->reg_state != NETREG_RELEASED);
 
-	kfree(dev->ifalias);
+	/* no need to wait for rcu grace period:
+	 * device is dead and about to be freed.
+	 */
+	kfree(rcu_access_pointer(dev->ifalias));
 	netdev_freemem(dev);
 }
 
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index e6955da0d58d..3961f87cdc76 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -1366,6 +1366,16 @@ static int nla_put_iflink(struct sk_buff *skb, const struct net_device *dev)
 	return nla_put_u32(skb, IFLA_LINK, ifindex);
 }
 
+static noinline_for_stack int nla_put_ifalias(struct sk_buff *skb,
+					      struct net_device *dev)
+{
+	char buf[IFALIASZ];
+	int ret;
+
+	ret = dev_get_alias(dev, buf, sizeof(buf));
+	return ret > 0 ? nla_put_string(skb, IFLA_IFALIAS, buf) : 0;
+}
+
 static int rtnl_fill_link_netnsid(struct sk_buff *skb,
 				  const struct net_device *dev)
 {
@@ -1425,8 +1435,7 @@ static int rtnl_fill_ifinfo(struct sk_buff *skb, struct net_device *dev,
 	    nla_put_u8(skb, IFLA_CARRIER, netif_carrier_ok(dev)) ||
 	    (dev->qdisc &&
 	     nla_put_string(skb, IFLA_QDISC, dev->qdisc->ops->id)) ||
-	    (dev->ifalias &&
-	     nla_put_string(skb, IFLA_IFALIAS, dev->ifalias)) ||
+	    nla_put_ifalias(skb, dev) ||
 	    nla_put_u32(skb, IFLA_CARRIER_CHANGES,
 			atomic_read(&dev->carrier_changes)) ||
 	    nla_put_u8(skb, IFLA_PROTO_DOWN, dev->proto_down))
-- 
2.13.6

^ permalink raw reply related

* Question about "prevent dst uses after free" and WARNING in nf_xfrm_me_harder / refcnt / 4.13.3
From: Denys Fedoryshchenko @ 2017-10-02 21:33 UTC (permalink / raw)
  To: Eric Dumazet, Linux Kernel Network Developers

Hi,

I'm running now 4.13.3, is this patch required for 4.13 as well?
(it doesnt apply cleanly, as in 4.13 tcp_prequeue use 
skb_dst_force_safe, so i just renamed it there to skb_dst_force )

This is what i get on PPPoE BRAS on this kernel, patch applied
(no idea if its related to patch, but just mentioning i applied it, as 
it's not vanilla 4.13.3)

[ 7858.579600] ------------[ cut here ]------------
[ 7858.579818] WARNING: CPU: 2 PID: 0 at ./include/net/dst.h:254 
nf_xfrm_me_harder+0x61/0xec [nf_nat]
[ 7858.580160] Modules linked in: cls_fw act_police cls_u32 sch_ingress 
sch_htb pppoe pppox ppp_generic slhc netconsole configfs coretemp 
nf_nat_pptp nf_nat_proto_gre nf_conntrack_pptp nf_conntrack_proto_gre 
tun xt_REDIRECT nf_nat_redirect xt_nat xt_TCPMSS ipt_REJECT 
nf_reject_ipv4 xt_set ts_bm xt_string xt_connmark xt_DSCP xt_mark 
xt_tcpudp ip_set_hash_net ip_set_hash_ip ip_set nfnetlink iptable_mangle 
iptable_filter iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 
nf_nat nf_conntrack ip_tables x_tables 8021q garp mrp stp llc ixgbe dca
[ 7858.581255] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 
4.13.3-build-0133 #27
[ 7858.581456] Hardware name: HP ProLiant DL320e Gen8 v2, BIOS P80 
04/02/2015
[ 7858.581659] task: ffff880434e6a700 task.stack: ffffc90001904000
[ 7858.581862] RIP: 0010:nf_xfrm_me_harder+0x61/0xec [nf_nat]
[ 7858.582061] RSP: 0018:ffff880436483bc0 EFLAGS: 00010246
[ 7858.582259] RAX: 0000000000000000 RBX: ffffffff822df000 RCX: 
ffff8803ee9028ce
[ 7858.582461] RDX: 0000000000000014 RSI: ffff88041cd82900 RDI: 
ffff880436483bf8
[ 7858.582661] RBP: ffff880436483c20 R08: ffffffff81e0b400 R09: 
00000000b9160000
[ 7858.582865] R10: ffff8803ee9028e8 R11: 0000000000000000 R12: 
ffff880401e92100
[ 7858.583068] R13: 0000000000000001 R14: ffffffff822df000 R15: 
ffff88042e280078
[ 7858.583269] FS:  0000000000000000(0000) GS:ffff880436480000(0000) 
knlGS:0000000000000000
[ 7858.583608] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 7858.583809] CR2: 00007f9b2886fc9c CR3: 0000000429223000 CR4: 
00000000001406e0
[ 7858.584013] Call Trace:
[ 7858.584209]  <IRQ>
[ 7858.584408]  ? nf_nat_ipv4_fn+0x12e/0x189 [nf_nat_ipv4]
[ 7858.584605]  nf_nat_ipv4_out+0xb6/0xd3 [nf_nat_ipv4]
[ 7858.584807]  iptable_nat_ipv4_out+0x15/0x17 [iptable_nat]
[ 7858.585010]  nf_hook_slow+0x2a/0x9a
[ 7858.585209]  ip_output+0x96/0xb4
[ 7858.585410]  ? ip_fragment.constprop.5+0x7c/0x7c
[ 7858.585610]  ip_forward_finish+0x5b/0x60
[ 7858.585811]  ip_forward+0x36d/0x37a
[ 7858.586010]  ? ip_frag_mem+0x11/0x11
[ 7858.586207]  ip_rcv_finish+0x2f9/0x304
[ 7858.586406]  ip_rcv+0x32a/0x337
[ 7858.586604]  ? ip_local_deliver_finish+0x1bb/0x1bb
[ 7858.586808]  __netif_receive_skb_core+0x4f0/0x847
[ 7858.587009]  __netif_receive_skb+0x18/0x5a
[ 7858.587208]  ? __netif_receive_skb+0x18/0x5a
[ 7858.587407]  process_backlog+0xa4/0x127
[ 7858.587606]  net_rx_action+0x11e/0x2d8
[ 7858.587811]  ? sched_clock_cpu+0x15/0x9b
[ 7858.588013]  __do_softirq+0xe7/0x23a
[ 7858.588210]  irq_exit+0x52/0x93
[ 7858.588408]  smp_call_function_single_interrupt+0x33/0x35
[ 7858.588610]  call_function_single_interrupt+0x83/0x90
[ 7858.588811] RIP: 0010:mwait_idle+0x93/0x13c
[ 7858.589007] RSP: 0018:ffffc90001907eb0 EFLAGS: 00000246 ORIG_RAX: 
ffffffffffffff04
[ 7858.589347] RAX: 0000000000000000 RBX: ffff880434e6a700 RCX: 
0000000000000000
[ 7858.589548] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 
0000000000000000
[ 7858.589750] RBP: ffffc90001907ec0 R08: 0000000000000000 R09: 
0000000000000001
[ 7858.589952] R10: ffffc90001907e58 R11: 000000000000024d R12: 
0000000000000002
[ 7858.590149] R13: 0000000000000000 R14: ffff880434e6a700 R15: 
ffff880434e6a700
[ 7858.590347]  </IRQ>
[ 7858.590541]  arch_cpu_idle+0xf/0x11
[ 7858.590738]  default_idle_call+0x25/0x27
[ 7858.590938]  do_idle+0xb8/0x150
[ 7858.591133]  cpu_startup_entry+0x1f/0x21
[ 7858.591332]  start_secondary+0xe8/0xeb
[ 7858.591531]  secondary_startup_64+0x9f/0x9f
[ 7858.591729] Code: 83 7e 48 00 74 07 48 8b b6 80 01 00 00 8b 86 80 00 
00 00 85 c0 74 14 8d 50 01 f0 0f b1 96 80 00 00 00 0f 94 c2 84 d2 75 04 
eb e8 <0f> ff 49 8b 4c 24 18 48 8d 55 a0 45 31 c0 48 89 df e8 d9 de 95
[ 7858.592239] ---[ end trace c089174999ff4fc3 ]---
[ 7858.592448] dst_release: dst:ffff88041cd82900 refcnt:-1
[ 8139.130003] igb 0000:07:00.0 eth0: igb: eth0 NIC Link is Down
[ 8139.130309] igb 0000:07:00.0 eth0: Reset adapter
[ 8164.431523] igb 0000:07:00.0 eth0: igb: eth0 NIC Link is Up 1000 Mbps 
Full Duplex, Flow Control: RX/TX
[ 9149.190518] perf: interrupt took too long (3132 > 3128), lowering 
kernel.perf_event_max_sample_rate to 63000
[17205.528640] ------------[ cut here ]------------
[17205.528855] WARNING: CPU: 0 PID: 0 at ./include/net/dst.h:254 
nf_xfrm_me_harder+0x61/0xec [nf_nat]
[17205.529197] Modules linked in: cls_fw act_police cls_u32 sch_ingress 
sch_htb pppoe pppox ppp_generic slhc netconsole configfs coretemp 
nf_nat_pptp nf_nat_proto_gre nf_conntrack_pptp nf_conntrack_proto_gre 
tun xt_REDIRECT nf_nat_redirect xt_nat xt_TCPMSS ipt_REJECT 
nf_reject_ipv4 xt_set ts_bm xt_string xt_connmark xt_DSCP xt_mark 
xt_tcpudp ip_set_hash_net ip_set_hash_ip ip_set nfnetlink iptable_mangle 
iptable_filter iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 
nf_nat nf_conntrack ip_tables x_tables 8021q garp mrp stp llc ixgbe dca
[17205.530294] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G        W       
4.13.3-build-0133 #27
[17205.530632] Hardware name: HP ProLiant DL320e Gen8 v2, BIOS P80 
04/02/2015
[17205.530834] task: ffffffff8220e480 task.stack: ffffffff82200000
[17205.531033] RIP: 0010:nf_xfrm_me_harder+0x61/0xec [nf_nat]
[17205.531232] RSP: 0018:ffff880436403bc0 EFLAGS: 00010246
[17205.531434] RAX: 0000000000000000 RBX: ffffffff822df000 RCX: 
ffff8803f5fba0ce
[17205.531636] RDX: 0000000000000014 RSI: ffff8804041ae100 RDI: 
ffff880436403bf8
[17205.531836] RBP: ffff880436403c20 R08: ffffffff81e0b400 R09: 
0000000033d10000
[17205.532035] R10: ffff8803f5fba0e8 R11: 0000000000000000 R12: 
ffff88041e7a3500
[17205.532235] R13: 0000000000000001 R14: ffffffff822df000 R15: 
ffff88042e280078
[17205.532435] FS:  0000000000000000(0000) GS:ffff880436400000(0000) 
knlGS:0000000000000000
[17205.532775] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[17205.532974] CR2: 00007f9b2c6b52b8 CR3: 0000000429223000 CR4: 
00000000001406f0
[17205.533170] Call Trace:
[17205.533361]  <IRQ>
[17205.533555]  ? nf_nat_ipv4_fn+0x12e/0x189 [nf_nat_ipv4]
[17205.533754]  nf_nat_ipv4_out+0xb6/0xd3 [nf_nat_ipv4]
[17205.533953]  iptable_nat_ipv4_out+0x15/0x17 [iptable_nat]
[17205.534151]  nf_hook_slow+0x2a/0x9a
[17205.534344]  ip_output+0x96/0xb4
[17205.534539]  ? ip_fragment.constprop.5+0x7c/0x7c
[17205.534738]  ip_forward_finish+0x5b/0x60
[17205.534939]  ip_forward+0x36d/0x37a
[17205.535137]  ? ip_frag_mem+0x11/0x11
[17205.535337]  ip_rcv_finish+0x2f9/0x304
[17205.535537]  ip_rcv+0x32a/0x337
[17205.535732]  ? ip_local_deliver_finish+0x1bb/0x1bb
[17205.535935]  __netif_receive_skb_core+0x4f0/0x847
[17205.536135]  __netif_receive_skb+0x18/0x5a
[17205.536332]  ? __netif_receive_skb+0x18/0x5a
[17205.536533]  process_backlog+0xa4/0x127
[17205.536731]  net_rx_action+0x11e/0x2d8
[17205.536934]  ? sched_clock_cpu+0x15/0x9b
[17205.537134]  __do_softirq+0xe7/0x23a
[17205.537331]  irq_exit+0x52/0x93
[17205.537530]  smp_call_function_single_interrupt+0x33/0x35
[17205.537730]  call_function_single_interrupt+0x83/0x90
[17205.537934] RIP: 0010:mwait_idle+0x93/0x13c
[17205.538131] RSP: 0018:ffffffff82203e28 EFLAGS: 00000246 ORIG_RAX: 
ffffffffffffff04
[17205.538469] RAX: 0000000000000000 RBX: ffffffff8220e480 RCX: 
0000000000000000
[17205.538668] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 
0000000000000000
[17205.538871] RBP: ffffffff82203e38 R08: 0000000000000000 R09: 
0000000000000001
[17205.539071] R10: ffffffff82203dd0 R11: 000000000000002a R12: 
0000000000000000
[17205.539271] R13: 0000000000000000 R14: ffffffff8220e480 R15: 
ffffffff8220e480
[17205.539472]  </IRQ>
[17205.539670]  arch_cpu_idle+0xf/0x11
[17205.539869]  default_idle_call+0x25/0x27
[17205.540068]  do_idle+0xb8/0x150
[17205.540266]  cpu_startup_entry+0x1f/0x21
[17205.540465]  rest_init+0xb5/0xb7
[17205.540665]  start_kernel+0x3b0/0x3bd
[17205.540864]  x86_64_start_reservations+0x2a/0x2c
[17205.541063]  x86_64_start_kernel+0x16a/0x178
[17205.541262]  secondary_startup_64+0x9f/0x9f
[17205.541458] Code: 83 7e 48 00 74 07 48 8b b6 80 01 00 00 8b 86 80 00 
00 00 85 c0 74 14 8d 50 01 f0 0f b1 96 80 00 00 00 0f 94 c2 84 d2 75 04 
eb e8 <0f> ff 49 8b 4c 24 18 48 8d 55 a0 45 31 c0 48 89 df e8 d9 de 95
[17205.541964] ---[ end trace c089174999ff4fc4 ]---
[17205.542165] dst_release: dst:ffff8804041ae100 refcnt:-1

^ permalink raw reply

* Re: [PATCH net-next] vhost_net: do not stall on zerocopy depletion
From: Willem de Bruijn @ 2017-10-02 21:34 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Network Development, David Miller, Jason Wang, Koichiro Den,
	virtualization, Willem de Bruijn
In-Reply-To: <CAF=yD-KotdpHs96GomMKR-BqG3Gyrvo+to0sk2=a6E5BKjgpkg@mail.gmail.com>

On Fri, Sep 29, 2017 at 9:25 PM, Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
> On Fri, Sep 29, 2017 at 3:38 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
>> On Wed, Sep 27, 2017 at 08:25:56PM -0400, Willem de Bruijn wrote:
>>> From: Willem de Bruijn <willemb@google.com>
>>>
>>> Vhost-net has a hard limit on the number of zerocopy skbs in flight.
>>> When reached, transmission stalls. Stalls cause latency, as well as
>>> head-of-line blocking of other flows that do not use zerocopy.
>>>
>>> Instead of stalling, revert to copy-based transmission.
>>>
>>> Tested by sending two udp flows from guest to host, one with payload
>>> of VHOST_GOODCOPY_LEN, the other too small for zerocopy (1B). The
>>> large flow is redirected to a netem instance with 1MBps rate limit
>>> and deep 1000 entry queue.
>>>
>>>   modprobe ifb
>>>   ip link set dev ifb0 up
>>>   tc qdisc add dev ifb0 root netem limit 1000 rate 1MBit
>>>
>>>   tc qdisc add dev tap0 ingress
>>>   tc filter add dev tap0 parent ffff: protocol ip \
>>>       u32 match ip dport 8000 0xffff \
>>>       action mirred egress redirect dev ifb0
>>>
>>> Before the delay, both flows process around 80K pps. With the delay,
>>> before this patch, both process around 400. After this patch, the
>>> large flow is still rate limited, while the small reverts to its
>>> original rate. See also discussion in the first link, below.
>>>
>>> The limit in vhost_exceeds_maxpend must be carefully chosen. When
>>> vq->num >> 1, the flows remain correlated. This value happens to
>>> correspond to VHOST_MAX_PENDING for vq->num == 256. Allow smaller
>>> fractions and ensure correctness also for much smaller values of
>>> vq->num, by testing the min() of both explicitly. See also the
>>> discussion in the second link below.
>>>
>>> Link:http://lkml.kernel.org/r/CAF=yD-+Wk9sc9dXMUq1+x_hh=3ThTXa6BnZkygP3tgVpjbp93g@mail.gmail.com
>>> Link:http://lkml.kernel.org/r/20170819064129.27272-1-den@klaipeden.com
>>> Signed-off-by: Willem de Bruijn <willemb@google.com>
>>
>> I'd like to see the effect on the non rate limited case though.
>> If guest is quick won't we have lots of copies then?
>
> Yes, but not significantly more than without this patch.
>
> I ran 1, 10 and 100 flow tcp_stream throughput tests from a sender
> in the guest to a receiver in the host.
>
> To answer the other benchmark question first, I did not see anything
> noteworthy when increasing vq->num from 256 to 1024.
>
> With 1 and 10 flows without this patch all packets use zerocopy.
> With the patch, less than 1% eschews zerocopy.
>
> With 100 flows, even without this patch, 90+% of packets are copied.
> Some zerocopy packets from vhost_net fail this test in tun.c
>
>     if (iov_iter_npages(&i, INT_MAX) <= MAX_SKB_FRAGS)
>
> Generating packets with up to 21 frags. I'm not sure yet why or
> what the fraction of these packets is.

This seems to be a mix of page alignment and compound pages.
The iov_len is always well below the maximum, but frags exceed
page size and can start high in the initial page.

 tun_get_user: num_pages=21 max=17 iov_len=6 len=65226
   0: p_off=3264 len=6888
   1: p_off=1960 len=16384
   2: p_off=1960 len=6232
   3: p_off=0 len=10152
   4: p_off=1960 len=16384
   5: p_off=1960 len=9120

> But this in turn can
> disable zcopy_used in vhost_net_tx_select_zcopy for a
> larger share of packets:
>
>         return !net->tx_flush &&
>                 net->tx_packets / 64 >= net->tx_zcopy_err;
>

Testing iov_iter_npages() in handle_tx to inform zcopy_used allows
skipping these without turning off zerocopy for all other packets.

After implementing that, tx_zcopy_err drops to zero, but only around
40% of packets use zerocopy.

^ permalink raw reply

* RE: [net-next 11/15] i40evf: Enable VF to request an alternate queue allocation
From: Brady, Alan @ 2017-10-02 21:35 UTC (permalink / raw)
  To: Yuval Mintz, Kirsher, Jeffrey T, davem@davemloft.net
  Cc: netdev@vger.kernel.org, nhorman@redhat.com, sassmann@redhat.com,
	jogreene@redhat.com
In-Reply-To: <AM0PR0502MB368318470E6010AB9D6C07E1BF7D0@AM0PR0502MB3683.eurprd05.prod.outlook.com>

> -----Original Message-----
> From: Yuval Mintz [mailto:yuvalm@mellanox.com]
> Sent: Monday, October 2, 2017 1:51 PM
> To: Kirsher, Jeffrey T <jeffrey.t.kirsher@intel.com>; davem@davemloft.net
> Cc: Brady, Alan <alan.brady@intel.com>; netdev@vger.kernel.org;
> nhorman@redhat.com; sassmann@redhat.com; jogreene@redhat.com
> Subject: RE: [net-next 11/15] i40evf: Enable VF to request an alternate queue
> allocation
> 
> > +	case VIRTCHNL_OP_REQUEST_QUEUES: {
> > +		struct virtchnl_vf_res_request *vfres =
> > +			(struct virtchnl_vf_res_request *)msg;
> > +		if (vfres->num_queue_pairs == adapter->num_req_queues)
> > {
> > +			adapter->flags |=
> > I40EVF_FLAG_REINIT_ITR_NEEDED;
> > +			i40evf_schedule_reset(adapter);
> > +		} else {
> > +			dev_info(&adapter->pdev->dev,
> > +				 "Requested %d queues, PF can support
> > %d\n",
> > +				 adapter->num_req_queues,
> > +				 vfres->num_queue_pairs);
> > +			adapter->num_req_queues = 0;
> > +		}
> > +		}
> > +		break;
> 
> Something is odd about your parenthesis.
> 
> >  	default:
> >  		if (adapter->current_op && (v_opcode != adapter-
> > >current_op))
> >  			dev_warn(&adapter->pdev->dev, "Expected
> response %d from PF,
> > received %d\n",
> > --
> > 2.14.2

I'm not sure which parentheses you are referring to.  Perhaps you mean the double curly braces?  The outer set of braces are creating a block in the case statement so we can declare a variable specific to the case statement.  The inner brace is an if/else statement block.  They are both consistent with brace usage/alignment in the rest of the file and adding a new line between them generates a GCC warning.

-Alan

^ permalink raw reply

* Re: [PATCH v2 net-next] net: core: decouple ifalias get/set from rtnl lock
From: Florian Westphal @ 2017-10-02 21:47 UTC (permalink / raw)
  To: Florian Westphal; +Cc: netdev
In-Reply-To: <20171002212942.14521-1-fw@strlen.de>

Florian Westphal <fw@strlen.de> wrote:
> +	mutex_lock(&ifalias_mutex);
> +
> +	old = rcu_dereference_protected(dev->ifalias,
> +					mutex_is_locked(&ifalias_mutex));
> +	if (len) {
> +		memcpy(new_alias->ifalias, alias, len);
> +		new_alias->ifalias[len] = 0;
> +	}

This can be done outside of the lock, so this can be
> +
> +	rcu_assign_pointer(dev->ifalias, new_alias);

rcu_swap_protected().

I'll send a v3.

^ permalink raw reply

* Re: [RFC net-next 0/5] TSN: Add qdisc-based config interfaces for traffic shapers
From: Levi Pearson @ 2017-10-02 21:48 UTC (permalink / raw)
  To: Rodney Cummings
  Cc: Linux Kernel Network Developers, Vinicius Costa Gomes,
	Henrik Austad, richardcochran@gmail.com,
	jesus.sanchez-palencia@intel.com, andre.guedes@intel.com
In-Reply-To: <DM2PR0401MB138982C6B22DF8154A82DB39927D0@DM2PR0401MB1389.namprd04.prod.outlook.com>

Hi Rodney,

On Mon, Oct 2, 2017 at 1:40 PM, Rodney Cummings <rodney.cummings@ni.com> wrote:

> It's a shame that someone built such hardware. Speaking as a manufacturer
> of daisy-chainable products for industrial/automotive applications, I
> wouldn't use that hardware in my products. The whole point of scheduling
> is to obtain close-to-optimal network determinism. If the hardware
> doesn't schedule properly, the product is probably better off using CBS.

I should note that the hardware I'm using was not designed to be a TSN
endpoint. It just happens that the hardware, designed for other
reasons, happens to have hardware support for Qbv and thus makes it an
interesting and inexpensive platform on which to do Qbv experiments.
Nevertheless, I believe the right combination of driver support for
the hardware shaping and appropriate software in the network stack
would allow it to schedule things fairly precisely in the
application-centric manner you've described. I will ultimately defer
to your judgement on that, though, and spend some time with the Qcc
draft to see where I may be wrong.

> It would be great if all of the user-level application code was scheduled
> as tightly as the network, but the reality is that we aren't there yet.
> Also, even if the end-station has a separately timed RT Linux thread for
> each stream, the order of those threads can vary. Therefore, when the
> end-station (SoC) has more than one talker, the frames for each talker
> can be written into the Qbv queue at any time, in any order.
> There is no scheduling of frames (i.e. streams)... only the queue.

Frames are presented to *the kernel* in any order. The qdisc
infrastructure provides mechanisms by which other scheduling policy
can be added between presentation to the kernel and presentation to
hardware queues.

Even with i210-style LaunchTime functionality, this must also be
provided. The hardware cannot re-order time-scheduled frames, and
reviewers of the SO_TXTIME proposal rejected the idea of having the
driver try to re-order scheduled frames already submitted to the
hardware-visible descriptor chains, and rightly so I think.

So far, I have thought of 3 places this ordering constraint might be provided:

1. Built into each driver with offload support for SO_TXTIME
2. A qdisc module that will re-order scheduled skbuffs within a certain window
3. Left to userspace to coordinate

I think 1 is particularly bad due to duplication of code. 3 is the
easy default from the point of view of the kernel, but not attractive
from the point of view of precise scheduling and low latency to
egress. Unless someone has a better idea, I think 2 would be the most
appropriate choice. I can think of a number of ways one might be
designed, but I have not thought them through fully yet.

> From the perspective of the CNC, that means that a Qbv-only end-station is
> sloppy. In contrast, an end-station using per-stream scheduling
> (e.g. i210 with LaunchTime) is precise, because each frame has a known time.
> It's true that P802.1Qcc supports both, so the CNC might support both.
> Nevertheless, the CNC is likely to compute a network-wide schedule
> that optimizes for the per-stream end-stations, and locates the windows
> for the Qbv-only end-stations in a slower interval.

Because out-of-order submission of frames to LaunchTime hardware would
result in the frame that is enqueued later but scheduled earlier only
transmitting once the later-scheduled frame completes, enqueuing
frames immediately with LaunchTime information results in even *worse*
sloppiness if two applications don't always submit their frames in the
correct order to the kernel during a cycle. If your queue is simply
gated at the time of the first scheduled launch, the frame that should
have been first will only be late by the length of the queued-first
frame. If they both have precise launch times, the frame that should
have been first will have to wait until the normal launch time of the
later-scheduled frame plus the time it takes for it to egress!

In either case, re-ordering is required if it is not controlled for in
userspace. In other words, I am not disagreeing with you about the
insufficiency of a Qbv shaper for an endpoint! I am just pointing out
that neither hardware LaunchTime support nor hardware Qbv window
support are sufficient by themselves. Nor, for that case, is the CBS
offload support sufficient to meet multi-stream requirements from SRP.
All of these provide a low-layer mechanism by which appropriate
scheduling of a multi-stream or multi-application endpoint can be
enhanced, but none of them provide the complete story by themselves.

> I realize that we are only doing CBS for now. I guess my main point is that
> if/when we add scheduling, it cannot be limited to Qbv. Per-stream scheduling
> is essential, because that is the assumption for most CNC designs that
> I'm aware of.

I understand and completely agree. I have believed that per-stream
shaping/scheduling in some form is essential from the beginning.
Likewise, CBS is not sufficient in itself for multi-stream systems in
which multiple applications provide streams scheduled for the same
traffic class. But CBS and SO_TXTIME are good first steps; they
provides hooks for drivers to finally support the mechanisms provided
by hardware, and also in the case of SO_TXTIME a hook on which further
time-sensitive enhancements to the network stack can be hung, such as
qdiscs that can store and release frames based on scheduled launch
times. When these are in place, I think stream-based scheduling via
qdisc would be a reasonable next step; perhaps a common design could
be found to work both for media streams over CBS and industrial
streams scheduled by launch time.

Levi

^ permalink raw reply

* [PATCH v3 net-next] net: core: decouple ifalias get/set from rtnl lock
From: Florian Westphal @ 2017-10-02 21:50 UTC (permalink / raw)
  To: netdev; +Cc: edumazet, Florian Westphal

Device alias can be set by either rtnetlink (rtnl is held) or sysfs.

rtnetlink hold the rtnl mutex, sysfs acquires it for this purpose.
Add an extra mutex for it and use rcu to protect concurrent accesses.

This allows the sysfs path to not take rtnl and would later allow
to not hold it when dumping ifalias.

Based on suggestion from Eric Dumazet.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 Changes since v2:
  - use rcu_swap_protected
 Change since v1:
  - don't use a sequence lock, instead use kmalloc+kfree_rcu
  - add comment for dev_get_alias
  - add comment why we use kfree and not kfree_rcu
  to free dev->alias during netdevice destruction

 include/linux/netdevice.h |  8 +++++++-
 net/core/dev.c            | 52 +++++++++++++++++++++++++++++++++++------------
 net/core/net-sysfs.c      | 17 ++++++++--------
 net/core/rtnetlink.c      | 13 ++++++++++--
 4 files changed, 65 insertions(+), 25 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index e1d6ef130611..d04424cfffba 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -826,6 +826,11 @@ struct xfrmdev_ops {
 };
 #endif
 
+struct dev_ifalias {
+	struct rcu_head rcuhead;
+	char ifalias[];
+};
+
 /*
  * This structure defines the management hooks for network devices.
  * The following hooks can be defined; unless noted otherwise, they are
@@ -1632,7 +1637,7 @@ enum netdev_priv_flags {
 struct net_device {
 	char			name[IFNAMSIZ];
 	struct hlist_node	name_hlist;
-	char 			*ifalias;
+	struct dev_ifalias	__rcu *ifalias;
 	/*
 	 *	I/O specific fields
 	 *	FIXME: Merge these and struct ifmap into one
@@ -3275,6 +3280,7 @@ void __dev_notify_flags(struct net_device *, unsigned int old_flags,
 			unsigned int gchanges);
 int dev_change_name(struct net_device *, const char *);
 int dev_set_alias(struct net_device *, const char *, size_t);
+int dev_get_alias(const struct net_device *, char *, size_t);
 int dev_change_net_namespace(struct net_device *, struct net *, const char *);
 int __dev_set_mtu(struct net_device *, int);
 int dev_set_mtu(struct net_device *, int);
diff --git a/net/core/dev.c b/net/core/dev.c
index e350c768d4b5..1770097cfd86 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -188,6 +188,8 @@ static struct napi_struct *napi_by_id(unsigned int napi_id);
 DEFINE_RWLOCK(dev_base_lock);
 EXPORT_SYMBOL(dev_base_lock);
 
+static DEFINE_MUTEX(ifalias_mutex);
+
 /* protects napi_hash addition/deletion and napi_gen_id */
 static DEFINE_SPINLOCK(napi_hash_lock);
 
@@ -1265,29 +1267,53 @@ int dev_change_name(struct net_device *dev, const char *newname)
  */
 int dev_set_alias(struct net_device *dev, const char *alias, size_t len)
 {
-	char *new_ifalias;
-
-	ASSERT_RTNL();
+	struct dev_ifalias *new_alias = NULL;
 
 	if (len >= IFALIASZ)
 		return -EINVAL;
 
-	if (!len) {
-		kfree(dev->ifalias);
-		dev->ifalias = NULL;
-		return 0;
+	if (len) {
+		new_alias = kmalloc(sizeof(*new_alias) + len + 1, GFP_KERNEL);
+		if (!new_alias)
+			return -ENOMEM;
+
+		memcpy(new_alias->ifalias, alias, len);
+		new_alias->ifalias[len] = 0;
 	}
 
-	new_ifalias = krealloc(dev->ifalias, len + 1, GFP_KERNEL);
-	if (!new_ifalias)
-		return -ENOMEM;
-	dev->ifalias = new_ifalias;
-	memcpy(dev->ifalias, alias, len);
-	dev->ifalias[len] = 0;
+	mutex_lock(&ifalias_mutex);
+	rcu_swap_protected(dev->ifalias, new_alias,
+			   mutex_is_locked(&ifalias_mutex));
+	mutex_unlock(&ifalias_mutex);
+
+	if (new_alias)
+		kfree_rcu(new_alias, rcuhead);
 
 	return len;
 }
 
+/**
+ *	dev_get_alias - get ifalias of a device
+ *	@dev: device
+ *	@alias: buffer to store name of ifalias
+ *	@len: size of buffer
+ *
+ *	get ifalias for a device.  Caller must make sure dev cannot go
+ *	away,  e.g. rcu read lock or own a reference count to device.
+ */
+int dev_get_alias(const struct net_device *dev, char *name, size_t len)
+{
+	const struct dev_ifalias *alias;
+	int ret = 0;
+
+	rcu_read_lock();
+	alias = rcu_dereference(dev->ifalias);
+	if (alias)
+		ret = snprintf(name, len, "%s", alias->ifalias);
+	rcu_read_unlock();
+
+	return ret;
+}
 
 /**
  *	netdev_features_change - device changes features
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 927a6dcbad96..51d5836d8fb9 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -391,10 +391,7 @@ static ssize_t ifalias_store(struct device *dev, struct device_attribute *attr,
 	if (len >  0 && buf[len - 1] == '\n')
 		--count;
 
-	if (!rtnl_trylock())
-		return restart_syscall();
 	ret = dev_set_alias(netdev, buf, count);
-	rtnl_unlock();
 
 	return ret < 0 ? ret : len;
 }
@@ -403,13 +400,12 @@ static ssize_t ifalias_show(struct device *dev,
 			    struct device_attribute *attr, char *buf)
 {
 	const struct net_device *netdev = to_net_dev(dev);
+	char tmp[IFALIASZ];
 	ssize_t ret = 0;
 
-	if (!rtnl_trylock())
-		return restart_syscall();
-	if (netdev->ifalias)
-		ret = sprintf(buf, "%s\n", netdev->ifalias);
-	rtnl_unlock();
+	ret = dev_get_alias(netdev, tmp, sizeof(tmp));
+	if (ret > 0)
+		ret = sprintf(buf, "%s\n", tmp);
 	return ret;
 }
 static DEVICE_ATTR_RW(ifalias);
@@ -1488,7 +1484,10 @@ static void netdev_release(struct device *d)
 
 	BUG_ON(dev->reg_state != NETREG_RELEASED);
 
-	kfree(dev->ifalias);
+	/* no need to wait for rcu grace period:
+	 * device is dead and about to be freed.
+	 */
+	kfree(rcu_access_pointer(dev->ifalias));
 	netdev_freemem(dev);
 }
 
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index e6955da0d58d..3961f87cdc76 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -1366,6 +1366,16 @@ static int nla_put_iflink(struct sk_buff *skb, const struct net_device *dev)
 	return nla_put_u32(skb, IFLA_LINK, ifindex);
 }
 
+static noinline_for_stack int nla_put_ifalias(struct sk_buff *skb,
+					      struct net_device *dev)
+{
+	char buf[IFALIASZ];
+	int ret;
+
+	ret = dev_get_alias(dev, buf, sizeof(buf));
+	return ret > 0 ? nla_put_string(skb, IFLA_IFALIAS, buf) : 0;
+}
+
 static int rtnl_fill_link_netnsid(struct sk_buff *skb,
 				  const struct net_device *dev)
 {
@@ -1425,8 +1435,7 @@ static int rtnl_fill_ifinfo(struct sk_buff *skb, struct net_device *dev,
 	    nla_put_u8(skb, IFLA_CARRIER, netif_carrier_ok(dev)) ||
 	    (dev->qdisc &&
 	     nla_put_string(skb, IFLA_QDISC, dev->qdisc->ops->id)) ||
-	    (dev->ifalias &&
-	     nla_put_string(skb, IFLA_IFALIAS, dev->ifalias)) ||
+	    nla_put_ifalias(skb, dev) ||
 	    nla_put_u32(skb, IFLA_CARRIER_CHANGES,
 			atomic_read(&dev->carrier_changes)) ||
 	    nla_put_u8(skb, IFLA_PROTO_DOWN, dev->proto_down))
-- 
2.13.6

^ permalink raw reply related

* Re: [PATCH net-next] ibmvnic: Improve output for unsupported stats
From: Andrew Lunn @ 2017-10-02 21:57 UTC (permalink / raw)
  To: John Allen; +Cc: netdev, Thomas Falcon
In-Reply-To: <6ce14d1a-11cc-37be-e6ca-b9ec4afb01a8@linux.vnet.ibm.com>

On Mon, Oct 02, 2017 at 03:31:39PM -0500, John Allen wrote:
> The vnic server can report -1 in the event that a given statistic is not
> supported. Currently, the -1 value is implicitly cast to an unsigned
> integer and appears through the ethtool -S output as a very large number.
> This patch improves this behavior by reporting 0 in the event that a
> given statistic is not supported.

Hi John

If it does not exist, why not skip it altogether?
ibmvnic_get_sset_count() could walk the list and count the valid ones.
ibmvnic_get_strings() could only return the name of the valid onces.

      Andrew

^ permalink raw reply

* Re: [net-next 00/15][pull request] 40GbE Intel Wired LAN Driver Updates 2017-10-02
From: David Miller @ 2017-10-02 22:17 UTC (permalink / raw)
  To: jeffrey.t.kirsher; +Cc: netdev, nhorman, sassmann, jogreene
In-Reply-To: <20171002194852.71970-1-jeffrey.t.kirsher@intel.com>

From: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Date: Mon,  2 Oct 2017 12:48:37 -0700

> This series contains updates to i40e and i40evf.

Pulled, thanks Jeff.

I saw the feedback about "odd parenthesis" in patch #11 from Yuval
Mintz, but I simply don't see it at all.

^ permalink raw reply

* [PATCH net-next v6 3/4] bpf: add helper bpf_perf_prog_read_value
From: Yonghong Song @ 2017-10-02 22:42 UTC (permalink / raw)
  To: peterz, rostedt, ast, daniel, netdev; +Cc: kernel-team
In-Reply-To: <20171002224218.3181418-1-yhs@fb.com>

This patch adds helper bpf_perf_prog_read_cvalue for perf event based bpf
programs, to read event counter and enabled/running time.
The enabled/running time is accumulated since the perf event open.

The typical use case for perf event based bpf program is to attach itself
to a single event. In such cases, if it is desirable to get scaling factor
between two bpf invocations, users can can save the time values in a map,
and use the value from the map and the current value to calculate
the scaling factor.

Signed-off-by: Yonghong Song <yhs@fb.com>
---
 include/linux/perf_event.h |  1 +
 include/uapi/linux/bpf.h   | 10 +++++++++-
 kernel/events/core.c       |  1 +
 kernel/trace/bpf_trace.c   | 28 ++++++++++++++++++++++++++++
 4 files changed, 39 insertions(+), 1 deletion(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 21d8c12..79b18a2 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -806,6 +806,7 @@ struct perf_output_handle {
 struct bpf_perf_event_data_kern {
 	struct pt_regs *regs;
 	struct perf_sample_data *data;
+	struct perf_event *event;
 };
 
 #ifdef CONFIG_CGROUP_PERF
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 4bdd4b2..6f6a236 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -599,6 +599,13 @@ union bpf_attr {
  *     @buf: buf to fill
  *     @buf_size: size of the buf
  *     Return: 0 on success or negative error code
+ *
+ * int bpf_perf_prog_read_value(ctx, buf, buf_size)
+ *     read perf prog attached perf event counter and enabled/running time
+ *     @ctx: pointer to ctx
+ *     @buf: buf to fill
+ *     @buf_size: size of the buf
+ *     Return : 0 on success or negative error code
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -656,7 +663,8 @@ union bpf_attr {
 	FN(sk_redirect_map),		\
 	FN(sock_map_update),		\
 	FN(xdp_adjust_meta),		\
-	FN(perf_event_read_value),
+	FN(perf_event_read_value),	\
+	FN(perf_prog_read_value),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
diff --git a/kernel/events/core.c b/kernel/events/core.c
index c761ef4..d1efac2 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -8081,6 +8081,7 @@ static void bpf_overflow_handler(struct perf_event *event,
 	struct bpf_perf_event_data_kern ctx = {
 		.data = data,
 		.regs = regs,
+		.event = event,
 	};
 	int ret = 0;
 
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 686dfa1..c4d617a 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -612,6 +612,32 @@ static const struct bpf_func_proto bpf_get_stackid_proto_tp = {
 	.arg3_type	= ARG_ANYTHING,
 };
 
+BPF_CALL_3(bpf_perf_prog_read_value_tp, struct bpf_perf_event_data_kern *, ctx,
+	   struct bpf_perf_event_value *, buf, u32, size)
+{
+	int err;
+
+	if (unlikely(size != sizeof(struct bpf_perf_event_value)))
+		return -EINVAL;
+
+	err = perf_event_read_local(ctx->event, &buf->counter, &buf->enabled,
+				    &buf->running);
+	if (unlikely(err)) {
+		memset(buf, 0, size);
+		return err;
+	}
+	return 0;
+}
+
+static const struct bpf_func_proto bpf_perf_prog_read_value_proto_tp = {
+         .func           = bpf_perf_prog_read_value_tp,
+         .gpl_only       = true,
+         .ret_type       = RET_INTEGER,
+         .arg1_type      = ARG_PTR_TO_CTX,
+         .arg2_type      = ARG_PTR_TO_UNINIT_MEM,
+         .arg3_type      = ARG_CONST_SIZE,
+};
+
 static const struct bpf_func_proto *tp_prog_func_proto(enum bpf_func_id func_id)
 {
 	switch (func_id) {
@@ -619,6 +645,8 @@ static const struct bpf_func_proto *tp_prog_func_proto(enum bpf_func_id func_id)
 		return &bpf_perf_event_output_proto_tp;
 	case BPF_FUNC_get_stackid:
 		return &bpf_get_stackid_proto_tp;
+	case BPF_FUNC_perf_prog_read_value:
+		return &bpf_perf_prog_read_value_proto_tp;
 	default:
 		return tracing_func_proto(func_id);
 	}
-- 
2.9.5

^ permalink raw reply related

* [PATCH net-next v6 4/4] bpf: add a test case for helper bpf_perf_prog_read_value
From: Yonghong Song @ 2017-10-02 22:42 UTC (permalink / raw)
  To: peterz, rostedt, ast, daniel, netdev; +Cc: kernel-team
In-Reply-To: <20171002224218.3181418-1-yhs@fb.com>

The bpf sample program trace_event is enhanced to use the new
helper to print out enabled/running time.

Signed-off-by: Yonghong Song <yhs@fb.com>
---
 samples/bpf/trace_event_kern.c            | 10 ++++++++++
 samples/bpf/trace_event_user.c            | 13 ++++++++-----
 tools/include/uapi/linux/bpf.h            |  3 ++-
 tools/testing/selftests/bpf/bpf_helpers.h |  3 +++
 4 files changed, 23 insertions(+), 6 deletions(-)

diff --git a/samples/bpf/trace_event_kern.c b/samples/bpf/trace_event_kern.c
index 41b6115..a77a583d 100644
--- a/samples/bpf/trace_event_kern.c
+++ b/samples/bpf/trace_event_kern.c
@@ -37,10 +37,14 @@ struct bpf_map_def SEC("maps") stackmap = {
 SEC("perf_event")
 int bpf_prog1(struct bpf_perf_event_data *ctx)
 {
+	char time_fmt1[] = "Time Enabled: %llu, Time Running: %llu";
+	char time_fmt2[] = "Get Time Failed, ErrCode: %d";
 	char fmt[] = "CPU-%d period %lld ip %llx";
 	u32 cpu = bpf_get_smp_processor_id();
+	struct bpf_perf_event_value value_buf;
 	struct key_t key;
 	u64 *val, one = 1;
+	int ret;
 
 	if (ctx->sample_period < 10000)
 		/* ignore warmup */
@@ -54,6 +58,12 @@ int bpf_prog1(struct bpf_perf_event_data *ctx)
 		return 0;
 	}
 
+	ret = bpf_perf_prog_read_value(ctx, (void *)&value_buf, sizeof(struct bpf_perf_event_value));
+	if (!ret)
+	  bpf_trace_printk(time_fmt1, sizeof(time_fmt1), value_buf.enabled, value_buf.running);
+	else
+	  bpf_trace_printk(time_fmt2, sizeof(time_fmt2), ret);
+
 	val = bpf_map_lookup_elem(&counts, &key);
 	if (val)
 		(*val)++;
diff --git a/samples/bpf/trace_event_user.c b/samples/bpf/trace_event_user.c
index 7bd827b..bf4f1b6 100644
--- a/samples/bpf/trace_event_user.c
+++ b/samples/bpf/trace_event_user.c
@@ -127,6 +127,9 @@ static void test_perf_event_all_cpu(struct perf_event_attr *attr)
 	int *pmu_fd = malloc(nr_cpus * sizeof(int));
 	int i, error = 0;
 
+	/* system wide perf event, no need to inherit */
+	attr->inherit = 0;
+
 	/* open perf_event on all cpus */
 	for (i = 0; i < nr_cpus; i++) {
 		pmu_fd[i] = sys_perf_event_open(attr, -1, i, -1, 0);
@@ -154,6 +157,11 @@ static void test_perf_event_task(struct perf_event_attr *attr)
 {
 	int pmu_fd;
 
+	/* per task perf event, enable inherit so the "dd ..." command can be traced properly.
+	 * Enabling inherit will cause bpf_perf_prog_read_time helper failure.
+	 */
+	attr->inherit = 1;
+
 	/* open task bound event */
 	pmu_fd = sys_perf_event_open(attr, 0, -1, -1, 0);
 	if (pmu_fd < 0) {
@@ -175,14 +183,12 @@ static void test_bpf_perf_event(void)
 		.freq = 1,
 		.type = PERF_TYPE_HARDWARE,
 		.config = PERF_COUNT_HW_CPU_CYCLES,
-		.inherit = 1,
 	};
 	struct perf_event_attr attr_type_sw = {
 		.sample_freq = SAMPLE_FREQ,
 		.freq = 1,
 		.type = PERF_TYPE_SOFTWARE,
 		.config = PERF_COUNT_SW_CPU_CLOCK,
-		.inherit = 1,
 	};
 	struct perf_event_attr attr_hw_cache_l1d = {
 		.sample_freq = SAMPLE_FREQ,
@@ -192,7 +198,6 @@ static void test_bpf_perf_event(void)
 			PERF_COUNT_HW_CACHE_L1D |
 			(PERF_COUNT_HW_CACHE_OP_READ << 8) |
 			(PERF_COUNT_HW_CACHE_RESULT_ACCESS << 16),
-		.inherit = 1,
 	};
 	struct perf_event_attr attr_hw_cache_branch_miss = {
 		.sample_freq = SAMPLE_FREQ,
@@ -202,7 +207,6 @@ static void test_bpf_perf_event(void)
 			PERF_COUNT_HW_CACHE_BPU |
 			(PERF_COUNT_HW_CACHE_OP_READ << 8) |
 			(PERF_COUNT_HW_CACHE_RESULT_MISS << 16),
-		.inherit = 1,
 	};
 	struct perf_event_attr attr_type_raw = {
 		.sample_freq = SAMPLE_FREQ,
@@ -210,7 +214,6 @@ static void test_bpf_perf_event(void)
 		.type = PERF_TYPE_RAW,
 		/* Intel Instruction Retired */
 		.config = 0xc0,
-		.inherit = 1,
 	};
 
 	printf("Test HW_CPU_CYCLES\n");
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index db51cbc..cb7bec9 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -649,7 +649,8 @@ union bpf_attr {
 	FN(sk_redirect_map),		\
 	FN(sock_map_update),		\
 	FN(xdp_adjust_meta),		\
-	FN(perf_event_read_value),
+	FN(perf_event_read_value),	\
+	FN(perf_prog_read_value),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
diff --git a/tools/testing/selftests/bpf/bpf_helpers.h b/tools/testing/selftests/bpf/bpf_helpers.h
index c15ca83..e25dbf6 100644
--- a/tools/testing/selftests/bpf/bpf_helpers.h
+++ b/tools/testing/selftests/bpf/bpf_helpers.h
@@ -75,6 +75,9 @@ static int (*bpf_sock_map_update)(void *map, void *key, void *value,
 static int (*bpf_perf_event_read_value)(void *map, unsigned long long flags,
 					void *buf, unsigned int buf_size) =
 	(void *) BPF_FUNC_perf_event_read_value;
+static int (*bpf_perf_prog_read_value)(void *ctx, void *buf,
+				       unsigned int buf_size) =
+	(void *) BPF_FUNC_perf_prog_read_value;
 
 
 /* llvm builtin functions that eBPF C program may use to
-- 
2.9.5

^ permalink raw reply related

* [PATCH net-next v6 2/4] bpf: add a test case for helper bpf_perf_event_read_value
From: Yonghong Song @ 2017-10-02 22:42 UTC (permalink / raw)
  To: peterz, rostedt, ast, daniel, netdev; +Cc: kernel-team
In-Reply-To: <20171002224218.3181418-1-yhs@fb.com>

The bpf sample program tracex6 is enhanced to use the new
helper to read enabled/running time as well.

Signed-off-by: Yonghong Song <yhs@fb.com>
---
 samples/bpf/tracex6_kern.c                | 26 ++++++++++++++++++++++++++
 samples/bpf/tracex6_user.c                | 13 ++++++++++++-
 tools/include/uapi/linux/bpf.h            |  3 ++-
 tools/testing/selftests/bpf/bpf_helpers.h |  3 +++
 4 files changed, 43 insertions(+), 2 deletions(-)

diff --git a/samples/bpf/tracex6_kern.c b/samples/bpf/tracex6_kern.c
index e7d1803..46c557a 100644
--- a/samples/bpf/tracex6_kern.c
+++ b/samples/bpf/tracex6_kern.c
@@ -15,6 +15,12 @@ struct bpf_map_def SEC("maps") values = {
 	.value_size = sizeof(u64),
 	.max_entries = 64,
 };
+struct bpf_map_def SEC("maps") values2 = {
+	.type = BPF_MAP_TYPE_HASH,
+	.key_size = sizeof(int),
+	.value_size = sizeof(struct bpf_perf_event_value),
+	.max_entries = 64,
+};
 
 SEC("kprobe/htab_map_get_next_key")
 int bpf_prog1(struct pt_regs *ctx)
@@ -37,5 +43,25 @@ int bpf_prog1(struct pt_regs *ctx)
 	return 0;
 }
 
+SEC("kprobe/htab_map_lookup_elem")
+int bpf_prog2(struct pt_regs *ctx)
+{
+	u32 key = bpf_get_smp_processor_id();
+	struct bpf_perf_event_value *val, buf;
+	int error;
+
+	error = bpf_perf_event_read_value(&counters, key, &buf, sizeof(buf));
+	if (error)
+		return 0;
+
+	val = bpf_map_lookup_elem(&values2, &key);
+	if (val)
+		*val = buf;
+	else
+		bpf_map_update_elem(&values2, &key, &buf, BPF_NOEXIST);
+
+	return 0;
+}
+
 char _license[] SEC("license") = "GPL";
 u32 _version SEC("version") = LINUX_VERSION_CODE;
diff --git a/samples/bpf/tracex6_user.c b/samples/bpf/tracex6_user.c
index a05a99a..3341a96 100644
--- a/samples/bpf/tracex6_user.c
+++ b/samples/bpf/tracex6_user.c
@@ -22,6 +22,7 @@
 
 static void check_on_cpu(int cpu, struct perf_event_attr *attr)
 {
+	struct bpf_perf_event_value value2;
 	int pmu_fd, error = 0;
 	cpu_set_t set;
 	__u64 value;
@@ -46,8 +47,18 @@ static void check_on_cpu(int cpu, struct perf_event_attr *attr)
 		fprintf(stderr, "Value missing for CPU %d\n", cpu);
 		error = 1;
 		goto on_exit;
+	} else {
+		fprintf(stderr, "CPU %d: %llu\n", cpu, value);
+	}
+	/* The above bpf_map_lookup_elem should trigger the second kprobe */
+	if (bpf_map_lookup_elem(map_fd[2], &cpu, &value2)) {
+		fprintf(stderr, "Value2 missing for CPU %d\n", cpu);
+		error = 1;
+		goto on_exit;
+	} else {
+		fprintf(stderr, "CPU %d: counter: %llu, enabled: %llu, running: %llu\n", cpu,
+			value2.counter, value2.enabled, value2.running);
 	}
-	fprintf(stderr, "CPU %d: %llu\n", cpu, value);
 
 on_exit:
 	assert(bpf_map_delete_elem(map_fd[0], &cpu) == 0 || error);
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 6d2137b..db51cbc 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -648,7 +648,8 @@ union bpf_attr {
 	FN(redirect_map),		\
 	FN(sk_redirect_map),		\
 	FN(sock_map_update),		\
-	FN(xdp_adjust_meta),
+	FN(xdp_adjust_meta),		\
+	FN(perf_event_read_value),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
diff --git a/tools/testing/selftests/bpf/bpf_helpers.h b/tools/testing/selftests/bpf/bpf_helpers.h
index a56053d..c15ca83 100644
--- a/tools/testing/selftests/bpf/bpf_helpers.h
+++ b/tools/testing/selftests/bpf/bpf_helpers.h
@@ -72,6 +72,9 @@ static int (*bpf_sk_redirect_map)(void *map, int key, int flags) =
 static int (*bpf_sock_map_update)(void *map, void *key, void *value,
 				  unsigned long long flags) =
 	(void *) BPF_FUNC_sock_map_update;
+static int (*bpf_perf_event_read_value)(void *map, unsigned long long flags,
+					void *buf, unsigned int buf_size) =
+	(void *) BPF_FUNC_perf_event_read_value;
 
 
 /* llvm builtin functions that eBPF C program may use to
-- 
2.9.5

^ permalink raw reply related

* [PATCH net-next v6 1/4] bpf: add helper bpf_perf_event_read_value for perf event array map
From: Yonghong Song @ 2017-10-02 22:42 UTC (permalink / raw)
  To: peterz, rostedt, ast, daniel, netdev; +Cc: kernel-team
In-Reply-To: <20171002224218.3181418-1-yhs@fb.com>

Hardware pmu counters are limited resources. When there are more
pmu based perf events opened than available counters, kernel will
multiplex these events so each event gets certain percentage
(but not 100%) of the pmu time. In case that multiplexing happens,
the number of samples or counter value will not reflect the
case compared to no multiplexing. This makes comparison between
different runs difficult.

Typically, the number of samples or counter value should be
normalized before comparing to other experiments. The typical
normalization is done like:
  normalized_num_samples = num_samples * time_enabled / time_running
  normalized_counter_value = counter_value * time_enabled / time_running
where time_enabled is the time enabled for event and time_running is
the time running for event since last normalization.

This patch adds helper bpf_perf_event_read_value for kprobed based perf
event array map, to read perf counter and enabled/running time.
The enabled/running time is accumulated since the perf event open.
To achieve scaling factor between two bpf invocations, users
can can use cpu_id as the key (which is typical for perf array usage model)
to remember the previous value and do the calculation inside the
bpf program.

Signed-off-by: Yonghong Song <yhs@fb.com>
---
 include/linux/perf_event.h |  6 ++++--
 include/uapi/linux/bpf.h   | 20 ++++++++++++++++++--
 kernel/bpf/arraymap.c      |  2 +-
 kernel/bpf/verifier.c      |  4 +++-
 kernel/events/core.c       | 15 ++++++++++++---
 kernel/trace/bpf_trace.c   | 46 +++++++++++++++++++++++++++++++++++++++++-----
 6 files changed, 79 insertions(+), 14 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 8e22f24..21d8c12 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -884,7 +884,8 @@ perf_event_create_kernel_counter(struct perf_event_attr *attr,
 				void *context);
 extern void perf_pmu_migrate_context(struct pmu *pmu,
 				int src_cpu, int dst_cpu);
-int perf_event_read_local(struct perf_event *event, u64 *value);
+int perf_event_read_local(struct perf_event *event, u64 *value,
+			  u64 *enabled, u64 *running);
 extern u64 perf_event_read_value(struct perf_event *event,
 				 u64 *enabled, u64 *running);
 
@@ -1286,7 +1287,8 @@ static inline const struct perf_event_attr *perf_event_attrs(struct perf_event *
 {
 	return ERR_PTR(-EINVAL);
 }
-static inline int perf_event_read_local(struct perf_event *event, u64 *value)
+static inline int perf_event_read_local(struct perf_event *event, u64 *value,
+					u64 *enabled, u64 *running)
 {
 	return -EINVAL;
 }
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 6d2137b..4bdd4b2 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -592,6 +592,13 @@ union bpf_attr {
  *     @xdp_md: pointer to xdp_md
  *     @delta: An positive/negative integer to be added to xdp_md.data_meta
  *     Return: 0 on success or negative on error
+ * int bpf_perf_event_read_value(map, flags, buf, buf_size)
+ *     read perf event counter value and perf event enabled/running time
+ *     @map: pointer to perf_event_array map
+ *     @flags: index of event in the map or bitmask flags
+ *     @buf: buf to fill
+ *     @buf_size: size of the buf
+ *     Return: 0 on success or negative error code
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -648,7 +655,8 @@ union bpf_attr {
 	FN(redirect_map),		\
 	FN(sk_redirect_map),		\
 	FN(sock_map_update),		\
-	FN(xdp_adjust_meta),
+	FN(xdp_adjust_meta),		\
+	FN(perf_event_read_value),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
@@ -692,7 +700,9 @@ enum bpf_func_id {
 #define BPF_F_ZERO_CSUM_TX		(1ULL << 1)
 #define BPF_F_DONT_FRAGMENT		(1ULL << 2)
 
-/* BPF_FUNC_perf_event_output and BPF_FUNC_perf_event_read flags. */
+/* BPF_FUNC_perf_event_output, BPF_FUNC_perf_event_read and
+ * BPF_FUNC_perf_event_read_value flags.
+ */
 #define BPF_F_INDEX_MASK		0xffffffffULL
 #define BPF_F_CURRENT_CPU		BPF_F_INDEX_MASK
 /* BPF_FUNC_perf_event_output for sk_buff input context. */
@@ -885,4 +895,10 @@ enum {
 #define TCP_BPF_IW		1001	/* Set TCP initial congestion window */
 #define TCP_BPF_SNDCWND_CLAMP	1002	/* Set sndcwnd_clamp */
 
+struct bpf_perf_event_value {
+	__u64 counter;
+	__u64 enabled;
+	__u64 running;
+};
+
 #endif /* _UAPI__LINUX_BPF_H__ */
diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index 98c0f00..68d8666 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -492,7 +492,7 @@ static void *perf_event_fd_array_get_ptr(struct bpf_map *map,
 
 	ee = ERR_PTR(-EOPNOTSUPP);
 	event = perf_file->private_data;
-	if (perf_event_read_local(event, &value) == -EOPNOTSUPP)
+	if (perf_event_read_local(event, &value, NULL, NULL) == -EOPNOTSUPP)
 		goto err_out;
 
 	ee = bpf_event_entry_gen(perf_file, map_file);
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 4cf9b72..b0cbc2d 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -1552,7 +1552,8 @@ static int check_map_func_compatibility(struct bpf_map *map, int func_id)
 		break;
 	case BPF_MAP_TYPE_PERF_EVENT_ARRAY:
 		if (func_id != BPF_FUNC_perf_event_read &&
-		    func_id != BPF_FUNC_perf_event_output)
+		    func_id != BPF_FUNC_perf_event_output &&
+		    func_id != BPF_FUNC_perf_event_read_value)
 			goto error;
 		break;
 	case BPF_MAP_TYPE_STACK_TRACE:
@@ -1595,6 +1596,7 @@ static int check_map_func_compatibility(struct bpf_map *map, int func_id)
 		break;
 	case BPF_FUNC_perf_event_read:
 	case BPF_FUNC_perf_event_output:
+	case BPF_FUNC_perf_event_read_value:
 		if (map->map_type != BPF_MAP_TYPE_PERF_EVENT_ARRAY)
 			goto error;
 		break;
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 6bc21e2..c761ef4 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3684,10 +3684,12 @@ static inline u64 perf_event_count(struct perf_event *event)
  *     will not be local and we cannot read them atomically
  *   - must not have a pmu::count method
  */
-int perf_event_read_local(struct perf_event *event, u64 *value)
+int perf_event_read_local(struct perf_event *event, u64 *value,
+			  u64 *enabled, u64 *running)
 {
 	unsigned long flags;
 	int ret = 0;
+	u64 now;
 
 	/*
 	 * Disabling interrupts avoids all counter scheduling (context
@@ -3718,14 +3720,21 @@ int perf_event_read_local(struct perf_event *event, u64 *value)
 		goto out;
 	}
 
+	now = event->shadow_ctx_time + perf_clock();
+	if (enabled)
+		*enabled = now - event->tstamp_enabled;
 	/*
 	 * If the event is currently on this CPU, its either a per-task event,
 	 * or local to this CPU. Furthermore it means its ACTIVE (otherwise
 	 * oncpu == -1).
 	 */
-	if (event->oncpu == smp_processor_id())
+	if (event->oncpu == smp_processor_id()) {
 		event->pmu->read(event);
-
+		if (running)
+			*running = now - event->tstamp_running;
+	} else if (running) {
+		*running = event->total_time_running;
+	}
 	*value = local64_read(&event->count);
 out:
 	local_irq_restore(flags);
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index dc498b6..686dfa1 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -255,14 +255,13 @@ const struct bpf_func_proto *bpf_get_trace_printk_proto(void)
 	return &bpf_trace_printk_proto;
 }
 
-BPF_CALL_2(bpf_perf_event_read, struct bpf_map *, map, u64, flags)
-{
+static __always_inline int
+get_map_perf_counter(struct bpf_map *map, u64 flags,
+		     u64 *value, u64 *enabled, u64 *running) {
 	struct bpf_array *array = container_of(map, struct bpf_array, map);
 	unsigned int cpu = smp_processor_id();
 	u64 index = flags & BPF_F_INDEX_MASK;
 	struct bpf_event_entry *ee;
-	u64 value = 0;
-	int err;
 
 	if (unlikely(flags & ~(BPF_F_INDEX_MASK)))
 		return -EINVAL;
@@ -275,7 +274,15 @@ BPF_CALL_2(bpf_perf_event_read, struct bpf_map *, map, u64, flags)
 	if (!ee)
 		return -ENOENT;
 
-	err = perf_event_read_local(ee->event, &value);
+	return perf_event_read_local(ee->event, value, enabled, running);
+}
+
+BPF_CALL_2(bpf_perf_event_read, struct bpf_map *, map, u64, flags)
+{
+	u64 value = 0;
+	int err;
+
+	err = get_map_perf_counter(map, flags, &value, NULL, NULL);
 	/*
 	 * this api is ugly since we miss [-22..-2] range of valid
 	 * counter values, but that's uapi
@@ -293,6 +300,33 @@ static const struct bpf_func_proto bpf_perf_event_read_proto = {
 	.arg2_type	= ARG_ANYTHING,
 };
 
+BPF_CALL_4(bpf_perf_event_read_value, struct bpf_map *, map, u64, flags,
+	   struct bpf_perf_event_value *, buf, u32, size)
+{
+	int err;
+
+	if (unlikely(size != sizeof(struct bpf_perf_event_value)))
+		return -EINVAL;
+
+	err = get_map_perf_counter(map, flags, &buf->counter, &buf->enabled,
+				   &buf->running);
+	if (unlikely(err)) {
+		memset(buf, 0, size);
+		return err;
+	}
+	return 0;
+}
+
+static const struct bpf_func_proto bpf_perf_event_read_value_proto = {
+	.func		= bpf_perf_event_read_value,
+	.gpl_only	= true,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_CONST_MAP_PTR,
+	.arg2_type	= ARG_ANYTHING,
+	.arg3_type	= ARG_PTR_TO_UNINIT_MEM,
+	.arg4_type	= ARG_CONST_SIZE,
+};
+
 static DEFINE_PER_CPU(struct perf_sample_data, bpf_sd);
 
 static __always_inline u64
@@ -499,6 +533,8 @@ static const struct bpf_func_proto *kprobe_prog_func_proto(enum bpf_func_id func
 		return &bpf_perf_event_output_proto;
 	case BPF_FUNC_get_stackid:
 		return &bpf_get_stackid_proto;
+	case BPF_FUNC_perf_event_read_value:
+		return &bpf_perf_event_read_value_proto;
 	default:
 		return tracing_func_proto(func_id);
 	}
-- 
2.9.5

^ permalink raw reply related

* [PATCH net-next v6 0/4] bpf: add two helpers to read perf event enabled/running time
From: Yonghong Song @ 2017-10-02 22:42 UTC (permalink / raw)
  To: peterz, rostedt, ast, daniel, netdev; +Cc: kernel-team

Hardware pmu counters are limited resources. When there are more
pmu based perf events opened than available counters, kernel will
multiplex these events so each event gets certain percentage
(but not 100%) of the pmu time. In case that multiplexing happens,
the number of samples or counter value will not reflect the
case compared to no multiplexing. This makes comparison between
different runs difficult.

Typically, the number of samples or counter value should be
normalized before comparing to other experiments. The typical
normalization is done like:
  normalized_num_samples = num_samples * time_enabled / time_running
  normalized_counter_value = counter_value * time_enabled / time_running
where time_enabled is the time enabled for event and time_running is
the time running for event since last normalization.

This patch set implements two helper functions.
The helper bpf_perf_event_read_value reads counter/time_enabled/time_running
for perf event array map. The helper bpf_perf_prog_read_value read
counter/time_enabled/time_running for bpf prog with type BPF_PROG_TYPE_PERF_EVENT.

[Dave, Peter,

 Previous communcation shows that this patch may potentially have
 merge conflict with upcoming tip changes in the next merge window.

 Could you advise how this patch should proceed?

 Thanks!
]

Changelogs:
v5->v6:
  . rebase on top of net-next
v4->v5:
  . fix some coding style issues
  . memset the input buffer in case of error for ARG_PTR_TO_UNINIT_MEM
    type of argument.
v3->v4:
  . fix a build failure
v2->v3:
  . counters should be read in order to read enabled/running time. This is to
    prevent that counters and enabled/running time may be read separately.
v1->v2:
  . reading enabled/running time should be together with reading counters
    which contains the logic to ensure the result is valid.

Yonghong Song (4):
  bpf: add helper bpf_perf_event_read_value for perf event array map
  bpf: add a test case for helper bpf_perf_event_read_value
  bpf: add helper bpf_perf_prog_read_value
  bpf: add a test case for helper bpf_perf_prog_read_value

 include/linux/perf_event.h                |  7 ++-
 include/uapi/linux/bpf.h                  | 28 +++++++++++-
 kernel/bpf/arraymap.c                     |  2 +-
 kernel/bpf/verifier.c                     |  4 +-
 kernel/events/core.c                      | 16 +++++--
 kernel/trace/bpf_trace.c                  | 74 ++++++++++++++++++++++++++++---
 samples/bpf/trace_event_kern.c            | 10 +++++
 samples/bpf/trace_event_user.c            | 13 +++---
 samples/bpf/tracex6_kern.c                | 26 +++++++++++
 samples/bpf/tracex6_user.c                | 13 +++++-
 tools/include/uapi/linux/bpf.h            |  4 +-
 tools/testing/selftests/bpf/bpf_helpers.h |  6 +++
 12 files changed, 182 insertions(+), 21 deletions(-)

-- 
2.9.5

^ permalink raw reply

* RE: [RFC net-next 0/5] TSN: Add qdisc-based config interfaces for traffic shapers
From: Rodney Cummings @ 2017-10-02 22:52 UTC (permalink / raw)
  To: Levi Pearson
  Cc: Linux Kernel Network Developers, Vinicius Costa Gomes,
	Henrik Austad, richardcochran@gmail.com,
	jesus.sanchez-palencia@intel.com, andre.guedes@intel.com
In-Reply-To: <CAEYbN3TWwPjoHW-wQ9DC=ubXELoZ4jbMUTX_+D3Rn_h=6LHvDA@mail.gmail.com>

Thanks Levi,

Great discussion. It sounds like we're aligned.

Qbv is essential in switches. My concern was the end-station side only.

You're absolutely right that i210 LaunchTime doesn't do everything that is needed 
for scheduling. It still requires quite a bit of frame-juggling to ensure that
transmit descriptors are ordered by time, and that the control system's cyclic
traffic is repeated as expected. Like you, I'm not sure where that code is best
located in the long run.

For a hardware design that connects an FPGA to one of the internal ports of a
Qbv-capable switch, the preferable scheduling design can be done there.
The FPGA IP essentially implements a cyclic schedule of transmit descriptors,
each with its own buffer that can be written atomically by application code.
For a control system (i.e. industrial field-level or automotive control),
each buffer (stream) can hold a single frame. If you take a look at the
"Network Interface" specs for FlexRay in automotive, they went so far as to
mandate that design for a NIC. The IP for that sort of thing
doesn't take alot of gates or memory, so I can see why FlexRay went that way.

That sort of design wasn't destined to happen as part of Qbv, because the switch
vendors are worried about things like 64-port switches. Also, if we treat Qbv
as applying to switches only, FlexRay-style IP isn't needed. For the switch,
we only need it to "get out of the way", and Qbv is fine for that.

I may be naive and idealistic, but in the long run, I'm still hopeful that an
Ethernet NIC vendor will implement Flexray-style scheduling for end-station. 
We need a NIC trend-setter to do it as a differentiator, and the rest will 
probably follow along. That completely eliminates code in Linux for the 
frame-juggling. We'll need to provide a way to open a socket for a specific 
stream, but that doesn't seem too challenging.

Rodney




> -----Original Message-----
> From: Levi Pearson [mailto:levipearson@gmail.com]
> Sent: Monday, October 2, 2017 4:49 PM
> To: Rodney Cummings <rodney.cummings@ni.com>
> Cc: Linux Kernel Network Developers <netdev@vger.kernel.org>; Vinicius
> Costa Gomes <vinicius.gomes@intel.com>; Henrik Austad <henrik@austad.us>;
> richardcochran@gmail.com; jesus.sanchez-palencia@intel.com;
> andre.guedes@intel.com
> Subject: Re: [RFC net-next 0/5] TSN: Add qdisc-based config interfaces for
> traffic shapers
> 
> Hi Rodney,
> 
> On Mon, Oct 2, 2017 at 1:40 PM, Rodney Cummings <rodney.cummings@ni.com>
> wrote:
> 
> > It's a shame that someone built such hardware. Speaking as a
> manufacturer
> > of daisy-chainable products for industrial/automotive applications, I
> > wouldn't use that hardware in my products. The whole point of scheduling
> > is to obtain close-to-optimal network determinism. If the hardware
> > doesn't schedule properly, the product is probably better off using CBS.
> 
> I should note that the hardware I'm using was not designed to be a TSN
> endpoint. It just happens that the hardware, designed for other
> reasons, happens to have hardware support for Qbv and thus makes it an
> interesting and inexpensive platform on which to do Qbv experiments.
> Nevertheless, I believe the right combination of driver support for
> the hardware shaping and appropriate software in the network stack
> would allow it to schedule things fairly precisely in the
> application-centric manner you've described. I will ultimately defer
> to your judgement on that, though, and spend some time with the Qcc
> draft to see where I may be wrong.
> 
> > It would be great if all of the user-level application code was
> scheduled
> > as tightly as the network, but the reality is that we aren't there yet.
> > Also, even if the end-station has a separately timed RT Linux thread for
> > each stream, the order of those threads can vary. Therefore, when the
> > end-station (SoC) has more than one talker, the frames for each talker
> > can be written into the Qbv queue at any time, in any order.
> > There is no scheduling of frames (i.e. streams)... only the queue.
> 
> Frames are presented to *the kernel* in any order. The qdisc
> infrastructure provides mechanisms by which other scheduling policy
> can be added between presentation to the kernel and presentation to
> hardware queues.
> 
> Even with i210-style LaunchTime functionality, this must also be
> provided. The hardware cannot re-order time-scheduled frames, and
> reviewers of the SO_TXTIME proposal rejected the idea of having the
> driver try to re-order scheduled frames already submitted to the
> hardware-visible descriptor chains, and rightly so I think.
> 
> So far, I have thought of 3 places this ordering constraint might be
> provided:
> 
> 1. Built into each driver with offload support for SO_TXTIME
> 2. A qdisc module that will re-order scheduled skbuffs within a certain
> window
> 3. Left to userspace to coordinate
> 
> I think 1 is particularly bad due to duplication of code. 3 is the
> easy default from the point of view of the kernel, but not attractive
> from the point of view of precise scheduling and low latency to
> egress. Unless someone has a better idea, I think 2 would be the most
> appropriate choice. I can think of a number of ways one might be
> designed, but I have not thought them through fully yet.
> 
> > From the perspective of the CNC, that means that a Qbv-only end-station
> is
> > sloppy. In contrast, an end-station using per-stream scheduling
> > (e.g. i210 with LaunchTime) is precise, because each frame has a known
> time.
> > It's true that P802.1Qcc supports both, so the CNC might support both.
> > Nevertheless, the CNC is likely to compute a network-wide schedule
> > that optimizes for the per-stream end-stations, and locates the windows
> > for the Qbv-only end-stations in a slower interval.
> 
> Because out-of-order submission of frames to LaunchTime hardware would
> result in the frame that is enqueued later but scheduled earlier only
> transmitting once the later-scheduled frame completes, enqueuing
> frames immediately with LaunchTime information results in even *worse*
> sloppiness if two applications don't always submit their frames in the
> correct order to the kernel during a cycle. If your queue is simply
> gated at the time of the first scheduled launch, the frame that should
> have been first will only be late by the length of the queued-first
> frame. If they both have precise launch times, the frame that should
> have been first will have to wait until the normal launch time of the
> later-scheduled frame plus the time it takes for it to egress!
> 
> In either case, re-ordering is required if it is not controlled for in
> userspace. In other words, I am not disagreeing with you about the
> insufficiency of a Qbv shaper for an endpoint! I am just pointing out
> that neither hardware LaunchTime support nor hardware Qbv window
> support are sufficient by themselves. Nor, for that case, is the CBS
> offload support sufficient to meet multi-stream requirements from SRP.
> All of these provide a low-layer mechanism by which appropriate
> scheduling of a multi-stream or multi-application endpoint can be
> enhanced, but none of them provide the complete story by themselves.
> 
> > I realize that we are only doing CBS for now. I guess my main point is
> that
> > if/when we add scheduling, it cannot be limited to Qbv. Per-stream
> scheduling
> > is essential, because that is the assumption for most CNC designs that
> > I'm aware of.
> 
> I understand and completely agree. I have believed that per-stream
> shaping/scheduling in some form is essential from the beginning.
> Likewise, CBS is not sufficient in itself for multi-stream systems in
> which multiple applications provide streams scheduled for the same
> traffic class. But CBS and SO_TXTIME are good first steps; they
> provides hooks for drivers to finally support the mechanisms provided
> by hardware, and also in the case of SO_TXTIME a hook on which further
> time-sensitive enhancements to the network stack can be hung, such as
> qdiscs that can store and release frames based on scheduled launch
> times. When these are in place, I think stream-based scheduling via
> qdisc would be a reasonable next step; perhaps a common design could
> be found to work both for media streams over CBS and industrial
> streams scheduled by launch time.
> 
> 
> Levi

^ permalink raw reply

* Re: [PATCH net-next] ibmvnic: Improve output for unsupported stats
From: David Miller @ 2017-10-02 23:03 UTC (permalink / raw)
  To: andrew; +Cc: jallen, netdev, tlfalcon
In-Reply-To: <20171002215723.GN17713@lunn.ch>

From: Andrew Lunn <andrew@lunn.ch>
Date: Mon, 2 Oct 2017 23:57:23 +0200

> On Mon, Oct 02, 2017 at 03:31:39PM -0500, John Allen wrote:
>> The vnic server can report -1 in the event that a given statistic is not
>> supported. Currently, the -1 value is implicitly cast to an unsigned
>> integer and appears through the ethtool -S output as a very large number.
>> This patch improves this behavior by reporting 0 in the event that a
>> given statistic is not supported.
> 
> Hi John
> 
> If it does not exist, why not skip it altogether?
> ibmvnic_get_sset_count() could walk the list and count the valid ones.
> ibmvnic_get_strings() could only return the name of the valid onces.

Agreed, skipping would be preferable to printing zero values.

^ permalink raw reply

* Re: [PATCH] fsl/fman: remove of_node
From: David Miller @ 2017-10-02 23:04 UTC (permalink / raw)
  To: madalin.bucur; +Cc: netdev, f.fainelli, linux-kernel
In-Reply-To: <1506940297-20442-1-git-send-email-madalin.bucur@nxp.com>

From: Madalin Bucur <madalin.bucur@nxp.com>
Date: Mon, 2 Oct 2017 13:31:37 +0300

> The FMan MAC driver allocates a platform device for the Ethernet
> driver to probe on. Setting pdev->dev.of_node with the MAC node
> triggers the MAC driver probing of the new platform device. While
> this fails quickly and does not affect the functionality of the
> drivers, it is incorrect and must be removed. This was added to
> address a report that DSA code using of_find_net_device_by_node()
> is unable to use the DPAA interfaces. Error message seen before
> this fix:
> 
> fsl_mac dpaa-ethernet.0: __devm_request_mem_region(mac) failed
> fsl_mac: probe of dpaa-ethernet.0 failed with error -16
> 
> Signed-off-by: Madalin Bucur <madalin.bucur@nxp.com>

Is the DSA issue no longer something we need to be concerned
about?  If not, why?  You have to explain this.

^ permalink raw reply

* Re: [RFC net-next 0/5] TSN: Add qdisc-based config interfaces for traffic shapers
From: Guedes, Andre @ 2017-10-02 23:06 UTC (permalink / raw)
  To: levipearson@gmail.com, rodney.cummings@ni.com
  Cc: Sanchez-Palencia, Jesus, netdev@vger.kernel.org, Gomes, Vinicius,
	Briano, Ivan, richardcochran@gmail.com, henrik@austad.us
In-Reply-To: <CAEYbN3STY+7ZOodQLaP4MfbvwCovWsav52PXBmjHND7QOO=srg@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 10530 bytes --]

Hi all,

On Mon, 2017-10-02 at 12:45 -0600, Levi Pearson wrote:
> Hi Rodney,
> 
> Some archives seem to have threaded it, but I have CC'd the
> participants I saw in the original discussion thread since they may
> not otherwise notice it amongst the normal traffic.
> 
> On Fri, Sep 29, 2017 at 2:44 PM, Rodney Cummings <rodney.cummings@ni.com>
> wrote:

[...]

> > 1. Question: From an 802.1 perspective, is this RFC intended to support
> > end-station (e.g. NIC in host), bridges (i.e. DSA), or both?
> > 
> > This is very important to clarify, because the usage of this interface
> > will be very different for one or the other.
> > 
> > For a bridge, the user code typically represents a remote management
> > protocol (e.g. SNMP, NETCONF, RESTCONF), and this interface is
> > expected to align with the specifications of 802.1Q clause 12,
> > which serves as the information model for management. Historically,
> > a standard kernel interface for management hasn't been viewed as
> > essential, but I suppose it wouldn't hurt.
> 
> I don't think the proposal was meant to cover the case of non-local
> switch hardware, but in addition to dsa and switchdev switch ICs
> managed by embedded Linux-running SoCs, there are SoCs with embedded
> small port count switches or even plain multiple NICs with software
> bridging. Many of these embedded small port count switches have FQTSS
> hardware that could potentially be configured by the proposed cbs
> qdisc. This blurs the line somewhat between what is a "bridge" and
> what is an "end-station" in 802.1Q terminology, but nevertheless these
> devices exist, sometimes acting as an endpoint + a real bridge and
> sometimes as just a system with multiple network interfaces.

During the development of this proposal, we were most focused on end-station
use-cases. We considered some bridge use-cases as well just to verify that the
proposed design wouldn't be an issue if someone else goes for it.

We agree that the line between end-station and bridge can be a bit blurred (in
this case). Even though we designed this interface with end-station use-cases
in mind, if the proposed infrastructure could be used as is in bridge use-
cases, good.

> > For an end station, the user code can be an implementation of SRP
> > (802.1Q clause 35), or it can be an application-specific
> > protocol (e.g. industrial fieldbus) that exchanges data according
> > to P802.1Qcc clause 46. Either way, the top-level user interface
> > is designed for individual streams, not queues and shapers. That
> > implies some translation code between that top-level interface
> > and this sort of kernel interface.

Yes, you're right. Our understanding is that the top-level interfaces should be
implemented at user space as well as any stream management functionality. The
idea here is to keep the kernel-side as simple as possible. The kernel handles
hardware configuration (via Traffic Control interface) while the user space
handles TSN streams i.e. the kernel provides the mechanism and the user space
provides the policy.

> > As a specific end-station example, for CBS, 802.1Q-2014 subclause
> > 34.6.1 requires "per-stream queues" in the Talker end-station.
> > I don't see 34.6.1 represented in the proposed RFC, but that's
> > okay... maybe per-stream queues are implemented in user code.
> > Nevertheless, if that is the assumption, I think we need to
> > clarify, especially in examples.
> 
> You're correct that the FQTSS credit-based shaping algorithm requires
> per-stream shaping by Talker endpoints as well, but this is in
> addition to the per-class shaping provided by most hardware shaping
> implementations that I'm aware of in endpoint network hardware. I
> agree that we need to document the need to provide this, but it can
> definitely be built on top of the current proposal.
> 
> I believe the per-stream shaping could be managed either by a user
> space application that manages all use of a streaming traffic class,
> or through an additional qdisc module that performs per-stream
> management on top of the proposed cbs qdisc, ensuring that the
> frames-per-observation interval aspect of each stream's reservation is
> obeyed. This becomes a fairly simple qdisc to implement on top of a
> per-traffic class shaper, and could even be implemented with the help
> of the timestamp that the SO_TXTIME proposal adds to skbuffs, but I
> think keeping the layers separate provides more flexibility to
> implementations and keeps management of various kinds of hardware
> offload support simpler as well.

Indeed, 'per-stream queue' is not covered in this RFC. For now, we expect it to
be implemented in user code. We believe the proposed CBS qdisc could be
extended to support a full software-based implementation which would be used to
implement 'per-stream queue' support. This functionality should be addressed by
a separated series.

Anyways, we're about to send the v3 patchset implementing this proposal and
we'll make it clear.

> > 2. Suggestion: Do not assume that a time-aware (i.e. scheduled)
> > end-station will always use 802.1Qbv.
> > 
> > For those who are subscribed to the 802.1 mailing list,
> > I'd suggest a read of draft P802.1Qcc/D1.6, subclause U.1
> > of Annex U. Subclause U.1 assumes that bridges in the network use
> > 802.1Qbv, and then it poses the question of what an end-station
> > Talker should do. If the end-station also uses 802.1Qbv,
> > and that end-station transmits multiple streams, 802.1Qbv is
> > a bad implementation. The reason is that the scheduling
> > (i.e. order in time) of each stream cannot be controlled, which
> > in turn means that the CNC (network manager) cannot optimize
> > the 802.1Qbv schedules in bridges. The preferred technique
> > is to use "per-stream scheduling" in each Talker, so that
> > the CNC can create an optimal schedules (i.e. best determinism).
> > 
> > I'm aware of a small number of proprietary CNC implementations for
> > 802.1Qbv in bridges, and they are generally assuming per-stream
> > scheduling in end-stations (Talkers).
> > 
> > The i210 NIC's LaunchTime can be used to implement per-stream
> > scheduling. I haven't looked at SO_TXTIME in detail, but it sounds
> > like per-stream scheduling. If so, then we already have the
> > fundamental building blocks for a complete implementation
> > of a time-aware end-station.
> > 
> > If we answer the preceding question #1 as "end-station only",
> > I would recommend avoiding 802.1Qbv in this interface. There
> > isn't really anything wrong with it per-se, but it would lead
> > developers down the wrong path.
> 
> In some situations, such as device nodes that each incorporate a small
> port count switch for the purpose of daisy-chaining a segment of the
> network, "end stations" must do a limited subset of local bridge
> management as well. I'm not sure how common this is going to be for
> industrial control applications, but I know there are audio and
> automotive applications built this way.
> 
> One particular device I am working with now provides all network
> access through a DSA switch chip with hardware Qbv support in addtion
> to hardware Qav support. The SoC attached to it has no hardware timed
> launch (SO_TXTIME) support. In this case, although the proposed
> interface for Qbv is not *sufficient* to make a working time-aware end
> station, it does provide a usable building block to provide one. As
> with the credit-based shaping system, Talkers must provide an
> additional level of per-stream shaping as well, but this is largely
> (absent the jitter calculations, which are sort of a middle-level
> concern) independent of what sort of hardware offload of the
> scheduling is provided.
> 
> Both Qbv windows and timed launch support do roughly the same thing;
> they *delay* the launch of a hardware-queued frame so it can egress at
> a precisely specified time, and at least with the i210 and Qbv, ensure
> that no other traffic will be in-progress when that time arrives. For
> either to be used effectively, the application still has to prepare
> the frame slightly ahead-of-time and thus must have the same level of
> time-awareness. This is, again, largely independent of what kind of
> hardware offloading support is provided and is also largely
> independent of the network stack itself. Neither queue window
> management nor SO_TXTIME help the application present its
> time-sensitive traffic at the right time; that's a matter to be worked
> out with the application taking advantage of PTP and the OS scheduler.
> Whether you rely on managed windows or hardware launch time to provide
> the precisely correct amount of delay beyond that is immaterial to the
> application. In the absence of SO_TXTIME offloading (or even with it,
> and in the presence of sufficient OS scheduling jitter), an additional
> layer may need to be provided to ensure different applications' frames
> are queued in the correct order for egress during the window. Again,
> this could be a purely user-space application multiplexer or a
> separate qdisc module.
> 
> I wholeheartedly agree with you and Richard that we ought to
> eventually provide application-level APIs that don't require users to
> have deep knowledge of various 802.1Q intricacies. But I believe that
> the hardware offloading capability being provided now, and the variety
> of the way things are hooked up in real hardware, suggests that we
> ought to also build the support for the underlying protocols in layers
> so that we don't create unnecessary mismatches between offloading
> capability (which can be essential to overall network performance) and
> APIs, such that one configuration of offload support is privileged
> above others even when comparable scheduling accuracy could be
> provided by either.
> 
> In any case, only the cbs qdisc has been included in the post-RFC
> patch cover page for its last couple of iterations, so there is plenty
> of time to discuss how time-aware shaping, preemption, etc. management
> should occur beyond the cbs and SO_TXTIME proposals.

Yes, based on the previous feedback about the Qbv offloading interface
('taprio'), we've decided to postpone its proposal until we have NICs
supporting Qbv and more realistic use-cases. The current proposal covers only
FQTSS.

Thanks for your feedback!

Best regards,

Andre

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 3262 bytes --]

^ permalink raw reply

* Re: [PATCH net-next v2 1/2] libbpf: parse maps sections of varying size
From: Alexei Starovoitov @ 2017-10-02 23:07 UTC (permalink / raw)
  To: Craig Gallek, Daniel Borkmann, Jesper Dangaard Brouer,
	David S . Miller
  Cc: Chonggang Li, netdev
In-Reply-To: <20171002164129.47986-2-kraigatgoog@gmail.com>

On 10/2/17 9:41 AM, Craig Gallek wrote:
> +	/* Assume equally sized map definitions */
> +	map_def_sz = data->d_size / nr_maps;
> +	if (!data->d_size || (data->d_size % nr_maps) != 0) {
> +		pr_warning("unable to determine map definition size "
> +			   "section %s, %d maps in %zd bytes\n",
> +			   obj->path, nr_maps, data->d_size);
> +		return -EINVAL;
> +	}

this approach is not as flexible as done by samples/bpf/bpf_load.c
where it looks at every map independently by walking symtab,
but I guess it's ok.
I'd like to hear what Daniel and Jesper say,
since we really want to move to libbpf.a in samples/bpf/
and loader has to get to parity with the one in samples.

^ permalink raw reply

* [PATCH net-next v2 0/3] tools: add bpftool
From: Jakub Kicinski @ 2017-10-02 23:11 UTC (permalink / raw)
  To: netdev; +Cc: daniel, alexei.starovoitov, oss-drivers, Jakub Kicinski

Hi!

This set adds bpftool to the tools/ directory.  The first 
patch renames tools/net to tools/bpf, the second one adds 
the new code, while the third adds simple documentation.

v2:
 - report names, map ids, load time, uid;
 - add docs/man pages;
 - general cleanups & fixes.

Thanks to David Beckett for help with docs and testing.

Jakub Kicinski (3):
  tools: rename tools/net directory to tools/bpf
  tools: bpf: add bpftool
  tools: bpftool: add documentation

 MAINTAINERS                                      |   3 +-
 tools/Makefile                                   |  14 +-
 tools/{net => bpf}/Makefile                      |  18 +-
 tools/{net => bpf}/bpf_asm.c                     |   0
 tools/{net => bpf}/bpf_dbg.c                     |   0
 tools/{net => bpf}/bpf_exp.l                     |   0
 tools/{net => bpf}/bpf_exp.y                     |   0
 tools/{net => bpf}/bpf_jit_disasm.c              |   0
 tools/bpf/bpftool/Documentation/Makefile         |  34 ++
 tools/bpf/bpftool/Documentation/bpftool-map.txt  | 110 ++++
 tools/bpf/bpftool/Documentation/bpftool-prog.txt |  81 +++
 tools/bpf/bpftool/Documentation/bpftool.txt      |  34 ++
 tools/bpf/bpftool/Makefile                       |  86 +++
 tools/bpf/bpftool/common.c                       | 215 +++++++
 tools/bpf/bpftool/jit_disasm.c                   |  87 +++
 tools/bpf/bpftool/main.c                         | 212 +++++++
 tools/bpf/bpftool/main.h                         |  99 +++
 tools/bpf/bpftool/map.c                          | 744 +++++++++++++++++++++++
 tools/bpf/bpftool/prog.c                         | 427 +++++++++++++
 19 files changed, 2152 insertions(+), 12 deletions(-)
 rename tools/{net => bpf}/Makefile (74%)
 rename tools/{net => bpf}/bpf_asm.c (100%)
 rename tools/{net => bpf}/bpf_dbg.c (100%)
 rename tools/{net => bpf}/bpf_exp.l (100%)
 rename tools/{net => bpf}/bpf_exp.y (100%)
 rename tools/{net => bpf}/bpf_jit_disasm.c (100%)
 create mode 100644 tools/bpf/bpftool/Documentation/Makefile
 create mode 100644 tools/bpf/bpftool/Documentation/bpftool-map.txt
 create mode 100644 tools/bpf/bpftool/Documentation/bpftool-prog.txt
 create mode 100644 tools/bpf/bpftool/Documentation/bpftool.txt
 create mode 100644 tools/bpf/bpftool/Makefile
 create mode 100644 tools/bpf/bpftool/common.c
 create mode 100644 tools/bpf/bpftool/jit_disasm.c
 create mode 100644 tools/bpf/bpftool/main.c
 create mode 100644 tools/bpf/bpftool/main.h
 create mode 100644 tools/bpf/bpftool/map.c
 create mode 100644 tools/bpf/bpftool/prog.c

-- 
2.14.1

^ permalink raw reply

* [PATCH net-next v2 1/3] tools: rename tools/net directory to tools/bpf
From: Jakub Kicinski @ 2017-10-02 23:11 UTC (permalink / raw)
  To: netdev; +Cc: daniel, alexei.starovoitov, oss-drivers, Jakub Kicinski
In-Reply-To: <20171002231130.12406-1-jakub.kicinski@netronome.com>

We currently only have BPF tools in the tools/net directory.
We are about to add more BPF tools there, not necessarily
networking related, rename the directory and related Makefile
targets to bpf.

Suggested-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
---
 MAINTAINERS                         |  3 +--
 tools/Makefile                      | 14 +++++++-------
 tools/{net => bpf}/Makefile         |  0
 tools/{net => bpf}/bpf_asm.c        |  0
 tools/{net => bpf}/bpf_dbg.c        |  0
 tools/{net => bpf}/bpf_exp.l        |  0
 tools/{net => bpf}/bpf_exp.y        |  0
 tools/{net => bpf}/bpf_jit_disasm.c |  0
 8 files changed, 8 insertions(+), 9 deletions(-)
 rename tools/{net => bpf}/Makefile (100%)
 rename tools/{net => bpf}/bpf_asm.c (100%)
 rename tools/{net => bpf}/bpf_dbg.c (100%)
 rename tools/{net => bpf}/bpf_exp.l (100%)
 rename tools/{net => bpf}/bpf_exp.y (100%)
 rename tools/{net => bpf}/bpf_jit_disasm.c (100%)

diff --git a/MAINTAINERS b/MAINTAINERS
index 6671f375f7fc..2f79b94a41ec 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -2725,7 +2725,7 @@ F:	net/core/filter.c
 F:	net/sched/act_bpf.c
 F:	net/sched/cls_bpf.c
 F:	samples/bpf/
-F:	tools/net/bpf*
+F:	tools/bpf/
 F:	tools/testing/selftests/bpf/
 
 BROADCOM B44 10/100 ETHERNET DRIVER
@@ -9416,7 +9416,6 @@ F:	include/uapi/linux/in.h
 F:	include/uapi/linux/net.h
 F:	include/uapi/linux/netdevice.h
 F:	include/uapi/linux/net_namespace.h
-F:	tools/net/
 F:	tools/testing/selftests/net/
 F:	lib/random32.c
 
diff --git a/tools/Makefile b/tools/Makefile
index 9dfede37c8ff..df6fcb293fbc 100644
--- a/tools/Makefile
+++ b/tools/Makefile
@@ -19,7 +19,7 @@ include scripts/Makefile.include
 	@echo '  kvm_stat               - top-like utility for displaying kvm statistics'
 	@echo '  leds                   - LEDs  tools'
 	@echo '  liblockdep             - user-space wrapper for kernel locking-validator'
-	@echo '  net                    - misc networking tools'
+	@echo '  bpf                    - misc BPF tools'
 	@echo '  perf                   - Linux performance measurement and analysis tool'
 	@echo '  selftests              - various kernel selftests'
 	@echo '  spi                    - spi tools'
@@ -57,7 +57,7 @@ acpi: FORCE
 cpupower: FORCE
 	$(call descend,power/$@)
 
-cgroup firewire hv guest spi usb virtio vm net iio gpio objtool leds: FORCE
+cgroup firewire hv guest spi usb virtio vm bpf iio gpio objtool leds: FORCE
 	$(call descend,$@)
 
 liblockdep: FORCE
@@ -91,7 +91,7 @@ kvm_stat: FORCE
 
 all: acpi cgroup cpupower gpio hv firewire liblockdep \
 		perf selftests spi turbostat usb \
-		virtio vm net x86_energy_perf_policy \
+		virtio vm bpf x86_energy_perf_policy \
 		tmon freefall iio objtool kvm_stat
 
 acpi_install:
@@ -100,7 +100,7 @@ all: acpi cgroup cpupower gpio hv firewire liblockdep \
 cpupower_install:
 	$(call descend,power/$(@:_install=),install)
 
-cgroup_install firewire_install gpio_install hv_install iio_install perf_install spi_install usb_install virtio_install vm_install net_install objtool_install:
+cgroup_install firewire_install gpio_install hv_install iio_install perf_install spi_install usb_install virtio_install vm_install bpf_install objtool_install:
 	$(call descend,$(@:_install=),install)
 
 liblockdep_install:
@@ -124,7 +124,7 @@ all: acpi cgroup cpupower gpio hv firewire liblockdep \
 install: acpi_install cgroup_install cpupower_install gpio_install \
 		hv_install firewire_install iio_install liblockdep_install \
 		perf_install selftests_install turbostat_install usb_install \
-		virtio_install vm_install net_install x86_energy_perf_policy_install \
+		virtio_install vm_install bpf_install x86_energy_perf_policy_install \
 		tmon_install freefall_install objtool_install kvm_stat_install
 
 acpi_clean:
@@ -133,7 +133,7 @@ install: acpi_install cgroup_install cpupower_install gpio_install \
 cpupower_clean:
 	$(call descend,power/cpupower,clean)
 
-cgroup_clean hv_clean firewire_clean spi_clean usb_clean virtio_clean vm_clean net_clean iio_clean gpio_clean objtool_clean leds_clean:
+cgroup_clean hv_clean firewire_clean spi_clean usb_clean virtio_clean vm_clean bpf_clean iio_clean gpio_clean objtool_clean leds_clean:
 	$(call descend,$(@:_clean=),clean)
 
 liblockdep_clean:
@@ -169,7 +169,7 @@ install: acpi_install cgroup_install cpupower_install gpio_install \
 
 clean: acpi_clean cgroup_clean cpupower_clean hv_clean firewire_clean \
 		perf_clean selftests_clean turbostat_clean spi_clean usb_clean virtio_clean \
-		vm_clean net_clean iio_clean x86_energy_perf_policy_clean tmon_clean \
+		vm_clean bpf_clean iio_clean x86_energy_perf_policy_clean tmon_clean \
 		freefall_clean build_clean libbpf_clean libsubcmd_clean liblockdep_clean \
 		gpio_clean objtool_clean leds_clean
 
diff --git a/tools/net/Makefile b/tools/bpf/Makefile
similarity index 100%
rename from tools/net/Makefile
rename to tools/bpf/Makefile
diff --git a/tools/net/bpf_asm.c b/tools/bpf/bpf_asm.c
similarity index 100%
rename from tools/net/bpf_asm.c
rename to tools/bpf/bpf_asm.c
diff --git a/tools/net/bpf_dbg.c b/tools/bpf/bpf_dbg.c
similarity index 100%
rename from tools/net/bpf_dbg.c
rename to tools/bpf/bpf_dbg.c
diff --git a/tools/net/bpf_exp.l b/tools/bpf/bpf_exp.l
similarity index 100%
rename from tools/net/bpf_exp.l
rename to tools/bpf/bpf_exp.l
diff --git a/tools/net/bpf_exp.y b/tools/bpf/bpf_exp.y
similarity index 100%
rename from tools/net/bpf_exp.y
rename to tools/bpf/bpf_exp.y
diff --git a/tools/net/bpf_jit_disasm.c b/tools/bpf/bpf_jit_disasm.c
similarity index 100%
rename from tools/net/bpf_jit_disasm.c
rename to tools/bpf/bpf_jit_disasm.c
-- 
2.14.1

^ permalink raw reply related

* [PATCH net-next v2 2/3] tools: bpf: add bpftool
From: Jakub Kicinski @ 2017-10-02 23:11 UTC (permalink / raw)
  To: netdev; +Cc: daniel, alexei.starovoitov, oss-drivers, Jakub Kicinski
In-Reply-To: <20171002231130.12406-1-jakub.kicinski@netronome.com>

Add a simple tool for querying and updating BPF objects on the system.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
---
 tools/bpf/Makefile             |  18 +-
 tools/bpf/bpftool/Makefile     |  80 +++++
 tools/bpf/bpftool/common.c     | 216 ++++++++++++
 tools/bpf/bpftool/jit_disasm.c |  87 +++++
 tools/bpf/bpftool/main.c       | 212 ++++++++++++
 tools/bpf/bpftool/main.h       |  99 ++++++
 tools/bpf/bpftool/map.c        | 744 +++++++++++++++++++++++++++++++++++++++++
 tools/bpf/bpftool/prog.c       | 456 +++++++++++++++++++++++++
 8 files changed, 1909 insertions(+), 3 deletions(-)
 create mode 100644 tools/bpf/bpftool/Makefile
 create mode 100644 tools/bpf/bpftool/common.c
 create mode 100644 tools/bpf/bpftool/jit_disasm.c
 create mode 100644 tools/bpf/bpftool/main.c
 create mode 100644 tools/bpf/bpftool/main.h
 create mode 100644 tools/bpf/bpftool/map.c
 create mode 100644 tools/bpf/bpftool/prog.c

diff --git a/tools/bpf/Makefile b/tools/bpf/Makefile
index ddf888010652..325a35e1c28e 100644
--- a/tools/bpf/Makefile
+++ b/tools/bpf/Makefile
@@ -3,6 +3,7 @@ prefix = /usr
 CC = gcc
 LEX = flex
 YACC = bison
+MAKE = make
 
 CFLAGS += -Wall -O2
 CFLAGS += -D__EXPORTED_HEADERS__ -I../../include/uapi -I../../include
@@ -13,7 +14,7 @@ CFLAGS += -D__EXPORTED_HEADERS__ -I../../include/uapi -I../../include
 %.lex.c: %.l
 	$(LEX) -o $@ $<
 
-all : bpf_jit_disasm bpf_dbg bpf_asm
+all: bpf_jit_disasm bpf_dbg bpf_asm bpftool
 
 bpf_jit_disasm : CFLAGS += -DPACKAGE='bpf_jit_disasm'
 bpf_jit_disasm : LDLIBS = -lopcodes -lbfd -ldl
@@ -26,10 +27,21 @@ bpf_asm : LDLIBS =
 bpf_asm : bpf_asm.o bpf_exp.yacc.o bpf_exp.lex.o
 bpf_exp.lex.o : bpf_exp.yacc.c
 
-clean :
+clean: bpftool_clean
 	rm -rf *.o bpf_jit_disasm bpf_dbg bpf_asm bpf_exp.yacc.* bpf_exp.lex.*
 
-install :
+install: bpftool_install
 	install bpf_jit_disasm $(prefix)/bin/bpf_jit_disasm
 	install bpf_dbg $(prefix)/bin/bpf_dbg
 	install bpf_asm $(prefix)/bin/bpf_asm
+
+bpftool:
+	$(MAKE) -C bpftool
+
+bpftool_install:
+	$(MAKE) -C bpftool install
+
+bpftool_clean:
+	$(MAKE) -C bpftool clean
+
+.PHONY: bpftool FORCE
diff --git a/tools/bpf/bpftool/Makefile b/tools/bpf/bpftool/Makefile
new file mode 100644
index 000000000000..a7151f47fb40
--- /dev/null
+++ b/tools/bpf/bpftool/Makefile
@@ -0,0 +1,80 @@
+include ../../scripts/Makefile.include
+
+include ../../scripts/utilities.mak
+
+ifeq ($(srctree),)
+srctree := $(patsubst %/,%,$(dir $(CURDIR)))
+srctree := $(patsubst %/,%,$(dir $(srctree)))
+srctree := $(patsubst %/,%,$(dir $(srctree)))
+#$(info Determined 'srctree' to be $(srctree))
+endif
+
+ifneq ($(objtree),)
+#$(info Determined 'objtree' to be $(objtree))
+endif
+
+ifneq ($(OUTPUT),)
+#$(info Determined 'OUTPUT' to be $(OUTPUT))
+# Adding $(OUTPUT) as a directory to look for source files,
+# because use generated output files as sources dependency
+# for flex/bison parsers.
+VPATH += $(OUTPUT)
+export VPATH
+endif
+
+ifeq ($(V),1)
+  Q =
+else
+  Q = @
+endif
+
+BPF_DIR	= $(srctree)/tools/lib/bpf/
+
+ifneq ($(OUTPUT),)
+  BPF_PATH=$(OUTPUT)
+else
+  BPF_PATH=$(BPF_DIR)
+endif
+
+LIBBPF = $(BPF_PATH)libbpf.a
+
+$(LIBBPF): FORCE
+	$(Q)$(MAKE) -C $(BPF_DIR) OUTPUT=$(OUTPUT) $(OUTPUT)libbpf.a FEATURES_DUMP=$(FEATURE_DUMP_EXPORT)
+
+$(LIBBPF)-clean:
+	$(call QUIET_CLEAN, libbpf)
+	$(Q)$(MAKE) -C $(BPF_DIR) OUTPUT=$(OUTPUT) clean >/dev/null
+
+prefix = /usr
+
+CC = gcc
+
+CFLAGS += -O2
+CFLAGS += -W -Wall -Wextra -Wno-unused-parameter -Wshadow
+CFLAGS += -D__EXPORTED_HEADERS__ -I$(srctree)/tools/include/uapi -I$(srctree)/tools/include -I$(srctree)/tools/lib/bpf
+LIBS = -lelf -lbfd -lopcodes $(LIBBPF)
+
+include $(wildcard *.d)
+
+all: $(OUTPUT)bpftool
+
+SRCS=$(wildcard *.c)
+OBJS=$(patsubst %.c,$(OUTPUT)%.o,$(SRCS))
+
+$(OUTPUT)bpftool: $(OBJS) $(LIBBPF)
+	$(QUIET_LINK)$(CC) $(CFLAGS) -o $@ $^ $(LIBS)
+
+$(OUTPUT)%.o: %.c
+	$(QUIET_CC)$(COMPILE.c) -MMD -o $@ $<
+
+clean: $(LIBBPF)-clean
+	$(call QUIET_CLEAN, bpftool)
+	$(Q)rm -rf $(OUTPUT)bpftool $(OUTPUT)*.o $(OUTPUT)*.d
+
+install:
+	install $(OUTPUT)bpftool $(prefix)/sbin/bpftool
+
+FORCE:
+
+.PHONY: all clean FORCE
+.DEFAULT_GOAL := all
diff --git a/tools/bpf/bpftool/common.c b/tools/bpf/bpftool/common.c
new file mode 100644
index 000000000000..df8396a0c400
--- /dev/null
+++ b/tools/bpf/bpftool/common.c
@@ -0,0 +1,216 @@
+/*
+ * Copyright (C) 2017 Netronome Systems, Inc.
+ *
+ * This software is dual licensed under the GNU General License Version 2,
+ * June 1991 as shown in the file COPYING in the top-level directory of this
+ * source tree or the BSD 2-Clause License provided below.  You have the
+ * option to license this software under the complete terms of either license.
+ *
+ * The BSD 2-Clause License:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      1. Redistributions of source code must retain the above
+ *         copyright notice, this list of conditions and the following
+ *         disclaimer.
+ *
+ *      2. Redistributions in binary form must reproduce the above
+ *         copyright notice, this list of conditions and the following
+ *         disclaimer in the documentation and/or other materials
+ *         provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+/* Author: Jakub Kicinski <kubakici@wp.pl> */
+
+#include <errno.h>
+#include <libgen.h>
+#include <stdbool.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+#include <linux/limits.h>
+#include <linux/magic.h>
+#include <sys/types.h>
+#include <sys/vfs.h>
+
+#include <bpf.h>
+
+#include "main.h"
+
+static bool is_bpffs(char *path)
+{
+	struct statfs st_fs;
+
+	if (statfs(path, &st_fs) < 0)
+		return false;
+
+	return (unsigned long)st_fs.f_type == BPF_FS_MAGIC;
+}
+
+int open_obj_pinned_any(char *path, enum bpf_obj_type exp_type)
+{
+	enum bpf_obj_type type;
+	int fd;
+
+	fd = bpf_obj_get(path);
+	if (fd < 0) {
+		err("bpf obj get (%s): %s\n", path,
+		    errno == EACCES && !is_bpffs(dirname(path)) ?
+		    "directory not in bpf file system (bpffs)" :
+		    strerror(errno));
+		return -1;
+	}
+
+	type = get_fd_type(fd);
+	if (type < 0) {
+		close(fd);
+		return type;
+	}
+	if (type != exp_type) {
+		err("incorrect object type: %s\n", get_fd_type_name(type));
+		close(fd);
+		return -1;
+	}
+
+	return fd;
+}
+
+int do_pin_any(int argc, char **argv, int (*get_fd_by_id)(__u32))
+{
+	unsigned int id;
+	char *endptr;
+	int err;
+	int fd;
+
+	if (!is_prefix(*argv, "id")) {
+		err("expected 'id' got %s\n", *argv);
+		return -1;
+	}
+	NEXT_ARG();
+
+	id = strtoul(*argv, &endptr, 0);
+	if (*endptr) {
+		err("can't parse %s as ID\n", *argv);
+		return -1;
+	}
+	NEXT_ARG();
+
+	if (argc != 1)
+		usage();
+
+	fd = get_fd_by_id(id);
+	if (fd < 0) {
+		err("can't get prog by id (%u): %s\n", id, strerror(errno));
+		return -1;
+	}
+
+	err = bpf_obj_pin(fd, *argv);
+	close(fd);
+	if (err) {
+		err("can't pin the object (%s): %s\n", *argv,
+		    errno == EACCES && !is_bpffs(dirname(*argv)) ?
+		    "directory not in bpf file system (bpffs)" :
+		    strerror(errno));
+		return -1;
+	}
+
+	return 0;
+}
+
+const char *get_fd_type_name(enum bpf_obj_type type)
+{
+	static const char * const names[] = {
+		[BPF_OBJ_UNKNOWN]	= "unknown",
+		[BPF_OBJ_PROG]		= "prog",
+		[BPF_OBJ_MAP]		= "map",
+	};
+
+	if (type < 0 || type >= ARRAY_SIZE(names) || !names[type])
+		return names[BPF_OBJ_UNKNOWN];
+
+	return names[type];
+}
+
+int get_fd_type(int fd)
+{
+	char path[PATH_MAX];
+	char buf[512];
+	ssize_t n;
+
+	snprintf(path, sizeof(path), "/proc/%d/fd/%d", getpid(), fd);
+
+	n = readlink(path, buf, sizeof(buf));
+	if (n < 0) {
+		err("can't read link type: %s\n", strerror(errno));
+		return -1;
+	}
+	if (n == sizeof(path)) {
+		err("can't read link type: path too long!\n");
+		return -1;
+	}
+
+	if (strstr(buf, "bpf-map"))
+		return BPF_OBJ_MAP;
+	else if (strstr(buf, "bpf-prog"))
+		return BPF_OBJ_PROG;
+
+	return BPF_OBJ_UNKNOWN;
+}
+
+char *get_fdinfo(int fd, const char *key)
+{
+	char path[PATH_MAX];
+	char *line = NULL;
+	size_t line_n = 0;
+	ssize_t n;
+	FILE *fdi;
+
+	snprintf(path, sizeof(path), "/proc/%d/fdinfo/%d", getpid(), fd);
+
+	fdi = fopen(path, "r");
+	if (!fdi) {
+		err("can't open fdinfo: %s\n", strerror(errno));
+		return NULL;
+	}
+
+	while ((n = getline(&line, &line_n, fdi))) {
+		char *value;
+		int len;
+
+		if (!strstr(line, key))
+			continue;
+
+		fclose(fdi);
+
+		value = strchr(line, '\t');
+		if (!value || !value[1]) {
+			err("malformed fdinfo!?\n");
+			free(line);
+			return NULL;
+		}
+		value++;
+
+		len = strlen(value);
+		memmove(line, value, len);
+		line[len - 1] = '\0';
+
+		return line;
+	}
+
+	err("key '%s' not found in fdinfo\n", key);
+	free(line);
+	fclose(fdi);
+	return NULL;
+}
diff --git a/tools/bpf/bpftool/jit_disasm.c b/tools/bpf/bpftool/jit_disasm.c
new file mode 100644
index 000000000000..70e480b59e9d
--- /dev/null
+++ b/tools/bpf/bpftool/jit_disasm.c
@@ -0,0 +1,87 @@
+/*
+ * Based on:
+ *
+ * Minimal BPF JIT image disassembler
+ *
+ * Disassembles BPF JIT compiler emitted opcodes back to asm insn's for
+ * debugging or verification purposes.
+ *
+ * Copyright 2013 Daniel Borkmann <daniel@iogearbox.net>
+ * Licensed under the GNU General Public License, version 2.0 (GPLv2)
+ */
+
+#include <stdint.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <assert.h>
+#include <unistd.h>
+#include <string.h>
+#include <bfd.h>
+#include <dis-asm.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+
+static void get_exec_path(char *tpath, size_t size)
+{
+	ssize_t len;
+	char *path;
+
+	snprintf(tpath, size, "/proc/%d/exe", (int) getpid());
+	tpath[size - 1] = 0;
+
+	path = strdup(tpath);
+	assert(path);
+
+	len = readlink(path, tpath, size - 1);
+	assert(len > 0);
+	tpath[len] = 0;
+
+	free(path);
+}
+
+void disasm_print_insn(unsigned char *image, ssize_t len, int opcodes)
+{
+	disassembler_ftype disassemble;
+	struct disassemble_info info;
+	int count, i, pc = 0;
+	char tpath[256];
+	bfd *bfdf;
+
+	if (!len)
+		return;
+
+	memset(tpath, 0, sizeof(tpath));
+	get_exec_path(tpath, sizeof(tpath));
+
+	bfdf = bfd_openr(tpath, NULL);
+	assert(bfdf);
+	assert(bfd_check_format(bfdf, bfd_object));
+
+	init_disassemble_info(&info, stdout, (fprintf_ftype) fprintf);
+	info.arch = bfd_get_arch(bfdf);
+	info.mach = bfd_get_mach(bfdf);
+	info.buffer = image;
+	info.buffer_length = len;
+
+	disassemble_init_for_target(&info);
+
+	disassemble = disassembler(bfdf);
+	assert(disassemble);
+
+	do {
+		printf("%4x:\t", pc);
+
+		count = disassemble(pc, &info);
+
+		if (opcodes) {
+			printf("\n\t");
+			for (i = 0; i < count; ++i)
+				printf("%02x ", (uint8_t) image[pc + i]);
+		}
+		printf("\n");
+
+		pc += count;
+	} while (count > 0 && pc < len);
+
+	bfd_close(bfdf);
+}
diff --git a/tools/bpf/bpftool/main.c b/tools/bpf/bpftool/main.c
new file mode 100644
index 000000000000..e02d00d6e00b
--- /dev/null
+++ b/tools/bpf/bpftool/main.c
@@ -0,0 +1,212 @@
+/*
+ * Copyright (C) 2017 Netronome Systems, Inc.
+ *
+ * This software is dual licensed under the GNU General License Version 2,
+ * June 1991 as shown in the file COPYING in the top-level directory of this
+ * source tree or the BSD 2-Clause License provided below.  You have the
+ * option to license this software under the complete terms of either license.
+ *
+ * The BSD 2-Clause License:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      1. Redistributions of source code must retain the above
+ *         copyright notice, this list of conditions and the following
+ *         disclaimer.
+ *
+ *      2. Redistributions in binary form must reproduce the above
+ *         copyright notice, this list of conditions and the following
+ *         disclaimer in the documentation and/or other materials
+ *         provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+/* Author: Jakub Kicinski <kubakici@wp.pl> */
+
+#include <bfd.h>
+#include <ctype.h>
+#include <errno.h>
+#include <linux/bpf.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+
+#include <bpf.h>
+
+#include "main.h"
+
+const char *bin_name;
+static int last_argc;
+static char **last_argv;
+static int (*last_do_help)(int argc, char **argv);
+
+void usage(void)
+{
+	last_do_help(last_argc - 1, last_argv + 1);
+
+	exit(-1);
+}
+
+static int do_help(int argc, char **argv)
+{
+	fprintf(stderr,
+		"Usage: %s OBJECT { COMMAND | help }\n"
+		"       %s batch file FILE\n"
+		"\n"
+		"       OBJECT := { prog | map }\n",
+		bin_name, bin_name);
+
+	return 0;
+}
+
+int cmd_select(const struct cmd *cmds, int argc, char **argv,
+	       int (*help)(int argc, char **argv))
+{
+	unsigned int i;
+
+	last_argc = argc;
+	last_argv = argv;
+	last_do_help = help;
+
+	if (argc < 1 && cmds[0].func)
+		return cmds[0].func(argc, argv);
+
+	for (i = 0; cmds[i].func; i++)
+		if (is_prefix(*argv, cmds[i].cmd))
+			return cmds[i].func(argc - 1, argv + 1);
+
+	help(argc - 1, argv + 1);
+
+	return -1;
+}
+
+bool is_prefix(const char *pfx, const char *str)
+{
+	if (!pfx)
+		return false;
+	if (strlen(str) < strlen(pfx))
+		return false;
+
+	return !memcmp(str, pfx, strlen(pfx));
+}
+
+void print_hex(void *arg, unsigned int n, const char *sep)
+{
+	unsigned char *data = arg;
+	unsigned int i;
+
+	for (i = 0; i < n; i++) {
+		const char *pfx = "";
+
+		if (!i)
+			/* nothing */;
+		else if (!(i % 16))
+			printf("\n");
+		else if (!(i % 8))
+			printf("  ");
+		else
+			pfx = sep;
+
+		printf("%s%02hhx", i ? pfx : "", data[i]);
+	}
+}
+
+static int do_batch(int argc, char **argv);
+
+static const struct cmd cmds[] = {
+	{ "help",	do_help },
+	{ "batch",	do_batch },
+	{ "prog",	do_prog },
+	{ "map",	do_map },
+	{ 0 }
+};
+
+static int do_batch(int argc, char **argv)
+{
+	unsigned int lines = 0;
+	char *n_argv[4096];
+	char buf[65536];
+	int n_argc;
+	FILE *fp;
+	int err;
+
+	if (argc < 2) {
+		err("too few parameters for batch\n");
+		return -1;
+	} else if (!is_prefix(*argv, "file")) {
+		err("expected 'file', got: %s\n", *argv);
+		return -1;
+	} else if (argc > 2) {
+		err("too many parameters for batch\n");
+		return -1;
+	}
+	NEXT_ARG();
+
+	fp = fopen(*argv, "r");
+	if (!fp) {
+		err("Can't open file (%s): %s\n", *argv, strerror(errno));
+		return -1;
+	}
+
+	while (fgets(buf, sizeof(buf), fp)) {
+		if (strlen(buf) == sizeof(buf) - 1) {
+			errno = E2BIG;
+			break;
+		}
+
+		n_argc = 0;
+		n_argv[n_argc] = strtok(buf, " \t\n");
+
+		while (n_argv[n_argc]) {
+			n_argc++;
+			if (n_argc == ARRAY_SIZE(n_argv)) {
+				err("line %d has too many arguments, skip\n",
+				    lines);
+				n_argc = 0;
+				break;
+			}
+			n_argv[n_argc] = strtok(NULL, " \t\n");
+		}
+
+		if (!n_argc)
+			continue;
+
+		err = cmd_select(cmds, n_argc, n_argv, do_help);
+		if (err)
+			goto err_close;
+
+		lines++;
+	}
+
+	if (errno && errno != ENOENT) {
+		perror("reading batch file failed");
+		err = -1;
+	} else {
+		info("processed %d lines\n", lines);
+		err = 0;
+	}
+err_close:
+	fclose(fp);
+
+	return err;
+}
+
+int main(int argc, char **argv)
+{
+	bin_name = argv[0];
+	NEXT_ARG();
+
+	bfd_init();
+
+	return cmd_select(cmds, argc, argv, do_help);
+}
diff --git a/tools/bpf/bpftool/main.h b/tools/bpf/bpftool/main.h
new file mode 100644
index 000000000000..85d2d7870a58
--- /dev/null
+++ b/tools/bpf/bpftool/main.h
@@ -0,0 +1,99 @@
+/*
+ * Copyright (C) 2017 Netronome Systems, Inc.
+ *
+ * This software is dual licensed under the GNU General License Version 2,
+ * June 1991 as shown in the file COPYING in the top-level directory of this
+ * source tree or the BSD 2-Clause License provided below.  You have the
+ * option to license this software under the complete terms of either license.
+ *
+ * The BSD 2-Clause License:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      1. Redistributions of source code must retain the above
+ *         copyright notice, this list of conditions and the following
+ *         disclaimer.
+ *
+ *      2. Redistributions in binary form must reproduce the above
+ *         copyright notice, this list of conditions and the following
+ *         disclaimer in the documentation and/or other materials
+ *         provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+/* Author: Jakub Kicinski <kubakici@wp.pl> */
+
+#ifndef __BPF_TOOL_H
+#define __BPF_TOOL_H
+
+#include <stdbool.h>
+#include <stdio.h>
+#include <linux/bpf.h>
+
+#define ARRAY_SIZE(a)	(sizeof(a) / sizeof(a[0]))
+
+#define err(msg...)	fprintf(stderr, "Error: " msg)
+#define warn(msg...)	fprintf(stderr, "Warning: " msg)
+#define info(msg...)	fprintf(stderr, msg)
+
+#define ptr_to_u64(ptr)	((__u64)(unsigned long)(ptr))
+
+#define min(a, b)							\
+	({ typeof(a) _a = (a); typeof(b) _b = (b); _a > _b ? _b : _a; })
+#define max(a, b)							\
+	({ typeof(a) _a = (a); typeof(b) _b = (b); _a < _b ? _b : _a; })
+
+#define NEXT_ARG()	({ argc--; argv++; if (argc < 0) usage(); })
+#define NEXT_ARGP()	({ (*argc)--; (*argv)++; if (*argc < 0) usage(); })
+#define BAD_ARG()	({ err("what is '%s'?\n", *argv); -1; })
+
+#define BPF_TAG_FMT	"%02hhx:%02hhx:%02hhx:%02hhx:"	\
+			"%02hhx:%02hhx:%02hhx:%02hhx"
+
+#define HELP_SPEC_PROGRAM						\
+	"PROG := { id PROG_ID | pinned FILE | tag PROG_TAG }"
+
+enum bpf_obj_type {
+	BPF_OBJ_UNKNOWN,
+	BPF_OBJ_PROG,
+	BPF_OBJ_MAP,
+};
+
+extern const char *bin_name;
+
+bool is_prefix(const char *pfx, const char *str);
+void print_hex(void *arg, unsigned int n, const char *sep);
+void usage(void) __attribute__((noreturn));
+
+struct cmd {
+	const char *cmd;
+	int (*func)(int argc, char **argv);
+};
+
+int cmd_select(const struct cmd *cmds, int argc, char **argv,
+	       int (*help)(int argc, char **argv));
+
+int get_fd_type(int fd);
+const char *get_fd_type_name(enum bpf_obj_type type);
+char *get_fdinfo(int fd, const char *key);
+int open_obj_pinned_any(char *path, enum bpf_obj_type exp_type);
+int do_pin_any(int argc, char **argv, int (*get_fd_by_id)(__u32));
+
+int do_prog(int argc, char **arg);
+int do_map(int argc, char **arg);
+
+int prog_parse_fd(int *argc, char ***argv);
+
+void disasm_print_insn(unsigned char *image, ssize_t len, int opcodes);
+
+#endif
diff --git a/tools/bpf/bpftool/map.c b/tools/bpf/bpftool/map.c
new file mode 100644
index 000000000000..660b9cfbc068
--- /dev/null
+++ b/tools/bpf/bpftool/map.c
@@ -0,0 +1,744 @@
+/*
+ * Copyright (C) 2017 Netronome Systems, Inc.
+ *
+ * This software is dual licensed under the GNU General License Version 2,
+ * June 1991 as shown in the file COPYING in the top-level directory of this
+ * source tree or the BSD 2-Clause License provided below.  You have the
+ * option to license this software under the complete terms of either license.
+ *
+ * The BSD 2-Clause License:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      1. Redistributions of source code must retain the above
+ *         copyright notice, this list of conditions and the following
+ *         disclaimer.
+ *
+ *      2. Redistributions in binary form must reproduce the above
+ *         copyright notice, this list of conditions and the following
+ *         disclaimer in the documentation and/or other materials
+ *         provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+/* Author: Jakub Kicinski <kubakici@wp.pl> */
+
+#include <assert.h>
+#include <ctype.h>
+#include <errno.h>
+#include <fcntl.h>
+#include <stdbool.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+
+#include <bpf.h>
+
+#include "main.h"
+
+static const char * const map_type_name[] = {
+	[BPF_MAP_TYPE_UNSPEC]		= "unspec",
+	[BPF_MAP_TYPE_HASH]		= "hash",
+	[BPF_MAP_TYPE_ARRAY]		= "array",
+	[BPF_MAP_TYPE_PROG_ARRAY]	= "prog_array",
+	[BPF_MAP_TYPE_PERF_EVENT_ARRAY]	= "perf_event_array",
+	[BPF_MAP_TYPE_PERCPU_HASH]	= "percpu_hash",
+	[BPF_MAP_TYPE_PERCPU_ARRAY]	= "percpu_array",
+	[BPF_MAP_TYPE_STACK_TRACE]	= "stack_trace",
+	[BPF_MAP_TYPE_CGROUP_ARRAY]	= "cgroup_array",
+	[BPF_MAP_TYPE_LRU_HASH]		= "lru_hash",
+	[BPF_MAP_TYPE_LRU_PERCPU_HASH]	= "lru_percpu_hash",
+	[BPF_MAP_TYPE_LPM_TRIE]		= "lpm_trie",
+	[BPF_MAP_TYPE_ARRAY_OF_MAPS]	= "array_of_maps",
+	[BPF_MAP_TYPE_HASH_OF_MAPS]	= "hash_of_maps",
+	[BPF_MAP_TYPE_DEVMAP]		= "devmap",
+	[BPF_MAP_TYPE_SOCKMAP]		= "sockmap",
+};
+
+static unsigned int get_possible_cpus(void)
+{
+	static unsigned int result;
+	char buf[128];
+	long int n;
+	char *ptr;
+	int fd;
+
+	if (result)
+		return result;
+
+	fd = open("/sys/devices/system/cpu/possible", O_RDONLY);
+	if (fd < 0) {
+		err("can't open sysfs possible cpus\n");
+		exit(-1);
+	}
+
+	n = read(fd, buf, sizeof(buf));
+	if (n < 2) {
+		err("can't read sysfs possible cpus\n");
+		exit(-1);
+	}
+	close(fd);
+
+	if (n == sizeof(buf)) {
+		err("read sysfs possible cpus overflow\n");
+		exit(-1);
+	}
+
+	ptr = buf;
+	n = 0;
+	while (*ptr && *ptr != '\n') {
+		unsigned int a, b;
+
+		if (sscanf(ptr, "%u-%u", &a, &b) == 2) {
+			n += b - a + 1;
+
+			ptr = strchr(ptr, '-') + 1;
+		} else if (sscanf(ptr, "%u", &a) == 1) {
+			n++;
+		} else {
+			assert(0);
+		}
+
+		while (isdigit(*ptr))
+			ptr++;
+		if (*ptr == ',')
+			ptr++;
+	}
+
+	result = n;
+
+	return result;
+}
+
+static bool map_is_per_cpu(__u32 type)
+{
+	return type == BPF_MAP_TYPE_PERCPU_HASH ||
+	       type == BPF_MAP_TYPE_PERCPU_ARRAY ||
+	       type == BPF_MAP_TYPE_LRU_PERCPU_HASH;
+}
+
+static bool map_is_map_of_maps(__u32 type)
+{
+	return type == BPF_MAP_TYPE_ARRAY_OF_MAPS ||
+	       type == BPF_MAP_TYPE_HASH_OF_MAPS;
+}
+
+static bool map_is_map_of_progs(__u32 type)
+{
+	return type == BPF_MAP_TYPE_PROG_ARRAY;
+}
+
+static void *alloc_value(struct bpf_map_info *info)
+{
+	if (map_is_per_cpu(info->type))
+		return malloc(info->value_size * get_possible_cpus());
+	else
+		return malloc(info->value_size);
+}
+
+static int map_parse_fd(int *argc, char ***argv)
+{
+	int fd;
+
+	if (is_prefix(**argv, "id")) {
+		unsigned int id;
+		char *endptr;
+
+		NEXT_ARGP();
+
+		id = strtoul(**argv, &endptr, 0);
+		if (*endptr) {
+			err("can't parse %s as ID\n", **argv);
+			return -1;
+		}
+		NEXT_ARGP();
+
+		fd = bpf_map_get_fd_by_id(id);
+		if (fd < 0)
+			err("get map by id (%u): %s\n", id, strerror(errno));
+		return fd;
+	} else if (is_prefix(**argv, "pinned")) {
+		char *path;
+
+		NEXT_ARGP();
+
+		path = **argv;
+		NEXT_ARGP();
+
+		return open_obj_pinned_any(path, BPF_OBJ_MAP);
+	}
+
+	err("expected 'id' or 'pinned', got: '%s'?\n", **argv);
+	return -1;
+}
+
+static int
+map_parse_fd_and_info(int *argc, char ***argv, void *info, __u32 *info_len)
+{
+	int err;
+	int fd;
+
+	fd = map_parse_fd(argc, argv);
+	if (fd < 0)
+		return -1;
+
+	err = bpf_obj_get_info_by_fd(fd, info, info_len);
+	if (err) {
+		err("can't get map info: %s\n", strerror(errno));
+		close(fd);
+		return err;
+	}
+
+	return fd;
+}
+
+static void print_entry(struct bpf_map_info *info, unsigned char *key,
+			unsigned char *value)
+{
+	if (!map_is_per_cpu(info->type)) {
+		bool single_line, break_names;
+
+		break_names = info->key_size > 16 || info->value_size > 16;
+		single_line = info->key_size + info->value_size <= 24 &&
+			!break_names;
+
+		printf("key:%c", break_names ? '\n' : ' ');
+		print_hex(key, info->key_size, " ");
+
+		printf(single_line ? "  " : "\n");
+
+		printf("value:%c", break_names ? '\n' : ' ');
+		print_hex(value, info->value_size, " ");
+
+		printf("\n");
+	} else {
+		unsigned int i, n;
+
+		n = get_possible_cpus();
+
+		printf("key:\n");
+		print_hex(key, info->key_size, " ");
+		printf("\n");
+		for (i = 0; i < n; i++) {
+			printf("value (CPU %02d):%c",
+			       i, info->value_size > 16 ? '\n' : ' ');
+			print_hex(value + i * info->value_size,
+				  info->value_size, " ");
+			printf("\n");
+		}
+	}
+}
+
+static char **parse_bytes(char **argv, const char *name, unsigned char *val,
+			  unsigned int n)
+{
+	unsigned int i = 0;
+	char *endptr;
+
+	while (i < n && argv[i]) {
+		val[i] = strtoul(argv[i], &endptr, 0);
+		if (*endptr) {
+			err("error parsing byte: %s\n", argv[i]);
+			break;
+		}
+		i++;
+	}
+
+	if (i != n) {
+		err("%s expected %d bytes got %d\n", name, n, i);
+		return NULL;
+	}
+
+	return argv + i;
+}
+
+static int parse_elem(char **argv, struct bpf_map_info *info,
+		      void *key, void *value, __u32 key_size, __u32 value_size,
+		      __u32 *flags, __u32 **value_fd)
+{
+	if (!*argv) {
+		if (!key && !value)
+			return 0;
+		err("did not find %s\n", key ? "key" : "value");
+		return -1;
+	}
+
+	if (is_prefix(*argv, "key")) {
+		if (!key) {
+			if (key_size)
+				err("duplicate key\n");
+			else
+				err("unnecessary key\n");
+			return -1;
+		}
+
+		argv = parse_bytes(argv + 1, "key", key, key_size);
+		if (!argv)
+			return -1;
+
+		return parse_elem(argv, info, NULL, value, key_size, value_size,
+				  flags, value_fd);
+	} else if (is_prefix(*argv, "value")) {
+		int fd;
+
+		if (!value) {
+			if (value_size)
+				err("duplicate value\n");
+			else
+				err("unnecessary value\n");
+			return -1;
+		}
+
+		argv++;
+
+		if (map_is_map_of_maps(info->type)) {
+			int argc = 2;
+
+			if (value_size != 4) {
+				err("value smaller than 4B for map in map?\n");
+				return -1;
+			}
+			if (!argv[0] || !argv[1]) {
+				err("not enough value arguments for map in map\n");
+				return -1;
+			}
+
+			fd = map_parse_fd(&argc, &argv);
+			if (fd < 0)
+				return -1;
+
+			*value_fd = value;
+			**value_fd = fd;
+		} else if (map_is_map_of_progs(info->type)) {
+			int argc = 2;
+
+			if (value_size != 4) {
+				err("value smaller than 4B for map of progs?\n");
+				return -1;
+			}
+			if (!argv[0] || !argv[1]) {
+				err("not enough value arguments for map of progs\n");
+				return -1;
+			}
+
+			fd = prog_parse_fd(&argc, &argv);
+			if (fd < 0)
+				return -1;
+
+			*value_fd = value;
+			**value_fd = fd;
+		} else {
+			argv = parse_bytes(argv, "value", value, value_size);
+			if (!argv)
+				return -1;
+		}
+
+		return parse_elem(argv, info, key, NULL, key_size, value_size,
+				  flags, NULL);
+	} else if (is_prefix(*argv, "any") || is_prefix(*argv, "noexist") ||
+		   is_prefix(*argv, "exist")) {
+		if (!flags) {
+			err("flags specified multiple times: %s\n", *argv);
+			return -1;
+		}
+
+		if (is_prefix(*argv, "any"))
+			*flags = BPF_ANY;
+		else if (is_prefix(*argv, "noexist"))
+			*flags = BPF_NOEXIST;
+		else if (is_prefix(*argv, "exist"))
+			*flags = BPF_EXIST;
+
+		return parse_elem(argv + 1, info, key, value, key_size,
+				  value_size, NULL, value_fd);
+	}
+
+	err("expected key or value, got: %s\n", *argv);
+	return -1;
+}
+
+static int show_map_close(int fd, struct bpf_map_info *info)
+{
+	char *memlock;
+
+	memlock = get_fdinfo(fd, "memlock");
+	close(fd);
+
+	printf("   %u: ", info->id);
+	if (info->type < ARRAY_SIZE(map_type_name))
+		printf("%s  ", map_type_name[info->type]);
+	else
+		printf("type:%u  ", info->type);
+
+	if (*info->name)
+		printf("name:%s  ", info->name);
+
+	printf("flags:0x%x\n", info->map_flags);
+	printf("\tkey:%uB  value:%uB  max_entries:%u",
+	       info->key_size, info->value_size, info->max_entries);
+
+	if (memlock)
+		printf("  memlock:%sB", memlock);
+	free(memlock);
+
+	printf("\n");
+
+	return 0;
+}
+
+static int do_show(int argc, char **argv)
+{
+	struct bpf_map_info info = {};
+	__u32 len = sizeof(info);
+	__u32 id = 0;
+	int err;
+	int fd;
+
+	if (argc == 2) {
+		fd = map_parse_fd_and_info(&argc, &argv, &info, &len);
+		if (fd < 0)
+			return -1;
+
+		return show_map_close(fd, &info);
+	}
+
+	if (argc)
+		return BAD_ARG();
+
+	while (true) {
+		err = bpf_map_get_next_id(id, &id);
+		if (err) {
+			if (errno == ENOENT)
+				break;
+			err("can't get next map: %s\n", strerror(errno));
+			if (errno == EINVAL)
+				err("kernel too old?\n");
+			return -1;
+		}
+
+		fd = bpf_map_get_fd_by_id(id);
+		if (fd < 0) {
+			err("can't get map by id (%u): %s\n",
+			    id, strerror(errno));
+			return -1;
+		}
+
+		err = bpf_obj_get_info_by_fd(fd, &info, &len);
+		if (err) {
+			err("can't get map info: %s\n", strerror(errno));
+			close(fd);
+			return -1;
+		}
+
+		show_map_close(fd, &info);
+	}
+
+	return errno == ENOENT ? 0 : -1;
+}
+
+static int do_dump(int argc, char **argv)
+{
+	void *key, *value, *prev_key;
+	unsigned int num_elems = 0;
+	struct bpf_map_info info = {};
+	__u32 len = sizeof(info);
+	int err;
+	int fd;
+
+	if (argc != 2)
+		usage();
+
+	fd = map_parse_fd_and_info(&argc, &argv, &info, &len);
+	if (fd < 0)
+		return -1;
+
+	if (map_is_map_of_maps(info.type) || map_is_map_of_progs(info.type)) {
+		err("Dumping maps of maps and program maps not supported\n");
+		close(fd);
+		return -1;
+	}
+
+	key = malloc(info.key_size);
+	value = alloc_value(&info);
+	if (!key || !value) {
+		err("mem alloc failed\n");
+		err = -1;
+		goto exit_free;
+	}
+
+	prev_key = NULL;
+	while (true) {
+		err = bpf_map_get_next_key(fd, prev_key, key);
+		if (err) {
+			if (errno == ENOENT)
+				err = 0;
+			break;
+		}
+
+		if (!bpf_map_lookup_elem(fd, key, value)) {
+			print_entry(&info, key, value);
+		} else {
+			info("can't lookup element with key: ");
+			print_hex(key, info.key_size, " ");
+			printf("\n");
+		}
+
+		prev_key = key;
+		num_elems++;
+	}
+
+	printf("Found %u element%s\n", num_elems, num_elems != 1 ? "s" : "");
+
+exit_free:
+	free(key);
+	free(value);
+	close(fd);
+
+	return err;
+}
+
+static int do_update(int argc, char **argv)
+{
+	struct bpf_map_info info = {};
+	__u32 len = sizeof(info);
+	__u32 *value_fd = NULL;
+	__u32 flags = BPF_ANY;
+	void *key, *value;
+	int fd, err;
+
+	if (argc < 2)
+		usage();
+
+	fd = map_parse_fd_and_info(&argc, &argv, &info, &len);
+	if (fd < 0)
+		return -1;
+
+	key = malloc(info.key_size);
+	value = alloc_value(&info);
+	if (!key || !value) {
+		err("mem alloc failed");
+		err = -1;
+		goto exit_free;
+	}
+
+	err = parse_elem(argv, &info, key, value, info.key_size,
+			 info.value_size, &flags, &value_fd);
+	if (err)
+		goto exit_free;
+
+	err = bpf_map_update_elem(fd, key, value, flags);
+	if (err) {
+		err("update failed: %s\n", strerror(errno));
+		goto exit_free;
+	}
+
+exit_free:
+	if (value_fd)
+		close(*value_fd);
+	free(key);
+	free(value);
+	close(fd);
+
+	return err;
+}
+
+static int do_lookup(int argc, char **argv)
+{
+	struct bpf_map_info info = {};
+	__u32 len = sizeof(info);
+	void *key, *value;
+	int err;
+	int fd;
+
+	if (argc < 2)
+		usage();
+
+	fd = map_parse_fd_and_info(&argc, &argv, &info, &len);
+	if (fd < 0)
+		return -1;
+
+	key = malloc(info.key_size);
+	value = alloc_value(&info);
+	if (!key || !value) {
+		err("mem alloc failed");
+		err = -1;
+		goto exit_free;
+	}
+
+	err = parse_elem(argv, &info, key, NULL, info.key_size, 0, NULL, NULL);
+	if (err)
+		goto exit_free;
+
+	err = bpf_map_lookup_elem(fd, key, value);
+	if (!err) {
+		print_entry(&info, key, value);
+	} else if (errno == ENOENT) {
+		printf("key:\n");
+		print_hex(key, info.key_size, " ");
+		printf("\n\nNot found\n");
+	} else {
+		err("lookup failed: %s\n", strerror(errno));
+	}
+
+exit_free:
+	free(key);
+	free(value);
+	close(fd);
+
+	return err;
+}
+
+static int do_getnext(int argc, char **argv)
+{
+	struct bpf_map_info info = {};
+	__u32 len = sizeof(info);
+	void *key, *nextkey;
+	int err;
+	int fd;
+
+	if (argc < 2)
+		usage();
+
+	fd = map_parse_fd_and_info(&argc, &argv, &info, &len);
+	if (fd < 0)
+		return -1;
+
+	key = malloc(info.key_size);
+	nextkey = malloc(info.key_size);
+	if (!key || !nextkey) {
+		err("mem alloc failed");
+		err = -1;
+		goto exit_free;
+	}
+
+	if (argc) {
+		err = parse_elem(argv, &info, key, NULL, info.key_size, 0,
+				 NULL, NULL);
+		if (err)
+			goto exit_free;
+	} else {
+		free(key);
+		key = NULL;
+	}
+
+	err = bpf_map_get_next_key(fd, key, nextkey);
+	if (err) {
+		err("can't get next key: %s\n", strerror(errno));
+		goto exit_free;
+	}
+
+	if (key) {
+		printf("key:\n");
+		print_hex(key, info.key_size, " ");
+		printf("\n");
+	} else {
+		printf("key: None\n");
+	}
+
+	printf("next key:\n");
+	print_hex(nextkey, info.key_size, " ");
+	printf("\n");
+
+exit_free:
+	free(nextkey);
+	free(key);
+	close(fd);
+
+	return err;
+}
+
+static int do_delete(int argc, char **argv)
+{
+	struct bpf_map_info info = {};
+	__u32 len = sizeof(info);
+	void *key;
+	int err;
+	int fd;
+
+	if (argc < 2)
+		usage();
+
+	fd = map_parse_fd_and_info(&argc, &argv, &info, &len);
+	if (fd < 0)
+		return -1;
+
+	key = malloc(info.key_size);
+	if (!key) {
+		err("mem alloc failed");
+		err = -1;
+		goto exit_free;
+	}
+
+	err = parse_elem(argv, &info, key, NULL, info.key_size, 0, NULL, NULL);
+	if (err)
+		goto exit_free;
+
+	err = bpf_map_delete_elem(fd, key);
+	if (err)
+		err("delete failed: %s\n", strerror(errno));
+
+exit_free:
+	free(key);
+	close(fd);
+
+	return err;
+}
+
+static int do_pin(int argc, char **argv)
+{
+	return do_pin_any(argc, argv, bpf_map_get_fd_by_id);
+}
+
+static int do_help(int argc, char **argv)
+{
+	fprintf(stderr,
+		"Usage: %s %s show   [MAP]\n"
+		"       %s %s dump    MAP\n"
+		"       %s %s update  MAP  key BYTES value VALUE [UPDATE_FLAGS]\n"
+		"       %s %s lookup  MAP  key BYTES\n"
+		"       %s %s getnext MAP [key BYTES]\n"
+		"       %s %s delete  MAP  key BYTES\n"
+		"       %s %s pin     MAP  FILE\n"
+		"       %s %s help\n"
+		"\n"
+		"       MAP := { id MAP_ID | pinned FILE }\n"
+		"       " HELP_SPEC_PROGRAM "\n"
+		"       VALUE := { BYTES | MAP | PROG }\n"
+		"       UPDATE_FLAGS := { any | exist | noexist }\n"
+		"",
+		bin_name, argv[-2], bin_name, argv[-2], bin_name, argv[-2],
+		bin_name, argv[-2], bin_name, argv[-2], bin_name, argv[-2],
+		bin_name, argv[-2], bin_name, argv[-2]);
+
+	return 0;
+}
+
+static const struct cmd cmds[] = {
+	{ "show",	do_show },
+	{ "help",	do_help },
+	{ "dump",	do_dump },
+	{ "update",	do_update },
+	{ "lookup",	do_lookup },
+	{ "getnext",	do_getnext },
+	{ "delete",	do_delete },
+	{ "pin",	do_pin },
+	{ 0 }
+};
+
+int do_map(int argc, char **argv)
+{
+	return cmd_select(cmds, argc, argv, do_help);
+}
diff --git a/tools/bpf/bpftool/prog.c b/tools/bpf/bpftool/prog.c
new file mode 100644
index 000000000000..45033dec41dd
--- /dev/null
+++ b/tools/bpf/bpftool/prog.c
@@ -0,0 +1,456 @@
+/*
+ * Copyright (C) 2017 Netronome Systems, Inc.
+ *
+ * This software is dual licensed under the GNU General License Version 2,
+ * June 1991 as shown in the file COPYING in the top-level directory of this
+ * source tree or the BSD 2-Clause License provided below.  You have the
+ * option to license this software under the complete terms of either license.
+ *
+ * The BSD 2-Clause License:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      1. Redistributions of source code must retain the above
+ *         copyright notice, this list of conditions and the following
+ *         disclaimer.
+ *
+ *      2. Redistributions in binary form must reproduce the above
+ *         copyright notice, this list of conditions and the following
+ *         disclaimer in the documentation and/or other materials
+ *         provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+/* Author: Jakub Kicinski <kubakici@wp.pl> */
+
+#include <errno.h>
+#include <fcntl.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <time.h>
+#include <unistd.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+
+#include <bpf.h>
+
+#include "main.h"
+
+static const char * const prog_type_name[] = {
+	[BPF_PROG_TYPE_UNSPEC]		= "unspec",
+	[BPF_PROG_TYPE_SOCKET_FILTER]	= "socket_filter",
+	[BPF_PROG_TYPE_KPROBE]		= "kprobe",
+	[BPF_PROG_TYPE_SCHED_CLS]	= "sched_cls",
+	[BPF_PROG_TYPE_SCHED_ACT]	= "sched_act",
+	[BPF_PROG_TYPE_TRACEPOINT]	= "tracepoint",
+	[BPF_PROG_TYPE_XDP]		= "xdp",
+	[BPF_PROG_TYPE_PERF_EVENT]	= "perf_event",
+	[BPF_PROG_TYPE_CGROUP_SKB]	= "cgroup_skb",
+	[BPF_PROG_TYPE_CGROUP_SOCK]	= "cgroup_sock",
+	[BPF_PROG_TYPE_LWT_IN]		= "lwt_in",
+	[BPF_PROG_TYPE_LWT_OUT]		= "lwt_out",
+	[BPF_PROG_TYPE_LWT_XMIT]	= "lwt_xmit",
+	[BPF_PROG_TYPE_SOCK_OPS]	= "sock_ops",
+	[BPF_PROG_TYPE_SK_SKB]		= "sk_skb",
+};
+
+static void print_boot_time(__u64 nsecs, char *buf, unsigned int size)
+{
+	struct timespec real_time_ts, boot_time_ts;
+	time_t wallclock_secs;
+	struct tm load_tm;
+
+	buf[--size] = '\0';
+
+	if (clock_gettime(CLOCK_REALTIME, &real_time_ts) ||
+	    clock_gettime(CLOCK_BOOTTIME, &boot_time_ts)) {
+		perror("Can't read clocks");
+		snprintf(buf, size, "%llu", nsecs / 1000000000);
+		return;
+	}
+
+	wallclock_secs = (real_time_ts.tv_sec - boot_time_ts.tv_sec) +
+		nsecs / 1000000000;
+
+	if (!localtime_r(&wallclock_secs, &load_tm)) {
+		snprintf(buf, size, "%llu", nsecs / 1000000000);
+		return;
+	}
+
+	strftime(buf, size, "%b %d/%H:%M", &load_tm);
+}
+
+static int prog_fd_by_tag(unsigned char *tag)
+{
+	struct bpf_prog_info info = {};
+	__u32 len = sizeof(info);
+	unsigned int id = 0;
+	int err;
+	int fd;
+
+	while (true) {
+		err = bpf_prog_get_next_id(id, &id);
+		if (err) {
+			err("%s\n", strerror(errno));
+			return -1;
+		}
+
+		fd = bpf_prog_get_fd_by_id(id);
+		if (fd < 0) {
+			err("can't get prog by id (%u): %s\n",
+			    id, strerror(errno));
+			return -1;
+		}
+
+		err = bpf_obj_get_info_by_fd(fd, &info, &len);
+		if (err) {
+			err("can't get prog info (%u): %s\n",
+			    id, strerror(errno));
+			close(fd);
+			return -1;
+		}
+
+		if (!memcmp(tag, info.tag, BPF_TAG_SIZE))
+			return fd;
+
+		close(fd);
+	}
+}
+
+int prog_parse_fd(int *argc, char ***argv)
+{
+	int fd;
+
+	if (is_prefix(**argv, "id")) {
+		unsigned int id;
+		char *endptr;
+
+		NEXT_ARGP();
+
+		id = strtoul(**argv, &endptr, 0);
+		if (*endptr) {
+			err("can't parse %s as ID\n", **argv);
+			return -1;
+		}
+		NEXT_ARGP();
+
+		fd = bpf_prog_get_fd_by_id(id);
+		if (fd < 0)
+			err("get by id (%u): %s\n", id, strerror(errno));
+		return fd;
+	} else if (is_prefix(**argv, "tag")) {
+		unsigned char tag[BPF_TAG_SIZE];
+
+		NEXT_ARGP();
+
+		if (sscanf(**argv, BPF_TAG_FMT, tag, tag + 1, tag + 2,
+			   tag + 3, tag + 4, tag + 5, tag + 6, tag + 7)
+		    != BPF_TAG_SIZE) {
+			err("can't parse tag\n");
+			return -1;
+		}
+		NEXT_ARGP();
+
+		return prog_fd_by_tag(tag);
+	} else if (is_prefix(**argv, "pinned")) {
+		char *path;
+
+		NEXT_ARGP();
+
+		path = **argv;
+		NEXT_ARGP();
+
+		return open_obj_pinned_any(path, BPF_OBJ_PROG);
+	}
+
+	err("expected 'id', 'tag' or 'pinned', got: '%s'?\n", **argv);
+	return -1;
+}
+
+static void show_prog_maps(int fd, u32 num_maps)
+{
+	struct bpf_prog_info info = {};
+	__u32 len = sizeof(info);
+	__u32 map_ids[num_maps];
+	unsigned int i;
+	int err;
+
+	info.nr_map_ids = num_maps;
+	info.map_ids = ptr_to_u64(map_ids);
+
+	err = bpf_obj_get_info_by_fd(fd, &info, &len);
+	if (err || !info.nr_map_ids)
+		return;
+
+	printf("  map_ids:");
+	for (i = 0; i < info.nr_map_ids; i++)
+		printf("%u%s", map_ids[i],
+		       i == info.nr_map_ids - 1 ? "" : ",");
+}
+
+static int show_prog(int fd)
+{
+	struct bpf_prog_info info = {};
+	__u32 len = sizeof(info);
+	char *memlock;
+	int err;
+
+	err = bpf_obj_get_info_by_fd(fd, &info, &len);
+	if (err) {
+		err("can't get prog info: %s\n", strerror(errno));
+		return -1;
+	}
+
+	printf("   %u: ", info.id);
+	if (info.type < ARRAY_SIZE(prog_type_name))
+		printf("%s  ", prog_type_name[info.type]);
+	else
+		printf("type:%u  ", info.type);
+
+	if (*info.name)
+		printf("name:%s  ", info.name);
+
+	printf("tag ");
+	print_hex(info.tag, BPF_TAG_SIZE, ":");
+	printf("\n");
+
+	if (info.load_time) {
+		char buf[32];
+
+		print_boot_time(info.load_time, buf, sizeof(buf));
+
+		/* Piggy back on load_time, since 0 uid is a valid one */
+		printf("\tloaded_at: %s  uid:%u\n", buf, info.created_by_uid);
+	}
+
+	printf("\txlated:%uB", info.xlated_prog_len);
+
+	if (info.jited_prog_len)
+		printf("  jited:%uB", info.jited_prog_len);
+	else
+		printf("  jited:no");
+
+	memlock = get_fdinfo(fd, "memlock");
+	if (memlock)
+		printf("  memlock:%sB", memlock);
+	free(memlock);
+
+	if (info.nr_map_ids)
+		show_prog_maps(fd, info.nr_map_ids);
+
+	printf("\n");
+
+	return 0;
+}
+
+static int do_show(int argc, char **argv)
+{	__u32 id = 0;
+	int err;
+	int fd;
+
+	if (argc == 2) {
+		fd = prog_parse_fd(&argc, &argv);
+		if (fd < 0)
+			return -1;
+
+		return show_prog(fd);
+	}
+
+	if (argc)
+		return BAD_ARG();
+
+	while (true) {
+		err = bpf_prog_get_next_id(id, &id);
+		if (err) {
+			if (errno == ENOENT)
+				break;
+			err("can't get next program: %s\n", strerror(errno));
+			if (errno == EINVAL)
+				err("kernel too old?\n");
+			return -1;
+		}
+
+		fd = bpf_prog_get_fd_by_id(id);
+		if (fd < 0) {
+			err("can't get prog by id (%u): %s\n",
+			    id, strerror(errno));
+			return -1;
+		}
+
+		err = show_prog(fd);
+		close(fd);
+		if (err)
+			return err;
+	}
+
+	return 0;
+}
+
+static int do_dump(int argc, char **argv)
+{
+	struct bpf_prog_info info = {};
+	__u32 len = sizeof(info);
+	bool can_disasm = false;
+	unsigned int buf_size;
+	char *filepath = NULL;
+	bool opcodes = false;
+	unsigned char *buf;
+	__u32 *member_len;
+	__u64 *member_ptr;
+	ssize_t n;
+	int err;
+	int fd;
+
+	if (is_prefix(*argv, "jited")) {
+		member_len = &info.jited_prog_len;
+		member_ptr = &info.jited_prog_insns;
+		can_disasm = true;
+	} else if (is_prefix(*argv, "xlated")) {
+		member_len = &info.xlated_prog_len;
+		member_ptr = &info.xlated_prog_insns;
+	} else {
+		err("expected 'xlated' or 'jited', got: %s\n", *argv);
+		return -1;
+	}
+	NEXT_ARG();
+
+	if (argc < 2)
+		usage();
+
+	fd = prog_parse_fd(&argc, &argv);
+	if (fd < 0)
+		return -1;
+
+	if (is_prefix(*argv, "file")) {
+		NEXT_ARG();
+		if (!argc) {
+			err("expected file path\n");
+			return -1;
+		}
+
+		filepath = *argv;
+		NEXT_ARG();
+	} else if (is_prefix(*argv, "opcodes")) {
+		opcodes = true;
+		NEXT_ARG();
+	}
+
+	if (!filepath && !can_disasm) {
+		err("expected 'file' got %s\n", *argv);
+		return -1;
+	}
+	if (argc) {
+		usage();
+		return -1;
+	}
+
+	err = bpf_obj_get_info_by_fd(fd, &info, &len);
+	if (err) {
+		err("can't get prog info: %s\n", strerror(errno));
+		return -1;
+	}
+
+	if (!*member_len) {
+		info("no instructions returned\n");
+		close(fd);
+		return 0;
+	}
+
+	buf_size = *member_len;
+
+	buf = malloc(buf_size);
+	if (!buf) {
+		err("mem alloc failed\n");
+		close(fd);
+		return -1;
+	}
+
+	memset(&info, 0, sizeof(info));
+
+	*member_ptr = ptr_to_u64(buf);
+	*member_len = buf_size;
+
+	err = bpf_obj_get_info_by_fd(fd, &info, &len);
+	close(fd);
+	if (err) {
+		err("can't get prog info: %s\n", strerror(errno));
+		goto err_free;
+	}
+
+	if (*member_len > buf_size) {
+		info("too many instructions returned\n");
+		goto err_free;
+	}
+
+	if (filepath) {
+		fd = open(filepath, O_WRONLY | O_CREAT | O_TRUNC, 0600);
+		if (fd < 0) {
+			err("can't open file %s: %s\n", filepath,
+			    strerror(errno));
+			goto err_free;
+		}
+
+		n = write(fd, buf, *member_len);
+		close(fd);
+		if (n != *member_len) {
+			err("error writing output file: %s\n",
+			    n < 0 ? strerror(errno) : "short write");
+			goto err_free;
+		}
+	} else {
+		disasm_print_insn(buf, *member_len, opcodes);
+	}
+
+	free(buf);
+
+	return 0;
+
+err_free:
+	free(buf);
+	return -1;
+}
+
+static int do_pin(int argc, char **argv)
+{
+	return do_pin_any(argc, argv, bpf_prog_get_fd_by_id);
+}
+
+static int do_help(int argc, char **argv)
+{
+	fprintf(stderr,
+		"Usage: %s %s show [PROG]\n"
+		"       %s %s dump xlated PROG  file FILE\n"
+		"       %s %s dump jited  PROG [file FILE] [opcodes]\n"
+		"       %s %s pin   PROG FILE\n"
+		"       %s %s help\n"
+		"\n"
+		"       " HELP_SPEC_PROGRAM "\n"
+		"",
+		bin_name, argv[-2], bin_name, argv[-2], bin_name, argv[-2],
+		bin_name, argv[-2], bin_name, argv[-2]);
+
+	return 0;
+}
+
+static const struct cmd cmds[] = {
+	{ "show",	do_show },
+	{ "dump",	do_dump },
+	{ "pin",	do_pin },
+	{ 0 }
+};
+
+int do_prog(int argc, char **argv)
+{
+	return cmd_select(cmds, argc, argv, do_help);
+}
-- 
2.14.1

^ permalink raw reply related

* [PATCH net-next v2 3/3] tools: bpftool: add documentation
From: Jakub Kicinski @ 2017-10-02 23:11 UTC (permalink / raw)
  To: netdev
  Cc: daniel, alexei.starovoitov, oss-drivers, Jakub Kicinski,
	David Beckett
In-Reply-To: <20171002231130.12406-1-jakub.kicinski@netronome.com>

Add documentation for bpftool.  Separate files for each subcommand.
Use rst format.  Documentation is compiled into man pages using
rst2man.

Signed-off-by: David Beckett <david.beckett@netronome.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
---
 tools/bpf/bpftool/Documentation/Makefile         |  34 +++++++
 tools/bpf/bpftool/Documentation/bpftool-map.txt  | 110 +++++++++++++++++++++++
 tools/bpf/bpftool/Documentation/bpftool-prog.txt |  81 +++++++++++++++++
 tools/bpf/bpftool/Documentation/bpftool.txt      |  34 +++++++
 tools/bpf/bpftool/Makefile                       |   6 ++
 5 files changed, 265 insertions(+)
 create mode 100644 tools/bpf/bpftool/Documentation/Makefile
 create mode 100644 tools/bpf/bpftool/Documentation/bpftool-map.txt
 create mode 100644 tools/bpf/bpftool/Documentation/bpftool-prog.txt
 create mode 100644 tools/bpf/bpftool/Documentation/bpftool.txt

diff --git a/tools/bpf/bpftool/Documentation/Makefile b/tools/bpf/bpftool/Documentation/Makefile
new file mode 100644
index 000000000000..ebd21ab2c6df
--- /dev/null
+++ b/tools/bpf/bpftool/Documentation/Makefile
@@ -0,0 +1,34 @@
+include ../../../scripts/Makefile.include
+include ../../../scripts/utilities.mak
+
+INSTALL ?= install
+RM ?= rm -f
+
+# Make the path relative to DESTDIR, not prefix
+ifndef DESTDIR
+prefix?=$(HOME)
+endif
+mandir ?= $(prefix)/share/man
+man8dir = $(mandir)/man8
+
+MAN8_TXT = $(wildcard *.txt)
+
+_DOC_MAN8 = $(patsubst %.txt,%.8,$(MAN8_TXT))
+DOC_MAN8 = $(addprefix $(OUTPUT),$(_DOC_MAN8))
+
+man: man8
+man8: $(DOC_MAN8)
+
+$(OUTPUT)%.8: %.txt
+	rst2man $< > $@
+
+clean:
+	$(call QUIET_CLEAN, Documentation) $(RM) $(DOC_MAN8)
+
+install: man
+	$(call QUIET_INSTALL, Documentation-man) \
+		$(INSTALL) -d -m 755 $(DESTDIR)$(man8dir); \
+		$(INSTALL) -m 644 $(DOC_MAN8) $(DESTDIR)$(man8dir);
+
+.PHONY: man man8 clean install
+.DEFAULT_GOAL := man
diff --git a/tools/bpf/bpftool/Documentation/bpftool-map.txt b/tools/bpf/bpftool/Documentation/bpftool-map.txt
new file mode 100644
index 000000000000..ea82b9059a0a
--- /dev/null
+++ b/tools/bpf/bpftool/Documentation/bpftool-map.txt
@@ -0,0 +1,110 @@
+================
+bpftool-map
+================
+-------------------------------------------------------------------------------
+tool for inspection and simple manipulation of eBPF maps
+-------------------------------------------------------------------------------
+
+:Manual section: 8
+
+SYNOPSIS
+========
+
+	**bpftool** **map** *COMMAND*
+
+	*COMMANDS* :=
+	{ show | dump | update | lookup | getnext | delete | pin | help }
+
+MAP COMMANDS
+=============
+
+|	**bpftool** map show   [*MAP*]
+|	**bpftool** map dump    *MAP*
+|	**bpftool** map update  *MAP*  key *BYTES*   value *VALUE* [*UPDATE_FLAGS*]
+|	**bpftool** map lookup  *MAP*  key *BYTES*
+|	**bpftool** map getnext *MAP* [key *BYTES*]
+|	**bpftool** map delete  *MAP*  key *BYTES*
+|	**bpftool** map pin     *MAP*  *FILE*
+|	**bpftool** map help
+|
+|	*MAP* := { id MAP_ID | pinned FILE }
+|	*VALUE* := { BYTES | MAP | PROGRAM }
+|	*UPDATE_FLAGS* := { any | exist | noexist }
+
+DESCRIPTION
+===========
+	**bpftool map show**   [*MAP*]
+		  Show information about loaded maps.  If *MAP* is specified
+		  show information only about given map, otherwise list all
+		  maps currently loaded on the system.
+
+		  Output will start with map ID followed by map type and
+		  zero or more named attributes (depending on kernel version).
+
+	**bpftool map dump**    *MAP*
+		  Dump all entries in a given *MAP*.
+
+	**bpftool map update**  *MAP*  **key** *BYTES*   **value** *VALUE* [*UPDATE_FLAGS*]
+		  Update map entry for a given *KEY*.
+
+		  *UPDATE_FLAGS* can be one of: **any** update existing entry
+		  or add if doesn't exit; **exist** update only if entry already
+		  exists; **noexist** update only if entry doesn't exist.
+
+	**bpftool map lookup**  *MAP*  **key** *BYTES*
+		  Lookup **key** in the map.
+
+	**bpftool map getnext** *MAP* [**key** *BYTES*]
+		  Get next key.  If *key* is not specified, get first key.
+
+	**bpftool map delete**  *MAP*  **key** *BYTES*
+		  Remove entry from the map.
+
+	**bpftool map pin**     *MAP*  *FILE*
+		  Pin map *MAP* as *FILE*.
+
+		  Note: *FILE* must be located in *bpffs* mount.
+
+	**bpftool map help**
+		  Print short help message.
+
+EXAMPLES
+========
+**# bpftool map show**
+::
+
+  10: hash  name:some_map  flags:0x0
+	key:4B  value:8B  max_entries:2048  memlock:167936B
+
+**# bpftool map update id 10 key 13 00 07 00 value 02 00 00 00 01 02 03 04**
+
+**# bpftool map lookup id 10 key 0 1 2 3**
+
+::
+
+  key: 00 01 02 03 value: 00 01 02 03 04 05 06 07
+
+
+**# bpftool map dump id 10**
+::
+
+  key: 00 01 02 03  value: 00 01 02 03 04 05 06 07
+  key: 0d 00 07 00  value: 02 00 00 00 01 02 03 04
+  Found 2 elements
+
+**# bpftool map getnext id 10 key 0 1 2 3**
+::
+
+  key:
+  00 01 02 03
+  next key:
+  0d 00 07 00
+
+|  
+| **# mount -t bpf none /sys/fs/bpf/**
+| **# bpftool map pin id 10 /sys/fs/bpf/map**
+| **# bpftool map del pinned /sys/fs/bpf/map key 13 00 07 00**
+
+SEE ALSO
+========
+	**bpftool**\ (8), **bpftool-prog**\ (8)
diff --git a/tools/bpf/bpftool/Documentation/bpftool-prog.txt b/tools/bpf/bpftool/Documentation/bpftool-prog.txt
new file mode 100644
index 000000000000..d632de0e0212
--- /dev/null
+++ b/tools/bpf/bpftool/Documentation/bpftool-prog.txt
@@ -0,0 +1,81 @@
+================
+bpftool-prog
+================
+-------------------------------------------------------------------------------
+tool for inspection and simple manipulation of eBPF progs
+-------------------------------------------------------------------------------
+
+:Manual section: 8
+
+SYNOPSIS
+========
+
+|	**bpftool** prog show [*PROG*]
+|	**bpftool** prog dump xlated *PROG*  file *FILE*
+|	**bpftool** prog dump jited  *PROG* [file *FILE*] [opcodes]
+|	**bpftool** prog pin *PROG* *FILE*
+|	**bpftool** prog help
+|
+|	*PROG* := { id *PROG_ID* | pinned *FILE* | tag *PROG_TAG* }
+
+DESCRIPTION
+===========
+	**bpftool prog show** [*PROG*]
+		  Show information about loaded programs.  If *PROG* is
+		  specified show information only about given program, otherwise
+		  list all programs currently loaded on the system.
+
+		  Output will start with program ID followed by program type and
+		  zero or more named attributes (depending on kernel version).
+
+	**bpftool prog dump xlated** *PROG*  **file** *FILE*
+		  Dump eBPF instructions of the program from the kernel to a
+		  file.
+
+	**bpftool prog dump jited**  *PROG* [**file** *FILE*] [**opcodes**]
+		  Dump jited image (host machine code) of the program.
+		  If *FILE* is specified image will be written to a file,
+		  otherwise it will be disassembled and printed to stdout.
+
+		  **opcodes** controls if raw opcodes will be printed.
+
+	**bpftool prog pin** *PROG* *FILE*
+		  Pin program *PROG* as *FILE*.
+
+		  Note: *FILE* must be located in *bpffs* mount.
+
+	**bpftool prog help**
+		  Print short help message.
+
+EXAMPLES
+========
+**# bpftool prog show**
+::
+
+  10: xdp  name:some_prog  tag 00:5a:3d:21:23:62:0c:8b
+	loaded_at:2024.771  uid:0
+	xlated:528B  jited:370B  memlock:4096B  map_ids:10
+
+| 
+| **# bpftool prog dump xlated id 10 file /tmp/t**
+| **# ls -l /tmp/t**
+|   -rw------- 1 root root 560 Jul 22 01:42 /tmp/t
+
+| 
+| **# mount -t bpf none /sys/fs/bpf/**
+| **# bpftool prog pin id 10 /sys/fs/bpf/prog**
+| **# bpftool prog dum jited pinned /sys/fs/bpf/prog**
+
+::
+
+    push   %rbp
+    mov    %rsp,%rbp
+    sub    $0x228,%rsp
+    sub    $0x28,%rbp
+    mov    %rbx,0x0(%rbp)
+
+
+
+SEE ALSO
+========
+	**bpftool**\ (8), **bpftool-map**\ (8)
diff --git a/tools/bpf/bpftool/Documentation/bpftool.txt b/tools/bpf/bpftool/Documentation/bpftool.txt
new file mode 100644
index 000000000000..f1df1893fb54
--- /dev/null
+++ b/tools/bpf/bpftool/Documentation/bpftool.txt
@@ -0,0 +1,34 @@
+================
+BPFTOOL
+================
+-------------------------------------------------------------------------------
+tool for inspection and simple manipulation of eBPF programs and maps
+-------------------------------------------------------------------------------
+
+:Manual section: 8
+
+SYNOPSIS
+========
+
+	**bpftool** *OBJECT* { *COMMAND* | help }
+
+	**bpftool** batch file *FILE*
+
+	*OBJECT* := { **map** | **program** }
+
+	*MAP-COMMANDS* :=
+	{ show | dump | update | lookup | getnext | delete | pin | help }
+
+	*PROG-COMMANDS* := { show | dump jited | dump xlated | pin | help }
+
+DESCRIPTION
+===========
+	*bpftool* allows for inspection and simple modification of BPF objects
+	on the system.
+
+	Note that format of the output of all tools is not guaranteed to be
+	stable and should not be depended upon.
+
+SEE ALSO
+========
+	**bpftool-map**\ (8), **bpftool-prog**\ (8)
diff --git a/tools/bpf/bpftool/Makefile b/tools/bpf/bpftool/Makefile
index a7151f47fb40..8705ee44664d 100644
--- a/tools/bpf/bpftool/Makefile
+++ b/tools/bpf/bpftool/Makefile
@@ -74,6 +74,12 @@ clean: $(LIBBPF)-clean
 install:
 	install $(OUTPUT)bpftool $(prefix)/sbin/bpftool
 
+doc:
+	$(Q)$(MAKE) -C Documentation/
+
+doc-install:
+	$(Q)$(MAKE) -C Documentation/ install
+
 FORCE:
 
 .PHONY: all clean FORCE
-- 
2.14.1

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox