Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [RFC PATCH net-next v5 3/4] virtio_net: Extend virtio to use VF datapath when available
From: Jiri Pirko @ 2018-04-11 14:51 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Samudrala, Sridhar, stephen, davem, netdev, virtualization,
	virtio-dev, jesse.brandeburg, alexander.h.duyck, kubakici,
	jasowang, loseweigh
In-Reply-To: <20180411174157-mutt-send-email-mst@kernel.org>

Wed, Apr 11, 2018 at 04:45:07PM CEST, mst@redhat.com wrote:
>On Wed, Apr 11, 2018 at 10:03:32AM +0200, Jiri Pirko wrote:
>> Wed, Apr 11, 2018 at 08:24:43AM CEST, sridhar.samudrala@intel.com wrote:
>> >On 4/10/2018 11:03 PM, Jiri Pirko wrote:
>> >> Tue, Apr 10, 2018 at 05:59:02PM CEST, sridhar.samudrala@intel.com wrote:
>> >> > On 4/10/2018 8:43 AM, Jiri Pirko wrote:
>> >> > > Tue, Apr 10, 2018 at 05:27:48PM CEST, sridhar.samudrala@intel.com wrote:
>> >> > > > On 4/10/2018 8:22 AM, Jiri Pirko wrote:
>> >> > > > > Tue, Apr 10, 2018 at 05:13:40PM CEST, sridhar.samudrala@intel.com wrote:
>> >> > > > > > On 4/10/2018 3:55 AM, Jiri Pirko wrote:
>> >> > > > > > > Mon, Apr 09, 2018 at 08:47:06PM CEST, sridhar.samudrala@intel.com wrote:
>> >> > > > > > > > On 4/9/2018 1:07 AM, Jiri Pirko wrote:
>> >> > > > > > > > > Sat, Apr 07, 2018 at 12:59:14AM CEST, sridhar.samudrala@intel.com wrote:
>> >> > > > > > > > > > On 4/6/2018 5:48 AM, Jiri Pirko wrote:
>> >> > > > > > > > > > > Thu, Apr 05, 2018 at 11:08:22PM CEST, sridhar.samudrala@intel.com wrote:
>> >> > > > > > > > > [...]
>> >> > > > > > > > > 
>> >> > > > > > > > > > > > +static int virtnet_bypass_join_child(struct net_device *bypass_netdev,
>> >> > > > > > > > > > > > +				     struct net_device *child_netdev)
>> >> > > > > > > > > > > > +{
>> >> > > > > > > > > > > > +	struct virtnet_bypass_info *vbi;
>> >> > > > > > > > > > > > +	bool backup;
>> >> > > > > > > > > > > > +
>> >> > > > > > > > > > > > +	vbi = netdev_priv(bypass_netdev);
>> >> > > > > > > > > > > > +	backup = (child_netdev->dev.parent == bypass_netdev->dev.parent);
>> >> > > > > > > > > > > > +	if (backup ? rtnl_dereference(vbi->backup_netdev) :
>> >> > > > > > > > > > > > +			rtnl_dereference(vbi->active_netdev)) {
>> >> > > > > > > > > > > > +		netdev_info(bypass_netdev,
>> >> > > > > > > > > > > > +			    "%s attempting to join bypass dev when %s already present\n",
>> >> > > > > > > > > > > > +			    child_netdev->name, backup ? "backup" : "active");
>> >> > > > > > > > > > > Bypass module should check if there is already some other netdev
>> >> > > > > > > > > > > enslaved and refuse right there.
>> >> > > > > > > > > > This will work for virtio-net with 3 netdev model, but this check has to be done by netvsc
>> >> > > > > > > > > > as its bypass_netdev is same as the backup_netdev.
>> >> > > > > > > > > > Will add a flag while registering with the bypass module to indicate if the driver is doing
>> >> > > > > > > > > > a 2 netdev or 3 netdev model and based on that flag this check can be done in bypass module
>> >> > > > > > > > > > for 3 netdev scenario.
>> >> > > > > > > > > Just let me undestand it clearly. What I expect the difference would be
>> >> > > > > > > > > between 2netdev and3 netdev model is this:
>> >> > > > > > > > > 2netdev:
>> >> > > > > > > > >        bypass_master
>> >> > > > > > > > >           /
>> >> > > > > > > > >          /
>> >> > > > > > > > > VF_slave
>> >> > > > > > > > > 
>> >> > > > > > > > > 3netdev:
>> >> > > > > > > > >        bypass_master
>> >> > > > > > > > >           /     \
>> >> > > > > > > > >          /       \
>> >> > > > > > > > > VF_slave   backup_slave
>> >> > > > > > > > > 
>> >> > > > > > > > > Is that correct? If not, how does it look like?
>> >> > > > > > > > > 
>> >> > > > > > > > > 
>> >> > > > > > > > Looks correct.
>> >> > > > > > > > VF_slave and backup_slave are the original netdevs and are present in both the models.
>> >> > > > > > > > In the 3 netdev model, bypass_master netdev is created and VF_slave and backup_slave are
>> >> > > > > > > > marked as the 2 slaves of this new netdev.
>> >> > > > > > > You say it looks correct and in another sentence you provide completely
>> >> > > > > > > different description. Could you please look again?
>> >> > > > > > > 
>> >> > > > > > To be exact, 2 netdev model with netvsc looks like this.
>> >> > > > > > 
>> >> > > > > >       netvsc_netdev
>> >> > > > > >         /
>> >> > > > > >        /
>> >> > > > > > VF_slave
>> >> > > > > > 
>> >> > > > > > With virtio_net, 3 netdev model
>> >> > > > > > 
>> >> > > > > >     bypass_netdev
>> >> > > > > >         /     \
>> >> > > > > >        /       \
>> >> > > > > > VF_slave   virtio_net netdev
>> >> > > > > Could you also mark the original netdev which is there now? is it
>> >> > > > > bypass_netdev or virtio_net_netdev ?
>> >> > > > bypass_netdev
>> >> > > >       /     \
>> >> > > >      /       \
>> >> > > > VF_slave   virtio_net netdev (original)
>> >> > > That does not make sense.
>> >> > > 1) You diverge from the behaviour of the netvsc, where the original
>> >> > >      netdev is a master of the VF
>> >> > > 2) If the original netdev is a slave, you cannot have any IP address
>> >> > >      configured on it (well you could, but the rx_handler would eat every
>> >> > >      incoming packet). So you will break the user bacause he would have to
>> >> > >      move the configuration to the new master device.
>> >> > > This only makes sense that the original netdev becomes the master for both
>> >> > > netvsc and virtio_net.
>> >> > Forgot to mention that bypass_netdev takes over the name of the original
>> >> > netdev and
>> >> > virtio_net netdev will get the backup name.
>> >> What do you mean by "name"?
>> >
>> >bypass_netdev also is associated with the same pci device as the original virtio_net
>> >netdev via SET_NETDEV_DEV().  Also, we added ndo_get_phys_port_name() to virtio_net
>> >that will return _bkup when BACKUP feature is enabled.
>> 
>> Okay.
>> 
>> >
>> >So for ex: if virtio_net inteface was getting 'ens12' as the name assigned by udev
>> >without BACKUP feature,  when BACKUP feature is enabled,  the  bypass_netdev will be
>> >named 'ens12' and the original virtio_net will get named as ens12n_bkup.
>> 
>> Got it.
>> 
>> I don't like the bypass_master to look differently in netvsc and
>> virtio_net :/ The best would be to convert netvsc to 3 netdev model and
>> treat them the same. The more I think about it, the more the 2 netdev
>> model feels wrong.
>
>If you believe that, then this patchset is a step in the right
>direction.
>
>With something like this patchset applied, converting netvsc to a 3
>device model will presumably be just a flag flip away. Afterwards

If done properly. Yes.

>we'll be able to drop dead code handling the bypass_master flag.
>
>> 
>> >
>> >
>> >> 
>> >> > So the userspace network configuration doesn't need to change.
>> >> > 
>> >> > 
>> >

^ permalink raw reply

* RE: [PATCH v1 net 0/3] lan78xx: Fixes and enhancements
From: RaghuramChary.Jallipalli @ 2018-04-11 14:54 UTC (permalink / raw)
  To: davem; +Cc: netdev, UNGLinuxDriver, Woojung.Huh
In-Reply-To: <20180411.102009.742541645869806513.davem@davemloft.net>

> 
> > These series of patches have fix and enhancements for lan78xx driver.
> 
> Two problems with this series:
> 
> 1) Only bug fixes are appropriate at this time.  Features and "enhancements"
>    belong in net-next which is not open right now.
> 
> 2) Patch #3 doesn't even apply cleanly.  Always base your patches on the
>    current tree.
> 
Sure, will send  #2,#3 patches in net-next when it opens.
Also will make sure they will be based on current tree.

Thanks,
Raghu

^ permalink raw reply

* [PATCH net 0/2] ibmvnic: Fix parameter change request handling
From: Nathan Fontenot @ 2018-04-11 14:37 UTC (permalink / raw)
  To: netdev; +Cc: jallen, tlfalcon

When updating parameters for the ibmvnic driver there is a possibility
of entering an infinite loop if a return value other that a partial
success is received from sending the login CRQ.

Also, a deadlock can occur on the rtnl lock if netdev_notify_peers()
is called during driver reset for a parameter change reset.

This patch set corrects both of these issues by updating the return
code handling in ibmvnic_login() nand gaurding against calling
netdev_notify_peers() for parameter change requests.

-Nathan
---

Nathan Fontenot (2):
      ibmvnic: Handle all login error conditions
      ibmvnic: Do not notify peers on parameter change resets

 drivers/net/ethernet/ibm/ibmvnic.c |   58 +++++++++++++++++++++++-------------
 drivers/net/ethernet/ibm/ibmvnic.h |    1 -
 2 files changed, 37 insertions(+), 22 deletions(-)

^ permalink raw reply

* [PATCH net 1/2] ibmvnic: Handle all login error conditions
From: Nathan Fontenot @ 2018-04-11 14:37 UTC (permalink / raw)
  To: netdev; +Cc: jallen, tlfalcon
In-Reply-To: <152345733402.41721.15717591824692008270.stgit@ltcalpine2-lp14.aus.stglabs.ibm.com>

There is a bug in handling the possible return codes from sending the
login CRQ. The current code treats any non-success return value,
minus failure to send the crq and a timeout waiting for a login response,
as a need to re-send the login CRQ. This can put the drive in an
inifinite loop of trying to login when getting return values other
that a partial success such as a return code of aborted. For these
scenarios the login will not ever succeed at this point and the
driver would need to be reset again.

To resolve this loop trying to login is updated to only retry the
login if the driver gets a return code of a partial success. Other
return coes are treated as an error and the driver returns an error
from ibmvnic_login().

To avoid infinite looping in the partial success return cases, the
number of retries is capped at the maximum number of supported
queues. This value was chosen because the driver does a renegatiation
of capabilities which sets the number of queus possible and allows
the driver to attempt a login for possible value for the number
of queues supported.

Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
---
 drivers/net/ethernet/ibm/ibmvnic.c |   55 +++++++++++++++++++++++-------------
 drivers/net/ethernet/ibm/ibmvnic.h |    1 -
 2 files changed, 35 insertions(+), 21 deletions(-)

diff --git a/drivers/net/ethernet/ibm/ibmvnic.c b/drivers/net/ethernet/ibm/ibmvnic.c
index aad5658d79d5..c6d5e9448eef 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@ -794,46 +794,61 @@ static int ibmvnic_login(struct net_device *netdev)
 {
 	struct ibmvnic_adapter *adapter = netdev_priv(netdev);
 	unsigned long timeout = msecs_to_jiffies(30000);
-	struct device *dev = &adapter->vdev->dev;
+	int retry_count = 0;
 	int rc;
 
 	do {
-		if (adapter->renegotiate) {
-			adapter->renegotiate = false;
+		if (retry_count > IBMVNIC_MAX_QUEUES) {
+			netdev_warn(netdev, "Login attempts exceeded\n");
+			return -1;
+		}
+
+		adapter->init_done_rc = 0;
+		reinit_completion(&adapter->init_done);
+		rc = send_login(adapter);
+		if (rc) {
+			netdev_warn(netdev, "Unable to login\n");
+			return rc;
+		}
+
+		if (!wait_for_completion_timeout(&adapter->init_done,
+						 timeout)) {
+			netdev_warn(netdev, "Login timed out\n");
+			return -1;
+		}
+
+		if (adapter->init_done_rc == PARTIALSUCCESS) {
+			retry_count++;
 			release_sub_crqs(adapter, 1);
 
+			adapter->init_done_rc = 0;
 			reinit_completion(&adapter->init_done);
 			send_cap_queries(adapter);
 			if (!wait_for_completion_timeout(&adapter->init_done,
 							 timeout)) {
-				dev_err(dev, "Capabilities query timeout\n");
+				netdev_warn(netdev,
+					    "Capabilities query timed out\n");
 				return -1;
 			}
+
 			rc = init_sub_crqs(adapter);
 			if (rc) {
-				dev_err(dev,
-					"Initialization of SCRQ's failed\n");
+				netdev_warn(netdev,
+					    "SCRQ initialization failed\n");
 				return -1;
 			}
+
 			rc = init_sub_crq_irqs(adapter);
 			if (rc) {
-				dev_err(dev,
-					"Initialization of SCRQ's irqs failed\n");
+				netdev_warn(netdev,
+					    "SCRQ irq initialization failed\n");
 				return -1;
 			}
-		}
-
-		reinit_completion(&adapter->init_done);
-		rc = send_login(adapter);
-		if (rc) {
-			dev_err(dev, "Unable to attempt device login\n");
-			return rc;
-		} else if (!wait_for_completion_timeout(&adapter->init_done,
-						 timeout)) {
-			dev_err(dev, "Login timeout\n");
+		} else if (adapter->init_done_rc) {
+			netdev_warn(netdev, "Adapter login failed\n");
 			return -1;
 		}
-	} while (adapter->renegotiate);
+	} while (adapter->init_done_rc == PARTIALSUCCESS);
 
 	/* handle pending MAC address changes after successful login */
 	if (adapter->mac_change_pending) {
@@ -3942,7 +3957,7 @@ static int handle_login_rsp(union ibmvnic_crq *login_rsp_crq,
 	 * to resend the login buffer with fewer queues requested.
 	 */
 	if (login_rsp_crq->generic.rc.code) {
-		adapter->renegotiate = true;
+		adapter->init_done_rc = login_rsp_crq->generic.rc.code;
 		complete(&adapter->init_done);
 		return 0;
 	}
diff --git a/drivers/net/ethernet/ibm/ibmvnic.h b/drivers/net/ethernet/ibm/ibmvnic.h
index 99c0b58c2c39..22391e8805f6 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.h
+++ b/drivers/net/ethernet/ibm/ibmvnic.h
@@ -1035,7 +1035,6 @@ struct ibmvnic_adapter {
 
 	struct ibmvnic_sub_crq_queue **tx_scrq;
 	struct ibmvnic_sub_crq_queue **rx_scrq;
-	bool renegotiate;
 
 	/* rx structs */
 	struct napi_struct *napi;

^ permalink raw reply related

* Re: [PATCH v3 0/2] vhost: fix vhost_vq_access_ok() log check
From: David Miller @ 2018-04-11 14:55 UTC (permalink / raw)
  To: mst
  Cc: stefanha, virtualization, torvalds, jasowang, netdev,
	syzkaller-bugs, kvm, linux-kernel
In-Reply-To: <20180411162345-mutt-send-email-mst@kernel.org>

From: "Michael S. Tsirkin" <mst@redhat.com>
Date: Wed, 11 Apr 2018 16:24:02 +0300

> On Wed, Apr 11, 2018 at 10:35:39AM +0800, Stefan Hajnoczi wrote:
>> v3:
>>  * Rebased onto net/master and resolved conflict [DaveM]
>> 
>> v2:
>>  * Rewrote the conditional to make the vq access check clearer [Linus]
>>  * Added Patch 2 to make the return type consistent and harder to misuse [Linus]
>> 
>> The first patch fixes the vhost virtqueue access check which was recently
>> broken.  The second patch replaces the int return type with bool to prevent
>> future bugs.
> 
> Acked-by: Michael S. Tsirkin <mst@redhat.com>
> 
> We need the 1st one on stable I think.

Series applied and patch #1 queued up for -stable.

^ permalink raw reply

* [PATCH net 2/2] ibmvnic: Do not notify peers on parameter change resets
From: Nathan Fontenot @ 2018-04-11 14:37 UTC (permalink / raw)
  To: netdev; +Cc: jallen, tlfalcon
In-Reply-To: <152345733402.41721.15717591824692008270.stgit@ltcalpine2-lp14.aus.stglabs.ibm.com>

When attempting to change the driver parameters, such as the MTU
value or number of queues, do not call netdev_notify_peers().
Doing so will deadlock on the rtnl_lock.

Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
---
 drivers/net/ethernet/ibm/ibmvnic.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/ibm/ibmvnic.c b/drivers/net/ethernet/ibm/ibmvnic.c
index c6d5e9448eef..fdca1ea5bbf9 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@ -1843,7 +1843,8 @@ static int do_reset(struct ibmvnic_adapter *adapter,
 	for (i = 0; i < adapter->req_rx_queues; i++)
 		napi_schedule(&adapter->napi[i]);
 
-	if (adapter->reset_reason != VNIC_RESET_FAILOVER)
+	if (adapter->reset_reason != VNIC_RESET_FAILOVER &&
+	    adapter->reset_reason != VNIC_RESET_CHANGE_PARAM)
 		netdev_notify_peers(netdev);
 
 	netif_carrier_on(netdev);

^ permalink raw reply related

* Re: WARNING in kobject_add_internal
From: Dmitry Vyukov @ 2018-04-11 14:57 UTC (permalink / raw)
  To: syzbot
  Cc: bridge, David Miller, Greg Kroah-Hartman, LKML, netdev,
	stephen hemminger, syzkaller-bugs
In-Reply-To: <001a113f8036fab00405620e4d72@google.com>

On Fri, Jan 5, 2018 at 10:41 PM, syzbot
<syzbot+e204ced820ef739d71ef5438f5e1976a874abc8d@syzkaller.appspotmail.com>
wrote:
> syzkaller has found reproducer for the following crash on
> 89876f275e8d562912d9c238cd888b52065cf25c
> git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/master
> compiler: gcc (GCC) 7.1.1 20170620
> .config is attached
> Raw console output is attached.
> C reproducer is attached
> syzkaller reproducer is attached. See https://goo.gl/kgGztJ
> for information about syzkaller reproducers
>
>
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by:
> syzbot+e204ced820ef739d71ef5438f5e1976a874abc8d@syzkaller.appspotmail.com
> It will help syzbot understand when the bug is fixed.

#syz dup: WARNING: kobject bug in device_add

> ------------[ cut here ]------------
> kobject_add_internal failed for   (error: -12 parent: net)
> WARNING: CPU: 1 PID: 3494 at lib/kobject.c:244
> kobject_add_internal+0x3f6/0xbc0 lib/kobject.c:242
> Kernel panic - not syncing: panic_on_warn set ...
>
> CPU: 1 PID: 3494 Comm: syzkaller425998 Not tainted 4.15.0-rc6+ #249
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> Google 01/01/2011
> Call Trace:
>  __dump_stack lib/dump_stack.c:17 [inline]
>  dump_stack+0x194/0x257 lib/dump_stack.c:53
>  panic+0x1e4/0x41c kernel/panic.c:183
>  __warn+0x1dc/0x200 kernel/panic.c:547
>  report_bug+0x211/0x2d0 lib/bug.c:184
>  fixup_bug.part.11+0x37/0x80 arch/x86/kernel/traps.c:178
>  fixup_bug arch/x86/kernel/traps.c:247 [inline]
>  do_error_trap+0x2d7/0x3e0 arch/x86/kernel/traps.c:296
>  do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:315
>  invalid_op+0x22/0x40 arch/x86/entry/entry_64.S:1079
> RIP: 0010:kobject_add_internal+0x3f6/0xbc0 lib/kobject.c:242
> RSP: 0018:ffff8801c53c76f0 EFLAGS: 00010286
> RAX: dffffc0000000008 RBX: ffff8801bf5a88d8 RCX: ffffffff8159da9e
> RDX: 0000000000000000 RSI: 1ffff10038a78e99 RDI: ffff8801c53c73f8
> RBP: ffff8801c53c77e8 R08: 1ffff10038a78e5b R09: 0000000000000000
> R10: ffff8801c53c74b0 R11: 0000000000000000 R12: 1ffff10038a78ee4
> R13: 00000000fffffff4 R14: ffff8801d8359a80 R15: ffffffff86201980
>  kobject_add_varg lib/kobject.c:366 [inline]
>  kobject_add+0x132/0x1f0 lib/kobject.c:411
>  device_add+0x35d/0x1650 drivers/base/core.c:1787
>  netdev_register_kobject+0x183/0x360 net/core/net-sysfs.c:1604
>  register_netdevice+0xb2b/0x1010 net/core/dev.c:7698
>  tun_set_iff drivers/net/tun.c:2319 [inline]
>  __tun_chr_ioctl+0x1d89/0x3dd0 drivers/net/tun.c:2524
>  tun_chr_ioctl+0x2a/0x40 drivers/net/tun.c:2773
>  vfs_ioctl fs/ioctl.c:46 [inline]
>  do_vfs_ioctl+0x1b1/0x1520 fs/ioctl.c:686
>  SYSC_ioctl fs/ioctl.c:701 [inline]
>  SyS_ioctl+0x8f/0xc0 fs/ioctl.c:692
>  entry_SYSCALL_64_fastpath+0x23/0x9a
> RIP: 0033:0x444fc9
> RSP: 002b:00007fff53389dc8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
> RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 0000000000444fc9
> RDX: 0000000020533000 RSI: 00000000400454ca RDI: 0000000000000004
> RBP: 0000000000000005 R08: 0000000000000002 R09: 0000006f00003131
> R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000402500
> R13: 0000000000402590 R14: 0000000000000000 R15: 0000000000000000
>
> Dumping ftrace buffer:
>    (ftrace buffer empty)
> Kernel Offset: disabled
> Rebooting in 86400 seconds..
>

^ permalink raw reply

* Re: [PATCH] dec: tulip: de4x5: Replace mdelay with usleep_range in de4x5_hw_init
From: kbuild test robot @ 2018-04-11 14:58 UTC (permalink / raw)
  To: Jia-Ju Bai
  Cc: kbuild-all, davem, dhowells, stephen, johannes.berg,
	arvind.yadav.cs, netdev, linux-parisc, linux-kernel, Jia-Ju Bai
In-Reply-To: <1523412054-2108-1-git-send-email-baijiaju1990@gmail.com>

[-- Attachment #1: Type: text/plain, Size: 4811 bytes --]

Hi Jia-Ju,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on net-next/master]
[also build test ERROR on v4.16 next-20180411]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Jia-Ju-Bai/dec-tulip-de4x5-Replace-mdelay-with-usleep_range-in-de4x5_hw_init/20180411-222527
config: i386-randconfig-x015-201814 (attached as .config)
compiler: gcc-7 (Debian 7.3.0-1) 7.3.0
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All errors (new ones prefixed by >>):

   drivers/net//ethernet/dec/tulip/de4x5.c: In function 'de4x5_hw_init':
>> drivers/net//ethernet/dec/tulip/de4x5.c:1110:5: error: implicit declaration of function 'usleep'; did you mean 'ssleep'? [-Werror=implicit-function-declaration]
        usleep(10000, 11000);
        ^~~~~~
        ssleep
   cc1: some warnings being treated as errors

vim +1110 drivers/net//ethernet/dec/tulip/de4x5.c

  1091	
  1092	
  1093	static int
  1094	de4x5_hw_init(struct net_device *dev, u_long iobase, struct device *gendev)
  1095	{
  1096	    char name[DE4X5_NAME_LENGTH + 1];
  1097	    struct de4x5_private *lp = netdev_priv(dev);
  1098	    struct pci_dev *pdev = NULL;
  1099	    int i, status=0;
  1100	
  1101	    dev_set_drvdata(gendev, dev);
  1102	
  1103	    /* Ensure we're not sleeping */
  1104	    if (lp->bus == EISA) {
  1105		outb(WAKEUP, PCI_CFPM);
  1106	    } else {
  1107		pdev = to_pci_dev (gendev);
  1108		pci_write_config_byte(pdev, PCI_CFDA_PSM, WAKEUP);
  1109	    }
> 1110	    usleep(10000, 11000);
  1111	
  1112	    RESET_DE4X5;
  1113	
  1114	    if ((inl(DE4X5_STS) & (STS_TS | STS_RS)) != 0) {
  1115		return -ENXIO;                       /* Hardware could not reset */
  1116	    }
  1117	
  1118	    /*
  1119	    ** Now find out what kind of DC21040/DC21041/DC21140 board we have.
  1120	    */
  1121	    lp->useSROM = false;
  1122	    if (lp->bus == PCI) {
  1123		PCI_signature(name, lp);
  1124	    } else {
  1125		EISA_signature(name, gendev);
  1126	    }
  1127	
  1128	    if (*name == '\0') {                     /* Not found a board signature */
  1129		return -ENXIO;
  1130	    }
  1131	
  1132	    dev->base_addr = iobase;
  1133	    printk ("%s: %s at 0x%04lx", dev_name(gendev), name, iobase);
  1134	
  1135	    status = get_hw_addr(dev);
  1136	    printk(", h/w address %pM\n", dev->dev_addr);
  1137	
  1138	    if (status != 0) {
  1139		printk("      which has an Ethernet PROM CRC error.\n");
  1140		return -ENXIO;
  1141	    } else {
  1142		skb_queue_head_init(&lp->cache.queue);
  1143		lp->cache.gepc = GEP_INIT;
  1144		lp->asBit = GEP_SLNK;
  1145		lp->asPolarity = GEP_SLNK;
  1146		lp->asBitValid = ~0;
  1147		lp->timeout = -1;
  1148		lp->gendev = gendev;
  1149		spin_lock_init(&lp->lock);
  1150		timer_setup(&lp->timer, de4x5_ast, 0);
  1151		de4x5_parse_params(dev);
  1152	
  1153		/*
  1154		** Choose correct autosensing in case someone messed up
  1155		*/
  1156	        lp->autosense = lp->params.autosense;
  1157	        if (lp->chipset != DC21140) {
  1158	            if ((lp->chipset==DC21040) && (lp->params.autosense&TP_NW)) {
  1159	                lp->params.autosense = TP;
  1160	            }
  1161	            if ((lp->chipset==DC21041) && (lp->params.autosense&BNC_AUI)) {
  1162	                lp->params.autosense = BNC;
  1163	            }
  1164	        }
  1165		lp->fdx = lp->params.fdx;
  1166		sprintf(lp->adapter_name,"%s (%s)", name, dev_name(gendev));
  1167	
  1168		lp->dma_size = (NUM_RX_DESC + NUM_TX_DESC) * sizeof(struct de4x5_desc);
  1169	#if defined(__alpha__) || defined(__powerpc__) || defined(CONFIG_SPARC) || defined(DE4X5_DO_MEMCPY)
  1170		lp->dma_size += RX_BUFF_SZ * NUM_RX_DESC + DE4X5_ALIGN;
  1171	#endif
  1172		lp->rx_ring = dma_alloc_coherent(gendev, lp->dma_size,
  1173						 &lp->dma_rings, GFP_ATOMIC);
  1174		if (lp->rx_ring == NULL) {
  1175		    return -ENOMEM;
  1176		}
  1177	
  1178		lp->tx_ring = lp->rx_ring + NUM_RX_DESC;
  1179	
  1180		/*
  1181		** Set up the RX descriptor ring (Intels)
  1182		** Allocate contiguous receive buffers, long word aligned (Alphas)
  1183		*/
  1184	#if !defined(__alpha__) && !defined(__powerpc__) && !defined(CONFIG_SPARC) && !defined(DE4X5_DO_MEMCPY)
  1185		for (i=0; i<NUM_RX_DESC; i++) {
  1186		    lp->rx_ring[i].status = 0;
  1187		    lp->rx_ring[i].des1 = cpu_to_le32(RX_BUFF_SZ);
  1188		    lp->rx_ring[i].buf = 0;
  1189		    lp->rx_ring[i].next = 0;
  1190		    lp->rx_skb[i] = (struct sk_buff *) 1;     /* Dummy entry */
  1191		}
  1192	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 31227 bytes --]

^ permalink raw reply

* Re: [PATCH net] sctp: do not check port in sctp_inet6_cmp_addr
From: Marcelo Ricardo Leitner @ 2018-04-11 14:59 UTC (permalink / raw)
  To: Neil Horman; +Cc: Xin Long, network dev, linux-sctp, davem
In-Reply-To: <20180411143607.GA4141@hmswarspite.think-freely.org>

On Wed, Apr 11, 2018 at 10:36:07AM -0400, Neil Horman wrote:
> On Wed, Apr 11, 2018 at 08:58:05PM +0800, Xin Long wrote:
> > pf->cmp_addr() is called before binding a v6 address to the sock. It
> > should not check ports, like in sctp_inet_cmp_addr.
> >
> > But sctp_inet6_cmp_addr checks the addr by invoking af(6)->cmp_addr,
> > sctp_v6_cmp_addr where it also compares the ports.
> >
> > This would cause that setsockopt(SCTP_SOCKOPT_BINDX_ADD) could bind
> > multiple duplicated IPv6 addresses after Commit 40b4f0fd74e4 ("sctp:
> > lack the check for ports in sctp_v6_cmp_addr").
> >
> > This patch is to remove af->cmp_addr called in sctp_inet6_cmp_addr,
> > but do the proper check for both v6 addrs and v4mapped addrs.
> >
> > Fixes: 40b4f0fd74e4 ("sctp: lack the check for ports in sctp_v6_cmp_addr")
> > Reported-by: Jianwen Ji <jiji@redhat.com>
> > Signed-off-by: Xin Long <lucien.xin@gmail.com>
> > ---
> >  net/sctp/ipv6.c | 27 ++++++++++++++++++++++++---
> >  1 file changed, 24 insertions(+), 3 deletions(-)
> >
> > diff --git a/net/sctp/ipv6.c b/net/sctp/ipv6.c
> > index f1fc48e..be4b72c 100644
> > --- a/net/sctp/ipv6.c
> > +++ b/net/sctp/ipv6.c
> > @@ -846,8 +846,8 @@ static int sctp_inet6_cmp_addr(const union sctp_addr *addr1,
> >  			       const union sctp_addr *addr2,
> >  			       struct sctp_sock *opt)
> >  {
> > -	struct sctp_af *af1, *af2;
> >  	struct sock *sk = sctp_opt2sk(opt);
> > +	struct sctp_af *af1, *af2;
> >
> >  	af1 = sctp_get_af_specific(addr1->sa.sa_family);
> >  	af2 = sctp_get_af_specific(addr2->sa.sa_family);
> > @@ -863,10 +863,31 @@ static int sctp_inet6_cmp_addr(const union sctp_addr *addr1,
> >  	if (sctp_is_any(sk, addr1) || sctp_is_any(sk, addr2))
> >  		return 1;
> >
> > -	if (addr1->sa.sa_family != addr2->sa.sa_family)
> > +	if (addr1->sa.sa_family != addr2->sa.sa_family) {
> > +		if (addr1->sa.sa_family == AF_INET &&
> > +		    addr2->sa.sa_family == AF_INET6 &&
> > +		    ipv6_addr_v4mapped(&addr2->v6.sin6_addr))
> > +			if (addr2->v6.sin6_addr.s6_addr32[3] ==
> > +			    addr1->v4.sin_addr.s_addr)
> > +				return 1;
> > +		if (addr2->sa.sa_family == AF_INET &&
> > +		    addr1->sa.sa_family == AF_INET6 &&
> > +		    ipv6_addr_v4mapped(&addr1->v6.sin6_addr))
> > +			if (addr1->v6.sin6_addr.s6_addr32[3] ==
> > +			    addr2->v4.sin_addr.s_addr)
> > +				return 1;
> > +		return 0;
> > +	}
> > +
> > +	if (!ipv6_addr_equal(&addr1->v6.sin6_addr, &addr2->v6.sin6_addr))
> > +		return 0;
> > +
> > +	if ((ipv6_addr_type(&addr1->v6.sin6_addr) & IPV6_ADDR_LINKLOCAL) &&
> > +	    addr1->v6.sin6_scope_id && addr2->v6.sin6_scope_id &&
> > +	    addr1->v6.sin6_scope_id != addr2->v6.sin6_scope_id)
> >  		return 0;
> >
> > -	return af1->cmp_addr(addr1, addr2);
> > +	return 1;
> >  }
> >
> >  /* Verify that the provided sockaddr looks bindable.   Common verification,
> > --
> > 2.1.0
> >
> This looks correct to me, but is it worth duplicating the comparison code like
> this from the cmp_addr function?  It might be more worthwhile to add a flag to
> the cmp_addr method to direct weather it needs to check port values or not.
> That way you could continue to use the cmp_addr function here.

Adding a flag into sctp_v6_cmp_addr will get us a terrible code to
read. It's already not one of the best looking part of it. Maybe
still duplicate part of it it, but at 'af' level? As in:
- af->cmp_addr
- af->cmp_addr_port

  Marcelo

^ permalink raw reply

* KASAN: use-after-free Read in tipc_sub_unsubscribe (2)
From: syzbot @ 2018-04-11 15:02 UTC (permalink / raw)
  To: davem, jon.maloy, linux-kernel, netdev, syzkaller-bugs,
	tipc-discussion, ying.xue

Hello,

syzbot hit the following crash on upstream commit
b284d4d5a6785f8cd07eda2646a95782373cd01e (Tue Apr 10 19:25:30 2018 +0000)
Merge tag 'ceph-for-4.17-rc1' of git://github.com/ceph/ceph-client
syzbot dashboard link:  
https://syzkaller.appspot.com/bug?extid=aa245f26d42b8305d157

So far this crash happened 2 times on upstream.
C reproducer: https://syzkaller.appspot.com/x/repro.c?id=5881855630901248
syzkaller reproducer:  
https://syzkaller.appspot.com/x/repro.syz?id=5979790213382144
Raw console output:  
https://syzkaller.appspot.com/x/log.txt?id=5808961445953536
Kernel config:  
https://syzkaller.appspot.com/x/.config?id=-1223000601505858474
compiler: gcc (GCC) 8.0.1 20180301 (experimental)

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+aa245f26d42b8305d157@syzkaller.appspotmail.com
It will help syzbot understand when the bug is fixed. See footer for  
details.
If you forward the report, please keep this part and the footer.

R10: 0000000020b89fe4 R11: 0000000000000246 R12: 0000000000000005
R13: ffffffffffffffff R14: 0000000000000000 R15: 0000000000000000
Service creation failed, no memory
Failed to subscribe for {1906,0,4294967295}
==================================================================
BUG: KASAN: use-after-free in tipc_sub_unsubscribe+0x22d/0x305  
net/tipc/subscr.c:167
Read of size 4 at addr ffff8801b78718d8 by task syzkaller446011/4466

CPU: 1 PID: 4466 Comm: syzkaller446011 Not tainted 4.16.0+ #19
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011
Call Trace:
  __dump_stack lib/dump_stack.c:77 [inline]
  dump_stack+0x1b9/0x294 lib/dump_stack.c:113
  print_address_description+0x6c/0x20b mm/kasan/report.c:256
  kasan_report_error mm/kasan/report.c:354 [inline]
  kasan_report.cold.7+0xac/0x2f5 mm/kasan/report.c:412
  __asan_report_load4_noabort+0x14/0x20 mm/kasan/report.c:432
  tipc_sub_unsubscribe+0x22d/0x305 net/tipc/subscr.c:167
  tipc_conn_delete_sub+0x32d/0x530 net/tipc/topsrv.c:245
  tipc_topsrv_kern_unsubscr+0x280/0x3f0 net/tipc/topsrv.c:598
  tipc_group_delete+0x2dd/0x3f0 net/tipc/group.c:231
  tipc_sk_leave+0x10e/0x210 net/tipc/socket.c:2800
  tipc_release+0x146/0x1290 net/tipc/socket.c:576
  sock_release+0x96/0x1b0 net/socket.c:594
  sock_close+0x16/0x20 net/socket.c:1149
  __fput+0x34d/0x890 fs/file_table.c:209
  ____fput+0x15/0x20 fs/file_table.c:243
  task_work_run+0x1e4/0x290 kernel/task_work.c:113
  exit_task_work include/linux/task_work.h:22 [inline]
  do_exit+0x1aee/0x2730 kernel/exit.c:865
  do_group_exit+0x16f/0x430 kernel/exit.c:968
  SYSC_exit_group kernel/exit.c:979 [inline]
  SyS_exit_group+0x1d/0x20 kernel/exit.c:977
  do_syscall_64+0x29e/0x9d0 arch/x86/entry/common.c:287
  entry_SYSCALL_64_after_hwframe+0x42/0xb7
RIP: 0033:0x43f1f8
RSP: 002b:00007fff7867bac8 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 000000000043f1f8
RDX: 0000000000000000 RSI: 000000000000003c RDI: 0000000000000000
RBP: 00000000004bf2e8 R08: 00000000000000e7 R09: ffffffffffffffd0
R10: 0000000020b89fe4 R11: 0000000000000246 R12: 0000000000000001
R13: 00000000006d1180 R14: 0000000000000000 R15: 0000000000000000

Allocated by task 4466:
  save_stack+0x43/0xd0 mm/kasan/kasan.c:448
  set_track mm/kasan/kasan.c:460 [inline]
  kasan_kmalloc+0xc4/0xe0 mm/kasan/kasan.c:553
  kmem_cache_alloc_trace+0x152/0x780 mm/slab.c:3620
  kmalloc include/linux/slab.h:512 [inline]
  tipc_sub_subscribe+0x25a/0x6b0 net/tipc/subscr.c:143
  tipc_conn_rcv_sub.isra.5+0x42c/0x7e0 net/tipc/topsrv.c:381
  tipc_topsrv_kern_subscr+0x72b/0xad0 net/tipc/topsrv.c:582
  tipc_group_create+0x72e/0xa50 net/tipc/group.c:194
  tipc_sk_join net/tipc/socket.c:2766 [inline]
  tipc_setsockopt+0x2c9/0xd70 net/tipc/socket.c:2881
  __sys_setsockopt+0x1bd/0x390 net/socket.c:1903
  SYSC_setsockopt net/socket.c:1914 [inline]
  SyS_setsockopt+0x34/0x50 net/socket.c:1911
  do_syscall_64+0x29e/0x9d0 arch/x86/entry/common.c:287
  entry_SYSCALL_64_after_hwframe+0x42/0xb7

Freed by task 4466:
  save_stack+0x43/0xd0 mm/kasan/kasan.c:448
  set_track mm/kasan/kasan.c:460 [inline]
  __kasan_slab_free+0x11a/0x170 mm/kasan/kasan.c:521
  kasan_slab_free+0xe/0x10 mm/kasan/kasan.c:528
  __cache_free mm/slab.c:3498 [inline]
  kfree+0xd9/0x260 mm/slab.c:3813
  tipc_sub_kref_release net/tipc/subscr.c:117 [inline]
  kref_put include/linux/kref.h:70 [inline]
  tipc_sub_put+0x33/0x40 net/tipc/subscr.c:122
  tipc_nametbl_unsubscribe+0x52c/0xaf0 net/tipc/name_table.c:709
  tipc_sub_unsubscribe+0x6d/0x305 net/tipc/subscr.c:166
  tipc_conn_delete_sub+0x32d/0x530 net/tipc/topsrv.c:245
  tipc_topsrv_kern_unsubscr+0x280/0x3f0 net/tipc/topsrv.c:598
  tipc_group_delete+0x2dd/0x3f0 net/tipc/group.c:231
  tipc_sk_leave+0x10e/0x210 net/tipc/socket.c:2800
  tipc_release+0x146/0x1290 net/tipc/socket.c:576
  sock_release+0x96/0x1b0 net/socket.c:594
  sock_close+0x16/0x20 net/socket.c:1149
  __fput+0x34d/0x890 fs/file_table.c:209
  ____fput+0x15/0x20 fs/file_table.c:243
  task_work_run+0x1e4/0x290 kernel/task_work.c:113
  exit_task_work include/linux/task_work.h:22 [inline]
  do_exit+0x1aee/0x2730 kernel/exit.c:865
  do_group_exit+0x16f/0x430 kernel/exit.c:968
  SYSC_exit_group kernel/exit.c:979 [inline]
  SyS_exit_group+0x1d/0x20 kernel/exit.c:977
  do_syscall_64+0x29e/0x9d0 arch/x86/entry/common.c:287
  entry_SYSCALL_64_after_hwframe+0x42/0xb7

The buggy address belongs to the object at ffff8801b7871840
  which belongs to the cache kmalloc-256 of size 256
The buggy address is located 152 bytes inside of
  256-byte region [ffff8801b7871840, ffff8801b7871940)
The buggy address belongs to the page:
page:ffffea0006de1c40 count:1 mapcount:0 mapping:ffff8801b78710c0 index:0x0
flags: 0x2fffc0000000100(slab)
raw: 02fffc0000000100 ffff8801b78710c0 0000000000000000 000000010000000c
raw: ffffea00071ce1a0 ffffea0006db8fa0 ffff8801dac007c0 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
  ffff8801b7871780: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
  ffff8801b7871800: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
> ffff8801b7871880: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                                                     ^
  ffff8801b7871900: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
  ffff8801b7871980: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
==================================================================


---
This bug is generated by a dumb bot. It may contain errors.
See https://goo.gl/tpsmEJ for details.
Direct all questions to syzkaller@googlegroups.com.

syzbot will keep track of this bug report.
If you forgot to add the Reported-by tag, once the fix for this bug is  
merged
into any tree, please reply to this email with:
#syz fix: exact-commit-title
If you want to test a patch for this bug, please reply with:
#syz test: git://repo/address.git branch
and provide the patch inline or as an attachment.
To mark this as a duplicate of another syzbot report, please reply with:
#syz dup: exact-subject-of-another-report
If it's a one-off invalid bug report, please reply with:
#syz invalid
Note: if the crash happens again, it will cause creation of a new bug  
report.
Note: all commands must start from beginning of the line in the email body.

^ permalink raw reply

* [PATCH v2 net] lan78xx: PHY DSP registers initialization to address EEE link drop issues with long cables
From: Raghuram Chary J @ 2018-04-11 15:06 UTC (permalink / raw)
  To: davem; +Cc: netdev, unglinuxdriver, woojung.huh, raghuramchary.jallipalli

The patch is to configure DSP registers of PHY device
to handle Gbe-EEE failures with >40m cable length.

Fixes: 55d7de9de6c3 ("Microchip's LAN7800 family USB 2/3 to 10/100/1000 Ethernet device driver")
Signed-off-by: Raghuram Chary J <raghuramchary.jallipalli@microchip.com>
---
v0->v1:
   * Use phy_save_page to save current page before switching to TR page.
   * Use phy_restore_page to restore saved page.
   * Add read_page and write_page callbacks.
   * __phy_read, __phy_write to read,write phy registers.
   * Handle error conditions.
v1->v2:
   * Sending the patch to net separately from enhancements.
---
 drivers/net/phy/microchip.c  | 178 ++++++++++++++++++++++++++++++++++++++++++-
 include/linux/microchipphy.h |   8 ++
 2 files changed, 185 insertions(+), 1 deletion(-)

diff --git a/drivers/net/phy/microchip.c b/drivers/net/phy/microchip.c
index 0f293ef28935..4a8e91922eaa 100644
--- a/drivers/net/phy/microchip.c
+++ b/drivers/net/phy/microchip.c
@@ -20,6 +20,7 @@
 #include <linux/ethtool.h>
 #include <linux/phy.h>
 #include <linux/microchipphy.h>
+#include <linux/delay.h>
 
 #define DRIVER_AUTHOR	"WOOJUNG HUH <woojung.huh@microchip.com>"
 #define DRIVER_DESC	"Microchip LAN88XX PHY driver"
@@ -30,6 +31,16 @@ struct lan88xx_priv {
 	__u32	wolopts;
 };
 
+static int lan88xx_read_page(struct phy_device *phydev)
+{
+	return __phy_read(phydev, LAN88XX_EXT_PAGE_ACCESS);
+}
+
+static int lan88xx_write_page(struct phy_device *phydev, int page)
+{
+	return __phy_write(phydev, LAN88XX_EXT_PAGE_ACCESS, page);
+}
+
 static int lan88xx_phy_config_intr(struct phy_device *phydev)
 {
 	int rc;
@@ -66,6 +77,150 @@ static int lan88xx_suspend(struct phy_device *phydev)
 	return 0;
 }
 
+static int lan88xx_TR_reg_set(struct phy_device *phydev, u16 regaddr,
+			      u32 data)
+{
+	int val, save_page, ret = 0;
+	u16 buf;
+
+	/* Save current page */
+	save_page = phy_save_page(phydev);
+	if (save_page < 0) {
+		pr_warn("Failed to get current page\n");
+		goto err;
+	}
+
+	/* Switch to TR page */
+	lan88xx_write_page(phydev, LAN88XX_EXT_PAGE_ACCESS_TR);
+
+	ret = __phy_write(phydev, LAN88XX_EXT_PAGE_TR_LOW_DATA,
+			  (data & 0xFFFF));
+	if (ret < 0) {
+		pr_warn("Failed to write TR low data\n");
+		goto err;
+	}
+
+	ret = __phy_write(phydev, LAN88XX_EXT_PAGE_TR_HIGH_DATA,
+			  (data & 0x00FF0000) >> 16);
+	if (ret < 0) {
+		pr_warn("Failed to write TR high data\n");
+		goto err;
+	}
+
+	/* Config control bits [15:13] of register */
+	buf = (regaddr & ~(0x3 << 13));/* Clr [14:13] to write data in reg */
+	buf |= 0x8000; /* Set [15] to Packet transmit */
+
+	ret = __phy_write(phydev, LAN88XX_EXT_PAGE_TR_CR, buf);
+	if (ret < 0) {
+		pr_warn("Failed to write data in reg\n");
+		goto err;
+	}
+
+	usleep_range(1000, 2000);/* Wait for Data to be written */
+	val = __phy_read(phydev, LAN88XX_EXT_PAGE_TR_CR);
+	if (!(val & 0x8000))
+		pr_warn("TR Register[0x%X] configuration failed\n", regaddr);
+err:
+	return phy_restore_page(phydev, save_page, ret);
+}
+
+static void lan88xx_config_TR_regs(struct phy_device *phydev)
+{
+	int err;
+
+	/* Get access to Channel 0x1, Node 0xF , Register 0x01.
+	 * Write 24-bit value 0x12B00A to register. Setting MrvlTrFix1000Kf,
+	 * MrvlTrFix1000Kp, MasterEnableTR bits.
+	 */
+	err = lan88xx_TR_reg_set(phydev, 0x0F82, 0x12B00A);
+	if (err < 0)
+		pr_warn("Failed to Set Register[0x0F82]\n");
+
+	/* Get access to Channel b'10, Node b'1101, Register 0x06.
+	 * Write 24-bit value 0xD2C46F to register. Setting SSTrKf1000Slv,
+	 * SSTrKp1000Mas bits.
+	 */
+	err = lan88xx_TR_reg_set(phydev, 0x168C, 0xD2C46F);
+	if (err < 0)
+		pr_warn("Failed to Set Register[0x168C]\n");
+
+	/* Get access to Channel b'10, Node b'1111, Register 0x11.
+	 * Write 24-bit value 0x620 to register. Setting rem_upd_done_thresh
+	 * bits
+	 */
+	err = lan88xx_TR_reg_set(phydev, 0x17A2, 0x620);
+	if (err < 0)
+		pr_warn("Failed to Set Register[0x17A2]\n");
+
+	/* Get access to Channel b'10, Node b'1101, Register 0x10.
+	 * Write 24-bit value 0xEEFFDD to register. Setting
+	 * eee_TrKp1Long_1000, eee_TrKp2Long_1000, eee_TrKp3Long_1000,
+	 * eee_TrKp1Short_1000,eee_TrKp2Short_1000, eee_TrKp3Short_1000 bits.
+	 */
+	err = lan88xx_TR_reg_set(phydev, 0x16A0, 0xEEFFDD);
+	if (err < 0)
+		pr_warn("Failed to Set Register[0x16A0]\n");
+
+	/* Get access to Channel b'10, Node b'1101, Register 0x13.
+	 * Write 24-bit value 0x071448 to register. Setting
+	 * slv_lpi_tr_tmr_val1, slv_lpi_tr_tmr_val2 bits.
+	 */
+	err = lan88xx_TR_reg_set(phydev, 0x16A6, 0x071448);
+	if (err < 0)
+		pr_warn("Failed to Set Register[0x16A6]\n");
+
+	/* Get access to Channel b'10, Node b'1101, Register 0x12.
+	 * Write 24-bit value 0x13132F to register. Setting
+	 * slv_sigdet_timer_val1, slv_sigdet_timer_val2 bits.
+	 */
+	err = lan88xx_TR_reg_set(phydev, 0x16A4, 0x13132F);
+	if (err < 0)
+		pr_warn("Failed to Set Register[0x16A4]\n");
+
+	/* Get access to Channel b'10, Node b'1101, Register 0x14.
+	 * Write 24-bit value 0x0 to register. Setting eee_3level_delay,
+	 * eee_TrKf_freeze_delay bits.
+	 */
+	err = lan88xx_TR_reg_set(phydev, 0x16A8, 0x0);
+	if (err < 0)
+		pr_warn("Failed to Set Register[0x16A8]\n");
+
+	/* Get access to Channel b'01, Node b'1111, Register 0x34.
+	 * Write 24-bit value 0x91B06C to register. Setting
+	 * FastMseSearchThreshLong1000, FastMseSearchThreshShort1000,
+	 * FastMseSearchUpdGain1000 bits.
+	 */
+	err = lan88xx_TR_reg_set(phydev, 0x0FE8, 0x91B06C);
+	if (err < 0)
+		pr_warn("Failed to Set Register[0x0FE8]\n");
+
+	/* Get access to Channel b'01, Node b'1111, Register 0x3E.
+	 * Write 24-bit value 0xC0A028 to register. Setting
+	 * FastMseKp2ThreshLong1000, FastMseKp2ThreshShort1000,
+	 * FastMseKp2UpdGain1000, FastMseKp2ExitEn1000 bits.
+	 */
+	err = lan88xx_TR_reg_set(phydev, 0x0FFC, 0xC0A028);
+	if (err < 0)
+		pr_warn("Failed to Set Register[0x0FFC]\n");
+
+	/* Get access to Channel b'01, Node b'1111, Register 0x35.
+	 * Write 24-bit value 0x041600 to register. Setting
+	 * FastMseSearchPhShNum1000, FastMseSearchClksPerPh1000,
+	 * FastMsePhChangeDelay1000 bits.
+	 */
+	err = lan88xx_TR_reg_set(phydev, 0x0FEA, 0x041600);
+	if (err < 0)
+		pr_warn("Failed to Set Register[0x0FEA]\n");
+
+	/* Get access to Channel b'10, Node b'1101, Register 0x03.
+	 * Write 24-bit value 0x000004 to register. Setting TrFreeze bits.
+	 */
+	err = lan88xx_TR_reg_set(phydev, 0x1686, 0x000004);
+	if (err < 0)
+		pr_warn("Failed to Set Register[0x1686]\n");
+}
+
 static int lan88xx_probe(struct phy_device *phydev)
 {
 	struct device *dev = &phydev->mdio.dev;
@@ -132,6 +287,25 @@ static void lan88xx_set_mdix(struct phy_device *phydev)
 	phy_write(phydev, LAN88XX_EXT_PAGE_ACCESS, LAN88XX_EXT_PAGE_SPACE_0);
 }
 
+static int lan88xx_config_init(struct phy_device *phydev)
+{
+	int val;
+
+	genphy_config_init(phydev);
+	/*Zerodetect delay enable */
+	val = phy_read_mmd(phydev, MDIO_MMD_PCS,
+			   PHY_ARDENNES_MMD_DEV_3_PHY_CFG);
+	val |= PHY_ARDENNES_MMD_DEV_3_PHY_CFG_ZD_DLY_EN_;
+
+	phy_write_mmd(phydev, MDIO_MMD_PCS, PHY_ARDENNES_MMD_DEV_3_PHY_CFG,
+		      val);
+
+	/* Config DSP registers */
+	lan88xx_config_TR_regs(phydev);
+
+	return 0;
+}
+
 static int lan88xx_config_aneg(struct phy_device *phydev)
 {
 	lan88xx_set_mdix(phydev);
@@ -151,7 +325,7 @@ static struct phy_driver microchip_phy_driver[] = {
 	.probe		= lan88xx_probe,
 	.remove		= lan88xx_remove,
 
-	.config_init	= genphy_config_init,
+	.config_init	= lan88xx_config_init,
 	.config_aneg	= lan88xx_config_aneg,
 
 	.ack_interrupt	= lan88xx_phy_ack_interrupt,
@@ -160,6 +334,8 @@ static struct phy_driver microchip_phy_driver[] = {
 	.suspend	= lan88xx_suspend,
 	.resume		= genphy_resume,
 	.set_wol	= lan88xx_set_wol,
+	.read_page	= lan88xx_read_page,
+	.write_page	= lan88xx_write_page,
 } };
 
 module_phy_driver(microchip_phy_driver);
diff --git a/include/linux/microchipphy.h b/include/linux/microchipphy.h
index eb492d47f717..8f9c90379732 100644
--- a/include/linux/microchipphy.h
+++ b/include/linux/microchipphy.h
@@ -70,4 +70,12 @@
 #define	LAN88XX_MMD3_CHIP_ID			(32877)
 #define	LAN88XX_MMD3_CHIP_REV			(32878)
 
+/* DSP registers */
+#define PHY_ARDENNES_MMD_DEV_3_PHY_CFG		(0x806A)
+#define PHY_ARDENNES_MMD_DEV_3_PHY_CFG_ZD_DLY_EN_	(0x2000)
+#define LAN88XX_EXT_PAGE_ACCESS_TR		(0x52B5)
+#define LAN88XX_EXT_PAGE_TR_CR			16
+#define LAN88XX_EXT_PAGE_TR_LOW_DATA		17
+#define LAN88XX_EXT_PAGE_TR_HIGH_DATA		18
+
 #endif /* _MICROCHIPPHY_H */
-- 
2.16.2

^ permalink raw reply related

* Re: [PATCH net 1/2] ibmvnic: Handle all login error conditions
From: David Miller @ 2018-04-11 15:07 UTC (permalink / raw)
  To: nfont; +Cc: netdev, jallen, tlfalcon
In-Reply-To: <152345744074.41721.14779103174755605325.stgit@ltcalpine2-lp14.aus.stglabs.ibm.com>

From: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Date: Wed, 11 Apr 2018 09:37:21 -0500

> There is a bug in handling the possible return codes from sending the
> login CRQ. The current code treats any non-success return value,
> minus failure to send the crq and a timeout waiting for a login response,
> as a need to re-send the login CRQ. This can put the drive in an
                                                       ^^^^^

"driver"

> inifinite loop of trying to login when getting return values other
  ^^^^^^^^^

"infinite"

> that a partial success such as a return code of aborted. For these
> scenarios the login will not ever succeed at this point and the
> driver would need to be reset again.
> 
> To resolve this loop trying to login is updated to only retry the
> login if the driver gets a return code of a partial success. Other
> return coes are treated as an error and the driver returns an error
         ^^^^

"codes"

> from ibmvnic_login().
> 
> To avoid infinite looping in the partial success return cases, the
> number of retries is capped at the maximum number of supported
> queues. This value was chosen because the driver does a renegatiation
                                                          ^^^^^^^^^^^^^

"renegotiation"

> of capabilities which sets the number of queus possible and allows
                                           ^^^^^

"queues"

^ permalink raw reply

* [PATCH] net: stmmac: fix missing support for 802.1AD tag on reception
From: Elad Nachman @ 2018-04-11 15:07 UTC (permalink / raw)
  To: netdev@vger.kernel.org, David Miller

Stmmac reception handler calls stmmac_rx_vlan() to strip the vlan before calling napi_gro_receive().

The function assumes VLAN tagged frames are always tagged with 802.1Q protocol,
and assigns ETH_P_8021Q to the skb by hard-coding the parameter on call to __vlan_hwaccel_put_tag() .

This causes packets not to be passed to the VLAN slave if it was created with 802.1AD protocol
(ip link add link eth0 eth0.100 type vlan proto 802.1ad id 100).

This fix passes the protocol from the VLAN header into __vlan_hwaccel_put_tag()
instead of using the hard-coded value of ETH_P_8021Q.

Signed-off-by: Elad Nachman <eladn@gilat.com>

---

--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c2018-04-11 17:04:00.586057300 +0300
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c2018-04-11 17:05:33.601992400 +0300
@@ -3293,17 +3293,19 @@ dma_map_err:

 static void stmmac_rx_vlan(struct net_device *dev, struct sk_buff *skb)
 {
-struct ethhdr *ehdr;
+struct vlan_ethhdr *veth;
 u16 vlanid;
+__be16 vlan_proto;

 if ((dev->features & NETIF_F_HW_VLAN_CTAG_RX) ==
     NETIF_F_HW_VLAN_CTAG_RX &&
     !__vlan_get_tag(skb, &vlanid)) {
 /* pop the vlan tag */
-ehdr = (struct ethhdr *)skb->data;
-memmove(skb->data + VLAN_HLEN, ehdr, ETH_ALEN * 2);
+veth = (struct vlan_ethhdr *)skb->data;
+vlan_proto = veth->h_vlan_proto;
+memmove(skb->data + VLAN_HLEN, veth, ETH_ALEN * 2);
 skb_pull(skb, VLAN_HLEN);
-__vlan_hwaccel_put_tag(skb, htons(ETH_P_8021Q), vlanid);
+__vlan_hwaccel_put_tag(skb, vlan_proto, vlanid);
 }
 }
IMPORTANT - This email and any attachments is intended for the above named addressee(s), and may contain information which is confidential or privileged. If you are not the intended recipient, please inform the sender immediately and delete this email: you should not copy or use this e-mail for any purpose nor disclose its contents to any person.

^ permalink raw reply

* WARNING: kobject bug in br_add_if
From: syzbot @ 2018-04-11 15:15 UTC (permalink / raw)
  To: bridge, davem, linux-kernel, netdev, stephen, syzkaller-bugs

Hello,

syzbot hit the following crash on upstream commit
10b84daddbec72c6b440216a69de9a9605127f7a (Sat Mar 31 17:59:00 2018 +0000)
Merge branch 'perf-urgent-for-linus' of  
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
syzbot dashboard link:  
https://syzkaller.appspot.com/bug?extid=de73361ee4971b6e6f75

So far this crash happened 4 times on net-next, upstream.
Unfortunately, I don't have any reproducer for this crash yet.
Raw console output:  
https://syzkaller.appspot.com/x/log.txt?id=5007286875455488
Kernel config:  
https://syzkaller.appspot.com/x/.config?id=-2760467897697295172
compiler: gcc (GCC) 7.1.1 20170620

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+de73361ee4971b6e6f75@syzkaller.appspotmail.com
It will help syzbot understand when the bug is fixed. See footer for  
details.
If you forward the report, please keep this part and the footer.

R13: 0000000000000369 R14: 00000000006f7278 R15: 0000000000000006
------------[ cut here ]------------
binder: 23650:23651 unknown command 1078223622
kobject_add_internal failed for brport (error: -12 parent: bond0)
binder: 23650:23651 ioctl c0306201 2000dfd0 returned -22
WARNING: CPU: 1 PID: 23647 at lib/kobject.c:242  
kobject_add_internal+0x3f6/0xbc0 lib/kobject.c:240
Kernel panic - not syncing: panic_on_warn set ...

CPU: 1 PID: 23647 Comm: syz-executor7 Not tainted 4.16.0-rc7+ #374
binder: BINDER_SET_CONTEXT_MGR already set
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011
Call Trace:
  __dump_stack lib/dump_stack.c:17 [inline]
  dump_stack+0x194/0x24d lib/dump_stack.c:53
  panic+0x1e4/0x41c kernel/panic.c:183
  __warn+0x1dc/0x200 kernel/panic.c:547
  report_bug+0x1f4/0x2b0 lib/bug.c:186
  fixup_bug.part.10+0x37/0x80 arch/x86/kernel/traps.c:178
  fixup_bug arch/x86/kernel/traps.c:247 [inline]
  do_error_trap+0x2d7/0x3e0 arch/x86/kernel/traps.c:296
  do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:315
  invalid_op+0x1b/0x40 arch/x86/entry/entry_64.S:986
RIP: 0010:kobject_add_internal+0x3f6/0xbc0 lib/kobject.c:240
RSP: 0018:ffff8801d089f560 EFLAGS: 00010286
RAX: dffffc0000000008 RBX: ffff8801adbee178 RCX: ffffffff815b193e
RDX: 0000000000040000 RSI: ffffc900022aa000 RDI: 1ffff1003a113e31
RBP: ffff8801d089f658 R08: 1ffff1003a113df3 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 1ffff1003a113eb2
R13: 00000000fffffff4 R14: ffff8801abd88828 R15: ffff8801d75a1e00
  kobject_add_varg lib/kobject.c:364 [inline]
  kobject_init_and_add+0xf9/0x150 lib/kobject.c:436
  br_add_if+0x79a/0x1a70 net/bridge/br_if.c:533
  add_del_if+0xf4/0x140 net/bridge/br_ioctl.c:101
  br_dev_ioctl+0xa2/0xc0 net/bridge/br_ioctl.c:396
  dev_ifsioc+0x333/0x9b0 net/core/dev_ioctl.c:334
  dev_ioctl+0x176/0xbe0 net/core/dev_ioctl.c:500
  sock_do_ioctl+0x1ba/0x390 net/socket.c:981
  sock_ioctl+0x367/0x670 net/socket.c:1081
  vfs_ioctl fs/ioctl.c:46 [inline]
  do_vfs_ioctl+0x1b1/0x1520 fs/ioctl.c:686
  SYSC_ioctl fs/ioctl.c:701 [inline]
  SyS_ioctl+0x8f/0xc0 fs/ioctl.c:692
  do_syscall_64+0x281/0x940 arch/x86/entry/common.c:287
  entry_SYSCALL_64_after_hwframe+0x42/0xb7
RIP: 0033:0x454e79
RSP: 002b:00007eff7dab7c68 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 00007eff7dab86d4 RCX: 0000000000454e79
RDX: 0000000020000000 RSI: 00000000000089a2 RDI: 0000000000000014
RBP: 000000000072bea0 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000015
R13: 0000000000000369 R14: 00000000006f7278 R15: 0000000000000006
Dumping ftrace buffer:
    (ftrace buffer empty)
Kernel Offset: disabled
Rebooting in 86400 seconds..


---
This bug is generated by a dumb bot. It may contain errors.
See https://goo.gl/tpsmEJ for details.
Direct all questions to syzkaller@googlegroups.com.

syzbot will keep track of this bug report.
If you forgot to add the Reported-by tag, once the fix for this bug is  
merged
into any tree, please reply to this email with:
#syz fix: exact-commit-title
To mark this as a duplicate of another syzbot report, please reply with:
#syz dup: exact-subject-of-another-report
If it's a one-off invalid bug report, please reply with:
#syz invalid
Note: if the crash happens again, it will cause creation of a new bug  
report.
Note: all commands must start from beginning of the line in the email body.

^ permalink raw reply

* [PATCH] selftests: net: add in_netns.sh to TEST_PROGS
From: Anders Roxell @ 2018-04-11 15:17 UTC (permalink / raw)
  To: davem, shuah; +Cc: netdev, linux-kselftest, linux-kernel, Anders Roxell

Script in_netns.sh isn't installed.
--------------------
running psock_fanout test
--------------------
./run_afpackettests: line 12: ./in_netns.sh: No such file or directory
[FAIL]
--------------------
running psock_tpacket test
--------------------
./run_afpackettests: line 22: ./in_netns.sh: No such file or directory
[FAIL]

In current code added in_netns.sh to be installed.

Fixes: cc30c93fa020 ("selftests/net: ignore background traffic in psock_fanout")
Signed-off-by: Anders Roxell <anders.roxell@linaro.org>
---
 tools/testing/selftests/net/Makefile | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/testing/selftests/net/Makefile b/tools/testing/selftests/net/Makefile
index 785fc18a16b4..8f1e13d2e547 100644
--- a/tools/testing/selftests/net/Makefile
+++ b/tools/testing/selftests/net/Makefile
@@ -5,7 +5,7 @@ CFLAGS =  -Wall -Wl,--no-as-needed -O2 -g
 CFLAGS += -I../../../../usr/include/
 
 TEST_PROGS := run_netsocktests run_afpackettests test_bpf.sh netdevice.sh rtnetlink.sh
-TEST_PROGS += fib_tests.sh fib-onlink-tests.sh pmtu.sh
+TEST_PROGS += fib_tests.sh fib-onlink-tests.sh in_netns.sh pmtu.sh
 TEST_GEN_FILES =  socket
 TEST_GEN_FILES += psock_fanout psock_tpacket msg_zerocopy
 TEST_GEN_PROGS = reuseport_bpf reuseport_bpf_cpu reuseport_bpf_numa
-- 
2.16.3

^ permalink raw reply related

* Re: WARNING: kobject bug in br_add_if
From: Dmitry Vyukov @ 2018-04-11 15:18 UTC (permalink / raw)
  To: syzbot
  Cc: bridge, David Miller, LKML, netdev, stephen hemminger,
	syzkaller-bugs, Greg Kroah-Hartman
In-Reply-To: <001a113de2d878ade40569941a21@google.com>

On Wed, Apr 11, 2018 at 5:15 PM, syzbot
<syzbot+de73361ee4971b6e6f75@syzkaller.appspotmail.com> wrote:
> Hello,
>
> syzbot hit the following crash on upstream commit
> 10b84daddbec72c6b440216a69de9a9605127f7a (Sat Mar 31 17:59:00 2018 +0000)
> Merge branch 'perf-urgent-for-linus' of
> git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
> syzbot dashboard link:
> https://syzkaller.appspot.com/bug?extid=de73361ee4971b6e6f75
>
> So far this crash happened 4 times on net-next, upstream.
> Unfortunately, I don't have any reproducer for this crash yet.
> Raw console output:
> https://syzkaller.appspot.com/x/log.txt?id=5007286875455488
> Kernel config:
> https://syzkaller.appspot.com/x/.config?id=-2760467897697295172
> compiler: gcc (GCC) 7.1.1 20170620
>
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: syzbot+de73361ee4971b6e6f75@syzkaller.appspotmail.com
> It will help syzbot understand when the bug is fixed. See footer for
> details.
> If you forward the report, please keep this part and the footer.

+Greg

The plan is to remove this WARNING from kobject_add, if there are no objections.

> ------------[ cut here ]------------
> binder: 23650:23651 unknown command 1078223622
> kobject_add_internal failed for brport (error: -12 parent: bond0)
> binder: 23650:23651 ioctl c0306201 2000dfd0 returned -22
> WARNING: CPU: 1 PID: 23647 at lib/kobject.c:242
> kobject_add_internal+0x3f6/0xbc0 lib/kobject.c:240
> Kernel panic - not syncing: panic_on_warn set ...
>
> CPU: 1 PID: 23647 Comm: syz-executor7 Not tainted 4.16.0-rc7+ #374
> binder: BINDER_SET_CONTEXT_MGR already set
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> Google 01/01/2011
> Call Trace:
>  __dump_stack lib/dump_stack.c:17 [inline]
>  dump_stack+0x194/0x24d lib/dump_stack.c:53
>  panic+0x1e4/0x41c kernel/panic.c:183
>  __warn+0x1dc/0x200 kernel/panic.c:547
>  report_bug+0x1f4/0x2b0 lib/bug.c:186
>  fixup_bug.part.10+0x37/0x80 arch/x86/kernel/traps.c:178
>  fixup_bug arch/x86/kernel/traps.c:247 [inline]
>  do_error_trap+0x2d7/0x3e0 arch/x86/kernel/traps.c:296
>  do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:315
>  invalid_op+0x1b/0x40 arch/x86/entry/entry_64.S:986
> RIP: 0010:kobject_add_internal+0x3f6/0xbc0 lib/kobject.c:240
> RSP: 0018:ffff8801d089f560 EFLAGS: 00010286
> RAX: dffffc0000000008 RBX: ffff8801adbee178 RCX: ffffffff815b193e
> RDX: 0000000000040000 RSI: ffffc900022aa000 RDI: 1ffff1003a113e31
> RBP: ffff8801d089f658 R08: 1ffff1003a113df3 R09: 0000000000000000
> R10: 0000000000000000 R11: 0000000000000000 R12: 1ffff1003a113eb2
> R13: 00000000fffffff4 R14: ffff8801abd88828 R15: ffff8801d75a1e00
>  kobject_add_varg lib/kobject.c:364 [inline]
>  kobject_init_and_add+0xf9/0x150 lib/kobject.c:436
>  br_add_if+0x79a/0x1a70 net/bridge/br_if.c:533
>  add_del_if+0xf4/0x140 net/bridge/br_ioctl.c:101
>  br_dev_ioctl+0xa2/0xc0 net/bridge/br_ioctl.c:396
>  dev_ifsioc+0x333/0x9b0 net/core/dev_ioctl.c:334
>  dev_ioctl+0x176/0xbe0 net/core/dev_ioctl.c:500
>  sock_do_ioctl+0x1ba/0x390 net/socket.c:981
>  sock_ioctl+0x367/0x670 net/socket.c:1081
>  vfs_ioctl fs/ioctl.c:46 [inline]
>  do_vfs_ioctl+0x1b1/0x1520 fs/ioctl.c:686
>  SYSC_ioctl fs/ioctl.c:701 [inline]
>  SyS_ioctl+0x8f/0xc0 fs/ioctl.c:692
>  do_syscall_64+0x281/0x940 arch/x86/entry/common.c:287
>  entry_SYSCALL_64_after_hwframe+0x42/0xb7
> RIP: 0033:0x454e79
> RSP: 002b:00007eff7dab7c68 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
> RAX: ffffffffffffffda RBX: 00007eff7dab86d4 RCX: 0000000000454e79
> RDX: 0000000020000000 RSI: 00000000000089a2 RDI: 0000000000000014
> RBP: 000000000072bea0 R08: 0000000000000000 R09: 0000000000000000
> R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000015
> R13: 0000000000000369 R14: 00000000006f7278 R15: 0000000000000006
> Dumping ftrace buffer:
>    (ftrace buffer empty)
> Kernel Offset: disabled
> Rebooting in 86400 seconds..
>
>
> ---
> This bug is generated by a dumb bot. It may contain errors.
> See https://goo.gl/tpsmEJ for details.
> Direct all questions to syzkaller@googlegroups.com.
>
> syzbot will keep track of this bug report.
> If you forgot to add the Reported-by tag, once the fix for this bug is
> merged
> into any tree, please reply to this email with:
> #syz fix: exact-commit-title
> To mark this as a duplicate of another syzbot report, please reply with:
> #syz dup: exact-subject-of-another-report
> If it's a one-off invalid bug report, please reply with:
> #syz invalid
> Note: if the crash happens again, it will cause creation of a new bug
> report.
> Note: all commands must start from beginning of the line in the email body.
>
> --
> You received this message because you are subscribed to the Google Groups
> "syzkaller-bugs" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to syzkaller-bugs+unsubscribe@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/syzkaller-bugs/001a113de2d878ade40569941a21%40google.com.
> For more options, visit https://groups.google.com/d/optout.

^ permalink raw reply

* Re: [PATCH net 1/2] ibmvnic: Handle all login error conditions
From: Nathan Fontenot @ 2018-04-11 15:24 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, jallen, tlfalcon
In-Reply-To: <20180411.110731.507616117263900808.davem@davemloft.net>

On 04/11/2018 10:07 AM, David Miller wrote:
> From: Nathan Fontenot <nfont@linux.vnet.ibm.com>
> Date: Wed, 11 Apr 2018 09:37:21 -0500
> 
>> There is a bug in handling the possible return codes from sending the
>> login CRQ. The current code treats any non-success return value,
>> minus failure to send the crq and a timeout waiting for a login response,
>> as a need to re-send the login CRQ. This can put the drive in an
>                                                        ^^^^^
> 
> "driver"
> 
>> inifinite loop of trying to login when getting return values other
>   ^^^^^^^^^
> 
> "infinite"
> 
>> that a partial success such as a return code of aborted. For these
>> scenarios the login will not ever succeed at this point and the
>> driver would need to be reset again.
>>
>> To resolve this loop trying to login is updated to only retry the
>> login if the driver gets a return code of a partial success. Other
>> return coes are treated as an error and the driver returns an error
>          ^^^^
> 
> "codes"
> 
>> from ibmvnic_login().
>>
>> To avoid infinite looping in the partial success return cases, the
>> number of retries is capped at the maximum number of supported
>> queues. This value was chosen because the driver does a renegatiation
>                                                           ^^^^^^^^^^^^^
> 
> "renegotiation"
> 
>> of capabilities which sets the number of queus possible and allows
>                                            ^^^^^
> 
> "queues"
> 

Oh man, complete spelling failure. resending.

-Nathan

^ permalink raw reply

* [PATCH v2 net 0/2] ibmvnic: Fix parameter change request handling
From: Nathan Fontenot @ 2018-04-11 15:09 UTC (permalink / raw)
  To: netdev; +Cc: jallen, tlfalcon

When updating parameters for the ibmvnic driver there is a possibility
of entering an infinite loop if a return value other that a partial
success is received from sending the login CRQ.

Also, a deadlock can occur on the rtnl lock if netdev_notify_peers()
is called during driver reset for a parameter change reset.

This patch set corrects both of these issues by updating the return
code handling in ibmvnic_login() nand gaurding against calling
netdev_notify_peers() for parameter change requests.

-Nathan

Updates for V2: Correct spelling mistakes in commit messages.
---

Nathan Fontenot (2):
      ibmvnic: Handle all login error conditions
      ibmvnic: Do not notify peers on parameter change resets

 drivers/net/ethernet/ibm/ibmvnic.c |   58 +++++++++++++++++++++++-------------
 drivers/net/ethernet/ibm/ibmvnic.h |    1 -
 2 files changed, 37 insertions(+), 22 deletions(-)

^ permalink raw reply

* [PATCH v2 net 1/2] ibmvnic: Handle all login error conditions
From: Nathan Fontenot @ 2018-04-11 15:09 UTC (permalink / raw)
  To: netdev; +Cc: jallen, tlfalcon
In-Reply-To: <152345930946.41899.8546547036273488202.stgit@ltcalpine2-lp14.aus.stglabs.ibm.com>

There is a bug in handling the possible return codes from sending the
login CRQ. The current code treats any non-success return value,
minus failure to send the crq and a timeout waiting for a login response,
as a need to re-send the login CRQ. This can put the drive in an
infinite loop of trying to login when getting return values other
that a partial success such as a return code of aborted. For these
scenarios the login will not ever succeed at this point and the
driver would need to be reset again.

To resolve this loop trying to login is updated to only retry the
login if the driver gets a return code of a partial success. Other
return codes are treated as an error and the driver returns an error
from ibmvnic_login().

To avoid infinite looping in the partial success return cases, the
number of retries is capped at the maximum number of supported
queues. This value was chosen because the driver does a renegotiation
of capabilities which sets the number of queues possible and allows
the driver to attempt a login for possible value for the number
of queues supported.

Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
---
 drivers/net/ethernet/ibm/ibmvnic.c |   55 +++++++++++++++++++++++-------------
 drivers/net/ethernet/ibm/ibmvnic.h |    1 -
 2 files changed, 35 insertions(+), 21 deletions(-)

diff --git a/drivers/net/ethernet/ibm/ibmvnic.c b/drivers/net/ethernet/ibm/ibmvnic.c
index aad5658d79d5..c6d5e9448eef 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@ -794,46 +794,61 @@ static int ibmvnic_login(struct net_device *netdev)
 {
 	struct ibmvnic_adapter *adapter = netdev_priv(netdev);
 	unsigned long timeout = msecs_to_jiffies(30000);
-	struct device *dev = &adapter->vdev->dev;
+	int retry_count = 0;
 	int rc;
 
 	do {
-		if (adapter->renegotiate) {
-			adapter->renegotiate = false;
+		if (retry_count > IBMVNIC_MAX_QUEUES) {
+			netdev_warn(netdev, "Login attempts exceeded\n");
+			return -1;
+		}
+
+		adapter->init_done_rc = 0;
+		reinit_completion(&adapter->init_done);
+		rc = send_login(adapter);
+		if (rc) {
+			netdev_warn(netdev, "Unable to login\n");
+			return rc;
+		}
+
+		if (!wait_for_completion_timeout(&adapter->init_done,
+						 timeout)) {
+			netdev_warn(netdev, "Login timed out\n");
+			return -1;
+		}
+
+		if (adapter->init_done_rc == PARTIALSUCCESS) {
+			retry_count++;
 			release_sub_crqs(adapter, 1);
 
+			adapter->init_done_rc = 0;
 			reinit_completion(&adapter->init_done);
 			send_cap_queries(adapter);
 			if (!wait_for_completion_timeout(&adapter->init_done,
 							 timeout)) {
-				dev_err(dev, "Capabilities query timeout\n");
+				netdev_warn(netdev,
+					    "Capabilities query timed out\n");
 				return -1;
 			}
+
 			rc = init_sub_crqs(adapter);
 			if (rc) {
-				dev_err(dev,
-					"Initialization of SCRQ's failed\n");
+				netdev_warn(netdev,
+					    "SCRQ initialization failed\n");
 				return -1;
 			}
+
 			rc = init_sub_crq_irqs(adapter);
 			if (rc) {
-				dev_err(dev,
-					"Initialization of SCRQ's irqs failed\n");
+				netdev_warn(netdev,
+					    "SCRQ irq initialization failed\n");
 				return -1;
 			}
-		}
-
-		reinit_completion(&adapter->init_done);
-		rc = send_login(adapter);
-		if (rc) {
-			dev_err(dev, "Unable to attempt device login\n");
-			return rc;
-		} else if (!wait_for_completion_timeout(&adapter->init_done,
-						 timeout)) {
-			dev_err(dev, "Login timeout\n");
+		} else if (adapter->init_done_rc) {
+			netdev_warn(netdev, "Adapter login failed\n");
 			return -1;
 		}
-	} while (adapter->renegotiate);
+	} while (adapter->init_done_rc == PARTIALSUCCESS);
 
 	/* handle pending MAC address changes after successful login */
 	if (adapter->mac_change_pending) {
@@ -3942,7 +3957,7 @@ static int handle_login_rsp(union ibmvnic_crq *login_rsp_crq,
 	 * to resend the login buffer with fewer queues requested.
 	 */
 	if (login_rsp_crq->generic.rc.code) {
-		adapter->renegotiate = true;
+		adapter->init_done_rc = login_rsp_crq->generic.rc.code;
 		complete(&adapter->init_done);
 		return 0;
 	}
diff --git a/drivers/net/ethernet/ibm/ibmvnic.h b/drivers/net/ethernet/ibm/ibmvnic.h
index 99c0b58c2c39..22391e8805f6 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.h
+++ b/drivers/net/ethernet/ibm/ibmvnic.h
@@ -1035,7 +1035,6 @@ struct ibmvnic_adapter {
 
 	struct ibmvnic_sub_crq_queue **tx_scrq;
 	struct ibmvnic_sub_crq_queue **rx_scrq;
-	bool renegotiate;
 
 	/* rx structs */
 	struct napi_struct *napi;

^ permalink raw reply related

* [PATCH v2 net 2/2] ibmvnic: Do not notify peers on parameter change resets
From: Nathan Fontenot @ 2018-04-11 15:09 UTC (permalink / raw)
  To: netdev; +Cc: jallen, tlfalcon
In-Reply-To: <152345930946.41899.8546547036273488202.stgit@ltcalpine2-lp14.aus.stglabs.ibm.com>

When attempting to change the driver parameters, such as the MTU
value or number of queues, do not call netdev_notify_peers().
Doing so will deadlock on the rtnl_lock.

Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
---
 drivers/net/ethernet/ibm/ibmvnic.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/ibm/ibmvnic.c b/drivers/net/ethernet/ibm/ibmvnic.c
index c6d5e9448eef..fdca1ea5bbf9 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@ -1843,7 +1843,8 @@ static int do_reset(struct ibmvnic_adapter *adapter,
 	for (i = 0; i < adapter->req_rx_queues; i++)
 		napi_schedule(&adapter->napi[i]);
 
-	if (adapter->reset_reason != VNIC_RESET_FAILOVER)
+	if (adapter->reset_reason != VNIC_RESET_FAILOVER &&
+	    adapter->reset_reason != VNIC_RESET_CHANGE_PARAM)
 		netdev_notify_peers(netdev);
 
 	netif_carrier_on(netdev);

^ permalink raw reply related

* [PATCH v2] dec: tulip: de4x5: Replace mdelay with usleep_range in de4x5_hw_init
From: Jia-Ju Bai @ 2018-04-11 15:39 UTC (permalink / raw)
  To: davem, stephen, johannes.berg, arvind.yadav.cs, dhowells
  Cc: netdev, linux-parisc, linux-kernel, Jia-Ju Bai

de4x5_hw_init() is never called in atomic context.

de4x5_hw_init() is only called by de4x5_pci_probe(), which is only 
set as ".probe" in struct pci_driver.

Despite never getting called from atomic context, de4x5_hw_init() 
calls mdelay() to busily wait.
This is not necessary and can be replaced with usleep_range() to 
avoid busy waiting.

This is found by a static analysis tool named DCNS written by myself.
And I also manually check it.

Signed-off-by: Jia-Ju Bai <baijiaju1990@gmail.com>
---
v2:
* Use usleep_range() to correct usleep() in v1.

---
 drivers/net/ethernet/dec/tulip/de4x5.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/dec/tulip/de4x5.c b/drivers/net/ethernet/dec/tulip/de4x5.c
index 0affee9..3fb0119 100644
--- a/drivers/net/ethernet/dec/tulip/de4x5.c
+++ b/drivers/net/ethernet/dec/tulip/de4x5.c
@@ -1107,7 +1107,7 @@ static int (*dc_infoblock[])(struct net_device *dev, u_char, u_char *) = {
 	pdev = to_pci_dev (gendev);
 	pci_write_config_byte(pdev, PCI_CFDA_PSM, WAKEUP);
     }
-    mdelay(10);
+    usleep_range(10000, 11000);

     RESET_DE4X5;

-- 
1.9.1

^ permalink raw reply related

* [RFC bpf-next v2 1/8] bpf: add script and prepare bpf.h for new helpers documentation
From: Quentin Monnet @ 2018-04-11 15:41 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: daniel, ast, netdev, oss-drivers, linux-doc, linux-man
In-Reply-To: <20180410181620.axzpwh2iawg2kgh3@ast-mbp.dhcp.thefacebook.com>

2018-04-10 11:16 UTC-0700 ~ Alexei Starovoitov
<alexei.starovoitov@gmail.com>
> On Tue, Apr 10, 2018 at 03:41:50PM +0100, Quentin Monnet wrote:
>> Remove previous "overview" of eBPF helpers from user bpf.h header.
>> Replace it by a comment explaining how to process the new documentation
>> (to come in following patches) with a Python script to produce RST, then
>> man page documentation.
>>
>> Also add the aforementioned Python script under scripts/. It is used to
>> process include/uapi/linux/bpf.h and to extract helper descriptions, to
>> turn it into a RST document that can further be processed with rst2man
>> to produce a man page. The script takes one "--filename <path/to/file>"
>> option. If the script is launched from scripts/ in the kernel root
>> directory, it should be able to find the location of the header to
>> parse, and "--filename <path/to/file>" is then optional. If it cannot
>> find the file, then the option becomes mandatory. RST-formatted
>> documentation is printed to standard output.
>>
>> Typical workflow for producing the final man page would be:
>>
>>     $ ./scripts/bpf_helpers_doc.py \
>>             --filename include/uapi/linux/bpf.h > /tmp/bpf-helpers.rst
>>     $ rst2man /tmp/bpf-helpers.rst > /tmp/bpf-helpers.7
>>     $ man /tmp/bpf-helpers.7
>>
>> Note that the tool kernel-doc cannot be used to document eBPF helpers,
>> whose signatures are not available directly in the header files
>> (pre-processor directives are used to produce them at the beginning of
>> the compilation process).
>>
>> Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
>> ---
>>  include/uapi/linux/bpf.h   | 406 ++------------------------------------------
>>  scripts/bpf_helpers_doc.py | 414 +++++++++++++++++++++++++++++++++++++++++++++
>>  2 files changed, 430 insertions(+), 390 deletions(-)
>>  create mode 100755 scripts/bpf_helpers_doc.py
>>

[...]

>> diff --git a/scripts/bpf_helpers_doc.py b/scripts/bpf_helpers_doc.py
>> new file mode 100755
>> index 000000000000..3a15ba3f0a83
>> --- /dev/null
>> +++ b/scripts/bpf_helpers_doc.py
>> @@ -0,0 +1,414 @@
>> +#!/usr/bin/python3
>> +#
>> +# Copyright (C) 2018 Netronome Systems, Inc.
>> +#
>> +# This software is licensed under the GNU General License Version 2,
>> +# June 1991 as shown in the file COPYING in the top-level directory of this
>> +# source tree.
> 
> please use SPDX instead.
> 

Same as for bpftool, our layers remain a bit cautious about it. I'd be
happy to change it for SPDX as a follow-up when I get the green light.

>> +#
>> +# THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS"
>> +# WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING,
>> +# BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
>> +# FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE
>> +# OF THE PROGRAM IS WITH YOU. SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME
>> +# THE COST OF ALL NECESSARY SERVICING, REPAIR OR CORRECTION.
>> +

[...]

>> +class PrinterRST(Printer):
>> +    """
>> +    A printer for dumping collected information about helpers as a ReStructured
>> +    Text page compatible with the rst2man program, which can be used to
>> +    generate a manual page for the helpers.
>> +    @helpers: array of Helper objects to print to standard output
>> +    """
>> +    def print_header(self):
>> +        header = '''\
>> +.. Copyright (C) 2018 Netronome Systems, Inc.
> 
> I think would be good to capture copyrights of all authors that added
> the helpers being documented. Since a lot of text was copied from commit
> logs it's only fair to preserve the copyrights.
> Such man page file is automatically generated by the python script
> and script itself is copyrighted by Netronome. That's fine, but the text
> of man page is not netronome only.
> I'm not sure what would be the solution. May be something like:
> "
> Copyright (C) All BPF authors and contributors from 2011 to present
> See git log include/uapi/linux/bpf.h for details
> "
> ?

Seems fair indeed. I do not have a better suggestion myself, so I will
stick to yours.

Out of curiosity, why 2011 for the year? I thought you introduced eBPF
in the kernel in 2014 (bd4cf0ed331a), and I do not believe helpers have
any link with cBPF?

>> +.. 
>> +.. %%%LICENSE_START(VERBATIM)
>> +.. Permission is granted to make and distribute verbatim copies of this
>> +.. manual provided the copyright notice and this permission notice are
>> +.. preserved on all copies.
>> +.. 
>> +.. Permission is granted to copy and distribute modified versions of this
>> +.. manual under the conditions for verbatim copying, provided that the
>> +.. entire resulting derived work is distributed under the terms of a
>> +.. permission notice identical to this one.
>> +.. 
>> +.. Since the Linux kernel and libraries are constantly changing, this
>> +.. manual page may be incorrect or out-of-date.  The author(s) assume no
>> +.. responsibility for errors or omissions, or for damages resulting from
>> +.. the use of the information contained herein.  The author(s) may not
>> +.. have taken the same level of care in the production of this manual,
>> +.. which is licensed free of charge, as they might when working
>> +.. professionally.
>> +.. 
>> +.. Formatted or processed versions of this manual, if unaccompanied by
>> +.. the source, must acknowledge the copyright and authors of this work.
>> +.. %%%LICENSE_END
>> +.. 
>> +.. Please do not edit this file. It was generated from the documentation
>> +.. located in file include/uapi/linux/bpf.h of the Linux kernel sources
>> +.. (helpers description), and from scripts/bpf_helpers_doc.py in the same
>> +.. repository (header and footer).
>> +
>> +===========
>> +BPF-HELPERS
>> +===========
>> +-------------------------------------------------------------------------------
>> +list of eBPF helper functions
>> +-------------------------------------------------------------------------------
>> +
>> +:Manual section: 7
>> +
>> +DESCRIPTION
>> +===========
>> +
>> +The extended Berkeley Packet Filter (eBPF) subsystem consists in programs
>> +written in a pseudo-assembly language, then attached to one of the several
>> +kernel hooks and run in reaction of specific events. This framework differs
>> +from the older, "classic" BPF (or "cBPF") in several aspects, one of them being
>> +the ability to call special functions (or "helpers") from within a program. For
>> +security reasons, these functions are restricted to a white-list of helpers
>> +defined in the kernel.
> 
> 'for security reasons' sounds a bit odd. May be 'for safety reasons' ?
> Or drop that part.

I'll drop it and keep "These functions are restricted to a white-list of
helpers defined in the kernel."

>> +
>> +These helpers are used by eBPF programs to interact with the system, or with
>> +the context in which they work. For instance, they can be used to print
>> +debugging messages, to get the time since the system was booted, to interact
>> +with eBPF maps, or to manipulate network packets metadata. Since there are
> 
> s/packets metadata/packets/

Ok.

>> +several eBPF program types, and that they do not run in the same context, each
>> +program type can only call a subset of those helpers.
>> +
>> +Due to eBPF conventions, a helper can not have more than five arguments.
>> +
>> +This document is an attempt to list and document the helpers available to eBPF
>> +developers. They are sorted by chronological order (the oldest helpers in the
>> +kernel at the top).
>> +
>> +HELPERS
>> +=======
>> +'''
>> +        print(header)
>> +
>> +    def print_footer(self):
>> +        footer = '''
>> +NOTES
>> +=====
>> +
>> +On the performance side, eBPF programs move to the stack all arguments to pass
>> +to the helpers, and call directly into the compiled helper functions without
> 
> "move to the stack all arguments" ?! I'm not sure what you're trying to say.
> The arguments stay in registers for the call.
> 
>> +requiring any foreign-function interface. As a result, calling helpers
>> +introduce very little overhead.
> 
> not true. it's zero overhead. Literally. Very little is not the same as zero.

Not the same indeed :). I will fix with "no overhead".

I'm not too sure either what I meant when I wrote the thing about moving
arguments to the stack... In fact this "NOTES" section is short and not
really relevant. I will probably delete it and add a line in page header
about helpers being called with no overhead.

>> +
>> +EXAMPLES
>> +========
>> +
>> +Example usage for most of the eBPF helpers listed in this manual page are
>> +available within the Linux kernel sources, at the following locations:
>> +
>> +* *samples/bpf/*
>> +* *tools/testing/selftests/bpf/*
>> +
>> +IMPLEMENTATION
>> +==============
>> +
>> +This manual page is an effort to document the existing eBPF helper functions.
>> +But as of this writing, the BPF sub-system is under heavy development. New eBPF
>> +program or map types are added, along with new helper functions. Some helpers
>> +are occasionally made available for additional program types. So in spite of
>> +the efforts of the community, this page might not be up-to-date. If you want to
>> +check by yourself what helper functions exist in your kernel, or what types of
>> +programs they can support, here are some files among the kernel tree that you
>> +may be interested in:
>> +
>> +* *include/uapi/linux/bpf.h* contains the full list of all helper functions.
>> +* *net/core/filter.c* contains the definition of most network-related helper
>> +  functions, and the list of program types from which they can be used.
>> +* *kernel/trace/bpf_trace.c* is the equivalent for most tracing program-related
>> +  helpers.
>> +* *kernel/bpf/verifier.c* contains the functions used to check that valid types
>> +  of eBPF maps are used with a given helper function.
>> +* *kernel/bpf/* directory contains other files in which additional helpers are
>> +  defined (for cgroups, sockmaps, etc.).
>> +
>> +Compatibility between helper functions and program types can generally be found
>> +in the files where helper functions are defined. Look for the **struct
>> +bpf_func_proto** objects and for functions returning them: these functions
>> +contain a list of helpers that a given program type can call. Note that the
>> +**default:** label of the **switch ... case** used to filter helpers can call
>> +other functions, themselves allowing access to additional helpers. The
>> +requirement for GPL license is also in those **struct bpf_func_proto**.
> 
> I think here would be good to add that most networking helpers are non-GPL
> because they operate on packets which are abstract bytes on the wire,
> whereas most tracing helpers are GPL, since they inspect the guts of
> the linux kernel which is GPL itself.
> That's the main reason why adding extra 'gpl=yes/no' for each helper
> description is redundant.
> 

I removed information from the page header about licensing since v1, I
may reintroduce some of it and tell about the difference between
networking and tracing programs, as you suggest.

I understand this difference and I see that specifying GPL requirement
for each individual helper is redundant. And yet I still believe that
for newcomers, it remains easier to have the indication for the specific
helper they are reading about in the man page rather than to take a
guess ("Is this helper for networking only?"). But I do not intend to
add it back to this set anyway, so let's keep this for future
discussions :).

Thanks for the review!
Quentin

^ permalink raw reply

* [RFC bpf-next v2 2/8] bpf: add documentation for eBPF helpers (01-11)
From: Quentin Monnet @ 2018-04-11 15:42 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: daniel, ast, netdev, oss-drivers, linux-doc, linux-man
In-Reply-To: <20180410175605.2wqhaqx34a4o3gdi@ast-mbp.dhcp.thefacebook.com>

2018-04-10 10:56 UTC-0700 ~ Alexei Starovoitov
<alexei.starovoitov@gmail.com>
> On Tue, Apr 10, 2018 at 03:41:51PM +0100, Quentin Monnet wrote:
>> Add documentation for eBPF helper functions to bpf.h user header file.
>> This documentation can be parsed with the Python script provided in
>> another commit of the patch series, in order to provide a RST document
>> that can later be converted into a man page.
>>
>> The objective is to make the documentation easily understandable and
>> accessible to all eBPF developers, including beginners.
>>
>> This patch contains descriptions for the following helper functions, all
>> written by Alexei:
>>
>> - bpf_map_lookup_elem()
>> - bpf_map_update_elem()
>> - bpf_map_delete_elem()
>> - bpf_probe_read()
>> - bpf_ktime_get_ns()
>> - bpf_trace_printk()
>> - bpf_skb_store_bytes()
>> - bpf_l3_csum_replace()
>> - bpf_l4_csum_replace()
>> - bpf_tail_call()
>> - bpf_clone_redirect()
>>
>> Cc: Alexei Starovoitov <ast@kernel.org>
>> Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
>> ---
>>  include/uapi/linux/bpf.h | 199 +++++++++++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 199 insertions(+)
>>
>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>> index 45f77f01e672..2bc653a3a20f 100644
>> --- a/include/uapi/linux/bpf.h
>> +++ b/include/uapi/linux/bpf.h
>> @@ -381,6 +381,205 @@ union bpf_attr {
>>   * intentional, removing them would break paragraphs for rst2man.
>>   *
>>   * Start of BPF helper function descriptions:
>> + *
>> + * void *bpf_map_lookup_elem(struct bpf_map *map, void *key)
>> + * 	Description
>> + * 		Perform a lookup in *map* for an entry associated to *key*.
>> + * 	Return
>> + * 		Map value associated to *key*, or **NULL** if no entry was
>> + * 		found.
>> + *
>> + * int bpf_map_update_elem(struct bpf_map *map, void *key, void *value, u64 flags)
>> + * 	Description
>> + * 		Add or update the value of the entry associated to *key* in
>> + * 		*map* with *value*. *flags* is one of:
>> + *
>> + * 		**BPF_NOEXIST**
>> + * 			The entry for *key* must not exist in the map.
>> + * 		**BPF_EXIST**
>> + * 			The entry for *key* must already exist in the map.
>> + * 		**BPF_ANY**
>> + * 			No condition on the existence of the entry for *key*.
>> + *
>> + * 		These flags are only useful for maps of type
>> + * 		**BPF_MAP_TYPE_HASH**. For all other map types, **BPF_ANY**
>> + * 		should be used.
> 
> I think that's not entirely accurate.
> The flags work as expected for all other map types as well
> and for lru map, sockmap, map in map the flags have practical use cases.
> 

Ok, I missed that. I have to go back and check how the flags are used
for those maps. I will cook up something cleaner for the next version of
the set.

>> + * 	Return
>> + * 		0 on success, or a negative error in case of failure.
>> + *

[...]

>> + *
>> + * int bpf_trace_printk(const char *fmt, u32 fmt_size, ...)
>> + * 	Description
>> + * 		This helper is a "printk()-like" facility for debugging. It
>> + * 		prints a message defined by format *fmt* (of size *fmt_size*)
>> + * 		to file *\/sys/kernel/debug/tracing/trace* from DebugFS, if
>> + * 		available. It can take up to three additional **u64**
>> + * 		arguments (as an eBPF helpers, the total number of arguments is
>> + * 		limited to five). Each time the helper is called, it appends a
>> + * 		line that looks like the following:
>> + *
>> + * 		::
>> + *
>> + * 			telnet-470   [001] .N.. 419421.045894: 0x00000001: BPF command: 2
>> + *
>> + * 		In the above:
>> + *
>> + * 			* ``telnet`` is the name of the current task.
>> + * 			* ``470`` is the PID of the current task.
>> + * 			* ``001`` is the CPU number on which the task is
>> + * 			  running.
>> + * 			* In ``.N..``, each character refers to a set of
>> + * 			  options (whether irqs are enabled, scheduling
>> + * 			  options, whether hard/softirqs are running, level of
>> + * 			  preempt_disabled respectively). **N** means that
>> + * 			  **TIF_NEED_RESCHED** and **PREEMPT_NEED_RESCHED**
>> + * 			  are set.
>> + * 			* ``419421.045894`` is a timestamp.
>> + * 			* ``0x00000001`` is a fake value used by BPF for the
>> + * 			  instruction pointer register.
>> + * 			* ``BPF command: 2`` is the message formatted with
>> + * 			  *fmt*.
> 
> the above depends on how trace_pipe was configured. It's a default
> configuration for many, but would be good to explain this a bit better.
> 

I did not know about that. Would you have a pointer about how to
configure trace_pipe, please?

>> + *
>> + * 		The conversion specifiers supported by *fmt* are similar, but
>> + * 		more limited than for printk(). They are **%d**, **%i**,
>> + * 		**%u**, **%x**, **%ld**, **%li**, **%lu**, **%lx**, **%lld**,
>> + * 		**%lli**, **%llu**, **%llx**, **%p**, **%s**. No modifier (size
>> + * 		of field, padding with zeroes, etc.) is available, and the
>> + * 		helper will silently fail if it encounters an unknown
>> + * 		specifier.
> 
> This is not true. bpf_trace_printk will return -EINVAL for unknown specifier.
> 

Correct, sorry about that. I never check the return value of
bpf_trace_printk(), and it's hard to realise it failed without resorting
to another bpf_trace_printk() :). I'll fix it, what about:

"No modifier (size of field, padding with zeroes, etc.) is available,
and the helper will return **-EINVAL** (but print nothing) if it
encounters an unknown specifier."

(I would like to keep the "print nothing" idea, at the beginning I spent
some time myself trying to figure out why my bpf_trace_prink() seemed to
be never called--I was simply trying to print with "%#x".)

>> + *
>> + * 		Also, note that **bpf_trace_printk**\ () is slow, and should
>> + * 		only be used for debugging purposes. For passing values to user
>> + * 		space, perf events should be preferred.
> 
> please mention the giant dmesg warning that people will definitely
> notice when they try to use this helper.

This is a good idea, I will mention it.

>> + * 	Return
>> + * 		The number of bytes written to the buffer, or a negative error
>> + * 		in case of failure.
>> + *

[...]

>> + * int bpf_tail_call(void *ctx, struct bpf_map *prog_array_map, u32 index)
>> + * 	Description
>> + * 		This special helper is used to trigger a "tail call", or in
>> + * 		other words, to jump into another eBPF program. The contents of
>> + * 		eBPF registers and stack are not modified, the new program
>> + * 		"inherits" them from the caller. This mechanism allows for
> 
> "inherits" is a technically correct, but misleading statement,
> since callee program cannot access caller's registers and stack.
> 

I can replace this sentence by:

"The same stack frame is used (but values on stack and in registers for
the caller are not accessible to the callee)."

>> + * 		program chaining, either for raising the maximum number of
>> + * 		available eBPF instructions, or to execute given programs in
>> + * 		conditional blocks. For security reasons, there is an upper
>> + * 		limit to the number of successive tail calls that can be
>> + * 		performed.
>> + *
>> + * 		Upon call of this helper, the program attempts to jump into a
>> + * 		program referenced at index *index* in *prog_array_map*, a
>> + * 		special map of type **BPF_MAP_TYPE_PROG_ARRAY**, and passes
>> + * 		*ctx*, a pointer to the context.
>> + *
>> + * 		If the call succeeds, the kernel immediately runs the first
>> + * 		instruction of the new program. This is not a function call,
>> + * 		and it never goes back to the previous program. If the call
>> + * 		fails, then the helper has no effect, and the caller continues
>> + * 		to run its own instructions. A call can fail if the destination
>> + * 		program for the jump does not exist (i.e. *index* is superior
>> + * 		to the number of entries in *prog_array_map*), or if the
>> + * 		maximum number of tail calls has been reached for this chain of
>> + * 		programs. This limit is defined in the kernel by the macro
>> + * 		**MAX_TAIL_CALL_CNT** (not accessible to user space), which
>> + * 		is currently set to 32.
>> + * 	Return
>> + * 		0 on success, or a negative error in case of failure.
>> + *
>> + * int bpf_clone_redirect(struct sk_buff *skb, u32 ifindex, u64 flags)
>> + * 	Description
>> + * 		Clone and redirect the packet associated to *skb* to another
>> + * 		net device of index *ifindex*. The only flag supported for now
>> + * 		is **BPF_F_INGRESS**, which indicates the packet is to be
>> + * 		redirected to the ingress interface instead of (by default)
>> + * 		egress.
> 
> imo the above sentence is prone to misinterpretation.
> Can you rephrase it to say that both redirect to ingress and redirect to egress
> are supported and flag is used to indicate which path to take ?
> 

I could replace with the following:

"Clone and redirect the packet associated to *skb* to another net device
of index *ifindex*. Both ingress and egress interfaces can be used for
redirection. The **BPF_F_INGRESS** value in *flags* is used to make the
distinction (ingress path is selected if the flag is present, egress
path otherwise). This is the only flag supported for now."

I think I wrote similar things about other helpers using BPF_F_INGRESS
flag, I will also update them accordingly.

>> + *
>> + * 		A call to this helper is susceptible to change data from the
>> + * 		packet. Therefore, at load time, all checks on pointers
>> + * 		previously done by the verifier are invalidated and must be
>> + * 		performed again.
>> + * 	Return
>> + * 		0 on success, or a negative error in case of failure.
>>   */
>>  #define __BPF_FUNC_MAPPER(FN)		\
>>  	FN(unspec),			\
>> -- 
>> 2.14.1
>>

^ permalink raw reply

* [RFC bpf-next v2 3/8] bpf: add documentation for eBPF helpers (12-22)
From: Quentin Monnet @ 2018-04-11 15:43 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: daniel, ast, netdev, oss-drivers, linux-doc, linux-man
In-Reply-To: <20180410224259.v5p2t2dc5s27mw3z@ast-mbp.dhcp.thefacebook.com>

2018-04-10 15:43 UTC-0700 ~ Alexei Starovoitov
<alexei.starovoitov@gmail.com>
> On Tue, Apr 10, 2018 at 03:41:52PM +0100, Quentin Monnet wrote:
>> Add documentation for eBPF helper functions to bpf.h user header file.
>> This documentation can be parsed with the Python script provided in
>> another commit of the patch series, in order to provide a RST document
>> that can later be converted into a man page.
>>
>> The objective is to make the documentation easily understandable and
>> accessible to all eBPF developers, including beginners.
>>
>> This patch contains descriptions for the following helper functions, all
>> writter by Alexei:
>>
>> - bpf_get_current_pid_tgid()
>> - bpf_get_current_uid_gid()
>> - bpf_get_current_comm()
>> - bpf_skb_vlan_push()
>> - bpf_skb_vlan_pop()
>> - bpf_skb_get_tunnel_key()
>> - bpf_skb_set_tunnel_key()
>> - bpf_redirect()
>> - bpf_perf_event_output()
>> - bpf_get_stackid()
>> - bpf_get_current_task()
>>
>> Cc: Alexei Starovoitov <ast@kernel.org>
>> Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
>> ---
>>  include/uapi/linux/bpf.h | 237 +++++++++++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 237 insertions(+)
>>
>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>> index 2bc653a3a20f..f3ea8824efbc 100644
>> --- a/include/uapi/linux/bpf.h
>> +++ b/include/uapi/linux/bpf.h
>> @@ -580,6 +580,243 @@ union bpf_attr {
>>   * 		performed again.
>>   * 	Return
>>   * 		0 on success, or a negative error in case of failure.
>> + *
>> + * u64 bpf_get_current_pid_tgid(void)
>> + * 	Return
>> + * 		A 64-bit integer containing the current tgid and pid, and
>> + * 		created as such:
>> + * 		*current_task*\ **->tgid << 32 \|**
>> + * 		*current_task*\ **->pid**.
>> + *
>> + * u64 bpf_get_current_uid_gid(void)
>> + * 	Return
>> + * 		A 64-bit integer containing the current GID and UID, and
>> + * 		created as such: *current_gid* **<< 32 \|** *current_uid*.
>> + *
>> + * int bpf_get_current_comm(char *buf, u32 size_of_buf)
>> + * 	Description
>> + * 		Copy the **comm** attribute of the current task into *buf* of
>> + * 		*size_of_buf*. The **comm** attribute contains the name of
>> + * 		the executable (excluding the path) for the current task. The
>> + * 		*size_of_buf* must be strictly positive. On success, the
> 
> that reminds me that we probably should relax it to ARG_CONST_SIZE_OR_ZERO.
> The programs won't be passing an actual zero into it, but it helps
> a lot to tell verifier that zero is also valid, since programs
> become much simpler.
> 

Ok. No change to helper description for now, we will update here when
your patch lands.

>> + * 		helper makes sure that the *buf* is NUL-terminated. On failure,
>> + * 		it is filled with zeroes.
>> + * 	Return
>> + * 		0 on success, or a negative error in case of failure.
>> + *
>> + * int bpf_skb_vlan_push(struct sk_buff *skb, __be16 vlan_proto, u16 vlan_tci)
>> + * 	Description
>> + * 		Push a *vlan_tci* (VLAN tag control information) of protocol
>> + * 		*vlan_proto* to the packet associated to *skb*, then update
>> + * 		the checksum. Note that if *vlan_proto* is different from
>> + * 		**ETH_P_8021Q** and **ETH_P_8021AD**, it is considered to
>> + * 		be **ETH_P_8021Q**.
>> + *
>> + * 		A call to this helper is susceptible to change data from the
>> + * 		packet. Therefore, at load time, all checks on pointers
>> + * 		previously done by the verifier are invalidated and must be
>> + * 		performed again.
>> + * 	Return
>> + * 		0 on success, or a negative error in case of failure.
>> + *
>> + * int bpf_skb_vlan_pop(struct sk_buff *skb)
>> + * 	Description
>> + * 		Pop a VLAN header from the packet associated to *skb*.
>> + *
>> + * 		A call to this helper is susceptible to change data from the
>> + * 		packet. Therefore, at load time, all checks on pointers
>> + * 		previously done by the verifier are invalidated and must be
>> + * 		performed again.
>> + * 	Return
>> + * 		0 on success, or a negative error in case of failure.
>> + *
>> + * int bpf_skb_get_tunnel_key(struct sk_buff *skb, struct bpf_tunnel_key *key, u32 size, u64 flags)
>> + * 	Description
>> + * 		Get tunnel metadata. This helper takes a pointer *key* to an
>> + * 		empty **struct bpf_tunnel_key** of **size**, that will be
>> + * 		filled with tunnel metadata for the packet associated to *skb*.
>> + * 		The *flags* can be set to **BPF_F_TUNINFO_IPV6**, which
>> + * 		indicates that the tunnel is based on IPv6 protocol instead of
>> + * 		IPv4.
>> + *
>> + * 		This is typically used on the receive path to perform a lookup
>> + * 		or a packet redirection based on the value of *key*:
> 
> above is correct, but feels a bit cryptic.
> May be give more concrete example for particular tunneling protocol like gre
> and say that tunnel_key.remote_ip[46] is essential part of the encap and
> bpf prog will make decisions based on the contents of the encap header
> where bpf_tunnel_key is a single structure that generalizes parameters of
> various tunneling protocols into one struct.
> 

I will try to do this.

>> + *
>> + * 		::
>> + *
>> + * 			struct bpf_tunnel_key key = {};
>> + * 			bpf_skb_get_tunnel_key(skb, &key, sizeof(key), 0);
>> + * 			     lookup or redirect based on key ...
>> + *
>> + * 	Return
>> + * 		0 on success, or a negative error in case of failure.
>> + *
>> + * int bpf_skb_set_tunnel_key(struct sk_buff *skb, struct bpf_tunnel_key *key, u32 size, u64 flags)
>> + * 	Description
>> + * 		Populate tunnel metadata for packet associated to *skb.* The
>> + * 		tunnel metadata is set to the contents of *key*, of *size*. The
>> + * 		*flags* can be set to a combination of the following values:
>> + *
>> + * 		**BPF_F_TUNINFO_IPV6**
>> + * 			Indicate that the tunnel is based on IPv6 protocol
>> + * 			instead of IPv4.
>> + * 		**BPF_F_ZERO_CSUM_TX**
>> + * 			For IPv4 packets, add a flag to tunnel metadata
>> + * 			indicating that checksum computation should be skipped
>> + * 			and checksum set to zeroes.
>> + * 		**BPF_F_DONT_FRAGMENT**
>> + * 			Add a flag to tunnel metadata indicating that the
>> + * 			packet should not be fragmented.
>> + * 		**BPF_F_SEQ_NUMBER**
>> + * 			Add a flag to tunnel metadata indicating that a
>> + * 			sequence number should be added to tunnel header before
>> + * 			sending the packet. This flag was added for GRE
>> + * 			encapsulation, but might be used with other protocols
>> + * 			as well in the future.
>> + *
>> + * 		Here is a typical usage on the transmit path:
>> + *
>> + * 		::
>> + *
>> + * 			struct bpf_tunnel_key key;
>> + * 			     populate key ...
>> + * 			bpf_skb_set_tunnel_key(skb, &key, sizeof(key), 0);
>> + * 			bpf_clone_redirect(skb, vxlan_dev_ifindex, 0);
>> + *
>> + * 	Return
>> + * 		0 on success, or a negative error in case of failure.
>> + *
>> + * int bpf_redirect(u32 ifindex, u64 flags)
>> + * 	Description
>> + * 		Redirect the packet to another net device of index *ifindex*.
>> + * 		This helper is somewhat similar to **bpf_clone_redirect**\
>> + * 		(), except that the packet is not cloned, which provides
>> + * 		increased performance.
>> + *
>> + * 		For hooks other than XDP, *flags* can be set to
>> + * 		**BPF_F_INGRESS**, which indicates the packet is to be
>> + * 		redirected to the ingress interface instead of (by default)
>> + * 		egress. Currently, XDP does not support any flag.
>> + * 	Return
>> + * 		For XDP, the helper returns **XDP_REDIRECT** on success or
>> + * 		**XDP_ABORT** on error. For other program types, the values
>> + * 		are **TC_ACT_REDIRECT** on success or **TC_ACT_SHOT** on
>> + * 		error.
>> + *
>> + * int bpf_perf_event_output(struct pt_reg *ctx, struct bpf_map *map, u64 flags, void *data, u64 size)
>> + * 	Description
>> + * 		Write perf raw sample into a perf event held by *map* of type
> 
> I'd say:
> Write raw *data* blob into special bpf perf event held by ...
> 

Yes it sounds better, I will follow the suggestion.

>> + * 		**BPF_MAP_TYPE_PERF_EVENT_ARRAY**. This perf event must
>> + * 		have the following attributes: **PERF_SAMPLE_RAW** as
>> + * 		**sample_type**, **PERF_TYPE_SOFTWARE** as **type**, and
>> + * 		**PERF_COUNT_SW_BPF_OUTPUT** as **config**.
>> + *
>> + * 		The *flags* are used to indicate the index in *map* for which
>> + * 		the value must be put, masked with **BPF_F_INDEX_MASK**.
>> + * 		Alternatively, *flags* can be set to **BPF_F_CURRENT_CPU**
>> + * 		to indicate that the index of the current CPU core should be
>> + * 		used.
>> + *
>> + * 		The value to write, of *size*, is passed through eBPF stack and
>> + * 		pointed by *data*.
>> + *
>> + * 		The context of the program *ctx* needs also be passed to the
>> + * 		helper, and will get interpreted as a pointer to a **struct
>> + * 		pt_reg**.
> 
> Not quite correct.
> Initially bpf_perf_event_output() was only used with 'struct pt_reg *ctx',
> but then later it was generalized for all other tracing prog types,
> for clsact and even for XDP.
> So 'ctx' can be any of the context used by these program types.
> 

Right, I suppose I only looked at bpf_perf_event_output_tp() for this
one :(. I can simply trim it to:

"The context of the program *ctx* needs also be passed to the helper."

>> + *
>> + * 		On user space, a program willing to read the values needs to
>> + * 		call **perf_event_open**\ () on the perf event (either for
>> + * 		one or for all CPUs) and to store the file descriptor into the
>> + * 		*map*. This must be done before the eBPF program can send data
>> + * 		into it. An example is available in file
>> + * 		*samples/bpf/trace_output_user.c* in the Linux kernel source
>> + * 		tree (the eBPF program counterpart is in
>> + * 		*samples/bpf/trace_output_kern.c*). It looks like the
>> + * 		following snippet:
>> + *
>> + * 		::
>> + *
>> + * 			volatile struct perf_event_mmap_page *header;
>> + * 			struct perf_event_attr attr = {
>> + * 			        .sample_type = PERF_SAMPLE_RAW,
>> + * 			        .type = PERF_TYPE_SOFTWARE,
>> + * 			        .config = PERF_COUNT_SW_BPF_OUTPUT,
>> + * 			};
>> + * 			int page_size;
>> + * 			int mmap_size;
>> + * 			int key = 0;
>> + * 			int pmu_fd;
>> + * 			void *base;
>> + * 			
>> + * 			if (load_bpf_file(filename))
>> + * 			        return -1;
>> + * 			
>> + * 			pmu_fd = sys_perf_event_open(&attr,
>> + * 			                             -1, // pid
>> + * 			                              0, // cpu
>> + * 			                             -1, // group_fd
>> + * 			                              0);
>> + * 			
>> + * 			assert(pmu_fd >= 0);
>> + * 			assert(bpf_map_update_elem(map_fd[0], &key,
>> + * 			                           &pmu_fd, BPF_ANY) == 0);
>> + * 			assert(ioctl(pmu_fd, PERF_EVENT_IOC_ENABLE, 0) == 0);
>> + * 			
>> + * 			page_size = getpagesize();
>> + * 			mmap_size = page_size * (page_cnt + 1);
>> + * 			
>> + * 			base = mmap(NULL, mmap_size, PROT_READ | PROT_WRITE,
>> + * 			            MAP_SHARED, fd, 0);
>> + * 			if (base == MAP_FAILED)
>> + * 			        return -1;
>> + * 			
>> + * 			header = base;
> 
> I think that is too much for the man page, especially above is far from
> complete example.
> 

Yeah, I was unsure about keeping it. I will remove the snippet.

>> + *
>> + * 		**bpf_perf_event_output**\ () achieves better performance
>> + * 		than **bpf_trace_printk**\ () for sharing data with user
>> + * 		space, and is much better suitable for streaming data from eBPF
>> + * 		programs.
>> + * 	Return
>> + * 		0 on success, or a negative error in case of failure.
>> + *
>> + * int bpf_get_stackid(struct pt_reg *ctx, struct bpf_map *map, u64 flags)
>> + * 	Description
>> + * 		Walk a user or a kernel stack and return its id. To achieve
>> + * 		this, the helper needs *ctx*, which is a pointer to the context
>> + * 		on which the tracing program is executed, and a pointer to a
>> + * 		*map* of type **BPF_MAP_TYPE_STACK_TRACE**.
>> + *
>> + * 		The last argument, *flags*, holds the number of stack frames to
>> + * 		skip (from 0 to 255), masked with
>> + * 		**BPF_F_SKIP_FIELD_MASK**. The next bits can be used to set
>> + * 		a combination of the following flags:
>> + *
>> + * 		**BPF_F_USER_STACK**
>> + * 			Collect a user space stack instead of a kernel stack.
>> + * 		**BPF_F_FAST_STACK_CMP**
>> + * 			Compare stacks by hash only.
>> + * 		**BPF_F_REUSE_STACKID**
>> + * 			If two different stacks hash into the same *stackid*,
>> + * 			discard the old one.
> 
> we have an annoying bug here that we will be sending a patch to fix soon,
> since right now there is no way for the program to know that stackid
> got replaced.
> 

Understood. Same as for bpf_get_current_comm(), I will leave the
description untouched until the patch lands.

>> + *
>> + * 		The stack id retrieved is a 32 bit long integer handle which
>> + * 		can be further combined with other data (including other stack
>> + * 		ids) and used as a key into maps. This can be useful for
>> + * 		generating a variety of graphs (such as flame graphs or off-cpu
>> + * 		graphs).
>> + *
>> + * 		For walking a stack, this helper is an improvement over
>> + * 		**bpf_probe_read**\ (), which can be used with unrolled loops
>> + * 		but is not efficient and consumes a lot of eBPF instructions.
>> + * 		Instead, **bpf_get_stackid**\ () can collect up to
>> + * 		**PERF_MAX_STACK_DEPTH** both kernel and user frames.
> 
> PERF_MAX_STACK_DEPTH is now controlled by sysctl knob.
> Would be good to mention that this limit can and should be increased
> for profiling long user stacks like java.
> 

Good idea, I will add it.

Thanks a lot Alexei for the thorough reviews!
Quentin

^ permalink raw reply

* [RFC bpf-next v2 7/8] bpf: add documentation for eBPF helpers (51-57)
From: Quentin Monnet @ 2018-04-11 15:44 UTC (permalink / raw)
  To: Yonghong Song, daniel, ast
  Cc: netdev, oss-drivers, linux-doc, linux-man, Lawrence Brakmo,
	Josef Bacik, Andrey Ignatov
In-Reply-To: <cc54b41e-3f2f-e87f-042f-842c96308626@fb.com>

2018-04-10 09:58 UTC-0700 ~ Yonghong Song <yhs@fb.com>
> On 4/10/18 7:41 AM, Quentin Monnet wrote:
>> Add documentation for eBPF helper functions to bpf.h user header file.
>> This documentation can be parsed with the Python script provided in
>> another commit of the patch series, in order to provide a RST document
>> that can later be converted into a man page.
>>
>> The objective is to make the documentation easily understandable and
>> accessible to all eBPF developers, including beginners.
>>
>> This patch contains descriptions for the following helper functions:
>>
>> Helpers from Lawrence:
>> - bpf_setsockopt()
>> - bpf_getsockopt()
>> - bpf_sock_ops_cb_flags_set()
>>
>> Helpers from Yonghong:
>> - bpf_perf_event_read_value()
>> - bpf_perf_prog_read_value()
>>
>> Helper from Josef:
>> - bpf_override_return()
>>
>> Helper from Andrey:
>> - bpf_bind()
>>
>> Cc: Lawrence Brakmo <brakmo@fb.com>
>> Cc: Yonghong Song <yhs@fb.com>
>> Cc: Josef Bacik <jbacik@fb.com>
>> Cc: Andrey Ignatov <rdna@fb.com>
>> Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
>> ---
>>   include/uapi/linux/bpf.h | 184
>> +++++++++++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 184 insertions(+)
>>
>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>> index 15d9ccafebbe..7343af4196c8 100644
>> --- a/include/uapi/linux/bpf.h
>> +++ b/include/uapi/linux/bpf.h

[...]

>> @@ -1255,6 +1277,168 @@ union bpf_attr {
>>    *         performed again.
>>    *     Return
>>    *         0 on success, or a negative error in case of failure.
>> + *
>> + * int bpf_perf_event_read_value(struct bpf_map *map, u64 flags,
>> struct bpf_perf_event_value *buf, u32 buf_size)
>> + *     Description
>> + *         Read the value of a perf event counter, and store it into
>> *buf*
>> + *         of size *buf_size*. This helper relies on a *map* of type
>> + *         **BPF_MAP_TYPE_PERF_EVENT_ARRAY**. The nature of the perf
>> + *         event counter is selected at the creation of the *map*. The
> 
> The nature of the perf event counter is selected when *map* is updated
> with perf_event fd's.
> 

Thanks, I will fix it.

>> + *         *map* is an array whose size is the number of available CPU
>> + *         cores, and each cell contains a value relative to one
>> core. The
> 
> It is confusing to mix core/cpu here. Maybe just use perf_event
> convention, always using cpu?
> 

Right, I'll remove occurrences of "core".

>> + *         value to retrieve is indicated by *flags*, that contains the
>> + *         index of the core to look up, masked with
>> + *         **BPF_F_INDEX_MASK**. Alternatively, *flags* can be set to
>> + *         **BPF_F_CURRENT_CPU** to indicate that the value for the
>> + *         current CPU core should be retrieved.
>> + *
>> + *         This helper behaves in a way close to
>> + *         **bpf_perf_event_read**\ () helper, save that instead of
>> + *         just returning the value observed, it fills the *buf*
>> + *         structure. This allows for additional data to be
>> retrieved: in
>> + *         particular, the enabled and running times (in *buf*\
>> + *         **->enabled** and *buf*\ **->running**, respectively) are
>> + *         copied.
>> + *
>> + *         These values are interesting, because hardware PMU
>> (Performance
>> + *         Monitoring Unit) counters are limited resources. When
>> there are
>> + *         more PMU based perf events opened than available counters,
>> + *         kernel will multiplex these events so each event gets certain
>> + *         percentage (but not all) of the PMU time. In case that
>> + *         multiplexing happens, the number of samples or counter value
>> + *         will not reflect the case compared to when no multiplexing
>> + *         occurs. This makes comparison between different runs
>> difficult.
>> + *         Typically, the counter value should be normalized before
>> + *         comparing to other experiments. The usual normalization is
>> done
>> + *         as follows.
>> + *
>> + *         ::
>> + *
>> + *             normalized_counter = counter * t_enabled / t_running
>> + *
>> + *         Where t_enabled is the time enabled for event and
>> t_running is
>> + *         the time running for event since last normalization. The
>> + *         enabled and running times are accumulated since the perf
>> event
>> + *         open. To achieve scaling factor between two invocations of an
>> + *         eBPF program, users can can use CPU id as the key (which is
>> + *         typical for perf array usage model) to remember the previous
>> + *         value and do the calculation inside the eBPF program.
>> + *     Return
>> + *         0 on success, or a negative error in case of failure.
>> + *

[...]

Thanks Yonghong for the review!

I have a favor to ask of you. I got a bounce for Kaixu Xia's email
address, and I don't know what alternative email address I could use. I
CC-ed to have a review for helper bpf_perf_event_read() (in patch 6 of
this series), which is rather close to bpf_perf_event_read_value().
Would you mind having a look at that one too, please? The description is
not long.

Quentin

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox