Netdev List

Netdev List
 help / color / mirror / Atom feed

* [PATCH v7 1/4] vdpa: Add suspend operation
From: Eugenio Pérez @ 2022-08-10 17:15 UTC (permalink / raw)
  To: kvm, Michael S. Tsirkin, linux-kernel, Jason Wang, virtualization,
	netdev
  Cc: dinang, martinpo, Wu Zongyong, Piotr.Uminski, gautam.dawar,
	ecree.xilinx, martinh, Stefano Garzarella, pabloc, habetsm.xilinx,
	lvivier, Zhu Lingshan, tanuj.kamde, Longpeng, lulu, hanand,
	Parav Pandit, Si-Wei Liu, Eli Cohen, Xie Yongji, Zhang Min,
	Dan Carpenter, Christophe JAILLET
In-Reply-To: <20220810171512.2343333-1-eperezma@redhat.com>

This operation is optional: It it's not implemented, backend feature bit
will not be exposed.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
Message-Id: <20220623160738.632852-2-eperezma@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 include/linux/vdpa.h | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h
index 7b4a13d3bd91..d282f464d2f1 100644
--- a/include/linux/vdpa.h
+++ b/include/linux/vdpa.h
@@ -218,6 +218,9 @@ struct vdpa_map_file {
  * @reset:			Reset device
  *				@vdev: vdpa device
  *				Returns integer: success (0) or error (< 0)
+ * @suspend:			Suspend or resume the device (optional)
+ *				@vdev: vdpa device
+ *				Returns integer: success (0) or error (< 0)
  * @get_config_size:		Get the size of the configuration space includes
  *				fields that are conditional on feature bits.
  *				@vdev: vdpa device
@@ -319,6 +322,7 @@ struct vdpa_config_ops {
 	u8 (*get_status)(struct vdpa_device *vdev);
 	void (*set_status)(struct vdpa_device *vdev, u8 status);
 	int (*reset)(struct vdpa_device *vdev);
+	int (*suspend)(struct vdpa_device *vdev);
 	size_t (*get_config_size)(struct vdpa_device *vdev);
 	void (*get_config)(struct vdpa_device *vdev, unsigned int offset,
 			   void *buf, unsigned int len);
-- 
2.31.1


^ permalink raw reply related

* [PATCH v7 0/4] Implement vdpasim suspend operation
From: Eugenio Pérez @ 2022-08-10 17:15 UTC (permalink / raw)
  To: kvm, Michael S. Tsirkin, linux-kernel, Jason Wang, virtualization,
	netdev
  Cc: dinang, martinpo, Wu Zongyong, Piotr.Uminski, gautam.dawar,
	ecree.xilinx, martinh, Stefano Garzarella, pabloc, habetsm.xilinx,
	lvivier, Zhu Lingshan, tanuj.kamde, Longpeng, lulu, hanand,
	Parav Pandit, Si-Wei Liu, Eli Cohen, Xie Yongji, Zhang Min,
	Dan Carpenter, Christophe JAILLET

Implement suspend operation for vdpa_sim devices, so vhost-vdpa will offer
that backend feature and userspace can effectively suspend the device.

This is a must before getting virtqueue indexes (base) for live migration,
since the device could modify them after userland gets them. There are
individual ways to perform that action for some devices
(VHOST_NET_SET_BACKEND, VHOST_VSOCK_SET_RUNNING, ...) but there was no
way to perform it for any vhost device (and, in particular, vhost-vdpa).

After a successful return of ioctl the device must not process more virtqueue
descriptors. The device can answer to read or writes of config fields as if it
were not suspended. In particular, writing to "queue_enable" with a value of 1
will not make the device start processing virtqueue buffers.

In the future, we will provide features similar to
VHOST_USER_GET_INFLIGHT_FD so the device can save pending operations.

Applied on top of [1] branch after removing the old commits.

Comments are welcome.

v7:
* Remove ioctl leftover argument and update doc accordingly.

v6:
* Remove the resume operation, making the ioctl simpler. We can always add
  another ioctl for VM_STOP/VM_RESUME operation later.
* s/stop/suspend/ to differentiate more from reset.
* Clarify scope of the suspend operation.

v5:
* s/not stop/resume/ in doc.

v4:
* Replace VHOST_STOP to VHOST_VDPA_STOP in vhost ioctl switch case too.

v3:
* s/VHOST_STOP/VHOST_VDPA_STOP/
* Add documentation and requirements of the ioctl above its definition.

v2:
* Replace raw _F_STOP with BIT_ULL(_F_STOP).
* Fix obtaining of stop ioctl arg (it was not obtained but written).
* Add stop to vdpa_sim_blk.

[1] git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git

Eugenio Pérez (4):
  vdpa: Add suspend operation
  vhost-vdpa: introduce SUSPEND backend feature bit
  vhost-vdpa: uAPI to suspend the device
  vdpa_sim: Implement suspend vdpa op

 drivers/vdpa/vdpa_sim/vdpa_sim.c     | 14 +++++++++++
 drivers/vdpa/vdpa_sim/vdpa_sim.h     |  1 +
 drivers/vdpa/vdpa_sim/vdpa_sim_blk.c |  3 +++
 drivers/vdpa/vdpa_sim/vdpa_sim_net.c |  3 +++
 drivers/vhost/vdpa.c                 | 35 +++++++++++++++++++++++++++-
 include/linux/vdpa.h                 |  4 ++++
 include/uapi/linux/vhost.h           |  9 +++++++
 include/uapi/linux/vhost_types.h     |  2 ++
 8 files changed, 70 insertions(+), 1 deletion(-)

-- 
2.31.1

^ permalink raw reply

* Re: [PATCH net-next] net: ngbe: Add build support for ngbe
From: kernel test robot @ 2022-08-10 17:08 UTC (permalink / raw)
  To: Mengyuan Lou, netdev; +Cc: kbuild-all, jiawenwu, Mengyuan Lou
In-Reply-To: <20220808094113.9434-1-mengyuanlou@net-swift.com>

Hi Mengyuan,

I love your patch! Yet something to improve:

[auto build test ERROR on net-next/master]

url:    https://github.com/intel-lab-lkp/linux/commits/Mengyuan-Lou/net-ngbe-Add-build-support-for-ngbe/20220808-174431
base:   https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git f86d1fbbe7858884d6754534a0afbb74fc30bc26
config: i386-allyesconfig (https://download.01.org/0day-ci/archive/20220811/202208110135.9PK79CPj-lkp@intel.com/config)
compiler: gcc-11 (Debian 11.3.0-3) 11.3.0
reproduce (this is a W=1 build):
        # https://github.com/intel-lab-lkp/linux/commit/b813046e2626a39496a064fb85ed44916289a4ee
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Mengyuan-Lou/net-ngbe-Add-build-support-for-ngbe/20220808-174431
        git checkout b813046e2626a39496a064fb85ed44916289a4ee
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        make W=1 O=build_dir ARCH=i386 SHELL=/bin/bash

If you fix the issue, kindly add following tag where applicable
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   drivers/net/ethernet/wangxun/ngbe/ngbe_main.c: In function 'ngbe_probe':
>> drivers/net/ethernet/wangxun/ngbe/ngbe_main.c:105:42: error: 'ngbe_MAX_TX_QUEUES' undeclared (first use in this function); did you mean 'NGBE_MAX_TX_QUEUES'?
     105 |                                          ngbe_MAX_TX_QUEUES,
         |                                          ^~~~~~~~~~~~~~~~~~
         |                                          NGBE_MAX_TX_QUEUES
   drivers/net/ethernet/wangxun/ngbe/ngbe_main.c:105:42: note: each undeclared identifier is reported only once for each function it appears in
>> drivers/net/ethernet/wangxun/ngbe/ngbe_main.c:106:42: error: 'ngbe_MAX_RX_QUEUES' undeclared (first use in this function); did you mean 'NGBE_MAX_RX_QUEUES'?
     106 |                                          ngbe_MAX_RX_QUEUES);
         |                                          ^~~~~~~~~~~~~~~~~~
         |                                          NGBE_MAX_RX_QUEUES


vim +105 drivers/net/ethernet/wangxun/ngbe/ngbe_main.c

    61	
    62	/**
    63	 * ngbe_probe - Device Initialization Routine
    64	 * @pdev: PCI device information struct
    65	 * @ent: entry in ngbe_pci_tbl
    66	 *
    67	 * Returns 0 on success, negative on failure
    68	 *
    69	 * ngbe_probe initializes an adapter identified by a pci_dev structure.
    70	 * The OS initialization, configuring of the adapter private structure,
    71	 * and a hardware reset occur.
    72	 **/
    73	static int ngbe_probe(struct pci_dev *pdev,
    74			      const struct pci_device_id __always_unused *ent)
    75	{
    76		struct ngbe_adapter *adapter = NULL;
    77		struct net_device *netdev;
    78		int err;
    79	
    80		err = pci_enable_device_mem(pdev);
    81		if (err)
    82			return err;
    83	
    84		err = dma_set_mask_and_coherent(&pdev->dev, DMA_BIT_MASK(64));
    85		if (err) {
    86			dev_err(&pdev->dev,
    87				"No usable DMA configuration, aborting\n");
    88			goto err_pci_disable_dev;
    89		}
    90	
    91		err = pci_request_selected_regions(pdev,
    92						   pci_select_bars(pdev, IORESOURCE_MEM),
    93						   ngbe_driver_name);
    94		if (err) {
    95			dev_err(&pdev->dev,
    96				"pci_request_selected_regions failed 0x%x\n", err);
    97			goto err_pci_disable_dev;
    98		}
    99	
   100		pci_enable_pcie_error_reporting(pdev);
   101		pci_set_master(pdev);
   102	
   103		netdev = devm_alloc_etherdev_mqs(&pdev->dev,
   104						 sizeof(struct ngbe_adapter),
 > 105						 ngbe_MAX_TX_QUEUES,
 > 106						 ngbe_MAX_RX_QUEUES);
   107		if (!netdev) {
   108			err = -ENOMEM;
   109			goto err_pci_release_regions;
   110		}
   111	
   112		SET_NETDEV_DEV(netdev, &pdev->dev);
   113	
   114		adapter = netdev_priv(netdev);
   115		adapter->netdev = netdev;
   116		adapter->pdev = pdev;
   117	
   118		adapter->io_addr = devm_ioremap(&pdev->dev,
   119						pci_resource_start(pdev, 0),
   120						pci_resource_len(pdev, 0));
   121		if (!adapter->io_addr) {
   122			err = -EIO;
   123			goto err_pci_release_regions;
   124		}
   125	
   126		netdev->features |= NETIF_F_HIGHDMA;
   127	
   128		pci_set_drvdata(pdev, adapter);
   129	
   130		return 0;
   131	
   132	err_pci_release_regions:
   133		pci_disable_pcie_error_reporting(pdev);
   134		pci_release_selected_regions(pdev,
   135					     pci_select_bars(pdev, IORESOURCE_MEM));
   136	err_pci_disable_dev:
   137		pci_disable_device(pdev);
   138		return err;
   139	}
   140	

-- 
0-DAY CI Kernel Test Service
https://01.org/lkp

^ permalink raw reply

* Re: [PATCH bpf-next 05/15] bpf: Fix incorrect mem_cgroup_put
From: Shakeel Butt @ 2022-08-10 17:07 UTC (permalink / raw)
  To: Yafang Shao
  Cc: ast, daniel, andrii, kafai, songliubraving, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, hannes, mhocko, roman.gushchin,
	songmuchun, akpm, netdev, bpf, linux-mm
In-Reply-To: <20220810151840.16394-6-laoar.shao@gmail.com>

On Wed, Aug 10, 2022 at 03:18:30PM +0000, Yafang Shao wrote:
> The memcg may be the root_mem_cgroup, in which case we shouldn't put it.

No, it is ok to put root_mem_cgroup. css_put already handles the root
cgroups.


^ permalink raw reply

* Re: [syzbot] WARNING in ieee80211_ibss_csa_beacon
From: syzbot @ 2022-08-10 16:47 UTC (permalink / raw)
  To: code, davem, johannes, kuba, linux-kernel, linux-wireless, netdev,
	syzkaller-bugs
In-Reply-To: <20220810113551.344792-1-code@siddh.me>

Hello,

syzbot tried to test the proposed patch but the build/boot failed:

tered promiscuous mode
[   49.294465][ T3636] bond0: (slave bond_slave_0): Enslaving as an active interface with an up link
[   49.305282][ T3636] bond0: (slave bond_slave_1): Enslaving as an active interface with an up link
[   49.325908][ T3636] team0: Port device team_slave_0 added
[   49.333047][ T3636] team0: Port device team_slave_1 added
[   49.350306][ T3636] batman_adv: batadv0: Adding interface: batadv_slave_0
[   49.357336][ T3636] batman_adv: batadv0: The MTU of interface batadv_slave_0 is too small (1500) to handle the transport of batman-adv packets. Packets going over this interface will be fragmented on layer2 which could impact the performance. Setting the MTU to 1560 would solve the problem.
[   49.383401][ T3636] batman_adv: batadv0: Not using interface batadv_slave_0 (retrying later): interface not active
[   49.395845][ T3636] batman_adv: batadv0: Adding interface: batadv_slave_1
[   49.402957][ T3636] batman_adv: batadv0: The MTU of interface batadv_slave_1 is too small (1500) to handle the transport of batman-adv packets. Packets going over this interface will be fragmented on layer2 which could impact the performance. Setting the MTU to 1560 would solve the problem.
[   49.430471][ T3636] batman_adv: batadv0: Not using interface batadv_slave_1 (retrying later): interface not active
[   49.455720][ T3636] device hsr_slave_0 entered promiscuous mode
[   49.463006][ T3636] device hsr_slave_1 entered promiscuous mode
[   49.538340][ T3636] netdevsim netdevsim0 netdevsim0: renamed from eth0
[   49.549079][ T3636] netdevsim netdevsim0 netdevsim1: renamed from eth1
[   49.558155][ T3636] netdevsim netdevsim0 netdevsim2: renamed from eth2
[   49.569133][ T3636] netdevsim netdevsim0 netdevsim3: renamed from eth3
[   49.590785][ T3636] bridge0: port 2(bridge_slave_1) entered blocking state
[   49.597986][ T3636] bridge0: port 2(bridge_slave_1) entered forwarding state
[   49.605904][ T3636] bridge0: port 1(bridge_slave_0) entered blocking state
[   49.613050][ T3636] bridge0: port 1(bridge_slave_0) entered forwarding state
[   49.657283][ T3636] 8021q: adding VLAN 0 to HW filter on device bond0
[   49.669522][   T14] IPv6: ADDRCONF(NETDEV_CHANGE): veth0: link becomes ready
[   49.679945][   T14] bridge0: port 1(bridge_slave_0) entered disabled state
[   49.688892][   T14] bridge0: port 2(bridge_slave_1) entered disabled state
[   49.697602][   T14] IPv6: ADDRCONF(NETDEV_CHANGE): bond0: link becomes ready
[   49.710449][ T3636] 8021q: adding VLAN 0 to HW filter on device team0
[   49.722894][ T3647] IPv6: ADDRCONF(NETDEV_CHANGE): bridge_slave_0: link becomes ready
[   49.732572][ T3647] bridge0: port 1(bridge_slave_0) entered blocking state
[   49.739646][ T3647] bridge0: port 1(bridge_slave_0) entered forwarding state
[   49.750696][  T923] IPv6: ADDRCONF(NETDEV_CHANGE): bridge_slave_1: link becomes ready
[   49.759168][  T923] bridge0: port 2(bridge_slave_1) entered blocking state
[   49.766347][  T923] bridge0: port 2(bridge_slave_1) entered forwarding state
[   49.783139][ T3647] IPv6: ADDRCONF(NETDEV_CHANGE): team_slave_0: link becomes ready
[   49.798118][ T3647] IPv6: ADDRCONF(NETDEV_CHANGE): team0: link becomes ready
[   49.807101][ T3647] IPv6: ADDRCONF(NETDEV_CHANGE): team_slave_1: link becomes ready
[   49.816367][ T3647] IPv6: ADDRCONF(NETDEV_CHANGE): hsr_slave_0: link becomes ready
[   49.828659][ T3636] hsr0: Slave B (hsr_slave_1) is not up; please bring it up to get a fully working HSR network
[   49.841622][ T3636] IPv6: ADDRCONF(NETDEV_CHANGE): hsr0: link becomes ready
[   49.849961][   T14] IPv6: ADDRCONF(NETDEV_CHANGE): hsr_slave_1: link becomes ready
[   49.867463][  T923] IPv6: ADDRCONF(NETDEV_CHANGE): vxcan0: link becomes ready
[   49.875057][  T923] IPv6: ADDRCONF(NETDEV_CHANGE): vxcan1: link becomes ready
[   49.887724][ T3636] 8021q: adding VLAN 0 to HW filter on device batadv0
[   49.991352][   T14] IPv6: ADDRCONF(NETDEV_CHANGE): veth0_virt_wifi: link becomes ready
[   50.007687][   T14] IPv6: ADDRCONF(NETDEV_CHANGE): veth0_vlan: link becomes ready
[   50.016485][   T14] IPv6: ADDRCONF(NETDEV_CHANGE): vlan0: link becomes ready
[   50.024664][   T14] IPv6: ADDRCONF(NETDEV_CHANGE): vlan1: link becomes ready
[   50.034755][ T3636] device veth0_vlan entered promiscuous mode
[   50.047971][ T3636] device veth1_vlan entered promiscuous mode
[   50.067469][   T14] IPv6: ADDRCONF(NETDEV_CHANGE): macvlan0: link becomes ready
[   50.075584][   T14] IPv6: ADDRCONF(NETDEV_CHANGE): macvlan1: link becomes ready
[   50.084115][   T14] IPv6: ADDRCONF(NETDEV_CHANGE): veth0_macvtap: link becomes ready
[   50.095890][ T3636] device veth0_macvtap entered promiscuous mode
[   50.105744][ T3636] device veth1_macvtap entered promiscuous mode
[   50.120925][ T3636] batman_adv: batadv0: Interface activated: batadv_slave_0
[   50.129807][   T14] IPv6: ADDRCONF(NETDEV_CHANGE): veth0_to_batadv: link becomes ready
[   50.139778][   T14] IPv6: ADDRCONF(NETDEV_CHANGE): macvtap0: link becomes ready
[   50.152837][ T3636] batman_adv: batadv0: Interface activated: batadv_slave_1
[   50.161478][   T14] IPv6: ADDRCONF(NETDEV_CHANGE): veth1_to_batadv: link becomes ready
[   50.172240][ T3636] netdevsim netdevsim0 netdevsim0: set [1, 0] type 2 family 0 port 6081 - 0
[   50.182764][ T3636] netdevsim netdevsim0 netdevsim1: set [1, 0] type 2 family 0 port 6081 - 0
[   50.192635][ T3636] netdevsim netdevsim0 netdevsim2: set [1, 0] type 2 family 0 port 6081 - 0
[   50.202479][ T3636] netdevsim netdevsim0 netdevsim3: set [1, 0] type 2 family 0 port 6081 - 0
[   50.258761][   T33] wlan0: Created IBSS using preconfigured BSSID 50:50:50:50:50:50
[   50.276234][   T33] wlan0: Creating new IBSS network, BSSID 50:50:50:50:50:50
[   50.292455][   T22] IPv6: ADDRCONF(NETDEV_CHANGE): wlan0: link becomes ready
[   50.306188][   T11] wlan1: Created IBSS using preconfigured BSSID 50:50:50:50:50:50
[   50.315505][   T11] wlan1: Creating new IBSS network, BSSID 50:50:50:50:50:50
[   50.325576][   T14] IPv6: ADDRCONF(NETDEV_CHANGE): wlan1: link becomes ready
2022/08/10 16:46:13 building call list...
[   50.505046][ T3636] ------------[ cut here ]------------
[   50.510773][ T3636] ODEBUG: assert_init not available (active state 0) object type: timer_list hint: 0x0
[   50.520732][ T3636] WARNING: CPU: 1 PID: 3636 at lib/debugobjects.c:505 debug_object_assert_init+0x1fa/0x250
[   50.530739][ T3636] Modules linked in:
[   50.534652][ T3636] CPU: 1 PID: 3636 Comm: syz-executor.0 Not tainted 5.19.0-syzkaller #0
[   50.542991][ T3636] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 07/22/2022
[   50.553063][ T3636] RIP: 0010:debug_object_assert_init+0x1fa/0x250
[   50.559406][ T3636] Code: e8 bb d2 d1 fd 4c 8b 45 00 48 c7 c7 20 96 6a 8a 48 c7 c6 20 93 6a 8a 48 c7 c2 c0 97 6a 8a 31 c9 49 89 d9 31 c0 e8 86 cd 4e fd <0f> 0b ff 05 da 58 8a 09 48 83 c5 38 48 89 e8 48 c1 e8 03 42 80 3c
[   50.579117][ T3636] RSP: 0018:ffffc9000392f8c8 EFLAGS: 00010046
[   50.585300][ T3636] RAX: 8bc764758f9d2d00 RBX: 0000000000000000 RCX: ffff88807f27ba80
[   50.593296][ T3636] RDX: 0000000000000000 RSI: 0000000080000000 RDI: 0000000000000000
[   50.601277][ T3636] RBP: ffffffff8a0fc700 R08: ffffffff8165ed3d R09: ffffed10173a4f14
[   50.609266][ T3636] R10: ffffed10173a4f14 R11: 1ffff110173a4f13 R12: dffffc0000000000
[   50.617255][ T3636] R13: ffff88801bea49d0 R14: 0000000000000015 R15: ffffffff900beb38
[   50.625245][ T3636] FS:  0000000000000000(0000) GS:ffff8880b9d00000(0000) knlGS:0000000000000000
[   50.634196][ T3636] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   50.641255][ T3636] CR2: 00007fe56a2e1200 CR3: 0000000011c4e000 CR4: 00000000003506e0
[   50.649282][ T3636] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   50.657280][ T3636] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   50.665286][ T3636] Call Trace:
[   50.668606][ T3636]  <TASK>
[   50.671567][ T3636]  del_timer+0x3d/0x2d0
[   50.675770][ T3636]  ? try_to_grab_pending+0xb1/0x700
[   50.681004][ T3636]  try_to_grab_pending+0xbf/0x700
[   50.686321][ T3636]  __cancel_work_timer+0x81/0x5b0
[   50.691373][ T3636]  ? mgmt_send_event_skb+0x2ee/0x4e0
[   50.696805][ T3636]  ? kmem_cache_free+0x95/0x1d0
[   50.701675][ T3636]  ? mgmt_send_event_skb+0x2ee/0x4e0
[   50.706989][ T3636]  mgmt_index_removed+0x244/0x330
[   50.712032][ T3636]  hci_unregister_dev+0x28e/0x460
[   50.718115][ T3636]  ? vhci_open+0x360/0x360
[   50.722542][ T3636]  vhci_release+0x7f/0xd0
[   50.726883][ T3636]  __fput+0x3b9/0x820
[   50.730896][ T3636]  task_work_run+0x146/0x1c0
[   50.735510][ T3636]  do_exit+0x4ed/0x1f30
[   50.739669][ T3636]  ? rcu_read_lock_sched_held+0x41/0xb0
[   50.745233][ T3636]  do_group_exit+0x23b/0x2f0
[   50.749828][ T3636]  ? _raw_spin_unlock_irq+0x1f/0x40
[   50.755023][ T3636]  ? lockdep_hardirqs_on+0x8d/0x130
[   50.760218][ T3636]  get_signal+0x16a3/0x1700
[   50.766302][ T3636]  arch_do_signal_or_restart+0x29/0x5d0
[   50.771852][ T3636]  exit_to_user_mode_loop+0x74/0x150
[   50.777133][ T3636]  exit_to_user_mode_prepare+0xb2/0x140
[   50.782695][ T3636]  syscall_exit_to_user_mode+0x26/0x60
[   50.788737][ T3636]  do_syscall_64+0x49/0x90
[   50.793176][ T3636]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
[   50.799177][ T3636] RIP: 0033:0x4191dc
[   50.803063][ T3636] Code: Unable to access opcode bytes at RIP 0x4191b2.
[   50.809916][ T3636] RSP: 002b:00007ffe6c6d7830 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[   50.818354][ T3636] RAX: fffffffffffffe00 RBX: 00007ffe6c6d78f0 RCX: 00000000004191dc
[   50.826326][ T3636] RDX: 0000000000000050 RSI: 0000000000568020 RDI: 00000000000000f9
[   50.834295][ T3636] RBP: 0000000000000003 R08: 0000000000000000 R09: 0079746972756365
[   50.842269][ T3636] R10: 00000000005436a0 R11: 0000000000000246 R12: 0000000000000032
[   50.850229][ T3636] R13: 000000000000c4c0 R14: 0000000000000000 R15: 00007ffe6c6d7930
[   50.858211][ T3636]  </TASK>
[   50.861256][ T3636] Kernel panic - not syncing: panic_on_warn set ...
[   50.867835][ T3636] CPU: 1 PID: 3636 Comm: syz-executor.0 Not tainted 5.19.0-syzkaller #0
[   50.876158][ T3636] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 07/22/2022
[   50.886289][ T3636] Call Trace:
[   50.890527][ T3636]  <TASK>
[   50.893448][ T3636]  dump_stack_lvl+0x131/0x1c8
[   50.898221][ T3636]  panic+0x26b/0x693
[   50.902113][ T3636]  ? __warn+0x131/0x220
[   50.906266][ T3636]  ? debug_object_assert_init+0x1fa/0x250
[   50.912064][ T3636]  __warn+0x1fa/0x220
[   50.916054][ T3636]  ? debug_object_assert_init+0x1fa/0x250
[   50.921777][ T3636]  report_bug+0x1b3/0x2d0
[   50.926103][ T3636]  handle_bug+0x3d/0x70
[   50.930513][ T3636]  exc_invalid_op+0x16/0x40
[   50.935009][ T3636]  asm_exc_invalid_op+0x16/0x20
[   50.939919][ T3636] RIP: 0010:debug_object_assert_init+0x1fa/0x250
[   50.946259][ T3636] Code: e8 bb d2 d1 fd 4c 8b 45 00 48 c7 c7 20 96 6a 8a 48 c7 c6 20 93 6a 8a 48 c7 c2 c0 97 6a 8a 31 c9 49 89 d9 31 c0 e8 86 cd 4e fd <0f> 0b ff 05 da 58 8a 09 48 83 c5 38 48 89 e8 48 c1 e8 03 42 80 3c
[   50.965962][ T3636] RSP: 0018:ffffc9000392f8c8 EFLAGS: 00010046
[   50.972034][ T3636] RAX: 8bc764758f9d2d00 RBX: 0000000000000000 RCX: ffff88807f27ba80
[   50.980009][ T3636] RDX: 0000000000000000 RSI: 0000000080000000 RDI: 0000000000000000
[   50.988148][ T3636] RBP: ffffffff8a0fc700 R08: ffffffff8165ed3d R09: ffffed10173a4f14
[   50.996123][ T3636] R10: ffffed10173a4f14 R11: 1ffff110173a4f13 R12: dffffc0000000000
[   51.004180][ T3636] R13: ffff88801bea49d0 R14: 0000000000000015 R15: ffffffff900beb38
[   51.012153][ T3636]  ? __wake_up_klogd+0xcd/0x100
[   51.017277][ T3636]  ? debug_object_assert_init+0x1fa/0x250
[   51.023040][ T3636]  del_timer+0x3d/0x2d0
[   51.027291][ T3636]  ? try_to_grab_pending+0xb1/0x700
[   51.032705][ T3636]  try_to_grab_pending+0xbf/0x700
[   51.037752][ T3636]  __cancel_work_timer+0x81/0x5b0
[   51.042785][ T3636]  ? mgmt_send_event_skb+0x2ee/0x4e0
[   51.048148][ T3636]  ? kmem_cache_free+0x95/0x1d0
[   51.053085][ T3636]  ? mgmt_send_event_skb+0x2ee/0x4e0
[   51.058748][ T3636]  mgmt_index_removed+0x244/0x330
[   51.063855][ T3636]  hci_unregister_dev+0x28e/0x460
[   51.069135][ T3636]  ? vhci_open+0x360/0x360
[   51.073542][ T3636]  vhci_release+0x7f/0xd0
[   51.077997][ T3636]  __fput+0x3b9/0x820
[   51.082080][ T3636]  task_work_run+0x146/0x1c0
[   51.086847][ T3636]  do_exit+0x4ed/0x1f30
[   51.091007][ T3636]  ? rcu_read_lock_sched_held+0x41/0xb0
[   51.096557][ T3636]  do_group_exit+0x23b/0x2f0
[   51.101224][ T3636]  ? _raw_spin_unlock_irq+0x1f/0x40
[   51.106439][ T3636]  ? lockdep_hardirqs_on+0x8d/0x130
[   51.111641][ T3636]  get_signal+0x16a3/0x1700
[   51.116151][ T3636]  arch_do_signal_or_restart+0x29/0x5d0
[   51.121725][ T3636]  exit_to_user_mode_loop+0x74/0x150
[   51.127039][ T3636]  exit_to_user_mode_prepare+0xb2/0x140
[   51.132611][ T3636]  syscall_exit_to_user_mode+0x26/0x60
[   51.138271][ T3636]  do_syscall_64+0x49/0x90
[   51.143033][ T3636]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
[   51.149376][ T3636] RIP: 0033:0x4191dc
[   51.153293][ T3636] Code: Unable to access opcode bytes at RIP 0x4191b2.
[   51.160326][ T3636] RSP: 002b:00007ffe6c6d7830 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[   51.170079][ T3636] RAX: fffffffffffffe00 RBX: 00007ffe6c6d78f0 RCX: 00000000004191dc
[   51.178052][ T3636] RDX: 0000000000000050 RSI: 0000000000568020 RDI: 00000000000000f9
[   51.186026][ T3636] RBP: 0000000000000003 R08: 0000000000000000 R09: 0079746972756365
[   51.194096][ T3636] R10: 00000000005436a0 R11: 0000000000000246 R12: 0000000000000032
[   51.202061][ T3636] R13: 000000000000c4c0 R14: 0000000000000000 R15: 00007ffe6c6d7930
[   51.210491][ T3636]  </TASK>
[   51.213889][ T3636] Kernel Offset: disabled
[   51.218357][ T3636] Rebooting in 86400 seconds..


syzkaller build log:
go env (err=<nil>)
GO111MODULE="auto"
GOARCH="amd64"
GOBIN=""
GOCACHE="/syzkaller/.cache/go-build"
GOENV="/syzkaller/.config/go/env"
GOEXE=""
GOEXPERIMENT=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GOMODCACHE="/syzkaller/jobs/linux/gopath/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/syzkaller/jobs/linux/gopath"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/local/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/linux_amd64"
GOVCS=""
GOVERSION="go1.17"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD="/syzkaller/jobs/linux/gopath/src/github.com/google/syzkaller/go.mod"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build478383173=/tmp/go-build -gno-record-gcc-switches"

git status (err=<nil>)
HEAD detached at 607e3baf1
nothing to commit, working tree clean


go list -f '{{.Stale}}' ./sys/syz-sysgen | grep -q false || go install ./sys/syz-sysgen
make .descriptions
bin/syz-sysgen
touch .descriptions
GOOS=linux GOARCH=amd64 go build "-ldflags=-s -w -X github.com/google/syzkaller/prog.GitRevision=607e3baf1c25928040d05fc22eff6fce7edd709e -X 'github.com/google/syzkaller/prog.gitRevisionDate=20210324-183421'" "-tags=syz_target syz_os_linux syz_arch_amd64 " -o ./bin/linux_amd64/syz-fuzzer github.com/google/syzkaller/syz-fuzzer
GOOS=linux GOARCH=amd64 go build "-ldflags=-s -w -X github.com/google/syzkaller/prog.GitRevision=607e3baf1c25928040d05fc22eff6fce7edd709e -X 'github.com/google/syzkaller/prog.gitRevisionDate=20210324-183421'" "-tags=syz_target syz_os_linux syz_arch_amd64 " -o ./bin/linux_amd64/syz-execprog github.com/google/syzkaller/tools/syz-execprog
GOOS=linux GOARCH=amd64 go build "-ldflags=-s -w -X github.com/google/syzkaller/prog.GitRevision=607e3baf1c25928040d05fc22eff6fce7edd709e -X 'github.com/google/syzkaller/prog.gitRevisionDate=20210324-183421'" "-tags=syz_target syz_os_linux syz_arch_amd64 " -o ./bin/linux_amd64/syz-stress github.com/google/syzkaller/tools/syz-stress
mkdir -p ./bin/linux_amd64
gcc -o ./bin/linux_amd64/syz-executor executor/executor.cc \
	-m64 -O2 -pthread -Wall -Werror -Wparentheses -Wunused-const-variable -Wframe-larger-than=16384 -static -fpermissive -w -DGOOS_linux=1 -DGOARCH_amd64=1 \
	-DHOSTGOOS_linux=1 -DGIT_REVISION=\"607e3baf1c25928040d05fc22eff6fce7edd709e\"


Error text is too large and was truncated, full error text is at:
https://syzkaller.appspot.com/x/error.txt?x=149def63080000


Tested on:

commit:         d4252071 add barriers to buffer_uptodate and set_buffe..
git tree:       https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master
kernel config:  https://syzkaller.appspot.com/x/.config?x=aac0e3f739de465e
dashboard link: https://syzkaller.appspot.com/bug?extid=b6c9fe29aefe68e4ad34
compiler:       Debian clang version 13.0.1-++20220126092033+75e33f71c2da-1~exp1~20220126212112.63, GNU ld (GNU Binutils for Debian) 2.35.2
patch:          https://syzkaller.appspot.com/x/patch.diff?x=12593366080000


^ permalink raw reply

* [PATCH net] net: atm: bring back zatm uAPI
From: Jakub Kicinski @ 2022-08-10 16:45 UTC (permalink / raw)
  To: davem; +Cc: netdev, edumazet, pabeni, jirislaby, arnd, Jakub Kicinski

Jiri reports that linux-atm does not build without this header.
Bring it back. It's completely dead code but we can't break
the build for user space :(

Reported-by: Jiri Slaby <jirislaby@kernel.org>
Fixes: 052e1f01bfae ("net: atm: remove support for ZeitNet ZN122x ATM devices")
Link: https://lore.kernel.org/all/8576aef3-37e4-8bae-bab5-08f82a78efd3@kernel.org/
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
 include/uapi/linux/atm_zatm.h | 47 +++++++++++++++++++++++++++++++++++
 1 file changed, 47 insertions(+)
 create mode 100644 include/uapi/linux/atm_zatm.h

diff --git a/include/uapi/linux/atm_zatm.h b/include/uapi/linux/atm_zatm.h
new file mode 100644
index 000000000000..5135027b93c1
--- /dev/null
+++ b/include/uapi/linux/atm_zatm.h
@@ -0,0 +1,47 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+/* atm_zatm.h - Driver-specific declarations of the ZATM driver (for use by
+		driver-specific utilities) */
+
+/* Written 1995-1999 by Werner Almesberger, EPFL LRC/ICA */
+
+
+#ifndef LINUX_ATM_ZATM_H
+#define LINUX_ATM_ZATM_H
+
+/*
+ * Note: non-kernel programs including this file must also include
+ * sys/types.h for struct timeval
+ */
+
+#include <linux/atmapi.h>
+#include <linux/atmioc.h>
+
+#define ZATM_GETPOOL	_IOW('a',ATMIOC_SARPRV+1,struct atmif_sioc)
+						/* get pool statistics */
+#define ZATM_GETPOOLZ	_IOW('a',ATMIOC_SARPRV+2,struct atmif_sioc)
+						/* get statistics and zero */
+#define ZATM_SETPOOL	_IOW('a',ATMIOC_SARPRV+3,struct atmif_sioc)
+						/* set pool parameters */
+
+struct zatm_pool_info {
+	int ref_count;			/* free buffer pool usage counters */
+	int low_water,high_water;	/* refill parameters */
+	int rqa_count,rqu_count;	/* queue condition counters */
+	int offset,next_off;		/* alignment optimizations: offset */
+	int next_cnt,next_thres;	/* repetition counter and threshold */
+};
+
+struct zatm_pool_req {
+	int pool_num;			/* pool number */
+	struct zatm_pool_info info;	/* actual information */
+};
+
+#define ZATM_OAM_POOL		0	/* free buffer pool for OAM cells */
+#define ZATM_AAL0_POOL		1	/* free buffer pool for AAL0 cells */
+#define ZATM_AAL5_POOL_BASE	2	/* first AAL5 free buffer pool */
+#define ZATM_LAST_POOL	ZATM_AAL5_POOL_BASE+10 /* max. 64 kB */
+
+#define ZATM_TIMER_HISTORY_SIZE	16	/* number of timer adjustments to
+					   record; must be 2^n */
+
+#endif
-- 
2.37.1


^ permalink raw reply related

* Re: [PATCH net-next 3/6] net: atm: remove support for ZeitNet ZN122x ATM devices
From: Jakub Kicinski @ 2022-08-10 16:42 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Jiri Slaby, davem, pabeni, netdev, Chas Williams,
	linux-atm-general, Thomas Bogendoerfer, linux-mips
In-Reply-To: <CAK8P3a01yfeg-3QO=MeDG7JzXEsTGxK+vMpFJ83SGwPto4AOxw@mail.gmail.com>

On Wed, 10 Aug 2022 11:11:32 +0200 Arnd Bergmann wrote:
> > This unfortunately breaks linux-atm:
> > zntune.c:18:10: fatal error: linux/atm_zatm.h: No such file or directory
> >
> > The source does also:
> > ioctl(s,ZATM_SETPOOL,&sioc)
> > ioctl(s,zero ? ZATM_GETPOOLZ : ZATM_GETPOOL,&sioc)
> > etc.
> >
> > So we should likely revert the below:  
> 
> I suppose there is no chance of also getting the linux-atm package updated
> to not include those source files, right? The last release I found on
> sourceforge
> is 12 years old, but maybe I was looking in the wrong place.

Is linux-atm used for something remotely modern? PPPoA? Maybe it's 
time to ditch it completely? I'll send the revert in any case.

^ permalink raw reply

* Re: [PATCH bpf-next v5 0/3] destructive bpf_kfuncs
From: patchwork-bot+netdevbpf @ 2022-08-10 16:30 UTC (permalink / raw)
  To: Artem Savkov
  Cc: ast, daniel, andrii, bpf, netdev, linux-kernel, aarcange, dvacek,
	olsajiri, song, dxu, memxor
In-Reply-To: <20220810065905.475418-1-asavkov@redhat.com>

Hello:

This series was applied to bpf/bpf-next.git (master)
by Alexei Starovoitov <ast@kernel.org>:

On Wed, 10 Aug 2022 08:59:02 +0200 you wrote:
> eBPF is often used for kernel debugging, and one of the widely used and
> powerful debugging techniques is post-mortem debugging with a full memory dump.
> Triggering a panic at exactly the right moment allows the user to get such a
> dump and thus a better view at the system's state. Right now the only way to
> do this in BPF is to signal userspace to trigger kexec/panic. This is
> suboptimal as going through userspace requires context changes and adds
> significant delays taking system further away from "the right moment". On a
> single-cpu system the situation is even worse because BPF program won't even be
> able to block the thread of interest.
> 
> [...]

Here is the summary with links:
  - [bpf-next,v5,1/3] bpf: add destructive kfunc flag
    https://git.kernel.org/bpf/bpf-next/c/4dd48c6f1f83
  - [bpf-next,v5,2/3] bpf: export crash_kexec() as destructive kfunc
    https://git.kernel.org/bpf/bpf-next/c/133790596406
  - [bpf-next,v5,3/3] selftests/bpf: add destructive kfunc test
    https://git.kernel.org/bpf/bpf-next/c/e33894581675

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply

* [syzbot] WARNING: suspicious RCU usage in add_v4_addrs
From: syzbot @ 2022-08-10 16:28 UTC (permalink / raw)
  To: davem, dsahern, edumazet, kuba, linux-kernel, netdev, pabeni,
	syzkaller-bugs, yoshfuji

Hello,

syzbot found the following issue on:

HEAD commit:    0966d385830d riscv: Fix auipc+jalr relocation range checks
git tree:       git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux.git fixes
console output: https://syzkaller.appspot.com/x/log.txt?x=17b4436a080000
kernel config:  https://syzkaller.appspot.com/x/.config?x=6295d67591064921
dashboard link: https://syzkaller.appspot.com/bug?extid=27aad254a5e7479997ed
compiler:       riscv64-linux-gnu-gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2
userspace arch: riscv64

Unfortunately, I don't have any reproducer for this issue yet.

IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: syzbot+27aad254a5e7479997ed@syzkaller.appspotmail.com

=============================
WARNING: suspicious RCU usage
5.17.0-rc1-syzkaller-00002-g0966d385830d #0 Not tainted
-----------------------------
net/ipv6/addrconf.c:3140 suspicious rcu_dereference_protected() usage!

other info that might help us debug this:


rcu_scheduler_active = 2, debug_locks = 1
1 lock held by syz-executor.1/2048:
 #0: 000000c00135d600 (rtnl_mutex){+.+.}-{3:3}, at: rtnl_lock net/core/rtnetlink.c:72 [inline]
 #0: 000000c00135d600 (rtnl_mutex){+.+.}-{3:3}, at: rtnetlink_rcv_msg+0x2fe/0x9a0 net/core/rtnetlink.c:5589

stack backtrace:
CPU: 0 PID: 2048 Comm: syz-executor.1 Not tainted 5.17.0-rc1-syzkaller-00002-g0966d385830d #0
Hardware name: riscv-virtio,qemu (DT)
Call Trace:
[<ffffffff8000a228>] dump_backtrace+0x2e/0x3c arch/riscv/kernel/stacktrace.c:113
[<ffffffff831668cc>] show_stack+0x34/0x40 arch/riscv/kernel/stacktrace.c:119
[<ffffffff831756ba>] __dump_stack lib/dump_stack.c:88 [inline]
[<ffffffff831756ba>] dump_stack_lvl+0xe4/0x150 lib/dump_stack.c:106
[<ffffffff83175742>] dump_stack+0x1c/0x24 lib/dump_stack.c:113
[<ffffffff831689d2>] lockdep_rcu_suspicious+0x106/0x118 kernel/locking/lockdep.c:6563
[<ffffffff82d414d0>] add_v4_addrs+0x566/0x640 net/ipv6/addrconf.c:3140
[<ffffffff82d4e322>] addrconf_gre_config net/ipv6/addrconf.c:3425 [inline]
[<ffffffff82d4e322>] addrconf_notify+0x784/0x1360 net/ipv6/addrconf.c:3605
[<ffffffff800aac84>] notifier_call_chain+0xb8/0x188 kernel/notifier.c:84
[<ffffffff800aad7e>] raw_notifier_call_chain+0x2a/0x38 kernel/notifier.c:392
[<ffffffff8271d086>] call_netdevice_notifiers_info+0x9e/0x10c net/core/dev.c:1919
[<ffffffff827422c8>] call_netdevice_notifiers_extack net/core/dev.c:1931 [inline]
[<ffffffff827422c8>] call_netdevice_notifiers net/core/dev.c:1945 [inline]
[<ffffffff827422c8>] __dev_notify_flags+0x108/0x1fa net/core/dev.c:8179
[<ffffffff827436f6>] dev_change_flags+0x9c/0xba net/core/dev.c:8215
[<ffffffff82767e16>] do_setlink+0x5d6/0x21c4 net/core/rtnetlink.c:2729
[<ffffffff8276a6a2>] __rtnl_newlink+0x99e/0xfa0 net/core/rtnetlink.c:3412
[<ffffffff8276ad04>] rtnl_newlink+0x60/0x8c net/core/rtnetlink.c:3527
[<ffffffff8276b46c>] rtnetlink_rcv_msg+0x338/0x9a0 net/core/rtnetlink.c:5592
[<ffffffff8296ded2>] netlink_rcv_skb+0xf8/0x2be net/netlink/af_netlink.c:2494
[<ffffffff827624f4>] rtnetlink_rcv+0x26/0x30 net/core/rtnetlink.c:5610
[<ffffffff8296cbcc>] netlink_unicast_kernel net/netlink/af_netlink.c:1317 [inline]
[<ffffffff8296cbcc>] netlink_unicast+0x40e/0x5fe net/netlink/af_netlink.c:1343
[<ffffffff8296d29c>] netlink_sendmsg+0x4e0/0x994 net/netlink/af_netlink.c:1919
[<ffffffff826d264e>] sock_sendmsg_nosec net/socket.c:705 [inline]
[<ffffffff826d264e>] sock_sendmsg+0xa0/0xc4 net/socket.c:725
[<ffffffff826d7026>] __sys_sendto+0x1f2/0x2e0 net/socket.c:2040
[<ffffffff826d7152>] __do_sys_sendto net/socket.c:2052 [inline]
[<ffffffff826d7152>] sys_sendto+0x3e/0x52 net/socket.c:2048
[<ffffffff80005716>] ret_from_syscall+0x0/0x2

=============================
WARNING: suspicious RCU usage
5.17.0-rc1-syzkaller-00002-g0966d385830d #0 Not tainted
-----------------------------
include/linux/inetdevice.h:249 suspicious rcu_dereference_protected() usage!

other info that might help us debug this:


rcu_scheduler_active = 2, debug_locks = 1
1 lock held by syz-executor.1/2048:
 #0: 000000c00135d600 (rtnl_mutex){+.+.}-{3:3}, at: rtnl_lock net/core/rtnetlink.c:72 [inline]
 #0: 000000c00135d600 (rtnl_mutex){+.+.}-{3:3}, at: rtnetlink_rcv_msg+0x2fe/0x9a0 net/core/rtnetlink.c:5589

stack backtrace:
CPU: 0 PID: 2048 Comm: syz-executor.1 Not tainted 5.17.0-rc1-syzkaller-00002-g0966d385830d #0
Hardware name: riscv-virtio,qemu (DT)
Call Trace:
[<ffffffff8000a228>] dump_backtrace+0x2e/0x3c arch/riscv/kernel/stacktrace.c:113
[<ffffffff831668cc>] show_stack+0x34/0x40 arch/riscv/kernel/stacktrace.c:119
[<ffffffff831756ba>] __dump_stack lib/dump_stack.c:88 [inline]
[<ffffffff831756ba>] dump_stack_lvl+0xe4/0x150 lib/dump_stack.c:106
[<ffffffff83175742>] dump_stack+0x1c/0x24 lib/dump_stack.c:113
[<ffffffff831689d2>] lockdep_rcu_suspicious+0x106/0x118 kernel/locking/lockdep.c:6563
[<ffffffff82d412fe>] __in_dev_get_rtnl include/linux/inetdevice.h:249 [inline]
[<ffffffff82d412fe>] add_v4_addrs+0x394/0x640 net/ipv6/addrconf.c:3135
[<ffffffff82d4e322>] addrconf_gre_config net/ipv6/addrconf.c:3425 [inline]
[<ffffffff82d4e322>] addrconf_notify+0x784/0x1360 net/ipv6/addrconf.c:3605
[<ffffffff800aac84>] notifier_call_chain+0xb8/0x188 kernel/notifier.c:84
[<ffffffff800aad7e>] raw_notifier_call_chain+0x2a/0x38 kernel/notifier.c:392
[<ffffffff8271d086>] call_netdevice_notifiers_info+0x9e/0x10c net/core/dev.c:1919
[<ffffffff827422c8>] call_netdevice_notifiers_extack net/core/dev.c:1931 [inline]
[<ffffffff827422c8>] call_netdevice_notifiers net/core/dev.c:1945 [inline]
[<ffffffff827422c8>] __dev_notify_flags+0x108/0x1fa net/core/dev.c:8179
[<ffffffff827436f6>] dev_change_flags+0x9c/0xba net/core/dev.c:8215
[<ffffffff82767e16>] do_setlink+0x5d6/0x21c4 net/core/rtnetlink.c:2729
[<ffffffff8276a6a2>] __rtnl_newlink+0x99e/0xfa0 net/core/rtnetlink.c:3412
[<ffffffff8276ad04>] rtnl_newlink+0x60/0x8c net/core/rtnetlink.c:3527
[<ffffffff8276b46c>] rtnetlink_rcv_msg+0x338/0x9a0 net/core/rtnetlink.c:5592
[<ffffffff8296ded2>] netlink_rcv_skb+0xf8/0x2be net/netlink/af_netlink.c:2494
[<ffffffff827624f4>] rtnetlink_rcv+0x26/0x30 net/core/rtnetlink.c:5610
[<ffffffff8296cbcc>] netlink_unicast_kernel net/netlink/af_netlink.c:1317 [inline]
[<ffffffff8296cbcc>] netlink_unicast+0x40e/0x5fe net/netlink/af_netlink.c:1343
[<ffffffff8296d29c>] netlink_sendmsg+0x4e0/0x994 net/netlink/af_netlink.c:1919
[<ffffffff826d264e>] sock_sendmsg_nosec net/socket.c:705 [inline]
[<ffffffff826d264e>] sock_sendmsg+0xa0/0xc4 net/socket.c:725
[<ffffffff826d7026>] __sys_sendto+0x1f2/0x2e0 net/socket.c:2040
[<ffffffff826d7152>] __do_sys_sendto net/socket.c:2052 [inline]
[<ffffffff826d7152>] sys_sendto+0x3e/0x52 net/socket.c:2048
[<ffffffff80005716>] ret_from_syscall+0x0/0x2

=============================
WARNING: suspicious RCU usage
5.17.0-rc1-syzkaller-00002-g0966d385830d #0 Not tainted
-----------------------------
net/ipv6/addrconf.c:3140 suspicious rcu_dereference_protected() usage!

other info that might help us debug this:


rcu_scheduler_active = 2, debug_locks = 1
1 lock held by syz-executor.1/2048:
 #0: 000000c00135d600 (rtnl_mutex){+.+.}-{3:3}, at: rtnl_lock net/core/rtnetlink.c:72 [inline]
 #0: 000000c00135d600 (rtnl_mutex){+.+.}-{3:3}, at: rtnetlink_rcv_msg+0x2fe/0x9a0 net/core/rtnetlink.c:5589

stack backtrace:
CPU: 0 PID: 2048 Comm: syz-executor.1 Not tainted 5.17.0-rc1-syzkaller-00002-g0966d385830d #0
Hardware name: riscv-virtio,qemu (DT)
Call Trace:
[<ffffffff8000a228>] dump_backtrace+0x2e/0x3c arch/riscv/kernel/stacktrace.c:113
[<ffffffff831668cc>] show_stack+0x34/0x40 arch/riscv/kernel/stacktrace.c:119
[<ffffffff831756ba>] __dump_stack lib/dump_stack.c:88 [inline]
[<ffffffff831756ba>] dump_stack_lvl+0xe4/0x150 lib/dump_stack.c:106
[<ffffffff83175742>] dump_stack+0x1c/0x24 lib/dump_stack.c:113
[<ffffffff831689d2>] lockdep_rcu_suspicious+0x106/0x118 kernel/locking/lockdep.c:6563
[<ffffffff82d4154c>] add_v4_addrs+0x5e2/0x640 net/ipv6/addrconf.c:3140
[<ffffffff82d4e322>] addrconf_gre_config net/ipv6/addrconf.c:3425 [inline]
[<ffffffff82d4e322>] addrconf_notify+0x784/0x1360 net/ipv6/addrconf.c:3605
[<ffffffff800aac84>] notifier_call_chain+0xb8/0x188 kernel/notifier.c:84
[<ffffffff800aad7e>] raw_notifier_call_chain+0x2a/0x38 kernel/notifier.c:392
[<ffffffff8271d086>] call_netdevice_notifiers_info+0x9e/0x10c net/core/dev.c:1919
[<ffffffff827422c8>] call_netdevice_notifiers_extack net/core/dev.c:1931 [inline]
[<ffffffff827422c8>] call_netdevice_notifiers net/core/dev.c:1945 [inline]
[<ffffffff827422c8>] __dev_notify_flags+0x108/0x1fa net/core/dev.c:8179
[<ffffffff827436f6>] dev_change_flags+0x9c/0xba net/core/dev.c:8215
[<ffffffff82767e16>] do_setlink+0x5d6/0x21c4 net/core/rtnetlink.c:2729
[<ffffffff8276a6a2>] __rtnl_newlink+0x99e/0xfa0 net/core/rtnetlink.c:3412
[<ffffffff8276ad04>] rtnl_newlink+0x60/0x8c net/core/rtnetlink.c:3527
[<ffffffff8276b46c>] rtnetlink_rcv_msg+0x338/0x9a0 net/core/rtnetlink.c:5592
[<ffffffff8296ded2>] netlink_rcv_skb+0xf8/0x2be net/netlink/af_netlink.c:2494
[<ffffffff827624f4>] rtnetlink_rcv+0x26/0x30 net/core/rtnetlink.c:5610
[<ffffffff8296cbcc>] netlink_unicast_kernel net/netlink/af_netlink.c:1317 [inline]
[<ffffffff8296cbcc>] netlink_unicast+0x40e/0x5fe net/netlink/af_netlink.c:1343
[<ffffffff8296d29c>] netlink_sendmsg+0x4e0/0x994 net/netlink/af_netlink.c:1919
[<ffffffff826d264e>] sock_sendmsg_nosec net/socket.c:705 [inline]
[<ffffffff826d264e>] sock_sendmsg+0xa0/0xc4 net/socket.c:725
[<ffffffff826d7026>] __sys_sendto+0x1f2/0x2e0 net/socket.c:2040
[<ffffffff826d7152>] __do_sys_sendto net/socket.c:2052 [inline]
[<ffffffff826d7152>] sys_sendto+0x3e/0x52 net/socket.c:2048
[<ffffffff80005716>] ret_from_syscall+0x0/0x2

=============================
WARNING: suspicious RCU usage
5.17.0-rc1-syzkaller-00002-g0966d385830d #0 Not tainted
-----------------------------
include/net/addrconf.h:313 suspicious rcu_dereference_check() usage!

other info that might help us debug this:


rcu_scheduler_active = 2, debug_locks = 1
1 lock held by syz-executor.1/2048:
 #0: 000000c00135d600 (rtnl_mutex){+.+.}-{3:3}, at: rtnl_lock net/core/rtnetlink.c:72 [inline]
 #0: 000000c00135d600 (rtnl_mutex){+.+.}-{3:3}, at: rtnetlink_rcv_msg+0x2fe/0x9a0 net/core/rtnetlink.c:5589

stack backtrace:
CPU: 0 PID: 2048 Comm: syz-executor.1 Not tainted 5.17.0-rc1-syzkaller-00002-g0966d385830d #0
Hardware name: riscv-virtio,qemu (DT)
Call Trace:
[<ffffffff8000a228>] dump_backtrace+0x2e/0x3c arch/riscv/kernel/stacktrace.c:113
[<ffffffff831668cc>] show_stack+0x34/0x40 arch/riscv/kernel/stacktrace.c:119
[<ffffffff831756ba>] __dump_stack lib/dump_stack.c:88 [inline]
[<ffffffff831756ba>] dump_stack_lvl+0xe4/0x150 lib/dump_stack.c:106
[<ffffffff83175742>] dump_stack+0x1c/0x24 lib/dump_stack.c:113
[<ffffffff831689d2>] lockdep_rcu_suspicious+0x106/0x118 kernel/locking/lockdep.c:6563
[<ffffffff82db8b04>] __in6_dev_get include/net/addrconf.h:313 [inline]
[<ffffffff82db8b04>] __in6_dev_get include/net/addrconf.h:311 [inline]
[<ffffffff82db8b04>] ipv6_mc_netdev_event+0x29c/0x4a8 net/ipv6/mcast.c:2842
[<ffffffff800aac84>] notifier_call_chain+0xb8/0x188 kernel/notifier.c:84
[<ffffffff800aad7e>] raw_notifier_call_chain+0x2a/0x38 kernel/notifier.c:392
[<ffffffff8271d086>] call_netdevice_notifiers_info+0x9e/0x10c net/core/dev.c:1919
[<ffffffff827422c8>] call_netdevice_notifiers_extack net/core/dev.c:1931 [inline]
[<ffffffff827422c8>] call_netdevice_notifiers net/core/dev.c:1945 [inline]
[<ffffffff827422c8>] __dev_notify_flags+0x108/0x1fa net/core/dev.c:8179
[<ffffffff827436f6>] dev_change_flags+0x9c/0xba net/core/dev.c:8215
[<ffffffff82767e16>] do_setlink+0x5d6/0x21c4 net/core/rtnetlink.c:2729
[<ffffffff8276a6a2>] __rtnl_newlink+0x99e/0xfa0 net/core/rtnetlink.c:3412
[<ffffffff8276ad04>] rtnl_newlink+0x60/0x8c net/core/rtnetlink.c:3527
[<ffffffff8276b46c>] rtnetlink_rcv_msg+0x338/0x9a0 net/core/rtnetlink.c:5592
[<ffffffff8296ded2>] netlink_rcv_skb+0xf8/0x2be net/netlink/af_netlink.c:2494
[<ffffffff827624f4>] rtnetlink_rcv+0x26/0x30 net/core/rtnetlink.c:5610
[<ffffffff8296cbcc>] netlink_unicast_kernel net/netlink/af_netlink.c:1317 [inline]
[<ffffffff8296cbcc>] netlink_unicast+0x40e/0x5fe net/netlink/af_netlink.c:1343
[<ffffffff8296d29c>] netlink_sendmsg+0x4e0/0x994 net/netlink/af_netlink.c:1919
[<ffffffff826d264e>] sock_sendmsg_nosec net/socket.c:705 [inline]
[<ffffffff826d264e>] sock_sendmsg+0xa0/0xc4 net/socket.c:725
[<ffffffff826d7026>] __sys_sendto+0x1f2/0x2e0 net/socket.c:2040
[<ffffffff826d7152>] __do_sys_sendto net/socket.c:2052 [inline]
[<ffffffff826d7152>] sys_sendto+0x3e/0x52 net/socket.c:2048
[<ffffffff80005716>] ret_from_syscall+0x0/0x2

=============================
WARNING: suspicious RCU usage
5.17.0-rc1-syzkaller-00002-g0966d385830d #0 Not tainted
-----------------------------
net/8021q/vlan.c:392 suspicious rcu_dereference_protected() usage!

other info that might help us debug this:


rcu_scheduler_active = 2, debug_locks = 1
1 lock held by syz-executor.1/2048:
 #0: 000000c00135d600 (rtnl_mutex){+.+.}-{3:3}, at: rtnl_lock net/core/rtnetlink.c:72 [inline]
 #0: 000000c00135d600 (rtnl_mutex){+.+.}-{3:3}, at: rtnetlink_rcv_msg+0x2fe/0x9a0 net/core/rtnetlink.c:5589

stack backtrace:
CPU: 0 PID: 2048 Comm: syz-executor.1 Not tainted 5.17.0-rc1-syzkaller-00002-g0966d385830d #0
Hardware name: riscv-virtio,qemu (DT)
Call Trace:
[<ffffffff8000a228>] dump_backtrace+0x2e/0x3c arch/riscv/kernel/stacktrace.c:113
[<ffffffff831668cc>] show_stack+0x34/0x40 arch/riscv/kernel/stacktrace.c:119
[<ffffffff831756ba>] __dump_stack lib/dump_stack.c:88 [inline]
[<ffffffff831756ba>] dump_stack_lvl+0xe4/0x150 lib/dump_stack.c:106
[<ffffffff83175742>] dump_stack+0x1c/0x24 lib/dump_stack.c:113
[<ffffffff831689d2>] lockdep_rcu_suspicious+0x106/0x118 kernel/locking/lockdep.c:6563
[<ffffffff82f0e32e>] vlan_device_event+0x364/0x1434 net/8021q/vlan.c:392
[<ffffffff800aac84>] notifier_call_chain+0xb8/0x188 kernel/notifier.c:84
[<ffffffff800aad7e>] raw_notifier_call_chain+0x2a/0x38 kernel/notifier.c:392
[<ffffffff8271d086>] call_netdevice_notifiers_info+0x9e/0x10c net/core/dev.c:1919
[<ffffffff827422c8>] call_netdevice_notifiers_extack net/core/dev.c:1931 [inline]
[<ffffffff827422c8>] call_netdevice_notifiers net/core/dev.c:1945 [inline]
[<ffffffff827422c8>] __dev_notify_flags+0x108/0x1fa net/core/dev.c:8179
[<ffffffff827436f6>] dev_change_flags+0x9c/0xba net/core/dev.c:8215
[<ffffffff82767e16>] do_setlink+0x5d6/0x21c4 net/core/rtnetlink.c:2729
[<ffffffff8276a6a2>] __rtnl_newlink+0x99e/0xfa0 net/core/rtnetlink.c:3412
[<ffffffff8276ad04>] rtnl_newlink+0x60/0x8c net/core/rtnetlink.c:3527
[<ffffffff8276b46c>] rtnetlink_rcv_msg+0x338/0x9a0 net/core/rtnetlink.c:5592
[<ffffffff8296ded2>] netlink_rcv_skb+0xf8/0x2be net/netlink/af_netlink.c:2494
[<ffffffff827624f4>] rtnetlink_rcv+0x26/0x30 net/core/rtnetlink.c:5610
[<ffffffff8296cbcc>] netlink_unicast_kernel net/netlink/af_netlink.c:1317 [inline]
[<ffffffff8296cbcc>] netlink_unicast+0x40e/0x5fe net/netlink/af_netlink.c:1343
[<ffffffff8296d29c>] netlink_sendmsg+0x4e0/0x994 net/netlink/af_netlink.c:1919
[<ffffffff826d264e>] sock_sendmsg_nosec net/socket.c:705 [inline]
[<ffffffff826d264e>] sock_sendmsg+0xa0/0xc4 net/socket.c:725
[<ffffffff826d7026>] __sys_sendto+0x1f2/0x2e0 net/socket.c:2040
[<ffffffff826d7152>] __do_sys_sendto net/socket.c:2052 [inline]
[<ffffffff826d7152>] sys_sendto+0x3e/0x52 net/socket.c:2048
[<ffffffff80005716>] ret_from_syscall+0x0/0x2

=====================================
WARNING: bad unlock balance detected!
5.17.0-rc1-syzkaller-00002-g0966d385830d #0 Not tainted
-------------------------------------
syz-executor.1/2048 is trying to release lock (rtnl_mutex) at:
[<ffffffff827745dc>] __rtnl_unlock+0x34/0x80 net/core/rtnetlink.c:98
but there are no more locks to release!

other info that might help us debug this:
1 lock held by syz-executor.1/2048:
 #0: 000000c00135d600 (rtnl_mutex){+.+.}-{3:3}, at: rtnl_lock net/core/rtnetlink.c:72 [inline]
 #0: 000000c00135d600 (rtnl_mutex){+.+.}-{3:3}, at: rtnetlink_rcv_msg+0x2fe/0x9a0 net/core/rtnetlink.c:5589

stack backtrace:
CPU: 0 PID: 2048 Comm: syz-executor.1 Not tainted 5.17.0-rc1-syzkaller-00002-g0966d385830d #0
Hardware name: riscv-virtio,qemu (DT)
Call Trace:
[<ffffffff8000a228>] dump_backtrace+0x2e/0x3c arch/riscv/kernel/stacktrace.c:113
[<ffffffff831668cc>] show_stack+0x34/0x40 arch/riscv/kernel/stacktrace.c:119
[<ffffffff831756ba>] __dump_stack lib/dump_stack.c:88 [inline]
[<ffffffff831756ba>] dump_stack_lvl+0xe4/0x150 lib/dump_stack.c:106
[<ffffffff83175742>] dump_stack+0x1c/0x24 lib/dump_stack.c:113
[<ffffffff8316887a>] print_unlock_imbalance_bug.part.0+0xc4/0xd2 kernel/locking/lockdep.c:5080
[<ffffffff80115d78>] print_unlock_imbalance_bug kernel/locking/lockdep.c:5062 [inline]
[<ffffffff80115d78>] __lock_release kernel/locking/lockdep.c:5316 [inline]
[<ffffffff80115d78>] lock_release+0x4fe/0x614 kernel/locking/lockdep.c:5659
[<ffffffff831a7d4c>] __mutex_unlock_slowpath+0xa4/0x3a2 kernel/locking/mutex.c:893
[<ffffffff831a8058>] mutex_unlock+0xe/0x16 kernel/locking/mutex.c:540
[<ffffffff827745dc>] __rtnl_unlock+0x34/0x80 net/core/rtnetlink.c:98
[<ffffffff82746ef4>] netdev_run_todo+0x1ee/0x752 net/core/dev.c:9929
[<ffffffff8276b47a>] rtnl_unlock net/core/rtnetlink.c:112 [inline]
[<ffffffff8276b47a>] rtnetlink_rcv_msg+0x346/0x9a0 net/core/rtnetlink.c:5593
[<ffffffff8296ded2>] netlink_rcv_skb+0xf8/0x2be net/netlink/af_netlink.c:2494
[<ffffffff827624f4>] rtnetlink_rcv+0x26/0x30 net/core/rtnetlink.c:5610
[<ffffffff8296cbcc>] netlink_unicast_kernel net/netlink/af_netlink.c:1317 [inline]
[<ffffffff8296cbcc>] netlink_unicast+0x40e/0x5fe net/netlink/af_netlink.c:1343
[<ffffffff8296d29c>] netlink_sendmsg+0x4e0/0x994 net/netlink/af_netlink.c:1919
[<ffffffff826d264e>] sock_sendmsg_nosec net/socket.c:705 [inline]
[<ffffffff826d264e>] sock_sendmsg+0xa0/0xc4 net/socket.c:725
[<ffffffff826d7026>] __sys_sendto+0x1f2/0x2e0 net/socket.c:2040
[<ffffffff826d7152>] __do_sys_sendto net/socket.c:2052 [inline]
[<ffffffff826d7152>] sys_sendto+0x3e/0x52 net/socket.c:2048
[<ffffffff80005716>] ret_from_syscall+0x0/0x2
8021q: adding VLAN 0 to HW filter on device bond0


---
This report is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.

syzbot will keep track of this issue. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.

^ permalink raw reply

* Re: [RFC 1/1] net: move IFF_LIVE_ADDR_CHANGE to public flag
From: James Prestwood @ 2022-08-10 16:26 UTC (permalink / raw)
  To: Johannes Berg, Jakub Kicinski; +Cc: netdev
In-Reply-To: <0fc27b144ca3adb4ff6b3057f2654040392ef2d8.camel@sipsolutions.net>

Hi Johannes,

On Tue, 2022-08-09 at 21:04 +0200, Johannes Berg wrote:
> On Thu, 2022-08-04 at 12:49 -0700, James Prestwood wrote:
> > > > 
> > > > The semantics in wireless are also a little stretched because
> > > > normally
> > > > if the flag is not set the netdev will _refuse_ (-EBUSY) to
> > > > change
> > > > the
> > > > address while running, not do some crazy fw reset.
> > > 
> > > Sorry if I wasn't clear, but its not nl80211 doing the fw reset
> > > automatically. The wireless subsystem actually completely disallows
> > > a
> > > MAC change if the device is running, this flag isn't even checked.
> > > This
> > > means userspace has to bring the device down itself, then change
> > > the
> > > MAC.
> > > 
> > > I plan on also modifying mac80211 to first check this flag and
> > > allow
> > > a
> > > live MAC change if possible. But ultimately userspace still needs
> > > to
> > > be
> > > aware of the support.
> > > 
> 
> I'm not sure this is the right approach.
> 
> For the stated purpose (not powering down the NIC), with most mac80211
> drivers the following would work:
> 
>  - add a new virtual interface of any supported type, and bring it up
>  - bring down the other interface, change MAC address, bring it up
> again
>  - remove the interface added in step 1
> 
> though obviously that's not a good way to do it!
> 
> But internally in mac80211, there's a distinction between
> 
>  ->stop() to turn off the NIC, and
>  ->remove_interface() to remove the interface.
> 
> Changing the MAC address should always be possible when the interface
> doesn't exist in the driver (remove_interface), but without stop()ing
> the NIC.
> 
> However, obviously remove_interface() implies that you break the
> connection first, and obviously you cannot change the MAC address
> without breaking the connection (stopping AP, etc.)

> Therefore, the semantics of this flag don't make sense - you cannot
> change the MAC address in a "live" way while there's a connection, and
> at least internally you need not stop the NIC to change it. Since
> ethernet has no concept of a "connection" in the same way, things are
> different there.

There isn't a need for changing the MAC when connected/scanning, as
this doesn't make much sense. I guess "live" can be interpreted
differently, but my interpretation is simply changing the MAC when the
device isn't powered off. IFF_POWERED_ADDR_CHANGE, maybe, is better
suited.

> 
> Not sure how to really solve this - perhaps a wireless-specific way of
> changing the MAC address could be added, though that's quite ugly, or
> we
> might be able to permit changing the MAC address while not active in
> any
> way (connected, scanning etc.) by removing from/re-adding to the driver
> at least as far as mac80211 is concerned.

Ok, so this is how I originally did it in those old patches:

https://lore.kernel.org/linux-wireless/20190913195908.7871-2-prestwoj@gmail.com/

i.e. remove_interface, change the mac, add_interface. 

But before I revive those I want to make sure a flag can be advertised
to userspace e.g. NL80211_EXT_FEATURE_LIVE_ADDRESS_CHANGE. (or
POWERED). Since this was the reason the patches got dropped in the
first place.

> 
> johannes



^ permalink raw reply

* Re: [PATCH v5 3/5] Bluetooth: Add support for hci devcoredump
From: Greg Kroah-Hartman @ 2022-08-10 16:19 UTC (permalink / raw)
  To: Manish Mandlik
  Cc: Arend van Spriel, marcel, luiz.dentz, Johannes Berg, Dan Williams,
	Jason Gunthorpe, linux-bluetooth, Thomas Gleixner,
	Rafael J . Wysocki, chromeos-bluetooth-upstreaming, Won Chung,
	Abhishek Pandit-Subedi, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Johan Hedberg, Paolo Abeni, linux-kernel, netdev
In-Reply-To: <20220810085753.v5.3.Iaf638bb9f885f5880ab1b4e7ae2f73dd53a54661@changeid>

On Wed, Aug 10, 2022 at 09:00:36AM -0700, Manish Mandlik wrote:
> --- a/net/bluetooth/hci_core.c
> +++ b/net/bluetooth/hci_core.c
> @@ -2510,14 +2510,23 @@ struct hci_dev *hci_alloc_dev_priv(int sizeof_priv)
>  	INIT_WORK(&hdev->tx_work, hci_tx_work);
>  	INIT_WORK(&hdev->power_on, hci_power_on);
>  	INIT_WORK(&hdev->error_reset, hci_error_reset);
> +#ifdef CONFIG_DEV_COREDUMP
> +	INIT_WORK(&hdev->dump.dump_rx, hci_devcoredump_rx);
> +#endif
>  
>  	hci_cmd_sync_init(hdev);
>  
>  	INIT_DELAYED_WORK(&hdev->power_off, hci_power_off);
> +#ifdef CONFIG_DEV_COREDUMP
> +	INIT_DELAYED_WORK(&hdev->dump.dump_timeout, hci_devcoredump_timeout);
> +#endif
>  
>  	skb_queue_head_init(&hdev->rx_q);
>  	skb_queue_head_init(&hdev->cmd_q);
>  	skb_queue_head_init(&hdev->raw_q);
> +#ifdef CONFIG_DEV_COREDUMP
> +	skb_queue_head_init(&hdev->dump.dump_q);
> +#endif

Putting #ifdef in .c files is messy, why not put all of this behind a
function that you properly handle in a .h file instead?

thanks,

greg k-h

^ permalink raw reply

* [PATCH v2 2/2] neighbour: make proxy_queue.qlen limit per-device
From: Alexander Mikhalitsyn @ 2022-08-10 16:08 UTC (permalink / raw)
  To: netdev
  Cc: Alexander Mikhalitsyn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Daniel Borkmann, David Ahern,
	Yajun Deng, Roopa Prabhu, Christian Brauner, linux-kernel,
	Alexey Kuznetsov, Konstantin Khorenko, kernel, devel,
	Denis V . Lunev
In-Reply-To: <20220810160840.311628-1-alexander.mikhalitsyn@virtuozzo.com>

Right now we have a neigh_param PROXY_QLEN which specifies maximum length
of neigh_table->proxy_queue. But in fact, this limitation doesn't work well
because check condition looks like:
tbl->proxy_queue.qlen > NEIGH_VAR(p, PROXY_QLEN)

The problem is that p (struct neigh_parms) is a per-device thing,
but tbl (struct neigh_table) is a system-wide global thing.

It seems reasonable to make proxy_queue limit per-device based.

Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: David Ahern <dsahern@kernel.org>
Cc: Yajun Deng <yajun.deng@linux.dev>
Cc: Roopa Prabhu <roopa@nvidia.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: netdev@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
Cc: Konstantin Khorenko <khorenko@virtuozzo.com>
Cc: kernel@openvz.org
Cc: devel@openvz.org
Suggested-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
Reviewed-by: Denis V. Lunev <den@openvz.org>
---
 include/net/neighbour.h |  1 +
 net/core/neighbour.c    | 25 ++++++++++++++++++++++---
 2 files changed, 23 insertions(+), 3 deletions(-)

diff --git a/include/net/neighbour.h b/include/net/neighbour.h
index 87419f7f5421..bc3fbec70d10 100644
--- a/include/net/neighbour.h
+++ b/include/net/neighbour.h
@@ -82,6 +82,7 @@ struct neigh_parms {
 	struct rcu_head rcu_head;
 
 	int	reachable_time;
+	int	qlen;
 	int	data[NEIGH_VAR_DATA_MAX];
 	DECLARE_BITMAP(data_state, NEIGH_VAR_DATA_MAX);
 };
diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index 19d99d1eff53..0469fafffd5d 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -316,9 +316,18 @@ static void pneigh_queue_purge(struct sk_buff_head *list, struct net *net)
 	skb = skb_peek(list);
 	while (skb != NULL) {
 		struct sk_buff *skb_next = skb_peek_next(skb, list);
-		if (net == NULL || net_eq(dev_net(skb->dev), net)) {
+		struct net_device *dev = skb->dev;
+		if (net == NULL || net_eq(dev_net(dev), net)) {
+			struct in_device *in_dev;
+
+			rcu_read_lock();
+			in_dev = __in_dev_get_rcu(dev);
+			if (in_dev)
+				in_dev->arp_parms->qlen--;
+			rcu_read_unlock();
 			__skb_unlink(skb, list);
-			dev_put(skb->dev);
+
+			dev_put(dev);
 			kfree_skb(skb);
 		}
 		skb = skb_next;
@@ -1606,8 +1615,15 @@ static void neigh_proxy_process(struct timer_list *t)
 
 		if (tdif <= 0) {
 			struct net_device *dev = skb->dev;
+			struct in_device *in_dev;
 
+			rcu_read_lock();
+			in_dev = __in_dev_get_rcu(dev);
+			if (in_dev)
+				in_dev->arp_parms->qlen--;
+			rcu_read_unlock();
 			__skb_unlink(skb, &tbl->proxy_queue);
+
 			if (tbl->proxy_redo && netif_running(dev)) {
 				rcu_read_lock();
 				tbl->proxy_redo(skb);
@@ -1632,7 +1648,7 @@ void pneigh_enqueue(struct neigh_table *tbl, struct neigh_parms *p,
 	unsigned long sched_next = jiffies +
 			prandom_u32_max(NEIGH_VAR(p, PROXY_DELAY));
 
-	if (tbl->proxy_queue.qlen > NEIGH_VAR(p, PROXY_QLEN)) {
+	if (p->qlen > NEIGH_VAR(p, PROXY_QLEN)) {
 		kfree_skb(skb);
 		return;
 	}
@@ -1648,6 +1664,7 @@ void pneigh_enqueue(struct neigh_table *tbl, struct neigh_parms *p,
 	skb_dst_drop(skb);
 	dev_hold(skb->dev);
 	__skb_queue_tail(&tbl->proxy_queue, skb);
+	p->qlen++;
 	mod_timer(&tbl->proxy_timer, sched_next);
 	spin_unlock(&tbl->proxy_queue.lock);
 }
@@ -1680,6 +1697,7 @@ struct neigh_parms *neigh_parms_alloc(struct net_device *dev,
 		refcount_set(&p->refcnt, 1);
 		p->reachable_time =
 				neigh_rand_reach_time(NEIGH_VAR(p, BASE_REACHABLE_TIME));
+		p->qlen = 0;
 		dev_hold_track(dev, &p->dev_tracker, GFP_KERNEL);
 		p->dev = dev;
 		write_pnet(&p->net, net);
@@ -1745,6 +1763,7 @@ void neigh_table_init(int index, struct neigh_table *tbl)
 	refcount_set(&tbl->parms.refcnt, 1);
 	tbl->parms.reachable_time =
 			  neigh_rand_reach_time(NEIGH_VAR(&tbl->parms, BASE_REACHABLE_TIME));
+	tbl->parms.qlen = 0;
 
 	tbl->stats = alloc_percpu(struct neigh_statistics);
 	if (!tbl->stats)
-- 
2.36.1


^ permalink raw reply related

* [PATCH v2 0/2] neighbour: fix possible DoS due to net iface start/stop loop
From: Alexander Mikhalitsyn @ 2022-08-10 16:08 UTC (permalink / raw)
  To: netdev
  Cc: Alexander Mikhalitsyn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Daniel Borkmann, David Ahern,
	Yajun Deng, Roopa Prabhu, Christian Brauner, linux-kernel,
	Denis V . Lunev, Alexey Kuznetsov, Konstantin Khorenko,
	Pavel Tikhomirov, Andrey Zhadchenko, Alexander Mikhalitsyn,
	kernel, devel
In-Reply-To: <20220729103559.215140-1-alexander.mikhalitsyn@virtuozzo.com>

Dear friends,

Recently one of OpenVZ users reported that they have issues with network
availability of some containers. It was discovered that the reason is absence
of ARP replies from the Host Node on the requests about container IPs.

Of course, we started from tcpdump analysis and noticed that ARP requests
successfuly comes to the problematic node external interface. So, something
was wrong from the kernel side.

I've played a lot with arping and perf in attempts to understand what's
happening. And the key observation was that we experiencing issues only
with ARP requests with broadcast source ip (skb->pkt_type == PACKET_BROADCAST).
But for packets skb->pkt_type == PACKET_HOST everything works flawlessly.

Let me show a small piece of code:

static int arp_process(struct sock *sk, struct sk_buff *skb)
...
				if (NEIGH_CB(skb)->flags & LOCALLY_ENQUEUED ||
				    skb->pkt_type == PACKET_HOST ||
				    NEIGH_VAR(in_dev->arp_parms, PROXY_DELAY) == 0) { // reply instantly
					arp_send_dst(ARPOP_REPLY, ETH_P_ARP,
						     sip, dev, tip, sha,
						     dev->dev_addr, sha,
						     reply_dst);
				} else {
					pneigh_enqueue(&arp_tbl,                     // reply with delay
						       in_dev->arp_parms, skb);
					goto out_free_dst;
				}

The problem was that for PACKET_BROADCAST packets we delaying replies and use pneigh_enqueue() function.
For some reason, queued packets were lost almost all the time! The reason for such behaviour is pneigh_queue_purge()
function which cleanups all the queue, and this function called everytime once some network device in the system
gets link down.

neigh_ifdown -> pneigh_queue_purge

Now imagine that we have a node with 500+ containers with microservices. And some of that microservices are buggy
and always restarting... in this case, pneigh_queue_purge function will be called very frequently.

This problem is reproducible only with so-called "host routed" setup. The classical scheme bridge + veth
is not affected.

Minimal reproducer

Suppose that we have a network 172.29.1.1/16 brd 172.29.255.255
and we have free-to-use IP, let it be 172.29.128.3

1. Network configuration. I showing the minimal configuration, it makes no sense
as we have both veth devices stay at the same net namespace, but for demonstation and simplicity sake it's okay.

ip l a veth31427 type veth peer name veth314271
ip l s veth31427 up
ip l s veth314271 up

# setup static arp entry and publish it
arp -Ds -i br0 172.29.128.3 veth31427 pub
# setup static route for this address
route add 172.29.128.3/32 dev veth31427

2. "attacker" side (kubernetes pod with buggy microservice :) )

unshare -n
ip l a type veth
ip l s veth0 up
ip l s veth1 up
for i in {1..100000}; do ip link set veth0 down; sleep 0.01; ip link set veth0 up; done

This will totaly block ARP replies for 172.29.128.3 address. Just try
# arping -I eth0 172.29.128.3 -c 4

Our proposal is simple:
1. Let's cleanup queue partially. Remove only skb's that related to the net namespace
of the adapter which link is down.

2. Let's account proxy_queue limit properly per-device. Current limitation looks
not fully correct because we comparing per-device configurable limit with the
"global" qlen of proxy_queue.

Thanks,
Alex

v2:
	- only ("neigh: fix possible DoS due to net iface start/stop") is changed
		do del_timer_sync() if queue is empty after pneigh_queue_purge()

Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: David Ahern <dsahern@kernel.org>
Cc: Yajun Deng <yajun.deng@linux.dev>
Cc: Roopa Prabhu <roopa@nvidia.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: netdev@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: Denis V. Lunev <den@openvz.org>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: Konstantin Khorenko <khorenko@virtuozzo.com>
Cc: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Cc: Andrey Zhadchenko <andrey.zhadchenko@virtuozzo.com>
Cc: Alexander Mikhalitsyn <alexander@mihalicyn.com>
Cc: kernel@openvz.org
Cc: devel@openvz.org
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>

Alexander Mikhalitsyn (1):
  neighbour: make proxy_queue.qlen limit per-device

Denis V. Lunev (1):
  neigh: fix possible DoS due to net iface start/stop loop

 include/net/neighbour.h |  1 +
 net/core/neighbour.c    | 46 +++++++++++++++++++++++++++++++++--------
 2 files changed, 38 insertions(+), 9 deletions(-)

-- 
2.36.1

^ permalink raw reply

* [PATCH v2 1/2] neigh: fix possible DoS due to net iface start/stop loop
From: Alexander Mikhalitsyn @ 2022-08-10 16:08 UTC (permalink / raw)
  To: netdev
  Cc: Denis V. Lunev, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Daniel Borkmann, David Ahern, Yajun Deng,
	Roopa Prabhu, Christian Brauner, linux-kernel, Alexey Kuznetsov,
	Alexander Mikhalitsyn, Konstantin Khorenko, kernel, devel
In-Reply-To: <20220810160840.311628-1-alexander.mikhalitsyn@virtuozzo.com>

From: "Denis V. Lunev" <den@openvz.org>

Normal processing of ARP request (usually this is Ethernet broadcast
packet) coming to the host is looking like the following:
* the packet comes to arp_process() call and is passed through routing
  procedure
* the request is put into the queue using pneigh_enqueue() if
  corresponding ARP record is not local (common case for container
  records on the host)
* the request is processed by timer (within 80 jiffies by default) and
  ARP reply is sent from the same arp_process() using
  NEIGH_CB(skb)->flags & LOCALLY_ENQUEUED condition (flag is set inside
  pneigh_enqueue())

And here the problem comes. Linux kernel calls pneigh_queue_purge()
which destroys the whole queue of ARP requests on ANY network interface
start/stop event through __neigh_ifdown().

This is actually not a problem within the original world as network
interface start/stop was accessible to the host 'root' only, which
could do more destructive things. But the world is changed and there
are Linux containers available. Here container 'root' has an access
to this API and could be considered as untrusted user in the hosting
(container's) world.

Thus there is an attack vector to other containers on node when
container's root will endlessly start/stop interfaces. We have observed
similar situation on a real production node when docker container was
doing such activity and thus other containers on the node become not
accessible.

The patch proposed doing very simple thing. It drops only packets from
the same namespace in the pneigh_queue_purge() where network interface
state change is detected. This is enough to prevent the problem for the
whole node preserving original semantics of the code.

v2:
	- do del_timer_sync() if queue is empty after pneigh_queue_purge()

Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: David Ahern <dsahern@kernel.org>
Cc: Yajun Deng <yajun.deng@linux.dev>
Cc: Roopa Prabhu <roopa@nvidia.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: netdev@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
Cc: Konstantin Khorenko <khorenko@virtuozzo.com>
Cc: kernel@openvz.org
Cc: devel@openvz.org
Investigated-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
---
 net/core/neighbour.c | 25 +++++++++++++++++--------
 1 file changed, 17 insertions(+), 8 deletions(-)

diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index 54625287ee5b..19d99d1eff53 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -307,14 +307,23 @@ static int neigh_del_timer(struct neighbour *n)
 	return 0;
 }
 
-static void pneigh_queue_purge(struct sk_buff_head *list)
+static void pneigh_queue_purge(struct sk_buff_head *list, struct net *net)
 {
+	unsigned long flags;
 	struct sk_buff *skb;
 
-	while ((skb = skb_dequeue(list)) != NULL) {
-		dev_put(skb->dev);
-		kfree_skb(skb);
+	spin_lock_irqsave(&list->lock, flags);
+	skb = skb_peek(list);
+	while (skb != NULL) {
+		struct sk_buff *skb_next = skb_peek_next(skb, list);
+		if (net == NULL || net_eq(dev_net(skb->dev), net)) {
+			__skb_unlink(skb, list);
+			dev_put(skb->dev);
+			kfree_skb(skb);
+		}
+		skb = skb_next;
 	}
+	spin_unlock_irqrestore(&list->lock, flags);
 }
 
 static void neigh_flush_dev(struct neigh_table *tbl, struct net_device *dev,
@@ -385,9 +394,9 @@ static int __neigh_ifdown(struct neigh_table *tbl, struct net_device *dev,
 	write_lock_bh(&tbl->lock);
 	neigh_flush_dev(tbl, dev, skip_perm);
 	pneigh_ifdown_and_unlock(tbl, dev);
-
-	del_timer_sync(&tbl->proxy_timer);
-	pneigh_queue_purge(&tbl->proxy_queue);
+	pneigh_queue_purge(&tbl->proxy_queue, dev_net(dev));
+	if (skb_queue_empty_lockless(&tbl->proxy_queue))
+		del_timer_sync(&tbl->proxy_timer);
 	return 0;
 }
 
@@ -1787,7 +1796,7 @@ int neigh_table_clear(int index, struct neigh_table *tbl)
 	cancel_delayed_work_sync(&tbl->managed_work);
 	cancel_delayed_work_sync(&tbl->gc_work);
 	del_timer_sync(&tbl->proxy_timer);
-	pneigh_queue_purge(&tbl->proxy_queue);
+	pneigh_queue_purge(&tbl->proxy_queue, NULL);
 	neigh_ifdown(tbl, NULL);
 	if (atomic_read(&tbl->entries))
 		pr_crit("neighbour leakage\n");
-- 
2.36.1


^ permalink raw reply related

* Re: [RFC PATCH net-next] docs: net: add an explanation of VF (and other) Representors
From: Edward Cree @ 2022-08-10 16:02 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: ecree, netdev, davem, pabeni, edumazet, corbet, linux-doc,
	linux-net-drivers, Jacob Keller, Jesse Brandeburg, Michael Chan,
	Andy Gospodarek, Saeed Mahameed, Jiri Pirko, Shannon Nelson,
	Simon Horman, Alexander Duyck
In-Reply-To: <20220808204135.040a4516@kernel.org>

On 09/08/2022 04:41, Jakub Kicinski wrote:
>>> AFAIK there's no "management PF" in the Linux model.  
>>
>> Maybe a bad word choice.  I'm referring to whichever PF (which likely
>>  also has an ordinary netdevice) has administrative rights over the NIC /
>>  internal switch at a firmware level.  Other names I've seen tossed
>>  around include "primary PF", "admin PF".
> 
> I believe someone (mellanox?) used the term eswitch manager.
> I'd use "host PF", somehow that makes most sense to me.

Not sure about that, I've seen "host" used as antonym of "SoC", so
 if the device is configured with the SoC as the admin this could
 confuse people.
I think whatever term we settle on, this document might need to
 have a 'Definitions' section to make it clear :S

>>> What is "the PCIe controller" here? I presume you've seen the
>>> devlink-port doc.  
>>
>> Yes, that's where I got this terminology from.
>> "the" PCIe controller here is the one on which the mgmt PF lives.  For
>>  instance you might have a NIC where you run OVS on a SoC inside the
>>  chip, that has its own PCIe controller including a PF it uses to drive
>>  the hardware v-switch (so it can offload OVS rules), in addition to
>>  the PCIe controller that exposes PFs & VFs to the host you plug it
>>  into through the physical PCIe socket / edge connector.
>> In that case this bullet would refer to any additional PFs the SoC has
>>  besides the management one...
> 
> IMO the model where there's a overall controller for the entire device
> is also a mellanox limitation, due to lack of support for nested
> switches
Instead of "the PCIe controller" I should probably say "the local PCIe
 controller", since that's the wording the devlink-port doc uses.

> Say I pay for a bare metal instance in my favorite public could. 
> Why would the forwarding between VFs I spawn be controlled by the cloud
> provider and not me?!
> 
> But perhaps Netronome was the only vendor capable of nested switching.

Quite possibly.  Current EF100 NICs can't do nested switching either.

>>>> + - PFs and VFs with other personalities, including network block devices (such
>>>> +   as a vDPA virtio-blk PF backed by remote/distributed storage).  
>>>
>>> IDK how you can configure block forwarding (which is DMAs of command
>>> + data blocks, not packets AFAIU) with the networking concepts..
>>> I've not used the storage functions tho, so I could be wrong.  
>>
>> Maybe I'm way off the beam here, but my understanding is that this
>>  sort of thing involves a block interface between the host and the
>>  NIC, but then something internal to the NIC converts those
>>  operations into network operations (e.g. RDMA traffic or Ceph TCP
>>  packets), which then go out on the network to access the actual
>>  data.  In that case the back-end has to have network connectivity,
>>  and the obvious™ way to do that is give it a v-port on the v-switch
>>  just like anyone else.
> 
> I see. I don't think this covers all implementations. 

Right, I should probably make it more clear that this isn't the only
 way it could be done.
I'm merely trying to make clear that things that don't look like
 netdevices might still have a v-port and hence need a repr.

> "TX queue attached to" made me think of a netdev Tx queue with a qdisc
> rather than just a HW queue. No better ideas tho.

Would adding the word "hardware" before "TX queue" help?  Have to
 admit the netdev-queue interpretation hadn't occurred to me.

>> (And it looks like the core uses `c<N>` for my `if<N>` that you were
>>  so horrified by.  Devlink-port documentation doesn't make it super
>>  clear whether controller 0 is "the controller that's in charge" or
>>  "the controller from which we're viewing things", though I think in
>>  practice it comes to the same thing.)
> 
> I think we had a bit. Perhaps @external? The controller which doesn't
> have @external == true should be the local one IIRC. And by extension
> presumably in charge.

Yes, and that should work fine per se.  It's just not reflected in the
 phys_port_name string in any way, so legacy userland that relies on
 that won't have that piece of info (but it never did) and probably
 assumes that c0 is local.

-ed

^ permalink raw reply

* [PATCH v5 3/5] Bluetooth: Add support for hci devcoredump
From: Manish Mandlik @ 2022-08-10 16:00 UTC (permalink / raw)
  To: Arend van Spriel, Greg Kroah-Hartman, marcel, luiz.dentz
  Cc: Johannes Berg, Dan Williams, Jason Gunthorpe,
	Signed-off-by : Manish Mandlik, linux-bluetooth, Thomas Gleixner,
	Rafael J . Wysocki, chromeos-bluetooth-upstreaming, Won Chung,
	Abhishek Pandit-Subedi, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Johan Hedberg, Paolo Abeni, linux-kernel, netdev
In-Reply-To: <20220810085753.v5.1.I5622b2a92dca4d2703a0f747e24f3ef19303e6df@changeid>

From: Abhishek Pandit-Subedi <abhishekpandit@chromium.org>

Add devcoredump APIs to hci core so that drivers only have to provide
the dump skbs instead of managing the synchronization and timeouts.

The devcoredump APIs should be used in the following manner:
 - hci_devcoredump_init is called to allocate the dump.
 - hci_devcoredump_append is called to append any skbs with dump data
   OR hci_devcoredump_append_pattern is called to insert a pattern.
 - hci_devcoredump_complete is called when all dump packets have been
   sent OR hci_devcoredump_abort is called to indicate an error and
   cancel an ongoing dump collection.

The high level APIs just prepare some skbs with the appropriate data and
queue it for the dump to process. Packets part of the crashdump can be
intercepted in the driver in interrupt context and forwarded directly to
the devcoredump APIs.

Internally, there are 5 states for the dump: idle, active, complete,
abort and timeout. A devcoredump will only be in active state after it
has been initialized. Once active, it accepts data to be appended,
patterns to be inserted (i.e. memset) and a completion event or an abort
event to generate a devcoredump. The timeout is initialized at the same
time the dump is initialized (defaulting to 10s) and will be cleared
either when the timeout occurs or the dump is complete or aborted.

Signed-off-by: Abhishek Pandit-Subedi <abhishekpandit@chromium.org>
Signed-off-by: Manish Mandlik <mmandlik@google.com>
Reviewed-by: Abhishek Pandit-Subedi <abhishekpandit@chromium.org>
---

(no changes since v4)

Changes in v4:
- Add .enabled() and .coredump() to hci_devcoredump struct

Changes in v3:
- Add attribute to enable/disable and set default state to disabled

Changes in v2:
- Move hci devcoredump implementation to new files
- Move dump queue and dump work to hci_devcoredump struct
- Add CONFIG_DEV_COREDUMP conditional compile

 include/net/bluetooth/coredump.h | 119 +++++++
 include/net/bluetooth/hci_core.h |   5 +
 net/bluetooth/Makefile           |   2 +
 net/bluetooth/coredump.c         | 524 +++++++++++++++++++++++++++++++
 net/bluetooth/hci_core.c         |   9 +
 net/bluetooth/hci_sync.c         |   2 +
 6 files changed, 661 insertions(+)
 create mode 100644 include/net/bluetooth/coredump.h
 create mode 100644 net/bluetooth/coredump.c

diff --git a/include/net/bluetooth/coredump.h b/include/net/bluetooth/coredump.h
new file mode 100644
index 000000000000..be09290927c0
--- /dev/null
+++ b/include/net/bluetooth/coredump.h
@@ -0,0 +1,119 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2022 Google Corporation
+ */
+
+#ifndef __COREDUMP_H
+#define __COREDUMP_H
+
+#define DEVCOREDUMP_TIMEOUT	msecs_to_jiffies(10000)	/* 10 sec */
+
+typedef bool (*coredump_enabled_t)(struct hci_dev *hdev);
+typedef void (*coredump_t)(struct hci_dev *hdev);
+typedef int  (*dmp_hdr_t)(struct hci_dev *hdev, char *buf, size_t size);
+typedef void (*notify_change_t)(struct hci_dev *hdev, int state);
+
+/* struct hci_devcoredump - Devcoredump state
+ *
+ * @supported: Indicates if FW dump collection is supported by driver
+ * @state: Current state of dump collection
+ * @alloc_size: Total size of the dump
+ * @head: Start of the dump
+ * @tail: Pointer to current end of dump
+ * @end: head + alloc_size for easy comparisons
+ *
+ * @dump_q: Dump queue for state machine to process
+ * @dump_rx: Devcoredump state machine work
+ * @dump_timeout: Devcoredump timeout work
+ *
+ * @enabled: Checks if the devcoredump is enabled for the device
+ *
+ * @coredump: Called from the driver's .coredump() function.
+ * @dmp_hdr: Create a dump header to identify controller/fw/driver info
+ * @notify_change: Notify driver when devcoredump state has changed
+ */
+struct hci_devcoredump {
+	bool		supported;
+
+	enum devcoredump_state {
+		HCI_DEVCOREDUMP_IDLE,
+		HCI_DEVCOREDUMP_ACTIVE,
+		HCI_DEVCOREDUMP_DONE,
+		HCI_DEVCOREDUMP_ABORT,
+		HCI_DEVCOREDUMP_TIMEOUT
+	} state;
+
+	size_t		alloc_size;
+	char		*head;
+	char		*tail;
+	char		*end;
+
+	struct sk_buff_head	dump_q;
+	struct work_struct	dump_rx;
+	struct delayed_work	dump_timeout;
+
+	coredump_enabled_t	enabled;
+
+	coredump_t		coredump;
+	dmp_hdr_t		dmp_hdr;
+	notify_change_t		notify_change;
+};
+
+#ifdef CONFIG_DEV_COREDUMP
+
+void hci_devcoredump_reset(struct hci_dev *hdev);
+void hci_devcoredump_rx(struct work_struct *work);
+void hci_devcoredump_timeout(struct work_struct *work);
+
+int hci_devcoredump_register(struct hci_dev *hdev, coredump_t coredump,
+			     dmp_hdr_t dmp_hdr, notify_change_t notify_change);
+int hci_devcoredump_init(struct hci_dev *hdev, u32 dmp_size);
+int hci_devcoredump_append(struct hci_dev *hdev, struct sk_buff *skb);
+int hci_devcoredump_append_pattern(struct hci_dev *hdev, u8 pattern, u32 len);
+int hci_devcoredump_complete(struct hci_dev *hdev);
+int hci_devcoredump_abort(struct hci_dev *hdev);
+
+#else
+
+static inline void hci_devcoredump_reset(struct hci_dev *hdev) {}
+static inline void hci_devcoredump_rx(struct work_struct *work) {}
+static inline void hci_devcoredump_timeout(struct work_struct *work) {}
+
+static inline int hci_devcoredump_register(struct hci_dev *hdev,
+					   coredump_t coredump,
+					   dmp_hdr_t dmp_hdr,
+					   notify_change_t notify_change)
+{
+	return -EOPNOTSUPP;
+}
+
+static inline int hci_devcoredump_init(struct hci_dev *hdev, u32 dmp_size)
+{
+	return -EOPNOTSUPP;
+}
+
+static inline int hci_devcoredump_append(struct hci_dev *hdev,
+					 struct sk_buff *skb)
+{
+	return -EOPNOTSUPP;
+}
+
+static inline int hci_devcoredump_append_pattern(struct hci_dev *hdev,
+						 u8 pattern, u32 len)
+{
+	return -EOPNOTSUPP;
+}
+
+static inline int hci_devcoredump_complete(struct hci_dev *hdev)
+{
+	return -EOPNOTSUPP;
+}
+
+static inline int hci_devcoredump_abort(struct hci_dev *hdev)
+{
+	return -EOPNOTSUPP;
+}
+
+#endif /* CONFIG_DEV_COREDUMP */
+
+#endif /* __COREDUMP_H */
diff --git a/include/net/bluetooth/hci_core.h b/include/net/bluetooth/hci_core.h
index e7862903187d..fb5ef1c6dd10 100644
--- a/include/net/bluetooth/hci_core.h
+++ b/include/net/bluetooth/hci_core.h
@@ -32,6 +32,7 @@
 #include <net/bluetooth/hci.h>
 #include <net/bluetooth/hci_sync.h>
 #include <net/bluetooth/hci_sock.h>
+#include <net/bluetooth/coredump.h>
 
 /* HCI priority */
 #define HCI_PRIO_MAX	7
@@ -585,6 +586,10 @@ struct hci_dev {
 	const char		*fw_info;
 	struct dentry		*debugfs;
 
+#ifdef CONFIG_DEV_COREDUMP
+	struct hci_devcoredump	dump;
+#endif
+
 	struct device		dev;
 
 	struct rfkill		*rfkill;
diff --git a/net/bluetooth/Makefile b/net/bluetooth/Makefile
index 0e7b7db42750..141ac1fda0bf 100644
--- a/net/bluetooth/Makefile
+++ b/net/bluetooth/Makefile
@@ -17,6 +17,8 @@ bluetooth-y := af_bluetooth.o hci_core.o hci_conn.o hci_event.o mgmt.o \
 	ecdh_helper.o hci_request.o mgmt_util.o mgmt_config.o hci_codec.o \
 	eir.o hci_sync.o
 
+bluetooth-$(CONFIG_DEV_COREDUMP) += coredump.o
+
 bluetooth-$(CONFIG_BT_BREDR) += sco.o
 bluetooth-$(CONFIG_BT_LE) += iso.o
 bluetooth-$(CONFIG_BT_HS) += a2mp.o amp.o
diff --git a/net/bluetooth/coredump.c b/net/bluetooth/coredump.c
new file mode 100644
index 000000000000..b412056457c8
--- /dev/null
+++ b/net/bluetooth/coredump.c
@@ -0,0 +1,524 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2022 Google Corporation
+ */
+
+#include <linux/devcoredump.h>
+
+#include <net/bluetooth/bluetooth.h>
+#include <net/bluetooth/hci_core.h>
+
+enum hci_devcoredump_pkt_type {
+	HCI_DEVCOREDUMP_PKT_INIT,
+	HCI_DEVCOREDUMP_PKT_SKB,
+	HCI_DEVCOREDUMP_PKT_PATTERN,
+	HCI_DEVCOREDUMP_PKT_COMPLETE,
+	HCI_DEVCOREDUMP_PKT_ABORT,
+};
+
+struct hci_devcoredump_skb_cb {
+	u16 pkt_type;
+};
+
+struct hci_devcoredump_skb_pattern {
+	u8 pattern;
+	u32 len;
+} __packed;
+
+#define hci_dmp_cb(skb)	((struct hci_devcoredump_skb_cb *)((skb)->cb))
+
+#define MAX_DEVCOREDUMP_HDR_SIZE	512	/* bytes */
+
+static int hci_devcoredump_update_hdr_state(char *buf, size_t size, int state)
+{
+	if (!buf)
+		return 0;
+
+	return snprintf(buf, size, "Bluetooth devcoredump\nState: %d\n", state);
+}
+
+/* Call with hci_dev_lock only. */
+static int hci_devcoredump_update_state(struct hci_dev *hdev, int state)
+{
+	hdev->dump.state = state;
+
+	return hci_devcoredump_update_hdr_state(hdev->dump.head,
+						hdev->dump.alloc_size, state);
+}
+
+static int hci_devcoredump_mkheader(struct hci_dev *hdev, char *buf,
+				    size_t buf_size)
+{
+	char *ptr = buf;
+	size_t rem = buf_size;
+	size_t read = 0;
+
+	read = hci_devcoredump_update_hdr_state(ptr, rem, HCI_DEVCOREDUMP_IDLE);
+	read += 1; /* update_hdr_state adds \0 at the end upon state rewrite */
+	rem -= read;
+	ptr += read;
+
+	if (hdev->dump.dmp_hdr) {
+		/* dmp_hdr() should return number of bytes written */
+		read = hdev->dump.dmp_hdr(hdev, ptr, rem);
+		rem -= read;
+		ptr += read;
+	}
+
+	read = snprintf(ptr, rem, "--- Start dump ---\n");
+	rem -= read;
+	ptr += read;
+
+	return buf_size - rem;
+}
+
+/* Do not call with hci_dev_lock since this calls driver code. */
+static void hci_devcoredump_notify(struct hci_dev *hdev, int state)
+{
+	if (hdev->dump.notify_change)
+		hdev->dump.notify_change(hdev, state);
+}
+
+/* Call with hci_dev_lock only. */
+void hci_devcoredump_reset(struct hci_dev *hdev)
+{
+	hdev->dump.head = NULL;
+	hdev->dump.tail = NULL;
+	hdev->dump.alloc_size = 0;
+
+	hci_devcoredump_update_state(hdev, HCI_DEVCOREDUMP_IDLE);
+
+	cancel_delayed_work(&hdev->dump.dump_timeout);
+	skb_queue_purge(&hdev->dump.dump_q);
+}
+
+/* Call with hci_dev_lock only. */
+static void hci_devcoredump_free(struct hci_dev *hdev)
+{
+	if (hdev->dump.head)
+		vfree(hdev->dump.head);
+
+	hci_devcoredump_reset(hdev);
+}
+
+/* Call with hci_dev_lock only. */
+static int hci_devcoredump_alloc(struct hci_dev *hdev, u32 size)
+{
+	hdev->dump.head = vmalloc(size);
+	if (!hdev->dump.head)
+		return -ENOMEM;
+
+	hdev->dump.alloc_size = size;
+	hdev->dump.tail = hdev->dump.head;
+	hdev->dump.end = hdev->dump.head + size;
+
+	hci_devcoredump_update_state(hdev, HCI_DEVCOREDUMP_IDLE);
+
+	return 0;
+}
+
+/* Call with hci_dev_lock only. */
+static bool hci_devcoredump_copy(struct hci_dev *hdev, char *buf, u32 size)
+{
+	if (hdev->dump.tail + size > hdev->dump.end)
+		return false;
+
+	memcpy(hdev->dump.tail, buf, size);
+	hdev->dump.tail += size;
+
+	return true;
+}
+
+/* Call with hci_dev_lock only. */
+static bool hci_devcoredump_memset(struct hci_dev *hdev, u8 pattern, u32 len)
+{
+	if (hdev->dump.tail + len > hdev->dump.end)
+		return false;
+
+	memset(hdev->dump.tail, pattern, len);
+	hdev->dump.tail += len;
+
+	return true;
+}
+
+/* Call with hci_dev_lock only. */
+static int hci_devcoredump_prepare(struct hci_dev *hdev, u32 dump_size)
+{
+	char *dump_hdr;
+	int dump_hdr_size;
+	u32 size;
+	int err = 0;
+
+	dump_hdr = vmalloc(MAX_DEVCOREDUMP_HDR_SIZE);
+	if (!dump_hdr) {
+		err = -ENOMEM;
+		goto hdr_free;
+	}
+
+	dump_hdr_size = hci_devcoredump_mkheader(hdev, dump_hdr,
+						 MAX_DEVCOREDUMP_HDR_SIZE);
+	size = dump_hdr_size + dump_size;
+
+	if (hci_devcoredump_alloc(hdev, size)) {
+		err = -ENOMEM;
+		goto hdr_free;
+	}
+
+	/* Insert the device header */
+	if (!hci_devcoredump_copy(hdev, dump_hdr, dump_hdr_size)) {
+		bt_dev_err(hdev, "Failed to insert header");
+		hci_devcoredump_free(hdev);
+
+		err = -ENOMEM;
+		goto hdr_free;
+	}
+
+hdr_free:
+	if (dump_hdr)
+		vfree(dump_hdr);
+
+	return err;
+}
+
+/* Bluetooth devcoredump state machine.
+ *
+ * Devcoredump states:
+ *
+ *      HCI_DEVCOREDUMP_IDLE: The default state.
+ *
+ *      HCI_DEVCOREDUMP_ACTIVE: A devcoredump will be in this state once it has
+ *              been initialized using hci_devcoredump_init(). Once active, the
+ *              driver can append data using hci_devcoredump_append() or insert
+ *              a pattern using hci_devcoredump_append_pattern().
+ *
+ *      HCI_DEVCOREDUMP_DONE: Once the dump collection is complete, the drive
+ *              can signal the completion using hci_devcoredump_complete(). A
+ *              devcoredump is generated indicating the completion event and
+ *              then the state machine is reset to the default state.
+ *
+ *      HCI_DEVCOREDUMP_ABORT: The driver can cancel ongoing dump collection in
+ *              case of any error using hci_devcoredump_abort(). A devcoredump
+ *              is still generated with the available data indicating the abort
+ *              event and then the state machine is reset to the default state.
+ *
+ *      HCI_DEVCOREDUMP_TIMEOUT: A timeout timer for HCI_DEVCOREDUMP_TIMEOUT sec
+ *              is started during devcoredump initialization. Once the timeout
+ *              occurs, the driver is notified, a devcoredump is generated with
+ *              the available data indicating the timeout event and then the
+ *              state machine is reset to the default state.
+ *
+ * The driver must register using hci_devcoredump_register() before using the
+ * hci devcoredump APIs.
+ */
+void hci_devcoredump_rx(struct work_struct *work)
+{
+	struct hci_dev *hdev = container_of(work, struct hci_dev, dump.dump_rx);
+	struct sk_buff *skb;
+	struct hci_devcoredump_skb_pattern *pattern;
+	u32 dump_size;
+	int start_state;
+
+#define DBG_UNEXPECTED_STATE() \
+		bt_dev_dbg(hdev, \
+			   "Unexpected packet (%d) for state (%d). ", \
+			   hci_dmp_cb(skb)->pkt_type, hdev->dump.state)
+
+	while ((skb = skb_dequeue(&hdev->dump.dump_q))) {
+		hci_dev_lock(hdev);
+		start_state = hdev->dump.state;
+
+		switch (hci_dmp_cb(skb)->pkt_type) {
+		case HCI_DEVCOREDUMP_PKT_INIT:
+			if (hdev->dump.state != HCI_DEVCOREDUMP_IDLE) {
+				DBG_UNEXPECTED_STATE();
+				goto loop_continue;
+			}
+
+			if (skb->len != sizeof(dump_size)) {
+				bt_dev_dbg(hdev, "Invalid dump init pkt");
+				goto loop_continue;
+			}
+
+			dump_size = *((u32 *)skb->data);
+			if (!dump_size) {
+				bt_dev_err(hdev, "Zero size dump init pkt");
+				goto loop_continue;
+			}
+
+			if (hci_devcoredump_prepare(hdev, dump_size)) {
+				bt_dev_err(hdev, "Failed to prepare for dump");
+				goto loop_continue;
+			}
+
+			hci_devcoredump_update_state(hdev,
+						     HCI_DEVCOREDUMP_ACTIVE);
+			queue_delayed_work(hdev->workqueue,
+					   &hdev->dump.dump_timeout,
+					   DEVCOREDUMP_TIMEOUT);
+			break;
+
+		case HCI_DEVCOREDUMP_PKT_SKB:
+			if (hdev->dump.state != HCI_DEVCOREDUMP_ACTIVE) {
+				DBG_UNEXPECTED_STATE();
+				goto loop_continue;
+			}
+
+			if (!hci_devcoredump_copy(hdev, skb->data, skb->len))
+				bt_dev_dbg(hdev, "Failed to insert skb");
+			break;
+
+		case HCI_DEVCOREDUMP_PKT_PATTERN:
+			if (hdev->dump.state != HCI_DEVCOREDUMP_ACTIVE) {
+				DBG_UNEXPECTED_STATE();
+				goto loop_continue;
+			}
+
+			if (skb->len != sizeof(*pattern)) {
+				bt_dev_dbg(hdev, "Invalid pattern skb");
+				goto loop_continue;
+			}
+
+			pattern = (void *)skb->data;
+
+			if (!hci_devcoredump_memset(hdev, pattern->pattern,
+						    pattern->len))
+				bt_dev_dbg(hdev, "Failed to set pattern");
+			break;
+
+		case HCI_DEVCOREDUMP_PKT_COMPLETE:
+			if (hdev->dump.state != HCI_DEVCOREDUMP_ACTIVE) {
+				DBG_UNEXPECTED_STATE();
+				goto loop_continue;
+			}
+
+			hci_devcoredump_update_state(hdev,
+						     HCI_DEVCOREDUMP_DONE);
+			dump_size = hdev->dump.tail - hdev->dump.head;
+
+			bt_dev_info(hdev,
+				    "Devcoredump complete with size %u "
+				    "(expect %u)",
+				    dump_size, hdev->dump.alloc_size);
+
+			dev_coredumpv(&hdev->dev, hdev->dump.head, dump_size,
+				      GFP_KERNEL);
+			break;
+
+		case HCI_DEVCOREDUMP_PKT_ABORT:
+			if (hdev->dump.state != HCI_DEVCOREDUMP_ACTIVE) {
+				DBG_UNEXPECTED_STATE();
+				goto loop_continue;
+			}
+
+			hci_devcoredump_update_state(hdev,
+						     HCI_DEVCOREDUMP_ABORT);
+			dump_size = hdev->dump.tail - hdev->dump.head;
+
+			bt_dev_info(hdev,
+				    "Devcoredump aborted with size %u "
+				    "(expect %u)",
+				    dump_size, hdev->dump.alloc_size);
+
+			/* Emit a devcoredump with the available data */
+			dev_coredumpv(&hdev->dev, hdev->dump.head, dump_size,
+				      GFP_KERNEL);
+			break;
+
+		default:
+			bt_dev_dbg(hdev,
+				   "Unknown packet (%d) for state (%d). ",
+				   hci_dmp_cb(skb)->pkt_type, hdev->dump.state);
+			break;
+		}
+
+loop_continue:
+		kfree_skb(skb);
+		hci_dev_unlock(hdev);
+
+		if (start_state != hdev->dump.state)
+			hci_devcoredump_notify(hdev, hdev->dump.state);
+
+		hci_dev_lock(hdev);
+		if (hdev->dump.state == HCI_DEVCOREDUMP_DONE ||
+		    hdev->dump.state == HCI_DEVCOREDUMP_ABORT)
+			hci_devcoredump_reset(hdev);
+		hci_dev_unlock(hdev);
+	}
+}
+EXPORT_SYMBOL(hci_devcoredump_rx);
+
+void hci_devcoredump_timeout(struct work_struct *work)
+{
+	struct hci_dev *hdev = container_of(work, struct hci_dev,
+					    dump.dump_timeout.work);
+	u32 dump_size;
+
+	hci_devcoredump_notify(hdev, HCI_DEVCOREDUMP_TIMEOUT);
+
+	hci_dev_lock(hdev);
+
+	cancel_work_sync(&hdev->dump.dump_rx);
+
+	hci_devcoredump_update_state(hdev, HCI_DEVCOREDUMP_TIMEOUT);
+	dump_size = hdev->dump.tail - hdev->dump.head;
+	bt_dev_info(hdev, "Devcoredump timeout with size %u (expect %u)",
+		    dump_size, hdev->dump.alloc_size);
+
+	/* Emit a devcoredump with the available data */
+	dev_coredumpv(&hdev->dev, hdev->dump.head, dump_size, GFP_KERNEL);
+
+	hci_devcoredump_reset(hdev);
+
+	hci_dev_unlock(hdev);
+}
+EXPORT_SYMBOL(hci_devcoredump_timeout);
+
+int hci_devcoredump_register(struct hci_dev *hdev, coredump_t coredump,
+			     dmp_hdr_t dmp_hdr, notify_change_t notify_change)
+{
+	/* Driver must implement coredump() and dmp_hdr() functions for
+	 * bluetooth devcoredump. The coredump() should trigger a coredump
+	 * event on the controller when the device's coredump sysfs entry is
+	 * written to. The dmp_hdr() should create a dump header to identify
+	 * the controller/fw/driver info.
+	 */
+	if (!coredump || !dmp_hdr)
+		return -EINVAL;
+
+	hci_dev_lock(hdev);
+	hdev->dump.coredump = coredump;
+	hdev->dump.dmp_hdr = dmp_hdr;
+	hdev->dump.notify_change = notify_change;
+	hdev->dump.supported = true;
+	hci_dev_unlock(hdev);
+
+	return 0;
+}
+EXPORT_SYMBOL(hci_devcoredump_register);
+
+static inline bool hci_devcoredump_enabled(struct hci_dev *hdev)
+{
+	/* The 'supported' flag is true when the driver registers with the HCI
+	 * devcoredump API, whereas, the 'enabled' is controlled via a sysfs
+	 * entry. For drivers like btusb which supports multiple vendor drivers,
+	 * it is possible that the vendor driver does not support but the
+	 * interface is provided by the base btusb driver. So, check both.
+	 */
+	if (hdev->dump.supported && hdev->dump.enabled)
+		return hdev->dump.enabled(hdev);
+
+	return false;
+}
+
+int hci_devcoredump_init(struct hci_dev *hdev, u32 dmp_size)
+{
+	struct sk_buff *skb = NULL;
+
+	if (!hci_devcoredump_enabled(hdev))
+		return -EOPNOTSUPP;
+
+	skb = alloc_skb(sizeof(dmp_size), GFP_ATOMIC);
+	if (!skb) {
+		bt_dev_err(hdev, "Failed to allocate devcoredump init");
+		return -ENOMEM;
+	}
+
+	hci_dmp_cb(skb)->pkt_type = HCI_DEVCOREDUMP_PKT_INIT;
+	skb_put_data(skb, &dmp_size, sizeof(dmp_size));
+
+	skb_queue_tail(&hdev->dump.dump_q, skb);
+	queue_work(hdev->workqueue, &hdev->dump.dump_rx);
+
+	return 0;
+}
+EXPORT_SYMBOL(hci_devcoredump_init);
+
+int hci_devcoredump_append(struct hci_dev *hdev, struct sk_buff *skb)
+{
+	if (!skb)
+		return -ENOMEM;
+
+	if (!hci_devcoredump_enabled(hdev)) {
+		kfree_skb(skb);
+		return -EOPNOTSUPP;
+	}
+
+	hci_dmp_cb(skb)->pkt_type = HCI_DEVCOREDUMP_PKT_SKB;
+
+	skb_queue_tail(&hdev->dump.dump_q, skb);
+	queue_work(hdev->workqueue, &hdev->dump.dump_rx);
+
+	return 0;
+}
+EXPORT_SYMBOL(hci_devcoredump_append);
+
+int hci_devcoredump_append_pattern(struct hci_dev *hdev, u8 pattern, u32 len)
+{
+	struct hci_devcoredump_skb_pattern p;
+	struct sk_buff *skb = NULL;
+
+	if (!hci_devcoredump_enabled(hdev))
+		return -EOPNOTSUPP;
+
+	skb = alloc_skb(sizeof(p), GFP_ATOMIC);
+	if (!skb) {
+		bt_dev_err(hdev, "Failed to allocate devcoredump pattern");
+		return -ENOMEM;
+	}
+
+	p.pattern = pattern;
+	p.len = len;
+
+	hci_dmp_cb(skb)->pkt_type = HCI_DEVCOREDUMP_PKT_PATTERN;
+	skb_put_data(skb, &p, sizeof(p));
+
+	skb_queue_tail(&hdev->dump.dump_q, skb);
+	queue_work(hdev->workqueue, &hdev->dump.dump_rx);
+
+	return 0;
+}
+EXPORT_SYMBOL(hci_devcoredump_append_pattern);
+
+int hci_devcoredump_complete(struct hci_dev *hdev)
+{
+	struct sk_buff *skb = NULL;
+
+	if (!hci_devcoredump_enabled(hdev))
+		return -EOPNOTSUPP;
+
+	skb = alloc_skb(0, GFP_ATOMIC);
+	if (!skb) {
+		bt_dev_err(hdev, "Failed to allocate devcoredump complete");
+		return -ENOMEM;
+	}
+
+	hci_dmp_cb(skb)->pkt_type = HCI_DEVCOREDUMP_PKT_COMPLETE;
+
+	skb_queue_tail(&hdev->dump.dump_q, skb);
+	queue_work(hdev->workqueue, &hdev->dump.dump_rx);
+
+	return 0;
+}
+EXPORT_SYMBOL(hci_devcoredump_complete);
+
+int hci_devcoredump_abort(struct hci_dev *hdev)
+{
+	struct sk_buff *skb = NULL;
+
+	if (!hci_devcoredump_enabled(hdev))
+		return -EOPNOTSUPP;
+
+	skb = alloc_skb(0, GFP_ATOMIC);
+	if (!skb) {
+		bt_dev_err(hdev, "Failed to allocate devcoredump abort");
+		return -ENOMEM;
+	}
+
+	hci_dmp_cb(skb)->pkt_type = HCI_DEVCOREDUMP_PKT_ABORT;
+
+	skb_queue_tail(&hdev->dump.dump_q, skb);
+	queue_work(hdev->workqueue, &hdev->dump.dump_rx);
+
+	return 0;
+}
+EXPORT_SYMBOL(hci_devcoredump_abort);
diff --git a/net/bluetooth/hci_core.c b/net/bluetooth/hci_core.c
index b3a5a3cc9372..9a697190c7a8 100644
--- a/net/bluetooth/hci_core.c
+++ b/net/bluetooth/hci_core.c
@@ -2510,14 +2510,23 @@ struct hci_dev *hci_alloc_dev_priv(int sizeof_priv)
 	INIT_WORK(&hdev->tx_work, hci_tx_work);
 	INIT_WORK(&hdev->power_on, hci_power_on);
 	INIT_WORK(&hdev->error_reset, hci_error_reset);
+#ifdef CONFIG_DEV_COREDUMP
+	INIT_WORK(&hdev->dump.dump_rx, hci_devcoredump_rx);
+#endif
 
 	hci_cmd_sync_init(hdev);
 
 	INIT_DELAYED_WORK(&hdev->power_off, hci_power_off);
+#ifdef CONFIG_DEV_COREDUMP
+	INIT_DELAYED_WORK(&hdev->dump.dump_timeout, hci_devcoredump_timeout);
+#endif
 
 	skb_queue_head_init(&hdev->rx_q);
 	skb_queue_head_init(&hdev->cmd_q);
 	skb_queue_head_init(&hdev->raw_q);
+#ifdef CONFIG_DEV_COREDUMP
+	skb_queue_head_init(&hdev->dump.dump_q);
+#endif
 
 	init_waitqueue_head(&hdev->req_wait_q);
 
diff --git a/net/bluetooth/hci_sync.c b/net/bluetooth/hci_sync.c
index e6d804b82b67..09d74ae2b81c 100644
--- a/net/bluetooth/hci_sync.c
+++ b/net/bluetooth/hci_sync.c
@@ -4337,6 +4337,8 @@ int hci_dev_open_sync(struct hci_dev *hdev)
 		goto done;
 	}
 
+	hci_devcoredump_reset(hdev);
+
 	set_bit(HCI_RUNNING, &hdev->flags);
 	hci_sock_dev_event(hdev, HCI_DEV_OPEN);
 
-- 
2.37.1.559.g78731f0fdb-goog


^ permalink raw reply related

* Re: [PATCH] net: dsa: mv88e6060: report max mtu 1536
From: Sergei Antonov @ 2022-08-10 15:56 UTC (permalink / raw)
  To: Vladimir Oltean; +Cc: netdev, Florian Fainelli
In-Reply-To: <20220810133531.wia2oznylkjrgje2@skbuf>

On Wed, 10 Aug 2022 at 16:35, Vladimir Oltean <olteanv@gmail.com> wrote:
>
> On Wed, Aug 10, 2022 at 03:00:20PM +0300, Sergei Antonov wrote:
> > > >       val = addr[0] << 8 | addr[1];
> > > >
> > > >       /* The multicast bit is always transmitted as a zero, so the switch uses
> > > > @@ -212,6 +211,11 @@ static int mv88e6060_setup(struct dsa_switch *ds)
> > > >       return 0;
> > > >  }
> > > >
> > > > +static int mv88e6060_port_max_mtu(struct dsa_switch *ds, int port)
> > > > +{
> > > > +     return MV88E6060_MAX_MTU;
> > > > +}
> > >
> > > Does this solve any problem? It's ok for the hardware MTU to be higher
> > > than advertised. The problem is when the hardware doesn't accept what
> > > the stack thinks it should.
> >
> > I need some time to reconstruct the problem. IIRC there was an attempt
> > to set MTU 1504 (1500 + a switch overhead), but can not reproduce it
> > at the moment.
>
> What kernel are you using? According to Documentation/process/maintainer-netdev.rst,
> you should test the patches you submit against the master branch from one of
> https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git
> or
> https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git
> depending on whether it's a new feature or if it fixes a problem.
>
> Currently, both net and net-next contain the same thing (we are in a
> merge window so net-next will not progress until kernel 6.0-rc1 is cut),
> which is that dsa_slave_change_mtu() will not do anything because of
> this:
>
>         if (!ds->ops->port_change_mtu)
>                 return -EOPNOTSUPP;
>
> (which mv88e6060 does not implement)
>
> So I am slightly doubtful that anyone attempts an MTU change for this
> switch, as you say.
>
> The DSA master (host port, not switch), on the other hand, is a
> different story. Its MTU is updated to 1504 by dsa_master_setup().
>
> > > You're the first person to submit a patch on mv88e6060 that I see.
> > > Is there a board with this switch available somewhere? Does the driver
> > > still work?
> >
> > Very nice to get your feedback. Because, yes, I am working with a
> > device which has mv88e6060, it is called MOXA NPort 6610.
> >
> > The driver works now. There was one problem which I had to workaround.
> > Inside my device only ports 2 and 5 are used, so I initially wrote in
> > .dts:
> >         switch@0 {
> >                 compatible = "marvell,mv88e6060";
> >                 reg = <16>;
>
> reg = <16> for switch@0? Something is wrong, probably switch@0.

Thanks for noticing it.
In my case the device addresses are:
  PHY Registers - 0x10-0x14
  Switch Core Registers - 0x18-0x1D
  Switch Global Registers - 0x1F
I renamed switch@0 to switch@10 and made reg hexadecimal for clarity:
"reg = <0x10>". It works, see below for more information on testing.
Should I leave it like so?

> > 2. Insert this code at the beginning of mv88e6060_setup_port():
> > if(!dsa_is_cpu_port(priv->ds, p) && !dsa_to_port(priv->ds, p)->cpu_dp)
> >     return 0;
> > 'cpu_dp' was the null pointer the driver crashed at.
>
> You mean here:
>
>                         (dsa_is_cpu_port(priv->ds, p) ?
>                          dsa_user_ports(priv->ds) :
>                          BIT(dsa_to_port(priv->ds, p)->cpu_dp->index)));

Yes.

> Yes, this is a limitation that has been made worse by blind code
> conversions (nobody seems to have the hardware or to know someone who
> does; I've been tempted to delete the driver a few times or at least to
> move it to staging, because of the unrealistically long delays until
> someone chirps that something is broken for it, even when it obviously is).
> The driver assumes that if the port isn't a CPU port, it's a user port.
> That's clearly false.
>
> You can probably put this at the beginning of mv88e6060_setup_port():
>
>         if (dsa_is_unused_port(priv->ds, p))
>                 return 0;
>
> The bug seems to have been introduced by commit 0abfd494deef ("net: dsa:
> use dedicated CPU port"), because, although before we'd be uselessly
> programming the port VLAN for a disabled port, now in doing so, we
> dereference a NULL pointer.

The suggested fix with dsa_is_unused_port() works. I tested it on the
'netdev/net.git' repo, see below. Should I submit it as a patch
(Fixes: 0abfd494deef)?

So I tested "dsa_is_unused_port()" and "switch@10" fixes with
https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git
What I did after system boot-up:

~ # dmesg | grep mv88
[    7.187296] mv88e6060 92000090.mdio--1-mii:10: switch Marvell
88E6060 (B0) detected
[    8.325712] mv88e6060 92000090.mdio--1-mii:10: switch Marvell
88E6060 (B0) detected
[    9.190299] mv88e6060 92000090.mdio--1-mii:10 lan2 (uninitialized):
PHY [dsa-0.0:02] driver [Generic PHY] (irq=POLL)

~ # ip a
1: lo: <LOOPBACK> mtu 65536 qdisc noop qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: eth0: <BROADCAST,MULTICAST> mtu 1504 qdisc noop qlen 1000
    link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff
3: lan2@eth0: <BROADCAST,MULTICAST,M-DOWN> mtu 1500 qdisc noop qlen 1000
    link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff

~ # ip link set dev eth0 address 00:90:e8:00:10:03 up

~ # ip a add 192.168.127.254/24 dev lan2

~ # ip link set dev lan2 address 00:90:e8:00:10:03 up
[   56.383801] DSA: failed to set STP state 3 (-95)
[   56.385491] mv88e6060 92000090.mdio--1-mii:10 lan2: configuring for
phy/gmii link mode
[   58.694319] mv88e6060 92000090.mdio--1-mii:10 lan2: Link is Up -
100Mbps/Full - flow control off
[   58.699244] IPv6: ADDRCONF(NETDEV_CHANGE): lan2: link becomes ready

~ # ip a
1: lo: <LOOPBACK> mtu 65536 qdisc noop qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1504 qdisc pfifo_fast qlen 1000
    link/ether 00:90:e8:00:10:03 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::290:e8ff:fe00:1003/64 scope link
       valid_lft forever preferred_lft forever
3: lan2@eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue qlen 1000
    link/ether 00:90:e8:00:10:03 brd ff:ff:ff:ff:ff:ff
    inet 192.168.127.254/24 scope global lan2
       valid_lft forever preferred_lft forever
    inet6 fe80::290:e8ff:fe00:1003/64 scope link
       valid_lft forever preferred_lft forever

Ping, ssh, scp work.

Is it correct for eth0 and lan2@eth0 to have the same MAC? I could not
make it work with different MACs.

> FWIW, in case there is ever a need to backport, the vintage-correct fix
> would be to use something like this:
>
>         if (!dsa_port_is_valid(priv->ds->ports[p]))
>                 return 0;
>
> but in that case the process is:
> - send patch against current "net" tree
> - wait until patch is queued up for "linux-stable" and backported as far
>   as possible
> - email will be sent that patch failed to apply to the still-maintained
>   LTS branches as far as the Fixes: tag required (this is why it is
>   important to populate the Fixes: tag correctly)
> - reply to that email with a manually backported patch, just for that
>   stable tree (linux-4.14.y etc)
>
> >
> > One more observation. Generating and setting a random MAC in
> > mv88e6060_setup_addr() is not necessary - the switch works without it
> > (at least in my case).
>
> The GLOBAL_MAC address that the switch uses there will be used as MAC SA
> in PAUSE frames (802.3 flow control). Not clear if you were aware of
> that fact when saying that the switch "works without it". In other words,
> if you make a change in that area, I expect that flow control is what
> you test, and not, say, ping.
>
> It's true that some other switches use a MAC SA of 00:00:00:00:00:00 for
> PAUSE frames (ocelot_init_port) and this hasn't caused a problem for them.
> I don't know if the 6060 supports this mode. If it does, it's worth a shot.

I don't know how to test flow control. Ping, ssh, scp work even with
mv88e6060_setup_addr() code removed. Of course, if MAC SA plays some
role in other scenarios, let it be :).

^ permalink raw reply

* [RFC net-next io_uring 11/11] io_uring/notif: add ubuf_info ref caching
From: Pavel Begunkov @ 2022-08-10 15:49 UTC (permalink / raw)
  To: io-uring, netdev
  Cc: Jens Axboe, David S . Miller, Jakub Kicinski, kernel-team,
	linux-kernel, xen-devel, Wei Liu, Paul Durrant, kvm,
	virtualization, Michael S . Tsirkin, Jason Wang, Pavel Begunkov
In-Reply-To: <cover.1660124059.git.asml.silence@gmail.com>

Cache some active notifier references at the io_uring side and get them
in batches, so the ammortised cost is low. Then these references can be
given away to the network layer using UARGFL_GIFT_REF.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 io_uring/net.c   |  8 +++++++-
 io_uring/notif.c |  6 ++++--
 io_uring/notif.h | 22 +++++++++++++++++++++-
 3 files changed, 32 insertions(+), 4 deletions(-)

diff --git a/io_uring/net.c b/io_uring/net.c
index e6fc9748fbd2..bdaf9b10bd1b 100644
--- a/io_uring/net.c
+++ b/io_uring/net.c
@@ -949,6 +949,7 @@ int io_sendzc(struct io_kiocb *req, unsigned int issue_flags)
 	struct io_sendzc *zc = io_kiocb_to_cmd(req);
 	struct io_notif_slot *notif_slot;
 	struct io_kiocb *notif;
+	struct ubuf_info *ubuf;
 	struct msghdr msg;
 	struct iovec iov;
 	struct socket *sock;
@@ -1007,10 +1008,15 @@ int io_sendzc(struct io_kiocb *req, unsigned int issue_flags)
 		min_ret = iov_iter_count(&msg.msg_iter);
 
 	msg.msg_flags = msg_flags;
-	msg.msg_ubuf = &io_notif_to_data(notif)->uarg;
 	msg.sg_from_iter = io_sg_from_iter;
+	msg.msg_ubuf = ubuf = &io_notif_to_data(notif)->uarg;
+	ubuf->flags |= UARGFL_GIFT_REF;
 	ret = sock_sendmsg(sock, &msg);
 
+	/* check if the send consumed an additional ref */
+	if (likely(!(ubuf->flags & UARGFL_GIFT_REF)))
+		io_notif_consume_ref(notif);
+
 	if (unlikely(ret < min_ret)) {
 		if (ret == -EAGAIN && (issue_flags & IO_URING_F_NONBLOCK))
 			return -EAGAIN;
diff --git a/io_uring/notif.c b/io_uring/notif.c
index dd346ea67580..73bbda5de07d 100644
--- a/io_uring/notif.c
+++ b/io_uring/notif.c
@@ -68,15 +68,17 @@ struct io_kiocb *io_alloc_notif(struct io_ring_ctx *ctx,
 	nd->uarg.skb_flags = SKBFL_ZEROCOPY_FRAG | SKBFL_DONT_ORPHAN;
 	nd->uarg.flags = UARGFL_CALLER_PINNED;
 	nd->uarg.callback = io_uring_tx_zerocopy_callback;
+	nd->cached_refs = IO_NOTIF_REF_CACHE_NR;
 	/* master ref owned by io_notif_slot, will be dropped on flush */
-	refcount_set(&nd->uarg.refcnt, 1);
+	refcount_set(&nd->uarg.refcnt, IO_NOTIF_REF_CACHE_NR + 1);
 	return notif;
 }
 
 static inline bool io_notif_drop_refs(struct io_notif_data *nd)
 {
-	int refs = 1;
+	int refs = nd->cached_refs + 1;
 
+	nd->cached_refs = 0;
 	return refcount_sub_and_test(refs, &nd->uarg.refcnt);
 }
 
diff --git a/io_uring/notif.h b/io_uring/notif.h
index 0819304d7e00..2a263055a53b 100644
--- a/io_uring/notif.h
+++ b/io_uring/notif.h
@@ -9,11 +9,14 @@
 
 #define IO_NOTIF_SPLICE_BATCH	32
 #define IORING_MAX_NOTIF_SLOTS (1U << 10)
+#define IO_NOTIF_REF_CACHE_NR	64
 
 struct io_notif_data {
 	struct file		*file;
-	struct ubuf_info	uarg;
 	unsigned long		account_pages;
+	/* extra uarg->refcnt refs */
+	int			cached_refs;
+	struct ubuf_info	uarg;
 };
 
 struct io_notif_slot {
@@ -88,3 +91,20 @@ static inline int io_notif_account_mem(struct io_kiocb *notif, unsigned len)
 	}
 	return 0;
 }
+
+static inline void io_notif_consume_ref(struct io_kiocb *notif)
+	__must_hold(&ctx->uring_lock)
+{
+	struct io_notif_data *nd = io_notif_to_data(notif);
+
+	nd->cached_refs--;
+
+	/*
+	* Issue sends without looking at notif->cached_refs first, so we
+	* always have to have at least one ref cached
+	*/
+	if (unlikely(!nd->cached_refs)) {
+		refcount_add(IO_NOTIF_REF_CACHE_NR, &nd->uarg.refcnt);
+		nd->cached_refs += IO_NOTIF_REF_CACHE_NR;
+	}
+}
-- 
2.37.0


^ permalink raw reply related

* [RFC net-next io_uring 09/11] io_uring/notif: add helper for flushing refs
From: Pavel Begunkov @ 2022-08-10 15:49 UTC (permalink / raw)
  To: io-uring, netdev
  Cc: Jens Axboe, David S . Miller, Jakub Kicinski, kernel-team,
	linux-kernel, xen-devel, Wei Liu, Paul Durrant, kvm,
	virtualization, Michael S . Tsirkin, Jason Wang, Pavel Begunkov
In-Reply-To: <cover.1660124059.git.asml.silence@gmail.com>

Add a helper for dropping notification references during flush. It's a
preparation patch, currently it's only one master ref, but we're going
to add ref caching.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 io_uring/notif.c | 14 +++++++++-----
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/io_uring/notif.c b/io_uring/notif.c
index a2ba1e35a59f..5661681b3b44 100644
--- a/io_uring/notif.c
+++ b/io_uring/notif.c
@@ -73,6 +73,13 @@ struct io_kiocb *io_alloc_notif(struct io_ring_ctx *ctx,
 	return notif;
 }
 
+static inline bool io_notif_drop_refs(struct io_notif_data *nd)
+{
+	int refs = 1;
+
+	return refcount_sub_and_test(refs, &nd->uarg.refcnt);
+}
+
 void io_notif_slot_flush(struct io_notif_slot *slot)
 	__must_hold(&ctx->uring_lock)
 {
@@ -81,8 +88,7 @@ void io_notif_slot_flush(struct io_notif_slot *slot)
 
 	slot->notif = NULL;
 
-	/* drop slot's master ref */
-	if (refcount_dec_and_test(&nd->uarg.refcnt))
+	if (io_notif_drop_refs(nd))
 		io_notif_complete(notif);
 }
 
@@ -97,13 +103,11 @@ __cold int io_notif_unregister(struct io_ring_ctx *ctx)
 	for (i = 0; i < ctx->nr_notif_slots; i++) {
 		struct io_notif_slot *slot = &ctx->notif_slots[i];
 		struct io_kiocb *notif = slot->notif;
-		struct io_notif_data *nd;
 
 		if (!notif)
 			continue;
-		nd = io_kiocb_to_cmd(notif);
 		slot->notif = NULL;
-		if (!refcount_dec_and_test(&nd->uarg.refcnt))
+		if (!io_notif_drop_refs(io_kiocb_to_cmd(notif)))
 			continue;
 		notif->io_task_work.func = __io_notif_complete_tw;
 		io_req_task_work_add(notif);
-- 
2.37.0


^ permalink raw reply related

* [RFC net-next io_uring 08/11] net: let callers provide ->msg_ubuf refs
From: Pavel Begunkov @ 2022-08-10 15:49 UTC (permalink / raw)
  To: io-uring, netdev
  Cc: Jens Axboe, David S . Miller, Jakub Kicinski, kernel-team,
	linux-kernel, xen-devel, Wei Liu, Paul Durrant, kvm,
	virtualization, Michael S . Tsirkin, Jason Wang, Pavel Begunkov
In-Reply-To: <cover.1660124059.git.asml.silence@gmail.com>

Some msg_ubuf providers like io_uring can keep elaborated ubuf_info
reference batching and caching, so it will be of benefit to let the
network layer to optionally steal some of the cached refs.

Add UARGFL_GIFT_REF, if set the caller has at least one extra reference
that it can gift away. If the network decides to take the ref it should
clear the flag.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 include/linux/skbuff.h | 14 ++++++++++++++
 net/ipv4/ip_output.c   |  1 +
 net/ipv6/ip6_output.c  |  1 +
 3 files changed, 16 insertions(+)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 45fe7f0648d0..972ec676e222 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -527,6 +527,11 @@ enum {
 	 * be freed until we return.
 	 */
 	UARGFL_CALLER_PINNED = BIT(0),
+
+	/* The caller can gift one ubuf reference. The flag should be cleared
+	 * when the reference is taken.
+	 */
+	UARGFL_GIFT_REF = BIT(1),
 };
 
 /*
@@ -1709,6 +1714,15 @@ static inline void net_zcopy_put(struct ubuf_info *uarg)
 		uarg->callback(NULL, uarg, true);
 }
 
+static inline bool net_zcopy_get_gift_ref(struct ubuf_info *uarg)
+{
+	bool has_ref;
+
+	has_ref = uarg->flags & UARGFL_GIFT_REF;
+	uarg->flags &= ~UARGFL_GIFT_REF;
+	return has_ref;
+}
+
 static inline void net_zcopy_put_abort(struct ubuf_info *uarg, bool have_uref)
 {
 	if (uarg) {
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 546897a4b4fa..9d42b6dd6b78 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -1032,6 +1032,7 @@ static int __ip_append_data(struct sock *sk,
 				paged = true;
 				zc = true;
 				uarg = msg->msg_ubuf;
+				extra_uref = net_zcopy_get_gift_ref(uarg);
 			}
 		} else if (sock_flag(sk, SOCK_ZEROCOPY)) {
 			uarg = msg_zerocopy_realloc(sk, length, skb_zcopy(skb));
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 6d4f01a0cf6e..8d8a8bbdb8df 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -1557,6 +1557,7 @@ static int __ip6_append_data(struct sock *sk,
 				paged = true;
 				zc = true;
 				uarg = msg->msg_ubuf;
+				extra_uref = net_zcopy_get_gift_ref(uarg);
 			}
 		} else if (sock_flag(sk, SOCK_ZEROCOPY)) {
 			uarg = msg_zerocopy_realloc(sk, length, skb_zcopy(skb));
-- 
2.37.0


^ permalink raw reply related

* [RFC net-next io_uring 06/11] net: add flags for controlling ubuf_info
From: Pavel Begunkov @ 2022-08-10 15:49 UTC (permalink / raw)
  To: io-uring, netdev
  Cc: Jens Axboe, David S . Miller, Jakub Kicinski, kernel-team,
	linux-kernel, xen-devel, Wei Liu, Paul Durrant, kvm,
	virtualization, Michael S . Tsirkin, Jason Wang, Pavel Begunkov
In-Reply-To: <cover.1660124059.git.asml.silence@gmail.com>

There are already skb_flags in ubuf_info, which enhancing skbs. Also add
flags controlling ubuf_info, mainly to hint about various referencing
aspects of it, which will be introduced in later patches.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 include/linux/skbuff.h | 1 +
 io_uring/notif.c       | 1 +
 net/core/skbuff.c      | 1 +
 3 files changed, 3 insertions(+)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index e749b5d3868d..2b2e0020030b 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -535,6 +535,7 @@ struct ubuf_info {
 			 bool zerocopy_success);
 	refcount_t refcnt;
 	u8 skb_flags;
+	u8 flags;
 };
 
 struct ubuf_info_msgzc {
diff --git a/io_uring/notif.c b/io_uring/notif.c
index 97cb4a7e8849..a2ba1e35a59f 100644
--- a/io_uring/notif.c
+++ b/io_uring/notif.c
@@ -66,6 +66,7 @@ struct io_kiocb *io_alloc_notif(struct io_ring_ctx *ctx,
 	nd = io_notif_to_data(notif);
 	nd->account_pages = 0;
 	nd->uarg.skb_flags = SKBFL_ZEROCOPY_FRAG | SKBFL_DONT_ORPHAN;
+	nd->uarg.flags = 0;
 	nd->uarg.callback = io_uring_tx_zerocopy_callback;
 	/* master ref owned by io_notif_slot, will be dropped on flush */
 	refcount_set(&nd->uarg.refcnt, 1);
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 40bb84986800..7e102373482c 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1207,6 +1207,7 @@ static struct ubuf_info *msg_zerocopy_alloc(struct sock *sk, size_t size)
 	uarg->bytelen = size;
 	uarg->zerocopy = 1;
 	uarg->ubuf.skb_flags = SKBFL_ZEROCOPY_FRAG | SKBFL_DONT_ORPHAN;
+	uarg->ubuf.flags = 0;
 	refcount_set(&uarg->ubuf.refcnt, 1);
 	sock_hold(sk);
 
-- 
2.37.0


^ permalink raw reply related

* [RFC net-next io_uring 10/11] io_uring/notif: mark notifs with UARGFL_CALLER_PINNED
From: Pavel Begunkov @ 2022-08-10 15:49 UTC (permalink / raw)
  To: io-uring, netdev
  Cc: Jens Axboe, David S . Miller, Jakub Kicinski, kernel-team,
	linux-kernel, xen-devel, Wei Liu, Paul Durrant, kvm,
	virtualization, Michael S . Tsirkin, Jason Wang, Pavel Begunkov
In-Reply-To: <cover.1660124059.git.asml.silence@gmail.com>

We always keep references to active notifications and drop them only
when we flush, so they're always pinned during sock_sendmsg() and we can
add UARGFL_CALLER_PINNED.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 io_uring/notif.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/io_uring/notif.c b/io_uring/notif.c
index 5661681b3b44..dd346ea67580 100644
--- a/io_uring/notif.c
+++ b/io_uring/notif.c
@@ -66,7 +66,7 @@ struct io_kiocb *io_alloc_notif(struct io_ring_ctx *ctx,
 	nd = io_notif_to_data(notif);
 	nd->account_pages = 0;
 	nd->uarg.skb_flags = SKBFL_ZEROCOPY_FRAG | SKBFL_DONT_ORPHAN;
-	nd->uarg.flags = 0;
+	nd->uarg.flags = UARGFL_CALLER_PINNED;
 	nd->uarg.callback = io_uring_tx_zerocopy_callback;
 	/* master ref owned by io_notif_slot, will be dropped on flush */
 	refcount_set(&nd->uarg.refcnt, 1);
-- 
2.37.0


^ permalink raw reply related

* [RFC net-next io_uring 07/11] net/tcp: optimise tcp ubuf refcounting
From: Pavel Begunkov @ 2022-08-10 15:49 UTC (permalink / raw)
  To: io-uring, netdev
  Cc: Jens Axboe, David S . Miller, Jakub Kicinski, kernel-team,
	linux-kernel, xen-devel, Wei Liu, Paul Durrant, kvm,
	virtualization, Michael S . Tsirkin, Jason Wang, Pavel Begunkov
In-Reply-To: <cover.1660124059.git.asml.silence@gmail.com>

Add UARGFL_CALLER_PINNED letting protocols know that the caller holds a
reference to the ubuf_info and so it doesn't need additional refcounting
for purposes of keeping it alive. With that TCP can save a refcount
put/get pair per send when used with ->msg_ubuf.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 include/linux/skbuff.h | 7 +++++++
 net/ipv4/tcp.c         | 9 ++++++---
 2 files changed, 13 insertions(+), 3 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 2b2e0020030b..45fe7f0648d0 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -522,6 +522,13 @@ enum {
 #define SKBFL_ALL_ZEROCOPY	(SKBFL_ZEROCOPY_FRAG | SKBFL_PURE_ZEROCOPY | \
 				 SKBFL_DONT_ORPHAN | SKBFL_MANAGED_FRAG_REFS)
 
+enum {
+	/* The caller holds a reference during the submission so the ubuf won't
+	 * be freed until we return.
+	 */
+	UARGFL_CALLER_PINNED = BIT(0),
+};
+
 /*
  * The callback notifies userspace to release buffers when skb DMA is done in
  * lower device, the skb last reference should be 0 when calling this.
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 3152da8f4763..4925107de57d 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1229,7 +1229,8 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
 
 		if (msg->msg_ubuf) {
 			uarg = msg->msg_ubuf;
-			net_zcopy_get(uarg);
+			if (!(uarg->flags & UARGFL_CALLER_PINNED))
+				net_zcopy_get(uarg);
 			zc = sk->sk_route_caps & NETIF_F_SG;
 		} else if (sock_flag(sk, SOCK_ZEROCOPY)) {
 			uarg = msg_zerocopy_realloc(sk, size, skb_zcopy(skb));
@@ -1455,7 +1456,8 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
 		tcp_push(sk, flags, mss_now, tp->nonagle, size_goal);
 	}
 out_nopush:
-	net_zcopy_put(uarg);
+	if (uarg && !(uarg->flags & UARGFL_CALLER_PINNED))
+		net_zcopy_put(uarg);
 	return copied + copied_syn;
 
 do_error:
@@ -1464,7 +1466,8 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
 	if (copied + copied_syn)
 		goto out;
 out_err:
-	net_zcopy_put_abort(uarg, true);
+	if (uarg && !(uarg->flags & UARGFL_CALLER_PINNED))
+		net_zcopy_put_abort(uarg, true);
 	err = sk_stream_error(sk, flags, err);
 	/* make sure we wake any epoll edge trigger waiter */
 	if (unlikely(tcp_rtx_and_write_queues_empty(sk) && err == -EAGAIN)) {
-- 
2.37.0


^ permalink raw reply related

* [RFC net-next io_uring 04/11] net: shrink struct ubuf_info
From: Pavel Begunkov @ 2022-08-10 15:49 UTC (permalink / raw)
  To: io-uring, netdev
  Cc: Jens Axboe, David S . Miller, Jakub Kicinski, kernel-team,
	linux-kernel, xen-devel, Wei Liu, Paul Durrant, kvm,
	virtualization, Michael S . Tsirkin, Jason Wang, Pavel Begunkov
In-Reply-To: <cover.1660124059.git.asml.silence@gmail.com>

We can benefit from a smaller struct ubuf_info, so leave only mandatory
fields and let users to decide how they want to extend it. Convert
MSG_ZEROCOPY to struct ubuf_info_msgzc and remove duplicated fields.
This reduces the size from 48 bytes to just 16.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 include/linux/skbuff.h | 22 ++++------------------
 net/core/skbuff.c      | 38 +++++++++++++++++++++-----------------
 net/ipv4/ip_output.c   |  2 +-
 net/ipv4/tcp.c         |  2 +-
 net/ipv6/ip6_output.c  |  2 +-
 5 files changed, 28 insertions(+), 38 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index f8ac3678dab8..afd7400d7f62 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -533,25 +533,8 @@ enum {
 struct ubuf_info {
 	void (*callback)(struct sk_buff *, struct ubuf_info *,
 			 bool zerocopy_success);
-	union {
-		struct {
-			unsigned long desc;
-			void *ctx;
-		};
-		struct {
-			u32 id;
-			u16 len;
-			u16 zerocopy:1;
-			u32 bytelen;
-		};
-	};
 	refcount_t refcnt;
 	u8 flags;
-
-	struct mmpin {
-		struct user_struct *user;
-		unsigned int num_pg;
-	} mmp;
 };
 
 struct ubuf_info_msgzc {
@@ -570,7 +553,10 @@ struct ubuf_info_msgzc {
 		};
 	};
 
-	struct mmpin mmp;
+	struct mmpin {
+		struct user_struct *user;
+		unsigned int num_pg;
+	} mmp;
 };
 
 #define skb_uarg(SKB)	((struct ubuf_info *)(skb_shinfo(SKB)->destructor_arg))
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 974bbbbe7138..b047a773acd7 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1183,7 +1183,7 @@ EXPORT_SYMBOL_GPL(mm_unaccount_pinned_pages);
 
 static struct ubuf_info *msg_zerocopy_alloc(struct sock *sk, size_t size)
 {
-	struct ubuf_info *uarg;
+	struct ubuf_info_msgzc *uarg;
 	struct sk_buff *skb;
 
 	WARN_ON_ONCE(!in_task());
@@ -1201,19 +1201,19 @@ static struct ubuf_info *msg_zerocopy_alloc(struct sock *sk, size_t size)
 		return NULL;
 	}
 
-	uarg->callback = msg_zerocopy_callback;
+	uarg->ubuf.callback = msg_zerocopy_callback;
 	uarg->id = ((u32)atomic_inc_return(&sk->sk_zckey)) - 1;
 	uarg->len = 1;
 	uarg->bytelen = size;
 	uarg->zerocopy = 1;
-	uarg->flags = SKBFL_ZEROCOPY_FRAG | SKBFL_DONT_ORPHAN;
-	refcount_set(&uarg->refcnt, 1);
+	uarg->ubuf.flags = SKBFL_ZEROCOPY_FRAG | SKBFL_DONT_ORPHAN;
+	refcount_set(&uarg->ubuf.refcnt, 1);
 	sock_hold(sk);
 
-	return uarg;
+	return &uarg->ubuf;
 }
 
-static inline struct sk_buff *skb_from_uarg(struct ubuf_info *uarg)
+static inline struct sk_buff *skb_from_uarg(struct ubuf_info_msgzc *uarg)
 {
 	return container_of((void *)uarg, struct sk_buff, cb);
 }
@@ -1222,6 +1222,7 @@ struct ubuf_info *msg_zerocopy_realloc(struct sock *sk, size_t size,
 				       struct ubuf_info *uarg)
 {
 	if (uarg) {
+		struct ubuf_info_msgzc *uarg_zc;
 		const u32 byte_limit = 1 << 19;		/* limit to a few TSO */
 		u32 bytelen, next;
 
@@ -1237,8 +1238,9 @@ struct ubuf_info *msg_zerocopy_realloc(struct sock *sk, size_t size,
 			return NULL;
 		}
 
-		bytelen = uarg->bytelen + size;
-		if (uarg->len == USHRT_MAX - 1 || bytelen > byte_limit) {
+		uarg_zc = uarg_to_msgzc(uarg);
+		bytelen = uarg_zc->bytelen + size;
+		if (uarg_zc->len == USHRT_MAX - 1 || bytelen > byte_limit) {
 			/* TCP can create new skb to attach new uarg */
 			if (sk->sk_type == SOCK_STREAM)
 				goto new_alloc;
@@ -1246,11 +1248,11 @@ struct ubuf_info *msg_zerocopy_realloc(struct sock *sk, size_t size,
 		}
 
 		next = (u32)atomic_read(&sk->sk_zckey);
-		if ((u32)(uarg->id + uarg->len) == next) {
-			if (mm_account_pinned_pages(&uarg->mmp, size))
+		if ((u32)(uarg_zc->id + uarg_zc->len) == next) {
+			if (mm_account_pinned_pages(&uarg_zc->mmp, size))
 				return NULL;
-			uarg->len++;
-			uarg->bytelen = bytelen;
+			uarg_zc->len++;
+			uarg_zc->bytelen = bytelen;
 			atomic_set(&sk->sk_zckey, ++next);
 
 			/* no extra ref when appending to datagram (MSG_MORE) */
@@ -1286,7 +1288,7 @@ static bool skb_zerocopy_notify_extend(struct sk_buff *skb, u32 lo, u16 len)
 	return true;
 }
 
-static void __msg_zerocopy_callback(struct ubuf_info *uarg)
+static void __msg_zerocopy_callback(struct ubuf_info_msgzc *uarg)
 {
 	struct sk_buff *tail, *skb = skb_from_uarg(uarg);
 	struct sock_exterr_skb *serr;
@@ -1339,19 +1341,21 @@ static void __msg_zerocopy_callback(struct ubuf_info *uarg)
 void msg_zerocopy_callback(struct sk_buff *skb, struct ubuf_info *uarg,
 			   bool success)
 {
-	uarg->zerocopy = uarg->zerocopy & success;
+	struct ubuf_info_msgzc *uarg_zc = uarg_to_msgzc(uarg);
+
+	uarg_zc->zerocopy = uarg_zc->zerocopy & success;
 
 	if (refcount_dec_and_test(&uarg->refcnt))
-		__msg_zerocopy_callback(uarg);
+		__msg_zerocopy_callback(uarg_zc);
 }
 EXPORT_SYMBOL_GPL(msg_zerocopy_callback);
 
 void msg_zerocopy_put_abort(struct ubuf_info *uarg, bool have_uref)
 {
-	struct sock *sk = skb_from_uarg(uarg)->sk;
+	struct sock *sk = skb_from_uarg(uarg_to_msgzc(uarg))->sk;
 
 	atomic_dec(&sk->sk_zckey);
-	uarg->len--;
+	uarg_to_msgzc(uarg)->len--;
 
 	if (have_uref)
 		msg_zerocopy_callback(NULL, uarg, true);
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index d7bd1daf022b..546897a4b4fa 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -1043,7 +1043,7 @@ static int __ip_append_data(struct sock *sk,
 				paged = true;
 				zc = true;
 			} else {
-				uarg->zerocopy = 0;
+				uarg_to_msgzc(uarg)->zerocopy = 0;
 				skb_zcopy_set(skb, uarg, &extra_uref);
 			}
 		}
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 970e9a2cca4a..3152da8f4763 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1239,7 +1239,7 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
 			}
 			zc = sk->sk_route_caps & NETIF_F_SG;
 			if (!zc)
-				uarg->zerocopy = 0;
+				uarg_to_msgzc(uarg)->zerocopy = 0;
 		}
 	}
 
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 897ca4f9b791..6d4f01a0cf6e 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -1568,7 +1568,7 @@ static int __ip6_append_data(struct sock *sk,
 				paged = true;
 				zc = true;
 			} else {
-				uarg->zerocopy = 0;
+				uarg_to_msgzc(uarg)->zerocopy = 0;
 				skb_zcopy_set(skb, uarg, &extra_uref);
 			}
 		}
-- 
2.37.0


^ permalink raw reply related

* [RFC net-next io_uring 05/11] net: rename ubuf_info's flags
From: Pavel Begunkov @ 2022-08-10 15:49 UTC (permalink / raw)
  To: io-uring, netdev
  Cc: Jens Axboe, David S . Miller, Jakub Kicinski, kernel-team,
	linux-kernel, xen-devel, Wei Liu, Paul Durrant, kvm,
	virtualization, Michael S . Tsirkin, Jason Wang, Pavel Begunkov
In-Reply-To: <cover.1660124059.git.asml.silence@gmail.com>

ubuf_info::flags contains SKBFL_* flags that we copy into skbs, change
the field name to stress that it keeps skb flags.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 include/linux/skbuff.h | 4 ++--
 io_uring/notif.c       | 2 +-
 net/core/skbuff.c      | 2 +-
 3 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index afd7400d7f62..e749b5d3868d 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -534,7 +534,7 @@ struct ubuf_info {
 	void (*callback)(struct sk_buff *, struct ubuf_info *,
 			 bool zerocopy_success);
 	refcount_t refcnt;
-	u8 flags;
+	u8 skb_flags;
 };
 
 struct ubuf_info_msgzc {
@@ -1664,7 +1664,7 @@ static inline void net_zcopy_get(struct ubuf_info *uarg)
 static inline void skb_zcopy_init(struct sk_buff *skb, struct ubuf_info *uarg)
 {
 	skb_shinfo(skb)->destructor_arg = uarg;
-	skb_shinfo(skb)->flags |= uarg->flags;
+	skb_shinfo(skb)->flags |= uarg->skb_flags;
 }
 
 static inline void skb_zcopy_set(struct sk_buff *skb, struct ubuf_info *uarg,
diff --git a/io_uring/notif.c b/io_uring/notif.c
index b5f989dff9de..97cb4a7e8849 100644
--- a/io_uring/notif.c
+++ b/io_uring/notif.c
@@ -65,7 +65,7 @@ struct io_kiocb *io_alloc_notif(struct io_ring_ctx *ctx,
 
 	nd = io_notif_to_data(notif);
 	nd->account_pages = 0;
-	nd->uarg.flags = SKBFL_ZEROCOPY_FRAG | SKBFL_DONT_ORPHAN;
+	nd->uarg.skb_flags = SKBFL_ZEROCOPY_FRAG | SKBFL_DONT_ORPHAN;
 	nd->uarg.callback = io_uring_tx_zerocopy_callback;
 	/* master ref owned by io_notif_slot, will be dropped on flush */
 	refcount_set(&nd->uarg.refcnt, 1);
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index b047a773acd7..40bb84986800 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1206,7 +1206,7 @@ static struct ubuf_info *msg_zerocopy_alloc(struct sock *sk, size_t size)
 	uarg->len = 1;
 	uarg->bytelen = size;
 	uarg->zerocopy = 1;
-	uarg->ubuf.flags = SKBFL_ZEROCOPY_FRAG | SKBFL_DONT_ORPHAN;
+	uarg->ubuf.skb_flags = SKBFL_ZEROCOPY_FRAG | SKBFL_DONT_ORPHAN;
 	refcount_set(&uarg->ubuf.refcnt, 1);
 	sock_hold(sk);
 
-- 
2.37.0


^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox