Netdev List
 help / color / mirror / Atom feed
* [PATCH net 0/4][pull request] Intel Wired LAN Driver Updates 2022-04-26
From: Tony Nguyen @ 2022-04-26 20:30 UTC (permalink / raw)
  To: davem, kuba, pabeni; +Cc: Tony Nguyen, netdev

This series contains updates to ice driver only.

Ivan Vecera removes races related to VF message processing by changing
mutex_trylock() call to mutex_lock() and moving additional operations
to occur under mutex.

Petr Oros increases wait time after firmware flash as current time is
not sufficient.

Jake resolves a use-after-free issue for mailbox snapshot.

The following are changes since commit acb16b395c3f3d7502443e0c799c2b42df645642:
  virtio_net: fix wrong buf address calculation when using xdp
and are available in the git repository at:
  git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue 100GbE

Ivan Vecera (2):
  ice: Fix incorrect locking in ice_vc_process_vf_msg()
  ice: Protect vf_state check by cfg_lock in ice_vc_process_vf_msg()

Jacob Keller (1):
  ice: fix use-after-free when deinitializing mailbox snapshot

Petr Oros (1):
  ice: wait 5 s for EMP reset after firmware flash

 drivers/net/ethernet/intel/ice/ice_main.c     |  3 +++
 drivers/net/ethernet/intel/ice/ice_sriov.c    |  2 +-
 drivers/net/ethernet/intel/ice/ice_virtchnl.c | 27 +++++++------------
 3 files changed, 13 insertions(+), 19 deletions(-)

-- 
2.31.1


^ permalink raw reply

* [PATCH net 2/4] ice: Protect vf_state check by cfg_lock in ice_vc_process_vf_msg()
From: Tony Nguyen @ 2022-04-26 20:30 UTC (permalink / raw)
  To: davem, kuba, pabeni
  Cc: Ivan Vecera, netdev, anthony.l.nguyen, Fei Liu, Jacob Keller,
	Konrad Jankowski
In-Reply-To: <20220426203018.3790378-1-anthony.l.nguyen@intel.com>

From: Ivan Vecera <ivecera@redhat.com>

Previous patch labelled "ice: Fix incorrect locking in
ice_vc_process_vf_msg()"  fixed an issue with ignored messages
sent by VF driver but a small race window still left.

Recently caught trace during 'ip link set ... vf 0 vlan ...' operation:

[ 7332.995625] ice 0000:3b:00.0: Clearing port VLAN on VF 0
[ 7333.001023] iavf 0000:3b:01.0: Reset indication received from the PF
[ 7333.007391] iavf 0000:3b:01.0: Scheduling reset task
[ 7333.059575] iavf 0000:3b:01.0: PF returned error -5 (IAVF_ERR_PARAM) to our request 3
[ 7333.059626] ice 0000:3b:00.0: Invalid message from VF 0, opcode 3, len 4, error -1

Setting of VLAN for VF causes a reset of the affected VF using
ice_reset_vf() function that runs with cfg_lock taken:

1. ice_notify_vf_reset() informs IAVF driver that reset is needed and
   IAVF schedules its own reset procedure
2. Bit ICE_VF_STATE_DIS is set in vf->vf_state
3. Misc initialization steps
4. ice_sriov_post_vsi_rebuild() -> ice_vf_set_initialized() and that
   clears ICE_VF_STATE_DIS in vf->vf_state

Step 3 is mentioned race window because IAVF reset procedure runs in
parallel and one of its step is sending of VIRTCHNL_OP_GET_VF_RESOURCES
message (opcode==3). This message is handled in ice_vc_process_vf_msg()
and if it is received during the mentioned race window then it's
marked as invalid and error is returned to VF driver.

Protect vf_state check in ice_vc_process_vf_msg() by cfg_lock to avoid
this race condition.

Fixes: e6ba5273d4ed ("ice: Fix race conditions between virtchnl handling and VF ndo ops")
Tested-by: Fei Liu <feliu@redhat.com>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Konrad Jankowski <konrad0.jankowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
---
 drivers/net/ethernet/intel/ice/ice_virtchnl.c | 12 +++++-------
 1 file changed, 5 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_virtchnl.c b/drivers/net/ethernet/intel/ice/ice_virtchnl.c
index 5612c032f15a..b72606c9e6d0 100644
--- a/drivers/net/ethernet/intel/ice/ice_virtchnl.c
+++ b/drivers/net/ethernet/intel/ice/ice_virtchnl.c
@@ -3625,6 +3625,8 @@ void ice_vc_process_vf_msg(struct ice_pf *pf, struct ice_rq_event_info *event)
 		return;
 	}
 
+	mutex_lock(&vf->cfg_lock);
+
 	/* Check if VF is disabled. */
 	if (test_bit(ICE_VF_STATE_DIS, vf->vf_states)) {
 		err = -EPERM;
@@ -3648,19 +3650,14 @@ void ice_vc_process_vf_msg(struct ice_pf *pf, struct ice_rq_event_info *event)
 				      NULL, 0);
 		dev_err(dev, "Invalid message from VF %d, opcode %d, len %d, error %d\n",
 			vf_id, v_opcode, msglen, err);
-		ice_put_vf(vf);
-		return;
+		goto finish;
 	}
 
-	mutex_lock(&vf->cfg_lock);
-
 	if (!ice_vc_is_opcode_allowed(vf, v_opcode)) {
 		ice_vc_send_msg_to_vf(vf, v_opcode,
 				      VIRTCHNL_STATUS_ERR_NOT_SUPPORTED, NULL,
 				      0);
-		mutex_unlock(&vf->cfg_lock);
-		ice_put_vf(vf);
-		return;
+		goto finish;
 	}
 
 	switch (v_opcode) {
@@ -3773,6 +3770,7 @@ void ice_vc_process_vf_msg(struct ice_pf *pf, struct ice_rq_event_info *event)
 			 vf_id, v_opcode, err);
 	}
 
+finish:
 	mutex_unlock(&vf->cfg_lock);
 	ice_put_vf(vf);
 }
-- 
2.31.1


^ permalink raw reply related

* [PATCH net 1/4] ice: Fix incorrect locking in ice_vc_process_vf_msg()
From: Tony Nguyen @ 2022-04-26 20:30 UTC (permalink / raw)
  To: davem, kuba, pabeni
  Cc: Ivan Vecera, netdev, anthony.l.nguyen, Konrad Jankowski
In-Reply-To: <20220426203018.3790378-1-anthony.l.nguyen@intel.com>

From: Ivan Vecera <ivecera@redhat.com>

Usage of mutex_trylock() in ice_vc_process_vf_msg() is incorrect
because message sent from VF is ignored and never processed.

Use mutex_lock() instead to fix the issue. It is safe because this
mutex is used to prevent races between VF related NDOs and
handlers processing request messages from VF and these handlers
are running in ice_service_task() context. Additionally move this
mutex lock prior ice_vc_is_opcode_allowed() call to avoid potential
races during allowlist access.

Fixes: e6ba5273d4ed ("ice: Fix race conditions between virtchnl handling and VF ndo ops")
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Tested-by: Konrad Jankowski <konrad0.jankowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
---
 drivers/net/ethernet/intel/ice/ice_virtchnl.c | 21 +++++++------------
 1 file changed, 7 insertions(+), 14 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_virtchnl.c b/drivers/net/ethernet/intel/ice/ice_virtchnl.c
index 69ff4b929772..5612c032f15a 100644
--- a/drivers/net/ethernet/intel/ice/ice_virtchnl.c
+++ b/drivers/net/ethernet/intel/ice/ice_virtchnl.c
@@ -3642,14 +3642,6 @@ void ice_vc_process_vf_msg(struct ice_pf *pf, struct ice_rq_event_info *event)
 			err = -EINVAL;
 	}
 
-	if (!ice_vc_is_opcode_allowed(vf, v_opcode)) {
-		ice_vc_send_msg_to_vf(vf, v_opcode,
-				      VIRTCHNL_STATUS_ERR_NOT_SUPPORTED, NULL,
-				      0);
-		ice_put_vf(vf);
-		return;
-	}
-
 error_handler:
 	if (err) {
 		ice_vc_send_msg_to_vf(vf, v_opcode, VIRTCHNL_STATUS_ERR_PARAM,
@@ -3660,12 +3652,13 @@ void ice_vc_process_vf_msg(struct ice_pf *pf, struct ice_rq_event_info *event)
 		return;
 	}
 
-	/* VF is being configured in another context that triggers a VFR, so no
-	 * need to process this message
-	 */
-	if (!mutex_trylock(&vf->cfg_lock)) {
-		dev_info(dev, "VF %u is being configured in another context that will trigger a VFR, so there is no need to handle this message\n",
-			 vf->vf_id);
+	mutex_lock(&vf->cfg_lock);
+
+	if (!ice_vc_is_opcode_allowed(vf, v_opcode)) {
+		ice_vc_send_msg_to_vf(vf, v_opcode,
+				      VIRTCHNL_STATUS_ERR_NOT_SUPPORTED, NULL,
+				      0);
+		mutex_unlock(&vf->cfg_lock);
 		ice_put_vf(vf);
 		return;
 	}
-- 
2.31.1


^ permalink raw reply related

* [PATCH net 3/4] ice: wait 5 s for EMP reset after firmware flash
From: Tony Nguyen @ 2022-04-26 20:30 UTC (permalink / raw)
  To: davem, kuba, pabeni; +Cc: Petr Oros, netdev, anthony.l.nguyen, Gurucharan
In-Reply-To: <20220426203018.3790378-1-anthony.l.nguyen@intel.com>

From: Petr Oros <poros@redhat.com>

We need to wait 5 s for EMP reset after firmware flash. Code was extracted
from OOT driver (ice v1.8.3 downloaded from sourceforge). Without this
wait, fw_activate let card in inconsistent state and recoverable only
by second flash/activate. Flash was tested on these fw's:
From -> To
 3.00 -> 3.10/3.20
 3.10 -> 3.00/3.20
 3.20 -> 3.00/3.10

Reproducer:
[root@host ~]# devlink dev flash pci/0000:ca:00.0 file E810_XXVDA4_FH_O_SEC_FW_1p6p1p9_NVM_3p10_PLDMoMCTP_0.11_8000AD7B.bin
Preparing to flash
[fw.mgmt] Erasing
[fw.mgmt] Erasing done
[fw.mgmt] Flashing 100%
[fw.mgmt] Flashing done 100%
[fw.undi] Erasing
[fw.undi] Erasing done
[fw.undi] Flashing 100%
[fw.undi] Flashing done 100%
[fw.netlist] Erasing
[fw.netlist] Erasing done
[fw.netlist] Flashing 100%
[fw.netlist] Flashing done 100%
Activate new firmware by devlink reload
[root@host ~]# devlink dev reload pci/0000:ca:00.0 action fw_activate
reload_actions_performed:
    fw_activate
[root@host ~]# ip link show ens7f0
71: ens7f0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN mode DEFAULT group default qlen 1000
    link/ether b4:96:91:dc:72:e0 brd ff:ff:ff:ff:ff:ff
    altname enp202s0f0

dmesg after flash:
[   55.120788] ice: Copyright (c) 2018, Intel Corporation.
[   55.274734] ice 0000:ca:00.0: Get PHY capabilities failed status = -5, continuing anyway
[   55.569797] ice 0000:ca:00.0: The DDP package was successfully loaded: ICE OS Default Package version 1.3.28.0
[   55.603629] ice 0000:ca:00.0: Get PHY capability failed.
[   55.608951] ice 0000:ca:00.0: ice_init_nvm_phy_type failed: -5
[   55.647348] ice 0000:ca:00.0: PTP init successful
[   55.675536] ice 0000:ca:00.0: DCB is enabled in the hardware, max number of TCs supported on this port are 8
[   55.685365] ice 0000:ca:00.0: FW LLDP is disabled, DCBx/LLDP in SW mode.
[   55.692179] ice 0000:ca:00.0: Commit DCB Configuration to the hardware
[   55.701382] ice 0000:ca:00.0: 126.024 Gb/s available PCIe bandwidth, limited by 16.0 GT/s PCIe x8 link at 0000:c9:02.0 (capable of 252.048 Gb/s with 16.0 GT/s PCIe x16 link)
Reboot doesn’t help, only second flash/activate with OOT or patched
driver put card back in consistent state.

After patch:
[root@host ~]# devlink dev flash pci/0000:ca:00.0 file E810_XXVDA4_FH_O_SEC_FW_1p6p1p9_NVM_3p10_PLDMoMCTP_0.11_8000AD7B.bin
Preparing to flash
[fw.mgmt] Erasing
[fw.mgmt] Erasing done
[fw.mgmt] Flashing 100%
[fw.mgmt] Flashing done 100%
[fw.undi] Erasing
[fw.undi] Erasing done
[fw.undi] Flashing 100%
[fw.undi] Flashing done 100%
[fw.netlist] Erasing
[fw.netlist] Erasing done
[fw.netlist] Flashing 100%
[fw.netlist] Flashing done 100%
Activate new firmware by devlink reload
[root@host ~]# devlink dev reload pci/0000:ca:00.0 action fw_activate
reload_actions_performed:
    fw_activate
[root@host ~]# ip link show ens7f0
19: ens7f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether b4:96:91:dc:72:e0 brd ff:ff:ff:ff:ff:ff
    altname enp202s0f0

Fixes: 399e27dbbd9e94 ("ice: support immediate firmware activation via devlink reload")
Signed-off-by: Petr Oros <poros@redhat.com>
Tested-by: Gurucharan <gurucharanx.g@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
---
 drivers/net/ethernet/intel/ice/ice_main.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/net/ethernet/intel/ice/ice_main.c b/drivers/net/ethernet/intel/ice/ice_main.c
index 5b1198859da7..9a0a358a15c2 100644
--- a/drivers/net/ethernet/intel/ice/ice_main.c
+++ b/drivers/net/ethernet/intel/ice/ice_main.c
@@ -6929,12 +6929,15 @@ static void ice_rebuild(struct ice_pf *pf, enum ice_reset_req reset_type)
 
 	dev_dbg(dev, "rebuilding PF after reset_type=%d\n", reset_type);
 
+#define ICE_EMP_RESET_SLEEP_MS 5000
 	if (reset_type == ICE_RESET_EMPR) {
 		/* If an EMP reset has occurred, any previously pending flash
 		 * update will have completed. We no longer know whether or
 		 * not the NVM update EMP reset is restricted.
 		 */
 		pf->fw_emp_reset_disabled = false;
+
+		msleep(ICE_EMP_RESET_SLEEP_MS);
 	}
 
 	err = ice_init_all_ctrlq(hw);
-- 
2.31.1


^ permalink raw reply related

* [PATCH net 4/4] ice: fix use-after-free when deinitializing mailbox snapshot
From: Tony Nguyen @ 2022-04-26 20:30 UTC (permalink / raw)
  To: davem, kuba, pabeni
  Cc: Jacob Keller, netdev, anthony.l.nguyen, Slawomir Laba,
	Konrad Jankowski
In-Reply-To: <20220426203018.3790378-1-anthony.l.nguyen@intel.com>

From: Jacob Keller <jacob.e.keller@intel.com>

During ice_sriov_configure, if num_vfs is 0, we are being asked by the
kernel to remove all VFs.

The driver first de-initializes the snapshot before freeing all the VFs.
This results in a use-after-free BUG detected by KASAN. The bug occurs
because the snapshot can still be accessed until all VFs are removed.

Fix this by freeing all the VFs first before calling
ice_mbx_deinit_snapshot.

[  +0.032591] ==================================================================
[  +0.000021] BUG: KASAN: use-after-free in ice_mbx_vf_state_handler+0x1c3/0x410 [ice]
[  +0.000315] Write of size 28 at addr ffff889908eb6f28 by task kworker/55:2/1530996

[  +0.000029] CPU: 55 PID: 1530996 Comm: kworker/55:2 Kdump: loaded Tainted: G S        I       5.17.0-dirty #1
[  +0.000022] Hardware name: Dell Inc. PowerEdge R740/0923K0, BIOS 1.6.13 12/17/2018
[  +0.000013] Workqueue: ice ice_service_task [ice]
[  +0.000279] Call Trace:
[  +0.000012]  <TASK>
[  +0.000011]  dump_stack_lvl+0x33/0x42
[  +0.000030]  print_report.cold.13+0xb2/0x6b3
[  +0.000028]  ? ice_mbx_vf_state_handler+0x1c3/0x410 [ice]
[  +0.000295]  kasan_report+0xa5/0x120
[  +0.000026]  ? __switch_to_asm+0x21/0x70
[  +0.000024]  ? ice_mbx_vf_state_handler+0x1c3/0x410 [ice]
[  +0.000298]  kasan_check_range+0x183/0x1e0
[  +0.000019]  memset+0x1f/0x40
[  +0.000018]  ice_mbx_vf_state_handler+0x1c3/0x410 [ice]
[  +0.000304]  ? ice_conv_link_speed_to_virtchnl+0x160/0x160 [ice]
[  +0.000297]  ? ice_vsi_dis_spoofchk+0x40/0x40 [ice]
[  +0.000305]  ice_is_malicious_vf+0x1aa/0x250 [ice]
[  +0.000303]  ? ice_restore_all_vfs_msi_state+0x160/0x160 [ice]
[  +0.000297]  ? __mutex_unlock_slowpath.isra.15+0x410/0x410
[  +0.000022]  ? ice_debug_cq+0xb7/0x230 [ice]
[  +0.000273]  ? __kasan_slab_alloc+0x2f/0x90
[  +0.000022]  ? memset+0x1f/0x40
[  +0.000017]  ? do_raw_spin_lock+0x119/0x1d0
[  +0.000022]  ? rwlock_bug.part.2+0x60/0x60
[  +0.000024]  __ice_clean_ctrlq+0x3a6/0xd60 [ice]
[  +0.000273]  ? newidle_balance+0x5b1/0x700
[  +0.000026]  ? ice_print_link_msg+0x2f0/0x2f0 [ice]
[  +0.000271]  ? update_cfs_group+0x1b/0x140
[  +0.000018]  ? load_balance+0x1260/0x1260
[  +0.000022]  ? ice_process_vflr_event+0x27/0x130 [ice]
[  +0.000301]  ice_service_task+0x136e/0x1470 [ice]
[  +0.000281]  process_one_work+0x3b4/0x6c0
[  +0.000030]  worker_thread+0x65/0x660
[  +0.000023]  ? __kthread_parkme+0xe4/0x100
[  +0.000021]  ? process_one_work+0x6c0/0x6c0
[  +0.000020]  kthread+0x179/0x1b0
[  +0.000018]  ? kthread_complete_and_exit+0x20/0x20
[  +0.000022]  ret_from_fork+0x22/0x30
[  +0.000026]  </TASK>

[  +0.000018] Allocated by task 10742:
[  +0.000013]  kasan_save_stack+0x1c/0x40
[  +0.000018]  __kasan_kmalloc+0x84/0xa0
[  +0.000016]  kmem_cache_alloc_trace+0x16c/0x2e0
[  +0.000015]  intel_iommu_probe_device+0xeb/0x860
[  +0.000015]  __iommu_probe_device+0x9a/0x2f0
[  +0.000016]  iommu_probe_device+0x43/0x270
[  +0.000015]  iommu_bus_notifier+0xa7/0xd0
[  +0.000015]  blocking_notifier_call_chain+0x90/0xc0
[  +0.000017]  device_add+0x5f3/0xd70
[  +0.000014]  pci_device_add+0x404/0xa40
[  +0.000015]  pci_iov_add_virtfn+0x3b0/0x550
[  +0.000016]  sriov_enable+0x3bb/0x600
[  +0.000013]  ice_ena_vfs+0x113/0xa79 [ice]
[  +0.000293]  ice_sriov_configure.cold.17+0x21/0xe0 [ice]
[  +0.000291]  sriov_numvfs_store+0x160/0x200
[  +0.000015]  kernfs_fop_write_iter+0x1db/0x270
[  +0.000018]  new_sync_write+0x21d/0x330
[  +0.000013]  vfs_write+0x376/0x410
[  +0.000013]  ksys_write+0xba/0x150
[  +0.000012]  do_syscall_64+0x3a/0x80
[  +0.000012]  entry_SYSCALL_64_after_hwframe+0x44/0xae

[  +0.000028] Freed by task 10742:
[  +0.000011]  kasan_save_stack+0x1c/0x40
[  +0.000015]  kasan_set_track+0x21/0x30
[  +0.000016]  kasan_set_free_info+0x20/0x30
[  +0.000012]  __kasan_slab_free+0x104/0x170
[  +0.000016]  kfree+0x9b/0x470
[  +0.000013]  devres_destroy+0x1c/0x20
[  +0.000015]  devm_kfree+0x33/0x40
[  +0.000012]  ice_mbx_deinit_snapshot+0x39/0x70 [ice]
[  +0.000295]  ice_sriov_configure+0xb0/0x260 [ice]
[  +0.000295]  sriov_numvfs_store+0x1bc/0x200
[  +0.000015]  kernfs_fop_write_iter+0x1db/0x270
[  +0.000016]  new_sync_write+0x21d/0x330
[  +0.000012]  vfs_write+0x376/0x410
[  +0.000012]  ksys_write+0xba/0x150
[  +0.000012]  do_syscall_64+0x3a/0x80
[  +0.000012]  entry_SYSCALL_64_after_hwframe+0x44/0xae

[  +0.000024] Last potentially related work creation:
[  +0.000010]  kasan_save_stack+0x1c/0x40
[  +0.000016]  __kasan_record_aux_stack+0x98/0xa0
[  +0.000013]  insert_work+0x34/0x160
[  +0.000015]  __queue_work+0x20e/0x650
[  +0.000016]  queue_work_on+0x4c/0x60
[  +0.000015]  nf_nat_masq_schedule+0x297/0x2e0 [nf_nat]
[  +0.000034]  masq_device_event+0x5a/0x60 [nf_nat]
[  +0.000031]  raw_notifier_call_chain+0x5f/0x80
[  +0.000017]  dev_close_many+0x1d6/0x2c0
[  +0.000015]  unregister_netdevice_many+0x4e3/0xa30
[  +0.000015]  unregister_netdevice_queue+0x192/0x1d0
[  +0.000014]  iavf_remove+0x8f9/0x930 [iavf]
[  +0.000058]  pci_device_remove+0x65/0x110
[  +0.000015]  device_release_driver_internal+0xf8/0x190
[  +0.000017]  pci_stop_bus_device+0xb5/0xf0
[  +0.000014]  pci_stop_and_remove_bus_device+0xe/0x20
[  +0.000016]  pci_iov_remove_virtfn+0x19c/0x230
[  +0.000015]  sriov_disable+0x4f/0x170
[  +0.000014]  ice_free_vfs+0x9a/0x490 [ice]
[  +0.000306]  ice_sriov_configure+0xb8/0x260 [ice]
[  +0.000294]  sriov_numvfs_store+0x1bc/0x200
[  +0.000015]  kernfs_fop_write_iter+0x1db/0x270
[  +0.000016]  new_sync_write+0x21d/0x330
[  +0.000012]  vfs_write+0x376/0x410
[  +0.000012]  ksys_write+0xba/0x150
[  +0.000012]  do_syscall_64+0x3a/0x80
[  +0.000012]  entry_SYSCALL_64_after_hwframe+0x44/0xae

[  +0.000025] The buggy address belongs to the object at ffff889908eb6f00
               which belongs to the cache kmalloc-96 of size 96
[  +0.000016] The buggy address is located 40 bytes inside of
               96-byte region [ffff889908eb6f00, ffff889908eb6f60)

[  +0.000026] The buggy address belongs to the physical page:
[  +0.000010] page:00000000b7e99a2e refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x1908eb6
[  +0.000016] flags: 0x57ffffc0000200(slab|node=1|zone=2|lastcpupid=0x1fffff)
[  +0.000024] raw: 0057ffffc0000200 ffffea0069d9fd80 dead000000000002 ffff88810004c780
[  +0.000015] raw: 0000000000000000 0000000000200020 00000001ffffffff 0000000000000000
[  +0.000009] page dumped because: kasan: bad access detected

[  +0.000016] Memory state around the buggy address:
[  +0.000012]  ffff889908eb6e00: fa fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
[  +0.000014]  ffff889908eb6e80: fa fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
[  +0.000014] >ffff889908eb6f00: fa fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
[  +0.000011]                                   ^
[  +0.000013]  ffff889908eb6f80: fa fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
[  +0.000013]  ffff889908eb7000: fa fb fb fb fb fb fb fb fc fc fc fc fa fb fb fb
[  +0.000012] ==================================================================

Fixes: 0891c89674e8 ("ice: warn about potentially malicious VFs")
Reported-by: Slawomir Laba <slawomirx.laba@intel.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Konrad Jankowski <konrad0.jankowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
---
 drivers/net/ethernet/intel/ice/ice_sriov.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_sriov.c b/drivers/net/ethernet/intel/ice/ice_sriov.c
index 8915a9d39e36..0c438219f7a3 100644
--- a/drivers/net/ethernet/intel/ice/ice_sriov.c
+++ b/drivers/net/ethernet/intel/ice/ice_sriov.c
@@ -1046,8 +1046,8 @@ int ice_sriov_configure(struct pci_dev *pdev, int num_vfs)
 
 	if (!num_vfs) {
 		if (!pci_vfs_assigned(pdev)) {
-			ice_mbx_deinit_snapshot(&pf->hw);
 			ice_free_vfs(pf);
+			ice_mbx_deinit_snapshot(&pf->hw);
 			if (pf->lag)
 				ice_enable_lag(pf->lag);
 			return 0;
-- 
2.31.1


^ permalink raw reply related

* Re: [PATCH net-next] net: mark tulip obsolete
From: Helge Deller @ 2022-04-26 20:53 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: davem, netdev, linux-parisc, linux-alpha@vger.kernel.org
In-Reply-To: <20220426055311.53dd8c31@kernel.org>

Hi Jakub,

On 4/26/22 14:53, Jakub Kicinski wrote:
> On Tue, 15 Mar 2022 23:18:38 +0100 Helge Deller wrote:
>> On 3/15/22 20:04, Jakub Kicinski wrote:
>>> On Tue, 15 Mar 2022 19:44:24 +0100 Helge Deller wrote:
>>>> On 3/15/22 19:43, Jakub Kicinski wrote:
>>>>> It's ancient, an likely completely unused at this point.
>>>>> Let's mark it obsolete to prevent refactoring.
>>>>
>>>> NAK.
>>>>
>>>> This driver is needed by nearly all PA-RISC machines.
>>>
>>> I was just trying to steer newcomers to code that's more relevant today.
>>
>> That intention is ok, but "obsolete" means it's not used any more,
>> and that's not true.
>
> Hi Helge! Which incarnation of tulip do you need for PA-RISC, exactly?

For parisc I have:

CONFIG_NET_TULIP=y
# CONFIG_DE2104X is not set
CONFIG_TULIP=y
# CONFIG_TULIP_MWI is not set
# CONFIG_TULIP_MMIO is not set
# CONFIG_TULIP_NAPI is not set
# CONFIG_DE4X5 is not set
# CONFIG_WINBOND_840 is not set
# CONFIG_DM9102 is not set
# CONFIG_ULI526X is not set
# CONFIG_PCMCIA_XIRCOM is not set
# CONFIG_NET_VENDOR_DLINK is not set
# CONFIG_NET_VENDOR_EMULEX is not set

So not the DE4X5.

> I'd like to try to remove DE4X5, if that's not the one you need
> (getting rid of virt_to_bus()-using drivers).

I've CC'ed the linux-alpha mailing list, as the DE4X5 driver might be
needed there, so removing it completely might not be the best idea.

But since you want to remove virt_to_bus()....
It seems this virt_to_bus() call is used for really old x86 machines/cards,
which probably aren't supported any longer.

See drivers/net/ethernet/dec/tulip/de4x5.c:
...
#if !defined(__alpha__) && !defined(__powerpc__) && !defined(CONFIG_SPARC) && !defined(DE4X5_DO_MEMCPY)
...
    tmp = virt_to_bus(p->data);
...

Maybe you could simply remove the part inside #if...#else
and insert a pr_err() instead (and return NULL)?

Helge

^ permalink raw reply

* LPC 2022 Networking and BPF Track CFP
From: Daniel Borkmann @ 2022-04-26 20:59 UTC (permalink / raw)
  To: netdev, bpf
  Cc: xdp-newbies, iovisor-dev, linux-wireless, netfilter-devel, lwn

We are pleased to announce the Call for Proposals (CFP) for the Networking and
BPF track at the 2022 edition of the Linux Plumbers Conference (LPC), which is
planned to be held in Dublin, Ireland, on September 12th - 14th, 2022.

Note that the conference is planned to be both in person and remote (hybrid).
CFP submitters should ideally be able to give their presentation in person to
minimize technical issues if circumstances permit, although presenting remotely
will also be possible.

This year's Networking and BPF track technical committee is comprised of:

    David S. Miller <davem@davemloft.net>
    Jakub Kicinski <kuba@kernel.org>
    Paolo Abeni <pabeni@redhat.com>
    Eric Dumazet <edumazet@google.com>
    Alexei Starovoitov <ast@kernel.org>
    Daniel Borkmann <daniel@iogearbox.net>
    Andrii Nakryiko <andrii@kernel.org>

We are seeking proposals of 40 minutes in length (including Q&A discussion).

Any kind of advanced Linux networking and/or BPF related topic will be considered.

Please submit your proposals through the official LPC website at:

    https://lpc.events/event/16/abstracts/

Make sure to select "eBPF & Networking" in the track pull-down menu.

Proposals must be submitted by August 10th, and submitters will be notified of
acceptance by August 12th.

Final slides (as PDF) are due on the first day of the conference.

We are very much looking forward to a great conference and seeing you all!

^ permalink raw reply

* Re: [PATCH net 1/2] ptp: ptp_clockmatrix: Add PTP_CLK_REQ_EXTTS support
From: Jakub Kicinski @ 2022-04-26 21:21 UTC (permalink / raw)
  To: Min Li; +Cc: richardcochran, lee.jones, linux-kernel, netdev
In-Reply-To: <1651001574-32457-1-git-send-email-min.li.xe@renesas.com>

On Tue, 26 Apr 2022 15:32:53 -0400 Min Li wrote:
> Use TOD_READ_SECONDARY for extts to keep TOD_READ_PRIMARY
> for gettime and settime exclusively

Does not build on 32 bit.

> -/*
> +/**
>   * Maximum absolute value for write phase offset in picoseconds
>   *
> + * @channel:  channel
> + * @delta_ns: delta in nanoseconds
> + *

Not a proper kdoc, first line must include struct name or function name.

Please wait 24h with reposting fixed version so other feedback has a
chance to come in.

^ permalink raw reply

* Re: [PATCH net 1/2] ptp: ptp_clockmatrix: Add PTP_CLK_REQ_EXTTS support
From: Richard Cochran @ 2022-04-26 21:25 UTC (permalink / raw)
  To: Min Li; +Cc: lee.jones, linux-kernel, netdev
In-Reply-To: <1651001574-32457-1-git-send-email-min.li.xe@renesas.com>

On Tue, Apr 26, 2022 at 03:32:53PM -0400, Min Li wrote:
> Use TOD_READ_SECONDARY for extts to keep TOD_READ_PRIMARY
> for gettime and settime exclusively
> 
> Signed-off-by: Min Li <min.li.xe@renesas.com>
> ---
>  drivers/ptp/ptp_clockmatrix.c    | 303 +++++++++++++++++++++++++--------------
>  drivers/ptp/ptp_clockmatrix.h    |   5 +
>  include/linux/mfd/idt8a340_reg.h |  12 +-
>  3 files changed, 209 insertions(+), 111 deletions(-)

Acked-by: Richard Cochran <richardcochran@gmail.com>

^ permalink raw reply

* Re: [PATCH net 2/2] ptp: ptp_clockmatrix: return -EBUSY if phase pull-in is in progress
From: Richard Cochran @ 2022-04-26 21:26 UTC (permalink / raw)
  To: Min Li; +Cc: lee.jones, linux-kernel, netdev
In-Reply-To: <1651001574-32457-2-git-send-email-min.li.xe@renesas.com>

On Tue, Apr 26, 2022 at 03:32:54PM -0400, Min Li wrote:
> Also removes PEROUT_ENABLE_OUTPUT_MASK
> 
> Signed-off-by: Min Li <min.li.xe@renesas.com>
> ---
>  drivers/ptp/ptp_clockmatrix.c | 32 ++------------------------------
>  drivers/ptp/ptp_clockmatrix.h |  2 --
>  2 files changed, 2 insertions(+), 32 deletions(-)

Acked-by: Richard Cochran <richardcochran@gmail.com>

^ permalink raw reply

* Re: [PATCH net-next] net: mark tulip obsolete
From: Jakub Kicinski @ 2022-04-26 21:31 UTC (permalink / raw)
  To: Helge Deller; +Cc: davem, netdev, linux-parisc, linux-alpha@vger.kernel.org
In-Reply-To: <87650cea-d190-9642-edf7-7dea42802dab@gmx.de>

On Tue, 26 Apr 2022 22:53:00 +0200 Helge Deller wrote:
> >> That intention is ok, but "obsolete" means it's not used any more,
> >> and that's not true.  
> >
> > Hi Helge! Which incarnation of tulip do you need for PA-RISC, exactly?  
> 
> For parisc I have:
> 
> CONFIG_NET_TULIP=y
> # CONFIG_DE2104X is not set
> CONFIG_TULIP=y
> # CONFIG_TULIP_MWI is not set
> # CONFIG_TULIP_MMIO is not set
> # CONFIG_TULIP_NAPI is not set
> # CONFIG_DE4X5 is not set
> # CONFIG_WINBOND_840 is not set
> # CONFIG_DM9102 is not set
> # CONFIG_ULI526X is not set
> # CONFIG_PCMCIA_XIRCOM is not set
> # CONFIG_NET_VENDOR_DLINK is not set
> # CONFIG_NET_VENDOR_EMULEX is not set
> 
> So not the DE4X5.
> 
> > I'd like to try to remove DE4X5, if that's not the one you need
> > (getting rid of virt_to_bus()-using drivers).  
> 
> I've CC'ed the linux-alpha mailing list, as the DE4X5 driver might be
> needed there, so removing it completely might not be the best idea.
> 
> But since you want to remove virt_to_bus()....
> It seems this virt_to_bus() call is used for really old x86 machines/cards,
> which probably aren't supported any longer.
> 
> See drivers/net/ethernet/dec/tulip/de4x5.c:
> ...
> #if !defined(__alpha__) && !defined(__powerpc__) && !defined(CONFIG_SPARC) && !defined(DE4X5_DO_MEMCPY)
> ...
>     tmp = virt_to_bus(p->data);
> ...
> 
> Maybe you could simply remove the part inside #if...#else
> and insert a pr_err() instead (and return NULL)?

Ah, good find, thanks for taking a look! I'll look into dropping just
sections of the code.

^ permalink raw reply

* [PATCH net-next 0/7] mptcp: Timeout for MP_FAIL response
From: Mat Martineau @ 2022-04-26 21:57 UTC (permalink / raw)
  To: netdev; +Cc: Mat Martineau, davem, kuba, pabeni, matthieu.baerts, mptcp

When one peer sends an infinite mapping to coordinate fallback from
MPTCP to regular TCP, the other peer is expected to send a packet with
the MPTCP MP_FAIL option to acknowledge the infinite mapping. Rather
than leave the connection in some half-fallback state, this series adds
a timeout after which the infinite mapping sender will reset the
connection.

Patch 1 adds a fallback self test.

Patches 2-5 make use of the MPTCP socket's retransmit timer to reset the
MPTCP connection if no MP_FAIL was received.

Patches 6 and 7 extends the self test to check MP_FAIL-related MIBs.

Geliang Tang (7):
  selftests: mptcp: add infinite map testcase
  mptcp: use mptcp_stop_timer
  mptcp: add data lock for sk timers
  mptcp: add MP_FAIL response support
  mptcp: reset subflow when MP_FAIL doesn't respond
  selftests: mptcp: check MP_FAIL response mibs
  selftests: mptcp: print extra msg in chk_csum_nr

 net/mptcp/pm.c                                |  18 ++-
 net/mptcp/protocol.c                          |  64 +++++++++-
 net/mptcp/protocol.h                          |   2 +
 net/mptcp/subflow.c                           |  13 ++
 tools/testing/selftests/net/mptcp/config      |   8 ++
 .../testing/selftests/net/mptcp/mptcp_join.sh | 119 +++++++++++++++++-
 6 files changed, 216 insertions(+), 8 deletions(-)


base-commit: 561215482cc69d1c758944d4463b3d5d96d37bd1
-- 
2.36.0


^ permalink raw reply

* [PATCH net-next 1/7] selftests: mptcp: add infinite map testcase
From: Mat Martineau @ 2022-04-26 21:57 UTC (permalink / raw)
  To: netdev
  Cc: Geliang Tang, davem, kuba, pabeni, matthieu.baerts, mptcp,
	Davide Caratti, Mat Martineau
In-Reply-To: <20220426215717.129506-1-mathew.j.martineau@linux.intel.com>

From: Geliang Tang <geliang.tang@suse.com>

Add the single subflow test case for MP_FAIL, to test the infinite
mapping case. Use the test_linkfail value to make 128KB test files.

Add a new function reset_with_fail(), in it use 'iptables' and 'tc
action pedit' rules to produce the bit flips to trigger the checksum
failures. Set validate_checksum to enable checksums for the MP_FAIL
tests without passing the '-C' argument. Set check_invert flag to
enable the invert bytes check for the output data in check_transfer().
Instead of the file mismatch error, this test prints out the inverted
bytes.

Add a new function pedit_action_pkts() to get the numbers of the packets
edited by the tc pedit actions. Print this numbers to the output.

Also add the needed kernel configures in the selftests config file.

Suggested-by: Davide Caratti <dcaratti@redhat.com>
Co-developed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
---
 tools/testing/selftests/net/mptcp/config      |  8 +++
 .../testing/selftests/net/mptcp/mptcp_join.sh | 70 ++++++++++++++++++-
 2 files changed, 77 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/net/mptcp/config b/tools/testing/selftests/net/mptcp/config
index d36b7da5082a..38021a0dd527 100644
--- a/tools/testing/selftests/net/mptcp/config
+++ b/tools/testing/selftests/net/mptcp/config
@@ -12,6 +12,9 @@ CONFIG_NF_TABLES=m
 CONFIG_NFT_COMPAT=m
 CONFIG_NETFILTER_XTABLES=m
 CONFIG_NETFILTER_XT_MATCH_BPF=m
+CONFIG_NETFILTER_XT_MATCH_LENGTH=m
+CONFIG_NETFILTER_XT_MATCH_STATISTIC=m
+CONFIG_NETFILTER_XT_TARGET_MARK=m
 CONFIG_NF_TABLES_INET=y
 CONFIG_NFT_TPROXY=m
 CONFIG_NFT_SOCKET=m
@@ -19,3 +22,8 @@ CONFIG_IP_ADVANCED_ROUTER=y
 CONFIG_IP_MULTIPLE_TABLES=y
 CONFIG_IP_NF_TARGET_REJECT=m
 CONFIG_IPV6_MULTIPLE_TABLES=y
+CONFIG_NET_ACT_CSUM=m
+CONFIG_NET_ACT_PEDIT=m
+CONFIG_NET_CLS_ACT=y
+CONFIG_NET_CLS_FW=m
+CONFIG_NET_SCH_INGRESS=m
diff --git a/tools/testing/selftests/net/mptcp/mptcp_join.sh b/tools/testing/selftests/net/mptcp/mptcp_join.sh
index 9eb4d889a24a..34152a50a3e2 100755
--- a/tools/testing/selftests/net/mptcp/mptcp_join.sh
+++ b/tools/testing/selftests/net/mptcp/mptcp_join.sh
@@ -266,6 +266,58 @@ reset_with_allow_join_id0()
 	ip netns exec $ns2 sysctl -q net.mptcp.allow_join_initial_addr_port=$ns2_enable
 }
 
+# Modify TCP payload without corrupting the TCP packet
+#
+# This rule inverts a 8-bit word at byte offset 148 for the 2nd TCP ACK packets
+# carrying enough data.
+# Once it is done, the TCP Checksum field is updated so the packet is still
+# considered as valid at the TCP level.
+# Because the MPTCP checksum, covering the TCP options and data, has not been
+# updated, the modification will be detected and an MP_FAIL will be emitted:
+# what we want to validate here without corrupting "random" MPTCP options.
+#
+# To avoid having tc producing this pr_info() message for each TCP ACK packets
+# not carrying enough data:
+#
+#     tc action pedit offset 162 out of bounds
+#
+# Netfilter is used to mark packets with enough data.
+reset_with_fail()
+{
+	reset "${1}" || return 1
+
+	ip netns exec $ns1 sysctl -q net.mptcp.checksum_enabled=1
+	ip netns exec $ns2 sysctl -q net.mptcp.checksum_enabled=1
+
+	check_invert=1
+	validate_checksum=1
+	local i="$2"
+	local ip="${3:-4}"
+	local tables
+
+	tables="iptables"
+	if [ $ip -eq 6 ]; then
+		tables="ip6tables"
+	fi
+
+	ip netns exec $ns2 $tables \
+		-t mangle \
+		-A OUTPUT \
+		-o ns2eth$i \
+		-p tcp \
+		-m length --length 150:9999 \
+		-m statistic --mode nth --packet 1 --every 99999 \
+		-j MARK --set-mark 42 || exit 1
+
+	tc -n $ns2 qdisc add dev ns2eth$i clsact || exit 1
+	tc -n $ns2 filter add dev ns2eth$i egress \
+		protocol ip prio 1000 \
+		handle 42 fw \
+		action pedit munge offset 148 u8 invert \
+		pipe csum tcp \
+		index 100 || exit 1
+}
+
 fail_test()
 {
 	ret=1
@@ -1199,7 +1251,7 @@ chk_join_nr()
 		echo "[ ok ]"
 	fi
 	[ "${dump_stats}" = 1 ] && dump_stats
-	if [ $checksum -eq 1 ]; then
+	if [ $validate_checksum -eq 1 ]; then
 		chk_csum_nr $csum_ns1 $csum_ns2
 		chk_fail_nr $fail_nr $fail_nr
 		chk_rst_nr $rst_nr $rst_nr
@@ -2590,6 +2642,21 @@ fastclose_tests()
 	fi
 }
 
+pedit_action_pkts()
+{
+	tc -n $ns2 -j -s action show action pedit index 100 | \
+		sed 's/.*"packets":\([0-9]\+\),.*/\1/'
+}
+
+fail_tests()
+{
+	# single subflow
+	if reset_with_fail "Infinite map" 1; then
+		run_tests $ns1 $ns2 10.0.1.1 128
+		chk_join_nr 0 0 0 +1 +0 1 0 1 "$(pedit_action_pkts)"
+	fi
+}
+
 implicit_tests()
 {
 	# userspace pm type prevents add_addr
@@ -2658,6 +2725,7 @@ all_tests_sorted=(
 	d@deny_join_id0_tests
 	m@fullmesh_tests
 	z@fastclose_tests
+	F@fail_tests
 	I@implicit_tests
 )
 
-- 
2.36.0


^ permalink raw reply related

* [PATCH net-next 2/7] mptcp: use mptcp_stop_timer
From: Mat Martineau @ 2022-04-26 21:57 UTC (permalink / raw)
  To: netdev
  Cc: Geliang Tang, davem, kuba, pabeni, matthieu.baerts, mptcp,
	Mat Martineau
In-Reply-To: <20220426215717.129506-1-mathew.j.martineau@linux.intel.com>

From: Geliang Tang <geliang.tang@suse.com>

Use the helper mptcp_stop_timer() instead of using sk_stop_timer() to
stop icsk_retransmit_timer directly.

Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
---
 net/mptcp/protocol.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 4581c570ef68..e3db319ce92e 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -2753,7 +2753,7 @@ static void __mptcp_destroy_sock(struct sock *sk)
 	/* join list will be eventually flushed (with rst) at sock lock release time*/
 	list_splice_init(&msk->conn_list, &conn_list);
 
-	sk_stop_timer(sk, &msk->sk.icsk_retransmit_timer);
+	mptcp_stop_timer(sk);
 	sk_stop_timer(sk, &sk->sk_timer);
 	msk->pm.status = 0;
 
@@ -2861,7 +2861,7 @@ static int mptcp_disconnect(struct sock *sk, int flags)
 		__mptcp_close_ssk(sk, ssk, subflow, MPTCP_CF_FASTCLOSE);
 	}
 
-	sk_stop_timer(sk, &msk->sk.icsk_retransmit_timer);
+	mptcp_stop_timer(sk);
 	sk_stop_timer(sk, &sk->sk_timer);
 
 	if (mptcp_sk(sk)->token)
-- 
2.36.0


^ permalink raw reply related

* [PATCH net-next 3/7] mptcp: add data lock for sk timers
From: Mat Martineau @ 2022-04-26 21:57 UTC (permalink / raw)
  To: netdev
  Cc: Geliang Tang, davem, kuba, pabeni, matthieu.baerts, mptcp,
	Mat Martineau
In-Reply-To: <20220426215717.129506-1-mathew.j.martineau@linux.intel.com>

From: Geliang Tang <geliang.tang@suse.com>

mptcp_data_lock() needs to be held when manipulating the msk
retransmit_timer or the sk sk_timer. This patch adds the data
lock for the both timers.

Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
---
 net/mptcp/protocol.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index e3db319ce92e..ea74122065f1 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -1605,8 +1605,10 @@ void __mptcp_push_pending(struct sock *sk, unsigned int flags)
 
 out:
 	/* ensure the rtx timer is running */
+	mptcp_data_lock(sk);
 	if (!mptcp_timer_pending(sk))
 		mptcp_reset_timer(sk);
+	mptcp_data_unlock(sk);
 	if (copied)
 		__mptcp_check_send_data_fin(sk);
 }
@@ -2491,8 +2493,10 @@ static void __mptcp_retrans(struct sock *sk)
 reset_timer:
 	mptcp_check_and_set_pending(sk);
 
+	mptcp_data_lock(sk);
 	if (!mptcp_timer_pending(sk))
 		mptcp_reset_timer(sk);
+	mptcp_data_unlock(sk);
 }
 
 static void mptcp_worker(struct work_struct *work)
@@ -2651,8 +2655,10 @@ void mptcp_subflow_shutdown(struct sock *sk, struct sock *ssk, int how)
 		} else {
 			pr_debug("Sending DATA_FIN on subflow %p", ssk);
 			tcp_send_ack(ssk);
+			mptcp_data_lock(sk);
 			if (!mptcp_timer_pending(sk))
 				mptcp_reset_timer(sk);
+			mptcp_data_unlock(sk);
 		}
 		break;
 	}
@@ -2753,8 +2759,10 @@ static void __mptcp_destroy_sock(struct sock *sk)
 	/* join list will be eventually flushed (with rst) at sock lock release time*/
 	list_splice_init(&msk->conn_list, &conn_list);
 
+	mptcp_data_lock(sk);
 	mptcp_stop_timer(sk);
 	sk_stop_timer(sk, &sk->sk_timer);
+	mptcp_data_unlock(sk);
 	msk->pm.status = 0;
 
 	/* clears msk->subflow, allowing the following loop to close
@@ -2816,7 +2824,9 @@ static void mptcp_close(struct sock *sk, long timeout)
 		__mptcp_destroy_sock(sk);
 		do_cancel_work = true;
 	} else {
+		mptcp_data_lock(sk);
 		sk_reset_timer(sk, &sk->sk_timer, jiffies + TCP_TIMEWAIT_LEN);
+		mptcp_data_unlock(sk);
 	}
 	release_sock(sk);
 	if (do_cancel_work)
@@ -2861,8 +2871,10 @@ static int mptcp_disconnect(struct sock *sk, int flags)
 		__mptcp_close_ssk(sk, ssk, subflow, MPTCP_CF_FASTCLOSE);
 	}
 
+	mptcp_data_lock(sk);
 	mptcp_stop_timer(sk);
 	sk_stop_timer(sk, &sk->sk_timer);
+	mptcp_data_unlock(sk);
 
 	if (mptcp_sk(sk)->token)
 		mptcp_event(MPTCP_EVENT_CLOSED, mptcp_sk(sk), NULL, GFP_KERNEL);
-- 
2.36.0


^ permalink raw reply related

* [PATCH net-next 4/7] mptcp: add MP_FAIL response support
From: Mat Martineau @ 2022-04-26 21:57 UTC (permalink / raw)
  To: netdev
  Cc: Geliang Tang, davem, kuba, pabeni, matthieu.baerts, mptcp,
	Mat Martineau
In-Reply-To: <20220426215717.129506-1-mathew.j.martineau@linux.intel.com>

From: Geliang Tang <geliang.tang@suse.com>

This patch adds a new struct member mp_fail_response_expect in struct
mptcp_subflow_context to support MP_FAIL response. In the single subflow
with checksum error and contiguous data special case, a MP_FAIL is sent
in response to another MP_FAIL.

Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
---
 net/mptcp/pm.c       | 10 +++++++++-
 net/mptcp/protocol.h |  1 +
 net/mptcp/subflow.c  |  2 ++
 3 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/net/mptcp/pm.c b/net/mptcp/pm.c
index 5c36870d3420..971e843a304c 100644
--- a/net/mptcp/pm.c
+++ b/net/mptcp/pm.c
@@ -290,8 +290,16 @@ void mptcp_pm_mp_fail_received(struct sock *sk, u64 fail_seq)
 
 	pr_debug("fail_seq=%llu", fail_seq);
 
-	if (!mptcp_has_another_subflow(sk) && READ_ONCE(msk->allow_infinite_fallback))
+	if (mptcp_has_another_subflow(sk) || !READ_ONCE(msk->allow_infinite_fallback))
+		return;
+
+	if (!READ_ONCE(subflow->mp_fail_response_expect)) {
+		pr_debug("send MP_FAIL response and infinite map");
+
+		subflow->send_mp_fail = 1;
+		MPTCP_INC_STATS(sock_net(sk), MPTCP_MIB_MPFAILTX);
 		subflow->send_infinite_map = 1;
+	}
 }
 
 /* path manager helpers */
diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
index 61d600693ffd..cc66c81a8fab 100644
--- a/net/mptcp/protocol.h
+++ b/net/mptcp/protocol.h
@@ -448,6 +448,7 @@ struct mptcp_subflow_context {
 		stale : 1,	    /* unable to snd/rcv data, do not use for xmit */
 		local_id_valid : 1; /* local_id is correctly initialized */
 	enum mptcp_data_avail data_avail;
+	bool	mp_fail_response_expect;
 	u32	remote_nonce;
 	u64	thmac;
 	u32	local_nonce;
diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c
index 30ffb00661bb..ca2352ad20d4 100644
--- a/net/mptcp/subflow.c
+++ b/net/mptcp/subflow.c
@@ -1217,6 +1217,8 @@ static bool subflow_check_data_avail(struct sock *ssk)
 				tcp_send_active_reset(ssk, GFP_ATOMIC);
 				while ((skb = skb_peek(&ssk->sk_receive_queue)))
 					sk_eat_skb(ssk, skb);
+			} else {
+				WRITE_ONCE(subflow->mp_fail_response_expect, true);
 			}
 			WRITE_ONCE(subflow->data_avail, MPTCP_SUBFLOW_NODATA);
 			return true;
-- 
2.36.0


^ permalink raw reply related

* [PATCH net-next 7/7] selftests: mptcp: print extra msg in chk_csum_nr
From: Mat Martineau @ 2022-04-26 21:57 UTC (permalink / raw)
  To: netdev
  Cc: Geliang Tang, davem, kuba, pabeni, matthieu.baerts, mptcp,
	Mat Martineau
In-Reply-To: <20220426215717.129506-1-mathew.j.martineau@linux.intel.com>

From: Geliang Tang <geliang.tang@suse.com>

When the multiple checksum errors occur in chk_csum_nr(), print the
numbers of the errors as an extra message.

Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
---
 tools/testing/selftests/net/mptcp/mptcp_join.sh | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/net/mptcp/mptcp_join.sh b/tools/testing/selftests/net/mptcp/mptcp_join.sh
index 8023c0773d95..e5c8fc2816fb 100755
--- a/tools/testing/selftests/net/mptcp/mptcp_join.sh
+++ b/tools/testing/selftests/net/mptcp/mptcp_join.sh
@@ -1013,6 +1013,7 @@ chk_csum_nr()
 	local csum_ns2=${2:-0}
 	local count
 	local dump_stats
+	local extra_msg=""
 	local allow_multi_errors_ns1=0
 	local allow_multi_errors_ns2=0
 
@@ -1028,6 +1029,9 @@ chk_csum_nr()
 	printf "%-${nr_blank}s %s" " " "sum"
 	count=$(ip netns exec $ns1 nstat -as | grep MPTcpExtDataCsumErr | awk '{print $2}')
 	[ -z "$count" ] && count=0
+	if [ "$count" != "$csum_ns1" ]; then
+		extra_msg="$extra_msg ns1=$count"
+	fi
 	if { [ "$count" != $csum_ns1 ] && [ $allow_multi_errors_ns1 -eq 0 ]; } ||
 	   { [ "$count" -lt $csum_ns1 ] && [ $allow_multi_errors_ns1 -eq 1 ]; }; then
 		echo "[fail] got $count data checksum error[s] expected $csum_ns1"
@@ -1039,15 +1043,20 @@ chk_csum_nr()
 	echo -n " - csum  "
 	count=$(ip netns exec $ns2 nstat -as | grep MPTcpExtDataCsumErr | awk '{print $2}')
 	[ -z "$count" ] && count=0
+	if [ "$count" != "$csum_ns2" ]; then
+		extra_msg="$extra_msg ns2=$count"
+	fi
 	if { [ "$count" != $csum_ns2 ] && [ $allow_multi_errors_ns2 -eq 0 ]; } ||
 	   { [ "$count" -lt $csum_ns2 ] && [ $allow_multi_errors_ns2 -eq 1 ]; }; then
 		echo "[fail] got $count data checksum error[s] expected $csum_ns2"
 		fail_test
 		dump_stats=1
 	else
-		echo "[ ok ]"
+		echo -n "[ ok ]"
 	fi
 	[ "${dump_stats}" = 1 ] && dump_stats
+
+	echo "$extra_msg"
 }
 
 chk_fail_nr()
-- 
2.36.0


^ permalink raw reply related

* [PATCH net-next 6/7] selftests: mptcp: check MP_FAIL response mibs
From: Mat Martineau @ 2022-04-26 21:57 UTC (permalink / raw)
  To: netdev
  Cc: Geliang Tang, davem, kuba, pabeni, matthieu.baerts, mptcp,
	Mat Martineau
In-Reply-To: <20220426215717.129506-1-mathew.j.martineau@linux.intel.com>

From: Geliang Tang <geliang.tang@suse.com>

This patch extends chk_fail_nr to check the MP_FAIL response mibs.

Add a new argument invert for chk_fail_nr to allow it can check the
MP_FAIL TX and RX mibs from the opposite direction.

When the infinite map is received before the MP_FAIL response, the
response will be lost. A '-' can be added into fail_tx or fail_rx to
represent that MP_FAIL response TX or RX can be lost when doing the
checks.

Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
---
 .../testing/selftests/net/mptcp/mptcp_join.sh | 38 +++++++++++++++++--
 1 file changed, 35 insertions(+), 3 deletions(-)

diff --git a/tools/testing/selftests/net/mptcp/mptcp_join.sh b/tools/testing/selftests/net/mptcp/mptcp_join.sh
index 34152a50a3e2..8023c0773d95 100755
--- a/tools/testing/selftests/net/mptcp/mptcp_join.sh
+++ b/tools/testing/selftests/net/mptcp/mptcp_join.sh
@@ -1054,13 +1054,38 @@ chk_fail_nr()
 {
 	local fail_tx=$1
 	local fail_rx=$2
+	local ns_invert=${3:-""}
 	local count
 	local dump_stats
+	local ns_tx=$ns1
+	local ns_rx=$ns2
+	local extra_msg=""
+	local allow_tx_lost=0
+	local allow_rx_lost=0
+
+	if [[ $ns_invert = "invert" ]]; then
+		ns_tx=$ns2
+		ns_rx=$ns1
+		extra_msg=" invert"
+	fi
+
+	if [[ "${fail_tx}" = "-"* ]]; then
+		allow_tx_lost=1
+		fail_tx=${fail_tx:1}
+	fi
+	if [[ "${fail_rx}" = "-"* ]]; then
+		allow_rx_lost=1
+		fail_rx=${fail_rx:1}
+	fi
 
 	printf "%-${nr_blank}s %s" " " "ftx"
-	count=$(ip netns exec $ns1 nstat -as | grep MPTcpExtMPFailTx | awk '{print $2}')
+	count=$(ip netns exec $ns_tx nstat -as | grep MPTcpExtMPFailTx | awk '{print $2}')
 	[ -z "$count" ] && count=0
 	if [ "$count" != "$fail_tx" ]; then
+		extra_msg="$extra_msg,tx=$count"
+	fi
+	if { [ "$count" != "$fail_tx" ] && [ $allow_tx_lost -eq 0 ]; } ||
+	   { [ "$count" -gt "$fail_tx" ] && [ $allow_tx_lost -eq 1 ]; }; then
 		echo "[fail] got $count MP_FAIL[s] TX expected $fail_tx"
 		fail_test
 		dump_stats=1
@@ -1069,17 +1094,23 @@ chk_fail_nr()
 	fi
 
 	echo -n " - failrx"
-	count=$(ip netns exec $ns2 nstat -as | grep MPTcpExtMPFailRx | awk '{print $2}')
+	count=$(ip netns exec $ns_rx nstat -as | grep MPTcpExtMPFailRx | awk '{print $2}')
 	[ -z "$count" ] && count=0
 	if [ "$count" != "$fail_rx" ]; then
+		extra_msg="$extra_msg,rx=$count"
+	fi
+	if { [ "$count" != "$fail_rx" ] && [ $allow_rx_lost -eq 0 ]; } ||
+	   { [ "$count" -gt "$fail_rx" ] && [ $allow_rx_lost -eq 1 ]; }; then
 		echo "[fail] got $count MP_FAIL[s] RX expected $fail_rx"
 		fail_test
 		dump_stats=1
 	else
-		echo "[ ok ]"
+		echo -n "[ ok ]"
 	fi
 
 	[ "${dump_stats}" = 1 ] && dump_stats
+
+	echo "$extra_msg"
 }
 
 chk_fclose_nr()
@@ -2654,6 +2685,7 @@ fail_tests()
 	if reset_with_fail "Infinite map" 1; then
 		run_tests $ns1 $ns2 10.0.1.1 128
 		chk_join_nr 0 0 0 +1 +0 1 0 1 "$(pedit_action_pkts)"
+		chk_fail_nr 1 -1 invert
 	fi
 }
 
-- 
2.36.0


^ permalink raw reply related

* [PATCH net-next 5/7] mptcp: reset subflow when MP_FAIL doesn't respond
From: Mat Martineau @ 2022-04-26 21:57 UTC (permalink / raw)
  To: netdev
  Cc: Geliang Tang, davem, kuba, pabeni, matthieu.baerts, mptcp,
	Mat Martineau
In-Reply-To: <20220426215717.129506-1-mathew.j.martineau@linux.intel.com>

From: Geliang Tang <geliang.tang@suse.com>

This patch adds a new msk->flags bit MPTCP_FAIL_NO_RESPONSE, then reuses
sk_timer to trigger a check if we have not received a response from the
peer after sending MP_FAIL. If the peer doesn't respond properly, reset
the subflow.

Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
---
 net/mptcp/pm.c       |  8 ++++++++
 net/mptcp/protocol.c | 48 ++++++++++++++++++++++++++++++++++++++++++++
 net/mptcp/protocol.h |  1 +
 net/mptcp/subflow.c  | 11 ++++++++++
 4 files changed, 68 insertions(+)

diff --git a/net/mptcp/pm.c b/net/mptcp/pm.c
index 971e843a304c..14f448d82bb2 100644
--- a/net/mptcp/pm.c
+++ b/net/mptcp/pm.c
@@ -287,6 +287,7 @@ void mptcp_pm_mp_fail_received(struct sock *sk, u64 fail_seq)
 {
 	struct mptcp_subflow_context *subflow = mptcp_subflow_ctx(sk);
 	struct mptcp_sock *msk = mptcp_sk(subflow->conn);
+	struct sock *s = (struct sock *)msk;
 
 	pr_debug("fail_seq=%llu", fail_seq);
 
@@ -299,6 +300,13 @@ void mptcp_pm_mp_fail_received(struct sock *sk, u64 fail_seq)
 		subflow->send_mp_fail = 1;
 		MPTCP_INC_STATS(sock_net(sk), MPTCP_MIB_MPFAILTX);
 		subflow->send_infinite_map = 1;
+	} else if (s && inet_sk_state_load(s) != TCP_CLOSE) {
+		pr_debug("MP_FAIL response received");
+
+		mptcp_data_lock(s);
+		if (inet_sk_state_load(s) != TCP_CLOSE)
+			sk_stop_timer(s, &s->sk_timer);
+		mptcp_data_unlock(s);
 	}
 }
 
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index ea74122065f1..a5d466e6b538 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -2169,10 +2169,38 @@ static void mptcp_retransmit_timer(struct timer_list *t)
 	sock_put(sk);
 }
 
+static struct mptcp_subflow_context *
+mp_fail_response_expect_subflow(struct mptcp_sock *msk)
+{
+	struct mptcp_subflow_context *subflow, *ret = NULL;
+
+	mptcp_for_each_subflow(msk, subflow) {
+		if (READ_ONCE(subflow->mp_fail_response_expect)) {
+			ret = subflow;
+			break;
+		}
+	}
+
+	return ret;
+}
+
+static void mptcp_check_mp_fail_response(struct mptcp_sock *msk)
+{
+	struct mptcp_subflow_context *subflow;
+	struct sock *sk = (struct sock *)msk;
+
+	bh_lock_sock(sk);
+	subflow = mp_fail_response_expect_subflow(msk);
+	if (subflow)
+		__set_bit(MPTCP_FAIL_NO_RESPONSE, &msk->flags);
+	bh_unlock_sock(sk);
+}
+
 static void mptcp_timeout_timer(struct timer_list *t)
 {
 	struct sock *sk = from_timer(sk, t, sk_timer);
 
+	mptcp_check_mp_fail_response(mptcp_sk(sk));
 	mptcp_schedule_work(sk);
 	sock_put(sk);
 }
@@ -2499,6 +2527,23 @@ static void __mptcp_retrans(struct sock *sk)
 	mptcp_data_unlock(sk);
 }
 
+static void mptcp_mp_fail_no_response(struct mptcp_sock *msk)
+{
+	struct mptcp_subflow_context *subflow;
+	struct sock *ssk;
+	bool slow;
+
+	subflow = mp_fail_response_expect_subflow(msk);
+	if (subflow) {
+		pr_debug("MP_FAIL doesn't respond, reset the subflow");
+
+		ssk = mptcp_subflow_tcp_sock(subflow);
+		slow = lock_sock_fast(ssk);
+		mptcp_subflow_reset(ssk);
+		unlock_sock_fast(ssk, slow);
+	}
+}
+
 static void mptcp_worker(struct work_struct *work)
 {
 	struct mptcp_sock *msk = container_of(work, struct mptcp_sock, work);
@@ -2539,6 +2584,9 @@ static void mptcp_worker(struct work_struct *work)
 	if (test_and_clear_bit(MPTCP_WORK_RTX, &msk->flags))
 		__mptcp_retrans(sk);
 
+	if (test_and_clear_bit(MPTCP_FAIL_NO_RESPONSE, &msk->flags))
+		mptcp_mp_fail_no_response(msk);
+
 unlock:
 	release_sock(sk);
 	sock_put(sk);
diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
index cc66c81a8fab..3a8740fef918 100644
--- a/net/mptcp/protocol.h
+++ b/net/mptcp/protocol.h
@@ -116,6 +116,7 @@
 #define MPTCP_WORK_EOF		3
 #define MPTCP_FALLBACK_DONE	4
 #define MPTCP_WORK_CLOSE_SUBFLOW 5
+#define MPTCP_FAIL_NO_RESPONSE	6
 
 /* MPTCP socket release cb flags */
 #define MPTCP_PUSH_PENDING	1
diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c
index ca2352ad20d4..75c824b67ca9 100644
--- a/net/mptcp/subflow.c
+++ b/net/mptcp/subflow.c
@@ -968,6 +968,7 @@ static enum mapping_status get_mapping_status(struct sock *ssk,
 {
 	struct mptcp_subflow_context *subflow = mptcp_subflow_ctx(ssk);
 	bool csum_reqd = READ_ONCE(msk->csum_enabled);
+	struct sock *sk = (struct sock *)msk;
 	struct mptcp_ext *mpext;
 	struct sk_buff *skb;
 	u16 data_len;
@@ -1009,6 +1010,12 @@ static enum mapping_status get_mapping_status(struct sock *ssk,
 		pr_debug("infinite mapping received");
 		MPTCP_INC_STATS(sock_net(ssk), MPTCP_MIB_INFINITEMAPRX);
 		subflow->map_data_len = 0;
+		if (sk && inet_sk_state_load(sk) != TCP_CLOSE) {
+			mptcp_data_lock(sk);
+			if (inet_sk_state_load(sk) != TCP_CLOSE)
+				sk_stop_timer(sk, &sk->sk_timer);
+			mptcp_data_unlock(sk);
+		}
 		return MAPPING_INVALID;
 	}
 
@@ -1219,6 +1226,10 @@ static bool subflow_check_data_avail(struct sock *ssk)
 					sk_eat_skb(ssk, skb);
 			} else {
 				WRITE_ONCE(subflow->mp_fail_response_expect, true);
+				/* The data lock is acquired in __mptcp_move_skbs() */
+				sk_reset_timer((struct sock *)msk,
+					       &((struct sock *)msk)->sk_timer,
+					       jiffies + TCP_RTO_MAX);
 			}
 			WRITE_ONCE(subflow->data_avail, MPTCP_SUBFLOW_NODATA);
 			return true;
-- 
2.36.0


^ permalink raw reply related

* Re: [PATCH bpf-next v6 5/6] bpf: Add selftests for raw syncookie helpers
From: Andrii Nakryiko @ 2022-04-26 22:11 UTC (permalink / raw)
  To: Maxim Mikityanskiy
  Cc: Alexei Starovoitov, bpf, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Networking, Tariq Toukan, Martin KaFai Lau,
	Song Liu, Yonghong Song, John Fastabend, KP Singh,
	David S. Miller, Jakub Kicinski, Petar Penkov, Lorenz Bauer,
	Eric Dumazet, Hideaki YOSHIFUJI, David Ahern, Shuah Khan,
	Jesper Dangaard Brouer, Nathan Chancellor, Nick Desaulniers,
	Joe Stringer, Florent Revest, open list:KERNEL SELFTEST FRAMEWORK,
	Toke Høiland-Jørgensen, Kumar Kartikeya Dwivedi,
	Florian Westphal, pabeni
In-Reply-To: <92e9eaf6-4d72-3173-3271-88e3b8637c7a@nvidia.com>

On Tue, Apr 26, 2022 at 11:29 AM Maxim Mikityanskiy <maximmi@nvidia.com> wrote:
>
> On 2022-04-26 09:26, Andrii Nakryiko wrote:
> > On Mon, Apr 25, 2022 at 5:12 PM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> >>
> >> On Fri, Apr 22, 2022 at 08:24:21PM +0300, Maxim Mikityanskiy wrote:
> >>> +void test_xdp_synproxy(void)
> >>> +{
> >>> +     int server_fd = -1, client_fd = -1, accept_fd = -1;
> >>> +     struct nstoken *ns = NULL;
> >>> +     FILE *ctrl_file = NULL;
> >>> +     char buf[1024];
> >>> +     size_t size;
> >>> +
> >>> +     SYS("ip netns add synproxy");
> >>> +
> >>> +     SYS("ip link add tmp0 type veth peer name tmp1");
> >>> +     SYS("ip link set tmp1 netns synproxy");
> >>> +     SYS("ip link set tmp0 up");
> >>> +     SYS("ip addr replace 198.18.0.1/24 dev tmp0");
> >>> +
> >>> +     // When checksum offload is enabled, the XDP program sees wrong
> >>> +     // checksums and drops packets.
> >>> +     SYS("ethtool -K tmp0 tx off");
> >>
> >> BPF CI image doesn't have ethtool installed.
> >> It will take some time to get it updated. Until then we cannot land the patch set.
> >> Can you think of a way to run this test without shelling to ethtool?
> >
> > Good news: we got updated CI image with ethtool, so that shouldn't be
> > a problem anymore.
> >
> > Bad news: this selftest still fails, but in different place:
> >
> > test_synproxy:FAIL:iptables -t raw -I PREROUTING -i tmp1 -p tcp -m tcp
> > --syn --dport 8080 -j CT --notrack unexpected error: 512 (errno 2)
>
> That's simply a matter of missing kernel config options:
>
> CONFIG_NETFILTER_SYNPROXY=y
> CONFIG_NETFILTER_XT_TARGET_CT=y
> CONFIG_NETFILTER_XT_MATCH_STATE=y
> CONFIG_IP_NF_FILTER=y
> CONFIG_IP_NF_TARGET_SYNPROXY=y
> CONFIG_IP_NF_RAW=y
>
> Shall I create a pull request on github to add these options to
> https://github.com/libbpf/libbpf/tree/master/travis-ci/vmtest/configs?
>

Yes, please. But also for [0], that's the one that tests all the
not-yet-applied patches

  [0] https://github.com/kernel-patches/vmtest/

> > See [0].
> >
> >    [0] https://github.com/kernel-patches/bpf/runs/6169439612?check_suite_focus=true
>

^ permalink raw reply

* Re: [PATCH memcg v4] net: set proper memcg for net_init hooks allocations
From: Roman Gushchin @ 2022-04-26 22:13 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Vlastimil Babka, Shakeel Butt, kernel, Florian Westphal,
	linux-kernel, Michal Hocko, cgroups, netdev, David S. Miller,
	Jakub Kicinski, Paolo Abeni
In-Reply-To: <33085523-a8b9-1bf6-2726-f456f59015ef@openvz.org>

On Tue, Apr 26, 2022 at 09:43:43AM +0300, Vasily Averin wrote:
> __register_pernet_operations() executes init hook of registered
> pernet_operation structure in all existing net namespaces.
> 
> Typically, these hooks are called by a process associated with
> the specified net namespace, and all __GFP_ACCOUNT marked
> allocation are accounted for corresponding container/memcg.
> 
> However __register_pernet_operations() calls the hooks in the same
> context, and as a result all marked allocations are accounted
> to one memcg for all processed net namespaces.
> 
> This patch adjusts active memcg for each net namespace and helps
> to account memory allocated inside ops_init() into the proper memcg.
> 
> Signed-off-by: Vasily Averin <vvs@openvz.org>
> ---
> v4: get_mem_cgroup_from_kmem() renamed to get_mem_cgroup_from_obj(),
>     get_net_memcg() replaced by mem_cgroup_or_root(), suggested by Roman.
> 
> v3: put_net_memcg() replaced by an alreay existing mem_cgroup_put()
>     It checks memcg before accessing it, this is required for
>     __register_pernet_operations() called before memcg initialization.
>     Additionally fixed leading whitespaces in non-memcg_kmem version
>     of mem_cgroup_from_obj().
> 
> v2: introduced get/put_net_memcg(),
>     new functions are moved under CONFIG_MEMCG_KMEM
>     to fix compilation issues reported by Intel's kernel test robot
> 
> v1: introduced get_mem_cgroup_from_kmem(), which takes the refcount
>     for the found memcg, suggested by Shakeel
> ---
>  include/linux/memcontrol.h | 27 ++++++++++++++++++++++++++-
>  net/core/net_namespace.c   |  7 +++++++
>  2 files changed, 33 insertions(+), 1 deletion(-)

Acked-by: Roman Gushchin <roman.gushchin@linux.dev>

Thanks!

^ permalink raw reply

* [PATCH v2] ath10k: skip ath10k_halt during suspend for driver state RESTARTING
From: Abhishek Kumar @ 2022-04-26 22:19 UTC (permalink / raw)
  To: kvalo
  Cc: quic_wgong, kuabhs, briannorris, linux-kernel, linux-wireless,
	ath10k, netdev, David S. Miller, Jakub Kicinski, Paolo Abeni

Double free crash is observed when FW recovery(caused by wmi
timeout/crash) is followed by immediate suspend event. The FW recovery
is triggered by ath10k_core_restart() which calls driver clean up via
ath10k_halt(). When the suspend event occurs between the FW recovery,
the restart worker thread is put into frozen state until suspend completes.
The suspend event triggers ath10k_stop() which again triggers ath10k_halt()
The double invocation of ath10k_halt() causes ath10k_htt_rx_free() to be
called twice(Note: ath10k_htt_rx_alloc was not called by restart worker
thread because of its frozen state), causing the crash.

To fix this, during the suspend flow, skip call to ath10k_halt() in
ath10k_stop() when the current driver state is ATH10K_STATE_RESTARTING.
Also, for driver state ATH10K_STATE_RESTARTING, call
ath10k_wait_for_suspend() in ath10k_stop(). This is because call to
ath10k_wait_for_suspend() is skipped later in
[ath10k_halt() > ath10k_core_stop()] for the driver state
ATH10K_STATE_RESTARTING.

The frozen restart worker thread will be cancelled during resume when the
device comes out of suspend.

Below is the crash stack for reference:

[  428.469167] ------------[ cut here ]------------
[  428.469180] kernel BUG at mm/slub.c:4150!
[  428.469193] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
[  428.469219] Workqueue: events_unbound async_run_entry_fn
[  428.469230] RIP: 0010:kfree+0x319/0x31b
[  428.469241] RSP: 0018:ffffa1fac015fc30 EFLAGS: 00010246
[  428.469247] RAX: ffffedb10419d108 RBX: ffff8c05262b0000
[  428.469252] RDX: ffff8c04a8c07000 RSI: 0000000000000000
[  428.469256] RBP: ffffa1fac015fc78 R08: 0000000000000000
[  428.469276] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  428.469285] Call Trace:
[  428.469295]  ? dma_free_attrs+0x5f/0x7d
[  428.469320]  ath10k_core_stop+0x5b/0x6f
[  428.469336]  ath10k_halt+0x126/0x177
[  428.469352]  ath10k_stop+0x41/0x7e
[  428.469387]  drv_stop+0x88/0x10e
[  428.469410]  __ieee80211_suspend+0x297/0x411
[  428.469441]  rdev_suspend+0x6e/0xd0
[  428.469462]  wiphy_suspend+0xb1/0x105
[  428.469483]  ? name_show+0x2d/0x2d
[  428.469490]  dpm_run_callback+0x8c/0x126
[  428.469511]  ? name_show+0x2d/0x2d
[  428.469517]  __device_suspend+0x2e7/0x41b
[  428.469523]  async_suspend+0x1f/0x93
[  428.469529]  async_run_entry_fn+0x3d/0xd1
[  428.469535]  process_one_work+0x1b1/0x329
[  428.469541]  worker_thread+0x213/0x372
[  428.469547]  kthread+0x150/0x15f
[  428.469552]  ? pr_cont_work+0x58/0x58
[  428.469558]  ? kthread_blkcg+0x31/0x31

Tested-on: QCA6174 hw3.2 PCI WLAN.RM.4.4.1-00288-QCARMSWPZ-1
Co-developed-by: Wen Gong <quic_wgong@quicinc.com>
Signed-off-by: Wen Gong <quic_wgong@quicinc.com>
Signed-off-by: Abhishek Kumar <kuabhs@chromium.org>
---

Changes in v2:
- Fixed typo, replaced ath11k by ath10k in the comments.
- Adjusted the position of my S-O-B tag.
- Added the Tested-on tag.

 drivers/net/wireless/ath/ath10k/mac.c | 18 ++++++++++++++++--
 1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/drivers/net/wireless/ath/ath10k/mac.c b/drivers/net/wireless/ath/ath10k/mac.c
index d804e19a742a..e9c1f11fef0a 100644
--- a/drivers/net/wireless/ath/ath10k/mac.c
+++ b/drivers/net/wireless/ath/ath10k/mac.c
@@ -5345,8 +5345,22 @@ static void ath10k_stop(struct ieee80211_hw *hw)
 
 	mutex_lock(&ar->conf_mutex);
 	if (ar->state != ATH10K_STATE_OFF) {
-		if (!ar->hw_rfkill_on)
-			ath10k_halt(ar);
+		if (!ar->hw_rfkill_on) {
+			/* If the current driver state is RESTARTING but not yet
+			 * fully RESTARTED because of incoming suspend event,
+			 * then ath10k_halt is already called via
+			 * ath10k_core_restart and should not be called here.
+			 */
+			if (ar->state != ATH10K_STATE_RESTARTING)
+				ath10k_halt(ar);
+			else
+				/* Suspending here, because when in RESTARTING
+				 * state, ath10k_core_stop skips
+				 * ath10k_wait_for_suspend.
+				 */
+				ath10k_wait_for_suspend(ar,
+							WMI_PDEV_SUSPEND_AND_DISABLE_INTR);
+		}
 		ar->state = ATH10K_STATE_OFF;
 	}
 	mutex_unlock(&ar->conf_mutex);
-- 
2.36.0.464.gb9c8b46e94-goog


^ permalink raw reply related

* Re: [PATCH] ath10k: skip ath10k_halt during suspend for driver state RESTARTING
From: Abhishek Kumar @ 2022-04-26 22:26 UTC (permalink / raw)
  To: Jeff Johnson
  Cc: kvalo, linux-kernel, linux-wireless, briannorris, ath10k, netdev,
	Wen Gong, David S. Miller, Jakub Kicinski, Paolo Abeni
In-Reply-To: <71858a31-4667-b358-194c-95a2ffc0c593@quicinc.com>

Addressed all the above comments and pushed out a v2 patch. Fixes in V2:
- Fixed typo, replaced ath11k by ath10k in comments.
- Adjusted the position of the S-O-B tag.
- Added tested on tag.

Thanks
Abhishek

On Tue, Apr 26, 2022 at 9:23 AM Jeff Johnson <quic_jjohnson@quicinc.com> wrote:
>
> (sorry for the 2nd message with no content -- operator error)

^ permalink raw reply

* Re: [PATCH v2] ath10k: skip ath10k_halt during suspend for driver state RESTARTING
From: Brian Norris @ 2022-04-26 22:34 UTC (permalink / raw)
  To: Abhishek Kumar
  Cc: kvalo, quic_wgong, Linux Kernel, linux-wireless, ath10k,
	<netdev@vger.kernel.org>, David S. Miller, Jakub Kicinski,
	Paolo Abeni
In-Reply-To: <20220426221859.v2.1.I650b809482e1af8d0156ed88b5dc2677a0711d46@changeid>

On Tue, Apr 26, 2022 at 3:20 PM Abhishek Kumar <kuabhs@chromium.org> wrote:
>
> Double free crash is observed when FW recovery(caused by wmi
> timeout/crash) is followed by immediate suspend event. The FW recovery
> is triggered by ath10k_core_restart() which calls driver clean up via
> ath10k_halt(). When the suspend event occurs between the FW recovery,
> the restart worker thread is put into frozen state until suspend completes.
> The suspend event triggers ath10k_stop() which again triggers ath10k_halt()
> The double invocation of ath10k_halt() causes ath10k_htt_rx_free() to be
> called twice(Note: ath10k_htt_rx_alloc was not called by restart worker
> thread because of its frozen state), causing the crash.
...
> Tested-on: QCA6174 hw3.2 PCI WLAN.RM.4.4.1-00288-QCARMSWPZ-1
> Co-developed-by: Wen Gong <quic_wgong@quicinc.com>
> Signed-off-by: Wen Gong <quic_wgong@quicinc.com>
> Signed-off-by: Abhishek Kumar <kuabhs@chromium.org>
> ---
>
> Changes in v2:
> - Fixed typo, replaced ath11k by ath10k in the comments.
> - Adjusted the position of my S-O-B tag.
> - Added the Tested-on tag.

You could have retained my:

Reviewed-by: Brian Norris <briannorris@chromium.org>

but no worries; it's just a few characters ;)

^ permalink raw reply

* Re: [PATCH bpf-next v5 3/8] bpf: per-cgroup lsm flavor
From: Stanislav Fomichev @ 2022-04-26 22:44 UTC (permalink / raw)
  To: Martin KaFai Lau; +Cc: netdev, bpf, ast, daniel, andrii
In-Reply-To: <20220426062705.siikmhdj5vhsffjq@kafai-mbp.dhcp.thefacebook.com>

On Mon, Apr 25, 2022 at 11:27 PM Martin KaFai Lau <kafai@fb.com> wrote:
>
> On Tue, Apr 19, 2022 at 12:00:48PM -0700, Stanislav Fomichev wrote:
> > Allow attaching to lsm hooks in the cgroup context.
> >
> > Attaching to per-cgroup LSM works exactly like attaching
> > to other per-cgroup hooks. New BPF_LSM_CGROUP is added
> > to trigger new mode; the actual lsm hook we attach to is
> > signaled via existing attach_btf_id.
> >
> > For the hooks that have 'struct socket' as its first argument,
> > we use the cgroup associated with that socket. For the rest,
> > we use 'current' cgroup (this is all on default hierarchy == v2 only).
> > Note that for the hooks that work on 'struct sock' we still
> > take the cgroup from 'current' because most of the time,
> > the 'sock' argument is not properly initialized.
> This paragraph is out-dated.

Ack, will update that last part about sock, thanks!

> > Behind the scenes, we allocate a shim program that is attached
> > to the trampoline and runs cgroup effective BPF programs array.
> > This shim has some rudimentary ref counting and can be shared
> > between several programs attaching to the same per-cgroup lsm hook.
> >
> > Note that this patch bloats cgroup size because we add 211
> > cgroup_bpf_attach_type(s) for simplicity sake. This will be
> > addressed in the subsequent patch.
> >
> > Also note that we only add non-sleepable flavor for now. To enable
> > sleepable use-cases, BPF_PROG_RUN_ARRAY_CG has to grab trace rcu,
> s/BPF_PROG_RUN_ARRAY_CG/bpf_prog_run_array_cg/

Sure, thanks!

> > shim programs have to be freed via trace rcu, cgroup_bpf.effective
> > should be also trace-rcu-managed + maybe some other changes that
> > I'm not aware of.
> >
>
> [ ... ]
>
> > diff --git a/kernel/bpf/bpf_lsm.c b/kernel/bpf/bpf_lsm.c
> > index 064eccba641d..2161cba1fe0c 100644
> > --- a/kernel/bpf/bpf_lsm.c
> > +++ b/kernel/bpf/bpf_lsm.c
> > @@ -16,6 +16,7 @@
> >  #include <linux/bpf_local_storage.h>
> >  #include <linux/btf_ids.h>
> >  #include <linux/ima.h>
> > +#include <linux/bpf-cgroup.h>
> >
> >  /* For every LSM hook that allows attachment of BPF programs, declare a nop
> >   * function where a BPF program can be attached.
> > @@ -35,6 +36,68 @@ BTF_SET_START(bpf_lsm_hooks)
> >  #undef LSM_HOOK
> >  BTF_SET_END(bpf_lsm_hooks)
> >
> > +/* List of LSM hooks that should operate on 'current' cgroup regardless
> > + * of function signature.
> > + */
> > +BTF_SET_START(bpf_lsm_current_hooks)
> > +/* operate on freshly allocated sk without any cgroup association */
> > +BTF_ID(func, bpf_lsm_sk_alloc_security)
> > +BTF_ID(func, bpf_lsm_sk_free_security)
> > +BTF_SET_END(bpf_lsm_current_hooks)
> > +
> > +int bpf_lsm_find_cgroup_shim(const struct bpf_prog *prog,
> > +                          bpf_func_t *bpf_func)
> > +{
> > +     const struct btf_type *first_arg_type;
> > +     const struct btf_type *sock_type;
> > +     const struct btf_type *sk_type;
> > +     const struct btf *btf_vmlinux;
> > +     const struct btf_param *args;
> > +     s32 type_id;
> > +
> > +     if (!prog->aux->attach_func_proto ||
> > +         !btf_type_is_func_proto(prog->aux->attach_func_proto))
> > +             return -EINVAL;
> > +
> > +     if (btf_type_vlen(prog->aux->attach_func_proto) < 1 ||
> > +         btf_id_set_contains(&bpf_lsm_current_hooks,
> > +                             prog->aux->attach_btf_id)) {
> > +             *bpf_func = __cgroup_bpf_run_lsm_current;
> > +             return 0;
> > +     }
> > +
> > +     args = btf_params(prog->aux->attach_func_proto);
> > +
> > +     btf_vmlinux = bpf_get_btf_vmlinux();
> > +     if (!btf_vmlinux)
> Remove this check and other similar checks because the btf_vmlinux has
> been successfully parsed during the prog load time.

Good point, will remove.

> > +             return -EINVAL;
> > +
> > +     type_id = btf_find_by_name_kind(btf_vmlinux, "socket", BTF_KIND_STRUCT);
> > +     if (type_id < 0)
> > +             return -EINVAL;
> > +     sock_type = btf_type_by_id(btf_vmlinux, type_id);
> > +
> > +     type_id = btf_find_by_name_kind(btf_vmlinux, "sock", BTF_KIND_STRUCT);
> > +     if (type_id < 0)
> > +             return -EINVAL;
> > +     sk_type = btf_type_by_id(btf_vmlinux, type_id);
> > +
> > +     first_arg_type = btf_type_resolve_ptr(btf_vmlinux, args[0].type, NULL);
> > +     if (first_arg_type == sock_type)
> > +             *bpf_func = __cgroup_bpf_run_lsm_socket;
> > +     else if (first_arg_type == sk_type)
> > +             *bpf_func = __cgroup_bpf_run_lsm_sock;
> > +     else
> > +             *bpf_func = __cgroup_bpf_run_lsm_current;
> > +
> > +     return 0;
> > +}
> > +
> > +int bpf_lsm_hook_idx(u32 btf_id)
> > +{
> > +     return btf_id_set_index(&bpf_lsm_hooks, btf_id);
> > +}
> > +
> >  int bpf_lsm_verify_prog(struct bpf_verifier_log *vlog,
> >                       const struct bpf_prog *prog)
> >  {
> > diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
> > index 0918a39279f6..4199de31f49c 100644
> > --- a/kernel/bpf/btf.c
> > +++ b/kernel/bpf/btf.c
> > @@ -4971,6 +4971,7 @@ bool btf_ctx_access(int off, int size, enum bpf_access_type type,
> >
> >       if (arg == nr_args) {
> >               switch (prog->expected_attach_type) {
> > +             case BPF_LSM_CGROUP:
> >               case BPF_LSM_MAC:
> >               case BPF_TRACE_FEXIT:
> >                       /* When LSM programs are attached to void LSM hooks
> > @@ -6396,6 +6397,16 @@ static int btf_id_cmp_func(const void *a, const void *b)
> >       return *pa - *pb;
> >  }
> >
> > +int btf_id_set_index(const struct btf_id_set *set, u32 id)
> > +{
> > +     const u32 *p;
> > +
> > +     p = bsearch(&id, set->ids, set->cnt, sizeof(u32), btf_id_cmp_func);
> > +     if (!p)
> > +             return -1;
> > +     return p - set->ids;
> > +}
> > +
> >  bool btf_id_set_contains(const struct btf_id_set *set, u32 id)
> >  {
> >       return bsearch(&id, set->ids, set->cnt, sizeof(u32), btf_id_cmp_func) != NULL;
> > diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
> > index aaf9e36f2736..ba0e0c7a661d 100644
> > --- a/kernel/bpf/cgroup.c
> > +++ b/kernel/bpf/cgroup.c
> > @@ -14,6 +14,9 @@
> >  #include <linux/string.h>
> >  #include <linux/bpf.h>
> >  #include <linux/bpf-cgroup.h>
> > +#include <linux/btf_ids.h>
> > +#include <linux/bpf_lsm.h>
> > +#include <linux/bpf_verifier.h>
> >  #include <net/sock.h>
> >  #include <net/bpf_sk_storage.h>
> >
> > @@ -88,6 +91,85 @@ bpf_prog_run_array_cg(const struct cgroup_bpf *cgrp,
> >       return run_ctx.retval;
> >  }
> >
> > +unsigned int __cgroup_bpf_run_lsm_sock(const void *ctx,
> > +                                    const struct bpf_insn *insn)
> > +{
> > +     const struct bpf_prog *shim_prog;
> > +     struct sock *sk;
> > +     struct cgroup *cgrp;
> > +     int ret = 0;
> > +     u64 *regs;
> > +
> > +     regs = (u64 *)ctx;
> > +     sk = (void *)(unsigned long)regs[BPF_REG_0];
> > +     /*shim_prog = container_of(insn, struct bpf_prog, insnsi);*/
> > +     shim_prog = (const struct bpf_prog *)((void *)insn - offsetof(struct bpf_prog, insnsi));
> > +
> > +     cgrp = sock_cgroup_ptr(&sk->sk_cgrp_data);
> > +     if (likely(cgrp))
> > +             ret = bpf_prog_run_array_cg(&cgrp->bpf,
> > +                                         shim_prog->aux->cgroup_atype,
> > +                                         ctx, bpf_prog_run, 0);
> > +     return ret;
> > +}
> > +
> > +unsigned int __cgroup_bpf_run_lsm_socket(const void *ctx,
> > +                                      const struct bpf_insn *insn)
> > +{
> > +     const struct bpf_prog *shim_prog;
> > +     struct socket *sock;
> > +     struct cgroup *cgrp;
> > +     int ret = 0;
> > +     u64 *regs;
> > +
> > +     regs = (u64 *)ctx;
> > +     sock = (void *)(unsigned long)regs[BPF_REG_0];
> > +     /*shim_prog = container_of(insn, struct bpf_prog, insnsi);*/
> > +     shim_prog = (const struct bpf_prog *)((void *)insn - offsetof(struct bpf_prog, insnsi));
> > +
> > +     cgrp = sock_cgroup_ptr(&sock->sk->sk_cgrp_data);
> > +     if (likely(cgrp))
> > +             ret = bpf_prog_run_array_cg(&cgrp->bpf,
> > +                                         shim_prog->aux->cgroup_atype,
> > +                                         ctx, bpf_prog_run, 0);
> > +     return ret;
> > +}
> > +
> > +unsigned int __cgroup_bpf_run_lsm_current(const void *ctx,
> > +                                       const struct bpf_insn *insn)
> > +{
> > +     const struct bpf_prog *shim_prog;
> > +     struct cgroup *cgrp;
> > +     int ret = 0;
> > +
> > +     if (unlikely(!current))
> > +             return 0;
> > +
> > +     /*shim_prog = container_of(insn, struct bpf_prog, insnsi);*/
> > +     shim_prog = (const struct bpf_prog *)((void *)insn - offsetof(struct bpf_prog, insnsi));
> > +
> > +     rcu_read_lock();
> > +     cgrp = task_dfl_cgroup(current);
> > +     if (likely(cgrp))
> > +             ret = bpf_prog_run_array_cg(&cgrp->bpf,
> > +                                         shim_prog->aux->cgroup_atype,
> > +                                         ctx, bpf_prog_run, 0);
> > +     rcu_read_unlock();
> > +     return ret;
> > +}
> > +
> > +#ifdef CONFIG_BPF_LSM
> > +static enum cgroup_bpf_attach_type bpf_lsm_attach_type_get(u32 attach_btf_id)
> > +{
> > +     return CGROUP_LSM_START + bpf_lsm_hook_idx(attach_btf_id);
> > +}
> > +#else
> > +static enum cgroup_bpf_attach_type bpf_lsm_attach_type_get(u32 attach_btf_id)
> > +{
> > +     return -EOPNOTSUPP;
> > +}
> > +#endif /* CONFIG_BPF_LSM */
> > +
> >  void cgroup_bpf_offline(struct cgroup *cgrp)
> >  {
> >       cgroup_get(cgrp);
> > @@ -155,6 +237,14 @@ static void bpf_cgroup_storages_link(struct bpf_cgroup_storage *storages[],
> >               bpf_cgroup_storage_link(storages[stype], cgrp, attach_type);
> >  }
> >
> > +static void bpf_cgroup_storages_unlink(struct bpf_cgroup_storage *storages[])
> > +{
> > +     enum bpf_cgroup_storage_type stype;
> > +
> > +     for_each_cgroup_storage_type(stype)
> > +             bpf_cgroup_storage_unlink(storages[stype]);
> > +}
> > +
> >  /* Called when bpf_cgroup_link is auto-detached from dying cgroup.
> >   * It drops cgroup and bpf_prog refcounts, and marks bpf_link as defunct. It
> >   * doesn't free link memory, which will eventually be done by bpf_link's
> > @@ -166,6 +256,16 @@ static void bpf_cgroup_link_auto_detach(struct bpf_cgroup_link *link)
> >       link->cgroup = NULL;
> >  }
> >
> > +static void bpf_cgroup_lsm_shim_release(struct bpf_prog *prog,
> > +                                     enum cgroup_bpf_attach_type atype)
> > +{
> > +     if (prog->aux->cgroup_atype < CGROUP_LSM_START ||
> > +         prog->aux->cgroup_atype > CGROUP_LSM_END)
> These checks are unnecessary.  cgroup_atype was set by the kernel
> during attach and it could not be invalid during detach.
>
> Remove this helper and directly call bpf_trampoline_unlink_cgroup_shim(prog)
> instead.

True. I'll remove the checks, I'll still leave
bpf_cgroup_lsm_shim_release around because in the next patch I add
bpf_lsm_attach_type_put here (seems than copy-pasting
bpf_lsm_attach_type_put elsewhere?)

> > +             return;
> > +
> > +     bpf_trampoline_unlink_cgroup_shim(prog);
> > +}
> > +
> >  /**
> >   * cgroup_bpf_release() - put references of all bpf programs and
> >   *                        release all cgroup bpf data
> > @@ -190,10 +290,18 @@ static void cgroup_bpf_release(struct work_struct *work)
> >
> >               hlist_for_each_entry_safe(pl, pltmp, progs, node) {
> >                       hlist_del(&pl->node);
> > -                     if (pl->prog)
> > +                     if (pl->prog) {
> > +                             if (atype == BPF_LSM_CGROUP)
> > +                                     bpf_cgroup_lsm_shim_release(pl->prog,
> > +                                                                 atype);
> >                               bpf_prog_put(pl->prog);
> > -                     if (pl->link)
> > +                     }
> > +                     if (pl->link) {
> > +                             if (atype == BPF_LSM_CGROUP)
> > +                                     bpf_cgroup_lsm_shim_release(pl->link->link.prog,
> > +                                                                 atype);
> >                               bpf_cgroup_link_auto_detach(pl->link);
> > +                     }
> >                       kfree(pl);
> >                       static_branch_dec(&cgroup_bpf_enabled_key[atype]);
> >               }
> > @@ -506,6 +614,7 @@ static int __cgroup_bpf_attach(struct cgroup *cgrp,
> >       struct bpf_prog *old_prog = NULL;
> >       struct bpf_cgroup_storage *storage[MAX_BPF_CGROUP_STORAGE_TYPE] = {};
> >       struct bpf_cgroup_storage *new_storage[MAX_BPF_CGROUP_STORAGE_TYPE] = {};
> > +     struct bpf_attach_target_info tgt_info = {};
> >       enum cgroup_bpf_attach_type atype;
> >       struct bpf_prog_list *pl;
> >       struct hlist_head *progs;
> > @@ -522,9 +631,35 @@ static int __cgroup_bpf_attach(struct cgroup *cgrp,
> >               /* replace_prog implies BPF_F_REPLACE, and vice versa */
> >               return -EINVAL;
> >
> > -     atype = to_cgroup_bpf_attach_type(type);
> > -     if (atype < 0)
> > -             return -EINVAL;
> > +     if (type == BPF_LSM_CGROUP) {
> > +             struct bpf_prog *p = prog ? : link->link.prog;
> > +
> > +             if (replace_prog) {
> > +                     /* Reusing shim from the original program.
> > +                      */
> > +                     if (replace_prog->aux->attach_btf_id !=
> > +                         p->aux->attach_btf_id)
> > +                             return -EINVAL;
> > +
> > +                     atype = replace_prog->aux->cgroup_atype;
> > +             } else {
> > +                     err = bpf_check_attach_target(NULL, p, NULL,
> > +                                                   p->aux->attach_btf_id,
> > +                                                   &tgt_info);
> > +                     if (err)
> > +                             return -EINVAL;
> > +
> > +                     atype = bpf_lsm_attach_type_get(p->aux->attach_btf_id);
> > +                     if (atype < 0)
> > +                             return atype;
> > +             }
> > +
> > +             p->aux->cgroup_atype = atype;
> > +     } else {
> > +             atype = to_cgroup_bpf_attach_type(type);
> > +             if (atype < 0)
> > +                     return -EINVAL;
> > +     }
> >
> >       progs = &cgrp->bpf.progs[atype];
> >
> > @@ -580,13 +715,26 @@ static int __cgroup_bpf_attach(struct cgroup *cgrp,
> >       if (err)
> >               goto cleanup;
> >
> > +     bpf_cgroup_storages_link(new_storage, cgrp, type);
> It looks everything is ready for the cgrp local_storage.
> How about also allowing bpf_get_local_storage() for BPF_LSM_CGROUP?

Definitely, we want local storage to work, let me actually exercise it
in the selftest and add the missing kernel bits to enable it.

> > +
> > +     if (type == BPF_LSM_CGROUP && !old_prog) {
> > +             struct bpf_prog *p = prog ? : link->link.prog;
> > +
> > +             err = bpf_trampoline_link_cgroup_shim(p, &tgt_info);
> > +             if (err)
> > +                     goto cleanup_trampoline;
> > +     }
> > +
> >       if (old_prog)
> >               bpf_prog_put(old_prog);
> >       else
> >               static_branch_inc(&cgroup_bpf_enabled_key[atype]);
> > -     bpf_cgroup_storages_link(new_storage, cgrp, type);
> > +
> >       return 0;
> >
> > +cleanup_trampoline:
> > +     bpf_cgroup_storages_unlink(new_storage);
> > +
> >  cleanup:
> >       if (old_prog) {
> >               pl->prog = old_prog;
> > @@ -678,9 +826,13 @@ static int __cgroup_bpf_replace(struct cgroup *cgrp,
> >       struct hlist_head *progs;
> >       bool found = false;
> >
> > -     atype = to_cgroup_bpf_attach_type(link->type);
> > -     if (atype < 0)
> > -             return -EINVAL;
> > +     if (link->type == BPF_LSM_CGROUP) {
> > +             atype = link->link.prog->aux->cgroup_atype;
> > +     } else {
> > +             atype = to_cgroup_bpf_attach_type(link->type);
> > +             if (atype < 0)
> > +                     return -EINVAL;
> > +     }
> >
> >       progs = &cgrp->bpf.progs[atype];
> >
> > @@ -696,6 +848,9 @@ static int __cgroup_bpf_replace(struct cgroup *cgrp,
> >       if (!found)
> >               return -ENOENT;
> >
> > +     if (link->type == BPF_LSM_CGROUP)
> > +             new_prog->aux->cgroup_atype = atype;
> Does it also need to check attach_btf_id between the
> new_prog and the old prog?

Ah, forgot this one, good catch, thanks!

> > +
> >       old_prog = xchg(&link->link.prog, new_prog);
> >       replace_effective_prog(cgrp, atype, link);
> >       bpf_prog_put(old_prog);
> > @@ -779,9 +934,15 @@ static int __cgroup_bpf_detach(struct cgroup *cgrp, struct bpf_prog *prog,
> >       u32 flags;
> >       int err;
> >
> > -     atype = to_cgroup_bpf_attach_type(type);
> > -     if (atype < 0)
> > -             return -EINVAL;
> > +     if (type == BPF_LSM_CGROUP) {
> > +             struct bpf_prog *p = prog ? : link->link.prog;
> > +
> > +             atype = p->aux->cgroup_atype;
> > +     } else {
> > +             atype = to_cgroup_bpf_attach_type(type);
> > +             if (atype < 0)
> > +                     return -EINVAL;
> > +     }
> >
> >       progs = &cgrp->bpf.progs[atype];
> >       flags = cgrp->bpf.flags[atype];
> > @@ -803,6 +964,10 @@ static int __cgroup_bpf_detach(struct cgroup *cgrp, struct bpf_prog *prog,
> >       if (err)
> >               goto cleanup;
> >
> > +     if (type == BPF_LSM_CGROUP)
> > +             bpf_cgroup_lsm_shim_release(prog ? : link->link.prog,
> > +                                         atype);
> After looking at find_detach_entry(),
> the pl->prog may not be the same as the prog or link->link.prog here.

I'm assuming you're talking about "allow detaching with invalid FD
(prog==NULL) in legacy mode", right? I did miss that, so I will use
pl->prog instead, thanks!

> > +
> >       /* now can actually delete it from this cgroup list */
> >       hlist_del(&pl->node);
> >
> > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> > index e9621cfa09f2..d94f4951154e 100644
> > --- a/kernel/bpf/syscall.c
> > +++ b/kernel/bpf/syscall.c
> > @@ -3139,6 +3139,11 @@ static int bpf_prog_attach_check_attach_type(const struct bpf_prog *prog,
> >               return prog->enforce_expected_attach_type &&
> >                       prog->expected_attach_type != attach_type ?
> >                       -EINVAL : 0;
> > +     case BPF_PROG_TYPE_LSM:
> > +             if (prog->expected_attach_type != BPF_LSM_CGROUP)
> > +                     return -EINVAL;
> > +             return 0;
> > +
> >       default:
> >               return 0;
> >       }
> > @@ -3194,6 +3199,8 @@ attach_type_to_prog_type(enum bpf_attach_type attach_type)
> >               return BPF_PROG_TYPE_SK_LOOKUP;
> >       case BPF_XDP:
> >               return BPF_PROG_TYPE_XDP;
> > +     case BPF_LSM_CGROUP:
> > +             return BPF_PROG_TYPE_LSM;
> >       default:
> >               return BPF_PROG_TYPE_UNSPEC;
> >       }
> > @@ -3247,6 +3254,7 @@ static int bpf_prog_attach(const union bpf_attr *attr)
> >       case BPF_PROG_TYPE_CGROUP_SOCKOPT:
> >       case BPF_PROG_TYPE_CGROUP_SYSCTL:
> >       case BPF_PROG_TYPE_SOCK_OPS:
> > +     case BPF_PROG_TYPE_LSM:
> >               ret = cgroup_bpf_prog_attach(attr, ptype, prog);
> >               break;
> >       default:
> > @@ -3284,6 +3292,7 @@ static int bpf_prog_detach(const union bpf_attr *attr)
> >       case BPF_PROG_TYPE_CGROUP_SOCKOPT:
> >       case BPF_PROG_TYPE_CGROUP_SYSCTL:
> >       case BPF_PROG_TYPE_SOCK_OPS:
> > +     case BPF_PROG_TYPE_LSM:
> >               return cgroup_bpf_prog_detach(attr, ptype);
> >       default:
> >               return -EINVAL;
> > @@ -4317,6 +4326,7 @@ static int link_create(union bpf_attr *attr, bpfptr_t uattr)
> >       case BPF_PROG_TYPE_CGROUP_DEVICE:
> >       case BPF_PROG_TYPE_CGROUP_SYSCTL:
> >       case BPF_PROG_TYPE_CGROUP_SOCKOPT:
> > +     case BPF_PROG_TYPE_LSM:
> >               ret = cgroup_bpf_link_attach(attr, prog);
> >               break;
> >       case BPF_PROG_TYPE_TRACING:
> > diff --git a/kernel/bpf/trampoline.c b/kernel/bpf/trampoline.c
> > index 0c4fd194e801..c76dfa4ea2d9 100644
> > --- a/kernel/bpf/trampoline.c
> > +++ b/kernel/bpf/trampoline.c
> > @@ -11,6 +11,8 @@
> >  #include <linux/rcupdate_wait.h>
> >  #include <linux/module.h>
> >  #include <linux/static_call.h>
> > +#include <linux/bpf_verifier.h>
> > +#include <linux/bpf_lsm.h>
> >
> >  /* dummy _ops. The verifier will operate on target program's ops. */
> >  const struct bpf_verifier_ops bpf_extension_verifier_ops = {
> > @@ -485,6 +487,149 @@ int bpf_trampoline_unlink_prog(struct bpf_prog *prog, struct bpf_trampoline *tr)
> >       return err;
> >  }
> >
> > +#if defined(CONFIG_BPF_JIT) && defined(CONFIG_BPF_SYSCALL)
> > +static struct bpf_prog *cgroup_shim_alloc(const struct bpf_prog *prog,
> > +                                       bpf_func_t bpf_func)
> > +{
> > +     struct bpf_prog *p;
> > +
> > +     p = bpf_prog_alloc(1, 0);
> > +     if (!p)
> > +             return NULL;
> > +
> > +     p->jited = false;
> > +     p->bpf_func = bpf_func;
> > +
> > +     p->aux->cgroup_atype = prog->aux->cgroup_atype;
> > +     p->aux->attach_func_proto = prog->aux->attach_func_proto;
> > +     p->aux->attach_btf_id = prog->aux->attach_btf_id;
> > +     p->aux->attach_btf = prog->aux->attach_btf;
> > +     btf_get(p->aux->attach_btf);
> > +     p->type = BPF_PROG_TYPE_LSM;
> > +     p->expected_attach_type = BPF_LSM_MAC;
> > +     bpf_prog_inc(p);
> > +
> > +     return p;
> > +}
> > +
> > +static struct bpf_prog *cgroup_shim_find(struct bpf_trampoline *tr,
> > +                                      bpf_func_t bpf_func)
> > +{
> > +     const struct bpf_prog_aux *aux;
> > +     int kind;
> > +
> > +     for (kind = 0; kind < BPF_TRAMP_MAX; kind++) {
> > +             hlist_for_each_entry(aux, &tr->progs_hlist[kind], tramp_hlist) {
> > +                     struct bpf_prog *p = aux->prog;
> > +
> > +                     if (p->bpf_func == bpf_func)
> > +                             return p;
> > +             }
> > +     }
> > +
> > +     return NULL;
> > +}
> > +
> > +int bpf_trampoline_link_cgroup_shim(struct bpf_prog *prog,
> > +                                 struct bpf_attach_target_info *tgt_info)
> > +{
> > +     struct bpf_prog *shim_prog = NULL;
> > +     struct bpf_trampoline *tr;
> > +     bpf_func_t bpf_func;
> > +     u64 key;
> > +     int err;
> > +
> > +     key = bpf_trampoline_compute_key(NULL, prog->aux->attach_btf,
> > +                                      prog->aux->attach_btf_id);
> > +
> > +     err = bpf_lsm_find_cgroup_shim(prog, &bpf_func);
> > +     if (err)
> > +             return err;
> > +
> > +     tr = bpf_trampoline_get(key, tgt_info);
> > +     if (!tr)
> > +             return  -ENOMEM;
> > +
> > +     mutex_lock(&tr->mutex);
> > +
> > +     shim_prog = cgroup_shim_find(tr, bpf_func);
> > +     if (shim_prog) {
> > +             /* Reusing existing shim attached by the other program.
> > +              */
> nit.  Avoid extra line in '*/' for one liner comment.  Didn't
> see this convention in other bpf codes.

Ack, will do!

> Also, how about the earlier comment regarding to another __bpf_prog_enter
> and __bpf_prog_exit for BPF_LSM_CGROUP that don't do the active counts
> and stats collection ?

Ugh, I totally forgot that we've agreed to remove those active counts
for the lsm-cgroup case. Let me actually add that part..

> > +             bpf_prog_inc(shim_prog);
> > +             mutex_unlock(&tr->mutex);
> > +             return 0;
> > +     }
> > +
> > +     /* Allocate and install new shim.
> > +      */
> Same here.

Ack. Thank you for another round of review!

> > +
> > +     shim_prog = cgroup_shim_alloc(prog, bpf_func);
> > +     if (!shim_prog) {
> > +             err = -ENOMEM;
> > +             goto out;
> > +     }
> > +
> > +     err = __bpf_trampoline_link_prog(shim_prog, tr);
> > +     if (err)
> > +             goto out;
> > +
> > +     mutex_unlock(&tr->mutex);
> > +
> > +     return 0;
> > +out:
> > +     if (shim_prog)
> > +             bpf_prog_put(shim_prog);
> > +
> > +     mutex_unlock(&tr->mutex);
> > +     return err;
> > +}
> > +
> > +void bpf_trampoline_unlink_cgroup_shim(struct bpf_prog *prog)
> > +{
> > +     struct bpf_prog *shim_prog;
> > +     struct bpf_trampoline *tr;
> > +     bpf_func_t bpf_func;
> > +     u64 key;
> > +     int err;
> > +
> > +     key = bpf_trampoline_compute_key(NULL, prog->aux->attach_btf,
> > +                                      prog->aux->attach_btf_id);
> > +
> > +     err = bpf_lsm_find_cgroup_shim(prog, &bpf_func);
> > +     if (err)
> > +             return;
> > +
> > +     tr = bpf_trampoline_lookup(key);
> > +     if (!tr)
> > +             return;
> > +
> > +     mutex_lock(&tr->mutex);
> > +
> > +     shim_prog = cgroup_shim_find(tr, bpf_func);
> > +     if (shim_prog) {
> > +             /* We use shim_prog refcnt for tracking whether to
> > +              * remove the shim program from the trampoline.
> > +              * Trampoline's mutex is held while refcnt is
> > +              * added/subtracted so we don't need to care about
> > +              * potential races.
> > +              */
> > +
> > +             if (atomic64_read(&shim_prog->aux->refcnt) == 1)
> > +                     WARN_ON_ONCE(__bpf_trampoline_unlink_prog(shim_prog, tr));
> > +
> > +             bpf_prog_put(shim_prog);
> > +     }
> > +
> > +     mutex_unlock(&tr->mutex);
> > +
> > +     bpf_trampoline_put(tr); /* bpf_trampoline_lookup */
> > +
> > +     if (shim_prog)
> > +             bpf_trampoline_put(tr);
> > +}
> > +#endif
> > +
> >  struct bpf_trampoline *bpf_trampoline_get(u64 key,
> >                                         struct bpf_attach_target_info *tgt_info)
> >  {
> > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> > index 9c1a02b82ecd..cc84954846d7 100644
> > --- a/kernel/bpf/verifier.c
> > +++ b/kernel/bpf/verifier.c
> > @@ -14197,6 +14197,7 @@ int bpf_check_attach_target(struct bpf_verifier_log *log,
> >               fallthrough;
> >       case BPF_MODIFY_RETURN:
> >       case BPF_LSM_MAC:
> > +     case BPF_LSM_CGROUP:
> >       case BPF_TRACE_FENTRY:
> >       case BPF_TRACE_FEXIT:
> >               if (!btf_type_is_func(t)) {
> > diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> > index d14b10b85e51..bbe48a2dd852 100644
> > --- a/tools/include/uapi/linux/bpf.h
> > +++ b/tools/include/uapi/linux/bpf.h
> > @@ -998,6 +998,7 @@ enum bpf_attach_type {
> >       BPF_SK_REUSEPORT_SELECT_OR_MIGRATE,
> >       BPF_PERF_EVENT,
> >       BPF_TRACE_KPROBE_MULTI,
> > +     BPF_LSM_CGROUP,
> >       __MAX_BPF_ATTACH_TYPE
> >  };
> >
> > --
> > 2.36.0.rc0.470.gd361397f0d-goog
> >

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox