Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH 2/2] vDPA: conditionally read fields in virtio-net dev
From: Zhu, Lingshan @ 2022-08-16  9:08 UTC (permalink / raw)
  To: Si-Wei Liu, jasowang, mst
  Cc: virtualization, netdev, kvm, parav, xieyongji, gautam.dawar
In-Reply-To: <9b6292f3-9bd5-ecd8-5e42-cd5d12f036e7@oracle.com>



On 8/16/2022 3:58 PM, Si-Wei Liu wrote:
>
>
> On 8/15/2022 6:58 PM, Zhu, Lingshan wrote:
>>
>>
>> On 8/16/2022 7:32 AM, Si-Wei Liu wrote:
>>>
>>>
>>> On 8/15/2022 2:26 AM, Zhu Lingshan wrote:
>>>> Some fields of virtio-net device config space are
>>>> conditional on the feature bits, the spec says:
>>>>
>>>> "The mac address field always exists
>>>> (though is only valid if VIRTIO_NET_F_MAC is set)"
>>>>
>>>> "max_virtqueue_pairs only exists if VIRTIO_NET_F_MQ
>>>> or VIRTIO_NET_F_RSS is set"
>>>>
>>>> "mtu only exists if VIRTIO_NET_F_MTU is set"
>>>>
>>>> so we should read MTU, MAC and MQ in the device config
>>>> space only when these feature bits are offered.
>>>>
>>>> For MQ, if both VIRTIO_NET_F_MQ and VIRTIO_NET_F_RSS are
>>>> not set, the virtio device should have
>>>> one queue pair as default value, so when userspace querying queue 
>>>> pair numbers,
>>>> it should return mq=1 than zero.
>>>>
>>>> For MTU, if VIRTIO_NET_F_MTU is not set, we should not read
>>>> MTU from the device config sapce.
>>>> RFC894 <A Standard for the Transmission of IP Datagrams over 
>>>> Ethernet Networks>
>>>> says:"The minimum length of the data field of a packet sent over an
>>>> Ethernet is 1500 octets, thus the maximum length of an IP datagram
>>>> sent over an Ethernet is 1500 octets.  Implementations are encouraged
>>>> to support full-length packets"
>>> Noted there's a typo in the above "The *maximum* length of the data 
>>> field of a packet sent over an Ethernet is 1500 octets ..." and the 
>>> RFC was written 1984.
>> the spec RFC894 says it is 1500, see <a 
>> href="https://urldefense.com/v3/__https://www.rfc-editor.org/rfc/rfc894.txt__;!!ACWV5N9M2RV99hQ!MdgxZjw5sp5Qz-GKfwT1IWcw_L4Jo1-UekuJPFz1UrG3YuqirKz7P9ksdJFh1vB6zHJ7z8Q04fpT0-9jWXCtlWM$">https://www.rfc-editor.org/rfc/rfc894.txt</a>
>>>
>>> Apparently that is no longer true with the introduction of Jumbo 
>>> size frame later in the 2000s. I'm not sure what is the point of 
>>> mention this ancient RFC. It doesn't say default MTU of any Ethernet 
>>> NIC/switch should be 1500 in either  case.
>> This could be a larger number for sure, we are trying to find out the 
>> min value for Ethernet here, to support 1500 octets, MTU should be 
>> 1500 at least, so I assume 1500 should be the default value for MTU
>>>
>>>>
>>>> virtio spec says:"The virtio network device is a virtual ethernet 
>>>> card",
>>> Right,
>>>> so the default MTU value should be 1500 for virtio-net.
>>> ... but it doesn't say the default is 1500. At least, not in 
>>> explicit way. Why it can't be 1492 or even lower? In practice, if 
>>> the network backend has a MTU higher than 1500, there's nothing 
>>> wrong for guest to configure default MTU more than 1500.
>> same as above
>>>
>>>>
>>>> For MAC, the spec says:"If the VIRTIO_NET_F_MAC feature bit is set,
>>>> the configuration space mac entry indicates the “physical” address
>>>> of the network card, otherwise the driver would typically
>>>> generate a random local MAC address." So there is no
>>>> default MAC address if VIRTIO_NET_F_MAC not set.
>>>>
>>>> This commits introduces functions vdpa_dev_net_mtu_config_fill()
>>>> and vdpa_dev_net_mac_config_fill() to fill MTU and MAC.
>>>> It also fixes vdpa_dev_net_mq_config_fill() to report correct
>>>> MQ when _F_MQ is not present.
>>>>
>>>> These functions should check devices features than driver
>>>> features, and struct vdpa_device is not needed as a parameter
>>>>
>>>> The test & userspace tool output:
>>>>
>>>> Feature bit VIRTIO_NET_F_MTU, VIRTIO_NET_F_RSS, VIRTIO_NET_F_MQ
>>>> and VIRTIO_NET_F_MAC can be mask out by hardcode.
>>>>
>>>> However, it is challenging to "disable" the related fields
>>>> in the HW device config space, so let's just assume the values
>>>> are meaningless if the feature bits are not set.
>>>>
>>>> Before this change, when feature bits for RSS, MQ, MTU and MAC
>>>> are not set, iproute2 output:
>>>> $vdpa vdpa0: mac 00:e8:ca:11:be:05 link up link_announce false mtu 
>>>> 1500
>>>>    negotiated_features
>>>>
>>>> without this commit, function vdpa_dev_net_config_fill()
>>>> reads all config space fields unconditionally, so let's
>>>> assume the MAC and MTU are meaningless, and it checks
>>>> MQ with driver_features, so we don't see max_vq_pairs.
>>>>
>>>> After applying this commit, when feature bits for
>>>> MQ, RSS, MAC and MTU are not set,iproute2 output:
>>>> $vdpa dev config show vdpa0
>>>> vdpa0: link up link_announce false max_vq_pairs 1 mtu 1500
>>>>    negotiated_features
>>>>
>>>> As explained above:
>>>> Here is no MAC, because VIRTIO_NET_F_MAC is not set,
>>>> and there is no default value for MAC. It shows
>>>> max_vq_paris = 1 because even without MQ feature,
>>>> a functional virtio-net must have one queue pair.
>>>> mtu = 1500 is the default value as ethernet
>>>> required.
>>>>
>>>> This commit also add supplementary comments for
>>>> __virtio16_to_cpu(true, xxx) operations in
>>>> vdpa_dev_net_config_fill() and vdpa_fill_stats_rec()
>>>>
>>>> Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
>>>> ---
>>>>   drivers/vdpa/vdpa.c | 60 
>>>> +++++++++++++++++++++++++++++++++++----------
>>>>   1 file changed, 47 insertions(+), 13 deletions(-)
>>>>
>>>> diff --git a/drivers/vdpa/vdpa.c b/drivers/vdpa/vdpa.c
>>>> index efb55a06e961..a74660b98979 100644
>>>> --- a/drivers/vdpa/vdpa.c
>>>> +++ b/drivers/vdpa/vdpa.c
>>>> @@ -801,19 +801,44 @@ static int vdpa_nl_cmd_dev_get_dumpit(struct 
>>>> sk_buff *msg, struct netlink_callba
>>>>       return msg->len;
>>>>   }
>>>>   -static int vdpa_dev_net_mq_config_fill(struct vdpa_device *vdev,
>>>> -                       struct sk_buff *msg, u64 features,
>>>> +static int vdpa_dev_net_mq_config_fill(struct sk_buff *msg, u64 
>>>> features,
>>>>                          const struct virtio_net_config *config)
>>>>   {
>>>>       u16 val_u16;
>>>>   -    if ((features & BIT_ULL(VIRTIO_NET_F_MQ)) == 0)
>>>> -        return 0;
>>>> +    if ((features & BIT_ULL(VIRTIO_NET_F_MQ)) == 0 &&
>>>> +        (features & BIT_ULL(VIRTIO_NET_F_RSS)) == 0)
>>>> +        val_u16 = 1;
>>>> +    else
>>>> +        val_u16 = __virtio16_to_cpu(true, 
>>>> config->max_virtqueue_pairs);
>>>>   -    val_u16 = le16_to_cpu(config->max_virtqueue_pairs);
>>>>       return nla_put_u16(msg, VDPA_ATTR_DEV_NET_CFG_MAX_VQP, val_u16);
>>>>   }
>>>>   +static int vdpa_dev_net_mtu_config_fill(struct sk_buff *msg, u64 
>>>> features,
>>>> +                    const struct virtio_net_config *config)
>>>> +{
>>>> +    u16 val_u16;
>>>> +
>>>> +    if ((features & BIT_ULL(VIRTIO_NET_F_MTU)) == 0)
>>>> +        val_u16 = 1500;
>>> As said, there's no virtio spec defined value for MTU. Please leave 
>>> this field out if feature VIRTIO_NET_F_MTU is not negotiated.
>> same as above
>>>> +    else
>>>> +        val_u16 = __virtio16_to_cpu(true, config->mtu);
>>>> +
>>>> +    return nla_put_u16(msg, VDPA_ATTR_DEV_NET_CFG_MTU, val_u16);
>>>> +}
>>>> +
>>>> +static int vdpa_dev_net_mac_config_fill(struct sk_buff *msg, u64 
>>>> features,
>>>> +                    const struct virtio_net_config *config)
>>>> +{
>>>> +    if ((features & BIT_ULL(VIRTIO_NET_F_MAC)) == 0)
>>>> +        return 0;
>>>> +    else
>>>> +        return  nla_put(msg, VDPA_ATTR_DEV_NET_CFG_MACADDR,
>>>> +                sizeof(config->mac), config->mac);
>>>> +}
>>>> +
>>>> +
>>>>   static int vdpa_dev_net_config_fill(struct vdpa_device *vdev, 
>>>> struct sk_buff *msg)
>>>>   {
>>>>       struct virtio_net_config config = {};
>>>> @@ -822,18 +847,16 @@ static int vdpa_dev_net_config_fill(struct 
>>>> vdpa_device *vdev, struct sk_buff *ms
>>>>         vdpa_get_config_unlocked(vdev, 0, &config, sizeof(config));
>>>>   -    if (nla_put(msg, VDPA_ATTR_DEV_NET_CFG_MACADDR, 
>>>> sizeof(config.mac),
>>>> -            config.mac))
>>>> -        return -EMSGSIZE;
>>>> +    /*
>>>> +     * Assume little endian for now, userspace can tweak this for
>>>> +     * legacy guest support.
>>> You can leave it as a TODO for kernel (vdpa core limitation), but 
>>> AFAIK there's nothing userspace needs to do to infer the endianness. 
>>> IMHO it's the kernel's job to provide an abstraction rather than 
>>> rely on userspace guessing it.
>> we have discussed it in another thread, and this comment is suggested 
>> by MST.
> Can you provide the context or link? It shouldn't work like this, 
> otherwise it is breaking uABI. E.g. how will a legacy/BE supporting 
> kernel/device be backward compatible with older vdpa tool (which has 
> knowledge of this endianness implication/assumption from day one)?
https://www.spinics.net/lists/netdev/msg837114.html

The challenge is that the status filed is virtio16, not le16, so 
le16_to_cpu(xxx) is wrong anyway. However we can not tell whether it is 
a LE or BE device from struct vdpa_device, so for most cases, we assume 
it is LE, and leave this comment.

Thanks
>
> -Siwei
>
>>>
>>>> +     */
>>>> +    val_u16 = __virtio16_to_cpu(true, config.status);
>>>>         val_u16 = __virtio16_to_cpu(true, config.status);
>>>>       if (nla_put_u16(msg, VDPA_ATTR_DEV_NET_STATUS, val_u16))
>>>>           return -EMSGSIZE;
>>>>   -    val_u16 = __virtio16_to_cpu(true, config.mtu);
>>>> -    if (nla_put_u16(msg, VDPA_ATTR_DEV_NET_CFG_MTU, val_u16))
>>>> -        return -EMSGSIZE;
>>>> -
>>>>       features_driver = vdev->config->get_driver_features(vdev);
>>>>       if (nla_put_u64_64bit(msg, VDPA_ATTR_DEV_NEGOTIATED_FEATURES, 
>>>> features_driver,
>>>>                     VDPA_ATTR_PAD))
>>>> @@ -846,7 +869,13 @@ static int vdpa_dev_net_config_fill(struct 
>>>> vdpa_device *vdev, struct sk_buff *ms
>>>>                     VDPA_ATTR_PAD))
>>>>           return -EMSGSIZE;
>>>>   -    return vdpa_dev_net_mq_config_fill(vdev, msg, 
>>>> features_driver, &config);
>>>> +    if (vdpa_dev_net_mac_config_fill(msg, features_device, &config))
>>>> +        return -EMSGSIZE;
>>>> +
>>>> +    if (vdpa_dev_net_mtu_config_fill(msg, features_device, &config))
>>>> +        return -EMSGSIZE;
>>>> +
>>>> +    return vdpa_dev_net_mq_config_fill(msg, features_device, 
>>>> &config);
>>>>   }
>>>>     static int
>>>> @@ -914,6 +943,11 @@ static int vdpa_fill_stats_rec(struct 
>>>> vdpa_device *vdev, struct sk_buff *msg,
>>>>       }
>>>>       vdpa_get_config_unlocked(vdev, 0, &config, sizeof(config));
>>>>   +    /*
>>>> +     * Assume little endian for now, userspace can tweak this for
>>>> +     * legacy guest support.
>>>> +     */
>>>> +
>>> Ditto.
>> same as above
>>
>> Thanks
>>>
>>> Thanks,
>>> -Siwei
>>>>       max_vqp = __virtio16_to_cpu(true, config.max_virtqueue_pairs);
>>>>       if (nla_put_u16(msg, VDPA_ATTR_DEV_NET_CFG_MAX_VQP, max_vqp))
>>>>           return -EMSGSIZE;
>>>
>>
>


^ permalink raw reply

* Re: [net v2 1/1] ice: Fix crash by keep old cfg when update TCs more than queues
From: Anatolii Gerasymenko @ 2022-08-16  9:13 UTC (permalink / raw)
  To: Ding Hui, jesse.brandeburg, anthony.l.nguyen, davem, edumazet,
	kuba, pabeni, keescook, intel-wired-lan
  Cc: netdev, linux-kernel, linux-hardening
In-Reply-To: <20220815011844.22193-1-dinghui@sangfor.com.cn>

On 15.08.2022 03:18, Ding Hui wrote:
> There are problems if allocated queues less than Traffic Classes.
> 
> Commit a632b2a4c920 ("ice: ethtool: Prohibit improper channel config
> for DCB") already disallow setting less queues than TCs.
> 
> Another case is if we first set less queues, and later update more TCs
> config due to LLDP, ice_vsi_cfg_tc() will failed but left dirty
> num_txq/rxq and tc_cfg in vsi, that will cause invalid porinter access.

Nice catch. Looks good to me.
 
> [   95.968089] ice 0000:3b:00.1: More TCs defined than queues/rings allocated.
> [   95.968092] ice 0000:3b:00.1: Trying to use more Rx queues (8), than were allocated (1)!
> [   95.968093] ice 0000:3b:00.1: Failed to config TC for VSI index: 0
> [   95.969621] general protection fault: 0000 [#1] SMP NOPTI
> [   95.969705] CPU: 1 PID: 58405 Comm: lldpad Kdump: loaded Tainted: G     U  W  O     --------- -t - 4.18.0 #1
> [   95.969867] Hardware name: O.E.M/BC11SPSCB10, BIOS 8.23 12/30/2021
> [   95.969992] RIP: 0010:devm_kmalloc+0xa/0x60
> [   95.970052] Code: 5c ff ff ff 31 c0 5b 5d 41 5c c3 b8 f4 ff ff ff eb f4 0f 1f 40 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 89 d1 <8b> 97 60 02 00 00 48 8d 7e 18 48 39 f7 72 3f 55 89 ce 53 48 8b 4c
> [   95.970344] RSP: 0018:ffffc9003f553888 EFLAGS: 00010206
> [   95.970425] RAX: dead000000000200 RBX: ffffea003c425b00 RCX: 00000000006080c0
> [   95.970536] RDX: 00000000006080c0 RSI: 0000000000000200 RDI: dead000000000200
> [   95.970648] RBP: dead000000000200 R08: 00000000000463c0 R09: ffff888ffa900000
> [   95.970760] R10: 0000000000000000 R11: 0000000000000002 R12: ffff888ff6b40100
> [   95.970870] R13: ffff888ff6a55018 R14: 0000000000000000 R15: ffff888ff6a55460
> [   95.970981] FS:  00007f51b7d24700(0000) GS:ffff88903ee80000(0000) knlGS:0000000000000000
> [   95.971108] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   95.971197] CR2: 00007fac5410d710 CR3: 0000000f2c1de002 CR4: 00000000007606e0
> [   95.971309] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [   95.971419] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [   95.971530] PKRU: 55555554
> [   95.971573] Call Trace:
> [   95.971622]  ice_setup_rx_ring+0x39/0x110 [ice]
> [   95.971695]  ice_vsi_setup_rx_rings+0x54/0x90 [ice]
> [   95.971774]  ice_vsi_open+0x25/0x120 [ice]
> [   95.971843]  ice_open_internal+0xb8/0x1f0 [ice]
> [   95.971919]  ice_ena_vsi+0x4f/0xd0 [ice]
> [   95.971987]  ice_dcb_ena_dis_vsi.constprop.5+0x29/0x90 [ice]
> [   95.972082]  ice_pf_dcb_cfg+0x29a/0x380 [ice]
> [   95.972154]  ice_dcbnl_setets+0x174/0x1b0 [ice]
> [   95.972220]  dcbnl_ieee_set+0x89/0x230
> [   95.972279]  ? dcbnl_ieee_del+0x150/0x150
> [   95.972341]  dcb_doit+0x124/0x1b0
> [   95.972392]  rtnetlink_rcv_msg+0x243/0x2f0
> [   95.972457]  ? dcb_doit+0x14d/0x1b0
> [   95.972510]  ? __kmalloc_node_track_caller+0x1d3/0x280
> [   95.972591]  ? rtnl_calcit.isra.31+0x100/0x100
> [   95.972661]  netlink_rcv_skb+0xcf/0xf0
> [   95.972720]  netlink_unicast+0x16d/0x220
> [   95.972781]  netlink_sendmsg+0x2ba/0x3a0
> [   95.975891]  sock_sendmsg+0x4c/0x50
> [   95.979032]  ___sys_sendmsg+0x2e4/0x300
> [   95.982147]  ? kmem_cache_alloc+0x13e/0x190
> [   95.985242]  ? __wake_up_common_lock+0x79/0x90
> [   95.988338]  ? __check_object_size+0xac/0x1b0
> [   95.991440]  ? _copy_to_user+0x22/0x30
> [   95.994539]  ? move_addr_to_user+0xbb/0xd0
> [   95.997619]  ? __sys_sendmsg+0x53/0x80
> [   96.000664]  __sys_sendmsg+0x53/0x80
> [   96.003747]  do_syscall_64+0x5b/0x1d0
> [   96.006862]  entry_SYSCALL_64_after_hwframe+0x65/0xca
> 
> Only update num_txq/rxq when passed check, and restore tc_cfg if setup
> queue map failed.
> 
> Signed-off-by: Ding Hui <dinghui@sangfor.com.cn>

Please, also add Fixes tag.

> ---
>  drivers/net/ethernet/intel/ice/ice_lib.c | 42 +++++++++++++++---------
>  1 file changed, 26 insertions(+), 16 deletions(-)
> 
> ---
> v1:
> https://patchwork.kernel.org/project/netdevbpf/patch/20220812123933.5481-1-dinghui@sangfor.com.cn/
> 
> v2:
>   rewrite subject
>   rebase to net
> 
> diff --git a/drivers/net/ethernet/intel/ice/ice_lib.c b/drivers/net/ethernet/intel/ice/ice_lib.c
> index a830f7f9aed0..6e64cca30351 100644
> --- a/drivers/net/ethernet/intel/ice/ice_lib.c
> +++ b/drivers/net/ethernet/intel/ice/ice_lib.c
> @@ -914,7 +914,7 @@ static void ice_set_dflt_vsi_ctx(struct ice_hw *hw, struct ice_vsi_ctx *ctxt)
>   */
>  static int ice_vsi_setup_q_map(struct ice_vsi *vsi, struct ice_vsi_ctx *ctxt)
>  {
> -	u16 offset = 0, qmap = 0, tx_count = 0, pow = 0;
> +	u16 offset = 0, qmap = 0, tx_count = 0, rx_count = 0, pow = 0;
>  	u16 num_txq_per_tc, num_rxq_per_tc;
>  	u16 qcount_tx = vsi->alloc_txq;
>  	u16 qcount_rx = vsi->alloc_rxq;
> @@ -981,23 +981,25 @@ static int ice_vsi_setup_q_map(struct ice_vsi *vsi, struct ice_vsi_ctx *ctxt)
>  	 * at least 1)
>  	 */
>  	if (offset)
> -		vsi->num_rxq = offset;
> +		rx_count = offset;
>  	else
> -		vsi->num_rxq = num_rxq_per_tc;
> +		rx_count = num_rxq_per_tc;
>  
> -	if (vsi->num_rxq > vsi->alloc_rxq) {
> +	if (rx_count > vsi->alloc_rxq) {
>  		dev_err(ice_pf_to_dev(vsi->back), "Trying to use more Rx queues (%u), than were allocated (%u)!\n",
> -			vsi->num_rxq, vsi->alloc_rxq);
> +			rx_count, vsi->alloc_rxq);
>  		return -EINVAL;
>  	}
>  
> -	vsi->num_txq = tx_count;
> -	if (vsi->num_txq > vsi->alloc_txq) {
> +	if (tx_count > vsi->alloc_txq) {
>  		dev_err(ice_pf_to_dev(vsi->back), "Trying to use more Tx queues (%u), than were allocated (%u)!\n",
> -			vsi->num_txq, vsi->alloc_txq);
> +			tx_count, vsi->alloc_txq);
>  		return -EINVAL;
>  	}
>  
> +	vsi->num_txq = tx_count;
> +	vsi->num_rxq = rx_count;
> +
>  	if (vsi->type == ICE_VSI_VF && vsi->num_txq != vsi->num_rxq) {
>  		dev_dbg(ice_pf_to_dev(vsi->back), "VF VSI should have same number of Tx and Rx queues. Hence making them equal\n");
>  		/* since there is a chance that num_rxq could have been changed
> @@ -3492,6 +3494,7 @@ ice_vsi_setup_q_map_mqprio(struct ice_vsi *vsi, struct ice_vsi_ctx *ctxt,
>  	int tc0_qcount = vsi->mqprio_qopt.qopt.count[0];
>  	u8 netdev_tc = 0;
>  	int i;
> +	u16 new_txq, new_rxq;

Please follow the Reverse Christmas Tree (RCT) convention.

>  	vsi->tc_cfg.ena_tc = ena_tc ? ena_tc : 1;
>  
> @@ -3530,21 +3533,24 @@ ice_vsi_setup_q_map_mqprio(struct ice_vsi *vsi, struct ice_vsi_ctx *ctxt,
>  		}
>  	}
>  
> -	/* Set actual Tx/Rx queue pairs */
> -	vsi->num_txq = offset + qcount_tx;
> -	if (vsi->num_txq > vsi->alloc_txq) {
> +	new_txq = offset + qcount_tx;
> +	if (new_txq > vsi->alloc_txq) {
>  		dev_err(ice_pf_to_dev(vsi->back), "Trying to use more Tx queues (%u), than were allocated (%u)!\n",
> -			vsi->num_txq, vsi->alloc_txq);
> +			new_txq, vsi->alloc_txq);
>  		return -EINVAL;
>  	}
>  
> -	vsi->num_rxq = offset + qcount_rx;
> -	if (vsi->num_rxq > vsi->alloc_rxq) {
> +	new_rxq = offset + qcount_rx;
> +	if (new_rxq > vsi->alloc_rxq) {
>  		dev_err(ice_pf_to_dev(vsi->back), "Trying to use more Rx queues (%u), than were allocated (%u)!\n",
> -			vsi->num_rxq, vsi->alloc_rxq);
> +			new_rxq, vsi->alloc_rxq);
>  		return -EINVAL;
>  	}
>  
> +	/* Set actual Tx/Rx queue pairs */
> +	vsi->num_txq = new_txq;
> +	vsi->num_rxq = new_rxq;
> +
>  	/* Setup queue TC[0].qmap for given VSI context */
>  	ctxt->info.tc_mapping[0] = cpu_to_le16(qmap);
>  	ctxt->info.q_mapping[0] = cpu_to_le16(vsi->rxq_map[0]);
> @@ -3580,6 +3586,7 @@ int ice_vsi_cfg_tc(struct ice_vsi *vsi, u8 ena_tc)
>  	struct device *dev;
>  	int i, ret = 0;
>  	u8 num_tc = 0;
> +	struct ice_tc_cfg old_tc_cfg;

RCT here also.
  
>  	dev = ice_pf_to_dev(pf);
>  	if (vsi->tc_cfg.ena_tc == ena_tc &&
> @@ -3600,6 +3607,7 @@ int ice_vsi_cfg_tc(struct ice_vsi *vsi, u8 ena_tc)
>  			max_txqs[i] = vsi->num_txq;
>  	}
>  
> +	memcpy(&old_tc_cfg, &vsi->tc_cfg, sizeof(old_tc_cfg));
>  	vsi->tc_cfg.ena_tc = ena_tc;
>  	vsi->tc_cfg.numtc = num_tc;
>  
> @@ -3616,8 +3624,10 @@ int ice_vsi_cfg_tc(struct ice_vsi *vsi, u8 ena_tc)
>  	else
>  		ret = ice_vsi_setup_q_map(vsi, ctx);
>  
> -	if (ret)
> +	if (ret) {
> +		memcpy(&vsi->tc_cfg, &old_tc_cfg, sizeof(vsi->tc_cfg));
>  		goto out;
> +	}
>  
>  	/* must to indicate which section of VSI context are being modified */
>  	ctx->info.valid_sections = cpu_to_le16(ICE_AQ_VSI_PROP_RXQ_MAP_VALID);

^ permalink raw reply

* [PATCH net] net/mlx5e: Allocate flow steering storage during uplink initialization
From: Leon Romanovsky @ 2022-08-16  8:47 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski
  Cc: Leon Romanovsky, Eric Dumazet, Lama Kayal, netdev, Paolo Abeni,
	Saeed Mahameed, Tariq Toukan

From: Leon Romanovsky <leonro@nvidia.com>

IPsec code relies on valid priv->fs pointer that is the case in NIC
flow, but not correct in uplink. Before commit that mentioned in the
Fixes line, that pointer was valid in all flows as it was allocated
together with priv struct.

In addition, the cleanup representors routine called to that
not-initialized priv->fs pointer and its internals which caused NULL
deference.

So, move FS allocation to be as early as possible.

Fixes: af8bbf730068 ("net/mlx5e: Convert mlx5e_flow_steering member of mlx5e_priv to pointer")
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 .../net/ethernet/mellanox/mlx5/core/en_rep.c  | 25 +++++++++++++------
 1 file changed, 17 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
index 4c1599de652c..0c66774a1720 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
@@ -696,6 +696,13 @@ static int mlx5e_init_rep(struct mlx5_core_dev *mdev,
 {
 	struct mlx5e_priv *priv = netdev_priv(netdev);
 
+	priv->fs = mlx5e_fs_init(priv->profile, mdev,
+				 !test_bit(MLX5E_STATE_DESTROYING, &priv->state));
+	if (!priv->fs) {
+		netdev_err(priv->netdev, "FS allocation failed\n");
+		return -ENOMEM;
+	}
+
 	mlx5e_build_rep_params(netdev);
 	mlx5e_timestamp_init(priv);
 
@@ -708,12 +715,21 @@ static int mlx5e_init_ul_rep(struct mlx5_core_dev *mdev,
 	struct mlx5e_priv *priv = netdev_priv(netdev);
 	int err;
 
+	priv->fs = mlx5e_fs_init(priv->profile, mdev,
+				 !test_bit(MLX5E_STATE_DESTROYING, &priv->state));
+	if (!priv->fs) {
+		netdev_err(priv->netdev, "FS allocation failed\n");
+		return -ENOMEM;
+	}
+
 	err = mlx5e_ipsec_init(priv);
 	if (err)
 		mlx5_core_err(mdev, "Uplink rep IPsec initialization failed, %d\n", err);
 
 	mlx5e_vxlan_set_netdev_info(priv);
-	return mlx5e_init_rep(mdev, netdev);
+	mlx5e_build_rep_params(netdev);
+	mlx5e_timestamp_init(priv);
+	return 0;
 }
 
 static void mlx5e_cleanup_rep(struct mlx5e_priv *priv)
@@ -836,13 +852,6 @@ static int mlx5e_init_rep_rx(struct mlx5e_priv *priv)
 	struct mlx5_core_dev *mdev = priv->mdev;
 	int err;
 
-	priv->fs = mlx5e_fs_init(priv->profile, mdev,
-				 !test_bit(MLX5E_STATE_DESTROYING, &priv->state));
-	if (!priv->fs) {
-		netdev_err(priv->netdev, "FS allocation failed\n");
-		return -ENOMEM;
-	}
-
 	priv->rx_res = mlx5e_rx_res_alloc();
 	if (!priv->rx_res) {
 		err = -ENOMEM;
-- 
2.37.2


^ permalink raw reply related

* Re: [PATCH net-next 07/10] net/smc: reduce unnecessary blocking in smcr_lgr_reg_rmbs()
From: Tony Lu @ 2022-08-16  8:24 UTC (permalink / raw)
  To: D. Wythe; +Cc: kgraul, wenjia, kuba, davem, netdev, linux-s390, linux-rdma
In-Reply-To: <46f364ce7878b740e58bf44d3bed5fe23c64a260.1660152975.git.alibuda@linux.alibaba.com>

On Thu, Aug 11, 2022 at 01:47:38AM +0800, D. Wythe wrote:
> From: "D. Wythe" <alibuda@linux.alibaba.com>
> 
> Unlike smc_buf_create() and smcr_buf_unuse(), smcr_lgr_reg_rmbs() is
> exclusive when assigned rmb_desc was not registered, although it can be
> executed in parallel when assigned rmb_desc was registered already
> and only performs read semtamics on it. Hence, we can not simply replace
> it with read semaphore.
> 
> The idea here is that if the assigned rmb_desc was registered already,
> use read semaphore to protect the critical section, once the assigned
> rmb_desc was not registered, keep using keep write semaphore still
> to keep its exclusivity.
> 
> Thanks to the reusable features of rmb_desc, which allows us to execute
> in parallel in most cases.
> 
> Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
> ---
>  net/smc/af_smc.c | 19 +++++++++++++++++--
>  1 file changed, 17 insertions(+), 2 deletions(-)
> 
> diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
> index 51b90e2..39dbf39 100644
> --- a/net/smc/af_smc.c
> +++ b/net/smc/af_smc.c
> @@ -516,10 +516,25 @@ static int smcr_lgr_reg_rmbs(struct smc_link *link,
>  {
>  	struct smc_link_group *lgr = link->lgr;
>  	int i, rc = 0;
> +	bool slow = false;

Consider do_slow?

Reverse Christmas tree.

>  
>  	rc = smc_llc_flow_initiate(lgr, SMC_LLC_FLOW_RKEY);
>  	if (rc)
>  		return rc;
> +
> +	down_read(&lgr->llc_conf_mutex);
> +	for (i = 0; i < SMC_LINKS_PER_LGR_MAX; i++) {
> +		if (!smc_link_active(&lgr->lnk[i]))
> +			continue;
> +		if (!rmb_desc->is_reg_mr[link->link_idx]) {
> +			up_read(&lgr->llc_conf_mutex);
> +			goto slow_path;
> +		}
> +	}
> +	/* mr register already */
> +	goto fast_path;
> +slow_path:
> +	slow = true;
>  	/* protect against parallel smc_llc_cli_rkey_exchange() and
>  	 * parallel smcr_link_reg_buf()
>  	 */
> @@ -531,7 +546,7 @@ static int smcr_lgr_reg_rmbs(struct smc_link *link,
>  		if (rc)
>  			goto out;
>  	}
> -
> +fast_path:
>  	/* exchange confirm_rkey msg with peer */
>  	rc = smc_llc_do_confirm_rkey(link, rmb_desc);
>  	if (rc) {
> @@ -540,7 +555,7 @@ static int smcr_lgr_reg_rmbs(struct smc_link *link,
>  	}
>  	rmb_desc->is_conf_rkey = true;
>  out:
> -	up_write(&lgr->llc_conf_mutex);
> +	slow ? up_write(&lgr->llc_conf_mutex) : up_read(&lgr->llc_conf_mutex);
>  	smc_llc_flow_stop(lgr, &lgr->llc_flow_lcl);
>  	return rc;
>  }
> -- 
> 1.8.3.1

^ permalink raw reply

* Re: igc: missing HW timestamps at TX
From: Vladimir Oltean @ 2022-08-16  9:33 UTC (permalink / raw)
  To: Ferenc Fejes
  Cc: vinicius.gomes@intel.com, marton12050@gmail.com,
	netdev@vger.kernel.org, peti.antal99@gmail.com
In-Reply-To: <1016fb1e514ff38ebfd22c52e2d848a7e8bc8d1a.camel@ericsson.com>

Hi Ferenc,

On Mon, Aug 15, 2022 at 06:47:31AM +0000, Ferenc Fejes wrote:
> I just played with those a little. Looks like the --cpu-mask the one it
> helps in my case. For example I checked the CPU core of the
> igc_ptp_tx_work:
> 
> # bpftrace -e 'kprobe:igc_ptp_tx_work { printf("%d\n", cpu); exit(); }'
> Attaching 1 probe...
> 0

I think this print is slightly irrelevant in the grand scheme, or at
least not very stable. Because schedule_work() is implemented as
"queue_work(system_wq, work)", and queue_work() is implemented as
"queue_work_on(WORK_CPU_UNBOUND, wq, work)", it means that the work item
associated with igc_ptp_tx_work() is not bound to any requested CPU.
So unless the prints are taken from the actual test rather than just
done once before it, which percpu kthread worker executes it from within
the pool might vary.  In turn, __queue_work() selects the CPU based on
raw_smp_processor_id() on which the caller is located (in this case, the
IRQ handler). So it will depend upon the tsync interrupt affinity,
basically.

> 
> Looks like its running on core 0, so I run the isochro:
> taskset -c 0 isochron ... --cpu-mask $((1 << 0)) - no lost timestamps
> taskset -c 1 isochron ... --cpu-mask $((1 << 0)) - no lost timestamps
> taskset -c 0 isochron ... --cpu-mask $((1 << 1)) - losing timestamps
> taskset -c 1 isochron ... --cpu-mask $((1 << 1)) - losing timestamps
(...)
> Maybe this is what helps in my case? With funccount tracer I checked
> that when the sender thread and igc_ptp_tx_work running on the same
> core, the worker called exactly as many times as many packets I sent.
> 
> However if the worker running on different core, funccount show some
> random number (less than the packets sent) and in that case I also lost
> timestamps.

Thanks.

Note that if igc_ptp_tx_work runs well on the same CPU (0) as the
isochron sender thread, but *not* that well on the other CPU,
I think a simple explanation (for now) might have to do with dynamic
frequency scaling of the CPUs (CONFIG_CPU_FREQ). If the CPU is kept busy
by the sender thread, the governor will increase the CPU frequency and
the tsync interrupt will be processed quicker, and this will unclog the
"single skb in flight" limitation quicker. If the CPU is mostly idle and
woken up only from time to time by a tsync interrupt, then the "single
skb in flight" limitation will kick in more often, and the isochron
thread will have its TX timestamp requests silently dropped in that
meantime until the idle CPU ramps up to execute its scheduled work item.

To prove my point you can try to compile a kernel with CONFIG_CPU_FREQ=n.
Makes sense?

^ permalink raw reply

* Re: [PATCH net-next 09/10] net/smc: fix potential panic dues to unprotected smc_llc_srv_add_link()
From: Tony Lu @ 2022-08-16  8:28 UTC (permalink / raw)
  To: D. Wythe; +Cc: kgraul, wenjia, kuba, davem, netdev, linux-s390, linux-rdma
In-Reply-To: <f4c5b1ba19c926e8b3d1def2ff685f29b2631b24.1660152975.git.alibuda@linux.alibaba.com>

On Thu, Aug 11, 2022 at 01:47:40AM +0800, D. Wythe wrote:
> From: "D. Wythe" <alibuda@linux.alibaba.com>
> 
> After we optimize the parallel capability of SMC-R connection
> establishment, there is a certain chance to trigger the
> following panic:
> 
> PID: 5900   TASK: ffff88c1c8af4100  CPU: 1   COMMAND: "kworker/1:48"
>  #0 [ffff9456c1cc79a0] machine_kexec at ffffffff870665b7
>  #1 [ffff9456c1cc79f0] __crash_kexec at ffffffff871b4c7a
>  #2 [ffff9456c1cc7ab0] crash_kexec at ffffffff871b5b60
>  #3 [ffff9456c1cc7ac0] oops_end at ffffffff87026ce7
>  #4 [ffff9456c1cc7ae0] page_fault_oops at ffffffff87075715
>  #5 [ffff9456c1cc7b58] exc_page_fault at ffffffff87ad0654
>  #6 [ffff9456c1cc7b80] asm_exc_page_fault at ffffffff87c00b62
>     [exception RIP: ib_alloc_mr+19]
>     RIP: ffffffffc0c9cce3  RSP: ffff9456c1cc7c38  RFLAGS: 00010202
>     RAX: 0000000000000000  RBX: 0000000000000002  RCX: 0000000000000004
>     RDX: 0000000000000010  RSI: 0000000000000000  RDI: 0000000000000000
>     RBP: ffff88c1ea281d00   R8: 000000020a34ffff   R9: ffff88c1350bbb20
>     R10: 0000000000000000  R11: 0000000000000001  R12: 0000000000000000
>     R13: 0000000000000010  R14: ffff88c1ab040a50  R15: ffff88c1ea281d00
>     ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
>  #7 [ffff9456c1cc7c60] smc_ib_get_memory_region at ffffffffc0aff6df [smc]
>  #8 [ffff9456c1cc7c88] smcr_buf_map_link at ffffffffc0b0278c [smc]
>  #9 [ffff9456c1cc7ce0] __smc_buf_create at ffffffffc0b03586 [smc]
> 
> The reason here is that when the server tries to create a second link,
> smc_llc_srv_add_link() has no protection and may add a new link to
> link group. This breaks the security environment protected by
> llc_conf_mutex.
> 
> Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>

I am curious if this patch can be merged with the previous one? It seems
that this panic is introduced by previous one?

> ---
>  net/smc/af_smc.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
> index 39dbf39..0b0c53a 100644
> --- a/net/smc/af_smc.c
> +++ b/net/smc/af_smc.c
> @@ -1834,8 +1834,10 @@ static int smcr_serv_conf_first_link(struct smc_sock *smc)
>  	smc_llc_link_active(link);
>  	smcr_lgr_set_type(link->lgr, SMC_LGR_SINGLE);
>  
> +	down_write(&link->lgr->llc_conf_mutex);
>  	/* initial contact - try to establish second link */
>  	smc_llc_srv_add_link(link, NULL);
> +	up_write(&link->lgr->llc_conf_mutex);
>  	return 0;
>  }
>  
> -- 
> 1.8.3.1

^ permalink raw reply

* [ RFC  net-next 0/3] net: flow_offload: add support for per action hw stats
From: Oz Shlomo @ 2022-08-16  9:23 UTC (permalink / raw)
  To: netdev
  Cc: Jiri Pirko, Jamal Hadi Salim, Simon Horman, Baowen Zheng,
	Vlad Buslov, Ido Schimmel, Roi Dayan, Oz Shlomo

There are currently two mechanisms for populating hardware stats:
1. Using flow_offload api to query the flow's statistics.
   The api assumes that the same stats values apply to all
   the flow's actions.
   This assumption breaks when action drops or jumps over following
   actions.
2. Using hw_action api to query specific action stats via a driver
   callback method. This api assures the correct action stats for
   the offloaded action, however, it does not apply to the rest of the
   actions in the flow's actions array, as elaborated below.

The current hw_action api does not apply to the following use cases:
1. Actions that are implicitly created by filters (aka bind actions).
   In the following example only one counter will apply to the rule:
   tc filter add dev $DEV prio 2 protocol ip parent ffff: \
        flower ip_proto tcp dst_ip $IP2 \
        action police rate 1mbit burst 100k conform-exceed drop/pipe \
        action mirred egress redirect dev $DEV2
  
2. Action preceding a hw action.
   In the following example the same flow stats will apply to the sample and
   mirred actions:
    tc action add police rate 1mbit burst 100k conform-exceed drop / pipe
    tc filter add dev $DEV prio 2 protocol ip parent ffff: \
        flower ip_proto tcp dst_ip $IP2 \
        action sample rate 1 group 10 trunc 60 pipe \
        action police index 1 \
        action mirred egress redirect dev $DEV2
        
3. Meter action using jump control.
   In the following example the same flow stats will apply to both
   mirred actions:
    tc action add police rate 1mbit burst 100k conform-exceed jump 2 / pipe
    tc filter add dev $DEV prio 2 protocol ip parent ffff: \
        flower ip_proto tcp dst_ip $IP2 \
        action police index 1 \
        action mirred egress redirect dev $DEV2
        action mirred egress redirect dev $DEV3

This series provides the platform to query per action stats for in_hw flows.

The first patch is a preparation patch

The second patch extends the flow_offload api to return stats array corresponding
to the flow's actions list.
The api populates all the actions' stats in a single callback invocation.
It also allows drivers to avoid per-action lookups by maintain pre-processed
array of the flow's action counters.

The third patch refreshes the hardware action stats from the userspace tc action utility.
It uses the existing hardware action api to query stats per action.
The api has lower performance, compared to the filter refresh stats, as it requires
a driver callback invocation per action, while requiring the driver to lookup the stats
for a specific action id.

Note that this series does not change the existing functionality, thus preserving
the current stats per flow design.

Mellanox driver implementation of the proposed api will follow the rfc discussion.

Oz Shlomo (2):
  net: flow_offload: add action stats api
  net/sched: act_api: update hw stats for tc action list

Roi Dayan (1):
  net: sched: Pass flow_stats instead of multiple stats args

 include/net/flow_offload.h |  6 ++++++
 include/net/pkt_cls.h      | 27 ++++++++++++++++-----------
 net/sched/act_api.c        | 15 +++++++++++----
 net/sched/cls_flower.c     |  9 +++------
 net/sched/cls_matchall.c   |  6 +-----
 5 files changed, 37 insertions(+), 26 deletions(-)

-- 
1.8.3.1


^ permalink raw reply

* Re: igc: missing HW timestamps at TX
From: Vladimir Oltean @ 2022-08-16  9:10 UTC (permalink / raw)
  To: Vinicius Costa Gomes
  Cc: Ferenc Fejes, netdev@vger.kernel.org, marton12050@gmail.com,
	peti.antal99@gmail.com
In-Reply-To: <87k079hzry.fsf@intel.com>

Hi Vinicius,

On Mon, Aug 15, 2022 at 04:07:13PM -0700, Vinicius Costa Gomes wrote:
> The interrupt that is generated is a general/misc interrupt, we have to
> check on the interrupt cause bit that it's something TimeSync related,
> and only then, we have to check that it's indeed a TX Timestamp that is
> ready. And then, there's another register with some bits saying which
> one of the 4 registers for timestamps that is ready. There are a few
> levels of indirection, but no polling.

I used the word "poll" after being inspired by the code comments:

/**
 * igc_ptp_tx_work
 * @work: pointer to work struct
 *
 * This work function polls the TSYNCTXCTL valid bit to determine when a
 * timestamp has been taken for the current stored skb.
 */

> I think your question is more "why there's that workqueue on igc?"/"why
> don't you retrieve the TX timestamp 'inline' with the interrupt?", if I
> got that right, then, I don't have a good reason, apart from the feeling
> that reading all those (5-6?) registers may take too long for a
> interrupt handler. And it's something that's being done the same for
> most (all?) Intel drivers.

Ok, so basically it is an attempt of making the interrupt handler threaded,
which doesn't run in hardirq context?

Note that this decision makes the igc limitation of "single timestampable
skb in flight" even much worse than it needs to be, because it prolongs
the "in flight" period until the system_wq actually gets to run the work
item we create.

> I have a TODO to experiment with removing the workqueue, and retrieving
> the TX timestamp in the same context as the interrupt handler, but other
> things always come up.
> 
> 
> Cheers,
> -- 
> Vinicius

^ permalink raw reply

* Re: [PATCH v16 mfd 8/8] mfd: ocelot: add support for the vsc7512 chip via spi
From: Andy Shevchenko @ 2022-08-16  8:52 UTC (permalink / raw)
  To: Ido Schimmel, Uwe Kleine-König
  Cc: Colin Foster, linux-arm Mailing List, open list:GPIO SUBSYSTEM,
	netdev, Linux Kernel Mailing List, devicetree, Terry Bowman,
	Vladimir Oltean, Wolfram Sang, Microchip Linux Driver Support,
	Steen Hegelund, Lars Povlsen, Linus Walleij, Paolo Abeni,
	Jakub Kicinski, Eric Dumazet, David S. Miller, Russell King,
	Heiner Kallweit, Andrew Lunn, Krzysztof Kozlowski, Rob Herring,
	Lee Jones, katie.morris, Jonathan Cameron, Dan Williams,
	Lee Jones
In-Reply-To: <YvpZoIN+5htY9Z1o@shredder>

On Mon, Aug 15, 2022 at 5:35 PM Ido Schimmel <idosch@idosch.org> wrote:
>
> On Mon, Aug 15, 2022 at 07:19:13AM -0700, Colin Foster wrote:
> > Something is going on that I don't fully understand with <asm/byteorder.h>.
> > I don't quite see how ocelot-core is throwing all sorts of errors in x86
> > builds now:
> >
> > https://patchwork.hopto.org/static/nipa/667471/12942993/build_allmodconfig_warn/stderr
> >
> > Snippet from there:
> >
> > /home/nipa/nipa/tests/patch/build_32bit/build_32bit.sh: line 21: ccache gcc: command not found
> > ../drivers/mfd/ocelot-spi.c: note: in included file (through ../include/linux/bitops.h, ../include/linux/kernel.h, ../arch/x86/include/asm/percpu.h, ../arch/x86/include/asm/current.h, ../include/linux/sched.h, ...):
> > ../arch/x86/include/asm/bitops.h:66:1: warning: unreplaced symbol 'return'
> > ../drivers/mfd/ocelot-spi.c: note: in included file (through ../include/linux/bitops.h, ../include/linux/kernel.h, ../arch/x86/include/asm/percpu.h, ../arch/x86/include/asm/current.h, ../include/linux/sched.h, ...):
> > ../include/asm-generic/bitops/generic-non-atomic.h:29:9: warning: unreplaced symbol 'mask'
> > ../include/asm-generic/bitops/generic-non-atomic.h:30:9: warning: unreplaced symbol 'p'
> > ../include/asm-generic/bitops/generic-non-atomic.h:32:10: warning: unreplaced symbol 'p'
> > ../include/asm-generic/bitops/generic-non-atomic.h:32:16: warning: unreplaced symbol 'mask'
> > ../include/asm-generic/bitops/generic-non-atomic.h:27:1: warning: unreplaced symbol 'return'
> > ../drivers/mfd/ocelot-spi.c: note: in included file (through ../arch/x86/include/asm/bitops.h, ../include/linux/bitops.h, ../include/linux/kernel.h, ../arch/x86/include/asm/percpu.h, ../arch/x86/include/asm/current.h, ...):
> > ../include/asm-generic/bitops/instrumented-non-atomic.h:26:1: warning: unreplaced symbol 'return'
> >
> >
> > <asm/byteorder.h> was included in both drivers/mfd/ocelot-spi.c and
> > drivers/mfd/ocelot.h previously, though Andy pointed out there didn't
> > seem to be any users... and I didn't either. I'm sure there's something
> > I must be missing.
>
> I got similar errors in our internal CI yesterday. Fixed by compiling
> sparse from git:
> https://git.kernel.org/pub/scm/devel/sparse/sparse.git/commit/?id=0e1aae55e49cad7ea43848af5b58ff0f57e7af99
>
> The update is also available in the "testing" repo in case you are
> running Fedora 35 / 36:
> https://bodhi.fedoraproject.org/updates/FEDORA-2022-c58b53730f
> https://bodhi.fedoraproject.org/updates/FEDORA-2022-2bc333ccac

Debian still produces the same errors which makes sparse useless.

-- 
With Best Regards,
Andy Shevchenko

^ permalink raw reply

* Re: [PATCH net-next 00/10] net/smc: optimize the parallelism of SMC-R connections
From: Jan Karcher @ 2022-08-16  9:35 UTC (permalink / raw)
  To: D. Wythe, kgraul, wenjia; +Cc: kuba, davem, netdev, linux-s390, linux-rdma
In-Reply-To: <cover.1660152975.git.alibuda@linux.alibaba.com>



On 10.08.2022 19:47, D. Wythe wrote:
> From: "D. Wythe" <alibuda@linux.alibaba.com>
> 
> This patch set attempts to optimize the parallelism of SMC-R connections,
> mainly to reduce unnecessary blocking on locks, and to fix exceptions that
> occur after thoses optimization.
>

Thank you again for your submission!
Let me give you a quick update from our side:
We tested your patches on top of the net-next kernel on our s390 
systems. They did crash our systems. After verifying our environment we 
pulled console logs and now we can tell that there is indeed a problem 
with your patches regarding SMC-D. So please do not integrate this 
change as of right now. I'm going to do more in depth reviews of your 
patches but i need some time for them so here is a quick a description 
of the problem:

It is a SMC-D problem, that occurs while building up the connection. In 
smc_conn_create you set struct smc_lnk_cluster *lnkc = NULL. For the 
SMC-R path you do grab the pointer, for SMC-D that never happens. Still 
you are using this refernce for SMC-D => Crash. This problem can be 
reproduced using the SMC-D path. Here is an example console output:

[  779.516382] Unable to handle kernel pointer dereference in virtual 
kernel address space
[  779.516389] Failing address: 0000000000000000 TEID: 0000000000000483
[  779.516391] Fault in home space mode while using kernel ASCE.
[  779.516395] AS:0000000069628007 R3:00000000ffbf0007 
S:00000000ffbef800 P:000000000000003d
[  779.516431] Oops: 0004 ilc:2 [#1] SMP
[  779.516436] Modules linked in: tcp_diag inet_diag ism mlx5_ib 
ib_uverbs mlx5_core smc_diag smc ib_core nft_fib_inet nft_fib_ipv4
nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 
nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv
6 nf_defrag_ipv4 ip_set nf_tables n
[  779.516470] CPU: 0 PID: 24 Comm: kworker/0:1 Not tainted 
5.19.0-13940-g22a46254655a #3
[  779.516476] Hardware name: IBM 8561 T01 701 (z/VM 7.2.0)

[  779.522738] Workqueue: smc_hs_wq smc_listen_work [smc]
[  779.522755] Krnl PSW : 0704c00180000000 000003ff803da89c 
(smc_conn_create+0x174/0x968 [smc])
[  779.522766]            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 
PM:0 RI:0 EA:3
[  779.522770] Krnl GPRS: 0000000000000002 0000000000000000 
0000000000000001 0000000000000000
[  779.522773]            000000008a4128a0 000003ff803f21aa 
000000008e30d640 0000000086d72000
[  779.522776]            0000000086d72000 000000008a412803 
000000008a412800 000000008e30d650
[  779.522779]            0000000080934200 0000000000000000 
000003ff803cb954 00000380002dfa88
[  779.522789] Krnl Code: 000003ff803da88e: e310f0e80024        stg 
%r1,232(%r15)
[  779.522789]            000003ff803da894: a7180000            lhi 
%r1,0
[  779.522789]           #000003ff803da898: 582003ac            l 
%r2,940
[  779.522789]           >000003ff803da89c: ba123020            cs 
%r1,%r2,32(%r3)
[  779.522789]            000003ff803da8a0: ec1603be007e        cij 
%r1,0,6,000003ff803db01c

[  779.522789]            000003ff803da8a6: 4110b002            la 
%r1,2(%r11)
[  779.522789]            000003ff803da8aa: e310f0f00024        stg 
%r1,240(%r15)
[  779.522789]            000003ff803da8b0: e310f0c00004        lg 
%r1,192(%r15)
[  779.522870] Call Trace:
[  779.522873]  [<000003ff803da89c>] smc_conn_create+0x174/0x968 [smc]
[  779.522884]  [<000003ff803cb954>] 
smc_find_ism_v2_device_serv+0x1b4/0x300 [smc]
01: HCPGSP2629I The virtual machine is placed in CP mode due to a SIGP 
stop from CPU 01.
01: HCPGSP2629I The virtual machine is placed in CP mode due to a SIGP 
stop from CPU 00.
[  779.522894]  [<000003ff803cbace>] smc_listen_find_device+0x2e/0x370 [smc]


I'm going to send the review for the first patch right away (which is 
the one causing the crash), so far I'm done with it. The others are 
going to follow. Maybe you can look over the problem and come up with a 
solution, otherwise we are going to decide if we want to look into it as 
soon as I'm done with the reviews. Thank you for your patience.

^ permalink raw reply

* [PATCH] net: Fix suspicious RCU usage in bpf_sk_reuseport_detach()
From: David Howells @ 2022-08-16  9:34 UTC (permalink / raw)
  To: yin31149; +Cc: Jakub Kicinski, netdev, dhowells, linux-kernel

bpf_sk_reuseport_detach() calls __rcu_dereference_sk_user_data_with_flags()
to obtain the value of sk->sk_user_data, but that function is only usable
if the RCU read lock is held, and neither that function nor any of its
callers hold it.

Fix this by adding a new helper, __locked_read_sk_user_data_with_flags()
that checks to see if sk->sk_callback_lock() is held and use that here
instead.

Alternatively, making __rcu_dereference_sk_user_data_with_flags() use
rcu_dereference_checked() might suffice.

Without this, the following warning can be occasionally observed:

=============================
WARNING: suspicious RCU usage
6.0.0-rc1-build2+ #563 Not tainted
-----------------------------
include/net/sock.h:592 suspicious rcu_dereference_check() usage!

other info that might help us debug this:

rcu_scheduler_active = 2, debug_locks = 1
5 locks held by locktest/29873:
 #0: ffff88812734b550 (&sb->s_type->i_mutex_key#9){+.+.}-{3:3}, at: __sock_release+0x77/0x121
 #1: ffff88812f5621b0 (sk_lock-AF_INET){+.+.}-{0:0}, at: tcp_close+0x1c/0x70
 #2: ffff88810312f5c8 (&h->lhash2[i].lock){+.+.}-{2:2}, at: inet_unhash+0x76/0x1c0
 #3: ffffffff83768bb8 (reuseport_lock){+...}-{2:2}, at: reuseport_detach_sock+0x18/0xdd
 #4: ffff88812f562438 (clock-AF_INET){++..}-{2:2}, at: bpf_sk_reuseport_detach+0x24/0xa4

stack backtrace:
CPU: 1 PID: 29873 Comm: locktest Not tainted 6.0.0-rc1-build2+ #563
Hardware name: ASUS All Series/H97-PLUS, BIOS 2306 10/09/2014
Call Trace:
 <TASK>
 dump_stack_lvl+0x4c/0x5f
 bpf_sk_reuseport_detach+0x6d/0xa4
 reuseport_detach_sock+0x75/0xdd
 inet_unhash+0xa5/0x1c0
 tcp_set_state+0x169/0x20f
 ? lockdep_sock_is_held+0x3a/0x3a
 ? __lock_release.isra.0+0x13e/0x220
 ? reacquire_held_locks+0x1bb/0x1bb
 ? hlock_class+0x31/0x96
 ? mark_lock+0x9e/0x1af
 __tcp_close+0x50/0x4b6
 tcp_close+0x28/0x70
 inet_release+0x8e/0xa7
 __sock_release+0x95/0x121
 sock_close+0x14/0x17
 __fput+0x20f/0x36a
 task_work_run+0xa3/0xcc
 exit_to_user_mode_prepare+0x9c/0x14d
 syscall_exit_to_user_mode+0x18/0x44
 entry_SYSCALL_64_after_hwframe+0x63/0xcd

Fixes: cf8c1e967224 ("net: refactor bpf_sk_reuseport_detach()")
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Hawkins Jiawei <yin31149@gmail.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: netdev@vger.kernel.org
---

 include/net/sock.h           |   25 +++++++++++++++++++++++++
 kernel/bpf/reuseport_array.c |    2 +-
 2 files changed, 26 insertions(+), 1 deletion(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 05a1bbdf5805..d08cfe190a78 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -577,6 +577,31 @@ static inline bool sk_user_data_is_nocopy(const struct sock *sk)
 
 #define __sk_user_data(sk) ((*((void __rcu **)&(sk)->sk_user_data)))
 
+/**
+ * __locked_read_sk_user_data_with_flags - return the pointer
+ * only if argument flags all has been set in sk_user_data. Otherwise
+ * return NULL
+ *
+ * @sk: socket
+ * @flags: flag bits
+ *
+ * The caller must be holding sk->sk_callback_lock.
+ */
+static inline void *
+__locked_read_sk_user_data_with_flags(const struct sock *sk,
+				      uintptr_t flags)
+{
+	uintptr_t sk_user_data =
+		(uintptr_t)rcu_dereference_check(__sk_user_data(sk),
+						 lockdep_is_held(&sk->sk_callback_lock));
+
+	WARN_ON_ONCE(flags & SK_USER_DATA_PTRMASK);
+
+	if ((sk_user_data & flags) == flags)
+		return (void *)(sk_user_data & SK_USER_DATA_PTRMASK);
+	return NULL;
+}
+
 /**
  * __rcu_dereference_sk_user_data_with_flags - return the pointer
  * only if argument flags all has been set in sk_user_data. Otherwise
diff --git a/kernel/bpf/reuseport_array.c b/kernel/bpf/reuseport_array.c
index 85fa9dbfa8bf..82c61612f382 100644
--- a/kernel/bpf/reuseport_array.c
+++ b/kernel/bpf/reuseport_array.c
@@ -24,7 +24,7 @@ void bpf_sk_reuseport_detach(struct sock *sk)
 	struct sock __rcu **socks;
 
 	write_lock_bh(&sk->sk_callback_lock);
-	socks = __rcu_dereference_sk_user_data_with_flags(sk, SK_USER_DATA_BPF);
+	socks = __locked_read_sk_user_data_with_flags(sk, SK_USER_DATA_BPF);
 	if (socks) {
 		WRITE_ONCE(sk->sk_user_data, NULL);
 		/*



^ permalink raw reply related

* Re: [PATCH ipsec-next] xfrm: update x->lastused for every packet
From: Steffen Klassert @ 2022-08-16  9:27 UTC (permalink / raw)
  To: Antony Antony; +Cc: Herbert Xu, netdev, Tobias Brunner
In-Reply-To: <a24754505073ab8f832aa34cd38d3ee68d36bc5e.1659603877.git.antony.antony@secunet.com>

On Thu, Aug 04, 2022 at 11:17:31AM +0200, Antony Antony wrote:
> x->lastused was only updated for outgoing mobile IPv6 packet.
> With this fix update it for every, in and out, packet.
> 
> This is useful to check if the a SA is still in use, or when was
> the last time an SA was used.  lastused time of in SA can used
> to check IPsec path is functional.
> 
> Signed-off-by: Antony Antony <antony.antony@secunet.com>

Your patch does not apply to current ipsec-next, please
rebase.

Thanks!

^ permalink raw reply

* Re: [PATCH V5 4/6] vDPA: !FEATURES_OK should not block querying device config space
From: Michael S. Tsirkin @ 2022-08-16  8:41 UTC (permalink / raw)
  To: Zhu, Lingshan
  Cc: Si-Wei Liu, virtualization, netdev, kvm, parav, xieyongji,
	gautam.dawar, jasowang
In-Reply-To: <f2864c96-cddd-129e-7dd8-a3743fe7e0d0@intel.com>

On Tue, Aug 16, 2022 at 04:29:04PM +0800, Zhu, Lingshan wrote:
> 
> 
> On 8/16/2022 3:41 PM, Si-Wei Liu wrote:
> 
>     Hi Michael,
> 
>     I just noticed this patch got pulled to linux-next prematurely without
>     getting consensus on code review, am not sure why. Hope it was just an
>     oversight.
> 
>     Unfortunately this introduced functionality regression to at least two
>     cases so far as I see:
> 
>     1. (bogus) VDPA_ATTR_DEV_NEGOTIATED_FEATURES are inadvertently exposed and
>     displayed in "vdpa dev config show" before feature negotiation is done.
>     Noted the corresponding features name shown in vdpa tool is called
>     "negotiated_features" rather than "driver_features". I see in no way the
>     intended change of the patch should break this user level expectation
>     regardless of any spec requirement. Do you agree on this point?
> 
> I will post a patch for iptour2, doing:
> 1) if iprout2 does not get driver_features from the kernel, then don't show
> negotiated features in the command output
> 2) process and decoding the device features.
> 
> 
>     2. There was also another implicit assumption that is broken by this patch.
>     There could be a vdpa tool query of config via vdpa_dev_net_config_fill()->
>     vdpa_get_config_unlocked() that races with the first vdpa_set_features()
>     call from VMM e.g. QEMU. Since the S_FEATURES_OK blocking condition is
>     removed, if the vdpa tool query occurs earlier than the first
>     set_driver_features() call from VMM, the following code will treat the
>     guest as legacy and then trigger an erroneous vdpa_set_features_unlocked
>     (... , 0) call to the vdpa driver:
> 
>      374         /*
>      375          * Config accesses aren't supposed to trigger before features
>     are set.
>      376          * If it does happen we assume a legacy guest.
>      377          */
>      378         if (!vdev->features_valid)
>      379                 vdpa_set_features_unlocked(vdev, 0);
>      380         ops->get_config(vdev, offset, buf, len);
> 
>     Depending on vendor driver's implementation, L380 may either return invalid
>     config data (or invalid endianness if on BE) or only config fields that are
>     valid in legacy layout. What's more severe is that, vdpa tool query in
>     theory shouldn't affect feature negotiation at all by making confusing
>     calls to the device, but now it is possible with the patch. Fixing this
>     would require more delicate work on the other paths involving the cf_lock
>     reader/write semaphore.
> 
>     Not sure what you plan to do next, post the fixes for both issues and get
>     the community review? Or simply revert the patch in question? Let us know.
> 
> The spec says:
> The device MUST allow reading of any device-specific configuration field before
> FEATURES_OK is set by
> the driver. This includes fields which are conditional on feature bits, as long
> as those feature bits are offered
> by the device.
> 
> so whether FEATURES_OK should not block reading the device config space. 
> vdpa_get_config_unlocked() will read the features, I don't know why it has a
> comment:
>         /*
>          * Config accesses aren't supposed to trigger before features are set.
>          * If it does happen we assume a legacy guest.
>          */
> 
> This conflicts with the spec.

Yea well. On the other hand the spec also calls for features to be
used to detect legacy versus modern driver.
This part of the spec needs work generally.


> vdpa_get_config_unlocked() checks vdev->features_valid, if not valid, it will
> set the drivers_features 0, I think this intends to prevent reading random
> driver_features. This function does not hold any locks, and didn't change
> anything.
> 
> So what is the race?
> 
> Thanks
> 
> 
> 
>     Thanks,
>     -Siwei
> 
> 
>     On 8/12/2022 3:44 AM, Zhu Lingshan wrote:
> 
>         Users may want to query the config space of a vDPA device,
>         to choose a appropriate one for a certain guest. This means the
>         users need to read the config space before FEATURES_OK, and
>         the existence of config space contents does not depend on
>         FEATURES_OK.
> 
>         The spec says:
>         The device MUST allow reading of any device-specific configuration
>         field before FEATURES_OK is set by the driver. This includes
>         fields which are conditional on feature bits, as long as those
>         feature bits are offered by the device.
> 
>         Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
>         ---
>           drivers/vdpa/vdpa.c | 8 --------
>           1 file changed, 8 deletions(-)
> 
>         diff --git a/drivers/vdpa/vdpa.c b/drivers/vdpa/vdpa.c
>         index 6eb3d972d802..bf312d9c59ab 100644
>         --- a/drivers/vdpa/vdpa.c
>         +++ b/drivers/vdpa/vdpa.c
>         @@ -855,17 +855,9 @@ vdpa_dev_config_fill(struct vdpa_device *vdev,
>         struct sk_buff *msg, u32 portid,
>           {
>               u32 device_id;
>               void *hdr;
>         -    u8 status;
>               int err;
>                 down_read(&vdev->cf_lock);
>         -    status = vdev->config->get_status(vdev);
>         -    if (!(status & VIRTIO_CONFIG_S_FEATURES_OK)) {
>         -        NL_SET_ERR_MSG_MOD(extack, "Features negotiation not
>         completed");
>         -        err = -EAGAIN;
>         -        goto out;
>         -    }
>         -
>               hdr = genlmsg_put(msg, portid, seq, &vdpa_nl_family, flags,
>                         VDPA_CMD_DEV_CONFIG_GET);
>               if (!hdr) {
> 
> 
> 
> 


^ permalink raw reply

* Re: [PATCH ipsec 1/2] xfrm: Check policy for nested XFRM packets in xfrm_input
From: Steffen Klassert @ 2022-08-16  8:53 UTC (permalink / raw)
  To: Benedict Wong; +Cc: netdev, nharold, lorenzo
In-Reply-To: <CANrj0baLB5a5QpdmmcNYZLyxe1r0gySLhT3krXVFXKOzBb8aww@mail.gmail.com>

On Mon, Aug 15, 2022 at 02:25:42PM -0700, Benedict Wong wrote:
> >
> > Hm, shouldn't the xfrm_policy_check called along the
> > packet path for each round after decapsulation?
> >
> > Do you use ESP transformation offload (INET_ESP_OFFLOAD/
> > INET6_ESP_OFFLOAD)?
> 
> Been a while since I've gotten a chance to look through the
> code, but when I previously looked through the stack, it looked
> like we have policy checks in the following places:
> - IPv4/IPv6 deliver to host
> - UDP/TCP/ICMP/L2TP/SCTP/VTI/raw in direct rcv methods
> 
> Additionally, we have a conditional check in XFRM-I, but
> *only if the packet is crossing network namespaces* (which
> in the Android case, it isn't)

Yes, this is because the secpath is cleared when crossing
network namespaces. The inbound policy check in the packet
path would fail in this case. That's why we do the policy
check there.

> Notably, it appears that the missing case is when the outer
> tunnel is an unencap'd ESP packet, which simply calls xfrm_input
> via xfrm(4|6)_rcv_spi. This changes adds that call to ensure
> that the verification is always performed in each packet path.

Please note that all policy checks are done for the traffic
selector of the inner packets. The inbound policy check makes
sure that the inner packets are allowed to pass and really
came through the SA that is recorded in the secpath.

When receiving an ESP packet, the packets IPsec ID (daddr/
SPI/proto) is mached against the SADB. If a matching SA is
there, it is used to decapsulate. The TS of the decapsulated
packet is used to do the policy lookup then.

If the decapsulated packet is not dropped by the policy lookup
and is again an ESP packet, we start with the SADB lookup as
described above.

So I think the behaviour is correct as it is implemented.


^ permalink raw reply

* [ RFC  net-next 2/3] net: flow_offload: add action stats api
From: Oz Shlomo @ 2022-08-16  9:23 UTC (permalink / raw)
  To: netdev
  Cc: Jiri Pirko, Jamal Hadi Salim, Simon Horman, Baowen Zheng,
	Vlad Buslov, Ido Schimmel, Roi Dayan, Oz Shlomo
In-Reply-To: <20220816092338.12613-1-ozsh@nvidia.com>

The current offload api provides visibility to flow hw stats.
This works as long as the flow stats values apply to all the flow's
actions. However, this assumption breaks when an action, such as police,
decides to drop or jump over other actions.

Extend the flow_offload api to return stat record per action instance.
Use the per action stats value, if available, when updating the action
instance counters.

Signed-off-by: Oz Shlomo <ozsh@nvidia.com>
---
 include/net/flow_offload.h |  6 ++++++
 include/net/pkt_cls.h      | 26 ++++++++++++++++----------
 net/sched/cls_flower.c     |  4 +++-
 net/sched/cls_matchall.c   |  2 +-
 4 files changed, 26 insertions(+), 12 deletions(-)

diff --git a/include/net/flow_offload.h b/include/net/flow_offload.h
index 2a9a9e42e7fd..5e1a34a76772 100644
--- a/include/net/flow_offload.h
+++ b/include/net/flow_offload.h
@@ -436,6 +436,11 @@ struct flow_stats {
 	bool used_hw_stats_valid;
 };
 
+struct flow_act_stats {
+	unsigned int		num_actions;
+	struct flow_stats	stats[];
+};
+
 static inline void flow_stats_update(struct flow_stats *flow_stats,
 				     u64 bytes, u64 pkts,
 				     u64 drops, u64 lastused,
@@ -583,6 +588,7 @@ struct flow_cls_offload {
 	struct flow_rule *rule;
 	struct flow_stats stats;
 	u32 classid;
+	struct flow_act_stats *act_stats;
 };
 
 enum offload_act_command  {
diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h
index 27eac9e73c61..f5e5582aef17 100644
--- a/include/net/pkt_cls.h
+++ b/include/net/pkt_cls.h
@@ -269,24 +269,30 @@ static inline void tcf_exts_put_net(struct tcf_exts *exts)
 
 static inline void
 tcf_exts_hw_stats_update(const struct tcf_exts *exts,
-			 struct flow_stats *stats)
+			 struct flow_stats *flow_stats,
+			 struct flow_act_stats *act_stats)
 {
 #ifdef CONFIG_NET_CLS_ACT
 	int i;
 
 	for (i = 0; i < exts->nr_actions; i++) {
 		struct tc_action *a = exts->actions[i];
+		struct flow_stats *stats = flow_stats;
 
 		/* if stats from hw, just skip */
-		if (tcf_action_update_hw_stats(a)) {
-			preempt_disable();
-			tcf_action_stats_update(a, stats->bytes, stats->pkts, stats->drops,
-						stats->lastused, true);
-			preempt_enable();
-
-			a->used_hw_stats = stats->used_hw_stats;
-			a->used_hw_stats_valid = stats->used_hw_stats_valid;
-		}
+		if (!tcf_action_update_hw_stats(a))
+			continue;
+
+		if (act_stats)
+			stats = &act_stats->stats[i];
+
+		preempt_disable();
+		tcf_action_stats_update(a, stats->bytes, stats->pkts, stats->drops,
+					stats->lastused, true);
+		preempt_enable();
+
+		a->used_hw_stats = stats->used_hw_stats;
+		a->used_hw_stats_valid = stats->used_hw_stats_valid;
 	}
 #endif
 }
diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c
index 7da3337c4356..7dc8a62796b5 100644
--- a/net/sched/cls_flower.c
+++ b/net/sched/cls_flower.c
@@ -499,7 +499,9 @@ static void fl_hw_update_stats(struct tcf_proto *tp, struct cls_fl_filter *f,
 	tc_setup_cb_call(block, TC_SETUP_CLSFLOWER, &cls_flower, false,
 			 rtnl_held);
 
-	tcf_exts_hw_stats_update(&f->exts, &cls_flower.stats);
+	tcf_exts_hw_stats_update(&f->exts, &cls_flower.stats, cls_flower.act_stats);
+
+	kfree(cls_flower.act_stats);
 }
 
 static void __fl_put(struct cls_fl_filter *f)
diff --git a/net/sched/cls_matchall.c b/net/sched/cls_matchall.c
index b5520a9c35e6..0ba4392b93de 100644
--- a/net/sched/cls_matchall.c
+++ b/net/sched/cls_matchall.c
@@ -332,7 +332,7 @@ static void mall_stats_hw_filter(struct tcf_proto *tp,
 
 	tc_setup_cb_call(block, TC_SETUP_CLSMATCHALL, &cls_mall, false, true);
 
-	tcf_exts_hw_stats_update(&head->exts, &cls_mall.stats);
+	tcf_exts_hw_stats_update(&head->exts, &cls_mall.stats, NULL);
 }
 
 static int mall_dump(struct net *net, struct tcf_proto *tp, void *fh,
-- 
1.8.3.1


^ permalink raw reply related

* Re: [PATCH V5 4/6] vDPA: !FEATURES_OK should not block querying device config space
From: Michael S. Tsirkin @ 2022-08-16  8:23 UTC (permalink / raw)
  To: Si-Wei Liu
  Cc: Zhu Lingshan, virtualization, netdev, kvm, parav, xieyongji,
	gautam.dawar, jasowang
In-Reply-To: <e99e6d81-d7d5-e1ff-08e0-c22581c1329a@oracle.com>

On Tue, Aug 16, 2022 at 12:41:21AM -0700, Si-Wei Liu wrote:
> Hi Michael,
> 
> I just noticed this patch got pulled to linux-next prematurely without
> getting consensus on code review, am not sure why. Hope it was just an
> oversight.
> 
> Unfortunately this introduced functionality regression to at least two cases
> so far as I see:
> 
> 1. (bogus) VDPA_ATTR_DEV_NEGOTIATED_FEATURES are inadvertently exposed and
> displayed in "vdpa dev config show" before feature negotiation is done.
> Noted the corresponding features name shown in vdpa tool is called
> "negotiated_features" rather than "driver_features". I see in no way the
> intended change of the patch should break this user level expectation
> regardless of any spec requirement. Do you agree on this point?
> 
> 2. There was also another implicit assumption that is broken by this patch.
> There could be a vdpa tool query of config via
> vdpa_dev_net_config_fill()->vdpa_get_config_unlocked() that races with the
> first vdpa_set_features() call from VMM e.g. QEMU. Since the S_FEATURES_OK
> blocking condition is removed, if the vdpa tool query occurs earlier than
> the first set_driver_features() call from VMM, the following code will treat
> the guest as legacy and then trigger an erroneous
> vdpa_set_features_unlocked(... , 0) call to the vdpa driver:
> 
>  374         /*
>  375          * Config accesses aren't supposed to trigger before features
> are set.
>  376          * If it does happen we assume a legacy guest.
>  377          */
>  378         if (!vdev->features_valid)
>  379                 vdpa_set_features_unlocked(vdev, 0);
>  380         ops->get_config(vdev, offset, buf, len);
> Depending on vendor driver's implementation, L380 may either return invalid
> config data (or invalid endianness if on BE) or only config fields that are
> valid in legacy layout. What's more severe is that, vdpa tool query in
> theory shouldn't affect feature negotiation at all by making confusing calls
> to the device, but now it is possible with the patch. Fixing this would
> require more delicate work on the other paths involving the cf_lock
> reader/write semaphore.
> 
> Not sure what you plan to do next, post the fixes for both issues and get
> the community review? Or simply revert the patch in question? Let us know.
> 
> Thanks,
> -Siwei

If we can get fixes that's good. If not I can apply a revert.
I'm on vacation next week, you guys will have the time
to figure out the best plan of action.

-- 
MST


^ permalink raw reply

* [ RFC  net-next 1/3] net: sched: Pass flow_stats instead of multiple stats args
From: Oz Shlomo @ 2022-08-16  9:23 UTC (permalink / raw)
  To: netdev
  Cc: Jiri Pirko, Jamal Hadi Salim, Simon Horman, Baowen Zheng,
	Vlad Buslov, Ido Schimmel, Roi Dayan, Oz Shlomo
In-Reply-To: <20220816092338.12613-1-ozsh@nvidia.com>

From: Roi Dayan <roid@nvidia.com>

Instead of passing 6 stats related args, pass the flow_stats.

Signed-off-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Oz Shlomo <ozsh@nvidia.com>
---
 include/net/pkt_cls.h    | 11 +++++------
 net/sched/cls_flower.c   |  7 +------
 net/sched/cls_matchall.c |  6 +-----
 3 files changed, 7 insertions(+), 17 deletions(-)

diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h
index d9d90e6925e1..27eac9e73c61 100644
--- a/include/net/pkt_cls.h
+++ b/include/net/pkt_cls.h
@@ -269,8 +269,7 @@ static inline void tcf_exts_put_net(struct tcf_exts *exts)
 
 static inline void
 tcf_exts_hw_stats_update(const struct tcf_exts *exts,
-			 u64 bytes, u64 packets, u64 drops, u64 lastuse,
-			 u8 used_hw_stats, bool used_hw_stats_valid)
+			 struct flow_stats *stats)
 {
 #ifdef CONFIG_NET_CLS_ACT
 	int i;
@@ -281,12 +280,12 @@ static inline void tcf_exts_put_net(struct tcf_exts *exts)
 		/* if stats from hw, just skip */
 		if (tcf_action_update_hw_stats(a)) {
 			preempt_disable();
-			tcf_action_stats_update(a, bytes, packets, drops,
-						lastuse, true);
+			tcf_action_stats_update(a, stats->bytes, stats->pkts, stats->drops,
+						stats->lastused, true);
 			preempt_enable();
 
-			a->used_hw_stats = used_hw_stats;
-			a->used_hw_stats_valid = used_hw_stats_valid;
+			a->used_hw_stats = stats->used_hw_stats;
+			a->used_hw_stats_valid = stats->used_hw_stats_valid;
 		}
 	}
 #endif
diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c
index 041d63ff809a..7da3337c4356 100644
--- a/net/sched/cls_flower.c
+++ b/net/sched/cls_flower.c
@@ -499,12 +499,7 @@ static void fl_hw_update_stats(struct tcf_proto *tp, struct cls_fl_filter *f,
 	tc_setup_cb_call(block, TC_SETUP_CLSFLOWER, &cls_flower, false,
 			 rtnl_held);
 
-	tcf_exts_hw_stats_update(&f->exts, cls_flower.stats.bytes,
-				 cls_flower.stats.pkts,
-				 cls_flower.stats.drops,
-				 cls_flower.stats.lastused,
-				 cls_flower.stats.used_hw_stats,
-				 cls_flower.stats.used_hw_stats_valid);
+	tcf_exts_hw_stats_update(&f->exts, &cls_flower.stats);
 }
 
 static void __fl_put(struct cls_fl_filter *f)
diff --git a/net/sched/cls_matchall.c b/net/sched/cls_matchall.c
index 06cf22adbab7..b5520a9c35e6 100644
--- a/net/sched/cls_matchall.c
+++ b/net/sched/cls_matchall.c
@@ -332,11 +332,7 @@ static void mall_stats_hw_filter(struct tcf_proto *tp,
 
 	tc_setup_cb_call(block, TC_SETUP_CLSMATCHALL, &cls_mall, false, true);
 
-	tcf_exts_hw_stats_update(&head->exts, cls_mall.stats.bytes,
-				 cls_mall.stats.pkts, cls_mall.stats.drops,
-				 cls_mall.stats.lastused,
-				 cls_mall.stats.used_hw_stats,
-				 cls_mall.stats.used_hw_stats_valid);
+	tcf_exts_hw_stats_update(&head->exts, &cls_mall.stats);
 }
 
 static int mall_dump(struct net *net, struct tcf_proto *tp, void *fh,
-- 
1.8.3.1


^ permalink raw reply related

* [PATCH xfrm-next v2 4/6] xfrm: add TX datapath support for IPsec full offload mode
From: Leon Romanovsky @ 2022-08-16  8:59 UTC (permalink / raw)
  To: Steffen Klassert
  Cc: Leon Romanovsky, David S . Miller, Herbert Xu, netdev, Raed Salem,
	ipsec-devel
In-Reply-To: <cover.1660639789.git.leonro@nvidia.com>

From: Leon Romanovsky <leonro@nvidia.com>

In IPsec full mode, the device is going to encrypt and encapsulate
packets that are associated with offloaded policy. After successful
policy lookup to indicate if packets should be offloaded or not,
the stack forwards packets to the device to do the magic.

Signed-off-by: Raed Salem <raeds@nvidia.com>
Signed-off-by: Huy Nguyen <huyn@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 net/xfrm/xfrm_device.c | 13 +++++++++++++
 net/xfrm/xfrm_output.c | 20 ++++++++++++++++++++
 2 files changed, 33 insertions(+)

diff --git a/net/xfrm/xfrm_device.c b/net/xfrm/xfrm_device.c
index 1cc482e9c87d..db5ebd36f68c 100644
--- a/net/xfrm/xfrm_device.c
+++ b/net/xfrm/xfrm_device.c
@@ -120,6 +120,16 @@ struct sk_buff *validate_xmit_xfrm(struct sk_buff *skb, netdev_features_t featur
 	if (xo->flags & XFRM_GRO || x->xso.dir == XFRM_DEV_OFFLOAD_IN)
 		return skb;
 
+	/* The packet was sent to HW IPsec full offload engine,
+	 * but to wrong device. Drop the packet, so it won't skip
+	 * XFRM stack.
+	 */
+	if (x->xso.type == XFRM_DEV_OFFLOAD_FULL && x->xso.dev != dev) {
+		kfree_skb(skb);
+		dev_core_stats_tx_dropped_inc(dev);
+		return NULL;
+	}
+
 	/* This skb was already validated on the upper/virtual dev */
 	if ((x->xso.dev != dev) && (x->xso.real_dev == dev))
 		return skb;
@@ -366,6 +376,9 @@ bool xfrm_dev_offload_ok(struct sk_buff *skb, struct xfrm_state *x)
 	struct xfrm_dst *xdst = (struct xfrm_dst *)dst;
 	struct net_device *dev = x->xso.dev;
 
+	if (x->xso.type == XFRM_DEV_OFFLOAD_FULL)
+		goto ok;
+
 	if (!x->type_offload || x->encap)
 		return false;
 
diff --git a/net/xfrm/xfrm_output.c b/net/xfrm/xfrm_output.c
index 555ab35cd119..27a8dac9ca7d 100644
--- a/net/xfrm/xfrm_output.c
+++ b/net/xfrm/xfrm_output.c
@@ -719,6 +719,26 @@ int xfrm_output(struct sock *sk, struct sk_buff *skb)
 		break;
 	}
 
+	if (x->xso.type == XFRM_DEV_OFFLOAD_FULL) {
+		struct dst_entry *dst = skb_dst_pop(skb);
+
+		if (!dst || !xfrm_dev_offload_ok(skb, x)) {
+			XFRM_INC_STATS(net, LINUX_MIB_XFRMOUTERROR);
+			kfree_skb(skb);
+			return -EHOSTUNREACH;
+		}
+
+		skb_dst_set(skb, dst);
+		err = skb_dst(skb)->ops->local_out(net, skb->sk, skb);
+		if (unlikely(err != 1))
+			return err;
+
+		if (!skb_dst(skb)->xfrm)
+			return dst_output(net, skb->sk, skb);
+
+		return 0;
+	}
+
 	secpath_reset(skb);
 
 	if (xfrm_dev_offload_ok(skb, x)) {
-- 
2.37.2


^ permalink raw reply related

* [PATCH xfrm-next v2 6/6] xfrm: enforce separation between priorities of HW/SW policies
From: Leon Romanovsky @ 2022-08-16  8:59 UTC (permalink / raw)
  To: Steffen Klassert
  Cc: Leon Romanovsky, David S . Miller, Herbert Xu, netdev, Raed Salem,
	ipsec-devel
In-Reply-To: <cover.1660639789.git.leonro@nvidia.com>

From: Leon Romanovsky <leonro@nvidia.com>

Devices that implement IPsec full offload mode offload policies too.
In RX path, it causes to the situation that HW can't effectively handle
mixed SW and HW priorities unless users make sure that HW offloaded
policies have higher priorities.

In order to make sure that users have coherent picture, let's require
that HW offloaded policies have always (both RX and TX) higher priorities
than SW ones.

To do not over engineer the code, HW policies are treated as SW ones and
don't take into account netdev to allow reuse of same priorities for
different devices.

Reviewed-by: Raed Salem <raeds@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 include/net/netns/xfrm.h |   8 ++-
 net/xfrm/xfrm_policy.c   | 113 +++++++++++++++++++++++++++++++++++++++
 2 files changed, 120 insertions(+), 1 deletion(-)

diff --git a/include/net/netns/xfrm.h b/include/net/netns/xfrm.h
index bd7c3be4af5d..f0cfa0faf611 100644
--- a/include/net/netns/xfrm.h
+++ b/include/net/netns/xfrm.h
@@ -29,6 +29,11 @@ struct xfrm_policy_hthresh {
 	u8			rbits6;
 };
 
+struct xfrm_policy_prio {
+	u32 max_sw_prio;
+	u32 min_hw_prio;
+};
+
 struct netns_xfrm {
 	struct list_head	state_all;
 	/*
@@ -52,7 +57,7 @@ struct netns_xfrm {
 	unsigned int		policy_idx_hmask;
 	struct hlist_head	policy_inexact[XFRM_POLICY_MAX];
 	struct xfrm_policy_hash	policy_bydst[XFRM_POLICY_MAX];
-	unsigned int		policy_count[XFRM_POLICY_MAX * 2];
+	unsigned int		policy_count[XFRM_POLICY_MAX * 3];
 	struct work_struct	policy_hash_work;
 	struct xfrm_policy_hthresh policy_hthresh;
 	struct list_head	inexact_bins;
@@ -67,6 +72,7 @@ struct netns_xfrm {
 	u32			sysctl_acq_expires;
 
 	u8			policy_default[XFRM_POLICY_MAX];
+	struct xfrm_policy_prio	policy_prio[XFRM_POLICY_MAX];
 
 #ifdef CONFIG_SYSCTL
 	struct ctl_table_header	*sysctl_hdr;
diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
index 3049fdcf8411..a8a7d5fe4cd8 100644
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -1570,13 +1570,70 @@ static struct xfrm_policy *xfrm_policy_insert_list(struct hlist_head *chain,
 	return delpol;
 }
 
+static int __xfrm_policy_check_hw_priority(struct net *net,
+					   struct xfrm_policy *policy, int dir)
+{
+	int left, right;
+
+	lockdep_assert_held(&net->xfrm.xfrm_policy_lock);
+
+	if (!net->xfrm.policy_count[dir])
+		/* Adding first policy */
+		return 0;
+
+	if (policy->xdo.type != XFRM_DEV_OFFLOAD_FULL) {
+		/* SW priority */
+		if (!net->xfrm.policy_count[2 * XFRM_POLICY_MAX + dir])
+			/* Special case to allow reuse maximum priority
+			 * (U32_MAX) for SW policies, when no HW policy exist.
+			 */
+			return 0;
+
+		left = policy->priority;
+		right = net->xfrm.policy_prio[dir].min_hw_prio;
+	} else {
+		/* HW priority */
+		left = net->xfrm.policy_prio[dir].max_sw_prio;
+		right = policy->priority;
+	}
+	if (left >= right)
+		return -EINVAL;
+
+	return 0;
+}
+
+static void __xfrm_policy_update_hw_priority(struct net *net,
+					     struct xfrm_policy *policy,
+					     int dir)
+{
+	u32 *hw_prio, *sw_prio;
+
+	lockdep_assert_held(&net->xfrm.xfrm_policy_lock);
+
+	if (policy->xdo.type != XFRM_DEV_OFFLOAD_FULL) {
+		sw_prio = &net->xfrm.policy_prio[dir].max_sw_prio;
+		*sw_prio = max(*sw_prio, policy->priority);
+		return;
+	}
+
+	hw_prio = &net->xfrm.policy_prio[dir].min_hw_prio;
+	*hw_prio = min(*hw_prio, policy->priority);
+}
+
 int xfrm_policy_insert(int dir, struct xfrm_policy *policy, int excl)
 {
 	struct net *net = xp_net(policy);
 	struct xfrm_policy *delpol;
 	struct hlist_head *chain;
+	int ret;
 
 	spin_lock_bh(&net->xfrm.xfrm_policy_lock);
+	ret = __xfrm_policy_check_hw_priority(net, policy, dir);
+	if (ret) {
+		spin_unlock_bh(&net->xfrm.xfrm_policy_lock);
+		return ret;
+	}
+
 	chain = policy_hash_bysel(net, &policy->selector, policy->family, dir);
 	if (chain)
 		delpol = xfrm_policy_insert_list(chain, policy, excl);
@@ -1606,6 +1663,7 @@ int xfrm_policy_insert(int dir, struct xfrm_policy *policy, int excl)
 	policy->curlft.use_time = 0;
 	if (!mod_timer(&policy->timer, jiffies + HZ))
 		xfrm_pol_hold(policy);
+	__xfrm_policy_update_hw_priority(net, policy, dir);
 	spin_unlock_bh(&net->xfrm.xfrm_policy_lock);
 
 	if (delpol)
@@ -2270,6 +2328,8 @@ static void __xfrm_policy_link(struct xfrm_policy *pol, int dir)
 
 	list_add(&pol->walk.all, &net->xfrm.policy_all);
 	net->xfrm.policy_count[dir]++;
+	if (pol->xdo.type == XFRM_DEV_OFFLOAD_FULL)
+		net->xfrm.policy_count[2 * XFRM_POLICY_MAX + dir]++;
 	xfrm_pol_hold(pol);
 }
 
@@ -2289,6 +2349,8 @@ static struct xfrm_policy *__xfrm_policy_unlink(struct xfrm_policy *pol,
 	}
 
 	list_del_init(&pol->walk.all);
+	if (pol->xdo.type == XFRM_DEV_OFFLOAD_FULL)
+		net->xfrm.policy_count[2 * XFRM_POLICY_MAX + dir]--;
 	net->xfrm.policy_count[dir]--;
 
 	return pol;
@@ -2304,12 +2366,58 @@ static void xfrm_sk_policy_unlink(struct xfrm_policy *pol, int dir)
 	__xfrm_policy_unlink(pol, XFRM_POLICY_MAX + dir);
 }
 
+static void __xfrm_policy_delete_prio(struct net *net,
+				      struct xfrm_policy *policy, int dir)
+{
+	struct xfrm_policy *pol;
+	u32 sw_prio = 0;
+
+	lockdep_assert_held(&net->xfrm.xfrm_policy_lock);
+
+	if (!net->xfrm.policy_count[dir]) {
+		net->xfrm.policy_prio[dir].max_sw_prio = sw_prio;
+		net->xfrm.policy_prio[dir].min_hw_prio = U32_MAX;
+		return;
+	}
+
+	if (policy->xdo.type == XFRM_DEV_OFFLOAD_FULL &&
+	    !net->xfrm.policy_count[2 * XFRM_POLICY_MAX + dir]) {
+		net->xfrm.policy_prio[dir].min_hw_prio = U32_MAX;
+		return;
+	}
+
+	list_for_each_entry(pol, &net->xfrm.policy_all, walk.all) {
+		if (pol->walk.dead)
+			continue;
+
+		if (policy->xdo.type != XFRM_DEV_OFFLOAD_FULL) {
+			/* SW priority */
+			if (pol->xdo.type == XFRM_DEV_OFFLOAD_FULL) {
+				net->xfrm.policy_prio[dir].max_sw_prio = sw_prio;
+				return;
+			}
+			sw_prio = pol->priority;
+			continue;
+		}
+		/* HW priority */
+		if (pol->xdo.type != XFRM_DEV_OFFLOAD_FULL)
+			continue;
+
+		net->xfrm.policy_prio[dir].min_hw_prio = pol->priority;
+		return;
+	}
+
+	net->xfrm.policy_prio[dir].max_sw_prio = sw_prio;
+}
+
 int xfrm_policy_delete(struct xfrm_policy *pol, int dir)
 {
 	struct net *net = xp_net(pol);
 
 	spin_lock_bh(&net->xfrm.xfrm_policy_lock);
 	pol = __xfrm_policy_unlink(pol, dir);
+	if (pol)
+		__xfrm_policy_delete_prio(net, pol, dir);
 	spin_unlock_bh(&net->xfrm.xfrm_policy_lock);
 	if (pol) {
 		xfrm_dev_policy_delete(pol);
@@ -4110,6 +4218,7 @@ static int __net_init xfrm_policy_init(struct net *net)
 
 		net->xfrm.policy_count[dir] = 0;
 		net->xfrm.policy_count[XFRM_POLICY_MAX + dir] = 0;
+		net->xfrm.policy_count[2 * XFRM_POLICY_MAX + dir] = 0;
 		INIT_HLIST_HEAD(&net->xfrm.policy_inexact[dir]);
 
 		htab = &net->xfrm.policy_bydst[dir];
@@ -4195,6 +4304,10 @@ static int __net_init xfrm_net_init(struct net *net)
 	net->xfrm.policy_default[XFRM_POLICY_FWD] = XFRM_USERPOLICY_ACCEPT;
 	net->xfrm.policy_default[XFRM_POLICY_OUT] = XFRM_USERPOLICY_ACCEPT;
 
+	net->xfrm.policy_prio[XFRM_POLICY_IN].min_hw_prio = U32_MAX;
+	net->xfrm.policy_prio[XFRM_POLICY_FWD].min_hw_prio = U32_MAX;
+	net->xfrm.policy_prio[XFRM_POLICY_OUT].min_hw_prio = U32_MAX;
+
 	rv = xfrm_statistics_init(net);
 	if (rv < 0)
 		goto out_statistics;
-- 
2.37.2


^ permalink raw reply related

* [PATCH xfrm-next v2 0/6] Extend XFRM core to allow full offload configuration
From: Leon Romanovsky @ 2022-08-16  8:59 UTC (permalink / raw)
  To: Steffen Klassert
  Cc: Leon Romanovsky, David S . Miller, Herbert Xu, netdev, Raed Salem,
	ipsec-devel

From: Leon Romanovsky <leonro@nvidia.com>

Changelog:
v2:
 * Rebased to latest 6.0-rc1
 * Add an extra check in TX datapath patch to validate packets before
   forwarding to HW.
 * Added policy cleanup logic in case of netdev down event 
v1: https://lore.kernel.org/all/cover.1652851393.git.leonro@nvidia.com 
 * Moved comment to be before if (...) in third patch.
v0: https://lore.kernel.org/all/cover.1652176932.git.leonro@nvidia.com
-----------------------------------------------------------------------

The following series extends XFRM core code to handle new type of IPsec
offload - full offload.

In this mode, the HW is going to be responsible for whole data path, so
both policy and state should be offloaded.

Thanks

Leon Romanovsky (6):
  xfrm: add new full offload flag
  xfrm: allow state full offload mode
  xfrm: add an interface to offload policy
  xfrm: add TX datapath support for IPsec full offload mode
  xfrm: add RX datapath protection for IPsec full offload mode
  xfrm: enforce separation between priorities of HW/SW policies

 .../inline_crypto/ch_ipsec/chcr_ipsec.c       |   4 +
 .../net/ethernet/intel/ixgbe/ixgbe_ipsec.c    |   5 +
 drivers/net/ethernet/intel/ixgbevf/ipsec.c    |   5 +
 .../mellanox/mlx5/core/en_accel/ipsec.c       |   4 +
 drivers/net/netdevsim/ipsec.c                 |   5 +
 include/linux/netdevice.h                     |   3 +
 include/net/netns/xfrm.h                      |   8 +-
 include/net/xfrm.h                            | 104 +++++++---
 include/uapi/linux/xfrm.h                     |   6 +
 net/xfrm/xfrm_device.c                        | 101 +++++++++-
 net/xfrm/xfrm_output.c                        |  20 ++
 net/xfrm/xfrm_policy.c                        | 180 ++++++++++++++++++
 net/xfrm/xfrm_user.c                          |  19 ++
 13 files changed, 434 insertions(+), 30 deletions(-)

-- 
2.37.2


^ permalink raw reply

* Re: [PATCH v2] wifi: cfg80211: Fix UAF in ieee80211_scan_rx()
From: Siddh Raman Pant @ 2022-08-16  8:52 UTC (permalink / raw)
  To: Johannes Berg
  Cc: jakub kicinski, greg kh, david s. miller, eric dumazet,
	paolo abeni, netdev, linux-kernel-mentees
In-Reply-To: <18fd9b89d45aedc1504d0cbd299ffb289ae96438.camel@sipsolutions.net>

On Tue, 16 Aug 2022 01:28:24 +0530  Johannes Berg  wrote:
> Sorry everyone, I always thought "this looks odd" and then never got
> around to taking a closer look.
> 
> So yeah, I still think this looks odd - cfg80211 doesn't really know
> anything about how mac80211 might be doing something with RCU to protect
> the pointer.
> 
> The patch also leaves the NULL-ing in mac80211 (that is how we reach it)
> broken wrt. the kfree_rcu() since it doesn't happen _before_, and the
> pointer in rdev isn't how this is reached through RCU (it's not even
> __rcu annotated).
> 

Thanks for the critical review.

> I think it might be conceptually better, though not faster, to do
> something like https://p.sipsolutions.net/1d23837f455dc4c2.txt which
> ensures that from mac80211's POV it can no longer be reached before we
> call cfg80211_scan_done().
> 
> Yeah, that's slower, but scanning is still a relatively infrequent (and
> slow anyway) operation, and this way we can stick to "this is not used
> once you call cfg80211_scan_done()" which just makes much more sense?
> 
> johannes
> 

Agreed, that looks like a good way to go about.

Thanks,
Siddh

^ permalink raw reply

* Re: [PATCH net-next v4] net: skb: prevent the split of kfree_skb_reason() by gcc
From: Miguel Ojeda @ 2022-08-16  9:02 UTC (permalink / raw)
  To: menglong8.dong
  Cc: kuba, ojeda, ndesaulniers, davem, edumazet, pabeni, asml.silence,
	imagedong, luiz.von.dentz, vasily.averin, jk, linux-kernel,
	netdev, kernel test robot
In-Reply-To: <20220816032846.2579217-1-imagedong@tencent.com>

On Tue, Aug 16, 2022 at 5:29 AM <menglong8.dong@gmail.com> wrote:
>
> Reported-by: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com>

Hmm... Why did you add me as a reporter?

> + * Optional: not supported by clang.
> + * Optional: not supported by icc.

Much better, thank you! Please add the links to GCC's docs, like in
most attributes (some newer attributes may have been added without
them -- I will fix that).

In any case, no need to send a new version just for this for the
moment, I would recommend waiting until others comment on whether they
want `__optimize__` used here as the workaround.

Cheers,
Miguel

^ permalink raw reply

* Re: [PATCH net-next 08/10] net/smc: replace mutex rmbs_lock and sndbufs_lock with rw_semaphore
From: Tony Lu @ 2022-08-16  8:37 UTC (permalink / raw)
  To: D. Wythe; +Cc: kgraul, wenjia, kuba, davem, netdev, linux-s390, linux-rdma
In-Reply-To: <b4e23c1ef29d567661de46a79c00e48a01344366.1660152975.git.alibuda@linux.alibaba.com>

On Thu, Aug 11, 2022 at 01:47:39AM +0800, D. Wythe wrote:
> From: "D. Wythe" <alibuda@linux.alibaba.com>
> 
> It's clear that rmbs_lock and sndbufs_lock are aims to protect the
> rmbs list or the sndbufs list.
> 
> During conenction establieshment, smc_buf_get_slot() will always

conenction -> connection

> be invoke, and it only performs read semantics in rmbs list and

invoke -> invoked.

> sndbufs list.
> 
> Based on the above considerations, we replace mutex with rw_semaphore.
> Only smc_buf_get_slot() use down_read() to allow smc_buf_get_slot()
> run concurrently, other part use down_write() to keep exclusive
> semantics.
> 
> Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
> ---
>  net/smc/smc_core.c | 55 +++++++++++++++++++++++++++---------------------------
>  net/smc/smc_core.h |  4 ++--
>  net/smc/smc_llc.c  | 16 ++++++++--------
>  3 files changed, 38 insertions(+), 37 deletions(-)
> 
> diff --git a/net/smc/smc_core.c b/net/smc/smc_core.c
> index 113804d..b90970a 100644
> --- a/net/smc/smc_core.c
> +++ b/net/smc/smc_core.c
> @@ -1138,8 +1138,8 @@ static int smc_lgr_create(struct smc_sock *smc, struct smc_init_info *ini)
>  	lgr->freeing = 0;
>  	lgr->vlan_id = ini->vlan_id;
>  	refcount_set(&lgr->refcnt, 1); /* set lgr refcnt to 1 */
> -	mutex_init(&lgr->sndbufs_lock);
> -	mutex_init(&lgr->rmbs_lock);
> +	init_rwsem(&lgr->sndbufs_lock);
> +	init_rwsem(&lgr->rmbs_lock);
>  	rwlock_init(&lgr->conns_lock);
>  	for (i = 0; i < SMC_RMBE_SIZES; i++) {
>  		INIT_LIST_HEAD(&lgr->sndbufs[i]);
> @@ -1380,7 +1380,7 @@ struct smc_link *smc_switch_conns(struct smc_link_group *lgr,
>  static void smcr_buf_unuse(struct smc_buf_desc *buf_desc, bool is_rmb,
>  			   struct smc_link_group *lgr)
>  {
> -	struct mutex *lock;	/* lock buffer list */
> +	struct rw_semaphore *lock;	/* lock buffer list */
>  	int rc;
>  
>  	if (is_rmb && buf_desc->is_conf_rkey && !list_empty(&lgr->list)) {
> @@ -1400,9 +1400,9 @@ static void smcr_buf_unuse(struct smc_buf_desc *buf_desc, bool is_rmb,
>  		/* buf registration failed, reuse not possible */
>  		lock = is_rmb ? &lgr->rmbs_lock :
>  				&lgr->sndbufs_lock;
> -		mutex_lock(lock);
> +		down_write(lock);
>  		list_del(&buf_desc->list);
> -		mutex_unlock(lock);
> +		up_write(lock);
>  
>  		smc_buf_free(lgr, is_rmb, buf_desc);
>  	} else {
> @@ -1506,15 +1506,16 @@ static void smcr_buf_unmap_lgr(struct smc_link *lnk)
>  	int i;
>  
>  	for (i = 0; i < SMC_RMBE_SIZES; i++) {
> -		mutex_lock(&lgr->rmbs_lock);
> +		down_write(&lgr->rmbs_lock);
>  		list_for_each_entry_safe(buf_desc, bf, &lgr->rmbs[i], list)
>  			smcr_buf_unmap_link(buf_desc, true, lnk);
> -		mutex_unlock(&lgr->rmbs_lock);
> -		mutex_lock(&lgr->sndbufs_lock);
> +		up_write(&lgr->rmbs_lock);
> +
> +		down_write(&lgr->sndbufs_lock);
>  		list_for_each_entry_safe(buf_desc, bf, &lgr->sndbufs[i],
>  					 list)
>  			smcr_buf_unmap_link(buf_desc, false, lnk);
> -		mutex_unlock(&lgr->sndbufs_lock);
> +		up_write(&lgr->sndbufs_lock);
>  	}
>  }
>  
> @@ -2324,19 +2325,19 @@ int smc_uncompress_bufsize(u8 compressed)
>   * buffer size; if not available, return NULL
>   */
>  static struct smc_buf_desc *smc_buf_get_slot(int compressed_bufsize,
> -					     struct mutex *lock,
> +					     struct rw_semaphore *lock,
>  					     struct list_head *buf_list)
>  {
>  	struct smc_buf_desc *buf_slot;
>  
> -	mutex_lock(lock);
> +	down_read(lock);
>  	list_for_each_entry(buf_slot, buf_list, list) {
>  		if (cmpxchg(&buf_slot->used, 0, 1) == 0) {
> -			mutex_unlock(lock);
> +			up_read(lock);
>  			return buf_slot;
>  		}
>  	}
> -	mutex_unlock(lock);
> +	up_read(lock);
>  	return NULL;
>  }
>  
> @@ -2445,13 +2446,13 @@ int smcr_link_reg_buf(struct smc_link *link, struct smc_buf_desc *buf_desc)
>  	return 0;
>  }
>  
> -static int _smcr_buf_map_lgr(struct smc_link *lnk, struct mutex *lock,
> +static int _smcr_buf_map_lgr(struct smc_link *lnk, struct rw_semaphore *lock,
>  			     struct list_head *lst, bool is_rmb)
>  {
>  	struct smc_buf_desc *buf_desc, *bf;
>  	int rc = 0;
>  
> -	mutex_lock(lock);
> +	down_write(lock);
>  	list_for_each_entry_safe(buf_desc, bf, lst, list) {
>  		if (!buf_desc->used)
>  			continue;
> @@ -2460,7 +2461,7 @@ static int _smcr_buf_map_lgr(struct smc_link *lnk, struct mutex *lock,
>  			goto out;
>  	}
>  out:
> -	mutex_unlock(lock);
> +	up_write(lock);
>  	return rc;
>  }
>  
> @@ -2493,37 +2494,37 @@ int smcr_buf_reg_lgr(struct smc_link *lnk)
>  	int i, rc = 0;
>  
>  	/* reg all RMBs for a new link */
> -	mutex_lock(&lgr->rmbs_lock);
> +	down_write(&lgr->rmbs_lock);
>  	for (i = 0; i < SMC_RMBE_SIZES; i++) {
>  		list_for_each_entry_safe(buf_desc, bf, &lgr->rmbs[i], list) {
>  			if (!buf_desc->used)
>  				continue;
>  			rc = smcr_link_reg_buf(lnk, buf_desc);
>  			if (rc) {
> -				mutex_unlock(&lgr->rmbs_lock);
> +				up_write(&lgr->rmbs_lock);
>  				return rc;
>  			}
>  		}
>  	}
> -	mutex_unlock(&lgr->rmbs_lock);
> +	up_write(&lgr->rmbs_lock);
>  
>  	if (lgr->buf_type == SMCR_PHYS_CONT_BUFS)
>  		return rc;
>  
>  	/* reg all vzalloced sndbufs for a new link */
> -	mutex_lock(&lgr->sndbufs_lock);
> +	down_write(&lgr->sndbufs_lock);
>  	for (i = 0; i < SMC_RMBE_SIZES; i++) {
>  		list_for_each_entry_safe(buf_desc, bf, &lgr->sndbufs[i], list) {
>  			if (!buf_desc->used || !buf_desc->is_vm)
>  				continue;
>  			rc = smcr_link_reg_buf(lnk, buf_desc);
>  			if (rc) {
> -				mutex_unlock(&lgr->sndbufs_lock);
> +				up_write(&lgr->sndbufs_lock);
>  				return rc;
>  			}
>  		}
>  	}
> -	mutex_unlock(&lgr->sndbufs_lock);
> +	up_write(&lgr->sndbufs_lock);
>  	return rc;
>  }
>  
> @@ -2641,7 +2642,7 @@ static int __smc_buf_create(struct smc_sock *smc, bool is_smcd, bool is_rmb)
>  	struct list_head *buf_list;
>  	int bufsize, bufsize_short;
>  	bool is_dgraded = false;
> -	struct mutex *lock;	/* lock buffer list */
> +	struct rw_semaphore *lock;	/* lock buffer list */
>  	int sk_buf_size;
>  
>  	if (is_rmb)
> @@ -2689,9 +2690,9 @@ static int __smc_buf_create(struct smc_sock *smc, bool is_smcd, bool is_rmb)
>  		SMC_STAT_RMB_ALLOC(smc, is_smcd, is_rmb);
>  		SMC_STAT_RMB_SIZE(smc, is_smcd, is_rmb, bufsize);
>  		buf_desc->used = 1;
> -		mutex_lock(lock);
> +		down_write(lock);
>  		list_add(&buf_desc->list, buf_list);
> -		mutex_unlock(lock);
> +		up_write(lock);
>  		break; /* found */
>  	}
>  
> @@ -2765,9 +2766,9 @@ int smc_buf_create(struct smc_sock *smc, bool is_smcd)
>  	/* create rmb */
>  	rc = __smc_buf_create(smc, is_smcd, true);
>  	if (rc) {
> -		mutex_lock(&smc->conn.lgr->sndbufs_lock);
> +		down_write(&smc->conn.lgr->sndbufs_lock);
>  		list_del(&smc->conn.sndbuf_desc->list);
> -		mutex_unlock(&smc->conn.lgr->sndbufs_lock);
> +		up_write(&smc->conn.lgr->sndbufs_lock);
>  		smc_buf_free(smc->conn.lgr, false, smc->conn.sndbuf_desc);
>  		smc->conn.sndbuf_desc = NULL;
>  	}
> diff --git a/net/smc/smc_core.h b/net/smc/smc_core.h
> index 559d330..008148c 100644
> --- a/net/smc/smc_core.h
> +++ b/net/smc/smc_core.h
> @@ -300,9 +300,9 @@ struct smc_link_group {
>  	unsigned short		vlan_id;	/* vlan id of link group */
>  
>  	struct list_head	sndbufs[SMC_RMBE_SIZES];/* tx buffers */
> -	struct mutex		sndbufs_lock;	/* protects tx buffers */
> +	struct rw_semaphore	sndbufs_lock;	/* protects tx buffers */
>  	struct list_head	rmbs[SMC_RMBE_SIZES];	/* rx buffers */
> -	struct mutex		rmbs_lock;	/* protects rx buffers */
> +	struct rw_semaphore	rmbs_lock;	/* protects rx buffers */
>  
>  	u8			id[SMC_LGR_ID_SIZE];	/* unique lgr id */
>  	struct delayed_work	free_work;	/* delayed freeing of an lgr */
> diff --git a/net/smc/smc_llc.c b/net/smc/smc_llc.c
> index d744937..76f9906 100644
> --- a/net/smc/smc_llc.c
> +++ b/net/smc/smc_llc.c
> @@ -642,7 +642,7 @@ static int smc_llc_fill_ext_v2(struct smc_llc_msg_add_link_v2_ext *ext,
>  
>  	prim_lnk_idx = link->link_idx;
>  	lnk_idx = link_new->link_idx;
> -	mutex_lock(&lgr->rmbs_lock);
> +	down_write(&lgr->rmbs_lock);
>  	ext->num_rkeys = lgr->conns_num;
>  	if (!ext->num_rkeys)
>  		goto out;
> @@ -662,7 +662,7 @@ static int smc_llc_fill_ext_v2(struct smc_llc_msg_add_link_v2_ext *ext,
>  	}
>  	len += i * sizeof(ext->rt[0]);
>  out:
> -	mutex_unlock(&lgr->rmbs_lock);
> +	up_write(&lgr->rmbs_lock);
>  	return len;
>  }
>  
> @@ -923,7 +923,7 @@ static int smc_llc_cli_rkey_exchange(struct smc_link *link,
>  	int rc = 0;
>  	int i;
>  
> -	mutex_lock(&lgr->rmbs_lock);
> +	down_write(&lgr->rmbs_lock);
>  	num_rkeys_send = lgr->conns_num;
>  	buf_pos = smc_llc_get_first_rmb(lgr, &buf_lst);
>  	do {
> @@ -950,7 +950,7 @@ static int smc_llc_cli_rkey_exchange(struct smc_link *link,
>  			break;
>  	} while (num_rkeys_send || num_rkeys_recv);
>  
> -	mutex_unlock(&lgr->rmbs_lock);
> +	up_write(&lgr->rmbs_lock);
>  	return rc;
>  }
>  
> @@ -1033,14 +1033,14 @@ static void smc_llc_save_add_link_rkeys(struct smc_link *link,
>  	ext = (struct smc_llc_msg_add_link_v2_ext *)((u8 *)lgr->wr_rx_buf_v2 +
>  						     SMC_WR_TX_SIZE);
>  	max = min_t(u8, ext->num_rkeys, SMC_LLC_RKEYS_PER_MSG_V2);
> -	mutex_lock(&lgr->rmbs_lock);
> +	down_write(&lgr->rmbs_lock);
>  	for (i = 0; i < max; i++) {
>  		smc_rtoken_set(lgr, link->link_idx, link_new->link_idx,
>  			       ext->rt[i].rmb_key,
>  			       ext->rt[i].rmb_vaddr_new,
>  			       ext->rt[i].rmb_key_new);
>  	}
> -	mutex_unlock(&lgr->rmbs_lock);
> +	up_write(&lgr->rmbs_lock);
>  }
>  
>  static void smc_llc_save_add_link_info(struct smc_link *link,
> @@ -1349,7 +1349,7 @@ static int smc_llc_srv_rkey_exchange(struct smc_link *link,
>  	int rc = 0;
>  	int i;
>  
> -	mutex_lock(&lgr->rmbs_lock);
> +	down_write(&lgr->rmbs_lock);
>  	num_rkeys_send = lgr->conns_num;
>  	buf_pos = smc_llc_get_first_rmb(lgr, &buf_lst);
>  	do {
> @@ -1374,7 +1374,7 @@ static int smc_llc_srv_rkey_exchange(struct smc_link *link,
>  		smc_llc_flow_qentry_del(&lgr->llc_flow_lcl);
>  	} while (num_rkeys_send || num_rkeys_recv);
>  out:
> -	mutex_unlock(&lgr->rmbs_lock);
> +	up_write(&lgr->rmbs_lock);
>  	return rc;
>  }
>  
> -- 
> 1.8.3.1

^ permalink raw reply

* Re: igc: missing HW timestamps at TX
From: Kurt Kanzenbach @ 2022-08-16  8:51 UTC (permalink / raw)
  To: Vinicius Costa Gomes, Vladimir Oltean
  Cc: Ferenc Fejes, netdev@vger.kernel.org, marton12050@gmail.com,
	peti.antal99@gmail.com
In-Reply-To: <87k079hzry.fsf@intel.com>

[-- Attachment #1: Type: text/plain, Size: 1192 bytes --]

Hi Vinicius,

On Mon Aug 15 2022, Vinicius Costa Gomes wrote:
> I think your question is more "why there's that workqueue on igc?"/"why
> don't you retrieve the TX timestamp 'inline' with the interrupt?", if I
> got that right, then, I don't have a good reason, apart from the feeling
> that reading all those (5-6?) registers may take too long for a
> interrupt handler. And it's something that's being done the same for
> most (all?) Intel drivers.

We do have one optimization for igb which attempts to read the Tx
timestamp directly from the ISR. If that's not ready *only* then we
schedule the worker. I do assume igb and igc have the same logic for
retrieving the timestamps here.

The problem with workqueues is that under heavy system load, it might be
deferred and timestamps will be lost. I guess that workqueue was added
because of something like this: 1f6e8178d685 ("igb: Prevent dropped Tx
timestamps via work items and interrupts.").

>
> I have a TODO to experiment with removing the workqueue, and retrieving
> the TX timestamp in the same context as the interrupt handler, but other
> things always come up.

Let me know if you have interest in that igb patch.

Thanks,
Kurt

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 861 bytes --]

^ permalink raw reply

* RE: [PATCH v2 1/2] fs: Export __receive_fd()
From: David Laight @ 2022-08-16  8:03 UTC (permalink / raw)
  To: 'Kirill Tkhai', Linux Kernel Network Developers
  Cc: davem@davemloft.net, edumazet@google.com, viro@zeniv.linux.org.uk
In-Reply-To: <3a8da760-d58b-04fe-e251-e0d143493df1@ya.ru>

From: Kirill Tkhai
> Sent: 15 August 2022 22:15
> 
> This is needed to make receive_fd_user() available in modules, and it will be used in next patch.
> 
> Signed-off-by: Kirill Tkhai <tkhai@ya.ru>
> ---
> v2: New
>  fs/file.c |    1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/fs/file.c b/fs/file.c
> index 3bcc1ecc314a..e45d45f1dd45 100644
> --- a/fs/file.c
> +++ b/fs/file.c
> @@ -1181,6 +1181,7 @@ int __receive_fd(struct file *file, int __user *ufd, unsigned int o_flags)
>  	__receive_sock(file);
>  	return new_fd;
>  }
> +EXPORT_SYMBOL_GPL(__receive_fd);

It doesn't seem right (to me) to be exporting a function
with a __ prefix.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox