Netdev List
 help / color / mirror / Atom feed
* Re: [RFC net-next 1/4] ynl: add intro docs for the concept
From: Jakub Kicinski @ 2022-08-16  0:32 UTC (permalink / raw)
  To: Johannes Berg
  Cc: netdev, davem, edumazet, pabeni, sdf, jacob.e.keller, vadfed,
	jiri, dsahern, stephen, fw, linux-doc
In-Reply-To: <273db0bc09c0e074a8875679e5e07ea047b61c27.camel@sipsolutions.net>

On Mon, 15 Aug 2022 22:09:29 +0200 Johannes Berg wrote:
> On Wed, 2022-08-10 at 19:23 -0700, Jakub Kicinski wrote:
> > 
> > +Note that attribute spaces do not themselves nest, nested attributes refer to their internal
> > +space via a ``nested-attributes`` property, so the YAML spec does not resemble the format
> > +of the netlink messages directly.  
> 
> I find this a bit ... confusing.
> 
> I think reading the other patch I know what you mean, but if I think of
> this I think more of the policy declarations than the message itself,
> and there we do refer to another policy?
> 
> Maybe reword a bit and say
> 
>    Note that attribute spaces do not themselves nest, nested attributes
>    refer to their internal space via a ``nested-attributes`` property
>    (the name of another or the same attribute space).
> 
> or something?

I think I put the cart before the horse in this looong sentence. How
about:

  Note that the YAML spec is "flattened" and is not meant to visually
  resemble the format of the netlink messages (unlike certain ad-hoc documentation
  formats seen in kernel comments). In the YAML spec subordinate attribute sets
  are not defined inline as a nest, but defined in a separate attribute set
  referred to with a ``nested-attributes`` property of the container.

^ permalink raw reply

* [PATCH net] tls: rx: react to strparser initialization errors
From: Jakub Kicinski @ 2022-08-16  0:23 UTC (permalink / raw)
  To: davem
  Cc: netdev, edumazet, pabeni, Jakub Kicinski,
	syzbot+abd45eb849b05194b1b6, borisp, john.fastabend

Even though the normal strparser's init function has a return
value we got away with ignoring errors until now, as it only
validates the parameters and we were passing correct parameters.

tls_strp can fail to init on memory allocation errors, which
syzbot duly induced and reported.

Reported-by: syzbot+abd45eb849b05194b1b6@syzkaller.appspotmail.com
Fixes: 84c61fe1a75b ("tls: rx: do not use the standard strparser")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
CC: borisp@nvidia.com
CC: john.fastabend@gmail.com
---
 net/tls/tls_sw.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index f76119f62f1b..fe27241cd13f 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -2702,7 +2702,9 @@ int tls_set_sw_offload(struct sock *sk, struct tls_context *ctx, int tx)
 			crypto_info->version != TLS_1_3_VERSION &&
 			!!(tfm->__crt_alg->cra_flags & CRYPTO_ALG_ASYNC);
 
-		tls_strp_init(&sw_ctx_rx->strp, sk);
+		rc = tls_strp_init(&sw_ctx_rx->strp, sk);
+		if (rc)
+			goto free_aead;
 	}
 
 	goto out;
-- 
2.37.2


^ permalink raw reply related

* Re: [PATCH v3 0/5] virtio: drop sizing vqs during init
From: Michael S. Tsirkin @ 2022-08-15 23:35 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-kernel, Xuan Zhuo, Jason Wang, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, virtualization, netdev,
	Jens Axboe, James Bottomley, Martin K. Petersen, Guenter Roeck,
	Greg KH
In-Reply-To: <CAHk-=wj=Ju_jhbww7WmpgmHHebMSdd1U5WBjh925yLB_F1j9Ng@mail.gmail.com>

On Mon, Aug 15, 2022 at 03:24:28PM -0700, Linus Torvalds wrote:
> On Mon, Aug 15, 2022 at 3:00 PM Michael S. Tsirkin <mst@redhat.com> wrote:
> >
> > But the benefit is unclear in any case, so let's revert for now.
> 
> Should I take this patch series directly, or will you be sending a
> pull request (preferred)?
> 
>              Linus

I'll be sending a pull request, just not today - I try not to do
this at strange hours of night.

-- 
MST


^ permalink raw reply

* Re: [PATCH 2/2] vDPA: conditionally read fields in virtio-net dev
From: Si-Wei Liu @ 2022-08-15 23:32 UTC (permalink / raw)
  To: Zhu Lingshan, jasowang, mst
  Cc: virtualization, netdev, kvm, parav, xieyongji, gautam.dawar
In-Reply-To: <20220815092638.504528-3-lingshan.zhu@intel.com>



On 8/15/2022 2:26 AM, Zhu Lingshan wrote:
> Some fields of virtio-net device config space are
> conditional on the feature bits, the spec says:
>
> "The mac address field always exists
> (though is only valid if VIRTIO_NET_F_MAC is set)"
>
> "max_virtqueue_pairs only exists if VIRTIO_NET_F_MQ
> or VIRTIO_NET_F_RSS is set"
>
> "mtu only exists if VIRTIO_NET_F_MTU is set"
>
> so we should read MTU, MAC and MQ in the device config
> space only when these feature bits are offered.
>
> For MQ, if both VIRTIO_NET_F_MQ and VIRTIO_NET_F_RSS are
> not set, the virtio device should have
> one queue pair as default value, so when userspace querying queue pair numbers,
> it should return mq=1 than zero.
>
> For MTU, if VIRTIO_NET_F_MTU is not set, we should not read
> MTU from the device config sapce.
> RFC894 <A Standard for the Transmission of IP Datagrams over Ethernet Networks>
> says:"The minimum length of the data field of a packet sent over an
> Ethernet is 1500 octets, thus the maximum length of an IP datagram
> sent over an Ethernet is 1500 octets.  Implementations are encouraged
> to support full-length packets"
Noted there's a typo in the above "The *maximum* length of the data 
field of a packet sent over an Ethernet is 1500 octets ..." and the RFC 
was written 1984.
Apparently that is no longer true with the introduction of Jumbo size 
frame later in the 2000s. I'm not sure what is the point of mention this 
ancient RFC. It doesn't say default MTU of any Ethernet NIC/switch 
should be 1500 in either  case.

>
> virtio spec says:"The virtio network device is a virtual ethernet card",
Right,
> so the default MTU value should be 1500 for virtio-net.
... but it doesn't say the default is 1500. At least, not in explicit 
way. Why it can't be 1492 or even lower? In practice, if the network 
backend has a MTU higher than 1500, there's nothing wrong for guest to 
configure default MTU more than 1500.

>
> For MAC, the spec says:"If the VIRTIO_NET_F_MAC feature bit is set,
> the configuration space mac entry indicates the “physical” address
> of the network card, otherwise the driver would typically
> generate a random local MAC address." So there is no
> default MAC address if VIRTIO_NET_F_MAC not set.
>
> This commits introduces functions vdpa_dev_net_mtu_config_fill()
> and vdpa_dev_net_mac_config_fill() to fill MTU and MAC.
> It also fixes vdpa_dev_net_mq_config_fill() to report correct
> MQ when _F_MQ is not present.
>
> These functions should check devices features than driver
> features, and struct vdpa_device is not needed as a parameter
>
> The test & userspace tool output:
>
> Feature bit VIRTIO_NET_F_MTU, VIRTIO_NET_F_RSS, VIRTIO_NET_F_MQ
> and VIRTIO_NET_F_MAC can be mask out by hardcode.
>
> However, it is challenging to "disable" the related fields
> in the HW device config space, so let's just assume the values
> are meaningless if the feature bits are not set.
>
> Before this change, when feature bits for RSS, MQ, MTU and MAC
> are not set, iproute2 output:
> $vdpa vdpa0: mac 00:e8:ca:11:be:05 link up link_announce false mtu 1500
>    negotiated_features
>
> without this commit, function vdpa_dev_net_config_fill()
> reads all config space fields unconditionally, so let's
> assume the MAC and MTU are meaningless, and it checks
> MQ with driver_features, so we don't see max_vq_pairs.
>
> After applying this commit, when feature bits for
> MQ, RSS, MAC and MTU are not set,iproute2 output:
> $vdpa dev config show vdpa0
> vdpa0: link up link_announce false max_vq_pairs 1 mtu 1500
>    negotiated_features
>
> As explained above:
> Here is no MAC, because VIRTIO_NET_F_MAC is not set,
> and there is no default value for MAC. It shows
> max_vq_paris = 1 because even without MQ feature,
> a functional virtio-net must have one queue pair.
> mtu = 1500 is the default value as ethernet
> required.
>
> This commit also add supplementary comments for
> __virtio16_to_cpu(true, xxx) operations in
> vdpa_dev_net_config_fill() and vdpa_fill_stats_rec()
>
> Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
> ---
>   drivers/vdpa/vdpa.c | 60 +++++++++++++++++++++++++++++++++++----------
>   1 file changed, 47 insertions(+), 13 deletions(-)
>
> diff --git a/drivers/vdpa/vdpa.c b/drivers/vdpa/vdpa.c
> index efb55a06e961..a74660b98979 100644
> --- a/drivers/vdpa/vdpa.c
> +++ b/drivers/vdpa/vdpa.c
> @@ -801,19 +801,44 @@ static int vdpa_nl_cmd_dev_get_dumpit(struct sk_buff *msg, struct netlink_callba
>   	return msg->len;
>   }
>   
> -static int vdpa_dev_net_mq_config_fill(struct vdpa_device *vdev,
> -				       struct sk_buff *msg, u64 features,
> +static int vdpa_dev_net_mq_config_fill(struct sk_buff *msg, u64 features,
>   				       const struct virtio_net_config *config)
>   {
>   	u16 val_u16;
>   
> -	if ((features & BIT_ULL(VIRTIO_NET_F_MQ)) == 0)
> -		return 0;
> +	if ((features & BIT_ULL(VIRTIO_NET_F_MQ)) == 0 &&
> +	    (features & BIT_ULL(VIRTIO_NET_F_RSS)) == 0)
> +		val_u16 = 1;
> +	else
> +		val_u16 = __virtio16_to_cpu(true, config->max_virtqueue_pairs);
>   
> -	val_u16 = le16_to_cpu(config->max_virtqueue_pairs);
>   	return nla_put_u16(msg, VDPA_ATTR_DEV_NET_CFG_MAX_VQP, val_u16);
>   }
>   
> +static int vdpa_dev_net_mtu_config_fill(struct sk_buff *msg, u64 features,
> +					const struct virtio_net_config *config)
> +{
> +	u16 val_u16;
> +
> +	if ((features & BIT_ULL(VIRTIO_NET_F_MTU)) == 0)
> +		val_u16 = 1500;
As said, there's no virtio spec defined value for MTU. Please leave this 
field out if feature VIRTIO_NET_F_MTU is not negotiated.
> +	else
> +		val_u16 = __virtio16_to_cpu(true, config->mtu);
> +
> +	return nla_put_u16(msg, VDPA_ATTR_DEV_NET_CFG_MTU, val_u16);
> +}
> +
> +static int vdpa_dev_net_mac_config_fill(struct sk_buff *msg, u64 features,
> +					const struct virtio_net_config *config)
> +{
> +	if ((features & BIT_ULL(VIRTIO_NET_F_MAC)) == 0)
> +		return 0;
> +	else
> +		return  nla_put(msg, VDPA_ATTR_DEV_NET_CFG_MACADDR,
> +				sizeof(config->mac), config->mac);
> +}
> +
> +
>   static int vdpa_dev_net_config_fill(struct vdpa_device *vdev, struct sk_buff *msg)
>   {
>   	struct virtio_net_config config = {};
> @@ -822,18 +847,16 @@ static int vdpa_dev_net_config_fill(struct vdpa_device *vdev, struct sk_buff *ms
>   
>   	vdpa_get_config_unlocked(vdev, 0, &config, sizeof(config));
>   
> -	if (nla_put(msg, VDPA_ATTR_DEV_NET_CFG_MACADDR, sizeof(config.mac),
> -		    config.mac))
> -		return -EMSGSIZE;
> +	/*
> +	 * Assume little endian for now, userspace can tweak this for
> +	 * legacy guest support.
You can leave it as a TODO for kernel (vdpa core limitation), but AFAIK 
there's nothing userspace needs to do to infer the endianness. IMHO it's 
the kernel's job to provide an abstraction rather than rely on userspace 
guessing it.

> +	 */
> +	val_u16 = __virtio16_to_cpu(true, config.status);
>   
>   	val_u16 = __virtio16_to_cpu(true, config.status);
>   	if (nla_put_u16(msg, VDPA_ATTR_DEV_NET_STATUS, val_u16))
>   		return -EMSGSIZE;
>   
> -	val_u16 = __virtio16_to_cpu(true, config.mtu);
> -	if (nla_put_u16(msg, VDPA_ATTR_DEV_NET_CFG_MTU, val_u16))
> -		return -EMSGSIZE;
> -
>   	features_driver = vdev->config->get_driver_features(vdev);
>   	if (nla_put_u64_64bit(msg, VDPA_ATTR_DEV_NEGOTIATED_FEATURES, features_driver,
>   			      VDPA_ATTR_PAD))
> @@ -846,7 +869,13 @@ static int vdpa_dev_net_config_fill(struct vdpa_device *vdev, struct sk_buff *ms
>   			      VDPA_ATTR_PAD))
>   		return -EMSGSIZE;
>   
> -	return vdpa_dev_net_mq_config_fill(vdev, msg, features_driver, &config);
> +	if (vdpa_dev_net_mac_config_fill(msg, features_device, &config))
> +		return -EMSGSIZE;
> +
> +	if (vdpa_dev_net_mtu_config_fill(msg, features_device, &config))
> +		return -EMSGSIZE;
> +
> +	return vdpa_dev_net_mq_config_fill(msg, features_device, &config);
>   }
>   
>   static int
> @@ -914,6 +943,11 @@ static int vdpa_fill_stats_rec(struct vdpa_device *vdev, struct sk_buff *msg,
>   	}
>   	vdpa_get_config_unlocked(vdev, 0, &config, sizeof(config));
>   
> +	/*
> +	 * Assume little endian for now, userspace can tweak this for
> +	 * legacy guest support.
> +	 */
> +
Ditto.

Thanks,
-Siwei
>   	max_vqp = __virtio16_to_cpu(true, config.max_virtqueue_pairs);
>   	if (nla_put_u16(msg, VDPA_ATTR_DEV_NET_CFG_MAX_VQP, max_vqp))
>   		return -EMSGSIZE;


^ permalink raw reply

* Re: [PATCH 2/6] vsock: return errors other than -ENOMEM to socket
From: kernel test robot @ 2022-08-15 23:13 UTC (permalink / raw)
  To: Bobby Eshleman
  Cc: llvm, kbuild-all, Bobby Eshleman, Cong Wang, Jiang Wang,
	Stefan Hajnoczi, Stefano Garzarella, Michael S. Tsirkin,
	Jason Wang, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	K. Y. Srinivasan, Haiyang Zhang, Stephen Hemminger, Wei Liu,
	Dexuan Cui, kvm, virtualization, netdev, linux-kernel,
	linux-hyperv
In-Reply-To: <d81818b868216c774613dd03641fcfe63cc55a45.1660362668.git.bobby.eshleman@bytedance.com>

Hi Bobby,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on mst-vhost/linux-next]
[also build test WARNING on linus/master v6.0-rc1]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Bobby-Eshleman/virtio-vsock-introduce-dgrams-sk_buff-and-qdisc/20220816-015812
base:   https://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git linux-next
config: i386-randconfig-a014-20220815 (https://download.01.org/0day-ci/archive/20220816/202208160737.gXXFmPbY-lkp@intel.com/config)
compiler: clang version 16.0.0 (https://github.com/llvm/llvm-project 6afcc4a459ead8809a0d6d9b4bf7b64bcc13582b)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/intel-lab-lkp/linux/commit/68c9c8216a573cdfe2170cad677854e2f4a34634
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Bobby-Eshleman/virtio-vsock-introduce-dgrams-sk_buff-and-qdisc/20220816-015812
        git checkout 68c9c8216a573cdfe2170cad677854e2f4a34634
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=i386 SHELL=/bin/bash net/vmw_vsock/

If you fix the issue, kindly add following tag where applicable
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

>> net/vmw_vsock/virtio_transport.c:178: warning: This comment starts with '/**', but isn't a kernel-doc comment. Refer Documentation/doc-guide/kernel-doc.rst
    * Merge the two most recent skbs together if possible.


vim +178 net/vmw_vsock/virtio_transport.c

93afaf2cdefaa9 Bobby Eshleman 2022-08-15  176  
93afaf2cdefaa9 Bobby Eshleman 2022-08-15  177  /**
93afaf2cdefaa9 Bobby Eshleman 2022-08-15 @178   * Merge the two most recent skbs together if possible.
93afaf2cdefaa9 Bobby Eshleman 2022-08-15  179   *
93afaf2cdefaa9 Bobby Eshleman 2022-08-15  180   * Caller must hold the queue lock.
93afaf2cdefaa9 Bobby Eshleman 2022-08-15  181   */
93afaf2cdefaa9 Bobby Eshleman 2022-08-15  182  static void
93afaf2cdefaa9 Bobby Eshleman 2022-08-15  183  virtio_transport_add_to_queue(struct sk_buff_head *queue, struct sk_buff *new)
93afaf2cdefaa9 Bobby Eshleman 2022-08-15  184  {
93afaf2cdefaa9 Bobby Eshleman 2022-08-15  185  	struct sk_buff *old;
93afaf2cdefaa9 Bobby Eshleman 2022-08-15  186  
93afaf2cdefaa9 Bobby Eshleman 2022-08-15  187  	spin_lock_bh(&queue->lock);
93afaf2cdefaa9 Bobby Eshleman 2022-08-15  188  	/* In order to reduce skb memory overhead, we merge new packets with
93afaf2cdefaa9 Bobby Eshleman 2022-08-15  189  	 * older packets if they pass virtio_transport_skbs_can_merge().
93afaf2cdefaa9 Bobby Eshleman 2022-08-15  190  	 */
93afaf2cdefaa9 Bobby Eshleman 2022-08-15  191  	if (skb_queue_empty_lockless(queue)) {
93afaf2cdefaa9 Bobby Eshleman 2022-08-15  192  		__skb_queue_tail(queue, new);
93afaf2cdefaa9 Bobby Eshleman 2022-08-15  193  		goto out;
93afaf2cdefaa9 Bobby Eshleman 2022-08-15  194  	}
93afaf2cdefaa9 Bobby Eshleman 2022-08-15  195  
93afaf2cdefaa9 Bobby Eshleman 2022-08-15  196  	old = skb_peek_tail(queue);
93afaf2cdefaa9 Bobby Eshleman 2022-08-15  197  
93afaf2cdefaa9 Bobby Eshleman 2022-08-15  198  	if (!virtio_transport_skbs_can_merge(old, new)) {
93afaf2cdefaa9 Bobby Eshleman 2022-08-15  199  		__skb_queue_tail(queue, new);
93afaf2cdefaa9 Bobby Eshleman 2022-08-15  200  		goto out;
93afaf2cdefaa9 Bobby Eshleman 2022-08-15  201  	}
93afaf2cdefaa9 Bobby Eshleman 2022-08-15  202  
93afaf2cdefaa9 Bobby Eshleman 2022-08-15  203  	memcpy(skb_put(old, new->len), new->data, new->len);
93afaf2cdefaa9 Bobby Eshleman 2022-08-15  204  	vsock_hdr(old)->len = cpu_to_le32(old->len);
93afaf2cdefaa9 Bobby Eshleman 2022-08-15  205  	vsock_hdr(old)->buf_alloc = vsock_hdr(new)->buf_alloc;
93afaf2cdefaa9 Bobby Eshleman 2022-08-15  206  	vsock_hdr(old)->fwd_cnt = vsock_hdr(new)->fwd_cnt;
93afaf2cdefaa9 Bobby Eshleman 2022-08-15  207  	dev_kfree_skb_any(new);
93afaf2cdefaa9 Bobby Eshleman 2022-08-15  208  
93afaf2cdefaa9 Bobby Eshleman 2022-08-15  209  out:
93afaf2cdefaa9 Bobby Eshleman 2022-08-15  210  	spin_unlock_bh(&queue->lock);
93afaf2cdefaa9 Bobby Eshleman 2022-08-15  211  }
93afaf2cdefaa9 Bobby Eshleman 2022-08-15  212  

-- 
0-DAY CI Kernel Test Service
https://01.org/lkp

^ permalink raw reply

* Re: igc: missing HW timestamps at TX
From: Vinicius Costa Gomes @ 2022-08-15 23:07 UTC (permalink / raw)
  To: Vladimir Oltean
  Cc: Ferenc Fejes, netdev@vger.kernel.org, marton12050@gmail.com,
	peti.antal99@gmail.com
In-Reply-To: <20220815222639.346wachaaq5zjwue@skbuf>

Hi Vladimir,

Vladimir Oltean <vladimir.oltean@nxp.com> writes:

> Hi Vinicius,
>
> On Mon, Aug 15, 2022 at 02:39:33PM -0700, Vinicius Costa Gomes wrote:
>> Just some aditional information (note that I know very little about
>> interrupt internal workings), igc_intr_msi() is called when MSI-X is not
>> enabled (i.e. "MSI only" system), igc_msix_other() is called when MSI-X
>> is available. When MSI-X is available, i225/i226 sets up a separate
>> interrupt handler for "general" events, the TX timestamp being available
>> to be read from the registers is one those events.
>
> Thanks for the extra information.
>
> Why is the i225/i226 emitting an interrupt about the availability of a
> new TX timestamp, if the igc driver polls for its availability anyway?
> In other words, when IGC_TSICR_TXTS is found set, is a TX timestamp
> available or is it not? Why does the driver schedule a deferred work
> item to retrieve it?

The interrupt that is generated is a general/misc interrupt, we have to
check on the interrupt cause bit that it's something TimeSync related,
and only then, we have to check that it's indeed a TX Timestamp that is
ready. And then, there's another register with some bits saying which
one of the 4 registers for timestamps that is ready. There are a few
levels of indirection, but no polling.

I think your question is more "why there's that workqueue on igc?"/"why
don't you retrieve the TX timestamp 'inline' with the interrupt?", if I
got that right, then, I don't have a good reason, apart from the feeling
that reading all those (5-6?) registers may take too long for a
interrupt handler. And it's something that's being done the same for
most (all?) Intel drivers.

I have a TODO to experiment with removing the workqueue, and retrieving
the TX timestamp in the same context as the interrupt handler, but other
things always come up.


Cheers,
-- 
Vinicius

^ permalink raw reply

* Re: [PATCH bpf-next 2/3] bpf: Add support for writing to nf_conn:mark
From: Alexei Starovoitov @ 2022-08-15 22:47 UTC (permalink / raw)
  To: Florian Westphal
  Cc: Toke Høiland-Jørgensen, Daniel Xu, bpf,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Kumar Kartikeya Dwivedi, Pablo Neira Ayuso, netfilter-devel,
	Network Development, LKML
In-Reply-To: <20220815224011.GA9821@breakpoint.cc>

On Mon, Aug 15, 2022 at 3:40 PM Florian Westphal <fw@strlen.de> wrote:
>
> Toke Høiland-Jørgensen <toke@kernel.org> wrote:
> > > Support direct writes to nf_conn:mark from TC and XDP prog types. This
> > > is useful when applications want to store per-connection metadata. This
> > > is also particularly useful for applications that run both bpf and
> > > iptables/nftables because the latter can trivially access this metadata.
> > >
> > > One example use case would be if a bpf prog is responsible for advanced
> > > packet classification and iptables/nftables is later used for routing
> > > due to pre-existing/legacy code.
> > >
> > > Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
> >
> > Didn't we agree the last time around that all field access should be
> > using helper kfuncs instead of allowing direct writes to struct nf_conn?
>
> I don't see why ct->mark needs special handling.
>
> It might be possible we need to change accesses on nf/tc side to use
> READ/WRITE_ONCE though.

+1
I don't think we need to have a hard rule.
If fields is safe to access directly than it's faster
to let bpf prog read/write it.
There are no backward compat concerns. If conntrack side decides
to make that field special we can disallow direct writes in
the same kernel version. These accesses, just like kfuncs, are unstable.

^ permalink raw reply

* Re: [PATCH bpf-next 2/3] bpf: Add support for writing to nf_conn:mark
From: Daniel Xu @ 2022-08-15 22:41 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen, bpf@vger.kernel.org,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Kumar Kartikeya Dwivedi
  Cc: pablo, fw, netfilter-devel, netdev, linux-kernel
In-Reply-To: <871qth87r1.fsf@toke.dk>

Hi Toke,

On Mon, Aug 15, 2022, at 4:25 PM, Toke Høiland-Jørgensen wrote:
> Daniel Xu <dxu@dxuuu.xyz> writes:
>
>> Support direct writes to nf_conn:mark from TC and XDP prog types. This
>> is useful when applications want to store per-connection metadata. This
>> is also particularly useful for applications that run both bpf and
>> iptables/nftables because the latter can trivially access this metadata.
>>
>> One example use case would be if a bpf prog is responsible for advanced
>> packet classification and iptables/nftables is later used for routing
>> due to pre-existing/legacy code.
>>
>> Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
>
> Didn't we agree the last time around that all field access should be
> using helper kfuncs instead of allowing direct writes to struct nf_conn?

Sorry, I was not aware of those discussions. Do you have a link handy?

I received the suggestion to enable direct writes here:
https://lore.kernel.org/bpf/CAP01T74aWUW-iyPCV_VfASO6YqfAZmnkYQMN2B4L8ngMMgnAcw@mail.gmail.com/ .

Thanks,
Daniel

^ permalink raw reply

* Re: [PATCH bpf-next 2/3] bpf: Add support for writing to nf_conn:mark
From: Florian Westphal @ 2022-08-15 22:40 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Daniel Xu, bpf, ast, daniel, andrii, memxor, pablo, fw,
	netfilter-devel, netdev, linux-kernel
In-Reply-To: <871qth87r1.fsf@toke.dk>

Toke Høiland-Jørgensen <toke@kernel.org> wrote:
> > Support direct writes to nf_conn:mark from TC and XDP prog types. This
> > is useful when applications want to store per-connection metadata. This
> > is also particularly useful for applications that run both bpf and
> > iptables/nftables because the latter can trivially access this metadata.
> >
> > One example use case would be if a bpf prog is responsible for advanced
> > packet classification and iptables/nftables is later used for routing
> > due to pre-existing/legacy code.
> >
> > Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
> 
> Didn't we agree the last time around that all field access should be
> using helper kfuncs instead of allowing direct writes to struct nf_conn?

I don't see why ct->mark needs special handling.

It might be possible we need to change accesses on nf/tc side to use
READ/WRITE_ONCE though.

^ permalink raw reply

* [syzbot] inconsistent lock state in p9_req_put
From: syzbot @ 2022-08-15 22:24 UTC (permalink / raw)
  To: asmadeus, davem, edumazet, ericvh, kuba, linux-kernel, linux_oss,
	lucho, netdev, pabeni, syzkaller-bugs, v9fs-developer

Hello,

syzbot found the following issue on:

HEAD commit:    f6eb0fed6a39 Merge tag 'timers-urgent-2022-08-13' of git:/..
git tree:       upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=1184fec3080000
kernel config:  https://syzkaller.appspot.com/x/.config?x=ffbab52ef9fab60
dashboard link: https://syzkaller.appspot.com/bug?extid=2f20b523930c32c160cc
compiler:       gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2

Unfortunately, I don't have any reproducer for this issue yet.

IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: syzbot+2f20b523930c32c160cc@syzkaller.appspotmail.com

================================
WARNING: inconsistent lock state
5.19.0-syzkaller-14264-gf6eb0fed6a39 #0 Not tainted
--------------------------------
inconsistent {HARDIRQ-ON-W} -> {IN-HARDIRQ-W} usage.
syz-executor.3/10062 [HC1[1]:SC1[1]:HE0:SE0] takes:
ffff88801ceb8c18 (&clnt->lock){?.+.}-{2:2}, at: p9_tag_remove net/9p/client.c:367 [inline]
ffff88801ceb8c18 (&clnt->lock){?.+.}-{2:2}, at: p9_req_put net/9p/client.c:375 [inline]
ffff88801ceb8c18 (&clnt->lock){?.+.}-{2:2}, at: p9_req_put+0xc6/0x250 net/9p/client.c:372
{HARDIRQ-ON-W} state was registered at:
  lock_acquire kernel/locking/lockdep.c:5666 [inline]
  lock_acquire+0x1ab/0x570 kernel/locking/lockdep.c:5631
  __raw_spin_lock include/linux/spinlock_api_smp.h:133 [inline]
  _raw_spin_lock+0x2a/0x40 kernel/locking/spinlock.c:154
  spin_lock include/linux/spinlock.h:349 [inline]
  p9_fd_request+0x85/0x330 net/9p/trans_fd.c:672
  p9_client_rpc+0x2f0/0xce0 net/9p/client.c:660
  p9_client_version net/9p/client.c:880 [inline]
  p9_client_create+0xaec/0x1070 net/9p/client.c:985
  v9fs_session_init+0x1e2/0x1810 fs/9p/v9fs.c:408
  v9fs_mount+0xba/0xc90 fs/9p/vfs_super.c:126
  legacy_get_tree+0x105/0x220 fs/fs_context.c:610
  vfs_get_tree+0x89/0x2f0 fs/super.c:1530
  do_new_mount fs/namespace.c:3040 [inline]
  path_mount+0x1326/0x1e20 fs/namespace.c:3370
  do_mount fs/namespace.c:3383 [inline]
  __do_sys_mount fs/namespace.c:3591 [inline]
  __se_sys_mount fs/namespace.c:3568 [inline]
  __x64_sys_mount+0x27f/0x300 fs/namespace.c:3568
  do_syscall_x64 arch/x86/entry/common.c:50 [inline]
  do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
  entry_SYSCALL_64_after_hwframe+0x63/0xcd
irq event stamp: 1139
hardirqs last  enabled at (1138): [<ffffffff89c00190>] __do_softirq+0x190/0x9c6 kernel/softirq.c:555
hardirqs last disabled at (1139): [<ffffffff897eaf31>] common_interrupt+0x11/0xc0 arch/x86/kernel/irq.c:240
softirqs last  enabled at (62): [<ffffffff81483e73>] invoke_softirq kernel/softirq.c:445 [inline]
softirqs last  enabled at (62): [<ffffffff81483e73>] __irq_exit_rcu+0x123/0x180 kernel/softirq.c:650
softirqs last disabled at (1137): [<ffffffff81483e73>] invoke_softirq kernel/softirq.c:445 [inline]
softirqs last disabled at (1137): [<ffffffff81483e73>] __irq_exit_rcu+0x123/0x180 kernel/softirq.c:650

other info that might help us debug this:
 Possible unsafe locking scenario:

       CPU0
       ----
  lock(&clnt->lock);
  <Interrupt>
    lock(&clnt->lock);

 *** DEADLOCK ***

3 locks held by syz-executor.3/10062:
 #0: ffff888066ee2ab0 (&p->alloc_lock){+.+.}-{2:2}, at: spin_lock include/linux/spinlock.h:349 [inline]
 #0: ffff888066ee2ab0 (&p->alloc_lock){+.+.}-{2:2}, at: task_lock include/linux/sched/task.h:174 [inline]
 #0: ffff888066ee2ab0 (&p->alloc_lock){+.+.}-{2:2}, at: exit_fs+0x5a/0x170 fs/fs_struct.c:101
 #1: ffff88801b98ca20 (&fs->lock){+.+.}-{2:2}, at: spin_lock include/linux/spinlock.h:349 [inline]
 #1: ffff88801b98ca20 (&fs->lock){+.+.}-{2:2}, at: exit_fs+0x62/0x170 fs/fs_struct.c:102
 #2: ffff888022720020 (&chan->lock#2){-.-.}-{2:2}, at: req_done+0xcf/0x2e0 net/9p/trans_virtio.c:139

stack backtrace:
CPU: 0 PID: 10062 Comm: syz-executor.3 Not tainted 5.19.0-syzkaller-14264-gf6eb0fed6a39 #0
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-2 04/01/2014
Call Trace:
 <IRQ>
 __dump_stack lib/dump_stack.c:88 [inline]
 dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106
 print_usage_bug kernel/locking/lockdep.c:3961 [inline]
 valid_state kernel/locking/lockdep.c:3973 [inline]
 mark_lock_irq kernel/locking/lockdep.c:4176 [inline]
 mark_lock.part.0.cold+0x18/0xd8 kernel/locking/lockdep.c:4632
 mark_lock kernel/locking/lockdep.c:4596 [inline]
 mark_usage kernel/locking/lockdep.c:4524 [inline]
 __lock_acquire+0x14a2/0x56d0 kernel/locking/lockdep.c:5007
 lock_acquire kernel/locking/lockdep.c:5666 [inline]
 lock_acquire+0x1ab/0x570 kernel/locking/lockdep.c:5631
 __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline]
 _raw_spin_lock_irqsave+0x39/0x50 kernel/locking/spinlock.c:162
 p9_tag_remove net/9p/client.c:367 [inline]
 p9_req_put net/9p/client.c:375 [inline]
 p9_req_put+0xc6/0x250 net/9p/client.c:372
 req_done+0x1de/0x2e0 net/9p/trans_virtio.c:148
 vring_interrupt drivers/virtio/virtio_ring.c:2454 [inline]
 vring_interrupt+0x29d/0x3d0 drivers/virtio/virtio_ring.c:2429
 __handle_irq_event_percpu+0x227/0x870 kernel/irq/handle.c:158
 handle_irq_event_percpu kernel/irq/handle.c:193 [inline]
 handle_irq_event+0xa7/0x1e0 kernel/irq/handle.c:210
 handle_edge_irq+0x25f/0xd00 kernel/irq/chip.c:819
 generic_handle_irq_desc include/linux/irqdesc.h:158 [inline]
 handle_irq arch/x86/kernel/irq.c:231 [inline]
 __common_interrupt+0x9d/0x210 arch/x86/kernel/irq.c:250
 common_interrupt+0x4d/0xc0 arch/x86/kernel/irq.c:240
 asm_common_interrupt+0x22/0x40 arch/x86/include/asm/idtentry.h:640
RIP: 0010:__do_softirq+0x196/0x9c6 kernel/softirq.c:557
Code: 00 48 01 f0 48 89 44 24 18 48 c7 c7 c0 41 eb 89 e8 5f ff be ff 65 66 c7 05 35 94 43 76 00 00 e8 70 ab c1 f7 fb b8 ff ff ff ff <48> c7 c3 c0 a0 c0 8b 41 0f bc c5 41 89 c7 41 83 c7 01 0f 85 ad 00
RSP: 0018:ffffc90000007f70 EFLAGS: 00000206
RAX: 00000000ffffffff RBX: ffff888066ee2140 RCX: 1ffffffff211cc2e
RDX: 0000000000000000 RSI: 0000000000000102 RDI: 0000000000000000
RBP: ffff8880125241c0 R08: 0000000000000001 R09: ffffffff908d9967
R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000000
R13: 0000000000000080 R14: 0000000000000000 R15: 0000000000000000
 invoke_softirq kernel/softirq.c:445 [inline]
 __irq_exit_rcu+0x123/0x180 kernel/softirq.c:650
 irq_exit_rcu+0x5/0x20 kernel/softirq.c:662
 sysvec_apic_timer_interrupt+0x93/0xc0 arch/x86/kernel/apic/apic.c:1106
 </IRQ>
 <TASK>
 asm_sysvec_apic_timer_interrupt+0x16/0x20 arch/x86/include/asm/idtentry.h:649
RIP: 0010:lock_acquire+0x1ef/0x570 kernel/locking/lockdep.c:5634
Code: d2 a3 7e 83 f8 01 0f 85 e8 02 00 00 9c 58 f6 c4 02 0f 85 fb 02 00 00 48 83 7c 24 08 00 74 01 fb 48 b8 00 00 00 00 00 fc ff df <48> 01 c3 48 c7 03 00 00 00 00 48 c7 43 08 00 00 00 00 48 8b 84 24
RSP: 0018:ffffc9000357fa48 EFLAGS: 00000206
RAX: dffffc0000000000 RBX: 1ffff920006aff4b RCX: ffffffff815e513e
RDX: 1ffff1100cddc576 RSI: 0000000000000002 RDI: 0000000000000000
RBP: 0000000000000001 R08: 0000000000000000 R09: ffffffff908d9967
R10: fffffbfff211b32c R11: 0000000000000000 R12: 0000000000000000
R13: 0000000000000000 R14: ffff88801b98ca20 R15: 0000000000000000
 __raw_spin_lock include/linux/spinlock_api_smp.h:133 [inline]
 _raw_spin_lock+0x2a/0x40 kernel/locking/spinlock.c:154
 spin_lock include/linux/spinlock.h:349 [inline]
 exit_fs+0x62/0x170 fs/fs_struct.c:102
 do_exit+0xaa6/0x29b0 kernel/exit.c:791
 do_group_exit+0xd2/0x2f0 kernel/exit.c:925
 get_signal+0x238c/0x2610 kernel/signal.c:2857
 arch_do_signal_or_restart+0x82/0x2300 arch/x86/kernel/signal.c:869
 exit_to_user_mode_loop kernel/entry/common.c:166 [inline]
 exit_to_user_mode_prepare+0x15f/0x250 kernel/entry/common.c:201
 __syscall_exit_to_user_mode_work kernel/entry/common.c:283 [inline]
 syscall_exit_to_user_mode+0x19/0x50 kernel/entry/common.c:294
 do_syscall_64+0x42/0xb0 arch/x86/entry/common.c:86
 entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7f0b07489279
Code: Unable to access opcode bytes at RIP 0x7f0b0748924f.
RSP: 002b:00007f0b085a3218 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
RAX: fffffffffffffe00 RBX: 00007f0b0759bf88 RCX: 00007f0b07489279
RDX: 0000000000000000 RSI: 0000000000000080 RDI: 00007f0b0759bf88
RBP: 00007f0b0759bf80 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00007f0b0759bf8c
R13: 00007ffdfa1bfd4f R14: 00007f0b085a3300 R15: 0000000000022000
 </TASK>
----------------
Code disassembly (best guess):
   0:	00 48 01             	add    %cl,0x1(%rax)
   3:	f0 48 89 44 24 18    	lock mov %rax,0x18(%rsp)
   9:	48 c7 c7 c0 41 eb 89 	mov    $0xffffffff89eb41c0,%rdi
  10:	e8 5f ff be ff       	callq  0xffbeff74
  15:	65 66 c7 05 35 94 43 	movw   $0x0,%gs:0x76439435(%rip)        # 0x76439454
  1c:	76 00 00
  1f:	e8 70 ab c1 f7       	callq  0xf7c1ab94
  24:	fb                   	sti
  25:	b8 ff ff ff ff       	mov    $0xffffffff,%eax
* 2a:	48 c7 c3 c0 a0 c0 8b 	mov    $0xffffffff8bc0a0c0,%rbx <-- trapping instruction
  31:	41 0f bc c5          	bsf    %r13d,%eax
  35:	41 89 c7             	mov    %eax,%r15d
  38:	41 83 c7 01          	add    $0x1,%r15d
  3c:	0f                   	.byte 0xf
  3d:	85                   	.byte 0x85
  3e:	ad                   	lods   %ds:(%rsi),%eax


---
This report is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.

syzbot will keep track of this issue. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.

^ permalink raw reply

* Re: [PATCH bpf-next 2/3] bpf: Add support for writing to nf_conn:mark
From: Toke Høiland-Jørgensen @ 2022-08-15 22:25 UTC (permalink / raw)
  To: Daniel Xu, bpf, ast, daniel, andrii, memxor
  Cc: Daniel Xu, pablo, fw, netfilter-devel, netdev, linux-kernel
In-Reply-To: <f850bb7e20950736d9175c61d7e0691098e06182.1660592020.git.dxu@dxuuu.xyz>

Daniel Xu <dxu@dxuuu.xyz> writes:

> Support direct writes to nf_conn:mark from TC and XDP prog types. This
> is useful when applications want to store per-connection metadata. This
> is also particularly useful for applications that run both bpf and
> iptables/nftables because the latter can trivially access this metadata.
>
> One example use case would be if a bpf prog is responsible for advanced
> packet classification and iptables/nftables is later used for routing
> due to pre-existing/legacy code.
>
> Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>

Didn't we agree the last time around that all field access should be
using helper kfuncs instead of allowing direct writes to struct nf_conn?

-Toke

^ permalink raw reply

* Re: [PATCH v3 bpf-next 00/15] bpf: net: Remove duplicated code from bpf_setsockopt()
From: Daniel Borkmann @ 2022-08-15 22:04 UTC (permalink / raw)
  To: sdf, Martin KaFai Lau
  Cc: bpf, netdev, Alexei Starovoitov, Andrii Nakryiko, David Miller,
	Eric Dumazet, Jakub Kicinski, kernel-team, Paolo Abeni
In-Reply-To: <YvU2md/W4YSlnkBH@google.com>

On 8/11/22 7:04 PM, sdf@google.com wrote:
> On 08/10, Martin KaFai Lau wrote:
>> The code in bpf_setsockopt() is mostly a copy-and-paste from
>> the sock_setsockopt(), do_tcp_setsockopt(), do_ipv6_setsockopt(),
>> and do_ip_setsockopt().  As the allowed optnames in bpf_setsockopt()
>> grows, so are the duplicated code.  The code between the copies
>> also slowly drifted.
> 
>> This set is an effort to clean this up and reuse the existing
>> {sock,do_tcp,do_ipv6,do_ip}_setsockopt() as much as possible.
> 
>> After the clean up, this set also adds a few allowed optnames
>> that we need to the bpf_setsockopt().
> 
>> The initial attempt was to clean up both bpf_setsockopt() and
>> bpf_getsockopt() together.  However, the patch set was getting
>> too long.  It is beneficial to leave the bpf_getsockopt()
>> out for another patch set.  Thus, this set is focusing
>> on the bpf_setsockopt().
> 
>> v3:
>> - s/in_bpf/has_current_bpf_ctx/ (Andrii)
>> - Add comments to has_current_bpf_ctx() and sockopt_lock_sock()
>>    (Stanislav)
>> - Use vmlinux.h in selftest and add defines to bpf_tracing_net.h
>>    (Stanislav)
>> - Use bpf_getsockopt(SO_MARK) in selftest (Stanislav)
>> - Use BPF_CORE_READ_BITFIELD in selftest (Yonghong)
> 
> Reviewed-by: Stanislav Fomichev <sdf@google.com>
> 
> (I didn't go super deep on the selftest)

Looks like that one throws a build error, fwiw:

https://github.com/kernel-patches/bpf/runs/7844497492?check_suite_focus=true

   [...]
     CLNG-BPF [test_maps] kfunc_call_test_subprog.o
     CLNG-BPF [test_maps] bpf_iter_test_kern6.o
   progs/setget_sockopt.c:39:33: error: implicit truncation from 'int' to bit-field changes value from 1 to -1 [-Werror,-Wbitfield-constant-conversion]
           { .opt = SO_REUSEADDR, .flip = 1, },
                                          ^
   progs/setget_sockopt.c:42:33: error: implicit truncation from 'int' to bit-field changes value from 1 to -1 [-Werror,-Wbitfield-constant-conversion]
           { .opt = SO_KEEPALIVE, .flip = 1, },
                                          ^
   progs/setget_sockopt.c:44:33: error: implicit truncation from 'int' to bit-field changes value from 1 to -1 [-Werror,-Wbitfield-constant-conversion]
           { .opt = SO_REUSEPORT, .flip = 1, },
                                          ^
     CLNG-BPF [test_maps] btf__core_reloc_type_id.o
   progs/setget_sockopt.c:48:32: error: implicit truncation from 'int' to bit-field changes value from 1 to -1 [-Werror,-Wbitfield-constant-conversion]
           { .opt = SO_TXREHASH, .flip = 1, },
                                         ^
   progs/setget_sockopt.c:53:32: error: implicit truncation from 'int' to bit-field changes value from 1 to -1 [-Werror,-Wbitfield-constant-conversion]
           { .opt = TCP_NODELAY, .flip = 1, },
                                         ^
   progs/setget_sockopt.c:61:45: error: implicit truncation from 'int' to bit-field changes value from 1 to -1 [-Werror,-Wbitfield-constant-conversion]
           { .opt = TCP_THIN_LINEAR_TIMEOUTS, .flip = 1, },
                                                      ^
   progs/setget_sockopt.c:75:39: error: implicit truncation from 'int' to bit-field changes value from 1 to -1 [-Werror,-Wbitfield-constant-conversion]
           { .opt = IPV6_AUTOFLOWLABEL, .flip = 1, },
                                                ^
   7 errors generated.
   make: *** [Makefile:521: /tmp/runner/work/bpf/bpf/tools/testing/selftests/bpf/setget_sockopt.o] Error 1
   make: *** Waiting for unfinished jobs....
   make: Leaving directory '/tmp/runner/work/bpf/bpf/tools/testing/selftests/bpf'
   Error: Process completed with exit code 2.

^ permalink raw reply

* Re: igc: missing HW timestamps at TX
From: Vinicius Costa Gomes @ 2022-08-15 22:04 UTC (permalink / raw)
  To: Ferenc Fejes, vladimir.oltean@nxp.com
  Cc: marton12050@gmail.com, netdev@vger.kernel.org,
	peti.antal99@gmail.com
In-Reply-To: <1016fb1e514ff38ebfd22c52e2d848a7e8bc8d1a.camel@ericsson.com>

Hi Ferenc,

Ferenc Fejes <ferenc.fejes@ericsson.com> writes:

> Hi Vladimir!
>
> Thank you for the reply!
>
> On Fri, 2022-08-12 at 20:16 +0000, Vladimir Oltean wrote:
>> Hi Ferenc,
>> 
>> On Fri, Aug 12, 2022 at 02:13:52PM +0000, Ferenc Fejes wrote:
>> > Ethtool after the measurement:
>> > ethtool -S enp3s0 | grep hwtstamp
>> >      tx_hwtstamp_timeouts: 1
>> >      tx_hwtstamp_skipped: 419
>> >      rx_hwtstamp_cleared: 0
>> > 
>> > Which is inline with what the isochron see.
>> > 
>> > But thats only happens if I forcingly put the affinity of the
>> > sender
>> > different CPU core than the ptp worker of the igc. If those running
>> > on
>> > the same core I doesnt lost any HW timestam even for 10 million
>> > packets. Worth to mention actually I see many lost timestamp which
>> > confused me a little bit but those are lost because of the small
>> > MSG_ERRQUEUE. When I increased that from few kbytes to 20 mbytes I
>> > got
>> > every timestamp successfully.
>> 
>> I have zero knowledge of Intel hardware. That being said, I've looked
>> at
>> the driver for about 5 minutes, and the design seems to be that where
>> the timestamp is not available in band from the TX completion NAPI as
>> part of BD ring metadata, but rather, a TX timestamp complete is
>> raised,
>> and this results in igc_tsync_interrupt() being called. However there
>> are 2 paths in the driver which call this, one is igc_msix_other()
>> and
>> the other is igc_intr_msi() - this latter one is also the interrupt
>> that
>> triggers the napi_schedule(). It would be interesting to see exactly
>> which MSI-X interrupt is the one that triggers igc_tsync_interrupt().
>> 
>> It's also interesting to understand what you mean precisely by
>> affinity
>> of isochron. It has a main thread (used for PTP monitoring and for TX
>> timestamps) and a pthread for the sending process. The main thread's
>> affinity is controlled via taskset; the sender thread via --cpu-mask.
>
> I just played with those a little. Looks like the --cpu-mask the one it
> helps in my case. For example I checked the CPU core of the
> igc_ptp_tx_work:
>
> # bpftrace -e 'kprobe:igc_ptp_tx_work { printf("%d\n", cpu); exit(); }'
> Attaching 1 probe...
> 0
>
> Looks like its running on core 0, so I run the isochro:
> taskset -c 0 isochron ... --cpu-mask $((1 << 0)) - no lost timestamps
> taskset -c 1 isochron ... --cpu-mask $((1 << 0)) - no lost timestamps
> taskset -c 0 isochron ... --cpu-mask $((1 << 1)) - losing timestamps
> taskset -c 1 isochron ... --cpu-mask $((1 << 1)) - losing timestamps
>
>> Is it the *sender* thread the one who makes the TX timestamps be
>> available quicker to user space, rather than the main thread, who
>> actually dequeues them from the error queue? If so, it might be
>> because
>> the TX packets will trigger the TX completion interrupt, and this
>> will
>> accelerate the processing of the TX timestamps. I'm unclear what
>> happens
>> when the sender thread runs on a different CPU core than the TX
>> timestamp thread.
>
> Well I have no clue unfortunately but your theory makes sense. Vinicius
> might help us out here.
>

I think it's more related to the synchronization (and communication)
overhead of having two CPUs involved. And the fact that the workqueue
runs with low priority (as Vladimir pointed out below) on a different
CPU doesn't help.

>> 
>> Your need to increase the SO_RCVBUF is also interesting. Keep in mind
>> that isochron at that scheduling priority and policy is a CPU hog,
>> and
>> that igc_tsync_interrupt() calls schedule_work() - which uses the
>> system
>> workqueue that runs at a very low priority (this begs the question,
>> how
>> do you know how to match the CPU on which isochron runs with the CPU
>> of
>> the system workqueue?). So isochron, high priority, competes for CPU
>> time with igc_ptp_tx_work(), low priority. One produces data, one
>> consumes it; queues are bound to get full at some point.
>
> Maybe this is what helps in my case? With funccount tracer I checked
> that when the sender thread and igc_ptp_tx_work running on the same
> core, the worker called exactly as many times as many packets I sent.
>
> However if the worker running on different core, funccount show some
> random number (less than the packets sent) and in that case I also lost
> timestamps.
>
> I'm not sure what happening here, maybe the "deferred" scheduling of
> the worker sometimes too slow to enqueue every timestamp into the error
> queue? And because I force both the sender and worker to the same core,
> they executed in order (my system pretty much idle other than these
> processes) introducing some sort of throtthling to the timestamp
> processing?
>
>> On the other hand, other drivers use the ptp_aux_kworker() that the
>> PTP
>> core creates specifically for this purpose. It is a dedicated kthread
>> whose scheduling policy and priority can be adjusted using chrt. I
>> think
>> it would be interesting to see how things behave when you replace
>> schedule_work() with ptp_schedule_worker().
>
> I will try to take a look into that. Anyway, thank you for the
> insights, I'm happy with the way how it works now (at least I can do my
> experiments with that).
>
> Best,
> Ferenc
>

-- 
Vinicius

^ permalink raw reply

* Re: [PATCHv2 bpf-next] libbpf: making bpf_prog_load() ignore name if kernel doesn't support
From: Andrii Nakryiko @ 2022-08-15 21:59 UTC (permalink / raw)
  To: Quentin Monnet
  Cc: Hangbin Liu, netdev, Andrii Nakryiko, Alexei Starovoitov,
	Daniel Borkmann, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	bpf
In-Reply-To: <a3c23cfe-061a-1722-8521-26e57b4b2cf4@isovalent.com>

On Mon, Aug 15, 2022 at 8:36 AM Quentin Monnet <quentin@isovalent.com> wrote:
>
> On 13/08/2022 01:09, Hangbin Liu wrote:
> > Similar with commit 10b62d6a38f7 ("libbpf: Add names for auxiliary maps"),
> > let's make bpf_prog_load() also ignore name if kernel doesn't support
> > program name.
> >
> > To achieve this, we need to call sys_bpf_prog_load() directly in
> > probe_kern_prog_name() to avoid circular dependency. sys_bpf_prog_load()
> > also need to be exported in the libbpf_internal.h file.
> >
> > Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
> > ---
> > v2: move sys_bpf_prog_load definition to libbpf_internal.h. memset attr
> >     to 0 specifically to aviod padding.
> > ---
> >  tools/lib/bpf/bpf.c             |  6 ++----
> >  tools/lib/bpf/libbpf.c          | 12 ++++++++++--
> >  tools/lib/bpf/libbpf_internal.h |  3 +++
> >  3 files changed, 15 insertions(+), 6 deletions(-)
> >
> > diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
> > index 6a96e665dc5d..575867d69496 100644
> > --- a/tools/lib/bpf/bpf.c
> > +++ b/tools/lib/bpf/bpf.c
> > @@ -84,9 +84,7 @@ static inline int sys_bpf_fd(enum bpf_cmd cmd, union bpf_attr *attr,
> >       return ensure_good_fd(fd);
> >  }
> >
> > -#define PROG_LOAD_ATTEMPTS 5
> > -
> > -static inline int sys_bpf_prog_load(union bpf_attr *attr, unsigned int size, int attempts)
> > +int sys_bpf_prog_load(union bpf_attr *attr, unsigned int size, int attempts)
> >  {
> >       int fd;
> >
> > @@ -263,7 +261,7 @@ int bpf_prog_load(enum bpf_prog_type prog_type,
> >       attr.prog_ifindex = OPTS_GET(opts, prog_ifindex, 0);
> >       attr.kern_version = OPTS_GET(opts, kern_version, 0);
> >
> > -     if (prog_name)
> > +     if (prog_name && kernel_supports(NULL, FEAT_PROG_NAME))
> >               libbpf_strlcpy(attr.prog_name, prog_name, sizeof(attr.prog_name));
> >       attr.license = ptr_to_u64(license);
> >
> > diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
> > index 3f01f5cd8a4c..4a351897bdcc 100644
> > --- a/tools/lib/bpf/libbpf.c
> > +++ b/tools/lib/bpf/libbpf.c
> > @@ -4419,10 +4419,18 @@ static int probe_kern_prog_name(void)
> >               BPF_MOV64_IMM(BPF_REG_0, 0),
> >               BPF_EXIT_INSN(),
> >       };
> > -     int ret, insn_cnt = ARRAY_SIZE(insns);
> > +     union bpf_attr attr;
> > +     int ret;
> > +
> > +     memset(&attr, 0, sizeof(attr));
> > +     attr.prog_type = BPF_PROG_TYPE_SOCKET_FILTER;
> > +     attr.license = ptr_to_u64("GPL");
> > +     attr.insns = ptr_to_u64(insns);
> > +     attr.insn_cnt = (__u32)ARRAY_SIZE(insns);
> > +     libbpf_strlcpy(attr.prog_name, "test", sizeof(attr.prog_name));
> >
> >       /* make sure loading with name works */
> > -     ret = bpf_prog_load(BPF_PROG_TYPE_SOCKET_FILTER, "test", "GPL", insns, insn_cnt, NULL);
> > +     ret = sys_bpf_prog_load(&attr, sizeof(attr), PROG_LOAD_ATTEMPTS);
> >       return probe_fd(ret);
> >  }
> >
> > diff --git a/tools/lib/bpf/libbpf_internal.h b/tools/lib/bpf/libbpf_internal.h
> > index 4135ae0a2bc3..377642ff51fc 100644
> > --- a/tools/lib/bpf/libbpf_internal.h
> > +++ b/tools/lib/bpf/libbpf_internal.h
> > @@ -573,4 +573,7 @@ static inline bool is_pow_of_2(size_t x)
> >       return x && (x & (x - 1)) == 0;
> >  }
> >
> > +#define PROG_LOAD_ATTEMPTS 5
> > +int sys_bpf_prog_load(union bpf_attr *attr, unsigned int size, int attempts);
> > +
> >  #endif /* __LIBBPF_LIBBPF_INTERNAL_H */
>
> Looks good to me, thanks!
>
> Acked-by: Quentin Monnet <quentin@isovalent.com>

I did a small adjustment to not fill out entire big bpf_attr union
completely (and added a bit more meaningful "libbpf_nametest" prog
name):

$ git diff --staged
diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 4a351897bdcc..f05dd61a8a7f 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -4415,6 +4415,7 @@ static int probe_fd(int fd)

 static int probe_kern_prog_name(void)
 {
+       const size_t attr_sz = offsetofend(union bpf_attr, prog_name);
        struct bpf_insn insns[] = {
                BPF_MOV64_IMM(BPF_REG_0, 0),
                BPF_EXIT_INSN(),
@@ -4422,12 +4423,12 @@ static int probe_kern_prog_name(void)
        union bpf_attr attr;
        int ret;

-       memset(&attr, 0, sizeof(attr));
+       memset(&attr, 0, attr_sz);
        attr.prog_type = BPF_PROG_TYPE_SOCKET_FILTER;
        attr.license = ptr_to_u64("GPL");
        attr.insns = ptr_to_u64(insns);
        attr.insn_cnt = (__u32)ARRAY_SIZE(insns);
-       libbpf_strlcpy(attr.prog_name, "test", sizeof(attr.prog_name));
+       libbpf_strlcpy(attr.prog_name, "libbpf_nametest",
sizeof(attr.prog_name));

Pushed to bpf-next, thanks!

^ permalink raw reply related

* [PATCH v2 5/5] virtio: Revert "virtio: find_vqs() add arg sizes"
From: Michael S. Tsirkin @ 2022-08-15 21:53 UTC (permalink / raw)
  To: linux-kernel
  Cc: Xuan Zhuo, Jason Wang, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, virtualization, netdev,
	Linus Torvalds, Jens Axboe, James Bottomley, Martin K. Petersen,
	Guenter Roeck, Greg KH, Richard Weinberger, Anton Ivanov,
	Johannes Berg, Hans de Goede, Mark Gross, Vadim Pasternak,
	Bjorn Andersson, Mathieu Poirier, Cornelia Huck, Halil Pasic,
	Eric Farman, Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Christian Borntraeger, Sven Schnelle, linux-um,
	platform-driver-x86, linux-remoteproc, linux-s390, kvm
In-Reply-To: <20220815215251.154451-1-mst@redhat.com>

This reverts commit a10fba0377145fccefea4dc4dd5915b7ed87e546: the
proposed API isn't supported on all transports but no
effort was made to address this.

It might not be hard to fix if we want to: maybe just
rename size to size_hint and make sure legacy
transports ignore the hint.

But it's not sure what the benefit is in any case, so
let's drop it.

Fixes: a10fba037714 ("virtio: find_vqs() add arg sizes")
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 arch/um/drivers/virtio_uml.c             |  2 +-
 drivers/platform/mellanox/mlxbf-tmfifo.c |  1 -
 drivers/remoteproc/remoteproc_virtio.c   |  1 -
 drivers/s390/virtio/virtio_ccw.c         |  1 -
 drivers/virtio/virtio_mmio.c             |  1 -
 drivers/virtio/virtio_pci_common.c       |  2 +-
 drivers/virtio/virtio_pci_common.h       |  2 +-
 drivers/virtio/virtio_pci_modern.c       |  7 ++-----
 drivers/virtio/virtio_vdpa.c             |  1 -
 include/linux/virtio_config.h            | 14 +++++---------
 10 files changed, 10 insertions(+), 22 deletions(-)

diff --git a/arch/um/drivers/virtio_uml.c b/arch/um/drivers/virtio_uml.c
index 79e38afd4b91..e719af8bdf56 100644
--- a/arch/um/drivers/virtio_uml.c
+++ b/arch/um/drivers/virtio_uml.c
@@ -1011,7 +1011,7 @@ static struct virtqueue *vu_setup_vq(struct virtio_device *vdev,
 
 static int vu_find_vqs(struct virtio_device *vdev, unsigned nvqs,
 		       struct virtqueue *vqs[], vq_callback_t *callbacks[],
-		       const char * const names[], u32 sizes[], const bool *ctx,
+		       const char * const names[], const bool *ctx,
 		       struct irq_affinity *desc)
 {
 	struct virtio_uml_device *vu_dev = to_virtio_uml_device(vdev);
diff --git a/drivers/platform/mellanox/mlxbf-tmfifo.c b/drivers/platform/mellanox/mlxbf-tmfifo.c
index 8be13d416f48..1ae3c56b66b0 100644
--- a/drivers/platform/mellanox/mlxbf-tmfifo.c
+++ b/drivers/platform/mellanox/mlxbf-tmfifo.c
@@ -928,7 +928,6 @@ static int mlxbf_tmfifo_virtio_find_vqs(struct virtio_device *vdev,
 					struct virtqueue *vqs[],
 					vq_callback_t *callbacks[],
 					const char * const names[],
-					u32 sizes[],
 					const bool *ctx,
 					struct irq_affinity *desc)
 {
diff --git a/drivers/remoteproc/remoteproc_virtio.c b/drivers/remoteproc/remoteproc_virtio.c
index 81c4f5776109..0f7706e23eb9 100644
--- a/drivers/remoteproc/remoteproc_virtio.c
+++ b/drivers/remoteproc/remoteproc_virtio.c
@@ -158,7 +158,6 @@ static int rproc_virtio_find_vqs(struct virtio_device *vdev, unsigned int nvqs,
 				 struct virtqueue *vqs[],
 				 vq_callback_t *callbacks[],
 				 const char * const names[],
-				 u32 sizes[],
 				 const bool * ctx,
 				 struct irq_affinity *desc)
 {
diff --git a/drivers/s390/virtio/virtio_ccw.c b/drivers/s390/virtio/virtio_ccw.c
index 896896e32664..a10dbe632ef9 100644
--- a/drivers/s390/virtio/virtio_ccw.c
+++ b/drivers/s390/virtio/virtio_ccw.c
@@ -637,7 +637,6 @@ static int virtio_ccw_find_vqs(struct virtio_device *vdev, unsigned nvqs,
 			       struct virtqueue *vqs[],
 			       vq_callback_t *callbacks[],
 			       const char * const names[],
-			       u32 sizes[],
 			       const bool *ctx,
 			       struct irq_affinity *desc)
 {
diff --git a/drivers/virtio/virtio_mmio.c b/drivers/virtio/virtio_mmio.c
index dfcecfd7aba1..3ff746e3f24a 100644
--- a/drivers/virtio/virtio_mmio.c
+++ b/drivers/virtio/virtio_mmio.c
@@ -474,7 +474,6 @@ static int vm_find_vqs(struct virtio_device *vdev, unsigned int nvqs,
 		       struct virtqueue *vqs[],
 		       vq_callback_t *callbacks[],
 		       const char * const names[],
-		       u32 sizes[],
 		       const bool *ctx,
 		       struct irq_affinity *desc)
 {
diff --git a/drivers/virtio/virtio_pci_common.c b/drivers/virtio/virtio_pci_common.c
index 7ad734584823..ad258a9d3b9f 100644
--- a/drivers/virtio/virtio_pci_common.c
+++ b/drivers/virtio/virtio_pci_common.c
@@ -396,7 +396,7 @@ static int vp_find_vqs_intx(struct virtio_device *vdev, unsigned int nvqs,
 /* the config->find_vqs() implementation */
 int vp_find_vqs(struct virtio_device *vdev, unsigned int nvqs,
 		struct virtqueue *vqs[], vq_callback_t *callbacks[],
-		const char * const names[], u32 sizes[], const bool *ctx,
+		const char * const names[], const bool *ctx,
 		struct irq_affinity *desc)
 {
 	int err;
diff --git a/drivers/virtio/virtio_pci_common.h b/drivers/virtio/virtio_pci_common.h
index a5ff838b85a5..23112d84218f 100644
--- a/drivers/virtio/virtio_pci_common.h
+++ b/drivers/virtio/virtio_pci_common.h
@@ -110,7 +110,7 @@ void vp_del_vqs(struct virtio_device *vdev);
 /* the config->find_vqs() implementation */
 int vp_find_vqs(struct virtio_device *vdev, unsigned int nvqs,
 		struct virtqueue *vqs[], vq_callback_t *callbacks[],
-		const char * const names[], u32 sizes[], const bool *ctx,
+		const char * const names[], const bool *ctx,
 		struct irq_affinity *desc);
 const char *vp_bus_name(struct virtio_device *vdev);
 
diff --git a/drivers/virtio/virtio_pci_modern.c b/drivers/virtio/virtio_pci_modern.c
index be51ec849252..c3b9f2761849 100644
--- a/drivers/virtio/virtio_pci_modern.c
+++ b/drivers/virtio/virtio_pci_modern.c
@@ -347,15 +347,12 @@ static struct virtqueue *setup_vq(struct virtio_pci_device *vp_dev,
 static int vp_modern_find_vqs(struct virtio_device *vdev, unsigned int nvqs,
 			      struct virtqueue *vqs[],
 			      vq_callback_t *callbacks[],
-			      const char * const names[],
-			      u32 sizes[],
-			      const bool *ctx,
+			      const char * const names[], const bool *ctx,
 			      struct irq_affinity *desc)
 {
 	struct virtio_pci_device *vp_dev = to_vp_device(vdev);
 	struct virtqueue *vq;
-	int rc = vp_find_vqs(vdev, nvqs, vqs, callbacks, names, sizes, ctx,
-			     desc);
+	int rc = vp_find_vqs(vdev, nvqs, vqs, callbacks, names, ctx, desc);
 
 	if (rc)
 		return rc;
diff --git a/drivers/virtio/virtio_vdpa.c b/drivers/virtio/virtio_vdpa.c
index 9bc4d110b800..c6b9b5062043 100644
--- a/drivers/virtio/virtio_vdpa.c
+++ b/drivers/virtio/virtio_vdpa.c
@@ -272,7 +272,6 @@ static int virtio_vdpa_find_vqs(struct virtio_device *vdev, unsigned int nvqs,
 				struct virtqueue *vqs[],
 				vq_callback_t *callbacks[],
 				const char * const names[],
-				u32 sizes[],
 				const bool *ctx,
 				struct irq_affinity *desc)
 {
diff --git a/include/linux/virtio_config.h b/include/linux/virtio_config.h
index 888f7e96f0c7..36ec7be1f480 100644
--- a/include/linux/virtio_config.h
+++ b/include/linux/virtio_config.h
@@ -55,7 +55,6 @@ struct virtio_shm_region {
  *		include a NULL entry for vqs that do not need a callback
  *	names: array of virtqueue names (mainly for debugging)
  *		include a NULL entry for vqs unused by driver
- *	sizes: array of virtqueue sizes
  *	Returns 0 on success or error status
  * @del_vqs: free virtqueues found by find_vqs().
  * @synchronize_cbs: synchronize with the virtqueue callbacks (optional)
@@ -104,9 +103,7 @@ struct virtio_config_ops {
 	void (*reset)(struct virtio_device *vdev);
 	int (*find_vqs)(struct virtio_device *, unsigned nvqs,
 			struct virtqueue *vqs[], vq_callback_t *callbacks[],
-			const char * const names[],
-			u32 sizes[],
-			const bool *ctx,
+			const char * const names[], const bool *ctx,
 			struct irq_affinity *desc);
 	void (*del_vqs)(struct virtio_device *);
 	void (*synchronize_cbs)(struct virtio_device *);
@@ -215,7 +212,7 @@ struct virtqueue *virtio_find_single_vq(struct virtio_device *vdev,
 	const char *names[] = { n };
 	struct virtqueue *vq;
 	int err = vdev->config->find_vqs(vdev, 1, &vq, callbacks, names, NULL,
-					 NULL, NULL);
+					 NULL);
 	if (err < 0)
 		return ERR_PTR(err);
 	return vq;
@@ -227,8 +224,7 @@ int virtio_find_vqs(struct virtio_device *vdev, unsigned nvqs,
 			const char * const names[],
 			struct irq_affinity *desc)
 {
-	return vdev->config->find_vqs(vdev, nvqs, vqs, callbacks, names, NULL,
-				      NULL, desc);
+	return vdev->config->find_vqs(vdev, nvqs, vqs, callbacks, names, NULL, desc);
 }
 
 static inline
@@ -237,8 +233,8 @@ int virtio_find_vqs_ctx(struct virtio_device *vdev, unsigned nvqs,
 			const char * const names[], const bool *ctx,
 			struct irq_affinity *desc)
 {
-	return vdev->config->find_vqs(vdev, nvqs, vqs, callbacks, names, NULL,
-				      ctx, desc);
+	return vdev->config->find_vqs(vdev, nvqs, vqs, callbacks, names, ctx,
+				      desc);
 }
 
 /**
-- 
MST


^ permalink raw reply related

* [PATCH v2 3/5] virtio-mmio: Revert "virtio_mmio: support the arg sizes of find_vqs()"
From: Michael S. Tsirkin @ 2022-08-15 21:53 UTC (permalink / raw)
  To: linux-kernel
  Cc: Xuan Zhuo, Jason Wang, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, virtualization, netdev,
	Linus Torvalds, Jens Axboe, James Bottomley, Martin K. Petersen,
	Guenter Roeck, Greg KH
In-Reply-To: <20220815215251.154451-1-mst@redhat.com>

This reverts commit fbed86abba6e0472d98079790e58060e4332608a.
The API is now unused, let's not carry dead code around.

Fixes: fbed86abba6e ("virtio_mmio: support the arg sizes of find_vqs()")
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 drivers/virtio/virtio_mmio.c | 8 ++------
 1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/drivers/virtio/virtio_mmio.c b/drivers/virtio/virtio_mmio.c
index c492a57531c6..dfcecfd7aba1 100644
--- a/drivers/virtio/virtio_mmio.c
+++ b/drivers/virtio/virtio_mmio.c
@@ -360,7 +360,7 @@ static void vm_synchronize_cbs(struct virtio_device *vdev)
 
 static struct virtqueue *vm_setup_vq(struct virtio_device *vdev, unsigned int index,
 				  void (*callback)(struct virtqueue *vq),
-				  const char *name, u32 size, bool ctx)
+				  const char *name, bool ctx)
 {
 	struct virtio_mmio_device *vm_dev = to_virtio_mmio_device(vdev);
 	struct virtio_mmio_vq_info *info;
@@ -395,11 +395,8 @@ static struct virtqueue *vm_setup_vq(struct virtio_device *vdev, unsigned int in
 		goto error_new_virtqueue;
 	}
 
-	if (!size || size > num)
-		size = num;
-
 	/* Create the vring */
-	vq = vring_create_virtqueue(index, size, VIRTIO_MMIO_VRING_ALIGN, vdev,
+	vq = vring_create_virtqueue(index, num, VIRTIO_MMIO_VRING_ALIGN, vdev,
 				 true, true, ctx, vm_notify, callback, name);
 	if (!vq) {
 		err = -ENOMEM;
@@ -503,7 +500,6 @@ static int vm_find_vqs(struct virtio_device *vdev, unsigned int nvqs,
 		}
 
 		vqs[i] = vm_setup_vq(vdev, queue_idx++, callbacks[i], names[i],
-				     sizes ? sizes[i] : 0,
 				     ctx ? ctx[i] : false);
 		if (IS_ERR(vqs[i])) {
 			vm_del_vqs(vdev);
-- 
MST


^ permalink raw reply related

* [PATCH v2 4/5] virtio_pci: Revert "virtio_pci: support the arg sizes of find_vqs()"
From: Michael S. Tsirkin @ 2022-08-15 21:53 UTC (permalink / raw)
  To: linux-kernel
  Cc: Xuan Zhuo, Jason Wang, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, virtualization, netdev,
	Linus Torvalds, Jens Axboe, James Bottomley, Martin K. Petersen,
	Guenter Roeck, Greg KH, Andres Freund
In-Reply-To: <20220815215251.154451-1-mst@redhat.com>

This reverts commit cdb44806fca2d0ad29ca644cbf1505433902ee0c: the legacy
path is wrong and in fact can not support the proposed API since for a
legacy device we never communicate the vq size to the hypervisor.

Reported-by: Andres Freund <andres@anarazel.de>
Fixes: cdb44806fca2 ("virtio_pci: support the arg sizes of find_vqs()")
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 drivers/virtio/virtio_pci_common.c | 18 ++++++++----------
 drivers/virtio/virtio_pci_common.h |  1 -
 drivers/virtio/virtio_pci_legacy.c |  6 +-----
 drivers/virtio/virtio_pci_modern.c | 10 +++-------
 4 files changed, 12 insertions(+), 23 deletions(-)

diff --git a/drivers/virtio/virtio_pci_common.c b/drivers/virtio/virtio_pci_common.c
index 00ad476a815d..7ad734584823 100644
--- a/drivers/virtio/virtio_pci_common.c
+++ b/drivers/virtio/virtio_pci_common.c
@@ -174,7 +174,6 @@ static int vp_request_msix_vectors(struct virtio_device *vdev, int nvectors,
 static struct virtqueue *vp_setup_vq(struct virtio_device *vdev, unsigned int index,
 				     void (*callback)(struct virtqueue *vq),
 				     const char *name,
-				     u32 size,
 				     bool ctx,
 				     u16 msix_vec)
 {
@@ -187,7 +186,7 @@ static struct virtqueue *vp_setup_vq(struct virtio_device *vdev, unsigned int in
 	if (!info)
 		return ERR_PTR(-ENOMEM);
 
-	vq = vp_dev->setup_vq(vp_dev, info, index, callback, name, size, ctx,
+	vq = vp_dev->setup_vq(vp_dev, info, index, callback, name, ctx,
 			      msix_vec);
 	if (IS_ERR(vq))
 		goto out_info;
@@ -284,7 +283,7 @@ void vp_del_vqs(struct virtio_device *vdev)
 
 static int vp_find_vqs_msix(struct virtio_device *vdev, unsigned int nvqs,
 		struct virtqueue *vqs[], vq_callback_t *callbacks[],
-		const char * const names[], u32 sizes[], bool per_vq_vectors,
+		const char * const names[], bool per_vq_vectors,
 		const bool *ctx,
 		struct irq_affinity *desc)
 {
@@ -327,8 +326,8 @@ static int vp_find_vqs_msix(struct virtio_device *vdev, unsigned int nvqs,
 		else
 			msix_vec = VP_MSIX_VQ_VECTOR;
 		vqs[i] = vp_setup_vq(vdev, queue_idx++, callbacks[i], names[i],
-				     sizes ? sizes[i] : 0,
-				     ctx ? ctx[i] : false, msix_vec);
+				     ctx ? ctx[i] : false,
+				     msix_vec);
 		if (IS_ERR(vqs[i])) {
 			err = PTR_ERR(vqs[i]);
 			goto error_find;
@@ -358,7 +357,7 @@ static int vp_find_vqs_msix(struct virtio_device *vdev, unsigned int nvqs,
 
 static int vp_find_vqs_intx(struct virtio_device *vdev, unsigned int nvqs,
 		struct virtqueue *vqs[], vq_callback_t *callbacks[],
-		const char * const names[], u32 sizes[], const bool *ctx)
+		const char * const names[], const bool *ctx)
 {
 	struct virtio_pci_device *vp_dev = to_vp_device(vdev);
 	int i, err, queue_idx = 0;
@@ -380,7 +379,6 @@ static int vp_find_vqs_intx(struct virtio_device *vdev, unsigned int nvqs,
 			continue;
 		}
 		vqs[i] = vp_setup_vq(vdev, queue_idx++, callbacks[i], names[i],
-				     sizes ? sizes[i] : 0,
 				     ctx ? ctx[i] : false,
 				     VIRTIO_MSI_NO_VECTOR);
 		if (IS_ERR(vqs[i])) {
@@ -404,15 +402,15 @@ int vp_find_vqs(struct virtio_device *vdev, unsigned int nvqs,
 	int err;
 
 	/* Try MSI-X with one vector per queue. */
-	err = vp_find_vqs_msix(vdev, nvqs, vqs, callbacks, names, sizes, true, ctx, desc);
+	err = vp_find_vqs_msix(vdev, nvqs, vqs, callbacks, names, true, ctx, desc);
 	if (!err)
 		return 0;
 	/* Fallback: MSI-X with one vector for config, one shared for queues. */
-	err = vp_find_vqs_msix(vdev, nvqs, vqs, callbacks, names, sizes, false, ctx, desc);
+	err = vp_find_vqs_msix(vdev, nvqs, vqs, callbacks, names, false, ctx, desc);
 	if (!err)
 		return 0;
 	/* Finally fall back to regular interrupts. */
-	return vp_find_vqs_intx(vdev, nvqs, vqs, callbacks, names, sizes, ctx);
+	return vp_find_vqs_intx(vdev, nvqs, vqs, callbacks, names, ctx);
 }
 
 const char *vp_bus_name(struct virtio_device *vdev)
diff --git a/drivers/virtio/virtio_pci_common.h b/drivers/virtio/virtio_pci_common.h
index c0448378b698..a5ff838b85a5 100644
--- a/drivers/virtio/virtio_pci_common.h
+++ b/drivers/virtio/virtio_pci_common.h
@@ -80,7 +80,6 @@ struct virtio_pci_device {
 				      unsigned int idx,
 				      void (*callback)(struct virtqueue *vq),
 				      const char *name,
-				      u32 size,
 				      bool ctx,
 				      u16 msix_vec);
 	void (*del_vq)(struct virtio_pci_vq_info *info);
diff --git a/drivers/virtio/virtio_pci_legacy.c b/drivers/virtio/virtio_pci_legacy.c
index d75e5c4e637f..2257f1b3d8ae 100644
--- a/drivers/virtio/virtio_pci_legacy.c
+++ b/drivers/virtio/virtio_pci_legacy.c
@@ -112,7 +112,6 @@ static struct virtqueue *setup_vq(struct virtio_pci_device *vp_dev,
 				  unsigned int index,
 				  void (*callback)(struct virtqueue *vq),
 				  const char *name,
-				  u32 size,
 				  bool ctx,
 				  u16 msix_vec)
 {
@@ -126,13 +125,10 @@ static struct virtqueue *setup_vq(struct virtio_pci_device *vp_dev,
 	if (!num || vp_legacy_get_queue_enable(&vp_dev->ldev, index))
 		return ERR_PTR(-ENOENT);
 
-	if (!size || size > num)
-		size = num;
-
 	info->msix_vector = msix_vec;
 
 	/* create the vring */
-	vq = vring_create_virtqueue(index, size,
+	vq = vring_create_virtqueue(index, num,
 				    VIRTIO_PCI_VRING_ALIGN, &vp_dev->vdev,
 				    true, false, ctx,
 				    vp_notify, callback, name);
diff --git a/drivers/virtio/virtio_pci_modern.c b/drivers/virtio/virtio_pci_modern.c
index f7965c5dd36b..be51ec849252 100644
--- a/drivers/virtio/virtio_pci_modern.c
+++ b/drivers/virtio/virtio_pci_modern.c
@@ -293,7 +293,6 @@ static struct virtqueue *setup_vq(struct virtio_pci_device *vp_dev,
 				  unsigned int index,
 				  void (*callback)(struct virtqueue *vq),
 				  const char *name,
-				  u32 size,
 				  bool ctx,
 				  u16 msix_vec)
 {
@@ -311,18 +310,15 @@ static struct virtqueue *setup_vq(struct virtio_pci_device *vp_dev,
 	if (!num || vp_modern_get_queue_enable(mdev, index))
 		return ERR_PTR(-ENOENT);
 
-	if (!size || size > num)
-		size = num;
-
-	if (size & (size - 1)) {
-		dev_warn(&vp_dev->pci_dev->dev, "bad queue size %u", size);
+	if (num & (num - 1)) {
+		dev_warn(&vp_dev->pci_dev->dev, "bad queue size %u", num);
 		return ERR_PTR(-EINVAL);
 	}
 
 	info->msix_vector = msix_vec;
 
 	/* create the vring */
-	vq = vring_create_virtqueue(index, size,
+	vq = vring_create_virtqueue(index, num,
 				    SMP_CACHE_BYTES, &vp_dev->vdev,
 				    true, true, ctx,
 				    vp_notify, callback, name);
-- 
MST


^ permalink raw reply related

* [PATCH v2 2/5] virtio: Revert "virtio: add helper virtio_find_vqs_ctx_size()"
From: Michael S. Tsirkin @ 2022-08-15 21:53 UTC (permalink / raw)
  To: linux-kernel
  Cc: Xuan Zhuo, Jason Wang, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, virtualization, netdev,
	Linus Torvalds, Jens Axboe, James Bottomley, Martin K. Petersen,
	Guenter Roeck, Greg KH
In-Reply-To: <20220815215251.154451-1-mst@redhat.com>

This reverts commit fe3dc04e31aa51f91dc7f741a5f76cc4817eb5b4: the
API is now unused and in fact can't be implemented on top of a legacy
device.

Fixes: fe3dc04e31aa ("virtio: add helper virtio_find_vqs_ctx_size()")
Cc: "Xuan Zhuo" <xuanzhuo@linux.alibaba.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 include/linux/virtio_config.h | 12 ------------
 1 file changed, 12 deletions(-)

diff --git a/include/linux/virtio_config.h b/include/linux/virtio_config.h
index 6adff09f7170..888f7e96f0c7 100644
--- a/include/linux/virtio_config.h
+++ b/include/linux/virtio_config.h
@@ -241,18 +241,6 @@ int virtio_find_vqs_ctx(struct virtio_device *vdev, unsigned nvqs,
 				      ctx, desc);
 }
 
-static inline
-int virtio_find_vqs_ctx_size(struct virtio_device *vdev, u32 nvqs,
-			     struct virtqueue *vqs[],
-			     vq_callback_t *callbacks[],
-			     const char * const names[],
-			     u32 sizes[],
-			     const bool *ctx, struct irq_affinity *desc)
-{
-	return vdev->config->find_vqs(vdev, nvqs, vqs, callbacks, names, sizes,
-				      ctx, desc);
-}
-
 /**
  * virtio_synchronize_cbs - synchronize with virtqueue callbacks
  * @vdev: the device
-- 
MST


^ permalink raw reply related

* [PATCH v2 1/1] virtio: Revert "virtio: find_vqs() add arg sizes"
From: Michael S. Tsirkin @ 2022-08-15 21:53 UTC (permalink / raw)
  To: linux-kernel
  Cc: Xuan Zhuo, Jason Wang, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, virtualization, netdev,
	Linus Torvalds, Jens Axboe, James Bottomley, Martin K. Petersen,
	Guenter Roeck, Greg KH, Richard Weinberger, Anton Ivanov,
	Johannes Berg, Hans de Goede, Mark Gross, Vadim Pasternak,
	Bjorn Andersson, Mathieu Poirier, Cornelia Huck, Halil Pasic,
	Eric Farman, Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Christian Borntraeger, Sven Schnelle, linux-um,
	platform-driver-x86, linux-remoteproc, linux-s390, kvm
In-Reply-To: <20220815215251.154451-1-mst@redhat.com>

This reverts commit a10fba0377145fccefea4dc4dd5915b7ed87e546: the
proposed API isn't supported on all transports but no
effort was made to address this.

It might not be hard to fix if we want to: maybe just
rename size to size_hint and make sure legacy
transports ignore the hint.

But it's not sure what the benefit is in any case, so
let's drop it.

Fixes: a10fba037714 ("virtio: find_vqs() add arg sizes")
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

diff --git a/arch/um/drivers/virtio_uml.c b/arch/um/drivers/virtio_uml.c
index 79e38afd4b91..e719af8bdf56 100644
--- a/arch/um/drivers/virtio_uml.c
+++ b/arch/um/drivers/virtio_uml.c
@@ -1011,7 +1011,7 @@ static struct virtqueue *vu_setup_vq(struct virtio_device *vdev,
 
 static int vu_find_vqs(struct virtio_device *vdev, unsigned nvqs,
 		       struct virtqueue *vqs[], vq_callback_t *callbacks[],
-		       const char * const names[], u32 sizes[], const bool *ctx,
+		       const char * const names[], const bool *ctx,
 		       struct irq_affinity *desc)
 {
 	struct virtio_uml_device *vu_dev = to_virtio_uml_device(vdev);
diff --git a/drivers/platform/mellanox/mlxbf-tmfifo.c b/drivers/platform/mellanox/mlxbf-tmfifo.c
index 8be13d416f48..1ae3c56b66b0 100644
--- a/drivers/platform/mellanox/mlxbf-tmfifo.c
+++ b/drivers/platform/mellanox/mlxbf-tmfifo.c
@@ -928,7 +928,6 @@ static int mlxbf_tmfifo_virtio_find_vqs(struct virtio_device *vdev,
 					struct virtqueue *vqs[],
 					vq_callback_t *callbacks[],
 					const char * const names[],
-					u32 sizes[],
 					const bool *ctx,
 					struct irq_affinity *desc)
 {
diff --git a/drivers/remoteproc/remoteproc_virtio.c b/drivers/remoteproc/remoteproc_virtio.c
index 81c4f5776109..0f7706e23eb9 100644
--- a/drivers/remoteproc/remoteproc_virtio.c
+++ b/drivers/remoteproc/remoteproc_virtio.c
@@ -158,7 +158,6 @@ static int rproc_virtio_find_vqs(struct virtio_device *vdev, unsigned int nvqs,
 				 struct virtqueue *vqs[],
 				 vq_callback_t *callbacks[],
 				 const char * const names[],
-				 u32 sizes[],
 				 const bool * ctx,
 				 struct irq_affinity *desc)
 {
diff --git a/drivers/s390/virtio/virtio_ccw.c b/drivers/s390/virtio/virtio_ccw.c
index 896896e32664..a10dbe632ef9 100644
--- a/drivers/s390/virtio/virtio_ccw.c
+++ b/drivers/s390/virtio/virtio_ccw.c
@@ -637,7 +637,6 @@ static int virtio_ccw_find_vqs(struct virtio_device *vdev, unsigned nvqs,
 			       struct virtqueue *vqs[],
 			       vq_callback_t *callbacks[],
 			       const char * const names[],
-			       u32 sizes[],
 			       const bool *ctx,
 			       struct irq_affinity *desc)
 {
diff --git a/drivers/virtio/virtio_mmio.c b/drivers/virtio/virtio_mmio.c
index dfcecfd7aba1..3ff746e3f24a 100644
--- a/drivers/virtio/virtio_mmio.c
+++ b/drivers/virtio/virtio_mmio.c
@@ -474,7 +474,6 @@ static int vm_find_vqs(struct virtio_device *vdev, unsigned int nvqs,
 		       struct virtqueue *vqs[],
 		       vq_callback_t *callbacks[],
 		       const char * const names[],
-		       u32 sizes[],
 		       const bool *ctx,
 		       struct irq_affinity *desc)
 {
diff --git a/drivers/virtio/virtio_pci_common.c b/drivers/virtio/virtio_pci_common.c
index 7ad734584823..ad258a9d3b9f 100644
--- a/drivers/virtio/virtio_pci_common.c
+++ b/drivers/virtio/virtio_pci_common.c
@@ -396,7 +396,7 @@ static int vp_find_vqs_intx(struct virtio_device *vdev, unsigned int nvqs,
 /* the config->find_vqs() implementation */
 int vp_find_vqs(struct virtio_device *vdev, unsigned int nvqs,
 		struct virtqueue *vqs[], vq_callback_t *callbacks[],
-		const char * const names[], u32 sizes[], const bool *ctx,
+		const char * const names[], const bool *ctx,
 		struct irq_affinity *desc)
 {
 	int err;
diff --git a/drivers/virtio/virtio_pci_common.h b/drivers/virtio/virtio_pci_common.h
index a5ff838b85a5..23112d84218f 100644
--- a/drivers/virtio/virtio_pci_common.h
+++ b/drivers/virtio/virtio_pci_common.h
@@ -110,7 +110,7 @@ void vp_del_vqs(struct virtio_device *vdev);
 /* the config->find_vqs() implementation */
 int vp_find_vqs(struct virtio_device *vdev, unsigned int nvqs,
 		struct virtqueue *vqs[], vq_callback_t *callbacks[],
-		const char * const names[], u32 sizes[], const bool *ctx,
+		const char * const names[], const bool *ctx,
 		struct irq_affinity *desc);
 const char *vp_bus_name(struct virtio_device *vdev);
 
diff --git a/drivers/virtio/virtio_pci_modern.c b/drivers/virtio/virtio_pci_modern.c
index be51ec849252..c3b9f2761849 100644
--- a/drivers/virtio/virtio_pci_modern.c
+++ b/drivers/virtio/virtio_pci_modern.c
@@ -347,15 +347,12 @@ static struct virtqueue *setup_vq(struct virtio_pci_device *vp_dev,
 static int vp_modern_find_vqs(struct virtio_device *vdev, unsigned int nvqs,
 			      struct virtqueue *vqs[],
 			      vq_callback_t *callbacks[],
-			      const char * const names[],
-			      u32 sizes[],
-			      const bool *ctx,
+			      const char * const names[], const bool *ctx,
 			      struct irq_affinity *desc)
 {
 	struct virtio_pci_device *vp_dev = to_vp_device(vdev);
 	struct virtqueue *vq;
-	int rc = vp_find_vqs(vdev, nvqs, vqs, callbacks, names, sizes, ctx,
-			     desc);
+	int rc = vp_find_vqs(vdev, nvqs, vqs, callbacks, names, ctx, desc);
 
 	if (rc)
 		return rc;
diff --git a/drivers/virtio/virtio_vdpa.c b/drivers/virtio/virtio_vdpa.c
index 9bc4d110b800..c6b9b5062043 100644
--- a/drivers/virtio/virtio_vdpa.c
+++ b/drivers/virtio/virtio_vdpa.c
@@ -272,7 +272,6 @@ static int virtio_vdpa_find_vqs(struct virtio_device *vdev, unsigned int nvqs,
 				struct virtqueue *vqs[],
 				vq_callback_t *callbacks[],
 				const char * const names[],
-				u32 sizes[],
 				const bool *ctx,
 				struct irq_affinity *desc)
 {
diff --git a/include/linux/virtio_config.h b/include/linux/virtio_config.h
index 888f7e96f0c7..36ec7be1f480 100644
--- a/include/linux/virtio_config.h
+++ b/include/linux/virtio_config.h
@@ -55,7 +55,6 @@ struct virtio_shm_region {
  *		include a NULL entry for vqs that do not need a callback
  *	names: array of virtqueue names (mainly for debugging)
  *		include a NULL entry for vqs unused by driver
- *	sizes: array of virtqueue sizes
  *	Returns 0 on success or error status
  * @del_vqs: free virtqueues found by find_vqs().
  * @synchronize_cbs: synchronize with the virtqueue callbacks (optional)
@@ -104,9 +103,7 @@ struct virtio_config_ops {
 	void (*reset)(struct virtio_device *vdev);
 	int (*find_vqs)(struct virtio_device *, unsigned nvqs,
 			struct virtqueue *vqs[], vq_callback_t *callbacks[],
-			const char * const names[],
-			u32 sizes[],
-			const bool *ctx,
+			const char * const names[], const bool *ctx,
 			struct irq_affinity *desc);
 	void (*del_vqs)(struct virtio_device *);
 	void (*synchronize_cbs)(struct virtio_device *);
@@ -215,7 +212,7 @@ struct virtqueue *virtio_find_single_vq(struct virtio_device *vdev,
 	const char *names[] = { n };
 	struct virtqueue *vq;
 	int err = vdev->config->find_vqs(vdev, 1, &vq, callbacks, names, NULL,
-					 NULL, NULL);
+					 NULL);
 	if (err < 0)
 		return ERR_PTR(err);
 	return vq;
@@ -227,8 +224,7 @@ int virtio_find_vqs(struct virtio_device *vdev, unsigned nvqs,
 			const char * const names[],
 			struct irq_affinity *desc)
 {
-	return vdev->config->find_vqs(vdev, nvqs, vqs, callbacks, names, NULL,
-				      NULL, desc);
+	return vdev->config->find_vqs(vdev, nvqs, vqs, callbacks, names, NULL, desc);
 }
 
 static inline
@@ -237,8 +233,8 @@ int virtio_find_vqs_ctx(struct virtio_device *vdev, unsigned nvqs,
 			const char * const names[], const bool *ctx,
 			struct irq_affinity *desc)
 {
-	return vdev->config->find_vqs(vdev, nvqs, vqs, callbacks, names, NULL,
-				      ctx, desc);
+	return vdev->config->find_vqs(vdev, nvqs, vqs, callbacks, names, ctx,
+				      desc);
 }
 
 /**
-- 
MST


^ permalink raw reply related

* [PATCH v2 1/5] virtio_net: Revert "virtio_net: set the default max ring size by find_vqs()"
From: Michael S. Tsirkin @ 2022-08-15 21:53 UTC (permalink / raw)
  To: linux-kernel
  Cc: Xuan Zhuo, Jason Wang, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, virtualization, netdev,
	Linus Torvalds, Jens Axboe, James Bottomley, Martin K. Petersen,
	Guenter Roeck, Greg KH, Andres Freund
In-Reply-To: <20220815215251.154451-1-mst@redhat.com>

This reverts commit 762faee5a2678559d3dc09d95f8f2c54cd0466a7.

This has been reported to trip up guests on GCP (Google Cloud).
The reason is that virtio_find_vqs_ctx_size is broken on legacy
devices. We can in theory fix virtio_find_vqs_ctx_size but
in fact the patch itself has several other issues:

- It treats unknown speed as < 10G
- It leaves userspace no way to find out the ring size set by hypervisor
- It tests speed when link is down
- It ignores the virtio spec advice:
        Both \field{speed} and \field{duplex} can change, thus the driver
        is expected to re-read these values after receiving a
        configuration change notification.
- It is not clear the performance impact has been tested properly

Revert the patch for now.

Reported-by: Andres Freund <andres@anarazel.de>
Link: https://lore.kernel.org/r/20220814212610.GA3690074%40roeck-us.net
Link: https://lore.kernel.org/r/20220815070203.plwjx7b3cyugpdt7%40awork3.anarazel.de
Link: https://lore.kernel.org/r/3df6bb82-1951-455d-a768-e9e1513eb667%40www.fastmail.com
Link: https://lore.kernel.org/r/FCDC5DDE-3CDD-4B8A-916F-CA7D87B547CE%40anarazel.de
Fixes: 762faee5a267 ("virtio_net: set the default max ring size by find_vqs()")
Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Cc: Jason Wang <jasowang@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Tested-by: Andres Freund <andres@anarazel.de>
Tested-by: From: Guenter Roeck <linux@roeck-us.net>
---
 drivers/net/virtio_net.c | 42 ++++------------------------------------
 1 file changed, 4 insertions(+), 38 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index d934774e9733..ece00b84e3a7 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -3432,29 +3432,6 @@ static unsigned int mergeable_min_buf_len(struct virtnet_info *vi, struct virtqu
 		   (unsigned int)GOOD_PACKET_LEN);
 }
 
-static void virtnet_config_sizes(struct virtnet_info *vi, u32 *sizes)
-{
-	u32 i, rx_size, tx_size;
-
-	if (vi->speed == SPEED_UNKNOWN || vi->speed < SPEED_10000) {
-		rx_size = 1024;
-		tx_size = 1024;
-
-	} else if (vi->speed < SPEED_40000) {
-		rx_size = 1024 * 4;
-		tx_size = 1024 * 4;
-
-	} else {
-		rx_size = 1024 * 8;
-		tx_size = 1024 * 8;
-	}
-
-	for (i = 0; i < vi->max_queue_pairs; i++) {
-		sizes[rxq2vq(i)] = rx_size;
-		sizes[txq2vq(i)] = tx_size;
-	}
-}
-
 static int virtnet_find_vqs(struct virtnet_info *vi)
 {
 	vq_callback_t **callbacks;
@@ -3462,7 +3439,6 @@ static int virtnet_find_vqs(struct virtnet_info *vi)
 	int ret = -ENOMEM;
 	int i, total_vqs;
 	const char **names;
-	u32 *sizes;
 	bool *ctx;
 
 	/* We expect 1 RX virtqueue followed by 1 TX virtqueue, followed by
@@ -3490,15 +3466,10 @@ static int virtnet_find_vqs(struct virtnet_info *vi)
 		ctx = NULL;
 	}
 
-	sizes = kmalloc_array(total_vqs, sizeof(*sizes), GFP_KERNEL);
-	if (!sizes)
-		goto err_sizes;
-
 	/* Parameters for control virtqueue, if any */
 	if (vi->has_cvq) {
 		callbacks[total_vqs - 1] = NULL;
 		names[total_vqs - 1] = "control";
-		sizes[total_vqs - 1] = 64;
 	}
 
 	/* Allocate/initialize parameters for send/receive virtqueues */
@@ -3513,10 +3484,8 @@ static int virtnet_find_vqs(struct virtnet_info *vi)
 			ctx[rxq2vq(i)] = true;
 	}
 
-	virtnet_config_sizes(vi, sizes);
-
-	ret = virtio_find_vqs_ctx_size(vi->vdev, total_vqs, vqs, callbacks,
-				       names, sizes, ctx, NULL);
+	ret = virtio_find_vqs_ctx(vi->vdev, total_vqs, vqs, callbacks,
+				  names, ctx, NULL);
 	if (ret)
 		goto err_find;
 
@@ -3536,8 +3505,6 @@ static int virtnet_find_vqs(struct virtnet_info *vi)
 
 
 err_find:
-	kfree(sizes);
-err_sizes:
 	kfree(ctx);
 err_ctx:
 	kfree(names);
@@ -3897,9 +3864,6 @@ static int virtnet_probe(struct virtio_device *vdev)
 		vi->curr_queue_pairs = num_online_cpus();
 	vi->max_queue_pairs = max_queue_pairs;
 
-	virtnet_init_settings(dev);
-	virtnet_update_settings(vi);
-
 	/* Allocate/initialize the rx/tx queues, and invoke find_vqs */
 	err = init_vqs(vi);
 	if (err)
@@ -3912,6 +3876,8 @@ static int virtnet_probe(struct virtio_device *vdev)
 	netif_set_real_num_tx_queues(dev, vi->curr_queue_pairs);
 	netif_set_real_num_rx_queues(dev, vi->curr_queue_pairs);
 
+	virtnet_init_settings(dev);
+
 	if (virtio_has_feature(vdev, VIRTIO_NET_F_STANDBY)) {
 		vi->failover = net_failover_create(vi->dev);
 		if (IS_ERR(vi->failover)) {
-- 
MST


^ permalink raw reply related

* [PATCH v2 1/1] virtio: kerneldocs fixes and enhancements
From: Michael S. Tsirkin @ 2022-08-15 21:53 UTC (permalink / raw)
  To: linux-kernel
  Cc: Xuan Zhuo, Jason Wang, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, virtualization, netdev,
	Linus Torvalds, Jens Axboe, James Bottomley, Martin K. Petersen,
	Guenter Roeck, Greg KH, Ricardo Cañuelo, Cornelia Huck
In-Reply-To: <20220815215251.154451-1-mst@redhat.com>

From: Ricardo Cañuelo <ricardo.canuelo@collabora.com>

Fix variable names in some kerneldocs, naming in others.
Add kerneldocs for struct vring_desc and vring_interrupt.

Signed-off-by: Ricardo Cañuelo <ricardo.canuelo@collabora.com>
Message-Id: <20220810094004.1250-2-ricardo.canuelo@collabora.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index d66c8e6d0ef3..4620e9d79dde 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -2426,6 +2426,14 @@ static inline bool more_used(const struct vring_virtqueue *vq)
 	return vq->packed_ring ? more_used_packed(vq) : more_used_split(vq);
 }
 
+/**
+ * vring_interrupt - notify a virtqueue on an interrupt
+ * @irq: the IRQ number (ignored)
+ * @_vq: the struct virtqueue to notify
+ *
+ * Calls the callback function of @_vq to process the virtqueue
+ * notification.
+ */
 irqreturn_t vring_interrupt(int irq, void *_vq)
 {
 	struct vring_virtqueue *vq = to_vvq(_vq);
diff --git a/include/linux/virtio.h b/include/linux/virtio.h
index a3f73bb6733e..dcab9c7e8784 100644
--- a/include/linux/virtio.h
+++ b/include/linux/virtio.h
@@ -11,7 +11,7 @@
 #include <linux/gfp.h>
 
 /**
- * virtqueue - a queue to register buffers for sending or receiving.
+ * struct virtqueue - a queue to register buffers for sending or receiving.
  * @list: the chain of virtqueues for this device
  * @callback: the function to call when buffers are consumed (can be NULL).
  * @name: the name of this virtqueue (mainly for debugging)
@@ -97,7 +97,7 @@ int virtqueue_resize(struct virtqueue *vq, u32 num,
 		     void (*recycle)(struct virtqueue *vq, void *buf));
 
 /**
- * virtio_device - representation of a device using virtio
+ * struct virtio_device - representation of a device using virtio
  * @index: unique position on the virtio bus
  * @failed: saved value for VIRTIO_CONFIG_S_FAILED bit (for restore)
  * @config_enabled: configuration change reporting enabled
@@ -156,7 +156,7 @@ size_t virtio_max_dma_size(struct virtio_device *vdev);
 	list_for_each_entry(vq, &vdev->vqs, list)
 
 /**
- * virtio_driver - operations for a virtio I/O driver
+ * struct virtio_driver - operations for a virtio I/O driver
  * @driver: underlying device driver (populate name and owner).
  * @id_table: the ids serviced by this driver.
  * @feature_table: an array of feature numbers supported by this driver.
diff --git a/include/linux/virtio_config.h b/include/linux/virtio_config.h
index 36ec7be1f480..4b517649cfe8 100644
--- a/include/linux/virtio_config.h
+++ b/include/linux/virtio_config.h
@@ -239,7 +239,7 @@ int virtio_find_vqs_ctx(struct virtio_device *vdev, unsigned nvqs,
 
 /**
  * virtio_synchronize_cbs - synchronize with virtqueue callbacks
- * @vdev: the device
+ * @dev: the virtio device
  */
 static inline
 void virtio_synchronize_cbs(struct virtio_device *dev)
@@ -258,7 +258,7 @@ void virtio_synchronize_cbs(struct virtio_device *dev)
 
 /**
  * virtio_device_ready - enable vq use in probe function
- * @vdev: the device
+ * @dev: the virtio device
  *
  * Driver must call this to use vqs in the probe function.
  *
@@ -306,7 +306,7 @@ const char *virtio_bus_name(struct virtio_device *vdev)
 /**
  * virtqueue_set_affinity - setting affinity for a virtqueue
  * @vq: the virtqueue
- * @cpu: the cpu no.
+ * @cpu_mask: the cpu mask
  *
  * Pay attention the function are best-effort: the affinity hint may not be set
  * due to config support, irq type and sharing.
diff --git a/include/uapi/linux/virtio_ring.h b/include/uapi/linux/virtio_ring.h
index 476d3e5c0fe7..f8c20d3de8da 100644
--- a/include/uapi/linux/virtio_ring.h
+++ b/include/uapi/linux/virtio_ring.h
@@ -93,15 +93,21 @@
 #define VRING_USED_ALIGN_SIZE 4
 #define VRING_DESC_ALIGN_SIZE 16
 
-/* Virtio ring descriptors: 16 bytes.  These can chain together via "next". */
+/**
+ * struct vring_desc - Virtio ring descriptors,
+ * 16 bytes long. These can chain together via @next.
+ *
+ * @addr: buffer address (guest-physical)
+ * @len: buffer length
+ * @flags: descriptor flags
+ * @next: index of the next descriptor in the chain,
+ *        if the VRING_DESC_F_NEXT flag is set. We chain unused
+ *        descriptors via this, too.
+ */
 struct vring_desc {
-	/* Address (guest-physical). */
 	__virtio64 addr;
-	/* Length. */
 	__virtio32 len;
-	/* The flags as indicated above. */
 	__virtio16 flags;
-	/* We chain unused descriptors via this, too */
 	__virtio16 next;
 };
 
-- 
MST


^ permalink raw reply related

* [PATCH v2 0/5] virtio: drop sizing vqs during init
From: Michael S. Tsirkin @ 2022-08-15 21:53 UTC (permalink / raw)
  To: linux-kernel
  Cc: Xuan Zhuo, Jason Wang, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, virtualization, netdev,
	Linus Torvalds, Jens Axboe, James Bottomley, Martin K. Petersen,
	Guenter Roeck, Greg KH

Supplying size during init does not work for all transports.
In fact for legacy pci doing that causes a memory
corruption which was reported on Google Cloud.

We might get away with changing size to size_hint so it's
safe to ignore and then fixing legacy to ignore the hint.

But the benefit is unclear in any case, so let's revert for now.
Any new version will have to come with
- documentation of performance gains
- performance testing showing existing workflows
  are not harmed materially. especially ones with
  bursty traffic
- report of testing on legacy devices


Huge shout out to Andres Freund for the effort spent reproducing and
debugging!  Thanks to Guenter Roeck for help with testing!

Michael S. Tsirkin (5):
  virtio_net: Revert "virtio_net: set the default max ring size by
    find_vqs()"
  virtio: Revert "virtio: add helper virtio_find_vqs_ctx_size()"
  virtio-mmio: Revert "virtio_mmio: support the arg sizes of find_vqs()"
  virtio_pci: Revert "virtio_pci: support the arg sizes of find_vqs()"
  virtio: Revert "virtio: find_vqs() add arg sizes"

 arch/um/drivers/virtio_uml.c             |  2 +-
 drivers/net/virtio_net.c                 | 42 +++---------------------
 drivers/platform/mellanox/mlxbf-tmfifo.c |  1 -
 drivers/remoteproc/remoteproc_virtio.c   |  1 -
 drivers/s390/virtio/virtio_ccw.c         |  1 -
 drivers/virtio/virtio_mmio.c             |  9 ++---
 drivers/virtio/virtio_pci_common.c       | 20 +++++------
 drivers/virtio/virtio_pci_common.h       |  3 +-
 drivers/virtio/virtio_pci_legacy.c       |  6 +---
 drivers/virtio/virtio_pci_modern.c       | 17 +++-------
 drivers/virtio/virtio_vdpa.c             |  1 -
 include/linux/virtio_config.h            | 26 +++------------
 12 files changed, 28 insertions(+), 101 deletions(-)

-- 
MST


^ permalink raw reply

* Re: [PATCH] virtio_net: Revert "virtio_net: set the default max ring size by find_vqs()"
From: Andres Freund @ 2022-08-15 21:46 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Guenter Roeck, linux-kernel, Xuan Zhuo, Jason Wang,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	virtualization, netdev, Linus Torvalds, Jens Axboe,
	James Bottomley, Martin K. Petersen, Greg KH, c
In-Reply-To: <20220815173256-mutt-send-email-mst@kernel.org>

Hi,

On 2022-08-15 17:39:08 -0400, Michael S. Tsirkin wrote:
> On Mon, Aug 15, 2022 at 02:28:39PM -0700, Andres Freund wrote:
> > On 2022-08-15 17:04:10 -0400, Michael S. Tsirkin wrote:
> > > So virtio has a queue_size register. When read, it will give you
> > > originally the maximum queue size. Normally we just read it and
> > > use it as queue size.
> > > 
> > > However, when queue memory allocation fails, and unconditionally with a
> > > network device with the problematic patch, driver is asking the
> > > hypervisor to make the ring smaller by writing a smaller value into this
> > > register.
> > > 
> > > I suspect that what happens is hypervisor still uses the original value
> > > somewhere.
> > 
> > It looks more like the host is never told about the changed size for legacy
> > devices...
> > 
> > Indeed, adding a vp_legacy_set_queue_size() & call to it to setup_vq(), makes
> > 5.19 + restricting queue sizes to 1024 boot again.
> 
> Interesting, the register is RO in the legacy interface.
> And to be frank I can't find where is vp_legacy_set_queue_size
> even implemented. It's midnight here too ...

Yea, I meant that added both vp_legacy_set_queue_size() and a call to it. I
was just quickly experimenting around.


> Yes I figured this out too. And I was able to reproduce on qemu now.

Cool.


> I'm posting a new patchset reverting all the handing of resize
> restrictions, I think we should rethink it for the next release.

Makes sense.

Greetings,

Andres Freund

^ permalink raw reply

* [PATCH v2 2/2] af_unix: Add ioctl(SIOCUNIXGRABFDS) to grab files of receive queue skbs
From: Kirill Tkhai @ 2022-08-15 21:45 UTC (permalink / raw)
  To: Linux Kernel Network Developers; +Cc: davem, edumazet, viro
In-Reply-To: <0b07a55f-0713-7ba4-9b6b-88bc8cc6f1f5@ya.ru>

When a fd owning a counter of some critical resource, say, of a mount,
it's impossible to umount that mount and disconnect related block device.
That fd may be contained in some unix socket receive queue skb.

Despite we have an interface for detecting such the sockets queues
(/proc/[PID]/fdinfo/[fd] shows non-zero scm_fds count if so) and
it's possible to kill that process to release the counter, the problem is
that there may be several processes, and it's not a good thing to kill
each of them.

This patch adds a simple interface to grab files from receive queue,
so the caller may analyze them, and even do that recursively, if grabbed
file is unix socket itself. So, the described above problem may be solved
by this ioctl() in pair with pidfd_getfd().

Note, that the existing recvmsg(,,MSG_PEEK) is not suitable for that
purpose, since it modifies peek offset inside socket, and this results
in a problem in case of examined process uses peek offset itself.
Additional ptrace freezing of that task plus ioctl(SO_PEEK_OFF) won't help
too, since that socket may relate to several tasks, and there is no
reliable and non-racy way to detect that. Also, if the caller of such
trick will die, the examined task will remain frozen forever. The new
suggested ioctl(SIOCUNIXGRABFDS) does not have such problems.

The realization of ioctl(SIOCUNIXGRABFDS) is pretty simple. The only
interesting thing is protocol with userspace. Firstly, we let userspace
to know the number of all files in receive queue skbs. Then we receive
fds one by one starting from requested offset. We return number of
received fds if there is a successfully received fd, and this number
may be less in case of error or desired fds number lack. Userspace
may detect that situations by comparison of returned value and
out.nr_all minus in.nr_skip. Looking over different variant this one
looks the best for me (I considered returning error in case of error
and there is a received fd. Also I considered returning number of
received files as one more member in struct unix_ioc_grab_fds).

Signed-off-by: Kirill Tkhai <tkhai@ya.ru>
---
 include/uapi/linux/un.h |   12 ++++++++
 net/unix/af_unix.c      |   70 +++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 82 insertions(+)

diff --git a/include/uapi/linux/un.h b/include/uapi/linux/un.h
index 0ad59dc8b686..995b358263dd 100644
--- a/include/uapi/linux/un.h
+++ b/include/uapi/linux/un.h
@@ -11,6 +11,18 @@ struct sockaddr_un {
 	char sun_path[UNIX_PATH_MAX];	/* pathname */
 };
 
+struct unix_ioc_grab_fds {
+	struct {
+		int nr_grab;
+		int nr_skip;
+		int *fds;
+	} in;
+	struct {
+		int nr_all;
+	} out;
+};
+
 #define SIOCUNIXFILE (SIOCPROTOPRIVATE + 0) /* open a socket file with O_PATH */
+#define SIOCUNIXGRABFDS (SIOCPROTOPRIVATE + 1) /* grab files from recv queue */
 
 #endif /* _LINUX_UN_H */
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index bf338b782fc4..3c7e8049eba1 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -3079,6 +3079,73 @@ static int unix_open_file(struct sock *sk)
 	return fd;
 }
 
+static int unix_ioc_grab_fds(struct sock *sk, struct unix_ioc_grab_fds __user *uarg)
+{
+	int i, todo, skip, count, all, err, done = 0;
+	struct unix_sock *u = unix_sk(sk);
+	struct unix_ioc_grab_fds arg;
+	struct sk_buff *skb = NULL;
+	struct scm_fp_list *fp;
+
+	if (copy_from_user(&arg, uarg, sizeof(arg)))
+		return -EFAULT;
+
+	skip = arg.in.nr_skip;
+	todo = arg.in.nr_grab;
+
+	if (skip < 0 || todo <= 0)
+		return -EINVAL;
+	if (mutex_lock_interruptible(&u->iolock))
+		return -EINTR;
+
+	all = atomic_read(&u->scm_stat.nr_fds);
+	err = -EFAULT;
+	/* Set uarg->out.nr_all before the first file is received. */
+	if (put_user(all, &uarg->out.nr_all))
+		goto unlock;
+	err = 0;
+	if (all <= skip)
+		goto unlock;
+	if (all - skip < todo)
+		todo = all - skip;
+	while (todo) {
+		spin_lock(&sk->sk_receive_queue.lock);
+		if (!skb)
+			skb = skb_peek(&sk->sk_receive_queue);
+		else
+			skb = skb_peek_next(skb, &sk->sk_receive_queue);
+		spin_unlock(&sk->sk_receive_queue.lock);
+
+		if (!skb)
+			goto unlock;
+
+		fp = UNIXCB(skb).fp;
+		count = fp->count;
+		if (skip >= count) {
+			skip -= count;
+			continue;
+		}
+
+		for (i = skip; i < count && todo; i++) {
+			err = receive_fd_user(fp->fp[i], &arg.in.fds[done], 0);
+			if (err < 0)
+				goto unlock;
+			done++;
+			todo--;
+		}
+		skip = 0;
+	}
+unlock:
+	mutex_unlock(&u->iolock);
+
+	/* Return number of fds (non-error) if there is a received file. */
+	if (done)
+		return done;
+	if (err < 0)
+		return err;
+	return 0;
+}
+
 static int unix_ioctl(struct socket *sock, unsigned int cmd, unsigned long arg)
 {
 	struct sock *sk = sock->sk;
@@ -3113,6 +3180,9 @@ static int unix_ioctl(struct socket *sock, unsigned int cmd, unsigned long arg)
 		}
 		break;
 #endif
+	case SIOCUNIXGRABFDS:
+		err = unix_ioc_grab_fds(sk, (struct unix_ioc_grab_fds __user *)arg);
+		break;
 	default:
 		err = -ENOIOCTLCMD;
 		break;




^ permalink raw reply related

* [PATCH v2 1/2] fs: Export __receive_fd()
From: Kirill Tkhai @ 2022-08-15 21:42 UTC (permalink / raw)
  To: Linux Kernel Network Developers; +Cc: davem, edumazet, viro
In-Reply-To: <0b07a55f-0713-7ba4-9b6b-88bc8cc6f1f5@ya.ru>

This is needed to make receive_fd_user() available in modules, and it will be used in next patch.

Signed-off-by: Kirill Tkhai <tkhai@ya.ru>
---
v2: New
 fs/file.c |    1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/file.c b/fs/file.c
index 3bcc1ecc314a..e45d45f1dd45 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -1181,6 +1181,7 @@ int __receive_fd(struct file *file, int __user *ufd, unsigned int o_flags)
 	__receive_sock(file);
 	return new_fd;
 }
+EXPORT_SYMBOL_GPL(__receive_fd);
 
 int receive_fd_replace(int new_fd, struct file *file, unsigned int o_flags)
 {




^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox