Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [RFC feedback] AF_XDP and non-Intel hardware
From: Mykyta Iziumtsev @ 2018-05-22  7:45 UTC (permalink / raw)
  To: Björn Töpel
  Cc: Netdev, Björn Töpel, Karlsson, Magnus, Zhang, Qi Z,
	Francois Ozog, Ilias Apalodimas, Brian Brooks,
	Jesper Dangaard Brouer, andy, michael.chan, luke
In-Reply-To: <CAJ+HfNhB_E6RDn-K4Wp0h8194bqqicVzWJ0Kn18HAG3B7K=hXA@mail.gmail.com>

On 21 May 2018 at 20:55, Björn Töpel <bjorn.topel@gmail.com> wrote:
> 2018-05-21 14:34 GMT+02:00 Mykyta Iziumtsev <mykyta.iziumtsev@linaro.org>:
>> Hi Björn and Magnus,
>>
>> (This thread is a follow up to private dialogue. The intention is to
>> let community know that AF_XDP can be enhanced further to make it
>> compatible with wider range of NIC vendors).
>>
>
> Mykyta, thanks for doing the write-up and sending it to the netdev
> list! The timing could not be better -- we need to settle on an uapi
> that works for all vendors prior enabling it in the kernel.
>
>> There are two NIC variations which don't fit well with current AF_XDP proposal.
>>
>> The first variation is represented by some NXP and Cavium NICs. AF_XDP
>> expects NIC to put incoming frames into slots defined by UMEM area.
>> Here slot size is set in XDP_UMEM_REG xdp_umem.reg.frame_size and
>> slots available to NIC are communicated to the kernel via UMEM fill
>> queue. While Intel NICs support only one slot size, NXP and Cavium
>> support multiple slot sizes to optimize memory usage (e.g. { 128, 512,
>> 2048 }, please refer to [1] for rationale). On frame reception the NIC
>> pulls a slot from best fit pool based on frame size.
>>
>> The second variation is represented by e.g. Chelsio T5/T6 and Netcope
>> NICs. As shown above, AF_XDP expects NIC to put incoming frames at
>> predefined addresses. This is not the case here as the NIC is in
>> charge of placing frames in memory based on it's own algorithm. For
>> example, Chelsio T5/T6 is expecting to get whole pages from the driver
>> and puts incoming frames on the page in a 'tape recorder' fashion.
>> Assuming 'base' is page address and flen[N] is an array of frame
>> lengths, the frame placement in memory will look like that:
>>   base + 0
>>   base + frame_len[0]
>>   base + frame_len[0] + frame_len[1]
>>   base + frame_len[0] + frame_len[1] + frame_len[2]
>>   ...
>>
>> To better support these two NIC variations I suggest to abandon 'frame
>> size' structuring in UMEM and stick with 'pages' instead. The NIC
>> kernel driver is then responsible for splitting provided pages into
>> slots expected by underlying HW (or not splitting at all in case of
>> Chelsio/Netcope).
>>
>
> Let's call the first variation "multi-pool" and the second
> "type-writer" for simplicity. The type-writer model is very
> interesting, and makes a lot of sense when the PCIe bus is a
> constraint.
>
>> On XDP_UMEM_REG the application needs to specify page_size. Then the
>> application can pass empty pages to the kernel driver using UMEM
>> 'fill' queue by specifying page offset within the UMEM area. xdp_desc
>> format needs to be changed as well: frame location will be defined by
>> offset from the beginning of UMEM area instead of frame index. As
>> payload headroom can vary with AF_XDP we'll need to specify it in
>> xdp_desc as well. Beside that it could be worth to consider changing
>> payload length to u16 as 64k+ frames aren't very common in networking.
>> The resulting xdp_desc would look like that:
>>
>> struct xdp_desc {
>>         __u64 offset;
>>         __u16 headroom;
>>         __u16 len;
>>         __u8 flags;
>>         __u8 padding[3];
>> };
>>
>
> Let's expand a bit here; Instead of passing indicies to fixed sized
> frames to the fill ring, we now pass a memory region. For simplicity,
> let's say a page. The fill ring descriptor requires offset and
> len. The offset is a global offset from an UMEM perspective, and len
> is the size of the region.
>

I would rather stick with region equal to page (regular or huge page,
defined by application). The page size can be extracted from
vm_area_struct in XDP_UMEM_REG (preferred) or configured by
application.

> On the Rx ring the descriptor, as you wrote, must be changed as well
> to your suggestion above. Note, that headroom is still needed, since
> XDP can change the size of a packet, so the fixed headroom stated in
> UMEM registration is not sufficient.
>
> This model is obviously more flexible, but then also a bit more
> complex. E.g. a fixed-frame NIC (like ixgbe), what should the
> behaviour be? Should the fill ring entry be used only for *one* frame,
> or chopped up to multiple receive frames? Should it be configurable?
> Driver specific?

I think driver-specific makes most sense here. In case of fixed-frame
NIC the driver shall chop the ring entry into multiple receive frames.

>
> Also, validating the entries in the fill queue require more work
> (compared to the current index variant). Currently, we only skip
> invalid indicies. What should we do when say, you pass a memory window
> that is too narrow (say 128B) but correct from a UMEM
> perspective. Going this path, we need to add pretty hard constraints
> so we don't end up it too complex code -- because then it'll be too
> slow.

If we stick with pages -- the only possible erroneous input will be
'page out of UMEM boundaries'. The validation will be essentially:

if ((offset > umem->size) || (offset & (umem->page_size - 1))
    fail

The question is what shall be done if validation fails ? Would
SEGFAULT be reasonable ? This is more or less equivalent to
dereferencing invalid pointer.

>
>> In current proposal you have a notion of 'frame ownership': 'owned by
>> kernel' or 'owned by application'. The ownership is transferred by
>> means of enqueueing frame index in UMEM 'fill' queue (from application
>> to kernel) or in UMEM 'tx completion' queue (from kernel to
>> application). If you decide to adopt 'page' approach this notion needs
>> to be changed a bit. This is because in case of packet forwarding one
>> and the same page can be used for RX (parts of it enqueued in HW 'free
>> lists') and TX (forwarding of previously RXed packets).
>>
>> I propose to define 'ownership' as a right to manipulate the
>> partitioning of the page into frames. Whenever application passes a
>> page to the kernel via UMEM 'fill' queue -- the ownership is
>> transferred to the kernel. The application can't allocate packets on
>> this page until kernel is done with it, but it can change payload of
>> RXed packets before forwarding them. The kernel can pass ownership
>> back by means of 'end-of-page' in xdp_desc.flags.
>>
>
> I like the end-of-page mechanism.
>
>> The pages are definitely supposed to be recycled sooner or later. Even
>> if it's not part of kernel API and the housekeeping implementation
>> resided completely in application I still would like to propose
>> possible (hopefully, cost efficient) solution to that. The recycling
>> could be achieved by keeping refcount on pages and recycling the page
>> only when it's owned by application and refcount reaches 0.
>>
>> Whenever application transfers page ownership to the kernel the
>> refcount shall be initialized to 0. With each incoming RX xdp_desc the
>> corresponding page needs to be identified (xdp_desc.offset >>
>> PAGE_SHIFT) and refcount incremented. When the packet gets freed the
>> refcount shall be decremented. If packet is forwarded in TX xdp_desc
>> -- the refcount gets decremented only on TX completion (again,
>> tx_completion.desc >> PAGE_SHIFT). For packets originating from the
>> application itself the payload buffers needs to be allocated from
>> empty page owned by the application and refcount needs to be
>> incremented as well.
>>
>
> Strictly speaking, we're not enforcing correctness in the current
> solution. If the userspace application passes index 1 mulitple times
> to the fill ring, and at the same time send index 1, things will
> break. So, with the existing solution the userspace application
> *still* needs to track the frames. With this new model, the
> tracking/refcounting will be more complex. That might be a concern.
>
> For the multi-pool NICs I think we can still just have one UMEM, and
> let the driver decide where in which pool to place this certain chunk
> of memory. Thoughts?

Definitely agree with that. This is HW specific and exposing it to the
application would only harm portability.

>
> Now, how do we go forward? I think this is very important, and I will
> hack a copy-mode version for this new API. I'm a bit worried that the
> complexity/configuration space will grow and impact performance, but
> let's see.
>
> To prove that the API works for the NICs you mentioned, we need an
> actual zero-copy implementation for them. Do you think Linaro could
> work on a zero-copy variant for any of the NICs above?
>

Linaro will definitely contribute zero-copy implementation for some
ARM-based NICs with 'multi-pool' variation. Implementation of
'type-writer' variation is up to Chelsio/Netcope, we only try to come
up with API which (most probably) will fit them as well.

>
> Again thanks for bringing this up!
> Björn
>
>
>
>> [1] https://en.wikipedia.org/wiki/Internet_Mix
>>
>> With best regards,
>> Mykyta

^ permalink raw reply

* Re: [PATCH] ipvs: drop templates for never established TCP connections
From: Julian Anastasov @ 2018-05-22  7:51 UTC (permalink / raw)
  To: Michal Koutný; +Cc: wensong, horms, mkubecek, netdev, lvs-devel
In-Reply-To: <20180521172707.14106-1-mkoutny@suse.com>

[-- Attachment #1: Type: text/plain, Size: 6458 bytes --]


	Hello,

On Mon, 21 May 2018, Michal Koutný wrote:

> IPVS includes protection against filling the ip_vs_conn_tab by dropping 1/32 of
> feasible entries every second. The template entries (for persistent services)
> are never directly deleted by this mechanism but when a picked TCP connection
> entry is being dropped (1), the respective template entry is dropped too
> (realized by expiring 60 seconds after the connection entry being dropped).

	We try to drop the template in ip_vs_random_dropentry()
but I guess kernel/time/timer.c:enqueue_timer() puts both timers
in reverse order for expiration by using hlist_add_head().

> There is another mechanism that removes connection entries when they
> time out (2), in this case the associated template entry is not deleted.
> Under SYN flood template entries would accumulate (due to their entry
> longer timeout).

	There is also ip_vs_todrop() called in tcp_conn_schedule().
It just drops specific part from the SYNs on memory pressure.

> The accumulation takes place also with drop_entry being enabled. Roughly
> 15% ((31/32)^60) of SYN_RECV connections survive the dropping mechanism
> (1) and are removed by the timeout mechanism (2)(defaults to 60 seconds
> for SYN_RECV), thus template entries would still accumulate.
> 
> The patch ensures that when a connection entry times out, we also remove the
> template entry from the table. To prevent breaking persistent services (since
> the connection may time out in already established state) we add a new entry
> flag to protect templates what spawned at least one established TCP connection.
> 
> Cc: Michal Kubeček <mkubecek@suse.com>
> Signed-off-by: Michal Koutný <mkoutny@suse.com>
> ---
>  include/uapi/linux/ip_vs.h           | 33 +++++++++++++++++----------------
>  net/netfilter/ipvs/ip_vs_conn.c      | 10 +++++++++-
>  net/netfilter/ipvs/ip_vs_core.c      | 15 ++++++++++++++-
>  net/netfilter/ipvs/ip_vs_proto_tcp.c |  6 ++++++
>  4 files changed, 46 insertions(+), 18 deletions(-)
> 
> diff --git a/include/uapi/linux/ip_vs.h b/include/uapi/linux/ip_vs.h
> index 1c916b2f89dc..ef3bbc001fcd 100644
> --- a/include/uapi/linux/ip_vs.h
> +++ b/include/uapi/linux/ip_vs.h
> @@ -79,22 +79,23 @@
>   *      IPVS Connection Flags
>   *      Only flags 0..15 are sent to backup server
>   */
> -#define IP_VS_CONN_F_FWD_MASK	0x0007		/* mask for the fwd methods */
> -#define IP_VS_CONN_F_MASQ	0x0000		/* masquerading/NAT */
> -#define IP_VS_CONN_F_LOCALNODE	0x0001		/* local node */
> -#define IP_VS_CONN_F_TUNNEL	0x0002		/* tunneling */
> -#define IP_VS_CONN_F_DROUTE	0x0003		/* direct routing */
> -#define IP_VS_CONN_F_BYPASS	0x0004		/* cache bypass */
> -#define IP_VS_CONN_F_SYNC	0x0020		/* entry created by sync */
> -#define IP_VS_CONN_F_HASHED	0x0040		/* hashed entry */
> -#define IP_VS_CONN_F_NOOUTPUT	0x0080		/* no output packets */
> -#define IP_VS_CONN_F_INACTIVE	0x0100		/* not established */
> -#define IP_VS_CONN_F_OUT_SEQ	0x0200		/* must do output seq adjust */
> -#define IP_VS_CONN_F_IN_SEQ	0x0400		/* must do input seq adjust */
> -#define IP_VS_CONN_F_SEQ_MASK	0x0600		/* in/out sequence mask */
> -#define IP_VS_CONN_F_NO_CPORT	0x0800		/* no client port set yet */
> -#define IP_VS_CONN_F_TEMPLATE	0x1000		/* template, not connection */
> -#define IP_VS_CONN_F_ONE_PACKET	0x2000		/* forward only one packet */
> +#define IP_VS_CONN_F_FWD_MASK		0x0007		/* mask for the fwd methods */
> +#define IP_VS_CONN_F_MASQ		0x0000		/* masquerading/NAT */
> +#define IP_VS_CONN_F_LOCALNODE		0x0001		/* local node */
> +#define IP_VS_CONN_F_TUNNEL		0x0002		/* tunneling */
> +#define IP_VS_CONN_F_DROUTE		0x0003		/* direct routing */
> +#define IP_VS_CONN_F_BYPASS		0x0004		/* cache bypass */
> +#define IP_VS_CONN_F_SYNC		0x0020		/* entry created by sync */
> +#define IP_VS_CONN_F_HASHED		0x0040		/* hashed entry */
> +#define IP_VS_CONN_F_NOOUTPUT		0x0080		/* no output packets */
> +#define IP_VS_CONN_F_INACTIVE		0x0100		/* not established */
> +#define IP_VS_CONN_F_OUT_SEQ		0x0200		/* must do output seq adjust */
> +#define IP_VS_CONN_F_IN_SEQ		0x0400		/* must do input seq adjust */
> +#define IP_VS_CONN_F_SEQ_MASK		0x0600		/* in/out sequence mask */
> +#define IP_VS_CONN_F_NO_CPORT		0x0800		/* no client port set yet */
> +#define IP_VS_CONN_F_TEMPLATE		0x1000		/* template, not connection */
> +#define IP_VS_CONN_F_ONE_PACKET		0x2000		/* forward only one packet */
> +#define IP_VS_CONN_F_TMPL_PERSISTED	0x4000		/* template, confirmed persistent */
>  
>  /* Initial bits allowed in backup server */
>  #define IP_VS_CONN_F_BACKUP_MASK (IP_VS_CONN_F_FWD_MASK | \
> diff --git a/net/netfilter/ipvs/ip_vs_conn.c b/net/netfilter/ipvs/ip_vs_conn.c
> index 370abbf6f421..6afc606a388c 100644
> --- a/net/netfilter/ipvs/ip_vs_conn.c
> +++ b/net/netfilter/ipvs/ip_vs_conn.c
> @@ -820,6 +820,7 @@ static void ip_vs_conn_rcu_free(struct rcu_head *head)
>  static void ip_vs_conn_expire(struct timer_list *t)
>  {
>  	struct ip_vs_conn *cp = from_timer(cp, t, timer);
> +	struct ip_vs_conn *cp_c;
>  	struct netns_ipvs *ipvs = cp->ipvs;
>  
>  	/*
> @@ -834,8 +835,15 @@ static void ip_vs_conn_expire(struct timer_list *t)
>  		del_timer(&cp->timer);
>  
>  		/* does anybody control me? */
> -		if (cp->control)
> +		cp_c = cp->control;
> +		if (cp_c) {
>  			ip_vs_control_del(cp);
> +			if (cp_c->flags & IP_VS_CONN_F_TEMPLATE &&
> +			    !(cp_c->flags & IP_VS_CONN_F_TMPL_PERSISTED)) {
> +				IP_VS_DBG(4, "del conn template\n");
> +				ip_vs_conn_expire_now(cp_c);

	So, we have current conn expired after 60 seconds
in IP_VS_TCP_S_SYN_RECV state and possibly other conns
in same state that are not expired yet.

	Another option is just to use something like:

	if (cp_c) {
		ip_vs_control_del(cp);
		/* Restart cp_c timer only for last conn */
		if (!atomic_read(&cp_c->n_control) &&
		    (cp_c->flags & IP_VS_CONN_F_TEMPLATE) &&
		    /* Some func to decide when to drop cp_c:
		     * - it can be for SYN state
		     * - it can be when cp was dropped on load
		     */
		    cp->state == IP_VS_TCP_S_SYN_RECV) {
			IP_VS_DBG(4, "del conn template\n");
			ip_vs_conn_expire_now(cp_c);
		}
	}

	It is not perfect, i.e. it does not know if there was
some conn that was established in the past:

- CONN1: SYN, SYN+ACK, ESTABLISH, FIN, FIN+ACK, expire
- CONN2: expire in SYN state, drop tpl before persistent timeout

	But it should work in the general case.
Anyways, give me some days to think more on this issue.

Regards

^ permalink raw reply

* Re: [PATCH net-next 00/13] nfp: abm: add basic support for advanced buffering NIC
From: Jakub Kicinski @ 2018-05-22  7:56 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: David Miller, Linux Netdev List, oss-drivers, Andy Gospodarek,
	linux-internal
In-Reply-To: <CAJ3xEMh1uoLAJbdXRLPSPHKxd5ZmPN=nf0ZOgj8TUm8j2gp6fg@mail.gmail.com>

On Mon, May 21, 2018 at 11:32 PM, Or Gerlitz wrote:
> On Tue, May 22, 2018 at 8:12 AM, Jakub Kicinski wrote:
>> Hi!
>>
>> This series lays groundwork for advanced buffer management NIC feature.
>> It makes necessary NFP core changes, spawns representors and adds devlink
>> glue.  Following series will add the actual buffering configuration (patch
>> series size limit).
>>
>> First three patches add support for configuring NFP buffer pools via a
>> mailbox.  The existing devlink APIs are used for the purpose.
>>
>> Third patch allows us to perform small reads from the NFP memory.
>>
>> The rest of the patch set adds eswitch mode change support and makes
>> the driver spawn appropriate representors.
>
> Hi Jakub,
>
> Could you provide more higher level description on the abm use-case
> and nature of these representors? I understand that under abm you are
> modeling the nic as switch with vNIC ports, does vNIC port and vNIC
> port rep have the same characteristics as VF and VF rep (xmit on one side
> <--> send on 2nd side),  does traffic is to be offloaded using TC, etc.
> What one would be doing with vNIC instance, hand it to container ala the Intel
> VMDQ concept?
> can this be seen as veth HW offload? etc

Yes, the reprs can be used like VF reprs but that's not the main use
case. We are targeting container world with ABM, so no VFs and no
SR-IOV.  There is only one vNIC per port and no veth offload etc. In
the most basic scenario with 1 PF corresponding to 1 port there is no
real use for switching.

The main purpose here is that we want to setup the buffering and QoS
inside the NIC (both for TX and RX) and then use eBPF to perform
filtering, queue assignment and per-application RSS. That's pretty
much it at this point.

Switching if any will be a basic bridge offload.  QoS configuration
will all be done using TC qdisc offload, RED etc. exactly like mlxsw
:)

^ permalink raw reply

* Re: CONTACT MONEY GRAM OFFICE FOR YOUR PAYMENT OF $5,000.00
From: Janet Keucker @ 2018-05-22  8:09 UTC (permalink / raw)


Attention: Beneficiary,

Re:CONTACT MONEY GRAM OFFICE FOR YOUR PAYMENT OF $5,000.00

This is the second time we are notifying you about the statue of your 
compensation payment of ( $1,500,000.00).Be inform that We have been authorized 
by the United Nations Compensation Commission (UNCC) to release your 
compensation payment totaling the sum of US$1,500,000.00 through Money Gram 
transfer.You will be receiving the sum of US$5,000.00 daily until the total sum 
of US $1,500,000.00 is completely transferred to you.

Besides, we have already credited your first payment of US$5,000.00 today,
therefore, you are advice to contact our director of foreign Remittance 
Department MR.MARCEL OWURAH and request him to give you the details of your 
first payment such as MoneyGram reference Numbers and senders name to enable 
you pick up your first payment of US$5,000.00 at any Money Gram office located 
around you. Kindly contact him through the information stated below for faster 
process.

Contact Person: MR.MARCEL OWURAH
Telephone Number: +229 94 06 77 68
E-MAIL: ( transfermoney242@gmail.com )

REMEMBER TO FORWARD HIM YOUR FULL INFORMATION AS REQUIRED BELOW TO ENABLE HIM 
LOCATE YOUR PAYMENT FILE AND ATTEND TO YOU IMMEDIATELY.

1. Your Full Name:
2. Address:
3. Age:
4. Occupation:
5. Telephone
6. Country:

NOTE: that the amount to be paid to you is (USD1.500, 000.00), we expect your 
urgent response to this email to enable us monitor the transaction effectively.

Best Regards
Mrs.Janet Keucker. 

^ permalink raw reply

* [RFC v5 0/5] virtio: support packed ring
From: Tiwei Bie @ 2018-05-22  8:16 UTC (permalink / raw)
  To: mst, jasowang, virtualization, linux-kernel, netdev
  Cc: wexu, jfreimann, tiwei.bie

Hello everyone,

This RFC implements packed ring support in virtio driver.

Some simple functional tests have been done with Jason's
packed ring implementation in vhost (RFC v4):

https://lkml.org/lkml/2018/5/16/501

Both of ping and netperf worked as expected w/ EVENT_IDX
disabled. Ping worked as expected w/ EVENT_IDX enabled,
but netperf didn't (A hack has been added in the driver
to make netperf test pass in this case. The hack can be
found by `grep -rw XXX` in the code).

TODO:
- Refinements (for code and commit log);
- More tests;
- Bug fixes;

RFC v4 -> RFC v5:
- Save DMA addr, etc in desc state (Jason);
- Track used wrap counter;

RFC v3 -> RFC v4:
- Make ID allocation support out-of-order (Jason);
- Various fixes for EVENT_IDX support;

RFC v2 -> RFC v3:
- Split into small patches (Jason);
- Add helper virtqueue_use_indirect() (Jason);
- Just set id for the last descriptor of a list (Jason);
- Calculate the prev in virtqueue_add_packed() (Jason);
- Fix/improve desc suppression code (Jason/MST);
- Refine the code layout for XXX_split/packed and wrappers (MST);
- Fix the comments and API in uapi (MST);
- Remove the BUG_ON() for indirect (Jason);
- Some other refinements and bug fixes;

RFC v1 -> RFC v2:
- Add indirect descriptor support - compile test only;
- Add event suppression supprt - compile test only;
- Move vring_packed_init() out of uapi (Jason, MST);
- Merge two loops into one in virtqueue_add_packed() (Jason);
- Split vring_unmap_one() for packed ring and split ring (Jason);
- Avoid using '%' operator (Jason);
- Rename free_head -> next_avail_idx (Jason);
- Add comments for virtio_wmb() in virtqueue_add_packed() (Jason);
- Some other refinements and bug fixes;

Thanks!

Tiwei Bie (5):
  virtio: add packed ring definitions
  virtio_ring: support creating packed ring
  virtio_ring: add packed ring support
  virtio_ring: add event idx support in packed ring
  virtio_ring: enable packed ring

 drivers/virtio/virtio_ring.c       | 1354 ++++++++++++++++++++++------
 include/linux/virtio_ring.h        |    8 +-
 include/uapi/linux/virtio_config.h |    5 +-
 include/uapi/linux/virtio_ring.h   |   36 +
 4 files changed, 1125 insertions(+), 278 deletions(-)

-- 
2.17.0

^ permalink raw reply

* [RFC v5 1/5] virtio: add packed ring definitions
From: Tiwei Bie @ 2018-05-22  8:16 UTC (permalink / raw)
  To: mst, jasowang, virtualization, linux-kernel, netdev
  Cc: wexu, jfreimann, tiwei.bie
In-Reply-To: <20180522081648.14768-1-tiwei.bie@intel.com>

Signed-off-by: Tiwei Bie <tiwei.bie@intel.com>
---
 include/uapi/linux/virtio_config.h |  5 ++++-
 include/uapi/linux/virtio_ring.h   | 36 ++++++++++++++++++++++++++++++
 2 files changed, 40 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/virtio_config.h b/include/uapi/linux/virtio_config.h
index 308e2096291f..932a6ecc8e46 100644
--- a/include/uapi/linux/virtio_config.h
+++ b/include/uapi/linux/virtio_config.h
@@ -49,7 +49,7 @@
  * transport being used (eg. virtio_ring), the rest are per-device feature
  * bits. */
 #define VIRTIO_TRANSPORT_F_START	28
-#define VIRTIO_TRANSPORT_F_END		34
+#define VIRTIO_TRANSPORT_F_END		35
 
 #ifndef VIRTIO_CONFIG_NO_LEGACY
 /* Do we get callbacks when the ring is completely used, even if we've
@@ -71,4 +71,7 @@
  * this is for compatibility with legacy systems.
  */
 #define VIRTIO_F_IOMMU_PLATFORM		33
+
+/* This feature indicates support for the packed virtqueue layout. */
+#define VIRTIO_F_RING_PACKED		34
 #endif /* _UAPI_LINUX_VIRTIO_CONFIG_H */
diff --git a/include/uapi/linux/virtio_ring.h b/include/uapi/linux/virtio_ring.h
index 6d5d5faa989b..3932cb80c347 100644
--- a/include/uapi/linux/virtio_ring.h
+++ b/include/uapi/linux/virtio_ring.h
@@ -44,6 +44,9 @@
 /* This means the buffer contains a list of buffer descriptors. */
 #define VRING_DESC_F_INDIRECT	4
 
+#define VRING_DESC_F_AVAIL(b)	((b) << 7)
+#define VRING_DESC_F_USED(b)	((b) << 15)
+
 /* The Host uses this in used->flags to advise the Guest: don't kick me when
  * you add a buffer.  It's unreliable, so it's simply an optimization.  Guest
  * will still kick if it's out of buffers. */
@@ -53,6 +56,10 @@
  * optimization.  */
 #define VRING_AVAIL_F_NO_INTERRUPT	1
 
+#define VRING_EVENT_F_ENABLE	0x0
+#define VRING_EVENT_F_DISABLE	0x1
+#define VRING_EVENT_F_DESC	0x2
+
 /* We support indirect buffer descriptors */
 #define VIRTIO_RING_F_INDIRECT_DESC	28
 
@@ -171,4 +178,33 @@ static inline int vring_need_event(__u16 event_idx, __u16 new_idx, __u16 old)
 	return (__u16)(new_idx - event_idx - 1) < (__u16)(new_idx - old);
 }
 
+struct vring_packed_desc_event {
+	/* __virtio16 off  : 15; // Descriptor Event Offset
+	 * __virtio16 wrap : 1;  // Descriptor Event Wrap Counter */
+	__virtio16 off_wrap;
+	/* __virtio16 flags : 2; // Descriptor Event Flags */
+	__virtio16 flags;
+};
+
+struct vring_packed_desc {
+	/* Buffer Address. */
+	__virtio64 addr;
+	/* Buffer Length. */
+	__virtio32 len;
+	/* Buffer ID. */
+	__virtio16 id;
+	/* The flags depending on descriptor type. */
+	__virtio16 flags;
+};
+
+struct vring_packed {
+	unsigned int num;
+
+	struct vring_packed_desc *desc;
+
+	struct vring_packed_desc_event *driver;
+
+	struct vring_packed_desc_event *device;
+};
+
 #endif /* _UAPI_LINUX_VIRTIO_RING_H */
-- 
2.17.0

^ permalink raw reply related

* [RFC v5 3/5] virtio_ring: add packed ring support
From: Tiwei Bie @ 2018-05-22  8:16 UTC (permalink / raw)
  To: mst, jasowang, virtualization, linux-kernel, netdev
  Cc: wexu, jfreimann, tiwei.bie
In-Reply-To: <20180522081648.14768-1-tiwei.bie@intel.com>

This commit introduces the support (without EVENT_IDX) for
packed ring.

Signed-off-by: Tiwei Bie <tiwei.bie@intel.com>
---
 drivers/virtio/virtio_ring.c | 474 ++++++++++++++++++++++++++++++++++-
 1 file changed, 468 insertions(+), 6 deletions(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index f5ef5f42a7cf..eb9fd5207a68 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -62,6 +62,12 @@ struct vring_desc_state {
 };
 
 struct vring_desc_state_packed {
+	void *data;			/* Data for callback. */
+	struct vring_packed_desc *indir_desc; /* Indirect descriptor, if any. */
+	int num;			/* Descriptor list length. */
+	dma_addr_t addr;		/* Buffer DMA addr. */
+	u32 len;			/* Buffer length. */
+	u16 flags;			/* Descriptor flags. */
 	int next;			/* The next desc state. */
 };
 
@@ -758,6 +764,72 @@ static inline unsigned vring_size_packed(unsigned int num, unsigned long align)
 		& ~(align - 1)) + sizeof(struct vring_packed_desc_event) * 2;
 }
 
+static void vring_unmap_state_packed(const struct vring_virtqueue *vq,
+				     struct vring_desc_state_packed *state)
+{
+	u16 flags;
+
+	if (!vring_use_dma_api(vq->vq.vdev))
+		return;
+
+	flags = state->flags;
+
+	if (flags & VRING_DESC_F_INDIRECT) {
+		dma_unmap_single(vring_dma_dev(vq),
+				 state->addr, state->len,
+				 (flags & VRING_DESC_F_WRITE) ?
+				 DMA_FROM_DEVICE : DMA_TO_DEVICE);
+	} else {
+		dma_unmap_page(vring_dma_dev(vq),
+			       state->addr, state->len,
+			       (flags & VRING_DESC_F_WRITE) ?
+			       DMA_FROM_DEVICE : DMA_TO_DEVICE);
+	}
+}
+
+static void vring_unmap_desc_packed(const struct vring_virtqueue *vq,
+				   struct vring_packed_desc *desc)
+{
+	u16 flags;
+
+	if (!vring_use_dma_api(vq->vq.vdev))
+		return;
+
+	flags = virtio16_to_cpu(vq->vq.vdev, desc->flags);
+
+	if (flags & VRING_DESC_F_INDIRECT) {
+		dma_unmap_single(vring_dma_dev(vq),
+				 virtio64_to_cpu(vq->vq.vdev, desc->addr),
+				 virtio32_to_cpu(vq->vq.vdev, desc->len),
+				 (flags & VRING_DESC_F_WRITE) ?
+				 DMA_FROM_DEVICE : DMA_TO_DEVICE);
+	} else {
+		dma_unmap_page(vring_dma_dev(vq),
+			       virtio64_to_cpu(vq->vq.vdev, desc->addr),
+			       virtio32_to_cpu(vq->vq.vdev, desc->len),
+			       (flags & VRING_DESC_F_WRITE) ?
+			       DMA_FROM_DEVICE : DMA_TO_DEVICE);
+	}
+}
+
+static struct vring_packed_desc *alloc_indirect_packed(struct virtqueue *_vq,
+						       unsigned int total_sg,
+						       gfp_t gfp)
+{
+	struct vring_packed_desc *desc;
+
+	/*
+	 * We require lowmem mappings for the descriptors because
+	 * otherwise virt_to_phys will give us bogus addresses in the
+	 * virtqueue.
+	 */
+	gfp &= ~__GFP_HIGHMEM;
+
+	desc = kmalloc(total_sg * sizeof(struct vring_packed_desc), gfp);
+
+	return desc;
+}
+
 static inline int virtqueue_add_packed(struct virtqueue *_vq,
 				       struct scatterlist *sgs[],
 				       unsigned int total_sg,
@@ -767,47 +839,437 @@ static inline int virtqueue_add_packed(struct virtqueue *_vq,
 				       void *ctx,
 				       gfp_t gfp)
 {
+	struct vring_virtqueue *vq = to_vvq(_vq);
+	struct vring_packed_desc *desc;
+	struct scatterlist *sg;
+	unsigned int i, n, descs_used, uninitialized_var(prev), err_idx;
+	__virtio16 uninitialized_var(head_flags), flags;
+	u16 head, avail_wrap_counter, id, cur;
+	bool indirect;
+
+	START_USE(vq);
+
+	BUG_ON(data == NULL);
+	BUG_ON(ctx && vq->indirect);
+
+	if (unlikely(vq->broken)) {
+		END_USE(vq);
+		return -EIO;
+	}
+
+#ifdef DEBUG
+	{
+		ktime_t now = ktime_get();
+
+		/* No kick or get, with .1 second between?  Warn. */
+		if (vq->last_add_time_valid)
+			WARN_ON(ktime_to_ms(ktime_sub(now, vq->last_add_time))
+					    > 100);
+		vq->last_add_time = now;
+		vq->last_add_time_valid = true;
+	}
+#endif
+
+	BUG_ON(total_sg == 0);
+
+	head = vq->next_avail_idx;
+	avail_wrap_counter = vq->avail_wrap_counter;
+
+	if (virtqueue_use_indirect(_vq, total_sg))
+		desc = alloc_indirect_packed(_vq, total_sg, gfp);
+	else {
+		desc = NULL;
+		WARN_ON_ONCE(total_sg > vq->vring_packed.num && !vq->indirect);
+	}
+
+	if (desc) {
+		/* Use a single buffer which doesn't continue */
+		indirect = true;
+		/* Set up rest to use this indirect table. */
+		i = 0;
+		descs_used = 1;
+	} else {
+		indirect = false;
+		desc = vq->vring_packed.desc;
+		i = head;
+		descs_used = total_sg;
+	}
+
+	if (vq->vq.num_free < descs_used) {
+		pr_debug("Can't add buf len %i - avail = %i\n",
+			 descs_used, vq->vq.num_free);
+		/* FIXME: for historical reasons, we force a notify here if
+		 * there are outgoing parts to the buffer.  Presumably the
+		 * host should service the ring ASAP. */
+		if (out_sgs)
+			vq->notify(&vq->vq);
+		if (indirect)
+			kfree(desc);
+		END_USE(vq);
+		return -ENOSPC;
+	}
+
+	id = vq->free_head;
+	BUG_ON(id == vq->vring_packed.num);
+
+	cur = id;
+	for (n = 0; n < out_sgs + in_sgs; n++) {
+		for (sg = sgs[n]; sg; sg = sg_next(sg)) {
+			dma_addr_t addr = vring_map_one_sg(vq, sg, n < out_sgs ?
+					       DMA_TO_DEVICE : DMA_FROM_DEVICE);
+			if (vring_mapping_error(vq, addr))
+				goto unmap_release;
+
+			flags = cpu_to_virtio16(_vq->vdev, VRING_DESC_F_NEXT |
+				  (n < out_sgs ? 0 : VRING_DESC_F_WRITE) |
+				  VRING_DESC_F_AVAIL(vq->avail_wrap_counter) |
+				  VRING_DESC_F_USED(!vq->avail_wrap_counter));
+			if (!indirect && i == head)
+				head_flags = flags;
+			else
+				desc[i].flags = flags;
+
+			desc[i].addr = cpu_to_virtio64(_vq->vdev, addr);
+			desc[i].len = cpu_to_virtio32(_vq->vdev, sg->length);
+			i++;
+			if (!indirect) {
+				vq->desc_state_packed[cur].addr = addr;
+				vq->desc_state_packed[cur].len = sg->length;
+				vq->desc_state_packed[cur].flags =
+					virtio16_to_cpu(_vq->vdev, flags);
+				cur = vq->desc_state_packed[cur].next;
+
+				if (i >= vq->vring_packed.num) {
+					i = 0;
+					vq->avail_wrap_counter ^= 1;
+				}
+			}
+		}
+	}
+
+	prev = (i > 0 ? i : vq->vring_packed.num) - 1;
+	desc[prev].id = cpu_to_virtio16(_vq->vdev, id);
+
+	/* Last one doesn't continue. */
+	if (total_sg == 1)
+		head_flags &= cpu_to_virtio16(_vq->vdev, ~VRING_DESC_F_NEXT);
+	else
+		desc[prev].flags &= cpu_to_virtio16(_vq->vdev,
+						~VRING_DESC_F_NEXT);
+
+	if (indirect) {
+		/* Now that the indirect table is filled in, map it. */
+		dma_addr_t addr = vring_map_single(
+			vq, desc, total_sg * sizeof(struct vring_packed_desc),
+			DMA_TO_DEVICE);
+		if (vring_mapping_error(vq, addr))
+			goto unmap_release;
+
+		head_flags = cpu_to_virtio16(_vq->vdev, VRING_DESC_F_INDIRECT |
+				       VRING_DESC_F_AVAIL(avail_wrap_counter) |
+				       VRING_DESC_F_USED(!avail_wrap_counter));
+		vq->vring_packed.desc[head].addr = cpu_to_virtio64(_vq->vdev,
+								   addr);
+		vq->vring_packed.desc[head].len = cpu_to_virtio32(_vq->vdev,
+				total_sg * sizeof(struct vring_packed_desc));
+		vq->vring_packed.desc[head].id = cpu_to_virtio16(_vq->vdev, id);
+
+		vq->desc_state_packed[id].addr = addr;
+		vq->desc_state_packed[id].len = total_sg *
+				sizeof(struct vring_packed_desc);
+		vq->desc_state_packed[id].flags =
+				virtio16_to_cpu(_vq->vdev, head_flags);
+	}
+
+	/* We're using some buffers from the free list. */
+	vq->vq.num_free -= descs_used;
+
+	/* Update free pointer */
+	if (indirect) {
+		n = head + 1;
+		if (n >= vq->vring_packed.num) {
+			n = 0;
+			vq->avail_wrap_counter ^= 1;
+		}
+		vq->next_avail_idx = n;
+		vq->free_head = vq->desc_state_packed[id].next;
+	} else {
+		vq->next_avail_idx = i;
+		vq->free_head = cur;
+	}
+
+	/* Store token and indirect buffer state. */
+	vq->desc_state_packed[id].num = descs_used;
+	vq->desc_state_packed[id].data = data;
+	if (indirect)
+		vq->desc_state_packed[id].indir_desc = desc;
+	else
+		vq->desc_state_packed[id].indir_desc = ctx;
+
+	/* A driver MUST NOT make the first descriptor in the list
+	 * available before all subsequent descriptors comprising
+	 * the list are made available. */
+	virtio_wmb(vq->weak_barriers);
+	vq->vring_packed.desc[head].flags = head_flags;
+	vq->num_added += descs_used;
+
+	pr_debug("Added buffer head %i to %p\n", head, vq);
+	END_USE(vq);
+
+	return 0;
+
+unmap_release:
+	err_idx = i;
+	i = head;
+
+	for (n = 0; n < total_sg; n++) {
+		if (i == err_idx)
+			break;
+		vring_unmap_desc_packed(vq, &desc[i]);
+		i++;
+		if (!indirect && i >= vq->vring_packed.num)
+			i = 0;
+	}
+
+	vq->avail_wrap_counter = avail_wrap_counter;
+
+	if (indirect)
+		kfree(desc);
+
+	END_USE(vq);
 	return -EIO;
 }
 
 static bool virtqueue_kick_prepare_packed(struct virtqueue *_vq)
 {
-	return false;
+	struct vring_virtqueue *vq = to_vvq(_vq);
+	u16 flags;
+	bool needs_kick;
+	u32 snapshot;
+
+	START_USE(vq);
+	/* We need to expose the new flags value before checking notification
+	 * suppressions. */
+	virtio_mb(vq->weak_barriers);
+
+	snapshot = *(u32 *)vq->vring_packed.device;
+	flags = virtio16_to_cpu(_vq->vdev, (__virtio16)(snapshot >> 16)) & 0x3;
+
+#ifdef DEBUG
+	if (vq->last_add_time_valid) {
+		WARN_ON(ktime_to_ms(ktime_sub(ktime_get(),
+					      vq->last_add_time)) > 100);
+	}
+	vq->last_add_time_valid = false;
+#endif
+
+	needs_kick = (flags != VRING_EVENT_F_DISABLE);
+	END_USE(vq);
+	return needs_kick;
+}
+
+static void detach_buf_packed(struct vring_virtqueue *vq,
+			      unsigned int id, void **ctx)
+{
+	struct vring_desc_state_packed *state;
+	struct vring_packed_desc *desc;
+	unsigned int i;
+	int *next;
+
+	/* Clear data ptr. */
+	vq->desc_state_packed[id].data = NULL;
+
+	next = &id;
+	for (i = 0; i < vq->desc_state_packed[id].num; i++) {
+		state = &vq->desc_state_packed[*next];
+		vring_unmap_state_packed(vq, state);
+		next = &vq->desc_state_packed[*next].next;
+	}
+
+	vq->vq.num_free += vq->desc_state_packed[id].num;
+	*next = vq->free_head;
+	vq->free_head = id;
+
+	if (vq->indirect) {
+		u32 len;
+
+		/* Free the indirect table, if any, now that it's unmapped. */
+		desc = vq->desc_state_packed[id].indir_desc;
+		if (!desc)
+			return;
+
+		len = vq->desc_state_packed[id].len;
+		for (i = 0; i < len / sizeof(struct vring_packed_desc); i++)
+			vring_unmap_desc_packed(vq, &desc[i]);
+
+		kfree(desc);
+		vq->desc_state_packed[id].indir_desc = NULL;
+	} else if (ctx) {
+		*ctx = vq->desc_state_packed[id].indir_desc;
+	}
 }
 
 static inline bool more_used_packed(const struct vring_virtqueue *vq)
 {
-	return false;
+	u16 last_used, flags;
+	u8 avail, used;
+
+	last_used = vq->last_used_idx;
+	flags = virtio16_to_cpu(vq->vq.vdev,
+				vq->vring_packed.desc[last_used].flags);
+	avail = !!(flags & VRING_DESC_F_AVAIL(1));
+	used = !!(flags & VRING_DESC_F_USED(1));
+
+	return avail == used && used == vq->used_wrap_counter;
 }
 
 static void *virtqueue_get_buf_ctx_packed(struct virtqueue *_vq,
 					  unsigned int *len,
 					  void **ctx)
 {
-	return NULL;
+	struct vring_virtqueue *vq = to_vvq(_vq);
+	u16 last_used, id;
+	void *ret;
+
+	START_USE(vq);
+
+	if (unlikely(vq->broken)) {
+		END_USE(vq);
+		return NULL;
+	}
+
+	if (!more_used_packed(vq)) {
+		pr_debug("No more buffers in queue\n");
+		END_USE(vq);
+		return NULL;
+	}
+
+	/* Only get used elements after they have been exposed by host. */
+	virtio_rmb(vq->weak_barriers);
+
+	last_used = vq->last_used_idx;
+	id = virtio16_to_cpu(_vq->vdev, vq->vring_packed.desc[last_used].id);
+	*len = virtio32_to_cpu(_vq->vdev, vq->vring_packed.desc[last_used].len);
+
+	if (unlikely(id >= vq->vring_packed.num)) {
+		BAD_RING(vq, "id %u out of range\n", id);
+		return NULL;
+	}
+	if (unlikely(!vq->desc_state_packed[id].data)) {
+		BAD_RING(vq, "id %u is not a head!\n", id);
+		return NULL;
+	}
+
+	vq->last_used_idx += vq->desc_state_packed[id].num;
+	if (vq->last_used_idx >= vq->vring_packed.num) {
+		vq->last_used_idx -= vq->vring_packed.num;
+		vq->used_wrap_counter ^= 1;
+	}
+
+	/* detach_buf_packed clears data, so grab it now. */
+	ret = vq->desc_state_packed[id].data;
+	detach_buf_packed(vq, id, ctx);
+
+#ifdef DEBUG
+	vq->last_add_time_valid = false;
+#endif
+
+	END_USE(vq);
+	return ret;
 }
 
 static void virtqueue_disable_cb_packed(struct virtqueue *_vq)
 {
+	struct vring_virtqueue *vq = to_vvq(_vq);
+
+	if (vq->event_flags_shadow != VRING_EVENT_F_DISABLE) {
+		vq->event_flags_shadow = VRING_EVENT_F_DISABLE;
+		vq->vring_packed.driver->flags = cpu_to_virtio16(_vq->vdev,
+							vq->event_flags_shadow);
+	}
 }
 
 static unsigned virtqueue_enable_cb_prepare_packed(struct virtqueue *_vq)
 {
-	return 0;
+	struct vring_virtqueue *vq = to_vvq(_vq);
+
+	START_USE(vq);
+
+	/* We optimistically turn back on interrupts, then check if there was
+	 * more to do. */
+
+	if (vq->event_flags_shadow == VRING_EVENT_F_DISABLE) {
+		virtio_wmb(vq->weak_barriers);
+		vq->event_flags_shadow = VRING_EVENT_F_ENABLE;
+		vq->vring_packed.driver->flags = cpu_to_virtio16(_vq->vdev,
+							vq->event_flags_shadow);
+	}
+
+	END_USE(vq);
+	return vq->last_used_idx;
 }
 
 static bool virtqueue_poll_packed(struct virtqueue *_vq, unsigned last_used_idx)
 {
-	return false;
+	struct vring_virtqueue *vq = to_vvq(_vq);
+	u8 avail, used;
+	u16 flags;
+
+	virtio_mb(vq->weak_barriers);
+	flags = virtio16_to_cpu(vq->vq.vdev,
+			vq->vring_packed.desc[last_used_idx].flags);
+	avail = !!(flags & VRING_DESC_F_AVAIL(1));
+	used = !!(flags & VRING_DESC_F_USED(1));
+
+	return avail == used && used == vq->used_wrap_counter;
 }
 
 static bool virtqueue_enable_cb_delayed_packed(struct virtqueue *_vq)
 {
-	return false;
+	struct vring_virtqueue *vq = to_vvq(_vq);
+
+	START_USE(vq);
+
+	/* We optimistically turn back on interrupts, then check if there was
+	 * more to do. */
+
+	if (vq->event_flags_shadow == VRING_EVENT_F_DISABLE) {
+		virtio_wmb(vq->weak_barriers);
+		vq->event_flags_shadow = VRING_EVENT_F_ENABLE;
+		vq->vring_packed.driver->flags = cpu_to_virtio16(_vq->vdev,
+							vq->event_flags_shadow);
+	}
+
+	if (more_used_packed(vq)) {
+		END_USE(vq);
+		return false;
+	}
+
+	END_USE(vq);
+	return true;
 }
 
 static void *virtqueue_detach_unused_buf_packed(struct virtqueue *_vq)
 {
+	struct vring_virtqueue *vq = to_vvq(_vq);
+	unsigned int i;
+	void *buf;
+
+	START_USE(vq);
+
+	for (i = 0; i < vq->vring_packed.num; i++) {
+		if (!vq->desc_state_packed[i].data)
+			continue;
+		/* detach_buf clears data, so grab it now. */
+		buf = vq->desc_state_packed[i].data;
+		detach_buf_packed(vq, i, NULL);
+		END_USE(vq);
+		return buf;
+	}
+	/* That should have freed everything. */
+	BUG_ON(vq->vq.num_free != vq->vring_packed.num);
+
+	END_USE(vq);
 	return NULL;
 }
 
-- 
2.17.0

^ permalink raw reply related

* [RFC v5 5/5] virtio_ring: enable packed ring
From: Tiwei Bie @ 2018-05-22  8:16 UTC (permalink / raw)
  To: mst, jasowang, virtualization, linux-kernel, netdev
  Cc: wexu, jfreimann, tiwei.bie
In-Reply-To: <20180522081648.14768-1-tiwei.bie@intel.com>

Signed-off-by: Tiwei Bie <tiwei.bie@intel.com>
---
 drivers/virtio/virtio_ring.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 1ee52a89cb04..355dfef4f1eb 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -1956,6 +1956,8 @@ void vring_transport_features(struct virtio_device *vdev)
 			break;
 		case VIRTIO_F_IOMMU_PLATFORM:
 			break;
+		case VIRTIO_F_RING_PACKED:
+			break;
 		default:
 			/* We don't understand this bit. */
 			__virtio_clear_bit(vdev, i);
-- 
2.17.0

^ permalink raw reply related

* [RFC v5 2/5] virtio_ring: support creating packed ring
From: Tiwei Bie @ 2018-05-22  8:16 UTC (permalink / raw)
  To: mst, jasowang, virtualization, linux-kernel, netdev
  Cc: wexu, jfreimann, tiwei.bie
In-Reply-To: <20180522081648.14768-1-tiwei.bie@intel.com>

This commit introduces the support for creating packed ring.
All split ring specific functions are added _split suffix.
Some necessary stubs for packed ring are also added.

Signed-off-by: Tiwei Bie <tiwei.bie@intel.com>
---
 drivers/virtio/virtio_ring.c | 801 +++++++++++++++++++++++------------
 include/linux/virtio_ring.h  |   8 +-
 2 files changed, 546 insertions(+), 263 deletions(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 71458f493cf8..f5ef5f42a7cf 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -61,11 +61,15 @@ struct vring_desc_state {
 	struct vring_desc *indir_desc;	/* Indirect descriptor, if any. */
 };
 
+struct vring_desc_state_packed {
+	int next;			/* The next desc state. */
+};
+
 struct vring_virtqueue {
 	struct virtqueue vq;
 
-	/* Actual memory layout for this queue */
-	struct vring vring;
+	/* Is this a packed ring? */
+	bool packed;
 
 	/* Can we use weak barriers? */
 	bool weak_barriers;
@@ -87,11 +91,39 @@ struct vring_virtqueue {
 	/* Last used index we've seen. */
 	u16 last_used_idx;
 
-	/* Last written value to avail->flags */
-	u16 avail_flags_shadow;
+	union {
+		/* Available for split ring */
+		struct {
+			/* Actual memory layout for this queue. */
+			struct vring vring;
 
-	/* Last written value to avail->idx in guest byte order */
-	u16 avail_idx_shadow;
+			/* Last written value to avail->flags */
+			u16 avail_flags_shadow;
+
+			/* Last written value to avail->idx in
+			 * guest byte order. */
+			u16 avail_idx_shadow;
+		};
+
+		/* Available for packed ring */
+		struct {
+			/* Actual memory layout for this queue. */
+			struct vring_packed vring_packed;
+
+			/* Driver ring wrap counter. */
+			u8 avail_wrap_counter;
+
+			/* Device ring wrap counter. */
+			u8 used_wrap_counter;
+
+			/* Index of the next avail descriptor. */
+			u16 next_avail_idx;
+
+			/* Last written value to driver->flags in
+			 * guest byte order. */
+			u16 event_flags_shadow;
+		};
+	};
 
 	/* How to notify other side. FIXME: commonalize hcalls! */
 	bool (*notify)(struct virtqueue *vq);
@@ -111,11 +143,24 @@ struct vring_virtqueue {
 #endif
 
 	/* Per-descriptor state. */
-	struct vring_desc_state desc_state[];
+	union {
+		struct vring_desc_state desc_state[1];
+		struct vring_desc_state_packed desc_state_packed[1];
+	};
 };
 
 #define to_vvq(_vq) container_of(_vq, struct vring_virtqueue, vq)
 
+static inline bool virtqueue_use_indirect(struct virtqueue *_vq,
+					  unsigned int total_sg)
+{
+	struct vring_virtqueue *vq = to_vvq(_vq);
+
+	/* If the host supports indirect descriptor tables, and we have multiple
+	 * buffers, then go indirect. FIXME: tune this threshold */
+	return (vq->indirect && total_sg > 1 && vq->vq.num_free);
+}
+
 /*
  * Modern virtio devices have feature bits to specify whether they need a
  * quirk and bypass the IOMMU. If not there, just use the DMA API.
@@ -201,8 +246,17 @@ static dma_addr_t vring_map_single(const struct vring_virtqueue *vq,
 			      cpu_addr, size, direction);
 }
 
-static void vring_unmap_one(const struct vring_virtqueue *vq,
-			    struct vring_desc *desc)
+static int vring_mapping_error(const struct vring_virtqueue *vq,
+			       dma_addr_t addr)
+{
+	if (!vring_use_dma_api(vq->vq.vdev))
+		return 0;
+
+	return dma_mapping_error(vring_dma_dev(vq), addr);
+}
+
+static void vring_unmap_one_split(const struct vring_virtqueue *vq,
+				  struct vring_desc *desc)
 {
 	u16 flags;
 
@@ -226,17 +280,9 @@ static void vring_unmap_one(const struct vring_virtqueue *vq,
 	}
 }
 
-static int vring_mapping_error(const struct vring_virtqueue *vq,
-			       dma_addr_t addr)
-{
-	if (!vring_use_dma_api(vq->vq.vdev))
-		return 0;
-
-	return dma_mapping_error(vring_dma_dev(vq), addr);
-}
-
-static struct vring_desc *alloc_indirect(struct virtqueue *_vq,
-					 unsigned int total_sg, gfp_t gfp)
+static struct vring_desc *alloc_indirect_split(struct virtqueue *_vq,
+					       unsigned int total_sg,
+					       gfp_t gfp)
 {
 	struct vring_desc *desc;
 	unsigned int i;
@@ -257,14 +303,14 @@ static struct vring_desc *alloc_indirect(struct virtqueue *_vq,
 	return desc;
 }
 
-static inline int virtqueue_add(struct virtqueue *_vq,
-				struct scatterlist *sgs[],
-				unsigned int total_sg,
-				unsigned int out_sgs,
-				unsigned int in_sgs,
-				void *data,
-				void *ctx,
-				gfp_t gfp)
+static inline int virtqueue_add_split(struct virtqueue *_vq,
+				      struct scatterlist *sgs[],
+				      unsigned int total_sg,
+				      unsigned int out_sgs,
+				      unsigned int in_sgs,
+				      void *data,
+				      void *ctx,
+				      gfp_t gfp)
 {
 	struct vring_virtqueue *vq = to_vvq(_vq);
 	struct scatterlist *sg;
@@ -300,10 +346,8 @@ static inline int virtqueue_add(struct virtqueue *_vq,
 
 	head = vq->free_head;
 
-	/* If the host supports indirect descriptor tables, and we have multiple
-	 * buffers, then go indirect. FIXME: tune this threshold */
-	if (vq->indirect && total_sg > 1 && vq->vq.num_free)
-		desc = alloc_indirect(_vq, total_sg, gfp);
+	if (virtqueue_use_indirect(_vq, total_sg))
+		desc = alloc_indirect_split(_vq, total_sg, gfp);
 	else {
 		desc = NULL;
 		WARN_ON_ONCE(total_sg > vq->vring.num && !vq->indirect);
@@ -424,7 +468,7 @@ static inline int virtqueue_add(struct virtqueue *_vq,
 	for (n = 0; n < total_sg; n++) {
 		if (i == err_idx)
 			break;
-		vring_unmap_one(vq, &desc[i]);
+		vring_unmap_one_split(vq, &desc[i]);
 		i = virtio16_to_cpu(_vq->vdev, vq->vring.desc[i].next);
 	}
 
@@ -435,6 +479,355 @@ static inline int virtqueue_add(struct virtqueue *_vq,
 	return -EIO;
 }
 
+static bool virtqueue_kick_prepare_split(struct virtqueue *_vq)
+{
+	struct vring_virtqueue *vq = to_vvq(_vq);
+	u16 new, old;
+	bool needs_kick;
+
+	START_USE(vq);
+	/* We need to expose available array entries before checking avail
+	 * event. */
+	virtio_mb(vq->weak_barriers);
+
+	old = vq->avail_idx_shadow - vq->num_added;
+	new = vq->avail_idx_shadow;
+	vq->num_added = 0;
+
+#ifdef DEBUG
+	if (vq->last_add_time_valid) {
+		WARN_ON(ktime_to_ms(ktime_sub(ktime_get(),
+					      vq->last_add_time)) > 100);
+	}
+	vq->last_add_time_valid = false;
+#endif
+
+	if (vq->event) {
+		needs_kick = vring_need_event(virtio16_to_cpu(_vq->vdev, vring_avail_event(&vq->vring)),
+					      new, old);
+	} else {
+		needs_kick = !(vq->vring.used->flags & cpu_to_virtio16(_vq->vdev, VRING_USED_F_NO_NOTIFY));
+	}
+	END_USE(vq);
+	return needs_kick;
+}
+
+static void detach_buf_split(struct vring_virtqueue *vq, unsigned int head,
+			     void **ctx)
+{
+	unsigned int i, j;
+	__virtio16 nextflag = cpu_to_virtio16(vq->vq.vdev, VRING_DESC_F_NEXT);
+
+	/* Clear data ptr. */
+	vq->desc_state[head].data = NULL;
+
+	/* Put back on free list: unmap first-level descriptors and find end */
+	i = head;
+
+	while (vq->vring.desc[i].flags & nextflag) {
+		vring_unmap_one_split(vq, &vq->vring.desc[i]);
+		i = virtio16_to_cpu(vq->vq.vdev, vq->vring.desc[i].next);
+		vq->vq.num_free++;
+	}
+
+	vring_unmap_one_split(vq, &vq->vring.desc[i]);
+	vq->vring.desc[i].next = cpu_to_virtio16(vq->vq.vdev, vq->free_head);
+	vq->free_head = head;
+
+	/* Plus final descriptor */
+	vq->vq.num_free++;
+
+	if (vq->indirect) {
+		struct vring_desc *indir_desc = vq->desc_state[head].indir_desc;
+		u32 len;
+
+		/* Free the indirect table, if any, now that it's unmapped. */
+		if (!indir_desc)
+			return;
+
+		len = virtio32_to_cpu(vq->vq.vdev, vq->vring.desc[head].len);
+
+		BUG_ON(!(vq->vring.desc[head].flags &
+			 cpu_to_virtio16(vq->vq.vdev, VRING_DESC_F_INDIRECT)));
+		BUG_ON(len == 0 || len % sizeof(struct vring_desc));
+
+		for (j = 0; j < len / sizeof(struct vring_desc); j++)
+			vring_unmap_one_split(vq, &indir_desc[j]);
+
+		kfree(indir_desc);
+		vq->desc_state[head].indir_desc = NULL;
+	} else if (ctx) {
+		*ctx = vq->desc_state[head].indir_desc;
+	}
+}
+
+static inline bool more_used_split(const struct vring_virtqueue *vq)
+{
+	return vq->last_used_idx != virtio16_to_cpu(vq->vq.vdev, vq->vring.used->idx);
+}
+
+static void *virtqueue_get_buf_ctx_split(struct virtqueue *_vq,
+					 unsigned int *len,
+					 void **ctx)
+{
+	struct vring_virtqueue *vq = to_vvq(_vq);
+	void *ret;
+	unsigned int i;
+	u16 last_used;
+
+	START_USE(vq);
+
+	if (unlikely(vq->broken)) {
+		END_USE(vq);
+		return NULL;
+	}
+
+	if (!more_used_split(vq)) {
+		pr_debug("No more buffers in queue\n");
+		END_USE(vq);
+		return NULL;
+	}
+
+	/* Only get used array entries after they have been exposed by host. */
+	virtio_rmb(vq->weak_barriers);
+
+	last_used = (vq->last_used_idx & (vq->vring.num - 1));
+	i = virtio32_to_cpu(_vq->vdev, vq->vring.used->ring[last_used].id);
+	*len = virtio32_to_cpu(_vq->vdev, vq->vring.used->ring[last_used].len);
+
+	if (unlikely(i >= vq->vring.num)) {
+		BAD_RING(vq, "id %u out of range\n", i);
+		return NULL;
+	}
+	if (unlikely(!vq->desc_state[i].data)) {
+		BAD_RING(vq, "id %u is not a head!\n", i);
+		return NULL;
+	}
+
+	/* detach_buf_split clears data, so grab it now. */
+	ret = vq->desc_state[i].data;
+	detach_buf_split(vq, i, ctx);
+	vq->last_used_idx++;
+	/* If we expect an interrupt for the next entry, tell host
+	 * by writing event index and flush out the write before
+	 * the read in the next get_buf call. */
+	if (!(vq->avail_flags_shadow & VRING_AVAIL_F_NO_INTERRUPT))
+		virtio_store_mb(vq->weak_barriers,
+				&vring_used_event(&vq->vring),
+				cpu_to_virtio16(_vq->vdev, vq->last_used_idx));
+
+#ifdef DEBUG
+	vq->last_add_time_valid = false;
+#endif
+
+	END_USE(vq);
+	return ret;
+}
+
+static void virtqueue_disable_cb_split(struct virtqueue *_vq)
+{
+	struct vring_virtqueue *vq = to_vvq(_vq);
+
+	if (!(vq->avail_flags_shadow & VRING_AVAIL_F_NO_INTERRUPT)) {
+		vq->avail_flags_shadow |= VRING_AVAIL_F_NO_INTERRUPT;
+		if (!vq->event)
+			vq->vring.avail->flags = cpu_to_virtio16(_vq->vdev, vq->avail_flags_shadow);
+	}
+}
+
+static unsigned virtqueue_enable_cb_prepare_split(struct virtqueue *_vq)
+{
+	struct vring_virtqueue *vq = to_vvq(_vq);
+	u16 last_used_idx;
+
+	START_USE(vq);
+
+	/* We optimistically turn back on interrupts, then check if there was
+	 * more to do. */
+	/* Depending on the VIRTIO_RING_F_EVENT_IDX feature, we need to
+	 * either clear the flags bit or point the event index at the next
+	 * entry. Always do both to keep code simple. */
+	if (vq->avail_flags_shadow & VRING_AVAIL_F_NO_INTERRUPT) {
+		vq->avail_flags_shadow &= ~VRING_AVAIL_F_NO_INTERRUPT;
+		if (!vq->event)
+			vq->vring.avail->flags = cpu_to_virtio16(_vq->vdev, vq->avail_flags_shadow);
+	}
+	vring_used_event(&vq->vring) = cpu_to_virtio16(_vq->vdev, last_used_idx = vq->last_used_idx);
+	END_USE(vq);
+	return last_used_idx;
+}
+
+static bool virtqueue_poll_split(struct virtqueue *_vq, unsigned last_used_idx)
+{
+	struct vring_virtqueue *vq = to_vvq(_vq);
+
+	virtio_mb(vq->weak_barriers);
+	return (u16)last_used_idx != virtio16_to_cpu(_vq->vdev, vq->vring.used->idx);
+}
+
+static bool virtqueue_enable_cb_delayed_split(struct virtqueue *_vq)
+{
+	struct vring_virtqueue *vq = to_vvq(_vq);
+	u16 bufs;
+
+	START_USE(vq);
+
+	/* We optimistically turn back on interrupts, then check if there was
+	 * more to do. */
+	/* Depending on the VIRTIO_RING_F_USED_EVENT_IDX feature, we need to
+	 * either clear the flags bit or point the event index at the next
+	 * entry. Always update the event index to keep code simple. */
+	if (vq->avail_flags_shadow & VRING_AVAIL_F_NO_INTERRUPT) {
+		vq->avail_flags_shadow &= ~VRING_AVAIL_F_NO_INTERRUPT;
+		if (!vq->event)
+			vq->vring.avail->flags = cpu_to_virtio16(_vq->vdev, vq->avail_flags_shadow);
+	}
+	/* TODO: tune this threshold */
+	bufs = (u16)(vq->avail_idx_shadow - vq->last_used_idx) * 3 / 4;
+
+	virtio_store_mb(vq->weak_barriers,
+			&vring_used_event(&vq->vring),
+			cpu_to_virtio16(_vq->vdev, vq->last_used_idx + bufs));
+
+	if (unlikely((u16)(virtio16_to_cpu(_vq->vdev, vq->vring.used->idx) - vq->last_used_idx) > bufs)) {
+		END_USE(vq);
+		return false;
+	}
+
+	END_USE(vq);
+	return true;
+}
+
+static void *virtqueue_detach_unused_buf_split(struct virtqueue *_vq)
+{
+	struct vring_virtqueue *vq = to_vvq(_vq);
+	unsigned int i;
+	void *buf;
+
+	START_USE(vq);
+
+	for (i = 0; i < vq->vring.num; i++) {
+		if (!vq->desc_state[i].data)
+			continue;
+		/* detach_buf clears data, so grab it now. */
+		buf = vq->desc_state[i].data;
+		detach_buf_split(vq, i, NULL);
+		vq->avail_idx_shadow--;
+		vq->vring.avail->idx = cpu_to_virtio16(_vq->vdev, vq->avail_idx_shadow);
+		END_USE(vq);
+		return buf;
+	}
+	/* That should have freed everything. */
+	BUG_ON(vq->vq.num_free != vq->vring.num);
+
+	END_USE(vq);
+	return NULL;
+}
+
+/*
+ * The layout for the packed ring is a continuous chunk of memory
+ * which looks like this.
+ *
+ * struct vring_packed {
+ *	// The actual descriptors (16 bytes each)
+ *	struct vring_packed_desc desc[num];
+ *
+ *	// Padding to the next align boundary.
+ *	char pad[];
+ *
+ *	// Driver Event Suppression
+ *	struct vring_packed_desc_event driver;
+ *
+ *	// Device Event Suppression
+ *	struct vring_packed_desc_event device;
+ * };
+ */
+static inline void vring_init_packed(struct vring_packed *vr, unsigned int num,
+				     void *p, unsigned long align)
+{
+	vr->num = num;
+	vr->desc = p;
+	vr->driver = (void *)(((uintptr_t)p + sizeof(struct vring_packed_desc)
+		* num + align - 1) & ~(align - 1));
+	vr->device = vr->driver + 1;
+}
+
+static inline unsigned vring_size_packed(unsigned int num, unsigned long align)
+{
+	return ((sizeof(struct vring_packed_desc) * num + align - 1)
+		& ~(align - 1)) + sizeof(struct vring_packed_desc_event) * 2;
+}
+
+static inline int virtqueue_add_packed(struct virtqueue *_vq,
+				       struct scatterlist *sgs[],
+				       unsigned int total_sg,
+				       unsigned int out_sgs,
+				       unsigned int in_sgs,
+				       void *data,
+				       void *ctx,
+				       gfp_t gfp)
+{
+	return -EIO;
+}
+
+static bool virtqueue_kick_prepare_packed(struct virtqueue *_vq)
+{
+	return false;
+}
+
+static inline bool more_used_packed(const struct vring_virtqueue *vq)
+{
+	return false;
+}
+
+static void *virtqueue_get_buf_ctx_packed(struct virtqueue *_vq,
+					  unsigned int *len,
+					  void **ctx)
+{
+	return NULL;
+}
+
+static void virtqueue_disable_cb_packed(struct virtqueue *_vq)
+{
+}
+
+static unsigned virtqueue_enable_cb_prepare_packed(struct virtqueue *_vq)
+{
+	return 0;
+}
+
+static bool virtqueue_poll_packed(struct virtqueue *_vq, unsigned last_used_idx)
+{
+	return false;
+}
+
+static bool virtqueue_enable_cb_delayed_packed(struct virtqueue *_vq)
+{
+	return false;
+}
+
+static void *virtqueue_detach_unused_buf_packed(struct virtqueue *_vq)
+{
+	return NULL;
+}
+
+static inline int virtqueue_add(struct virtqueue *_vq,
+				struct scatterlist *sgs[],
+				unsigned int total_sg,
+				unsigned int out_sgs,
+				unsigned int in_sgs,
+				void *data,
+				void *ctx,
+				gfp_t gfp)
+{
+	struct vring_virtqueue *vq = to_vvq(_vq);
+
+	return vq->packed ? virtqueue_add_packed(_vq, sgs, total_sg, out_sgs,
+						 in_sgs, data, ctx, gfp) :
+			    virtqueue_add_split(_vq, sgs, total_sg, out_sgs,
+						in_sgs, data, ctx, gfp);
+}
+
 /**
  * virtqueue_add_sgs - expose buffers to other end
  * @vq: the struct virtqueue we're talking about.
@@ -551,34 +944,9 @@ EXPORT_SYMBOL_GPL(virtqueue_add_inbuf_ctx);
 bool virtqueue_kick_prepare(struct virtqueue *_vq)
 {
 	struct vring_virtqueue *vq = to_vvq(_vq);
-	u16 new, old;
-	bool needs_kick;
 
-	START_USE(vq);
-	/* We need to expose available array entries before checking avail
-	 * event. */
-	virtio_mb(vq->weak_barriers);
-
-	old = vq->avail_idx_shadow - vq->num_added;
-	new = vq->avail_idx_shadow;
-	vq->num_added = 0;
-
-#ifdef DEBUG
-	if (vq->last_add_time_valid) {
-		WARN_ON(ktime_to_ms(ktime_sub(ktime_get(),
-					      vq->last_add_time)) > 100);
-	}
-	vq->last_add_time_valid = false;
-#endif
-
-	if (vq->event) {
-		needs_kick = vring_need_event(virtio16_to_cpu(_vq->vdev, vring_avail_event(&vq->vring)),
-					      new, old);
-	} else {
-		needs_kick = !(vq->vring.used->flags & cpu_to_virtio16(_vq->vdev, VRING_USED_F_NO_NOTIFY));
-	}
-	END_USE(vq);
-	return needs_kick;
+	return vq->packed ? virtqueue_kick_prepare_packed(_vq) :
+			    virtqueue_kick_prepare_split(_vq);
 }
 EXPORT_SYMBOL_GPL(virtqueue_kick_prepare);
 
@@ -626,58 +994,9 @@ bool virtqueue_kick(struct virtqueue *vq)
 }
 EXPORT_SYMBOL_GPL(virtqueue_kick);
 
-static void detach_buf(struct vring_virtqueue *vq, unsigned int head,
-		       void **ctx)
-{
-	unsigned int i, j;
-	__virtio16 nextflag = cpu_to_virtio16(vq->vq.vdev, VRING_DESC_F_NEXT);
-
-	/* Clear data ptr. */
-	vq->desc_state[head].data = NULL;
-
-	/* Put back on free list: unmap first-level descriptors and find end */
-	i = head;
-
-	while (vq->vring.desc[i].flags & nextflag) {
-		vring_unmap_one(vq, &vq->vring.desc[i]);
-		i = virtio16_to_cpu(vq->vq.vdev, vq->vring.desc[i].next);
-		vq->vq.num_free++;
-	}
-
-	vring_unmap_one(vq, &vq->vring.desc[i]);
-	vq->vring.desc[i].next = cpu_to_virtio16(vq->vq.vdev, vq->free_head);
-	vq->free_head = head;
-
-	/* Plus final descriptor */
-	vq->vq.num_free++;
-
-	if (vq->indirect) {
-		struct vring_desc *indir_desc = vq->desc_state[head].indir_desc;
-		u32 len;
-
-		/* Free the indirect table, if any, now that it's unmapped. */
-		if (!indir_desc)
-			return;
-
-		len = virtio32_to_cpu(vq->vq.vdev, vq->vring.desc[head].len);
-
-		BUG_ON(!(vq->vring.desc[head].flags &
-			 cpu_to_virtio16(vq->vq.vdev, VRING_DESC_F_INDIRECT)));
-		BUG_ON(len == 0 || len % sizeof(struct vring_desc));
-
-		for (j = 0; j < len / sizeof(struct vring_desc); j++)
-			vring_unmap_one(vq, &indir_desc[j]);
-
-		kfree(indir_desc);
-		vq->desc_state[head].indir_desc = NULL;
-	} else if (ctx) {
-		*ctx = vq->desc_state[head].indir_desc;
-	}
-}
-
 static inline bool more_used(const struct vring_virtqueue *vq)
 {
-	return vq->last_used_idx != virtio16_to_cpu(vq->vq.vdev, vq->vring.used->idx);
+	return vq->packed ? more_used_packed(vq) : more_used_split(vq);
 }
 
 /**
@@ -700,57 +1019,9 @@ void *virtqueue_get_buf_ctx(struct virtqueue *_vq, unsigned int *len,
 			    void **ctx)
 {
 	struct vring_virtqueue *vq = to_vvq(_vq);
-	void *ret;
-	unsigned int i;
-	u16 last_used;
 
-	START_USE(vq);
-
-	if (unlikely(vq->broken)) {
-		END_USE(vq);
-		return NULL;
-	}
-
-	if (!more_used(vq)) {
-		pr_debug("No more buffers in queue\n");
-		END_USE(vq);
-		return NULL;
-	}
-
-	/* Only get used array entries after they have been exposed by host. */
-	virtio_rmb(vq->weak_barriers);
-
-	last_used = (vq->last_used_idx & (vq->vring.num - 1));
-	i = virtio32_to_cpu(_vq->vdev, vq->vring.used->ring[last_used].id);
-	*len = virtio32_to_cpu(_vq->vdev, vq->vring.used->ring[last_used].len);
-
-	if (unlikely(i >= vq->vring.num)) {
-		BAD_RING(vq, "id %u out of range\n", i);
-		return NULL;
-	}
-	if (unlikely(!vq->desc_state[i].data)) {
-		BAD_RING(vq, "id %u is not a head!\n", i);
-		return NULL;
-	}
-
-	/* detach_buf clears data, so grab it now. */
-	ret = vq->desc_state[i].data;
-	detach_buf(vq, i, ctx);
-	vq->last_used_idx++;
-	/* If we expect an interrupt for the next entry, tell host
-	 * by writing event index and flush out the write before
-	 * the read in the next get_buf call. */
-	if (!(vq->avail_flags_shadow & VRING_AVAIL_F_NO_INTERRUPT))
-		virtio_store_mb(vq->weak_barriers,
-				&vring_used_event(&vq->vring),
-				cpu_to_virtio16(_vq->vdev, vq->last_used_idx));
-
-#ifdef DEBUG
-	vq->last_add_time_valid = false;
-#endif
-
-	END_USE(vq);
-	return ret;
+	return vq->packed ? virtqueue_get_buf_ctx_packed(_vq, len, ctx) :
+			    virtqueue_get_buf_ctx_split(_vq, len, ctx);
 }
 EXPORT_SYMBOL_GPL(virtqueue_get_buf_ctx);
 
@@ -772,12 +1043,10 @@ void virtqueue_disable_cb(struct virtqueue *_vq)
 {
 	struct vring_virtqueue *vq = to_vvq(_vq);
 
-	if (!(vq->avail_flags_shadow & VRING_AVAIL_F_NO_INTERRUPT)) {
-		vq->avail_flags_shadow |= VRING_AVAIL_F_NO_INTERRUPT;
-		if (!vq->event)
-			vq->vring.avail->flags = cpu_to_virtio16(_vq->vdev, vq->avail_flags_shadow);
-	}
-
+	if (vq->packed)
+		virtqueue_disable_cb_packed(_vq);
+	else
+		virtqueue_disable_cb_split(_vq);
 }
 EXPORT_SYMBOL_GPL(virtqueue_disable_cb);
 
@@ -796,23 +1065,9 @@ EXPORT_SYMBOL_GPL(virtqueue_disable_cb);
 unsigned virtqueue_enable_cb_prepare(struct virtqueue *_vq)
 {
 	struct vring_virtqueue *vq = to_vvq(_vq);
-	u16 last_used_idx;
 
-	START_USE(vq);
-
-	/* We optimistically turn back on interrupts, then check if there was
-	 * more to do. */
-	/* Depending on the VIRTIO_RING_F_EVENT_IDX feature, we need to
-	 * either clear the flags bit or point the event index at the next
-	 * entry. Always do both to keep code simple. */
-	if (vq->avail_flags_shadow & VRING_AVAIL_F_NO_INTERRUPT) {
-		vq->avail_flags_shadow &= ~VRING_AVAIL_F_NO_INTERRUPT;
-		if (!vq->event)
-			vq->vring.avail->flags = cpu_to_virtio16(_vq->vdev, vq->avail_flags_shadow);
-	}
-	vring_used_event(&vq->vring) = cpu_to_virtio16(_vq->vdev, last_used_idx = vq->last_used_idx);
-	END_USE(vq);
-	return last_used_idx;
+	return vq->packed ? virtqueue_enable_cb_prepare_packed(_vq) :
+			    virtqueue_enable_cb_prepare_split(_vq);
 }
 EXPORT_SYMBOL_GPL(virtqueue_enable_cb_prepare);
 
@@ -829,8 +1084,8 @@ bool virtqueue_poll(struct virtqueue *_vq, unsigned last_used_idx)
 {
 	struct vring_virtqueue *vq = to_vvq(_vq);
 
-	virtio_mb(vq->weak_barriers);
-	return (u16)last_used_idx != virtio16_to_cpu(_vq->vdev, vq->vring.used->idx);
+	return vq->packed ? virtqueue_poll_packed(_vq, last_used_idx) :
+			    virtqueue_poll_split(_vq, last_used_idx);
 }
 EXPORT_SYMBOL_GPL(virtqueue_poll);
 
@@ -868,34 +1123,9 @@ EXPORT_SYMBOL_GPL(virtqueue_enable_cb);
 bool virtqueue_enable_cb_delayed(struct virtqueue *_vq)
 {
 	struct vring_virtqueue *vq = to_vvq(_vq);
-	u16 bufs;
 
-	START_USE(vq);
-
-	/* We optimistically turn back on interrupts, then check if there was
-	 * more to do. */
-	/* Depending on the VIRTIO_RING_F_USED_EVENT_IDX feature, we need to
-	 * either clear the flags bit or point the event index at the next
-	 * entry. Always update the event index to keep code simple. */
-	if (vq->avail_flags_shadow & VRING_AVAIL_F_NO_INTERRUPT) {
-		vq->avail_flags_shadow &= ~VRING_AVAIL_F_NO_INTERRUPT;
-		if (!vq->event)
-			vq->vring.avail->flags = cpu_to_virtio16(_vq->vdev, vq->avail_flags_shadow);
-	}
-	/* TODO: tune this threshold */
-	bufs = (u16)(vq->avail_idx_shadow - vq->last_used_idx) * 3 / 4;
-
-	virtio_store_mb(vq->weak_barriers,
-			&vring_used_event(&vq->vring),
-			cpu_to_virtio16(_vq->vdev, vq->last_used_idx + bufs));
-
-	if (unlikely((u16)(virtio16_to_cpu(_vq->vdev, vq->vring.used->idx) - vq->last_used_idx) > bufs)) {
-		END_USE(vq);
-		return false;
-	}
-
-	END_USE(vq);
-	return true;
+	return vq->packed ? virtqueue_enable_cb_delayed_packed(_vq) :
+			    virtqueue_enable_cb_delayed_split(_vq);
 }
 EXPORT_SYMBOL_GPL(virtqueue_enable_cb_delayed);
 
@@ -910,27 +1140,9 @@ EXPORT_SYMBOL_GPL(virtqueue_enable_cb_delayed);
 void *virtqueue_detach_unused_buf(struct virtqueue *_vq)
 {
 	struct vring_virtqueue *vq = to_vvq(_vq);
-	unsigned int i;
-	void *buf;
 
-	START_USE(vq);
-
-	for (i = 0; i < vq->vring.num; i++) {
-		if (!vq->desc_state[i].data)
-			continue;
-		/* detach_buf clears data, so grab it now. */
-		buf = vq->desc_state[i].data;
-		detach_buf(vq, i, NULL);
-		vq->avail_idx_shadow--;
-		vq->vring.avail->idx = cpu_to_virtio16(_vq->vdev, vq->avail_idx_shadow);
-		END_USE(vq);
-		return buf;
-	}
-	/* That should have freed everything. */
-	BUG_ON(vq->vq.num_free != vq->vring.num);
-
-	END_USE(vq);
-	return NULL;
+	return vq->packed ? virtqueue_detach_unused_buf_packed(_vq) :
+			    virtqueue_detach_unused_buf_split(_vq);
 }
 EXPORT_SYMBOL_GPL(virtqueue_detach_unused_buf);
 
@@ -955,7 +1167,8 @@ irqreturn_t vring_interrupt(int irq, void *_vq)
 EXPORT_SYMBOL_GPL(vring_interrupt);
 
 struct virtqueue *__vring_new_virtqueue(unsigned int index,
-					struct vring vring,
+					union vring_union vring,
+					bool packed,
 					struct virtio_device *vdev,
 					bool weak_barriers,
 					bool context,
@@ -963,19 +1176,22 @@ struct virtqueue *__vring_new_virtqueue(unsigned int index,
 					void (*callback)(struct virtqueue *),
 					const char *name)
 {
-	unsigned int i;
 	struct vring_virtqueue *vq;
+	unsigned int num, i;
+	size_t size;
 
-	vq = kmalloc(sizeof(*vq) + vring.num * sizeof(struct vring_desc_state),
-		     GFP_KERNEL);
+	num = packed ? vring.vring_packed.num : vring.vring_split.num;
+	size = packed ? num * sizeof(struct vring_desc_state_packed) :
+			num * sizeof(struct vring_desc_state);
+
+	vq = kmalloc(sizeof(*vq) + size, GFP_KERNEL);
 	if (!vq)
 		return NULL;
 
-	vq->vring = vring;
 	vq->vq.callback = callback;
 	vq->vq.vdev = vdev;
 	vq->vq.name = name;
-	vq->vq.num_free = vring.num;
+	vq->vq.num_free = num;
 	vq->vq.index = index;
 	vq->we_own_ring = false;
 	vq->queue_dma_addr = 0;
@@ -984,9 +1200,8 @@ struct virtqueue *__vring_new_virtqueue(unsigned int index,
 	vq->weak_barriers = weak_barriers;
 	vq->broken = false;
 	vq->last_used_idx = 0;
-	vq->avail_flags_shadow = 0;
-	vq->avail_idx_shadow = 0;
 	vq->num_added = 0;
+	vq->packed = packed;
 	list_add_tail(&vq->vq.list, &vdev->vqs);
 #ifdef DEBUG
 	vq->in_use = false;
@@ -997,19 +1212,48 @@ struct virtqueue *__vring_new_virtqueue(unsigned int index,
 		!context;
 	vq->event = virtio_has_feature(vdev, VIRTIO_RING_F_EVENT_IDX);
 
+	if (vq->packed) {
+		vq->vring_packed = vring.vring_packed;
+		vq->next_avail_idx = 0;
+		vq->avail_wrap_counter = 1;
+		vq->used_wrap_counter = 1;
+		vq->event_flags_shadow = 0;
+
+		memset(vq->desc_state_packed, 0,
+			num * sizeof(struct vring_desc_state_packed));
+
+		/* Put everything in free lists. */
+		vq->free_head = 0;
+		for (i = 0; i < num-1; i++)
+			vq->desc_state_packed[i].next = i + 1;
+	} else {
+		vq->vring = vring.vring_split;
+		vq->avail_flags_shadow = 0;
+		vq->avail_idx_shadow = 0;
+
+		/* Put everything in free lists. */
+		vq->free_head = 0;
+		for (i = 0; i < num-1; i++)
+			vq->vring.desc[i].next = cpu_to_virtio16(vdev, i + 1);
+
+		memset(vq->desc_state, 0,
+			num * sizeof(struct vring_desc_state));
+	}
+
 	/* No callback?  Tell other side not to bother us. */
 	if (!callback) {
-		vq->avail_flags_shadow |= VRING_AVAIL_F_NO_INTERRUPT;
-		if (!vq->event)
-			vq->vring.avail->flags = cpu_to_virtio16(vdev, vq->avail_flags_shadow);
+		if (packed) {
+			vq->event_flags_shadow = VRING_EVENT_F_DISABLE;
+			vq->vring_packed.driver->flags = cpu_to_virtio16(vdev,
+						vq->event_flags_shadow);
+		} else {
+			vq->avail_flags_shadow |= VRING_AVAIL_F_NO_INTERRUPT;
+			if (!vq->event)
+				vq->vring.avail->flags = cpu_to_virtio16(vdev,
+						vq->avail_flags_shadow);
+		}
 	}
 
-	/* Put everything in free lists. */
-	vq->free_head = 0;
-	for (i = 0; i < vring.num-1; i++)
-		vq->vring.desc[i].next = cpu_to_virtio16(vdev, i + 1);
-	memset(vq->desc_state, 0, vring.num * sizeof(struct vring_desc_state));
-
 	return &vq->vq;
 }
 EXPORT_SYMBOL_GPL(__vring_new_virtqueue);
@@ -1056,6 +1300,12 @@ static void vring_free_queue(struct virtio_device *vdev, size_t size,
 	}
 }
 
+static inline int
+__vring_size(unsigned int num, unsigned long align, bool packed)
+{
+	return packed ? vring_size_packed(num, align) : vring_size(num, align);
+}
+
 struct virtqueue *vring_create_virtqueue(
 	unsigned int index,
 	unsigned int num,
@@ -1072,7 +1322,8 @@ struct virtqueue *vring_create_virtqueue(
 	void *queue = NULL;
 	dma_addr_t dma_addr;
 	size_t queue_size_in_bytes;
-	struct vring vring;
+	union vring_union vring;
+	bool packed;
 
 	/* We assume num is a power of 2. */
 	if (num & (num - 1)) {
@@ -1080,9 +1331,13 @@ struct virtqueue *vring_create_virtqueue(
 		return NULL;
 	}
 
+	packed = virtio_has_feature(vdev, VIRTIO_F_RING_PACKED);
+
 	/* TODO: allocate each queue chunk individually */
-	for (; num && vring_size(num, vring_align) > PAGE_SIZE; num /= 2) {
-		queue = vring_alloc_queue(vdev, vring_size(num, vring_align),
+	for (; num && __vring_size(num, vring_align, packed) > PAGE_SIZE;
+			num /= 2) {
+		queue = vring_alloc_queue(vdev, __vring_size(num, vring_align,
+							     packed),
 					  &dma_addr,
 					  GFP_KERNEL|__GFP_NOWARN|__GFP_ZERO);
 		if (queue)
@@ -1094,17 +1349,21 @@ struct virtqueue *vring_create_virtqueue(
 
 	if (!queue) {
 		/* Try to get a single page. You are my only hope! */
-		queue = vring_alloc_queue(vdev, vring_size(num, vring_align),
+		queue = vring_alloc_queue(vdev, __vring_size(num, vring_align,
+							     packed),
 					  &dma_addr, GFP_KERNEL|__GFP_ZERO);
 	}
 	if (!queue)
 		return NULL;
 
-	queue_size_in_bytes = vring_size(num, vring_align);
-	vring_init(&vring, num, queue, vring_align);
+	queue_size_in_bytes = __vring_size(num, vring_align, packed);
+	if (packed)
+		vring_init_packed(&vring.vring_packed, num, queue, vring_align);
+	else
+		vring_init(&vring.vring_split, num, queue, vring_align);
 
-	vq = __vring_new_virtqueue(index, vring, vdev, weak_barriers, context,
-				   notify, callback, name);
+	vq = __vring_new_virtqueue(index, vring, packed, vdev, weak_barriers,
+				   context, notify, callback, name);
 	if (!vq) {
 		vring_free_queue(vdev, queue_size_in_bytes, queue,
 				 dma_addr);
@@ -1130,10 +1389,17 @@ struct virtqueue *vring_new_virtqueue(unsigned int index,
 				      void (*callback)(struct virtqueue *vq),
 				      const char *name)
 {
-	struct vring vring;
-	vring_init(&vring, num, pages, vring_align);
-	return __vring_new_virtqueue(index, vring, vdev, weak_barriers, context,
-				     notify, callback, name);
+	union vring_union vring;
+	bool packed;
+
+	packed = virtio_has_feature(vdev, VIRTIO_F_RING_PACKED);
+	if (packed)
+		vring_init_packed(&vring.vring_packed, num, pages, vring_align);
+	else
+		vring_init(&vring.vring_split, num, pages, vring_align);
+
+	return __vring_new_virtqueue(index, vring, packed, vdev, weak_barriers,
+				     context, notify, callback, name);
 }
 EXPORT_SYMBOL_GPL(vring_new_virtqueue);
 
@@ -1143,7 +1409,9 @@ void vring_del_virtqueue(struct virtqueue *_vq)
 
 	if (vq->we_own_ring) {
 		vring_free_queue(vq->vq.vdev, vq->queue_size_in_bytes,
-				 vq->vring.desc, vq->queue_dma_addr);
+				 vq->packed ? (void *)vq->vring_packed.desc :
+					      (void *)vq->vring.desc,
+				 vq->queue_dma_addr);
 	}
 	list_del(&_vq->list);
 	kfree(vq);
@@ -1185,7 +1453,7 @@ unsigned int virtqueue_get_vring_size(struct virtqueue *_vq)
 
 	struct vring_virtqueue *vq = to_vvq(_vq);
 
-	return vq->vring.num;
+	return vq->packed ? vq->vring_packed.num : vq->vring.num;
 }
 EXPORT_SYMBOL_GPL(virtqueue_get_vring_size);
 
@@ -1228,6 +1496,10 @@ dma_addr_t virtqueue_get_avail_addr(struct virtqueue *_vq)
 
 	BUG_ON(!vq->we_own_ring);
 
+	if (vq->packed)
+		return vq->queue_dma_addr + ((char *)vq->vring_packed.driver -
+				(char *)vq->vring_packed.desc);
+
 	return vq->queue_dma_addr +
 		((char *)vq->vring.avail - (char *)vq->vring.desc);
 }
@@ -1239,11 +1511,16 @@ dma_addr_t virtqueue_get_used_addr(struct virtqueue *_vq)
 
 	BUG_ON(!vq->we_own_ring);
 
+	if (vq->packed)
+		return vq->queue_dma_addr + ((char *)vq->vring_packed.device -
+				(char *)vq->vring_packed.desc);
+
 	return vq->queue_dma_addr +
 		((char *)vq->vring.used - (char *)vq->vring.desc);
 }
 EXPORT_SYMBOL_GPL(virtqueue_get_used_addr);
 
+/* Only available for split ring */
 const struct vring *virtqueue_get_vring(struct virtqueue *vq)
 {
 	return &to_vvq(vq)->vring;
diff --git a/include/linux/virtio_ring.h b/include/linux/virtio_ring.h
index bbf32524ab27..a0075894ad16 100644
--- a/include/linux/virtio_ring.h
+++ b/include/linux/virtio_ring.h
@@ -60,6 +60,11 @@ static inline void virtio_store_mb(bool weak_barriers,
 struct virtio_device;
 struct virtqueue;
 
+union vring_union {
+	struct vring vring_split;
+	struct vring_packed vring_packed;
+};
+
 /*
  * Creates a virtqueue and allocates the descriptor ring.  If
  * may_reduce_num is set, then this may allocate a smaller ring than
@@ -79,7 +84,8 @@ struct virtqueue *vring_create_virtqueue(unsigned int index,
 
 /* Creates a virtqueue with a custom layout. */
 struct virtqueue *__vring_new_virtqueue(unsigned int index,
-					struct vring vring,
+					union vring_union vring,
+					bool packed,
 					struct virtio_device *vdev,
 					bool weak_barriers,
 					bool ctx,
-- 
2.17.0

^ permalink raw reply related

* [RFC v5 4/5] virtio_ring: add event idx support in packed ring
From: Tiwei Bie @ 2018-05-22  8:16 UTC (permalink / raw)
  To: mst, jasowang, virtualization, linux-kernel, netdev
  Cc: wexu, jfreimann, tiwei.bie
In-Reply-To: <20180522081648.14768-1-tiwei.bie@intel.com>

This commit introduces the EVENT_IDX support in packed ring.

Signed-off-by: Tiwei Bie <tiwei.bie@intel.com>
---
 drivers/virtio/virtio_ring.c | 67 ++++++++++++++++++++++++++++++++++--
 1 file changed, 64 insertions(+), 3 deletions(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index eb9fd5207a68..1ee52a89cb04 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -1043,7 +1043,7 @@ static inline int virtqueue_add_packed(struct virtqueue *_vq,
 static bool virtqueue_kick_prepare_packed(struct virtqueue *_vq)
 {
 	struct vring_virtqueue *vq = to_vvq(_vq);
-	u16 flags;
+	u16 new, old, off_wrap, flags, wrap_counter, event_idx;
 	bool needs_kick;
 	u32 snapshot;
 
@@ -1052,9 +1052,19 @@ static bool virtqueue_kick_prepare_packed(struct virtqueue *_vq)
 	 * suppressions. */
 	virtio_mb(vq->weak_barriers);
 
+	old = vq->next_avail_idx - vq->num_added;
+	new = vq->next_avail_idx;
+	vq->num_added = 0;
+
 	snapshot = *(u32 *)vq->vring_packed.device;
+	off_wrap = virtio16_to_cpu(_vq->vdev, (__virtio16)(snapshot & 0xffff));
 	flags = virtio16_to_cpu(_vq->vdev, (__virtio16)(snapshot >> 16)) & 0x3;
 
+	wrap_counter = off_wrap >> 15;
+	event_idx = off_wrap & ~(1<<15);
+	if (wrap_counter != vq->avail_wrap_counter)
+		event_idx -= vq->vring_packed.num;
+
 #ifdef DEBUG
 	if (vq->last_add_time_valid) {
 		WARN_ON(ktime_to_ms(ktime_sub(ktime_get(),
@@ -1063,7 +1073,10 @@ static bool virtqueue_kick_prepare_packed(struct virtqueue *_vq)
 	vq->last_add_time_valid = false;
 #endif
 
-	needs_kick = (flags != VRING_EVENT_F_DISABLE);
+	if (flags == VRING_EVENT_F_DESC)
+		needs_kick = vring_need_event(event_idx, new, old);
+	else
+		needs_kick = (flags != VRING_EVENT_F_DISABLE);
 	END_USE(vq);
 	return needs_kick;
 }
@@ -1170,6 +1183,15 @@ static void *virtqueue_get_buf_ctx_packed(struct virtqueue *_vq,
 	ret = vq->desc_state_packed[id].data;
 	detach_buf_packed(vq, id, ctx);
 
+	/* If we expect an interrupt for the next entry, tell host
+	 * by writing event index and flush out the write before
+	 * the read in the next get_buf call. */
+	if (vq->event_flags_shadow == VRING_EVENT_F_DESC)
+		virtio_store_mb(vq->weak_barriers,
+				&vq->vring_packed.driver->off_wrap,
+				cpu_to_virtio16(_vq->vdev, vq->last_used_idx |
+					((u16)vq->used_wrap_counter << 15)));
+
 #ifdef DEBUG
 	vq->last_add_time_valid = false;
 #endif
@@ -1197,10 +1219,26 @@ static unsigned virtqueue_enable_cb_prepare_packed(struct virtqueue *_vq)
 
 	/* We optimistically turn back on interrupts, then check if there was
 	 * more to do. */
+	/* Depending on the VIRTIO_RING_F_EVENT_IDX feature, we need to
+	 * either clear the flags bit or point the event index at the next
+	 * entry. Always update the event index to keep code simple. */
+
+	vq->vring_packed.driver->off_wrap = cpu_to_virtio16(_vq->vdev,
+			vq->last_used_idx |
+			((u16)vq->used_wrap_counter << 15));
 
 	if (vq->event_flags_shadow == VRING_EVENT_F_DISABLE) {
 		virtio_wmb(vq->weak_barriers);
+		// XXX XXX XXX
+		// Setting VRING_EVENT_F_DESC in this function
+		// will break netperf test for now.
+		// Will need to look into this.
+#if 0
+		vq->event_flags_shadow = vq->event ? VRING_EVENT_F_DESC :
+						     VRING_EVENT_F_ENABLE;
+#else
 		vq->event_flags_shadow = VRING_EVENT_F_ENABLE;
+#endif
 		vq->vring_packed.driver->flags = cpu_to_virtio16(_vq->vdev,
 							vq->event_flags_shadow);
 	}
@@ -1227,15 +1265,38 @@ static bool virtqueue_poll_packed(struct virtqueue *_vq, unsigned last_used_idx)
 static bool virtqueue_enable_cb_delayed_packed(struct virtqueue *_vq)
 {
 	struct vring_virtqueue *vq = to_vvq(_vq);
+	u16 bufs, used_idx, wrap_counter;
 
 	START_USE(vq);
 
 	/* We optimistically turn back on interrupts, then check if there was
 	 * more to do. */
+	/* Depending on the VIRTIO_RING_F_EVENT_IDX feature, we need to
+	 * either clear the flags bit or point the event index at the next
+	 * entry. Always update the event index to keep code simple. */
+
+	/* TODO: tune this threshold */
+	if (vq->next_avail_idx < vq->last_used_idx)
+		bufs = (vq->vring_packed.num + vq->next_avail_idx -
+				vq->last_used_idx) * 3 / 4;
+	else
+		bufs = (vq->next_avail_idx - vq->last_used_idx) * 3 / 4;
+
+	wrap_counter = vq->used_wrap_counter;
+
+	used_idx = vq->last_used_idx + bufs;
+	if (used_idx >= vq->vring_packed.num) {
+		used_idx -= vq->vring_packed.num;
+		wrap_counter ^= 1;
+	}
+
+	vq->vring_packed.driver->off_wrap = cpu_to_virtio16(_vq->vdev,
+			used_idx | (wrap_counter << 15));
 
 	if (vq->event_flags_shadow == VRING_EVENT_F_DISABLE) {
 		virtio_wmb(vq->weak_barriers);
-		vq->event_flags_shadow = VRING_EVENT_F_ENABLE;
+		vq->event_flags_shadow = vq->event ? VRING_EVENT_F_DESC :
+						     VRING_EVENT_F_ENABLE;
 		vq->vring_packed.driver->flags = cpu_to_virtio16(_vq->vdev,
 							vq->event_flags_shadow);
 	}
-- 
2.17.0

^ permalink raw reply related

* [PATCH] net/mlx4: fix spelling mistake: "Inrerface" -> "Interface"
From: Colin King @ 2018-05-22  8:37 UTC (permalink / raw)
  To: Tariq Toukan, David S . Miller, netdev, linux-rdma
  Cc: kernel-janitors, linux-kernel

From: Colin Ian King <colin.king@canonical.com>

Trivial fix to spelling mistake in mlx4_dbg debug message

Signed-off-by: Colin Ian King <colin.king@canonical.com>
---
 drivers/net/ethernet/mellanox/mlx4/intf.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/intf.c b/drivers/net/ethernet/mellanox/mlx4/intf.c
index 2edcce98ab2d..6bd4103265d2 100644
--- a/drivers/net/ethernet/mellanox/mlx4/intf.c
+++ b/drivers/net/ethernet/mellanox/mlx4/intf.c
@@ -172,7 +172,7 @@ int mlx4_do_bond(struct mlx4_dev *dev, bool enable)
 		list_add_tail(&dev_ctx->list, &priv->ctx_list);
 		spin_unlock_irqrestore(&priv->ctx_lock, flags);
 
-		mlx4_dbg(dev, "Inrerface for protocol %d restarted with when bonded mode is %s\n",
+		mlx4_dbg(dev, "Interface for protocol %d restarted with when bonded mode is %s\n",
 			 dev_ctx->intf->protocol, enable ?
 			 "enabled" : "disabled");
 	}
-- 
2.17.0

^ permalink raw reply related

* Re: KASAN: use-after-free Read in vhost_chr_write_iter
From: DaeRyong Jeong @ 2018-05-22  8:38 UTC (permalink / raw)
  To: Jason Wang
  Cc: mst, kvm, virtualization, netdev, linux-kernel, byoungyoung,
	kt0755, bammanag
In-Reply-To: <fb27c1fd-5172-252a-cb8f-b53927a26d06@redhat.com>

On Mon, May 21, 2018 at 10:38:10AM +0800, Jason Wang wrote:
> 
> 
> On 2018年05月18日 17:24, Jason Wang wrote:
> > 
> > 
> > On 2018年05月17日 21:45, DaeRyong Jeong wrote:
> > > We report the crash: KASAN: use-after-free Read in vhost_chr_write_iter
> > > 
> > > This crash has been found in v4.17-rc1 using RaceFuzzer (a modified
> > > version of Syzkaller), which we describe more at the end of this
> > > report. Our analysis shows that the race occurs when invoking two
> > > syscalls concurrently, write$vnet and ioctl$VHOST_RESET_OWNER.
> > > 
> > > 
> > > Analysis:
> > > We think the concurrent execution of vhost_process_iotlb_msg() and
> > > vhost_dev_cleanup() causes the crash.
> > > Both of functions can run concurrently (please see call sequence below),
> > > and possibly, there is a race on dev->iotlb.
> > > If the switch occurs right after vhost_dev_cleanup() frees
> > > dev->iotlb, vhost_process_iotlb_msg() still sees the non-null value
> > > and it
> > > keep executing without returning -EFAULT. Consequently, use-after-free
> > > occures
> > > 
> > > 
> > > Thread interleaving:
> > > CPU0 (vhost_process_iotlb_msg)                CPU1 (vhost_dev_cleanup)
> > > (In the case of both VHOST_IOTLB_UPDATE and
> > > VHOST_IOTLB_INVALIDATE)
> > > =====                            =====
> > >                             vhost_umem_clean(dev->iotlb);
> > > if (!dev->iotlb) {
> > >             ret = -EFAULT;
> > >                 break;
> > > }
> > >                             dev->iotlb = NULL;
> > > 
> > > 
> > > Call Sequence:
> > > CPU0
> > > =====
> > > vhost_net_chr_write_iter
> > >     vhost_chr_write_iter
> > >         vhost_process_iotlb_msg
> > > 
> > > CPU1
> > > =====
> > > vhost_net_ioctl
> > >     vhost_net_reset_owner
> > >         vhost_dev_reset_owner
> > >             vhost_dev_cleanup
> > 
> > Thanks a lot for the analysis.
> > 
> > This could be addressed by simply protect it with dev mutex.
> > 
> > Will post a patch.
> > 
> 
> Could you please help to test the attached patch? I've done some smoking
> test.
> 
> Thanks

Sorry to say this, but we don't have a reproducer for this bug since our
reproducer is being implemented.

This crash had occrued a few times in our fuzzer, so I inspected the code
manually.

It seems the patch is good for me, but we can't test the patch for now.
Sorry.

> From 88328386f3f652e684ee33dc4cf63dcaed871aea Mon Sep 17 00:00:00 2001
> From: Jason Wang <jasowang@redhat.com>
> Date: Fri, 18 May 2018 17:33:27 +0800
> Subject: [PATCH] vhost: synchronize IOTLB message with dev cleanup
> 
> DaeRyong Jeong reports a race between vhost_dev_cleanup() and
> vhost_process_iotlb_msg():
> 
> Thread interleaving:
> CPU0 (vhost_process_iotlb_msg)			CPU1 (vhost_dev_cleanup)
> (In the case of both VHOST_IOTLB_UPDATE and
> VHOST_IOTLB_INVALIDATE)
> =====						=====
> 						vhost_umem_clean(dev->iotlb);
> if (!dev->iotlb) {
> 	        ret = -EFAULT;
> 		        break;
> }
> 						dev->iotlb = NULL;
> 
> The reason is we don't synchronize between them, fixing by protecting
> vhost_process_iotlb_msg() with dev mutex.
> 
> Reported-by: DaeRyong Jeong <threeearcat@gmail.com>
> Fixes: 6b1e6cc7855b0 ("vhost: new device IOTLB API")
> Reported-by: DaeRyong Jeong <threeearcat@gmail.com>
> ---
>  drivers/vhost/vhost.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index f3bd8e9..f0be5f3 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -981,6 +981,7 @@ static int vhost_process_iotlb_msg(struct vhost_dev *dev,
>  {
>  	int ret = 0;
>  
> +	mutex_lock(&dev->mutex);
>  	vhost_dev_lock_vqs(dev);
>  	switch (msg->type) {
>  	case VHOST_IOTLB_UPDATE:
> @@ -1016,6 +1017,8 @@ static int vhost_process_iotlb_msg(struct vhost_dev *dev,
>  	}
>  
>  	vhost_dev_unlock_vqs(dev);
> +	mutex_unlock(&dev->mutex);
> +
>  	return ret;
>  }
>  ssize_t vhost_chr_write_iter(struct vhost_dev *dev,
> -- 
> 2.7.4
> 

^ permalink raw reply

* Re: [PATCH net-next 08/13] devlink: don't take instance lock around eswitch mode set
From: Jiri Pirko @ 2018-05-22  8:41 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: davem, netdev, oss-drivers
In-Reply-To: <20180522051255.9438-9-jakub.kicinski@netronome.com>

Tue, May 22, 2018 at 07:12:50AM CEST, jakub.kicinski@netronome.com wrote:
>Changing switch mode may want to register and unregister devlink
>ports.  Therefore similarly to DEVLINK_CMD_PORT_SPLIT/UNSPLIT it
>should not take the instance lock.  Drivers don't depend on existing
>locking since it's a very recent addition.
>
>Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>

Acked-by: Jiri Pirko <jiri@mellanox.com>

^ permalink raw reply

* Re: KASAN: use-after-free Read in vhost_chr_write_iter
From: Jason Wang @ 2018-05-22  8:42 UTC (permalink / raw)
  To: DaeRyong Jeong
  Cc: mst, kvm, virtualization, netdev, linux-kernel, byoungyoung,
	kt0755, bammanag
In-Reply-To: <20180522083842.GA10604@dragonet.kaist.ac.kr>



On 2018年05月22日 16:38, DaeRyong Jeong wrote:
> On Mon, May 21, 2018 at 10:38:10AM +0800, Jason Wang wrote:
>> On 2018年05月18日 17:24, Jason Wang wrote:
>>> On 2018年05月17日 21:45, DaeRyong Jeong wrote:
>>>> We report the crash: KASAN: use-after-free Read in vhost_chr_write_iter
>>>>
>>>> This crash has been found in v4.17-rc1 using RaceFuzzer (a modified
>>>> version of Syzkaller), which we describe more at the end of this
>>>> report. Our analysis shows that the race occurs when invoking two
>>>> syscalls concurrently, write$vnet and ioctl$VHOST_RESET_OWNER.
>>>>
>>>>
>>>> Analysis:
>>>> We think the concurrent execution of vhost_process_iotlb_msg() and
>>>> vhost_dev_cleanup() causes the crash.
>>>> Both of functions can run concurrently (please see call sequence below),
>>>> and possibly, there is a race on dev->iotlb.
>>>> If the switch occurs right after vhost_dev_cleanup() frees
>>>> dev->iotlb, vhost_process_iotlb_msg() still sees the non-null value
>>>> and it
>>>> keep executing without returning -EFAULT. Consequently, use-after-free
>>>> occures
>>>>
>>>>
>>>> Thread interleaving:
>>>> CPU0 (vhost_process_iotlb_msg)                CPU1 (vhost_dev_cleanup)
>>>> (In the case of both VHOST_IOTLB_UPDATE and
>>>> VHOST_IOTLB_INVALIDATE)
>>>> =====                            =====
>>>>                              vhost_umem_clean(dev->iotlb);
>>>> if (!dev->iotlb) {
>>>>              ret = -EFAULT;
>>>>                  break;
>>>> }
>>>>                              dev->iotlb = NULL;
>>>>
>>>>
>>>> Call Sequence:
>>>> CPU0
>>>> =====
>>>> vhost_net_chr_write_iter
>>>>      vhost_chr_write_iter
>>>>          vhost_process_iotlb_msg
>>>>
>>>> CPU1
>>>> =====
>>>> vhost_net_ioctl
>>>>      vhost_net_reset_owner
>>>>          vhost_dev_reset_owner
>>>>              vhost_dev_cleanup
>>> Thanks a lot for the analysis.
>>>
>>> This could be addressed by simply protect it with dev mutex.
>>>
>>> Will post a patch.
>>>
>> Could you please help to test the attached patch? I've done some smoking
>> test.
>>
>> Thanks
> Sorry to say this, but we don't have a reproducer for this bug since our
> reproducer is being implemented.
>
> This crash had occrued a few times in our fuzzer, so I inspected the code
> manually.
>
> It seems the patch is good for me, but we can't test the patch for now.
> Sorry.
>

No problem.

I'm trying to craft a reproducer, looks not hard.

Thanks

^ permalink raw reply

* Re: [Cake] [PATCH net-next v14 6/7] sch_cake: Add overhead compensation support to the rate shaper
From: Toke Høiland-Jørgensen @ 2018-05-22  8:44 UTC (permalink / raw)
  To: Marcelo Ricardo Leitner; +Cc: netdev, cake
In-Reply-To: <20180521234513.GH26212@localhost.localdomain>



On 22 May 2018 01:45:13 CEST, Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> wrote:
>On Mon, May 21, 2018 at 10:35:58PM +0200, Toke Høiland-Jørgensen wrote:
>> +static u32 cake_overhead(struct cake_sched_data *q, const struct
>sk_buff *skb)
>> +{
>> +	const struct skb_shared_info *shinfo = skb_shinfo(skb);
>> +	unsigned int hdr_len, last_len = 0;
>> +	u32 off = skb_network_offset(skb);
>> +	u32 len = qdisc_pkt_len(skb);
>> +	u16 segs = 1;
>> +
>> +	q->avg_netoff = cake_ewma(q->avg_netoff, off << 16, 8);
>> +
>> +	if (!shinfo->gso_size)
>> +		return cake_calc_overhead(q, len, off);
>> +
>> +	/* borrowed from qdisc_pkt_len_init() */
>> +	hdr_len = skb_transport_header(skb) - skb_mac_header(skb);
>> +
>> +	/* + transport layer */
>> +	if (likely(shinfo->gso_type & (SKB_GSO_TCPV4 |
>> +						SKB_GSO_TCPV6))) {
>> +		const struct tcphdr *th;
>> +		struct tcphdr _tcphdr;
>> +
>> +		th = skb_header_pointer(skb, skb_transport_offset(skb),
>> +					sizeof(_tcphdr), &_tcphdr);
>> +		if (likely(th))
>> +			hdr_len += __tcp_hdrlen(th);
>> +	} else {
>
>I didn't see some code limiting GSO packets to just TCP or UDP. Is it
>safe to assume that this packet is an UDP one, and not SCTP or ESP,
>for example?

As the comment says, I nicked this from the qdisc init code.
So I assume it's safe? :)

>> +		struct udphdr _udphdr;
>> +
>> +		if (skb_header_pointer(skb, skb_transport_offset(skb),
>> +				       sizeof(_udphdr), &_udphdr))
>> +			hdr_len += sizeof(struct udphdr);
>> +	}
>> +
>> +	if (unlikely(shinfo->gso_type & SKB_GSO_DODGY))
>> +		segs = DIV_ROUND_UP(skb->len - hdr_len,
>> +				    shinfo->gso_size);
>> +	else
>> +		segs = shinfo->gso_segs;
>> +
>> +	len = shinfo->gso_size + hdr_len;
>> +	last_len = skb->len - shinfo->gso_size * (segs - 1);
>> +
>> +	return (cake_calc_overhead(q, len, off) * (segs - 1) +
>> +		cake_calc_overhead(q, last_len, off));
>> +}
>> +

^ permalink raw reply

* Re: [Cake] [PATCH net-next v14 4/7] sch_cake: Add NAT awareness to packet classifier
From: Toke Høiland-Jørgensen @ 2018-05-22  8:45 UTC (permalink / raw)
  To: Marcelo Ricardo Leitner; +Cc: netdev, cake, netfilter-devel
In-Reply-To: <20180521233406.GG26212@localhost.localdomain>



On 22 May 2018 01:34:06 CEST, Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> wrote:
>[Cc'ing netfilter-devel@ for awareness]

Thanks! I'll add a Cc in the next version.

>On Mon, May 21, 2018 at 10:35:58PM +0200, Toke Høiland-Jørgensen wrote:
>> When CAKE is deployed on a gateway that also performs NAT (which is a
>> common deployment mode), the host fairness mechanism cannot
>distinguish
>> internal hosts from each other, and so fails to work correctly.
>> 
>> To fix this, we add an optional NAT awareness mode, which will query
>the
>> kernel conntrack mechanism to obtain the pre-NAT addresses for each
>packet
>> and use that in the flow and host hashing.
>> 
>> When the shaper is enabled and the host is already performing NAT,
>the cost
>> of this lookup is negligible. However, in unlimited mode with no NAT
>being
>> performed, there is a significant CPU cost at higher bandwidths. For
>this
>> reason, the feature is turned off by default.
>> 
>> Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk>
>> ---
>>  net/sched/sch_cake.c |   79
>++++++++++++++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 79 insertions(+)
>> 
>> diff --git a/net/sched/sch_cake.c b/net/sched/sch_cake.c
>> index 92623160d43e..04364993ce19 100644
>> --- a/net/sched/sch_cake.c
>> +++ b/net/sched/sch_cake.c
>> @@ -71,6 +71,12 @@
>>  #include <net/tcp.h>
>>  #include <net/flow_dissector.h>
>>  
>> +#if IS_REACHABLE(CONFIG_NF_CONNTRACK)
>> +#include <net/netfilter/nf_conntrack_core.h>
>> +#include <net/netfilter/nf_conntrack_zones.h>
>> +#include <net/netfilter/nf_conntrack.h>
>> +#endif
>> +
>>  #define CAKE_SET_WAYS (8)
>>  #define CAKE_MAX_TINS (8)
>>  #define CAKE_QUEUES (1024)
>> @@ -516,6 +522,60 @@ static bool cobalt_should_drop(struct
>cobalt_vars *vars,
>>  	return drop;
>>  }
>>  
>> +#if IS_REACHABLE(CONFIG_NF_CONNTRACK)
>> +
>> +static void cake_update_flowkeys(struct flow_keys *keys,
>> +				 const struct sk_buff *skb)
>> +{
>> +	const struct nf_conntrack_tuple *tuple;
>> +	enum ip_conntrack_info ctinfo;
>> +	struct nf_conn *ct;
>> +	bool rev = false;
>> +
>> +	if (tc_skb_protocol(skb) != htons(ETH_P_IP))
>> +		return;
>> +
>> +	ct = nf_ct_get(skb, &ctinfo);
>> +	if (ct) {
>> +		tuple = nf_ct_tuple(ct, CTINFO2DIR(ctinfo));
>> +	} else {
>> +		const struct nf_conntrack_tuple_hash *hash;
>> +		struct nf_conntrack_tuple srctuple;
>> +
>> +		if (!nf_ct_get_tuplepr(skb, skb_network_offset(skb),
>> +				       NFPROTO_IPV4, dev_net(skb->dev),
>> +				       &srctuple))
>> +			return;
>> +
>> +		hash = nf_conntrack_find_get(dev_net(skb->dev),
>> +					     &nf_ct_zone_dflt,
>> +					     &srctuple);
>> +		if (!hash)
>> +			return;
>> +
>> +		rev = true;
>> +		ct = nf_ct_tuplehash_to_ctrack(hash);
>> +		tuple = nf_ct_tuple(ct, !hash->tuple.dst.dir);
>> +	}
>> +
>> +	keys->addrs.v4addrs.src = rev ? tuple->dst.u3.ip :
>tuple->src.u3.ip;
>> +	keys->addrs.v4addrs.dst = rev ? tuple->src.u3.ip :
>tuple->dst.u3.ip;
>> +
>> +	if (keys->ports.ports) {
>> +		keys->ports.src = rev ? tuple->dst.u.all : tuple->src.u.all;
>> +		keys->ports.dst = rev ? tuple->src.u.all : tuple->dst.u.all;
>> +	}
>> +	if (rev)
>> +		nf_ct_put(ct);
>> +}
>> +#else
>> +static void cake_update_flowkeys(struct flow_keys *keys,
>> +				 const struct sk_buff *skb)
>> +{
>> +	/* There is nothing we can do here without CONNTRACK */
>> +}
>> +#endif
>> +
>>  /* Cake has several subtle multiple bit settings. In these cases you
>>   *  would be matching triple isolate mode as well.
>>   */
>> @@ -543,6 +603,9 @@ static u32 cake_hash(struct cake_tin_data *q,
>const struct sk_buff *skb,
>>  	skb_flow_dissect_flow_keys(skb, &keys,
>>  				   FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL);
>>  
>> +	if (flow_mode & CAKE_FLOW_NAT_FLAG)
>> +		cake_update_flowkeys(&keys, skb);
>> +
>>  	/* flow_hash_from_keys() sorts the addresses by value, so we have
>>  	 * to preserve their order in a separate data structure to treat
>>  	 * src and dst host addresses as independently selectable.
>> @@ -1894,6 +1957,18 @@ static int cake_change(struct Qdisc *sch,
>struct nlattr *opt,
>>  	if (err < 0)
>>  		return err;
>>  
>> +	if (tb[TCA_CAKE_NAT]) {
>> +#if IS_REACHABLE(CONFIG_NF_CONNTRACK)
>> +		q->flow_mode &= ~CAKE_FLOW_NAT_FLAG;
>> +		q->flow_mode |= CAKE_FLOW_NAT_FLAG *
>> +			!!nla_get_u32(tb[TCA_CAKE_NAT]);
>> +#else
>> +		NL_SET_ERR_MSG_ATTR(extack, "No conntrack support in kernel",
>> +				    tb[TCA_CAKE_NAT]);
>> +		return -EOPNOTSUPP;
>> +#endif
>> +	}
>> +
>>  	if (tb[TCA_CAKE_BASE_RATE64])
>>  		q->rate_bps = nla_get_u64(tb[TCA_CAKE_BASE_RATE64]);
>>  
>> @@ -2066,6 +2141,10 @@ static int cake_dump(struct Qdisc *sch, struct
>sk_buff *skb)
>>  	if (nla_put_u32(skb, TCA_CAKE_ACK_FILTER, q->ack_filter))
>>  		goto nla_put_failure;
>>  
>> +	if (nla_put_u32(skb, TCA_CAKE_NAT,
>> +			!!(q->flow_mode & CAKE_FLOW_NAT_FLAG)))
>> +		goto nla_put_failure;
>> +
>>  	return nla_nest_end(skb, opts);
>>  
>>  nla_put_failure:
>> 
>> _______________________________________________
>> Cake mailing list
>> Cake@lists.bufferbloat.net
>> https://lists.bufferbloat.net/listinfo/cake

^ permalink raw reply

* Re: [PATCH bpf v2 6/6] bpf: fix JITed dump for multi-function programs via syscall
From: Daniel Borkmann @ 2018-05-22  8:54 UTC (permalink / raw)
  To: Sandipan Das; +Cc: ast, netdev, linuxppc-dev, naveen.n.rao, mpe, jakub.kicinski
In-Reply-To: <57a3df63-f533-0dee-3fb8-621c0ba51968@linux.vnet.ibm.com>

On 05/21/2018 09:42 PM, Sandipan Das wrote:
> On 05/18/2018 09:21 PM, Daniel Borkmann wrote:
>> On 05/18/2018 02:50 PM, Sandipan Das wrote:
>>> Currently, for multi-function programs, we cannot get the JITed
>>> instructions using the bpf system call's BPF_OBJ_GET_INFO_BY_FD
>>> command. Because of this, userspace tools such as bpftool fail
>>> to identify a multi-function program as being JITed or not.
>>>
>>> With the JIT enabled and the test program running, this can be
>>> verified as follows:
>>>
>>>   # cat /proc/sys/net/core/bpf_jit_enable
>>>   1
>>>
>>> Before applying this patch:
>>>
>>>   # bpftool prog list
>>>   1: kprobe  name foo  tag b811aab41a39ad3d  gpl
>>>           loaded_at 2018-05-16T11:43:38+0530  uid 0
>>>           xlated 216B  not jited  memlock 65536B
>>>   ...
>>>
>>>   # bpftool prog dump jited id 1
>>>   no instructions returned
>>>
>>> After applying this patch:
>>>
>>>   # bpftool prog list
>>>   1: kprobe  name foo  tag b811aab41a39ad3d  gpl
>>>           loaded_at 2018-05-16T12:13:01+0530  uid 0
>>>           xlated 216B  jited 308B  memlock 65536B
>>>   ...
>>
>> That's really nice! One comment inline below:
>>
>>>   # bpftool prog dump jited id 1
>>>      0:   nop
>>>      4:   nop
>>>      8:   mflr    r0
>>>      c:   std     r0,16(r1)
>>>     10:   stdu    r1,-112(r1)
>>>     14:   std     r31,104(r1)
>>>     18:   addi    r31,r1,48
>>>     1c:   li      r3,10
>>>   ...
>>>
>>> Signed-off-by: Sandipan Das <sandipan@linux.vnet.ibm.com>
>>> ---
>>>  kernel/bpf/syscall.c | 38 ++++++++++++++++++++++++++++++++------
>>>  1 file changed, 32 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
>>> index 54a72fafe57c..2430d159078c 100644
>>> --- a/kernel/bpf/syscall.c
>>> +++ b/kernel/bpf/syscall.c
>>> @@ -1896,7 +1896,7 @@ static int bpf_prog_get_info_by_fd(struct bpf_prog *prog,
>>>  	struct bpf_prog_info info = {};
>>>  	u32 info_len = attr->info.info_len;
>>>  	char __user *uinsns;
>>> -	u32 ulen;
>>> +	u32 ulen, i;
>>>  	int err;
>>>  
>>>  	err = check_uarg_tail_zero(uinfo, sizeof(info), info_len);
>>> @@ -1922,7 +1922,6 @@ static int bpf_prog_get_info_by_fd(struct bpf_prog *prog,
>>>  	ulen = min_t(u32, info.nr_map_ids, ulen);
>>>  	if (ulen) {
>>>  		u32 __user *user_map_ids = u64_to_user_ptr(info.map_ids);
>>> -		u32 i;
>>>  
>>>  		for (i = 0; i < ulen; i++)
>>>  			if (put_user(prog->aux->used_maps[i]->id,
>>> @@ -1970,13 +1969,41 @@ static int bpf_prog_get_info_by_fd(struct bpf_prog *prog,
>>>  	 * for offload.
>>>  	 */
>>>  	ulen = info.jited_prog_len;
>>> -	info.jited_prog_len = prog->jited_len;
>>> +	if (prog->aux->func_cnt) {
>>> +		info.jited_prog_len = 0;
>>> +		for (i = 0; i < prog->aux->func_cnt; i++)
>>> +			info.jited_prog_len += prog->aux->func[i]->jited_len;
>>> +	} else {
>>> +		info.jited_prog_len = prog->jited_len;
>>> +	}
>>> +
>>>  	if (info.jited_prog_len && ulen) {
>>>  		if (bpf_dump_raw_ok()) {
>>>  			uinsns = u64_to_user_ptr(info.jited_prog_insns);
>>>  			ulen = min_t(u32, info.jited_prog_len, ulen);
>>> -			if (copy_to_user(uinsns, prog->bpf_func, ulen))
>>> -				return -EFAULT;
>>> +
>>> +			/* for multi-function programs, copy the JITed
>>> +			 * instructions for all the functions
>>> +			 */
>>> +			if (prog->aux->func_cnt) {
>>> +				u32 len, free;
>>> +				u8 *img;
>>> +
>>> +				free = ulen;
>>> +				for (i = 0; i < prog->aux->func_cnt; i++) {
>>> +					len = prog->aux->func[i]->jited_len;
>>> +					img = (u8 *) prog->aux->func[i]->bpf_func;
>>> +					if (len > free)
>>> +						break;
>>> +					if (copy_to_user(uinsns, img, len))
>>> +						return -EFAULT;
>>> +					uinsns += len;
>>> +					free -= len;
>>
>> Is there any way we can introduce a delimiter between the different
>> images such that they could be more easily correlated with the call
>> from the main (or other sub-)program instead of having one contiguous
>> dump blob?
> 
> Can we have another member in bpf_prog_info that points to a list of the lengths of the
> JITed images for each subprogram? We can use this information to split up the dump.

Seems okay to me.

Thanks,
Daniel

^ permalink raw reply

* Re: [PATCH v2 net] stmmac: strip vlan tag on reception only for 8021q tagged frames
From: Jose Abreu @ 2018-05-22  8:56 UTC (permalink / raw)
  To: Florian Fainelli, David Miller, eladv6, Jose.Abreu
  Cc: makita.toshiaki, netdev, peppe.cavallaro, alexandre.torgue
In-Reply-To: <0812781e-2660-914b-0e3b-4d84e58afebe@gmail.com>

On 21-05-2018 17:42, Florian Fainelli wrote:
> On 05/21/2018 08:48 AM, David Miller wrote:
>> From: David Miller <davem@davemloft.net>
>> Date: Thu, 17 May 2018 12:43:56 -0400 (EDT)
>>
>>> Giuseppe and Alexandre, please review this patch.
>> If nobody thinks this patch is important enough to actually
>> review, I'm tossing it.
>>
>> Sorry.
>>
> How about looping in Jose?

Thanks for the cc Florian!

Elad,

Looking at this patch I have a couple of questions and suggestions:

I see that most drivers use this pattern so, are they all broken?
or is this a special case?

You can also get the inner/outer vlan tag by reading desc0 of
receive descriptor. Which can make this completely agnostic of
VLAN tag.

Thanks and Best Regards,
Jose Miguel Abreu

^ permalink raw reply

* Re: [PATCH v2 bpf-next 0/3] bpf: Add MTU check to fib lookup helper
From: Daniel Borkmann @ 2018-05-22  8:59 UTC (permalink / raw)
  To: dsahern, netdev, borkmann, ast; +Cc: davem, David Ahern
In-Reply-To: <20180521160816.7060-1-dsahern@kernel.org>

On 05/21/2018 06:08 PM, dsahern@kernel.org wrote:
> From: David Ahern <dsahern@gmail.com>
> 
> Packets that exceed the egress MTU can not be forwarded in the fast path.
> Add IPv4 and IPv6 MTU helpers that take a FIB lookup result (versus the
> typical dst path) and add the calls to bpf_ipv{4,6}_fib_lookup.
> 
> v2
> - add ip6_mtu_from_fib6 to ipv6_stub
> - only call the new MTU helpers for fib lookups in XDP path; skb
>   path uses is_skb_forwardable to determine if the packet can be
>   sent via the egress device from the FIB lookup
> 
> David Ahern (3):
>   net/ipv4: Add helper to return path MTU based on fib result
>   net/ipv6: Add helper to return path MTU based on fib result
>   bpf: Add mtu checking to FIB forwarding helper
> 
>  include/net/addrconf.h   |  2 ++
>  include/net/ip6_fib.h    |  6 ++++++
>  include/net/ip6_route.h  |  3 +++
>  include/net/ip_fib.h     |  2 ++
>  net/core/filter.c        | 42 +++++++++++++++++++++++++++++++++++-------
>  net/ipv4/route.c         | 31 +++++++++++++++++++++++++++++++
>  net/ipv6/addrconf_core.c |  8 ++++++++
>  net/ipv6/af_inet6.c      |  1 +
>  net/ipv6/route.c         | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
>  9 files changed, 136 insertions(+), 7 deletions(-)

Applied to bpf-next, thanks David!

^ permalink raw reply

* Re: [PATCH net-next v11 3/5] net: Introduce net_failover driver
From: Jiri Pirko @ 2018-05-22  8:59 UTC (permalink / raw)
  To: Sridhar Samudrala
  Cc: mst, stephen, davem, netdev, virtualization, virtio-dev,
	jesse.brandeburg, alexander.h.duyck, kubakici, jasowang,
	loseweigh, aaron.f.brown, anjali.singhai
In-Reply-To: <1526954781-35359-4-git-send-email-sridhar.samudrala@intel.com>

Tue, May 22, 2018 at 04:06:19AM CEST, sridhar.samudrala@intel.com wrote:
>The net_failover driver provides an automated failover mechanism via APIs
>to create and destroy a failover master netdev and mananges a primary and
>standby slave netdevs that get registered via the generic failover
>infrastructure.
>
>The failover netdev acts a master device and controls 2 slave devices. The
>original paravirtual interface gets registered as 'standby' slave netdev and
>a passthru/vf device with the same MAC gets registered as 'primary' slave
>netdev. Both 'standby' and 'failover' netdevs are associated with the same
>'pci' device. The user accesses the network interface via 'failover' netdev.
>The 'failover' netdev chooses 'primary' netdev as default for transmits when
>it is available with link up and running.
>
>This can be used by paravirtual drivers to enable an alternate low latency
>datapath. It also enables hypervisor controlled live migration of a VM with
>direct attached VF by failing over to the paravirtual datapath when the VF
>is unplugged.
>
>Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>

[...]

>+
>+The net_failover driver provides an automated failover mechanism via APIs
>+to create and destroy a failover master netdev and mananges a primary and

s/mananges/manages/

^ permalink raw reply

* Re: [PATCH bpf-next 0/8] AF_XDP follow-up patches, uapi and cleanups
From: Daniel Borkmann @ 2018-05-22  8:59 UTC (permalink / raw)
  To: Björn Töpel, magnus.karlsson, magnus.karlsson, ast,
	netdev
  Cc: Björn Töpel
In-Reply-To: <20180522073503.2199-1-bjorn.topel@gmail.com>

On 05/22/2018 09:34 AM, Björn Töpel wrote:
> From: Björn Töpel <bjorn.topel@intel.com>
> 
> This the second follow-up set. The first four patches are uapi
> changes:
> 
> * Removing rebind support
> * Getting rid of structure hole
> * Removing explicit cache line alignment
> * Stricter bind checks
> 
> The last patches do some cleanups, where the umem and refcount_t
> changes were suggested by Daniel.
> 
> * Add a missing write-barrier and use READ_ONCE for data-dependencies
> * Clean up umem and do proper locking
> * Convert atomic_t to refcount_t
> 
> Björn Töpel (7):
>   xsk: remove rebind support
>   xsk: fill hole in struct sockaddr_xdp
>   xsk: remove explicit ring structure from uapi
>   samples/bpf: adapt xdpsock to the new uapi
>   xsk: add missing write- and data-dependency barrier
>   xsk: simplified umem setup
>   xsk: convert atomic_t to refcount_t
> 
> Magnus Karlsson (1):
>   xsk: proper queue id check at bind
> 
>  include/uapi/linux/if_xdp.h |  46 ++++++++---------
>  net/xdp/xdp_umem.c          |  85 +++++++++++++++---------------
>  net/xdp/xdp_umem.h          |   5 +-
>  net/xdp/xsk.c               | 105 +++++++++++++++++++++++--------------
>  net/xdp/xsk_queue.h         |  17 ++++++
>  samples/bpf/xdpsock_user.c  | 123 +++++++++++++++++++++++++++-----------------
>  6 files changed, 225 insertions(+), 156 deletions(-)

LGTM, applied to bpf-next, thanks guys!

^ permalink raw reply

* Re: [PATCH net-next v11 2/5] netvsc: refactor notifier/event handling code to use the failover framework
From: Jiri Pirko @ 2018-05-22  9:06 UTC (permalink / raw)
  To: Sridhar Samudrala
  Cc: mst, stephen, davem, netdev, virtualization, virtio-dev,
	jesse.brandeburg, alexander.h.duyck, kubakici, jasowang,
	loseweigh, aaron.f.brown, anjali.singhai
In-Reply-To: <1526954781-35359-3-git-send-email-sridhar.samudrala@intel.com>

Tue, May 22, 2018 at 04:06:18AM CEST, sridhar.samudrala@intel.com wrote:
>Use the registration/notification framework supported by the generic
>failover infrastructure.
>
>Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>

In previous patchset versions, the common code did
netdev_rx_handler_register() and netdev_upper_dev_link() etc
(netvsc_vf_join()). Now, this is still done in netvsc. Why?

This should be part of the common "failover" code.

^ permalink raw reply

* Re: [PATCH net-next v11 2/5] netvsc: refactor notifier/event handling code to use the failover framework
From: Jiri Pirko @ 2018-05-22  9:08 UTC (permalink / raw)
  To: Sridhar Samudrala
  Cc: mst, stephen, davem, netdev, virtualization, virtio-dev,
	jesse.brandeburg, alexander.h.duyck, kubakici, jasowang,
	loseweigh, aaron.f.brown, anjali.singhai
In-Reply-To: <20180522090637.GE2149@nanopsycho>

Tue, May 22, 2018 at 11:06:37AM CEST, jiri@resnulli.us wrote:
>Tue, May 22, 2018 at 04:06:18AM CEST, sridhar.samudrala@intel.com wrote:
>>Use the registration/notification framework supported by the generic
>>failover infrastructure.
>>
>>Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>
>In previous patchset versions, the common code did
>netdev_rx_handler_register() and netdev_upper_dev_link() etc
>(netvsc_vf_join()). Now, this is still done in netvsc. Why?
>
>This should be part of the common "failover" code.
>

Also note that in the current patchset you use IFF_FAILOVER flag for
master, yet for the slave you use IFF_SLAVE. That is wrong.
IFF_FAILOVER_SLAVE should be used.

^ permalink raw reply

* Re: [RFC feedback] AF_XDP and non-Intel hardware
From: Luke Gorrie @ 2018-05-22  9:15 UTC (permalink / raw)
  To: Björn Töpel
  Cc: Mykyta Iziumtsev, Netdev, Björn Töpel, Karlsson, Magnus,
	Zhang, Qi Z, Francois Ozog, Ilias Apalodimas, Brian Brooks,
	Jesper Dangaard Brouer, andy, michael.chan
In-Reply-To: <CAJ+HfNhB_E6RDn-K4Wp0h8194bqqicVzWJ0Kn18HAG3B7K=hXA@mail.gmail.com>

On 21 May 2018 at 20:55, Björn Töpel <bjorn.topel@gmail.com> wrote:
>
> 2018-05-21 14:34 GMT+02:00 Mykyta Iziumtsev <mykyta.iziumtsev@linaro.org>:
> > Hi Björn and Magnus,
> >
> > (This thread is a follow up to private dialogue. The intention is to
> > let community know that AF_XDP can be enhanced further to make it
> > compatible with wider range of NIC vendors).
> >
>
> Mykyta, thanks for doing the write-up and sending it to the netdev
> list! The timing could not be better -- we need to settle on an uapi
> that works for all vendors prior enabling it in the kernel.

[Resending with vger-compatible formatting.]

So! The discussion here seems to be about how to make the XDP uapi
accommodate all hardware vendors but I wanted to chime in with a
userspace application developer perspective (remember us? ;-))

These days more and more people understand the weird and wonderful
ways that NICs want to deal with packet memory. Scatter-gather lists;
typewriter buffers; payload inline in descriptors; metadata inline in
payload; constraints on buffer size; constraints on buffer alignment;
etc; etc; etc.

How about userspace applications though? We also have our own ideas
about the ways that things should be done. I think there is a
fundamental tension here: the more flexibility you provide to
hardware, the more constraints you impose on applications, and vice
versa.

To be concrete let me explain the peculiar way that we handle packet
memory in the Snabb application.  Snabb uses a simple representation
of packets in memory:

  struct packet {
    uint16_t length;
    unsigned char data[10 * 1024];
  }

and a special allocator so that the virtual address of each packet:

- Is identical in every process that can share traffic;
- Can be mapped on demand (via SIGSEGV fault handler);
- Can be used to calculate the DMA (physical) address;
- Can be used to calculate how much headroom is available.

So our scheme is fairly nuanced. Just now this seems to fit well with
most NICs, which allow scatter-gather operation from memory allocated
independently by the application, but we have to resolve an impedence
mismatch (copy) for e.g. typewriter model. Overall this situation is
quite acceptable.

How would this fit with the XDP uapi though? Can we preserve these
properties of our packets and make them XDP-compatible? The ideal for
us would probably be to replace the code that allocates a HugeTLB for
packet data with an equivalent that allocates a chunk of
XDP-compatible memory that we can slice up and mremap to suit our taste.

If that is not possible then I see a couple of alternatives:

One would be to drop all of our invariants on packet addresses and
switch to a more middle-of-the-road design that puts everything inline
into the packet (an "sk_buff-alike.") Then we would outsource all the
allocation to the kernel, which would do it specially to suit the
hardware from $VENDOR. (And hopefully deal somehow with mixing traffic
from $OTHERVENDOR too, etc.)

The other alternative would be to preserve our current packet
structure and introduce a copy into separate XDP memory on
transmit/receive. This is the approach that we take today with
vhost-user and is the approach we would take if we supported a
"typewriter" style NIC too.

I'm not immediately wild about either of those options though, and I
am not sure how keen the next wave of application developers turning up
over the next 5-10 years and "doing it our way" will be either.

So, anyway, that is my braindump on trying to understand how suitable
XDP would be for us as application developers, and how much of this
depends on the fine details that are being discussed on this thread.
I hope this perspective is a useful complement to the feedback from
hardware makers.

Cheers,
-Luke

^ permalink raw reply

* Re: [PATCH net-next 0/2] net: sfp: small improvements
From: Antoine Tenart @ 2018-05-22  9:24 UTC (permalink / raw)
  To: David Miller
  Cc: antoine.tenart, linux, netdev, linux-kernel, thomas.petazzoni,
	maxime.chevallier, gregory.clement, miquel.raynal, nadavh,
	stefanc, ymarkman, mw
In-Reply-To: <20180521.115115.102542572797022222.davem@davemloft.net>

Hi David,

On Mon, May 21, 2018 at 11:51:15AM -0400, David Miller wrote:
> From: Antoine Tenart <antoine.tenart@bootlin.com>
> Date: Thu, 17 May 2018 10:29:05 +0200
> 
> > This series was part of the mvpp2 phylink one but as we reworked it to
> > use fixed-link on the DB boards, the SFP commits weren't needed
> > anymore for our use case. Two of the three patches still are needed I
> > believe (I ditched the one about non-wired SFP cages), so they are sent
> > here in a separate series.
> 
> Based upon the discussion of patch #1, it seems there is a desire to make
> the i2c-bus property mandatory since it isn't clear if access to the SFP
> module without it really all that doable.

Thanks for the clarification, I was about to ask for it. I'll make a v2
making i2c-bus mandatory then.

Thanks!
Antoine

-- 
Antoine Ténart, Bootlin (formerly Free Electrons)
Embedded Linux and Kernel engineering
https://bootlin.com

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox