* linux-next: manual merge of the net-next tree with the s390 tree
From: Stephen Rothwell @ 2018-05-29 3:00 UTC (permalink / raw)
To: David Miller, Networking, Martin Schwidefsky, Heiko Carstens
Cc: Linux-Next Mailing List, Linux Kernel Mailing List, Ursula Braun,
Daniel Borkmann, Alexei Starovoitov
[-- Attachment #1: Type: text/plain, Size: 1049 bytes --]
Hi all,
Today's linux-next merge of the net-next tree got a conflict in:
arch/s390/net/Makefile
between commit:
866f4c8e0e26 ("s390/net: add pnetid support")
from the s390 tree and commit:
e1cf4befa297 ("bpf, s390x: remove ld_abs/ld_ind")
from the net-next tree.
I fixed it up (see below) and can carry the fix as necessary. This
is now fixed as far as linux-next is concerned, but any non trivial
conflicts should be mentioned to your upstream maintainer when your tree
is submitted for merging. You may also want to consider cooperating
with the maintainer of the conflicting tree to minimise any particularly
complex conflicts.
--
Cheers,
Stephen Rothwell
diff --cc arch/s390/net/Makefile
index e2b85ffdbb0c,d4663b4bf509..000000000000
--- a/arch/s390/net/Makefile
+++ b/arch/s390/net/Makefile
@@@ -2,5 -2,4 +2,5 @@@
#
# Arch-specific network modules
#
- obj-$(CONFIG_BPF_JIT) += bpf_jit.o bpf_jit_comp.o
+ obj-$(CONFIG_BPF_JIT) += bpf_jit_comp.o
+obj-$(CONFIG_HAVE_PNETID) += pnet.o
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
^ permalink raw reply
* Re: [PATCH net-next] vrf: add CRC32c offload to device features
From: David Miller @ 2018-05-29 2:55 UTC (permalink / raw)
To: dcaratti; +Cc: dsa, vyasevich, marcelo.leitner, linux-sctp, netdev
In-Reply-To: <bb3aa69eaef613f033f8f52674740286ba67dc31.1527175921.git.dcaratti@redhat.com>
From: Davide Caratti <dcaratti@redhat.com>
Date: Thu, 24 May 2018 17:49:35 +0200
> SCTP sockets originated in a VRF can improve their performance if CRC32c
> computation is delegated to underlying devices: update device features,
> setting NETIF_F_SCTP_CRC. Iterating the following command in the topology
> proposed with [1],
>
> # ip vrf exec vrf-h2 netperf -H 192.0.2.1 -t SCTP_STREAM -- -m 10K
>
> the measured throughput in Mbit/s improved from 2395 ± 1% to 2720 ± 1%.
>
> [1] https://www.spinics.net/lists/netdev/msg486007.html
>
> Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Applied, thank you.
^ permalink raw reply
* Re: [PATCH] net: stmmac: Use mutex instead of spinlock
From: David Miller @ 2018-05-29 2:54 UTC (permalink / raw)
To: thierry.reding
Cc: peppe.cavallaro, alexandre.torgue, jonathanh, netdev,
linux-kernel
In-Reply-To: <20180524140907.24197-1-thierry.reding@gmail.com>
From: Thierry Reding <thierry.reding@gmail.com>
Date: Thu, 24 May 2018 16:09:07 +0200
> From: Thierry Reding <treding@nvidia.com>
>
> Some drivers, such as DWC EQOS on Tegra, need to perform operations that
> can sleep under this lock (clk_set_rate() in tegra_eqos_fix_speed()) for
> proper operation. Since there is no need for this lock to be a spinlock,
> convert it to a mutex instead.
>
> Fixes: e6ea2d16fc61 ("net: stmmac: dwc-qos: Add Tegra186 support")
> Reported-by: Jon Hunter <jonathanh@nvidia.com>
> Signed-off-by: Thierry Reding <treding@nvidia.com>
> ---
> This applies on top of net-next.
Applied to net-next.
^ permalink raw reply
* Re: [PATCH net-next v2 1/1] bnx2x: Collect the device debug information during Tx timeout.
From: David Miller @ 2018-05-29 2:53 UTC (permalink / raw)
To: sudarsana.kalluru; +Cc: netdev
In-Reply-To: <20180524115751.5284-1-sudarsana.kalluru@cavium.com>
From: Sudarsana Reddy Kalluru <sudarsana.kalluru@cavium.com>
Date: Thu, 24 May 2018 04:57:51 -0700
> Tx-timeout mostly happens due to some issue in the device. In such cases,
> debug dump would be helpful for identifying the cause of the issue.
> This patch adds support to spill debug data during the Tx timeout. Here
> bnx2x_panic_dump() API is used instead of bnx2x_panic(), since we still
> want to allow the Tx-timeout recovery a chance to succeed.
>
>
> Changes from previous version:
> -------------------------------
> v2: Fixed a coding error.
>
> Please consider applying this to "net-next".
>
> Signed-off-by: Sudarsana Reddy Kalluru <Sudarsana.Kalluru@cavium.com>
Applied, thank you.
^ permalink raw reply
* [PATCH] drivers/net: Fix various unnecessary characters after logging newlines
From: Joe Perches @ 2018-05-29 2:51 UTC (permalink / raw)
To: netdev; +Cc: brcm80211-dev-list
Remove and coalesce formats when there is an unnecessary
character after a logging newline. These extra characters
cause logging defects.
Miscellanea:
o Coalesce formats
Signed-off-by: Joe Perches <joe@perches.com>
---
drivers/net/ethernet/cavium/liquidio/lio_main.c | 2 +-
drivers/net/ethernet/qlogic/netxen/netxen_nic_ctx.c | 6 ++----
drivers/net/ethernet/qlogic/qed/qed_dev.c | 2 +-
drivers/net/ethernet/qlogic/qlge/qlge_main.c | 4 ++--
drivers/net/wireless/broadcom/brcm80211/brcmfmac/cfg80211.c | 2 +-
drivers/net/wireless/intel/ipw2x00/ipw2200.c | 3 +--
drivers/net/wireless/intersil/prism54/islpci_eth.c | 6 +++---
drivers/net/wireless/realtek/rtl8xxxu/rtl8xxxu_8192e.c | 4 ++--
drivers/net/wireless/realtek/rtl8xxxu/rtl8xxxu_8723b.c | 4 ++--
drivers/net/wireless/realtek/rtl8xxxu/rtl8xxxu_core.c | 4 ++--
drivers/net/wireless/realtek/rtlwifi/rtl8821ae/dm.c | 2 +-
11 files changed, 18 insertions(+), 21 deletions(-)
diff --git a/drivers/net/ethernet/cavium/liquidio/lio_main.c b/drivers/net/ethernet/cavium/liquidio/lio_main.c
index e500528ad751..8a815bb57177 100644
--- a/drivers/net/ethernet/cavium/liquidio/lio_main.c
+++ b/drivers/net/ethernet/cavium/liquidio/lio_main.c
@@ -1766,7 +1766,7 @@ static int load_firmware(struct octeon_device *oct)
ret = request_firmware(&fw, fw_name, &oct->pci_dev->dev);
if (ret) {
- dev_err(&oct->pci_dev->dev, "Request firmware failed. Could not find file %s.\n.",
+ dev_err(&oct->pci_dev->dev, "Request firmware failed. Could not find file %s.\n",
fw_name);
release_firmware(fw);
return ret;
diff --git a/drivers/net/ethernet/qlogic/netxen/netxen_nic_ctx.c b/drivers/net/ethernet/qlogic/netxen/netxen_nic_ctx.c
index 6cec2a6a3dcc..7503aa222392 100644
--- a/drivers/net/ethernet/qlogic/netxen/netxen_nic_ctx.c
+++ b/drivers/net/ethernet/qlogic/netxen/netxen_nic_ctx.c
@@ -146,8 +146,7 @@ netxen_get_minidump_template(struct netxen_adapter *adapter)
if ((cmd.rsp.cmd == NX_RCODE_SUCCESS) && (size == cmd.rsp.arg2)) {
memcpy(adapter->mdump.md_template, addr, size);
} else {
- dev_err(&adapter->pdev->dev, "Failed to get minidump template, "
- "err_code : %d, requested_size : %d, actual_size : %d\n ",
+ dev_err(&adapter->pdev->dev, "Failed to get minidump template, err_code : %d, requested_size : %d, actual_size : %d\n",
cmd.rsp.cmd, size, cmd.rsp.arg2);
}
pci_free_consistent(adapter->pdev, size, addr, md_template_addr);
@@ -180,8 +179,7 @@ netxen_setup_minidump(struct netxen_adapter *adapter)
if ((err == NX_RCODE_CMD_INVALID) ||
(err == NX_RCODE_CMD_NOT_IMPL)) {
dev_info(&adapter->pdev->dev,
- "Flashed firmware version does not support minidump, "
- "minimum version required is [ %u.%u.%u ].\n ",
+ "Flashed firmware version does not support minidump, minimum version required is [ %u.%u.%u ]\n",
NX_MD_SUPPORT_MAJOR, NX_MD_SUPPORT_MINOR,
NX_MD_SUPPORT_SUBVERSION);
}
diff --git a/drivers/net/ethernet/qlogic/qed/qed_dev.c b/drivers/net/ethernet/qlogic/qed/qed_dev.c
index 560528962658..fde20fd9942c 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_dev.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_dev.c
@@ -1098,7 +1098,7 @@ int qed_final_cleanup(struct qed_hwfn *p_hwfn,
}
DP_VERBOSE(p_hwfn, QED_MSG_IOV,
- "Sending final cleanup for PFVF[%d] [Command %08x\n]",
+ "Sending final cleanup for PFVF[%d] [Command %08x]\n",
id, command);
qed_wr(p_hwfn, p_ptt, XSDM_REG_OPERATION_GEN, command);
diff --git a/drivers/net/ethernet/qlogic/qlge/qlge_main.c b/drivers/net/ethernet/qlogic/qlge/qlge_main.c
index 8293c2028002..70de062b72a1 100644
--- a/drivers/net/ethernet/qlogic/qlge/qlge_main.c
+++ b/drivers/net/ethernet/qlogic/qlge/qlge_main.c
@@ -2211,7 +2211,7 @@ static int ql_clean_outbound_rx_ring(struct rx_ring *rx_ring)
while (prod != rx_ring->cnsmr_idx) {
netif_printk(qdev, rx_status, KERN_DEBUG, qdev->ndev,
- "cq_id = %d, prod = %d, cnsmr = %d.\n.",
+ "cq_id = %d, prod = %d, cnsmr = %d\n",
rx_ring->cq_id, prod, rx_ring->cnsmr_idx);
net_rsp = (struct ob_mac_iocb_rsp *)rx_ring->curr_entry;
@@ -2258,7 +2258,7 @@ static int ql_clean_inbound_rx_ring(struct rx_ring *rx_ring, int budget)
while (prod != rx_ring->cnsmr_idx) {
netif_printk(qdev, rx_status, KERN_DEBUG, qdev->ndev,
- "cq_id = %d, prod = %d, cnsmr = %d.\n.",
+ "cq_id = %d, prod = %d, cnsmr = %d\n",
rx_ring->cq_id, prod, rx_ring->cnsmr_idx);
net_rsp = rx_ring->curr_entry;
diff --git a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/cfg80211.c b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/cfg80211.c
index f5b405c98047..b6122aad639e 100644
--- a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/cfg80211.c
+++ b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/cfg80211.c
@@ -1264,7 +1264,7 @@ static void brcmf_link_down(struct brcmf_cfg80211_vif *vif, u16 reason)
brcmf_dbg(TRACE, "Enter\n");
if (test_and_clear_bit(BRCMF_VIF_STATUS_CONNECTED, &vif->sme_state)) {
- brcmf_dbg(INFO, "Call WLC_DISASSOC to stop excess roaming\n ");
+ brcmf_dbg(INFO, "Call WLC_DISASSOC to stop excess roaming\n");
err = brcmf_fil_cmd_data_set(vif->ifp,
BRCMF_C_DISASSOC, NULL, 0);
if (err) {
diff --git a/drivers/net/wireless/intel/ipw2x00/ipw2200.c b/drivers/net/wireless/intel/ipw2x00/ipw2200.c
index ba3fb1d2ddb4..f26beeb6c5ff 100644
--- a/drivers/net/wireless/intel/ipw2x00/ipw2200.c
+++ b/drivers/net/wireless/intel/ipw2x00/ipw2200.c
@@ -7557,8 +7557,7 @@ static int ipw_associate(void *data)
}
if (priv->status & STATUS_DISASSOCIATING) {
- IPW_DEBUG_ASSOC("Not attempting association (in "
- "disassociating)\n ");
+ IPW_DEBUG_ASSOC("Not attempting association (in disassociating)\n");
schedule_work(&priv->associate);
return 0;
}
diff --git a/drivers/net/wireless/intersil/prism54/islpci_eth.c b/drivers/net/wireless/intersil/prism54/islpci_eth.c
index 9b0ded733294..b277113b33d3 100644
--- a/drivers/net/wireless/intersil/prism54/islpci_eth.c
+++ b/drivers/net/wireless/intersil/prism54/islpci_eth.c
@@ -57,7 +57,7 @@ islpci_eth_cleanup_transmit(islpci_private *priv,
#if VERBOSE > SHOW_ERROR_MESSAGES
DEBUG(SHOW_TRACING,
- "cleanup skb %p skb->data %p skb->len %u truesize %u\n ",
+ "cleanup skb %p skb->data %p skb->len %u truesize %u\n",
skb, skb->data, skb->len, skb->truesize);
#endif
@@ -328,7 +328,7 @@ islpci_eth_receive(islpci_private *priv)
#if VERBOSE > SHOW_ERROR_MESSAGES
DEBUG(SHOW_TRACING,
- "frq->addr %x skb->data %p skb->len %u offset %u truesize %u\n ",
+ "frq->addr %x skb->data %p skb->len %u offset %u truesize %u\n",
control_block->rx_data_low[priv->free_data_rx].address, skb->data,
skb->len, offset, skb->truesize);
#endif
@@ -436,7 +436,7 @@ islpci_eth_receive(islpci_private *priv)
#if VERBOSE > SHOW_ERROR_MESSAGES
DEBUG(SHOW_TRACING,
- "new alloc skb %p skb->data %p skb->len %u index %u truesize %u\n ",
+ "new alloc skb %p skb->data %p skb->len %u index %u truesize %u\n",
skb, skb->data, skb->len, index, skb->truesize);
#endif
diff --git a/drivers/net/wireless/realtek/rtl8xxxu/rtl8xxxu_8192e.c b/drivers/net/wireless/realtek/rtl8xxxu/rtl8xxxu_8192e.c
index 38b2ba1ac6f8..380e86f9e00b 100644
--- a/drivers/net/wireless/realtek/rtl8xxxu/rtl8xxxu_8192e.c
+++ b/drivers/net/wireless/realtek/rtl8xxxu/rtl8xxxu_8192e.c
@@ -1267,8 +1267,8 @@ static void rtl8192eu_phy_iq_calibrate(struct rtl8xxxu_priv *priv)
reg_ecc = result[candidate][7];
dev_dbg(dev, "%s: candidate is %x\n", __func__, candidate);
dev_dbg(dev,
- "%s: e94 =%x e9c=%x ea4=%x eac=%x eb4=%x ebc=%x ec4=%x "
- "ecc=%x\n ", __func__, reg_e94, reg_e9c,
+ "%s: e94 =%x e9c=%x ea4=%x eac=%x eb4=%x ebc=%x ec4=%x ecc=%x\n",
+ __func__, reg_e94, reg_e9c,
reg_ea4, reg_eac, reg_eb4, reg_ebc, reg_ec4, reg_ecc);
path_a_ok = true;
path_b_ok = true;
diff --git a/drivers/net/wireless/realtek/rtl8xxxu/rtl8xxxu_8723b.c b/drivers/net/wireless/realtek/rtl8xxxu/rtl8xxxu_8723b.c
index c4b86a84a721..26b674aca125 100644
--- a/drivers/net/wireless/realtek/rtl8xxxu/rtl8xxxu_8723b.c
+++ b/drivers/net/wireless/realtek/rtl8xxxu/rtl8xxxu_8723b.c
@@ -1175,8 +1175,8 @@ static void rtl8723bu_phy_iq_calibrate(struct rtl8xxxu_priv *priv)
reg_ecc = result[candidate][7];
dev_dbg(dev, "%s: candidate is %x\n", __func__, candidate);
dev_dbg(dev,
- "%s: e94 =%x e9c=%x ea4=%x eac=%x eb4=%x ebc=%x ec4=%x "
- "ecc=%x\n ", __func__, reg_e94, reg_e9c,
+ "%s: e94 =%x e9c=%x ea4=%x eac=%x eb4=%x ebc=%x ec4=%x ecc=%x\n",
+ __func__, reg_e94, reg_e9c,
reg_ea4, reg_eac, reg_eb4, reg_ebc, reg_ec4, reg_ecc);
path_a_ok = true;
path_b_ok = true;
diff --git a/drivers/net/wireless/realtek/rtl8xxxu/rtl8xxxu_core.c b/drivers/net/wireless/realtek/rtl8xxxu/rtl8xxxu_core.c
index 718a73c623a7..505ab1b055ff 100644
--- a/drivers/net/wireless/realtek/rtl8xxxu/rtl8xxxu_core.c
+++ b/drivers/net/wireless/realtek/rtl8xxxu/rtl8xxxu_core.c
@@ -3406,8 +3406,8 @@ void rtl8xxxu_gen1_phy_iq_calibrate(struct rtl8xxxu_priv *priv)
reg_ecc = result[candidate][7];
dev_dbg(dev, "%s: candidate is %x\n", __func__, candidate);
dev_dbg(dev,
- "%s: e94 =%x e9c=%x ea4=%x eac=%x eb4=%x ebc=%x ec4=%x "
- "ecc=%x\n ", __func__, reg_e94, reg_e9c,
+ "%s: e94 =%x e9c=%x ea4=%x eac=%x eb4=%x ebc=%x ec4=%x ecc=%x\n",
+ __func__, reg_e94, reg_e9c,
reg_ea4, reg_eac, reg_eb4, reg_ebc, reg_ec4, reg_ecc);
path_a_ok = true;
path_b_ok = true;
diff --git a/drivers/net/wireless/realtek/rtlwifi/rtl8821ae/dm.c b/drivers/net/wireless/realtek/rtlwifi/rtl8821ae/dm.c
index 9111ba7ff0a1..3be8c88971e2 100644
--- a/drivers/net/wireless/realtek/rtlwifi/rtl8821ae/dm.c
+++ b/drivers/net/wireless/realtek/rtlwifi/rtl8821ae/dm.c
@@ -2642,7 +2642,7 @@ static void rtl8821ae_dm_edca_choose_traffic_idx(
if (cur_tx_bytes > (cur_rx_bytes*4)) {
*pb_is_cur_rdl_state = false;
RT_TRACE(rtlpriv, COMP_TURBO, DBG_LOUD,
- "Uplink Traffic\n ");
+ "Uplink Traffic\n");
} else {
*pb_is_cur_rdl_state = true;
RT_TRACE(rtlpriv, COMP_TURBO, DBG_LOUD,
^ permalink raw reply related
* Re: [RFC v5 2/5] virtio_ring: support creating packed ring
From: Jason Wang @ 2018-05-29 2:49 UTC (permalink / raw)
To: Tiwei Bie, mst, virtualization, linux-kernel, netdev; +Cc: wexu, Cornelia Huck
In-Reply-To: <20180522081648.14768-3-tiwei.bie@intel.com>
On 2018年05月22日 16:16, Tiwei Bie wrote:
> This commit introduces the support for creating packed ring.
> All split ring specific functions are added _split suffix.
> Some necessary stubs for packed ring are also added.
>
> Signed-off-by: Tiwei Bie <tiwei.bie@intel.com>
> ---
> drivers/virtio/virtio_ring.c | 801 +++++++++++++++++++++++------------
> include/linux/virtio_ring.h | 8 +-
> 2 files changed, 546 insertions(+), 263 deletions(-)
>
> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> index 71458f493cf8..f5ef5f42a7cf 100644
> --- a/drivers/virtio/virtio_ring.c
> +++ b/drivers/virtio/virtio_ring.c
> @@ -61,11 +61,15 @@ struct vring_desc_state {
> struct vring_desc *indir_desc; /* Indirect descriptor, if any. */
> };
>
> +struct vring_desc_state_packed {
> + int next; /* The next desc state. */
> +};
> +
> struct vring_virtqueue {
> struct virtqueue vq;
>
> - /* Actual memory layout for this queue */
> - struct vring vring;
> + /* Is this a packed ring? */
> + bool packed;
>
> /* Can we use weak barriers? */
> bool weak_barriers;
> @@ -87,11 +91,39 @@ struct vring_virtqueue {
> /* Last used index we've seen. */
> u16 last_used_idx;
>
> - /* Last written value to avail->flags */
> - u16 avail_flags_shadow;
> + union {
> + /* Available for split ring */
> + struct {
> + /* Actual memory layout for this queue. */
> + struct vring vring;
>
> - /* Last written value to avail->idx in guest byte order */
> - u16 avail_idx_shadow;
> + /* Last written value to avail->flags */
> + u16 avail_flags_shadow;
> +
> + /* Last written value to avail->idx in
> + * guest byte order. */
> + u16 avail_idx_shadow;
> + };
> +
> + /* Available for packed ring */
> + struct {
> + /* Actual memory layout for this queue. */
> + struct vring_packed vring_packed;
> +
> + /* Driver ring wrap counter. */
> + u8 avail_wrap_counter;
> +
> + /* Device ring wrap counter. */
> + u8 used_wrap_counter;
How about just use boolean?
> +
> + /* Index of the next avail descriptor. */
> + u16 next_avail_idx;
> +
> + /* Last written value to driver->flags in
> + * guest byte order. */
> + u16 event_flags_shadow;
> + };
> + };
>
> /* How to notify other side. FIXME: commonalize hcalls! */
> bool (*notify)(struct virtqueue *vq);
> @@ -111,11 +143,24 @@ struct vring_virtqueue {
> #endif
>
> /* Per-descriptor state. */
> - struct vring_desc_state desc_state[];
> + union {
> + struct vring_desc_state desc_state[1];
> + struct vring_desc_state_packed desc_state_packed[1];
> + };
> };
>
> #define to_vvq(_vq) container_of(_vq, struct vring_virtqueue, vq)
>
> +static inline bool virtqueue_use_indirect(struct virtqueue *_vq,
> + unsigned int total_sg)
> +{
> + struct vring_virtqueue *vq = to_vvq(_vq);
> +
> + /* If the host supports indirect descriptor tables, and we have multiple
> + * buffers, then go indirect. FIXME: tune this threshold */
> + return (vq->indirect && total_sg > 1 && vq->vq.num_free);
> +}
> +
> /*
> * Modern virtio devices have feature bits to specify whether they need a
> * quirk and bypass the IOMMU. If not there, just use the DMA API.
> @@ -201,8 +246,17 @@ static dma_addr_t vring_map_single(const struct vring_virtqueue *vq,
> cpu_addr, size, direction);
> }
>
> -static void vring_unmap_one(const struct vring_virtqueue *vq,
> - struct vring_desc *desc)
> +static int vring_mapping_error(const struct vring_virtqueue *vq,
> + dma_addr_t addr)
> +{
> + if (!vring_use_dma_api(vq->vq.vdev))
> + return 0;
> +
> + return dma_mapping_error(vring_dma_dev(vq), addr);
> +}
> +
> +static void vring_unmap_one_split(const struct vring_virtqueue *vq,
> + struct vring_desc *desc)
> {
> u16 flags;
>
> @@ -226,17 +280,9 @@ static void vring_unmap_one(const struct vring_virtqueue *vq,
> }
> }
>
> -static int vring_mapping_error(const struct vring_virtqueue *vq,
> - dma_addr_t addr)
> -{
> - if (!vring_use_dma_api(vq->vq.vdev))
> - return 0;
> -
> - return dma_mapping_error(vring_dma_dev(vq), addr);
> -}
It looks to me if you keep vring_mapping_error behind
vring_unmap_one_split(), lots of changes were unncessary.
> -
> -static struct vring_desc *alloc_indirect(struct virtqueue *_vq,
> - unsigned int total_sg, gfp_t gfp)
> +static struct vring_desc *alloc_indirect_split(struct virtqueue *_vq,
> + unsigned int total_sg,
> + gfp_t gfp)
> {
> struct vring_desc *desc;
> unsigned int i;
> @@ -257,14 +303,14 @@ static struct vring_desc *alloc_indirect(struct virtqueue *_vq,
> return desc;
> }
>
> -static inline int virtqueue_add(struct virtqueue *_vq,
> - struct scatterlist *sgs[],
> - unsigned int total_sg,
> - unsigned int out_sgs,
> - unsigned int in_sgs,
> - void *data,
> - void *ctx,
> - gfp_t gfp)
> +static inline int virtqueue_add_split(struct virtqueue *_vq,
> + struct scatterlist *sgs[],
> + unsigned int total_sg,
> + unsigned int out_sgs,
> + unsigned int in_sgs,
> + void *data,
> + void *ctx,
> + gfp_t gfp)
> {
> struct vring_virtqueue *vq = to_vvq(_vq);
> struct scatterlist *sg;
> @@ -300,10 +346,8 @@ static inline int virtqueue_add(struct virtqueue *_vq,
>
> head = vq->free_head;
>
> - /* If the host supports indirect descriptor tables, and we have multiple
> - * buffers, then go indirect. FIXME: tune this threshold */
> - if (vq->indirect && total_sg > 1 && vq->vq.num_free)
> - desc = alloc_indirect(_vq, total_sg, gfp);
> + if (virtqueue_use_indirect(_vq, total_sg))
> + desc = alloc_indirect_split(_vq, total_sg, gfp);
> else {
> desc = NULL;
> WARN_ON_ONCE(total_sg > vq->vring.num && !vq->indirect);
> @@ -424,7 +468,7 @@ static inline int virtqueue_add(struct virtqueue *_vq,
> for (n = 0; n < total_sg; n++) {
> if (i == err_idx)
> break;
> - vring_unmap_one(vq, &desc[i]);
> + vring_unmap_one_split(vq, &desc[i]);
> i = virtio16_to_cpu(_vq->vdev, vq->vring.desc[i].next);
> }
>
> @@ -435,6 +479,355 @@ static inline int virtqueue_add(struct virtqueue *_vq,
> return -EIO;
> }
>
> +static bool virtqueue_kick_prepare_split(struct virtqueue *_vq)
> +{
> + struct vring_virtqueue *vq = to_vvq(_vq);
> + u16 new, old;
> + bool needs_kick;
> +
> + START_USE(vq);
> + /* We need to expose available array entries before checking avail
> + * event. */
> + virtio_mb(vq->weak_barriers);
> +
> + old = vq->avail_idx_shadow - vq->num_added;
> + new = vq->avail_idx_shadow;
> + vq->num_added = 0;
> +
> +#ifdef DEBUG
> + if (vq->last_add_time_valid) {
> + WARN_ON(ktime_to_ms(ktime_sub(ktime_get(),
> + vq->last_add_time)) > 100);
> + }
> + vq->last_add_time_valid = false;
> +#endif
> +
> + if (vq->event) {
> + needs_kick = vring_need_event(virtio16_to_cpu(_vq->vdev, vring_avail_event(&vq->vring)),
> + new, old);
> + } else {
> + needs_kick = !(vq->vring.used->flags & cpu_to_virtio16(_vq->vdev, VRING_USED_F_NO_NOTIFY));
> + }
> + END_USE(vq);
> + return needs_kick;
> +}
> +
> +static void detach_buf_split(struct vring_virtqueue *vq, unsigned int head,
> + void **ctx)
> +{
> + unsigned int i, j;
> + __virtio16 nextflag = cpu_to_virtio16(vq->vq.vdev, VRING_DESC_F_NEXT);
> +
> + /* Clear data ptr. */
> + vq->desc_state[head].data = NULL;
> +
> + /* Put back on free list: unmap first-level descriptors and find end */
> + i = head;
> +
> + while (vq->vring.desc[i].flags & nextflag) {
> + vring_unmap_one_split(vq, &vq->vring.desc[i]);
> + i = virtio16_to_cpu(vq->vq.vdev, vq->vring.desc[i].next);
> + vq->vq.num_free++;
> + }
> +
> + vring_unmap_one_split(vq, &vq->vring.desc[i]);
> + vq->vring.desc[i].next = cpu_to_virtio16(vq->vq.vdev, vq->free_head);
> + vq->free_head = head;
> +
> + /* Plus final descriptor */
> + vq->vq.num_free++;
> +
> + if (vq->indirect) {
> + struct vring_desc *indir_desc = vq->desc_state[head].indir_desc;
> + u32 len;
> +
> + /* Free the indirect table, if any, now that it's unmapped. */
> + if (!indir_desc)
> + return;
> +
> + len = virtio32_to_cpu(vq->vq.vdev, vq->vring.desc[head].len);
> +
> + BUG_ON(!(vq->vring.desc[head].flags &
> + cpu_to_virtio16(vq->vq.vdev, VRING_DESC_F_INDIRECT)));
> + BUG_ON(len == 0 || len % sizeof(struct vring_desc));
> +
> + for (j = 0; j < len / sizeof(struct vring_desc); j++)
> + vring_unmap_one_split(vq, &indir_desc[j]);
> +
> + kfree(indir_desc);
> + vq->desc_state[head].indir_desc = NULL;
> + } else if (ctx) {
> + *ctx = vq->desc_state[head].indir_desc;
> + }
> +}
> +
> +static inline bool more_used_split(const struct vring_virtqueue *vq)
> +{
> + return vq->last_used_idx != virtio16_to_cpu(vq->vq.vdev, vq->vring.used->idx);
> +}
> +
> +static void *virtqueue_get_buf_ctx_split(struct virtqueue *_vq,
> + unsigned int *len,
> + void **ctx)
> +{
> + struct vring_virtqueue *vq = to_vvq(_vq);
> + void *ret;
> + unsigned int i;
> + u16 last_used;
> +
> + START_USE(vq);
> +
> + if (unlikely(vq->broken)) {
> + END_USE(vq);
> + return NULL;
> + }
> +
> + if (!more_used_split(vq)) {
> + pr_debug("No more buffers in queue\n");
> + END_USE(vq);
> + return NULL;
> + }
> +
> + /* Only get used array entries after they have been exposed by host. */
> + virtio_rmb(vq->weak_barriers);
> +
> + last_used = (vq->last_used_idx & (vq->vring.num - 1));
> + i = virtio32_to_cpu(_vq->vdev, vq->vring.used->ring[last_used].id);
> + *len = virtio32_to_cpu(_vq->vdev, vq->vring.used->ring[last_used].len);
> +
> + if (unlikely(i >= vq->vring.num)) {
> + BAD_RING(vq, "id %u out of range\n", i);
> + return NULL;
> + }
> + if (unlikely(!vq->desc_state[i].data)) {
> + BAD_RING(vq, "id %u is not a head!\n", i);
> + return NULL;
> + }
> +
> + /* detach_buf_split clears data, so grab it now. */
> + ret = vq->desc_state[i].data;
> + detach_buf_split(vq, i, ctx);
> + vq->last_used_idx++;
> + /* If we expect an interrupt for the next entry, tell host
> + * by writing event index and flush out the write before
> + * the read in the next get_buf call. */
> + if (!(vq->avail_flags_shadow & VRING_AVAIL_F_NO_INTERRUPT))
> + virtio_store_mb(vq->weak_barriers,
> + &vring_used_event(&vq->vring),
> + cpu_to_virtio16(_vq->vdev, vq->last_used_idx));
> +
> +#ifdef DEBUG
> + vq->last_add_time_valid = false;
> +#endif
> +
> + END_USE(vq);
> + return ret;
> +}
> +
> +static void virtqueue_disable_cb_split(struct virtqueue *_vq)
> +{
> + struct vring_virtqueue *vq = to_vvq(_vq);
> +
> + if (!(vq->avail_flags_shadow & VRING_AVAIL_F_NO_INTERRUPT)) {
> + vq->avail_flags_shadow |= VRING_AVAIL_F_NO_INTERRUPT;
> + if (!vq->event)
> + vq->vring.avail->flags = cpu_to_virtio16(_vq->vdev, vq->avail_flags_shadow);
> + }
> +}
> +
> +static unsigned virtqueue_enable_cb_prepare_split(struct virtqueue *_vq)
> +{
> + struct vring_virtqueue *vq = to_vvq(_vq);
> + u16 last_used_idx;
> +
> + START_USE(vq);
> +
> + /* We optimistically turn back on interrupts, then check if there was
> + * more to do. */
> + /* Depending on the VIRTIO_RING_F_EVENT_IDX feature, we need to
> + * either clear the flags bit or point the event index at the next
> + * entry. Always do both to keep code simple. */
> + if (vq->avail_flags_shadow & VRING_AVAIL_F_NO_INTERRUPT) {
> + vq->avail_flags_shadow &= ~VRING_AVAIL_F_NO_INTERRUPT;
> + if (!vq->event)
> + vq->vring.avail->flags = cpu_to_virtio16(_vq->vdev, vq->avail_flags_shadow);
> + }
> + vring_used_event(&vq->vring) = cpu_to_virtio16(_vq->vdev, last_used_idx = vq->last_used_idx);
> + END_USE(vq);
> + return last_used_idx;
> +}
> +
> +static bool virtqueue_poll_split(struct virtqueue *_vq, unsigned last_used_idx)
> +{
> + struct vring_virtqueue *vq = to_vvq(_vq);
> +
> + virtio_mb(vq->weak_barriers);
> + return (u16)last_used_idx != virtio16_to_cpu(_vq->vdev, vq->vring.used->idx);
> +}
> +
> +static bool virtqueue_enable_cb_delayed_split(struct virtqueue *_vq)
> +{
> + struct vring_virtqueue *vq = to_vvq(_vq);
> + u16 bufs;
> +
> + START_USE(vq);
> +
> + /* We optimistically turn back on interrupts, then check if there was
> + * more to do. */
> + /* Depending on the VIRTIO_RING_F_USED_EVENT_IDX feature, we need to
> + * either clear the flags bit or point the event index at the next
> + * entry. Always update the event index to keep code simple. */
> + if (vq->avail_flags_shadow & VRING_AVAIL_F_NO_INTERRUPT) {
> + vq->avail_flags_shadow &= ~VRING_AVAIL_F_NO_INTERRUPT;
> + if (!vq->event)
> + vq->vring.avail->flags = cpu_to_virtio16(_vq->vdev, vq->avail_flags_shadow);
> + }
> + /* TODO: tune this threshold */
> + bufs = (u16)(vq->avail_idx_shadow - vq->last_used_idx) * 3 / 4;
> +
> + virtio_store_mb(vq->weak_barriers,
> + &vring_used_event(&vq->vring),
> + cpu_to_virtio16(_vq->vdev, vq->last_used_idx + bufs));
> +
> + if (unlikely((u16)(virtio16_to_cpu(_vq->vdev, vq->vring.used->idx) - vq->last_used_idx) > bufs)) {
> + END_USE(vq);
> + return false;
> + }
> +
> + END_USE(vq);
> + return true;
> +}
> +
> +static void *virtqueue_detach_unused_buf_split(struct virtqueue *_vq)
> +{
> + struct vring_virtqueue *vq = to_vvq(_vq);
> + unsigned int i;
> + void *buf;
> +
> + START_USE(vq);
> +
> + for (i = 0; i < vq->vring.num; i++) {
> + if (!vq->desc_state[i].data)
> + continue;
> + /* detach_buf clears data, so grab it now. */
> + buf = vq->desc_state[i].data;
> + detach_buf_split(vq, i, NULL);
> + vq->avail_idx_shadow--;
> + vq->vring.avail->idx = cpu_to_virtio16(_vq->vdev, vq->avail_idx_shadow);
> + END_USE(vq);
> + return buf;
> + }
> + /* That should have freed everything. */
> + BUG_ON(vq->vq.num_free != vq->vring.num);
> +
> + END_USE(vq);
> + return NULL;
> +}
I think the those copy-and-paste hunks could be avoided and the diff
should only contains renaming of the function. If yes, it would be very
welcomed since it requires to compare the changes verbatim otherwise.
> +
> +/*
> + * The layout for the packed ring is a continuous chunk of memory
> + * which looks like this.
> + *
> + * struct vring_packed {
> + * // The actual descriptors (16 bytes each)
> + * struct vring_packed_desc desc[num];
> + *
> + * // Padding to the next align boundary.
> + * char pad[];
> + *
> + * // Driver Event Suppression
> + * struct vring_packed_desc_event driver;
> + *
> + * // Device Event Suppression
> + * struct vring_packed_desc_event device;
> + * };
> + */
> +static inline void vring_init_packed(struct vring_packed *vr, unsigned int num,
> + void *p, unsigned long align)
> +{
> + vr->num = num;
> + vr->desc = p;
> + vr->driver = (void *)(((uintptr_t)p + sizeof(struct vring_packed_desc)
> + * num + align - 1) & ~(align - 1));
If we choose not to go uapi, maybe we can use ALIGN() macro here?
> + vr->device = vr->driver + 1;
> +}
> +
> +static inline unsigned vring_size_packed(unsigned int num, unsigned long align)
> +{
> + return ((sizeof(struct vring_packed_desc) * num + align - 1)
> + & ~(align - 1)) + sizeof(struct vring_packed_desc_event) * 2;
> +}
> +
> +static inline int virtqueue_add_packed(struct virtqueue *_vq,
> + struct scatterlist *sgs[],
> + unsigned int total_sg,
> + unsigned int out_sgs,
> + unsigned int in_sgs,
> + void *data,
> + void *ctx,
> + gfp_t gfp)
> +{
> + return -EIO;
> +}
> +
> +static bool virtqueue_kick_prepare_packed(struct virtqueue *_vq)
> +{
> + return false;
> +}
> +
> +static inline bool more_used_packed(const struct vring_virtqueue *vq)
> +{
> + return false;
> +}
> +
> +static void *virtqueue_get_buf_ctx_packed(struct virtqueue *_vq,
> + unsigned int *len,
> + void **ctx)
> +{
> + return NULL;
> +}
> +
> +static void virtqueue_disable_cb_packed(struct virtqueue *_vq)
> +{
> +}
> +
> +static unsigned virtqueue_enable_cb_prepare_packed(struct virtqueue *_vq)
> +{
> + return 0;
> +}
> +
> +static bool virtqueue_poll_packed(struct virtqueue *_vq, unsigned last_used_idx)
> +{
> + return false;
> +}
> +
> +static bool virtqueue_enable_cb_delayed_packed(struct virtqueue *_vq)
> +{
> + return false;
> +}
> +
> +static void *virtqueue_detach_unused_buf_packed(struct virtqueue *_vq)
> +{
> + return NULL;
> +}
> +
> +static inline int virtqueue_add(struct virtqueue *_vq,
> + struct scatterlist *sgs[],
> + unsigned int total_sg,
> + unsigned int out_sgs,
> + unsigned int in_sgs,
> + void *data,
> + void *ctx,
> + gfp_t gfp)
> +{
> + struct vring_virtqueue *vq = to_vvq(_vq);
> +
> + return vq->packed ? virtqueue_add_packed(_vq, sgs, total_sg, out_sgs,
> + in_sgs, data, ctx, gfp) :
> + virtqueue_add_split(_vq, sgs, total_sg, out_sgs,
> + in_sgs, data, ctx, gfp);
> +}
> +
> /**
> * virtqueue_add_sgs - expose buffers to other end
> * @vq: the struct virtqueue we're talking about.
> @@ -551,34 +944,9 @@ EXPORT_SYMBOL_GPL(virtqueue_add_inbuf_ctx);
> bool virtqueue_kick_prepare(struct virtqueue *_vq)
> {
> struct vring_virtqueue *vq = to_vvq(_vq);
> - u16 new, old;
> - bool needs_kick;
>
> - START_USE(vq);
> - /* We need to expose available array entries before checking avail
> - * event. */
> - virtio_mb(vq->weak_barriers);
> -
> - old = vq->avail_idx_shadow - vq->num_added;
> - new = vq->avail_idx_shadow;
> - vq->num_added = 0;
> -
> -#ifdef DEBUG
> - if (vq->last_add_time_valid) {
> - WARN_ON(ktime_to_ms(ktime_sub(ktime_get(),
> - vq->last_add_time)) > 100);
> - }
> - vq->last_add_time_valid = false;
> -#endif
> -
> - if (vq->event) {
> - needs_kick = vring_need_event(virtio16_to_cpu(_vq->vdev, vring_avail_event(&vq->vring)),
> - new, old);
> - } else {
> - needs_kick = !(vq->vring.used->flags & cpu_to_virtio16(_vq->vdev, VRING_USED_F_NO_NOTIFY));
> - }
> - END_USE(vq);
> - return needs_kick;
> + return vq->packed ? virtqueue_kick_prepare_packed(_vq) :
> + virtqueue_kick_prepare_split(_vq);
> }
> EXPORT_SYMBOL_GPL(virtqueue_kick_prepare);
>
> @@ -626,58 +994,9 @@ bool virtqueue_kick(struct virtqueue *vq)
> }
> EXPORT_SYMBOL_GPL(virtqueue_kick);
>
> -static void detach_buf(struct vring_virtqueue *vq, unsigned int head,
> - void **ctx)
> -{
> - unsigned int i, j;
> - __virtio16 nextflag = cpu_to_virtio16(vq->vq.vdev, VRING_DESC_F_NEXT);
> -
> - /* Clear data ptr. */
> - vq->desc_state[head].data = NULL;
> -
> - /* Put back on free list: unmap first-level descriptors and find end */
> - i = head;
> -
> - while (vq->vring.desc[i].flags & nextflag) {
> - vring_unmap_one(vq, &vq->vring.desc[i]);
> - i = virtio16_to_cpu(vq->vq.vdev, vq->vring.desc[i].next);
> - vq->vq.num_free++;
> - }
> -
> - vring_unmap_one(vq, &vq->vring.desc[i]);
> - vq->vring.desc[i].next = cpu_to_virtio16(vq->vq.vdev, vq->free_head);
> - vq->free_head = head;
> -
> - /* Plus final descriptor */
> - vq->vq.num_free++;
> -
> - if (vq->indirect) {
> - struct vring_desc *indir_desc = vq->desc_state[head].indir_desc;
> - u32 len;
> -
> - /* Free the indirect table, if any, now that it's unmapped. */
> - if (!indir_desc)
> - return;
> -
> - len = virtio32_to_cpu(vq->vq.vdev, vq->vring.desc[head].len);
> -
> - BUG_ON(!(vq->vring.desc[head].flags &
> - cpu_to_virtio16(vq->vq.vdev, VRING_DESC_F_INDIRECT)));
> - BUG_ON(len == 0 || len % sizeof(struct vring_desc));
> -
> - for (j = 0; j < len / sizeof(struct vring_desc); j++)
> - vring_unmap_one(vq, &indir_desc[j]);
> -
> - kfree(indir_desc);
> - vq->desc_state[head].indir_desc = NULL;
> - } else if (ctx) {
> - *ctx = vq->desc_state[head].indir_desc;
> - }
> -}
> -
> static inline bool more_used(const struct vring_virtqueue *vq)
> {
> - return vq->last_used_idx != virtio16_to_cpu(vq->vq.vdev, vq->vring.used->idx);
> + return vq->packed ? more_used_packed(vq) : more_used_split(vq);
> }
>
> /**
> @@ -700,57 +1019,9 @@ void *virtqueue_get_buf_ctx(struct virtqueue *_vq, unsigned int *len,
> void **ctx)
> {
> struct vring_virtqueue *vq = to_vvq(_vq);
> - void *ret;
> - unsigned int i;
> - u16 last_used;
>
> - START_USE(vq);
> -
> - if (unlikely(vq->broken)) {
> - END_USE(vq);
> - return NULL;
> - }
> -
> - if (!more_used(vq)) {
> - pr_debug("No more buffers in queue\n");
> - END_USE(vq);
> - return NULL;
> - }
> -
> - /* Only get used array entries after they have been exposed by host. */
> - virtio_rmb(vq->weak_barriers);
> -
> - last_used = (vq->last_used_idx & (vq->vring.num - 1));
> - i = virtio32_to_cpu(_vq->vdev, vq->vring.used->ring[last_used].id);
> - *len = virtio32_to_cpu(_vq->vdev, vq->vring.used->ring[last_used].len);
> -
> - if (unlikely(i >= vq->vring.num)) {
> - BAD_RING(vq, "id %u out of range\n", i);
> - return NULL;
> - }
> - if (unlikely(!vq->desc_state[i].data)) {
> - BAD_RING(vq, "id %u is not a head!\n", i);
> - return NULL;
> - }
> -
> - /* detach_buf clears data, so grab it now. */
> - ret = vq->desc_state[i].data;
> - detach_buf(vq, i, ctx);
> - vq->last_used_idx++;
> - /* If we expect an interrupt for the next entry, tell host
> - * by writing event index and flush out the write before
> - * the read in the next get_buf call. */
> - if (!(vq->avail_flags_shadow & VRING_AVAIL_F_NO_INTERRUPT))
> - virtio_store_mb(vq->weak_barriers,
> - &vring_used_event(&vq->vring),
> - cpu_to_virtio16(_vq->vdev, vq->last_used_idx));
> -
> -#ifdef DEBUG
> - vq->last_add_time_valid = false;
> -#endif
> -
> - END_USE(vq);
> - return ret;
> + return vq->packed ? virtqueue_get_buf_ctx_packed(_vq, len, ctx) :
> + virtqueue_get_buf_ctx_split(_vq, len, ctx);
> }
> EXPORT_SYMBOL_GPL(virtqueue_get_buf_ctx);
>
> @@ -772,12 +1043,10 @@ void virtqueue_disable_cb(struct virtqueue *_vq)
> {
> struct vring_virtqueue *vq = to_vvq(_vq);
>
> - if (!(vq->avail_flags_shadow & VRING_AVAIL_F_NO_INTERRUPT)) {
> - vq->avail_flags_shadow |= VRING_AVAIL_F_NO_INTERRUPT;
> - if (!vq->event)
> - vq->vring.avail->flags = cpu_to_virtio16(_vq->vdev, vq->avail_flags_shadow);
> - }
> -
> + if (vq->packed)
> + virtqueue_disable_cb_packed(_vq);
> + else
> + virtqueue_disable_cb_split(_vq);
> }
> EXPORT_SYMBOL_GPL(virtqueue_disable_cb);
>
> @@ -796,23 +1065,9 @@ EXPORT_SYMBOL_GPL(virtqueue_disable_cb);
> unsigned virtqueue_enable_cb_prepare(struct virtqueue *_vq)
> {
> struct vring_virtqueue *vq = to_vvq(_vq);
> - u16 last_used_idx;
>
> - START_USE(vq);
> -
> - /* We optimistically turn back on interrupts, then check if there was
> - * more to do. */
> - /* Depending on the VIRTIO_RING_F_EVENT_IDX feature, we need to
> - * either clear the flags bit or point the event index at the next
> - * entry. Always do both to keep code simple. */
> - if (vq->avail_flags_shadow & VRING_AVAIL_F_NO_INTERRUPT) {
> - vq->avail_flags_shadow &= ~VRING_AVAIL_F_NO_INTERRUPT;
> - if (!vq->event)
> - vq->vring.avail->flags = cpu_to_virtio16(_vq->vdev, vq->avail_flags_shadow);
> - }
> - vring_used_event(&vq->vring) = cpu_to_virtio16(_vq->vdev, last_used_idx = vq->last_used_idx);
> - END_USE(vq);
> - return last_used_idx;
> + return vq->packed ? virtqueue_enable_cb_prepare_packed(_vq) :
> + virtqueue_enable_cb_prepare_split(_vq);
> }
> EXPORT_SYMBOL_GPL(virtqueue_enable_cb_prepare);
>
> @@ -829,8 +1084,8 @@ bool virtqueue_poll(struct virtqueue *_vq, unsigned last_used_idx)
> {
> struct vring_virtqueue *vq = to_vvq(_vq);
>
> - virtio_mb(vq->weak_barriers);
> - return (u16)last_used_idx != virtio16_to_cpu(_vq->vdev, vq->vring.used->idx);
> + return vq->packed ? virtqueue_poll_packed(_vq, last_used_idx) :
> + virtqueue_poll_split(_vq, last_used_idx);
> }
> EXPORT_SYMBOL_GPL(virtqueue_poll);
>
> @@ -868,34 +1123,9 @@ EXPORT_SYMBOL_GPL(virtqueue_enable_cb);
> bool virtqueue_enable_cb_delayed(struct virtqueue *_vq)
> {
> struct vring_virtqueue *vq = to_vvq(_vq);
> - u16 bufs;
>
> - START_USE(vq);
> -
> - /* We optimistically turn back on interrupts, then check if there was
> - * more to do. */
> - /* Depending on the VIRTIO_RING_F_USED_EVENT_IDX feature, we need to
> - * either clear the flags bit or point the event index at the next
> - * entry. Always update the event index to keep code simple. */
> - if (vq->avail_flags_shadow & VRING_AVAIL_F_NO_INTERRUPT) {
> - vq->avail_flags_shadow &= ~VRING_AVAIL_F_NO_INTERRUPT;
> - if (!vq->event)
> - vq->vring.avail->flags = cpu_to_virtio16(_vq->vdev, vq->avail_flags_shadow);
> - }
> - /* TODO: tune this threshold */
> - bufs = (u16)(vq->avail_idx_shadow - vq->last_used_idx) * 3 / 4;
> -
> - virtio_store_mb(vq->weak_barriers,
> - &vring_used_event(&vq->vring),
> - cpu_to_virtio16(_vq->vdev, vq->last_used_idx + bufs));
> -
> - if (unlikely((u16)(virtio16_to_cpu(_vq->vdev, vq->vring.used->idx) - vq->last_used_idx) > bufs)) {
> - END_USE(vq);
> - return false;
> - }
> -
> - END_USE(vq);
> - return true;
> + return vq->packed ? virtqueue_enable_cb_delayed_packed(_vq) :
> + virtqueue_enable_cb_delayed_split(_vq);
> }
> EXPORT_SYMBOL_GPL(virtqueue_enable_cb_delayed);
>
> @@ -910,27 +1140,9 @@ EXPORT_SYMBOL_GPL(virtqueue_enable_cb_delayed);
> void *virtqueue_detach_unused_buf(struct virtqueue *_vq)
> {
> struct vring_virtqueue *vq = to_vvq(_vq);
> - unsigned int i;
> - void *buf;
>
> - START_USE(vq);
> -
> - for (i = 0; i < vq->vring.num; i++) {
> - if (!vq->desc_state[i].data)
> - continue;
> - /* detach_buf clears data, so grab it now. */
> - buf = vq->desc_state[i].data;
> - detach_buf(vq, i, NULL);
> - vq->avail_idx_shadow--;
> - vq->vring.avail->idx = cpu_to_virtio16(_vq->vdev, vq->avail_idx_shadow);
> - END_USE(vq);
> - return buf;
> - }
> - /* That should have freed everything. */
> - BUG_ON(vq->vq.num_free != vq->vring.num);
> -
> - END_USE(vq);
> - return NULL;
> + return vq->packed ? virtqueue_detach_unused_buf_packed(_vq) :
> + virtqueue_detach_unused_buf_split(_vq);
> }
> EXPORT_SYMBOL_GPL(virtqueue_detach_unused_buf);
>
> @@ -955,7 +1167,8 @@ irqreturn_t vring_interrupt(int irq, void *_vq)
> EXPORT_SYMBOL_GPL(vring_interrupt);
>
> struct virtqueue *__vring_new_virtqueue(unsigned int index,
> - struct vring vring,
> + union vring_union vring,
> + bool packed,
> struct virtio_device *vdev,
> bool weak_barriers,
> bool context,
> @@ -963,19 +1176,22 @@ struct virtqueue *__vring_new_virtqueue(unsigned int index,
> void (*callback)(struct virtqueue *),
> const char *name)
> {
> - unsigned int i;
> struct vring_virtqueue *vq;
> + unsigned int num, i;
> + size_t size;
>
> - vq = kmalloc(sizeof(*vq) + vring.num * sizeof(struct vring_desc_state),
> - GFP_KERNEL);
> + num = packed ? vring.vring_packed.num : vring.vring_split.num;
> + size = packed ? num * sizeof(struct vring_desc_state_packed) :
> + num * sizeof(struct vring_desc_state);
> +
> + vq = kmalloc(sizeof(*vq) + size, GFP_KERNEL);
> if (!vq)
> return NULL;
>
> - vq->vring = vring;
> vq->vq.callback = callback;
> vq->vq.vdev = vdev;
> vq->vq.name = name;
> - vq->vq.num_free = vring.num;
> + vq->vq.num_free = num;
> vq->vq.index = index;
> vq->we_own_ring = false;
> vq->queue_dma_addr = 0;
> @@ -984,9 +1200,8 @@ struct virtqueue *__vring_new_virtqueue(unsigned int index,
> vq->weak_barriers = weak_barriers;
> vq->broken = false;
> vq->last_used_idx = 0;
> - vq->avail_flags_shadow = 0;
> - vq->avail_idx_shadow = 0;
> vq->num_added = 0;
> + vq->packed = packed;
> list_add_tail(&vq->vq.list, &vdev->vqs);
> #ifdef DEBUG
> vq->in_use = false;
> @@ -997,19 +1212,48 @@ struct virtqueue *__vring_new_virtqueue(unsigned int index,
> !context;
> vq->event = virtio_has_feature(vdev, VIRTIO_RING_F_EVENT_IDX);
>
> + if (vq->packed) {
> + vq->vring_packed = vring.vring_packed;
> + vq->next_avail_idx = 0;
> + vq->avail_wrap_counter = 1;
> + vq->used_wrap_counter = 1;
> + vq->event_flags_shadow = 0;
> +
> + memset(vq->desc_state_packed, 0,
> + num * sizeof(struct vring_desc_state_packed));
> +
> x
> + vq->free_head = 0;
> + for (i = 0; i < num-1; i++)
> + vq->desc_state_packed[i].next = i + 1;
> + } else {
> + vq->vring = vring.vring_split;
> + vq->avail_flags_shadow = 0;
> + vq->avail_idx_shadow = 0;
> +
> + /* Put everything in free lists. */
> + vq->free_head = 0;
> + for (i = 0; i < num-1; i++)
> + vq->vring.desc[i].next = cpu_to_virtio16(vdev, i + 1);
> +
> + memset(vq->desc_state, 0,
> + num * sizeof(struct vring_desc_state));
> + }
> +
> /* No callback? Tell other side not to bother us. */
> if (!callback) {
> - vq->avail_flags_shadow |= VRING_AVAIL_F_NO_INTERRUPT;
> - if (!vq->event)
> - vq->vring.avail->flags = cpu_to_virtio16(vdev, vq->avail_flags_shadow);
> + if (packed) {
> + vq->event_flags_shadow = VRING_EVENT_F_DISABLE;
> + vq->vring_packed.driver->flags = cpu_to_virtio16(vdev,
> + vq->event_flags_shadow);
> + } else {
> + vq->avail_flags_shadow |= VRING_AVAIL_F_NO_INTERRUPT;
> + if (!vq->event)
> + vq->vring.avail->flags = cpu_to_virtio16(vdev,
> + vq->avail_flags_shadow);
> + }
> }
>
> - /* Put everything in free lists. */
> - vq->free_head = 0;
> - for (i = 0; i < vring.num-1; i++)
> - vq->vring.desc[i].next = cpu_to_virtio16(vdev, i + 1);
> - memset(vq->desc_state, 0, vring.num * sizeof(struct vring_desc_state));
> -
> return &vq->vq;
> }
> EXPORT_SYMBOL_GPL(__vring_new_virtqueue);
> @@ -1056,6 +1300,12 @@ static void vring_free_queue(struct virtio_device *vdev, size_t size,
> }
> }
>
> +static inline int
> +__vring_size(unsigned int num, unsigned long align, bool packed)
> +{
> + return packed ? vring_size_packed(num, align) : vring_size(num, align);
> +}
> +
> struct virtqueue *vring_create_virtqueue(
> unsigned int index,
> unsigned int num,
> @@ -1072,7 +1322,8 @@ struct virtqueue *vring_create_virtqueue(
> void *queue = NULL;
> dma_addr_t dma_addr;
> size_t queue_size_in_bytes;
> - struct vring vring;
> + union vring_union vring;
> + bool packed;
>
> /* We assume num is a power of 2. */
> if (num & (num - 1)) {
> @@ -1080,9 +1331,13 @@ struct virtqueue *vring_create_virtqueue(
> return NULL;
> }
>
> + packed = virtio_has_feature(vdev, VIRTIO_F_RING_PACKED);
> +
> /* TODO: allocate each queue chunk individually */
> - for (; num && vring_size(num, vring_align) > PAGE_SIZE; num /= 2) {
> - queue = vring_alloc_queue(vdev, vring_size(num, vring_align),
> + for (; num && __vring_size(num, vring_align, packed) > PAGE_SIZE;
> + num /= 2) {
> + queue = vring_alloc_queue(vdev, __vring_size(num, vring_align,
> + packed),
> &dma_addr,
> GFP_KERNEL|__GFP_NOWARN|__GFP_ZERO);
> if (queue)
> @@ -1094,17 +1349,21 @@ struct virtqueue *vring_create_virtqueue(
>
> if (!queue) {
> /* Try to get a single page. You are my only hope! */
> - queue = vring_alloc_queue(vdev, vring_size(num, vring_align),
> + queue = vring_alloc_queue(vdev, __vring_size(num, vring_align,
> + packed),
> &dma_addr, GFP_KERNEL|__GFP_ZERO);
> }
> if (!queue)
> return NULL;
>
> - queue_size_in_bytes = vring_size(num, vring_align);
> - vring_init(&vring, num, queue, vring_align);
> + queue_size_in_bytes = __vring_size(num, vring_align, packed);
> + if (packed)
> + vring_init_packed(&vring.vring_packed, num, queue, vring_align);
> + else
> + vring_init(&vring.vring_split, num, queue, vring_align);
>
> - vq = __vring_new_virtqueue(index, vring, vdev, weak_barriers, context,
> - notify, callback, name);
> + vq = __vring_new_virtqueue(index, vring, packed, vdev, weak_barriers,
> + context, notify, callback, name);
> if (!vq) {
> vring_free_queue(vdev, queue_size_in_bytes, queue,
> dma_addr);
> @@ -1130,10 +1389,17 @@ struct virtqueue *vring_new_virtqueue(unsigned int index,
> void (*callback)(struct virtqueue *vq),
> const char *name)
> {
> - struct vring vring;
> - vring_init(&vring, num, pages, vring_align);
> - return __vring_new_virtqueue(index, vring, vdev, weak_barriers, context,
> - notify, callback, name);
> + union vring_union vring;
> + bool packed;
> +
> + packed = virtio_has_feature(vdev, VIRTIO_F_RING_PACKED);
> + if (packed)
> + vring_init_packed(&vring.vring_packed, num, pages, vring_align);
> + else
> + vring_init(&vring.vring_split, num, pages, vring_align);
> +
> + return __vring_new_virtqueue(index, vring, packed, vdev, weak_barriers,
> + context, notify, callback, name);
> }
> EXPORT_SYMBOL_GPL(vring_new_virtqueue);
>
> @@ -1143,7 +1409,9 @@ void vring_del_virtqueue(struct virtqueue *_vq)
>
> if (vq->we_own_ring) {
> vring_free_queue(vq->vq.vdev, vq->queue_size_in_bytes,
> - vq->vring.desc, vq->queue_dma_addr);
> + vq->packed ? (void *)vq->vring_packed.desc :
> + (void *)vq->vring.desc,
> + vq->queue_dma_addr);
> }
> list_del(&_vq->list);
> kfree(vq);
> @@ -1185,7 +1453,7 @@ unsigned int virtqueue_get_vring_size(struct virtqueue *_vq)
>
> struct vring_virtqueue *vq = to_vvq(_vq);
>
> - return vq->vring.num;
> + return vq->packed ? vq->vring_packed.num : vq->vring.num;
> }
> EXPORT_SYMBOL_GPL(virtqueue_get_vring_size);
>
> @@ -1228,6 +1496,10 @@ dma_addr_t virtqueue_get_avail_addr(struct virtqueue *_vq)
>
> BUG_ON(!vq->we_own_ring);
>
> + if (vq->packed)
> + return vq->queue_dma_addr + ((char *)vq->vring_packed.driver -
> + (char *)vq->vring_packed.desc);
> +
> return vq->queue_dma_addr +
> ((char *)vq->vring.avail - (char *)vq->vring.desc);
> }
> @@ -1239,11 +1511,16 @@ dma_addr_t virtqueue_get_used_addr(struct virtqueue *_vq)
>
> BUG_ON(!vq->we_own_ring);
>
> + if (vq->packed)
> + return vq->queue_dma_addr + ((char *)vq->vring_packed.device -
> + (char *)vq->vring_packed.desc);
> +
> return vq->queue_dma_addr +
> ((char *)vq->vring.used - (char *)vq->vring.desc);
> }
> EXPORT_SYMBOL_GPL(virtqueue_get_used_addr);
>
> +/* Only available for split ring */
> const struct vring *virtqueue_get_vring(struct virtqueue *vq)
> {
A possible issue with this is:
After commit d4674240f31f8c4289abba07d64291c6ddce51bc ("KVM: s390:
virtio-ccw revision 1 SET_VQ"). CCW tries to use
virtqueue_get_avail()/virtqueue_get_used(). Looks like a bug either here
or ccw code.
Thanks
> return &to_vvq(vq)->vring;
> diff --git a/include/linux/virtio_ring.h b/include/linux/virtio_ring.h
> index bbf32524ab27..a0075894ad16 100644
> --- a/include/linux/virtio_ring.h
> +++ b/include/linux/virtio_ring.h
> @@ -60,6 +60,11 @@ static inline void virtio_store_mb(bool weak_barriers,
> struct virtio_device;
> struct virtqueue;
>
> +union vring_union {
> + struct vring vring_split;
> + struct vring_packed vring_packed;
> +};
> +
> /*
> * Creates a virtqueue and allocates the descriptor ring. If
> * may_reduce_num is set, then this may allocate a smaller ring than
> @@ -79,7 +84,8 @@ struct virtqueue *vring_create_virtqueue(unsigned int index,
>
> /* Creates a virtqueue with a custom layout. */
> struct virtqueue *__vring_new_virtqueue(unsigned int index,
> - struct vring vring,
> + union vring_union vring,
> + bool packed,
> struct virtio_device *vdev,
> bool weak_barriers,
> bool ctx,
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* [PATCH bpf-next] bpf: hide the unused 'off' variable
From: YueHaibing @ 2018-05-29 2:40 UTC (permalink / raw)
To: davem, ast, daniel; +Cc: netdev, linux-kernel, YueHaibing
The local variable is only used while CONFIG_IPV6 enabled
net/core/filter.c: In function ‘sk_msg_convert_ctx_access’:
net/core/filter.c:6489:6: warning: unused variable ‘off’ [-Wunused-variable]
int off;
^
This puts it into #ifdef.
Fixes: 303def35f64e ("bpf: allow sk_msg programs to read sock fields")
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
---
net/core/filter.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/net/core/filter.c b/net/core/filter.c
index 24e6ce8..0ce93ed 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -6486,7 +6486,9 @@ static u32 sk_msg_convert_ctx_access(enum bpf_access_type type,
struct bpf_prog *prog, u32 *target_size)
{
struct bpf_insn *insn = insn_buf;
+#if IS_ENABLED(CONFIG_IPV6)
int off;
+#endif
switch (si->off) {
case offsetof(struct sk_msg_md, data):
--
2.7.0
^ permalink raw reply related
* Re: [PATCH 0/9] Netfilter/IPVS fixes for net
From: David Miller @ 2018-05-29 2:39 UTC (permalink / raw)
To: pablo; +Cc: netfilter-devel, netdev
In-Reply-To: <20180528234221.31254-1-pablo@netfilter.org>
From: Pablo Neira Ayuso <pablo@netfilter.org>
Date: Tue, 29 May 2018 01:42:12 +0200
> The following patchset contains Netfilter/IPVS fixes for your net tree:
>
> 1) Null pointer dereference when dumping conntrack helper configuration,
> from Taehee Yoo.
>
> 2) Missing sanitization in ebtables extension name through compat,
> from Paolo Abeni.
>
> 3) Broken fetch of tracing value, from Taehee Yoo.
>
> 4) Incorrect arithmetics in packet ratelimiting.
>
> 5) Buffer overflow in IPVS sync daemon, from Julian Anastasov.
>
> 6) Wrong argument to nla_strlcpy() in nfnetlink_{acct,cthelper},
> from Eric Dumazet.
>
> 7) Fix splat in nft_update_chain_stats().
>
> 8) Null pointer dereference from object netlink dump path, from
> Taehee Yoo.
>
> 9) Missing static_branch_inc() when enabling counters in existing
> chain, from Taehee Yoo.
>
> You can pull these changes from:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf.git
Pulled, thanks.
^ permalink raw reply
* Re: [PATCH v2 net] tun: Fix NULL pointer dereference in XDP redirect
From: Jason Wang @ 2018-05-29 2:14 UTC (permalink / raw)
To: Toshiaki Makita, David S. Miller; +Cc: netdev
In-Reply-To: <1527503869-2412-1-git-send-email-makita.toshiaki@lab.ntt.co.jp>
On 2018年05月28日 18:37, Toshiaki Makita wrote:
> Calling XDP redirection requires bh disabled. Softirq can call another
> XDP function and redirection functions, then the percpu static variable
> ri->map can be overwritten to NULL.
>
> This is a generic XDP case called from tun.
>
> [ 3535.736058] BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
> [ 3535.743974] PGD 0 P4D 0
> [ 3535.746530] Oops: 0000 [#1] SMP PTI
> [ 3535.750049] Modules linked in: vhost_net vhost tap tun bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter sunrpc vfat fat ext4 mbcache jbd2 intel_rapl skx_edac nfit libnvdimm x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm ipmi_ssif irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc ses aesni_intel crypto_simd cryptd enclosure hpwdt hpilo glue_helper ipmi_si pcspkr wmi mei_me ioatdma mei ipmi_devintf shpchp dca ipmi_msghandler lpc_ich acpi_power_meter sch_fq_codel ip_tables xfs libcrc32c sd_mod mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm smartpqi i40e crc32c_intel scsi_transport_sas tg3 i2c_core ptp pps_core
> [ 3535.813456] CPU: 5 PID: 1630 Comm: vhost-1614 Not tainted 4.17.0-rc4 #2
> [ 3535.820127] Hardware name: HPE ProLiant DL360 Gen10/ProLiant DL360 Gen10, BIOS U32 11/14/2017
> [ 3535.828732] RIP: 0010:__xdp_map_lookup_elem+0x5/0x30
> [ 3535.833740] RSP: 0018:ffffb4bc47bf7c58 EFLAGS: 00010246
> [ 3535.839009] RAX: ffff9fdfcfea1c40 RBX: 0000000000000000 RCX: ffff9fdf27fe3100
> [ 3535.846205] RDX: ffff9fdfca769200 RSI: 0000000000000000 RDI: 0000000000000000
> [ 3535.853402] RBP: ffffb4bc491d9000 R08: 00000000000045ad R09: 0000000000000ec0
> [ 3535.860597] R10: 0000000000000001 R11: ffff9fdf26c3ce4e R12: ffff9fdf9e72c000
> [ 3535.867794] R13: 0000000000000000 R14: fffffffffffffff2 R15: ffff9fdfc82cdd00
> [ 3535.874990] FS: 0000000000000000(0000) GS:ffff9fdfcfe80000(0000) knlGS:0000000000000000
> [ 3535.883152] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 3535.888948] CR2: 0000000000000018 CR3: 0000000bde724004 CR4: 00000000007626e0
> [ 3535.896145] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 3535.903342] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [ 3535.910538] PKRU: 55555554
> [ 3535.913267] Call Trace:
> [ 3535.915736] xdp_do_generic_redirect+0x7a/0x310
> [ 3535.920310] do_xdp_generic.part.117+0x285/0x370
> [ 3535.924970] tun_get_user+0x5b9/0x1260 [tun]
> [ 3535.929279] tun_sendmsg+0x52/0x70 [tun]
> [ 3535.933237] handle_tx+0x2ad/0x5f0 [vhost_net]
> [ 3535.937721] vhost_worker+0xa5/0x100 [vhost]
> [ 3535.942030] kthread+0xf5/0x130
> [ 3535.945198] ? vhost_dev_ioctl+0x3b0/0x3b0 [vhost]
> [ 3535.950031] ? kthread_bind+0x10/0x10
> [ 3535.953727] ret_from_fork+0x35/0x40
> [ 3535.957334] Code: 0e 74 15 83 f8 10 75 05 e9 49 aa b3 ff f3 c3 0f 1f 80 00 00 00 00 f3 c3 e9 29 9d b3 ff 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 <8b> 47 18 83 f8 0e 74 0d 83 f8 10 75 05 e9 49 a9 b3 ff 31 c0 c3
> [ 3535.976387] RIP: __xdp_map_lookup_elem+0x5/0x30 RSP: ffffb4bc47bf7c58
> [ 3535.982883] CR2: 0000000000000018
> [ 3535.987096] ---[ end trace 383b299dd1430240 ]---
> [ 3536.131325] Kernel panic - not syncing: Fatal exception
> [ 3536.137484] Kernel Offset: 0x26a00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
> [ 3536.281406] ---[ end Kernel panic - not syncing: Fatal exception ]---
>
> And a kernel with generic case fixed still panics in tun driver XDP
> redirect, because it disabled only preemption, but not bh.
>
> [ 2055.128746] BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
> [ 2055.136662] PGD 0 P4D 0
> [ 2055.139219] Oops: 0000 [#1] SMP PTI
> [ 2055.142736] Modules linked in: vhost_net vhost tap tun bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter sunrpc vfat fat ext4 mbcache jbd2 intel_rapl skx_edac nfit libnvdimm x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc ses aesni_intel ipmi_ssif crypto_simd enclosure cryptd hpwdt glue_helper ioatdma hpilo wmi dca pcspkr ipmi_si acpi_power_meter ipmi_devintf shpchp mei_me ipmi_msghandler mei lpc_ich sch_fq_codel ip_tables xfs libcrc32c sd_mod mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm i40e smartpqi tg3 scsi_transport_sas crc32c_intel i2c_core ptp pps_core
> [ 2055.206142] CPU: 6 PID: 1693 Comm: vhost-1683 Tainted: G W 4.17.0-rc5-fix-tun+ #1
> [ 2055.215011] Hardware name: HPE ProLiant DL360 Gen10/ProLiant DL360 Gen10, BIOS U32 11/14/2017
> [ 2055.223617] RIP: 0010:__xdp_map_lookup_elem+0x5/0x30
> [ 2055.228624] RSP: 0018:ffff998b07607cc0 EFLAGS: 00010246
> [ 2055.233892] RAX: ffff8dbd8e235700 RBX: ffff8dbd8ff21c40 RCX: 0000000000000004
> [ 2055.241089] RDX: ffff998b097a9000 RSI: 0000000000000000 RDI: 0000000000000000
> [ 2055.248286] RBP: 0000000000000000 R08: 00000000000065a8 R09: 0000000000005d80
> [ 2055.255483] R10: 0000000000000040 R11: ffff8dbcf0100000 R12: ffff998b097a9000
> [ 2055.262681] R13: ffff8dbd8c98c000 R14: 0000000000000000 R15: ffff998b07607d78
> [ 2055.269879] FS: 0000000000000000(0000) GS:ffff8dbd8ff00000(0000) knlGS:0000000000000000
> [ 2055.278039] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 2055.283834] CR2: 0000000000000018 CR3: 0000000c0c8cc005 CR4: 00000000007626e0
> [ 2055.291030] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 2055.298227] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [ 2055.305424] PKRU: 55555554
> [ 2055.308153] Call Trace:
> [ 2055.310624] xdp_do_redirect+0x7b/0x380
> [ 2055.314499] tun_get_user+0x10fe/0x12a0 [tun]
> [ 2055.318895] tun_sendmsg+0x52/0x70 [tun]
> [ 2055.322852] handle_tx+0x2ad/0x5f0 [vhost_net]
> [ 2055.327337] vhost_worker+0xa5/0x100 [vhost]
> [ 2055.331646] kthread+0xf5/0x130
> [ 2055.334813] ? vhost_dev_ioctl+0x3b0/0x3b0 [vhost]
> [ 2055.339646] ? kthread_bind+0x10/0x10
> [ 2055.343343] ret_from_fork+0x35/0x40
> [ 2055.346950] Code: 0e 74 15 83 f8 10 75 05 e9 e9 aa b3 ff f3 c3 0f 1f 80 00 00 00 00 f3 c3 e9 c9 9d b3 ff 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 <8b> 47 18 83 f8 0e 74 0d 83 f8 10 75 05 e9 e9 a9 b3 ff 31 c0 c3
> [ 2055.366004] RIP: __xdp_map_lookup_elem+0x5/0x30 RSP: ffff998b07607cc0
> [ 2055.372500] CR2: 0000000000000018
> [ 2055.375856] ---[ end trace 2a2dcc5e9e174268 ]---
> [ 2055.523626] Kernel panic - not syncing: Fatal exception
> [ 2055.529796] Kernel Offset: 0x2e000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
> [ 2055.677539] ---[ end Kernel panic - not syncing: Fatal exception ]---
>
> v2:
> - Removed preempt_disable/enable since local_bh_disable will prevent
> preemption as well, feedback from Jason Wang.
>
> Fixes: 761876c857cb ("tap: XDP support")
> Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
> ---
> drivers/net/tun.c | 15 +++++++++------
> 1 file changed, 9 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> index 45d8077..23e9eb6 100644
> --- a/drivers/net/tun.c
> +++ b/drivers/net/tun.c
> @@ -1650,7 +1650,7 @@ static struct sk_buff *tun_build_skb(struct tun_struct *tun,
> else
> *skb_xdp = 0;
>
> - preempt_disable();
> + local_bh_disable();
> rcu_read_lock();
> xdp_prog = rcu_dereference(tun->xdp_prog);
> if (xdp_prog && !*skb_xdp) {
> @@ -1675,7 +1675,7 @@ static struct sk_buff *tun_build_skb(struct tun_struct *tun,
> if (err)
> goto err_redirect;
> rcu_read_unlock();
> - preempt_enable();
> + local_bh_enable();
> return NULL;
> case XDP_TX:
> get_page(alloc_frag->page);
> @@ -1684,7 +1684,7 @@ static struct sk_buff *tun_build_skb(struct tun_struct *tun,
> goto err_redirect;
> tun_xdp_flush(tun->dev);
> rcu_read_unlock();
> - preempt_enable();
> + local_bh_enable();
> return NULL;
> case XDP_PASS:
> delta = orig_data - xdp.data;
> @@ -1703,7 +1703,7 @@ static struct sk_buff *tun_build_skb(struct tun_struct *tun,
> skb = build_skb(buf, buflen);
> if (!skb) {
> rcu_read_unlock();
> - preempt_enable();
> + local_bh_enable();
> return ERR_PTR(-ENOMEM);
> }
>
> @@ -1713,7 +1713,7 @@ static struct sk_buff *tun_build_skb(struct tun_struct *tun,
> alloc_frag->offset += buflen;
>
> rcu_read_unlock();
> - preempt_enable();
> + local_bh_enable();
>
> return skb;
>
> @@ -1721,7 +1721,7 @@ static struct sk_buff *tun_build_skb(struct tun_struct *tun,
> put_page(alloc_frag->page);
> err_xdp:
> rcu_read_unlock();
> - preempt_enable();
> + local_bh_enable();
> this_cpu_inc(tun->pcpu_stats->rx_dropped);
> return NULL;
> }
> @@ -1917,16 +1917,19 @@ static ssize_t tun_get_user(struct tun_struct *tun, struct tun_file *tfile,
> struct bpf_prog *xdp_prog;
> int ret;
>
> + local_bh_disable();
> rcu_read_lock();
> xdp_prog = rcu_dereference(tun->xdp_prog);
> if (xdp_prog) {
> ret = do_xdp_generic(xdp_prog, skb);
> if (ret != XDP_PASS) {
> rcu_read_unlock();
> + local_bh_enable();
> return total_len;
> }
> }
> rcu_read_unlock();
> + local_bh_enable();
> }
>
> rcu_read_lock();
Acked-by: Jason Wang <jasowang@redhat.com>
Thanks.
^ permalink raw reply
* [RFC V5 PATCH 7/8] vhost: packed ring support
From: Jason Wang @ 2018-05-29 2:10 UTC (permalink / raw)
To: mst, jasowang
Cc: kvm, virtualization, netdev, linux-kernel, jfreimann, wexu,
tiwei.bie
In-Reply-To: <1527559830-8133-1-git-send-email-jasowang@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
---
drivers/vhost/net.c | 13 +-
drivers/vhost/vhost.c | 585 ++++++++++++++++++++++++++++++++++++++++++++++----
drivers/vhost/vhost.h | 13 +-
3 files changed, 566 insertions(+), 45 deletions(-)
diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 30273ad..4991aa4 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -71,7 +71,8 @@ enum {
VHOST_NET_FEATURES = VHOST_FEATURES |
(1ULL << VHOST_NET_F_VIRTIO_NET_HDR) |
(1ULL << VIRTIO_NET_F_MRG_RXBUF) |
- (1ULL << VIRTIO_F_IOMMU_PLATFORM)
+ (1ULL << VIRTIO_F_IOMMU_PLATFORM) |
+ (1ULL << VIRTIO_F_RING_PACKED)
};
enum {
@@ -576,7 +577,7 @@ static void handle_tx(struct vhost_net *net)
nvq->upend_idx = ((unsigned)nvq->upend_idx - 1)
% UIO_MAXIOV;
}
- vhost_discard_vq_desc(vq, 1);
+ vhost_discard_vq_desc(vq, &used, 1);
vhost_net_enable_vq(net, vq);
break;
}
@@ -714,9 +715,11 @@ static void handle_rx(struct vhost_net *net)
mergeable = vhost_has_feature(vq, VIRTIO_NET_F_MRG_RXBUF);
while ((sock_len = vhost_net_rx_peek_head_len(net, sock->sk))) {
+ struct vhost_used_elem *used = vq->heads + nheads;
+
sock_len += sock_hlen;
vhost_len = sock_len + vhost_hlen;
- err = vhost_get_bufs(vq, vq->heads + nheads, vhost_len,
+ err = vhost_get_bufs(vq, used, vhost_len,
&in, vq_log, &log,
likely(mergeable) ? UIO_MAXIOV : 1,
&headcount);
@@ -762,7 +765,7 @@ static void handle_rx(struct vhost_net *net)
if (unlikely(err != sock_len)) {
pr_debug("Discarded rx packet: "
" len %d, expected %zd\n", err, sock_len);
- vhost_discard_vq_desc(vq, headcount);
+ vhost_discard_vq_desc(vq, used, 1);
continue;
}
/* Supply virtio_net_hdr if VHOST_NET_F_VIRTIO_NET_HDR */
@@ -786,7 +789,7 @@ static void handle_rx(struct vhost_net *net)
copy_to_iter(&num_buffers, sizeof num_buffers,
&fixup) != sizeof num_buffers) {
vq_err(vq, "Failed num_buffers write");
- vhost_discard_vq_desc(vq, headcount);
+ vhost_discard_vq_desc(vq, used, 1);
goto out;
}
nheads += headcount;
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 4031a8f..a36e5ad2 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -323,6 +323,9 @@ static void vhost_vq_reset(struct vhost_dev *dev,
vhost_reset_is_le(vq);
vhost_disable_cross_endian(vq);
vq->busyloop_timeout = 0;
+ vq->used_wrap_counter = true;
+ vq->last_avail_wrap_counter = true;
+ vq->avail_wrap_counter = true;
vq->umem = NULL;
vq->iotlb = NULL;
__vhost_vq_meta_reset(vq);
@@ -1103,11 +1106,22 @@ static int vhost_iotlb_miss(struct vhost_virtqueue *vq, u64 iova, int access)
return 0;
}
-static bool vq_access_ok(struct vhost_virtqueue *vq, unsigned int num,
- struct vring_desc __user *desc,
- struct vring_avail __user *avail,
- struct vring_used __user *used)
+static int vq_access_ok_packed(struct vhost_virtqueue *vq, unsigned int num,
+ struct vring_desc __user *desc,
+ struct vring_avail __user *avail,
+ struct vring_used __user *used)
+{
+ struct vring_desc_packed *packed = (struct vring_desc_packed *)desc;
+
+ /* FIXME: check device area and driver area */
+ return access_ok(VERIFY_READ, packed, num * sizeof(*packed)) &&
+ access_ok(VERIFY_WRITE, packed, num * sizeof(*packed));
+}
+static int vq_access_ok_split(struct vhost_virtqueue *vq, unsigned int num,
+ struct vring_desc __user *desc,
+ struct vring_avail __user *avail,
+ struct vring_used __user *used)
{
size_t s = vhost_has_feature(vq, VIRTIO_RING_F_EVENT_IDX) ? 2 : 0;
@@ -1118,6 +1132,17 @@ static bool vq_access_ok(struct vhost_virtqueue *vq, unsigned int num,
sizeof *used + num * sizeof *used->ring + s);
}
+static int vq_access_ok(struct vhost_virtqueue *vq, unsigned int num,
+ struct vring_desc __user *desc,
+ struct vring_avail __user *avail,
+ struct vring_used __user *used)
+{
+ if (vhost_has_feature(vq, VIRTIO_F_RING_PACKED))
+ return vq_access_ok_packed(vq, num, desc, avail, used);
+ else
+ return vq_access_ok_split(vq, num, desc, avail, used);
+}
+
static void vhost_vq_meta_update(struct vhost_virtqueue *vq,
const struct vhost_umem_node *node,
int type)
@@ -1361,6 +1386,10 @@ long vhost_vring_ioctl(struct vhost_dev *d, unsigned int ioctl, void __user *arg
break;
}
vq->last_avail_idx = s.num;
+ if (vhost_has_feature(vq, VIRTIO_F_RING_PACKED)) {
+ vq->last_avail_wrap_counter = s.num >> 31;
+ vq->avail_wrap_counter = vq->last_avail_wrap_counter;
+ }
/* Forget the cached index value. */
vq->avail_idx = vq->last_avail_idx;
break;
@@ -1369,6 +1398,8 @@ long vhost_vring_ioctl(struct vhost_dev *d, unsigned int ioctl, void __user *arg
s.num = vq->last_avail_idx;
if (copy_to_user(argp, &s, sizeof s))
r = -EFAULT;
+ if (vhost_has_feature(vq, VIRTIO_F_RING_PACKED))
+ s.num |= vq->last_avail_wrap_counter << 31;
break;
case VHOST_SET_VRING_ADDR:
if (copy_from_user(&a, argp, sizeof a)) {
@@ -1730,6 +1761,9 @@ int vhost_vq_init_access(struct vhost_virtqueue *vq)
vhost_init_is_le(vq);
+ if (vhost_has_feature(vq, VIRTIO_F_RING_PACKED))
+ return 0;
+
r = vhost_update_used_flags(vq);
if (r)
goto err;
@@ -1803,7 +1837,8 @@ static int translate_desc(struct vhost_virtqueue *vq, u64 addr, u32 len,
/* Each buffer in the virtqueues is actually a chain of descriptors. This
* function returns the next descriptor in the chain,
* or -1U if we're at the end. */
-static unsigned next_desc(struct vhost_virtqueue *vq, struct vring_desc *desc)
+static unsigned next_desc_split(struct vhost_virtqueue *vq,
+ struct vring_desc *desc)
{
unsigned int next;
@@ -1816,11 +1851,17 @@ static unsigned next_desc(struct vhost_virtqueue *vq, struct vring_desc *desc)
return next;
}
-static int get_indirect(struct vhost_virtqueue *vq,
- struct iovec iov[], unsigned int iov_size,
- unsigned int *out_num, unsigned int *in_num,
- struct vhost_log *log, unsigned int *log_num,
- struct vring_desc *indirect)
+static unsigned next_desc_packed(struct vhost_virtqueue *vq,
+ struct vring_desc_packed *desc)
+{
+ return desc->flags & cpu_to_vhost16(vq, VRING_DESC_F_NEXT);
+}
+
+static int get_indirect_split(struct vhost_virtqueue *vq,
+ struct iovec iov[], unsigned int iov_size,
+ unsigned int *out_num, unsigned int *in_num,
+ struct vhost_log *log, unsigned int *log_num,
+ struct vring_desc *indirect)
{
struct vring_desc desc;
unsigned int i = 0, count, found = 0;
@@ -1910,23 +1951,301 @@ static int get_indirect(struct vhost_virtqueue *vq,
}
*out_num += ret;
}
- } while ((i = next_desc(vq, &desc)) != -1);
+ } while ((i = next_desc_split(vq, &desc)) != -1);
return 0;
}
-/* This looks in the virtqueue and for the first available buffer, and converts
- * it to an iovec for convenient access. Since descriptors consist of some
- * number of output then some number of input descriptors, it's actually two
- * iovecs, but we pack them into one and note how many of each there were.
- *
- * This function returns the descriptor number found, or vq->num (which is
- * never a valid descriptor number) if none was found. A negative code is
- * returned on error. */
-int vhost_get_vq_desc(struct vhost_virtqueue *vq,
- struct vhost_used_elem *used,
- struct iovec iov[], unsigned int iov_size,
- unsigned int *out_num, unsigned int *in_num,
- struct vhost_log *log, unsigned int *log_num)
+static int get_indirect_packed(struct vhost_virtqueue *vq,
+ struct iovec iov[], unsigned int iov_size,
+ unsigned int *out_num, unsigned int *in_num,
+ struct vhost_log *log, unsigned int *log_num,
+ struct vring_desc_packed *indirect)
+{
+ struct vring_desc_packed desc;
+ unsigned int i = 0, count, found = 0;
+ u32 len = vhost32_to_cpu(vq, indirect->len);
+ struct iov_iter from;
+ int ret, access;
+
+ /* Sanity check */
+ if (unlikely(len % sizeof(desc))) {
+ vq_err(vq, "Invalid length in indirect descriptor: "
+ "len 0x%llx not multiple of 0x%zx\n",
+ (unsigned long long)len,
+ sizeof desc);
+ return -EINVAL;
+ }
+
+ ret = translate_desc(vq, vhost64_to_cpu(vq, indirect->addr),
+ len, vq->indirect,
+ UIO_MAXIOV, VHOST_ACCESS_RO);
+ if (unlikely(ret < 0)) {
+ if (ret != -EAGAIN)
+ vq_err(vq, "Translation failure %d in indirect.\n",
+ ret);
+ return ret;
+ }
+ iov_iter_init(&from, READ, vq->indirect, ret, len);
+
+ /* We will use the result as an address to read from, so most
+ * architectures only need a compiler barrier here. */
+ read_barrier_depends();
+
+ count = len / sizeof desc;
+ /* Buffers are chained via a 16 bit next field, so
+ * we can have at most 2^16 of these. */
+ if (unlikely(count > USHRT_MAX + 1)) {
+ vq_err(vq, "Indirect buffer length too big: %d\n",
+ indirect->len);
+ return -E2BIG;
+ }
+
+ do {
+ unsigned iov_count = *in_num + *out_num;
+ if (unlikely(++found > count)) {
+ vq_err(vq, "Loop detected: last one at %u "
+ "indirect size %u\n",
+ i, count);
+ return -EINVAL;
+ }
+ if (unlikely(!copy_from_iter_full(&desc, sizeof(desc),
+ &from))) {
+ vq_err(vq, "Failed indirect descriptor: idx %d, %zx\n",
+ i, (size_t)vhost64_to_cpu(vq, indirect->addr)
+ + i * sizeof desc);
+ return -EINVAL;
+ }
+ if (unlikely(desc.flags &
+ cpu_to_vhost16(vq, VRING_DESC_F_INDIRECT))) {
+ vq_err(vq, "Nested indirect descriptor: idx %d, %zx\n",
+ i, (size_t)vhost64_to_cpu(vq, indirect->addr)
+ + i * sizeof desc);
+ return -EINVAL;
+ }
+
+ if (desc.flags & cpu_to_vhost16(vq, VRING_DESC_F_WRITE))
+ access = VHOST_ACCESS_WO;
+ else
+ access = VHOST_ACCESS_RO;
+
+ ret = translate_desc(vq, vhost64_to_cpu(vq, desc.addr),
+ vhost32_to_cpu(vq, desc.len),
+ iov + iov_count,
+ iov_size - iov_count, access);
+ if (unlikely(ret < 0)) {
+ if (ret != -EAGAIN)
+ vq_err(vq, "Translation failure %d "
+ "indirect idx %d\n",
+ ret, i);
+ return ret;
+ }
+ /* If this is an input descriptor, increment that count. */
+ if (access == VHOST_ACCESS_WO) {
+ *in_num += ret;
+ if (unlikely(log)) {
+ log[*log_num].addr =
+ vhost64_to_cpu(vq, desc.addr);
+ log[*log_num].len =
+ vhost32_to_cpu(vq, desc.len);
+ ++*log_num;
+ }
+ } else {
+ /* If it's an output descriptor, they're all supposed
+ * to come before any input descriptors. */
+ if (unlikely(*in_num)) {
+ vq_err(vq, "Indirect descriptor "
+ "has out after in: idx %d\n", i);
+ return -EINVAL;
+ }
+ *out_num += ret;
+ }
+ i++;
+ } while (next_desc_packed(vq, &desc));
+ return 0;
+}
+
+#define DESC_AVAIL (1 << VRING_DESC_F_AVAIL)
+#define DESC_USED (1 << VRING_DESC_F_USED)
+static bool desc_is_avail(struct vhost_virtqueue *vq, bool wrap_counter,
+ __virtio16 flags)
+{
+ bool avail = flags & cpu_to_vhost16(vq, DESC_AVAIL);
+
+ return avail == wrap_counter;
+}
+
+static __virtio16 get_desc_flags(struct vhost_virtqueue *vq, bool write)
+{
+ __virtio16 flags = 0;
+
+ if (vq->used_wrap_counter) {
+ flags |= cpu_to_vhost16(vq, DESC_AVAIL);
+ flags |= cpu_to_vhost16(vq, DESC_USED);
+ } else {
+ flags &= ~cpu_to_vhost16(vq, DESC_AVAIL);
+ flags &= ~cpu_to_vhost16(vq, DESC_USED);
+ }
+
+ if (write)
+ flags |= cpu_to_vhost16(vq, VRING_DESC_F_WRITE);
+
+ return flags;
+}
+
+static bool vhost_vring_packed_need_event(struct vhost_virtqueue *vq,
+ bool wrap, __u16 off_wrap, __u16 new,
+ __u16 old)
+{
+ int off = off_wrap & ~(1 << 15);
+
+ if (new < old) {
+ new += vq->num;
+ wrap ^= 1;
+ }
+
+ if (wrap != off_wrap >> 15)
+ off += vq->num;
+
+ return vring_need_event(off, new, old);
+}
+
+static int vhost_get_vq_desc_packed(struct vhost_virtqueue *vq,
+ struct vhost_used_elem *used,
+ struct iovec iov[], unsigned int iov_size,
+ unsigned int *out_num, unsigned int *in_num,
+ struct vhost_log *log,
+ unsigned int *log_num)
+{
+ struct vring_desc_packed desc;
+ int ret, access, i;
+ u16 last_avail_idx = vq->last_avail_idx;
+ u16 off_wrap = vq->avail_idx | (vq->avail_wrap_counter << 15);
+
+ /* When we start there are none of either input nor output. */
+ *out_num = *in_num = 0;
+ if (unlikely(log))
+ *log_num = 0;
+
+ used->count = 0;
+
+ do {
+ struct vring_desc_packed *d = vq->desc_packed +
+ vq->last_avail_idx;
+ unsigned int iov_count = *in_num + *out_num;
+
+ ret = vhost_get_user(vq, desc.flags, &d->flags,
+ VHOST_ADDR_DESC);
+ if (unlikely(ret)) {
+ vq_err(vq, "Failed to get flags: idx %d addr %p\n",
+ vq->last_avail_idx, &d->flags);
+ return -EFAULT;
+ }
+
+ if (!desc_is_avail(vq, vq->last_avail_wrap_counter, desc.flags)) {
+ /* If there's nothing new since last we looked, return
+ * invalid.
+ */
+ if (!used->count)
+ return -ENOSPC;
+ vq_err(vq, "Unexpected unavail descriptor: idx %d\n",
+ vq->last_avail_idx);
+ return -EFAULT;
+ }
+
+ /* Read desc content after we're sure it was available. */
+ smp_rmb();
+
+ ret = vhost_copy_from_user(vq, &desc, d, sizeof(desc));
+ if (unlikely(ret)) {
+ vq_err(vq, "Failed to get descriptor: idx %d addr %p\n",
+ vq->last_avail_idx, d);
+ return -EFAULT;
+ }
+
+ used->elem.id = desc.id;
+
+ if (desc.flags & cpu_to_vhost16(vq, VRING_DESC_F_INDIRECT)) {
+ ret = get_indirect_packed(vq, iov, iov_size,
+ out_num, in_num, log,
+ log_num, &desc);
+ if (unlikely(ret < 0)) {
+ if (ret != -EAGAIN)
+ vq_err(vq, "Failure detected "
+ "in indirect descriptor "
+ "at idx %d\n", i);
+ return ret;
+ }
+ goto next;
+ }
+
+ if (desc.flags & cpu_to_vhost16(vq, VRING_DESC_F_WRITE))
+ access = VHOST_ACCESS_WO;
+ else
+ access = VHOST_ACCESS_RO;
+ ret = translate_desc(vq, vhost64_to_cpu(vq, desc.addr),
+ vhost32_to_cpu(vq, desc.len),
+ iov + iov_count, iov_size - iov_count,
+ access);
+ if (unlikely(ret < 0)) {
+ if (ret != -EAGAIN)
+ vq_err(vq, "Translation failure %d idx %d\n",
+ ret, i);
+ return ret;
+ }
+
+ if (access == VHOST_ACCESS_WO) {
+ /* If this is an input descriptor,
+ * increment that count.
+ */
+ *in_num += ret;
+ if (unlikely(log)) {
+ log[*log_num].addr =
+ vhost64_to_cpu(vq, desc.addr);
+ log[*log_num].len =
+ vhost32_to_cpu(vq, desc.len);
+ ++*log_num;
+ }
+ } else {
+ /* If it's an output descriptor, they're all supposed
+ * to come before any input descriptors.
+ */
+ if (unlikely(*in_num)) {
+ vq_err(vq, "Desc out after in: idx %d\n",
+ i);
+ return -EINVAL;
+ }
+ *out_num += ret;
+ }
+
+next:
+ if (unlikely(++used->count > vq->num)) {
+ vq_err(vq, "Loop detected: last one at %u "
+ "vq size %u head %u\n",
+ i, vq->num, used->elem.id);
+ return -EINVAL;
+ }
+ if (++vq->last_avail_idx >= vq->num) {
+ vq->last_avail_idx = 0;
+ vq->last_avail_wrap_counter ^= 1;
+ }
+ /* If this descriptor says it doesn't chain, we're done. */
+ } while (next_desc_packed(vq, &desc));
+
+ if (vhost_vring_packed_need_event(vq, vq->last_avail_wrap_counter,
+ off_wrap, vq->last_avail_idx,
+ last_avail_idx)) {
+ vq->avail_idx = vq->last_avail_idx;
+ vq->avail_wrap_counter = vq->last_avail_wrap_counter;
+ }
+
+ return 0;
+}
+
+static int vhost_get_vq_desc_split(struct vhost_virtqueue *vq,
+ struct vhost_used_elem *used,
+ struct iovec iov[], unsigned int iov_size,
+ unsigned int *out_num, unsigned int *in_num,
+ struct vhost_log *log, unsigned int *log_num)
{
struct vring_desc desc;
unsigned int i, head, found = 0;
@@ -2011,9 +2330,9 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
return -EFAULT;
}
if (desc.flags & cpu_to_vhost16(vq, VRING_DESC_F_INDIRECT)) {
- ret = get_indirect(vq, iov, iov_size,
- out_num, in_num,
- log, log_num, &desc);
+ ret = get_indirect_split(vq, iov, iov_size,
+ out_num, in_num,
+ log, log_num, &desc);
if (unlikely(ret < 0)) {
if (ret != -EAGAIN)
vq_err(vq, "Failure detected "
@@ -2055,7 +2374,7 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
}
*out_num += ret;
}
- } while ((i = next_desc(vq, &desc)) != -1);
+ } while ((i = next_desc_split(vq, &desc)) != -1);
/* On success, increment avail index. */
vq->last_avail_idx++;
@@ -2065,6 +2384,31 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
BUG_ON(!(vq->used_flags & VRING_USED_F_NO_NOTIFY));
return 0;
}
+
+/* This looks in the virtqueue and for the first available buffer, and converts
+ * it to an iovec for convenient access. Since descriptors consist of some
+ * number of output then some number of input descriptors, it's actually two
+ * iovecs, but we pack them into one and note how many of each there were.
+ *
+ * This function returns the descriptor number found, or vq->num (which is
+ * never a valid descriptor number) if none was found. A negative code is
+ * returned on error.
+ */
+int vhost_get_vq_desc(struct vhost_virtqueue *vq,
+ struct vhost_used_elem *used,
+ struct iovec iov[], unsigned int iov_size,
+ unsigned int *out_num, unsigned int *in_num,
+ struct vhost_log *log, unsigned int *log_num)
+{
+ if (vhost_has_feature(vq, VIRTIO_F_RING_PACKED))
+ return vhost_get_vq_desc_packed(vq, used, iov, iov_size,
+ out_num, in_num,
+ log, log_num);
+ else
+ return vhost_get_vq_desc_split(vq, used, iov, iov_size,
+ out_num, in_num,
+ log, log_num);
+}
EXPORT_SYMBOL_GPL(vhost_get_vq_desc);
void vhost_set_used_len(struct vhost_virtqueue *vq,
@@ -2151,15 +2495,30 @@ int vhost_get_bufs(struct vhost_virtqueue *vq,
*count = headcount;
return 0;
err:
- vhost_discard_vq_desc(vq, headcount);
+ vhost_discard_vq_desc(vq, heads, headcount);
return r;
}
EXPORT_SYMBOL_GPL(vhost_get_bufs);
/* Reverse the effect of vhost_get_vq_desc. Useful for error handling. */
-void vhost_discard_vq_desc(struct vhost_virtqueue *vq, int n)
+void vhost_discard_vq_desc(struct vhost_virtqueue *vq,
+ struct vhost_used_elem *heads,
+ int headcount)
{
- vq->last_avail_idx -= n;
+ int i;
+
+ if (vhost_has_feature(vq, VIRTIO_F_RING_PACKED)) {
+ for (i = 0; i < headcount; i++) {
+ vq->last_avail_idx -= heads[i].count;
+ if (vq->last_avail_idx >= vq->num) {
+ vq->last_avail_wrap_counter ^= 1;
+ vq->last_avail_idx += vq->num;
+ }
+ }
+ } else {
+ vq->last_avail_idx -= headcount;
+ }
+
}
EXPORT_SYMBOL_GPL(vhost_discard_vq_desc);
@@ -2215,10 +2574,69 @@ static int __vhost_add_used_n(struct vhost_virtqueue *vq,
return 0;
}
+static int vhost_add_used_n_packed(struct vhost_virtqueue *vq,
+ struct vhost_used_elem *heads,
+ unsigned int count)
+{
+ struct vring_desc_packed __user *desc;
+ int i, ret;
+
+ for (i = 0; i < count; i++) {
+ desc = vq->desc_packed + vq->last_used_idx;
+
+ ret = vhost_put_user(vq, heads[i].elem.id, &desc->id,
+ VHOST_ADDR_DESC);
+ if (unlikely(ret)) {
+ vq_err(vq, "Failed to update id: idx %d addr %p\n",
+ vq->last_used_idx, desc);
+ return -EFAULT;
+ }
+ ret = vhost_put_user(vq, heads[i].elem.len, &desc->len,
+ VHOST_ADDR_DESC);
+ if (unlikely(ret)) {
+ vq_err(vq, "Failed to update len: idx %d addr %p\n",
+ vq->last_used_idx, desc);
+ return -EFAULT;
+ }
+
+ /* Update flags after descriptor id and len is wrote,
+ * TODO: Update head flags at last for saving barriers */
+ smp_wmb();
+
+ ret = vhost_put_user(vq, get_desc_flags(vq, heads[i].elem.len),
+ &desc->flags, VHOST_ADDR_DESC);
+ if (unlikely(ret)) {
+ vq_err(vq, "Failed to update flags: idx %d addr %p\n",
+ vq->last_used_idx, desc);
+ return -EFAULT;
+ }
+
+ if (unlikely(vq->log_used)) {
+ /* Make sure desc is written before update log. */
+ smp_wmb();
+ log_write(vq->log_base, vq->log_addr +
+ vq->last_used_idx * sizeof(*desc),
+ sizeof(*desc));
+ if (vq->log_ctx)
+ eventfd_signal(vq->log_ctx, 1);
+ }
+
+ vq->last_used_idx += heads[i].count;
+ if (vq->last_used_idx >= vq->num) {
+ vq->used_wrap_counter ^= 1;
+ vq->last_used_idx -= vq->num;
+ }
+ }
+
+ return 0;
+}
+
/* After we've used one of their buffers, we tell them about it. We'll then
* want to notify the guest, using eventfd. */
-int vhost_add_used_n(struct vhost_virtqueue *vq, struct vhost_used_elem *heads,
- unsigned count)
+static int vhost_add_used_n_split(struct vhost_virtqueue *vq,
+ struct vhost_used_elem *heads,
+ unsigned count)
+
{
int start, n, r;
@@ -2250,6 +2668,19 @@ int vhost_add_used_n(struct vhost_virtqueue *vq, struct vhost_used_elem *heads,
}
return r;
}
+
+/* After we've used one of their buffers, we tell them about it. We'll then
+ * want to notify the guest, using eventfd.
+ */
+int vhost_add_used_n(struct vhost_virtqueue *vq,
+ struct vhost_used_elem *heads,
+ unsigned int count)
+{
+ if (vhost_has_feature(vq, VIRTIO_F_RING_PACKED))
+ return vhost_add_used_n_packed(vq, heads, count);
+ else
+ return vhost_add_used_n_split(vq, heads, count);
+}
EXPORT_SYMBOL_GPL(vhost_add_used_n);
static bool vhost_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
@@ -2257,6 +2688,11 @@ static bool vhost_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
__u16 old, new;
__virtio16 event;
bool v;
+
+ /* FIXME: check driver area */
+ if (vhost_has_feature(vq, VIRTIO_F_RING_PACKED))
+ return true;
+
/* Flush out used index updates. This is paired
* with the barrier that the Guest executes when enabling
* interrupts. */
@@ -2319,7 +2755,8 @@ void vhost_add_used_and_signal_n(struct vhost_dev *dev,
EXPORT_SYMBOL_GPL(vhost_add_used_and_signal_n);
/* return true if we're sure that avaiable ring is empty */
-bool vhost_vq_avail_empty(struct vhost_dev *dev, struct vhost_virtqueue *vq)
+static bool vhost_vq_avail_empty_split(struct vhost_dev *dev,
+ struct vhost_virtqueue *vq)
{
__virtio16 avail_idx;
int r;
@@ -2334,10 +2771,58 @@ bool vhost_vq_avail_empty(struct vhost_dev *dev, struct vhost_virtqueue *vq)
return vq->avail_idx == vq->last_avail_idx;
}
+
+static bool vhost_vq_avail_empty_packed(struct vhost_dev *dev,
+ struct vhost_virtqueue *vq)
+{
+ struct vring_desc_packed *d = vq->desc_packed + vq->avail_idx;
+ __virtio16 flags;
+ int ret;
+
+ ret = vhost_get_user(vq, flags, &d->flags, VHOST_ADDR_DESC);
+ if (unlikely(ret)) {
+ vq_err(vq, "Failed to get flags: idx %d addr %p\n",
+ vq->last_avail_idx, d);
+ return -EFAULT;
+ }
+
+ return !desc_is_avail(vq, vq->avail_wrap_counter, flags);
+}
+
+bool vhost_vq_avail_empty(struct vhost_dev *dev, struct vhost_virtqueue *vq)
+{
+ if (vhost_has_feature(vq, VIRTIO_F_RING_PACKED))
+ return vhost_vq_avail_empty_packed(dev, vq);
+ else
+ return vhost_vq_avail_empty_split(dev, vq);
+}
EXPORT_SYMBOL_GPL(vhost_vq_avail_empty);
-/* OK, now we need to know about added descriptors. */
-bool vhost_enable_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
+static bool vhost_enable_notify_packed(struct vhost_dev *dev,
+ struct vhost_virtqueue *vq)
+{
+ struct vring_desc_packed *d = vq->desc_packed + vq->avail_idx;
+ __virtio16 flags;
+ int ret;
+
+ /* FIXME: disable notification through device area */
+
+ /* They could have slipped one in as we were doing that: make
+ * sure it's written, then check again. */
+ smp_mb();
+
+ ret = vhost_get_user(vq, flags, &d->flags, VHOST_ADDR_DESC);
+ if (unlikely(ret)) {
+ vq_err(vq, "Failed to get descriptor: idx %d addr %p\n",
+ vq->last_avail_idx, &d->flags);
+ return -EFAULT;
+ }
+
+ return desc_is_avail(vq, vq->avail_wrap_counter, flags);
+}
+
+static bool vhost_enable_notify_split(struct vhost_dev *dev,
+ struct vhost_virtqueue *vq)
{
__virtio16 avail_idx;
int r;
@@ -2372,10 +2857,25 @@ bool vhost_enable_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
return vhost16_to_cpu(vq, avail_idx) != vq->avail_idx;
}
+
+/* OK, now we need to know about added descriptors. */
+bool vhost_enable_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
+{
+ if (vhost_has_feature(vq, VIRTIO_F_RING_PACKED))
+ return vhost_enable_notify_packed(dev, vq);
+ else
+ return vhost_enable_notify_split(dev, vq);
+}
EXPORT_SYMBOL_GPL(vhost_enable_notify);
-/* We don't need to be notified again. */
-void vhost_disable_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
+static void vhost_disable_notify_packed(struct vhost_dev *dev,
+ struct vhost_virtqueue *vq)
+{
+ /* FIXME: disable notification through device area */
+}
+
+static void vhost_disable_notify_split(struct vhost_dev *dev,
+ struct vhost_virtqueue *vq)
{
int r;
@@ -2389,6 +2889,15 @@ void vhost_disable_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
&vq->used->flags, r);
}
}
+
+/* We don't need to be notified again. */
+void vhost_disable_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
+{
+ if (vhost_has_feature(vq, VIRTIO_F_RING_PACKED))
+ return vhost_disable_notify_packed(dev, vq);
+ else
+ return vhost_disable_notify_split(dev, vq);
+}
EXPORT_SYMBOL_GPL(vhost_disable_notify);
/* Create a new message. */
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 604821b..7543a46 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -36,6 +36,7 @@ struct vhost_poll {
struct vhost_used_elem {
struct vring_used_elem elem;
+ int count;
};
void vhost_work_init(struct vhost_work *work, vhost_work_fn_t fn);
@@ -91,7 +92,10 @@ struct vhost_virtqueue {
/* The actual ring of buffers. */
struct mutex mutex;
unsigned int num;
- struct vring_desc __user *desc;
+ union {
+ struct vring_desc __user *desc;
+ struct vring_desc_packed __user *desc_packed;
+ };
struct vring_avail __user *avail;
struct vring_used __user *used;
const struct vhost_umem_node *meta_iotlb[VHOST_NUM_ADDRS];
@@ -148,6 +152,9 @@ struct vhost_virtqueue {
bool user_be;
#endif
u32 busyloop_timeout;
+ bool used_wrap_counter;
+ bool avail_wrap_counter;
+ bool last_avail_wrap_counter;
};
struct vhost_msg_node {
@@ -203,7 +210,9 @@ void vhost_set_used_len(struct vhost_virtqueue *vq,
int len);
int vhost_get_used_len(struct vhost_virtqueue *vq,
struct vhost_used_elem *used);
-void vhost_discard_vq_desc(struct vhost_virtqueue *, int n);
+void vhost_discard_vq_desc(struct vhost_virtqueue *,
+ struct vhost_used_elem *,
+ int n);
int vhost_vq_init_access(struct vhost_virtqueue *);
int vhost_add_used(struct vhost_virtqueue *vq,
--
2.7.4
^ permalink raw reply related
* [RFC V5 PATCH 8/8] vhost: event suppression for packed ring
From: Jason Wang @ 2018-05-29 2:10 UTC (permalink / raw)
To: mst, jasowang
Cc: kvm, virtualization, netdev, linux-kernel, jfreimann, wexu,
tiwei.bie
In-Reply-To: <1527559830-8133-1-git-send-email-jasowang@redhat.com>
This patch introduces basic support for event suppression aka driver
and device area.
Signed-off-by: Jason Wang <jasowang@redhat.com>
---
drivers/vhost/vhost.c | 191 ++++++++++++++++++++++++++++++++++++---
drivers/vhost/vhost.h | 10 +-
include/uapi/linux/virtio_ring.h | 19 ++++
3 files changed, 204 insertions(+), 16 deletions(-)
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index a36e5ad2..112f680 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -1112,10 +1112,15 @@ static int vq_access_ok_packed(struct vhost_virtqueue *vq, unsigned int num,
struct vring_used __user *used)
{
struct vring_desc_packed *packed = (struct vring_desc_packed *)desc;
+ struct vring_packed_desc_event *driver_event =
+ (struct vring_packed_desc_event *)avail;
+ struct vring_packed_desc_event *device_event =
+ (struct vring_packed_desc_event *)used;
- /* FIXME: check device area and driver area */
return access_ok(VERIFY_READ, packed, num * sizeof(*packed)) &&
- access_ok(VERIFY_WRITE, packed, num * sizeof(*packed));
+ access_ok(VERIFY_WRITE, packed, num * sizeof(*packed)) &&
+ access_ok(VERIFY_READ, driver_event, sizeof(*driver_event)) &&
+ access_ok(VERIFY_WRITE, device_event, sizeof(*device_event));
}
static int vq_access_ok_split(struct vhost_virtqueue *vq, unsigned int num,
@@ -1190,14 +1195,27 @@ static bool iotlb_access_ok(struct vhost_virtqueue *vq,
return true;
}
-int vq_iotlb_prefetch(struct vhost_virtqueue *vq)
+int vq_iotlb_prefetch_packed(struct vhost_virtqueue *vq)
+{
+ int num = vq->num;
+
+ return iotlb_access_ok(vq, VHOST_ACCESS_RO, (u64)(uintptr_t)vq->desc,
+ num * sizeof(*vq->desc), VHOST_ADDR_DESC) &&
+ iotlb_access_ok(vq, VHOST_ACCESS_WO, (u64)(uintptr_t)vq->desc,
+ num * sizeof(*vq->desc), VHOST_ADDR_DESC) &&
+ iotlb_access_ok(vq, VHOST_ACCESS_RO,
+ (u64)(uintptr_t)vq->driver_event,
+ sizeof(*vq->driver_event), VHOST_ADDR_AVAIL) &&
+ iotlb_access_ok(vq, VHOST_ACCESS_WO,
+ (u64)(uintptr_t)vq->device_event,
+ sizeof(*vq->device_event), VHOST_ADDR_USED);
+}
+
+int vq_iotlb_prefetch_split(struct vhost_virtqueue *vq)
{
size_t s = vhost_has_feature(vq, VIRTIO_RING_F_EVENT_IDX) ? 2 : 0;
unsigned int num = vq->num;
- if (!vq->iotlb)
- return 1;
-
return iotlb_access_ok(vq, VHOST_ACCESS_RO, (u64)(uintptr_t)vq->desc,
num * sizeof(*vq->desc), VHOST_ADDR_DESC) &&
iotlb_access_ok(vq, VHOST_ACCESS_RO, (u64)(uintptr_t)vq->avail,
@@ -1209,6 +1227,17 @@ int vq_iotlb_prefetch(struct vhost_virtqueue *vq)
num * sizeof(*vq->used->ring) + s,
VHOST_ADDR_USED);
}
+
+int vq_iotlb_prefetch(struct vhost_virtqueue *vq)
+{
+ if (!vq->iotlb)
+ return 1;
+
+ if (vhost_has_feature(vq, VIRTIO_F_RING_PACKED))
+ return vq_iotlb_prefetch_packed(vq);
+ else
+ return vq_iotlb_prefetch_split(vq);
+}
EXPORT_SYMBOL_GPL(vq_iotlb_prefetch);
/* Can we log writes? */
@@ -1730,6 +1759,50 @@ static int vhost_update_used_flags(struct vhost_virtqueue *vq)
return 0;
}
+static int vhost_update_device_flags(struct vhost_virtqueue *vq,
+ __virtio16 device_flags)
+{
+ void __user *flags;
+
+ if (vhost_put_user(vq, device_flags, &vq->device_event->flags,
+ VHOST_ADDR_USED) < 0)
+ return -EFAULT;
+ if (unlikely(vq->log_used)) {
+ /* Make sure the flag is seen before log. */
+ smp_wmb();
+ /* Log used flag write. */
+ flags = &vq->device_event->flags;
+ log_write(vq->log_base, vq->log_addr +
+ (flags - (void __user *)vq->device_event),
+ sizeof(vq->device_event->flags));
+ if (vq->log_ctx)
+ eventfd_signal(vq->log_ctx, 1);
+ }
+ return 0;
+}
+
+static int vhost_update_device_off_wrap(struct vhost_virtqueue *vq,
+ __virtio16 device_off_wrap)
+{
+ void __user *off_wrap;
+
+ if (vhost_put_user(vq, device_off_wrap, &vq->device_event->off_wrap,
+ VHOST_ADDR_USED) < 0)
+ return -EFAULT;
+ if (unlikely(vq->log_used)) {
+ /* Make sure the flag is seen before log. */
+ smp_wmb();
+ /* Log used flag write. */
+ off_wrap = &vq->device_event->off_wrap;
+ log_write(vq->log_base, vq->log_addr +
+ (off_wrap - (void __user *)vq->device_event),
+ sizeof(vq->device_event->off_wrap));
+ if (vq->log_ctx)
+ eventfd_signal(vq->log_ctx, 1);
+ }
+ return 0;
+}
+
static int vhost_update_avail_event(struct vhost_virtqueue *vq, u16 avail_event)
{
if (vhost_put_user(vq, cpu_to_vhost16(vq, vq->avail_idx),
@@ -2683,16 +2756,13 @@ int vhost_add_used_n(struct vhost_virtqueue *vq,
}
EXPORT_SYMBOL_GPL(vhost_add_used_n);
-static bool vhost_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
+static bool vhost_notify_split(struct vhost_dev *dev,
+ struct vhost_virtqueue *vq)
{
__u16 old, new;
__virtio16 event;
bool v;
- /* FIXME: check driver area */
- if (vhost_has_feature(vq, VIRTIO_F_RING_PACKED))
- return true;
-
/* Flush out used index updates. This is paired
* with the barrier that the Guest executes when enabling
* interrupts. */
@@ -2725,6 +2795,64 @@ static bool vhost_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
return vring_need_event(vhost16_to_cpu(vq, event), new, old);
}
+static bool vhost_notify_packed(struct vhost_dev *dev,
+ struct vhost_virtqueue *vq)
+{
+ __virtio16 event_off_wrap, event_flags;
+ __u16 old, new, off_wrap;
+ bool v;
+
+ /* Flush out used descriptors updates. This is paired
+ * with the barrier that the Guest executes when enabling
+ * interrupts.
+ */
+ smp_mb();
+
+ if (vhost_get_avail(vq, event_flags,
+ &vq->driver_event->flags) < 0) {
+ vq_err(vq, "Failed to get driver desc_event_flags");
+ return true;
+ }
+
+ if (!vhost_has_feature(vq, VIRTIO_RING_F_EVENT_IDX))
+ return event_flags !=
+ cpu_to_vhost16(vq, RING_EVENT_FLAGS_DISABLE);
+
+ old = vq->signalled_used;
+ v = vq->signalled_used_valid;
+ new = vq->signalled_used = vq->last_used_idx;
+ vq->signalled_used_valid = true;
+
+ if (event_flags != cpu_to_vhost16(vq, RING_EVENT_FLAGS_DESC))
+ return event_flags !=
+ cpu_to_vhost16(vq, RING_EVENT_FLAGS_DISABLE);
+
+ /* Read desc event flags before event_off and event_wrap */
+ smp_rmb();
+
+ if (vhost_get_avail(vq, event_off_wrap,
+ &vq->driver_event->off_wrap) < 0) {
+ vq_err(vq, "Failed to get driver desc_event_off/wrap");
+ return true;
+ }
+
+ off_wrap = vhost16_to_cpu(vq, event_off_wrap);
+
+ if (unlikely(!v))
+ return true;
+
+ return vhost_vring_packed_need_event(vq, vq->used_wrap_counter,
+ off_wrap, new, old);
+}
+
+static bool vhost_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
+{
+ if (vhost_has_feature(vq, VIRTIO_F_RING_PACKED))
+ return vhost_notify_packed(dev, vq);
+ else
+ return vhost_notify_split(dev, vq);
+}
+
/* This actually signals the guest, using eventfd. */
void vhost_signal(struct vhost_dev *dev, struct vhost_virtqueue *vq)
{
@@ -2802,10 +2930,34 @@ static bool vhost_enable_notify_packed(struct vhost_dev *dev,
struct vhost_virtqueue *vq)
{
struct vring_desc_packed *d = vq->desc_packed + vq->avail_idx;
- __virtio16 flags;
+ __virtio16 flags = RING_EVENT_FLAGS_ENABLE;
int ret;
- /* FIXME: disable notification through device area */
+ if (!(vq->used_flags & VRING_USED_F_NO_NOTIFY))
+ return false;
+ vq->used_flags &= ~VRING_USED_F_NO_NOTIFY;
+
+ if (vhost_has_feature(vq, VIRTIO_RING_F_EVENT_IDX)) {
+ __virtio16 off_wrap = cpu_to_vhost16(vq, vq->avail_idx |
+ vq->avail_wrap_counter << 15);
+
+ ret = vhost_update_device_off_wrap(vq, off_wrap);
+ if (ret) {
+ vq_err(vq, "Failed to write to off warp at %p: %d\n",
+ &vq->device_event->off_wrap, ret);
+ return false;
+ }
+ /* Make sure off_wrap is wrote before flags */
+ smp_wmb();
+ flags = RING_EVENT_FLAGS_DESC;
+ }
+
+ ret = vhost_update_device_flags(vq, flags);
+ if (ret) {
+ vq_err(vq, "Failed to enable notification at %p: %d\n",
+ &vq->device_event->flags, ret);
+ return false;
+ }
/* They could have slipped one in as we were doing that: make
* sure it's written, then check again. */
@@ -2871,7 +3023,18 @@ EXPORT_SYMBOL_GPL(vhost_enable_notify);
static void vhost_disable_notify_packed(struct vhost_dev *dev,
struct vhost_virtqueue *vq)
{
- /* FIXME: disable notification through device area */
+ __virtio16 flags;
+ int r;
+
+ if (vq->used_flags & VRING_USED_F_NO_NOTIFY)
+ return;
+ vq->used_flags |= VRING_USED_F_NO_NOTIFY;
+
+ flags = cpu_to_vhost16(vq, RING_EVENT_FLAGS_DISABLE);
+ r = vhost_update_device_flags(vq, flags);
+ if (r)
+ vq_err(vq, "Failed to enable notification at %p: %d\n",
+ &vq->device_event->flags, r);
}
static void vhost_disable_notify_split(struct vhost_dev *dev,
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 7543a46..b920582 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -96,8 +96,14 @@ struct vhost_virtqueue {
struct vring_desc __user *desc;
struct vring_desc_packed __user *desc_packed;
};
- struct vring_avail __user *avail;
- struct vring_used __user *used;
+ union {
+ struct vring_avail __user *avail;
+ struct vring_packed_desc_event __user *driver_event;
+ };
+ union {
+ struct vring_used __user *used;
+ struct vring_packed_desc_event __user *device_event;
+ };
const struct vhost_umem_node *meta_iotlb[VHOST_NUM_ADDRS];
struct file *kick;
struct eventfd_ctx *call_ctx;
diff --git a/include/uapi/linux/virtio_ring.h b/include/uapi/linux/virtio_ring.h
index e297580..71c7a46 100644
--- a/include/uapi/linux/virtio_ring.h
+++ b/include/uapi/linux/virtio_ring.h
@@ -75,6 +75,25 @@ struct vring_desc_packed {
__virtio16 flags;
};
+/* Enable events */
+#define RING_EVENT_FLAGS_ENABLE 0x0
+/* Disable events */
+#define RING_EVENT_FLAGS_DISABLE 0x1
+/*
+ * Enable events for a specific descriptor
+ * (as specified by Descriptor Ring Change Event Offset/Wrap Counter).
+ * Only valid if VIRTIO_F_RING_EVENT_IDX has been negotiated.
+ */
+#define RING_EVENT_FLAGS_DESC 0x2
+/* The value 0x3 is reserved */
+
+struct vring_packed_desc_event {
+ /* Descriptor Ring Change Event Offset and Wrap Counter */
+ __virtio16 off_wrap;
+ /* Descriptor Ring Change Event Flags */
+ __virtio16 flags;
+};
+
/* Virtio ring descriptors: 16 bytes. These can chain together via "next". */
struct vring_desc {
/* Address (guest-physical). */
--
2.7.4
^ permalink raw reply related
* [RFC V5 PATCH 6/8] virtio: introduce packed ring defines
From: Jason Wang @ 2018-05-29 2:10 UTC (permalink / raw)
To: mst, jasowang
Cc: kvm, virtualization, netdev, linux-kernel, jfreimann, wexu,
tiwei.bie
In-Reply-To: <1527559830-8133-1-git-send-email-jasowang@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
---
include/uapi/linux/virtio_config.h | 9 +++++++++
include/uapi/linux/virtio_ring.h | 13 +++++++++++++
2 files changed, 22 insertions(+)
diff --git a/include/uapi/linux/virtio_config.h b/include/uapi/linux/virtio_config.h
index 308e209..5903d51 100644
--- a/include/uapi/linux/virtio_config.h
+++ b/include/uapi/linux/virtio_config.h
@@ -71,4 +71,13 @@
* this is for compatibility with legacy systems.
*/
#define VIRTIO_F_IOMMU_PLATFORM 33
+
+#define VIRTIO_F_RING_PACKED 34
+
+/*
+ * This feature indicates that all buffers are used by the device in
+ * the same order in which they have been made available.
+ */
+#define VIRTIO_F_IN_ORDER 35
+
#endif /* _UAPI_LINUX_VIRTIO_CONFIG_H */
diff --git a/include/uapi/linux/virtio_ring.h b/include/uapi/linux/virtio_ring.h
index 6d5d5fa..e297580 100644
--- a/include/uapi/linux/virtio_ring.h
+++ b/include/uapi/linux/virtio_ring.h
@@ -43,6 +43,8 @@
#define VRING_DESC_F_WRITE 2
/* This means the buffer contains a list of buffer descriptors. */
#define VRING_DESC_F_INDIRECT 4
+#define VRING_DESC_F_AVAIL 7
+#define VRING_DESC_F_USED 15
/* The Host uses this in used->flags to advise the Guest: don't kick me when
* you add a buffer. It's unreliable, so it's simply an optimization. Guest
@@ -62,6 +64,17 @@
* at the end of the used ring. Guest should ignore the used->flags field. */
#define VIRTIO_RING_F_EVENT_IDX 29
+struct vring_desc_packed {
+ /* Buffer Address. */
+ __virtio64 addr;
+ /* Buffer Length. */
+ __virtio32 len;
+ /* Buffer ID. */
+ __virtio16 id;
+ /* The flags depending on descriptor type. */
+ __virtio16 flags;
+};
+
/* Virtio ring descriptors: 16 bytes. These can chain together via "next". */
struct vring_desc {
/* Address (guest-physical). */
--
2.7.4
^ permalink raw reply related
* [RFC V5 PATCH 5/8] vhost: vhost_put_user() can accept metadata type
From: Jason Wang @ 2018-05-29 2:10 UTC (permalink / raw)
To: mst, jasowang
Cc: kvm, virtualization, netdev, linux-kernel, jfreimann, wexu,
tiwei.bie
In-Reply-To: <1527559830-8133-1-git-send-email-jasowang@redhat.com>
We assumes used ring update is the only user for vhost_put_user() in
the past. This may not be the case for the incoming packed ring which
may update the descriptor ring for used. So introduce a new type
parameter.
Signed-off-by: Jason Wang <jasowang@redhat.com>
---
drivers/vhost/vhost.c | 14 +++++++-------
1 file changed, 7 insertions(+), 7 deletions(-)
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index e0fcfec..4031a8f 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -811,7 +811,7 @@ static inline void __user *__vhost_get_user(struct vhost_virtqueue *vq,
return __vhost_get_user_slow(vq, addr, size, type);
}
-#define vhost_put_user(vq, x, ptr) \
+#define vhost_put_user(vq, x, ptr, type) \
({ \
int ret = -EFAULT; \
if (!vq->iotlb) { \
@@ -819,7 +819,7 @@ static inline void __user *__vhost_get_user(struct vhost_virtqueue *vq,
} else { \
__typeof__(ptr) to = \
(__typeof__(ptr)) __vhost_get_user(vq, ptr, \
- sizeof(*ptr), VHOST_ADDR_USED); \
+ sizeof(*ptr), type); \
if (to != NULL) \
ret = __put_user(x, to); \
else \
@@ -1683,7 +1683,7 @@ static int vhost_update_used_flags(struct vhost_virtqueue *vq)
{
void __user *used;
if (vhost_put_user(vq, cpu_to_vhost16(vq, vq->used_flags),
- &vq->used->flags) < 0)
+ &vq->used->flags, VHOST_ADDR_USED) < 0)
return -EFAULT;
if (unlikely(vq->log_used)) {
/* Make sure the flag is seen before log. */
@@ -1702,7 +1702,7 @@ static int vhost_update_used_flags(struct vhost_virtqueue *vq)
static int vhost_update_avail_event(struct vhost_virtqueue *vq, u16 avail_event)
{
if (vhost_put_user(vq, cpu_to_vhost16(vq, vq->avail_idx),
- vhost_avail_event(vq)))
+ vhost_avail_event(vq), VHOST_ADDR_USED))
return -EFAULT;
if (unlikely(vq->log_used)) {
void __user *used;
@@ -2185,12 +2185,12 @@ static int __vhost_add_used_n(struct vhost_virtqueue *vq,
used = vq->used->ring + start;
for (i = 0; i < count; i++) {
if (unlikely(vhost_put_user(vq, heads[i].elem.id,
- &used[i].id))) {
+ &used[i].id, VHOST_ADDR_USED))) {
vq_err(vq, "Failed to write used id");
return -EFAULT;
}
if (unlikely(vhost_put_user(vq, heads[i].elem.len,
- &used[i].len))) {
+ &used[i].len, VHOST_ADDR_USED))) {
vq_err(vq, "Failed to write used len");
return -EFAULT;
}
@@ -2236,7 +2236,7 @@ int vhost_add_used_n(struct vhost_virtqueue *vq, struct vhost_used_elem *heads,
/* Make sure buffer is written before we update index. */
smp_wmb();
if (vhost_put_user(vq, cpu_to_vhost16(vq, vq->last_used_idx),
- &vq->used->idx)) {
+ &vq->used->idx, VHOST_ADDR_USED)) {
vq_err(vq, "Failed to increment used idx");
return -EFAULT;
}
--
2.7.4
^ permalink raw reply related
* [RFC V5 PATCH 4/8] vhost_net: do not explicitly manipulate vhost_used_elem
From: Jason Wang @ 2018-05-29 2:10 UTC (permalink / raw)
To: mst, jasowang
Cc: kvm, virtualization, netdev, linux-kernel, jfreimann, wexu,
tiwei.bie
In-Reply-To: <1527559830-8133-1-git-send-email-jasowang@redhat.com>
Two helpers of setting/getting used len were introduced to avoid
explicitly manipulating vhost_used_elem in zerocopy code. This will be
used to hide used_elem internals and simplify packed ring
implementation.
Signed-off-by: Jason Wang <jasowang@redhat.com>
---
drivers/vhost/net.c | 11 +++++------
drivers/vhost/vhost.c | 12 ++++++++++--
drivers/vhost/vhost.h | 5 +++++
3 files changed, 20 insertions(+), 8 deletions(-)
diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 3826f1f..30273ad 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -341,9 +341,10 @@ static void vhost_zerocopy_signal_used(struct vhost_net *net,
int j = 0;
for (i = nvq->done_idx; i != nvq->upend_idx; i = (i + 1) % UIO_MAXIOV) {
- if (vq->heads[i].elem.len == VHOST_DMA_FAILED_LEN)
+ if (vhost_get_used_len(vq, &vq->heads[i]) ==
+ VHOST_DMA_FAILED_LEN)
vhost_net_tx_err(net);
- if (VHOST_DMA_IS_DONE(vq->heads[i].elem.len)) {
+ if (VHOST_DMA_IS_DONE(vhost_get_used_len(vq, &vq->heads[i]))) {
vq->heads[i].elem.len = VHOST_DMA_CLEAR_LEN;
++j;
} else
@@ -542,10 +543,8 @@ static void handle_tx(struct vhost_net *net)
struct ubuf_info *ubuf;
ubuf = nvq->ubuf_info + nvq->upend_idx;
- vq->heads[nvq->upend_idx].elem.id =
- cpu_to_vhost32(vq, used.elem.id);
- vq->heads[nvq->upend_idx].elem.len =
- VHOST_DMA_IN_PROGRESS;
+ vhost_set_used_len(vq, &used, VHOST_DMA_IN_PROGRESS);
+ vq->heads[nvq->upend_idx] = used;
ubuf->callback = vhost_zerocopy_callback;
ubuf->ctx = nvq->ubufs;
ubuf->desc = nvq->upend_idx;
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 2b2a776..e0fcfec 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -2067,11 +2067,19 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
}
EXPORT_SYMBOL_GPL(vhost_get_vq_desc);
-static void vhost_set_used_len(struct vhost_virtqueue *vq,
- struct vhost_used_elem *used, int len)
+void vhost_set_used_len(struct vhost_virtqueue *vq,
+ struct vhost_used_elem *used, int len)
{
used->elem.len = cpu_to_vhost32(vq, len);
}
+EXPORT_SYMBOL_GPL(vhost_set_used_len);
+
+int vhost_get_used_len(struct vhost_virtqueue *vq,
+ struct vhost_used_elem *used)
+{
+ return vhost32_to_cpu(vq, used->elem.len);
+}
+EXPORT_SYMBOL_GPL(vhost_get_used_len);
/* This is a multi-buffer version of vhost_get_desc, that works if
* vq has read descriptors only.
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 8dea44b..604821b 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -198,6 +198,11 @@ int vhost_get_bufs(struct vhost_virtqueue *vq,
unsigned *log_num,
unsigned int quota,
s16 *count);
+void vhost_set_used_len(struct vhost_virtqueue *vq,
+ struct vhost_used_elem *used,
+ int len);
+int vhost_get_used_len(struct vhost_virtqueue *vq,
+ struct vhost_used_elem *used);
void vhost_discard_vq_desc(struct vhost_virtqueue *, int n);
int vhost_vq_init_access(struct vhost_virtqueue *);
--
2.7.4
^ permalink raw reply related
* [RFC V5 PATCH 3/8] vhost: do not use vring_used_elem
From: Jason Wang @ 2018-05-29 2:10 UTC (permalink / raw)
To: mst, jasowang
Cc: kvm, virtualization, netdev, linux-kernel, jfreimann, wexu,
tiwei.bie
In-Reply-To: <1527559830-8133-1-git-send-email-jasowang@redhat.com>
Instead of depending on the exported vring_used_elem, this patch
switches to use a new internal structure vhost_used_elem which embed
vring_used_elem in itself. This could be used to let vhost to record
extra metadata for the incoming packed ring layout.
Signed-off-by: Jason Wang <jasowang@redhat.com>
---
drivers/vhost/net.c | 19 +++++++-------
drivers/vhost/scsi.c | 10 ++++----
drivers/vhost/vhost.c | 68 ++++++++++++---------------------------------------
drivers/vhost/vhost.h | 18 ++++++++------
drivers/vhost/vsock.c | 6 ++---
5 files changed, 45 insertions(+), 76 deletions(-)
diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 826489c..3826f1f 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -341,10 +341,10 @@ static void vhost_zerocopy_signal_used(struct vhost_net *net,
int j = 0;
for (i = nvq->done_idx; i != nvq->upend_idx; i = (i + 1) % UIO_MAXIOV) {
- if (vq->heads[i].len == VHOST_DMA_FAILED_LEN)
+ if (vq->heads[i].elem.len == VHOST_DMA_FAILED_LEN)
vhost_net_tx_err(net);
- if (VHOST_DMA_IS_DONE(vq->heads[i].len)) {
- vq->heads[i].len = VHOST_DMA_CLEAR_LEN;
+ if (VHOST_DMA_IS_DONE(vq->heads[i].elem.len)) {
+ vq->heads[i].elem.len = VHOST_DMA_CLEAR_LEN;
++j;
} else
break;
@@ -367,7 +367,7 @@ static void vhost_zerocopy_callback(struct ubuf_info *ubuf, bool success)
rcu_read_lock_bh();
/* set len to mark this desc buffers done DMA */
- vq->heads[ubuf->desc].len = success ?
+ vq->heads[ubuf->desc].elem.len = success ?
VHOST_DMA_DONE_LEN : VHOST_DMA_FAILED_LEN;
cnt = vhost_net_ubuf_put(ubufs);
@@ -426,7 +426,7 @@ static int vhost_net_enable_vq(struct vhost_net *n,
static int vhost_net_tx_get_vq_desc(struct vhost_net *net,
struct vhost_virtqueue *vq,
- struct vring_used_elem *used_elem,
+ struct vhost_used_elem *used_elem,
struct iovec iov[], unsigned int iov_size,
unsigned int *out_num, unsigned int *in_num)
{
@@ -477,7 +477,7 @@ static void handle_tx(struct vhost_net *net)
size_t hdr_size;
struct socket *sock;
struct vhost_net_ubuf_ref *uninitialized_var(ubufs);
- struct vring_used_elem used;
+ struct vhost_used_elem used;
bool zcopy, zcopy_used;
int sent_pkts = 0;
@@ -542,9 +542,10 @@ static void handle_tx(struct vhost_net *net)
struct ubuf_info *ubuf;
ubuf = nvq->ubuf_info + nvq->upend_idx;
- vq->heads[nvq->upend_idx].id =
- cpu_to_vhost32(vq, used.id);
- vq->heads[nvq->upend_idx].len = VHOST_DMA_IN_PROGRESS;
+ vq->heads[nvq->upend_idx].elem.id =
+ cpu_to_vhost32(vq, used.elem.id);
+ vq->heads[nvq->upend_idx].elem.len =
+ VHOST_DMA_IN_PROGRESS;
ubuf->callback = vhost_zerocopy_callback;
ubuf->ctx = nvq->ubufs;
ubuf->desc = nvq->upend_idx;
diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
index 654c71f..ac11412 100644
--- a/drivers/vhost/scsi.c
+++ b/drivers/vhost/scsi.c
@@ -67,7 +67,7 @@ struct vhost_scsi_inflight {
struct vhost_scsi_cmd {
/* Descriptor from vhost_get_vq_desc() for virt_queue segment */
- struct vring_used_elem tvc_vq_used;
+ struct vhost_used_elem tvc_vq_used;
/* virtio-scsi initiator task attribute */
int tvc_task_attr;
/* virtio-scsi response incoming iovecs */
@@ -441,7 +441,7 @@ vhost_scsi_do_evt_work(struct vhost_scsi *vs, struct vhost_scsi_evt *evt)
struct vhost_virtqueue *vq = &vs->vqs[VHOST_SCSI_VQ_EVT].vq;
struct virtio_scsi_event *event = &evt->event;
struct virtio_scsi_event __user *eventp;
- struct vring_used_elem used;
+ struct vhost_used_elem used;
unsigned out, in;
int ret;
@@ -785,7 +785,7 @@ static void vhost_scsi_submission_work(struct work_struct *work)
static void
vhost_scsi_send_bad_target(struct vhost_scsi *vs,
struct vhost_virtqueue *vq,
- struct vring_used_elem *used, unsigned out)
+ struct vhost_used_elem *used, unsigned out)
{
struct virtio_scsi_cmd_resp __user *resp;
struct virtio_scsi_cmd_resp rsp;
@@ -808,7 +808,7 @@ vhost_scsi_handle_vq(struct vhost_scsi *vs, struct vhost_virtqueue *vq)
struct virtio_scsi_cmd_req v_req;
struct virtio_scsi_cmd_req_pi v_req_pi;
struct vhost_scsi_cmd *cmd;
- struct vring_used_elem used;
+ struct vhost_used_elem used;
struct iov_iter out_iter, in_iter, prot_iter, data_iter;
u64 tag;
u32 exp_data_len, data_direction;
@@ -837,7 +837,7 @@ vhost_scsi_handle_vq(struct vhost_scsi *vs, struct vhost_virtqueue *vq)
ARRAY_SIZE(vq->iov), &out, &in,
NULL, NULL);
pr_debug("vhost_get_vq_desc: head: %d, out: %u in: %u\n",
- used.id, out, in);
+ used.elem.id, out, in);
/* Nothing new? Wait for eventfd to tell us they refilled. */
if (ret == -ENOSPC) {
if (unlikely(vhost_enable_notify(&vs->dev, vq))) {
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 296bd5e..2b2a776 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -728,41 +728,6 @@ static bool memory_access_ok(struct vhost_dev *d, struct vhost_umem *umem,
static int translate_desc(struct vhost_virtqueue *vq, u64 addr, u32 len,
struct iovec iov[], int iov_size, int access);
-static int vhost_copy_to_user(struct vhost_virtqueue *vq, void __user *to,
- const void *from, unsigned size)
-{
- int ret;
-
- if (!vq->iotlb)
- return __copy_to_user(to, from, size);
- else {
- /* This function should be called after iotlb
- * prefetch, which means we're sure that all vq
- * could be access through iotlb. So -EAGAIN should
- * not happen in this case.
- */
- struct iov_iter t;
- void __user *uaddr = vhost_vq_meta_fetch(vq,
- (u64)(uintptr_t)to, size,
- VHOST_ADDR_USED);
-
- if (uaddr)
- return __copy_to_user(uaddr, from, size);
-
- ret = translate_desc(vq, (u64)(uintptr_t)to, size, vq->iotlb_iov,
- ARRAY_SIZE(vq->iotlb_iov),
- VHOST_ACCESS_WO);
- if (ret < 0)
- goto out;
- iov_iter_init(&t, WRITE, vq->iotlb_iov, ret, size);
- ret = copy_to_iter(from, size, &t);
- if (ret == size)
- ret = 0;
- }
-out:
- return ret;
-}
-
static int vhost_copy_from_user(struct vhost_virtqueue *vq, void *to,
void __user *from, unsigned size)
{
@@ -1958,7 +1923,7 @@ static int get_indirect(struct vhost_virtqueue *vq,
* never a valid descriptor number) if none was found. A negative code is
* returned on error. */
int vhost_get_vq_desc(struct vhost_virtqueue *vq,
- struct vring_used_elem *used,
+ struct vhost_used_elem *used,
struct iovec iov[], unsigned int iov_size,
unsigned int *out_num, unsigned int *in_num,
struct vhost_log *log, unsigned int *log_num)
@@ -2009,7 +1974,7 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
return -EFAULT;
}
- used->id = ring_head;
+ used->elem.id = ring_head;
head = vhost16_to_cpu(vq, ring_head);
/* If their number is silly, that's an error. */
@@ -2103,9 +2068,9 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
EXPORT_SYMBOL_GPL(vhost_get_vq_desc);
static void vhost_set_used_len(struct vhost_virtqueue *vq,
- struct vring_used_elem *used, int len)
+ struct vhost_used_elem *used, int len)
{
- used->len = cpu_to_vhost32(vq, len);
+ used->elem.len = cpu_to_vhost32(vq, len);
}
/* This is a multi-buffer version of vhost_get_desc, that works if
@@ -2119,7 +2084,7 @@ static void vhost_set_used_len(struct vhost_virtqueue *vq,
* returns number of buffer heads allocated, negative on error
*/
int vhost_get_bufs(struct vhost_virtqueue *vq,
- struct vring_used_elem *heads,
+ struct vhost_used_elem *heads,
int datalen,
unsigned *iovcount,
struct vhost_log *log,
@@ -2192,7 +2157,7 @@ EXPORT_SYMBOL_GPL(vhost_discard_vq_desc);
/* After we've used one of their buffers, we tell them about it. We'll then
* want to notify the guest, using eventfd. */
-int vhost_add_used(struct vhost_virtqueue *vq, struct vring_used_elem *used,
+int vhost_add_used(struct vhost_virtqueue *vq, struct vhost_used_elem *used,
int len)
{
vhost_set_used_len(vq, used, len);
@@ -2201,27 +2166,26 @@ int vhost_add_used(struct vhost_virtqueue *vq, struct vring_used_elem *used,
EXPORT_SYMBOL_GPL(vhost_add_used);
static int __vhost_add_used_n(struct vhost_virtqueue *vq,
- struct vring_used_elem *heads,
+ struct vhost_used_elem *heads,
unsigned count)
{
struct vring_used_elem __user *used;
u16 old, new;
- int start;
+ int start, i;
start = vq->last_used_idx & (vq->num - 1);
used = vq->used->ring + start;
- if (count == 1) {
- if (vhost_put_user(vq, heads[0].id, &used->id)) {
+ for (i = 0; i < count; i++) {
+ if (unlikely(vhost_put_user(vq, heads[i].elem.id,
+ &used[i].id))) {
vq_err(vq, "Failed to write used id");
return -EFAULT;
}
- if (vhost_put_user(vq, heads[0].len, &used->len)) {
+ if (unlikely(vhost_put_user(vq, heads[i].elem.len,
+ &used[i].len))) {
vq_err(vq, "Failed to write used len");
return -EFAULT;
}
- } else if (vhost_copy_to_user(vq, used, heads, count * sizeof *used)) {
- vq_err(vq, "Failed to write used");
- return -EFAULT;
}
if (unlikely(vq->log_used)) {
/* Make sure data is seen before log. */
@@ -2245,7 +2209,7 @@ static int __vhost_add_used_n(struct vhost_virtqueue *vq,
/* After we've used one of their buffers, we tell them about it. We'll then
* want to notify the guest, using eventfd. */
-int vhost_add_used_n(struct vhost_virtqueue *vq, struct vring_used_elem *heads,
+int vhost_add_used_n(struct vhost_virtqueue *vq, struct vhost_used_elem *heads,
unsigned count)
{
int start, n, r;
@@ -2329,7 +2293,7 @@ EXPORT_SYMBOL_GPL(vhost_signal);
/* And here's the combo meal deal. Supersize me! */
void vhost_add_used_and_signal(struct vhost_dev *dev,
struct vhost_virtqueue *vq,
- struct vring_used_elem *used, int len)
+ struct vhost_used_elem *used, int len)
{
vhost_add_used(vq, used, len);
vhost_signal(dev, vq);
@@ -2339,7 +2303,7 @@ EXPORT_SYMBOL_GPL(vhost_add_used_and_signal);
/* multi-buffer version of vhost_add_used_and_signal */
void vhost_add_used_and_signal_n(struct vhost_dev *dev,
struct vhost_virtqueue *vq,
- struct vring_used_elem *heads, unsigned count)
+ struct vhost_used_elem *heads, unsigned count)
{
vhost_add_used_n(vq, heads, count);
vhost_signal(dev, vq);
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index a7cc7e7..8dea44b 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -34,6 +34,10 @@ struct vhost_poll {
struct vhost_dev *dev;
};
+struct vhost_used_elem {
+ struct vring_used_elem elem;
+};
+
void vhost_work_init(struct vhost_work *work, vhost_work_fn_t fn);
void vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work);
bool vhost_has_work(struct vhost_dev *dev);
@@ -126,7 +130,7 @@ struct vhost_virtqueue {
struct iovec iov[UIO_MAXIOV];
struct iovec iotlb_iov[64];
struct iovec *indirect;
- struct vring_used_elem *heads;
+ struct vhost_used_elem *heads;
/* Protected by virtqueue mutex. */
struct vhost_umem *umem;
struct vhost_umem *iotlb;
@@ -182,12 +186,12 @@ bool vhost_vq_access_ok(struct vhost_virtqueue *vq);
bool vhost_log_access_ok(struct vhost_dev *);
int vhost_get_vq_desc(struct vhost_virtqueue *,
- struct vring_used_elem *used_elem,
+ struct vhost_used_elem *used_elem,
struct iovec iov[], unsigned int iov_count,
unsigned int *out_num, unsigned int *in_num,
struct vhost_log *log, unsigned int *log_num);
int vhost_get_bufs(struct vhost_virtqueue *vq,
- struct vring_used_elem *heads,
+ struct vhost_used_elem *heads,
int datalen,
unsigned *iovcount,
struct vhost_log *log,
@@ -198,13 +202,13 @@ void vhost_discard_vq_desc(struct vhost_virtqueue *, int n);
int vhost_vq_init_access(struct vhost_virtqueue *);
int vhost_add_used(struct vhost_virtqueue *vq,
- struct vring_used_elem *elem, int len);
-int vhost_add_used_n(struct vhost_virtqueue *, struct vring_used_elem *heads,
+ struct vhost_used_elem *elem, int len);
+int vhost_add_used_n(struct vhost_virtqueue *vq, struct vhost_used_elem *heads,
unsigned count);
void vhost_add_used_and_signal(struct vhost_dev *, struct vhost_virtqueue *,
- struct vring_used_elem *, int len);
+ struct vhost_used_elem *, int len);
void vhost_add_used_and_signal_n(struct vhost_dev *, struct vhost_virtqueue *,
- struct vring_used_elem *heads, unsigned count);
+ struct vhost_used_elem *heads, unsigned count);
void vhost_signal(struct vhost_dev *, struct vhost_virtqueue *);
void vhost_disable_notify(struct vhost_dev *, struct vhost_virtqueue *);
bool vhost_vq_avail_empty(struct vhost_dev *, struct vhost_virtqueue *);
diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
index 59a01cd..695694f 100644
--- a/drivers/vhost/vsock.c
+++ b/drivers/vhost/vsock.c
@@ -98,7 +98,7 @@ vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
for (;;) {
struct virtio_vsock_pkt *pkt;
- struct vring_used_elem used;
+ struct vhost_used_elem used;
struct iov_iter iov_iter;
unsigned out, in;
size_t nbytes;
@@ -146,7 +146,7 @@ vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
break;
}
- len = vhost32_to_cpu(vq, used.len);
+ len = vhost32_to_cpu(vq, used.elem.len);
iov_iter_init(&iov_iter, READ, &vq->iov[out], in, len);
nbytes = copy_to_iter(&pkt->hdr, sizeof(pkt->hdr), &iov_iter);
@@ -346,7 +346,7 @@ static void vhost_vsock_handle_tx_kick(struct vhost_work *work)
struct vhost_vsock *vsock = container_of(vq->dev, struct vhost_vsock,
dev);
struct virtio_vsock_pkt *pkt;
- struct vring_used_elem used;
+ struct vhost_used_elem used;
int ret;
unsigned int out, in;
bool added = false;
--
2.7.4
^ permalink raw reply related
* [RFC V5 PATCH 2/8] vhost: hide used ring layout from device
From: Jason Wang @ 2018-05-29 2:10 UTC (permalink / raw)
To: mst, jasowang
Cc: kvm, virtualization, netdev, linux-kernel, jfreimann, wexu,
tiwei.bie
In-Reply-To: <1527559830-8133-1-git-send-email-jasowang@redhat.com>
We used to return descriptor head by vhost_get_vq_desc() to device and
pass it back to vhost_add_used() and its friends. This exposes the
internal used ring layout to device which makes it hard to be extended for
e.g packed ring layout.
So this patch tries to hide the used ring layout by
- letting vhost_get_vq_desc() return pointer to struct vring_used_elem
- accepting pointer to struct vring_used_elem in vhost_add_used() and
vhost_add_used_and_signal()
This could help to hide used ring layout and make it easier to
implement packed ring on top.
Signed-off-by: Jason Wang <jasowang@redhat.com>
---
drivers/vhost/net.c | 46 +++++++++++++++++++++-----------------
drivers/vhost/scsi.c | 62 +++++++++++++++++++++++++++------------------------
drivers/vhost/vhost.c | 52 +++++++++++++++++++++---------------------
drivers/vhost/vhost.h | 9 +++++---
drivers/vhost/vsock.c | 42 +++++++++++++++++-----------------
5 files changed, 112 insertions(+), 99 deletions(-)
diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 762aa81..826489c 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -426,22 +426,24 @@ static int vhost_net_enable_vq(struct vhost_net *n,
static int vhost_net_tx_get_vq_desc(struct vhost_net *net,
struct vhost_virtqueue *vq,
+ struct vring_used_elem *used_elem,
struct iovec iov[], unsigned int iov_size,
unsigned int *out_num, unsigned int *in_num)
{
unsigned long uninitialized_var(endtime);
- int r = vhost_get_vq_desc(vq, vq->iov, ARRAY_SIZE(vq->iov),
+ int r = vhost_get_vq_desc(vq, used_elem, vq->iov, ARRAY_SIZE(vq->iov),
out_num, in_num, NULL, NULL);
- if (r == vq->num && vq->busyloop_timeout) {
+ if (r == -ENOSPC && vq->busyloop_timeout) {
preempt_disable();
endtime = busy_clock() + vq->busyloop_timeout;
while (vhost_can_busy_poll(vq->dev, endtime) &&
vhost_vq_avail_empty(vq->dev, vq))
cpu_relax();
preempt_enable();
- r = vhost_get_vq_desc(vq, vq->iov, ARRAY_SIZE(vq->iov),
- out_num, in_num, NULL, NULL);
+ r = vhost_get_vq_desc(vq, used_elem, vq->iov,
+ ARRAY_SIZE(vq->iov), out_num, in_num,
+ NULL, NULL);
}
return r;
@@ -463,7 +465,6 @@ static void handle_tx(struct vhost_net *net)
struct vhost_net_virtqueue *nvq = &net->vqs[VHOST_NET_VQ_TX];
struct vhost_virtqueue *vq = &nvq->vq;
unsigned out, in;
- int head;
struct msghdr msg = {
.msg_name = NULL,
.msg_namelen = 0,
@@ -476,6 +477,7 @@ static void handle_tx(struct vhost_net *net)
size_t hdr_size;
struct socket *sock;
struct vhost_net_ubuf_ref *uninitialized_var(ubufs);
+ struct vring_used_elem used;
bool zcopy, zcopy_used;
int sent_pkts = 0;
@@ -499,20 +501,20 @@ static void handle_tx(struct vhost_net *net)
vhost_zerocopy_signal_used(net, vq);
- head = vhost_net_tx_get_vq_desc(net, vq, vq->iov,
- ARRAY_SIZE(vq->iov),
- &out, &in);
- /* On error, stop handling until the next kick. */
- if (unlikely(head < 0))
- break;
+ err = vhost_net_tx_get_vq_desc(net, vq, &used, vq->iov,
+ ARRAY_SIZE(vq->iov),
+ &out, &in);
/* Nothing new? Wait for eventfd to tell us they refilled. */
- if (head == vq->num) {
+ if (err == -ENOSPC) {
if (unlikely(vhost_enable_notify(&net->dev, vq))) {
vhost_disable_notify(&net->dev, vq);
continue;
}
break;
}
+ /* On error, stop handling until the next kick. */
+ if (unlikely(err < 0))
+ break;
if (in) {
vq_err(vq, "Unexpected descriptor format for TX: "
"out %d, int %d\n", out, in);
@@ -540,7 +542,8 @@ static void handle_tx(struct vhost_net *net)
struct ubuf_info *ubuf;
ubuf = nvq->ubuf_info + nvq->upend_idx;
- vq->heads[nvq->upend_idx].id = cpu_to_vhost32(vq, head);
+ vq->heads[nvq->upend_idx].id =
+ cpu_to_vhost32(vq, used.id);
vq->heads[nvq->upend_idx].len = VHOST_DMA_IN_PROGRESS;
ubuf->callback = vhost_zerocopy_callback;
ubuf->ctx = nvq->ubufs;
@@ -581,7 +584,7 @@ static void handle_tx(struct vhost_net *net)
pr_debug("Truncated TX packet: "
" len %d != %zd\n", err, len);
if (!zcopy_used)
- vhost_add_used_and_signal(&net->dev, vq, head, 0);
+ vhost_add_used_and_signal(&net->dev, vq, &used, 0);
else
vhost_zerocopy_signal_used(net, vq);
vhost_net_tx_packet(net);
@@ -713,14 +716,12 @@ static void handle_rx(struct vhost_net *net)
while ((sock_len = vhost_net_rx_peek_head_len(net, sock->sk))) {
sock_len += sock_hlen;
vhost_len = sock_len + vhost_hlen;
- headcount = vhost_get_bufs(vq, vq->heads + nheads, vhost_len,
- &in, vq_log, &log,
- likely(mergeable) ? UIO_MAXIOV : 1);
- /* On error, stop handling until the next kick. */
- if (unlikely(headcount < 0))
- goto out;
+ err = vhost_get_bufs(vq, vq->heads + nheads, vhost_len,
+ &in, vq_log, &log,
+ likely(mergeable) ? UIO_MAXIOV : 1,
+ &headcount);
/* OK, now we need to know about added descriptors. */
- if (!headcount) {
+ if (err == -ENOSPC) {
if (unlikely(vhost_enable_notify(&net->dev, vq))) {
/* They have slipped one in as we were
* doing that: check again. */
@@ -731,6 +732,9 @@ static void handle_rx(struct vhost_net *net)
* they refilled. */
goto out;
}
+ /* On error, stop handling until the next kick. */
+ if (unlikely(err < 0))
+ goto out;
if (nvq->rx_ring)
msg.msg_control = vhost_net_buf_consume(&nvq->rxq);
/* On overrun, truncate and discard */
diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
index 7ad5709..654c71f 100644
--- a/drivers/vhost/scsi.c
+++ b/drivers/vhost/scsi.c
@@ -67,7 +67,7 @@ struct vhost_scsi_inflight {
struct vhost_scsi_cmd {
/* Descriptor from vhost_get_vq_desc() for virt_queue segment */
- int tvc_vq_desc;
+ struct vring_used_elem tvc_vq_used;
/* virtio-scsi initiator task attribute */
int tvc_task_attr;
/* virtio-scsi response incoming iovecs */
@@ -441,8 +441,9 @@ vhost_scsi_do_evt_work(struct vhost_scsi *vs, struct vhost_scsi_evt *evt)
struct vhost_virtqueue *vq = &vs->vqs[VHOST_SCSI_VQ_EVT].vq;
struct virtio_scsi_event *event = &evt->event;
struct virtio_scsi_event __user *eventp;
+ struct vring_used_elem used;
unsigned out, in;
- int head, ret;
+ int ret;
if (!vq->private_data) {
vs->vs_events_missed = true;
@@ -451,16 +452,16 @@ vhost_scsi_do_evt_work(struct vhost_scsi *vs, struct vhost_scsi_evt *evt)
again:
vhost_disable_notify(&vs->dev, vq);
- head = vhost_get_vq_desc(vq, vq->iov,
+ ret = vhost_get_vq_desc(vq, &used, vq->iov,
ARRAY_SIZE(vq->iov), &out, &in,
NULL, NULL);
- if (head < 0) {
+ if (ret == -ENOSPC) {
+ if (vhost_enable_notify(&vs->dev, vq))
+ goto again;
vs->vs_events_missed = true;
return;
}
- if (head == vq->num) {
- if (vhost_enable_notify(&vs->dev, vq))
- goto again;
+ if (ret < 0) {
vs->vs_events_missed = true;
return;
}
@@ -480,7 +481,7 @@ vhost_scsi_do_evt_work(struct vhost_scsi *vs, struct vhost_scsi_evt *evt)
eventp = vq->iov[out].iov_base;
ret = __copy_to_user(eventp, event, sizeof(*event));
if (!ret)
- vhost_add_used_and_signal(&vs->dev, vq, head, 0);
+ vhost_add_used_and_signal(&vs->dev, vq, &used, 0);
else
vq_err(vq, "Faulted on vhost_scsi_send_event\n");
}
@@ -541,7 +542,7 @@ static void vhost_scsi_complete_cmd_work(struct vhost_work *work)
ret = copy_to_iter(&v_rsp, sizeof(v_rsp), &iov_iter);
if (likely(ret == sizeof(v_rsp))) {
struct vhost_scsi_virtqueue *q;
- vhost_add_used(cmd->tvc_vq, cmd->tvc_vq_desc, 0);
+ vhost_add_used(cmd->tvc_vq, &cmd->tvc_vq_used, 0);
q = container_of(cmd->tvc_vq, struct vhost_scsi_virtqueue, vq);
vq = q - vs->vqs;
__set_bit(vq, signal);
@@ -784,7 +785,7 @@ static void vhost_scsi_submission_work(struct work_struct *work)
static void
vhost_scsi_send_bad_target(struct vhost_scsi *vs,
struct vhost_virtqueue *vq,
- int head, unsigned out)
+ struct vring_used_elem *used, unsigned out)
{
struct virtio_scsi_cmd_resp __user *resp;
struct virtio_scsi_cmd_resp rsp;
@@ -795,7 +796,7 @@ vhost_scsi_send_bad_target(struct vhost_scsi *vs,
resp = vq->iov[out].iov_base;
ret = __copy_to_user(resp, &rsp, sizeof(rsp));
if (!ret)
- vhost_add_used_and_signal(&vs->dev, vq, head, 0);
+ vhost_add_used_and_signal(&vs->dev, vq, used, 0);
else
pr_err("Faulted on virtio_scsi_cmd_resp\n");
}
@@ -807,11 +808,12 @@ vhost_scsi_handle_vq(struct vhost_scsi *vs, struct vhost_virtqueue *vq)
struct virtio_scsi_cmd_req v_req;
struct virtio_scsi_cmd_req_pi v_req_pi;
struct vhost_scsi_cmd *cmd;
+ struct vring_used_elem used;
struct iov_iter out_iter, in_iter, prot_iter, data_iter;
u64 tag;
u32 exp_data_len, data_direction;
unsigned int out = 0, in = 0;
- int head, ret, prot_bytes;
+ int ret, prot_bytes;
size_t req_size, rsp_size = sizeof(struct virtio_scsi_cmd_resp);
size_t out_size, in_size;
u16 lun;
@@ -831,22 +833,22 @@ vhost_scsi_handle_vq(struct vhost_scsi *vs, struct vhost_virtqueue *vq)
vhost_disable_notify(&vs->dev, vq);
for (;;) {
- head = vhost_get_vq_desc(vq, vq->iov,
- ARRAY_SIZE(vq->iov), &out, &in,
- NULL, NULL);
+ ret = vhost_get_vq_desc(vq, &used, vq->iov,
+ ARRAY_SIZE(vq->iov), &out, &in,
+ NULL, NULL);
pr_debug("vhost_get_vq_desc: head: %d, out: %u in: %u\n",
- head, out, in);
- /* On error, stop handling until the next kick. */
- if (unlikely(head < 0))
- break;
+ used.id, out, in);
/* Nothing new? Wait for eventfd to tell us they refilled. */
- if (head == vq->num) {
+ if (ret == -ENOSPC) {
if (unlikely(vhost_enable_notify(&vs->dev, vq))) {
vhost_disable_notify(&vs->dev, vq);
continue;
}
break;
}
+ /* On error, stop handling until the next kick. */
+ if (unlikely(ret < 0))
+ break;
/*
* Check for a sane response buffer so we can report early
* errors back to the guest.
@@ -891,20 +893,20 @@ vhost_scsi_handle_vq(struct vhost_scsi *vs, struct vhost_virtqueue *vq)
if (unlikely(!copy_from_iter_full(req, req_size, &out_iter))) {
vq_err(vq, "Faulted on copy_from_iter\n");
- vhost_scsi_send_bad_target(vs, vq, head, out);
+ vhost_scsi_send_bad_target(vs, vq, &used, out);
continue;
}
/* virtio-scsi spec requires byte 0 of the lun to be 1 */
if (unlikely(*lunp != 1)) {
vq_err(vq, "Illegal virtio-scsi lun: %u\n", *lunp);
- vhost_scsi_send_bad_target(vs, vq, head, out);
+ vhost_scsi_send_bad_target(vs, vq, &used, out);
continue;
}
tpg = READ_ONCE(vs_tpg[*target]);
if (unlikely(!tpg)) {
/* Target does not exist, fail the request */
- vhost_scsi_send_bad_target(vs, vq, head, out);
+ vhost_scsi_send_bad_target(vs, vq, &used, out);
continue;
}
/*
@@ -950,7 +952,8 @@ vhost_scsi_handle_vq(struct vhost_scsi *vs, struct vhost_virtqueue *vq)
if (data_direction != DMA_TO_DEVICE) {
vq_err(vq, "Received non zero pi_bytesout,"
" but wrong data_direction\n");
- vhost_scsi_send_bad_target(vs, vq, head, out);
+ vhost_scsi_send_bad_target(vs, vq,
+ &used, out);
continue;
}
prot_bytes = vhost32_to_cpu(vq, v_req_pi.pi_bytesout);
@@ -958,7 +961,8 @@ vhost_scsi_handle_vq(struct vhost_scsi *vs, struct vhost_virtqueue *vq)
if (data_direction != DMA_FROM_DEVICE) {
vq_err(vq, "Received non zero pi_bytesin,"
" but wrong data_direction\n");
- vhost_scsi_send_bad_target(vs, vq, head, out);
+ vhost_scsi_send_bad_target(vs, vq,
+ &used, out);
continue;
}
prot_bytes = vhost32_to_cpu(vq, v_req_pi.pi_bytesin);
@@ -996,7 +1000,7 @@ vhost_scsi_handle_vq(struct vhost_scsi *vs, struct vhost_virtqueue *vq)
vq_err(vq, "Received SCSI CDB with command_size: %d that"
" exceeds SCSI_MAX_VARLEN_CDB_SIZE: %d\n",
scsi_command_size(cdb), VHOST_SCSI_MAX_CDB_SIZE);
- vhost_scsi_send_bad_target(vs, vq, head, out);
+ vhost_scsi_send_bad_target(vs, vq, &used, out);
continue;
}
cmd = vhost_scsi_get_tag(vq, tpg, cdb, tag, lun, task_attr,
@@ -1005,7 +1009,7 @@ vhost_scsi_handle_vq(struct vhost_scsi *vs, struct vhost_virtqueue *vq)
if (IS_ERR(cmd)) {
vq_err(vq, "vhost_scsi_get_tag failed %ld\n",
PTR_ERR(cmd));
- vhost_scsi_send_bad_target(vs, vq, head, out);
+ vhost_scsi_send_bad_target(vs, vq, &used, out);
continue;
}
cmd->tvc_vhost = vs;
@@ -1025,7 +1029,7 @@ vhost_scsi_handle_vq(struct vhost_scsi *vs, struct vhost_virtqueue *vq)
if (unlikely(ret)) {
vq_err(vq, "Failed to map iov to sgl\n");
vhost_scsi_release_cmd(&cmd->tvc_se_cmd);
- vhost_scsi_send_bad_target(vs, vq, head, out);
+ vhost_scsi_send_bad_target(vs, vq, &used, out);
continue;
}
}
@@ -1034,7 +1038,7 @@ vhost_scsi_handle_vq(struct vhost_scsi *vs, struct vhost_virtqueue *vq)
* complete the virtio-scsi request in TCM callback context via
* vhost_scsi_queue_data_in() and vhost_scsi_queue_status()
*/
- cmd->tvc_vq_desc = head;
+ cmd->tvc_vq_used = used;
/*
* Dispatch cmd descriptor for cmwq execution in process
* context provided by vhost_scsi_workqueue. This also ensures
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 096a688..296bd5e 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -1958,6 +1958,7 @@ static int get_indirect(struct vhost_virtqueue *vq,
* never a valid descriptor number) if none was found. A negative code is
* returned on error. */
int vhost_get_vq_desc(struct vhost_virtqueue *vq,
+ struct vring_used_elem *used,
struct iovec iov[], unsigned int iov_size,
unsigned int *out_num, unsigned int *in_num,
struct vhost_log *log, unsigned int *log_num)
@@ -1990,7 +1991,7 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
* invalid.
*/
if (vq->avail_idx == last_avail_idx)
- return vq->num;
+ return -ENOSPC;
/* Only get avail ring entries after they have been
* exposed by guest.
@@ -2008,6 +2009,7 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
return -EFAULT;
}
+ used->id = ring_head;
head = vhost16_to_cpu(vq, ring_head);
/* If their number is silly, that's an error. */
@@ -2096,10 +2098,16 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
/* Assume notifications from guest are disabled at this point,
* if they aren't we would need to update avail_event index. */
BUG_ON(!(vq->used_flags & VRING_USED_F_NO_NOTIFY));
- return head;
+ return 0;
}
EXPORT_SYMBOL_GPL(vhost_get_vq_desc);
+static void vhost_set_used_len(struct vhost_virtqueue *vq,
+ struct vring_used_elem *used, int len)
+{
+ used->len = cpu_to_vhost32(vq, len);
+}
+
/* This is a multi-buffer version of vhost_get_desc, that works if
* vq has read descriptors only.
* @vq - the relevant virtqueue
@@ -2116,13 +2124,13 @@ int vhost_get_bufs(struct vhost_virtqueue *vq,
unsigned *iovcount,
struct vhost_log *log,
unsigned *log_num,
- unsigned int quota)
+ unsigned int quota,
+ s16 *count)
{
unsigned int out, in;
int seg = 0;
int headcount = 0;
- unsigned d;
- int r, nlogs = 0;
+ int r = 0, nlogs = 0;
/* len is always initialized before use since we are always called with
* datalen > 0.
*/
@@ -2133,17 +2141,12 @@ int vhost_get_bufs(struct vhost_virtqueue *vq,
r = -ENOBUFS;
goto err;
}
- r = vhost_get_vq_desc(vq, vq->iov + seg,
+ r = vhost_get_vq_desc(vq, &heads[headcount], vq->iov + seg,
ARRAY_SIZE(vq->iov) - seg, &out,
&in, log, log_num);
if (unlikely(r < 0))
goto err;
- d = r;
- if (d == vq->num) {
- r = 0;
- goto err;
- }
if (unlikely(out || in <= 0)) {
vq_err(vq, "unexpected descriptor format for RX: "
"out %d, in %d\n", out, in);
@@ -2154,24 +2157,26 @@ int vhost_get_bufs(struct vhost_virtqueue *vq,
nlogs += *log_num;
log += *log_num;
}
- heads[headcount].id = cpu_to_vhost32(vq, d);
+
len = iov_length(vq->iov + seg, in);
- heads[headcount].len = cpu_to_vhost32(vq, len);
+ vhost_set_used_len(vq, &heads[headcount], len);
datalen -= len;
++headcount;
seg += in;
}
- heads[headcount - 1].len = cpu_to_vhost32(vq, len + datalen);
+ vhost_set_used_len(vq, &heads[headcount - 1], len + datalen);
*iovcount = seg;
if (unlikely(log))
*log_num = nlogs;
/* Detect overrun */
if (unlikely(datalen > 0)) {
- r = UIO_MAXIOV + 1;
+ headcount = UIO_MAXIOV + 1;
goto err;
}
- return headcount;
+
+ *count = headcount;
+ return 0;
err:
vhost_discard_vq_desc(vq, headcount);
return r;
@@ -2187,14 +2192,11 @@ EXPORT_SYMBOL_GPL(vhost_discard_vq_desc);
/* After we've used one of their buffers, we tell them about it. We'll then
* want to notify the guest, using eventfd. */
-int vhost_add_used(struct vhost_virtqueue *vq, unsigned int head, int len)
+int vhost_add_used(struct vhost_virtqueue *vq, struct vring_used_elem *used,
+ int len)
{
- struct vring_used_elem heads = {
- cpu_to_vhost32(vq, head),
- cpu_to_vhost32(vq, len)
- };
-
- return vhost_add_used_n(vq, &heads, 1);
+ vhost_set_used_len(vq, used, len);
+ return vhost_add_used_n(vq, used, 1);
}
EXPORT_SYMBOL_GPL(vhost_add_used);
@@ -2327,9 +2329,9 @@ EXPORT_SYMBOL_GPL(vhost_signal);
/* And here's the combo meal deal. Supersize me! */
void vhost_add_used_and_signal(struct vhost_dev *dev,
struct vhost_virtqueue *vq,
- unsigned int head, int len)
+ struct vring_used_elem *used, int len)
{
- vhost_add_used(vq, head, len);
+ vhost_add_used(vq, used, len);
vhost_signal(dev, vq);
}
EXPORT_SYMBOL_GPL(vhost_add_used_and_signal);
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 52edd242..a7cc7e7 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -182,6 +182,7 @@ bool vhost_vq_access_ok(struct vhost_virtqueue *vq);
bool vhost_log_access_ok(struct vhost_dev *);
int vhost_get_vq_desc(struct vhost_virtqueue *,
+ struct vring_used_elem *used_elem,
struct iovec iov[], unsigned int iov_count,
unsigned int *out_num, unsigned int *in_num,
struct vhost_log *log, unsigned int *log_num);
@@ -191,15 +192,17 @@ int vhost_get_bufs(struct vhost_virtqueue *vq,
unsigned *iovcount,
struct vhost_log *log,
unsigned *log_num,
- unsigned int quota);
+ unsigned int quota,
+ s16 *count);
void vhost_discard_vq_desc(struct vhost_virtqueue *, int n);
int vhost_vq_init_access(struct vhost_virtqueue *);
-int vhost_add_used(struct vhost_virtqueue *, unsigned int head, int len);
+int vhost_add_used(struct vhost_virtqueue *vq,
+ struct vring_used_elem *elem, int len);
int vhost_add_used_n(struct vhost_virtqueue *, struct vring_used_elem *heads,
unsigned count);
void vhost_add_used_and_signal(struct vhost_dev *, struct vhost_virtqueue *,
- unsigned int id, int len);
+ struct vring_used_elem *, int len);
void vhost_add_used_and_signal_n(struct vhost_dev *, struct vhost_virtqueue *,
struct vring_used_elem *heads, unsigned count);
void vhost_signal(struct vhost_dev *, struct vhost_virtqueue *);
diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
index 34bc3ab..59a01cd 100644
--- a/drivers/vhost/vsock.c
+++ b/drivers/vhost/vsock.c
@@ -98,11 +98,12 @@ vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
for (;;) {
struct virtio_vsock_pkt *pkt;
+ struct vring_used_elem used;
struct iov_iter iov_iter;
unsigned out, in;
size_t nbytes;
size_t len;
- int head;
+ int ret;
spin_lock_bh(&vsock->send_pkt_list_lock);
if (list_empty(&vsock->send_pkt_list)) {
@@ -116,16 +117,9 @@ vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
list_del_init(&pkt->list);
spin_unlock_bh(&vsock->send_pkt_list_lock);
- head = vhost_get_vq_desc(vq, vq->iov, ARRAY_SIZE(vq->iov),
- &out, &in, NULL, NULL);
- if (head < 0) {
- spin_lock_bh(&vsock->send_pkt_list_lock);
- list_add(&pkt->list, &vsock->send_pkt_list);
- spin_unlock_bh(&vsock->send_pkt_list_lock);
- break;
- }
-
- if (head == vq->num) {
+ ret = vhost_get_vq_desc(vq, &used, vq->iov, ARRAY_SIZE(vq->iov),
+ &out, &in, NULL, NULL);
+ if (ret == -ENOSPC) {
spin_lock_bh(&vsock->send_pkt_list_lock);
list_add(&pkt->list, &vsock->send_pkt_list);
spin_unlock_bh(&vsock->send_pkt_list_lock);
@@ -139,6 +133,12 @@ vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
}
break;
}
+ if (ret < 0) {
+ spin_lock_bh(&vsock->send_pkt_list_lock);
+ list_add(&pkt->list, &vsock->send_pkt_list);
+ spin_unlock_bh(&vsock->send_pkt_list_lock);
+ break;
+ }
if (out) {
virtio_transport_free_pkt(pkt);
@@ -146,7 +146,7 @@ vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
break;
}
- len = iov_length(&vq->iov[out], in);
+ len = vhost32_to_cpu(vq, used.len);
iov_iter_init(&iov_iter, READ, &vq->iov[out], in, len);
nbytes = copy_to_iter(&pkt->hdr, sizeof(pkt->hdr), &iov_iter);
@@ -163,7 +163,7 @@ vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
break;
}
- vhost_add_used(vq, head, sizeof(pkt->hdr) + pkt->len);
+ vhost_add_used(vq, &used, sizeof(pkt->hdr) + pkt->len);
added = true;
if (pkt->reply) {
@@ -346,7 +346,8 @@ static void vhost_vsock_handle_tx_kick(struct vhost_work *work)
struct vhost_vsock *vsock = container_of(vq->dev, struct vhost_vsock,
dev);
struct virtio_vsock_pkt *pkt;
- int head;
+ struct vring_used_elem used;
+ int ret;
unsigned int out, in;
bool added = false;
@@ -367,18 +368,17 @@ static void vhost_vsock_handle_tx_kick(struct vhost_work *work)
goto no_more_replies;
}
- head = vhost_get_vq_desc(vq, vq->iov, ARRAY_SIZE(vq->iov),
- &out, &in, NULL, NULL);
- if (head < 0)
- break;
-
- if (head == vq->num) {
+ ret = vhost_get_vq_desc(vq, &used, vq->iov, ARRAY_SIZE(vq->iov),
+ &out, &in, NULL, NULL);
+ if (ret == -ENOSPC) {
if (unlikely(vhost_enable_notify(&vsock->dev, vq))) {
vhost_disable_notify(&vsock->dev, vq);
continue;
}
break;
}
+ if (ret < 0)
+ break;
pkt = vhost_vsock_alloc_pkt(vq, out, in);
if (!pkt) {
@@ -397,7 +397,7 @@ static void vhost_vsock_handle_tx_kick(struct vhost_work *work)
else
virtio_transport_free_pkt(pkt);
- vhost_add_used(vq, head, sizeof(pkt->hdr) + len);
+ vhost_add_used(vq, &used, sizeof(pkt->hdr) + len);
added = true;
}
--
2.7.4
^ permalink raw reply related
* [RFC V5 PATCH 1/8] vhost: move get_rx_bufs to vhost.c
From: Jason Wang @ 2018-05-29 2:10 UTC (permalink / raw)
To: mst, jasowang
Cc: kvm, virtualization, netdev, linux-kernel, jfreimann, wexu,
tiwei.bie
In-Reply-To: <1527559830-8133-1-git-send-email-jasowang@redhat.com>
Move get_rx_bufs() to vhost.c and rename it to
vhost_get_bufs(). This helps to hide vring internal layout from
specific device implementation. Packed ring implementation will
benefit from this.
Signed-off-by: Jason Wang <jasowang@redhat.com>
---
drivers/vhost/net.c | 83 ++-------------------------------------------------
drivers/vhost/vhost.c | 78 +++++++++++++++++++++++++++++++++++++++++++++++
drivers/vhost/vhost.h | 7 +++++
3 files changed, 88 insertions(+), 80 deletions(-)
diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 986058a..762aa81 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -664,83 +664,6 @@ static int vhost_net_rx_peek_head_len(struct vhost_net *net, struct sock *sk)
return len;
}
-/* This is a multi-buffer version of vhost_get_desc, that works if
- * vq has read descriptors only.
- * @vq - the relevant virtqueue
- * @datalen - data length we'll be reading
- * @iovcount - returned count of io vectors we fill
- * @log - vhost log
- * @log_num - log offset
- * @quota - headcount quota, 1 for big buffer
- * returns number of buffer heads allocated, negative on error
- */
-static int get_rx_bufs(struct vhost_virtqueue *vq,
- struct vring_used_elem *heads,
- int datalen,
- unsigned *iovcount,
- struct vhost_log *log,
- unsigned *log_num,
- unsigned int quota)
-{
- unsigned int out, in;
- int seg = 0;
- int headcount = 0;
- unsigned d;
- int r, nlogs = 0;
- /* len is always initialized before use since we are always called with
- * datalen > 0.
- */
- u32 uninitialized_var(len);
-
- while (datalen > 0 && headcount < quota) {
- if (unlikely(seg >= UIO_MAXIOV)) {
- r = -ENOBUFS;
- goto err;
- }
- r = vhost_get_vq_desc(vq, vq->iov + seg,
- ARRAY_SIZE(vq->iov) - seg, &out,
- &in, log, log_num);
- if (unlikely(r < 0))
- goto err;
-
- d = r;
- if (d == vq->num) {
- r = 0;
- goto err;
- }
- if (unlikely(out || in <= 0)) {
- vq_err(vq, "unexpected descriptor format for RX: "
- "out %d, in %d\n", out, in);
- r = -EINVAL;
- goto err;
- }
- if (unlikely(log)) {
- nlogs += *log_num;
- log += *log_num;
- }
- heads[headcount].id = cpu_to_vhost32(vq, d);
- len = iov_length(vq->iov + seg, in);
- heads[headcount].len = cpu_to_vhost32(vq, len);
- datalen -= len;
- ++headcount;
- seg += in;
- }
- heads[headcount - 1].len = cpu_to_vhost32(vq, len + datalen);
- *iovcount = seg;
- if (unlikely(log))
- *log_num = nlogs;
-
- /* Detect overrun */
- if (unlikely(datalen > 0)) {
- r = UIO_MAXIOV + 1;
- goto err;
- }
- return headcount;
-err:
- vhost_discard_vq_desc(vq, headcount);
- return r;
-}
-
/* Expects to be always run from workqueue - which acts as
* read-size critical section for our kind of RCU. */
static void handle_rx(struct vhost_net *net)
@@ -790,9 +713,9 @@ static void handle_rx(struct vhost_net *net)
while ((sock_len = vhost_net_rx_peek_head_len(net, sock->sk))) {
sock_len += sock_hlen;
vhost_len = sock_len + vhost_hlen;
- headcount = get_rx_bufs(vq, vq->heads + nheads, vhost_len,
- &in, vq_log, &log,
- likely(mergeable) ? UIO_MAXIOV : 1);
+ headcount = vhost_get_bufs(vq, vq->heads + nheads, vhost_len,
+ &in, vq_log, &log,
+ likely(mergeable) ? UIO_MAXIOV : 1);
/* On error, stop handling until the next kick. */
if (unlikely(headcount < 0))
goto out;
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index f0be5f3..096a688 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -2100,6 +2100,84 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
}
EXPORT_SYMBOL_GPL(vhost_get_vq_desc);
+/* This is a multi-buffer version of vhost_get_desc, that works if
+ * vq has read descriptors only.
+ * @vq - the relevant virtqueue
+ * @datalen - data length we'll be reading
+ * @iovcount - returned count of io vectors we fill
+ * @log - vhost log
+ * @log_num - log offset
+ * @quota - headcount quota, 1 for big buffer
+ * returns number of buffer heads allocated, negative on error
+ */
+int vhost_get_bufs(struct vhost_virtqueue *vq,
+ struct vring_used_elem *heads,
+ int datalen,
+ unsigned *iovcount,
+ struct vhost_log *log,
+ unsigned *log_num,
+ unsigned int quota)
+{
+ unsigned int out, in;
+ int seg = 0;
+ int headcount = 0;
+ unsigned d;
+ int r, nlogs = 0;
+ /* len is always initialized before use since we are always called with
+ * datalen > 0.
+ */
+ u32 uninitialized_var(len);
+
+ while (datalen > 0 && headcount < quota) {
+ if (unlikely(seg >= UIO_MAXIOV)) {
+ r = -ENOBUFS;
+ goto err;
+ }
+ r = vhost_get_vq_desc(vq, vq->iov + seg,
+ ARRAY_SIZE(vq->iov) - seg, &out,
+ &in, log, log_num);
+ if (unlikely(r < 0))
+ goto err;
+
+ d = r;
+ if (d == vq->num) {
+ r = 0;
+ goto err;
+ }
+ if (unlikely(out || in <= 0)) {
+ vq_err(vq, "unexpected descriptor format for RX: "
+ "out %d, in %d\n", out, in);
+ r = -EINVAL;
+ goto err;
+ }
+ if (unlikely(log)) {
+ nlogs += *log_num;
+ log += *log_num;
+ }
+ heads[headcount].id = cpu_to_vhost32(vq, d);
+ len = iov_length(vq->iov + seg, in);
+ heads[headcount].len = cpu_to_vhost32(vq, len);
+ datalen -= len;
+ ++headcount;
+ seg += in;
+ }
+ heads[headcount - 1].len = cpu_to_vhost32(vq, len + datalen);
+ *iovcount = seg;
+ if (unlikely(log))
+ *log_num = nlogs;
+
+ /* Detect overrun */
+ if (unlikely(datalen > 0)) {
+ r = UIO_MAXIOV + 1;
+ goto err;
+ }
+ return headcount;
+err:
+ vhost_discard_vq_desc(vq, headcount);
+ return r;
+}
+EXPORT_SYMBOL_GPL(vhost_get_bufs);
+
/* Reverse the effect of vhost_get_vq_desc. Useful for error handling. */
void vhost_discard_vq_desc(struct vhost_virtqueue *vq, int n)
{
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 6c844b9..52edd242 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -185,6 +185,13 @@ int vhost_get_vq_desc(struct vhost_virtqueue *,
struct iovec iov[], unsigned int iov_count,
unsigned int *out_num, unsigned int *in_num,
struct vhost_log *log, unsigned int *log_num);
+int vhost_get_bufs(struct vhost_virtqueue *vq,
+ struct vring_used_elem *heads,
+ int datalen,
+ unsigned *iovcount,
+ struct vhost_log *log,
+ unsigned *log_num,
+ unsigned int quota);
void vhost_discard_vq_desc(struct vhost_virtqueue *, int n);
int vhost_vq_init_access(struct vhost_virtqueue *);
--
2.7.4
^ permalink raw reply related
* [RFC V5 PATCH 0/8] Packed ring layout for vhost
From: Jason Wang @ 2018-05-29 2:10 UTC (permalink / raw)
To: mst, jasowang
Cc: kvm, virtualization, netdev, linux-kernel, jfreimann, wexu,
tiwei.bie
Hi all:
This RFC implement packed ring layout. The code were tested with
Tiwei's RFC V5 at https://lkml.org/lkml/2018/5/22/138. Some fixups and
tweaks were needed on top of Tiwei's code to make it run for event
index.
Pktgen reports about 20% improvement on TX PPS when doing pktgen from
guest to host. No ovbious improvement on RX PPS. We can do lots of
optimizations on top but for simple and for correceness first, this
version does not do much.
This version were tested with:
- Zerocopy (Out of Order) support
- vIOMMU support
- mergeable buffer on/off
- busy polling on/off
Notes for tester:
- Start from this version, vhost need qemu co-operation to work
correctly. Or you can comment out the packed specific code for
GET/SET_VRING_BASE.
- Changes from V4:
- fix signalled_used index recording
- track avail index correctly
- various minor fixes
Changes from V3:
- Fix math on event idx checking
- Sync last avail wrap counter through GET/SET_VRING_BASE
- remove desc_event prefix in the driver/device structure
Changes from V2:
- do not use & in checking desc_event_flags
- off should be most significant bit
- remove the workaround of mergeable buffer for dpdk prototype
- id should be in the last descriptor in the chain
- keep _F_WRITE for write descriptor when adding used
- device flags updating should use ADDR_USED type
- return error on unexpected unavail descriptor in a chain
- return false in vhost_ve_avail_empty is descriptor is available
- track last seen avail_wrap_counter
- correctly examine available descriptor in get_indirect_packed()
- vhost_idx_diff should return u16 instead of bool
Changes from V1:
- Refactor vhost used elem code to avoid open coding on used elem
- Event suppression support (compile test only).
- Indirect descriptor support (compile test only).
- Zerocopy support.
- vIOMMU support.
- SCSI/VSOCK support (compile test only).
- Fix several bugs
Jason Wang (8):
vhost: move get_rx_bufs to vhost.c
vhost: hide used ring layout from device
vhost: do not use vring_used_elem
vhost_net: do not explicitly manipulate vhost_used_elem
vhost: vhost_put_user() can accept metadata type
virtio: introduce packed ring defines
vhost: packed ring support
vhost: event suppression for packed ring
drivers/vhost/net.c | 144 ++----
drivers/vhost/scsi.c | 62 +--
drivers/vhost/vhost.c | 926 +++++++++++++++++++++++++++++++++----
drivers/vhost/vhost.h | 52 ++-
drivers/vhost/vsock.c | 42 +-
include/uapi/linux/virtio_config.h | 9 +
include/uapi/linux/virtio_ring.h | 32 ++
7 files changed, 1000 insertions(+), 267 deletions(-)
--
2.7.4
^ permalink raw reply
* RE: [PATCH, net-next] net: ethernet: freescale: fix false-positive string overflow warning
From: Andy Duan @ 2018-05-29 1:10 UTC (permalink / raw)
To: Arnd Bergmann, David S. Miller
Cc: Fabio Estevam, Andrew Lunn, Troy Kisky, Florian Fainelli,
netdev@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <20180528154958.2684086-1-arnd@arndb.de>
From: Arnd Bergmann <arnd@arndb.de> Sent: 2018年5月28日 23:50
> While compile-testing on arm64 with gcc-8.1, I ran into a build diagnostic:
>
> drivers/net/ethernet/freescale/fec_main.c: In function 'fec_probe':
> drivers/net/ethernet/freescale/fec_main.c:3517:25: error: '%d' directive
> writing between 1 and 10 bytes into a region of size 5
> [-Werror=format-overflow=]
> sprintf(irq_name, "int%d", i);
> ^~
> drivers/net/ethernet/freescale/fec_main.c:3517:21: note: directive
> argument in the range [0, 2147483646]
> sprintf(irq_name, "int%d", i);
> ^~~~~~~
> drivers/net/ethernet/freescale/fec_main.c:3517:3: note: 'sprintf' output
> between 5 and 14 bytes into a destination of size 8
> sprintf(irq_name, "int%d", i);
> ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> It appears this has never shown on ppc32 or arm32 for an unknown
> reason, but now gcc fails to identify that the 'irq_cnt' loop index has an
> upper bound of 3, and instead uses a bogus range.
>
> To work around the warning, this changes the sprintf to snprintf with the
> correct buffer length.
>
> Fixes: 78cc6e7ef957 ("net: ethernet: freescale: Allow FEC with
> COMPILE_TEST")
> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Fugang Duan <fugang.duan@nxp.com>
> ---
> drivers/net/ethernet/freescale/fec_main.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/net/ethernet/freescale/fec_main.c
> b/drivers/net/ethernet/freescale/fec_main.c
> index ab7521c04eb2..c729665107f5 100644
> --- a/drivers/net/ethernet/freescale/fec_main.c
> +++ b/drivers/net/ethernet/freescale/fec_main.c
> @@ -3514,7 +3514,7 @@ fec_probe(struct platform_device *pdev)
> goto failed_init;
>
> for (i = 0; i < irq_cnt; i++) {
> - sprintf(irq_name, "int%d", i);
> + snprintf(irq_name, sizeof(irq_name), "int%d", i);
> irq = platform_get_irq_byname(pdev, irq_name);
> if (irq < 0)
> irq = platform_get_irq(pdev, i);
> --
> 2.9.0
^ permalink raw reply
* Re: [PATCH v3 net-next 2/2] tcp: minor optimization around tcp_hdr() usage in tcp receive path
From: Yafang Shao @ 2018-05-29 0:41 UTC (permalink / raw)
To: Eric Dumazet; +Cc: Song Liu, David Miller, netdev, LKML
In-Reply-To: <CANn89iKU49yBRq4x8xHGXiWZ9h0PNAmyWnMoNDFmSm9oKXsbqw@mail.gmail.com>
On Tue, May 29, 2018 at 12:36 AM, Eric Dumazet <edumazet@google.com> wrote:
> On Mon, May 28, 2018 at 8:36 AM Yafang Shao <laoar.shao@gmail.com> wrote:
>
>> This is additional to the commit ea1627c20c34 ("tcp: minor optimizations
> around tcp_hdr() usage").
>> At this point, skb->data is same with tcp_hdr() as tcp header has not
>> been pulled yet.
>
>> Cc: Eric Dumazet <edumazet@google.com>
>> Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
>> ---
>> net/ipv4/tcp_ipv4.c | 2 +-
>> net/ipv6/tcp_ipv6.c | 2 +-
>> 2 files changed, 2 insertions(+), 2 deletions(-)
>
>> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
>> index adbdb50..d179386 100644
>> --- a/net/ipv4/tcp_ipv4.c
>> +++ b/net/ipv4/tcp_ipv4.c
>> @@ -1486,7 +1486,7 @@ int tcp_v4_do_rcv(struct sock *sk, struct sk_buff
> *skb)
>> sk->sk_rx_dst = NULL;
>> }
>> }
>> - tcp_rcv_established(sk, skb, tcp_hdr(skb));
>> + tcp_rcv_established(sk, skb, (const struct tcphdr
> *)skb->data);
>> return 0;
>> }
>
>> diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
>> index 7d47c2b..1c633ff 100644
>> --- a/net/ipv6/tcp_ipv6.c
>> +++ b/net/ipv6/tcp_ipv6.c
>> @@ -1322,7 +1322,7 @@ static int tcp_v6_do_rcv(struct sock *sk, struct
> sk_buff *skb)
>> }
>> }
>
>> - tcp_rcv_established(sk, skb, tcp_hdr(skb));
>> + tcp_rcv_established(sk, skb, (const struct tcphdr
> *)skb->data);
>> if (opt_skb)
>> goto ipv6_pktoptions;
>> return 0;
>> --
>> 1.8.3.1
>
>
> I would rather remove the third parameter of tcp_rcv_established() instead
> of duplicating the cast.
OK.
And what about introducing a new helper tcp_hdr_fast() ?
/* use it when tcp header has not been pulled yet */
static inline struct tcphdr *tcp_hdr_fast(const struct sk_buff *skb)
{
return (const struct tcphdr *)skb->data;
}
That could help us to use this optimized one instead of the original
one if possilbe.
Thanks
Yafang
^ permalink raw reply
* [PATCH 5/9] ipvs: fix buffer overflow with sync daemon and service
From: Pablo Neira Ayuso @ 2018-05-28 23:42 UTC (permalink / raw)
To: netfilter-devel; +Cc: davem, netdev
In-Reply-To: <20180528234221.31254-1-pablo@netfilter.org>
From: Julian Anastasov <ja@ssi.bg>
syzkaller reports for buffer overflow for interface name
when starting sync daemons [1]
What we do is that we copy user structure into larger stack
buffer but later we search NUL past the stack buffer.
The same happens for sched_name when adding/editing virtual server.
We are restricted by IP_VS_SCHEDNAME_MAXLEN and IP_VS_IFNAME_MAXLEN
being used as size in include/uapi/linux/ip_vs.h, so they
include the space for NUL.
As using strlcpy is wrong for unsafe source, replace it with
strscpy and add checks to return EINVAL if source string is not
NUL-terminated. The incomplete strlcpy fix comes from 2.6.13.
For the netlink interface reduce the len parameter for
IPVS_DAEMON_ATTR_MCAST_IFN and IPVS_SVC_ATTR_SCHED_NAME,
so that we get proper EINVAL.
[1]
kernel BUG at lib/string.c:1052!
invalid opcode: 0000 [#1] SMP KASAN
Dumping ftrace buffer:
(ftrace buffer empty)
Modules linked in:
CPU: 1 PID: 373 Comm: syz-executor936 Not tainted 4.17.0-rc4+ #45
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
RIP: 0010:fortify_panic+0x13/0x20 lib/string.c:1051
RSP: 0018:ffff8801c976f800 EFLAGS: 00010282
RAX: 0000000000000022 RBX: 0000000000000040 RCX: 0000000000000000
RDX: 0000000000000022 RSI: ffffffff8160f6f1 RDI: ffffed00392edef6
RBP: ffff8801c976f800 R08: ffff8801cf4c62c0 R09: ffffed003b5e4fb0
R10: ffffed003b5e4fb0 R11: ffff8801daf27d87 R12: ffff8801c976fa20
R13: ffff8801c976fae4 R14: ffff8801c976fae0 R15: 000000000000048b
FS: 00007fd99f75e700(0000) GS:ffff8801daf00000(0000)
knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000200001c0 CR3: 00000001d6843000 CR4: 00000000001406e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
strlen include/linux/string.h:270 [inline]
strlcpy include/linux/string.h:293 [inline]
do_ip_vs_set_ctl+0x31c/0x1d00 net/netfilter/ipvs/ip_vs_ctl.c:2388
nf_sockopt net/netfilter/nf_sockopt.c:106 [inline]
nf_setsockopt+0x7d/0xd0 net/netfilter/nf_sockopt.c:115
ip_setsockopt+0xd8/0xf0 net/ipv4/ip_sockglue.c:1253
udp_setsockopt+0x62/0xa0 net/ipv4/udp.c:2487
ipv6_setsockopt+0x149/0x170 net/ipv6/ipv6_sockglue.c:917
tcp_setsockopt+0x93/0xe0 net/ipv4/tcp.c:3057
sock_common_setsockopt+0x9a/0xe0 net/core/sock.c:3046
__sys_setsockopt+0x1bd/0x390 net/socket.c:1903
__do_sys_setsockopt net/socket.c:1914 [inline]
__se_sys_setsockopt net/socket.c:1911 [inline]
__x64_sys_setsockopt+0xbe/0x150 net/socket.c:1911
do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:287
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x447369
RSP: 002b:00007fd99f75dda8 EFLAGS: 00000246 ORIG_RAX: 0000000000000036
RAX: ffffffffffffffda RBX: 00000000006e39e4 RCX: 0000000000447369
RDX: 000000000000048b RSI: 0000000000000000 RDI: 0000000000000003
RBP: 0000000000000000 R08: 0000000000000018 R09: 0000000000000000
R10: 00000000200001c0 R11: 0000000000000246 R12: 00000000006e39e0
R13: 75a1ff93f0896195 R14: 6f745f3168746576 R15: 0000000000000001
Code: 08 5b 41 5c 41 5d 41 5e 41 5f 5d c3 0f 0b 48 89 df e8 d2 8f 48 fa eb
de 55 48 89 fe 48 c7 c7 60 65 64 88 48 89 e5 e8 91 dd f3 f9 <0f> 0b 90 90
90 90 90 90 90 90 90 90 90 55 48 89 e5 41 57 41 56
RIP: fortify_panic+0x13/0x20 lib/string.c:1051 RSP: ffff8801c976f800
Reported-and-tested-by: syzbot+aac887f77319868646df@syzkaller.appspotmail.com
Fixes: e4ff67513096 ("ipvs: add sync_maxlen parameter for the sync daemon")
Fixes: 4da62fc70d7c ("[IPVS]: Fix for overflows")
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Acked-by: Simon Horman <horms+renesas@verge.net.au>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
net/netfilter/ipvs/ip_vs_ctl.c | 21 +++++++++++++++------
1 file changed, 15 insertions(+), 6 deletions(-)
diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
index f36098887ad0..3ecca0616d8c 100644
--- a/net/netfilter/ipvs/ip_vs_ctl.c
+++ b/net/netfilter/ipvs/ip_vs_ctl.c
@@ -2381,8 +2381,10 @@ do_ip_vs_set_ctl(struct sock *sk, int cmd, void __user *user, unsigned int len)
struct ipvs_sync_daemon_cfg cfg;
memset(&cfg, 0, sizeof(cfg));
- strlcpy(cfg.mcast_ifn, dm->mcast_ifn,
- sizeof(cfg.mcast_ifn));
+ ret = -EINVAL;
+ if (strscpy(cfg.mcast_ifn, dm->mcast_ifn,
+ sizeof(cfg.mcast_ifn)) <= 0)
+ goto out_dec;
cfg.syncid = dm->syncid;
ret = start_sync_thread(ipvs, &cfg, dm->state);
} else {
@@ -2420,12 +2422,19 @@ do_ip_vs_set_ctl(struct sock *sk, int cmd, void __user *user, unsigned int len)
}
}
+ if ((cmd == IP_VS_SO_SET_ADD || cmd == IP_VS_SO_SET_EDIT) &&
+ strnlen(usvc.sched_name, IP_VS_SCHEDNAME_MAXLEN) ==
+ IP_VS_SCHEDNAME_MAXLEN) {
+ ret = -EINVAL;
+ goto out_unlock;
+ }
+
/* Check for valid protocol: TCP or UDP or SCTP, even for fwmark!=0 */
if (usvc.protocol != IPPROTO_TCP && usvc.protocol != IPPROTO_UDP &&
usvc.protocol != IPPROTO_SCTP) {
- pr_err("set_ctl: invalid protocol: %d %pI4:%d %s\n",
+ pr_err("set_ctl: invalid protocol: %d %pI4:%d\n",
usvc.protocol, &usvc.addr.ip,
- ntohs(usvc.port), usvc.sched_name);
+ ntohs(usvc.port));
ret = -EFAULT;
goto out_unlock;
}
@@ -2847,7 +2856,7 @@ static const struct nla_policy ip_vs_cmd_policy[IPVS_CMD_ATTR_MAX + 1] = {
static const struct nla_policy ip_vs_daemon_policy[IPVS_DAEMON_ATTR_MAX + 1] = {
[IPVS_DAEMON_ATTR_STATE] = { .type = NLA_U32 },
[IPVS_DAEMON_ATTR_MCAST_IFN] = { .type = NLA_NUL_STRING,
- .len = IP_VS_IFNAME_MAXLEN },
+ .len = IP_VS_IFNAME_MAXLEN - 1 },
[IPVS_DAEMON_ATTR_SYNC_ID] = { .type = NLA_U32 },
[IPVS_DAEMON_ATTR_SYNC_MAXLEN] = { .type = NLA_U16 },
[IPVS_DAEMON_ATTR_MCAST_GROUP] = { .type = NLA_U32 },
@@ -2865,7 +2874,7 @@ static const struct nla_policy ip_vs_svc_policy[IPVS_SVC_ATTR_MAX + 1] = {
[IPVS_SVC_ATTR_PORT] = { .type = NLA_U16 },
[IPVS_SVC_ATTR_FWMARK] = { .type = NLA_U32 },
[IPVS_SVC_ATTR_SCHED_NAME] = { .type = NLA_NUL_STRING,
- .len = IP_VS_SCHEDNAME_MAXLEN },
+ .len = IP_VS_SCHEDNAME_MAXLEN - 1 },
[IPVS_SVC_ATTR_PE_NAME] = { .type = NLA_NUL_STRING,
.len = IP_VS_PENAME_MAXLEN },
[IPVS_SVC_ATTR_FLAGS] = { .type = NLA_BINARY,
--
2.11.0
^ permalink raw reply related
* [PATCH 9/9] netfilter: nf_tables: increase nft_counters_enabled in nft_chain_stats_replace()
From: Pablo Neira Ayuso @ 2018-05-28 23:42 UTC (permalink / raw)
To: netfilter-devel; +Cc: davem, netdev
In-Reply-To: <20180528234221.31254-1-pablo@netfilter.org>
From: Taehee Yoo <ap420073@gmail.com>
When a chain is updated, a counter can be attached. if so,
the nft_counters_enabled should be increased.
test commands:
%nft add table ip filter
%nft add chain ip filter input { type filter hook input priority 4\; }
%iptables-compat -Z input
%nft delete chain ip filter input
we can see below messages.
[ 286.443720] jump label: negative count!
[ 286.448278] WARNING: CPU: 0 PID: 1459 at kernel/jump_label.c:197 __static_key_slow_dec_cpuslocked+0x6f/0xf0
[ 286.449144] Modules linked in: nf_tables nfnetlink ip_tables x_tables
[ 286.449144] CPU: 0 PID: 1459 Comm: nft Tainted: G W 4.17.0-rc2+ #12
[ 286.449144] RIP: 0010:__static_key_slow_dec_cpuslocked+0x6f/0xf0
[ 286.449144] RSP: 0018:ffff88010e5176f0 EFLAGS: 00010286
[ 286.449144] RAX: 000000000000001b RBX: ffffffffc0179500 RCX: ffffffffb8a82522
[ 286.449144] RDX: 0000000000000001 RSI: 0000000000000008 RDI: ffff88011b7e5eac
[ 286.449144] RBP: 0000000000000000 R08: ffffed00236fce5c R09: ffffed00236fce5b
[ 286.449144] R10: ffffffffc0179503 R11: ffffed00236fce5c R12: 0000000000000000
[ 286.449144] R13: ffff88011a28e448 R14: ffff88011a28e470 R15: dffffc0000000000
[ 286.449144] FS: 00007f0384328700(0000) GS:ffff88011b600000(0000) knlGS:0000000000000000
[ 286.449144] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 286.449144] CR2: 00007f038394bf10 CR3: 0000000104a86000 CR4: 00000000001006f0
[ 286.449144] Call Trace:
[ 286.449144] static_key_slow_dec+0x6a/0x70
[ 286.449144] nf_tables_chain_destroy+0x19d/0x210 [nf_tables]
[ 286.449144] nf_tables_commit+0x1891/0x1c50 [nf_tables]
[ 286.449144] nfnetlink_rcv+0x1148/0x13d0 [nfnetlink]
[ ... ]
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
net/netfilter/nf_tables_api.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c
index 2bdc8767aa40..501e48a7965b 100644
--- a/net/netfilter/nf_tables_api.c
+++ b/net/netfilter/nf_tables_api.c
@@ -1298,8 +1298,10 @@ static void nft_chain_stats_replace(struct nft_base_chain *chain,
rcu_assign_pointer(chain->stats, newstats);
synchronize_rcu();
free_percpu(oldstats);
- } else
+ } else {
rcu_assign_pointer(chain->stats, newstats);
+ static_branch_inc(&nft_counters_enabled);
+ }
}
static void nf_tables_chain_destroy(struct nft_ctx *ctx)
--
2.11.0
^ permalink raw reply related
* [PATCH 6/9] netfilter: provide correct argument to nla_strlcpy()
From: Pablo Neira Ayuso @ 2018-05-28 23:42 UTC (permalink / raw)
To: netfilter-devel; +Cc: davem, netdev
In-Reply-To: <20180528234221.31254-1-pablo@netfilter.org>
From: Eric Dumazet <edumazet@google.com>
Recent patch forgot to remove nla_data(), upsetting syzkaller a bit.
BUG: KASAN: slab-out-of-bounds in nla_strlcpy+0x13d/0x150 lib/nlattr.c:314
Read of size 1 at addr ffff8801ad1f4fdd by task syz-executor189/4509
CPU: 1 PID: 4509 Comm: syz-executor189 Not tainted 4.17.0-rc6+ #62
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x1b9/0x294 lib/dump_stack.c:113
print_address_description+0x6c/0x20b mm/kasan/report.c:256
kasan_report_error mm/kasan/report.c:354 [inline]
kasan_report.cold.7+0x242/0x2fe mm/kasan/report.c:412
__asan_report_load1_noabort+0x14/0x20 mm/kasan/report.c:430
nla_strlcpy+0x13d/0x150 lib/nlattr.c:314
nfnl_acct_new+0x574/0xc50 net/netfilter/nfnetlink_acct.c:118
nfnetlink_rcv_msg+0xdb5/0xff0 net/netfilter/nfnetlink.c:212
netlink_rcv_skb+0x172/0x440 net/netlink/af_netlink.c:2448
nfnetlink_rcv+0x1fe/0x1ba0 net/netfilter/nfnetlink.c:513
netlink_unicast_kernel net/netlink/af_netlink.c:1310 [inline]
netlink_unicast+0x58b/0x740 net/netlink/af_netlink.c:1336
netlink_sendmsg+0x9f0/0xfa0 net/netlink/af_netlink.c:1901
sock_sendmsg_nosec net/socket.c:629 [inline]
sock_sendmsg+0xd5/0x120 net/socket.c:639
sock_write_iter+0x35a/0x5a0 net/socket.c:908
call_write_iter include/linux/fs.h:1784 [inline]
new_sync_write fs/read_write.c:474 [inline]
__vfs_write+0x64d/0x960 fs/read_write.c:487
vfs_write+0x1f8/0x560 fs/read_write.c:549
ksys_write+0xf9/0x250 fs/read_write.c:598
__do_sys_write fs/read_write.c:610 [inline]
__se_sys_write fs/read_write.c:607 [inline]
__x64_sys_write+0x73/0xb0 fs/read_write.c:607
Fixes: 4e09fc873d92 ("netfilter: prefer nla_strlcpy for dealing with NLA_STRING attributes")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Florian Westphal <fw@strlen.de>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
net/netfilter/nfnetlink_acct.c | 2 +-
net/netfilter/nfnetlink_cthelper.c | 4 ++--
2 files changed, 3 insertions(+), 3 deletions(-)
diff --git a/net/netfilter/nfnetlink_acct.c b/net/netfilter/nfnetlink_acct.c
index 6ddf89183e7b..a0e5adf0b3b6 100644
--- a/net/netfilter/nfnetlink_acct.c
+++ b/net/netfilter/nfnetlink_acct.c
@@ -115,7 +115,7 @@ static int nfnl_acct_new(struct net *net, struct sock *nfnl,
nfacct->flags = flags;
}
- nla_strlcpy(nfacct->name, nla_data(tb[NFACCT_NAME]), NFACCT_NAME_MAX);
+ nla_strlcpy(nfacct->name, tb[NFACCT_NAME], NFACCT_NAME_MAX);
if (tb[NFACCT_BYTES]) {
atomic64_set(&nfacct->bytes,
diff --git a/net/netfilter/nfnetlink_cthelper.c b/net/netfilter/nfnetlink_cthelper.c
index fa026b269b36..cb5b5f207777 100644
--- a/net/netfilter/nfnetlink_cthelper.c
+++ b/net/netfilter/nfnetlink_cthelper.c
@@ -150,7 +150,7 @@ nfnl_cthelper_expect_policy(struct nf_conntrack_expect_policy *expect_policy,
return -EINVAL;
nla_strlcpy(expect_policy->name,
- nla_data(tb[NFCTH_POLICY_NAME]), NF_CT_HELPER_NAME_LEN);
+ tb[NFCTH_POLICY_NAME], NF_CT_HELPER_NAME_LEN);
expect_policy->max_expected =
ntohl(nla_get_be32(tb[NFCTH_POLICY_EXPECT_MAX]));
if (expect_policy->max_expected > NF_CT_EXPECT_MAX_CNT)
@@ -235,7 +235,7 @@ nfnl_cthelper_create(const struct nlattr * const tb[],
goto err1;
nla_strlcpy(helper->name,
- nla_data(tb[NFCTH_NAME]), NF_CT_HELPER_NAME_LEN);
+ tb[NFCTH_NAME], NF_CT_HELPER_NAME_LEN);
size = ntohl(nla_get_be32(tb[NFCTH_PRIV_DATA_LEN]));
if (size > FIELD_SIZEOF(struct nf_conn_help, data)) {
ret = -ENOMEM;
--
2.11.0
^ permalink raw reply related
* [PATCH 8/9] netfilter: nf_tables: fix NULL-ptr in nf_tables_dump_obj()
From: Pablo Neira Ayuso @ 2018-05-28 23:42 UTC (permalink / raw)
To: netfilter-devel; +Cc: davem, netdev
In-Reply-To: <20180528234221.31254-1-pablo@netfilter.org>
From: Taehee Yoo <ap420073@gmail.com>
The table field in nft_obj_filter is not an array. In order to check
tablename, we should check if the pointer is set.
Test commands:
%nft add table ip filter
%nft add counter ip filter ct1
%nft reset counters
Splat looks like:
[ 306.510504] kasan: CONFIG_KASAN_INLINE enabled
[ 306.516184] kasan: GPF could be caused by NULL-ptr deref or user memory access
[ 306.524775] general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN PTI
[ 306.528284] Modules linked in: nft_objref nft_counter nf_tables nfnetlink ip_tables x_tables
[ 306.528284] CPU: 0 PID: 1488 Comm: nft Not tainted 4.17.0-rc4+ #17
[ 306.528284] Hardware name: To be filled by O.E.M. To be filled by O.E.M./Aptio CRB, BIOS 5.6.5 07/08/2015
[ 306.528284] RIP: 0010:nf_tables_dump_obj+0x52c/0xa70 [nf_tables]
[ 306.528284] RSP: 0018:ffff8800b6cb7520 EFLAGS: 00010246
[ 306.528284] RAX: 0000000000000000 RBX: ffff8800b6c49820 RCX: 0000000000000000
[ 306.528284] RDX: 0000000000000000 RSI: dffffc0000000000 RDI: ffffed0016d96e9a
[ 306.528284] RBP: ffff8800b6cb75c0 R08: ffffed00236fce7c R09: ffffed00236fce7b
[ 306.528284] R10: ffffffff9f6241e8 R11: ffffed00236fce7c R12: ffff880111365108
[ 306.528284] R13: 0000000000000000 R14: ffff8800b6c49860 R15: ffff8800b6c49860
[ 306.528284] FS: 00007f838b007700(0000) GS:ffff88011b600000(0000) knlGS:0000000000000000
[ 306.528284] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 306.528284] CR2: 00007ffeafabcf78 CR3: 00000000b6cbe000 CR4: 00000000001006f0
[ 306.528284] Call Trace:
[ 306.528284] netlink_dump+0x470/0xa20
[ 306.528284] __netlink_dump_start+0x5ae/0x690
[ 306.528284] ? nf_tables_getobj+0x1b3/0x740 [nf_tables]
[ 306.528284] nf_tables_getobj+0x2f5/0x740 [nf_tables]
[ 306.528284] ? nft_obj_notify+0x100/0x100 [nf_tables]
[ 306.528284] ? nf_tables_getobj+0x740/0x740 [nf_tables]
[ 306.528284] ? nf_tables_dump_flowtable_done+0x70/0x70 [nf_tables]
[ 306.528284] ? nft_obj_notify+0x100/0x100 [nf_tables]
[ 306.528284] nfnetlink_rcv_msg+0x8ff/0x932 [nfnetlink]
[ 306.528284] ? nfnetlink_rcv_msg+0x216/0x932 [nfnetlink]
[ 306.528284] netlink_rcv_skb+0x1c9/0x2f0
[ 306.528284] ? nfnetlink_bind+0x1d0/0x1d0 [nfnetlink]
[ 306.528284] ? debug_check_no_locks_freed+0x270/0x270
[ 306.528284] ? netlink_ack+0x7a0/0x7a0
[ 306.528284] ? ns_capable_common+0x6e/0x110
[ ... ]
Fixes: e46abbcc05aa8 ("netfilter: nf_tables: Allow table names of up to 255 chars")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
net/netfilter/nf_tables_api.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c
index 91e80aa852d6..2bdc8767aa40 100644
--- a/net/netfilter/nf_tables_api.c
+++ b/net/netfilter/nf_tables_api.c
@@ -4706,7 +4706,7 @@ static int nf_tables_dump_obj(struct sk_buff *skb, struct netlink_callback *cb)
if (idx > s_idx)
memset(&cb->args[1], 0,
sizeof(cb->args) - sizeof(cb->args[0]));
- if (filter && filter->table[0] &&
+ if (filter && filter->table &&
strcmp(filter->table, table->name))
goto cont;
if (filter &&
@@ -5380,7 +5380,7 @@ static int nf_tables_dump_flowtable(struct sk_buff *skb,
if (idx > s_idx)
memset(&cb->args[1], 0,
sizeof(cb->args) - sizeof(cb->args[0]));
- if (filter && filter->table[0] &&
+ if (filter && filter->table &&
strcmp(filter->table, table->name))
goto cont;
--
2.11.0
^ permalink raw reply related
* [PATCH 7/9] netfilter: nf_tables: disable preemption in nft_update_chain_stats()
From: Pablo Neira Ayuso @ 2018-05-28 23:42 UTC (permalink / raw)
To: netfilter-devel; +Cc: davem, netdev
In-Reply-To: <20180528234221.31254-1-pablo@netfilter.org>
This patch fixes the following splat.
[118709.054937] BUG: using smp_processor_id() in preemptible [00000000] code: test/1571
[118709.054970] caller is nft_update_chain_stats.isra.4+0x53/0x97 [nf_tables]
[118709.054980] CPU: 2 PID: 1571 Comm: test Not tainted 4.17.0-rc6+ #335
[...]
[118709.054992] Call Trace:
[118709.055011] dump_stack+0x5f/0x86
[118709.055026] check_preemption_disabled+0xd4/0xe4
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
net/netfilter/nf_tables_core.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/net/netfilter/nf_tables_core.c b/net/netfilter/nf_tables_core.c
index 942702a2776f..40e744572283 100644
--- a/net/netfilter/nf_tables_core.c
+++ b/net/netfilter/nf_tables_core.c
@@ -126,15 +126,15 @@ static noinline void nft_update_chain_stats(const struct nft_chain *chain,
if (!base_chain->stats)
return;
+ local_bh_disable();
stats = this_cpu_ptr(rcu_dereference(base_chain->stats));
if (stats) {
- local_bh_disable();
u64_stats_update_begin(&stats->syncp);
stats->pkts++;
stats->bytes += pkt->skb->len;
u64_stats_update_end(&stats->syncp);
- local_bh_enable();
}
+ local_bh_enable();
}
struct nft_jumpstack {
--
2.11.0
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox