Netdev List

Netdev List
 help / color / mirror / Atom feed

* [net-next 2/4] e1000: remove workaround for Errata 23 from jumbo alloc
From: Jeff Kirsher @ 2012-05-17 11:27 UTC (permalink / raw)
  To: davem; +Cc: Sebastian Andrzej Siewior, netdev, gospo, sassmann, Jeff Kirsher
In-Reply-To: <1337254070-32500-1-git-send-email-jeffrey.t.kirsher@intel.com>

From: Sebastian Andrzej Siewior <bigeasy@linutronix.de>

According to the comment, errata 23 says that the memory we allocate
can't cross a 64KiB boundary. In case of jumbo frames we allocate
complete pages which can never cross the 64KiB boundary because
PAGE_SIZE should be a multiple of 64KiB so we stop either before the
boundary or start after it but never cross it. Furthermore the check
seems bogus because it looks at skb->data which is not seen by the HW
at all because we only pass the DMA address of the page we allocated. So
I *think* the workaround is not required here.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/e1000/e1000_main.c |   24 ------------------------
 1 files changed, 0 insertions(+), 24 deletions(-)

diff --git a/drivers/net/ethernet/intel/e1000/e1000_main.c b/drivers/net/ethernet/intel/e1000/e1000_main.c
index f1aef68..fefbf4d 100644
--- a/drivers/net/ethernet/intel/e1000/e1000_main.c
+++ b/drivers/net/ethernet/intel/e1000/e1000_main.c
@@ -4391,30 +4391,6 @@ e1000_alloc_jumbo_rx_buffers(struct e1000_adapter *adapter,
 			break;
 		}
 
-		/* Fix for errata 23, can't cross 64kB boundary */
-		if (!e1000_check_64k_bound(adapter, skb->data, bufsz)) {
-			struct sk_buff *oldskb = skb;
-			e_err(rx_err, "skb align check failed: %u bytes at "
-			      "%p\n", bufsz, skb->data);
-			/* Try again, without freeing the previous */
-			skb = netdev_alloc_skb_ip_align(netdev, bufsz);
-			/* Failed allocation, critical failure */
-			if (!skb) {
-				dev_kfree_skb(oldskb);
-				adapter->alloc_rx_buff_failed++;
-				break;
-			}
-
-			if (!e1000_check_64k_bound(adapter, skb->data, bufsz)) {
-				/* give up */
-				dev_kfree_skb(skb);
-				dev_kfree_skb(oldskb);
-				break; /* while (cleaned_count--) */
-			}
-
-			/* Use new allocation */
-			dev_kfree_skb(oldskb);
-		}
 		buffer_info->skb = skb;
 		buffer_info->length = adapter->rx_buffer_len;
 check_page:
-- 
1.7.7.6

^ permalink raw reply related

* [net-next 3/4] e1000: look in the page and not in skb->data for the last byte
From: Jeff Kirsher @ 2012-05-17 11:27 UTC (permalink / raw)
  To: davem; +Cc: Sebastian Andrzej Siewior, netdev, gospo, sassmann, Jeff Kirsher
In-Reply-To: <1337254070-32500-1-git-send-email-jeffrey.t.kirsher@intel.com>

From: Sebastian Andrzej Siewior <bigeasy@linutronix.de>

The code seems to want to look at the last byte where the HW puts some
information. Since the skb->data area is never seen by the HW I guess it
does not work as expected. We pass the page address to the HW so I
*think* in order to get to the last byte where the information might be
one should use the page buffer and take a look.
This is of course not more than just compile tested.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/e1000/e1000_main.c |    6 +++++-
 1 files changed, 5 insertions(+), 1 deletions(-)

diff --git a/drivers/net/ethernet/intel/e1000/e1000_main.c b/drivers/net/ethernet/intel/e1000/e1000_main.c
index fefbf4d..6ac80c8 100644
--- a/drivers/net/ethernet/intel/e1000/e1000_main.c
+++ b/drivers/net/ethernet/intel/e1000/e1000_main.c
@@ -4066,7 +4066,11 @@ static bool e1000_clean_jumbo_rx_irq(struct e1000_adapter *adapter,
 		/* errors is only valid for DD + EOP descriptors */
 		if (unlikely((status & E1000_RXD_STAT_EOP) &&
 		    (rx_desc->errors & E1000_RXD_ERR_FRAME_ERR_MASK))) {
-			u8 last_byte = *(skb->data + length - 1);
+			u8 *mapped;
+			u8 last_byte;
+
+			mapped = kmap_atomic(buffer_info->page);
+			last_byte = *(mapped + length - 1);
 			if (TBI_ACCEPT(hw, status, rx_desc->errors, length,
 				       last_byte)) {
 				spin_lock_irqsave(&adapter->stats_lock,
-- 
1.7.7.6

^ permalink raw reply related

* [net-next 4/4] igb: Disable the BMC-to-OS Watchdog Enable bit for DMAC.
From: Jeff Kirsher @ 2012-05-17 11:27 UTC (permalink / raw)
  To: davem; +Cc: Matthew Vick, netdev, gospo, sassmann, Jeff Kirsher
In-Reply-To: <1337254070-32500-1-git-send-email-jeffrey.t.kirsher@intel.com>

From: Matthew Vick <matthew.vick@intel.com>

Under certain scenarios, it's possible that bursty manageability traffic
over the BMC-to-OS path may overrun the internal manageability receive
buffer causing dropped manageability packets. Clearing this bit prevents
this situation by interrupting coalescing to allow manageability traffic
through.

Signed-off-by: Matthew Vick <matthew.vick@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/igb/e1000_defines.h |    2 ++
 drivers/net/ethernet/intel/igb/igb_main.c      |    3 +++
 2 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/drivers/net/ethernet/intel/igb/e1000_defines.h b/drivers/net/ethernet/intel/igb/e1000_defines.h
index 6409f85..ec7e4fe 100644
--- a/drivers/net/ethernet/intel/igb/e1000_defines.h
+++ b/drivers/net/ethernet/intel/igb/e1000_defines.h
@@ -301,6 +301,8 @@
 							* transactions */
 #define E1000_DMACR_DMAC_LX_SHIFT       28
 #define E1000_DMACR_DMAC_EN             0x80000000 /* Enable DMA Coalescing */
+/* DMA Coalescing BMC-to-OS Watchdog Enable */
+#define E1000_DMACR_DC_BMC2OSW_EN	0x00008000
 
 #define E1000_DMCTXTH_DMCTTHR_MASK      0x00000FFF /* DMA Coalescing Transmit
 							* Threshold */
diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c
index 9bbf1a2..dd3bfe8 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -7147,6 +7147,9 @@ static void igb_init_dmac(struct igb_adapter *adapter, u32 pba)
 
 			/* watchdog timer= +-1000 usec in 32usec intervals */
 			reg |= (1000 >> 5);
+
+			/* Disable BMC-to-OS Watchdog Enable */
+			reg &= ~E1000_DMACR_DC_BMC2OSW_EN;
 			wr32(E1000_DMACR, reg);
 
 			/*
-- 
1.7.7.6

^ permalink raw reply related

* [net-next 0/4][pull request] Intel Wired LAN Driver Updates
From: Jeff Kirsher @ 2012-05-17 11:27 UTC (permalink / raw)
  To: davem; +Cc: Jeff Kirsher, netdev, gospo, sassmann

This series of patches contains updates for e1000, e1000e and igb.

The following are changes since commit dc6b9b78234fecdc6d2ca5e1629185718202bcf5:
  net: include/net/sock.h cleanup
and are available in the git repository at:
  git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net-next master

Bruce Allan (1):
  e1000e: fix typo in definition of E1000_CTRL_EXT_FORCE_SMBUS

Matthew Vick (1):
  igb: Disable the BMC-to-OS Watchdog Enable bit for DMAC.

Sebastian Andrzej Siewior (2):
  e1000: remove workaround for Errata 23 from jumbo alloc
  e1000: look in the page and not in skb->data for the last byte

 drivers/net/ethernet/intel/e1000/e1000_main.c  |   30 ++++--------------------
 drivers/net/ethernet/intel/e1000e/defines.h    |    2 +-
 drivers/net/ethernet/intel/igb/e1000_defines.h |    2 +
 drivers/net/ethernet/intel/igb/igb_main.c      |    3 ++
 4 files changed, 11 insertions(+), 26 deletions(-)

-- 
1.7.7.6

^ permalink raw reply

* [net-next 1/4] e1000e: fix typo in definition of E1000_CTRL_EXT_FORCE_SMBUS
From: Jeff Kirsher @ 2012-05-17 11:27 UTC (permalink / raw)
  To: davem; +Cc: Bruce Allan, netdev, gospo, sassmann, Jeff Kirsher
In-Reply-To: <1337254070-32500-1-git-send-email-jeffrey.t.kirsher@intel.com>

From: Bruce Allan <bruce.w.allan@intel.com>

This define is needed by i217.

Reported-by: Bjorn Mork <bjorn@mork.no>
Signed-off-by: Bruce Allan <bruce.w.allan@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/e1000e/defines.h |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/net/ethernet/intel/e1000e/defines.h b/drivers/net/ethernet/intel/e1000e/defines.h
index 11c4666..351a409 100644
--- a/drivers/net/ethernet/intel/e1000e/defines.h
+++ b/drivers/net/ethernet/intel/e1000e/defines.h
@@ -76,7 +76,7 @@
 /* Extended Device Control */
 #define E1000_CTRL_EXT_LPCD  0x00000004     /* LCD Power Cycle Done */
 #define E1000_CTRL_EXT_SDP3_DATA 0x00000080 /* Value of SW Definable Pin 3 */
-#define E1000_CTRL_EXT_FORCE_SMBUS 0x00000004 /* Force SMBus mode*/
+#define E1000_CTRL_EXT_FORCE_SMBUS 0x00000800 /* Force SMBus mode */
 #define E1000_CTRL_EXT_EE_RST    0x00002000 /* Reinitialize from EEPROM */
 #define E1000_CTRL_EXT_SPD_BYPS  0x00008000 /* Speed Select Bypass */
 #define E1000_CTRL_EXT_RO_DIS    0x00020000 /* Relaxed Ordering disable */
-- 
1.7.7.6

^ permalink raw reply related

* [net] e1000: Prevent reset task killing itself.
From: Jeff Kirsher @ 2012-05-17 11:04 UTC (permalink / raw)
  To: davem; +Cc: Tushar Dave, netdev, gospo, sassmann, stable, Jeff Kirsher

From: Tushar Dave <tushar.n.dave@intel.com>

Killing reset task while adapter is resetting causes deadlock.
Only kill reset task if adapter is not resetting.
Ref bug #43132 on bugzilla.kernel.org

CC: stable@vger.kernel.org
Signed-off-by: Tushar Dave <tushar.n.dave@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---

@stable - this patch is applicable back to 3.1 kernels

---
 drivers/net/ethernet/intel/e1000/e1000_main.c |    6 +++++-
 1 files changed, 5 insertions(+), 1 deletions(-)

diff --git a/drivers/net/ethernet/intel/e1000/e1000_main.c b/drivers/net/ethernet/intel/e1000/e1000_main.c
index 37caa88..8d8908d 100644
--- a/drivers/net/ethernet/intel/e1000/e1000_main.c
+++ b/drivers/net/ethernet/intel/e1000/e1000_main.c
@@ -493,7 +493,11 @@ out:
 static void e1000_down_and_stop(struct e1000_adapter *adapter)
 {
 	set_bit(__E1000_DOWN, &adapter->flags);
-	cancel_work_sync(&adapter->reset_task);
+
+	/* Only kill reset task if adapter is not resetting */
+	if (!test_bit(__E1000_RESETTING, &adapter->flags))
+		cancel_work_sync(&adapter->reset_task);
+
 	cancel_delayed_work_sync(&adapter->watchdog_task);
 	cancel_delayed_work_sync(&adapter->phy_info_task);
 	cancel_delayed_work_sync(&adapter->fifo_stall_task);
-- 
1.7.7.6

^ permalink raw reply related

* [PATCH net-next] net/mlx4_en: num cores tx rings for every UP
From: Amir Vadai @ 2012-05-17 10:58 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev, Oren Duer, Amir Vadai, John Fastabend, Liran Liss

Change the TX ring scheme such that the number of rings for untagged packets
and for tagged packets (per each of the vlan priorities) is the same, unlike
the current situation where for tagged traffic there's one ring per priority
and for untagged rings as the number of core.

Queue selection is done as follows:

If the mqprio qdisc is operates on the interface, such that the core networking
code invoked the device setup_tc ndo callback, a mapping of skb->priority =>
queue set is forced - for both, tagged and untagged traffic.

Else, the egress map skb->priority =>  User priority is used for tagged traffic, and
all untagged traffic is sent through tx rings of UP 0.

The patch follows the convergence of discussing that issue with John Fastabend
over this thread http://comments.gmane.org/gmane.linux.network/229877

Cc: John Fastabend <john.r.fastabend@intel.com>
Cc: Liran Liss <liranl@mellanox.com>
Signed-off-by: Amir Vadai <amirv@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx4/en_main.c   |    6 ++-
 drivers/net/ethernet/mellanox/mlx4/en_netdev.c |   41 +++++++++++++++++------
 drivers/net/ethernet/mellanox/mlx4/en_tx.c     |   15 +++++---
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h   |    9 ++---
 4 files changed, 47 insertions(+), 24 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_main.c b/drivers/net/ethernet/mellanox/mlx4/en_main.c
index 346fdb2..988b242 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_main.c
@@ -101,6 +101,8 @@ static int mlx4_en_get_profile(struct mlx4_en_dev *mdev)
 	int i;
 
 	params->udp_rss = udp_rss;
+	params->num_tx_rings_p_up = min_t(int, num_online_cpus(),
+			MLX4_EN_MAX_TX_RING_P_UP);
 	if (params->udp_rss && !(mdev->dev->caps.flags
 					& MLX4_DEV_CAP_FLAG_UDP_RSS)) {
 		mlx4_warn(mdev, "UDP RSS is not supported on this device.\n");
@@ -113,8 +115,8 @@ static int mlx4_en_get_profile(struct mlx4_en_dev *mdev)
 		params->prof[i].tx_ppp = pfctx;
 		params->prof[i].tx_ring_size = MLX4_EN_DEF_TX_RING_SIZE;
 		params->prof[i].rx_ring_size = MLX4_EN_DEF_RX_RING_SIZE;
-		params->prof[i].tx_ring_num = MLX4_EN_NUM_TX_RINGS +
-			MLX4_EN_NUM_PPP_RINGS;
+		params->prof[i].tx_ring_num = params->num_tx_rings_p_up *
+			MLX4_EN_NUM_UP;
 		params->prof[i].rss_rings = 0;
 	}
 
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
index eaa8fad..926d8aa 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
@@ -47,9 +47,22 @@
 
 static int mlx4_en_setup_tc(struct net_device *dev, u8 up)
 {
-	if (up != MLX4_EN_NUM_UP)
+	struct mlx4_en_priv *priv = netdev_priv(dev);
+	int i;
+	unsigned int q, offset = 0;
+
+	if (up && up != MLX4_EN_NUM_UP)
 		return -EINVAL;
 
+	netdev_set_num_tc(dev, up);
+
+	/* Partition Tx queues evenly amongst UP's */
+	q = priv->tx_ring_num / up;
+	for (i = 0; i < up; i++) {
+		netdev_set_tc_queue(dev, i, q, offset);
+		offset += q;
+	}
+
 	return 0;
 }
 
@@ -661,7 +674,7 @@ int mlx4_en_start_port(struct net_device *dev)
 		/* Configure ring */
 		tx_ring = &priv->tx_ring[i];
 		err = mlx4_en_activate_tx_ring(priv, tx_ring, cq->mcq.cqn,
-				max(0, i - MLX4_EN_NUM_TX_RINGS));
+			i / priv->mdev->profile.num_tx_rings_p_up);
 		if (err) {
 			en_err(priv, "Failed allocating Tx ring\n");
 			mlx4_en_deactivate_cq(priv, cq);
@@ -986,6 +999,9 @@ void mlx4_en_destroy_netdev(struct net_device *dev)
 
 	mlx4_en_free_resources(priv);
 
+	kfree(priv->tx_ring);
+	kfree(priv->tx_cq);
+
 	free_netdev(dev);
 }
 
@@ -1091,6 +1107,18 @@ int mlx4_en_init_netdev(struct mlx4_en_dev *mdev, int port,
 	priv->ctrl_flags = cpu_to_be32(MLX4_WQE_CTRL_CQ_UPDATE |
 			MLX4_WQE_CTRL_SOLICITED);
 	priv->tx_ring_num = prof->tx_ring_num;
+	priv->tx_ring = kzalloc(sizeof(struct mlx4_en_tx_ring) *
+			priv->tx_ring_num, GFP_KERNEL);
+	if (!priv->tx_ring) {
+		err = -ENOMEM;
+		goto out;
+	}
+	priv->tx_cq = kzalloc(sizeof(struct mlx4_en_cq) * priv->tx_ring_num,
+			GFP_KERNEL);
+	if (!priv->tx_cq) {
+		err = -ENOMEM;
+		goto out;
+	}
 	priv->rx_ring_num = prof->rx_ring_num;
 	priv->mac_index = -1;
 	priv->msg_enable = MLX4_EN_MSG_LEVEL;
@@ -1138,15 +1166,6 @@ int mlx4_en_init_netdev(struct mlx4_en_dev *mdev, int port,
 	netif_set_real_num_tx_queues(dev, priv->tx_ring_num);
 	netif_set_real_num_rx_queues(dev, priv->rx_ring_num);
 
-	netdev_set_num_tc(dev, MLX4_EN_NUM_UP);
-
-	/* First 9 rings are for UP 0 */
-	netdev_set_tc_queue(dev, 0, MLX4_EN_NUM_TX_RINGS + 1, 0);
-
-	/* Partition Tx queues evenly amongst UP's 1-7 */
-	for (i = 1; i < MLX4_EN_NUM_UP; i++)
-		netdev_set_tc_queue(dev, i, 1, MLX4_EN_NUM_TX_RINGS + i);
-
 	SET_ETHTOOL_OPS(dev, &mlx4_en_ethtool_ops);
 
 	/* Set defualt MAC */
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
index 9a38483..019d856 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
@@ -525,14 +525,17 @@ static void build_inline_wqe(struct mlx4_en_tx_desc *tx_desc, struct sk_buff *sk
 
 u16 mlx4_en_select_queue(struct net_device *dev, struct sk_buff *skb)
 {
-	u16 vlan_tag = 0;
+	struct mlx4_en_priv *priv = netdev_priv(dev);
+	u16 rings_p_up = priv->mdev->profile.num_tx_rings_p_up;
+	u8 up = 0;
 
-	if (vlan_tx_tag_present(skb)) {
-		vlan_tag = vlan_tx_tag_get(skb);
-		return MLX4_EN_NUM_TX_RINGS + (vlan_tag >> 13);
-	}
+	if (dev->num_tc)
+		return skb_tx_hash(dev, skb);
+
+	if (vlan_tx_tag_present(skb))
+		up = vlan_tx_tag_get(skb) >> VLAN_PRIO_SHIFT;
 
-	return skb_tx_hash(dev, skb);
+	return __skb_tx_hash(dev, skb, rings_p_up) + up * rings_p_up;
 }
 
 static void mlx4_bf_copy(void __iomem *dst, unsigned long *src, unsigned bytecnt)
diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
index 5d87637..6ae3509 100644
--- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
+++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
@@ -111,9 +111,7 @@ enum {
 #define MLX4_EN_MIN_TX_SIZE	(4096 / TXBB_SIZE)
 
 #define MLX4_EN_SMALL_PKT_SIZE		64
-#define MLX4_EN_NUM_TX_RINGS		8
-#define MLX4_EN_NUM_PPP_RINGS		8
-#define MAX_TX_RINGS			(MLX4_EN_NUM_TX_RINGS + MLX4_EN_NUM_PPP_RINGS)
+#define MLX4_EN_MAX_TX_RING_P_UP	32
 #define MLX4_EN_NUM_UP			8
 #define MLX4_EN_DEF_TX_RING_SIZE	512
 #define MLX4_EN_DEF_RX_RING_SIZE  	1024
@@ -339,6 +337,7 @@ struct mlx4_en_profile {
 	u32 active_ports;
 	u32 small_pkt_int;
 	u8 no_reset;
+	u8 num_tx_rings_p_up;
 	struct mlx4_en_port_profile prof[MLX4_MAX_PORTS + 1];
 };
 
@@ -477,9 +476,9 @@ struct mlx4_en_priv {
 	u16 num_frags;
 	u16 log_rx_info;
 
-	struct mlx4_en_tx_ring tx_ring[MAX_TX_RINGS];
+	struct mlx4_en_tx_ring *tx_ring;
 	struct mlx4_en_rx_ring rx_ring[MAX_RX_RINGS];
-	struct mlx4_en_cq tx_cq[MAX_TX_RINGS];
+	struct mlx4_en_cq *tx_cq;
 	struct mlx4_en_cq rx_cq[MAX_RX_RINGS];
 	struct work_struct mcast_task;
 	struct work_struct mac_task;
-- 
1.7.8.2

^ permalink raw reply related

* Re: [PATCH v4 6/6] net: sh_eth: use NAPI
From: Francois Romieu @ 2012-05-17 10:33 UTC (permalink / raw)
  To: Shimoda, Yoshihiro; +Cc: netdev, SH-Linux
In-Reply-To: <4FB32D17.30404@renesas.com>

Shimoda, Yoshihiro <yoshihiro.shimoda.uh@renesas.com> :
[...]
> diff --git a/drivers/net/ethernet/renesas/sh_eth.c b/drivers/net/ethernet/renesas/sh_eth.c
> index c64a31c..edc7dfe 100644
> --- a/drivers/net/ethernet/renesas/sh_eth.c
> +++ b/drivers/net/ethernet/renesas/sh_eth.c
[...]
> +static int sh_eth_poll(struct napi_struct *napi, int budget)
> +{
> +	struct sh_eth_private *mdp = container_of(napi, struct sh_eth_private,
> +						  napi);
> +	struct net_device *ndev = mdp->ndev;
> +	struct sh_eth_cpu_data *cd = mdp->cd;
> +	int work_done = 0, txfree_num;
> +	u32 intr_status = sh_eth_read(ndev, EESR);
> +
> +	/* Clear interrupt flags */
> +	sh_eth_write(ndev, intr_status, EESR);
> +
> +	/* check txdesc */
> +	txfree_num = sh_eth_txfree(ndev);

[...]
> @@ -1678,19 +1710,15 @@ static int sh_eth_start_xmit(struct sk_buff *skb, struct net_device *ndev)
>  	struct sh_eth_private *mdp = netdev_priv(ndev);
>  	struct sh_eth_txdesc *txdesc;
>  	u32 entry;
> -	unsigned long flags;
> 
> -	spin_lock_irqsave(&mdp->lock, flags);
>  	if ((mdp->cur_tx - mdp->dirty_tx) >= (mdp->num_tx_ring - 4)) {
>  		if (!sh_eth_txfree(ndev)) {

There are now two racing sh_eth_txfree and there is no [PATCH v4 7/6].

If I may suggest a slightly different approach, I would apply the patch
below before anything NAPI related:

diff --git a/drivers/net/ethernet/renesas/sh_eth.c b/drivers/net/ethernet/renesas/sh_eth.c
index d63e09b..6d77462 100644
--- a/drivers/net/ethernet/renesas/sh_eth.c
+++ b/drivers/net/ethernet/renesas/sh_eth.c
@@ -1495,18 +1495,6 @@ static int sh_eth_start_xmit(struct sk_buff *skb, struct net_device *ndev)
 	u32 entry;
 	unsigned long flags;
 
-	spin_lock_irqsave(&mdp->lock, flags);
-	if ((mdp->cur_tx - mdp->dirty_tx) >= (TX_RING_SIZE - 4)) {
-		if (!sh_eth_txfree(ndev)) {
-			if (netif_msg_tx_queued(mdp))
-				dev_warn(&ndev->dev, "TxFD exhausted.\n");
-			netif_stop_queue(ndev);
-			spin_unlock_irqrestore(&mdp->lock, flags);
-			return NETDEV_TX_BUSY;
-		}
-	}
-	spin_unlock_irqrestore(&mdp->lock, flags);
-
 	entry = mdp->cur_tx % TX_RING_SIZE;
 	mdp->tx_skbuff[entry] = skb;
 	txdesc = &mdp->tx_ring[entry];
@@ -1531,6 +1519,15 @@ static int sh_eth_start_xmit(struct sk_buff *skb, struct net_device *ndev)
 	if (!(sh_eth_read(ndev, EDTRR) & sh_eth_get_edtrr_trns(mdp)))
 		sh_eth_write(ndev, sh_eth_get_edtrr_trns(mdp), EDTRR);
 
+	spin_lock_irqsave(&mdp->lock, flags);
+	if ((mdp->cur_tx - mdp->dirty_tx) >= (TX_RING_SIZE - 4)) {
+		if (netif_msg_tx_queued(mdp)) {
+			dev_warn(&ndev->dev, "TxFD exhausted.\n");
+			netif_stop_queue(ndev);
+		}
+	}
+	spin_unlock_irqrestore(&mdp->lock, flags);
+
 	return NETDEV_TX_OK;
 }
 

Rationale: the driver does not need to return NETDEV_TX_BUSY when it
should signal that it will not handle more packets after the current
one. You may add an extra assertion at the start of sh_eth_start_xmit()
and return NETDEV_TX_BUSY but it should be understood as a debug / bug
helper only.

Then you can convert to a {start/stop} queue race free NAPI with adequate
barriers.

-- 
Ueimor

^ permalink raw reply related

* Re: [PATCH v5 2/2] decrement static keys on real destroy time
From: KAMEZAWA Hiroyuki @ 2012-05-17 10:27 UTC (permalink / raw)
  To: Glauber Costa
  Cc: Andrew Morton, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, devel-GEFAQzZX7r8dnm+yROfE0A,
	netdev-u79uwXL29TY76Z2rM5mHXA, Tejun Heo, Li Zefan,
	Johannes Weiner, Michal Hocko
In-Reply-To: <4FB4D14D.4020303-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>

(2012/05/17 19:22), Glauber Costa wrote:

> On 05/17/2012 02:18 PM, KAMEZAWA Hiroyuki wrote:
>> (2012/05/17 18:52), Glauber Costa wrote:
>>
>>> On 05/17/2012 09:37 AM, Andrew Morton wrote:
>>>>>>   If that happens, locking in static_key_slow_inc will prevent any damage.
>>>>>>   My previous version had explicit code to prevent that, but we were
>>>>>>   pointed out that this is already part of the static_key expectations, so
>>>>>>   that was dropped.
>>>> This makes no sense.  If two threads run that code concurrently,
>>>> key->enabled gets incremented twice.  Nobody anywhere has a record that
>>>> this happened so it cannot be undone.  key->enabled is now in an
>>>> unknown state.
>>>
>>> Kame, Tejun,
>>>
>>> Andrew is right. It seems we will need that mutex after all. Just this
>>> is not a race, and neither something that should belong in the
>>> static_branch interface.
>>>
>>
>>
>> Hmm....how about having
>>
>> res_counter_xchg_limit(res,&old_limit, new_limit);
>>
>> if (!cg_proto->updated&&  old_limit == RESOURCE_MAX)
>> 	....update labels...
>>
>> Then, no mutex overhead maybe and activated will be updated only once.
>> Ah, but please fix in a way you like. Above is an example.
> 
> I think a mutex is a lot cleaner than adding a new function to the 
> res_counter interface.
> 
> We could do a counter, and then later decrement the key until the 
> counter reaches zero, but between those two, I still think a mutex here 
> is preferable.
> 
> Only that, instead of coming up with a mutex of ours, we could export 
> and reuse set_limit_mutex from memcontrol.c
> 


ok, please.

thx,
-Kame

> 
>> Thanks,
>> -Kame
>> (*) I'm sorry I won't be able to read e-mails, tomorrow.
>>
> Ok Kame. I am not in a terrible hurry to fix this, it doesn't seem to be 
> hurting any real workload.
> 
> 

^ permalink raw reply

* Re: [PATCH v5 2/2] decrement static keys on real destroy time
From: Glauber Costa @ 2012-05-17 10:22 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, devel-GEFAQzZX7r8dnm+yROfE0A,
	netdev-u79uwXL29TY76Z2rM5mHXA, Tejun Heo, Li Zefan,
	Johannes Weiner, Michal Hocko
In-Reply-To: <4FB4D061.10406-+CUm20s59erQFUHtdCDX3A@public.gmane.org>

On 05/17/2012 02:18 PM, KAMEZAWA Hiroyuki wrote:
> (2012/05/17 18:52), Glauber Costa wrote:
>
>> On 05/17/2012 09:37 AM, Andrew Morton wrote:
>>>>>   If that happens, locking in static_key_slow_inc will prevent any damage.
>>>>>   My previous version had explicit code to prevent that, but we were
>>>>>   pointed out that this is already part of the static_key expectations, so
>>>>>   that was dropped.
>>> This makes no sense.  If two threads run that code concurrently,
>>> key->enabled gets incremented twice.  Nobody anywhere has a record that
>>> this happened so it cannot be undone.  key->enabled is now in an
>>> unknown state.
>>
>> Kame, Tejun,
>>
>> Andrew is right. It seems we will need that mutex after all. Just this
>> is not a race, and neither something that should belong in the
>> static_branch interface.
>>
>
>
> Hmm....how about having
>
> res_counter_xchg_limit(res,&old_limit, new_limit);
>
> if (!cg_proto->updated&&  old_limit == RESOURCE_MAX)
> 	....update labels...
>
> Then, no mutex overhead maybe and activated will be updated only once.
> Ah, but please fix in a way you like. Above is an example.

I think a mutex is a lot cleaner than adding a new function to the 
res_counter interface.

We could do a counter, and then later decrement the key until the 
counter reaches zero, but between those two, I still think a mutex here 
is preferable.

Only that, instead of coming up with a mutex of ours, we could export 
and reuse set_limit_mutex from memcontrol.c


> Thanks,
> -Kame
> (*) I'm sorry I won't be able to read e-mails, tomorrow.
>
Ok Kame. I am not in a terrible hurry to fix this, it doesn't seem to be 
hurting any real workload.

^ permalink raw reply

* Re: [PATCH v5 2/2] decrement static keys on real destroy time
From: KAMEZAWA Hiroyuki @ 2012-05-17 10:18 UTC (permalink / raw)
  To: Glauber Costa
  Cc: Andrew Morton, cgroups, linux-mm, devel, netdev, Tejun Heo,
	Li Zefan, Johannes Weiner, Michal Hocko
In-Reply-To: <4FB4CA4D.50608@parallels.com>

(2012/05/17 18:52), Glauber Costa wrote:

> On 05/17/2012 09:37 AM, Andrew Morton wrote:
>>>>  If that happens, locking in static_key_slow_inc will prevent any damage.
>>>>  My previous version had explicit code to prevent that, but we were
>>>>  pointed out that this is already part of the static_key expectations, so
>>>>  that was dropped.
>> This makes no sense.  If two threads run that code concurrently,
>> key->enabled gets incremented twice.  Nobody anywhere has a record that
>> this happened so it cannot be undone.  key->enabled is now in an
>> unknown state.
> 
> Kame, Tejun,
> 
> Andrew is right. It seems we will need that mutex after all. Just this 
> is not a race, and neither something that should belong in the 
> static_branch interface.
> 


Hmm....how about having

res_counter_xchg_limit(res, &old_limit, new_limit);

if (!cg_proto->updated && old_limit == RESOURCE_MAX)
	....update labels...

Then, no mutex overhead maybe and activated will be updated only once.
Ah, but please fix in a way you like. Above is an example.

Thanks,
-Kame
(*) I'm sorry I won't be able to read e-mails, tomorrow.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH v5 2/2] decrement static keys on real destroy time
From: Glauber Costa @ 2012-05-17  9:52 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	devel-GEFAQzZX7r8dnm+yROfE0A,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
	netdev-u79uwXL29TY76Z2rM5mHXA, Tejun Heo, Li Zefan,
	Johannes Weiner, Michal Hocko
In-Reply-To: <20120516223715.5d1b4385.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>

On 05/17/2012 09:37 AM, Andrew Morton wrote:
>> >  If that happens, locking in static_key_slow_inc will prevent any damage.
>> >  My previous version had explicit code to prevent that, but we were
>> >  pointed out that this is already part of the static_key expectations, so
>> >  that was dropped.
> This makes no sense.  If two threads run that code concurrently,
> key->enabled gets incremented twice.  Nobody anywhere has a record that
> this happened so it cannot be undone.  key->enabled is now in an
> unknown state.

Kame, Tejun,

Andrew is right. It seems we will need that mutex after all. Just this 
is not a race, and neither something that should belong in the 
static_branch interface.

We want to make sure that enabled is not updated before the jump label 
update, because we need a specific ordering guarantee at the patched 
sites. And *that*, the interface guarantees, and we were wrong to 
believe it did not. That is a correction issue for the accounting, and 
that part is right.

But when we disarm it, we'll need to make sure that happened only once, 
otherwise we may never unpatch it. That, or we'd need that to be a 
counter. The jump label interface does not - and should not - keep track 
of how many updates happened to a key. That's the role of whoever is 
using it.

If you agree with the above, I'll send this patch again with the correction.

Andrew, thank you very much. Do you spot anything else here?

^ permalink raw reply

* Re: [PATCH 1/1] smsc95xx: add FLAG_POINTTOPOINT flag for driver_info
From: Xiao Jiang @ 2012-05-17  9:51 UTC (permalink / raw)
  To: Ming Lei; +Cc: steve.glendinning, gregkh, netdev, linux-usb, linux-kernel
In-Reply-To: <CACVXFVPLf9+8qQKgkikexq3ao=b9fM4jOCasMWVJrbZEVSj_Tg@mail.gmail.com>

Ming Lei wrote:
> On Thu, May 17, 2012 at 10:23 AM, Xiao Jiang <jgq516@gmail.com> wrote:
>   
>> Ming Lei wrote:
>>     
>>> On Wed, May 16, 2012 at 4:01 PM,  <jgq516@gmail.com> wrote:
>>>
>>>       
>>>> From: Xiao Jiang <jgq516@gmail.com>
>>>>
>>>> commit c26134 introduced FLAG_POINTTOPOINT flag for USB ethernet devices
>>>> which possibly use "usb%d" names, add this flag to make sure pandaboard
>>>> can mount nfs with smsc95xx NIC.
>>>>
>>>>         
>>> Without the flag, I also can mount nfs successfully on my Pandaboard...
>>>       
>
> I always mount nfs in console, and not tried to mount nfs as root fs.
>
>   
>>>       
>> I have pulled latest tree
>> (git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
>> commit 0e93b4b304ae052ba1bc73f6d34a68556fe93429), and enable related options
>> (USB_NET_SMSC95XX,
>> USB_EHCI_HCD and USB_EHCI_HCD_OMAP) with omap2plus_config, However the
>> kernel still can't mount
>> nfs, pls see below infos.
>>
>> [    3.114105] smsc95xx v1.0.4
>> [    4.533752] smsc95xx 1-1.1:1.0: *eth0*: register 'smsc95xx' at
>> usb-ehci-omap.0-1.1, smsc95xx USB 2.0 Ethernet, fe:b9:1b:07:8e:d1
>> [  108.854217] VFS: Unable to mount root fs via NFS, trying floppy.
>> [  108.861114] VFS: Cannot open root device "nfs" or unknown-block(2,0):
>> error -6
>> [  108.868713] Please append a correct "root=" boot option; here are the
>> available partitions:
>> [  108.877655] b300         7761920 mmcblk0  driver: mmcblk
>> [  108.883239]   b301           40131 mmcblk0p1
>> 00000000-0000-0000-0000-000000000mmcblk0p1
>> [  108.891662]   b302         7719232 mmcblk0p2
>> 00000000-0000-0000-0000-000000000mmcblk0p2
>> [  108.900146] Kernel panic - not syncing: VFS: Unable to mount root fs on
>> unknown-block(2,0)
>>
>> BTW: I tested it with OMAP4430 ES2.2 pandaboard, the issue can be solved
>> with apply the patch.
>>
>> Is there something which I missed? thanks.
>>     
>
> What is your kernel parameter? Maybe you use 'usb%d' in kernel parameter for
> mounting nfs as root fs. If so, could you try 'eth%d' in kernel cmd?
>
> In fact, smsc95xx is a real LAN interface, and 'eth%d' should be prefered name
> as described in changelog of commit
> c261344d3ce3edac781f9d3c7eabe2e96d8e8fe8(usbnet:use eth%d name for
> known ethernet devices)
>
>   
Thanks for your notice, I used wrong kernel parameter.

Regards,
Xiao
> Thanks,
>   

^ permalink raw reply

* tcp timestamp issues with google servers
From: Miklos Szeredi @ 2012-05-17  9:39 UTC (permalink / raw)
  To: netdev; +Cc: linux-kernel

Sometimes connection to google.com, gmail.com and other google servers
doesn't work or takes ages to connect.  When this hits it hits all
google servers at the same time and it's persistent.  It never happens
to anything other than google.  Rebooting helps.  Rarely it goes away
spontaneously.

Apparently google is sometimes replying with an invalid TSecr timestamp
value (smaller than the one sent in the last packet) and this confuses
the Linux TCP stack which either discards the packet or sends a Reset.

Network dump attached.

I found only a couple of references to this issue:

http://gotchas.livejournal.com/3028.html

http://groups.google.com/group/comp.os.linux.networking/browse_thread/thread/29f56feded11b42a

Turning tcp timestamps fixes the issue:

  sysctl -w net.ipv4.tcp_timestamps=0

Not sure why this happens only to me and a very few others.

It appears to be an issue with google TCP stack (is it a modified
stack?) but I thought about issues in my network switch (restarting it
doesn't help) or something in the ISP, but those look unlikely.

Any ideas?

Thanks,
Miklos



  1   0.000000 192.168.28.100 -> 74.125.232.226 TCP 51303 > http [SYN] Seq=0 Win=14600 Len=0 MSS=1460 SACK_PERM=1 TSV=35355050 TSER=0 WS=5
  2   0.002730 74.125.232.226 -> 192.168.28.100 TCP http > 51303 [SYN, ACK] Seq=0 Ack=1 Win=14180 Len=0 MSS=1430 SACK_PERM=1 TSV=1184565067 TSER=35325344 WS=6
  3   0.002776 192.168.28.100 -> 74.125.232.226 TCP 51303 > http [RST] Seq=1 Win=0 Len=0
  4   1.001408 192.168.28.100 -> 74.125.232.226 TCP 51303 > http [SYN] Seq=0 Win=14600 Len=0 MSS=1460 SACK_PERM=1 TSV=35356052 TSER=0 WS=5
  5   1.004136 74.125.232.226 -> 192.168.28.100 TCP [TCP Previous segment lost] http > 51303 [SYN, ACK] Seq=15638919 Ack=1 Win=14180 Len=0 MSS=1430 SACK_PERM=1 TSV=1184566068 TSER=35325344 WS=6
  6   1.411915 74.125.232.226 -> 192.168.28.100 TCP http > 51303 [SYN, ACK] Seq=15638919 Ack=1 Win=14180 Len=0 MSS=1430 SACK_PERM=1 TSV=1184566476 TSER=35325344 WS=6
  7   2.011568 74.125.232.226 -> 192.168.28.100 TCP http > 51303 [SYN, ACK] Seq=15638919 Ack=1 Win=14180 Len=0 MSS=1430 SACK_PERM=1 TSV=1184567076 TSER=35325344 WS=6
  8   3.005400 192.168.28.100 -> 74.125.232.226 TCP 51303 > http [SYN] Seq=0 Win=14600 Len=0 MSS=1460 SACK_PERM=1 TSV=35358056 TSER=0 WS=5
  9   3.007972 74.125.232.226 -> 192.168.28.100 TCP http > 51303 [SYN, ACK] Seq=15638919 Ack=1 Win=14180 Len=0 MSS=1430 SACK_PERM=1 TSV=1184568072 TSER=35325344 WS=6
 10   3.212862 74.125.232.226 -> 192.168.28.100 TCP http > 51303 [SYN, ACK] Seq=15638919 Ack=1 Win=14180 Len=0 MSS=1430 SACK_PERM=1 TSV=1184568277 TSER=35325344 WS=6
 11   5.612449 74.125.232.226 -> 192.168.28.100 TCP http > 51303 [SYN, ACK] Seq=15638919 Ack=1 Win=14180 Len=0 MSS=1430 SACK_PERM=1 TSV=1184570677 TSER=35325344 WS=6
 12   7.013405 192.168.28.100 -> 74.125.232.226 TCP 51303 > http [SYN] Seq=0 Win=14600 Len=0 MSS=1460 SACK_PERM=1 TSV=35362064 TSER=0 WS=5
 13   7.016627 74.125.232.226 -> 192.168.28.100 TCP http > 51303 [SYN, ACK] Seq=15638919 Ack=1 Win=14180 Len=0 MSS=1430 SACK_PERM=1 TSV=1184572080 TSER=35325344 WS=6
 14  10.412642 74.125.232.226 -> 192.168.28.100 TCP http > 51303 [SYN, ACK] Seq=15638919 Ack=1 Win=14180 Len=0 MSS=1430 SACK_PERM=1 TSV=1184575477 TSER=35325344 WS=6
 15  15.029547 192.168.28.100 -> 74.125.232.226 TCP 51303 > http [SYN] Seq=0 Win=14600 Len=0 MSS=1460 SACK_PERM=1 TSV=35370080 TSER=0 WS=5
 16  15.032931 74.125.232.226 -> 192.168.28.100 TCP http > 51303 [SYN, ACK] Seq=15638919 Ack=1 Win=14180 Len=0 MSS=1430 SACK_PERM=1 TSV=1184580097 TSER=35325344 WS=6
 17  31.061400 192.168.28.100 -> 74.125.232.226 TCP 51303 > http [SYN] Seq=0 Win=14600 Len=0 MSS=1460 SACK_PERM=1 TSV=35386112 TSER=0 WS=5
 18  31.064538 74.125.232.226 -> 192.168.28.100 TCP [TCP Previous segment lost] http > 51303 [SYN, ACK] Seq=485350292 Ack=1 Win=14180 Len=0 MSS=1430 SACK_PERM=1 TSV=1184596129 TSER=35325344 WS=6
 19  31.416339 74.125.232.226 -> 192.168.28.100 TCP http > 51303 [SYN, ACK] Seq=485350292 Ack=1 Win=14180 Len=0 MSS=1430 SACK_PERM=1 TSV=1184596480 TSER=35325344 WS=6
 20  32.015998 74.125.232.226 -> 192.168.28.100 TCP http > 51303 [SYN, ACK] Seq=485350292 Ack=1 Win=14180 Len=0 MSS=1430 SACK_PERM=1 TSV=1184597081 TSER=35325344 WS=6
 21  33.216276 74.125.232.226 -> 192.168.28.100 TCP http > 51303 [SYN, ACK] Seq=485350292 Ack=1 Win=14180 Len=0 MSS=1430 SACK_PERM=1 TSV=1184598281 TSER=35325344 WS=6
 22  35.616879 74.125.232.226 -> 192.168.28.100 TCP http > 51303 [SYN, ACK] Seq=485350292 Ack=1 Win=14180 Len=0 MSS=1430 SACK_PERM=1 TSV=1184600681 TSER=35325344 WS=6
 23  40.417065 74.125.232.226 -> 192.168.28.100 TCP http > 51303 [SYN, ACK] Seq=485350292 Ack=1 Win=14180 Len=0 MSS=1430 SACK_PERM=1 TSV=1184605482 TSER=35325344 WS=6

^ permalink raw reply

* [PULL] virtio: last minute fixes for 3.4
From: Michael S. Tsirkin @ 2012-05-17  9:35 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: kvm, mst, netdev, linux-kernel, virtualization, uobergfe,
	amit.shah, David Miller

The following changes since commit 0e93b4b304ae052ba1bc73f6d34a68556fe93429:

  Merge git://git.kernel.org/pub/scm/virt/kvm/kvm (2012-05-16 14:30:51 -0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git for_linus

for you to fetch changes up to ec13ee80145ccb95b00e6e610044bbd94a170051:

  virtio_net: invoke softirqs after __napi_schedule (2012-05-17 12:16:38 +0300)

----------------------------------------------------------------
virtio: last minute fixes for 3.4

Here are a couple of last minute virtio fixes for 3.4.
Hope it's not too late yes - I might have tried too hard
to make sure the fix is well tested.

Fixes are by Amit and myself. One fixes module removal
and one suspend of a VM, the last one the handling of out
of memory condition.
They are thus very low risk as most people never hit these paths, but do fix
very annoying problems for people that do use the feature.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

----------------------------------------------------------------
Amit Shah (2):
      virtio: console: tell host of open ports after resume from s3/s4
      virtio: balloon: let host know of updated balloon size before module removal

Michael S. Tsirkin (1):
      virtio_net: invoke softirqs after __napi_schedule

 drivers/char/virtio_console.c   |    7 +++++++
 drivers/net/virtio_net.c        |    2 ++
 drivers/virtio/virtio_balloon.c |    1 +
 3 files changed, 10 insertions(+), 0 deletions(-)

^ permalink raw reply

* [PATCH 2/2] [net/virtio_net]: make virtio_net support NUMA info
From: Liu Ping Fan @ 2012-05-17  9:20 UTC (permalink / raw)
  To: kvm, netdev
  Cc: linux-kernel, qemu-devel, Avi Kivity, Michael S. Tsirkin,
	Srivatsa Vaddagiri, Rusty Russell, Anthony Liguori, Ryan Harper,
	Shirley Ma, Krishna Kumar, Tom Lendacky
In-Reply-To: <1337246456-30909-1-git-send-email-kernelfans@gmail.com>

From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>

Vhost net uses separate transfer logic unit in different node.
Virtio net must determine which logic unit it will talk with,
so we can improve the performance.

Signed-off-by: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
---
 drivers/net/virtio_net.c |  425 ++++++++++++++++++++++++++++++++++------------
 1 files changed, 314 insertions(+), 111 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index af8acc8..31abafa 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -50,16 +50,32 @@ struct virtnet_stats {
 	u64 rx_packets;
 };
 
+struct napi_info {
+	struct napi_struct napi;
+	struct work_struct enable_napi;
+};
+
+struct vnet_virtio_node {
+	struct virtio_node vnode;
+	int demo_cpu;
+	struct napi_info info;
+	struct delayed_work refill;
+	struct virtnet_info *owner;
+};
+
 struct virtnet_info {
 	struct virtio_device *vdev;
-	struct virtqueue *rvq, *svq, *cvq;
+	/* we want to scatter in different host nodes */
+	struct virtqueue **vqs, **rvqs, **svqs;
+	struct virtqueue *cvq;
+	/* we want to scatter in different host nodes */
+	struct vnet_virtio_node **vnet_nodes;
 	struct net_device *dev;
-	struct napi_struct napi;
+
 	unsigned int status;
 
 	/* Number of input buffers, and max we've ever had. */
 	unsigned int num, max;
-
 	/* I like... big packets and I cannot lie! */
 	bool big_packets;
 
@@ -69,9 +85,6 @@ struct virtnet_info {
 	/* Active statistics */
 	struct virtnet_stats __percpu *stats;
 
-	/* Work struct for refilling if we run low on memory. */
-	struct delayed_work refill;
-
 	/* Chain pages by the private ptr. */
 	struct page *pages;
 
@@ -136,7 +149,6 @@ static void skb_xmit_done(struct virtqueue *svq)
 
 	/* Suppress further interrupts. */
 	virtqueue_disable_cb(svq);
-
 	/* We were probably waiting for more output buffers. */
 	netif_wake_queue(vi->dev);
 }
@@ -220,7 +232,8 @@ static struct sk_buff *page_to_skb(struct virtnet_info *vi,
 	return skb;
 }
 
-static int receive_mergeable(struct virtnet_info *vi, struct sk_buff *skb)
+static int receive_mergeable(struct virtnet_info *vi, struct sk_buff *skb,
+	struct virtqueue *rvq)
 {
 	struct skb_vnet_hdr *hdr = skb_vnet_hdr(skb);
 	struct page *page;
@@ -234,7 +247,7 @@ static int receive_mergeable(struct virtnet_info *vi, struct sk_buff *skb)
 			skb->dev->stats.rx_length_errors++;
 			return -EINVAL;
 		}
-		page = virtqueue_get_buf(vi->rvq, &len);
+		page = virtqueue_get_buf(rvq, &len);
 		if (!page) {
 			pr_debug("%s: rx error: %d buffers missing\n",
 				 skb->dev->name, hdr->mhdr.num_buffers);
@@ -252,7 +265,8 @@ static int receive_mergeable(struct virtnet_info *vi, struct sk_buff *skb)
 	return 0;
 }
 
-static void receive_buf(struct net_device *dev, void *buf, unsigned int len)
+static void receive_buf(struct net_device *dev, void *buf, unsigned int len,
+	struct virtqueue *rvq)
 {
 	struct virtnet_info *vi = netdev_priv(dev);
 	struct virtnet_stats *stats = this_cpu_ptr(vi->stats);
@@ -283,7 +297,7 @@ static void receive_buf(struct net_device *dev, void *buf, unsigned int len)
 			return;
 		}
 		if (vi->mergeable_rx_bufs)
-			if (receive_mergeable(vi, skb)) {
+			if (receive_mergeable(vi, skb, rvq)) {
 				dev_kfree_skb(skb);
 				return;
 			}
@@ -353,7 +367,67 @@ frame_err:
 	dev_kfree_skb(skb);
 }
 
-static int add_recvbuf_small(struct virtnet_info *vi, gfp_t gfp)
+/* todo, this will be redesign, and as a part of exporting host numa info to
+  * guest scheduler  */
+/* fix me, host numa node id directly exposed to guest? */
+
+/* fill in by host */
+static s16 __vapicid_to_vnode[MAX_LOCAL_APIC];
+/* fix me, HOST_NUMNODES is defined by host */
+#define  HOST_NUMNODES  128
+static struct cpumask vnode_to_vcpumask_map[HOST_NUMNODES];
+DECLARE_PER_CPU(int, vcpu_to_vnode_map);
+
+void init_vnode_map(void)
+{
+	int cpu, apicid, vnode;
+	for_each_possible_cpu(cpu) {
+		apicid = cpu_physical_id(cpu);
+		vnode = __vapicid_to_vnode[apicid];
+		per_cpu(vcpu_to_vnode_map, cpu) = vnode;
+	}
+}
+
+struct cpumask *vnode_to_vcpumask(int virtio_node)
+{
+	struct cpumask *msk = &vnode_to_vcpumask_map[virtio_node];
+	return msk;
+}
+
+static int first_vcpu_on_virtio_node(int virtio_node)
+{
+	 struct cpumask *msk = vnode_to_vcpumask(virtio_node);
+	 return cpumask_first(msk);
+}
+
+static int vcpu_to_virtio_node(void)
+{
+	int vnode = __get_cpu_var(vcpu_to_vnode_map);
+	return vnode;
+}
+/* end of todo */
+
+static int virtqueue_pickup(struct virtnet_info *vi, struct virtqueue **vq, int rx)
+{
+	int node;
+	int i;
+	struct vnet_virtio_node *vnnode;
+	node = vcpu_to_virtio_node();
+	for (i = 0; i < vi->vdev->node_cnt; i++) {
+		vnnode = vi->vnet_nodes[i];
+		if (vnnode->vnode.node_id == node) {
+			if (rx == 0)
+				*vq = vnnode->vnode.svq;
+			else
+				*vq = vnnode->vnode.rvq;
+			return 0;
+		}
+	}
+	*vq = NULL;
+	return -1;
+}
+
+static int add_recvbuf_small(struct virtnet_info *vi, struct virtqueue *vq, gfp_t gfp)
 {
 	struct sk_buff *skb;
 	struct skb_vnet_hdr *hdr;
@@ -369,15 +443,14 @@ static int add_recvbuf_small(struct virtnet_info *vi, gfp_t gfp)
 	sg_set_buf(vi->rx_sg, &hdr->hdr, sizeof hdr->hdr);
 
 	skb_to_sgvec(skb, vi->rx_sg + 1, 0, skb->len);
-
-	err = virtqueue_add_buf(vi->rvq, vi->rx_sg, 0, 2, skb, gfp);
+	err = virtqueue_add_buf(vq, vi->rx_sg, 0, 2, skb, gfp);
 	if (err < 0)
 		dev_kfree_skb(skb);
 
 	return err;
 }
 
-static int add_recvbuf_big(struct virtnet_info *vi, gfp_t gfp)
+static int add_recvbuf_big(struct virtnet_info *vi, struct virtqueue *vq, gfp_t gfp)
 {
 	struct page *first, *list = NULL;
 	char *p;
@@ -415,7 +488,8 @@ static int add_recvbuf_big(struct virtnet_info *vi, gfp_t gfp)
 
 	/* chain first in list head */
 	first->private = (unsigned long)list;
-	err = virtqueue_add_buf(vi->rvq, vi->rx_sg, 0, MAX_SKB_FRAGS + 2,
+
+	err = virtqueue_add_buf(vq, vi->rx_sg, 0, MAX_SKB_FRAGS + 2,
 				first, gfp);
 	if (err < 0)
 		give_pages(vi, first);
@@ -423,7 +497,7 @@ static int add_recvbuf_big(struct virtnet_info *vi, gfp_t gfp)
 	return err;
 }
 
-static int add_recvbuf_mergeable(struct virtnet_info *vi, gfp_t gfp)
+static int add_recvbuf_mergeable(struct virtnet_info *vi, struct virtqueue *vq, gfp_t gfp)
 {
 	struct page *page;
 	int err;
@@ -433,8 +507,7 @@ static int add_recvbuf_mergeable(struct virtnet_info *vi, gfp_t gfp)
 		return -ENOMEM;
 
 	sg_init_one(vi->rx_sg, page_address(page), PAGE_SIZE);
-
-	err = virtqueue_add_buf(vi->rvq, vi->rx_sg, 0, 1, page, gfp);
+	err = virtqueue_add_buf(vq, vi->rx_sg, 0, 1, page, gfp);
 	if (err < 0)
 		give_pages(vi, page);
 
@@ -448,18 +521,17 @@ static int add_recvbuf_mergeable(struct virtnet_info *vi, gfp_t gfp)
  * before we're receiving packets, or from refill_work which is
  * careful to disable receiving (using napi_disable).
  */
-static bool try_fill_recv(struct virtnet_info *vi, gfp_t gfp)
+static bool try_fill_recv(struct virtnet_info *vi, struct virtqueue *rvq, gfp_t gfp)
 {
 	int err;
 	bool oom;
-
 	do {
 		if (vi->mergeable_rx_bufs)
-			err = add_recvbuf_mergeable(vi, gfp);
+			err = add_recvbuf_mergeable(vi, rvq, gfp);
 		else if (vi->big_packets)
-			err = add_recvbuf_big(vi, gfp);
+			err = add_recvbuf_big(vi, rvq, gfp);
 		else
-			err = add_recvbuf_small(vi, gfp);
+			err = add_recvbuf_small(vi, rvq, gfp);
 
 		oom = err == -ENOMEM;
 		if (err < 0)
@@ -468,31 +540,79 @@ static bool try_fill_recv(struct virtnet_info *vi, gfp_t gfp)
 	} while (err > 0);
 	if (unlikely(vi->num > vi->max))
 		vi->max = vi->num;
-	virtqueue_kick(vi->rvq);
+
+	virtqueue_kick(rvq);
 	return !oom;
 }
 
+static void try_fill_all_recv(struct virtnet_info *vi, gfp_t gfp)
+{
+	int i, cpu, err;
+	struct vnet_virtio_node *vnnode;
+	for (i = 0; i < vi->vdev->node_cnt; i++) {
+		vnnode = vi->vnet_nodes[i];
+		err = try_fill_recv(vi, vnnode->vnode.rvq, gfp);
+		if (err) {
+			cpu = first_vcpu_on_virtio_node(vnnode->vnode.node_id);
+			queue_delayed_work_on(cpu, system_nrt_wq, &vnnode->refill, 0);
+		}
+	}
+	return;
+}
+
 static void skb_recv_done(struct virtqueue *rvq)
 {
-	struct virtnet_info *vi = rvq->vdev->priv;
+	struct vnet_virtio_node *vnet_node = container_of(rvq->node, struct vnet_virtio_node, vnode);
+	struct napi_struct *napi = &vnet_node->info.napi;
+
 	/* Schedule NAPI, Suppress further interrupts if successful. */
-	if (napi_schedule_prep(&vi->napi)) {
+	if (napi_schedule_prep(napi)) {
 		virtqueue_disable_cb(rvq);
-		__napi_schedule(&vi->napi);
+		__napi_schedule(napi);
 	}
 }
 
-static void virtnet_napi_enable(struct virtnet_info *vi)
+static void virtnet_napi_enable(struct napi_struct *napi, struct virtqueue *rvq)
 {
-	napi_enable(&vi->napi);
+	napi_enable(napi);
 
 	/* If all buffers were filled by other side before we napi_enabled, we
 	 * won't get another interrupt, so process any outstanding packets
 	 * now.  virtnet_poll wants re-enable the queue, so we disable here.
 	 * We synchronize against interrupts via NAPI_STATE_SCHED */
-	if (napi_schedule_prep(&vi->napi)) {
-		virtqueue_disable_cb(vi->rvq);
-		__napi_schedule(&vi->napi);
+	if (napi_schedule_prep(napi)) {
+		virtqueue_disable_cb(rvq);
+		__napi_schedule(napi);
+	}
+}
+
+static void virtnet_napis_disable(struct virtnet_info *vi)
+{
+	int i;
+	struct vnet_virtio_node *vnnode;
+	for (i = 0; i < vi->vdev->node_cnt; i++) {
+		vnnode = vi->vnet_nodes[i];
+		napi_disable(&vnnode->info.napi);
+	}
+}
+
+static void napi_enable_worker(struct work_struct *work)
+{
+	struct vnet_virtio_node *vnnode = container_of(work,
+		struct vnet_virtio_node, refill.work);
+	struct virtqueue *rvq = vnnode->vnode.rvq;
+	virtnet_napi_enable(&vnnode->info.napi, rvq);
+}
+
+static void virtnet_napis_enable(struct virtnet_info *vi)
+{
+	int i;
+	struct work_struct *work;
+	struct vnet_virtio_node *vnnode;
+	for (i = 0; i < vi->vdev->node_cnt; i++) {
+		vnnode = vi->vnet_nodes[i];
+		work = &vnnode->info.enable_napi;
+		queue_work_on(vnnode->demo_cpu, system_wq, work);
 	}
 }
 
@@ -500,43 +620,52 @@ static void refill_work(struct work_struct *work)
 {
 	struct virtnet_info *vi;
 	bool still_empty;
+	struct napi_struct *napi;
+	struct virtqueue *rvq;
+	struct vnet_virtio_node *vnnode = container_of(work,
+		struct vnet_virtio_node, refill.work);
 
-	vi = container_of(work, struct virtnet_info, refill.work);
-	napi_disable(&vi->napi);
-	still_empty = !try_fill_recv(vi, GFP_KERNEL);
-	virtnet_napi_enable(vi);
+	vi = vnnode->owner;
+	napi = &vnnode->info.napi;
+	rvq = vnnode->vnode.rvq;
+	napi_disable(napi);
+
+	still_empty = !try_fill_recv(vi, rvq, GFP_KERNEL);
+	virtnet_napi_enable(napi, rvq);
 
 	/* In theory, this can happen: if we don't get any buffers in
 	 * we will *never* try to fill again. */
 	if (still_empty)
-		queue_delayed_work(system_nrt_wq, &vi->refill, HZ/2);
+		queue_delayed_work_on(vnnode->demo_cpu, system_nrt_wq, &vnnode->refill, HZ/2);
 }
 
 static int virtnet_poll(struct napi_struct *napi, int budget)
 {
-	struct virtnet_info *vi = container_of(napi, struct virtnet_info, napi);
+	struct virtnet_info *vi;
 	void *buf;
 	unsigned int len, received = 0;
-
+	struct vnet_virtio_node *vnnode = container_of(napi, struct vnet_virtio_node, info.napi);
+	struct virtqueue *rvq = vnnode->vnode.rvq;
+	vi = vnnode->owner;
 again:
 	while (received < budget &&
-	       (buf = virtqueue_get_buf(vi->rvq, &len)) != NULL) {
-		receive_buf(vi->dev, buf, len);
+	       (buf = virtqueue_get_buf(rvq, &len)) != NULL) {
+		receive_buf(vi->dev, buf, len, rvq);
 		--vi->num;
 		received++;
 	}
 
 	if (vi->num < vi->max / 2) {
-		if (!try_fill_recv(vi, GFP_ATOMIC))
-			queue_delayed_work(system_nrt_wq, &vi->refill, 0);
+		if (!try_fill_recv(vi, rvq, GFP_ATOMIC))
+			queue_delayed_work(system_nrt_wq, &vnnode->refill, 0);
 	}
 
 	/* Out of packets? */
 	if (received < budget) {
 		napi_complete(napi);
-		if (unlikely(!virtqueue_enable_cb(vi->rvq)) &&
+		if (unlikely(!virtqueue_enable_cb(rvq)) &&
 		    napi_schedule_prep(napi)) {
-			virtqueue_disable_cb(vi->rvq);
+			virtqueue_disable_cb(rvq);
 			__napi_schedule(napi);
 			goto again;
 		}
@@ -545,13 +674,13 @@ again:
 	return received;
 }
 
-static unsigned int free_old_xmit_skbs(struct virtnet_info *vi)
+static unsigned int free_old_xmit_skbs(struct virtnet_info *vi, struct virtqueue *svq)
 {
 	struct sk_buff *skb;
 	unsigned int len, tot_sgs = 0;
 	struct virtnet_stats *stats = this_cpu_ptr(vi->stats);
 
-	while ((skb = virtqueue_get_buf(vi->svq, &len)) != NULL) {
+	while ((skb = virtqueue_get_buf(svq, &len)) != NULL) {
 		pr_debug("Sent skb %p\n", skb);
 
 		u64_stats_update_begin(&stats->syncp);
@@ -565,7 +694,7 @@ static unsigned int free_old_xmit_skbs(struct virtnet_info *vi)
 	return tot_sgs;
 }
 
-static int xmit_skb(struct virtnet_info *vi, struct sk_buff *skb)
+static int xmit_skb(struct virtnet_info *vi, struct virtqueue *svq, struct sk_buff *skb)
 {
 	struct skb_vnet_hdr *hdr = skb_vnet_hdr(skb);
 	const unsigned char *dest = ((struct ethhdr *)skb->data)->h_dest;
@@ -608,7 +737,8 @@ static int xmit_skb(struct virtnet_info *vi, struct sk_buff *skb)
 		sg_set_buf(vi->tx_sg, &hdr->hdr, sizeof hdr->hdr);
 
 	hdr->num_sg = skb_to_sgvec(skb, vi->tx_sg + 1, 0, skb->len) + 1;
-	return virtqueue_add_buf(vi->svq, vi->tx_sg, hdr->num_sg,
+
+	return virtqueue_add_buf(svq, vi->tx_sg, hdr->num_sg,
 				 0, skb, GFP_ATOMIC);
 }
 
@@ -616,12 +746,14 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
 {
 	struct virtnet_info *vi = netdev_priv(dev);
 	int capacity;
+	struct virtqueue *svq;
+	virtqueue_pickup(vi, &svq, 0);
 
 	/* Free up any pending old buffers before queueing new ones. */
-	free_old_xmit_skbs(vi);
+	free_old_xmit_skbs(vi, svq);
 
 	/* Try to transmit */
-	capacity = xmit_skb(vi, skb);
+	capacity = xmit_skb(vi, svq, skb);
 
 	/* This can happen with OOM and indirect buffers. */
 	if (unlikely(capacity < 0)) {
@@ -640,7 +772,7 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
 		kfree_skb(skb);
 		return NETDEV_TX_OK;
 	}
-	virtqueue_kick(vi->svq);
+	virtqueue_kick(svq);
 
 	/* Don't wait up for transmitted skbs to be freed. */
 	skb_orphan(skb);
@@ -650,12 +782,12 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
 	 * before it gets out of hand.  Naturally, this wastes entries. */
 	if (capacity < 2+MAX_SKB_FRAGS) {
 		netif_stop_queue(dev);
-		if (unlikely(!virtqueue_enable_cb_delayed(vi->svq))) {
+		if (unlikely(!virtqueue_enable_cb_delayed(svq))) {
 			/* More just got used, free them then recheck. */
-			capacity += free_old_xmit_skbs(vi);
+			capacity += free_old_xmit_skbs(vi, svq);
 			if (capacity >= 2+MAX_SKB_FRAGS) {
 				netif_start_queue(dev);
-				virtqueue_disable_cb(vi->svq);
+				virtqueue_disable_cb(svq);
 			}
 		}
 	}
@@ -718,20 +850,15 @@ static struct rtnl_link_stats64 *virtnet_stats(struct net_device *dev,
 static void virtnet_netpoll(struct net_device *dev)
 {
 	struct virtnet_info *vi = netdev_priv(dev);
-
-	napi_schedule(&vi->napi);
+	virtnet_napis_enable(vi);
 }
 #endif
 
 static int virtnet_open(struct net_device *dev)
 {
 	struct virtnet_info *vi = netdev_priv(dev);
-
-	/* Make sure we have some buffers: if oom use wq. */
-	if (!try_fill_recv(vi, GFP_KERNEL))
-		queue_delayed_work(system_nrt_wq, &vi->refill, 0);
-
-	virtnet_napi_enable(vi);
+	try_fill_all_recv(vi, GFP_KERNEL);
+	virtnet_napis_enable(vi);
 	return 0;
 }
 
@@ -783,11 +910,10 @@ static bool virtnet_send_command(struct virtnet_info *vi, u8 class, u8 cmd,
 static int virtnet_close(struct net_device *dev)
 {
 	struct virtnet_info *vi = netdev_priv(dev);
-
-	/* Make sure refill_work doesn't re-enable napi! */
-	cancel_delayed_work_sync(&vi->refill);
-	napi_disable(&vi->napi);
-
+	int i;
+	for (i = 0; i < vi->vdev->node_cnt; i++)
+		cancel_delayed_work_sync(&vi->vnet_nodes[i]->refill);
+	virtnet_napis_disable(vi);
 	return 0;
 }
 
@@ -897,9 +1023,10 @@ static void virtnet_get_ringparam(struct net_device *dev,
 				struct ethtool_ringparam *ring)
 {
 	struct virtnet_info *vi = netdev_priv(dev);
+	struct vnet_virtio_node *vnnode =  vi->vnet_nodes[0];
 
-	ring->rx_max_pending = virtqueue_get_vring_size(vi->rvq);
-	ring->tx_max_pending = virtqueue_get_vring_size(vi->svq);
+	ring->rx_max_pending = virtqueue_get_vring_size(vnnode->vnode.rvq);
+	ring->tx_max_pending = virtqueue_get_vring_size(vnnode->vnode.svq);
 	ring->rx_pending = ring->rx_max_pending;
 	ring->tx_pending = ring->tx_max_pending;
 
@@ -986,29 +1113,61 @@ static void virtnet_config_changed(struct virtio_device *vdev)
 
 static int init_vqs(struct virtnet_info *vi)
 {
-	struct virtqueue *vqs[3];
-	vq_callback_t *callbacks[] = { skb_recv_done, skb_xmit_done, NULL};
+	struct virtqueue **vqs;
 	const char *names[] = { "input", "output", "control" };
-	int nvqs, err;
-
+	const char **name_array;
+	vq_callback_t **callbacks;
+	int node_cnt, nvqs, err =  -ENOMEM;
+	int i;
 	/* We expect two virtqueues, receive then send,
 	 * and optionally control. */
-	nvqs = virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ) ? 3 : 2;
+	node_cnt = vi->vdev->node_cnt;
+	nvqs = virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ)? node_cnt*2+1 :
+		node_cnt*2;
+	callbacks = kzalloc(sizeof(void *)*nvqs, GFP_KERNEL);
+	for (i = 0; i < node_cnt; i++)
+		callbacks[i] = skb_recv_done;
+	for (; i < node_cnt*2; i++)
+		callbacks[i] = skb_xmit_done;
+
+	name_array = kmalloc(sizeof(void *)*nvqs, GFP_KERNEL);
+	if ( name_array == NULL)
+		goto free_callbacks;
+
+	for (i = 0; i < node_cnt; i++)
+		name_array[i] = names[0];
+	for (; i <  node_cnt*2; i++)
+		name_array[i] = names[1];
+	if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ))
+		name_array[i] = names[2];
+
+	vqs = kmalloc(sizeof(void *)*nvqs, GFP_KERNEL);
+	if (vqs == NULL)
+		goto free_name;
 
 	err = vi->vdev->config->find_vqs(vi->vdev, nvqs, vqs, callbacks, names);
 	if (err)
-		return err;
+		goto free_vqs;
 
-	vi->rvq = vqs[0];
-	vi->svq = vqs[1];
+	vi->vqs = vqs;
+	vi->rvqs = vi->vqs;
+	vi->svqs = vi->vqs + vi->vdev->node_cnt;
 
 	if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ)) {
-		vi->cvq = vqs[2];
+		vi->cvq = vi->vqs[vi->vdev->node_cnt*2];
 
 		if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VLAN))
 			vi->dev->features |= NETIF_F_HW_VLAN_FILTER;
 	}
-	return 0;
+	err = 0;
+free_vqs:
+	if (err)
+		kfree(vqs);
+free_name:
+	kfree(name_array);
+free_callbacks:
+	kfree(callbacks);
+	return err;
 }
 
 static int virtnet_probe(struct virtio_device *vdev)
@@ -1016,6 +1175,8 @@ static int virtnet_probe(struct virtio_device *vdev)
 	int err;
 	struct net_device *dev;
 	struct virtnet_info *vi;
+	int i, size, cur, prev = 0;
+	struct vnet_virtio_node *vnnode;
 
 	/* Allocate ourselves a network device with room for our info */
 	dev = alloc_etherdev(sizeof(struct virtnet_info));
@@ -1064,7 +1225,7 @@ static int virtnet_probe(struct virtio_device *vdev)
 
 	/* Set up our device-specific information */
 	vi = netdev_priv(dev);
-	netif_napi_add(dev, &vi->napi, virtnet_poll, napi_weight);
+
 	vi->dev = dev;
 	vi->vdev = vdev;
 	vdev->priv = vi;
@@ -1074,7 +1235,6 @@ static int virtnet_probe(struct virtio_device *vdev)
 	if (vi->stats == NULL)
 		goto free;
 
-	INIT_DELAYED_WORK(&vi->refill, refill_work);
 	sg_init_table(vi->rx_sg, ARRAY_SIZE(vi->rx_sg));
 	sg_init_table(vi->tx_sg, ARRAY_SIZE(vi->tx_sg));
 
@@ -1086,19 +1246,46 @@ static int virtnet_probe(struct virtio_device *vdev)
 
 	if (virtio_has_feature(vdev, VIRTIO_NET_F_MRG_RXBUF))
 		vi->mergeable_rx_bufs = true;
-
 	err = init_vqs(vi);
 	if (err)
 		goto free_stats;
 
+	/* Which host node napi_struct will be on, determined by page fault handled by KVM.
+	  * So allocate them seperately!
+	 */
+	vi->vnet_nodes = kmalloc(sizeof(void *) * vi->vdev->node_cnt, GFP_KERNEL);
+	size = PAGE_ALIGN(sizeof(struct vnet_virtio_node));
+	for (i = 0; i < vi->vdev->node_cnt; i++) {
+		vnnode = kmalloc(size, GFP_KERNEL);
+		if (vnnode == NULL) {
+			err = -ENOMEM;
+			goto free_napi;
+		}
+		cur = find_next_bit(&vi->vdev->allow_map, 64, prev);
+		prev = cur;
+		vnnode->vnode.node_id = cur;
+		vnnode->owner = vi;
+		vnnode->vnode.rvq = vi->rvqs[i];
+		vnnode->vnode.svq = vi->svqs[i];
+		vnnode->demo_cpu = first_vcpu_on_virtio_node(cur);
+
+		vi->rvqs[i]->node = &vnnode->vnode;
+		vi->svqs[i]->node = &vnnode->vnode;
+
+		INIT_WORK(&vnnode->info.enable_napi, napi_enable_worker);
+		netif_napi_add(dev, &vnnode->info.napi, virtnet_poll, napi_weight);
+		INIT_DELAYED_WORK(&vnnode->refill, refill_work);
+		vi->vnet_nodes[i] = vnnode;
+	}
+
 	err = register_netdev(dev);
 	if (err) {
 		pr_debug("virtio_net: registering device failed\n");
 		goto free_vqs;
 	}
 
-	/* Last of all, set up some receive buffers. */
-	try_fill_recv(vi, GFP_KERNEL);
+	try_fill_all_recv(vi, GFP_KERNEL);
+
 
 	/* If we didn't even get one input buffer, we're useless. */
 	if (vi->num == 0) {
@@ -1121,6 +1308,12 @@ static int virtnet_probe(struct virtio_device *vdev)
 
 unregister:
 	unregister_netdev(dev);
+free_napi:
+	for (; i  >  0; --i) {
+		vnnode = vi->vnet_nodes[i];
+		netif_napi_del(&vnnode->info.napi);
+		kfree(vnnode);
+	}
 free_vqs:
 	vdev->config->del_vqs(vdev);
 free_stats:
@@ -1133,32 +1326,39 @@ free:
 static void free_unused_bufs(struct virtnet_info *vi)
 {
 	void *buf;
-	while (1) {
-		buf = virtqueue_detach_unused_buf(vi->svq);
-		if (!buf)
-			break;
-		dev_kfree_skb(buf);
-	}
-	while (1) {
-		buf = virtqueue_detach_unused_buf(vi->rvq);
-		if (!buf)
-			break;
-		if (vi->mergeable_rx_bufs || vi->big_packets)
-			give_pages(vi, buf);
-		else
+	int i;
+	struct virtqueue *svq, *rvq;
+	for (i = 0; i < vi->vdev->node_cnt; i++) {
+		svq = vi->svqs[i];
+		rvq = vi->rvqs[i];
+
+		while (1) {
+			buf = virtqueue_detach_unused_buf(svq);
+			if (!buf)
+				break;
 			dev_kfree_skb(buf);
-		--vi->num;
+		}
+		while (1) {
+			buf = virtqueue_detach_unused_buf(rvq);
+			if (!buf)
+				break;
+			if (vi->mergeable_rx_bufs || vi->big_packets)
+				give_pages(vi, buf);
+			else
+				dev_kfree_skb(buf);
+			--vi->num;
+		}
 	}
 	BUG_ON(vi->num != 0);
 }
 
+
 static void remove_vq_common(struct virtnet_info *vi)
 {
 	vi->vdev->config->reset(vi->vdev);
 
 	/* Free unused buffers in both send and recv, if any. */
 	free_unused_bufs(vi);
-
 	vi->vdev->config->del_vqs(vi->vdev);
 
 	while (vi->pages)
@@ -1172,7 +1372,8 @@ static void __devexit virtnet_remove(struct virtio_device *vdev)
 	unregister_netdev(vi->dev);
 
 	remove_vq_common(vi);
-
+	kfree(vi->vqs);
+	kfree(vi->vnet_nodes);
 	free_percpu(vi->stats);
 	free_netdev(vi->dev);
 }
@@ -1181,17 +1382,22 @@ static void __devexit virtnet_remove(struct virtio_device *vdev)
 static int virtnet_freeze(struct virtio_device *vdev)
 {
 	struct virtnet_info *vi = vdev->priv;
+	int i;
 
-	virtqueue_disable_cb(vi->rvq);
-	virtqueue_disable_cb(vi->svq);
+	for (i = 0; i < vdev->node_cnt; i++) {
+		virtqueue_disable_cb(vi->rvqs[i]);
+		virtqueue_disable_cb(vi->svqs[i]);
+	}
 	if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ))
 		virtqueue_disable_cb(vi->cvq);
 
 	netif_device_detach(vi->dev);
-	cancel_delayed_work_sync(&vi->refill);
+
+	for (i = 0; i < vdev->node_cnt; i++)
+		cancel_delayed_work_sync(&vi->vnet_nodes[i]->refill);
 
 	if (netif_running(vi->dev))
-		napi_disable(&vi->napi);
+		virtnet_napis_disable(vi);
 
 	remove_vq_common(vi);
 
@@ -1208,13 +1414,10 @@ static int virtnet_restore(struct virtio_device *vdev)
 		return err;
 
 	if (netif_running(vi->dev))
-		virtnet_napi_enable(vi);
+		virtnet_napis_enable(vi);
 
 	netif_device_attach(vi->dev);
-
-	if (!try_fill_recv(vi, GFP_KERNEL))
-		queue_delayed_work(system_nrt_wq, &vi->refill, 0);
-
+	try_fill_all_recv(vi, GFP_KERNEL);
 	return 0;
 }
 #endif
-- 
1.7.4.4


^ permalink raw reply related

* [PATCH 1/2] [kvm/virtio]: make virtio support NUMA attr
From: Liu Ping Fan @ 2012-05-17  9:20 UTC (permalink / raw)
  To: kvm, netdev
  Cc: linux-kernel, qemu-devel, Avi Kivity, Michael S. Tsirkin,
	Srivatsa Vaddagiri, Rusty Russell, Anthony Liguori, Ryan Harper,
	Shirley Ma, Krishna Kumar, Tom Lendacky
In-Reply-To: <1337246456-30909-1-git-send-email-kernelfans@gmail.com>

From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>

For each numa node reported by vhost, we alloc a pair of i/o vq,
and assign them msix IRQ, and set irq affinity to a set of vcpu
in the same node.
Also we alloc vqs on PAGE_SIZE align, so they will be allocated by
host when pg fault happen on different node.

Signed-off-by: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
---
 drivers/virtio/virtio.c       |    2 +-
 drivers/virtio/virtio_pci.c   |   35 +++++++++++++++++++++++++++++++++--
 drivers/virtio/virtio_ring.c  |    9 ++++++---
 include/linux/virtio.h        |    9 +++++++++
 include/linux/virtio_config.h |    1 +
 include/linux/virtio_pci.h    |    9 +++++++++
 6 files changed, 59 insertions(+), 6 deletions(-)

diff --git a/drivers/virtio/virtio.c b/drivers/virtio/virtio.c
index 984c501..79e873f 100644
--- a/drivers/virtio/virtio.c
+++ b/drivers/virtio/virtio.c
@@ -136,7 +136,7 @@ static int virtio_dev_probe(struct device *_d)
 			set_bit(i, dev->features);
 
 	dev->config->finalize_features(dev);
-
+	dev->config->get_numa_map(dev);
 	err = drv->probe(dev);
 	if (err)
 		add_status(dev, VIRTIO_CONFIG_S_FAILED);
diff --git a/drivers/virtio/virtio_pci.c b/drivers/virtio/virtio_pci.c
index 2e03d41..5bb8a97 100644
--- a/drivers/virtio/virtio_pci.c
+++ b/drivers/virtio/virtio_pci.c
@@ -129,6 +129,24 @@ static void vp_finalize_features(struct virtio_device *vdev)
 	iowrite32(vdev->features[0], vp_dev->ioaddr+VIRTIO_PCI_GUEST_FEATURES);
 }
 
+static void vp_get_numa_map(struct virtio_device *vdev)
+{
+	int i, cnt,  sz = 32;
+	int cur, prev = 0;
+	struct virtio_pci_device *vp_dev = to_vp_device(vdev);
+
+	/* We only support 32 numa bits. */
+	vdev->allow_map = ioread32(vp_dev->ioaddr+VIRTIO_PCI_NUMA_MAP);
+	for (i = 0; i < sz; i++) {
+		cur = find_next_bit(&vdev->allow_map, sz, prev);
+		prev = cur;
+		if (cur >= sz)
+			break;
+		cnt++;
+	}
+	vdev->node_cnt = cnt;
+}
+
 /* virtio config->get() implementation */
 static void vp_get(struct virtio_device *vdev, unsigned offset,
 		   void *buf, unsigned len)
@@ -516,6 +534,8 @@ static int vp_try_to_find_vqs(struct virtio_device *vdev, unsigned nvqs,
 	struct virtio_pci_device *vp_dev = to_vp_device(vdev);
 	u16 msix_vec;
 	int i, err, nvectors, allocated_vectors;
+	int irq, next, prev = 0;
+	struct cpumask *mask;
 
 	if (!use_msix) {
 		/* Old style: one normal interrupt for change and all vqs. */
@@ -562,14 +582,24 @@ static int vp_try_to_find_vqs(struct virtio_device *vdev, unsigned nvqs,
 			 sizeof *vp_dev->msix_names,
 			 "%s-%s",
 			 dev_name(&vp_dev->vdev.dev), names[i]);
-		err = request_irq(vp_dev->msix_entries[msix_vec].vector,
-				  vring_interrupt, 0,
+		irq = vp_dev->msix_entries[msix_vec].vector;
+		err = request_irq(irq, vring_interrupt, 0,
 				  vp_dev->msix_names[msix_vec],
 				  vqs[i]);
 		if (err) {
 			vp_del_vq(vqs[i]);
 			goto error_find;
 		}
+		if (i == vdev->node_cnt)
+			prev = 0;
+		/* fix me the @size */
+		next = find_next_bit(vdev->allow_map, 64, prev);
+		prev = next;
+		if (next < 64) {
+			mask = vnode_to_vcpumask(next);
+			mask = cpumask_and(mask, cpu_online_mask, mask);
+			irq_set_affinity(irq, mask);
+		}
 	}
 	return 0;
 
@@ -619,6 +649,7 @@ static struct virtio_config_ops virtio_pci_config_ops = {
 	.del_vqs	= vp_del_vqs,
 	.get_features	= vp_get_features,
 	.finalize_features = vp_finalize_features,
+	.get_numa_map = vp_get_numa_map,
 	.bus_name	= vp_bus_name,
 };
 
diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 5aa43c3..5baa949 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -626,15 +626,18 @@ struct virtqueue *vring_new_virtqueue(unsigned int num,
 				      const char *name)
 {
 	struct vring_virtqueue *vq;
-	unsigned int i;
+	unsigned int i, size, max;
 
 	/* We assume num is a power of 2. */
 	if (num & (num - 1)) {
 		dev_warn(&vdev->dev, "Bad virtqueue length %u\n", num);
 		return NULL;
 	}
-
-	vq = kmalloc(sizeof(*vq) + sizeof(void *)*num, GFP_KERNEL);
+	size = PAGE_ALIGN (sizeof(*vq) + sizeof(void *)*num);
+	/* Allocate on PAGE boundary, so host can locate them at proper
+	 * node
+	 */
+	vq = kmalloc(size, GFP_KERNEL);
 	if (!vq)
 		return NULL;
 
diff --git a/include/linux/virtio.h b/include/linux/virtio.h
index 8efd28a..ec992c9 100644
--- a/include/linux/virtio.h
+++ b/include/linux/virtio.h
@@ -9,6 +9,12 @@
 #include <linux/mod_devicetable.h>
 #include <linux/gfp.h>
 
+struct virtio_node {
+	int node_id;
+	struct virtqueue *rvq;
+	struct virtqueue *svq;
+};
+
 /**
  * virtqueue - a queue to register buffers for sending or receiving.
  * @list: the chain of virtqueues for this device
@@ -22,6 +28,7 @@ struct virtqueue {
 	void (*callback)(struct virtqueue *vq);
 	const char *name;
 	struct virtio_device *vdev;
+	struct virtio_node *node;
 	void *priv;
 };
 
@@ -66,6 +73,8 @@ struct virtio_device {
 	struct virtio_device_id id;
 	struct virtio_config_ops *config;
 	struct list_head vqs;
+	int node_cnt;
+	unsigned long allow_map;
 	/* Note that this is a Linux set_bit-style bitmap. */
 	unsigned long features[1];
 	void *priv;
diff --git a/include/linux/virtio_config.h b/include/linux/virtio_config.h
index 7323a33..5e2fd77 100644
--- a/include/linux/virtio_config.h
+++ b/include/linux/virtio_config.h
@@ -124,6 +124,7 @@ struct virtio_config_ops {
 	void (*del_vqs)(struct virtio_device *);
 	u32 (*get_features)(struct virtio_device *vdev);
 	void (*finalize_features)(struct virtio_device *vdev);
+	void (*get_numa_map)(struct virtio_device *vdev);
 	const char *(*bus_name)(struct virtio_device *vdev);
 };
 
diff --git a/include/linux/virtio_pci.h b/include/linux/virtio_pci.h
index ea66f3f..1426717 100644
--- a/include/linux/virtio_pci.h
+++ b/include/linux/virtio_pci.h
@@ -78,9 +78,18 @@
 /* Vector value used to disable MSI for queue */
 #define VIRTIO_MSI_NO_VECTOR            0xffff
 
+#ifdef VIRTIO_NUMA
+/* 32bits to show allowed numa */
+#define VIRTIO_PCI_NUMA_MAP         24
+
+/* The remaining space is defined by each driver as the per-driver
+ * configuration space */
+#define VIRTIO_PCI_CONFIG(dev)		28
+#else
 /* The remaining space is defined by each driver as the per-driver
  * configuration space */
 #define VIRTIO_PCI_CONFIG(dev)		((dev)->msix_enabled ? 24 : 20)
+#endif
 
 /* Virtio ABI version, this must match exactly */
 #define VIRTIO_PCI_ABI_VERSION		0
-- 
1.7.4.4

^ permalink raw reply related

* [PATCH 2/2] [kvm/vhost-net]: make vhost net own NUMA attribute
From: Liu Ping Fan @ 2012-05-17  9:20 UTC (permalink / raw)
  To: kvm, netdev
  Cc: Krishna Kumar, Shirley Ma, Tom Lendacky, Michael S. Tsirkin,
	qemu-devel, Rusty Russell, Srivatsa Vaddagiri, linux-kernel,
	Ryan Harper, Avi Kivity, Anthony Liguori
In-Reply-To: <1337246456-30909-1-git-send-email-kernelfans@gmail.com>

From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>

Make vhost net support to spread on host node according the command.
And consider the whole vhost_net componsed of lots of logic net units.
for each node, there is a unit, which includes a vhost_worker thread,
rx/tx vhost_virtqueue.

Signed-off-by: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
---
 drivers/vhost/net.c |  388 ++++++++++++++++++++++++++++++++++-----------------
 1 files changed, 258 insertions(+), 130 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 1f21d2a..770933e 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -55,8 +55,19 @@ enum vhost_net_poll_state {
 
 struct vhost_net {
 	struct vhost_dev dev;
-	struct vhost_virtqueue vqs[VHOST_NET_VQ_MAX];
+	int numa_init;
+	int vqcnt;
+	struct vhost_virtqueue **vqs;
+	/* one for tx, one for rx */
 	struct vhost_poll poll[VHOST_NET_VQ_MAX];
+	int token[VHOST_NET_VQ_MAX];
+	/* fix me, Although tun.socket.sock can be parrell, but _maybe_, we need to record
+	 * wmem_alloc independly for each subdev.
+	 */
+	struct mutex mutex;
+	struct socket __rcu *tx_sock;
+	struct socket __rcu *rx_sock;
+
 	/* Tells us whether we are polling a socket for TX.
 	 * We only do this when socket buffer fills up.
 	 * Protected by tx vq lock. */
@@ -112,7 +123,9 @@ static void tx_poll_stop(struct vhost_net *net)
 {
 	if (likely(net->tx_poll_state != VHOST_NET_POLL_STARTED))
 		return;
+
 	vhost_poll_stop(net->poll + VHOST_NET_VQ_TX);
+
 	net->tx_poll_state = VHOST_NET_POLL_STOPPED;
 }
 
@@ -121,15 +134,15 @@ static void tx_poll_start(struct vhost_net *net, struct socket *sock)
 {
 	if (unlikely(net->tx_poll_state != VHOST_NET_POLL_STOPPED))
 		return;
+
 	vhost_poll_start(net->poll + VHOST_NET_VQ_TX, sock->file);
 	net->tx_poll_state = VHOST_NET_POLL_STARTED;
 }
 
 /* Expects to be always run from workqueue - which acts as
  * read-size critical section for our kind of RCU. */
-static void handle_tx(struct vhost_net *net)
+static void handle_tx(struct vhost_net *net, struct vhost_virtqueue *vq)
 {
-	struct vhost_virtqueue *vq = &net->dev.vqs[VHOST_NET_VQ_TX];
 	unsigned out, in, s;
 	int head;
 	struct msghdr msg = {
@@ -148,15 +161,15 @@ static void handle_tx(struct vhost_net *net)
 	bool zcopy;
 
 	/* TODO: check that we are running from vhost_worker? */
-	sock = rcu_dereference_check(vq->private_data, 1);
+	sock = rcu_dereference_check(net->tx_sock, 1);
 	if (!sock)
 		return;
 
 	wmem = atomic_read(&sock->sk->sk_wmem_alloc);
 	if (wmem >= sock->sk->sk_sndbuf) {
-		mutex_lock(&vq->mutex);
+		mutex_lock(&net->mutex);
 		tx_poll_start(net, sock);
-		mutex_unlock(&vq->mutex);
+		mutex_unlock(&net->mutex);
 		return;
 	}
 
@@ -165,6 +178,7 @@ static void handle_tx(struct vhost_net *net)
 
 	if (wmem < sock->sk->sk_sndbuf / 2)
 		tx_poll_stop(net);
+
 	hdr_size = vq->vhost_hlen;
 	zcopy = vhost_sock_zcopy(sock);
 
@@ -186,8 +200,10 @@ static void handle_tx(struct vhost_net *net)
 
 			wmem = atomic_read(&sock->sk->sk_wmem_alloc);
 			if (wmem >= sock->sk->sk_sndbuf * 3 / 4) {
+				mutex_lock(&net->mutex);
 				tx_poll_start(net, sock);
 				set_bit(SOCK_ASYNC_NOSPACE, &sock->flags);
+				mutex_unlock(&net->mutex);
 				break;
 			}
 			/* If more outstanding DMAs, queue the work.
@@ -197,8 +213,10 @@ static void handle_tx(struct vhost_net *net)
 				    (vq->upend_idx - vq->done_idx) :
 				    (vq->upend_idx + UIO_MAXIOV - vq->done_idx);
 			if (unlikely(num_pends > VHOST_MAX_PEND)) {
+				mutex_lock(&net->mutex);
 				tx_poll_start(net, sock);
 				set_bit(SOCK_ASYNC_NOSPACE, &sock->flags);
+				mutex_unlock(&net->mutex);
 				break;
 			}
 			if (unlikely(vhost_enable_notify(&net->dev, vq))) {
@@ -353,9 +371,8 @@ err:
 
 /* Expects to be always run from workqueue - which acts as
  * read-size critical section for our kind of RCU. */
-static void handle_rx(struct vhost_net *net)
+static void handle_rx(struct vhost_net *net, struct vhost_virtqueue *vq)
 {
-	struct vhost_virtqueue *vq = &net->dev.vqs[VHOST_NET_VQ_RX];
 	unsigned uninitialized_var(in), log;
 	struct vhost_log *vq_log;
 	struct msghdr msg = {
@@ -375,11 +392,10 @@ static void handle_rx(struct vhost_net *net)
 	size_t vhost_hlen, sock_hlen;
 	size_t vhost_len, sock_len;
 	/* TODO: check that we are running from vhost_worker? */
-	struct socket *sock = rcu_dereference_check(vq->private_data, 1);
+	struct socket *sock = rcu_dereference_check(net->tx_sock, 1);
 
 	if (!sock)
 		return;
-
 	mutex_lock(&vq->mutex);
 	vhost_disable_notify(&net->dev, vq);
 	vhost_hlen = vq->vhost_hlen;
@@ -465,8 +481,7 @@ static void handle_tx_kick(struct vhost_work *work)
 	struct vhost_virtqueue *vq = container_of(work, struct vhost_virtqueue,
 						  poll.work);
 	struct vhost_net *net = container_of(vq->dev, struct vhost_net, dev);
-
-	handle_tx(net);
+	handle_tx(net, vq);
 }
 
 static void handle_rx_kick(struct vhost_work *work)
@@ -475,103 +490,115 @@ static void handle_rx_kick(struct vhost_work *work)
 						  poll.work);
 	struct vhost_net *net = container_of(vq->dev, struct vhost_net, dev);
 
-	handle_rx(net);
+	handle_rx(net, vq);
 }
 
-static void handle_tx_net(struct vhost_work *work)
+/* Get sock->file event, then pick up a vhost_worker to wake up.
+ * Currently ,we are round robin, maybe in future, we know which
+ * numa-node the skb from tap want to go.
+ */
+static int deliver_worker(struct vhost_net *net, int rx)
 {
-	struct vhost_net *net = container_of(work, struct vhost_net,
-					     poll[VHOST_NET_VQ_TX].work);
-	handle_tx(net);
+	int i = rx ? VHOST_NET_VQ_RX : VHOST_NET_VQ_TX;
+	int idx = ((net->token[i]++<<1)+i)%net->vqcnt;
+	vhost_poll_queue(&net->vqs[idx]->poll);
+	return 0;
 }
 
-static void handle_rx_net(struct vhost_work *work)
+static int net_poll_wakeup(wait_queue_t *wait, unsigned mode, int sync,
+			     void *key)
 {
-	struct vhost_net *net = container_of(work, struct vhost_net,
-					     poll[VHOST_NET_VQ_RX].work);
-	handle_rx(net);
+	struct vhost_poll *poll = container_of(wait, struct vhost_poll, wait);
+	struct vhost_poll *head = (poll->mask == POLLIN) ? poll : poll-1;
+	struct vhost_net *net = container_of(head, struct vhost_net, poll[0]);
+
+	if (!((unsigned long)key & poll->mask))
+		return 0;
+
+	if (poll->mask == POLLIN)
+		deliver_worker(net, 1);
+	else
+		deliver_worker(net, 0);
+	return 0;
+}
+
+static void net_poll_init(struct vhost_poll *poll, unsigned long mask)
+{
+	init_waitqueue_func_entry(&poll->wait, net_poll_wakeup);
+	init_poll_funcptr(&poll->table, vhost_poll_func);
+	poll->mask = mask;
+	poll->subdev = NULL;
 }
 
 static int vhost_net_open(struct inode *inode, struct file *f)
 {
 	struct vhost_net *n = kmalloc(sizeof *n, GFP_KERNEL);
-	struct vhost_dev *dev;
-	int r;
-
 	if (!n)
 		return -ENOMEM;
-
-	dev = &n->dev;
-	n->vqs[VHOST_NET_VQ_TX].handle_kick = handle_tx_kick;
-	n->vqs[VHOST_NET_VQ_RX].handle_kick = handle_rx_kick;
-	r = vhost_dev_init(dev, n->vqs, VHOST_NET_VQ_MAX);
-	if (r < 0) {
-		kfree(n);
-		return r;
-	}
-
-	vhost_poll_init(n->poll + VHOST_NET_VQ_TX, handle_tx_net, POLLOUT, dev);
-	vhost_poll_init(n->poll + VHOST_NET_VQ_RX, handle_rx_net, POLLIN, dev);
-	n->tx_poll_state = VHOST_NET_POLL_DISABLED;
-
 	f->private_data = n;
-
 	return 0;
 }
 
-static void vhost_net_disable_vq(struct vhost_net *n,
-				 struct vhost_virtqueue *vq)
+static void vhost_net_disable_xmit(struct vhost_net *n, int rx)
 {
-	if (!vq->private_data)
-		return;
-	if (vq == n->vqs + VHOST_NET_VQ_TX) {
+	if (rx  == 0) {
 		tx_poll_stop(n);
 		n->tx_poll_state = VHOST_NET_POLL_DISABLED;
 	} else
-		vhost_poll_stop(n->poll + VHOST_NET_VQ_RX);
+		vhost_poll_stop(n->poll+VHOST_NET_VQ_RX);
 }
 
-static void vhost_net_enable_vq(struct vhost_net *n,
-				struct vhost_virtqueue *vq)
+static void vhost_net_enable_xmit(struct vhost_net *n, int rx)
 {
 	struct socket *sock;
 
-	sock = rcu_dereference_protected(vq->private_data,
-					 lockdep_is_held(&vq->mutex));
-	if (!sock)
-		return;
-	if (vq == n->vqs + VHOST_NET_VQ_TX) {
+	if (rx == 0) {
+		sock = rcu_dereference_protected(n->tx_sock,
+					 lockdep_is_held(&n->mutex));
+		if (!sock)
+			return;
 		n->tx_poll_state = VHOST_NET_POLL_STOPPED;
 		tx_poll_start(n, sock);
-	} else
+	} else {
+		sock = rcu_dereference_protected(n->rx_sock,
+					 lockdep_is_held(&n->mutex));
+		if (!sock)
+			return;
 		vhost_poll_start(n->poll + VHOST_NET_VQ_RX, sock->file);
+	}
 }
 
-static struct socket *vhost_net_stop_vq(struct vhost_net *n,
-					struct vhost_virtqueue *vq)
+static int vhost_net_stop_xmit(struct vhost_net *n, int rx)
 {
-	struct socket *sock;
-
-	mutex_lock(&vq->mutex);
-	sock = rcu_dereference_protected(vq->private_data,
-					 lockdep_is_held(&vq->mutex));
-	vhost_net_disable_vq(n, vq);
-	rcu_assign_pointer(vq->private_data, NULL);
-	mutex_unlock(&vq->mutex);
-	return sock;
+	mutex_lock(&n->mutex);
+	vhost_net_disable_xmit(n, rx);
+	mutex_unlock(&n->mutex);
+	return 0;
 }
 
-static void vhost_net_stop(struct vhost_net *n, struct socket **tx_sock,
-			   struct socket **rx_sock)
+static void vhost_net_stop(struct vhost_net *n)
 {
-	*tx_sock = vhost_net_stop_vq(n, n->vqs + VHOST_NET_VQ_TX);
-	*rx_sock = vhost_net_stop_vq(n, n->vqs + VHOST_NET_VQ_RX);
+	vhost_net_stop_xmit(n, 0);
+	vhost_net_stop_xmit(n, 1);
 }
 
-static void vhost_net_flush_vq(struct vhost_net *n, int index)
+/* We wait for vhost_work on all vqs to finish gp. And n->poll[] 
+ * are not vhost_work any longer
+ */
+static void vhost_net_flush_vq(struct vhost_net *n, int rx)
 {
-	vhost_poll_flush(n->poll + index);
-	vhost_poll_flush(&n->dev.vqs[index].poll);
+	int i, idx;
+	if (rx == 0) {
+		for (i = 0; i < n->dev.node_cnt; i++) {
+			idx = (i<<1) + VHOST_NET_VQ_TX;
+			vhost_poll_flush(&n->dev.vqs[idx]->poll);
+		}
+	} else {
+		for (i = 0; i < n->dev.node_cnt; i++) {
+			idx = (i<<1) + VHOST_NET_VQ_RX;
+			vhost_poll_flush(&n->dev.vqs[idx]->poll);
+		}
+	}
 }
 
 static void vhost_net_flush(struct vhost_net *n)
@@ -583,16 +610,16 @@ static void vhost_net_flush(struct vhost_net *n)
 static int vhost_net_release(struct inode *inode, struct file *f)
 {
 	struct vhost_net *n = f->private_data;
-	struct socket *tx_sock;
-	struct socket *rx_sock;
 
-	vhost_net_stop(n, &tx_sock, &rx_sock);
+	vhost_net_stop(n);
 	vhost_net_flush(n);
 	vhost_dev_cleanup(&n->dev, false);
-	if (tx_sock)
-		fput(tx_sock->file);
-	if (rx_sock)
-		fput(rx_sock->file);
+
+	if (n->tx_sock)
+		fput(n->tx_sock->file);
+	if (n->rx_sock)
+		fput(n->rx_sock->file);
+
 	/* We do an extra flush before freeing memory,
 	 * since jobs can re-queue themselves. */
 	vhost_net_flush(n);
@@ -665,30 +692,27 @@ static struct socket *get_socket(int fd)
 	return ERR_PTR(-ENOTSOCK);
 }
 
-static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd)
+static long vhost_net_set_backend(struct vhost_net *n, unsigned rx, int fd)
 {
 	struct socket *sock, *oldsock;
 	struct vhost_virtqueue *vq;
-	struct vhost_ubuf_ref *ubufs, *oldubufs = NULL;
-	int r;
+	struct vhost_ubuf_ref *ubufs, *old, **oldubufs = NULL;
+	int r, i;
+	struct vhost_poll *poll;
+	struct socket **target;
 
+	oldubufs = kmalloc(sizeof(void *)*n->dev.node_cnt, GFP_KERNEL);
+	if (oldubufs == NULL)
+		return -ENOMEM;
 	mutex_lock(&n->dev.mutex);
 	r = vhost_dev_check_owner(&n->dev);
 	if (r)
 		goto err;
+	if (rx)
+		target = &n->rx_sock;
+	else
+		target = &n->tx_sock;
 
-	if (index >= VHOST_NET_VQ_MAX) {
-		r = -ENOBUFS;
-		goto err;
-	}
-	vq = n->vqs + index;
-	mutex_lock(&vq->mutex);
-
-	/* Verify that ring has been setup correctly. */
-	if (!vhost_vq_access_ok(vq)) {
-		r = -EFAULT;
-		goto err_vq;
-	}
 	sock = get_socket(fd);
 	if (IS_ERR(sock)) {
 		r = PTR_ERR(sock);
@@ -696,70 +720,106 @@ static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd)
 	}
 
 	/* start polling new socket */
-	oldsock = rcu_dereference_protected(vq->private_data,
-					    lockdep_is_held(&vq->mutex));
+	if (rx == 1)
+		/* todo, consider about protection, hold net->mutex? */
+		oldsock = rcu_dereference_protected(n->rx_sock, 1);
+	else
+		oldsock = rcu_dereference_protected(n->tx_sock, 1);
+
 	if (sock != oldsock) {
-		ubufs = vhost_ubuf_alloc(vq, sock && vhost_sock_zcopy(sock));
-		if (IS_ERR(ubufs)) {
-			r = PTR_ERR(ubufs);
-			goto err_ubufs;
+		if (rx == 1)
+			poll = &n->poll[0];
+		else
+			poll = &n->poll[1];
+
+		/* todo, consider about protection, hold net->mutex? */
+		vhost_poll_stop(poll);
+
+		for (i = 0; i < n->dev.node_cnt; i++) {
+			if (rx == 0)
+				vq = n->vqs[(i<<1)+VHOST_NET_VQ_TX];
+			else
+				vq = n->vqs[(i<<1)+VHOST_NET_VQ_RX];
+
+			mutex_lock(&vq->mutex);
+			ubufs = vhost_ubuf_alloc(vq, sock && vhost_sock_zcopy(sock));
+			if (IS_ERR(ubufs)) {
+				r = PTR_ERR(ubufs);
+				mutex_unlock(&vq->mutex);
+				goto err_ubufs;
+			}
+			oldubufs[i] = vq->ubufs;
+			vq->ubufs = ubufs;
+			r = vhost_init_used(vq);
+			mutex_unlock(&vq->mutex);
+			if (r)
+				goto err_vq;
 		}
-		oldubufs = vq->ubufs;
-		vq->ubufs = ubufs;
-		vhost_net_disable_vq(n, vq);
-		rcu_assign_pointer(vq->private_data, sock);
-		vhost_net_enable_vq(n, vq);
-
-		r = vhost_init_used(vq);
-		if (r)
-			goto err_vq;
+
+		mutex_lock(&n->mutex);
+		vhost_net_disable_xmit(n, rx);
+		if (rx == 1)
+			rcu_assign_pointer(n->rx_sock, sock);
+		else
+			rcu_assign_pointer(n->tx_sock, sock);
+		vhost_net_enable_xmit(n, rx);
+		mutex_unlock(&n->mutex);
+
+		/* todo, consider about protection, hold net->mutex? */
+		vhost_poll_start(poll, sock->file);
 	}
 
-	mutex_unlock(&vq->mutex);
+	for (i = 0; i < n->dev.node_cnt; i++) {
+		old = oldubufs[i];
+		if (rx == 0)
+			vq = n->vqs[(i<<1)+VHOST_NET_VQ_TX];
+		else
+			vq = n->vqs[(i<<1)+VHOST_NET_VQ_RX];
 
-	if (oldubufs) {
-		vhost_ubuf_put_and_wait(oldubufs);
-		mutex_lock(&vq->mutex);
-		vhost_zerocopy_signal_used(vq);
-		mutex_unlock(&vq->mutex);
+		if (old) {
+			vhost_ubuf_put_and_wait(old);
+			mutex_lock(&vq->mutex);
+			vhost_zerocopy_signal_used(vq);
+			mutex_unlock(&vq->mutex);
+		}
 	}
 
 	if (oldsock) {
-		vhost_net_flush_vq(n, index);
+		vhost_net_flush_vq(n, rx);
 		fput(oldsock->file);
 	}
 
 	mutex_unlock(&n->dev.mutex);
+	kfree(oldubufs);
 	return 0;
 
 err_ubufs:
 	fput(sock->file);
 err_vq:
-	mutex_unlock(&vq->mutex);
+	mutex_unlock(&n->mutex);
 err:
 	mutex_unlock(&n->dev.mutex);
+	kfree(oldubufs);
 	return r;
 }
 
 static long vhost_net_reset_owner(struct vhost_net *n)
 {
-	struct socket *tx_sock = NULL;
-	struct socket *rx_sock = NULL;
 	long err;
 
 	mutex_lock(&n->dev.mutex);
 	err = vhost_dev_check_owner(&n->dev);
 	if (err)
 		goto done;
-	vhost_net_stop(n, &tx_sock, &rx_sock);
+	vhost_net_stop(n);
 	vhost_net_flush(n);
 	err = vhost_dev_reset_owner(&n->dev);
 done:
 	mutex_unlock(&n->dev.mutex);
-	if (tx_sock)
-		fput(tx_sock->file);
-	if (rx_sock)
-		fput(rx_sock->file);
+	if (n->tx_sock)
+		fput(n->tx_sock->file);
+	if (n->rx_sock)
+		fput(n->rx_sock->file);
 	return err;
 }
 
@@ -788,17 +848,72 @@ static int vhost_net_set_features(struct vhost_net *n, u64 features)
 	}
 	n->dev.acked_features = features;
 	smp_wmb();
-	for (i = 0; i < VHOST_NET_VQ_MAX; ++i) {
-		mutex_lock(&n->vqs[i].mutex);
-		n->vqs[i].vhost_hlen = vhost_hlen;
-		n->vqs[i].sock_hlen = sock_hlen;
-		mutex_unlock(&n->vqs[i].mutex);
+	for (i = 0; i < n->vqcnt; ++i) {
+		mutex_lock(&n->vqs[i]->mutex);
+		n->vqs[i]->vhost_hlen = vhost_hlen;
+		n->vqs[i]->sock_hlen = sock_hlen;
+		mutex_unlock(&n->vqs[i]->mutex);
 	}
 	vhost_net_flush(n);
 	mutex_unlock(&n->dev.mutex);
 	return 0;
 }
 
+static int vhost_netdev_init(struct vhost_net *n)
+{
+	struct vhost_dev *dev;
+	vhost_work_fn_t *handle_kicks;
+	int r, i;
+	int cur, prev = 0;
+	int sz = 64;
+	int vqcnt;
+	int *vqs_map;
+	dev = &n->dev;
+	vqcnt = dev->node_cnt * 2;
+	n->vqs =  kmalloc(vqcnt*sizeof(void *), GFP_KERNEL);
+	handle_kicks = kmalloc(vqcnt*sizeof(void *), GFP_KERNEL);
+	vqs_map = kmalloc(vqcnt*sizeof(int), GFP_KERNEL);
+	for (i = 0; i < vqcnt;) {
+		cur = find_next_bit(&n->dev.allow_map, sz, prev);
+		prev = cur;
+		handle_kicks[i++] = handle_rx_kick;
+		vqs_map[i] = cur;
+		handle_kicks[i++] = handle_tx_kick;
+		vqs_map[i] = cur;
+
+	}
+
+	r = vhost_dev_alloc_subdevs(dev, &n->dev.allow_map, sz);
+	if (r < 0) {
+		/* todo, err handling */
+		return r;
+	}
+	r = vhost_dev_alloc_vqs(dev, n->vqs, vqcnt, vqs_map, vqcnt, handle_kicks);
+	if (r < 0) {
+		/* todo, err handling */
+		return r;
+	}
+	r = vhost_dev_init(dev, n->vqs, vqcnt);
+	if (r < 0)
+		goto exit;
+	if (experimental_zcopytx)
+		vhost_enable_zcopy(dev, 0);
+
+	net_poll_init(n->poll+VHOST_NET_VQ_TX, POLLOUT);
+	net_poll_init(n->poll+VHOST_NET_VQ_RX, POLLIN);
+	n->tx_poll_state = VHOST_NET_POLL_DISABLED;
+	n->numa_init = 1;
+	r = 0;
+exit:
+	kfree(handle_kicks);
+	kfree(vqs_map);
+	if (r == 0)
+		return 0;
+	kfree(n->vqs);
+	kfree(n);
+	return r;
+}
+
 static long vhost_net_ioctl(struct file *f, unsigned int ioctl,
 			    unsigned long arg)
 {
@@ -808,8 +923,23 @@ static long vhost_net_ioctl(struct file *f, unsigned int ioctl,
 	struct vhost_vring_file backend;
 	u64 features;
 	int r;
+	/* todo, dynamic allocated */
+	unsigned long bmp, sz = 64;
+
+	if (!n->numa_init && ioctl != VHOST_NET_SET_NUMA)
+		return -EOPNOTSUPP;
 
 	switch (ioctl) {
+	case VHOST_NET_SET_NUMA:
+		/* 4 must be extended. */
+		if (copy_from_user(&bmp, argp, 4))
+			return -EFAULT;
+		r = check_numa_bmp(&bmp, sz);
+		if (r < 0)
+			return -EINVAL;
+		n->dev.allow_map = bmp;
+		r = vhost_netdev_init(n);
+		return r;
 	case VHOST_NET_SET_BACKEND:
 		if (copy_from_user(&backend, argp, sizeof backend))
 			return -EFAULT;
@@ -863,8 +993,6 @@ static struct miscdevice vhost_net_misc = {
 
 static int vhost_net_init(void)
 {
-	if (experimental_zcopytx)
-		vhost_enable_zcopy(VHOST_NET_VQ_TX);
 	return misc_register(&vhost_net_misc);
 }
 module_init(vhost_net_init);
-- 
1.7.4.4

^ permalink raw reply related

* [PATCH 1/2] [kvm/vhost]: make vhost support NUMA model.
From: Liu Ping Fan @ 2012-05-17  9:20 UTC (permalink / raw)
  To: kvm, netdev
  Cc: linux-kernel, qemu-devel, Avi Kivity, Michael S. Tsirkin,
	Srivatsa Vaddagiri, Rusty Russell, Anthony Liguori, Ryan Harper,
	Shirley Ma, Krishna Kumar, Tom Lendacky
In-Reply-To: <1337246456-30909-1-git-send-email-kernelfans@gmail.com>

From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>

Make vhost allocate vhost_virtqueue on different host nodes as required.

Signed-off-by: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
---
 drivers/vhost/vhost.c |  380 +++++++++++++++++++++++++++++++++++--------------
 drivers/vhost/vhost.h |   41 ++++--
 include/linux/vhost.h |    2 +-
 3 files changed, 304 insertions(+), 119 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 51e4c1e..b0d2855 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -23,6 +23,7 @@
 #include <linux/file.h>
 #include <linux/highmem.h>
 #include <linux/slab.h>
+#include <linux/sched.h>
 #include <linux/kthread.h>
 #include <linux/cgroup.h>
 
@@ -37,12 +38,11 @@ enum {
 	VHOST_MEMORY_F_LOG = 0x1,
 };
 
-static unsigned vhost_zcopy_mask __read_mostly;
 
 #define vhost_used_event(vq) ((u16 __user *)&vq->avail->ring[vq->num])
 #define vhost_avail_event(vq) ((u16 __user *)&vq->used->ring[vq->num])
 
-static void vhost_poll_func(struct file *file, wait_queue_head_t *wqh,
+void vhost_poll_func(struct file *file, wait_queue_head_t *wqh,
 			    poll_table *pt)
 {
 	struct vhost_poll *poll;
@@ -75,12 +75,12 @@ static void vhost_work_init(struct vhost_work *work, vhost_work_fn_t fn)
 
 /* Init poll structure */
 void vhost_poll_init(struct vhost_poll *poll, vhost_work_fn_t fn,
-		     unsigned long mask, struct vhost_dev *dev)
+		     unsigned long mask, struct vhost_sub_dev *dev)
 {
 	init_waitqueue_func_entry(&poll->wait, vhost_poll_wakeup);
 	init_poll_funcptr(&poll->table, vhost_poll_func);
 	poll->mask = mask;
-	poll->dev = dev;
+	poll->subdev = dev;
 
 	vhost_work_init(&poll->work, fn);
 }
@@ -103,7 +103,7 @@ void vhost_poll_stop(struct vhost_poll *poll)
 	remove_wait_queue(poll->wqh, &poll->wait);
 }
 
-static bool vhost_work_seq_done(struct vhost_dev *dev, struct vhost_work *work,
+static bool vhost_work_seq_done(struct vhost_sub_dev *dev, struct vhost_work *work,
 				unsigned seq)
 {
 	int left;
@@ -114,19 +114,19 @@ static bool vhost_work_seq_done(struct vhost_dev *dev, struct vhost_work *work,
 	return left <= 0;
 }
 
-static void vhost_work_flush(struct vhost_dev *dev, struct vhost_work *work)
+static void vhost_work_flush(struct vhost_sub_dev *sub, struct vhost_work *work)
 {
 	unsigned seq;
 	int flushing;
 
-	spin_lock_irq(&dev->work_lock);
+	spin_lock_irq(&sub->work_lock);
 	seq = work->queue_seq;
 	work->flushing++;
-	spin_unlock_irq(&dev->work_lock);
-	wait_event(work->done, vhost_work_seq_done(dev, work, seq));
-	spin_lock_irq(&dev->work_lock);
+	spin_unlock_irq(&sub->work_lock);
+	wait_event(work->done, vhost_work_seq_done(sub, work, seq));
+	spin_lock_irq(&sub->work_lock);
 	flushing = --work->flushing;
-	spin_unlock_irq(&dev->work_lock);
+	spin_unlock_irq(&sub->work_lock);
 	BUG_ON(flushing < 0);
 }
 
@@ -134,26 +134,26 @@ static void vhost_work_flush(struct vhost_dev *dev, struct vhost_work *work)
  * locks that are also used by the callback. */
 void vhost_poll_flush(struct vhost_poll *poll)
 {
-	vhost_work_flush(poll->dev, &poll->work);
+	vhost_work_flush(poll->subdev, &poll->work);
 }
 
-static inline void vhost_work_queue(struct vhost_dev *dev,
+static inline void vhost_work_queue(struct vhost_sub_dev *sub,
 				    struct vhost_work *work)
 {
 	unsigned long flags;
 
-	spin_lock_irqsave(&dev->work_lock, flags);
+	spin_lock_irqsave(&sub->work_lock, flags);
 	if (list_empty(&work->node)) {
-		list_add_tail(&work->node, &dev->work_list);
+		list_add_tail(&work->node, &sub->work_list);
 		work->queue_seq++;
-		wake_up_process(dev->worker);
+		wake_up_process(sub->worker);
 	}
-	spin_unlock_irqrestore(&dev->work_lock, flags);
+	spin_unlock_irqrestore(&sub->work_lock, flags);
 }
 
 void vhost_poll_queue(struct vhost_poll *poll)
 {
-	vhost_work_queue(poll->dev, &poll->work);
+	vhost_work_queue(poll->subdev, &poll->work);
 }
 
 static void vhost_vq_reset(struct vhost_dev *dev,
@@ -188,7 +188,8 @@ static void vhost_vq_reset(struct vhost_dev *dev,
 
 static int vhost_worker(void *data)
 {
-	struct vhost_dev *dev = data;
+	struct vhost_sub_dev *sub = data;
+	struct vhost_dev *dev = sub->owner;
 	struct vhost_work *work = NULL;
 	unsigned uninitialized_var(seq);
 
@@ -198,7 +199,7 @@ static int vhost_worker(void *data)
 		/* mb paired w/ kthread_stop */
 		set_current_state(TASK_INTERRUPTIBLE);
 
-		spin_lock_irq(&dev->work_lock);
+		spin_lock_irq(&sub->work_lock);
 		if (work) {
 			work->done_seq = seq;
 			if (work->flushing)
@@ -206,18 +207,18 @@ static int vhost_worker(void *data)
 		}
 
 		if (kthread_should_stop()) {
-			spin_unlock_irq(&dev->work_lock);
+			spin_unlock_irq(&sub->work_lock);
 			__set_current_state(TASK_RUNNING);
 			break;
 		}
-		if (!list_empty(&dev->work_list)) {
-			work = list_first_entry(&dev->work_list,
+		if (!list_empty(&sub->work_list)) {
+			work = list_first_entry(&sub->work_list,
 						struct vhost_work, node);
 			list_del_init(&work->node);
 			seq = work->queue_seq;
 		} else
 			work = NULL;
-		spin_unlock_irq(&dev->work_lock);
+		spin_unlock_irq(&sub->work_lock);
 
 		if (work) {
 			__set_current_state(TASK_RUNNING);
@@ -244,54 +245,189 @@ static void vhost_vq_free_iovecs(struct vhost_virtqueue *vq)
 	vq->ubuf_info = NULL;
 }
 
-void vhost_enable_zcopy(int vq)
+void vhost_enable_zcopy(struct vhost_dev *dev, int rx)
 {
-	vhost_zcopy_mask |= 0x1 << vq;
+	int i;
+	if (rx == 0)
+		for (i = 0; i < dev->node_cnt; i++)
+			dev->zcopy_mask |= 0x1<<(2*i+1);
 }
 
-/* Helper to allocate iovec buffers for all vqs. */
-static long vhost_dev_alloc_iovecs(struct vhost_dev *dev)
+/* Need for vq dynamicly allocator, which is important to migrate among NUMA */
+static int vhost_vq_alloc_iovecs(struct vhost_virtqueue *vq)
 {
-	int i;
 	bool zcopy;
+	int i;
+	struct vhost_dev *dev = vq->dev;
+	int node = vq->node_id;
+	vq->indirect = kmalloc_node(sizeof *vq->indirect  *
+					   UIO_MAXIOV, GFP_KERNEL, node);
+	vq->log = kmalloc_node(sizeof *vq->log * UIO_MAXIOV,
+				  GFP_KERNEL, node);
+	vq->heads = kmalloc_node(sizeof *vq->heads *
+					UIO_MAXIOV, GFP_KERNEL, node);
+	for (i = 0; i < dev->node_cnt*2; i++) {
+		if (dev->vqs[i] == vq) {
+			zcopy = dev->zcopy_mask & (0x1 << i);
+			break;
+		}
+	}
+	if (zcopy)
+		vq->ubuf_info =
+			kmalloc_node(sizeof *vq->ubuf_info *
+				UIO_MAXIOV, GFP_KERNEL, node);
+	if (!vq->indirect || !vq->log || !vq->heads ||
+		(zcopy && !vq->ubuf_info)) {
+		kfree(vq->indirect);
+		kfree(vq->log);
+		kfree(vq->heads);
+		kfree(vq->ubuf_info);
 
-	for (i = 0; i < dev->nvqs; ++i) {
-		dev->vqs[i].indirect = kmalloc(sizeof *dev->vqs[i].indirect *
-					       UIO_MAXIOV, GFP_KERNEL);
-		dev->vqs[i].log = kmalloc(sizeof *dev->vqs[i].log * UIO_MAXIOV,
-					  GFP_KERNEL);
-		dev->vqs[i].heads = kmalloc(sizeof *dev->vqs[i].heads *
-					    UIO_MAXIOV, GFP_KERNEL);
-		zcopy = vhost_zcopy_mask & (0x1 << i);
-		if (zcopy)
-			dev->vqs[i].ubuf_info =
-				kmalloc(sizeof *dev->vqs[i].ubuf_info *
-					UIO_MAXIOV, GFP_KERNEL);
-		if (!dev->vqs[i].indirect || !dev->vqs[i].log ||
-			!dev->vqs[i].heads ||
-			(zcopy && !dev->vqs[i].ubuf_info))
+		return -ENOMEM;
+	} else
+		return 0;
+}
+
+/* Helper to allocate iovec buffers for all vqs. */
+static long vhost_dev_alloc_iovecs(struct vhost_dev *dev)
+{
+	int i, ret;
+	for (i = 0; i < dev->nvqs; i++) {
+		ret = vhost_vq_alloc_iovecs(dev->vqs[i]);
+		if (ret < 0) {
+			i -= 1;
 			goto err_nomem;
+		}
 	}
 	return 0;
-
 err_nomem:
 	for (; i >= 0; --i)
-		vhost_vq_free_iovecs(&dev->vqs[i]);
+		vhost_vq_free_iovecs(dev->vqs[i]);
 	return -ENOMEM;
 }
 
 static void vhost_dev_free_iovecs(struct vhost_dev *dev)
 {
 	int i;
-
 	for (i = 0; i < dev->nvqs; ++i)
-		vhost_vq_free_iovecs(&dev->vqs[i]);
+		vhost_vq_free_iovecs(dev->vqs[i]);
 }
 
-long vhost_dev_init(struct vhost_dev *dev,
-		    struct vhost_virtqueue *vqs, int nvqs)
+int vhost_dev_alloc_subdevs(struct vhost_dev *dev, unsigned long *numa_map,
+	int sz)
+{
+	int i, j = 0;
+	int cur, prev = 0;
+	struct vhost_sub_dev *sub;
+	/* Todo,replace allow_map with dynamic allocated */
+	dev->allow_map = *numa_map;
+	dev->sub_devs = kmalloc(dev->node_cnt*sizeof(void *), GFP_KERNEL);
+
+	while (1) {
+		cur = find_next_bit(numa_map, sz, prev);
+		if (cur >= sz)
+			break;
+		prev = cur;
+		sub =  kmalloc_node(sizeof(struct vhost_sub_dev), GFP_KERNEL, cur);
+		if (sub == NULL)
+			goto err;
+		j++;
+		sub->node_id = cur;
+		sub->owner = dev;
+		spin_lock_init(&sub->work_lock);
+		INIT_LIST_HEAD(&sub->work_list);
+		dev->sub_devs[i] = sub;
+	}
+
+	dev->node_cnt = j;
+	return 0;
+err:
+	for (i = 0; i < j; i++) {
+		kfree(dev->sub_devs[i]);
+		dev->sub_devs[i] = NULL;
+	}
+	return -ENOMEM;
+
+}
+
+void vhost_dev_free_subdevs(struct vhost_dev *dev)
 {
 	int i;
+	for (i = 0; i < dev->node_cnt; i++)
+		kfree(dev->sub_devs[i]);
+	return;
+}
+
+static int check_numa(int *vqs_map, int sz)
+{
+	int i, node;
+
+	for (i = 0; i < sz; i++) {
+		for_each_online_node(node)
+			if (vqs_map[i] == node)
+				break;
+		if (vqs_map[i] != node)
+			return -1;
+	}
+	return 0;
+}
+
+int check_numa_bmp(unsigned long *numa_bmp, int sz)
+{
+	int i, node, cur, prev = 0;
+
+	for (i = 0; i < sz; i++) {
+		cur = find_next_bit(numa_bmp, sz, prev);
+		prev = cur;
+		if (cur >= sz)
+			return 0;
+		for_each_online_node(node)
+			if (cur == node)
+				break;
+		if (cur != node)
+			return -1;
+	}
+	return 0;
+}
+
+/* allocate vqs in node according to request map */
+int vhost_dev_alloc_vqs(struct vhost_dev *dev, struct vhost_virtqueue **vqs, int cnt,
+	int *vqs_map, int sz, vhost_work_fn_t *handle_kick)
+{
+	int r, i, j = 0;
+	r = check_numa(vqs_map, sz);
+	if (r < 0)
+		return -EINVAL;
+	for (i = 0; i < cnt ; i++) {
+		vqs[i] = kmalloc_node(sizeof(struct vhost_virtqueue),
+			GFP_KERNEL, vqs_map[i]);
+		if (vqs[i] == NULL)
+			goto err;
+		vqs[i]->handle_kick = handle_kick[i];
+		j = i;
+	}
+	return 0;
+err:
+	for (i = 0; i < j; i++)
+		kfree(vqs[i]);
+	return -ENOMEM;
+
+}
+
+void vhost_dev_free_vqs(struct vhost_dev *dev, struct vhost_virtqueue **vqs,
+	int cnt)
+{
+	int i;
+	for (i = 0; i < cnt ; i++)
+		kfree(vqs[i]);
+	return;
+}
+
+long vhost_dev_init(struct vhost_dev *dev, struct vhost_virtqueue **vqs, int nvqs)
+{
+	int i, j, ret = 0;
+	struct vhost_sub_dev *subdev;
+	struct vhost_virtqueue *vq;
 
 	dev->vqs = vqs;
 	dev->nvqs = nvqs;
@@ -300,24 +436,32 @@ long vhost_dev_init(struct vhost_dev *dev,
 	dev->log_file = NULL;
 	dev->memory = NULL;
 	dev->mm = NULL;
-	spin_lock_init(&dev->work_lock);
-	INIT_LIST_HEAD(&dev->work_list);
-	dev->worker = NULL;
 
 	for (i = 0; i < dev->nvqs; ++i) {
-		dev->vqs[i].log = NULL;
-		dev->vqs[i].indirect = NULL;
-		dev->vqs[i].heads = NULL;
-		dev->vqs[i].ubuf_info = NULL;
-		dev->vqs[i].dev = dev;
-		mutex_init(&dev->vqs[i].mutex);
-		vhost_vq_reset(dev, dev->vqs + i);
-		if (dev->vqs[i].handle_kick)
-			vhost_poll_init(&dev->vqs[i].poll,
-					dev->vqs[i].handle_kick, POLLIN, dev);
-	}
+		vq = dev->vqs[i];
+		/* for each numa node, in-vq/out-vq */
+		vq->log = NULL;
+		vq->indirect = NULL;
+		vq->heads = NULL;
+		vq->ubuf_info = NULL;
+		vq->dev = dev;
+		mutex_init(&vq->mutex);
+		vhost_vq_reset(dev, vq);
+
+		if (vq->handle_kick) {
+			for (j = 0; j < i; j++) {
+				subdev =  dev->sub_devs[j];
+				if (vq->node_id == subdev->node_id)
+					vhost_poll_init(&vq->poll, vq->handle_kick, POLLIN, subdev);
+				else {
+					vhost_poll_init(&vq->poll, vq->handle_kick, POLLIN, dev->sub_devs[0]);
+					ret = 1;
+				}
+			}
+		}
 
-	return 0;
+	}
+	return ret;
 }
 
 /* Caller should have device mutex */
@@ -344,19 +488,26 @@ static void vhost_attach_cgroups_work(struct vhost_work *work)
 static int vhost_attach_cgroups(struct vhost_dev *dev)
 {
 	struct vhost_attach_cgroups_struct attach;
-
+	int i, ret = 0;
+	struct vhost_sub_dev *sub;
 	attach.owner = current;
-	vhost_work_init(&attach.work, vhost_attach_cgroups_work);
-	vhost_work_queue(dev, &attach.work);
-	vhost_work_flush(dev, &attach.work);
-	return attach.ret;
+	for (i = 0; i < dev->node_cnt; i++) {
+		sub = dev->sub_devs[i];
+		vhost_work_init(&attach.work, vhost_attach_cgroups_work);
+		vhost_work_queue(sub, &attach.work);
+		vhost_work_flush(sub, &attach.work);
+		ret |= attach.ret;
+	}
+	return ret;
 }
 
 /* Caller should have device mutex */
 static long vhost_dev_set_owner(struct vhost_dev *dev)
 {
 	struct task_struct *worker;
-	int err;
+	int err, i, j, cur, prev = 0;
+	int sz = sizeof(unsigned long);
+	const struct cpumask *mask;
 
 	/* Is there an owner already? */
 	if (dev->mm) {
@@ -366,14 +517,19 @@ static long vhost_dev_set_owner(struct vhost_dev *dev)
 
 	/* No owner, become one */
 	dev->mm = get_task_mm(current);
-	worker = kthread_create(vhost_worker, dev, "vhost-%d", current->pid);
-	if (IS_ERR(worker)) {
-		err = PTR_ERR(worker);
-		goto err_worker;
+
+	for (i = 0, j = 0; i < dev->node_cnt; i++, j++) {
+		cur = find_next_bit(&dev->allow_map, sz, prev);
+		dev->sub_devs[i]->worker = kthread_create_on_node(vhost_worker,
+			dev->sub_devs[i], cur, "vhost-%d-node-%d", current->pid, cur);
+		if (dev->sub_devs[i]->worker == NULL)
+			goto err_cgroup;
+		mask = cpumask_of_node(cur);
+		do_set_cpus_allowed(worker, mask);
 	}
 
-	dev->worker = worker;
-	wake_up_process(worker);	/* avoid contributing to loadavg */
+	for (i = 0; i < dev->node_cnt; i++)
+		wake_up_process(dev->sub_devs[i]->worker);
 
 	err = vhost_attach_cgroups(dev);
 	if (err)
@@ -385,9 +541,12 @@ static long vhost_dev_set_owner(struct vhost_dev *dev)
 
 	return 0;
 err_cgroup:
-	kthread_stop(worker);
-	dev->worker = NULL;
-err_worker:
+
+	for (i = 0; i < j; i++) {
+		kthread_stop(dev->sub_devs[i]->worker);
+		dev->sub_devs[i]->worker = NULL;
+	}
+
 	if (dev->mm)
 		mmput(dev->mm);
 	dev->mm = NULL;
@@ -442,28 +601,28 @@ void vhost_dev_cleanup(struct vhost_dev *dev, bool locked)
 	int i;
 
 	for (i = 0; i < dev->nvqs; ++i) {
-		if (dev->vqs[i].kick && dev->vqs[i].handle_kick) {
-			vhost_poll_stop(&dev->vqs[i].poll);
-			vhost_poll_flush(&dev->vqs[i].poll);
+		if (dev->vqs[i]->kick && dev->vqs[i]->handle_kick) {
+			vhost_poll_stop(&dev->vqs[i]->poll);
+			vhost_poll_flush(&dev->vqs[i]->poll);
 		}
 		/* Wait for all lower device DMAs done. */
-		if (dev->vqs[i].ubufs)
-			vhost_ubuf_put_and_wait(dev->vqs[i].ubufs);
+		if (dev->vqs[i]->ubufs)
+			vhost_ubuf_put_and_wait(dev->vqs[i]->ubufs);
 
 		/* Signal guest as appropriate. */
-		vhost_zerocopy_signal_used(&dev->vqs[i]);
-
-		if (dev->vqs[i].error_ctx)
-			eventfd_ctx_put(dev->vqs[i].error_ctx);
-		if (dev->vqs[i].error)
-			fput(dev->vqs[i].error);
-		if (dev->vqs[i].kick)
-			fput(dev->vqs[i].kick);
-		if (dev->vqs[i].call_ctx)
-			eventfd_ctx_put(dev->vqs[i].call_ctx);
-		if (dev->vqs[i].call)
-			fput(dev->vqs[i].call);
-		vhost_vq_reset(dev, dev->vqs + i);
+		vhost_zerocopy_signal_used(dev->vqs[i]);
+
+		if (dev->vqs[i]->error_ctx)
+			eventfd_ctx_put(dev->vqs[i]->error_ctx);
+		if (dev->vqs[i]->error)
+			fput(dev->vqs[i]->error);
+		if (dev->vqs[i]->kick)
+			fput(dev->vqs[i]->kick);
+		if (dev->vqs[i]->call_ctx)
+			eventfd_ctx_put(dev->vqs[i]->call_ctx);
+		if (dev->vqs[i]->call)
+			fput(dev->vqs[i]->call);
+		vhost_vq_reset(dev, dev->vqs[i]);
 	}
 	vhost_dev_free_iovecs(dev);
 	if (dev->log_ctx)
@@ -477,11 +636,15 @@ void vhost_dev_cleanup(struct vhost_dev *dev, bool locked)
 					locked ==
 						lockdep_is_held(&dev->mutex)));
 	RCU_INIT_POINTER(dev->memory, NULL);
+
+	/* fixme,It will be considered and fixed in next verion */
 	WARN_ON(!list_empty(&dev->work_list));
 	if (dev->worker) {
 		kthread_stop(dev->worker);
 		dev->worker = NULL;
 	}
+	/* end*/
+
 	if (dev->mm)
 		mmput(dev->mm);
 	dev->mm = NULL;
@@ -534,14 +697,14 @@ static int memory_access_ok(struct vhost_dev *d, struct vhost_memory *mem,
 
 	for (i = 0; i < d->nvqs; ++i) {
 		int ok;
-		mutex_lock(&d->vqs[i].mutex);
+		mutex_lock(&d->vqs[i]->mutex);
 		/* If ring is inactive, will check when it's enabled. */
-		if (d->vqs[i].private_data)
-			ok = vq_memory_access_ok(d->vqs[i].log_base, mem,
+		if (d->vqs[i]->private_data)
+			ok = vq_memory_access_ok(d->vqs[i]->log_base, mem,
 						 log_all);
 		else
 			ok = 1;
-		mutex_unlock(&d->vqs[i].mutex);
+		mutex_unlock(&d->vqs[i]->mutex);
 		if (!ok)
 			return 0;
 	}
@@ -650,8 +813,7 @@ static long vhost_set_vring(struct vhost_dev *d, int ioctl, void __user *argp)
 		return r;
 	if (idx >= d->nvqs)
 		return -ENOBUFS;
-
-	vq = d->vqs + idx;
+	vq = d->vqs[idx];
 
 	mutex_lock(&vq->mutex);
 
@@ -750,6 +912,7 @@ static long vhost_set_vring(struct vhost_dev *d, int ioctl, void __user *argp)
 		vq->log_addr = a.log_guest_addr;
 		vq->used = (void __user *)(unsigned long)a.used_user_addr;
 		break;
+
 	case VHOST_SET_VRING_KICK:
 		if (copy_from_user(&f, argp, sizeof f)) {
 			r = -EFAULT;
@@ -766,6 +929,7 @@ static long vhost_set_vring(struct vhost_dev *d, int ioctl, void __user *argp)
 		} else
 			filep = eventfp;
 		break;
+
 	case VHOST_SET_VRING_CALL:
 		if (copy_from_user(&f, argp, sizeof f)) {
 			r = -EFAULT;
@@ -863,7 +1027,7 @@ long vhost_dev_ioctl(struct vhost_dev *d, unsigned int ioctl, unsigned long arg)
 		for (i = 0; i < d->nvqs; ++i) {
 			struct vhost_virtqueue *vq;
 			void __user *base = (void __user *)(unsigned long)p;
-			vq = d->vqs + i;
+			vq = d->vqs[i];
 			mutex_lock(&vq->mutex);
 			/* If ring is inactive, will check when it's enabled. */
 			if (vq->private_data && !vq_log_access_ok(d, vq, base))
@@ -890,9 +1054,9 @@ long vhost_dev_ioctl(struct vhost_dev *d, unsigned int ioctl, unsigned long arg)
 		} else
 			filep = eventfp;
 		for (i = 0; i < d->nvqs; ++i) {
-			mutex_lock(&d->vqs[i].mutex);
-			d->vqs[i].log_ctx = d->log_ctx;
-			mutex_unlock(&d->vqs[i].mutex);
+			mutex_lock(&d->vqs[i]->mutex);
+			d->vqs[i]->log_ctx = d->log_ctx;
+			mutex_unlock(&d->vqs[i]->mutex);
 		}
 		if (ctx)
 			eventfd_ctx_put(ctx);
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 8de1fd5..12d4237 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -13,12 +13,13 @@
 #include <linux/virtio_ring.h>
 #include <linux/atomic.h>
 
+#define VHOST_NUMA
 /* This is for zerocopy, used buffer len is set to 1 when lower device DMA
  * done */
 #define VHOST_DMA_DONE_LEN	1
 #define VHOST_DMA_CLEAR_LEN	0
 
-struct vhost_device;
+struct vhost_dev;
 
 struct vhost_work;
 typedef void (*vhost_work_fn_t)(struct vhost_work *work);
@@ -32,6 +33,8 @@ struct vhost_work {
 	unsigned		  done_seq;
 };
 
+struct vhost_sub_dev;
+
 /* Poll a file (eventfd or socket) */
 /* Note: there's nothing vhost specific about this structure. */
 struct vhost_poll {
@@ -40,11 +43,13 @@ struct vhost_poll {
 	wait_queue_t              wait;
 	struct vhost_work	  work;
 	unsigned long		  mask;
-	struct vhost_dev	 *dev;
+	struct vhost_sub_dev *subdev;
 };
 
+void vhost_poll_func(struct file *file, wait_queue_head_t *wqh,
+			    poll_table *pt);
 void vhost_poll_init(struct vhost_poll *poll, vhost_work_fn_t fn,
-		     unsigned long mask, struct vhost_dev *dev);
+		     unsigned long mask, struct vhost_sub_dev *dev);
 void vhost_poll_start(struct vhost_poll *poll, struct file *file);
 void vhost_poll_stop(struct vhost_poll *poll);
 void vhost_poll_flush(struct vhost_poll *poll);
@@ -70,7 +75,7 @@ void vhost_ubuf_put_and_wait(struct vhost_ubuf_ref *);
 /* The virtqueue structure describes a queue attached to a device. */
 struct vhost_virtqueue {
 	struct vhost_dev *dev;
-
+	int node_id;
 	/* The actual ring of buffers. */
 	struct mutex mutex;
 	unsigned int num;
@@ -143,6 +148,14 @@ struct vhost_virtqueue {
 	struct vhost_ubuf_ref *ubufs;
 };
 
+struct vhost_sub_dev {
+	struct vhost_dev *owner;
+	int node_id;
+	spinlock_t work_lock;
+	struct list_head work_list;
+	struct task_struct *worker;
+};
+
 struct vhost_dev {
 	/* Readers use RCU to access memory table pointer
 	 * log base pointer and features.
@@ -151,16 +164,24 @@ struct vhost_dev {
 	struct mm_struct *mm;
 	struct mutex mutex;
 	unsigned acked_features;
-	struct vhost_virtqueue *vqs;
+	struct vhost_virtqueue **vqs;
 	int nvqs;
 	struct file *log_file;
 	struct eventfd_ctx *log_ctx;
-	spinlock_t work_lock;
-	struct list_head work_list;
-	struct task_struct *worker;
+	/* todo, change it to bitmap */
+	unsigned long allow_map;
+	unsigned long node_cnt;
+	unsigned long zcopy_mask;
+	struct vhost_sub_dev **sub_devs;
 };
 
-long vhost_dev_init(struct vhost_dev *, struct vhost_virtqueue *vqs, int nvqs);
+int check_numa_bmp(unsigned long *numa_bmp, int sz);
+int vhost_dev_alloc_subdevs(struct vhost_dev *dev, unsigned long *numa_map,
+	int sz);
+void vhost_dev_free_subdevs(struct vhost_dev *dev);
+int vhost_dev_alloc_vqs(struct vhost_dev *dev, struct vhost_virtqueue **vqs,
+	int cnt, int *vqs_map, int sz, vhost_work_fn_t *handle_kick);
+long vhost_dev_init(struct vhost_dev *, struct vhost_virtqueue **vqs, int nvqs);
 long vhost_dev_check_owner(struct vhost_dev *);
 long vhost_dev_reset_owner(struct vhost_dev *);
 void vhost_dev_cleanup(struct vhost_dev *, bool locked);
@@ -216,6 +237,6 @@ static inline int vhost_has_feature(struct vhost_dev *dev, int bit)
 	return acked_features & (1 << bit);
 }
 
-void vhost_enable_zcopy(int vq);
+void vhost_enable_zcopy(struct vhost_dev *dev, int rx);
 
 #endif
diff --git a/include/linux/vhost.h b/include/linux/vhost.h
index e847f1e..d8c76f1 100644
--- a/include/linux/vhost.h
+++ b/include/linux/vhost.h
@@ -120,7 +120,7 @@ struct vhost_memory {
  * used for transmit.  Pass fd -1 to unbind from the socket and the transmit
  * device.  This can be used to stop the ring (e.g. for migration). */
 #define VHOST_NET_SET_BACKEND _IOW(VHOST_VIRTIO, 0x30, struct vhost_vring_file)
-
+#define VHOST_NET_SET_NUMA  _IOW(VHOST_VIRTIO, 0x31, unsigned long)
 /* Feature bits */
 /* Log all write descriptors. Can be changed while device is active. */
 #define VHOST_F_LOG_ALL 26
-- 
1.7.4.4

^ permalink raw reply related

* [RFC:kvm] export host NUMA info to guest & make emulated device NUMA attr
From: Liu Ping Fan @ 2012-05-17  9:20 UTC (permalink / raw)
  To: kvm, netdev
  Cc: Krishna Kumar, Shirley Ma, Tom Lendacky, Michael S. Tsirkin,
	qemu-devel, Rusty Russell, Srivatsa Vaddagiri, linux-kernel,
	Ryan Harper, Avi Kivity, Anthony Liguori

Currently, the guest can not know the NUMA info of the vcpu, which will
result in performance drawback.

This is the discovered and experiment by
        Shirley Ma <xma@us.ibm.com>
        Krishna Kumar <krkumar2@in.ibm.com>
        Tom Lendacky <toml@us.ibm.com>
Refer to - http://www.mail-archive.com/kvm@vger.kernel.org/msg69868.html
we can see the big perfermance gap between NUMA aware and unaware.

Enlightened by their discovery, I think, we can do more work -- that is to
export NUMA info of host to guest.

So here comes the idea:
1. export host numa info through guest's sched domain to its scheduler
  Export vcpu's NUMA info to guest scheduler(I think mem NUMA problem
  has been handled by host).  So the guest's lb will consider the cost.
  I am still working on this, and my original idea is to export these info
  through "static struct sched_domain_topology_level *sched_domain_topology"
  to guest.

2. Do a better emulation of virt mach exported to guest.
  In real world, the devices are limited by kinds of reasons to own the NUMA
  property. But as to Qemu, the device is emulated by thread, which inherit
  the NUMA attr in nature.  We can implement the device as components of many
  logic units, each of the unit is backed by a thread in different host node.
  Currently, I want to start the work on vhost. But I think, maybe in
  future, the iothread in Qemu can also has such attr.

Forgive me, for the limited time, I can not have more better understand of
vhost/virtio_net drivers. These patches are just draft, _FAR_, _FAR_ from work.
I will do more detail work for them in future.

To easy the review, the following is the sum up of the 2nd point of the idea.
As for the 1st point of the idea, it is not reflected in the patches.

--spread/shrink the vhost_workers over the host nodes as demanded from Qemu.
  And we can consider each vhost_worker as an independent net logic device
  embeded in physical device "vhost_net".  At the meanwhile, we spread vcpu
  threads over the host node. 
  The vrings on guest are allocated PAGE_SIZE align separately, so they can 
  will only be mapped into different host node, so vhost_worker in the same
  node can access it with the least cost. So does the vq on guest.

--virtio_net driver will changes and talk with the logic device. And which
  logic device it will talk to is determined by on which vcpu it is scheduled.

--the binding of vcpus and vhost_worker is implemented by: 
  for call direction, vq-a in the node-A will have a dedicated irq-a. And 
  we set the irq-a's affinity to vcpus in node-A.
  for kick direction, kick register-b trigger different eventfd-b which wake up
  vhost_worker-b.

Please give some comments and suggestion.

Thanks and regards,
pingfan

^ permalink raw reply

* [PATCH net-next] tcp: bool conversions
From: Eric Dumazet @ 2012-05-17  9:15 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

From: Eric Dumazet <edumazet@google.com>

bool conversions where possible.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 include/net/tcp.h        |   56 ++++-----
 net/ipv4/tcp.c           |   20 +--
 net/ipv4/tcp_cong.c      |    6 -
 net/ipv4/tcp_hybla.c     |   10 -
 net/ipv4/tcp_input.c     |  214 ++++++++++++++++++-------------------
 net/ipv4/tcp_ipv4.c      |   26 ++--
 net/ipv4/tcp_minisocks.c |   24 ++--
 net/ipv4/tcp_output.c    |   75 ++++++------
 net/ipv6/tcp_ipv6.c      |    4 
 9 files changed, 219 insertions(+), 216 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index aaf5de9..e79aa48 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -263,14 +263,14 @@ extern int tcp_memory_pressure;
  * and worry about wraparound (automatic with unsigned arithmetic).
  */
 
-static inline int before(__u32 seq1, __u32 seq2)
+static inline bool before(__u32 seq1, __u32 seq2)
 {
         return (__s32)(seq1-seq2) < 0;
 }
 #define after(seq2, seq1) 	before(seq1, seq2)
 
 /* is s2<=s1<=s3 ? */
-static inline int between(__u32 seq1, __u32 seq2, __u32 seq3)
+static inline bool between(__u32 seq1, __u32 seq2, __u32 seq3)
 {
 	return seq3 - seq2 >= seq1 - seq2;
 }
@@ -305,7 +305,7 @@ static inline void tcp_synq_overflow(struct sock *sk)
 }
 
 /* syncookies: no recent synqueue overflow on this listening socket? */
-static inline int tcp_synq_no_recent_overflow(const struct sock *sk)
+static inline bool tcp_synq_no_recent_overflow(const struct sock *sk)
 {
 	unsigned long last_overflow = tcp_sk(sk)->rx_opt.ts_recent_stamp;
 	return time_after(jiffies, last_overflow + TCP_TIMEOUT_FALLBACK);
@@ -383,7 +383,7 @@ extern struct sock * tcp_check_req(struct sock *sk,struct sk_buff *skb,
 				   struct request_sock **prev);
 extern int tcp_child_process(struct sock *parent, struct sock *child,
 			     struct sk_buff *skb);
-extern int tcp_use_frto(struct sock *sk);
+extern bool tcp_use_frto(struct sock *sk);
 extern void tcp_enter_frto(struct sock *sk);
 extern void tcp_enter_loss(struct sock *sk, int how);
 extern void tcp_clear_retrans(struct tcp_sock *tp);
@@ -470,7 +470,7 @@ static inline __u32 cookie_v6_init_sequence(struct sock *sk,
 
 extern void __tcp_push_pending_frames(struct sock *sk, unsigned int cur_mss,
 				      int nonagle);
-extern int tcp_may_send_now(struct sock *sk);
+extern bool tcp_may_send_now(struct sock *sk);
 extern int tcp_retransmit_skb(struct sock *, struct sk_buff *);
 extern void tcp_retransmit_timer(struct sock *sk);
 extern void tcp_xmit_retransmit_queue(struct sock *);
@@ -484,9 +484,9 @@ extern int tcp_write_wakeup(struct sock *);
 extern void tcp_send_fin(struct sock *sk);
 extern void tcp_send_active_reset(struct sock *sk, gfp_t priority);
 extern int tcp_send_synack(struct sock *);
-extern int tcp_syn_flood_action(struct sock *sk,
-				const struct sk_buff *skb,
-				const char *proto);
+extern bool tcp_syn_flood_action(struct sock *sk,
+				 const struct sk_buff *skb,
+				 const char *proto);
 extern void tcp_push_one(struct sock *, unsigned int mss_now);
 extern void tcp_send_ack(struct sock *sk);
 extern void tcp_send_delayed_ack(struct sock *sk);
@@ -794,12 +794,12 @@ static inline int tcp_is_sack(const struct tcp_sock *tp)
 	return tp->rx_opt.sack_ok;
 }
 
-static inline int tcp_is_reno(const struct tcp_sock *tp)
+static inline bool tcp_is_reno(const struct tcp_sock *tp)
 {
 	return !tcp_is_sack(tp);
 }
 
-static inline int tcp_is_fack(const struct tcp_sock *tp)
+static inline bool tcp_is_fack(const struct tcp_sock *tp)
 {
 	return tp->rx_opt.sack_ok & TCP_FACK_ENABLED;
 }
@@ -901,7 +901,7 @@ static inline u32 tcp_wnd_end(const struct tcp_sock *tp)
 {
 	return tp->snd_una + tp->snd_wnd;
 }
-extern int tcp_is_cwnd_limited(const struct sock *sk, u32 in_flight);
+extern bool tcp_is_cwnd_limited(const struct sock *sk, u32 in_flight);
 
 static inline void tcp_minshall_update(struct tcp_sock *tp, unsigned int mss,
 				       const struct sk_buff *skb)
@@ -944,7 +944,7 @@ static inline __sum16 __tcp_checksum_complete(struct sk_buff *skb)
 	return __skb_checksum_complete(skb);
 }
 
-static inline int tcp_checksum_complete(struct sk_buff *skb)
+static inline bool tcp_checksum_complete(struct sk_buff *skb)
 {
 	return !skb_csum_unnecessary(skb) &&
 		__tcp_checksum_complete(skb);
@@ -974,12 +974,12 @@ static inline void tcp_prequeue_init(struct tcp_sock *tp)
  *
  * NOTE: is this not too big to inline?
  */
-static inline int tcp_prequeue(struct sock *sk, struct sk_buff *skb)
+static inline bool tcp_prequeue(struct sock *sk, struct sk_buff *skb)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 
 	if (sysctl_tcp_low_latency || !tp->ucopy.task)
-		return 0;
+		return false;
 
 	__skb_queue_tail(&tp->ucopy.prequeue, skb);
 	tp->ucopy.memory += skb->truesize;
@@ -1003,7 +1003,7 @@ static inline int tcp_prequeue(struct sock *sk, struct sk_buff *skb)
 						  (3 * tcp_rto_min(sk)) / 4,
 						  TCP_RTO_MAX);
 	}
-	return 1;
+	return true;
 }
 
 
@@ -1108,28 +1108,28 @@ static inline int tcp_fin_time(const struct sock *sk)
 	return fin_timeout;
 }
 
-static inline int tcp_paws_check(const struct tcp_options_received *rx_opt,
-				 int paws_win)
+static inline bool tcp_paws_check(const struct tcp_options_received *rx_opt,
+				  int paws_win)
 {
 	if ((s32)(rx_opt->ts_recent - rx_opt->rcv_tsval) <= paws_win)
-		return 1;
+		return true;
 	if (unlikely(get_seconds() >= rx_opt->ts_recent_stamp + TCP_PAWS_24DAYS))
-		return 1;
+		return true;
 	/*
 	 * Some OSes send SYN and SYNACK messages with tsval=0 tsecr=0,
 	 * then following tcp messages have valid values. Ignore 0 value,
 	 * or else 'negative' tsval might forbid us to accept their packets.
 	 */
 	if (!rx_opt->ts_recent)
-		return 1;
-	return 0;
+		return true;
+	return false;
 }
 
-static inline int tcp_paws_reject(const struct tcp_options_received *rx_opt,
-				  int rst)
+static inline bool tcp_paws_reject(const struct tcp_options_received *rx_opt,
+				   int rst)
 {
 	if (tcp_paws_check(rx_opt, 0))
-		return 0;
+		return false;
 
 	/* RST segments are not recommended to carry timestamp,
 	   and, if they do, it is recommended to ignore PAWS because
@@ -1144,8 +1144,8 @@ static inline int tcp_paws_reject(const struct tcp_options_received *rx_opt,
 	   However, we can relax time bounds for RST segments to MSL.
 	 */
 	if (rst && get_seconds() >= rx_opt->ts_recent_stamp + TCP_PAWS_MSL)
-		return 0;
-	return 1;
+		return false;
+	return true;
 }
 
 static inline void tcp_mib_init(struct net *net)
@@ -1383,7 +1383,7 @@ static inline void tcp_unlink_write_queue(struct sk_buff *skb, struct sock *sk)
 	__skb_unlink(skb, &sk->sk_write_queue);
 }
 
-static inline int tcp_write_queue_empty(struct sock *sk)
+static inline bool tcp_write_queue_empty(struct sock *sk)
 {
 	return skb_queue_empty(&sk->sk_write_queue);
 }
@@ -1440,7 +1440,7 @@ static inline void tcp_highest_sack_combine(struct sock *sk,
 /* Determines whether this is a thin stream (which may suffer from
  * increased latency). Used to trigger latency-reducing mechanisms.
  */
-static inline unsigned int tcp_stream_is_thin(struct tcp_sock *tp)
+static inline bool tcp_stream_is_thin(struct tcp_sock *tp)
 {
 	return tp->packets_out < 4 && !tcp_in_initial_slowstart(tp);
 }
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index e8a80d0..63ddaee 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -593,7 +593,7 @@ static inline void tcp_mark_push(struct tcp_sock *tp, struct sk_buff *skb)
 	tp->pushed_seq = tp->write_seq;
 }
 
-static inline int forced_push(const struct tcp_sock *tp)
+static inline bool forced_push(const struct tcp_sock *tp)
 {
 	return after(tp->write_seq, tp->pushed_seq + (tp->max_window >> 1));
 }
@@ -1082,7 +1082,7 @@ new_segment:
 				if (err)
 					goto do_fault;
 			} else {
-				int merge = 0;
+				bool merge = false;
 				int i = skb_shinfo(skb)->nr_frags;
 				struct page *page = sk->sk_sndmsg_page;
 				int off;
@@ -1096,7 +1096,7 @@ new_segment:
 				    off != PAGE_SIZE) {
 					/* We can extend the last page
 					 * fragment. */
-					merge = 1;
+					merge = true;
 				} else if (i == MAX_SKB_FRAGS || !sg) {
 					/* Need to add new fragment and cannot
 					 * do this because interface is non-SG,
@@ -1293,7 +1293,7 @@ static int tcp_peek_sndq(struct sock *sk, struct msghdr *msg, int len)
 void tcp_cleanup_rbuf(struct sock *sk, int copied)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
-	int time_to_ack = 0;
+	bool time_to_ack = false;
 
 	struct sk_buff *skb = skb_peek(&sk->sk_receive_queue);
 
@@ -1319,7 +1319,7 @@ void tcp_cleanup_rbuf(struct sock *sk, int copied)
 		      ((icsk->icsk_ack.pending & ICSK_ACK_PUSHED) &&
 		       !icsk->icsk_ack.pingpong)) &&
 		      !atomic_read(&sk->sk_rmem_alloc)))
-			time_to_ack = 1;
+			time_to_ack = true;
 	}
 
 	/* We send an ACK if we can now advertise a non-zero window
@@ -1341,7 +1341,7 @@ void tcp_cleanup_rbuf(struct sock *sk, int copied)
 			 * "Lots" means "at least twice" here.
 			 */
 			if (new_window && new_window >= 2 * rcv_window_now)
-				time_to_ack = 1;
+				time_to_ack = true;
 		}
 	}
 	if (time_to_ack)
@@ -2171,7 +2171,7 @@ EXPORT_SYMBOL(tcp_close);
 
 /* These states need RST on ABORT according to RFC793 */
 
-static inline int tcp_need_reset(int state)
+static inline bool tcp_need_reset(int state)
 {
 	return (1 << state) &
 	       (TCPF_ESTABLISHED | TCPF_CLOSE_WAIT | TCPF_FIN_WAIT1 |
@@ -2245,7 +2245,7 @@ int tcp_disconnect(struct sock *sk, int flags)
 }
 EXPORT_SYMBOL(tcp_disconnect);
 
-static inline int tcp_can_repair_sock(struct sock *sk)
+static inline bool tcp_can_repair_sock(const struct sock *sk)
 {
 	return capable(CAP_NET_ADMIN) &&
 		((1 << sk->sk_state) & (TCPF_CLOSE | TCPF_ESTABLISHED));
@@ -3172,13 +3172,13 @@ out_free:
 struct tcp_md5sig_pool __percpu *tcp_alloc_md5sig_pool(struct sock *sk)
 {
 	struct tcp_md5sig_pool __percpu *pool;
-	int alloc = 0;
+	bool alloc = false;
 
 retry:
 	spin_lock_bh(&tcp_md5sig_pool_lock);
 	pool = tcp_md5sig_pool;
 	if (tcp_md5sig_users++ == 0) {
-		alloc = 1;
+		alloc = true;
 		spin_unlock_bh(&tcp_md5sig_pool_lock);
 	} else if (!pool) {
 		tcp_md5sig_users--;
diff --git a/net/ipv4/tcp_cong.c b/net/ipv4/tcp_cong.c
index 272a845..04dbd7a 100644
--- a/net/ipv4/tcp_cong.c
+++ b/net/ipv4/tcp_cong.c
@@ -280,19 +280,19 @@ int tcp_set_congestion_control(struct sock *sk, const char *name)
 /* RFC2861 Check whether we are limited by application or congestion window
  * This is the inverse of cwnd check in tcp_tso_should_defer
  */
-int tcp_is_cwnd_limited(const struct sock *sk, u32 in_flight)
+bool tcp_is_cwnd_limited(const struct sock *sk, u32 in_flight)
 {
 	const struct tcp_sock *tp = tcp_sk(sk);
 	u32 left;
 
 	if (in_flight >= tp->snd_cwnd)
-		return 1;
+		return true;
 
 	left = tp->snd_cwnd - in_flight;
 	if (sk_can_gso(sk) &&
 	    left * sysctl_tcp_tso_win_divisor < tp->snd_cwnd &&
 	    left * tp->mss_cache < sk->sk_gso_max_size)
-		return 1;
+		return true;
 	return left <= tcp_max_tso_deferred_mss(tp);
 }
 EXPORT_SYMBOL_GPL(tcp_is_cwnd_limited);
diff --git a/net/ipv4/tcp_hybla.c b/net/ipv4/tcp_hybla.c
index fe3ecf4..57bdd17 100644
--- a/net/ipv4/tcp_hybla.c
+++ b/net/ipv4/tcp_hybla.c
@@ -15,7 +15,7 @@
 
 /* Tcp Hybla structure. */
 struct hybla {
-	u8    hybla_en;
+	bool  hybla_en;
 	u32   snd_cwnd_cents; /* Keeps increment values when it is <1, <<7 */
 	u32   rho;	      /* Rho parameter, integer part  */
 	u32   rho2;	      /* Rho * Rho, integer part */
@@ -24,8 +24,7 @@ struct hybla {
 	u32   minrtt;	      /* Minimum smoothed round trip time value seen */
 };
 
-/* Hybla reference round trip time (default= 1/40 sec = 25 ms),
-   expressed in jiffies */
+/* Hybla reference round trip time (default= 1/40 sec = 25 ms), in ms */
 static int rtt0 = 25;
 module_param(rtt0, int, 0644);
 MODULE_PARM_DESC(rtt0, "reference rout trip time (ms)");
@@ -39,7 +38,7 @@ static inline void hybla_recalc_param (struct sock *sk)
 	ca->rho_3ls = max_t(u32, tcp_sk(sk)->srtt / msecs_to_jiffies(rtt0), 8);
 	ca->rho = ca->rho_3ls >> 3;
 	ca->rho2_7ls = (ca->rho_3ls * ca->rho_3ls) << 1;
-	ca->rho2 = ca->rho2_7ls >>7;
+	ca->rho2 = ca->rho2_7ls >> 7;
 }
 
 static void hybla_init(struct sock *sk)
@@ -52,7 +51,7 @@ static void hybla_init(struct sock *sk)
 	ca->rho_3ls = 0;
 	ca->rho2_7ls = 0;
 	ca->snd_cwnd_cents = 0;
-	ca->hybla_en = 1;
+	ca->hybla_en = true;
 	tp->snd_cwnd = 2;
 	tp->snd_cwnd_clamp = 65535;
 
@@ -67,6 +66,7 @@ static void hybla_init(struct sock *sk)
 static void hybla_state(struct sock *sk, u8 ca_state)
 {
 	struct hybla *ca = inet_csk_ca(sk);
+
 	ca->hybla_en = (ca_state == TCP_CA_Open);
 }
 
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index eb97787..b961ef5 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -196,9 +196,10 @@ static void tcp_enter_quickack_mode(struct sock *sk)
  * and the session is not interactive.
  */
 
-static inline int tcp_in_quickack_mode(const struct sock *sk)
+static inline bool tcp_in_quickack_mode(const struct sock *sk)
 {
 	const struct inet_connection_sock *icsk = inet_csk(sk);
+
 	return icsk->icsk_ack.quick && !icsk->icsk_ack.pingpong;
 }
 
@@ -253,11 +254,11 @@ static inline void TCP_ECN_rcv_syn(struct tcp_sock *tp, const struct tcphdr *th)
 		tp->ecn_flags &= ~TCP_ECN_OK;
 }
 
-static inline int TCP_ECN_rcv_ecn_echo(const struct tcp_sock *tp, const struct tcphdr *th)
+static bool TCP_ECN_rcv_ecn_echo(const struct tcp_sock *tp, const struct tcphdr *th)
 {
 	if (th->ece && !th->syn && (tp->ecn_flags & TCP_ECN_OK))
-		return 1;
-	return 0;
+		return true;
+	return false;
 }
 
 /* Buffer size and advertised window tuning.
@@ -1123,36 +1124,36 @@ static void tcp_skb_mark_lost_uncond_verify(struct tcp_sock *tp,
  * the exact amount is rather hard to quantify. However, tp->max_window can
  * be used as an exaggerated estimate.
  */
-static int tcp_is_sackblock_valid(struct tcp_sock *tp, int is_dsack,
-				  u32 start_seq, u32 end_seq)
+static bool tcp_is_sackblock_valid(struct tcp_sock *tp, bool is_dsack,
+				   u32 start_seq, u32 end_seq)
 {
 	/* Too far in future, or reversed (interpretation is ambiguous) */
 	if (after(end_seq, tp->snd_nxt) || !before(start_seq, end_seq))
-		return 0;
+		return false;
 
 	/* Nasty start_seq wrap-around check (see comments above) */
 	if (!before(start_seq, tp->snd_nxt))
-		return 0;
+		return false;
 
 	/* In outstanding window? ...This is valid exit for D-SACKs too.
 	 * start_seq == snd_una is non-sensical (see comments above)
 	 */
 	if (after(start_seq, tp->snd_una))
-		return 1;
+		return true;
 
 	if (!is_dsack || !tp->undo_marker)
-		return 0;
+		return false;
 
 	/* ...Then it's D-SACK, and must reside below snd_una completely */
 	if (after(end_seq, tp->snd_una))
-		return 0;
+		return false;
 
 	if (!before(start_seq, tp->undo_marker))
-		return 1;
+		return true;
 
 	/* Too old */
 	if (!after(end_seq, tp->undo_marker))
-		return 0;
+		return false;
 
 	/* Undo_marker boundary crossing (overestimates a lot). Known already:
 	 *   start_seq < undo_marker and end_seq >= undo_marker.
@@ -1224,17 +1225,17 @@ static void tcp_mark_lost_retrans(struct sock *sk)
 		tp->lost_retrans_low = new_low_seq;
 }
 
-static int tcp_check_dsack(struct sock *sk, const struct sk_buff *ack_skb,
-			   struct tcp_sack_block_wire *sp, int num_sacks,
-			   u32 prior_snd_una)
+static bool tcp_check_dsack(struct sock *sk, const struct sk_buff *ack_skb,
+			    struct tcp_sack_block_wire *sp, int num_sacks,
+			    u32 prior_snd_una)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 	u32 start_seq_0 = get_unaligned_be32(&sp[0].start_seq);
 	u32 end_seq_0 = get_unaligned_be32(&sp[0].end_seq);
-	int dup_sack = 0;
+	bool dup_sack = false;
 
 	if (before(start_seq_0, TCP_SKB_CB(ack_skb)->ack_seq)) {
-		dup_sack = 1;
+		dup_sack = true;
 		tcp_dsack_seen(tp);
 		NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPDSACKRECV);
 	} else if (num_sacks > 1) {
@@ -1243,7 +1244,7 @@ static int tcp_check_dsack(struct sock *sk, const struct sk_buff *ack_skb,
 
 		if (!after(end_seq_0, end_seq_1) &&
 		    !before(start_seq_0, start_seq_1)) {
-			dup_sack = 1;
+			dup_sack = true;
 			tcp_dsack_seen(tp);
 			NET_INC_STATS_BH(sock_net(sk),
 					LINUX_MIB_TCPDSACKOFORECV);
@@ -1274,9 +1275,10 @@ struct tcp_sacktag_state {
  * FIXME: this could be merged to shift decision code
  */
 static int tcp_match_skb_to_sack(struct sock *sk, struct sk_buff *skb,
-				 u32 start_seq, u32 end_seq)
+				  u32 start_seq, u32 end_seq)
 {
-	int in_sack, err;
+	int err;
+	bool in_sack;
 	unsigned int pkt_len;
 	unsigned int mss;
 
@@ -1322,7 +1324,7 @@ static int tcp_match_skb_to_sack(struct sock *sk, struct sk_buff *skb,
 static u8 tcp_sacktag_one(struct sock *sk,
 			  struct tcp_sacktag_state *state, u8 sacked,
 			  u32 start_seq, u32 end_seq,
-			  int dup_sack, int pcount)
+			  bool dup_sack, int pcount)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 	int fack_count = state->fack_count;
@@ -1402,10 +1404,10 @@ static u8 tcp_sacktag_one(struct sock *sk,
 /* Shift newly-SACKed bytes from this skb to the immediately previous
  * already-SACKed sk_buff. Mark the newly-SACKed bytes as such.
  */
-static int tcp_shifted_skb(struct sock *sk, struct sk_buff *skb,
-			   struct tcp_sacktag_state *state,
-			   unsigned int pcount, int shifted, int mss,
-			   int dup_sack)
+static bool tcp_shifted_skb(struct sock *sk, struct sk_buff *skb,
+			    struct tcp_sacktag_state *state,
+			    unsigned int pcount, int shifted, int mss,
+			    bool dup_sack)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 	struct sk_buff *prev = tcp_write_queue_prev(sk, skb);
@@ -1455,7 +1457,7 @@ static int tcp_shifted_skb(struct sock *sk, struct sk_buff *skb,
 	if (skb->len > 0) {
 		BUG_ON(!tcp_skb_pcount(skb));
 		NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_SACKSHIFTED);
-		return 0;
+		return false;
 	}
 
 	/* Whole SKB was eaten :-) */
@@ -1478,7 +1480,7 @@ static int tcp_shifted_skb(struct sock *sk, struct sk_buff *skb,
 
 	NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_SACKMERGED);
 
-	return 1;
+	return true;
 }
 
 /* I wish gso_size would have a bit more sane initialization than
@@ -1501,7 +1503,7 @@ static int skb_can_shift(const struct sk_buff *skb)
 static struct sk_buff *tcp_shift_skb_data(struct sock *sk, struct sk_buff *skb,
 					  struct tcp_sacktag_state *state,
 					  u32 start_seq, u32 end_seq,
-					  int dup_sack)
+					  bool dup_sack)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 	struct sk_buff *prev;
@@ -1640,14 +1642,14 @@ static struct sk_buff *tcp_sacktag_walk(struct sk_buff *skb, struct sock *sk,
 					struct tcp_sack_block *next_dup,
 					struct tcp_sacktag_state *state,
 					u32 start_seq, u32 end_seq,
-					int dup_sack_in)
+					bool dup_sack_in)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 	struct sk_buff *tmp;
 
 	tcp_for_write_queue_from(skb, sk) {
 		int in_sack = 0;
-		int dup_sack = dup_sack_in;
+		bool dup_sack = dup_sack_in;
 
 		if (skb == tcp_send_head(sk))
 			break;
@@ -1662,7 +1664,7 @@ static struct sk_buff *tcp_sacktag_walk(struct sk_buff *skb, struct sock *sk,
 							next_dup->start_seq,
 							next_dup->end_seq);
 			if (in_sack > 0)
-				dup_sack = 1;
+				dup_sack = true;
 		}
 
 		/* skb reference here is a bit tricky to get right, since
@@ -1767,7 +1769,7 @@ tcp_sacktag_write_queue(struct sock *sk, const struct sk_buff *ack_skb,
 	struct sk_buff *skb;
 	int num_sacks = min(TCP_NUM_SACKS, (ptr[1] - TCPOLEN_SACK_BASE) >> 3);
 	int used_sacks;
-	int found_dup_sack = 0;
+	bool found_dup_sack = false;
 	int i, j;
 	int first_sack_index;
 
@@ -1798,7 +1800,7 @@ tcp_sacktag_write_queue(struct sock *sk, const struct sk_buff *ack_skb,
 	used_sacks = 0;
 	first_sack_index = 0;
 	for (i = 0; i < num_sacks; i++) {
-		int dup_sack = !i && found_dup_sack;
+		bool dup_sack = !i && found_dup_sack;
 
 		sp[used_sacks].start_seq = get_unaligned_be32(&sp_wire[i].start_seq);
 		sp[used_sacks].end_seq = get_unaligned_be32(&sp_wire[i].end_seq);
@@ -1865,7 +1867,7 @@ tcp_sacktag_write_queue(struct sock *sk, const struct sk_buff *ack_skb,
 	while (i < used_sacks) {
 		u32 start_seq = sp[i].start_seq;
 		u32 end_seq = sp[i].end_seq;
-		int dup_sack = (found_dup_sack && (i == first_sack_index));
+		bool dup_sack = (found_dup_sack && (i == first_sack_index));
 		struct tcp_sack_block *next_dup = NULL;
 
 		if (found_dup_sack && ((i + 1) == first_sack_index))
@@ -1967,9 +1969,9 @@ out:
 }
 
 /* Limits sacked_out so that sum with lost_out isn't ever larger than
- * packets_out. Returns zero if sacked_out adjustement wasn't necessary.
+ * packets_out. Returns false if sacked_out adjustement wasn't necessary.
  */
-static int tcp_limit_reno_sacked(struct tcp_sock *tp)
+static bool tcp_limit_reno_sacked(struct tcp_sock *tp)
 {
 	u32 holes;
 
@@ -1978,9 +1980,9 @@ static int tcp_limit_reno_sacked(struct tcp_sock *tp)
 
 	if ((tp->sacked_out + holes) > tp->packets_out) {
 		tp->sacked_out = tp->packets_out - holes;
-		return 1;
+		return true;
 	}
-	return 0;
+	return false;
 }
 
 /* If we receive more dupacks than we expected counting segments
@@ -2034,40 +2036,40 @@ static int tcp_is_sackfrto(const struct tcp_sock *tp)
 /* F-RTO can only be used if TCP has never retransmitted anything other than
  * head (SACK enhanced variant from Appendix B of RFC4138 is more robust here)
  */
-int tcp_use_frto(struct sock *sk)
+bool tcp_use_frto(struct sock *sk)
 {
 	const struct tcp_sock *tp = tcp_sk(sk);
 	const struct inet_connection_sock *icsk = inet_csk(sk);
 	struct sk_buff *skb;
 
 	if (!sysctl_tcp_frto)
-		return 0;
+		return false;
 
 	/* MTU probe and F-RTO won't really play nicely along currently */
 	if (icsk->icsk_mtup.probe_size)
-		return 0;
+		return false;
 
 	if (tcp_is_sackfrto(tp))
-		return 1;
+		return true;
 
 	/* Avoid expensive walking of rexmit queue if possible */
 	if (tp->retrans_out > 1)
-		return 0;
+		return false;
 
 	skb = tcp_write_queue_head(sk);
 	if (tcp_skb_is_last(sk, skb))
-		return 1;
+		return true;
 	skb = tcp_write_queue_next(sk, skb);	/* Skips head */
 	tcp_for_write_queue_from(skb, sk) {
 		if (skb == tcp_send_head(sk))
 			break;
 		if (TCP_SKB_CB(skb)->sacked & TCPCB_RETRANS)
-			return 0;
+			return false;
 		/* Short-circuit when first non-SACKed skb has been checked */
 		if (!(TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_ACKED))
 			break;
 	}
-	return 1;
+	return true;
 }
 
 /* RTO occurred, but do not yet enter Loss state. Instead, defer RTO
@@ -2303,7 +2305,7 @@ void tcp_enter_loss(struct sock *sk, int how)
  *
  * Do processing similar to RTO timeout.
  */
-static int tcp_check_sack_reneging(struct sock *sk, int flag)
+static bool tcp_check_sack_reneging(struct sock *sk, int flag)
 {
 	if (flag & FLAG_SACK_RENEGING) {
 		struct inet_connection_sock *icsk = inet_csk(sk);
@@ -2314,9 +2316,9 @@ static int tcp_check_sack_reneging(struct sock *sk, int flag)
 		tcp_retransmit_skb(sk, tcp_write_queue_head(sk));
 		inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS,
 					  icsk->icsk_rto, TCP_RTO_MAX);
-		return 1;
+		return true;
 	}
-	return 0;
+	return false;
 }
 
 static inline int tcp_fackets_out(const struct tcp_sock *tp)
@@ -2472,28 +2474,28 @@ static inline int tcp_head_timedout(const struct sock *sk)
  * Main question: may we further continue forward transmission
  * with the same cwnd?
  */
-static int tcp_time_to_recover(struct sock *sk, int flag)
+static bool tcp_time_to_recover(struct sock *sk, int flag)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 	__u32 packets_out;
 
 	/* Do not perform any recovery during F-RTO algorithm */
 	if (tp->frto_counter)
-		return 0;
+		return false;
 
 	/* Trick#1: The loss is proven. */
 	if (tp->lost_out)
-		return 1;
+		return true;
 
 	/* Not-A-Trick#2 : Classic rule... */
 	if (tcp_dupack_heuristics(tp) > tp->reordering)
-		return 1;
+		return true;
 
 	/* Trick#3 : when we use RFC2988 timer restart, fast
 	 * retransmit can be triggered by timeout of queue head.
 	 */
 	if (tcp_is_fack(tp) && tcp_head_timedout(sk))
-		return 1;
+		return true;
 
 	/* Trick#4: It is still not OK... But will it be useful to delay
 	 * recovery more?
@@ -2505,7 +2507,7 @@ static int tcp_time_to_recover(struct sock *sk, int flag)
 		/* We have nothing to send. This connection is limited
 		 * either by receiver window or by application.
 		 */
-		return 1;
+		return true;
 	}
 
 	/* If a thin stream is detected, retransmit after first
@@ -2516,7 +2518,7 @@ static int tcp_time_to_recover(struct sock *sk, int flag)
 	if ((tp->thin_dupack || sysctl_tcp_thin_dupack) &&
 	    tcp_stream_is_thin(tp) && tcp_dupack_heuristics(tp) > 1 &&
 	    tcp_is_sack(tp) && !tcp_send_head(sk))
-		return 1;
+		return true;
 
 	/* Trick#6: TCP early retransmit, per RFC5827.  To avoid spurious
 	 * retransmissions due to small network reorderings, we implement
@@ -2528,7 +2530,7 @@ static int tcp_time_to_recover(struct sock *sk, int flag)
 	    !tcp_may_send_now(sk))
 		return !tcp_pause_early_retransmit(sk, flag);
 
-	return 0;
+	return false;
 }
 
 /* New heuristics: it is possible only after we switched to restart timer
@@ -2767,7 +2769,7 @@ static inline int tcp_may_undo(const struct tcp_sock *tp)
 }
 
 /* People celebrate: "We love our President!" */
-static int tcp_try_undo_recovery(struct sock *sk)
+static bool tcp_try_undo_recovery(struct sock *sk)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 
@@ -2792,10 +2794,10 @@ static int tcp_try_undo_recovery(struct sock *sk)
 		 * is ACKed. For Reno it is MUST to prevent false
 		 * fast retransmits (RFC2582). SACK TCP is safe. */
 		tcp_moderate_cwnd(tp);
-		return 1;
+		return true;
 	}
 	tcp_set_ca_state(sk, TCP_CA_Open);
-	return 0;
+	return false;
 }
 
 /* Try to undo cwnd reduction, because D-SACKs acked all retransmitted data */
@@ -2825,19 +2827,19 @@ static void tcp_try_undo_dsack(struct sock *sk)
  * that successive retransmissions of a segment must not advance
  * retrans_stamp under any conditions.
  */
-static int tcp_any_retrans_done(const struct sock *sk)
+static bool tcp_any_retrans_done(const struct sock *sk)
 {
 	const struct tcp_sock *tp = tcp_sk(sk);
 	struct sk_buff *skb;
 
 	if (tp->retrans_out)
-		return 1;
+		return true;
 
 	skb = tcp_write_queue_head(sk);
 	if (unlikely(skb && TCP_SKB_CB(skb)->sacked & TCPCB_EVER_RETRANS))
-		return 1;
+		return true;
 
-	return 0;
+	return false;
 }
 
 /* Undo during fast recovery after partial ACK. */
@@ -2871,7 +2873,7 @@ static int tcp_try_undo_partial(struct sock *sk, int acked)
 }
 
 /* Undo during loss recovery after partial ACK. */
-static int tcp_try_undo_loss(struct sock *sk)
+static bool tcp_try_undo_loss(struct sock *sk)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 
@@ -2893,9 +2895,9 @@ static int tcp_try_undo_loss(struct sock *sk)
 		tp->undo_marker = 0;
 		if (tcp_is_sack(tp))
 			tcp_set_ca_state(sk, TCP_CA_Open);
-		return 1;
+		return true;
 	}
-	return 0;
+	return false;
 }
 
 static inline void tcp_complete_cwr(struct sock *sk)
@@ -3370,7 +3372,7 @@ static int tcp_clean_rtx_queue(struct sock *sk, int prior_fackets,
 	const struct inet_connection_sock *icsk = inet_csk(sk);
 	struct sk_buff *skb;
 	u32 now = tcp_time_stamp;
-	int fully_acked = 1;
+	int fully_acked = true;
 	int flag = 0;
 	u32 pkts_acked = 0;
 	u32 reord = tp->packets_out;
@@ -3394,7 +3396,7 @@ static int tcp_clean_rtx_queue(struct sock *sk, int prior_fackets,
 			if (!acked_pcount)
 				break;
 
-			fully_acked = 0;
+			fully_acked = false;
 		} else {
 			acked_pcount = tcp_skb_pcount(skb);
 		}
@@ -3673,7 +3675,7 @@ static void tcp_undo_spur_to_response(struct sock *sk, int flag)
  *     to prove that the RTO is indeed spurious. It transfers the control
  *     from F-RTO to the conventional RTO recovery
  */
-static int tcp_process_frto(struct sock *sk, int flag)
+static bool tcp_process_frto(struct sock *sk, int flag)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 
@@ -3689,7 +3691,7 @@ static int tcp_process_frto(struct sock *sk, int flag)
 
 	if (!before(tp->snd_una, tp->frto_highmark)) {
 		tcp_enter_frto_loss(sk, (tp->frto_counter == 1 ? 2 : 3), flag);
-		return 1;
+		return true;
 	}
 
 	if (!tcp_is_sackfrto(tp)) {
@@ -3698,19 +3700,19 @@ static int tcp_process_frto(struct sock *sk, int flag)
 		 * data, winupdate
 		 */
 		if (!(flag & FLAG_ANY_PROGRESS) && (flag & FLAG_NOT_DUP))
-			return 1;
+			return true;
 
 		if (!(flag & FLAG_DATA_ACKED)) {
 			tcp_enter_frto_loss(sk, (tp->frto_counter == 1 ? 0 : 3),
 					    flag);
-			return 1;
+			return true;
 		}
 	} else {
 		if (!(flag & FLAG_DATA_ACKED) && (tp->frto_counter == 1)) {
 			/* Prevent sending of new data. */
 			tp->snd_cwnd = min(tp->snd_cwnd,
 					   tcp_packets_in_flight(tp));
-			return 1;
+			return true;
 		}
 
 		if ((tp->frto_counter >= 2) &&
@@ -3720,10 +3722,10 @@ static int tcp_process_frto(struct sock *sk, int flag)
 			/* RFC4138 shortcoming (see comment above) */
 			if (!(flag & FLAG_FORWARD_PROGRESS) &&
 			    (flag & FLAG_NOT_DUP))
-				return 1;
+				return true;
 
 			tcp_enter_frto_loss(sk, 3, flag);
-			return 1;
+			return true;
 		}
 	}
 
@@ -3735,7 +3737,7 @@ static int tcp_process_frto(struct sock *sk, int flag)
 		if (!tcp_may_send_now(sk))
 			tcp_enter_frto_loss(sk, 2, flag);
 
-		return 1;
+		return true;
 	} else {
 		switch (sysctl_tcp_frto_response) {
 		case 2:
@@ -3752,7 +3754,7 @@ static int tcp_process_frto(struct sock *sk, int flag)
 		tp->undo_marker = 0;
 		NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPSPURIOUSRTOS);
 	}
-	return 0;
+	return false;
 }
 
 /* This routine deals with incoming acks, but not outgoing ones. */
@@ -3770,7 +3772,7 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
 	int prior_sacked = tp->sacked_out;
 	int pkts_acked = 0;
 	int newly_acked_sacked = 0;
-	int frto_cwnd = 0;
+	bool frto_cwnd = false;
 
 	/* If the ack is older than previous acks
 	 * then we can probably ignore it.
@@ -4025,7 +4027,7 @@ void tcp_parse_options(const struct sk_buff *skb, struct tcp_options_received *o
 }
 EXPORT_SYMBOL(tcp_parse_options);
 
-static int tcp_parse_aligned_timestamp(struct tcp_sock *tp, const struct tcphdr *th)
+static bool tcp_parse_aligned_timestamp(struct tcp_sock *tp, const struct tcphdr *th)
 {
 	const __be32 *ptr = (const __be32 *)(th + 1);
 
@@ -4036,31 +4038,31 @@ static int tcp_parse_aligned_timestamp(struct tcp_sock *tp, const struct tcphdr
 		tp->rx_opt.rcv_tsval = ntohl(*ptr);
 		++ptr;
 		tp->rx_opt.rcv_tsecr = ntohl(*ptr);
-		return 1;
+		return true;
 	}
-	return 0;
+	return false;
 }
 
 /* Fast parse options. This hopes to only see timestamps.
  * If it is wrong it falls back on tcp_parse_options().
  */
-static int tcp_fast_parse_options(const struct sk_buff *skb,
-				  const struct tcphdr *th,
-				  struct tcp_sock *tp, const u8 **hvpp)
+static bool tcp_fast_parse_options(const struct sk_buff *skb,
+				   const struct tcphdr *th,
+				   struct tcp_sock *tp, const u8 **hvpp)
 {
 	/* In the spirit of fast parsing, compare doff directly to constant
 	 * values.  Because equality is used, short doff can be ignored here.
 	 */
 	if (th->doff == (sizeof(*th) / 4)) {
 		tp->rx_opt.saw_tstamp = 0;
-		return 0;
+		return false;
 	} else if (tp->rx_opt.tstamp_ok &&
 		   th->doff == ((sizeof(*th) + TCPOLEN_TSTAMP_ALIGNED) / 4)) {
 		if (tcp_parse_aligned_timestamp(tp, th))
-			return 1;
+			return true;
 	}
 	tcp_parse_options(skb, &tp->rx_opt, hvpp, 1);
-	return 1;
+	return true;
 }
 
 #ifdef CONFIG_TCP_MD5SIG
@@ -4301,7 +4303,7 @@ static void tcp_fin(struct sock *sk)
 	}
 }
 
-static inline int tcp_sack_extend(struct tcp_sack_block *sp, u32 seq,
+static inline bool tcp_sack_extend(struct tcp_sack_block *sp, u32 seq,
 				  u32 end_seq)
 {
 	if (!after(seq, sp->end_seq) && !after(sp->start_seq, end_seq)) {
@@ -4309,9 +4311,9 @@ static inline int tcp_sack_extend(struct tcp_sack_block *sp, u32 seq,
 			sp->start_seq = seq;
 		if (after(end_seq, sp->end_seq))
 			sp->end_seq = end_seq;
-		return 1;
+		return true;
 	}
-	return 0;
+	return false;
 }
 
 static void tcp_dsack_set(struct sock *sk, u32 seq, u32 end_seq)
@@ -4507,7 +4509,7 @@ static void tcp_ofo_queue(struct sock *sk)
 	}
 }
 
-static int tcp_prune_ofo_queue(struct sock *sk);
+static bool tcp_prune_ofo_queue(struct sock *sk);
 static int tcp_prune_queue(struct sock *sk);
 
 static int tcp_try_rmem_schedule(struct sock *sk, unsigned int size)
@@ -5092,10 +5094,10 @@ static void tcp_collapse_ofo_queue(struct sock *sk)
  * Purge the out-of-order queue.
  * Return true if queue was pruned.
  */
-static int tcp_prune_ofo_queue(struct sock *sk)
+static bool tcp_prune_ofo_queue(struct sock *sk)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
-	int res = 0;
+	bool res = false;
 
 	if (!skb_queue_empty(&tp->out_of_order_queue)) {
 		NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_OFOPRUNED);
@@ -5109,7 +5111,7 @@ static int tcp_prune_ofo_queue(struct sock *sk)
 		if (tp->rx_opt.sack_ok)
 			tcp_sack_reset(&tp->rx_opt);
 		sk_mem_reclaim(sk);
-		res = 1;
+		res = true;
 	}
 	return res;
 }
@@ -5186,7 +5188,7 @@ void tcp_cwnd_application_limited(struct sock *sk)
 	tp->snd_cwnd_stamp = tcp_time_stamp;
 }
 
-static int tcp_should_expand_sndbuf(const struct sock *sk)
+static bool tcp_should_expand_sndbuf(const struct sock *sk)
 {
 	const struct tcp_sock *tp = tcp_sk(sk);
 
@@ -5194,21 +5196,21 @@ static int tcp_should_expand_sndbuf(const struct sock *sk)
 	 * not modify it.
 	 */
 	if (sk->sk_userlocks & SOCK_SNDBUF_LOCK)
-		return 0;
+		return false;
 
 	/* If we are under global TCP memory pressure, do not expand.  */
 	if (sk_under_memory_pressure(sk))
-		return 0;
+		return false;
 
 	/* If we are under soft global TCP memory pressure, do not expand.  */
 	if (sk_memory_allocated(sk) >= sk_prot_mem_limits(sk, 0))
-		return 0;
+		return false;
 
 	/* If we filled the congestion window, do not expand.  */
 	if (tp->packets_out >= tp->snd_cwnd)
-		return 0;
+		return false;
 
-	return 1;
+	return true;
 }
 
 /* When incoming ACK allowed to free some skb from write_queue,
@@ -5434,16 +5436,16 @@ static inline int tcp_checksum_complete_user(struct sock *sk,
 }
 
 #ifdef CONFIG_NET_DMA
-static int tcp_dma_try_early_copy(struct sock *sk, struct sk_buff *skb,
+static bool tcp_dma_try_early_copy(struct sock *sk, struct sk_buff *skb,
 				  int hlen)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 	int chunk = skb->len - hlen;
 	int dma_cookie;
-	int copied_early = 0;
+	bool copied_early = false;
 
 	if (tp->ucopy.wakeup)
-		return 0;
+		return false;
 
 	if (!tp->ucopy.dma_chan && tp->ucopy.pinned_list)
 		tp->ucopy.dma_chan = net_dma_find_channel();
@@ -5459,7 +5461,7 @@ static int tcp_dma_try_early_copy(struct sock *sk, struct sk_buff *skb,
 			goto out;
 
 		tp->ucopy.dma_cookie = dma_cookie;
-		copied_early = 1;
+		copied_early = true;
 
 		tp->ucopy.len -= chunk;
 		tp->copied_seq += chunk;
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 2e76ffb..a43b87d 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -866,14 +866,14 @@ static void tcp_v4_reqsk_destructor(struct request_sock *req)
 }
 
 /*
- * Return 1 if a syncookie should be sent
+ * Return true if a syncookie should be sent
  */
-int tcp_syn_flood_action(struct sock *sk,
+bool tcp_syn_flood_action(struct sock *sk,
 			 const struct sk_buff *skb,
 			 const char *proto)
 {
 	const char *msg = "Dropping request";
-	int want_cookie = 0;
+	bool want_cookie = false;
 	struct listen_sock *lopt;
 
 
@@ -881,7 +881,7 @@ int tcp_syn_flood_action(struct sock *sk,
 #ifdef CONFIG_SYN_COOKIES
 	if (sysctl_tcp_syncookies) {
 		msg = "Sending cookies";
-		want_cookie = 1;
+		want_cookie = true;
 		NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPREQQFULLDOCOOKIES);
 	} else
 #endif
@@ -1196,7 +1196,7 @@ clear_hash_noput:
 }
 EXPORT_SYMBOL(tcp_v4_md5_hash_skb);
 
-static int tcp_v4_inbound_md5_hash(struct sock *sk, const struct sk_buff *skb)
+static bool tcp_v4_inbound_md5_hash(struct sock *sk, const struct sk_buff *skb)
 {
 	/*
 	 * This gets called for each TCP segment that arrives
@@ -1219,16 +1219,16 @@ static int tcp_v4_inbound_md5_hash(struct sock *sk, const struct sk_buff *skb)
 
 	/* We've parsed the options - do we have a hash? */
 	if (!hash_expected && !hash_location)
-		return 0;
+		return false;
 
 	if (hash_expected && !hash_location) {
 		NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPMD5NOTFOUND);
-		return 1;
+		return true;
 	}
 
 	if (!hash_expected && hash_location) {
 		NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPMD5UNEXPECTED);
-		return 1;
+		return true;
 	}
 
 	/* Okay, so this is hash_expected and hash_location -
@@ -1244,9 +1244,9 @@ static int tcp_v4_inbound_md5_hash(struct sock *sk, const struct sk_buff *skb)
 				     &iph->daddr, ntohs(th->dest),
 				     genhash ? " tcp_v4_calc_md5_hash failed"
 				     : "");
-		return 1;
+		return true;
 	}
-	return 0;
+	return false;
 }
 
 #endif
@@ -1280,7 +1280,7 @@ int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb)
 	__be32 saddr = ip_hdr(skb)->saddr;
 	__be32 daddr = ip_hdr(skb)->daddr;
 	__u32 isn = TCP_SKB_CB(skb)->when;
-	int want_cookie = 0;
+	bool want_cookie = false;
 
 	/* Never answer to SYNs send to broadcast or multicast */
 	if (skb_rtable(skb)->rt_flags & (RTCF_BROADCAST | RTCF_MULTICAST))
@@ -1339,7 +1339,7 @@ int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb)
 		while (l-- > 0)
 			*c++ ^= *hash_location++;
 
-		want_cookie = 0;	/* not our kind of cookie */
+		want_cookie = false;	/* not our kind of cookie */
 		tmp_ext.cookie_out_never = 0; /* false */
 		tmp_ext.cookie_plus = tmp_opt.cookie_plus;
 	} else if (!tp->rx_opt.cookie_in_always) {
@@ -2073,7 +2073,7 @@ static void *listening_get_idx(struct seq_file *seq, loff_t *pos)
 	return rc;
 }
 
-static inline int empty_bucket(struct tcp_iter_state *st)
+static inline bool empty_bucket(struct tcp_iter_state *st)
 {
 	return hlist_nulls_empty(&tcp_hashinfo.ehash[st->bucket].chain) &&
 		hlist_nulls_empty(&tcp_hashinfo.ehash[st->bucket].twchain);
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index 6f6a918..b85d9fe 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -55,7 +55,7 @@ EXPORT_SYMBOL_GPL(tcp_death_row);
  * state.
  */
 
-static int tcp_remember_stamp(struct sock *sk)
+static bool tcp_remember_stamp(struct sock *sk)
 {
 	const struct inet_connection_sock *icsk = inet_csk(sk);
 	struct tcp_sock *tp = tcp_sk(sk);
@@ -72,13 +72,13 @@ static int tcp_remember_stamp(struct sock *sk)
 		}
 		if (release_it)
 			inet_putpeer(peer);
-		return 1;
+		return true;
 	}
 
-	return 0;
+	return false;
 }
 
-static int tcp_tw_remember_stamp(struct inet_timewait_sock *tw)
+static bool tcp_tw_remember_stamp(struct inet_timewait_sock *tw)
 {
 	struct sock *sk = (struct sock *) tw;
 	struct inet_peer *peer;
@@ -94,17 +94,17 @@ static int tcp_tw_remember_stamp(struct inet_timewait_sock *tw)
 			peer->tcp_ts	   = tcptw->tw_ts_recent;
 		}
 		inet_putpeer(peer);
-		return 1;
+		return true;
 	}
-	return 0;
+	return false;
 }
 
-static __inline__ int tcp_in_window(u32 seq, u32 end_seq, u32 s_win, u32 e_win)
+static bool tcp_in_window(u32 seq, u32 end_seq, u32 s_win, u32 e_win)
 {
 	if (seq == s_win)
-		return 1;
+		return true;
 	if (after(end_seq, s_win) && before(seq, e_win))
-		return 1;
+		return true;
 	return seq == e_win && seq == end_seq;
 }
 
@@ -143,7 +143,7 @@ tcp_timewait_state_process(struct inet_timewait_sock *tw, struct sk_buff *skb,
 	struct tcp_options_received tmp_opt;
 	const u8 *hash_location;
 	struct tcp_timewait_sock *tcptw = tcp_twsk((struct sock *)tw);
-	int paws_reject = 0;
+	bool paws_reject = false;
 
 	tmp_opt.saw_tstamp = 0;
 	if (th->doff > (sizeof(*th) >> 2) && tcptw->tw_ts_recent_stamp) {
@@ -316,7 +316,7 @@ void tcp_time_wait(struct sock *sk, int state, int timeo)
 	struct inet_timewait_sock *tw = NULL;
 	const struct inet_connection_sock *icsk = inet_csk(sk);
 	const struct tcp_sock *tp = tcp_sk(sk);
-	int recycle_ok = 0;
+	bool recycle_ok = false;
 
 	if (tcp_death_row.sysctl_tw_recycle && tp->rx_opt.ts_recent_stamp)
 		recycle_ok = tcp_remember_stamp(sk);
@@ -575,7 +575,7 @@ struct sock *tcp_check_req(struct sock *sk, struct sk_buff *skb,
 	struct sock *child;
 	const struct tcphdr *th = tcp_hdr(skb);
 	__be32 flg = tcp_flag_word(th) & (TCP_FLAG_RST|TCP_FLAG_SYN|TCP_FLAG_ACK);
-	int paws_reject = 0;
+	bool paws_reject = false;
 
 	tmp_opt.saw_tstamp = 0;
 	if (th->doff > (sizeof(struct tcphdr)>>2)) {
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 1a63082..803cbfe 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -370,7 +370,7 @@ static void tcp_init_nondata_skb(struct sk_buff *skb, u32 seq, u8 flags)
 	TCP_SKB_CB(skb)->end_seq = seq;
 }
 
-static inline int tcp_urg_mode(const struct tcp_sock *tp)
+static inline bool tcp_urg_mode(const struct tcp_sock *tp)
 {
 	return tp->snd_una != tp->snd_up;
 }
@@ -1391,20 +1391,20 @@ static int tcp_init_tso_segs(const struct sock *sk, struct sk_buff *skb,
 }
 
 /* Minshall's variant of the Nagle send check. */
-static inline int tcp_minshall_check(const struct tcp_sock *tp)
+static inline bool tcp_minshall_check(const struct tcp_sock *tp)
 {
 	return after(tp->snd_sml, tp->snd_una) &&
 		!after(tp->snd_sml, tp->snd_nxt);
 }
 
-/* Return 0, if packet can be sent now without violation Nagle's rules:
+/* Return false, if packet can be sent now without violation Nagle's rules:
  * 1. It is full sized.
  * 2. Or it contains FIN. (already checked by caller)
  * 3. Or TCP_CORK is not set, and TCP_NODELAY is set.
  * 4. Or TCP_CORK is not set, and all sent packets are ACKed.
  *    With Minshall's modification: all sent small packets are ACKed.
  */
-static inline int tcp_nagle_check(const struct tcp_sock *tp,
+static inline bool tcp_nagle_check(const struct tcp_sock *tp,
 				  const struct sk_buff *skb,
 				  unsigned int mss_now, int nonagle)
 {
@@ -1413,11 +1413,11 @@ static inline int tcp_nagle_check(const struct tcp_sock *tp,
 		 (!nonagle && tp->packets_out && tcp_minshall_check(tp)));
 }
 
-/* Return non-zero if the Nagle test allows this packet to be
+/* Return true if the Nagle test allows this packet to be
  * sent now.
  */
-static inline int tcp_nagle_test(const struct tcp_sock *tp, const struct sk_buff *skb,
-				 unsigned int cur_mss, int nonagle)
+static inline bool tcp_nagle_test(const struct tcp_sock *tp, const struct sk_buff *skb,
+				  unsigned int cur_mss, int nonagle)
 {
 	/* Nagle rule does not apply to frames, which sit in the middle of the
 	 * write_queue (they have no chances to get new data).
@@ -1426,24 +1426,25 @@ static inline int tcp_nagle_test(const struct tcp_sock *tp, const struct sk_buff
 	 * argument based upon the location of SKB in the send queue.
 	 */
 	if (nonagle & TCP_NAGLE_PUSH)
-		return 1;
+		return true;
 
 	/* Don't use the nagle rule for urgent data (or for the final FIN).
 	 * Nagle can be ignored during F-RTO too (see RFC4138).
 	 */
 	if (tcp_urg_mode(tp) || (tp->frto_counter == 2) ||
 	    (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN))
-		return 1;
+		return true;
 
 	if (!tcp_nagle_check(tp, skb, cur_mss, nonagle))
-		return 1;
+		return true;
 
-	return 0;
+	return false;
 }
 
 /* Does at least the first segment of SKB fit into the send window? */
-static inline int tcp_snd_wnd_test(const struct tcp_sock *tp, const struct sk_buff *skb,
-				   unsigned int cur_mss)
+static bool tcp_snd_wnd_test(const struct tcp_sock *tp,
+			     const struct sk_buff *skb,
+			     unsigned int cur_mss)
 {
 	u32 end_seq = TCP_SKB_CB(skb)->end_seq;
 
@@ -1476,7 +1477,7 @@ static unsigned int tcp_snd_test(const struct sock *sk, struct sk_buff *skb,
 }
 
 /* Test if sending is allowed right now. */
-int tcp_may_send_now(struct sock *sk)
+bool tcp_may_send_now(struct sock *sk)
 {
 	const struct tcp_sock *tp = tcp_sk(sk);
 	struct sk_buff *skb = tcp_send_head(sk);
@@ -1546,7 +1547,7 @@ static int tso_fragment(struct sock *sk, struct sk_buff *skb, unsigned int len,
  *
  * This algorithm is from John Heffner.
  */
-static int tcp_tso_should_defer(struct sock *sk, struct sk_buff *skb)
+static bool tcp_tso_should_defer(struct sock *sk, struct sk_buff *skb)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 	const struct inet_connection_sock *icsk = inet_csk(sk);
@@ -1606,11 +1607,11 @@ static int tcp_tso_should_defer(struct sock *sk, struct sk_buff *skb)
 	/* Ok, it looks like it is advisable to defer.  */
 	tp->tso_deferred = 1 | (jiffies << 1);
 
-	return 1;
+	return true;
 
 send_now:
 	tp->tso_deferred = 0;
-	return 0;
+	return false;
 }
 
 /* Create a new MTU probe if we are ready.
@@ -1752,11 +1753,11 @@ static int tcp_mtu_probe(struct sock *sk)
  * snd_up-64k-mss .. snd_up cannot be large. However, taking into
  * account rare use of URG, this is not a big flaw.
  *
- * Returns 1, if no segments are in flight and we have queued segments, but
- * cannot send anything now because of SWS or another problem.
+ * Returns true, if no segments are in flight and we have queued segments,
+ * but cannot send anything now because of SWS or another problem.
  */
-static int tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
-			  int push_one, gfp_t gfp)
+static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
+			   int push_one, gfp_t gfp)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 	struct sk_buff *skb;
@@ -1770,7 +1771,7 @@ static int tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
 		/* Do MTU probing. */
 		result = tcp_mtu_probe(sk);
 		if (!result) {
-			return 0;
+			return false;
 		} else if (result > 0) {
 			sent_pkts = 1;
 		}
@@ -1829,7 +1830,7 @@ static int tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
 
 	if (likely(sent_pkts)) {
 		tcp_cwnd_validate(sk);
-		return 0;
+		return false;
 	}
 	return !tp->packets_out && tcp_send_head(sk);
 }
@@ -2028,22 +2029,22 @@ static void tcp_collapse_retrans(struct sock *sk, struct sk_buff *skb)
 }
 
 /* Check if coalescing SKBs is legal. */
-static int tcp_can_collapse(const struct sock *sk, const struct sk_buff *skb)
+static bool tcp_can_collapse(const struct sock *sk, const struct sk_buff *skb)
 {
 	if (tcp_skb_pcount(skb) > 1)
-		return 0;
+		return false;
 	/* TODO: SACK collapsing could be used to remove this condition */
 	if (skb_shinfo(skb)->nr_frags != 0)
-		return 0;
+		return false;
 	if (skb_cloned(skb))
-		return 0;
+		return false;
 	if (skb == tcp_send_head(sk))
-		return 0;
+		return false;
 	/* Some heurestics for collapsing over SACK'd could be invented */
 	if (TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_ACKED)
-		return 0;
+		return false;
 
-	return 1;
+	return true;
 }
 
 /* Collapse packets in the retransmit queue to make to create
@@ -2054,7 +2055,7 @@ static void tcp_retrans_try_collapse(struct sock *sk, struct sk_buff *to,
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 	struct sk_buff *skb = to, *tmp;
-	int first = 1;
+	bool first = true;
 
 	if (!sysctl_tcp_retrans_collapse)
 		return;
@@ -2068,7 +2069,7 @@ static void tcp_retrans_try_collapse(struct sock *sk, struct sk_buff *to,
 		space -= skb->len;
 
 		if (first) {
-			first = 0;
+			first = false;
 			continue;
 		}
 
@@ -2208,18 +2209,18 @@ int tcp_retransmit_skb(struct sock *sk, struct sk_buff *skb)
 /* Check if we forward retransmits are possible in the current
  * window/congestion state.
  */
-static int tcp_can_forward_retransmit(struct sock *sk)
+static bool tcp_can_forward_retransmit(struct sock *sk)
 {
 	const struct inet_connection_sock *icsk = inet_csk(sk);
 	const struct tcp_sock *tp = tcp_sk(sk);
 
 	/* Forward retransmissions are possible only during Recovery. */
 	if (icsk->icsk_ca_state != TCP_CA_Recovery)
-		return 0;
+		return false;
 
 	/* No forward retransmissions in Reno are possible. */
 	if (tcp_is_reno(tp))
-		return 0;
+		return false;
 
 	/* Yeah, we have to make difficult choice between forward transmission
 	 * and retransmission... Both ways have their merits...
@@ -2230,9 +2231,9 @@ static int tcp_can_forward_retransmit(struct sock *sk)
 	 */
 
 	if (tcp_may_send_now(sk))
-		return 0;
+		return false;
 
-	return 1;
+	return true;
 }
 
 /* This gets called after a retransmit timeout, and the initially
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 4cf55ae..554d599 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1055,7 +1055,7 @@ static int tcp_v6_conn_request(struct sock *sk, struct sk_buff *skb)
 	struct tcp_sock *tp = tcp_sk(sk);
 	__u32 isn = TCP_SKB_CB(skb)->when;
 	struct dst_entry *dst = NULL;
-	int want_cookie = 0;
+	bool want_cookie = false;
 
 	if (skb->protocol == htons(ETH_P_IP))
 		return tcp_v4_conn_request(sk, skb);
@@ -1116,7 +1116,7 @@ static int tcp_v6_conn_request(struct sock *sk, struct sk_buff *skb)
 		while (l-- > 0)
 			*c++ ^= *hash_location++;
 
-		want_cookie = 0;	/* not our kind of cookie */
+		want_cookie = false;	/* not our kind of cookie */
 		tmp_ext.cookie_out_never = 0; /* false */
 		tmp_ext.cookie_plus = tmp_opt.cookie_plus;
 	} else if (!tp->rx_opt.cookie_in_always) {

^ permalink raw reply related

* Re: [PATCH net-next] net: core: Use pr_<level>
From: David Miller @ 2012-05-17  9:00 UTC (permalink / raw)
  To: joe; +Cc: nhorman, netdev, linux-kernel
In-Reply-To: <1337234320.17726.26.camel@joe2Laptop>

From: Joe Perches <joe@perches.com>
Date: Wed, 16 May 2012 22:58:40 -0700

> Use the current logging style.
> 
> This enables use of dynamic debugging as well.
> 
> Convert printk(KERN_<LEVEL> to pr_<level>.
> Add pr_fmt. Remove embedded prefixes, use
> %s, __func__ instead.
> 
> Signed-off-by: Joe Perches <joe@perches.com>

Applied.

^ permalink raw reply

* Re: [PATCH net-next] net: ipv6: ndisc: Neaten ND_PRINTx macros
From: David Miller @ 2012-05-17  9:00 UTC (permalink / raw)
  To: joe; +Cc: netdev, linux-kernel
In-Reply-To: <1337232518.17726.21.camel@joe2Laptop>

From: Joe Perches <joe@perches.com>
Date: Wed, 16 May 2012 22:28:38 -0700

> Why use several macros when one will do?
> 
> Convert the multiple ND_PRINTKx macros to a single
> ND_PRINTK macro.  Use the new net_<level>_ratelimited
> mechanism too.
> 
> Add pr_fmt with "ICMPv6: " as prefix.
> Remove embedded ICMPv6 prefixes from messages.
> 
> Signed-off-by: Joe Perches <joe@perches.com>

Applied.

^ permalink raw reply

* Re: [PATCH net-next] pktgen: Use pr_debug
From: David Miller @ 2012-05-17  9:00 UTC (permalink / raw)
  To: joe; +Cc: netdev, linux-kernel
In-Reply-To: <1337226641.17726.8.camel@joe2Laptop>

From: Joe Perches <joe@perches.com>
Date: Wed, 16 May 2012 20:50:41 -0700

> Convert printk(KERN_DEBUG to pr_debug which can
> enable dynamic debugging.
> 
> Remove embedded prefixes from the conversions as
> pr_fmt adds them.
> 
> Align arguments.
> 
> Signed-off-by: Joe Perches <joe@perches.com>

Applied.

^ permalink raw reply

* Re: [PATCH net-next] net: include/net/sock.h cleanup
From: David Miller @ 2012-05-17  8:50 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev
In-Reply-To: <1337244495.29313.2.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 17 May 2012 10:48:15 +0200

> From: Eric Dumazet <edumazet@google.com>
> 
> bool/const conversions where possible
> 
> __inline__ -> inline
> 
> space cleanups
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Applied.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox