Netdev List
 help / color / mirror / Atom feed
* [PATCH V3 2/8] net: mediatek: mtk_cal_txd_req() returns bad value
From: John Crispin @ 2016-04-07 22:54 UTC (permalink / raw)
  To: David S. Miller
  Cc: Felix Fietkau, netdev-u79uwXL29TY76Z2rM5mHXA,
	Sean Wang (王志亘),
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mediatek-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Matthias Brugger,
	John Crispin
In-Reply-To: <1460069651-1234-1-git-send-email-blogic-p3rKhJxN3npAfugRpC6u6w@public.gmane.org>

The code used to also support the PDMA engine, which had 2 packet pointers
per descriptor. Because of this we had to divide the result by 2 and round
it up. This is no longer needed as the code only supports QDMA.

Signed-off-by: John Crispin <blogic-p3rKhJxN3npAfugRpC6u6w@public.gmane.org>
---
 drivers/net/ethernet/mediatek/mtk_eth_soc.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mediatek/mtk_eth_soc.c b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
index bb10d57..94cceb8 100644
--- a/drivers/net/ethernet/mediatek/mtk_eth_soc.c
+++ b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
@@ -681,7 +681,7 @@ static inline int mtk_cal_txd_req(struct sk_buff *skb)
 		nfrags += skb_shinfo(skb)->nr_frags;
 	}
 
-	return DIV_ROUND_UP(nfrags, 2);
+	return nfrags;
 }
 
 static int mtk_start_xmit(struct sk_buff *skb, struct net_device *dev)
-- 
1.7.10.4

^ permalink raw reply related

* [PATCH V3 4/8] net: mediatek: fix stop and wakeup of queue
From: John Crispin @ 2016-04-07 22:54 UTC (permalink / raw)
  To: David S. Miller
  Cc: Felix Fietkau, netdev-u79uwXL29TY76Z2rM5mHXA,
	Sean Wang (王志亘),
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mediatek-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Matthias Brugger,
	John Crispin
In-Reply-To: <1460069651-1234-1-git-send-email-blogic-p3rKhJxN3npAfugRpC6u6w@public.gmane.org>

The driver supports 2 MACs. Both run on the same DMA ring. If we go
above/below the TX rings threshold value, we always need to wake/stop
the queue of both devices. Not doing to can cause TX stalls and packet
drops on one of the devices.

Signed-off-by: John Crispin <blogic-p3rKhJxN3npAfugRpC6u6w@public.gmane.org>
---
 drivers/net/ethernet/mediatek/mtk_eth_soc.c |   37 +++++++++++++++++++--------
 1 file changed, 27 insertions(+), 10 deletions(-)

diff --git a/drivers/net/ethernet/mediatek/mtk_eth_soc.c b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
index a4982e4..4ebc42e 100644
--- a/drivers/net/ethernet/mediatek/mtk_eth_soc.c
+++ b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
@@ -684,6 +684,28 @@ static inline int mtk_cal_txd_req(struct sk_buff *skb)
 	return nfrags;
 }
 
+static void mtk_wake_queue(struct mtk_eth *eth)
+{
+	int i;
+
+	for (i = 0; i < MTK_MAC_COUNT; i++) {
+		if (!eth->netdev[i])
+			continue;
+		netif_wake_queue(eth->netdev[i]);
+	}
+}
+
+static void mtk_stop_queue(struct mtk_eth *eth)
+{
+	int i;
+
+	for (i = 0; i < MTK_MAC_COUNT; i++) {
+		if (!eth->netdev[i])
+			continue;
+		netif_stop_queue(eth->netdev[i]);
+	}
+}
+
 static int mtk_start_xmit(struct sk_buff *skb, struct net_device *dev)
 {
 	struct mtk_mac *mac = netdev_priv(dev);
@@ -695,7 +717,7 @@ static int mtk_start_xmit(struct sk_buff *skb, struct net_device *dev)
 
 	tx_num = mtk_cal_txd_req(skb);
 	if (unlikely(atomic_read(&ring->free_count) <= tx_num)) {
-		netif_stop_queue(dev);
+		mtk_stop_queue(eth);
 		netif_err(eth, tx_queued, dev,
 			  "Tx Ring full when queue awake!\n");
 		return NETDEV_TX_BUSY;
@@ -720,10 +742,10 @@ static int mtk_start_xmit(struct sk_buff *skb, struct net_device *dev)
 		goto drop;
 
 	if (unlikely(atomic_read(&ring->free_count) <= ring->thresh)) {
-		netif_stop_queue(dev);
+		mtk_stop_queue(eth);
 		if (unlikely(atomic_read(&ring->free_count) >
 			     ring->thresh))
-			netif_wake_queue(dev);
+			mtk_wake_queue(eth);
 	}
 
 	return NETDEV_TX_OK;
@@ -897,13 +919,8 @@ static int mtk_poll_tx(struct mtk_eth *eth, int budget, bool *tx_again)
 	if (!total)
 		return 0;
 
-	for (i = 0; i < MTK_MAC_COUNT; i++) {
-		if (!eth->netdev[i] ||
-		    unlikely(!netif_queue_stopped(eth->netdev[i])))
-			continue;
-		if (atomic_read(&ring->free_count) > ring->thresh)
-			netif_wake_queue(eth->netdev[i]);
-	}
+	if (atomic_read(&ring->free_count) > ring->thresh)
+		mtk_wake_queue(eth);
 
 	return total;
 }
-- 
1.7.10.4

^ permalink raw reply related

* [PATCH V3 5/8] net: mediatek: fix TX locking
From: John Crispin @ 2016-04-07 22:54 UTC (permalink / raw)
  To: David S. Miller
  Cc: Felix Fietkau, Matthias Brugger,
	Sean Wang (王志亘), netdev, linux-mediatek,
	linux-kernel, John Crispin
In-Reply-To: <1460069651-1234-1-git-send-email-blogic@openwrt.org>

Inside the TX path there is a lock inside the tx_map function. This is
however too late. The patch moves the lock to the start of the xmit
function right before the free count check of the DMA ring happens.
If we do not do this, the code becomes racy leading to TX stalls and
dropped packets. This happens as there are 2 netdevs running on the
same physical DMA ring.

Signed-off-by: John Crispin <blogic@openwrt.org>

---
Changes in V3
* fix a typo in the comment

 drivers/net/ethernet/mediatek/mtk_eth_soc.c |   20 ++++++++++----------
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/drivers/net/ethernet/mediatek/mtk_eth_soc.c b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
index 4ebc42e..7b76075 100644
--- a/drivers/net/ethernet/mediatek/mtk_eth_soc.c
+++ b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
@@ -536,7 +536,6 @@ static int mtk_tx_map(struct sk_buff *skb, struct net_device *dev,
 	struct mtk_eth *eth = mac->hw;
 	struct mtk_tx_dma *itxd, *txd;
 	struct mtk_tx_buf *tx_buf;
-	unsigned long flags;
 	dma_addr_t mapped_addr;
 	unsigned int nr_frags;
 	int i, n_desc = 1;
@@ -568,11 +567,6 @@ static int mtk_tx_map(struct sk_buff *skb, struct net_device *dev,
 	if (unlikely(dma_mapping_error(&dev->dev, mapped_addr)))
 		return -ENOMEM;
 
-	/* normally we can rely on the stack not calling this more than once,
-	 * however we have 2 queues running ont he same ring so we need to lock
-	 * the ring access
-	 */
-	spin_lock_irqsave(&eth->page_lock, flags);
 	WRITE_ONCE(itxd->txd1, mapped_addr);
 	tx_buf->flags |= MTK_TX_FLAGS_SINGLE0;
 	dma_unmap_addr_set(tx_buf, dma_addr0, mapped_addr);
@@ -632,8 +626,6 @@ static int mtk_tx_map(struct sk_buff *skb, struct net_device *dev,
 	WRITE_ONCE(itxd->txd3, (TX_DMA_SWC | TX_DMA_PLEN0(skb_headlen(skb)) |
 				(!nr_frags * TX_DMA_LS0)));
 
-	spin_unlock_irqrestore(&eth->page_lock, flags);
-
 	netdev_sent_queue(dev, skb->len);
 	skb_tx_timestamp(skb);
 
@@ -661,8 +653,6 @@ err_dma:
 		itxd = mtk_qdma_phys_to_virt(ring, itxd->txd2);
 	} while (itxd != txd);
 
-	spin_unlock_irqrestore(&eth->page_lock, flags);
-
 	return -ENOMEM;
 }
 
@@ -712,14 +702,22 @@ static int mtk_start_xmit(struct sk_buff *skb, struct net_device *dev)
 	struct mtk_eth *eth = mac->hw;
 	struct mtk_tx_ring *ring = &eth->tx_ring;
 	struct net_device_stats *stats = &dev->stats;
+	unsigned long flags;
 	bool gso = false;
 	int tx_num;
 
+	/* normally we can rely on the stack not calling this more than once,
+	 * however we have 2 queues running on the same ring so we need to lock
+	 * the ring access
+	 */
+	spin_lock_irqsave(&eth->page_lock, flags);
+
 	tx_num = mtk_cal_txd_req(skb);
 	if (unlikely(atomic_read(&ring->free_count) <= tx_num)) {
 		mtk_stop_queue(eth);
 		netif_err(eth, tx_queued, dev,
 			  "Tx Ring full when queue awake!\n");
+		spin_unlock_irqrestore(&eth->page_lock, flags);
 		return NETDEV_TX_BUSY;
 	}
 
@@ -747,10 +745,12 @@ static int mtk_start_xmit(struct sk_buff *skb, struct net_device *dev)
 			     ring->thresh))
 			mtk_wake_queue(eth);
 	}
+	spin_unlock_irqrestore(&eth->page_lock, flags);
 
 	return NETDEV_TX_OK;
 
 drop:
+	spin_unlock_irqrestore(&eth->page_lock, flags);
 	stats->tx_dropped++;
 	dev_kfree_skb(skb);
 	return NETDEV_TX_OK;
-- 
1.7.10.4

^ permalink raw reply related

* [PATCH V3 6/8] net: mediatek: fix mtk_pending_work
From: John Crispin @ 2016-04-07 22:54 UTC (permalink / raw)
  To: David S. Miller
  Cc: Felix Fietkau, netdev-u79uwXL29TY76Z2rM5mHXA,
	Sean Wang (王志亘),
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mediatek-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Matthias Brugger,
	John Crispin
In-Reply-To: <1460069651-1234-1-git-send-email-blogic-p3rKhJxN3npAfugRpC6u6w@public.gmane.org>

The driver supports 2 MACs. Both run on the same DMA ring. If we hit a TX
timeout we need to stop both netdevs before restarting them again. If we
don't do this, mtk_stop() wont shutdown DMA and the consecutive call to
mtk_open() wont restart DMA and enable IRQs.

Signed-off-by: John Crispin <blogic-p3rKhJxN3npAfugRpC6u6w@public.gmane.org>

---
Changes in V3
* make the patch bisectable

 drivers/net/ethernet/mediatek/mtk_eth_soc.c |   28 +++++++++++++++++++--------
 1 file changed, 20 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/mediatek/mtk_eth_soc.c b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
index 7b76075..cd5d0c9 100644
--- a/drivers/net/ethernet/mediatek/mtk_eth_soc.c
+++ b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
@@ -1432,17 +1432,29 @@ static void mtk_pending_work(struct work_struct *work)
 {
 	struct mtk_mac *mac = container_of(work, struct mtk_mac, pending_work);
 	struct mtk_eth *eth = mac->hw;
-	struct net_device *dev = eth->netdev[mac->id];
-	int err;
+	int err, i;
+	unsigned long restart = 0;
 
 	rtnl_lock();
-	mtk_stop(dev);
 
-	err = mtk_open(dev);
-	if (err) {
-		netif_alert(eth, ifup, dev,
-			    "Driver up/down cycle failed, closing device.\n");
-		dev_close(dev);
+	/* stop all devices to make sure that dma is properly shut down */
+	for (i = 0; i < MTK_MAC_COUNT; i++) {
+		if (!netif_oper_up(eth->netdev[i]))
+			continue;
+		mtk_stop(eth->netdev[i]);
+		__set_bit(i, &restart);
+	}
+
+	/* restart DMA and enable IRQs */
+	for (i = 0; i < MTK_MAC_COUNT; i++) {
+		if (!test_bit(i, &restart))
+			continue;
+		err = mtk_open(eth->netdev[i]);
+		if (err) {
+			netif_alert(eth, ifup, eth->netdev[i],
+			      "Driver up/down cycle failed, closing device.\n");
+			dev_close(eth->netdev[i]);
+		}
 	}
 	rtnl_unlock();
 }
-- 
1.7.10.4

^ permalink raw reply related

* [PATCH V3 7/8] net: mediatek: move the pending_work struct to the device generic struct
From: John Crispin @ 2016-04-07 22:54 UTC (permalink / raw)
  To: David S. Miller
  Cc: Felix Fietkau, Matthias Brugger,
	Sean Wang (王志亘), netdev, linux-mediatek,
	linux-kernel, John Crispin
In-Reply-To: <1460069651-1234-1-git-send-email-blogic@openwrt.org>

The worker always touches both netdevs. It is ethernet core and not MAC
specific. We only need one worker, which belongs into the ethernets core
struct.

Signed-off-by: John Crispin <blogic@openwrt.org>

---
Changes in V3
* make the patch bisectable

 drivers/net/ethernet/mediatek/mtk_eth_soc.c |   13 +++++--------
 drivers/net/ethernet/mediatek/mtk_eth_soc.h |    4 ++--
 2 files changed, 7 insertions(+), 10 deletions(-)

diff --git a/drivers/net/ethernet/mediatek/mtk_eth_soc.c b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
index cd5d0c9..eb0d554 100644
--- a/drivers/net/ethernet/mediatek/mtk_eth_soc.c
+++ b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
@@ -1193,7 +1193,7 @@ static void mtk_tx_timeout(struct net_device *dev)
 	eth->netdev[mac->id]->stats.tx_errors++;
 	netif_err(eth, tx_err, dev,
 		  "transmit timed out\n");
-	schedule_work(&mac->pending_work);
+	schedule_work(&eth->pending_work);
 }
 
 static irqreturn_t mtk_handle_irq(int irq, void *_eth)
@@ -1430,8 +1430,7 @@ static int mtk_do_ioctl(struct net_device *dev, struct ifreq *ifr, int cmd)
 
 static void mtk_pending_work(struct work_struct *work)
 {
-	struct mtk_mac *mac = container_of(work, struct mtk_mac, pending_work);
-	struct mtk_eth *eth = mac->hw;
+	struct mtk_eth *eth = container_of(work, struct mtk_eth, pending_work);
 	int err, i;
 	unsigned long restart = 0;
 
@@ -1439,7 +1438,7 @@ static void mtk_pending_work(struct work_struct *work)
 
 	/* stop all devices to make sure that dma is properly shut down */
 	for (i = 0; i < MTK_MAC_COUNT; i++) {
-		if (!netif_oper_up(eth->netdev[i]))
+		if (!eth->netdev[i])
 			continue;
 		mtk_stop(eth->netdev[i]);
 		__set_bit(i, &restart);
@@ -1464,15 +1463,13 @@ static int mtk_cleanup(struct mtk_eth *eth)
 	int i;
 
 	for (i = 0; i < MTK_MAC_COUNT; i++) {
-		struct mtk_mac *mac = netdev_priv(eth->netdev[i]);
-
 		if (!eth->netdev[i])
 			continue;
 
 		unregister_netdev(eth->netdev[i]);
 		free_netdev(eth->netdev[i]);
-		cancel_work_sync(&mac->pending_work);
 	}
+	cancel_work_sync(&eth->pending_work);
 
 	return 0;
 }
@@ -1660,7 +1657,6 @@ static int mtk_add_mac(struct mtk_eth *eth, struct device_node *np)
 	mac->id = id;
 	mac->hw = eth;
 	mac->of_node = np;
-	INIT_WORK(&mac->pending_work, mtk_pending_work);
 
 	mac->hw_stats = devm_kzalloc(eth->dev,
 				     sizeof(*mac->hw_stats),
@@ -1762,6 +1758,7 @@ static int mtk_probe(struct platform_device *pdev)
 
 	eth->dev = &pdev->dev;
 	eth->msg_enable = netif_msg_init(mtk_msg_level, MTK_DEFAULT_MSG_ENABLE);
+	INIT_WORK(&eth->pending_work, mtk_pending_work);
 
 	err = mtk_hw_init(eth);
 	if (err)
diff --git a/drivers/net/ethernet/mediatek/mtk_eth_soc.h b/drivers/net/ethernet/mediatek/mtk_eth_soc.h
index 48a5292..eed626d 100644
--- a/drivers/net/ethernet/mediatek/mtk_eth_soc.h
+++ b/drivers/net/ethernet/mediatek/mtk_eth_soc.h
@@ -363,6 +363,7 @@ struct mtk_rx_ring {
  * @clk_gp1:		The gmac1 clock
  * @clk_gp2:		The gmac2 clock
  * @mii_bus:		If there is a bus we need to create an instance for it
+ * @pending_work:	The workqueue used to reset the dma ring
  */
 
 struct mtk_eth {
@@ -389,6 +390,7 @@ struct mtk_eth {
 	struct clk			*clk_gp1;
 	struct clk			*clk_gp2;
 	struct mii_bus			*mii_bus;
+	struct work_struct		pending_work;
 };
 
 /* struct mtk_mac -	the structure that holds the info about the MACs of the
@@ -398,7 +400,6 @@ struct mtk_eth {
  * @hw:			Backpointer to our main datastruture
  * @hw_stats:		Packet statistics counter
  * @phy_dev:		The attached PHY if available
- * @pending_work:	The workqueue used to reset the dma ring
  */
 struct mtk_mac {
 	int				id;
@@ -406,7 +407,6 @@ struct mtk_mac {
 	struct mtk_eth			*hw;
 	struct mtk_hw_stats		*hw_stats;
 	struct phy_device		*phy_dev;
-	struct work_struct		pending_work;
 };
 
 /* the struct describing the SoC. these are declared in the soc_xyz.c files */
-- 
1.7.10.4

^ permalink raw reply related

* [PATCH V3 8/8] net: mediatek: do not set the QID field in the TX DMA descriptors
From: John Crispin @ 2016-04-07 22:54 UTC (permalink / raw)
  To: David S. Miller
  Cc: Felix Fietkau, netdev-u79uwXL29TY76Z2rM5mHXA,
	Sean Wang (王志亘),
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mediatek-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Matthias Brugger,
	John Crispin
In-Reply-To: <1460069651-1234-1-git-send-email-blogic-p3rKhJxN3npAfugRpC6u6w@public.gmane.org>

The QID field gets set to the mac id. This made the DMA linked list queue
the traffic of each MAC on a different internal queue. However during long
term testing we found that this will cause traffic stalls as the multi
queue setup requires a more complete initialisation which is not part of
the upstream driver yet.

This patch removes the code setting the QID field, resulting in all
traffic ending up in queue 0 which works without any special setup.

Signed-off-by: John Crispin <blogic-p3rKhJxN3npAfugRpC6u6w@public.gmane.org>
---
 drivers/net/ethernet/mediatek/mtk_eth_soc.c |    3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mediatek/mtk_eth_soc.c b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
index eb0d554..c984462 100644
--- a/drivers/net/ethernet/mediatek/mtk_eth_soc.c
+++ b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
@@ -603,8 +603,7 @@ static int mtk_tx_map(struct sk_buff *skb, struct net_device *dev,
 			WRITE_ONCE(txd->txd1, mapped_addr);
 			WRITE_ONCE(txd->txd3, (TX_DMA_SWC |
 					       TX_DMA_PLEN0(frag_map_size) |
-					       last_frag * TX_DMA_LS0) |
-					       mac->id);
+					       last_frag * TX_DMA_LS0));
 			WRITE_ONCE(txd->txd4, 0);
 
 			tx_buf->skb = (struct sk_buff *)MTK_DMA_DUMMY_DESC;
-- 
1.7.10.4

^ permalink raw reply related

* Re: [PATCH net-next] sock: make lockdep_sock_is_held static inline
From: Eric Dumazet @ 2016-04-07 23:08 UTC (permalink / raw)
  To: Hannes Frederic Sowa; +Cc: netdev
In-Reply-To: <5706E39D.3060206@stressinduktion.org>

On Fri, 2016-04-08 at 00:47 +0200, Hannes Frederic Sowa wrote:

> I see... hmpf.
> 
> Wouldn't it be nicer if I include a helper a la:
> 
> #define lockdep_is_held(lock)  1
> 
> in lockdep.h in case lockdep is globally not enabled? I do actually have 
> already another user for this outside of PROVE_RCU.

It probably had been discussed on lkml a long time ago.

My guess is that you can not do that, but am too lazy to tell you why ;)

^ permalink raw reply

* [PATCH ipoute2] iplink: display number of rx/tx queues
From: Eric Dumazet @ 2016-04-07 23:11 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev

From: Eric Dumazet <edumazet@google.com>

We can set the attributes, so would be nice to display them when
provided by the kernel.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
diff --git a/ip/ipaddress.c b/ip/ipaddress.c
index 3998d8c..f7bd1c7 100644
--- a/ip/ipaddress.c
+++ b/ip/ipaddress.c
@@ -894,6 +894,12 @@ int print_linkinfo(const struct sockaddr_nl *who,
 	if (do_link && tb[IFLA_AF_SPEC] && show_details)
 		print_af_spec(fp, tb[IFLA_AF_SPEC]);
 
+	if (tb[IFLA_NUM_TX_QUEUES] && show_details)
+		fprintf(fp, "numtxqueues %u ", rta_getattr_u32(tb[IFLA_NUM_TX_QUEUES]));
+
+	if (tb[IFLA_NUM_RX_QUEUES] && show_details)
+		fprintf(fp, "numrxqueues %u ", rta_getattr_u32(tb[IFLA_NUM_RX_QUEUES]));
+
 	if ((do_link || show_details) && tb[IFLA_IFALIAS]) {
 		fprintf(fp, "%s    alias %s", _SL_,
 			rta_getattr_str(tb[IFLA_IFALIAS]));

^ permalink raw reply related

* [PATCH net] lockdep: provide always true lockdep_is_held stub if lockdep disabled
From: Hannes Frederic Sowa @ 2016-04-07 23:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: netdev, Peter Zijlstra, Ingo Molnar, Eric Dumazet, David Miller

I need this to provide a generic lockdep_sock_is_held function which can
be easily used in the kernel without using ifdef PROVEN macros.

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
---
Hello Peter and Ingo,

if it is possible coud this go in via the net-tree, as this problem is
visible there already? Would be happy to get a review.

Thanks,
Hannes

 include/linux/lockdep.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/include/linux/lockdep.h b/include/linux/lockdep.h
index d026b190c53066..dc8d447cb3ab1c 100644
--- a/include/linux/lockdep.h
+++ b/include/linux/lockdep.h
@@ -428,6 +428,8 @@ struct lock_class_key { };
 #define lockdep_pin_lock(l)				do { (void)(l); } while (0)
 #define lockdep_unpin_lock(l)			do { (void)(l); } while (0)
 
+#define lockdep_is_held(l)			({ (void)(l); (1); })
+
 #endif /* !LOCKDEP */
 
 #ifdef CONFIG_LOCK_STAT
-- 
2.5.5

^ permalink raw reply related

* Re: [PATCH net-next] sock: make lockdep_sock_is_held static inline
From: Hannes Frederic Sowa @ 2016-04-07 23:13 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev
In-Reply-To: <1460070528.6473.425.camel@edumazet-glaptop3.roam.corp.google.com>

On 08.04.2016 01:08, Eric Dumazet wrote:
> On Fri, 2016-04-08 at 00:47 +0200, Hannes Frederic Sowa wrote:
>
>> I see... hmpf.
>>
>> Wouldn't it be nicer if I include a helper a la:
>>
>> #define lockdep_is_held(lock)  1
>>
>> in lockdep.h in case lockdep is globally not enabled? I do actually have
>> already another user for this outside of PROVE_RCU.
>
> It probably had been discussed on lkml a long time ago.
>
> My guess is that you can not do that, but am too lazy to tell you why ;)

I will simply try it and otherwise go with your solution. My other cases 
would need to be ifdefed then as well, but also possible.

Patch is send out and I will now research after your tip that it was 
already tried.

Thanks,
Hannes

^ permalink raw reply

* Re: [PATCH net] lockdep: provide always true lockdep_is_held stub if lockdep disabled
From: Hannes Frederic Sowa @ 2016-04-07 23:21 UTC (permalink / raw)
  To: linux-kernel
  Cc: netdev, Peter Zijlstra, Ingo Molnar, Eric Dumazet, David Miller
In-Reply-To: <1460070722-18652-1-git-send-email-hannes@stressinduktion.org>

On 08.04.2016 01:12, Hannes Frederic Sowa wrote:
> I need this to provide a generic lockdep_sock_is_held function which can
> be easily used in the kernel without using ifdef PROVEN macros.
>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Eric Dumazet <eric.dumazet@gmail.com>
> Cc: David Miller <davem@davemloft.net>
> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
> ---
> Hello Peter and Ingo,
>
> if it is possible coud this go in via the net-tree, as this problem is
> visible there already? Would be happy to get a review.

I take this patch back, as some call sites test if the lock is 
definitely not held. I come up with a better approach.

^ permalink raw reply

* Re: [RFC PATCH 07/11] GENEVE: Add option to mangle IP IDs on inner headers when using TSO
From: Jesse Gross @ 2016-04-07 23:22 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: herbert, Tom Herbert, Alexander Duyck, edumazet,
	Linux Kernel Network Developers, David Miller
In-Reply-To: <20160407223237.11142.33072.stgit@ahduyck-xeon-server>

On Thu, Apr 7, 2016 at 7:32 PM, Alexander Duyck <aduyck@mirantis.com> wrote:
> This patch adds support for a feature I am calling IP ID mangling.  It is
> basically just another way of saying the IP IDs that are transmitted by the
> tunnel may not match up with what would normally be expected.  Specifically
> what will happen is in the case of TSO the IP IDs on the headers will be a
> fixed value so a given TSO will repeat the same inner IP ID value gso_segs
> number of times.
>
> Signed-off-by: Alexander Duyck <aduyck@mirantis.com>

If I'm understanding this correctly, enabling IP ID mangling will help
performance on ixgbe since it will allow it to do GSO partial instead
of plain GSO but it will hurt performance on i40e since it will drop
from TSO to plain GSO.

Assuming that's right, it seems like it will make it hard to chose the
right setting without knowledge of which hardware is in use. I guess
what we really want is "I care about nicely incrementing IP IDs" vs.
"I don't care as long as the DF bit is set". That second case is
really what this flag is trying to say but it seems like it is
enforcing too much in the i40e case - I don't think anyone wants to go
out of their way to make IP IDs jump around if incrementing is faster.

^ permalink raw reply

* [PATCH net-next v2] sock: make lockdep_sock_is_held inline and conditional on LOCKDEP
From: Hannes Frederic Sowa @ 2016-04-07 23:35 UTC (permalink / raw)
  To: netdev; +Cc: Eric Dumazet, David Miller
In-Reply-To: <1460066015-22105-1-git-send-email-hannes@stressinduktion.org>

lockdep_is_held is only specified if CONFIG_LOCKDEP is defined, so make
it depending on it.

Also add the missing inline keyword, so no warnings about unused functions
show up during complilation.

Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
---
 include/net/sock.h | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index eb2d7c3e120b25..81d6fecec0a2c0 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1360,13 +1360,15 @@ do {									\
 	lockdep_init_map(&(sk)->sk_lock.dep_map, (name), (key), 0);	\
 } while (0)
 
-static bool lockdep_sock_is_held(const struct sock *csk)
+#ifdef CONFIG_LOCKDEP
+static inline bool lockdep_sock_is_held(const struct sock *csk)
 {
 	struct sock *sk = (struct sock *)csk;
 
 	return lockdep_is_held(&sk->sk_lock) ||
 	       lockdep_is_held(&sk->sk_lock.slock);
 }
+#endif
 
 void lock_sock_nested(struct sock *sk, int subclass);
 
-- 
2.5.5

^ permalink raw reply related

* Re: [RFC PATCH 07/11] GENEVE: Add option to mangle IP IDs on inner headers when using TSO
From: Alexander Duyck @ 2016-04-07 23:52 UTC (permalink / raw)
  To: Jesse Gross
  Cc: Alexander Duyck, Herbert Xu, Tom Herbert, Eric Dumazet,
	Linux Kernel Network Developers, David Miller
In-Reply-To: <CAEh+42jyh0EWwaOAu3KNSQKEv3NRgjPT_4QRYUmLRSXtsMgWJw@mail.gmail.com>

On Thu, Apr 7, 2016 at 4:22 PM, Jesse Gross <jesse@kernel.org> wrote:
> On Thu, Apr 7, 2016 at 7:32 PM, Alexander Duyck <aduyck@mirantis.com> wrote:
>> This patch adds support for a feature I am calling IP ID mangling.  It is
>> basically just another way of saying the IP IDs that are transmitted by the
>> tunnel may not match up with what would normally be expected.  Specifically
>> what will happen is in the case of TSO the IP IDs on the headers will be a
>> fixed value so a given TSO will repeat the same inner IP ID value gso_segs
>> number of times.
>>
>> Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
>
> If I'm understanding this correctly, enabling IP ID mangling will help
> performance on ixgbe since it will allow it to do GSO partial instead
> of plain GSO but it will hurt performance on i40e since it will drop
> from TSO to plain GSO.

Right.  However the option is currently defaulted to off, and can be
enabled per tunnel endpoint.  So if you had an ixgbe to i40e link you
could enable it on the end with the ixgbe and you should see good
performance in both directions.

> Assuming that's right, it seems like it will make it hard to chose the
> right setting without knowledge of which hardware is in use. I guess
> what we really want is "I care about nicely incrementing IP IDs" vs.
> "I don't care as long as the DF bit is set". That second case is
> really what this flag is trying to say but it seems like it is
> enforcing too much in the i40e case - I don't think anyone wants to go
> out of their way to make IP IDs jump around if incrementing is faster.

Right.  The problem is trying to sort out all the GRO/GSO bits.  I was
probably being a bit too conservative after the last few iterations
for the GRO fixes.

Just a thought.  What if I replaced NETIF_F_TSO_FIXEDID with something
that meant we could mange the IP ID like a NETIF_F_TSO_IPID_MANGLE
(advice for better name welcome).  Instead of the feature flag meaning
we are going to transmit packets with a fixed ID it would mean we
don't care about the ID and are free to mangle it as we see fit.  The
GSO type can retain the same meaning as far as that requiring the same
ID for all, but the feature would mean we will take fixed and convert
it to incrementing, or incrementing and convert it to fixed.

- Alex

^ permalink raw reply

* Re: [PATCH net-next] sock: make lockdep_sock_is_held static inline
From: David Miller @ 2016-04-07 22:01 UTC (permalink / raw)
  To: hannes; +Cc: netdev
In-Reply-To: <1460066015-22105-1-git-send-email-hannes@stressinduktion.org>

From: Hannes Frederic Sowa <hannes@stressinduktion.org>
Date: Thu,  7 Apr 2016 23:53:35 +0200

> I forgot to add inline to lockdep_sock_is_held, so it generated all
> kinds of build warnings if not build with lockdep support.
> 
> Reported-by: kbuild test robot <fengguang.wu@intel.com>
> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>

Applied, thanks.

^ permalink raw reply

* Re: [PATCH net-next v2] sock: make lockdep_sock_is_held inline and conditional on LOCKDEP
From: David Miller @ 2016-04-08  0:40 UTC (permalink / raw)
  To: hannes; +Cc: netdev, eric.dumazet
In-Reply-To: <1460072118-10006-1-git-send-email-hannes@stressinduktion.org>

From: Hannes Frederic Sowa <hannes@stressinduktion.org>
Date: Fri,  8 Apr 2016 01:35:18 +0200

> lockdep_is_held is only specified if CONFIG_LOCKDEP is defined, so make
> it depending on it.
> 
> Also add the missing inline keyword, so no warnings about unused functions
> show up during complilation.
> 
> Cc: Eric Dumazet <eric.dumazet@gmail.com>
> Cc: David Miller <davem@davemloft.net>
> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>

I already applied your inline patch, I applied the following:

====================
[PATCH] net: Fix build failure due to lockdep_sock_is_held().

Needs to be protected with CONFIG_LOCKDEP.

Based upon a patch by Hannes Frederic Sowa.

Signed-off-by: David S. Miller <davem@davemloft.net>
---
 include/net/sock.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/include/net/sock.h b/include/net/sock.h
index 46b2937..81d6fec 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1360,6 +1360,7 @@ do {									\
 	lockdep_init_map(&(sk)->sk_lock.dep_map, (name), (key), 0);	\
 } while (0)
 
+#ifdef CONFIG_LOCKDEP
 static inline bool lockdep_sock_is_held(const struct sock *csk)
 {
 	struct sock *sk = (struct sock *)csk;
@@ -1367,6 +1368,7 @@ static inline bool lockdep_sock_is_held(const struct sock *csk)
 	return lockdep_is_held(&sk->sk_lock) ||
 	       lockdep_is_held(&sk->sk_lock.slock);
 }
+#endif
 
 void lock_sock_nested(struct sock *sk, int subclass);
 
-- 
2.1.0

^ permalink raw reply related

* Re: [PATCH V3] net: emac: emac gigabit ethernet controller driver
From: Andrew Lunn @ 2016-04-08  0:53 UTC (permalink / raw)
  To: Timur Tabi
  Cc: Rob Herring, Gilad Avidov, netdev, linux-kernel@vger.kernel.org,
	devicetree@vger.kernel.org, linux-arm-msm, Sagar Dharia, shankerd,
	Greg Kroah-Hartman, vikrams, Christopher Covington
In-Reply-To: <5706D493.8020700@codeaurora.org>

On Thu, Apr 07, 2016 at 04:43:47PM -0500, Timur Tabi wrote:
> On my platform, firmware (UEFI) configures all of the GPIOs.  I need
> to get confirmation, but it appears that we don't actually make any
> GPIO calls at all.  I see code that looks like this:
> 
> 	for (i = 0; (!adpt->no_mdio_gpio) && i < EMAC_NUM_GPIO; i++) {
> 		gpio_info = &adpt->gpio_info[i];
> 		retval = of_get_named_gpio(node, gpio_info->name, 0);
> 		if (retval < 0)
> 			return retval;
> 
> And on our ACPI system, adpt->no_mdio_gpio is always true:
> 
> 	/* Assume GPIOs required for MDC/MDIO are enabled in firmware */
> 	adpt->no_mdio_gpio = true;

There are two different things here. One is configuring the pin to be
a GPIO. The second is using the GPIO as a GPIO. In this case,
bit-banging the MDIO bus.

The firmware could be doing the configuration, setting the pin as a
GPIO. However, the firmware cannot be doing the MDIO bit-banging to
make an MDIO bus available. Linux has to do that.

Or it could be we have all completely misunderstood the hardware, and
we are not doing bit-banging GPIO MDIO. There is a real MDIO
controller there, we don't use these pins as GPIOs, etc....

	   Andrew

^ permalink raw reply

* Re: [PATCH net-next v2] macvlan: Support interface operstate properly
From: Banerjee, Debabrata @ 2016-04-08  1:03 UTC (permalink / raw)
  To: Nikolay Aleksandrov, Patrick McHardy, netdev@vger.kernel.org
In-Reply-To: <57063EDD.3040406@cumulusnetworks.com>

On 4/7/16, 7:05 AM, "Nikolay Aleksandrov" <nikolay@cumulusnetworks.com> wrote:



>On 04/07/2016 12:36 AM, Debabrata Banerjee wrote:
>> Set appropriate macvlan interface status based on lower device and our
>> status. Can be up, down, or lowerlayerdown.
>What about dormant ?
>
>That being said I understand the need to switch to lowerlayerdown when the lower
>device is in "down", which is basically the most important change of this patch.
>The rest is already handled by link watch based on carrier state. By now people
>are used to having lowerlayerdown when there's no carrier, now it can also mean
>that the lower device has been brought admin down.
>
>Here's another interesting state:
>6: mac1@eth2: <NO-CARRIER,BROADCAST,MULTICAST,UP,LOWER_UP,DORMANT,M-DOWN> mtu 1500 qdisc noqueue state LOWERLAYERDOWN mode DEFAULT group default qlen 1000
>Prior to this patch the macvlan would stay in dormant state and it will also propagate
>to devices stacked on top of it.

The no carrier issue gave me an idea. What if we just set no carrier on the macvlan device?
This seems appropriate, but it doesn't directly address the dormant issue, although that could
be a separate patch. Will this cause any new issues? It's very simple, new patch incoming.



^ permalink raw reply

* Re: [PATCH v2 net-next 00/10] allow bpf attach to tracepoints
From: David Miller @ 2016-04-08  1:04 UTC (permalink / raw)
  To: ast
  Cc: rostedt, peterz, mingo, daniel, acme, wangnan0, jbacik,
	brendan.d.gregg, netdev, linux-kernel, kernel-team
In-Reply-To: <1459993411-2754735-1-git-send-email-ast@fb.com>


Series applied, thanks Alexei.

^ permalink raw reply

* [PATCH net-next] macvlan: Set nocarrier when lowerdev admin down
From: Debabrata Banerjee @ 2016-04-08  1:06 UTC (permalink / raw)
  To: Nikolay Aleksandrov, Patrick McHardy, netdev; +Cc: Debabrata Banerjee
In-Reply-To: <57063EDD.3040406@cumulusnetworks.com>

When the lowerdev is set administratively down disable carrier on the
macvlan interface. This means operstate gets set properly instead of
still being "up".

Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com>
---
 drivers/net/macvlan.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c
index 2bcf1f3..16d0e56 100644
--- a/drivers/net/macvlan.c
+++ b/drivers/net/macvlan.c
@@ -1525,10 +1525,14 @@ static int macvlan_device_event(struct notifier_block *unused,
 
 	switch (event) {
 	case NETDEV_UP:
+	case NETDEV_DOWN:
 	case NETDEV_CHANGE:
-		list_for_each_entry(vlan, &port->vlans, list)
+		list_for_each_entry(vlan, &port->vlans, list) {
 			netif_stacked_transfer_operstate(vlan->lowerdev,
 							 vlan->dev);
+			if (!(vlan->lowerdev->flags & IFF_UP))
+				netif_carrier_off(vlan->dev);
+		}
 		break;
 	case NETDEV_FEAT_CHANGE:
 		list_for_each_entry(vlan, &port->vlans, list) {
-- 
2.8.0

^ permalink raw reply related

* Re: [PATCH v8 net-next 1/1] hv_sock: introduce Hyper-V Sockets
From: Joe Perches @ 2016-04-08  1:15 UTC (permalink / raw)
  To: Dexuan Cui, gregkh, davem, netdev, linux-kernel, devel, olaf, apw,
	jasowang, cavery, kys, haiyangz
  Cc: vkuznets
In-Reply-To: <1460079411-31982-1-git-send-email-decui@microsoft.com>

On Thu, 2016-04-07 at 18:36 -0700, Dexuan Cui wrote:
> Hyper-V Sockets (hv_sock) supplies a byte-stream based communication
> mechanism between the host and the guest. It's somewhat like TCP over
> VMBus, but the transportation layer (VMBus) is much simpler than IP.
[]
> diff --git a/include/net/af_hvsock.h b/include/net/af_hvsock.h
[]
> +#define VMBUS_RINGBUFFER_SIZE_HVSOCK_RECV (5 * PAGE_SIZE)
> +#define VMBUS_RINGBUFFER_SIZE_HVSOCK_SEND (5 * PAGE_SIZE)
> +
> +#define HVSOCK_RCV_BUF_SZ	VMBUS_RINGBUFFER_SIZE_HVSOCK_RECV
> +#define HVSOCK_SND_BUF_SZ	PAGE_SIZE
[]
> +struct hvsock_sock {
[]
> +	struct {
> +		struct vmpipe_proto_header hdr;
> +		char buf[HVSOCK_SND_BUF_SZ];
> +	} __packed send;
> +
> +	struct {
> +		struct vmpipe_proto_header hdr;
> +		char buf[HVSOCK_RCV_BUF_SZ];
> +		unsigned int data_len;
> +		unsigned int data_offset;
> +	} __packed recv;
> +};

These bufs are not page aligned and so can span pages.

Is there any value in allocating these bufs separately
as pages instead of as a kmalloc?

^ permalink raw reply

* [PATCH v8 net-next 0/1] introduce Hyper-V VM Sockets(hv_sock)
From: Dexuan Cui @ 2016-04-08  1:36 UTC (permalink / raw)
  To: gregkh, davem, netdev, linux-kernel, devel, olaf, apw, jasowang,
	cavery, kys, haiyangz
  Cc: joe

Hyper-V Sockets (hv_sock) supplies a byte-stream based communication
mechanism between the host and the guest. It's somewhat like TCP over
VMBus, but the transportation layer (VMBus) is much simpler than IP.

With Hyper-V Sockets, applications between the host and the guest can talk
to each other directly by the traditional BSD-style socket APIs.

Hyper-V Sockets is only available on new Windows hosts, like Windows Server
2016. More info is in this article "Make your own integration services":
https://msdn.microsoft.com/en-us/virtualization/hyperv_on_windows/develop/make_mgmt_service

The patch implements the necessary support in the guest side by
introducing a new socket address family AF_HYPERV.

Note: the VMBus driver side's supporting patches have been in the mainline
tree.

I know the kernel has already had a VM Sockets driver (AF_VSOCK) based
on VMware VMCI (net/vmw_vsock/, drivers/misc/vmw_vmci), and KVM is
proposing AF_VSOCK of virtio version:
http://marc.info/?l=linux-netdev&m=145952064004765&w=2

However, though Hyper-V Sockets may seem conceptually similar to
AF_VOSCK, there are differences in the transportation layer, and IMO these
make the direct code reusing impractical:

1. In AF_VSOCK, the endpoint type is: <u32 ContextID, u32 Port>, but in
AF_HYPERV, the endpoint type is: <GUID VM_ID, GUID ServiceID>. Here GUID
is 128-bit.

2. AF_VSOCK supports SOCK_DGRAM, while AF_HYPERV doesn't.

3. AF_VSOCK supports some special sock opts, like SO_VM_SOCKETS_BUFFER_SIZE,
SO_VM_SOCKETS_BUFFER_MIN/MAX_SIZE and SO_VM_SOCKETS_CONNECT_TIMEOUT.
These are meaningless to AF_HYPERV.

4. Some AF_VSOCK's VMCI transportation ops are meanless to AF_HYPERV/VMBus,
like .notify_recv_init
.notify_recv_pre_block
.notify_recv_pre_dequeue
.notify_recv_post_dequeue
.notify_send_init
.notify_send_pre_block
.notify_send_pre_enqueue
.notify_send_post_enqueue
etc.

So I think we'd better introduce a new address family: AF_HYPERV.

Please review the patch.

Looking forward to your comments!

Changes since v1:
- updated "[PATCH 6/7] hvsock: introduce Hyper-V VM Sockets feature"
- added __init and __exit for the module init/exit functions
- net/hv_sock/Kconfig: "default m" -> "default m if HYPERV"
- MODULE_LICENSE: "Dual MIT/GPL" -> "Dual BSD/GPL"

Changes since v2:
- fixed various coding issue pointed out by David Miller
- fixed indentation issues
- removed pr_debug in net/hv_sock/af_hvsock.c
- used reverse-Chrismas-tree style for local variables.
- EXPORT_SYMBOL -> EXPORT_SYMBOL_GPL

Changes since v3:
- fixed a few coding issue pointed by Vitaly Kuznetsov and Dan Carpenter
- fixed the ret value in vmbus_recvpacket_hvsock on error
- fixed the style of multi-line comment: vmbus_get_hvsock_rw_status()

Changes since v4 (https://lkml.org/lkml/2015/7/28/404):
- addressed all the comments about V4.
- treat the hvsock offers/channels as special VMBus devices
- add a mechanism to pass hvsock events to the hvsock driver
- fixed some corner cases with proper locking when a connection is closed
- rebased to the latest Greg's tree

Changes since v5 (https://lkml.org/lkml/2015/12/24/103):
- addressed the coding style issues (Vitaly Kuznetsov & David Miller, thanks!)
- used a better coding for the per-channel rescind callback (Thank Vitaly!)
- avoided the introduction of new VMBUS driver APIs vmbus_sendpacket_hvsock()
and vmbus_recvpacket_hvsock() and used vmbus_sendpacket()/vmbus_recvpacket()
in the higher level (i.e., the vmsock driver). Thank Vitaly!

Changes since v6 (http://lkml.iu.edu/hypermail/linux/kernel/1601.3/01813.html)
- only a few minor changes of coding style and comments

Changes since v7
- a few minor changes of coding style: thanks, Joe Perches!
- added some lines of comments about GUID/UUID before the struct sockaddr_hv.

Dexuan Cui (1):
  hv_sock: introduce Hyper-V Sockets

 MAINTAINERS                 |    2 +
 include/linux/hyperv.h      |   16 +
 include/linux/socket.h      |    5 +-
 include/net/af_hvsock.h     |   51 ++
 include/uapi/linux/hyperv.h |   25 +
 net/Kconfig                 |    1 +
 net/Makefile                |    1 +
 net/hv_sock/Kconfig         |   10 +
 net/hv_sock/Makefile        |    3 +
 net/hv_sock/af_hvsock.c     | 1483 +++++++++++++++++++++++++++++++++++++++++++
 10 files changed, 1595 insertions(+), 2 deletions(-)
 create mode 100644 include/net/af_hvsock.h
 create mode 100644 net/hv_sock/Kconfig
 create mode 100644 net/hv_sock/Makefile
 create mode 100644 net/hv_sock/af_hvsock.c

-- 
2.1.0

^ permalink raw reply

* [PATCH v8 net-next 1/1] hv_sock: introduce Hyper-V Sockets
From: Dexuan Cui @ 2016-04-08  1:36 UTC (permalink / raw)
  To: gregkh, davem, netdev, linux-kernel, devel, olaf, apw, jasowang,
	cavery, kys, haiyangz
  Cc: joe

Hyper-V Sockets (hv_sock) supplies a byte-stream based communication
mechanism between the host and the guest. It's somewhat like TCP over
VMBus, but the transportation layer (VMBus) is much simpler than IP.

With Hyper-V Sockets, applications between the host and the guest can talk
to each other directly by the traditional BSD-style socket APIs.

Hyper-V Sockets is only available on new Windows hosts, like Windows Server
2016. More info is in this article "Make your own integration services":
https://msdn.microsoft.com/en-us/virtualization/hyperv_on_windows/develop/make_mgmt_service

The patch implements the necessary support in the guest side by introducing
a new socket address family AF_HYPERV.

Signed-off-by: Dexuan Cui <decui@microsoft.com>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
---
 MAINTAINERS                 |    2 +
 include/linux/hyperv.h      |   16 +
 include/linux/socket.h      |    5 +-
 include/net/af_hvsock.h     |   51 ++
 include/uapi/linux/hyperv.h |   25 +
 net/Kconfig                 |    1 +
 net/Makefile                |    1 +
 net/hv_sock/Kconfig         |   10 +
 net/hv_sock/Makefile        |    3 +
 net/hv_sock/af_hvsock.c     | 1483 +++++++++++++++++++++++++++++++++++++++++++
 10 files changed, 1595 insertions(+), 2 deletions(-)
 create mode 100644 include/net/af_hvsock.h
 create mode 100644 net/hv_sock/Kconfig
 create mode 100644 net/hv_sock/Makefile
 create mode 100644 net/hv_sock/af_hvsock.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 67d99dd..7b6f203 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -5267,7 +5267,9 @@ F:	drivers/pci/host/pci-hyperv.c
 F:	drivers/net/hyperv/
 F:	drivers/scsi/storvsc_drv.c
 F:	drivers/video/fbdev/hyperv_fb.c
+F:	net/hv_sock/
 F:	include/linux/hyperv.h
+F:	include/net/af_hvsock.h
 F:	tools/hv/
 F:	Documentation/ABI/stable/sysfs-bus-vmbus
 
diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h
index aa0fadc..b92439d 100644
--- a/include/linux/hyperv.h
+++ b/include/linux/hyperv.h
@@ -1338,4 +1338,20 @@ extern __u32 vmbus_proto_version;
 
 int vmbus_send_tl_connect_request(const uuid_le *shv_guest_servie_id,
 				  const uuid_le *shv_host_servie_id);
+struct vmpipe_proto_header {
+	u32 pkt_type;
+	u32 data_size;
+} __packed;
+
+#define HVSOCK_HEADER_LEN	(sizeof(struct vmpacket_descriptor) + \
+				 sizeof(struct vmpipe_proto_header))
+
+/* See 'prev_indices' in hv_ringbuffer_read(), hv_ringbuffer_write() */
+#define PREV_INDICES_LEN	(sizeof(u64))
+
+#define HVSOCK_PKT_LEN(payload_len)	(HVSOCK_HEADER_LEN + \
+					ALIGN((payload_len), 8) + \
+					PREV_INDICES_LEN)
+#define HVSOCK_MIN_PKT_LEN	HVSOCK_PKT_LEN(1)
+
 #endif /* _HYPERV_H */
diff --git a/include/linux/socket.h b/include/linux/socket.h
index 73bf6c6..88b1ccd 100644
--- a/include/linux/socket.h
+++ b/include/linux/socket.h
@@ -201,8 +201,8 @@ struct ucred {
 #define AF_NFC		39	/* NFC sockets			*/
 #define AF_VSOCK	40	/* vSockets			*/
 #define AF_KCM		41	/* Kernel Connection Multiplexor*/
-
-#define AF_MAX		42	/* For now.. */
+#define AF_HYPERV	42	/* Hyper-V Sockets		*/
+#define AF_MAX		43	/* For now.. */
 
 /* Protocol families, same as address families. */
 #define PF_UNSPEC	AF_UNSPEC
@@ -249,6 +249,7 @@ struct ucred {
 #define PF_NFC		AF_NFC
 #define PF_VSOCK	AF_VSOCK
 #define PF_KCM		AF_KCM
+#define PF_HYPERV	AF_HYPERV
 #define PF_MAX		AF_MAX
 
 /* Maximum queue length specifiable by listen.  */
diff --git a/include/net/af_hvsock.h b/include/net/af_hvsock.h
new file mode 100644
index 0000000..a5aa28d
--- /dev/null
+++ b/include/net/af_hvsock.h
@@ -0,0 +1,51 @@
+#ifndef __AF_HVSOCK_H__
+#define __AF_HVSOCK_H__
+
+#include <linux/kernel.h>
+#include <linux/hyperv.h>
+#include <net/sock.h>
+
+#define VMBUS_RINGBUFFER_SIZE_HVSOCK_RECV (5 * PAGE_SIZE)
+#define VMBUS_RINGBUFFER_SIZE_HVSOCK_SEND (5 * PAGE_SIZE)
+
+#define HVSOCK_RCV_BUF_SZ	VMBUS_RINGBUFFER_SIZE_HVSOCK_RECV
+#define HVSOCK_SND_BUF_SZ	PAGE_SIZE
+
+#define sk_to_hvsock(__sk)    ((struct hvsock_sock *)(__sk))
+#define hvsock_to_sk(__hvsk)   ((struct sock *)(__hvsk))
+
+struct hvsock_sock {
+	/* sk must be the first member. */
+	struct sock sk;
+
+	struct sockaddr_hv local_addr;
+	struct sockaddr_hv remote_addr;
+
+	/* protected by the global hvsock_mutex */
+	struct list_head bound_list;
+	struct list_head connected_list;
+
+	struct list_head accept_queue;
+	/* used by enqueue and dequeue */
+	struct mutex accept_queue_mutex;
+
+	struct delayed_work dwork;
+
+	u32 peer_shutdown;
+
+	struct vmbus_channel *channel;
+
+	struct {
+		struct vmpipe_proto_header hdr;
+		char buf[HVSOCK_SND_BUF_SZ];
+	} __packed send;
+
+	struct {
+		struct vmpipe_proto_header hdr;
+		char buf[HVSOCK_RCV_BUF_SZ];
+		unsigned int data_len;
+		unsigned int data_offset;
+	} __packed recv;
+};
+
+#endif /* __AF_HVSOCK_H__ */
diff --git a/include/uapi/linux/hyperv.h b/include/uapi/linux/hyperv.h
index e347b24..f1d0bca 100644
--- a/include/uapi/linux/hyperv.h
+++ b/include/uapi/linux/hyperv.h
@@ -26,6 +26,7 @@
 #define _UAPI_HYPERV_H
 
 #include <linux/uuid.h>
+#include <linux/socket.h>
 
 /*
  * Framework version for util services.
@@ -396,4 +397,28 @@ struct hv_kvp_ip_msg {
 	struct hv_kvp_ipaddr_value      kvp_ip_val;
 } __attribute__((packed));
 
+/*
+ * This is the address fromat of Hyper-V Sockets.
+ * Note: here we just borrow the kernel's built-in type uuid_le. When
+ * an application calls bind() or connect(), the 2 members of struct
+ * sockaddr_hv must be of GUID.
+ * The GUID format differs from the UUID format only in the byte order of
+ * the first 3 fields. Refer to:
+ * https://en.wikipedia.org/wiki/Globally_unique_identifier
+ */
+#define guid_t uuid_le
+struct sockaddr_hv {
+	__kernel_sa_family_t	shv_family;  /* Address family		*/
+	__le16		reserved;	     /* Must be Zero		*/
+	guid_t		shv_vm_id;	     /* Not used. Must be Zero. */
+	guid_t		shv_service_id;	     /* Service ID		*/
+};
+
+#define SHV_VMID_GUEST	NULL_UUID_LE
+#define SHV_VMID_HOST	NULL_UUID_LE
+
+#define SHV_SERVICE_ID_ANY	NULL_UUID_LE
+
+#define SHV_PROTO_RAW		1
+
 #endif /* _UAPI_HYPERV_H */
diff --git a/net/Kconfig b/net/Kconfig
index a8934d8..68d13d7 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -231,6 +231,7 @@ source "net/dns_resolver/Kconfig"
 source "net/batman-adv/Kconfig"
 source "net/openvswitch/Kconfig"
 source "net/vmw_vsock/Kconfig"
+source "net/hv_sock/Kconfig"
 source "net/netlink/Kconfig"
 source "net/mpls/Kconfig"
 source "net/hsr/Kconfig"
diff --git a/net/Makefile b/net/Makefile
index 81d1411..d115c31 100644
--- a/net/Makefile
+++ b/net/Makefile
@@ -70,6 +70,7 @@ obj-$(CONFIG_BATMAN_ADV)	+= batman-adv/
 obj-$(CONFIG_NFC)		+= nfc/
 obj-$(CONFIG_OPENVSWITCH)	+= openvswitch/
 obj-$(CONFIG_VSOCKETS)	+= vmw_vsock/
+obj-$(CONFIG_HYPERV_SOCK)	+= hv_sock/
 obj-$(CONFIG_MPLS)		+= mpls/
 obj-$(CONFIG_HSR)		+= hsr/
 ifneq ($(CONFIG_NET_SWITCHDEV),)
diff --git a/net/hv_sock/Kconfig b/net/hv_sock/Kconfig
new file mode 100644
index 0000000..1f41848
--- /dev/null
+++ b/net/hv_sock/Kconfig
@@ -0,0 +1,10 @@
+config HYPERV_SOCK
+	tristate "Hyper-V Sockets"
+	depends on HYPERV
+	default m if HYPERV
+	help
+	  Hyper-V Sockets is somewhat like TCP over VMBus, allowing
+	  communication between Linux guest and Hyper-V host without TCP/IP.
+
+	  To compile this driver as a module, choose M here: the module
+	  will be called hv_sock.
diff --git a/net/hv_sock/Makefile b/net/hv_sock/Makefile
new file mode 100644
index 0000000..716c012
--- /dev/null
+++ b/net/hv_sock/Makefile
@@ -0,0 +1,3 @@
+obj-$(CONFIG_HYPERV_SOCK) += hv_sock.o
+
+hv_sock-y += af_hvsock.o
diff --git a/net/hv_sock/af_hvsock.c b/net/hv_sock/af_hvsock.c
new file mode 100644
index 0000000..185a382
--- /dev/null
+++ b/net/hv_sock/af_hvsock.c
@@ -0,0 +1,1483 @@
+/*
+ * Hyper-V Sockets -- a socket-based communication channel between the
+ * Hyper-V host and the virtual machines running on it.
+ *
+ * Copyright(c) 2016, Microsoft Corporation. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ * 3. The name of the author may not be used to endorse or promote
+ *    products derived from this software without specific prior written
+ *    permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR
+ * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
+ * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT,
+ * INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
+ * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+ * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
+ * STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING
+ * IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
+ * POSSIBILITY OF SUCH DAMAGE.
+ */
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <net/af_hvsock.h>
+
+static struct proto hvsock_proto = {
+	.name = "HV_SOCK",
+	.owner = THIS_MODULE,
+	.obj_size = sizeof(struct hvsock_sock),
+};
+
+#define SS_LISTEN 255
+
+static LIST_HEAD(hvsock_bound_list);
+static LIST_HEAD(hvsock_connected_list);
+static DEFINE_MUTEX(hvsock_mutex);
+
+static bool uuid_equals(uuid_le u1, uuid_le u2)
+{
+	return !uuid_le_cmp(u1, u2);
+}
+
+/* NOTE: hvsock_mutex must be held when the below helper functions, whose
+ * names begin with __ hvsock, are invoked.
+ */
+static void __hvsock_insert_bound(struct list_head *list,
+				  struct hvsock_sock *hvsk)
+{
+	sock_hold(&hvsk->sk);
+	list_add(&hvsk->bound_list, list);
+}
+
+static void __hvsock_insert_connected(struct list_head *list,
+				      struct hvsock_sock *hvsk)
+{
+	sock_hold(&hvsk->sk);
+	list_add(&hvsk->connected_list, list);
+}
+
+static void __hvsock_remove_bound(struct hvsock_sock *hvsk)
+{
+	list_del_init(&hvsk->bound_list);
+	sock_put(&hvsk->sk);
+}
+
+static void __hvsock_remove_connected(struct hvsock_sock *hvsk)
+{
+	list_del_init(&hvsk->connected_list);
+	sock_put(&hvsk->sk);
+}
+
+static struct sock *__hvsock_find_bound_socket(const struct sockaddr_hv *addr)
+{
+	struct hvsock_sock *hvsk;
+
+	list_for_each_entry(hvsk, &hvsock_bound_list, bound_list) {
+		if (uuid_equals(addr->shv_service_id,
+				hvsk->local_addr.shv_service_id))
+			return hvsock_to_sk(hvsk);
+	}
+	return NULL;
+}
+
+static struct sock *__hvsock_find_connected_socket_by_channel(
+	const struct vmbus_channel *channel)
+{
+	struct hvsock_sock *hvsk;
+
+	list_for_each_entry(hvsk, &hvsock_connected_list, connected_list) {
+		if (hvsk->channel == channel)
+			return hvsock_to_sk(hvsk);
+	}
+	return NULL;
+}
+
+static bool __hvsock_in_bound_list(struct hvsock_sock *hvsk)
+{
+	return !list_empty(&hvsk->bound_list);
+}
+
+static bool __hvsock_in_connected_list(struct hvsock_sock *hvsk)
+{
+	return !list_empty(&hvsk->connected_list);
+}
+
+static void hvsock_insert_connected(struct hvsock_sock *hvsk)
+{
+	__hvsock_insert_connected(&hvsock_connected_list, hvsk);
+}
+
+static
+void hvsock_enqueue_accept(struct sock *listener, struct sock *connected)
+{
+	struct hvsock_sock *hvlistener;
+	struct hvsock_sock *hvconnected;
+
+	hvlistener = sk_to_hvsock(listener);
+	hvconnected = sk_to_hvsock(connected);
+
+	sock_hold(connected);
+	sock_hold(listener);
+
+	mutex_lock(&hvlistener->accept_queue_mutex);
+	list_add_tail(&hvconnected->accept_queue, &hvlistener->accept_queue);
+	listener->sk_ack_backlog++;
+	mutex_unlock(&hvlistener->accept_queue_mutex);
+}
+
+static struct sock *hvsock_dequeue_accept(struct sock *listener)
+{
+	struct hvsock_sock *hvlistener;
+	struct hvsock_sock *hvconnected;
+
+	hvlistener = sk_to_hvsock(listener);
+
+	mutex_lock(&hvlistener->accept_queue_mutex);
+
+	if (list_empty(&hvlistener->accept_queue)) {
+		mutex_unlock(&hvlistener->accept_queue_mutex);
+		return NULL;
+	}
+
+	hvconnected = list_entry(hvlistener->accept_queue.next,
+				 struct hvsock_sock, accept_queue);
+
+	list_del_init(&hvconnected->accept_queue);
+	listener->sk_ack_backlog--;
+
+	mutex_unlock(&hvlistener->accept_queue_mutex);
+
+	sock_put(listener);
+	/* The caller will need a reference on the connected socket so we let
+	 * it call sock_put().
+	 */
+
+	return hvsock_to_sk(hvconnected);
+}
+
+static bool hvsock_is_accept_queue_empty(struct sock *sk)
+{
+	struct hvsock_sock *hvsk = sk_to_hvsock(sk);
+	int ret;
+
+	mutex_lock(&hvsk->accept_queue_mutex);
+	ret = list_empty(&hvsk->accept_queue);
+	mutex_unlock(&hvsk->accept_queue_mutex);
+
+	return ret;
+}
+
+static void hvsock_addr_init(struct sockaddr_hv *addr, uuid_le service_id)
+{
+	memset(addr, 0, sizeof(*addr));
+	addr->shv_family = AF_HYPERV;
+	addr->shv_service_id = service_id;
+}
+
+static int hvsock_addr_validate(const struct sockaddr_hv *addr)
+{
+	if (!addr)
+		return -EFAULT;
+
+	if (addr->shv_family != AF_HYPERV)
+		return -EAFNOSUPPORT;
+
+	if (addr->reserved != 0)
+		return -EINVAL;
+
+	if (!uuid_equals(addr->shv_vm_id, NULL_UUID_LE))
+		return -EINVAL;
+
+	return 0;
+}
+
+static bool hvsock_addr_bound(const struct sockaddr_hv *addr)
+{
+	return !uuid_equals(addr->shv_service_id, SHV_SERVICE_ID_ANY);
+}
+
+static int hvsock_addr_cast(const struct sockaddr *addr, size_t len,
+			    struct sockaddr_hv **out_addr)
+{
+	if (len < sizeof(**out_addr))
+		return -EFAULT;
+
+	*out_addr = (struct sockaddr_hv *)addr;
+	return hvsock_addr_validate(*out_addr);
+}
+
+static int __hvsock_do_bind(struct hvsock_sock *hvsk,
+			    struct sockaddr_hv *addr)
+{
+	struct sockaddr_hv hv_addr;
+	int ret = 0;
+
+	hvsock_addr_init(&hv_addr, addr->shv_service_id);
+
+	mutex_lock(&hvsock_mutex);
+
+	if (uuid_equals(addr->shv_service_id, SHV_SERVICE_ID_ANY)) {
+		do {
+			uuid_le_gen(&hv_addr.shv_service_id);
+		} while (__hvsock_find_bound_socket(&hv_addr));
+	} else {
+		if (__hvsock_find_bound_socket(&hv_addr)) {
+			ret = -EADDRINUSE;
+			goto out;
+		}
+	}
+
+	hvsock_addr_init(&hvsk->local_addr, hv_addr.shv_service_id);
+	__hvsock_insert_bound(&hvsock_bound_list, hvsk);
+
+out:
+	mutex_unlock(&hvsock_mutex);
+
+	return ret;
+}
+
+static int __hvsock_bind(struct sock *sk, struct sockaddr_hv *addr)
+{
+	struct hvsock_sock *hvsk = sk_to_hvsock(sk);
+	int ret;
+
+	if (hvsock_addr_bound(&hvsk->local_addr))
+		return -EINVAL;
+
+	switch (sk->sk_socket->type) {
+	case SOCK_STREAM:
+		ret = __hvsock_do_bind(hvsk, addr);
+		break;
+
+	default:
+		ret = -EINVAL;
+		break;
+	}
+
+	return ret;
+}
+
+/* Autobind this socket to the local address if necessary. */
+static int hvsock_auto_bind(struct hvsock_sock *hvsk)
+{
+	struct sock *sk = hvsock_to_sk(hvsk);
+	struct sockaddr_hv local_addr;
+
+	if (hvsock_addr_bound(&hvsk->local_addr))
+		return 0;
+	hvsock_addr_init(&local_addr, SHV_SERVICE_ID_ANY);
+	return __hvsock_bind(sk, &local_addr);
+}
+
+static void hvsock_sk_destruct(struct sock *sk)
+{
+	struct hvsock_sock *hvsk = sk_to_hvsock(sk);
+	struct vmbus_channel *channel = hvsk->channel;
+
+	if (!channel)
+		return;
+
+	vmbus_hvsock_device_unregister(channel);
+}
+
+static void __hvsock_release(struct sock *sk)
+{
+	struct hvsock_sock *hvsk;
+	struct sock *pending;
+
+	hvsk = sk_to_hvsock(sk);
+
+	mutex_lock(&hvsock_mutex);
+	if (__hvsock_in_bound_list(hvsk))
+		__hvsock_remove_bound(hvsk);
+
+	if (__hvsock_in_connected_list(hvsk))
+		__hvsock_remove_connected(hvsk);
+	mutex_unlock(&hvsock_mutex);
+
+	lock_sock(sk);
+	sock_orphan(sk);
+	sk->sk_shutdown = SHUTDOWN_MASK;
+
+	/* Clean up any sockets that never were accepted. */
+	while ((pending = hvsock_dequeue_accept(sk)) != NULL) {
+		__hvsock_release(pending);
+		sock_put(pending);
+	}
+
+	release_sock(sk);
+	sock_put(sk);
+}
+
+static int hvsock_release(struct socket *sock)
+{
+	/* If accept() is interrupted by a signal, the temporary socket
+	 * struct's sock->sk is NULL.
+	 */
+	if (sock->sk) {
+		__hvsock_release(sock->sk);
+		sock->sk = NULL;
+	}
+
+	sock->state = SS_FREE;
+	return 0;
+}
+
+static struct sock *__hvsock_create(struct net *net, struct socket *sock,
+				    gfp_t priority, unsigned short type)
+{
+	struct hvsock_sock *hvsk;
+	struct sock *sk;
+
+	sk = sk_alloc(net, AF_HYPERV, priority, &hvsock_proto, 0);
+	if (!sk)
+		return NULL;
+
+	sock_init_data(sock, sk);
+
+	/* sk->sk_type is normally set in sock_init_data, but only if sock is
+	 * non-NULL. We make sure that our sockets always have a type by
+	 * setting it here if needed.
+	 */
+	if (!sock)
+		sk->sk_type = type;
+
+	hvsk = sk_to_hvsock(sk);
+	hvsock_addr_init(&hvsk->local_addr, SHV_SERVICE_ID_ANY);
+	hvsock_addr_init(&hvsk->remote_addr, SHV_SERVICE_ID_ANY);
+
+	sk->sk_destruct = hvsock_sk_destruct;
+
+	/* Looks stream-based socket doesn't need this. */
+	sk->sk_backlog_rcv = NULL;
+
+	sk->sk_state = 0;
+	sock_reset_flag(sk, SOCK_DONE);
+
+	INIT_LIST_HEAD(&hvsk->bound_list);
+	INIT_LIST_HEAD(&hvsk->connected_list);
+
+	INIT_LIST_HEAD(&hvsk->accept_queue);
+	mutex_init(&hvsk->accept_queue_mutex);
+
+	hvsk->peer_shutdown = 0;
+
+	hvsk->recv.data_len = 0;
+	hvsk->recv.data_offset = 0;
+
+	return sk;
+}
+
+static int hvsock_bind(struct socket *sock, struct sockaddr *addr,
+		       int addr_len)
+{
+	struct sockaddr_hv *hv_addr;
+	struct sock *sk;
+	int ret;
+
+	sk = sock->sk;
+
+	if (hvsock_addr_cast(addr, addr_len, &hv_addr) != 0)
+		return -EINVAL;
+
+	lock_sock(sk);
+	ret = __hvsock_bind(sk, hv_addr);
+	release_sock(sk);
+
+	return ret;
+}
+
+static int hvsock_getname(struct socket *sock,
+			  struct sockaddr *addr, int *addr_len, int peer)
+{
+	struct sockaddr_hv *hv_addr;
+	struct hvsock_sock *hvsk;
+	struct sock *sk;
+	int ret;
+
+	sk = sock->sk;
+	hvsk = sk_to_hvsock(sk);
+	ret = 0;
+
+	lock_sock(sk);
+
+	if (peer) {
+		if (sock->state != SS_CONNECTED) {
+			ret = -ENOTCONN;
+			goto out;
+		}
+		hv_addr = &hvsk->remote_addr;
+	} else {
+		hv_addr = &hvsk->local_addr;
+	}
+
+	__sockaddr_check_size(sizeof(*hv_addr));
+
+	memcpy(addr, hv_addr, sizeof(*hv_addr));
+	*addr_len = sizeof(*hv_addr);
+
+out:
+	release_sock(sk);
+	return ret;
+}
+
+static int hvsock_shutdown(struct socket *sock, int mode)
+{
+	struct sock *sk;
+
+	if (mode < SHUT_RD || mode > SHUT_RDWR)
+		return -EINVAL;
+	/* This maps:
+	 * SHUT_RD   (0) -> RCV_SHUTDOWN  (1)
+	 * SHUT_WR   (1) -> SEND_SHUTDOWN (2)
+	 * SHUT_RDWR (2) -> SHUTDOWN_MASK (3)
+	 */
+	++mode;
+
+	if (sock->state == SS_UNCONNECTED)
+		return -ENOTCONN;
+
+	sock->state = SS_DISCONNECTING;
+
+	sk = sock->sk;
+
+	lock_sock(sk);
+
+	sk->sk_shutdown |= mode;
+	sk->sk_state_change(sk);
+
+	/* TODO: how to send a FIN if we haven't done that? */
+	if (mode & SEND_SHUTDOWN)
+		;
+
+	release_sock(sk);
+
+	return 0;
+}
+
+static void get_ringbuffer_rw_status(struct vmbus_channel *channel,
+				     bool *can_read, bool *can_write)
+{
+	u32 avl_read_bytes, avl_write_bytes, dummy;
+
+	if (can_read) {
+		hv_get_ringbuffer_availbytes(&channel->inbound,
+					     &avl_read_bytes,
+					     &dummy);
+		*can_read = avl_read_bytes >= HVSOCK_MIN_PKT_LEN;
+	}
+
+	/* We write into the ringbuffer only when we're able to write a
+	 * a payload of 4096 bytes (the actual written payload's length may be
+	 * less than 4096).
+	 */
+	if (can_write) {
+		hv_get_ringbuffer_availbytes(&channel->outbound,
+					     &dummy,
+					     &avl_write_bytes);
+		*can_write = avl_write_bytes > HVSOCK_PKT_LEN(PAGE_SIZE);
+	}
+}
+
+static unsigned int hvsock_poll(struct file *file, struct socket *sock,
+				poll_table *wait)
+{
+	struct vmbus_channel *channel;
+	bool can_read, can_write;
+	struct hvsock_sock *hvsk;
+	struct sock *sk;
+	unsigned int mask;
+
+	sk = sock->sk;
+	hvsk = sk_to_hvsock(sk);
+
+	poll_wait(file, sk_sleep(sk), wait);
+	mask = 0;
+
+	if (sk->sk_err)
+		/* Signify that there has been an error on this socket. */
+		mask |= POLLERR;
+
+	/* INET sockets treat local write shutdown and peer write shutdown as a
+	 * case of POLLHUP set.
+	 */
+	if ((sk->sk_shutdown == SHUTDOWN_MASK) ||
+	    ((sk->sk_shutdown & SEND_SHUTDOWN) &&
+	     (hvsk->peer_shutdown & SEND_SHUTDOWN))) {
+		mask |= POLLHUP;
+	}
+
+	if (sk->sk_shutdown & RCV_SHUTDOWN ||
+	    hvsk->peer_shutdown & SEND_SHUTDOWN) {
+		mask |= POLLRDHUP;
+	}
+
+	lock_sock(sk);
+
+	/* Listening sockets that have connections in their accept
+	 * queue can be read.
+	 */
+	if (sk->sk_state == SS_LISTEN && !hvsock_is_accept_queue_empty(sk))
+		mask |= POLLIN | POLLRDNORM;
+
+	/* The mutex is to against hvsock_open_connection() */
+	mutex_lock(&hvsock_mutex);
+
+	channel = hvsk->channel;
+	if (channel) {
+		/* If there is something in the queue then we can read */
+		get_ringbuffer_rw_status(channel, &can_read, &can_write);
+
+		if (!can_read && hvsk->recv.data_len > 0)
+			can_read = true;
+
+		if (!(sk->sk_shutdown & RCV_SHUTDOWN) && can_read)
+			mask |= POLLIN | POLLRDNORM;
+	} else {
+		can_read = false;
+		can_write = false;
+	}
+
+	mutex_unlock(&hvsock_mutex);
+
+	/* Sockets whose connections have been closed terminated should
+	 * also be considered read, and we check the shutdown flag for that.
+	 */
+	if (sk->sk_shutdown & RCV_SHUTDOWN ||
+	    hvsk->peer_shutdown & SEND_SHUTDOWN) {
+		mask |= POLLIN | POLLRDNORM;
+	}
+
+	/* Connected sockets that can produce data can be written. */
+	if (sk->sk_state == SS_CONNECTED && can_write &&
+	    !(sk->sk_shutdown & SEND_SHUTDOWN)) {
+		/* Remove POLLWRBAND since INET sockets are not setting it.
+		 */
+		mask |= POLLOUT | POLLWRNORM;
+	}
+
+	/* Simulate INET socket poll behaviors, which sets
+	 * POLLOUT|POLLWRNORM when peer is closed and nothing to read,
+	 * but local send is not shutdown.
+	 */
+	if (sk->sk_state == SS_UNCONNECTED &&
+	    !(sk->sk_shutdown & SEND_SHUTDOWN))
+		mask |= POLLOUT | POLLWRNORM;
+
+	release_sock(sk);
+
+	return mask;
+}
+
+/* This function runs in the tasklet context of process_chn_event() */
+static void hvsock_on_channel_cb(void *ctx)
+{
+	struct sock *sk = (struct sock *)ctx;
+	struct hvsock_sock *hvsk = sk_to_hvsock(sk);
+	struct vmbus_channel *channel = hvsk->channel;
+	bool can_read, can_write;
+
+	if (!channel) {
+		WARN_ONCE(1, "NULL channel! There is a programming bug.\n");
+		return;
+	}
+
+	get_ringbuffer_rw_status(channel, &can_read, &can_write);
+
+	if (can_read)
+		sk->sk_data_ready(sk);
+
+	if (can_write)
+		sk->sk_write_space(sk);
+}
+
+static void hvsock_close_connection(struct vmbus_channel *channel)
+{
+	struct hvsock_sock *hvsk;
+	struct sock *sk;
+
+	mutex_lock(&hvsock_mutex);
+
+	sk = __hvsock_find_connected_socket_by_channel(channel);
+
+	/* The guest has already closed the connection? */
+	if (!sk)
+		goto out;
+
+	sk->sk_socket->state = SS_UNCONNECTED;
+	sk->sk_state = SS_UNCONNECTED;
+	sock_set_flag(sk, SOCK_DONE);
+
+	hvsk = sk_to_hvsock(sk);
+	hvsk->peer_shutdown |= SEND_SHUTDOWN | RCV_SHUTDOWN;
+
+	sk->sk_state_change(sk);
+out:
+	mutex_unlock(&hvsock_mutex);
+}
+
+static int hvsock_open_connection(struct vmbus_channel *channel)
+{
+	struct hvsock_sock *hvsk, *new_hvsk;
+	struct sockaddr_hv hv_addr;
+	struct sock *sk, *new_sk;
+
+	uuid_le *instance, *service_id;
+	int ret;
+
+	instance = &channel->offermsg.offer.if_instance;
+	service_id = &channel->offermsg.offer.if_type;
+
+	hvsock_addr_init(&hv_addr, *instance);
+
+	mutex_lock(&hvsock_mutex);
+
+	sk = __hvsock_find_bound_socket(&hv_addr);
+
+	if (sk) {
+		/* It is from the guest client's connect() */
+		if (sk->sk_state != SS_CONNECTING) {
+			ret = -ENXIO;
+			goto out;
+		}
+
+		hvsk = sk_to_hvsock(sk);
+		hvsk->channel = channel;
+		set_channel_read_state(channel, false);
+		vmbus_set_chn_rescind_callback(channel,
+					       hvsock_close_connection);
+		ret = vmbus_open(channel, VMBUS_RINGBUFFER_SIZE_HVSOCK_SEND,
+				 VMBUS_RINGBUFFER_SIZE_HVSOCK_RECV, NULL, 0,
+				 hvsock_on_channel_cb, sk);
+		if (ret != 0) {
+			hvsk->channel = NULL;
+			goto out;
+		}
+
+		set_channel_pending_send_size(channel,
+					      HVSOCK_PKT_LEN(PAGE_SIZE));
+		sk->sk_state = SS_CONNECTED;
+		sk->sk_socket->state = SS_CONNECTED;
+		hvsock_insert_connected(hvsk);
+		sk->sk_state_change(sk);
+		goto out;
+	}
+
+	/* Now we suppose it is from a host client's connect() */
+	hvsock_addr_init(&hv_addr, *service_id);
+	sk = __hvsock_find_bound_socket(&hv_addr);
+
+	/* No guest server listening? Well, let's ignore the offer */
+	if (!sk || sk->sk_state != SS_LISTEN) {
+		ret = -ENXIO;
+		goto out;
+	}
+
+	if (sk->sk_ack_backlog >= sk->sk_max_ack_backlog) {
+		ret = -EMFILE;
+		goto out;
+	}
+
+	new_sk = __hvsock_create(sock_net(sk), NULL, GFP_KERNEL, sk->sk_type);
+	if (!new_sk) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	new_hvsk = sk_to_hvsock(new_sk);
+	new_sk->sk_state = SS_CONNECTING;
+	hvsock_addr_init(&new_hvsk->local_addr, *service_id);
+	hvsock_addr_init(&new_hvsk->remote_addr, *instance);
+
+	set_channel_read_state(channel, false);
+	new_hvsk->channel = channel;
+	vmbus_set_chn_rescind_callback(channel, hvsock_close_connection);
+	ret = vmbus_open(channel, VMBUS_RINGBUFFER_SIZE_HVSOCK_SEND,
+			 VMBUS_RINGBUFFER_SIZE_HVSOCK_RECV, NULL, 0,
+			 hvsock_on_channel_cb, new_sk);
+	if (ret != 0) {
+		new_hvsk->channel = NULL;
+		sock_put(new_sk);
+		goto out;
+	}
+	set_channel_pending_send_size(channel, HVSOCK_PKT_LEN(PAGE_SIZE));
+
+	new_sk->sk_state = SS_CONNECTED;
+	hvsock_insert_connected(new_hvsk);
+	hvsock_enqueue_accept(sk, new_sk);
+	sk->sk_state_change(sk);
+out:
+	mutex_unlock(&hvsock_mutex);
+	return ret;
+}
+
+static void hvsock_connect_timeout(struct work_struct *work)
+{
+	struct hvsock_sock *hvsk;
+	struct sock *sk;
+
+	hvsk = container_of(work, struct hvsock_sock, dwork.work);
+	sk = hvsock_to_sk(hvsk);
+
+	lock_sock(sk);
+	if ((sk->sk_state == SS_CONNECTING) &&
+	    (sk->sk_shutdown != SHUTDOWN_MASK)) {
+		sk->sk_state = SS_UNCONNECTED;
+		sk->sk_err = ETIMEDOUT;
+		sk->sk_error_report(sk);
+	}
+	release_sock(sk);
+
+	sock_put(sk);
+}
+
+static int hvsock_connect(struct socket *sock, struct sockaddr *addr,
+			  int addr_len, int flags)
+{
+	struct sockaddr_hv *remote_addr;
+	struct hvsock_sock *hvsk;
+	struct sock *sk;
+
+	DEFINE_WAIT(wait);
+	long timeout;
+
+	int ret = 0;
+
+	sk = sock->sk;
+	hvsk = sk_to_hvsock(sk);
+
+	lock_sock(sk);
+
+	switch (sock->state) {
+	case SS_CONNECTED:
+		ret = -EISCONN;
+		goto out;
+	case SS_DISCONNECTING:
+		ret = -EINVAL;
+		goto out;
+	case SS_CONNECTING:
+		/* This continues on so we can move sock into the SS_CONNECTED
+		 * state once the connection has completed (at which point err
+		 * will be set to zero also).  Otherwise, we will either wait
+		 * for the connection or return -EALREADY should this be a
+		 * non-blocking call.
+		 */
+		ret = -EALREADY;
+		break;
+	default:
+		if ((sk->sk_state == SS_LISTEN) ||
+		    hvsock_addr_cast(addr, addr_len, &remote_addr) != 0) {
+			ret = -EINVAL;
+			goto out;
+		}
+
+		/* Set the remote address that we are connecting to. */
+		memcpy(&hvsk->remote_addr, remote_addr,
+		       sizeof(hvsk->remote_addr));
+
+		ret = hvsock_auto_bind(hvsk);
+		if (ret)
+			goto out;
+
+		sk->sk_state = SS_CONNECTING;
+
+		ret = vmbus_send_tl_connect_request(
+					&hvsk->local_addr.shv_service_id,
+					&hvsk->remote_addr.shv_service_id);
+		if (ret < 0)
+			goto out;
+
+		/* Mark sock as connecting and set the error code to in
+		 * progress in case this is a non-blocking connect.
+		 */
+		sock->state = SS_CONNECTING;
+		ret = -EINPROGRESS;
+	}
+
+	/* The receive path will handle all communication until we are able to
+	 * enter the connected state.  Here we wait for the connection to be
+	 * completed or a notification of an error.
+	 */
+	timeout = 30 * HZ;
+	prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE);
+
+	while (sk->sk_state != SS_CONNECTED && sk->sk_err == 0) {
+		if (flags & O_NONBLOCK) {
+			/* If we're not going to block, we schedule a timeout
+			 * function to generate a timeout on the connection
+			 * attempt, in case the peer doesn't respond in a
+			 * timely manner. We hold on to the socket until the
+			 * timeout fires.
+			 */
+			sock_hold(sk);
+			INIT_DELAYED_WORK(&hvsk->dwork,
+					  hvsock_connect_timeout);
+			schedule_delayed_work(&hvsk->dwork, timeout);
+
+			/* Skip ahead to preserve error code set above. */
+			goto out_wait;
+		}
+
+		release_sock(sk);
+		timeout = schedule_timeout(timeout);
+		lock_sock(sk);
+
+		if (signal_pending(current)) {
+			ret = sock_intr_errno(timeout);
+			goto out_wait_error;
+		} else if (timeout == 0) {
+			ret = -ETIMEDOUT;
+			goto out_wait_error;
+		}
+
+		prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE);
+	}
+
+	ret = sk->sk_err ? -sk->sk_err : 0;
+
+out_wait_error:
+	if (ret < 0) {
+		sk->sk_state = SS_UNCONNECTED;
+		sock->state = SS_UNCONNECTED;
+	}
+out_wait:
+	finish_wait(sk_sleep(sk), &wait);
+out:
+	release_sock(sk);
+	return ret;
+}
+
+static
+int hvsock_accept(struct socket *sock, struct socket *newsock, int flags)
+{
+	struct hvsock_sock *hvconnected;
+	struct sock *connected;
+	struct sock *listener;
+
+	DEFINE_WAIT(wait);
+	long timeout;
+
+	int ret = 0;
+
+	listener = sock->sk;
+
+	lock_sock(listener);
+
+	if (sock->type != SOCK_STREAM) {
+		ret = -EOPNOTSUPP;
+		goto out;
+	}
+
+	if (listener->sk_state != SS_LISTEN) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	/* Wait for children sockets to appear; these are the new sockets
+	 * created upon connection establishment.
+	 */
+	timeout = sock_sndtimeo(listener, flags & O_NONBLOCK);
+	prepare_to_wait(sk_sleep(listener), &wait, TASK_INTERRUPTIBLE);
+
+	while ((connected = hvsock_dequeue_accept(listener)) == NULL &&
+	       listener->sk_err == 0) {
+		release_sock(listener);
+		timeout = schedule_timeout(timeout);
+		lock_sock(listener);
+
+		if (signal_pending(current)) {
+			ret = sock_intr_errno(timeout);
+			goto out_wait;
+		} else if (timeout == 0) {
+			ret = -EAGAIN;
+			goto out_wait;
+		}
+
+		prepare_to_wait(sk_sleep(listener), &wait, TASK_INTERRUPTIBLE);
+	}
+
+	if (listener->sk_err)
+		ret = -listener->sk_err;
+
+	if (connected) {
+		lock_sock(connected);
+		hvconnected = sk_to_hvsock(connected);
+
+		/* If the listener socket has received an error, then we should
+		 * reject this socket and return.  Note that we simply mark the
+		 * socket rejected, drop our reference, and let the cleanup
+		 * function handle the cleanup; the fact that we found it in
+		 * the listener's accept queue guarantees that the cleanup
+		 * function hasn't run yet.
+		 */
+		if (ret) {
+			release_sock(connected);
+			sock_put(connected);
+			goto out_wait;
+		}
+
+		newsock->state = SS_CONNECTED;
+		sock_graft(connected, newsock);
+		release_sock(connected);
+		sock_put(connected);
+	}
+
+out_wait:
+	finish_wait(sk_sleep(listener), &wait);
+out:
+	release_sock(listener);
+	return ret;
+}
+
+static int hvsock_listen(struct socket *sock, int backlog)
+{
+	struct hvsock_sock *hvsk;
+	struct sock *sk;
+	int ret = 0;
+
+	sk = sock->sk;
+	lock_sock(sk);
+
+	if (sock->type != SOCK_STREAM) {
+		ret = -EOPNOTSUPP;
+		goto out;
+	}
+
+	if (sock->state != SS_UNCONNECTED) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	if (backlog <= 0) {
+		ret = -EINVAL;
+		goto out;
+	}
+	/* This is an artificial limit */
+	if (backlog > 128)
+		backlog = 128;
+
+	hvsk = sk_to_hvsock(sk);
+	if (!hvsock_addr_bound(&hvsk->local_addr)) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	sk->sk_ack_backlog = 0;
+	sk->sk_max_ack_backlog = backlog;
+	sk->sk_state = SS_LISTEN;
+out:
+	release_sock(sk);
+	return ret;
+}
+
+static int hvsock_setsockopt(struct socket *sock,
+			     int level,
+			     int optname,
+			     char __user *optval, unsigned int optlen)
+{
+	return -ENOPROTOOPT;
+}
+
+static int hvsock_getsockopt(struct socket *sock,
+			     int level,
+			     int optname,
+			     char __user *optval, int __user *optlen)
+{
+	return -ENOPROTOOPT;
+}
+
+static int hvsock_send_data(struct vmbus_channel *channel,
+			    struct hvsock_sock *hvsk,
+			    size_t to_write)
+{
+	hvsk->send.hdr.pkt_type = 1;
+	hvsk->send.hdr.data_size = to_write;
+	return vmbus_sendpacket(channel, &hvsk->send.hdr,
+				sizeof(hvsk->send.hdr) + to_write,
+				0, VM_PKT_DATA_INBAND, 0);
+}
+
+static int hvsock_sendmsg(struct socket *sock, struct msghdr *msg, size_t len)
+{
+	struct vmbus_channel *channel;
+	struct hvsock_sock *hvsk;
+	struct sock *sk;
+
+	size_t total_to_write = len;
+	size_t total_written = 0;
+
+	bool can_write;
+	long timeout;
+	int ret = 0;
+
+	DEFINE_WAIT(wait);
+
+	if (len == 0)
+		return -EINVAL;
+
+	if (msg->msg_flags & ~MSG_DONTWAIT) {
+		pr_err("%s: unsupported flags=0x%x\n", __func__,
+		       msg->msg_flags);
+		return -EOPNOTSUPP;
+	}
+
+	sk = sock->sk;
+	hvsk = sk_to_hvsock(sk);
+	channel = hvsk->channel;
+
+	lock_sock(sk);
+
+	/* Callers should not provide a destination with stream sockets. */
+	if (msg->msg_namelen) {
+		ret = -EOPNOTSUPP;
+		goto out;
+	}
+
+	/* Send data only if both sides are not shutdown in the direction. */
+	if (sk->sk_shutdown & SEND_SHUTDOWN ||
+	    hvsk->peer_shutdown & RCV_SHUTDOWN) {
+		ret = -EPIPE;
+		goto out;
+	}
+
+	if (sk->sk_state != SS_CONNECTED ||
+	    !hvsock_addr_bound(&hvsk->local_addr)) {
+		ret = -ENOTCONN;
+		goto out;
+	}
+
+	if (!hvsock_addr_bound(&hvsk->remote_addr)) {
+		ret = -EDESTADDRREQ;
+		goto out;
+	}
+
+	timeout = sock_sndtimeo(sk, msg->msg_flags & MSG_DONTWAIT);
+
+	prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE);
+
+	while (total_to_write > 0) {
+		size_t to_write;
+
+		while (1) {
+			get_ringbuffer_rw_status(channel, NULL, &can_write);
+
+			if (can_write || sk->sk_err != 0 ||
+			    (sk->sk_shutdown & SEND_SHUTDOWN) ||
+			    (hvsk->peer_shutdown & RCV_SHUTDOWN))
+				break;
+
+			/* Don't wait for non-blocking sockets. */
+			if (timeout == 0) {
+				ret = -EAGAIN;
+				goto out_wait;
+			}
+
+			release_sock(sk);
+
+			timeout = schedule_timeout(timeout);
+
+			lock_sock(sk);
+			if (signal_pending(current)) {
+				ret = sock_intr_errno(timeout);
+				goto out_wait;
+			} else if (timeout == 0) {
+				ret = -EAGAIN;
+				goto out_wait;
+			}
+
+			prepare_to_wait(sk_sleep(sk), &wait,
+					TASK_INTERRUPTIBLE);
+		}
+
+		/* These checks occur both as part of and after the loop
+		 * conditional since we need to check before and after
+		 * sleeping.
+		 */
+		if (sk->sk_err) {
+			ret = -sk->sk_err;
+			goto out_wait;
+		} else if ((sk->sk_shutdown & SEND_SHUTDOWN) ||
+			   (hvsk->peer_shutdown & RCV_SHUTDOWN)) {
+			ret = -EPIPE;
+			goto out_wait;
+		}
+
+		/* Note: that write will only write as many bytes as possible
+		 * in the ringbuffer. It is the caller's responsibility to
+		 * check how many bytes we actually wrote.
+		 */
+		do {
+			to_write = min_t(size_t, HVSOCK_SND_BUF_SZ,
+					 total_to_write);
+			ret = memcpy_from_msg(hvsk->send.buf, msg, to_write);
+			if (ret != 0)
+				goto out_wait;
+
+			ret = hvsock_send_data(channel, hvsk, to_write);
+			if (ret != 0)
+				goto out_wait;
+
+			total_written += to_write;
+			total_to_write -= to_write;
+		} while (total_to_write > 0);
+	}
+out_wait:
+	if (total_written > 0)
+		ret = total_written;
+
+	finish_wait(sk_sleep(sk), &wait);
+out:
+	release_sock(sk);
+
+	/* ret is a bigger-than-0 total_written or a negative err code. */
+	if (ret == 0) {
+		WARN(1, "unexpected return value of 0\n");
+		ret = -EIO;
+	}
+
+	return ret;
+}
+
+static int hvsock_recv_data(struct vmbus_channel *channel,
+			    struct hvsock_sock *hvsk,
+			    size_t *payload_len)
+{
+	u32 buffer_actual_len;
+	u64 dummy_req_id;
+	int ret;
+
+	ret = vmbus_recvpacket(channel, &hvsk->recv.hdr,
+			       sizeof(hvsk->recv.hdr) + sizeof(hvsk->recv.buf),
+			       &buffer_actual_len, &dummy_req_id);
+	if (ret != 0 || buffer_actual_len <= sizeof(hvsk->recv.hdr))
+		*payload_len = 0;
+	else
+		*payload_len = hvsk->recv.hdr.data_size;
+
+	return ret;
+}
+
+static int hvsock_recvmsg(struct socket *sock, struct msghdr *msg,
+			  size_t len, int flags)
+{
+	struct vmbus_channel *channel;
+	struct hvsock_sock *hvsk;
+	struct sock *sk;
+
+	size_t total_to_read = len;
+	size_t copied;
+
+	bool can_read;
+	long timeout;
+
+	int ret = 0;
+
+	DEFINE_WAIT(wait);
+
+	sk = sock->sk;
+	hvsk = sk_to_hvsock(sk);
+	channel = hvsk->channel;
+
+	lock_sock(sk);
+
+	if (sk->sk_state != SS_CONNECTED) {
+		/* Recvmsg is supposed to return 0 if a peer performs an
+		 * orderly shutdown. Differentiate between that case and when a
+		 * peer has not connected or a local shutdown occurred with the
+		 * SOCK_DONE flag.
+		 */
+		if (sock_flag(sk, SOCK_DONE))
+			ret = 0;
+		else
+			ret = -ENOTCONN;
+
+		goto out;
+	}
+
+	/* We ignore msg->addr_name/len. */
+	if (flags & ~MSG_DONTWAIT) {
+		pr_err("%s: unsupported flags=0x%x\n", __func__, flags);
+		ret = -EOPNOTSUPP;
+		goto out;
+	}
+
+	/* We don't check peer_shutdown flag here since peer may actually shut
+	 * down, but there can be data in the queue that a local socket can
+	 * receive.
+	 */
+	if (sk->sk_shutdown & RCV_SHUTDOWN) {
+		ret = 0;
+		goto out;
+	}
+
+	/* It is valid on Linux to pass in a zero-length receive buffer.  This
+	 * is not an error.  We may as well bail out now.
+	 */
+	if (!len) {
+		ret = 0;
+		goto out;
+	}
+
+	timeout = sock_rcvtimeo(sk, flags & MSG_DONTWAIT);
+	copied = 0;
+
+	prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE);
+
+	while (1) {
+		bool need_refill = hvsk->recv.data_len == 0;
+
+		if (need_refill)
+			get_ringbuffer_rw_status(channel, &can_read, NULL);
+		else
+			can_read = true;
+
+		if (can_read) {
+			size_t payload_len;
+
+			if (need_refill) {
+				ret = hvsock_recv_data(channel, hvsk,
+						       &payload_len);
+				if (ret != 0 || payload_len == 0 ||
+				    payload_len > HVSOCK_RCV_BUF_SZ) {
+					ret = -EIO;
+					goto out_wait;
+				}
+
+				hvsk->recv.data_len = payload_len;
+				hvsk->recv.data_offset = 0;
+			}
+
+			if (hvsk->recv.data_len <= total_to_read) {
+				ret = memcpy_to_msg(msg, hvsk->recv.buf +
+						    hvsk->recv.data_offset,
+						    hvsk->recv.data_len);
+				if (ret != 0)
+					break;
+
+				copied += hvsk->recv.data_len;
+				total_to_read -= hvsk->recv.data_len;
+				hvsk->recv.data_len = 0;
+				hvsk->recv.data_offset = 0;
+
+				if (total_to_read == 0)
+					break;
+			} else {
+				ret = memcpy_to_msg(msg, hvsk->recv.buf +
+						    hvsk->recv.data_offset,
+						    total_to_read);
+				if (ret != 0)
+					break;
+
+				copied += total_to_read;
+				hvsk->recv.data_len -= total_to_read;
+				hvsk->recv.data_offset += total_to_read;
+				total_to_read = 0;
+				break;
+			}
+		} else {
+			if (sk->sk_err || (sk->sk_shutdown & RCV_SHUTDOWN) ||
+			    (hvsk->peer_shutdown & SEND_SHUTDOWN))
+				break;
+
+			/* Don't wait for non-blocking sockets. */
+			if (timeout == 0) {
+				ret = -EAGAIN;
+				break;
+			}
+
+			if (copied > 0)
+				break;
+
+			release_sock(sk);
+			timeout = schedule_timeout(timeout);
+			lock_sock(sk);
+
+			if (signal_pending(current)) {
+				ret = sock_intr_errno(timeout);
+				break;
+			} else if (timeout == 0) {
+				ret = -EAGAIN;
+				break;
+			}
+
+			prepare_to_wait(sk_sleep(sk), &wait,
+					TASK_INTERRUPTIBLE);
+		}
+	}
+
+	if (sk->sk_err)
+		ret = -sk->sk_err;
+	else if (sk->sk_shutdown & RCV_SHUTDOWN)
+		ret = 0;
+
+	if (copied > 0) {
+		ret = copied;
+
+		/* If the other side has shutdown for sending and there
+		 * is nothing more to read, then we modify the socket
+		 * state.
+		 */
+		if ((hvsk->peer_shutdown & SEND_SHUTDOWN) &&
+		    hvsk->recv.data_len == 0) {
+			get_ringbuffer_rw_status(channel, &can_read, NULL);
+			if (!can_read) {
+				sk->sk_state = SS_UNCONNECTED;
+				sock_set_flag(sk, SOCK_DONE);
+				sk->sk_state_change(sk);
+			}
+		}
+	}
+out_wait:
+	finish_wait(sk_sleep(sk), &wait);
+out:
+	release_sock(sk);
+	return ret;
+}
+
+static const struct proto_ops hvsock_ops = {
+	.family = PF_HYPERV,
+	.owner = THIS_MODULE,
+	.release = hvsock_release,
+	.bind = hvsock_bind,
+	.connect = hvsock_connect,
+	.socketpair = sock_no_socketpair,
+	.accept = hvsock_accept,
+	.getname = hvsock_getname,
+	.poll = hvsock_poll,
+	.ioctl = sock_no_ioctl,
+	.listen = hvsock_listen,
+	.shutdown = hvsock_shutdown,
+	.setsockopt = hvsock_setsockopt,
+	.getsockopt = hvsock_getsockopt,
+	.sendmsg = hvsock_sendmsg,
+	.recvmsg = hvsock_recvmsg,
+	.mmap = sock_no_mmap,
+	.sendpage = sock_no_sendpage,
+};
+
+static int hvsock_create(struct net *net, struct socket *sock,
+			 int protocol, int kern)
+{
+	if (!capable(CAP_SYS_ADMIN) && !capable(CAP_NET_ADMIN))
+		return -EPERM;
+
+	if (protocol != 0 && protocol != SHV_PROTO_RAW)
+		return -EPROTONOSUPPORT;
+
+	switch (sock->type) {
+	case SOCK_STREAM:
+		sock->ops = &hvsock_ops;
+		break;
+	default:
+		return -ESOCKTNOSUPPORT;
+	}
+
+	sock->state = SS_UNCONNECTED;
+
+	return __hvsock_create(net, sock, GFP_KERNEL, 0) ? 0 : -ENOMEM;
+}
+
+static const struct net_proto_family hvsock_family_ops = {
+	.family = AF_HYPERV,
+	.create = hvsock_create,
+	.owner = THIS_MODULE,
+};
+
+static int hvsock_probe(struct hv_device *hdev,
+			const struct hv_vmbus_device_id *dev_id)
+{
+	struct vmbus_channel *channel = hdev->channel;
+
+	/* We ignore the error return code to suppress the unnecessary
+	 * error message in vmbus_probe(): on error the host will rescind
+	 * the offer in 30 seconds and we can do cleanup at that time.
+	 */
+	(void)hvsock_open_connection(channel);
+
+	return 0;
+}
+
+static int hvsock_remove(struct hv_device *hdev)
+{
+	struct vmbus_channel *channel = hdev->channel;
+
+	vmbus_close(channel);
+
+	return 0;
+}
+
+/* It's not really used. See vmbus_match() and vmbus_probe(). */
+static const struct hv_vmbus_device_id id_table[] = {
+	{},
+};
+
+static struct hv_driver hvsock_drv = {
+	.name		= "hv_sock",
+	.hvsock		= true,
+	.id_table	= id_table,
+	.probe		= hvsock_probe,
+	.remove		= hvsock_remove,
+};
+
+static int __init hvsock_init(void)
+{
+	int ret;
+
+	/* Hyper-V Sockets requires at least VMBus 4.0 */
+	if ((vmbus_proto_version >> 16) < 4) {
+		pr_err("failed to load: VMBus 4 or later is required\n");
+		return -ENODEV;
+	}
+
+	ret = vmbus_driver_register(&hvsock_drv);
+	if (ret) {
+		pr_err("failed to register hv_sock driver\n");
+		return ret;
+	}
+
+	ret = proto_register(&hvsock_proto, 0);
+	if (ret) {
+		pr_err("failed to register protocol\n");
+		goto unreg_hvsock_drv;
+	}
+
+	ret = sock_register(&hvsock_family_ops);
+	if (ret) {
+		pr_err("failed to register address family\n");
+		goto unreg_proto;
+	}
+
+	return 0;
+
+unreg_proto:
+	proto_unregister(&hvsock_proto);
+unreg_hvsock_drv:
+	vmbus_driver_unregister(&hvsock_drv);
+	return ret;
+}
+
+static void __exit hvsock_exit(void)
+{
+	sock_unregister(AF_HYPERV);
+	proto_unregister(&hvsock_proto);
+	vmbus_driver_unregister(&hvsock_drv);
+}
+
+module_init(hvsock_init);
+module_exit(hvsock_exit);
+
+MODULE_DESCRIPTION("Hyper-V Sockets");
+MODULE_LICENSE("Dual BSD/GPL");
-- 
2.1.0

^ permalink raw reply related

* Re: [PATCH 2/2] net-ath9k_htc: Replace a variable initialisation by an assignment in ath9k_htc_set_channel()
From: Julian Calaby @ 2016-04-08  1:40 UTC (permalink / raw)
  To: Kalle Valo
  Cc: ath9k-devel, linux-wireless, netdev, QCA ath9k Development, LKML,
	SF Markus Elfring, kernel-janitors, Julia Lawall
In-Reply-To: <5686C47E.1040906@users.sourceforge.net>

Hi Kalle,

On Sat, Jan 2, 2016 at 5:25 AM, SF Markus Elfring
<elfring@users.sourceforge.net> wrote:
> From: Markus Elfring <elfring@users.sourceforge.net>
> Date: Fri, 1 Jan 2016 19:09:32 +0100
>
> Replace an explicit initialisation for one local variable at the beginning
> by a conditional assignment.
>
> Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>

This looks sane to me.

Reviewed-by: Julian Calaby <julian.calaby@gmail.com>

Thanks,

Julian Calaby

> ---
>  drivers/net/wireless/ath/ath9k/htc_drv_main.c | 7 ++-----
>  1 file changed, 2 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/net/wireless/ath/ath9k/htc_drv_main.c b/drivers/net/wireless/ath/ath9k/htc_drv_main.c
> index a680a97..30bd59e 100644
> --- a/drivers/net/wireless/ath/ath9k/htc_drv_main.c
> +++ b/drivers/net/wireless/ath/ath9k/htc_drv_main.c
> @@ -246,7 +246,7 @@ static int ath9k_htc_set_channel(struct ath9k_htc_priv *priv,
>         struct ieee80211_conf *conf = &common->hw->conf;
>         bool fastcc;
>         struct ieee80211_channel *channel = hw->conf.chandef.chan;
> -       struct ath9k_hw_cal_data *caldata = NULL;
> +       struct ath9k_hw_cal_data *caldata;
>         enum htc_phymode mode;
>         __be16 htc_mode;
>         u8 cmd_rsp;
> @@ -274,10 +274,7 @@ static int ath9k_htc_set_channel(struct ath9k_htc_priv *priv,
>                 priv->ah->curchan->channel,
>                 channel->center_freq, conf_is_ht(conf), conf_is_ht40(conf),
>                 fastcc);
> -
> -       if (!fastcc)
> -               caldata = &priv->caldata;
> -
> +       caldata = fastcc ? NULL : &priv->caldata;
>         ret = ath9k_hw_reset(ah, hchan, caldata, fastcc);
>         if (ret) {
>                 ath_err(common,
> --
> 2.6.3
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Julian Calaby

Email: julian.calaby@gmail.com
Profile: http://www.google.com/profiles/julian.calaby/

^ permalink raw reply

* mpls's find_outdev()
From: David Miller @ 2016-04-08  1:41 UTC (permalink / raw)
  To: roopa; +Cc: netdev


While auditing something unrelated, I noticed that this function seems
to potentially dev_put() on an error pointer.

I guess the problem comes from the fact that there are two methods
by which the device pointer is obtained.

First, inet{,6}_fib_lookup_dev() which uses error pointers.

Second, dev_get_by_index() which returns a valid device or NULL,
and therefore does not use error pointers.

If inet{,6}_fib_lookup_dev() returns an error pointer, the !dev check
will not pass and dev_put() will operate on an error pointer and
crash.

If you agree with my analysis, could you please cook up and test a
fix?

Thank you!

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox